Optimize your dataset memory print with minimal dtype.

count_errors[source]

count_errors(ser:Series, new_dtype)

After converting ser to new dtype, count whether items have isclose()

map_dtypes_to_choices[source]

map_dtypes_to_choices(ser:Series, optimize:str)

get_smallest_valid_conversion[source]

get_smallest_valid_conversion(ser:Series, optimize:str)

get_improvement[source]

get_improvement(as_type:AsType, current_nbytes:int)

report_on_dataframe[source]

report_on_dataframe(df:DataFrame, unit:str='MB', optimize:str='memory')

[Report on columns that might be converted] Args: df ([type]): [description] unit (str, optional): [byte, MB, GB]. Defaults to "MB". optimize (str, optional): [memory, computation]. Defaults to memory. [memory]: The lowest memory dtype for float is fp16. [computation]: The lowest memory dtype for float is fp32.

# sell_prices.csv.zip 
# Source data: https://www.kaggle.com/c/m5-forecasting-uncertainty/

df = pd.read_csv('data/sell_prices.csv')
report_on_dataframe(df)
Current dtype Proposed dtype Current Memory (MB) Proposed Memory (MB) Ram Usage Improvement (MB) Ram Usage Improvement (%)
Column
store_id object category 203763.920410 3340.907715 200423.012695 98.360403
item_id object category 233039.977539 6824.677734 226215.299805 97.071456
wm_yr_wk int64 int16 26723.191406 6680.844727 20042.346680 74.999825
sell_price float64 None 26723.191406 NaN NaN NaN

report_on_dataframe shows you the possible dtype conversion and the improvement. Note that the library try to optimize the memory base on current values of the data, you should still be careful about overflow for further transformation.

optimize_dtypes[source]

optimize_dtypes(df:DataFrame, proposed_df:DataFrame)

if __name__ == "__main__":
    print("Given a dataframe, check for lowest possible conversions:")

    nbr_rows = 100
    df = pd.DataFrame()
    df["a"] = [0] * nbr_rows
    df["b"] = [256] * nbr_rows
    df["c"] = [65_536] * nbr_rows
    df["d"] = [1_100.0] * nbr_rows
    df["e"] = [100_101.0] * nbr_rows
    df["str_a"] = ["hello"] * nbr_rows
    df["str_b"] = [str(n) for n in range(nbr_rows)]
    report_on_dataframe(df)

    print("convert_dtypes does a slightly different job:")
    print(df.convert_dtypes())
Given a dataframe, check for lowest possible conversions:
convert_dtypes does a slightly different job:
    a    b      c     d       e  str_a str_b
0   0  256  65536  1100  100101  hello     0
1   0  256  65536  1100  100101  hello     1
2   0  256  65536  1100  100101  hello     2
3   0  256  65536  1100  100101  hello     3
4   0  256  65536  1100  100101  hello     4
.. ..  ...    ...   ...     ...    ...   ...
95  0  256  65536  1100  100101  hello    95
96  0  256  65536  1100  100101  hello    96
97  0  256  65536  1100  100101  hello    97
98  0  256  65536  1100  100101  hello    98
99  0  256  65536  1100  100101  hello    99

[100 rows x 7 columns]