This file will become your README and also the index of your documentation.

Install

pip install dtype_diet

Documentation

https://noklam.github.io/dtype_diet/

How to use

This is a fork of https://github.com/ianozsvald/dtype_diet to continue supoprt and develop the library with approval from the original author @ianozsvald.

This tool checks each column to see if larger dtypes (e.g. 8 byte float64 and int64) could be shrunk to smaller dtypes without causing any data loss. Dropping an 8 byte type to a 4 (or 2 or 1 byte) type will keep halving the RAM requirement for that column. Categoricals are proposed for object columns which can bring significant speed and RAM benefits.

Here's an minimal example with 3 lines of code running on a Kaggle dataset showing a reduction of 957 -> 85MB, you can find the notebook in the repository:

# sell_prices.csv.zip 
# Source data: https://www.kaggle.com/c/m5-forecasting-uncertainty/
import pandas as pd
from dtype_diet import report_on_dataframe, optimize_dtypes
df = pd.read_csv('data/sell_prices.csv')
proposed_df = report_on_dataframe(df, unit="MB")
new_df = optimize_dtypes(df, proposed_df)
print(f'Original df memory: {df.memory_usage(deep=True).sum()/1024/1024} MB')
print(f'Propsed df memory: {new_df.memory_usage(deep=True).sum()/1024/1024} MB')

Original df memory: 957.5197134017944 MB
Propsed df memory: 85.09655094146729 MB

proposed_df

Recommendations:

Run report_on_dataframe(your_df) to get recommendations
Run optimize_dtypes(df, proposed_df) to convert to recommeded dtypes.
Consider if Categoricals will save you RAM (see Caveats below)
Consider if f32 or f16 will be useful (see Caveats - f32 is probably a reasonable choice unless you have huge ranges of floats)
Consider if int32, int16, int8 will be useful (see Caveats - overflow may be an issue)
Look at https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.convert_dtypes.html which recommends Pandas nullable dtype alternatives (e.g. to avoid promoting an int64 with NaN items to float64, instead you get Int64 with NaNs and no data loss)
Look at Extension arrays like https://github.com/JDASoftwareGroup/rle-array (thanks @repererum for the tweet)

Look at report_on_dataframe(your_df) to get a printed report - no changes are made to your dataframe.

Caveats

reduced numeric ranges might lead to overflow (TODO document)
category dtype can have unexpected effects e.g. need for observed=True in groupby (TODO document)
f16 is likely to be simulated on modern hardware so calculations will be 2-3* slower than on f32 or f64
we could do with a link that explains binary representation of float & int for those wanting to learn more

Development

Contributors

Antony Milbourne https://github.com/amilbourne
Mani https://github.com/neomatrix369

Local Setup

$ conda create -n dtype_diet python=3.8 pandas jupyter pyarrow pytest
$ conda activate dtype_diet

Release

make release

Contributing

The repository is developed with nbdev, a system for developing library with notebook.

Make sure you run this if you want to contribute to the library. For details, please refer to nbdev documentation (https://github.com/fastai/nbdev)

nbdev_install_git_hooks

Some other useful commands

nbdev_build_docs
nbdev_build_lib
nbdev_test_nbs

	Current dtype	Proposed dtype	Current Memory (MB)	Proposed Memory (MB)	Ram Usage Improvement (MB)	Ram Usage Improvement (%)
Column
store_id	object	category	203763.920410	3340.907715	200423.012695	98.360403
item_id	object	category	233039.977539	6824.677734	226215.299805	97.071456
wm_yr_wk	int64	int16	26723.191406	6680.844727	20042.346680	74.999825
sell_price	float64	None	26723.191406	NaN	NaN	NaN