deepcopy, LGBM and pickle

At first sight, these 3 things may not sounds related at all. I am writing this article to share a bug with lightgbm that I encountered and it eventually leads to deeper understanding of what pickle really are.
python
pickle
deepcopy
Author

noklam

Published

March 19, 2021

To start with, letā€™s look at some code to get some context.

deepcopy or no copy?

import pandas as pd
import numpy as np
import lightgbm as lgb
from copy import deepcopy

params = {
'objective': 'regression',
'verbose': -1,
'num_leaves': 3
}

X = np.random.rand(100,2)
Y = np.ravel(np.random.rand(100,1))
lgbm = lgb.train(params, lgb.Dataset(X,label=Y),num_boost_round=1)
print("Parameters of the model: ", lgbm.params)
Parameters of the model:  {'objective': 'regression', 'verbose': -1, 'num_leaves': 3, 'num_iterations': 1, 'early_stopping_round': None}
## Deep copy will missing params
new_model = deepcopy(lgbm)
Finished loading model, total used 1 iterations

You would expect new_model.parameters return the same dict right? Not quite.

print("Parameters of the copied model: ", new_model.params)
Parameters of the copied model:  {}

Surprise, surprise. Itā€™s an empty dict, where did the parameters go? To dive deep into the issue, letā€™s have a look at the source code of deepcopy to understand how does it work.

reference: https://github.com/python/cpython/blob/e8e341993e3f80a3c456fb8e0219530c93c13151/Lib/copy.py#L128

def deepcopy(x, memo=None, _nil=[]):
    """Deep copy operation on arbitrary Python objects.
    See the module's __doc__ string for more info.
    """

    ... # skip some irrelevant code  

    cls = type(x)

    copier = _deepcopy_dispatch.get(cls)
    if copier is not None:
        y = copier(x, memo)
    else:
        if issubclass(cls, type):
            y = _deepcopy_atomic(x, memo)
        else:
            copier = getattr(x, "__deepcopy__", None)
            if copier is not None:
                y = copier(memo)
            else:
                ... # skip irrelevant code

    # If is its own copy, don't memoize.
    if y is not x:
        memo[d] = y
        _keep_alive(x, memo) # Make sure x lives at least as long as d
    return y

In particular, line 17 is what we care.
copier = getattr(x, "__deepcopy__", None)

If a particular class has implement the __deepcopy__ method, deepcopy will try to invoke that instead of the standard copy. The following dummy class should illustrate this clearly.

class DummyClass():
    def __deepcopy__(self, _):
        print('Just hanging around and not copying.')
o = DummyClass()
deepcopy(o)
Just hanging around and not copying.

a lightgbm model is actually a Booster object and implement its own __deepcopy__. It only copy the model string but nothing else, this explains why deepcopy(lgbm).paramters is an empty dictionary.

 def __deepcopy__(self, _): 
     model_str = self.model_to_string(num_iteration=-1) 
     booster = Booster(model_str=model_str) 
     return booster 

Reference: https://github.com/microsoft/LightGBM/blob/d6ebd063fff7ff9ed557c3f2bcacc8f9456583e6/python-package/lightgbm/basic.py#L2279-L2282

Okay, so why lightgbm need to have an custom implementation? I thought this is a bug, but turns out there are some deeper reason behind this. I created an issue on GitHub.

https://github.com/microsoft/LightGBM/issues/4085 Their response is > Custom deepcopy is needed to make Booster class picklable.

šŸ„–Italian BMT, šŸ„¬Lettuce šŸ… tomato and some šŸ„’pickles please

What does pickle really is? and what makes an object pickable?

Python Pickle is used to serialize and deserialize a python object structure. Any object on python can be pickled so that it can be saved on disk.

Serialization roughly means translating the data in memory into a format that can be stored on disk or sent over network. Itā€™s like ordering a chair from Ikea, they will send you a box, but not a chair.

The process of decomposing the chair and put it into a box is serialization, while putting it together is deserialization. With pickle terms, we called it Pickling and Unpickling.

deserialize and serialize

What is Pickle

Pickle is a protocol for Python, you and either pickling a Python object to memory or to file.

import pickle
d = {'a': 1}
pickle_d = pickle.dumps(d)
pickle_d
b'\x80\x04\x95\n\x00\x00\x00\x00\x00\x00\x00}\x94\x8c\x01a\x94K\x01s.'

The python dict is now transfrom into a series of binary str, this string can be only understand by Python. We can also deserialize a binary string back to a python dict.

binary_str = b'\x80\x04\x95\n\x00\x00\x00\x00\x00\x00\x00}\x94\x8c\x01a\x94K\x01s.'
pickle.loads(binary_str)
{'a': 1}

Reference: https://www.python.org/dev/peps/pep-0574/#:~:text=The%20pickle%20protocol%20was%20originally%20designed%20in%201995,copying%20temporary%20data%20before%20writing%20it%20to%20disk.

What makes something picklable

Finally, we come back to our initial questions. > What makes something picklable? Why lightgbm need to have deepcopy to make the Booster class picklable?

What can be pickled and unpickled? The following types can be pickled:
* None, True, and False
* integers, floating point numbers, complex numbers
* strings, bytes, bytearrays
* tuples, lists, sets, and dictionaries containing only picklable objects
* functions defined at the top level of a module (using def, not lambda)
* built-in functions defined at the top level of a module
* classes that are defined at the top level of a module

So pretty much common datatype, functions and classes are picklable. Letā€™s see without __deepcopy__, the Booster class is not serializable as it claims.

import lightgbm
from lightgbm import Booster
del Booster.__deepcopy__

params = {
'objective': 'regression',
'verbose': -1,
'num_leaves': 3
}

X = np.random.rand(100,2)
Y = np.ravel(np.random.rand(100,1))
lgbm = lgb.train(params, lgb.Dataset(X,label=Y),num_boost_round=1)


deepcopy_lgbm = deepcopy(lgbm)
lgbm.params, deepcopy_lgbm.params
({'objective': 'regression',
  'verbose': -1,
  'num_leaves': 3,
  'num_iterations': 1,
  'early_stopping_round': None},
 {'objective': 'regression',
  'verbose': -1,
  'num_leaves': 3,
  'num_iterations': 1,
  'early_stopping_round': None})
pickle.dumps(deepcopy_lgbm) == pickle.dumps(lgbm)
True
unpickle_model = pickle.loads(pickle.dumps(deepcopy_lgbm))
unpickle_deepcopy_model = pickle.loads(pickle.dumps(lgbm))
unpickle_model.params, unpickle_deepcopy_model.params
({'objective': 'regression',
  'verbose': -1,
  'num_leaves': 3,
  'num_iterations': 1,
  'early_stopping_round': None},
 {'objective': 'regression',
  'verbose': -1,
  'num_leaves': 3,
  'num_iterations': 1,
  'early_stopping_round': None})
unpickle_model.model_to_string() == unpickle_deepcopy_model.model_to_string()
True
unpickle_deepcopy_model.predict(X)
array([0.48439803, 0.48439803, 0.50141491, 0.48439803, 0.48439803,
       0.48439803, 0.50141491, 0.48439803, 0.48439803, 0.48439803,
       0.49029787, 0.49029787, 0.48439803, 0.48439803, 0.48439803,
       0.49029787, 0.48439803, 0.50141491, 0.50141491, 0.50141491,
       0.48439803, 0.50141491, 0.48439803, 0.49029787, 0.50141491,
       0.50141491, 0.48439803, 0.49029787, 0.49029787, 0.49029787,
       0.49029787, 0.50141491, 0.48439803, 0.50141491, 0.48439803,
       0.49029787, 0.50141491, 0.48439803, 0.48439803, 0.48439803,
       0.48439803, 0.50141491, 0.50141491, 0.48439803, 0.49029787,
       0.48439803, 0.48439803, 0.50141491, 0.48439803, 0.48439803,
       0.48439803, 0.48439803, 0.48439803, 0.48439803, 0.50141491,
       0.49029787, 0.48439803, 0.50141491, 0.49029787, 0.49029787,
       0.50141491, 0.50141491, 0.48439803, 0.50141491, 0.48439803,
       0.48439803, 0.48439803, 0.48439803, 0.50141491, 0.48439803,
       0.48439803, 0.50141491, 0.50141491, 0.49029787, 0.50141491,
       0.48439803, 0.49029787, 0.48439803, 0.48439803, 0.50141491,
       0.50141491, 0.48439803, 0.49029787, 0.48439803, 0.48439803,
       0.50141491, 0.49029787, 0.50141491, 0.50141491, 0.49029787,
       0.48439803, 0.49029787, 0.48439803, 0.48439803, 0.48439803,
       0.48439803, 0.48439803, 0.48439803, 0.50141491, 0.49029787])

Last Word

Wellā€¦. It seems actually picklable? I may need to investigate the issue a bit more. For now, the __deepcopy__ does not seems to be necessary.

I tried to dig into lightgbm source code and find this potential related issue. https://github.com/microsoft/LightGBM/blame/dc1bc23adf1137ef78722176e2da69f8411b1feb/python-package/lightgbm/basic.py#L2298