deepcopy, LGBM and pickle

At first sight, these 3 things may not sounds related at all. I am writing this article to share a bug with lightgbm that I encountered and it eventually leads to deeper understanding of what pickle really are.
python
pickle
deepcopy
Author

noklam

Published

March 19, 2021

To start with, let’s look at some code to get some context.

deepcopy or no copy?

import pandas as pd
import numpy as np
import lightgbm as lgb
from copy import deepcopy

params = {
'objective': 'regression',
'verbose': -1,
'num_leaves': 3
}

X = np.random.rand(100,2)
Y = np.ravel(np.random.rand(100,1))
lgbm = lgb.train(params, lgb.Dataset(X,label=Y),num_boost_round=1)
print("Parameters of the model: ", lgbm.params)
Parameters of the model:  {'objective': 'regression', 'verbose': -1, 'num_leaves': 3, 'num_iterations': 1, 'early_stopping_round': None}
## Deep copy will missing params
new_model = deepcopy(lgbm)
Finished loading model, total used 1 iterations

You would expect new_model.parameters return the same dict right? Not quite.

print("Parameters of the copied model: ", new_model.params)
Parameters of the copied model:  {}

Surprise, surprise. It’s an empty dict, where did the parameters go? To dive deep into the issue, let’s have a look at the source code of deepcopy to understand how does it work.

reference: https://github.com/python/cpython/blob/e8e341993e3f80a3c456fb8e0219530c93c13151/Lib/copy.py#L128

def deepcopy(x, memo=None, _nil=[]):
    """Deep copy operation on arbitrary Python objects.
    See the module's __doc__ string for more info.
    """

    ... # skip some irrelevant code  

    cls = type(x)

    copier = _deepcopy_dispatch.get(cls)
    if copier is not None:
        y = copier(x, memo)
    else:
        if issubclass(cls, type):
            y = _deepcopy_atomic(x, memo)
        else:
            copier = getattr(x, "__deepcopy__", None)
            if copier is not None:
                y = copier(memo)
            else:
                ... # skip irrelevant code

    # If is its own copy, don't memoize.
    if y is not x:
        memo[d] = y
        _keep_alive(x, memo) # Make sure x lives at least as long as d
    return y

In particular, line 17 is what we care.
copier = getattr(x, "__deepcopy__", None)

If a particular class has implement the __deepcopy__ method, deepcopy will try to invoke that instead of the standard copy. The following dummy class should illustrate this clearly.

class DummyClass():
    def __deepcopy__(self, _):
        print('Just hanging around and not copying.')
o = DummyClass()
deepcopy(o)
Just hanging around and not copying.

a lightgbm model is actually a Booster object and implement its own __deepcopy__. It only copy the model string but nothing else, this explains why deepcopy(lgbm).paramters is an empty dictionary.

 def __deepcopy__(self, _): 
     model_str = self.model_to_string(num_iteration=-1) 
     booster = Booster(model_str=model_str) 
     return booster 

Reference: https://github.com/microsoft/LightGBM/blob/d6ebd063fff7ff9ed557c3f2bcacc8f9456583e6/python-package/lightgbm/basic.py#L2279-L2282

Okay, so why lightgbm need to have an custom implementation? I thought this is a bug, but turns out there are some deeper reason behind this. I created an issue on GitHub.

https://github.com/microsoft/LightGBM/issues/4085 Their response is > Custom deepcopy is needed to make Booster class picklable.

šŸ„–Italian BMT, 🄬Lettuce šŸ… tomato and some šŸ„’pickles please

What does pickle really is? and what makes an object pickable?

Python Pickle is used to serialize and deserialize a python object structure. Any object on python can be pickled so that it can be saved on disk.

Serialization roughly means translating the data in memory into a format that can be stored on disk or sent over network. It’s like ordering a chair from Ikea, they will send you a box, but not a chair.

The process of decomposing the chair and put it into a box is serialization, while putting it together is deserialization. With pickle terms, we called it Pickling and Unpickling.

deserialize and serialize

What is Pickle

Pickle is a protocol for Python, you and either pickling a Python object to memory or to file.

import pickle
d = {'a': 1}
pickle_d = pickle.dumps(d)
pickle_d
b'\x80\x04\x95\n\x00\x00\x00\x00\x00\x00\x00}\x94\x8c\x01a\x94K\x01s.'

The python dict is now transfrom into a series of binary str, this string can be only understand by Python. We can also deserialize a binary string back to a python dict.

binary_str = b'\x80\x04\x95\n\x00\x00\x00\x00\x00\x00\x00}\x94\x8c\x01a\x94K\x01s.'
pickle.loads(binary_str)
{'a': 1}

Reference: https://www.python.org/dev/peps/pep-0574/#:~:text=The%20pickle%20protocol%20was%20originally%20designed%20in%201995,copying%20temporary%20data%20before%20writing%20it%20to%20disk.

What makes something picklable

Finally, we come back to our initial questions. > What makes something picklable? Why lightgbm need to have deepcopy to make the Booster class picklable?

What can be pickled and unpickled? The following types can be pickled:
* None, True, and False
* integers, floating point numbers, complex numbers
* strings, bytes, bytearrays
* tuples, lists, sets, and dictionaries containing only picklable objects
* functions defined at the top level of a module (using def, not lambda)
* built-in functions defined at the top level of a module
* classes that are defined at the top level of a module

So pretty much common datatype, functions and classes are picklable. Let’s see without __deepcopy__, the Booster class is not serializable as it claims.

import lightgbm
from lightgbm import Booster
del Booster.__deepcopy__

params = {
'objective': 'regression',
'verbose': -1,
'num_leaves': 3
}

X = np.random.rand(100,2)
Y = np.ravel(np.random.rand(100,1))
lgbm = lgb.train(params, lgb.Dataset(X,label=Y),num_boost_round=1)


deepcopy_lgbm = deepcopy(lgbm)
lgbm.params, deepcopy_lgbm.params
({'objective': 'regression',
  'verbose': -1,
  'num_leaves': 3,
  'num_iterations': 1,
  'early_stopping_round': None},
 {'objective': 'regression',
  'verbose': -1,
  'num_leaves': 3,
  'num_iterations': 1,
  'early_stopping_round': None})
pickle.dumps(deepcopy_lgbm) == pickle.dumps(lgbm)
True
unpickle_model = pickle.loads(pickle.dumps(deepcopy_lgbm))
unpickle_deepcopy_model = pickle.loads(pickle.dumps(lgbm))
unpickle_model.params, unpickle_deepcopy_model.params
({'objective': 'regression',
  'verbose': -1,
  'num_leaves': 3,
  'num_iterations': 1,
  'early_stopping_round': None},
 {'objective': 'regression',
  'verbose': -1,
  'num_leaves': 3,
  'num_iterations': 1,
  'early_stopping_round': None})
unpickle_model.model_to_string() == unpickle_deepcopy_model.model_to_string()
True
unpickle_deepcopy_model.predict(X)
array([0.48439803, 0.48439803, 0.50141491, 0.48439803, 0.48439803,
       0.48439803, 0.50141491, 0.48439803, 0.48439803, 0.48439803,
       0.49029787, 0.49029787, 0.48439803, 0.48439803, 0.48439803,
       0.49029787, 0.48439803, 0.50141491, 0.50141491, 0.50141491,
       0.48439803, 0.50141491, 0.48439803, 0.49029787, 0.50141491,
       0.50141491, 0.48439803, 0.49029787, 0.49029787, 0.49029787,
       0.49029787, 0.50141491, 0.48439803, 0.50141491, 0.48439803,
       0.49029787, 0.50141491, 0.48439803, 0.48439803, 0.48439803,
       0.48439803, 0.50141491, 0.50141491, 0.48439803, 0.49029787,
       0.48439803, 0.48439803, 0.50141491, 0.48439803, 0.48439803,
       0.48439803, 0.48439803, 0.48439803, 0.48439803, 0.50141491,
       0.49029787, 0.48439803, 0.50141491, 0.49029787, 0.49029787,
       0.50141491, 0.50141491, 0.48439803, 0.50141491, 0.48439803,
       0.48439803, 0.48439803, 0.48439803, 0.50141491, 0.48439803,
       0.48439803, 0.50141491, 0.50141491, 0.49029787, 0.50141491,
       0.48439803, 0.49029787, 0.48439803, 0.48439803, 0.50141491,
       0.50141491, 0.48439803, 0.49029787, 0.48439803, 0.48439803,
       0.50141491, 0.49029787, 0.50141491, 0.50141491, 0.49029787,
       0.48439803, 0.49029787, 0.48439803, 0.48439803, 0.48439803,
       0.48439803, 0.48439803, 0.48439803, 0.50141491, 0.49029787])

Last Word

Well…. It seems actually picklable? I may need to investigate the issue a bit more. For now, the __deepcopy__ does not seems to be necessary.

I tried to dig into lightgbm source code and find this potential related issue. https://github.com/microsoft/LightGBM/blame/dc1bc23adf1137ef78722176e2da69f8411b1feb/python-package/lightgbm/basic.py#L2298