Lesson learnt from Kaggle - Bengali Image Classification Competition

python
kaggle
My first silver medal on Kaggles Competition - Bengali Image Classification Competition. 10 lessons learn!
Published

March 21, 2020

I have teamed up with a friend to participate in the Bengali Image Classification Competition. We struggled to get a high rank in the Public leaderboard throughout the competition. In the end, the result is a big surprise to everyone as the leaderboard shook a lot.

Public Leaderboard

Public Leaderboard

The final private score was much lower than the public score. It suggests that most participants are over-fitting Public leaderboard.

The Classification Task

This is an image classification competition. We need to predict 3 parts of Bengali characters root, consonant and vowel. It is a typical classification tasks like the MNIST dataset.

Grapheme example

Evaluation Metrics

The competition use macro-recall as the evaluation metric. In general, people get >96% recall in training, the tops are even getting >99% recall.

Code
python
import numpy as np
import sklearn.metrics

scores = []
for component in ['grapheme_root', 'consonant_diacritic', 'vowel_diacritic']:
    y_true_subset = solution[solution[component] == component]['target'].values
    y_pred_subset = submission[submission[component] == component]['target'].values
    scores.append(sklearn.metrics.recall_score(
        y_true_subset, y_pred_subset, average='macro'))
final_score = np.average(scores, weights=[2,1,1])

Model (Bigger still better)

We start with xresnet50, which is a relatively small model. As we have the assumption that this classification task is a very standard task, therefore the difference of model will not be the most important one. Thus we pick xresnet50 as it has a good performance in terms of accuracy and train relatively fast.

Near the end of the competition, we switch to a larger model se-resnext101. It requires triple training time plus we have to scale down the batch size as it does not fit into the GPU memory. Surprisingly (maybe not surprising to everyone), the bigger model did boost the performance more than I expected with ~0.3-0.5% recall. It is a big improvement as the recall is very high (~0.97), in other words, it reduces ~10% error solely by just using a better model, not bad!

Augmentation

There are never “enough” data for deep learning, so we always try our best to collect more data. Since we cannot collect more data, we need data augmentation. We start with rotation + scale. We also find MixUp and CutMix is very effective to boost the performance. It also gives us roughly 10% boost initially from 0.96 -> 0.964 recall.

CutMix & MixUp

Example of Augmentation

Mixup is simple, if you know about photography, it is similar to have double exposure of your photos. It overlays two images (cat+dog in this case) by sampling weights. So instead of prediction P(dog) = 1, the new target could become P(dog) = 0.8 and P(cat) = 0.2.

CutMix shares a similar idea, instead of overlay 2 images, it crops out a certain ratio of the image and replaces it with another one.

It always surprises me that these augmented data does not make much sense to a human, but it is very effective to improve model accuracy and reduce overfitting empirically.

Logging of Experiment

I normally just log my experiment with a simple CSV and some printing message. This start to get tedious when there are more than 1 people to work. It is important to communicate the results of experiments. I explore Hydra and wandb in this competition and they are very useful.

Hydra

Hydra for configuration composition

It is often a good idea to make your experiment configurable. We use Hydra for this purpose and it is useful to compose different configuration group. By making your hyper-paramters configurable, you can define an experiment by configuration files and run multiple experiments. By logging the configuration with the training statistics, it is easy to do cross-models comparison and find out which configuration is useful for your model.

I have written an short example for how to use Hydra.

Wandb

wandb (Weight & Biases) does a few things. It provides built-in functions that automatically log all your model statistics, you can also log your custom metrics with simple functions.

  • Compare the configuration of different experiments to find out the model with the best performance.
  • Built-in function for logging model weights and gradient for debugging purpose.
  • Log any metrics that you want

All of these combined to make collaboration experience better. It is really important to sync the progress frequently and getting everyone results in a single platform makes these conversations easier.

image.png

Stochastic Weight Averaging

This is a simple yet effective technique which gives about 0.3-0.4% boost to my model. In simple words, it takes snapshots of the model weights during training and takes an average at the end. It provides a cheap way to do models ensemble while you are only training 1 model. This is important for this competition as it allows me to keep training time short enough to allow feedback within hours and reduce over-fitting.)

image.png

Larger is better (image size)

We downsample our image size to 128x128 throughout the competition, as it makes the model train faster and we believe most technique should be transferable to larger image size. It is important to keep your feedback loop short enough (hours if not days). You want your training data as small as possible while keeping them transferable to your full dataset. Once we scale our image to full size, it takes almost 20 hours to train a single model, and we only have little chance to tune the hyper-parameters before the competition end.

Debug & Checkpoint

There was a time we develop our model separately and we didn’t sync our code for a while. We refactor our code during the time and it was a huge mistake. It turns out our pre-refactor code trains much better model and we introduce some unknown bug. It is almost impossible to find out as we change multiple things. It is so hard to debug a neural network and testing it thoroughly is important. Injecting a large amount of code may help you to run an experiment earlier, but you may pay much more time to debug it afterwards.

I think this is applicable even if you are working alone. * Keep your changes small. * Establish a baseline early, always do a regression test after a new feature introduced (especially after code refactoring) * Create checkpoint to rollback anytime, especially if you are not working on it every day.

Implementation is the key of Kaggle competition (in real life too). It does not matter how great your model is, a tiny little bug could have damaged your model silently

Use auxiliary label

image.png As mentioned earlier, this competition requires to predict the root, vowel and the consonant part. In the training data, they actually provide the grapheme too. Lots of people saying that if you train with the grapheme, it improves the model greatly and get the recall >98% easily.

This is something we could not reproduce throughout the competition, we tried it in the very last minute but it does not seem to improve our model. It turns out lots of people are overfitting the data, as the testing dataset has much more unseen character.

But it is still a great remark that training with labels that is not your final desired output could still be very useful.

Weight loss

The distribution of the training dataset is very imbalance, but to get a good result, we need to predict every single class accurately (macro recall). To deal with this issue, we choose to use class weights, where a higher weight would be applied to rare samples. We don’t have an ablation study for this, but it seems to help close the gap between accuracy & recall and allows us to train the model slightly better.

Find a teammate!

Lastly, please go and find a teammate if you can. It is very common to start a Kaggle competition, but not so easy to finish them. I have stopped for a month during the competition due to my job. It is really hard to get back to the competition after you stopped for so long. Getting a teammate helps to motivate you and in the end, it is a great learning experience for both of us.

Pretrain Model

We also tried to use a pretrained model, as it allows shorter training and gives better performance by transfer learning (Using weights learn from a large dataset to as initial weight). It also gives our model a bit of improvement.

  • Finetune the model head, while keeping other layers freeze (except BatchNorm layer).
  • Unfreeze the model, train all the layers together.

I also tried training the model directly with discriminating learning rate while not freezing any layer at all. It performs similarly to freezing fine-tuning , so I end up just start training the entire model from the beginning.

If the code works, don’t touch it

This is probably not a good habit usually, but I suggest not to do it for a competition. We spent lots of time for debugging our code after code refactoring and end up just rolling back to an older commit and cherry-picks new features. In a competition, you don’t have enough time to test everything. You do not need a nice abstract class for all your features, some refactoring to keep your function/class clean is probably needed, but do not overspend your time on it. It is even common to jump between frameworks (you may find other’s Kernel useful), so it is not possible to structure your code perfectly.

  • If someone has create a working submission script, use it!
  • If someone has create a working pre-processing function, use it!

Don’t spend time on trying to optimize these code unless it is necessary, it is often not worth it in a competition context. You should focus on adding new features, trying out new model, testing with new augmentation technique instead.

Summary

This is a great learning experience and refreshes some of my outdated computer vision model knowledge. If you have never joined a competition, find a friend and get started. If you have just finished one, try writing it out and share your experience. 😉