How to save or load data with Kedro?

DataCatalog is the first concept that you learnt in Kedro. Although it is important, users actually never have to interact with it, at least not directly. This post will explain the concept with a minimal example that how user should save or load data with Kedro.
python
kedro
Published

March 26, 2024

The first answer that you may get is to use the data catalog, but what really is data catalog? Some may say it is the catalog.yml, other may mentions the DataCatalog class. They are both true but it lacks a bit of context.

Let’s focus on “how” to load or save data in a Kedro Project.

Create a Kedro Node

First, we need a Python function that takes some inputs and outputs:

import pandas as pd

def dummy_func():
    df = pd.DataFrame([{"foo": "bar"}])
    # df.to_csv("my_data.csv") # You don't need to save it explicitly
    return df

This function takes no input but produces a DataFrame, how does Kedro know how to save this data? In Kedro, it uses Node instead of function. A Node is a Python function + inputs + outputs. The inputs and outputs is merely a name of the data rather than the actual object.

from kedro.pipeline import node, pipeline

dummy_node = node(func=dummy_func, inputs=None, outputs="my_data")

You can call the node directly, but it’s not necessary because it is handled by Kedro Runner and Kedro Pipeline.

dummy_func()
foo
0 bar
result = dummy_node()
result
{'my_data':    foo
 0  bar}

It saves the DataFrame inside a dictionary with the key “my_data” (the outputs defined in node).

result["my_data"]
foo
0 bar
pd.testing.assert_frame_equal(result["my_data"], dummy_func()) # assertion pass

Using the DataCatalog class in a Python file

The last step is to save it as a file, which is where the DataCatalog or catalog.yml comes into the picture.

from kedro.io import DataCatalog

catalog_config = {"my_data":
                        {"type": "pandas.CSVDataset",
                         "filepath": "my_csv.csv"}
                        }
catalog = DataCatalog.from_config(catalog_config)
catalog.save( "my_data", result["my_data"])

We can check if the data is saved correctly:

pd.read_csv("my_csv.csv")
foo
0 bar
# Or use DataCatalog
catalog.load("my_data")
foo
0 bar

Construct the DataCatalog class with catalog.yml

Going back to this, catalog.yml is merely catalog_config but written in YAML.

catalog_config = {"my_data":
                        {"type": "pandas.CSVDataset",
                         "filepath": "my_csv.csv"}
                        }
catalog = DataCatalog.from_config(catalog_config)

We can replace the dictionary with catalog.yml.

# Usually defined in a catalog.yml file
catalog_yml = """
my_data:
  type: pandas.CSVDataset
  filepath: my_csv.csv
"""

import yaml
catalog_config = yaml.safe_load(catalog_yml)
new_catalog = DataCatalog.from_config(catalog_config)

Summary

These abstractions are usually hidden from the end users. You do not need to use the DataCatalog if you are working with a Kedro Project. Behind the scenes, this is what happens.

  1. Function signature (and outputs) are mapped according to the node definition.
  2. DataCatalog loads the data according to their name, and look that up from catalog.yml to figure out whether it should be load from a CSV or a parquet file

Bonus - Kedro Runner

There is one important component missing in this article, the Kedro Runner. During a kedro run, the Runner decide the order of executing the node, request data from DataCatalog, and save data with DataCatalog. The pseudocode of a kedro run is roughly like this:

for node in nodes:
    run_node(node, catalog)

def run_node(node, catalog):
    # Prepare Data
    inputs = {}
    for name in node.inputs:
        inputs[name] = catalog.load(name)

    # Execute Node
    outputs = node.run(inputs)

    # Save Data
    for name in node.outputs:
        catalog.save(outputs[name], name)