import pandas as pd
def dummy_func():
= pd.DataFrame([{"foo": "bar"}])
df # df.to_csv("my_data.csv") # You don't need to save it explicitly
return df
The first answer that you may get is to use the data catalog, but what really is data catalog? Some may say it is the catalog.yml
, other may mentions the DataCatalog
class. They are both true but it lacks a bit of context.
Let’s focus on “how” to load or save data in a Kedro Project.
Create a Kedro Node
First, we need a Python function that takes some inputs and outputs:
This function takes no input but produces a DataFrame, how does Kedro know how to save this data? In Kedro, it uses Node
instead of function. A Node is a Python function + inputs + outputs. The inputs and outputs is merely a name of the data rather than the actual object.
from kedro.pipeline import node, pipeline
= node(func=dummy_func, inputs=None, outputs="my_data") dummy_node
You can call the node directly, but it’s not necessary because it is handled by Kedro Runner and Kedro Pipeline.
dummy_func()
foo | |
---|---|
0 | bar |
= dummy_node()
result result
{'my_data': foo
0 bar}
It saves the DataFrame inside a dictionary with the key “my_data” (the outputs defined in node).
"my_data"] result[
foo | |
---|---|
0 | bar |
"my_data"], dummy_func()) # assertion pass pd.testing.assert_frame_equal(result[
Using the DataCatalog
class in a Python file
The last step is to save it as a file, which is where the DataCatalog
or catalog.yml
comes into the picture.
from kedro.io import DataCatalog
= {"my_data":
catalog_config "type": "pandas.CSVDataset",
{"filepath": "my_csv.csv"}
}= DataCatalog.from_config(catalog_config)
catalog "my_data", result["my_data"]) catalog.save(
We can check if the data is saved correctly:
"my_csv.csv") pd.read_csv(
foo | |
---|---|
0 | bar |
# Or use DataCatalog
"my_data") catalog.load(
foo | |
---|---|
0 | bar |
Construct the DataCatalog
class with catalog.yml
Going back to this, catalog.yml
is merely catalog_config
but written in YAML.
= {"my_data":
catalog_config "type": "pandas.CSVDataset",
{"filepath": "my_csv.csv"}
}= DataCatalog.from_config(catalog_config) catalog
We can replace the dictionary with catalog.yml
.
# Usually defined in a catalog.yml file
= """
catalog_yml my_data:
type: pandas.CSVDataset
filepath: my_csv.csv
"""
import yaml
= yaml.safe_load(catalog_yml)
catalog_config = DataCatalog.from_config(catalog_config) new_catalog
Summary
These abstractions are usually hidden from the end users. You do not need to use the DataCatalog
if you are working with a Kedro Project. Behind the scenes, this is what happens.
- Function signature (and outputs) are mapped according to the node definition.
DataCatalog
loads the data according to their name, and look that up fromcatalog.yml
to figure out whether it should be load from a CSV or a parquet file
Bonus - Kedro Runner
There is one important component missing in this article, the Kedro Runner. During a kedro run
, the Runner
decide the order of executing the node, request data from DataCatalog
, and save data with DataCatalog
. The pseudocode of a kedro run
is roughly like this:
for node in nodes:
run_node(node, catalog)
def run_node(node, catalog):
# Prepare Data
= {}
inputs for name in node.inputs:
= catalog.load(name)
inputs[name]
# Execute Node
= node.run(inputs)
outputs
# Save Data
for name in node.outputs:
catalog.save(outputs[name], name)