import pandas as pd
def dummy_func():
df = pd.DataFrame([{"foo": "bar"}])
# df.to_csv("my_data.csv") # You don't need to save it explicitly
return dfThe first answer that you may get is to use the data catalog, but what really is data catalog? Some may say it is the catalog.yml, other may mentions the DataCatalog class. They are both true but it lacks a bit of context.
Let’s focus on “how” to load or save data in a Kedro Project.
Create a Kedro Node
First, we need a Python function that takes some inputs and outputs:
This function takes no input but produces a DataFrame, how does Kedro know how to save this data? In Kedro, it uses Node instead of function. A Node is a Python function + inputs + outputs. The inputs and outputs is merely a name of the data rather than the actual object.
from kedro.pipeline import node, pipeline
dummy_node = node(func=dummy_func, inputs=None, outputs="my_data")You can call the node directly, but it’s not necessary because it is handled by Kedro Runner and Kedro Pipeline.
dummy_func()| foo | |
|---|---|
| 0 | bar |
result = dummy_node()
result{'my_data': foo
0 bar}
It saves the DataFrame inside a dictionary with the key “my_data” (the outputs defined in node).
result["my_data"]| foo | |
|---|---|
| 0 | bar |
pd.testing.assert_frame_equal(result["my_data"], dummy_func()) # assertion passUsing the DataCatalog class in a Python file
The last step is to save it as a file, which is where the DataCatalog or catalog.yml comes into the picture.
from kedro.io import DataCatalog
catalog_config = {"my_data":
{"type": "pandas.CSVDataset",
"filepath": "my_csv.csv"}
}
catalog = DataCatalog.from_config(catalog_config)
catalog.save( "my_data", result["my_data"])We can check if the data is saved correctly:
pd.read_csv("my_csv.csv")| foo | |
|---|---|
| 0 | bar |
# Or use DataCatalog
catalog.load("my_data")| foo | |
|---|---|
| 0 | bar |
Construct the DataCatalog class with catalog.yml
Going back to this, catalog.yml is merely catalog_config but written in YAML.
catalog_config = {"my_data":
{"type": "pandas.CSVDataset",
"filepath": "my_csv.csv"}
}
catalog = DataCatalog.from_config(catalog_config)We can replace the dictionary with catalog.yml.
# Usually defined in a catalog.yml file
catalog_yml = """
my_data:
type: pandas.CSVDataset
filepath: my_csv.csv
"""
import yaml
catalog_config = yaml.safe_load(catalog_yml)
new_catalog = DataCatalog.from_config(catalog_config)Summary
These abstractions are usually hidden from the end users. You do not need to use the DataCatalog if you are working with a Kedro Project. Behind the scenes, this is what happens.
- Function signature (and outputs) are mapped according to the node definition.
DataCatalogloads the data according to their name, and look that up fromcatalog.ymlto figure out whether it should be load from a CSV or a parquet file
Bonus - Kedro Runner
There is one important component missing in this article, the Kedro Runner. During a kedro run, the Runner decide the order of executing the node, request data from DataCatalog, and save data with DataCatalog. The pseudocode of a kedro run is roughly like this:
for node in nodes:
run_node(node, catalog)
def run_node(node, catalog):
# Prepare Data
inputs = {}
for name in node.inputs:
inputs[name] = catalog.load(name)
# Execute Node
outputs = node.run(inputs)
# Save Data
for name in node.outputs:
catalog.save(outputs[name], name)