Investigation of the Kedro default node names

https://github.com/kedro-org/kedro/issues/3575
python
kedro
Published

February 8, 2024

Default node names are problematic #3575

!kedro -V
kedro, version 0.18.14
%load_ext kedro.ipython
%reload_kedro default-node-name/
[02/08/24 15:59:00] WARNING  Kedro extension was registered but couldn't find a Kedro project. Make  __init__.py:40
                             sure you run '%reload_kedro <project_root>'.                                          
[02/08/24 15:59:00] INFO     Kedro project default-node-name                                        __init__.py:108
                    INFO     Defined global variable 'context', 'session', 'catalog' and            __init__.py:109
                             'pipelines'                                                                           
[02/08/24 15:59:08] INFO     Registered line magic 'run_viz'                                        __init__.py:115

Node name(s)

  • node.name (with namespace)
  • node.short_name
  • node._name
  • node._unique_key (hashable)
  • node._func_name

node.name (with namespace)

  • Needed, node, pipeline, runner (expected to be the public interface)
  • it’s using str(self)
@property
def name(self) -> str:
    """Node's name.

    Returns:
        Node's name if provided or the name of its function.
    """
    node_name = self._name or str(self)
    if self.namespace:
        return f"{self.namespace}.{node_name}"
    return node_name

node.short_name

  • Not needed for kedro run
  • No reference in the entire codebase
  • deprecated will be a breaking change (technically) ## node._name
  • Only usage in node.py , not used outside
In [8]: n._unique_key
Out[8]: ('preprocess_companies_node', 'companies', 'preprocessed_companies')

node.unique_key (hashable)

  • Return 3 things, tuple of (node name, sorted_input, sorted_output)
  • node comparison, checking of unique node
  • hash(node) = hash(node._unique_key)
  • less than, larger than , what for? __eq__ make sense.
  • __lt__ - Private Kedro PR
    • Doesn’t seem to be needed until I added the sorted(nodes) to ensure SequentialRunner have deterministic output ## node._func_name Usage:
  • __str__
  • __repr__
  • short_name
  • __str__ and __repr__ will call node._func_name
n = pipelines["__default__"].nodes[0] # Getting the first node
def __str__(self) -> str:
    def _set_to_str(xset: set | list[str]) -> str:
        return f"[{';'.join(xset)}]"

    out_str = _set_to_str(self.outputs) if self._outputs else "None"
    in_str = _set_to_str(self.inputs) if self._inputs else "None"

    prefix = self._name + ": " if self._name else ""
    return prefix + f"{self._func_name}({in_str}) -> {out_str}"

def _set_to_str(xset: set | list[str]) -> str:
    return f"[{';'.join(xset)}]"

self = n
out_str = _set_to_str(self.outputs) if self._outputs else "None"
in_str = _set_to_str(self.inputs) if self._inputs else "None"

prefix = self._name + ": " if self._name else ""
prefix + f"{self._func_name}({in_str}) -> {out_str}"

'split: split_data([example_iris_data;parameters]) -> [X_train;X_test;y_train;y_test]'
n.__str__??
n.__str__()

'split: split_data([example_iris_data,parameters]) -> [X_train,X_test,y_train,y_test]'
str(n)

'split: split_data([example_iris_data,parameters]) -> [X_train,X_test,y_train,y_test]'
repr(n)

"Node(split_data, ['example_iris_data', 'parameters'], ['X_train', 'X_test', 'y_train', 'y_test'], 'split')"
Notes:
    - Duplicate Node are checked with `node.name` not `node.unique_key`

Observation 1

def dummy_func(x):
    return "dummy"
from kedro.pipeline import node
def format_name(node):
    print(f"{str(node)=}")
    print(f"{repr(node)=}")
    print(f"{node.name=}")
    print(f"{node.short_name=}")
    print()
nameless_node = node(dummy_func, inputs=["a"], outputs=["b"])
nameless_namespace_node = node(dummy_func, inputs=["a"], outputs=["b"], namespace="nok")
nam_node = node(dummy_func, inputs=["a"], outputs=["b"], name="dummy_name")

format_name(nameless_node)
format_name(nameless_namespace_node)
format_name(name_node)
str(node)='dummy_func([a]) -> [b]'
repr(node)="Node(dummy_func, ['a'], ['b'], None)"
node.name='dummy_func([a]) -> [b]'
node.short_name='Dummy Func'

str(node)='dummy_func([a]) -> [b]'
repr(node)="Node(dummy_func, ['a'], ['b'], None)"
node.name='nok.dummy_func([a]) -> [b]'
node.short_name='Dummy Func'

str(node)='dummy_name: dummy_func([a]) -> [b]'
repr(node)="Node(dummy_func, ['a'], ['b'], 'dummy_name')"
node.name='dummy_name'
node.short_name='dummy_name'
  1. The repr for namespace if wrong because it will not reconstruct the same node, and namespace wasn’t included in the __repr__ at all.
  2. short_name feels very kedro-viz coupled and unnecessary to keep them in kedro. Kedro does not use this property.

Observation 2

https://github.com/kedro-org/kedro/pull/568/files - can replace with self._func_name instead of_get_readable_func_name`

Observation 3 - node.name

    def __str__(self) -> str:
        def _set_to_str(xset: set | list[str]) -> str:
            return f"[{';'.join(xset)}]"

        out_str = _set_to_str(self.outputs) if self._outputs else "None"
        in_str = _set_to_str(self.inputs) if self._inputs else "None"

        prefix = self._name + ": " if self._name else ""
        return prefix + f"{self._func_name}({in_str}) -> {out_str}"

    @property
    def name(self) -> str:
        """Node's name.

        Returns:
            Node's name if provided or the name of its function.
        """
        node_name = self._name or str(self)
        if self.namespace:
            return f"{self.namespace}.{node_name}"
        return node_name    

This is an important property and must be kept unique, it’s used for filtering.

However in the implementation it used __str__ which is for “printing” and create obsecure dependency. In any case, it should be reverted and __str__ relies on self.name instead.