Demo of debugging Kedro pipeline with noetebook

Demo of debugging Kedro pipeline with noetebook
python
kedro
Published

November 8, 2022

Steps to debug Kedro pipeline in a notebook

  1. Read from stack trace - find out the line of code that produce the error
  2. Find which node this function belongs to
  3. Trying to rerun the pipeline just before this node
  4. If it’s not a persisted dataset, you need to change it in catalog.yml, and re-run the pipeline, error is thrown again
  5. session has already been used once, so if you call session again it will throw error. (so he had a wrapper function that recreate session and do something similar to session.run
  6. Create a new session or %reload_kedro?
  7. Now catalog.load that persisted dataset, i.e. func(catalog.load("some_data"))
  8. Copy the source code of func to notebook, it would work if the function itself is the node function, but if it is some function buried deep down, that’s a lot more copy-pasting and change of import maybe.
  9. Change the source code and make it work in the notebook
  10. Rerun the pipeline to ensure everything works

Running Session as Usual

%reload_kedro
[11/08/22 16:44:22] INFO     Resolved project path as:                                              __init__.py:132
                             /Users/Nok_Lam_Chan/dev/kedro_gallery/jupyter-debug-demo.                             
                             To set a different path, run '%reload_kedro <project_root>'                           
[11/08/22 16:44:24] INFO     Kedro project jupyter_debug_demo                                       __init__.py:101
                    INFO     Defined global variable 'context', 'session', 'catalog' and            __init__.py:102
                             'pipelines'                                                                           
                    INFO     Registered line magic 'run_viz'                                        __init__.py:108
session
<kedro.framework.session.session.KedroSession object at 0x7fc47a1a0be0>
pipelines
{'__default__': Pipeline([
Node(split_data, ['example_iris_data', 'parameters'], ['X_train', 'X_test', 'y_train', 'y_test'], 'split'),
Node(make_predictions, ['X_train', 'X_test', 'y_train'], 'y_pred', 'make_predictions'),
Node(report_accuracy, ['y_pred', 'y_test'], None, 'report_accuracy')
])}
session.run()
                    INFO     Kedro project jupyter-debug-demo                                        session.py:340
[11/08/22 16:44:25] INFO     Loading data from 'example_iris_data' (CSVDataSet)...              data_catalog.py:343
                    INFO     Loading data from 'parameters' (MemoryDataSet)...                  data_catalog.py:343
                    INFO     Running node: split: split_data([example_iris_data,parameters]) ->         node.py:327
                             [X_train,X_test,y_train,y_test]                                                       
                    INFO     Saving data to 'X_train' (MemoryDataSet)...                        data_catalog.py:382
                    INFO     Saving data to 'X_test' (MemoryDataSet)...                         data_catalog.py:382
                    INFO     Saving data to 'y_train' (MemoryDataSet)...                        data_catalog.py:382
                    INFO     Saving data to 'y_test' (PickleDataSet)...                         data_catalog.py:382
                    INFO     Completed 1 out of 3 tasks                                     sequential_runner.py:85
                    INFO     Loading data from 'X_train' (MemoryDataSet)...                     data_catalog.py:343
                    INFO     Loading data from 'X_test' (MemoryDataSet)...                      data_catalog.py:343
                    INFO     Loading data from 'y_train' (MemoryDataSet)...                     data_catalog.py:343
                    INFO     Running node: make_predictions: make_predictions([X_train,X_test,y_train]) node.py:327
                             -> [y_pred]                                                                           
1
                    INFO     Saving data to 'y_pred' (PickleDataSet)...                         data_catalog.py:382
                    INFO     Completed 2 out of 3 tasks                                     sequential_runner.py:85
                    INFO     Loading data from 'y_pred' (PickleDataSet)...                      data_catalog.py:343
                    INFO     Loading data from 'y_test' (PickleDataSet)...                      data_catalog.py:343
                    INFO     Running node: report_accuracy: report_accuracy([y_pred,y_test]) -> None    node.py:327
                    ERROR    Node 'report_accuracy: report_accuracy([y_pred,y_test]) -> None' failed    node.py:352
                             with error:                                                                           
                             Simulate some bug here                                                                
                    WARNING  There are 1 nodes that have not run.                                     runner.py:202
                             You can resume the pipeline run from the nearest nodes with persisted                 
                             inputs by adding the following argument to your previous command:                     
                               --from-nodes "report_accuracy"                                                      
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
 /var/folders/dv/bz0yz1dn71d2hygq110k3xhw0000gp/T/ipykernel_7863/833844929.py:1 in <cell line: 1> 
                                                                                                  
 [Errno 2] No such file or directory:                                                             
 '/var/folders/dv/bz0yz1dn71d2hygq110k3xhw0000gp/T/ipykernel_7863/833844929.py'                   
                                                                                                  
 /Users/Nok_Lam_Chan/GitHub/kedro/kedro/framework/session/session.py:404 in run                   
                                                                                                  
   401 │   │   )                                                                                  
   402 │   │                                                                                      
   403 │   │   try:                                                                               
 404 │   │   │   run_result = runner.run(                                                       
   405 │   │   │   │   filtered_pipeline, catalog, hook_manager, session_id                       
   406 │   │   │   )                                                                              
   407 │   │   │   self._run_called = True                                                        
                                                                                                  
 /Users/Nok_Lam_Chan/GitHub/kedro/kedro/runner/runner.py:88 in run                                
                                                                                                  
    85 │   │   │   self._logger.info(                                                             
    86 │   │   │   │   "Asynchronous mode is enabled for loading and saving data"                 
    87 │   │   │   )                                                                              
  88 │   │   self._run(pipeline, catalog, hook_manager, session_id)                             
    89 │   │                                                                                      
    90 │   │   self._logger.info("Pipeline execution completed successfully.")                    
    91                                                                                            
                                                                                                  
 /Users/Nok_Lam_Chan/GitHub/kedro/kedro/runner/sequential_runner.py:70 in _run                    
                                                                                                  
   67 │   │                                                                                       
   68 │   │   for exec_index, node in enumerate(nodes):                                           
   69 │   │   │   try:                                                                            
 70 │   │   │   │   run_node(node, catalog, hook_manager, self._is_async, session_id)           
   71 │   │   │   │   done_nodes.add(node)                                                        
   72 │   │   │   except Exception:                                                               
   73 │   │   │   │   self._suggest_resume_scenario(pipeline, done_nodes, catalog)                
                                                                                                  
 /Users/Nok_Lam_Chan/GitHub/kedro/kedro/runner/runner.py:304 in run_node                          
                                                                                                  
   301 │   if is_async:                                                                           
   302 │   │   node = _run_node_async(node, catalog, hook_manager, session_id)                    
   303 │   else:                                                                                  
 304 │   │   node = _run_node_sequential(node, catalog, hook_manager, session_id)               
   305 │                                                                                          
   306 │   for name in node.confirms:                                                             
   307 │   │   catalog.confirm(name)                                                              
                                                                                                  
 /Users/Nok_Lam_Chan/GitHub/kedro/kedro/runner/runner.py:398 in _run_node_sequential              
                                                                                                  
   395 │   )                                                                                      
   396 │   inputs.update(additional_inputs)                                                       
   397 │                                                                                          
 398 outputs = _call_node_run(                                                              
   399 │   │   node, catalog, inputs, is_async, hook_manager, session_id=session_id               
   400 │   )                                                                                      
   401                                                                                            
                                                                                                  
 /Users/Nok_Lam_Chan/GitHub/kedro/kedro/runner/runner.py:366 in _call_node_run                    
                                                                                                  
   363 │   │   │   is_async=is_async,                                                             
   364 │   │   │   session_id=session_id,                                                         
   365 │   │   )                                                                                  
 366 │   │   raise exc                                                                          
   367 │   hook_manager.hook.after_node_run(                                                      
   368 │   │   node=node,                                                                         
   369 │   │   catalog=catalog,                                                                   
                                                                                                  
 /Users/Nok_Lam_Chan/GitHub/kedro/kedro/runner/runner.py:356 in _call_node_run                    
                                                                                                  
   353 ) -> Dict[str, Any]:                                                                       
   354 │   # pylint: disable=too-many-arguments                                                   
   355 │   try:                                                                                   
 356 │   │   outputs = node.run(inputs)                                                         
   357 │   except Exception as exc:                                                               
   358 │   │   hook_manager.hook.on_node_error(                                                   
   359 │   │   │   error=exc,                                                                     
                                                                                                  
 /Users/Nok_Lam_Chan/GitHub/kedro/kedro/pipeline/node.py:353 in run                               
                                                                                                  
   350 │   │   # purposely catch all exceptions                                                   
   351 │   │   except Exception as exc:                                                           
   352 │   │   │   self._logger.error("Node '%s' failed with error: \n%s", str(self), str(exc))   
 353 │   │   │   raise exc                                                                      
   354 │                                                                                          
   355 │   def _run_with_no_inputs(self, inputs: Dict[str, Any]):                                 
   356 │   │   if inputs:                                                                         
                                                                                                  
 /Users/Nok_Lam_Chan/GitHub/kedro/kedro/pipeline/node.py:344 in run                               
                                                                                                  
   341 │   │   │   elif isinstance(self._inputs, str):                                            
   342 │   │   │   │   outputs = self._run_with_one_input(inputs, self._inputs)                   
   343 │   │   │   elif isinstance(self._inputs, list):                                           
 344 │   │   │   │   outputs = self._run_with_list(inputs, self._inputs)                        
   345 │   │   │   elif isinstance(self._inputs, dict):                                           
   346 │   │   │   │   outputs = self._run_with_dict(inputs, self._inputs)                        
   347                                                                                            
                                                                                                  
 /Users/Nok_Lam_Chan/GitHub/kedro/kedro/pipeline/node.py:384 in _run_with_list                    
                                                                                                  
   381 │   │   │   │   f"{sorted(inputs.keys())}."                                                
   382 │   │   │   )                                                                              
   383 │   │   # Ensure the function gets the inputs in the correct order                         
 384 │   │   return self._func(*(inputs[item] for item in node_inputs))                         
   385 │                                                                                          
   386 │   def _run_with_dict(self, inputs: Dict[str, Any], node_inputs: Dict[str, str]):         
   387 │   │   # Node inputs and provided run inputs should completely overlap                    
                                                                                                  
 /Users/Nok_Lam_Chan/dev/kedro_gallery/jupyter-debug-demo/src/jupyter_debug_demo/nodes.py:74 in   
 report_accuracy                                                                                  
                                                                                                  
   71 │   │   y_pred: Predicted target.                                                           
   72 │   │   y_test: True target.                                                                
   73 """                                                                                     
 74 raise ValueError("Simulate some bug here")                                              
   75 │   accuracy = (y_pred == y_test).sum() / len(y_test)                                       
   76 │   logger = logging.getLogger(__name__)                                                    
   77 │   logger.info("Model has accuracy of %.3f on test data.", accuracy)                       
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: Simulate some bug here
  1. Read from stack trace - find out the line of code that produce the error

  2. Find which node this function belongs to

  3. Trying to rerun the pipeline just before this node

  4. If it’s not a persisted dataset, you need to change it in catalog.yml, and re-run the pipeline, error is thrown again

  5. session has already been used once, so if you call session again it will throw error. (so he had a wrapper function that recreate session and do something similar to session.run

  6. Create a new session or %reload_kedro and re-run?

This is not efficient because in interactive workflow, these intermdiate variables is likely store in the catalog already.

%reload_kedro
[11/08/22 16:46:49] INFO     Resolved project path as:                                              __init__.py:132
                             /Users/Nok_Lam_Chan/dev/kedro_gallery/jupyter-debug-demo.                             
                             To set a different path, run '%reload_kedro <project_root>'                           
[11/08/22 16:46:50] INFO     Kedro project jupyter_debug_demo                                       __init__.py:101
                    INFO     Defined global variable 'context', 'session', 'catalog' and            __init__.py:102
                             'pipelines'                                                                           
                    INFO     Registered line magic 'run_viz'                                        __init__.py:108
session.run()
[11/08/22 16:46:53] INFO     Kedro project jupyter-debug-demo                                        session.py:340
[11/08/22 16:46:54] INFO     Loading data from 'example_iris_data' (CSVDataSet)...              data_catalog.py:343
                    INFO     Loading data from 'parameters' (MemoryDataSet)...                  data_catalog.py:343
                    INFO     Running node: split: split_data([example_iris_data,parameters]) ->         node.py:327
                             [X_train,X_test,y_train,y_test]                                                       
                    INFO     Saving data to 'X_train' (MemoryDataSet)...                        data_catalog.py:382
                    INFO     Saving data to 'X_test' (MemoryDataSet)...                         data_catalog.py:382
                    INFO     Saving data to 'y_train' (MemoryDataSet)...                        data_catalog.py:382
                    INFO     Saving data to 'y_test' (PickleDataSet)...                         data_catalog.py:382
                    INFO     Completed 1 out of 3 tasks                                     sequential_runner.py:85
                    INFO     Loading data from 'X_train' (MemoryDataSet)...                     data_catalog.py:343
                    INFO     Loading data from 'X_test' (MemoryDataSet)...                      data_catalog.py:343
                    INFO     Loading data from 'y_train' (MemoryDataSet)...                     data_catalog.py:343
                    INFO     Running node: make_predictions: make_predictions([X_train,X_test,y_train]) node.py:327
                             -> [y_pred]                                                                           
1
                    INFO     Saving data to 'y_pred' (PickleDataSet)...                         data_catalog.py:382
                    INFO     Completed 2 out of 3 tasks                                     sequential_runner.py:85
                    INFO     Loading data from 'y_pred' (PickleDataSet)...                      data_catalog.py:343
                    INFO     Loading data from 'y_test' (PickleDataSet)...                      data_catalog.py:343
                    INFO     Running node: report_accuracy: report_accuracy([y_pred,y_test]) -> None    node.py:327
                    ERROR    Node 'report_accuracy: report_accuracy([y_pred,y_test]) -> None' failed    node.py:352
                             with error:                                                                           
                             Simulate some bug here                                                                
                    WARNING  There are 1 nodes that have not run.                                     runner.py:202
                             You can resume the pipeline run from the nearest nodes with persisted                 
                             inputs by adding the following argument to your previous command:                     
                               --from-nodes "report_accuracy"                                                      
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
 /var/folders/dv/bz0yz1dn71d2hygq110k3xhw0000gp/T/ipykernel_7863/833844929.py:1 in <cell line: 1> 
                                                                                                  
 [Errno 2] No such file or directory:                                                             
 '/var/folders/dv/bz0yz1dn71d2hygq110k3xhw0000gp/T/ipykernel_7863/833844929.py'                   
                                                                                                  
 /Users/Nok_Lam_Chan/GitHub/kedro/kedro/framework/session/session.py:404 in run                   
                                                                                                  
   401 │   │   )                                                                                  
   402 │   │                                                                                      
   403 │   │   try:                                                                               
 404 │   │   │   run_result = runner.run(                                                       
   405 │   │   │   │   filtered_pipeline, catalog, hook_manager, session_id                       
   406 │   │   │   )                                                                              
   407 │   │   │   self._run_called = True                                                        
                                                                                                  
 /Users/Nok_Lam_Chan/GitHub/kedro/kedro/runner/runner.py:88 in run                                
                                                                                                  
    85 │   │   │   self._logger.info(                                                             
    86 │   │   │   │   "Asynchronous mode is enabled for loading and saving data"                 
    87 │   │   │   )                                                                              
  88 │   │   self._run(pipeline, catalog, hook_manager, session_id)                             
    89 │   │                                                                                      
    90 │   │   self._logger.info("Pipeline execution completed successfully.")                    
    91                                                                                            
                                                                                                  
 /Users/Nok_Lam_Chan/GitHub/kedro/kedro/runner/sequential_runner.py:70 in _run                    
                                                                                                  
   67 │   │                                                                                       
   68 │   │   for exec_index, node in enumerate(nodes):                                           
   69 │   │   │   try:                                                                            
 70 │   │   │   │   run_node(node, catalog, hook_manager, self._is_async, session_id)           
   71 │   │   │   │   done_nodes.add(node)                                                        
   72 │   │   │   except Exception:                                                               
   73 │   │   │   │   self._suggest_resume_scenario(pipeline, done_nodes, catalog)                
                                                                                                  
 /Users/Nok_Lam_Chan/GitHub/kedro/kedro/runner/runner.py:304 in run_node                          
                                                                                                  
   301 │   if is_async:                                                                           
   302 │   │   node = _run_node_async(node, catalog, hook_manager, session_id)                    
   303 │   else:                                                                                  
 304 │   │   node = _run_node_sequential(node, catalog, hook_manager, session_id)               
   305 │                                                                                          
   306 │   for name in node.confirms:                                                             
   307 │   │   catalog.confirm(name)                                                              
                                                                                                  
 /Users/Nok_Lam_Chan/GitHub/kedro/kedro/runner/runner.py:398 in _run_node_sequential              
                                                                                                  
   395 │   )                                                                                      
   396 │   inputs.update(additional_inputs)                                                       
   397 │                                                                                          
 398 outputs = _call_node_run(                                                              
   399 │   │   node, catalog, inputs, is_async, hook_manager, session_id=session_id               
   400 │   )                                                                                      
   401                                                                                            
                                                                                                  
 /Users/Nok_Lam_Chan/GitHub/kedro/kedro/runner/runner.py:366 in _call_node_run                    
                                                                                                  
   363 │   │   │   is_async=is_async,                                                             
   364 │   │   │   session_id=session_id,                                                         
   365 │   │   )                                                                                  
 366 │   │   raise exc                                                                          
   367 │   hook_manager.hook.after_node_run(                                                      
   368 │   │   node=node,                                                                         
   369 │   │   catalog=catalog,                                                                   
                                                                                                  
 /Users/Nok_Lam_Chan/GitHub/kedro/kedro/runner/runner.py:356 in _call_node_run                    
                                                                                                  
   353 ) -> Dict[str, Any]:                                                                       
   354 │   # pylint: disable=too-many-arguments                                                   
   355 │   try:                                                                                   
 356 │   │   outputs = node.run(inputs)                                                         
   357 │   except Exception as exc:                                                               
   358 │   │   hook_manager.hook.on_node_error(                                                   
   359 │   │   │   error=exc,                                                                     
                                                                                                  
 /Users/Nok_Lam_Chan/GitHub/kedro/kedro/pipeline/node.py:353 in run                               
                                                                                                  
   350 │   │   # purposely catch all exceptions                                                   
   351 │   │   except Exception as exc:                                                           
   352 │   │   │   self._logger.error("Node '%s' failed with error: \n%s", str(self), str(exc))   
 353 │   │   │   raise exc                                                                      
   354 │                                                                                          
   355 │   def _run_with_no_inputs(self, inputs: Dict[str, Any]):                                 
   356 │   │   if inputs:                                                                         
                                                                                                  
 /Users/Nok_Lam_Chan/GitHub/kedro/kedro/pipeline/node.py:344 in run                               
                                                                                                  
   341 │   │   │   elif isinstance(self._inputs, str):                                            
   342 │   │   │   │   outputs = self._run_with_one_input(inputs, self._inputs)                   
   343 │   │   │   elif isinstance(self._inputs, list):                                           
 344 │   │   │   │   outputs = self._run_with_list(inputs, self._inputs)                        
   345 │   │   │   elif isinstance(self._inputs, dict):                                           
   346 │   │   │   │   outputs = self._run_with_dict(inputs, self._inputs)                        
   347                                                                                            
                                                                                                  
 /Users/Nok_Lam_Chan/GitHub/kedro/kedro/pipeline/node.py:384 in _run_with_list                    
                                                                                                  
   381 │   │   │   │   f"{sorted(inputs.keys())}."                                                
   382 │   │   │   )                                                                              
   383 │   │   # Ensure the function gets the inputs in the correct order                         
 384 │   │   return self._func(*(inputs[item] for item in node_inputs))                         
   385 │                                                                                          
   386 │   def _run_with_dict(self, inputs: Dict[str, Any], node_inputs: Dict[str, str]):         
   387 │   │   # Node inputs and provided run inputs should completely overlap                    
                                                                                                  
 /Users/Nok_Lam_Chan/dev/kedro_gallery/jupyter-debug-demo/src/jupyter_debug_demo/nodes.py:74 in   
 report_accuracy                                                                                  
                                                                                                  
   71 │   │   y_pred: Predicted target.                                                           
   72 │   │   y_test: True target.                                                                
   73 """                                                                                     
 74 raise ValueError("Simulate some bug here")                                              
   75 │   accuracy = (y_pred == y_test).sum() / len(y_test)                                       
   76 │   logger = logging.getLogger(__name__)                                                    
   77 │   logger.info("Model has accuracy of %.3f on test data.", accuracy)                       
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: Simulate some bug here
  1. Now catalog.load that persisted dataset, i.e. func(catalog.load("some_data"))
y_pred = catalog.load("y_pred")
y_test = catalog.load("y_test")
[11/08/22 16:47:19] INFO     Loading data from 'y_pred' (PickleDataSet)...                      data_catalog.py:343
                    INFO     Loading data from 'y_test' (PickleDataSet)...                      data_catalog.py:343
catalog.datasets.y_pred.load().head()  # This is the alternative way to use auto-discovery which can be improved
0     setosa
2     setosa
7     setosa
20    setosa
21    setosa
Name: species, dtype: object
  1. Copy the source code of func to notebook, it would work if the function itself is the node function, but if it is some function buried deep down, that’s a lot more copy-pasting and change of import maybe.
def report_accuracy(y_pred: pd.Series, y_test: pd.Series):
    """Calculates and logs the accuracy.

    Args:
        y_pred: Predicted target.
        y_test: True target.
    """
    raise ValueError("Simulate some bug here")
    accuracy = (y_pred == y_test).sum() / len(y_test)
    logger = logging.getLogger(__name__)
    logger.info("Model has accuracy of %.3f on test data.", accuracy)
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
 /var/folders/dv/bz0yz1dn71d2hygq110k3xhw0000gp/T/ipykernel_7863/1415042900.py:1 in <cell line:   
 1>                                                                                               
                                                                                                  
 [Errno 2] No such file or directory:                                                             
 '/var/folders/dv/bz0yz1dn71d2hygq110k3xhw0000gp/T/ipykernel_7863/1415042900.py'                  
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
NameError: name 'pd' is not defined

This won’t work immediately work, a couple of copy&paste is needed

  • manual copy the imports
  • Remove the function now - copy the source code as a cell instead
import pandas as pd
import logging
raise ValueError("Simulate some bug here")
accuracy = (y_pred == y_test).sum() / len(y_test)
logger = logging.getLogger(__name__)
logger.info("Model has accuracy of %.3f on test data.", accuracy)
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
 /var/folders/dv/bz0yz1dn71d2hygq110k3xhw0000gp/T/ipykernel_7863/2816569123.py:1 in <cell line:   
 1>                                                                                               
                                                                                                  
 [Errno 2] No such file or directory:                                                             
 '/var/folders/dv/bz0yz1dn71d2hygq110k3xhw0000gp/T/ipykernel_7863/2816569123.py'                  
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: Simulate some bug here

Assume we know that the first line is buggy, let’s remove it

# raise ValueError("Simulate some bug here")
accuracy = (y_pred == y_test).sum() / len(y_test)
logger = logging.getLogger(__name__)
logger.info("Model has accuracy of %.3f on test data.", accuracy)
# It now works - lets copy this block back into the function and rerun
  1. Change the source code and make it work in the notebook
  2. Rerun the pipeline to ensure everything works
%reload_kedro
session.run()
[11/08/22 16:50:48] INFO     Resolved project path as:                                              __init__.py:132
                             /Users/Nok_Lam_Chan/dev/kedro_gallery/jupyter-debug-demo.                             
                             To set a different path, run '%reload_kedro <project_root>'                           
[11/08/22 16:50:49] INFO     Kedro project jupyter_debug_demo                                       __init__.py:101
                    INFO     Defined global variable 'context', 'session', 'catalog' and            __init__.py:102
                             'pipelines'                                                                           
                    INFO     Registered line magic 'run_viz'                                        __init__.py:108
                    INFO     Kedro project jupyter-debug-demo                                        session.py:340
[11/08/22 16:50:50] INFO     Loading data from 'example_iris_data' (CSVDataSet)...              data_catalog.py:343
                    INFO     Loading data from 'parameters' (MemoryDataSet)...                  data_catalog.py:343
                    INFO     Running node: split: split_data([example_iris_data,parameters]) ->         node.py:327
                             [X_train,X_test,y_train,y_test]                                                       
                    INFO     Saving data to 'X_train' (MemoryDataSet)...                        data_catalog.py:382
                    INFO     Saving data to 'X_test' (MemoryDataSet)...                         data_catalog.py:382
                    INFO     Saving data to 'y_train' (MemoryDataSet)...                        data_catalog.py:382
                    INFO     Saving data to 'y_test' (PickleDataSet)...                         data_catalog.py:382
                    INFO     Completed 1 out of 3 tasks                                     sequential_runner.py:85
                    INFO     Loading data from 'X_train' (MemoryDataSet)...                     data_catalog.py:343
                    INFO     Loading data from 'X_test' (MemoryDataSet)...                      data_catalog.py:343
                    INFO     Loading data from 'y_train' (MemoryDataSet)...                     data_catalog.py:343
                    INFO     Running node: make_predictions: make_predictions([X_train,X_test,y_train]) node.py:327
                             -> [y_pred]                                                                           
1
                    INFO     Saving data to 'y_pred' (PickleDataSet)...                         data_catalog.py:382
                    INFO     Completed 2 out of 3 tasks                                     sequential_runner.py:85
                    INFO     Loading data from 'y_pred' (PickleDataSet)...                      data_catalog.py:343
                    INFO     Loading data from 'y_test' (PickleDataSet)...                      data_catalog.py:343
                    INFO     Running node: report_accuracy: report_accuracy([y_pred,y_test]) -> None    node.py:327
                    INFO     Model has accuracy of 0.933 on test data.                                  nodes.py:77
                    INFO     Completed 3 out of 3 tasks                                     sequential_runner.py:85
                    INFO     Pipeline execution completed successfully.                                runner.py:90
{}

It works now!

Debugging with interactive session is not uncommon - compare to IDE/breakpoint. * You can make plots and see the data * You can intercept the variable and continue with the program - espeically useful when it is computation intensive.

See more comments from Antony

More to optimize 1st PoC * %load_node - populate all neccessary data where the node throws error * When pipeline fail - raise something like %load_node debug=True - the traceback should have information about which node the error is coming from. * Is there anything we can use viz? Sometimes I get question from people can kedro-viz help with debugging too.

More to optimize: * What if the error is not in the node function but somewhere deeper in the call stack? * Handle case when the inputs are not in catalog - how to recompute the necessary inputs? Potentially we can use the backtracking to do it in a more efficient way.