Python-based transformations 🧪
🚧 This feature is under development, and the interface may change in future releases. Interested in becoming an early tester? Join dlt+ early access.
dlt+ allows you to define Arrow-based transformations that operate on a cache. The actual transformation code is located in the ./transformations folder.
In this section, you will learn how you can define Arrow-based transformations with Python.
Generate template​
Since this feature is still under development and documentation is limited, we recommend starting with a template. You can generate one using the following command:
Make sure you have configured your cache and transformation in the dlt.yml file before running the command below.
dlt transformation <transformation-name> render-t-layer
Running this command will create a new set of transformations inside the ./transformations folder. The generated template includes:
- Transformation functions that manage incremental loading state based on
dlt_load_id. - Two transformation functions that implement user-defined transformations.
- A staging view, which pre-selects only rows eligible for the current transformation run.
- A main output table, which initially just forwards all incoming rows unchanged.
If you run the generated transformations without modifying them, the execution will fail. This happens because your cache expects an aggregated table corresponding to the <transformation-name>, but the newly created transformations do not include it. To resolve this, you can either:
- Update your cache settings to match the new transformation.
- Implement a transformation that aligns with the expected table structure.
Understanding incremental transformations​
The default transformations generated by the scaffolding command work incrementally using the dlt_load_id from the incoming dataset. Here's how it works:
- The
dlt_loadstable is automatically available in the cache. - The transformation layer identifies which
load_ids exist in the incoming dataset. - It selects only those
load_ids that have not yet been processed (i.e., missing from theprocessed_load_idstable). - Once all transformations are complete, the
processed_load_idstable is updated with the processedload_ids. - The cache saves the
processed_load_idstable to the output dataset after each run. - When syncing the input dataset, the cache reloads the
processed_load_idstable from the output dataset (if available).
This mechanism allows incremental transformations to function seamlessly, even on ephemeral machines, where the cache is not retained between runs.