Version: 1.16.0 (latest)

Python-based transformations 🧪

caution

🚧 This feature is under development, and the interface may change in future releases. Interested in becoming an early tester? Join dlt+ early access.

dlt+ allows you to define Arrow-based transformations that operate on a cache. The actual transformation code is located in the ./transformations folder. In this section, you will learn how you can define Arrow-based transformations with Python.

Generate template

Since this feature is still under development and documentation is limited, we recommend starting with a template. You can generate one using the following command:

note

Make sure you have configured your cache and transformation in the dlt.yml file before running the command below.

dlt transformation <transformation-name> render-t-layer

Running this command will create a new set of transformations inside the ./transformations folder. The generated template includes:

Transformation functions that manage incremental loading state based on dlt_load_id.
Two transformation functions that implement user-defined transformations.
A staging view, which pre-selects only rows eligible for the current transformation run.
A main output table, which initially just forwards all incoming rows unchanged.

If you run the generated transformations without modifying them, the execution will fail. This happens because your cache expects an aggregated table corresponding to the <transformation-name>, but the newly created transformations do not include it. To resolve this, you can either:

Update your cache settings to match the new transformation.
Implement a transformation that aligns with the expected table structure.

Understanding incremental transformations

The default transformations generated by the scaffolding command work incrementally using the dlt_load_id from the incoming dataset. Here's how it works:

The dlt_loads table is automatically available in the cache.
The transformation layer identifies which load_ids exist in the incoming dataset.
It selects only those load_ids that have not yet been processed (i.e., missing from the processed_load_ids table).
Once all transformations are complete, the processed_load_ids table is updated with the processed load_ids.
The cache saves the processed_load_ids table to the output dataset after each run.
When syncing the input dataset, the cache reloads the processed_load_ids table from the output dataset (if available).

This mechanism allows incremental transformations to function seamlessly, even on ephemeral machines, where the cache is not retained between runs.

Python-based transformations 🧪

Generate template

Understanding incremental transformations

DHelp

Ask a question

Generate template​

Understanding incremental transformations​

DHelp

Ask a question

Generate template

Understanding incremental transformations