dltflow
#
dltflow
is a Python package that provides authoring utilities and CD patterns for Databricks’ DLT product. It intends
make writing and deploying DLT code and pipelines to Databricks as easy as possible.
NOTE!!!: This project is in early development. APIs and features are subject to change. Use with caution.
Why dltflow
?#
Valid question. Here are a few reasons why you might want to use dltflow
:
DLT Pipelines in Python Modules:
DLT Pipelines are a newer feature in Databricks bring data quality, lineage, and observability to your data pipelines.
DLT, as documented, can only be instrumented via Notebooks.
dltflow
provides a way to author DLT pipelines in Python modules - which leverage meta programming patterns (via configuration) and deploy them to Databricks.
Deployment of DLT Pipelines:
dltflow
provides a way to deploy Python Modules as DLT pipelines to Databricks.It builds on the shoulders of
dbx
anddlt-meta
projects to provide a seamless deployment experience.
This project heavily is inspired by dbx and dlt-meta projects.
The reason for a seperate project is because:
Generally DLT pipelines are only SQL or Python, and have to live in Notebooks.
dbx
is a great tool for deploying Python modules to Databricks, but it doesn’t support DLT Pipelines for python modules.dlt-meta
has some deployment features that are adopted into this repo.dab
is a new deployment tool by databricks, but suffers from the same problem asdbx
.
Getting Started with dltflow
#
Installation#
pip install dltflow
Initialization#
dltflow
’s audience is for developers who are familiar with Databricks, write PySpark, and want to instrument their
data assets via DLT pipelines.
Project Initialization and Templating:
dltflow
provides a cli command to initialize a project. This command will create a dltflow.yml
file in the root of
your project. Optionally, you can start your project with a template.
dltflow init --help
>>> Usage: dltflow init [OPTIONS]
Initialize the project with a dltflow.yml file.
Optionally start your project with a template.
Options:
-p, --profile TEXT Databricks profile to use
-n, --project-name TEXT Name of the project
-c, --config-path TEXT Path to configuration directory
-w, --workflows-path TEXT Path to workflows directory
-t, --build-template Create a templated project?
-o, --overwrite Overwrite existing config file [default: True]
-d, --shared DBFS location to store the project [default:
True]
--help Show this message and exit.
Simply running dltflow init
will bring up a set of prompts which is help you fill out the options listed above. As a
final
question in the prompts, you will be asked if you want to start your project with a template. If you answer yes
, a
template project will be created in the current working directory.
The structure will be as follows:
git-root/
my_project/ # code goes here.
conf/ # configuration to drive your pipelines.
workflows/ # json or yml definitions for workflows in databricks.
dltflow.yml # dltflow config file.
setup.py # setup file for python packages.
pyproject.toml # pyproject file for python packages.
Deployment#
Now that we have our pipeline, our pipeline configuration, and our DLT pipeline job spec, we can deploy the pipeline to
Databricks. To do so, we need to use dltflow
’s deploy
command from the CLI.
Generally dltflow
tries to follow the same pattern as dbx
for deployment. The deploy
command will look for
a dltflow.yml
file in the root of the project. This file should contain the necessary configurations for deployment.
See Initialization docs for more information on the topic.
dltflow deploy-py-dlt --help
>>> Usage: dltflow deploy-py-dlt [OPTIONS]
Deploy a DLT pipeline.
Options:
--deployment-file TEXT [required]
--environment TEXT [required]
--as-individual Overrides project settings. Useful for developers as
their experimenting with getting their code fully
function. The impact of this flag is that any
derived DLT pipelines created with have a prefix
name of [{profile}_{user_name}] -- this is to not
overwrite any existing pipelines with logic that is
not yet fully baked..
--help Show this message and exit.
And to tie together the full example, here’s how we can deploy our example pipeline.
dltflow deploy-py-dlt --deployment-file conf/dlt/test.json --environment dev