Databricks and Azure DevOps CICD Pipeline for DBFS
DevOps CICD Pipeline for Databricks
This is in reference to this Git repo https://dev.azure.com/mortimer-xyz/mortie23/_git/dbfs-cicd-pieline
Much credit to Adam Paternostro
This is based on a simplified version of the repo from Adam Paternostro. Read up on this for a much more feature rich version of this.
Adam Paternostro, Azure-Databricks-Dev-Ops
I only required a small subset of what Adam was doing so I created my own how to in the process. The following is specifically about syncing Databricks init scripts from a Git repo to DBFS.
There is a whole lot of things you need to get this to work. Most of it is documented elsewhere, so instead of re-writing it, I’ll reference it.
Create a repo in Azure DevOps
Add the content of this repo.
Create a DevOps Azure Resource Manager service connection
This will create it’s own service principal which we will be using later.
Azure Resource Manager service connection
Create a Key Vault
Add the secrets to the KeyVault
You’ll need to generate a secret for the service principal.
Create an Environment in Azure DevOps pipelines
Environment in Azure DevOps pipelines
Add an Access Policy for the KeyVault
Give the service principal associated to the Azure Resource Manager service connection (create previously) service principal Secret management on the Key Vault.
Access Policy for the KeyVault
Create a Databricks secret scope backed by KeyVault
Create and then run the pipeline
DevOps pipelines results
Databricks DBFS success
Now, checking in our Databricks DBFS we see the folder and the scripts we copied.
If we modify the scripts and then git add .
git commit -m 'message'
git push
etc. We can configure the pipeline to run on a pull request completion to master
.