Cloud Infrastructure,  Data Science,  System Administration

Databricks and Azure DevOps CICD Pipeline for DBFS

Databricks and Azure DevOps CICD Pipeline for DBFS

DevOps CICD Pipeline for Databricks

This is in reference to this Git repo https://dev.azure.com/mortimer-xyz/mortie23/_git/dbfs-cicd-pieline

Much credit to Adam Paternostro

This is based on a simplified version of the repo from Adam Paternostro. Read up on this for a much more feature rich version of this.

Adam Paternostro, Azure-Databricks-Dev-Ops

I only required a small subset of what Adam was doing so I created my own how to in the process. The following is specifically about syncing Databricks init scripts from a Git repo to DBFS.

There is a whole lot of things you need to get this to work. Most of it is documented elsewhere, so instead of re-writing it, I’ll reference it.

Create a repo in Azure DevOps

Add the content of this repo.

cicd scripts

Create a DevOps Azure Resource Manager service connection

This will create it’s own service principal which we will be using later.

Azure Resource Manager service connection

cicd service principal params

Create a Key Vault

Quick Create in Azure Portal

Add the secrets to the KeyVault

You’ll need to generate a secret for the service principal.

cicd secrets keyvault add

Create an Environment in Azure DevOps pipelines

Environment in Azure DevOps pipelines

Add an Access Policy for the KeyVault

Give the service principal associated to the Azure Resource Manager service connection (create previously) service principal Secret management on the Key Vault.

Access Policy for the KeyVault

Create a Databricks secret scope backed by KeyVault

cicd secrets keyvault backed

Create and then run the pipeline

cicd devops pipelines new

cicd devops pipelines run

DevOps pipelines results

cicd devops pipelines

cicd devops pipelines success

Databricks DBFS success

Now, checking in our Databricks DBFS we see the folder and the scripts we copied.

cicd databricks dbfs success

If we modify the scripts and then git add . git commit -m 'message' git push etc. We can configure the pipeline to run on a pull request completion to master.