Cloud Infrastructure, Data Science, System Administration

Databricks security model architecture

Christopher Mortimer
Specialising in data
More posts by Christopher Mortimer.

Christopher Mortimer

31 Oct 2022•18 min read

Overview

Let’s say we have a brand new Databricks resource that we have just created from Azure. It comes as an empty shell ready for use. You add yourself as the system administrator and you need to onboard your organisation in such as way that teams within the organisation can collaborate. To illustrate this, let’s take the example of a large group of Law Firms.

Organisations can have a structure where multiple “business units” are supported by a common set of enabling services. In this example the People (HR), Finance and IT teams are central and support the different groups of legal teams, Family, Criminal and Consumer law.

One of the organizations has put their hand up to be the enabling service of IT.

Databricks will be used by both the business units and the IT teams (to do bespoke analysis of IT systems, such as cost monitoring of cloud resources and monitoring usage of systems). The teams may have the need to keep their work separated and inaccessible from each other, and may also have the need to share.

Gartner coined the term Bi-model. See this article: Gartner Glossary, Bimodal.

Bimodal is the practice of managing two separate but coherent styles of work: one focused on predictability; the other on exploration. Mode 1 is optimized for areas that are more predictable and well-understood. It focuses on exploiting what is known, while renovating the legacy environment into a state that is fit for a digital world. Mode 2 is exploratory, experimenting to solve new problems and optimized for areas of uncertainty

There are many words used to separate these two modes, I have listed a few I have seen:

Mode 1
- Operational
Mode 2
- Discovery, Laboratory, Exploratory

Here we are focusing on mode 2 (some mode 1 thrown in here too) and will denote mode 2 as Discovery

Components

There are many components of a Databricks workspace that need to be considered in the security model. We will try to cover off a lot of what is included and have a default position on each component. Having this upfront will attempt to avoid business users having to ask down the track for a component that hasn’t been considered.

Workspace Settings
Users and Groups
Workspace and Repos
Compute
Storage
Secrets
Machine Learning

Workspace Settings

There are many workspace settings that can be changed depending on what your requirements for the platform are. It is difficult to prescribe a default set of modifications to make, however one recommendation would be to Enable the Web Terminal. Users are able to submit shell commands from a notebook regardless of this setting, it may just make it easier if they are comfortable using a command line.

Users and Groups

Azure Active Directory

Our Azure administrators have some nice rules and naming conventions which will help us down the track. Let’s create the organization structure in Azure Active Directory.

LawFirms-<Environment>-Databricks-<BusinessUnit>-<Role>

Component	Values	description
Organisation	LawFirms	The name of the organisation
Environment	Dev, PreProd, Prod
Resource	Databricks	The name of the resource
Business unit	Family, Consumer, Criminal etc	The name of the business unit
Role	DataScientist, DataEngineer etc	The role within the business unit

Firstly we create a series of groups that assist us in setting permissions at the group level (not user level).

Now we add the users to the groups.

This is repeated for all the users and groups in the New York law firms organizational structure.

Databricks users and groups

We will sync these users and groups from Azure Active Directory to our Databricks workspace using the following process Sync users and groups from Azure Active Directory. If that is too much overhead (which it may well be) then syncing the users manually and creating Databricks local groups manually may suffice.

Note: if we are going to do this manually, we should put careful thought on how we might migrate to the SCIM method later down the track. Following a naming convention that will make the migration easy would be beneficial. The manual method will also become cumbersome as more Law firms in New York start to see the benefits of the platform and want to jump on, or as Lawyers move between firms, which we know they do, and we need to change their membership in multiple places (Azure AD and Databricks).

By default, all users are in the users group and cannot be removed. By default the users group may have settings that you want to control separately, so lets remove all default entitlements from the users group.

Now, lets add at least Workspace access back for each of our custom groups from Azure Active Directory.

So now we have a Databricks workspace with the users and groups from our organisation, we need to setups some things to enable them to work both siloed and collaboratively.

Workspace and Repos

Workspace

Using source control is going to be the recommended way to manage notebooks (code). But in some instances, your business units data scientists may not be comfortable using Git, or maybe even the IT team hasn’t provided them with access to a remote Git service (such as Github, Gitlab, Bitbucket or Azure DevOps repos).

We create a top level folder for each business unit at both the Workspace level and the Shared level. The Shared folders are optional, and should be communicated to business that they are use at your own risk, as other teams can read/write content in them.

So now, logging in as Robert Zane from the Family Law business unit, we will only see the Family workspace folder.

Repos

What is a distributed version control system?

A distributed version control system (DVCS) brings a local copy of the complete repository to every team member’s computer, so they can commit, branch, and merge locally.

When it comes to Repos, we do not need a folder for each business unit for feature development and collaboration. Git is a DVCS, and that means that each users has a folder for themselves.

However, for running scheduled jobs or CI/CD, a location that always has the production code is required. This article discusses this topic well CI/CD workflows with Git integration and Databricks Repos. To enable this we will still create a top level folder for each business unit to enable them to operationalise their pipelines.

We have set the same Can Manage permissions to the top level Repo folder (Family), and Robert was able to log in a clone his teams repo twice:

once checked out to to the preprod branch
the next checked out to the master branch

This way, his team can create jobs that source code from either of these clones and ensure the job runs against the correct version of the code base.

He also has a personal folder that he can clone his teams repository and check out his feature branches that he creates for analysis and development.

Compute

We may/may not want to allow business units to create any cluster they want. This example assumes we do not.

Note: Going down this path will require more administration burden down the track as business units requirements change. Some restrictions appear to have more advantages than not, such as limiting the machine spec (and therefore limiting cost), however limiting, for example the Databricks runtime version, does not appear to have many advantages.

This can be done using cluster policies.

Cluster policies

A cluster policy definition is a JSON string that defines restrictions or sets parameters on cluster creation.

Cluster policies themselves require a name. The name will contain the core components required to delineate between use of the policies. They will line up with the components used to name the Active Directory Groups.

<BusinessUnit>-<Role>

Component	Values	description
Business unit	Family, Consumer, Criminal etc	The name of the business unit
Role	DataScientist, DataEngineer etc	The role within the business unit

We will create a very light touch cluster policy for our IT team members and a more restrictive policy for the members of the Family Law business unit (see Restrictive cluster policy).

The light touch cluster policy is below:

{
  "cluster_name": {
    "type": "regex",
    "pattern": "(Dev|PreProd|Prod)-(Discovery|Operational)-IT-[0-9.]*?ML?-[a-zA-Z]*-?[0-9]*?",
    "hidden": false
  },
  "custom_tags.Team": {
    "type": "fixed",
    "value": "IT"
  },
  "autotermination_minutes": {
    "type": "unlimited",
    "isOptional": true,
    "defaultValue": 60
  }
}

Our naming convention for the clusters is:

(Dev|PreProd|Prod)-(Discovery|Operational)-<BusinessUnit>-[0-9.]*?ML?-[a-zA-Z]*-?[0-9]*?

Component	Values	description
Environment	Dev, PreProd, Prod	The environment of the clusters use
Mode	Discovery, Operational	What mode the cluster is used for
Business unit	Family, Consumer, Criminal etc	The name of the business unit
Databricks runtime		The Databricks runtime version
Name	Such as a project	A name to denote the use of the cluster
Cluster number	If more than one per project	A number (optional)

Example:

Prod-Discovery-Family-11.3ML-project-01

After creation of the policy, we need to assign the policy permission to the correct group.

Testing the policy

When Robert creates a cluster he only had the option to create a cluster using the Family-DataScientist cluster policy.

When he attempted to create a cluster that breached the naming convention, he is notified at the bottom of the page what the naming convention he needs to follow is (as a regex).

Cluster use

By default, the permissions for using the cluster will be exclusive to the user that created the cluster. This is not a great idea, since the cluster might need to be used (in the case of a shared cluster), monitored, stopped or deleted by someone else within the team. We will communicate with the business units that they can self manage the permissions of their clusters and should add their team as Can Manage after the creation of any new cluster.

Package management

There are a few ways to install required packages on clusters. Databricks provides a GUI for installing packages, but this can also be done:

within notebook code using, for example, %pip install <package>
within an init script run while the cluster is starting up (see Sample init_script framework

Storage

Data lake

All storage will be kept in an Azure Data Lake Storage (ADLS) Gen 2 storage account. A single container will we created for Discovery use. A sub folder will be created for each business unit and permissions set accordingly.

To enable users from Azure Active Directory to read and write from the ADLS we need a couple of things first. We will use the group to manage the permissions:

The group will need the role of Reader at the storage account level
The ADLS folder’s ACLs will need to have the group set appropriately.

Make sure you set the default permissions too so that child items (files) in the folders inherit the permissions of the parent folder.

This is a very siloed security model and does not allow for sharing of data between business units. It is suggested to also create a shared folder within the discovery container and give each business unit access on the occasion that they need to share data.

Hive metastore

In Databricks, data can be registered to the Hive metastore (think of it as an information schema or data dictionary not the data itself). This enables SQL queries on the data (schema on read).

By default there is no security across the Hive metastore, which means that all users have permissions to create, read and drop registrations (not the data) from the metastore.

We will create a Hive metastore schema (the name schema and database are interchangeable in a Databricks metastore) for each business unit.

CREATE SCHEMA admin;
CREATE SCHEMA family;
CREATE SCHEMA consumer;
CREATE SCHEMA criminal;
CREATE SCHEMA it;

Note: Hive metastore objects (schemas, tables etc) can only be lowercase.

This will become a suggestion for business users to use, however there is nothing restricting them from creating more schemas, and dropping already existing schemas.

To make a more secure metastore layer without going down the path of Table Access Controls we could implement multiple metastores using Databricks Unity Catalog (not for this discussion).

Table Access Controls

Table access control lets you programmatically grant and revoke access to your data from Python and SQL.

By default, all users have access to all data stored in a cluster’s managed tables unless table access control is enabled for that cluster. Once table access control is enabled, users can set permissions for data objects on that cluster.

Table ACLs

This means that unless Table ACLs are enforced on all clusters and applied to all objects in the Hive metastore, then it is not an option. Since we have architected permissions at the storage level we will not go down this path.

Sample of how this may work

GRANT USAGE ON SCHEMA family TO `LawFirms-Prod-Databricks-FamilyLaw-DataScientist`;

Implementing a data governance model using Table ACLs requires clusters with Table ACLs enabled. If we attempt to run any GRANT statements with clusters that do not have Table ACLs enabled we will get the following error.

Error in SQL statement: SparkException: Trying to perform permission action on Hive Metastore /CATALOG/`hive_metastore`/DATABASE/`family` but Table Access Control is not enabled on this cluster.

After adding the following Spark config to the cluster, the Table ACL was successfully set.

spark.databricks.acl.sqlOnly true

Michael Ross in the IT team logs in tries to select from a table in the family schema and gets the following error.

Error in SQL statement: SecurityException: User does not have permission SELECT on table `family`.`game`.
User does not have permission USAGE on database `family`.

DBFS

The Databricks File System (DBFS) is a distributed file system mounted into a Databricks workspace and available on Databricks clusters.

Databricks recommends against storing any production data or sensitive information in DBFS.

In some cases it may be required, so the important thing to do is to set some standards and expectations to business units on the usage of DBFS.

directory level	directory	description
	/dbfs/FileStore	Top level DBFS folder for artifacts
	Admin	Admin directory
	init_scripts	For admin init scripts
	lib	For file based libraries or packages
	debs	Debian packages
	python	Python packages
	R	R packages
	Family	Business unit directory
	init_scripts	For business unit init scripts

We use the top level of the DBFS as the FileStore as a convention. At this level we create a folder for each business unit and provide them guidance on placing anything that is required there. There is nothing stopping them from creating more top level folders or deleting existing folders or files.

For this reason it is suggested that any files that are stored in DBFS are not the source of truth. A suggested way to manage scripts that are required in DBFS is to use a process such as the one documented here in a previous post Databricks and Azure DevOps CICD Pipeline for DBFS

Secrets

Users may/may not require secrets. However it is best to provide an Azure key vault backed secret scope upfront for each business unit and provide them a user guide on how to use it in the instance that they do.

We are going to manage the secrets using the Databricks CLI in a PowerShell terminal. Create an Azure Key Vault-backed secret scope using the Databricks CLI

# Create each secret scope
databricks secrets create-scope --scope <scope-name> --scope-backend-type AZURE_KEYVAULT --resource-id <azure-keyvault-resource-id> --dns-name <azure-keyvault-dns-name>
# List all scopes
databricks secrets list-scopes

Scope        Backend         KeyVault URL
-----------  --------------  --------------------------------------------
admin        AZURE_KEYVAULT  https://<keyvault>.vault.azure.net/
consumer     AZURE_KEYVAULT  https://<keyvault>.vault.azure.net/
criminal     AZURE_KEYVAULT  https://<keyvault>.vault.azure.net/
family       AZURE_KEYVAULT  https://<keyvault>.vault.azure.net/
it           AZURE_KEYVAULT  https://<keyvault>.vault.azure.net/

Machine Learning

Our users are going to want to register models to the inbuilt model registry in Databricks (an instance of MLFlow).

At this stage, the recommendation in Discovery is for users to self manage the permission of the model artifacts, however a user guide through an example is the best way to enable this.

To demonstrate how this could work we are going to log in as one of our IT Data Scientists (Michael Ross), and he is going to run through our Databricks NFL MLFlow example.

First he creates an appropriate cluster using the cluster policy we created for Data scientists in our IT team. He names it:

Prod-Discovery-IT-11.3ML-nfl-01

He knows the best practices because of the great users guides provided by the administrators so he goes into the cluster permissions and gives the rest of his team permissions to the shared cluster.

He clones the Git repo to his personal repo folder using the following HTTPS clone URL https://github.com/mortie23/databricks-nfl-data.git.

Firstly Michael loads the CSV files to the ADLS folder that the administrators have setup for him on abfss://discovery@lawfirm.dfs.core.windows.net/IT/ and then runs the db.sql notebook to load all the NFL data from ADLS and register them in the Hive metastore.

Then he opens up the nfl.model notebook and trains a model. Since we are using MLFlow, Michael ends up with a experiment in the model registry.

with mlflow.start_run():

Notebooks and experiments in a folder inherit all permissions settings of that folder. So only Michael will be able to see the experiment he ran. This is fine in Discovery, but when moving into Mode 1 (Operational) we will want the experiments run from the top level PreProd and Prod clones.

Michael now registers the experiment to a model for use by others in the team. He also needs to remember to set the permissions on the registered model.

Michael Ross now deicides to hand off the remainder of the work (integrating the model predictions into an operational pipeline) to the Data Engineer in his team.

Harvey Specter now logs in and since Michael has set the permission correctly he can pick up the same notebook. Since Michael used a Shared cluster, Harvey can use the cluster that Michael created.

Note: If Michael did not set the model permissions, Harvey would get the following error: RestException: PERMISSION_DENIED: User does not have any permission level assigned to the registered model.

Harvey is able to use the following code to load the latest trained model that Michael prepared for him.

model_name = "nfl-decisiontree-players"
model_udf = mlflow.pyfunc.spark_udf(
  spark,
  f"models:/{model_name}/latest"
)

Appendix

Restrictive cluster policy

{
  "spark_conf.spark.databricks.cluster.profile": {
    "type": "fixed",
    "value": "singleNode",
    "hidden": true
  },
  "num_workers": {
    "type": "fixed",
    "value": 0,
    "hidden": true
  },
  "runtime_engine": {
    "type": "fixed",
    "value": "STANDARD",
    "hidden": true
  },
  "cluster_type": {
    "type": "fixed",
    "value": "all-purpose",
    "hidden": false
  },
  "cluster_name": {
    "type": "regex",
    "pattern": "(Dev|PreProd|Prod)-(Discovery|Operational)-Family-[0-9.]*?ML?-[a-z]*-?[0-9]*?",
    "hidden": false
  },
  "data_security_mode": {
    "type": "fixed",
    "value": "LEGACY_SINGLE_USER"
  },
  "autotermination_minutes": {
    "type": "fixed",
    "value": 60
  },
  "custom_tags.Team": {
    "type": "fixed",
    "value": "Family"
  },
  "spark_version": {
    "type": "allowlist",
    "values": ["11.3.x-cpu-ml-scala2.12", "11.3.x-scala2.12"],
    "defaultValue": "11.3.x-scala2.12"
  },
  "node_type_id": {
    "type": "allowlist",
    "values": ["Standard_F4"],
    "defaultValue": "Standard_F4"
  }
}

Sample init_script framework

This sample framework has a parent script that calls subsequent scripts to install packages for:

apt
python
R

clusterConfigs.sh

#!/bin/bash
## add custom prompt to bashrc
cat ${script_dir}/bashPrompt.sh >>~/.bashrc
## add custom vimrc using vim so much on the terminal
cat ${script_dir}/.vimrc >>~/.vimrc
source ${script_dir}/packageInstallTest.sh

packageInstallTest.sh

#!/bin/bash
## install python packages from a requirements file
/databricks/python/bin/pip install -r /${script_dir}/packageInstallTest.txt
## test installing jq using apt and libsodium-dev (which is a dependency for our next R one)
apt -y install jq
apt -y install libsodium-dev
## install library from R library list
cat /${script_dir}/packageInstallTest.R | R --no-save

packageInstallTest.txt

nltk
spacy
file:/dbfs/FileStore/Admin/lib/python-package/en_core_web_md-3.1.0-py3-none-any.whl
imblearn

packageInstallTest.R

## set the correct library path for Databricks spark
.libPaths('/databricks/spark/R/lib')
databricksLibPath<-.libPaths()[1]
## install two packages to this library path
install.packages('plumber', repos='https://cran.csiro.au/', lib=databricksLibPath)

Databricks security model architecture

Christopher Mortimer

Christopher Mortimer

Overview

Components

Workspace Settings

Users and Groups

Azure Active Directory

Databricks users and groups

Workspace and Repos

Workspace

Repos

What is a distributed version control system?

Compute

Cluster policies

Testing the policy

Cluster use

Package management

Storage

Data lake

Hive metastore

Table Access Controls

Sample of how this may work

DBFS

Secrets

Machine Learning

Appendix

Restrictive cluster policy

Sample init_script framework

clusterConfigs.sh

packageInstallTest.sh

packageInstallTest.txt

packageInstallTest.R

If cloud platforms were people (thoughts from a Generative AI)

Quackpipe, running DuckDB local and on AWS Glue

Query Salesforce with Python

Overview

Bi-modal

Components

Workspace Settings

Users and Groups

Azure Active Directory

Databricks users and groups

Workspace and Repos

Workspace

Repos

What is a distributed version control system?

Compute

Cluster policies

Testing the policy

Cluster use

Package management

Storage

Data lake

Hive metastore

Table Access Controls

Sample of how this may work

DBFS

Secrets

Machine Learning

Appendix

Restrictive cluster policy

Sample init_script framework

clusterConfigs.sh

packageInstallTest.sh

packageInstallTest.txt

packageInstallTest.R