Data Engineering,  Data Science

Databricks sample NFL dataset

Databricks sample NFL dataset

Create a Databricks database with sample NFL data

The code this post is relating to is here:

This is using:

  • Azure Databricks
  • Azure Data Lake Storage (ADLS) Gen2

Create your storage account and container

In this demo I have created a storage account called nfl and a container within the account called nfldata dbnfl adls containers I have added my Azure Active Directory account as a Storage Blob Data Contributor. dbnfl adls container roles

Load data to ADLS

I am using Azure Storage Explorer to do this. You could also use the Azure portal, the azcopy cli or many other methods.

dbnfl adls ase upload

Clone this repo to your Databricks

dbnfl databricks clone repo

Open the notebook called

Run the Notebook

  1. First we will create the Hive database for the tables.
create database nfl
  1. The notebook has two functions.
    • Reading a CSV file and creating a Delta table
    • Reading all files in an ADLS container and running the aforementioned function for each.

dbnfl hive nfl tables

Test the results

The notebook has a sample query to test the results. This will get all players from the SF 49ers and sort them by the number rushing yards they had in each game.

  nfl.player p
  inner join nfl.team_lookup
on p.team_id = t.team_id
  and t.team_short = 'SF'
order by p.RUSHING_YARDS desc

For the 2014 NFL season, for the 49ers, Frank Gore and Colin Kaepernick were racking up the rushing yards.

Top 20

1Frank GoreRBSF1580
2Colin KaepernickQBSF151114
3Frank GoreRBSF1440
4Frank GoreRBSF1190
5Frank GoreRBSF1070
6Frank GoreRBSF950
7Frank GoreRBSF810
8Frank GoreRBSF660
9Colin KaepernickQBSF64248
10Frank GoreRBSF630
11Colin KaepernickQBSF63204
12Frank GoreRBSF630
13Colin KaepernickQBSF58218
14Carlos HydeRBSF550
15Colin KaepernickQBSF54245
16Carlos HydeRBSF500
17Frank GoreRBSF490
18Colin KaepernickQBSF46141
19Carlos HydeRBSF430
20Frank GoreRBSF380