Cloud Infrastructure,  Data Engineering,  GCP,  Machine Learning,  Python,  Software Engineering

Making Machine Learning predictions from BigQuery Dataform

Making Machine Learning predictions from BigQuery Dataform

NFL Touchdown Prediction with Dataform and BigQuery ML

This blog post demonstrates how to build an end-to-end ML pipeline using Google Cloud Platform (GCP) services to predict NFL game touchdowns.

There is a lot more that is setup and explained in previous blog posts:

Architecture

The solution uses:

Project Structure

ml-df/
├── 📁definitions/         # Dataform SQL definitions
│   ├── 📁t0_mdl/          # Model declarations
│   ├── 📁t1_raw/          # Raw data loading
│   └── 📁t2_cur/          # Curated/transformed/predicted data
├── 📁docs/                # Documentation and screenshots
│   ├── 📁img/             # Screenshots
└── workflow_settings.yaml # Dataform configuration

Setup and Usage

  1. Set up a GCP project with BigQuery, Dataform and many other GCP infra components enabled
  2. Fork this repository and setup a workspace in Dataform
  3. Configure workflow_settings.yaml with your project details
  4. Run the Dataform pipeline

Pipeline Flow

gcp bqdf pipeline graph

  1. Raw NFL game statistics are loaded into BigQuery
  2. Data is transformed and prepared for prediction
  3. Each game’s features are sent to the ML model endpoint (Cloud Run service is required and discussed in other previously linked blog posts)
  4. Predictions are stored back in BigQuery

The main part of interest would be the SQLX for the prediction:

config {
  type: "incremental",
  description: "NFL game touchdown predictions",
  uniqueKey: ["game_team_id"],
  columns: {
    game_team_id: 'Unique identifier for game and team combination',
    predicted_touchdowns: 'Predicted number of touchdowns',
    created_timestamp: 'Created timestamp'
  },
  schema: dataform.projectConfig.vars.layer_cur
}

with batched_games as (
    select
      ifnull(
        array_agg(
          json_object(
            'game_team_id', game_team_id,
            'total_first_downs', total_first_downs,
            'total_yards', total_yards,
            'interceptions', interceptions,
            'punts', punts
          )
          ignore nulls
        ), [json_object()]
      ) as batch
    from
      ${ref("game_stats")}
      ${when(incremental(), `where game_team_id not in (select distinct game_team_id from ${self()})` )}
)
, predictions as (
    select
      ${ref("nfltouchdown_cloudrun")}(parse_json(to_json_string(batch))) as predictions_json
    from
      batched_games
)
, unnested_predictions as (
    select
        json_value(prediction, '$.game_team_id') as game_team_id
      , json_value(prediction, '$.prediction') as predicted_touchdowns
    from
      predictions
    , unnest(json_query_array(predictions_json, '$')) as prediction
)
select
    cast(up.game_team_id as INT64) as game_team_id
  , cast(up.predicted_touchdowns as FLOAT64) as predicted_touchdowns
  , current_timestamp() as created_timestamp
from
  unnested_predictions up

This method will incrementally take new game stats (for arguments sake, let’s say it was run weekly), and then only make one request for all the games that week to the Cloud Run predict service.

It does this by using array_agg and json_object to create an array of JSON objects and POSTing all that data to the BigQuery Remote Function connection. It then uses unnest to split back out the responses.

Example BigQuery objects

This shows what the BigQuery objects look like when viewed in the BigQuery web UI.

gcp bqdf bigquery objects

Example execution

A successful execution of the complete pipeline in Dataform looks like this:

gcp bqdf execution

Example results

When the pipeline is run, the predicted touchdowns are then available in the game_touchdown_predict table.

gcp bqdf bigquery predictions