Cloud Infrastructure, Data Engineering, GCP, Machine Learning, Python, Software Engineering

Making Machine Learning predictions from BigQuery Dataform

Christopher Mortimer
Specialising in data
More posts by Christopher Mortimer.

Christopher Mortimer

03 Apr 2025•3 min read

Making Machine Learning predictions from BigQuery Dataform

NFL Touchdown Prediction with Dataform and BigQuery ML

This blog post demonstrates how to build an end-to-end ML pipeline using Google Cloud Platform (GCP) services to predict NFL game touchdowns.

There is a lot more that is setup and explained in previous blog posts:

Architecture

The solution uses:

Dataform for data transformation and pipeline orchestration
BigQuery for data storage and SQL operations
Cloud Run for hosting the ML model endpoint https://github.com/mortie23/ml/blob/master/infra/tf/gcp/environment/cloudrun.tf
Python for ML model training (stored separately)

Project Structure

ml-df/
├── 📁definitions/         # Dataform SQL definitions
│   ├── 📁t0_mdl/          # Model declarations
│   ├── 📁t1_raw/          # Raw data loading
│   └── 📁t2_cur/          # Curated/transformed/predicted data
├── 📁docs/                # Documentation and screenshots
│   ├── 📁img/             # Screenshots
└── workflow_settings.yaml # Dataform configuration

Setup and Usage

Set up a GCP project with BigQuery, Dataform and many other GCP infra components enabled
Fork this repository and setup a workspace in Dataform
Configure workflow_settings.yaml with your project details
Run the Dataform pipeline

Pipeline Flow

Raw NFL game statistics are loaded into BigQuery
Data is transformed and prepared for prediction
Each game’s features are sent to the ML model endpoint (Cloud Run service is required and discussed in other previously linked blog posts)
Predictions are stored back in BigQuery

The main part of interest would be the SQLX for the prediction:

config {
  type: "incremental",
  description: "NFL game touchdown predictions",
  uniqueKey: ["game_team_id"],
  columns: {
    game_team_id: 'Unique identifier for game and team combination',
    predicted_touchdowns: 'Predicted number of touchdowns',
    created_timestamp: 'Created timestamp'
  },
  schema: dataform.projectConfig.vars.layer_cur
}

with batched_games as (
    select
      ifnull(
        array_agg(
          json_object(
            'game_team_id', game_team_id,
            'total_first_downs', total_first_downs,
            'total_yards', total_yards,
            'interceptions', interceptions,
            'punts', punts
          )
          ignore nulls
        ), [json_object()]
      ) as batch
    from
      ${ref("game_stats")}
      ${when(incremental(), `where game_team_id not in (select distinct game_team_id from ${self()})` )}
)
, predictions as (
    select
      ${ref("nfltouchdown_cloudrun")}(parse_json(to_json_string(batch))) as predictions_json
    from
      batched_games
)
, unnested_predictions as (
    select
        json_value(prediction, '$.game_team_id') as game_team_id
      , json_value(prediction, '$.prediction') as predicted_touchdowns
    from
      predictions
    , unnest(json_query_array(predictions_json, '$')) as prediction
)
select
    cast(up.game_team_id as INT64) as game_team_id
  , cast(up.predicted_touchdowns as FLOAT64) as predicted_touchdowns
  , current_timestamp() as created_timestamp
from
  unnested_predictions up

This method will incrementally take new game stats (for arguments sake, let’s say it was run weekly), and then only make one request for all the games that week to the Cloud Run predict service.

It does this by using array_agg and json_object to create an array of JSON objects and POSTing all that data to the BigQuery Remote Function connection. It then uses unnest to split back out the responses.