Making Machine Learning predictions from BigQuery Dataform

NFL Touchdown Prediction with Dataform and BigQuery ML
This blog post demonstrates how to build an end-to-end ML pipeline using Google Cloud Platform (GCP) services to predict NFL game touchdowns.
There is a lot more that is setup and explained in previous blog posts:
- GCP Serverless prediction service using Cloud Run from a Vertex AI model
- The Tale of Two Routes: A Love Story Between BigQuery and Vertex AI for Machine Learning predictions
Architecture
The solution uses:
- Dataform for data transformation and pipeline orchestration
- BigQuery for data storage and SQL operations
- Cloud Run for hosting the ML model endpoint https://github.com/mortie23/ml/blob/master/infra/tf/gcp/environment/cloudrun.tf
- Python for ML model training (stored separately)
Project Structure
ml-df/
├── 📁definitions/ # Dataform SQL definitions
│ ├── 📁t0_mdl/ # Model declarations
│ ├── 📁t1_raw/ # Raw data loading
│ └── 📁t2_cur/ # Curated/transformed/predicted data
├── 📁docs/ # Documentation and screenshots
│ ├── 📁img/ # Screenshots
└── workflow_settings.yaml # Dataform configuration
Setup and Usage
- Set up a GCP project with BigQuery, Dataform and many other GCP infra components enabled
- Fork this repository and setup a workspace in Dataform
- Configure
workflow_settings.yaml
with your project details - Run the Dataform pipeline
Pipeline Flow
- Raw NFL game statistics are loaded into BigQuery
- Data is transformed and prepared for prediction
- Each game’s features are sent to the ML model endpoint (Cloud Run service is required and discussed in other previously linked blog posts)
- Predictions are stored back in BigQuery
The main part of interest would be the SQLX for the prediction:
config {
type: "incremental",
description: "NFL game touchdown predictions",
uniqueKey: ["game_team_id"],
columns: {
game_team_id: 'Unique identifier for game and team combination',
predicted_touchdowns: 'Predicted number of touchdowns',
created_timestamp: 'Created timestamp'
},
schema: dataform.projectConfig.vars.layer_cur
}
with batched_games as (
select
ifnull(
array_agg(
json_object(
'game_team_id', game_team_id,
'total_first_downs', total_first_downs,
'total_yards', total_yards,
'interceptions', interceptions,
'punts', punts
)
ignore nulls
), [json_object()]
) as batch
from
${ref("game_stats")}
${when(incremental(), `where game_team_id not in (select distinct game_team_id from ${self()})` )}
)
, predictions as (
select
${ref("nfltouchdown_cloudrun")}(parse_json(to_json_string(batch))) as predictions_json
from
batched_games
)
, unnested_predictions as (
select
json_value(prediction, '$.game_team_id') as game_team_id
, json_value(prediction, '$.prediction') as predicted_touchdowns
from
predictions
, unnest(json_query_array(predictions_json, '$')) as prediction
)
select
cast(up.game_team_id as INT64) as game_team_id
, cast(up.predicted_touchdowns as FLOAT64) as predicted_touchdowns
, current_timestamp() as created_timestamp
from
unnested_predictions up
This method will incrementally take new game stats (for arguments sake, let’s say it was run weekly), and then only make one request for all the games that week to the Cloud Run predict service.
It does this by using array_agg
and json_object
to create an array of JSON objects and POSTing all that data to the BigQuery Remote Function connection. It then uses unnest
to split back out the responses.
Example BigQuery objects
This shows what the BigQuery objects look like when viewed in the BigQuery web UI.
Example execution
A successful execution of the complete pipeline in Dataform looks like this:
Example results
When the pipeline is run, the predicted touchdowns are then available in the game_touchdown_predict
table.