GCP Cloud Functions for Web Scraping AFL data

Scraping web data using GCP Cloud Functions
The main reason for this blog post is a practical example of writing a GCP Cloud Function with in the box features of:
- Tests
- Configurations
- Environment management for deployment
- The best sport in the world (AFL)
The source code can be found in the following repository https://github.com/mortie23/ml/tree/master/eng/afl/fn-afl-teams.
Directory structure
We will start with a standard directory structure for each of our Cloud Functions like below.
📁 fn-afl-teams/
├── 📁 src/
│ │ 📄 __init__.py
│ │ 📄 afl.yaml
│ │ 📄 main.py
│ │ 📄 requirements.txt
├── 📁 tests/
│ │ 📄 test_fn-afl-teams.py
│ 📄 deploy.sh
│ 📄 poetry.lock
│ 📄 pyproject.toml
│ 📄 pytest.ini
📄 .env
Development process and tooling
In this example I am running on a Windows machine with WSL (Ubunutu). The first step it so install all packages on a local Python virtual environment (venv).
# create the venv. assuming created within a venv directory on your home directory
python3 -m venv ~/venv/fn_afl_teams
source ~/venv/fn_afl_teams/bin/activate
# install poetry to manage the remaining packages
pip install poetry
Now we use poetry to install everything else from the pyproject.toml file.
# install all packages to the venv
poetry lock && poetry install --no-root
# register the ipykernel for the user
python3 -m ipykernel install --user --name fn_afl_news
Given this Cloud Function writes to BigQuery, we need to authenticate the gcloud SDK on our WSL installation.
gcloud auth login
gcloud atuh application-default login
With the ./src/main.py
Python script open in VScode, and ensure the active kernel is the venv.
Run the script in interactive.
When in interactive mode in VScode, another good way to explore any resulting dataframes is using the Data Wrangler extension:
If you have to add any new packages while you are developing, you add them using poetry:
poetry add pandas
Deploy
If you have had to add or update any packages in the pyproject.toml
then export to the requirements file before you deploy. The Cloud Build step when deploying a Python based Cloud Functions depends on a requirements.txt file in the source directory.
poetry export --without-hashes --format=requirements.txt > ./src/requirements.txt
To deploy the funciton run the deploy script passing the target environment.
./deploy.sh --env dev
Preparing function...done.
✓ Deploying function...
✓ [Build] Logs are available at [https://console.cloud.google.com/cloud-build/builds;region=australia-southeast1/xyz]
✓ [Service]
✓ [ArtifactRegistry]
✓ [Healthcheck]
✓ [Triggercheck]
Done.
Calling
To test calling the Cloud Function we can use a basic curl command. Ensure you are logged in as a user that has invoke permissions.
curl https://australia-southeast1-prj-xyz-dev-fruit-0.cloudfunctions.net/fn-afl-teams-0 -H "Authorization: Bearer $(gcloud auth print-identity-token)"
If you check the target table, it should now include data from the scraped API.
Testing
To run the unit tests during development we can run this:
pytest ./tests/test_fn-afl-teams.py::test_webscrape_teams
============================ test session starts =============================
platform linux -- Python 3.10.7, pytest-8.3.2, pluggy-1.5.0
rootdir: /mnt/c/git/github/mortie23/ml/eng/afl/fn-afl-teams
configfile: pytest.ini
plugins: env-1.1.3, hydra-core-1.3.2
collected 1 item
tests/test_fn-afl-teams.py . [100%]
============================= 1 passed in 1.24s ==============================
IAM and other infrastructure
The following is some of the prerequisites required for BigQuery, Storage Buckets and IAM permissions for developers (members of the gcp-developers
group) and service account (serviceAccount:sa-xyz-dev-fruit-cf@prj-xyz-dev-fruit-0.iam.gserviceaccount.com
).
Creation of BigQuery dataset for Horizon Scanning
bq mk --location=australia-southeast1 --project_id=prj-xyz-dev-fruit-0 afl
Setup of the Cloud Build things
Create a custom bucket for the source code
gcloud storage buckets create --location=australia-southeast1 --project=prj-xyz-dev-fruit-0 gs://bkt-xyz-dev-afl-fn-0
Create service accounts for Cloud Functions and Cloud Build
# Create a service account
gcloud iam service-accounts create sa-xyz-dev-fruit-cf --display-name "Cloud Function Service Account"
# Allow gcp-developers to act as the Cloud Function service account for deployment
gcloud iam service-accounts add-iam-policy-binding sa-xyz-dev-fruit-cf@prj-xyz-dev-fruit-0.iam.gserviceaccount.com --member='group:gcp-developers@tremendousdomain.xyz' --role='roles/iam.serviceAccountUser'
# Allow developers to invoke functions
gcloud projects add-iam-policy-binding prj-xyz-dev-fruit-0 --member='group:gcp-developers@tremendousdomain.xyz' --role='roles/run.invoker'
gcloud projects add-iam-policy-binding prj-xyz-dev-fruit-0 --member="group:gcp-developers@tremendousdomain.xyz" --role="roles/cloudfunctions.invoker"
gcloud projects add-iam-policy-binding prj-xyz-dev-fruit-0 --member="group:gcp-developers@tremendousdomain.xyz" --role="roles/cloudfunctions.developer"
gcloud projects add-iam-policy-binding prj-xyz-dev-fruit-0 --member="group:gcp-developers@tremendousdomain.xyz" --role="roles/cloudfunctions.viewer"
# Give service account permissions to all the things needed (might be too many permissions)
gcloud projects add-iam-policy-binding prj-xyz-dev-fruit-0 --member="serviceAccount:sa-xyz-dev-fruit-cf@prj-xyz-dev-fruit-0.iam.gserviceaccount.com" --role="roles/cloudfunctions.invoker"
gcloud projects add-iam-policy-binding prj-xyz-dev-fruit-0 --member="serviceAccount:sa-xyz-dev-fruit-cf@prj-xyz-dev-fruit-0.iam.gserviceaccount.com" --role="roles/cloudfunctions.developer"
gcloud projects add-iam-policy-binding prj-xyz-dev-fruit-0 --member="serviceAccount:sa-xyz-dev-fruit-cf@prj-xyz-dev-fruit-0.iam.gserviceaccount.com" --role="roles/cloudfunctions.viewer"
gcloud projects add-iam-policy-binding prj-xyz-dev-fruit-0 --member="serviceAccount:sa-xyz-dev-fruit-cf@prj-xyz-dev-fruit-0.iam.gserviceaccount.com" --role="roles/logging.logWriter"
gcloud projects add-iam-policy-binding prj-xyz-dev-fruit-0 --member="serviceAccount:sa-xyz-dev-fruit-cf@prj-xyz-dev-fruit-0.iam.gserviceaccount.com" --role="roles/artifactregistry.writer"
gcloud projects add-iam-policy-binding prj-xyz-dev-fruit-0 --member="serviceAccount:sa-xyz-dev-fruit-cf@prj-xyz-dev-fruit-0.iam.gserviceaccount.com" --role="roles/storage.objectAdmin"
gcloud projects add-iam-policy-binding prj-xyz-dev-fruit-0 --member="serviceAccount:sa-xyz-dev-fruit-cf@prj-xyz-dev-fruit-0.iam.gserviceaccount.com" --role="roles/cloudbuild.builds.builder"
# Cloud Functions use Cloud Run
gcloud projects add-iam-policy-binding prj-xyz-dev-fruit-0 --member="serviceAccount:sa-xyz-dev-fruit-cf@prj-xyz-dev-fruit-0.iam.gserviceaccount.com" --role="roles/run.admin"
# Needed for the BigQuery part after getting bigquery
gcloud projects add-iam-policy-binding prj-xyz-dev-fruit-0 --member="serviceAccount:sa-xyz-dev-fruit-cf@prj-xyz-dev-fruit-0.iam.gserviceaccount.com" --role="roles/bigquery.dataEditor"
gcloud projects add-iam-policy-binding prj-xyz-dev-fruit-0 --member="serviceAccount:sa-xyz-dev-fruit-cf@prj-xyz-dev-fruit-0.iam.gserviceaccount.com" --role="roles/bigquery.jobUser"