Dataflow,  Data Engineering,  GCP,  KMS

KMS encrypted credentials with Dataflow on GCP

KMS encrypted credentials with Dataflow on GCP

Encrypting credentials

Google KMS

Cloud Key Management Service (Cloud KMS) lets you create and manage encryption keys for use in compatible Google Cloud services and in your own applications. So first things first, we have to create a Google managed KMS keyring and associated key.

gcp kms dataflow key

Google-provided Dataflow templates

Some background on why we are doing this. Google provides open source Dataflow templates that you can use instead of writing pipeline code.

If we were to use the PostgeSQL to BigQuery one for example, we find that the only way to provide the credentials in a secure way is to encrypt them with a Google managed KMS key.

https://cloud.google.com/dataflow/docs/guides/templates/provided/postgresql-to-bigquery#template-parameters

ParameterDescription
usernameOptional: The username to use for the JDBC connection. You can pass in this value encrypted by a Cloud KMS key as a Base64-encoded string.
passwordOptional: The password to use for the JDBC connection. You can pass in this value encrypted by a Cloud KMS key as a Base64-encoded string.

We have created a template shell script that will execute the GCloud CLI command for running a Dataflow job.

https://github.com/mortie23/beam-poetry-mono/blob/master/eng/dataflow-run/nfl-postgres-tandl/dtf-gcloud-template.sh

In this shell script, we have the credentials as templated parameters:

gcloud dataflow flex-template run dtf-postgres-nfl-tandl-<<job_name>> \
    ...
    --parameters connectionURL="<<connectionURL>>" \
    --parameters username="<<username>>" \
    --parameters password="<<password>>" \

But wow do we get the encrypted versions of these into the template.

KMS helper function

We have written a series of helper functions to make this easier for us. Firstly a custom Beam Me Up package has been written with a KMS module.

https://github.com/mortie23/beam-poetry-mono/blob/master/eng/lib/beammeup/beammeup/kms.py

The function within the module that is used to create the Base64 encoded encrypted credentials is the encode_dataflow_parameter function. Let’s look at what it does step by step.

# Uses another Beam Me Up function to encrypt any string value
encrypted_parameter = encrypt_symmetric(
    project_id=project_id,
    location_id=location_id,
    key_ring_id=key_ring_id,
    key_id=key_id,
    plaintext=parameter,
)
# Base64 encodes the returned ciphertext (which is in bytes) and then decodes the bytes to a string
base64_encrypted_parameter = base64.b64encode(
    encrypted_parameter.ciphertext
).decode("utf-8")

Let’s look at each of the parts of this to figure out what is happening. Dataflow jobs require a KMS encoded

Encrypted parameter

The result from the call to encrypt_symmetric is an object with multiple parts. We just need the ciphertext.

name: "projects/prj-xyz-prd-fruit/locations/australia-southeast1/keyRings/keyring-fruit/cryptoKeys/key-fruit/cryptoKeyVersions/1"
ciphertext: "\n$\000\304\227|\301\324\343z\227\227\206I\214^\310\326\301\234O`\243|A\n\342\275v\332|\314|(>\252;\243\022J\000\254\004\343\003\375[\3760\333\027>(\311|k\"\302\01426\334\207/\361\\\226-\001MP\375R0\017\270v\310\323\351\331\2140\217v2P9\242\333Y\262\307\225G\200\345a\373;)\257\260\263\344]1\323\322W\241\270J\t"
ciphertext_crc32c {
  value: 2931620773
}
verified_plaintext_crc32c: true
protection_level: SOFTWARE

Cipher text

The ciphertext itself is in bytes:

b'\n$\x00\xc4\x97|\xc1\xd4\xe3z\x97\x97\x86I\x8c^\xc8\xd6\xc1\x9cO`\xa3|A\n\xe2\xbdv\xda|\xcc|(>\xaa;\xa3\x12J\x00\xac\x04\xe3\x03\xfd[\xfe0\xdb\x17>(\xc9|k"\xc2\x0c26\xdc\x87/\xf1\\\x96-\x01MP\xfdR0\x0f\xb8v\xc8\xd3\xe9\xd9\x8c0\x8fv2P9\xa2\xdbY\xb2\xc7\x95G\x80\xe5a\xfb;)\xaf\xb0\xb3\xe4]1\xd3\xd2W\xa1\xb8J\t'

Base64 encoded

The result of Base64 encoding these bytes are bytes again.

b'CiQAxJd8wdTjepeXhkmMXsjWwZxPYKN8QQrivXbafMx8KD6qO6MSSgCsBOMD/Vv+MNsXPijJfGsiwgwyNtyHL/Fcli0BTVD9UjAPuHbI0+nZjDCPdjJQOaLbWbLHlUeA5WH7OymvsLPkXTHT0lehuEoJ'

Decoded

Decoding these bytes leaves us with a string we can pass into the GCloud CLI Dataflow template parameter.

'CiQAxJd8wdTjepeXhkmMXsjWwZxPYKN8QQrivXbafMx8KD6qO6MSSgCsBOMD/Vv+MNsXPijJfGsiwgwyNtyHL/Fcli0BTVD9UjAPuHbI0+nZjDCPdjJQOaLbWbLHlUeA5WH7OymvsLPkXTHT0lehuEoJ'

Filling in the parameters

A Python script is used that can source the Beam Me Up package and use our helper functions to encrypt a given credential from a .dotenv file and then replace the values in the shell script.

https://github.com/mortie23/beam-poetry-mono/blob/master/eng/dataflow-run/nfl-postgres-tandl/dataflow-generate-run.py#L174

encoded_username = encode_dataflow_parameter(
    project_id=cfg.project_id,
    location_id=cfg.location,
    key_ring_id=cfg.key_ring_id,
    key_id=cfg.key_id,
    parameter=username,
)