Data Engineering,  Python

Connecting to Teradata from Python

Connecting to Teradata from Python

Reading and Writing data between Python and Teradata

This blog post is paired with the associated Github repository https://github.com/mortie23/python-teradata

Local Python environment setup

For this run through I am using:

  • Python 3.8.10
  • installed on Windows Subsystem for Linux (Ubuntu 20.04)

Create a new virtual environment and install the requirements.

# Create a new virtual environment
python3 -m venv tdsql
# Activate the environment
source tdsql/bin/activate
# Install the packages required
pip install -r requirements.txt
# Register the IPython Kernel for use (with VScode)
python3 -m ipykernel install --user --name tdsql --display-name tdsql

Teradata

I am using a Teradata Express 16.20 virtual machine running with VMWare Workstation Player 16.2.5. Our hosts file is mapping the virtual machine’s IP address to tdvm

cat /etc/hosts
# This file was automatically generated by WSL. To stop automatic generation of this file, add the following entry to /etc/wsl.conf:
# [network]
# generateHosts = false
127.0.0.1       localhost
192.168.190.128 tdvm

teradata vm

Setup database structures

This step is obviously not required, however it is here for completeness sake. Running the teradata-setup.sql script to create the required database and provide access to the appropriate user.

-- Create Database
-- Assumes you have a production analytical datastore (ADS)
CREATE DATABASE PRD_ADS_PYTHON FROM PRD_ADS AS
   PERM = 0
   NO FALLBACK
   NO BEFORE JOURNAL
   NO AFTER JOURNAL
;
COMMENT ON PRD_ADS_PYTHON AS 'Python development' ;
GRANT ALL ON PRD_ADS_PYTHON TO PRDDBA WITH GRANT OPTION ;

CREATE DATABASE PRD_ADS_PYTHON_NFL_DB FROM PRD_ADS_PYTHON AS
   PERM = 0
   NO FALLBACK
   NO BEFORE JOURNAL
   NO AFTER JOURNAL
;
COMMENT ON PRD_ADS_PYTHON_NFL_DB AS 'NFL Data loaded using Python' ;
GRANT ALL ON PRD_ADS_PYTHON_NFL_DB TO PRDDBA WITH GRANT OPTION ;

-- Give Space
-- Python
CREATE DATABASE GS_TEMP FROM PRD_ADS AS
PERM = 1E9
;
GIVE GS_TEMP TO PRD_ADS_PYTHON
;
DROP DATABASE GS_TEMP
;
-- Python NFL
CREATE DATABASE GS_TEMP FROM PRD_ADS_PYTHON AS
PERM = 5E6
;
GIVE GS_TEMP TO PRD_ADS_PYTHON_NFL_DB
;
DROP DATABASE GS_TEMP
;

-- Users
CREATE USER MORTCH FROM PRD_USERS AS
  PASSWORD = MORTCH
  PERM = 0
  NO Fallback
  NO Before Journal NO After Journal
  DEFAULT ROLE = ALL
;

-- Roles
CREATE ROLE R_PYTHON_PRD;
GRANT R_PYTHON_PRD TO MORTCH;

-- Python
GRANT SELECT, INSERT, UPDATE, DELETE, CREATE TABLE, DROP TABLE ON PRD_ADS_PYTHON_NFL_DB TO R_PYTHON_PRD ;
GRANT SELECT, CREATE VIEW, DROP VIEW ON PRD_ADS_PYTHON_NFL_DB TO R_PYTHON_PRD ;

-- Test
CREATE TABLE PRD_ADS_PYTHON_NFL_DB.TEST (
  COL1 INTEGER
)
PRIMARY INDEX(COL1)
;
INSERT INTO PRD_ADS_PYTHON_NFL_DB.TEST VALUES(1);

Setup the table structures

Firstly test connecting to it from Teradata Studio IDE:

teradata studio connection

Running this (python-nfl-ddl.sql) SQL script in you favourite SQL IDE (for example Teradata Studio).

teradata tables

Python execution in VSCode

I’m going to using Python interactive in VSCode to run this. In this mode we can run the “Cells” (blocks of code between # %%) separately and view tabular results in a more visual way.

For example, running the python-teradata.sql script up until this point and then in the interactive window typing tables and hitting shift-enter.

# %%
query = f"""
select
  TableName
  , count(*) as ColumnCount
from
  dbc.columnsv
where
  databasename='{database}'
group by
  TableName
order by
  TableName
"""
tables = pd.read_sql(query, con)
# %%
tables

python nfl tables

Loading data from CSV

The key function that loads data to a table from a CSV file is here:

def insert_rows_from_csv(
    tablename: str,
    columncount: int,
):
    """Insert rows from a CSV into a Teradat table

    Args:
        tablename (str): the name of the table
        columncount (int): the number of columns in the table
    """
    columncount_questionmark = ", ?" * (columncount - 1)
    exec_string = f"{{fn teradata_read_csv(./data/{tablename.lower()}.csv)}}insert into {tablename} (?{columncount_questionmark})"
    cur.execute(exec_string)

The script then has a function that loops over all the tables and runs the insert_rows_from_csv function. After the loads have completed, running a select using the Pandas read_sql method demonstrates that we have successfully loaded the tables from the CSV files.

python player table

Cleaning up

We also have a script to truncate all the rows from the tables.

# Truncate all tables
table_load = table_load.sort_values(by="order", ascending=False)
for index, row in table_load.iterrows():
    print(row["TableName"])
    cur.execute(f"delete from {row['TableName']}")