Chapter 12: Data Integration and ETL Pipelines (The Emerald City Clinic)#
Database and SQL Through Pop Culture | Brendan Shea, PhD#
Every database project so far has started with the data already inside a tidy SQLite file. Real projects almost never start that way. The data is scattered — a CSV from one department, a JSON API from another, an XML feed from a legacy system, a streaming pharmacy queue, even a web page that has to be scraped. The work of pulling that data together, cleaning it up, and getting it into a working warehouse is data integration.
This chapter walks through that work using a fictional pop-culture setting: Emerald City Clinic, a Wizard of Oz–themed medical facility on the outskirts of Oz. By the end, you will know how to acquire data from many sources, transform it with Pandas, load it into SQLite, troubleshoot the failures you will actually hit, and use AI tools wisely in the process.
Note for SQL-first readers. This chapter introduces some Python and a library called Pandas. If your only programming experience so far is SQL, that’s fine — the next section is a short Python & Pandas survival kit that maps every new idea back to SQL you already know. Take it slowly; the rest of the chapter will make a lot more sense after it.
Learning Outcomes#
By the end of this chapter, you will be able to:
Read a small amount of Python and Pandas code and translate it into the SQL you already know.
Acquire data from CSV, JSON, XML, REST and OData, ODBC/JDBC, FTP, NFS, SSH, CIFS, RPC, SOAP, streaming sources, and scraped HTML.
Use Pandas to read source data, clean it, merge it, and load it into SQLite — and use the
%%sqlmagic to query the result.Distinguish ETL from ELT and decide which pattern fits a scenario.
Diagnose and fix the most common data acquisition failures: schema mismatch, encoding, permissions, driver/version issues, encryption key handling, timeouts, corruption, and syntax/runtime errors.
Explain where AI and modern tooling — prompt engineering, RAG, virtual data warehouses, ML libraries, RPA — help and where they introduce risk.
Keywords: data integration, ETL, ELT, Pandas, DataFrame, %%sql magic, CSV, JSON, XML, REST, OData, SOAP, ODBC, JDBC, streaming, scraping, SQLite, prompt engineering, RAG, RPA
Brendan’s Lecture#
from IPython.display import YouTubeVideo
# Lecture video to be linked here.
Introduction to Emerald City Clinic#
The clinic’s data team has six members, and the chapter’s case studies stay with this cast:
Dorothy — newly hired data engineer. She is leading the consolidation project that pulls every department’s data into one SQLite warehouse.
The Scarecrow — runs the records team. He delivers patient demographics as CSV exports.
The Tin Man — owns lab systems. His department exposes lab results through a JSON REST API.
The Cowardly Lion — handles security and compliance. He cares about access control, encryption keys, and certificates.
The Wizard — the CIO. He sets strategic direction and signs off on the project.
Glinda — the analytics lead. She consumes the consolidated warehouse for reporting.
The goal is one SQLite database that combines the Scarecrow’s patients, the Tin Man’s labs, the Wizard’s legacy charts, and a steady stream of pharmacy refills.
A Python and Pandas Survival Kit (for SQL Students)#
If you have only written SQL, this section gives you the minimum Python and Pandas vocabulary the rest of the chapter assumes. Read it once carefully; you can come back to it whenever a later cell looks unfamiliar.
The good news: almost every Pandas operation in this chapter has a direct SQL equivalent. If you can read SQL, you can read Pandas — you just need a small dictionary between the two.
Three Python ideas you need before you read any code#
1. Variables hold values. In SQL you write SELECT * FROM patients. In Python you give the result a name:
patients = something
The word patients is now a label that points at whatever value the right-hand side produced. You can re-use it later.
2. import loads a library. Python ships small. To use Pandas you write import pandas as pd, which means: load the Pandas library, and from now on let me call it pd for short.
3. Methods are called with a dot. df.head() means “on the thing called df, call the method named head.” The empty parentheses mean “no extra arguments.” SQL has functions too — COUNT(*), UPPER(name) — Python just puts the value first and the function after the dot.
That’s it. Three ideas. Everything else in this section builds on them.
What is a DataFrame? (It’s a table.)#
The central object in Pandas is a DataFrame. A DataFrame is, for all practical purposes, a database table that lives in memory instead of on disk. It has:
Columns with names and types.
Rows you can filter, sort, and group.
A row index (a hidden
ROWID-like column).
You will see variables like patients_df (a DataFrame holding patient rows) or labs_df (a DataFrame of lab results). The _df suffix is a convention, not a rule.
Below is a tiny DataFrame built by hand — no file, no internet — so you can see the shape of one.
import pandas as pd
# Build a tiny DataFrame from a dictionary. Each key is a column name.
toy = pd.DataFrame({
'patient_id': ['ECC0001', 'ECC0002', 'ECC0003'],
'town': ['Munchkinland', 'Emerald City', 'Munchkinland'],
'last_visit': ['2026-05-01', '2026-05-02', '2026-05-10'],
})
toy
| patient_id | town | last_visit | |
|---|---|---|---|
| 0 | ECC0001 | Munchkinland | 2026-05-01 |
| 1 | ECC0002 | Emerald City | 2026-05-02 |
| 2 | ECC0003 | Munchkinland | 2026-05-10 |
From SQL to Pandas — a translation table#
Keep this table handy. Most of the chapter’s Pandas code is one of these patterns.
What you want to do |
SQL |
Pandas |
|---|---|---|
See the first few rows |
|
|
Count the rows |
|
|
List the column names |
|
|
Filter rows |
|
|
Pick certain columns |
|
|
Rename a column |
|
|
Join two tables |
|
|
Group + aggregate |
|
|
Sort |
|
|
Write to a database |
|
|
Read from a database |
|
|
If you can map a Pandas line to one of these rows, you can read it. If you can’t, slow down and ask: what would this be in SQL?
Filtering: WHERE in Pandas#
The strangest-looking SQL-to-Pandas translation is filtering, because Pandas reuses square brackets for two things: picking columns and filtering rows. The pattern below is the one to memorize.
df[df['col'] == 'value']
Read it out loud: “From df, take the rows where df['col'] equals 'value'.” The inner df['col'] == 'value' produces a column of True/False values; the outer df[...] keeps only the True rows. It’s WHERE col = 'value' with a different costume.
# Same toy DataFrame as before. Filter for Munchkinland patients.
# SQL equivalent: SELECT * FROM toy WHERE town = 'Munchkinland';
toy[toy['town'] == 'Munchkinland']
| patient_id | town | last_visit | |
|---|---|---|---|
| 0 | ECC0001 | Munchkinland | 2026-05-01 |
| 2 | ECC0003 | Munchkinland | 2026-05-10 |
Joining and grouping#
merge is JOIN. groupby is GROUP BY. The example below joins a tiny visits table to the patient table and counts visits per town — exactly what you would write as JOIN ... GROUP BY town.
visits = pd.DataFrame({
'patient_id': ['ECC0001', 'ECC0001', 'ECC0002', 'ECC0003'],
'visited_on': ['2026-05-01', '2026-05-15', '2026-05-02', '2026-05-10'],
})
# JOIN: combine each visit row with its patient row.
joined = visits.merge(toy, on='patient_id', how='left')
joined
| patient_id | visited_on | town | last_visit | |
|---|---|---|---|---|
| 0 | ECC0001 | 2026-05-01 | Munchkinland | 2026-05-01 |
| 1 | ECC0001 | 2026-05-15 | Munchkinland | 2026-05-01 |
| 2 | ECC0002 | 2026-05-02 | Emerald City | 2026-05-02 |
| 3 | ECC0003 | 2026-05-10 | Munchkinland | 2026-05-10 |
# GROUP BY town, COUNT(*).
joined.groupby('town').size().reset_index(name='visit_count')
| town | visit_count | |
|---|---|---|
| 0 | Emerald City | 1 |
| 1 | Munchkinland | 3 |
A note on %%sql#
You already used the %%sql magic (from jupysql / ipython-sql) throughout the semester, so we won’t re-introduce it here. Just remember the rhythm:
%sql sqlite:///some_file.db— point the magic at a database.%%sqlat the top of a cell — every line in the cell is SQL.
The rest of the chapter mixes Pandas and %%sql deliberately: Pandas when you need Python logic alongside the data, %%sql when the work is just a query.
%pip install --quiet jupysql
%load_ext sql
Note: you may need to restart the kernel to use updated packages.
Check yourself before moving on#
Try to answer these in your head. The answers are at the bottom of the chapter glossary.
In SQL:
SELECT first_name FROM patients WHERE town = 'Emerald City';What is the equivalent Pandas line?What does
df.head()correspond to in SQL?If
combined.merge(other, on='patient_id', how='left')were SQL, which kind of JOIN would it be?When would you reach for
%%sqlinstead of Pandas?
If any of these felt shaky, re-read the translation table above before continuing. The rest of the chapter is much smoother once these click.
Generating the source data#
Real source files come from real systems. For this chapter, a Python script produces a deterministic set of source files — CSV, JSON, XML, a streaming feed, and a web page — so the notebook is self-contained.
You do not need to read or understand the code in the next cell. It is plumbing: it writes the files we will use later. Run it once, then move on. If you are curious, the same code lives in data/emerald_city/generator.py.
Wrote source files to /home/runner/work/database_sql/database_sql/emerald_city_data
25 patients, 43 lab results
Verify the files#
Before moving on, list the files the generator just wrote. This is a habit worth building: always confirm your inputs exist before you try to load them.
from pathlib import Path
for f in sorted(Path('emerald_city_data').iterdir()):
print(f'{f.name:<28} {f.stat().st_size:>6} bytes')
charts.xml 1878 bytes
labs.json 8053 bytes
partner_directory.html 535 bytes
patients.csv 1598 bytes
pharmacy_stream.txt 4950 bytes
soap_response.xml 340 bytes
Acquisition techniques and methods#
This section covers the formats and protocols you will encounter when pulling data into a database. Most projects mix several at once, and Emerald City Clinic is no exception. For each format we will: (1) look at what the source looks like, (2) read it into a DataFrame, and (3) translate what we did back into SQL terms.
File-based: CSV — the Scarecrow’s records#
CSV (comma-separated values) is the oldest, simplest format for tabular data. Every line is a row; every field is separated by a comma. The first line is usually a header that names the columns. The Scarecrow’s records team exports a fresh patients.csv every morning.
Before we read it, take a quick look at the raw file. (Recognizing what raw data looks like is half the battle when something goes wrong.)
# Show the first 3 lines of the raw CSV file. No Pandas yet — just file reading.
with open('emerald_city_data/patients.csv') as f:
for i, line in enumerate(f):
if i >= 3:
break
print(line.rstrip())
patient_id,first_name,last_name,date_of_birth,town,primary_provider
ECC0001,Toto,Munchkin,1987-06-07,Quadling Country,Dr. Pingaree
ECC0002,Boq,Gale,1986-09-11,Gillikin Country,Dr. Pumperdink
Now read it with Pandas. pd.read_csv turns the file into a DataFrame. Think of it as SELECT * FROM patients.csv — except Pandas does the parsing instead of the database.
import pandas as pd
patients_df = pd.read_csv('emerald_city_data/patients.csv')
patients_df.head()
| patient_id | first_name | last_name | date_of_birth | town | primary_provider | |
|---|---|---|---|---|---|---|
| 0 | ECC0001 | Toto | Munchkin | 1987-06-07 | Quadling Country | Dr. Pingaree |
| 1 | ECC0002 | Boq | Gale | 1986-09-11 | Gillikin Country | Dr. Pumperdink |
| 2 | ECC0003 | Fiyero | Munchkin | 1989-09-17 | Winkie Country | Dr. Gale |
| 3 | ECC0004 | Tip | Gale | 2020-12-05 | Gillikin Country | Dr. Emerald |
| 4 | ECC0005 | Glinda | Emerald | 1966-08-13 | Winkie Country | Dr. Emerald |
Try it. Before you run the next cell, predict: how many rows does patients_df have? Which columns do you see? Then run the cell to check.
print('rows: ', len(patients_df))
print('columns:', list(patients_df.columns))
rows: 25
columns: ['patient_id', 'first_name', 'last_name', 'date_of_birth', 'town', 'primary_provider']
File-based: JSON — the Tin Man’s lab results#
JSON (JavaScript Object Notation) is the format you reach for when the data has a hierarchy or when fields vary across records. Two shapes show up constantly:
A list of objects, each with the same keys — this is the “flat” shape, and Pandas can load it directly into a DataFrame.
A nested object with sub-lists and sub-objects — common in API responses; you usually have to drill into the structure before it becomes tabular.
The Tin Man’s daily dump is the first shape. Peek at the first record before parsing it — it’s a habit worth keeping for every new JSON source you meet.
# Show just the first record of labs.json so we can see its shape.
with open('emerald_city_data/labs.json') as f:
raw = f.read()
print(raw[:300], '...')
[
{
"lab_id": "LAB00001",
"patient_id": "ECC0002",
"panel": "Lipid Panel",
"collected_on": "2026-03-21",
"result_flag": "normal",
"ordering_provider": "Dr. Pumperdink"
},
{
"lab_id": "LAB00002",
"patient_id": "ECC0002",
"panel": "HbA1c",
"collected_on": ...
labs_df = pd.read_json('emerald_city_data/labs.json')
labs_df.head()
| lab_id | patient_id | panel | collected_on | result_flag | ordering_provider | |
|---|---|---|---|---|---|---|
| 0 | LAB00001 | ECC0002 | Lipid Panel | 2026-03-21 | normal | Dr. Pumperdink |
| 1 | LAB00002 | ECC0002 | HbA1c | 2026-04-09 | high | Dr. Pumperdink |
| 2 | LAB00003 | ECC0003 | HbA1c | 2026-01-09 | normal | Dr. Gale |
| 3 | LAB00004 | ECC0004 | CMP | 2026-01-04 | normal | Dr. Emerald |
| 4 | LAB00005 | ECC0004 | BMP | 2026-04-27 | low | Dr. Emerald |
File-based: XML — the Wizard’s legacy charts#
XML is older than JSON but still common in healthcare, finance, and government systems. Where JSON uses {} and [], XML uses tags: <chart>...</chart>. The structure is a tree — every element can have attributes (the patient_id="ECC0001" part) and child elements (the <opened_on>, <department>, <note> inside).
The Wizard’s chart system, set up decades ago in Oz, exports XML like the snippet below. As with JSON, eyeball the raw shape first.
# Show the first ~12 lines of the XML so we can see the tag structure.
with open('emerald_city_data/charts.xml') as f:
for i, line in enumerate(f):
if i >= 12:
break
print(line.rstrip())
<charts>
<chart patient_id="ECC0001">
<opened_on>2018-01-17</opened_on>
<department>Endocrinology</department>
<note>Legacy chart imported from the Wizard system.</note>
</chart>
<chart patient_id="ECC0002">
<opened_on>2020-03-16</opened_on>
<department>Endocrinology</department>
<note>Legacy chart imported from the Wizard system.</note>
</chart>
<chart patient_id="ECC0003">
charts_df = pd.read_xml('emerald_city_data/charts.xml', parser='etree')
charts_df.head()
| patient_id | opened_on | department | note | |
|---|---|---|---|---|
| 0 | ECC0001 | 2018-01-17 | Endocrinology | Legacy chart imported from the Wizard system. |
| 1 | ECC0002 | 2020-03-16 | Endocrinology | Legacy chart imported from the Wizard system. |
| 2 | ECC0003 | 2012-03-15 | Pulmonology | Legacy chart imported from the Wizard system. |
| 3 | ECC0004 | 2018-03-17 | Family Medicine | Legacy chart imported from the Wizard system. |
| 4 | ECC0005 | 2022-03-17 | Family Medicine | Legacy chart imported from the Wizard system. |
Check yourself. The XML file holds 10 charts. How many rows do you expect in charts_df? Run len(charts_df) to confirm.
len(charts_df)
10
API-based: REST and OData#
A REST API exposes resources at URLs and returns JSON or XML. Tin Man Diagnostics runs one at (imagined) https://api.tinmandx.example.com/v1/labs. In production you would hit it with HTTP GET:
import requests
response = requests.get('https://api.tinmandx.example.com/v1/labs')
labs_df = pd.DataFrame(response.json())
For this chapter we read the Tin Man’s daily JSON dump as if it came from the API — the data flow is identical, but it doesn’t require a network call.
OData is a convention layered on top of REST that adds a standard query syntax — $filter, $select, $top — so clients can do server-side filtering and pagination without learning each API’s custom parameters. Many enterprise systems (Microsoft Dynamics, SAP) speak OData. An OData URL like ?$filter=result_flag eq 'high' is just WHERE result_flag = 'high' in disguise.
# Read the JSON dump as if it were an API response.
labs_df = pd.read_json('emerald_city_data/labs.json')
# OData $filter=result_flag eq 'high' --> SQL WHERE result_flag = 'high'
# Pandas equivalent:
high_labs = labs_df[labs_df['result_flag'] == 'high']
high_labs.head()
| lab_id | patient_id | panel | collected_on | result_flag | ordering_provider | |
|---|---|---|---|---|---|---|
| 1 | LAB00002 | ECC0002 | HbA1c | 2026-04-09 | high | Dr. Pumperdink |
| 15 | LAB00016 | ECC0009 | BMP | 2026-05-24 | high | Dr. Emerald |
| 16 | LAB00017 | ECC0010 | CBC | 2026-04-03 | high | Dr. Lurline |
| 23 | LAB00024 | ECC0014 | TSH | 2026-05-13 | high | Dr. Quadling |
| 31 | LAB00032 | ECC0021 | TSH | 2026-05-20 | high | Dr. Winkie |
Driver-based: ODBC and JDBC#
When the source is another database, you do not parse files — you connect through a driver. Two driver standards matter:
ODBC (Open Database Connectivity) — a Microsoft-led standard, widely supported on Windows and many other platforms. C, Python, and R all have ODBC clients.
JDBC (Java Database Connectivity) — the same idea, but for Java applications.
In Python, the most common path is SQLAlchemy, which sits above ODBC/JDBC drivers and offers a single API for many databases. The line create_engine('sqlite:///legacy_flow.db') says: make me a connection to the SQLite file legacy_flow.db.
Emerald City has an older patient-flow database living in a separate SQLite file. We build a tiny one below to keep the demo self-contained, then read from it through SQLAlchemy.
import sqlite3
# Step 1: build a tiny legacy database file with 3 visit rows.
# (In real life this database already exists — you'd skip straight to step 2.)
with sqlite3.connect('legacy_flow.db') as con:
con.execute('DROP TABLE IF EXISTS visits')
con.execute('CREATE TABLE visits (visit_id INTEGER PRIMARY KEY, patient_id TEXT, visited_on TEXT)')
con.executemany('INSERT INTO visits VALUES (?, ?, ?)',
[(1, 'ECC0001', '2026-05-01'),
(2, 'ECC0002', '2026-05-02'),
(3, 'ECC0001', '2026-05-10')])
print('legacy_flow.db built')
legacy_flow.db built
from sqlalchemy import create_engine, text
# Step 2: open a connection through SQLAlchemy and read from it.
engine = create_engine('sqlite:///legacy_flow.db')
visits_df = pd.read_sql(text('SELECT * FROM visits'), engine)
visits_df
| visit_id | patient_id | visited_on | |
|---|---|---|---|
| 0 | 1 | ECC0001 | 2026-05-01 |
| 1 | 2 | ECC0002 | 2026-05-02 |
| 2 | 3 | ECC0001 | 2026-05-10 |
We can also query the same database with the %%sql magic. This is the cleaner way when the work is just SQL with no Python in between — and it shows that the SQLAlchemy code above and a plain SQL cell are doing the same thing.
%sql sqlite:///legacy_flow.db
%%sql
SELECT patient_id, COUNT(*) AS visit_count
FROM visits
GROUP BY patient_id;
| patient_id | visit_count |
|---|---|
| ECC0001 | 2 |
| ECC0002 | 1 |
Network protocols: FTP, NFS, SSH, CIFS, RPC, SOAP#
Beyond HTTP, several older protocols still move data between systems. You will see them in firewall rules, vendor agreements, and audit logs.
Protocol |
What it does |
Typical use at Emerald City |
|---|---|---|
FTP / SFTP |
File transfer over the network |
The Scarecrow’s nightly CSV dropped on a shared server |
NFS |
Network file system (Unix) |
A shared mount where the Tin Man writes lab PDFs |
CIFS / SMB |
Network file system (Windows) |
A clinic-wide drive for scanned documents |
SSH |
Secure remote shell and file copy |
Dorothy logs in to the records server to retrieve files |
RPC |
Remote procedure call |
Older inter-system messages used by the Wizard’s billing engine |
SOAP |
XML-based web services |
The hours-of-operation feed that a partner uses |
For most of these, your code does not change — once the file lands in a directory you can read, pd.read_csv (or its cousins) just works. The protocol is an ops concern, not a Python concern.
The exception is SOAP, which gives you XML wrapped in an Envelope and a Body. You don’t have to write SOAP code in this course, but you should recognize the shape if you see it.
print(open('emerald_city_data/soap_response.xml').read())
<?xml version="1.0"?>
<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/">
<soap:Body><GetClinicHoursResponse>
<Hours><Day>Monday</Day><Open>08:00</Open><Close>17:00</Close></Hours>
<Hours><Day>Saturday</Day><Open>09:00</Open><Close>13:00</Close></Hours>
</GetClinicHoursResponse></soap:Body>
</soap:Envelope>
Discussion. Which of the protocols above could deliver the Scarecrow’s daily CSV? Which would not make sense for it? Why? Talk it through with a classmate — there are several reasonable answers.
Streaming vs. non-streaming#
A non-streaming source is a fixed file or query result — you read it all at once. The CSV and JSON examples above are non-streaming.
A streaming source delivers data continuously, one record at a time, often forever. The Tin Man’s pharmacy partner pushes refill events to a queue, and Dorothy’s loader reads them one at a time.
Python supports the streaming pattern naturally with file iteration: each line of pharmacy_stream.txt is one event. The code below counts how many refills there are per drug, reading one line at a time so memory stays small.
Read the code slowly. The four key lines are commented.
import json
count_by_drug = {} # an empty dictionary
with open('emerald_city_data/pharmacy_stream.txt') as f: # open the file
for line in f: # one event per line
event = json.loads(line) # parse JSON -> dict
drug = event['drug']
count_by_drug[drug] = count_by_drug.get(drug, 0) + 1 # increment counter
# Sort by count, descending.
for drug, n in sorted(count_by_drug.items(), key=lambda kv: -kv[1]):
print(f'{n:>3d} {drug}')
11 Levothyroxine 50mcg
10 Lisinopril 10mg
10 Atorvastatin 20mg
6 Metformin 500mg
3 Amoxicillin 500mg
Predict before you run. The file has 40 pharmacy events. The five drugs in the catalog are Lisinopril, Metformin, Atorvastatin, Levothyroxine, and Amoxicillin. Will every drug appear at least once? If the seed changed, could one of them appear zero times? Discuss.
Web scraping#
When data lives in a web page but no API exposes it, scraping is the last resort. BeautifulSoup is the standard Python library for parsing HTML — it turns the page into a tree you can search with CSS-like selectors.
A few ethics rules before you ever scrape a real site:
Check the site’s terms of service.
Respect
robots.txt.Identify your scraper with a
User-Agentheader.Throttle requests so you do not overwhelm the server.
For practice, we scrape a local page the generator wrote — no live traffic involved.
from bs4 import BeautifulSoup
# Load the local HTML file into BeautifulSoup.
with open('emerald_city_data/partner_directory.html') as f:
soup = BeautifulSoup(f, 'html.parser')
# Walk each row inside the table with id="partners".
for row in soup.select('table#partners tbody tr'):
cells_ = [c.get_text(strip=True) for c in row.find_all('td')]
print(' | '.join(cells_))
Tin Man Diagnostics | Lab | 555-0101
Scarecrow Records | Records | 555-0102
Lion Security | Compliance | 555-0103
Glinda Analytics | Analytics | 555-0104
Transforming the data with Pandas#
Once you have the data in memory, you usually need to clean it up before loading it into a database. The vocabulary you need is small — and you’ve already seen most of it in the survival kit:
df.rename(columns={...})— rename columns to match the target schema.df['col'] = df['col'].astype(...)— convert a column to the right type.pd.to_datetime(df['col'])— parse strings into proper dates.df.merge(other, on='key')— join two dataframes by a shared column.df.to_sql('table', engine, if_exists='replace')— write the dataframe to a database.
The Emerald City team needs to combine the Scarecrow’s patient list with the Tin Man’s labs so analysts can ask “which towns have the most high lab results?” That’s a JOIN in SQL and a .merge() in Pandas.
# Re-read both sources fresh (so this section runs independently).
patients_df = pd.read_csv('emerald_city_data/patients.csv')
labs_df = pd.read_json('emerald_city_data/labs.json')
# Parse the date columns so we can filter on them later.
patients_df['date_of_birth'] = pd.to_datetime(patients_df['date_of_birth'])
labs_df['collected_on'] = pd.to_datetime(labs_df['collected_on'])
# Merge into a wider analytical view.
# SQL equivalent: SELECT * FROM labs LEFT JOIN patients USING (patient_id);
combined = labs_df.merge(patients_df, on='patient_id', how='left')
combined.head()
| lab_id | patient_id | panel | collected_on | result_flag | ordering_provider | first_name | last_name | date_of_birth | town | primary_provider | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | LAB00001 | ECC0002 | Lipid Panel | 2026-03-21 | normal | Dr. Pumperdink | Boq | Gale | 1986-09-11 | Gillikin Country | Dr. Pumperdink |
| 1 | LAB00002 | ECC0002 | HbA1c | 2026-04-09 | high | Dr. Pumperdink | Boq | Gale | 1986-09-11 | Gillikin Country | Dr. Pumperdink |
| 2 | LAB00003 | ECC0003 | HbA1c | 2026-01-09 | normal | Dr. Gale | Fiyero | Munchkin | 1989-09-17 | Winkie Country | Dr. Gale |
| 3 | LAB00004 | ECC0004 | CMP | 2026-01-04 | normal | Dr. Emerald | Tip | Gale | 2020-12-05 | Gillikin Country | Dr. Emerald |
| 4 | LAB00005 | ECC0004 | BMP | 2026-04-27 | low | Dr. Emerald | Tip | Gale | 2020-12-05 | Gillikin Country | Dr. Emerald |
Read it as SQL. The merge above is a LEFT JOIN: every lab keeps its row, even if (somehow) the patient is missing. how='inner' would drop labs with no matching patient; how='right' would do the opposite. The default how='inner' is the same as a plain JOIN.
Loading into SQLite: ETL vs. ELT#
Two patterns describe how the cleaning work is split with the loading work. They are worth understanding as ideas before you see the code.
ETL (Extract, Transform, Load). Pull the data out of the source. Transform it in memory (Pandas, a script, a workflow tool). Load the cleaned result into the warehouse. The database only sees the polished version.
ELT (Extract, Load, Transform). Pull the data out and load it into the warehouse in its raw shape. Transform it later, in SQL, using views or staging tables. The database sees both the raw and the cleaned versions.
ETL is older. It made sense when warehouses were expensive and small. ELT is newer and dominant in cloud warehouses, where storage is cheap and SQL is fast.
The Emerald City project uses ETL for the structured patient/lab combination (the data is already small and already analyzed) and ELT for the streaming pharmacy data (the analytics team prefers to inspect raw events and shape the views themselves).
ETL example: clean in Pandas, then load#
We rename two columns, pick only the fields the analytics team wants, and write the result to a new SQLite database called emerald_city.db.
from sqlalchemy import create_engine, text
warehouse = create_engine('sqlite:///emerald_city.db')
# Transform in pandas...
clean = combined.rename(columns={
'first_name': 'patient_first',
'last_name': 'patient_last',
})
analytics = clean[['lab_id', 'patient_id', 'patient_first', 'patient_last',
'town', 'panel', 'result_flag', 'collected_on']]
# ...then load.
analytics.to_sql('patient_lab_results', warehouse,
if_exists='replace', index=False)
print(f'Loaded {len(analytics)} rows into patient_lab_results')
Loaded 43 rows into patient_lab_results
ELT example: load raw, then transform in SQL#
For the pharmacy stream, we load every raw event into a pharmacy_events_raw table, then build a SQL view that does the aggregation. The view is the “transform” step — it lives in the database, not in Python.
import json
# Load the raw stream into a list of dicts and dump it into the warehouse.
raw_pharmacy = []
with open('emerald_city_data/pharmacy_stream.txt') as f:
for line in f:
raw_pharmacy.append(json.loads(line))
pd.DataFrame(raw_pharmacy).to_sql(
'pharmacy_events_raw', warehouse, if_exists='replace', index=False
)
print(f'Loaded {len(raw_pharmacy)} raw pharmacy events')
Loaded 40 raw pharmacy events
# Create the view that transforms the raw rows. This is the "T" of ELT.
with warehouse.connect() as con:
con.execute(text('DROP VIEW IF EXISTS pharmacy_by_drug'))
con.execute(text('''
CREATE VIEW pharmacy_by_drug AS
SELECT drug, COUNT(*) AS refill_count, SUM(days_supply) AS total_days
FROM pharmacy_events_raw
GROUP BY drug
ORDER BY refill_count DESC
'''))
con.commit()
print('view pharmacy_by_drug created')
view pharmacy_by_drug created
Now switch the %%sql connection to the warehouse and query both tables — the ETL output and the ELT view — to see the result of all that work.
%sql sqlite:///emerald_city.db
%%sql
SELECT town, result_flag, COUNT(*) AS n
FROM patient_lab_results
GROUP BY town, result_flag
ORDER BY town, result_flag;
| town | result_flag | n |
|---|---|---|
| Emerald City | high | 2 |
| Emerald City | low | 3 |
| Emerald City | normal | 5 |
| Gillikin Country | high | 1 |
| Gillikin Country | low | 1 |
| Gillikin Country | normal | 2 |
| Munchkinland | low | 1 |
| Munchkinland | normal | 2 |
| Quadling Country | high | 1 |
| Quadling Country | low | 2 |
%%sql
SELECT * FROM pharmacy_by_drug;
| drug | refill_count | total_days |
|---|---|---|
| Levothyroxine 50mcg | 11 | 330 |
| Lisinopril 10mg | 10 | 300 |
| Atorvastatin 20mg | 10 | 300 |
| Metformin 500mg | 6 | 360 |
| Amoxicillin 500mg | 3 | 63 |
Discussion. For each of these scenarios, would you choose ETL or ELT? Why?
The Tin Man sends a 30-row CSV every morning with the day’s flagged labs.
The pharmacy partner streams 2 million events per day; analysts want to ask new questions of the raw data every week.
A vendor sends an XML feed once a month that must be validated against a strict schema before it touches the warehouse.
The compliance team wants to keep a perfect copy of every raw record from every source for audit.
Troubleshooting common data acquisition issues#
Every integration project hits the same handful of failures. Knowing them by sight cuts hours off the debugging.
Schema mismatch#
The most common failure: the source file changes its column order or renames a column, and the loader silently shifts every value into the wrong field. Patient phone numbers end up in the date-of-birth column.
# Loader assumes column order. The CSV gets re-ordered by mistake.
df = pd.read_csv('emerald_city_data/patients.csv', header=None,
names=['id', 'dob', 'first', 'last', 'town', 'provider']) # WRONG order
How to recognize it. Values look like other values shifted one column over: ages of “Munchkinland,” provider names that look like dates.
Fix. Never rely on column order. Always pass header=0 (the default) and validate that the headers in the file match what you expected:
df = pd.read_csv('emerald_city_data/patients.csv')
expected = {'patient_id', 'first_name', 'last_name', 'date_of_birth', 'town', 'primary_provider'}
missing = expected - set(df.columns)
assert not missing, f'Missing columns: {missing}'
Encoding#
A file written on a Windows machine in Western Europe arrives as Windows-1252. Python defaults to UTF-8 and rejects the bytes:
# Throws: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 137
df = pd.read_csv('partner_export.csv') # source written in cp1252
How to recognize it. A UnicodeDecodeError, often with a specific byte and position.
Fix. Tell Pandas the encoding:
df = pd.read_csv('partner_export.csv', encoding='cp1252')
If you do not know the encoding, the chardet library can guess. When you control the upstream system, ask them to write UTF-8 and never look back.
Permissions#
The loader cannot read the source file:
PermissionError: [Errno 13] Permission denied: '/mnt/scarecrow_share/patients.csv'
How to recognize it. PermissionError with a specific path.
Fix. Check four things, in order:
Does the file path exist? (
os.path.existswill tell you.)Does the current user have read access? (
ls -lon Linux; Properties → Security on Windows.)Is the share mounted with the right credentials?
Has someone rotated the service account password?
This is usually an ops problem, not a code problem. Loop in whoever owns the share before changing the code.
Driver and version issues#
Pandas refuses to write to SQLite:
ImportError: SQLAlchemy required for SQL support
or
DatabaseError: ... module 'sqlite3' has no attribute 'XYZ'
How to recognize it. Import errors or attribute errors that mention the driver itself.
Fix. Pin versions explicitly:
pip install --upgrade pandas sqlalchemy
Most “it works on my machine” failures come down to a driver version skew between development and production. Treat the driver as part of your application’s dependencies, not part of the environment.
Encryption keys in acquisition#
The Tin Man’s API responses are encrypted at rest. The decryption key is stored in a separate file on the loader’s machine. If the key file is missing, the loader fails:
FileNotFoundError: [Errno 2] No such file or directory: '/etc/ecc/tinman.key'
or worse:
ValueError: MAC check failed
The first message is a missing key file. The second is the wrong key file. Both are common when promoting code from dev to prod and forgetting to deploy the keys.
Fix.
Store keys in a secrets manager (cloud provider secret store, HashiCorp Vault), not on disk.
Document key rotation as a step in the deployment checklist.
Cross-reference Chapter 10’s discussion of BYOK and KYOK — those choices decide who can recover from this kind of failure.
Timeouts and corruption#
A long file transfer gets cut off. The loader thinks the file is complete but it ends mid-record:
json.decoder.JSONDecodeError: Expecting ',' delimiter: line 487 column 12 (char 14322)
How to recognize it. Parse errors that point to a position far into the file.
Fix.
Compare the file’s size to the upstream size, or use a checksum (
md5sum,sha256sum).If they do not match, retry the transfer.
If they do match, the upstream file itself may be corrupt — escalate.
A robust integration pipeline always validates the transfer before parsing. The cheapest validation is a checksum check.
Syntax and runtime errors#
Two kinds of errors show up most:
Syntax errors in SQL strings constructed with f-strings, especially when a value contains a quote. The fix is to use parameterized queries instead of string concatenation. Chapter 10’s SQL-injection discussion is the security version of the same lesson.
Runtime errors like
KeyErrorwhen a column the loader expects is missing, orValueErrorwhen a date cannot be parsed.
The best defense is a thin validation step between extract and transform: confirm the expected columns are there, sample-check a few values, and raise a clear error if something is off. A loud failure at validation is much cheaper than a silent failure that corrupts the warehouse.
Spot the bug#
Each of the snippets below has one of the failures above. Identify it before reading the answer.
A.
df = pd.read_csv('partner_export.csv')
# UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 42
B.
df = pd.read_csv('patients.csv', header=None,
names=['id', 'town', 'first', 'last', 'dob', 'provider'])
print(df['dob'].head()) # prints things like 'Munchkinland', 'Emerald City'
C.
with open('labs.json') as f:
data = json.load(f)
# JSONDecodeError: Expecting ',' delimiter: line 2031 column 5 (char 92481)
Answers: A is an encoding mismatch (the source isn’t UTF-8). B is a schema mismatch (the column order in names is wrong). C is a corrupted or truncated file.
Emerging tech and AI in data integration#
AI tooling has become a routine part of data integration work. It is not magic, but it is genuinely useful when applied carefully.
Where AI helps in this workflow#
The honest answer is “drafting.” AI coding assistants are good at:
Writing the first pass of a parser for an unfamiliar format.
Generating a regex that matches a description.
Drafting a schema migration from a description of the new fields.
Summarizing the API documentation for a system you have never used.
Dorothy uses an assistant on roughly half her work. The other half — deciding what to build, what to validate, and what to discard — is hers.
Prompt engineering and human-in-the-loop#
Prompt engineering is the skill of writing AI requests that produce useful results. A weak prompt: “write a parser for the Wizard’s XML.” A stronger prompt: “Here is a sample XML file with three records. Write a Python function parse_charts(path) that returns a list of dicts with keys patient_id, opened_on, department. Use xml.etree.ElementTree.”
Human-in-the-loop is the matching discipline on the human side. The AI drafts; the engineer reads, runs tests, and only then commits. Dorothy never trusts AI-generated SQL without running it on test data first.
Hallucinations and verification#
AI assistants confidently make things up. They will reference functions that do not exist, columns that were never in the schema, and behavior the underlying library does not actually have.
A realistic example: Dorothy asks the assistant for a query to summarize labs by patient. The assistant returns:
SELECT patient_id, lab_summary(panel, result_flag) AS summary
FROM labs
GROUP BY patient_id;
lab_summary is not a real function. The assistant invented it from the shape of the question. Without verification, the query would have hit the database and failed — or worse, succeeded against a stub function and returned nonsense.
The rule: nothing AI-generated goes into the pipeline until a human has run it on real data and validated the result.
Retrieval-augmented generation (RAG)#
Retrieval-augmented generation combines a search system with a language model. Instead of asking the model to recall facts from training, the system first searches a trusted corpus, then asks the model to answer using only the retrieved passages.
For Emerald City, RAG lets clinic staff ask questions like “what are the most common high-flag panels this month?” The system retrieves the relevant rows from the warehouse, passes them to the language model, and the model writes a plain-English answer grounded in the retrieved data.
RAG does not eliminate hallucinations, but it makes them much rarer — the model has the actual data in front of it.
ML libraries#
A few libraries come up so often they are worth recognizing by name:
pandas — the standard tool for cleaning and reshaping tabular data, used throughout this chapter.
scikit-learn — classical machine learning models: regression, classification, clustering. The most common library for “predict X from these columns.”
PyTorch and TensorFlow — deep-learning frameworks for neural networks. Used heavily in image, audio, and language work.
Pandas is data prep; the others are model training. The two halves run on the same data but solve different problems.
Virtual data warehouses#
A virtual data warehouse presents many underlying sources as a single queryable warehouse — but it does not actually copy the data. Queries are translated and pushed down to the underlying systems.
Compare with what Dorothy built: a materialized warehouse where data is physically copied into SQLite. Virtual warehouses are great when the source systems are already fast and copying the data is expensive or non-compliant; materialized warehouses are simpler and faster for analytics.
Most large organizations end up with both — some data physically consolidated, some left in place behind a virtual layer.
Robotic process automation (RPA)#
RPA automates the routine click-and-type work that runs around a database — pulling a daily report out of a vendor portal, renaming files, emailing dashboards. It is less glamorous than ML but often more useful.
Emerald City uses an RPA bot to grab a partner’s nightly CSV from a portal that has no API, drop the file onto the shared NFS mount, and trigger the loader. The bot has saved Dorothy hours every week and replaced a fragile manual process.
Lab activity: extend the pipeline#
Now it is your turn. Add a pharmacy CSV to the Emerald City pipeline.
Write a short Python snippet (you can copy from the generator above) that produces
emerald_city_data/pharmacy_partner.csvwith columnspatient_id,drug,days_supply,filled_on.Read your CSV with
pd.read_csv.Validate the columns. Raise a clear error if any expected column is missing — use the schema-mismatch pattern from the troubleshooting section.
Convert
filled_onto a real date withpd.to_datetime.Merge it with
patients_dfonpatient_id.Load the merged result into a new SQLite table called
pharmacy_partner_loadedinemerald_city.db.Using
%%sql, write a query against the warehouse that returns the number of refills per town.
Bonus. Deliberately corrupt your CSV (delete a column header) and verify that your validation step catches it before the bad data reaches the warehouse.
Stretch. Decide whether your pipeline above is ETL or ELT. Could it be the other one? What would change?
Chapter Summary#
Use this list as a self-check. After working through the chapter, you should be able to do each of the following:
You can read short Pandas snippets and translate them into SQL (and vice versa).
You can pull data from CSV, JSON, XML, REST and OData APIs, ODBC/JDBC drivers, network protocols, streaming sources, and scraped HTML.
You can use Pandas to read, clean, merge, and load data into SQLite — and use the
%%sqlmagic to query it back.You can pick between ETL and ELT for a given scenario.
You can diagnose and recover from schema mismatch, encoding, permissions, driver/version, encryption key, timeout/corruption, and syntax/runtime failures.
You can describe how AI tools help with integration work and what their limits are.
Practice with the Loop of the Recursive Dragon#
Sharpen what you just learned with a chapter-matched review set in the Loop of the Recursive Dragon — an adaptive review game with multiple question types and RPG-style mechanics, built for this book.
Glossary#
Use this list as a quick review sheet for the chapter.
Python and Pandas vocabulary#
DataFrame — Pandas’s in-memory table; conceptually the same as a database table, with named columns and rows.
pd.read_csv/pd.read_json/pd.read_xml— Functions that load a file directly into a DataFrame.df.head()— Shows the first few rows; the Pandas equivalent ofLIMIT 5.df.merge(other, on='k')— Joins two DataFrames on a shared key; the Pandas equivalent ofJOIN ... USING (k).df.groupby('g').size()— Groups rows and counts them; equivalent toGROUP BY g, COUNT(*).df.to_sql/pd.read_sql— Writes a DataFrame to a database table and reads a query back into a DataFrame.%%sqlmagic — A Jupyter feature (fromipython-sql) that lets you write a full SQL cell against a connected database, with no Python wrapper.
Acquisition formats and protocols#
ETL (Extract, Transform, Load) — A data-integration pattern in which data is cleaned in memory before it is loaded into the warehouse.
ELT (Extract, Load, Transform) — A data-integration pattern in which data is loaded into the warehouse in its raw shape and transformed later with SQL.
CSV — A plain-text tabular format with one row per line and fields separated by commas.
XML — A self-describing markup format that uses tags to represent structured data.
REST — An architectural style for web APIs that exposes resources at URLs over HTTP.
OData — A convention layered on REST that adds standard query syntax such as
$filter,$select, and$top.SOAP — An older XML-based web-service protocol that wraps every message in an
Envelopeand aBody.RPC (Remote Procedure Call) — A pattern in which a client invokes a procedure on a remote system as if it were local.
ODBC — A driver standard that lets applications connect to many database systems through one API.
JDBC — The same idea as ODBC, but for Java applications.
FTP / SFTP — A protocol for transferring files between systems; SFTP adds encryption.
NFS — A network file system common in Unix environments.
CIFS / SMB — A network file system common in Windows environments.
SSH — A protocol for secure remote shell access and file copying.
Streaming source — A data source that delivers records continuously, one at a time.
Non-streaming source — A data source that delivers a fixed, finite result, such as a file.
Web scraping — Extracting data from a web page by parsing its HTML.
Troubleshooting#
Schema mismatch — A failure in which the source file’s columns no longer match what the loader expects.
Character encoding — The mapping between bytes and characters; a mismatch causes decoding errors.
Driver version mismatch — A failure caused by incompatible versions of a database driver between systems.
Emerging tech and AI#
Prompt engineering — The skill of phrasing AI requests so they produce useful results.
Retrieval-augmented generation (RAG) — A pattern that combines a search step with a language model so answers are grounded in a trusted corpus.
Hallucination — An AI response that confidently states something untrue, such as a non-existent function or column.
Human-in-the-loop — A workflow in which a human reviews and approves AI-generated work before it takes effect.
Virtual data warehouse — A warehouse that exposes many underlying sources as one queryable system without physically copying their data.
Robotic process automation (RPA) — Software that automates routine click-and-type work around existing systems.
Data poisoning — Cross-reference to Chapter 10: an attack that corrupts training or pipeline data to skew downstream behavior.
Answers to the survival-kit check#
patients_df[patients_df['town'] == 'Emerald City']['first_name']LIMIT 5(with the default of 5 rows; passnto change it:df.head(10)).A
LEFT JOIN. Every row of the left table is kept; matches from the right table are filled in where they exist.When the work is just SQL — querying an existing database, building a view, exploring a table. Reach for Pandas when you need to combine SQL with Python logic (parsing files, calling APIs, branching on values).