Open In Colab

Chapter 4: Advanced SQL & Business Continuity (Mario Brothers Plumbing)#

Database and SQL Through Pop Culture | Brendan Shea, PhD#

This chapter works through advanced retrieval against the Mario Brothers Plumbing database: GROUP BY and HAVING, subqueries, query performance in Big-O terms, and JSON inside SQL. A closing case study at the Toad Medical Center connects these techniques to business continuity and disaster recovery planning.

Learning Outcomes#

By the end of this chapter, you will be able to:

  1. Interpret complex database schemas featuring decimal fields, JSON columns, and subtype relationships.

  2. Summarize and filter grouped datasets using GROUP BY, HAVING, and SQL aggregate functions.

  3. Formulate subqueries within SELECT, WHERE, and HAVING clauses to solve multi-step data retrieval problems.

  4. Optimize query outputs using formatting functions and conditional CASE expressions for business reporting.

  5. Query semi-structured JSON data natively within a relational database.

  6. Analyze query efficiency in terms of Big O growth rates as data scales.

  7. Define recovery objectives (RPO/RTO) and explain the systems used to test and achieve them.

  8. Design fault-tolerant architectures that eliminate single points of failure (SPOFs) via replication and geographic redundancy.

  9. Differentiate backup mechanisms—including full, incremental, and differential backups, snapshots, and restore points.

  10. Analyze system log data and reports using advanced SQL queries to monitor backup and recovery integrity.

Keywords: SQL, GROUP BY, HAVING, subqueries, CASE, JSON, DECIMAL, Big O notation, business continuity, disaster recovery, backups

Brendan’s Lecture#

from IPython.display import YouTubeVideo
YouTubeVideo('6VrkMdTZVf8', width=800, height=500)

Mario Brothers Plumbing: Database Schema#

In this lesson, we’ll be working with a database that represents the plumbing business of the famous “Mario Brothers”. Let’s get started.

First, we’ll load the database and display the basic schema.

%%sql
SELECT * FROM sqlite_master WHERE type='table';
Running query in 'sqlite:///mario_bros_plumbing.db'
type name tbl_name rootpage sql
0 table Customers Customers 2 CREATE TABLE Customers (\n  customer_id INTEGE...
1 table Employees Employees 3 CREATE TABLE Employees (\n  employee_id INTEGE...
2 table ServiceTypes ServiceTypes 4 CREATE TABLE ServiceTypes (\n  service_type_id...
3 table Services Services 5 CREATE TABLE Services (\n  service_id INTEGER ...
4 table Orders Orders 6 CREATE TABLE Orders (\n  -- Keeps track of a c...
5 table Order_Items Order_Items 7 CREATE TABLE Order_Items (\n  -- Keeps track o...

Database Overview#

The “Mario Brothers Plumbing” database consists of six interconnected tables designed to manage a plumbing business:

  1. Customers: Stores customer information, including a JSON field for address.

  2. Employees: Stores employee information, including job title and hire date.

  3. ServiceTypes: Stores service type information, including name and description.

  4. Services: Stores service information, including name, description, and price (DECIMAL).

  5. Orders: Stores order information, including customer, employee, date, and total amount (DECIMAL).

  6. Order_Items: Stores order item information, including order, service, and quantity.

Data Types: JSON and DECIMAL#

Two notable data types used in this database are JSON and DECIMAL.

JSON (JavaScript Object Notation) is a lightweight data interchange format that allows for flexible and structured data representation. It can store complex data types like objects and arrays. In this database, JSON is used to store customer addresses, as it provides a convenient way to store and retrieve structured address data without the need for separate address-related tables. Later in this chapter, we’ll see how to use SQLite to query this data.

DECIMAL is a data type used to store precise numeric values, with a specified precision and scale. It is suitable for storing monetary values, such as prices and total amounts, where exactness is crucial. In this database, DECIMAL(10,2) is used, allowing for prices and total amounts up to 99,999,999.99.

Relationships#

The tables in this database are related through one-to-many relationships, established using foreign key constraints:

  • Customers and Orders: A customer can have multiple orders, but an order belongs to only one customer. This is a one-to-many relationship, with the customer_id foreign key in the Orders table referencing the customer_id primary key in the Customers table.

  • Employees and Orders: An employee can handle multiple orders, but an order is handled by only one employee. This is a one-to-many relationship, with the employee_id foreign key in the Orders table referencing the employee_id primary key in the Employees table.

  • Orders and Order_Items: An order can have multiple order items, but an order item belongs to only one order. This is a one-to-many relationship, with the order_id foreign key in the Order_Items table referencing the order_id primary key in the Orders table.

  • Services and Order_Items: A service can be included in multiple order items, but an order item includes only one service. This is a one-to-many relationship, with the service_id foreign key in the Order_Items table referencing the service_id primary key in the Services table.

Sub-type Relationship#

In this database, Services are a sub-type of ServiceTypes. This means that each service belongs to a specific service type, and the service type provides a way to categorize and group related services.

The sub-type relationship is encoded in the relational database using a one-to-many relationship between the ServiceTypes and Services tables. The service_type_id foreign key in the Services table references the service_type_id primary key in the ServiceTypes table. This relationship ensures that each service is associated with a valid service type and allows for efficient querying and data integrity maintenance.

By using a sub-type relationship, the database can store common attributes of service types in the ServiceTypes table, while specific details of individual services are stored in the Services table. This design promotes data normalization, reduces data redundancy, and allows for easier management and extension of the service catalog.

AN ERD for Mario Brothers Plumbing#

Now, let’s take a look a the entity-relationship diagram for this database.

places

handles

contains

included in

categorizes

CUSTOMERS

integer

customer_id

PK

varchar

first_name

varchar

last_name

json

address

varchar

phone_number

ORDERS

integer

order_id

PK

integer

customer_id

FK

integer

employee_id

FK

date

order_date

decimal

total_amount

EMPLOYEES

integer

employee_id

PK

varchar

first_name

varchar

last_name

varchar

job_title

date

hire_date

ORDER_ITEMS

integer

order_item_id

PK

integer

order_id

FK

integer

service_id

FK

integer

quantity

SERVICES

integer

service_id

PK

integer

service_type_id

FK

varchar

service_name

varchar

description

decimal

price

SERVICETYPES

integer

service_type_id

PK

varchar

service_type_name

varchar

description

If you look closely, here’s what this ERD shows us:

  1. Each customer can place multiple orders, but each order is associated with only one customer. This is shown by the one-to-many relationship between CUSTOMERS and ORDERS. The diagram indicates that a customer record contains basic information like name, address, and phone number, while an order includes details such as the order date and total amount.

  2. Employees are responsible for handling orders. The diagram shows a one-to-many relationship between EMPLOYEES and ORDERS, meaning one employee can handle multiple orders, but each order is handled by only one employee. Employee information includes their name, job title, and hire date.

  3. An order consists of multiple order items, represented by the one-to-many relationship between ORDERS and ORDER_ITEMS. Each order item is linked to a specific service and includes the quantity of that service ordered. This structure allows for flexibility in creating orders with various services.

  4. The diagram illustrates a two-level subtype-supertype hierarchy for services. SERVICETYPES categorize the SERVICES offered. Each service belongs to one service type, but a service type can have multiple services. This allows for grouping similar services (e.g., “exterior wash” or “interior cleaning”) while maintaining detailed information about each specific service, including its name, description, and price.

  5. The use of primary keys (PK) and foreign keys (FK) in the diagram ensures data integrity and establishes clear relationships between entities. For example, the order_id in ORDER_ITEMS is a foreign key referencing the primary key in ORDERS, ensuring that every order item is associated with a valid order. Similarly, the service_id in SERVICES is linked to SERVICE_TYPES, maintaining the hierarchical relationship between service types and individual services.

A Quick Look at the Data#

Now, let’s take a quick look at the data in each table.

%%sql
SELECT * FROM employees LIMIT 5;
Running query in 'sqlite:///mario_bros_plumbing.db'
employee_id first_name last_name job_title hire_date
0 1 Super Mario Master Plumber 2000-09-13
1 2 Super Luigi Journeyman Plumber 2003-02-20
2 3 Princess Peach Project Manager 2005-06-10
3 4 Cat Peach Apprentice Plumber 2014-11-05
4 5 Tanuki Mario Plumbing Technician 2011-04-28
%%sql
SELECT * FROM customers LIMIT 5;
Running query in 'sqlite:///mario_bros_plumbing.db'
customer_id first_name last_name address phone_number
0 1 Peach Toadstool {"street": "Mushroom Castle", "city": "Toad To... (555) 123-4567
1 2 Yoshi Dino {"street": "24 Egg Island", "city": "Dinosaur ... (555) 987-6543
2 3 Daisy Sarasa {"street": "10 Sarasaland Way", "city": "Chai ... (555) 456-7890
3 4 Toadette Toadstool {"street": "15 Mushroom St", "city": "Toad Tow... (555) 789-0123
4 5 Bowser Koopa {"street": "1 Bowser Castle", "city": "Dark La... (555) 654-3210
%%sql
SELECT * FROM serviceTypes LIMIT 5;
Running query in 'sqlite:///mario_bros_plumbing.db'
service_type_id service_type_name description
0 1 Repair Services related to fixing and repairing plumb...
1 2 Installation Services related to installing new plumbing fi...
2 3 Inspection Services related to inspecting and assessing p...
%%sql
SELECT * FROM services LIMIT 5;
Running query in 'sqlite:///mario_bros_plumbing.db'
service_id service_type_id service_name description price
0 1 1 Pipe Repair Fix leaky or broken pipes 50
1 2 1 Drain Cleaning Clear clogged drains and pipes 75
2 3 2 Toilet Installation Install a new toilet 150
3 4 2 Sink Replacement Replace an old or damaged sink 200
4 5 1 Water Heater Repair Fix issues with water heaters 120
%%sql
SELECT * FROM orders LIMIT 5;
Running query in 'sqlite:///mario_bros_plumbing.db'
order_id customer_id employee_id order_date total_amount
0 1 3 1 2001-10-14 925
1 2 2 1 2010-05-29 825
2 3 1 3 2008-03-14 1025
3 4 6 3 2008-04-16 1140
4 5 2 1 2016-03-03 750
%%sql
SELECT * FROM order_items LIMIT 5;
Running query in 'sqlite:///mario_bros_plumbing.db'
order_item_id order_id service_id quantity
0 1 1 1 1
1 2 1 4 2
2 3 1 1 3
3 4 1 2 3
4 5 1 1 2

Using GROUP BY in SQL#

The GROUP BY clause in SQL is used to group rows in a result set based on one or more columns. It is often used in combination with aggregate functions like COUNT(), SUM(), AVG(), MIN(), and MAX() to perform calculations on grouped data.

The basic syntax of GROUP BY is as follows:

SELECT
 column1,
 column2,
 ...,
 aggregate_function(column) -- Ex: SUM(), COUNT(), AVG()
FROM
 table_name
GROUP BY column1, column2, ...;

When using GROUP BY, the SELECT statement should only include columns that are either listed in the GROUP BY clause or used with an aggregate function. The GROUP BY clause comes after the FROM and WHERE clauses but before the ORDER BY clause.

Let’s explore some examples using the “Mario Brothers Plumbing” database to understand how GROUP BY can be used in practice.

Example: Counting Orders per Customer (with Table Aliases)#

Suppose we want to count the number of orders placed by each customer. We can use GROUP BY with the COUNT() aggregate function to achieve this.

%%sql
--Number of orders by each customer
SELECT
  c.customer_id AS "customer_id",
  c.first_name,
  c.last_name,
  COUNT(o.order_id) AS order_count
FROM
  Customers c -- We use a Table alias "c" for "Customers"
  -- Table alias "o" for orders
  JOIN Orders o ON c.customer_id = o.customer_id
-- We group by all columns in the select clause, but NOT the COUNT
GROUP BY c.customer_id, c.first_name, c.last_name
ORDER BY order_count DESC
LIMIT 10;
Running query in 'sqlite:///mario_bros_plumbing.db'
customer_id first_name last_name order_count
0 3 Daisy Sarasa 67
1 1 Peach Toadstool 63
2 2 Yoshi Dino 57
3 6 Wario Wario 32
4 4 Toadette Toadstool 27
5 5 Bowser Koopa 27
6 29 Birdo Egg 4
7 14 Zelda Hyrule 3
8 17 Pikachu Pokemon 3
9 23 Toad Toadstool 3

A few things happen in this query:

  1. First, table aliases are used to give a table, or a subquery in a FROM clause, a temporary name. They can make queries easier to write and to read by shortening the names of tables. In this query, ‘c’ is an alias for the ‘Customers’ table, and ‘o’ is an alias for the ‘Orders’ table. This allows us to refer to these tables using these shorter names throughout the query.

  2. The GROUP BY statement in SQL is used to group rows that have the same values in specified columns. In this case, we’re grouping by ‘c.customer_id’, which means that the result set will have one row for each unique customer_id in the ‘Customers’ table.

  3. The query is also using a aggregate function, COUNT(), to count the number of orders for each customer. The COUNT() function returns the number of rows that matches a specified criteria. In this case, it’s counting the number of ‘o.order_id’ for each group of ‘c.customer_id’.

So, the overall result of this query will be a list of customers (with their customer_id, first_name, and last_name), along with the number of orders that each customer has made.

Example: Calculating Total Order Sales per Employee (with PRINTF)#

Let’s say we want to calculate the total amount of sales handled by each employee. We can use GROUP BY with the SUM() aggregate function.

%%sql
SELECT
  e.employee_id,
  e.first_name,
  e.last_name,
  -- We can use printf to format the way currency appears
  PRINTF("$%.2f", SUM(o.total_amount)) AS total_sales_usd
FROM
  Employees e
  JOIN Orders o ON e.employee_id = o.employee_id
-- One row per employee id
GROUP BY e.employee_id;
Running query in 'sqlite:///mario_bros_plumbing.db'
employee_id first_name last_name total_sales_usd
0 1 Super Mario $68485.00
1 2 Super Luigi $38815.00
2 3 Princess Peach $33755.00
3 4 Cat Peach $11635.00
4 5 Tanuki Mario $25855.00
5 6 Fire Luigi $17430.00
6 7 Toad Toadstool $7185.00

Here, the GROUP BY clause is used to group the result set by one or more columns. In this query, we group the rows by employee_id. This means that the query will produce one row per unique employee_id value. When GROUP BY is used, any column in the SELECT list that is not an aggregate function (like SUM()) must be included in the GROUP BY clause. In this case, employee_id, first_name, and last_name are not aggregate functions, so they must (technically) be listed in the GROUP BY clause. However, since employee_id uniquely identifies each employee, we only need to include employee_id in the GROUP BY clause.

The PRINTF() function is used to format the total sales amount as a currency string. It takes two arguments: a format string and a value.

  • The format string "$%.2f" specifies that the output should start with a dollar sign ($), followed by the value with two decimal places (%.2f).

  • The value passed to PRINTF() is the result of SUM(o.total_amount), which calculates the sum of total_amount for each employee. Since we are grouping by employee_id, the SUM() function will calculate the total sales for each employee.

Using HAVING in SQL#

The HAVING clause in SQL is used to filter the results of an aggregate function based on a specified condition. It is similar to the WHERE clause, but while WHERE filters individual rows before grouping, HAVING filters grouped rows after the GROUP BY clause has been applied.

The basic syntax of HAVING is as follows:

SELECT
  column1,
  column2,
  ...,
  aggregate_function(column)
FROM table_name
(WHERE condition) -- A WHERE comes before
GROUP BY column1, column2, ...
-- HAVING comes after a group by
HAVING condition;

The HAVING clause comes after the GROUP BY clause and before the ORDER BY clause. The condition in the HAVING clause typically involves an aggregate function and can use comparison operators like =, >, <, >=, <=, and <>.

Example: Filtering Employees by Total Sales Amount#

Suppose we want to find employees who have achieved total sales greater than $20,000. We can use HAVING with the SUM() aggregate function to filter the grouped results.

%%sql
SELECT
  e.employee_id,
  e.first_name,
  e.last_name,
  PRINTF("$%.2f", SUM(o.total_amount)) AS total_sales_usd
FROM
  Employees e
  JOIN Orders o ON e.employee_id = o.employee_id
GROUP BY e.employee_id
HAVING SUM(o.total_amount) > 20000;
Running query in 'sqlite:///mario_bros_plumbing.db'
employee_id first_name last_name total_sales_usd
0 1 Super Mario $68485.00
1 2 Super Luigi $38815.00
2 3 Princess Peach $33755.00
3 5 Tanuki Mario $25855.00

In this example, we join the Employees and Orders tables, group the results by employee_id, and calculate the total sales for each employee using SUM(). The HAVING clause then filters the grouped results to include only employees whose total sales exceed $1000.

Example: Filtering Service Types by Average Price#

Let’s say we want to find service types whose average price is greater than $100. We can use HAVING with the AVG() aggregate function to filter the grouped results.

%%sql
SELECT
  st.service_type_name,
  PRINTF("$%.2f", AVG(s.price)) AS average_price
FROM
  ServiceTypes st
  JOIN Services s ON st.service_type_id = s.service_type_id
GROUP BY st.service_type_name
HAVING AVG(s.price) > 100;
Running query in 'sqlite:///mario_bros_plumbing.db'
service_type_name average_price
0 Installation $175.00
1 Repair $123.75

Here, we join the ServiceTypes and Services tables, group the results by service_type_name, and calculate the average price for each service type using AVG(). The HAVING clause then filters the grouped results to include only service types whose average price is greater than $100.

Subqueries in SQL#

A subquery, also known as a nested query or inner query, is a query within another query. It allows you to use the results of one query as input for another query. Subqueries can be used in various parts of an SQL statement, such as SELECT, FROM, WHERE, and HAVING clauses.

The basic syntax of a subquery is as follows:

SELECT ... -- Start of "outer query"
FROM ...
WHERE column_name operator (
  -- Start of subquery ("inner query")
    SELECT ...
    FROM ...
    WHERE ...
);

The subquery is enclosed in parentheses and placed within the outer query. The outer query uses the results of the subquery to perform further operations or filtering.

Let’s take a look at a few examples.

Subquery in the SELECT Clause#

Suppose we want to retrieve the customer details along with the difference between their total order amount and the average order amount of all customers. We can use a subquery in the SELECT clause to calculate the average order amount.

%%sql
SELECT
  c.customer_id,
  c.first_name,
  c.last_name,
  SUM(o.total_amount) AS total_order_amount,
  (SUM(o.total_amount) -
    -- We subtract the results of this subquery
    (SELECT AVG(total_amount) FROM Orders)
  ) AS difference_from_average
FROM
  Customers c
  JOIN  Orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.first_name, c.last_name
LIMIT 10;
Running query in 'sqlite:///mario_bros_plumbing.db'
customer_id first_name last_name total_order_amount difference_from_average
0 1 Peach Toadstool 37110 36450.38961
1 2 Yoshi Dino 35635 34975.38961
2 3 Daisy Sarasa 46210 45550.38961
3 4 Toadette Toadstool 20450 19790.38961
4 5 Bowser Koopa 18275 17615.38961
5 6 Wario Wario 21465 20805.38961
6 7 Waluigi Wario 2295 1635.38961
7 8 Donkey Kong 75 -584.61039
8 9 Diddy Kong 1065 405.38961
9 11 Cappy Bonneter 570 -89.61039

In this example, the subquery (SELECT AVG(total_amount) FROM Orders) calculates the average order amount of all customers. The result of the subquery is then used in the outer query to calculate the difference between each customer’s total order amount and the average order amount. This calculation would be difficult to achieve without a subquery.

We can use PRINTF to clean up the presentation of this data as follows:

%%sql
SELECT
  c.customer_id,
  c.first_name,
  c.last_name,
  PRINTF("$%.2f", SUM(o.total_amount)) AS total_order_amount,
  PRINTF("$%.2f",
    (
    SUM(o.total_amount) -
    -- We subtract the results of this subquery
    (SELECT AVG(total_amount) FROM Orders)
    )
  )AS difference_from_average
FROM
  Customers c
  JOIN Orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.first_name, c.last_name
LIMIT 10;
Running query in 'sqlite:///mario_bros_plumbing.db'
customer_id first_name last_name total_order_amount difference_from_average
0 1 Peach Toadstool $37110.00 $36450.39
1 2 Yoshi Dino $35635.00 $34975.39
2 3 Daisy Sarasa $46210.00 $45550.39
3 4 Toadette Toadstool $20450.00 $19790.39
4 5 Bowser Koopa $18275.00 $17615.39
5 6 Wario Wario $21465.00 $20805.39
6 7 Waluigi Wario $2295.00 $1635.39
7 8 Donkey Kong $75.00 $-584.61
8 9 Diddy Kong $1065.00 $405.39
9 11 Cappy Bonneter $570.00 $-89.61

Subquery in the WHERE or HAVING Clauses#

Let’s say we want to find the employees who have greater than average sales per order.

%%sql
SELECT
  e.employee_id,
  e.first_name,
  e.last_name,
  PRINTF("$%.2f", SUM(o.total_amount) / COUNT(o.total_amount)) AS avg_sales_order
FROM
  Employees e
  JOIN Orders o ON e.employee_id = o.employee_id
GROUP BY e.employee_id
HAVING (SUM(o.total_amount) / COUNT(o.total_amount)) >
  (SELECT AVG(total_amount) FROM Orders);
Running query in 'sqlite:///mario_bros_plumbing.db'
employee_id first_name last_name avg_sales_order
0 1 Super Mario $678.00
1 3 Princess Peach $675.00
2 5 Tanuki Mario $698.00
3 7 Toad Toadstool $718.00

The query retrieves employee details and their average sales per order, but only for employees whose average sales per order is greater than the overall average order amount. It joins the Employees and Orders tables, groups the results by employee_id, calculates the average sales per order using SUM and COUNT, and filters the results using a HAVING clause that compares each employee’s average sales with the overall average calculated by a subquery.

The CASE statement in SQL is a powerful tool for introducing conditional logic into your queries. It allows you to evaluate conditions and return specific values depending on whether those conditions are met. This is particularly useful when you need to categorize data, perform calculations based on certain thresholds, or otherwise transform your dataset without altering the underlying table structure.

General Syntax of CASE#

The basic syntax for a CASE statement looks like this:

SELECT
    CASE
        WHEN condition1 THEN result1
        WHEN condition2 THEN result2
        ...
        ELSE default_result
    END AS new_column_name
FROM table_name;

Components:#

  • WHEN condition THEN result: Each WHEN clause is a condition, and THEN specifies the value that will be returned if the condition is true.

  • ELSE default_result: (Optional) If none of the WHEN conditions are met, the ELSE clause specifies the default value to return.

  • END: This keyword closes the CASE statement.

  • AS new_column_name: This labels the resulting value as a new column in the output.

Motivation for Using CASE#

The primary motivation for using a CASE statement is that it allows for flexible data manipulation directly in SQL queries. Rather than having to pre-process data in another language or tool, you can transform your data in real time as it’s queried. This is especially helpful when categorizing data or creating dynamic outputs without altering the actual table contents.

Example: Using CASE in the Mario Brothers Plumbing Database#

To illustrate how CASE works, let’s consider an example from the Mario Brothers plumbing database. Suppose we have a table called Services, which lists various plumbing services along with their prices. We want to categorize each service as either “Expensive” or “Affordable” based on its price, where any service costing more than 100 coins is labeled as “Expensive” and anything else as “Affordable.”

Here’s the query that accomplishes this:

%%sql
SELECT service_name,
       CASE
           WHEN price > 100 THEN 'Expensive'
           ELSE 'Affordable'
       END AS price_category
FROM Services;
Running query in 'sqlite:///mario_bros_plumbing.db'
service_name price_category
0 Pipe Repair Affordable
1 Drain Cleaning Affordable
2 Toilet Installation Expensive
3 Sink Replacement Expensive
4 Water Heater Repair Expensive
5 Plumbing Inspection Affordable
6 Emergency Service Expensive

JSON and SQL#

In chapter 1, we discussed different logical “data models”, or ways of organizing information. The two most prominent data models in use today are:

  1. The relational model that organizes data into related “tables”. SQL is designed for databases organized on this model. SQLite follows this model, as do most other leading databases (MySQL, Oracle, SQL Server, Postgres).

  2. Databases in key-value structures, such as JSON. These are sometimes called document databases. MongoDB is the most widely used type of this database.

As it turns out, most modern relational databases have added the capacity to deal natively with JSON data. (And so, they are technically “hybrid” databases). We’ll briefly take a look at how this works, using the Address column of the Customer table in our Mario Bros plumbing database.

The JSON (JavaScript Object Notation) format is a lightweight, text-based way to represent structured data. In the following example, the entire snippet is an object—an unordered collection of key-value pairs. A key-value pair consists of a key—always a string—and a value, which may be a string, number, array, another object, boolean, or null. The simplicity and clarity of this structure make JSON ideal for data interchange between systems.

Consider a simple example:

// A simple JSON object representing an address
{
  "street": "Mushroom Castle",
  "city": "Toad Town",
  "zip_code": "12345",
  "country": "Mushroom Kingdom"
}

Each key (such as "street") is associated with a value ("Mushroom Castle"), where both are represented as strings. The following table details each element of this JSON object:

Key

Value

Explanation

street

“Mushroom Castle”

A string representing the street name or location of the building.

city

“Toad Town”

A string denoting the name of the city.

zip_code

“12345”

A string (or sometimes a number) used as the postal code.

country

“Mushroom Kingdom”

A string that identifies the country or region.

To illustrate JSON’s versatility, consider an expanded example that includes nested objects and arrays. This is typical in applications where data has a hierarchical structure:

{
  "person": {
    "name": "Mario",
    "contact": {
      "email": "mario@example.com",
      "phone": "555-0199"
    },
    "addresses": [
      {
        "type": "home",
        "street": "Mushroom Castle",
        "city": "Toad Town",
        "zip_code": "12345",
        "country": "Mushroom Kingdom"
      },
      {
        "type": "office",
        "street": "Peach Blvd",
        "city": "Royal City",
        "zip_code": "67890",
        "country": "Mushroom Kingdom"
      }
    ]
  }
}

In this extended example, the object assigned to "person" encapsulates several key-value pairs. The "contact" key holds another object for email and phone, and the "addresses" key is associated with an array of objects. An array is an ordered list of values, and here each element of the array is a separate address record. This demonstrates how JSON can represent complex, nested data structures in a clear and organized manner.

The use of JSON spans from simple configurations to transmitting detailed data between servers and clients, making it an essential tool in modern data exchange and application development.

Storing JSON in SQLite#

SQLite provides support for storing JSON data directly in database columns. You can store JSON objects as text in a column of type TEXT. In our database, we have a table called Customers with a column named address that stores JSON data representing customer addresses. Let’s see what this looks like:

%%sql
SELECT address FROM Customers LIMIT 15;
Running query in 'sqlite:///mario_bros_plumbing.db'
address
0 {"street": "Mushroom Castle", "city": "Toad To...
1 {"street": "24 Egg Island", "city": "Dinosaur ...
2 {"street": "10 Sarasaland Way", "city": "Chai ...
3 {"street": "15 Mushroom St", "city": "Toad Tow...
4 {"street": "1 Bowser Castle", "city": "Dark La...
5 {"street": "100 Gold Coin Blvd", "city": "Diam...
6 {"street": "101 Silver Coin Ave", "city": "Dia...
7 {"street": "50 Banana Jungle", "city": "DK Isl...
8 {"street": "51 Banana Jungle", "city": "DK Isl...
9 {"street": "Comet Observatory", "city": "Space"}
10 {"street": "1 Hat Kingdom Way", "city": "Cap K...
11 {"street": "1 New Donk City Plaza", "city": "M...
12 {"street": "1 Hyrule Castle", "city": "Hyrule ...
13 {"street": "2 Hyrule Castle", "city": "Hyrule ...
14 {"street": "10 Bounty Hunter Ave", "city": "Sp...

Querying JSON Data in SQLite#

To extract a specific value from a JSON column, you can use the json_extract() function in the SELECT statement. The json_extract() function takes the JSON column and a path expression as arguments. For example, to retrieve the street address for each customer:

%%sql
SELECT
  customer_id,
  -- address is the (JSON) column, and street is a "key"
  json_extract(address, '$.street') AS street
FROM Customers
LIMIT 10;
Running query in 'sqlite:///mario_bros_plumbing.db'
customer_id street
0 1 Mushroom Castle
1 2 24 Egg Island
2 3 10 Sarasaland Way
3 4 15 Mushroom St
4 5 1 Bowser Castle
5 6 100 Gold Coin Blvd
6 7 101 Silver Coin Ave
7 8 50 Banana Jungle
8 9 51 Banana Jungle
9 10 Comet Observatory

This query will return the customer_id and the value of the “street” key from the JSON object stored in the address column. The $ symbol represents the root of the JSON object, and .street specifies the path to the “street” key.

You can extract multiple values from a JSON column by specifying multiple path expressions in the json_extract() function. For example, to retrieve the street and city for each customer:

%%sql
SELECT
  customer_id,
  last_name,
  json_extract(address, '$.street') AS street,
  json_extract(address, '$.city') AS city
FROM Customers
LIMIT 10;
Running query in 'sqlite:///mario_bros_plumbing.db'
customer_id last_name street city
0 1 Toadstool Mushroom Castle Toad Town
1 2 Dino 24 Egg Island Dinosaur Land
2 3 Sarasa 10 Sarasaland Way Chai Kingdom
3 4 Toadstool 15 Mushroom St Toad Town
4 5 Koopa 1 Bowser Castle Dark Land
5 6 Wario 100 Gold Coin Blvd Diamond City
6 7 Wario 101 Silver Coin Ave Diamond City
7 8 Kong 50 Banana Jungle DK Island
8 9 Kong 51 Banana Jungle DK Island
9 10 Cosmic Comet Observatory Space

You can use JSON values in the WHERE clause to filter rows based on specific conditions. The json_extract() function can be used to extract the desired JSON value for comparison. For example, to find customers who live in the city “Toad Town”:

%%sql
SELECT
  customer_id,
  first_name,
  last_name,
  json_extract(address, '$.city') AS city
FROM Customers
WHERE json_extract(address, '$.city') = 'Toad Town'
LIMIT 10;
Running query in 'sqlite:///mario_bros_plumbing.db'
customer_id first_name last_name city
0 1 Peach Toadstool Toad Town
1 4 Toadette Toadstool Toad Town
2 23 Toad Toadstool Toad Town
3 28 Toadsworth Toadstool Toad Town

To check if a specific key exists in a JSON column, you can use the json_type() function in the WHERE clause. The json_type() function returns the data type of the value at the specified path. If the key exists, it will return the data type; otherwise, it will return NULL. For example, to find customers who have a “apartment” key in their address:

%%sql
SELECT
  customer_id,
  first_name,
  last_name,
  json_extract(address, '$.apartment') AS apartment
FROM Customers
WHERE json_type(address, '$.apartment') IS NOT NULL
LIMIT 10;
Running query in 'sqlite:///mario_bros_plumbing.db'
customer_id first_name last_name apartment
0 2 Yoshi Dino A
1 4 Toadette Toadstool 2B
2 27 Lakitu Cloud Skyview

We can also combine JSON with what we’ve been learning about GROUP BY in this chapter. For example, here is a query that gives a count of how many residents live in each city.

%%sql
-- Count of residents in each city
SELECT
  json_extract(address, '$.city') AS city,
  COUNT(*) AS resident_count
FROM Customers
GROUP BY city
LIMIT 10;
Running query in 'sqlite:///mario_bros_plumbing.db'
city resident_count
0 Acorn Plains 1
1 Cap Kingdom 1
2 Chai Kingdom 1
3 DK Island 4
4 Dark Land 4
5 Diamond City 2
6 Dinosaur Land 1
7 Eagleland 1
8 Hyrule Kingdom 2
9 Kanto Region 1

Or, a bit more ambitiously, we could find out some summary statistics about the orders in each city:

%%sql
SELECT
  json_extract(c.address, '$.city') AS city,
  COUNT(*) AS order_count,
  MAX(total_amount) AS max_order_amount,
  MIN(total_amount) AS min_order_amount,
  AVG(total_amount) AS avg_order_amount
FROM
  Customers c
  JOIN Orders o ON c.customer_id = o.customer_id
GROUP BY city
Running query in 'sqlite:///mario_bros_plumbing.db'
city order_count max_order_amount min_order_amount avg_order_amount
0 Acorn Plains 3 1575 225 800.000000
1 Cap Kingdom 1 570 570 570.000000
2 Chai Kingdom 67 1485 75 689.701493
3 DK Island 4 1065 75 553.750000
4 Dark Land 31 1700 75 677.258065
5 Diamond City 34 1620 50 698.823529
6 Dinosaur Land 57 2210 50 625.175439
7 Hyrule Kingdom 4 1025 150 640.000000
8 Kanto Region 3 850 300 483.333333
9 Lycia 1 775 775 775.000000
10 Lylat System 1 300 300 300.000000
11 Metro Kingdom 1 600 600 600.000000
12 Skyworld 3 760 300 603.333333
13 Subcon 4 1375 450 983.750000
14 Toad Town 94 1600 75 637.712766

Putting it Altogether#

Finally, let’s put everything (well, not everything, but a bunch of thigs) we’ve learned into a sigle query.

%%sql
SELECT
  json_extract(c.address, '$.city') AS city,
  COUNT(*) AS order_count,
  PRINTF('$%.2f', MAX(total_amount)) AS max_order_amount,
  PRINTF('$%.2f', MIN(total_amount)) AS min_order_amount,
  PRINTF('$%.2f', AVG(total_amount)) AS avg_order_amount
FROM
  Customers c
  JOIN Orders o ON c.customer_id = o.customer_id
WHERE c.first_name NOT IN ("Wario", "Bowser")
GROUP BY city
HAVING order_count > 10
ORDER BY order_count DESC;
Running query in 'sqlite:///mario_bros_plumbing.db'
city order_count max_order_amount min_order_amount avg_order_amount
0 Toad Town 94 $1600.00 $75.00 $637.71
1 Chai Kingdom 67 $1485.00 $75.00 $689.70
2 Dinosaur Land 57 $2210.00 $50.00 $625.18

This query demonstrates various SQL concepts and techniques that we have learned in the course so far. Let’s go through each part:

  1. SELECT clause:

    • json_extract(address, '$.city') AS city: Extracts the value of the “city” key from the JSON object stored in the address column of the Customers table and aliases it as city.

    • COUNT(*) AS order_count: Counts the number of rows in each group and aliases the result as order_count.

    • PRINTF('$%.2f', MAX(total_amount)) AS max_order_amount: Calculates the maximum value of total_amount for each group, formats it as a currency string using PRINTF, and aliases the result as max_order_amount.

    • PRINTF('$%.2f', MIN(total_amount)) AS min_order_amount: Calculates the minimum value of total_amount for each group, formats it as a currency string using PRINTF, and aliases the result as min_order_amount.

    • PRINTF('$%.2f', AVG(total_amount)) AS avg_order_amount: Calculates the average value of total_amount for each group, formats it as a currency string using PRINTF, and aliases the result as avg_order_amount.

  2. FROM clause:

    • Customers c: Specifies the Customers table as the main table for the query and assigns it the alias c.

  3. JOIN clause:

    • JOIN Orders o ON c.customer_id = o.customer_id: Performs an inner join between the Customers table (aliased as c) and the Orders table (aliased as o) based on the matching customer_id column. This join retrieves rows from both tables where the customer_id values match.

  4. WHERE clause:

    • Customers.first_name NOT IN ("Wario", "Bowser"): Filters the rows from the Customers table where the first_name is not “Wario” or “Bowser”. This condition excludes customers with those specific first names from the result set.

  5. GROUP BY clause:

    • GROUP BY city: Groups the rows based on the city column extracted from the JSON address object. This clause ensures that the aggregate functions (COUNT, MAX, MIN, AVG) are applied to each group of rows with the same city value.

  6. HAVING clause:

    • HAVING order_count > 10: Filters the grouped rows based on the condition that the order_count (calculated by COUNT(*)) is greater than 10. This clause removes cities with 10 or fewer orders from the result set.

  7. ORDER BY clause:

    • ORDER BY order_count DESC: Sorts the result set in descending order based on the order_count column. Cities with the highest number of orders will appear first in the result set.

From Advanced Queries to Business Continuity#

Now that you can write complex subqueries, aggregate data with GROUP BY and HAVING, and extract values from nested JSON documents, you possess the power to extract critical insights from databases. But what happens if the database itself goes offline? What if a server fails, a data center is flooded, or a bad schema update corrupts customer records?

As a database professional, writing high-performance queries is only half of the job. The other half is ensuring that the data is always available and resilient to disasters.

In the following case study, we will visit the Toad Medical Center to see how database administrators apply business continuity planning, design fault-tolerant systems, and manage backups. As you read, think about how the advanced query techniques we just covered—such as extracting JSON logs or analyzing system metrics—help administrators monitor these backup and recovery systems.

Case Study: Business Continuity at the Toad Medical Center#

Just across town from Mario Bros Plumbing, the Toad Town Medical Center stands as the primary healthcare facility serving the Mushroom Kingdom’s diverse population. In a medical facility, database uptime is literally a matter of life and death. If clinicians cannot access electronic health records (EHRs) during a Bowser invasion, a sudden plumbing overflow, or a catastrophic system failure, patient safety is immediately compromised.

Recognizing the critical importance of operational resilience, the Toad Town Medical Center has developed a comprehensive business continuity and disaster recovery (DR) strategy. The goal is simple: ensure that critical patient systems and data remain secure, available, and fully recoverable under any circumstances.

Disaster Recovery Metrics: Defining RPO and RTO#

Disaster recovery (DR) planning is the process of preparing for, responding to, and recovering from major disruptions. When designing a recovery strategy, administrators do not guess; they rely on two critical, time-based metrics to define their success:

  • Recovery Point Objective (RPO) defines the maximum acceptable amount of data loss, measured in time. If the Toad Medical Center has an RPO of one hour for its EHR system, the backup systems must save data at least every hour so that no more than 60 minutes of patient records are ever lost in a crash.

  • Recovery Time Objective (RTO) defines the maximum acceptable downtime before a system must be fully restored. If the pharmacy inventory system has an RTO of two hours, administrators must have the tools and standby systems ready to bring it back online within 120 minutes of a failure.

Regular DR testing via tabletop exercises and full-scale simulations is the only way to verify that a team can meet these metrics. If last night’s simulated failover took three hours when the target RTO was two, the recovery procedures or infrastructure must be optimized. Below is the continuous lifecycle of disaster recovery planning:

Informs

Guides

Prepares for

Identifies issues

Initiates next cycle

Risk Assessment
*Identify potential risks
*Assess vulnerabilities
*Prioritize critical assets

Plan Development
*Create detailed procedures
*Define roles and responsibilities
*Document communication protocols

Implementation
*Install redundant systems
*Set up backup power and data storage
*Establish failover connections

Testing & Training
*Conduct tabletop exercises
*Perform full-scale drills
Train staff on procedures

Evaluation & Update
*Analyze test results
Identify weaknesses
*Revise and improve plan

Architecting Resilience: Fault Tolerance & High Availability#

To avoid disasters altogether, database systems are built for high availability—the ability to remain operational despite individual hardware or network failures. This starts with identifying and eliminating single points of failure (SPOFs), which are any single components whose failure takes down the entire system.

Administrators eliminate SPOFs by building redundant pathways and standby systems:

  • Database Redundancy: Achieved using replication (copying data to another system), mirroring (maintaining an identical real-time copy), or log shipping (periodically feeding database transaction logs to a standby engine). If the primary database fails, a hot standby immediately takes over.

  • Network & Power Redundancy: Uses multiple independent power grids, uninterruptible power supplies (UPS), and load-balanced network interfaces. If a cord or power grid fails, the system switches feeds without dropping a single user.

  • Geographic Redundancy: Moves resilience beyond a single location. A multizone region deployment places replicas in separate data centers within the same cloud region to survive localized power outages. For catastrophic events (like a major volcanic eruption in the Mushroom Kingdom), geo-replication copies data to separate regions thousands of miles away, enabling a complete regional failover.

Data Protection: Backups, Snapshots, and Restore Points#

If systems go down or files are corrupted, the last line of defense is data recovery from a backup. Database administrators must choose the right backup strategy to balance storage space, backup speed, and recovery time:

  • Full Backups create a complete copy of the entire database. They are highly secure and easy to restore from, but they are slow and consume massive storage. These are typically scheduled weekly.

  • Incremental Backups capture only the data that changed since the last backup of any type. They are extremely fast and small, but restoring requires applying the last full backup followed by every subsequent incremental backup in sequence.

  • Differential Backups capture all data changed since the last full backup. They take more storage than incremental backups but are faster to restore, requiring only the full backup and the latest differential file.

  • Snapshots & Restore Points are specialized, point-in-time states. A snapshot is a rapid, storage-level copy of a database volume that tracks block-level changes. A restore point is a named logical marker set before making a risky change (like a database schema migration) so a DBA can instantly roll back if things go wrong.

Backups are useless if they are corrupted. Administrators use a cryptographic hash function—which generates a unique signature for a dataset—to compare the backup file’s signature with the original database. If the signatures match, the backup’s integrity is validated. Finally, a clear retention policy defines how long backups are saved (such as keeping daily incremental files for 30 days before purging them, and archiving monthly snapshots for seven years) to optimize storage costs and meet compliance guidelines.

Preparation & Planning

Backup and Verification

Failure Simulation & Recovery Execution

Validation & Testing

Review & Continuous Improvement

Define RPO & RTO

RPO: Max Data Loss

RTO: Max Downtime

Compliance & Operations: Documentation and Reporting#

A resilient database is not just built; it must be documented and continuously monitored to prove its health to auditors and leadership. Administrators use several standard documents to organize their operations:

  • System Security Plan (SSP) is the central blueprint documenting what sensitive data the database handles, who has access, and what security controls are in place.

  • Continuity of Operations Plan (COOP) describes how the hospital keeps treating patients and running essential services if the database is temporarily offline.

  • System Design Document (SDD) outlines the technical architecture, detailing server locations, replication paths, and software versions needed to rebuild the system from scratch.

  • Failback Plan provides a step-by-step checklist to safely shift operational traffic back to the primary database server after it has recovered, ensuring zero data is lost in transition.

Finally, every backup job must produce a backup report detailing the start time, completion time, file size, and success status. These reports feed into automated dashboards. At Toad Medical, a failed backup triggers an immediate page to the on-call engineer.

Remember the advanced queries we covered earlier in this chapter—such as aggregate functions with GROUP BY and parsing nested JSON logs? Database administrators write those exact queries against system logs to generate these reports and monitor backup success across hundreds of database instances.

Interactive activity: Open this standalone page to use the activity in the Jupyter Book or from Colab.

Open the business continuity quiz

Lab: Practice Your SQL#

Chapter Summary#

Use this checklist as a self-check. After working through this chapter, you should be able to:

  • Group and filter aggregated datasets using the GROUP BY and HAVING clauses, applying functions like COUNT(), AVG(), and PRINTF().

  • Formulate subqueries to serve as inline data sources, filter criteria, or dynamic values within larger SELECT statements.

  • Natively query JSON data stored inside relational columns using functions like json_extract().

  • Analyze query complexity using Big O notation to reason about how execution time scales with input size.

  • Establish recovery goals by defining appropriate Recovery Point Objectives (RPO) and Recovery Time Objectives (RTO) for critical systems.

  • Design fault-tolerant architectures that eliminate single points of failure (SPOFs) using replication, high availability, and geographic redundancy.

  • Develop a comprehensive backup strategy that balances full, incremental, and differential backups with snapshots and restore points.

  • Validate data integrity using cryptographic hash functions to confirm backup safety.

  • Maintain operational compliance through System Security Plans (SSP), Continuity of Operations Plans (COOP), and detailed backup reports.

Practice with the Loop of the Recursive Dragon#

Sharpen what you just learned with a chapter-matched review set in the Loop of the Recursive Dragon — an adaptive review game with multiple question types and RPG-style mechanics, built for this book.

Launch the Chapter 4 review set →

Glossary#

Use this reference sheet to quickly review key terms from this chapter.

Advanced SQL & Query Performance#

  • GROUP BY is a SQL clause used to group rows that have the same values in specified columns, allowing aggregate functions to be performed on each group.

  • HAVING is a SQL clause used to filter groups returned by a GROUP BY clause based on aggregate conditions, acting like a WHERE clause for groups.

  • Aggregate Function is a SQL function (such as COUNT, SUM, AVG, MIN, MAX) that combines values from multiple rows to return a single summary value.

  • Subquery is a nested query written inside another SQL statement (such as within a SELECT, WHERE, or HAVING clause) to perform multi-step calculations.

  • Table Alias is a temporary, shortened name assigned to a table within a query to make the SQL code cleaner and easier to write.

  • JSON (JavaScript Object Notation) is a lightweight, human-readable text format used to store and exchange semi-structured data.

  • json_extract() is a built-in SQLite function used to parse a JSON string and retrieve a specific value or object.

  • JSON Path Expression is a string syntax (like $.city or $.address.zip) passed to JSON functions to navigate to a specific key or property within a JSON object.

  • PRINTF() is a formatting function in SQLite used to format text and numeric outputs, such as converting a decimal number to a currency string.

  • Hybrid Database is a database management system that supports multiple data models, such as querying both traditional relational tables and semi-structured JSON documents in the same engine.

  • Big O Notation is a mathematical notation used to describe how the execution time or space requirements of an algorithm grow relative to the size of the input data.

  • O(1) (Constant Complexity) represents a growth rate where the execution time remains exactly the same, regardless of how large the dataset grows.

  • O(n) (Linear Complexity) represents a growth rate where the execution time grows in direct proportion to the size of the dataset.

  • O(n log n) (Linearithmic Complexity) represents a growth rate typical of highly efficient sorting algorithms, where execution time grows slightly faster than linear but far slower than quadratic.

  • O(n^2) (Quadratic Complexity) represents a growth rate where execution time grows proportionally to the square of the dataset size, often caused by nested loops over the data.

Business Continuity & Disaster Recovery#

  • Business Continuity is the strategic planning and operational discipline that ensures critical business operations can continue during and after a major disruption.

  • Disaster Recovery (DR) Planning is the specific process of preparing for, responding to, and recovering from catastrophic events that threaten an organization’s IT infrastructure and data.

  • Recovery Point Objective (RPO) is the maximum acceptable amount of data loss measured in time, dictating how frequently backups must be performed.

  • Recovery Time Objective (RTO) is the maximum acceptable amount of time that a system can remain offline after a disaster before being fully restored.

  • Single Point of Failure (SPOF) is any individual component in an architecture whose failure takes the entire system or service offline.

  • High Availability is a system design quality that ensures a high level of operational uptime through redundancy of hardware, power, network, and database layers.

  • Replication is the process of copying data from a primary database to one or more secondary database systems to ensure availability and load distribution.

  • Mirroring is a high-availability technique that maintains a real-time, identical copy of a database on a secondary server, reflecting changes instantly.

  • Log Shipping is a backup technique that automatically ships transaction logs from a primary database to a secondary standby database, which applies them in sequence.

  • Multizone Region is a cloud deployment architecture that places database replicas across separate physical data centers (zones) within the same geographic region to protect against localized outages.

  • Geo-replication is a redundancy pattern that copies database data to separate data centers located in different regions (often hundreds of miles apart) to survive catastrophic regional disasters.

  • Full Backup is a complete, point-in-time copy of the entire database, providing a full baseline for restoration but consuming the most storage and time.

  • Incremental Backup is a fast, compact backup that copies only the data changed since the last backup of any type (full or incremental).

  • Differential Backup is a backup that copies all data changed since the last full backup, balancing storage efficiency with faster restoration times.

  • Database Dump is a text or binary file containing SQL statements (or data blocks) representing the structure and contents of a database, used for backups and migrations.

  • Snapshot is a rapid, storage-level, point-in-time copy of a storage volume that tracks only changed data blocks rather than copy the entire volume.

  • Restore Point is a named logical marker set manually by an administrator before performing risky operations (like a schema migration) to allow an instant rollback if needed.

  • Hash Function is a mathematical algorithm that generates a unique, fixed-size signature for a file or dataset, used to validate backup integrity and ensure no corruption has occurred.

  • Retention Policy is a set of rules defining how long backups must be kept before being permanently deleted (purged) or moved to long-term storage (archived).

  • Backup Report is a diagnostic summary generated after a backup job finishes, documenting the start time, completion time, size, status, and any errors or warnings.

  • System Security Plan (SSP) is a comprehensive document describing a system’s security architecture, user permissions, and compliance controls.

  • Continuity of Operations Plan (COOP) is a non-technical operational plan detailing how a business or organization keeps essential functions running when key IT systems are offline.

  • System Design Document (SDD) is a technical architecture blueprint outlining the physical and logical layout of a database system.

  • Failback is the formal, structured process of returning operational database traffic to the primary server once it has recovered from a failover.