Chapter 3: Joins and Set Operations (The IMDb Cinema Database)#
Brendan Shea, PhD#
In this chapter, we dive into the powerful world of SQL joins and set operations. We’ll explore how these tools allow us to combine data from multiple tables in a database, enabling us to extract meaningful insights and answer complex questions. Using a database based on data from the Internet Movie Database (IMDB), we’ll learn how to connect related pieces of information, such as movies, actors, directors, and awards. By mastering joins and set operations, you’ll be able to navigate and manipulate relational databases with ease.
Learning Outcomes#
By the end of this chapter, you will be able to:
Interpret a database schema and crow’s foot ERD, including entities, keys, relationships, and join tables.
Write INNER JOIN queries to combine data from related tables.
Join three or more tables to answer questions about movies, people, roles, and awards.
Apply join conditions and filters to limit results correctly.
Use LEFT JOIN to preserve unmatched rows and explain when CROSS JOIN is appropriate or dangerous.
Combine result sets with UNION, INTERSECT, and EXCEPT.
Use joins and set operations to solve real-world reporting and analysis problems.
Keywords: SQL, INNER JOIN, LEFT JOIN, CROSS JOIN, UNION, INTERSECT, EXCEPT, schema, crow’s foot ERD, join tables
Brendan’s Lecture#
Run the following cell to see my lecture.
##Click here to launch my lecture
from IPython.display import YouTubeVideo
YouTubeVideo('jCOYyfoc7ZY', width=800, height=500)
Introduction to the Movie Database#
For this chapter, we’ll be using a database based on data from the Internet Movie Database (IMDB). The full dataset is available here: https://developer.imdb.com/non-commercial-datasets/. We’ll just be using a small part of this data, based on movies that have (at one point or another) been in the “top 100” in terms of box office returns.
To start off with, we need to download a copy of the database and connect to it:
Database Schema: Movies#
Let’s begin by exploring the database schem for movies.db. This will give us:
A list of all the tables in the datase.
The columns (attributes) of each table.
The primary and foreign keys that link the tables together.
We can do this by typing the following command:
%%sql
SELECT *
FROM sqlite_schema
WHERE type = 'table';-- Show the database schema
| type | name | tbl_name | rootpage | sql | |
|---|---|---|---|---|---|
| 0 | table | Movie | Movie | 2 | CREATE TABLE Movie (id CHAR(7) PRIMARY KEY, na... |
| 1 | table | Person | Person | 4 | CREATE TABLE Person (id CHAR(7) PRIMARY KEY, n... |
| 2 | table | Actor | Actor | 11 | CREATE TABLE Actor (actor_id CHAR(7), movie_id... |
| 3 | table | Director | Director | 15 | CREATE TABLE Director(director_id CHAR(7), mov... |
| 4 | table | Oscar | Oscar | 18 | CREATE TABLE Oscar(movie_id CHAR(7), person_id... |
Our IMDb movie database consists of five tables: Movie, Person, Actor, Director, and Oscar. These tables store data related to movies, directors, actors, and Academy Awards. Let us explore each of these tables to understand their attributes, data types, and primary keys.
Movie#
The Movie table contains information about individual movies.
The id column stores a unique integer value for each movie, which acts as the table’s primary key.
The name column stores the text title of the movie.
The year column records the integer year the movie was released.
The rating column stores a text value representing the movie’s content rating, such as “PG” or “R”.
The runtime column records the length of the movie in minutes as an integer.
The genre column stores the primary category of the movie as text.
The earnings_rank column records the movie’s box office ranking as an integer.
Person#
The Person table stores personal profiles of individuals involved in the film industry.
The id column is a unique integer serving as the primary key.
The name column records the full name of the person as text.
The dob column stores the person’s date of birth.
The pob column records their place of birth as a text string.
Actor#
The Actor table is a junction table that establishes the many-to-many relationship between movies and actors.
The actor_id column is a foreign key pointing to the person table, which forms a composite primary key alongside the movie ID.
The movie_id column is a foreign key pointing to the movie table.
Director#
The Director table is a junction table that maps the many-to-many relationship between movies and their directors.
The director_id column is a foreign key pointing to the person table, forming a composite primary key with the movie ID.
The movie_id column is a foreign key pointing to the movie table.
Oscar#
The Oscar table records Academy Award wins.
The movie_id column is a foreign key pointing to the movie table.
The person_id column is a foreign key pointing to the person table who won the award.
The type column records the category of the award as text.
The year column stores the integer year the award was won, which forms a composite primary key with the movie ID, person ID, and award type.
A “Crow’s Foot” Entity-Relationship Diagram of the Movie Data#
To help us better understand the stucture of the movie data, let’s create an entity-relationship diagram. Here, I am using Crow’s Foot style (more on this later). I’m using the Mermaid language to create the diagram.
Interpreting a Crow’s Foot ERD#
Crow’s foot diagrams, also known as Entity-Relationship Diagrams (ERDs), are visual layouts used to represent the logical structure and connections of a database. Let us break down how to read the visual notations in the diagram:
Entities#
We represent individual tables as entities using colored rectangles, such as showing MOVIE, PERSON, ACTOR, DIRECTOR, and OSCAR.
The upper section of the rectangle lists the entity name, while the columns below list the table’s attributes.
We mark the table’s unique identifiers as primary keys (PK) and columns that link to other tables as foreign keys (FK).
Relationships#
We draw lines between the rectangles to represent relationships showing how tables connect to one another. The shapes at the ends of these lines indicate the cardinality of the connection, telling us the minimum and maximum number of matching rows allowed:
The symbol of two parallel vertical lines represents a one-and-only-one relationship, indicating that a row must match exactly one row in the connected table.
The symbol of a circle next to a vertical line represents a zero-or-one relationship, meaning a row can connect to either one row or none at all.
The symbol of a vertical line next to a three-pronged “crow’s foot” represents a one-to-many relationship, indicating that a row connects to at least one or many rows in the other table.
The symbol of a circle next to a “crow’s foot” represents a zero-to-many relationship, meaning a row can connect to any number of rows or none at all.
Reading relationships in our database#
Using these rules, we can read the connections in our movie database:
The connection between
PersonandActorshows that a person can act in zero-to-many movies, but each actor record must connect to one-and-only-one person.The connection between
MovieandActorshows that a movie can feature one-to-many actors, but each actor row links to one-and-only-one movie.The connection between
PersonandDirectorshows that a person can direct zero-to-many movies, while each director row maps to one-and-only-one person.The connection between
MovieandOscarshows that a movie can win zero-to-many Oscars, but each Oscar record links to one-and-only-one movie.
Previewing the Data#
Finally, let’s use some simple SQL SELECT statements to take a look at the data.
The basic SELECT * statement retrieves all columns from a table. The * is a wildcard that means “all columns.” The LIMIT clause restricts the number of rows returned - in this case, we’ll see only the first 10 rows of the Movie table.
%%sql
SELECT *
FROM Movie
LIMIT 10;
| id | name | year | rating | runtime | genre | earnings_rank | |
|---|---|---|---|---|---|---|---|
| 0 | 2488496 | Star Wars: The Force Awakens | 2015 | PG-13 | 138 | A | 1 |
| 1 | 4154796 | Avengers: Endgame | 2019 | PG-13 | 181 | AVS | 2 |
| 2 | 1087260 | Spider-Man: No Way Home | 2021 | PG-13 | 148 | AVFS | 3 |
| 3 | 0499549 | Avatar | 2009 | PG-13 | 162 | AVYS | 4 |
| 4 | 1745960 | Top Gun: Maverick | 2022 | PG-13 | 130 | AD | 5 |
| 5 | 1825683 | Black Panther | 2018 | PG-13 | 134 | AVS | 6 |
| 6 | 4154756 | Avengers: Infinity War | 2018 | PG-13 | 149 | AVYS | 7 |
| 7 | 0120338 | Titanic | 1997 | PG-13 | 194 | DR | 8 |
| 8 | 0369610 | Jurassic World | 2015 | PG-13 | 124 | A | 9 |
| 9 | 0848228 | The Avengers | 2012 | PG-13 | 143 | A | 10 |
This query introduces several aggregate functions:
COUNT(DISTINCT id)counts unique movie IDs, eliminating duplicatesMIN()andMAX()find the smallest and largest values in a columnAVG()calculates the mean valueROUND(x, n)rounds number x to n decimal places Each column is given an alias using theASkeyword to make the output more readable.
%%sql
-- Statics about Movies
SELECT
COUNT(DISTINCT id) AS "num_movies",
MIN(year) AS "earliest_year",
MAX(year) AS "latest_year",
MIN(runtime) AS "shortest_movie",
MAX(runtime) AS "longest_movie",
ROUND(AVG(runtime),2) AS "avg_movie_runtime"
FROM Movie;
| num_movies | earliest_year | latest_year | shortest_movie | longest_movie | avg_movie_runtime | |
|---|---|---|---|---|---|---|
| 0 | 745 | 1927 | 2022 | 66 | 242 | 121.88 |
Another SELECT * query, showing us all columns from the Person table. Like before, LIMIT 10 ensures we only see the first 10 rows, making the output manageable.
%%sql
SELECT *
FROM Person
LIMIT 10;
| id | name | dob | pob | |
|---|---|---|---|---|
| 0 | 0000002 | Lauren Bacall | 1924-09-16 | New York, New York, USA |
| 1 | 0000004 | John Belushi | 1949-01-24 | Chicago, Illinois, USA |
| 2 | 0000006 | Ingrid Bergman | 1915-08-29 | Stockholm, Sweden |
| 3 | 0000007 | Humphrey Bogart | 1899-12-25 | New York, New York, USA |
| 4 | 0000008 | Marlon Brando | 1924-04-03 | Omaha, Nebraska, USA |
| 5 | 0000009 | Richard Burton | 1925-11-10 | Pontrhydyfen, Wales, UK |
| 6 | 0000010 | James Cagney | 1899-07-17 | Yonkers, New York, USA |
| 7 | 0000011 | Gary Cooper | 1901-05-07 | Helena, Montana, USA |
| 8 | 0000012 | Bette Davis | 1908-04-05 | Lowell, Massachusetts, USA |
| 9 | 0000014 | Olivia de Havilland | 1916-07-01 | Tokyo, Japan |
Here we use COUNT(DISTINCT id) to count unique person IDs. The DISTINCT keyword ensures we don’t count any person more than once, even if they appear multiple times in the table.
%%sql
-- Number of disticnt people
SELECT
COUNT(DISTINCT id) AS "num_people"
FROM Person
| num_people | |
|---|---|
| 0 | 2672 |
Looking at the Actor junction table with SELECT *. This shows us how actors are connected to movies through their IDs. The LIMIT 5 clause shows just the first 5 rows.
%%sql
SELECT *
FROM Actor
LIMIT 5;
| actor_id | movie_id | |
|---|---|---|
| 0 | 0000138 | 0120338 |
| 1 | 0000701 | 0120338 |
| 2 | 0000708 | 0120338 |
| 3 | 0000870 | 0120338 |
| 4 | 0000200 | 0120338 |
Using COUNT(DISTINCT actor_id) to count unique actors. Remember that DISTINCT eliminates duplicates, so actors who appear in multiple movies are only counted once.
%%sql
--Number of actors
SELECT COUNT(DISTINCT actor_id) AS "num_actors"
FROM Actor
| num_actors | |
|---|---|
| 0 | 2239 |
Another look at a junction table, this time for directors. The SELECT * shows all columns, while LIMIT 5 keeps the output brief.
%%sql
SELECT *
FROM Director
LIMIT 5;
| director_id | movie_id | |
|---|---|---|
| 0 | 0000116 | 0120338 |
| 1 | 0000184 | 0076759 |
| 2 | 0011470 | 0298148 |
| 3 | 0000229 | 0083866 |
| 4 | 0000184 | 0120915 |
Similar to our actor count, COUNT(DISTINCT director_id) gives us the number of unique directors in our database.
%%sql
--Number of directors
SELECT COUNT(DISTINCT director_id) AS "num_directors"
FROM Director
| num_directors | |
|---|---|
| 0 | 433 |
Examining the Oscar table with SELECT *. The LIMIT 5 clause shows us a sample of the award records.
%%sql
SELECT *
FROM Oscar
LIMIT 5;
| movie_id | person_id | type | year | |
|---|---|---|---|---|
| 0 | 1036646 | NaN | BEST-PICTURE | 2022 |
| 1 | 9620288 | 0000226 | BEST-ACTOR | 2022 |
| 2 | 9115530 | 1567113 | BEST-ACTRESS | 2022 |
| 3 | 3581652 | 3663196 | BEST-SUPPORTING-ACTRESS | 2022 |
| 4 | 1036646 | 1319274 | BEST-SUPPORTING-ACTOR | 2022 |
The SELECT DISTINCT type query shows all unique values in the type column, with no duplicates. This tells us all the different categories of Oscars in our database.
%%sql
--Types of Oscars
SELECT DISTINCT type
FROM Oscar
| type | |
|---|---|
| 0 | BEST-PICTURE |
| 1 | BEST-ACTRESS |
| 2 | BEST-SUPPORTING-ACTRESS |
| 3 | BEST-ACTOR |
| 4 | BEST-SUPPORTING-ACTOR |
| 5 | BEST-DIRECTOR |
A Gentle Introduction to JOINs#
Our Movie table contains details about individual films, such as their titles, ratings, and earnings. Our Person table stores profiles of actors and directors, including their names and places of birth.
However, if you want to find which actors starred in a specific movie, this combined information is not stored in either table alone. To answer these cross-table questions, SQL uses joins to combine rows from multiple tables based on a related column.
In our database, the Actor and Director tables serve as junction tables (or join tables) that establish many-to-many relationships:
The
Actortable links a person ID to a movie ID, indicating who acted in which film.The
Directortable links a person ID to a movie ID, indicating who directed which film.
To find which actors starred in the movie Inception, SQL must link all three tables:
The query starts at the
Movietable to locate the row for Inception.The database joins the
Movietable to theActortable by matching the unique movie IDs.The database joins the
Actortable to thePersontable by matching the person IDs, retrieving the actors’ names.
By writing join queries, you can easily traverse these database links to combine records and answer complex questions.
Example: JOINing Movie and Actor#
In the Movie table, the primary key is id. Both the Person and the Actor table have an attribute called movie_id that is a foreign that links to the is. So, we could write joins like this:
%%sql
SELECT
-- We can select columns from multiple tables
Movie.name,
Actor.actor_id
FROM
-- Here we join the Movie and Actor tables
Movie
JOIN Actor ON Movie.id = Actor.movie_id
LIMIT 5;
| name | actor_id | |
|---|---|---|
| 0 | Titanic | 0000138 |
| 1 | Titanic | 0000701 |
| 2 | Titanic | 0000708 |
| 3 | Titanic | 0000870 |
| 4 | Titanic | 0000200 |
You’ll notice this isn’t very informative (yet), because the actor table actually doesn’t contain the actor’s name. Instead, it just has their actor_id, which is a reference to the Person table. In just a bit, we’ll see how we can fix this (by joining three tables).
Example: JOINing Person and Actor#
We can do the same thing to join the Person (with primary key id) and Actor tables (with foreign key actor_id). A
%%sql
SELECT
-- We can select columns from multiple tables
Person.name,
Actor.actor_id
FROM
-- Here we join the Person and Actor tables
Person
JOIN Actor ON Person.id = Actor.actor_id
LIMIT 5;
| name | actor_id | |
|---|---|---|
| 0 | Leonardo DiCaprio | 0000138 |
| 1 | Kate Winslet | 0000701 |
| 2 | Billy Zane | 0000708 |
| 3 | Kathy Bates | 0000870 |
| 4 | Bill Paxton | 0000200 |
Example: JOINing Movie, Actor, and Person#
In order to get a more useful result, let’s join THREE tables–the Person, Movie and Actor table. Here, the Actor table serves as junction table between Person and Movie
%%sql
SELECT
-- We can select columns from multiple tables
-- We use AS to give them sensible names
Person.name AS "actor_name",
Person.dob AS "person_dob",
Person.pob AS "person_pob",
Movie.name AS "movie_name",
Movie.year AS "movie_year"
FROM
-- Here we join the Person, Movie and Actor tables
Person
JOIN Actor ON Person.id = Actor.actor_id
JOIN Movie ON Actor.movie_id = Movie.id
LIMIT 10;
| actor_name | person_dob | person_pob | movie_name | movie_year | |
|---|---|---|---|---|---|
| 0 | Leonardo DiCaprio | 1974-11-11 | Hollywood, California, USA | Titanic | 1997 |
| 1 | Kate Winslet | 1975-10-05 | Reading, Berkshire, England, UK | Titanic | 1997 |
| 2 | Billy Zane | 1966-02-24 | Chicago, Illinois, USA | Titanic | 1997 |
| 3 | Kathy Bates | 1948-06-28 | Memphis, Tennessee, USA | Titanic | 1997 |
| 4 | Bill Paxton | 1955-05-17 | Fort Worth, Texas, USA | Titanic | 1997 |
| 5 | Mark Hamill | 1951-09-25 | Oakland, California, USA | Star Wars: Episode IV - A New Hope | 1977 |
| 6 | Harrison Ford | 1942-07-13 | Chicago, Illinois, USA | Star Wars: Episode IV - A New Hope | 1977 |
| 7 | Carrie Fisher | 1956-10-21 | Beverly Hills, Los Angeles, California, USA | Star Wars: Episode IV - A New Hope | 1977 |
| 8 | Peter Cushing | 1913-05-26 | Kenley, Surrey, England, UK | Star Wars: Episode IV - A New Hope | 1977 |
| 9 | Alec Guinness | 1914-04-02 | Marylebone, London, England, UK | Star Wars: Episode IV - A New Hope | 1977 |
Example: JOINing Person, Movie, and Director#
We can do the same thing for the Person, Movie, and Director tables.
%%sql
SELECT
Person.name AS "director_name",
Person.dob AS "person_dob",
Person.pob AS "person_pob",
Movie.name AS "movie_name",
Movie.year AS "movie_year"
FROM
-- Here we join the Person, Movie and Director tables
Person
JOIN Director ON Person.id = Director.director_id
JOIN Movie ON Director.movie_id = Movie.id
LIMIT 10;
| director_name | person_dob | person_pob | movie_name | movie_year | |
|---|---|---|---|---|---|
| 0 | James Cameron | 1954-08-14 | Kapuskasing, Ontario, Canada | Titanic | 1997 |
| 1 | George Lucas | 1944-05-14 | Modesto, California, USA | Star Wars: Episode IV - A New Hope | 1977 |
| 2 | Andrew Adamson | 1966-12-01 | Auckland, New Zealand | Shrek 2 | 2004 |
| 3 | Steven Spielberg | 1946-12-18 | Cincinnati, Ohio, USA | E.T. the Extra-Terrestrial | 1982 |
| 4 | George Lucas | 1944-05-14 | Modesto, California, USA | Star Wars: Episode I - The Phantom Menace | 1999 |
| 5 | Sam Raimi | 1959-10-23 | Franklin, Michigan, USA | Spider-Man | 2002 |
| 6 | George Lucas | 1944-05-14 | Modesto, California, USA | Star Wars: Episode III - Revenge of the Sith | 2005 |
| 7 | Peter Jackson | 1961-10-31 | Pukerua Bay, North Island, New Zealand | Lord of the Rings: The Return of the King, The | 2003 |
| 8 | Sam Raimi | 1959-10-23 | Franklin, Michigan, USA | Spider-Man 2 | 2004 |
| 9 | Mel Gibson | 1956-01-03 | Peekskill, New York, USA | Passion of the Christ, The | 2004 |
Example: JOINing with Oscars#
Finally, let’s try joining tables with Oscars. First, the Movie table:
%%sql
SELECT
Movie.name AS "movie_name",
Movie.year AS "movie_year",
Oscar.type AS "oscar_type",
Oscar.year AS "oscar_year"
FROM
-- Here we join the Movie and Oscar tables
Movie
JOIN Oscar ON Movie.id = Oscar.movie_id
LIMIT 10;
| movie_name | movie_year | oscar_type | oscar_year | |
|---|---|---|---|---|
| 0 | CODA | 2021 | BEST-PICTURE | 2022 |
| 1 | King Richard | 2021 | BEST-ACTOR | 2022 |
| 2 | The Eyes of Tammy Faye | 2021 | BEST-ACTRESS | 2022 |
| 3 | West Side Story | 2021 | BEST-SUPPORTING-ACTRESS | 2022 |
| 4 | CODA | 2021 | BEST-SUPPORTING-ACTOR | 2022 |
| 5 | The Power of the Dog | 2021 | BEST-DIRECTOR | 2022 |
| 6 | Nomadland | 2020 | BEST-PICTURE | 2021 |
| 7 | The Father | 2020 | BEST-ACTOR | 2021 |
| 8 | Nomadland | 2020 | BEST-ACTRESS | 2021 |
| 9 | Judas and the Black Messiah | 2021 | BEST-SUPPORTING-ACTOR | 2021 |
And then, the Person and Oscar table:
%%sql
SELECT
Person.name AS "person_name",
Person.dob AS "person_dob",
Person.pob AS "person_pob",
Oscar.type AS "oscar_type",
Oscar.year AS "oscar_year"
FROM
-- Here we join the Person and Oscar tables
Person
JOIN Oscar ON Person.id = Oscar.person_id
LIMIT 10;
| person_name | person_dob | person_pob | oscar_type | oscar_year | |
|---|---|---|---|---|---|
| 0 | Ingrid Bergman | 1915-08-29 | Stockholm, Sweden | BEST-ACTRESS | 1945 |
| 1 | Ingrid Bergman | 1915-08-29 | Stockholm, Sweden | BEST-ACTRESS | 1957 |
| 2 | Ingrid Bergman | 1915-08-29 | Stockholm, Sweden | BEST-SUPPORTING-ACTRESS | 1975 |
| 3 | Humphrey Bogart | 1899-12-25 | New York, New York, USA | BEST-ACTOR | 1952 |
| 4 | Marlon Brando | 1924-04-03 | Omaha, Nebraska, USA | BEST-ACTOR | 1955 |
| 5 | Marlon Brando | 1924-04-03 | Omaha, Nebraska, USA | BEST-ACTOR | 1973 |
| 6 | James Cagney | 1899-07-17 | Yonkers, New York, USA | BEST-ACTOR | 1943 |
| 7 | Gary Cooper | 1901-05-07 | Helena, Montana, USA | BEST-ACTOR | 1942 |
| 8 | Gary Cooper | 1901-05-07 | Helena, Montana, USA | BEST-ACTOR | 1953 |
| 9 | Bette Davis | 1908-04-05 | Lowell, Massachusetts, USA | BEST-ACTRESS | 1936 |
JOINs with Conditions#
Now that we understand at a basic level what joins “are”, let’s put this knowledge to work by writing a few more advanced queries using WHERE clauses.
Example: Actors in ‘PG-13’ Movies#
To accomplish this, we’ll INNER JOIN the Movie and Actor tables, linking them through the id and movie_id columns, respectively. We’ll then filter the results to only include movies with a ‘PG-13’ rating.
%%sql
SELECT
Person.name as "actor",
Movie.name AS "movie_title",
Rating AS "movie_rating"
FROM
Person
JOIN Actor ON Person.id = Actor.actor_id
JOIN Movie ON Actor.movie_id = Movie.id
WHERE
-- Limit our results to PG-13 Moives
Movie.rating = 'PG-13'
LIMIT 10;
| actor | movie_title | movie_rating | |
|---|---|---|---|
| 0 | Leonardo DiCaprio | Titanic | PG-13 |
| 1 | Kate Winslet | Titanic | PG-13 |
| 2 | Billy Zane | Titanic | PG-13 |
| 3 | Kathy Bates | Titanic | PG-13 |
| 4 | Bill Paxton | Titanic | PG-13 |
| 5 | Tobey Maguire | Spider-Man | PG-13 |
| 6 | Willem Dafoe | Spider-Man | PG-13 |
| 7 | Kirsten Dunst | Spider-Man | PG-13 |
| 8 | James Franco | Spider-Man | PG-13 |
| 9 | Cliff Robertson | Spider-Man | PG-13 |
We could alter this query to just get a COUNT of such actors.
%%sql
SELECT
COUNT(DISTINCT Person.name) AS "number_of_actors"
FROM
Person
JOIN Actor ON Person.id = Actor.actor_id
JOIN Movie ON Actor.movie_id = Movie.id
WHERE
Movie.rating = 'PG-13';
| number_of_actors | |
|---|---|
| 0 | 758 |
Example: Text Filtering with LIKE - Directors Born in Minnesota#
What if you’re curious to know which directors were born in Minnesota and have directed a movie? We can INNER JOIN the Person and Director tables and use the SQL LIKE keyword to filter text data.
%%sql
SELECT
Person.name,
Person.pob,
Person.dob
FROM
Person
JOIN Director ON Person.id = Director.director_id
WHERE Person.pob LIKE '%Minnesota%'
LIMIT 10;
| name | pob | dob | |
|---|---|---|---|
| 0 | Pete Docter | Bloomington, Minnesota, USA | 1968-08-10 |
| 1 | George Roy Hill | Minneapolis, Minnesotaa, USA | 1921-12-20 |
| 2 | Joel Coen | Minneapolis, Minnesota, USA | 1954-11-29 |
| 3 | Terry Gilliam | Minneapolis, Minnesota, USA | 1940-11-22 |
| 4 | Ethan Coen | Minneapolis, Minnesota, USA | 1957-09-21 |
| 5 | Pete Docter | Bloomington, Minnesota, USA | 1968-08-10 |
| 6 | Ethan Coen | Minneapolis, Minnesota, USA | 1957-09-21 |
| 7 | Joel Coen | Minneapolis, Minnesota, USA | 1954-11-29 |
| 8 | Pete Docter | Bloomington, Minnesota, USA | 1968-08-10 |
| 9 | Joel Coen | Minneapolis, Minnesota, USA | 1954-11-29 |
Now, let’s alter this query to see how old these directors currenly are.
%%sql
SELECT
Person.name,
Person.pob,
Person.dob,
-- Calculate difference in years
(strftime('%Y', 'now') - strftime('%Y', Person.dob)) AS "current_age"
FROM
Person
JOIN Director ON Person.id = Director.director_id
WHERE Person.pob LIKE '%Minnesota%'
LIMIT 10;
| name | pob | dob | current_age | |
|---|---|---|---|---|
| 0 | Pete Docter | Bloomington, Minnesota, USA | 1968-08-10 | 58 |
| 1 | George Roy Hill | Minneapolis, Minnesotaa, USA | 1921-12-20 | 105 |
| 2 | Joel Coen | Minneapolis, Minnesota, USA | 1954-11-29 | 72 |
| 3 | Terry Gilliam | Minneapolis, Minnesota, USA | 1940-11-22 | 86 |
| 4 | Ethan Coen | Minneapolis, Minnesota, USA | 1957-09-21 | 69 |
| 5 | Pete Docter | Bloomington, Minnesota, USA | 1968-08-10 | 58 |
| 6 | Ethan Coen | Minneapolis, Minnesota, USA | 1957-09-21 | 69 |
| 7 | Joel Coen | Minneapolis, Minnesota, USA | 1954-11-29 | 72 |
| 8 | Pete Docter | Bloomington, Minnesota, USA | 1968-08-10 | 58 |
| 9 | Joel Coen | Minneapolis, Minnesota, USA | 1954-11-29 | 72 |
You might notice there’s a new function in this line:
strftime('%Y', 'now') - strftime('%Y', Person.dob)) AS "current_age"
Let’s break this down:
strftimeis a SQLite function that extracts parts of a date according to a specified format. The format'%Y'extracts the year from a date.'now'is a special string in SQLite that represents the current date and time.Person.dobis the date of birth column in the Person table.strftime('%Y', 'now')extracts the current year.strftime('%Y', Person.dob)extracts the year of birth for each person.By subtracting the year of birth from the current year, we get the current age of each person.
The
AS "current_age"part aliases the result of this calculation as “current_age”, which will be the column name in the output.
So, in plain English, this query calculates the current age of each person by subtracting their year of birth (extracted from the dob column) from the current year (obtained using 'now').
Example: Sorting with ORDER BY - Top Earning PG-13 Movies and Their Directors#
Perhaps you’re interested in knowing which ‘PG-13’ movies earned the most at the box office and who directed them. In this example, we use an INNER JOIN between the Movie and Director tables and sort the results using the ORDER BY keyword.
%%sql
SELECT
Movie.name AS "movie",
Person.name AS "director",
Movie.earnings_rank
FROM
Movie
JOIN Director ON Movie.id = Director.movie_id
JOIN Person ON Director.director_id = Person.id
WHERE
Movie.rating = 'PG-13'
AND Movie.earnings_rank IS NOT NULL
ORDER BY Movie.earnings_rank ASC
LIMIT 10;
| movie | director | earnings_rank | |
|---|---|---|---|
| 0 | Star Wars: The Force Awakens | J.J. Abrams | 1 |
| 1 | Avengers: Endgame | Anthony Russo | 2 |
| 2 | Avengers: Endgame | Joe Russo | 2 |
| 3 | Spider-Man: No Way Home | Jon Watts | 3 |
| 4 | Avatar | James Cameron | 4 |
| 5 | Top Gun: Maverick | Joseph Kosinski | 5 |
| 6 | Black Panther | Ryan Coogler | 6 |
| 7 | Avengers: Infinity War | Anthony Russo | 7 |
| 8 | Avengers: Infinity War | Joe Russo | 7 |
| 9 | Titanic | James Cameron | 8 |
While this query may look a bit scary, there’s actually nothing new here!
We select three columns:
The name of the movie (aliased as “movie”)
The name of the director (aliased as “director”)
The earnings rank of the movie-
We join” three tables:
Starts with the Movie table
Joins the Director table where the movie IDs match
Joins the Person table where the director IDs match
We filter the results:
Includes only movies with a ‘PG-13’ rating
Excludes movies with a null earnings rank-
We orders the results:
Sorts by the earnings rank in ascending order (lowest rank first)-
LEFT JOIN#
A LEFT JOIN, also known as a LEFT OUTER JOIN, is a type of JOIN that returns all the rows from the left table (the first table mentioned in the query), and the matched rows from the right table (the second table mentioned). If there is no match for a row in the left table, the result will still include that row, but with NULL values in the columns from the right table.
Let’s consider an example using our movie database. Suppose we want to get a list of all movies and the Oscars they have won (if any):
%%sql
SELECT
Movie.name AS "movie",
Oscar.type AS "oscar_type",
Oscar.year AS "oscar_year"
FROM
Movie
-- Left join allows us to inlude movies that have not won an Oscar
LEFT JOIN Oscar ON Movie.id = Oscar.movie_id
LIMIT 10;
| movie | oscar_type | oscar_year | |
|---|---|---|---|
| 0 | Star Wars: The Force Awakens | NaN | NaN |
| 1 | Avengers: Endgame | NaN | NaN |
| 2 | Spider-Man: No Way Home | NaN | NaN |
| 3 | Avatar | NaN | NaN |
| 4 | Top Gun: Maverick | NaN | NaN |
| 5 | Black Panther | NaN | NaN |
| 6 | Avengers: Infinity War | NaN | NaN |
| 7 | Titanic | BEST-DIRECTOR | 1998.0 |
| 8 | Titanic | BEST-PICTURE | 1998.0 |
| 9 | Jurassic World | NaN | NaN |
Here’s what the query does:
We select the name of the movie from the
Movietable, and the type and year of the Oscar from theOscartable.We LEFT JOIN the
Oscartable. For each movie, if there’s a corresponding entry in theOscartable (i.e., if the movie has won an Oscar), themovie_idwill match. If a movie hasn’t won an Oscar, there will be no match.For movies that have won an Oscar, the type and year of the Oscar will be included in the result. For movies that haven’t won an Oscar, the result will still include the movie, but the Oscar type and year will be NULL.
You can contrast this with standard (INNER) joins.
%%sql
SELECT
Movie.name AS "movie",
Oscar.type AS "oscar_type",
Oscar.year AS "oscar_year"
FROM
Movie
-- THis time, we ONLY include Movies that won on Oscar
JOIN Oscar ON Movie.id = Oscar.movie_id
LIMIT 10;
| movie | oscar_type | oscar_year | |
|---|---|---|---|
| 0 | CODA | BEST-PICTURE | 2022 |
| 1 | King Richard | BEST-ACTOR | 2022 |
| 2 | The Eyes of Tammy Faye | BEST-ACTRESS | 2022 |
| 3 | West Side Story | BEST-SUPPORTING-ACTRESS | 2022 |
| 4 | CODA | BEST-SUPPORTING-ACTOR | 2022 |
| 5 | The Power of the Dog | BEST-DIRECTOR | 2022 |
| 6 | Nomadland | BEST-PICTURE | 2021 |
| 7 | The Father | BEST-ACTOR | 2021 |
| 8 | Nomadland | BEST-ACTRESS | 2021 |
| 9 | Judas and the Black Messiah | BEST-SUPPORTING-ACTOR | 2021 |
CROSS JOIN: Both Fundamental and Dangerous#
A CROSS JOIN, also known as a Cartesian product, is the most basic type of join operation in relational databases. It combines every single row from the first table with every single row from the second table without checking any matching conditions.
If table A contains 10 rows and table B contains 5 rows, the resulting CROSS JOIN set will contain exactly 50 rows.
Why CROSS JOIN is fundamental#
The database engine uses the Cartesian product as the conceptual foundation for all other join types, running this full pairing before applying filter conditions.
We use a CROSS JOIN in specialized applications to generate combinations, such as pairing a table of colors with a table of sizes to create a complete catalog of product inventory options.
However, because a CROSS JOIN generates a multiplicative combination of rows, running it on large tables can easily produce millions of unwanted rows and freeze your database. In everyday queries, you should always use specific joins (like INNER JOIN) to keep your queries safe and efficient.
%%sql
-- This will generate every possible movie-actor combination
SELECT
Movie.name AS movie_title,
Person.name AS actor_name
FROM
Movie
CROSS JOIN Person
JOIN Actor ON Person.id = Actor.actor_id
LIMIT 20;
| movie_title | actor_name | |
|---|---|---|
| 0 | Star Wars: The Force Awakens | Leonardo DiCaprio |
| 1 | Star Wars: The Force Awakens | Kate Winslet |
| 2 | Star Wars: The Force Awakens | Billy Zane |
| 3 | Star Wars: The Force Awakens | Kathy Bates |
| 4 | Star Wars: The Force Awakens | Bill Paxton |
| 5 | Star Wars: The Force Awakens | Mark Hamill |
| 6 | Star Wars: The Force Awakens | Harrison Ford |
| 7 | Star Wars: The Force Awakens | Carrie Fisher |
| 8 | Star Wars: The Force Awakens | Peter Cushing |
| 9 | Star Wars: The Force Awakens | Alec Guinness |
| 10 | Star Wars: The Force Awakens | Mike Myers |
| 11 | Star Wars: The Force Awakens | Eddie Murphy |
| 12 | Star Wars: The Force Awakens | Cameron Diaz |
| 13 | Star Wars: The Force Awakens | Julie Andrews |
| 14 | Star Wars: The Force Awakens | Antonio Banderas |
| 15 | Star Wars: The Force Awakens | Henry Thomas |
| 16 | Star Wars: The Force Awakens | Dee Wallace-Stone |
| 17 | Star Wars: The Force Awakens | Robert MacNaughton |
| 18 | Star Wars: The Force Awakens | Drew Barrymore |
| 19 | Star Wars: The Force Awakens | Peter Coyote |
Most of these actors never acted in this movie!
Let’s see a comparison of the rows returned by left join vs cross join:
%%sql
-- Compare row counts
SELECT COUNT(*) AS "Total Combinations" FROM Movie CROSS JOIN Person
| Total Combinations | |
|---|---|
| 0 | 1990640 |
%%sql
SELECT COUNT(*) AS "Actual Relationships" FROM Movie JOIN Actor ON Movie.id = Actor.movie_id;
| Actual Relationships | |
|---|---|
| 0 | 3740 |
Legitimate Uses of CROSS JOIN#
Despite its potential pitfalls, CROSS JOIN has several valuable use cases:
For example, sometimes we really do need to generate ALL POSSIBLE combinations of values (for example, every possible movie with every possible rating).
%%sql
-- Generate all possible movie-rating combinations
SELECT
m.name,
r.rating
FROM
Movie m
CROSS JOIN (SELECT DISTINCT rating FROM Movie) r;
| name | rating | |
|---|---|---|
| 0 | Star Wars: The Force Awakens | PG-13 |
| 1 | Star Wars: The Force Awakens | PG |
| 2 | Star Wars: The Force Awakens | G |
| 3 | Star Wars: The Force Awakens | R |
| 4 | Star Wars: The Force Awakens | NC-17 |
| ... | ... | ... |
| 5955 | West Side Story | R |
| 5956 | West Side Story | NC-17 |
| 5957 | West Side Story | M |
| 5958 | West Side Story | GP |
| 5959 | West Side Story | NaN |
5960 rows × 2 columns
Table: Types of SQL Joins#
Join Type |
Description |
Use Case |
|---|---|---|
Inner Join |
This join returns only the rows where there is a matching key value in both tables. |
Use this standard join to find records that have corresponding entries in both tables. |
Left Join |
This join returns all rows from the left table, along with matching rows from the right table, filling unmatched right columns with null values. |
Use this join when you want to preserve all records from your primary table even if they have no matches. |
Right Join |
This join returns all rows from the right table, along with matching rows from the left table, filling unmatched left columns with null values. |
Use this join to preserve all records from the secondary table, which is rarely used since it can be rewritten as a Left Join. |
Full Join |
This join returns all records when there is a match in either the left or the right table, filling unmatched columns on either side with null values. |
Use this join when you need a complete overview of all records across both tables, showing where they match and where they do not. |
Set Operations#
In SQL, set operations allow you to combine the results of two or more SELECT statements. The main set operations are UNION, INTERSECT, and EXCEPT. These operations are based on the mathematical concept of sets.
Let’s explore each of these operations using our movie database.
UNION#
The UNION operation combines the result-set of two or more SELECT statements and removes duplicates. Each SELECT statement within UNION must have the same number of columns, and the columns must also have similar data types.
For example, let’s say we want to get a list of all people who have either directed a movie or won an Oscar:
%%sql
SELECT
Person.name AS "director_or_oscar"
FROM
Person
JOIN Director ON Person.id = Director.director_id
UNION
SELECT
Person.name
FROM
Person
JOIN Oscar ON Person.id = Oscar.person_id
LIMIT 10;
| director_or_oscar | |
|---|---|
| 0 | Adam McKay |
| 1 | Adam Shankman |
| 2 | Adrian Lyne |
| 3 | Adrian Molina |
| 4 | Adrien Brody |
| 5 | Al Pacino |
| 6 | Alan Arkin |
| 7 | Alan J. Pakula |
| 8 | Alan Taylor |
| 9 | Alec Guinness |
This query:
Selects the names of all directors by joining the
PersonandDirectortables.Uses UNION to combine this result with another SELECT statement.
The second SELECT statement selects the names of all Oscar winners by joining the
PersonandOscartables.The UNION operation combines these two result sets and removes any duplicates.
So, the final result will be a list of all people who have either directed a movie or won an Oscar, with no duplicates.
INTERSECT#
The INTERSECT operation returns only the rows that are common between the result-sets of two or more SELECT statements.
Let’s find all people who have both directed a movie and won an Oscar:
%%sql
SELECT
Person.name AS "director_and_oscar"
FROM
Person
JOIN Director ON Person.id = Director.director_id
INTERSECT
SELECT
Person.name
FROM
Person
JOIN Oscar ON Person.id = Oscar.person_id
LIMIT 10;
| director_and_oscar | |
|---|---|
| 0 | Alejandro Gonzalez Inarritu |
| 1 | Alfonso Cuaron |
| 2 | Ang Lee |
| 3 | Anthony Minghella |
| 4 | Barry Levinson |
| 5 | Bernardo Bertolucci |
| 6 | Billy Wilder |
| 7 | Bob Fosse |
| 8 | Bong Joon Ho |
| 9 | Carol Reed |
This query is similar to the UNION example, but uses INTERSECT instead. The result will include only those people who are in both result sets - that is, people who have both directed a movie and won an Oscar.
EXCEPT#
The EXCEPT operation returns all the rows from the first SELECT statement that are not returned by the second SELECT statement.
For example, let’s find all directors who have not won an Oscar:
%%sql
SELECT
Person.name AS "director_no_oscar"
FROM
Person
JOIN Director ON Person.id = Director.director_id
EXCEPT
SELECT
Person.name
FROM
Person
JOIN Oscar ON Person.id = Oscar.person_id
LIMIT 10;
| director_no_oscar | |
|---|---|
| 0 | Adam McKay |
| 1 | Adam Shankman |
| 2 | Adrian Lyne |
| 3 | Adrian Molina |
| 4 | Alan J. Pakula |
| 5 | Alan Taylor |
| 6 | Alex Proyas |
| 7 | Alexander Korda |
| 8 | Alfred E. Green |
| 9 | Alfred Hitchcock |
Putting it Altogether#
We’ll close the section by looking at some slightly more involved examples.
Example: All Star Wars or Star Trek Actors Born in the US#
First, let’s get the names and places of birth of actors who have appeared in a Star Wars OR a Star Trek film who are American (born in the US).
%%sql
SELECT
DISTINCT(Person.name) AS "actor",
Person.pob AS "place_of_birth"
FROM
Person
JOIN Actor ON Person.id = Actor.actor_id
JOIN Movie ON Actor.movie_id = Movie.id
WHERE
(Movie.name LIKE '%Star Wars%'
OR Movie.name LIKE '%Star Trek%')
AND Person.pob LIKE '%USA%'
ORDER BY actor
LIMIT 10;
| actor | place_of_birth | |
|---|---|---|
| 0 | Adam Driver | San Diego, California, USA |
| 1 | Alan Tudyk | El Paso, Texas, USA |
| 2 | Alden Ehrenreich | Los Angeles, California, USA |
| 3 | Billy Dee Williams | New York, New York, USA |
| 4 | Carrie Fisher | Beverly Hills, Los Angeles, California, USA |
| 5 | Chris Pine | Los Angeles, California, USA |
| 6 | Donald Glover | Edwards Air Force Base, California, USA |
| 7 | Harrison Ford | Chicago, Illinois, USA |
| 8 | Jake Lloyd | Fort Collins, Colorado, USA |
| 9 | Leonard Nimoy | Boston, Massachusetts, USA |
Here is how the database engine executes this multi-table join query:
The
SELECT DISTINCTstatement retrieves only unique actor names and their places of birth, using aliases to rename the columns in the final output.The
FROMclause specifies that the query starts at thePersontable.The first
JOINstatement combines thePersontable with theActortable by matching the person IDs.The second
JOINstatement combines the resulting rows with theMovietable by matching the movie IDs.The
WHEREclause filters the combined dataset to include only actors who starred in movies released after the year 2010.The
ORDER BYclause sorts the final list of actors alphabetically by their name.
Example: Summary Statitics About Science Fiction Movies#
Now, let’s construct a query to get some summary statistics (min, max, average) of science fiction movies.
%%sql
SELECT
MIN(Movie.year) AS "earliest_year",
MIN(Movie.runtime) AS "min_runtime",
MAX(Movie.runtime) AS "max_runtime",
ROUND(AVG(Movie.runtime),2) AS "avg_runtime",
COUNT(Movie.id) AS "num_movies",
COUNT(Oscar.type) AS "num_oscars"
FROM
Movie
-- Include sci-movies that didn't win oscars
LEFT JOIN Oscar ON Movie.id = Oscar.movie_id
WHERE
-- 'S' stands for "science fiction"
Movie.genre LIKE '%S%';
| earliest_year | min_runtime | max_runtime | avg_runtime | num_movies | num_oscars | |
|---|---|---|---|---|---|---|
| 0 | 1968 | 81 | 242 | 125.72 | 86 | 3 |
Here is how the database engine executes this aggregate query:
The
SELECTstatement calculates summary values using three aggregate functions:MINfinds the earliest release year,MINfinds the shortest runtime, andMAXfinds the highest earnings rank.The
FROMclause starts the query at the primaryMovietable.The first
JOINstatement links theMovietable to theActortable by matching their movie IDs.The second
JOINstatement links the resulting rows to thePersontable by matching their actor IDs.The
WHEREclause filters the rows so that only movies starring the actor “Tom Hanks” are included in the calculations.
Conclusion: Mastering SQL Joins and Set Operations#
Throughout this chapter, we have explored how SQL combines and manipulates data across multiple tables. Understanding how these operations work together is key to writing powerful queries.
The power of joins#
Joins allow us to rebuild relationships between tables that we separated during database design:
An
INNER JOINfilters out unmatched records, returning only the rows where there is a matching key value in both tables.A
LEFT JOINpreserves every row from your primary table, filling unmatched secondary columns with null markers.A
CROSS JOINgenerates every possible combination between two tables, which serves as the theoretical base for database queries.
The logic of set operations#
While joins combine columns from different tables side-by-side, set operations combine rows from different queries vertically:
The
UNIONoperator merges the results of two queries into a single list, automatically removing any duplicate rows.The
INTERSECToperator returns only the common rows that appear in the results of both queries.The
EXCEPToperator subtracts the results of the second query from the first, returning only the unique rows from the first set.
By mastering joins and set operations, you can navigate complex database relationships to extract, combine, and analyze data efficiently.
Chapter Summary#
Use this list as a self-check. After working through the chapter, you should be able to do each of the following:
You can explain what a join is and how it combines data across tables.
You can choose among INNER, LEFT, RIGHT, and FULL OUTER joins.
You can combine result sets with UNION, INTERSECT, and EXCEPT.
You can write multi-table queries that mix joins and set operations.
You can read a database schema and sketch its entity-relationship diagram.
You can apply joins and sets to answer real questions about your data.
Lab Activity: Practice Your SQL (Highly Recommended!)#
Learning SQL requires practice! You can run the following cell to practice your SQL queries. Let me (Brendan) know if there’s any problem with this app, as it’s still under development.
Practice with the Loop of the Recursive Dragon#
Sharpen what you just learned with a chapter-matched review set in the Loop of the Recursive Dragon — an adaptive review game with multiple question types and RPG-style mechanics, built for this book.
Glossary#
Use this reference sheet to quickly review key database relationship terms.
ER modeling terms#
An entity is a real-world object, person, or concept that we store data about in a table.
An attribute is a specific column in a table that represents a characteristic of an entity.
A relationship describes how different database entities connect to one another.
The cardinality refers to a rule defining the minimum and maximum number of matching rows allowed between two tables.
A crow’s foot diagram is a popular visual layout used to represent database tables and their relationships.
A primary key is a unique column value used to identify each row in a table.
A foreign key is a column that references another table’s primary key to link them together.
A junction table is a helper table that stores foreign keys from two different tables to resolve a many-to-many relationship.
Join keywords#
A join is a SQL operation that combines rows from multiple tables side-by-side based on a related column.
An
INNER JOINreturns only the rows where there is a matching key value in both tables.A
LEFT JOINreturns all rows from the primary table along with matching rows from the secondary table.A
CROSS JOINreturns the Cartesian product of two tables by pairing every row of the first table with every row of the second.
Set keywords#
A set operation is a SQL query that combines the row results of two separate queries vertically.
A
UNIONoperation merges the results of two queries, automatically removing duplicate rows.An
INTERSECToperation returns only the common rows that appear in the results of both queries.An
EXCEPToperation subtracts the results of the second query from the first, returning only the unique rows from the first query.
Types of Relationships: Entity-Relationship Model#
Notation |
Definition |
|---|---|
Customer |
|
Movie |
|
Person |
|
Person |
o–o{ Actor |
Actor }o– |