Open In Colab

Chapter 3: Joins and Set Operations (The IMDb Cinema Database)#

Brendan Shea, PhD#

In this chapter, we dive into the powerful world of SQL joins and set operations. We’ll explore how these tools allow us to combine data from multiple tables in a database, enabling us to extract meaningful insights and answer complex questions. Using a database based on data from the Internet Movie Database (IMDB), we’ll learn how to connect related pieces of information, such as movies, actors, directors, and awards. By mastering joins and set operations, you’ll be able to navigate and manipulate relational databases with ease.

Learning Outcomes#

By the end of this chapter, you will be able to:

  1. Interpret a database schema and crow’s foot ERD, including entities, keys, relationships, and join tables.

  2. Write INNER JOIN queries to combine data from related tables.

  3. Join three or more tables to answer questions about movies, people, roles, and awards.

  4. Apply join conditions and filters to limit results correctly.

  5. Use LEFT JOIN to preserve unmatched rows and explain when CROSS JOIN is appropriate or dangerous.

  6. Combine result sets with UNION, INTERSECT, and EXCEPT.

  7. Use joins and set operations to solve real-world reporting and analysis problems.

Keywords: SQL, INNER JOIN, LEFT JOIN, CROSS JOIN, UNION, INTERSECT, EXCEPT, schema, crow’s foot ERD, join tables

Brendan’s Lecture#

Run the following cell to see my lecture.

##Click here to launch my lecture
from IPython.display import YouTubeVideo
YouTubeVideo('jCOYyfoc7ZY', width=800, height=500)

Introduction to the Movie Database#

For this chapter, we’ll be using a database based on data from the Internet Movie Database (IMDB). The full dataset is available here: https://developer.imdb.com/non-commercial-datasets/. We’ll just be using a small part of this data, based on movies that have (at one point or another) been in the “top 100” in terms of box office returns.

To start off with, we need to download a copy of the database and connect to it:

Database Schema: Movies#

Let’s begin by exploring the database schem for movies.db. This will give us:

  1. A list of all the tables in the datase.

  2. The columns (attributes) of each table.

  3. The primary and foreign keys that link the tables together.

We can do this by typing the following command:

%%sql
SELECT *
FROM sqlite_schema
WHERE type = 'table';-- Show the database schema
Running query in 'sqlite:///imdb_movie.sqlite'
type name tbl_name rootpage sql
0 table Movie Movie 2 CREATE TABLE Movie (id CHAR(7) PRIMARY KEY, na...
1 table Person Person 4 CREATE TABLE Person (id CHAR(7) PRIMARY KEY, n...
2 table Actor Actor 11 CREATE TABLE Actor (actor_id CHAR(7), movie_id...
3 table Director Director 15 CREATE TABLE Director(director_id CHAR(7), mov...
4 table Oscar Oscar 18 CREATE TABLE Oscar(movie_id CHAR(7), person_id...

Our IMDb movie database consists of five tables: Movie, Person, Actor, Director, and Oscar. These tables store data related to movies, directors, actors, and Academy Awards. Let us explore each of these tables to understand their attributes, data types, and primary keys.

Movie#

The Movie table contains information about individual movies.

  • The id column stores a unique integer value for each movie, which acts as the table’s primary key.

  • The name column stores the text title of the movie.

  • The year column records the integer year the movie was released.

  • The rating column stores a text value representing the movie’s content rating, such as “PG” or “R”.

  • The runtime column records the length of the movie in minutes as an integer.

  • The genre column stores the primary category of the movie as text.

  • The earnings_rank column records the movie’s box office ranking as an integer.

Person#

The Person table stores personal profiles of individuals involved in the film industry.

  • The id column is a unique integer serving as the primary key.

  • The name column records the full name of the person as text.

  • The dob column stores the person’s date of birth.

  • The pob column records their place of birth as a text string.

Actor#

The Actor table is a junction table that establishes the many-to-many relationship between movies and actors.

  • The actor_id column is a foreign key pointing to the person table, which forms a composite primary key alongside the movie ID.

  • The movie_id column is a foreign key pointing to the movie table.

Director#

The Director table is a junction table that maps the many-to-many relationship between movies and their directors.

  • The director_id column is a foreign key pointing to the person table, forming a composite primary key with the movie ID.

  • The movie_id column is a foreign key pointing to the movie table.

Oscar#

The Oscar table records Academy Award wins.

  • The movie_id column is a foreign key pointing to the movie table.

  • The person_id column is a foreign key pointing to the person table who won the award.

  • The type column records the category of the award as text.

  • The year column stores the integer year the award was won, which forms a composite primary key with the movie ID, person ID, and award type.

A “Crow’s Foot” Entity-Relationship Diagram of the Movie Data#

To help us better understand the stucture of the movie data, let’s create an entity-relationship diagram. Here, I am using Crow’s Foot style (more on this later). I’m using the Mermaid language to create the diagram.

_images/b01eaf7b28641524c071997b237f41047e1317f71c2ead4481bb484c7cff784c.svg

Interpreting a Crow’s Foot ERD#

Crow’s foot diagrams, also known as Entity-Relationship Diagrams (ERDs), are visual layouts used to represent the logical structure and connections of a database. Let us break down how to read the visual notations in the diagram:

Entities#

  • We represent individual tables as entities using colored rectangles, such as showing MOVIE, PERSON, ACTOR, DIRECTOR, and OSCAR.

  • The upper section of the rectangle lists the entity name, while the columns below list the table’s attributes.

  • We mark the table’s unique identifiers as primary keys (PK) and columns that link to other tables as foreign keys (FK).

Relationships#

We draw lines between the rectangles to represent relationships showing how tables connect to one another. The shapes at the ends of these lines indicate the cardinality of the connection, telling us the minimum and maximum number of matching rows allowed:

  • The symbol of two parallel vertical lines represents a one-and-only-one relationship, indicating that a row must match exactly one row in the connected table.

  • The symbol of a circle next to a vertical line represents a zero-or-one relationship, meaning a row can connect to either one row or none at all.

  • The symbol of a vertical line next to a three-pronged “crow’s foot” represents a one-to-many relationship, indicating that a row connects to at least one or many rows in the other table.

  • The symbol of a circle next to a “crow’s foot” represents a zero-to-many relationship, meaning a row can connect to any number of rows or none at all.

Reading relationships in our database#

Using these rules, we can read the connections in our movie database:

  • The connection between Person and Actor shows that a person can act in zero-to-many movies, but each actor record must connect to one-and-only-one person.

  • The connection between Movie and Actor shows that a movie can feature one-to-many actors, but each actor row links to one-and-only-one movie.

  • The connection between Person and Director shows that a person can direct zero-to-many movies, while each director row maps to one-and-only-one person.

  • The connection between Movie and Oscar shows that a movie can win zero-to-many Oscars, but each Oscar record links to one-and-only-one movie.

Previewing the Data#

Finally, let’s use some simple SQL SELECT statements to take a look at the data.

The basic SELECT * statement retrieves all columns from a table. The * is a wildcard that means “all columns.” The LIMIT clause restricts the number of rows returned - in this case, we’ll see only the first 10 rows of the Movie table.

%%sql
SELECT *
FROM Movie
LIMIT 10;
Running query in 'sqlite:///imdb_movie.sqlite'
id name year rating runtime genre earnings_rank
0 2488496 Star Wars: The Force Awakens 2015 PG-13 138 A 1
1 4154796 Avengers: Endgame 2019 PG-13 181 AVS 2
2 1087260 Spider-Man: No Way Home 2021 PG-13 148 AVFS 3
3 0499549 Avatar 2009 PG-13 162 AVYS 4
4 1745960 Top Gun: Maverick 2022 PG-13 130 AD 5
5 1825683 Black Panther 2018 PG-13 134 AVS 6
6 4154756 Avengers: Infinity War 2018 PG-13 149 AVYS 7
7 0120338 Titanic 1997 PG-13 194 DR 8
8 0369610 Jurassic World 2015 PG-13 124 A 9
9 0848228 The Avengers 2012 PG-13 143 A 10

This query introduces several aggregate functions:

  • COUNT(DISTINCT id) counts unique movie IDs, eliminating duplicates

  • MIN() and MAX() find the smallest and largest values in a column

  • AVG() calculates the mean value

  • ROUND(x, n) rounds number x to n decimal places Each column is given an alias using the AS keyword to make the output more readable.

%%sql
-- Statics about Movies
SELECT
  COUNT(DISTINCT id) AS "num_movies",
  MIN(year) AS "earliest_year",
  MAX(year) AS "latest_year",
  MIN(runtime) AS "shortest_movie",
  MAX(runtime) AS "longest_movie",
  ROUND(AVG(runtime),2) AS "avg_movie_runtime"
FROM Movie;
Running query in 'sqlite:///imdb_movie.sqlite'
num_movies earliest_year latest_year shortest_movie longest_movie avg_movie_runtime
0 745 1927 2022 66 242 121.88

Another SELECT * query, showing us all columns from the Person table. Like before, LIMIT 10 ensures we only see the first 10 rows, making the output manageable.

%%sql
SELECT *
FROM Person
LIMIT 10;
Running query in 'sqlite:///imdb_movie.sqlite'
id name dob pob
0 0000002 Lauren Bacall 1924-09-16 New York, New York, USA
1 0000004 John Belushi 1949-01-24 Chicago, Illinois, USA
2 0000006 Ingrid Bergman 1915-08-29 Stockholm, Sweden
3 0000007 Humphrey Bogart 1899-12-25 New York, New York, USA
4 0000008 Marlon Brando 1924-04-03 Omaha, Nebraska, USA
5 0000009 Richard Burton 1925-11-10 Pontrhydyfen, Wales, UK
6 0000010 James Cagney 1899-07-17 Yonkers, New York, USA
7 0000011 Gary Cooper 1901-05-07 Helena, Montana, USA
8 0000012 Bette Davis 1908-04-05 Lowell, Massachusetts, USA
9 0000014 Olivia de Havilland 1916-07-01 Tokyo, Japan

Here we use COUNT(DISTINCT id) to count unique person IDs. The DISTINCT keyword ensures we don’t count any person more than once, even if they appear multiple times in the table.

%%sql
-- Number of disticnt people
SELECT
  COUNT(DISTINCT id) AS "num_people"
FROM Person
Running query in 'sqlite:///imdb_movie.sqlite'
num_people
0 2672

Looking at the Actor junction table with SELECT *. This shows us how actors are connected to movies through their IDs. The LIMIT 5 clause shows just the first 5 rows.

%%sql
SELECT *
FROM Actor
LIMIT 5;
Running query in 'sqlite:///imdb_movie.sqlite'
actor_id movie_id
0 0000138 0120338
1 0000701 0120338
2 0000708 0120338
3 0000870 0120338
4 0000200 0120338

Using COUNT(DISTINCT actor_id) to count unique actors. Remember that DISTINCT eliminates duplicates, so actors who appear in multiple movies are only counted once.

%%sql
--Number of actors
SELECT COUNT(DISTINCT actor_id) AS "num_actors"
FROM Actor
Running query in 'sqlite:///imdb_movie.sqlite'
num_actors
0 2239

Another look at a junction table, this time for directors. The SELECT * shows all columns, while LIMIT 5 keeps the output brief.

%%sql
SELECT *
FROM Director
LIMIT 5;
Running query in 'sqlite:///imdb_movie.sqlite'
director_id movie_id
0 0000116 0120338
1 0000184 0076759
2 0011470 0298148
3 0000229 0083866
4 0000184 0120915

Similar to our actor count, COUNT(DISTINCT director_id) gives us the number of unique directors in our database.

%%sql
--Number of directors
SELECT COUNT(DISTINCT director_id) AS "num_directors"
FROM Director
Running query in 'sqlite:///imdb_movie.sqlite'
num_directors
0 433

Examining the Oscar table with SELECT *. The LIMIT 5 clause shows us a sample of the award records.

%%sql
SELECT *
FROM Oscar
LIMIT 5;
Running query in 'sqlite:///imdb_movie.sqlite'
movie_id person_id type year
0 1036646 NaN BEST-PICTURE 2022
1 9620288 0000226 BEST-ACTOR 2022
2 9115530 1567113 BEST-ACTRESS 2022
3 3581652 3663196 BEST-SUPPORTING-ACTRESS 2022
4 1036646 1319274 BEST-SUPPORTING-ACTOR 2022

The SELECT DISTINCT type query shows all unique values in the type column, with no duplicates. This tells us all the different categories of Oscars in our database.

%%sql
--Types of Oscars
SELECT DISTINCT type
FROM Oscar
Running query in 'sqlite:///imdb_movie.sqlite'
type
0 BEST-PICTURE
1 BEST-ACTRESS
2 BEST-SUPPORTING-ACTRESS
3 BEST-ACTOR
4 BEST-SUPPORTING-ACTOR
5 BEST-DIRECTOR

A Gentle Introduction to JOINs#

Our Movie table contains details about individual films, such as their titles, ratings, and earnings. Our Person table stores profiles of actors and directors, including their names and places of birth.

However, if you want to find which actors starred in a specific movie, this combined information is not stored in either table alone. To answer these cross-table questions, SQL uses joins to combine rows from multiple tables based on a related column.

In our database, the Actor and Director tables serve as junction tables (or join tables) that establish many-to-many relationships:

  • The Actor table links a person ID to a movie ID, indicating who acted in which film.

  • The Director table links a person ID to a movie ID, indicating who directed which film.

To find which actors starred in the movie Inception, SQL must link all three tables:

  1. The query starts at the Movie table to locate the row for Inception.

  2. The database joins the Movie table to the Actor table by matching the unique movie IDs.

  3. The database joins the Actor table to the Person table by matching the person IDs, retrieving the actors’ names.

By writing join queries, you can easily traverse these database links to combine records and answer complex questions.

Example: JOINing Movie and Actor#

In the Movie table, the primary key is id. Both the Person and the Actor table have an attribute called movie_id that is a foreign that links to the is. So, we could write joins like this:

%%sql
SELECT
  -- We can select columns from multiple tables
  Movie.name,
  Actor.actor_id
FROM
  -- Here we join the Movie and Actor tables
  Movie
  JOIN Actor ON Movie.id = Actor.movie_id
LIMIT 5;
Running query in 'sqlite:///imdb_movie.sqlite'
name actor_id
0 Titanic 0000138
1 Titanic 0000701
2 Titanic 0000708
3 Titanic 0000870
4 Titanic 0000200

You’ll notice this isn’t very informative (yet), because the actor table actually doesn’t contain the actor’s name. Instead, it just has their actor_id, which is a reference to the Person table. In just a bit, we’ll see how we can fix this (by joining three tables).

Example: JOINing Person and Actor#

We can do the same thing to join the Person (with primary key id) and Actor tables (with foreign key actor_id). A

%%sql
SELECT
  -- We can select columns from multiple tables
  Person.name,
  Actor.actor_id
FROM
  -- Here we join the Person and Actor tables
  Person
  JOIN Actor ON Person.id = Actor.actor_id
LIMIT 5;
Running query in 'sqlite:///imdb_movie.sqlite'
name actor_id
0 Leonardo DiCaprio 0000138
1 Kate Winslet 0000701
2 Billy Zane 0000708
3 Kathy Bates 0000870
4 Bill Paxton 0000200

Example: JOINing Movie, Actor, and Person#

In order to get a more useful result, let’s join THREE tables–the Person, Movie and Actor table. Here, the Actor table serves as junction table between Person and Movie

%%sql
SELECT
  -- We can select columns from multiple tables
  -- We use AS to give them sensible names
  Person.name AS "actor_name",
  Person.dob AS "person_dob",
  Person.pob AS "person_pob",
  Movie.name AS "movie_name",
  Movie.year AS "movie_year"
FROM
  -- Here we join the Person, Movie and Actor tables
  Person
  JOIN Actor ON Person.id = Actor.actor_id
  JOIN Movie ON Actor.movie_id = Movie.id
LIMIT 10;
Running query in 'sqlite:///imdb_movie.sqlite'
actor_name person_dob person_pob movie_name movie_year
0 Leonardo DiCaprio 1974-11-11 Hollywood, California, USA Titanic 1997
1 Kate Winslet 1975-10-05 Reading, Berkshire, England, UK Titanic 1997
2 Billy Zane 1966-02-24 Chicago, Illinois, USA Titanic 1997
3 Kathy Bates 1948-06-28 Memphis, Tennessee, USA Titanic 1997
4 Bill Paxton 1955-05-17 Fort Worth, Texas, USA Titanic 1997
5 Mark Hamill 1951-09-25 Oakland, California, USA Star Wars: Episode IV - A New Hope 1977
6 Harrison Ford 1942-07-13 Chicago, Illinois, USA Star Wars: Episode IV - A New Hope 1977
7 Carrie Fisher 1956-10-21 Beverly Hills, Los Angeles, California, USA Star Wars: Episode IV - A New Hope 1977
8 Peter Cushing 1913-05-26 Kenley, Surrey, England, UK Star Wars: Episode IV - A New Hope 1977
9 Alec Guinness 1914-04-02 Marylebone, London, England, UK Star Wars: Episode IV - A New Hope 1977

Example: JOINing Person, Movie, and Director#

We can do the same thing for the Person, Movie, and Director tables.

%%sql
SELECT
  Person.name AS "director_name",
  Person.dob AS "person_dob",
  Person.pob AS "person_pob",
  Movie.name AS "movie_name",
  Movie.year AS "movie_year"
FROM
  -- Here we join the Person, Movie and Director tables
  Person
  JOIN Director ON Person.id = Director.director_id
  JOIN Movie ON Director.movie_id = Movie.id
LIMIT 10;
Running query in 'sqlite:///imdb_movie.sqlite'
director_name person_dob person_pob movie_name movie_year
0 James Cameron 1954-08-14 Kapuskasing, Ontario, Canada Titanic 1997
1 George Lucas 1944-05-14 Modesto, California, USA Star Wars: Episode IV - A New Hope 1977
2 Andrew Adamson 1966-12-01 Auckland, New Zealand Shrek 2 2004
3 Steven Spielberg 1946-12-18 Cincinnati, Ohio, USA E.T. the Extra-Terrestrial 1982
4 George Lucas 1944-05-14 Modesto, California, USA Star Wars: Episode I - The Phantom Menace 1999
5 Sam Raimi 1959-10-23 Franklin, Michigan, USA Spider-Man 2002
6 George Lucas 1944-05-14 Modesto, California, USA Star Wars: Episode III - Revenge of the Sith 2005
7 Peter Jackson 1961-10-31 Pukerua Bay, North Island, New Zealand Lord of the Rings: The Return of the King, The 2003
8 Sam Raimi 1959-10-23 Franklin, Michigan, USA Spider-Man 2 2004
9 Mel Gibson 1956-01-03 Peekskill, New York, USA Passion of the Christ, The 2004

Example: JOINing with Oscars#

Finally, let’s try joining tables with Oscars. First, the Movie table:

%%sql
SELECT
  Movie.name AS "movie_name",
  Movie.year AS "movie_year",
  Oscar.type AS "oscar_type",
  Oscar.year AS "oscar_year"
FROM
  -- Here we join the Movie and Oscar tables
  Movie
  JOIN Oscar ON Movie.id = Oscar.movie_id
LIMIT 10;
Running query in 'sqlite:///imdb_movie.sqlite'
movie_name movie_year oscar_type oscar_year
0 CODA 2021 BEST-PICTURE 2022
1 King Richard 2021 BEST-ACTOR 2022
2 The Eyes of Tammy Faye 2021 BEST-ACTRESS 2022
3 West Side Story 2021 BEST-SUPPORTING-ACTRESS 2022
4 CODA 2021 BEST-SUPPORTING-ACTOR 2022
5 The Power of the Dog 2021 BEST-DIRECTOR 2022
6 Nomadland 2020 BEST-PICTURE 2021
7 The Father 2020 BEST-ACTOR 2021
8 Nomadland 2020 BEST-ACTRESS 2021
9 Judas and the Black Messiah 2021 BEST-SUPPORTING-ACTOR 2021

And then, the Person and Oscar table:

%%sql
SELECT
  Person.name AS "person_name",
  Person.dob AS "person_dob",
  Person.pob AS "person_pob",
  Oscar.type AS "oscar_type",
  Oscar.year AS "oscar_year"
FROM
  -- Here we join the Person and Oscar tables
  Person
  JOIN Oscar ON Person.id = Oscar.person_id
LIMIT 10;
Running query in 'sqlite:///imdb_movie.sqlite'
person_name person_dob person_pob oscar_type oscar_year
0 Ingrid Bergman 1915-08-29 Stockholm, Sweden BEST-ACTRESS 1945
1 Ingrid Bergman 1915-08-29 Stockholm, Sweden BEST-ACTRESS 1957
2 Ingrid Bergman 1915-08-29 Stockholm, Sweden BEST-SUPPORTING-ACTRESS 1975
3 Humphrey Bogart 1899-12-25 New York, New York, USA BEST-ACTOR 1952
4 Marlon Brando 1924-04-03 Omaha, Nebraska, USA BEST-ACTOR 1955
5 Marlon Brando 1924-04-03 Omaha, Nebraska, USA BEST-ACTOR 1973
6 James Cagney 1899-07-17 Yonkers, New York, USA BEST-ACTOR 1943
7 Gary Cooper 1901-05-07 Helena, Montana, USA BEST-ACTOR 1942
8 Gary Cooper 1901-05-07 Helena, Montana, USA BEST-ACTOR 1953
9 Bette Davis 1908-04-05 Lowell, Massachusetts, USA BEST-ACTRESS 1936

JOINs with Conditions#

Now that we understand at a basic level what joins “are”, let’s put this knowledge to work by writing a few more advanced queries using WHERE clauses.

Example: Actors in ‘PG-13’ Movies#

To accomplish this, we’ll INNER JOIN the Movie and Actor tables, linking them through the id and movie_id columns, respectively. We’ll then filter the results to only include movies with a ‘PG-13’ rating.

%%sql
SELECT
  Person.name as "actor",
  Movie.name AS "movie_title",
  Rating AS "movie_rating"
FROM
  Person
  JOIN Actor ON Person.id = Actor.actor_id
  JOIN Movie ON Actor.movie_id = Movie.id
WHERE
  -- Limit our results to PG-13 Moives
  Movie.rating = 'PG-13'
LIMIT 10;
Running query in 'sqlite:///imdb_movie.sqlite'
actor movie_title movie_rating
0 Leonardo DiCaprio Titanic PG-13
1 Kate Winslet Titanic PG-13
2 Billy Zane Titanic PG-13
3 Kathy Bates Titanic PG-13
4 Bill Paxton Titanic PG-13
5 Tobey Maguire Spider-Man PG-13
6 Willem Dafoe Spider-Man PG-13
7 Kirsten Dunst Spider-Man PG-13
8 James Franco Spider-Man PG-13
9 Cliff Robertson Spider-Man PG-13

We could alter this query to just get a COUNT of such actors.

%%sql
SELECT
  COUNT(DISTINCT Person.name) AS "number_of_actors"
FROM
  Person
  JOIN Actor ON Person.id = Actor.actor_id
  JOIN Movie ON Actor.movie_id = Movie.id
WHERE
  Movie.rating = 'PG-13';
Running query in 'sqlite:///imdb_movie.sqlite'
number_of_actors
0 758

Example: Text Filtering with LIKE - Directors Born in Minnesota#

What if you’re curious to know which directors were born in Minnesota and have directed a movie? We can INNER JOIN the Person and Director tables and use the SQL LIKE keyword to filter text data.

%%sql
SELECT
  Person.name,
  Person.pob,
  Person.dob
FROM
  Person
  JOIN Director ON Person.id = Director.director_id
WHERE Person.pob LIKE '%Minnesota%'
LIMIT 10;
Running query in 'sqlite:///imdb_movie.sqlite'
name pob dob
0 Pete Docter Bloomington, Minnesota, USA 1968-08-10
1 George Roy Hill Minneapolis, Minnesotaa, USA 1921-12-20
2 Joel Coen Minneapolis, Minnesota, USA 1954-11-29
3 Terry Gilliam Minneapolis, Minnesota, USA 1940-11-22
4 Ethan Coen Minneapolis, Minnesota, USA 1957-09-21
5 Pete Docter Bloomington, Minnesota, USA 1968-08-10
6 Ethan Coen Minneapolis, Minnesota, USA 1957-09-21
7 Joel Coen Minneapolis, Minnesota, USA 1954-11-29
8 Pete Docter Bloomington, Minnesota, USA 1968-08-10
9 Joel Coen Minneapolis, Minnesota, USA 1954-11-29

Now, let’s alter this query to see how old these directors currenly are.

%%sql
SELECT
  Person.name,
  Person.pob,
  Person.dob,
  -- Calculate difference in years
  (strftime('%Y', 'now') - strftime('%Y', Person.dob)) AS "current_age"
FROM
  Person
  JOIN Director ON Person.id = Director.director_id
WHERE Person.pob LIKE '%Minnesota%'
LIMIT 10;
Running query in 'sqlite:///imdb_movie.sqlite'
name pob dob current_age
0 Pete Docter Bloomington, Minnesota, USA 1968-08-10 58
1 George Roy Hill Minneapolis, Minnesotaa, USA 1921-12-20 105
2 Joel Coen Minneapolis, Minnesota, USA 1954-11-29 72
3 Terry Gilliam Minneapolis, Minnesota, USA 1940-11-22 86
4 Ethan Coen Minneapolis, Minnesota, USA 1957-09-21 69
5 Pete Docter Bloomington, Minnesota, USA 1968-08-10 58
6 Ethan Coen Minneapolis, Minnesota, USA 1957-09-21 69
7 Joel Coen Minneapolis, Minnesota, USA 1954-11-29 72
8 Pete Docter Bloomington, Minnesota, USA 1968-08-10 58
9 Joel Coen Minneapolis, Minnesota, USA 1954-11-29 72

You might notice there’s a new function in this line:

strftime('%Y', 'now') - strftime('%Y', Person.dob)) AS "current_age"

Let’s break this down:

  1. strftime is a SQLite function that extracts parts of a date according to a specified format. The format '%Y' extracts the year from a date.

  2. 'now' is a special string in SQLite that represents the current date and time.

  3. Person.dob is the date of birth column in the Person table.

  4. strftime('%Y', 'now') extracts the current year.

  5. strftime('%Y', Person.dob) extracts the year of birth for each person.

  6. By subtracting the year of birth from the current year, we get the current age of each person.

  7. The AS "current_age" part aliases the result of this calculation as “current_age”, which will be the column name in the output.

So, in plain English, this query calculates the current age of each person by subtracting their year of birth (extracted from the dob column) from the current year (obtained using 'now').

Example: Sorting with ORDER BY - Top Earning PG-13 Movies and Their Directors#

Perhaps you’re interested in knowing which ‘PG-13’ movies earned the most at the box office and who directed them. In this example, we use an INNER JOIN between the Movie and Director tables and sort the results using the ORDER BY keyword.

%%sql
SELECT
  Movie.name AS "movie",
  Person.name AS "director",
  Movie.earnings_rank
FROM
  Movie
  JOIN Director ON Movie.id = Director.movie_id
  JOIN Person ON Director.director_id = Person.id
WHERE
  Movie.rating = 'PG-13'
  AND Movie.earnings_rank IS NOT NULL
ORDER BY Movie.earnings_rank ASC
LIMIT 10;
Running query in 'sqlite:///imdb_movie.sqlite'
movie director earnings_rank
0 Star Wars: The Force Awakens J.J. Abrams 1
1 Avengers: Endgame Anthony Russo 2
2 Avengers: Endgame Joe Russo 2
3 Spider-Man: No Way Home Jon Watts 3
4 Avatar James Cameron 4
5 Top Gun: Maverick Joseph Kosinski 5
6 Black Panther Ryan Coogler 6
7 Avengers: Infinity War Anthony Russo 7
8 Avengers: Infinity War Joe Russo 7
9 Titanic James Cameron 8

While this query may look a bit scary, there’s actually nothing new here!

  1. We select three columns:

    • The name of the movie (aliased as “movie”)

    • The name of the director (aliased as “director”)

    • The earnings rank of the movie-

  2. We join” three tables:

    • Starts with the Movie table

    • Joins the Director table where the movie IDs match

    • Joins the Person table where the director IDs match

  3. We filter the results:

    • Includes only movies with a ‘PG-13’ rating

    • Excludes movies with a null earnings rank-

  4. We orders the results:

    • Sorts by the earnings rank in ascending order (lowest rank first)-

LEFT JOIN#

A LEFT JOIN, also known as a LEFT OUTER JOIN, is a type of JOIN that returns all the rows from the left table (the first table mentioned in the query), and the matched rows from the right table (the second table mentioned). If there is no match for a row in the left table, the result will still include that row, but with NULL values in the columns from the right table.

Let’s consider an example using our movie database. Suppose we want to get a list of all movies and the Oscars they have won (if any):

%%sql
SELECT
  Movie.name AS "movie",
  Oscar.type AS "oscar_type",
  Oscar.year AS "oscar_year"
FROM
  Movie
  -- Left join allows us to inlude movies that have not won an Oscar
  LEFT JOIN Oscar ON Movie.id = Oscar.movie_id
LIMIT 10;
Running query in 'sqlite:///imdb_movie.sqlite'
movie oscar_type oscar_year
0 Star Wars: The Force Awakens NaN NaN
1 Avengers: Endgame NaN NaN
2 Spider-Man: No Way Home NaN NaN
3 Avatar NaN NaN
4 Top Gun: Maverick NaN NaN
5 Black Panther NaN NaN
6 Avengers: Infinity War NaN NaN
7 Titanic BEST-DIRECTOR 1998.0
8 Titanic BEST-PICTURE 1998.0
9 Jurassic World NaN NaN

Here’s what the query does:

  • We select the name of the movie from the Movie table, and the type and year of the Oscar from the Oscar table.

  • We LEFT JOIN the Oscar table. For each movie, if there’s a corresponding entry in the Oscar table (i.e., if the movie has won an Oscar), the movie_id will match. If a movie hasn’t won an Oscar, there will be no match.

  • For movies that have won an Oscar, the type and year of the Oscar will be included in the result. For movies that haven’t won an Oscar, the result will still include the movie, but the Oscar type and year will be NULL.

You can contrast this with standard (INNER) joins.

%%sql
SELECT
  Movie.name AS "movie",
  Oscar.type AS "oscar_type",
  Oscar.year AS "oscar_year"
FROM
  Movie
  -- THis time, we ONLY include Movies that won on Oscar
  JOIN Oscar ON Movie.id = Oscar.movie_id
LIMIT 10;
Running query in 'sqlite:///imdb_movie.sqlite'
movie oscar_type oscar_year
0 CODA BEST-PICTURE 2022
1 King Richard BEST-ACTOR 2022
2 The Eyes of Tammy Faye BEST-ACTRESS 2022
3 West Side Story BEST-SUPPORTING-ACTRESS 2022
4 CODA BEST-SUPPORTING-ACTOR 2022
5 The Power of the Dog BEST-DIRECTOR 2022
6 Nomadland BEST-PICTURE 2021
7 The Father BEST-ACTOR 2021
8 Nomadland BEST-ACTRESS 2021
9 Judas and the Black Messiah BEST-SUPPORTING-ACTOR 2021

CROSS JOIN: Both Fundamental and Dangerous#

A CROSS JOIN, also known as a Cartesian product, is the most basic type of join operation in relational databases. It combines every single row from the first table with every single row from the second table without checking any matching conditions.

If table A contains 10 rows and table B contains 5 rows, the resulting CROSS JOIN set will contain exactly 50 rows.

Why CROSS JOIN is fundamental#

  • The database engine uses the Cartesian product as the conceptual foundation for all other join types, running this full pairing before applying filter conditions.

  • We use a CROSS JOIN in specialized applications to generate combinations, such as pairing a table of colors with a table of sizes to create a complete catalog of product inventory options.

However, because a CROSS JOIN generates a multiplicative combination of rows, running it on large tables can easily produce millions of unwanted rows and freeze your database. In everyday queries, you should always use specific joins (like INNER JOIN) to keep your queries safe and efficient.

%%sql
-- This will generate every possible movie-actor combination
SELECT
    Movie.name AS movie_title,
    Person.name AS actor_name
FROM
    Movie
    CROSS JOIN Person
    JOIN Actor ON Person.id = Actor.actor_id
LIMIT 20;
Running query in 'sqlite:///imdb_movie.sqlite'
movie_title actor_name
0 Star Wars: The Force Awakens Leonardo DiCaprio
1 Star Wars: The Force Awakens Kate Winslet
2 Star Wars: The Force Awakens Billy Zane
3 Star Wars: The Force Awakens Kathy Bates
4 Star Wars: The Force Awakens Bill Paxton
5 Star Wars: The Force Awakens Mark Hamill
6 Star Wars: The Force Awakens Harrison Ford
7 Star Wars: The Force Awakens Carrie Fisher
8 Star Wars: The Force Awakens Peter Cushing
9 Star Wars: The Force Awakens Alec Guinness
10 Star Wars: The Force Awakens Mike Myers
11 Star Wars: The Force Awakens Eddie Murphy
12 Star Wars: The Force Awakens Cameron Diaz
13 Star Wars: The Force Awakens Julie Andrews
14 Star Wars: The Force Awakens Antonio Banderas
15 Star Wars: The Force Awakens Henry Thomas
16 Star Wars: The Force Awakens Dee Wallace-Stone
17 Star Wars: The Force Awakens Robert MacNaughton
18 Star Wars: The Force Awakens Drew Barrymore
19 Star Wars: The Force Awakens Peter Coyote

Most of these actors never acted in this movie!

Let’s see a comparison of the rows returned by left join vs cross join:

%%sql
-- Compare row counts
SELECT COUNT(*) AS "Total Combinations" FROM Movie CROSS JOIN Person
Running query in 'sqlite:///imdb_movie.sqlite'
Total Combinations
0 1990640
%%sql
SELECT COUNT(*) AS "Actual Relationships" FROM Movie JOIN Actor ON Movie.id = Actor.movie_id;
Running query in 'sqlite:///imdb_movie.sqlite'
Actual Relationships
0 3740

Legitimate Uses of CROSS JOIN#

Despite its potential pitfalls, CROSS JOIN has several valuable use cases:

For example, sometimes we really do need to generate ALL POSSIBLE combinations of values (for example, every possible movie with every possible rating).

%%sql
-- Generate all possible movie-rating combinations
SELECT
    m.name,
    r.rating
FROM
    Movie m
    CROSS JOIN (SELECT DISTINCT rating FROM Movie) r;
Running query in 'sqlite:///imdb_movie.sqlite'
name rating
0 Star Wars: The Force Awakens PG-13
1 Star Wars: The Force Awakens PG
2 Star Wars: The Force Awakens G
3 Star Wars: The Force Awakens R
4 Star Wars: The Force Awakens NC-17
... ... ...
5955 West Side Story R
5956 West Side Story NC-17
5957 West Side Story M
5958 West Side Story GP
5959 West Side Story NaN

5960 rows × 2 columns

Table: Types of SQL Joins#

Join Type

Description

Use Case

Inner Join

This join returns only the rows where there is a matching key value in both tables.

Use this standard join to find records that have corresponding entries in both tables.

Left Join

This join returns all rows from the left table, along with matching rows from the right table, filling unmatched right columns with null values.

Use this join when you want to preserve all records from your primary table even if they have no matches.

Right Join

This join returns all rows from the right table, along with matching rows from the left table, filling unmatched left columns with null values.

Use this join to preserve all records from the secondary table, which is rarely used since it can be rewritten as a Left Join.

Full Join

This join returns all records when there is a match in either the left or the right table, filling unmatched columns on either side with null values.

Use this join when you need a complete overview of all records across both tables, showing where they match and where they do not.

Set Operations#

In SQL, set operations allow you to combine the results of two or more SELECT statements. The main set operations are UNION, INTERSECT, and EXCEPT. These operations are based on the mathematical concept of sets.

Let’s explore each of these operations using our movie database.

UNION#

The UNION operation combines the result-set of two or more SELECT statements and removes duplicates. Each SELECT statement within UNION must have the same number of columns, and the columns must also have similar data types.

For example, let’s say we want to get a list of all people who have either directed a movie or won an Oscar:

%%sql
SELECT
  Person.name AS "director_or_oscar"
FROM
  Person
  JOIN Director ON Person.id = Director.director_id

UNION

SELECT
  Person.name
FROM
  Person
  JOIN Oscar ON Person.id = Oscar.person_id
LIMIT 10;
Running query in 'sqlite:///imdb_movie.sqlite'
director_or_oscar
0 Adam McKay
1 Adam Shankman
2 Adrian Lyne
3 Adrian Molina
4 Adrien Brody
5 Al Pacino
6 Alan Arkin
7 Alan J. Pakula
8 Alan Taylor
9 Alec Guinness

This query:

  1. Selects the names of all directors by joining the Person and Director tables.

  2. Uses UNION to combine this result with another SELECT statement.

  3. The second SELECT statement selects the names of all Oscar winners by joining the Person and Oscar tables.

  4. The UNION operation combines these two result sets and removes any duplicates.

So, the final result will be a list of all people who have either directed a movie or won an Oscar, with no duplicates.

INTERSECT#

The INTERSECT operation returns only the rows that are common between the result-sets of two or more SELECT statements.

Let’s find all people who have both directed a movie and won an Oscar:

%%sql
SELECT
  Person.name AS "director_and_oscar"
FROM
  Person
  JOIN Director ON Person.id = Director.director_id

INTERSECT

SELECT
  Person.name
FROM
  Person
  JOIN Oscar ON Person.id = Oscar.person_id
LIMIT 10;
Running query in 'sqlite:///imdb_movie.sqlite'
director_and_oscar
0 Alejandro Gonzalez Inarritu
1 Alfonso Cuaron
2 Ang Lee
3 Anthony Minghella
4 Barry Levinson
5 Bernardo Bertolucci
6 Billy Wilder
7 Bob Fosse
8 Bong Joon Ho
9 Carol Reed

This query is similar to the UNION example, but uses INTERSECT instead. The result will include only those people who are in both result sets - that is, people who have both directed a movie and won an Oscar.

EXCEPT#

The EXCEPT operation returns all the rows from the first SELECT statement that are not returned by the second SELECT statement.

For example, let’s find all directors who have not won an Oscar:

%%sql
SELECT
  Person.name AS "director_no_oscar"
FROM
  Person
  JOIN Director ON Person.id = Director.director_id

EXCEPT

SELECT
  Person.name
FROM
  Person
  JOIN Oscar ON Person.id = Oscar.person_id
LIMIT 10;
Running query in 'sqlite:///imdb_movie.sqlite'
director_no_oscar
0 Adam McKay
1 Adam Shankman
2 Adrian Lyne
3 Adrian Molina
4 Alan J. Pakula
5 Alan Taylor
6 Alex Proyas
7 Alexander Korda
8 Alfred E. Green
9 Alfred Hitchcock

Putting it Altogether#

We’ll close the section by looking at some slightly more involved examples.

Example: All Star Wars or Star Trek Actors Born in the US#

First, let’s get the names and places of birth of actors who have appeared in a Star Wars OR a Star Trek film who are American (born in the US).

%%sql
SELECT
  DISTINCT(Person.name) AS "actor",
  Person.pob AS "place_of_birth"
FROM
  Person
  JOIN Actor ON Person.id = Actor.actor_id
  JOIN Movie ON Actor.movie_id = Movie.id
WHERE
  (Movie.name LIKE '%Star Wars%'
  OR Movie.name LIKE '%Star Trek%')
  AND Person.pob LIKE '%USA%'
ORDER BY actor
LIMIT 10;
Running query in 'sqlite:///imdb_movie.sqlite'
actor place_of_birth
0 Adam Driver San Diego, California, USA
1 Alan Tudyk El Paso, Texas, USA
2 Alden Ehrenreich Los Angeles, California, USA
3 Billy Dee Williams New York, New York, USA
4 Carrie Fisher Beverly Hills, Los Angeles, California, USA
5 Chris Pine Los Angeles, California, USA
6 Donald Glover Edwards Air Force Base, California, USA
7 Harrison Ford Chicago, Illinois, USA
8 Jake Lloyd Fort Collins, Colorado, USA
9 Leonard Nimoy Boston, Massachusetts, USA

Here is how the database engine executes this multi-table join query:

  1. The SELECT DISTINCT statement retrieves only unique actor names and their places of birth, using aliases to rename the columns in the final output.

  2. The FROM clause specifies that the query starts at the Person table.

  3. The first JOIN statement combines the Person table with the Actor table by matching the person IDs.

  4. The second JOIN statement combines the resulting rows with the Movie table by matching the movie IDs.

  5. The WHERE clause filters the combined dataset to include only actors who starred in movies released after the year 2010.

  6. The ORDER BY clause sorts the final list of actors alphabetically by their name.

Example: Summary Statitics About Science Fiction Movies#

Now, let’s construct a query to get some summary statistics (min, max, average) of science fiction movies.

%%sql
SELECT
  MIN(Movie.year) AS "earliest_year",
  MIN(Movie.runtime) AS "min_runtime",
  MAX(Movie.runtime) AS "max_runtime",
  ROUND(AVG(Movie.runtime),2) AS "avg_runtime",
  COUNT(Movie.id) AS "num_movies",
  COUNT(Oscar.type) AS "num_oscars"
FROM
  Movie
  -- Include sci-movies that didn't win oscars
  LEFT JOIN Oscar ON Movie.id = Oscar.movie_id
WHERE
  -- 'S' stands for "science fiction"
  Movie.genre LIKE '%S%';
Running query in 'sqlite:///imdb_movie.sqlite'
earliest_year min_runtime max_runtime avg_runtime num_movies num_oscars
0 1968 81 242 125.72 86 3

Here is how the database engine executes this aggregate query:

  1. The SELECT statement calculates summary values using three aggregate functions: MIN finds the earliest release year, MIN finds the shortest runtime, and MAX finds the highest earnings rank.

  2. The FROM clause starts the query at the primary Movie table.

  3. The first JOIN statement links the Movie table to the Actor table by matching their movie IDs.

  4. The second JOIN statement links the resulting rows to the Person table by matching their actor IDs.

  5. The WHERE clause filters the rows so that only movies starring the actor “Tom Hanks” are included in the calculations.

Conclusion: Mastering SQL Joins and Set Operations#

Throughout this chapter, we have explored how SQL combines and manipulates data across multiple tables. Understanding how these operations work together is key to writing powerful queries.

The power of joins#

Joins allow us to rebuild relationships between tables that we separated during database design:

  • An INNER JOIN filters out unmatched records, returning only the rows where there is a matching key value in both tables.

  • A LEFT JOIN preserves every row from your primary table, filling unmatched secondary columns with null markers.

  • A CROSS JOIN generates every possible combination between two tables, which serves as the theoretical base for database queries.

The logic of set operations#

While joins combine columns from different tables side-by-side, set operations combine rows from different queries vertically:

  • The UNION operator merges the results of two queries into a single list, automatically removing any duplicate rows.

  • The INTERSECT operator returns only the common rows that appear in the results of both queries.

  • The EXCEPT operator subtracts the results of the second query from the first, returning only the unique rows from the first set.

By mastering joins and set operations, you can navigate complex database relationships to extract, combine, and analyze data efficiently.

Chapter Summary#

Use this list as a self-check. After working through the chapter, you should be able to do each of the following:

  • You can explain what a join is and how it combines data across tables.

  • You can choose among INNER, LEFT, RIGHT, and FULL OUTER joins.

  • You can combine result sets with UNION, INTERSECT, and EXCEPT.

  • You can write multi-table queries that mix joins and set operations.

  • You can read a database schema and sketch its entity-relationship diagram.

  • You can apply joins and sets to answer real questions about your data.

Practice with the Loop of the Recursive Dragon#

Sharpen what you just learned with a chapter-matched review set in the Loop of the Recursive Dragon — an adaptive review game with multiple question types and RPG-style mechanics, built for this book.

Launch the Chapter 3 review set →

Glossary#

Use this reference sheet to quickly review key database relationship terms.

ER modeling terms#

  • An entity is a real-world object, person, or concept that we store data about in a table.

  • An attribute is a specific column in a table that represents a characteristic of an entity.

  • A relationship describes how different database entities connect to one another.

  • The cardinality refers to a rule defining the minimum and maximum number of matching rows allowed between two tables.

  • A crow’s foot diagram is a popular visual layout used to represent database tables and their relationships.

  • A primary key is a unique column value used to identify each row in a table.

  • A foreign key is a column that references another table’s primary key to link them together.

  • A junction table is a helper table that stores foreign keys from two different tables to resolve a many-to-many relationship.

Join keywords#

  • A join is a SQL operation that combines rows from multiple tables side-by-side based on a related column.

  • An INNER JOIN returns only the rows where there is a matching key value in both tables.

  • A LEFT JOIN returns all rows from the primary table along with matching rows from the secondary table.

  • A CROSS JOIN returns the Cartesian product of two tables by pairing every row of the first table with every row of the second.

Set keywords#

  • A set operation is a SQL query that combines the row results of two separate queries vertically.

  • A UNION operation merges the results of two queries, automatically removing duplicate rows.

  • An INTERSECT operation returns only the common rows that appear in the results of both queries.

  • An EXCEPT operation subtracts the results of the second query from the first, returning only the unique rows from the first query.

Types of Relationships: Entity-Relationship Model#

Notation

Definition

Customer

Movie

Person

Person

o–o{ Actor

Actor }o–