Chapter 1: Database Foundations (Star Trek and Starfleet Command)

Chapter 1: Database Foundations (Star Trek and Starfleet Command)#

Databases Through Pop Culture: Brendan Shea, PhD#

This chapter introduces databases using Star Trek as an extended analogy. You will explore the distinctions between data, information, and knowledge; compare relational, document, and graph data models; walk through conceptual, logical, and physical modeling; and consider how DBMS choice, storage location, and analytical structures like data lakes, warehouses, and marts fit together.

Learning Outcomes#

By the end of this chapter, you will be able to:

Distinguish among data, information, and knowledge, and explain how databases help turn raw data into useful information.
Compare flat files and databases in terms of organization, consistency, and retrieval.
Differentiate conceptual, logical, and physical data models and identify entities and relationships in simple ERDs.
Compare relational, document, graph, and column-oriented data models and match them to appropriate use cases.
Evaluate factors involved in choosing a DBMS, including deployment choices such as local versus cloud systems.
Explain how databases relate to data lakes, data warehouses, and data marts.
Use a notebook environment to create a simple table, insert rows, and run basic retrieval, filtering, sorting, and counting queries with SQL.

Video Lecture#

You can click below to launch my lecture.

from IPython.display import YouTubeVideo
YouTubeVideo('ppa-ZFNQJH0', width=800, height=500)

Introduction to the Case Study: Starship SQL#

Welcome aboard the Starship Enterprise. Across this chapter, we will explore database concepts using examples from the adventures of the Enterprise and its crew.

Why a fictional spaceship? Because the same database ideas apply almost everywhere. Whether you are running a starship, a small business, or a global enterprise, the basics of storing, retrieving, and updating data look the same.

Setting the lessons aboard the Enterprise also makes abstract ideas easier to remember. We will see how databases help with concrete problems — organizing crew schedules, analyzing sensor data from a strange planet, tracking encounters with new species — and how the same patterns show up in everyday work.

What’s the Difference Between Data, Information, and Knowledge?#

To understand databases, we must first distinguish three closely related terms: data, information, and knowledge. Consider the Enterprise’s away team beaming down to an unexplored planet:

We define data as raw, unorganized facts and figures. For instance, the ship’s sensors might record isolated numbers like 295.2, 1.013, and 0.04, which carry no meaning on their own.
We define information as raw data that has been organized and given context so that it carries a clear meaning. By applying labels, the crew discovers that 295.2 is the temperature in Kelvin, 1.013 is the atmospheric pressure in bars, and 0.04 is the percentage of toxic chlorine gas in the air.
We define knowledge as the practical understanding and actionable insights we gain by analyzing that information. Using the threat report, Spock determines that the planet is safe for human exploration without protective environmental suits, but warns the crew to avoid drinking the chlorinated surface water.

Without a way to store and organize data, we cannot easily convert raw sensor readings into useful knowledge. This is where databases come in.

What is a Database? How Can They Help Turn Data Into Knowledge?#

A database is a structured collection of data organized for efficient storage, retrieval, and updating. Databases are built to handle large volumes of information reliably and securely.

By contrast, a flat file is a simple, linear data file such as a single text table or spreadsheet. Flat files are easy to create, but they struggle to handle large datasets and complex relationships because searching for a record requires reading through rows one by one.

To understand the difference, imagine Starfleet Academy’s library.

In a flat-file library, every book is printed on a single, giant scroll. Finding a specific science officer’s graduation record requires unrolling the entire scroll from the beginning.
In a relational database library, records are split into separate tables (like dedicated catalog cards) and linked by unique identifiers, allowing the computer to jump directly to the exact record you need.

A relational database also enforces rules called constraints to ensure that data remains consistent and accurate, preventing duplicate records or orphaned entries when files are updated.

Table: Database vs Flat Files#

Criterion	Databases	Flat Files
Data Structure	Structured with tables, rows, and columns, allowing for complex relationships and queries.	Simple, often plain text or CSV format, with a linear structure and no inherent relationships.
Data Integrity	Enforced through constraints (e.g., primary keys, foreign keys), ensuring consistency and accuracy.	No built-in mechanisms for enforcing data integrity, leading to potential inconsistencies.
Performance	Optimized for large-scale data handling with indexing, caching, and query optimization features.	Limited performance, especially with large datasets, as each read/write requires file access.
Concurrency	Supports multiple concurrent users with transaction management, locking, and isolation levels.	Generally limited to single-user access, or requires complex mechanisms to manage concurrency.
Scalability	Designed to handle vast amounts of data and can scale horizontally (across servers) or vertically.	Limited scalability, as performance degrades significantly with larger datasets.

Interactive activity: Open this standalone page to use the activity in the Jupyter Book or from Colab.

Open the Data, Information, and Knowledge quiz

What is “Data Modeling”?#

The first step in building any database is creating a data model. A data model is a standard blueprint that describes your data’s structure, relationships, and business rules.

Creating a data model requires analyzing how your organization operates:

We identify entities, which are the real-world objects, people, or events we need to store data about. For example, Starfleet might track officers, starships, and missions.
We identify attributes, which are the specific characteristics that describe each entity, such as an officer’s rank or a starship’s registry number.
We identify relationships, which describe how those entities connect to one another, such as recording which officer is assigned to serve on which starship.

By mapping out these connections, we build a data blueprint that ensures the physical database reflects real-world operations. We group data modeling into three progressive stages:

A conceptual data model is a high-level business overview that lists the main entities and how they connect, written in plain language.
A logical data model refines that overview into a concrete database structure (like tables and columns) without committing to a specific database software.
A physical data model is the actual implementation of that structure inside a specific database engine, specifying exact data types and physical storage settings.

The Entity-Relationship Model#

The Entity-Relationship (ER) Model is a visual way to describe data. It is used mainly in the conceptual and logical stages. ERDs are easy enough that technical and non-technical stakeholders can both validate them.

The key components are:

Entities — distinct objects or concepts (Mission, Starship, Crew Member).
Attributes — properties of an entity (mission start date, starship name, crew member rank).
Relationships — how entities connect (a Mission is assigned to a Starship).
Cardinality — how many of one entity relate to the other (one Starship can have many Missions, but each Mission belongs to one Starship).

ERDs fit into the modeling process in three stages:

Conceptual. A high-level ER diagram shows the main entities and relationships, without all the attributes. Stakeholders use it to confirm the overall shape of the data.
Logical. The diagram is refined with attributes, data types, and cardinality. This more detailed ER model bridges concept and implementation.
Physical. The diagram becomes the actual schema in a chosen database. Entities become tables, attributes become columns, and relationships become primary and foreign keys.

Using the ER Model throughout keeps the data structure well-defined and easy for everyone to understand.

Example: Conceptual ERD#

Example: Logical ERD#

Example: Physical ERD#

Logical Data Models: The Relational Model#

The relational model is the most widely used logical data model in modern applications. It organizes data into tables (often called relations), which consist of rows and columns.

Relational terminology#

Relational database theory uses formal terms to describe tables:

A relation is a structured table containing rows and columns.
An attribute is a named column in a table that represents a specific characteristic, such as a starship’s name.
A tuple is a single row in a table representing a unique record, such as a specific starship’s specifications.
A primary key is a unique identifier assigned to a column to ensure that no two rows in a table are identical, such as a starship’s unique registry number.
A foreign key is a column that references the primary key of another table to establish a clean relationship between them.

For example, Starfleet’s starships table might look like this:

registry (PK)	name	class	commissioning_date
NCC-1701	USS Enterprise	Constitution	2245-12-31
NCC-1864	USS Reliant	Miranda	2264-03-15

To track assignments, we can use a separate personnel table:

officer_id (PK)	name	rank	registry_assignment (FK)
1	James T. Kirk	Captain	NCC-1701
2	Spock	Commander	NCC-1701

The foreign key column links Spock and Kirk directly to the USS Enterprise record in the starships table. By separating data into dedicated tables and linking them with keys, the relational model keeps data clean, organized, and free of redundant duplicates.

Logical Data Models: JSON and Document Databases#

JSON (JavaScript Object Notation) is a lightweight, text-based format for representing data. It is widely used in NoSQL databases, especially document databases. JSON is well-suited to semi-structured and hierarchical data.

In JSON, data is stored as key-value pairs and arrays. Keys are strings, and values can be strings, numbers, booleans, objects, or arrays. Because objects and arrays can nest, JSON supports complex hierarchical structures inside a single document.

// Example of a JSON file
// Form is key : value
{
  "crew_id": "001",
  "name": "James Kirk",
  "rank": "Captain",
  "ship": "Enterprise",
  
  // Missions is an example of a "nested" data structure
  "missions": [
    {
      "mission_id": "M001",
      "planet": "Vulcan",
      "objective": "Diplomatic meeting",
      "start_date": "2258-01-15",
      "end_date": "2258-01-18"
    },
    {
      "mission_id": "M002",
      "planet": "Andoria",
      "objective": "Scientific research",
      "start_date": "2258-02-03",
      "end_date": "2258-02-07"
    }
  ],
  "skills": ["Leadership", "Tactics", "Diplomacy"],
  "performance_reviews": [
    {
      "date": "2258-12-31",
      "reviewer": "Admiral Pike",
      "rating": 9,
      "comments": "Exceptional leadership and decision-making skills."
    }
  ]
}

Document databases store data as semi-structured documents, typically using formats like JSON or BSON. Unlike relational tables, different documents in the same collection can have entirely different fields, which provides excellent schema flexibility.

Key features of document databases#

The database uses a schema-less design where documents can add or change fields instantly without requiring expensive table migrations.
The system supports nested structures to allow complex, hierarchical data (like a user’s address list) to be stored inside a single document.
The database is designed to scale horizontally by distributing documents across multiple servers, making it ideal for big-data applications.

Logical Model: Graphs for Networks of Connected Information#

Relational databases organize data into tables, while document databases nest related details inside single documents. Graph databases take a different approach: they store data as a network of connected entities. They shine when the relationships between things matter just as much as the things themselves.

Understanding graph models#

A graph database represents data as a visual network:

We represent individual entities as nodes (or vertices), such as creating a node for Kirk, Spock, and the USS Enterprise.
We connect these nodes using edges (or relationships) to describe how they relate, such as drawing a “reports_to” edge pointing from Spock to Kirk.
We add extra details to nodes and edges by storing them as properties, such as adding rank = "Captain" to Kirk’s node, or started_date = "2265" to Spock’s reporting edge.

Real-world applications#

Graphs are excellent for analyzing complex, highly connected networks. Consider the diplomatic relations of the United Federation of Planets:

Each member planet is a node, connected by trade agreements, military alliances, and cultural exchanges.
Graph databases allow us to run relationship-heavy queries easily, such as finding the shortest diplomatic chain between two empires or identifying trade partners.
We write graph queries using specialized languages like Cypher in Neo4j:

MATCH (officer:Officer)-[:SERVES_ON]->(ship:Starship)
WHERE ship.name = "USS Enterprise"
RETURN officer.name, officer.rank

Alternatively, we can represent graph data as RDF triples, which describe relationships in a simple subject-predicate-object format:

crew:kirk starfleet:servesOn ship:enterprise .
ship:enterprise starfleet:name "USS Enterprise" .

By focusing on paths and connections, graph databases handle relationship analysis far better than traditional tables.

Other database types you should know#

“Captain, four more data models deserve mention before we choose hardware,” Spock notes. The relational, document, graph, and column-oriented models we have already met cover most production workloads, but specialty stores have grown up around specific problems.

Key-value databases (Redis, Amazon DynamoDB) treat the entire store as one giant lookup table. The Enterprise’s mission-status board, which has to read and write a small handful of values thousands of times a second, fits this model perfectly.
Vector databases (pgvector, Pinecone) store high-dimensional numeric vectors and find the ones most similar to a query vector. Spock’s similarity searches across alien language samples — “find the five recordings most like this one” — are a natural use case.
Time-series databases (InfluxDB, TimescaleDB) are tuned for measurements that arrive in order, timestamped. The warp core’s continuous telemetry — temperature, pressure, plasma flow — is a classic time-series workload.
Object-oriented databases (db4o is the historical example) store entire programming objects, with inheritance and references preserved. The holodeck’s serialized character library is the kind of nested object graph these were built for.

Most modern relational systems can simulate these patterns — Postgres supports key-value columns, JSON, and vector extensions — but specialized stores trade flexibility for speed in their niche. Knowing the categories exist is enough at this stage; you will not see them again until much later in the book.

Logical Models: Columns for Analysis#

Relational databases handle structured tables, while document databases store flexible JSON records. A third major approach, the column-oriented model, is designed specifically for high-speed analysis of massive datasets.

The core idea: thinking in columns#

Most relational databases use row-oriented storage, meaning that all columns of a single row are stored together on disk. This is excellent for operational tasks, such as looking up an officer’s profile by ID.

However, analytical queries often need to calculate averages or sums across billions of rows but only look at a single column (such as calculating the average fuel consumption of an entire starfleet). In a row-oriented database, the computer must read every single column of every row, wasting time and resources.

A column-oriented database stores each column’s data together on disk:

The database only reads the specific columns requested by the query, which speeds up analysis of massive datasets.
The database compresses data incredibly well because all values in a single column share the same data type.

By focusing on columns, these databases allow analysts to run high-speed reports across billions of records.

Interactive activity: Open this standalone page to use the activity in the Jupyter Book or from Colab.

Open the Starship data model comparison

Structured, semistructured, and unstructured data#

Before choosing storage, it helps to know what shape your data is in. Three broad categories cover almost everything:

Structured data has a fixed schema: every record has the same fields in the same order. The Enterprise’s crew roster — name, rank, department, hire date — is structured. Structured data lives naturally in a relational database.
Semistructured data has tags or keys but no strict schema. A mission log might have a timestamp and a summary on every entry, but only some entries include a casualty_count or a sensor_attachment. Semistructured data fits document databases like MongoDB or JSON columns in Postgres.
Unstructured data has no field structure at all — long text, audio, video, or images. The Enterprise’s raw sensor video and free-text away-team reports are unstructured. Unstructured data typically lives in object storage with metadata in a database alongside.

Shape	Example	Typical storage	Typical access pattern
Structured	Crew roster CSV	Relational tables	SQL queries with `WHERE`/`JOIN`
Semistructured	Mission log JSON	Document database or JSON column	Path expressions (`$.summary`)
Unstructured	Sensor video	Object storage + metadata table	Full file fetch + metadata lookup

The choice of storage is mostly the choice of shape. As your projects grow, you will often combine all three.

Physical Models: Choosing a DBMS#

To store and manage data, you need a Database Management System (DBMS), which is the software program that lets you create, organize, and interact with a database.

DBMS options vary based on the data models they support:

A Relational DBMS (RDBMS) uses tables and SQL, which is the standard choice for financial records and structured operational data.
A Document DBMS stores data in JSON documents, providing flexibility for web applications.
A Graph DBMS maps out connections between nodes, making it ideal for network analysis and recommendation engines.
A Columnar DBMS stores data in columns rather than rows, which is optimized for big-data analytics and warehousing.

Many modern database systems use a multi-model approach, allowing developers to query both relational tables and JSON documents inside the same database engine.

Introduction to Databases: Starship Data Exploration Lab#

Now that we have explored relational, document, and graph models conceptually, let’s get our hands dirty by exploring these models inside a live notebook!

In this lab, we will load a pre-built Starfleet database themed around the USS Enterprise. Instead of writing complex code to build tables or insert records (which you will learn to do in later chapters), you will use simple SELECT statements to query and explore the data. We will also contrast how the same information is represented in relational, document, and graph formats.

Installing kuzu into active kernel...

Successfully connected to Starfleet Database! (starfleet.db and starfleet_graph_db)

1. Querying with SQL (Relational Model)#

In a relational database, data is organized into tables of rows and columns. We use SQL (Structured Query Language) to retrieve and filter this data.

To run a SQL query in a Jupyter Notebook, we use the %%sql prefix at the top of a code cell. Let’s start by retrieving all records from the crew table. Run the cell below:

%%sql
SELECT * FROM crew;

 * sqlite:///starfleet.db
Done.

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[7], line 1
----> 1 get_ipython().run_cell_magic('sql', '', 'SELECT * FROM crew;\n')

File /opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sql/magic.py:219, in SqlMagic.execute(self, line, cell, local_ns)
    216     return
    218 try:
--> 219     result = sql.run.run(conn, parsed["sql"], self, user_ns)
    221     if (
    222         result is not None
    223         and not isinstance(result, str)
   (...)    226         # Instead of returning values, set variables directly in the
    227         # user's namespace. Variable names given by column names
    229         if self.autopandas:

File /opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sql/run.py:374, in run(conn, sql, config, user_namespace)
    372     if result and config.feedback:
    373         print(interpret_rowcount(result.rowcount))
--> 374 resultset = ResultSet(result, config)
    375 if config.autopandas:
    376     return resultset.DataFrame()

File /opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sql/run.py:116, in ResultSet.__init__(self, sqlaproxy, config)
    114         list.__init__(self, sqlaproxy.fetchall())
    115     self.field_names = unduplicate_field_names(self.keys)
--> 116     self.pretty = PrettyTable(self.field_names, style=prettytable.__dict__[config.style.upper()])
    117 else:
    118     list.__init__(self, [])

KeyError: 'DEFAULT'

Filtering Your Data#

Relational databases excel at filtering large tables. Suppose we only want to see officers assigned to the starship with registry NCC-1701 (the USS Enterprise). We can use a WHERE clause to filter the rows:

%%sql
SELECT name, rank, role 
FROM crew 
WHERE registry_assignment = 'NCC-1701';

Running query in 'sqlite:///starfleet.db'

	name	rank	role
0	James T. Kirk	Captain	Commanding Officer
1	Spock	Commander	First Officer / Science Officer
2	Leonard McCoy	Lieutenant Commander	Chief Medical Officer
3	Nyota Uhura	Lieutenant	Communications Officer

Counting Records#

We can also perform quick calculations, such as counting how many officers are currently assigned to the fleet. Run the cell below to count all crew members in our table:

%%sql
SELECT COUNT(*) AS total_crew 
FROM crew;

Running query in 'sqlite:///starfleet.db'

	total_crew
0	5

2. Comparing with the Document Model (JSON in SQLite)#

In a document database (like MongoDB), data is stored as nested, flexible documents (often in JSON format) rather than split across rigid tables.

Many modern relational databases—including SQLite—actually support a hybrid/multi-model approach. We can store a complete JSON document inside a single text column and query its nested keys directly using built-in JSON functions like json_extract().

Let’s look at how we can query Spock’s profile directly from a JSON column. Run the cell below:

%%sql
SELECT officer_id,
       json_extract(profile_json, '$.name') AS officer_name,
       json_extract(profile_json, '$.home_planet') AS home_planet,
       json_extract(profile_json, '$.skills') AS all_skills
FROM officer_profiles;

Running query in 'sqlite:///starfleet.db'

	officer_id	officer_name	home_planet	all_skills
0	2	Spock	Vulcan	["Logic","Vulcan Harp","Starship Operations"]

Key Takeaways: Relational vs. Document#

Relational Tables: Crew member facts are split into columns and linked via foreign keys (e.g., registry_assignment referencing the starships table). This enforces strict consistency and avoids duplicates, but requires joining tables.
JSON Documents: Everything about a crew member is stored inside a single nested document. This is highly flexible—different officers can have completely different attributes (like "home_planet": "Vulcan" for Spock, while other crew records might omit this field entirely) without altering the table structure.

3. Comparing with the Graph Model (Cypher in the Notebook)#

If we were using a graph database (like Neo4j or Kùzu), the system would focus heavily on the connections (relationships) between Spock, the crew, and the ship.

Instead of tables or nested documents, the data is mapped as a visual network:

Nodes: Represent individual entities (like Kirk, Spock, or the USS Enterprise).
Edges: Represent the relationships between nodes (like Spock REPORTS_TO Kirk, or Spock SERVES_ON the Enterprise).

Executing Graph Queries Locally: `%%cypher`#

Our local Starfleet graph database is running directly inside this notebook session using Kùzu—a serverless, in-process graph database management system.

Just like we used the %%sql cell magic to query our relational tables, we have loaded a custom %%cypher cell magic that allows us to run Cypher queries against our graph database.

Let’s query our graph to find all officers who report to Captain Kirk. Run the cell below:

%%cypher
MATCH (officer:Officer)-[:REPORTS_TO]->(captain:Officer)
WHERE captain.name = 'James T. Kirk'
RETURN officer.name AS OfficerName, officer.rank AS Rank

	OfficerName	Rank
0	Spock	Commander
1	Leonard McCoy	Lieutenant Commander

Choosing Your DBMS: Leading Systems by Model#

To store and manage data, we use a Database Management System (DBMS). Instead of a random list of names, it is helpful to categorize leading database systems by the logical data model they are optimized for.

Use this guide to understand your choices:

Data Model	Leading DBMS	Best Use Case	Beginner Friendliness & Trade-offs
Relational (SQL)	SQLite	Local development, mobile apps, embedded software.	Perfect for Beginners: Zero setup, file-based, lightweight. Trade-off: Not designed for high-concurrency multi-user websites.
	PostgreSQL	Enterprise apps, high-integrity financial records, robust web backends.	Highly Recommended for Production: Extremely reliable, open-source, supports advanced types (like JSON and Vector).
	MySQL / MariaDB	Standard web application backends (e.g., WordPress).	Beginner-Friendly: Simple to use, massive community support.
	Microsoft SQL Server / Oracle / IBM Db2	Large-scale corporate or legacy enterprise environments.	Advanced: High licensing costs, optimized for massive Microsoft or corporate ecosystems.
Document (JSON / NoSQL)	MongoDB	Web applications requiring rapid development and flexible, changing schemas.	Great for Beginners: Natural translation between JSON and programming objects. Trade-off: Lacks strict relational constraints out of the box.
	Amazon DynamoDB / Redis	High-performance caching, real-time key-value lookups at cloud scale.	Intermediate: Extremely fast key-value lookups, fully managed in the cloud.
Graph (Network / NoSQL)	Neo4j	Social networks, recommendation engines, fraud detection, and network analysis.	Visual & Elegant: Intuitive nodes-and-edges querying with Cypher. Trade-off: Niche focus, not suited for general transactional records.
Columnar (Analytical / NoSQL)	Apache Cassandra	High-speed analytical writes and big-data storage across distributed servers.	Advanced: Harder to set up, built for massive horizontal scaling across cloud nodes.

Note: Many modern databases are multi-model. For instance, PostgreSQL has excellent support for JSON document querying and vector search, allowing you to combine paradigms inside one robust engine!

Advanced Data Storage Concepts: Scaling Beyond a Single Database#

As organizations grow, they need to store and analyze data at a scale that a single operational database cannot handle. Depending on the speed, clean-up level, and analytical needs of the data, organizations utilize four primary storage architectures: Operational Databases, Data Lakes, Data Warehouses, and Data Marts.

To understand how these systems work together, let’s explore their technical definitions, real-world examples, and a “Federation Logistics” metaphor from the Starship Enterprise:

1. Operational Database (OLTP)#

To keep a starship or business running day-to-day, you need a system that can handle immediate, real-time actions. This is the domain of operational databases.

An operational database is optimized for Online Transactional Processing (OLTP), which means it handles high-speed, real-time read and write operations to record everyday transactions and keep them accurate.
In the real world, this is the technology behind ATM cash withdrawals or e-commerce shopping carts updating your balance or inventory instantly.
In our Starfleet analogy, this is the Enterprise bridge console recording active warp core telemetry, shields status, and real-time crew coordinates to keep the ship safe.

2. Data Lake#

Not all data is structured or cleaned up immediately, especially when it is arriving in massive, continuous streams. Organizations need a cheap place to store this raw material until they are ready to use it.

A data lake is a massive storage repository that ingests and retains structured, semi-structured, and unstructured data in its native, uncleaned format without a predefined schema.
In the real world, companies use data lakes to capture raw factory sensor logs or social media video feeds cheaply, postponing the sorting and cleaning process until analysts actually need the data.
In our Starfleet analogy, this is like the unfiltered cargo bay where the crew dumps raw sensor tapes, audio logs, and video files from a new planet to be analyzed later by HQ.

3. Data Warehouse (OLAP)#

When analysts want to spot long-term historical trends or compile big reports, looking at individual operational databases is too slow and fragmented. They need a consolidated, cleaned-up view of the entire organization.

A data warehouse is a centralized repository optimized for Online Analytical Processing (OLAP), designed to pull historical data from multiple sources, clean it, and structure it for complex reporting.
In the real world, a retail corporation might consolidate 10 years of cash register receipts, website traffic, and shipping logs into a data warehouse to analyze long-term holiday shopping trends.
In our Starfleet analogy, this represents the Starfleet HQ Historical Archives, where analysts take the raw cargo bay tapes from all ships in the fleet, filter out sensor noise, correct spelling errors, and index them so historians can check 50-year fleet trends.

4. Data Mart#

A central data warehouse is massive, and searching through it can be incredibly slow if you only care about a tiny fraction of the business. Organizations create targeted slices of the warehouse to keep specific teams fast and secure.

A data mart is a smaller, curated subset of a data warehouse focused on a single department or line of business to speed up queries.
In the real world, the finance department will query a dedicated finance data mart optimized for budgeting, while the marketing team uses a separate marketing data mart to track advertising campaigns.
In our Starfleet analogy, this is like a warp drive engineering library—a specialized database slice containing only engine metrics and historical plasma logs, allowing Geordi La Forge to run diagnostics without sorting through medical or diplomatic files.

The Data Pipeline: What is ETL?#

To move data from raw databases or unstructured lakes into a polished warehouse, engineers build a data pipeline.

The term ETL stands for Extract, Transform, Load, which is the three-step pipeline used to consolidate and clean business data.
During the extract phase, the pipeline pulls raw records out of individual operational databases or files.
During the transform phase, the pipeline cleans the data by removing duplicate records, correcting formatting errors (like converting all dates to YYYY-MM-DD), and structuring it.
During the load phase, the pipeline writes the polished, structured data into the destination warehouse or data mart.

Interactive activity: Open this standalone page to use the activity in the Jupyter Book or from Colab.

Open the data storage systems quiz

Chapter Summary#

Use this list as a self-check. After working through the chapter, you should be able to do each of the following:

You can distinguish data, information, and knowledge, and explain how databases turn data into knowledge.
You can name the advantages databases offer over flat files for organization, retrieval, and consistency.
You can describe the conceptual, logical, and physical levels of data modeling.
You can compare relational, document (JSON), and graph data models and pick one for a given scenario.
You can weigh DBMS choices and the trade-offs between cloud and local storage.
You can explain how databases, data lakes, data warehouses, and data marts relate to one another.
You can recognize key-value, vector, time-series, and object-oriented databases and name a use case for each.
You can sort data into structured, semistructured, and unstructured categories and explain why the category matters.

Practice with the Loop of the Recursive Dragon#

Sharpen what you just learned with a chapter-matched review set in the Loop of the Recursive Dragon — an adaptive review game with multiple question types and RPG-style mechanics, built for this book.

Launch the Chapter 1 review set →

Glossary#

Use this reference sheet to quickly review key terms from this chapter.

Data is raw, unorganized facts and figures that carry no meaning on their own.
Information is data that has been organized, structured, and labeled so that it carries a clear meaning.
Knowledge is the practical understanding and actionable insights gained by analyzing information.
A database is a structured collection of data organized for efficient storage, retrieval, and updating.
A flat file is a simple, linear data file where searching requires scanning rows one by one.
A relational database is a database that organizes data into tables of rows and columns linked by unique keys.
A data model is a standard blueprint that describes a database’s structure, relationships, and rules.
An entity is a real-world object, person, or event that we store data about in a database.
An attribute is a specific column in a table that represents a characteristic of an entity.
A relationship describes how different database entities connect to one another.
A conceptual data model is a high-level business overview of entities and relationships in plain language.
A logical data model is a concrete database layout (like tables and columns) independent of specific software.
A physical data model is the actual implementation of a database layout inside a specific database engine.
A relation is a formal database term for a structured table of rows and columns.
A tuple is a formal database term for a single row in a table.
A primary key is a unique column value used to identify each row in a table.
A foreign key is a column that references another table’s primary key to link them together.
A document database is a NoSQL database that stores data in flexible, semi-structured JSON documents.
A graph database is a NoSQL database that stores data as a network of nodes, edges, and properties.
A node is a point in a graph database representing a specific entity.
An edge is a link in a graph database representing a relationship between two nodes.
A property is a key-value detail stored on a node or an edge in a graph database.
A column-oriented database is an analytical database that stores each column’s data together on disk.
A Database Management System (DBMS) is the software program used to manage and interact with a database.
A static IP is a permanent, unchanging network address manually assigned to a database server.
A dynamic IP is a temporary network address assigned automatically that can change over time.
SQL is the standard Structured Query Language used to query and modify relational databases.
A data lake is a massive repository that stores raw, unorganized data in its original format.
A data warehouse is a central analytical repository that stores integrated historical data from multiple sources.
A data mart is a smaller, targeted slice of a data warehouse focused on a single business department.

Chapter 1: Database Foundations (Star Trek and Starfleet Command)

Contents

Chapter 1: Database Foundations (Star Trek and Starfleet Command)#

Databases Through Pop Culture: Brendan Shea, PhD#

Learning Outcomes#

Video Lecture#

Introduction to the Case Study: Starship SQL#

What’s the Difference Between Data, Information, and Knowledge?#

What is a Database? How Can They Help Turn Data Into Knowledge?#

Table: Database vs Flat Files#

What is “Data Modeling”?#

The Entity-Relationship Model#

Example: Conceptual ERD#

Example: Logical ERD#

Example: Physical ERD#

Logical Data Models: The Relational Model#

Relational terminology#

Logical Data Models: JSON and Document Databases#

Key features of document databases#

Logical Model: Graphs for Networks of Connected Information#

Understanding graph models#

Real-world applications#

Other database types you should know#

Logical Models: Columns for Analysis#

The core idea: thinking in columns#

Structured, semistructured, and unstructured data#

Physical Models: Choosing a DBMS#

Introduction to Databases: Starship Data Exploration Lab#

1. Querying with SQL (Relational Model)#

Filtering Your Data#

Counting Records#

2. Comparing with the Document Model (JSON in SQLite)#

Key Takeaways: Relational vs. Document#

3. Comparing with the Graph Model (Cypher in the Notebook)#

Executing Graph Queries Locally: %%cypher#

Choosing Your DBMS: Leading Systems by Model#

Advanced Data Storage Concepts: Scaling Beyond a Single Database#

1. Operational Database (OLTP)#

2. Data Lake#

3. Data Warehouse (OLAP)#

4. Data Mart#

The Data Pipeline: What is ETL?#

Chapter Summary#

Practice with the Loop of the Recursive Dragon#

Glossary#

Executing Graph Queries Locally: `%%cypher`#