13.9. Database Normalization#
13.9.1. What is database normalization?#
Normalization is the process of structuring a relational database to reduce redundancy and improve data integrity.
Goals:
Eliminate redundant data
Reduce anomalies during insert, update, and delete
Ensure attributes depend on the key, the whole key, and nothing but the key
How is normalization achieved?
By applying a series of normal forms. Each normal form imposes stricter rules; in practice, achieving 1NF, 2NF, and 3NF is usually sufficient for OLTP systems.
References:
13.9.2. Normal Forms (concise)#
First Normal Form (1NF)
No repeating groups or arrays
All values are atomic (single, indivisible values)
Second Normal Form (2NF)
The table is in 1NF
No partial dependency: each non-key attribute depends on the entire primary key (relevant when the key is composite)
Third Normal Form (3NF)
The table is in 2NF
No transitive dependencies: no non-key attribute depends on another non-key attribute
13.9.3. Summary#
All values must be atomic (1NF)
No partial dependency on part of a composite key (2NF)
No transitive dependency among non-key attributes (3NF)
13.9.4. Bad Design Examples#
13.9.4.1. Example 1#
Show code cell source
import sqlite3
import pandas as pd
conn = sqlite3.connect('example1.db')
sql_statement = "SELECT * FROM EMPLOYEES_PROJECTS_TIME"
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')
| EmployeeID | Name | Project | Time | |
|---|---|---|---|---|
| 0 | EN1-26 | Sean O'Brien | 30-452-T3, 30-457-T3, 32-244-T3 | 0.25, 0.40, 0.30 |
| 1 | EN1-33 | Amy Guya | 30-452-T3, 30-382-TC, 32-244-T3 | 0.05, 0.35, 0.60 |
| 2 | EN1-35 | Steven Baranco | 30-452-T3, 31-238-TC | 0.15, 0.80 |
| 3 | EN1-36 | Elizabeth Roslyn | 35-152-TC | 0.90 |
| 4 | EN1-38 | Carol Schaaf | 36-272-TC | 0.75 |
| 5 | EN1-40 | Alexandra Wing | 31-238-TC, 31-241-TC | 0.20, 0.70 |
Problems with Example 1
Repeating groups of fields (e.g.,
project1,project2, …)Project and time fields are not atomic
Can’t sort by last name (e.g., if stored together as full name without split)
Can’t sort by time because the field is stored as text; store durations as numeric (e.g., minutes) or a standard format
Implicit pairing between project and time (not explicitly modeled)
13.9.4.2. Example 2#
Show code cell source
import sqlite3
import pandas as pd
conn = sqlite3.connect('example2.db')
sql_statement = "SELECT * FROM EMPLOYEES_PROJECTS_TIME"
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')
| EmployeeID | Last_Name | First_Name | Project1 | Time1 | Project2 | Time2 | Project3 | Time3 | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | EN1-26 | O'Brien | Sean | 30-452-T3 | 0.250000 | 30-457-T3 | 0.400000 | 32-244-T3 | 0.300000 |
| 1 | EN1-33 | Guya | Amy | 30-452-T3 | 0.050000 | 30-382-TC | 0.350000 | 32-244-T3 | 0.600000 |
| 2 | EN1-35 | Baranco | Steven | 30-452-T3 | 0.150000 | 31-238-TC | 0.800000 | None | nan |
| 3 | EN1-36 | Roslyn | Elizabeth | 35-152-TC | 0.900000 | None | nan | None | nan |
| 4 | EN1-38 | Schaaf | Carol | 36-272-TC | 0.750000 | None | nan | None | nan |
| 5 | EN1-40 | Wing | Alexandra | 31-238-TC | 0.200000 | 31-241-TC | 0.700000 | None | nan |
Analysis of Example 2
Sorting is possible if data types are corrected
Not scalable: how do you add another project without altering the schema?
13.9.4.3. Example 3 (Move to 1NF)#
Show code cell source
import sqlite3
import pandas as pd
conn = sqlite3.connect('example3.db')
sql_statement = "SELECT * FROM EMPLOYEES"
df = pd.read_sql_query(sql_statement, conn)
display(df.style.set_table_attributes('style="font-size: 12px"'))
sql_statement = "SELECT * FROM PROJECTS_EMPLOYEES_TIME"
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')
| EmployeeID | Last_Name | First_Name | |
|---|---|---|---|
| 0 | EN1-26 | O'Brien | Sean |
| 1 | EN1-33 | Guya | Amy |
| 2 | EN1-35 | Baranco | Steven |
| 3 | EN1-36 | Roslyn | Elizabeth |
| 4 | EN1-38 | Schaaf | Carol |
| 5 | EN1-40 | Wing | Alexandra |
| ProjectNum | EmployeeID | Time | |
|---|---|---|---|
| 0 | 30-328-TC | EN1-33 | 0.350000 |
| 1 | 30-452-T3 | EN1-26 | 0.250000 |
| 2 | 30-452-T3 | EN1-33 | 0.050000 |
| 3 | 30-452-T3 | EN1-35 | 0.150000 |
| 4 | 31-238-TC | EN1-35 | 0.800000 |
| 5 | 30-457-T3 | EN1-26 | 0.400000 |
| 6 | 31-238-TC | EN1-40 | 0.200000 |
| 7 | 31-241-TC | EN1-40 | 0.700000 |
| 8 | 32-244-T3 | EN1-33 | 0.600000 |
| 9 | 35-152-TC | EN1-36 | 0.900000 |
| 10 | 36-272-TC | EN1-38 | 0.750000 |
Analysis of Example 3 — First Normal Form
One row per (employee, project, time) → atomic values
Can group by
employee_idorproject_idCan sort by time and by name
13.9.4.4. Example 4 (Pre-2NF anti-pattern)#
Show code cell source
import sqlite3
import pandas as pd
conn = sqlite3.connect('example4.db')
sql_statement = "SELECT * FROM EMPLOYEES_PROJECTS"
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')
| EmployeeID | Last_Name | First_Name | ProjectNumber | ProjectTitle | |
|---|---|---|---|---|---|
| 0 | EN1-26 | O'Brien | Sean | 30-452-T3 | STAR manual |
| 1 | EN1-26 | O'Brien | Sean | 30-457-T3 | ISO procedures |
| 2 | EN1-26 | O'Brien | Sean | 31-124-T3 | Employee handbook |
| 3 | EN1-33 | Guya | Amy | 30-452-T3 | STAR manual |
| 4 | EN1-33 | Guya | Amy | 30-482-TC | Web Site |
| 5 | EN1-33 | Guya | Amy | 31-241-TC | New catalog |
| 6 | EN1-35 | Baranco | Steven | 30-452-T3 | STAR manual |
| 7 | EN1-35 | Baranco | Steven | 31-238-TC | STAR prototype |
| 8 | EN1-36 | Roslyn | Elizabeth | 35-152-TC | STAR pricing |
| 9 | EN1-38 | Schaaf | Carol | 36-272-TC | Order system |
| 10 | EN1-40 | Wing | Alexandra | 31-238-TC | STAR prototype |
| 11 | EN1-40 | Wing | Alexandra | 31-241-TC | New catalog |
Analysis of Example 4
How would you update a project title? You’d have to edit many rows (update anomaly)
Can you add a project without an
employee_id? Not cleanlyHow can you delete a project without losing its details? Risk of delete anomaly
13.9.4.5. Example 5 (2NF achieved)#
Show code cell source
import sqlite3
import pandas as pd
conn = sqlite3.connect('example5.db')
sql_statement = "SELECT * FROM EMPLOYEES"
df = pd.read_sql_query(sql_statement, conn)
display(df.style.set_table_attributes('style="font-size: 12px"'))
sql_statement = "SELECT * FROM EMPLOYEES_PROJECTS"
df = pd.read_sql_query(sql_statement, conn)
display(df.style.set_table_attributes('style="font-size: 12px"'))
sql_statement = "SELECT * FROM PROJECTS"
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')
| EmployeeID | Last_Name | First_Name | |
|---|---|---|---|
| 0 | EN1-26 | O'Brien | Sean |
| 1 | EN1-33 | Guya | Amy |
| 2 | EN1-35 | Baranco | Steven |
| 3 | EN1-36 | Roslyn | Elizabeth |
| 4 | EN1-38 | Schaaf | Carol |
| 5 | EN1-40 | Wing | Alexandra |
| EmployeeID | ProjectNum | |
|---|---|---|
| 0 | EN1-26 | 30-452-T3 |
| 1 | EN1-26 | 30-457-T3 |
| 2 | EN1-26 | 31-124-T3 |
| 3 | EN1-33 | 30-328-TC |
| 4 | EN1-33 | 30-452-T3 |
| 5 | EN1-33 | 32-244-T3 |
| 6 | EN1-35 | 30-452-T3 |
| 7 | EN1-35 | 31-238-TC |
| 8 | EN1-36 | 35-152-TC |
| 9 | EN1-38 | 36-272-TC |
| 10 | EN1-40 | 31-238-TC2 |
| 11 | EN1-40 | 31-241-TC |
| ProjectNum | ProjectTitle | |
|---|---|---|
| 0 | 30-452-T3 | STAR manual |
| 1 | 30-457-T3 | ISO procedures |
| 2 | 30-482-TC | Web site |
| 3 | 31-124-T3 | Employee handbook |
| 4 | 31-238-TC | STAR prototype |
| 5 | 31-238-TC2 | New catalog |
| 6 | 35-152-TC | STAR pricing |
| 7 | 36-272-TC | Order system |
Analysis of Example 5
Project attributes live in
PROJECTS(no duplication in the junction table)Junction table holds only FKs and relationship-specific attributes
No partial dependencies
13.9.4.6. Example 6 (3NF problem: transitive dependency)#
Show code cell source
import sqlite3
import pandas as pd
conn = sqlite3.connect('example6.db')
sql_statement = "SELECT * FROM PROJECTS"
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')
| ProjectNum | ProjectTitle | ProjectMgr | Phone | |
|---|---|---|---|---|
| 0 | 30-452-T3 | STAR manual | Garrison | 2756 |
| 1 | 30-457-T3 | ISO procedures | Jacanda | 2954 |
| 2 | 30-482-TC | Web site | Friedman | 2846 |
| 3 | 31-124-T3 | Employee handbook | Jones | 3102 |
| 4 | 31-238-TC | STAR prototype | Garrison | 2756 |
| 5 | 31-241-TC | New catalog | Jones | 3102 |
| 6 | 35-152-TC | STAR pricing | Vance | 3022 |
| 7 | 36-272-TC | Order system | Jacanda | 2954 |
Analysis of Example 6
Example:
PROJECTS(project_id PK, manager_name, manager_phone, ...)manager_phonedepends onmanager_name(non-key), not directly onproject_idThis is a transitive dependency:
project_id → manager_name → manager_phone
13.9.4.7. Example 7 (3NF achieved: remove transitive dependency)#
Show code cell source
import sqlite3
import pandas as pd
conn = sqlite3.connect('example7.db')
sql_statement = "SELECT * FROM MANAGERS"
df = pd.read_sql_query(sql_statement, conn)
display(df.style.set_table_attributes('style="font-size: 12px"'))
sql_statement = "SELECT * FROM PROJECTS"
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')
| ProjectMgr | Phone | |
|---|---|---|
| 0 | Friedman | 2846 |
| 1 | Garrison | 2756 |
| 2 | Jacanda | 2954 |
| 3 | Jones | 3102 |
| 4 | Vance | 3022 |
| ProjectNum | ProjectTitle | ProjectMgr | |
|---|---|---|---|
| 0 | 30-452-T3 | STAR manual | Garrison |
| 1 | 30-457-T3 | ISO procedures | Jacanda |
| 2 | 30-482-TC | Web site | Friedman |
| 3 | 31-124-T3 | Employee handbook | Jones |
| 4 | 31-238-TC | STAR prototype | Garrison |
| 5 | 31-241-TC | New catalog | Jones |
| 6 | 35-152-TC | STAR pricing | Vance |
| 7 | 36-272-TC | Order system | Jacanda |
Analysis of Example 7
MANAGERS(manager_id PK, manager_name, manager_phone, ...)PROJECTSreferencesmanager_id(FK); manager details live inMANAGERSRemoves the transitive dependency and update anomalies