13.9. Database Normalization#

13.9.1. What is database normalization?#

Normalization is the process of structuring a relational database to reduce redundancy and improve data integrity.

  • Goals:

    • Eliminate redundant data

    • Reduce anomalies during insert, update, and delete

    • Ensure attributes depend on the key, the whole key, and nothing but the key

  • How is normalization achieved?

    • By applying a series of normal forms. Each normal form imposes stricter rules; in practice, achieving 1NF, 2NF, and 3NF is usually sufficient for OLTP systems.

References:

13.9.2. Normal Forms (concise)#

  • First Normal Form (1NF)

    1. No repeating groups or arrays

    2. All values are atomic (single, indivisible values)

  • Second Normal Form (2NF)

    1. The table is in 1NF

    2. No partial dependency: each non-key attribute depends on the entire primary key (relevant when the key is composite)

  • Third Normal Form (3NF)

    1. The table is in 2NF

    2. No transitive dependencies: no non-key attribute depends on another non-key attribute

13.9.3. Summary#

  1. All values must be atomic (1NF)

  2. No partial dependency on part of a composite key (2NF)

  3. No transitive dependency among non-key attributes (3NF)

13.9.4. Bad Design Examples#

13.9.4.1. Example 1#

Hide code cell source
import sqlite3
import pandas as pd

conn = sqlite3.connect('example1.db')
sql_statement = "SELECT * FROM EMPLOYEES_PROJECTS_TIME"
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')
  EmployeeID Name Project Time
0 EN1-26 Sean O'Brien 30-452-T3, 30-457-T3, 32-244-T3 0.25, 0.40, 0.30
1 EN1-33 Amy Guya 30-452-T3, 30-382-TC, 32-244-T3 0.05, 0.35, 0.60
2 EN1-35 Steven Baranco 30-452-T3, 31-238-TC 0.15, 0.80
3 EN1-36 Elizabeth Roslyn 35-152-TC 0.90
4 EN1-38 Carol Schaaf 36-272-TC 0.75
5 EN1-40 Alexandra Wing 31-238-TC, 31-241-TC 0.20, 0.70
  • Problems with Example 1

    • Repeating groups of fields (e.g., project1, project2, …)

    • Project and time fields are not atomic

    • Can’t sort by last name (e.g., if stored together as full name without split)

    • Can’t sort by time because the field is stored as text; store durations as numeric (e.g., minutes) or a standard format

    • Implicit pairing between project and time (not explicitly modeled)

13.9.4.2. Example 2#

Hide code cell source
import sqlite3
import pandas as pd

conn = sqlite3.connect('example2.db')
sql_statement = "SELECT * FROM EMPLOYEES_PROJECTS_TIME"
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')
  EmployeeID Last_Name First_Name Project1 Time1 Project2 Time2 Project3 Time3
0 EN1-26 O'Brien Sean 30-452-T3 0.250000 30-457-T3 0.400000 32-244-T3 0.300000
1 EN1-33 Guya Amy 30-452-T3 0.050000 30-382-TC 0.350000 32-244-T3 0.600000
2 EN1-35 Baranco Steven 30-452-T3 0.150000 31-238-TC 0.800000 None nan
3 EN1-36 Roslyn Elizabeth 35-152-TC 0.900000 None nan None nan
4 EN1-38 Schaaf Carol 36-272-TC 0.750000 None nan None nan
5 EN1-40 Wing Alexandra 31-238-TC 0.200000 31-241-TC 0.700000 None nan
  • Analysis of Example 2

    • Sorting is possible if data types are corrected

    • Not scalable: how do you add another project without altering the schema?

13.9.4.3. Example 3 (Move to 1NF)#

Hide code cell source
import sqlite3
import pandas as pd

conn = sqlite3.connect('example3.db')

sql_statement = "SELECT * FROM EMPLOYEES"
df = pd.read_sql_query(sql_statement, conn)
display(df.style.set_table_attributes('style="font-size: 12px"'))

sql_statement = "SELECT * FROM PROJECTS_EMPLOYEES_TIME"
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')
  EmployeeID Last_Name First_Name
0 EN1-26 O'Brien Sean
1 EN1-33 Guya Amy
2 EN1-35 Baranco Steven
3 EN1-36 Roslyn Elizabeth
4 EN1-38 Schaaf Carol
5 EN1-40 Wing Alexandra
  ProjectNum EmployeeID Time
0 30-328-TC EN1-33 0.350000
1 30-452-T3 EN1-26 0.250000
2 30-452-T3 EN1-33 0.050000
3 30-452-T3 EN1-35 0.150000
4 31-238-TC EN1-35 0.800000
5 30-457-T3 EN1-26 0.400000
6 31-238-TC EN1-40 0.200000
7 31-241-TC EN1-40 0.700000
8 32-244-T3 EN1-33 0.600000
9 35-152-TC EN1-36 0.900000
10 36-272-TC EN1-38 0.750000
  • Analysis of Example 3 — First Normal Form

    • One row per (employee, project, time) → atomic values

    • Can group by employee_id or project_id

    • Can sort by time and by name

13.9.4.4. Example 4 (Pre-2NF anti-pattern)#

Hide code cell source
import sqlite3
import pandas as pd

conn = sqlite3.connect('example4.db')
sql_statement = "SELECT * FROM EMPLOYEES_PROJECTS"
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')
  EmployeeID Last_Name First_Name ProjectNumber ProjectTitle
0 EN1-26 O'Brien Sean 30-452-T3 STAR manual
1 EN1-26 O'Brien Sean 30-457-T3 ISO procedures
2 EN1-26 O'Brien Sean 31-124-T3 Employee handbook
3 EN1-33 Guya Amy 30-452-T3 STAR manual
4 EN1-33 Guya Amy 30-482-TC Web Site
5 EN1-33 Guya Amy 31-241-TC New catalog
6 EN1-35 Baranco Steven 30-452-T3 STAR manual
7 EN1-35 Baranco Steven 31-238-TC STAR prototype
8 EN1-36 Roslyn Elizabeth 35-152-TC STAR pricing
9 EN1-38 Schaaf Carol 36-272-TC Order system
10 EN1-40 Wing Alexandra 31-238-TC STAR prototype
11 EN1-40 Wing Alexandra 31-241-TC New catalog
  • Analysis of Example 4

    • How would you update a project title? You’d have to edit many rows (update anomaly)

    • Can you add a project without an employee_id? Not cleanly

    • How can you delete a project without losing its details? Risk of delete anomaly

13.9.4.5. Example 5 (2NF achieved)#

Hide code cell source
import sqlite3
import pandas as pd

conn = sqlite3.connect('example5.db')

sql_statement = "SELECT * FROM EMPLOYEES"
df = pd.read_sql_query(sql_statement, conn)
display(df.style.set_table_attributes('style="font-size: 12px"'))

sql_statement = "SELECT * FROM EMPLOYEES_PROJECTS"
df = pd.read_sql_query(sql_statement, conn)
display(df.style.set_table_attributes('style="font-size: 12px"'))

sql_statement = "SELECT * FROM PROJECTS"
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')
  EmployeeID Last_Name First_Name
0 EN1-26 O'Brien Sean
1 EN1-33 Guya Amy
2 EN1-35 Baranco Steven
3 EN1-36 Roslyn Elizabeth
4 EN1-38 Schaaf Carol
5 EN1-40 Wing Alexandra
  EmployeeID ProjectNum
0 EN1-26 30-452-T3
1 EN1-26 30-457-T3
2 EN1-26 31-124-T3
3 EN1-33 30-328-TC
4 EN1-33 30-452-T3
5 EN1-33 32-244-T3
6 EN1-35 30-452-T3
7 EN1-35 31-238-TC
8 EN1-36 35-152-TC
9 EN1-38 36-272-TC
10 EN1-40 31-238-TC2
11 EN1-40 31-241-TC
  ProjectNum ProjectTitle
0 30-452-T3 STAR manual
1 30-457-T3 ISO procedures
2 30-482-TC Web site
3 31-124-T3 Employee handbook
4 31-238-TC STAR prototype
5 31-238-TC2 New catalog
6 35-152-TC STAR pricing
7 36-272-TC Order system
  • Analysis of Example 5

    • Project attributes live in PROJECTS (no duplication in the junction table)

    • Junction table holds only FKs and relationship-specific attributes

    • No partial dependencies

13.9.4.6. Example 6 (3NF problem: transitive dependency)#

Hide code cell source
import sqlite3
import pandas as pd

conn = sqlite3.connect('example6.db')
sql_statement = "SELECT * FROM PROJECTS"
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')
  ProjectNum ProjectTitle ProjectMgr Phone
0 30-452-T3 STAR manual Garrison 2756
1 30-457-T3 ISO procedures Jacanda 2954
2 30-482-TC Web site Friedman 2846
3 31-124-T3 Employee handbook Jones 3102
4 31-238-TC STAR prototype Garrison 2756
5 31-241-TC New catalog Jones 3102
6 35-152-TC STAR pricing Vance 3022
7 36-272-TC Order system Jacanda 2954
  • Analysis of Example 6

    • Example: PROJECTS(project_id PK, manager_name, manager_phone, ...)

    • manager_phone depends on manager_name (non-key), not directly on project_id

    • This is a transitive dependency: project_id manager_name manager_phone

13.9.4.7. Example 7 (3NF achieved: remove transitive dependency)#

Hide code cell source
import sqlite3
import pandas as pd

conn = sqlite3.connect('example7.db')
sql_statement = "SELECT * FROM MANAGERS"
df = pd.read_sql_query(sql_statement, conn)
display(df.style.set_table_attributes('style="font-size: 12px"'))

sql_statement = "SELECT * FROM PROJECTS"
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')
  ProjectMgr Phone
0 Friedman 2846
1 Garrison 2756
2 Jacanda 2954
3 Jones 3102
4 Vance 3022
  ProjectNum ProjectTitle ProjectMgr
0 30-452-T3 STAR manual Garrison
1 30-457-T3 ISO procedures Jacanda
2 30-482-TC Web site Friedman
3 31-124-T3 Employee handbook Jones
4 31-238-TC STAR prototype Garrison
5 31-241-TC New catalog Jones
6 35-152-TC STAR pricing Vance
7 36-272-TC Order system Jacanda
  • Analysis of Example 7

    • MANAGERS(manager_id PK, manager_name, manager_phone, ...)

    • PROJECTS references manager_id (FK); manager details live in MANAGERS

    • Removes the transitive dependency and update anomalies