Normalization#

What is database normalization?

  • Ref: https://www.complexsql.com/database-normalization/

  • Ref: http://www.databasedev.co.uk/1norm_form.html

  • The purpose of database normalization is to:

  • eliminate redundant data

  • reduce complexity of data, making it easier to manage the data and make change

  • ensure logical data dependencies

  • How is database normalization achieved?

  • First Normal Form (1NF):

    1. if there are no repeating groups.

    2. all values are atomic, meaning they are the smallest meaningful value

  • Second Normal Form (2NF):

    1. the table is in first normal form

    2. each non-key field is functionally dependent on the entire primary key

  • Third Normal Form (3NF):

    1. the table is in second normal form

    2. there are no transitive dependencies

  • Ref: https://arctype.com/blog/2nf-3nf-normalization-example/

  • Summary

    1. All values must be atomic

    2. No redundancy

    3. No implicit relationship/dependency

    4. No transitive relationship/dependency

Bad Design Examples#

Example 1#

import sqlite3
import pandas as pd

conn = sqlite3.connect('example1.db')

sql_statement = "SELECT * FROM EMPLOYEES_PROJECTS_TIME"
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')
  EmployeeID Name Project Time
0 EN1-26 Sean O'Brien 30-452-T3, 30-457-T3, 32-244-T3 0.25, 0.40, 0.30
1 EN1-33 Amy Guya 30-452-T3, 30-382-TC, 32-244-T3 0.05, 0.35, 0.60
2 EN1-35 Steven Baranco 30-452-T3, 31-238-TC 0.15, 0.80
3 EN1-36 Elizabeth Roslyn 35-152-TC 0.90
4 EN1-38 Carol Schaaf 36-272-TC 0.75
5 EN1-40 Alexandra Wing 31-238-TC, 31-241-TC 0.20, 0.70
  • Problems with example1

    • Repeating group of fields

    • The project and time fields are not made up of atomic values

    • Can’t sort by last name

    • Can’t sort by time because field is type text

    • Assumed relationship between project and time

Example 2#

import sqlite3
import pandas as pd

conn = sqlite3.connect('example2.db')

sql_statement = "SELECT * FROM EMPLOYEES_PROJECTS_TIME"
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')
  EmployeeID Last_Name First_Name Project1 Time1 Project2 Time2 Project3 Time3
0 EN1-26 O'Brien Sean 30-452-T3 0.250000 30-457-T3 0.400000 32-244-T3 0.300000
1 EN1-33 Guya Amy 30-452-T3 0.050000 30-382-TC 0.350000 32-244-T3 0.600000
2 EN1-35 Baranco Steven 30-452-T3 0.150000 31-238-TC 0.800000 None nan
3 EN1-36 Roslyn Elizabeth 35-152-TC 0.900000 None nan None nan
4 EN1-38 Schaaf Carol 36-272-TC 0.750000 None nan None nan
5 EN1-40 Wing Alexandra 31-238-TC 0.200000 31-241-TC 0.700000 None nan
  • Analysis of example2

    • Can sort now!

    • How can you add another project?

Example 3#

import sqlite3
import pandas as pd

conn = sqlite3.connect('example3.db')

sql_statement = "SELECT * FROM EMPLOYEES"
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')
display(df)

sql_statement = "SELECT * FROM PROJECTS_EMPLOYEES_TIME"
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')
EmployeeID Last_Name First_Name
0 EN1-26 O'Brien Sean
1 EN1-33 Guya Amy
2 EN1-35 Baranco Steven
3 EN1-36 Roslyn Elizabeth
4 EN1-38 Schaaf Carol
5 EN1-40 Wing Alexandra
  ProjectNum EmployeeID Time
0 30-328-TC EN1-33 0.350000
1 30-452-T3 EN1-26 0.250000
2 30-452-T3 EN1-33 0.050000
3 30-452-T3 EN1-35 0.150000
4 31-238-TC EN1-35 0.800000
5 30-457-T3 EN1-26 0.400000
6 31-238-TC EN1-40 0.200000
7 31-241-TC EN1-40 0.700000
8 32-244-T3 EN1-33 0.600000
9 35-152-TC EN1-36 0.900000
10 36-272-TC EN1-38 0.750000
  • Analysis of example3 – first normal form

    • Can do groups by employeeid or projectnum

    • Can sort by time

    • Can sort by name

Example 4#

import sqlite3
import pandas as pd

conn = sqlite3.connect('example4.db')

sql_statement = "SELECT * FROM EMPLOYEES_PROJECTS"
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')
  EmployeeID Last_Name First_Name ProjectNumber ProjectTitle
0 EN1-26 O'Brien Sean 30-452-T3 STAR manual
1 EN1-26 O'Brien Sean 30-457-T3 ISO procedures
2 EN1-26 O'Brien Sean 31-124-T3 Employee handbook
3 EN1-33 Guya Amy 30-452-T3 STAR manual
4 EN1-33 Guya Amy 30-482-TC Web Site
5 EN1-33 Guya Amy 31-241-TC New catalog
6 EN1-35 Baranco Steven 30-452-T3 STAR manual
7 EN1-35 Baranco Steven 31-238-TC STAR prototype
8 EN1-36 Roslyn Elizabeth 35-152-TC STAR pricing
9 EN1-38 Schaaf Carol 36-272-TC Order system
10 EN1-40 Wing Alexandra 31-238-TC STAR prototype
11 EN1-40 Wing Alexandra 31-241-TC New catalog
  • Analysis of example4

    • How would you update the project title for a given project? Have to edit in many places

    • Can you add a project without an employeeid?

    • How can you delete a project?

Example 5#

import sqlite3
import pandas as pd

conn = sqlite3.connect('example5.db')

sql_statement = "SELECT * FROM EMPLOYEES"
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')
display(df)

sql_statement = "SELECT * FROM EMPLOYEES_PROJECTS"
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')
display(df)

sql_statement = "SELECT * FROM PROJECTS"
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')
EmployeeID Last_Name First_Name
0 EN1-26 O'Brien Sean
1 EN1-33 Guya Amy
2 EN1-35 Baranco Steven
3 EN1-36 Roslyn Elizabeth
4 EN1-38 Schaaf Carol
5 EN1-40 Wing Alexandra
EmployeeID ProjectNum
0 EN1-26 30-452-T3
1 EN1-26 30-457-T3
2 EN1-26 31-124-T3
3 EN1-33 30-328-TC
4 EN1-33 30-452-T3
5 EN1-33 32-244-T3
6 EN1-35 30-452-T3
7 EN1-35 31-238-TC
8 EN1-36 35-152-TC
9 EN1-38 36-272-TC
10 EN1-40 31-238-TC2
11 EN1-40 31-241-TC
  ProjectNum ProjectTitle
0 30-452-T3 STAR manual
1 30-457-T3 ISO procedures
2 30-482-TC Web site
3 31-124-T3 Employee handbook
4 31-238-TC STAR prototype
5 31-238-TC2 New catalog
6 35-152-TC STAR pricing
7 36-272-TC Order system
  • Analysis of example5

    • second normal form

Example 6#

import sqlite3
import pandas as pd

conn = sqlite3.connect('example6.db')
sql_statement = "SELECT * FROM PROJECTS"
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')
  ProjectNum ProjectTitle ProjectMgr Phone
0 30-452-T3 STAR manual Garrison 2756
1 30-457-T3 ISO procedures Jacanda 2954
2 30-482-TC Web site Friedman 2846
3 31-124-T3 Employee handbook Jones 3102
4 31-238-TC STAR prototype Garrison 2756
5 31-241-TC New catalog Jones 3102
6 35-152-TC STAR pricing Vance 3022
7 36-272-TC Order system Jacanda 2954
  • Analysis of example 6

    • Phone number, which is a non-key field, has transitive dependency on another non-key field.

Example 6#

import sqlite3
import pandas as pd

conn = sqlite3.connect('example7.db')
sql_statement = "SELECT * FROM MANAGERS"
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')
display(df)

conn = sqlite3.connect('example7.db')
sql_statement = "SELECT * FROM PROJECTS"
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')
ProjectMgr Phone
0 Friedman 2846
1 Garrison 2756
2 Jacanda 2954
3 Jones 3102
4 Vance 3022
  ProjectNum ProjectTitle ProjectMgr
0 30-452-T3 STAR manual Garrison
1 30-457-T3 ISO procedures Jacanda
2 30-482-TC Web site Friedman
3 31-124-T3 Employee handbook Jones
4 31-238-TC STAR prototype Garrison
5 31-241-TC New catalog Jones
6 35-152-TC STAR pricing Vance
7 36-272-TC Order system Jacanda
  • Analysis of example7

    • Removed transitive dependency