Normalization#

What is database normalization?

Ref: https://www.complexsql.com/database-normalization/
Ref: http://www.databasedev.co.uk/1norm_form.html
The purpose of database normalization is to:
eliminate redundant data
reduce complexity of data, making it easier to manage the data and make change
ensure logical data dependencies
How is database normalization achieved?
- By fulfilling five normal forms. Each normal form represents an increasingly stringent set of rules. Usually fulfilling the first three normal forms is sufficient.
- Ref: https://www.1keydata.com/database-normalization/first-normal-form-1nf.php
First Normal Form (1NF):
1. if there are no repeating groups.
2. all values are atomic, meaning they are the smallest meaningful value
Second Normal Form (2NF):
1. the table is in first normal form
2. each non-key field is functionally dependent on the entire primary key
Third Normal Form (3NF):
1. the table is in second normal form
2. there are no transitive dependencies
Ref: https://arctype.com/blog/2nf-3nf-normalization-example/
Summary
1. All values must be atomic
2. No redundancy
3. No implicit relationship/dependency
4. No transitive relationship/dependency

Bad Design Examples#

Example 1#

import sqlite3
import pandas as pd

conn = sqlite3.connect('example1.db')

sql_statement = "SELECT * FROM EMPLOYEES_PROJECTS_TIME"
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')

	EmployeeID	Name	Project	Time
0	EN1-26	Sean O'Brien	30-452-T3, 30-457-T3, 32-244-T3	0.25, 0.40, 0.30
1	EN1-33	Amy Guya	30-452-T3, 30-382-TC, 32-244-T3	0.05, 0.35, 0.60
2	EN1-35	Steven Baranco	30-452-T3, 31-238-TC	0.15, 0.80
3	EN1-36	Elizabeth Roslyn	35-152-TC	0.90
4	EN1-38	Carol Schaaf	36-272-TC	0.75
5	EN1-40	Alexandra Wing	31-238-TC, 31-241-TC	0.20, 0.70

Problems with example1
- Repeating group of fields
- The project and time fields are not made up of atomic values
- Can’t sort by last name
- Can’t sort by time because field is type text
- Assumed relationship between project and time

Example 2#

import sqlite3
import pandas as pd

conn = sqlite3.connect('example2.db')

sql_statement = "SELECT * FROM EMPLOYEES_PROJECTS_TIME"
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')

	EmployeeID	Last_Name	First_Name	Project1	Time1	Project2	Time2	Project3	Time3
0	EN1-26	O'Brien	Sean	30-452-T3	0.250000	30-457-T3	0.400000	32-244-T3	0.300000
1	EN1-33	Guya	Amy	30-452-T3	0.050000	30-382-TC	0.350000	32-244-T3	0.600000
2	EN1-35	Baranco	Steven	30-452-T3	0.150000	31-238-TC	0.800000	None	nan
3	EN1-36	Roslyn	Elizabeth	35-152-TC	0.900000	None	nan	None	nan
4	EN1-38	Schaaf	Carol	36-272-TC	0.750000	None	nan	None	nan
5	EN1-40	Wing	Alexandra	31-238-TC	0.200000	31-241-TC	0.700000	None	nan

Analysis of example2
- Can sort now!
- How can you add another project?

Example 3#

import sqlite3
import pandas as pd

conn = sqlite3.connect('example3.db')

sql_statement = "SELECT * FROM EMPLOYEES"
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')
display(df)

sql_statement = "SELECT * FROM PROJECTS_EMPLOYEES_TIME"
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')

	EmployeeID	Last_Name	First_Name
0	EN1-26	O'Brien	Sean
1	EN1-33	Guya	Amy
2	EN1-35	Baranco	Steven
3	EN1-36	Roslyn	Elizabeth
4	EN1-38	Schaaf	Carol
5	EN1-40	Wing	Alexandra

	ProjectNum	EmployeeID	Time
0	30-328-TC	EN1-33	0.350000
1	30-452-T3	EN1-26	0.250000
2	30-452-T3	EN1-33	0.050000
3	30-452-T3	EN1-35	0.150000
4	31-238-TC	EN1-35	0.800000
5	30-457-T3	EN1-26	0.400000
6	31-238-TC	EN1-40	0.200000
7	31-241-TC	EN1-40	0.700000
8	32-244-T3	EN1-33	0.600000
9	35-152-TC	EN1-36	0.900000
10	36-272-TC	EN1-38	0.750000

Analysis of example3 – first normal form
- Can do groups by employeeid or projectnum
- Can sort by time
- Can sort by name

Example 4#

import sqlite3
import pandas as pd

conn = sqlite3.connect('example4.db')

sql_statement = "SELECT * FROM EMPLOYEES_PROJECTS"
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')

	EmployeeID	Last_Name	First_Name	ProjectNumber	ProjectTitle
0	EN1-26	O'Brien	Sean	30-452-T3	STAR manual
1	EN1-26	O'Brien	Sean	30-457-T3	ISO procedures
2	EN1-26	O'Brien	Sean	31-124-T3	Employee handbook
3	EN1-33	Guya	Amy	30-452-T3	STAR manual
4	EN1-33	Guya	Amy	30-482-TC	Web Site
5	EN1-33	Guya	Amy	31-241-TC	New catalog
6	EN1-35	Baranco	Steven	30-452-T3	STAR manual
7	EN1-35	Baranco	Steven	31-238-TC	STAR prototype
8	EN1-36	Roslyn	Elizabeth	35-152-TC	STAR pricing
9	EN1-38	Schaaf	Carol	36-272-TC	Order system
10	EN1-40	Wing	Alexandra	31-238-TC	STAR prototype
11	EN1-40	Wing	Alexandra	31-241-TC	New catalog

Analysis of example4
- How would you update the project title for a given project? Have to edit in many places
- Can you add a project without an employeeid?
- How can you delete a project?

Example 5#

import sqlite3
import pandas as pd

conn = sqlite3.connect('example5.db')

sql_statement = "SELECT * FROM EMPLOYEES"
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')
display(df)

sql_statement = "SELECT * FROM EMPLOYEES_PROJECTS"
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')
display(df)

sql_statement = "SELECT * FROM PROJECTS"
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')

	EmployeeID	Last_Name	First_Name
0	EN1-26	O'Brien	Sean
1	EN1-33	Guya	Amy
2	EN1-35	Baranco	Steven
3	EN1-36	Roslyn	Elizabeth
4	EN1-38	Schaaf	Carol
5	EN1-40	Wing	Alexandra

	EmployeeID	ProjectNum
0	EN1-26	30-452-T3
1	EN1-26	30-457-T3
2	EN1-26	31-124-T3
3	EN1-33	30-328-TC
4	EN1-33	30-452-T3
5	EN1-33	32-244-T3
6	EN1-35	30-452-T3
7	EN1-35	31-238-TC
8	EN1-36	35-152-TC
9	EN1-38	36-272-TC
10	EN1-40	31-238-TC2
11	EN1-40	31-241-TC

	ProjectNum	ProjectTitle
0	30-452-T3	STAR manual
1	30-457-T3	ISO procedures
2	30-482-TC	Web site
3	31-124-T3	Employee handbook
4	31-238-TC	STAR prototype
5	31-238-TC2	New catalog
6	35-152-TC	STAR pricing
7	36-272-TC	Order system

Analysis of example5
- second normal form

Example 6#

import sqlite3
import pandas as pd

conn = sqlite3.connect('example6.db')
sql_statement = "SELECT * FROM PROJECTS"
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')

	ProjectNum	ProjectTitle	ProjectMgr	Phone
0	30-452-T3	STAR manual	Garrison	2756
1	30-457-T3	ISO procedures	Jacanda	2954
2	30-482-TC	Web site	Friedman	2846
3	31-124-T3	Employee handbook	Jones	3102
4	31-238-TC	STAR prototype	Garrison	2756
5	31-241-TC	New catalog	Jones	3102
6	35-152-TC	STAR pricing	Vance	3022
7	36-272-TC	Order system	Jacanda	2954

Analysis of example 6
- Phone number, which is a non-key field, has transitive dependency on another non-key field.

Example 6#

import sqlite3
import pandas as pd

conn = sqlite3.connect('example7.db')
sql_statement = "SELECT * FROM MANAGERS"
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')
display(df)

conn = sqlite3.connect('example7.db')
sql_statement = "SELECT * FROM PROJECTS"
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')

	ProjectMgr	Phone
0	Friedman	2846
1	Garrison	2756
2	Jacanda	2954
3	Jones	3102
4	Vance	3022

	ProjectNum	ProjectTitle	ProjectMgr
0	30-452-T3	STAR manual	Garrison
1	30-457-T3	ISO procedures	Jacanda
2	30-482-TC	Web site	Friedman
3	31-124-T3	Employee handbook	Jones
4	31-238-TC	STAR prototype	Garrison
5	31-241-TC	New catalog	Jones
6	35-152-TC	STAR pricing	Vance
7	36-272-TC	Order system	Jacanda

Analysis of example7
- Removed transitive dependency

EAS503

Normalization

Contents

Normalization#

Bad Design Examples#

Example 1#

Example 2#

Example 3#

Example 4#

Example 5#

Example 6#

Example 6#