13.10. Normalization Old#
What is database normalization?
The purpose of database normalization is to:
eliminate redundant data
reduce complexity of data, making it easier to manage the data and make change
ensure logical data dependencies
How is database normalization achieved?
By fulfilling five normal forms. Each normal form represents an increasingly stringent set of rules. Usually fulfilling the first three normal forms is sufficient.
Ref: https://www.1keydata.com/database-normalization/first-normal-form-1nf.php
First Normal Form (1NF):
if there are no repeating groups.
all values are atomic, meaning they are the smallest meaningful value
Second Normal Form (2NF):
the table is in first normal form
each non-key field is functionally dependent on the entire primary key
Third Normal Form (3NF):
the table is in second normal form
there are no transitive dependencies
Ref: https://arctype.com/blog/2nf-3nf-normalization-example/
Summary
All values must be atomic
No redundancy
No implicit relationship/dependency
No transitive relationship/dependency
13.10.1. Bad Design Examples#
13.10.1.1. Example 1#
Show code cell source
import sqlite3
import pandas as pd
conn = sqlite3.connect('example1.db')
sql_statement = "SELECT * FROM EMPLOYEES_PROJECTS_TIME"
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[1], line 2
1 import sqlite3
----> 2 import pandas as pd
4 conn = sqlite3.connect('example1.db')
6 sql_statement = "SELECT * FROM EMPLOYEES_PROJECTS_TIME"
ModuleNotFoundError: No module named 'pandas'
Problems with example1
Repeating group of fields
The project and time fields are not made up of atomic values
Can’t sort by last name
Can’t sort by time because field is type text
Assumed relationship between project and time
13.10.1.2. Example 2#
Show code cell source
import sqlite3
import pandas as pd
conn = sqlite3.connect('example2.db')
sql_statement = "SELECT * FROM EMPLOYEES_PROJECTS_TIME"
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[2], line 2
1 import sqlite3
----> 2 import pandas as pd
4 conn = sqlite3.connect('example2.db')
6 sql_statement = "SELECT * FROM EMPLOYEES_PROJECTS_TIME"
ModuleNotFoundError: No module named 'pandas'
Analysis of example2
Can sort now!
How can you add another project?
13.10.1.3. Example 3#
Show code cell source
import sqlite3
import pandas as pd
conn = sqlite3.connect('example3.db')
sql_statement = "SELECT * FROM EMPLOYEES"
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')
display(df)
sql_statement = "SELECT * FROM PROJECTS_EMPLOYEES_TIME"
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[3], line 2
1 import sqlite3
----> 2 import pandas as pd
4 conn = sqlite3.connect('example3.db')
6 sql_statement = "SELECT * FROM EMPLOYEES"
ModuleNotFoundError: No module named 'pandas'
Analysis of example3 – first normal form
Can do groups by employeeid or projectnum
Can sort by time
Can sort by name
13.10.1.4. Example 4#
Show code cell source
import sqlite3
import pandas as pd
conn = sqlite3.connect('example4.db')
sql_statement = "SELECT * FROM EMPLOYEES_PROJECTS"
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[4], line 2
1 import sqlite3
----> 2 import pandas as pd
4 conn = sqlite3.connect('example4.db')
6 sql_statement = "SELECT * FROM EMPLOYEES_PROJECTS"
ModuleNotFoundError: No module named 'pandas'
Analysis of example4
How would you update the project title for a given project? Have to edit in many places
Can you add a project without an employeeid?
How can you delete a project?
13.10.1.5. Example 5#
Show code cell source
import sqlite3
import pandas as pd
conn = sqlite3.connect('example5.db')
sql_statement = "SELECT * FROM EMPLOYEES"
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')
display(df)
sql_statement = "SELECT * FROM EMPLOYEES_PROJECTS"
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')
display(df)
sql_statement = "SELECT * FROM PROJECTS"
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[5], line 2
1 import sqlite3
----> 2 import pandas as pd
4 conn = sqlite3.connect('example5.db')
6 sql_statement = "SELECT * FROM EMPLOYEES"
ModuleNotFoundError: No module named 'pandas'
Analysis of example5
second normal form
13.10.1.6. Example 6#
Show code cell source
import sqlite3
import pandas as pd
conn = sqlite3.connect('example6.db')
sql_statement = "SELECT * FROM PROJECTS"
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[6], line 2
1 import sqlite3
----> 2 import pandas as pd
4 conn = sqlite3.connect('example6.db')
5 sql_statement = "SELECT * FROM PROJECTS"
ModuleNotFoundError: No module named 'pandas'
Analysis of example 6
Phone number, which is a non-key field, has transitive dependency on another non-key field.
13.10.1.7. Example 6#
Show code cell source
import sqlite3
import pandas as pd
conn = sqlite3.connect('example7.db')
sql_statement = "SELECT * FROM MANAGERS"
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')
display(df)
conn = sqlite3.connect('example7.db')
sql_statement = "SELECT * FROM PROJECTS"
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[7], line 2
1 import sqlite3
----> 2 import pandas as pd
4 conn = sqlite3.connect('example7.db')
5 sql_statement = "SELECT * FROM MANAGERS"
ModuleNotFoundError: No module named 'pandas'
Analysis of example7
Removed transitive dependency