13.10. Normalization Old#

What is database normalization?

  • Ref: https://www.complexsql.com/database-normalization/

  • Ref: http://www.databasedev.co.uk/1norm_form.html

  • The purpose of database normalization is to:

  • eliminate redundant data

  • reduce complexity of data, making it easier to manage the data and make change

  • ensure logical data dependencies

  • How is database normalization achieved?

  • First Normal Form (1NF):

    1. if there are no repeating groups.

    2. all values are atomic, meaning they are the smallest meaningful value

  • Second Normal Form (2NF):

    1. the table is in first normal form

    2. each non-key field is functionally dependent on the entire primary key

  • Third Normal Form (3NF):

    1. the table is in second normal form

    2. there are no transitive dependencies

  • Ref: https://arctype.com/blog/2nf-3nf-normalization-example/

  • Summary

    1. All values must be atomic

    2. No redundancy

    3. No implicit relationship/dependency

    4. No transitive relationship/dependency

13.10.1. Bad Design Examples#

13.10.1.1. Example 1#

Hide code cell source
import sqlite3
import pandas as pd

conn = sqlite3.connect('example1.db')

sql_statement = "SELECT * FROM EMPLOYEES_PROJECTS_TIME"
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 2
      1 import sqlite3
----> 2 import pandas as pd
      4 conn = sqlite3.connect('example1.db')
      6 sql_statement = "SELECT * FROM EMPLOYEES_PROJECTS_TIME"

ModuleNotFoundError: No module named 'pandas'
  • Problems with example1

    • Repeating group of fields

    • The project and time fields are not made up of atomic values

    • Can’t sort by last name

    • Can’t sort by time because field is type text

    • Assumed relationship between project and time

13.10.1.2. Example 2#

Hide code cell source
import sqlite3
import pandas as pd

conn = sqlite3.connect('example2.db')

sql_statement = "SELECT * FROM EMPLOYEES_PROJECTS_TIME"
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[2], line 2
      1 import sqlite3
----> 2 import pandas as pd
      4 conn = sqlite3.connect('example2.db')
      6 sql_statement = "SELECT * FROM EMPLOYEES_PROJECTS_TIME"

ModuleNotFoundError: No module named 'pandas'
  • Analysis of example2

    • Can sort now!

    • How can you add another project?

13.10.1.3. Example 3#

Hide code cell source
import sqlite3
import pandas as pd

conn = sqlite3.connect('example3.db')

sql_statement = "SELECT * FROM EMPLOYEES"
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')
display(df)

sql_statement = "SELECT * FROM PROJECTS_EMPLOYEES_TIME"
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[3], line 2
      1 import sqlite3
----> 2 import pandas as pd
      4 conn = sqlite3.connect('example3.db')
      6 sql_statement = "SELECT * FROM EMPLOYEES"

ModuleNotFoundError: No module named 'pandas'
  • Analysis of example3 – first normal form

    • Can do groups by employeeid or projectnum

    • Can sort by time

    • Can sort by name

13.10.1.4. Example 4#

Hide code cell source
import sqlite3
import pandas as pd

conn = sqlite3.connect('example4.db')

sql_statement = "SELECT * FROM EMPLOYEES_PROJECTS"
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[4], line 2
      1 import sqlite3
----> 2 import pandas as pd
      4 conn = sqlite3.connect('example4.db')
      6 sql_statement = "SELECT * FROM EMPLOYEES_PROJECTS"

ModuleNotFoundError: No module named 'pandas'
  • Analysis of example4

    • How would you update the project title for a given project? Have to edit in many places

    • Can you add a project without an employeeid?

    • How can you delete a project?

13.10.1.5. Example 5#

Hide code cell source
import sqlite3
import pandas as pd

conn = sqlite3.connect('example5.db')

sql_statement = "SELECT * FROM EMPLOYEES"
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')
display(df)

sql_statement = "SELECT * FROM EMPLOYEES_PROJECTS"
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')
display(df)

sql_statement = "SELECT * FROM PROJECTS"
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[5], line 2
      1 import sqlite3
----> 2 import pandas as pd
      4 conn = sqlite3.connect('example5.db')
      6 sql_statement = "SELECT * FROM EMPLOYEES"

ModuleNotFoundError: No module named 'pandas'
  • Analysis of example5

    • second normal form

13.10.1.6. Example 6#

Hide code cell source
import sqlite3
import pandas as pd

conn = sqlite3.connect('example6.db')
sql_statement = "SELECT * FROM PROJECTS"
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[6], line 2
      1 import sqlite3
----> 2 import pandas as pd
      4 conn = sqlite3.connect('example6.db')
      5 sql_statement = "SELECT * FROM PROJECTS"

ModuleNotFoundError: No module named 'pandas'
  • Analysis of example 6

    • Phone number, which is a non-key field, has transitive dependency on another non-key field.

13.10.1.7. Example 6#

Hide code cell source
import sqlite3
import pandas as pd

conn = sqlite3.connect('example7.db')
sql_statement = "SELECT * FROM MANAGERS"
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')
display(df)

conn = sqlite3.connect('example7.db')
sql_statement = "SELECT * FROM PROJECTS"
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[7], line 2
      1 import sqlite3
----> 2 import pandas as pd
      4 conn = sqlite3.connect('example7.db')
      5 sql_statement = "SELECT * FROM MANAGERS"

ModuleNotFoundError: No module named 'pandas'
  • Analysis of example7

    • Removed transitive dependency