{ "cells": [ { "cell_type": "markdown", "id": "61f5c915", "metadata": {}, "source": [ "# Normalization\n", "\n", "What is database normalization?\n", "- Ref: https://www.complexsql.com/database-normalization/\n", "- Ref: http://www.databasedev.co.uk/1norm_form.html\n", "- The purpose of database normalization is to:\n", "- eliminate redundant data\n", "- reduce complexity of data, making it easier to manage the data and make change\n", "- ensure logical data dependencies\n", "- How is database normalization achieved?\n", " - By fulfilling five normal forms. Each normal form represents an increasingly stringent set of rules. Usually fulfilling the first three normal forms is sufficient.\n", " - Ref: https://www.1keydata.com/database-normalization/first-normal-form-1nf.php\n", "- First Normal Form (1NF): \n", " 1. if there are no repeating groups.\n", " 2. all values are atomic, meaning they are the smallest meaningful value\n", "- Second Normal Form (2NF): \n", " 1. the table is in first normal form\n", " 2. each non-key field is functionally dependent on the entire primary key\n", "- Third Normal Form (3NF):\n", " 1. the table is in second normal form\n", " 2. there are no transitive dependencies\n", "- Ref: https://arctype.com/blog/2nf-3nf-normalization-example/\n", "- Summary\n", " 1. All values must be atomic\n", " 2. No redundancy\n", " 3. No implicit relationship/dependency\n", " 4. No transitive relationship/dependency\n", "\n", "## Bad Design Examples\n", "### Example 1" ] }, { "cell_type": "code", "execution_count": 1, "id": "ddb8d268", "metadata": { "tags": [ "hide-input", "output_scroll" ] }, "outputs": [ { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
 EmployeeIDNameProjectTime
0EN1-26Sean O'Brien30-452-T3, 30-457-T3, 32-244-T30.25, 0.40, 0.30
1EN1-33Amy Guya30-452-T3, 30-382-TC, 32-244-T30.05, 0.35, 0.60
2EN1-35Steven Baranco30-452-T3, 31-238-TC0.15, 0.80
3EN1-36Elizabeth Roslyn35-152-TC0.90
4EN1-38Carol Schaaf36-272-TC0.75
5EN1-40Alexandra Wing31-238-TC, 31-241-TC0.20, 0.70
\n" ], "text/plain": [ "" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import sqlite3\n", "import pandas as pd\n", "\n", "conn = sqlite3.connect('example1.db')\n", "\n", "sql_statement = \"SELECT * FROM EMPLOYEES_PROJECTS_TIME\"\n", "df = pd.read_sql_query(sql_statement, conn)\n", "df.style.set_table_attributes('style=\"font-size: 12px\"')" ] }, { "cell_type": "markdown", "id": "155d11ea", "metadata": {}, "source": [ "- Problems with example1\n", " - Repeating group of fields\n", " - The project and time fields are not made up of atomic values\n", " - Can't sort by last name\n", " - Can't sort by time because field is type text\n", " - Assumed relationship between project and time\n", "\n", "### Example 2" ] }, { "cell_type": "code", "execution_count": 2, "id": "63b03eca", "metadata": { "tags": [ "hide-input", "output_scroll" ] }, "outputs": [ { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
 EmployeeIDLast_NameFirst_NameProject1Time1Project2Time2Project3Time3
0EN1-26O'BrienSean30-452-T30.25000030-457-T30.40000032-244-T30.300000
1EN1-33GuyaAmy30-452-T30.05000030-382-TC0.35000032-244-T30.600000
2EN1-35BarancoSteven30-452-T30.15000031-238-TC0.800000Nonenan
3EN1-36RoslynElizabeth35-152-TC0.900000NonenanNonenan
4EN1-38SchaafCarol36-272-TC0.750000NonenanNonenan
5EN1-40WingAlexandra31-238-TC0.20000031-241-TC0.700000Nonenan
\n" ], "text/plain": [ "" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import sqlite3\n", "import pandas as pd\n", "\n", "conn = sqlite3.connect('example2.db')\n", "\n", "sql_statement = \"SELECT * FROM EMPLOYEES_PROJECTS_TIME\"\n", "df = pd.read_sql_query(sql_statement, conn)\n", "df.style.set_table_attributes('style=\"font-size: 12px\"')" ] }, { "cell_type": "markdown", "id": "340746c6", "metadata": {}, "source": [ "- Analysis of example2\n", " - Can sort now!\n", " - How can you add another project?\n", "\n", "### Example 3" ] }, { "cell_type": "code", "execution_count": 3, "id": "c88ceb44", "metadata": { "tags": [ "hide-input", "output_scroll" ] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
EmployeeIDLast_NameFirst_Name
0EN1-26O'BrienSean
1EN1-33GuyaAmy
2EN1-35BarancoSteven
3EN1-36RoslynElizabeth
4EN1-38SchaafCarol
5EN1-40WingAlexandra
\n", "
" ], "text/plain": [ " EmployeeID Last_Name First_Name\n", "0 EN1-26 O'Brien Sean\n", "1 EN1-33 Guya Amy\n", "2 EN1-35 Baranco Steven\n", "3 EN1-36 Roslyn Elizabeth\n", "4 EN1-38 Schaaf Carol\n", "5 EN1-40 Wing Alexandra" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
 ProjectNumEmployeeIDTime
030-328-TCEN1-330.350000
130-452-T3EN1-260.250000
230-452-T3EN1-330.050000
330-452-T3EN1-350.150000
431-238-TCEN1-350.800000
530-457-T3EN1-260.400000
631-238-TCEN1-400.200000
731-241-TCEN1-400.700000
832-244-T3EN1-330.600000
935-152-TCEN1-360.900000
1036-272-TCEN1-380.750000
\n" ], "text/plain": [ "" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import sqlite3\n", "import pandas as pd\n", "\n", "conn = sqlite3.connect('example3.db')\n", "\n", "sql_statement = \"SELECT * FROM EMPLOYEES\"\n", "df = pd.read_sql_query(sql_statement, conn)\n", "df.style.set_table_attributes('style=\"font-size: 12px\"')\n", "display(df)\n", "\n", "sql_statement = \"SELECT * FROM PROJECTS_EMPLOYEES_TIME\"\n", "df = pd.read_sql_query(sql_statement, conn)\n", "df.style.set_table_attributes('style=\"font-size: 12px\"')" ] }, { "cell_type": "markdown", "id": "a6555301", "metadata": {}, "source": [ "- Analysis of example3 -- first normal form\n", " - Can do groups by employeeid or projectnum\n", " - Can sort by time\n", " - Can sort by name\n", "\n", "\n", "### Example 4" ] }, { "cell_type": "code", "execution_count": 4, "id": "9c889eae", "metadata": { "tags": [ "hide-input", "output_scroll" ] }, "outputs": [ { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
 EmployeeIDLast_NameFirst_NameProjectNumberProjectTitle
0EN1-26O'BrienSean30-452-T3STAR manual
1EN1-26O'BrienSean30-457-T3ISO procedures
2EN1-26O'BrienSean31-124-T3Employee handbook
3EN1-33GuyaAmy30-452-T3STAR manual
4EN1-33GuyaAmy30-482-TCWeb Site
5EN1-33GuyaAmy31-241-TCNew catalog
6EN1-35BarancoSteven30-452-T3STAR manual
7EN1-35BarancoSteven31-238-TCSTAR prototype
8EN1-36RoslynElizabeth35-152-TCSTAR pricing
9EN1-38SchaafCarol36-272-TCOrder system
10EN1-40WingAlexandra31-238-TCSTAR prototype
11EN1-40WingAlexandra31-241-TCNew catalog
\n" ], "text/plain": [ "" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import sqlite3\n", "import pandas as pd\n", "\n", "conn = sqlite3.connect('example4.db')\n", "\n", "sql_statement = \"SELECT * FROM EMPLOYEES_PROJECTS\"\n", "df = pd.read_sql_query(sql_statement, conn)\n", "df.style.set_table_attributes('style=\"font-size: 12px\"')" ] }, { "cell_type": "markdown", "id": "0db83086", "metadata": {}, "source": [ "- Analysis of example4\n", " - How would you update the project title for a given project? Have to edit in many places\n", " - Can you add a project without an employeeid?\n", " - How can you delete a project?\n", "\n", "### Example 5" ] }, { "cell_type": "code", "execution_count": 5, "id": "8413f020", "metadata": { "tags": [ "hide-input", "output_scroll" ] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
EmployeeIDLast_NameFirst_Name
0EN1-26O'BrienSean
1EN1-33GuyaAmy
2EN1-35BarancoSteven
3EN1-36RoslynElizabeth
4EN1-38SchaafCarol
5EN1-40WingAlexandra
\n", "
" ], "text/plain": [ " EmployeeID Last_Name First_Name\n", "0 EN1-26 O'Brien Sean\n", "1 EN1-33 Guya Amy\n", "2 EN1-35 Baranco Steven\n", "3 EN1-36 Roslyn Elizabeth\n", "4 EN1-38 Schaaf Carol\n", "5 EN1-40 Wing Alexandra" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
EmployeeIDProjectNum
0EN1-2630-452-T3
1EN1-2630-457-T3
2EN1-2631-124-T3
3EN1-3330-328-TC
4EN1-3330-452-T3
5EN1-3332-244-T3
6EN1-3530-452-T3
7EN1-3531-238-TC
8EN1-3635-152-TC
9EN1-3836-272-TC
10EN1-4031-238-TC2
11EN1-4031-241-TC
\n", "
" ], "text/plain": [ " EmployeeID ProjectNum\n", "0 EN1-26 30-452-T3\n", "1 EN1-26 30-457-T3\n", "2 EN1-26 31-124-T3\n", "3 EN1-33 30-328-TC\n", "4 EN1-33 30-452-T3\n", "5 EN1-33 32-244-T3\n", "6 EN1-35 30-452-T3\n", "7 EN1-35 31-238-TC\n", "8 EN1-36 35-152-TC\n", "9 EN1-38 36-272-TC\n", "10 EN1-40 31-238-TC2\n", "11 EN1-40 31-241-TC" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
 ProjectNumProjectTitle
030-452-T3STAR manual
130-457-T3ISO procedures
230-482-TCWeb site
331-124-T3Employee handbook
431-238-TCSTAR prototype
531-238-TC2New catalog
635-152-TCSTAR pricing
736-272-TCOrder system
\n" ], "text/plain": [ "" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import sqlite3\n", "import pandas as pd\n", "\n", "conn = sqlite3.connect('example5.db')\n", "\n", "sql_statement = \"SELECT * FROM EMPLOYEES\"\n", "df = pd.read_sql_query(sql_statement, conn)\n", "df.style.set_table_attributes('style=\"font-size: 12px\"')\n", "display(df)\n", "\n", "sql_statement = \"SELECT * FROM EMPLOYEES_PROJECTS\"\n", "df = pd.read_sql_query(sql_statement, conn)\n", "df.style.set_table_attributes('style=\"font-size: 12px\"')\n", "display(df)\n", "\n", "sql_statement = \"SELECT * FROM PROJECTS\"\n", "df = pd.read_sql_query(sql_statement, conn)\n", "df.style.set_table_attributes('style=\"font-size: 12px\"')" ] }, { "cell_type": "markdown", "id": "61b0ee55", "metadata": {}, "source": [ "- Analysis of example5\n", " - second normal form\n", "\n", "### Example 6" ] }, { "cell_type": "code", "execution_count": 6, "id": "7b756662", "metadata": { "tags": [ "hide-input", "output_scroll" ] }, "outputs": [ { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
 ProjectNumProjectTitleProjectMgrPhone
030-452-T3STAR manualGarrison2756
130-457-T3ISO proceduresJacanda2954
230-482-TCWeb siteFriedman2846
331-124-T3Employee handbookJones3102
431-238-TCSTAR prototypeGarrison2756
531-241-TCNew catalogJones3102
635-152-TCSTAR pricingVance3022
736-272-TCOrder systemJacanda2954
\n" ], "text/plain": [ "" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import sqlite3\n", "import pandas as pd\n", "\n", "conn = sqlite3.connect('example6.db')\n", "sql_statement = \"SELECT * FROM PROJECTS\"\n", "df = pd.read_sql_query(sql_statement, conn)\n", "df.style.set_table_attributes('style=\"font-size: 12px\"')" ] }, { "cell_type": "markdown", "id": "86d692d4", "metadata": {}, "source": [ "- Analysis of example 6\n", " - Phone number, which is a non-key field, has transitive dependency on another non-key field. \n", "\n", "### Example 6" ] }, { "cell_type": "code", "execution_count": 7, "id": "e6222202", "metadata": { "tags": [ "hide-input", "output_scroll" ] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ProjectMgrPhone
0Friedman2846
1Garrison2756
2Jacanda2954
3Jones3102
4Vance3022
\n", "
" ], "text/plain": [ " ProjectMgr Phone\n", "0 Friedman 2846\n", "1 Garrison 2756\n", "2 Jacanda 2954\n", "3 Jones 3102\n", "4 Vance 3022" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
 ProjectNumProjectTitleProjectMgr
030-452-T3STAR manualGarrison
130-457-T3ISO proceduresJacanda
230-482-TCWeb siteFriedman
331-124-T3Employee handbookJones
431-238-TCSTAR prototypeGarrison
531-241-TCNew catalogJones
635-152-TCSTAR pricingVance
736-272-TCOrder systemJacanda
\n" ], "text/plain": [ "" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import sqlite3\n", "import pandas as pd\n", "\n", "conn = sqlite3.connect('example7.db')\n", "sql_statement = \"SELECT * FROM MANAGERS\"\n", "df = pd.read_sql_query(sql_statement, conn)\n", "df.style.set_table_attributes('style=\"font-size: 12px\"')\n", "display(df)\n", "\n", "conn = sqlite3.connect('example7.db')\n", "sql_statement = \"SELECT * FROM PROJECTS\"\n", "df = pd.read_sql_query(sql_statement, conn)\n", "df.style.set_table_attributes('style=\"font-size: 12px\"')" ] }, { "cell_type": "markdown", "id": "e5c659a0", "metadata": {}, "source": [ "- Analysis of example7\n", " - Removed transitive dependency" ] } ], "metadata": { "jupytext": { "formats": "md:myst", "text_representation": { "extension": ".md", "format_name": "myst" } }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" }, "source_map": [ 11, 43, 53, 64, 74, 82, 97, 107, 117, 126, 146, 153, 162, 169, 185 ] }, "nbformat": 4, "nbformat_minor": 5 }