13.2. SQL Query Primer#

13.2.1. Query Clauses#

SQL offers six clauses (Query Clauses) to query your data. All SQL queries will use at least two clauses, i.e., SELECT and FROM.

Table 13.1 Query Clauses#

Clause Name

Purpose

SELECT

Determines which columns to include in the query’s result set

FROM

Identifies the tables from which to retrieve data and how the tables should be joined

WHERE

Filters out unwanted data

GROUP BY

Used to group rows together by common column values

HAVING

Filters out unwanted groups

ORDER BY

Sorts the rows of the final result set by one or more columns

13.2.1.1. Basic query#

The most basic SQL query will have a SELECT and FROM clause. Select lets you choose the columns you want. In case you want all the columns, you can use *, which indicates to SQL you want all the columns. The FROM clause lets you specify the table you want to query. The following is the most basic SQL query: You can select all the columns by using * after the SELECT. Note all SQL queries are terminated by a semicolon (;). You can format your SQL with as many spaces and tabs as you like. To indicate to SQL that your query statement is complete, terminate it with a semicolon.

Query: Select all columns and all rows from the PatientCorePopulatedTable table.

SELECT
    *
FROM
    PatientCorePopulatedTable;

By default all the rows are selected since no filtering is applied.

Hide code cell source
import sqlite3
import pandas as pd

conn = sqlite3.connect('100_patients.db')

sql_statement = """
SELECT
    *
FROM
    PatientCorePopulatedTable;
"""
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 2
      1 import sqlite3
----> 2 import pandas as pd
      4 conn = sqlite3.connect('100_patients.db')
      6 sql_statement = """
      7 SELECT
      8     *
      9 FROM
     10     PatientCorePopulatedTable;
     11 """

ModuleNotFoundError: No module named 'pandas'

13.2.1.2. Limit rows#

You can limit the number of rows to 10 by adding LIMIT 10 after the FROM clause.

Query: Select all columns and from the PatientCorePopulatedTable table and limit to 10 rows.

SELECT
    *
FROM
    PatientCorePopulatedTable
LIMIT 10;
Hide code cell source
import sqlite3
import pandas as pd

conn = sqlite3.connect('100_patients.db')

sql_statement = """
SELECT
    *
FROM
    PatientCorePopulatedTable
LIMIT 10;
"""
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[2], line 2
      1 import sqlite3
----> 2 import pandas as pd
      4 conn = sqlite3.connect('100_patients.db')
      6 sql_statement = """
      7 SELECT
      8     *
   (...)     11 LIMIT 10;
     12 """

ModuleNotFoundError: No module named 'pandas'

13.2.1.3. Select some columns#

You can select columns from a table by specifying them after the SELECT clause. Multiple columns are separated by a comma (,).

Query: Select the PatientID and PatientDateOfBirth columns and limit to 10 rows.

SELECT
    PatientID,
    PatientDateOfBirth
FROM
    PatientCorePopulatedTable
LIMIT 10;
Hide code cell source
import sqlite3
import pandas as pd

conn = sqlite3.connect('100_patients.db')

sql_statement = """
SELECT
    PatientID,
    PatientDateOfBirth
FROM
    PatientCorePopulatedTable
LIMIT 10;
"""
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[3], line 2
      1 import sqlite3
----> 2 import pandas as pd
      4 conn = sqlite3.connect('100_patients.db')
      6 sql_statement = """
      7 SELECT
      8     PatientID,
   (...)     12 LIMIT 10;
     13 """

ModuleNotFoundError: No module named 'pandas'

13.2.1.4. Using column alias#

Notice that all the column names are prefixed by Patient. You can give columns a different name, meaning an alias. There are two ways to do alias. You can put the alias name right after the actual column name, e.g., PatientID PTID or you can use the AS keyword to indicate explicitly that you are aliasing a column name, e.g., PatientID AS PTID. Note that if the alias has a space, then it should be in quotes.

Query: Select the PatientID and PatientDateOfBirth columns, but alias PatientID to PTID and PatientDateOfBirth to “Date of Birth” and limit to 10 rows.

SELECT
    PatientID PTID, 
    PatientDateOfBirth AS "Date of Birth"
FROM
    PatientCorePopulatedTable
LIMIT 10;
Hide code cell source
import sqlite3
import pandas as pd

conn = sqlite3.connect('100_patients.db')

sql_statement = """
SELECT
    PatientID PTID, 
    PatientDateOfBirth AS "Date of Birth"
FROM
    PatientCorePopulatedTable
LIMIT 10;
"""
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[4], line 2
      1 import sqlite3
----> 2 import pandas as pd
      4 conn = sqlite3.connect('100_patients.db')
      6 sql_statement = """
      7 SELECT
      8     PatientID PTID, 
   (...)     12 LIMIT 10;
     13 """

ModuleNotFoundError: No module named 'pandas'

13.2.1.5. Adding columns not from the table#

Besides selecting columns in the table, you can also add the following columns to your query:

  1. literals such as numbers or strings

  2. Math expressions such as PatientPopulationPercentageBelowPoverty + 1, or PatientPopulationPercentageBelowPoverty * 100

Query: Select the PatientID, Hospital, and PatientPopulationPercentageBelowPoverty columns, alias PatientID to PTID, make the Hospital column ‘Buffalo General’, multiply PatientPopulationPercentageBelowPoverty by 10 and alias it to Poverty Level, and limit to 10 rows. Note that if the alias name has a space, then the name needs to inside of quotes.

SELECT
    PatientID PTID, 
    'Buffalo General' Hospital,
    PatientPopulationPercentageBelowPoverty * 10 "Poverty Level"
FROM
    PatientCorePopulatedTable
LIMIT 10;
Hide code cell source
import sqlite3
import pandas as pd

conn = sqlite3.connect('100_patients.db')

sql_statement = """
SELECT
    PatientID PTID, 
    'Buffalo General' Hospital,
    PatientPopulationPercentageBelowPoverty * 10 "Poverty Level"
FROM
    PatientCorePopulatedTable
LIMIT 10;
"""
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[5], line 2
      1 import sqlite3
----> 2 import pandas as pd
      4 conn = sqlite3.connect('100_patients.db')
      6 sql_statement = """
      7 SELECT
      8     PatientID PTID, 
   (...)     13 LIMIT 10;
     14 """

ModuleNotFoundError: No module named 'pandas'

13.2.1.6. Removing duplicates#

In some cases you might get duplicate rows. You remove these duplicate rows by putting the DISTINCT keyword after the SELECT keyword. One use of this is to get distinct values of a given column.

Query: Select only the PatientMaritalStatus column from the PatientCorePopulatedTable table.

SELECT
    PatientMaritalStatus
FROM
    PatientCorePopulatedTable;
Hide code cell source
import sqlite3
import pandas as pd

conn = sqlite3.connect('100_patients.db')

sql_statement = """
SELECT
    PatientMaritalStatus
FROM
    PatientCorePopulatedTable;
"""
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[6], line 2
      1 import sqlite3
----> 2 import pandas as pd
      4 conn = sqlite3.connect('100_patients.db')
      6 sql_statement = """
      7 SELECT
      8     PatientMaritalStatus
      9 FROM
     10     PatientCorePopulatedTable;
     11 """

ModuleNotFoundError: No module named 'pandas'

Query: Select the PatientMaritalStatus column from the PatientCorePopulatedTable table but only select distinct values.

SELECT DISTINCT
    PatientMaritalStatus
FROM
    PatientCorePopulatedTable;
Hide code cell source
import sqlite3
import pandas as pd

conn = sqlite3.connect('100_patients.db')

sql_statement = """
SELECT DISTINCT
    PatientMaritalStatus
FROM
    PatientCorePopulatedTable;
"""
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[7], line 2
      1 import sqlite3
----> 2 import pandas as pd
      4 conn = sqlite3.connect('100_patients.db')
      6 sql_statement = """
      7 SELECT DISTINCT
      8     PatientMaritalStatus
      9 FROM
     10     PatientCorePopulatedTable;
     11 """

ModuleNotFoundError: No module named 'pandas'

13.2.1.7. Removing duplicates with multiple columns#

The DISTINCT keyword can also be used to find distinct combination of columns. It is also used sometimes with joins to remove duplicate rows.

Query: Select PatientRace and PatientMaritalStatus columns from the PatientCorePopulatedTable table but only select distinct values. This query finds the distinct combinations of race and martial status.

SELECT DISTINCT
    PatientRace,
    PatientMaritalStatus
FROM
    PatientCorePopulatedTable
ORDER BY PatientRace, PatientMaritalStatus
Hide code cell source
import sqlite3
import pandas as pd

conn = sqlite3.connect('100_patients.db')

sql_statement = """
SELECT DISTINCT
    PatientRace,
    PatientMaritalStatus
FROM
    PatientCorePopulatedTable
ORDER BY PatientRace, PatientMaritalStatus
"""
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[8], line 2
      1 import sqlite3
----> 2 import pandas as pd
      4 conn = sqlite3.connect('100_patients.db')
      6 sql_statement = """
      7 SELECT DISTINCT
      8     PatientRace,
   (...)     12 ORDER BY PatientRace, PatientMaritalStatus
     13 """

ModuleNotFoundError: No module named 'pandas'

13.2.1.8. FROM multiple tables#

The power of SQL lies in the fact that you can combine tables together based on some shared column between tables. The FROM clause allows you to select from multiple tables.

You should now that there are four types of tables in SQL:

  1. Permanent tables (i.e., created using the CREATE TABLE statement)

  2. Derived tables (i.e., rows returned by a subquery and held in memory)

  3. Temporary tables (i.e., volatile data held in memory)

  4. Virtual tables (i.e, created using the CREATE VIEW statement)

We have been using permanent tables so far. Temporary and virtual tables will be covered later. The following is an example a derived table.

13.2.1.8.1. Derived table#

A derived query is a query held in memory. You surround it a pair of parenthesis and give it a name.

Query: Create a subquery called dx_codes which selects the PrimaryDiagnosisCode and PrimaryDiagnosisDescription columns from the AdmissionsDiagnosesCorePopulatedTable where the AdmissionID is equal to 1. Then use this derived query in another query that concatenates the diagnosis code and diagnosis description, e.g., “(M01.X) Direct infection of joint in infectious and parasitic diseases classified elsewhere”. Fields and string literals can be concatenated in SQLite using ||.

SELECT '(' || dx_codes.code || ') ' || dx_codes.description AS CodeWDescription
FROM
   (
     SELECT
       PrimaryDiagnosisCode code,
       PrimaryDiagnosisDescription description 
     FROM
       AdmissionsDiagnosesCorePopulatedTable
      WHERE AdmissionID = 1
   ) dx_codes
Hide code cell source
import sqlite3
import pandas as pd

conn = sqlite3.connect('100_patients.db')

sql_statement = """
SELECT '(' || dx_codes.code || ') ' || dx_codes.description AS CodeWDescription
FROM
   (
     SELECT
       PrimaryDiagnosisCode code,
       PrimaryDiagnosisDescription description 
     FROM
       AdmissionsDiagnosesCorePopulatedTable
      WHERE AdmissionID = 1
   ) dx_codes
"""
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[9], line 2
      1 import sqlite3
----> 2 import pandas as pd
      4 conn = sqlite3.connect('100_patients.db')
      6 sql_statement = """
      7 SELECT '(' || dx_codes.code || ') ' || dx_codes.description AS CodeWDescription
      8 FROM
   (...)     16    ) dx_codes
     17 """

ModuleNotFoundError: No module named 'pandas'

13.2.1.9. The WHERE Clause#

The WHERE clause allows you to filter out unwanted rows. For string fields, you can use the equality operator (=) or the LIKE operator. For numerical and date fields, you can use all the usually operators such as greater than, less than, etc. WHERE clauses can be combined using AND and OR. Parenthesis can be used to clarify grouping of the clauses. The WHERE clauses are put after the FROM clause.

Query: Select all patients from PatientCorePopulatedTable table that are either married and african american or married and white and the PatientPopulationPercentageBelowPoverty is above 15. Select the following columns: PatientID, PatientRace, PatientMaritalStatus and PatientPopulationPercentageBelowPoverty.

SELECT 
  PatientID, 
  PatientRace, 
  PatientMaritalStatus,
  PatientPopulationPercentageBelowPoverty
FROM
  PatientCorePopulatedTable
WHERE ((PatientRace = 'White' AND PatientMaritalStatus = 'Married') OR (PatientRace = 'African American' AND PatientMaritalStatus = 'Married'))
  AND PatientPopulationPercentageBelowPoverty > 15
Hide code cell source
import sqlite3
import pandas as pd

conn = sqlite3.connect('100_patients.db')

sql_statement = """
SELECT 
  PatientID, 
  PatientRace, 
  PatientMaritalStatus,
  PatientPopulationPercentageBelowPoverty
FROM
  PatientCorePopulatedTable
WHERE (
 (
  PatientRace = 'White' AND PatientMaritalStatus = 'Married') 
  OR (PatientRace = 'African American' AND PatientMaritalStatus = 'Married')
 )
  AND PatientPopulationPercentageBelowPoverty > 15
"""
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[10], line 2
      1 import sqlite3
----> 2 import pandas as pd
      4 conn = sqlite3.connect('100_patients.db')
      6 sql_statement = """
      7 SELECT 
      8   PatientID, 
   (...)     19   AND PatientPopulationPercentageBelowPoverty > 15
     20 """

ModuleNotFoundError: No module named 'pandas'

13.2.1.10. The ORDER BY Clauses#

You can order the the rows by column(s) using the ORDER BY clause. This clause is put after the WHERE clause. You can specify multiple columns separated by comma. You can also specify ascending order using the ASC keyword after the column name and descending order by using the DESC keyword. The default sorting order is ascending. A shortcut for descending is putting - before the column name. Finally you can sort the columns by its numerical position.

13.2.1.10.1. Sort by columns#

Query: Select the PatientMaritalStatus and PatientPopulationPercentageBelowPoverty columns from the PatientCorePopulatedTable table and sort by PatientPopulationPercentageBelowPoverty.

SELECT 
  PatientMaritalStatus, 
  PatientPopulationPercentageBelowPoverty
FROM
  PatientCorePopulatedTable
ORDER BY PatientPopulationPercentageBelowPoverty;
Hide code cell source
import sqlite3
import pandas as pd

conn = sqlite3.connect('100_patients.db')

sql_statement = """
SELECT 
  PatientMaritalStatus, 
  PatientPopulationPercentageBelowPoverty
FROM
  PatientCorePopulatedTable
ORDER BY PatientPopulationPercentageBelowPoverty;
"""
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[11], line 2
      1 import sqlite3
----> 2 import pandas as pd
      4 conn = sqlite3.connect('100_patients.db')
      6 sql_statement = """
      7 SELECT 
      8   PatientMaritalStatus, 
   (...)     12 ORDER BY PatientPopulationPercentageBelowPoverty;
     13 """

ModuleNotFoundError: No module named 'pandas'

Query: Select the PatientMaritalStatus and PatientPopulationPercentageBelowPoverty columns from the PatientCorePopulatedTable table and sort by PatientMaritalStatus and PatientPopulationPercentageBelowPoverty.

SELECT 
  PatientMaritalStatus, 
  PatientPopulationPercentageBelowPoverty
FROM
  PatientCorePopulatedTable
ORDER BY PatientMaritalStatus, PatientPopulationPercentageBelowPoverty;
Hide code cell source
import sqlite3
import pandas as pd

conn = sqlite3.connect('100_patients.db')

sql_statement = """
SELECT 
  PatientMaritalStatus, 
  PatientPopulationPercentageBelowPoverty
FROM
  PatientCorePopulatedTable
ORDER BY PatientMaritalStatus, PatientPopulationPercentageBelowPoverty;
"""
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[12], line 2
      1 import sqlite3
----> 2 import pandas as pd
      4 conn = sqlite3.connect('100_patients.db')
      6 sql_statement = """
      7 SELECT 
      8   PatientMaritalStatus, 
   (...)     12 ORDER BY PatientMaritalStatus, PatientPopulationPercentageBelowPoverty;
     13 """

ModuleNotFoundError: No module named 'pandas'

13.2.1.10.2. Ascending versus Descending Sort Order#

Query: Select the PatientMaritalStatus and PatientPopulationPercentageBelowPoverty columns from the PatientCorePopulatedTable table and sort by PatientMaritalStatus and PatientPopulationPercentageBelowPoverty by descending order using the keyword DESC.

SELECT 
  PatientMaritalStatus, 
  PatientPopulationPercentageBelowPoverty
FROM
  PatientCorePopulatedTable
ORDER BY PatientMaritalStatus, PatientPopulationPercentageBelowPoverty DESC;
Hide code cell source
import sqlite3
import pandas as pd

conn = sqlite3.connect('100_patients.db')

sql_statement = """
SELECT 
  PatientMaritalStatus, 
  PatientPopulationPercentageBelowPoverty
FROM
  PatientCorePopulatedTable
ORDER BY PatientMaritalStatus, PatientPopulationPercentageBelowPoverty DESC;
"""
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[13], line 2
      1 import sqlite3
----> 2 import pandas as pd
      4 conn = sqlite3.connect('100_patients.db')
      6 sql_statement = """
      7 SELECT 
      8   PatientMaritalStatus, 
   (...)     12 ORDER BY PatientMaritalStatus, PatientPopulationPercentageBelowPoverty DESC;
     13 """

ModuleNotFoundError: No module named 'pandas'

Query: Select the PatientMaritalStatus and PatientPopulationPercentageBelowPoverty columns from the PatientCorePopulatedTable table and sort by PatientMaritalStatus and PatientPopulationPercentageBelowPoverty by descending order using the -.

SELECT 
  PatientMaritalStatus, 
  PatientPopulationPercentageBelowPoverty
FROM
  PatientCorePopulatedTable
ORDER BY PatientMaritalStatus, -PatientPopulationPercentageBelowPoverty;
Hide code cell source
import sqlite3
import pandas as pd

conn = sqlite3.connect('100_patients.db')

sql_statement = """
SELECT 
  PatientMaritalStatus, 
  PatientPopulationPercentageBelowPoverty
FROM
  PatientCorePopulatedTable
ORDER BY PatientMaritalStatus, -PatientPopulationPercentageBelowPoverty;
"""
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[14], line 2
      1 import sqlite3
----> 2 import pandas as pd
      4 conn = sqlite3.connect('100_patients.db')
      6 sql_statement = """
      7 SELECT 
      8   PatientMaritalStatus, 
   (...)     12 ORDER BY PatientMaritalStatus, -PatientPopulationPercentageBelowPoverty;
     13 """

ModuleNotFoundError: No module named 'pandas'

13.2.1.10.3. Sorting using numerical position#

When sorting by numerical position, you cannot use -. You must use the keyword DESC. Column names make your code more explicit, which you should prefer.

Query: Select the PatientMaritalStatus and PatientPopulationPercentageBelowPoverty columns from the PatientCorePopulatedTable table and sort by PatientMaritalStatus and PatientPopulationPercentageBelowPoverty by descending order using the keyword DESC, but using the numerical position of PatientPopulationPercentageBelowPoverty.

SELECT 
  PatientMaritalStatus, 
  PatientPopulationPercentageBelowPoverty
FROM
  PatientCorePopulatedTable
ORDER BY PatientMaritalStatus, 2 DESC;
Hide code cell source
import sqlite3
import pandas as pd

conn = sqlite3.connect('100_patients.db')

sql_statement = """
SELECT 
  PatientMaritalStatus, 
  PatientPopulationPercentageBelowPoverty
FROM
  PatientCorePopulatedTable
ORDER BY PatientMaritalStatus, 2 DESC;
"""
df = pd.read_sql_query(sql_statement, conn)
df.style.set_table_attributes('style="font-size: 12px"')
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[15], line 2
      1 import sqlite3
----> 2 import pandas as pd
      4 conn = sqlite3.connect('100_patients.db')
      6 sql_statement = """
      7 SELECT 
      8   PatientMaritalStatus, 
   (...)     12 ORDER BY PatientMaritalStatus, 2 DESC;
     13 """

ModuleNotFoundError: No module named 'pandas'