15.1. Numpy and Pandas#
15.1.1. NumPy#
15.1.1.1. Description#
Ref: https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Numpy_Python_Cheat_Sheet.pdf
NumPy provides a multidimensional array object.
Each object comes with an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation and much more.
Basically, you can use NumPy to create ndarray (N-dimensional array) and easily manipulate the array. It is super fast because it is written in C.
15.1.1.2. How is it different from Python arrays?#
Python lists can be modified – you can add and remove elements. NumPy arrays have a fixed size at creation.
Python lists can contain different data types. NumPy arrays can only have one data type. If you put in mixed types, they become a string.
NumPy arrays come prepackaged with advanced mathematical operations. The operations are super fast even on large numbers of data and they use less memory.
15.1.1.3. Why use NumPy#
Most data analysis programs use NumPy to manipulate data. They might take in data as standard Python lists, but they convert it to a NumPy array and manipulate the data using NumPy routines and output the transformed data as a NumPy array.
NumPy data array is the main data type used in most scientific and mathematical Python-based packages.
15.1.1.4. Simple example#
# Square a list using Python
squared_values = []
for number in range(10):
squared_values.append(number*number)
print(squared_values)
# Square a list using NumPy
import numpy as np
vector = np.array(range(10))
scalar = 5
print(vector * scalar)
print(vector * vector)
15.1.1.5. Numpy Basics#
NumPy arrays can be a 1-D array, called a vector, or a 2-D array, called a matrix
15.1.1.5.1. NumPy casting – covert Python list to a NumPy array#
my_list = [1, 2, 3]
print(my_list)
import numpy as np
my_vector = np.array(my_list)
print(my_vector)
my_matrix = np.array([my_list, my_list])
np.hstack([my_list, my_list])
my_nested_list = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
my_matrix = np.array(my_nested_list)
print(my_matrix)
my_list1 = [[1,2], [3,4]]
my_list2 = [[5,6], [7,8]]
np.hstack([my_list1, my_list2])
np.vstack([my_list1, my_list2])
15.1.1.5.2. NumPy creating arrays#
my_list = range(10)
## Create array using arange
np.arange(10)
np.arange(0, 10)
np.arange(0, 10, 2)
## Create array of zeros
np.zeros(3)
np.zeros((3,3))
## Create array of ones
np.ones(3)
np.ones((3,3))
np.ones(3)*4
## Create evenly spaced vector
### Example use case: when you have Y values for a plot but need to generate X values
### *** Includes both start an end
# np.arange(start, end(not included), step size)
# np.linspace(start, end(included), number_of_points)
np.linspace(0, 10, 5)
np.linspace(1900, 2000, 11)
15.1.2. Create an identify matrix#
np.eye(3)
15.1.3. Creating an empty array#
np.empty((2,3))
15.1.4. Creating Random Numbers#
# Uniform distribution
np.random.rand(3)
np.random.rand(3,3)
# Normal distribution
np.random.randn(3)
np.random.randn(3,3)
# Random integers
# np.random.randint(start, end(not_included), size)
np.random.randint(1,101)
np.random.randint(1,101,5)
15.1.5. Reshaping arrays#
vector = np.arange(1,10)
print(vector.reshape(3,3))
vector = np.arange(1,13)
print(vector.reshape(3,4))
15.1.6. Basic array operations#
vector = np.random.randint(1,50,25)
# Min
vector.min()
# Max
vector.max()
# get location of min value
index = vector.argmin()
# get location of max value
index = vector.argmax()
# get shape
vector.shape
my_matrix = vector.reshape(5, 5)
my_matrix.shape
15.1.7. Indexing a 1-D array – vector#
vector = np.array(range(10))
# vector[index]
# vector [start:end]
# vector [:end]
# vector [start:]
# vector [start, end, step]
vector[3]
vector[3:8]
vector[:5]
vector[5:]
vector[3:9:2]
vector[-1]
15.1.8. Setting multiple values at once – Broadcasting#
There are two main features of NumPy arrays
Broadcasting – set multiple values at once
Vectorization – no need for explicit looping – example, vector multiplication or squaring
vector[3:6] = 12
15.1.9. BE CAREFUL#
If you store a slice of an array in a new variable, changes in the new variable will be reflected in the original array.
vector = np.array(range(10))
my_slice = vector[3:7]
my_slice[:] = 20
print(vector)
Copy the array if you need a copy
vector = np.array(range(10))
my_slice_copy = vector[3:7].copy()
print(vector)
15.1.10. Indexing a 2-D array – Matrix#
Remember – Python is zero-indexed
matrix = np.array(range(1,10)).reshape((3,3))
matrix[0,0]
matrix[0][0]
matrix[2,2]
matrix[2][2]
matrix[:,2] # Grab the third column
matrix[1,:] # Grab the second row
matrix[:2] # grab the first two rows, all columns
matrix[:2,:] # grab the first two rows, all columns
matrix[:,1:] # grab all the rows, but columns starting from 1
15.1.11. Conditional selection#
vector = np.arange(10)
gt2 = vector > 2 # create condition
lt8 = vector < 8 # create condition
selected_gt2 = vector[gt2] # apply condition to select
selected_lt8 = vector[lt8] # apply condition to select
vector[vector>2]
vector[vector<8]
cond = (vector>2) & (vector<7)
vector[cond]
cond = (vector>=2) & (vector<=7)
vector[cond]
15.1.12. Array operations – Basic#
vector = np.arange(10)
vector + vector
vector - vector
vector * vector
vector / vector # problem!!! return `nan`
vector + 10
vector - 10
vector * 10
vector / 10
15.1.13. Array operations – Advanced#
vector = np.arange(10)
np.max(vector)
np.min(vector)
np.sqrt(vector)
np.log(vector)
sum(vector<5)
import math
vector = np.arange(1,11) * math.pi
np.sin(vector)
vector = np.arange(0,math.pi+math.pi/4,math.pi/4)
np.sin(vector)
matrix = np.random.rand(5,5)
np.floor(matrix*1000)/1000
np.round(matrix*1000)/1000
np.ceil(matrix*1000)/1000
matrix = np.arange(1,10).reshape(3,3)
matrix.sum(axis=1)
matrix.sum(axis=0)
matrix.cumsum()
matrix.cumprod()
matrix.min(axis=1)
matrix.min(axis=0)
matrix.max(axis=1)
matrix.max(axis=0)
matrix = np.array([1,2,3]*3).reshape(3,3)
np.unique(matrix.reshape(3,3))
15.1.14. Pandas#
Built-on top of NumPy – meaning the underlying data structure used is ndarray
Pandas provides series which are like NumPy arrays but with associated index labels – meaning are like column labels or row labels. Element data type can be different
Pandas also provides dataframes which are like Excel sheets or database tables
15.1.14.1. Basic examples#
import numpy as np
import pandas as pd
header = ['chrom', 'pos', 'filter']
data = [4, 12345, 38.4]
vector = np.array(data)
data_dict = {'chrom': 4, 'pos': 12345, 'filter': 38.4}
s1 = pd.Series(data=data) # Notice the data type is float
s2 = pd.Series(data=data, index=header)
# can also do
s1 = pd.Series(data)
s2 = pd.Series(data, header)
# can hold different data types
data = [1, '2s', 34]
pd.Series(data)
# can use a dictionary to initialize a panda series
pd.Series(data_dict)
## Using index labels to fetch element
header = ['chrom', 'pos', 'filter']
data = [4, 12345, 38.4]
series = pd.Series(data=data, index=header)
series['chrom']
series['filter']
series = pd.Series(data)
series[0]
## Basic operations
header1 = ['chrom', 'pos', 'filter']
data1 = [4, 12345, 38.4]
header2 = ['chrom', 'pos', 'filter', 'qual']
data2 = [3, 4899, 234, 89.9]
s1 = pd.Series(data1, header1)
s2 = pd.Series(data2, header2)
s1+s2
header1 = ['chrom', 'pos', 'filter']
data1 = [4, 12345, 38.4]
header2 = ['chrom', 'pos', 'filter', 'qual']
data2 = ['3', 4899, 234, 89.9]
s1 = pd.Series(data1, header1)
s2 = pd.Series(data2, header2)
s1+s2
data1 = [4, 12345, 38.4]
data2 = [3, 4899, 234, 89.9]
s1 = pd.Series(data1)
s2 = pd.Series(data2)
## IMPORTANT - with index labels -- operations are based on label
header1 = ['pos', 'filter', 'chrom']
data1 = [12345, 38.4, 4]
header2 = ['chrom', 'pos', 'filter', 'qual']
data2 = [3, 4899, 234, 89.9]
s1 = pd.Series(data1, header1)
s2 = pd.Series(data2, header2)
s1+s2
data1 = [12345, 38.4, 4]
data2 = [3, 4899, 234, 89.9]
s1 = pd.Series(data1)
s2 = pd.Series(data2)
15.1.15. Dataframes –#
Dataframe is composed of series
import numpy as np
import pandas as pd
header = ['exam1', 'exam2', 'exam3']
data = np.random.randint(65, 101, 12).reshape(4,3)
students = ['student1', 'student2', 'student3', 'student4']
df = pd.DataFrame(data=data, columns=header)
df = pd.DataFrame(data=data, index=students, columns=header)
df['exam1']
df.exam1 # not a good way to do this
df['average'] = (df['exam1'] + df['exam2'] + df['exam3'])/3
df.drop('average') # does not work because default for drop is to work on row labels
df.drop('average', axis=1) # works on column labels
## STILL NOT DROPPED from df
df
df.drop('average', axis=1, inplace=True)
## drop a student
df.drop('student3')
header = ['exam1', 'exam2', 'exam3']
data = np.random.randint(65, 101, 12).reshape(4,3)
students = ['student1', 'student2', 'student3', 'student3']
df = pd.DataFrame(data=data, index=students, columns=header)
df.drop('student3')
df.drop('student3', inplace=True)
## Row is referred to as axis=0
## Column is referred to as axis=1
## (R,C) == (axis=0, axis=1) df.shape
15.1.16. Select Dataframe rows#
header = ['exam1', 'exam2', 'exam3']
data = np.random.randint(65, 101, 12).reshape(4,3)
students = ['student1', 'student2', 'student3', 'student4']
df = pd.DataFrame(data=data, index=students, columns=header)
df.loc['student1']
df.iloc[0] ## remember that column names do not count as rows
15.2. Three Methods for Selecting Data#
15.2.1. Method 1 – Square brackets#
15.2.2. Method 1.1 – index/text – #column wise#
import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
display(df['StudentName'])
0 student1
1 student2
2 student3
3 student4
4 student5
5 student6
6 student7
7 student8
8 student9
9 student10
Name: StudentName, dtype: object
15.2.3. Method 1.2 – list, NOT TUPLE # column wise#
import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
display(df[['StudentName', 'E1']])
| StudentName | E1 | |
|---|---|---|
| 0 | student1 | 92 |
| 1 | student2 | 74 |
| 2 | student3 | 83 |
| 3 | student4 | 100 |
| 4 | student5 | 77 |
| 5 | student6 | 100 |
| 6 | student7 | 66 |
| 7 | student8 | 97 |
| 8 | student9 | 95 |
| 9 | student10 | 78 |
15.2.4. Method 1.3 – a slice # row wise#
import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
display(df[1:5])
| StudentName | E1 | E2 | E3 | E4 | E5 | |
|---|---|---|---|---|---|---|
| 1 | student2 | 74 | 93 | 88 | 67 | 85 |
| 2 | student3 | 83 | 96 | 74 | 79 | 92 |
| 3 | student4 | 100 | 72 | 83 | 85 | 66 |
| 4 | student5 | 77 | 96 | 66 | 79 | 92 |
15.2.5. Method 1.4?? – slice and list?#
import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
display(df[1:5, ['E1', 'E2']])
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
File ~/anaconda3/envs/eas503book/lib/python3.12/site-packages/pandas/core/indexes/base.py:3805, in Index.get_loc(self, key)
3804 try:
-> 3805 return self._engine.get_loc(casted_key)
3806 except KeyError as err:
File index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()
File index.pyx:173, in pandas._libs.index.IndexEngine.get_loc()
TypeError: '(slice(1, 5, None), ['E1', 'E2'])' is an invalid key
During handling of the above exception, another exception occurred:
InvalidIndexError Traceback (most recent call last)
Cell In[4], line 4
2 columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
3 df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
----> 4 display(df[1:5, ['E1', 'E2']])
File ~/anaconda3/envs/eas503book/lib/python3.12/site-packages/pandas/core/frame.py:4102, in DataFrame.__getitem__(self, key)
4100 if self.columns.nlevels > 1:
4101 return self._getitem_multilevel(key)
-> 4102 indexer = self.columns.get_loc(key)
4103 if is_integer(indexer):
4104 indexer = [indexer]
File ~/anaconda3/envs/eas503book/lib/python3.12/site-packages/pandas/core/indexes/base.py:3817, in Index.get_loc(self, key)
3812 raise KeyError(key) from err
3813 except TypeError:
3814 # If we have a listlike key, _check_indexing_error will raise
3815 # InvalidIndexError. Otherwise we fall through and re-raise
3816 # the TypeError.
-> 3817 self._check_indexing_error(key)
3818 raise
File ~/anaconda3/envs/eas503book/lib/python3.12/site-packages/pandas/core/indexes/base.py:6059, in Index._check_indexing_error(self, key)
6055 def _check_indexing_error(self, key):
6056 if not is_scalar(key):
6057 # if key is not a scalar, directly raise an error (the code below
6058 # would convert to numpy arrays and raise later any way) - GH29926
-> 6059 raise InvalidIndexError(key)
InvalidIndexError: (slice(1, 5, None), ['E1', 'E2'])
15.2.6. Method 1.5?? – slice and slice?#
import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
display(df[1:5, 1:5])
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
File ~/anaconda3/envs/eas503book/lib/python3.12/site-packages/pandas/core/indexes/base.py:3805, in Index.get_loc(self, key)
3804 try:
-> 3805 return self._engine.get_loc(casted_key)
3806 except KeyError as err:
File index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()
File index.pyx:196, in pandas._libs.index.IndexEngine.get_loc()
File pandas/_libs/hashtable_class_helper.pxi:7081, in pandas._libs.hashtable.PyObjectHashTable.get_item()
File pandas/_libs/hashtable_class_helper.pxi:7089, in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: (slice(1, 5, None), slice(1, 5, None))
During handling of the above exception, another exception occurred:
InvalidIndexError Traceback (most recent call last)
Cell In[5], line 4
2 columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
3 df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
----> 4 display(df[1:5, 1:5])
File ~/anaconda3/envs/eas503book/lib/python3.12/site-packages/pandas/core/frame.py:4102, in DataFrame.__getitem__(self, key)
4100 if self.columns.nlevels > 1:
4101 return self._getitem_multilevel(key)
-> 4102 indexer = self.columns.get_loc(key)
4103 if is_integer(indexer):
4104 indexer = [indexer]
File ~/anaconda3/envs/eas503book/lib/python3.12/site-packages/pandas/core/indexes/base.py:3811, in Index.get_loc(self, key)
3806 except KeyError as err:
3807 if isinstance(casted_key, slice) or (
3808 isinstance(casted_key, abc.Iterable)
3809 and any(isinstance(x, slice) for x in casted_key)
3810 ):
-> 3811 raise InvalidIndexError(key)
3812 raise KeyError(key) from err
3813 except TypeError:
3814 # If we have a listlike key, _check_indexing_error will raise
3815 # InvalidIndexError. Otherwise we fall through and re-raise
3816 # the TypeError.
InvalidIndexError: (slice(1, 5, None), slice(1, 5, None))
15.2.7. Method 2 – .iloc#
It uses square braces [] and not ()
It uses numerical index
15.2.7.1. Method 2.1 – index only!!! Cannot index by location index with a non-integer key#
import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
display(df.iloc[0])
StudentName student1
E1 92
E2 77
E3 87
E4 77
E5 94
Name: 0, dtype: object
15.2.7.2. Method 2.2 – list#
import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
display(df.iloc[[0, 1, 2]])
| StudentName | E1 | E2 | E3 | E4 | E5 | |
|---|---|---|---|---|---|---|
| 0 | student1 | 92 | 77 | 87 | 77 | 94 |
| 1 | student2 | 74 | 93 | 88 | 67 | 85 |
| 2 | student3 | 83 | 96 | 74 | 79 | 92 |
15.2.7.3. Method 2.3 – list and slice#
import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
display(df.iloc[[0, 1, 2], :])
| StudentName | E1 | E2 | E3 | E4 | E5 | |
|---|---|---|---|---|---|---|
| 0 | student1 | 92 | 77 | 87 | 77 | 94 |
| 1 | student2 | 74 | 93 | 88 | 67 | 85 |
| 2 | student3 | 83 | 96 | 74 | 79 | 92 |
15.2.7.4. Method 2.4 – list and list#
import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
display(df.iloc[[0, 1, 2], [0, 1]])
| StudentName | E1 | |
|---|---|---|
| 0 | student1 | 92 |
| 1 | student2 | 74 |
| 2 | student3 | 83 |
15.2.7.5. Method 2.5 – slice#
import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
display(df.iloc[0:3])
| StudentName | E1 | E2 | E3 | E4 | E5 | |
|---|---|---|---|---|---|---|
| 0 | student1 | 92 | 77 | 87 | 77 | 94 |
| 1 | student2 | 74 | 93 | 88 | 67 | 85 |
| 2 | student3 | 83 | 96 | 74 | 79 | 92 |
15.2.7.6. Method 2.6 – slice and list#
import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
display(df.iloc[:, [1, 2]])
| E1 | E2 | |
|---|---|---|
| 0 | 92 | 77 |
| 1 | 74 | 93 |
| 2 | 83 | 96 |
| 3 | 100 | 72 |
| 4 | 77 | 96 |
| 5 | 100 | 86 |
| 6 | 66 | 91 |
| 7 | 97 | 86 |
| 8 | 95 | 98 |
| 9 | 78 | 76 |
15.2.8. Method 2.7 – sliceS#
import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
display(df.iloc[0:3, 1:3])
| E1 | E2 | |
|---|---|---|
| 0 | 92 | 77 |
| 1 | 74 | 93 |
| 2 | 83 | 96 |
15.2.9. Method 3 – .loc#
If you did not specify the index, then numbers are used!
Warning
When you use .loc for indexing, it includes the end point. This is unlike .iloc, which does not include the end point.
15.2.10. Method 3.1 – index # row-wise#
import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
display(df.loc[3])
StudentName student4
E1 100
E2 72
E3 83
E4 85
E5 66
Name: 3, dtype: object
15.2.11. Method 3.2 – txt # row-wise#
import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
df.set_index(df['StudentName'], inplace=True)
display(df.loc['student4'])
StudentName student4
E1 100
E2 72
E3 83
E4 85
E5 66
Name: student4, dtype: object
15.2.12. Method 3.3 – list#
import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
df.set_index(df['StudentName'], inplace=True)
display(df.loc[['student4', 'student1']])
| StudentName | E1 | E2 | E3 | E4 | E5 | |
|---|---|---|---|---|---|---|
| StudentName | ||||||
| student4 | student4 | 100 | 72 | 83 | 85 | 66 |
| student1 | student1 | 92 | 77 | 87 | 77 | 94 |
15.2.13. Method 3.4 – listS#
import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
df.set_index(df['StudentName'], inplace=True)
display(df.loc[['student4', 'student1'], ['E1', 'E2']])
| E1 | E2 | |
|---|---|---|
| StudentName | ||
| student4 | 100 | 72 |
| student1 | 92 | 77 |
15.2.14. Method 3.5 – slice and list#
import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
df.set_index(df['StudentName'], inplace=True)
display(df.loc[:, ['E1', 'E2']])
| E1 | E2 | |
|---|---|---|
| StudentName | ||
| student1 | 92 | 77 |
| student2 | 74 | 93 |
| student3 | 83 | 96 |
| student4 | 100 | 72 |
| student5 | 77 | 96 |
| student6 | 100 | 86 |
| student7 | 66 | 91 |
| student8 | 97 | 86 |
| student9 | 95 | 98 |
| student10 | 78 | 76 |
15.2.15. Method 3.6 – list and Slice#
import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
df.set_index(df['StudentName'], inplace=True)
display(df.loc[['student1', 'student2'], :])
| StudentName | E1 | E2 | E3 | E4 | E5 | |
|---|---|---|---|---|---|---|
| StudentName | ||||||
| student1 | student1 | 92 | 77 | 87 | 77 | 94 |
| student2 | student2 | 74 | 93 | 88 | 67 | 85 |
15.2.16. Method 3.7 – slice and Slice#
import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
df.set_index(df['StudentName'], inplace=True)
display(df.loc['student1':'student5', :])
| StudentName | E1 | E2 | E3 | E4 | E5 | |
|---|---|---|---|---|---|---|
| StudentName | ||||||
| student1 | student1 | 92 | 77 | 87 | 77 | 94 |
| student2 | student2 | 74 | 93 | 88 | 67 | 85 |
| student3 | student3 | 83 | 96 | 74 | 79 | 92 |
| student4 | student4 | 100 | 72 | 83 | 85 | 66 |
| student5 | student5 | 77 | 96 | 66 | 79 | 92 |
15.2.17. Method 3.8 – slice#
import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
df.set_index(df['StudentName'], inplace=True)
display(df.loc['student1':'student5'])
| StudentName | E1 | E2 | E3 | E4 | E5 | |
|---|---|---|---|---|---|---|
| StudentName | ||||||
| student1 | student1 | 92 | 77 | 87 | 77 | 94 |
| student2 | student2 | 74 | 93 | 88 | 67 | 85 |
| student3 | student3 | 83 | 96 | 74 | 79 | 92 |
| student4 | student4 | 100 | 72 | 83 | 85 | 66 |
| student5 | student5 | 77 | 96 | 66 | 79 | 92 |
15.2.18. Method 3.9 – slice and slice#
import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
df.set_index(df['StudentName'], inplace=True)
display(df.loc[:, 'E1':'E4'])
| E1 | E2 | E3 | E4 | |
|---|---|---|---|---|
| StudentName | ||||
| student1 | 92 | 77 | 87 | 77 |
| student2 | 74 | 93 | 88 | 67 |
| student3 | 83 | 96 | 74 | 79 |
| student4 | 100 | 72 | 83 | 85 |
| student5 | 77 | 96 | 66 | 79 |
| student6 | 100 | 86 | 84 | 70 |
| student7 | 66 | 91 | 94 | 97 |
| student8 | 97 | 86 | 75 | 69 |
| student9 | 95 | 98 | 99 | 85 |
| student10 | 78 | 76 | 73 | 88 |
15.2.19. iterrows – iterate over rows#
iterate over the data row-wise. The first element is the index and second element is the row values
import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
for idx, row in df.iterrows():
print(idx, row)
0 StudentName student1
E1 92
E2 77
E3 87
E4 77
E5 94
Name: 0, dtype: object
1 StudentName student2
E1 74
E2 93
E3 88
E4 67
E5 85
Name: 1, dtype: object
2 StudentName student3
E1 83
E2 96
E3 74
E4 79
E5 92
Name: 2, dtype: object
3 StudentName student4
E1 100
E2 72
E3 83
E4 85
E5 66
Name: 3, dtype: object
4 StudentName student5
E1 77
E2 96
E3 66
E4 79
E5 92
Name: 4, dtype: object
5 StudentName student6
E1 100
E2 86
E3 84
E4 70
E5 71
Name: 5, dtype: object
6 StudentName student7
E1 66
E2 91
E3 94
E4 97
E5 80
Name: 6, dtype: object
7 StudentName student8
E1 97
E2 86
E3 75
E4 69
E5 88
Name: 7, dtype: object
8 StudentName student9
E1 95
E2 98
E3 99
E4 85
E5 86
Name: 8, dtype: object
9 StudentName student10
E1 78
E2 76
E3 73
E4 88
E5 86
Name: 9, dtype: object
15.2.20. items – iterate over columns#
import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
for column_name, column in df.items():
print(column_name, column)
StudentName 0 student1
1 student2
2 student3
3 student4
4 student5
5 student6
6 student7
7 student8
8 student9
9 student10
Name: StudentName, dtype: object
E1 0 92
1 74
2 83
3 100
4 77
5 100
6 66
7 97
8 95
9 78
Name: E1, dtype: object
E2 0 77
1 93
2 96
3 72
4 96
5 86
6 91
7 86
8 98
9 76
Name: E2, dtype: object
E3 0 87
1 88
2 74
3 83
4 66
5 84
6 94
7 75
8 99
9 73
Name: E3, dtype: object
E4 0 77
1 67
2 79
3 85
4 79
5 70
6 97
7 69
8 85
9 88
Name: E4, dtype: object
E5 0 94
1 85
2 92
3 66
4 92
5 71
6 80
7 88
8 86
9 86
Name: E5, dtype: object
15.2.21. Select subset of data#
header = ['exam1', 'exam2', 'exam3']
data = np.random.randint(65, 101, 12).reshape(4,3)
students = ['student1', 'student2', 'student3', 'student4']
df = pd.DataFrame(data=data, index=students, columns=header)
df.loc['student1', 'exam1']
df.loc[['student1', 'student3'], ['exam1', 'exam3']]
15.2.22. Use conditions to select#
header = ['exam1', 'exam2', 'exam3']
data = np.random.randint(65, 101, 12).reshape(4,3)
students = ['student1', 'student2', 'student3', 'student4']
df = pd.DataFrame(data=data, index=students, columns=header)
df>=90
df[df>=90]
df['exam1']>=85 #
df[df['exam1']>=85] # gives all columns where exam1 is greater than 85
df[df['exam1']>=85]['exam3']
df[df['exam1']>=85][['exam2', 'exam3']]
df[(df['exam1']>=85) & (df['exam2']>=85)]
df[(df['exam1']>=85) & (df['exam2']>=85)]['exam3']
df[(df['exam1']>=85) | (df['exam2']>=85)]
df[(df['exam1']>=85) | (df['exam2']>=85)]['exam3']
15.2.23. Adding student index#
header = ['exam1', 'exam2', 'exam3']
data = np.random.randint(65, 101, 12).reshape(4,3)
students = ['student1', 'student2', 'student3', 'student4']
df = pd.DataFrame(data=data, columns=header)
df['name'] = students
df.set_index('name', inplace=True)
df.loc['student1']
df.reset_index(inplace=True)
15.2.24. Multi-index data#
students = 'student1 student1 student1 student2 student2 student2 student3 student3 student3'
exams = 'exam1 exam2 exam3'.split()*3
classes = 'class1 class2'
index = list(zip(students.split(), exams))
index = pd.MultiIndex.from_tuples(index)
df = pd.DataFrame(np.random.randint(65, 101, 3*3*2).reshape(9,2) , index, classes.split())
df.loc['student1'].loc['exam1']['class1']
df.index.names
df.index.names = ['Students', 'Exams']
## cross-section
df.xs('student1')
df.xs('exam1', level='Exams')
15.2.25. Dealing with missing data#
my_dict = {'student1': [90, 84, np.nan], 'student2': [77, np.nan, np.nan], 'student3': [88, 65, 93]}
df = pd.DataFrame(my_dict)
df.dropna()
df.dropna(axis=0)
df.dropna(axis=1)
df.dropna(thresh=2)
df.fillna(value=55)
df.drop(axis=0, labels=[1,2])
df.drop(axis=1, columns=['student1'])
15.2.26. Groupby#
Note
In previous versions of pandas, non-numeric columns were silently ignored. Now, however, we need to either explicitly choose numeric columns or ask mean to do it for us with numeric_only=True keyword argument.
import pandas as pd
import numpy as np
my_dict = {
'Exams': 'exam1 exam1 exam1'.split() + 'exam2 exam2 exam2'.split() + 'exam3 exam3 exam3'.split(),
'Students': 'student1 student2 student3'.split()*3,
'Scores': np.random.randint(65,101,9)
}
df = pd.DataFrame(my_dict)
df.groupby('Students').mean(numeric_only=True)
df.groupby('Students').mean(numeric_only=True).loc['student1']
df.groupby('Exams').max(numeric_only=True)['Scores']
df.groupby('Exams').describe()
df.groupby('Students').describe().transpose()
import pandas as pd
import numpy as np
my_dict = {
'Exams': 'exam1 exam1 exam1'.split() + 'exam2 exam2 exam2'.split() + 'exam3 exam3 exam3'.split(),
'Students': 'student1 student2 student3'.split() * 3,
'Scores': np.random.randint(65, 101, 9)
}
df = pd.DataFrame(my_dict)
# Select only numeric columns
new_df = df.set_index("Students")
new_df.select_dtypes(include=[np.number]).groupby("Students").mean()
import pandas as pd
import numpy as np
# Create a sample DataFrame with a multi-level index
data = {
'Scores': np.random.randint(65, 101, 9),
}
index = pd.MultiIndex.from_tuples(
[('student1', 'exam1'), ('student1', 'exam2'), ('student1', 'exam3'),
('student2', 'exam1'), ('student2', 'exam2'), ('student2', 'exam3'),
('student3', 'exam1'), ('student3', 'exam2'), ('student3', 'exam3')],
names=['Students', 'Exams']
)
df = pd.DataFrame(data, index=index)
# Group by the first level of the index ('Students')
grouped_df = df.groupby(level='Students').mean()
display(grouped_df)
15.2.27. Merging – SQL JOIN#
departments = {
'DepartmentId': [1, 2, 3, 4],
'DepartmentName': ['IT', 'Physics', 'Arts', 'Math']
}
df1 = pd.DataFrame(departments)
students = {
'StudentId': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'StudentName': ['Michael', 'John', 'Jack', 'Sara', 'Sally', 'Jena', 'Nancy', 'Adam', 'Stevens', 'George'],
'DepartmentId': [1, 1, 1, 2, 2, np.nan, 2, 3, 3, np.nan]
}
df2 = pd.DataFrame(students)
marks = {
'MarkId': [1, 2, 3, 4, 5, 6, 7, 8],
'StudentId': [1, 2, 3, 4, 5, 6, 7, 8],
'Mark': [18, 20, 16, 19, 14, 20, 20, 20]
}
df3 = pd.DataFrame(marks)
pd.merge(df2, df1, how='inner', on='DepartmentId')
pd.merge(df1, df2, how='inner', on='DepartmentId')
pd.merge(df1, df2, how='outer', on='DepartmentId')
pd.merge(df2, df1, how='right', on='DepartmentId')
pd.merge(df3, pd.merge(df2, df1, how='inner', on='DepartmentId'), how='inner', on='StudentId')
data = pd.merge(df3, pd.merge(df2, df1, how='inner', on='DepartmentId'), how='inner', on='StudentId')
data[['StudentName', 'Mark', 'DepartmentName']]
15.2.28. Concatenation#
d1 = {
'C0': ['C0R0', 'C0R1', 'C0R2'],
'C1': ['C1R0', 'C1R1', 'C2R2'],
'C2': ['C2R0', 'C2R1', 'C2R2'],
}
df1 = pd.DataFrame(d1)
d2 = {
'C0': ['C0R3', 'C0R4', 'C0R5'],
'C1': ['C1R3', 'C1R4', 'C1R5'],
'C2': ['C2R3', 'C2R4', 'C2R5'],
}
df2 = pd.DataFrame(d2)
d3 = {
'C0': ['C0R6', 'C0R7', 'C0R8'],
'C1': ['C1R6', 'C1R7', 'C1R8'],
'C2': ['C2R6', 'C2R7', 'C2R8'],
}
df3 = pd.DataFrame(d3)
pd.concat([df1, df2, df3])
## Concatenation -- Fix index
d1 = {
'C0': ['COR0', 'COR1', 'COR2'],
'C1': ['C1R0', 'C1R1', 'C2R2'],
'C2': ['C2R0', 'C2R1', 'C2R2'],
}
df1 = pd.DataFrame(d1, index=[1, 2, 3])
d2 = {
'C0': ['C0R3', 'C0R4', 'C0R5'],
'C1': ['C1R3', 'C1R4', 'C1R5'],
'C2': ['C2R3', 'C2R4', 'C2R5'],
}
df2 = pd.DataFrame(d2, index=[4, 5, 6])
d3 = {
'C0': ['C0R6', 'C0R7', 'C0R8'],
'C1': ['C1R6', 'C1R7', 'C1R8'],
'C2': ['C2R6', 'C2R7', 'C2R8'],
}
df3 = pd.DataFrame(d3, index=[7, 8, 9])
pd.concat([df1, df2, df3])
# Better way
d1 = {
'C0': ['C0R0', 'C0R1', 'C0R2'],
'C1': ['C1R0', 'C1R1', 'C2R2'],
'C2': ['C2R0', 'C2R1', 'C2R2'],
}
df1 = pd.DataFrame(d1)
d2 = {
'C0': ['C0R3', 'C0R4', 'C0R5'],
'C1': ['C1R3', 'C1R4', 'C1R5'],
'C2': ['C2R3', 'C2R4', 'C2R5'],
}
df2 = pd.DataFrame(d2)
d3 = {
'C0': ['C0R6', 'C0R7', 'C0R8'],
'C1': ['C1R6', 'C1R7', 'C1R8'],
'C2': ['C2R6', 'C2R7', 'C2R8'],
}
df3 = pd.DataFrame(d3)
concat = pd.concat([df1, df2, df3])
concat = concat.reset_index().drop("index", axis=1)
concat
15.2.29. More Pandas Operations#
data['DepartmentName'].unique()
data['DepartmentName'].nunique()
data['DepartmentName'].value_counts()
data[data['Mark']>17]
15.2.30. Lambda with Pandas#
Scale marks by 5
def times5(val):
return val * 5
data['Mark'].apply(times5)
data['Mark'].apply(lambda val: val*5)
Upper all department names
def upper(string):
return string.upper()
data['DepartmentName'].apply(upper)
data['DepartmentName'].apply(lambda string: string.upper())
mapping = {18: 'B', 14: 'C', 19: 'A-', 20: 'A+'}
df3['Mark'].map(mapping)
15.2.31. Dropping columns#
data.columns
data.drop(['StudentId', 'MarkId' , 'DepartmentId'], axis=1)
15.2.32. Sorting#
data.sort_values('Mark')
data.sort_values('Mark', ascending=False)
15.2.33. Importing CSV, TSV#
data = pd.read_csv('students.tsv', sep='\t', names=['lastname', 'firstname', 'username', 'exam1', 'exam2', 'exam3'])
data.sort_values('exam1', ascending=False)
data[['exam1', 'exam2', 'exam3']].mean()
data['average']= np.mean(data[['exam1', 'exam2', 'exam3']], axis=1)
data.sort_values('average', ascending=False)
data.to_csv('output.tsv', sep='\t', index=False, header=False)
15.2.34. Other methods#
data.head()
data.head(2)
data.tail
data.tail(3)
data.shape
data.iloc[3]
data.columns
data.dtypes
data.info()
data.value_counts()
data.describe()
15.2.35. More data manipulation#
data[data['exam1'].between(75, 85)]
data[data['exam1'].isin([75, 85, 95])]
data['exam1'].unique()
data['exam1'].nunique()
np.sort(data['exam1'].unique())
15.2.36. Misc#
Iterate over the data frame
Where: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.where.html
Plotting: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html
to_csv
to_sql: https://datacarpentry.org/python-ecology-lesson/09-working-with-sql/index.html
15.2.37. A Previous Example in Pandas#
import numpy as np
import pandas as pd
employees = {
'EmployeeID': ['EN1-26', 'EN1-33', 'EN1-35', 'EN1-36', 'EN1-38', 'EN1-40'],
'Last_Name': ["O'Brien", "Guya", "Baranco", "Roslyn", "Schaaf", "Wing"],
'First_Name': ["Sean", "Amy", "Steven", "Elizabeth", "Carol", "Alexandra" ]
}
df_employees = pd.DataFrame(employees)
projects = {
'ProjectNum': [
"30-452-T3",
"30-457-T3",
"30-482-TC",
"31-124-T3",
"31-238-TC",
"31-238-TC2",
"35-152-TC",
"36-272-TC"
],
'ProjectTitle': [
"STAR manual",
"ISO procedures",
"Web site",
"Employee handbook",
"STAR prototype",
"New catalog",
"STAR pricing",
"Order system"
]
}
df_projects = pd.DataFrame(projects)
print(df_projects)
employees_projects = {
'EmployeeID': [
"EN1-26",
"EN1-26",
"EN1-26",
"EN1-33",
"EN1-33",
"EN1-33",
"EN1-35",
"EN1-35",
"EN1-36",
"EN1-38",
"EN1-40",
"EN1-40",
],
'ProjectNum' : [
"30-452-T3",
"30-457-T3",
"31-124-T3",
"30-328-TC",
"30-452-T3",
"32-244-T3",
"30-452-T3",
"31-238-TC",
"35-152-TC",
"36-272-TC",
"31-238-TC2",
"31-241-TC",
]
}
df_employees_projects = pd.DataFrame(employees_projects)
data = pd.merge(pd.merge(df_employees_projects, df_employees, how='left', on='EmployeeID'), df_projects, how='left', on='ProjectNum')
print(data)
15.2.38. References#
Ref: https://www.kaggle.com/shovitbhari/pandas-75-exercises-with-solutions Ref: https://pythonspeed.com/memory/