15.1. Numpy and Pandas#
15.1.1. NumPy#
15.1.1.1. Description#
Ref: https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Numpy_Python_Cheat_Sheet.pdf
NumPy provides a multidimensional array object.
Each object comes with an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation and much more.
Basically, you can use NumPy to create ndarray (N-dimensional array) and easily manipulate the array. It is super fast because it is written in C.
15.1.1.2. How is it different from Python arrays?#
Python lists can be modified – you can add and remove elements. NumPy arrays have a fixed size at creation.
Python lists can contain different data types. NumPy arrays can only have one data type. If you put in mixed types, they become a string.
NumPy arrays come prepackaged with advanced mathematical operations. The operations are super fast even on large numbers of data and they use less memory.
15.1.1.3. Why use NumPy#
Most data analysis programs use NumPy to manipulate data. They might take in data as standard Python lists, but they convert it to a NumPy array and manipulate the data using NumPy routines and output the transformed data as a NumPy array.
NumPy data array is the main data type used in most scientific and mathematical Python-based packages.
15.1.1.4. Simple example#
# Square a list using Python
squared_values = []
for number in range(10):
squared_values.append(number*number)
print(squared_values)
# Square a list using NumPy
import numpy as np
vector = np.array(range(10))
scalar = 5
print(vector * scalar)
print(vector * vector)
15.1.1.5. Numpy Basics#
NumPy arrays can be a 1-D array, called a vector, or a 2-D array, called a matrix
15.1.1.5.1. NumPy casting – covert Python list to a NumPy array#
my_list = [1, 2, 3]
print(my_list)
import numpy as np
my_vector = np.array(my_list)
print(my_vector)
my_matrix = np.array([my_list, my_list])
np.hstack([my_list, my_list])
my_nested_list = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
my_matrix = np.array(my_nested_list)
print(my_matrix)
my_list1 = [[1,2], [3,4]]
my_list2 = [[5,6], [7,8]]
np.hstack([my_list1, my_list2])
np.vstack([my_list1, my_list2])
15.1.1.5.2. NumPy creating arrays#
my_list = range(10)
## Create array using arange
np.arange(10)
np.arange(0, 10)
np.arange(0, 10, 2)
## Create array of zeros
np.zeros(3)
np.zeros((3,3))
## Create array of ones
np.ones(3)
np.ones((3,3))
np.ones(3)*4
## Create evenly spaced vector
### Example use case: when you have Y values for a plot but need to generate X values
### *** Includes both start an end
# np.arange(start, end(not included), step size)
# np.linspace(start, end(included), number_of_points)
np.linspace(0, 10, 5)
np.linspace(1900, 2000, 11)
15.1.2. Create an identify matrix#
np.eye(3)
15.1.3. Creating an empty array#
np.empty((2,3))
15.1.4. Creating Random Numbers#
# Uniform distribution
np.random.rand(3)
np.random.rand(3,3)
# Normal distribution
np.random.randn(3)
np.random.randn(3,3)
# Random integers
# np.random.randint(start, end(not_included), size)
np.random.randint(1,101)
np.random.randint(1,101,5)
15.1.5. Reshaping arrays#
vector = np.arange(1,10)
print(vector.reshape(3,3))
vector = np.arange(1,13)
print(vector.reshape(3,4))
15.1.6. Basic array operations#
vector = np.random.randint(1,50,25)
# Min
vector.min()
# Max
vector.max()
# get location of min value
index = vector.argmin()
# get location of max value
index = vector.argmax()
# get shape
vector.shape
my_matrix = vector.reshape(5, 5)
my_matrix.shape
15.1.7. Indexing a 1-D array – vector#
vector = np.array(range(10))
# vector[index]
# vector [start:end]
# vector [:end]
# vector [start:]
# vector [start, end, step]
vector[3]
vector[3:8]
vector[:5]
vector[5:]
vector[3:9:2]
vector[-1]
15.1.8. Setting multiple values at once – Broadcasting#
There are two main features of NumPy arrays
Broadcasting – set multiple values at once
Vectorization – no need for explicit looping – example, vector multiplication or squaring
vector[3:6] = 12
15.1.9. BE CAREFUL#
If you store a slice of an array in a new variable, changes in the new variable will be reflected in the original array.
vector = np.array(range(10))
my_slice = vector[3:7]
my_slice[:] = 20
print(vector)
Copy the array if you need a copy
vector = np.array(range(10))
my_slice_copy = vector[3:7].copy()
print(vector)
15.1.10. Indexing a 2-D array – Matrix#
Remember – Python is zero-indexed
matrix = np.array(range(1,10)).reshape((3,3))
matrix[0,0]
matrix[0][0]
matrix[2,2]
matrix[2][2]
matrix[:,2] # Grab the third column
matrix[1,:] # Grab the second row
matrix[:2] # grab the first two rows, all columns
matrix[:2,:] # grab the first two rows, all columns
matrix[:,1:] # grab all the rows, but columns starting from 1
15.1.11. Conditional selection#
vector = np.arange(10)
gt2 = vector > 2 # create condition
lt8 = vector < 8 # create condition
selected_gt2 = vector[gt2] # apply condition to select
selected_lt8 = vector[lt8] # apply condition to select
vector[vector>2]
vector[vector<8]
cond = (vector>2) & (vector<7)
vector[cond]
cond = (vector>=2) & (vector<=7)
vector[cond]
15.1.12. Array operations – Basic#
vector = np.arange(10)
vector + vector
vector - vector
vector * vector
vector / vector # problem!!! return `nan`
vector + 10
vector - 10
vector * 10
vector / 10
15.1.13. Array operations – Advanced#
vector = np.arange(10)
np.max(vector)
np.min(vector)
np.sqrt(vector)
np.log(vector)
sum(vector<5)
import math
vector = np.arange(1,11) * math.pi
np.sin(vector)
vector = np.arange(0,math.pi+math.pi/4,math.pi/4)
np.sin(vector)
matrix = np.random.rand(5,5)
np.floor(matrix*1000)/1000
np.round(matrix*1000)/1000
np.ceil(matrix*1000)/1000
matrix = np.arange(1,10).reshape(3,3)
matrix.sum(axis=1)
matrix.sum(axis=0)
matrix.cumsum()
matrix.cumprod()
matrix.min(axis=1)
matrix.min(axis=0)
matrix.max(axis=1)
matrix.max(axis=0)
matrix = np.array([1,2,3]*3).reshape(3,3)
np.unique(matrix.reshape(3,3))
15.1.14. Pandas#
Built-on top of NumPy – meaning the underlying data structure used is ndarray
Pandas provides series which are like NumPy arrays but with associated index labels – meaning are like column labels or row labels. Element data type can be different
Pandas also provides dataframes which are like Excel sheets or database tables
15.1.14.1. Basic examples#
import numpy as np
import pandas as pd
header = ['chrom', 'pos', 'filter']
data = [4, 12345, 38.4]
vector = np.array(data)
data_dict = {'chrom': 4, 'pos': 12345, 'filter': 38.4}
s1 = pd.Series(data=data) # Notice the data type is float
s2 = pd.Series(data=data, index=header)
# can also do
s1 = pd.Series(data)
s2 = pd.Series(data, header)
# can hold different data types
data = [1, '2s', 34]
pd.Series(data)
# can use a dictionary to initialize a panda series
pd.Series(data_dict)
## Using index labels to fetch element
header = ['chrom', 'pos', 'filter']
data = [4, 12345, 38.4]
series = pd.Series(data=data, index=header)
series['chrom']
series['filter']
series = pd.Series(data)
series[0]
## Basic operations
header1 = ['chrom', 'pos', 'filter']
data1 = [4, 12345, 38.4]
header2 = ['chrom', 'pos', 'filter', 'qual']
data2 = [3, 4899, 234, 89.9]
s1 = pd.Series(data1, header1)
s2 = pd.Series(data2, header2)
s1+s2
header1 = ['chrom', 'pos', 'filter']
data1 = [4, 12345, 38.4]
header2 = ['chrom', 'pos', 'filter', 'qual']
data2 = ['3', 4899, 234, 89.9]
s1 = pd.Series(data1, header1)
s2 = pd.Series(data2, header2)
s1+s2
data1 = [4, 12345, 38.4]
data2 = [3, 4899, 234, 89.9]
s1 = pd.Series(data1)
s2 = pd.Series(data2)
## IMPORTANT - with index labels -- operations are based on label
header1 = ['pos', 'filter', 'chrom']
data1 = [12345, 38.4, 4]
header2 = ['chrom', 'pos', 'filter', 'qual']
data2 = [3, 4899, 234, 89.9]
s1 = pd.Series(data1, header1)
s2 = pd.Series(data2, header2)
s1+s2
data1 = [12345, 38.4, 4]
data2 = [3, 4899, 234, 89.9]
s1 = pd.Series(data1)
s2 = pd.Series(data2)
15.1.15. Dataframes –#
Dataframe is composed of series
import numpy as np
import pandas as pd
header = ['exam1', 'exam2', 'exam3']
data = np.random.randint(65, 101, 12).reshape(4,3)
students = ['student1', 'student2', 'student3', 'student4']
df = pd.DataFrame(data=data, columns=header)
df = pd.DataFrame(data=data, index=students, columns=header)
df['exam1']
df.exam1 # not a good way to do this
df['average'] = (df['exam1'] + df['exam2'] + df['exam3'])/3
df.drop('average') # does not work because default for drop is to work on row labels
df.drop('average', axis=1) # works on column labels
## STILL NOT DROPPED from df
df
df.drop('average', axis=1, inplace=True)
## drop a student
df.drop('student3')
header = ['exam1', 'exam2', 'exam3']
data = np.random.randint(65, 101, 12).reshape(4,3)
students = ['student1', 'student2', 'student3', 'student3']
df = pd.DataFrame(data=data, index=students, columns=header)
df.drop('student3')
df.drop('student3', inplace=True)
## Row is referred to as axis=0
## Column is referred to as axis=1
## (R,C) == (axis=0, axis=1) df.shape
15.1.16. Select Dataframe rows#
header = ['exam1', 'exam2', 'exam3']
data = np.random.randint(65, 101, 12).reshape(4,3)
students = ['student1', 'student2', 'student3', 'student4']
df = pd.DataFrame(data=data, index=students, columns=header)
df.loc['student1']
df.iloc[0] ## remember that column names do not count as rows
15.2. Three Methods for Selecting Data#
15.2.1. Method 1 – Square brackets#
15.2.2. Method 1.1 – index/text – #column wise#
import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
display(df['StudentName'])
0 student1
1 student2
2 student3
3 student4
4 student5
5 student6
6 student7
7 student8
8 student9
9 student10
Name: StudentName, dtype: object
15.2.3. Method 1.2 – list, NOT TUPLE # column wise#
import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
display(df[['StudentName', 'E1']])
StudentName | E1 | |
---|---|---|
0 | student1 | 92 |
1 | student2 | 74 |
2 | student3 | 83 |
3 | student4 | 100 |
4 | student5 | 77 |
5 | student6 | 100 |
6 | student7 | 66 |
7 | student8 | 97 |
8 | student9 | 95 |
9 | student10 | 78 |
15.2.4. Method 1.3 – a slice # row wise#
import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
display(df[1:5])
StudentName | E1 | E2 | E3 | E4 | E5 | |
---|---|---|---|---|---|---|
1 | student2 | 74 | 93 | 88 | 67 | 85 |
2 | student3 | 83 | 96 | 74 | 79 | 92 |
3 | student4 | 100 | 72 | 83 | 85 | 66 |
4 | student5 | 77 | 96 | 66 | 79 | 92 |
15.2.5. Method 1.4?? – slice and list?#
import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
display(df[1:5, ['E1', 'E2']])
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
File ~/anaconda3/envs/eas503book/lib/python3.12/site-packages/pandas/core/indexes/base.py:3805, in Index.get_loc(self, key)
3804 try:
-> 3805 return self._engine.get_loc(casted_key)
3806 except KeyError as err:
File index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()
File index.pyx:173, in pandas._libs.index.IndexEngine.get_loc()
TypeError: '(slice(1, 5, None), ['E1', 'E2'])' is an invalid key
During handling of the above exception, another exception occurred:
InvalidIndexError Traceback (most recent call last)
Cell In[4], line 4
2 columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
3 df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
----> 4 display(df[1:5, ['E1', 'E2']])
File ~/anaconda3/envs/eas503book/lib/python3.12/site-packages/pandas/core/frame.py:4102, in DataFrame.__getitem__(self, key)
4100 if self.columns.nlevels > 1:
4101 return self._getitem_multilevel(key)
-> 4102 indexer = self.columns.get_loc(key)
4103 if is_integer(indexer):
4104 indexer = [indexer]
File ~/anaconda3/envs/eas503book/lib/python3.12/site-packages/pandas/core/indexes/base.py:3817, in Index.get_loc(self, key)
3812 raise KeyError(key) from err
3813 except TypeError:
3814 # If we have a listlike key, _check_indexing_error will raise
3815 # InvalidIndexError. Otherwise we fall through and re-raise
3816 # the TypeError.
-> 3817 self._check_indexing_error(key)
3818 raise
File ~/anaconda3/envs/eas503book/lib/python3.12/site-packages/pandas/core/indexes/base.py:6059, in Index._check_indexing_error(self, key)
6055 def _check_indexing_error(self, key):
6056 if not is_scalar(key):
6057 # if key is not a scalar, directly raise an error (the code below
6058 # would convert to numpy arrays and raise later any way) - GH29926
-> 6059 raise InvalidIndexError(key)
InvalidIndexError: (slice(1, 5, None), ['E1', 'E2'])
15.2.6. Method 1.5?? – slice and slice?#
import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
display(df[1:5, 1:5])
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
File ~/anaconda3/envs/eas503book/lib/python3.12/site-packages/pandas/core/indexes/base.py:3805, in Index.get_loc(self, key)
3804 try:
-> 3805 return self._engine.get_loc(casted_key)
3806 except KeyError as err:
File index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()
File index.pyx:196, in pandas._libs.index.IndexEngine.get_loc()
File pandas/_libs/hashtable_class_helper.pxi:7081, in pandas._libs.hashtable.PyObjectHashTable.get_item()
File pandas/_libs/hashtable_class_helper.pxi:7089, in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: (slice(1, 5, None), slice(1, 5, None))
During handling of the above exception, another exception occurred:
InvalidIndexError Traceback (most recent call last)
Cell In[5], line 4
2 columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
3 df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
----> 4 display(df[1:5, 1:5])
File ~/anaconda3/envs/eas503book/lib/python3.12/site-packages/pandas/core/frame.py:4102, in DataFrame.__getitem__(self, key)
4100 if self.columns.nlevels > 1:
4101 return self._getitem_multilevel(key)
-> 4102 indexer = self.columns.get_loc(key)
4103 if is_integer(indexer):
4104 indexer = [indexer]
File ~/anaconda3/envs/eas503book/lib/python3.12/site-packages/pandas/core/indexes/base.py:3811, in Index.get_loc(self, key)
3806 except KeyError as err:
3807 if isinstance(casted_key, slice) or (
3808 isinstance(casted_key, abc.Iterable)
3809 and any(isinstance(x, slice) for x in casted_key)
3810 ):
-> 3811 raise InvalidIndexError(key)
3812 raise KeyError(key) from err
3813 except TypeError:
3814 # If we have a listlike key, _check_indexing_error will raise
3815 # InvalidIndexError. Otherwise we fall through and re-raise
3816 # the TypeError.
InvalidIndexError: (slice(1, 5, None), slice(1, 5, None))
15.2.7. Method 2 – .iloc#
It uses square braces [] and not ()
It uses numerical index
15.2.7.1. Method 2.1 – index only!!! Cannot index by location index with a non-integer key#
import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
display(df.iloc[0])
StudentName student1
E1 92
E2 77
E3 87
E4 77
E5 94
Name: 0, dtype: object
15.2.7.2. Method 2.2 – list#
import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
display(df.iloc[[0, 1, 2]])
StudentName | E1 | E2 | E3 | E4 | E5 | |
---|---|---|---|---|---|---|
0 | student1 | 92 | 77 | 87 | 77 | 94 |
1 | student2 | 74 | 93 | 88 | 67 | 85 |
2 | student3 | 83 | 96 | 74 | 79 | 92 |
15.2.7.3. Method 2.3 – list and slice#
import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
display(df.iloc[[0, 1, 2], :])
StudentName | E1 | E2 | E3 | E4 | E5 | |
---|---|---|---|---|---|---|
0 | student1 | 92 | 77 | 87 | 77 | 94 |
1 | student2 | 74 | 93 | 88 | 67 | 85 |
2 | student3 | 83 | 96 | 74 | 79 | 92 |
15.2.7.4. Method 2.4 – list and list#
import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
display(df.iloc[[0, 1, 2], [0, 1]])
StudentName | E1 | |
---|---|---|
0 | student1 | 92 |
1 | student2 | 74 |
2 | student3 | 83 |
15.2.7.5. Method 2.5 – slice#
import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
display(df.iloc[0:3])
StudentName | E1 | E2 | E3 | E4 | E5 | |
---|---|---|---|---|---|---|
0 | student1 | 92 | 77 | 87 | 77 | 94 |
1 | student2 | 74 | 93 | 88 | 67 | 85 |
2 | student3 | 83 | 96 | 74 | 79 | 92 |
15.2.7.6. Method 2.6 – slice and list#
import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
display(df.iloc[:, [1, 2]])
E1 | E2 | |
---|---|---|
0 | 92 | 77 |
1 | 74 | 93 |
2 | 83 | 96 |
3 | 100 | 72 |
4 | 77 | 96 |
5 | 100 | 86 |
6 | 66 | 91 |
7 | 97 | 86 |
8 | 95 | 98 |
9 | 78 | 76 |
15.2.8. Method 2.7 – sliceS#
import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
display(df.iloc[0:3, 1:3])
E1 | E2 | |
---|---|---|
0 | 92 | 77 |
1 | 74 | 93 |
2 | 83 | 96 |
15.2.9. Method 3 – .loc#
If you did not specify the index, then numbers are used!
Warning
When you use .loc
for indexing, it includes the end point. This is unlike .iloc
, which does not include the end point.
15.2.10. Method 3.1 – index # row-wise#
import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
display(df.loc[3])
StudentName student4
E1 100
E2 72
E3 83
E4 85
E5 66
Name: 3, dtype: object
15.2.11. Method 3.2 – txt # row-wise#
import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
df.set_index(df['StudentName'], inplace=True)
display(df.loc['student4'])
StudentName student4
E1 100
E2 72
E3 83
E4 85
E5 66
Name: student4, dtype: object
15.2.12. Method 3.3 – list#
import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
df.set_index(df['StudentName'], inplace=True)
display(df.loc[['student4', 'student1']])
StudentName | E1 | E2 | E3 | E4 | E5 | |
---|---|---|---|---|---|---|
StudentName | ||||||
student4 | student4 | 100 | 72 | 83 | 85 | 66 |
student1 | student1 | 92 | 77 | 87 | 77 | 94 |
15.2.13. Method 3.4 – listS#
import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
df.set_index(df['StudentName'], inplace=True)
display(df.loc[['student4', 'student1'], ['E1', 'E2']])
E1 | E2 | |
---|---|---|
StudentName | ||
student4 | 100 | 72 |
student1 | 92 | 77 |
15.2.14. Method 3.5 – slice and list#
import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
df.set_index(df['StudentName'], inplace=True)
display(df.loc[:, ['E1', 'E2']])
E1 | E2 | |
---|---|---|
StudentName | ||
student1 | 92 | 77 |
student2 | 74 | 93 |
student3 | 83 | 96 |
student4 | 100 | 72 |
student5 | 77 | 96 |
student6 | 100 | 86 |
student7 | 66 | 91 |
student8 | 97 | 86 |
student9 | 95 | 98 |
student10 | 78 | 76 |
15.2.15. Method 3.6 – list and Slice#
import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
df.set_index(df['StudentName'], inplace=True)
display(df.loc[['student1', 'student2'], :])
StudentName | E1 | E2 | E3 | E4 | E5 | |
---|---|---|---|---|---|---|
StudentName | ||||||
student1 | student1 | 92 | 77 | 87 | 77 | 94 |
student2 | student2 | 74 | 93 | 88 | 67 | 85 |
15.2.16. Method 3.7 – slice and Slice#
import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
df.set_index(df['StudentName'], inplace=True)
display(df.loc['student1':'student5', :])
StudentName | E1 | E2 | E3 | E4 | E5 | |
---|---|---|---|---|---|---|
StudentName | ||||||
student1 | student1 | 92 | 77 | 87 | 77 | 94 |
student2 | student2 | 74 | 93 | 88 | 67 | 85 |
student3 | student3 | 83 | 96 | 74 | 79 | 92 |
student4 | student4 | 100 | 72 | 83 | 85 | 66 |
student5 | student5 | 77 | 96 | 66 | 79 | 92 |
15.2.17. Method 3.8 – slice#
import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
df.set_index(df['StudentName'], inplace=True)
display(df.loc['student1':'student5'])
StudentName | E1 | E2 | E3 | E4 | E5 | |
---|---|---|---|---|---|---|
StudentName | ||||||
student1 | student1 | 92 | 77 | 87 | 77 | 94 |
student2 | student2 | 74 | 93 | 88 | 67 | 85 |
student3 | student3 | 83 | 96 | 74 | 79 | 92 |
student4 | student4 | 100 | 72 | 83 | 85 | 66 |
student5 | student5 | 77 | 96 | 66 | 79 | 92 |
15.2.18. Method 3.9 – slice and slice#
import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
df.set_index(df['StudentName'], inplace=True)
display(df.loc[:, 'E1':'E4'])
E1 | E2 | E3 | E4 | |
---|---|---|---|---|
StudentName | ||||
student1 | 92 | 77 | 87 | 77 |
student2 | 74 | 93 | 88 | 67 |
student3 | 83 | 96 | 74 | 79 |
student4 | 100 | 72 | 83 | 85 |
student5 | 77 | 96 | 66 | 79 |
student6 | 100 | 86 | 84 | 70 |
student7 | 66 | 91 | 94 | 97 |
student8 | 97 | 86 | 75 | 69 |
student9 | 95 | 98 | 99 | 85 |
student10 | 78 | 76 | 73 | 88 |
15.2.19. iterrows – iterate over rows#
iterate over the data row-wise. The first element is the index and second element is the row values
import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
for idx, row in df.iterrows():
print(idx, row)
0 StudentName student1
E1 92
E2 77
E3 87
E4 77
E5 94
Name: 0, dtype: object
1 StudentName student2
E1 74
E2 93
E3 88
E4 67
E5 85
Name: 1, dtype: object
2 StudentName student3
E1 83
E2 96
E3 74
E4 79
E5 92
Name: 2, dtype: object
3 StudentName student4
E1 100
E2 72
E3 83
E4 85
E5 66
Name: 3, dtype: object
4 StudentName student5
E1 77
E2 96
E3 66
E4 79
E5 92
Name: 4, dtype: object
5 StudentName student6
E1 100
E2 86
E3 84
E4 70
E5 71
Name: 5, dtype: object
6 StudentName student7
E1 66
E2 91
E3 94
E4 97
E5 80
Name: 6, dtype: object
7 StudentName student8
E1 97
E2 86
E3 75
E4 69
E5 88
Name: 7, dtype: object
8 StudentName student9
E1 95
E2 98
E3 99
E4 85
E5 86
Name: 8, dtype: object
9 StudentName student10
E1 78
E2 76
E3 73
E4 88
E5 86
Name: 9, dtype: object
15.2.20. items – iterate over columns#
import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
for column_name, column in df.items():
print(column_name, column)
StudentName 0 student1
1 student2
2 student3
3 student4
4 student5
5 student6
6 student7
7 student8
8 student9
9 student10
Name: StudentName, dtype: object
E1 0 92
1 74
2 83
3 100
4 77
5 100
6 66
7 97
8 95
9 78
Name: E1, dtype: object
E2 0 77
1 93
2 96
3 72
4 96
5 86
6 91
7 86
8 98
9 76
Name: E2, dtype: object
E3 0 87
1 88
2 74
3 83
4 66
5 84
6 94
7 75
8 99
9 73
Name: E3, dtype: object
E4 0 77
1 67
2 79
3 85
4 79
5 70
6 97
7 69
8 85
9 88
Name: E4, dtype: object
E5 0 94
1 85
2 92
3 66
4 92
5 71
6 80
7 88
8 86
9 86
Name: E5, dtype: object
15.2.21. Select subset of data#
header = ['exam1', 'exam2', 'exam3']
data = np.random.randint(65, 101, 12).reshape(4,3)
students = ['student1', 'student2', 'student3', 'student4']
df = pd.DataFrame(data=data, index=students, columns=header)
df.loc['student1', 'exam1']
df.loc[['student1', 'student3'], ['exam1', 'exam3']]
15.2.22. Use conditions to select#
header = ['exam1', 'exam2', 'exam3']
data = np.random.randint(65, 101, 12).reshape(4,3)
students = ['student1', 'student2', 'student3', 'student4']
df = pd.DataFrame(data=data, index=students, columns=header)
df>=90
df[df>=90]
df['exam1']>=85 #
df[df['exam1']>=85] # gives all columns where exam1 is greater than 85
df[df['exam1']>=85]['exam3']
df[df['exam1']>=85][['exam2', 'exam3']]
df[(df['exam1']>=85) & (df['exam2']>=85)]
df[(df['exam1']>=85) & (df['exam2']>=85)]['exam3']
df[(df['exam1']>=85) | (df['exam2']>=85)]
df[(df['exam1']>=85) | (df['exam2']>=85)]['exam3']
15.2.23. Adding student index#
header = ['exam1', 'exam2', 'exam3']
data = np.random.randint(65, 101, 12).reshape(4,3)
students = ['student1', 'student2', 'student3', 'student4']
df = pd.DataFrame(data=data, columns=header)
df['name'] = students
df.set_index('name', inplace=True)
df.loc['student1']
df.reset_index(inplace=True)
15.2.24. Multi-index data#
students = 'student1 student1 student1 student2 student2 student2 student3 student3 student3'
exams = 'exam1 exam2 exam3'.split()*3
classes = 'class1 class2'
index = list(zip(students.split(), exams))
index = pd.MultiIndex.from_tuples(index)
df = pd.DataFrame(np.random.randint(65, 101, 3*3*2).reshape(9,2) , index, classes.split())
df.loc['student1'].loc['exam1']['class1']
df.index.names
df.index.names = ['Students', 'Exams']
## cross-section
df.xs('student1')
df.xs('exam1', level='Exams')
15.2.25. Dealing with missing data#
my_dict = {'student1': [90, 84, np.nan], 'student2': [77, np.nan, np.nan], 'student3': [88, 65, 93]}
df = pd.DataFrame(my_dict)
df.dropna()
df.dropna(axis=0)
df.dropna(axis=1)
df.dropna(thresh=2)
df.fillna(value=55)
df.drop(axis=0, labels=[1,2])
df.drop(axis=1, columns=['student1'])
15.2.26. Groupby#
Note
In previous versions of pandas, non-numeric columns were silently ignored. Now, however, we need to either explicitly choose numeric columns or ask mean to do it for us with numeric_only=True
keyword argument.
import pandas as pd
import numpy as np
my_dict = {
'Exams': 'exam1 exam1 exam1'.split() + 'exam2 exam2 exam2'.split() + 'exam3 exam3 exam3'.split(),
'Students': 'student1 student2 student3'.split()*3,
'Scores': np.random.randint(65,101,9)
}
df = pd.DataFrame(my_dict)
df.groupby('Students').mean(numeric_only=True)
df.groupby('Students').mean(numeric_only=True).loc['student1']
df.groupby('Exams').max(numeric_only=True)['Scores']
df.groupby('Exams').describe()
df.groupby('Students').describe().transpose()
import pandas as pd
import numpy as np
my_dict = {
'Exams': 'exam1 exam1 exam1'.split() + 'exam2 exam2 exam2'.split() + 'exam3 exam3 exam3'.split(),
'Students': 'student1 student2 student3'.split() * 3,
'Scores': np.random.randint(65, 101, 9)
}
df = pd.DataFrame(my_dict)
# Select only numeric columns
new_df = df.set_index("Students")
new_df.select_dtypes(include=[np.number]).groupby("Students").mean()
import pandas as pd
import numpy as np
# Create a sample DataFrame with a multi-level index
data = {
'Scores': np.random.randint(65, 101, 9),
}
index = pd.MultiIndex.from_tuples(
[('student1', 'exam1'), ('student1', 'exam2'), ('student1', 'exam3'),
('student2', 'exam1'), ('student2', 'exam2'), ('student2', 'exam3'),
('student3', 'exam1'), ('student3', 'exam2'), ('student3', 'exam3')],
names=['Students', 'Exams']
)
df = pd.DataFrame(data, index=index)
# Group by the first level of the index ('Students')
grouped_df = df.groupby(level='Students').mean()
display(grouped_df)
15.2.27. Merging – SQL JOIN#
departments = {
'DepartmentId': [1, 2, 3, 4],
'DepartmentName': ['IT', 'Physics', 'Arts', 'Math']
}
df1 = pd.DataFrame(departments)
students = {
'StudentId': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'StudentName': ['Michael', 'John', 'Jack', 'Sara', 'Sally', 'Jena', 'Nancy', 'Adam', 'Stevens', 'George'],
'DepartmentId': [1, 1, 1, 2, 2, np.nan, 2, 3, 3, np.nan]
}
df2 = pd.DataFrame(students)
marks = {
'MarkId': [1, 2, 3, 4, 5, 6, 7, 8],
'StudentId': [1, 2, 3, 4, 5, 6, 7, 8],
'Mark': [18, 20, 16, 19, 14, 20, 20, 20]
}
df3 = pd.DataFrame(marks)
pd.merge(df2, df1, how='inner', on='DepartmentId')
pd.merge(df1, df2, how='inner', on='DepartmentId')
pd.merge(df1, df2, how='outer', on='DepartmentId')
pd.merge(df2, df1, how='right', on='DepartmentId')
pd.merge(df3, pd.merge(df2, df1, how='inner', on='DepartmentId'), how='inner', on='StudentId')
data = pd.merge(df3, pd.merge(df2, df1, how='inner', on='DepartmentId'), how='inner', on='StudentId')
data[['StudentName', 'Mark', 'DepartmentName']]
15.2.28. Concatenation#
d1 = {
'C0': ['C0R0', 'C0R1', 'C0R2'],
'C1': ['C1R0', 'C1R1', 'C2R2'],
'C2': ['C2R0', 'C2R1', 'C2R2'],
}
df1 = pd.DataFrame(d1)
d2 = {
'C0': ['C0R3', 'C0R4', 'C0R5'],
'C1': ['C1R3', 'C1R4', 'C1R5'],
'C2': ['C2R3', 'C2R4', 'C2R5'],
}
df2 = pd.DataFrame(d2)
d3 = {
'C0': ['C0R6', 'C0R7', 'C0R8'],
'C1': ['C1R6', 'C1R7', 'C1R8'],
'C2': ['C2R6', 'C2R7', 'C2R8'],
}
df3 = pd.DataFrame(d3)
pd.concat([df1, df2, df3])
## Concatenation -- Fix index
d1 = {
'C0': ['COR0', 'COR1', 'COR2'],
'C1': ['C1R0', 'C1R1', 'C2R2'],
'C2': ['C2R0', 'C2R1', 'C2R2'],
}
df1 = pd.DataFrame(d1, index=[1, 2, 3])
d2 = {
'C0': ['C0R3', 'C0R4', 'C0R5'],
'C1': ['C1R3', 'C1R4', 'C1R5'],
'C2': ['C2R3', 'C2R4', 'C2R5'],
}
df2 = pd.DataFrame(d2, index=[4, 5, 6])
d3 = {
'C0': ['C0R6', 'C0R7', 'C0R8'],
'C1': ['C1R6', 'C1R7', 'C1R8'],
'C2': ['C2R6', 'C2R7', 'C2R8'],
}
df3 = pd.DataFrame(d3, index=[7, 8, 9])
pd.concat([df1, df2, df3])
# Better way
d1 = {
'C0': ['C0R0', 'C0R1', 'C0R2'],
'C1': ['C1R0', 'C1R1', 'C2R2'],
'C2': ['C2R0', 'C2R1', 'C2R2'],
}
df1 = pd.DataFrame(d1)
d2 = {
'C0': ['C0R3', 'C0R4', 'C0R5'],
'C1': ['C1R3', 'C1R4', 'C1R5'],
'C2': ['C2R3', 'C2R4', 'C2R5'],
}
df2 = pd.DataFrame(d2)
d3 = {
'C0': ['C0R6', 'C0R7', 'C0R8'],
'C1': ['C1R6', 'C1R7', 'C1R8'],
'C2': ['C2R6', 'C2R7', 'C2R8'],
}
df3 = pd.DataFrame(d3)
concat = pd.concat([df1, df2, df3])
concat = concat.reset_index().drop("index", axis=1)
concat
15.2.29. More Pandas Operations#
data['DepartmentName'].unique()
data['DepartmentName'].nunique()
data['DepartmentName'].value_counts()
data[data['Mark']>17]
15.2.30. Lambda with Pandas#
Scale marks by 5
def times5(val):
return val * 5
data['Mark'].apply(times5)
data['Mark'].apply(lambda val: val*5)
Upper all department names
def upper(string):
return string.upper()
data['DepartmentName'].apply(upper)
data['DepartmentName'].apply(lambda string: string.upper())
mapping = {18: 'B', 14: 'C', 19: 'A-', 20: 'A+'}
df3['Mark'].map(mapping)
15.2.31. Dropping columns#
data.columns
data.drop(['StudentId', 'MarkId' , 'DepartmentId'], axis=1)
15.2.32. Sorting#
data.sort_values('Mark')
data.sort_values('Mark', ascending=False)
15.2.33. Importing CSV, TSV#
data = pd.read_csv('students.tsv', sep='\t', names=['lastname', 'firstname', 'username', 'exam1', 'exam2', 'exam3'])
data.sort_values('exam1', ascending=False)
data[['exam1', 'exam2', 'exam3']].mean()
data['average']= np.mean(data[['exam1', 'exam2', 'exam3']], axis=1)
data.sort_values('average', ascending=False)
data.to_csv('output.tsv', sep='\t', index=False, header=False)
15.2.34. Other methods#
data.head()
data.head(2)
data.tail
data.tail(3)
data.shape
data.iloc[3]
data.columns
data.dtypes
data.info()
data.value_counts()
data.describe()
15.2.35. More data manipulation#
data[data['exam1'].between(75, 85)]
data[data['exam1'].isin([75, 85, 95])]
data['exam1'].unique()
data['exam1'].nunique()
np.sort(data['exam1'].unique())
15.2.36. Misc#
Iterate over the data frame
Where: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.where.html
Plotting: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html
to_csv
to_sql: https://datacarpentry.org/python-ecology-lesson/09-working-with-sql/index.html
15.2.37. A Previous Example in Pandas#
import numpy as np
import pandas as pd
employees = {
'EmployeeID': ['EN1-26', 'EN1-33', 'EN1-35', 'EN1-36', 'EN1-38', 'EN1-40'],
'Last_Name': ["O'Brien", "Guya", "Baranco", "Roslyn", "Schaaf", "Wing"],
'First_Name': ["Sean", "Amy", "Steven", "Elizabeth", "Carol", "Alexandra" ]
}
df_employees = pd.DataFrame(employees)
projects = {
'ProjectNum': [
"30-452-T3",
"30-457-T3",
"30-482-TC",
"31-124-T3",
"31-238-TC",
"31-238-TC2",
"35-152-TC",
"36-272-TC"
],
'ProjectTitle': [
"STAR manual",
"ISO procedures",
"Web site",
"Employee handbook",
"STAR prototype",
"New catalog",
"STAR pricing",
"Order system"
]
}
df_projects = pd.DataFrame(projects)
print(df_projects)
employees_projects = {
'EmployeeID': [
"EN1-26",
"EN1-26",
"EN1-26",
"EN1-33",
"EN1-33",
"EN1-33",
"EN1-35",
"EN1-35",
"EN1-36",
"EN1-38",
"EN1-40",
"EN1-40",
],
'ProjectNum' : [
"30-452-T3",
"30-457-T3",
"31-124-T3",
"30-328-TC",
"30-452-T3",
"32-244-T3",
"30-452-T3",
"31-238-TC",
"35-152-TC",
"36-272-TC",
"31-238-TC2",
"31-241-TC",
]
}
df_employees_projects = pd.DataFrame(employees_projects)
data = pd.merge(pd.merge(df_employees_projects, df_employees, how='left', on='EmployeeID'), df_projects, how='left', on='ProjectNum')
print(data)
15.2.38. References#
Ref: https://www.kaggle.com/shovitbhari/pandas-75-exercises-with-solutions Ref: https://pythonspeed.com/memory/