Numpy and Pandas

Contents

15.1. Numpy and Pandas#

15.1.1. NumPy#

15.1.1.1. Description#

15.1.1.2. How is it different from Python arrays?#

  • Python lists can be modified – you can add and remove elements. NumPy arrays have a fixed size at creation.

  • Python lists can contain different data types. NumPy arrays can only have one data type. If you put in mixed types, they become a string.

  • NumPy arrays come prepackaged with advanced mathematical operations. The operations are super fast even on large numbers of data and they use less memory.

15.1.1.3. Why use NumPy#

  • Most data analysis programs use NumPy to manipulate data. They might take in data as standard Python lists, but they convert it to a NumPy array and manipulate the data using NumPy routines and output the transformed data as a NumPy array.

  • NumPy data array is the main data type used in most scientific and mathematical Python-based packages.

15.1.1.4. Simple example#

# Square a list using Python
squared_values = []
for number in range(10):
    squared_values.append(number*number)

print(squared_values)

# Square a list using NumPy
import numpy as np

vector = np.array(range(10))
scalar = 5
print(vector * scalar)
print(vector * vector)

15.1.1.5. Numpy Basics#

  • NumPy arrays can be a 1-D array, called a vector, or a 2-D array, called a matrix

15.1.1.5.1. NumPy casting – covert Python list to a NumPy array#

my_list = [1, 2, 3]
print(my_list)

import numpy as np

my_vector = np.array(my_list)
print(my_vector)

my_matrix = np.array([my_list, my_list])
np.hstack([my_list, my_list])

my_nested_list = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
my_matrix = np.array(my_nested_list)
print(my_matrix)

my_list1 = [[1,2], [3,4]]
my_list2 = [[5,6], [7,8]]
np.hstack([my_list1, my_list2])
np.vstack([my_list1, my_list2])

15.1.1.5.2. NumPy creating arrays#

my_list = range(10)

## Create array using arange
np.arange(10)
np.arange(0, 10)
np.arange(0, 10, 2)

## Create array of zeros
np.zeros(3)
np.zeros((3,3))

## Create array of ones
np.ones(3)
np.ones((3,3))
np.ones(3)*4

## Create evenly spaced vector
### Example use case: when you have Y values for a plot but need to generate X values
### *** Includes both start an end
# np.arange(start, end(not included), step size)
# np.linspace(start, end(included), number_of_points)
np.linspace(0, 10, 5)
np.linspace(1900, 2000, 11)

15.1.2. Create an identify matrix#

np.eye(3)

15.1.3. Creating an empty array#

np.empty((2,3))

15.1.4. Creating Random Numbers#

# Uniform distribution
np.random.rand(3)
np.random.rand(3,3)
# Normal distribution
np.random.randn(3)
np.random.randn(3,3)
# Random integers
# np.random.randint(start, end(not_included), size)
np.random.randint(1,101)
np.random.randint(1,101,5)

15.1.5. Reshaping arrays#


vector = np.arange(1,10)
print(vector.reshape(3,3))

vector = np.arange(1,13)
print(vector.reshape(3,4))

15.1.6. Basic array operations#

vector = np.random.randint(1,50,25)
# Min
vector.min()
# Max
vector.max()
# get location of min value
index = vector.argmin()
# get location of max value
index = vector.argmax()
# get shape
vector.shape

my_matrix = vector.reshape(5, 5)
my_matrix.shape

15.1.7. Indexing a 1-D array – vector#

vector = np.array(range(10))
# vector[index]
# vector [start:end]
# vector [:end]
# vector [start:]
# vector [start, end, step]

vector[3]
vector[3:8]
vector[:5]
vector[5:]
vector[3:9:2]
vector[-1]

15.1.8. Setting multiple values at once – Broadcasting#

  • There are two main features of NumPy arrays

    • Broadcasting – set multiple values at once

    • Vectorization – no need for explicit looping – example, vector multiplication or squaring

vector[3:6] = 12

15.1.9. BE CAREFUL#

  • If you store a slice of an array in a new variable, changes in the new variable will be reflected in the original array.

vector = np.array(range(10))
my_slice = vector[3:7]
my_slice[:] = 20
print(vector)
  • Copy the array if you need a copy

vector = np.array(range(10))
my_slice_copy = vector[3:7].copy()
print(vector)

15.1.10. Indexing a 2-D array – Matrix#

  • Remember – Python is zero-indexed

matrix = np.array(range(1,10)).reshape((3,3))
matrix[0,0]
matrix[0][0]
matrix[2,2]
matrix[2][2]


matrix[:,2] # Grab the third column
matrix[1,:] # Grab the second row
matrix[:2] # grab the first two rows, all columns
matrix[:2,:] # grab the first two rows, all columns
matrix[:,1:] # grab all the rows, but columns starting from 1

15.1.11. Conditional selection#

vector = np.arange(10)
gt2 = vector > 2 # create condition
lt8 = vector < 8 # create condition

selected_gt2 = vector[gt2] # apply condition to select
selected_lt8 = vector[lt8] # apply condition to select

vector[vector>2]
vector[vector<8]

cond = (vector>2) & (vector<7)
vector[cond]

cond = (vector>=2) & (vector<=7)
vector[cond]

15.1.12. Array operations – Basic#

vector = np.arange(10)

vector + vector
vector - vector
vector * vector
vector / vector # problem!!! return `nan`
vector + 10
vector - 10
vector * 10
vector / 10

15.1.13. Array operations – Advanced#

vector = np.arange(10)
np.max(vector)
np.min(vector)
np.sqrt(vector)
np.log(vector)

sum(vector<5)
import math
vector = np.arange(1,11) * math.pi
np.sin(vector)
vector = np.arange(0,math.pi+math.pi/4,math.pi/4)
np.sin(vector)

matrix = np.random.rand(5,5)
np.floor(matrix*1000)/1000
np.round(matrix*1000)/1000
np.ceil(matrix*1000)/1000

matrix = np.arange(1,10).reshape(3,3)
matrix.sum(axis=1)
matrix.sum(axis=0)
matrix.cumsum()
matrix.cumprod()

matrix.min(axis=1)
matrix.min(axis=0)

matrix.max(axis=1)
matrix.max(axis=0)

matrix = np.array([1,2,3]*3).reshape(3,3)
np.unique(matrix.reshape(3,3))

15.1.14. Pandas#

  • Built-on top of NumPy – meaning the underlying data structure used is ndarray

  • Pandas provides series which are like NumPy arrays but with associated index labels – meaning are like column labels or row labels. Element data type can be different

  • Pandas also provides dataframes which are like Excel sheets or database tables

15.1.14.1. Basic examples#

import numpy as np
import pandas as pd

header = ['chrom', 'pos', 'filter']
data = [4, 12345, 38.4]

vector = np.array(data)
data_dict = {'chrom': 4, 'pos': 12345, 'filter': 38.4}

s1 = pd.Series(data=data) # Notice the data type is float
s2 = pd.Series(data=data, index=header)

# can also do
s1 = pd.Series(data)
s2 = pd.Series(data, header)

# can hold different data types

data = [1, '2s', 34]
pd.Series(data)

# can use a dictionary to initialize a panda series
pd.Series(data_dict)

## Using index labels to fetch element
header = ['chrom', 'pos', 'filter']
data = [4, 12345, 38.4]
series = pd.Series(data=data, index=header)
series['chrom']
series['filter']

series = pd.Series(data)
series[0]


## Basic operations
header1 = ['chrom', 'pos', 'filter']
data1 = [4, 12345, 38.4]
header2 = ['chrom', 'pos', 'filter', 'qual']
data2 = [3, 4899, 234, 89.9]

s1 = pd.Series(data1, header1)
s2 = pd.Series(data2, header2)
s1+s2

header1 = ['chrom', 'pos', 'filter']
data1 = [4, 12345, 38.4]
header2 = ['chrom', 'pos', 'filter', 'qual']
data2 = ['3', 4899, 234, 89.9]

s1 = pd.Series(data1, header1)
s2 = pd.Series(data2, header2)
s1+s2

data1 = [4, 12345, 38.4]
data2 = [3, 4899, 234, 89.9]

s1 = pd.Series(data1)
s2 = pd.Series(data2)


## IMPORTANT - with index labels -- operations are based on label
header1 = ['pos', 'filter', 'chrom']
data1 = [12345, 38.4, 4]
header2 = ['chrom', 'pos', 'filter', 'qual']
data2 = [3, 4899, 234, 89.9]

s1 = pd.Series(data1, header1)
s2 = pd.Series(data2, header2)
s1+s2


data1 = [12345, 38.4, 4]
data2 = [3, 4899, 234, 89.9]

s1 = pd.Series(data1)
s2 = pd.Series(data2)

15.1.15. Dataframes –#

import numpy as np
import pandas as pd

header = ['exam1', 'exam2', 'exam3']
data = np.random.randint(65, 101, 12).reshape(4,3)
students = ['student1', 'student2', 'student3', 'student4']

df = pd.DataFrame(data=data, columns=header)

df = pd.DataFrame(data=data, index=students, columns=header)
df['exam1']
df.exam1 # not a good way to do this


df['average'] = (df['exam1'] + df['exam2'] + df['exam3'])/3
df.drop('average') # does not work because default for drop is to work on row labels
df.drop('average', axis=1) # works on column labels

## STILL NOT DROPPED from df
df
df.drop('average', axis=1, inplace=True)

## drop a student
df.drop('student3')

header = ['exam1', 'exam2', 'exam3']
data = np.random.randint(65, 101, 12).reshape(4,3)
students = ['student1', 'student2', 'student3', 'student3']
df = pd.DataFrame(data=data, index=students, columns=header)

df.drop('student3')
df.drop('student3', inplace=True)
## Row is referred to as axis=0
## Column is referred to as axis=1
## (R,C) == (axis=0, axis=1) df.shape

15.1.16. Select Dataframe rows#

header = ['exam1', 'exam2', 'exam3']
data = np.random.randint(65, 101, 12).reshape(4,3)
students = ['student1', 'student2', 'student3', 'student4']
df = pd.DataFrame(data=data, index=students, columns=header)

df.loc['student1']
df.iloc[0] ## remember that column names do not count as rows

15.2. Three Methods for Selecting Data#

15.2.1. Method 1 – Square brackets#

15.2.2. Method 1.1 – index/text – #column wise#

import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
display(df['StudentName'])
0     student1
1     student2
2     student3
3     student4
4     student5
5     student6
6     student7
7     student8
8     student9
9    student10
Name: StudentName, dtype: object

15.2.3. Method 1.2 – list, NOT TUPLE # column wise#

import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
display(df[['StudentName', 'E1']])
StudentName E1
0 student1 92
1 student2 74
2 student3 83
3 student4 100
4 student5 77
5 student6 100
6 student7 66
7 student8 97
8 student9 95
9 student10 78

15.2.4. Method 1.3 – a slice # row wise#

import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
display(df[1:5])
StudentName E1 E2 E3 E4 E5
1 student2 74 93 88 67 85
2 student3 83 96 74 79 92
3 student4 100 72 83 85 66
4 student5 77 96 66 79 92

15.2.5. Method 1.4?? – slice and list?#

import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
display(df[1:5, ['E1', 'E2']])
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File ~/anaconda3/envs/eas503book/lib/python3.12/site-packages/pandas/core/indexes/base.py:3805, in Index.get_loc(self, key)
   3804 try:
-> 3805     return self._engine.get_loc(casted_key)
   3806 except KeyError as err:

File index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()

File index.pyx:173, in pandas._libs.index.IndexEngine.get_loc()

TypeError: '(slice(1, 5, None), ['E1', 'E2'])' is an invalid key

During handling of the above exception, another exception occurred:

InvalidIndexError                         Traceback (most recent call last)
Cell In[4], line 4
      2 columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
      3 df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
----> 4 display(df[1:5, ['E1', 'E2']])

File ~/anaconda3/envs/eas503book/lib/python3.12/site-packages/pandas/core/frame.py:4102, in DataFrame.__getitem__(self, key)
   4100 if self.columns.nlevels > 1:
   4101     return self._getitem_multilevel(key)
-> 4102 indexer = self.columns.get_loc(key)
   4103 if is_integer(indexer):
   4104     indexer = [indexer]

File ~/anaconda3/envs/eas503book/lib/python3.12/site-packages/pandas/core/indexes/base.py:3817, in Index.get_loc(self, key)
   3812     raise KeyError(key) from err
   3813 except TypeError:
   3814     # If we have a listlike key, _check_indexing_error will raise
   3815     #  InvalidIndexError. Otherwise we fall through and re-raise
   3816     #  the TypeError.
-> 3817     self._check_indexing_error(key)
   3818     raise

File ~/anaconda3/envs/eas503book/lib/python3.12/site-packages/pandas/core/indexes/base.py:6059, in Index._check_indexing_error(self, key)
   6055 def _check_indexing_error(self, key):
   6056     if not is_scalar(key):
   6057         # if key is not a scalar, directly raise an error (the code below
   6058         # would convert to numpy arrays and raise later any way) - GH29926
-> 6059         raise InvalidIndexError(key)

InvalidIndexError: (slice(1, 5, None), ['E1', 'E2'])

15.2.6. Method 1.5?? – slice and slice?#

import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
display(df[1:5, 1:5])
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/anaconda3/envs/eas503book/lib/python3.12/site-packages/pandas/core/indexes/base.py:3805, in Index.get_loc(self, key)
   3804 try:
-> 3805     return self._engine.get_loc(casted_key)
   3806 except KeyError as err:

File index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()

File index.pyx:196, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:7081, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:7089, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: (slice(1, 5, None), slice(1, 5, None))

During handling of the above exception, another exception occurred:

InvalidIndexError                         Traceback (most recent call last)
Cell In[5], line 4
      2 columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
      3 df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
----> 4 display(df[1:5, 1:5])

File ~/anaconda3/envs/eas503book/lib/python3.12/site-packages/pandas/core/frame.py:4102, in DataFrame.__getitem__(self, key)
   4100 if self.columns.nlevels > 1:
   4101     return self._getitem_multilevel(key)
-> 4102 indexer = self.columns.get_loc(key)
   4103 if is_integer(indexer):
   4104     indexer = [indexer]

File ~/anaconda3/envs/eas503book/lib/python3.12/site-packages/pandas/core/indexes/base.py:3811, in Index.get_loc(self, key)
   3806 except KeyError as err:
   3807     if isinstance(casted_key, slice) or (
   3808         isinstance(casted_key, abc.Iterable)
   3809         and any(isinstance(x, slice) for x in casted_key)
   3810     ):
-> 3811         raise InvalidIndexError(key)
   3812     raise KeyError(key) from err
   3813 except TypeError:
   3814     # If we have a listlike key, _check_indexing_error will raise
   3815     #  InvalidIndexError. Otherwise we fall through and re-raise
   3816     #  the TypeError.

InvalidIndexError: (slice(1, 5, None), slice(1, 5, None))

15.2.7. Method 2 – .iloc#

  • It uses square braces [] and not ()

  • It uses numerical index

15.2.7.1. Method 2.1 – index only!!! Cannot index by location index with a non-integer key#

import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
display(df.iloc[0])
StudentName    student1
E1                   92
E2                   77
E3                   87
E4                   77
E5                   94
Name: 0, dtype: object

15.2.7.2. Method 2.2 – list#

import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
display(df.iloc[[0, 1, 2]])
StudentName E1 E2 E3 E4 E5
0 student1 92 77 87 77 94
1 student2 74 93 88 67 85
2 student3 83 96 74 79 92

15.2.7.3. Method 2.3 – list and slice#

import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
display(df.iloc[[0, 1, 2], :])
StudentName E1 E2 E3 E4 E5
0 student1 92 77 87 77 94
1 student2 74 93 88 67 85
2 student3 83 96 74 79 92

15.2.7.4. Method 2.4 – list and list#

import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
display(df.iloc[[0, 1, 2], [0, 1]])
StudentName E1
0 student1 92
1 student2 74
2 student3 83

15.2.7.5. Method 2.5 – slice#

import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
display(df.iloc[0:3])
StudentName E1 E2 E3 E4 E5
0 student1 92 77 87 77 94
1 student2 74 93 88 67 85
2 student3 83 96 74 79 92

15.2.7.6. Method 2.6 – slice and list#

import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
display(df.iloc[:, [1, 2]])
E1 E2
0 92 77
1 74 93
2 83 96
3 100 72
4 77 96
5 100 86
6 66 91
7 97 86
8 95 98
9 78 76

15.2.8. Method 2.7 – sliceS#

import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
display(df.iloc[0:3, 1:3])
E1 E2
0 92 77
1 74 93
2 83 96

15.2.9. Method 3 – .loc#

  • If you did not specify the index, then numbers are used!

Warning

When you use .loc for indexing, it includes the end point. This is unlike .iloc, which does not include the end point.

15.2.10. Method 3.1 – index # row-wise#

import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
display(df.loc[3])
StudentName    student4
E1                  100
E2                   72
E3                   83
E4                   85
E5                   66
Name: 3, dtype: object

15.2.11. Method 3.2 – txt # row-wise#

import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
df.set_index(df['StudentName'], inplace=True)
display(df.loc['student4'])
StudentName    student4
E1                  100
E2                   72
E3                   83
E4                   85
E5                   66
Name: student4, dtype: object

15.2.12. Method 3.3 – list#

import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
df.set_index(df['StudentName'], inplace=True)
display(df.loc[['student4', 'student1']])
StudentName E1 E2 E3 E4 E5
StudentName
student4 student4 100 72 83 85 66
student1 student1 92 77 87 77 94

15.2.13. Method 3.4 – listS#

import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
df.set_index(df['StudentName'], inplace=True)
display(df.loc[['student4', 'student1'], ['E1', 'E2']])
E1 E2
StudentName
student4 100 72
student1 92 77

15.2.14. Method 3.5 – slice and list#

import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
df.set_index(df['StudentName'], inplace=True)
display(df.loc[:, ['E1', 'E2']])
E1 E2
StudentName
student1 92 77
student2 74 93
student3 83 96
student4 100 72
student5 77 96
student6 100 86
student7 66 91
student8 97 86
student9 95 98
student10 78 76

15.2.15. Method 3.6 – list and Slice#

import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
df.set_index(df['StudentName'], inplace=True)
display(df.loc[['student1', 'student2'], :])
StudentName E1 E2 E3 E4 E5
StudentName
student1 student1 92 77 87 77 94
student2 student2 74 93 88 67 85

15.2.16. Method 3.7 – slice and Slice#

import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
df.set_index(df['StudentName'], inplace=True)
display(df.loc['student1':'student5', :])
StudentName E1 E2 E3 E4 E5
StudentName
student1 student1 92 77 87 77 94
student2 student2 74 93 88 67 85
student3 student3 83 96 74 79 92
student4 student4 100 72 83 85 66
student5 student5 77 96 66 79 92

15.2.17. Method 3.8 – slice#

import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
df.set_index(df['StudentName'], inplace=True)
display(df.loc['student1':'student5'])
StudentName E1 E2 E3 E4 E5
StudentName
student1 student1 92 77 87 77 94
student2 student2 74 93 88 67 85
student3 student3 83 96 74 79 92
student4 student4 100 72 83 85 66
student5 student5 77 96 66 79 92

15.2.18. Method 3.9 – slice and slice#

import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
df.set_index(df['StudentName'], inplace=True)
display(df.loc[:, 'E1':'E4'])
E1 E2 E3 E4
StudentName
student1 92 77 87 77
student2 74 93 88 67
student3 83 96 74 79
student4 100 72 83 85
student5 77 96 66 79
student6 100 86 84 70
student7 66 91 94 97
student8 97 86 75 69
student9 95 98 99 85
student10 78 76 73 88

15.2.19. iterrows – iterate over rows#

  • iterate over the data row-wise. The first element is the index and second element is the row values

import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
for idx, row in df.iterrows():
    print(idx, row)
0 StudentName    student1
E1                   92
E2                   77
E3                   87
E4                   77
E5                   94
Name: 0, dtype: object
1 StudentName    student2
E1                   74
E2                   93
E3                   88
E4                   67
E5                   85
Name: 1, dtype: object
2 StudentName    student3
E1                   83
E2                   96
E3                   74
E4                   79
E5                   92
Name: 2, dtype: object
3 StudentName    student4
E1                  100
E2                   72
E3                   83
E4                   85
E5                   66
Name: 3, dtype: object
4 StudentName    student5
E1                   77
E2                   96
E3                   66
E4                   79
E5                   92
Name: 4, dtype: object
5 StudentName    student6
E1                  100
E2                   86
E3                   84
E4                   70
E5                   71
Name: 5, dtype: object
6 StudentName    student7
E1                   66
E2                   91
E3                   94
E4                   97
E5                   80
Name: 6, dtype: object
7 StudentName    student8
E1                   97
E2                   86
E3                   75
E4                   69
E5                   88
Name: 7, dtype: object
8 StudentName    student9
E1                   95
E2                   98
E3                   99
E4                   85
E5                   86
Name: 8, dtype: object
9 StudentName    student10
E1                    78
E2                    76
E3                    73
E4                    88
E5                    86
Name: 9, dtype: object

15.2.20. items – iterate over columns#

import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
for column_name, column in df.items():
    print(column_name, column)
StudentName 0     student1
1     student2
2     student3
3     student4
4     student5
5     student6
6     student7
7     student8
8     student9
9    student10
Name: StudentName, dtype: object
E1 0     92
1     74
2     83
3    100
4     77
5    100
6     66
7     97
8     95
9     78
Name: E1, dtype: object
E2 0    77
1    93
2    96
3    72
4    96
5    86
6    91
7    86
8    98
9    76
Name: E2, dtype: object
E3 0    87
1    88
2    74
3    83
4    66
5    84
6    94
7    75
8    99
9    73
Name: E3, dtype: object
E4 0    77
1    67
2    79
3    85
4    79
5    70
6    97
7    69
8    85
9    88
Name: E4, dtype: object
E5 0    94
1    85
2    92
3    66
4    92
5    71
6    80
7    88
8    86
9    86
Name: E5, dtype: object

15.2.21. Select subset of data#

header = ['exam1', 'exam2', 'exam3']
data = np.random.randint(65, 101, 12).reshape(4,3)
students = ['student1', 'student2', 'student3', 'student4']
df = pd.DataFrame(data=data, index=students, columns=header)

df.loc['student1', 'exam1']
df.loc[['student1', 'student3'], ['exam1', 'exam3']]

15.2.22. Use conditions to select#

header = ['exam1', 'exam2', 'exam3']
data = np.random.randint(65, 101, 12).reshape(4,3)
students = ['student1', 'student2', 'student3', 'student4']
df = pd.DataFrame(data=data, index=students, columns=header)

df>=90
df[df>=90]
df['exam1']>=85 #
df[df['exam1']>=85] # gives all columns where exam1 is greater than 85
df[df['exam1']>=85]['exam3']
df[df['exam1']>=85][['exam2', 'exam3']]


df[(df['exam1']>=85) & (df['exam2']>=85)]
df[(df['exam1']>=85) & (df['exam2']>=85)]['exam3']

df[(df['exam1']>=85) | (df['exam2']>=85)]
df[(df['exam1']>=85) | (df['exam2']>=85)]['exam3']

15.2.23. Adding student index#

header = ['exam1', 'exam2', 'exam3']
data = np.random.randint(65, 101, 12).reshape(4,3)
students = ['student1', 'student2', 'student3', 'student4']
df = pd.DataFrame(data=data, columns=header)
df['name'] = students

df.set_index('name', inplace=True)
df.loc['student1']
df.reset_index(inplace=True)

15.2.24. Multi-index data#

students = 'student1 student1 student1 student2 student2 student2 student3 student3 student3'
exams = 'exam1 exam2 exam3'.split()*3
classes = 'class1 class2'
index = list(zip(students.split(), exams))
index = pd.MultiIndex.from_tuples(index)
df = pd.DataFrame(np.random.randint(65, 101, 3*3*2).reshape(9,2) , index, classes.split())
df.loc['student1'].loc['exam1']['class1']
df.index.names
df.index.names = ['Students', 'Exams']

## cross-section
df.xs('student1')
df.xs('exam1', level='Exams')

15.2.25. Dealing with missing data#

my_dict = {'student1': [90, 84, np.nan], 'student2': [77, np.nan, np.nan], 'student3': [88, 65, 93]}
df = pd.DataFrame(my_dict)
df.dropna()
df.dropna(axis=0)
df.dropna(axis=1)

df.dropna(thresh=2)
df.fillna(value=55)

df.drop(axis=0, labels=[1,2])
df.drop(axis=1, columns=['student1'])

15.2.26. Groupby#

Note

In previous versions of pandas, non-numeric columns were silently ignored. Now, however, we need to either explicitly choose numeric columns or ask mean to do it for us with numeric_only=True keyword argument.

import pandas as pd
import numpy as np

my_dict = {
    'Exams': 'exam1 exam1 exam1'.split() + 'exam2 exam2 exam2'.split() + 'exam3 exam3 exam3'.split(),
    'Students': 'student1 student2 student3'.split()*3,
    'Scores': np.random.randint(65,101,9)
}
df = pd.DataFrame(my_dict)
df.groupby('Students').mean(numeric_only=True)
df.groupby('Students').mean(numeric_only=True).loc['student1']
df.groupby('Exams').max(numeric_only=True)['Scores']
df.groupby('Exams').describe()
df.groupby('Students').describe().transpose()
import pandas as pd
import numpy as np
my_dict = {
  'Exams': 'exam1 exam1 exam1'.split() + 'exam2 exam2 exam2'.split() + 'exam3 exam3 exam3'.split(),
  'Students': 'student1 student2 student3'.split() * 3,
  'Scores': np.random.randint(65, 101, 9)
}
df = pd.DataFrame(my_dict)

# Select only numeric columns
new_df = df.set_index("Students")
new_df.select_dtypes(include=[np.number]).groupby("Students").mean()
import pandas as pd
import numpy as np

# Create a sample DataFrame with a multi-level index
data = {
  'Scores': np.random.randint(65, 101, 9),
}
index = pd.MultiIndex.from_tuples(
  [('student1', 'exam1'), ('student1', 'exam2'), ('student1', 'exam3'),
   ('student2', 'exam1'), ('student2', 'exam2'), ('student2', 'exam3'),
   ('student3', 'exam1'), ('student3', 'exam2'), ('student3', 'exam3')],
  names=['Students', 'Exams']
)

df = pd.DataFrame(data, index=index)
# Group by the first level of the index ('Students')
grouped_df = df.groupby(level='Students').mean()

display(grouped_df)

15.2.27. Merging – SQL JOIN#

departments = {
    'DepartmentId': [1, 2, 3, 4],
    'DepartmentName': ['IT', 'Physics', 'Arts', 'Math']
}

df1 = pd.DataFrame(departments)

students = {
    'StudentId': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'StudentName': ['Michael', 'John', 'Jack', 'Sara', 'Sally', 'Jena', 'Nancy', 'Adam', 'Stevens', 'George'],
    'DepartmentId': [1, 1, 1, 2, 2, np.nan, 2, 3, 3, np.nan]
}

df2 = pd.DataFrame(students)

marks = {
    'MarkId': [1, 2, 3, 4, 5, 6, 7, 8],
    'StudentId': [1, 2, 3, 4, 5, 6, 7, 8],
    'Mark': [18, 20, 16, 19, 14, 20, 20, 20]
}

df3 = pd.DataFrame(marks)

pd.merge(df2, df1, how='inner', on='DepartmentId')
pd.merge(df1, df2, how='inner', on='DepartmentId')
pd.merge(df1, df2, how='outer', on='DepartmentId')
pd.merge(df2, df1, how='right', on='DepartmentId')

pd.merge(df3, pd.merge(df2, df1, how='inner', on='DepartmentId'), how='inner', on='StudentId')
data = pd.merge(df3, pd.merge(df2, df1, how='inner', on='DepartmentId'), how='inner', on='StudentId')

data[['StudentName', 'Mark', 'DepartmentName']]

15.2.28. Concatenation#

d1 = {
    'C0': ['C0R0', 'C0R1', 'C0R2'],
    'C1': ['C1R0', 'C1R1', 'C2R2'],
    'C2': ['C2R0', 'C2R1', 'C2R2'],
}

df1 = pd.DataFrame(d1)

d2 = {
    'C0': ['C0R3', 'C0R4', 'C0R5'],
    'C1': ['C1R3', 'C1R4', 'C1R5'],
    'C2': ['C2R3', 'C2R4', 'C2R5'],
}

df2 = pd.DataFrame(d2)

d3 = {
    'C0': ['C0R6', 'C0R7', 'C0R8'],
    'C1': ['C1R6', 'C1R7', 'C1R8'],
    'C2': ['C2R6', 'C2R7', 'C2R8'],
}

df3 = pd.DataFrame(d3)

pd.concat([df1, df2, df3])

## Concatenation -- Fix index

d1 = {
    'C0': ['COR0', 'COR1', 'COR2'],
    'C1': ['C1R0', 'C1R1', 'C2R2'],
    'C2': ['C2R0', 'C2R1', 'C2R2'],
}

df1 = pd.DataFrame(d1, index=[1, 2, 3])

d2 = {
    'C0': ['C0R3', 'C0R4', 'C0R5'],
    'C1': ['C1R3', 'C1R4', 'C1R5'],
    'C2': ['C2R3', 'C2R4', 'C2R5'],
}

df2 = pd.DataFrame(d2, index=[4, 5, 6])

d3 = {
    'C0': ['C0R6', 'C0R7', 'C0R8'],
    'C1': ['C1R6', 'C1R7', 'C1R8'],
    'C2': ['C2R6', 'C2R7', 'C2R8'],
}

df3 = pd.DataFrame(d3, index=[7, 8, 9])

pd.concat([df1, df2, df3])

# Better way

d1 = {
    'C0': ['C0R0', 'C0R1', 'C0R2'],
    'C1': ['C1R0', 'C1R1', 'C2R2'],
    'C2': ['C2R0', 'C2R1', 'C2R2'],
}

df1 = pd.DataFrame(d1)

d2 = {
    'C0': ['C0R3', 'C0R4', 'C0R5'],
    'C1': ['C1R3', 'C1R4', 'C1R5'],
    'C2': ['C2R3', 'C2R4', 'C2R5'],
}

df2 = pd.DataFrame(d2)

d3 = {
    'C0': ['C0R6', 'C0R7', 'C0R8'],
    'C1': ['C1R6', 'C1R7', 'C1R8'],
    'C2': ['C2R6', 'C2R7', 'C2R8'],
}

df3 = pd.DataFrame(d3)

concat = pd.concat([df1, df2, df3])
concat = concat.reset_index().drop("index", axis=1)
concat

15.2.29. More Pandas Operations#

data['DepartmentName'].unique()
data['DepartmentName'].nunique()
data['DepartmentName'].value_counts()
data[data['Mark']>17]

15.2.30. Lambda with Pandas#

  • Scale marks by 5


def times5(val):
    return val * 5

data['Mark'].apply(times5)

data['Mark'].apply(lambda val: val*5)

  • Upper all department names

def upper(string):
    return string.upper()

data['DepartmentName'].apply(upper)

data['DepartmentName'].apply(lambda string: string.upper())

mapping = {18: 'B', 14: 'C', 19: 'A-', 20: 'A+'}
df3['Mark'].map(mapping)


15.2.31. Dropping columns#

data.columns
data.drop(['StudentId', 'MarkId' , 'DepartmentId'], axis=1)

15.2.32. Sorting#

data.sort_values('Mark')
data.sort_values('Mark', ascending=False)

15.2.33. Importing CSV, TSV#


data = pd.read_csv('students.tsv', sep='\t', names=['lastname', 'firstname', 'username', 'exam1', 'exam2', 'exam3'])

data.sort_values('exam1', ascending=False)

data[['exam1', 'exam2', 'exam3']].mean()
data['average']= np.mean(data[['exam1', 'exam2', 'exam3']], axis=1)
data.sort_values('average', ascending=False)

data.to_csv('output.tsv', sep='\t', index=False, header=False)

15.2.34. Other methods#

data.head()
data.head(2)

data.tail
data.tail(3)

data.shape
data.iloc[3]
data.columns
data.dtypes
data.info()
data.value_counts()
data.describe()

15.2.35. More data manipulation#

data[data['exam1'].between(75, 85)]
data[data['exam1'].isin([75, 85, 95])]
data['exam1'].unique()
data['exam1'].nunique()
np.sort(data['exam1'].unique())

15.2.36. Misc#

15.2.37. A Previous Example in Pandas#

import numpy as np
import pandas as pd


employees = {
    'EmployeeID': ['EN1-26', 'EN1-33', 'EN1-35', 'EN1-36', 'EN1-38', 'EN1-40'],
    'Last_Name': ["O'Brien", "Guya", "Baranco", "Roslyn", "Schaaf", "Wing"],
    'First_Name': ["Sean", "Amy", "Steven", "Elizabeth", "Carol", "Alexandra" ]
}

df_employees = pd.DataFrame(employees)

projects = {
    'ProjectNum': [
        "30-452-T3",
        "30-457-T3",
        "30-482-TC",
        "31-124-T3",
        "31-238-TC",
        "31-238-TC2",
        "35-152-TC",
        "36-272-TC"
    ],
    'ProjectTitle': [
        "STAR manual",
        "ISO procedures",
        "Web site",
        "Employee handbook",
        "STAR prototype",
        "New catalog",
        "STAR pricing",
        "Order system"
    ]
}


df_projects = pd.DataFrame(projects)

print(df_projects)

employees_projects = {
    'EmployeeID': [
        "EN1-26",
        "EN1-26",
        "EN1-26",
        "EN1-33",
        "EN1-33",
        "EN1-33",
        "EN1-35",
        "EN1-35",
        "EN1-36",
        "EN1-38",
        "EN1-40",
        "EN1-40",
    ],
    'ProjectNum' : [
        "30-452-T3",
        "30-457-T3",
        "31-124-T3",
        "30-328-TC",
        "30-452-T3",
        "32-244-T3",
        "30-452-T3",
        "31-238-TC",
        "35-152-TC",
        "36-272-TC",
        "31-238-TC2",
        "31-241-TC",
    ]


}

df_employees_projects = pd.DataFrame(employees_projects)

data = pd.merge(pd.merge(df_employees_projects, df_employees, how='left', on='EmployeeID'), df_projects,  how='left', on='ProjectNum')
print(data)

15.2.38. References#

Ref: https://www.kaggle.com/shovitbhari/pandas-75-exercises-with-solutions Ref: https://pythonspeed.com/memory/