Numpy and Pandas

15.1. Numpy and Pandas#

15.1.1. NumPy#

15.1.1.1. Description#

Ref: https://docs.scipy.org/doc/numpy/user/whatisnumpy.html
Ref: https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Numpy_Python_Cheat_Sheet.pdf
NumPy provides a multidimensional array object.
Each object comes with an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation and much more.
Basically, you can use NumPy to create ndarray (N-dimensional array) and easily manipulate the array. It is super fast because it is written in C.

15.1.1.2. How is it different from Python arrays?#

Python lists can be modified – you can add and remove elements. NumPy arrays have a fixed size at creation.
Python lists can contain different data types. NumPy arrays can only have one data type. If you put in mixed types, they become a string.
NumPy arrays come prepackaged with advanced mathematical operations. The operations are super fast even on large numbers of data and they use less memory.

15.1.1.3. Why use NumPy#

Most data analysis programs use NumPy to manipulate data. They might take in data as standard Python lists, but they convert it to a NumPy array and manipulate the data using NumPy routines and output the transformed data as a NumPy array.
NumPy data array is the main data type used in most scientific and mathematical Python-based packages.

15.1.1.4. Simple example#

# Square a list using Python
squared_values = []
for number in range(10):
    squared_values.append(number*number)

print(squared_values)

# Square a list using NumPy
import numpy as np

vector = np.array(range(10))
scalar = 5
print(vector * scalar)
print(vector * vector)

15.1.1.5. Numpy Basics#

NumPy arrays can be a 1-D array, called a vector, or a 2-D array, called a matrix

15.1.1.5.1. NumPy casting – covert Python list to a NumPy array#

my_list = [1, 2, 3]
print(my_list)

import numpy as np

my_vector = np.array(my_list)
print(my_vector)

my_matrix = np.array([my_list, my_list])
np.hstack([my_list, my_list])

my_nested_list = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
my_matrix = np.array(my_nested_list)
print(my_matrix)

my_list1 = [[1,2], [3,4]]
my_list2 = [[5,6], [7,8]]
np.hstack([my_list1, my_list2])
np.vstack([my_list1, my_list2])

15.1.1.5.2. NumPy creating arrays#

my_list = range(10)

## Create array using arange
np.arange(10)
np.arange(0, 10)
np.arange(0, 10, 2)

## Create array of zeros
np.zeros(3)
np.zeros((3,3))

## Create array of ones
np.ones(3)
np.ones((3,3))
np.ones(3)*4

## Create evenly spaced vector
### Example use case: when you have Y values for a plot but need to generate X values
### *** Includes both start an end
# np.arange(start, end(not included), step size)
# np.linspace(start, end(included), number_of_points)
np.linspace(0, 10, 5)
np.linspace(1900, 2000, 11)

15.1.2. Create an identify matrix#

np.eye(3)

15.1.3. Creating an empty array#

np.empty((2,3))

15.1.4. Creating Random Numbers#

Ref: https://docs.scipy.org/doc/numpy-1.14.0/reference/routines.random.html

# Uniform distribution
np.random.rand(3)
np.random.rand(3,3)
# Normal distribution
np.random.randn(3)
np.random.randn(3,3)
# Random integers
# np.random.randint(start, end(not_included), size)
np.random.randint(1,101)
np.random.randint(1,101,5)

15.1.5. Reshaping arrays#

vector = np.arange(1,10)
print(vector.reshape(3,3))

vector = np.arange(1,13)
print(vector.reshape(3,4))

15.1.6. Basic array operations#

vector = np.random.randint(1,50,25)
# Min
vector.min()
# Max
vector.max()
# get location of min value
index = vector.argmin()
# get location of max value
index = vector.argmax()
# get shape
vector.shape

my_matrix = vector.reshape(5, 5)
my_matrix.shape

15.1.7. Indexing a 1-D array – vector#

vector = np.array(range(10))
# vector[index]
# vector [start:end]
# vector [:end]
# vector [start:]
# vector [start, end, step]

vector[3]
vector[3:8]
vector[:5]
vector[5:]
vector[3:9:2]
vector[-1]

15.1.8. Setting multiple values at once – Broadcasting#

There are two main features of NumPy arrays
- Broadcasting – set multiple values at once
- Vectorization – no need for explicit looping – example, vector multiplication or squaring

vector[3:6] = 12

15.1.9. BE CAREFUL#

If you store a slice of an array in a new variable, changes in the new variable will be reflected in the original array.

vector = np.array(range(10))
my_slice = vector[3:7]
my_slice[:] = 20
print(vector)

Copy the array if you need a copy

vector = np.array(range(10))
my_slice_copy = vector[3:7].copy()
print(vector)

15.1.10. Indexing a 2-D array – Matrix#

Remember – Python is zero-indexed

matrix = np.array(range(1,10)).reshape((3,3))
matrix[0,0]
matrix[0][0]
matrix[2,2]
matrix[2][2]


matrix[:,2] # Grab the third column
matrix[1,:] # Grab the second row
matrix[:2] # grab the first two rows, all columns
matrix[:2,:] # grab the first two rows, all columns
matrix[:,1:] # grab all the rows, but columns starting from 1

15.1.11. Conditional selection#

vector = np.arange(10)
gt2 = vector > 2 # create condition
lt8 = vector < 8 # create condition

selected_gt2 = vector[gt2] # apply condition to select
selected_lt8 = vector[lt8] # apply condition to select

vector[vector>2]
vector[vector<8]

cond = (vector>2) & (vector<7)
vector[cond]

cond = (vector>=2) & (vector<=7)
vector[cond]

15.1.12. Array operations – Basic#

vector = np.arange(10)

vector + vector
vector - vector
vector * vector
vector / vector # problem!!! return `nan`
vector + 10
vector - 10
vector * 10
vector / 10

15.1.13. Array operations – Advanced#

vector = np.arange(10)
np.max(vector)
np.min(vector)
np.sqrt(vector)
np.log(vector)

sum(vector<5)
import math
vector = np.arange(1,11) * math.pi
np.sin(vector)
vector = np.arange(0,math.pi+math.pi/4,math.pi/4)
np.sin(vector)

matrix = np.random.rand(5,5)
np.floor(matrix*1000)/1000
np.round(matrix*1000)/1000
np.ceil(matrix*1000)/1000

matrix = np.arange(1,10).reshape(3,3)
matrix.sum(axis=1)
matrix.sum(axis=0)
matrix.cumsum()
matrix.cumprod()

matrix.min(axis=1)
matrix.min(axis=0)

matrix.max(axis=1)
matrix.max(axis=0)

matrix = np.array([1,2,3]*3).reshape(3,3)
np.unique(matrix.reshape(3,3))

15.1.14. Pandas#

Built-on top of NumPy – meaning the underlying data structure used is ndarray
Pandas provides series which are like NumPy arrays but with associated index labels – meaning are like column labels or row labels. Element data type can be different
Pandas also provides dataframes which are like Excel sheets or database tables

15.1.14.1. Basic examples#

import numpy as np
import pandas as pd

header = ['chrom', 'pos', 'filter']
data = [4, 12345, 38.4]

vector = np.array(data)
data_dict = {'chrom': 4, 'pos': 12345, 'filter': 38.4}

s1 = pd.Series(data=data) # Notice the data type is float
s2 = pd.Series(data=data, index=header)

# can also do
s1 = pd.Series(data)
s2 = pd.Series(data, header)

# can hold different data types

data = [1, '2s', 34]
pd.Series(data)

# can use a dictionary to initialize a panda series
pd.Series(data_dict)

## Using index labels to fetch element
header = ['chrom', 'pos', 'filter']
data = [4, 12345, 38.4]
series = pd.Series(data=data, index=header)
series['chrom']
series['filter']

series = pd.Series(data)
series[0]


## Basic operations
header1 = ['chrom', 'pos', 'filter']
data1 = [4, 12345, 38.4]
header2 = ['chrom', 'pos', 'filter', 'qual']
data2 = [3, 4899, 234, 89.9]

s1 = pd.Series(data1, header1)
s2 = pd.Series(data2, header2)
s1+s2

header1 = ['chrom', 'pos', 'filter']
data1 = [4, 12345, 38.4]
header2 = ['chrom', 'pos', 'filter', 'qual']
data2 = ['3', 4899, 234, 89.9]

s1 = pd.Series(data1, header1)
s2 = pd.Series(data2, header2)
s1+s2

data1 = [4, 12345, 38.4]
data2 = [3, 4899, 234, 89.9]

s1 = pd.Series(data1)
s2 = pd.Series(data2)


## IMPORTANT - with index labels -- operations are based on label
header1 = ['pos', 'filter', 'chrom']
data1 = [12345, 38.4, 4]
header2 = ['chrom', 'pos', 'filter', 'qual']
data2 = [3, 4899, 234, 89.9]

s1 = pd.Series(data1, header1)
s2 = pd.Series(data2, header2)
s1+s2


data1 = [12345, 38.4, 4]
data2 = [3, 4899, 234, 89.9]

s1 = pd.Series(data1)
s2 = pd.Series(data2)

15.1.15. Dataframes –#

Dataframe is composed of series
Ref: https://pandas.pydata.org/docs/reference/api/pandas.io.formats.style.Styler.html#pandas.io.formats.style.Styler

import numpy as np
import pandas as pd

header = ['exam1', 'exam2', 'exam3']
data = np.random.randint(65, 101, 12).reshape(4,3)
students = ['student1', 'student2', 'student3', 'student4']

df = pd.DataFrame(data=data, columns=header)

df = pd.DataFrame(data=data, index=students, columns=header)
df['exam1']
df.exam1 # not a good way to do this


df['average'] = (df['exam1'] + df['exam2'] + df['exam3'])/3
df.drop('average') # does not work because default for drop is to work on row labels
df.drop('average', axis=1) # works on column labels

## STILL NOT DROPPED from df
df
df.drop('average', axis=1, inplace=True)

## drop a student
df.drop('student3')

header = ['exam1', 'exam2', 'exam3']
data = np.random.randint(65, 101, 12).reshape(4,3)
students = ['student1', 'student2', 'student3', 'student3']
df = pd.DataFrame(data=data, index=students, columns=header)

df.drop('student3')
df.drop('student3', inplace=True)
## Row is referred to as axis=0
## Column is referred to as axis=1
## (R,C) == (axis=0, axis=1) df.shape

15.1.16. Select Dataframe rows#

header = ['exam1', 'exam2', 'exam3']
data = np.random.randint(65, 101, 12).reshape(4,3)
students = ['student1', 'student2', 'student3', 'student4']
df = pd.DataFrame(data=data, index=students, columns=header)

df.loc['student1']
df.iloc[0] ## remember that column names do not count as rows

15.2. Three Methods for Selecting Data#

15.2.1. Method 1 – Square brackets#

15.2.2. Method 1.1 – index/text – #column wise#

import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
display(df['StudentName'])

   student1
   student2
   student3
   student4
   student5
   student6
   student7
   student8
   student9
  student10
Name: StudentName, dtype: object

15.2.3. Method 1.2 – list, NOT TUPLE # column wise#

import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
display(df[['StudentName', 'E1']])

	StudentName	E1
0	student1	92
1	student2	74
2	student3	83
3	student4	100
4	student5	77
5	student6	100
6	student7	66
7	student8	97
8	student9	95
9	student10	78

15.2.4. Method 1.3 – a slice # row wise#

import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
display(df[1:5])

	StudentName	E1	E2	E3	E4	E5
1	student2	74	93	88	67	85
2	student3	83	96	74	79	92
3	student4	100	72	83	85	66
4	student5	77	96	66	79	92

15.2.5. Method 1.4?? – slice and list?#

import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
display(df[1:5, ['E1', 'E2']])

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File ~/anaconda3/envs/eas503book/lib/python3.12/site-packages/pandas/core/indexes/base.py:3805, in Index.get_loc(self, key)
   3804 try:
-> 3805     return self._engine.get_loc(casted_key)
   3806 except KeyError as err:

File index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()

File index.pyx:173, in pandas._libs.index.IndexEngine.get_loc()

TypeError: '(slice(1, 5, None), ['E1', 'E2'])' is an invalid key

During handling of the above exception, another exception occurred:

InvalidIndexError                         Traceback (most recent call last)
Cell In[4], line 4
      2 columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
      3 df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
----> 4 display(df[1:5, ['E1', 'E2']])

File ~/anaconda3/envs/eas503book/lib/python3.12/site-packages/pandas/core/frame.py:4102, in DataFrame.__getitem__(self, key)
   4100 if self.columns.nlevels > 1:
   4101     return self._getitem_multilevel(key)
-> 4102 indexer = self.columns.get_loc(key)
   4103 if is_integer(indexer):
   4104     indexer = [indexer]

File ~/anaconda3/envs/eas503book/lib/python3.12/site-packages/pandas/core/indexes/base.py:3817, in Index.get_loc(self, key)
   3812     raise KeyError(key) from err
   3813 except TypeError:
   3814     # If we have a listlike key, _check_indexing_error will raise
   3815     #  InvalidIndexError. Otherwise we fall through and re-raise
   3816     #  the TypeError.
-> 3817     self._check_indexing_error(key)
   3818     raise

File ~/anaconda3/envs/eas503book/lib/python3.12/site-packages/pandas/core/indexes/base.py:6059, in Index._check_indexing_error(self, key)
   6055 def _check_indexing_error(self, key):
   6056     if not is_scalar(key):
   6057         # if key is not a scalar, directly raise an error (the code below
   6058         # would convert to numpy arrays and raise later any way) - GH29926
-> 6059         raise InvalidIndexError(key)

InvalidIndexError: (slice(1, 5, None), ['E1', 'E2'])

15.2.6. Method 1.5?? – slice and slice?#

import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
display(df[1:5, 1:5])

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/anaconda3/envs/eas503book/lib/python3.12/site-packages/pandas/core/indexes/base.py:3805, in Index.get_loc(self, key)
   3804 try:
-> 3805     return self._engine.get_loc(casted_key)
   3806 except KeyError as err:

File index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()

File index.pyx:196, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:7081, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:7089, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: (slice(1, 5, None), slice(1, 5, None))

During handling of the above exception, another exception occurred:

InvalidIndexError                         Traceback (most recent call last)
Cell In[5], line 4
      2 columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
      3 df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
----> 4 display(df[1:5, 1:5])

File ~/anaconda3/envs/eas503book/lib/python3.12/site-packages/pandas/core/frame.py:4102, in DataFrame.__getitem__(self, key)
   4100 if self.columns.nlevels > 1:
   4101     return self._getitem_multilevel(key)
-> 4102 indexer = self.columns.get_loc(key)
   4103 if is_integer(indexer):
   4104     indexer = [indexer]

File ~/anaconda3/envs/eas503book/lib/python3.12/site-packages/pandas/core/indexes/base.py:3811, in Index.get_loc(self, key)
   3806 except KeyError as err:
   3807     if isinstance(casted_key, slice) or (
   3808         isinstance(casted_key, abc.Iterable)
   3809         and any(isinstance(x, slice) for x in casted_key)
   3810     ):
-> 3811         raise InvalidIndexError(key)
   3812     raise KeyError(key) from err
   3813 except TypeError:
   3814     # If we have a listlike key, _check_indexing_error will raise
   3815     #  InvalidIndexError. Otherwise we fall through and re-raise
   3816     #  the TypeError.

InvalidIndexError: (slice(1, 5, None), slice(1, 5, None))

15.2.7. Method 2 – .iloc#

It uses square braces [] and not ()
It uses numerical index

15.2.7.1. Method 2.1 – index only!!! Cannot index by location index with a non-integer key#

import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
display(df.iloc[0])

StudentName    student1
E1                   92
E2                   77
E3                   87
E4                   77
E5                   94
Name: 0, dtype: object

15.2.7.2. Method 2.2 – list#

import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
display(df.iloc[[0, 1, 2]])

	StudentName	E1	E2	E3	E4	E5
0	student1	92	77	87	77	94
1	student2	74	93	88	67	85
2	student3	83	96	74	79	92

15.2.7.3. Method 2.3 – list and slice#

import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
display(df.iloc[[0, 1, 2], :])

	StudentName	E1	E2	E3	E4	E5
0	student1	92	77	87	77	94
1	student2	74	93	88	67	85
2	student3	83	96	74	79	92

15.2.7.4. Method 2.4 – list and list#

import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
display(df.iloc[[0, 1, 2], [0, 1]])

	StudentName	E1
0	student1	92
1	student2	74
2	student3	83

15.2.7.5. Method 2.5 – slice#

import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
display(df.iloc[0:3])

	StudentName	E1	E2	E3	E4	E5
0	student1	92	77	87	77	94
1	student2	74	93	88	67	85
2	student3	83	96	74	79	92

15.2.7.6. Method 2.6 – slice and list#

import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
display(df.iloc[:, [1, 2]])

	E1	E2
0	92	77
1	74	93
2	83	96
3	100	72
4	77	96
5	100	86
6	66	91
7	97	86
8	95	98
9	78	76

15.2.8. Method 2.7 – sliceS#

import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
display(df.iloc[0:3, 1:3])

	E1	E2
0	92	77
1	74	93
2	83	96

15.2.9. Method 3 – .loc#

If you did not specify the index, then numbers are used!

Warning

When you use .loc for indexing, it includes the end point. This is unlike .iloc, which does not include the end point.

15.2.10. Method 3.1 – index # row-wise#

import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
display(df.loc[3])

StudentName    student4
E1                  100
E2                   72
E3                   83
E4                   85
E5                   66
Name: 3, dtype: object

15.2.11. Method 3.2 – txt # row-wise#

import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
df.set_index(df['StudentName'], inplace=True)
display(df.loc['student4'])

StudentName    student4
E1                  100
E2                   72
E3                   83
E4                   85
E5                   66
Name: student4, dtype: object

15.2.12. Method 3.3 – list#

import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
df.set_index(df['StudentName'], inplace=True)
display(df.loc[['student4', 'student1']])

	StudentName	E1	E2	E3	E4	E5
StudentName
student4	student4	100	72	83	85	66
student1	student1	92	77	87	77	94

15.2.13. Method 3.4 – listS#

import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
df.set_index(df['StudentName'], inplace=True)
display(df.loc[['student4', 'student1'], ['E1', 'E2']])

	E1	E2
StudentName
student4	100	72
student1	92	77

15.2.14. Method 3.5 – slice and list#

import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
df.set_index(df['StudentName'], inplace=True)
display(df.loc[:, ['E1', 'E2']])

	E1	E2
StudentName
student1	92	77
student2	74	93
student3	83	96
student4	100	72
student5	77	96
student6	100	86
student7	66	91
student8	97	86
student9	95	98
student10	78	76

15.2.15. Method 3.6 – list and Slice#

import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
df.set_index(df['StudentName'], inplace=True)
display(df.loc[['student1', 'student2'], :])

	StudentName	E1	E2	E3	E4	E5
StudentName
student1	student1	92	77	87	77	94
student2	student2	74	93	88	67	85

15.2.16. Method 3.7 – slice and Slice#

import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
df.set_index(df['StudentName'], inplace=True)
display(df.loc['student1':'student5', :])

	StudentName	E1	E2	E3	E4	E5
StudentName
student1	student1	92	77	87	77	94
student2	student2	74	93	88	67	85
student3	student3	83	96	74	79	92
student4	student4	100	72	83	85	66
student5	student5	77	96	66	79	92

15.2.17. Method 3.8 – slice#

import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
df.set_index(df['StudentName'], inplace=True)
display(df.loc['student1':'student5'])

	StudentName	E1	E2	E3	E4	E5
StudentName
student1	student1	92	77	87	77	94
student2	student2	74	93	88	67	85
student3	student3	83	96	74	79	92
student4	student4	100	72	83	85	66
student5	student5	77	96	66	79	92

15.2.18. Method 3.9 – slice and slice#

import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
df.set_index(df['StudentName'], inplace=True)
display(df.loc[:, 'E1':'E4'])

	E1	E2	E3	E4
StudentName
student1	92	77	87	77
student2	74	93	88	67
student3	83	96	74	79
student4	100	72	83	85
student5	77	96	66	79
student6	100	86	84	70
student7	66	91	94	97
student8	97	86	75	69
student9	95	98	99	85
student10	78	76	73	88

15.2.19. iterrows – iterate over rows#

iterate over the data row-wise. The first element is the index and second element is the row values

import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
for idx, row in df.iterrows():
    print(idx, row)

0 StudentName    student1
E1                   92
E2                   77
E3                   87
E4                   77
E5                   94
Name: 0, dtype: object
1 StudentName    student2
E1                   74
E2                   93
E3                   88
E4                   67
E5                   85
Name: 1, dtype: object
2 StudentName    student3
E1                   83
E2                   96
E3                   74
E4                   79
E5                   92
Name: 2, dtype: object
3 StudentName    student4
E1                  100
E2                   72
E3                   83
E4                   85
E5                   66
Name: 3, dtype: object
4 StudentName    student5
E1                   77
E2                   96
E3                   66
E4                   79
E5                   92
Name: 4, dtype: object
5 StudentName    student6
E1                  100
E2                   86
E3                   84
E4                   70
E5                   71
Name: 5, dtype: object
6 StudentName    student7
E1                   66
E2                   91
E3                   94
E4                   97
E5                   80
Name: 6, dtype: object
7 StudentName    student8
E1                   97
E2                   86
E3                   75
E4                   69
E5                   88
Name: 7, dtype: object
8 StudentName    student9
E1                   95
E2                   98
E3                   99
E4                   85
E5                   86
Name: 8, dtype: object
9 StudentName    student10
E1                    78
E2                    76
E3                    73
E4                    88
E5                    86
Name: 9, dtype: object

15.2.20. items – iterate over columns#

import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
for column_name, column in df.items():
    print(column_name, column)

StudentName 0     student1
   student2
   student3
   student4
   student5
   student6
   student7
   student8
   student9
  student10
Name: StudentName, dtype: object
E1 0     92
   74
   83
  100
   77
  100
   66
   97
   95
   78
Name: E1, dtype: object
E2 0    77
  93
  96
  72
  96
  86
  91
  86
  98
  76
Name: E2, dtype: object
E3 0    87
  88
  74
  83
  66
  84
  94
  75
  99
  73
Name: E3, dtype: object
E4 0    77
  67
  79
  85
  79
  70
  97
  69
  85
  88
Name: E4, dtype: object
E5 0    94
  85
  92
  66
  92
  71
  80
  88
  86
  86
Name: E5, dtype: object

15.2.21. Select subset of data#

header = ['exam1', 'exam2', 'exam3']
data = np.random.randint(65, 101, 12).reshape(4,3)
students = ['student1', 'student2', 'student3', 'student4']
df = pd.DataFrame(data=data, index=students, columns=header)

df.loc['student1', 'exam1']
df.loc[['student1', 'student3'], ['exam1', 'exam3']]

15.2.22. Use conditions to select#

header = ['exam1', 'exam2', 'exam3']
data = np.random.randint(65, 101, 12).reshape(4,3)
students = ['student1', 'student2', 'student3', 'student4']
df = pd.DataFrame(data=data, index=students, columns=header)

df>=90
df[df>=90]
df['exam1']>=85 #
df[df['exam1']>=85] # gives all columns where exam1 is greater than 85
df[df['exam1']>=85]['exam3']
df[df['exam1']>=85][['exam2', 'exam3']]


df[(df['exam1']>=85) & (df['exam2']>=85)]
df[(df['exam1']>=85) & (df['exam2']>=85)]['exam3']

df[(df['exam1']>=85) | (df['exam2']>=85)]
df[(df['exam1']>=85) | (df['exam2']>=85)]['exam3']

15.2.23. Adding student index#

header = ['exam1', 'exam2', 'exam3']
data = np.random.randint(65, 101, 12).reshape(4,3)
students = ['student1', 'student2', 'student3', 'student4']
df = pd.DataFrame(data=data, columns=header)
df['name'] = students

df.set_index('name', inplace=True)
df.loc['student1']
df.reset_index(inplace=True)

15.2.24. Multi-index data#

students = 'student1 student1 student1 student2 student2 student2 student3 student3 student3'
exams = 'exam1 exam2 exam3'.split()*3
classes = 'class1 class2'
index = list(zip(students.split(), exams))
index = pd.MultiIndex.from_tuples(index)
df = pd.DataFrame(np.random.randint(65, 101, 3*3*2).reshape(9,2) , index, classes.split())
df.loc['student1'].loc['exam1']['class1']
df.index.names
df.index.names = ['Students', 'Exams']

## cross-section
df.xs('student1')
df.xs('exam1', level='Exams')

15.2.25. Dealing with missing data#

my_dict = {'student1': [90, 84, np.nan], 'student2': [77, np.nan, np.nan], 'student3': [88, 65, 93]}
df = pd.DataFrame(my_dict)
df.dropna()
df.dropna(axis=0)
df.dropna(axis=1)

df.dropna(thresh=2)
df.fillna(value=55)

df.drop(axis=0, labels=[1,2])
df.drop(axis=1, columns=['student1'])

15.2.26. Groupby#

Note

In previous versions of pandas, non-numeric columns were silently ignored. Now, however, we need to either explicitly choose numeric columns or ask mean to do it for us with numeric_only=True keyword argument.

import pandas as pd
import numpy as np

my_dict = {
    'Exams': 'exam1 exam1 exam1'.split() + 'exam2 exam2 exam2'.split() + 'exam3 exam3 exam3'.split(),
    'Students': 'student1 student2 student3'.split()*3,
    'Scores': np.random.randint(65,101,9)
}
df = pd.DataFrame(my_dict)
df.groupby('Students').mean(numeric_only=True)
df.groupby('Students').mean(numeric_only=True).loc['student1']
df.groupby('Exams').max(numeric_only=True)['Scores']
df.groupby('Exams').describe()
df.groupby('Students').describe().transpose()

import pandas as pd
import numpy as np
my_dict = {
  'Exams': 'exam1 exam1 exam1'.split() + 'exam2 exam2 exam2'.split() + 'exam3 exam3 exam3'.split(),
  'Students': 'student1 student2 student3'.split() * 3,
  'Scores': np.random.randint(65, 101, 9)
}
df = pd.DataFrame(my_dict)

# Select only numeric columns
new_df = df.set_index("Students")
new_df.select_dtypes(include=[np.number]).groupby("Students").mean()

import pandas as pd
import numpy as np

# Create a sample DataFrame with a multi-level index
data = {
  'Scores': np.random.randint(65, 101, 9),
}
index = pd.MultiIndex.from_tuples(
  [('student1', 'exam1'), ('student1', 'exam2'), ('student1', 'exam3'),
   ('student2', 'exam1'), ('student2', 'exam2'), ('student2', 'exam3'),
   ('student3', 'exam1'), ('student3', 'exam2'), ('student3', 'exam3')],
  names=['Students', 'Exams']
)

df = pd.DataFrame(data, index=index)
# Group by the first level of the index ('Students')
grouped_df = df.groupby(level='Students').mean()

display(grouped_df)

15.2.27. Merging – SQL JOIN#

departments = {
    'DepartmentId': [1, 2, 3, 4],
    'DepartmentName': ['IT', 'Physics', 'Arts', 'Math']
}

df1 = pd.DataFrame(departments)

students = {
    'StudentId': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'StudentName': ['Michael', 'John', 'Jack', 'Sara', 'Sally', 'Jena', 'Nancy', 'Adam', 'Stevens', 'George'],
    'DepartmentId': [1, 1, 1, 2, 2, np.nan, 2, 3, 3, np.nan]
}

df2 = pd.DataFrame(students)

marks = {
    'MarkId': [1, 2, 3, 4, 5, 6, 7, 8],
    'StudentId': [1, 2, 3, 4, 5, 6, 7, 8],
    'Mark': [18, 20, 16, 19, 14, 20, 20, 20]
}

df3 = pd.DataFrame(marks)

pd.merge(df2, df1, how='inner', on='DepartmentId')
pd.merge(df1, df2, how='inner', on='DepartmentId')
pd.merge(df1, df2, how='outer', on='DepartmentId')
pd.merge(df2, df1, how='right', on='DepartmentId')

pd.merge(df3, pd.merge(df2, df1, how='inner', on='DepartmentId'), how='inner', on='StudentId')
data = pd.merge(df3, pd.merge(df2, df1, how='inner', on='DepartmentId'), how='inner', on='StudentId')

data[['StudentName', 'Mark', 'DepartmentName']]

ref: https://stackoverflow.com/a/48411543

15.2.28. Concatenation#

d1 = {
    'C0': ['C0R0', 'C0R1', 'C0R2'],
    'C1': ['C1R0', 'C1R1', 'C2R2'],
    'C2': ['C2R0', 'C2R1', 'C2R2'],
}

df1 = pd.DataFrame(d1)

d2 = {
    'C0': ['C0R3', 'C0R4', 'C0R5'],
    'C1': ['C1R3', 'C1R4', 'C1R5'],
    'C2': ['C2R3', 'C2R4', 'C2R5'],
}

df2 = pd.DataFrame(d2)

d3 = {
    'C0': ['C0R6', 'C0R7', 'C0R8'],
    'C1': ['C1R6', 'C1R7', 'C1R8'],
    'C2': ['C2R6', 'C2R7', 'C2R8'],
}

df3 = pd.DataFrame(d3)

pd.concat([df1, df2, df3])

## Concatenation -- Fix index

d1 = {
    'C0': ['COR0', 'COR1', 'COR2'],
    'C1': ['C1R0', 'C1R1', 'C2R2'],
    'C2': ['C2R0', 'C2R1', 'C2R2'],
}

df1 = pd.DataFrame(d1, index=[1, 2, 3])

d2 = {
    'C0': ['C0R3', 'C0R4', 'C0R5'],
    'C1': ['C1R3', 'C1R4', 'C1R5'],
    'C2': ['C2R3', 'C2R4', 'C2R5'],
}

df2 = pd.DataFrame(d2, index=[4, 5, 6])

d3 = {
    'C0': ['C0R6', 'C0R7', 'C0R8'],
    'C1': ['C1R6', 'C1R7', 'C1R8'],
    'C2': ['C2R6', 'C2R7', 'C2R8'],
}

df3 = pd.DataFrame(d3, index=[7, 8, 9])

pd.concat([df1, df2, df3])

# Better way

d1 = {
    'C0': ['C0R0', 'C0R1', 'C0R2'],
    'C1': ['C1R0', 'C1R1', 'C2R2'],
    'C2': ['C2R0', 'C2R1', 'C2R2'],
}

df1 = pd.DataFrame(d1)

d2 = {
    'C0': ['C0R3', 'C0R4', 'C0R5'],
    'C1': ['C1R3', 'C1R4', 'C1R5'],
    'C2': ['C2R3', 'C2R4', 'C2R5'],
}

df2 = pd.DataFrame(d2)

d3 = {
    'C0': ['C0R6', 'C0R7', 'C0R8'],
    'C1': ['C1R6', 'C1R7', 'C1R8'],
    'C2': ['C2R6', 'C2R7', 'C2R8'],
}

df3 = pd.DataFrame(d3)

concat = pd.concat([df1, df2, df3])
concat = concat.reset_index().drop("index", axis=1)
concat

15.2.29. More Pandas Operations#

data['DepartmentName'].unique()
data['DepartmentName'].nunique()
data['DepartmentName'].value_counts()
data[data['Mark']>17]

15.2.30. Lambda with Pandas#

Scale marks by 5

def times5(val):
    return val * 5

data['Mark'].apply(times5)

data['Mark'].apply(lambda val: val*5)

Upper all department names

def upper(string):
    return string.upper()

data['DepartmentName'].apply(upper)

data['DepartmentName'].apply(lambda string: string.upper())

mapping = {18: 'B', 14: 'C', 19: 'A-', 20: 'A+'}
df3['Mark'].map(mapping)

15.2.31. Dropping columns#

data.columns
data.drop(['StudentId', 'MarkId' , 'DepartmentId'], axis=1)

15.2.32. Sorting#

data.sort_values('Mark')
data.sort_values('Mark', ascending=False)

15.2.33. Importing CSV, TSV#

data = pd.read_csv('students.tsv', sep='\t', names=['lastname', 'firstname', 'username', 'exam1', 'exam2', 'exam3'])

data.sort_values('exam1', ascending=False)

data[['exam1', 'exam2', 'exam3']].mean()
data['average']= np.mean(data[['exam1', 'exam2', 'exam3']], axis=1)
data.sort_values('average', ascending=False)

data.to_csv('output.tsv', sep='\t', index=False, header=False)

15.2.34. Other methods#

data.head()
data.head(2)

data.tail
data.tail(3)

data.shape
data.iloc[3]
data.columns
data.dtypes
data.info()
data.value_counts()
data.describe()

15.2.35. More data manipulation#

data[data['exam1'].between(75, 85)]
data[data['exam1'].isin([75, 85, 95])]
data['exam1'].unique()
data['exam1'].nunique()
np.sort(data['exam1'].unique())

15.2.36. Misc#

Iterate over the data frame
- items: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.items.html
- iteritems: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iteritems.html
Where: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.where.html
Plotting: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html
to_csv
to_sql: https://datacarpentry.org/python-ecology-lesson/09-working-with-sql/index.html

15.2.37. A Previous Example in Pandas#

import numpy as np
import pandas as pd


employees = {
    'EmployeeID': ['EN1-26', 'EN1-33', 'EN1-35', 'EN1-36', 'EN1-38', 'EN1-40'],
    'Last_Name': ["O'Brien", "Guya", "Baranco", "Roslyn", "Schaaf", "Wing"],
    'First_Name': ["Sean", "Amy", "Steven", "Elizabeth", "Carol", "Alexandra" ]
}

df_employees = pd.DataFrame(employees)

projects = {
    'ProjectNum': [
        "30-452-T3",
        "30-457-T3",
        "30-482-TC",
        "31-124-T3",
        "31-238-TC",
        "31-238-TC2",
        "35-152-TC",
        "36-272-TC"
    ],
    'ProjectTitle': [
        "STAR manual",
        "ISO procedures",
        "Web site",
        "Employee handbook",
        "STAR prototype",
        "New catalog",
        "STAR pricing",
        "Order system"
    ]
}


df_projects = pd.DataFrame(projects)

print(df_projects)

employees_projects = {
    'EmployeeID': [
        "EN1-26",
        "EN1-26",
        "EN1-26",
        "EN1-33",
        "EN1-33",
        "EN1-33",
        "EN1-35",
        "EN1-35",
        "EN1-36",
        "EN1-38",
        "EN1-40",
        "EN1-40",
    ],
    'ProjectNum' : [
        "30-452-T3",
        "30-457-T3",
        "31-124-T3",
        "30-328-TC",
        "30-452-T3",
        "32-244-T3",
        "30-452-T3",
        "31-238-TC",
        "35-152-TC",
        "36-272-TC",
        "31-238-TC2",
        "31-241-TC",
    ]


}

df_employees_projects = pd.DataFrame(employees_projects)

data = pd.merge(pd.merge(df_employees_projects, df_employees, how='left', on='EmployeeID'), df_projects,  how='left', on='ProjectNum')
print(data)

15.2.38. References#

Ref: https://www.kaggle.com/shovitbhari/pandas-75-exercises-with-solutions Ref: https://pythonspeed.com/memory/