Numpy and Pandas

Contents

15.1. Numpy and Pandas#

15.1.1. NumPy#

15.1.1.1. Description#

15.1.1.2. How is it different from Python arrays?#

  • Python lists can be modified – you can add and remove elements. NumPy arrays have a fixed size at creation.

  • Python lists can contain different data types. NumPy arrays can only have one data type. If you put in mixed types, they become a string.

  • NumPy arrays come prepackaged with advanced mathematical operations. The operations are super fast even on large numbers of data and they use less memory.

15.1.1.3. Why use NumPy#

  • Most data analysis programs use NumPy to manipulate data. They might take in data as standard Python lists, but they convert it to a NumPy array and manipulate the data using NumPy routines and output the transformed data as a NumPy array.

  • NumPy data array is the main data type used in most scientific and mathematical Python-based packages.

15.1.1.4. Simple example#

# Square a list using Python
squared_values = []
for number in range(10):
    squared_values.append(number*number)

print(squared_values)

# Square a list using NumPy
import numpy as np

vector = np.array(range(10))
scalar = 5
print(vector * scalar)
print(vector * vector)

15.1.1.5. Numpy Basics#

  • NumPy arrays can be a 1-D array, called a vector, or a 2-D array, called a matrix

15.1.1.5.1. NumPy casting – covert Python list to a NumPy array#

my_list = [1, 2, 3]
print(my_list)

import numpy as np

my_vector = np.array(my_list)
print(my_vector)

my_matrix = np.array([my_list, my_list])
np.hstack([my_list, my_list])

my_nested_list = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
my_matrix = np.array(my_nested_list)
print(my_matrix)

my_list1 = [[1,2], [3,4]]
my_list2 = [[5,6], [7,8]]
np.hstack([my_list1, my_list2])
np.vstack([my_list1, my_list2])

15.1.1.5.2. NumPy creating arrays#

my_list = range(10)

## Create array using arange
np.arange(10)
np.arange(0, 10)
np.arange(0, 10, 2)

## Create array of zeros
np.zeros(3)
np.zeros((3,3))

## Create array of ones
np.ones(3)
np.ones((3,3))
np.ones(3)*4

## Create evenly spaced vector
### Example use case: when you have Y values for a plot but need to generate X values
### *** Includes both start an end
# np.arange(start, end(not included), step size)
# np.linspace(start, end(included), number_of_points)
np.linspace(0, 10, 5)
np.linspace(1900, 2000, 11)

15.1.1.6. Create an identify matrix#

np.eye(3)

15.1.1.7. Creating an empty array#

np.empty((2,3))

15.1.1.8. Creating Random Numbers#

# Uniform distribution
np.random.rand(3)
np.random.rand(3,3)
# Normal distribution
np.random.randn(3)
np.random.randn(3,3)
# Random integers
# np.random.randint(start, end(not_included), size)
np.random.randint(1,101)
np.random.randint(1,101,5)

15.1.1.9. Reshaping arrays#


vector = np.arange(1,10)
print(vector.reshape(3,3))

vector = np.arange(1,13)
print(vector.reshape(3,4))

15.1.1.10. Basic array operations#

vector = np.random.randint(1,50,25)
# Min
vector.min()
# Max
vector.max()
# get location of min value
index = vector.argmin()
# get location of max value
index = vector.argmax()
# get shape
vector.shape

my_matrix = vector.reshape(5, 5)
my_matrix.shape

15.1.1.11. Indexing a 1-D array – vector#

vector = np.array(range(10))
# vector[index]
# vector [start:end]
# vector [:end]
# vector [start:]
# vector [start, end, step]

vector[3]
vector[3:8]
vector[:5]
vector[5:]
vector[3:9:2]
vector[-1]

15.1.1.12. Setting multiple values at once – Broadcasting#

  • There are two main features of NumPy arrays

    • Broadcasting – set multiple values at once

    • Vectorization – no need for explicit looping – example, vector multiplication or squaring

vector[3:6] = 12

15.1.1.13. BE CAREFUL#

  • If you store a slice of an array in a new variable, changes in the new variable will be reflected in the original array.

vector = np.array(range(10))
my_slice = vector[3:7]
my_slice[:] = 20
print(vector)
  • Copy the array if you need a copy

vector = np.array(range(10))
my_slice_copy = vector[3:7].copy()
print(vector)

15.1.1.14. Indexing a 2-D array – Matrix#

  • Remember – Python is zero-indexed

matrix = np.array(range(1,10)).reshape((3,3))
matrix[0,0]
matrix[0][0]
matrix[2,2]
matrix[2][2]


matrix[:,2] # Grab the third column
matrix[1,:] # Grab the second row
matrix[:2] # grab the first two rows, all columns
matrix[:2,:] # grab the first two rows, all columns
matrix[:,1:] # grab all the rows, but columns starting from 1

15.1.1.15. Conditional selection#

vector = np.arange(10)
gt2 = vector > 2 # create condition
lt8 = vector < 8 # create condition

selected_gt2 = vector[gt2] # apply condition to select
selected_lt8 = vector[lt8] # apply condition to select

vector[vector>2]
vector[vector<8]

cond = (vector>2) & (vector<7)
vector[cond]

cond = (vector>=2) & (vector<=7)
vector[cond]

15.1.1.16. Array operations – Basic#

vector = np.arange(10)

vector + vector
vector - vector
vector * vector
vector / vector # problem!!! return `nan`
vector + 10
vector - 10
vector * 10
vector / 10

15.1.1.17. Array operations – Advanced#

vector = np.arange(10)
np.max(vector)
np.min(vector)
np.sqrt(vector)
np.log(vector)

sum(vector<5)
import math
vector = np.arange(1,11) * math.pi
np.sin(vector)
vector = np.arange(0,math.pi+math.pi/4,math.pi/4)
np.sin(vector)

matrix = np.random.rand(5,5)
np.floor(matrix*1000)/1000
np.round(matrix*1000)/1000
np.ceil(matrix*1000)/1000

matrix = np.arange(1,10).reshape(3,3)
matrix.sum(axis=1)
matrix.sum(axis=0)
matrix.cumsum()
matrix.cumprod()

matrix.min(axis=1)
matrix.min(axis=0)

matrix.max(axis=1)
matrix.max(axis=0)

matrix = np.array([1,2,3]*3).reshape(3,3)
np.unique(matrix.reshape(3,3))

15.1.2. Pandas#

  • Built-on top of NumPy – meaning the underlying data structure used is ndarray

  • Pandas provides series which are like NumPy arrays but with associated index labels – meaning are like column labels or row labels. Element data type can be different

  • Pandas also provides dataframes which are like Excel sheets or database tables

15.1.2.1. Basic examples#

import numpy as np
import pandas as pd

header = ['chrom', 'pos', 'filter']
data = [4, 12345, 38.4]

vector = np.array(data)
data_dict = {'chrom': 4, 'pos': 12345, 'filter': 38.4}

s1 = pd.Series(data=data) # Notice the data type is float
s2 = pd.Series(data=data, index=header)

# can also do
s1 = pd.Series(data)
s2 = pd.Series(data, header)

# can hold different data types

data = [1, '2s', 34]
pd.Series(data)

# can use a dictionary to initialize a panda series
pd.Series(data_dict)

## Using index labels to fetch element
header = ['chrom', 'pos', 'filter']
data = [4, 12345, 38.4]
series = pd.Series(data=data, index=header)
series['chrom']
series['filter']

series = pd.Series(data)
series[0]


## Basic operations
header1 = ['chrom', 'pos', 'filter']
data1 = [4, 12345, 38.4]
header2 = ['chrom', 'pos', 'filter', 'qual']
data2 = [3, 4899, 234, 89.9]

s1 = pd.Series(data1, header1)
s2 = pd.Series(data2, header2)
s1+s2

header1 = ['chrom', 'pos', 'filter']
data1 = [4, 12345, 38.4]
header2 = ['chrom', 'pos', 'filter', 'qual']
data2 = ['3', 4899, 234, 89.9]

s1 = pd.Series(data1, header1)
s2 = pd.Series(data2, header2)
s1+s2

data1 = [4, 12345, 38.4]
data2 = [3, 4899, 234, 89.9]

s1 = pd.Series(data1)
s2 = pd.Series(data2)


## IMPORTANT - with index labels -- operations are based on label
header1 = ['pos', 'filter', 'chrom']
data1 = [12345, 38.4, 4]
header2 = ['chrom', 'pos', 'filter', 'qual']
data2 = [3, 4899, 234, 89.9]

s1 = pd.Series(data1, header1)
s2 = pd.Series(data2, header2)
s1+s2


data1 = [12345, 38.4, 4]
data2 = [3, 4899, 234, 89.9]

s1 = pd.Series(data1)
s2 = pd.Series(data2)

15.1.2.2. Dataframes –#

import numpy as np
import pandas as pd

header = ['exam1', 'exam2', 'exam3']
data = np.random.randint(65, 101, 12).reshape(4,3)
students = ['student1', 'student2', 'student3', 'student4']

df = pd.DataFrame(data=data, columns=header)

df = pd.DataFrame(data=data, index=students, columns=header)
df['exam1']
df.exam1 # not a good way to do this


df['average'] = (df['exam1'] + df['exam2'] + df['exam3'])/3
df.drop('average') # does not work because default for drop is to work on row labels
df.drop('average', axis=1) # works on column labels

## STILL NOT DROPPED from df
df
df.drop('average', axis=1, inplace=True)

## drop a student
df.drop('student3')

header = ['exam1', 'exam2', 'exam3']
data = np.random.randint(65, 101, 12).reshape(4,3)
students = ['student1', 'student2', 'student3', 'student3']
df = pd.DataFrame(data=data, index=students, columns=header)

df.drop('student3')
df.drop('student3', inplace=True)
## Row is referred to as axis=0
## Column is referred to as axis=1
## (R,C) == (axis=0, axis=1) df.shape

15.1.2.3. Select Dataframe rows#

header = ['exam1', 'exam2', 'exam3']
data = np.random.randint(65, 101, 12).reshape(4,3)
students = ['student1', 'student2', 'student3', 'student4']
df = pd.DataFrame(data=data, index=students, columns=header)

df.loc['student1']
df.iloc[0] ## remember that column names do not count as rows

15.1.3. Three Methods for Selecting Data in Pandas#

This guide demonstrates three primary methods for selecting data from pandas DataFrames: square brackets [], .iloc, and .loc.

15.1.3.1. Setup: Loading the Data#

import pandas as pd

# Load data once for reuse
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.read_csv("testdata_1.txt", header=None)
df.columns = columns
display(df)
StudentName E1 E2 E3 E4 E5
0 student1 92 77 87 77 94
1 student2 74 93 88 67 85
2 student3 83 96 74 79 92
3 student4 100 72 83 85 66
4 student5 77 96 66 79 92
5 student6 100 86 84 70 71
6 student7 66 91 94 97 80
7 student8 97 86 75 69 88
8 student9 95 98 99 85 86
9 student10 78 76 73 88 86

15.1.3.2. Method 1: Square Brackets []#

Square brackets provide basic indexing with limited functionality. They work differently depending on the input type.

Note

  • Single column name or list of columns → selects columns

  • Slice → selects rows

  • Cannot combine row and column selection (will raise an error)

15.1.3.2.1. 1.1 Single Column Selection (returns Series)#

# Select one column by name
display(df['StudentName'])
0     student1
1     student2
2     student3
3     student4
4     student5
5     student6
6     student7
7     student8
8     student9
9    student10
Name: StudentName, dtype: object

15.1.3.2.2. 1.2 Multiple Column Selection (returns DataFrame)#

# Select multiple columns using a list (NOT a tuple)
display(df[['StudentName', 'E1']])
StudentName E1
0 student1 92
1 student2 74
2 student3 83
3 student4 100
4 student5 77
5 student6 100
6 student7 66
7 student8 97
8 student9 95
9 student10 78

15.1.3.2.3. 1.3 Row Selection with Slice#

# Select rows 1 through 4 (excludes row 5)
display(df[1:5])
StudentName E1 E2 E3 E4 E5
1 student2 74 93 88 67 85
2 student3 83 96 74 79 92
3 student4 100 72 83 85 66
4 student5 77 96 66 79 92

15.1.3.2.4. ❌ 1.4 Invalid: Slice and List (Will Fail)#

# This will raise a TypeError
# display(df[1:5, ['E1', 'E2']])  # NOT SUPPORTED

15.1.3.2.5. ❌ 1.5 Invalid: Two Slices (Will Fail)#

# This will raise a TypeError
# display(df[1:5, 1:5])  # NOT SUPPORTED

Warning

Square brackets cannot select rows and columns simultaneously. Use .iloc or .loc instead.


15.1.3.3. Method 2: Position-Based Indexing with .iloc#

.iloc uses integer positions (0-based indexing) to select data.

Note

Key Features:

  • Uses square brackets [], not parentheses

  • Accepts integers, lists of integers, and slices

  • Syntax: df.iloc[rows, columns]

  • Slices exclude the end point (like Python lists)

15.1.3.3.1. 2.1 Single Row Selection (returns Series)#

# Select first row (index 0)
display(df.iloc[0])
StudentName    student1
E1                   92
E2                   77
E3                   87
E4                   77
E5                   94
Name: 0, dtype: object

15.1.3.3.2. 2.2 Multiple Rows with List#

# Select rows at positions 0, 1, 2
display(df.iloc[[0, 1, 2]])
StudentName E1 E2 E3 E4 E5
0 student1 92 77 87 77 94
1 student2 74 93 88 67 85
2 student3 83 96 74 79 92

15.1.3.3.3. 2.3 Row Slice#

# Select rows 0 through 2 (excludes row 3)
display(df.iloc[0:3])
StudentName E1 E2 E3 E4 E5
0 student1 92 77 87 77 94
1 student2 74 93 88 67 85
2 student3 83 96 74 79 92

15.1.3.3.4. 2.4 Rows and All Columns#

# Select specific rows, all columns
display(df.iloc[[0, 1, 2], :])
StudentName E1 E2 E3 E4 E5
0 student1 92 77 87 77 94
1 student2 74 93 88 67 85
2 student3 83 96 74 79 92

15.1.3.3.5. 2.5 Rows and Specific Columns (List)#

# Select specific rows and columns by position
display(df.iloc[[0, 1, 2], [0, 1]])
StudentName E1
0 student1 92
1 student2 74
2 student3 83

15.1.3.3.6. 2.6 All Rows and Specific Columns#

# Select all rows, columns at positions 1 and 2
display(df.iloc[:, [1, 2]])
E1 E2
0 92 77
1 74 93
2 83 96
3 100 72
4 77 96
5 100 86
6 66 91
7 97 86
8 95 98
9 78 76

15.1.3.3.7. 2.7 Row and Column Slices#

# Select rows 0-2 and columns 1-2 (excludes endpoints)
display(df.iloc[0:3, 1:3])
E1 E2
0 92 77
1 74 93
2 83 96

15.1.3.4. Method 3: Label-Based Indexing with .loc#

.loc uses labels/names to select data (can also use integer labels if they exist).

Warning

Critical Difference: .loc slices include the end point, unlike .iloc and Python slices.

Note

Key Features:

  • Uses row and column labels (names)

  • If no custom index is set, uses default integer labels (0, 1, 2, …)

  • Syntax: df.loc[rows, columns]

  • Slices include both start and end points

15.1.3.4.1. 3.1 Single Row by Integer Label#

# Select row with label 3 (default integer index)
display(df.loc[3])
StudentName    student4
E1                  100
E2                   72
E3                   83
E4                   85
E5                   66
Name: 3, dtype: object

15.1.3.4.2. 3.2 Single Row by Custom Index#

# Set StudentName as index
df_indexed = df.set_index('StudentName')
display(df_indexed.loc['student4'])
E1    100
E2     72
E3     83
E4     85
E5     66
Name: student4, dtype: int64

15.1.3.4.3. 3.3 Multiple Rows with List#

df_indexed = df.set_index('StudentName')
# Select specific rows by label
display(df_indexed.loc[['student4', 'student1']])
E1 E2 E3 E4 E5
StudentName
student4 100 72 83 85 66
student1 92 77 87 77 94

15.1.3.4.4. 3.4 Rows and Columns with Lists#

df_indexed = df.set_index('StudentName')
# Select specific rows and columns
display(df_indexed.loc[['student4', 'student1'], ['E1', 'E2']])
E1 E2
StudentName
student4 100 72
student1 92 77

15.1.3.4.5. 3.5 All Rows and Specific Columns#

df_indexed = df.set_index('StudentName')
# Select all rows, specific columns
display(df_indexed.loc[:, ['E1', 'E2']])
E1 E2
StudentName
student1 92 77
student2 74 93
student3 83 96
student4 100 72
student5 77 96
student6 100 86
student7 66 91
student8 97 86
student9 95 98
student10 78 76

15.1.3.4.6. 3.6 Specific Rows and All Columns#

df_indexed = df.set_index('StudentName')
# Select specific rows, all columns
display(df_indexed.loc[['student1', 'student2'], :])
E1 E2 E3 E4 E5
StudentName
student1 92 77 87 77 94
student2 74 93 88 67 85

15.1.3.4.7. 3.7 Row Slice (Inclusive)#

df_indexed = df.set_index('StudentName')
# Select from student1 to student5 (INCLUDES student5)
display(df_indexed.loc['student1':'student5'])
E1 E2 E3 E4 E5
StudentName
student1 92 77 87 77 94
student2 74 93 88 67 85
student3 83 96 74 79 92
student4 100 72 83 85 66
student5 77 96 66 79 92

15.1.3.4.8. 3.8 Row and Column Slices#

df_indexed = df.set_index('StudentName')
# Both slices are inclusive
display(df_indexed.loc['student1':'student5', :])
E1 E2 E3 E4 E5
StudentName
student1 92 77 87 77 94
student2 74 93 88 67 85
student3 83 96 74 79 92
student4 100 72 83 85 66
student5 77 96 66 79 92

15.1.3.4.9. 3.9 Column Slice Only#

df_indexed = df.set_index('StudentName')
# Select columns E1 through E4 (INCLUDES E4)
display(df_indexed.loc[:, 'E1':'E4'])
E1 E2 E3 E4
StudentName
student1 92 77 87 77
student2 74 93 88 67
student3 83 96 74 79
student4 100 72 83 85
student5 77 96 66 79
student6 100 86 84 70
student7 66 91 94 97
student8 97 86 75 69
student9 95 98 99 85
student10 78 76 73 88

15.1.3.5. Quick Reference Summary#

Method

Indexing Type

Slice Behavior

Row & Column Selection

[]

Column names or row positions

Excludes end

❌ Limited

.iloc

Integer positions (0-based)

Excludes end

✅ Yes

.loc

Labels/names

Includes end

✅ Yes

15.1.3.6. iterrows – iterate over rows#

  • iterate over the data row-wise. The first element is the index and second element is the row values

import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
for idx, row in df.iterrows():
    print(idx, row)
0 StudentName    student1
E1                   92
E2                   77
E3                   87
E4                   77
E5                   94
Name: 0, dtype: object
1 StudentName    student2
E1                   74
E2                   93
E3                   88
E4                   67
E5                   85
Name: 1, dtype: object
2 StudentName    student3
E1                   83
E2                   96
E3                   74
E4                   79
E5                   92
Name: 2, dtype: object
3 StudentName    student4
E1                  100
E2                   72
E3                   83
E4                   85
E5                   66
Name: 3, dtype: object
4 StudentName    student5
E1                   77
E2                   96
E3                   66
E4                   79
E5                   92
Name: 4, dtype: object
5 StudentName    student6
E1                  100
E2                   86
E3                   84
E4                   70
E5                   71
Name: 5, dtype: object
6 StudentName    student7
E1                   66
E2                   91
E3                   94
E4                   97
E5                   80
Name: 6, dtype: object
7 StudentName    student8
E1                   97
E2                   86
E3                   75
E4                   69
E5                   88
Name: 7, dtype: object
8 StudentName    student9
E1                   95
E2                   98
E3                   99
E4                   85
E5                   86
Name: 8, dtype: object
9 StudentName    student10
E1                    78
E2                    76
E3                    73
E4                    88
E5                    86
Name: 9, dtype: object

15.1.3.7. items – iterate over columns#

import pandas as pd
columns = ('StudentName', 'E1', 'E2', 'E3', 'E4','E5')
df = pd.DataFrame(data=[ele.strip().split(',') for ele in open('testdata_1.txt')], columns=columns)
for column_name, column in df.items():
    print(column_name, column)
StudentName 0     student1
1     student2
2     student3
3     student4
4     student5
5     student6
6     student7
7     student8
8     student9
9    student10
Name: StudentName, dtype: object
E1 0     92
1     74
2     83
3    100
4     77
5    100
6     66
7     97
8     95
9     78
Name: E1, dtype: object
E2 0    77
1    93
2    96
3    72
4    96
5    86
6    91
7    86
8    98
9    76
Name: E2, dtype: object
E3 0    87
1    88
2    74
3    83
4    66
5    84
6    94
7    75
8    99
9    73
Name: E3, dtype: object
E4 0    77
1    67
2    79
3    85
4    79
5    70
6    97
7    69
8    85
9    88
Name: E4, dtype: object
E5 0    94
1    85
2    92
3    66
4    92
5    71
6    80
7    88
8    86
9    86
Name: E5, dtype: object

15.1.4. Select subset of data#

header = ['exam1', 'exam2', 'exam3']
data = np.random.randint(65, 101, 12).reshape(4,3)
students = ['student1', 'student2', 'student3', 'student4']
df = pd.DataFrame(data=data, index=students, columns=header)

df.loc['student1', 'exam1']
df.loc[['student1', 'student3'], ['exam1', 'exam3']]

15.1.5. Use conditions to select#

header = ['exam1', 'exam2', 'exam3']
data = np.random.randint(65, 101, 12).reshape(4,3)
students = ['student1', 'student2', 'student3', 'student4']
df = pd.DataFrame(data=data, index=students, columns=header)

df>=90
df[df>=90]
df['exam1']>=85 #
df[df['exam1']>=85] # gives all columns where exam1 is greater than 85
df[df['exam1']>=85]['exam3']
df[df['exam1']>=85][['exam2', 'exam3']]


df[(df['exam1']>=85) & (df['exam2']>=85)]
df[(df['exam1']>=85) & (df['exam2']>=85)]['exam3']

df[(df['exam1']>=85) | (df['exam2']>=85)]
df[(df['exam1']>=85) | (df['exam2']>=85)]['exam3']

15.1.6. Adding student index#

header = ['exam1', 'exam2', 'exam3']
data = np.random.randint(65, 101, 12).reshape(4,3)
students = ['student1', 'student2', 'student3', 'student4']
df = pd.DataFrame(data=data, columns=header)
df['name'] = students

df.set_index('name', inplace=True)
df.loc['student1']
df.reset_index(inplace=True)

15.1.7. Multi-index data#

students = 'student1 student1 student1 student2 student2 student2 student3 student3 student3'
exams = 'exam1 exam2 exam3'.split()*3
classes = 'class1 class2'
index = list(zip(students.split(), exams))
index = pd.MultiIndex.from_tuples(index)
df = pd.DataFrame(np.random.randint(65, 101, 3*3*2).reshape(9,2) , index, classes.split())
df.loc['student1'].loc['exam1']['class1']
df.index.names
df.index.names = ['Students', 'Exams']

## cross-section
df.xs('student1')
df.xs('exam1', level='Exams')

15.1.8. Dealing with missing data#

my_dict = {'student1': [90, 84, np.nan], 'student2': [77, np.nan, np.nan], 'student3': [88, 65, 93]}
df = pd.DataFrame(my_dict)
df.dropna()
df.dropna(axis=0)
df.dropna(axis=1)

df.dropna(thresh=2)
df.fillna(value=55)

df.drop(axis=0, labels=[1,2])
df.drop(axis=1, columns=['student1'])

15.1.9. Groupby#

Note

In previous versions of pandas, non-numeric columns were silently ignored. Now, however, we need to either explicitly choose numeric columns or ask mean to do it for us with numeric_only=True keyword argument.

import pandas as pd
import numpy as np

my_dict = {
    'Exams': 'exam1 exam1 exam1'.split() + 'exam2 exam2 exam2'.split() + 'exam3 exam3 exam3'.split(),
    'Students': 'student1 student2 student3'.split()*3,
    'Scores': np.random.randint(65,101,9)
}
df = pd.DataFrame(my_dict)
df.groupby('Students').mean(numeric_only=True)
df.groupby('Students').mean(numeric_only=True).loc['student1']
df.groupby('Exams').max(numeric_only=True)['Scores']
df.groupby('Exams').describe()
df.groupby('Students').describe().transpose()
import pandas as pd
import numpy as np
my_dict = {
  'Exams': 'exam1 exam1 exam1'.split() + 'exam2 exam2 exam2'.split() + 'exam3 exam3 exam3'.split(),
  'Students': 'student1 student2 student3'.split() * 3,
  'Scores': np.random.randint(65, 101, 9)
}
df = pd.DataFrame(my_dict)

# Select only numeric columns
new_df = df.set_index("Students")
new_df.select_dtypes(include=[np.number]).groupby("Students").mean()
import pandas as pd
import numpy as np

# Create a sample DataFrame with a multi-level index
data = {
  'Scores': np.random.randint(65, 101, 9),
}
index = pd.MultiIndex.from_tuples(
  [('student1', 'exam1'), ('student1', 'exam2'), ('student1', 'exam3'),
   ('student2', 'exam1'), ('student2', 'exam2'), ('student2', 'exam3'),
   ('student3', 'exam1'), ('student3', 'exam2'), ('student3', 'exam3')],
  names=['Students', 'Exams']
)

df = pd.DataFrame(data, index=index)
# Group by the first level of the index ('Students')
grouped_df = df.groupby(level='Students').mean()

display(grouped_df)

15.1.10. Merging – SQL JOIN#

departments = {
    'DepartmentId': [1, 2, 3, 4],
    'DepartmentName': ['IT', 'Physics', 'Arts', 'Math']
}

df1 = pd.DataFrame(departments)

students = {
    'StudentId': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'StudentName': ['Michael', 'John', 'Jack', 'Sara', 'Sally', 'Jena', 'Nancy', 'Adam', 'Stevens', 'George'],
    'DepartmentId': [1, 1, 1, 2, 2, np.nan, 2, 3, 3, np.nan]
}

df2 = pd.DataFrame(students)

marks = {
    'MarkId': [1, 2, 3, 4, 5, 6, 7, 8],
    'StudentId': [1, 2, 3, 4, 5, 6, 7, 8],
    'Mark': [18, 20, 16, 19, 14, 20, 20, 20]
}

df3 = pd.DataFrame(marks)

pd.merge(df2, df1, how='inner', on='DepartmentId')
pd.merge(df1, df2, how='inner', on='DepartmentId')
pd.merge(df1, df2, how='outer', on='DepartmentId')
pd.merge(df2, df1, how='right', on='DepartmentId')

pd.merge(df3, pd.merge(df2, df1, how='inner', on='DepartmentId'), how='inner', on='StudentId')
data = pd.merge(df3, pd.merge(df2, df1, how='inner', on='DepartmentId'), how='inner', on='StudentId')

data[['StudentName', 'Mark', 'DepartmentName']]

15.1.11. Concatenation#

d1 = {
    'C0': ['C0R0', 'C0R1', 'C0R2'],
    'C1': ['C1R0', 'C1R1', 'C2R2'],
    'C2': ['C2R0', 'C2R1', 'C2R2'],
}

df1 = pd.DataFrame(d1)

d2 = {
    'C0': ['C0R3', 'C0R4', 'C0R5'],
    'C1': ['C1R3', 'C1R4', 'C1R5'],
    'C2': ['C2R3', 'C2R4', 'C2R5'],
}

df2 = pd.DataFrame(d2)

d3 = {
    'C0': ['C0R6', 'C0R7', 'C0R8'],
    'C1': ['C1R6', 'C1R7', 'C1R8'],
    'C2': ['C2R6', 'C2R7', 'C2R8'],
}

df3 = pd.DataFrame(d3)

pd.concat([df1, df2, df3])

## Concatenation -- Fix index

d1 = {
    'C0': ['COR0', 'COR1', 'COR2'],
    'C1': ['C1R0', 'C1R1', 'C2R2'],
    'C2': ['C2R0', 'C2R1', 'C2R2'],
}

df1 = pd.DataFrame(d1, index=[1, 2, 3])

d2 = {
    'C0': ['C0R3', 'C0R4', 'C0R5'],
    'C1': ['C1R3', 'C1R4', 'C1R5'],
    'C2': ['C2R3', 'C2R4', 'C2R5'],
}

df2 = pd.DataFrame(d2, index=[4, 5, 6])

d3 = {
    'C0': ['C0R6', 'C0R7', 'C0R8'],
    'C1': ['C1R6', 'C1R7', 'C1R8'],
    'C2': ['C2R6', 'C2R7', 'C2R8'],
}

df3 = pd.DataFrame(d3, index=[7, 8, 9])

pd.concat([df1, df2, df3])

# Better way

d1 = {
    'C0': ['C0R0', 'C0R1', 'C0R2'],
    'C1': ['C1R0', 'C1R1', 'C2R2'],
    'C2': ['C2R0', 'C2R1', 'C2R2'],
}

df1 = pd.DataFrame(d1)

d2 = {
    'C0': ['C0R3', 'C0R4', 'C0R5'],
    'C1': ['C1R3', 'C1R4', 'C1R5'],
    'C2': ['C2R3', 'C2R4', 'C2R5'],
}

df2 = pd.DataFrame(d2)

d3 = {
    'C0': ['C0R6', 'C0R7', 'C0R8'],
    'C1': ['C1R6', 'C1R7', 'C1R8'],
    'C2': ['C2R6', 'C2R7', 'C2R8'],
}

df3 = pd.DataFrame(d3)

concat = pd.concat([df1, df2, df3])
concat = concat.reset_index().drop("index", axis=1)
concat

15.1.12. More Pandas Operations#

data['DepartmentName'].unique()
data['DepartmentName'].nunique()
data['DepartmentName'].value_counts()
data[data['Mark']>17]

15.1.13. Lambda with Pandas#

  • Scale marks by 5


def times5(val):
    return val * 5

data['Mark'].apply(times5)

data['Mark'].apply(lambda val: val*5)

  • Upper all department names

def upper(string):
    return string.upper()

data['DepartmentName'].apply(upper)

data['DepartmentName'].apply(lambda string: string.upper())

mapping = {18: 'B', 14: 'C', 19: 'A-', 20: 'A+'}
df3['Mark'].map(mapping)


15.1.14. Dropping columns#

data.columns
data.drop(['StudentId', 'MarkId' , 'DepartmentId'], axis=1)

15.1.15. Sorting#

data.sort_values('Mark')
data.sort_values('Mark', ascending=False)

15.1.16. Importing CSV, TSV#


data = pd.read_csv('students.tsv', sep='\t', names=['lastname', 'firstname', 'username', 'exam1', 'exam2', 'exam3'])

data.sort_values('exam1', ascending=False)

data[['exam1', 'exam2', 'exam3']].mean()
data['average']= np.mean(data[['exam1', 'exam2', 'exam3']], axis=1)
data.sort_values('average', ascending=False)

data.to_csv('output.tsv', sep='\t', index=False, header=False)

15.1.17. Other methods#

data.head()
data.head(2)

data.tail
data.tail(3)

data.shape
data.iloc[3]
data.columns
data.dtypes
data.info()
data.value_counts()
data.describe()

15.1.18. More data manipulation#

data[data['exam1'].between(75, 85)]
data[data['exam1'].isin([75, 85, 95])]
data['exam1'].unique()
data['exam1'].nunique()
np.sort(data['exam1'].unique())

15.1.19. Misc#

15.1.20. A Previous Example in Pandas#

import numpy as np
import pandas as pd


employees = {
    'EmployeeID': ['EN1-26', 'EN1-33', 'EN1-35', 'EN1-36', 'EN1-38', 'EN1-40'],
    'Last_Name': ["O'Brien", "Guya", "Baranco", "Roslyn", "Schaaf", "Wing"],
    'First_Name': ["Sean", "Amy", "Steven", "Elizabeth", "Carol", "Alexandra" ]
}

df_employees = pd.DataFrame(employees)

projects = {
    'ProjectNum': [
        "30-452-T3",
        "30-457-T3",
        "30-482-TC",
        "31-124-T3",
        "31-238-TC",
        "31-238-TC2",
        "35-152-TC",
        "36-272-TC"
    ],
    'ProjectTitle': [
        "STAR manual",
        "ISO procedures",
        "Web site",
        "Employee handbook",
        "STAR prototype",
        "New catalog",
        "STAR pricing",
        "Order system"
    ]
}


df_projects = pd.DataFrame(projects)

print(df_projects)

employees_projects = {
    'EmployeeID': [
        "EN1-26",
        "EN1-26",
        "EN1-26",
        "EN1-33",
        "EN1-33",
        "EN1-33",
        "EN1-35",
        "EN1-35",
        "EN1-36",
        "EN1-38",
        "EN1-40",
        "EN1-40",
    ],
    'ProjectNum' : [
        "30-452-T3",
        "30-457-T3",
        "31-124-T3",
        "30-328-TC",
        "30-452-T3",
        "32-244-T3",
        "30-452-T3",
        "31-238-TC",
        "35-152-TC",
        "36-272-TC",
        "31-238-TC2",
        "31-241-TC",
    ]


}

df_employees_projects = pd.DataFrame(employees_projects)

data = pd.merge(pd.merge(df_employees_projects, df_employees, how='left', on='EmployeeID'), df_projects,  how='left', on='ProjectNum')
print(data)

15.1.21. References#

Ref: https://www.kaggle.com/shovitbhari/pandas-75-exercises-with-solutions Ref: https://pythonspeed.com/memory/