Preprocessing data

16.3. Preprocessing data#

16.3.1. Standardization, or mean removal and variance scaling#

Standardization of datasets is a common requirement for many machine learning estimators implemented in scikit-learn; they might behave badly if the individual features do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance.

In practice we often ignore the shape of the distribution and just transform the data to center it by removing the mean value of each feature, then scale it by dividing non-constant features by their standard deviation.

from IPython.display import display
from sklearn.datasets import load_iris
iris_dataset = load_iris()
from sklearn import preprocessing
import numpy as np
import pandas as pd
X_train = np.array([[ 1., -1.,  2.],
                    [ 2.,  0.,  0.],
                    [ 0.,  1., -1.]])

df = pd.DataFrame(X_train)
display(df)

X_scaled = preprocessing.scale(df)

X_scaled      

#rows are different patients/subjects/flowers
# columns are various features of the patients/subjects/flowers

	0	1	2
0	1.0	-1.0	2.0
1	2.0	0.0	0.0
2	0.0	1.0	-1.0

array([[ 0.        , -1.22474487,  1.33630621],
       [ 1.22474487,  0.        , -0.26726124],
       [-1.22474487,  1.22474487, -1.06904497]])

df.describe()

	0	1	2
count	3.0	3.0	3.000000
mean	1.0	0.0	0.333333
std	1.0	1.0	1.527525
min	0.0	-1.0	-1.000000
25%	0.5	-0.5	-0.500000
50%	1.0	0.0	0.000000
75%	1.5	0.5	1.000000
max	2.0	1.0	2.000000

idata = iris_dataset.data
id_scaled = preprocessing.scale(idata)
idata.mean(axis=0)
id_scaled.mean(axis=0)
id_scaled.var(axis=0)

array([1., 1., 1., 1.])

X_train.mean(axis=0)
X_scaled.mean(axis=0)

array([0., 0., 0.])

X_scaled.std(axis=0)
X_scaled.mean(axis=1)
X_scaled.std(axis=1)

array([1.04587533, 0.64957343, 1.11980724])

16.3.2. Can save scaling and apply to testing data#

Never ever apply any preprocessing or any other step on all your data and then train. This will lead to Data Leakage!

scaler = preprocessing.StandardScaler().fit(X_train)
scaler                         
scaler.transform(X_train)  

array([[ 0.        , -1.22474487,  1.33630621],
       [ 1.22474487,  0.        , -0.26726124],
       [-1.22474487,  1.22474487, -1.06904497]])

Note

Be careful about data leakage https://www.atoti.io/articles/what-is-data-leakage-and-how-to-mitigate-it/

X_test = [[-1., 1., 0.]]
scaler.transform(X_test)  

array([[-2.44948974,  1.22474487, -0.26726124]])

16.3.3. Scaling features to a range#

X_train = np.array([[ 1., -1.,  2.],
                    [ 2.,  0.,  0.],
                    [ 0.,  1., -1.]])

min_max_scaler = preprocessing.MinMaxScaler()
X_train_minmax = min_max_scaler.fit_transform(X_train)
X_train_minmax

array([[0.5       , 0.        , 1.        ],
       [1.        , 0.5       , 0.33333333],
       [0.        , 1.        , 0.        ]])

X_test = np.array([[-3., -1.,  4.]])
X_test_minmax = min_max_scaler.transform(X_test)
X_test_minmax

array([[-1.5       ,  0.        ,  1.66666667]])

16.3.4. Pre-processing data - Non-linear transformation#

This is a very silly example, but just demonstrating usage.

import pandas as pd
%matplotlib inline
df = pd.read_csv('international-airline-passengers.csv')
display(df)

	date	passengers
0	1949-01	112.0
1	1949-02	118.0
2	1949-03	132.0
3	1949-04	129.0
4	1949-05	121.0
...	...	...
140	1960-09	508.0
141	1960-10	461.0
142	1960-11	390.0
143	1960-12	432.0
144	International airline passengers: monthly tota...	NaN

145 rows × 2 columns

df['passengers'].hist(bins=20)

<Axes: >

../../_images/c4289d6c4495f9f05eba1ff51b9c48e070570f410936e83074f5ecfbe0d4e066.png

import numpy as np
df['passengers'] = np.log(df['passengers'])
df['passengers'].hist(bins=20)

<Axes: >

../../_images/a504d4f93134725f54ab3130cb159ca8ec3a760c176e2a5fce0658dd7eaface8.png

16.3.5. Normalization#

from sklearn import preprocessing
X = [[ 1., -1.,  2.],
     [ 2.,  0.,  0.],
     [ 0.,  1., -1.]]
X_normalized = preprocessing.normalize(X, norm='l1')

X_normalized   
# http://www.chioka.in/differences-between-the-l1-norm-and-the-l2-norm-least-absolute-deviations-and-least-squares/

array([[ 0.25, -0.25,  0.5 ],
       [ 1.  ,  0.  ,  0.  ],
       [ 0.  ,  0.5 , -0.5 ]])

16.3.5.1. Can save the normalization for future use#

normalizer = preprocessing.Normalizer().fit(X)  # fit does nothing
normalizer.transform(X)    

array([[ 0.40824829, -0.40824829,  0.81649658],
       [ 1.        ,  0.        ,  0.        ],
       [ 0.        ,  0.70710678, -0.70710678]])

normalizer.transform([[2.,  1., 0.]]) 

array([[0.89442719, 0.4472136 , 0.        ]])

16.3.6. Preprocessing data - Encoding#

from sklearn import preprocessing
enc = preprocessing.OrdinalEncoder()
X = [['male', 'from US', 'uses Safari'], 
     ['female', 'from Europe', 'uses Firefox']]
enc.fit(X)  

OrdinalEncoder()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

enc.transform([['female', 'from US', 'uses Safari']])

array([[0., 1., 1.]])

enc.transform([['male', 'from Europe', 'uses Safari']])

array([[1., 0., 1.]])

enc.transform([['female', 'from Europe', 'uses Firefox']])

array([[0., 0., 0.]])

16.3.6.1. One Hot Encoder#

genders = ['female', 'male']
locations = ['from Africa', 'from Asia', 'from Europe', 'from US']
browsers = ['uses Chrome', 'uses Firefox', 'uses IE', 'uses Safari']
enc = preprocessing.OneHotEncoder(categories=[genders, locations, browsers])
X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
enc.fit(X) 
enc.categories_

[array(['female', 'male'], dtype=object),
 array(['from Africa', 'from Asia', 'from Europe', 'from US'], dtype=object),
 array(['uses Chrome', 'uses Firefox', 'uses IE', 'uses Safari'],
       dtype=object)]

enc.transform([['male', 'from US', 'uses Safari']])

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 3 stored elements and shape (1, 10)>

tmp = enc.transform([['female', 'from Asia', 'uses Chrome'],
                    ['male', 'from Europe', 'uses Safari']]).toarray()
tmp

array([[1., 0., 0., 1., 0., 0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 1., 0., 0., 0., 0., 1.]])

16.3.7. Discretization#

Discretization (otherwise known as quantization or binning) provides a way to partition continuous features into discrete values. Certain datasets with continuous features may benefit from discretization, because discretization can transform the dataset of continuous attributes to one with only nominal attributes.

One-hot encoded discretized features can make a model more expressive, while maintaining interpretability. For instance, pre-processing with a discretizer can introduce nonlinearity to linear models.

X = np.array([[ -3., 5., 15 ],
              [  0., 6., 14 ],
              [  6., 3., 11 ]])
est = preprocessing.KBinsDiscretizer(n_bins=[3, 2, 4], encode='ordinal').fit(X)
est.transform(X)

array([[0., 1., 3.],
       [1., 1., 2.],
       [2., 0., 0.]])

16.3.8. Univariate feature imputation#

Impute missing values

import numpy as np
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
orig_data = [[1, 2],
         [np.nan, 3], 
         [7, 6]]
imp.fit(orig_data)  
imp.transform(orig_data)

array([[1., 2.],
       [4., 3.],
       [7., 6.]])

X = [[np.nan, 2], 
     [6, np.nan], 
     [7, 6]]
print(imp.transform(X))  

[[4.         2.        ]
 [6.         3.66666667]
 [7.         6.        ]]

import pandas as pd
df = pd.DataFrame([["a", "x"],
                   [np.nan, "w"],
                   ["a", np.nan],
                   ["b", "y"]], dtype="category")

imp = SimpleImputer(strategy="most_frequent")
print(imp.fit_transform(df)) 

[['a' 'x']
 ['a' 'w']
 ['a' 'w']
 ['b' 'y']]

import pandas as pd
df = pd.DataFrame([["a", "x"],
                   [np.nan, "y"],
                   ["c", np.nan],
                   ["b", "y"]], dtype="category")

imp = SimpleImputer(strategy="most_frequent")
print(imp.fit_transform(df)) 

[['a' 'x']
 ['a' 'y']
 ['c' 'y']
 ['b' 'y']]

16.3.9. Review Book Examples#

https://en.wikipedia.org/wiki/Correlation#/media/File:Correlation_examples2.svg

https://scikit-learn.org/stable/auto_examples/miscellaneous/plot_pipeline_display.html

https://colab.research.google.com/github/ageron/handson-ml3/blob/main/02_end_to_end_machine_learning_project.ipynb