16.3. Preprocessing data#

16.3.1. Standardization, or mean removal and variance scaling#

Standardization of datasets is a common requirement for many machine learning estimators implemented in scikit-learn; they might behave badly if the individual features do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance.

In practice we often ignore the shape of the distribution and just transform the data to center it by removing the mean value of each feature, then scale it by dividing non-constant features by their standard deviation.

from IPython.display import display
from sklearn.datasets import load_iris
iris_dataset = load_iris()
from sklearn import preprocessing
import numpy as np
import pandas as pd
X_train = np.array([[ 1., -1.,  2.],
                    [ 2.,  0.,  0.],
                    [ 0.,  1., -1.]])

df = pd.DataFrame(X_train)
display(df)

X_scaled = preprocessing.scale(df)

X_scaled      

#rows are different patients/subjects/flowers
# columns are various features of the patients/subjects/flowers
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 2
      1 from IPython.display import display
----> 2 from sklearn.datasets import load_iris
      3 iris_dataset = load_iris()
      4 from sklearn import preprocessing

ModuleNotFoundError: No module named 'sklearn'
df.describe()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[2], line 1
----> 1 df.describe()

NameError: name 'df' is not defined
idata = iris_dataset.data
id_scaled = preprocessing.scale(idata)
idata.mean(axis=0)
id_scaled.mean(axis=0)
id_scaled.var(axis=0)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[3], line 1
----> 1 idata = iris_dataset.data
      2 id_scaled = preprocessing.scale(idata)
      3 idata.mean(axis=0)

NameError: name 'iris_dataset' is not defined
X_train.mean(axis=0)
X_scaled.mean(axis=0)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[4], line 1
----> 1 X_train.mean(axis=0)
      2 X_scaled.mean(axis=0)

NameError: name 'X_train' is not defined
X_scaled.std(axis=0)
X_scaled.mean(axis=1)
X_scaled.std(axis=1)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[5], line 1
----> 1 X_scaled.std(axis=0)
      2 X_scaled.mean(axis=1)
      3 X_scaled.std(axis=1)

NameError: name 'X_scaled' is not defined

16.3.2. Can save scaling and apply to testing data#

  • Never ever apply any preprocessing or any other step on all your data and then train. This will lead to Data Leakage!

scaler = preprocessing.StandardScaler().fit(X_train)
scaler                         
scaler.transform(X_train)  
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[6], line 1
----> 1 scaler = preprocessing.StandardScaler().fit(X_train)
      2 scaler                         
      3 scaler.transform(X_train)  

NameError: name 'preprocessing' is not defined
X_test = [[-1., 1., 0.]]
scaler.transform(X_test)  
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[7], line 2
      1 X_test = [[-1., 1., 0.]]
----> 2 scaler.transform(X_test)  

NameError: name 'scaler' is not defined

16.3.3. Scaling features to a range#

X_train = np.array([[ 1., -1.,  2.],
                    [ 2.,  0.,  0.],
                    [ 0.,  1., -1.]])

min_max_scaler = preprocessing.MinMaxScaler()
X_train_minmax = min_max_scaler.fit_transform(X_train)
X_train_minmax
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[8], line 1
----> 1 X_train = np.array([[ 1., -1.,  2.],
      2                     [ 2.,  0.,  0.],
      3                     [ 0.,  1., -1.]])
      5 min_max_scaler = preprocessing.MinMaxScaler()
      6 X_train_minmax = min_max_scaler.fit_transform(X_train)

NameError: name 'np' is not defined
X_test = np.array([[-3., -1.,  4.]])
X_test_minmax = min_max_scaler.transform(X_test)
X_test_minmax
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[9], line 1
----> 1 X_test = np.array([[-3., -1.,  4.]])
      2 X_test_minmax = min_max_scaler.transform(X_test)
      3 X_test_minmax

NameError: name 'np' is not defined

16.3.4. Pre-processing data - Non-linear transformation#

  • This is a very silly example, but just demonstrating usage.

import pandas as pd
%matplotlib inline
df = pd.read_csv('international-airline-passengers.csv')
display(df)
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[10], line 1
----> 1 import pandas as pd
      2 get_ipython().run_line_magic('matplotlib', 'inline')
      3 df = pd.read_csv('international-airline-passengers.csv')

ModuleNotFoundError: No module named 'pandas'
df['passengers'].hist(bins=20)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[11], line 1
----> 1 df['passengers'].hist(bins=20)

NameError: name 'df' is not defined
import numpy as np
df['passengers'] = np.log(df['passengers'])
df['passengers'].hist(bins=20)
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[12], line 1
----> 1 import numpy as np
      2 df['passengers'] = np.log(df['passengers'])
      3 df['passengers'].hist(bins=20)

ModuleNotFoundError: No module named 'numpy'

16.3.5. Normalization#

from sklearn import preprocessing
X = [[ 1., -1.,  2.],
     [ 2.,  0.,  0.],
     [ 0.,  1., -1.]]
X_normalized = preprocessing.normalize(X, norm='l1')

X_normalized   
# http://www.chioka.in/differences-between-the-l1-norm-and-the-l2-norm-least-absolute-deviations-and-least-squares/
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[13], line 1
----> 1 from sklearn import preprocessing
      2 X = [[ 1., -1.,  2.],
      3      [ 2.,  0.,  0.],
      4      [ 0.,  1., -1.]]
      5 X_normalized = preprocessing.normalize(X, norm='l1')

ModuleNotFoundError: No module named 'sklearn'

16.3.5.1. Can save the normalization for future use#

normalizer = preprocessing.Normalizer().fit(X)  # fit does nothing
normalizer.transform(X)    
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[14], line 1
----> 1 normalizer = preprocessing.Normalizer().fit(X)  # fit does nothing
      2 normalizer.transform(X)    

NameError: name 'preprocessing' is not defined
normalizer.transform([[2.,  1., 0.]]) 
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[15], line 1
----> 1 normalizer.transform([[2.,  1., 0.]]) 

NameError: name 'normalizer' is not defined

16.3.6. Preprocessing data - Encoding#

from sklearn import preprocessing
enc = preprocessing.OrdinalEncoder()
X = [['male', 'from US', 'uses Safari'], 
     ['female', 'from Europe', 'uses Firefox']]
enc.fit(X)  
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[16], line 1
----> 1 from sklearn import preprocessing
      2 enc = preprocessing.OrdinalEncoder()
      3 X = [['male', 'from US', 'uses Safari'], 
      4      ['female', 'from Europe', 'uses Firefox']]

ModuleNotFoundError: No module named 'sklearn'
enc.transform([['female', 'from US', 'uses Safari']])
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[17], line 1
----> 1 enc.transform([['female', 'from US', 'uses Safari']])

NameError: name 'enc' is not defined
enc.transform([['male', 'from Europe', 'uses Safari']])
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[18], line 1
----> 1 enc.transform([['male', 'from Europe', 'uses Safari']])

NameError: name 'enc' is not defined
enc.transform([['female', 'from Europe', 'uses Firefox']])
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[19], line 1
----> 1 enc.transform([['female', 'from Europe', 'uses Firefox']])

NameError: name 'enc' is not defined

16.3.6.1. One Hot Encoder#

genders = ['female', 'male']
locations = ['from Africa', 'from Asia', 'from Europe', 'from US']
browsers = ['uses Chrome', 'uses Firefox', 'uses IE', 'uses Safari']
enc = preprocessing.OneHotEncoder(categories=[genders, locations, browsers])
X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
enc.fit(X) 
enc.categories_
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[20], line 4
      2 locations = ['from Africa', 'from Asia', 'from Europe', 'from US']
      3 browsers = ['uses Chrome', 'uses Firefox', 'uses IE', 'uses Safari']
----> 4 enc = preprocessing.OneHotEncoder(categories=[genders, locations, browsers])
      5 X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
      6 enc.fit(X) 

NameError: name 'preprocessing' is not defined
enc.transform([['male', 'from US', 'uses Safari']])
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[21], line 1
----> 1 enc.transform([['male', 'from US', 'uses Safari']])

NameError: name 'enc' is not defined
tmp = enc.transform([['female', 'from Asia', 'uses Chrome'],
                    ['male', 'from Europe', 'uses Safari']]).toarray()
tmp
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[22], line 1
----> 1 tmp = enc.transform([['female', 'from Asia', 'uses Chrome'],
      2                     ['male', 'from Europe', 'uses Safari']]).toarray()
      3 tmp

NameError: name 'enc' is not defined

16.3.7. Discretization#

Discretization (otherwise known as quantization or binning) provides a way to partition continuous features into discrete values. Certain datasets with continuous features may benefit from discretization, because discretization can transform the dataset of continuous attributes to one with only nominal attributes.

One-hot encoded discretized features can make a model more expressive, while maintaining interpretability. For instance, pre-processing with a discretizer can introduce nonlinearity to linear models.

X = np.array([[ -3., 5., 15 ],
              [  0., 6., 14 ],
              [  6., 3., 11 ]])
est = preprocessing.KBinsDiscretizer(n_bins=[3, 2, 4], encode='ordinal').fit(X)
est.transform(X)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[23], line 1
----> 1 X = np.array([[ -3., 5., 15 ],
      2               [  0., 6., 14 ],
      3               [  6., 3., 11 ]])
      4 est = preprocessing.KBinsDiscretizer(n_bins=[3, 2, 4], encode='ordinal').fit(X)
      5 est.transform(X)

NameError: name 'np' is not defined

16.3.8. Univariate feature imputation#

  • Impute missing values

import numpy as np
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
orig_data = [[1, 2],
         [np.nan, 3], 
         [7, 6]]
imp.fit(orig_data)  
imp.transform(orig_data)
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[24], line 1
----> 1 import numpy as np
      2 from sklearn.impute import SimpleImputer
      3 imp = SimpleImputer(missing_values=np.nan, strategy='mean')

ModuleNotFoundError: No module named 'numpy'
X = [[np.nan, 2], 
     [6, np.nan], 
     [7, 6]]
print(imp.transform(X))  
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[25], line 1
----> 1 X = [[np.nan, 2], 
      2      [6, np.nan], 
      3      [7, 6]]
      4 print(imp.transform(X))  

NameError: name 'np' is not defined
import pandas as pd
df = pd.DataFrame([["a", "x"],
                   [np.nan, "w"],
                   ["a", np.nan],
                   ["b", "y"]], dtype="category")

imp = SimpleImputer(strategy="most_frequent")
print(imp.fit_transform(df)) 
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[26], line 1
----> 1 import pandas as pd
      2 df = pd.DataFrame([["a", "x"],
      3                    [np.nan, "w"],
      4                    ["a", np.nan],
      5                    ["b", "y"]], dtype="category")
      7 imp = SimpleImputer(strategy="most_frequent")

ModuleNotFoundError: No module named 'pandas'
import pandas as pd
df = pd.DataFrame([["a", "x"],
                   [np.nan, "y"],
                   ["c", np.nan],
                   ["b", "y"]], dtype="category")

imp = SimpleImputer(strategy="most_frequent")
print(imp.fit_transform(df)) 
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[27], line 1
----> 1 import pandas as pd
      2 df = pd.DataFrame([["a", "x"],
      3                    [np.nan, "y"],
      4                    ["c", np.nan],
      5                    ["b", "y"]], dtype="category")
      7 imp = SimpleImputer(strategy="most_frequent")

ModuleNotFoundError: No module named 'pandas'

16.3.9. Review Book Examples#

https://en.wikipedia.org/wiki/Correlation#/media/File:Correlation_examples2.svg

https://scikit-learn.org/stable/auto_examples/miscellaneous/plot_pipeline_display.html

https://colab.research.google.com/github/ageron/handson-ml3/blob/main/02_end_to_end_machine_learning_project.ipynb

16.3.10. #