16.3. Preprocessing data#
16.3.1. Standardization, or mean removal and variance scaling#
Standardization of datasets is a common requirement for many machine learning estimators implemented in scikit-learn; they might behave badly if the individual features do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance.
In practice we often ignore the shape of the distribution and just transform the data to center it by removing the mean value of each feature, then scale it by dividing non-constant features by their standard deviation.
from IPython.display import display
from sklearn.datasets import load_iris
iris_dataset = load_iris()
from sklearn import preprocessing
import numpy as np
import pandas as pd
X_train = np.array([[ 1., -1., 2.],
[ 2., 0., 0.],
[ 0., 1., -1.]])
df = pd.DataFrame(X_train)
display(df)
X_scaled = preprocessing.scale(df)
X_scaled
#rows are different patients/subjects/flowers
# columns are various features of the patients/subjects/flowers
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[1], line 2
1 from IPython.display import display
----> 2 from sklearn.datasets import load_iris
3 iris_dataset = load_iris()
4 from sklearn import preprocessing
ModuleNotFoundError: No module named 'sklearn'
df.describe()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[2], line 1
----> 1 df.describe()
NameError: name 'df' is not defined
idata = iris_dataset.data
id_scaled = preprocessing.scale(idata)
idata.mean(axis=0)
id_scaled.mean(axis=0)
id_scaled.var(axis=0)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[3], line 1
----> 1 idata = iris_dataset.data
2 id_scaled = preprocessing.scale(idata)
3 idata.mean(axis=0)
NameError: name 'iris_dataset' is not defined
X_train.mean(axis=0)
X_scaled.mean(axis=0)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[4], line 1
----> 1 X_train.mean(axis=0)
2 X_scaled.mean(axis=0)
NameError: name 'X_train' is not defined
X_scaled.std(axis=0)
X_scaled.mean(axis=1)
X_scaled.std(axis=1)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[5], line 1
----> 1 X_scaled.std(axis=0)
2 X_scaled.mean(axis=1)
3 X_scaled.std(axis=1)
NameError: name 'X_scaled' is not defined
16.3.2. Can save scaling and apply to testing data#
Never ever apply any preprocessing or any other step on all your data and then train. This will lead to Data Leakage!
scaler = preprocessing.StandardScaler().fit(X_train)
scaler
scaler.transform(X_train)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[6], line 1
----> 1 scaler = preprocessing.StandardScaler().fit(X_train)
2 scaler
3 scaler.transform(X_train)
NameError: name 'preprocessing' is not defined
Note
Be careful about data leakage https://www.atoti.io/articles/what-is-data-leakage-and-how-to-mitigate-it/
X_test = [[-1., 1., 0.]]
scaler.transform(X_test)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[7], line 2
1 X_test = [[-1., 1., 0.]]
----> 2 scaler.transform(X_test)
NameError: name 'scaler' is not defined
16.3.3. Scaling features to a range#
X_train = np.array([[ 1., -1., 2.],
[ 2., 0., 0.],
[ 0., 1., -1.]])
min_max_scaler = preprocessing.MinMaxScaler()
X_train_minmax = min_max_scaler.fit_transform(X_train)
X_train_minmax
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[8], line 1
----> 1 X_train = np.array([[ 1., -1., 2.],
2 [ 2., 0., 0.],
3 [ 0., 1., -1.]])
5 min_max_scaler = preprocessing.MinMaxScaler()
6 X_train_minmax = min_max_scaler.fit_transform(X_train)
NameError: name 'np' is not defined
X_test = np.array([[-3., -1., 4.]])
X_test_minmax = min_max_scaler.transform(X_test)
X_test_minmax
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[9], line 1
----> 1 X_test = np.array([[-3., -1., 4.]])
2 X_test_minmax = min_max_scaler.transform(X_test)
3 X_test_minmax
NameError: name 'np' is not defined
16.3.4. Pre-processing data - Non-linear transformation#
This is a very silly example, but just demonstrating usage.
import pandas as pd
%matplotlib inline
df = pd.read_csv('international-airline-passengers.csv')
display(df)
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[10], line 1
----> 1 import pandas as pd
2 get_ipython().run_line_magic('matplotlib', 'inline')
3 df = pd.read_csv('international-airline-passengers.csv')
ModuleNotFoundError: No module named 'pandas'
df['passengers'].hist(bins=20)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[11], line 1
----> 1 df['passengers'].hist(bins=20)
NameError: name 'df' is not defined
import numpy as np
df['passengers'] = np.log(df['passengers'])
df['passengers'].hist(bins=20)
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[12], line 1
----> 1 import numpy as np
2 df['passengers'] = np.log(df['passengers'])
3 df['passengers'].hist(bins=20)
ModuleNotFoundError: No module named 'numpy'
16.3.5. Normalization#
from sklearn import preprocessing
X = [[ 1., -1., 2.],
[ 2., 0., 0.],
[ 0., 1., -1.]]
X_normalized = preprocessing.normalize(X, norm='l1')
X_normalized
# http://www.chioka.in/differences-between-the-l1-norm-and-the-l2-norm-least-absolute-deviations-and-least-squares/
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[13], line 1
----> 1 from sklearn import preprocessing
2 X = [[ 1., -1., 2.],
3 [ 2., 0., 0.],
4 [ 0., 1., -1.]]
5 X_normalized = preprocessing.normalize(X, norm='l1')
ModuleNotFoundError: No module named 'sklearn'
16.3.5.1. Can save the normalization for future use#
normalizer = preprocessing.Normalizer().fit(X) # fit does nothing
normalizer.transform(X)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[14], line 1
----> 1 normalizer = preprocessing.Normalizer().fit(X) # fit does nothing
2 normalizer.transform(X)
NameError: name 'preprocessing' is not defined
normalizer.transform([[2., 1., 0.]])
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[15], line 1
----> 1 normalizer.transform([[2., 1., 0.]])
NameError: name 'normalizer' is not defined
16.3.6. Preprocessing data - Encoding#
from sklearn import preprocessing
enc = preprocessing.OrdinalEncoder()
X = [['male', 'from US', 'uses Safari'],
['female', 'from Europe', 'uses Firefox']]
enc.fit(X)
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[16], line 1
----> 1 from sklearn import preprocessing
2 enc = preprocessing.OrdinalEncoder()
3 X = [['male', 'from US', 'uses Safari'],
4 ['female', 'from Europe', 'uses Firefox']]
ModuleNotFoundError: No module named 'sklearn'
enc.transform([['female', 'from US', 'uses Safari']])
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[17], line 1
----> 1 enc.transform([['female', 'from US', 'uses Safari']])
NameError: name 'enc' is not defined
enc.transform([['male', 'from Europe', 'uses Safari']])
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[18], line 1
----> 1 enc.transform([['male', 'from Europe', 'uses Safari']])
NameError: name 'enc' is not defined
enc.transform([['female', 'from Europe', 'uses Firefox']])
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[19], line 1
----> 1 enc.transform([['female', 'from Europe', 'uses Firefox']])
NameError: name 'enc' is not defined
16.3.6.1. One Hot Encoder#
genders = ['female', 'male']
locations = ['from Africa', 'from Asia', 'from Europe', 'from US']
browsers = ['uses Chrome', 'uses Firefox', 'uses IE', 'uses Safari']
enc = preprocessing.OneHotEncoder(categories=[genders, locations, browsers])
X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
enc.fit(X)
enc.categories_
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[20], line 4
2 locations = ['from Africa', 'from Asia', 'from Europe', 'from US']
3 browsers = ['uses Chrome', 'uses Firefox', 'uses IE', 'uses Safari']
----> 4 enc = preprocessing.OneHotEncoder(categories=[genders, locations, browsers])
5 X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
6 enc.fit(X)
NameError: name 'preprocessing' is not defined
enc.transform([['male', 'from US', 'uses Safari']])
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[21], line 1
----> 1 enc.transform([['male', 'from US', 'uses Safari']])
NameError: name 'enc' is not defined
tmp = enc.transform([['female', 'from Asia', 'uses Chrome'],
['male', 'from Europe', 'uses Safari']]).toarray()
tmp
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[22], line 1
----> 1 tmp = enc.transform([['female', 'from Asia', 'uses Chrome'],
2 ['male', 'from Europe', 'uses Safari']]).toarray()
3 tmp
NameError: name 'enc' is not defined
16.3.7. Discretization#
Discretization (otherwise known as quantization or binning) provides a way to partition continuous features into discrete values. Certain datasets with continuous features may benefit from discretization, because discretization can transform the dataset of continuous attributes to one with only nominal attributes.
One-hot encoded discretized features can make a model more expressive, while maintaining interpretability. For instance, pre-processing with a discretizer can introduce nonlinearity to linear models.
X = np.array([[ -3., 5., 15 ],
[ 0., 6., 14 ],
[ 6., 3., 11 ]])
est = preprocessing.KBinsDiscretizer(n_bins=[3, 2, 4], encode='ordinal').fit(X)
est.transform(X)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[23], line 1
----> 1 X = np.array([[ -3., 5., 15 ],
2 [ 0., 6., 14 ],
3 [ 6., 3., 11 ]])
4 est = preprocessing.KBinsDiscretizer(n_bins=[3, 2, 4], encode='ordinal').fit(X)
5 est.transform(X)
NameError: name 'np' is not defined
16.3.8. Univariate feature imputation#
Impute missing values
import numpy as np
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
orig_data = [[1, 2],
[np.nan, 3],
[7, 6]]
imp.fit(orig_data)
imp.transform(orig_data)
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[24], line 1
----> 1 import numpy as np
2 from sklearn.impute import SimpleImputer
3 imp = SimpleImputer(missing_values=np.nan, strategy='mean')
ModuleNotFoundError: No module named 'numpy'
X = [[np.nan, 2],
[6, np.nan],
[7, 6]]
print(imp.transform(X))
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[25], line 1
----> 1 X = [[np.nan, 2],
2 [6, np.nan],
3 [7, 6]]
4 print(imp.transform(X))
NameError: name 'np' is not defined
import pandas as pd
df = pd.DataFrame([["a", "x"],
[np.nan, "w"],
["a", np.nan],
["b", "y"]], dtype="category")
imp = SimpleImputer(strategy="most_frequent")
print(imp.fit_transform(df))
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[26], line 1
----> 1 import pandas as pd
2 df = pd.DataFrame([["a", "x"],
3 [np.nan, "w"],
4 ["a", np.nan],
5 ["b", "y"]], dtype="category")
7 imp = SimpleImputer(strategy="most_frequent")
ModuleNotFoundError: No module named 'pandas'
import pandas as pd
df = pd.DataFrame([["a", "x"],
[np.nan, "y"],
["c", np.nan],
["b", "y"]], dtype="category")
imp = SimpleImputer(strategy="most_frequent")
print(imp.fit_transform(df))
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[27], line 1
----> 1 import pandas as pd
2 df = pd.DataFrame([["a", "x"],
3 [np.nan, "y"],
4 ["c", np.nan],
5 ["b", "y"]], dtype="category")
7 imp = SimpleImputer(strategy="most_frequent")
ModuleNotFoundError: No module named 'pandas'
16.3.9. Review Book Examples#
https://en.wikipedia.org/wiki/Correlation#/media/File:Correlation_examples2.svg
https://scikit-learn.org/stable/auto_examples/miscellaneous/plot_pipeline_display.html