16.2. Loading Data#

16.2.1. Ipython Setting#

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
x = 1
y = 2
x
y
2

16.2.2. Loading Sample Dataset#

  • Iris data

from sklearn.datasets import load_iris
iris_dataset = load_iris()
iris_dataset
print(iris_dataset.DESCR)
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[2], line 1
----> 1 from sklearn.datasets import load_iris
      2 iris_dataset = load_iris()
      3 iris_dataset

ModuleNotFoundError: No module named 'sklearn'

16.2.2.1. Access the actual data#

  • .data contains the actual data, or sepal and petal measurements

iris_dataset.data
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[3], line 1
----> 1 iris_dataset.data

NameError: name 'iris_dataset' is not defined

16.2.2.2. Access the data target or label#

  • .target contains the associated labels

  • labels are encoded

iris_dataset.target
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[4], line 1
----> 1 iris_dataset.target

NameError: name 'iris_dataset' is not defined

16.2.2.3. Access mapping of data labels#

  • .target_names gives you the labels associated with 1, 2, 3

iris_dataset.target_names
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[5], line 1
----> 1 iris_dataset.target_names

NameError: name 'iris_dataset' is not defined

Note

What is the difference between .target and .target_names?

16.2.3. Loading Another Sample Dataset#

  • Iris data

from sklearn.datasets import load_diabetes 
diabetes =  load_diabetes()
print(diabetes.DESCR)
print(diabetes.data.shape)
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[6], line 1
----> 1 from sklearn.datasets import load_diabetes 
      2 diabetes =  load_diabetes()
      3 print(diabetes.DESCR)

ModuleNotFoundError: No module named 'sklearn'

16.2.3.1. Loading the Data into Pandas#

import pandas as pd
from sklearn.datasets import load_diabetes 
diabetes =  load_diabetes()
pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[7], line 1
----> 1 import pandas as pd
      2 from sklearn.datasets import load_diabetes 
      3 diabetes =  load_diabetes()

ModuleNotFoundError: No module named 'pandas'

Note

What is a feature? A feature is an attribute/measurement/value/characteristic of that data that can be used as an input to a model.

16.2.4. Loading data from the web#

import sklearn
from sklearn.datasets import fetch_california_housing
houses = fetch_california_housing()
print(houses.DESCR)
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[8], line 1
----> 1 import sklearn
      2 from sklearn.datasets import fetch_california_housing
      3 houses = fetch_california_housing()

ModuleNotFoundError: No module named 'sklearn'

16.2.4.1. Getting Basic Information#

houses.data.shape
houses.feature_names
pd.DataFrame(houses.data, columns=houses.feature_names)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[9], line 1
----> 1 houses.data.shape
      2 houses.feature_names
      3 pd.DataFrame(houses.data, columns=houses.feature_names)

NameError: name 'houses' is not defined

16.2.5. Generate dataset#

from sklearn.datasets import make_regression
x,y = make_regression(n_samples=100, n_features=5, n_targets=1, noise=0.005)
pd.DataFrame(y)
pd.DataFrame(x)
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[10], line 1
----> 1 from sklearn.datasets import make_regression
      2 x,y = make_regression(n_samples=100, n_features=5, n_targets=1, noise=0.005)
      3 pd.DataFrame(y)

ModuleNotFoundError: No module named 'sklearn'

16.2.5.1. Plotting the data#

import seaborn as sns
sns.set(color_codes=True)
sns.regplot(x=x, y=y);
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[11], line 1
----> 1 import seaborn as sns
      2 sns.set(color_codes=True)
      3 sns.regplot(x=x, y=y);

ModuleNotFoundError: No module named 'seaborn'

16.2.6. Load data from openml.org#

from sklearn.datasets import fetch_openml
mice = fetch_openml(name='miceprotein')
print(mice.target)
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[12], line 1
----> 1 from sklearn.datasets import fetch_openml
      2 mice = fetch_openml(name='miceprotein')
      3 print(mice.target)

ModuleNotFoundError: No module named 'sklearn'

16.2.7. Review Book Example#