16.2. Loading Data#
16.2.1. Ipython Setting#
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
x = 1
y = 2
x
y
2
16.2.2. Loading Sample Dataset#
Iris data
from sklearn.datasets import load_iris
iris_dataset = load_iris()
iris_dataset
print(iris_dataset.DESCR)
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[2], line 1
----> 1 from sklearn.datasets import load_iris
2 iris_dataset = load_iris()
3 iris_dataset
ModuleNotFoundError: No module named 'sklearn'
16.2.2.1. Access the actual data#
.data
contains the actual data, or sepal and petal measurements
iris_dataset.data
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[3], line 1
----> 1 iris_dataset.data
NameError: name 'iris_dataset' is not defined
16.2.2.2. Access the data target or label#
.target
contains the associated labelslabels are encoded
iris_dataset.target
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[4], line 1
----> 1 iris_dataset.target
NameError: name 'iris_dataset' is not defined
16.2.2.3. Access mapping of data labels#
.target_names
gives you the labels associated with 1, 2, 3
iris_dataset.target_names
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[5], line 1
----> 1 iris_dataset.target_names
NameError: name 'iris_dataset' is not defined
Note
What is the difference between .target
and .target_names
?
16.2.3. Loading Another Sample Dataset#
Iris data
from sklearn.datasets import load_diabetes
diabetes = load_diabetes()
print(diabetes.DESCR)
print(diabetes.data.shape)
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[6], line 1
----> 1 from sklearn.datasets import load_diabetes
2 diabetes = load_diabetes()
3 print(diabetes.DESCR)
ModuleNotFoundError: No module named 'sklearn'
16.2.3.1. Loading the Data into Pandas#
import pandas as pd
from sklearn.datasets import load_diabetes
diabetes = load_diabetes()
pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[7], line 1
----> 1 import pandas as pd
2 from sklearn.datasets import load_diabetes
3 diabetes = load_diabetes()
ModuleNotFoundError: No module named 'pandas'
Note
What is a feature? A feature is an attribute/measurement/value/characteristic of that data that can be used as an input to a model.
16.2.4. Loading data from the web#
import sklearn
from sklearn.datasets import fetch_california_housing
houses = fetch_california_housing()
print(houses.DESCR)
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[8], line 1
----> 1 import sklearn
2 from sklearn.datasets import fetch_california_housing
3 houses = fetch_california_housing()
ModuleNotFoundError: No module named 'sklearn'
16.2.4.1. Getting Basic Information#
houses.data.shape
houses.feature_names
pd.DataFrame(houses.data, columns=houses.feature_names)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[9], line 1
----> 1 houses.data.shape
2 houses.feature_names
3 pd.DataFrame(houses.data, columns=houses.feature_names)
NameError: name 'houses' is not defined
16.2.5. Generate dataset#
from sklearn.datasets import make_regression
x,y = make_regression(n_samples=100, n_features=5, n_targets=1, noise=0.005)
pd.DataFrame(y)
pd.DataFrame(x)
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[10], line 1
----> 1 from sklearn.datasets import make_regression
2 x,y = make_regression(n_samples=100, n_features=5, n_targets=1, noise=0.005)
3 pd.DataFrame(y)
ModuleNotFoundError: No module named 'sklearn'
16.2.5.1. Plotting the data#
import seaborn as sns
sns.set(color_codes=True)
sns.regplot(x=x, y=y);
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[11], line 1
----> 1 import seaborn as sns
2 sns.set(color_codes=True)
3 sns.regplot(x=x, y=y);
ModuleNotFoundError: No module named 'seaborn'
16.2.6. Load data from openml.org#
from sklearn.datasets import fetch_openml
mice = fetch_openml(name='miceprotein')
print(mice.target)
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[12], line 1
----> 1 from sklearn.datasets import fetch_openml
2 mice = fetch_openml(name='miceprotein')
3 print(mice.target)
ModuleNotFoundError: No module named 'sklearn'
16.2.7. Review Book Example#
Up to prepare data for machine learning algorithms https://colab.research.google.com/github/ageron/handson-ml3/blob/main/02_end_to_end_machine_learning_project.ipynb