Introduction

16.1. Introduction#

Note

What is the difference between supervised and unsupervised learning?

There are three main paradigms for machine learning: supervised learning, unsupervised learning, and reinforcement learning. We will cover the first two learning paradigms.
In supervised learning, you are training a machine to predict categories of new data, called classification, or to predict numerical values of a new data, called regression.
- Example of classification: You have a dataset that has sepal and petal measurements for three different types of Iris flowers. For each of flower type, the correct label or sub-species is given. You encounter a new set of flower measurements, but this time you only know the measurements of the sepal and petal. You can train classification algorithms to use knowledge of the labeled dataset to predict the label of the new set of flower measurements, namely, predict the flower sub-specie.
- Example of regression: You have a dataset with years of experience and associated salary. Someone asks you what the salary should be for a year not in the original dataset. You can fit the original data in a model to predict values that are not in the original dataset.
In unsupervised learning, the dataset has no labels. Instead, you can use an algorithm to find patterns or structures in the dataset, effectively, generate labels for the dataset. Usually clustering algorithms are used for unsupervised learning. The idea is to to cluster similar data points and label them as one.
- Example of clustering: You have a dataset that has sepal and petal measurements for three different types of Iris flowers, but you do not know which sub-species each measurement belongs to. You can use a clustering algorithm to find three centroids based on the measurements for each of the three sub-species, effectively generating labels for your unlabeled data.

16.1.1. Main Challenges of Machine Learning#

Insufficient quantity of training data
Nonrepresentative training data
Poor quality data
Irrelevant features
Overfitting the training data
Underfitting the training data
Testing and validation
Hyperparameter tuning and model selection
Data mismatch

16.1.2. Outline of an End-to-End Machine Learning Project#

What problem are you solving and why does it matter?
1. The purpose is to motivate why you are doing machine learning. Machine learning models themselves are not the end goal. Rather, what benefit can you get from applying that model? So you must be able to motivate the what and why of your problem. Finally, you should know how whatever you are doing is currently done.
2. Example from the housing data set: we want to build a model to predict median house pricing, which will then be combined with other data to determine if it is worth investing in a given area. Currently, this analysis is done manually by experts.
3. Example of medical diagnosis: We want to build an image classification model that can predict if a given image is cancerous or not. Currently, a radiologist has to swift through 100s of image to determine if a given is cancerous.
Getting/Loading the data
1. Data might have to parsed from some unstructured or semi-structured file and then loaded into a database
Explore and Visualize the data to gain insight
1. Visualize the geographical data
2. Look for correlations
3. Experiment with attribute combination
Prepare the data for machine learning algorithms
1. Clean the data
2. Handle text and categorical attributes
3. Perform feature scaling and transformation
4. Apply custom transformation, if needed
5. Create transformation pipeline
Select and train a model
1. Feature selection
2. Train and evaluate on the training set
3. Better evaluation using cross-validation
Fine-Tune your model
1. Grid search
2. Randomized search
3. Ensemble methods
4. Analyze the best models and their errors
5. Evaluate your system on the test set
Launch, monitor, and maintain your system

16.1.3. General Outline of Machine Learning#

Loading Data
- Load toy data included in sklearn
- Download published/annotated data from online
- Generate data with specific statistics to learn how algorithms work
Preprocessing Data
- Make data zero mean
- Make data unit variance
- Fix range of values
- Deal with missing values
- Map text labels to integer labels (if applicable) – ML algorithms require the data be numeric
Dimensionality Reduction of data
- If you use too many features and do not have enough samples, you could over fit.
- So you have to choose the most discriminating few features
Applying algorithms
- Labeled Data - Supervised
- Non-labeled Data - Unsupervised
Evaluation
- Receiver Operator Curve
  - Sensitivity
  - Specificity
- Imbalanced Data
- Example: 95 % one class, 5% another class

Introduction

Contents

16.1. Introduction#

16.1.1. Main Challenges of Machine Learning#

16.1.2. Outline of an End-to-End Machine Learning Project#

16.1.3. General Outline of Machine Learning#