Machine Learning#

General Machine Learning Concept#

Simplified: in machine learning, a computer is trained to classify new data. Think of it as an input-output device that takes in a number of inputs, and based on the pattern of these inputs, determine the most likely class associated with that data. There two main types of learning strategies. 1) Supervised learning where you train the machine using data for which the correct class is known. 2) Unsupervised learning where the classifier itself tries to find patterns within the input data itself. (Biosignal and Medical Image Processing John Semmlow)
Classification (determine label) vs Regression (predict a numerical value)

General Outline of Machine Learning#

Loading Data
- Load toy data included in sklearn
- Download published/annotated data from online
- Generate data with specific statistics to learn how algorithms work
Preprocessing Data
- Make data zero mean
- Make data unit variance
- Fix range of values
- Deal with missing values
- Map text labels to integer labels (if applicable) – ML algorithms require the data be numeric
Dimensionality Reduction of data
- If you use too many features and do not have enough samples, you could over fit.
- So you have to choose the most discriminating few features
Applying algorithms
- Labeled Data - Supervised
- Non-labeled Data - Unsupervised
Evaluation
- Receiver Operator Curve
  - Sensitivity
  - Specificity
- Imbalanced Data
- Example: 95 % one class, 5% another class

REF: Scikit-learn Essentials: Mastering the scikit-learn Machine Learning Library for Python by Dhiraj Kumar

Various Python Libraries: When to use what#

ref:https://www.quora.com/What-is-the-relationship-among-NumPy-SciPy-Pandas-and-Scikit-learn-and-when-should-I-use-each-one-of-them
Numpy – provides efficient array computation via vectorization and broadcasting
- Vectorization – no need for explicit looping – example, vector multiplication or squaring
- Broadcasting – manipulate multiple values at once
Pandas - Uses Numpy arrays as the underlying structure. Good for analyzing tabular data
Scipy (scientific python)- provides libraries for scientific computing, including: integration, interpolation, signal processing, linear algebra, statistics. Also uses Numpy infrastructure
Scikit-learn - provides a collection of machine-learning algorithms that use Numy and Scipy
- Most used Python library for machine-learning
  - regression
  - classification
  - clustering

Topics Covered#

Introduction to Scikit-learn
Loading Dataset using Scikit-learn
Preprocessing data using scikit-learn
Train Test split using scikit-learn
Linear regression using scikit-learn
Naive Bays using Scikit learn
SVM using Scikit learn
k-Nearest neighbor using Scikit-learn

Introduction to Scikit-learn#

Choose the right estimator – the right algorithm for doing ML
- https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html
Consistent – all object share a common interface
Inspection – all parameter values are exposed as public attributes
Limited object Hierarchy – algorithms are represented as Python classes, datasets mainly as numpy array and parameters as standard python strings
Composition – ML as a sequences of fundamental algorithms
Defaults – provides good default values

High Level Steps#

Choose the class of model to be coded
Choose the hyper parameters of the model
Arrange data into target and features
Write model fitment code using fit() method.

General Steps#

Load Data
- Three ways load data
  - Dataset loaders (toy datasets that come with skikit-learn)
    - Good for illustrating how the various algorithms work
  - Dataset fetchers (real world datasets)
    - Built-in functions to load large datasets
    - Pull from openml.org
  - Dataset generation functions
    - Artificial datasets – can create labeled datasets
Pre-process Data
- Remove mean **
- Scale Variance **
- Non-linear transformation
- Normalization
- Encoding categorical features
- Discretization
- Imputation of missing values
- Maybe remove outliers if it can be justified. Always document this in research paper

Loading data#

Preprocessing#

Mean and Variance#

algorithms require that all the features have variance of similar magnitude. If the magnitude differ by orders of magnitude larger than others, it might dominate the objective function.
- Whatever mean or std you subtract from the training set, you have to use the same on the testing set.
Algorithms assume that the input data is Gaussian distribution with zero mean and unit variance.
Power transformers aim to map data from any distribution to as close to a Gaussian distribution

Encoding#

Map text values to integer codes. For example, instead of using text for city names, use integers, or Male = 0, Female = 1. This allows fitting numerical values into models.

Discretization#

Turn continuous values in to discrete values. You bin the continuous into bins. *** Linear models can become non-linear due to discretization

Imputation#

Many real world datasets contain missing values;
- Discard entire rows
- Or fill data – usually by guessing from available data

Splitting Data#

It is common to split data into training and testing samples.
Usually you do 90/10 or 80/20.
The splitting has to be random
K-Cross-fold validation – split data K times

Linear Regression#

LR models the relationship between a dependent variable and one or more independent variable. When one increase or decrease, the other increases or decreases.

Naive Bayes#

Simple supervised machine learning classifier
Assumes the features are independent
- Example – apple is red, round and 4 cm in diameter

Support Vector Machine (SVM)#

Another supervised machine learning classifier
- use for both classification and regression
Can do non-linear classification
Hyper plane – maximize the margin between two classes
Support Vectors – data points that define the hyper plane
Margin width – define an optimal hyper plane we need to maximize the width of the margin

Evaluation#

Ref: https://en.wikipedia.org/wiki/Receiver_operating_characteristic Ref: https://datascience-enthusiast.com/Python/ROC_Precision-Recall.html

roc

EAS503

Machine Learning

Contents

Machine Learning#

General Machine Learning Concept#

General Outline of Machine Learning#

Various Python Libraries: When to use what#

Topics Covered#

Introduction to Scikit-learn#

High Level Steps#

General Steps#

Loading data#

Preprocessing#

Mean and Variance#

Encoding#

Discretization#

Imputation#

Splitting Data#

Linear Regression#

Naive Bayes#

Support Vector Machine (SVM)#

Evaluation#