Machine Learning and Statistical Learning

Machine Learning is a field of study focused on the analysis and development of algorithms that give computers the ability to make decisions and predictions based which are based on previously observed data that has been fed into the algorithm, rather than having rules explicitly programmed by a human. It is a rapidly growing field that has applications in many industries.

Statistical Learning refers to a subfield of machine learning that deals with algorithms based on statistics and probability theory. Unlike non-statistical machine learning algorithms, the results of a statistical learning algorithm can typically be evaluated using statistical methods. A statistical learning algorithm is capable not only of producing a prediction, but is also able to account for the amount of uncertainty in its prediction.

Applications of Machine Learning

Machine learning is a rapidly growning field that has applications in many disciplines. A (non-exaustive) list of applications includes spam detection, fraud detection, generating search results, recommender systems, image classification, voice recognition, character recognition, programming autonomous vehicles, predicting credit scores, predicting housing costs, supply chain management, and medical diagnoses.

Types of Machine Learning Algorithms

There are three main categories of machine learning problems: supervised learning, unsupervised learning, and reinforcement learning.

The goal of Supervised Learning is to create models to generate predictions.
The goal of Unsupervised Learning is to identify structure within the data.
The goal of Reinforcement Learning is to train an algorithm to make decisions within an environment in order to complete a certain task.

In this class, we will focus primarily on supervised learning, but will also see a bit of unsupervised learning.

Supervised Learning

In a supervised learning task, the goal is to generate a model or prediction function that can predict an output based on one or more input values.

The output of the model is typically called a label, target value, or response variable.
The inputs into the model are typically called features or predictors.

The algorithm “learns” the optimal model by studying a training set of data that consists of several observations (or instances, or samples), each of which contains values for both its features and its label. You can imagine a supervised learning algorithm as a function that takes training data as input, and that produces a model, or prediction function, as its output.

Regression and Classification

Supervised learning tasks can be further grouped into regression and classification problems, based on whether the labels are continuous and real-valued or categorical.

Regression. A regression task is one for which the target values are continuous, real numbers. Examples of regression tasks include:
- Predicting a credit score.
- Predicting the sale price of a home.
- Predicting blood pressure.
- Predicting dimensions of an organism.

Classification. A classification task is one for which the target values are discrete classes. Examples of classification tasks include:
- Predicting whether a loan is “high risk” or “low risk”.
- Predicting the species of an organism based on measurements.
- Classifying objects within an image.
- Recognizing images of hand-written letters or digits.
- Voice recognition.

Example of a Regression Task

The figure below shows a small data set that could be used for a regression task. The horizontal axis represents the single feature/predictor, while the vertical axis represents the continuous label/response.

Also displayed are the plots of 5 prediction functions (or models) that were generated using different regression algorithms.

Example of a Classification Task

The figure below shows the plot of a sythetic data set created for use in a classification task. The two axes represent two different features/predictors. Each point represents a singple observation. Each observation is assigned on of two classes, “Orange” or “Blue”, as indicated by the color of the point.

The prediction function generated by a classification algorithm would assign one of the two classes to each point in the plane. The figure below shows the regions designated as being in each class according to 5 different classification algorithms.

Unsupervised Learning

In unsupervised learning, we work with data sets in which features are provided, but no labels. Not only are the labels “missing”, but we typically do not start an unsupervised learning tasks with a preconception of what the possible labels might be. The goal of unsupervised learning is to identify structure within the data that might not be readily perceptible to humans.

Some common unsupervised learning tasks are:

Clustering
Outlier Detection
Feature extraction

The figure below shows the results of applying a clustering algorithm to an unlabeled dataset. The algorithm used in this example is referred to as K-means clustering.

Reinforcement Learning

Reinforcement learning is a branch of machine learning concerned with training an AI to interact with an virtual environment to perform a specific task. Examples of reinforcement learning tasks include training an algortihm to play chess, to play a video game, or to navigate a self-driving car.

In reinforcement learning, a software agent is provided with rules governing how it can interact with its environment. A goal is set for the agent, and a method is provided to score how successful the agent is at accomplishing that task. The RL algorithm then constructs a training set through trial-and-error. With each new attempt, the agent receives feedback regarding its performance, and then makes adjustments as needed.

Some links to YouTube videos demonstrating interesting applications of reinforcement learning are provided below.

Lecture 1.1 - Introduction to Statistical Learning