Supervised Learning

The goal in a supervised learning task is to use observed data to create a model that can be used to predict values of some target variable \(Y\) given observed values for a set of features \(X_1, X_2, ..., X_p\).

Terminology

There are a variety of terms that are used to distinguish between the inputs and the output of a supervised learning model.

The output of the model is typically called a response, label, or dependent variable.
The inputs of the model are typically called predictors, features, or independent variables.

Classification and Regression

We can seperate supervised learning tasks further into the categories of regression or classification, based on the nature of the target variable \(Y\).

We refer to a supervised learning task in which the target variable is continuous and quantitative (numerical) as a regression task.
We refer to a supervised learning task in which the target variable is qualitative (categorical) as a classification task.

Types of Relationships

Assume that \(X\) and \(Y\) are two variables that are in some way related, and we wish to build a model that explains the way in which the value of \(Y\) depends on the value of \(X\). We can classify such a model as either stochastic or deterministic, depending on whether or not it allows for randomness in the relationship.

Deterministic Relationships

A relationship between two variables \(X\) and \(Y\) is said to be deterministic if there is no randomness in the relationship. In a deterministic relationship, knowing the value of one variable allows you to determine the exact value of the other variable. A deterministic relationship might be described by an equation of the form:

\[Y = f(X)\]

When building machine learning models, we generally work with variables that display imperfect, or noisy, relationships that do not behave deterministically.

Stochastic Relationships

A relationship between two variables \(X\) and \(Y\) is said to be stochastic if there is some randomness in the relationship. In a stochastic relationship, knowing the value of one variable might (or might not) allow you to come up with reasonable estimate for the other variable, but you can never hope to know the exact value of one variable based on the value of another.

A stochastic relationship might be represented using an equation of the following form:

\[Y = f(X) + \varepsilon\]

In the previous equation, we assume that the \(\varepsilon\) is a random variable whose value we do not know ahead of time. It represents uncertainty in the model (from measurement error, effects of unknown variables, or truly random effects).

To have a complete stochastic model, we need to know the function \(f\), as well as the distribution of the error term \(\varepsilon\). A common assumption for a stochastic model of this form is that \(\varepsilon\) is normally distributed with a mean of 0 and some standard deviation, \(\sigma\). This assumption is often written as \(\varepsilon \sim N(0, \sigma^2)\). The size of \(\sigma\) represents the amount of uncertainty in our model.

Population Models

Assume that the true relationship between the variables \(X\) and \(Y\) is described by a stochastic model of the form \(Y = f(X) + \varepsilon\). We call such a model a population model, as it describes the relationship for the entire population of pairs of values \((x,y)\), albeit with some amount of noise.

Sampling from a Population Model

Assume that we have a population model of the form \(Y = f(X) + \varepsilon\), where \(\varepsilon \sim N(0, \sigma^2)\). Suppose that we know \(f\) and \(\sigma\). We can generate a sample of observations \((x_i, y_i)\) from this population model as follows:

We start by selecting \(n\) values \(x_1, x_2, ..., x_n\).
We then randomly generate \(n\) error terms \(\varepsilon_1, \varepsilon_2,..., \varepsilon_n\) according to the distribution \(N(0, \sigma^2)\).
Finally, we determine values \(y_1, y_2, ..., y_n\) by setting \(y_i = f(x_i) + \varepsilon_i\).

Example in R

Suppose that we have a population model of the form \(Y = 4 - 2 X + X^2 + \varepsilon\), where \(N(0, \sigma^2 = 4)\). The following R code generates a sample of 10 observations of the form \((x_i, y_i)\) according to this model.

Note that since the error terms are random generated, the sample will be a bit different each time.

x <- c(0.5, 0.75, 1.5, 2.25, 2.5, 2.5, 3, 3.25, 3.5, 4)
e <- rnorm(n=10, mean=0, sd=2)
y <- 4 - 2*x + x^2 + e

Goal of Supervised Learning

In a real-world supervised learning problem, we will not have access to a population model. Instead, we will have a sample of observations that we assume to have been generated by some (hypothetical) population model. Our goal is to use the observed data to find a function \(\hat f\) that approximates the relationship defined by \(f\). We will typically also wish to appproximate the standard deviation of the error term with a quantity \(s = \hat \sigma\).

Learning from Data

In a supervised learning task, we have a collection of observed data, called the training data, that we use to try to find a fitted function \(Y = \hat f(X)\) that best describes the relationship between the variables. The question of what function provides the “best” fit is a complicated one, which we will discuss in detail in the coming lectures.

We close this lesson with a plot that shows the a training set along with three proposed models generated from this training set. I encourage you to consider which model you believe is the “best”, and to think about why you selected the model that you did.

Lesson 2.1 - Introduction to Supervised Learning