Robbie Beane
The goal in a supervised learning task is to use observed data to create a model that can be used to predict values of some target variable \(Y\) given observed values for a set of features \(X_1, X_2, ..., X_p\).
There are a variety of terms that are used to distinguish between the inputs and the output of a supervised learning model.
We can seperate supervised learning tasks further into the categories of regression or classification, based on the nature of the target variable \(Y\).
Assume that \(X\) and \(Y\) are two variables that are in some way related, and we wish to build a model that explains the way in which the value of \(Y\) depends on the value of \(X\). We can classify such a model as either stochastic or deterministic, depending on whether or not it allows for randomness in the relationship.
A relationship between two variables \(X\) and \(Y\) is said to be deterministic if there is no randomness in the relationship. In a deterministic relationship, knowing the value of one variable allows you to determine the exact value of the other variable. A deterministic relationship might be described by an equation of the form:
\[Y = f(X)\]
When building machine learning models, we generally work with variables that display imperfect, or noisy, relationships that do not behave deterministically.
A relationship between two variables \(X\) and \(Y\) is said to be stochastic if there is some randomness in the relationship. In a stochastic relationship, knowing the value of one variable might (or might not) allow you to come up with reasonable estimate for the other variable, but you can never hope to know the exact value of one variable based on the value of another.
A stochastic relationship might be represented using an equation of the following form:
\[Y = f(X) + \varepsilon\]
In the previous equation, we assume that the \(\varepsilon\) is a random variable whose value we do not know ahead of time. It represents uncertainty in the model (from measurement error, effects of unknown variables, or truly random effects).
To have a complete stochastic model, we need to know the function \(f\), as well as the distribution of the error term \(\varepsilon\). A common assumption for a stochastic model of this form is that \(\varepsilon\) is normally distributed with a mean of 0 and some standard deviation, \(\sigma\). This assumption is often written as \(\varepsilon \sim N(0, \sigma^2)\). The size of \(\sigma\) represents the amount of uncertainty in our model.
Assume that the true relationship between the variables \(X\) and \(Y\) is described by a stochastic model of the form \(Y = f(X) + \varepsilon\). We call such a model a population model, as it describes the relationship for the entire population of pairs of values \((x,y)\), albeit with some amount of noise.
Assume that we have a population model of the form \(Y = f(X) + \varepsilon\), where \(\varepsilon \sim N(0, \sigma^2)\). Suppose that we know \(f\) and \(\sigma\). We can generate a sample of observations \((x_i, y_i)\) from this population model as follows:
Suppose that we have a population model of the form \(Y = 4 - 2 X + X^2 + \varepsilon\), where \(N(0, \sigma^2 = 4)\). The following R code generates a sample of 10 observations of the form \((x_i, y_i)\) according to this model.
Note that since the error terms are random generated, the sample will be a bit different each time.
x <- c(0.5, 0.75, 1.5, 2.25, 2.5, 2.5, 3, 3.25, 3.5, 4)
e <- rnorm(n=10, mean=0, sd=2)
y <- 4 - 2*x + x^2 + eIn a real-world supervised learning problem, we will not have access to a population model. Instead, we will have a sample of observations that we assume to have been generated by some (hypothetical) population model. Our goal is to use the observed data to find a function \(\hat f\) that approximates the relationship defined by \(f\). We will typically also wish to appproximate the standard deviation of the error term with a quantity \(s = \hat \sigma\).
In a supervised learning task, we have a collection of observed data, called the training data, that we use to try to find a fitted function \(Y = \hat f(X)\) that best describes the relationship between the variables. The question of what function provides the “best” fit is a complicated one, which we will discuss in detail in the coming lectures.
We close this lesson with a plot that shows the a training set along with three proposed models generated from this training set. I encourage you to consider which model you believe is the “best”, and to think about why you selected the model that you did.