Shown are Sales vs TV, Radio and Newspaper, with a blue linear-regression line fit separately to each. Can we predict Sales using these three? Perhaps we can do better using a model: \[Sales = f(TV,Radio,Newspaper)\]
Here Sales is a response or target that we wish to predict. We generically refer to the response as \(Y\).
TV is a feature, or input, or predictor, we name it \(X_1\)
Likewise name Radio as \(X_2\), and so on.
We can refer to the input vector collectively as
\[X = \left( \begin{array}{c} X_1 \\ X_2 \\ X_3 \end{array} \right)\]
Now we write our model as \[Y = f(X) + \epsilon\] where \(\epsilon\) captures measurement errors and other discrepancies.
With a good \(f\) we can make predictions of \(Y\) at new points \(X = x\)
We can understand which components of \(X = (X_1, X_2, ..., X_p)\) are important in explaining \(Y\), and which are irrelevant, e.g. Seniority and Years of Education have a big impact on Income, but Marital Status typically does not.
Depending on the complexity of \(f\), we may be able to understand how each component \(X_j\) of \(X\) affects \(Y\).
Is there an ideal \(f(X)\)? In particular, what is a good value for \(f(X)\) at any selected value of \(X\), say \(X = 4\)? There can be many \(Y\) values at \(X = 4\). A good value is \[f(4) = E(Y|X = 4)\] \(E(Y|X =4 )\) means expected value (average) of \(Y\) given \(X = 4\)
This ideal \(f(x) = E(Y|X = x)\) is called the regression function.
Is also defined for vector \(X\); e.g. \[f(x) = f(x_1,x_2,x_3) = E(Y|X_1 = x_1, X_2 = x_2, X_3 = x_3)\]
Is the ideal or optimal predictor of \(Y\) with regard to mean-squared prediction error: \(f(x) = E(Y|X = x)\) is the function that minimizes \(E[(Y-g(X))^2|X = x]\) over all functions \(g\) at all points \(X = x\).
\(\epsilon = Y - f(x)\) is the irreducible error – i.e. even if we knew \(f(x)\), we would still make errors in prediction, since at each \(X = x\) there is typically a distribution of possible \(Y\) values.
For any estimate \(\hat{f}(x)\) of \(f(x)\), we have \[E[(Y - \hat{f}(X))^2|X = x] = [f(x) - \hat{f}(x)]^2 + Var(\epsilon)\]
Typically we have few if any data points with \(X = 4\) exactly.
So we cannot compute \(E(Y|X = x)!\)
Relax the definition and let \[\hat{f}(x) = Ave(Y|X \in N(x))\] where \(N(x)\) is some neighborhood of \(x\).
Nearest neighbor averaging can be pretty good for small \(p\) – i.e. \(p \leq 4\) and large-sh \(N\).
We will discuss smoother versions, such as kernel and spline smoothing later.
The linear model is an important example of a parametric model: \[f_L(X) = \beta_0 + \beta_1X_1 + \beta_2X_2 + ...\beta_pX_P.\]
A linear model is specified in terms of \(p + 1\) parameters (_0, _1, …, _p)
We estimate the parameters by fitting the model to training data.
Although it is almost never correct, a linear model often serves as a good and interpret able approximation to the unknown true function \(f(X)\)/
A linear model \(\hat{f}_L(X) = \hat{\beta}_0 + \hat{\beta}_1X\) gives a reasonable fit here
A quadratic model \(\hat{f}_Q(X) = \hat{\beta}_0 + \hat{\beta}_1X + \hat{\beta}_2X^2\) fits slightly better.
Simulated example: Red points are simulated values for income from the model
\[income = f(education,seniority) + \epsilon\] \(f\) is the blue surface.
Linear regression model fit to the simulated data.
\[\hat{f}_L(education,seniority) = \hat{\beta}_0 + \hat{\beta}+1 *education + \hat{\beta}_2 * seniority\]
More flexible regression model \(\hat{f}_S(education,sensiority)\) fit to the simulated data. Here we use a technique called thin-plate spline to fit a flexible surface. We control the roughness of the fit.
Even more flexible regression model \(\hat{f}_S(education,sensiority)\) fit to the simulated data. Here the fitted model makes no errors on the training data! Also known as overfitting.
Suppose we fit a model \(\hat{f}(x)\) to some training data. \(Tr = [x_i, y_i]^N_1\), and we wish to see how well it performs + We could compute the average squared prediction error over \(Tr\): \[MSE_{Tr} = Ave_{i \in}Tr[y_i - \hat{f}(x_i)]^2\]
This may be biased toward more overfit models. + Instead we should, if possible, compute it using fresh test data \(Te = [x_i, y_i]^M_1\) \[MSE_{Te} = Ave_{i \in}Tr[y_i - \hat{f}(x_i)]^2\]
Black curve is truth. Red curve on right is \(MSE_{Te}\), grey curve is \(MSE_{Tr}\). Orange, blue and green curves/squares correspond to fits of different flexibility.
Here the truth is smoother, so the smoother fit and linear model do really well.
Here the truth is wiggly and the noise is low, so the more flexible fits do the best.
Suppose we have fit a model \(\hat{f}(x)\) to some training data Tr, and let \((x_0, y_0)\) be a test observation drawn from the population. If the true model is \(Y = f(X) + \epsilon\) (with \(f(x) = E(Y|X = x)\), then \[ E(y_0 - \hat{f}(x_0))^2 = Var(\hat{f}(x_0)) + [Bias(\hat{f}(x_0))]^2 + Var(\epsilon)\].
The expectation averages over the variability of \(y_0\) as well as the variability in Tr. Note that \([Bias(\hat{f}(x_0))] = E[\hat{f}(x_0)] - f(x_0)\)
Typically as the flexibility of \(\hat{f}\) increases, its variance increases, and its bias decreases. So choosing the flexibility based on average test error amounts to a bias variance trade-off.
Here the response variable \(Y\) is qualitative – e.g. email is one of \(C = (spam, ham) (ham = good email)\), digit class is one of \(C = {0,1,..., 9}.\) Our goals are to: * Build a classifier \(C(X)\) that assigns a class label from \(C\) to a future unlabeled observation \(X\).
Assess the uncertainty in each classification
Understand the roles of the different predictors among \(X = (X_1, X_2, ..., X_p)\).
Is there an ideal \(C(X)\)? Suppose \(K\) elements in \(C\) are numbered \(1,2, ..., K\). Let \[p_k(x) = Pr(Y = k|X = x), k = 1, 2, ..., K\].
These are conditional class probabilities at \(x\); e.g. see little barplot at \(x = 5 \). Then the Bayes optimal classifier at \(x\) is
\[C(x) = j if p_j(x) = max[p_1(x), p2(x), ..., p_K(x)]\]
Nearest-neighbor averaging can be used as before. Also breaks down as dimension grows, However, the impact on \(\hat{C}(x)\) is less than on \(\hat{p}_k(x), k = 1, ..., K\).
\[Err_{Te} = Ave_{i \in Te^I} [y_i \ne \hat{C}(x_i)]\]
The Bayes classifier (using the true \(p_k(x)\)) has the smallest error (in the population).
Support-vector machines build structured models for \(C(x)\).
We will also build structured models for representing the \(p_k(x)\). e.g. Logistic regression, generalized additive models.
# Vectors, Data, Matrices, Subsetting
x=c(2,7,5)
x
## [1] 2 7 5
y=seq(from=4,length=3,by=3)
?seq
y
## [1] 4 7 10
x+y
## [1] 6 14 15
x/y
## [1] 0.5 1.0 0.5
x^y
## [1] 16 823543 9765625
x[2]
## [1] 7
x[2:3]
## [1] 7 5
x[-2]
## [1] 2 5
x[-c(1,2)]
## [1] 5
z=matrix(seq(1,12),4,3)
z
## [,1] [,2] [,3]
## [1,] 1 5 9
## [2,] 2 6 10
## [3,] 3 7 11
## [4,] 4 8 12
z[3:4,2:3]
## [,1] [,2]
## [1,] 7 11
## [2,] 8 12
z[,2:3]
## [,1] [,2]
## [1,] 5 9
## [2,] 6 10
## [3,] 7 11
## [4,] 8 12
z[,1]
## [1] 1 2 3 4
z[,1,drop=FALSE]
## [,1]
## [1,] 1
## [2,] 2
## [3,] 3
## [4,] 4
dim(z)
## [1] 4 3
ls()
## [1] "x" "y" "z"
rm(y)
ls()
## [1] "x" "z"# Generating random data, graphics
x=runif(50)
y=rnorm(50)
plot(x,y)plot(x,y,xlab="Random Uniform",ylab="Random Normal",pch="*",col="blue")par(mfrow=c(2,1))
plot(x,y)
hist(y)par(mfrow=c(1,1))