Pros:
Cons:
Suppose a regression model with two covariates \(X_1\) and \(X_2\) is considered:
\[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \epsilon \] where \(X_1\) and \(X_2\) are nearly perfectly correlated (co-linear). You can approximate this model by: \[ Y = \beta_0 + \left( \beta_1 + \beta_2 \right) X_1 + \epsilon \] The result is:
library(ElemStatLearn)
data(prostate)
str(prostate)
## 'data.frame': 97 obs. of 10 variables:
## $ lcavol : num -0.58 -0.994 -0.511 -1.204 0.751 ...
## $ lweight: num 2.77 3.32 2.69 3.28 3.43 ...
## $ age : int 50 58 74 58 62 50 64 58 47 63 ...
## $ lbph : num -1.39 -1.39 -1.39 -1.39 -1.39 ...
## $ svi : int 0 0 0 0 0 0 0 0 0 0 ...
## $ lcp : num -1.39 -1.39 -1.39 -1.39 -1.39 ...
## $ gleason: int 6 6 7 6 6 6 6 6 6 6 ...
## $ pgg45 : int 0 0 20 0 0 0 0 0 0 0 ...
## $ lpsa : num -0.431 -0.163 -0.163 -0.163 0.372 ...
## $ train : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
*No method better when data/computation time permits it
Approach
Two common problems
Assume the test set formed by the observations \((x_i, y_i)\), \(i = 1, \ldots, n\) and assume the linear model
\[ y_i = f(x_i) + \epsilon_i, \quad i = 1, 2, \ldots, n, \] with the errors:
For a (previously unseen) test observation, not used to train the statistical learning method, \((x_0, y_0)\) and an estimation of \(f\), denoted \(\widehat{f}\) and corresponding to the test set, the expected test MSE (expected prediction error in the MSE sense) is
\[
\mathbb{E} \left[ \left( y_0 - \widehat{f} (x_0) \right)^2 \right]
,
\]
where the expected value notation \(\mathbb{E}\) is used accounting for the
randomness of the function estimation \(\widehat{f}\).
The expected test MSE can be decomposed in:
the irreducible error: \(\sigma^2\),
the bias: \(\mathbf{Bias} \left( \widehat{f}(x_0) \right) = f(x_0) - \mathbb{E} \left[ \widehat{f} (x_0) \right]\), corresponding to the error induced by the approximation of the real \(f\) by the estimate \(\widehat{f}\),
the variance: \(\mathbb{V} \left[ \widehat{f} (x_0) \right]\) - corresponding to the amount by which \(\widehat{f}\) would change when estimated using a different training set, different training sets resulting in different \(\widehat{f}\). Ideally, the estimate for \(f\) should not vary too much between training sets.
The expected test MSE writes
\[ \mathbb{E} \left[ \left( y_0 - \widehat{f} (x_0) \right)^2 \right] = \mathbb{E} \left[ y_0^2 \right] - 2 \mathbb{E} \left[ y_0 \widehat{f} (x_0) \right] + \mathbb{E} \left[ \widehat{f}^2 (x_0) \right] \]
Using the assumption of the underlying linear model, the first term of the expected test MSE writes
\[ \begin{align} \mathbb{E} \left[ y_0^2 \right] = \mathbb{E} \left[ \left( f(x_0) + \epsilon_0 \right)^2 \right] = \mathbb{E} \left[ f^2(x_0) \right] + 2 \mathbb{E} \left[\epsilon_0 f(x_0) \right] + \mathbb{E} \left[ \epsilon_0^2 \right] = f^2(x_0) + \sigma^2 , \end{align} \] using the fact that \(\epsilon_0\) and \(\widehat{f}(x_0)\) are independent ( hence \(\mathbb{E} \left[\epsilon_0 \widehat{f}(x_0) \right] = \mathbb{E} \left[\epsilon_0 \right] \mathbb{E} \left[ \widehat{f}(x_0) \right]\)) and \(\mathbb{E} \left[\epsilon_0 \right] = 0\) by assumption.
The second term of the expected test MSE writes
\[ \mathbb{E} \left[ y_0 \widehat{f} (x_0) \right] = \mathbb{E} \left[ \left( f(x_0) + \epsilon_0 \right) \widehat{f} (x_0) \right] = \mathbb{E} \left[ f(x_0) \widehat{f}(x_0) \right] + \mathbb{E} \left[ \epsilon_0 \widehat{f}(x_0) \right] = f(x_0) \mathbb{E} \left[ \widehat{f} (x_0) \right]. \] Lastly, the third term of the expected test MSE writes
\[ \mathbb{E} \left[ \widehat{f}^2 (x_0)\right] = \mathbb{V} \left[ \widehat{f} (x_0) \right] + \mathbb{E}^2 \left[ \widehat{f} (x_0) \right] \]
Plugging back, expected test MSE writes
\[ \begin{align} \mathbb{E} \left[ \left( y_0 - \widehat{f} (x_0) \right)^2 \right] & = \mathbb{E} \left[ y_0^2 \right] - 2 \mathbb{E} \left[ y_0 \widehat{f} (x_0) \right] + \mathbb{E} \left[ \widehat{f}^2 (x_0) \right] \\ & = f^2(x_0) + \sigma^2 - 2 f(x_0) \mathbb{E} \left[ \widehat{f} (x_0) \right] + \mathbb{V} \left[ \widehat{f} (x_0) \right] + \mathbb{E}^2 \left[ \widehat{f} (x_0) \right] \\ & = \sigma^2 + \left(f(x_0) - \mathbb{E} \left[ \widehat{f} (x_0) \right] \right)^2 + \mathbb{V} \left[ \widehat{f} (x_0) \right] \\ & = \sigma^2 + \mathbf{Bias}^2 \left( \widehat{f}(x_0) \right) + \mathbb{V} \left[ \widehat{f} (x_0) \right] \\ & = \text{irreducible error} + \text{squared bias} + \text{variance} \end{align} \] The overall expected test MSE (overall expected prediction error in the MSE sense) can be computed by averaging the expected test MSE at one test point over all possible values of \(x_0\) in the test set.
library(ElemStatLearn)
data(prostate)
str(prostate)
## 'data.frame': 97 obs. of 10 variables:
## $ lcavol : num -0.58 -0.994 -0.511 -1.204 0.751 ...
## $ lweight: num 2.77 3.32 2.69 3.28 3.43 ...
## $ age : int 50 58 74 58 62 50 64 58 47 63 ...
## $ lbph : num -1.39 -1.39 -1.39 -1.39 -1.39 ...
## $ svi : int 0 0 0 0 0 0 0 0 0 0 ...
## $ lcp : num -1.39 -1.39 -1.39 -1.39 -1.39 ...
## $ gleason: int 6 6 7 6 6 6 6 6 6 6 ...
## $ pgg45 : int 0 0 20 0 0 0 0 0 0 0 ...
## $ lpsa : num -0.431 -0.163 -0.163 -0.163 0.372 ...
## $ train : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
smallset <- prostate[1:5,]
lm(lpsa ~ ., data = smallset)
##
## Call:
## lm(formula = lpsa ~ ., data = smallset)
##
## Coefficients:
## (Intercept) lcavol lweight age lbph svi
## 9.60615 0.13901 -0.79142 0.09516 NA NA
## lcp gleason pgg45 trainTRUE
## NA -2.08710 NA NA
For the case where the number of predictors is greater than the
number of data points in the set, some of the coeficients will be
NA since in this case the design matrix can’t be
inverted.
Model: \(y_i = f(x_i) + \epsilon_i\)
Assume a linear model \(\widehat{f} = \beta^T x\)
Constrain only \(\lambda\) coefficients to be nonzero.
Selection problem is after chosing \(\lambda\). (can be computationally demanding)
If the \(\beta_i\)s are unconstrained:
To control variance, we might regularize/shrink the coeficients, using a penalized form of the sum of squares.
\[ \sum_{j = 1}^{n} \left( y_j - \sum_{j = 1}^{m} \beta_{i}x_{ij} \right)^2 + P(\lambda; \beta) \] Things that are commonly looked for:
The ridge regression solves
\[ \|\boldsymbol{y} - \boldsymbol{X} \boldsymbol{\beta} \|_2^2 + \lambda \| \boldsymbol{\beta} \|_2^2 = \sum_{j = 1}^{n} \left( y_j - \sum_{j = 1}^{m} \beta_{i}x_{ij} \right)^2 + \lambda \sum_{j=1}^{m} \beta_j^2 \] The penalty term is the \(\ell_2\) norm.
It can be shown that this estimator is the solution to the least squares problem subject to the constraint \(\| \boldsymbol{\beta} \|^2 = c\):
\[ \min \|\boldsymbol{y} - \boldsymbol{X} \boldsymbol{\beta} \|_2^2 \quad \text{subject to} \quad \| \boldsymbol{\beta} \|_2^2 = c. \] Inclusion of \(\lambda\) may also make the problem non-singlular even if \(\boldsymbol{X}^T \boldsymbol{X}\) is non invertible.
The objective of LASSO is to solve
\[ \min_{\boldsymbol{\beta}} \|\boldsymbol{y} - \boldsymbol{X} \boldsymbol{\beta} \|_2^2 \quad \text{subject to} \quad \sum_{i = 1}^{p} | \beta_i | \leq s, \]
with the corresponding Lagrangian
\[ \sum_{i=1}^{N} \left( y_i - \beta_0 + \sum_{j=1}^{p} x_{ij} \beta_j \right)^2 + \lambda \sum_{j=1}^{p} | \beta_j |, \] with an analytical solution for orthonormal design matrices \(\boldsymbol{X}\).
For a number \(n\) of independent classifiers, \(c_i\), \(i = 1, \ldots, n\) with an accuracy
\[ \mathbb{P}(c_i = 1) = p, \quad i = 1, \ldots, n \],
the majority vote accuracy is
\[ \mathbb{P} \left( \boldsymbol{c}, \| \boldsymbol{c} \| \geq \lceil n/2 \rceil \right) = 1 - \mathbb{P} \left( \boldsymbol{c}, \| \boldsymbol{c} \| \leq \lfloor n/2 \rfloor \right) = \sum_{k = 0}^{\lfloor n/2 \rfloor} {n \choose k} p^{n-k} (1-p)^{k} = 1 - \text{pbinom}(\lfloor n/2 \rfloor, n, p) \] For 5 completely independent classifiers (\(n = 5\)) with the accuracy \(70\)% (p = 0.7), the majority vote accuracy is \(83.7\)%: \[ \mathbb{P} \left( \boldsymbol{c}, \| \boldsymbol{c} \| \geq 2.5 \right) = 1 - \text{pbinom}(\text{floor}(n/2), n, p) = \text{pbinom}(\text{floor}(n/2), n, p, \text{FALSE}) = 0.83692 , \]
p = 0.7
n = 5
pbinom(floor(n/2), n, p, FALSE)
## [1] 0.83692
For 101 completely independent classifiers, (\(n = 101\)) with the same accuracy \(70\)% (p = 0.7), the majority vote accuracy is \(99.9\)%:
p = 0.7
n = 101
pbinom(floor(n/2), n, p, FALSE)
## [1] 0.9999871
library(ISLR)
data(Wage)
library(ggplot2)
library(caret)
# Remove the logwage
Wage <- subset(Wage, select = -c(logwage))
# Create a building data sset and validation set
inBuild <- createDataPartition(y = Wage$wage, p = 0.7, list = FALSE)
# Create the validation set
validation <- Wage[-inBuild, ]
build <- Wage[inBuild, ]
inTrain <- createDataPartition(y = build$wage, p = 0.7, list = FALSE)
# Create the testing set
test <- build[-inTrain, ]
# Create the training set
train <- build[inTrain, ]
# The validation, testing and training sets and their dimensions
head(validation); dim(validation)
## year age maritl race education region
## 86582 2004 24 1. Never Married 1. White 4. College Grad 2. Middle Atlantic
## 376662 2008 54 2. Married 1. White 4. College Grad 2. Middle Atlantic
## 377954 2008 30 1. Never Married 3. Asian 3. Some College 2. Middle Atlantic
## 305706 2007 34 2. Married 1. White 2. HS Grad 2. Middle Atlantic
## 153561 2003 39 2. Married 1. White 4. College Grad 2. Middle Atlantic
## 449654 2009 54 2. Married 1. White 2. HS Grad 2. Middle Atlantic
## jobclass health health_ins wage
## 86582 2. Information 2. >=Very Good 2. No 70.47602
## 376662 2. Information 2. >=Very Good 1. Yes 127.11574
## 377954 2. Information 1. <=Good 1. Yes 111.72085
## 305706 1. Industrial 2. >=Very Good 2. No 81.28325
## 153561 1. Industrial 2. >=Very Good 1. Yes 134.70538
## 449654 2. Information 2. >=Very Good 1. Yes 134.70538
## [1] 898 10
head(train); dim(train)
## year age maritl race education region
## 231655 2006 18 1. Never Married 1. White 1. < HS Grad 2. Middle Atlantic
## 161300 2003 45 2. Married 1. White 3. Some College 2. Middle Atlantic
## 155159 2003 43 2. Married 3. Asian 4. College Grad 2. Middle Atlantic
## 81404 2004 52 2. Married 1. White 2. HS Grad 2. Middle Atlantic
## 160191 2003 37 1. Never Married 3. Asian 4. College Grad 2. Middle Atlantic
## 301585 2007 56 2. Married 1. White 4. College Grad 2. Middle Atlantic
## jobclass health health_ins wage
## 231655 1. Industrial 1. <=Good 2. No 75.04315
## 161300 1. Industrial 1. <=Good 1. Yes 130.98218
## 155159 2. Information 2. >=Very Good 1. Yes 154.68529
## 81404 2. Information 2. >=Very Good 1. Yes 128.68049
## 160191 1. Industrial 2. >=Very Good 2. No 82.67964
## 301585 1. Industrial 1. <=Good 1. Yes 129.15669
## [1] 1474 10
head(test); dim(test)
## year age maritl race education region
## 11443 2005 50 4. Divorced 1. White 2. HS Grad 2. Middle Atlantic
## 450601 2009 44 2. Married 4. Other 3. Some College 2. Middle Atlantic
## 228963 2006 41 1. Never Married 2. Black 3. Some College 2. Middle Atlantic
## 302778 2007 45 4. Divorced 1. White 3. Some College 2. Middle Atlantic
## 8690 2005 35 1. Never Married 1. White 2. HS Grad 2. Middle Atlantic
## 447660 2009 51 2. Married 1. White 3. Some College 2. Middle Atlantic
## jobclass health health_ins wage
## 11443 2. Information 1. <=Good 1. Yes 75.04315
## 450601 1. Industrial 2. >=Very Good 1. Yes 169.52854
## 228963 2. Information 2. >=Very Good 1. Yes 118.88436
## 302778 2. Information 1. <=Good 1. Yes 117.14682
## 8690 2. Information 2. >=Very Good 1. Yes 89.49248
## 447660 1. Industrial 2. >=Very Good 1. Yes 90.48191
## [1] 628 10
# GLM model
fitGLM <- train(wage ~., method = 'glm', data = train)
predGLM <- predict(fitGLM, test)
errGLM <- predGLM - test$wage
#sqrt(sum(errGLM^2))
#length(predGLM)
qplot(test$wage, predGLM)
plot(errGLM, type = "l", pch = 17, col = 'blue', xlab = 'testing', ylab = 'error')
# RF model
fitRF <- train(wage ~., method = 'rf', data = train,
trControl = trainControl(method = 'cv'), number = 3)
predRF <- predict(fitRF, test)
errRF <- predRF - test$wage
#sqrt(sum(errRF^2))
#length(predRF)
qplot(test$wage, predRF)
plot(errRF, type = "l", pch = 17, col = 'cyan', xlab = 'testing', ylab = 'error')
qplot(predGLM, predRF, color = wage, data = test)
# Create the data frame of the two predictions and the wages (from the testing set)
predDF <- data.frame(predGLM, predRF, wage = test$wage)
head(predDF); dim(predDF)
## predGLM predRF wage
## 11443 92.02262 98.01184 75.04315
## 450601 107.58982 134.11919 169.52854
## 228963 97.76183 96.47581 118.88436
## 302778 100.82265 106.52115 117.14682
## 8690 94.03700 96.49075 89.49248
## 447660 124.79846 130.23127 90.48191
## [1] 628 3
# Combined model (trained using the testing set)
fitCOMB <- train(wage ~., method = 'gam', data = predDF)
predCOMB <- predict(fitCOMB, predDF)
errCOMB <- predCOMB - test$wage
#sqrt(sum(errCOMB^2))
#length(predCOMB)
qplot(test$wage, predCOMB)
plot(errCOMB, type = "l", pch = 17, col = 'pink', xlab = 'testing', ylab = 'error')
# Plot comparing the errors corresponding to the three models (GLM,RF,COMB)
plot(errGLM, type = 'l', pch = 17, col = 'blue', xlab = "testing", ylab = "error")
lines(errRF, type = 'l', pch = 17, col = 'cyan')
lines(errCOMB, type = 'l', pch = 17, col = 'pink')
legend('topright',
legend=c('err1', 'err2', 'err3'),
col=c('blue', 'cyan', 'pink'),
lty = 1, cex=0.8)
sqrt(sum(errGLM^2))
## [1] 923.8519
sqrt(sum(errRF^2))
## [1] 952.6409
sqrt(sum(errCOMB^2))
## [1] 901.1065
# Prediction over the validation set using the GLM (trained over training)
predGLMVal <- predict(fitGLM, validation)
errGLMVal <- predGLMVal - validation$wage
#sqrt(sum(errGLMVal^2))
#length(predGLMVal)
qplot(validation$wage, predGLMVal)
plot(errGLMVal, type = "l", pch = 17, col = 'blue', xlab = 'validation', ylab = 'error')
# Prediction over the validation set using the RF (trained over training)
predRFVal <- predict(fitRF, validation)
errRFVal <- predRFVal - validation$wage
#sqrt(sum(errRFVal^2))
#length(predRFVal)
qplot(validation$wage, predRFVal)
plot(errRFVal, type = "l", pch = 17, col = 'cyan', xlab = 'validation', ylab = 'error')
# Create the data frame of the two predictions corresponding to the two models.
predDFVal <- data.frame(predGLM = predGLMVal,
predRF = predRFVal)
head(predDFVal, 10); dim(predDFVal)
## predGLM predRF
## 86582 95.17083 81.60250
## 376662 143.20212 156.71460
## 377954 91.59363 105.01108
## 305706 92.32045 96.49526
## 153561 130.33402 141.23856
## 449654 121.14431 106.38376
## 230312 141.88892 133.49794
## 229379 104.82888 100.16875
## 86064 84.80882 95.76331
## 7690 123.51298 115.36876
## [1] 898 2
# Prediction over the validation set using the RF (training over training)
predCOMBVal <- predict(fitCOMB, predDFVal)
errCOMBVal <- predCOMBVal - validation$wage
#sqrt(sum(errCOMBVal^2))
#length(predCOMBVal)
qplot(validation$wage, predCOMBVal)
plot(errCOMBVal, type = "l", pch = 17, col = 'pink', xlab = 'validation', ylab = 'error')
plot(errGLMVal, type = 'l', pch = 17, col = 'blue', xlab = "validation", ylab = "error")
lines(errRFVal, type = 'l', pch = 17, col = 'cyan')
lines(errCOMBVal, type = 'l', pch = 17, col = 'pink')
legend('topright',
legend=c('err1', 'err2', 'err3'),
col=c('blue', 'cyan', 'pink'),
lty = 1, cex=0.8)
sqrt(sum(errGLMVal^2))
## [1] 1023.509
sqrt(sum(errRFVal^2))
## [1] 1058.78
sqrt(sum(errCOMBVal^2))
## [1] 1039.941
caret:
caretEnsembleTypically applied to time series data, which is introducing some very specific types of dependent structure and some additional challenges which should be taken into account when performing prediction.
Data are dependent over time.
Specific pattern types:
Subsampling into training/test sets is more complicated.
Similar issues arise in spatial data.
Typically, the goal is to predict one or more observations into the futures.
All standard predictions can be used.
Thing that should be considered * the “spurious correlations”, i.e. sometimes random variables can often be corelated but not causally related (coincidence or the presence of other unseen factors). Those are also common in geographuc analysis. * extrapolation
Sometimes you don’t know the labels for predictions
To build a predictor
In a new data set
data("iris")
library(ggplot2)
inTrain <- createDataPartition(y = iris$Species, p = 0.7, list = FALSE)
train <- iris[inTrain,]
test <- iris[-inTrain,]
dim(train)
## [1] 105 5
dim(test)
## [1] 45 5
kMeans1 <- kmeans(subset(train, select = -c(Species)), centers = 3)
train$clusters <- as.factor(kMeans1$cluster)
qplot(Petal.Width, Petal.Length, colour = clusters, data = train)
table(kMeans1$cluster, train$clusters)
##
## 1 2 3
## 1 35 0 0
## 2 0 49 0
## 3 0 0 21
fitRPART <- train(clusters ~., data = subset(train, select = -c(Species)), method = 'rpart')
table(predict(fitRPART, train), train$Species)
##
## setosa versicolor virginica
## 1 35 0 0
## 2 0 35 14
## 3 0 0 21
predRPART <- predict(fitRPART, test)
table(predRPART, test$Species)
##
## predRPART setosa versicolor virginica
## 1 15 0 0
## 2 0 15 6
## 3 0 0 9