k-fold cross validation is a resampling technique that randomly divides the data set into k groups of observations or roughly equal size. The first fold is the validation set and the model method is run on the remaining k-1 folds of data. The \(MSE_1\) (mean squared error) is computed on the observations on the held-out fold and the procedure is repeated k-times with a different held-out fold as the validation set. This generates k estimates of the test error, \(MSE_1, MSE_2, ..., MSE_k\). The k-fold CV (cross validation) estimate is then generated by averaging these values.
\[CV_{(k)}=\frac{1}{k}\sum_{i=1}^k MSE_i\]
There is a bias-variance trade-off associated with the choice of k in k-fold cross-validation. Typically a k-value of 5 or 10 is used as these have been shown empirically to generate test-error rate estimates that do not have either high bias or high variance.
Note that when we have a qualitative response variable we are more likely to build an effective model using a classifier than using a regression with different levels or dummy variables.
Use the set.seed(1) function for reproducible results and split the data into test and train set.
set.seed(1)
train = sample (1250, 1250)
The Smarket data set is a collection of daily percentage returns from 2001 to 2005 with columns representing lagged percentage returns for previous days 1 to 5 days from the current day along with current day percentage return, volume traded for the day and the direction that the market finished, up or down from the previous day.
An examination of the data set using the head() and tail() function and the str() function gives a flavor of the data.
head(Smarket)
## Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume Today Direction
## 1 2001 0.381 -0.192 -2.624 -1.055 5.010 1.1913 0.959 Up
## 2 2001 0.959 0.381 -0.192 -2.624 -1.055 1.2965 1.032 Up
## 3 2001 1.032 0.959 0.381 -0.192 -2.624 1.4112 -0.623 Down
## 4 2001 -0.623 1.032 0.959 0.381 -0.192 1.2760 0.614 Up
## 5 2001 0.614 -0.623 1.032 0.959 0.381 1.2057 0.213 Up
## 6 2001 0.213 0.614 -0.623 1.032 0.959 1.3491 1.392 Up
tail(Smarket)
## Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume Today Direction
## 1245 2005 0.252 -0.024 -0.584 -0.285 -0.141 2.06517 0.422 Up
## 1246 2005 0.422 0.252 -0.024 -0.584 -0.285 1.88850 0.043 Up
## 1247 2005 0.043 0.422 0.252 -0.024 -0.584 1.28581 -0.955 Down
## 1248 2005 -0.955 0.043 0.422 0.252 -0.024 1.54047 0.130 Up
## 1249 2005 0.130 -0.955 0.043 0.422 0.252 1.42236 -0.298 Down
## 1250 2005 -0.298 0.130 -0.955 0.043 0.422 1.38254 -0.489 Down
str(Smarket)
## 'data.frame': 1250 obs. of 9 variables:
## $ Year : num 2001 2001 2001 2001 2001 ...
## $ Lag1 : num 0.381 0.959 1.032 -0.623 0.614 ...
## $ Lag2 : num -0.192 0.381 0.959 1.032 -0.623 ...
## $ Lag3 : num -2.624 -0.192 0.381 0.959 1.032 ...
## $ Lag4 : num -1.055 -2.624 -0.192 0.381 0.959 ...
## $ Lag5 : num 5.01 -1.055 -2.624 -0.192 0.381 ...
## $ Volume : num 1.19 1.3 1.41 1.28 1.21 ...
## $ Today : num 0.959 1.032 -0.623 0.614 0.213 ...
## $ Direction: Factor w/ 2 levels "Down","Up": 2 2 1 2 2 2 1 2 2 2 ...
Logistic regression models the probability that the response variable belongs to a specific category where \(Y = Pr(X)\) where \(Pr(X)=Pr(Y=k\vert X=x)\) is the conditional probability of the response \(Y\) given the predictor(s) \(X\). Because this is a probability we need outputs bounded from \(0\) to \(1\) for all values of \(X\). In simple logistic regression we use the logistic function for this purpose where
\[Pr(X)=\frac{e^{\beta_0+\beta_1X_1}}{1+e^{\beta_0+\beta_1X_1}}\] To fit this model we use the maximum likelihood function where we seek to maximize the likelihood function to find the estimates of our predictor coefficients
\[l(\beta_0, \beta_1)=\prod_{i:y_i=1}p(x_i)\prod_{i^\prime :y_i^\prime} (1-p(x_i^\prime))\] where the estimates \(\hat{\beta}_0\) and \(\hat{\beta}_1\) are chosen to maximize this likelihood.
We find the odds by manipulating \(Pr(X)\) to get
\[log\Big(\frac{p(X)}{1-p(X)}\Big) = \beta_0+\beta_1X_1\]
where the left hand side is the logit function or log-odds function.
Multiple logistic regression simply expands upon this model as in multiple linear regression.
\[Pr(X)=\frac{e^{\beta_0+\beta_1X_1+...+\beta_pX_p}}{1+e^{\beta_0+\beta_1X_1+...+\beta_pX_p}}\] and the logit function can be generalized as
\[log\Big(\frac{p(X)}{1-p(X)}\Big) = \beta_0+\beta_1X_1+...+\beta_pX_p\]
And, as with the simple logistic regression, we use the maximum likelihood function to estimate \(\beta_0, \beta_1,...,\beta_p\)
The cv.glm() function can be used to implement a k-fold Cross Validation. Below, we use k = 10, a common choice for k, on the Smarket data set with the S&P 500 stock market index data. We set a random seed for reproducibility above and initialize a vector in which we will store the CV errors corresponding to the polynomial fits of orders one to ten.
Using the glm() function and the family = binomial argument for the logit function we create a logistic regression model where the response is the direction and the predictors are Lag1 and Lag2. We create a call to generate the mean absolute error (MAE)
MAE=c()
mae=c()
for (i in 1:10){
train_data <- Smarket[-((i*125-124):(i*125)), ]
test_data <- Smarket[((i*125-124):(i*125)), ]
m_glm <- glm(Direction ~ Lag1 + Lag2, data = train_data, family = binomial)
m_pred <- predict(m_glm, test_data, type = "response")
glm.pred <- rep("Down", 1125)
glm.pred[m_pred > 0.5] = "Up"
mae <- c(mae, mean(abs(as.integer(glm.pred=="Up")-as.integer(test_data$Direction == "Up"))))
}
MAE=c(MAE, mean(mae))
MAE
## [1] 0.4872
There are several reasons why LDA is a better option than logistic regression:
LDA uses Bayes Theorem for classification which we can explain by noting that if we have \(K\) classes and we want to classify the qualitative response variable \(Y\) where there are \(K\) possible distinct and unordered values we use the following method.
Let \(\pi_k\) be the prior probability that a given randomly chosen observation comes from the \(k\)th class. Let \(f_k(x) \equiv Pr(X=x\vert Y=k)\) be the density function of \(X\) for an observation from the \(k\)th class. \(f_k(x)\) is relatively large if there is a high probability that an observation in the \(k\)th class has \(X \approx x\) and \(f_k(x)\) is relatively small if it is very unlikely that an observation in the \(k\)th class has \(X\approx x\).
Bayes Theorem states that
\[Pr(Y=k\vert X=x) = \frac{\pi_k f_k(x)}{\sum_{l=1}^K\pi_l f_l(x)}\]
Letting \(p_k(x) = Pr(Y=k\vert X)\) we see that we can simply plug in estimates of \(\pi_k\) and \(f_k(X)\) into the formula which can be generated with the software that then takes care of the rest. We refer to \(p_k(x)\) as the posterior probability that an observation \(X=x\) belongs to the \(k\)th class given the predictor value for that observation.
Estimating \(\pi_k\) is easy if we have a random sample of \(Y\)’s from the population but estimating \(f_k(X)\) is more difficult. However, if we have an estimate for \(f_k(x)\) then we can build a classifier that approximates the Bayes classifier.
By assuming that \(X=(X_1, X_2, ...,X_p)\) is drawn from a multivariate Gaussian distribution, with a class specific mean vector and a common covariance matrix which we can write as \(X\sim N(\mu, \Sigma)\) to indicate that \(p\) has a multivariate Gaussian distribution. \(E(X)=\mu\) is the mean of the \(X\) vector with \(p\) components and \(Cov(X)=\Sigma\) is the \(p\times p\) covariance matrix of \(X\).
Formally, the multivariate Gaussian density is
\[f(x)=\frac{1}{(2\pi)^{p/2}\vert \Sigma\vert^{1/2}}exp\Big(-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)\Big)\]
plugging the density function for the \(k\)th class, \(f_k(X=x)\) into
\[Pr(Y=k\vert X=x) = \frac{\pi_k f_k(x)}{\sum_{l=1}^K\pi_l f_l(x)}\] and applying some algebra we see that the Bayes classifier assigns \(X=x\) to the class for which
\[\delta_k(x)=x^T\Sigma^{-1}\mu_k-\frac{1}{2}\mu_k^T\Sigma^{-1}\mu_k+\text{ log }\pi_k\]
is the largest. The Bayes decision boundaries represent the set of values \(x\) for which \(\delta_k(x)=\delta_l(x)\). In other words for which
\[x^T\Sigma^{-1}\mu_k-\frac{1}{2}\mu_k^T\Sigma^{-1}\mu_k=x^T\Sigma^{-1}\mu_l-\frac{1}{2}\mu_l^T\Sigma^{-1}\mu_l, \text{ for } k\neq l\]
The log \(\pi_k\) term has disappeared because each of the three classes has the same number of training observations, thus \(\pi_k\) is the same for each class. To estimate \(\mu_1,...,\mu_k, \pi_1,...,\pi_k\) and \(\Sigma\) we use similar conventions for the case where \(p=1\)
\[\hat{\mu}_k=\frac{1}{n_k}\sum_{i:y_i=k}x_i\] \[\hat{\Sigma}=\frac{1}{n-K}\sum_{k=1}^K\sum_{i:y_i=k}(x_i-\hat{\mu}_k)^2\] \[\hat{\pi}_k=\frac{n_k}{n}\]
The estimates are plugged into \[\delta_k(x)=x^T\Sigma^{-1}\mu_k-\frac{1}{2}\mu_k^T\Sigma^{-1}\mu_k+\text{ log }\pi_k\] in order to assign a new observation \(X=x\) to the class for which \(\hat{\delta}_k(x)\) is the largest. This is a linear function of \(x\) so the LDA decision rule depends on \(x\) only through a linear combination of the elements.
The output for LDA often uses a confusion matrix to display the True status versus the predicted status for the qualitative response variable. Elements on the diagonal represent correct predictions and off-diagonal represent misclassifications.
This introduces the terms sensitivity and specificity to characterize the performance of a classifier. We call the true positive rate the sensitivity and the false positive rate, 1 - specificity
The Bayes classifier works by assigning an observation to the class for which the posterior probability \(p_k(X)\) is the largest. If we have two classes, say “wrong” and “right” we assign the observation to the “wrong” class if
\[Pr(\text{ wrong = Yes }\vert X=x)>0.5\]
This creates a threshold of \(50\)% for the posterior probability of default in order to assign an observation to the “wrong” class. If we have concerns about mislabeling the prediction for the “wrong” class we can lower this threshold. We could, for example, label an observation with a posterior probability of being in the “wrong” class about 20% to the “wrong” class
\[Pr(\text{ wrong = Yes }\vert X=x)>0.2\] We use the receiver operating characteristics curve, ROC curve, to simultaneously display the two types of errors for all possible thresholds where the overall performance of the classifier is given by the area under the ROC curve (the AUC) where the larger the percentage, the better the classifier.
Run the k-fold cross validation with k = 10.
#MAE=c()
mae=c()
for (i in 1:10){
train_data <- Smarket[-((i*125-124):(i*125)), ]
test_data <- Smarket[((i*125-124):(i*125)), ]
m_lda <- lda(Direction ~ Lag1 + Lag2, data = train_data)
m_pred <- predict(m_lda, test_data)
mae <- c(mae, mean(abs( as.integer(m_pred$class == 'Up') - as.integer(train_data$Direction == 'Up'))))
}
MAE=c(MAE, mean(mae))
MAE
## [1] 0.4872000 0.4826667
QDA assumes the observations come from a Gaussian distribution like LDA but QDA assumes each class has its own covariance matrix. QDA assumes that an observation from the \(k\)th class is of the form \(X\sim N(\mu_k, \Sigma_k)\), where \(\Sigma_k\) is a covariance matrix for the \(k\)th class.
In this assumption, the Bayes classifier assigns an observation \(X=x\) to the class for which
\[\begin{aligned} \delta_k(x)&=-\frac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k)-\frac{1}{2}\text{ log }\vert\Sigma_k\vert + \text{ log }\pi_k\\ &=-\frac{1}{2}x^T\Sigma_k^{-1}x+x^T\Sigma_k^{-1}\mu_k-\frac{1}{2}\mu_k^T\Sigma_k^{-1}\mu_k-\text{ log }\vert\Sigma_k\vert + \text{ log }\pi_k\\ \end{aligned}\]
is the largest. The QDA classifier plugs estimates for \(\Sigma_k, \mu_k\), and \(\pi_k\) into the equation above and then assigning \(X=x\) to the class for which the quantity is largest. Since \(x\) appears as a quadratic equation, we call this classifier QDA.
The reasons for choosing LDA over QDA or vice versa have to do with the bias-variance tradeoff. When there are \(p\) predictors, estimating the covariance matrix requires estimating \(p(p+1)/2\) parameters. QDA estimates a separate covariance matrix for each class for a total of \(Kp(p+1)/2\) parameters. If we have \(50\) predictors this is some multiple of \((50*51)/2=1275\) for a significant jump in predictors. The LDA model assumes the \(K\) classes share a common covariance matrix so that the LDA model becomes linear in \(x\) so that there are \(kp\) linear coefficients to estimate. Thus LDA is a less flexible classifer than QDA and has a significantly lower variance. The tradeoff comes from noting that if the LDA assumption of a common covariance matrix is incorrect then high bias can be an issue.
Run the k-fold cross validation with k = 10.
#MAE <- c()
mae <- c()
for (i in 1:10){
train_data <- Smarket[-((i*125-124):(i*125)), ]
test_data <- Smarket[((i*125-124):(i*125)), ]
m_qda <- qda(Direction ~ Lag1 + Lag2, data = train_data)
m_pred <- predict(m_qda, test_data)
mae <- c(mae, mean(abs( as.integer(m_pred$class == 'Up') - as.integer(train_data$Direction == 'Up'))))
}
MAE=c(MAE, mean(mae))
MAE
## [1] 0.4872000 0.4826667 0.4868444
#MAE <- c()
mae <- c()
for (i in 1:10){
train_data <- Smarket[-((i*125-124):(i*125)), ]
test_data <- Smarket[((i*125-124):(i*125)), ]
train.X = cbind(train_data$Lag1, train_data$Lag2)
test.X = cbind(test_data$Lag1, test_data$Lag2)
train.Direction = train_data$Direction
test.Direction = test_data$Direction
knn.pred = knn(train.X, test.X, train.Direction, k = 3)
mae = c(mae,mean(knn.pred != test.Direction))
}
MAE=c(MAE, mean(mae))
MAE
## [1] 0.4872000 0.4826667 0.4868444 0.5152000
#MAE <- c()
mae <- c()
for (i in 1:10){
train_data <- Smarket[-((i*125-124):(i*125)), ]
test_data <- Smarket[((i*125-124):(i*125)), ]
train.X = cbind(train_data$Lag1, train_data$Lag2)
test.X = cbind(test_data$Lag1, test_data$Lag2)
train.Direction = train_data$Direction
test.Direction = test_data$Direction
knn.pred = knn(train.X, test.X, train.Direction, k = 4)
mae = c(mae,mean(knn.pred != test_data$Direction))
}
MAE = c(MAE, mean(mae))
MAE_knn4 = mean(mae)
MAE_knn4
## [1] 0.504
#MAE
#MAE=c()
mae=c()
for (i in 1:10){
train_data <- Smarket[-((i*125-124):(i*125)), ]
test_data <- Smarket[((i*125-124):(i*125)), ]
#train <- (train_data$Direction == "Up")
b <- gam(Direction ~ s(Lag1) + s(Lag2),
data = Smarket, family = binomial)
pred_b = predict(b, newdata = test_data, type = "response")
pred_b_ud = rep("Down", 125)
pred_b_ud[pred_b > 0.5] = "Up"
mae <- c(mae, mean(abs(as.integer(pred_b_ud=="Up")
-as.integer(test_data$Direction == "Up"))))
}
MAE=c(MAE, mean(mae))
MAE
## [1] 0.4872000 0.4826667 0.4868444 0.5152000 0.5040000 0.4600000
#MAE=c()
mae=c()
for (i in 1:10){
train_data <- Smarket[-((i*125-124):(i*125)), ]
test_data <- Smarket[((i*125-124):(i*125)), ]
#train <- (train_data$Direction == "Up")
b2 <- gam(Direction ~ ti(Lag1) + ti(Lag2) + ti(Lag1, Lag2),
data = Smarket, family = binomial)
pred_b = predict(b2, newdata = test_data, type = "response")
pred_b_ud = rep("Down", 125)
pred_b_ud[pred_b > 0.5] = "Up"
mae <- c(mae, mean(abs(as.integer(pred_b_ud=="Up")
-as.integer(test_data$Direction == "Up"))))
}
MAE = c(MAE, mean(mae))
MAE
## [1] 0.4872000 0.4826667 0.4868444 0.5152000 0.5040000 0.4600000 0.4616000
#MAE=c()
mae=c()
for (i in 1:10){
train_data <- Smarket[-((i*125-124):(i*125)), ]
test_data <- Smarket[((i*125-124):(i*125)), ]
#train <- (train_data$Direction == "Up")
b3 <- gam(Direction ~ te(Lag1, Lag2),
data = Smarket, family = binomial)
pred_b = predict(b3, newdata = test_data,
type = "response")
pred_b_ud = rep("Down", 125)
pred_b_ud[pred_b > 0.5] = "Up"
mae <- c(mae, mean(abs(as.integer(pred_b_ud=="Up")
-as.integer(test_data$Direction == "Up"))))
}
MAE=c(MAE, mean(mae))
MAE
## [1] 0.4872000 0.4826667 0.4868444 0.5152000 0.5040000 0.4600000 0.4616000
## [8] 0.4528000
plot(MAE, ylim=c(.45,.52) , col='green', type='p',
pch=19, main='cross validation for the Smarket data',
ylab='mean absolute error')
text(MAE, c('lda', 'qda', 'glm', 'knn3', 'knn4',
'b', 'b2', 'b3'), pos=c(3,3), cex=1.25, col="purple")