This is the lab from Chapter 4 of “Introduction to Statistical Learning using R”. The data set to be used is Smarket. This data set consists of percentage returns for the S&P 500 stock index over 1,250 days, from the beginning of 2001 until the end of 2005. For each date, the percentage returns for each of the five previous trading days were recorded, Lag1 through Lag5. Also recorded is the Volume (the number of shares traded on the previous day, in billions), Today (the percentage return on the date in question) and Direction (whether the market was Up or Down on this date).
# Load the ISLR package
library(ISLR)
# Upload the Smarket data
data(Smarket)
names(Smarket)
## [1] "Year" "Lag1" "Lag2" "Lag3" "Lag4" "Lag5"
## [7] "Volume" "Today" "Direction"
# Summary statistics
summary(Smarket)
## Year Lag1 Lag2 Lag3
## Min. :2001 Min. :-4.922 Min. :-4.922 Min. :-4.922
## 1st Qu.:2002 1st Qu.:-0.640 1st Qu.:-0.640 1st Qu.:-0.640
## Median :2003 Median : 0.039 Median : 0.039 Median : 0.038
## Mean :2003 Mean : 0.004 Mean : 0.004 Mean : 0.002
## 3rd Qu.:2004 3rd Qu.: 0.597 3rd Qu.: 0.597 3rd Qu.: 0.597
## Max. :2005 Max. : 5.733 Max. : 5.733 Max. : 5.733
## Lag4 Lag5 Volume Today
## Min. :-4.922 Min. :-4.922 Min. :0.356 Min. :-4.922
## 1st Qu.:-0.640 1st Qu.:-0.640 1st Qu.:1.257 1st Qu.:-0.640
## Median : 0.038 Median : 0.038 Median :1.423 Median : 0.038
## Mean : 0.002 Mean : 0.006 Mean :1.478 Mean : 0.003
## 3rd Qu.: 0.597 3rd Qu.: 0.597 3rd Qu.:1.642 3rd Qu.: 0.597
## Max. : 5.733 Max. : 5.733 Max. :3.152 Max. : 5.733
## Direction
## Down:602
## Up :648
##
##
##
##
# To know more info about the data:
`?`(Smarket)
# Attach column names
attach(Smarket)
# Correlation matrix
cor(Smarket[, -9])
## Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume
## Year 1.00000 0.029700 0.030596 0.033195 0.035689 0.029788 0.53901
## Lag1 0.02970 1.000000 -0.026294 -0.010803 -0.002986 -0.005675 0.04091
## Lag2 0.03060 -0.026294 1.000000 -0.025897 -0.010854 -0.003558 -0.04338
## Lag3 0.03319 -0.010803 -0.025897 1.000000 -0.024051 -0.018808 -0.04182
## Lag4 0.03569 -0.002986 -0.010854 -0.024051 1.000000 -0.027084 -0.04841
## Lag5 0.02979 -0.005675 -0.003558 -0.018808 -0.027084 1.000000 -0.02200
## Volume 0.53901 0.040910 -0.043383 -0.041824 -0.048414 -0.022002 1.00000
## Today 0.03010 -0.026155 -0.010250 -0.002448 -0.006900 -0.034860 0.01459
## Today
## Year 0.030095
## Lag1 -0.026155
## Lag2 -0.010250
## Lag3 -0.002448
## Lag4 -0.006900
## Lag5 -0.034860
## Volume 0.014592
## Today 1.000000
As one would expect, the correlations between the lag variables and to- day’s returns are close to zero. In other words, there appears to be little correlation between today’s returns and previous days’ returns. The only substantial correlation is between Year and Volume. By plotting the data we see that Volume is increasing over time. In other words, the average number of shares traded daily increased from 2001 to 2005
we will fit a logistic regression model in order to predict Direction using Lag1 through Lag5 and Volume. The glm() function fits generalized linear models, a class of models that includes logistic regression. The syntax of the glm() function is similar to that of lm(), except that we must pass the argument family=binomial in order to tell R to run a logistic regression rather than some other type of generalized linear model.
glm.fit = glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Smarket,
family = binomial)
summary(glm.fit)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = binomial, data = Smarket)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.45 -1.20 1.07 1.15 1.33
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.12600 0.24074 -0.52 0.60
## Lag1 -0.07307 0.05017 -1.46 0.15
## Lag2 -0.04230 0.05009 -0.84 0.40
## Lag3 0.01109 0.04994 0.22 0.82
## Lag4 0.00936 0.04997 0.19 0.85
## Lag5 0.01031 0.04951 0.21 0.83
## Volume 0.13544 0.15836 0.86 0.39
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1731.2 on 1249 degrees of freedom
## Residual deviance: 1727.6 on 1243 degrees of freedom
## AIC: 1742
##
## Number of Fisher Scoring iterations: 3
The predict() function can be used to predict the probability that the market will go up, given values of the predictors. The type=“response” option tells R to output probabilities of the form P(Y = 1|X), as opposed to other information such as the logit. If no data set is supplied to the predict() function, then the probabilities are computed for the training data that was used to fit the logistic regression model. Here we have printed only the first ten probabilities. We know that these values correspond to the probability of the market going up, rather than down, because the contrasts() function indicates that R has created a dummy variable with a 1 for Up.
glm.probs = predict(glm.fit, type = "response")
length(glm.probs)
## [1] 1250
glm.probs[1:10]
## 1 2 3 4 5 6 7 8 9 10
## 0.5071 0.4815 0.4811 0.5152 0.5108 0.5070 0.4927 0.5092 0.5176 0.4888
contrasts(Direction)
## Up
## Down 0
## Up 1
In order to make a prediction as to whether the market will go up or down on a particular day, we must convert these predicted probabilities into class labels, Up or Down. The following two commands create a vector of class predictions based on whether the predicted probability of a market increase is greater than or less than 0.5.
glm.pred = rep("Down", 1250)
glm.pred[glm.probs > 0.5] = "Up"
# Produce a confusion matrix
table(glm.pred, Direction)
## Direction
## glm.pred Down Up
## Down 145 141
## Up 457 507
# Compute the fraction of days for which the prediction was correct
mean(glm.pred == Direction)
## [1] 0.5216
# Create a vector corresponding to the observations from 2001 through 2004
train = (Year < 2005)
# Test data set. 2005
Smarket.2005 = Smarket[!train, ]
Direction.2005 = Direction[!train]
# Fit the logistic regression using only the train data
glm.fit <- glm(Direction ~ lag1 + lag2 + lag3 + lag4 + lag5 + Volume, data = Smarket,
family = binomial, subset = train)
## Error: object 'lag1' not found
# Test the model by fitting the test data set into the fitted model
glm.probs <- predict(glm.fit, Smarket.2005, type = "response")
# compute the predictions for 2005 and compare them to the actual
# movements of the market over that time period.
glm.pred = rep("Down", nrow(Smarket.2005))
glm.pred[glm.probs > 0.5] = "Up"
table(glm.pred, Direction.2005)
## Direction.2005
## glm.pred Down Up
## Down 4 1
## Up 107 140
# Test error rate
mean(glm.pred == Direction.2005)
## [1] 0.5714
# Training error rate
mean(glm.pred != Direction.2005)
## [1] 0.4286
# Refit the model using only 2 predictors lag1 and lag2
glm.fit = glm(Direction ~ Lag1 + Lag2, data = Smarket, family = binomial, subset = train)
glm.probs = predict(glm.fit, Smarket.2005, type = "response")
glm.pred = rep("Down", nrow(Smarket.2005))
glm.pred[glm.probs > 0.5] = "Up"
table(glm.pred, Direction.2005)
## Direction.2005
## glm.pred Down Up
## Down 35 35
## Up 76 106
# Test error rate: predicted correctly / total observations of the test
# data set
mean(glm.pred == Direction.2005)
## [1] 0.5595
# Predict
predict(glm.fit, newdata = data.frame(Lag1 = c(1.2, 1.5), Lag2 = c(1.1, -0.8)),
type = "response")
## 1 2
## 0.4791 0.4961
# Scatter plot matrix
pairs(Smarket)
plot(Volume)
library(MASS)
lda.fit = lda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)
lda.fit
## Call:
## lda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)
##
## Prior probabilities of groups:
## Down Up
## 0.492 0.508
##
## Group means:
## Lag1 Lag2
## Down 0.04279 0.03389
## Up -0.03955 -0.03133
##
## Coefficients of linear discriminants:
## LD1
## Lag1 -0.6420
## Lag2 -0.5135
The predict() function returns a list with three elements. The first element, class, contains LDA’s predictions about the movement of the market. The second element, posterior, is a matrix whose kth column contains the posterior probability that the corresponding observation belongs to the kth class
lda.pred = predict(lda.fit, Smarket.2005)
names(lda.pred)
## [1] "class" "posterior" "x"
lda.class = lda.pred$class
# Confusion Matrix
table(lda.class, Direction.2005)
## Direction.2005
## lda.class Down Up
## Down 35 35
## Up 76 106
# Test error rate
mean(lda.class == Direction.2005)
## [1] 0.5595
sum(lda.pred$posterior[, 1] >= 0.5)
## [1] 70
sum(lda.pred$posterior[, 1] < 0.5)
## [1] 182
lda.pred$posterior[1:20, 1]
## 999 1000 1001 1002 1003 1004 1005 1006 1007 1008
## 0.4902 0.4792 0.4668 0.4740 0.4928 0.4939 0.4951 0.4873 0.4907 0.4844
## 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018
## 0.4907 0.5120 0.4895 0.4707 0.4745 0.4800 0.4936 0.5031 0.4979 0.4886
lda.class[1:20]
## [1] Up Up Up Up Up Up Up Up Up Up Up Down Up Up
## [15] Up Up Up Down Up Up
## Levels: Down Up
plot(lda.fit)
We will now perform KNN using the knn() function, which is part of the class library. This function works rather differently from the other model-fitting functions that we have encountered thus far. Rather than a two-step approach in which we first fit the model and then we use the model to make predictions, knn() forms predictions using a single command. The function requires four inputs. 1. A matrix containing the predictors associated with the training data,labeled train.X below. 2. A matrix containing the predictors associated with the data for which we wish to make predictions, labeled test.X below. 3. A vector containing the class labels for the training observations, labeled train.Direction below. 4. A value for K, the number of nearest neighbors to be used by the classifier.
library(class)
# Train Matrix
train.X = cbind(Lag1, Lag2)[train, ]
# Test Matrix
test.X = cbind(Lag1, Lag2)[!train, ]
train.Direction = Direction[train]
Now the knn() function can be used to predict the market’s movement for the dates in 2005. We set a random seed before we apply knn() because if several observations are tied as nearest neighbors, then R will randomly break the tie. Therefore, a seed must be set in order to ensure reproducibility of results.
set.seed(1)
knn.pred = knn(train.X, test.X, train.Direction, k = 3)
table(knn.pred, Direction.2005)
## Direction.2005
## knn.pred Down Up
## Down 48 55
## Up 63 86
# standardize the data so that all standardizevariables are given a mean
# of zero and a standard deviation of one
standardized.X = scale(Caravan[, -86])
# split the observations into a test set, containing the first 1,000
# observations, and a training set, containing the remaining observations.
# We fit a KNN model on the training data using K = 1, and evaluate its
# performance on the test data.
data(Caravan)
attach(Caravan)
test = 1:1000
train.X = standardized.X[-test, ]
test.X = standardized.X[test, ]
train.Y = Purchase[-test]
test.Y = Purchase[test]
set.seed(1)
knn.pred = knn(train.X, test.X, train.Y, k = 1)
mean(test.Y != knn.pred)
## [1] 0.118
mean(test.Y != "No")
## [1] 0.059
table(knn.pred, test.Y)
## test.Y
## knn.pred No Yes
## No 873 50
## Yes 68 9
knn.pred = knn(train.X, test.X, train.Y, k = 3)
table(knn.pred, test.Y)
## test.Y
## knn.pred No Yes
## No 920 54
## Yes 21 5
knn.pred = knn(train.X, test.X, train.Y, k = 5)
table(knn.pred, test.Y)
## test.Y
## knn.pred No Yes
## No 930 55
## Yes 11 4