Chapter 4 Lab, Introduction to Statical Learning using R

Logistic Regression, LDA, QDA and KNN

Tarek Dib

Introduction

This is the lab from Chapter 4 of “Introduction to Statistical Learning using R”. The data set to be used is Smarket. This data set consists of percentage returns for the S&P 500 stock index over 1,250 days, from the beginning of 2001 until the end of 2005. For each date, the percentage returns for each of the five previous trading days were recorded, Lag1 through Lag5. Also recorded is the Volume (the number of shares traded on the previous day, in billions), Today (the percentage return on the date in question) and Direction (whether the market was Up or Down on this date).

# Load the ISLR package
library(ISLR)
# Upload the Smarket data
data(Smarket)
names(Smarket)

## [1] "Year"      "Lag1"      "Lag2"      "Lag3"      "Lag4"      "Lag5"     
## [7] "Volume"    "Today"     "Direction"

# Summary statistics
summary(Smarket)

##       Year           Lag1             Lag2             Lag3       
##  Min.   :2001   Min.   :-4.922   Min.   :-4.922   Min.   :-4.922  
##  1st Qu.:2002   1st Qu.:-0.640   1st Qu.:-0.640   1st Qu.:-0.640  
##  Median :2003   Median : 0.039   Median : 0.039   Median : 0.038  
##  Mean   :2003   Mean   : 0.004   Mean   : 0.004   Mean   : 0.002  
##  3rd Qu.:2004   3rd Qu.: 0.597   3rd Qu.: 0.597   3rd Qu.: 0.597  
##  Max.   :2005   Max.   : 5.733   Max.   : 5.733   Max.   : 5.733  
##       Lag4             Lag5            Volume          Today       
##  Min.   :-4.922   Min.   :-4.922   Min.   :0.356   Min.   :-4.922  
##  1st Qu.:-0.640   1st Qu.:-0.640   1st Qu.:1.257   1st Qu.:-0.640  
##  Median : 0.038   Median : 0.038   Median :1.423   Median : 0.038  
##  Mean   : 0.002   Mean   : 0.006   Mean   :1.478   Mean   : 0.003  
##  3rd Qu.: 0.597   3rd Qu.: 0.597   3rd Qu.:1.642   3rd Qu.: 0.597  
##  Max.   : 5.733   Max.   : 5.733   Max.   :3.152   Max.   : 5.733  
##  Direction 
##  Down:602  
##  Up  :648  
##            
##            
##            
##

# To know more info about the data:
`?`(Smarket)
# Attach column names
attach(Smarket)
# Correlation matrix
cor(Smarket[, -9])

##           Year      Lag1      Lag2      Lag3      Lag4      Lag5   Volume
## Year   1.00000  0.029700  0.030596  0.033195  0.035689  0.029788  0.53901
## Lag1   0.02970  1.000000 -0.026294 -0.010803 -0.002986 -0.005675  0.04091
## Lag2   0.03060 -0.026294  1.000000 -0.025897 -0.010854 -0.003558 -0.04338
## Lag3   0.03319 -0.010803 -0.025897  1.000000 -0.024051 -0.018808 -0.04182
## Lag4   0.03569 -0.002986 -0.010854 -0.024051  1.000000 -0.027084 -0.04841
## Lag5   0.02979 -0.005675 -0.003558 -0.018808 -0.027084  1.000000 -0.02200
## Volume 0.53901  0.040910 -0.043383 -0.041824 -0.048414 -0.022002  1.00000
## Today  0.03010 -0.026155 -0.010250 -0.002448 -0.006900 -0.034860  0.01459
##            Today
## Year    0.030095
## Lag1   -0.026155
## Lag2   -0.010250
## Lag3   -0.002448
## Lag4   -0.006900
## Lag5   -0.034860
## Volume  0.014592
## Today   1.000000

As one would expect, the correlations between the lag variables and to- day’s returns are close to zero. In other words, there appears to be little correlation between today’s returns and previous days’ returns. The only substantial correlation is between Year and Volume. By plotting the data we see that Volume is increasing over time. In other words, the average number of shares traded daily increased from 2001 to 2005

Logistic Regression

we will fit a logistic regression model in order to predict Direction using Lag1 through Lag5 and Volume. The glm() function fits generalized linear models, a class of models that includes logistic regression. The syntax of the glm() function is similar to that of lm(), except that we must pass the argument family=binomial in order to tell R to run a logistic regression rather than some other type of generalized linear model.

glm.fit = glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Smarket, 
    family = binomial)
summary(glm.fit)

## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = binomial, data = Smarket)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
##  -1.45   -1.20    1.07    1.15    1.33  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.12600    0.24074   -0.52     0.60
## Lag1        -0.07307    0.05017   -1.46     0.15
## Lag2        -0.04230    0.05009   -0.84     0.40
## Lag3         0.01109    0.04994    0.22     0.82
## Lag4         0.00936    0.04997    0.19     0.85
## Lag5         0.01031    0.04951    0.21     0.83
## Volume       0.13544    0.15836    0.86     0.39
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1731.2  on 1249  degrees of freedom
## Residual deviance: 1727.6  on 1243  degrees of freedom
## AIC: 1742
## 
## Number of Fisher Scoring iterations: 3

Predict Function

The predict() function can be used to predict the probability that the market will go up, given values of the predictors. The type=“response” option tells R to output probabilities of the form P(Y = 1|X), as opposed to other information such as the logit. If no data set is supplied to the predict() function, then the probabilities are computed for the training data that was used to fit the logistic regression model. Here we have printed only the first ten probabilities. We know that these values correspond to the probability of the market going up, rather than down, because the contrasts() function indicates that R has created a dummy variable with a 1 for Up.

glm.probs = predict(glm.fit, type = "response")
length(glm.probs)

## [1] 1250

glm.probs[1:10]

##      1      2      3      4      5      6      7      8      9     10 
## 0.5071 0.4815 0.4811 0.5152 0.5108 0.5070 0.4927 0.5092 0.5176 0.4888

contrasts(Direction)

##      Up
## Down  0
## Up    1

In order to make a prediction as to whether the market will go up or down on a particular day, we must convert these predicted probabilities into class labels, Up or Down. The following two commands create a vector of class predictions based on whether the predicted probability of a market increase is greater than or less than 0.5.

glm.pred = rep("Down", 1250)
glm.pred[glm.probs > 0.5] = "Up"
# Produce a confusion matrix
table(glm.pred, Direction)

##         Direction
## glm.pred Down  Up
##     Down  145 141
##     Up    457 507

# Compute the fraction of days for which the prediction was correct
mean(glm.pred == Direction)

## [1] 0.5216

Splitting the data into training and test sets

# Create a vector corresponding to the observations from 2001 through 2004
train = (Year < 2005)
# Test data set. 2005
Smarket.2005 = Smarket[!train, ]
Direction.2005 = Direction[!train]
# Fit the logistic regression using only the train data
glm.fit <- glm(Direction ~ lag1 + lag2 + lag3 + lag4 + lag5 + Volume, data = Smarket, 
    family = binomial, subset = train)

## Error: object 'lag1' not found

# Test the model by fitting the test data set into the fitted model
glm.probs <- predict(glm.fit, Smarket.2005, type = "response")
# compute the predictions for 2005 and compare them to the actual
# movements of the market over that time period.
glm.pred = rep("Down", nrow(Smarket.2005))
glm.pred[glm.probs > 0.5] = "Up"
table(glm.pred, Direction.2005)

##         Direction.2005
## glm.pred Down  Up
##     Down    4   1
##     Up    107 140

# Test error rate
mean(glm.pred == Direction.2005)

## [1] 0.5714

# Training error rate
mean(glm.pred != Direction.2005)

## [1] 0.4286

# Refit the model using only 2 predictors lag1 and lag2
glm.fit = glm(Direction ~ Lag1 + Lag2, data = Smarket, family = binomial, subset = train)
glm.probs = predict(glm.fit, Smarket.2005, type = "response")
glm.pred = rep("Down", nrow(Smarket.2005))
glm.pred[glm.probs > 0.5] = "Up"
table(glm.pred, Direction.2005)

##         Direction.2005
## glm.pred Down  Up
##     Down   35  35
##     Up     76 106

# Test error rate: predicted correctly / total observations of the test
# data set
mean(glm.pred == Direction.2005)

## [1] 0.5595

# Predict
predict(glm.fit, newdata = data.frame(Lag1 = c(1.2, 1.5), Lag2 = c(1.1, -0.8)), 
    type = "response")

##      1      2 
## 0.4791 0.4961

# Scatter plot matrix
pairs(Smarket)

plot of chunk unnamed-chunk-6

plot(Volume)

plot of chunk unnamed-chunk-6

LDA

library(MASS)
lda.fit = lda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)
lda.fit

## Call:
## lda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)
## 
## Prior probabilities of groups:
##  Down    Up 
## 0.492 0.508 
## 
## Group means:
##          Lag1     Lag2
## Down  0.04279  0.03389
## Up   -0.03955 -0.03133
## 
## Coefficients of linear discriminants:
##          LD1
## Lag1 -0.6420
## Lag2 -0.5135

The predict() function returns a list with three elements. The first element, class, contains LDA’s predictions about the movement of the market. The second element, posterior, is a matrix whose kth column contains the posterior probability that the corresponding observation belongs to the kth class

lda.pred = predict(lda.fit, Smarket.2005)
names(lda.pred)

## [1] "class"     "posterior" "x"

lda.class = lda.pred$class
# Confusion Matrix
table(lda.class, Direction.2005)

##          Direction.2005
## lda.class Down  Up
##      Down   35  35
##      Up     76 106

# Test error rate
mean(lda.class == Direction.2005)

## [1] 0.5595

sum(lda.pred$posterior[, 1] >= 0.5)

## [1] 70

sum(lda.pred$posterior[, 1] < 0.5)

## [1] 182

lda.pred$posterior[1:20, 1]

##    999   1000   1001   1002   1003   1004   1005   1006   1007   1008 
## 0.4902 0.4792 0.4668 0.4740 0.4928 0.4939 0.4951 0.4873 0.4907 0.4844 
##   1009   1010   1011   1012   1013   1014   1015   1016   1017   1018 
## 0.4907 0.5120 0.4895 0.4707 0.4745 0.4800 0.4936 0.5031 0.4979 0.4886

lda.class[1:20]

##  [1] Up   Up   Up   Up   Up   Up   Up   Up   Up   Up   Up   Down Up   Up  
## [15] Up   Up   Up   Down Up   Up  
## Levels: Down Up

plot(lda.fit)

plot of chunk unnamed-chunk-9

K-Nearest Neighbor

We will now perform KNN using the knn() function, which is part of the class library. This function works rather differently from the other model-fitting functions that we have encountered thus far. Rather than a two-step approach in which we first fit the model and then we use the model to make predictions, knn() forms predictions using a single command. The function requires four inputs. 1. A matrix containing the predictors associated with the training data,labeled train.X below. 2. A matrix containing the predictors associated with the data for which we wish to make predictions, labeled test.X below. 3. A vector containing the class labels for the training observations, labeled train.Direction below. 4. A value for K, the number of nearest neighbors to be used by the classifier.

library(class)
# Train Matrix
train.X = cbind(Lag1, Lag2)[train, ]
# Test Matrix
test.X = cbind(Lag1, Lag2)[!train, ]
train.Direction = Direction[train]

Now the knn() function can be used to predict the market’s movement for the dates in 2005. We set a random seed before we apply knn() because if several observations are tied as nearest neighbors, then R will randomly break the tie. Therefore, a seed must be set in order to ensure reproducibility of results.

set.seed(1)
knn.pred = knn(train.X, test.X, train.Direction, k = 3)
table(knn.pred, Direction.2005)

##         Direction.2005
## knn.pred Down Up
##     Down   48 55
##     Up     63 86

KNN for Caravan Insurance data set

# standardize the data so that all standardizevariables are given a mean
# of zero and a standard deviation of one
standardized.X = scale(Caravan[, -86])
# split the observations into a test set, containing the first 1,000
# observations, and a training set, containing the remaining observations.
# We fit a KNN model on the training data using K = 1, and evaluate its
# performance on the test data.
data(Caravan)
attach(Caravan)
test = 1:1000
train.X = standardized.X[-test, ]
test.X = standardized.X[test, ]
train.Y = Purchase[-test]
test.Y = Purchase[test]
set.seed(1)
knn.pred = knn(train.X, test.X, train.Y, k = 1)
mean(test.Y != knn.pred)

## [1] 0.118

mean(test.Y != "No")

## [1] 0.059

table(knn.pred, test.Y)

##         test.Y
## knn.pred  No Yes
##      No  873  50
##      Yes  68   9

knn.pred = knn(train.X, test.X, train.Y, k = 3)
table(knn.pred, test.Y)

##         test.Y
## knn.pred  No Yes
##      No  920  54
##      Yes  21   5

knn.pred = knn(train.X, test.X, train.Y, k = 5)
table(knn.pred, test.Y)

##         test.Y
## knn.pred  No Yes
##      No  930  55
##      Yes  11   4