Lab 4

1a. We will be using the Smarket data set from the ISLR library

The variable names that are included in the data set are below.

library(ISLR)
data("Smarket")
names(Smarket)

## [1] "Year"      "Lag1"      "Lag2"      "Lag3"      "Lag4"      "Lag5"     
## [7] "Volume"    "Today"     "Direction"

Each of the data variables are defined as followed:

Year:The year that the observation was recorded
Lag1: Percentage return for previous day
Lag2: Percentage return for 2 days previous
Lag3: Percentage return for 3 days previous
Lag4: Percentage return for 4 days previous
Lag5: Percentage return for 5 days previous
Volume: Volume of shares traded (number of daily shares traded in billions)
Today: Percentage return for today
Direction: A categorical variable indicating whether the market moved up or down on the current day. This variable is often binary, where “Up” may be coded as 1, and “Down” as 0.

1b.

Lets examine the structure of the data set:

str(Smarket)

## 'data.frame':    1250 obs. of  9 variables:
##  $ Year     : num  2001 2001 2001 2001 2001 ...
##  $ Lag1     : num  0.381 0.959 1.032 -0.623 0.614 ...
##  $ Lag2     : num  -0.192 0.381 0.959 1.032 -0.623 ...
##  $ Lag3     : num  -2.624 -0.192 0.381 0.959 1.032 ...
##  $ Lag4     : num  -1.055 -2.624 -0.192 0.381 0.959 ...
##  $ Lag5     : num  5.01 -1.055 -2.624 -0.192 0.381 ...
##  $ Volume   : num  1.19 1.3 1.41 1.28 1.21 ...
##  $ Today    : num  0.959 1.032 -0.623 0.614 0.213 ...
##  $ Direction: Factor w/ 2 levels "Down","Up": 2 2 1 2 2 2 1 2 2 2 ...

Using “str()” we can examine that all the data types are made up of numeric/doubles except for the last one which is a Factor w/ 2 levels.

1c.

Using the “summary()” function of the data set shows us a five-number-sumary of each of the numeric variables.

summary(Smarket)

##       Year           Lag1                Lag2                Lag3          
##  Min.   :2001   Min.   :-4.922000   Min.   :-4.922000   Min.   :-4.922000  
##  1st Qu.:2002   1st Qu.:-0.639500   1st Qu.:-0.639500   1st Qu.:-0.640000  
##  Median :2003   Median : 0.039000   Median : 0.039000   Median : 0.038500  
##  Mean   :2003   Mean   : 0.003834   Mean   : 0.003919   Mean   : 0.001716  
##  3rd Qu.:2004   3rd Qu.: 0.596750   3rd Qu.: 0.596750   3rd Qu.: 0.596750  
##  Max.   :2005   Max.   : 5.733000   Max.   : 5.733000   Max.   : 5.733000  
##       Lag4                Lag5              Volume           Today          
##  Min.   :-4.922000   Min.   :-4.92200   Min.   :0.3561   Min.   :-4.922000  
##  1st Qu.:-0.640000   1st Qu.:-0.64000   1st Qu.:1.2574   1st Qu.:-0.639500  
##  Median : 0.038500   Median : 0.03850   Median :1.4229   Median : 0.038500  
##  Mean   : 0.001636   Mean   : 0.00561   Mean   :1.4783   Mean   : 0.003138  
##  3rd Qu.: 0.596750   3rd Qu.: 0.59700   3rd Qu.:1.6417   3rd Qu.: 0.596750  
##  Max.   : 5.733000   Max.   : 5.73300   Max.   :3.1525   Max.   : 5.733000  
##  Direction 
##  Down:602  
##  Up  :648  
##            
##            
##            
##

1d.

Not all of the “Smarket” data set is numeric so in order to run a “cor()” function on it, I will display only the first 8 of the data variables. “cor()” will show the correlation coefficients for each pair of these variables.

numeric_data = Smarket[,-9]
cor(numeric_data)

##              Year         Lag1         Lag2         Lag3         Lag4
## Year   1.00000000  0.029699649  0.030596422  0.033194581  0.035688718
## Lag1   0.02969965  1.000000000 -0.026294328 -0.010803402 -0.002985911
## Lag2   0.03059642 -0.026294328  1.000000000 -0.025896670 -0.010853533
## Lag3   0.03319458 -0.010803402 -0.025896670  1.000000000 -0.024051036
## Lag4   0.03568872 -0.002985911 -0.010853533 -0.024051036  1.000000000
## Lag5   0.02978799 -0.005674606 -0.003557949 -0.018808338 -0.027083641
## Volume 0.53900647  0.040909908 -0.043383215 -0.041823686 -0.048414246
## Today  0.03009523 -0.026155045 -0.010250033 -0.002447647 -0.006899527
##                Lag5      Volume        Today
## Year    0.029787995  0.53900647  0.030095229
## Lag1   -0.005674606  0.04090991 -0.026155045
## Lag2   -0.003557949 -0.04338321 -0.010250033
## Lag3   -0.018808338 -0.04182369 -0.002447647
## Lag4   -0.027083641 -0.04841425 -0.006899527
## Lag5    1.000000000 -0.02200231 -0.034860083
## Volume -0.022002315  1.00000000  0.014591823
## Today  -0.034860083  0.01459182  1.000000000

From the data there is only one noteworthy correlation. That is the correlation between Year and Volume. Below we can see how volume is affected over time.

plot(Smarket$Volume, xlab = "Index", ylab = "Volume",
     main = "Volume")

2a. Logistic Regression

I used the “glm.fit()” function to create a model of all the variables of the Smarket to predict the direction of the market.

glm.fit=glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Smarket, family = binomial)
summary(glm.fit)

## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = binomial, data = Smarket)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.126000   0.240736  -0.523    0.601
## Lag1        -0.073074   0.050167  -1.457    0.145
## Lag2        -0.042301   0.050086  -0.845    0.398
## Lag3         0.011085   0.049939   0.222    0.824
## Lag4         0.009359   0.049974   0.187    0.851
## Lag5         0.010313   0.049511   0.208    0.835
## Volume       0.135441   0.158360   0.855    0.392
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1731.2  on 1249  degrees of freedom
## Residual deviance: 1727.6  on 1243  degrees of freedom
## AIC: 1741.6
## 
## Number of Fisher Scoring iterations: 3

2b.

Below are the coefficients produced by the model

coef(glm.fit)

##  (Intercept)         Lag1         Lag2         Lag3         Lag4         Lag5 
## -0.126000257 -0.073073746 -0.042301344  0.011085108  0.009358938  0.010313068 
##       Volume 
##  0.135440659

2c.

Below I used the “predict()” function to predict the probability that the market will go up or down based off the above predictor values.

glm.probs = predict(glm.fit, type = "response")
glm.probs[1:10]

##         1         2         3         4         5         6         7         8 
## 0.5070841 0.4814679 0.4811388 0.5152224 0.5107812 0.5069565 0.4926509 0.5092292 
##         9        10 
## 0.5176135 0.4888378

contrasts(Smarket$Direction)

##      Up
## Down  0
## Up    1

2d.

I will now convert these probabilities into class labels, Up or Down.

glm.pred = rep("Down",1250) #Make a vector in which all predictions are Down
glm.pred[glm.probs>0.5]="Up" #Change appropriate values to Up

cm = table(glm.pred, Smarket$Direction)
print(cm)

##         
## glm.pred Down  Up
##     Down  145 141
##     Up    457 507

The diagonal elements of the confusion matrix indicate correct predictions. That is, the market was down on a day the model predicted Down, or the market was up on a day the model predicted Up

correct = (cm[1,1] + cm[2,2])/sum(cm)
print(correct)

## [1] 0.5216

The model was correct 52.16% of the time.

error = 1 - correct
print(error)

## [1] 0.4784

The model incorrect 47.8% of the time.

This is the training error rate. We want to minimize this while constructing a valid model.

3a. Training and Testing Sets

We will now have split our data and create a training and a testing set.

Data from 2001-2004 will compose the training set.

Data from 2005 will compose the testing set.

Both of these data sets will have the same 9 varaibles but will have less objects/rows.

train = (Smarket$Year < 2005)

Smarket.2005 = Smarket[!train,]
dim(Smarket.2005)

## [1] 252   9

NOTE: “train” is not the training set. It is just an array that holds the boolean values determined by the test Smarket$Year<2005. “train” contains a Boolean value for each of the objects in Smarket.

#train

Rather than making a separate, potentially (very) large training set, we can use the Boolean values in “train” to help the model just use data from the Year < 2005 subset. In the code below, we are fitting a model only with data from 2001 - 2004. We are then testing the model with data from 2005.

# glm.fit: This variable will store the result of the logistic regeression model fitting
# 
glm.fit = glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Smarket, family = binomial, subset = train)
glm.probs=predict(glm.fit, Smarket.2005, type = "response")

Next, predictions are computed for 2005. Then the predictions will be compared to the actual movement of the market

glm.pred =rep("Down", 252)
glm.pred[glm.probs > 0.5] = "Up"
cm = table(glm.pred, Smarket.2005$Direction)
cm

##         
## glm.pred Down Up
##     Down   77 97
##     Up     34 44

Accuracy with the test set

correct = mean(glm.pred == Smarket.2005$Direction)
print(correct)

## [1] 0.4801587

The test set was correct 48.01587% of the time.

Error with the test set

error = 1 - correct
print(error)

## [1] 0.5198413

The test set was wrong 51.98413% of the time.

Conclusion:

Questions:

Compare the model accuracy for both scenarios: same data for train and test vs. separate test and train. Which model had the better accuracy?

When using the same data for train and test was more accurate.

Since the separate train/test data scenario should produce the more valid model, what does the accuracy say about this model?

It says that the model is likely not good at predicting the correct outcome. Since the model was used to assess new data. It will likely not be great to continue to use this model in predictions as its accuracy will lower with new data.

Would it be better to use this model to predict the market, or just guess the direction? Which would be better at making a direction prediction, using this model or a coin toss?

I don’t think it would matter much since both the coin toss and the model will be roughly 50% accurate. Yet you can say the coin toss would be an improvement because the seperate data set test shows that it is below 50% accurate at 48% accuracy.