The variable names that are included in the data set are below.
library(ISLR)
data("Smarket")
names(Smarket)
## [1] "Year" "Lag1" "Lag2" "Lag3" "Lag4" "Lag5"
## [7] "Volume" "Today" "Direction"
Each of the data variables are defined as followed:
Year:The year that the observation was recorded
Lag1: Percentage return for previous day
Lag2: Percentage return for 2 days previous
Lag3: Percentage return for 3 days previous
Lag4: Percentage return for 4 days previous
Lag5: Percentage return for 5 days previous
Volume: Volume of shares traded (number of daily shares traded in billions)
Today: Percentage return for today
Direction: A categorical variable indicating whether the market moved up or down on the current day. This variable is often binary, where “Up” may be coded as 1, and “Down” as 0.
Lets examine the structure of the data set:
str(Smarket)
## 'data.frame': 1250 obs. of 9 variables:
## $ Year : num 2001 2001 2001 2001 2001 ...
## $ Lag1 : num 0.381 0.959 1.032 -0.623 0.614 ...
## $ Lag2 : num -0.192 0.381 0.959 1.032 -0.623 ...
## $ Lag3 : num -2.624 -0.192 0.381 0.959 1.032 ...
## $ Lag4 : num -1.055 -2.624 -0.192 0.381 0.959 ...
## $ Lag5 : num 5.01 -1.055 -2.624 -0.192 0.381 ...
## $ Volume : num 1.19 1.3 1.41 1.28 1.21 ...
## $ Today : num 0.959 1.032 -0.623 0.614 0.213 ...
## $ Direction: Factor w/ 2 levels "Down","Up": 2 2 1 2 2 2 1 2 2 2 ...
Using “str()” we can examine that all the data types are made up of numeric/doubles except for the last one which is a Factor w/ 2 levels.
Using the “summary()” function of the data set shows us a five-number-sumary of each of the numeric variables.
summary(Smarket)
## Year Lag1 Lag2 Lag3
## Min. :2001 Min. :-4.922000 Min. :-4.922000 Min. :-4.922000
## 1st Qu.:2002 1st Qu.:-0.639500 1st Qu.:-0.639500 1st Qu.:-0.640000
## Median :2003 Median : 0.039000 Median : 0.039000 Median : 0.038500
## Mean :2003 Mean : 0.003834 Mean : 0.003919 Mean : 0.001716
## 3rd Qu.:2004 3rd Qu.: 0.596750 3rd Qu.: 0.596750 3rd Qu.: 0.596750
## Max. :2005 Max. : 5.733000 Max. : 5.733000 Max. : 5.733000
## Lag4 Lag5 Volume Today
## Min. :-4.922000 Min. :-4.92200 Min. :0.3561 Min. :-4.922000
## 1st Qu.:-0.640000 1st Qu.:-0.64000 1st Qu.:1.2574 1st Qu.:-0.639500
## Median : 0.038500 Median : 0.03850 Median :1.4229 Median : 0.038500
## Mean : 0.001636 Mean : 0.00561 Mean :1.4783 Mean : 0.003138
## 3rd Qu.: 0.596750 3rd Qu.: 0.59700 3rd Qu.:1.6417 3rd Qu.: 0.596750
## Max. : 5.733000 Max. : 5.73300 Max. :3.1525 Max. : 5.733000
## Direction
## Down:602
## Up :648
##
##
##
##
Not all of the “Smarket” data set is numeric so in order to run a “cor()” function on it, I will display only the first 8 of the data variables. “cor()” will show the correlation coefficients for each pair of these variables.
numeric_data = Smarket[,-9]
cor(numeric_data)
## Year Lag1 Lag2 Lag3 Lag4
## Year 1.00000000 0.029699649 0.030596422 0.033194581 0.035688718
## Lag1 0.02969965 1.000000000 -0.026294328 -0.010803402 -0.002985911
## Lag2 0.03059642 -0.026294328 1.000000000 -0.025896670 -0.010853533
## Lag3 0.03319458 -0.010803402 -0.025896670 1.000000000 -0.024051036
## Lag4 0.03568872 -0.002985911 -0.010853533 -0.024051036 1.000000000
## Lag5 0.02978799 -0.005674606 -0.003557949 -0.018808338 -0.027083641
## Volume 0.53900647 0.040909908 -0.043383215 -0.041823686 -0.048414246
## Today 0.03009523 -0.026155045 -0.010250033 -0.002447647 -0.006899527
## Lag5 Volume Today
## Year 0.029787995 0.53900647 0.030095229
## Lag1 -0.005674606 0.04090991 -0.026155045
## Lag2 -0.003557949 -0.04338321 -0.010250033
## Lag3 -0.018808338 -0.04182369 -0.002447647
## Lag4 -0.027083641 -0.04841425 -0.006899527
## Lag5 1.000000000 -0.02200231 -0.034860083
## Volume -0.022002315 1.00000000 0.014591823
## Today -0.034860083 0.01459182 1.000000000
From the data there is only one noteworthy correlation. That is the correlation between Year and Volume. Below we can see how volume is affected over time.
plot(Smarket$Volume, xlab = "Index", ylab = "Volume",
main = "Volume")
I used the “glm.fit()” function to create a model of all the variables of the Smarket to predict the direction of the market.
glm.fit=glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Smarket, family = binomial)
summary(glm.fit)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = binomial, data = Smarket)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.126000 0.240736 -0.523 0.601
## Lag1 -0.073074 0.050167 -1.457 0.145
## Lag2 -0.042301 0.050086 -0.845 0.398
## Lag3 0.011085 0.049939 0.222 0.824
## Lag4 0.009359 0.049974 0.187 0.851
## Lag5 0.010313 0.049511 0.208 0.835
## Volume 0.135441 0.158360 0.855 0.392
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1731.2 on 1249 degrees of freedom
## Residual deviance: 1727.6 on 1243 degrees of freedom
## AIC: 1741.6
##
## Number of Fisher Scoring iterations: 3
Below are the coefficients produced by the model
coef(glm.fit)
## (Intercept) Lag1 Lag2 Lag3 Lag4 Lag5
## -0.126000257 -0.073073746 -0.042301344 0.011085108 0.009358938 0.010313068
## Volume
## 0.135440659
Below I used the “predict()” function to predict the probability that the market will go up or down based off the above predictor values.
glm.probs = predict(glm.fit, type = "response")
glm.probs[1:10]
## 1 2 3 4 5 6 7 8
## 0.5070841 0.4814679 0.4811388 0.5152224 0.5107812 0.5069565 0.4926509 0.5092292
## 9 10
## 0.5176135 0.4888378
contrasts(Smarket$Direction)
## Up
## Down 0
## Up 1
I will now convert these probabilities into class labels, Up or Down.
glm.pred = rep("Down",1250) #Make a vector in which all predictions are Down
glm.pred[glm.probs>0.5]="Up" #Change appropriate values to Up
cm = table(glm.pred, Smarket$Direction)
print(cm)
##
## glm.pred Down Up
## Down 145 141
## Up 457 507
The diagonal elements of the confusion matrix indicate correct predictions. That is, the market was down on a day the model predicted Down, or the market was up on a day the model predicted Up
correct = (cm[1,1] + cm[2,2])/sum(cm)
print(correct)
## [1] 0.5216
The model was correct 52.16% of the time.
error = 1 - correct
print(error)
## [1] 0.4784
The model incorrect 47.8% of the time.
This is the training error rate. We want to minimize this while constructing a valid model.
We will now have split our data and create a training and a testing set.
Data from 2001-2004 will compose the training set.
Data from 2005 will compose the testing set.
Both of these data sets will have the same 9 varaibles but will have less objects/rows.
train = (Smarket$Year < 2005)
Smarket.2005 = Smarket[!train,]
dim(Smarket.2005)
## [1] 252 9
#train
Rather than making a separate, potentially (very) large training set, we can use the Boolean values in “train” to help the model just use data from the Year < 2005 subset. In the code below, we are fitting a model only with data from 2001 - 2004. We are then testing the model with data from 2005.
# glm.fit: This variable will store the result of the logistic regeression model fitting
#
glm.fit = glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Smarket, family = binomial, subset = train)
glm.probs=predict(glm.fit, Smarket.2005, type = "response")
Next, predictions are computed for 2005. Then the predictions will be compared to the actual movement of the market
glm.pred =rep("Down", 252)
glm.pred[glm.probs > 0.5] = "Up"
cm = table(glm.pred, Smarket.2005$Direction)
cm
##
## glm.pred Down Up
## Down 77 97
## Up 34 44
Accuracy with the test set
correct = mean(glm.pred == Smarket.2005$Direction)
print(correct)
## [1] 0.4801587
The test set was correct 48.01587% of the time.
Error with the test set
error = 1 - correct
print(error)
## [1] 0.5198413
The test set was wrong 51.98413% of the time.
- Compare the model accuracy for both scenarios: same data for train and test vs. separate test and train. Which model had the better accuracy?
When using the same data for train and test was more accurate.
- Since the separate train/test data scenario should produce the more valid model, what does the accuracy say about this model?
It says that the model is likely not good at predicting the correct outcome. Since the model was used to assess new data. It will likely not be great to continue to use this model in predictions as its accuracy will lower with new data.
- Would it be better to use this model to predict the market, or just guess the direction? Which would be better at making a direction prediction, using this model or a coin toss?
I don’t think it would matter much since both the coin toss and the model will be roughly 50% accurate. Yet you can say the coin toss would be an improvement because the seperate data set test shows that it is below 50% accurate at 48% accuracy.