The first exercise consist in predicting the returns. The dataset is the one provided by R, called Weekly, with 1,089 weekly returns, for 21 years 1990-2010. Second dataset is the Carseat dataset which will help in predicting the Car sale, which is the second exercise. Alongside of codes I will attempt to briefly explain the functions.
knitr::opts_chunk$set(echo = TRUE)
# Predicting the market
library(ISLR)
summary(Weekly)
## Year Lag1 Lag2 Lag3
## Min. :1990 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950
## 1st Qu.:1995 1st Qu.: -1.1540 1st Qu.: -1.1540 1st Qu.: -1.1580
## Median :2000 Median : 0.2410 Median : 0.2410 Median : 0.2410
## Mean :2000 Mean : 0.1506 Mean : 0.1511 Mean : 0.1472
## 3rd Qu.:2005 3rd Qu.: 1.4050 3rd Qu.: 1.4090 3rd Qu.: 1.4090
## Max. :2010 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260
## Lag4 Lag5 Volume
## Min. :-18.1950 Min. :-18.1950 Min. :0.08747
## 1st Qu.: -1.1580 1st Qu.: -1.1660 1st Qu.:0.33202
## Median : 0.2380 Median : 0.2340 Median :1.00268
## Mean : 0.1458 Mean : 0.1399 Mean :1.57462
## 3rd Qu.: 1.4090 3rd Qu.: 1.4050 3rd Qu.:2.05373
## Max. : 12.0260 Max. : 12.0260 Max. :9.32821
## Today Direction
## Min. :-18.1950 Down:484
## 1st Qu.: -1.1540 Up :605
## Median : 0.2410
## Mean : 0.1499
## 3rd Qu.: 1.4050
## Max. : 12.0260
# Checking for Corelations. cor() function produces a matrix that contains all of the pairwise correlations among the predictors in a data set. Direction is excluded since it is a qualitative variable.
cor(Weekly [,-9])
## Year Lag1 Lag2 Lag3 Lag4
## Year 1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923
## Lag1 -0.03228927 1.000000000 -0.07485305 0.05863568 -0.071273876
## Lag2 -0.03339001 -0.074853051 1.00000000 -0.07572091 0.058381535
## Lag3 -0.03000649 0.058635682 -0.07572091 1.00000000 -0.075395865
## Lag4 -0.03112792 -0.071273876 0.05838153 -0.07539587 1.000000000
## Lag5 -0.03051910 -0.008183096 -0.07249948 0.06065717 -0.075675027
## Volume 0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617
## Today -0.03245989 -0.075031842 0.05916672 -0.07124364 -0.007825873
## Lag5 Volume Today
## Year -0.030519101 0.84194162 -0.032459894
## Lag1 -0.008183096 -0.06495131 -0.075031842
## Lag2 -0.072499482 -0.08551314 0.059166717
## Lag3 0.060657175 -0.06928771 -0.071243639
## Lag4 -0.075675027 -0.06107462 -0.007825873
## Lag5 1.000000000 -0.05851741 0.011012698
## Volume -0.058517414 1.00000000 -0.033077783
## Today 0.011012698 -0.03307778 1.000000000
It appears that volume and year are highly correlated as it can be observed in the table as well as in the plot, next page. All the variables are not correlated, but the Lag 2 is highly correlated with Today variable.
# Calculating the mean, median, etc for every year and every variable
attach(Weekly)
plot(Volume)
Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? Is so, which ones?
glm.fit <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Weekly, family = binomial)
summary(glm.fit)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = binomial, data = Weekly)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6949 -1.2565 0.9913 1.0849 1.4579
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26686 0.08593 3.106 0.0019 **
## Lag1 -0.04127 0.02641 -1.563 0.1181
## Lag2 0.05844 0.02686 2.175 0.0296 *
## Lag3 -0.01606 0.02666 -0.602 0.5469
## Lag4 -0.02779 0.02646 -1.050 0.2937
## Lag5 -0.01447 0.02638 -0.549 0.5833
## Volume -0.02274 0.03690 -0.616 0.5377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1486.4 on 1082 degrees of freedom
## AIC: 1500.4
##
## Number of Fisher Scoring iterations: 4
The smallest p value is associated with Lag2. The positive coefficient for this predictor suggests that if the market had a positive return yesterday, then it is likely to go up today. At a value of 0.03, the p values, is relatively small, thus revealing there is clear evidence of a real association between Lag2 and Direction.
Now I Compute the confusion matrix and overall fraction of correct predictions. I will explain what the confusion matrix is telling you about the type of mistakes made by logistic regression.
##But before that, in order to make a prediction as the whether the market will go up or down on a particular day, I must convert these predicted probabilities into class labels, Up or Down. The following commands create a vector of class predictions based on whether the predicted probability of a market increase is greater than or less than 0.5
probs <- predict(glm.fit, type = "response")
glm.pred <- rep("Down", length(probs))
glm.pred[probs > 0.5] <- "Up"
###The first command creates a vector of 1,089 ‘Down’ elements. The second line transforms to “Up” all of the elements for which the predicted probability of a market increase exceeds 0.5. Given the predictions, the table () function can be used to produce a confusion matrix in order to determine how many observations were correctly or incorrectly classified.
table(glm.pred, Direction)
## Direction
## glm.pred Down Up
## Down 54 48
## Up 430 557
This model correctly predicted that the market will go up on 557 days and that it would go down on 54, for a total of 557 + 54 = 611 correct predictions.
However, the mean function has been used to compute the fraction of days for which the prediction was correct. In this case, logistic regression correctly predicted the movement of the market 56% of the time.
mean(glm.pred==Direction)
## [1] 0.5610652
Although logistic regression appears to work better than random guessing, this result is misleading because we trained and tested the model on the same set of 1,089 observations. In other words, 100- 56 = 44% is the training error rate. Thus, the confusion matrix does not properly fit the data.
Now I fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. I compute the confusion matrix and the overall fraction of the correct predictions for the held out data (that is, the 2009 and 2010)
train <- (Year < 2009)
Weekly.20092010 <- Weekly[!train, ]
Direction.20092010 <- Direction[!train]
fit.glm2 <- glm(Direction ~ Lag2, data = Weekly,family=binomial, subset = train)
summary(fit.glm2)
##
## Call:
## glm(formula = Direction ~ Lag2, family = binomial, data = Weekly,
## subset = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.536 -1.264 1.021 1.091 1.368
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.20326 0.06428 3.162 0.00157 **
## Lag2 0.05810 0.02870 2.024 0.04298 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1354.7 on 984 degrees of freedom
## Residual deviance: 1350.5 on 983 degrees of freedom
## AIC: 1354.5
##
## Number of Fisher Scoring iterations: 4
probs2 <- predict(fit.glm2, Weekly.20092010, type = "response")
glm.pred2 <- rep("Down", length(probs2))
glm.pred2[probs2 > 0.5] <- "Up"
table(glm.pred2, Direction.20092010)
## Direction.20092010
## glm.pred2 Down Up
## Down 9 5
## Up 34 56
mean(glm.pred2==Direction.20092010)
## [1] 0.625
The results improved significantly when removing the variables that are not statistically significant. Thus leading to a better model 62 % of the weekly movements have been correctly predicted. The confusion matrix suggests that on days when logistic regression predicts that the market will decline it is correct 62 % of the time. It is the same when it predicts an increase in the market that has a 62% accuracy rate.
Applying an LDA
library(MASS)
lda.fit = lda(Direction ~ Lag2, data = Weekly, subset = train)
lda.fit
## Call:
## lda(Direction ~ Lag2, data = Weekly, subset = train)
##
## Prior probabilities of groups:
## Down Up
## 0.4477157 0.5522843
##
## Group means:
## Lag2
## Down -0.03568254
## Up 0.26036581
##
## Coefficients of linear discriminants:
## LD1
## Lag2 0.4414162
The LDA output indicates that r=0.44 and r=0.55; in other words, 44% of the training observations correspond to days during which the market went down. The group means suggest that there is a tendency for the previous day return to be negative on days when the market decreases, and a tendency for the previous days returns to be positive when the market increases.
pred.lda <- predict(lda.fit, Weekly.20092010)
table(pred.lda$class, Direction.20092010)
## Direction.20092010
## Down Up
## Down 9 5
## Up 34 56
mean(pred.lda$class==Direction.20092010)
## [1] 0.625
The LDA and logistic regression predictions are identical. Thus, logistic regression and LDA works at the same rate with no difference between both.
I repeat by using QDA. I found the same patterns found as in LDA, of course excluding the coefficients of the linear discriminants since the QDA does not provide it.
qda.fit = qda(Direction ~ Lag2 , data = Weekly, subset=train)
qda.fit
## Call:
## qda(Direction ~ Lag2, data = Weekly, subset = train)
##
## Prior probabilities of groups:
## Down Up
## 0.4477157 0.5522843
##
## Group means:
## Lag2
## Down -0.03568254
## Up 0.26036581
Confusion matrix and Mean for qda
qda.class = predict(qda.fit, Weekly.20092010)$class
table(qda.class, Direction.20092010)
## Direction.20092010
## qda.class Down Up
## Down 0 0
## Up 43 61
mean(qda.class==Direction.20092010)
## [1] 0.5865385
The numbers suggest that the quadratic form assumed by QDA may not capture the true relationship. QDA seems to over fit the data.
Repeat useing KNN with K=1. The numbers suggest that the quadratic form assumed by KNN may not capture the true relationship. KNN seems to over fit the data.
#Creating an input with a matrix containing the predictors associated with the training dat
library(class)
train = (Weekly$Year<2009)
train.X = Weekly$Lag2[train]
#Creating an input with a matrix containing the predictors associated with the data for which we wish to make predictions.
test.X = Weekly$Lag2[!train]
#Creating an input with a vector containing the class labels for the training observations
train.Direction = Weekly$Direction[train]
#Predicting the market's movement for the dates in 2009 and 2010, however, set a random seed prior to this. This is done because if several observations are tied as nearest neighbors, then R will randomly break the tie. Thus, a seed must be set in order to ensure reproducibility of results
set.seed(1)
knn.pred = knn(data.frame(train.X), data.frame(test.X), train.Direction, k=1)
table(knn.pred, Direction.20092010)
## Direction.20092010
## knn.pred Down Up
## Down 21 30
## Up 22 31
mean(knn.pred==Direction.20092010)
## [1] 0.5
It appears that for this data, Linear Regression and LDA provides the best results of the methods that we have examined so far. The QDA and KNN do not capture the true relationship, over fitting the data.
Now I experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods. Report the variables, method, and associated confusion matrix that appears to provide the best results on the held out data. Note that you should also experiment with values for K in the KNN classifier.
As it can be observed bellow the p-values associated with both variables, Lag 1 and Lag2 are high, and the confusion matrix and the mean do not offer a better model compared to the regression model analyzed in part d of this exercise.
glm.fit3=glm(Direction ~ Lag1 + Lag2, data=Weekly, family= binomial, subset=train)
summary(glm.fit3)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2, family = binomial, data = Weekly,
## subset = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6149 -1.2565 0.9989 1.0875 1.5330
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.21109 0.06456 3.269 0.00108 **
## Lag1 -0.05421 0.02886 -1.878 0.06034 .
## Lag2 0.05384 0.02905 1.854 0.06379 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1354.7 on 984 degrees of freedom
## Residual deviance: 1347.0 on 982 degrees of freedom
## AIC: 1353
##
## Number of Fisher Scoring iterations: 4
Confusion matrix and mean for glm pred
glm.probs3=predict(glm.fit3, Weekly.20092010, type= "response")
glm.pred3=rep("Down", 104)
glm.pred3[glm.probs3>.5]= "Up"
table(glm.pred3, Direction.20092010)
## Direction.20092010
## glm.pred3 Down Up
## Down 7 8
## Up 36 53
mean(glm.pred3==Direction.20092010)
## [1] 0.5769231
Repeating the same process applying LDA. We can see that LDA performs the same as with the previous regression analysis done at A1. The confusion matrix reveals the same patter as the Logistic Regression, with the same mean. Nevertheless, they both perform worse because the previous model has Lag 2, which is the only
lda.fit2= lda(Direction ~ Lag1 + Lag2, data=Weekly, subset=train)
lda.fit2
## Call:
## lda(Direction ~ Lag1 + Lag2, data = Weekly, subset = train)
##
## Prior probabilities of groups:
## Down Up
## 0.4477157 0.5522843
##
## Group means:
## Lag1 Lag2
## Down 0.289444444 -0.03568254
## Up -0.009213235 0.26036581
##
## Coefficients of linear discriminants:
## LD1
## Lag1 -0.3013148
## Lag2 0.2982579
Confusion Matrix and mean for LDA
#######
lda.pred2=predict(lda.fit2, Weekly.20092010)
names(lda.pred2)
## [1] "class" "posterior" "x"
lda.class1=lda.pred2$class
table(lda.class1, Direction.20092010)
## Direction.20092010
## lda.class1 Down Up
## Down 7 8
## Up 36 53
####
mean(lda.class1==Direction.20092010)
## [1] 0.5769231
Applying QDA.Interestingly is to see that QDA reveals a slightly better model that LDA, though not a huge difference. Of course, this outcome shall not be taken as legitimized, and application of a bigger data set is needed in order to verify and embrace the result.
qda.fit2=qda(Direction~ Lag1 + Lag2, data = Weekly, subset = train)
qda.fit2
## Call:
## qda(Direction ~ Lag1 + Lag2, data = Weekly, subset = train)
##
## Prior probabilities of groups:
## Down Up
## 0.4477157 0.5522843
##
## Group means:
## Lag1 Lag2
## Down 0.289444444 -0.03568254
## Up -0.009213235 0.26036581
Confusion matrix and calculation of the mean for the qda.
qda.class1=predict (qda.fit2, Weekly.20092010)$class
table(qda.class1, Direction.20092010)
## Direction.20092010
## qda.class1 Down Up
## Down 7 10
## Up 36 51
#####
mean(qda.class1==Direction.20092010)
## [1] 0.5576923
Applying KNN. K- Nearest Neighbor does not offer a better model as it can be seen in the confusion matrix and the mean, which is 50% compared to QDA with 58% and LDA and Regression analysis, both with 57 %.
train.X1_1= cbind(Lag1, Lag2) [train,]
test.X1= cbind (Lag1, Lag2) [!train,]
train.Direction1 =Direction[train]
set.seed(1)
knn.pred1 = knn(train.X1_1, test.X1, train.Direction1 , k=4)
table(knn.pred1, Direction.20092010)
## Direction.20092010
## knn.pred1 Down Up
## Down 23 32
## Up 20 29
# Call in the carseats librar
library (MASS)
library (ISLR)
attach(Carseats)
dim(Carseats) # it can be observed there are 400 cases and 11 variables
## [1] 400 11
names(Carseats) # the variable names
## [1] "Sales" "CompPrice" "Income" "Advertising" "Population"
## [6] "Price" "ShelveLoc" "Age" "Education" "Urban"
## [11] "US"
I Split the data set into a training set and test set.
set.seed(1)
train = sample(1:nrow(Carseats), nrow(Carseats)/2)
library(tree)
Carseats.train= Carseats[train,]
Carseats.test = Carseats[-train,]
Bellow I fit a regression tree to the training set, plot the tree, and interpret the results. What test error rate do you obtain?
The tree function is used to fit a classification tree in order to predict Sales using all variables of the data set. The summary() function lists the variables that are used as internal nodes in the tree, thenumber of terminal nodes and the training error rate. The training error rate is 2,36 %. A small deviance indicates a tree that provides. The residual mean deviance reported in this case is 182 – 18 = 164. a good fit to the training data
tree.carseats = tree(Sales~. , Carseats.train)
summary(tree.carseats)
##
## Regression tree:
## tree(formula = Sales ~ ., data = Carseats.train)
## Variables actually used in tree construction:
## [1] "ShelveLoc" "Price" "Age" "Advertising" "Income"
## [6] "CompPrice"
## Number of terminal nodes: 18
## Residual mean deviance: 2.36 = 429.5 / 182
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -4.2570 -1.0360 0.1024 0.0000 0.9301 3.9130
The test error associated with the regression tree obtained is 4.14. The squared root of the MSE is therefore around 2.034, indicating that this model leads to the test predictions that are within around $2,034 of the true median car value of the sales.
Use cross-validation in order to determine the optimal level of tree complexity. Does pruning the tree improve the test error rate?
cv.tree () function is used to see whether pruning the tree will improve the performance. The most complex tree is selected by cross-validation.
cv.carseats = cv.tree(tree.carseats)
plot(cv.carseats$size, cv.carseats$dev, type='b')
However, if we wish to prune the tree, we used the tree.prune () tree function as bellow. In keeping with with the cross-validation results, we use the unpruned tree to make predictions on the test set.The MSE increased dramatically, thus pruning is proved over fitting the data. Above, we got a 4,14 MSE yet, here, when pruning undertaken, we get a higher MSE.
prune.carseats = prune.tree(tree.carseats,best=5)
yhat= predict(tree.carseats, Carseats.test)
mean((yhat-Carseats.train$Sales)^2) # increases the MSE
## [1] 15.14909
Use the bagging approach in order to analyze this data. What test error rate do you obtain? Use the importance () function to determine which variables are most important.
Before I apply bagging I describe what it is, firstly. Bagging is a special case of a random forest with m=p. Random foret can be used to perform both random forests and bagging. But bellow bagging is performed. The argument mtry=11 indicates that all 13 predictors should be considered for each split of the tree, or bagging should be done. The test set MSE associated with the bagged regression tree is 12.39, few errors down with respect to the pruning, yet, the regression tree has offered the best model.
Additionally, the two measures are given when the importance function () is used. The former is based upon the mean decrease of accuracy in predictions on the out of the bag samples when a given variable is excluded from the model. The latter is a measure of the total decrease in node impurity that results from splits over that variable, averaged over all trees. As observed in Price and Shelveloc are the most important variables in predicting the sales. In other words, the Price and Shelve location of the car matters in sales, yet the most important in sales is the price of the car.
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
bag.carseats = randomForest(Sales~., data= Carseats, subset=train, mtry=11, importance=TRUE)
## Warning in randomForest.default(m, y, ...): invalid mtry: reset to within
## valid range
bag.carseats
##
## Call:
## randomForest(formula = Sales ~ ., data = Carseats, mtry = 11, importance = TRUE, subset = train)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 10
##
## Mean of squared residuals: 2.814248
## % Var explained: 63.13
yhat.bag = predict (bag.carseats, data= carseats.test)
mean((yhat.bag-Carseats.train$Sales)^2)
## [1] 2.814248
Price and ShelveLoc appears to be the most important variables
importance(bag.carseats) # Price and ShelveLoc appears to be the most important variables
## %IncMSE IncNodePurity
## CompPrice 14.4124562 133.731797
## Income 6.5147532 74.346961
## Advertising 15.7607104 117.822651
## Population 0.6031237 60.227867
## Price 57.8206926 514.802084
## ShelveLoc 43.0486065 319.117972
## Age 19.8789659 192.880596
## Education 2.9319161 39.490093
## Urban -3.1300102 8.695529
## US 7.6298722 15.723975
Use random forests to analyze this data. What test error rate do you obtain? Use the importance() function to determine which variables are most important.
set.seed(1)
rf.carseats= randomForest(Sales ~., data= Carseats, subset= train, mtry=2, importance=TRUE)
yhat.rf= predict(rf.carseats, Carseats.test)
mean((yhat.rf-Carseats.train$Sales)^2) # improved a lot
## [1] 9.745073
# Describe the effect of m, the number of variables considered at each split, on the error rate obtained.
Use the importance() function to determine which variables are most important. Two measures of variable importance are reported. The former is based upon the mean decrease of accurracy in predictions on the out of bag samples when a given variable is excluded from the model. The latter is a measure of the total decrease in node impurity that results from splits over that variable, averaged over all trees.
importance (rf.carseats)
## %IncMSE IncNodePurity
## CompPrice 6.1956449 131.14542
## Income 3.0956643 140.14986
## Advertising 9.7249923 133.42279
## Population 0.6180575 115.91127
## Price 31.6481939 334.12007
## ShelveLoc 27.8697892 208.67709
## Age 15.2480039 186.88482
## Education 2.5989601 76.12780
## Urban -2.4479750 20.51276
## US 6.3148583 36.91249
varImpPlot(rf.carseats)