The first exercise consist in predicting the returns. The dataset is the one provided by R, called Weekly, with 1,089 weekly returns, for 21 years 1990-2010. Second dataset is the Carseat dataset which will help in predicting the Car sale, which is the second exercise. Alongside of codes I will attempt to briefly explain the functions.

I. First exercise - Predicting the returns

  1. Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns?
knitr::opts_chunk$set(echo = TRUE)
# Predicting the market
library(ISLR)
summary(Weekly)
##       Year           Lag1               Lag2               Lag3         
##  Min.   :1990   Min.   :-18.1950   Min.   :-18.1950   Min.   :-18.1950  
##  1st Qu.:1995   1st Qu.: -1.1540   1st Qu.: -1.1540   1st Qu.: -1.1580  
##  Median :2000   Median :  0.2410   Median :  0.2410   Median :  0.2410  
##  Mean   :2000   Mean   :  0.1506   Mean   :  0.1511   Mean   :  0.1472  
##  3rd Qu.:2005   3rd Qu.:  1.4050   3rd Qu.:  1.4090   3rd Qu.:  1.4090  
##  Max.   :2010   Max.   : 12.0260   Max.   : 12.0260   Max.   : 12.0260  
##       Lag4               Lag5              Volume       
##  Min.   :-18.1950   Min.   :-18.1950   Min.   :0.08747  
##  1st Qu.: -1.1580   1st Qu.: -1.1660   1st Qu.:0.33202  
##  Median :  0.2380   Median :  0.2340   Median :1.00268  
##  Mean   :  0.1458   Mean   :  0.1399   Mean   :1.57462  
##  3rd Qu.:  1.4090   3rd Qu.:  1.4050   3rd Qu.:2.05373  
##  Max.   : 12.0260   Max.   : 12.0260   Max.   :9.32821  
##      Today          Direction 
##  Min.   :-18.1950   Down:484  
##  1st Qu.: -1.1540   Up  :605  
##  Median :  0.2410             
##  Mean   :  0.1499             
##  3rd Qu.:  1.4050             
##  Max.   : 12.0260
# Checking for Corelations. cor() function produces a matrix that contains all of the pairwise correlations among the predictors in a data set. Direction is excluded since it is a qualitative variable.
cor(Weekly [,-9])
##               Year         Lag1        Lag2        Lag3         Lag4
## Year    1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923
## Lag1   -0.03228927  1.000000000 -0.07485305  0.05863568 -0.071273876
## Lag2   -0.03339001 -0.074853051  1.00000000 -0.07572091  0.058381535
## Lag3   -0.03000649  0.058635682 -0.07572091  1.00000000 -0.075395865
## Lag4   -0.03112792 -0.071273876  0.05838153 -0.07539587  1.000000000
## Lag5   -0.03051910 -0.008183096 -0.07249948  0.06065717 -0.075675027
## Volume  0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617
## Today  -0.03245989 -0.075031842  0.05916672 -0.07124364 -0.007825873
##                Lag5      Volume        Today
## Year   -0.030519101  0.84194162 -0.032459894
## Lag1   -0.008183096 -0.06495131 -0.075031842
## Lag2   -0.072499482 -0.08551314  0.059166717
## Lag3    0.060657175 -0.06928771 -0.071243639
## Lag4   -0.075675027 -0.06107462 -0.007825873
## Lag5    1.000000000 -0.05851741  0.011012698
## Volume -0.058517414  1.00000000 -0.033077783
## Today   0.011012698 -0.03307778  1.000000000

It appears that volume and year are highly correlated as it can be observed in the table as well as in the plot, next page. All the variables are not correlated, but the Lag 2 is highly correlated with Today variable.

# Calculating the mean, median, etc for every year and every variable
attach(Weekly)
plot(Volume)

Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? Is so, which ones?

glm.fit <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Weekly, family = binomial)

summary(glm.fit)
## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = binomial, data = Weekly)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6949  -1.2565   0.9913   1.0849   1.4579  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.26686    0.08593   3.106   0.0019 **
## Lag1        -0.04127    0.02641  -1.563   0.1181   
## Lag2         0.05844    0.02686   2.175   0.0296 * 
## Lag3        -0.01606    0.02666  -0.602   0.5469   
## Lag4        -0.02779    0.02646  -1.050   0.2937   
## Lag5        -0.01447    0.02638  -0.549   0.5833   
## Volume      -0.02274    0.03690  -0.616   0.5377   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1486.4  on 1082  degrees of freedom
## AIC: 1500.4
## 
## Number of Fisher Scoring iterations: 4

The smallest p value is associated with Lag2. The positive coefficient for this predictor suggests that if the market had a positive return yesterday, then it is likely to go up today. At a value of 0.03, the p values, is relatively small, thus revealing there is clear evidence of a real association between Lag2 and Direction.

Now I Compute the confusion matrix and overall fraction of correct predictions. I will explain what the confusion matrix is telling you about the type of mistakes made by logistic regression.

##But before that, in order to make a prediction as the whether the market will go up or down on a particular day, I must convert these predicted probabilities into class labels, Up or Down. The following commands create a vector of class predictions based on whether the predicted probability of a market increase is greater than or less than 0.5
probs <- predict(glm.fit, type = "response")
glm.pred <- rep("Down", length(probs))
glm.pred[probs > 0.5] <- "Up"


###The first command creates a vector of 1,089 ‘Down’ elements. The second line transforms to “Up” all of the elements for which the predicted probability of a market increase exceeds 0.5. Given the predictions, the table () function can be used to produce a confusion matrix in order to determine how many observations were correctly or incorrectly classified.

table(glm.pred, Direction)
##         Direction
## glm.pred Down  Up
##     Down   54  48
##     Up    430 557

This model correctly predicted that the market will go up on 557 days and that it would go down on 54, for a total of 557 + 54 = 611 correct predictions.

However, the mean function has been used to compute the fraction of days for which the prediction was correct. In this case, logistic regression correctly predicted the movement of the market 56% of the time.

mean(glm.pred==Direction)
## [1] 0.5610652

Although logistic regression appears to work better than random guessing, this result is misleading because we trained and tested the model on the same set of 1,089 observations. In other words, 100- 56 = 44% is the training error rate. Thus, the confusion matrix does not properly fit the data.

Now I fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. I compute the confusion matrix and the overall fraction of the correct predictions for the held out data (that is, the 2009 and 2010)

train <- (Year < 2009)
Weekly.20092010 <- Weekly[!train, ]
Direction.20092010 <- Direction[!train]
fit.glm2 <- glm(Direction ~ Lag2, data = Weekly,family=binomial, subset = train)
summary(fit.glm2)
## 
## Call:
## glm(formula = Direction ~ Lag2, family = binomial, data = Weekly, 
##     subset = train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.536  -1.264   1.021   1.091   1.368  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.20326    0.06428   3.162  0.00157 **
## Lag2         0.05810    0.02870   2.024  0.04298 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1354.7  on 984  degrees of freedom
## Residual deviance: 1350.5  on 983  degrees of freedom
## AIC: 1354.5
## 
## Number of Fisher Scoring iterations: 4
probs2 <- predict(fit.glm2, Weekly.20092010, type = "response")
glm.pred2 <- rep("Down", length(probs2))
glm.pred2[probs2 > 0.5] <- "Up"
table(glm.pred2, Direction.20092010)
##          Direction.20092010
## glm.pred2 Down Up
##      Down    9  5
##      Up     34 56
mean(glm.pred2==Direction.20092010)
## [1] 0.625

The results improved significantly when removing the variables that are not statistically significant. Thus leading to a better model 62 % of the weekly movements have been correctly predicted. The confusion matrix suggests that on days when logistic regression predicts that the market will decline it is correct 62 % of the time. It is the same when it predicts an increase in the market that has a 62% accuracy rate.

Applying an LDA

library(MASS)
lda.fit = lda(Direction ~ Lag2, data = Weekly, subset = train)
lda.fit
## Call:
## lda(Direction ~ Lag2, data = Weekly, subset = train)
## 
## Prior probabilities of groups:
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Group means:
##             Lag2
## Down -0.03568254
## Up    0.26036581
## 
## Coefficients of linear discriminants:
##            LD1
## Lag2 0.4414162

The LDA output indicates that r=0.44 and r=0.55; in other words, 44% of the training observations correspond to days during which the market went down. The group means suggest that there is a tendency for the previous day return to be negative on days when the market decreases, and a tendency for the previous days returns to be positive when the market increases.

pred.lda <- predict(lda.fit, Weekly.20092010)
table(pred.lda$class, Direction.20092010)
##       Direction.20092010
##        Down Up
##   Down    9  5
##   Up     34 56
mean(pred.lda$class==Direction.20092010)
## [1] 0.625

The LDA and logistic regression predictions are identical. Thus, logistic regression and LDA works at the same rate with no difference between both.

I repeat by using QDA. I found the same patterns found as in LDA, of course excluding the coefficients of the linear discriminants since the QDA does not provide it.

qda.fit = qda(Direction ~ Lag2 , data = Weekly, subset=train)
qda.fit
## Call:
## qda(Direction ~ Lag2, data = Weekly, subset = train)
## 
## Prior probabilities of groups:
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Group means:
##             Lag2
## Down -0.03568254
## Up    0.26036581

Confusion matrix and Mean for qda

qda.class = predict(qda.fit, Weekly.20092010)$class
table(qda.class, Direction.20092010)
##          Direction.20092010
## qda.class Down Up
##      Down    0  0
##      Up     43 61
mean(qda.class==Direction.20092010)
## [1] 0.5865385

The numbers suggest that the quadratic form assumed by QDA may not capture the true relationship. QDA seems to over fit the data.

Repeat useing KNN with K=1. The numbers suggest that the quadratic form assumed by KNN may not capture the true relationship. KNN seems to over fit the data.

#Creating an input with a matrix containing the predictors associated with the training dat
library(class)
train = (Weekly$Year<2009)
train.X = Weekly$Lag2[train]

#Creating an input with a matrix containing the predictors associated with the data for which we wish to make predictions.

test.X = Weekly$Lag2[!train]

#Creating an input with a vector containing the class labels for the training observations

train.Direction = Weekly$Direction[train]

#Predicting the market's movement for the dates in 2009 and 2010, however, set a random seed prior to this. This is done because if several observations are tied as nearest neighbors, then R will randomly break the tie. Thus, a seed must be set in order to ensure reproducibility of results

set.seed(1)
knn.pred = knn(data.frame(train.X), data.frame(test.X), train.Direction, k=1)

table(knn.pred, Direction.20092010)
##         Direction.20092010
## knn.pred Down Up
##     Down   21 30
##     Up     22 31
mean(knn.pred==Direction.20092010)
## [1] 0.5

It appears that for this data, Linear Regression and LDA provides the best results of the methods that we have examined so far. The QDA and KNN do not capture the true relationship, over fitting the data.

Now I experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods. Report the variables, method, and associated confusion matrix that appears to provide the best results on the held out data. Note that you should also experiment with values for K in the KNN classifier.

As it can be observed bellow the p-values associated with both variables, Lag 1 and Lag2 are high, and the confusion matrix and the mean do not offer a better model compared to the regression model analyzed in part d of this exercise.

glm.fit3=glm(Direction ~ Lag1 + Lag2, data=Weekly, family= binomial, subset=train)
summary(glm.fit3)
## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2, family = binomial, data = Weekly, 
##     subset = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6149  -1.2565   0.9989   1.0875   1.5330  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.21109    0.06456   3.269  0.00108 **
## Lag1        -0.05421    0.02886  -1.878  0.06034 . 
## Lag2         0.05384    0.02905   1.854  0.06379 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1354.7  on 984  degrees of freedom
## Residual deviance: 1347.0  on 982  degrees of freedom
## AIC: 1353
## 
## Number of Fisher Scoring iterations: 4

Confusion matrix and mean for glm pred

glm.probs3=predict(glm.fit3, Weekly.20092010, type= "response")
glm.pred3=rep("Down", 104)
glm.pred3[glm.probs3>.5]= "Up"
table(glm.pred3, Direction.20092010)
##          Direction.20092010
## glm.pred3 Down Up
##      Down    7  8
##      Up     36 53
mean(glm.pred3==Direction.20092010)
## [1] 0.5769231

Repeating the same process applying LDA. We can see that LDA performs the same as with the previous regression analysis done at A1. The confusion matrix reveals the same patter as the Logistic Regression, with the same mean. Nevertheless, they both perform worse because the previous model has Lag 2, which is the only

lda.fit2= lda(Direction ~ Lag1 + Lag2, data=Weekly, subset=train)
lda.fit2
## Call:
## lda(Direction ~ Lag1 + Lag2, data = Weekly, subset = train)
## 
## Prior probabilities of groups:
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Group means:
##              Lag1        Lag2
## Down  0.289444444 -0.03568254
## Up   -0.009213235  0.26036581
## 
## Coefficients of linear discriminants:
##             LD1
## Lag1 -0.3013148
## Lag2  0.2982579

Confusion Matrix and mean for LDA

#######

lda.pred2=predict(lda.fit2, Weekly.20092010)
names(lda.pred2)
## [1] "class"     "posterior" "x"
lda.class1=lda.pred2$class
table(lda.class1, Direction.20092010)
##           Direction.20092010
## lda.class1 Down Up
##       Down    7  8
##       Up     36 53
####
mean(lda.class1==Direction.20092010)
## [1] 0.5769231

Applying QDA.Interestingly is to see that QDA reveals a slightly better model that LDA, though not a huge difference. Of course, this outcome shall not be taken as legitimized, and application of a bigger data set is needed in order to verify and embrace the result.

qda.fit2=qda(Direction~ Lag1 + Lag2, data = Weekly, subset = train)
qda.fit2
## Call:
## qda(Direction ~ Lag1 + Lag2, data = Weekly, subset = train)
## 
## Prior probabilities of groups:
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Group means:
##              Lag1        Lag2
## Down  0.289444444 -0.03568254
## Up   -0.009213235  0.26036581

Confusion matrix and calculation of the mean for the qda.

qda.class1=predict (qda.fit2, Weekly.20092010)$class
table(qda.class1, Direction.20092010)
##           Direction.20092010
## qda.class1 Down Up
##       Down    7 10
##       Up     36 51
#####
mean(qda.class1==Direction.20092010)
## [1] 0.5576923

Applying KNN. K- Nearest Neighbor does not offer a better model as it can be seen in the confusion matrix and the mean, which is 50% compared to QDA with 58% and LDA and Regression analysis, both with 57 %.

train.X1_1= cbind(Lag1, Lag2) [train,]
test.X1= cbind (Lag1, Lag2) [!train,]
train.Direction1 =Direction[train]
set.seed(1)
knn.pred1 = knn(train.X1_1, test.X1, train.Direction1 , k=4)

table(knn.pred1, Direction.20092010)
##          Direction.20092010
## knn.pred1 Down Up
##      Down   23 32
##      Up     20 29

The Second exercise - Predicting the care sale applying regression trees

# Call in the carseats librar
library (MASS)
library (ISLR)
attach(Carseats)

dim(Carseats) # it can be observed there are 400 cases and 11 variables
## [1] 400  11
names(Carseats) # the variable names
##  [1] "Sales"       "CompPrice"   "Income"      "Advertising" "Population" 
##  [6] "Price"       "ShelveLoc"   "Age"         "Education"   "Urban"      
## [11] "US"

I Split the data set into a training set and test set.

set.seed(1)
train = sample(1:nrow(Carseats), nrow(Carseats)/2)
library(tree)
Carseats.train= Carseats[train,]
Carseats.test = Carseats[-train,]

Bellow I fit a regression tree to the training set, plot the tree, and interpret the results. What test error rate do you obtain?

The tree function is used to fit a classification tree in order to predict Sales using all variables of the data set. The summary() function lists the variables that are used as internal nodes in the tree, thenumber of terminal nodes and the training error rate. The training error rate is 2,36 %. A small deviance indicates a tree that provides. The residual mean deviance reported in this case is 182 – 18 = 164. a good fit to the training data

tree.carseats = tree(Sales~. , Carseats.train)
summary(tree.carseats)
## 
## Regression tree:
## tree(formula = Sales ~ ., data = Carseats.train)
## Variables actually used in tree construction:
## [1] "ShelveLoc"   "Price"       "Age"         "Advertising" "Income"     
## [6] "CompPrice"  
## Number of terminal nodes:  18 
## Residual mean deviance:  2.36 = 429.5 / 182 
## Distribution of residuals:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -4.2570 -1.0360  0.1024  0.0000  0.9301  3.9130

The test error associated with the regression tree obtained is 4.14. The squared root of the MSE is therefore around 2.034, indicating that this model leads to the test predictions that are within around $2,034 of the true median car value of the sales.

Use cross-validation in order to determine the optimal level of tree complexity. Does pruning the tree improve the test error rate?

cv.tree () function is used to see whether pruning the tree will improve the performance. The most complex tree is selected by cross-validation.

cv.carseats = cv.tree(tree.carseats)
plot(cv.carseats$size, cv.carseats$dev, type='b')

However, if we wish to prune the tree, we used the tree.prune () tree function as bellow. In keeping with with the cross-validation results, we use the unpruned tree to make predictions on the test set.The MSE increased dramatically, thus pruning is proved over fitting the data. Above, we got a 4,14 MSE yet, here, when pruning undertaken, we get a higher MSE.

prune.carseats = prune.tree(tree.carseats,best=5)

yhat= predict(tree.carseats, Carseats.test)
mean((yhat-Carseats.train$Sales)^2)  # increases the MSE 
## [1] 15.14909

Use the bagging approach in order to analyze this data. What test error rate do you obtain? Use the importance () function to determine which variables are most important.

Before I apply bagging I describe what it is, firstly. Bagging is a special case of a random forest with m=p. Random foret can be used to perform both random forests and bagging. But bellow bagging is performed. The argument mtry=11 indicates that all 13 predictors should be considered for each split of the tree, or bagging should be done. The test set MSE associated with the bagged regression tree is 12.39, few errors down with respect to the pruning, yet, the regression tree has offered the best model.

Additionally, the two measures are given when the importance function () is used. The former is based upon the mean decrease of accuracy in predictions on the out of the bag samples when a given variable is excluded from the model. The latter is a measure of the total decrease in node impurity that results from splits over that variable, averaged over all trees. As observed in Price and Shelveloc are the most important variables in predicting the sales. In other words, the Price and Shelve location of the car matters in sales, yet the most important in sales is the price of the car.

library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
bag.carseats = randomForest(Sales~., data= Carseats, subset=train, mtry=11, importance=TRUE)
## Warning in randomForest.default(m, y, ...): invalid mtry: reset to within
## valid range
bag.carseats
## 
## Call:
##  randomForest(formula = Sales ~ ., data = Carseats, mtry = 11,      importance = TRUE, subset = train) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 10
## 
##           Mean of squared residuals: 2.814248
##                     % Var explained: 63.13
yhat.bag = predict (bag.carseats, data= carseats.test)
mean((yhat.bag-Carseats.train$Sales)^2)
## [1] 2.814248

Price and ShelveLoc appears to be the most important variables

importance(bag.carseats) # Price and ShelveLoc appears to be the most important variables
##                %IncMSE IncNodePurity
## CompPrice   14.4124562    133.731797
## Income       6.5147532     74.346961
## Advertising 15.7607104    117.822651
## Population   0.6031237     60.227867
## Price       57.8206926    514.802084
## ShelveLoc   43.0486065    319.117972
## Age         19.8789659    192.880596
## Education    2.9319161     39.490093
## Urban       -3.1300102      8.695529
## US           7.6298722     15.723975

Use random forests to analyze this data. What test error rate do you obtain? Use the importance() function to determine which variables are most important.

set.seed(1)
rf.carseats= randomForest(Sales ~., data= Carseats, subset= train, mtry=2, importance=TRUE)
yhat.rf= predict(rf.carseats, Carseats.test)
mean((yhat.rf-Carseats.train$Sales)^2) # improved a lot
## [1] 9.745073
# Describe the effect of m, the number of variables considered at each split,  on the error rate obtained. 

Use the importance() function to determine which variables are most important. Two measures of variable importance are reported. The former is based upon the mean decrease of accurracy in predictions on the out of bag samples when a given variable is excluded from the model. The latter is a measure of the total decrease in node impurity that results from splits over that variable, averaged over all trees.

importance (rf.carseats)
##                %IncMSE IncNodePurity
## CompPrice    6.1956449     131.14542
## Income       3.0956643     140.14986
## Advertising  9.7249923     133.42279
## Population   0.6180575     115.91127
## Price       31.6481939     334.12007
## ShelveLoc   27.8697892     208.67709
## Age         15.2480039     186.88482
## Education    2.5989601      76.12780
## Urban       -2.4479750      20.51276
## US           6.3148583      36.91249
varImpPlot(rf.carseats)