Data Analysis Assignment

Question 1. What are the advantages and disadvantages of a very flexible (versus a less flexible) approach for regression or classification? Under what circumstances might a more flexible approach be preferred to a less flexible approach? When might a less flexible approach be preferred?

Regression and Classification use modelling to estimate how a set of inputs can provide information about an output for the purposes of inference and prediction. Models build a function to predict the output based upon the inputs provided. In most cases, models are built using training data, and the accuracy of the model is determined by calculating how it performs on a separate test data.

Different models will utilize the provided inputs differently. Less flexible models will use fewer inputs to create a model. In the case of simple linear regression, the response variable is predicted on the basis of a single predictor variable. The advantage of a less flexible model is that is easy to interpret. Using the example of simple linear regression, the output (response) variable can be written in a simple linear equation in terms of the input (predictor) variable. Requiring fewer inputs means less flexible models work with less data which is easier to collect and faster to process. However, the number of variables not captured by the model will result in error in the prediction. The model may need to be recalculated to accommodate for changes in input variables not captured by the model.

A more flexible model will use a greater number of the inputs and will train the model function to produce a prediction output to fit closely to the input data provided. The advantage of this flexibility is that better accuracy of predictions can be achieved compared with less flexible models. Using more flexible models means larger amounts of data and also more data processing is required since the more flexible models are also more complex. The flexibility of the model may mean that it fits very closely to the training data, however does not fit as well any other data set, which is referred to as over fitting. Over fitting means the accuracy of more flexible models can be poor. The complexity of more flexible models means the model can be difficult to interpret, with less understanding about how the input variables are used calculated the resulting output.

A less flexible model would be preferred when the number of predictors is very large but the number of observations is small. In this situation a more flexible model would over fit the data in the small number of observations and result in poor results with other observations. A less flexible model would also be preferred in cases where the variance of the error associated with the observations is extremely high, since a more flexible model will fit to the noise in the data and provide worse results. A less flexible model is preferred when the aim is to understand how the response is effect by the predictors (inference) and when the interpretability of the results is important.

More flexible models would be preferred where the data consists of a large number of observations with a small number of predictors, since the accuracy of the results will be better. More flexible methods include more inputs, which means there is less chance of bias which occurs when the model does not include information that is affecting the outcome. More flexible models have the advantage when prediction accuracy is important above the interpretability of the results.

Question 2. Use R to create some simulation data and fit with simple linear regression models to it. Make sure to use set.seed (1) prior to starting part (a) to ensure consistent results.

(a) Using the rnorm() function, create a vector, x containing 100 observations drawn from a N(0,1) distribution. This represents a variable, X.

set.seed(1)
#Assign 100 observations from norm dist(0, 1) to vector x
x <- rnorm(100, mean = 0, sd = 1)

(b) Using the rnorm() function, create a vector, epsilon, containing 100 observations drawn from a N(0,0.25) distribution. This represents a random variable $\epsilon$.

#Assign 100 observations from norm dist(0, 0.25) to vector epsilon
epsilon <- rnorm(100, mean = 0, sd = 0.25)

(c) Using x and epsilon, generate a vector y according to the model $Y= -1+0.5X+\epsilon$. What is the length of vector y. What are the values of $\beta_0$ and $\beta_1$ in this linear model?

y = -1 + 0.5*x + epsilon
#Print Length of y vector
length(y)

## [1] 100

The length of vector y is 100. The values of $\beta_0$ and $\beta_1$ are -1 and 0.5 respectively.

(d) Create a scatterplot of x and y. Comment on what you observed.

plot(x,y,col="blue",main='Plot of Function y vs x')

The scatterplot shows there is a positive linear relationship between x and y, with some variability in the distribution which has been introduced by the epsilon distribution.

(e) Fit a least squares linear model to predict y using x. What are the values of $\widehat{\beta_0}$ and $\widehat{\beta_1}$? How do $\widehat{\beta_0}$ and $\widehat{\beta_1}$ compare to $\beta_0$ and $\beta_1$?

lm.fit=lm(y~x)
summary(lm.fit)

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.46921 -0.15344 -0.03487  0.13485  0.58654 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.00942    0.02425  -41.63   <2e-16 ***
## x            0.49973    0.02693   18.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2407 on 98 degrees of freedom
## Multiple R-squared:  0.7784, Adjusted R-squared:  0.7762 
## F-statistic: 344.3 on 1 and 98 DF,  p-value: < 2.2e-16

The summary of the least squares linear model of y regressed on x has values for $\widehat{\beta_0}$ and $\widehat{\beta_1}$ of -1.00942 and 0.49973 which are very close to the original values of -1 and 0.5.

(f) Display the least squares line on the scatterplot obtained in (d). Draw the true population regression line in (c) as well, using a different color. Using the legend() command to create an appropriate legend.

par(mar=c(5,4,4,2))
plot(x,y,xlim=c(-3,3),col='blue',main='Regression Plot',cex.main=0.8)
abline(lm.fit,col='red',lty=3, lwd=2)
abline(-1, 0.5, pch=20,col="black",lwd=1)
legend("topleft", c("y","Least Squares","Pop.Regression"), cex=0.8, pch=c(1,-32,-32), lty=c(0,1,2), lwd=c(0,2,1), col=c("blue","red","black"));

(g) Repeat (a)-(f) after modifying the data generation process in such a way that there is less noise in the data. This can be achieved by decreasing the variance of the normal distribution used to generate epsilon in (b). Describe your results.

set.seed(1)
#Assign 100 observations from norm dist(0, 1) to vector x
x <- rnorm(100, mean = 0, sd = 1)
#Create vector epsilon2 with less noise variance
#Assign 100 observations from norm dist(0, 0.1) to vector epsilon2
epsilon2 <- rnorm(100, mean = 0, sd = 0.1)
#Create vector y2 using epsilon2
y2 = -1 + 0.5*x + epsilon2

The scatterplot of the two different distributions of y, with a larger noise variance, and y2, with a smaller noise variance shows that y has a wider distribution while y2 is more densely distributed about the true regression population line.

#Scatterplot of x and Y
plot(x,y,col="blue",main='Functions with Different variance',cex.main=0.8)
points(x,y2,col="red")
legend("topleft", c("y","y2"), pch=c(1,1),cex=0.8, col=c("blue","red"));

The Least squared regression model for the function y with a smaller noise component has coefficients $\widehat{\beta_0}$ = -1.003769 and $\widehat{\beta_1}$ = 0.499894 which are closer to the values of the original function than the previous linear regression model.

#Fit least squares model 
lm2.fit=lm(y2~x)
summary(lm2.fit)

## 
## Call:
## lm(formula = y2 ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.18768 -0.06138 -0.01395  0.05394  0.23462 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.003769   0.009699  -103.5   <2e-16 ***
## x            0.499894   0.010773    46.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09628 on 98 degrees of freedom
## Multiple R-squared:  0.9565, Adjusted R-squared:  0.956 
## F-statistic:  2153 on 1 and 98 DF,  p-value: < 2.2e-16

#Plot results
par(mar=c(5,4,4,2))
plot(x,y2,xlim=c(-3,3),col='blue',main='Regression Plot with Reduced Noise Variance',cex.main=0.8)
abline(lm2.fit,col='red',lty=3, lwd=2)
abline(-1, 0.5, lwd=1,col="black")
legend("topleft", c("y2","Least Squares y2","Pop.Regression"), cex=0.8, pch=c(1,-32,-32), lty=c(0,1,2), lwd=c(0,2,1), col=c("blue","red","black"))

The plot of the Least squared regression model for the function y with a smaller noise component shows it lies much closer true regression population line than that of the original data set.

(h) What are the confidence intervals for $\beta_0$ and $\beta_1$ based on the original data set and the less noisy data set? Comment on your results.

The confidence intervals for the original data set are larger, or wider than those for the less noisy data set. This is because the standard error, used to calculate the confidence intervals, is smaller in the less noisy data set.

#Print Confidence intervals for the original data set
confint (lm.fit)

##                  2.5 %     97.5 %
## (Intercept) -1.0575402 -0.9613061
## x            0.4462897  0.5531801

#Print Confidence intervals for the less noisy data set
confint (lm2.fit)

##                  2.5 %     97.5 %
## (Intercept) -1.0230161 -0.9845224
## x            0.4785159  0.5212720

Question 3. The question involves the Boston data set, which can be found from the MASS packages in R. We now try to predict capita crime rate using the other variables in this data set.

# Load the leaps and MASS libraries and investigate the Boston dataset
library(leaps)
library(MASS)
attach(Boston)
names(Boston)

##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"

(a) Fit a multiple regression model to predict the response using all of the predictors. Describe your results. For which predictors can we reject the null hypothesis : H₀ : $\beta_j$ = 0 ? How do you interpret the coefficient of the variable chas?

#Fit a multiple regression model using all predictors
lm.fit =lm(crim~.,data=Boston)
summary(lm.fit)

## 
## Call:
## lm(formula = crim ~ ., data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.924 -2.120 -0.353  1.019 75.051 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  17.033228   7.234903   2.354 0.018949 *  
## zn            0.044855   0.018734   2.394 0.017025 *  
## indus        -0.063855   0.083407  -0.766 0.444294    
## chas         -0.749134   1.180147  -0.635 0.525867    
## nox         -10.313535   5.275536  -1.955 0.051152 .  
## rm            0.430131   0.612830   0.702 0.483089    
## age           0.001452   0.017925   0.081 0.935488    
## dis          -0.987176   0.281817  -3.503 0.000502 ***
## rad           0.588209   0.088049   6.680 6.46e-11 ***
## tax          -0.003780   0.005156  -0.733 0.463793    
## ptratio      -0.271081   0.186450  -1.454 0.146611    
## black        -0.007538   0.003673  -2.052 0.040702 *  
## lstat         0.126211   0.075725   1.667 0.096208 .  
## medv         -0.198887   0.060516  -3.287 0.001087 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.439 on 492 degrees of freedom
## Multiple R-squared:  0.454,  Adjusted R-squared:  0.4396 
## F-statistic: 31.47 on 13 and 492 DF,  p-value: < 2.2e-16

coef(lm.fit)

##   (Intercept)            zn         indus          chas           nox 
##  17.033227523   0.044855215  -0.063854824  -0.749133611 -10.313534912 
##            rm           age           dis           rad           tax 
##   0.430130506   0.001451643  -0.987175726   0.588208591  -0.003780016 
##       ptratio         black         lstat          medv 
##  -0.271080558  -0.007537505   0.126211376  -0.198886821

The results from the multiple regression model on all predictors show an F-statistic of 31.47 with a p-value of <2.2e-16.

The p-value for this F-statistic indicates that the Null hypothesis that there is no relationship between the response and the predictors does not hold and should be rejected. In other words that there is a relationship between the predictors and the response.

The predictors which have the greatest significance, and for which we can reject the Null Hypothesis are those with a p-values less than 0.05. These are dis (weighted mean of distances to five Boston employment centres), rad (index of accessibility to radial highways), medv (median value of owner-occupied homes in $1000s), zn (average number of rooms per dwelling) and black (1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town)

The predictors with the largest p-values, and therefore the smallest significance to the regression model are tax (full-value property-tax rate per $10,000) and indus (proportion of non-retail business acres per town).

The coefficient of the variable chas is -0.749134 which indicates that if a tract bounds the Charles river the crime rate will decrease by 0.749134.

(b) Use the regsubsets() function to perform best subset selection in order to choose the best model. What is the best model obtained according to Cp, BIC and adjusted R2 respectively? Show some plots to provide evidence for your answer, and report the coefficients of the best model obtained.

#Use regsubsets() function to perform subset selection
regfit.full <- regsubsets(crim~.,data=Boston,nvmax=13)

#Print summary of subset selections
regfit.summary <- summary(regfit.full)
regfit.summary

## Subset selection object
## Call: regsubsets.formula(crim ~ ., data = Boston, nvmax = 13)
## 13 Variables  (and intercept)
##         Forced in Forced out
## zn          FALSE      FALSE
## indus       FALSE      FALSE
## chas        FALSE      FALSE
## nox         FALSE      FALSE
## rm          FALSE      FALSE
## age         FALSE      FALSE
## dis         FALSE      FALSE
## rad         FALSE      FALSE
## tax         FALSE      FALSE
## ptratio     FALSE      FALSE
## black       FALSE      FALSE
## lstat       FALSE      FALSE
## medv        FALSE      FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: exhaustive
##           zn  indus chas nox rm  age dis rad tax ptratio black lstat medv
## 1  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     " "   " "   " " 
## 2  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     " "   "*"   " " 
## 3  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     "*"   "*"   " " 
## 4  ( 1 )  "*" " "   " "  " " " " " " "*" "*" " " " "     " "   " "   "*" 
## 5  ( 1 )  "*" " "   " "  " " " " " " "*" "*" " " " "     "*"   " "   "*" 
## 6  ( 1 )  "*" " "   " "  "*" " " " " "*" "*" " " " "     "*"   " "   "*" 
## 7  ( 1 )  "*" " "   " "  "*" " " " " "*" "*" " " "*"     "*"   " "   "*" 
## 8  ( 1 )  "*" " "   " "  "*" " " " " "*" "*" " " "*"     "*"   "*"   "*" 
## 9  ( 1 )  "*" "*"   " "  "*" " " " " "*" "*" " " "*"     "*"   "*"   "*" 
## 10  ( 1 ) "*" "*"   " "  "*" "*" " " "*" "*" " " "*"     "*"   "*"   "*" 
## 11  ( 1 ) "*" "*"   " "  "*" "*" " " "*" "*" "*" "*"     "*"   "*"   "*" 
## 12  ( 1 ) "*" "*"   "*"  "*" "*" " " "*" "*" "*" "*"     "*"   "*"   "*" 
## 13  ( 1 ) "*" "*"   "*"  "*" "*" "*" "*" "*" "*" "*"     "*"   "*"   "*"

Investigate the values of Cp, BIC and adjusted R2.
The best value of Cp is the lowest value.
The best model according to Cp is Model 8

which.min(regfit.summary$cp)

## [1] 8

The best value of BIC is the lowest value. The best model according to BIC is Model 3.

which.min(regfit.summary$bic)

## [1] 3

The best value of adjusted R2 is the highest value. The best model according to adjusted R2 is Model 9.

which.max(regfit.summary$adjr2)

## [1] 9

Plot each value against the number of Predictors.

par(mfrow=c(1,3),mar=c(5,4,4,2))
plot(regfit.summary$cp, xlab = "Number of predictors", ylab = "C_p", type = "l")
points(which.min(regfit.summary$cp), regfit.summary$cp[which.min(regfit.summary$cp)], col = "red", cex = 2, pch = 20)
plot(regfit.summary$bic, xlab = "Number of predictors", ylab = "BIC", type = "l")
points(which.min(regfit.summary$bic), regfit.summary$bic[which.min(regfit.summary$bic)], col = "red", cex = 2, pch = 20)
plot(regfit.summary$adjr2, xlab = "Number of predictors", ylab = "Adjusted R^2", type = "l")
points(which.max(regfit.summary$adjr2), regfit.summary$adjr2[which.max(regfit.summary$adjr2)], col = "red", cex = 2, pch = 20)

Using the results from the printouts above, the coefficients for each of these models are printed below.

#Print Coefficients for Model 8
coef(regfit.full,8)

##   (Intercept)            zn           nox           dis           rad 
##  19.683127801   0.043293393 -12.753707757  -0.918318253   0.532616533 
##       ptratio         black         lstat          medv 
##  -0.310540942  -0.007922426   0.110173124  -0.174207166

#Print Coefficients for Model 3
coef(regfit.full,3)

##  (Intercept)          rad        black        lstat 
## -0.372585457  0.488172386 -0.009471639  0.213595700

#Print Coefficients for Model 9
coef(regfit.full,9)

##   (Intercept)            zn         indus           nox           dis 
##  19.124636156   0.042788127  -0.099385948 -10.466490364  -1.002597606 
##           rad       ptratio         black         lstat          medv 
##   0.539503547  -0.270835584  -0.008003761   0.117805932  -0.180593877

(c) Use the validation set approach to choose the best model by randomly allocated 50% of data set for training and 50% for test. What is the best model obtained according to the validation set approach? How does it compare to the best model obtained in (b)? Report the coefficients of the best model obtained.

The split of the dataset into training and test data is done by using the sample function to randomly order the dataset, before allocating 50 % to the training set and 50% to the test set.

#Validation by splitting the data into 50% trial and 50% Test sets
set.seed(123)  #Set random number - with this result is the same
train <- sample(c(TRUE ,FALSE),rep=TRUE,nrow(Boston),prob=c(0.5,0.5))
test <- (!train )

We apply regsubsets() to the training set in order to perform best subset selection.

regfit.best=regsubsets(crim~.,data=Boston[train,],nvmax=13)

A model matrix is created from the test data. The coefficients from regfit.best are extracted for the best model of that size, and multiplied into the appropriate columns of the test model matrix to form predictions from which the test MSE is computed.

test.mat=model.matrix(crim~.,data=Boston[test,])
val.errors=rep(NA ,13)
for(i in 1:13){
  coefi=coef(regfit.best,id=i)
  pred=test.mat[,names(coefi)]%*% coefi
  val.errors[i]= mean((Boston$crim[test]-pred)^2)
}

The best model is the one which results in the lowest MSE. This is the model with 13 predictor variables, the coefficients of which are printed below.

which.min(val.errors )

## [1] 2

coef(regfit.best,13)

##   (Intercept)            zn         indus          chas           nox 
## -10.279494564   0.041595943   0.011003048  -0.009149835  -2.626514842 
##            rm           age           dis           rad           tax 
##   2.649489541  -0.012157935  -0.717483829   0.570787450  -0.004562162 
##       ptratio         black         lstat          medv 
##  -0.126752345   0.014632390   0.115528123  -0.297323645

The model chosen by validation set approach does not correspond with any of the models chosen by using the C_p, BIC or Adjusted R² values

(d) Repeat part (c) with the 5-fold cross validation approach. What is the best model obtained according to this cross validation approach? How does it compare to the best models obtained in (b) and (c)?

For 5-fold cross validation a vector is created that allocates each observation to one of k=5 folds.The cv.errors vector is created to store the results.

k=5
set.seed(123)
folds=sample(1:k,nrow(Boston),replace=TRUE)
cv.errors=matrix (NA,k,13,dimnames=list(NULL,paste (1:13)))

A function predict() is required to predict values for a given model

predict.regsubsets =function (object,newdata,id ,...){
  form=as.formula(object$call[[2]])
  mat=model.matrix(form,newdata)
  coefi=coef(object,id=id)
  xvars=names(coefi)
  mat[,xvars ]%*%coefi
  }

Cross-validation is performed for each of folds 1 through to 5. The elements of each fold are in the test set and the remainder are in the training set. The predictions are made for each model size and the test errors are stored in the matrix cv.errors.

for(j in 1:k){
  best.fit =regsubsets (crim~.,data=Boston[folds!=j,],nvmax =13)
  for(i in 1:13) {
    pred=predict(best.fit,Boston[folds==j,],id=i)
    cv.errors[j,i]=mean((Boston$crim[folds ==j]-pred)^2)
  }
}

The test error results are averaged across the 5 folds. The model with 9 predictor variables is shown to have the lowest test error.

mean.cv.errors=apply(cv.errors,2,mean)
which.min(mean.cv.errors)

## 9 
## 9

par(mfrow=c(1,1))
plot(mean.cv.errors,pch=20,type="b",col="blue",main="Average Test Error vs Number of Predictors",ylab="Average Test error",cex.main=0.8,cex=0.8)

coef(regfit.best,9)

##   (Intercept)            zn            rm           age           dis 
## -14.175189129   0.045404980   2.695840851  -0.015348953  -0.711516982 
##           rad           tax         black         lstat          medv 
##   0.558622399  -0.004917531   0.014603789   0.120665602  -0.283851506

This model corresponds with the model chosen by using the maximum adjusted R2 vlaue, but not with the models chosen by the validation approach, the C_p or BIC.

Question 4. This question should be answered using the Weekly data set, which is part of the ISLR package. It contains 1089 weekly return for 21 years, from the beginning of 1990 to the end 2010.

(a) Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns?

The Weekly data set provides information about the weekly percentage returns for the S&P 500 stock index between 1990 and 2010. A plot of the Volume of shares traded shows the volume increases with time and with increasing variability until the end of the dataset when a decrease in volume with high variability is observed.

library(ISLR)
attach(Weekly)
summary(Weekly)

##       Year           Lag1               Lag2               Lag3         
##  Min.   :1990   Min.   :-18.1950   Min.   :-18.1950   Min.   :-18.1950  
##  1st Qu.:1995   1st Qu.: -1.1540   1st Qu.: -1.1540   1st Qu.: -1.1580  
##  Median :2000   Median :  0.2410   Median :  0.2410   Median :  0.2410  
##  Mean   :2000   Mean   :  0.1506   Mean   :  0.1511   Mean   :  0.1472  
##  3rd Qu.:2005   3rd Qu.:  1.4050   3rd Qu.:  1.4090   3rd Qu.:  1.4090  
##  Max.   :2010   Max.   : 12.0260   Max.   : 12.0260   Max.   : 12.0260  
##       Lag4               Lag5              Volume       
##  Min.   :-18.1950   Min.   :-18.1950   Min.   :0.08747  
##  1st Qu.: -1.1580   1st Qu.: -1.1660   1st Qu.:0.33202  
##  Median :  0.2380   Median :  0.2340   Median :1.00268  
##  Mean   :  0.1458   Mean   :  0.1399   Mean   :1.57462  
##  3rd Qu.:  1.4090   3rd Qu.:  1.4050   3rd Qu.:2.05373  
##  Max.   : 12.0260   Max.   : 12.0260   Max.   :9.32821  
##      Today          Direction 
##  Min.   :-18.1950   Down:484  
##  1st Qu.: -1.1540   Up  :605  
##  Median :  0.2410             
##  Mean   :  0.1499             
##  3rd Qu.:  1.4050             
##  Max.   : 12.0260

plot(Volume,type='l',col='blue',main="Volume of Shares traded between 1990 and 2010",cex.main=0.8)

The correlations between the variables shows there is only significant correlation between the Year and the Volume.

cor(Weekly[,-9])

##               Year         Lag1        Lag2        Lag3         Lag4
## Year    1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923
## Lag1   -0.03228927  1.000000000 -0.07485305  0.05863568 -0.071273876
## Lag2   -0.03339001 -0.074853051  1.00000000 -0.07572091  0.058381535
## Lag3   -0.03000649  0.058635682 -0.07572091  1.00000000 -0.075395865
## Lag4   -0.03112792 -0.071273876  0.05838153 -0.07539587  1.000000000
## Lag5   -0.03051910 -0.008183096 -0.07249948  0.06065717 -0.075675027
## Volume  0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617
## Today  -0.03245989 -0.075031842  0.05916672 -0.07124364 -0.007825873
##                Lag5      Volume        Today
## Year   -0.030519101  0.84194162 -0.032459894
## Lag1   -0.008183096 -0.06495131 -0.075031842
## Lag2   -0.072499482 -0.08551314  0.059166717
## Lag3    0.060657175 -0.06928771 -0.071243639
## Lag4   -0.075675027 -0.06107462 -0.007825873
## Lag5    1.000000000 -0.05851741  0.011012698
## Volume -0.058517414  1.00000000 -0.033077783
## Today   0.011012698 -0.03307778  1.000000000

The box plots show the direction of the market in the current week plotted against weekly percentage returns. The left-most panel displays the current week which shows very similar percentage of returns in the positive and negative direction.The middle panel displays the direction of the market in the current week vs the percentage returns in the previous week. The two plots are almost identical suggesting the previous weeks data cannot be used to predict the current week’s returns. This is also the case for the data from two weeks previous which is shown in the right-most panel.

par(mfrow=c(1,3),mar=c(5,4,4,2))
boxplot(Today~Direction,data=Weekly,col=c('blue','red'),main="This Week",ylab='% change in S&P',xlab="This week's Direction")
boxplot(Lag1~Direction,data=Weekly,col=c('blue','red'),main="One Week ago",ylab='% change in S&P',xlab="This week's Direction")
boxplot(Lag2~Direction,data=Weekly,col=c('blue','red'),main="Two Weeks ago",ylab='% change in S&P',xlab="This week's Direction")

(b) Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?

glm.fit=glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume,data=Weekly,family=binomial)
summary (glm.fit )

## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = binomial, data = Weekly)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6949  -1.2565   0.9913   1.0849   1.4579  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.26686    0.08593   3.106   0.0019 **
## Lag1        -0.04127    0.02641  -1.563   0.1181   
## Lag2         0.05844    0.02686   2.175   0.0296 * 
## Lag3        -0.01606    0.02666  -0.602   0.5469   
## Lag4        -0.02779    0.02646  -1.050   0.2937   
## Lag5        -0.01447    0.02638  -0.549   0.5833   
## Volume      -0.02274    0.03690  -0.616   0.5377   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1486.4  on 1082  degrees of freedom
## AIC: 1500.4
## 
## Number of Fisher Scoring iterations: 4

The smallest p-value 0.0296 is associated with Lag2. This value is larger than 0.05 which typically used as a cut-off for rejecting the Null hypotheses and so there is statistically significant association between Lag2 and Direction. The other p-values are all larger and the associated predictors are not statistically significant.

(c) Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.

The predict() function is used to predict the probability that the market will go up, given values of the predictors provided in the Weekly dataset. The constrasts function indicates that R has created a dummy variable with 1 for Up, meaning that the probabilities are for the market going up. The probability vector is converted to a vector of class labels “Up” and “Down” depending on whether the predicted probability is greater or less than or equal to 0.5.

contrasts(Direction)

##      Up
## Down  0
## Up    1

glm.probs=predict(glm.fit,type="response")
glm.pred=rep("Down",1089)
glm.pred[glm.probs >.5]="Up"

The table() function is then used to produce a confusion matrix in order to determine how many observations were correctly or incorrectly classified.
The number of correct predictions of “Down” that correspond to an actual Direction value “Down” is 54.
The number of incorrect predictions of “Down” that correspond to an actual Direction value “Up” is 48.
The number of incorrect predictions of “Up” that corresponds to an actual Direction value “Down” is 430.
The number of correct predictions of “Up” that corresponds to an actual Direction value “Up” is 557.
The number of correct predictions is therefore 557 + 54 = 611.
The number of incorrect predictions similarly is 430 + 48 = 478

table(glm.pred,Direction)

##         Direction
## glm.pred Down  Up
##     Down   54  48
##     Up    430 557

The mean() function is used to compute the fraction of days for which the prediction was correct. In this case, logistic regression correctly predicted the movement of the market 56.10% of the time.

mean(glm.pred==Direction)

## [1] 0.5610652

(d) Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).

The training data is selected from the period 1990 to 2008 ie less than 2009, while the test data is the period from 2009 to 2010.
The logistic regression model is run on the training data using only Lag2 as predictor, and the predictions made for the test data. The resulting confusion matrix shows that the model correctly predicts the direction of the market 9+56=65 times out of 104, which is 62.50% of the time.

train=(Year<2009)
Weekly.2009<-Weekly[!train,]
dim(Weekly.2009)

## [1] 104   9

Direction.2009<-Direction[!train]
glm.fit=glm(Direction~Lag2,data=Weekly,family=binomial,subset=train )
glm.probs=predict(glm.fit,Weekly.2009,type="response")
glm.pred=rep("Down",104)
glm.pred[glm.probs >.5]="Up"
table(glm.pred,Direction.2009)

##         Direction.2009
## glm.pred Down Up
##     Down    9  5
##     Up     34 56

mean(glm.pred==Direction.2009)

## [1] 0.625

Data Analysis Assignment

D. Ormandy