library(HSAUR3)
library(dplyr)
library(ggplot2)
library(tidyr)
library(ISLR)
library(boot)
library("PerformanceAnalytics")

Question 1.

Use the bladdercancer data from the HSAUR3 library to answer the following questions.

A

Construct graphical and numerical summaries that will show the relationship between tumor size and the number of recurrent tumors. Discuss your discovery. (Hint: mosaic plot may be the best way to assess this)

Part 1 - Mosaic Plot

Judging by the below mosaic plot the observed frequency for 1 or 2 tumors greater than 3 cm is lower than expected and the observed frequency for 3 or 4 tumors less than or equal to 3 cm is also lower than expected.

data(bladdercancer)
#build a mosaic plot with base r for the bladder cancer data
mosaicplot(xtabs(~ number + tumorsize, data = bladdercancer),
           main='Number of Tumors to Tumor Size in Bladder Cancer Patients',
           xlab='Numbr of Tumors',
           ylab='Size of Tumor in cm',
           shade=T)

Part 2 - Histogram

The histogram shows a non-normal distrobution for both tumor sizes greater than 3 cm and less than or equal to 3 cm. The histogram is centered around one tumor and the frequency of more tumors is substantially less than a patient having just one tumor.

#use ggplot to build a histogram and label the axis values
ggplot(bladdercancer, aes(number, fill = tumorsize)) +
  geom_histogram(binwidth=.5, position="dodge") +
  labs(title='Tumor Size in Bladder Patients',
       x='Number of Tumors',
       y='Count')

Part 3 - Frequency Table

The frequency table justifies the distribution of the histogram. The highest counts for both tumor size factors come at having just one tumor and significantly decrease as the number of tumors increase.

  #create a frequency table in data.frame format by grouping tumorsise and number; summarizing the groups with a count, and then spreading the number variable into columns
  bladdercancer %>%
  group_by(tumorsize,number) %>%
  summarize(freq = n()) %>%
  spread(number,freq,sep='_of_tumors_')
## # A tibble: 2 x 5
## # Groups:   tumorsize [2]
##   tumorsize number_of_tumors_1 number_of_tumors_2 number_of_tumors_3
## *    <fctr>              <int>              <int>              <int>
## 1     <=3cm                 15                  5                  1
## 2      >3cm                  5                  2                  1
## # ... with 1 more variables: number_of_tumors_4 <int>

B

Build a Poisson regression that estimates the effect of size of tumor on the number of recurrent tumors. Discuss your results.

Part 1 - Build Regression Models

#build three separate models with the glm function
mod1_p1 <- glm(number ~time + tumorsize + tumorsize*time,data=bladdercancer,family=poisson(link=log))
mod2_p1 <- glm(number ~ time + tumorsize,data=bladdercancer,family=poisson())
mod3_p1 <- glm(number ~ tumorsize,data=bladdercancer,family=poisson())

Part 2 - Model 1 Summary Statistics

It is evident from the first model containing time, tumor size, and the interaction between tumor size and time that none of the variables are significant. However, the residual deviance and null deviance are quite low compared to the degrees of freedom

summary(mod1_p1)
## 
## Call:
## glm(formula = number ~ time + tumorsize + tumorsize * time, family = poisson(link = log), 
##     data = bladdercancer)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.6943  -0.5581  -0.2413   0.2932   1.4644  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)
## (Intercept)         0.03957    0.43088   0.092    0.927
## time                0.02138    0.02418   0.884    0.377
## tumorsize>3cm       0.46717    0.66713   0.700    0.484
## time:tumorsize>3cm -0.01676    0.03821  -0.439    0.661
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 12.800  on 30  degrees of freedom
## Residual deviance: 11.566  on 27  degrees of freedom
## AIC: 90.377
## 
## Number of Fisher Scoring iterations: 4

Part 3 - Model 2 Summary Statistics

Interestingly, if we take out the interaction between tumor size and tim our AIC actually drops. Our resiual and null deviance are approximately the same but there is still no significance with the time or tumor size.

summary(mod2_p1)
## 
## Call:
## glm(formula = number ~ time + tumorsize, family = poisson(), 
##     data = bladdercancer)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8183  -0.4753  -0.2923   0.3319   1.5446  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept)    0.14568    0.34766   0.419    0.675
## time           0.01478    0.01883   0.785    0.433
## tumorsize>3cm  0.20511    0.30620   0.670    0.503
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 12.800  on 30  degrees of freedom
## Residual deviance: 11.757  on 28  degrees of freedom
## AIC: 88.568
## 
## Number of Fisher Scoring iterations: 4

Part 4 - Model 3 Summary Statistics

With model 3 we drop the time variable and notice that our intercept is now significant at the 5% level however, tumor size is still not significant. Our residual and null deviance values are still low compared to the degrees of freedom and our AIC drops further.

summary(mod3_p1)
## 
## Call:
## glm(formula = number ~ tumorsize, family = poisson(), data = bladdercancer)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.6363  -0.3996  -0.3996   0.4277   1.7326  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)  
## (Intercept)     0.3747     0.1768   2.120    0.034 *
## tumorsize>3cm   0.2007     0.3062   0.655    0.512  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 12.80  on 30  degrees of freedom
## Residual deviance: 12.38  on 29  degrees of freedom
## AIC: 87.191
## 
## Number of Fisher Scoring iterations: 4

Part 5 - ANOVA

#model1 vs model2 vs model 3
anova(mod1_p1,mod2_p1,mod3_p1,test='Chisq')
## Analysis of Deviance Table
## 
## Model 1: number ~ time + tumorsize + tumorsize * time
## Model 2: number ~ time + tumorsize
## Model 3: number ~ tumorsize
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1        27     11.566                     
## 2        28     11.757 -1 -0.19095   0.6621
## 3        29     12.380 -1 -0.62363   0.4297

Part 6 - Comparing Deviance on Model 1

The output indicates that we can reject the null hypothesis and that a more complicated model may work to explain the data. However, my assumption is that we are simply trying to explain noise and are overfitting by adding additional variables and interaction terms. Additionally, there is no statistical significance of the variables explaining the affect on th number of tumors in bladder cancer patients.

1-pchisq((mod1_p1$null.deviance-mod1_p1$deviance),(mod1_p1$df.null-mod1_p1$df.residual))
## [1] 0.7448414

Comparing Deviance on Model 2

Comparing deviance on model 2 further justifies my assumption from part 5. We removed the interaction term and the deviance test is reduced to 0.59 which signfifies the model is getting closer to the null hypothesis with no statistical significance between time or tumor size and the number of tumors in bladder cancer patients.

1-pchisq((mod2_p1$null.deviance-mod2_p1$deviance),(mod2_p1$df.null-mod2_p1$df.residual))
## [1] 0.5935891

Comparing Deviance on Model 3

After we compare our deviance on the third model we notice that this model is no different than the null model and we accept the null hpothesis. There is no interaction between tumor size and the number of tumors with this data set.

1-pchisq((mod3_p1$null.deviancee-mod3_p1$deviance),(mod3_p1$df.null-mod3_p1$df.residual))
## numeric(0)

Final Answer

From the above analysis I accept the null hypothesis that there is nothing within the data to explain an increase in the number of tumors. Neither time nor the tumor size have any affect on increasing the number of tumors. This can be further proved through part one and interpreting the distribution of the data.

Question 2

The following data is the number of new AIDS cases in Belgium between the years 1981-1993. Let t denote time y < c(12, 14, 33, 50, 67, 74, 123, 141, 165, 204, 253, 246, 240) t <- 1:13

A.

Plot the relationship between #AIDS cases against time. Comment on the plot

The scatter plot indicates a clear relationship. As time increases so does the number of AIDs cases. The relationship peaks at year 11 and then begins to decline.

y <- c(12, 14, 33, 50, 67, 74, 123, 141, 165, 204, 253, 246, 240) 
t <- 1:13

data <- data.frame(number=y,time=t)
#build a scatterplot with ggplot2 and label the axis
ggplot(data,aes(x=time,y=number)) +
  geom_point() + 
  labs(title='Scatterplot of AIDs Cases From 1981 to 1993',
       x = 'Time in Years From 1981',
       y='Number of AIDs Cases')

B.

Fit a Poisson regression model \(log(\mu_i) = \beta_0 + \beta_1 t_i\). Comment on the model parameters and residuals (deviance) vs Fitted plot.

model_p2a <- glm(number ~ time, data=data,family=poisson())

Part 1 - Model 1 Summary Statisitics

After reviewing the residual deviance of the model we can determine that the model is overdispersed by a factor of about 7. 80.686 on 11 degrees of freedom.

summary(model_p2a)
## 
## Call:
## glm(formula = number ~ time, family = poisson(), data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.6784  -1.5013  -0.2636   2.1760   2.7306  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 3.140590   0.078247   40.14   <2e-16 ***
## time        0.202121   0.007771   26.01   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 872.206  on 12  degrees of freedom
## Residual deviance:  80.686  on 11  degrees of freedom
## AIC: 166.37
## 
## Number of Fisher Scoring iterations: 4

Part 2 - Model 1 Residual vs Fitted Plot

The residuals vs fitted plot further proves that the data is overdispersed. At time 1, 2, and 13 the residual values are far from zero which idicates they are outliers. Additionally, there is a clear pattern to the residual plot which indicates that mean does not increase as the variance increase because there is not a constant spread in the residuals.

#call the plot funtion on the model selecting only the residual vs fitted plot
plot(model_p2a,which=1)

C.

Now add a quadratic term in time (i.e., \(log(\mu_i) = \beta_0 + \beta_1 t_i + \beta_2 t^2_i\) ) and fit the model. Comment on the model parameters and assess the residual plots.

model_p2b <- glm(number ~ time + poly(time,degree=2),data=data,family=poisson()) 

Part 1 - Model 2 Summary Statistics

After adding the 2nd order polynomial covariate to the model the residual deviance to the degrees of freedom now indicate there is no overdispersion.

summary(model_p2b)
## 
## Call:
## glm(formula = number ~ time + poly(time, degree = 2), family = poisson(), 
##     data = data)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.45903  -0.64491   0.08927   0.67117   1.54596  
## 
## Coefficients: (1 not defined because of singularities)
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              2.64858    0.11323  23.390  < 2e-16 ***
## time                     0.25716    0.01167  22.038  < 2e-16 ***
## poly(time, degree = 2)1       NA         NA      NA       NA    
## poly(time, degree = 2)2 -0.95511    0.11896  -8.029 9.82e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 872.2058  on 12  degrees of freedom
## Residual deviance:   9.2402  on 10  degrees of freedom
## AIC: 96.924
## 
## Number of Fisher Scoring iterations: 4

Part 2 - Model 2 Residual vs Fitted Plot

The plot further justifies my thought that the model is not overdispersed and explains the data due to the residuals being evenly spread on the plot.

#call the plot funtion on the model selecting only the residual vs fitted plot
plot(model_p2b,which=1)

D.

Compare the two models using AIC, which is better? Answer - Model 2’s AIC is clearly lower, 96.23 vs 166.37, which further tells us that the second model with the 2nd order polynomial covariate better explains the data.

Model 1 AIC

model_p2a$aic
## [1] 166.3698

Model 2 AIC

model_p2b$aic
## [1] 96.92358

E.

Use anova()-function to perform \(\chi^2\) test for model selection. Did adding the quadratic term improve model? Answer - The second order polynomial improved the model. The p value shows statistical significance in the second model thus we reject the null hypothesis that there is no difference in models.

anova(model_p2a,model_p2b,test='Chisq')
## Analysis of Deviance Table
## 
## Model 1: number ~ time
## Model 2: number ~ time + poly(time, degree = 2)
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1        11     80.686                          
## 2        10      9.240  1   71.446 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Question 3

Load the Default dataset from ISLR library. The dataset contains information on ten thousand customers. The aim here is to predict which customers will default on their credit card debt. It is a 4 dimensional dataset with 10000 observations. You had developed a logistic regression model on HW #2. Now consider the following two models

Model1 \(\rightarrow\) Default = Student + balance

Model2 \(\rightarrow\) Default = Balance

For the two competing models do the following

A

With the whole data compare the two models (Use AIC and/or error rate). Answer - Model 1 has a lower AIC value than model 2. Model 1 also has a lower mean square error than model 2, thus model 1 with the added \(student\) covariate is slightly better than the second model without \(student\) as a covariate.

data(Default)
#change the default variable to binary format
Default <- Default %>%
  mutate(default=ifelse(default=='Yes',1,0))
#split the data set into 70% training and 30% test
smp_size <- floor(0.7 * nrow(Default))

#set the seed for reproducibility
set.seed(123)
#create an index for the training set
train_ind <- sample(seq_len(nrow(Default)), size = smp_size)
#subset the default data set by the rows in the trainin data set
train <- Default[train_ind, ]
test <- Default[-train_ind, ]
#build the two models
model_p3a <- glm(default ~ student + balance,data=Default,family=binomial())
model_p3b <- glm(default ~ balance,data=Default,family=binomial())

Part 1 - Mean Square Error & AIC of Model 1

MSE

mean((predict(model_p3a,Default,type='response')-Default$default)^2)
## [1] 0.02130176

AIC

model_p3a$aic
## [1] 1577.682

Part 2 - Mean Square Error & AIC of Model 2

MSE

mean((predict(model_p3b,Default,type='response')-Default$default)^2)
## [1] 0.02170579

AIC

model_p3b$aic
## [1] 1600.452

B

Use validation set approach and choose the best model. Be aware that we have few people who defaulted in the data. Answer - There is little difference between the two models even after adding the covariate \(student\). This suggests that \(student\) status is not a good predicting covariate of defaulting on a credit card. Nevertheless, there is less error and a lower AIC by adding student. An interesting differnce between these statistics and the statistics described without a validation approach is that there is a much lower AIC and means square error.

#use the training data set to rebuild both models
model_p3c <- glm(default ~ student + balance,data=train,family=binomial())
model_p3d <- glm(default ~ balance,data=train,family=binomial())

Part 1 - Summary Statistics, Mean Square Error, and AIC of Model 1

Summary Statistics

summary(model_p3c)
## 
## Call:
## glm(formula = default ~ student + balance, family = binomial(), 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5220  -0.1382  -0.0528  -0.0188   3.7802  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.100e+01  4.492e-01  -24.49  < 2e-16 ***
## studentYes  -7.169e-01  1.753e-01   -4.09 4.32e-05 ***
## balance      5.914e-03  2.807e-04   21.07  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2130.6  on 6999  degrees of freedom
## Residual deviance: 1109.0  on 6997  degrees of freedom
## AIC: 1115
## 
## Number of Fisher Scoring iterations: 8

MSE

mean((predict(model_p3c,test,type='response')-test$default)^2)
## [1] 0.02071291

AIC

model_p3c$aic
## [1] 1115.033

Part 2 - Summary Statistics, Mean Square Error, and AIC of Model 2

Summary Statistics

summary(model_p3d)
## 
## Call:
## glm(formula = default ~ balance, family = binomial(), data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3391  -0.1439  -0.0555  -0.0206   3.7937  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.090e+01  4.393e-01  -24.80   <2e-16 ***
## balance      5.673e-03  2.672e-04   21.24   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2130.6  on 6999  degrees of freedom
## Residual deviance: 1126.7  on 6998  degrees of freedom
## AIC: 1130.7
## 
## Number of Fisher Scoring iterations: 8

MSE

mean((predict(model_p3d,test,type='response')-test$default)^2)
## [1] 0.02098329

AIC

model_p3d$aic
## [1] 1130.69

C.

Use LOOCV approach and choose the best model. Answer - Model 1 is the bes model using Leave One Out Cross Validation as indicated by the below two outputs.

Part 1 - Leave One Out Cross Validation of Model 1

###LOOCV

##one
#set n to 100
#this method was usd because my computer crashed every time I used the method outlined in the power point and lecture
n <- 100
#build an empty matrix to loop through
loocv_tmp1 <- matrix(NA, nrow = n, ncol = 1)

for (k in 1:n) {
  #create a loocv training set for k in 1:100
  train_loocv <- Default[-k, ]
  test_loocv <- Default[k, ]
  #fit the model at each iteration
  fitted_models <- glm(default ~ balance + student,data=train_loocv,family='binomial')
  #calculation the mse at each iteration
  MSE1 <- mean((predict(fitted_models,test_loocv,type='response')-test_loocv$default)^2)
  #set the kth row to the mse from the above calculation
  loocv_tmp1[k, ] <- MSE1
}
#calculation the mean of the cross validation
 mean(loocv_tmp1)
## [1] 0.0008405851

Part 2 - Leave One Out Cross Validation of Model 2

##two

n <- 100

loocv_tmp2 <- matrix(NA, nrow = n, ncol = 1)

for (k in 1:n) {
  train_loocv <- Default[-k, ]
  test_loocv <- Default[k, ]
  fitted_models <- glm(default ~ balance,data=train_loocv,family='binomial')
  MSE1 <- mean((predict(fitted_models,test_loocv,type='response')-test_loocv$default)^2)
  loocv_tmp2[k, ] <- MSE1
}
mean(loocv_tmp2)
## [1] 0.001071851

D.

Use 10-fold cross-validation approach and choose the best model. Answer - Model 1 has a lower error rate using 10 fold cross validation thus model 1 is the best model in 10-fold cross validation.

Part 1 - 10-fold Cross Validation of Model 1

cv.glm(train, model_p3c,K=10)$delta[1]
## [1] 0.02162689

Part 2 - 10-fold Cross Validation of Model 2

cv.glm(train, model_p3d,K=10)$delta[1]
## [1] 0.0221047

Final Answer

It is evident from all three cross validation techniques that adding student as a coviariate reduces the error of predictions slightly, but enough for it to be used in model selection. Thus model 1, with studen as a covariate, is the best model under all three cross validation techniques.

Question 4

In the ISLR library load the Smarket dataset. This contains Daily percentage returns for the S&P 500 stock index between 2001 and 2005. There are 1250 observations and 9 variables. The variable of interest is Direction which is a factor with levels Down and Up indicating whether the market had a positive or negative return on a given day. Since the goal is to predict the direction of the stock market in the future, here it would make sense to use the data from years 2001 - 2004 as training and 2005 as validation. According to this, create a training set and testing set. Perform logistic regression and assess the error rate.

data(Smarket)
#convert Direction to binary
Smarket <- Smarket %>%
  mutate(Direction = ifelse(Direction == 'Up',1,0))
#split the training and test sets by year
train <- Smarket[Smarket$Year != 2005,]
test <- Smarket[Smarket$Year == 2005,]

#build the 7 models
model1 <- glm(Direction ~ Lag1,train, family=binomial())
model2 <- glm(Direction ~ Lag1 + Lag2,train,family=binomial())
model3 <- glm(Direction ~ Lag1 + Lag2+Lag3,train,family=binomial())
model4 <- glm(Direction ~ Lag1 + Lag2+Lag3+Lag4,train,family=binomial())
model5 <- glm(Direction ~ Lag1 + Lag2+Lag3+Lag4+Lag5,train,family=binomial())
model6 <- glm(Direction ~ Lag1 + Lag2+Lag3+Lag4+Lag5+Volume,train,family=binomial())
model7 <- glm(Direction ~ Lag1 + Lag2  + Lag1*Lag3 +Lag1*Lag5 +Lag3*Lag4 + Lag3*Lag5,data = train, family=binomial())
#run an anovva on the 7 different models to find the best model
anov <- anova(model1,model2,model3,model4,model5,model6,model7,test='Chisq')

#calculate the MSE of model 7
MSE <- mean((predict(model7,test,type='response')-test$Direction)^2)

#follow the same steps as problem 3 for LOOCV
n <- 100

loocv_tmp1 <- matrix(NA, nrow = n, ncol = 1)

for (k in 1:n) {
  train_loocv <- Smarket[-k, ]
  test_loocv <- Smarket[k, ]
  fitted_models <- glm(Direction ~ Lag1 + Lag2  + Lag1*Lag3 +Lag1*Lag5 +Lag3*Lag4 + Lag3*Lag5,data = train_loocv, family='binomial')
  loocv_tmp1[k, ] <- mean((predict(fitted_models,test_loocv,type='response')-test_loocv$Direction)^2)
}
loocv1 <- mean(loocv_tmp1)

#calculate 10 fold cross validation using the in class function
tenfold <-cv.glm(train, model7,K=10)$delta[1]

Part 1 - Model Selection

I have built 7 different models, all with different covariates, and the 7th model with interaction terms and only selected covariates. Below is the output of the analysis of variance of the models. Model7 is slighltly better due to its near significance, thus we will use it for our final model.

anov
## Analysis of Deviance Table
## 
## Model 1: Direction ~ Lag1
## Model 2: Direction ~ Lag1 + Lag2
## Model 3: Direction ~ Lag1 + Lag2 + Lag3
## Model 4: Direction ~ Lag1 + Lag2 + Lag3 + Lag4
## Model 5: Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5
## Model 6: Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume
## Model 7: Direction ~ Lag1 + Lag2 + Lag1 * Lag3 + Lag1 * Lag5 + Lag3 * 
##     Lag4 + Lag3 * Lag5
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)  
## 1       996     1382.1                       
## 2       995     1381.4  1   0.7428  0.38877  
## 3       994     1381.4  1   0.0293  0.86409  
## 4       993     1381.3  1   0.0281  0.86694  
## 5       992     1381.3  1   0.0040  0.94939  
## 6       991     1381.1  1   0.2355  0.62748  
## 7       988     1373.8  3   7.2708  0.06375 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Part 2 - Summary Statistics

The summary statistics indicate a slight statistical significance of the the percentage return for the previous day and for 5 days previous. The AIC of the model is relatively high.

summary(model7)
## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag1 * Lag3 + Lag1 * 
##     Lag5 + Lag3 * Lag4 + Lag3 * Lag5, family = binomial(), data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7186  -1.1874   0.9387   1.1588   1.6609  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  0.032001   0.063624   0.503   0.6150  
## Lag1        -0.052534   0.053664  -0.979   0.3276  
## Lag2        -0.054475   0.052344  -1.041   0.2980  
## Lag3         0.002665   0.053421   0.050   0.9602  
## Lag5        -0.002196   0.051943  -0.042   0.9663  
## Lag4        -0.001858   0.052108  -0.036   0.9716  
## Lag1:Lag3    0.039926   0.031925   1.251   0.2111  
## Lag1:Lag5   -0.067380   0.034159  -1.973   0.0486 *
## Lag3:Lag4   -0.039895   0.034931  -1.142   0.2534  
## Lag3:Lag5    0.033801   0.031114   1.086   0.2773  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1383.3  on 997  degrees of freedom
## Residual deviance: 1373.8  on 988  degrees of freedom
## AIC: 1393.8
## 
## Number of Fisher Scoring iterations: 4

Part 3 - Mean Square Error

The mean square error of our final model is 0.248, which means there is a relatively low error rate for predicting the outcome of the stock market given the previous 5 days worth of returns.

MSE
## [1] 0.248777

Part 4 - Leave One Out Cross validation

loocv1
## [1] 0.2595494

Part 5 - 10-fold Cross Validation

tenfold
## [1] 0.2520341

Final Answer - All three cross validation techniques sugest there is between a 24-26 error rate when attempting to predict the market movement (up or down) on a given day with previous 1 through 5 days percentage returns and the market volume for a given day. One thing that was learned is that there is some interaction between yesterday’s percentage return and the percentage return from 5 days ago. This was significant at the 5% level in our final selected model. All other covariates were not significant thus the model overfits the data.