Problem 1

  1. Explain how k-fold cross-validation is implemented. k-fold cross-validation is an intermediate between the single validation and leave-one-out approach, splitting the the data into approximately equal size “k-folds” This method fits the data to k-1 folds and trains on 1. The process is repeated k number of times with the rule of thumb being k = 5 or 10.

  2. What are the advantages and disadvantages of k-fold cross-validation relative to:
  3. single validation approach - In comparison, the k-fold cross-validation approach as a smaller training set size thus a smaller variance in the results. However, the larger validation set size creates a larger bias.

  1. LOOCV - In comparison, the k-fold corss validarion approach as a larger trainging set size thus a larger vaiance in the results. However, the smaller valiudation set size creates a smaller bias.

Problem 2

  1. Generate a simulated data set as follows:
set.seed(1)
x=rnorm(100)
y=x-2*x^2+rnorm(100)

n (number of obervations) = 100 p (number of explanatory variables) =

  1. Create a scatterplot of X against Y. Comment on what you find.
plot(x,y)

Based on the scatterplot, it is clear that there is curvature in the data and that a least squares regression line (1st polynomial) would not be the best fit for the data.

  1. Set a random seed, and then compute the LOOCV errors that result from fitting the following four models using least squares:
set.seed(44)
DATA<-data.frame(x, y)

library(boot)
glm.fit<-glm(y~x, data=DATA)
cv.err<-cv.glm(DATA, glm.fit)
cv.err$delta
## [1] 7.288162 7.284744
glm.fit<-glm(y~poly(x, 2), data=DATA)
cv.err<-cv.glm(DATA, glm.fit)
cv.err$delta
## [1] 0.9374236 0.9371789
glm.fit<-glm(y~poly(x, 3), data=DATA)
cv.err<-cv.glm(DATA, glm.fit)
cv.err$delta
## [1] 0.9566218 0.9562538
glm.fit<-glm(y~poly(x, 4), data=DATA)
cv.err<-cv.glm(DATA, glm.fit)
cv.err$delta
## [1] 0.9539049 0.9534453
  1. Repeat (c) using another random seed, and report your results. Are your results the same as what you got in (c)? Why?
set.seed(13)
DATA<-data.frame(x, y)

glm.fit<-glm(y~x, data=DATA)
cv.err<-cv.glm(DATA, glm.fit)
cv.err$delta
## [1] 7.288162 7.284744
glm.fit<-glm(y~poly(x, 2), data=DATA)
cv.err<-cv.glm(DATA, glm.fit)
cv.err$delta
## [1] 0.9374236 0.9371789
glm.fit<-glm(y~poly(x, 3), data=DATA)
cv.err<-cv.glm(DATA, glm.fit)
cv.err$delta
## [1] 0.9566218 0.9562538
glm.fit<-glm(y~poly(x, 4), data=DATA)
cv.err<-cv.glm(DATA, glm.fit)
cv.err$delta
## [1] 0.9539049 0.9534453
  1. Which of the models in c) had the smallest LOOCV error? Is this what you expected? Explain your answer?

The second degree polynomial had the smallest LOOCV error. Based on the shapr of the oringial scatterplot it seemed probabale that a second degree polynomial would be the correct fit however, I also expetected the third and fourth degree polynomials to have low errors as well. The next question to consider is is “worth it” to increase the power (maybe overfitting) for such a minimal improvement in lowering the error.

  1. Comment on the statistical significance of the coefficient estimates that results from fitting each of the models in (c) using least squares. Do these results agree with the conclusions drawn based on cross-validation results?
mod4<-lm(y~poly(x, 4), data=DATA)
summary(mod4)
## 
## Call:
## lm(formula = y ~ poly(x, 4), data = DATA)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0550 -0.6212 -0.1567  0.5952  2.2267 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.55002    0.09591 -16.162  < 2e-16 ***
## poly(x, 4)1   6.18883    0.95905   6.453 4.59e-09 ***
## poly(x, 4)2 -23.94830    0.95905 -24.971  < 2e-16 ***
## poly(x, 4)3   0.26411    0.95905   0.275    0.784    
## poly(x, 4)4   1.25710    0.95905   1.311    0.193    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9591 on 95 degrees of freedom
## Multiple R-squared:  0.8753, Adjusted R-squared:  0.8701 
## F-statistic: 166.7 on 4 and 95 DF,  p-value: < 2.2e-16

Both the least squares models (first and second degree) have statistically significant coefficient estimates (p-val(second) < p-val(first)) while the (thrid and fourth degree) do not. These results support the conclusion that it is not worth it to fit a third or fourth degree polynomial as they would both be examples of overfitting. The correct model to fit the data is a second degree polynomial.

mod2<-lm(y~poly(x, 2), data=DATA)
summary(mod2)
## 
## Call:
## lm(formula = y ~ poly(x, 2), data = DATA)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9650 -0.6254 -0.1288  0.5803  2.2700 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.5500     0.0958  -16.18  < 2e-16 ***
## poly(x, 2)1   6.1888     0.9580    6.46 4.18e-09 ***
## poly(x, 2)2 -23.9483     0.9580  -25.00  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.958 on 97 degrees of freedom
## Multiple R-squared:  0.873,  Adjusted R-squared:  0.8704 
## F-statistic: 333.3 on 2 and 97 DF,  p-value: < 2.2e-16
plot(x,y)

plot(mod2)