Problem 2: We will now perform cross-validation on a simulated data set:

  1. Generate a simulated data set as follows:
set.seed(1)
x=rnorm(100)
y=x-2*x^2+rnorm(100)

# In this data set, what is n (the number of observations) and what is p (the number of explanatory variables)? Write out the MLR model used to generate the data in equation form.
#N is 100, and p is 2.

model <- lm(x~y)
summary(model)
## 
## Call:
## lm(formula = x ~ y)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.4456 -0.6280 -0.1526  0.4327  2.8577 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  0.23121    0.10171   2.273   0.0252 *
## y            0.07892    0.03316   2.380   0.0192 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8778 on 98 degrees of freedom
## Multiple R-squared:  0.05465,    Adjusted R-squared:  0.045 
## F-statistic: 5.665 on 1 and 98 DF,  p-value: 0.01924
  1. Create a scatterplot of X against Y. Comment on what you find.
plot(x,y)

#Parabolic dataset, with the parabola opening downward.
  1. Set a random seed, and then compute the LOOCV errors that result from fitting the following four models using least squares:
  2. Y=β0+β1X+ε
  1. Y=β0+β1X+β2X^2+ε
  2. Y=β0+β1X+β2X2+β3X3+ε
  3. Y=β0+β1X+β2X2+β3X3+β4X^4+ε
train<-sample(100, 50)

degreePoly<-4
polyMSE<-matrix(nrow=degreePoly, ncol=2)
colnames(polyMSE)<-c("degree", "MSE")
for(i in 1:degreePoly){ 
  polyMSE[i,1]<-i
  this.fit<-lm(x~poly(y,i), subset=train)
  polyMSE[i,2]<-mean((x-predict(this.fit))[-train]^2)
}

polyDF<-as.data.frame(polyMSE)
head(polyDF)
##   degree       MSE
## 1      1 0.9454519
## 2      2 0.9717068
## 3      3 0.9850743
## 4      4 0.9961960
  1. Repeat (c) using another random seed, and report your results. Are your results the same as what you got in (c)? Why?
set.seed(2)
x2=rnorm(100)
y2=x-2*x^2+rnorm(100)

train2<-sample(100, 50)

degreePoly2<-4
polyMSE<-matrix(nrow=degreePoly2, ncol=2)
colnames(polyMSE)<-c("degree", "MSE")
for(i in 1:degreePoly2){ 
  polyMSE[i,1]<-i
  this.fit<-lm(x2~poly(y2,i), subset=train2)
  polyMSE[i,2]<-mean((x2-predict(this.fit))[-train2]^2)
}

polyDF2<-as.data.frame(polyMSE)
head(polyDF2)
##   degree      MSE
## 1      1 1.301014
## 2      2 1.306334
## 3      3 1.302949
## 4      4 1.357522
#These results are not the same as last time because it is a different randomization.
  1. Which of the models in (c) had the smallest LOOCV error? Is this what you expected? Explain your answer.
# The first model (degree 1) had the smallest LOOCV error, which is not what I would expect. I would expect a U shaped curve with a plateau in it, instead of a continual rise.
  1. Comment on the statistical significance of the coefficient estimates that results from fitting each of the models in (c) using least squares. Do these results agree with the conclusions drawn based on cross-validation results?
# The cross examination showed a parabolic dataset that opened downward, which could agree with the results from the coefficient estimates because those estimates are continually rising, and as such at a high degree may start to decrease again (altough that seems unlikely).