Problem 1

  1. In this data set, what is n and what is p? Write out the model used to generate the data in equation form.
  set.seed(1)
x = rnorm (100)
y = x-2* x^2+ rnorm (100)

The number of sample size of y(n) is

n <- 100

The number of predictors in the model(p) is

p <- 2

Since rnorm(100) in y meets the condition to be an error with mean = 0 and sd = 1, there are two predictors: X, x^2.

  1. Create a scatterplot of X against Y . Comment on what you find.
plot(x,y, pch = 19)

There is a non-linear relationship between X and Y.

  1. Set a random seed, and then compute the LOOCV errors that result from fitting the following three models using least squares: Center predictors for second, third and fourth order predictors.
library(boot)
## Warning: package 'boot' was built under R version 3.6.2
set.seed(100)
X <- rnorm(100)
X.cent <- scale(X, scale=F)
Y = X-2* X^2+ rnorm (100)
dat = data.frame(X.cent,Y)

# Calculate MSE for all models with centered predictors for the second, third, and fourth order
cv.err.c = rep(NA,3)
for (i in 1:3) {
  glm.fit <- glm(Y~poly(X.cent,i+1),data = dat)
  cv.err.c[i] <- cv.glm(dat, glm.fit)$delta[1]
}
cv.err.c
## [1] 0.6511909 0.6665339 0.6671261
  1. Repeat (c) using another random seed, and report your results. Are your results the same as what you got in (c)? Why?
set.seed(50)
X <- rnorm(100)
X.cent <- scale(X, scale=F)
Y = X-2* X^2+ rnorm (100)
dat <- data.frame(X.cent,Y)

# Calculate MSE for all models with centered predictors for the second, third, and fourth order
cv.err.d = rep(NA,3)
for (i in 1:3) {
  glm.fit <- glm(Y~poly(X.cent,i+1),data = dat)
  cv.err.d[i] <- cv.glm(dat, glm.fit)$delta[1]
}
cv.err.d
## [1] 0.990204 1.000547 1.013875
  1. Which of the models in (c) had the smallest LOOCV error? Is this what you expected? Explain your answer.
min.err.fold <- which.min(cv.err.c)
min.err <- cv.err.c[min.err.fold]
min.err.fold     # The model number with the smallest LOOCV
## [1] 1
round(min.err,4) # MSE corresponding the model with the smallest LOOCV
## [1] 0.6512

It concludes that the second order model has the smallest LOOCV error. It was what I expected since the true y is quatratic form to begin with.

  1. Comment on the statistical significance of the coefficient estimates that results from fitting each of the models in (c) using least squares. Do these results agree with the conclusions drawn based on the cross-validation results?
glm.fit.2 <- glm(Y~poly(X.cent,2),data = dat)
summary(glm.fit.2)
## 
## Call:
## glm(formula = Y ~ poly(X.cent, 2), data = dat)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.55901  -0.67948   0.02914   0.59356   2.65331  
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -2.19133    0.09814  -22.33   <2e-16 ***
## poly(X.cent, 2)1  15.43428    0.98140   15.73   <2e-16 ***
## poly(X.cent, 2)2 -27.28047    0.98140  -27.80   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.9631416)
## 
##     Null deviance: 1075.866  on 99  degrees of freedom
## Residual deviance:   93.425  on 97  degrees of freedom
## AIC: 284.99
## 
## Number of Fisher Scoring iterations: 2
glm.fit.3 <- glm(Y~poly(X.cent,4),data = dat)
summary(glm.fit.3)
## 
## Call:
## glm(formula = Y ~ poly(X.cent, 4), data = dat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6894  -0.6778   0.1201   0.5571   2.7050  
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -2.19133    0.09866 -22.212   <2e-16 ***
## poly(X.cent, 4)1  15.43428    0.98656  15.645   <2e-16 ***
## poly(X.cent, 4)2 -27.28047    0.98656 -27.652   <2e-16 ***
## poly(X.cent, 4)3   0.56089    0.98656   0.569    0.571    
## poly(X.cent, 4)4  -0.80433    0.98656  -0.815    0.417    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.9732968)
## 
##     Null deviance: 1075.866  on 99  degrees of freedom
## Residual deviance:   92.463  on 95  degrees of freedom
## AIC: 287.95
## 
## Number of Fisher Scoring iterations: 2
glm.fit.4 <- glm(Y~poly(X.cent,4),data = dat)
summary(glm.fit.4)
## 
## Call:
## glm(formula = Y ~ poly(X.cent, 4), data = dat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6894  -0.6778   0.1201   0.5571   2.7050  
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -2.19133    0.09866 -22.212   <2e-16 ***
## poly(X.cent, 4)1  15.43428    0.98656  15.645   <2e-16 ***
## poly(X.cent, 4)2 -27.28047    0.98656 -27.652   <2e-16 ***
## poly(X.cent, 4)3   0.56089    0.98656   0.569    0.571    
## poly(X.cent, 4)4  -0.80433    0.98656  -0.815    0.417    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.9732968)
## 
##     Null deviance: 1075.866  on 99  degrees of freedom
## Residual deviance:   92.463  on 95  degrees of freedom
## AIC: 287.95
## 
## Number of Fisher Scoring iterations: 2

Since the true y is of quadratic form to begin with, It is obvious to have such a great significant level for first and second order predictors of all three model while the other two variable do not have.