Problem 1
set.seed(1)
x = rnorm (100)
y = x-2* x^2+ rnorm (100)
The number of sample size of y(n) is
n <- 100
The number of predictors in the model(p) is
p <- 2
Since rnorm(100) in y meets the condition to be an error with mean = 0 and sd = 1, there are two predictors: X, x^2.
plot(x,y, pch = 19)
There is a non-linear relationship between X and Y.
library(boot)
## Warning: package 'boot' was built under R version 3.6.2
set.seed(100)
X <- rnorm(100)
X.cent <- scale(X, scale=F)
Y = X-2* X^2+ rnorm (100)
dat = data.frame(X.cent,Y)
# Calculate MSE for all models with centered predictors for the second, third, and fourth order
cv.err.c = rep(NA,3)
for (i in 1:3) {
glm.fit <- glm(Y~poly(X.cent,i+1),data = dat)
cv.err.c[i] <- cv.glm(dat, glm.fit)$delta[1]
}
cv.err.c
## [1] 0.6511909 0.6665339 0.6671261
set.seed(50)
X <- rnorm(100)
X.cent <- scale(X, scale=F)
Y = X-2* X^2+ rnorm (100)
dat <- data.frame(X.cent,Y)
# Calculate MSE for all models with centered predictors for the second, third, and fourth order
cv.err.d = rep(NA,3)
for (i in 1:3) {
glm.fit <- glm(Y~poly(X.cent,i+1),data = dat)
cv.err.d[i] <- cv.glm(dat, glm.fit)$delta[1]
}
cv.err.d
## [1] 0.990204 1.000547 1.013875
min.err.fold <- which.min(cv.err.c)
min.err <- cv.err.c[min.err.fold]
min.err.fold # The model number with the smallest LOOCV
## [1] 1
round(min.err,4) # MSE corresponding the model with the smallest LOOCV
## [1] 0.6512
It concludes that the second order model has the smallest LOOCV error. It was what I expected since the true y is quatratic form to begin with.
glm.fit.2 <- glm(Y~poly(X.cent,2),data = dat)
summary(glm.fit.2)
##
## Call:
## glm(formula = Y ~ poly(X.cent, 2), data = dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.55901 -0.67948 0.02914 0.59356 2.65331
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.19133 0.09814 -22.33 <2e-16 ***
## poly(X.cent, 2)1 15.43428 0.98140 15.73 <2e-16 ***
## poly(X.cent, 2)2 -27.28047 0.98140 -27.80 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.9631416)
##
## Null deviance: 1075.866 on 99 degrees of freedom
## Residual deviance: 93.425 on 97 degrees of freedom
## AIC: 284.99
##
## Number of Fisher Scoring iterations: 2
glm.fit.3 <- glm(Y~poly(X.cent,4),data = dat)
summary(glm.fit.3)
##
## Call:
## glm(formula = Y ~ poly(X.cent, 4), data = dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6894 -0.6778 0.1201 0.5571 2.7050
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.19133 0.09866 -22.212 <2e-16 ***
## poly(X.cent, 4)1 15.43428 0.98656 15.645 <2e-16 ***
## poly(X.cent, 4)2 -27.28047 0.98656 -27.652 <2e-16 ***
## poly(X.cent, 4)3 0.56089 0.98656 0.569 0.571
## poly(X.cent, 4)4 -0.80433 0.98656 -0.815 0.417
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.9732968)
##
## Null deviance: 1075.866 on 99 degrees of freedom
## Residual deviance: 92.463 on 95 degrees of freedom
## AIC: 287.95
##
## Number of Fisher Scoring iterations: 2
glm.fit.4 <- glm(Y~poly(X.cent,4),data = dat)
summary(glm.fit.4)
##
## Call:
## glm(formula = Y ~ poly(X.cent, 4), data = dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6894 -0.6778 0.1201 0.5571 2.7050
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.19133 0.09866 -22.212 <2e-16 ***
## poly(X.cent, 4)1 15.43428 0.98656 15.645 <2e-16 ***
## poly(X.cent, 4)2 -27.28047 0.98656 -27.652 <2e-16 ***
## poly(X.cent, 4)3 0.56089 0.98656 0.569 0.571
## poly(X.cent, 4)4 -0.80433 0.98656 -0.815 0.417
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.9732968)
##
## Null deviance: 1075.866 on 99 degrees of freedom
## Residual deviance: 92.463 on 95 degrees of freedom
## AIC: 287.95
##
## Number of Fisher Scoring iterations: 2
Since the true y is of quadratic form to begin with, It is obvious to have such a great significant level for first and second order predictors of all three model while the other two variable do not have.