Problem Set #6

Problem 2

Generate a simulated data set as follows:

set.seed(1)
x=rnorm(100)
y=x-2*x^2+rnorm(100)

n (number of obervations) = 100 p (number of explanatory variables) =

Create a scatterplot of X against Y. Comment on what you find.

plot(x,y)

Based on the scatterplot, it is clear that there is curvature in the data and that a least squares regression line (1st polynomial) would not be the best fit for the data.

Set a random seed, and then compute the LOOCV errors that result from fitting the following four models using least squares:

set.seed(44)
DATA<-data.frame(x, y)

library(boot)

glm.fit<-glm(y~x, data=DATA)
cv.err<-cv.glm(DATA, glm.fit)
cv.err$delta

## [1] 7.288162 7.284744

glm.fit<-glm(y~poly(x, 2), data=DATA)
cv.err<-cv.glm(DATA, glm.fit)
cv.err$delta

## [1] 0.9374236 0.9371789

glm.fit<-glm(y~poly(x, 3), data=DATA)
cv.err<-cv.glm(DATA, glm.fit)
cv.err$delta

## [1] 0.9566218 0.9562538

glm.fit<-glm(y~poly(x, 4), data=DATA)
cv.err<-cv.glm(DATA, glm.fit)
cv.err$delta

## [1] 0.9539049 0.9534453

Repeat (c) using another random seed, and report your results. Are your results the same as what you got in (c)? Why?

set.seed(13)
DATA<-data.frame(x, y)

glm.fit<-glm(y~x, data=DATA)
cv.err<-cv.glm(DATA, glm.fit)
cv.err$delta

## [1] 7.288162 7.284744

glm.fit<-glm(y~poly(x, 2), data=DATA)
cv.err<-cv.glm(DATA, glm.fit)
cv.err$delta

## [1] 0.9374236 0.9371789

glm.fit<-glm(y~poly(x, 3), data=DATA)
cv.err<-cv.glm(DATA, glm.fit)
cv.err$delta

## [1] 0.9566218 0.9562538

glm.fit<-glm(y~poly(x, 4), data=DATA)
cv.err<-cv.glm(DATA, glm.fit)
cv.err$delta

## [1] 0.9539049 0.9534453

Which of the models in c) had the smallest LOOCV error? Is this what you expected? Explain your answer?

The second degree polynomial had the smallest LOOCV error. Based on the shapr of the oringial scatterplot it seemed probabale that a second degree polynomial would be the correct fit however, I also expetected the third and fourth degree polynomials to have low errors as well. The next question to consider is is “worth it” to increase the power (maybe overfitting) for such a minimal improvement in lowering the error.

Comment on the statistical significance of the coefficient estimates that results from fitting each of the models in (c) using least squares. Do these results agree with the conclusions drawn based on cross-validation results?

mod4<-lm(y~poly(x, 4), data=DATA)
summary(mod4)

## 
## Call:
## lm(formula = y ~ poly(x, 4), data = DATA)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0550 -0.6212 -0.1567  0.5952  2.2267 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.55002    0.09591 -16.162  < 2e-16 ***
## poly(x, 4)1   6.18883    0.95905   6.453 4.59e-09 ***
## poly(x, 4)2 -23.94830    0.95905 -24.971  < 2e-16 ***
## poly(x, 4)3   0.26411    0.95905   0.275    0.784    
## poly(x, 4)4   1.25710    0.95905   1.311    0.193    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9591 on 95 degrees of freedom
## Multiple R-squared:  0.8753, Adjusted R-squared:  0.8701 
## F-statistic: 166.7 on 4 and 95 DF,  p-value: < 2.2e-16

Both the least squares models (first and second degree) have statistically significant coefficient estimates (p-val(second) < p-val(first)) while the (thrid and fourth degree) do not. These results support the conclusion that it is not worth it to fit a third or fourth degree polynomial as they would both be examples of overfitting. The correct model to fit the data is a second degree polynomial.

mod2<-lm(y~poly(x, 2), data=DATA)
summary(mod2)

## 
## Call:
## lm(formula = y ~ poly(x, 2), data = DATA)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9650 -0.6254 -0.1288  0.5803  2.2700 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.5500     0.0958  -16.18  < 2e-16 ***
## poly(x, 2)1   6.1888     0.9580    6.46 4.18e-09 ***
## poly(x, 2)2 -23.9483     0.9580  -25.00  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.958 on 97 degrees of freedom
## Multiple R-squared:  0.873,  Adjusted R-squared:  0.8704 
## F-statistic: 333.3 on 2 and 97 DF,  p-value: < 2.2e-16

plot(x,y)

plot(mod2)

Problem Set #6

Jack Boydell

10/19/2019

Problem 1

Problem 2