1a)The set of observations is split into k groups of equal size. One group is then used as the validation group while the fit uses the remaining groups. This is done with each group getting held out so then the average of each result is computed.

bi) Has less variability than single set apporach.

bii)Not as computationally expensive and more accurate of the test error rate. Less variance for k, but more bias.

2a

set.seed(1)
x=rnorm(100)
y=x-2*x^2+rnorm(100)

There are 100 observations. The betas are 0, 1, -2 and only 2 explanatory variables. y=1x-2x^2+e

2b)

plot(x,y)

There is a curved relationship.

2c)

set.seed(2)
x=rnorm(100)
y=x-2*x^2+rnorm(100)
data=data.frame(x,y)

cv.error<-rep(0, 4)
for(i in 1:4){
  glm.fit<-glm(y~poly(x, i), data=data.frame(x,y))
  cv.error[i]<-cv.glm(data, glm.fit)$delta[1]
}

cvDF<-data.frame(degree=1:4, cv.error)

library(tidyverse)
## -- Attaching packages -------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.2.1     v purrr   0.3.2
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   0.8.3     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## -- Conflicts ----------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
ggplot(data=cvDF, aes(x=degree, y=cv.error))+
  geom_point()+
  geom_line()

set.seed(1337)
x=rnorm(100)
y=x-2*x^2+rnorm(100)
data=data.frame(x,y)

cv.error<-rep(0, 4)
for(i in 1:4){
  glm.fit<-glm(y~poly(x, i), data=data.frame(x,y))
  cv.error[i]<-cv.glm(data, glm.fit)$delta[1]
}

cvDF<-data.frame(degree=1:4, cv.error)

library(tidyverse)

ggplot(data=cvDF, aes(x=degree, y=cv.error))+
  geom_point(,)+
  geom_line()

The trend stays the same but the error values are different. Because the seeds make rnadom values it would different for each one.

e)The degree 2 had least error which is expected since the orignal model has x^2 as the degree.

f)It does seem to agree.