Assignment12

Using the stats and boot libraries in R perform a cross-validation experiment to observe the bias variance tradeoff. You’ll use the auto data set from previous assignments. This dataset has 392 observations across 5 variables. We want to fit a polynomial model of various degrees using the glm function in R and then measure the cross validation error using cv.glm function

library(ggplot2)
library(stats)
library(boot)


datapath <- "C:/CUNY/Courses/IS605/Assignments/Assignment12/auto-mpg.DATA"
autompg_data <- scan(datapath)

#str(autompg_data)

autompg_data_mtrx <- t(matrix(autompg_data, nrow = 5))
#head(autompg_data_mtrx)



auto.data.df <- data.frame( displacement = autompg_data_mtrx[,1],
                            horsepower = autompg_data_mtrx[,2],
                            weight = autompg_data_mtrx[,3],
                            acceleration = autompg_data_mtrx[,4],
                            mpg = autompg_data_mtrx[,5]
                )



#ploy2<- poly(displacement+horsepower+weight+acceleration,2 )


cv.err.est <- c()
cv.bias<- c()

for (i in 1:9) {

glm.fit <- glm(mpg~poly(displacement+horsepower+
               weight+acceleration,i) , 
               data=auto.data.df 
               )


# The first component is the raw cross-validation estimate of prediction error
cv.err.est[i] <- cv.glm(auto.data.df , glm.fit, K=5)$delta[1]

#The second component is the adjusted cross-validation estimate. 
#The adjustment is designed to compensate for the bias introduced by not using leave-one-out cross-validation.
cv.bias[i] <- cv.glm(auto.data.df , glm.fit, K=5)$delta[2]

}

glm.fit

## 
## Call:  glm(formula = mpg ~ poly(displacement + horsepower + weight + 
##     acceleration, i), data = auto.data.df)
## 
## Coefficients:
##                                                 (Intercept)  
##                                                     23.4459  
## poly(displacement + horsepower + weight + acceleration, i)1  
##                                                   -129.0809  
## poly(displacement + horsepower + weight + acceleration, i)2  
##                                                     24.5509  
## poly(displacement + horsepower + weight + acceleration, i)3  
##                                                     -0.6179  
## poly(displacement + horsepower + weight + acceleration, i)4  
##                                                     -2.3957  
## poly(displacement + horsepower + weight + acceleration, i)5  
##                                                      2.8573  
## poly(displacement + horsepower + weight + acceleration, i)6  
##                                                     -3.7582  
## poly(displacement + horsepower + weight + acceleration, i)7  
##                                                      6.9607  
## poly(displacement + horsepower + weight + acceleration, i)8  
##                                                     -1.3778  
## poly(displacement + horsepower + weight + acceleration, i)9  
##                                                     -2.6602  
## 
## Degrees of Freedom: 391 Total (i.e. Null);  382 Residual
## Null Deviance:       23820 
## Residual Deviance: 6469  AIC: 2233

# The raw cross-validation estimate of prediction error values
cv.err.est

## [1] 18.47932 16.83684 17.00961 17.00794 17.08485 16.90837 16.84573 17.17995
## [9] 18.05378

#The adjusted cross-validation estimate
cv.bias

## [1] 18.45266 16.93459 16.77402 16.82087 16.92188 17.11592 16.93129 17.15363
## [9] 17.35992

Perform the polynomial fit and then plot the resulting 5 fold cross validation curve. Your output should show the characteristic U-shape illustrating the tradeoff between bias and variance.

The output shows the characteristic U-shape illustrating the tradeoff between bias and variance.
At the beginning as the polynomial degree is low, we observe that the under-fitting is occurring as the model does not fit the data well enough. Specifically, the under fitting occurs as the model shows low variance but high bias. It is expected since lower degrees result into simple model.
As we increase the degree, we the bias decreases as the variance stays unchanged. However, as we keep increase the degree, we start noticing the overfitting causing the variance to be high and bias to be low. In other words, the model starts fitting the data too well, and becomes excessively complicated.

Please note that it took me few attempts to get the U-shape output. I ran the plot code at least 4 times.

Assignment12

Mohamed Elmoudni

November 15, 2015