Illustration of overfitting with the mtcars dataset

Jorge Bretones Santamarina

4th of September 2019

Goals and motivation

The goals of this assignment are:

1. Build a Shiny Application and host it in a server.

2. Create a presentation in R Presenter or Slidify to pitch the application.

The topic chosen is overfitting, a phenomenon that takes place when we fit very complex functions that capture some of the randomness inherent in the training data. This is associated with a larger test error and a high variance of the statistical model.

To illustrate it we will employ the Motor Trend Car Road Tests dataset (“mtcars”).

The dataset: “mtcars”

The Motor Trend Car Road Tests dataset was extracted from the 1974 Motor Trend US magazine and it comprises 32 observations of 11 variables. One of them is fuel consumption and the other 10 relate to different aspects of automobile design and performance.

  • mpg Miles/(US) gallon

  • cyl Number of cylinders

  • disp Displacement (cu.in.)

  • hp Gross horsepower

  • drat Rear axle ratio

  • wt Weight (1000 lbs)

  • qsec 1/4 mile time

  • vs Engine (0 = V-shaped, 1 = straight)

  • am Transmission (0 = automatic, 1 = manual)

  • gear Number of forward gears

  • carb Number of carburetors

Relationship under study: mpg vs disp

In our application we will explore the relationship between the fuel consumption (in US milles/gallon) and the engine displacement (in cubic inches). The plot below shows that this relationship is most likely non-linear.

library(ggplot2)
g <- ggplot(aes(x = disp,y=mpg),data = mtcars) + geom_point() + theme_bw() + 
    labs (x = "Engine displacement (cubic inches)",y = "Fuel consumption (miles/gallon)", 
          title = "Fuel consumption vs engine displacement") + 
    theme(plot.title = element_text(hjust = 0.5,face = "bold"),
          axis.title.x = element_text(face = "bold"),axis.title.y = element_text(face = "bold"))
print(g)

Application documentation

The application is hosted at the following server:

https://jorgebs94.shinyapps.io/Shiny-App-Week-4-Project/

The whole documentation and the associated files can be found at this link: https://github.com/JorgeBS94/Shiny-App-Week-4-Project.git

To use the application, the user must click on the desired model (linear, quadratic or polynomial of degree 6). This will display the fitted values and the Residual Standard Error (RSS) of the resulting model. As the complexity of the fitted function increases, the RSS decreases. However, it is clear that the Polynomial fit is clearly overfitting the training data, causing a high variance problem.