Investigating model accuracy with different sizes of training and test sets

Tom Withey
30/11/19

Summary

R contains a data set which records the eruption time and waiting time between eruptions for the Old Faithful geyser

I have created a shiny app which uses the old faithful data to do the following:

  • Use a slider to specify the proportion of data used for training a model, and builds a model on that data
  • Plots the training data and the linear model in one graph and the test data and the model in another
  • Reports the linear model parameters and the root mean squared error

The server and UI files for building the app may be found on my github page: https://github.com/tw81/shiny_assignment

Split the data and build a model

Based on the slider inut value, the following code (which is made reactive in the server code) creates training and test sets, builds a model and calculates the values predicted by the model

library(caret)
data(faithful)
train_prop <- 0.5 # normally set using the slider
set.seed(333)
train_prop <- train_prop # in the server code this is input$train_prop
inTrain <- createDataPartition(y=faithful$waiting,p=train_prop,list=FALSE)
trainFaith <- faithful[inTrain,]
lm1 <- lm(eruptions ~ waiting,data=trainFaith)
trainFaith$preds <- predict(lm1)

Plot the data

The following code plots the training data (similar code plots the test data)

library(plotly)
p1 <- plot_ly(trainFaith,x=~waiting) %>%
  add_trace(y=~eruptions, type = 'scatter',mode='markers',name="Training data") %>%
  add_trace(y=~preds,type = 'scatter',mode='lines',name="Fitted model",color='orange')

Model output

Finally, the following code reports the linear model parameters and the root mean squared error on the training set (similar code reports the error on the test set)

modvals <- lm1$coefficients
paste0("Eruption time = waiting time * ",format(round(modvals[2],4),nsmall=4)," ",format(round(modvals[1],2),nsmall=2))
[1] "Eruption time = waiting time * 0.0722 -1.65"
sqrt(mean((trainFaith$preds-trainFaith$eruptions)^2))
[1] 0.4904734