Investigating model accuracy with different sizes of training and test sets

Tom Withey
30/11/19

Summary

R contains a data set which records the eruption time and waiting time between eruptions for the Old Faithful geyser

I have created a shiny app which uses the old faithful data to do the following:

  • Specify the proportion of data used for training a model
  • Builds a linear model on the training data
  • Plots the training data and the linear model
  • Plots the remaining (test) data and the same linear model
  • Reports the linear model parameters
  • Reports the root mean squared error on the training set and the test set

The proportion of data to be used for training is set using an interactive slider, and all plots and outputs are updated accordingly when the 'submit' button is pressed

Split the data and build a model

Based on the slider inut value, the following code (which is made reactive in the server code) creates training and test sets, builds a model and calculates the values predicted by the model

library(caret)
data(faithful)
train_prop <- 0.5 # normally set using the slider
set.seed(333)
train_prop <- train_prop # in the server code this is input$train_prop
inTrain <- createDataPartition(y=faithful$waiting,p=train_prop,list=FALSE)
trainFaith <- faithful[inTrain,]
lm1 <- lm(eruptions ~ waiting,data=trainFaith)
trainFaith$preds <- predict(lm1)

Plot the data

The following code plots the training data (similar code plots the test data)

library(plotly)
p1 <- plot_ly(trainFaith,x=~waiting) %>%
  add_trace(y=~eruptions, type = 'scatter',mode='markers',name="Training data") %>%
  add_trace(y=~preds,type = 'scatter',mode='lines',name="Fitted model",color='orange')

Model output

Finally, the following code reports the linear model parameters and the root mean squared error on the training set (similar code reports the error on the test set)

modvals <- lm1$coefficients
paste0("Intercept: ", format(round(modvals[1],2),nsmall=2), " Slope: ",format(round(modvals[2],4),nsmall=4))
[1] "Intercept: -1.65 Slope: 0.0722"
paste0("Eruption time = waiting time * ",format(round(modvals[2],4),nsmall=4)," ",format(round(modvals[1],2),nsmall=2))
[1] "Eruption time = waiting time * 0.0722 -1.65"
sqrt(mean((trainFaith$preds-trainFaith$eruptions)^2))
[1] 0.4904734