Cross Validation

I’d like to do cross validation in two ways. I am first going to do the traditional split, 66%. I’ll fit the model to the training data and then check out the results on the remaining 34% testing data.

sports = c('Swimming', 'Tennis', 'Rowing', 'Gymnastics', 'Golf', 'Athletics', 'Bobsleigh')

dataLessSports <- data[which(data$Sport %in% sports),]

trainingSamples <- createDataPartition(dataLessSports$ID,p=.66,list = FALSE)
trainData <- dataLessSports[trainingSamples,]
testData <- dataLessSports[-trainingSamples,]

model <- lm(Age ~ Sport + Height + Weight, data = trainData, na.action = na.omit)
summary(model)
## 
## Call:
## lm(formula = Age ~ Sport + Height + Weight, data = trainData, 
##     na.action = na.omit)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -12.444  -3.011  -0.482   2.459  35.015 
## 
## Coefficients:
##                  Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)     17.832787   0.452436   39.415  < 2e-16 ***
## SportBobsleigh   2.179273   0.118045   18.461  < 2e-16 ***
## SportGolf        4.532871   0.510624    8.877  < 2e-16 ***
## SportGymnastics -2.693966   0.054536  -49.398  < 2e-16 ***
## SportRowing     -0.637091   0.066896   -9.524  < 2e-16 ***
## SportSwimming   -4.817304   0.047395 -101.641  < 2e-16 ***
## SportTennis     -0.378523   0.117937   -3.210  0.00133 ** 
## Height           0.021148   0.003226    6.556 5.56e-11 ***
## Weight           0.054242   0.002303   23.548  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.171 on 53750 degrees of freedom
##   (15732 observations deleted due to missingness)
## Multiple R-squared:  0.2541, Adjusted R-squared:  0.2539 
## F-statistic:  2288 on 8 and 53750 DF,  p-value: < 2.2e-16

I specifically picked a silly model where I first restricted to a handful of sports. Does anybody think we can predict the age of an athlete by their sport, height or weight? You should notice though all of the coefficients are statistically significant. Let’s check out how this does on the testing data.

predictions = predict(model,testData)

tes <- testData$Age[(!is.na(predictions)) & (!is.na(testData$Age))]
pre <- predictions[(!is.na(predictions)) & (!is.na(testData$Age))]

data.frame(R2 = R2(pre, tes ),
           RMSE = RMSE(pre, tes ))
##          R2     RMSE
## 1 0.2514814 4.180316

I had to do some fancy work there to make this work because of all the na values. Notice that I sliced both the prediction and the actual data cutting out the NA before I proceeded. I’ve left my NA in my data to show you that while most stuff will work, you sometimes have to do heroics to get around it! I’ll also point out that this model does not do that poorly on testing data so maybe there is some reason to expect the sport to correlate to the age of the athlete although I will note that there is not really that big of swing except maybe between swimmers and golfers. Athletics is the intercept so it does not have a coefficient.

Let’s repeat this but do the \(k\)-fold cross validation. I’ll use \(k=10\) and run the exact same model ten times. This time we’ll train on 90% of the data and test on a slice of 10%. Then we will reintroduce that 10% and hold back a different 10% of the data. We repeat this ten times until all slices have been held back as the training data.

dataNARemoved <- na.omit(dataLessSports)

This piece of code removed all the rows that had an NA values. Now I think I am golden to run the cross validation

train.control <- trainControl(method = "cv", number = 10)

model <- train(Age ~ Sport + Height + Weight, data = dataNARemoved, 
               method = "lm",
               trControl = train.control)
# Summarize the results
print(model)
## Linear Regression 
## 
## 9859 samples
##    3 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 8873, 8873, 8874, 8873, 8874, 8871, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   4.027149  0.2466658  3.133356
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
summary(model)
## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.313  -2.865  -0.394   2.339  33.079 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     23.908406   1.045856  22.860  < 2e-16 ***
## SportBobsleigh   3.240269   0.258410  12.539  < 2e-16 ***
## SportGolf        7.418533   2.014670   3.682 0.000232 ***
## SportGymnastics -2.612577   0.153729 -16.995  < 2e-16 ***
## SportRowing      0.357669   0.114473   3.124 0.001786 ** 
## SportSwimming   -4.102334   0.106486 -38.525  < 2e-16 ***
## SportTennis      0.840088   0.313859   2.677 0.007449 ** 
## Height          -0.016450   0.007442  -2.210 0.027100 *  
## Weight           0.057515   0.005166  11.132  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.027 on 9850 degrees of freedom
## Multiple R-squared:  0.2472, Adjusted R-squared:  0.2466 
## F-statistic: 404.4 on 8 and 9850 DF,  p-value: < 2.2e-16

Okay I have done it! I doubt I am going to make a very acuarate prediction using this model but I will not that it is clear this model is not overfit. I am confident that the model is not overfit because the cross validation gave very similar metrics to the original model. There are no hard and fast rules here but if you see large swings, you have some evidence that the model is overfit (too dependent on a few values for the fit).

Let’s examine the plots.

reg <- lm(Age~Height, data = dataNARemoved)
plot(dataNARemoved$Height,dataNARemoved$Age)
abline(reg,col = 'Red')
abline(24,.0575) 

I struggled to figure out how to visualize this. Maybe you will have some better ideas. Clearly the one factor and the multi-factor regressions should not be compared like this. I also struggled to get the data directly out of the 10-fold cross validation so I copied the estimates by hand. This is what I get for creating too complicated of a model!

Keep it Simple Stupid

While there is nothing wrong with what I did above, I have made everything super complicated. Rather that delete the above, I thought I’d just add another section and repeat the exercise on a simpler model.

I am going to look at Height as a predictor of Weight We have seen this earlier as an excellent model.

HWfit <- lm(Weight ~ Height, data = data, na.action = na.omit)

with(data,plot(Height, Weight))
abline(HWfit)

#ggplot(data = data, aes(Height,Weight)) +
#  geom_smooth(method = 'lm') +
#  geom_point()
confint.lm(HWfit)
##                   2.5 %      97.5 %
## (Intercept) -119.743732 -118.498456
## Height         1.078797    1.085886

Okay so now I’ll repeat the withholding of a third of the data and do the fitting and make predictions on the testing data.

dataHW <- data[c('ID','Height','Weight')]
dataHW <-na.omit(dataHW)



trainingSamples <- createDataPartition(dataHW$ID,p=.66,list = FALSE)
trainData <- dataHW[trainingSamples,]
testData <- dataHW[-trainingSamples,]

model <- lm(Weight ~ Height, data = trainData)
summary(model)
## 
## Call:
## lm(formula = Weight ~ Height, data = trainData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.935  -5.280  -0.853   3.899 135.065 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.192e+02  3.925e-01  -303.8   <2e-16 ***
## Height       1.083e+00  2.234e-03   484.7   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.692 on 136522 degrees of freedom
## Multiple R-squared:  0.6325, Adjusted R-squared:  0.6325 
## F-statistic: 2.349e+05 on 1 and 136522 DF,  p-value: < 2.2e-16

I am going to add to the data frame a variable that ask whether the data was in the testing or the training data. I’ll use that to graph the linear regression

dataHW[trainingSamples,'Test_Train']='training'
dataHW[-trainingSamples,'Test_Train']='testing'

ggplot(data = dataHW, mapping = aes(Height,Weight,color = Test_Train))+
  geom_jitter() +
  geom_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'

Here I’ll add the confidence intervals from the model.

confint.lm(model)
##                   2.5 %      97.5 %
## (Intercept) -119.981525 -118.443119
## Height         1.078396    1.087153

These confidence intervals are also super close. Everything suggests that my model is not overfit.

Cool, now I’ll make some predictions on the testing data

pre = predict(model,testData)

tes <- testData$Weight

data.frame(R2 = R2(pre, tes ),
           RMSE = RMSE(pre, tes ))
##          R2     RMSE
## 1 0.6368357 8.626697

Not a giant jump! Let’s check out the prediction on a famous Olympian, Micheal Phelps

data[which(data$Name == 'Michael Fred Phelps, II'),]
##           ID                    Name Sex Age Height Weight          Team NOC
## 187888 94406 Michael Fred Phelps, II   M  15    193     91 United States USA
## 187889 94406 Michael Fred Phelps, II   M  19    193     91 United States USA
## 187890 94406 Michael Fred Phelps, II   M  19    193     91 United States USA
## 187891 94406 Michael Fred Phelps, II   M  19    193     91 United States USA
## 187892 94406 Michael Fred Phelps, II   M  19    193     91 United States USA
## 187893 94406 Michael Fred Phelps, II   M  19    193     91 United States USA
## 187894 94406 Michael Fred Phelps, II   M  19    193     91 United States USA
## 187895 94406 Michael Fred Phelps, II   M  19    193     91 United States USA
## 187896 94406 Michael Fred Phelps, II   M  19    193     91 United States USA
## 187897 94406 Michael Fred Phelps, II   M  23    193     91 United States USA
## 187898 94406 Michael Fred Phelps, II   M  23    193     91 United States USA
## 187899 94406 Michael Fred Phelps, II   M  23    193     91 United States USA
## 187900 94406 Michael Fred Phelps, II   M  23    193     91 United States USA
## 187901 94406 Michael Fred Phelps, II   M  23    193     91 United States USA
## 187902 94406 Michael Fred Phelps, II   M  23    193     91 United States USA
## 187903 94406 Michael Fred Phelps, II   M  23    193     91 United States USA
## 187904 94406 Michael Fred Phelps, II   M  23    193     91 United States USA
## 187905 94406 Michael Fred Phelps, II   M  27    193     91 United States USA
## 187906 94406 Michael Fred Phelps, II   M  27    193     91 United States USA
## 187907 94406 Michael Fred Phelps, II   M  27    193     91 United States USA
## 187908 94406 Michael Fred Phelps, II   M  27    193     91 United States USA
## 187909 94406 Michael Fred Phelps, II   M  27    193     91 United States USA
## 187910 94406 Michael Fred Phelps, II   M  27    193     91 United States USA
## 187911 94406 Michael Fred Phelps, II   M  27    193     91 United States USA
## 187912 94406 Michael Fred Phelps, II   M  31    193     91 United States USA
## 187913 94406 Michael Fred Phelps, II   M  31    193     91 United States USA
## 187914 94406 Michael Fred Phelps, II   M  31    193     91 United States USA
## 187915 94406 Michael Fred Phelps, II   M  31    193     91 United States USA
## 187916 94406 Michael Fred Phelps, II   M  31    193     91 United States USA
## 187917 94406 Michael Fred Phelps, II   M  31    193     91 United States USA
##              Games Year Season           City    Sport
## 187888 2000 Summer 2000 Summer         Sydney Swimming
## 187889 2004 Summer 2004 Summer         Athina Swimming
## 187890 2004 Summer 2004 Summer         Athina Swimming
## 187891 2004 Summer 2004 Summer         Athina Swimming
## 187892 2004 Summer 2004 Summer         Athina Swimming
## 187893 2004 Summer 2004 Summer         Athina Swimming
## 187894 2004 Summer 2004 Summer         Athina Swimming
## 187895 2004 Summer 2004 Summer         Athina Swimming
## 187896 2004 Summer 2004 Summer         Athina Swimming
## 187897 2008 Summer 2008 Summer        Beijing Swimming
## 187898 2008 Summer 2008 Summer        Beijing Swimming
## 187899 2008 Summer 2008 Summer        Beijing Swimming
## 187900 2008 Summer 2008 Summer        Beijing Swimming
## 187901 2008 Summer 2008 Summer        Beijing Swimming
## 187902 2008 Summer 2008 Summer        Beijing Swimming
## 187903 2008 Summer 2008 Summer        Beijing Swimming
## 187904 2008 Summer 2008 Summer        Beijing Swimming
## 187905 2012 Summer 2012 Summer         London Swimming
## 187906 2012 Summer 2012 Summer         London Swimming
## 187907 2012 Summer 2012 Summer         London Swimming
## 187908 2012 Summer 2012 Summer         London Swimming
## 187909 2012 Summer 2012 Summer         London Swimming
## 187910 2012 Summer 2012 Summer         London Swimming
## 187911 2012 Summer 2012 Summer         London Swimming
## 187912 2016 Summer 2016 Summer Rio de Janeiro Swimming
## 187913 2016 Summer 2016 Summer Rio de Janeiro Swimming
## 187914 2016 Summer 2016 Summer Rio de Janeiro Swimming
## 187915 2016 Summer 2016 Summer Rio de Janeiro Swimming
## 187916 2016 Summer 2016 Summer Rio de Janeiro Swimming
## 187917 2016 Summer 2016 Summer Rio de Janeiro Swimming
##                                                Event  Medal
## 187888           Swimming Men's 200 metres Butterfly   <NA>
## 187889           Swimming Men's 200 metres Freestyle Bronze
## 187890 Swimming Men's 4 x 100 metres Freestyle Relay Bronze
## 187891 Swimming Men's 4 x 200 metres Freestyle Relay   Gold
## 187892           Swimming Men's 100 metres Butterfly   Gold
## 187893           Swimming Men's 200 metres Butterfly   Gold
## 187894   Swimming Men's 200 metres Individual Medley   Gold
## 187895   Swimming Men's 400 metres Individual Medley   Gold
## 187896    Swimming Men's 4 x 100 metres Medley Relay   Gold
## 187897           Swimming Men's 200 metres Freestyle   Gold
## 187898 Swimming Men's 4 x 100 metres Freestyle Relay   Gold
## 187899 Swimming Men's 4 x 200 metres Freestyle Relay   Gold
## 187900           Swimming Men's 100 metres Butterfly   Gold
## 187901           Swimming Men's 200 metres Butterfly   Gold
## 187902   Swimming Men's 200 metres Individual Medley   Gold
## 187903   Swimming Men's 400 metres Individual Medley   Gold
## 187904    Swimming Men's 4 x 100 metres Medley Relay   Gold
## 187905 Swimming Men's 4 x 100 metres Freestyle Relay Silver
## 187906 Swimming Men's 4 x 200 metres Freestyle Relay   Gold
## 187907           Swimming Men's 100 metres Butterfly   Gold
## 187908           Swimming Men's 200 metres Butterfly Silver
## 187909   Swimming Men's 200 metres Individual Medley   Gold
## 187910   Swimming Men's 400 metres Individual Medley   <NA>
## 187911    Swimming Men's 4 x 100 metres Medley Relay   Gold
## 187912 Swimming Men's 4 x 100 metres Freestyle Relay   Gold
## 187913 Swimming Men's 4 x 200 metres Freestyle Relay   Gold
## 187914           Swimming Men's 100 metres Butterfly Silver
## 187915           Swimming Men's 200 metres Butterfly   Gold
## 187916   Swimming Men's 200 metres Individual Medley   Gold
## 187917    Swimming Men's 4 x 100 metres Medley Relay   Gold
mpPrediction <- predict(model,data[which(data$Name == 'Michael Fred Phelps, II'),])

We see that we under predict his weight.

mpPrediction[1]-91
##    187888 
## -1.236811

Okay I am going to leave it there. The linear model clearly works. Again when we run cross validation, we are really asking to check if the model is overfit. We should not see that in a robust linear model but it may be possible if you have a few outliers that really mess things up! Recall we are supposed to look for outliers before we fit a regression by looking at the QQ-Plots.