Train/test Split and Cross-Validation on the Boston Housing Dataset

Jose Vilardy

Overview

Overfitting is one of the biggest challenges in the machine learning process. Overfitting means that the model has been trained “too well”, and as a result it learns the noise present in the training data as if it was a reliable pattern. Overfitting affects the ability of the model to perform well in unseen data, which is known as generalisation.

Two well known strategies to overcome the problem of overfitting are the train/validation split and cross-validation.

This vignette will compare the two strategies in the Boston Housing Dataset taken from the Kaggle competition “House Prices: Advanced Regression Technique”.

The purpose of this vignette is to evaluate the performance of the two strategies on the data set.

The data has been divided into training and test data. 80% of the data is randomly assigned to the training data and the remaining 20% is assigned to the test data.

The performance of the two machine learning models will be assessed on the new data using the RMSE measure. (See code 1)

Code 1

library(readxl)
library(dplyr)
library(caret)
library(ggplot2)
library(lattice)
library(rpart)
library(Metrics)
library(glmnet)
library(Matrix)
library(caret)
library(kableExtra)
data <- read.csv("train.csv")
data <- data%>%select(-Id)

set.seed(42)
partition <- createDataPartition(y = data$SalePrice, p = 0.7, list = F)
trainingdata = data[partition, ]
test <- data[-partition, ]

Train/Validation Split Approach

For this methodology, the training data is split into training data (70%) and validation data (30%- See figure below).

The training set contains a known output and the model learns this data in order to be generalized to other data in the process. In this way, the model will predict values for the 30% data (validation). By calculating the RMSE, it is possible to determine the prediction accuracy of the model.(See code 2)

Code 2

set.seed(42)
partitiontraining <- createDataPartition(y = trainingdata$SalePrice, p = 0.8, list = F)
training <- trainingdata[partitiontraining, ]
validation <- trainingdata[-partitiontraining, ]

After splitting the model in training and validation data, it is necessary to evaluate the variables with a major predictive power. The best method is reviewing the p - values in the regression model, however there are 78 variables in the model so it will take considerable time to do so manually. In this case, the lasso regularization algorithm is implemented to identify the variables to be significant on the model. (See code 3)

Code 3

x <- model.matrix(~.,trainingdata[,-1])
y <- trainingdata$SalePrice
cv.out <- cv.glmnet(x,y,alpha=1,type.measure = "mse" )
lambda_min <- cv.out$lambda.min
lambda_1se <- cv.out$lambda.1se
coef <- as.matrix(coef(cv.out,s=lambda_1se))
coef2 <- as.data.frame(as.table(coef))
coef2 <- coef2 %>%select(-Var2)%>%filter(Freq != 0)%>%rename(Variable = Var1, Coeficients = Freq)
Variable Coeficients
(Intercept) -1.518042e+05
NeighborhoodNridgHt 7.461987e+01
OverallQual 2.222257e+04
YearBuilt 3.178752e+01
YearRemodAdd 2.142772e+01
MasVnrArea 8.365345e+00
ExterQualTA -2.791157e+03
BsmtFinType1GLQ 1.690751e+03
BsmtFinSF1 4.027561e+00
TotalBsmtSF 1.138829e+01
X1stFlrSF 5.953603e+00
GrLivArea 3.703252e+01
KitchenQualTA -1.074059e+03
GarageCars 9.060121e+03

The output shows that only those variables that had been determined to be significant on the basis of p-values have non-zero coefficients. The coefficients of all other variables have been set to zero by the algorithm.

Having the significant variables of the model, the linear regression algorithm is applied and then using this model to predict the sale price on the validation data set.(See code 4)

Code 4

model <- lm(SalePrice~ OverallQual + YearBuilt + YearRemodAdd + MasVnrArea +ExterQual + BsmtFinType1
            + BsmtFinSF1 + TotalBsmtSF + X1stFlrSF + GrLivArea + KitchenQual + GarageCars, data=training)

The RMSE is calculated, which assesses accuracy of the model. After that, the model is applied on the test data to obtain the RMSE.(See code 5)

Code 5

p <- predict(model, validation)
error <- (p- validation$SalePrice)
RMSE_Model <- sqrt(mean(error^2))

ptest <- predict(model, test)
error1 <- (ptest- test$SalePrice)
RMSE_NewData <- sqrt(mean(error1^2))
Method <- c("Train/Test Split")
ModelRMSE <- c(RMSE_Model)
RMSENewData <- c(RMSE_NewData)

table1 <- data.frame(Method, ModelRMSE, RMSENewData)

kable(table1) %>% kable_styling(c("striped", "bordered")) %>%column_spec(2:3, border_left = T)
Method ModelRMSE RMSENewData
Train/Test Split 32344.19 34616.48

Those RMSE will be compared with the result using the cross validation methodology.

K-fold Cross Validation

In K-fold cross validation, the dataset is divided into k separate parts. The training process is repeated k times. Each time, one part is used as validation data, and the rest is used for training a model. The average of the error is then calculated to evaluate a model. Note that k-fold cross validation increases the computational requirements for training our model by a factor of k.

In this approach, all the training set is used to predict the sale price on the test set. The number of k selected for the cross validation approach is 5. (See Code 6)

Code 6


modelcv <- train(
  SalePrice~ OverallQual + YearBuilt + YearRemodAdd + MasVnrArea +ExterQual + BsmtFinType1
  + BsmtFinSF1 + TotalBsmtSF + X1stFlrSF + GrLivArea + KitchenQual + GarageCars, data=trainingdata,
  method = "lm",
  trControl = trainControl(
    method = "cv", number = 10
  )
)

RMSE_Modelcv <- modelcv$results$RMSE

pcv <- predict(modelcv, test)
errorcv <- (pcv- test$SalePrice)
RMSE_NewDatacv <- sqrt(mean(errorcv^2))
Method ModelRMSE RMSENewData
Cross Validation 37222.72 34175.83

Evaluation

The below shows the results of the two approach in terms of RMSE.

Methods ModelRMSE RMSENewData
Train/Test Split 32344.19 34616.48
Cross Validation 37222.72 34175.83

Conclusion

The Cross-Validation strategy has a lower RMSE on the new data in comparison with average RMSE of the model. That indicates the model has a major predictive value when tested on new data in comparison with the the Train/Test Split Approach where the RMSE for the validation set is much higher than that of the test set (new data).