The “BostonHousing” database contains hundreds of observations about houses in the city of Boston. Among the variables, we have the median value of houses and variables related to the region and the houses themselves. We will use a random forest model to try to estimate the value of each house based on the other variables.

Package installation

packages <- c('tidyverse','rpart','rpart.plot','gtools','Rmisc','scales','viridis','caret','AMR','randomForest','fastDummies','rattle','xgboost','ggpubr','reshape2','mlbench')

if(sum(as.numeric(!packages %in% installed.packages())) != 0){
  instalador <- packages[!packages %in% installed.packages()]
  for(i in 1:length(instalador)) {
    install.packages(instalador, dependencies = T)
    break()}
  sapply(packages, require, character = T) 
} else {
  sapply(packages, require, character = T) }

Loading the dataset

data(BostonHousing)

Viewing the first rows of the dataset

head(BostonHousing)
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio      b lstat
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
##   medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7

Data Dictionary:

CRIM: Per capita crime rate by town;

ZN: Proportion of residential land zoned for large lots (over 25,000 sq. ft. or about 2,322 square meters);

INDUS: Proportion of non-retail business acres per town;

CHAS: Charles River dummy variable (1 if tract bounds river; 0 otherwise);

NOX: Nitrogen oxide concentration (parts per 10 million);

RM: Average number of rooms per dwelling;

AGE: Proportion of owner-occupied units built before 1940;

DIS: Weighted distance to five Boston employment centers;

RAD: Index of accessibility to radial highways;

TAX: Property tax rate (full value in dollars per $10,000);

PTRATIO: Pupil-teacher ratio by town;

B: 1000(Bk - 0.63)^2, where Bk is the proportion of residents of African American descent by town;

LSTAT: Percentage of lower-status population;

MEDV: Median value of owner-occupied homes in thousands of dollars.

Splitting training and test samples

n <- sample(1:2,
            size = nrow(BostonHousing),
            replace = TRUE,
            prob=c(0.8, 0.2))

train <- BostonHousing[n==1,]
test <- BostonHousing[n==2,]

Training the first Random Forest

forest <- randomForest::randomForest(
  medv ~ .,
  data = train,
  ntree = 500
)

predict(forest, train) %>% head
##        1        2        3        4        5        7 
## 25.72872 22.25164 34.81491 34.16414 35.13523 22.27903
predict(forest, test) %>% head
##        6       12       17       36       45       50 
## 27.10891 21.06751 21.01731 21.63991 22.12806 18.96417

Ealuating the model

# Training data
p_train <- predict(forest, train)

# Test data
p_test <- predict(forest, test)

# Evaluation data frame (Training)
eval_train <- data.frame(obs=train$medv,
                         pred=p_train)

# Evaluation data frame (Test)
eval_test <- data.frame(obs=test$medv,
                        pred=p_test)
head(eval_test)
##     obs     pred
## 6  28.7 27.10891
## 12 18.9 21.06751
## 17 23.1 21.01731
## 36 18.9 21.63991
## 45 21.2 22.12806
## 50 19.4 18.96417
# Evaluation function
evaluate <- function(pred, obs) {
  mse <- mean((pred - obs)^2)
  rmse <- sqrt(mse)
  mae <- mean(abs(pred - obs))
  r_squared <- 1 - (sum((obs - pred)^2) / sum((obs - mean(obs))^2))

  cat("MSE:", mse, "\n")
  cat("RMSE:", rmse, "\n")
  cat("MAE:", mae, "\n")
  cat("R-squared:", r_squared, "\n")
}

# Using the evaluation function
evaluate(p_train, train$medv)
## MSE: 2.285244 
## RMSE: 1.511702 
## MAE: 0.9791821 
## R-squared: 0.9731332
evaluate(p_test, test$medv)
## MSE: 12.63301 
## RMSE: 3.554295 
## MAE: 2.263197 
## R-squared: 0.8447405

After evaluating our model, let’s try to improve the accuracy of our random forest through cross-validation with grid-search.

Cross-validation is a technique used to estimate the quality of a model and its generalization ability. It involves splitting the data into training and test sets, where the model is trained on the training data and evaluated on the test data. This validation is then repeated n times with variations in the training and test sets.

On the other hand, grid-search is a technique used to tune the hyperparameters of a machine learning model. Hyperparameters are values that are not learned by the model but affect its performance, such as the number of trees in a forest.

Grid-search e Cross-validation

# Cross-validation
control <- caret::trainControl(
  method = 'repeatedcv', 
  number = 4,
  repeats = 2,
  search = 'grid',
  summaryFunction = defaultSummary, 
  classProbs = FALSE )
  
grid <- base::expand.grid(.mtry = 1:10)
# Modeling the forest with cross-validation
forest_grid <- caret::train(medv ~ .,  
                                data = train,
                                method = 'rf', 
                                metric = 'RMSE', # Choose the best model based on RMSE
                                trControl = control,
                                ntree = 500,
                                tuneGrid = grid)
  
print(forest_grid)
## Random Forest 
## 
## 418 samples
##  13 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (4 fold, repeated 2 times) 
## Summary of sample sizes: 315, 313, 314, 312, 314, 312, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE     
##    1    4.533748  0.7997436  3.039232
##    2    3.768986  0.8492966  2.498007
##    3    3.591269  0.8568176  2.362850
##    4    3.520769  0.8585640  2.311007
##    5    3.520879  0.8566178  2.291031
##    6    3.527130  0.8540260  2.288432
##    7    3.547068  0.8508113  2.292945
##    8    3.571300  0.8480804  2.298113
##    9    3.623769  0.8422440  2.324433
##   10    3.632956  0.8415436  2.329627
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 4.
plot(forest_grid)

Evaluating the improved model

# Predictions with tuned model
p_train_grid <- predict(forest_grid, train)
p_test_grid <- predict(forest_grid, test)

# Evaluate the tuned model on training and test data
evaluate(p_train_grid, train$medv)
## MSE: 2.236085 
## RMSE: 1.495355 
## MAE: 0.9657139 
## R-squared: 0.9737111
evaluate(p_test_grid, test$medv)
## MSE: 12.3971 
## RMSE: 3.520951 
## MAE: 2.29935 
## R-squared: 0.8476399
# Evaluate the original model on training and test data
evaluate(p_train, train$medv)
## MSE: 2.285244 
## RMSE: 1.511702 
## MAE: 0.9791821 
## R-squared: 0.9731332
evaluate(p_test, test$medv)
## MSE: 12.63301 
## RMSE: 3.554295 
## MAE: 2.263197 
## R-squared: 0.8447405
# Create data frames for evaluation (Tuned model)
eval_train_grid <- data.frame(obs=train$medv, 
                               pred=p_train_grid)
eval_test_grid <- data.frame(obs=test$medv, 
                             pred=p_test_grid)

head(eval_test_grid)
##     obs     pred
## 6  28.7 27.02430
## 12 18.9 20.82097
## 17 23.1 20.89341
## 36 18.9 21.44280
## 45 21.2 21.99959
## 50 19.4 18.90027

Conclusion

We can conclude that the “improved” forest achieved through cross-validation with grid search has shown a slight improvement in predictive accuracy when compared to the initially created model. This improvement suggests that the hyperparameters adjusted through grid search contributed to enhancing the model’s performance.

It’s essential to emphasize that the random factor involved in selecting data for each tree in a random forest prevents the results from being consistently the same. Therefore, even if we repeat the entire algorithm described in this project, it’s likely that we will obtain slightly different metrics for the resulting models.

As a result, it is considered good practice to run the algorithm multiple times to gain a more realistic understanding of its predictive capability.