Executive Summary

The goal of this project is to predict how physical and environmental factors impact housing values in Boston using various machine learning algorithms. The Boston Housing dataset includes 13 qualitative and quantitative features that can be used to predict the median housing value (feature name: medv). This project predicts Boston housing values using machine learning. I explored the impact of physical and environmental factors by engineering features from the Boston Housing dataset and comparing the performance (RMSE) of LASSO Regression, Random Forest, and Ensemble models.

Data Summary: Boston Housing Dataset

Source and Usage

Origin: The dataset was collated by the U.S Census Service, focusing on housing in Boston, Massachusetts.
Published by: Harrison, D. and Rubinfeld, D.L. in their study on ‘Hedonic prices and the demand for clean air’, published in the Journal of Environmental Economics & Management (1978).
Access: Available from the StatLib archive (http://lib.stat.cmu.edu/datasets/boston).

Dataset Overview

Attributes: The dataset comprises 14 attributes for each case.

Attributes Description

CRIM per capita crime rate by town
ZN proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS proportion of non-retail business acres per town.
CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
NOX nitric oxides concentration (parts per 10 million).
RM average number of rooms per dwelling.
AGE proportion of owner-occupied units built prior to 1940.
DIS weighted distances to five Boston employment centres.
RAD index of accessibility to radial highways.
TAX full-value property-tax rate per $10,000.
PTRATIO pupil-teacher ratio by town.
B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
LSTAT % lower status of the population.
MEDV Median value of owner-occupied homes in $1000’s.

Censoring in MEDV: The median value of owner-occupied homes (MEDV) appears to be capped at $50,000. This is inferred from: 16 cases with a median price exactly $50,000. 15 cases with prices between $40,000 and $50,000.
This dataset is a fundamental resource in algorithm benchmarking, particularly in the field of housing economics and environmental studies.

Goal of the project and key steps that were performed

Goal: To use regression method predict median value of real estate (MEDV) using social, economic, and environmental features.
Key Steps:

Build regression model (LASSO) with selected orginial features
Improve regression model (LASSO) with transformed (e.g. log, square, sqrt…) selected orginial features
Further improve regression model (LASSO) with interaction term

library(caret)

## Loading required package: ggplot2

## Loading required package: lattice

library(dplyr)

## Warning: package 'dplyr' was built under R version 4.2.3

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
library(GGally)

## Warning: package 'GGally' was built under R version 4.2.3

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

library(glmnet)

## Loading required package: Matrix

## Warning: package 'Matrix' was built under R version 4.2.3

## 
## Attaching package: 'Matrix'

## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack

## Loaded glmnet 4.1-8

library(magrittr)

## 
## Attaching package: 'magrittr'

## The following object is masked from 'package:tidyr':
## 
##     extract

library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

library(corrplot)

## corrplot 0.92 loaded

housing_raw <- read.csv("ML-Capstone-Housing-Data-BostonHousing.csv")

housing_raw$index <- seq_along(housing_raw[,1])
set.seed(2023)  # for reproducibility
train_index <- createDataPartition(housing_raw$index, times = 1, p = 0.8, list = FALSE)
train_data <- housing_raw[train_index, ]
test_data <- housing_raw[-train_index, ]

Methods/Analysis

Exploratory Data Analysis & Visualizations

Overview of the data set

summary(train_data)

##       crim                zn             indus             chas        
##  Min.   : 0.00906   Min.   :  0.00   Min.   : 0.460   Min.   :0.00000  
##  1st Qu.: 0.07910   1st Qu.:  0.00   1st Qu.: 5.145   1st Qu.:0.00000  
##  Median : 0.22948   Median :  0.00   Median : 9.125   Median :0.00000  
##  Mean   : 3.62858   Mean   : 12.02   Mean   :10.915   Mean   :0.06158  
##  3rd Qu.: 3.64742   3rd Qu.: 16.25   3rd Qu.:18.100   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.740   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.137  
##  1st Qu.:0.4480   1st Qu.:5.887   1st Qu.: 42.65   1st Qu.: 2.122  
##  Median :0.5320   Median :6.210   Median : 75.50   Median : 3.346  
##  Mean   :0.5511   Mean   :6.294   Mean   : 67.73   Mean   : 3.883  
##  3rd Qu.:0.6240   3rd Qu.:6.629   3rd Qu.: 93.38   3rd Qu.: 5.372  
##  Max.   :0.8710   Max.   :8.725   Max.   :100.00   Max.   :12.127  
##       rad             tax           ptratio            b         
##  Min.   : 1.00   Min.   :187.0   Min.   :12.60   Min.   :  2.52  
##  1st Qu.: 4.00   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:374.83  
##  Median : 5.00   Median :330.0   Median :19.00   Median :391.31  
##  Mean   : 9.51   Mean   :406.8   Mean   :18.47   Mean   :356.63  
##  3rd Qu.:24.00   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.21  
##  Max.   :24.00   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat             medv           index      
##  Min.   : 1.730   Min.   : 5.00   Min.   :  2.0  
##  1st Qu.: 6.905   1st Qu.:16.73   1st Qu.:127.8  
##  Median :11.250   Median :21.20   Median :253.5  
##  Mean   :12.524   Mean   :22.51   Mean   :254.7  
##  3rd Qu.:16.703   3rd Qu.:24.95   3rd Qu.:379.8  
##  Max.   :37.970   Max.   :50.00   Max.   :506.0

colnames(train_data)

##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "b"       "lstat"   "medv"   
## [15] "index"

nrow(train_data)

## [1] 406

plotting distribution of median values of houses

ggplot(train_data, aes(x = medv)) +
  geom_histogram(binwidth = 0.5, color = "black", fill = "grey") +
  labs(title = "Histogram of Median House Values", x = "Median House Values", y = "Frequency") +
  theme_minimal()

plotting scatter plot matrix for all features to view relationships

options(repr.plot.width = 60, repr.plot.height = 60)

train_data |>
  #select(-c('chas', 'b')) |> 
  ggpairs(
        mapping = ggplot2::aes(size = 0.05),
        lower = list(continuous = wrap("points", alpha = 0.3, size=0.01, color = 'blue')), 
        diag = list(continuous = wrap("barDiag", alpha = 0.3)),  
        upper = list(continuous = wrap("cor", size = 1.9)),
        font.label = list(size = 0.1, face = "bold"),
        ggtheme = theme_minimal()
       )+ 
  theme(
    axis.text.x = element_text(angle = 90, hjust = 1, size = 3),  # Rotate x-axis labels by 90 degrees and reduce font size
    axis.text.y = element_text(size = 3)                         # Reduce y-axis label font size
  )

plotting correlation matrix for all features

# correlation matrix plotting
cor_plot_matrix <- 
  cor(train_data[,c("crim","zn","indus", "chas", "nox","rm","age","dis","rad","tax",
                    "ptratio","lstat","medv")])

corrplot(cor_plot_matrix, method = 'color', type='upper', 
         order = 'hclust',tl.col = "black", tl.srt = 45)

#view the correlation values 
cor_plot_matrix

##                crim          zn       indus        chas        nox          rm
## crim     1.00000000 -0.19728625  0.41272516 -0.05346437  0.4089370 -0.21640561
## zn      -0.19728625  1.00000000 -0.54280468 -0.02980824 -0.5209844  0.31421177
## indus    0.41272516 -0.54280468  1.00000000  0.06768869  0.7846162 -0.41913978
## chas    -0.05346437 -0.02980824  0.06768869  1.00000000  0.1203778  0.07120022
## nox      0.40893704 -0.52098440  0.78461618  0.12037778  1.0000000 -0.32988672
## rm      -0.21640561  0.31421177 -0.41913978  0.07120022 -0.3298867  1.00000000
## age      0.35537384 -0.56959076  0.66514395  0.07703603  0.7339173 -0.25394150
## dis     -0.37536960  0.66336359 -0.71695002 -0.09331896 -0.7678040  0.20891865
## rad      0.60679259 -0.31827343  0.62394237 -0.02567697  0.6060486 -0.23872869
## tax      0.56984326 -0.31861785  0.73735481 -0.04594003  0.6676820 -0.32894481
## ptratio  0.28158909 -0.38868901  0.35117883 -0.14927301  0.1856434 -0.37347459
## lstat    0.44976413 -0.41667183  0.63303810 -0.05117013  0.6140633 -0.62231820
## medv    -0.38997005  0.37291770 -0.52103445  0.13155823 -0.4516859  0.72998473
##                 age         dis         rad         tax    ptratio       lstat
## crim     0.35537384 -0.37536960  0.60679259  0.56984326  0.2815891  0.44976413
## zn      -0.56959076  0.66336359 -0.31827343 -0.31861785 -0.3886890 -0.41667183
## indus    0.66514395 -0.71695002  0.62394237  0.73735481  0.3511788  0.63303810
## chas     0.07703603 -0.09331896 -0.02567697 -0.04594003 -0.1492730 -0.05117013
## nox      0.73391733 -0.76780403  0.60604865  0.66768196  0.1856434  0.61406326
## rm      -0.25394150  0.20891865 -0.23872869 -0.32894481 -0.3734746 -0.62231820
## age      1.00000000 -0.75112439  0.46884925  0.51613062  0.2590609  0.61375808
## dis     -0.75112439  1.00000000 -0.49763701 -0.53667711 -0.2224566 -0.51093880
## rad      0.46884925 -0.49763701  1.00000000  0.91630699  0.4683878  0.50205415
## tax      0.51613062 -0.53667711  0.91630699  1.00000000  0.4475885  0.56650032
## ptratio  0.25906091 -0.22245659  0.46838783  0.44758846  1.0000000  0.37529424
## lstat    0.61375808 -0.51093880  0.50205415  0.56650032  0.3752942  1.00000000
## medv    -0.40482320  0.27327762 -0.41793510 -0.51025090 -0.5342755 -0.75353341
##               medv
## crim    -0.3899700
## zn       0.3729177
## indus   -0.5210344
## chas     0.1315582
## nox     -0.4516859
## rm       0.7299847
## age     -0.4048232
## dis      0.2732776
## rad     -0.4179351
## tax     -0.5102509
## ptratio -0.5342755
## lstat   -0.7535334
## medv     1.0000000

Here are the key insights I got from the scatter plot matrix and correlation matrix, which demonstrates the relationships between the all the selected features.

Notably, features “rm” (average number of rooms per dwelling) and “lstat” (% lower status of the population) showed significant correlations (0.72998473 and -0.75353341) with “medv” (median value of owner-occupied homes).

Intuitively, this result makes sense. Housing value is dependent on the size of the unit, more rooms equals to larger area usually. Lower status of the population also can indicate that the housing is more affordable or at a cheaper price comparing to other units.

ML Model Development

The goal is to build two models :
LASSO Regression
Random Forest

Model Evaluation Methods

RMSE function

RMSE <- function(true, predict) {
  differences <- true - predict
  differences <- differences[!is.na(differences)]
  sqrt(mean(differences^2))
}

R^2 function * proportion of the variance for the dependent variable that’s explained by the independent variables in a regression model.

R2 <- function(true, predict){
  #Total Sum of Squares (TSS)
  tss <- sum((true - mean(true))^2)
  #Residual Sum of Squares (RSS)
  rss <- sum((true - predict)^2)
  1 - (rss / tss)
}

LASSO - Base Model

First create a baseline LASSO model without any feature engineering.
Then test transformations for better fitting.

#create model "cv_model_base" with all original features
train_data_base <- data.frame(scale(train_data[c("crim","zn","indus", "chas", "nox","rm",
                                                 "age","dis","rad","tax","ptratio","lstat")]))
train_data_base$medv <- train_data$medv

#prepare a matrix for glmnet
x_train_base <- model.matrix(medv ~ crim + zn + indus + chas + nox + rm + age + dis + 
                               rad + tax + ptratio + lstat - 1, train_data_base)
y_train_base <- train_data$medv

#use cross validation to find the optimal lambda 
cv_model_base <- cv.glmnet(x_train_base, y_train_base, alpha=1)

lambda_base <- cv_model_base$lambda.min

testing base model on the test set

test_data_base <- data.frame(scale(test_data[c("crim","zn","indus", "chas", "nox","rm",
                                               "age","dis","rad","tax","ptratio","lstat")]))
test_data_base$medv <- test_data$medv

x_test_base <- model.matrix(medv ~ crim + zn + indus + chas + nox + rm + age + dis + 
                              rad + tax + ptratio + lstat -1, test_data_base)

predict_base <- predict(cv_model_base, s=lambda_base, newx = x_test_base)

y_test <- test_data$medv

RMSE: LASSO Base Model Performance

RMSE(y_test, predict_base)

## [1] 5.818608

R2(y_test, predict_base)

## [1] 0.5712238

LASSO - Model Improvement with Feature Transformations

Feature Transformations

first, look at which variable is more closely related to medv, and start transforming the feature by the order

library(dplyr)
cor_plot_ranked <- data.frame(cor_plot_matrix) 
cor_plot_ranked['medv']|>
  arrange(desc(abs(medv)))

create a list of function to transform features

transformations <- list(
  og = function(x) x,
  log = function(x) {ifelse(x > 0, log(x+1), x)},  
  sqrt = function(x) sqrt(x),
  square = function(x) x^2
)

create a function to apply transformation, fit the LASSO model, and return performance results

library(glmnet)

test_transformation <- function(data, feature, transformations, response = "medv") {
  results <- list()
  for (trans_name in names(transformations)) {
    # apply transformation for feature
    data[[paste0(feature, "_", trans_name)]] <- 
      transformations[[trans_name]](data[[feature]])
    
    # prepare data for LASSO
    x <- model.matrix(reformulate(termlabels = paste0(feature, "_", trans_name), 
                                  response = response), data)
    y <- data[[response]]
    
    # fit LASSO model
    cv_model <- cv.glmnet(x, y, alpha = 1)
    
    # store results
    results[[trans_name]] <- min(cv_model$cvm)  # store min. cross-validation error
  }
  return(results)
}

batch testing the feature transformation function on selected features

features <- c("crim", "zn", "indus", "nox", "rm", "age", "dis", "rad", "tax", "ptratio", 
              "lstat") #"chas" is removed

all_results <- list()

for (feature in features) {
  all_results[[feature]] <- test_transformation(train_data, feature, transformations)
}
all_results

## $crim
## $crim$og
## [1] 73.94864
## 
## $crim$log
## [1] 65.07591
## 
## $crim$sqrt
## [1] 66.02352
## 
## $crim$square
## [1] 83.44711
## 
## 
## $zn
## $zn$og
## [1] 74.97155
## 
## $zn$log
## [1] 73.32584
## 
## $zn$sqrt
## [1] 72.94894
## 
## $zn$square
## [1] 76.76166
## 
## 
## $indus
## $indus$og
## [1] 63.32572
## 
## $indus$log
## [1] 60.55059
## 
## $indus$sqrt
## [1] 61.20971
## 
## $indus$square
## [1] 67.96439
## 
## 
## $nox
## $nox$og
## [1] 68.6618
## 
## $nox$log
## [1] 68.88375
## 
## $nox$sqrt
## [1] 68.57813
## 
## $nox$square
## [1] 69.84305
## 
## 
## $rm
## $rm$og
## [1] 40.81376
## 
## $rm$log
## [1] 45.55186
## 
## $rm$sqrt
## [1] 43.66835
## 
## $rm$square
## [1] 36.86515
## 
## 
## $age
## $age$og
## [1] 72.27145
## 
## $age$log
## [1] 76.14704
## 
## $age$sqrt
## [1] 73.88289
## 
## $age$square
## [1] 71.21946
## 
## 
## $dis
## $dis$og
## [1] 79.69296
## 
## $dis$log
## [1] 77.75665
## 
## $dis$sqrt
## [1] 78.61167
## 
## $dis$square
## [1] 82.45873
## 
## 
## $rad
## $rad$og
## [1] 71.2702
## 
## $rad$log
## [1] 73.61563
## 
## $rad$sqrt
## [1] 72.28383
## 
## $rad$square
## [1] 70.49924
## 
## 
## $tax
## $tax$og
## [1] 63.95556
## 
## $tax$log
## [1] 63.41417
## 
## $tax$sqrt
## [1] 64.49838
## 
## $tax$square
## [1] 65.14206
## 
## 
## $ptratio
## $ptratio$og
## [1] 62.26636
## 
## $ptratio$log
## [1] 62.0783
## 
## $ptratio$sqrt
## [1] 62.20056
## 
## $ptratio$square
## [1] 61.58558
## 
## 
## $lstat
## $lstat$og
## [1] 37.38989
## 
## $lstat$log
## [1] 25.74093
## 
## $lstat$sqrt
## [1] 30.35599
## 
## $lstat$square
## [1] 53.28978

Result from running codes above, here are the transformation for each feature that has the lowest cross-validation error. * crim -> log * zn -> log * indus -> log * nox -> sqrt - differences are too small, may ignore * rm -> square * age -> square * dis -> log * rad -> square - differences are too small, may ignore * tax -> log - differences are too small, may ignore * ptratio -> square - differences are too small, may ignore * lstat -> log

Feature Engineering Test

transforming_train <- train_data |>
  mutate(log_crim = log(crim + 1),
          log_zn = log(zn + 1),
          log_indus = log(indus + 1),
          sq_rm = rm^2,
          sq_age = age^2,
          log_dis = log(dis+1),
          log_lstat = log(lstat + 1))
  
transformed_train_data <- transforming_train[c('log_crim','log_zn','log_indus','sq_rm',
                                               'sq_age','log_dis','log_lstat','chas',
                                               'tax','ptratio','nox','rad')] |>
  scale() |>
  data.frame()

transformed_train_data$medv = train_data$medv
head(transformed_train_data)

testing transformed model on the test set

#prepare a matrix for glmnet
x_train_transformed <- model.matrix(medv ~ log_crim + log_zn + log_indus + sq_rm + 
                                      sq_age + log_dis + log_lstat + rad + chas + 
                                      tax + ptratio + nox- 1, 
                                    transformed_train_data)
y_train <- train_data$medv

#use cross validation to find the optimal lambda 
cv_model_transformed <- cv.glmnet(x_train_transformed, y_train, alpha=1)
lambda_transformed <- cv_model_transformed$lambda.min

#prepare test set for prediction
transforming_test <- test_data |>
  mutate(log_crim = log(crim + 1),
          log_zn = log(zn + 1),
          log_indus = log(indus + 1),
          sq_rm = rm^2,
          sq_age = age^2,
          log_dis = log(dis+1),
          log_lstat = log(lstat + 1))
  
transformed_test_data <- transforming_test[c('log_crim','log_zn','log_indus','sq_rm','sq_age','log_dis','log_lstat','chas','tax',
                                             'ptratio','nox','rad')] |>
  scale() |>
  data.frame()

transformed_test_data$medv <- test_data$medv

x_test_transformed <- model.matrix(medv ~ log_crim + log_zn + log_indus + sq_rm + sq_age + log_dis + 
                                     log_lstat + rad + chas + tax + ptratio + nox - 1, 
                                   transformed_test_data)

predict_LASSO_transformed <- predict(cv_model_transformed,
                               newx = x_test_transformed)

y_test <- test_data$medv

RMSE for LASSO Model with Transformed Feature

RMSE(y_test, predict_LASSO_transformed)

## [1] 5.968521

R2(y_test, predict_LASSO_transformed)

## [1] 0.5488448

base model: RMSE: ~5.8 R-squared ~0.57

with transformed model RMSE: ~5.6 R-squared: ~0.60

Random Forest Model

library(randomForest)

## randomForest 4.7-1.1

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

select_features_rf <- c("crim","zn","indus", "chas", "nox","rm","age","dis","rad","tax",
                        "ptratio","lstat", "medv")

train_data_rf <- train_data[select_features_rf]

# create cross validation parameter / train control
train_control <- trainControl(method = 'cv', number = 5)

# create tuning grid
tune_grid <- expand.grid(mtry = c(2, 3, 4, 5, 6))

# train Random Forest model with tuned parameters
rf_model_tuned <- train(medv ~ ., 
                        data = train_data_rf,
                        method = 'rf',
                        trControl = train_control,
                        tuneGrid = tune_grid)

rf_model_tuned$bestTune$mtry

## [1] 5

mtry <- rf_model_tuned$bestTune$mtry

# build Random Forest model with original data
rf_model <- randomForest(medv ~ ., 
                         data = train_data_rf, 
                         ntree = 500, 
                         mtry = mtry)

# view model information 
print(rf_model)

## 
## Call:
##  randomForest(formula = medv ~ ., data = train_data_rf, ntree = 500,      mtry = mtry) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 5
## 
##           Mean of squared residuals: 8.599402
##                     % Var explained: 89.97

importance(rf_model)

##         IncNodePurity
## crim       1774.53142
## zn          148.45776
## indus      1684.98412
## chas         59.13755
## nox        1240.12170
## rm        12668.15217
## age         706.98150
## dis        1426.18937
## rad         184.90533
## tax         928.32232
## ptratio    2207.30600
## lstat     11464.53382

# plotting feature importance
varImpPlot(rf_model)

# test Random Forest model by predicting test set
test_data_rf <- test_data[select_features_rf]
predictions_rf <- predict(rf_model, test_data_rf)

# get RMSE of the Random Forest model 
rf_result <- data.frame(medv_actual = test_data_rf$medv,
                        medv_predicted = predictions_rf
                        )

RMSE(rf_result$medv_actual, rf_result$medv_predicted)

## [1] 5.043901

R2(rf_result$medv_actual, rf_result$medv_predicted)

## [1] 0.6778

The RMSE is around 5.0 with the tuned Random Forest model.

I’ve tried standardizing all the features except for medv using scale() before training Random Forest model, and the RMSE is ~5.3, so I am using original features

# Ensemble

```r ensemble_predictions <- (predictions_rf + predict_LASSO_transformed) / 2

RMSE(test_data_rf$medv, ensemble_predictions) ```

## [1] 5.350186

r R2(test_data_rf$medv, ensemble_predictions)

## [1] 0.6374814 The RMSE for the ensemble model is around 5.1, higher than the Random Forest model alone

Results section

-> presents the modeling results and discusses the model performance; and.

LASSO Regression

Base Model: The base LASSO model uses original features without any transformations, and the RMSE of the model is approximately 5.8 with R-squared value around 0.57.

Transformed Model: After applying feature transformations, the RMSE of the updated model improved to about 5.6 with R-squared value increased to approximately 0.60, showing some effectiveness of feature transformations for model outcome improvement.

Random Forest

Random Forest model with tuned feature (mtry) using cross validation yield RMSE of around 5.0 and R-squared around 0.68, much better prediction than the LASSO models.

Feature importance analysis indicated that ‘rm’, ‘lstat’, and ‘ptratio’ were the most influential features, which is similar to the EDA finding.

When testing Random Forest model with standardization of features, I found that using original features gives better results, so that is the model I kept in this report.

Ensemble Model

The ensemble model that combined the LASSO and Random Forest models, resulted in an RMSE of approximately 5.13 and R-squared value around 0.67. Hence the model did not outperform the Random Forest model.

Conclusion section

Summary of Results

The aim of this project is to predict median housing values in Boston using various machine learning algorithms.

With the Boston Housing Dataset, I selected the following features to predict MEDV (Median value of owner-occupied homes in $1000’s.): 1. CRIM per capita crime rate by town
2. ZN proportion of residential land zoned for lots over 25,000 sq.ft. 3. INDUS proportion of non-retail business acres per town. 4. CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise). 5. NOX nitric oxides concentration (parts per 10 million). 6. RM average number of rooms per dwelling. 7. AGE proportion of owner-occupied units built prior to 1940. 8. DIS weighted distances to five Boston employment centres. 9. RAD index of accessibility to radial highways. 10. TAX full-value property-tax rate per $10,000. 11. PTRATIO pupil-teacher ratio by town. 12. LSTAT % lower status of the population.

After building 2 LASSO models, 1 Random Forest model, and 1 Ensemble Model, Random Forest model stood out as having most predicting power in this machine learning project, with ‘rm’, ‘lstat’, and ‘ptratio’ as the most influential features. The LASSO model’s performance improvement with feature transformations underscores the importance of feature engineering in predictive modeling.

Limitation & Potential Future Improvement

Predicting housing prices is complex. There are many other features that could have been included in the training model, such as areas of the unit, the floor numbers, most recent renovation date, transportation access, number of parks nearby, noise levels, proximity to schools, neighborhood characteristics, land’s environmental attributes etc. With those features, the predicting power of the ml model shall increase. Also, this dataset is relatively old, published in the late 90s, and small in size, so the result is perhaps less relevant to the current housing market.

Reference:

source: https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html

#library(knitr)

# Replace "your_file.Rmd" with the path to your Rmd file
#rmd_file <- "ML-Capstone-Housing.Rmd"
#r_script <- "ML-Capstone-Housing.R" # Output R script filename

# Extract R code from the Rmd file and write to an R script
#purl(rmd_file, output = r_script)

Predicting Housing Values with Environmental Factors: A Machine Learning Approach

Becky Xu

2023-11-30