Overview

In this report I detail the machine learning (ML) models I implemented to accurately predict the housing prices in Boston suburbs. The dataset for this experiment is accessed from the UCI Machine Learning repository via https://archive.ics.uci.edu/ml/datasets/Housing. The report is organized in such a way as to demonstrate the entire process right from getting and cleaning the data, to exploratory analysis of the dataset to understand the distribution and importance of various features in influencing the algorithm, to coming with a hypothesis, training ML models, evaluation of the models, etc.

Introduction

The dataset (Boston Housing Price) was taken from the StatLib library which is maintained at Carnegie Mellon University and is freely available for download from the UCI Machine Learning Repository. The dataset consists of 506 observations of 14 attributes. The median value of house price in $1000s, denoted by MEDV, is the outcome or the dependent variable in our model. Below is a brief description of each feature and the outcome in our dataset:

  1. CRIM     – per capita crime rate by town
  2. ZN     – proportion of residential land zoned for lots over 25,000 sq.ft
  3. INDUS     – proportion of non-retail business acres per town
  4. CHAS     – Charles River dummy variable (1 if tract bounds river; else 0)
  5. NOX     – nitric oxides concentration (parts per 10 million)
  6. RM     – average number of rooms per dwelling
  7. AGE     – proportion of owner-occupied units built prior to 1940
  8. DIS     – weighted distances to five Boston employment centres
  9. RAD     – index of accessibility to radial highways
  10. TAX     – full-value property-tax rate per $10,000
  11. PTRATIO     – pupil-teacher ratio by town
  12. B     – 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
  13. LSTAT     – % lower status of the population
  14. MEDV     – Median value of owner-occupied homes in $1000’s

Getting and Cleaning the Data

Getting the data into R as an R object, cleaning the data and transforming it as a neat and usable R data frame or equivalent is the first step. The Boston Housing Price dataset consists of two files, housing.data and housing.names. The housing.data file consists of the actual data, while the housing.name consists of the column names and descriptions of each feature, along with other relevent information. It is easy to read the actual data into an R dataframe using read.table() function, but a little intuitive programming is required to extract the column names from the housing.names file. The R-function read.fwf() reads data from fixed-width files. Below, I use subsetting and gsub() substitution to extract the column names alone from this file.

# Read the dataset from housing.data file using read.table
# Read the description file containing the column names using fixed width file command
housing.df <- read.table("./data/housing.data", header = FALSE, sep = "")
names1 <- read.fwf("./data/housing.names", skip = 30, n = 17, 
               widths = c(-7,8,-60))

# Extract the column names alone from the relevent lines, remove spaces
names2 <- as.character(names1[-c(3,6,15),])
names2 <- gsub(" ", "", names2)

# Assign column names to housing.df
names(housing.df) <- names2

Now, let us check and explore the cleaned data frame containing the housing data. The housing.df R object is of class ‘data.frame’, which is very easy to work with using R scripts. The str() function is powerful in displaying the structure of an R dataframe. Below, the output of str() compactly provides the relevent information of our dataframe, like the number of observations, number of variables, names of each column, the class of each column, and sample values from each column.

# Display the class of the R object housing.df
class(housing.df)
## [1] "data.frame"
# Display the structure of the housing.df data frame
str(housing.df)
## 'data.frame':    506 obs. of  14 variables:
##  $ CRIM   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ ZN     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ INDUS  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ CHAS   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ NOX    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ RM     : num  6.58 6.42 7.18 7 7.15 ...
##  $ AGE    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ DIS    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ RAD    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ TAX    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ PTRATIO: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ B      : num  397 397 393 395 397 ...
##  $ LSTAT  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ MEDV   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

Data Exploration

To get more detailed statistical information from each column, summary() function can be used. It displays the summary statistics like minimum value, maximum value, median, mean, and the 1st and 3rd quartile values for each column in our dataset. Summary also provides information about the missing values, if any is present. We see that there are no missing values in any of our variables.

summary(housing.df)
##       CRIM                ZN             INDUS            CHAS        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       NOX               RM             AGE              DIS        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       RAD              TAX           PTRATIO            B         
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      LSTAT            MEDV      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Let us visualize the distribution and density of the outcome, MEDV. The black curve represents the density. In addition, the boxplot is also plotted to bring an additional perspective. We see that the median value of housing price is skewed to the right, with a number of outliers to the right. It may be useful to transform ‘MEDV’ column using functions like natural logrithm, while modeling the hypothesis for regression analysis.

Now, let us do scatterplot of some of the important variables (based on intuition) with the outcome variable MEDV. We see that there is strong positive or negative correlation between these variables and the outcome. It is also obviously evident that INDUS and NOX are strongly positively correlated with one another, as nitric oxide levels tend to go up with increase in industries.

plot(housing.df[,c(3,5,6,11,13,14)],pch=3)

Correlation and near zero variance

A few important properties to check now are the correlation of input features with the dependent variable, and to check if any feature has near zero variance (values not varying much within the column).

suppressMessages(library(caret))
# Correlation of each independent variable with the dependent variable
cor(housing.df,housing.df$MEDV)
##               [,1]
## CRIM    -0.3883046
## ZN       0.3604453
## INDUS   -0.4837252
## CHAS     0.1752602
## NOX     -0.4273208
## RM       0.6953599
## AGE     -0.3769546
## DIS      0.2499287
## RAD     -0.3816262
## TAX     -0.4685359
## PTRATIO -0.5077867
## B        0.3334608
## LSTAT   -0.7376627
## MEDV     1.0000000

We see that the number of rooms RM has the strongest positive correlation with the median value of the housing price, while the percentage of lower status population, LSTAT and the pupil-teacher ratio, PTRATIO, have strong negative correlation. The feature with the least correlation to MEDV is the proximity to Charles River, CHAS.

# Calulate near zero variance 
nzv <- nearZeroVar(housing.df, saveMetrics = TRUE)
sum(nzv$nzv)
## [1] 0

The output shows that there are no variable with zero or near zero variance.

Feature Engineering and Data Partitioning

Next we perform centering and scaling on the input features. Then we partition the data on a 7/3 ratio as training/test datasets.

# Centering/scaling of input features
house.scale <- cbind(scale(housing.df[1:13]), housing.df[14])

set.seed(12345)
#Do data partitioning
inTrain <- createDataPartition(y = house.scale$MEDV, p = 0.70, list = FALSE)
training <- house.scale[inTrain,]
testing <- house.scale[-inTrain,]

Regression Models

Linear models

Linear model 1

First, let us try generalized linear regression model with MEDV as the dependent variable and all the remaining variables as independent variables. We train the model with the training dataset. For this linear model, below is the coefficients of all the features, and the intercept. Next, we use the trained model to predict the outcome (MEDV) for the testing dataset. A good metric to test the accuracy of the model is to calculate the root-mean squared error, which is given by \[ RMSE = \sqrt \frac{\sum_{i = 1}^{n} (y_{pred,i}-y_{act,i})^2}{n}. \]

set.seed(12345)
#Try linear model using all features
fit.lm <- lm(MEDV~.,data = training)

#check cooeffs
data.frame(coef = round(fit.lm$coefficients,2))
##              coef
## (Intercept) 22.68
## CRIM        -0.83
## ZN           1.09
## INDUS       -0.01
## CHAS         0.64
## NOX         -2.32
## RM           2.26
## AGE          0.24
## DIS         -3.35
## RAD          2.89
## TAX         -2.08
## PTRATIO     -2.25
## B            0.87
## LSTAT       -3.74
set.seed(12345)
#predict on test set
pred.lm <- predict(fit.lm, newdata = testing)

# Root-mean squared error
rmse.lm <- sqrt(sum((pred.lm - testing$MEDV)^2)/
                   length(testing$MEDV))
                   
c(RMSE = rmse.lm, R2 = summary(fit.lm)$r.squared)
##      RMSE        R2 
## 4.3819920 0.7239054

We see that the RMSE is 4.381992 and the \(R^2\) value is 0.7239 for this model.

Linear model 2

We also saw that the output variable MEDV was skewed to the right. Performing a log transformation would normalize the distribution of MEDV. Let us perform glm with log(MEDV) as the outcome and all remaining features as input. We see that the RMSE value has reduced for this model.

set.seed(12345)
#Try linear model using all features
fit.lm1 <- lm(log(MEDV)~.,data = training)

set.seed(12345)
#predict on test set
pred.lm1 <- predict(fit.lm1, newdata = testing)

# Root-mean squared error
rmse.lm1 <- sqrt(sum((exp(pred.lm1) - testing$MEDV)^2)/
                   length(testing$MEDV))
                   
c(RMSE = rmse.lm1, R2 = summary(fit.lm1)$r.squared)
##     RMSE       R2 
## 4.146613 0.770014

Let us examine the calculated p-value for each feature in the linear model. Any feature which is not significant (p<0.05) is not contributing signicantly for the model, probably due to multicollinearity among other features. We see that the features, ZN, INDUS, and AGE are not significant.

library(car)
summary(fit.lm1)
## 
## Call:
## lm(formula = log(MEDV) ~ ., data = training)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.73926 -0.10694 -0.01578  0.10220  0.78240 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.04073    0.01056 287.977  < 2e-16 ***
## CRIM        -0.09153    0.01569  -5.835 1.25e-08 ***
## ZN           0.02578    0.01644   1.568  0.11778    
## INDUS        0.01234    0.02045   0.603  0.54658    
## CHAS         0.02527    0.01042   2.426  0.01577 *  
## NOX         -0.11288    0.02174  -5.193 3.56e-07 ***
## RM           0.04609    0.01443   3.195  0.00153 ** 
## AGE          0.01366    0.01830   0.746  0.45607    
## DIS         -0.11399    0.02120  -5.377 1.41e-07 ***
## RAD          0.12903    0.02904   4.443 1.20e-05 ***
## TAX         -0.09860    0.03153  -3.127  0.00192 ** 
## PTRATIO     -0.09482    0.01398  -6.784 5.17e-11 ***
## B            0.03204    0.01226   2.614  0.00934 ** 
## LSTAT       -0.19844    0.01687 -11.760  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1979 on 342 degrees of freedom
## Multiple R-squared:   0.77,  Adjusted R-squared:  0.7613 
## F-statistic: 88.08 on 13 and 342 DF,  p-value: < 2.2e-16
vif(fit.lm1)
##     CRIM       ZN    INDUS     CHAS      NOX       RM      AGE      DIS      RAD      TAX 
## 1.992300 2.436839 3.887833 1.071277 4.184944 1.983740 3.004529 3.935778 7.419761 8.755175 
##  PTRATIO        B    LSTAT 
## 1.901685 1.453877 2.643124

Variance inflation factors are computed using vif() for the standard errors of linear model coefficient estimates. It is imperative for the vif to be less than 5 for all the features. We see that the vif is greater than 5 for RAD and TAX.

Linear model 3

Based on all these observations, we now construct a new linear model as below:

log(MEDV) ~ CRIM + CHAS + NOX + RM + DIS + PTRATIO + RAD + B + LSTAT

set.seed(12345)
#Try simple linear model using selected features
fit.lm2 <- lm(formula = log(MEDV) ~ CRIM + CHAS + NOX + RM + DIS + PTRATIO + 
            RAD + B + LSTAT, data = training)

set.seed(12345)
#predict on test set
pred.lm2 <- predict(fit.lm2, newdata = testing)

# Root-mean squared error
rmse.lm2 <- sqrt(sum((exp(pred.lm2) - testing$MEDV)^2)/
                   length(testing$MEDV))
                   
c(RMSE = rmse.lm2, R2 = summary(fit.lm2)$r.squared)
##      RMSE        R2 
## 4.1786651 0.7624071

This model is marginally less accurate than linear model 2, based on slight increase in RMSE and slight decrease in \(R^2\) value. Let us plot the predicted vs actual values of the outcome MEDV.

# Plot of predicted price vs actual price
plot(exp(pred.lm2),testing$MEDV, xlab = "Predicted Price", ylab = "Actual Price")

# Diagnostics plots
layout(matrix(c(1,2,3,4),2,2))
plot(fit.lm2)

Random Forest Model

For random forest implementation, we could use the linear model formula of MEDV ~ . (meaning MEDV is the outcome with all other features as input). Inspecting the results, we see that the random forest model has given the best accuracy so far.

suppressMessages(library(randomForest))
set.seed(12345)
fit.rf <- randomForest(formula = MEDV ~ ., data = training)

set.seed(12345)
pred.rf <- predict(fit.rf, testing)

rmse.rf <- sqrt(sum(((pred.rf) - testing$MEDV)^2)/
                   length(testing$MEDV))
c(RMSE = rmse.rf, pseudoR2 = mean(fit.rf$rsq))
##      RMSE  pseudoR2 
## 2.9295953 0.8541698
plot(pred.rf,testing$MEDV, xlab = "Predicted Price", ylab = "Actual Price", pch = 3)

Conclusion

We experimented with various linear regression models and random forest models to predict the housing prices in Boston suburbs. Random forest model with simple linear relationship between the outcome and all input features yielded the best model to predict the outcome with least RMSE.