Introduction

Source - The information for the dataset is collected by the U.S Census Service concerning housing in the Boston Mass area. It was obtained from the StatLib archive. The dataset has been extensively used throughout the literature to benchmark algorithms. The dataset is small in size and has only 506 cases. The data was originally published by Harrison, D. and Rubinfeld, D.L. `Hedonic prices and the demand for clean air’, J. Environ. Economics & Management, vol.5, 81-102, 1978.

Goal - The goal of the project is to compare the performance of the machine learning algorithms.

Problem Statement- The dataset had 14 variables which are listed in the later part of the report. In this, medv: median price of the house is the target variable. Using rest feature variables and machine learning algorithms, we will predict the medv value.

Data Preparation

Libraries used

library(corrr)
library(kableExtra)
library(rattle)
library(corrplot)
library(adabag)
library(dplyr)
library(ggplot2)
library(GGally)
library(readxl)
library(ggthemes)
library(glmnet)
library(tidyr)
library(caTools)
library(DT)
library(randomForest)
library(gbm)
library(knitr)
library(ROCR)
library(leaps)
library(PRROC)
library(boot)
library(naniar)
library(psych)
library(grid)
library(ggplot2)
library(lattice)
library(caret) # Use cross-validation
library(class)
library(rpart) # Decision Tree
library(rpart.plot)
library(caretEnsemble)

Data Dictionary

data_dictionary <- read_excel("C:/Users/Radhika Sood/Desktop/R datasets/boston-house-prices/Boston_Data_Dictionary.xlsx")
data_dictionary
## # A tibble: 14 x 2
##    Abbr.   Description                                                          
##    <chr>   <chr>                                                                
##  1 crim    per capita crime rate by town                                        
##  2 zn      proportion of residential land zoned for lots over 25,000 sq.ft.     
##  3 indus   proportion of non-retail business acres per town                     
##  4 chas    Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
##  5 nox     nitric oxides concentration (parts per 10 million)                   
##  6 rm      average number of rooms per dwelling                                 
##  7 age     proportion of owner-occupied units built prior to 1940               
##  8 dis     weighted distances to five Boston employment centres                 
##  9 rad     index of accessibility to radial highways                            
## 10 tax     full-value property-tax rate per $10,000                             
## 11 ptratio pupil-teacher ratio by town                                          
## 12 black   1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town       
## 13 lstat   %   lower status of the population                                   
## 14 medv    Median value of owner-occupied homes in $1000's

Import

Importing the dataset

column_names <- c('crim', 'zn', 'indus', 'chas', 'nox', 'rm', 'age', 'dis', 'rad', 'tax', 'ptratio', 'b', 'lstat', 'medv')
housing <-read.csv("C:/Users/Radhika Sood/Desktop/R datasets/boston-house-prices/BostonHousing.csv",  stringsAsFactors = FALSE, header = TRUE)

Dimensions of the data

dim(housing)
## [1] 506  14

The dataset has 506 rows and 14 columns.

The various column names in the dataset are:

colnames(housing)
##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "b"       "Istat"   "medv"

Missing values

naniar::gg_miss_var(iris) +
theme_minimal()+
labs(y = "Missing Values in the dataset")

The graph shows that there is no missing value.

Glimse of the data:

head(housing)
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio      b Istat
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
##   medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7

Summary of the dataset is

summary(housing)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio            b         
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      Istat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Data Analysis

It is important to visualize and find the relations between the data

Histogram

housing %>%
  gather(key = "var", value = "value") %>%
  filter(var != "chas") %>%
  ggplot(aes(x = value)) +
  geom_histogram() +
  facet_wrap(~ var, scales = "free")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

From the histogram, we can observe Exponential distribution for age, dis and lstat variables and normal distribution for rm.

Scatterplot

housing %>% gather(key, val,-medv) %>%
  ggplot(aes(x=val, y = medv)) + geom_point() + stat_smooth(method = "lm", se = TRUE, col = "yellow") +
  facet_wrap(~key, scales = "free") + ggtitle("Scatter plot of variables vs Median Value (medv)") 

There is positive linear relationaship is observed between the medv with rm. Negative linear relationaship is observed between the medv with age, crim,indus, lstat, nox, tax, ptratio, rad.

Boxplot

housing %>%
  gather(key = "var", value = "value") %>%
  filter(var != "chas") %>%
  ggplot(aes(x = '',y = value)) +
  geom_boxplot(outlier.colour = "red") +
  facet_wrap(~ var, scales = "free")

Correlation

Lets check correlation between the variables.

Observation: There is high correlation between the medv and rm.

Lasso Regression

Introduction - Lasso regression performs L1 regularization, which adds a penality when a coefficient is added to the dataset and is equivalent to the absolute value of the maginitude of regression coefficiens. Lasso regression aims to minimize this penalty. Lasso is used when we have large number of predictor variables.

Standardizing the dataset before applying the Machine learning algos

housing<- scale(housing)
summary(housing)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio              b          
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      Istat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

Split the data and apply the lasso regression

split_data = sample(nrow(housing),nrow(housing)*0.80)
train = housing[split_data,]
test =  housing[-split_data,]

Model Building

lasso_reg <- glmnet(x=as.matrix(train[,-14]), y=train[,14],alpha=1)

Plotting lasso regression

matplot(lasso_reg$lambda,t(lasso_reg$beta),type="l",ylab="coefficient",xlab="lambda")
abline(h=0)

plot(lasso_reg)

As the value of the lambda increases, the coefficients shrink to zero.

Cross validation

We can apply cross-validation to find the best cv

Step 1: Plot mse vs lambda

cv_fit <- cv.glmnet(x=as.matrix(train[,-14]), y=train[,14],alpha=1, nfolds=6)
plot(cv_fit)

Step 2: The optimal lambda is:

opt_lambda <- cv_fit$lambda.min
opt_lambda
## [1] 0.002508806

The coefficients from the lasso regression are

coef(cv_fit,s = cv_fit$lambda.min)
## 14 x 1 sparse Matrix of class "dgCMatrix"
##                       1
## (Intercept)  0.00412836
## crim        -0.10764005
## zn           0.10650537
## indus        .         
## chas         0.07756374
## nox         -0.23635556
## rm           0.31586886
## age         -0.03212979
## dis         -0.33664120
## rad          0.24142497
## tax         -0.16210340
## ptratio     -0.21758798
## b            0.08751487
## Istat       -0.33576624

Lasso removes the unnecessary variables from the model.

Storing train and test mse

old_x <- as.matrix(train[,-14])
pred_train <- predict(cv_fit, s = opt_lambda, newx = old_x)
pred_test <- predict(cv_fit, s = opt_lambda, newx = as.matrix(test[,-14]))

In sample MSE is:

value <- c(train[,14])
train_mse <- mean((pred_train - value)^2)
train_mse
## [1] 0.2619976

Out of sample MSE is:

value <- c(test[,14])
test_mse <- mean((pred_test-value)^2)
test_mse
## [1] 0.2602199

The training and test MSE are:

train_mse
## [1] 0.2619976
test_mse
## [1] 0.2602199

Regression Tree

cp = .001 indicates that the split should minimise the overall lack of fit by a factor of 0.001 (cost complexity factor)

train <- data.frame(as.matrix(train))
boston.rpart <- rpart(medv ~ ., data = train, cp = .001)

Printing the plot

prp(boston.rpart, digit= 4)

To get the optimal value of cp

plotcp(boston.rpart)

printcp(boston.rpart)
## 
## Regression tree:
## rpart(formula = medv ~ ., data = train, cp = 0.001)
## 
## Variables actually used in tree construction:
## [1] age     b       crim    dis     Istat   nox     ptratio rm      tax    
## 
## Root node error: 404.45/404 = 1.0011
## 
## n= 404 
## 
##           CP nsplit rel error  xerror     xstd
## 1  0.4575256      0   1.00000 1.00505 0.091934
## 2  0.1591507      1   0.54247 0.64384 0.066882
## 3  0.0845833      2   0.38332 0.46031 0.054464
## 4  0.0324079      3   0.29874 0.38539 0.051892
## 5  0.0284978      4   0.26633 0.34963 0.048936
## 6  0.0260543      5   0.23783 0.35204 0.050129
## 7  0.0168399      6   0.21178 0.32388 0.048491
## 8  0.0113182      7   0.19494 0.30803 0.045369
## 9  0.0090342      8   0.18362 0.28237 0.042104
## 10 0.0071370      9   0.17459 0.28049 0.041115
## 11 0.0069507     10   0.16745 0.27062 0.038187
## 12 0.0046193     11   0.16050 0.26025 0.037823
## 13 0.0046176     12   0.15588 0.25686 0.037095
## 14 0.0043573     13   0.15126 0.25512 0.037106
## 15 0.0041213     14   0.14691 0.25586 0.037146
## 16 0.0030942     15   0.14278 0.25830 0.039309
## 17 0.0022499     16   0.13969 0.24906 0.038394
## 18 0.0019258     17   0.13744 0.25370 0.040152
## 19 0.0019191     18   0.13551 0.25511 0.040159
## 20 0.0015807     19   0.13360 0.25400 0.040135
## 21 0.0010750     20   0.13201 0.25337 0.040126
## 22 0.0010000     22   0.12987 0.25530 0.040121

For cp = 0.0042164, provides the relevant size of split = 11

Building a pruned regression tree for size = 11 and cp = 0.0042164, 0.0039381(12)

regression_tree_model <- prune(boston.rpart, cp = 0.0042164)

In-sample prediction

boston.train.pred.tree = predict(boston.rpart)

Out-of-sample prediction

test <- data.frame(as.matrix(test))
boston.test.pred.tree = predict(boston.rpart,test)

MSE for in-sample and out of sample prediction

tree_train_mse <- mean((boston.train.pred.tree-train$medv)^2)
tree_test_mse <- mean((boston.test.pred.tree-test$medv)^2)

MSE values for train and test samples are:

tree_train_mse
## [1] 0.1300101
tree_test_mse
## [1] 0.1930564

Random Forest

Random Forest removes the multicollinearity between the trees, and hence further reducing the variance

We need to find two parameters for random forest: Number of Trees: We create upto a maximum of 500 trees Number of Variables at each step: We chose the default best size of 4

library(randomForest)
rf_model<- randomForest(medv~., data = train,mtry = 6, subset = TRUE, importance=TRUE)
rf_model
## 
## Call:
##  randomForest(formula = medv ~ ., data = train, mtry = 6, importance = TRUE,      subset = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 6
## 
##           Mean of squared residuals: 0.1359126
##                     % Var explained: 86.42

Check the importance of each variable

rf_model$importance
##             %IncMSE IncNodePurity
## crim    0.123962933     24.255000
## zn      0.002410730      1.238039
## indus   0.050032338     16.945818
## chas    0.003047726      1.524485
## nox     0.100015169     21.473598
## rm      0.457646758    138.263616
## age     0.044067398      9.170771
## dis     0.085173870     22.746860
## rad     0.010704356      2.417908
## tax     0.027678493      7.508277
## ptratio 0.047275352     14.713202
## b       0.013753102      5.869851
## Istat   0.780933038    134.040338
varImpPlot(rf_model)

Plot out of bag error vs the number of trees to get the optimal value for the number of trees.

plot(rf_model$mse, type='l', col=2, lwd=2, xlab = "ntree", ylab = "OOB Error")

Optimal value for the Ntree should be 300 for this case.

Plotting test mse vs no. of predictor variables to select the optimal value.

oob.err<- rep(0, 13)
for(i in 1:13){
  fit<- randomForest(medv~., data = train, mtry=i, ntree=300)
  oob.err[i]<- fit$mse[200] #oob error for ntree=200
}

matplot(oob.err, pch=15, col = "red", type = "b", ylab = "MSE", xlab = "mtry")
legend("topright", legend = c("OOB Error"), pch = 15, col = c("red"))

According to this graph, mtry should be 5.

Creating the random forest

Pre <- as.formula("medv ~ .")
rf <- randomForest(Pre,data=train, mtry=5, ntree=300, importance =TRUE)

The random forest has test error of:

rf_train_pred = predict(rf)
rf_test_pred = predict(rf, test)
rf_test_mse <- mean((rf_test_pred - test[,14])^2)
rf_test_mse
## [1] 0.08796767

The random forest has training error of:

rf_train_mse <- mean((rf_train_pred - train[,14])^2)
rf_train_mse
## [1] 0.1347059

Boosting

In the classification case, we may grow many trees with only a single split and while any given tree has fairly low predictive power, as we add more trees and continue to learn slowly, the result can be a model with high predictive ability.

Number of trees: 10000 shrinkage: 0.01 Tree depth: 8 rm and lstat are the most influential variables

df<-data.frame(as.matrix(train))
test<-data.frame(as.matrix(test))

boosting<- gbm(medv ~., data = df,  distribution = "gaussian", n.trees = 10000,interaction.depth = 8, shrinkage = .01 )
summary(boosting)

##             var    rel.inf
## Istat     Istat 35.5386136
## rm           rm 32.6927530
## dis         dis  8.4420603
## crim       crim  5.4243482
## nox         nox  4.6823222
## age         age  3.6025463
## b             b  3.0618062
## ptratio ptratio  2.3396186
## tax         tax  1.5882420
## indus     indus  1.1091361
## chas       chas  0.7195026
## rad         rad  0.6512331
## zn           zn  0.1478179

Prediction using the boosting algorithm:

ntree <- seq (100, 1000, 100)
pred_test = predict(boosting, newdata=test, n.trees=ntree)
boosting_mse <- mean((pred_test-test[,14])^2)

Use cross validation to find the optimal number of trees

model <- gbm(medv~., data = df, distribution = "gaussian", n.trees=5000, interaction.depth=4, shrinkage = 0.01, verbose=F, cv.folds=5)
bestTreeForPrediction <- gbm.perf(model)

pred_tree = predict(model, newdata = test,n.trees = bestTreeForPrediction)
round(mean((pred_tree-test[,14])^2),2)
## [1] 0.08