Source - The information for the dataset is collected by the U.S Census Service concerning housing in the Boston Mass area. It was obtained from the StatLib archive. The dataset has been extensively used throughout the literature to benchmark algorithms. The dataset is small in size and has only 506 cases. The data was originally published by Harrison, D. and Rubinfeld, D.L. `Hedonic prices and the demand for clean air’, J. Environ. Economics & Management, vol.5, 81-102, 1978.
Goal - The goal of the project is to compare the performance of the machine learning algorithms.
Problem Statement- The dataset had 14 variables which are listed in the later part of the report. In this, medv: median price of the house is the target variable. Using rest feature variables and machine learning algorithms, we will predict the medv value.
library(corrr)
library(kableExtra)
library(rattle)
library(corrplot)
library(adabag)
library(dplyr)
library(ggplot2)
library(GGally)
library(readxl)
library(ggthemes)
library(glmnet)
library(tidyr)
library(caTools)
library(DT)
library(randomForest)
library(gbm)
library(knitr)
library(ROCR)
library(leaps)
library(PRROC)
library(boot)
library(naniar)
library(psych)
library(grid)
library(ggplot2)
library(lattice)
library(caret) # Use cross-validation
library(class)
library(rpart) # Decision Tree
library(rpart.plot)
library(caretEnsemble)
data_dictionary <- read_excel("C:/Users/Radhika Sood/Desktop/R datasets/boston-house-prices/Boston_Data_Dictionary.xlsx")
data_dictionary
## # A tibble: 14 x 2
## Abbr. Description
## <chr> <chr>
## 1 crim per capita crime rate by town
## 2 zn proportion of residential land zoned for lots over 25,000 sq.ft.
## 3 indus proportion of non-retail business acres per town
## 4 chas Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
## 5 nox nitric oxides concentration (parts per 10 million)
## 6 rm average number of rooms per dwelling
## 7 age proportion of owner-occupied units built prior to 1940
## 8 dis weighted distances to five Boston employment centres
## 9 rad index of accessibility to radial highways
## 10 tax full-value property-tax rate per $10,000
## 11 ptratio pupil-teacher ratio by town
## 12 black 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
## 13 lstat % lower status of the population
## 14 medv Median value of owner-occupied homes in $1000's
Importing the dataset
column_names <- c('crim', 'zn', 'indus', 'chas', 'nox', 'rm', 'age', 'dis', 'rad', 'tax', 'ptratio', 'b', 'lstat', 'medv')
housing <-read.csv("C:/Users/Radhika Sood/Desktop/R datasets/boston-house-prices/BostonHousing.csv", stringsAsFactors = FALSE, header = TRUE)
Dimensions of the data
dim(housing)
## [1] 506 14
The dataset has 506 rows and 14 columns.
The various column names in the dataset are:
colnames(housing)
## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "b" "Istat" "medv"
naniar::gg_miss_var(iris) +
theme_minimal()+
labs(y = "Missing Values in the dataset")
The graph shows that there is no missing value.
Glimse of the data:
head(housing)
## crim zn indus chas nox rm age dis rad tax ptratio b Istat
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21
## medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
Summary of the dataset is
summary(housing)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio b
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## Istat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
It is important to visualize and find the relations between the data
housing %>%
gather(key = "var", value = "value") %>%
filter(var != "chas") %>%
ggplot(aes(x = value)) +
geom_histogram() +
facet_wrap(~ var, scales = "free")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
From the histogram, we can observe Exponential distribution for age, dis and lstat variables and normal distribution for rm.
housing %>% gather(key, val,-medv) %>%
ggplot(aes(x=val, y = medv)) + geom_point() + stat_smooth(method = "lm", se = TRUE, col = "yellow") +
facet_wrap(~key, scales = "free") + ggtitle("Scatter plot of variables vs Median Value (medv)")
There is positive linear relationaship is observed between the medv with rm. Negative linear relationaship is observed between the medv with age, crim,indus, lstat, nox, tax, ptratio, rad.
housing %>%
gather(key = "var", value = "value") %>%
filter(var != "chas") %>%
ggplot(aes(x = '',y = value)) +
geom_boxplot(outlier.colour = "red") +
facet_wrap(~ var, scales = "free")
Lets check correlation between the variables.
Observation: There is high correlation between the medv and rm.
Introduction - Lasso regression performs L1 regularization, which adds a penality when a coefficient is added to the dataset and is equivalent to the absolute value of the maginitude of regression coefficiens. Lasso regression aims to minimize this penalty. Lasso is used when we have large number of predictor variables.
Standardizing the dataset before applying the Machine learning algos
housing<- scale(housing)
summary(housing)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio b
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## Istat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
Split the data and apply the lasso regression
split_data = sample(nrow(housing),nrow(housing)*0.80)
train = housing[split_data,]
test = housing[-split_data,]
Model Building
lasso_reg <- glmnet(x=as.matrix(train[,-14]), y=train[,14],alpha=1)
Plotting lasso regression
matplot(lasso_reg$lambda,t(lasso_reg$beta),type="l",ylab="coefficient",xlab="lambda")
abline(h=0)
plot(lasso_reg)
As the value of the lambda increases, the coefficients shrink to zero.
Cross validation
We can apply cross-validation to find the best cv
Step 1: Plot mse vs lambda
cv_fit <- cv.glmnet(x=as.matrix(train[,-14]), y=train[,14],alpha=1, nfolds=6)
plot(cv_fit)
Step 2: The optimal lambda is:
opt_lambda <- cv_fit$lambda.min
opt_lambda
## [1] 0.002508806
The coefficients from the lasso regression are
coef(cv_fit,s = cv_fit$lambda.min)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 0.00412836
## crim -0.10764005
## zn 0.10650537
## indus .
## chas 0.07756374
## nox -0.23635556
## rm 0.31586886
## age -0.03212979
## dis -0.33664120
## rad 0.24142497
## tax -0.16210340
## ptratio -0.21758798
## b 0.08751487
## Istat -0.33576624
Lasso removes the unnecessary variables from the model.
Storing train and test mse
old_x <- as.matrix(train[,-14])
pred_train <- predict(cv_fit, s = opt_lambda, newx = old_x)
pred_test <- predict(cv_fit, s = opt_lambda, newx = as.matrix(test[,-14]))
In sample MSE is:
value <- c(train[,14])
train_mse <- mean((pred_train - value)^2)
train_mse
## [1] 0.2619976
Out of sample MSE is:
value <- c(test[,14])
test_mse <- mean((pred_test-value)^2)
test_mse
## [1] 0.2602199
The training and test MSE are:
train_mse
## [1] 0.2619976
test_mse
## [1] 0.2602199
cp = .001 indicates that the split should minimise the overall lack of fit by a factor of 0.001 (cost complexity factor)
train <- data.frame(as.matrix(train))
boston.rpart <- rpart(medv ~ ., data = train, cp = .001)
Printing the plot
prp(boston.rpart, digit= 4)
To get the optimal value of cp
plotcp(boston.rpart)
printcp(boston.rpart)
##
## Regression tree:
## rpart(formula = medv ~ ., data = train, cp = 0.001)
##
## Variables actually used in tree construction:
## [1] age b crim dis Istat nox ptratio rm tax
##
## Root node error: 404.45/404 = 1.0011
##
## n= 404
##
## CP nsplit rel error xerror xstd
## 1 0.4575256 0 1.00000 1.00505 0.091934
## 2 0.1591507 1 0.54247 0.64384 0.066882
## 3 0.0845833 2 0.38332 0.46031 0.054464
## 4 0.0324079 3 0.29874 0.38539 0.051892
## 5 0.0284978 4 0.26633 0.34963 0.048936
## 6 0.0260543 5 0.23783 0.35204 0.050129
## 7 0.0168399 6 0.21178 0.32388 0.048491
## 8 0.0113182 7 0.19494 0.30803 0.045369
## 9 0.0090342 8 0.18362 0.28237 0.042104
## 10 0.0071370 9 0.17459 0.28049 0.041115
## 11 0.0069507 10 0.16745 0.27062 0.038187
## 12 0.0046193 11 0.16050 0.26025 0.037823
## 13 0.0046176 12 0.15588 0.25686 0.037095
## 14 0.0043573 13 0.15126 0.25512 0.037106
## 15 0.0041213 14 0.14691 0.25586 0.037146
## 16 0.0030942 15 0.14278 0.25830 0.039309
## 17 0.0022499 16 0.13969 0.24906 0.038394
## 18 0.0019258 17 0.13744 0.25370 0.040152
## 19 0.0019191 18 0.13551 0.25511 0.040159
## 20 0.0015807 19 0.13360 0.25400 0.040135
## 21 0.0010750 20 0.13201 0.25337 0.040126
## 22 0.0010000 22 0.12987 0.25530 0.040121
For cp = 0.0042164, provides the relevant size of split = 11
Building a pruned regression tree for size = 11 and cp = 0.0042164, 0.0039381(12)
regression_tree_model <- prune(boston.rpart, cp = 0.0042164)
In-sample prediction
boston.train.pred.tree = predict(boston.rpart)
Out-of-sample prediction
test <- data.frame(as.matrix(test))
boston.test.pred.tree = predict(boston.rpart,test)
MSE for in-sample and out of sample prediction
tree_train_mse <- mean((boston.train.pred.tree-train$medv)^2)
tree_test_mse <- mean((boston.test.pred.tree-test$medv)^2)
MSE values for train and test samples are:
tree_train_mse
## [1] 0.1300101
tree_test_mse
## [1] 0.1930564
Random Forest removes the multicollinearity between the trees, and hence further reducing the variance
We need to find two parameters for random forest: Number of Trees: We create upto a maximum of 500 trees Number of Variables at each step: We chose the default best size of 4
library(randomForest)
rf_model<- randomForest(medv~., data = train,mtry = 6, subset = TRUE, importance=TRUE)
rf_model
##
## Call:
## randomForest(formula = medv ~ ., data = train, mtry = 6, importance = TRUE, subset = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 6
##
## Mean of squared residuals: 0.1359126
## % Var explained: 86.42
Check the importance of each variable
rf_model$importance
## %IncMSE IncNodePurity
## crim 0.123962933 24.255000
## zn 0.002410730 1.238039
## indus 0.050032338 16.945818
## chas 0.003047726 1.524485
## nox 0.100015169 21.473598
## rm 0.457646758 138.263616
## age 0.044067398 9.170771
## dis 0.085173870 22.746860
## rad 0.010704356 2.417908
## tax 0.027678493 7.508277
## ptratio 0.047275352 14.713202
## b 0.013753102 5.869851
## Istat 0.780933038 134.040338
varImpPlot(rf_model)
Plot out of bag error vs the number of trees to get the optimal value for the number of trees.
plot(rf_model$mse, type='l', col=2, lwd=2, xlab = "ntree", ylab = "OOB Error")
Optimal value for the Ntree should be 300 for this case.
Plotting test mse vs no. of predictor variables to select the optimal value.
oob.err<- rep(0, 13)
for(i in 1:13){
fit<- randomForest(medv~., data = train, mtry=i, ntree=300)
oob.err[i]<- fit$mse[200] #oob error for ntree=200
}
matplot(oob.err, pch=15, col = "red", type = "b", ylab = "MSE", xlab = "mtry")
legend("topright", legend = c("OOB Error"), pch = 15, col = c("red"))
According to this graph, mtry should be 5.
Creating the random forest
Pre <- as.formula("medv ~ .")
rf <- randomForest(Pre,data=train, mtry=5, ntree=300, importance =TRUE)
The random forest has test error of:
rf_train_pred = predict(rf)
rf_test_pred = predict(rf, test)
rf_test_mse <- mean((rf_test_pred - test[,14])^2)
rf_test_mse
## [1] 0.08796767
The random forest has training error of:
rf_train_mse <- mean((rf_train_pred - train[,14])^2)
rf_train_mse
## [1] 0.1347059
In the classification case, we may grow many trees with only a single split and while any given tree has fairly low predictive power, as we add more trees and continue to learn slowly, the result can be a model with high predictive ability.
Number of trees: 10000 shrinkage: 0.01 Tree depth: 8 rm and lstat are the most influential variables
df<-data.frame(as.matrix(train))
test<-data.frame(as.matrix(test))
boosting<- gbm(medv ~., data = df, distribution = "gaussian", n.trees = 10000,interaction.depth = 8, shrinkage = .01 )
summary(boosting)
## var rel.inf
## Istat Istat 35.5386136
## rm rm 32.6927530
## dis dis 8.4420603
## crim crim 5.4243482
## nox nox 4.6823222
## age age 3.6025463
## b b 3.0618062
## ptratio ptratio 2.3396186
## tax tax 1.5882420
## indus indus 1.1091361
## chas chas 0.7195026
## rad rad 0.6512331
## zn zn 0.1478179
Prediction using the boosting algorithm:
ntree <- seq (100, 1000, 100)
pred_test = predict(boosting, newdata=test, n.trees=ntree)
boosting_mse <- mean((pred_test-test[,14])^2)
Use cross validation to find the optimal number of trees
model <- gbm(medv~., data = df, distribution = "gaussian", n.trees=5000, interaction.depth=4, shrinkage = 0.01, verbose=F, cv.folds=5)
bestTreeForPrediction <- gbm.perf(model)
pred_tree = predict(model, newdata = test,n.trees = bestTreeForPrediction)
round(mean((pred_tree-test[,14])^2),2)
## [1] 0.08