B.I.E.T. Jhansi
This project is based on the boston dataset. The data to be analyzed were collected by Harrison and Rubinfeld in 1978 for the purpose of discovering whether or not clean air influenced the value of houses in Boston. Their results are documented in a paper titled Hedonic prices and the demand for clean air, published in J. Environ. Economics and Management 5, 81-102.
This report seeks to examine the influence of several neighborhood attributes on the prices of housing, in an attempt to discover the most suitable explanatory variables. The specific neighborhood attributes to be considered are proximity to the Charles River, distance to the main employment centers, pupil-teacher ratio in schools, and levels of crime. Whereas the original study focused on air pollution using nitrogen oxide concentrations as an explanatory variable, this report examines whether or not there are other, better explanatory variables for the median value of houses in Boston. The R programming language will be used to conduct this analysis.
The dataset (Boston Housing Price) was taken from the StatLib library which is maintained at Carnegie Mellon University and is freely available for download from the UCI Machine Learning Repository. The dataset consists of 506 observations of 14 attributes. The median value of house price in $1000s, denoted by MEDV, is the outcome or the dependent variable in our model. Below is a brief description of each feature and the outcome in our dataset:
The prime objective of this project is to construct a working model which has the capability of predicting the value of houses, we will need to separate the dataset into features and the target variable. The features, ‘RM’, ‘LSTAT’, and ‘PTRATIO’, give us quantitative information about each data point. The target variable, ‘MEDV’, will be the variable we seek to predict. These are stored in features and prices, respectively.
setwd("~/Desktop/Data Analytics Internship/Boston")
library(readxl)
boston <- read_excel("boston.xls")
View(boston)
dim(boston)
## [1] 506 14
summary(boston)
## CRIM ZN INDUS CHAS
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## NOX RM AGE DIS
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## RAD TAX PT B
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## LSTAT MV
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
library(psych)
describe(boston)
## vars n mean sd median trimmed mad min max range
## CRIM 1 506 3.61 8.60 0.26 1.68 0.33 0.01 88.98 88.97
## ZN 2 506 11.36 23.32 0.00 5.08 0.00 0.00 100.00 100.00
## INDUS 3 506 11.14 6.86 9.69 10.93 9.37 0.46 27.74 27.28
## CHAS 4 506 0.07 0.25 0.00 0.00 0.00 0.00 1.00 1.00
## NOX 5 506 0.55 0.12 0.54 0.55 0.13 0.38 0.87 0.49
## RM 6 506 6.28 0.70 6.21 6.25 0.51 3.56 8.78 5.22
## AGE 7 506 68.57 28.15 77.50 71.20 28.98 2.90 100.00 97.10
## DIS 8 506 3.80 2.11 3.21 3.54 1.91 1.13 12.13 11.00
## RAD 9 506 9.55 8.71 5.00 8.73 2.97 1.00 24.00 23.00
## TAX 10 506 408.24 168.54 330.00 400.04 108.23 187.00 711.00 524.00
## PT 11 506 18.46 2.16 19.05 18.66 1.70 12.60 22.00 9.40
## B 12 506 356.67 91.29 391.44 383.17 8.09 0.32 396.90 396.58
## LSTAT 13 506 12.65 7.14 11.36 11.90 7.11 1.73 37.97 36.24
## MV 14 506 22.53 9.20 21.20 21.56 5.93 5.00 50.00 45.00
## skew kurtosis se
## CRIM 5.19 36.60 0.38
## ZN 2.21 3.95 1.04
## INDUS 0.29 -1.24 0.30
## CHAS 3.39 9.48 0.01
## NOX 0.72 -0.09 0.01
## RM 0.40 1.84 0.03
## AGE -0.60 -0.98 1.25
## DIS 1.01 0.46 0.09
## RAD 1.00 -0.88 0.39
## TAX 0.67 -1.15 7.49
## PT -0.80 -0.30 0.10
## B -2.87 7.10 4.06
## LSTAT 0.90 0.46 0.32
## MV 1.10 1.45 0.41
This table consists of our 2 categorical variables ‘CHAS’ and ‘RAD’.
ftable(CHAS ~ RAD, data=boston)
## CHAS 0 1
## RAD
## 1 19 1
## 2 24 0
## 3 36 2
## 4 102 8
## 5 104 11
## 6 26 0
## 7 17 0
## 8 19 5
## 24 124 8
We boxplot all the variables that effect our study of the target variable ’MEDV’00
plot(boston)
plot(boston[,c(3,5,6,11,13,14)],pch=3)
A few important properties to check now are the correlation of input features with the dependent variable, and to check if any feature has near zero variance (values not varying much within the column).
cor(boston,boston$MV)
## [,1]
## CRIM -0.3883046
## ZN 0.3604453
## INDUS -0.4837252
## CHAS 0.1752602
## NOX -0.4273208
## RM 0.6953599
## AGE -0.3769546
## DIS 0.2499287
## RAD -0.3816262
## TAX -0.4685359
## PT -0.5077867
## B 0.3334608
## LSTAT -0.7376627
## MV 1.0000000
We see that the number of rooms RM has the strongest positive correlation with the median value of the housing price, while the percentage of lower status population, LSTAT and the pupil-teacher ratio, PTRATIO, have strong negative correlation. The feature with the least correlation to MV is the proximity to Charles River, CHAS.
hist(boston$MV,xlab="Median Value", main="House Pricing", col="grey")
The right skewed distribution suggests that a log transformation would be appropriat. Similarly, the variables crim, dis, nox, zn are found to be right skewed, making log transformations appropriate. The left skewed distribution of ptratio suggests that squaring it could make for a better fit.
We partition the data on a 7/3 ratio as training/test datasets.
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
## Warning: replacing previous import by 'plyr::ddply' when loading 'caret'
## Warning: replacing previous import by 'tidyr::%>%' when loading 'broom'
## Warning: replacing previous import by 'tidyr::gather' when loading 'broom'
## Warning: replacing previous import by 'tidyr::spread' when loading 'broom'
## Warning: replacing previous import by 'rlang::!!' when loading 'recipes'
## Warning: replacing previous import by 'rlang::expr' when loading 'recipes'
## Warning: replacing previous import by 'rlang::f_lhs' when loading 'recipes'
## Warning: replacing previous import by 'rlang::f_rhs' when loading 'recipes'
## Warning: replacing previous import by 'rlang::is_empty' when loading
## 'recipes'
## Warning: replacing previous import by 'rlang::lang' when loading 'recipes'
## Warning: replacing previous import by 'rlang::na_dbl' when loading
## 'recipes'
## Warning: replacing previous import by 'rlang::names2' when loading
## 'recipes'
## Warning: replacing previous import by 'rlang::quos' when loading 'recipes'
## Warning: replacing previous import by 'rlang::sym' when loading 'recipes'
## Warning: replacing previous import by 'rlang::syms' when loading 'recipes'
set.seed(12345)
#Do data partitioning
inTrain <- createDataPartition(y = boston$MV, p = 0.70, list = FALSE)
training <- boston[inTrain,]
testing <- boston[-inTrain,]
First, let us try generalized linear regression model with MEDV as the dependent variable and all the remaining variables as independent variables. We train the model with the training dataset. For this linear model, below is the coefficients of all the features, and the intercept. Next, we use the trained model to predict the outcome (MV) for the testing dataset. A good metric to test the accuracy of the model is to calculate the root-mean squared error.
fit.lm <- lm(MV~.,data = training)
#Check Coefficients
data.frame(coef = round(fit.lm$coefficients,2))
## coef
## (Intercept) 42.96
## CRIM -0.10
## ZN 0.05
## INDUS 0.00
## CHAS 2.50
## NOX -19.98
## RM 3.21
## AGE 0.01
## DIS -1.59
## RAD 0.33
## TAX -0.01
## PT -1.04
## B 0.01
## LSTAT -0.52
set.seed(12345)
#predict on test set
pred.lm <- predict(fit.lm, newdata = testing)
# Root-mean squared error
rmse.lm <- sqrt(sum((pred.lm - testing$MV)^2)/
length(testing$MV))
c(RMSE = rmse.lm, R2 = summary(fit.lm)$r.squared)
## RMSE R2
## 4.3819921 0.7239054
We see that the RMSE is 4.381992 and the R2 value is 0.7239 for this model.
For the purposes of fitting a regression model for the first time, the log transformation will only be applied to mv, log(MV) to determine whether it does indeed provide a better fit.
set.seed(12345)
#Try linear model using all features
fit.lm1 <- lm(log(MV)~.,data = training)
set.seed(12345)
#predict on test set
pred.lm1 <- predict(fit.lm1, newdata = testing)
# Root-mean squared error
rmse.lm1 <- sqrt(sum((exp(pred.lm1) - testing$MV)^2)/
length(testing$MV))
c(RMSE = rmse.lm1, R2 = summary(fit.lm1)$r.squared)
## RMSE R2
## 4.146613 0.770014
library(car)
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
summary(fit.lm1)
##
## Call:
## lm(formula = log(MV) ~ ., data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.73926 -0.10694 -0.01578 0.10220 0.78240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.4720938 0.2550150 17.537 < 2e-16 ***
## CRIM -0.0106413 0.0018236 -5.835 1.25e-08 ***
## ZN 0.0011054 0.0007049 1.568 0.11778
## INDUS 0.0017993 0.0029814 0.603 0.54658
## CHAS 0.0995050 0.0410116 2.426 0.01577 *
## NOX -0.9741578 0.1875861 -5.193 3.56e-07 ***
## RM 0.0655990 0.0205322 3.195 0.00153 **
## AGE 0.0004851 0.0006502 0.746 0.45607
## DIS -0.0541356 0.0100680 -5.377 1.41e-07 ***
## RAD 0.0148185 0.0033352 4.443 1.20e-05 ***
## TAX -0.0005851 0.0001871 -3.127 0.00192 **
## PT -0.0437995 0.0064565 -6.784 5.17e-11 ***
## B 0.0003509 0.0001342 2.614 0.00934 **
## LSTAT -0.0277881 0.0023630 -11.760 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1979 on 342 degrees of freedom
## Multiple R-squared: 0.77, Adjusted R-squared: 0.7613
## F-statistic: 88.08 on 13 and 342 DF, p-value: < 2.2e-16
# Plot of predicted price vs actual price
plot(exp(pred.lm2),testing$MV, xlab = "Predicted Price", ylab = "Actual Price", col="blue")
layout(matrix(c(1,2,3,4),2,2))
plot(fit.lm2)
The goal of this report was to determine the neighborhood attributes that best explained variation in house pricing. Various statistical techniques were used to eliminate predictors and extraneous observations. In examining the final model, one finds – quite reasonably – that house prices are higher in areas with lower crime and lower pupil-teacher ratios. House prices also tend to be higher closer to the Charles River, and houses with more rooms are pricier. This report is interested in the neighborhood attributes of houses, so the number of rooms is not an important predictor. The most interesting factors to consider are nitrogen oxide levels and distance to the main employment centers. On the one hand, people would want to live close to their place of employment. Yet it is reasonable to suggest that pollution levels are higher as one moves closer to these main employment centers. Most importantly, when talking of pollution, it is not just nitrogen oxide levels that are higher, but also noise pollution levels. The regression model that was fitted shows that higher levels of pollution decrease house prices to a greater extent than distance to employment centers. This suggests that people would prefer to live further away from their place of employment if it meant lower levels of pollution, which is an interesting point to consider. On a concluding note, it is important to note that the data for this report was collected several decades ago. In the years since, there is no doubt that pollution levels have risen and it would be interesting to examine the ways in which that affects house pricing in Boston today.