OVER VIEW In this report, I developed a method to predict the housing prices in Boston suburbs. The dataset for this experiment is Boston housing price prediction. The report is organized in such a way as to demonstrate the entire process right from getting and cleaning the data, to exploratory analysis of the dataset to understand the distribution and importance of various features in influencing the algorithm, to coming with a hypothesis and fitting a linear regression model.
INTRODUCTION
The dataset (Boston Housing Price) consists of 506 observations of 14 attributes. The median value of house price in $1000s, denoted by MV, is the outcome or the dependent variable in our model. Below is a brief description of each feature and the outcome in our dataset:
CRIM – per capita crime rate by town ZN – proportion of residential land zoned for lots over 25,000 sq.ft INDUS – proportion of non-retail business acres per town CHAS – Charles River dummy variable (1 if tract bounds river; else 0) NOX – nitric oxides concentration (parts per 10 million) RM – average number of rooms per dwelling AGE – proportion of owner-occupied units built prior to 1940 DIS – weighted distances to five Boston employment centres RAD – index of accessibility to radial highways TAX – full-value property-tax rate per $10,000 PT – pupil-teacher ratio by town B – 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town LSTAT – % lower status of the population MV – Median value of owner-occupied homes in $1000’s
PROJECT DESCRIPTION You want to be the best real estate agent out there. In order to compete with other agents in your area, you decide to use DATA ANALYSIS. You are going to use various statistical analysis tools to build the best model to predict the value of a given house. Your task is to find the best price your client can sell their house at. The best guess from a model is one that best generalizes the data. In this first section of this project, we will make a cursory investigation about the Boston housing data and provide our observations. Familiarizing ourselves with the data through an explorative process is a fundamental practice to help our better understand and justify our results.
Since the main goal of this project is to construct a working model which has the capability of predicting the value of houses, we will need to separate the dataset into features and the target variable. The features, ‘RM’, ‘LSTAT’, and ‘PTRATIO’, give us quantitative information about each data point. The target variable, ‘MEDV’, will be the variable we seek to predict. These are stored in features and prices, respectively.
Boston.df <- read.csv(paste("Boston.csv", sep = ""))
View(Boston.df)
attach(Boston.df)
dim(Boston.df)
## [1] 506 14
library(psych)
describe(Boston.df)
## vars n mean sd median trimmed mad min max range
## CRIM 1 506 3.61 8.60 0.26 1.68 0.33 0.01 88.98 88.97
## ZN 2 506 11.36 23.32 0.00 5.08 0.00 0.00 100.00 100.00
## INDUS 3 506 11.14 6.86 9.69 10.93 9.37 0.46 27.74 27.28
## CHAS 4 506 0.07 0.25 0.00 0.00 0.00 0.00 1.00 1.00
## NOX 5 506 0.55 0.12 0.54 0.55 0.13 0.38 0.87 0.49
## RM 6 506 6.28 0.70 6.21 6.25 0.51 3.56 8.78 5.22
## AGE 7 506 68.57 28.15 77.50 71.20 28.98 2.90 100.00 97.10
## DIS 8 506 3.80 2.11 3.21 3.54 1.91 1.13 12.13 11.00
## RAD 9 506 9.55 8.71 5.00 8.73 2.97 1.00 24.00 23.00
## TAX 10 506 408.24 168.54 330.00 400.04 108.23 187.00 711.00 524.00
## PT 11 506 18.46 2.16 19.05 18.66 1.70 12.60 22.00 9.40
## B 12 506 356.67 91.29 391.44 383.17 8.09 0.32 396.90 396.58
## LSTAT 13 506 12.65 7.14 11.36 11.90 7.11 1.73 37.97 36.24
## MV 14 506 22.53 9.20 21.20 21.56 5.93 5.00 50.00 45.00
## skew kurtosis se
## CRIM 5.19 36.60 0.38
## ZN 2.21 3.95 1.04
## INDUS 0.29 -1.24 0.30
## CHAS 3.39 9.48 0.01
## NOX 0.72 -0.09 0.01
## RM 0.40 1.84 0.03
## AGE -0.60 -0.98 1.25
## DIS 1.01 0.46 0.09
## RAD 1.00 -0.88 0.39
## TAX 0.67 -1.15 7.49
## PT -0.80 -0.30 0.10
## B -2.87 7.10 4.06
## LSTAT 0.90 0.46 0.32
## MV 1.10 1.45 0.41
t.test(Boston.df$NOX,Boston.df$MV)
##
## Welch Two Sample t-test
##
## data: Boston.df$NOX and Boston.df$MV
## t = -53.75, df = 505.16, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -22.78145 -21.17477
## sample estimates:
## mean of x mean of y
## 0.5546951 22.5328064
t.test(Boston.df$PT,Boston.df$MV)
##
## Welch Two Sample t-test
##
## data: Boston.df$PT and Boston.df$MV
## t = -9.707, df = 560.79, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -4.902309 -3.252236
## sample estimates:
## mean of x mean of y
## 18.45553 22.53281
Model1 <- MV ~
CRIM + ZN + INDUS + CHAS + NOX + RM + AGE + DIS + RAD + TAX + PT + B + LSTAT + MV
fit1 <- lm(Model1, data = Boston.df)
## Warning in model.matrix.default(mt, mf, contrasts): the response appeared
## on the right-hand side and was dropped
## Warning in model.matrix.default(mt, mf, contrasts): problem with term 14 in
## model.matrix: no columns are assigned
summary(fit1)
##
## Call:
## lm(formula = Model1, data = Boston.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.595 -2.730 -0.518 1.777 26.199
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.646e+01 5.103e+00 7.144 3.28e-12 ***
## CRIM -1.080e-01 3.286e-02 -3.287 0.001087 **
## ZN 4.642e-02 1.373e-02 3.382 0.000778 ***
## INDUS 2.056e-02 6.150e-02 0.334 0.738287
## CHAS 2.687e+00 8.616e-01 3.118 0.001925 **
## NOX -1.777e+01 3.820e+00 -4.651 4.25e-06 ***
## RM 3.810e+00 4.179e-01 9.116 < 2e-16 ***
## AGE 6.922e-04 1.321e-02 0.052 0.958229
## DIS -1.476e+00 1.995e-01 -7.398 6.01e-13 ***
## RAD 3.060e-01 6.635e-02 4.613 5.07e-06 ***
## TAX -1.233e-02 3.760e-03 -3.280 0.001112 **
## PT -9.527e-01 1.308e-01 -7.283 1.31e-12 ***
## B 9.312e-03 2.686e-03 3.467 0.000573 ***
## LSTAT -5.248e-01 5.072e-02 -10.347 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338
## F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16
library(leaps)
leap <- regsubsets(Model1, data = Boston.df, nbest=1)
## Warning in model.matrix.default(terms(formula, data = data), mm): the
## response appeared on the right-hand side and was dropped
## Warning in model.matrix.default(terms(formula, data = data), mm): problem
## with term 14 in model.matrix: no columns are assigned
summary(leap)
## Subset selection object
## Call: regsubsets.formula(Model1, data = Boston.df, nbest = 1)
## 13 Variables (and intercept)
## Forced in Forced out
## CRIM FALSE FALSE
## ZN FALSE FALSE
## INDUS FALSE FALSE
## CHAS FALSE FALSE
## NOX FALSE FALSE
## RM FALSE FALSE
## AGE FALSE FALSE
## DIS FALSE FALSE
## RAD FALSE FALSE
## TAX FALSE FALSE
## PT FALSE FALSE
## B FALSE FALSE
## LSTAT FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
## CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PT B LSTAT
## 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " " " "*"
## 2 ( 1 ) " " " " " " " " " " "*" " " " " " " " " " " " " "*"
## 3 ( 1 ) " " " " " " " " " " "*" " " " " " " " " "*" " " "*"
## 4 ( 1 ) " " " " " " " " " " "*" " " "*" " " " " "*" " " "*"
## 5 ( 1 ) " " " " " " " " "*" "*" " " "*" " " " " "*" " " "*"
## 6 ( 1 ) " " " " " " "*" "*" "*" " " "*" " " " " "*" " " "*"
## 7 ( 1 ) " " " " " " "*" "*" "*" " " "*" " " " " "*" "*" "*"
## 8 ( 1 ) " " "*" " " "*" "*" "*" " " "*" " " " " "*" "*" "*"
plot(leap, scale="adjr2")
Model2 <- MV ~
RM +
RAD +
B +
ZN+
CHAS
fit2 <- lm(Model2, data = Boston.df)
summary(fit2)
##
## Call:
## lm(formula = Model2, data = Boston.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.373 -3.239 -0.676 2.197 40.498
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -32.094063 2.835567 -11.318 < 2e-16 ***
## RM 7.857518 0.401563 19.567 < 2e-16 ***
## RAD -0.157055 0.035333 -4.445 1.08e-05 ***
## B 0.016812 0.003241 5.188 3.09e-07 ***
## ZN 0.040391 0.012416 3.253 0.00122 **
## CHAS 4.186574 1.049381 3.990 7.61e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.94 on 500 degrees of freedom
## Multiple R-squared: 0.587, Adjusted R-squared: 0.5829
## F-statistic: 142.1 on 5 and 500 DF, p-value: < 2.2e-16
library(leaps)
leap <- regsubsets(Model2, data = Boston.df, nbest=1)
summary(leap)
## Subset selection object
## Call: regsubsets.formula(Model2, data = Boston.df, nbest = 1)
## 5 Variables (and intercept)
## Forced in Forced out
## RM FALSE FALSE
## RAD FALSE FALSE
## B FALSE FALSE
## ZN FALSE FALSE
## CHAS FALSE FALSE
## 1 subsets of each size up to 5
## Selection Algorithm: exhaustive
## RM RAD B ZN CHAS
## 1 ( 1 ) "*" " " " " " " " "
## 2 ( 1 ) "*" " " "*" " " " "
## 3 ( 1 ) "*" "*" "*" " " " "
## 4 ( 1 ) "*" "*" "*" " " "*"
## 5 ( 1 ) "*" "*" "*" "*" "*"
plot(leap, scale="adjr2")
CONCULSION The analysis of this dataset shows that nitric oxides concentration and pupil-teacher ratio has a strongly adverse effect on the Median value of owner-occupied homes in Boston, more than any of the other factors present in the dataset. The data is collected from 1978, which may be already out of date. Even though, we considered the inflation. The other features may also change with the time. The PTRATIO feature may not be very important in today’s market. Now the traffic is much more convenient than 1978, then people get more chances to attend school far from the neighborhood. The data collected in an urban city like Boston would not be applicable in a rural city.
REFERENCE WWW.GOOGLE.com http://www.ritchieng.com/machine-learning-project-boston-home-prices/ https://sites.coecis.cornell.edu/chaowang/2016/12/30/boston-housing-price-prediction/ https://necromuralist.github.io/boston_housing/