Boston housing price

OVER VIEW In this report, I developed a method to predict the housing prices in Boston suburbs. The dataset for this experiment is Boston housing price prediction. The report is organized in such a way as to demonstrate the entire process right from getting and cleaning the data, to exploratory analysis of the dataset to understand the distribution and importance of various features in influencing the algorithm, to coming with a hypothesis and fitting a linear regression model.

INTRODUCTION

The dataset (Boston Housing Price) consists of 506 observations of 14 attributes. The median value of house price in $1000s, denoted by MV, is the outcome or the dependent variable in our model. Below is a brief description of each feature and the outcome in our dataset:

CRIM – per capita crime rate by town ZN – proportion of residential land zoned for lots over 25,000 sq.ft INDUS – proportion of non-retail business acres per town CHAS – Charles River dummy variable (1 if tract bounds river; else 0) NOX – nitric oxides concentration (parts per 10 million) RM – average number of rooms per dwelling AGE – proportion of owner-occupied units built prior to 1940 DIS – weighted distances to five Boston employment centres RAD – index of accessibility to radial highways TAX – full-value property-tax rate per $10,000 PT – pupil-teacher ratio by town B – 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town LSTAT – % lower status of the population MV – Median value of owner-occupied homes in $1000’s

PROJECT DESCRIPTION You want to be the best real estate agent out there. In order to compete with other agents in your area, you decide to use DATA ANALYSIS. You are going to use various statistical analysis tools to build the best model to predict the value of a given house. Your task is to find the best price your client can sell their house at. The best guess from a model is one that best generalizes the data. In this first section of this project, we will make a cursory investigation about the Boston housing data and provide our observations. Familiarizing ourselves with the data through an explorative process is a fundamental practice to help our better understand and justify our results.

Since the main goal of this project is to construct a working model which has the capability of predicting the value of houses, we will need to separate the dataset into features and the target variable. The features, ‘RM’, ‘LSTAT’, and ‘PTRATIO’, give us quantitative information about each data point. The target variable, ‘MEDV’, will be the variable we seek to predict. These are stored in features and prices, respectively.

Boston.df <- read.csv(paste("Boston.csv", sep = ""))
View(Boston.df)
attach(Boston.df)
dim(Boston.df)

## [1] 506  14

library(psych)
describe(Boston.df)

##       vars   n   mean     sd median trimmed    mad    min    max  range
## CRIM     1 506   3.61   8.60   0.26    1.68   0.33   0.01  88.98  88.97
## ZN       2 506  11.36  23.32   0.00    5.08   0.00   0.00 100.00 100.00
## INDUS    3 506  11.14   6.86   9.69   10.93   9.37   0.46  27.74  27.28
## CHAS     4 506   0.07   0.25   0.00    0.00   0.00   0.00   1.00   1.00
## NOX      5 506   0.55   0.12   0.54    0.55   0.13   0.38   0.87   0.49
## RM       6 506   6.28   0.70   6.21    6.25   0.51   3.56   8.78   5.22
## AGE      7 506  68.57  28.15  77.50   71.20  28.98   2.90 100.00  97.10
## DIS      8 506   3.80   2.11   3.21    3.54   1.91   1.13  12.13  11.00
## RAD      9 506   9.55   8.71   5.00    8.73   2.97   1.00  24.00  23.00
## TAX     10 506 408.24 168.54 330.00  400.04 108.23 187.00 711.00 524.00
## PT      11 506  18.46   2.16  19.05   18.66   1.70  12.60  22.00   9.40
## B       12 506 356.67  91.29 391.44  383.17   8.09   0.32 396.90 396.58
## LSTAT   13 506  12.65   7.14  11.36   11.90   7.11   1.73  37.97  36.24
## MV      14 506  22.53   9.20  21.20   21.56   5.93   5.00  50.00  45.00
##        skew kurtosis   se
## CRIM   5.19    36.60 0.38
## ZN     2.21     3.95 1.04
## INDUS  0.29    -1.24 0.30
## CHAS   3.39     9.48 0.01
## NOX    0.72    -0.09 0.01
## RM     0.40     1.84 0.03
## AGE   -0.60    -0.98 1.25
## DIS    1.01     0.46 0.09
## RAD    1.00    -0.88 0.39
## TAX    0.67    -1.15 7.49
## PT    -0.80    -0.30 0.10
## B     -2.87     7.10 4.06
## LSTAT  0.90     0.46 0.32
## MV     1.10     1.45 0.41

t.test(Boston.df$NOX,Boston.df$MV)

## 
##  Welch Two Sample t-test
## 
## data:  Boston.df$NOX and Boston.df$MV
## t = -53.75, df = 505.16, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -22.78145 -21.17477
## sample estimates:
##  mean of x  mean of y 
##  0.5546951 22.5328064

t.test(Boston.df$PT,Boston.df$MV)

## 
##  Welch Two Sample t-test
## 
## data:  Boston.df$PT and Boston.df$MV
## t = -9.707, df = 560.79, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -4.902309 -3.252236
## sample estimates:
## mean of x mean of y 
##  18.45553  22.53281

Model1 <- MV ~

CRIM + ZN + INDUS + CHAS + NOX + RM + AGE + DIS + RAD + TAX + PT + B + LSTAT + MV

fit1 <- lm(Model1, data = Boston.df)

## Warning in model.matrix.default(mt, mf, contrasts): the response appeared
## on the right-hand side and was dropped

## Warning in model.matrix.default(mt, mf, contrasts): problem with term 14 in
## model.matrix: no columns are assigned

summary(fit1)

## 
## Call:
## lm(formula = Model1, data = Boston.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.595  -2.730  -0.518   1.777  26.199 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.646e+01  5.103e+00   7.144 3.28e-12 ***
## CRIM        -1.080e-01  3.286e-02  -3.287 0.001087 ** 
## ZN           4.642e-02  1.373e-02   3.382 0.000778 ***
## INDUS        2.056e-02  6.150e-02   0.334 0.738287    
## CHAS         2.687e+00  8.616e-01   3.118 0.001925 ** 
## NOX         -1.777e+01  3.820e+00  -4.651 4.25e-06 ***
## RM           3.810e+00  4.179e-01   9.116  < 2e-16 ***
## AGE          6.922e-04  1.321e-02   0.052 0.958229    
## DIS         -1.476e+00  1.995e-01  -7.398 6.01e-13 ***
## RAD          3.060e-01  6.635e-02   4.613 5.07e-06 ***
## TAX         -1.233e-02  3.760e-03  -3.280 0.001112 ** 
## PT          -9.527e-01  1.308e-01  -7.283 1.31e-12 ***
## B            9.312e-03  2.686e-03   3.467 0.000573 ***
## LSTAT       -5.248e-01  5.072e-02 -10.347  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared:  0.7406, Adjusted R-squared:  0.7338 
## F-statistic: 108.1 on 13 and 492 DF,  p-value: < 2.2e-16

library(leaps)
leap <- regsubsets(Model1, data = Boston.df, nbest=1)

## Warning in model.matrix.default(terms(formula, data = data), mm): the
## response appeared on the right-hand side and was dropped

## Warning in model.matrix.default(terms(formula, data = data), mm): problem
## with term 14 in model.matrix: no columns are assigned

summary(leap)

## Subset selection object
## Call: regsubsets.formula(Model1, data = Boston.df, nbest = 1)
## 13 Variables  (and intercept)
##       Forced in Forced out
## CRIM      FALSE      FALSE
## ZN        FALSE      FALSE
## INDUS     FALSE      FALSE
## CHAS      FALSE      FALSE
## NOX       FALSE      FALSE
## RM        FALSE      FALSE
## AGE       FALSE      FALSE
## DIS       FALSE      FALSE
## RAD       FALSE      FALSE
## TAX       FALSE      FALSE
## PT        FALSE      FALSE
## B         FALSE      FALSE
## LSTAT     FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
##          CRIM ZN  INDUS CHAS NOX RM  AGE DIS RAD TAX PT  B   LSTAT
## 1  ( 1 ) " "  " " " "   " "  " " " " " " " " " " " " " " " " "*"  
## 2  ( 1 ) " "  " " " "   " "  " " "*" " " " " " " " " " " " " "*"  
## 3  ( 1 ) " "  " " " "   " "  " " "*" " " " " " " " " "*" " " "*"  
## 4  ( 1 ) " "  " " " "   " "  " " "*" " " "*" " " " " "*" " " "*"  
## 5  ( 1 ) " "  " " " "   " "  "*" "*" " " "*" " " " " "*" " " "*"  
## 6  ( 1 ) " "  " " " "   "*"  "*" "*" " " "*" " " " " "*" " " "*"  
## 7  ( 1 ) " "  " " " "   "*"  "*" "*" " " "*" " " " " "*" "*" "*"  
## 8  ( 1 ) " "  "*" " "   "*"  "*" "*" " " "*" " " " " "*" "*" "*"

plot(leap, scale="adjr2")

Model2 <- MV ~
RM +
RAD +
B +
ZN+
CHAS

fit2 <- lm(Model2, data = Boston.df)
summary(fit2)

## 
## Call:
## lm(formula = Model2, data = Boston.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.373  -3.239  -0.676   2.197  40.498 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -32.094063   2.835567 -11.318  < 2e-16 ***
## RM            7.857518   0.401563  19.567  < 2e-16 ***
## RAD          -0.157055   0.035333  -4.445 1.08e-05 ***
## B             0.016812   0.003241   5.188 3.09e-07 ***
## ZN            0.040391   0.012416   3.253  0.00122 ** 
## CHAS          4.186574   1.049381   3.990 7.61e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.94 on 500 degrees of freedom
## Multiple R-squared:  0.587,  Adjusted R-squared:  0.5829 
## F-statistic: 142.1 on 5 and 500 DF,  p-value: < 2.2e-16

library(leaps)
leap <- regsubsets(Model2, data = Boston.df, nbest=1)
summary(leap)

## Subset selection object
## Call: regsubsets.formula(Model2, data = Boston.df, nbest = 1)
## 5 Variables  (and intercept)
##      Forced in Forced out
## RM       FALSE      FALSE
## RAD      FALSE      FALSE
## B        FALSE      FALSE
## ZN       FALSE      FALSE
## CHAS     FALSE      FALSE
## 1 subsets of each size up to 5
## Selection Algorithm: exhaustive
##          RM  RAD B   ZN  CHAS
## 1  ( 1 ) "*" " " " " " " " " 
## 2  ( 1 ) "*" " " "*" " " " " 
## 3  ( 1 ) "*" "*" "*" " " " " 
## 4  ( 1 ) "*" "*" "*" " " "*" 
## 5  ( 1 ) "*" "*" "*" "*" "*"

plot(leap, scale="adjr2")

CONCULSION The analysis of this dataset shows that nitric oxides concentration and pupil-teacher ratio has a strongly adverse effect on the Median value of owner-occupied homes in Boston, more than any of the other factors present in the dataset. The data is collected from 1978, which may be already out of date. Even though, we considered the inflation. The other features may also change with the time. The PTRATIO feature may not be very important in today’s market. Now the traffic is much more convenient than 1978, then people get more chances to attend school far from the neighborhood. The data collected in an urban city like Boston would not be applicable in a rural city.

REFERENCE WWW.GOOGLE.com http://www.ritchieng.com/machine-learning-project-boston-home-prices/ https://sites.coecis.cornell.edu/chaowang/2016/12/30/boston-housing-price-prediction/ https://necromuralist.github.io/boston_housing/

Boston housing price

Punith Kumar

1/3/2018