PROJECT REPORT :The influence of various factors on Housing Prices

1. Introduction

Residential real estate may contain either a single family or multifamily structure that is available for occupation or for non-business purposes. Residences can be classified by if and how they are connected to neighbouring residences and land. Different types of housing tenure can be used for the same physical type.

2. Overview

Residential Real Estate of any city, state or nation is affected by change in various socio-economic factors and buyers often face escalating real estate prices, bidding wars and prolonged search periods as they enter an increasingly competitive market. Several micro-factors such as Property Location, Updates and Upgrades, Comparable Neighbourhood Properties, unemployment, proximity to local business centers and education centers. Hence, using this dataset we try to understand these micro and macro socio-ecomic factors influencing them.

3. An empirical field study of The influence of various factors on Housing Prices

3.1 Overview

The aim is to study a dataset of a US city is used for analysis for estimating changes in the rates of housing under the influence of social and economical factors such as crime rate, full-value property-tax and accessibility to radial highways. For the study, there are two hypothesis taken:

Hypothesis 1: If the house’s tract bounds the river, then the median value of the house is affected.

Hyothesis 2: The accessibility to the radial highway accessibility affects the median value of houses.

3.2 Data

The source of the data : https://www.kaggle.com/apratim87/housingdata/data

Number of data columns- 14

Number of rows- 506

There are 14 attributes in each case of the dataset. They are:

CRIM - per capita crime rate by town

ZN - proportion of residential land zoned for lots over 25,000 sq.ft.

INDUS - proportion of non-retail business acres per town.

CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)

NOX - nitric oxides concentration (parts per 10 million)

RM - average number of rooms per dwelling

AGE - proportion of owner-occupied units built prior to 1940

DIS - weighted distances to five employment centers

RAD - index of accessibility to radial highways

TAX - full-value property-tax rate per $10,000

PTRATIO - pupil-teacher ratio by town

B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town

LSTAT - % lower status of the population

MEDV - Median value of owner-occupied homes in $1000’s

3.3 Model

For Hypothesis 1 and 2, Linear Regression model has been used.

For Hypothesis 1, we have used:

house.df<-read.csv(paste("housingdata.csv"))
View(house.df)
fit1<- lm(CHAS~MEDV, data=house.df)
summary(fit1)
## 
## Call:
## lm(formula = CHAS ~ MEDV, data = house.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.20211 -0.07869 -0.05860 -0.03223  0.97503 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.039891   0.029471  -1.354    0.176    
## MEDV         0.004840   0.001211   3.996 7.39e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2503 on 504 degrees of freedom
## Multiple R-squared:  0.03072,    Adjusted R-squared:  0.02879 
## F-statistic: 15.97 on 1 and 504 DF,  p-value: 7.391e-05

For Hypothesis 2, we have used:

fit2<-lm(RAD~MEDV, data=house.df)
summary(fit2)
## 
## Call:
## lm(formula = RAD ~ MEDV, data = house.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -12.391  -5.862  -3.658   8.893  24.375 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 17.69052    0.94853  18.650   <2e-16 ***
## MEDV        -0.36130    0.03898  -9.269   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.056 on 504 degrees of freedom
## Multiple R-squared:  0.1456, Adjusted R-squared:  0.1439 
## F-statistic: 85.91 on 1 and 504 DF,  p-value: < 2.2e-16

Linear Regression:

To study the influence of various factors on MEDV (Median value of owner occupied homes in $10000s).

fit3<- lm(MEDV~CRIM+ZN+INDUS+CHAS+NOX+RM+AGE+DIS+RAD+TAX+PTRATIO+
     LSTAT,data=house.df)
summary(fit3)
## 
## Call:
## lm(formula = MEDV ~ CRIM + ZN + INDUS + CHAS + NOX + RM + AGE + 
##     DIS + RAD + TAX + PTRATIO + LSTAT, data = house.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.1304  -2.7673  -0.5814   1.9414  26.2526 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  41.617270   4.936039   8.431 3.79e-16 ***
## CRIM         -0.121389   0.033000  -3.678 0.000261 ***
## ZN            0.046963   0.013879   3.384 0.000772 ***
## INDUS         0.013468   0.062145   0.217 0.828520    
## CHAS          2.839993   0.870007   3.264 0.001173 ** 
## NOX         -18.758022   3.851355  -4.870 1.50e-06 ***
## RM            3.658119   0.420246   8.705  < 2e-16 ***
## AGE           0.003611   0.013329   0.271 0.786595    
## DIS          -1.490754   0.201623  -7.394 6.17e-13 ***
## RAD           0.289405   0.066908   4.325 1.84e-05 ***
## TAX          -0.012682   0.003801  -3.337 0.000912 ***
## PTRATIO      -0.937533   0.132206  -7.091 4.63e-12 ***
## LSTAT        -0.552019   0.050659 -10.897  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.798 on 493 degrees of freedom
## Multiple R-squared:  0.7343, Adjusted R-squared:  0.7278 
## F-statistic: 113.5 on 12 and 493 DF,  p-value: < 2.2e-16

3.4 Results

For Hypothesis 1, The p-value is less than 0.05, confirming that that CHAS and MEDV are correlated. The estimate calculated shows, Median value of owner-occupied homes increases if the house’s tract bounds the river.

For Hypothesis 2, The p-value is less than 0.05, indicating that RAD and MEDV are correlated, but the estimate calculated shows, that median value of the owner owned homes decreases because of the radial highway accessability.

The Linear Regression conducted to study the influence of various relevant factors on MEDV gives us following insights:

  1. CRIM(per capita crime by town), NOX(Nitric oxide concentration), DIS(weighted distance to five employment centers), TAX(full-value property-tax per 10000 dollars), PTRATIO(pupil-teacher ratio per town) and LSTAT(lower status of the population) have a negative effect on the MEDV.

  2. INDUS(proportion of non-retail businesses per town), AGE(proportion of owner-occupied units built prior to 1940) doesn’t influence MEDV.

  3. ZN(proportion of residential land zoned for lots over 25,000 sq.ft.) and CHAS(Charles River Dummy Variable), RM(Average rooms per dwelling), RAD(index of accessibility to radial highways) influence MEDV positively.

4. Conclusion

The motivation behind this paper to research how socio-economic structures influence prices in various cities. The unique contribution of this paper is that we investigated the various socio-economic factors influence on the median value of the residential properties. We found through the paper that, factors as crime, distance to employment centres, taxes and pupil-teacher ration in neighbourhood public schools, property location affect the propert value and it enable for us to understand the psycology of general public for choosing their place of residence.

5. References:

  1. https://www.kaggle.com/apratim87/housingdata/data

  2. https://en.wikipedia.org/wiki/Real_estate#Residential_real_estate

  3. http://resources.point.com/8-biggest-factors-affect-real-estate-prices/

Appendix

STATISTICS

The length and breadth of the dataset

dim(house.df)
## [1] 506  14

Structure of the dataframe

str(house.df)
## 'data.frame':    506 obs. of  14 variables:
##  $ CRIM   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ ZN     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ INDUS  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ CHAS   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ NOX    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ RM     : num  6.58 6.42 7.18 7 7.15 ...
##  $ AGE    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ DIS    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ RAD    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ TAX    : int  296 242 242 222 222 222 311 311 311 311 ...
##  $ PTRATIO: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ B      : num  397 397 393 395 397 ...
##  $ LSTAT  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ MEDV   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

Summary and description of the dataset

summary(house.df)
##       CRIM                ZN             INDUS            CHAS        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       NOX               RM             AGE              DIS        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       RAD              TAX           PTRATIO            B         
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      LSTAT            MEDV      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
library(psych)
describe(house.df)
##         vars   n   mean     sd median trimmed    mad    min    max  range
## CRIM       1 506   3.61   8.60   0.26    1.68   0.33   0.01  88.98  88.97
## ZN         2 506  11.36  23.32   0.00    5.08   0.00   0.00 100.00 100.00
## INDUS      3 506  11.14   6.86   9.69   10.93   9.37   0.46  27.74  27.28
## CHAS       4 506   0.07   0.25   0.00    0.00   0.00   0.00   1.00   1.00
## NOX        5 506   0.55   0.12   0.54    0.55   0.13   0.38   0.87   0.49
## RM         6 506   6.28   0.70   6.21    6.25   0.51   3.56   8.78   5.22
## AGE        7 506  68.57  28.15  77.50   71.20  28.98   2.90 100.00  97.10
## DIS        8 506   3.80   2.11   3.21    3.54   1.91   1.13  12.13  11.00
## RAD        9 506   9.55   8.71   5.00    8.73   2.97   1.00  24.00  23.00
## TAX       10 506 408.24 168.54 330.00  400.04 108.23 187.00 711.00 524.00
## PTRATIO   11 506  18.46   2.16  19.05   18.66   1.70  12.60  22.00   9.40
## B         12 506 356.67  91.29 391.44  383.17   8.09   0.32 396.90 396.58
## LSTAT     13 506  12.65   7.14  11.36   11.90   7.11   1.73  37.97  36.24
## MEDV      14 506  22.53   9.20  21.20   21.56   5.93   5.00  50.00  45.00
##          skew kurtosis   se
## CRIM     5.19    36.60 0.38
## ZN       2.21     3.95 1.04
## INDUS    0.29    -1.24 0.30
## CHAS     3.39     9.48 0.01
## NOX      0.72    -0.09 0.01
## RM       0.40     1.84 0.03
## AGE     -0.60    -0.98 1.25
## DIS      1.01     0.46 0.09
## RAD      1.00    -0.88 0.39
## TAX      0.67    -1.15 7.49
## PTRATIO -0.80    -0.30 0.10
## B       -2.87     7.10 4.06
## LSTAT    0.90     0.46 0.32
## MEDV     1.10     1.45 0.41

Contingency Tables

table(house.df$CHAS)
## 
##   0   1 
## 471  35
table(house.df$RAD)
## 
##   1   2   3   4   5   6   7   8  24 
##  20  24  38 110 115  26  17  24 132

VISUAL REPRESENTATIONS

plot(house.df$MEDV,house.df$CRIM, main="plot of CHAS v/s MEDV", 
            ylab = "per capita crime rate by town",
            xlab="Median value of owner-occupied homes in $1000's")

library(car)
## 
## Attaching package: 'car'
## The following object is masked from 'package:psych':
## 
##     logit
scatterplot(house.df$CHAS,house.df$MEDV, ylab= "Median value of 
            owner-occupied homes in $1000's", xlab="Charles River dummy variable")

scatterplot(house.df$RAD, house.df$MEDV, ylab="Median value of owner-occupied homes in $1000's",xlab="index of accessibility to radial highways")

library(corrgram)
corrgram(house.df, order=FALSE, lower.panel=panel.shade, upper.panel=panel.pie, 
         text.panel=panel.txt, main="Corrgram of housing dataset")

lr.df<-house.df[,c(1,2,3,4,5,6,7,8,9,10,11,13,14)]
corrgram(lr.df, order=FALSE, lower.panel=panel.shade, upper.panel=panel.pie, 
         text.panel=panel.txt, main="Corrgram of housing dataset (linear regression)")

RELEVANT TESTS

t.test(house.df$CHAS,house.df$MEDV)
## 
##  Welch Two Sample t-test
## 
## data:  house.df$CHAS and house.df$MEDV
## t = -54.921, df = 505.77, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -23.26722 -21.66005
## sample estimates:
##   mean of x   mean of y 
##  0.06916996 22.53280632
t.test(house.df$RAD, house.df$MEDV)
## 
##  Welch Two Sample t-test
## 
## data:  house.df$RAD and house.df$MEDV
## t = -23.06, df = 1007, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -14.08824 -11.87855
## sample estimates:
## mean of x mean of y 
##  9.549407 22.532806