Residential real estate may contain either a single family or multifamily structure that is available for occupation or for non-business purposes. Residences can be classified by if and how they are connected to neighbouring residences and land. Different types of housing tenure can be used for the same physical type.
Residential Real Estate of any city, state or nation is affected by change in various socio-economic factors and buyers often face escalating real estate prices, bidding wars and prolonged search periods as they enter an increasingly competitive market. Several micro-factors such as Property Location, Updates and Upgrades, Comparable Neighbourhood Properties, unemployment, proximity to local business centers and education centers. Hence, using this dataset we try to understand these micro and macro socio-ecomic factors influencing them.
The aim is to study a dataset of a US city is used for analysis for estimating changes in the rates of housing under the influence of social and economical factors such as crime rate, full-value property-tax and accessibility to radial highways. For the study, there are two hypothesis taken:
Hypothesis 1: If the house’s tract bounds the river, then the median value of the house is affected.
Hyothesis 2: The accessibility to the radial highway accessibility affects the median value of houses.
The source of the data : https://www.kaggle.com/apratim87/housingdata/data
Number of data columns- 14
Number of rows- 506
There are 14 attributes in each case of the dataset. They are:
CRIM - per capita crime rate by town
ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS - proportion of non-retail business acres per town.
CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
NOX - nitric oxides concentration (parts per 10 million)
RM - average number of rooms per dwelling
AGE - proportion of owner-occupied units built prior to 1940
DIS - weighted distances to five employment centers
RAD - index of accessibility to radial highways
TAX - full-value property-tax rate per $10,000
PTRATIO - pupil-teacher ratio by town
B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
LSTAT - % lower status of the population
MEDV - Median value of owner-occupied homes in $1000’s
For Hypothesis 1 and 2, Linear Regression model has been used.
For Hypothesis 1, we have used:
house.df<-read.csv(paste("housingdata.csv"))
View(house.df)
fit1<- lm(CHAS~MEDV, data=house.df)
summary(fit1)
##
## Call:
## lm(formula = CHAS ~ MEDV, data = house.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.20211 -0.07869 -0.05860 -0.03223 0.97503
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.039891 0.029471 -1.354 0.176
## MEDV 0.004840 0.001211 3.996 7.39e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2503 on 504 degrees of freedom
## Multiple R-squared: 0.03072, Adjusted R-squared: 0.02879
## F-statistic: 15.97 on 1 and 504 DF, p-value: 7.391e-05
For Hypothesis 2, we have used:
fit2<-lm(RAD~MEDV, data=house.df)
summary(fit2)
##
## Call:
## lm(formula = RAD ~ MEDV, data = house.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.391 -5.862 -3.658 8.893 24.375
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.69052 0.94853 18.650 <2e-16 ***
## MEDV -0.36130 0.03898 -9.269 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.056 on 504 degrees of freedom
## Multiple R-squared: 0.1456, Adjusted R-squared: 0.1439
## F-statistic: 85.91 on 1 and 504 DF, p-value: < 2.2e-16
To study the influence of various factors on MEDV (Median value of owner occupied homes in $10000s).
fit3<- lm(MEDV~CRIM+ZN+INDUS+CHAS+NOX+RM+AGE+DIS+RAD+TAX+PTRATIO+
LSTAT,data=house.df)
summary(fit3)
##
## Call:
## lm(formula = MEDV ~ CRIM + ZN + INDUS + CHAS + NOX + RM + AGE +
## DIS + RAD + TAX + PTRATIO + LSTAT, data = house.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.1304 -2.7673 -0.5814 1.9414 26.2526
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 41.617270 4.936039 8.431 3.79e-16 ***
## CRIM -0.121389 0.033000 -3.678 0.000261 ***
## ZN 0.046963 0.013879 3.384 0.000772 ***
## INDUS 0.013468 0.062145 0.217 0.828520
## CHAS 2.839993 0.870007 3.264 0.001173 **
## NOX -18.758022 3.851355 -4.870 1.50e-06 ***
## RM 3.658119 0.420246 8.705 < 2e-16 ***
## AGE 0.003611 0.013329 0.271 0.786595
## DIS -1.490754 0.201623 -7.394 6.17e-13 ***
## RAD 0.289405 0.066908 4.325 1.84e-05 ***
## TAX -0.012682 0.003801 -3.337 0.000912 ***
## PTRATIO -0.937533 0.132206 -7.091 4.63e-12 ***
## LSTAT -0.552019 0.050659 -10.897 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.798 on 493 degrees of freedom
## Multiple R-squared: 0.7343, Adjusted R-squared: 0.7278
## F-statistic: 113.5 on 12 and 493 DF, p-value: < 2.2e-16
For Hypothesis 1, The p-value is less than 0.05, confirming that that CHAS and MEDV are correlated. The estimate calculated shows, Median value of owner-occupied homes increases if the house’s tract bounds the river.
For Hypothesis 2, The p-value is less than 0.05, indicating that RAD and MEDV are correlated, but the estimate calculated shows, that median value of the owner owned homes decreases because of the radial highway accessability.
The Linear Regression conducted to study the influence of various relevant factors on MEDV gives us following insights:
CRIM(per capita crime by town), NOX(Nitric oxide concentration), DIS(weighted distance to five employment centers), TAX(full-value property-tax per 10000 dollars), PTRATIO(pupil-teacher ratio per town) and LSTAT(lower status of the population) have a negative effect on the MEDV.
INDUS(proportion of non-retail businesses per town), AGE(proportion of owner-occupied units built prior to 1940) doesn’t influence MEDV.
ZN(proportion of residential land zoned for lots over 25,000 sq.ft.) and CHAS(Charles River Dummy Variable), RM(Average rooms per dwelling), RAD(index of accessibility to radial highways) influence MEDV positively.
The motivation behind this paper to research how socio-economic structures influence prices in various cities. The unique contribution of this paper is that we investigated the various socio-economic factors influence on the median value of the residential properties. We found through the paper that, factors as crime, distance to employment centres, taxes and pupil-teacher ration in neighbourhood public schools, property location affect the propert value and it enable for us to understand the psycology of general public for choosing their place of residence.
STATISTICS
The length and breadth of the dataset
dim(house.df)
## [1] 506 14
Structure of the dataframe
str(house.df)
## 'data.frame': 506 obs. of 14 variables:
## $ CRIM : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ ZN : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ INDUS : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ CHAS : int 0 0 0 0 0 0 0 0 0 0 ...
## $ NOX : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ RM : num 6.58 6.42 7.18 7 7.15 ...
## $ AGE : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ DIS : num 4.09 4.97 4.97 6.06 6.06 ...
## $ RAD : int 1 2 2 3 3 3 5 5 5 5 ...
## $ TAX : int 296 242 242 222 222 222 311 311 311 311 ...
## $ PTRATIO: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ B : num 397 397 393 395 397 ...
## $ LSTAT : num 4.98 9.14 4.03 2.94 5.33 ...
## $ MEDV : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
Summary and description of the dataset
summary(house.df)
## CRIM ZN INDUS CHAS
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## NOX RM AGE DIS
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## RAD TAX PTRATIO B
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## LSTAT MEDV
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
library(psych)
describe(house.df)
## vars n mean sd median trimmed mad min max range
## CRIM 1 506 3.61 8.60 0.26 1.68 0.33 0.01 88.98 88.97
## ZN 2 506 11.36 23.32 0.00 5.08 0.00 0.00 100.00 100.00
## INDUS 3 506 11.14 6.86 9.69 10.93 9.37 0.46 27.74 27.28
## CHAS 4 506 0.07 0.25 0.00 0.00 0.00 0.00 1.00 1.00
## NOX 5 506 0.55 0.12 0.54 0.55 0.13 0.38 0.87 0.49
## RM 6 506 6.28 0.70 6.21 6.25 0.51 3.56 8.78 5.22
## AGE 7 506 68.57 28.15 77.50 71.20 28.98 2.90 100.00 97.10
## DIS 8 506 3.80 2.11 3.21 3.54 1.91 1.13 12.13 11.00
## RAD 9 506 9.55 8.71 5.00 8.73 2.97 1.00 24.00 23.00
## TAX 10 506 408.24 168.54 330.00 400.04 108.23 187.00 711.00 524.00
## PTRATIO 11 506 18.46 2.16 19.05 18.66 1.70 12.60 22.00 9.40
## B 12 506 356.67 91.29 391.44 383.17 8.09 0.32 396.90 396.58
## LSTAT 13 506 12.65 7.14 11.36 11.90 7.11 1.73 37.97 36.24
## MEDV 14 506 22.53 9.20 21.20 21.56 5.93 5.00 50.00 45.00
## skew kurtosis se
## CRIM 5.19 36.60 0.38
## ZN 2.21 3.95 1.04
## INDUS 0.29 -1.24 0.30
## CHAS 3.39 9.48 0.01
## NOX 0.72 -0.09 0.01
## RM 0.40 1.84 0.03
## AGE -0.60 -0.98 1.25
## DIS 1.01 0.46 0.09
## RAD 1.00 -0.88 0.39
## TAX 0.67 -1.15 7.49
## PTRATIO -0.80 -0.30 0.10
## B -2.87 7.10 4.06
## LSTAT 0.90 0.46 0.32
## MEDV 1.10 1.45 0.41
Contingency Tables
table(house.df$CHAS)
##
## 0 1
## 471 35
table(house.df$RAD)
##
## 1 2 3 4 5 6 7 8 24
## 20 24 38 110 115 26 17 24 132
VISUAL REPRESENTATIONS
plot(house.df$MEDV,house.df$CRIM, main="plot of CHAS v/s MEDV",
ylab = "per capita crime rate by town",
xlab="Median value of owner-occupied homes in $1000's")
library(car)
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
scatterplot(house.df$CHAS,house.df$MEDV, ylab= "Median value of
owner-occupied homes in $1000's", xlab="Charles River dummy variable")
scatterplot(house.df$RAD, house.df$MEDV, ylab="Median value of owner-occupied homes in $1000's",xlab="index of accessibility to radial highways")
library(corrgram)
corrgram(house.df, order=FALSE, lower.panel=panel.shade, upper.panel=panel.pie,
text.panel=panel.txt, main="Corrgram of housing dataset")
lr.df<-house.df[,c(1,2,3,4,5,6,7,8,9,10,11,13,14)]
corrgram(lr.df, order=FALSE, lower.panel=panel.shade, upper.panel=panel.pie,
text.panel=panel.txt, main="Corrgram of housing dataset (linear regression)")
RELEVANT TESTS
t.test(house.df$CHAS,house.df$MEDV)
##
## Welch Two Sample t-test
##
## data: house.df$CHAS and house.df$MEDV
## t = -54.921, df = 505.77, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -23.26722 -21.66005
## sample estimates:
## mean of x mean of y
## 0.06916996 22.53280632
t.test(house.df$RAD, house.df$MEDV)
##
## Welch Two Sample t-test
##
## data: house.df$RAD and house.df$MEDV
## t = -23.06, df = 1007, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -14.08824 -11.87855
## sample estimates:
## mean of x mean of y
## 9.549407 22.532806