The influence of various factors on Housing Prices. My project relates to Real Estate.I propose to study the Median Value of owner-occupied homes and I wish to investigate the impact of the factors such as crime rate, Full-value property-tax rate per $10,000, index of accessibility to radial highways on median value of owner-occupied homes.
Number of data columns- 14
Number of rows- 506
There are 14 attributes in each case of the dataset. They are:
CRIM - per capita crime rate by town
ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS - proportion of non-retail business acres per town.
CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
NOX - nitric oxides concentration (parts per 10 million)
RM - average number of rooms per dwelling
AGE - proportion of owner-occupied units built prior to 1940
DIS - weighted distances to five employment centres
RAD - index of accessibility to radial highways
TAX - full-value property-tax rate per $10,000
PTRATIO - pupil-teacher ratio by town
B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
LSTAT - % lower status of the population
MEDV - Median value of owner-occupied homes in $1000’s
house.df<-read.csv(paste("housingdata.csv"))
View(house.df)
dim(house.df)
## [1] 506 14
str(house.df)
## 'data.frame': 506 obs. of 14 variables:
## $ CRIM : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ ZN : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ INDUS : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ CHAS : int 0 0 0 0 0 0 0 0 0 0 ...
## $ NOX : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ RM : num 6.58 6.42 7.18 7 7.15 ...
## $ AGE : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ DIS : num 4.09 4.97 4.97 6.06 6.06 ...
## $ RAD : int 1 2 2 3 3 3 5 5 5 5 ...
## $ TAX : int 296 242 242 222 222 222 311 311 311 311 ...
## $ PTRATIO: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ B : num 397 397 393 395 397 ...
## $ LSTAT : num 4.98 9.14 4.03 2.94 5.33 ...
## $ MEDV : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
attach(house.df)
summary(house.df)
## CRIM ZN INDUS CHAS
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## NOX RM AGE DIS
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## RAD TAX PTRATIO B
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## LSTAT MEDV
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
library(psych)
describe(house.df)
## vars n mean sd median trimmed mad min max range
## CRIM 1 506 3.61 8.60 0.26 1.68 0.33 0.01 88.98 88.97
## ZN 2 506 11.36 23.32 0.00 5.08 0.00 0.00 100.00 100.00
## INDUS 3 506 11.14 6.86 9.69 10.93 9.37 0.46 27.74 27.28
## CHAS 4 506 0.07 0.25 0.00 0.00 0.00 0.00 1.00 1.00
## NOX 5 506 0.55 0.12 0.54 0.55 0.13 0.38 0.87 0.49
## RM 6 506 6.28 0.70 6.21 6.25 0.51 3.56 8.78 5.22
## AGE 7 506 68.57 28.15 77.50 71.20 28.98 2.90 100.00 97.10
## DIS 8 506 3.80 2.11 3.21 3.54 1.91 1.13 12.13 11.00
## RAD 9 506 9.55 8.71 5.00 8.73 2.97 1.00 24.00 23.00
## TAX 10 506 408.24 168.54 330.00 400.04 108.23 187.00 711.00 524.00
## PTRATIO 11 506 18.46 2.16 19.05 18.66 1.70 12.60 22.00 9.40
## B 12 506 356.67 91.29 391.44 383.17 8.09 0.32 396.90 396.58
## LSTAT 13 506 12.65 7.14 11.36 11.90 7.11 1.73 37.97 36.24
## MEDV 14 506 22.53 9.20 21.20 21.56 5.93 5.00 50.00 45.00
## skew kurtosis se
## CRIM 5.19 36.60 0.38
## ZN 2.21 3.95 1.04
## INDUS 0.29 -1.24 0.30
## CHAS 3.39 9.48 0.01
## NOX 0.72 -0.09 0.01
## RM 0.40 1.84 0.03
## AGE -0.60 -0.98 1.25
## DIS 1.01 0.46 0.09
## RAD 1.00 -0.88 0.39
## TAX 0.67 -1.15 7.49
## PTRATIO -0.80 -0.30 0.10
## B -2.87 7.10 4.06
## LSTAT 0.90 0.46 0.32
## MEDV 1.10 1.45 0.41
table(house.df$CHAS)
##
## 0 1
## 471 35
table(house.df$RAD)
##
## 1 2 3 4 5 6 7 8 24
## 20 24 38 110 115 26 17 24 132
table(house.df$ZN)
##
## 0 12.5 17.5 18 20 21 22 25 28 30 33 34 35 40 45
## 372 10 1 1 21 4 10 10 3 6 4 3 3 7 6
## 52.5 55 60 70 75 80 82.5 85 90 95 100
## 3 3 4 3 3 15 2 2 5 4 1
table(house.df$TAX)
##
## 187 188 193 198 216 222 223 224 226 233 241 242 243 244 245 247 252 254
## 1 7 8 1 5 7 5 10 1 9 1 2 4 1 3 4 2 5
## 255 256 264 265 270 273 276 277 279 280 281 284 285 287 289 293 296 300
## 1 1 12 2 7 5 9 11 4 1 4 7 1 8 5 3 8 7
## 304 305 307 311 313 315 329 330 334 335 337 345 348 351 352 358 370 384
## 14 4 40 7 1 2 6 10 2 2 2 3 2 1 2 3 2 11
## 391 398 402 403 411 422 430 432 437 469 666 711
## 8 12 2 30 2 1 3 9 15 1 132 5
table(house.df$CHAS,house.df$RAD)
##
## 1 2 3 4 5 6 7 8 24
## 0 19 24 36 102 104 26 17 19 124
## 1 1 0 2 8 11 0 0 5 8
table(house.df$CHAS,house.df$ZN)
##
## 0 12.5 17.5 18 20 21 22 25 28 30 33 34 35 40 45 52.5 55
## 0 344 10 1 1 18 4 10 10 3 6 4 3 3 4 6 3 3
## 1 28 0 0 0 3 0 0 0 0 0 0 0 0 3 0 0 0
##
## 60 70 75 80 82.5 85 90 95 100
## 0 4 3 3 15 2 2 4 4 1
## 1 0 0 0 0 0 0 1 0 0
table(house.df$RAD, house.df$ZN)
##
## 0 12.5 17.5 18 20 21 22 25 28 30 33 34 35 40 45 52.5
## 1 6 0 0 1 0 0 0 0 0 0 0 0 3 2 0 0
## 2 18 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 3 26 0 1 0 5 0 0 0 0 0 0 0 0 0 0 0
## 4 77 3 0 0 0 4 0 4 3 0 0 0 0 5 0 0
## 5 78 7 0 0 16 0 0 0 0 0 0 0 0 0 6 0
## 6 17 0 0 0 0 0 0 0 0 6 0 0 0 0 0 3
## 7 0 0 0 0 0 0 10 0 0 0 4 3 0 0 0 0
## 8 18 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0
## 24 132 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##
## 55 60 70 75 80 82.5 85 90 95 100
## 1 1 2 0 0 3 0 0 2 0 0
## 2 0 0 0 0 3 2 1 0 0 0
## 3 0 0 0 3 0 0 0 1 2 0
## 4 0 2 0 0 9 0 1 0 2 0
## 5 2 0 3 0 0 0 0 2 0 1
## 6 0 0 0 0 0 0 0 0 0 0
## 7 0 0 0 0 0 0 0 0 0 0
## 8 0 0 0 0 0 0 0 0 0 0
## 24 0 0 0 0 0 0 0 0 0 0
boxplot(house.df$CRIM,main="per capita crime rate by town",xlab="CRIM"
, horizontal = T, col="light blue")
boxplot(house.df$INDUS,main="proportion of non-retail business acres per town",
xlab="INDUS", col="light blue")
boxplot(house.df$NOX, main=" nitric oxides concentration (parts per 10 million)",
xlab="NOX", col="light blue")
boxplot(house.df$RM, main="average number of rooms per dwelling", xlab="RM",
col="light blue")
boxplot(house.df$DIS, main="weighted distances to five employment centres",
xlab="DIS", col="light blue")
boxplot(house.df$TAX, main="full-value property-tax rate per $10,000",
xlab="TAX", col="light blue")
boxplot(house.df$PTRATIO, main="pupil-teacher ratio by town ",
xlab="PTRATIO", col="light blue")
boxplot(house.df$LSTAT, main="lower status of the population(%)",
xlab="LSTAT", col="light blue")
boxplot(house.df$MEDV, main="Median value of owner-occupied homes in $1000's",
xlab="MEDV", col="light blue")
library(lattice)
histogram(house.df$ZN, main="proportion of residential land zoned
for lots over 25,000 sq.ft", xlab="ZN", col="maroon")
charles=factor(house.df$CHAS, levels=c(1,0), labels=c("tract bounds river","Otherwise"))
histogram(charles,col="maroon", main="Charles River dummy variable")
histogram(house.df$AGE, main="proportion of owner-occupied units built prior to 1940",
xlab="AGE", col="maroon")
histogram(house.df$RAD, main="index of accessibility to radial highways",
xlab="RAD", col="maroon")
plot(house.df$MEDV,house.df$CRIM, main="plot of CHAS v/s MEDV",
ylab = "per capita crime rate by town",
xlab="Median value of owner-occupied homes in $1000's")
plot(house.df$MEDV,house.df$INDUS, main="plot of CHAS v/s INDUS",
xlab = "Median value of owner-occupied homes in $1000's",
ylab="proportion of non-retail business acres per town")
plot(house.df$MEDV,house.df$TAX, main="plot of CHAS v/s TAX",
xlab = "Median value of owner-occupied homes in $1000's",
ylab="full-value property-tax rate per $10,000")
plot(house.df$MEDV,house.df$RAD, main="plot of CHAS v/s RAD",
xlab = "Median value of owner-occupied homes in $1000's",
ylab="Index of accessibility to radial highways ")
hnum<-house.df[,c(1,2,3,5,6,7,8,11,12,13,14)]
cor(hnum)
## CRIM ZN INDUS NOX RM AGE
## CRIM 1.0000000 -0.2004692 0.4065834 0.4209717 -0.2192467 0.3527343
## ZN -0.2004692 1.0000000 -0.5338282 -0.5166037 0.3119906 -0.5695373
## INDUS 0.4065834 -0.5338282 1.0000000 0.7636514 -0.3916759 0.6447785
## NOX 0.4209717 -0.5166037 0.7636514 1.0000000 -0.3021882 0.7314701
## RM -0.2192467 0.3119906 -0.3916759 -0.3021882 1.0000000 -0.2402649
## AGE 0.3527343 -0.5695373 0.6447785 0.7314701 -0.2402649 1.0000000
## DIS -0.3796701 0.6644082 -0.7080270 -0.7692301 0.2052462 -0.7478805
## PTRATIO 0.2899456 -0.3916785 0.3832476 0.1889327 -0.3555015 0.2615150
## B -0.3850639 0.1755203 -0.3569765 -0.3800506 0.1280686 -0.2735340
## LSTAT 0.4556215 -0.4129946 0.6037997 0.5908789 -0.6138083 0.6023385
## MEDV -0.3883046 0.3604453 -0.4837252 -0.4273208 0.6953599 -0.3769546
## DIS PTRATIO B LSTAT MEDV
## CRIM -0.3796701 0.2899456 -0.3850639 0.4556215 -0.3883046
## ZN 0.6644082 -0.3916785 0.1755203 -0.4129946 0.3604453
## INDUS -0.7080270 0.3832476 -0.3569765 0.6037997 -0.4837252
## NOX -0.7692301 0.1889327 -0.3800506 0.5908789 -0.4273208
## RM 0.2052462 -0.3555015 0.1280686 -0.6138083 0.6953599
## AGE -0.7478805 0.2615150 -0.2735340 0.6023385 -0.3769546
## DIS 1.0000000 -0.2324705 0.2915117 -0.4969958 0.2499287
## PTRATIO -0.2324705 1.0000000 -0.1773833 0.3740443 -0.5077867
## B 0.2915117 -0.1773833 1.0000000 -0.3660869 0.3334608
## LSTAT -0.4969958 0.3740443 -0.3660869 1.0000000 -0.7376627
## MEDV 0.2499287 -0.5077867 0.3334608 -0.7376627 1.0000000
library(corrgram)
corrgram(house.df, order=FALSE, lower.panel=panel.shade, upper.panel=panel.pie,
text.panel=panel.txt, main="Corrgram of housing dataset")
library(corrgram)
corrgram(hnum, order=FALSE, lower.panel=panel.shade, upper.panel=panel.pie,
text.panel=panel.txt, main="Corrgram of housing dataset (numeric type)")
library(car)
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
scatterplotMatrix(~MEDV+CRIM+INDUS+RAD+TAX, data=house.df, main="Scatterplot matrix
Median value of owner-occupied homes in $1000's v/s other factors")
Null Hypothesis 1: There is no relationship between CRIM (per capita crime rate by town)and MEDV (Median value of owner-occupied homes in $1000’s)
cor.test(house.df$CRIM,house.df$MEDV)
##
## Pearson's product-moment correlation
##
## data: house.df$CRIM and house.df$MEDV
## t = -9.4597, df = 504, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4599064 -0.3116859
## sample estimates:
## cor
## -0.3883046
The p-value is less than 0.05, hence, we reject the null hypothesis and establish that there is significant relationship between CRIM and MEDV.
Null Hypothesis 2: There is no relationship between between CHAS (Charles River dummy variable) and MEDV(Median value of owner-occupied homes in $1000’s)
cor.test(house.df$CHAS, house.df$MEDV)
##
## Pearson's product-moment correlation
##
## data: house.df$CHAS and house.df$MEDV
## t = 3.9964, df = 504, p-value = 7.391e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.08945816 0.25848001
## sample estimates:
## cor
## 0.1752602
The p-value is more than 0.05, hence, we fail to reject the null hyothesis.
Null Hypothesis 3: There is no relationship between TAX(full-value property-tax rate per $10,000) and MEDV(Median value of owner-occupied homes in $1000’s)
cor.test(house.df$TAX, house.df$MEDV)
##
## Pearson's product-moment correlation
##
## data: house.df$TAX and house.df$MEDV
## t = -11.906, df = 504, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5338993 -0.3976061
## sample estimates:
## cor
## -0.4685359
The p-value is less than 0.05, hence, we reject the null hypothesis and establish that there is a significant relationship between TAX and MEDV.
Null Hypothesis 4: There is no relationship between CRIM(per capita crime rate by town) and RAD(index of accessibility to radial highways)
cor.test(house.df$CRIM,house.df$RAD)
##
## Pearson's product-moment correlation
##
## data: house.df$CRIM and house.df$RAD
## t = 17.998, df = 504, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5693817 0.6758248
## sample estimates:
## cor
## 0.6255051
The p-value is less than 0.05, hence, we reject the null hypothesis and establish that there is a significant relationship between CRIM and RAD.
Null Hypothesis 5: There is no relationship between AGE(proportion of owner-occupied units built prior to 1940) and TAX(full-value property-tax rate per $10,000)
t.test(house.df$AGE, house.df$TAX)
##
## Welch Two Sample t-test
##
## data: house.df$AGE and house.df$TAX
## t = -44.715, df = 533.15, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -354.5843 -324.7402
## sample estimates:
## mean of x mean of y
## 68.5749 408.2372
The p-value is less than 0.05, hence, through t-test we reject the null hypothesis and establish that there is significant relationship between AGE and TAX.
Null Hypothesis 6: There is no relationship between DIS(weighted distances to five employment centres) and MEDV(Median value of owner-occupied homes in $1000’s)
t.test(house.df$DIS, house.df$MEDV)
##
## Welch Two Sample t-test
##
## data: house.df$DIS and house.df$MEDV
## t = -44.673, df = 557.8, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -19.56164 -17.91389
## sample estimates:
## mean of x mean of y
## 3.795043 22.532806
The p-value is less than 0.05, hence, through t.test we reject the null hypothesis and establish that there is no significant relationship between DIS and MEDV.