This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
home.df<-read.csv(paste("home_data.csv"),sep=",",row.names = NULL)
dim(home.df)
## [1] 21613 21
str(home.df)
## 'data.frame': 21613 obs. of 21 variables:
## $ id : num 7.13e+09 6.41e+09 5.63e+09 2.49e+09 1.95e+09 ...
## $ date : Factor w/ 372 levels "20140502T000000",..: 165 221 291 221 284 11 57 252 340 306 ...
## $ price : int 221900 538000 180000 604000 510000 1225000 257500 291850 229500 323000 ...
## $ bedrooms : int 3 3 2 4 3 4 3 3 3 3 ...
## $ bathrooms : num 1 2.25 1 3 2 4.5 2.25 1.5 1 2.5 ...
## $ sqft_living : int 1180 2570 770 1960 1680 5420 1715 1060 1780 1890 ...
## $ sqft_lot : int 5650 7242 10000 5000 8080 101930 6819 9711 7470 6560 ...
## $ floors : num 1 2 1 1 1 1 2 1 1 2 ...
## $ waterfront : int 0 0 0 0 0 0 0 0 0 0 ...
## $ view : int 0 0 0 0 0 0 0 0 0 0 ...
## $ condition : int 3 3 3 5 3 3 3 3 3 3 ...
## $ grade : int 7 7 6 7 8 11 7 7 7 7 ...
## $ sqft_above : int 1180 2170 770 1050 1680 3890 1715 1060 1050 1890 ...
## $ sqft_basement: int 0 400 0 910 0 1530 0 0 730 0 ...
## $ yr_built : int 1955 1951 1933 1965 1987 2001 1995 1963 1960 2003 ...
## $ yr_renovated : int 0 1991 0 0 0 0 0 0 0 0 ...
## $ zipcode : int 98178 98125 98028 98136 98074 98053 98003 98198 98146 98038 ...
## $ lat : num 47.5 47.7 47.7 47.5 47.6 ...
## $ long : num -122 -122 -122 -122 -122 ...
## $ sqft_living15: int 1340 1690 2720 1360 1800 4760 2238 1650 1780 2390 ...
## $ sqft_lot15 : int 5650 7639 8062 5000 7503 101930 6819 9711 8113 7570 ...
summary(home.df)
## id date price
## Min. :1.000e+06 20140623T000000: 142 Min. : 75000
## 1st Qu.:2.123e+09 20140625T000000: 131 1st Qu.: 321950
## Median :3.905e+09 20140626T000000: 131 Median : 450000
## Mean :4.580e+09 20140708T000000: 127 Mean : 540088
## 3rd Qu.:7.309e+09 20150427T000000: 126 3rd Qu.: 645000
## Max. :9.900e+09 20150325T000000: 123 Max. :7700000
## (Other) :20833
## bedrooms bathrooms sqft_living sqft_lot
## Min. : 0.000 Min. :0.000 Min. : 290 Min. : 520
## 1st Qu.: 3.000 1st Qu.:1.750 1st Qu.: 1427 1st Qu.: 5040
## Median : 3.000 Median :2.250 Median : 1910 Median : 7618
## Mean : 3.371 Mean :2.115 Mean : 2080 Mean : 15107
## 3rd Qu.: 4.000 3rd Qu.:2.500 3rd Qu.: 2550 3rd Qu.: 10688
## Max. :33.000 Max. :8.000 Max. :13540 Max. :1651359
##
## floors waterfront view condition
## Min. :1.000 Min. :0.000000 Min. :0.0000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:0.000000 1st Qu.:0.0000 1st Qu.:3.000
## Median :1.500 Median :0.000000 Median :0.0000 Median :3.000
## Mean :1.494 Mean :0.007542 Mean :0.2343 Mean :3.409
## 3rd Qu.:2.000 3rd Qu.:0.000000 3rd Qu.:0.0000 3rd Qu.:4.000
## Max. :3.500 Max. :1.000000 Max. :4.0000 Max. :5.000
##
## grade sqft_above sqft_basement yr_built
## Min. : 1.000 Min. : 290 Min. : 0.0 Min. :1900
## 1st Qu.: 7.000 1st Qu.:1190 1st Qu.: 0.0 1st Qu.:1951
## Median : 7.000 Median :1560 Median : 0.0 Median :1975
## Mean : 7.657 Mean :1788 Mean : 291.5 Mean :1971
## 3rd Qu.: 8.000 3rd Qu.:2210 3rd Qu.: 560.0 3rd Qu.:1997
## Max. :13.000 Max. :9410 Max. :4820.0 Max. :2015
##
## yr_renovated zipcode lat long
## Min. : 0.0 Min. :98001 Min. :47.16 Min. :-122.5
## 1st Qu.: 0.0 1st Qu.:98033 1st Qu.:47.47 1st Qu.:-122.3
## Median : 0.0 Median :98065 Median :47.57 Median :-122.2
## Mean : 84.4 Mean :98078 Mean :47.56 Mean :-122.2
## 3rd Qu.: 0.0 3rd Qu.:98118 3rd Qu.:47.68 3rd Qu.:-122.1
## Max. :2015.0 Max. :98199 Max. :47.78 Max. :-121.3
##
## sqft_living15 sqft_lot15
## Min. : 399 Min. : 651
## 1st Qu.:1490 1st Qu.: 5100
## Median :1840 Median : 7620
## Mean :1987 Mean : 12768
## 3rd Qu.:2360 3rd Qu.: 10083
## Max. :6210 Max. :871200
##
library(psych)
describe(home.df)
table(home.df$bedrooms)
##
## 0 1 2 3 4 5 6 7 8 9 10 11 33
## 13 199 2760 9824 6882 1601 272 38 13 6 3 1 1
table(home.df$bathrooms)
##
## 0 0.5 0.75 1 1.25 1.5 1.75 2 2.25 2.5 2.75 3 3.25 3.5 3.75
## 10 4 72 3852 9 1446 3048 1930 2047 5380 1185 753 589 731 155
## 4 4.25 4.5 4.75 5 5.25 5.5 5.75 6 6.25 6.5 6.75 7.5 7.75 8
## 136 79 100 23 21 13 10 4 6 2 2 2 1 1 2
table(home.df$floors)
##
## 1 1.5 2 2.5 3 3.5
## 10680 1910 8241 161 613 8
table(home.df$waterfront)
##
## 0 1
## 21450 163
table(home.df$view)
##
## 0 1 2 3 4
## 19489 332 963 510 319
table(home.df$condition)
##
## 1 2 3 4 5
## 30 172 14031 5679 1701
table(home.df$grade)
##
## 1 3 4 5 6 7 8 9 10 11 12 13
## 1 3 29 242 2038 8981 6068 2615 1134 399 90 13
xtabs(~bedrooms+bathrooms,data=home.df)
## bathrooms
## bedrooms 0 0.5 0.75 1 1.25 1.5 1.75 2 2.25 2.5 2.75 3 3.25
## 0 7 0 1 1 0 1 0 0 0 3 0 0 0
## 1 3 1 27 138 2 12 4 6 4 2 0 0 0
## 2 0 2 26 1558 3 294 304 216 118 197 20 13 8
## 3 0 0 16 1780 4 829 1870 1048 1082 2357 275 197 184
## 4 0 1 2 325 0 254 719 525 709 2502 639 326 254
## 5 0 0 0 43 0 48 134 110 116 287 214 163 129
## 6 0 0 0 6 0 6 16 24 15 29 31 45 12
## 7 0 0 0 1 0 2 0 0 3 2 3 3 1
## 8 0 0 0 0 0 0 0 0 0 1 3 2 1
## 9 0 0 0 0 0 0 0 0 0 0 0 2 0
## 10 0 0 0 0 0 0 0 1 0 0 0 1 0
## 11 0 0 0 0 0 0 0 0 0 0 0 1 0
## 33 0 0 0 0 0 0 1 0 0 0 0 0 0
## bathrooms
## bedrooms 3.5 3.75 4 4.25 4.5 4.75 5 5.25 5.5 5.75 6 6.25 6.5
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 1 0 0 0 0 0 0 0 0 0 0 0 0 0
## 2 1 0 0 0 0 0 0 0 0 0 0 0 0
## 3 143 17 11 6 5 0 0 0 0 0 0 0 0
## 4 395 78 58 38 32 7 7 5 5 1 0 0 0
## 5 169 44 48 25 35 11 7 4 4 2 4 2 1
## 6 17 13 11 8 23 3 6 3 0 0 1 0 1
## 7 5 2 5 2 3 2 0 0 1 1 0 0 0
## 8 1 1 2 0 0 0 1 0 0 0 1 0 0
## 9 0 0 1 0 2 0 0 0 0 0 0 0 0
## 10 0 0 0 0 0 0 0 1 0 0 0 0 0
## 11 0 0 0 0 0 0 0 0 0 0 0 0 0
## 33 0 0 0 0 0 0 0 0 0 0 0 0 0
## bathrooms
## bedrooms 6.75 7.5 7.75 8
## 0 0 0 0 0
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 1 0 0 0
## 6 0 0 1 1
## 7 1 0 0 1
## 8 0 0 0 0
## 9 0 1 0 0
## 10 0 0 0 0
## 11 0 0 0 0
## 33 0 0 0 0
xtabs(~floors+bedrooms,data=home.df)
## bedrooms
## floors 0 1 2 3 4 5 6 7 8 9 10 11 33
## 1 4 162 1951 5455 2383 605 104 9 5 0 1 0 1
## 1.5 0 21 182 786 698 185 30 7 1 0 0 0 0
## 2 6 12 497 3118 3682 775 119 19 6 4 2 1 0
## 2.5 0 1 5 56 58 23 14 2 0 2 0 0 0
## 3 2 3 123 405 61 13 5 1 0 0 0 0 0
## 3.5 1 0 2 4 0 0 0 0 1 0 0 0 0
xtabs(~floors+bathrooms,data=home.df)
## bathrooms
## floors 0 0.5 0.75 1 1.25 1.5 1.75 2 2.25 2.5 2.75 3 3.25
## 1 5 3 64 3115 2 904 2494 1283 883 1013 500 228 72
## 1.5 0 0 5 624 2 205 282 282 95 155 120 76 25
## 2 3 1 3 106 3 268 244 304 940 3966 542 391 430
## 2.5 0 0 0 3 0 5 8 12 15 39 7 23 12
## 3 1 0 0 4 2 64 20 49 114 204 15 33 50
## 3.5 1 0 0 0 0 0 0 0 0 3 1 2 0
## bathrooms
## floors 3.5 3.75 4 4.25 4.5 4.75 5 5.25 5.5 5.75 6 6.25 6.5
## 1 58 18 18 3 5 3 4 1 3 0 0 0 0
## 1.5 17 11 4 2 3 2 0 0 0 0 0 0 0
## 2 614 113 99 67 81 14 16 12 7 4 6 2 2
## 2.5 13 7 5 4 3 3 1 0 0 0 0 0 0
## 3 29 6 9 3 8 1 0 0 0 0 0 0 0
## 3.5 0 0 1 0 0 0 0 0 0 0 0 0 0
## bathrooms
## floors 6.75 7.5 7.75 8
## 1 1 0 0 0
## 1.5 0 0 0 0
## 2 1 1 1 0
## 2.5 0 0 0 1
## 3 0 0 0 1
## 3.5 0 0 0 0
xtabs(~waterfront+floors,data=home.df)
## floors
## waterfront 1 1.5 2 2.5 3 3.5
## 0 10623 1889 8166 159 605 8
## 1 57 21 75 2 8 0
xtabs(~grade+condition,data=home.df)
## condition
## grade 1 2 3 4 5
## 1 1 0 0 0 0
## 3 0 1 1 0 1
## 4 1 5 13 10 0
## 5 9 15 100 84 34
## 6 11 59 1035 685 248
## 7 6 75 5234 2833 833
## 8 2 13 4269 1394 390
## 9 0 2 2041 446 126
## 10 0 2 921 156 55
## 11 0 0 332 56 11
## 12 0 0 74 13 3
## 13 0 0 11 2 0
xtabs(~grade+floors,data=home.df)
## floors
## grade 1 1.5 2 2.5 3 3.5
## 1 1 0 0 0 0 0
## 3 3 0 0 0 0 0
## 4 27 2 0 0 0 0
## 5 202 38 2 0 0 0
## 6 1662 311 63 2 0 0
## 7 5916 1006 1943 15 100 1
## 8 2233 402 2989 53 385 6
## 9 447 105 1935 46 82 0
## 10 142 35 906 26 25 0
## 11 34 11 323 14 17 0
## 12 11 0 72 2 4 1
## 13 2 0 8 3 0 0
boxplot(home.df$bedrooms, horizontal=TRUE, main="Distribution Bedroom", xlab="Bedroom count")
boxplot(home.df$bathrooms, horizontal=TRUE, main="Distribution Bathroom", xlab="Bathroom count")
boxplot(home.df$grade, horizontal=TRUE, main="construction quality", xlab="grade")
boxplot(xlab="build year",main="construction year",home.df$yr_built,horizontal = TRUE)
boxplot(home.df$price, horizontal=TRUE, xlab="house prices",main="Price distribution")
hist(home.df$bathrooms, xlab="number of bathroom" , ylim = c(0,9500), xlim = c(0,5), ylab="number of houses")
hist(home.df$sqft_living, xlim = c(0,6500), ylim = c(0,12000),xlab = "living area (sqft)", ylab = "number of houses")
hist(home.df$floors,xlab = "number of floors", ylab = "number of houses" )
hist(home.df$condition, ylim=c(0,15000), xlim=c(1.5,5),xlab = "house condition", ylab = "number of houses")
hist(home.df$sqft_above, xlim = c(0,6000), ylim = c(0,8000),xlab = "area above ground (sqft)", ylab = "number of houses")
hist(home.df$sqft_basement, xlim = c(0,2500), ylim = c(0,17000),xlab = "basement area(sqft)", ylab = "number of houses")
hist(home.df$sqft_living15, xlim = c(0,5500), ylim = c(0,8000),xlab = "15 neighbours avg living area (sqft)", ylab = "number of houses")
plot(home.df$bedrooms,home.df$price, main = "Price variation with bedrooms count", col="blue")
abline(h=mean(home.df$price),col="red",lty="dashed")
abline(v=mean(home.df$bedrooms),col="green", lty="dashed")
abline(lm(home.df$price~home.df$bedrooms))
plot(home.df$bathrooms,home.df$price, main = "Price variation with bathroom count", col="blue",cex=0.6)
abline(h=mean(home.df$price),col="red",lty="dashed")
abline(v=mean(home.df$bathrooms),col="green", lty="dashed")
abline(lm(home.df$price~home.df$bathrooms))
plot(home.df$sqft_living,home.df$price, main = "Price variation with living area", col="blue",cex=0.6)
abline(h=mean(home.df$price),col="red",lty="dashed")
abline(v=mean(home.df$sqft_living),col="green", lty="dashed")
abline(lm(home.df$price~home.df$sqft_living))
plot(home.df$sqft_lot,home.df$price, main = "Price variation with sqft_lot", col="blue",cex=0.6)
abline(h=mean(home.df$price),col="red",lty="dashed")
abline(v=mean(home.df$sqft_lot),col="green", lty="dashed")
abline(lm(home.df$price~home.df$sqft_lot))
plot(home.df$floors,home.df$price, main = "Price variation with number of floors", col="blue",cex=0.6)
abline(h=mean(home.df$price),col="red",lty="dashed")
abline(v=mean(home.df$floors),col="green", lty="dashed")
abline(lm(home.df$price~home.df$floors))
plot(home.df$zipcode,home.df$price, main = "Price variation with zipcode", col="blue",cex=0.6)
abline(h=mean(home.df$price),col="red",lty="dashed")
abline(v=mean(home.df$zipcode),col="green", lty="dashed")
abline(lm(home.df$price~home.df$zipcode))
plot(home.df$condition,home.df$price, main = "Price variation with home condition ", col="blue",cex=0.6)
abline(h=mean(home.df$price),col="red",lty="dashed")
abline(v=mean(home.df$condition),col="green", lty="dashed")
abline(lm(home.df$price~home.df$condition))
plot(home.df$grade,home.df$price, main = "Price variation with construction quality", col="blue",cex=0.6)
abline(h=mean(home.df$price),col="red",lty="dashed")
abline(v=mean(home.df$grade),col="green", lty="dashed")
abline(lm(home.df$price~home.df$grade))
plot(home.df$waterfront,home.df$price, main = "Price variation with waterfront", col="blue",cex=0.6)
abline(h=mean(home.df$price),col="red",lty="dashed")
abline(v=mean(home.df$waterfront),col="green", lty="dashed")
abline(lm(home.df$price~home.df$waterfront))
plot(home.df$view,home.df$price, main = "Price variation with view type", col="blue",cex=0.6)
abline(h=mean(home.df$price),col="red",lty="dashed")
abline(v=mean(home.df$view),col="green", lty="dashed")
abline(lm(home.df$price~home.df$view))
plot(home.df$sqft_above,home.df$price, main = "Price variation with sqft_above", col="blue",cex=0.6)
abline(h=mean(home.df$price),col="red",lty="dashed")
abline(v=mean(home.df$sqft_above),col="green", lty="dashed")
abline(lm(home.df$price~home.df$sqft_above))
plot(home.df$sqft_basement,home.df$price, main = "Price variation with basement area", col="blue",cex=0.6)
abline(h=mean(home.df$price),col="red",lty="dashed")
abline(v=mean(home.df$sqft_basement),col="green", lty="dashed")
abline(lm(home.df$price~home.df$sqft_basement))
plot(home.df$yr_built,home.df$price, main = "Price variation with built year ", col="blue",cex=0.6)
abline(h=mean(home.df$price),col="red",lty="dashed")
abline(v=mean(home.df$yr_built),col="green", lty="dashed")
abline(lm(home.df$price~home.df$yr_built))
plot(home.df$yr_renovated,home.df$price, main = "Price variation with renovation year", col="blue",cex=0.6)
abline(h=mean(home.df$price),col="red",lty="dashed")
abline(v=mean(home.df$yr_renovated),col="green", lty="dashed")
abline(lm(home.df$price~home.df$yr_renovated))
plot(home.df$sqft_living15,home.df$price, main = "Price variation with living area of 15 neighbours", col="blue",cex=0.6)
abline(h=mean(home.df$price),col="red",lty="dashed")
abline(v=mean(home.df$sqft_living15),col="green", lty="dashed")
abline(lm(home.df$price~home.df$sqft_living15))
plot(home.df$sqft_lot15,home.df$price, main = "Price variance with lot area of 15 neighbours ", col="blue",cex=0.6)
abline(h=mean(home.df$price),col="red",lty="dashed")
abline(v=mean(home.df$sqft_lot15),col="green", lty="dashed")
abline(lm(home.df$price~home.df$sqft_lot15))
cor.test(home.df$bedrooms,home.df$price)
##
## Pearson's product-moment correlation
##
## data: home.df$bedrooms and home.df$price
## t = 47.651, df = 21611, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2962354 0.3203646
## sample estimates:
## cor
## 0.3083496
cor.test(home.df$bathrooms,home.df$price)
##
## Pearson's product-moment correlation
##
## data: home.df$bathrooms and home.df$price
## t = 90.714, df = 21611, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5154140 0.5347258
## sample estimates:
## cor
## 0.5251375
cor.test(home.df$sqft_living,home.df$price)
##
## Pearson's product-moment correlation
##
## data: home.df$sqft_living and home.df$price
## t = 144.92, df = 21611, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6952099 0.7087336
## sample estimates:
## cor
## 0.7020351
cor.test(home.df$sqft_lot,home.df$price)
##
## Pearson's product-moment correlation
##
## data: home.df$sqft_lot and home.df$price
## t = 13.234, df = 21611, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.07642022 0.10286988
## sample estimates:
## cor
## 0.08966086
cor.test(home.df$sqft_above,home.df$price)
##
## Pearson's product-moment correlation
##
## data: home.df$sqft_above and home.df$price
## t = 111.87, df = 21611, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5970556 0.6139427
## sample estimates:
## cor
## 0.6055673
cor.test(home.df$sqft_basement,home.df$price)
##
## Pearson's product-moment correlation
##
## data: home.df$sqft_basement and home.df$price
## t = 50.314, df = 21611, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3118302 0.3356988
## sample estimates:
## cor
## 0.323816
cor.test(home.df$floors,home.df$price)
##
## Pearson's product-moment correlation
##
## data: home.df$floors and home.df$price
## t = 39.06, df = 21611, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2442983 0.2692042
## sample estimates:
## cor
## 0.2567939
cor.test(home.df$zipcode,home.df$price)
##
## Pearson's product-moment correlation
##
## data: home.df$zipcode and home.df$price
## t = -7.8323, df = 21611, p-value = 5.011e-15
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.06648768 -0.03989916
## sample estimates:
## cor
## -0.05320285
cor.test(home.df$condition,home.df$price)
##
## Pearson's product-moment correlation
##
## data: home.df$condition and home.df$price
## t = 5.349, df = 21611, p-value = 8.936e-08
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.02304097 0.04966970
## sample estimates:
## cor
## 0.03636179
cor.test(home.df$grade,home.df$price)
##
## Pearson's product-moment correlation
##
## data: home.df$grade and home.df$price
## t = 131.76, df = 21611, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6599749 0.6747621
## sample estimates:
## cor
## 0.6674343
cor.test(home.df$waterfront,home.df$price)
##
## Pearson's product-moment correlation
##
## data: home.df$waterfront and home.df$price
## t = 40.626, df = 21611, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2539392 0.2787117
## sample estimates:
## cor
## 0.2663694
cor.test(home.df$view,home.df$price)
##
## Pearson's product-moment correlation
##
## data: home.df$view and home.df$price
## t = 63.643, df = 21611, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3860061 0.4084620
## sample estimates:
## cor
## 0.3972935
cor.test(home.df$yr_built,home.df$price)
##
## Pearson's product-moment correlation
##
## data: home.df$yr_built and home.df$price
## t = 7.9517, df = 21611, p-value = 1.93e-15
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.04070885 0.06729506
## sample estimates:
## cor
## 0.05401153
cor.test(home.df$yr_renovated,home.df$price)
##
## Pearson's product-moment correlation
##
## data: home.df$yr_renovated and home.df$price
## t = 18.737, df = 21611, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1132928 0.1395306
## sample estimates:
## cor
## 0.1264338
cor.test(home.df$sqft_living15,home.df$price)
##
## Pearson's product-moment correlation
##
## data: home.df$sqft_living15 and home.df$price
## t = 106.14, df = 21611, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5765464 0.5940746
## sample estimates:
## cor
## 0.5853789
cor.test(home.df$sqft_lot15,home.df$price)
##
## Pearson's product-moment correlation
##
## data: home.df$sqft_lot15 and home.df$price
## t = 12.162, df = 21611, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.06919122 0.09567398
## sample estimates:
## cor
## 0.08244715
From the above correlation test we can conclude that price of house is affected by the factors sqft_living, grade, sqft_above, sqft_living15, bathroom, view, basement, bedroom, waterfront, floors, yr_renovated, sqft_living, sqft_lot15, yr_built, condition, sqft_lot, zipcode in the decreasing order.
library(corrplot)
## corrplot 0.84 loaded
par(mfrow=c(1,1))
corrplot(corr=cor(home.df[,c(3:17,20:21)]),col = c(50, "red", "gray", "blue"))
library(corrgram)
corrgram(home.df, order=TRUE, lower.panel=panel.shade,
upper.panel=panel.pie, text.panel=panel.txt,
main="house prices analysis")
library(car)
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
par(mar=c(1,1,1,1))
scatterplotMatrix(formula = ~price+bedrooms+bathrooms+floors, data=home.df)
scatterplotMatrix(~price+grade+waterfront+view+condition, data=home.df)
## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth
## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth
## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth
## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth
## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth
scatterplotMatrix(~price+sqft_living+sqft_lot+sqft_above+sqft_basement+sqft_living15+sqft_lot15, data=home.df)
library(vcd)
## Loading required package: grid
chisq.test(home.df$price,home.df$bedrooms)
## Warning in chisq.test(home.df$price, home.df$bedrooms): Chi-squared
## approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: home.df$price and home.df$bedrooms
## X-squared = 53991, df = 48372, p-value < 2.2e-16
t.test(home.df$price,home.df$bedrooms)
##
## Welch Two Sample t-test
##
## data: home.df$price and home.df$bedrooms
## t = 216.27, df = 21612, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 535190.0 544979.5
## sample estimates:
## mean of x mean of y
## 5.400881e+05 3.370842e+00
library(vcd)
chisq.test(home.df$price,home.df$bathrooms)
## Warning in chisq.test(home.df$price, home.df$bathrooms): Chi-squared
## approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: home.df$price and home.df$bathrooms
## X-squared = 248400, df = 116900, p-value < 2.2e-16
t.test(home.df$price,home.df$bathrooms)
##
## Welch Two Sample t-test
##
## data: home.df$price and home.df$bathrooms
## t = 216.27, df = 21612, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 535191.3 544980.8
## sample estimates:
## mean of x mean of y
## 5.400881e+05 2.114757e+00
library(vcd)
chisq.test(home.df$price,home.df$sqft_lot)
## Warning in chisq.test(home.df$price, home.df$sqft_lot): Chi-squared
## approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: home.df$price and home.df$sqft_lot
## X-squared = 40144000, df = 39427000, p-value < 2.2e-16
t.test(home.df$price,home.df$sqft_lot)
##
## Welch Two Sample t-test
##
## data: home.df$price and home.df$sqft_lot
## t = 208.9, df = 22162, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 520055.4 529907.0
## sample estimates:
## mean of x mean of y
## 540088.14 15106.97
library(vcd)
chisq.test(home.df$price,home.df$floors)
## Warning in chisq.test(home.df$price, home.df$floors): Chi-squared
## approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: home.df$price and home.df$floors
## X-squared = 22676, df = 20155, p-value < 2.2e-16
t.test(home.df$price,home.df$floors)
##
## Welch Two Sample t-test
##
## data: home.df$price and home.df$floors
## t = 216.27, df = 21612, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 535191.9 544981.4
## sample estimates:
## mean of x mean of y
## 5.400881e+05 1.494309e+00
library(vcd)
chisq.test(home.df$price,home.df$zipcode)
## Warning in chisq.test(home.df$price, home.df$zipcode): Chi-squared
## approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: home.df$price and home.df$zipcode
## X-squared = 300110, df = 278140, p-value < 2.2e-16
t.test(home.df$price,home.df$zipcode)
##
## Welch Two Sample t-test
##
## data: home.df$price and home.df$zipcode
## t = 177, df = 21612, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 437115.4 446905.0
## sample estimates:
## mean of x mean of y
## 540088.14 98077.94
library(vcd)
chisq.test(home.df$price,home.df$condition)
## Warning in chisq.test(home.df$price, home.df$condition): Chi-squared
## approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: home.df$price and home.df$condition
## X-squared = 18023, df = 16124, p-value < 2.2e-16
t.test(home.df$price,home.df$condition)
##
## Welch Two Sample t-test
##
## data: home.df$price and home.df$condition
## t = 216.27, df = 21612, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 535190.0 544979.5
## sample estimates:
## mean of x mean of y
## 540088.14191 3.40943
library(vcd)
chisq.test(home.df$price,home.df$grade)
## Warning in chisq.test(home.df$price, home.df$grade): Chi-squared
## approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: home.df$price and home.df$grade
## X-squared = 110070, df = 44341, p-value < 2.2e-16
t.test(home.df$price,home.df$grade)
##
## Welch Two Sample t-test
##
## data: home.df$price and home.df$grade
## t = 216.27, df = 21612, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 535185.7 544975.2
## sample estimates:
## mean of x mean of y
## 5.400881e+05 7.656873e+00
library(vcd)
chisq.test(home.df$price,home.df$waterfront)
## Warning in chisq.test(home.df$price, home.df$waterfront): Chi-squared
## approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: home.df$price and home.df$waterfront
## X-squared = 8246.8, df = 4031, p-value < 2.2e-16
t.test(home.df$price,home.df$waterfront)
##
## Welch Two Sample t-test
##
## data: home.df$price and home.df$waterfront
## t = 216.27, df = 21612, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 535193.4 544982.9
## sample estimates:
## mean of x mean of y
## 5.400881e+05 7.541757e-03
library(vcd)
chisq.test(home.df$price,home.df$view)
## Warning in chisq.test(home.df$price, home.df$view): Chi-squared
## approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: home.df$price and home.df$view
## X-squared = 24370, df = 16124, p-value < 2.2e-16
t.test(home.df$price,home.df$view)
##
## Welch Two Sample t-test
##
## data: home.df$price and home.df$view
## t = 216.27, df = 21612, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 535193.1 544982.7
## sample estimates:
## mean of x mean of y
## 5.400881e+05 2.343034e-01
library(vcd)
chisq.test(home.df$price,home.df$sqft_above)
## Warning in chisq.test(home.df$price, home.df$sqft_above): Chi-squared
## approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: home.df$price and home.df$sqft_above
## X-squared = 5542500, df = 3809300, p-value < 2.2e-16
t.test(home.df$price,home.df$sqft_above)
##
## Welch Two Sample t-test
##
## data: home.df$price and home.df$sqft_above
## t = 215.56, df = 21612, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 533405.0 543194.5
## sample estimates:
## mean of x mean of y
## 540088.142 1788.391
library(vcd)
chisq.test(home.df$price,home.df$sqft_basement)
## Warning in chisq.test(home.df$price, home.df$sqft_basement): Chi-squared
## approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: home.df$price and home.df$sqft_basement
## X-squared = 1643800, df = 1229500, p-value < 2.2e-16
t.test(home.df$price,home.df$sqft_basement)
##
## Welch Two Sample t-test
##
## data: home.df$price and home.df$sqft_basement
## t = 216.16, df = 21612, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 534901.9 544691.4
## sample estimates:
## mean of x mean of y
## 540088.142 291.509
library(vcd)
chisq.test(home.df$price,home.df$yr_renovated)
## Warning in chisq.test(home.df$price, home.df$yr_renovated): Chi-squared
## approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: home.df$price and home.df$yr_renovated
## X-squared = 313880, df = 278140, p-value < 2.2e-16
t.test(home.df$price,home.df$yr_renovated)
##
## Welch Two Sample t-test
##
## data: home.df$price and home.df$yr_renovated
## t = 216.24, df = 21612, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 535109.0 544898.5
## sample estimates:
## mean of x mean of y
## 540088.14191 84.40226
library(vcd)
chisq.test(home.df$price,home.df$yr_built)
## Warning in chisq.test(home.df$price, home.df$yr_built): Chi-squared
## approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: home.df$price and home.df$yr_built
## X-squared = 453260, df = 463560, p-value = 1
t.test(home.df$price,home.df$yr_built)
##
## Welch Two Sample t-test
##
## data: home.df$price and home.df$yr_built
## t = 215.49, df = 21612, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 533222.4 543011.9
## sample estimates:
## mean of x mean of y
## 540088.142 1971.005
library(vcd)
chisq.test(home.df$price,home.df$sqft_living15)
## Warning in chisq.test(home.df$price, home.df$sqft_living15): Chi-squared
## approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: home.df$price and home.df$sqft_living15
## X-squared = 4110100, df = 3128100, p-value < 2.2e-16
t.test(home.df$price,home.df$sqft_living15)
##
## Welch Two Sample t-test
##
## data: home.df$price and home.df$sqft_living15
## t = 215.48, df = 21612, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 533206.8 542996.4
## sample estimates:
## mean of x mean of y
## 540088.142 1986.552
library(vcd)
chisq.test(home.df$price,home.df$sqft_lot15)
## Warning in chisq.test(home.df$price, home.df$sqft_lot15): Chi-squared
## approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: home.df$price and home.df$sqft_lot15
## X-squared = 36291000, df = 35021000, p-value < 2.2e-16
t.test(home.df$price,home.df$sqft_lot15)
##
## Welch Two Sample t-test
##
## data: home.df$price and home.df$sqft_lot15
## t = 210.58, df = 21851, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 522411.4 532228.0
## sample estimates:
## mean of x mean of y
## 540088.14 12768.46
fm<-lm(price~bedrooms+bathrooms+sqft_living++zipcode+condition+grade+waterfront+view+sqft_above+yr_built+yr_renovated+sqft_living15+sqft_lot15, data=home.df)
summary(fm)
##
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + +zipcode +
## condition + grade + waterfront + view + sqft_above + yr_built +
## yr_renovated + sqft_living15 + sqft_lot15, data = home.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1344316 -110707 -10109 90445 4282499
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.114e+06 3.084e+06 1.010 0.31258
## bedrooms -3.974e+04 2.031e+03 -19.570 < 2e-16 ***
## bathrooms 5.227e+04 3.371e+03 15.504 < 2e-16 ***
## sqft_living 1.572e+02 4.479e+00 35.101 < 2e-16 ***
## zipcode 2.965e+01 3.098e+01 0.957 0.33850
## condition 1.835e+04 2.519e+03 7.284 3.36e-13 ***
## grade 1.216e+05 2.242e+03 54.250 < 2e-16 ***
## waterfront 5.797e+05 1.866e+04 31.073 < 2e-16 ***
## view 4.360e+04 2.286e+03 19.070 < 2e-16 ***
## sqft_above 7.522e+00 4.138e+00 1.818 0.06912 .
## yr_built -3.473e+03 7.324e+01 -47.414 < 2e-16 ***
## yr_renovated 1.143e+01 3.920e+00 2.916 0.00355 **
## sqft_living15 2.165e+01 3.624e+00 5.974 2.35e-09 ***
## sqft_lot15 -5.892e-01 5.589e-02 -10.541 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 216300 on 21599 degrees of freedom
## Multiple R-squared: 0.6529, Adjusted R-squared: 0.6527
## F-statistic: 3126 on 13 and 21599 DF, p-value: < 2.2e-16
From the above chisq-test, t-test, linear regression model we see that all the above variables have a significant relation to the price of houses
The average error in predicting prices in 231900 on given 21602 values
fm<-lm(price~sqft_lot+floors+sqft_basement, data=home.df)
summary(fm)
##
## Call:
## lm(formula = price ~ sqft_lot + floors + sqft_basement, data = home.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1040484 -185443 -52984 102827 5958032
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.560e+04 7.155e+03 9.168 <2e-16 ***
## sqft_lot 7.556e-01 5.283e-02 14.303 <2e-16 ***
## floors 2.435e+05 4.180e+03 58.246 <2e-16 ***
## sqft_basement 3.405e+02 5.100e+00 66.762 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 321600 on 21609 degrees of freedom
## Multiple R-squared: 0.2325, Adjusted R-squared: 0.2324
## F-statistic: 2182 on 3 and 21609 DF, p-value: < 2.2e-16
From the above chisq-test, t-test, lm we see that the p value is < 0.05 . Hence can be concluded that all the factors have a effect on the pricing on the houses.