R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

home.df<-read.csv(paste("home_data.csv"),sep=",",row.names = NULL)
dim(home.df)
## [1] 21613    21
str(home.df)
## 'data.frame':    21613 obs. of  21 variables:
##  $ id           : num  7.13e+09 6.41e+09 5.63e+09 2.49e+09 1.95e+09 ...
##  $ date         : Factor w/ 372 levels "20140502T000000",..: 165 221 291 221 284 11 57 252 340 306 ...
##  $ price        : int  221900 538000 180000 604000 510000 1225000 257500 291850 229500 323000 ...
##  $ bedrooms     : int  3 3 2 4 3 4 3 3 3 3 ...
##  $ bathrooms    : num  1 2.25 1 3 2 4.5 2.25 1.5 1 2.5 ...
##  $ sqft_living  : int  1180 2570 770 1960 1680 5420 1715 1060 1780 1890 ...
##  $ sqft_lot     : int  5650 7242 10000 5000 8080 101930 6819 9711 7470 6560 ...
##  $ floors       : num  1 2 1 1 1 1 2 1 1 2 ...
##  $ waterfront   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ view         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ condition    : int  3 3 3 5 3 3 3 3 3 3 ...
##  $ grade        : int  7 7 6 7 8 11 7 7 7 7 ...
##  $ sqft_above   : int  1180 2170 770 1050 1680 3890 1715 1060 1050 1890 ...
##  $ sqft_basement: int  0 400 0 910 0 1530 0 0 730 0 ...
##  $ yr_built     : int  1955 1951 1933 1965 1987 2001 1995 1963 1960 2003 ...
##  $ yr_renovated : int  0 1991 0 0 0 0 0 0 0 0 ...
##  $ zipcode      : int  98178 98125 98028 98136 98074 98053 98003 98198 98146 98038 ...
##  $ lat          : num  47.5 47.7 47.7 47.5 47.6 ...
##  $ long         : num  -122 -122 -122 -122 -122 ...
##  $ sqft_living15: int  1340 1690 2720 1360 1800 4760 2238 1650 1780 2390 ...
##  $ sqft_lot15   : int  5650 7639 8062 5000 7503 101930 6819 9711 8113 7570 ...
summary(home.df)
##        id                         date           price        
##  Min.   :1.000e+06   20140623T000000:  142   Min.   :  75000  
##  1st Qu.:2.123e+09   20140625T000000:  131   1st Qu.: 321950  
##  Median :3.905e+09   20140626T000000:  131   Median : 450000  
##  Mean   :4.580e+09   20140708T000000:  127   Mean   : 540088  
##  3rd Qu.:7.309e+09   20150427T000000:  126   3rd Qu.: 645000  
##  Max.   :9.900e+09   20150325T000000:  123   Max.   :7700000  
##                      (Other)        :20833                    
##     bedrooms        bathrooms      sqft_living       sqft_lot      
##  Min.   : 0.000   Min.   :0.000   Min.   :  290   Min.   :    520  
##  1st Qu.: 3.000   1st Qu.:1.750   1st Qu.: 1427   1st Qu.:   5040  
##  Median : 3.000   Median :2.250   Median : 1910   Median :   7618  
##  Mean   : 3.371   Mean   :2.115   Mean   : 2080   Mean   :  15107  
##  3rd Qu.: 4.000   3rd Qu.:2.500   3rd Qu.: 2550   3rd Qu.:  10688  
##  Max.   :33.000   Max.   :8.000   Max.   :13540   Max.   :1651359  
##                                                                    
##      floors        waterfront            view          condition    
##  Min.   :1.000   Min.   :0.000000   Min.   :0.0000   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:0.000000   1st Qu.:0.0000   1st Qu.:3.000  
##  Median :1.500   Median :0.000000   Median :0.0000   Median :3.000  
##  Mean   :1.494   Mean   :0.007542   Mean   :0.2343   Mean   :3.409  
##  3rd Qu.:2.000   3rd Qu.:0.000000   3rd Qu.:0.0000   3rd Qu.:4.000  
##  Max.   :3.500   Max.   :1.000000   Max.   :4.0000   Max.   :5.000  
##                                                                     
##      grade          sqft_above   sqft_basement       yr_built   
##  Min.   : 1.000   Min.   : 290   Min.   :   0.0   Min.   :1900  
##  1st Qu.: 7.000   1st Qu.:1190   1st Qu.:   0.0   1st Qu.:1951  
##  Median : 7.000   Median :1560   Median :   0.0   Median :1975  
##  Mean   : 7.657   Mean   :1788   Mean   : 291.5   Mean   :1971  
##  3rd Qu.: 8.000   3rd Qu.:2210   3rd Qu.: 560.0   3rd Qu.:1997  
##  Max.   :13.000   Max.   :9410   Max.   :4820.0   Max.   :2015  
##                                                                 
##   yr_renovated       zipcode           lat             long       
##  Min.   :   0.0   Min.   :98001   Min.   :47.16   Min.   :-122.5  
##  1st Qu.:   0.0   1st Qu.:98033   1st Qu.:47.47   1st Qu.:-122.3  
##  Median :   0.0   Median :98065   Median :47.57   Median :-122.2  
##  Mean   :  84.4   Mean   :98078   Mean   :47.56   Mean   :-122.2  
##  3rd Qu.:   0.0   3rd Qu.:98118   3rd Qu.:47.68   3rd Qu.:-122.1  
##  Max.   :2015.0   Max.   :98199   Max.   :47.78   Max.   :-121.3  
##                                                                   
##  sqft_living15    sqft_lot15    
##  Min.   : 399   Min.   :   651  
##  1st Qu.:1490   1st Qu.:  5100  
##  Median :1840   Median :  7620  
##  Mean   :1987   Mean   : 12768  
##  3rd Qu.:2360   3rd Qu.: 10083  
##  Max.   :6210   Max.   :871200  
## 
library(psych)
describe(home.df)
table(home.df$bedrooms)
## 
##    0    1    2    3    4    5    6    7    8    9   10   11   33 
##   13  199 2760 9824 6882 1601  272   38   13    6    3    1    1
table(home.df$bathrooms)
## 
##    0  0.5 0.75    1 1.25  1.5 1.75    2 2.25  2.5 2.75    3 3.25  3.5 3.75 
##   10    4   72 3852    9 1446 3048 1930 2047 5380 1185  753  589  731  155 
##    4 4.25  4.5 4.75    5 5.25  5.5 5.75    6 6.25  6.5 6.75  7.5 7.75    8 
##  136   79  100   23   21   13   10    4    6    2    2    2    1    1    2
table(home.df$floors)
## 
##     1   1.5     2   2.5     3   3.5 
## 10680  1910  8241   161   613     8
table(home.df$waterfront)
## 
##     0     1 
## 21450   163
table(home.df$view)
## 
##     0     1     2     3     4 
## 19489   332   963   510   319
table(home.df$condition)
## 
##     1     2     3     4     5 
##    30   172 14031  5679  1701
table(home.df$grade)
## 
##    1    3    4    5    6    7    8    9   10   11   12   13 
##    1    3   29  242 2038 8981 6068 2615 1134  399   90   13
xtabs(~bedrooms+bathrooms,data=home.df)
##         bathrooms
## bedrooms    0  0.5 0.75    1 1.25  1.5 1.75    2 2.25  2.5 2.75    3 3.25
##       0     7    0    1    1    0    1    0    0    0    3    0    0    0
##       1     3    1   27  138    2   12    4    6    4    2    0    0    0
##       2     0    2   26 1558    3  294  304  216  118  197   20   13    8
##       3     0    0   16 1780    4  829 1870 1048 1082 2357  275  197  184
##       4     0    1    2  325    0  254  719  525  709 2502  639  326  254
##       5     0    0    0   43    0   48  134  110  116  287  214  163  129
##       6     0    0    0    6    0    6   16   24   15   29   31   45   12
##       7     0    0    0    1    0    2    0    0    3    2    3    3    1
##       8     0    0    0    0    0    0    0    0    0    1    3    2    1
##       9     0    0    0    0    0    0    0    0    0    0    0    2    0
##       10    0    0    0    0    0    0    0    1    0    0    0    1    0
##       11    0    0    0    0    0    0    0    0    0    0    0    1    0
##       33    0    0    0    0    0    0    1    0    0    0    0    0    0
##         bathrooms
## bedrooms  3.5 3.75    4 4.25  4.5 4.75    5 5.25  5.5 5.75    6 6.25  6.5
##       0     0    0    0    0    0    0    0    0    0    0    0    0    0
##       1     0    0    0    0    0    0    0    0    0    0    0    0    0
##       2     1    0    0    0    0    0    0    0    0    0    0    0    0
##       3   143   17   11    6    5    0    0    0    0    0    0    0    0
##       4   395   78   58   38   32    7    7    5    5    1    0    0    0
##       5   169   44   48   25   35   11    7    4    4    2    4    2    1
##       6    17   13   11    8   23    3    6    3    0    0    1    0    1
##       7     5    2    5    2    3    2    0    0    1    1    0    0    0
##       8     1    1    2    0    0    0    1    0    0    0    1    0    0
##       9     0    0    1    0    2    0    0    0    0    0    0    0    0
##       10    0    0    0    0    0    0    0    1    0    0    0    0    0
##       11    0    0    0    0    0    0    0    0    0    0    0    0    0
##       33    0    0    0    0    0    0    0    0    0    0    0    0    0
##         bathrooms
## bedrooms 6.75  7.5 7.75    8
##       0     0    0    0    0
##       1     0    0    0    0
##       2     0    0    0    0
##       3     0    0    0    0
##       4     0    0    0    0
##       5     1    0    0    0
##       6     0    0    1    1
##       7     1    0    0    1
##       8     0    0    0    0
##       9     0    1    0    0
##       10    0    0    0    0
##       11    0    0    0    0
##       33    0    0    0    0
xtabs(~floors+bedrooms,data=home.df)
##       bedrooms
## floors    0    1    2    3    4    5    6    7    8    9   10   11   33
##    1      4  162 1951 5455 2383  605  104    9    5    0    1    0    1
##    1.5    0   21  182  786  698  185   30    7    1    0    0    0    0
##    2      6   12  497 3118 3682  775  119   19    6    4    2    1    0
##    2.5    0    1    5   56   58   23   14    2    0    2    0    0    0
##    3      2    3  123  405   61   13    5    1    0    0    0    0    0
##    3.5    1    0    2    4    0    0    0    0    1    0    0    0    0
xtabs(~floors+bathrooms,data=home.df)
##       bathrooms
## floors    0  0.5 0.75    1 1.25  1.5 1.75    2 2.25  2.5 2.75    3 3.25
##    1      5    3   64 3115    2  904 2494 1283  883 1013  500  228   72
##    1.5    0    0    5  624    2  205  282  282   95  155  120   76   25
##    2      3    1    3  106    3  268  244  304  940 3966  542  391  430
##    2.5    0    0    0    3    0    5    8   12   15   39    7   23   12
##    3      1    0    0    4    2   64   20   49  114  204   15   33   50
##    3.5    1    0    0    0    0    0    0    0    0    3    1    2    0
##       bathrooms
## floors  3.5 3.75    4 4.25  4.5 4.75    5 5.25  5.5 5.75    6 6.25  6.5
##    1     58   18   18    3    5    3    4    1    3    0    0    0    0
##    1.5   17   11    4    2    3    2    0    0    0    0    0    0    0
##    2    614  113   99   67   81   14   16   12    7    4    6    2    2
##    2.5   13    7    5    4    3    3    1    0    0    0    0    0    0
##    3     29    6    9    3    8    1    0    0    0    0    0    0    0
##    3.5    0    0    1    0    0    0    0    0    0    0    0    0    0
##       bathrooms
## floors 6.75  7.5 7.75    8
##    1      1    0    0    0
##    1.5    0    0    0    0
##    2      1    1    1    0
##    2.5    0    0    0    1
##    3      0    0    0    1
##    3.5    0    0    0    0
xtabs(~waterfront+floors,data=home.df)
##           floors
## waterfront     1   1.5     2   2.5     3   3.5
##          0 10623  1889  8166   159   605     8
##          1    57    21    75     2     8     0
xtabs(~grade+condition,data=home.df)
##      condition
## grade    1    2    3    4    5
##    1     1    0    0    0    0
##    3     0    1    1    0    1
##    4     1    5   13   10    0
##    5     9   15  100   84   34
##    6    11   59 1035  685  248
##    7     6   75 5234 2833  833
##    8     2   13 4269 1394  390
##    9     0    2 2041  446  126
##    10    0    2  921  156   55
##    11    0    0  332   56   11
##    12    0    0   74   13    3
##    13    0    0   11    2    0
xtabs(~grade+floors,data=home.df)
##      floors
## grade    1  1.5    2  2.5    3  3.5
##    1     1    0    0    0    0    0
##    3     3    0    0    0    0    0
##    4    27    2    0    0    0    0
##    5   202   38    2    0    0    0
##    6  1662  311   63    2    0    0
##    7  5916 1006 1943   15  100    1
##    8  2233  402 2989   53  385    6
##    9   447  105 1935   46   82    0
##    10  142   35  906   26   25    0
##    11   34   11  323   14   17    0
##    12   11    0   72    2    4    1
##    13    2    0    8    3    0    0
boxplot(home.df$bedrooms, horizontal=TRUE, main="Distribution Bedroom", xlab="Bedroom count")

boxplot(home.df$bathrooms, horizontal=TRUE, main="Distribution Bathroom", xlab="Bathroom count")

boxplot(home.df$grade, horizontal=TRUE, main="construction quality", xlab="grade")

boxplot(xlab="build year",main="construction year",home.df$yr_built,horizontal = TRUE)

boxplot(home.df$price, horizontal=TRUE, xlab="house prices",main="Price distribution")

hist(home.df$bathrooms, xlab="number of bathroom" , ylim = c(0,9500), xlim = c(0,5), ylab="number of houses")

hist(home.df$sqft_living, xlim = c(0,6500), ylim = c(0,12000),xlab = "living area (sqft)", ylab = "number of houses")

hist(home.df$floors,xlab = "number of floors", ylab = "number of houses" )

hist(home.df$condition, ylim=c(0,15000), xlim=c(1.5,5),xlab = "house condition", ylab = "number of houses")

hist(home.df$sqft_above, xlim = c(0,6000), ylim = c(0,8000),xlab = "area above ground (sqft)", ylab = "number of houses")

hist(home.df$sqft_basement, xlim = c(0,2500), ylim = c(0,17000),xlab = "basement area(sqft)", ylab = "number of houses")

hist(home.df$sqft_living15, xlim = c(0,5500), ylim = c(0,8000),xlab = "15 neighbours avg living area (sqft)", ylab = "number of houses")

plot(home.df$bedrooms,home.df$price, main = "Price variation with bedrooms count", col="blue")
abline(h=mean(home.df$price),col="red",lty="dashed")
abline(v=mean(home.df$bedrooms),col="green", lty="dashed")
abline(lm(home.df$price~home.df$bedrooms))

plot(home.df$bathrooms,home.df$price,  main = "Price variation with bathroom count", col="blue",cex=0.6)
abline(h=mean(home.df$price),col="red",lty="dashed")
abline(v=mean(home.df$bathrooms),col="green", lty="dashed")
abline(lm(home.df$price~home.df$bathrooms))

plot(home.df$sqft_living,home.df$price,  main = "Price variation with living area", col="blue",cex=0.6)
abline(h=mean(home.df$price),col="red",lty="dashed")
abline(v=mean(home.df$sqft_living),col="green", lty="dashed")
abline(lm(home.df$price~home.df$sqft_living))

plot(home.df$sqft_lot,home.df$price,  main = "Price variation with sqft_lot", col="blue",cex=0.6)
abline(h=mean(home.df$price),col="red",lty="dashed")
abline(v=mean(home.df$sqft_lot),col="green", lty="dashed")
abline(lm(home.df$price~home.df$sqft_lot))

plot(home.df$floors,home.df$price,  main = "Price variation with number of floors", col="blue",cex=0.6)
abline(h=mean(home.df$price),col="red",lty="dashed")
abline(v=mean(home.df$floors),col="green", lty="dashed")
abline(lm(home.df$price~home.df$floors))

plot(home.df$zipcode,home.df$price,  main = "Price variation with zipcode", col="blue",cex=0.6)
abline(h=mean(home.df$price),col="red",lty="dashed")
abline(v=mean(home.df$zipcode),col="green", lty="dashed")
abline(lm(home.df$price~home.df$zipcode))

plot(home.df$condition,home.df$price,  main = "Price variation with home condition ", col="blue",cex=0.6)
abline(h=mean(home.df$price),col="red",lty="dashed")
abline(v=mean(home.df$condition),col="green", lty="dashed")
abline(lm(home.df$price~home.df$condition))

plot(home.df$grade,home.df$price,  main = "Price variation with construction quality", col="blue",cex=0.6)
abline(h=mean(home.df$price),col="red",lty="dashed")
abline(v=mean(home.df$grade),col="green", lty="dashed")
abline(lm(home.df$price~home.df$grade))

plot(home.df$waterfront,home.df$price,  main = "Price variation with waterfront", col="blue",cex=0.6)
abline(h=mean(home.df$price),col="red",lty="dashed")
abline(v=mean(home.df$waterfront),col="green", lty="dashed")
abline(lm(home.df$price~home.df$waterfront))

plot(home.df$view,home.df$price,  main = "Price variation with view type", col="blue",cex=0.6)
abline(h=mean(home.df$price),col="red",lty="dashed")
abline(v=mean(home.df$view),col="green", lty="dashed")
abline(lm(home.df$price~home.df$view))

plot(home.df$sqft_above,home.df$price,  main = "Price variation with sqft_above", col="blue",cex=0.6)
abline(h=mean(home.df$price),col="red",lty="dashed")
abline(v=mean(home.df$sqft_above),col="green", lty="dashed")
abline(lm(home.df$price~home.df$sqft_above))

plot(home.df$sqft_basement,home.df$price,  main = "Price variation with basement area", col="blue",cex=0.6)
abline(h=mean(home.df$price),col="red",lty="dashed")
abline(v=mean(home.df$sqft_basement),col="green", lty="dashed")
abline(lm(home.df$price~home.df$sqft_basement))

plot(home.df$yr_built,home.df$price,  main = "Price variation with built year ", col="blue",cex=0.6)
abline(h=mean(home.df$price),col="red",lty="dashed")
abline(v=mean(home.df$yr_built),col="green", lty="dashed")
abline(lm(home.df$price~home.df$yr_built))

plot(home.df$yr_renovated,home.df$price,  main = "Price variation with renovation year", col="blue",cex=0.6)
abline(h=mean(home.df$price),col="red",lty="dashed")
abline(v=mean(home.df$yr_renovated),col="green", lty="dashed")
abline(lm(home.df$price~home.df$yr_renovated))

plot(home.df$sqft_living15,home.df$price,  main = "Price variation with living area of 15 neighbours", col="blue",cex=0.6)
abline(h=mean(home.df$price),col="red",lty="dashed")
abline(v=mean(home.df$sqft_living15),col="green", lty="dashed")
abline(lm(home.df$price~home.df$sqft_living15))

plot(home.df$sqft_lot15,home.df$price,  main = "Price variance with lot area of 15 neighbours ", col="blue",cex=0.6)
abline(h=mean(home.df$price),col="red",lty="dashed")
abline(v=mean(home.df$sqft_lot15),col="green", lty="dashed")
abline(lm(home.df$price~home.df$sqft_lot15))

cor.test(home.df$bedrooms,home.df$price)
## 
##  Pearson's product-moment correlation
## 
## data:  home.df$bedrooms and home.df$price
## t = 47.651, df = 21611, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2962354 0.3203646
## sample estimates:
##       cor 
## 0.3083496
cor.test(home.df$bathrooms,home.df$price)
## 
##  Pearson's product-moment correlation
## 
## data:  home.df$bathrooms and home.df$price
## t = 90.714, df = 21611, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5154140 0.5347258
## sample estimates:
##       cor 
## 0.5251375
cor.test(home.df$sqft_living,home.df$price)
## 
##  Pearson's product-moment correlation
## 
## data:  home.df$sqft_living and home.df$price
## t = 144.92, df = 21611, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6952099 0.7087336
## sample estimates:
##       cor 
## 0.7020351
cor.test(home.df$sqft_lot,home.df$price)
## 
##  Pearson's product-moment correlation
## 
## data:  home.df$sqft_lot and home.df$price
## t = 13.234, df = 21611, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.07642022 0.10286988
## sample estimates:
##        cor 
## 0.08966086
cor.test(home.df$sqft_above,home.df$price)
## 
##  Pearson's product-moment correlation
## 
## data:  home.df$sqft_above and home.df$price
## t = 111.87, df = 21611, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5970556 0.6139427
## sample estimates:
##       cor 
## 0.6055673
cor.test(home.df$sqft_basement,home.df$price)
## 
##  Pearson's product-moment correlation
## 
## data:  home.df$sqft_basement and home.df$price
## t = 50.314, df = 21611, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3118302 0.3356988
## sample estimates:
##      cor 
## 0.323816
cor.test(home.df$floors,home.df$price)
## 
##  Pearson's product-moment correlation
## 
## data:  home.df$floors and home.df$price
## t = 39.06, df = 21611, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2442983 0.2692042
## sample estimates:
##       cor 
## 0.2567939
cor.test(home.df$zipcode,home.df$price)
## 
##  Pearson's product-moment correlation
## 
## data:  home.df$zipcode and home.df$price
## t = -7.8323, df = 21611, p-value = 5.011e-15
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.06648768 -0.03989916
## sample estimates:
##         cor 
## -0.05320285
cor.test(home.df$condition,home.df$price)
## 
##  Pearson's product-moment correlation
## 
## data:  home.df$condition and home.df$price
## t = 5.349, df = 21611, p-value = 8.936e-08
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.02304097 0.04966970
## sample estimates:
##        cor 
## 0.03636179
cor.test(home.df$grade,home.df$price)
## 
##  Pearson's product-moment correlation
## 
## data:  home.df$grade and home.df$price
## t = 131.76, df = 21611, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6599749 0.6747621
## sample estimates:
##       cor 
## 0.6674343
cor.test(home.df$waterfront,home.df$price)
## 
##  Pearson's product-moment correlation
## 
## data:  home.df$waterfront and home.df$price
## t = 40.626, df = 21611, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2539392 0.2787117
## sample estimates:
##       cor 
## 0.2663694
cor.test(home.df$view,home.df$price)
## 
##  Pearson's product-moment correlation
## 
## data:  home.df$view and home.df$price
## t = 63.643, df = 21611, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3860061 0.4084620
## sample estimates:
##       cor 
## 0.3972935
cor.test(home.df$yr_built,home.df$price)
## 
##  Pearson's product-moment correlation
## 
## data:  home.df$yr_built and home.df$price
## t = 7.9517, df = 21611, p-value = 1.93e-15
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.04070885 0.06729506
## sample estimates:
##        cor 
## 0.05401153
cor.test(home.df$yr_renovated,home.df$price)
## 
##  Pearson's product-moment correlation
## 
## data:  home.df$yr_renovated and home.df$price
## t = 18.737, df = 21611, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1132928 0.1395306
## sample estimates:
##       cor 
## 0.1264338
cor.test(home.df$sqft_living15,home.df$price)
## 
##  Pearson's product-moment correlation
## 
## data:  home.df$sqft_living15 and home.df$price
## t = 106.14, df = 21611, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5765464 0.5940746
## sample estimates:
##       cor 
## 0.5853789
cor.test(home.df$sqft_lot15,home.df$price)
## 
##  Pearson's product-moment correlation
## 
## data:  home.df$sqft_lot15 and home.df$price
## t = 12.162, df = 21611, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.06919122 0.09567398
## sample estimates:
##        cor 
## 0.08244715

From the above correlation test we can conclude that price of house is affected by the factors sqft_living, grade, sqft_above, sqft_living15, bathroom, view, basement, bedroom, waterfront, floors, yr_renovated, sqft_living, sqft_lot15, yr_built, condition, sqft_lot, zipcode in the decreasing order.

library(corrplot)
## corrplot 0.84 loaded
par(mfrow=c(1,1))
corrplot(corr=cor(home.df[,c(3:17,20:21)]),col = c(50, "red", "gray", "blue"))

library(corrgram)
    corrgram(home.df, order=TRUE, lower.panel=panel.shade,
    upper.panel=panel.pie, text.panel=panel.txt,
    main="house prices analysis")

library(car)
## 
## Attaching package: 'car'
## The following object is masked from 'package:psych':
## 
##     logit
par(mar=c(1,1,1,1))
scatterplotMatrix(formula = ~price+bedrooms+bathrooms+floors, data=home.df)

scatterplotMatrix(~price+grade+waterfront+view+condition, data=home.df)
## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth

## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth

## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth

## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth

## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth

scatterplotMatrix(~price+sqft_living+sqft_lot+sqft_above+sqft_basement+sqft_living15+sqft_lot15, data=home.df)

library(vcd)
## Loading required package: grid
chisq.test(home.df$price,home.df$bedrooms)
## Warning in chisq.test(home.df$price, home.df$bedrooms): Chi-squared
## approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  home.df$price and home.df$bedrooms
## X-squared = 53991, df = 48372, p-value < 2.2e-16
t.test(home.df$price,home.df$bedrooms)
## 
##  Welch Two Sample t-test
## 
## data:  home.df$price and home.df$bedrooms
## t = 216.27, df = 21612, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  535190.0 544979.5
## sample estimates:
##    mean of x    mean of y 
## 5.400881e+05 3.370842e+00
library(vcd)
chisq.test(home.df$price,home.df$bathrooms)
## Warning in chisq.test(home.df$price, home.df$bathrooms): Chi-squared
## approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  home.df$price and home.df$bathrooms
## X-squared = 248400, df = 116900, p-value < 2.2e-16
t.test(home.df$price,home.df$bathrooms)
## 
##  Welch Two Sample t-test
## 
## data:  home.df$price and home.df$bathrooms
## t = 216.27, df = 21612, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  535191.3 544980.8
## sample estimates:
##    mean of x    mean of y 
## 5.400881e+05 2.114757e+00
library(vcd)
chisq.test(home.df$price,home.df$sqft_lot)
## Warning in chisq.test(home.df$price, home.df$sqft_lot): Chi-squared
## approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  home.df$price and home.df$sqft_lot
## X-squared = 40144000, df = 39427000, p-value < 2.2e-16
t.test(home.df$price,home.df$sqft_lot)
## 
##  Welch Two Sample t-test
## 
## data:  home.df$price and home.df$sqft_lot
## t = 208.9, df = 22162, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  520055.4 529907.0
## sample estimates:
## mean of x mean of y 
## 540088.14  15106.97
library(vcd)
chisq.test(home.df$price,home.df$floors)
## Warning in chisq.test(home.df$price, home.df$floors): Chi-squared
## approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  home.df$price and home.df$floors
## X-squared = 22676, df = 20155, p-value < 2.2e-16
t.test(home.df$price,home.df$floors)
## 
##  Welch Two Sample t-test
## 
## data:  home.df$price and home.df$floors
## t = 216.27, df = 21612, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  535191.9 544981.4
## sample estimates:
##    mean of x    mean of y 
## 5.400881e+05 1.494309e+00
library(vcd)
chisq.test(home.df$price,home.df$zipcode)
## Warning in chisq.test(home.df$price, home.df$zipcode): Chi-squared
## approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  home.df$price and home.df$zipcode
## X-squared = 300110, df = 278140, p-value < 2.2e-16
t.test(home.df$price,home.df$zipcode)
## 
##  Welch Two Sample t-test
## 
## data:  home.df$price and home.df$zipcode
## t = 177, df = 21612, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  437115.4 446905.0
## sample estimates:
## mean of x mean of y 
## 540088.14  98077.94
library(vcd)
chisq.test(home.df$price,home.df$condition)
## Warning in chisq.test(home.df$price, home.df$condition): Chi-squared
## approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  home.df$price and home.df$condition
## X-squared = 18023, df = 16124, p-value < 2.2e-16
t.test(home.df$price,home.df$condition)
## 
##  Welch Two Sample t-test
## 
## data:  home.df$price and home.df$condition
## t = 216.27, df = 21612, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  535190.0 544979.5
## sample estimates:
##    mean of x    mean of y 
## 540088.14191      3.40943
library(vcd)
chisq.test(home.df$price,home.df$grade)
## Warning in chisq.test(home.df$price, home.df$grade): Chi-squared
## approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  home.df$price and home.df$grade
## X-squared = 110070, df = 44341, p-value < 2.2e-16
t.test(home.df$price,home.df$grade)
## 
##  Welch Two Sample t-test
## 
## data:  home.df$price and home.df$grade
## t = 216.27, df = 21612, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  535185.7 544975.2
## sample estimates:
##    mean of x    mean of y 
## 5.400881e+05 7.656873e+00
library(vcd)
chisq.test(home.df$price,home.df$waterfront)
## Warning in chisq.test(home.df$price, home.df$waterfront): Chi-squared
## approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  home.df$price and home.df$waterfront
## X-squared = 8246.8, df = 4031, p-value < 2.2e-16
t.test(home.df$price,home.df$waterfront)
## 
##  Welch Two Sample t-test
## 
## data:  home.df$price and home.df$waterfront
## t = 216.27, df = 21612, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  535193.4 544982.9
## sample estimates:
##    mean of x    mean of y 
## 5.400881e+05 7.541757e-03
library(vcd)
chisq.test(home.df$price,home.df$view)
## Warning in chisq.test(home.df$price, home.df$view): Chi-squared
## approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  home.df$price and home.df$view
## X-squared = 24370, df = 16124, p-value < 2.2e-16
t.test(home.df$price,home.df$view)
## 
##  Welch Two Sample t-test
## 
## data:  home.df$price and home.df$view
## t = 216.27, df = 21612, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  535193.1 544982.7
## sample estimates:
##    mean of x    mean of y 
## 5.400881e+05 2.343034e-01
library(vcd)
chisq.test(home.df$price,home.df$sqft_above)
## Warning in chisq.test(home.df$price, home.df$sqft_above): Chi-squared
## approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  home.df$price and home.df$sqft_above
## X-squared = 5542500, df = 3809300, p-value < 2.2e-16
t.test(home.df$price,home.df$sqft_above)
## 
##  Welch Two Sample t-test
## 
## data:  home.df$price and home.df$sqft_above
## t = 215.56, df = 21612, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  533405.0 543194.5
## sample estimates:
##  mean of x  mean of y 
## 540088.142   1788.391
library(vcd)
chisq.test(home.df$price,home.df$sqft_basement)
## Warning in chisq.test(home.df$price, home.df$sqft_basement): Chi-squared
## approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  home.df$price and home.df$sqft_basement
## X-squared = 1643800, df = 1229500, p-value < 2.2e-16
t.test(home.df$price,home.df$sqft_basement)
## 
##  Welch Two Sample t-test
## 
## data:  home.df$price and home.df$sqft_basement
## t = 216.16, df = 21612, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  534901.9 544691.4
## sample estimates:
##  mean of x  mean of y 
## 540088.142    291.509
library(vcd)
chisq.test(home.df$price,home.df$yr_renovated)
## Warning in chisq.test(home.df$price, home.df$yr_renovated): Chi-squared
## approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  home.df$price and home.df$yr_renovated
## X-squared = 313880, df = 278140, p-value < 2.2e-16
t.test(home.df$price,home.df$yr_renovated)
## 
##  Welch Two Sample t-test
## 
## data:  home.df$price and home.df$yr_renovated
## t = 216.24, df = 21612, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  535109.0 544898.5
## sample estimates:
##    mean of x    mean of y 
## 540088.14191     84.40226
library(vcd)
chisq.test(home.df$price,home.df$yr_built)
## Warning in chisq.test(home.df$price, home.df$yr_built): Chi-squared
## approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  home.df$price and home.df$yr_built
## X-squared = 453260, df = 463560, p-value = 1
t.test(home.df$price,home.df$yr_built)
## 
##  Welch Two Sample t-test
## 
## data:  home.df$price and home.df$yr_built
## t = 215.49, df = 21612, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  533222.4 543011.9
## sample estimates:
##  mean of x  mean of y 
## 540088.142   1971.005
library(vcd)
chisq.test(home.df$price,home.df$sqft_living15)
## Warning in chisq.test(home.df$price, home.df$sqft_living15): Chi-squared
## approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  home.df$price and home.df$sqft_living15
## X-squared = 4110100, df = 3128100, p-value < 2.2e-16
t.test(home.df$price,home.df$sqft_living15)
## 
##  Welch Two Sample t-test
## 
## data:  home.df$price and home.df$sqft_living15
## t = 215.48, df = 21612, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  533206.8 542996.4
## sample estimates:
##  mean of x  mean of y 
## 540088.142   1986.552
library(vcd)
chisq.test(home.df$price,home.df$sqft_lot15)
## Warning in chisq.test(home.df$price, home.df$sqft_lot15): Chi-squared
## approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  home.df$price and home.df$sqft_lot15
## X-squared = 36291000, df = 35021000, p-value < 2.2e-16
t.test(home.df$price,home.df$sqft_lot15)
## 
##  Welch Two Sample t-test
## 
## data:  home.df$price and home.df$sqft_lot15
## t = 210.58, df = 21851, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  522411.4 532228.0
## sample estimates:
## mean of x mean of y 
## 540088.14  12768.46
fm<-lm(price~bedrooms+bathrooms+sqft_living++zipcode+condition+grade+waterfront+view+sqft_above+yr_built+yr_renovated+sqft_living15+sqft_lot15, data=home.df)
summary(fm)
## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + +zipcode + 
##     condition + grade + waterfront + view + sqft_above + yr_built + 
##     yr_renovated + sqft_living15 + sqft_lot15, data = home.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1344316  -110707   -10109    90445  4282499 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.114e+06  3.084e+06   1.010  0.31258    
## bedrooms      -3.974e+04  2.031e+03 -19.570  < 2e-16 ***
## bathrooms      5.227e+04  3.371e+03  15.504  < 2e-16 ***
## sqft_living    1.572e+02  4.479e+00  35.101  < 2e-16 ***
## zipcode        2.965e+01  3.098e+01   0.957  0.33850    
## condition      1.835e+04  2.519e+03   7.284 3.36e-13 ***
## grade          1.216e+05  2.242e+03  54.250  < 2e-16 ***
## waterfront     5.797e+05  1.866e+04  31.073  < 2e-16 ***
## view           4.360e+04  2.286e+03  19.070  < 2e-16 ***
## sqft_above     7.522e+00  4.138e+00   1.818  0.06912 .  
## yr_built      -3.473e+03  7.324e+01 -47.414  < 2e-16 ***
## yr_renovated   1.143e+01  3.920e+00   2.916  0.00355 ** 
## sqft_living15  2.165e+01  3.624e+00   5.974 2.35e-09 ***
## sqft_lot15    -5.892e-01  5.589e-02 -10.541  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 216300 on 21599 degrees of freedom
## Multiple R-squared:  0.6529, Adjusted R-squared:  0.6527 
## F-statistic:  3126 on 13 and 21599 DF,  p-value: < 2.2e-16

From the above chisq-test, t-test, linear regression model we see that all the above variables have a significant relation to the price of houses

The average error in predicting prices in 231900 on given 21602 values

fm<-lm(price~sqft_lot+floors+sqft_basement, data=home.df)
summary(fm)
## 
## Call:
## lm(formula = price ~ sqft_lot + floors + sqft_basement, data = home.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1040484  -185443   -52984   102827  5958032 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6.560e+04  7.155e+03   9.168   <2e-16 ***
## sqft_lot      7.556e-01  5.283e-02  14.303   <2e-16 ***
## floors        2.435e+05  4.180e+03  58.246   <2e-16 ***
## sqft_basement 3.405e+02  5.100e+00  66.762   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 321600 on 21609 degrees of freedom
## Multiple R-squared:  0.2325, Adjusted R-squared:  0.2324 
## F-statistic:  2182 on 3 and 21609 DF,  p-value: < 2.2e-16

From the above chisq-test, t-test, lm we see that the p value is < 0.05 . Hence can be concluded that all the factors have a effect on the pricing on the houses.