Examining housing list and sale prices on a myriad of factors.

library(ggplot2)
library(data.table)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
## 
##     between, first, last
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(RColorBrewer)
## Warning: package 'RColorBrewer' was built under R version 4.0.5
library(skimr)
homeprice <- read.csv("homeprice.csv")
skim(homeprice)
Data summary
Name homeprice
Number of rows 29
Number of columns 7
_______________________
Column type frequency:
numeric 7
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
list 0 1 274.85 120.43 43 189 275.0 339 599 ▂▆▇▁▁
sale 0 1 273.52 117.34 48 185 272.5 340 613 ▃▇▇▂▁
full 0 1 1.72 0.75 1 1 2.0 2 3 ▇▁▇▁▃
half 0 1 0.66 0.67 0 0 1.0 1 2 ▇▁▇▁▂
bedrooms 0 1 3.17 0.80 1 3 3.0 4 5 ▁▂▇▃▁
rooms 0 1 7.21 1.52 3 7 7.0 8 11 ▁▂▇▆▁
neighborhood 0 1 2.90 1.01 1 2 3.0 3 5 ▁▅▇▃▁
ggplot(homeprice, aes(x = rooms, fill = sale)) + geom_histogram(binwidth = 0.5, color = "blue") + 
  facet_wrap(~ neighborhood)

ggplot(homeprice, aes(x = list, fill = neighborhood)) + geom_histogram(binwidth = 20, color = "red")

ggplot(homeprice, aes(group = rooms, y = sale)) + geom_boxplot() +
  facet_wrap(~ neighborhood)

Sale Price

  • After creating the graphs, I move on to run multiple linear regression analysis on the sale price of a house compared to other factors. Not only do I create each multiple linear regression, I also analyze it, using the summary and anova functions and plot the residuals.
sprice_bn <- lm(sale~bedrooms + neighborhood, data = homeprice)
sprice_bn
## 
## Call:
## lm(formula = sale ~ bedrooms + neighborhood, data = homeprice)
## 
## Coefficients:
##  (Intercept)      bedrooms  neighborhood  
##      -132.06         42.48         93.49
summary(sprice_bn)
## 
## Call:
## lm(formula = sale ~ bedrooms + neighborhood, data = homeprice)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -90.871 -39.861   0.636  28.815 107.660 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -132.057     40.341  -3.273 0.003001 ** 
## bedrooms       42.483     11.446   3.712 0.000987 ***
## neighborhood   93.493      9.101  10.273 1.21e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 47.3 on 26 degrees of freedom
## Multiple R-squared:  0.8491, Adjusted R-squared:  0.8375 
## F-statistic: 73.16 on 2 and 26 DF,  p-value: 2.1e-11
anova(sprice_bn)
## Analysis of Variance Table
## 
## Response: sale
##              Df Sum Sq Mean Sq F value    Pr(>F)    
## bedrooms      1  91233   91233  40.782 9.163e-07 ***
## neighborhood  1 236105  236105 105.542 1.206e-10 ***
## Residuals    26  58164    2237                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
residuals(sprice_bn)
##           1           2           3           4           5           6 
##  28.8149071  19.6317990 -18.3542652 -43.3542652 -29.3641892  19.1457348 
##           7           8           9          10          11          12 
##  32.6218750  56.6218750  69.1288429 -49.4119088  80.1218750 107.6596706 
##          13          14          15          16          17          18 
## -63.3781250  29.1288429  15.6119510 -12.3781250  13.1049831  15.1288429 
##          19          20          21          22          23          24 
## -12.8572213   0.6358108 -30.8711571 -43.3542652  39.1119510   4.6695946 
##          25          26          27          28          29 
## -45.8711571  -9.3641892 -41.8472973 -39.8612331 -90.8711571
plot(sprice_bn, which = 1)

sprice_bba <- lm(sale ~ bedrooms + full, data = homeprice)
summary(sprice_bba)
## 
## Call:
## lm(formula = sale ~ bedrooms + full, data = homeprice)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -169.151  -59.716    2.849   51.849  196.159 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   -15.80      69.50  -0.227  0.82198   
## bedrooms       46.57      21.64   2.152  0.04087 * 
## full           82.12      23.19   3.541  0.00153 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 87.38 on 26 degrees of freedom
## Multiple R-squared:  0.4851, Adjusted R-squared:  0.4454 
## F-statistic: 12.25 on 2 and 26 DF,  p-value: 0.000179
anova(sprice_bba)
## Analysis of Variance Table
## 
## Response: sale
##           Df Sum Sq Mean Sq F value   Pr(>F)   
## bedrooms   1  91233   91233  11.949 0.001893 **
## full       1  95756   95756  12.541 0.001527 **
## Residuals 26 198514    7635                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(sprice_bba, which = 1)

sprice_rn <- lm(sale~rooms + neighborhood + bedrooms, data = homeprice)
summary(sprice_rn)
## 
## Call:
## lm(formula = sale ~ rooms + neighborhood + bedrooms, data = homeprice)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -91.166 -35.314  -2.229  33.494  93.133 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -157.35      43.35  -3.630  0.00127 ** 
## rooms           16.75      11.73   1.427  0.16589    
## neighborhood    88.03       9.71   9.066 2.23e-09 ***
## bedrooms        17.40      20.85   0.835  0.41181    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 46.38 on 25 degrees of freedom
## Multiple R-squared:  0.8605, Adjusted R-squared:  0.8437 
## F-statistic:  51.4 on 3 and 25 DF,  p-value: 7.783e-11
anova(sprice_rn)
## Analysis of Variance Table
## 
## Response: sale
##              Df Sum Sq Mean Sq F value    Pr(>F)    
## rooms         1 152218  152218 70.7572 9.298e-09 ***
## neighborhood  1 178003  178003 82.7432 2.092e-09 ***
## bedrooms      1   1499    1499  0.6967    0.4118    
## Residuals    25  53782    2151                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(sprice_rn, which = 1)

sprice_all <- lm(sale ~ half + full + bedrooms + rooms + neighborhood, data = homeprice)
summary(sprice_all)
## 
## Call:
## lm(formula = sale ~ half + full + bedrooms + rooms + neighborhood, 
##     data = homeprice)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -59.31 -34.06   7.20  21.32  55.93 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -135.263     37.283  -3.628  0.00141 ** 
## half           43.242     12.830   3.370  0.00264 ** 
## full           26.225     13.896   1.887  0.07181 .  
## bedrooms       20.409     17.798   1.147  0.26329    
## rooms           6.488     10.383   0.625  0.53823    
## neighborhood   77.243     10.077   7.665 8.86e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39.29 on 23 degrees of freedom
## Multiple R-squared:  0.9079, Adjusted R-squared:  0.8879 
## F-statistic: 45.34 on 5 and 23 DF,  p-value: 3.686e-11
anova(sprice_all)
## Analysis of Variance Table
## 
## Response: sale
##              Df Sum Sq Mean Sq  F value    Pr(>F)    
## half          1  59893   59893  38.7920 2.354e-06 ***
## full          1 179168  179168 116.0452 1.844e-10 ***
## bedrooms      1  10581   10581   6.8530   0.01538 *  
## rooms         1   9632    9632   6.2387   0.02009 *  
## neighborhood  1  90717   90717  58.7562 8.859e-08 ***
## Residuals    23  35511    1544                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(sprice_all, which = 1)

  • Alright, this is a lot of code to process and understand so I will break it down. The first linear model is looking at the sale price compared to both the number of bedrooms and the neighborhood that they are in. It is important to note that the neighborhood is an index score of 1-5 meaning that 1 is low-income and 5 is the highest-income. The second linear model is comparing the sale price to the number of bedrooms and full-bathrooms. The third linear model examines the sale price compared to the amount of rooms in the house and the neighborhood it resides in. The final linear model compares sale price to all the factors within the house and the neighborhood it resides in.
  • Now lets examine the data my equations have outputted one-by-one.

First Model

sprice_bn <- lm(sale~bedrooms + neighborhood, data = homeprice)
sprice_bn
## 
## Call:
## lm(formula = sale ~ bedrooms + neighborhood, data = homeprice)
## 
## Coefficients:
##  (Intercept)      bedrooms  neighborhood  
##      -132.06         42.48         93.49
summary(sprice_bn)
## 
## Call:
## lm(formula = sale ~ bedrooms + neighborhood, data = homeprice)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -90.871 -39.861   0.636  28.815 107.660 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -132.057     40.341  -3.273 0.003001 ** 
## bedrooms       42.483     11.446   3.712 0.000987 ***
## neighborhood   93.493      9.101  10.273 1.21e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 47.3 on 26 degrees of freedom
## Multiple R-squared:  0.8491, Adjusted R-squared:  0.8375 
## F-statistic: 73.16 on 2 and 26 DF,  p-value: 2.1e-11
anova(sprice_bn)
## Analysis of Variance Table
## 
## Response: sale
##              Df Sum Sq Mean Sq F value    Pr(>F)    
## bedrooms      1  91233   91233  40.782 9.163e-07 ***
## neighborhood  1 236105  236105 105.542 1.206e-10 ***
## Residuals    26  58164    2237                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(sprice_bn, which = 1)

  • Alright so the main thing to look at here is the summary functions analysis of the linear analysis. As we can see here as the numbers of bedrooms goes up by one, the price increases a significant amount. The same goes for the neighborhood index score. It is important to note that our Adjusted R-squared value is relatively close to one at 0.8375 meaning that there is a relationship. The graph is a plot of the residuals, now it is important to note that the residuals here seem to have a pattern of sorts which is not great. This means that there could be a bias.

Second Analysis

sprice_bba <- lm(sale ~ bedrooms + full, data = homeprice)
summary(sprice_bba)
## 
## Call:
## lm(formula = sale ~ bedrooms + full, data = homeprice)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -169.151  -59.716    2.849   51.849  196.159 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   -15.80      69.50  -0.227  0.82198   
## bedrooms       46.57      21.64   2.152  0.04087 * 
## full           82.12      23.19   3.541  0.00153 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 87.38 on 26 degrees of freedom
## Multiple R-squared:  0.4851, Adjusted R-squared:  0.4454 
## F-statistic: 12.25 on 2 and 26 DF,  p-value: 0.000179
anova(sprice_bba)
## Analysis of Variance Table
## 
## Response: sale
##           Df Sum Sq Mean Sq F value   Pr(>F)   
## bedrooms   1  91233   91233  11.949 0.001893 **
## full       1  95756   95756  12.541 0.001527 **
## Residuals 26 198514    7635                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(sprice_bba, which = 1)

  • For the second analysis we see that while both bedrooms and full bathrooms do seem to have significance in dictating the sale price of a home, the level of significance is not very good. Another important thing to note is both the graph showing the residuals which also seems to have a pattern, and the relatively low Adjusted R-squared value at 0.4454.

Third Analysis

sprice_rn <- lm(sale~rooms + neighborhood + bedrooms, data = homeprice)
summary(sprice_rn)
## 
## Call:
## lm(formula = sale ~ rooms + neighborhood + bedrooms, data = homeprice)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -91.166 -35.314  -2.229  33.494  93.133 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -157.35      43.35  -3.630  0.00127 ** 
## rooms           16.75      11.73   1.427  0.16589    
## neighborhood    88.03       9.71   9.066 2.23e-09 ***
## bedrooms        17.40      20.85   0.835  0.41181    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 46.38 on 25 degrees of freedom
## Multiple R-squared:  0.8605, Adjusted R-squared:  0.8437 
## F-statistic:  51.4 on 3 and 25 DF,  p-value: 7.783e-11
anova(sprice_rn)
## Analysis of Variance Table
## 
## Response: sale
##              Df Sum Sq Mean Sq F value    Pr(>F)    
## rooms         1 152218  152218 70.7572 9.298e-09 ***
## neighborhood  1 178003  178003 82.7432 2.092e-09 ***
## bedrooms      1   1499    1499  0.6967    0.4118    
## Residuals    25  53782    2151                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(sprice_rn, which = 1)

  • Now were looking at some interesting data once we have added total rooms, bedrooms, and where the property resides. As you can see, it seems that only neighborhood seems to have any real importance when it comes to the sale price of a house. However, if we look at the anova analysis, both total rooms and neighborhood have a significant interaction with the sale price. The Adjusted R-squared value is similar to our first analysis at 0.8437 which is fairly close to one. Looking at the residuals plot we have broken up that pattern we saw in the last two graphs. To me this means that the key factors for determining the sale price of a house are the amount of rooms (non-bedroom), but mainly, what neighborhood the property resides in.

Final analysis for sale price

sprice_all <- lm(sale ~ half + full + bedrooms + rooms + neighborhood, data = homeprice)
summary(sprice_all)
## 
## Call:
## lm(formula = sale ~ half + full + bedrooms + rooms + neighborhood, 
##     data = homeprice)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -59.31 -34.06   7.20  21.32  55.93 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -135.263     37.283  -3.628  0.00141 ** 
## half           43.242     12.830   3.370  0.00264 ** 
## full           26.225     13.896   1.887  0.07181 .  
## bedrooms       20.409     17.798   1.147  0.26329    
## rooms           6.488     10.383   0.625  0.53823    
## neighborhood   77.243     10.077   7.665 8.86e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39.29 on 23 degrees of freedom
## Multiple R-squared:  0.9079, Adjusted R-squared:  0.8879 
## F-statistic: 45.34 on 5 and 23 DF,  p-value: 3.686e-11
anova(sprice_all)
## Analysis of Variance Table
## 
## Response: sale
##              Df Sum Sq Mean Sq  F value    Pr(>F)    
## half          1  59893   59893  38.7920 2.354e-06 ***
## full          1 179168  179168 116.0452 1.844e-10 ***
## bedrooms      1  10581   10581   6.8530   0.01538 *  
## rooms         1   9632    9632   6.2387   0.02009 *  
## neighborhood  1  90717   90717  58.7562 8.859e-08 ***
## Residuals    23  35511    1544                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(sprice_all, which = 1)

  • For the final analysis, I threw all the variables (besides list price) and compared them too the sale price. As you can see by the data the neighborhood in which the property resides in seems to be the most important factor in determining the sale price. While the amount of full and half bathrooms also plays a role, from previous analysis it is safe to say that homes residing in higher income neighborhoods will have a higher sale price.

List Price

  • For the examination of list price data, I did the same as above but obviously compared the price at which the property was listed at, to the other variables.
lprice_bn <- lm(list~ bedrooms + neighborhood, data = homeprice)
summary(lprice_bn)
## 
## Call:
## lm(formula = list ~ bedrooms + neighborhood, data = homeprice)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -104.443  -34.765   -0.783   21.009   98.122 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -140.914     40.794  -3.454   0.0019 ** 
## bedrooms       42.887     11.574   3.705   0.0010 ** 
## neighborhood   96.565      9.203  10.493 7.71e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 47.83 on 26 degrees of freedom
## Multiple R-squared:  0.8535, Adjusted R-squared:  0.8423 
## F-statistic: 75.75 on 2 and 26 DF,  p-value: 1.428e-11
anova(lprice_bn)
## Analysis of Variance Table
## 
## Response: list
##              Df Sum Sq Mean Sq F value    Pr(>F)    
## bedrooms      1  94709   94709  41.402 8.105e-07 ***
## neighborhood  1 251878  251878 110.108 7.707e-11 ***
## Residuals    26  59477    2288                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(lprice_bn, which = 1)

lprice_bba <- lm(list ~ bedrooms + full, data = homeprice)
summary(lprice_bba)
## 
## Call:
## lm(formula = list ~ bedrooms + full, data = homeprice)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -172.13  -60.43   -2.13   47.87  173.78 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -23.24      70.05  -0.332 0.742762    
## bedrooms       46.19      21.81   2.118 0.043900 *  
## full           87.89      23.37   3.760 0.000871 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 88.07 on 26 degrees of freedom
## Multiple R-squared:  0.5033, Adjusted R-squared:  0.4651 
## F-statistic: 13.17 on 2 and 26 DF,  p-value: 0.0001119
anova(lprice_bba)
## Analysis of Variance Table
## 
## Response: list
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## bedrooms   1  94709   94709   12.21 0.0017211 ** 
## full       1 109679  109679   14.14 0.0008708 ***
## Residuals 26 201675    7757                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(lprice_bba, which = 1)

lprice_rn <- lm(list ~ rooms + neighborhood + bedrooms, data = homeprice)
summary(lprice_rn)
## 
## Call:
## lm(formula = list ~ rooms + neighborhood + bedrooms, data = homeprice)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -104.761  -29.449    1.635   31.158   73.909 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -168.136     43.598  -3.856 0.000716 ***
## rooms          18.019     11.800   1.527 0.139299    
## neighborhood   90.688      9.766   9.286  1.4e-09 ***
## bedrooms       15.899     20.971   0.758 0.455452    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 46.65 on 25 degrees of freedom
## Multiple R-squared:  0.866,  Adjusted R-squared:  0.8499 
## F-statistic: 53.87 on 3 and 25 DF,  p-value: 4.706e-11
anova(lprice_rn)
## Analysis of Variance Table
## 
## Response: list
##              Df Sum Sq Mean Sq F value    Pr(>F)    
## rooms         1 160647  160647 73.8239 6.239e-09 ***
## neighborhood  1 189763  189763 87.2042 1.253e-09 ***
## bedrooms      1   1251    1251  0.5748    0.4555    
## Residuals    25  54402    2176                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(lprice_rn, which = 1)

lprice_all <- lm(list ~ half + full + bedrooms + rooms + neighborhood, data = homeprice)
summary(lprice_all)
## 
## Call:
## lm(formula = list ~ half + full + bedrooms + rooms + neighborhood, 
##     data = homeprice)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -60.788 -28.776   4.351  23.859  62.720 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -144.544     36.026  -4.012 0.000546 ***
## half           45.556     12.397   3.675 0.001257 ** 
## full           32.125     13.427   2.392 0.025293 *  
## bedrooms       18.446     17.197   1.073 0.294572    
## rooms           7.126     10.033   0.710 0.484661    
## neighborhood   77.430      9.737   7.952 4.75e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37.97 on 23 degrees of freedom
## Multiple R-squared:  0.9183, Adjusted R-squared:  0.9006 
## F-statistic: 51.74 on 5 and 23 DF,  p-value: 9.358e-12
anova(lprice_all)
## Analysis of Variance Table
## 
## Response: list
##              Df Sum Sq Mean Sq  F value    Pr(>F)    
## half          1  62454   62454  43.3236 1.025e-06 ***
## full          1 199389  199389 138.3143 3.300e-11 ***
## bedrooms      1   9745    9745   6.7597   0.01601 *  
## rooms         1  10162   10162   7.0494   0.01415 *  
## neighborhood  1  91158   91158  63.2352 4.754e-08 ***
## Residuals    23  33156    1442                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(lprice_all, which = 1)

  • Once again I will break all these down one by one and examine them.

First Analysis

lprice_bn <- lm(list~ bedrooms + neighborhood, data = homeprice)
summary(lprice_bn)
## 
## Call:
## lm(formula = list ~ bedrooms + neighborhood, data = homeprice)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -104.443  -34.765   -0.783   21.009   98.122 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -140.914     40.794  -3.454   0.0019 ** 
## bedrooms       42.887     11.574   3.705   0.0010 ** 
## neighborhood   96.565      9.203  10.493 7.71e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 47.83 on 26 degrees of freedom
## Multiple R-squared:  0.8535, Adjusted R-squared:  0.8423 
## F-statistic: 75.75 on 2 and 26 DF,  p-value: 1.428e-11
anova(lprice_bn)
## Analysis of Variance Table
## 
## Response: list
##              Df Sum Sq Mean Sq F value    Pr(>F)    
## bedrooms      1  94709   94709  41.402 8.105e-07 ***
## neighborhood  1 251878  251878 110.108 7.707e-11 ***
## Residuals    26  59477    2288                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(lprice_bn, which = 1)

  • Same comparison as the first analysis for sale price but this time with list price. As you can see, neighborhood seems to be the primary deciding factor for the list price of a property. However, the amount of residuals and the graph do show there may be some missing data.

Second Analysis

lprice_bba <- lm(list ~ bedrooms + full, data = homeprice)
summary(lprice_bba)
## 
## Call:
## lm(formula = list ~ bedrooms + full, data = homeprice)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -172.13  -60.43   -2.13   47.87  173.78 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -23.24      70.05  -0.332 0.742762    
## bedrooms       46.19      21.81   2.118 0.043900 *  
## full           87.89      23.37   3.760 0.000871 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 88.07 on 26 degrees of freedom
## Multiple R-squared:  0.5033, Adjusted R-squared:  0.4651 
## F-statistic: 13.17 on 2 and 26 DF,  p-value: 0.0001119
anova(lprice_bba)
## Analysis of Variance Table
## 
## Response: list
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## bedrooms   1  94709   94709   12.21 0.0017211 ** 
## full       1 109679  109679   14.14 0.0008708 ***
## Residuals 26 201675    7757                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(lprice_bba, which = 1)

  • Once again it seems that full baths seem to play an integral role in determining the list price of a property. Unlike the sale price analysis which had less significance based on how many full-baths the property had, the list price does seem to be influenced more by it. However, the Adjusted R-squared value is still relatively low at 0.4651 and a graph of residuals does show some bias.

Third Analysis

lprice_rn <- lm(list ~ rooms + neighborhood + bedrooms, data = homeprice)
summary(lprice_rn)
## 
## Call:
## lm(formula = list ~ rooms + neighborhood + bedrooms, data = homeprice)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -104.761  -29.449    1.635   31.158   73.909 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -168.136     43.598  -3.856 0.000716 ***
## rooms          18.019     11.800   1.527 0.139299    
## neighborhood   90.688      9.766   9.286  1.4e-09 ***
## bedrooms       15.899     20.971   0.758 0.455452    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 46.65 on 25 degrees of freedom
## Multiple R-squared:  0.866,  Adjusted R-squared:  0.8499 
## F-statistic: 53.87 on 3 and 25 DF,  p-value: 4.706e-11
anova(lprice_rn)
## Analysis of Variance Table
## 
## Response: list
##              Df Sum Sq Mean Sq F value    Pr(>F)    
## rooms         1 160647  160647 73.8239 6.239e-09 ***
## neighborhood  1 189763  189763 87.2042 1.253e-09 ***
## bedrooms      1   1251    1251  0.5748    0.4555    
## Residuals    25  54402    2176                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(lprice_rn, which = 1)

  • The third analysis shows similar trends to the third analysis for the sale price. The list price does seem to be affected by the amount of rooms and the neighborhood that the property resides in.

Final Analysis

lprice_all <- lm(list ~ half + full + bedrooms + rooms + neighborhood, data = homeprice)
summary(lprice_all)
## 
## Call:
## lm(formula = list ~ half + full + bedrooms + rooms + neighborhood, 
##     data = homeprice)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -60.788 -28.776   4.351  23.859  62.720 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -144.544     36.026  -4.012 0.000546 ***
## half           45.556     12.397   3.675 0.001257 ** 
## full           32.125     13.427   2.392 0.025293 *  
## bedrooms       18.446     17.197   1.073 0.294572    
## rooms           7.126     10.033   0.710 0.484661    
## neighborhood   77.430      9.737   7.952 4.75e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37.97 on 23 degrees of freedom
## Multiple R-squared:  0.9183, Adjusted R-squared:  0.9006 
## F-statistic: 51.74 on 5 and 23 DF,  p-value: 9.358e-12
anova(lprice_all)
## Analysis of Variance Table
## 
## Response: list
##              Df Sum Sq Mean Sq  F value    Pr(>F)    
## half          1  62454   62454  43.3236 1.025e-06 ***
## full          1 199389  199389 138.3143 3.300e-11 ***
## bedrooms      1   9745    9745   6.7597   0.01601 *  
## rooms         1  10162   10162   7.0494   0.01415 *  
## neighborhood  1  91158   91158  63.2352 4.754e-08 ***
## Residuals    23  33156    1442                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(lprice_all, which = 1)

  • Once again, the final analysis shows that the list price is significantly effect by the neighborhood in which the property resides in. The list price may also be affected by the amount of full and half bathrooms within the property. The Adjusted R-squared is extremely high at 0.9006 and the graph of residuals seems to show that there is no bias.

Comparison

– After examining both analyses, I wanted to look at the most important fact affecting the list and sale price of the home, the neighborhood in which it resides.

lprice_n <- lm(list ~ neighborhood, data = homeprice)
summary(lprice_n)
## 
## Call:
## lm(formula = list ~ neighborhood, data = homeprice)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -137.878  -31.504   -2.878   47.822  103.683 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -28.75      33.17  -0.867    0.394    
## neighborhood   104.81      10.83   9.676 2.86e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 58.02 on 27 degrees of freedom
## Multiple R-squared:  0.7762, Adjusted R-squared:  0.7679 
## F-statistic: 93.63 on 1 and 27 DF,  p-value: 2.863e-10
sprice_n <- lm(sale ~ neighborhood, data = homeprice)
summary(sprice_n)
## 
## Call:
## lm(formula = sale ~ neighborhood, data = homeprice)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -134.378  -35.041   -9.041   36.985  125.633 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -20.95      32.82  -0.638    0.529    
## neighborhood   101.66      10.72   9.485 4.36e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 57.41 on 27 degrees of freedom
## Multiple R-squared:  0.7692, Adjusted R-squared:  0.7606 
## F-statistic: 89.97 on 1 and 27 DF,  p-value: 4.357e-10
  • With both list and sale price, there is a huge significance in where the home is located. As shown by both summaries. The higher the neighborhood index (or the richer the area) the increase in list price is massive and the sale price is not far behind (around 3 thousand dollars less than the list). To me this can be seen in the real-world on a day to day basis. Houses that are significantly smaller and in worse state are selling at a higher price than large new houses simply because they reside in better neighborhoods. This could mean that a house listed in a rich neighborhood could go over the list price.