Project Report

Author

Maxwell Martin

I am first going to read in the data about home prices

homeprice = read.csv("homeprice.csv")
library(ggplot2)
Warning: package 'ggplot2' was built under R version 4.3.3
head(homeprice)
   list  sale full half bedrooms rooms neighborhood
1  80.0 117.7    1    0        3     6            1
2 151.4 151.0    1    0        4     7            1
3 310.0 300.0    2    1        4     9            3
4 295.0 275.0    2    1        4     8            3
5 339.0 340.0    2    0        3     7            4
6 337.5 337.5    1    1        4     8            3
str(homeprice)
'data.frame':   29 obs. of  7 variables:
 $ list        : num  80 151 310 295 339 ...
 $ sale        : num  118 151 300 275 340 ...
 $ full        : int  1 1 2 2 2 1 3 1 1 1 ...
 $ half        : int  0 0 1 1 0 1 0 1 2 0 ...
 $ bedrooms    : int  3 4 4 4 3 4 3 3 3 1 ...
 $ rooms       : int  6 7 9 8 7 8 7 7 7 3 ...
 $ neighborhood: int  1 1 3 3 4 3 2 2 3 2 ...

After I get a quick look at how the data is organized I will make some plots to show the data in a way that is easier to understand.

hist(homeprice$sale, main = "Sale Price of Homes in New Jersey", xlab = "Sale Price", col = "purple")

ggplot(homeprice, aes(x = sale)) + geom_histogram(fill = "pink", col = "black") + facet_wrap( ~ neighborhood) + labs(title = "Sale Price by Neighborhood", xlab = "Sale Price", ylab = "Frequency")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Now I am going to make a model that can give us insight on the effect other variables have on sale price and evaluate the varibales effects.

model_saleprice <- lm(sale ~ full + half + bedrooms + rooms + neighborhood, data = homeprice)
summary(model_saleprice)

Call:
lm(formula = sale ~ full + half + bedrooms + rooms + neighborhood, 
    data = homeprice)

Residuals:
   Min     1Q Median     3Q    Max 
-59.31 -34.06   7.20  21.32  55.93 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -135.263     37.283  -3.628  0.00141 ** 
full           26.225     13.896   1.887  0.07181 .  
half           43.242     12.830   3.370  0.00264 ** 
bedrooms       20.409     17.798   1.147  0.26329    
rooms           6.488     10.383   0.625  0.53823    
neighborhood   77.243     10.077   7.665 8.86e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 39.29 on 23 degrees of freedom
Multiple R-squared:  0.9079,    Adjusted R-squared:  0.8879 
F-statistic: 45.34 on 5 and 23 DF,  p-value: 3.686e-11
anova(model_saleprice)
Analysis of Variance Table

Response: sale
             Df Sum Sq Mean Sq F value    Pr(>F)    
full          1 151632  151632 98.2101 9.062e-10 ***
half          1  87430   87430 56.6271 1.206e-07 ***
bedrooms      1  10581   10581  6.8530   0.01538 *  
rooms         1   9632    9632  6.2387   0.02009 *  
neighborhood  1  90717   90717 58.7562 8.859e-08 ***
Residuals    23  35511    1544                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Next I am going to make a model representing the effects of the same variables on list price and evaluate the variables effects.

model_listprice <- lm(list ~ full + half + bedrooms + rooms + neighborhood, data = homeprice)
summary(model_listprice)

Call:
lm(formula = list ~ full + half + bedrooms + rooms + neighborhood, 
    data = homeprice)

Residuals:
    Min      1Q  Median      3Q     Max 
-60.788 -28.776   4.351  23.859  62.720 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -144.544     36.026  -4.012 0.000546 ***
full           32.125     13.427   2.392 0.025293 *  
half           45.556     12.397   3.675 0.001257 ** 
bedrooms       18.446     17.197   1.073 0.294572    
rooms           7.126     10.033   0.710 0.484661    
neighborhood   77.430      9.737   7.952 4.75e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 37.97 on 23 degrees of freedom
Multiple R-squared:  0.9183,    Adjusted R-squared:  0.9006 
F-statistic: 51.74 on 5 and 23 DF,  p-value: 9.358e-12
anova(model_listprice)
Analysis of Variance Table

Response: list
             Df Sum Sq Mean Sq  F value    Pr(>F)    
full          1 169594  169594 117.6457 1.615e-10 ***
half          1  92249   92249  63.9922 4.294e-08 ***
bedrooms      1   9745    9745   6.7597   0.01601 *  
rooms         1  10162   10162   7.0494   0.01415 *  
neighborhood  1  91158   91158  63.2352 4.754e-08 ***
Residuals    23  33156    1442                       
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

After evaluating the summary and anova statistics that are returned I would recommend that real estate agents should focus on bathrooms as a feature of houses. Half and Full bathrooms are statistically significant based on their p-values. Features such as bathroom are changeable and can be added which seems important to me.

Lastly I can make a new model to model the difference between sale and list price directly.

homeprice$diff <- homeprice$sale - homeprice$list
model_diff <- lm(diff ~ neighborhood, data = homeprice)
summary(model_diff)

Call:
lm(formula = diff ~ neighborhood, data = homeprice)

Residuals:
   Min     1Q Median     3Q    Max 
-30.05  -7.50  -0.85   5.80  33.05 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept)     7.800      7.435   1.049    0.303
neighborhood   -3.150      2.428  -1.298    0.205

Residual standard error: 13 on 27 degrees of freedom
Multiple R-squared:  0.0587,    Adjusted R-squared:  0.02383 
F-statistic: 1.684 on 1 and 27 DF,  p-value: 0.2054

After all is evaluated I think that ricjer neighborhoods do not tend to have houses that sell over listing price more than poorer neighborhoods. This is because the list price is based more on comparable houses in the local area that have similar bathrooms. Because of this houses in either demographic are more effected by the number of bathrooms they have compared to other houses in their area.