Final Project House Prices

Author

Finley Robbins

Final Project

The goal of the document is to:

  • Explore the relationship between sale price and other variables.

  • Build a multiple linear regression model to explain the sale price.

  • Build a multiple linear regression model to explain the list price.

  • Find the effect of neighborhoods on sale and list price.

Exploring The Relationship

I brought up the packages I would need to work with for this.

library(ggplot2)
Warning: package 'ggplot2' was built under R version 4.3.3
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(knitr)
Warning: package 'knitr' was built under R version 4.3.3

Then I created a variable hom eto represent the contents of the provided csv and chekced its structure.

home <- read.csv("homeprice.csv")
str(home)
'data.frame':   29 obs. of  7 variables:
 $ list        : num  80 151 310 295 339 ...
 $ sale        : num  118 151 300 275 340 ...
 $ full        : int  1 1 2 2 2 1 3 1 1 1 ...
 $ half        : int  0 0 1 1 0 1 0 1 2 0 ...
 $ bedrooms    : int  3 4 4 4 3 4 3 3 3 1 ...
 $ rooms       : int  6 7 9 8 7 8 7 7 7 3 ...
 $ neighborhood: int  1 1 3 3 4 3 2 2 3 2 ...

I then made my first plot which was a histogram of the sale price distribution.

ggplot(home, aes (x = sale)) + geom_histogram() + ggtitle("Sale Price Histogram")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

I then ran five scatter plots to plot the variables: List Price, Number of Bedroooms, Number of Non-Bedrooms, Bathrooms and Half Bathrooms against the sale price. I did this using ggplot by setting the dataset to home the x variable to sale and the y to the variable being testing. I set the type of plot to scatter using “geom_point()” then added a regression line using “geom_smooth()”.

ggplot(home, aes(x = sale, y = list)) + geom_point() + geom_smooth() + ggtitle("Sale vs. List Price")
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

ggplot(home, aes(x = sale, y = bedrooms)) + geom_point() + geom_smooth() + ggtitle("Sale vs. Number of Bedrooms")
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

ggplot(home, aes(x = sale, y = rooms)) + geom_point() + geom_smooth() + ggtitle("Sale vs. Number of Non-Bedrooms")
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

ggplot(home, aes(x = sale, y = full)) + geom_point() + geom_smooth() + ggtitle("Sale vs. Bathrooms")
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

ggplot(home, aes(x = sale, y = half)) + geom_point() + geom_smooth() + ggtitle("Sale vs. Half Bathrooms")
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

For the neighborhood variable I made a box plot instead and needed to convert it into a factor.

ggplot(home, aes(x = factor(neighborhood), y = sale))+ geom_boxplot() + ggtitle("Sale and Neighborhood Wealth")

Based on the plots sale and list price are the most strongly correlated with is expected and obvious there is no need to test for a relationship between those variables in a real situation. While the other plots all did technically have positive correlation, half bathrooms had the least linear, full bathrooms also appeared to be slightly weaker than bedrooms and non bedrooms rooms, which were not strong due to outlier in the mid sale range price points. The box plot for neighborhoods and price was seemed stronger than the rooms based on the lack of overlap from the bulk of the data situated in the box. The relationship between v variables was positive and linear.

Multiple Linear Regression for Sale Price

modelforsale <- lm(sale ~ full + half + rooms + bedrooms + neighborhood + list, data = home)

Using the linear model function in R I created a variable for the sale model and set the formula to be sale explained by all the variables.

summary(modelforsale)

Call:
lm(formula = sale ~ full + half + rooms + bedrooms + neighborhood + 
    list, data = home)

Residuals:
    Min      1Q  Median      3Q     Max 
-28.807  -6.626  -0.270   5.580  32.933 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   5.13359   17.15496   0.299    0.768    
full         -4.97759    5.48033  -0.908    0.374    
half         -1.00644    5.70418  -0.176    0.862    
rooms        -0.43411    3.70424  -0.117    0.908    
bedrooms      2.49224    6.43616   0.387    0.702    
neighborhood  2.03434    6.88609   0.295    0.770    
list          0.97131    0.07616  12.754 1.22e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 13.87 on 22 degrees of freedom
Multiple R-squared:  0.989, Adjusted R-squared:  0.986 
F-statistic: 330.5 on 6 and 22 DF,  p-value: < 2.2e-16

The summary function returns the coefficients, residuals and significance.
None of the variables beside the list price were statistically significant all had p values above .1. Of the other names variables however bathrooms had the second lowest p value. The R-Squared was close enough to one to justify the model explaining the sale price.

anova(modelforsale)
Analysis of Variance Table

Response: sale
             Df Sum Sq Mean Sq  F value    Pr(>F)    
full          1 151632  151632 788.5513 < 2.2e-16 ***
half          1  87430   87430 454.6719 3.479e-16 ***
rooms         1  19851   19851 103.2341 9.030e-10 ***
bedrooms      1    362     362   1.8827    0.1839    
neighborhood  1  90717   90717 471.7666 2.359e-16 ***
list          1  31280   31280 162.6723 1.222e-11 ***
Residuals    22   4230     192                       
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The Anova test for the model returns the significance of predictors in the multiple regression model. To interpret the significance the p value is used.

The results showed that all variables bedsides bedrooms were significant predictors.

Using ggplot to model the residuals to model the distribution of the residuals to asses the validity of the model. The residuals should show they are normally distributed.

To do this the fitted values were plotted against the residuals using a scatter plot.

ggplot(modelforsale, aes(x = .fitted, y = .resid)) + geom_point() + geom_hline(yintercept = 0) + ggtitle("Residual for Sale Model")

The residuals do show that the model is valid based on the distribution.

Multiple Linear Regression for List Price

The same process was repeated for the list price in created the model.

modelforlist <- lm(list ~ full + half + rooms + bedrooms + neighborhood + sale, data = home)
summary (modelforlist)

Call:
lm(formula = list ~ full + half + rooms + bedrooms + neighborhood + 
    sale, data = home)

Residuals:
     Min       1Q   Median       3Q      Max 
-27.8544  -6.7013  -0.7265   6.7894  31.3427 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -21.8752    15.9419  -1.372    0.184    
full           8.3411     5.0923   1.638    0.116    
half           6.3398     5.3475   1.186    0.248    
rooms          1.2426     3.5706   0.348    0.731    
bedrooms      -0.0627     6.2402  -0.010    0.992    
neighborhood   7.3793     6.4787   1.139    0.267    
sale           0.9069     0.0711  12.754 1.22e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 13.4 on 22 degrees of freedom
Multiple R-squared:  0.9903,    Adjusted R-squared:  0.9876 
F-statistic: 373.3 on 6 and 22 DF,  p-value: < 2.2e-16

The summary of the multiple regression for list price returned lower p values than found in the sale price model. While the only variable that was deemed significant using the p value was once again only the sale price. Bedrooms as well had the highest p value. The R-Squared Ajusted was 0.9876 which is close enough to 1 to justify the model.

anova(modelforlist)
Analysis of Variance Table

Response: list
             Df Sum Sq Mean Sq  F value    Pr(>F)    
full          1 169594  169594 944.6042 < 2.2e-16 ***
half          1  92249   92249 513.8081 < 2.2e-16 ***
rooms         1  19349   19349 107.7693 6.085e-10 ***
bedrooms      1    558     558   3.1071   0.09184 .  
neighborhood  1  91158   91158 507.7299 < 2.2e-16 ***
sale          1  29206   29206 162.6723 1.222e-11 ***
Residuals    22   3950     180                       
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ggplot(modelforlist, aes(x = .fitted, y = .resid)) + geom_point() + geom_hline(yintercept = 0) + ggtitle("Residual for List Model")

The results of the Anova table were the exact same as the results for the multiple linear regression model for the sale price. All variables except for the number of bedrooms were significant predictors. As well the residuals were normally distributed.

Overall using the results for both the sale price and the list price, the list price having more accuracy with the predictors makes practical sense when understanding the housing market from a human perspective. A house having more rooms will mean the seller expects people to pay more, but the combination of quality, size and other factors will determine what the house is actually worth to a buyer. One might think that bedrooms would be a main predictor of price as well, however for families if the household size is consistent, more bedrooms would not be necessary, families may focus on paying extra for a different type of room. The number of full bathrooms was the variable that was consistent with having the highest significance level throughout both models, which would be the information presented to a real estate agent that outside of the list or sale price, they should be focusing on full bathrooms, then half bathrooms.

Neighborhoods on Sale vs. List Price

To explore the connection between the wealth of neighborhoods and the sale vs list price of a house a regression model was created comparing the difference in sale vs list price.

home$differnce <- home$sale - home$list
n_diff <- lm(differnce ~ neighborhood, data = home)
summary(n_diff)

Call:
lm(formula = differnce ~ neighborhood, data = home)

Residuals:
   Min     1Q Median     3Q    Max 
-30.05  -7.50  -0.85   5.80  33.05 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept)     7.800      7.435   1.049    0.303
neighborhood   -3.150      2.428  -1.298    0.205

Residual standard error: 13 on 27 degrees of freedom
Multiple R-squared:  0.0587,    Adjusted R-squared:  0.02383 
F-statistic: 1.684 on 1 and 27 DF,  p-value: 0.2054

Then another box plot was created to visualize the difference between rank of neighborhood wealth to difference between sale and list price and well as a plot of the residuals to test the validity of the model.

ggplot(home, aes(factor(neighborhood), y = differnce)) + geom_boxplot() + geom_hline(yintercept = 0) + ggtitle("Sale - List Price for Neighborhoods")

ggplot(n_diff, aes(x = .fitted, y = .resid)) + geom_point() + geom_hline(yintercept = 0) + ggtitle("Residual for Neighborhood")

While the p value of the regression model was not significant enough to suggest a correlation between the difference between list and sale price the p value was lower than previous p values obtained for bedroom and non bedroom predictors.

When looking at the residuals the amount of data obtained was not enough to make the model as the richest and poorest neighborhoods had only four data points in total.

The correlation was predicted that there would be a positive difference between the sale and list price however the visual trend of the box plot is the opposite. Tying this to a human perspective as the housing market has property values increasing at a rate faster than income levels are, there more interest in houses which are affordable for more people. These houses would likely be in poor neighborhoods which creates more demand and therefor more buyers offering above the listing price. Rather than a house with little demand it might sell for a lower price than the lisitng.