In this project we will be attempting to better understand the relationship between home characteristics and sale price using a dataset of homes sold in a town of New Jersey in 2001.
To begin we have to define our libraries and read in our data
library(ggplot2)
Warning: package 'ggplot2' was built under R version 4.4.3
library(dplyr)
Warning: package 'dplyr' was built under R version 4.4.2
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
library(tidyr)
Warning: package 'tidyr' was built under R version 4.4.2
homes <-read.csv("../data/homeprice.csv")
Question 1
Using this file, explore the relationship between the sale price and the other variables using scatterplots, histograms and/or boxplots. Identify those variables that appear to have the strongest relationship with sale price.
Histogram
Let’s begin with a simple histogram showing the spread of home prices in our dataset
ggplot(homes, aes(x = sale)) +geom_histogram(fill ="steelblue", color ="black", bins =20) +theme_minimal() +labs(title ="Distribution of Home Sale Prices",x ="Sale Price",y ="Number of Homes")
Boxplot
Now let’s see a boxplot showing sale price by its neighborhood rank
ggplot(homes, aes(x =as.factor(neighborhood), y = sale)) +geom_boxplot(fill ="lightgreen", color ="darkgreen") +theme_minimal() +labs(title ="Home Sale Prices Across Different Neighborhoods",x ="Neighborhood Rank",y ="Sale Price")
Scatterplot
Let’s now use scatterplots to take a look at how all the different variables relate to sale price
homes_long <- homes %>%pivot_longer(cols =c(list, full, half, bedrooms, rooms, neighborhood),names_to ="Predictor",values_to ="Value" )ggplot(homes_long, aes(x = Value, y = sale)) +geom_point(alpha =0.4, color ="darkblue") +# Alpha adds transparency so overlapping points are visiblegeom_smooth(method ="lm", color ="red", se =FALSE) +# Adds a trendline to each plotfacet_wrap(~ Predictor, scales ="free_x") +# This creates the 6-panel grid!theme_minimal() +labs(title ="Sale Price vs. House Characteristics",subtitle ="Comparing all different variables against final sale price",x ="Value of Predictor",y ="Sale Price" ) +theme(strip.text =element_text(face ="bold", size =12) # Makes the panel titles pop )
`geom_smooth()` using formula = 'y ~ x'
Upon a visual inspection of the plots, they all seem to have a fairly strong relationship. However I would say that list price has the strongest relationship, followed by neighborhood, rooms, and bedrooms. the full and half baths seems to have a relationship but not as strong. However this is just based on a visual inspection, some modeling will be required to determine if these are significant relationships or not.
Question 2
Now use these variables to build a multiple linear regression model to explain the sale price. Use the summary() function to find the coefficients and goodness-of-fit of the model. Use the anova() function to identify which variable appears to have the greatest effect on sale price. Remember to look at the distribution of residuals.
Let’s start with making the linear model
sale_model <-lm(sale ~ full + half + bedrooms + rooms + neighborhood, data = homes)
Now we can use the summary() function to find the coefficients and goodness-of-fit of the model
summary(sale_model)
Call:
lm(formula = sale ~ full + half + bedrooms + rooms + neighborhood,
data = homes)
Residuals:
Min 1Q Median 3Q Max
-59.31 -34.06 7.20 21.32 55.93
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -135.263 37.283 -3.628 0.00141 **
full 26.225 13.896 1.887 0.07181 .
half 43.242 12.830 3.370 0.00264 **
bedrooms 20.409 17.798 1.147 0.26329
rooms 6.488 10.383 0.625 0.53823
neighborhood 77.243 10.077 7.665 8.86e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 39.29 on 23 degrees of freedom
Multiple R-squared: 0.9079, Adjusted R-squared: 0.8879
F-statistic: 45.34 on 5 and 23 DF, p-value: 3.686e-11
The linear model shows that neighborhood results in the greatest increase on the sale price. Interestingly, the next two most important coefficients are full and half baths. Personally I would have expected rooms and bedrooms to have a larger increase, but the model shows that this not the case.
The goodness-of-fit sits at 0.8879 meaning our dataset explains 88.8% of the sale price, this is proof of a strong model.
With that established we can move on to using anova to identify which variables have the greatest effect on sale price.
anova(sale_model)
Analysis of Variance Table
Response: sale
Df Sum Sq Mean Sq F value Pr(>F)
full 1 151632 151632 98.2101 9.062e-10 ***
half 1 87430 87430 56.6271 1.206e-07 ***
bedrooms 1 10581 10581 6.8530 0.01538 *
rooms 1 9632 9632 6.2387 0.02009 *
neighborhood 1 90717 90717 58.7562 8.859e-08 ***
Residuals 23 35511 1544
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The anova table shows that full bathrooms has the greatest effect on the sale price, followed by neighborhood, and half baths. Again I am surprised that rooms and bedrooms aren’t higher on the list.
With all this said, let’s look at the residuals and see how well the model performed.
sale_residuals <-residuals(sale_model)hist(sale_residuals, breaks =15, col ="lightblue", border ="black",main ="Histogram of Sale Model Residuals",xlab ="Residuals (Actual Price - Predicted Price)")
plot(sale_model, which =2, main ="Normal Q-Q Plot of Residuals")
plot(sale_model, which =1, main ="Residuals vs Fitted Values")
The small sample size makes it difficult to learn much from the histogram, however the Q-Q plot shows that our model does a good job with a few exceptions at the extremes. The residuals vs fitted plot reaffirms this as the model appears to predict too high for the cheapest and most expensive homes.
Question 3
Build a second model using the same variables to explain the list price. Use the anova() function to identify which variable appears to have the greatest effect on list price. Are there differences from the sale price? Could you use this information to recommend which characteristic of a house a real estate agent should concentrate on?
Now we will run the same analysis as before but switch sale price for list price
list_model <-lm(list ~ full + half + bedrooms + rooms + neighborhood, data = homes)summary(list_model)
Call:
lm(formula = list ~ full + half + bedrooms + rooms + neighborhood,
data = homes)
Residuals:
Min 1Q Median 3Q Max
-60.788 -28.776 4.351 23.859 62.720
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -144.544 36.026 -4.012 0.000546 ***
full 32.125 13.427 2.392 0.025293 *
half 45.556 12.397 3.675 0.001257 **
bedrooms 18.446 17.197 1.073 0.294572
rooms 7.126 10.033 0.710 0.484661
neighborhood 77.430 9.737 7.952 4.75e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 37.97 on 23 degrees of freedom
Multiple R-squared: 0.9183, Adjusted R-squared: 0.9006
F-statistic: 51.74 on 5 and 23 DF, p-value: 9.358e-12
anova(list_model)
Analysis of Variance Table
Response: list
Df Sum Sq Mean Sq F value Pr(>F)
full 1 169594 169594 117.6457 1.615e-10 ***
half 1 92249 92249 63.9922 4.294e-08 ***
bedrooms 1 9745 9745 6.7597 0.01601 *
rooms 1 10162 10162 7.0494 0.01415 *
neighborhood 1 91158 91158 63.2352 4.754e-08 ***
Residuals 23 33156 1442
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
list_residuals <-residuals(list_model)hist(list_residuals, breaks =15, col ="lightblue", border ="black",main ="Histogram of list Model Residuals",xlab ="Residuals (Actual Price - Predicted Price)")
plot(list_model, which =2, main ="Normal Q-Q Plot of Residuals")
plot(list_model, which =1, main ="Residuals vs Fitted Values")
In comparing the sale and the list model, we see that both agree on neighborhood and bathrooms (half and full). The difference is in comparing estimates from the summary() function. Bathrooms seem to be overvalued, and bedrooms seem to be undervalued. This makes sense to me since I was saying earlier I was surprised by the value of bathrooms. This difference in estimate shows that there are difference in sale price predictions in the two models. I would recommend to a real estate agent to concentrate their prices on the neighborhood and half-baths.
Question 4
Finally, what is the effect of neighborhood on the difference between sale price and list price? Do richer neighborhoods mean it is more likely to have a house go over the asking price?
To do this, we will need to make a new variable that is the difference between sale price and list price and another one that gives a binary 0 or 1 to show if the sale price was over list price.
homes <- homes %>%mutate(price_diff = sale - list,over_asking = price_diff >0)
Let’s see the result
neighborhood_summary <- homes %>%group_by(neighborhood) %>%summarize(avg_diff =mean(price_diff),percent_over_asking =mean(over_asking) *100 )print(neighborhood_summary)
ggplot(homes, aes(x =as.factor(neighborhood), y = price_diff)) +geom_boxplot(fill ="lightblue") +geom_hline(yintercept =0, color ="red", linetype ="dashed", size =1) +theme_minimal() +labs(title ="Sale vs. List Price Difference by Neighborhood",x ="Neighborhood Rank (1: Poor -> 5: Rich)",y ="Price Difference (Sale - List)")
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
Interestingly enough, richer neighborhoods does not mean it is more likely to go over asking price. in fact the effect is the opposite, and poorer neighborhoods are more likely to go over asking price.