GEOG 6680 Final Project: House Prices
Introduction
This project examines the relationship between home characteristics and sale price in a New Jersey housing dataset from 2001. The goal is to identify which variables are most strongly associated with sale price, compare those patterns with list price, and determine whether neighborhood wealth affects the difference between sale price and asking price.
Neighborhoods are labeled according to their wealth ranking, 1 - 5. 1 being the least rich and 5 being the most.
Exploratory analysis
The analysis began with a series of scatterplots comparing sale price to each explanatory variable. These plots provide an initial overview of potential relationships within the data.
Initial exploration revealed a strong positive relationship between sale price and list price and sale price and neighborhood. However, scatterplots of sale price against bathrooms, bedrooms, and rooms were less informative because these variables take only a small number of discrete values. As a result, many observations overlap at the same x-values, making it difficult to evaluate the distribution of sale prices within each category.
For this reason, boxplots were used to better visualize the relationships between sale price and the discrete explanatory variables.
Home price is in thousands of US dollars.
Sale Price and House Characteristics
The boxplots indicate that homes with more bathrooms generally sell for higher prices. A similar pattern is visible for bedrooms, although there is substantial overlap among bedroom categories. The neighborhood boxplot shows the clearest separation among groups, suggesting that neighborhood may be one of the strongest determinants of home value.
Mean sale price
Looking more closely at the mean sale price, we see it increased consistently from the lowest-ranked neighborhood to the highest-ranked neighborhood, indicating a strong positive relationship between neighborhood wealth and house value.
Neighborhood and House Size
The strong neighborhood effect raises an important question: are homes in wealthier neighborhoods simply larger, or does neighborhood itself contribute additional value?
Cross-tabulations of bathrooms by neighborhood show that homes in higher-ranked neighborhoods generally contain more bathrooms.
To examine house size more directly, a total room count was calculated as the sum of bedrooms, non-bedroom rooms, full bathrooms, and half bathrooms.
table(home$neighborhood, home$full)
1 2 3
1 2 0 0
2 5 2 1
3 6 6 0
4 0 3 2
5 0 0 2
table(home$neighborhood, home$half)
0 1 2
1 2 0 0
2 4 3 1
3 3 8 1
4 3 2 0
5 1 0 1
5 8 10 11 12 13 14 15 16 19 20
1 0 0 1 0 1 0 0 0 0 0 0
2 1 0 1 0 3 1 2 0 0 0 0
3 0 1 0 3 1 3 2 1 1 0 0
4 0 0 0 0 1 2 0 1 0 0 1
5 0 0 0 0 0 1 0 0 0 1 0
The table shows that homes in wealthier neighborhoods tend to have more rooms.
The bubble plot illustrates the relationship among sale price, list price, neighborhood rank, and total room count. Larger homes are generally associated with higher prices and are concentrated in the higher-ranked neighborhoods.
Because square footage was not available in the dataset, a price-per-room metric was calculated by dividing sale price by total rooms. Although this measure should not be interpreted as a substitute for price per square foot, it provides a useful proxy of value relative to house size.
# A tibble: 5 × 3
neighborhood MeanTotal MeanPricePerRoom
<fct> <dbl> <dbl>
1 1 11 12.2
2 2 11.5 14.7
3 3 12.6 22.4
4 4 14.6 26.4
5 5 16 33.4
[1] 45.45455
[1] 173.7705
Average total room count increased from 11.0 rooms in Neighborhood 1 to 16.0 rooms in Neighborhood 5, an increase of approximately 46%. Over the same range, average price per room increased from 12.2 to 33.4 thousand dollars per room, an increase of approximately 174%.
This increase was considerably larger than the increase in total room count, suggesting that the higher value of homes in wealthier neighborhoods cannot be explained solely by larger house size. Buyers pay a neighborhood premium for homes in richer neighborhoods.
Modeling Sale Price
The sale price model explained approximately 94% of the variation in home sale prices (R² = 0.935). The overall model was highly significant (p < 0.001), indicating that the included variables collectively provide a strong explanation of sale price.
The ANOVA table shows that neighborhood was the most important predictor of sale price, accounting for more variation than any other variable. Among the housing characteristics, the number of half bathrooms had the strongest effect. Bedrooms contributed a small but statistically significant amount of explanatory power, while the number of other rooms and full bathrooms contributed relatively little after accounting for neighborhood effects.
fit_sale = lm(sale ~ neighborhood + full + half + bedrooms + rooms, data = home)
summary(fit_sale)
Call:
lm(formula = sale ~ neighborhood + full + half + bedrooms + rooms,
data = home)
Residuals:
Min 1Q Median 3Q Max
-41.876 -29.560 -4.928 16.586 70.893
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 18.545 43.498 0.426 0.674408
neighborhood2 -3.077 34.470 -0.089 0.929749
neighborhood3 81.044 32.578 2.488 0.021798 *
neighborhood4 168.570 38.935 4.330 0.000326 ***
neighborhood5 270.906 51.763 5.234 4.03e-05 ***
full 26.766 13.675 1.957 0.064405 .
half 50.931 12.615 4.037 0.000645 ***
bedrooms 5.215 18.134 0.288 0.776615
rooms 10.890 9.862 1.104 0.282616
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 35.34 on 20 degrees of freedom
Multiple R-squared: 0.9352, Adjusted R-squared: 0.9093
F-statistic: 36.09 on 8 and 20 DF, p-value: 3.106e-10
anova(fit_sale)Analysis of Variance Table
Response: sale
Df Sum Sq Mean Sq F value Pr(>F)
neighborhood 4 307719 76930 61.6043 5.826e-11 ***
full 1 2974 2974 2.3812 0.13848
half 1 42775 42775 34.2538 1.002e-05 ***
bedrooms 1 5536 5536 4.4334 0.04808 *
rooms 1 1523 1523 1.2193 0.28262
Residuals 20 24975 1249
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The coefficient estimates indicate that homes located in neighborhoods 3, 4, and 5 sold for significantly higher prices than homes in neighborhood 1, even after controlling for house characteristics. This further supports neighborhood as the primary characteristic in determining house price.
The residuals versus fitted values plot shows no strong pattern, indicating that the model captures the major trends in the data.
Modeling List Price
A second regression model was developed using list price.
The results were very similar to those obtained for sale price. Neighborhood remained the strongest predictor, followed by bathroom variables. This similarity suggests that sellers and buyers value many of the same housing characteristics and the market is operating efficiently.
fit_list = lm(list ~ neighborhood + full + half + bedrooms + rooms, data = home)
summary(fit_list)
Call:
lm(formula = list ~ neighborhood + full + half + bedrooms + rooms,
data = home)
Residuals:
Min 1Q Median 3Q Max
-50.506 -24.097 -0.703 22.762 54.684
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -10.294 44.424 -0.232 0.819113
neighborhood2 25.163 35.203 0.715 0.483005
neighborhood3 100.995 33.271 3.036 0.006531 **
neighborhood4 188.939 39.763 4.752 0.000122 ***
neighborhood5 300.699 52.865 5.688 1.44e-05 ***
full 29.280 13.966 2.097 0.048950 *
half 50.090 12.884 3.888 0.000914 ***
bedrooms 10.500 18.520 0.567 0.577065
rooms 9.225 10.072 0.916 0.370624
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 36.09 on 20 degrees of freedom
Multiple R-squared: 0.9358, Adjusted R-squared: 0.9102
F-statistic: 36.47 on 8 and 20 DF, p-value: 2.82e-10
anova(fit_list)Analysis of Variance Table
Response: list
Df Sum Sq Mean Sq F value Pr(>F)
neighborhood 4 324678 81169 62.3184 5.241e-11 ***
full 1 4287 4287 3.2916 0.08466 .
half 1 42918 42918 32.9503 1.286e-05 ***
bedrooms 1 7038 7038 5.4034 0.03074 *
rooms 1 1093 1093 0.8389 0.37062
Residuals 20 26050 1302
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual checks.
plot(fit_list, which = 1)Difference between sale price and list price
The final analysis examined whether neighborhood influences the difference between sale price and list price. The plot of difference in list price from sale price shows no strong corelation between whether homes in certain neighborhoods are more likely to sell for more than asking price.
Although some variation was observed among neighborhoods, the regression model and ANOVA results support that neighborhood was not a statistically significant predictor of the difference between sale price and list price. Therefore, the data do not provide strong evidence that homes in wealthier neighborhoods are more likely to sell above asking price.
fit_diff = lm(diff ~ neighborhood, data = home)
summary(fit_diff)
Call:
lm(formula = diff ~ neighborhood, data = home)
Residuals:
Min 1Q Median 3Q Max
-26.000 -4.875 0.150 5.125 26.000
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 18.650 8.626 2.162 0.0408 *
neighborhood2 -24.800 9.644 -2.572 0.0167 *
neighborhood3 -17.775 9.317 -1.908 0.0684 .
neighborhood4 -21.250 10.206 -2.082 0.0482 *
neighborhood5 -30.650 12.199 -2.513 0.0191 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 12.2 on 24 degrees of freedom
Multiple R-squared: 0.2636, Adjusted R-squared: 0.1409
F-statistic: 2.148 on 4 and 24 DF, p-value: 0.1059
anova(fit_diff)Analysis of Variance Table
Response: diff
Df Sum Sq Mean Sq F value Pr(>F)
neighborhood 4 1278.4 319.59 2.1477 0.1059
Residuals 24 3571.4 148.81
Conclusions
The analysis suggests four main findings. First, list price and sale price are very strongly related. Second, neighborhood is the strongest predictor of both sale price and list price. Third, richer neighborhoods tend to have overall higher home value. Fourth, neighborhood does not appear to be a strong predictor of whether a house sells above or below asking price.
Real estate agents would do well to try to represent sellers in the richer neighborhoods and buyers wanting to move to that neighborhood. They can potentially sell at a higher price by highlighting the number of half bathrooms and bedrooms. If reasonable, they might also improve their sales by advertising proximity to the richer neighborhood. Real estate agents often use this tactic to create the perception of value or a good deal for homes that are just outside the boundaries of expensive markets.
Notes
I resorted to ChatGPT to figure out how to plot each difference and keep the intercept at 0. It recommended geom_hline. While the boxplots convey more information, the point plot is more aesthetically pleasing to me. The summary and anova compensate for that choice.
home$diff = home$sale - home$list
ggplot(home,
aes(x = neighborhood,
y = diff)) +
geom_boxplot() +
xlab("Neighborhood") +
ylab("Sale Price - List Price") +
ggtitle("Difference Between Sale Price and List Price by Neighborhood")