GEOG 6680 Final Project: House Prices

Author

Bryce Nelson

Introduction

This project examines the relationship between home characteristics and sale price in a New Jersey housing dataset from 2001. The goal is to identify which variables are most strongly associated with sale price, compare those patterns with list price, and determine whether neighborhood wealth affects the difference between sale price and asking price.

Neighborhoods are labeled according to their wealth ranking, 1 - 5. 1 being the least rich and 5 being the most.

Exploratory analysis

The analysis began with a series of scatterplots comparing sale price to each explanatory variable. These plots provide an initial overview of potential relationships within the data.

Initial exploration revealed a strong positive relationship between sale price and list price and sale price and neighborhood. However, scatterplots of sale price against bathrooms, bedrooms, and rooms were less informative because these variables take only a small number of discrete values. As a result, many observations overlap at the same x-values, making it difficult to evaluate the distribution of sale prices within each category.

For this reason, boxplots were used to better visualize the relationships between sale price and the discrete explanatory variables.

Home price is in thousands of US dollars.

*Exploratory scatterplots of sale price ($100,000s) against list price ($100,000s), bathrooms,half baths, bedrooms, other rooms, and neighborhoods.*

Sale Price and House Characteristics

The boxplots indicate that homes with more bathrooms generally sell for higher prices. A similar pattern is visible for bedrooms, although there is substantial overlap among bedroom categories. The neighborhood boxplot shows the clearest separation among groups, suggesting that neighborhood may be one of the strongest determinants of home value.

Mean sale price

Looking more closely at the mean sale price, we see it increased consistently from the lowest-ranked neighborhood to the highest-ranked neighborhood, indicating a strong positive relationship between neighborhood wealth and house value.

Neighborhood and House Size

The strong neighborhood effect raises an important question: are homes in wealthier neighborhoods simply larger, or does neighborhood itself contribute additional value?

Cross-tabulations of bathrooms by neighborhood show that homes in higher-ranked neighborhoods generally contain more bathrooms.

To examine house size more directly, a total room count was calculated as the sum of bedrooms, non-bedroom rooms, full bathrooms, and half bathrooms.

table(home$neighborhood, home$full)

table(home$neighborhood, home$half)

   
    5 8 10 11 12 13 14 15 16 19 20
  1 0 0  1  0  1  0  0  0  0  0  0
  2 1 0  1  0  3  1  2  0  0  0  0
  3 0 1  0  3  1  3  2  1  1  0  0
  4 0 0  0  0  1  2  0  1  0  0  1
  5 0 0  0  0  0  1  0  0  0  1  0

The table shows that homes in wealthier neighborhoods tend to have more rooms.

The bubble plot illustrates the relationship among sale price, list price, neighborhood rank, and total room count. Larger homes are generally associated with higher prices and are concentrated in the higher-ranked neighborhoods.

Sale price versus list price, with point size representing total room count and color representing neighborhood. Some points overlap.

Because square footage was not available in the dataset, a price-per-room metric was calculated by dividing sale price by total rooms. Although this measure should not be interpreted as a substitute for price per square foot, it provides a useful proxy of value relative to house size.

# A tibble: 5 × 3
  neighborhood MeanTotal MeanPricePerRoom
  <fct>            <dbl>            <dbl>
1 1                 11               12.2
2 2                 11.5             14.7
3 3                 12.6             22.4
4 4                 14.6             26.4
5 5                 16               33.4

Average price per room increased substantially with neighborhood rank.

[1] 45.45455

[1] 173.7705

Average total room count increased from 11.0 rooms in Neighborhood 1 to 16.0 rooms in Neighborhood 5, an increase of approximately 46%. Over the same range, average price per room increased from 12.2 to 33.4 thousand dollars per room, an increase of approximately 174%.

This increase was considerably larger than the increase in total room count, suggesting that the higher value of homes in wealthier neighborhoods cannot be explained solely by larger house size. Buyers pay a neighborhood premium for homes in richer neighborhoods.

Modeling Sale Price

The sale price model explained approximately 94% of the variation in home sale prices (R² = 0.935). The overall model was highly significant (p < 0.001), indicating that the included variables collectively provide a strong explanation of sale price.

The ANOVA table shows that neighborhood was the most important predictor of sale price, accounting for more variation than any other variable. Among the housing characteristics, the number of half bathrooms had the strongest effect. Bedrooms contributed a small but statistically significant amount of explanatory power, while the number of other rooms and full bathrooms contributed relatively little after accounting for neighborhood effects.

fit_sale = lm(sale ~ neighborhood + full + half + bedrooms + rooms, data = home)
summary(fit_sale)


Call:
lm(formula = sale ~ neighborhood + full + half + bedrooms + rooms, 
    data = home)

Residuals:
    Min      1Q  Median      3Q     Max 
-41.876 -29.560  -4.928  16.586  70.893 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)     18.545     43.498   0.426 0.674408    
neighborhood2   -3.077     34.470  -0.089 0.929749    
neighborhood3   81.044     32.578   2.488 0.021798 *  
neighborhood4  168.570     38.935   4.330 0.000326 ***
neighborhood5  270.906     51.763   5.234 4.03e-05 ***
full            26.766     13.675   1.957 0.064405 .  
half            50.931     12.615   4.037 0.000645 ***
bedrooms         5.215     18.134   0.288 0.776615    
rooms           10.890      9.862   1.104 0.282616    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 35.34 on 20 degrees of freedom
Multiple R-squared:  0.9352,    Adjusted R-squared:  0.9093 
F-statistic: 36.09 on 8 and 20 DF,  p-value: 3.106e-10

anova(fit_sale)

Analysis of Variance Table

Response: sale
             Df Sum Sq Mean Sq F value    Pr(>F)    
neighborhood  4 307719   76930 61.6043 5.826e-11 ***
full          1   2974    2974  2.3812   0.13848    
half          1  42775   42775 34.2538 1.002e-05 ***
bedrooms      1   5536    5536  4.4334   0.04808 *  
rooms         1   1523    1523  1.2193   0.28262    
Residuals    20  24975    1249                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The coefficient estimates indicate that homes located in neighborhoods 3, 4, and 5 sold for significantly higher prices than homes in neighborhood 1, even after controlling for house characteristics. This further supports neighborhood as the primary characteristic in determining house price.

The residuals versus fitted values plot shows no strong pattern, indicating that the model captures the major trends in the data.

Modeling List Price

A second regression model was developed using list price.

The results were very similar to those obtained for sale price. Neighborhood remained the strongest predictor, followed by bathroom variables. This similarity suggests that sellers and buyers value many of the same housing characteristics and the market is operating efficiently.

fit_list = lm(list ~ neighborhood + full + half + bedrooms + rooms, data = home)
summary(fit_list)


Call:
lm(formula = list ~ neighborhood + full + half + bedrooms + rooms, 
    data = home)

Residuals:
    Min      1Q  Median      3Q     Max 
-50.506 -24.097  -0.703  22.762  54.684 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    -10.294     44.424  -0.232 0.819113    
neighborhood2   25.163     35.203   0.715 0.483005    
neighborhood3  100.995     33.271   3.036 0.006531 ** 
neighborhood4  188.939     39.763   4.752 0.000122 ***
neighborhood5  300.699     52.865   5.688 1.44e-05 ***
full            29.280     13.966   2.097 0.048950 *  
half            50.090     12.884   3.888 0.000914 ***
bedrooms        10.500     18.520   0.567 0.577065    
rooms            9.225     10.072   0.916 0.370624    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 36.09 on 20 degrees of freedom
Multiple R-squared:  0.9358,    Adjusted R-squared:  0.9102 
F-statistic: 36.47 on 8 and 20 DF,  p-value: 2.82e-10

anova(fit_list)

Analysis of Variance Table

Response: list
             Df Sum Sq Mean Sq F value    Pr(>F)    
neighborhood  4 324678   81169 62.3184 5.241e-11 ***
full          1   4287    4287  3.2916   0.08466 .  
half          1  42918   42918 32.9503 1.286e-05 ***
bedrooms      1   7038    7038  5.4034   0.03074 *  
rooms         1   1093    1093  0.8389   0.37062    
Residuals    20  26050    1302                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual checks.

plot(fit_list, which = 1)

Difference between sale price and list price

The final analysis examined whether neighborhood influences the difference between sale price and list price. The plot of difference in list price from sale price shows no strong corelation between whether homes in certain neighborhoods are more likely to sell for more than asking price.

Although some variation was observed among neighborhoods, the regression model and ANOVA results support that neighborhood was not a statistically significant predictor of the difference between sale price and list price. Therefore, the data do not provide strong evidence that homes in wealthier neighborhoods are more likely to sell above asking price.

fit_diff = lm(diff ~ neighborhood, data = home)
summary(fit_diff)


Call:
lm(formula = diff ~ neighborhood, data = home)

Residuals:
    Min      1Q  Median      3Q     Max 
-26.000  -4.875   0.150   5.125  26.000 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)  
(Intercept)     18.650      8.626   2.162   0.0408 *
neighborhood2  -24.800      9.644  -2.572   0.0167 *
neighborhood3  -17.775      9.317  -1.908   0.0684 .
neighborhood4  -21.250     10.206  -2.082   0.0482 *
neighborhood5  -30.650     12.199  -2.513   0.0191 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12.2 on 24 degrees of freedom
Multiple R-squared:  0.2636,    Adjusted R-squared:  0.1409 
F-statistic: 2.148 on 4 and 24 DF,  p-value: 0.1059

anova(fit_diff)

Analysis of Variance Table

Response: diff
             Df Sum Sq Mean Sq F value Pr(>F)
neighborhood  4 1278.4  319.59  2.1477 0.1059
Residuals    24 3571.4  148.81

Conclusions

The analysis suggests four main findings. First, list price and sale price are very strongly related. Second, neighborhood is the strongest predictor of both sale price and list price. Third, richer neighborhoods tend to have overall higher home value. Fourth, neighborhood does not appear to be a strong predictor of whether a house sells above or below asking price.

Real estate agents would do well to try to represent sellers in the richer neighborhoods and buyers wanting to move to that neighborhood. They can potentially sell at a higher price by highlighting the number of half bathrooms and bedrooms. If reasonable, they might also improve their sales by advertising proximity to the richer neighborhood. Real estate agents often use this tactic to create the perception of value or a good deal for homes that are just outside the boundaries of expensive markets.

Notes

I resorted to ChatGPT to figure out how to plot each difference and keep the intercept at 0. It recommended geom_hline. While the boxplots convey more information, the point plot is more aesthetically pleasing to me. The summary and anova compensate for that choice.

home$diff = home$sale - home$list

ggplot(home,
       aes(x = neighborhood,
           y = diff)) +
  geom_boxplot() +
  xlab("Neighborhood") +
  ylab("Sale Price - List Price") +
  ggtitle("Difference Between Sale Price and List Price by Neighborhood")