The dataset homeprice.csv contains information on houses sold in a New Jersey town during 2001. Variables include sale price, list price, number of bathrooms, bedrooms, rooms, and neighborhood rank. The goal of this analysis is to identify which house characteristics are most strongly related to sale price and list price, and to investigate whether neighborhood influences the difference between sale price and list price.
Data Exploration
First, the structure and summary statistics of the dataset were examined.
'data.frame': 29 obs. of 7 variables:
$ list : num 80 151 310 295 339 ...
$ sale : num 118 151 300 275 340 ...
$ full : int 1 1 2 2 2 1 3 1 1 1 ...
$ half : int 0 0 1 1 0 1 0 1 2 0 ...
$ bedrooms : int 3 4 4 4 3 4 3 3 3 1 ...
$ rooms : int 6 7 9 8 7 8 7 7 7 3 ...
$ neighborhood: int 1 1 3 3 4 3 2 2 3 2 ...
summary(homeprice)
list sale full half
Min. : 43.0 Min. : 48.0 Min. :1.000 Min. :0.0000
1st Qu.:189.0 1st Qu.:185.0 1st Qu.:1.000 1st Qu.:0.0000
Median :275.0 Median :272.5 Median :2.000 Median :1.0000
Mean :274.8 Mean :273.5 Mean :1.724 Mean :0.6552
3rd Qu.:339.0 3rd Qu.:340.0 3rd Qu.:2.000 3rd Qu.:1.0000
Max. :599.0 Max. :613.0 Max. :3.000 Max. :2.0000
bedrooms rooms neighborhood
Min. :1.000 Min. : 3.000 Min. :1.000
1st Qu.:3.000 1st Qu.: 7.000 1st Qu.:2.000
Median :3.000 Median : 7.000 Median :3.000
Mean :3.172 Mean : 7.207 Mean :2.897
3rd Qu.:4.000 3rd Qu.: 8.000 3rd Qu.:3.000
Max. :5.000 Max. :11.000 Max. :5.000
Relationship Between Sale Price and Other Variables
Scatterplots were used to examine the relationship between sale and the numerical variables.
The strongest relationship with sale price was observed for list price. The scatterplot showed a very strong positive linear relationship between sale and list.
Neighborhood also appeared to have a strong relationship with sale price. Houses in higher-ranked neighborhoods generally sold for higher prices than houses in lower-ranked neighborhoods.
The relationships between sale price and the numbers of bathrooms, bedrooms, and rooms were weaker.
Multiple Linear Regression for Sale Price
A multiple linear regression model was constructed using the available house characteristics.
homeprice$neighborhood <-factor(homeprice$neighborhood)m_sale <-lm( sale ~ list + full + half + bedrooms + rooms + neighborhood,data = homeprice)summary(m_sale)
Call:
lm(formula = sale ~ list + full + half + bedrooms + rooms + neighborhood,
data = homeprice)
Residuals:
Min 1Q Median 3Q Max
-24.149 -6.679 1.486 4.364 24.149
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 27.96941 15.84528 1.765 0.0936 .
list 0.91554 0.07965 11.494 5.34e-10 ***
full -0.04073 5.49422 -0.007 0.9942
half 5.07142 6.08110 0.834 0.4147
bedrooms -4.39758 6.64979 -0.661 0.5164
rooms 2.44383 3.66230 0.667 0.5126
neighborhood2 -26.11476 12.69888 -2.056 0.0537 .
neighborhood3 -11.42126 14.32365 -0.797 0.4351
neighborhood4 -4.41077 20.66629 -0.213 0.8333
neighborhood5 -4.39556 30.46718 -0.144 0.8868
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 12.86 on 19 degrees of freedom
Multiple R-squared: 0.9919, Adjusted R-squared: 0.988
F-statistic: 257.1 on 9 and 19 DF, p-value: < 2.2e-16
ANOVA
The ANOVA table was used to determine which variable had the greatest effect on sale price.
anova(m_sale)
Analysis of Variance Table
Response: sale
Df Sum Sq Mean Sq F value Pr(>F)
list 1 381050 381050 2305.6632 <2e-16 ***
full 1 156 156 0.9443 0.3434
half 1 21 21 0.1271 0.7254
bedrooms 1 25 25 0.1529 0.7001
rooms 1 3 3 0.0164 0.8994
neighborhood 4 1107 277 1.6748 0.1973
Residuals 19 3140 165
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual Diagnostics
Residual plots were examined to evaluate model assumptions.
plot(m_sale)
Interpretation
The model explained approximately 99% of the variation in sale price (R² ≈ 0.99).
The coefficient for list price was highly significant, indicating that sale price increases as list price increases.
According to the ANOVA results, neighborhood had the largest effect on sale price.
The residual plots indicated some violations of model assumptions. The Normal Q-Q and Leverage plots revealed that observations 12, 14, and 19 act as significant outliers and influential points.
Multiple Linear Regression for List Price
A second model was created using the same explanatory variables to explain list price.
m_list <-lm( list ~ sale + full + half + bedrooms + rooms + neighborhood,data = homeprice)summary(m_list)
Call:
lm(formula = list ~ sale + full + half + bedrooms + rooms + neighborhood,
data = homeprice)
Residuals:
Min 1Q Median 3Q Max
-24.7804 -6.5758 0.6545 6.2554 24.7804
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -28.00292 16.23412 -1.725 0.1008
sale 0.95493 0.08308 11.494 5.34e-10 ***
full 3.72018 5.54588 0.671 0.5104
half 1.45482 6.31436 0.230 0.8202
bedrooms 5.51946 6.75132 0.818 0.4238
rooms -1.17382 3.77423 -0.311 0.7592
neighborhood2 28.10130 12.80917 2.194 0.0409 *
neighborhood3 23.60418 13.85028 1.704 0.1046
neighborhood4 27.96652 20.13387 1.389 0.1809
neighborhood5 42.00318 29.60384 1.419 0.1721
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 13.13 on 19 degrees of freedom
Multiple R-squared: 0.9919, Adjusted R-squared: 0.9881
F-statistic: 259.6 on 9 and 19 DF, p-value: < 2.2e-16
ANOVA
anova(m_list)
Analysis of Variance Table
Response: list
Df Sum Sq Mean Sq F value Pr(>F)
sale 1 401374 401374 2328.4670 <2e-16 ***
full 1 346 346 2.0059 0.1729
half 1 134 134 0.7749 0.3897
bedrooms 1 4 4 0.0217 0.8843
rooms 1 24 24 0.1382 0.7142
neighborhood 4 908 227 1.3164 0.2997
Residuals 19 3275 172
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interpretation
According to the ANOVA results, neighborhood had the largest effect on list price.
In the sale price model, list is the strongest individual predictor in summary(), while neighborhood shows the largest overall effect in anova(). In the list price model, sale is the strongest individual predictor, and neighborhood2 is the only neighborhood level with a significant coefficient.
Real estate agents should pay close attention to neighborhood rank, but the model is also dominated by the very strong relationship between sale price and list price.
Difference Between Sale Price and List Price
To investigate whether neighborhood affects the difference between sale price and list price, a new variable was created.
m_diff <-lm(diff ~ neighborhood, data = homeprice)summary(m_diff)
Call:
lm(formula = diff ~ neighborhood, data = homeprice)
Residuals:
Min 1Q Median 3Q Max
-26.000 -4.875 0.150 5.125 26.000
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 18.650 8.626 2.162 0.0408 *
neighborhood2 -24.800 9.644 -2.572 0.0167 *
neighborhood3 -17.775 9.317 -1.908 0.0684 .
neighborhood4 -21.250 10.206 -2.082 0.0482 *
neighborhood5 -30.650 12.199 -2.513 0.0191 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 12.2 on 24 degrees of freedom
Multiple R-squared: 0.2636, Adjusted R-squared: 0.1409
F-statistic: 2.148 on 4 and 24 DF, p-value: 0.1059
anova(m_diff)
Analysis of Variance Table
Response: diff
Df Sum Sq Mean Sq F value Pr(>F)
neighborhood 4 1278.4 319.59 2.1477 0.1059
Residuals 24 3571.4 148.81
Interpretation
Neighborhood does not show strong evidence of affecting the difference between sale price and list price.
The average difference between sale price and list price was not larger in richer neighborhoods. In fact, houses in the highest-ranked neighborhoods tended to sell slightly below the asking price, while houses in lower-ranked neighborhoods were somewhat more likely to sell above the asking price.
Therefore, richer neighborhoods do not appear to be more likely to have houses sell above the asking price.
Conclusion
This analysis examined the relationship between housing characteristics and both sale price and list price.
The strongest relationship with sale price was observed for list price. Multiple linear regression indicated that the model explained approximately 99% of the variation in sale price.
Both models show an excellent fit where list and sale prices heavily dictate each other, but neighborhood rank is critical determinant of overall house value. While physical features remain insignificant, neighborhood quality plays a key role in setting initial asking prices (p=0.0409 for neighborhood2). Ultimately, agents should focus on location quality over physical modifications.
Finally, richer neighborhoods were not more likely to have houses sell above the asking price. The observed differences between sale price and list price were relatively small and did not increase consistently with neighborhood rank.