# Load the datasethomeprice <-read.csv("homeprice.csv")# Visualization: Histogram of Sale Pricehist(homeprice$sale, main="Distribution of Sale Prices", xlab="Sale Price", col="lightblue")
# Check all relationships using scatterplots or boxplotsplot(sale ~ list, data=homeprice, main="Sale Price vs List Price")
plot(sale ~ bedrooms, data=homeprice, main="Sale Price vs Bedrooms")
plot(sale ~ full, data=homeprice, main="Sale Price vs Full Bathrooms")
plot(sale ~ half, data=homeprice, main="Sale Price vs Half Bathrooms")
plot(sale ~ rooms, data=homeprice, main="Sale Price vs Number of Rooms")
plot(sale ~factor(neighborhood), data=homeprice, main="Sale Price by Neighborhood Rank")
In conclusion, list price and neighborhood rank appear to have the strongest and most definitive relationships with sale price. rooms shows a moderate positive effect, while bedrooms, full bathrooms, and half bathrooms show much weaker or non-linear associations with the final sale price.
2. First Model: Sale Price Regression
# Build a multiple linear regression model to explain sale pricemodel_sale <-lm(sale ~ list + full + half + bedrooms + rooms +factor(neighborhood), data=homeprice)# Check the coefficients and goodness-of-fitsummary(model_sale)
Call:
lm(formula = sale ~ list + full + half + bedrooms + rooms + factor(neighborhood),
data = homeprice)
Residuals:
Min 1Q Median 3Q Max
-24.149 -6.679 1.486 4.364 24.149
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 27.96941 15.84528 1.765 0.0936 .
list 0.91554 0.07965 11.494 5.34e-10 ***
full -0.04073 5.49422 -0.007 0.9942
half 5.07142 6.08110 0.834 0.4147
bedrooms -4.39758 6.64979 -0.661 0.5164
rooms 2.44383 3.66230 0.667 0.5126
factor(neighborhood)2 -26.11476 12.69888 -2.056 0.0537 .
factor(neighborhood)3 -11.42126 14.32365 -0.797 0.4351
factor(neighborhood)4 -4.41077 20.66629 -0.213 0.8333
factor(neighborhood)5 -4.39556 30.46718 -0.144 0.8868
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 12.86 on 19 degrees of freedom
Multiple R-squared: 0.9919, Adjusted R-squared: 0.988
F-statistic: 257.1 on 9 and 19 DF, p-value: < 2.2e-16
# Identify which variable appears to have the greatest effect on sale priceanova(model_sale)
Analysis of Variance Table
Response: sale
Df Sum Sq Mean Sq F value Pr(>F)
list 1 381050 381050 2305.6632 <2e-16 ***
full 1 156 156 0.9443 0.3434
half 1 21 21 0.1271 0.7254
bedrooms 1 25 25 0.1529 0.7001
rooms 1 3 3 0.0164 0.8994
factor(neighborhood) 4 1107 277 1.6748 0.1973
Residuals 19 3140 165
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Check the distribution of residualshist(residuals(model_sale), main="Histogram of Residuals (Sale Model)", xlab="Residuals", col="salmon")
# Residuals vs Fitted plot for model diagnosticsplot(model_sale, which=1)
The multiple linear regression model for sale price demonstrates an excellent fit, with an R² of 0.99. According to both the coefficient estimates and the ANOVA results, list price is the strongest and most statistically significant predictor of sale price. After controlling for list price, other variables such as bathrooms, bedrooms, rooms, and neighborhood rank contribute relatively little additional explanatory power. Diagnostic plots suggest that the residuals are approximately normally distributed and do not exhibit severe violations of the homoscedasticity assumption, although a few potential outliers and slight departures from linearity are visible at higher fitted values. Overall, the model appears to provide a reliable representation of the relationship between the predictors and sale price.
3. Second Model: List Price Regression
# Build a second model to explain list pricemodel_list <-lm(list ~ full + half + bedrooms + rooms +factor(neighborhood), data=homeprice)# Check the coefficients and goodness-of-fit for the list price modelsummary(model_list)
Call:
lm(formula = list ~ full + half + bedrooms + rooms + factor(neighborhood),
data = homeprice)
Residuals:
Min 1Q Median 3Q Max
-50.506 -24.097 -0.703 22.762 54.684
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -10.294 44.424 -0.232 0.819113
full 29.280 13.966 2.097 0.048950 *
half 50.090 12.884 3.888 0.000914 ***
bedrooms 10.500 18.520 0.567 0.577065
rooms 9.225 10.072 0.916 0.370624
factor(neighborhood)2 25.163 35.203 0.715 0.483005
factor(neighborhood)3 100.995 33.271 3.036 0.006531 **
factor(neighborhood)4 188.939 39.763 4.752 0.000122 ***
factor(neighborhood)5 300.699 52.865 5.688 1.44e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 36.09 on 20 degrees of freedom
Multiple R-squared: 0.9358, Adjusted R-squared: 0.9102
F-statistic: 36.47 on 8 and 20 DF, p-value: 2.82e-10
# Identify which variable appears to have the greatest effect on list priceanova(model_list)
Analysis of Variance Table
Response: list
Df Sum Sq Mean Sq F value Pr(>F)
full 1 169594 169594 130.2069 3.292e-10 ***
half 1 92249 92249 70.8248 5.283e-08 ***
bedrooms 1 9745 9745 7.4815 0.01275 *
rooms 1 10162 10162 7.8021 0.01122 *
factor(neighborhood) 4 98264 24566 18.8607 1.454e-06 ***
Residuals 20 26050 1302
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Check the distribution of residualshist(residuals(model_list), main="Histogram of Residuals (List Model)", xlab="Residuals", col="lightgreen")
# Residuals vs Fitted plot for model diagnosticsplot(model_list, which=1)
A second multiple regression model was developed using the same explanatory variables to predict list price. The model provides a strong fit (R² = 0.9358, Adjusted R² = 0.9102) and is highly significant overall (F = 36.47, p < 0.001). According to the ANOVA table, the number of full bathrooms has the greatest effect on list price (F = 130.21), followed by half bathrooms and neighborhood rank. This differs from the sale price model, where list price itself was by far the most important predictor of sale price. The results suggest that homeowners and real estate agents primarily use bathroom availability and neighborhood quality when establishing an initial listing price. Therefore, a real estate agent should focus on highlighting bathrooms and neighborhood prestige when marketing a property and determining its list price.
4. Neighborhood Effect on Price Difference
# Calculate the difference between sale price and list pricehomeprice$price_diff <- homeprice$sale - homeprice$list# Visualize the effect of neighborhood on the price differenceboxplot(price_diff ~factor(neighborhood), data=homeprice, main="Price Difference (Sale - List) by Neighborhood",xlab="Neighborhood Rank (1:Poor, 5:Rich)", ylab="Price Difference (Sale - List)", col="lightyellow")
# Check if the effect is statistically significantmodel_diff <-lm(price_diff ~factor(neighborhood), data=homeprice)summary(model_diff)
Call:
lm(formula = price_diff ~ factor(neighborhood), data = homeprice)
Residuals:
Min 1Q Median 3Q Max
-26.000 -4.875 0.150 5.125 26.000
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 18.650 8.626 2.162 0.0408 *
factor(neighborhood)2 -24.800 9.644 -2.572 0.0167 *
factor(neighborhood)3 -17.775 9.317 -1.908 0.0684 .
factor(neighborhood)4 -21.250 10.206 -2.082 0.0482 *
factor(neighborhood)5 -30.650 12.199 -2.513 0.0191 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 12.2 on 24 degrees of freedom
Multiple R-squared: 0.2636, Adjusted R-squared: 0.1409
F-statistic: 2.148 on 4 and 24 DF, p-value: 0.1059
The model suggests that higher neighborhood rankings are associated with smaller sale-list price differences. Relative to Neighborhood 1, the coefficient for Neighborhood 5 is negative (-30.65), indicating that homes in the richest neighborhoods tend to sell for less relative to their asking price. The boxplot shows a similar pattern, with poorer neighborhoods generally exhibiting positive sale-list differences and wealthier neighborhoods showing negative differences. However, the overall regression model is not statistically significant (F = 2.148, p = 0.106), so the evidence is not strong enough to conclude that neighborhood rank alone reliably predicts the difference between sale price and list price. Therefore, the data do not support the claim that richer neighborhoods are more likely to have homes sell above the asking price.