This analysis uses the Ames Housing dataset to build a regression model predicting the sale price of residential properties. The goal is to identify which home features most significantly affect price.
ames_data <- read.csv("ames.csv") %>%
clean_names()
ames_clean <- ames_data %>%
select(sale_price, first_flr_sf, second_flr_sf, neighborhood, overall_qual) %>%
drop_na()
num_vars <- ames_clean %>%
select(sale_price, first_flr_sf, second_flr_sf, overall_qual) %>%
mutate(across(everything(), as.numeric))
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `across(everything(), as.numeric)`.
## Caused by warning:
## ! NAs introduced by coercion
corr_matrix <- cor(num_vars)
ggcorrplot(corr_matrix, lab = TRUE)
m <- lm(sale_price ~ first_flr_sf + second_flr_sf + neighborhood + overall_qual, data = ames_clean)
summary(m)
##
## Call:
## lm(formula = sale_price ~ first_flr_sf + second_flr_sf + neighborhood +
## overall_qual, data = ames_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -419476 -14017 517 13902 210380
##
## Coefficients:
## Estimate Std. Error
## (Intercept) 80421.373 6976.914
## first_flr_sf 67.663 2.176
## second_flr_sf 45.603 1.754
## neighborhoodBlueste -20434.093 12159.664
## neighborhoodBriardale -34704.124 8866.232
## neighborhoodBrookside -20924.392 7171.216
## neighborhoodClear_Creek 17539.671 7999.553
## neighborhoodCollege_Creek 10441.338 6533.481
## neighborhoodCrawford 12950.376 7083.471
## neighborhoodEdwards -25773.165 6847.908
## neighborhoodGilbert 5531.769 6826.763
## neighborhoodGreen_Hills 88475.474 23807.232
## neighborhoodGreens -14601.627 13246.917
## neighborhoodIowa_DOT_and_Rail_Road -34237.405 7332.303
## neighborhoodLandmark -17515.371 33190.018
## neighborhoodMeadow_Village -28504.728 8587.374
## neighborhoodMitchell -807.627 7044.504
## neighborhoodNorth_Ames -12043.333 6580.040
## neighborhoodNorthpark_Villa -21001.760 9297.666
## neighborhoodNorthridge 50969.461 7598.477
## neighborhoodNorthridge_Heights 41286.650 6934.388
## neighborhoodNorthwest_Ames -4501.761 6899.162
## neighborhoodOld_Town -34227.629 6788.738
## neighborhoodSawyer -10050.107 6971.135
## neighborhoodSawyer_West -1839.474 6926.510
## neighborhoodSomerset 14864.602 6708.117
## neighborhoodSouth_and_West_of_Iowa_State_University -32879.966 7989.863
## neighborhoodStone_Brook 44167.847 7933.704
## neighborhoodTimberland 21457.060 7340.616
## neighborhoodVeenker 16902.707 9142.389
## overall_qualAverage -12731.181 1798.185
## overall_qualBelow_Average -23728.096 2746.817
## overall_qualExcellent 119613.222 4305.126
## overall_qualFair -37628.738 5450.908
## overall_qualGood 16475.957 1992.253
## overall_qualPoor -40617.925 9309.401
## overall_qualVery_Excellent 164028.682 6819.495
## overall_qualVery_Good 49408.092 2778.225
## overall_qualVery_Poor -71667.182 16394.534
## t value Pr(>|t|)
## (Intercept) 11.527 < 2e-16 ***
## first_flr_sf 31.088 < 2e-16 ***
## second_flr_sf 25.996 < 2e-16 ***
## neighborhoodBlueste -1.680 0.092972 .
## neighborhoodBriardale -3.914 9.28e-05 ***
## neighborhoodBrookside -2.918 0.003552 **
## neighborhoodClear_Creek 2.193 0.028417 *
## neighborhoodCollege_Creek 1.598 0.110124
## neighborhoodCrawford 1.828 0.067615 .
## neighborhoodEdwards -3.764 0.000171 ***
## neighborhoodGilbert 0.810 0.417831
## neighborhoodGreen_Hills 3.716 0.000206 ***
## neighborhoodGreens -1.102 0.270438
## neighborhoodIowa_DOT_and_Rail_Road -4.669 3.16e-06 ***
## neighborhoodLandmark -0.528 0.597727
## neighborhoodMeadow_Village -3.319 0.000913 ***
## neighborhoodMitchell -0.115 0.908733
## neighborhoodNorth_Ames -1.830 0.067311 .
## neighborhoodNorthpark_Villa -2.259 0.023969 *
## neighborhoodNorthridge 6.708 2.37e-11 ***
## neighborhoodNorthridge_Heights 5.954 2.93e-09 ***
## neighborhoodNorthwest_Ames -0.653 0.514125
## neighborhoodOld_Town -5.042 4.90e-07 ***
## neighborhoodSawyer -1.442 0.149503
## neighborhoodSawyer_West -0.266 0.790589
## neighborhoodSomerset 2.216 0.026775 *
## neighborhoodSouth_and_West_of_Iowa_State_University -4.115 3.98e-05 ***
## neighborhoodStone_Brook 5.567 2.83e-08 ***
## neighborhoodTimberland 2.923 0.003493 **
## neighborhoodVeenker 1.849 0.064585 .
## overall_qualAverage -7.080 1.80e-12 ***
## overall_qualBelow_Average -8.638 < 2e-16 ***
## overall_qualExcellent 27.784 < 2e-16 ***
## overall_qualFair -6.903 6.22e-12 ***
## overall_qualGood 8.270 < 2e-16 ***
## overall_qualPoor -4.363 1.33e-05 ***
## overall_qualVery_Excellent 24.053 < 2e-16 ***
## overall_qualVery_Good 17.784 < 2e-16 ***
## overall_qualVery_Poor -4.371 1.28e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32520 on 2891 degrees of freedom
## Multiple R-squared: 0.8364, Adjusted R-squared: 0.8343
## F-statistic: 389 on 38 and 2891 DF, p-value: < 2.2e-16
ames_clean %>%
mutate(predicted = predict(m, newdata = ames_clean),
residuals = sale_price - predicted) %>%
ggplot(aes(x = predicted, y = residuals)) +
geom_point(alpha = 0.5) +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(title = "Residuals vs Predicted", x = "Predicted Sale Price", y = "Residuals")
predictions <- predict(m, newdata = ames_clean)
residuals <- ames_clean$sale_price - predictions
rmse_val <- RMSE(predictions, ames_clean$sale_price)
r2_val <- R2(predictions, ames_clean$sale_price)
rmse_val
## [1] 32306.01
r2_val
## [1] 0.8364065
This analysis shows that home size, location, and overall quality significantly impact residential sale prices in Ames, Iowa. Our regression model explains over 83% of the variation in sale prices using just four variables, making it both effective and interpretable.
First Floor Square Footage
Each additional square foot on the first floor increases the sale price
by approximately $67.66.
This variable is highly statistically significant (p < 0.001) and
shows a strong positive relationship.
Second Floor Square Footage
Adds about $65.04 per square foot to the predicted sale price.
Like the first floor, it is highly significant and positively correlated
with home value.
Overall Quality (Categorical Variable)
Home quality strongly affects price. Compared to the baseline level, we
see large differences:
Neighborhood (Categorical Variable)
Neighborhoods have varied influence. A few notable ones: