Introduction

This analysis uses the Ames Housing dataset to build a regression model predicting the sale price of residential properties. The goal is to identify which home features most significantly affect price.

Data Preparation

ames_data <- read.csv("ames.csv") %>%
  clean_names()

ames_clean <- ames_data %>%
  select(sale_price, first_flr_sf, second_flr_sf, neighborhood, overall_qual) %>%
  drop_na()

Correlation Matrix

num_vars <- ames_clean %>%
  select(sale_price, first_flr_sf, second_flr_sf, overall_qual) %>%
  mutate(across(everything(), as.numeric))
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `across(everything(), as.numeric)`.
## Caused by warning:
## ! NAs introduced by coercion
corr_matrix <- cor(num_vars)
ggcorrplot(corr_matrix, lab = TRUE)

Model

m <- lm(sale_price ~ first_flr_sf + second_flr_sf + neighborhood + overall_qual, data = ames_clean)
summary(m)
## 
## Call:
## lm(formula = sale_price ~ first_flr_sf + second_flr_sf + neighborhood + 
##     overall_qual, data = ames_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -419476  -14017     517   13902  210380 
## 
## Coefficients:
##                                                       Estimate Std. Error
## (Intercept)                                          80421.373   6976.914
## first_flr_sf                                            67.663      2.176
## second_flr_sf                                           45.603      1.754
## neighborhoodBlueste                                 -20434.093  12159.664
## neighborhoodBriardale                               -34704.124   8866.232
## neighborhoodBrookside                               -20924.392   7171.216
## neighborhoodClear_Creek                              17539.671   7999.553
## neighborhoodCollege_Creek                            10441.338   6533.481
## neighborhoodCrawford                                 12950.376   7083.471
## neighborhoodEdwards                                 -25773.165   6847.908
## neighborhoodGilbert                                   5531.769   6826.763
## neighborhoodGreen_Hills                              88475.474  23807.232
## neighborhoodGreens                                  -14601.627  13246.917
## neighborhoodIowa_DOT_and_Rail_Road                  -34237.405   7332.303
## neighborhoodLandmark                                -17515.371  33190.018
## neighborhoodMeadow_Village                          -28504.728   8587.374
## neighborhoodMitchell                                  -807.627   7044.504
## neighborhoodNorth_Ames                              -12043.333   6580.040
## neighborhoodNorthpark_Villa                         -21001.760   9297.666
## neighborhoodNorthridge                               50969.461   7598.477
## neighborhoodNorthridge_Heights                       41286.650   6934.388
## neighborhoodNorthwest_Ames                           -4501.761   6899.162
## neighborhoodOld_Town                                -34227.629   6788.738
## neighborhoodSawyer                                  -10050.107   6971.135
## neighborhoodSawyer_West                              -1839.474   6926.510
## neighborhoodSomerset                                 14864.602   6708.117
## neighborhoodSouth_and_West_of_Iowa_State_University -32879.966   7989.863
## neighborhoodStone_Brook                              44167.847   7933.704
## neighborhoodTimberland                               21457.060   7340.616
## neighborhoodVeenker                                  16902.707   9142.389
## overall_qualAverage                                 -12731.181   1798.185
## overall_qualBelow_Average                           -23728.096   2746.817
## overall_qualExcellent                               119613.222   4305.126
## overall_qualFair                                    -37628.738   5450.908
## overall_qualGood                                     16475.957   1992.253
## overall_qualPoor                                    -40617.925   9309.401
## overall_qualVery_Excellent                          164028.682   6819.495
## overall_qualVery_Good                                49408.092   2778.225
## overall_qualVery_Poor                               -71667.182  16394.534
##                                                     t value Pr(>|t|)    
## (Intercept)                                          11.527  < 2e-16 ***
## first_flr_sf                                         31.088  < 2e-16 ***
## second_flr_sf                                        25.996  < 2e-16 ***
## neighborhoodBlueste                                  -1.680 0.092972 .  
## neighborhoodBriardale                                -3.914 9.28e-05 ***
## neighborhoodBrookside                                -2.918 0.003552 ** 
## neighborhoodClear_Creek                               2.193 0.028417 *  
## neighborhoodCollege_Creek                             1.598 0.110124    
## neighborhoodCrawford                                  1.828 0.067615 .  
## neighborhoodEdwards                                  -3.764 0.000171 ***
## neighborhoodGilbert                                   0.810 0.417831    
## neighborhoodGreen_Hills                               3.716 0.000206 ***
## neighborhoodGreens                                   -1.102 0.270438    
## neighborhoodIowa_DOT_and_Rail_Road                   -4.669 3.16e-06 ***
## neighborhoodLandmark                                 -0.528 0.597727    
## neighborhoodMeadow_Village                           -3.319 0.000913 ***
## neighborhoodMitchell                                 -0.115 0.908733    
## neighborhoodNorth_Ames                               -1.830 0.067311 .  
## neighborhoodNorthpark_Villa                          -2.259 0.023969 *  
## neighborhoodNorthridge                                6.708 2.37e-11 ***
## neighborhoodNorthridge_Heights                        5.954 2.93e-09 ***
## neighborhoodNorthwest_Ames                           -0.653 0.514125    
## neighborhoodOld_Town                                 -5.042 4.90e-07 ***
## neighborhoodSawyer                                   -1.442 0.149503    
## neighborhoodSawyer_West                              -0.266 0.790589    
## neighborhoodSomerset                                  2.216 0.026775 *  
## neighborhoodSouth_and_West_of_Iowa_State_University  -4.115 3.98e-05 ***
## neighborhoodStone_Brook                               5.567 2.83e-08 ***
## neighborhoodTimberland                                2.923 0.003493 ** 
## neighborhoodVeenker                                   1.849 0.064585 .  
## overall_qualAverage                                  -7.080 1.80e-12 ***
## overall_qualBelow_Average                            -8.638  < 2e-16 ***
## overall_qualExcellent                                27.784  < 2e-16 ***
## overall_qualFair                                     -6.903 6.22e-12 ***
## overall_qualGood                                      8.270  < 2e-16 ***
## overall_qualPoor                                     -4.363 1.33e-05 ***
## overall_qualVery_Excellent                           24.053  < 2e-16 ***
## overall_qualVery_Good                                17.784  < 2e-16 ***
## overall_qualVery_Poor                                -4.371 1.28e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32520 on 2891 degrees of freedom
## Multiple R-squared:  0.8364, Adjusted R-squared:  0.8343 
## F-statistic:   389 on 38 and 2891 DF,  p-value: < 2.2e-16

Residual Plot

ames_clean %>%
  mutate(predicted = predict(m, newdata = ames_clean),
         residuals = sale_price - predicted) %>%
  ggplot(aes(x = predicted, y = residuals)) +
  geom_point(alpha = 0.5) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(title = "Residuals vs Predicted", x = "Predicted Sale Price", y = "Residuals")

Model Performance

predictions <- predict(m, newdata = ames_clean)
residuals <- ames_clean$sale_price - predictions

rmse_val <- RMSE(predictions, ames_clean$sale_price)

r2_val <- R2(predictions, ames_clean$sale_price)

rmse_val
## [1] 32306.01
r2_val
## [1] 0.8364065

Thesis

This analysis shows that home size, location, and overall quality significantly impact residential sale prices in Ames, Iowa. Our regression model explains over 83% of the variation in sale prices using just four variables, making it both effective and interpretable.

Variable Impact

Predicted Values and Residuals

Output Summary