PROJECT 3

Which property characteristics significantly predict the sale price of residential homes in Ames between 2006–2010?

Introduction

The purpose of this study is to identify which housing attributes most strongly explain variation in home sale prices in Ames, Iowa. Understanding which features drive market value can help buyers, sellers, real estate professionals, and city planners make more informed decisions. This project uses the Ames Housing dataset. It contains 2,930 residential property sales and over 80 variables, documenting structural details, quality measures, location features, and sale conditions.

For this analysis, I focus on a manageable subset of predictors that are theoretically and practically relevant for explaining home prices. These variables include living area (above-ground square footage), overall quality rating, year built, garage area, basement square footage, neighborhood, number of full bathrooms, and lot size. These characteristics reflect size, condition, and location, three key drivers in real estate valuation.

The dataset is publicly available through OpenIntro and can be accessed here:

https://www.openintro.org/data/index.php?data=ames

Data Analysis

To address the research question, I will perform exploratory data analysis (EDA) to understand the distribution, variability, and potential relationships between sale price and key predictors. Visualizations such as histograms, boxplots, and scatterplots will help identify skewness, outliers, and potential linear relationships. Before modeling, I will clean the data by selecting the relevant variables, handling missing values, and recoding categorical variables as needed. I will use dplyr functions—including select(), mutate(), filter(), group_by(), and summarise()—to manipulate the dataset and prepare it for regression analysis.

Chunk 1 - Load the data and use select to show relevant variables, filter to filter NAs and mutate to create a new variable for the house age when it was sold

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

ames <- read.csv("ames.csv")
df <- ames

# select
df <- df %>%
  select(price, area, Overall.Qual, Year.Built,
         Garage.Area, Total.Bsmt.SF, Neighborhood,
         Full.Bath, Lot.Area) %>%

#filter
filter(!is.na(price) & !is.na(area) & !is.na(Overall.Qual)) %>%

#mutate
mutate(House.Age = 2020 - Year.Built) 

head(df) %>%

group_by(Neighborhood) %>%
summarise(Avg_Price = mean(price, na.rm = TRUE),
            Avg_Area = mean(area, na.rm = TRUE),
            Count = n()) %>%
  
arrange(desc(Avg_Price))

## # A tibble: 2 × 4
##   Neighborhood Avg_Price Avg_Area Count
##   <chr>            <dbl>    <dbl> <int>
## 1 Gilbert         192700    1616.     2
## 2 NAmes           184000    1498.     4

Chunk 2 - Histogram of sale price

library(ggplot2)
ggplot(df, aes(x = price)) +
  geom_histogram(bins = 40, fill = "steelblue") +
  theme_minimal()

## Chunk 3 - Scatterplot of area x price

ggplot(df, aes(x = area, y = price)) +
  geom_point(alpha = 0.5) +
  theme_minimal()

##Regression Analysis

To evaluate which home characteristics significantly predict sale price, I will estimate a multiple linear regression model with sale price as the dependent variable and eight predictors representing home size, quality, condition, and location. The model includes both quantitative and categorical predictors.

#Chunk 1 - Model

model <- lm(price ~ area + Overall.Qual + Year.Built + Garage.Area +
              Total.Bsmt.SF + Neighborhood + Full.Bath + Lot.Area + House.Age,
            data = df)

summary(model)

## 
## Call:
## lm(formula = price ~ area + Overall.Qual + Year.Built + Garage.Area + 
##     Total.Bsmt.SF + Neighborhood + Full.Bath + Lot.Area + House.Age, 
##     data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -469654  -14005    -264   13088  273297 
## 
## Coefficients: (1 not defined because of singularities)
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -7.398e+05  9.278e+04  -7.974 2.19e-15 ***
## area                 4.756e+01  2.008e+00  23.689  < 2e-16 ***
## Overall.Qual         1.713e+04  7.710e+02  22.213  < 2e-16 ***
## Year.Built           3.512e+02  4.692e+01   7.485 9.47e-14 ***
## Garage.Area          3.841e+01  3.931e+00   9.773  < 2e-16 ***
## Total.Bsmt.SF        2.205e+01  1.884e+00  11.708  < 2e-16 ***
## NeighborhoodBlueste -1.268e+04  1.234e+04  -1.028  0.30413    
## NeighborhoodBrDale  -1.911e+04  8.949e+03  -2.135  0.03285 *  
## NeighborhoodBrkSide  1.006e+04  7.776e+03   1.294  0.19590    
## NeighborhoodClearCr  1.619e+04  8.404e+03   1.927  0.05412 .  
## NeighborhoodCollgCr  6.475e+03  6.655e+03   0.973  0.33066    
## NeighborhoodCrawfor  3.290e+04  7.565e+03   4.349 1.42e-05 ***
## NeighborhoodEdwards -1.443e+03  7.130e+03  -0.202  0.83966    
## NeighborhoodGilbert  2.563e+03  6.900e+03   0.371  0.71030    
## NeighborhoodGreens   3.396e+03  1.346e+04   0.252  0.80087    
## NeighborhoodGrnHill  1.050e+05  2.436e+04   4.310 1.69e-05 ***
## NeighborhoodIDOTRR  -8.222e+02  7.970e+03  -0.103  0.91784    
## NeighborhoodLandmrk -1.644e+04  3.383e+04  -0.486  0.62709    
## NeighborhoodMeadowV -2.391e+03  8.642e+03  -0.277  0.78201    
## NeighborhoodMitchel  1.873e+03  7.185e+03   0.261  0.79439    
## NeighborhoodNAmes    2.314e+03  6.858e+03   0.337  0.73588    
## NeighborhoodNoRidge  5.472e+04  7.643e+03   7.160 1.02e-12 ***
## NeighborhoodNPkVill -1.152e+04  9.447e+03  -1.219  0.22293    
## NeighborhoodNridgHt  5.834e+04  6.894e+03   8.463  < 2e-16 ***
## NeighborhoodNWAmes  -2.399e+03  7.082e+03  -0.339  0.73488    
## NeighborhoodOldTown -1.182e+03  7.577e+03  -0.156  0.87604    
## NeighborhoodSawyer   4.568e+03  7.192e+03   0.635  0.52545    
## NeighborhoodSawyerW -1.484e+03  7.046e+03  -0.211  0.83316    
## NeighborhoodSomerst  1.446e+04  6.780e+03   2.133  0.03305 *  
## NeighborhoodStoneBr  6.663e+04  7.902e+03   8.433  < 2e-16 ***
## NeighborhoodSWISU   -2.339e+03  8.583e+03  -0.273  0.78521    
## NeighborhoodTimber   2.182e+04  7.518e+03   2.902  0.00374 ** 
## NeighborhoodVeenker  2.370e+04  9.377e+03   2.528  0.01153 *  
## Full.Bath           -3.556e+03  1.680e+03  -2.116  0.03439 *  
## Lot.Area             6.898e-01  9.014e-02   7.652 2.67e-14 ***
## House.Age                   NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33220 on 2893 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.8291, Adjusted R-squared:  0.8271 
## F-statistic: 412.8 on 34 and 2893 DF,  p-value: < 2.2e-16

Chunk 2 - Confidence Intervals

confint(model)

##                             2.5 %        97.5 %
## (Intercept)         -9.217466e+05 -5.579081e+05
## area                 4.362771e+01  5.150172e+01
## Overall.Qual         1.561529e+04  1.863896e+04
## Year.Built           2.591620e+02  4.431494e+02
## Garage.Area          3.070510e+01  4.611886e+01
## Total.Bsmt.SF        1.836016e+01  2.574678e+01
## NeighborhoodBlueste -3.688124e+04  1.151378e+04
## NeighborhoodBrDale  -3.665398e+04 -1.558584e+03
## NeighborhoodBrkSide -5.187526e+03  2.530485e+04
## NeighborhoodClearCr -2.868696e+02  3.266892e+04
## NeighborhoodCollgCr -6.573865e+03  1.952346e+04
## NeighborhoodCrawfor  1.806276e+04  4.772867e+04
## NeighborhoodEdwards -1.542386e+04  1.253826e+04
## NeighborhoodGilbert -1.096682e+04  1.609362e+04
## NeighborhoodGreens  -2.300327e+04  2.979570e+04
## NeighborhoodGrnHill  5.722483e+04  1.527625e+05
## NeighborhoodIDOTRR  -1.644987e+04  1.480537e+04
## NeighborhoodLandmrk -8.278062e+04  4.990241e+04
## NeighborhoodMeadowV -1.933557e+04  1.455296e+04
## NeighborhoodMitchel -1.221523e+04  1.596051e+04
## NeighborhoodNAmes   -1.113366e+04  1.576085e+04
## NeighborhoodNoRidge  3.973507e+04  6.970665e+04
## NeighborhoodNPkVill -3.003984e+04  7.007245e+03
## NeighborhoodNridgHt  4.482639e+04  7.186351e+04
## NeighborhoodNWAmes  -1.628568e+04  1.148849e+04
## NeighborhoodOldTown -1.603800e+04  1.367404e+04
## NeighborhoodSawyer  -9.535196e+03  1.867021e+04
## NeighborhoodSawyerW -1.530065e+04  1.233178e+04
## NeighborhoodSomerst  1.164439e+03  2.775381e+04
## NeighborhoodStoneBr  5.114033e+04  8.212670e+04
## NeighborhoodSWISU   -1.916891e+04  1.449011e+04
## NeighborhoodTimber   7.074829e+03  3.655534e+04
## NeighborhoodVeenker  5.316723e+03  4.209000e+04
## Full.Bath           -6.849998e+03 -2.615740e+02
## Lot.Area             5.130466e-01  8.665378e-01
## House.Age                      NA            NA

Chunk 3 - visualization

library(broom)
tidy_model <- broom::tidy(model, conf.int = TRUE)
tidy_model

## # A tibble: 36 × 7
##    term                estimate std.error statistic   p.value conf.low conf.high
##    <chr>                  <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
##  1 (Intercept)          -7.40e5  92779.       -7.97 2.19e- 15  -9.22e5 -557908. 
##  2 area                  4.76e1      2.01     23.7  1.55e-113   4.36e1      51.5
##  3 Overall.Qual          1.71e4    771.       22.2  4.55e-101   1.56e4   18639. 
##  4 Year.Built            3.51e2     46.9       7.48 9.47e- 14   2.59e2     443. 
##  5 Garage.Area           3.84e1      3.93      9.77 3.24e- 22   3.07e1      46.1
##  6 Total.Bsmt.SF         2.21e1      1.88     11.7  5.71e- 31   1.84e1      25.7
##  7 NeighborhoodBlueste  -1.27e4  12341.       -1.03 3.04e-  1  -3.69e4   11514. 
##  8 NeighborhoodBrDale   -1.91e4   8949.       -2.13 3.28e-  2  -3.67e4   -1559. 
##  9 NeighborhoodBrkSide   1.01e4   7776.        1.29 1.96e-  1  -5.19e3   25305. 
## 10 NeighborhoodClearCr   1.62e4   8404.        1.93 5.41e-  2  -2.87e2   32669. 
## # ℹ 26 more rows

##Interpretation

Based on the R-squared value(0.8291) we can conclude that about 83% of variation in house prices is explained by the predictors used in this research, which is considered high and suggests a strong model fit. From all the coeficients we can conclude that area, Overall.Qual, Year.Built, Garage.Area, and Total.BSMT.SF are the most relevant when it comes to influencing sale price. The interpretation of each separate coeficient are as follow: area: For each additional 1 sq ft of living area, the predicted sale price increases by $47.56, holding all other variables constant.The small standard error indicates precise estimation, and the p-value indicates this effect is highly statistically significant. Overall.Qual: Increasing quality by 1 unit increases predicted price by $17,130, holding other variables constant.The narrow confidence interval and extremely small p-value show this is a strong and significant predictor. Year.Built: Each additional year in the construction date increases predicted price by $351, holding other variables constant. The small standard error (46.9) indicates that this estimate is precise. The extremely low p-value shows that this effect is highly statistically significant, meaning the association is unlikely to be due to random chance. The 95% confidence interval (259 – 443) suggests we can be 95% confident that the true effect of each additional year lies within this range. Garage.Area: Each additional sq ft of garage adds ~$38 to predicted price. The small standard error (3.93) indicates a precise estimate. The extremely low p-value confirms this effect is statistically significant, meaning it is very unlikely to occur by chance. The confidence interval (30.71 – 46.12) shows the plausible range for the true effect of garage size on house price Total.BSMT.SF: Each additional sq ft of basement adds ~$22 to predicted price. The low standard error (1.88) indicates this estimate is precise. The p-value < 2 × 10e16 shows strong statistical significance. The 95% confidence interval (18.36 – 25.75) provides the range in which the true effect of basement size on price likely falls. Full.Bath: Negative coeficient meaning more bathrooms reduce the predicted price.he p-value indicates this is statistically significant, though the effect may be influenced by correlation with other variables (multicollinearity). Lot.Area: Not relevant compared to the other coeficients.

Model Assumptions and Diagnostics

#Chunk 1 - Linearity, residuals, and Homoscedasticity

plot(model, which = 1)

plot(model, which = 3)

## Warning: not plotting observations with leverage one:
##   2787

#Chunk 2 - Normality

plot(model, which = 2)

## Warning: not plotting observations with leverage one:
##   2787

hist(residuals(model), main = "Histogram of Residuals", xlab = "Residuals")

#Chunk 3 - Multicollinearity

library(car)

## Warning: package 'car' was built under R version 4.5.2

## Loading required package: carData

## Warning: package 'carData' was built under R version 4.5.2

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

model_no_neighborhood <- lm(price ~ area + Overall.Qual + Year.Built + Garage.Area +
                            Total.Bsmt.SF + Full.Bath + Lot.Area, data = df)
vif(model_no_neighborhood)

##          area  Overall.Qual    Year.Built   Garage.Area Total.Bsmt.SF 
##      2.434107      2.488875      1.975171      1.752586      1.640269 
##     Full.Bath      Lot.Area 
##      2.041900      1.156385

plot(model, which = 5)

## Warning: not plotting observations with leverage one:
##   2787

Interpretation

Linearity/ Residuals - the assumption is that the relationship between predictors and the outcome (price) is linear. The residuals are randomly scattered around 0 and no patterns so the assumption is correct.

Independence - The assumption is that each house is independent of others. Since the dataset contains individual homes we can assume independence.

Homoscedasticity - The assumption is that residuals have constant variance accross all fitted values. There is a mild funnel shape in the scale - location plot whihch means a small deviance from constant viance, but the model is still usable

Normality - The points along the line on the plot indicate residuals are normal. The tails do not strongly affect the model.

Multicollinearity - Most of the point in the plot are under five meaning multicollinearity is not a concern.

Conclusion and Future Directions

The model demonstrates strong explanatory power, with an R-squared of 0.83, meaning approximately 83% of the variance in sale price is explained by the included predictors. Residual diagnostics indicate generally good model assumptions, although the Scale-Location plot showed a slight funnel shape, suggesting mild heteroscedasticity. Overall, the results provide clear evidence that both structural and locational characteristics play a major role in determining home prices.

Future research could build on this analysis by exploring additional property features, such as the number of bedrooms, type of heating, or presence of a pool, to see if they also influence sale price. Transforming the sale price (e.g., using a log scale) could help address slight heteroscedasticity observed in the residuals. Adding simple interaction terms, like overall quality × living area, may capture combined effects of features. Finally, collecting data from other years or nearby cities could improve the model’s generalizability and provide a broader understanding of housing prices in different areas.