Which property characteristics significantly predict the sale price of residential homes in Ames between 2006–2010?
The purpose of this study is to identify which housing attributes most strongly explain variation in home sale prices in Ames, Iowa. Understanding which features drive market value can help buyers, sellers, real estate professionals, and city planners make more informed decisions. This project uses the Ames Housing dataset. It contains 2,930 residential property sales and over 80 variables, documenting structural details, quality measures, location features, and sale conditions.
For this analysis, I focus on a manageable subset of predictors that are theoretically and practically relevant for explaining home prices. These variables include living area (above-ground square footage), overall quality rating, year built, garage area, basement square footage, neighborhood, number of full bathrooms, and lot size. These characteristics reflect size, condition, and location, three key drivers in real estate valuation.
The dataset is publicly available through OpenIntro and can be accessed here:
To address the research question, I will perform exploratory data analysis (EDA) to understand the distribution, variability, and potential relationships between sale price and key predictors. Visualizations such as histograms, boxplots, and scatterplots will help identify skewness, outliers, and potential linear relationships. Before modeling, I will clean the data by selecting the relevant variables, handling missing values, and recoding categorical variables as needed. I will use dplyr functions—including select(), mutate(), filter(), group_by(), and summarise()—to manipulate the dataset and prepare it for regression analysis.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
ames <- read.csv("ames.csv")
df <- ames
# select
df <- df %>%
select(price, area, Overall.Qual, Year.Built,
Garage.Area, Total.Bsmt.SF, Neighborhood,
Full.Bath, Lot.Area) %>%
#filter
filter(!is.na(price) & !is.na(area) & !is.na(Overall.Qual)) %>%
#mutate
mutate(House.Age = 2020 - Year.Built)
head(df) %>%
group_by(Neighborhood) %>%
summarise(Avg_Price = mean(price, na.rm = TRUE),
Avg_Area = mean(area, na.rm = TRUE),
Count = n()) %>%
arrange(desc(Avg_Price))
## # A tibble: 2 × 4
## Neighborhood Avg_Price Avg_Area Count
## <chr> <dbl> <dbl> <int>
## 1 Gilbert 192700 1616. 2
## 2 NAmes 184000 1498. 4
library(ggplot2)
ggplot(df, aes(x = price)) +
geom_histogram(bins = 40, fill = "steelblue") +
theme_minimal()
## Chunk 3 - Scatterplot of area x price
ggplot(df, aes(x = area, y = price)) +
geom_point(alpha = 0.5) +
theme_minimal()
``
##Regression Analysis
To evaluate which home characteristics significantly predict sale price, I will estimate a multiple linear regression model with sale price as the dependent variable and eight predictors representing home size, quality, condition, and location. The model includes both quantitative and categorical predictors.
#Chunk 1 - Model
model <- lm(price ~ area + Overall.Qual + Year.Built + Garage.Area +
Total.Bsmt.SF + Neighborhood + Full.Bath + Lot.Area + House.Age,
data = df)
summary(model)
##
## Call:
## lm(formula = price ~ area + Overall.Qual + Year.Built + Garage.Area +
## Total.Bsmt.SF + Neighborhood + Full.Bath + Lot.Area + House.Age,
## data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -469654 -14005 -264 13088 273297
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.398e+05 9.278e+04 -7.974 2.19e-15 ***
## area 4.756e+01 2.008e+00 23.689 < 2e-16 ***
## Overall.Qual 1.713e+04 7.710e+02 22.213 < 2e-16 ***
## Year.Built 3.512e+02 4.692e+01 7.485 9.47e-14 ***
## Garage.Area 3.841e+01 3.931e+00 9.773 < 2e-16 ***
## Total.Bsmt.SF 2.205e+01 1.884e+00 11.708 < 2e-16 ***
## NeighborhoodBlueste -1.268e+04 1.234e+04 -1.028 0.30413
## NeighborhoodBrDale -1.911e+04 8.949e+03 -2.135 0.03285 *
## NeighborhoodBrkSide 1.006e+04 7.776e+03 1.294 0.19590
## NeighborhoodClearCr 1.619e+04 8.404e+03 1.927 0.05412 .
## NeighborhoodCollgCr 6.475e+03 6.655e+03 0.973 0.33066
## NeighborhoodCrawfor 3.290e+04 7.565e+03 4.349 1.42e-05 ***
## NeighborhoodEdwards -1.443e+03 7.130e+03 -0.202 0.83966
## NeighborhoodGilbert 2.563e+03 6.900e+03 0.371 0.71030
## NeighborhoodGreens 3.396e+03 1.346e+04 0.252 0.80087
## NeighborhoodGrnHill 1.050e+05 2.436e+04 4.310 1.69e-05 ***
## NeighborhoodIDOTRR -8.222e+02 7.970e+03 -0.103 0.91784
## NeighborhoodLandmrk -1.644e+04 3.383e+04 -0.486 0.62709
## NeighborhoodMeadowV -2.391e+03 8.642e+03 -0.277 0.78201
## NeighborhoodMitchel 1.873e+03 7.185e+03 0.261 0.79439
## NeighborhoodNAmes 2.314e+03 6.858e+03 0.337 0.73588
## NeighborhoodNoRidge 5.472e+04 7.643e+03 7.160 1.02e-12 ***
## NeighborhoodNPkVill -1.152e+04 9.447e+03 -1.219 0.22293
## NeighborhoodNridgHt 5.834e+04 6.894e+03 8.463 < 2e-16 ***
## NeighborhoodNWAmes -2.399e+03 7.082e+03 -0.339 0.73488
## NeighborhoodOldTown -1.182e+03 7.577e+03 -0.156 0.87604
## NeighborhoodSawyer 4.568e+03 7.192e+03 0.635 0.52545
## NeighborhoodSawyerW -1.484e+03 7.046e+03 -0.211 0.83316
## NeighborhoodSomerst 1.446e+04 6.780e+03 2.133 0.03305 *
## NeighborhoodStoneBr 6.663e+04 7.902e+03 8.433 < 2e-16 ***
## NeighborhoodSWISU -2.339e+03 8.583e+03 -0.273 0.78521
## NeighborhoodTimber 2.182e+04 7.518e+03 2.902 0.00374 **
## NeighborhoodVeenker 2.370e+04 9.377e+03 2.528 0.01153 *
## Full.Bath -3.556e+03 1.680e+03 -2.116 0.03439 *
## Lot.Area 6.898e-01 9.014e-02 7.652 2.67e-14 ***
## House.Age NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33220 on 2893 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.8291, Adjusted R-squared: 0.8271
## F-statistic: 412.8 on 34 and 2893 DF, p-value: < 2.2e-16
confint(model)
## 2.5 % 97.5 %
## (Intercept) -9.217466e+05 -5.579081e+05
## area 4.362771e+01 5.150172e+01
## Overall.Qual 1.561529e+04 1.863896e+04
## Year.Built 2.591620e+02 4.431494e+02
## Garage.Area 3.070510e+01 4.611886e+01
## Total.Bsmt.SF 1.836016e+01 2.574678e+01
## NeighborhoodBlueste -3.688124e+04 1.151378e+04
## NeighborhoodBrDale -3.665398e+04 -1.558584e+03
## NeighborhoodBrkSide -5.187526e+03 2.530485e+04
## NeighborhoodClearCr -2.868696e+02 3.266892e+04
## NeighborhoodCollgCr -6.573865e+03 1.952346e+04
## NeighborhoodCrawfor 1.806276e+04 4.772867e+04
## NeighborhoodEdwards -1.542386e+04 1.253826e+04
## NeighborhoodGilbert -1.096682e+04 1.609362e+04
## NeighborhoodGreens -2.300327e+04 2.979570e+04
## NeighborhoodGrnHill 5.722483e+04 1.527625e+05
## NeighborhoodIDOTRR -1.644987e+04 1.480537e+04
## NeighborhoodLandmrk -8.278062e+04 4.990241e+04
## NeighborhoodMeadowV -1.933557e+04 1.455296e+04
## NeighborhoodMitchel -1.221523e+04 1.596051e+04
## NeighborhoodNAmes -1.113366e+04 1.576085e+04
## NeighborhoodNoRidge 3.973507e+04 6.970665e+04
## NeighborhoodNPkVill -3.003984e+04 7.007245e+03
## NeighborhoodNridgHt 4.482639e+04 7.186351e+04
## NeighborhoodNWAmes -1.628568e+04 1.148849e+04
## NeighborhoodOldTown -1.603800e+04 1.367404e+04
## NeighborhoodSawyer -9.535196e+03 1.867021e+04
## NeighborhoodSawyerW -1.530065e+04 1.233178e+04
## NeighborhoodSomerst 1.164439e+03 2.775381e+04
## NeighborhoodStoneBr 5.114033e+04 8.212670e+04
## NeighborhoodSWISU -1.916891e+04 1.449011e+04
## NeighborhoodTimber 7.074829e+03 3.655534e+04
## NeighborhoodVeenker 5.316723e+03 4.209000e+04
## Full.Bath -6.849998e+03 -2.615740e+02
## Lot.Area 5.130466e-01 8.665378e-01
## House.Age NA NA
library(broom)
tidy_model <- broom::tidy(model, conf.int = TRUE)
tidy_model
## # A tibble: 36 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -7.40e5 92779. -7.97 2.19e- 15 -9.22e5 -557908.
## 2 area 4.76e1 2.01 23.7 1.55e-113 4.36e1 51.5
## 3 Overall.Qual 1.71e4 771. 22.2 4.55e-101 1.56e4 18639.
## 4 Year.Built 3.51e2 46.9 7.48 9.47e- 14 2.59e2 443.
## 5 Garage.Area 3.84e1 3.93 9.77 3.24e- 22 3.07e1 46.1
## 6 Total.Bsmt.SF 2.21e1 1.88 11.7 5.71e- 31 1.84e1 25.7
## 7 NeighborhoodBlueste -1.27e4 12341. -1.03 3.04e- 1 -3.69e4 11514.
## 8 NeighborhoodBrDale -1.91e4 8949. -2.13 3.28e- 2 -3.67e4 -1559.
## 9 NeighborhoodBrkSide 1.01e4 7776. 1.29 1.96e- 1 -5.19e3 25305.
## 10 NeighborhoodClearCr 1.62e4 8404. 1.93 5.41e- 2 -2.87e2 32669.
## # ℹ 26 more rows
##Interpretation
Based on the R-squared value(0.8291) we can conclude that about 83% of variation in house prices is explained by the predictors used in this research, which is considered high and suggests a strong model fit. From all the coeficients we can conclude that area, Overall.Qual, Year.Built, Garage.Area, and Total.BSMT.SF are the most relevant when it comes to influencing sale price. The interpretation of each separate coeficient are as follow: area: For each additional 1 sq ft of living area, the predicted sale price increases by $47.56, holding all other variables constant.The small standard error indicates precise estimation, and the p-value indicates this effect is highly statistically significant. Overall.Qual: Increasing quality by 1 unit increases predicted price by $17,130, holding other variables constant.The narrow confidence interval and extremely small p-value show this is a strong and significant predictor. Year.Built: Each additional year in the construction date increases predicted price by $351, holding other variables constant. The small standard error (46.9) indicates that this estimate is precise. The extremely low p-value shows that this effect is highly statistically significant, meaning the association is unlikely to be due to random chance. The 95% confidence interval (259 – 443) suggests we can be 95% confident that the true effect of each additional year lies within this range. Garage.Area: Each additional sq ft of garage adds ~$38 to predicted price. The small standard error (3.93) indicates a precise estimate. The extremely low p-value confirms this effect is statistically significant, meaning it is very unlikely to occur by chance. The confidence interval (30.71 – 46.12) shows the plausible range for the true effect of garage size on house price Total.BSMT.SF: Each additional sq ft of basement adds ~$22 to predicted price. The low standard error (1.88) indicates this estimate is precise. The p-value < 2 × 10e16 shows strong statistical significance. The 95% confidence interval (18.36 – 25.75) provides the range in which the true effect of basement size on price likely falls. Full.Bath: Negative coeficient meaning more bathrooms reduce the predicted price.he p-value indicates this is statistically significant, though the effect may be influenced by correlation with other variables (multicollinearity). Lot.Area: Not relevant compared to the other coeficients.
#Chunk 1 - Linearity, residuals, and Homoscedasticity
plot(model, which = 1)
plot(model, which = 3)
## Warning: not plotting observations with leverage one:
## 2787
#Chunk 2 - Normality
plot(model, which = 2)
## Warning: not plotting observations with leverage one:
## 2787
hist(residuals(model), main = "Histogram of Residuals", xlab = "Residuals")
#Chunk 3 - Multicollinearity
library(car)
## Warning: package 'car' was built under R version 4.5.2
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.5.2
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
model_no_neighborhood <- lm(price ~ area + Overall.Qual + Year.Built + Garage.Area +
Total.Bsmt.SF + Full.Bath + Lot.Area, data = df)
vif(model_no_neighborhood)
## area Overall.Qual Year.Built Garage.Area Total.Bsmt.SF
## 2.434107 2.488875 1.975171 1.752586 1.640269
## Full.Bath Lot.Area
## 2.041900 1.156385
plot(model, which = 5)
## Warning: not plotting observations with leverage one:
## 2787
Linearity/ Residuals - the assumption is that the relationship between predictors and the outcome (price) is linear. The residuals are randomly scattered around 0 and no patterns so the assumption is correct.
Independence - The assumption is that each house is independent of others. Since the dataset contains individual homes we can assume independence.
Homoscedasticity - The assumption is that residuals have constant variance accross all fitted values. There is a mild funnel shape in the scale - location plot whihch means a small deviance from constant viance, but the model is still usable
Normality - The points along the line on the plot indicate residuals are normal. The tails do not strongly affect the model.
Multicollinearity - Most of the point in the plot are under five meaning multicollinearity is not a concern.
The model demonstrates strong explanatory power, with an R-squared of 0.83, meaning approximately 83% of the variance in sale price is explained by the included predictors. Residual diagnostics indicate generally good model assumptions, although the Scale-Location plot showed a slight funnel shape, suggesting mild heteroscedasticity. Overall, the results provide clear evidence that both structural and locational characteristics play a major role in determining home prices.
Future research could build on this analysis by exploring additional property features, such as the number of bedrooms, type of heating, or presence of a pool, to see if they also influence sale price. Transforming the sale price (e.g., using a log scale) could help address slight heteroscedasticity observed in the residuals. Adding simple interaction terms, like overall quality × living area, may capture combined effects of features. Finally, collecting data from other years or nearby cities could improve the model’s generalizability and provide a broader understanding of housing prices in different areas.