The Ames Housing dataset contains information on 1,460 residential home sales in Ames, Iowa. The dataset includes characteristics such as living area, overall quality, year built, basement size, kitchen quality, and final sale price.
The objective of this analysis is to determine which housing characteristics have the greatest impact on home sale prices and to develop a predictive model that can estimate the value of a house based on these characteristics.
This question is useful because homeowners, buyers, real estate professionals, and developers can use these insights to better understand which property features contribute most strongly to housing values.
Dependent Variable: SalePrice
Independent Variables:
(a)GrLivArea: Above grade (ground) living area square feet
(b)OverallQual: Rates the overall material and finish of the house
(c)YearBuilt: Original construction date
(d)KitchenQual: Kitchen quality
(e)TotalBsmtSF: Total square feet of basement area
summary(
housing[, c(
"SalePrice",
"GrLivArea",
"OverallQual",
"YearBuilt",
"KitchenQual",
"TotalBsmtSF"
)]
)
## SalePrice GrLivArea OverallQual YearBuilt
## Min. : 34900 Min. : 334 Min. : 1.000 Min. :1872
## 1st Qu.:129975 1st Qu.:1130 1st Qu.: 5.000 1st Qu.:1954
## Median :163000 Median :1464 Median : 6.000 Median :1973
## Mean :180921 Mean :1515 Mean : 6.099 Mean :1971
## 3rd Qu.:214000 3rd Qu.:1777 3rd Qu.: 7.000 3rd Qu.:2000
## Max. :755000 Max. :5642 Max. :10.000 Max. :2010
## KitchenQual TotalBsmtSF
## Length :1460 Min. : 0.0
## N.unique : 4 1st Qu.: 795.8
## N.blank : 0 Median : 991.5
## Min.nchar: 2 Mean :1057.4
## Max.nchar: 2 3rd Qu.:1298.2
## Max. :6110.0
ggplot(housing, aes(x = SalePrice)) +
geom_histogram(bins = 30) +
labs(
title = "Distribution of Sales Prices",
x = "Sale Price ($)",
y = "Frequency")
ggplot(housing, aes(x = GrLivArea)) +
geom_histogram(bins = 30) +
labs(
title = "Distribution of Above-Ground Living Area (sq ft.)",
x = "Square Feet",
y = "Frequency")
ggplot(housing, aes(x = factor(OverallQual))) +
geom_bar() +
labs(
title = "Distribution of Overall Home Quality",
x = "Quality Rating",
y = "Count")
ggplot(housing, aes(x = YearBuilt)) +
geom_histogram(bins = 30) +
labs(
title = "Distribution of Year Built",
x = "Year Built",
y = "Count")
ggplot(housing, aes(x = KitchenQual)) +
geom_bar() +
labs(
title = "Distribution of Kitchen Quality",
x = "Kitchen Quality",
y = "Count")
ggplot(housing, aes(x = TotalBsmtSF)) +
geom_histogram(bins = 30) +
labs(
title = "Distribution of Total Basement Square Footage",
x = "Basement Area (sq ft)",
y = "Count")
ggplot(housing, aes(x = GrLivArea, y = SalePrice)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = FALSE) +
labs(
title = "Sale Price vs. Above-Ground Living Area",
x = "Above-Ground Living Area)",
y = "Sale Price($)")
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
ggplot(housing, aes(x = OverallQual, y = SalePrice)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = FALSE) +
labs(
title = "Sale Price vs. Overall Quality Rating",
x = "Overall Quality (1-10))",
y = "Sale Price($)")
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
ggplot(housing, aes(x = YearBuilt, y = SalePrice)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = FALSE) +
labs(
title = "Sale Price vs. Year Built",
x = "Year Built",
y = "Sale Price($)")
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
ggplot(housing, aes(x = TotalBsmtSF, y = SalePrice)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = FALSE) +
labs(
title = "Sale Price vs. Total Basement Square Footage",
x = "Basement Area (sq ft)",
y = "Sale Price($)")
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
ggplot(housing, aes(x = KitchenQual, y = SalePrice)) +
geom_boxplot() +
labs(
title = "Sale Price by Kitchen Quality",
x = "Kitchen Quality Rating",
y = "Sale Price($)")
## `geom_smooth()` using formula = 'y ~ x'
model <- lm(
SalePrice ~ GrLivArea +
OverallQual +
YearBuilt +
TotalBsmtSF +
KitchenQual,
data = housing
)
summary(model)
##
## Call:
## lm(formula = SalePrice ~ GrLivArea + OverallQual + YearBuilt +
## TotalBsmtSF + KitchenQual, data = housing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -513187 -18057 -783 15156 256279
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.631e+05 8.300e+04 -9.193 < 2e-16 ***
## GrLivArea 5.391e+01 2.428e+00 22.204 < 2e-16 ***
## OverallQual 1.726e+04 1.195e+03 14.441 < 2e-16 ***
## YearBuilt 3.970e+02 4.295e+01 9.242 < 2e-16 ***
## TotalBsmtSF 2.657e+01 2.766e+00 9.608 < 2e-16 ***
## KitchenQualFa -6.251e+04 8.029e+03 -7.786 1.3e-14 ***
## KitchenQualGd -5.102e+04 4.340e+03 -11.755 < 2e-16 ***
## KitchenQualTA -6.254e+04 4.936e+03 -12.670 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37140 on 1452 degrees of freedom
## Multiple R-squared: 0.7824, Adjusted R-squared: 0.7814
## F-statistic: 746 on 7 and 1452 DF, p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(model)
H1: Homes with larger above-ground living areas will have higher sale prices.
H2: Homes with higher overall quality ratings will have higher sale prices.
H3: Newer homes will have higher sale prices.
H4: Homes with larger basements will have higher sale prices.
H5: Homes with higher kitchen quality ratings will have higher sale prices.
The regression results strongly support all proposed hypotheses. GrLivArea, OverallQual, YearBuilt, TotalBsmtSF, and KitchenQual were all statistically significant predictors of sale price (p < 0.001 for each).
OverallQual produced the largest effect, indicating that improvements in overall construction quality and finishes have the greatest impact on housing values. YearBuilt also demonstrated a meaningful positive relationship with sale price, suggesting newer homes command a premium. Both living area and basement square footage contributed positively to home values as well.
The model achieved an R-squared value of 0.7824, suggesting that approximately 78.2% of variation in home sale prices can be explained by the variables included in the model. This indicates strong predictive capability and suggests that these housing characteristics are important determinants of market value.
The results suggest that homeowners and developers should prioritize improvements that increase overall home quality. Based on the model, increasing a property’s quality rating by one level is associated with an increase of approximately $17,260 in sale price.
Each additional year of construction recency is associated with approximately $397 in additional value, while each additional square foot of living area adds roughly $54. Each additional square foot of basement space adds approximately $27 in value.
Kitchen quality also plays a substantial role: homes with kitchen quality below “Excellent” (the baseline) are associated with notably lower sale prices, with “Fair” kitchens losing the most value (approximately $62,510 less than Excellent). This suggests kitchen renovations may offer strong returns.
Developers can use these findings when planning new construction or renovation projects by allocating resources toward quality upgrades, kitchen improvements, and maximizing usable living and basement space.
One limitation of this study is that important factors affecting housing prices may not be included in the model, such as local economic conditions, interest rates, school quality, and neighborhood specific characteristics. Neighborhood was deliberately excluded from this model; while it could increase explanatory power, it would add many categories that reduce interpretability and limit how well the model generalizes beyond Ames, Iowa.
Additionally, the analysis identifies relationships rather than causal effects. While the model demonstrates strong predictive performance, it cannot conclusively determine that changing a specific housing feature will directly cause a corresponding increase in sale price.
Future research could incorporate additional neighborhood level variables and alternative modeling approaches to improve predictive accuracy.