The Ames Housing dataset contains information on 1,460 residential home sales in Ames, Iowa. The dataset includes characteristics such as living area, overall quality, year built, basement size, kitchen quality, and final sale price.

The objective of this analysis is to determine which housing characteristics have the greatest impact on home sale prices and to develop a predictive model that can estimate the value of a house based on these characteristics.

This question is useful because homeowners, buyers, real estate professionals, and developers can use these insights to better understand which property features contribute most strongly to housing values.

Data and Variables

Variables Used

Dependent Variable: SalePrice

Independent Variables:

(a)GrLivArea: Above grade (ground) living area square feet

(b)OverallQual: Rates the overall material and finish of the house

(c)YearBuilt: Original construction date

(d)KitchenQual: Kitchen quality

(e)TotalBsmtSF: Total square feet of basement area

summary(
  housing[, c(
  "SalePrice",
  "GrLivArea",
  "OverallQual",
  "YearBuilt",
  "KitchenQual",
  "TotalBsmtSF"
)] 
)
##    SalePrice        GrLivArea     OverallQual       YearBuilt   
##  Min.   : 34900   Min.   : 334   Min.   : 1.000   Min.   :1872  
##  1st Qu.:129975   1st Qu.:1130   1st Qu.: 5.000   1st Qu.:1954  
##  Median :163000   Median :1464   Median : 6.000   Median :1973  
##  Mean   :180921   Mean   :1515   Mean   : 6.099   Mean   :1971  
##  3rd Qu.:214000   3rd Qu.:1777   3rd Qu.: 7.000   3rd Qu.:2000  
##  Max.   :755000   Max.   :5642   Max.   :10.000   Max.   :2010  
##     KitchenQual    TotalBsmtSF    
##  Length   :1460   Min.   :   0.0  
##  N.unique :   4   1st Qu.: 795.8  
##  N.blank  :   0   Median : 991.5  
##  Min.nchar:   2   Mean   :1057.4  
##  Max.nchar:   2   3rd Qu.:1298.2  
##                   Max.   :6110.0

Descriptive Visualizations

ggplot(housing, aes(x = SalePrice)) +
  geom_histogram(bins = 30) +
  labs(
    title = "Distribution of Sales Prices",
    x = "Sale Price ($)",
    y = "Frequency")

ggplot(housing, aes(x = GrLivArea)) +
  geom_histogram(bins = 30) +
  labs(
    title = "Distribution of Above-Ground Living Area (sq ft.)",
    x = "Square Feet",
    y = "Frequency")

ggplot(housing, aes(x = factor(OverallQual))) +
geom_bar() +
  labs(
    title = "Distribution of Overall Home Quality",
    x = "Quality Rating",
    y = "Count")

ggplot(housing, aes(x = YearBuilt)) +
geom_histogram(bins = 30) +
  labs(
    title = "Distribution of Year Built",
    x = "Year Built",
    y = "Count")

ggplot(housing, aes(x = KitchenQual)) +
geom_bar() +
  labs(
    title = "Distribution of Kitchen Quality",
    x = "Kitchen Quality",
    y = "Count")

ggplot(housing, aes(x = TotalBsmtSF)) +
geom_histogram(bins = 30) +
  labs(
    title = "Distribution of Total Basement Square Footage",
    x = "Basement Area (sq ft)",
    y = "Count")

Predictive Visualizations

ggplot(housing, aes(x = GrLivArea, y = SalePrice)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    title = "Sale Price vs. Above-Ground Living Area",
    x = "Above-Ground Living Area)",
    y = "Sale Price($)")
## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'
ggplot(housing, aes(x = OverallQual, y = SalePrice)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    title = "Sale Price vs. Overall Quality Rating",
    x = "Overall Quality (1-10))",
    y = "Sale Price($)")
## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'
ggplot(housing, aes(x = YearBuilt, y = SalePrice)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    title = "Sale Price vs. Year Built",
    x = "Year Built",
    y = "Sale Price($)")
## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'
ggplot(housing, aes(x = TotalBsmtSF, y = SalePrice)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    title = "Sale Price vs. Total Basement Square Footage",
    x = "Basement Area (sq ft)",
    y = "Sale Price($)")
## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'
ggplot(housing, aes(x = KitchenQual, y = SalePrice)) +
  geom_boxplot() +
  labs(
    title = "Sale Price by Kitchen Quality",
    x = "Kitchen Quality Rating",
    y = "Sale Price($)")

## `geom_smooth()` using formula = 'y ~ x'
model <- lm(
  SalePrice ~ GrLivArea +
    OverallQual +
    YearBuilt +
    TotalBsmtSF +
    KitchenQual,
  data = housing
)

summary(model)
## 
## Call:
## lm(formula = SalePrice ~ GrLivArea + OverallQual + YearBuilt + 
##     TotalBsmtSF + KitchenQual, data = housing)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -513187  -18057    -783   15156  256279 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -7.631e+05  8.300e+04  -9.193  < 2e-16 ***
## GrLivArea      5.391e+01  2.428e+00  22.204  < 2e-16 ***
## OverallQual    1.726e+04  1.195e+03  14.441  < 2e-16 ***
## YearBuilt      3.970e+02  4.295e+01   9.242  < 2e-16 ***
## TotalBsmtSF    2.657e+01  2.766e+00   9.608  < 2e-16 ***
## KitchenQualFa -6.251e+04  8.029e+03  -7.786  1.3e-14 ***
## KitchenQualGd -5.102e+04  4.340e+03 -11.755  < 2e-16 ***
## KitchenQualTA -6.254e+04  4.936e+03 -12.670  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37140 on 1452 degrees of freedom
## Multiple R-squared:  0.7824, Adjusted R-squared:  0.7814 
## F-statistic:   746 on 7 and 1452 DF,  p-value: < 2.2e-16

Regression Diagnostics

par(mfrow = c(2,2))
plot(model)

Hypotheses

H1: Homes with larger above-ground living areas will have higher sale prices.

H2: Homes with higher overall quality ratings will have higher sale prices.

H3: Newer homes will have higher sale prices.

H4: Homes with larger basements will have higher sale prices.

H5: Homes with higher kitchen quality ratings will have higher sale prices.

Discussion

The regression results strongly support all proposed hypotheses. GrLivArea, OverallQual, YearBuilt, TotalBsmtSF, and KitchenQual were all statistically significant predictors of sale price (p < 0.001 for each).

OverallQual produced the largest effect, indicating that improvements in overall construction quality and finishes have the greatest impact on housing values. YearBuilt also demonstrated a meaningful positive relationship with sale price, suggesting newer homes command a premium. Both living area and basement square footage contributed positively to home values as well.

The model achieved an R-squared value of 0.7824, suggesting that approximately 78.2% of variation in home sale prices can be explained by the variables included in the model. This indicates strong predictive capability and suggests that these housing characteristics are important determinants of market value.

Business Value (Prescriptive Recommendations)

The results suggest that homeowners and developers should prioritize improvements that increase overall home quality. Based on the model, increasing a property’s quality rating by one level is associated with an increase of approximately $17,260 in sale price.

Each additional year of construction recency is associated with approximately $397 in additional value, while each additional square foot of living area adds roughly $54. Each additional square foot of basement space adds approximately $27 in value.

Kitchen quality also plays a substantial role: homes with kitchen quality below “Excellent” (the baseline) are associated with notably lower sale prices, with “Fair” kitchens losing the most value (approximately $62,510 less than Excellent). This suggests kitchen renovations may offer strong returns.

Developers can use these findings when planning new construction or renovation projects by allocating resources toward quality upgrades, kitchen improvements, and maximizing usable living and basement space.

Limitations and Future Research

One limitation of this study is that important factors affecting housing prices may not be included in the model, such as local economic conditions, interest rates, school quality, and neighborhood specific characteristics. Neighborhood was deliberately excluded from this model; while it could increase explanatory power, it would add many categories that reduce interpretability and limit how well the model generalizes beyond Ames, Iowa.

Additionally, the analysis identifies relationships rather than causal effects. While the model demonstrates strong predictive performance, it cannot conclusively determine that changing a specific housing feature will directly cause a corresponding increase in sale price.

Future research could incorporate additional neighborhood level variables and alternative modeling approaches to improve predictive accuracy.