1. Introduction

This analysis utilizes the Ames Housing dataset. The goal of this project is to develop a concise and interpretable linear regression model using at most four variables to predict SalePrice. The final model includes the following predictors:

PropertyAge: Years since the house was built (derived from YearBuilt)
GrLivArea: Above-grade (ground-level) living area (in square feet)
GarageCars: Garage car capacity
OverallQual_rank: A numeric ranking (0–9) derived from text-based OverallQual descriptions

These variables were selected through exploratory data analysis, including correlation inspection, and visualization. The model aims to achieve a strong R² and low RMSE while remaining interpretable and free from multicollinearity.

Key metrics for the model:

R²
RMSE

The remainder of this report outlines the data processing steps, methodology, and findings.

2. Data Description

The Ames Housing dataset contains 2,930 residential property records with 83 variables. It includes detailed information on property size, structure, quality, and sale conditions.

For this analysis, we focused on a subset of relevant variables:

SalePrice (continuous): The target variable representing the property’s sale value in USD.
GrLivArea (continuous): The total above-grade living area (in square feet).
GarageCars (discrete): The number of car spaces in the garage.
YearBuilt (discrete): The original construction year, used to derive PropertyAge.
OverallQual (ordinal/text): Assessor’s overall quality rating, converted to a 0–9 numeric scale (OverallQual_rank) based on average sale price per label.

Data preprocessing included: - Converting categorical quality descriptions to numeric rankings - Deriving PropertyAge from YearBuilt - Inspecting correlation and multicollinearity among predictors - Handling missing values and ensuring data consistency

3. Methods

3.1 Data Cleaning & Overview

An initial summary of the dataset was conducted to inspect variable types, and understand data distributions. Special attention was given to categorical fields that required transformation for regression modeling.

Key preprocessing steps included:

PropertyAge: Derived by subtracting YearBuilt from 2025 to better represent home age in years.
OverallQual_rank: Transformed from descriptive text (e.g., “Good”, “Excellent”) into a numeric 0–9 scale based on predefined quality levels.
OverallCon_rank: Similarly converted from the OverallCond text values using the same 0–9 quality scale.
MSZoning_rank: Re-encoded from zoning descriptions (e.g., “Residential_Low_Density”) into an ordinal numeric scale (0–6) based on average sale prices per category.

These transformations were essential to ensure categorical variables could be effectively used in the regression model. Grouped summaries (e.g., mean SalePrice by MSZoning) were used to guide level ordering and validate that encodings captured meaningful differences.

3.2 Explore Variable Relationships with Sales Price

To identify strong predictors of sale price, I began by exploring numeric variables through visual inspection using pair plots. These plots were used to assess the linear relationships between each variable and SalePrice, helping to guide feature selection.

The numeric variables initially selected for inspection were: - GarageCars: Number of garage spaces - GrLivArea: Above-ground living area (square feet) - PropertyAge: Years since the home was built - LotArea: Lot size in square feet

Multiple pair plots were generated across the dataset’s numeric variables in batches, always including SalePrice for comparison. This visual approach helped identify which variables had strong, interpretable, and linear relationships with the target variable.

Key Observations: - GrLivArea and GarageCars both showed a strong positive linear relationship with SalePrice. - PropertyAge had a mild negative correlation, consistent with market behavior where newer homes generally sell for more. - LotArea had a weaker and more dispersed relationship, making it less useful for this linear model.

These findings guided the final variable selection for the regression model.

3.3 Distribuitions of Chosen Variables

To better understand the spread and behavior of the selected predictors, I reviewed their distributions 5 by 5 variables at the time. .

PropertyAge and GrLivArea showed right-skewed distributions, indicating a concentration of newer and moderately sized homes.
GarageCars, OverallQual_rank, and MSZoning_rank were more discretely distributed, as expected from their categorical origins.
These insights confirmed the need for transformation or careful interpretation in the regression model.

3.4 Correlation Matrices

A full correlation matrix was first created to identify variables most strongly correlated with SalePrice. This helped narrow down the most impactful numeric predictors.

A second, smaller matrix was then generated to highlight the correlation between the final selected variables and SalePrice:

GrLivArea and OverallQual_rank showed strong positive correlations.
GarageCars had a moderate positive relationship.
PropertyAge was negatively correlated, as older homes generally sell for less.

These insights validated the variable choices used in the final regression model.

3.5 Build Model

A linear regression model was built using the selected predictors: GrLivArea, GarageCars, OverallQual_rank, and PropertyAge.

The model summary shows all variables are statistically significant, with coefficients in the expected direction. GrLivArea and OverallQual_rank had the largest positive impact on sale price, while PropertyAge had a negative effect.

This confirms the selected variables are strong and predictors of housing prices.

3.6 Test the Model

3.6.1 RSME

The Root Mean Square Error (RMSE) was calculated to measure the model’s average prediction error in dollar terms.

## [1] "The RSME is "     "38727.7425282107"

The RMSE of the model is approximately 38728, indicating that, on average, the model’s predictions deviate from the actual sale price by that amount. The residuals histogram shows a relatively normal distribution, suggesting well-behaved model errors.

3.6.2 R squared

The R-squared (R²) value was computed to evaluate how well the model explains the variability in sale prices.

## [1] "r-squared data"    "0.764904815480217"

The R² value is 0.765, meaning the model explains roughly that proportion of the total variance in SalePrice. This indicates a moderate fit and supports the effectiveness of the selected predictors.

4. Key Findings

The linear regression model performed moderately well in predicting sale price using four interpretable variables: GrLivArea, GarageCars, OverallQual_rank, and PropertyAge.

OverallQual_rank had the strongest positive impact on sale price.
GrLivArea was also a significant positive predictor, reinforcing the influence of perceived home quality.
GarageCars showed a moderate positive effect.
PropertyAge had a negative relationship with price, consistent with expectations that newer homes sell for more.

All variables were statistically significant and aligned with common knowledge, and the model achieved a moderate R² and reasonable RMSE.

5. Limitations

The model does not account for interaction effects or non-linear relationships.
No train/test split was implemented, so performance metrics may be optimistic.
The analysis assumes all numeric transformations (e.g., ranking) are linear in nature, which may oversimplify things.
Only the captured/selected variables were utilized as predictors, other variables were not captures.

6. References

Ames Housing Dataset
R Documentation:
- lm
- cor
- summary
- hist
ChatGPT (https://chat.openai.com):
Used to generate example code (clearly commented), assist in transforming categorical variables, and draft written explanations for markdown sections.

HW6

Leonardo Cuellar

2025-04-15