This analysis utilizes the Ames Housing dataset. The goal of this project is to develop a concise and interpretable linear regression model using at most four variables to predict SalePrice. The final model includes the following predictors:
YearBuilt)OverallQual descriptionsThese variables were selected through exploratory data analysis, including correlation inspection, and visualization. The model aims to achieve a strong R² and low RMSE while remaining interpretable and free from multicollinearity.
Key metrics for the model:
The remainder of this report outlines the data processing steps, methodology, and findings.
The Ames Housing dataset contains 2,930 residential property records with 83 variables. It includes detailed information on property size, structure, quality, and sale conditions.
For this analysis, we focused on a subset of relevant variables:
Data preprocessing included: - Converting categorical quality
descriptions to numeric rankings - Deriving PropertyAge
from YearBuilt - Inspecting correlation and
multicollinearity among predictors - Handling missing values and
ensuring data consistency
An initial summary of the dataset was conducted to inspect variable types, and understand data distributions. Special attention was given to categorical fields that required transformation for regression modeling.
Key preprocessing steps included:
YearBuilt from 2025 to better represent home age in
years.OverallCond text values using the same 0–9 quality
scale.These transformations were essential to ensure categorical variables
could be effectively used in the regression model. Grouped summaries
(e.g., mean SalePrice by MSZoning) were used
to guide level ordering and validate that encodings captured meaningful
differences.
To identify strong predictors of sale price, I began by exploring
numeric variables through visual inspection using pair plots. These
plots were used to assess the linear relationships between each variable
and SalePrice, helping to guide feature selection.
The numeric variables initially selected for inspection were: - GarageCars: Number of garage spaces - GrLivArea: Above-ground living area (square feet) - PropertyAge: Years since the home was built - LotArea: Lot size in square feet
Multiple pair plots were generated across the dataset’s numeric
variables in batches, always including SalePrice for
comparison. This visual approach helped identify which variables had
strong, interpretable, and linear relationships with the target
variable.
Key Observations: - GrLivArea and
GarageCars both showed a strong positive linear
relationship with SalePrice. - PropertyAge
had a mild negative correlation, consistent with market behavior where
newer homes generally sell for more. - LotArea had a
weaker and more dispersed relationship, making it less useful for this
linear model.
These findings guided the final variable selection for the regression
model.
To better understand the spread and behavior of the selected predictors, I reviewed their distributions 5 by 5 variables at the time. .
A full correlation matrix was first created to identify variables
most strongly correlated with SalePrice. This helped narrow
down the most impactful numeric predictors.
A second, smaller matrix was then generated to highlight the
correlation between the final selected variables and
SalePrice:
These insights validated the variable choices used in the final
regression model.
A linear regression model was built using the selected predictors: GrLivArea, GarageCars, OverallQual_rank, and PropertyAge.
The model summary shows all variables are statistically significant, with coefficients in the expected direction. GrLivArea and OverallQual_rank had the largest positive impact on sale price, while PropertyAge had a negative effect.
This confirms the selected variables are strong and predictors of housing prices.
The Root Mean Square Error (RMSE) was calculated to
measure the model’s average prediction error in dollar terms.
## [1] "The RSME is " "38727.7425282107"
The RMSE of the model is approximately 38728, indicating that, on average, the model’s predictions deviate from the actual sale price by that amount. The residuals histogram shows a relatively normal distribution, suggesting well-behaved model errors.
The R-squared (R²) value was computed to evaluate how well the model explains the variability in sale prices.
## [1] "r-squared data" "0.764904815480217"
The R² value is 0.765, meaning the model explains
roughly that proportion of the total variance in SalePrice.
This indicates a moderate fit and supports the effectiveness of the
selected predictors.
The linear regression model performed moderately well in predicting sale price using four interpretable variables: GrLivArea, GarageCars, OverallQual_rank, and PropertyAge.
All variables were statistically significant and aligned with common knowledge, and the model achieved a moderate R² and reasonable RMSE.