Souleymane Doumbia, Data 622 Final Project
2024-12-22
Predicting house prices is critical for real estate stakeholders.
This project aims to build a predictive model to estimate house prices using property features.
Objectives:
Develop a robust predictive model.
Identify the most influential property features.
Simplify the dataset using Principal Component Analysis (PCA).
Business Context:
Optimize pricing strategies.
Guide investments.
Understand factors driving house prices.
Training Data: 1,460 records with 81 variables.
Test Data: Same structure, excluding
SalePrice
.
Initial Observations:
Numerical Variables: OverallQual
,
GrLivArea
, and GarageCars
likely significant
predictors.
Categorical Variables:
Neighborhood
, HouseStyle
need
encoding.
Key Steps:
Handle missing values:
Numeric: Imputed with the median.
Categorical: Imputed with the mode.
Align factor levels between training and test datasets.
One-hot encode categorical variables.
Scale all features for consistency.
Target Variable: SalePrice
Distribution: Right-skewed, concentrated between $100K–$300K.
A histogram was plotted to visualize the distribution of the
SalePrice
variable.
The distribution is right-skewed, indicating that the majority of houses are priced in the lower range, with a few high-priced outliers.
Correlation Insights:
Strong positive correlations were observed between
SalePrice
and features like OverallQual
,
GrLivArea
, and GarageCars
.
Other features, such as YearBuilt
and
TotalBsmtSF
, also showed meaningful positive
correlations.
No significant negative correlations were identified, though some
features exhibited very weak or no relationship with
SalePrice
.
The PCA analysis determined that 171 components, out of potentially hundreds of features, are sufficient to explain 95% of the total variance. This represents a significant reduction in dimensionality while retaining the critical information needed for modeling.
A scree plot was generated to visualize the variance explained by each principal component. The rapid decline in variance after the first few components emphasizes the redundancy of many original features.
The scree plot shows the variance explained by each component, with a sharp decline after the first few components. This “elbow” indicates that the majority of the variance is captured by the first few components.
Dimensionality Reduction: Retained 171 components to explain 95% of variance.
Improved Efficiency: Reduced features while retaining critical information.
Overfitting Mitigation: Focused on key components.
SalePrice
.During cross-validation, the Random Forest CV RMSE is approximately 30,000, compared to the XGBoost CV RMSE of 28,391.66.
XGBoost performs slightly better during cross-validation, suggesting stronger generalization to unseen data.
The hyperparameters for XGBoost, including
nrounds = 100
, max_depth = 5
,
eta = 0.1
, and subsample = 0.75
, contribute to
this improved generalization by preventing overfitting.
The density plots of the predicted sale prices for both Random Forest and XGBoost reveal similar distributions, with peaks and tails that align closely.
This suggests that both models are effectively capturing the
general trend of the SalePrice
variable, even though their
error metrics differ.
Minor differences in the prediction distributions could indicate slight variations in how the models handle specific patterns or outliers in the data.
Random Forest demonstrates superior training performance but may suffer from overfitting, as evidenced by its relatively higher cross-validation RMSE.
XGBoost achieves a better balance between training and validation performance, making it a more reliable model for generalization.
By incorporating computational optimizations and leveraging parallel processing, the evaluation process was streamlined without sacrificing accuracy.
Overall, XGBoost is recommended as the preferred model for this task due to its stronger cross-validation results, which suggest it will perform better on unseen data.
Key Findings:
PCA reduced dimensions efficiently while retaining 95% variance.
Random Forest showed better training performance.
XGBoost demonstrated stronger generalization during cross-validation.
Business Impact:
Accurate property valuation supports pricing strategies.
Insights into key features guide investments.
Models are scalable for larger datasets.