The real estate market in King County, Washington, USA, is complex and dynamic. Accurate property valuation is crucial for various stakeholders, including homeowners, buyers, real estate agents, and financial institutions. However, determining the fair market value of a house is challenging due to numerous factors that influence property prices.
Key challenges in house price prediction include:
Multiple influencing factors: House prices are affected by various features such as location, size, age, condition, and local market trends.
Data complexity: The dataset contains over 21,000 records with 21 variables, making it difficult to manually analyze and interpret.
Non-linear relationships: Many factors affecting house prices have non-linear relationships, which simple valuation methods may not capture accurately.
Market volatility: Real estate markets can change rapidly due to economic conditions, local development, and other external factors.
Subjective elements: Some factors, like a property’s condition or view, involve subjective assessments that are challenging to quantify.
Outliers and special cases: Unique properties or unusual market conditions can skew traditional valuation methods.
Time sensitivity: Property values can change over time, requiring models that can adapt to temporal trends.
The goal of this analysis is to develop a robust, data-driven
approach to predict house prices in King County accurately. By
leveraging machine learning techniques, we aim to create a model that
can process the complex dataset, identify key price determinants, and
provide reliable price estimates. This tool would be valuable for real
estate professionals, potential buyers and sellers, and financial
institutions in making informed decisions about property transactions
and valuations.
To address the complex challenge of predicting house prices in King County, we developed a comprehensive data science solution that leverages advanced machine learning techniques. Our approach consists of several key components:
Our solution not only provides accurate house price predictions but also offers insights into the factors driving these prices. The use of multiple models allows for a nuanced understanding of the market, while the economic analysis ensures that our solution is not just technically sound but also financially viable.
This data-driven approach addresses the complexities of the King County real estate market, providing a robust tool for stakeholders to make informed decisions about property valuations and transactions.
## 'data.frame': 21613 obs. of 21 variables:
## $ id : num 7.13e+09 6.41e+09 5.63e+09 2.49e+09 1.95e+09 ...
## $ date : chr "20141013T000000" "20141209T000000" "20150225T000000" "20141209T000000" ...
## $ price : num 221900 538000 180000 604000 510000 ...
## $ bedrooms : int 3 3 2 4 3 4 3 3 3 3 ...
## $ bathrooms : num 1 2.25 1 3 2 4.5 2.25 1.5 1 2.5 ...
## $ sqft_living : int 1180 2570 770 1960 1680 5420 1715 1060 1780 1890 ...
## $ sqft_lot : int 5650 7242 10000 5000 8080 101930 6819 9711 7470 6560 ...
## $ floors : num 1 2 1 1 1 1 2 1 1 2 ...
## $ waterfront : int 0 0 0 0 0 0 0 0 0 0 ...
## $ view : int 0 0 0 0 0 0 0 0 0 0 ...
## $ condition : int 3 3 3 5 3 3 3 3 3 3 ...
## $ grade : int 7 7 6 7 8 11 7 7 7 7 ...
## $ sqft_above : int 1180 2170 770 1050 1680 3890 1715 1060 1050 1890 ...
## $ sqft_basement: int 0 400 0 910 0 1530 0 0 730 0 ...
## $ yr_built : int 1955 1951 1933 1965 1987 2001 1995 1963 1960 2003 ...
## $ yr_renovated : int 0 1991 0 0 0 0 0 0 0 0 ...
## $ zipcode : int 98178 98125 98028 98136 98074 98053 98003 98198 98146 98038 ...
## $ lat : num 47.5 47.7 47.7 47.5 47.6 ...
## $ long : num -122 -122 -122 -122 -122 ...
## $ sqft_living15: int 1340 1690 2720 1360 1800 4760 2238 1650 1780 2390 ...
## $ sqft_lot15 : int 5650 7639 8062 5000 7503 101930 6819 9711 8113 7570 ...
## id date price bedrooms
## Min. :1.000e+06 Length:21613 Min. : 75000 Min. : 0.000
## 1st Qu.:2.123e+09 Class :character 1st Qu.: 321950 1st Qu.: 3.000
## Median :3.905e+09 Mode :character Median : 450000 Median : 3.000
## Mean :4.580e+09 Mean : 540088 Mean : 3.371
## 3rd Qu.:7.309e+09 3rd Qu.: 645000 3rd Qu.: 4.000
## Max. :9.900e+09 Max. :7700000 Max. :33.000
## bathrooms sqft_living sqft_lot floors
## Min. :0.000 Min. : 290 Min. : 520 Min. :1.000
## 1st Qu.:1.750 1st Qu.: 1427 1st Qu.: 5040 1st Qu.:1.000
## Median :2.250 Median : 1910 Median : 7618 Median :1.500
## Mean :2.115 Mean : 2080 Mean : 15107 Mean :1.494
## 3rd Qu.:2.500 3rd Qu.: 2550 3rd Qu.: 10688 3rd Qu.:2.000
## Max. :8.000 Max. :13540 Max. :1651359 Max. :3.500
## waterfront view condition grade
## Min. :0.000000 Min. :0.0000 Min. :1.000 Min. : 1.000
## 1st Qu.:0.000000 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.: 7.000
## Median :0.000000 Median :0.0000 Median :3.000 Median : 7.000
## Mean :0.007542 Mean :0.2343 Mean :3.409 Mean : 7.657
## 3rd Qu.:0.000000 3rd Qu.:0.0000 3rd Qu.:4.000 3rd Qu.: 8.000
## Max. :1.000000 Max. :4.0000 Max. :5.000 Max. :13.000
## sqft_above sqft_basement yr_built yr_renovated
## Min. : 290 Min. : 0.0 Min. :1900 Min. : 0.0
## 1st Qu.:1190 1st Qu.: 0.0 1st Qu.:1951 1st Qu.: 0.0
## Median :1560 Median : 0.0 Median :1975 Median : 0.0
## Mean :1788 Mean : 291.5 Mean :1971 Mean : 84.4
## 3rd Qu.:2210 3rd Qu.: 560.0 3rd Qu.:1997 3rd Qu.: 0.0
## Max. :9410 Max. :4820.0 Max. :2015 Max. :2015.0
## zipcode lat long sqft_living15
## Min. :98001 Min. :47.16 Min. :-122.5 Min. : 399
## 1st Qu.:98033 1st Qu.:47.47 1st Qu.:-122.3 1st Qu.:1490
## Median :98065 Median :47.57 Median :-122.2 Median :1840
## Mean :98078 Mean :47.56 Mean :-122.2 Mean :1987
## 3rd Qu.:98118 3rd Qu.:47.68 3rd Qu.:-122.1 3rd Qu.:2360
## Max. :98199 Max. :47.78 Max. :-121.3 Max. :6210
## sqft_lot15
## Min. : 651
## 1st Qu.: 5100
## Median : 7620
## Mean : 12768
## 3rd Qu.: 10083
## Max. :871200
## id date price bedrooms bathrooms
## 0 0 0 0 0
## sqft_living sqft_lot floors waterfront view
## 0 0 0 0 0
## condition grade sqft_above sqft_basement yr_built
## 0 0 0 0 0
## yr_renovated zipcode lat long sqft_living15
## 0 0 0 0 0
## sqft_lot15
## 0
# Convert date to a more usable format
kc_house_data$date <- as.POSIXct(kc_house_data$date, format = "%Y%m%dT%H%M%S")
kc_house_data$year <- as.numeric(format(kc_house_data$date, "%Y"))
kc_house_data$month <- as.numeric(format(kc_house_data$date, "%m"))
# Select relevant features
features <- c("bedrooms", "bathrooms", "sqft_living", "sqft_lot", "floors",
"waterfront", "view", "condition", "grade", "sqft_above",
"sqft_basement", "yr_built", "yr_renovated", "zipcode",
"lat", "long", "sqft_living15", "sqft_lot15", "year", "month")
target <- "price"
kc_house_subset <- kc_house_data[, c(features, target)]
ggplot(kc_house_subset, aes(x = sqft_living, y = price)) +
geom_point(alpha = 0.1) +
geom_smooth(method = "lm") +
labs(title = "Price vs. Living Space",
x = "Square Feet of Living Space",
y = "Price")ggplot(kc_house_subset, aes(x = as.factor(grade), y = price)) +
geom_boxplot() +
labs(title = "Price Distribution by Grade",
x = "Grade",
y = "Price")
# Split the data into training and testing sets
set.seed(123)
train_index <- createDataPartition(kc_house_subset$price, p = 0.8, list = FALSE)
train_data <- kc_house_subset[train_index, ]
test_data <- kc_house_subset[-train_index, ]
# Check for multicollinearity
cor_matrix <- cor(train_data[, sapply(train_data, is.numeric)])
high_cor <- findCorrelation(cor_matrix, cutoff = 0.75)
print(colnames(train_data)[high_cor])## [1] "sqft_living" "sqft_above" "month"
# Remove highly correlated features
train_data_reduced <- train_data[, -high_cor]
test_data_reduced <- test_data[, -high_cor]
# Linear Regression with reduced features
lm_model <- train(price ~ ., data = train_data_reduced, method = "lm")
# Random Forest with reduced features and adjusted parameters
rf_model <- train(price ~ .,
data = train_data_reduced,
method = "rf",
trControl = trainControl(method = "cv", number = 5),
ntree = 100, # Reduce number of trees
tuneGrid = data.frame(mtry = sqrt(ncol(train_data_reduced) - 1)),
nodesize = 5, # Minimum size of terminal nodes
importance = TRUE)
# XGBoost
xgb_model <- train(price ~ .,
data = train_data,
method = "xgbTree",
verbosity = 0)
# Make predictions
lm_pred <- predict(lm_model, test_data)
rf_pred <- predict(rf_model, test_data)
xgb_pred <- predict(xgb_model, test_data)
# Calculate RMSE
lm_rmse <- RMSE(lm_pred, test_data$price)
rf_rmse <- RMSE(rf_pred, test_data$price)
xgb_rmse <- RMSE(xgb_pred, test_data$price)
# Calculate R-squared
lm_rsq <- R2(lm_pred, test_data$price)
rf_rsq <- R2(rf_pred, test_data$price)
xgb_rsq <- R2(xgb_pred, test_data$price)
# Print results
cat("Linear Regression RMSE:", lm_rmse, "R-squared:", lm_rsq, "\n")## Linear Regression RMSE: 220823.4 R-squared: 0.6692366
## Random Forest RMSE: 154212.6 R-squared: 0.8472822
## XGBoost RMSE: 130593.6 R-squared: 0.8841921
## rf variable importance
##
## Overall
## lat 100.000
## long 52.453
## grade 40.536
## sqft_living15 31.085
## yr_built 28.350
## sqft_lot 26.668
## zipcode 24.149
## sqft_lot15 24.137
## bathrooms 22.417
## waterfront 17.575
## condition 15.896
## view 14.832
## bedrooms 12.732
## floors 11.595
## sqft_basement 10.158
## yr_renovated 2.584
## year 0.000
# Assume the cost of implementing the model is $50,000
implementation_cost <- 50000
# Calculate the average error reduction for rf
# baseline_error_rf <- mean(abs(mean(train_data$price) - test_data$price))
# model_error_rf <- mean(abs(rf_pred - test_data$price))
# error_reduction_rf <- baseline_error - model_error
# Calculate the average error reduction for xgb
baseline_error <- mean(abs(mean(train_data$price) - test_data$price))
model_error <- mean(abs(xgb_pred - test_data$price))
error_reduction <- baseline_error - model_error
# Calculate the potential savings per house
savings_per_house <- error_reduction / mean(train_data$price) * 100
# Calculate the number of houses needed to break even
houses_to_breakeven <- ceiling(implementation_cost / error_reduction)
# Calculate ROI after 1000 houses
roi_1000 <- (1000 * error_reduction - implementation_cost) / implementation_cost * 100
cat("Potential savings per house:", sprintf("%.2f%%", savings_per_house), "\n")## Potential savings per house: 30.08%
## Houses needed to break even: 1
## ROI after 1000 houses: 324726.49%
House Price Prediction Model Analysis for King County, USA
This report summarizes the development and evaluation of machine learning models to predict house prices using the King County, USA dataset, which contains over 21,000 records.
Dataset: House Sales in King County, USA (21,613 records, 21 variables)
Key features: bedrooms, bathrooms, square feet of living space, lot size, floors, waterfront, view, condition, grade, year built, year renovated, zipcode, latitude, longitude
Target variable: Price
Three models were developed and compared:
1. Linear Regression
2. Random Forest
3. XGBoost
Best performing model: XGBoost
RMSE: 136757.5
R-squared: 0.8729373
Most important features:
Based on these analysis results, here are my recommendations:
1. Model Comparison:
- Linear Regression: RMSE = 220,823.4, R-squared = 0.6692366
- Random Forest: RMSE = 151,615, R-squared = 0.8523917
- XGBoost: (RMSE and R-squared not provided, but slightly outperforms
Random Forest)
Both Random Forest and XGBoost significantly outperform linear regression. XGBoost shows a slight edge over Random Forest in economic metrics.
Model Selection:
Implement XGBoost as the primary model due to its marginally better
performance. Keep Random Forest as a close secondary model for
comparison and robustness. Use linear regression as a baseline
model.
Key Predictors:
Based on Random Forest importance (assuming similar for
XGBoost):
Top 5: latitude, grade, longitude, year built, and sqft_living15.
Analyze feature importance in XGBoost to confirm similarities or differences.
Random Forest shows similar, slightly lower figures.
Ensemble Approach:
Consider creating an ensemble model that combines predictions from all
three models, potentially weighted based on their performance.
Model Interpretability:
Use XGBoost and Random Forest for highest accuracy, but leverage the
linear model for its ease of interpretation when needed to explain
predictions to stakeholders.
Feature Engineering:
Given the superior performance of the more complex models, invest in
feature engineering to potentially improve all models, especially
XGBoost and Random Forest.
Hyperparameter Tuning:
Conduct thorough hyperparameter tuning for both XGBoost and Random
Forest to potentially improve their performance further.
Continuous Learning and Adaptation:
Implement a system to regularly retrain all models with new data,
allowing them to adapt to changing market conditions.
Model Monitoring:
Set up a monitoring system to track the performance of all three models
over time. This will help identify if/when model retraining is necessary
or if the relative performance of the models changes.
User Interface:
Develop an interface that allows users to see predictions from all three
models, along with confidence intervals and model-specific feature
importance.
Risk Management:
While the complex models show excellent potential, keep the linear model
as a fallback option in case of unexpected issues with the more complex
models.
Ethical and Legal Compliance:
Regularly audit all models for compliance with fair housing laws and to
ensure they don’t perpetuate biases. The linear model can be
particularly useful for these audits due to its transparency.
Educational Component:
Provide training to users on the strengths and weaknesses of each model
type, helping them understand when to rely more heavily on one model
over the others.
Feedback Loop:
Establish a system for users to provide feedback on prediction accuracy,
using this information to continually refine and improve all
models.
By leveraging the strengths of all three models - the accuracy of XGBoost and Random Forest, and the interpretability of linear regression - you can create a robust, high-performing, and transparent house price prediction system.