ContextBase Logo



Section 1 - The Problem

The real estate market in King County, Washington, USA, is complex and dynamic. Accurate property valuation is crucial for various stakeholders, including homeowners, buyers, real estate agents, and financial institutions. However, determining the fair market value of a house is challenging due to numerous factors that influence property prices.

Key challenges in house price prediction include:

  1. Multiple influencing factors: House prices are affected by various features such as location, size, age, condition, and local market trends.

  2. Data complexity: The dataset contains over 21,000 records with 21 variables, making it difficult to manually analyze and interpret.

  3. Non-linear relationships: Many factors affecting house prices have non-linear relationships, which simple valuation methods may not capture accurately.

  4. Market volatility: Real estate markets can change rapidly due to economic conditions, local development, and other external factors.

  5. Subjective elements: Some factors, like a property’s condition or view, involve subjective assessments that are challenging to quantify.

  6. Outliers and special cases: Unique properties or unusual market conditions can skew traditional valuation methods.

  7. Time sensitivity: Property values can change over time, requiring models that can adapt to temporal trends.

The goal of this analysis is to develop a robust, data-driven approach to predict house prices in King County accurately. By leveraging machine learning techniques, we aim to create a model that can process the complex dataset, identify key price determinants, and provide reliable price estimates. This tool would be valuable for real estate professionals, potential buyers and sellers, and financial institutions in making informed decisions about property transactions and valuations.


Section 2 - Our Solution

To address the complex challenge of predicting house prices in King County, we developed a comprehensive data science solution that leverages advanced machine learning techniques. Our approach consists of several key components:

  1. Data Preprocessing and Feature Engineering:
    • We cleaned and prepared the dataset, handling missing values and converting dates into a more usable format.
    • We engineered new features, such as extracting the year and month from the sale date, to capture temporal trends.
  2. Exploratory Data Analysis:
    • We conducted in-depth analysis of the relationships between various features and house prices.
    • Visualizations were created to better understand these relationships, such as the correlation between living space and price.
  3. Feature Selection:
    • We identified and removed highly correlated features to reduce multicollinearity and improve model performance.
    • Important features were selected based on their correlation with the target variable and their overall impact on the models.
  4. Multiple Model Development:
    We developed and compared three different machine learning models:
    1. Linear Regression: As a baseline model and for its interpretability.
    2. Random Forest: To capture non-linear relationships and feature interactions.
    3. XGBoost: For its high performance and ability to handle complex patterns in the data.
  5. Model Evaluation:
    • We used metrics such as Root Mean Square Error (RMSE) and R-squared to evaluate and compare the performance of each model.
    • Cross-validation was employed to ensure the robustness of our results.
  6. Feature Importance Analysis:
    • We analyzed feature importance to understand which factors have the most significant impact on house prices.
  7. Economic Analysis:
    • We conducted a cost-benefit analysis to determine the economic viability of implementing our solution.
    • This included calculating potential savings per house, the number of houses needed to break even on implementation costs, and the return on investment (ROI) after a certain number of predictions.
  8. Recommendations:
    • Based on our analysis, we provided comprehensive recommendations for model selection, implementation strategy, and ongoing maintenance.

Our solution not only provides accurate house price predictions but also offers insights into the factors driving these prices. The use of multiple models allows for a nuanced understanding of the market, while the economic analysis ensures that our solution is not just technically sound but also financially viable.

This data-driven approach addresses the complexities of the King County real estate market, providing a robust tool for stakeholders to make informed decisions about property valuations and transactions.



Section 3 - Data Import

https://www.kaggle.com/datasets/harlfoxem/housesalesprediction

kc_house_data <- read.csv("kc_house_data.csv")



Section 4 - Data Analysis and Preparation

# Check the structure of the dataset
str(kc_house_data)
## 'data.frame':    21613 obs. of  21 variables:
##  $ id           : num  7.13e+09 6.41e+09 5.63e+09 2.49e+09 1.95e+09 ...
##  $ date         : chr  "20141013T000000" "20141209T000000" "20150225T000000" "20141209T000000" ...
##  $ price        : num  221900 538000 180000 604000 510000 ...
##  $ bedrooms     : int  3 3 2 4 3 4 3 3 3 3 ...
##  $ bathrooms    : num  1 2.25 1 3 2 4.5 2.25 1.5 1 2.5 ...
##  $ sqft_living  : int  1180 2570 770 1960 1680 5420 1715 1060 1780 1890 ...
##  $ sqft_lot     : int  5650 7242 10000 5000 8080 101930 6819 9711 7470 6560 ...
##  $ floors       : num  1 2 1 1 1 1 2 1 1 2 ...
##  $ waterfront   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ view         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ condition    : int  3 3 3 5 3 3 3 3 3 3 ...
##  $ grade        : int  7 7 6 7 8 11 7 7 7 7 ...
##  $ sqft_above   : int  1180 2170 770 1050 1680 3890 1715 1060 1050 1890 ...
##  $ sqft_basement: int  0 400 0 910 0 1530 0 0 730 0 ...
##  $ yr_built     : int  1955 1951 1933 1965 1987 2001 1995 1963 1960 2003 ...
##  $ yr_renovated : int  0 1991 0 0 0 0 0 0 0 0 ...
##  $ zipcode      : int  98178 98125 98028 98136 98074 98053 98003 98198 98146 98038 ...
##  $ lat          : num  47.5 47.7 47.7 47.5 47.6 ...
##  $ long         : num  -122 -122 -122 -122 -122 ...
##  $ sqft_living15: int  1340 1690 2720 1360 1800 4760 2238 1650 1780 2390 ...
##  $ sqft_lot15   : int  5650 7639 8062 5000 7503 101930 6819 9711 8113 7570 ...
# Summary statistics
summary(kc_house_data)
##        id                date               price            bedrooms     
##  Min.   :1.000e+06   Length:21613       Min.   :  75000   Min.   : 0.000  
##  1st Qu.:2.123e+09   Class :character   1st Qu.: 321950   1st Qu.: 3.000  
##  Median :3.905e+09   Mode  :character   Median : 450000   Median : 3.000  
##  Mean   :4.580e+09                      Mean   : 540088   Mean   : 3.371  
##  3rd Qu.:7.309e+09                      3rd Qu.: 645000   3rd Qu.: 4.000  
##  Max.   :9.900e+09                      Max.   :7700000   Max.   :33.000  
##    bathrooms      sqft_living       sqft_lot           floors     
##  Min.   :0.000   Min.   :  290   Min.   :    520   Min.   :1.000  
##  1st Qu.:1.750   1st Qu.: 1427   1st Qu.:   5040   1st Qu.:1.000  
##  Median :2.250   Median : 1910   Median :   7618   Median :1.500  
##  Mean   :2.115   Mean   : 2080   Mean   :  15107   Mean   :1.494  
##  3rd Qu.:2.500   3rd Qu.: 2550   3rd Qu.:  10688   3rd Qu.:2.000  
##  Max.   :8.000   Max.   :13540   Max.   :1651359   Max.   :3.500  
##    waterfront            view          condition         grade       
##  Min.   :0.000000   Min.   :0.0000   Min.   :1.000   Min.   : 1.000  
##  1st Qu.:0.000000   1st Qu.:0.0000   1st Qu.:3.000   1st Qu.: 7.000  
##  Median :0.000000   Median :0.0000   Median :3.000   Median : 7.000  
##  Mean   :0.007542   Mean   :0.2343   Mean   :3.409   Mean   : 7.657  
##  3rd Qu.:0.000000   3rd Qu.:0.0000   3rd Qu.:4.000   3rd Qu.: 8.000  
##  Max.   :1.000000   Max.   :4.0000   Max.   :5.000   Max.   :13.000  
##    sqft_above   sqft_basement       yr_built     yr_renovated   
##  Min.   : 290   Min.   :   0.0   Min.   :1900   Min.   :   0.0  
##  1st Qu.:1190   1st Qu.:   0.0   1st Qu.:1951   1st Qu.:   0.0  
##  Median :1560   Median :   0.0   Median :1975   Median :   0.0  
##  Mean   :1788   Mean   : 291.5   Mean   :1971   Mean   :  84.4  
##  3rd Qu.:2210   3rd Qu.: 560.0   3rd Qu.:1997   3rd Qu.:   0.0  
##  Max.   :9410   Max.   :4820.0   Max.   :2015   Max.   :2015.0  
##     zipcode           lat             long        sqft_living15 
##  Min.   :98001   Min.   :47.16   Min.   :-122.5   Min.   : 399  
##  1st Qu.:98033   1st Qu.:47.47   1st Qu.:-122.3   1st Qu.:1490  
##  Median :98065   Median :47.57   Median :-122.2   Median :1840  
##  Mean   :98078   Mean   :47.56   Mean   :-122.2   Mean   :1987  
##  3rd Qu.:98118   3rd Qu.:47.68   3rd Qu.:-122.1   3rd Qu.:2360  
##  Max.   :98199   Max.   :47.78   Max.   :-121.3   Max.   :6210  
##    sqft_lot15    
##  Min.   :   651  
##  1st Qu.:  5100  
##  Median :  7620  
##  Mean   : 12768  
##  3rd Qu.: 10083  
##  Max.   :871200
# Check for missing values
colSums(is.na(kc_house_data))
##            id          date         price      bedrooms     bathrooms 
##             0             0             0             0             0 
##   sqft_living      sqft_lot        floors    waterfront          view 
##             0             0             0             0             0 
##     condition         grade    sqft_above sqft_basement      yr_built 
##             0             0             0             0             0 
##  yr_renovated       zipcode           lat          long sqft_living15 
##             0             0             0             0             0 
##    sqft_lot15 
##             0
# Convert date to a more usable format
kc_house_data$date <- as.POSIXct(kc_house_data$date, format = "%Y%m%dT%H%M%S")
kc_house_data$year <- as.numeric(format(kc_house_data$date, "%Y"))
kc_house_data$month <- as.numeric(format(kc_house_data$date, "%m"))

# Select relevant features
features <- c("bedrooms", "bathrooms", "sqft_living", "sqft_lot", "floors", 
              "waterfront", "view", "condition", "grade", "sqft_above", 
              "sqft_basement", "yr_built", "yr_renovated", "zipcode", 
              "lat", "long", "sqft_living15", "sqft_lot15", "year", "month")
target <- "price"

kc_house_subset <- kc_house_data[, c(features, target)]



Section 5 - Visualize Relationships

ggplot(kc_house_subset, aes(x = sqft_living, y = price)) +
  geom_point(alpha = 0.1) +
  geom_smooth(method = "lm") +
  labs(title = "Price vs. Living Space",
       x = "Square Feet of Living Space",
       y = "Price")

ggplot(kc_house_subset, aes(x = as.factor(grade), y = price)) +
  geom_boxplot() +
  labs(title = "Price Distribution by Grade",
       x = "Grade",
       y = "Price")



Section 6 - Model Development

# Split the data into training and testing sets
set.seed(123)
train_index <- createDataPartition(kc_house_subset$price, p = 0.8, list = FALSE)
train_data <- kc_house_subset[train_index, ]
test_data <- kc_house_subset[-train_index, ]

# Check for multicollinearity
cor_matrix <- cor(train_data[, sapply(train_data, is.numeric)])
high_cor <- findCorrelation(cor_matrix, cutoff = 0.75)
print(colnames(train_data)[high_cor])
## [1] "sqft_living" "sqft_above"  "month"
# Remove highly correlated features
train_data_reduced <- train_data[, -high_cor]
test_data_reduced <- test_data[, -high_cor]

# Linear Regression with reduced features
lm_model <- train(price ~ ., data = train_data_reduced, method = "lm")

# Random Forest with reduced features and adjusted parameters
rf_model <- train(price ~ ., 
                  data = train_data_reduced, 
                  method = "rf",
                  trControl = trainControl(method = "cv", number = 5),
                  ntree = 100,  # Reduce number of trees
                  tuneGrid = data.frame(mtry = sqrt(ncol(train_data_reduced) - 1)),
                  nodesize = 5,  # Minimum size of terminal nodes
                  importance = TRUE)

# XGBoost
xgb_model <- train(price ~ ., 
                   data = train_data, 
                   method = "xgbTree",
                   verbosity = 0)



Section 7 - Model Evaluation and Conclusions

# Make predictions
lm_pred <- predict(lm_model, test_data)
rf_pred <- predict(rf_model, test_data)
xgb_pred <- predict(xgb_model, test_data)

# Calculate RMSE
lm_rmse <- RMSE(lm_pred, test_data$price)
rf_rmse <- RMSE(rf_pred, test_data$price)
xgb_rmse <- RMSE(xgb_pred, test_data$price)

# Calculate R-squared
lm_rsq <- R2(lm_pred, test_data$price)
rf_rsq <- R2(rf_pred, test_data$price)
xgb_rsq <- R2(xgb_pred, test_data$price)

# Print results
cat("Linear Regression RMSE:", lm_rmse, "R-squared:", lm_rsq, "\n")
## Linear Regression RMSE: 220823.4 R-squared: 0.6692366
cat("Random Forest RMSE:", rf_rmse, "R-squared:", rf_rsq, "\n")
## Random Forest RMSE: 154212.6 R-squared: 0.8472822
cat("XGBoost RMSE:", xgb_rmse, "R-squared:", xgb_rsq, "\n")
## XGBoost RMSE: 130593.6 R-squared: 0.8841921
# Feature importance (for Random Forest)
importance <- varImp(rf_model)
print(importance)
## rf variable importance
## 
##               Overall
## lat           100.000
## long           52.453
## grade          40.536
## sqft_living15  31.085
## yr_built       28.350
## sqft_lot       26.668
## zipcode        24.149
## sqft_lot15     24.137
## bathrooms      22.417
## waterfront     17.575
## condition      15.896
## view           14.832
## bedrooms       12.732
## floors         11.595
## sqft_basement  10.158
## yr_renovated    2.584
## year            0.000



Section 8 - Economic Model and ROI Analysis

# Assume the cost of implementing the model is $50,000
implementation_cost <- 50000

# Calculate the average error reduction for rf
# baseline_error_rf <- mean(abs(mean(train_data$price) - test_data$price))
# model_error_rf <- mean(abs(rf_pred - test_data$price))
# error_reduction_rf <- baseline_error - model_error

# Calculate the average error reduction for xgb
baseline_error <- mean(abs(mean(train_data$price) - test_data$price))
model_error <- mean(abs(xgb_pred - test_data$price))
error_reduction <- baseline_error - model_error

# Calculate the potential savings per house
savings_per_house <- error_reduction / mean(train_data$price) * 100

# Calculate the number of houses needed to break even
houses_to_breakeven <- ceiling(implementation_cost / error_reduction)

# Calculate ROI after 1000 houses
roi_1000 <- (1000 * error_reduction - implementation_cost) / implementation_cost * 100

cat("Potential savings per house:", sprintf("%.2f%%", savings_per_house), "\n")
## Potential savings per house: 30.08%
cat("Houses needed to break even:", houses_to_breakeven, "\n")
## Houses needed to break even: 1
cat("ROI after 1000 houses:", sprintf("%.2f%%", roi_1000), "\n")
## ROI after 1000 houses: 324726.49%



Section 9 - Report

House Price Prediction Model Analysis for King County, USA

1. Introduction

This report summarizes the development and evaluation of machine learning models to predict house prices using the King County, USA dataset, which contains over 21,000 records.

2. Data Analysis and Preparation

  • Dataset: House Sales in King County, USA (21,613 records, 21 variables)

  • Key features: bedrooms, bathrooms, square feet of living space, lot size, floors, waterfront, view, condition, grade, year built, year renovated, zipcode, latitude, longitude

  • Target variable: Price

3. Model Development

Three models were developed and compared:
1. Linear Regression
2. Random Forest
3. XGBoost

4. Model Evaluation and Conclusions

  • Best performing model: XGBoost

  • RMSE: 136757.5

  • R-squared: 0.8729373

  • Most important features:

  1. grade 50.186
  2. long 49.210
  3. yr_built 32.633

5. Economic Analysis

  • Implementation cost: $50,000
  • Potential savings per house: 29.67%
  • Houses needed to break even: 1
  • ROI after 1000 houses: 320351.20%

6. Recommendations

Based on these analysis results, here are my recommendations:
1. Model Comparison:
- Linear Regression: RMSE = 220,823.4, R-squared = 0.6692366
- Random Forest: RMSE = 151,615, R-squared = 0.8523917
- XGBoost: (RMSE and R-squared not provided, but slightly outperforms Random Forest)

Both Random Forest and XGBoost significantly outperform linear regression. XGBoost shows a slight edge over Random Forest in economic metrics.

  1. Model Selection:
    Implement XGBoost as the primary model due to its marginally better performance. Keep Random Forest as a close secondary model for comparison and robustness. Use linear regression as a baseline model.

  2. Key Predictors:
    Based on Random Forest importance (assuming similar for XGBoost):

Top 5: latitude, grade, longitude, year built, and sqft_living15.

Analyze feature importance in XGBoost to confirm similarities or differences.

  1. Economic Viability:
    XGBoost shows the best economic potential:
  • Potential savings per house: 29.67%
  • Houses needed to break even: 1
  • ROI after 1000 houses: 320,351.20%

Random Forest shows similar, slightly lower figures.

  1. Implementation Strategy:
  1. Implement XGBoost as the primary model.
  2. Maintain Random Forest as a secondary model for comparison.
  3. Use linear regression as a quick, interpretable baseline.
  4. Start with a pilot program to verify real-world performance of all models.
  5. Gradually scale up the use of the best-performing model(s) in production.
  1. Ensemble Approach:
    Consider creating an ensemble model that combines predictions from all three models, potentially weighted based on their performance.

  2. Model Interpretability:
    Use XGBoost and Random Forest for highest accuracy, but leverage the linear model for its ease of interpretation when needed to explain predictions to stakeholders.

  3. Feature Engineering:
    Given the superior performance of the more complex models, invest in feature engineering to potentially improve all models, especially XGBoost and Random Forest.

  4. Hyperparameter Tuning:
    Conduct thorough hyperparameter tuning for both XGBoost and Random Forest to potentially improve their performance further.

  5. Continuous Learning and Adaptation:
    Implement a system to regularly retrain all models with new data, allowing them to adapt to changing market conditions.

  6. Model Monitoring:
    Set up a monitoring system to track the performance of all three models over time. This will help identify if/when model retraining is necessary or if the relative performance of the models changes.

  7. User Interface:
    Develop an interface that allows users to see predictions from all three models, along with confidence intervals and model-specific feature importance.

  8. Risk Management:
    While the complex models show excellent potential, keep the linear model as a fallback option in case of unexpected issues with the more complex models.

  9. Ethical and Legal Compliance:
    Regularly audit all models for compliance with fair housing laws and to ensure they don’t perpetuate biases. The linear model can be particularly useful for these audits due to its transparency.

  10. Educational Component:
    Provide training to users on the strengths and weaknesses of each model type, helping them understand when to rely more heavily on one model over the others.

  11. Feedback Loop:
    Establish a system for users to provide feedback on prediction accuracy, using this information to continually refine and improve all models.

By leveraging the strengths of all three models - the accuracy of XGBoost and Random Forest, and the interpretability of linear regression - you can create a robust, high-performing, and transparent house price prediction system.