Predicting House Prices in King County: A Comparative Analysis of Regression, Random Forest, and XGBoost Models

Introduction

The Project

In this project, our primary objective is to build a model that can accurately predict the price of houses in King County, which includes Seattle. Real estate pricing is influenced by a multitude of factors, ranging from the physical attributes of the house to its location, proximity to amenities, and economic conditions. Building a predictive model allows us to quantify these influences and provides insights that can be used by both buyers and sellers in the real estate market.

The Dataset

The dataset we are working with contains house sales prices for King County between May 2014 and May 2015. It comprises 21 variables that capture various aspects of the houses sold, including their physical characteristics (e.g., number of bedrooms, bathrooms, living area in square feet), condition and grade, year built and renovated, whether they have a waterfront, among others. Each row in the dataset represents a unique house sale and includes the sale price. With this rich set of features, we can create a robust model to predict house prices.

Business goal

The business goal of this project is to assist various stakeholders in the real estate industry. For real estate agents, the model can help provide a data-driven estimate of a house’s price, aiding in listing houses at appropriate prices. For potential buyers, the model can help them understand if a house is overpriced or underpriced, thus informing their purchase decisions. For sellers, it can help in setting a competitive and fair selling price. Additionally, the model can also provide valuable insights to real estate investors and policymakers for planning and decision-making.

Data Preparation

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(caret)

##  要求されたパッケージ lattice をロード中です 
## 
##  次のパッケージを付け加えます: 'caret' 
## 
##  以下のオブジェクトは 'package:purrr' からマスクされています:
## 
##     lift

library(corrplot)

## corrplot 0.92 loaded

library(tidymodels)

## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.0 ──
## ✔ broom        1.0.4     ✔ rsample      1.1.1
## ✔ dials        1.2.0     ✔ tune         1.1.1
## ✔ infer        1.0.4     ✔ workflows    1.1.3
## ✔ modeldata    1.1.0     ✔ workflowsets 1.0.1
## ✔ parsnip      1.1.0     ✔ yardstick    1.2.0
## ✔ recipes      1.0.6     
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard()        masks purrr::discard()
## ✖ dplyr::filter()          masks stats::filter()
## ✖ recipes::fixed()         masks stringr::fixed()
## ✖ dplyr::lag()             masks stats::lag()
## ✖ caret::lift()            masks purrr::lift()
## ✖ yardstick::precision()   masks caret::precision()
## ✖ yardstick::recall()      masks caret::recall()
## ✖ yardstick::sensitivity() masks caret::sensitivity()
## ✖ yardstick::spec()        masks readr::spec()
## ✖ yardstick::specificity() masks caret::specificity()
## ✖ recipes::step()          masks stats::step()
## • Search for functions across packages at https://www.tidymodels.org/find/

library(car)

##  要求されたパッケージ carData をロード中です 
## 
##  次のパッケージを付け加えます: 'car' 
## 
##  以下のオブジェクトは 'package:dplyr' からマスクされています:
## 
##     recode
## 
##  以下のオブジェクトは 'package:purrr' からマスクされています:
## 
##     some

library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

library(plotly)

## 
##  次のパッケージを付け加えます: 'plotly' 
## 
##  以下のオブジェクトは 'package:ggplot2' からマスクされています:
## 
##     last_plot
## 
##  以下のオブジェクトは 'package:stats' からマスクされています:
## 
##     filter
## 
##  以下のオブジェクトは 'package:graphics' からマスクされています:
## 
##     layout

library(scales)
library(lmtest)

##  要求されたパッケージ zoo をロード中です 
## 
##  次のパッケージを付け加えます: 'zoo' 
## 
##  以下のオブジェクトは 'package:base' からマスクされています:
## 
##     as.Date, as.Date.numeric

library(randomForest)

## Warning: パッケージ 'randomForest' はバージョン 4.3.1 の R の下で造られました

## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
##  次のパッケージを付け加えます: 'randomForest' 
## 
##  以下のオブジェクトは 'package:dplyr' からマスクされています:
## 
##     combine
## 
##  以下のオブジェクトは 'package:ggplot2' からマスクされています:
## 
##     margin

library(xgboost)

## Warning: パッケージ 'xgboost' はバージョン 4.3.1 の R の下で造られました

## 
##  次のパッケージを付け加えます: 'xgboost' 
## 
##  以下のオブジェクトは 'package:plotly' からマスクされています:
## 
##     slice
## 
##  以下のオブジェクトは 'package:dplyr' からマスクされています:
## 
##     slice

library(ranger)

## Warning: パッケージ 'ranger' はバージョン 4.3.1 の R の下で造られました

## 
##  次のパッケージを付け加えます: 'ranger' 
## 
##  以下のオブジェクトは 'package:randomForest' からマスクされています:
## 
##     importance

data <- read_csv("inputs/kc_house_data.csv")

## Rows: 21613 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl  (20): id, price, bedrooms, bathrooms, sqft_living, sqft_lot, floors, wa...
## dttm  (1): date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

glimpse(data)

## Rows: 21,613
## Columns: 21
## $ id            <dbl> 7129300520, 6414100192, 5631500400, 2487200875, 19544005…
## $ date          <dttm> 2014-10-13, 2014-12-09, 2015-02-25, 2014-12-09, 2015-02…
## $ price         <dbl> 221900, 538000, 180000, 604000, 510000, 1230000, 257500,…
## $ bedrooms      <dbl> 3, 3, 2, 4, 3, 4, 3, 3, 3, 3, 3, 2, 3, 3, 5, 4, 3, 4, 2,…
## $ bathrooms     <dbl> 1.00, 2.25, 1.00, 3.00, 2.00, 4.50, 2.25, 1.50, 1.00, 2.…
## $ sqft_living   <dbl> 1180, 2570, 770, 1960, 1680, 5420, 1715, 1060, 1780, 189…
## $ sqft_lot      <dbl> 5650, 7242, 10000, 5000, 8080, 101930, 6819, 9711, 7470,…
## $ floors        <dbl> 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 2.0, 1.0, 1…
## $ waterfront    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ view          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0,…
## $ condition     <dbl> 3, 3, 3, 5, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 4, 4,…
## $ grade         <dbl> 7, 7, 6, 7, 8, 11, 7, 7, 7, 7, 8, 7, 7, 7, 7, 9, 7, 7, 7…
## $ sqft_above    <dbl> 1180, 2170, 770, 1050, 1680, 3890, 1715, 1060, 1050, 189…
## $ sqft_basement <dbl> 0, 400, 0, 910, 0, 1530, 0, 0, 730, 0, 1700, 300, 0, 0, …
## $ yr_built      <dbl> 1955, 1951, 1933, 1965, 1987, 2001, 1995, 1963, 1960, 20…
## $ yr_renovated  <dbl> 0, 1991, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ zipcode       <dbl> 98178, 98125, 98028, 98136, 98074, 98053, 98003, 98198, …
## $ lat           <dbl> 47.5112, 47.7210, 47.7379, 47.5208, 47.6168, 47.6561, 47…
## $ long          <dbl> -122.257, -122.319, -122.233, -122.393, -122.045, -122.0…
## $ sqft_living15 <dbl> 1340, 1690, 2720, 1360, 1800, 4760, 2238, 1650, 1780, 23…
## $ sqft_lot15    <dbl> 5650, 7639, 8062, 5000, 7503, 101930, 6819, 9711, 8113, …

data$id <- as.factor(data$id)
data$date <- as.Date(substr(data$date, 1, 8), format = "%Y%m%d")
data$waterfront <- as.factor(data$waterfront)
data$view <- as.factor(data$view)
data$condition <- as.factor(data$condition)
data$grade <- as.factor(data$grade)
data$yr_renovated <- as.factor(ifelse(data$yr_renovated == 0, "Not Renovated", "Renovated"))
data$zipcode <- as.factor(data$zipcode)

sapply(data, function(x) sum(is.na(x)))

##            id          date         price      bedrooms     bathrooms 
##             0         21613             0             0             0 
##   sqft_living      sqft_lot        floors    waterfront          view 
##             0             0             0             0             0 
##     condition         grade    sqft_above sqft_basement      yr_built 
##             0             0             0             0             0 
##  yr_renovated       zipcode           lat          long sqft_living15 
##             0             0             0             0             0 
##    sqft_lot15 
##             0

data <- data[data$bedrooms < 11, ]

Exploratory Data Analysis

summary(data)

##          id             date           price            bedrooms     
##  795000620:    3   Min.   :NA      Min.   :  75000   Min.   : 0.000  
##  1000102  :    2   1st Qu.:NA      1st Qu.: 321725   1st Qu.: 3.000  
##  7200179  :    2   Median :NA      Median : 450000   Median : 3.000  
##  109200390:    2   Mean   :NaN     Mean   : 540178   Mean   : 3.369  
##  123039336:    2   3rd Qu.:NA      3rd Qu.: 645000   3rd Qu.: 4.000  
##  251300110:    2   Max.   :NA      Max.   :7700000   Max.   :10.000  
##  (Other)  :21598   NA's   :21611                                     
##    bathrooms      sqft_living       sqft_lot           floors      waterfront
##  Min.   :0.000   Min.   :  290   Min.   :    520   Min.   :1.000   0:21448   
##  1st Qu.:1.750   1st Qu.: 1426   1st Qu.:   5040   1st Qu.:1.000   1:  163   
##  Median :2.250   Median : 1910   Median :   7620   Median :1.500             
##  Mean   :2.115   Mean   : 2080   Mean   :  15108   Mean   :1.494             
##  3rd Qu.:2.500   3rd Qu.: 2550   3rd Qu.:  10688   3rd Qu.:2.000             
##  Max.   :8.000   Max.   :13540   Max.   :1651359   Max.   :3.500             
##                                                                              
##  view      condition     grade        sqft_above   sqft_basement   
##  0:19487   1:   30   7      :8979   Min.   : 290   Min.   :   0.0  
##  1:  332   2:  172   8      :6068   1st Qu.:1190   1st Qu.:   0.0  
##  2:  963   3:14030   9      :2615   Median :1560   Median :   0.0  
##  3:  510   4: 5679   6      :2038   Mean   :1788   Mean   : 291.5  
##  4:  319   5: 1700   10     :1134   3rd Qu.:2210   3rd Qu.: 560.0  
##                      11     : 399   Max.   :9410   Max.   :4820.0  
##                      (Other): 378                                  
##     yr_built           yr_renovated      zipcode           lat       
##  Min.   :1900   Not Renovated:20698   98103  :  601   Min.   :47.16  
##  1st Qu.:1951   Renovated    :  913   98038  :  590   1st Qu.:47.47  
##  Median :1975                         98115  :  583   Median :47.57  
##  Mean   :1971                         98052  :  574   Mean   :47.56  
##  3rd Qu.:1997                         98117  :  553   3rd Qu.:47.68  
##  Max.   :2015                         98042  :  548   Max.   :47.78  
##                                       (Other):18162                  
##       long        sqft_living15    sqft_lot15    
##  Min.   :-122.5   Min.   : 399   Min.   :   651  
##  1st Qu.:-122.3   1st Qu.:1490   1st Qu.:  5100  
##  Median :-122.2   Median :1840   Median :  7620  
##  Mean   :-122.2   Mean   :1987   Mean   : 12769  
##  3rd Qu.:-122.1   3rd Qu.:2360   3rd Qu.: 10084  
##  Max.   :-121.3   Max.   :6210   Max.   :871200  
##

p <- plot_ly(data, x = ~price, type = "histogram")
p

numeric_vars <- data[, sapply(data, is.numeric)]
correlation_matrix <- cor(numeric_vars)

numeric_data <- data[sapply(data, is.numeric)]
correlation_matrix <- cor(numeric_data)

corrplot(correlation_matrix, method = "circle")

## Warning in check_font_path(italic, "italic"): 'italic' should be a length-one
## vector, using the first element

ggcorr(data, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)

## Warning in ggcorr(data, label = TRUE, label_size = 2.9, hjust = 1, layout.exp =
## 2): data in column(s) 'id', 'date', 'waterfront', 'view', 'condition', 'grade',
## 'yr_renovated', 'zipcode' are not numeric and were ignored

## Warning in check_font_path(italic, "italic"): 'italic' should be a length-one
## vector, using the first element

Modeling

Train-Test Split

set.seed(123)
data_split <- initial_split(numeric_data, prop = 3/4)
train_data <- training(data_split)
test_data <- testing(data_split)

Model 1: Linear Regression with Preprocessing

rec <- recipe(price ~ ., data = train_data) %>% 
  step_naomit(all_predictors(), skip = TRUE) %>%
  step_center(all_numeric_predictors(), -all_outcomes(), skip = TRUE) %>%
  step_scale(all_numeric_predictors(), -all_outcomes(), skip = TRUE) %>%
  step_nzv(all_predictors(), -all_outcomes(), skip = TRUE)

rec_prep <- prep(rec)

data_train_preprocessed <- bake(rec_prep, new_data = train_data)

model1 <- lm(price ~ ., data = data_train_preprocessed)
summary(model1)

## 
## Call:
## lm(formula = price ~ ., data = data_train_preprocessed)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1405561  -114901   -13450    85762  4195922 
## 
## Coefficients: (1 not defined because of singularities)
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -5.248e+07  1.975e+06 -26.573  < 2e-16 ***
## bedrooms      -6.053e+04  2.500e+03 -24.212  < 2e-16 ***
## bathrooms      7.171e+04  4.136e+03  17.338  < 2e-16 ***
## sqft_living    2.194e+02  5.494e+00  39.934  < 2e-16 ***
## sqft_lot       1.373e-01  6.558e-02   2.093   0.0363 *  
## floors         1.813e+04  4.587e+03   3.952 7.77e-05 ***
## sqft_above     3.227e+01  5.483e+00   5.886 4.03e-09 ***
## sqft_basement         NA         NA      NA       NA    
## yr_built      -2.556e+03  8.213e+01 -31.117  < 2e-16 ***
## lat            5.753e+05  1.321e+04  43.562  < 2e-16 ***
## long          -2.457e+05  1.499e+04 -16.387  < 2e-16 ***
## sqft_living15  9.106e+01  4.160e+00  21.888  < 2e-16 ***
## sqft_lot15    -3.873e-01  9.657e-02  -4.011 6.08e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 224300 on 16196 degrees of freedom
## Multiple R-squared:  0.6252, Adjusted R-squared:  0.625 
## F-statistic:  2456 on 11 and 16196 DF,  p-value: < 2.2e-16

Model 2: Polynomial Regression

model2 <- lm(price ~ poly(sqft_living, 2) + ., data = train_data)
summary(model2)

## 
## Call:
## lm(formula = price ~ poly(sqft_living, 2) + ., data = train_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -5055091  -107172   -14531    76227  2645903 
## 
## Coefficients: (2 not defined because of singularities)
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -4.957e+07  1.903e+06 -26.048  < 2e-16 ***
## poly(sqft_living, 2)1  2.324e+07  6.222e+05  37.353  < 2e-16 ***
## poly(sqft_living, 2)2  8.283e+06  2.293e+05  36.119  < 2e-16 ***
## bedrooms              -3.896e+04  2.478e+03 -15.724  < 2e-16 ***
## bathrooms              7.486e+04  3.980e+03  18.807  < 2e-16 ***
## sqft_living                   NA         NA      NA       NA    
## sqft_lot               1.182e-01  6.309e-02   1.874  0.06091 .  
## floors                 3.141e+04  4.428e+03   7.094 1.36e-12 ***
## sqft_above             1.560e+01  5.295e+00   2.945  0.00323 ** 
## sqft_basement                 NA         NA      NA       NA    
## yr_built              -2.357e+03  7.920e+01 -29.759  < 2e-16 ***
## lat                    5.742e+05  1.271e+04  45.195  < 2e-16 ***
## long                  -2.219e+05  1.444e+04 -15.367  < 2e-16 ***
## sqft_living15          1.164e+02  4.063e+00  28.641  < 2e-16 ***
## sqft_lot15            -4.373e-01  9.292e-02  -4.706 2.54e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 215800 on 16195 degrees of freedom
## Multiple R-squared:  0.6532, Adjusted R-squared:  0.6529 
## F-statistic:  2542 on 12 and 16195 DF,  p-value: < 2.2e-16

Model 3: Random Forest

model3 <- randomForest(price ~ ., data = train_data)
summary(model3)

##                 Length Class  Mode     
## call                3  -none- call     
## type                1  -none- character
## predicted       16208  -none- numeric  
## mse               500  -none- numeric  
## rsq               500  -none- numeric  
## oob.times       16208  -none- numeric  
## importance         12  -none- numeric  
## importanceSD        0  -none- NULL     
## localImportance     0  -none- NULL     
## proximity           0  -none- NULL     
## ntree               1  -none- numeric  
## mtry                1  -none- numeric  
## forest             11  -none- list     
## coefs               0  -none- NULL     
## y               16208  -none- numeric  
## test                0  -none- NULL     
## inbag               0  -none- NULL     
## terms               3  terms  call

Model 4: XGBoost

dtrain <- xgb.DMatrix(data = as.matrix(subset(train_data, select = -price)), label = train_data$price)
dtest <- xgb.DMatrix(data = as.matrix(subset(test_data, select = -price)), label = test_data$price)

params <- list(
  objective = "reg:linear",
  eta = 0.3,
  max_depth = 6,
  early_stopping_rounds = 10
)

model4 <- xgb.train(params = params, data = dtrain, nrounds = 100)

## [17:21:20] WARNING: src/objective/regression_obj.cu:213: reg:linear is now deprecated in favor of reg:squarederror.
## [17:21:20] WARNING: src/learner.cc:767: 
## Parameters: { "early_stopping_rounds" } are not used.

summary(model4)

##               Length Class              Mode       
## handle             1 xgb.Booster.handle externalptr
## raw           409098 -none-             raw        
## niter              1 -none-             numeric    
## call               4 -none-             call       
## params             5 -none-             list       
## callbacks          1 -none-             list       
## feature_names     12 -none-             character  
## nfeatures          1 -none-             numeric

Evaluation

Predict and evaluate each model

# Generate predictions
pred1 <- predict(model1, newdata = test_data)
pred2 <- predict(model2, newdata = test_data)
pred3 <- predict(model3, newdata = test_data)
pred4 <- predict(model4, newdata = dtest)

# Calculate RMSE
RMSE1 <- sqrt(mean((test_data$price - pred1)^2))
RMSE2 <- sqrt(mean((test_data$price - pred2)^2))
RMSE3 <- sqrt(mean((test_data$price - pred3)^2))
RMSE4 <- sqrt(mean((test_data$price - as.vector(pred4))^2))

# Calculate MAE
MAE1 <- mean(abs(test_data$price - pred1))
MAE2 <- mean(abs(test_data$price - pred2))
MAE3 <- mean(abs(test_data$price - pred3))
MAE4 <- mean(abs(test_data$price - as.vector(pred4)))

# Calculate R-squared
R21 <- cor(test_data$price, pred1)^2
R22 <- cor(test_data$price, pred2)^2
R23 <- cor(test_data$price, pred3)^2
R24 <- cor(test_data$price, as.vector(pred4))^2

print(paste("Model 1 RMSE: ", RMSE1, " MAE: ", MAE1, " R2: ", R21))

## [1] "Model 1 RMSE:  228654.272928176  MAE:  144067.974336121  R2:  0.619623479880147"

print(paste("Model 2 RMSE: ", RMSE2, " MAE: ", MAE2, " R2: ", R22))

## [1] "Model 2 RMSE:  217127.312752543  MAE:  136376.940971214  R2:  0.657797969099388"

print(paste("Model 3 RMSE: ", RMSE3, " MAE: ", MAE3, " R2: ", R23))

## [1] "Model 3 RMSE:  151288.766909951  MAE:  79846.5123170156  R2:  0.836632072972019"

print(paste("Model 4 RMSE: ", RMSE4, " MAE: ", MAE4, " R2: ", R24))

## [1] "Model 4 RMSE:  158098.607340805  MAE:  80642.6557496992  R2:  0.818258781633119"

Conclusion

Based on the results of the four models, we can make the following conclusions:

Linear Regression Model (Model 1): This model has an RMSE of 228,654, an MAE of 144,068, and an R^2 of 0.62. This suggests that the model has a relatively high error rate and only explains about 62% of the variance in the house prices.

Polynomial Regression Model (Model 2): This model performs slightly better than the linear regression model, with an RMSE of 217,127, an MAE of 136,377, and an R^2 of 0.66. This indicates that the model explains about 66% of the variance in the house prices.

Random Forest Model (Model 3): This model performs significantly better than the first two models, with an RMSE of 151,289, an MAE of 79,847, and an R^2 of 0.84. This suggests that the model is more accurate and explains about 84% of the variance in the house prices.

XGBoost Model (Model 4): This model also performs well, with an RMSE of 158,099, an MAE of 80,643, and an R^2 of 0.82. This indicates that the model is quite accurate and explains about 82% of the variance in the house prices.

In conclusion, the Random Forest model (Model 3) and the XGBoost model (Model 4) perform the best among the four models, with the Random Forest model slightly outperforming the XGBoost model. These models are able to explain a significant portion of the variance in the house prices and have lower error rates compared to the linear and polynomial regression models. Therefore, they would be the most suitable models for predicting house prices in King County. However, it’s important to note that these models could potentially be improved further through hyperparameter tuning and additional feature engineering.

Reference

https://www.kaggle.com/datasets/harlfoxem/housesalesprediction