In this project, our primary objective is to build a model that can accurately predict the price of houses in King County, which includes Seattle. Real estate pricing is influenced by a multitude of factors, ranging from the physical attributes of the house to its location, proximity to amenities, and economic conditions. Building a predictive model allows us to quantify these influences and provides insights that can be used by both buyers and sellers in the real estate market.
The dataset we are working with contains house sales prices for King County between May 2014 and May 2015. It comprises 21 variables that capture various aspects of the houses sold, including their physical characteristics (e.g., number of bedrooms, bathrooms, living area in square feet), condition and grade, year built and renovated, whether they have a waterfront, among others. Each row in the dataset represents a unique house sale and includes the sale price. With this rich set of features, we can create a robust model to predict house prices.
The business goal of this project is to assist various stakeholders in the real estate industry. For real estate agents, the model can help provide a data-driven estimate of a house’s price, aiding in listing houses at appropriate prices. For potential buyers, the model can help them understand if a house is overpriced or underpriced, thus informing their purchase decisions. For sellers, it can help in setting a competitive and fair selling price. Additionally, the model can also provide valuable insights to real estate investors and policymakers for planning and decision-making.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(caret)
## 要求されたパッケージ lattice をロード中です
##
## 次のパッケージを付け加えます: 'caret'
##
## 以下のオブジェクトは 'package:purrr' からマスクされています:
##
## lift
library(corrplot)
## corrplot 0.92 loaded
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.0 ──
## ✔ broom 1.0.4 ✔ rsample 1.1.1
## ✔ dials 1.2.0 ✔ tune 1.1.1
## ✔ infer 1.0.4 ✔ workflows 1.1.3
## ✔ modeldata 1.1.0 ✔ workflowsets 1.0.1
## ✔ parsnip 1.1.0 ✔ yardstick 1.2.0
## ✔ recipes 1.0.6
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ recipes::fixed() masks stringr::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ caret::lift() masks purrr::lift()
## ✖ yardstick::precision() masks caret::precision()
## ✖ yardstick::recall() masks caret::recall()
## ✖ yardstick::sensitivity() masks caret::sensitivity()
## ✖ yardstick::spec() masks readr::spec()
## ✖ yardstick::specificity() masks caret::specificity()
## ✖ recipes::step() masks stats::step()
## • Search for functions across packages at https://www.tidymodels.org/find/
library(car)
## 要求されたパッケージ carData をロード中です
##
## 次のパッケージを付け加えます: 'car'
##
## 以下のオブジェクトは 'package:dplyr' からマスクされています:
##
## recode
##
## 以下のオブジェクトは 'package:purrr' からマスクされています:
##
## some
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(plotly)
##
## 次のパッケージを付け加えます: 'plotly'
##
## 以下のオブジェクトは 'package:ggplot2' からマスクされています:
##
## last_plot
##
## 以下のオブジェクトは 'package:stats' からマスクされています:
##
## filter
##
## 以下のオブジェクトは 'package:graphics' からマスクされています:
##
## layout
library(scales)
library(lmtest)
## 要求されたパッケージ zoo をロード中です
##
## 次のパッケージを付け加えます: 'zoo'
##
## 以下のオブジェクトは 'package:base' からマスクされています:
##
## as.Date, as.Date.numeric
library(randomForest)
## Warning: パッケージ 'randomForest' はバージョン 4.3.1 の R の下で造られました
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## 次のパッケージを付け加えます: 'randomForest'
##
## 以下のオブジェクトは 'package:dplyr' からマスクされています:
##
## combine
##
## 以下のオブジェクトは 'package:ggplot2' からマスクされています:
##
## margin
library(xgboost)
## Warning: パッケージ 'xgboost' はバージョン 4.3.1 の R の下で造られました
##
## 次のパッケージを付け加えます: 'xgboost'
##
## 以下のオブジェクトは 'package:plotly' からマスクされています:
##
## slice
##
## 以下のオブジェクトは 'package:dplyr' からマスクされています:
##
## slice
library(ranger)
## Warning: パッケージ 'ranger' はバージョン 4.3.1 の R の下で造られました
##
## 次のパッケージを付け加えます: 'ranger'
##
## 以下のオブジェクトは 'package:randomForest' からマスクされています:
##
## importance
data <- read_csv("inputs/kc_house_data.csv")
## Rows: 21613 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (20): id, price, bedrooms, bathrooms, sqft_living, sqft_lot, floors, wa...
## dttm (1): date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(data)
## Rows: 21,613
## Columns: 21
## $ id <dbl> 7129300520, 6414100192, 5631500400, 2487200875, 19544005…
## $ date <dttm> 2014-10-13, 2014-12-09, 2015-02-25, 2014-12-09, 2015-02…
## $ price <dbl> 221900, 538000, 180000, 604000, 510000, 1230000, 257500,…
## $ bedrooms <dbl> 3, 3, 2, 4, 3, 4, 3, 3, 3, 3, 3, 2, 3, 3, 5, 4, 3, 4, 2,…
## $ bathrooms <dbl> 1.00, 2.25, 1.00, 3.00, 2.00, 4.50, 2.25, 1.50, 1.00, 2.…
## $ sqft_living <dbl> 1180, 2570, 770, 1960, 1680, 5420, 1715, 1060, 1780, 189…
## $ sqft_lot <dbl> 5650, 7242, 10000, 5000, 8080, 101930, 6819, 9711, 7470,…
## $ floors <dbl> 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 2.0, 1.0, 1…
## $ waterfront <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ view <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0,…
## $ condition <dbl> 3, 3, 3, 5, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 4, 4,…
## $ grade <dbl> 7, 7, 6, 7, 8, 11, 7, 7, 7, 7, 8, 7, 7, 7, 7, 9, 7, 7, 7…
## $ sqft_above <dbl> 1180, 2170, 770, 1050, 1680, 3890, 1715, 1060, 1050, 189…
## $ sqft_basement <dbl> 0, 400, 0, 910, 0, 1530, 0, 0, 730, 0, 1700, 300, 0, 0, …
## $ yr_built <dbl> 1955, 1951, 1933, 1965, 1987, 2001, 1995, 1963, 1960, 20…
## $ yr_renovated <dbl> 0, 1991, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ zipcode <dbl> 98178, 98125, 98028, 98136, 98074, 98053, 98003, 98198, …
## $ lat <dbl> 47.5112, 47.7210, 47.7379, 47.5208, 47.6168, 47.6561, 47…
## $ long <dbl> -122.257, -122.319, -122.233, -122.393, -122.045, -122.0…
## $ sqft_living15 <dbl> 1340, 1690, 2720, 1360, 1800, 4760, 2238, 1650, 1780, 23…
## $ sqft_lot15 <dbl> 5650, 7639, 8062, 5000, 7503, 101930, 6819, 9711, 8113, …
data$id <- as.factor(data$id)
data$date <- as.Date(substr(data$date, 1, 8), format = "%Y%m%d")
data$waterfront <- as.factor(data$waterfront)
data$view <- as.factor(data$view)
data$condition <- as.factor(data$condition)
data$grade <- as.factor(data$grade)
data$yr_renovated <- as.factor(ifelse(data$yr_renovated == 0, "Not Renovated", "Renovated"))
data$zipcode <- as.factor(data$zipcode)
sapply(data, function(x) sum(is.na(x)))
## id date price bedrooms bathrooms
## 0 21613 0 0 0
## sqft_living sqft_lot floors waterfront view
## 0 0 0 0 0
## condition grade sqft_above sqft_basement yr_built
## 0 0 0 0 0
## yr_renovated zipcode lat long sqft_living15
## 0 0 0 0 0
## sqft_lot15
## 0
data <- data[data$bedrooms < 11, ]
summary(data)
## id date price bedrooms
## 795000620: 3 Min. :NA Min. : 75000 Min. : 0.000
## 1000102 : 2 1st Qu.:NA 1st Qu.: 321725 1st Qu.: 3.000
## 7200179 : 2 Median :NA Median : 450000 Median : 3.000
## 109200390: 2 Mean :NaN Mean : 540178 Mean : 3.369
## 123039336: 2 3rd Qu.:NA 3rd Qu.: 645000 3rd Qu.: 4.000
## 251300110: 2 Max. :NA Max. :7700000 Max. :10.000
## (Other) :21598 NA's :21611
## bathrooms sqft_living sqft_lot floors waterfront
## Min. :0.000 Min. : 290 Min. : 520 Min. :1.000 0:21448
## 1st Qu.:1.750 1st Qu.: 1426 1st Qu.: 5040 1st Qu.:1.000 1: 163
## Median :2.250 Median : 1910 Median : 7620 Median :1.500
## Mean :2.115 Mean : 2080 Mean : 15108 Mean :1.494
## 3rd Qu.:2.500 3rd Qu.: 2550 3rd Qu.: 10688 3rd Qu.:2.000
## Max. :8.000 Max. :13540 Max. :1651359 Max. :3.500
##
## view condition grade sqft_above sqft_basement
## 0:19487 1: 30 7 :8979 Min. : 290 Min. : 0.0
## 1: 332 2: 172 8 :6068 1st Qu.:1190 1st Qu.: 0.0
## 2: 963 3:14030 9 :2615 Median :1560 Median : 0.0
## 3: 510 4: 5679 6 :2038 Mean :1788 Mean : 291.5
## 4: 319 5: 1700 10 :1134 3rd Qu.:2210 3rd Qu.: 560.0
## 11 : 399 Max. :9410 Max. :4820.0
## (Other): 378
## yr_built yr_renovated zipcode lat
## Min. :1900 Not Renovated:20698 98103 : 601 Min. :47.16
## 1st Qu.:1951 Renovated : 913 98038 : 590 1st Qu.:47.47
## Median :1975 98115 : 583 Median :47.57
## Mean :1971 98052 : 574 Mean :47.56
## 3rd Qu.:1997 98117 : 553 3rd Qu.:47.68
## Max. :2015 98042 : 548 Max. :47.78
## (Other):18162
## long sqft_living15 sqft_lot15
## Min. :-122.5 Min. : 399 Min. : 651
## 1st Qu.:-122.3 1st Qu.:1490 1st Qu.: 5100
## Median :-122.2 Median :1840 Median : 7620
## Mean :-122.2 Mean :1987 Mean : 12769
## 3rd Qu.:-122.1 3rd Qu.:2360 3rd Qu.: 10084
## Max. :-121.3 Max. :6210 Max. :871200
##
p <- plot_ly(data, x = ~price, type = "histogram")
p
numeric_vars <- data[, sapply(data, is.numeric)]
correlation_matrix <- cor(numeric_vars)
numeric_data <- data[sapply(data, is.numeric)]
correlation_matrix <- cor(numeric_data)
corrplot(correlation_matrix, method = "circle")
## Warning in check_font_path(italic, "italic"): 'italic' should be a length-one
## vector, using the first element
ggcorr(data, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)
## Warning in ggcorr(data, label = TRUE, label_size = 2.9, hjust = 1, layout.exp =
## 2): data in column(s) 'id', 'date', 'waterfront', 'view', 'condition', 'grade',
## 'yr_renovated', 'zipcode' are not numeric and were ignored
## Warning in check_font_path(italic, "italic"): 'italic' should be a length-one
## vector, using the first element
set.seed(123)
data_split <- initial_split(numeric_data, prop = 3/4)
train_data <- training(data_split)
test_data <- testing(data_split)
rec <- recipe(price ~ ., data = train_data) %>%
step_naomit(all_predictors(), skip = TRUE) %>%
step_center(all_numeric_predictors(), -all_outcomes(), skip = TRUE) %>%
step_scale(all_numeric_predictors(), -all_outcomes(), skip = TRUE) %>%
step_nzv(all_predictors(), -all_outcomes(), skip = TRUE)
rec_prep <- prep(rec)
data_train_preprocessed <- bake(rec_prep, new_data = train_data)
model1 <- lm(price ~ ., data = data_train_preprocessed)
summary(model1)
##
## Call:
## lm(formula = price ~ ., data = data_train_preprocessed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1405561 -114901 -13450 85762 4195922
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.248e+07 1.975e+06 -26.573 < 2e-16 ***
## bedrooms -6.053e+04 2.500e+03 -24.212 < 2e-16 ***
## bathrooms 7.171e+04 4.136e+03 17.338 < 2e-16 ***
## sqft_living 2.194e+02 5.494e+00 39.934 < 2e-16 ***
## sqft_lot 1.373e-01 6.558e-02 2.093 0.0363 *
## floors 1.813e+04 4.587e+03 3.952 7.77e-05 ***
## sqft_above 3.227e+01 5.483e+00 5.886 4.03e-09 ***
## sqft_basement NA NA NA NA
## yr_built -2.556e+03 8.213e+01 -31.117 < 2e-16 ***
## lat 5.753e+05 1.321e+04 43.562 < 2e-16 ***
## long -2.457e+05 1.499e+04 -16.387 < 2e-16 ***
## sqft_living15 9.106e+01 4.160e+00 21.888 < 2e-16 ***
## sqft_lot15 -3.873e-01 9.657e-02 -4.011 6.08e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 224300 on 16196 degrees of freedom
## Multiple R-squared: 0.6252, Adjusted R-squared: 0.625
## F-statistic: 2456 on 11 and 16196 DF, p-value: < 2.2e-16
model2 <- lm(price ~ poly(sqft_living, 2) + ., data = train_data)
summary(model2)
##
## Call:
## lm(formula = price ~ poly(sqft_living, 2) + ., data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5055091 -107172 -14531 76227 2645903
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.957e+07 1.903e+06 -26.048 < 2e-16 ***
## poly(sqft_living, 2)1 2.324e+07 6.222e+05 37.353 < 2e-16 ***
## poly(sqft_living, 2)2 8.283e+06 2.293e+05 36.119 < 2e-16 ***
## bedrooms -3.896e+04 2.478e+03 -15.724 < 2e-16 ***
## bathrooms 7.486e+04 3.980e+03 18.807 < 2e-16 ***
## sqft_living NA NA NA NA
## sqft_lot 1.182e-01 6.309e-02 1.874 0.06091 .
## floors 3.141e+04 4.428e+03 7.094 1.36e-12 ***
## sqft_above 1.560e+01 5.295e+00 2.945 0.00323 **
## sqft_basement NA NA NA NA
## yr_built -2.357e+03 7.920e+01 -29.759 < 2e-16 ***
## lat 5.742e+05 1.271e+04 45.195 < 2e-16 ***
## long -2.219e+05 1.444e+04 -15.367 < 2e-16 ***
## sqft_living15 1.164e+02 4.063e+00 28.641 < 2e-16 ***
## sqft_lot15 -4.373e-01 9.292e-02 -4.706 2.54e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 215800 on 16195 degrees of freedom
## Multiple R-squared: 0.6532, Adjusted R-squared: 0.6529
## F-statistic: 2542 on 12 and 16195 DF, p-value: < 2.2e-16
model3 <- randomForest(price ~ ., data = train_data)
summary(model3)
## Length Class Mode
## call 3 -none- call
## type 1 -none- character
## predicted 16208 -none- numeric
## mse 500 -none- numeric
## rsq 500 -none- numeric
## oob.times 16208 -none- numeric
## importance 12 -none- numeric
## importanceSD 0 -none- NULL
## localImportance 0 -none- NULL
## proximity 0 -none- NULL
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 11 -none- list
## coefs 0 -none- NULL
## y 16208 -none- numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
## terms 3 terms call
dtrain <- xgb.DMatrix(data = as.matrix(subset(train_data, select = -price)), label = train_data$price)
dtest <- xgb.DMatrix(data = as.matrix(subset(test_data, select = -price)), label = test_data$price)
params <- list(
objective = "reg:linear",
eta = 0.3,
max_depth = 6,
early_stopping_rounds = 10
)
model4 <- xgb.train(params = params, data = dtrain, nrounds = 100)
## [17:21:20] WARNING: src/objective/regression_obj.cu:213: reg:linear is now deprecated in favor of reg:squarederror.
## [17:21:20] WARNING: src/learner.cc:767:
## Parameters: { "early_stopping_rounds" } are not used.
summary(model4)
## Length Class Mode
## handle 1 xgb.Booster.handle externalptr
## raw 409098 -none- raw
## niter 1 -none- numeric
## call 4 -none- call
## params 5 -none- list
## callbacks 1 -none- list
## feature_names 12 -none- character
## nfeatures 1 -none- numeric
# Generate predictions
pred1 <- predict(model1, newdata = test_data)
pred2 <- predict(model2, newdata = test_data)
pred3 <- predict(model3, newdata = test_data)
pred4 <- predict(model4, newdata = dtest)
# Calculate RMSE
RMSE1 <- sqrt(mean((test_data$price - pred1)^2))
RMSE2 <- sqrt(mean((test_data$price - pred2)^2))
RMSE3 <- sqrt(mean((test_data$price - pred3)^2))
RMSE4 <- sqrt(mean((test_data$price - as.vector(pred4))^2))
# Calculate MAE
MAE1 <- mean(abs(test_data$price - pred1))
MAE2 <- mean(abs(test_data$price - pred2))
MAE3 <- mean(abs(test_data$price - pred3))
MAE4 <- mean(abs(test_data$price - as.vector(pred4)))
# Calculate R-squared
R21 <- cor(test_data$price, pred1)^2
R22 <- cor(test_data$price, pred2)^2
R23 <- cor(test_data$price, pred3)^2
R24 <- cor(test_data$price, as.vector(pred4))^2
print(paste("Model 1 RMSE: ", RMSE1, " MAE: ", MAE1, " R2: ", R21))
## [1] "Model 1 RMSE: 228654.272928176 MAE: 144067.974336121 R2: 0.619623479880147"
print(paste("Model 2 RMSE: ", RMSE2, " MAE: ", MAE2, " R2: ", R22))
## [1] "Model 2 RMSE: 217127.312752543 MAE: 136376.940971214 R2: 0.657797969099388"
print(paste("Model 3 RMSE: ", RMSE3, " MAE: ", MAE3, " R2: ", R23))
## [1] "Model 3 RMSE: 151288.766909951 MAE: 79846.5123170156 R2: 0.836632072972019"
print(paste("Model 4 RMSE: ", RMSE4, " MAE: ", MAE4, " R2: ", R24))
## [1] "Model 4 RMSE: 158098.607340805 MAE: 80642.6557496992 R2: 0.818258781633119"
Based on the results of the four models, we can make the following conclusions:
Linear Regression Model (Model 1): This model has an RMSE of 228,654, an MAE of 144,068, and an R^2 of 0.62. This suggests that the model has a relatively high error rate and only explains about 62% of the variance in the house prices.
Polynomial Regression Model (Model 2): This model performs slightly better than the linear regression model, with an RMSE of 217,127, an MAE of 136,377, and an R^2 of 0.66. This indicates that the model explains about 66% of the variance in the house prices.
Random Forest Model (Model 3): This model performs significantly better than the first two models, with an RMSE of 151,289, an MAE of 79,847, and an R^2 of 0.84. This suggests that the model is more accurate and explains about 84% of the variance in the house prices.
XGBoost Model (Model 4): This model also performs well, with an RMSE of 158,099, an MAE of 80,643, and an R^2 of 0.82. This indicates that the model is quite accurate and explains about 82% of the variance in the house prices.
In conclusion, the Random Forest model (Model 3) and the XGBoost model (Model 4) perform the best among the four models, with the Random Forest model slightly outperforming the XGBoost model. These models are able to explain a significant portion of the variance in the house prices and have lower error rates compared to the linear and polynomial regression models. Therefore, they would be the most suitable models for predicting house prices in King County. However, it’s important to note that these models could potentially be improved further through hyperparameter tuning and additional feature engineering.