Recently, there have been wide swings in the growth rate of housing prices across many countries. Housing prices tend to rise over time; however, accurately predicting them might be a difficult task. There are various factors that may influence the demand for a property. Additionally, it is extremely difficult to figure out which set of factors might explain buyers’ behavior, since every buyer tends to have their own preferences such as house size, location, etc. In this document, I am going to predict housing prices in King County using a linear regression model, as well as find out which attribute plays the key role in determining the housing price. The dataset used was obtained from Kaggle (https://www.kaggle.com/datasets/harlfoxem/housesalesprediction) and consists of 21,613 observations.
house <- read.csv("kc_house_data.csv")
head(house)glimpse(house)## Rows: 21,613
## Columns: 21
## $ id <dbl> 7129300520, 6414100192, 5631500400, 2487200875, 19544005…
## $ date <chr> "20141013T000000", "20141209T000000", "20150225T000000",…
## $ price <dbl> 221900, 538000, 180000, 604000, 510000, 1225000, 257500,…
## $ bedrooms <int> 3, 3, 2, 4, 3, 4, 3, 3, 3, 3, 3, 2, 3, 3, 5, 4, 3, 4, 2,…
## $ bathrooms <dbl> 1.00, 2.25, 1.00, 3.00, 2.00, 4.50, 2.25, 1.50, 1.00, 2.…
## $ sqft_living <int> 1180, 2570, 770, 1960, 1680, 5420, 1715, 1060, 1780, 189…
## $ sqft_lot <int> 5650, 7242, 10000, 5000, 8080, 101930, 6819, 9711, 7470,…
## $ floors <dbl> 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 2.0, 1.0, 1…
## $ waterfront <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ view <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0,…
## $ condition <int> 3, 3, 3, 5, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 4, 4,…
## $ grade <int> 7, 7, 6, 7, 8, 11, 7, 7, 7, 7, 8, 7, 7, 7, 7, 9, 7, 7, 7…
## $ sqft_above <int> 1180, 2170, 770, 1050, 1680, 3890, 1715, 1060, 1050, 189…
## $ sqft_basement <int> 0, 400, 0, 910, 0, 1530, 0, 0, 730, 0, 1700, 300, 0, 0, …
## $ yr_built <int> 1955, 1951, 1933, 1965, 1987, 2001, 1995, 1963, 1960, 20…
## $ yr_renovated <int> 0, 1991, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ zipcode <int> 98178, 98125, 98028, 98136, 98074, 98053, 98003, 98198, …
## $ lat <dbl> 47.5112, 47.7210, 47.7379, 47.5208, 47.6168, 47.6561, 47…
## $ long <dbl> -122.257, -122.319, -122.233, -122.393, -122.045, -122.0…
## $ sqft_living15 <int> 1340, 1690, 2720, 1360, 1800, 4760, 2238, 1650, 1780, 23…
## $ sqft_lot15 <int> 5650, 7639, 8062, 5000, 7503, 101930, 6819, 9711, 8113, …
Here are some informations about the features:
We can drop id and date because it is not relevant with predicting house prices. Furthermore, since our data consists of housing prices in King County, we can drop zipcode, lat, and long. Inspecting the data further, we can see that waterfront is recognized as integer instead of categorical feature. So, we will correct it as well.
house <- house %>%
select(-c(id, date, zipcode, lat, long)) %>%
mutate(waterfront = as.factor(waterfront)) Checking whether the training set has any missing or duplicated values.
colSums(is.na(house))## price bedrooms bathrooms sqft_living sqft_lot
## 0 0 0 0 0
## floors waterfront view condition grade
## 0 0 0 0 0
## sqft_above sqft_basement yr_built yr_renovated sqft_living15
## 0 0 0 0 0
## sqft_lot15
## 0
sum(duplicated(house))## [1] 6
The training set does not have any missing value, but it has 6 duplicated values. I personally think that although some houses in the regency might be built exactly the same, it is enough to just pick one of them in modeling. This is because duplicated values, in a regression model, might systematically bias regression estimates. So, I will drop these missing values.
house <- house[!duplicated(house), ]The data is now clean and ready to be used.
First, let’s see our data summary.
summary(house)## price bedrooms bathrooms sqft_living
## Min. : 75000 Min. : 0.000 Min. :0.000 Min. : 290
## 1st Qu.: 321725 1st Qu.: 3.000 1st Qu.:1.750 1st Qu.: 1428
## Median : 450000 Median : 3.000 Median :2.250 Median : 1910
## Mean : 540107 Mean : 3.371 Mean :2.115 Mean : 2080
## 3rd Qu.: 645000 3rd Qu.: 4.000 3rd Qu.:2.500 3rd Qu.: 2550
## Max. :7700000 Max. :33.000 Max. :8.000 Max. :13540
## sqft_lot floors waterfront view condition
## Min. : 520 Min. :1.000 0:21444 Min. :0.0000 Min. :1.000
## 1st Qu.: 5040 1st Qu.:1.000 1: 163 1st Qu.:0.0000 1st Qu.:3.000
## Median : 7620 Median :1.500 Median :0.0000 Median :3.000
## Mean : 15110 Mean :1.494 Mean :0.2342 Mean :3.409
## 3rd Qu.: 10692 3rd Qu.:2.000 3rd Qu.:0.0000 3rd Qu.:4.000
## Max. :1651359 Max. :3.500 Max. :4.0000 Max. :5.000
## grade sqft_above sqft_basement yr_built
## Min. : 1.000 Min. : 290 Min. : 0.0 Min. :1900
## 1st Qu.: 7.000 1st Qu.:1190 1st Qu.: 0.0 1st Qu.:1951
## Median : 7.000 Median :1560 Median : 0.0 Median :1975
## Mean : 7.657 Mean :1788 Mean : 291.6 Mean :1971
## 3rd Qu.: 8.000 3rd Qu.:2210 3rd Qu.: 560.0 3rd Qu.:1997
## Max. :13.000 Max. :9410 Max. :4820.0 Max. :2015
## yr_renovated sqft_living15 sqft_lot15
## Min. : 0.00 Min. : 399 Min. : 651
## 1st Qu.: 0.00 1st Qu.:1490 1st Qu.: 5100
## Median : 0.00 Median :1840 Median : 7620
## Mean : 84.33 Mean :1987 Mean : 12771
## 3rd Qu.: 0.00 3rd Qu.:2360 3rd Qu.: 10084
## Max. :2015.00 Max. :6210 Max. :871200
The range of our target variable, price is quite big. I’m afraid there will be a lot of outliers in the data. Let’s check the distribution of house prices.
hist(house$price,
col = "skyblue",
main = "Distribution of House Prices",
xlab = "House Price",
breaks = 30)boxplot(house$price,
main = "Boxplot of House Prices",
col = "skyblue4")
We can see that the distribution of house prices looks like a normal
distribution, with some data on the right tail that skewed the
distribution. Furthermore, seeing the boxplot, we can see that there are
lots of outliers in the data. I have planned to construct a model using
Ordinary Least Square Linear Regression, and although it does not need
the distribution of neither outcome nor independent variables to be of
Gaussian (normal), it is very sensitive to outliers. Besides, outliers
are considered to be anomaly in the data. So, I will trim the data as to
remove some outliers.
house_trim <- house %>%
arrange(price) %>%
slice_min(price, prop = 0.97)
hist(house_trim$price,
col = "skyblue",
main = "Distribution of House Prices",
xlab = "House Price",
breaks = 20)
The distribution seems to be better. Now, let’s see the correlation
between features in this dataset.
ggcorr(house_trim, geom = "blank", label = T, label_size = 3, hjust = 1, size = 3, layout.exp = 2) +
geom_point(size = 8, aes(color = coefficient > 0, alpha = abs(coefficient) >= 0.5)) +
scale_alpha_manual(values = c("TRUE" = 0.25, "FALSE" = 0)) +
guides(color = F, alpha = F)## Warning in ggcorr(house_trim, geom = "blank", label = T, label_size = 3, : data
## in column(s) 'waterfront' are not numeric and were ignored
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
By looking at the correlation matrix, we can see that all variables are
positively correlated with price. Furthermore, sqft_living, grade,
sqft_above, and sqft_living15 seems to have strong correlations with
price.
Before proceeding to the modeling part, I will split the data into 2 sets: train set and test set. The reason behind this is pretty much straightforward. When we are creating a model, the main purpose is to be able to use it for predicting other data, i.e. data that the model has not seen. Thus, it is very important to evaluate the model’s out of sample accuracy, rather than only its accuracy on the training set (set containing data the model has learned from). I will split the data into 70% train set and 30% test set.
RNGkind(sample.kind = "Rounding")## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(417)
idx <- sample(nrow(house_trim), nrow(house_trim) * 0.7)
train <- house_trim[idx, ]
test <- house_trim[-idx, ]Let’s try to construct the model using 4 features that are strongly correlated with house prices: sqft_living, grade, sqft_above, and sqft_living15.
summary(lm(price ~ sqft_living + grade + sqft_above + sqft_living15,
data = house))##
## Call:
## lm(formula = price ~ sqft_living + grade + sqft_above + sqft_living15,
## data = house)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1073331 -136535 -23410 98540 4832331
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.476e+05 1.355e+04 -47.788 < 2e-16 ***
## sqft_living 2.273e+02 4.186e+00 54.313 < 2e-16 ***
## grade 1.057e+05 2.413e+03 43.789 < 2e-16 ***
## sqft_above -8.183e+01 4.465e+00 -18.330 < 2e-16 ***
## sqft_living15 2.619e+01 4.024e+00 6.508 7.77e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 248500 on 21602 degrees of freedom
## Multiple R-squared: 0.542, Adjusted R-squared: 0.5419
## F-statistic: 6391 on 4 and 21602 DF, p-value: < 2.2e-16
We can see the goodness of fit of the model by looking at the adjusted R2 value. It tells us how well our dependent variable can be explained by our independent variables. Using those 4 features, we get adjusted R2 value of 0.5421. It is quite poor. Let’s compare the result to the model with all the features included.
model_all <- lm(price ~ ., data = house)
summary(model_all)##
## Call:
## lm(formula = price ~ ., data = house)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1341541 -109312 -10670 89963 4289643
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.194e+06 1.385e+05 44.733 < 2e-16 ***
## bedrooms -3.928e+04 2.027e+03 -19.380 < 2e-16 ***
## bathrooms 4.567e+04 3.493e+03 13.075 < 2e-16 ***
## sqft_living 1.670e+02 4.668e+00 35.778 < 2e-16 ***
## sqft_lot -1.789e-03 5.129e-02 -0.035 0.97217
## floors 2.694e+04 3.785e+03 7.118 1.13e-12 ***
## waterfront1 5.787e+05 1.864e+04 31.047 < 2e-16 ***
## view 4.332e+04 2.274e+03 19.051 < 2e-16 ***
## condition 1.955e+04 2.497e+03 7.828 5.19e-15 ***
## grade 1.197e+05 2.250e+03 53.220 < 2e-16 ***
## sqft_above -6.258e+00 4.545e+00 -1.377 0.16858
## sqft_basement NA NA NA NA
## yr_built -3.569e+03 7.100e+01 -50.274 < 2e-16 ***
## yr_renovated 1.036e+01 3.916e+00 2.645 0.00818 **
## sqft_living15 2.488e+01 3.601e+00 6.909 5.03e-12 ***
## sqft_lot15 -5.503e-01 7.837e-02 -7.022 2.25e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 216100 on 21592 degrees of freedom
## Multiple R-squared: 0.6538, Adjusted R-squared: 0.6536
## F-statistic: 2912 on 14 and 21592 DF, p-value: < 2.2e-16
When we use all the features, the adjusted R2 raised by quite a significant amount. It rose from 0.5421 to 0.6999. Let’s check the RMSE of our model in the test set to check the out-of-sample error. It measures on average, how far is our prediction from the actual data.
RMSE(y_pred = predict(model_all, newdata = test),
y_true = test$price)## [1] 169260.4
The RMSE is quite high. It is still a poor model. Since there are still statistically insignificant variables, and to improve our model even further, I will use backward stepwise regression. It is a method that iteratively examines the significance of every independent variables in a regression model. Backward stepwise regression begins with a model containing all possible variables, then deleting each variable whose absence decreases the AIC (Akaike Information Criterion / information loss) the most.
model_back <- step(model_all, direction = "backward", trace = 0)
summary(model_back)##
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + floors +
## waterfront + view + condition + grade + yr_built + yr_renovated +
## sqft_living15 + sqft_lot15, data = house)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1344213 -109623 -10716 89998 4287299
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.214e+06 1.376e+05 45.146 < 2e-16 ***
## bedrooms -3.928e+04 2.026e+03 -19.384 < 2e-16 ***
## bathrooms 4.646e+04 3.445e+03 13.486 < 2e-16 ***
## sqft_living 1.629e+02 3.575e+00 45.561 < 2e-16 ***
## floors 2.476e+04 3.436e+03 7.206 5.94e-13 ***
## waterfront1 5.778e+05 1.862e+04 31.023 < 2e-16 ***
## view 4.389e+04 2.235e+03 19.637 < 2e-16 ***
## condition 1.975e+04 2.492e+03 7.925 2.40e-15 ***
## grade 1.194e+05 2.239e+03 53.338 < 2e-16 ***
## yr_built -3.578e+03 7.068e+01 -50.626 < 2e-16 ***
## yr_renovated 1.031e+01 3.916e+00 2.633 0.00846 **
## sqft_living15 2.392e+01 3.528e+00 6.781 1.22e-11 ***
## sqft_lot15 -5.604e-01 5.556e-02 -10.085 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 216100 on 21594 degrees of freedom
## Multiple R-squared: 0.6537, Adjusted R-squared: 0.6536
## F-statistic: 3398 on 12 and 21594 DF, p-value: < 2.2e-16
RMSE(y_pred = predict(model_back, newdata = test),
y_true = test$price)## [1] 169384.2
So, it turns out that the model with the least information loss is the model that uses all variables, with the optional sqft_basement variable. We can pretty much conclude the best multiple linear regression model that can be constructed with this data is pretty poor, with adjusted R2 value of 0.6999 and RMSE of 149611.2. Let’s first check whether the model fulfills all multiple linear regression assumption.
Let’s first check on the assumption of normality.
hist(model_back$residuals,
col = "skyblue",
main = "Histogram of Residuals",
xlab = "Residuals")nortest::ad.test(model_back$residuals)##
## Anderson-Darling normality test
##
## data: model_back$residuals
## A = 470.68, p-value < 2.2e-16
I used the Anderson-Darling normality test instead of shapiro-wilk normality test because the data size is too large. From the histogram and the statistical test, we can see that the errors are not normally distributed. This might happen because of outliers in our data, and we must be cautious of using the model as the errors will not be around 0. Let’s now check on the homoscedasticity assumption.
plot(model_back$fitted.values, model_back$residuals,
xlab = "Fitted Values",
ylab = "Residuals")
abline(h = 0, col = "red")bptest(model_back)##
## studentized Breusch-Pagan test
##
## data: model_back
## BP = 2765.5, df = 12, p-value < 2.2e-16
Since the p-value is less than 0.05, we reject the null hypothesis and conclude that our model violates the homoscedasticity assumption as well. This indicates that there are patterns that the model was unable to learn. Now the last assumption, multicollinearity.
data.frame("VIF" = vif(model_back))Since all VIF are below 10, we can conclude that multicollinearity is not present in our model. Looking at how the model violates 2 assumptions, it means that our model can still be improved upon, or maybe multiple linear regression is not good enough to find all the patterns in the data.
Our data is skewed to begin with, thus it is prone to overfitting. I would like to see what will happen if we remove all outliers in the target variable, and log-transform all numeric variables in the data. This should help in transforming all variables’ distribution into a more normal-shaped bell curve, thus the model should not have overfitting problem. For the log transformation, since there are quite a few zeros in the data, I will add a constant and use log10(x+1) instead.
log_tf <- function(x){
return(log10(x+1))
}house2 <- house %>%
filter(price < quantile(house$price, 0.75) + 1.5 * IQR(house$price)) %>%
mutate_if(is.numeric, log_tf)
house2Splitting the data into train and test set.
set.seed(417)
idx_tf <- sample(nrow(house2), nrow(house2) * 0.7)
train_tf <- house2[idx_tf, ]
test_tf <- house2[-idx_tf, ]Creating multiple linear regression model using transformed data.
model_tf <- step(object = lm(price ~ ., data = train_tf),
direction = "backward",
trace = F)
summary(model_tf)##
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + floors +
## waterfront + view + condition + grade + sqft_above + sqft_basement +
## yr_built + yr_renovated + sqft_living15 + sqft_lot15, data = train_tf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.50260 -0.08692 0.00982 0.08808 0.79153
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.230172 0.800623 44.003 < 2e-16 ***
## bedrooms -0.174142 0.015636 -11.137 < 2e-16 ***
## bathrooms 0.121389 0.018315 6.628 3.52e-11 ***
## sqft_living 0.196043 0.030836 6.358 2.11e-10 ***
## floors 0.247268 0.018228 13.565 < 2e-16 ***
## waterfront1 0.058243 0.021296 2.735 0.00625 **
## view 0.079930 0.008550 9.349 < 2e-16 ***
## condition 0.207228 0.019524 10.614 < 2e-16 ***
## grade 1.541827 0.032827 46.969 < 2e-16 ***
## sqft_above 0.096171 0.030241 3.180 0.00148 **
## sqft_basement 0.016296 0.002045 7.968 1.74e-15 ***
## yr_built -9.983872 0.243796 -40.952 < 2e-16 ***
## yr_renovated 0.003302 0.001797 1.837 0.06623 .
## sqft_living15 0.283654 0.012660 22.406 < 2e-16 ***
## sqft_lot15 -0.051048 0.003767 -13.550 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1282 on 14307 degrees of freedom
## Multiple R-squared: 0.5624, Adjusted R-squared: 0.562
## F-statistic: 1313 on 14 and 14307 DF, p-value: < 2.2e-16
We ended up getting a lower adjusted R2 error. Let’s see the RMSE.
predict_tf <- predict(model_tf, newdata = test_tf)
RMSE(y_pred = 10^(predict_tf) - 1,
y_true = 10^(test_tf$price) - 1)## [1] 136126.8
Although we get a lower adjusted R2 value, we ended up getting a lower RMSE value as well. Let’s check whether the new model violates linear model assumptions or not.
hist(model_tf$residuals,
col = "skyblue",
main = "Histogram of Residuals",
xlab = "Residuals")nortest::ad.test(model_tf$residuals)##
## Anderson-Darling normality test
##
## data: model_tf$residuals
## A = 21.792, p-value < 2.2e-16
Although the histogram seems slightly skewed, turns out that statistically, it is not normally distributed. So, the new model still violates the first assumption since the errors are not normally distributed.
plot(model_tf$fitted.values, model_tf$residuals,
xlab = "Fitted Values",
ylab = "Residuals")
abline(h = 0, col = "red")bptest(model_tf)##
## studentized Breusch-Pagan test
##
## data: model_tf
## BP = 541.44, df = 14, p-value < 2.2e-16
The homoscedasticity assumption is also violated, which means that there are patterns that the model was unable to learn. Let’s check the last assumption, multicollinearity.
data.frame("VIF" = vif(model_tf))The new model turns out to violate multicollinearity assumption as well. I have tested that removing sqft_above will result in slightly decreased RMSE, but the previous two assumptions will still be violated by the model. It seems that for this data, it is not appropriate to use multiple linear regression model.
For predicting housing prices in King County, USA, the multiple linear regression performs poorly. Multiple linear regression was unable to caught every patterns in the data. The best adjusted R2 got in the multiple linear model is 0.6999 with RMSE of 149611.2. Removing every outliers and log-transforming numerical features did not help much either, as it decreases the adjusted R2 into 0.562, but also decreases RMSE into 136126.8 (which is still high). All the model also violates several linear regression model assumptions. It is recommended to use another machine learning methods such as decision tree, random forest, or deep learning for this data.
S. Borde, A. Rane, G. Shende, and S. Shetty, “Real estate investment advising using machine learning,” International Research Journal of Engineering and Technology (IRJET), vol. 4, no. 3, p. 1821, 2017.
B. Trawinski, Z. Telec, J. Krasnoborski et al., “Comparison of expert algorithms with machine learning models for real estate appraisal,” in Proceedings of the 2017 IEEE International Conference on INnovations in Intelligent SysTems and Applications (INISTA), Gdynia, Poland, July 2017.
V. Kontrimas and A. Verikas, “The mass appraisal of the real estate by computational intelligence,” Applied Soft Computing, vol. 11, no. 1, pp. 443–448, 2011.
M. Woźniak, M. Graña, and E. Corchado, “A survey of multiple classifier systems as hybrid systems,” Information Fusion, vol. 16, pp. 3–17, 2014.
J. R. Barr, E. A. Ellis, A. Kassab, C. L. Redfearn, N. N. Srinivasan, and K. B. Voris, “Home price index: a machine learning methodology,” International Journal of Semantic Computing, vol. 11, no. 1, pp. 111–133, 2017.