Recently, there has been wide swings in the growth rate of housing prices across many countries. Housing prices tend to rise over time; however, accurately predicting it might be a difficult task. There are various factors that may influence how demanding a property is. Additionally, it is extremely difficult to figure out which set of factors might explain the buyers’ behavior, since every buyer tend to have its own preferences such as house size, location, etc. In this document, I am going to predict housing prices in King County using linear regression model, as well as finding which attribute plays the key role in the housing price. The dataset used is obtained from Kaggle, which consists of 21,613 observations.
These following packages are required in this notebook. Use install.packages() to install any packages that are not already downloaded and load them using library() function. I provided a brief explanation about their function.
library(tidyverse)
library(GGally)
library(MLmetrics)
library(car)
library(lmtest)house <- read.csv("data_input/house.csv")
head(house)glimpse(house)## Rows: 21,613
## Columns: 21
## $ id <dbl> 7129300520, 6414100192, 5631500400, 2487200875, 19544005~
## $ date <chr> "20141013T000000", "20141209T000000", "20150225T000000",~
## $ price <dbl> 221900, 538000, 180000, 604000, 510000, 1225000, 257500,~
## $ bedrooms <int> 3, 3, 2, 4, 3, 4, 3, 3, 3, 3, 3, 2, 3, 3, 5, 4, 3, 4, 2,~
## $ bathrooms <dbl> 1.00, 2.25, 1.00, 3.00, 2.00, 4.50, 2.25, 1.50, 1.00, 2.~
## $ sqft_living <int> 1180, 2570, 770, 1960, 1680, 5420, 1715, 1060, 1780, 189~
## $ sqft_lot <int> 5650, 7242, 10000, 5000, 8080, 101930, 6819, 9711, 7470,~
## $ floors <dbl> 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 2.0, 1.0, 1~
## $ waterfront <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ view <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0,~
## $ condition <int> 3, 3, 3, 5, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 4, 4,~
## $ grade <int> 7, 7, 6, 7, 8, 11, 7, 7, 7, 7, 8, 7, 7, 7, 7, 9, 7, 7, 7~
## $ sqft_above <int> 1180, 2170, 770, 1050, 1680, 3890, 1715, 1060, 1050, 189~
## $ sqft_basement <int> 0, 400, 0, 910, 0, 1530, 0, 0, 730, 0, 1700, 300, 0, 0, ~
## $ yr_built <int> 1955, 1951, 1933, 1965, 1987, 2001, 1995, 1963, 1960, 20~
## $ yr_renovated <int> 0, 1991, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ zipcode <int> 98178, 98125, 98028, 98136, 98074, 98053, 98003, 98198, ~
## $ lat <dbl> 47.5112, 47.7210, 47.7379, 47.5208, 47.6168, 47.6561, 47~
## $ long <dbl> -122.257, -122.319, -122.233, -122.393, -122.045, -122.0~
## $ sqft_living15 <int> 1340, 1690, 2720, 1360, 1800, 4760, 2238, 1650, 1780, 23~
## $ sqft_lot15 <int> 5650, 7639, 8062, 5000, 7503, 101930, 6819, 9711, 8113, ~
Here are some informations about the features:
We can drop id and date because it is not relevant with predicting house prices. Furthermore, since our data consists of housing prices in King County, we can drop zipcode, lat, and long. Inspecting the data further, we can see that waterfront is recognized as integer instead of categorical feature. So, we will correct it as well.
house <- house %>%
select(-c(id, date, zipcode, lat, long)) %>%
mutate(waterfront = as.factor(waterfront))Checking whether the training set has any missing or duplicated values.
colSums(is.na(house))## price bedrooms bathrooms sqft_living sqft_lot
## 0 0 0 0 0
## floors waterfront view condition grade
## 0 0 0 0 0
## sqft_above sqft_basement yr_built yr_renovated sqft_living15
## 0 0 0 0 0
## sqft_lot15
## 0
sum(duplicated(house))## [1] 6
The training set does not have any missing value, but it has 6 duplicated values. I personally think that although some houses in the regency might be built exactly the same, it is enough to just pick one of them in modeling. This is because duplicated values, in a regression model, might systematically bias regression estimates. So, I will drop these missing values.
house <- house[!duplicated(house), ]The data is now clean and ready to be used.
First, let’s see our data summary.
summary(house)## price bedrooms bathrooms sqft_living
## Min. : 75000 Min. : 0.000 Min. :0.000 Min. : 290
## 1st Qu.: 321725 1st Qu.: 3.000 1st Qu.:1.750 1st Qu.: 1428
## Median : 450000 Median : 3.000 Median :2.250 Median : 1910
## Mean : 540107 Mean : 3.371 Mean :2.115 Mean : 2080
## 3rd Qu.: 645000 3rd Qu.: 4.000 3rd Qu.:2.500 3rd Qu.: 2550
## Max. :7700000 Max. :33.000 Max. :8.000 Max. :13540
## sqft_lot floors waterfront view condition
## Min. : 520 Min. :1.000 0:21444 Min. :0.0000 Min. :1.000
## 1st Qu.: 5040 1st Qu.:1.000 1: 163 1st Qu.:0.0000 1st Qu.:3.000
## Median : 7620 Median :1.500 Median :0.0000 Median :3.000
## Mean : 15110 Mean :1.494 Mean :0.2342 Mean :3.409
## 3rd Qu.: 10692 3rd Qu.:2.000 3rd Qu.:0.0000 3rd Qu.:4.000
## Max. :1651359 Max. :3.500 Max. :4.0000 Max. :5.000
## grade sqft_above sqft_basement yr_built
## Min. : 1.000 Min. : 290 Min. : 0.0 Min. :1900
## 1st Qu.: 7.000 1st Qu.:1190 1st Qu.: 0.0 1st Qu.:1951
## Median : 7.000 Median :1560 Median : 0.0 Median :1975
## Mean : 7.657 Mean :1788 Mean : 291.6 Mean :1971
## 3rd Qu.: 8.000 3rd Qu.:2210 3rd Qu.: 560.0 3rd Qu.:1997
## Max. :13.000 Max. :9410 Max. :4820.0 Max. :2015
## yr_renovated sqft_living15 sqft_lot15
## Min. : 0.00 Min. : 399 Min. : 651
## 1st Qu.: 0.00 1st Qu.:1490 1st Qu.: 5100
## Median : 0.00 Median :1840 Median : 7620
## Mean : 84.33 Mean :1987 Mean : 12771
## 3rd Qu.: 0.00 3rd Qu.:2360 3rd Qu.: 10084
## Max. :2015.00 Max. :6210 Max. :871200
The range of our target variable, price is quite big. I’m afraid there will be a lot of outliers in the data. Let’s check the distribution of house prices.
hist(house$price,
col = "skyblue4",
main = "Distribution of House Prices",
xlab = "House Price",
breaks = 30)boxplot(house$price,
main = "Boxplot of House Prices",
col = "skyblue4")We can see that the distribution of house prices looks like a normal distribution, with some data on the right tail that skewed the distribution. Furthermore, seeing the boxplot, we can see that there are lots of outliers in the data. I have planned to construct a model using Ordinary Least Square Linear Regression, and although it does not need the distribution of neither outcome nor independent variables to be of Gaussian (normal), it is very sensitive to outliers. Besides, outliers are considered to be anomaly in the data. So, I will trim the data as to remove some outliers.
house_trim <- house %>%
arrange(price) %>%
slice_min(price, prop = 0.97)
hist(house_trim$price,
col = "skyblue4",
main = "Distribution of House Prices",
xlab = "House Price",
breaks = 20)The distribution seems to be better. Now, let’s see the correlation between features in this dataset.
ggcorr(house_trim, geom = "blank", label = T, label_size = 3, hjust = 1, size = 3, layout.exp = 2) +
geom_point(size = 8, aes(color = coefficient > 0, alpha = abs(coefficient) >= 0.5)) +
scale_alpha_manual(values = c("TRUE" = 0.25, "FALSE" = 0)) +
guides(color = F, alpha = F)By looking at the correlation matrix, we can see that all variables are positively correlated with price. Furthermore, sqft_living, grade, sqft_above, and sqft_living15 seems to have strong correlations with price.
Before proceeding to the modeling part, I will split the data into 2 sets: train set and test set. The reason behind this is pretty much straightforward. When we are creating a model, the main purpose is to be able to use it for predicting other data, i.e. data that the model has not seen. Thus, it is very important to evaluate the model’s out of sample accuracy, rather than only its accuracy on the training set (set containing data the model has learned from). I will split the data into 70% train set and 30% test set.
RNGkind(sample.kind = "Rounding")
set.seed(417)
idx <- sample(nrow(house_trim), nrow(house_trim) * 0.7)
train <- house_trim[idx, ]
test <- house_trim[-idx, ]Let’s try to construct the model using 4 features that are strongly correlated with house prices: sqft_living, grade, sqft_above, and sqft_living15.
summary(lm(price ~ sqft_living + grade + sqft_above + sqft_living15,
data = house))##
## Call:
## lm(formula = price ~ sqft_living + grade + sqft_above + sqft_living15,
## data = house)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1073331 -136535 -23410 98540 4832331
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -647608.483 13551.689 -47.788 < 0.0000000000000002 ***
## sqft_living 227.349 4.186 54.313 < 0.0000000000000002 ***
## grade 105681.905 2413.445 43.789 < 0.0000000000000002 ***
## sqft_above -81.835 4.465 -18.330 < 0.0000000000000002 ***
## sqft_living15 26.188 4.024 6.508 0.0000000000777 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 248500 on 21602 degrees of freedom
## Multiple R-squared: 0.542, Adjusted R-squared: 0.5419
## F-statistic: 6391 on 4 and 21602 DF, p-value: < 0.00000000000000022
We can see the goodness of fit of the model by looking at the adjusted \(R^2\) value. It tells us how well our dependent variable can be explained by our independent variables. Using those 4 features, we get adjusted \(R^2\) value of 0.5419. It is quite poor. Let’s compare the result to the model with all the features included.
model_all <- lm(price ~ ., data = house)
summary(model_all)##
## Call:
## lm(formula = price ~ ., data = house)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1341541 -109312 -10670 89963 4289643
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6193509.371427 138453.912432 44.733 < 0.0000000000000002 ***
## bedrooms -39284.309664 2027.057595 -19.380 < 0.0000000000000002 ***
## bathrooms 45669.595161 3492.923134 13.075 < 0.0000000000000002 ***
## sqft_living 167.006152 4.667788 35.778 < 0.0000000000000002 ***
## sqft_lot -0.001789 0.051287 -0.035 0.97217
## floors 26939.873429 3785.007716 7.118 0.00000000000113326 ***
## waterfront1 578689.095560 18639.128164 31.047 < 0.0000000000000002 ***
## view 43323.886145 2274.057349 19.051 < 0.0000000000000002 ***
## condition 19546.414241 2497.042029 7.828 0.00000000000000519 ***
## grade 119746.755980 2250.031450 53.220 < 0.0000000000000002 ***
## sqft_above -6.258051 4.545272 -1.377 0.16858
## sqft_basement NA NA NA NA
## yr_built -3569.379079 70.998813 -50.274 < 0.0000000000000002 ***
## yr_renovated 10.356520 3.916051 2.645 0.00818 **
## sqft_living15 24.877319 3.600884 6.909 0.00000000000502826 ***
## sqft_lot15 -0.550282 0.078365 -7.022 0.00000000000225194 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 216100 on 21592 degrees of freedom
## Multiple R-squared: 0.6538, Adjusted R-squared: 0.6536
## F-statistic: 2912 on 14 and 21592 DF, p-value: < 0.00000000000000022
When we use all the features, the adjusted \(R^2\) raised by quite a significant amount. It rose from 0.5419 to 0.6536. Let’s check the RMSE of our model in the test set to check the out-of-sample error. It measures on average, how far is our prediction from the actual data.
RMSE(y_pred = predict(model_all, newdata = test),
y_true = test$price)## [1] 169260.4
The RMSE is quite high. It is still a poor model. Since there are still statistically insignificant variables, and to improve our model even further, I will use backward stepwise regression. It is a method that iteratively examines the significance of every independent variables in a regression model. Backward stepwise regression begins with a model containing all possible variables, then deleting each variable whose absence decreases the AIC (Akaike Information Criterion / information loss) the most.
model_back <- step(model_all, direction = "backward", trace = 0)
summary(model_back)##
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + floors +
## waterfront + view + condition + grade + yr_built + yr_renovated +
## sqft_living15 + sqft_lot15, data = house)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1344213 -109623 -10716 89998 4287299
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6213912.92283 137641.37079 45.146 < 0.0000000000000002 ***
## bedrooms -39280.30161 2026.46584 -19.384 < 0.0000000000000002 ***
## bathrooms 46461.91181 3445.24654 13.486 < 0.0000000000000002 ***
## sqft_living 162.86083 3.57459 45.561 < 0.0000000000000002 ***
## floors 24762.49035 3436.29345 7.206 0.0000000000005944 ***
## waterfront1 577778.99746 18624.16685 31.023 < 0.0000000000000002 ***
## view 43891.54426 2235.18375 19.637 < 0.0000000000000002 ***
## condition 19751.88469 2492.46275 7.925 0.0000000000000024 ***
## grade 119445.10087 2239.39923 53.338 < 0.0000000000000002 ***
## yr_built -3578.47197 70.68458 -50.626 < 0.0000000000000002 ***
## yr_renovated 10.31107 3.91577 2.633 0.00846 **
## sqft_living15 23.92156 3.52772 6.781 0.0000000000122405 ***
## sqft_lot15 -0.56035 0.05556 -10.085 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 216100 on 21594 degrees of freedom
## Multiple R-squared: 0.6537, Adjusted R-squared: 0.6536
## F-statistic: 3398 on 12 and 21594 DF, p-value: < 0.00000000000000022
RMSE(y_pred = predict(model_back, newdata = test),
y_true = test$price)## [1] 169384.2
So, it turns out that the model with the least information loss is the model that uses all variables, with the optional sqft_basement variable. We can pretty much conclude the best multiple linear regression model that can be constructed with this data is pretty poor, with adjusted \(R^2\) value of 0.6536 and RMSE of 169,384.2. Let’s first check whether the model fulfills all multiple linear regression assumption.
Let’s first check on the assumption of normality.
hist(model_back$residuals,
col = "skyblue4",
main = "Histogram of Residuals",
xlab = "Residuals")nortest::ad.test(model_back$residuals)##
## Anderson-Darling normality test
##
## data: model_back$residuals
## A = 470.68, p-value < 0.00000000000000022
I used the Anderson-Darling normality test instead of shapiro-wilk normality test because the data size is too large. From the histogram and the statistical test, we can see that the errors are not normally distributed. This might happen because of outliers in our data, and we must be cautious of using the model as the errors will not be around 0. Let’s now check on the homoscedasticity assumption.
plot(model_back$fitted.values, model_back$residuals,
xlab = "Fitted Values",
ylab = "Residuals")
abline(h = 0, col = "red")bptest(model_back)##
## studentized Breusch-Pagan test
##
## data: model_back
## BP = 2765.5, df = 12, p-value < 0.00000000000000022
Since the p-value is less than 0.05, we reject the null hypothesis and conclude that our model violates the homoscedasticity assumption as well. This indicates that there are patterns that the model was unable to learn. Now the last assumption, multicollinearity.
data.frame("VIF" = vif(model_back))Since all VIF are below 10, we can conclude that multicollinearity is not present in our model. Looking at how the model violates 2 assumptions, it means that our model can still be improved upon, or maybe multiple linear regression is not good enough to find all the patterns in the data.
Our data is skewed to begin with, thus it is prone to overfitting. I would like to see what will happen if we remove all outliers in the target variable, and log-transform all numeric variables in the data. This should help in transforming all variables’ distribution into a more normal-shaped bell curve, thus the model should not have overfitting problem. For the log transformation, since there are quite a few zeros in the data, I will add a constant and use \(log_{10}(x+1)\) instead.
log_tf <- function(x){
return(log10(x+1))
}house2 <- house %>%
filter(price < quantile(house$price, 0.75) + 1.5 * IQR(house$price)) %>%
mutate_if(is.numeric, log_tf)
house2Splitting the data into train and test set.
set.seed(417)
idx_tf <- sample(nrow(house2), nrow(house2) * 0.7)
train_tf <- house2[idx_tf, ]
test_tf <- house2[-idx_tf, ]Creating multiple linear regression model using transformed data.
model_tf <- step(object = lm(price ~ ., data = train_tf),
direction = "backward",
trace = F)
summary(model_tf)##
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + floors +
## waterfront + view + condition + grade + sqft_above + sqft_basement +
## yr_built + yr_renovated + sqft_living15 + sqft_lot15, data = train_tf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.50260 -0.08692 0.00982 0.08808 0.79153
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.230172 0.800623 44.003 < 0.0000000000000002 ***
## bedrooms -0.174142 0.015636 -11.137 < 0.0000000000000002 ***
## bathrooms 0.121389 0.018315 6.628 0.00000000003524278 ***
## sqft_living 0.196043 0.030836 6.358 0.00000000021101201 ***
## floors 0.247268 0.018228 13.565 < 0.0000000000000002 ***
## waterfront1 0.058243 0.021296 2.735 0.00625 **
## view 0.079930 0.008550 9.349 < 0.0000000000000002 ***
## condition 0.207228 0.019524 10.614 < 0.0000000000000002 ***
## grade 1.541827 0.032827 46.969 < 0.0000000000000002 ***
## sqft_above 0.096171 0.030241 3.180 0.00148 **
## sqft_basement 0.016296 0.002045 7.968 0.00000000000000174 ***
## yr_built -9.983872 0.243796 -40.952 < 0.0000000000000002 ***
## yr_renovated 0.003302 0.001797 1.837 0.06623 .
## sqft_living15 0.283654 0.012660 22.406 < 0.0000000000000002 ***
## sqft_lot15 -0.051048 0.003767 -13.550 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1282 on 14307 degrees of freedom
## Multiple R-squared: 0.5624, Adjusted R-squared: 0.562
## F-statistic: 1313 on 14 and 14307 DF, p-value: < 0.00000000000000022
We ended up getting a lower adjusted \(R^2\) error. Let’s see the RMSE.
predict_tf <- predict(model_tf, newdata = test_tf)
RMSE(y_pred = 10^(predict_tf) - 1,
y_true = 10^(test_tf$price) - 1)## [1] 136126.8
Although we get a lower adjusted \(R^2\) value, we ended up getting a lower RMSE value as well. Let’s check whether the new model violates linear model assumptions or not.
hist(model_tf$residuals,
col = "skyblue4",
main = "Histogram of Residuals",
xlab = "Residuals")nortest::ad.test(model_tf$residuals)##
## Anderson-Darling normality test
##
## data: model_tf$residuals
## A = 21.792, p-value < 0.00000000000000022
Although the histogram seems slightly skewed, turns out that statistically, it is not normally distributed. So, the new model still violates the first assumption since the errors are not normally distributed.
plot(model_tf$fitted.values, model_tf$residuals,
xlab = "Fitted Values",
ylab = "Residuals")
abline(h = 0, col = "red")bptest(model_tf)##
## studentized Breusch-Pagan test
##
## data: model_tf
## BP = 541.44, df = 14, p-value < 0.00000000000000022
The homoscedasticity assumption is also violated, which means that there are patterns that the model was unable to learn. Let’s check the last assumption, multicollinearity.
data.frame("VIF" = vif(model_tf))The new model turns out to violate multicollinearity assumption as well. I have tested that removing sqft_above will result in slightly decreased RMSE, but the previous two assumptions will still be violated by the model. It seems that for this data, it is not appropriate to use multiple linear regression model.
For predicting housing prices in King County, USA, the multiple linear regression performs poorly. Multiple linear regression was unable to caught every patterns in the data. The best adjusted \(R^2\) got in the multiple linear model is 0.6536 with RMSE of 169,384.2. Removing every outliers and log-transforming numerical features did not help much either, as it decreases the adjusted \(R^2\) into 0.562, but also decreases RMSE into 136,126.8 (which is still high). All the model also violates several linear regression model assumptions. It is recommended to use another machine learning methods such as decision tree, random forest, or deep learning for this data.