1 Background

1.1 Brief

Recently, there has been wide swings in the growth rate of housing prices across many countries. Housing prices tend to rise over time; however, accurately predicting it might be a difficult task. There are various factors that may influence how demanding a property is. Additionally, it is extremely difficult to figure out which set of factors might explain the buyers’ behavior, since every buyer tend to have its own preferences such as house size, location, etc. In this document, I am going to predict housing prices in King County using linear regression model, as well as finding which attribute plays the key role in the housing price. The dataset used is obtained from Kaggle, which consists of 21,613 observations.

1.2 Libraries and Setup

These following packages are required in this notebook. Use install.packages() to install any packages that are not already downloaded and load them using library() function. I provided a brief explanation about their function.

  • tidyverse: data manipulation
  • GGally: correlation matrix
  • MLmetrics: model evaluation
  • car: checking multicollinearity
  • lmtest: testing homoscedasticity assumption
library(tidyverse)
library(GGally)
library(MLmetrics)
library(car)
library(lmtest)

2 Data Preparation

2.1 Data Inspection

house <- read.csv("data_input/house.csv")
head(house)
glimpse(house)
## Rows: 21,613
## Columns: 21
## $ id            <dbl> 7129300520, 6414100192, 5631500400, 2487200875, 19544005~
## $ date          <chr> "20141013T000000", "20141209T000000", "20150225T000000",~
## $ price         <dbl> 221900, 538000, 180000, 604000, 510000, 1225000, 257500,~
## $ bedrooms      <int> 3, 3, 2, 4, 3, 4, 3, 3, 3, 3, 3, 2, 3, 3, 5, 4, 3, 4, 2,~
## $ bathrooms     <dbl> 1.00, 2.25, 1.00, 3.00, 2.00, 4.50, 2.25, 1.50, 1.00, 2.~
## $ sqft_living   <int> 1180, 2570, 770, 1960, 1680, 5420, 1715, 1060, 1780, 189~
## $ sqft_lot      <int> 5650, 7242, 10000, 5000, 8080, 101930, 6819, 9711, 7470,~
## $ floors        <dbl> 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 2.0, 1.0, 1~
## $ waterfront    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ view          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0,~
## $ condition     <int> 3, 3, 3, 5, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 4, 4,~
## $ grade         <int> 7, 7, 6, 7, 8, 11, 7, 7, 7, 7, 8, 7, 7, 7, 7, 9, 7, 7, 7~
## $ sqft_above    <int> 1180, 2170, 770, 1050, 1680, 3890, 1715, 1060, 1050, 189~
## $ sqft_basement <int> 0, 400, 0, 910, 0, 1530, 0, 0, 730, 0, 1700, 300, 0, 0, ~
## $ yr_built      <int> 1955, 1951, 1933, 1965, 1987, 2001, 1995, 1963, 1960, 20~
## $ yr_renovated  <int> 0, 1991, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ zipcode       <int> 98178, 98125, 98028, 98136, 98074, 98053, 98003, 98198, ~
## $ lat           <dbl> 47.5112, 47.7210, 47.7379, 47.5208, 47.6168, 47.6561, 47~
## $ long          <dbl> -122.257, -122.319, -122.233, -122.393, -122.045, -122.0~
## $ sqft_living15 <int> 1340, 1690, 2720, 1360, 1800, 4760, 2238, 1650, 1780, 23~
## $ sqft_lot15    <int> 5650, 7639, 8062, 5000, 7503, 101930, 6819, 9711, 8113, ~

Here are some informations about the features:

  • id: Unique ID for each home sold
  • date: Date of the home sale
  • price: Price of each home sold
  • bedrooms: Number of bedrooms
  • bathrooms: Number of bathrooms, where .5 accounts for a room with a toilet but no shower
  • sqft_living: Square footage of the house interior living space
  • sqft_lot: Square footage of the land space
  • floors: Number of floors
  • waterfront: Whether the house was overlooking the waterfront or not
  • view: An index of how good the view of the property was (0-4)
  • condition: An index on the condition of the apartment (1-5)
  • grade: An index on the quality of building construction and design (1-13),
  • sqft_above: The square footage of the interior housing space that is above ground level
  • sqft_basement: The square footage of the interior housing space that is below ground level
  • yr_built: The year the house was initially built
  • yr_renovated: The year of the house’s last renovation
  • zipcode: Zipcode area the house is in
  • lat: Latitude
  • long: Longitude
  • sqft_living15: The square footage of interior housing living space for the nearest 15 neighbors
  • sqft_lot15: The square footage of the land lots of the nearest 15 neighbors

We can drop id and date because it is not relevant with predicting house prices. Furthermore, since our data consists of housing prices in King County, we can drop zipcode, lat, and long. Inspecting the data further, we can see that waterfront is recognized as integer instead of categorical feature. So, we will correct it as well.

house <- house %>% 
  select(-c(id, date, zipcode, lat, long)) %>% 
  mutate(waterfront = as.factor(waterfront))

2.2 Missing and Duplicated Values

Checking whether the training set has any missing or duplicated values.

colSums(is.na(house))
##         price      bedrooms     bathrooms   sqft_living      sqft_lot 
##             0             0             0             0             0 
##        floors    waterfront          view     condition         grade 
##             0             0             0             0             0 
##    sqft_above sqft_basement      yr_built  yr_renovated sqft_living15 
##             0             0             0             0             0 
##    sqft_lot15 
##             0
sum(duplicated(house))
## [1] 6

The training set does not have any missing value, but it has 6 duplicated values. I personally think that although some houses in the regency might be built exactly the same, it is enough to just pick one of them in modeling. This is because duplicated values, in a regression model, might systematically bias regression estimates. So, I will drop these missing values.

house <- house[!duplicated(house), ]

The data is now clean and ready to be used.

3 Exploratory Data Analysis

First, let’s see our data summary.

summary(house)
##      price            bedrooms        bathrooms      sqft_living   
##  Min.   :  75000   Min.   : 0.000   Min.   :0.000   Min.   :  290  
##  1st Qu.: 321725   1st Qu.: 3.000   1st Qu.:1.750   1st Qu.: 1428  
##  Median : 450000   Median : 3.000   Median :2.250   Median : 1910  
##  Mean   : 540107   Mean   : 3.371   Mean   :2.115   Mean   : 2080  
##  3rd Qu.: 645000   3rd Qu.: 4.000   3rd Qu.:2.500   3rd Qu.: 2550  
##  Max.   :7700000   Max.   :33.000   Max.   :8.000   Max.   :13540  
##     sqft_lot           floors      waterfront      view          condition    
##  Min.   :    520   Min.   :1.000   0:21444    Min.   :0.0000   Min.   :1.000  
##  1st Qu.:   5040   1st Qu.:1.000   1:  163    1st Qu.:0.0000   1st Qu.:3.000  
##  Median :   7620   Median :1.500              Median :0.0000   Median :3.000  
##  Mean   :  15110   Mean   :1.494              Mean   :0.2342   Mean   :3.409  
##  3rd Qu.:  10692   3rd Qu.:2.000              3rd Qu.:0.0000   3rd Qu.:4.000  
##  Max.   :1651359   Max.   :3.500              Max.   :4.0000   Max.   :5.000  
##      grade          sqft_above   sqft_basement       yr_built   
##  Min.   : 1.000   Min.   : 290   Min.   :   0.0   Min.   :1900  
##  1st Qu.: 7.000   1st Qu.:1190   1st Qu.:   0.0   1st Qu.:1951  
##  Median : 7.000   Median :1560   Median :   0.0   Median :1975  
##  Mean   : 7.657   Mean   :1788   Mean   : 291.6   Mean   :1971  
##  3rd Qu.: 8.000   3rd Qu.:2210   3rd Qu.: 560.0   3rd Qu.:1997  
##  Max.   :13.000   Max.   :9410   Max.   :4820.0   Max.   :2015  
##   yr_renovated     sqft_living15    sqft_lot15    
##  Min.   :   0.00   Min.   : 399   Min.   :   651  
##  1st Qu.:   0.00   1st Qu.:1490   1st Qu.:  5100  
##  Median :   0.00   Median :1840   Median :  7620  
##  Mean   :  84.33   Mean   :1987   Mean   : 12771  
##  3rd Qu.:   0.00   3rd Qu.:2360   3rd Qu.: 10084  
##  Max.   :2015.00   Max.   :6210   Max.   :871200

The range of our target variable, price is quite big. I’m afraid there will be a lot of outliers in the data. Let’s check the distribution of house prices.

hist(house$price,
     col = "skyblue4",
     main = "Distribution of House Prices",
     xlab = "House Price",
     breaks = 30)

boxplot(house$price,
        main = "Boxplot of House Prices",
        col = "skyblue4")

We can see that the distribution of house prices looks like a normal distribution, with some data on the right tail that skewed the distribution. Furthermore, seeing the boxplot, we can see that there are lots of outliers in the data. I have planned to construct a model using Ordinary Least Square Linear Regression, and although it does not need the distribution of neither outcome nor independent variables to be of Gaussian (normal), it is very sensitive to outliers. Besides, outliers are considered to be anomaly in the data. So, I will trim the data as to remove some outliers.

house_trim <- house %>% 
  arrange(price) %>% 
  slice_min(price, prop = 0.97)

hist(house_trim$price,
     col = "skyblue4",
     main = "Distribution of House Prices",
     xlab = "House Price",
     breaks = 20)

The distribution seems to be better. Now, let’s see the correlation between features in this dataset.

ggcorr(house_trim, geom = "blank", label = T, label_size = 3, hjust = 1, size = 3, layout.exp = 2) +
  geom_point(size = 8, aes(color = coefficient > 0, alpha = abs(coefficient) >= 0.5)) +
  scale_alpha_manual(values = c("TRUE" = 0.25, "FALSE" = 0)) +
  guides(color = F, alpha = F)

By looking at the correlation matrix, we can see that all variables are positively correlated with price. Furthermore, sqft_living, grade, sqft_above, and sqft_living15 seems to have strong correlations with price.

3.1 Cross Validation

Before proceeding to the modeling part, I will split the data into 2 sets: train set and test set. The reason behind this is pretty much straightforward. When we are creating a model, the main purpose is to be able to use it for predicting other data, i.e. data that the model has not seen. Thus, it is very important to evaluate the model’s out of sample accuracy, rather than only its accuracy on the training set (set containing data the model has learned from). I will split the data into 70% train set and 30% test set.

RNGkind(sample.kind = "Rounding")
set.seed(417)
idx <- sample(nrow(house_trim), nrow(house_trim) * 0.7)
train <- house_trim[idx, ]
test <- house_trim[-idx, ]

4 Modeling

4.1 Multiple Linear Regression

Let’s try to construct the model using 4 features that are strongly correlated with house prices: sqft_living, grade, sqft_above, and sqft_living15.

summary(lm(price ~ sqft_living + grade + sqft_above + sqft_living15,
           data = house))
## 
## Call:
## lm(formula = price ~ sqft_living + grade + sqft_above + sqft_living15, 
##     data = house)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1073331  -136535   -23410    98540  4832331 
## 
## Coefficients:
##                  Estimate  Std. Error t value             Pr(>|t|)    
## (Intercept)   -647608.483   13551.689 -47.788 < 0.0000000000000002 ***
## sqft_living       227.349       4.186  54.313 < 0.0000000000000002 ***
## grade          105681.905    2413.445  43.789 < 0.0000000000000002 ***
## sqft_above        -81.835       4.465 -18.330 < 0.0000000000000002 ***
## sqft_living15      26.188       4.024   6.508      0.0000000000777 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 248500 on 21602 degrees of freedom
## Multiple R-squared:  0.542,  Adjusted R-squared:  0.5419 
## F-statistic:  6391 on 4 and 21602 DF,  p-value: < 0.00000000000000022

We can see the goodness of fit of the model by looking at the adjusted \(R^2\) value. It tells us how well our dependent variable can be explained by our independent variables. Using those 4 features, we get adjusted \(R^2\) value of 0.5419. It is quite poor. Let’s compare the result to the model with all the features included.

model_all <- lm(price ~ ., data = house)
summary(model_all)
## 
## Call:
## lm(formula = price ~ ., data = house)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1341541  -109312   -10670    89963  4289643 
## 
## Coefficients: (1 not defined because of singularities)
##                     Estimate     Std. Error t value             Pr(>|t|)    
## (Intercept)   6193509.371427  138453.912432  44.733 < 0.0000000000000002 ***
## bedrooms       -39284.309664    2027.057595 -19.380 < 0.0000000000000002 ***
## bathrooms       45669.595161    3492.923134  13.075 < 0.0000000000000002 ***
## sqft_living       167.006152       4.667788  35.778 < 0.0000000000000002 ***
## sqft_lot           -0.001789       0.051287  -0.035              0.97217    
## floors          26939.873429    3785.007716   7.118  0.00000000000113326 ***
## waterfront1    578689.095560   18639.128164  31.047 < 0.0000000000000002 ***
## view            43323.886145    2274.057349  19.051 < 0.0000000000000002 ***
## condition       19546.414241    2497.042029   7.828  0.00000000000000519 ***
## grade          119746.755980    2250.031450  53.220 < 0.0000000000000002 ***
## sqft_above         -6.258051       4.545272  -1.377              0.16858    
## sqft_basement             NA             NA      NA                   NA    
## yr_built        -3569.379079      70.998813 -50.274 < 0.0000000000000002 ***
## yr_renovated       10.356520       3.916051   2.645              0.00818 ** 
## sqft_living15      24.877319       3.600884   6.909  0.00000000000502826 ***
## sqft_lot15         -0.550282       0.078365  -7.022  0.00000000000225194 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 216100 on 21592 degrees of freedom
## Multiple R-squared:  0.6538, Adjusted R-squared:  0.6536 
## F-statistic:  2912 on 14 and 21592 DF,  p-value: < 0.00000000000000022

When we use all the features, the adjusted \(R^2\) raised by quite a significant amount. It rose from 0.5419 to 0.6536. Let’s check the RMSE of our model in the test set to check the out-of-sample error. It measures on average, how far is our prediction from the actual data.

RMSE(y_pred = predict(model_all, newdata = test),
     y_true = test$price)
## [1] 169260.4

The RMSE is quite high. It is still a poor model. Since there are still statistically insignificant variables, and to improve our model even further, I will use backward stepwise regression. It is a method that iteratively examines the significance of every independent variables in a regression model. Backward stepwise regression begins with a model containing all possible variables, then deleting each variable whose absence decreases the AIC (Akaike Information Criterion / information loss) the most.

model_back <- step(model_all, direction = "backward", trace = 0)
summary(model_back)
## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + floors + 
##     waterfront + view + condition + grade + yr_built + yr_renovated + 
##     sqft_living15 + sqft_lot15, data = house)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1344213  -109623   -10716    89998  4287299 
## 
## Coefficients:
##                    Estimate    Std. Error t value             Pr(>|t|)    
## (Intercept)   6213912.92283  137641.37079  45.146 < 0.0000000000000002 ***
## bedrooms       -39280.30161    2026.46584 -19.384 < 0.0000000000000002 ***
## bathrooms       46461.91181    3445.24654  13.486 < 0.0000000000000002 ***
## sqft_living       162.86083       3.57459  45.561 < 0.0000000000000002 ***
## floors          24762.49035    3436.29345   7.206   0.0000000000005944 ***
## waterfront1    577778.99746   18624.16685  31.023 < 0.0000000000000002 ***
## view            43891.54426    2235.18375  19.637 < 0.0000000000000002 ***
## condition       19751.88469    2492.46275   7.925   0.0000000000000024 ***
## grade          119445.10087    2239.39923  53.338 < 0.0000000000000002 ***
## yr_built        -3578.47197      70.68458 -50.626 < 0.0000000000000002 ***
## yr_renovated       10.31107       3.91577   2.633              0.00846 ** 
## sqft_living15      23.92156       3.52772   6.781   0.0000000000122405 ***
## sqft_lot15         -0.56035       0.05556 -10.085 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 216100 on 21594 degrees of freedom
## Multiple R-squared:  0.6537, Adjusted R-squared:  0.6536 
## F-statistic:  3398 on 12 and 21594 DF,  p-value: < 0.00000000000000022
RMSE(y_pred = predict(model_back, newdata = test),
     y_true = test$price)
## [1] 169384.2

So, it turns out that the model with the least information loss is the model that uses all variables, with the optional sqft_basement variable. We can pretty much conclude the best multiple linear regression model that can be constructed with this data is pretty poor, with adjusted \(R^2\) value of 0.6536 and RMSE of 169,384.2. Let’s first check whether the model fulfills all multiple linear regression assumption.

4.2 Assumptions

Let’s first check on the assumption of normality.

hist(model_back$residuals,
     col = "skyblue4",
     main = "Histogram of Residuals",
     xlab = "Residuals")

nortest::ad.test(model_back$residuals)
## 
##  Anderson-Darling normality test
## 
## data:  model_back$residuals
## A = 470.68, p-value < 0.00000000000000022

I used the Anderson-Darling normality test instead of shapiro-wilk normality test because the data size is too large. From the histogram and the statistical test, we can see that the errors are not normally distributed. This might happen because of outliers in our data, and we must be cautious of using the model as the errors will not be around 0. Let’s now check on the homoscedasticity assumption.

plot(model_back$fitted.values, model_back$residuals,
     xlab = "Fitted Values",
     ylab = "Residuals")
abline(h = 0, col = "red")

bptest(model_back)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_back
## BP = 2765.5, df = 12, p-value < 0.00000000000000022

Since the p-value is less than 0.05, we reject the null hypothesis and conclude that our model violates the homoscedasticity assumption as well. This indicates that there are patterns that the model was unable to learn. Now the last assumption, multicollinearity.

data.frame("VIF" = vif(model_back))

Since all VIF are below 10, we can conclude that multicollinearity is not present in our model. Looking at how the model violates 2 assumptions, it means that our model can still be improved upon, or maybe multiple linear regression is not good enough to find all the patterns in the data.

4.3 Model Improvement

Our data is skewed to begin with, thus it is prone to overfitting. I would like to see what will happen if we remove all outliers in the target variable, and log-transform all numeric variables in the data. This should help in transforming all variables’ distribution into a more normal-shaped bell curve, thus the model should not have overfitting problem. For the log transformation, since there are quite a few zeros in the data, I will add a constant and use \(log_{10}(x+1)\) instead.

log_tf <- function(x){
  return(log10(x+1))
}
house2 <- house %>% 
  filter(price < quantile(house$price, 0.75) + 1.5 * IQR(house$price)) %>% 
  mutate_if(is.numeric, log_tf)

house2

Splitting the data into train and test set.

set.seed(417)
idx_tf <- sample(nrow(house2), nrow(house2) * 0.7)
train_tf <- house2[idx_tf, ]
test_tf <- house2[-idx_tf, ]

Creating multiple linear regression model using transformed data.

model_tf <- step(object = lm(price ~ ., data = train_tf),
                 direction = "backward",
                 trace = F)

summary(model_tf)
## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + floors + 
##     waterfront + view + condition + grade + sqft_above + sqft_basement + 
##     yr_built + yr_renovated + sqft_living15 + sqft_lot15, data = train_tf)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.50260 -0.08692  0.00982  0.08808  0.79153 
## 
## Coefficients:
##                Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)   35.230172   0.800623  44.003 < 0.0000000000000002 ***
## bedrooms      -0.174142   0.015636 -11.137 < 0.0000000000000002 ***
## bathrooms      0.121389   0.018315   6.628  0.00000000003524278 ***
## sqft_living    0.196043   0.030836   6.358  0.00000000021101201 ***
## floors         0.247268   0.018228  13.565 < 0.0000000000000002 ***
## waterfront1    0.058243   0.021296   2.735              0.00625 ** 
## view           0.079930   0.008550   9.349 < 0.0000000000000002 ***
## condition      0.207228   0.019524  10.614 < 0.0000000000000002 ***
## grade          1.541827   0.032827  46.969 < 0.0000000000000002 ***
## sqft_above     0.096171   0.030241   3.180              0.00148 ** 
## sqft_basement  0.016296   0.002045   7.968  0.00000000000000174 ***
## yr_built      -9.983872   0.243796 -40.952 < 0.0000000000000002 ***
## yr_renovated   0.003302   0.001797   1.837              0.06623 .  
## sqft_living15  0.283654   0.012660  22.406 < 0.0000000000000002 ***
## sqft_lot15    -0.051048   0.003767 -13.550 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1282 on 14307 degrees of freedom
## Multiple R-squared:  0.5624, Adjusted R-squared:  0.562 
## F-statistic:  1313 on 14 and 14307 DF,  p-value: < 0.00000000000000022

We ended up getting a lower adjusted \(R^2\) error. Let’s see the RMSE.

predict_tf <- predict(model_tf, newdata = test_tf)
RMSE(y_pred = 10^(predict_tf) - 1,
     y_true = 10^(test_tf$price) - 1)
## [1] 136126.8

Although we get a lower adjusted \(R^2\) value, we ended up getting a lower RMSE value as well. Let’s check whether the new model violates linear model assumptions or not.

hist(model_tf$residuals,
     col = "skyblue4",
     main = "Histogram of Residuals",
     xlab = "Residuals")

nortest::ad.test(model_tf$residuals)
## 
##  Anderson-Darling normality test
## 
## data:  model_tf$residuals
## A = 21.792, p-value < 0.00000000000000022

Although the histogram seems slightly skewed, turns out that statistically, it is not normally distributed. So, the new model still violates the first assumption since the errors are not normally distributed.

plot(model_tf$fitted.values, model_tf$residuals,
     xlab = "Fitted Values",
     ylab = "Residuals")
abline(h = 0, col = "red")

bptest(model_tf)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_tf
## BP = 541.44, df = 14, p-value < 0.00000000000000022

The homoscedasticity assumption is also violated, which means that there are patterns that the model was unable to learn. Let’s check the last assumption, multicollinearity.

data.frame("VIF" = vif(model_tf))

The new model turns out to violate multicollinearity assumption as well. I have tested that removing sqft_above will result in slightly decreased RMSE, but the previous two assumptions will still be violated by the model. It seems that for this data, it is not appropriate to use multiple linear regression model.

5 Conclusion

For predicting housing prices in King County, USA, the multiple linear regression performs poorly. Multiple linear regression was unable to caught every patterns in the data. The best adjusted \(R^2\) got in the multiple linear model is 0.6536 with RMSE of 169,384.2. Removing every outliers and log-transforming numerical features did not help much either, as it decreases the adjusted \(R^2\) into 0.562, but also decreases RMSE into 136,126.8 (which is still high). All the model also violates several linear regression model assumptions. It is recommended to use another machine learning methods such as decision tree, random forest, or deep learning for this data.

6 References

  1. Sarracino, Francesco & Mikucka, Malgorzata. (2017). Bias and efficiency loss in regression estimates due to duplicated observations: a Monte Carlo simulation. Survey Research Methods, 11. 17-44. doi:10.18148/srm/2017.v11i1.7149.