Introduction

This dataset contains house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015.

The business case for this analysis is to predict house prices with simple linear regression model.

Library

Library used in this analysis.

library(tidyverse)
library(GGally)
library(nortest)
library(MLmetrics)
library(lmtest)
library(car)

Data Preparation

Read the data

house <- read.csv("kc_house_data.csv")

Check data structure

glimpse(house)
## Rows: 21,613
## Columns: 21
## $ id            <dbl> 7129300520, 6414100192, 5631500400, 2487200875, 19544...
## $ date          <chr> "20141013T000000", "20141209T000000", "20150225T00000...
## $ price         <dbl> 221900, 538000, 180000, 604000, 510000, 1225000, 2575...
## $ bedrooms      <int> 3, 3, 2, 4, 3, 4, 3, 3, 3, 3, 3, 2, 3, 3, 5, 4, 3, 4,...
## $ bathrooms     <dbl> 1.00, 2.25, 1.00, 3.00, 2.00, 4.50, 2.25, 1.50, 1.00,...
## $ sqft_living   <int> 1180, 2570, 770, 1960, 1680, 5420, 1715, 1060, 1780, ...
## $ sqft_lot      <int> 5650, 7242, 10000, 5000, 8080, 101930, 6819, 9711, 74...
## $ floors        <dbl> 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 2.0, 1.0...
## $ waterfront    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ view          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0,...
## $ condition     <int> 3, 3, 3, 5, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 4,...
## $ grade         <int> 7, 7, 6, 7, 8, 11, 7, 7, 7, 7, 8, 7, 7, 7, 7, 9, 7, 7...
## $ sqft_above    <int> 1180, 2170, 770, 1050, 1680, 3890, 1715, 1060, 1050, ...
## $ sqft_basement <int> 0, 400, 0, 910, 0, 1530, 0, 0, 730, 0, 1700, 300, 0, ...
## $ yr_built      <int> 1955, 1951, 1933, 1965, 1987, 2001, 1995, 1963, 1960,...
## $ yr_renovated  <int> 0, 1991, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ zipcode       <int> 98178, 98125, 98028, 98136, 98074, 98053, 98003, 9819...
## $ lat           <dbl> 47.5112, 47.7210, 47.7379, 47.5208, 47.6168, 47.6561,...
## $ long          <dbl> -122.257, -122.319, -122.233, -122.393, -122.045, -12...
## $ sqft_living15 <int> 1340, 1690, 2720, 1360, 1800, 4760, 2238, 1650, 1780,...
## $ sqft_lot15    <int> 5650, 7639, 8062, 5000, 7503, 101930, 6819, 9711, 811...

We don’t need to change any datatypes but we need to deeslect id and date because it is not important for this case.

house <- house %>% 
  select(-c(id, date))

After that, we need to check if there is any NA in the dataset.

colSums(is.na(house))
##         price      bedrooms     bathrooms   sqft_living      sqft_lot 
##             0             0             0             0             0 
##        floors    waterfront          view     condition         grade 
##             0             0             0             0             0 
##    sqft_above sqft_basement      yr_built  yr_renovated       zipcode 
##             0             0             0             0             0 
##           lat          long sqft_living15    sqft_lot15 
##             0             0             0             0

Now, we know that the data don’t have NA and we can move to explore the data.

Exploratory Data Analysis

Exploratory data analysis is a phase where we explore the data variables, see if there are any pattern that can indicate any kind of correlation between variables.

Find the Pearson correlation between features.

ggcorr(house, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)

Based on the result, there are predictors that have strong correlation with price like sqft_living and grade. There are also predictors that have weak correlation with price like bedrooms and floors.

Before we make the model, we need to split the data into train dataset and test dataset. We will use the train dataset to train the linear regression model. The test dataset will be used as a comparation and see if the model get overfit and can not predict new data that hasn’t been seen during training phase. We will take 80% of the data as the training data and the rest of it as the testing data.

set.seed(123)
row_data <- nrow(house)
index <- sample(row_data, row_data*0.8)

data_train <- house[ index, ]
data_test <- house[ -index, ]

Now we will try to model the linear regression using price as the target variable. For this case, we can use stepwise method with both direction from empty predictors to full predictors.

set.seed(123)
model_full <- lm(price ~ ., data_train)
model_none <- lm(price ~ 1, data_train)
model_step <- step(model_none, 
                   scope = list(lower = model_none, upper = model_full),
                   direction = "both",
                   trace = 0
                   )

summary(model_step)
## 
## Call:
## lm(formula = price ~ sqft_living + lat + view + grade + yr_built + 
##     waterfront + bedrooms + bathrooms + zipcode + long + condition + 
##     sqft_above + sqft_living15 + sqft_lot15 + yr_renovated + 
##     sqft_lot, data = data_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1255028   -98160   -10112    77473  4356623 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    5.921e+06  3.215e+06   1.842   0.0655 .  
## sqft_living    1.417e+02  4.668e+00  30.356  < 2e-16 ***
## lat            5.989e+05  1.189e+04  50.361  < 2e-16 ***
## view           5.020e+04  2.364e+03  21.232  < 2e-16 ***
## grade          9.704e+04  2.387e+03  40.648  < 2e-16 ***
## yr_built      -2.650e+03  7.870e+01 -33.674  < 2e-16 ***
## waterfront     5.599e+05  1.902e+04  29.442  < 2e-16 ***
## bedrooms      -3.416e+04  2.110e+03 -16.195  < 2e-16 ***
## bathrooms      4.586e+04  3.500e+03  13.103  < 2e-16 ***
## zipcode       -5.729e+02  3.655e+01 -15.678  < 2e-16 ***
## long          -2.153e+05  1.454e+04 -14.800  < 2e-16 ***
## condition      2.912e+04  2.617e+03  11.129  < 2e-16 ***
## sqft_above     3.592e+01  4.348e+00   8.260  < 2e-16 ***
## sqft_living15  2.358e+01  3.817e+00   6.177 6.70e-10 ***
## sqft_lot15    -4.079e-01  8.117e-02  -5.025 5.09e-07 ***
## yr_renovated   1.832e+01  4.077e+00   4.493 7.06e-06 ***
## sqft_lot       1.200e-01  5.508e-02   2.178   0.0294 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 200100 on 17273 degrees of freedom
## Multiple R-squared:  0.7008, Adjusted R-squared:  0.7005 
## F-statistic:  2528 on 16 and 17273 DF,  p-value: < 2.2e-16

The summary of model shows a lot of information. But we just need to see Pr(>|t|). This column shows significance level of the variable towards model. It shows that all variables have significant effect toward the model. Also the model can explain 70.05% of the data.

Evaluation

The performance of our model can be calculated using MAPE and RMSE.

pred_data <- predict(model_step, data_test)
MAPE(pred_data, data_test$price)
## [1] 0.2635855
RMSE(pred_data, data_test$price)
## [1] 206240.4

Based on the result, MAPE shows that the model deviate around 26,35% from the actual data. Meanwhile RMSE shows that prediction deviate around 206240 points from the actual data. From this, we know that the model is not really good and improvement should be made.

Assumption

Linear regression is a parametric model, meaning that in order to create a model equation, the model follows some classical assumption. Linear regression that doesn’t follow the assumption may be misleading, or just simply has biased estimator.

Normality Test

The first assumption in linear regression is that the residuals follow normal distribution. We can easily check this by using the Anderson-Darling normality test. This test is used when the data more than 5000.

ad.test(model_step$residuals)
## 
##  Anderson-Darling normality test
## 
## data:  model_step$residuals
## A = 473.69, p-value < 2.2e-16

From the result, p-value is <0.05 so it means our residuals are not following the normal distribution.

Homocesdasticity

Homocesdasticity shows that the residual or error is constant or does not form a certain pattern. If the error forms a certain pattern such as a linear or conical line, then we call it ‘Heterocesdasticity’ and it will affect the standard error value in the biased estimate / predictor coefficient (too narrow or too wide). For this analysis, I use Breusch-Pagan test.

bptest(model_step)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_step
## BP = 2174, df = 16, p-value < 2.2e-16

Based on the result, p-value is <0.05 so the residuals error are form certain pattern.

Multicollinearity

Multicollinearity mean that there is a correlation between the independent variables/predictors. To check the multicollinearity, we can measure the varianec inflation factor (VIF). As a rule of thumb, a VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity.

vif(model_step)
##   sqft_living           lat          view         grade      yr_built 
##      7.948993      1.174260      1.439092      3.372771      2.315004 
##    waterfront      bedrooms     bathrooms       zipcode          long 
##      1.210064      1.634173      3.128336      1.651562      1.811724 
##     condition    sqft_above sqft_living15    sqft_lot15  yr_renovated 
##      1.251153      5.606754      2.973275      2.115085      1.147118 
##      sqft_lot 
##      2.097840

Based on the result, there are no correlation between variables.

Model Improvement

Tuning

We have already seen that our model doesn’t satisfy normality, and heterocesdasticity test. Now we will try to fix them. To made the model more linear and simple, we can transform some of the variables, both the target and the predictors. First, we can shun off the variables that have correlation coefficient above 0.6. Then we can change the target with log method.

ggcorr(house, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)

From the result, we know that some predictors like sqft_living and sqft_above, bathrooms and sqft_living, sqft_living and grade, sqft_above and grade have high correlation. We can deselect one of them. But to do that, we have to know what predictor have higher effect. So, we have to scale it.

scale_house <- house %>% 
  mutate_if(is.numeric, scale)

model_scale <- lm(price ~ ., scale_house, trace = 0)

summary(model_scale)
## 
## Call:
## lm(formula = price ~ ., data = scale_house, trace = 0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5185 -0.2703 -0.0265  0.2113 11.8031 
## 
## Coefficients: (1 not defined because of singularities)
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -1.703e-14  3.729e-03   0.000  1.00000    
## bedrooms      -9.061e-02  4.793e-03 -18.906  < 2e-16 ***
## bathrooms      8.631e-02  6.826e-03  12.645  < 2e-16 ***
## sqft_living    3.755e-01  1.097e-02  34.227  < 2e-16 ***
## sqft_lot       1.451e-02  5.407e-03   2.683  0.00729 ** 
## floors         9.839e-03  5.289e-03   1.860  0.06285 .  
## waterfront     1.374e-01  4.091e-03  33.580  < 2e-16 ***
## view           1.104e-01  4.467e-03  24.705  < 2e-16 ***
## condition      4.677e-02  4.168e-03  11.221  < 2e-16 ***
## grade          3.070e-01  6.893e-03  44.542  < 2e-16 ***
## sqft_above     7.021e-02  9.835e-03   7.139 9.71e-13 ***
## sqft_basement         NA         NA      NA       NA    
## yr_built      -2.096e-01  5.813e-03 -36.062  < 2e-16 ***
## yr_renovated   2.168e-02  4.000e-03   5.420 6.03e-08 ***
## zipcode       -8.488e-02  4.807e-03 -17.657  < 2e-16 ***
## lat            2.275e-01  4.052e-03  56.149  < 2e-16 ***
## long          -8.237e-02  5.038e-03 -16.349  < 2e-16 ***
## sqft_living15  4.048e-02  6.437e-03   6.289 3.26e-10 ***
## sqft_lot15    -2.846e-02  5.449e-03  -5.222 1.78e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5482 on 21595 degrees of freedom
## Multiple R-squared:  0.6997, Adjusted R-squared:  0.6995 
## F-statistic:  2960 on 17 and 21595 DF,  p-value: < 2.2e-16

Based on the summary results, we can deselect sqft_above, grade, sqft_living15, sqft_lot15, and bathrooms. After that, we can use log method to the target.

house2 <- house %>% 
  select(-c(sqft_above, grade, sqft_living15, sqft_lot15, bathrooms)) %>% 
  mutate(price = log(price))

Performance

set.seed(123)
data_train2 <- house2[index, ]
data_test2 <- house2[-index, ]
set.seed(123)
model_full2 <- lm(price ~ ., data_train2)
model_none2 <- lm(price ~ 1, data_train2)
model_step2 <- step(model_none2, 
                   scope = list(lower = model_none2, upper = model_full2),
                   direction = "both",
                   trace = 0
                   )

summary(model_step2)
## 
## Call:
## lm(formula = price ~ sqft_living + lat + view + condition + floors + 
##     yr_built + zipcode + waterfront + bedrooms + long + yr_renovated + 
##     sqft_basement + sqft_lot, data = data_train2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0534 -0.1762  0.0041  0.1781  1.2369 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.977e+00  4.543e+00   0.875    0.381    
## sqft_living    3.727e-04  3.827e-06  97.385  < 2e-16 ***
## lat            1.599e+00  1.658e-02  96.407  < 2e-16 ***
## view           9.336e-02  3.277e-03  28.493  < 2e-16 ***
## condition      7.065e-02  3.699e-03  19.096  < 2e-16 ***
## floors         1.116e-01  5.417e-03  20.612  < 2e-16 ***
## yr_built      -1.475e-03  1.042e-04 -14.160  < 2e-16 ***
## zipcode       -9.550e-04  5.163e-05 -18.497  < 2e-16 ***
## waterfront     3.071e-01  2.691e-02  11.414  < 2e-16 ***
## bedrooms      -2.346e-02  2.913e-03  -8.054 8.54e-16 ***
## long          -2.332e-01  2.023e-02 -11.528  < 2e-16 ***
## yr_renovated   5.080e-05  5.723e-06   8.877  < 2e-16 ***
## sqft_basement -5.295e-05  6.617e-06  -8.002 1.31e-15 ***
## sqft_lot       2.641e-07  5.675e-08   4.654 3.28e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2834 on 17276 degrees of freedom
## Multiple R-squared:  0.7106, Adjusted R-squared:  0.7104 
## F-statistic:  3263 on 13 and 17276 DF,  p-value: < 2.2e-16

Based on the result, we can see that this model can explain 71.04% of the data. It is higher 1% than the previous model.

pred_data2 <- predict(model_step2, data_test2)
MAPE(pred_data2, data_test2$price)
## [1] 0.0168914
RMSE(pred_data2, data_test2$price)
## [1] 0.2840877

Based on the result, MAPE shows that the model deviate around 1.68% from the actual data. Meanwhile RMSE shows that prediction deviate around 0.284 points from the actual data. From this, we know that the model is better than the last model.

Assumption

Lets test normality and Homocesdasticity again.

ad.test(model_step2$residuals)
## 
##  Anderson-Darling normality test
## 
## data:  model_step2$residuals
## A = 14.192, p-value < 2.2e-16
bptest(model_step2)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_step2
## BP = 994.37, df = 13, p-value < 2.2e-16

Based on the result, the data still don’t pass normality test and Homocesdasticity test. It mean the predictor might be still biased. So lets check the residual plot.

qqPlot(model_step2$residuals)

## 12778 21051 
## 10082 14846

Based on the result, some of the residuals still outside the line.

Conclusion

Even after we tweak the model, it still have not pass the normality test and Homocesdasticity test for Assumption. It means the predictor might be biased. But, overall the model is good, the model can explain 71.04% of the data with MAPE 1.68% and RMSE 0.2840877 points.