This dataset contains house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015.
The business case for this analysis is to predict house prices with simple linear regression model.
Library used in this analysis.
library(tidyverse)
library(GGally)
library(nortest)
library(MLmetrics)
library(lmtest)
library(car)Read the data
house <- read.csv("kc_house_data.csv")Check data structure
glimpse(house)## Rows: 21,613
## Columns: 21
## $ id <dbl> 7129300520, 6414100192, 5631500400, 2487200875, 19544...
## $ date <chr> "20141013T000000", "20141209T000000", "20150225T00000...
## $ price <dbl> 221900, 538000, 180000, 604000, 510000, 1225000, 2575...
## $ bedrooms <int> 3, 3, 2, 4, 3, 4, 3, 3, 3, 3, 3, 2, 3, 3, 5, 4, 3, 4,...
## $ bathrooms <dbl> 1.00, 2.25, 1.00, 3.00, 2.00, 4.50, 2.25, 1.50, 1.00,...
## $ sqft_living <int> 1180, 2570, 770, 1960, 1680, 5420, 1715, 1060, 1780, ...
## $ sqft_lot <int> 5650, 7242, 10000, 5000, 8080, 101930, 6819, 9711, 74...
## $ floors <dbl> 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 2.0, 1.0...
## $ waterfront <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ view <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0,...
## $ condition <int> 3, 3, 3, 5, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 4,...
## $ grade <int> 7, 7, 6, 7, 8, 11, 7, 7, 7, 7, 8, 7, 7, 7, 7, 9, 7, 7...
## $ sqft_above <int> 1180, 2170, 770, 1050, 1680, 3890, 1715, 1060, 1050, ...
## $ sqft_basement <int> 0, 400, 0, 910, 0, 1530, 0, 0, 730, 0, 1700, 300, 0, ...
## $ yr_built <int> 1955, 1951, 1933, 1965, 1987, 2001, 1995, 1963, 1960,...
## $ yr_renovated <int> 0, 1991, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ zipcode <int> 98178, 98125, 98028, 98136, 98074, 98053, 98003, 9819...
## $ lat <dbl> 47.5112, 47.7210, 47.7379, 47.5208, 47.6168, 47.6561,...
## $ long <dbl> -122.257, -122.319, -122.233, -122.393, -122.045, -12...
## $ sqft_living15 <int> 1340, 1690, 2720, 1360, 1800, 4760, 2238, 1650, 1780,...
## $ sqft_lot15 <int> 5650, 7639, 8062, 5000, 7503, 101930, 6819, 9711, 811...
We don’t need to change any datatypes but we need to deeslect id and date because it is not important for this case.
house <- house %>%
select(-c(id, date))After that, we need to check if there is any NA in the dataset.
colSums(is.na(house))## price bedrooms bathrooms sqft_living sqft_lot
## 0 0 0 0 0
## floors waterfront view condition grade
## 0 0 0 0 0
## sqft_above sqft_basement yr_built yr_renovated zipcode
## 0 0 0 0 0
## lat long sqft_living15 sqft_lot15
## 0 0 0 0
Now, we know that the data don’t have NA and we can move to explore the data.
Exploratory data analysis is a phase where we explore the data variables, see if there are any pattern that can indicate any kind of correlation between variables.
Find the Pearson correlation between features.
ggcorr(house, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)Based on the result, there are predictors that have strong correlation with price like sqft_living and grade. There are also predictors that have weak correlation with price like bedrooms and floors.
Before we make the model, we need to split the data into train dataset and test dataset. We will use the train dataset to train the linear regression model. The test dataset will be used as a comparation and see if the model get overfit and can not predict new data that hasn’t been seen during training phase. We will take 80% of the data as the training data and the rest of it as the testing data.
set.seed(123)
row_data <- nrow(house)
index <- sample(row_data, row_data*0.8)
data_train <- house[ index, ]
data_test <- house[ -index, ]Now we will try to model the linear regression using price as the target variable. For this case, we can use stepwise method with both direction from empty predictors to full predictors.
set.seed(123)
model_full <- lm(price ~ ., data_train)
model_none <- lm(price ~ 1, data_train)
model_step <- step(model_none,
scope = list(lower = model_none, upper = model_full),
direction = "both",
trace = 0
)
summary(model_step)##
## Call:
## lm(formula = price ~ sqft_living + lat + view + grade + yr_built +
## waterfront + bedrooms + bathrooms + zipcode + long + condition +
## sqft_above + sqft_living15 + sqft_lot15 + yr_renovated +
## sqft_lot, data = data_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1255028 -98160 -10112 77473 4356623
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.921e+06 3.215e+06 1.842 0.0655 .
## sqft_living 1.417e+02 4.668e+00 30.356 < 2e-16 ***
## lat 5.989e+05 1.189e+04 50.361 < 2e-16 ***
## view 5.020e+04 2.364e+03 21.232 < 2e-16 ***
## grade 9.704e+04 2.387e+03 40.648 < 2e-16 ***
## yr_built -2.650e+03 7.870e+01 -33.674 < 2e-16 ***
## waterfront 5.599e+05 1.902e+04 29.442 < 2e-16 ***
## bedrooms -3.416e+04 2.110e+03 -16.195 < 2e-16 ***
## bathrooms 4.586e+04 3.500e+03 13.103 < 2e-16 ***
## zipcode -5.729e+02 3.655e+01 -15.678 < 2e-16 ***
## long -2.153e+05 1.454e+04 -14.800 < 2e-16 ***
## condition 2.912e+04 2.617e+03 11.129 < 2e-16 ***
## sqft_above 3.592e+01 4.348e+00 8.260 < 2e-16 ***
## sqft_living15 2.358e+01 3.817e+00 6.177 6.70e-10 ***
## sqft_lot15 -4.079e-01 8.117e-02 -5.025 5.09e-07 ***
## yr_renovated 1.832e+01 4.077e+00 4.493 7.06e-06 ***
## sqft_lot 1.200e-01 5.508e-02 2.178 0.0294 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 200100 on 17273 degrees of freedom
## Multiple R-squared: 0.7008, Adjusted R-squared: 0.7005
## F-statistic: 2528 on 16 and 17273 DF, p-value: < 2.2e-16
The summary of model shows a lot of information. But we just need to see Pr(>|t|). This column shows significance level of the variable towards model. It shows that all variables have significant effect toward the model. Also the model can explain 70.05% of the data.
The performance of our model can be calculated using MAPE and RMSE.
pred_data <- predict(model_step, data_test)
MAPE(pred_data, data_test$price)## [1] 0.2635855
RMSE(pred_data, data_test$price)## [1] 206240.4
Based on the result, MAPE shows that the model deviate around 26,35% from the actual data. Meanwhile RMSE shows that prediction deviate around 206240 points from the actual data. From this, we know that the model is not really good and improvement should be made.
Linear regression is a parametric model, meaning that in order to create a model equation, the model follows some classical assumption. Linear regression that doesn’t follow the assumption may be misleading, or just simply has biased estimator.
The first assumption in linear regression is that the residuals follow normal distribution. We can easily check this by using the Anderson-Darling normality test. This test is used when the data more than 5000.
ad.test(model_step$residuals)##
## Anderson-Darling normality test
##
## data: model_step$residuals
## A = 473.69, p-value < 2.2e-16
From the result, p-value is <0.05 so it means our residuals are not following the normal distribution.
Homocesdasticity shows that the residual or error is constant or does not form a certain pattern. If the error forms a certain pattern such as a linear or conical line, then we call it ‘Heterocesdasticity’ and it will affect the standard error value in the biased estimate / predictor coefficient (too narrow or too wide). For this analysis, I use Breusch-Pagan test.
bptest(model_step)##
## studentized Breusch-Pagan test
##
## data: model_step
## BP = 2174, df = 16, p-value < 2.2e-16
Based on the result, p-value is <0.05 so the residuals error are form certain pattern.
Multicollinearity mean that there is a correlation between the independent variables/predictors. To check the multicollinearity, we can measure the varianec inflation factor (VIF). As a rule of thumb, a VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity.
vif(model_step)## sqft_living lat view grade yr_built
## 7.948993 1.174260 1.439092 3.372771 2.315004
## waterfront bedrooms bathrooms zipcode long
## 1.210064 1.634173 3.128336 1.651562 1.811724
## condition sqft_above sqft_living15 sqft_lot15 yr_renovated
## 1.251153 5.606754 2.973275 2.115085 1.147118
## sqft_lot
## 2.097840
Based on the result, there are no correlation between variables.
We have already seen that our model doesn’t satisfy normality, and heterocesdasticity test. Now we will try to fix them. To made the model more linear and simple, we can transform some of the variables, both the target and the predictors. First, we can shun off the variables that have correlation coefficient above 0.6. Then we can change the target with log method.
ggcorr(house, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)From the result, we know that some predictors like sqft_living and sqft_above, bathrooms and sqft_living, sqft_living and grade, sqft_above and grade have high correlation. We can deselect one of them. But to do that, we have to know what predictor have higher effect. So, we have to scale it.
scale_house <- house %>%
mutate_if(is.numeric, scale)
model_scale <- lm(price ~ ., scale_house, trace = 0)
summary(model_scale)##
## Call:
## lm(formula = price ~ ., data = scale_house, trace = 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5185 -0.2703 -0.0265 0.2113 11.8031
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.703e-14 3.729e-03 0.000 1.00000
## bedrooms -9.061e-02 4.793e-03 -18.906 < 2e-16 ***
## bathrooms 8.631e-02 6.826e-03 12.645 < 2e-16 ***
## sqft_living 3.755e-01 1.097e-02 34.227 < 2e-16 ***
## sqft_lot 1.451e-02 5.407e-03 2.683 0.00729 **
## floors 9.839e-03 5.289e-03 1.860 0.06285 .
## waterfront 1.374e-01 4.091e-03 33.580 < 2e-16 ***
## view 1.104e-01 4.467e-03 24.705 < 2e-16 ***
## condition 4.677e-02 4.168e-03 11.221 < 2e-16 ***
## grade 3.070e-01 6.893e-03 44.542 < 2e-16 ***
## sqft_above 7.021e-02 9.835e-03 7.139 9.71e-13 ***
## sqft_basement NA NA NA NA
## yr_built -2.096e-01 5.813e-03 -36.062 < 2e-16 ***
## yr_renovated 2.168e-02 4.000e-03 5.420 6.03e-08 ***
## zipcode -8.488e-02 4.807e-03 -17.657 < 2e-16 ***
## lat 2.275e-01 4.052e-03 56.149 < 2e-16 ***
## long -8.237e-02 5.038e-03 -16.349 < 2e-16 ***
## sqft_living15 4.048e-02 6.437e-03 6.289 3.26e-10 ***
## sqft_lot15 -2.846e-02 5.449e-03 -5.222 1.78e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5482 on 21595 degrees of freedom
## Multiple R-squared: 0.6997, Adjusted R-squared: 0.6995
## F-statistic: 2960 on 17 and 21595 DF, p-value: < 2.2e-16
Based on the summary results, we can deselect sqft_above, grade, sqft_living15, sqft_lot15, and bathrooms. After that, we can use log method to the target.
house2 <- house %>%
select(-c(sqft_above, grade, sqft_living15, sqft_lot15, bathrooms)) %>%
mutate(price = log(price))set.seed(123)
data_train2 <- house2[index, ]
data_test2 <- house2[-index, ]set.seed(123)
model_full2 <- lm(price ~ ., data_train2)
model_none2 <- lm(price ~ 1, data_train2)
model_step2 <- step(model_none2,
scope = list(lower = model_none2, upper = model_full2),
direction = "both",
trace = 0
)
summary(model_step2)##
## Call:
## lm(formula = price ~ sqft_living + lat + view + condition + floors +
## yr_built + zipcode + waterfront + bedrooms + long + yr_renovated +
## sqft_basement + sqft_lot, data = data_train2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0534 -0.1762 0.0041 0.1781 1.2369
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.977e+00 4.543e+00 0.875 0.381
## sqft_living 3.727e-04 3.827e-06 97.385 < 2e-16 ***
## lat 1.599e+00 1.658e-02 96.407 < 2e-16 ***
## view 9.336e-02 3.277e-03 28.493 < 2e-16 ***
## condition 7.065e-02 3.699e-03 19.096 < 2e-16 ***
## floors 1.116e-01 5.417e-03 20.612 < 2e-16 ***
## yr_built -1.475e-03 1.042e-04 -14.160 < 2e-16 ***
## zipcode -9.550e-04 5.163e-05 -18.497 < 2e-16 ***
## waterfront 3.071e-01 2.691e-02 11.414 < 2e-16 ***
## bedrooms -2.346e-02 2.913e-03 -8.054 8.54e-16 ***
## long -2.332e-01 2.023e-02 -11.528 < 2e-16 ***
## yr_renovated 5.080e-05 5.723e-06 8.877 < 2e-16 ***
## sqft_basement -5.295e-05 6.617e-06 -8.002 1.31e-15 ***
## sqft_lot 2.641e-07 5.675e-08 4.654 3.28e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2834 on 17276 degrees of freedom
## Multiple R-squared: 0.7106, Adjusted R-squared: 0.7104
## F-statistic: 3263 on 13 and 17276 DF, p-value: < 2.2e-16
Based on the result, we can see that this model can explain 71.04% of the data. It is higher 1% than the previous model.
pred_data2 <- predict(model_step2, data_test2)
MAPE(pred_data2, data_test2$price)## [1] 0.0168914
RMSE(pred_data2, data_test2$price)## [1] 0.2840877
Based on the result, MAPE shows that the model deviate around 1.68% from the actual data. Meanwhile RMSE shows that prediction deviate around 0.284 points from the actual data. From this, we know that the model is better than the last model.
Lets test normality and Homocesdasticity again.
ad.test(model_step2$residuals)##
## Anderson-Darling normality test
##
## data: model_step2$residuals
## A = 14.192, p-value < 2.2e-16
bptest(model_step2)##
## studentized Breusch-Pagan test
##
## data: model_step2
## BP = 994.37, df = 13, p-value < 2.2e-16
Based on the result, the data still don’t pass normality test and Homocesdasticity test. It mean the predictor might be still biased. So lets check the residual plot.
qqPlot(model_step2$residuals)## 12778 21051
## 10082 14846
Based on the result, some of the residuals still outside the line.
Even after we tweak the model, it still have not pass the normality test and Homocesdasticity test for Assumption. It means the predictor might be biased. But, overall the model is good, the model can explain 71.04% of the data with MAPE 1.68% and RMSE 0.2840877 points.