Predicting Hosue Price in King County, USA

Before Starting

 We all know all houses in the whole of state and city have many variants Starting from the number of rooms, Square foot (sqft) House, living area, and many factors which affect house prices. We'll start from the house prices that are located in King County

What Is King County

King County is located in the U.S. state of Washington. The population was 2,252,782 in the 2019 census estimate, making it the most populous county in Washington, and the 12th-most populous in the United States. The county seat is Seattle,[1] also the state’s most populous city.

King County is one of three Washington counties that are included in the Seattle–Tacoma–Bellevue metropolitan statistical area. (The others are Snohomish County to the north and Pierce County to the south.) About two-thirds of King County’s population lives in Seattle’s suburbs.

Source : https://en.wikipedia.org/wiki/King_County,_Washington

Lets Start

Import the dataset

# Import Dataset
dataset = read.csv("kc_house_data.csv")

Explanatory Data Analysis

We need Check Dataset to properly to know tipe of all column

str(dataset)

## 'data.frame':    21613 obs. of  21 variables:
##  $ id           : num  7.13e+09 6.41e+09 5.63e+09 2.49e+09 1.95e+09 ...
##  $ date         : chr  "20141013T000000" "20141209T000000" "20150225T000000" "20141209T000000" ...
##  $ price        : num  221900 538000 180000 604000 510000 ...
##  $ bedrooms     : int  3 3 2 4 3 4 3 3 3 3 ...
##  $ bathrooms    : num  1 2.25 1 3 2 4.5 2.25 1.5 1 2.5 ...
##  $ sqft_living  : int  1180 2570 770 1960 1680 5420 1715 1060 1780 1890 ...
##  $ sqft_lot     : int  5650 7242 10000 5000 8080 101930 6819 9711 7470 6560 ...
##  $ floors       : num  1 2 1 1 1 1 2 1 1 2 ...
##  $ waterfront   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ view         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ condition    : int  3 3 3 5 3 3 3 3 3 3 ...
##  $ grade        : int  7 7 6 7 8 11 7 7 7 7 ...
##  $ sqft_above   : int  1180 2170 770 1050 1680 3890 1715 1060 1050 1890 ...
##  $ sqft_basement: int  0 400 0 910 0 1530 0 0 730 0 ...
##  $ yr_built     : int  1955 1951 1933 1965 1987 2001 1995 1963 1960 2003 ...
##  $ yr_renovated : int  0 1991 0 0 0 0 0 0 0 0 ...
##  $ zipcode      : int  98178 98125 98028 98136 98074 98053 98003 98198 98146 98038 ...
##  $ lat          : num  47.5 47.7 47.7 47.5 47.6 ...
##  $ long         : num  -122 -122 -122 -122 -122 ...
##  $ sqft_living15: int  1340 1690 2720 1360 1800 4760 2238 1650 1780 2390 ...
##  $ sqft_lot15   : int  5650 7639 8062 5000 7503 101930 6819 9711 8113 7570 ...

colSums(is.na(dataset))

##            id          date         price      bedrooms     bathrooms 
##             0             0             0             0             0 
##   sqft_living      sqft_lot        floors    waterfront          view 
##             0             0             0             0             0 
##     condition         grade    sqft_above sqft_basement      yr_built 
##             0             0             0             0             0 
##  yr_renovated       zipcode           lat          long sqft_living15 
##             0             0             0             0             0 
##    sqft_lot15 
##             0

we can see the data don’t have one bit of a missing data

Explanation of our Goals

With this dataset, we want to predict a price in a dataset. Maybe I should delete some columns in which we don’t need to predict a house price. Our target is the Price, before cleaning the data, let’s see the correlations of all of the columns in our dataset.

ggcorr(dataset, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)

## Warning in ggcorr(dataset, label = TRUE, label_size = 2.9, hjust = 1, layout.exp
## = 2): data in column(s) 'date' are not numeric and were ignored

By doing an observation, we can identify the data that has a low correlation in which it has a score that’s approaching zero, and that’s a bad correlation. So, now we know what we should clean already.

Cleaning Data

dataset = dataset %>% 
  select(-c(zipcode, lat, long,  id, date, yr_renovated, yr_built))

Spliting Dataset to Traning_set and Test_set

set.seed(123)
split = sample.split(dataset, SplitRatio = 0.8)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

why am I splitting a dataset? because I want to test my model in new data set where a model doesn’t know new data before. and we must know why training_set more than test_set, because we want a model to learn much data of the test data. and division of training_set and test_set is 80 : 20

Make a Model

let’s make a model off Regression, We will make a multiple Linear Regression. why multiple Linear Regression?? that’s easy because we have a many predictors, and conditions for use Multiple Linear Regression is have a predictor of more than one.

regression = lm(price ~ . , training_set)
summary(regression)

## 
## Call:
## lm(formula = price ~ ., data = training_set)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1185261  -123912   -16199    94688  4586625 
## 
## Coefficients: (1 not defined because of singularities)
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -7.064e+05  1.993e+04 -35.452  < 2e-16 ***
## bedrooms      -3.533e+04  2.460e+03 -14.364  < 2e-16 ***
## bathrooms     -1.610e+04  3.966e+03  -4.060 4.93e-05 ***
## sqft_living    2.208e+02  5.562e+00  39.701  < 2e-16 ***
## sqft_lot       3.577e-02  6.054e-02   0.591    0.555    
## floors        -2.294e+03  4.553e+03  -0.504    0.614    
## waterfront     6.132e+05  2.201e+04  27.859  < 2e-16 ***
## view           5.818e+04  2.728e+03  21.329  < 2e-16 ***
## condition      5.526e+04  2.905e+03  19.020  < 2e-16 ***
## grade          1.028e+05  2.706e+03  37.972  < 2e-16 ***
## sqft_above    -2.846e+01  5.503e+00  -5.172 2.34e-07 ***
## sqft_basement         NA         NA      NA       NA    
## sqft_living15  6.036e+00  4.364e+00   1.383    0.167    
## sqft_lot15    -7.821e-01  9.321e-02  -8.391  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 233700 on 16968 degrees of freedom
## Multiple R-squared:   0.61,  Adjusted R-squared:  0.6097 
## F-statistic:  2212 on 12 and 16968 DF,  p-value: < 2.2e-16

so many predictors are significant to the target variable. but the model regression still have insignificant variable even the worst NA. if a case like this maybe we need a Step Wise to make a model of Regression Better

Step-Wise (backward elemination)

backward <- step(object = regression, direction = "backward", trace = F)
summary(backward)

## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + waterfront + 
##     view + condition + grade + sqft_above + sqft_living15 + sqft_lot15, 
##     data = training_set)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1182402  -123879   -16273    94593  4587266 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -7.079e+05  1.967e+04 -35.980  < 2e-16 ***
## bedrooms      -3.530e+04  2.457e+03 -14.370  < 2e-16 ***
## bathrooms     -1.681e+04  3.724e+03  -4.515 6.37e-06 ***
## sqft_living    2.218e+02  5.256e+00  42.205  < 2e-16 ***
## waterfront     6.127e+05  2.200e+04  27.852  < 2e-16 ***
## view           5.822e+04  2.726e+03  21.357  < 2e-16 ***
## condition      5.545e+04  2.877e+03  19.274  < 2e-16 ***
## grade          1.025e+05  2.670e+03  38.393  < 2e-16 ***
## sqft_above    -2.957e+01  4.940e+00  -5.987 2.18e-09 ***
## sqft_living15  6.239e+00  4.311e+00   1.447    0.148    
## sqft_lot15    -7.400e-01  6.625e-02 -11.171  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 233700 on 16970 degrees of freedom
## Multiple R-squared:   0.61,  Adjusted R-squared:  0.6098 
## F-statistic:  2654 on 10 and 16970 DF,  p-value: < 2.2e-16

and we have a good model. this is better than before but we see in yr_renovated have a high p-value, what should we do with that column? the answer is nothing. why?? Because it makes sense in the analysion quest. sometimes a house needs a renovation before they sell the house to look nicer and attract the attention of a buyer and of course the price went up.

Predict Model And Error

After we make a model more than 1, we need to compare who the best model to use. we’re going to do a few things to find the best model then let’s go.

test_set$pred_regression = predict(object = regression, newdata = test_set)

## Warning in predict.lm(object = regression, newdata = test_set): prediction from
## a rank-deficient fit may be misleading

test_set$pred_backward = predict(object = backward, newdata = test_set)
test_set = test_set %>% 
  mutate(error_regression = round(price - pred_regression, 0),
         error_backward = round(price - pred_backward, 0)) 

test_set[1:10,] %>% 
  select(price, pred_regression, error_regression, pred_backward, error_backward)

##      price pred_regression error_regression pred_backward error_backward
## 1   510000        469330.1            40670      467824.3          42176
## 2   291850        252785.8            39064      251939.1          39911
## 3   662500        872745.2          -210245      872633.3        -210133
## 4   189000        378415.1          -189415      377750.4        -188750
## 5  2000000       1081997.1           918003     1080195.4         919805
## 6   329000        669252.5          -340252      670075.4        -341075
## 7   687500        537559.8           149940      538597.5         148903
## 8   696000        603755.9            92244      603803.8          92196
## 9   240000        209678.8            30321      209280.3          30720
## 10  210490        142710.3            67780      142375.5          68115

sqrt(mean(test_set$error_regression^2)) # RSE of Regression

## [1] 216972.5

MAE(y_pred = test_set$pred_regression, y_true = test_set$price)

## [1] 150982.1

sqrt(mean(test_set$error_backward^2)) # RSE of Backward

## [1] 216975.4

MAE(y_pred = test_set$pred_backward, y_true = test_set$price)

## [1] 150946.3

We can see RSE of them, then is very very little difference among Regression and Backward. but backward RSE better than Regression RSE and then we will use regression with `backward elimination

Evaluation Model

and we’re on the final step is evaluation model. this section has a lot of things to do, but don’t worry we will do all steps to the evaluation model.

Histogram

hist(backward$residuals, breaks = 5)

shapiro.test(backward$residuals[0:5000])

## 
##  Shapiro-Wilk normality test
## 
## data:  backward$residuals[0:5000]
## W = 0.81016, p-value < 2.2e-16

For the backward model, the P-value <0.05 so that it rejects H0. It also means that the residuals don’t spread normally and that our model has errors far around the mean.

Heteroscedasticity

bptest(backward)

## 
##  studentized Breusch-Pagan test
## 
## data:  backward
## BP = 2123.2, df = 10, p-value < 2.2e-16

For backward models, the P-value <0.05 thus rejects H0. This also means that the residuals have a pattern (heteroscedasticity) where not all of the existing patterns are captured by the model created.

Multicollinearity

vif(backward)

##      bedrooms     bathrooms   sqft_living    waterfront          view 
##      1.629428      2.573973      7.342700      1.212574      1.390872 
##     condition         grade    sqft_above sqft_living15    sqft_lot15 
##      1.081988      3.072872      5.251822      2.705255      1.065496

There is no value equal to or more than 10 so that there is no multicollinearity between variables (between independent predictor variables).

Based on the analysis results, the backward model has poor criteria as a linear regression model. Then, when comparing the RSE between the two models, the backward model gives a lower RSE value. Therefore, the backward model was chosen as a better model.

Conclusions and Suggestions

After we completely create a model until evaluation we know Model backward have R-Square 0.6543 and RSE 222522. Besides, after the analysis test was carried out, the model had unfavorable criteria.

but I know where I did wrong, the wrong is the data. why I can say that?? because the correlation target and predictor have a low score if you look up back, and you remember when we create a model name regression, that have a NA predictor. which means is too much predictor no have correlation one and another.

my suggestion is go find good data to make a regression model, beside that regression has many variations like Polynomial regression, Support Vector Regression (SVR), Decision Tree Regression, and Random Forest Regression. maybe next time I will compare all regression in one data and compare all models.

I hope I didn’t disappoint you and you enjoy to see my Articel for this and see you next time