Predicting Hosue Price in King County, USA
Before Starting
We all know all houses in the whole of state and city have many variants Starting from the number of rooms, Square foot (sqft) House, living area, and many factors which affect house prices. We'll start from the house prices that are located in King County
What Is King County
King County is located in the U.S. state of Washington. The population was 2,252,782 in the 2019 census estimate, making it the most populous county in Washington, and the 12th-most populous in the United States. The county seat is Seattle,[1] also the state’s most populous city.
King County is one of three Washington counties that are included in the Seattle–Tacoma–Bellevue metropolitan statistical area. (The others are Snohomish County to the north and Pierce County to the south.) About two-thirds of King County’s population lives in Seattle’s suburbs.
Source : https://en.wikipedia.org/wiki/King_County,_Washington
Lets Start
Import the dataset
# Import Dataset
dataset = read.csv("kc_house_data.csv")Explanatory Data Analysis
We need Check Dataset to properly to know tipe of all column
str(dataset)## 'data.frame': 21613 obs. of 21 variables:
## $ id : num 7.13e+09 6.41e+09 5.63e+09 2.49e+09 1.95e+09 ...
## $ date : chr "20141013T000000" "20141209T000000" "20150225T000000" "20141209T000000" ...
## $ price : num 221900 538000 180000 604000 510000 ...
## $ bedrooms : int 3 3 2 4 3 4 3 3 3 3 ...
## $ bathrooms : num 1 2.25 1 3 2 4.5 2.25 1.5 1 2.5 ...
## $ sqft_living : int 1180 2570 770 1960 1680 5420 1715 1060 1780 1890 ...
## $ sqft_lot : int 5650 7242 10000 5000 8080 101930 6819 9711 7470 6560 ...
## $ floors : num 1 2 1 1 1 1 2 1 1 2 ...
## $ waterfront : int 0 0 0 0 0 0 0 0 0 0 ...
## $ view : int 0 0 0 0 0 0 0 0 0 0 ...
## $ condition : int 3 3 3 5 3 3 3 3 3 3 ...
## $ grade : int 7 7 6 7 8 11 7 7 7 7 ...
## $ sqft_above : int 1180 2170 770 1050 1680 3890 1715 1060 1050 1890 ...
## $ sqft_basement: int 0 400 0 910 0 1530 0 0 730 0 ...
## $ yr_built : int 1955 1951 1933 1965 1987 2001 1995 1963 1960 2003 ...
## $ yr_renovated : int 0 1991 0 0 0 0 0 0 0 0 ...
## $ zipcode : int 98178 98125 98028 98136 98074 98053 98003 98198 98146 98038 ...
## $ lat : num 47.5 47.7 47.7 47.5 47.6 ...
## $ long : num -122 -122 -122 -122 -122 ...
## $ sqft_living15: int 1340 1690 2720 1360 1800 4760 2238 1650 1780 2390 ...
## $ sqft_lot15 : int 5650 7639 8062 5000 7503 101930 6819 9711 8113 7570 ...
colSums(is.na(dataset))## id date price bedrooms bathrooms
## 0 0 0 0 0
## sqft_living sqft_lot floors waterfront view
## 0 0 0 0 0
## condition grade sqft_above sqft_basement yr_built
## 0 0 0 0 0
## yr_renovated zipcode lat long sqft_living15
## 0 0 0 0 0
## sqft_lot15
## 0
we can see the data don’t have one bit of a missing data
Explanation of our Goals
With this dataset, we want to predict a price in a dataset. Maybe I should delete some columns in which we don’t need to predict a house price. Our target is the Price, before cleaning the data, let’s see the correlations of all of the columns in our dataset.
ggcorr(dataset, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)## Warning in ggcorr(dataset, label = TRUE, label_size = 2.9, hjust = 1, layout.exp
## = 2): data in column(s) 'date' are not numeric and were ignored
By doing an observation, we can identify the data that has a low correlation in which it has a score that’s approaching zero, and that’s a bad correlation. So, now we know what we should clean already.
Cleaning Data
dataset = dataset %>%
select(-c(zipcode, lat, long, id, date, yr_renovated, yr_built))Spliting Dataset to Traning_set and Test_set
set.seed(123)
split = sample.split(dataset, SplitRatio = 0.8)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)why am I splitting a dataset? because I want to test my model in new data set where a model doesn’t know new data before. and we must know why training_set more than test_set, because we want a model to learn much data of the test data. and division of training_set and test_set is 80 : 20
Make a Model
let’s make a model off Regression, We will make a multiple Linear Regression. why multiple Linear Regression?? that’s easy because we have a many predictors, and conditions for use Multiple Linear Regression is have a predictor of more than one.
regression = lm(price ~ . , training_set)
summary(regression)##
## Call:
## lm(formula = price ~ ., data = training_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1185261 -123912 -16199 94688 4586625
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.064e+05 1.993e+04 -35.452 < 2e-16 ***
## bedrooms -3.533e+04 2.460e+03 -14.364 < 2e-16 ***
## bathrooms -1.610e+04 3.966e+03 -4.060 4.93e-05 ***
## sqft_living 2.208e+02 5.562e+00 39.701 < 2e-16 ***
## sqft_lot 3.577e-02 6.054e-02 0.591 0.555
## floors -2.294e+03 4.553e+03 -0.504 0.614
## waterfront 6.132e+05 2.201e+04 27.859 < 2e-16 ***
## view 5.818e+04 2.728e+03 21.329 < 2e-16 ***
## condition 5.526e+04 2.905e+03 19.020 < 2e-16 ***
## grade 1.028e+05 2.706e+03 37.972 < 2e-16 ***
## sqft_above -2.846e+01 5.503e+00 -5.172 2.34e-07 ***
## sqft_basement NA NA NA NA
## sqft_living15 6.036e+00 4.364e+00 1.383 0.167
## sqft_lot15 -7.821e-01 9.321e-02 -8.391 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 233700 on 16968 degrees of freedom
## Multiple R-squared: 0.61, Adjusted R-squared: 0.6097
## F-statistic: 2212 on 12 and 16968 DF, p-value: < 2.2e-16
so many predictors are significant to the target variable. but the model regression still have insignificant variable even the worst NA. if a case like this maybe we need a Step Wise to make a model of Regression Better
Step-Wise (backward elemination)
backward <- step(object = regression, direction = "backward", trace = F)
summary(backward)##
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + waterfront +
## view + condition + grade + sqft_above + sqft_living15 + sqft_lot15,
## data = training_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1182402 -123879 -16273 94593 4587266
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.079e+05 1.967e+04 -35.980 < 2e-16 ***
## bedrooms -3.530e+04 2.457e+03 -14.370 < 2e-16 ***
## bathrooms -1.681e+04 3.724e+03 -4.515 6.37e-06 ***
## sqft_living 2.218e+02 5.256e+00 42.205 < 2e-16 ***
## waterfront 6.127e+05 2.200e+04 27.852 < 2e-16 ***
## view 5.822e+04 2.726e+03 21.357 < 2e-16 ***
## condition 5.545e+04 2.877e+03 19.274 < 2e-16 ***
## grade 1.025e+05 2.670e+03 38.393 < 2e-16 ***
## sqft_above -2.957e+01 4.940e+00 -5.987 2.18e-09 ***
## sqft_living15 6.239e+00 4.311e+00 1.447 0.148
## sqft_lot15 -7.400e-01 6.625e-02 -11.171 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 233700 on 16970 degrees of freedom
## Multiple R-squared: 0.61, Adjusted R-squared: 0.6098
## F-statistic: 2654 on 10 and 16970 DF, p-value: < 2.2e-16
and we have a good model. this is better than before but we see in yr_renovated have a high p-value, what should we do with that column? the answer is nothing. why?? Because it makes sense in the analysion quest. sometimes a house needs a renovation before they sell the house to look nicer and attract the attention of a buyer and of course the price went up.
Predict Model And Error
After we make a model more than 1, we need to compare who the best model to use. we’re going to do a few things to find the best model then let’s go.
test_set$pred_regression = predict(object = regression, newdata = test_set)## Warning in predict.lm(object = regression, newdata = test_set): prediction from
## a rank-deficient fit may be misleading
test_set$pred_backward = predict(object = backward, newdata = test_set)
test_set = test_set %>%
mutate(error_regression = round(price - pred_regression, 0),
error_backward = round(price - pred_backward, 0))
test_set[1:10,] %>%
select(price, pred_regression, error_regression, pred_backward, error_backward)## price pred_regression error_regression pred_backward error_backward
## 1 510000 469330.1 40670 467824.3 42176
## 2 291850 252785.8 39064 251939.1 39911
## 3 662500 872745.2 -210245 872633.3 -210133
## 4 189000 378415.1 -189415 377750.4 -188750
## 5 2000000 1081997.1 918003 1080195.4 919805
## 6 329000 669252.5 -340252 670075.4 -341075
## 7 687500 537559.8 149940 538597.5 148903
## 8 696000 603755.9 92244 603803.8 92196
## 9 240000 209678.8 30321 209280.3 30720
## 10 210490 142710.3 67780 142375.5 68115
sqrt(mean(test_set$error_regression^2)) # RSE of Regression## [1] 216972.5
MAE(y_pred = test_set$pred_regression, y_true = test_set$price)## [1] 150982.1
sqrt(mean(test_set$error_backward^2)) # RSE of Backward## [1] 216975.4
MAE(y_pred = test_set$pred_backward, y_true = test_set$price)## [1] 150946.3
We can see RSE of them, then is very very little difference among Regression and Backward. but backward RSE better than Regression RSE and then we will use regression with `backward elimination
Evaluation Model
and we’re on the final step is evaluation model. this section has a lot of things to do, but don’t worry we will do all steps to the evaluation model.
Histogram
hist(backward$residuals, breaks = 5)shapiro.test(backward$residuals[0:5000])##
## Shapiro-Wilk normality test
##
## data: backward$residuals[0:5000]
## W = 0.81016, p-value < 2.2e-16
For the backward model, the P-value <0.05 so that it rejects H0. It also means that the residuals don’t spread normally and that our model has errors far around the mean.
Heteroscedasticity
bptest(backward)##
## studentized Breusch-Pagan test
##
## data: backward
## BP = 2123.2, df = 10, p-value < 2.2e-16
For backward models, the P-value <0.05 thus rejects H0. This also means that the residuals have a pattern (heteroscedasticity) where not all of the existing patterns are captured by the model created.
Multicollinearity
vif(backward)## bedrooms bathrooms sqft_living waterfront view
## 1.629428 2.573973 7.342700 1.212574 1.390872
## condition grade sqft_above sqft_living15 sqft_lot15
## 1.081988 3.072872 5.251822 2.705255 1.065496
There is no value equal to or more than 10 so that there is no multicollinearity between variables (between independent predictor variables).
Based on the analysis results, the backward model has poor criteria as a linear regression model. Then, when comparing the RSE between the two models, the backward model gives a lower RSE value. Therefore, the backward model was chosen as a better model.
Conclusions and Suggestions
After we completely create a model until evaluation we know Model backward have R-Square 0.6543 and RSE 222522. Besides, after the analysis test was carried out, the model had unfavorable criteria.
but I know where I did wrong, the wrong is the data. why I can say that?? because the correlation target and predictor have a low score if you look up back, and you remember when we create a model name regression, that have a NA predictor. which means is too much predictor no have correlation one and another.
my suggestion is go find good data to make a regression model, beside that regression has many variations like Polynomial regression, Support Vector Regression (SVR), Decision Tree Regression, and Random Forest Regression. maybe next time I will compare all regression in one data and compare all models.
I hope I didn’t disappoint you and you enjoy to see my Articel for this and see you next time