In general, when people want to buy a house, they look for a house that is affordable and has all the desired features. Home price predictions will help them to decide whether the house they want to buy is worth that price or not.
It’s the same with people who want to sell a house. By utilizing a house price prediction system, the seller will be able to decide whether all the features inherent in the property and the surrounding environment can add value to the sale so that the house can be sold at the best price.
In this assignment on House Price Prediction using machine learning, we try to predict the average house price in any neighborhood, with all other metrics.
#load library
library(tidyverse)
library(GGally) ##plot correlationWe use data from the California census to create a machine learning model to predict house prices in the State.
The data includes features such as population, median income, and median house prices for each block/District group in California.
housing <- read.csv("data_input/housing.csv", stringsAsFactors = F)Let’s check data type
glimpse(housing)## Rows: 20,640
## Columns: 10
## $ longitude <dbl> -122.23, -122.22, -122.24, -122.25, -122.25, -122.2~
## $ latitude <dbl> 37.88, 37.86, 37.85, 37.85, 37.85, 37.85, 37.84, 37~
## $ housing_median_age <dbl> 41, 21, 52, 52, 52, 52, 52, 52, 42, 52, 52, 52, 52,~
## $ total_rooms <dbl> 880, 7099, 1467, 1274, 1627, 919, 2535, 3104, 2555,~
## $ total_bedrooms <dbl> 129, 1106, 190, 235, 280, 213, 489, 687, 665, 707, ~
## $ population <dbl> 322, 2401, 496, 558, 565, 413, 1094, 1157, 1206, 15~
## $ households <dbl> 126, 1138, 177, 219, 259, 193, 514, 647, 595, 714, ~
## $ median_income <dbl> 8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6~
## $ median_house_value <dbl> 452600, 358500, 352100, 341300, 342200, 269700, 299~
## $ ocean_proximity <chr> "NEAR BAY", "NEAR BAY", "NEAR BAY", "NEAR BAY", "NE~
There are any 20.640 rows on 10 columns :
longitude, latitude, housing_median_age, total_rooms, total_bedrooms, population, households, median_income, median_house_value, ocean_proximity
Summary of the data
summary(housing)## longitude latitude housing_median_age total_rooms
## Min. :-124.3 Min. :32.54 Min. : 1.00 Min. : 2
## 1st Qu.:-121.8 1st Qu.:33.93 1st Qu.:18.00 1st Qu.: 1448
## Median :-118.5 Median :34.26 Median :29.00 Median : 2127
## Mean :-119.6 Mean :35.63 Mean :28.64 Mean : 2636
## 3rd Qu.:-118.0 3rd Qu.:37.71 3rd Qu.:37.00 3rd Qu.: 3148
## Max. :-114.3 Max. :41.95 Max. :52.00 Max. :39320
##
## total_bedrooms population households median_income
## Min. : 1.0 Min. : 3 Min. : 1.0 Min. : 0.4999
## 1st Qu.: 296.0 1st Qu.: 787 1st Qu.: 280.0 1st Qu.: 2.5634
## Median : 435.0 Median : 1166 Median : 409.0 Median : 3.5348
## Mean : 537.9 Mean : 1425 Mean : 499.5 Mean : 3.8707
## 3rd Qu.: 647.0 3rd Qu.: 1725 3rd Qu.: 605.0 3rd Qu.: 4.7432
## Max. :6445.0 Max. :35682 Max. :6082.0 Max. :15.0001
## NA's :207
## median_house_value ocean_proximity
## Min. : 14999 Length:20640
## 1st Qu.:119600 Class :character
## Median :179700 Mode :character
## Mean :206856
## 3rd Qu.:264725
## Max. :500001
##
Now let’s check missing value is any
anyNA(housing)## [1] TRUE
colSums(is.na(housing))## longitude latitude housing_median_age total_rooms
## 0 0 0 0
## total_bedrooms population households median_income
## 207 0 0 0
## median_house_value ocean_proximity
## 0 0
There is any missing value 207 rows in total_bedrooms column.
Fill the missing value, and change data type to factor for ocean_proximity column.
housing <- housing %>%
fill(total_bedrooms,.direction = "updown") %>%
mutate(ocean_proximity = as.factor(ocean_proximity))
#re-check any missing value
anyNA(housing)## [1] FALSE
The result is no missing value, data is ready to next process.
We are split the dataset with 80% as data train and 20% as data test.
#split data
RNGkind(sample.kind = "Rounding")
set.seed(417)
idx <- sample(nrow(housing), nrow(housing)* 0.8)
housing_train <- housing[idx,]
housing_test <- housing[ -idx,]Check high correlation to data target median_housing_value
ggcorr(housing_train, label = T, hjust = 1, layout.exp = 3)boxplot(housing_train$median_house_value)No significant outlier data
library(maps)
ca_df <- map_data("state") %>% filter(region =="california")
counties <- map_data("county")
ca_county <- subset(counties, region == "california")
ca_base <- ggplot(data = ca_df, mapping = aes(x = long, y = lat, group = group)) +
coord_fixed(1.3) +
geom_polygon(color = "black", fill = "gray")
ca_map <- ca_base +
geom_polygon(data = ca_county, fill = NA, color = "white") +
geom_polygon(color = "black", fill = NA) # get the state border back on top
ca_map +
#ggplot (housing, mapping = aes(y = latitude, x = longitude)) +
geom_jitter(data = housing,aes(x = longitude,
y = latitude,size = population,
col = median_house_value,
group = population))+
labs(title = "Population Distribution",
subtitle ="Median Housing Price",
col = "Median Price",
size = "Population",
x = "Longitude",
y = "Latitude"
)Based on the average distribution of plots above, most houses with high average prices are concentrated in areas close to the coast. However, this time, we do not only use one predictor but also use multiple predictors to find out affected to the average value of the housing price.
Now it’s time to build a Linear Regression model using all the predictors available from the train data, except latitude, longitude and population. In my opinion it’s not necessary
model_all <- lm(median_house_value ~ housing_median_age + total_rooms+total_bedrooms +
households + median_income +ocean_proximity, data = housing_train)
summary(model_all)##
## Call:
## lm(formula = median_house_value ~ housing_median_age + total_rooms +
## total_bedrooms + households + median_income + ocean_proximity,
## data = housing_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -571031 -44688 -12167 29586 493467
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15414.8653 2791.7946 5.521 0.000000034120882
## housing_median_age 1162.0133 50.9422 22.810 < 0.0000000000000002
## total_rooms -12.6299 0.8747 -14.438 < 0.0000000000000002
## total_bedrooms 94.3436 6.3876 14.770 < 0.0000000000000002
## households -8.5460 6.5426 -1.306 0.192
## median_income 41764.6039 380.2928 109.822 < 0.0000000000000002
## ocean_proximityINLAND -64496.9541 1440.8497 -44.763 < 0.0000000000000002
## ocean_proximityISLAND 183724.8996 32204.3718 5.705 0.000000011836028
## ocean_proximityNEAR BAY 13769.7548 1925.9292 7.150 0.000000000000906
## ocean_proximityNEAR OCEAN 18733.1146 1774.2163 10.559 < 0.0000000000000002
##
## (Intercept) ***
## housing_median_age ***
## total_rooms ***
## total_bedrooms ***
## households
## median_income ***
## ocean_proximityINLAND ***
## ocean_proximityISLAND ***
## ocean_proximityNEAR BAY ***
## ocean_proximityNEAR OCEAN ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 71950 on 16502 degrees of freedom
## Multiple R-squared: 0.6107, Adjusted R-squared: 0.6105
## F-statistic: 2876 on 9 and 16502 DF, p-value: < 0.00000000000000022
The results of the model show that all predictors are significant and the results are adj. R-squared is quite good at 0.61.
There is one predictor that does not show a significant predictor, the household column and we will remove it from the model, can it increase the Adjusted R-squared value?
Let’s see
model_new <- lm(median_house_value ~ housing_median_age + total_rooms+total_bedrooms + median_income +ocean_proximity, data = housing_train)
summary(model_new)##
## Call:
## lm(formula = median_house_value ~ housing_median_age + total_rooms +
## total_bedrooms + median_income + ocean_proximity, data = housing_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -571236 -44698 -12164 29533 493776
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14955.8587 2769.6497 5.400 0.000000067596584
## housing_median_age 1160.3372 50.9271 22.784 < 0.0000000000000002
## total_rooms -12.9180 0.8465 -15.261 < 0.0000000000000002
## total_bedrooms 88.1564 4.2855 20.571 < 0.0000000000000002
## median_income 41817.3152 378.1537 110.583 < 0.0000000000000002
## ocean_proximityINLAND -64163.0100 1418.0169 -45.248 < 0.0000000000000002
## ocean_proximityISLAND 184801.3734 32194.5140 5.740 0.000000009625338
## ocean_proximityNEAR BAY 13800.0109 1925.8311 7.166 0.000000000000806
## ocean_proximityNEAR OCEAN 18811.6128 1773.2362 10.609 < 0.0000000000000002
##
## (Intercept) ***
## housing_median_age ***
## total_rooms ***
## total_bedrooms ***
## median_income ***
## ocean_proximityINLAND ***
## ocean_proximityISLAND ***
## ocean_proximityNEAR BAY ***
## ocean_proximityNEAR OCEAN ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 71950 on 16503 degrees of freedom
## Multiple R-squared: 0.6107, Adjusted R-squared: 0.6105
## F-statistic: 3236 on 8 and 16503 DF, p-value: < 0.00000000000000022
After the new modeling is done, there is no significant change compared to the model in which the household column is entered. For that we ignore this new model
Next we will make predictions into the test data with the model we have created model_all.
housing_test$pred <- predict(model_all, housing_test)
hist(housing_test$pred)From the histogram above, the average price prediction results show a value of 208,050.55, where the average price range is between 8,368.85 and 713,808.24
plot(model_all)Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction errors). Residuals are a measure of how far from the regression line data points are; RMSE is a measure of how spread out these residuals are. In other words, it tells you how concentrated the data is around the line of best fit.
library(MLmetrics)
RMSE(y_pred = housing_test$pred, y_true = housing_test$median_house_value)## [1] 72193.15
The RMSE result quit small, it’s mean the model has enough good to predict.
hist(model_all$residuals) ### Shapiro Wilk Performance
The Shapiro-Wilk Test is more appropriate for small sample sizes (< 50 samples), but can also handle sample sizes as large as 5000.
shapiro.test(x = model_all$residuals[3:5000])##
## Shapiro-Wilk normality test
##
## data: model_all$residuals[3:5000]
## W = 0.92709, p-value < 0.00000000000000022
Note : Model has normal distribution and p-value < 0.05
One of the key assumptions of linear regression is that the residuals are distributed with equal variance at each level of the predictor variable. This assumption is known as homoscedasticity
One way to visually detect whether heteroscedasticity is present is to create a plot of the residuals against the fitted values of the regression model.
plot(model_all$fitted.values, model_all$residuals)
abline(h = 0, col = "red")A formal statistical test we can use to determine if heteroscedasticity is present is the Breusch-Pagan test. The Breusch-Pagan test is used to determine whether or not heteroscedasticity is present in a regression model.
library(lmtest)
bptest(model_all)##
## studentized Breusch-Pagan test
##
## data: model_all
## BP = 753.84, df = 9, p-value < 0.00000000000000022
p-value less then 0.05 conclude that heteroscedasticity is present in the regression model
VIF measures how much the variance of an estimated regression coefficient increases if your predictors are correlated. Because we want our prediction to be precise, low variance is the ideal.
library(car)
vif(model_all)## GVIF Df GVIF^(1/(2*Df))
## housing_median_age 1.307510 1 1.143464
## total_rooms 11.347351 1 3.368583
## total_bedrooms 22.500761 1 4.743497
## households 19.567252 1 4.423489
## median_income 1.666005 1 1.290738
## ocean_proximity 1.339643 4 1.037227
There is no multicultural because there is no VIF value more than 10
From the results of the tests that have been carried out, the model is declared to have passed the test.
However, to get the maximum model, process improvements need to be made again, one of which is checking the model for the trend of overfitting / underfitting models and making model diagnostics an important part of your machine learning tool.