1 Overview

In general, when people want to buy a house, they look for a house that is affordable and has all the desired features. Home price predictions will help them to decide whether the house they want to buy is worth that price or not.

It’s the same with people who want to sell a house. By utilizing a house price prediction system, the seller will be able to decide whether all the features inherent in the property and the surrounding environment can add value to the sale so that the house can be sold at the best price.

1.1 Objective

In this assignment on House Price Prediction using machine learning, we try to predict the average house price in any neighborhood, with all other metrics.

1.2 Prepare Data

1.2.1 Load Libary

#load library
library(tidyverse)

library(GGally) ##plot correlation

1.2.2 Load Dataset

We use data from the California census to create a machine learning model to predict house prices in the State.

The data includes features such as population, median income, and median house prices for each block/District group in California.

housing <- read.csv("data_input/housing.csv", stringsAsFactors = F)

Let’s check data type

glimpse(housing)
## Rows: 20,640
## Columns: 10
## $ longitude          <dbl> -122.23, -122.22, -122.24, -122.25, -122.25, -122.2~
## $ latitude           <dbl> 37.88, 37.86, 37.85, 37.85, 37.85, 37.85, 37.84, 37~
## $ housing_median_age <dbl> 41, 21, 52, 52, 52, 52, 52, 52, 42, 52, 52, 52, 52,~
## $ total_rooms        <dbl> 880, 7099, 1467, 1274, 1627, 919, 2535, 3104, 2555,~
## $ total_bedrooms     <dbl> 129, 1106, 190, 235, 280, 213, 489, 687, 665, 707, ~
## $ population         <dbl> 322, 2401, 496, 558, 565, 413, 1094, 1157, 1206, 15~
## $ households         <dbl> 126, 1138, 177, 219, 259, 193, 514, 647, 595, 714, ~
## $ median_income      <dbl> 8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6~
## $ median_house_value <dbl> 452600, 358500, 352100, 341300, 342200, 269700, 299~
## $ ocean_proximity    <chr> "NEAR BAY", "NEAR BAY", "NEAR BAY", "NEAR BAY", "NE~

There are any 20.640 rows on 10 columns :

longitude, latitude, housing_median_age, total_rooms, total_bedrooms, population, households, median_income, median_house_value, ocean_proximity

1.3 Data Wrangling

Summary of the data

summary(housing)
##    longitude         latitude     housing_median_age  total_rooms   
##  Min.   :-124.3   Min.   :32.54   Min.   : 1.00      Min.   :    2  
##  1st Qu.:-121.8   1st Qu.:33.93   1st Qu.:18.00      1st Qu.: 1448  
##  Median :-118.5   Median :34.26   Median :29.00      Median : 2127  
##  Mean   :-119.6   Mean   :35.63   Mean   :28.64      Mean   : 2636  
##  3rd Qu.:-118.0   3rd Qu.:37.71   3rd Qu.:37.00      3rd Qu.: 3148  
##  Max.   :-114.3   Max.   :41.95   Max.   :52.00      Max.   :39320  
##                                                                     
##  total_bedrooms     population      households     median_income    
##  Min.   :   1.0   Min.   :    3   Min.   :   1.0   Min.   : 0.4999  
##  1st Qu.: 296.0   1st Qu.:  787   1st Qu.: 280.0   1st Qu.: 2.5634  
##  Median : 435.0   Median : 1166   Median : 409.0   Median : 3.5348  
##  Mean   : 537.9   Mean   : 1425   Mean   : 499.5   Mean   : 3.8707  
##  3rd Qu.: 647.0   3rd Qu.: 1725   3rd Qu.: 605.0   3rd Qu.: 4.7432  
##  Max.   :6445.0   Max.   :35682   Max.   :6082.0   Max.   :15.0001  
##  NA's   :207                                                        
##  median_house_value ocean_proximity   
##  Min.   : 14999     Length:20640      
##  1st Qu.:119600     Class :character  
##  Median :179700     Mode  :character  
##  Mean   :206856                       
##  3rd Qu.:264725                       
##  Max.   :500001                       
## 

Now let’s check missing value is any

anyNA(housing)
## [1] TRUE
colSums(is.na(housing))
##          longitude           latitude housing_median_age        total_rooms 
##                  0                  0                  0                  0 
##     total_bedrooms         population         households      median_income 
##                207                  0                  0                  0 
## median_house_value    ocean_proximity 
##                  0                  0

There is any missing value 207 rows in total_bedrooms column.

Fill the missing value, and change data type to factor for ocean_proximity column.

housing <- housing %>% 
  fill(total_bedrooms,.direction = "updown") %>% 
  mutate(ocean_proximity = as.factor(ocean_proximity)) 

#re-check any missing value
anyNA(housing)
## [1] FALSE

The result is no missing value, data is ready to next process.

1.4 Data Spliting

We are split the dataset with 80% as data train and 20% as data test.

#split data 
RNGkind(sample.kind = "Rounding")
set.seed(417)

idx <- sample(nrow(housing), nrow(housing)* 0.8)

housing_train <- housing[idx,]

housing_test <- housing[ -idx,]

Check high correlation to data target median_housing_value

  ggcorr(housing_train, label = T, hjust = 1, layout.exp = 3)

boxplot(housing_train$median_house_value)

No significant outlier data

library(maps)
ca_df <- map_data("state") %>% filter(region =="california")


counties <- map_data("county")
ca_county <- subset(counties, region == "california")


ca_base <- ggplot(data = ca_df, mapping = aes(x = long, y = lat, group = group)) + 
  coord_fixed(1.3) + 
  geom_polygon(color = "black", fill = "gray")

ca_map <- ca_base +
  geom_polygon(data = ca_county, fill = NA, color = "white") +
  geom_polygon(color = "black", fill = NA)  # get the state border back on top


ca_map +
  #ggplot (housing, mapping = aes(y = latitude, x = longitude)) +
  geom_jitter(data = housing,aes(x = longitude, 
                                 y = latitude,size = population, 
                                 col = median_house_value,
                                 group = population))+
  labs(title = "Population Distribution", 
       subtitle ="Median Housing Price",
       col = "Median Price",
       size = "Population",
       x = "Longitude",
       y = "Latitude"
  )

Based on the average distribution of plots above, most houses with high average prices are concentrated in areas close to the coast. However, this time, we do not only use one predictor but also use multiple predictors to find out affected to the average value of the housing price.

1.5 Model Fitting

Now it’s time to build a Linear Regression model using all the predictors available from the train data, except latitude, longitude and population. In my opinion it’s not necessary

model_all <- lm(median_house_value ~ housing_median_age + total_rooms+total_bedrooms +
                  households + median_income +ocean_proximity, data = housing_train)
summary(model_all)
## 
## Call:
## lm(formula = median_house_value ~ housing_median_age + total_rooms + 
##     total_bedrooms + households + median_income + ocean_proximity, 
##     data = housing_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -571031  -44688  -12167   29586  493467 
## 
## Coefficients:
##                              Estimate  Std. Error t value             Pr(>|t|)
## (Intercept)                15414.8653   2791.7946   5.521    0.000000034120882
## housing_median_age          1162.0133     50.9422  22.810 < 0.0000000000000002
## total_rooms                  -12.6299      0.8747 -14.438 < 0.0000000000000002
## total_bedrooms                94.3436      6.3876  14.770 < 0.0000000000000002
## households                    -8.5460      6.5426  -1.306                0.192
## median_income              41764.6039    380.2928 109.822 < 0.0000000000000002
## ocean_proximityINLAND     -64496.9541   1440.8497 -44.763 < 0.0000000000000002
## ocean_proximityISLAND     183724.8996  32204.3718   5.705    0.000000011836028
## ocean_proximityNEAR BAY    13769.7548   1925.9292   7.150    0.000000000000906
## ocean_proximityNEAR OCEAN  18733.1146   1774.2163  10.559 < 0.0000000000000002
##                              
## (Intercept)               ***
## housing_median_age        ***
## total_rooms               ***
## total_bedrooms            ***
## households                   
## median_income             ***
## ocean_proximityINLAND     ***
## ocean_proximityISLAND     ***
## ocean_proximityNEAR BAY   ***
## ocean_proximityNEAR OCEAN ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 71950 on 16502 degrees of freedom
## Multiple R-squared:  0.6107, Adjusted R-squared:  0.6105 
## F-statistic:  2876 on 9 and 16502 DF,  p-value: < 0.00000000000000022

The results of the model show that all predictors are significant and the results are adj. R-squared is quite good at 0.61.

There is one predictor that does not show a significant predictor, the household column and we will remove it from the model, can it increase the Adjusted R-squared value?

Let’s see

model_new <- lm(median_house_value ~ housing_median_age + total_rooms+total_bedrooms  + median_income +ocean_proximity, data = housing_train)
summary(model_new)
## 
## Call:
## lm(formula = median_house_value ~ housing_median_age + total_rooms + 
##     total_bedrooms + median_income + ocean_proximity, data = housing_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -571236  -44698  -12164   29533  493776 
## 
## Coefficients:
##                              Estimate  Std. Error t value             Pr(>|t|)
## (Intercept)                14955.8587   2769.6497   5.400    0.000000067596584
## housing_median_age          1160.3372     50.9271  22.784 < 0.0000000000000002
## total_rooms                  -12.9180      0.8465 -15.261 < 0.0000000000000002
## total_bedrooms                88.1564      4.2855  20.571 < 0.0000000000000002
## median_income              41817.3152    378.1537 110.583 < 0.0000000000000002
## ocean_proximityINLAND     -64163.0100   1418.0169 -45.248 < 0.0000000000000002
## ocean_proximityISLAND     184801.3734  32194.5140   5.740    0.000000009625338
## ocean_proximityNEAR BAY    13800.0109   1925.8311   7.166    0.000000000000806
## ocean_proximityNEAR OCEAN  18811.6128   1773.2362  10.609 < 0.0000000000000002
##                              
## (Intercept)               ***
## housing_median_age        ***
## total_rooms               ***
## total_bedrooms            ***
## median_income             ***
## ocean_proximityINLAND     ***
## ocean_proximityISLAND     ***
## ocean_proximityNEAR BAY   ***
## ocean_proximityNEAR OCEAN ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 71950 on 16503 degrees of freedom
## Multiple R-squared:  0.6107, Adjusted R-squared:  0.6105 
## F-statistic:  3236 on 8 and 16503 DF,  p-value: < 0.00000000000000022

After the new modeling is done, there is no significant change compared to the model in which the household column is entered. For that we ignore this new model

Next we will make predictions into the test data with the model we have created model_all.

housing_test$pred <- predict(model_all, housing_test)

hist(housing_test$pred)

From the histogram above, the average price prediction results show a value of 208,050.55, where the average price range is between 8,368.85 and 713,808.24

plot(model_all)

1.6 Model Performance Testing

1.6.1 RMSE Permormance

Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction errors). Residuals are a measure of how far from the regression line data points are; RMSE is a measure of how spread out these residuals are. In other words, it tells you how concentrated the data is around the line of best fit.

library(MLmetrics)

RMSE(y_pred = housing_test$pred, y_true = housing_test$median_house_value)
## [1] 72193.15

The RMSE result quit small, it’s mean the model has enough good to predict.

hist(model_all$residuals)

### Shapiro Wilk Performance

The Shapiro-Wilk Test is more appropriate for small sample sizes (< 50 samples), but can also handle sample sizes as large as 5000.

shapiro.test(x = model_all$residuals[3:5000])
## 
##  Shapiro-Wilk normality test
## 
## data:  model_all$residuals[3:5000]
## W = 0.92709, p-value < 0.00000000000000022

Note : Model has normal distribution and p-value < 0.05

1.6.2 Breusch-Pagan test

One of the key assumptions of linear regression is that the residuals are distributed with equal variance at each level of the predictor variable. This assumption is known as homoscedasticity

One way to visually detect whether heteroscedasticity is present is to create a plot of the residuals against the fitted values of the regression model.

plot(model_all$fitted.values, model_all$residuals)
abline(h = 0, col = "red")

A formal statistical test we can use to determine if heteroscedasticity is present is the Breusch-Pagan test. The Breusch-Pagan test is used to determine whether or not heteroscedasticity is present in a regression model.

library(lmtest)

bptest(model_all)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_all
## BP = 753.84, df = 9, p-value < 0.00000000000000022

p-value less then 0.05 conclude that heteroscedasticity is present in the regression model

1.6.3 Multicollinearity

VIF measures how much the variance of an estimated regression coefficient increases if your predictors are correlated. Because we want our prediction to be precise, low variance is the ideal.

library(car)

vif(model_all)
##                         GVIF Df GVIF^(1/(2*Df))
## housing_median_age  1.307510  1        1.143464
## total_rooms        11.347351  1        3.368583
## total_bedrooms     22.500761  1        4.743497
## households         19.567252  1        4.423489
## median_income       1.666005  1        1.290738
## ocean_proximity     1.339643  4        1.037227

There is no multicultural because there is no VIF value more than 10

2 Conclusion

From the results of the tests that have been carried out, the model is declared to have passed the test.

However, to get the maximum model, process improvements need to be made again, one of which is checking the model for the trend of overfitting / underfitting models and making model diagnostics an important part of your machine learning tool.