This is my first Markdown document.

Let’s load some data and library for this project.

library(readr)
library(ggplot2)
library(knitr)
library(tidyverse)
library(caret)
library(leaps)
library(car)
library(mice)
library(scales)
library(RColorBrewer)
library(plotly)
library(nortest)
library(lmtest)

Introduction

Predicting Housing Prices

The purpose of this project is to predict the price of houses in California in 1990 based on a number of possible location-based predictors, including latitude, longitude, and information about other houses within a particular block.

While this project focuses on prediction, we are fully aware that housing prices have increased dramatically since 1990, when the data was collected. This model should not be used to predict today’s housing prices in California. This is purely an academic endeavor to explore statistical prediction.

The goal of the project is to create the model that can best predict home prices in California given reasonable test/train splits in the data.

The test dataset is taken from https://www.kaggle.com/camnugent/california-housing-prices

California Housing Prices Dataset

We’re using the California Housing Prices dataset (housing.csv) from the following Kaggle. This data pertains to the houses found in a given California district and some summary stats about them based on the 1990 census data.

housing_data <- read.csv("housing.csv")
summary(housing_data)
##    longitude         latitude     housing_median_age  total_rooms   
##  Min.   :-124.3   Min.   :32.54   Min.   : 1.00      Min.   :    2  
##  1st Qu.:-121.8   1st Qu.:33.93   1st Qu.:18.00      1st Qu.: 1448  
##  Median :-118.5   Median :34.26   Median :29.00      Median : 2127  
##  Mean   :-119.6   Mean   :35.63   Mean   :28.64      Mean   : 2636  
##  3rd Qu.:-118.0   3rd Qu.:37.71   3rd Qu.:37.00      3rd Qu.: 3148  
##  Max.   :-114.3   Max.   :41.95   Max.   :52.00      Max.   :39320  
##                                                                     
##  total_bedrooms     population      households     median_income    
##  Min.   :   1.0   Min.   :    3   Min.   :   1.0   Min.   : 0.4999  
##  1st Qu.: 296.0   1st Qu.:  787   1st Qu.: 280.0   1st Qu.: 2.5634  
##  Median : 435.0   Median : 1166   Median : 409.0   Median : 3.5348  
##  Mean   : 537.9   Mean   : 1425   Mean   : 499.5   Mean   : 3.8707  
##  3rd Qu.: 647.0   3rd Qu.: 1725   3rd Qu.: 605.0   3rd Qu.: 4.7432  
##  Max.   :6445.0   Max.   :35682   Max.   :6082.0   Max.   :15.0001  
##  NA's   :207                                                        
##  median_house_value ocean_proximity   
##  Min.   : 14999     Length:20640      
##  1st Qu.:119600     Class :character  
##  Median :179700     Mode  :character  
##  Mean   :206856                       
##  3rd Qu.:264725                       
##  Max.   :500001                       
## 

The dataset contains 20640 observations and 10 attributes (9 predictors and 1 response). Below is a list of the variables with descriptions taken from the original Kaggle site given above.

By looking at summary above we can see that the total_bedrooms variable has 207 NA values. We will address this issue in the Data Cleaning section in Methods.

Below is a visual representation of all data points in the dataset with longitude on the x-axis, latitude on the y-axis, and median_house_value shown in a color codes.

plot_map = ggplot(housing_data, 
                  aes(x = longitude, y = latitude, color = median_house_value, 
                      hma = housing_median_age, tr = total_rooms, tb = total_bedrooms,
                      hh = households, mi = median_income)) +
              geom_point(aes(size = population), alpha = 0.4) +
              xlab("Longitude") +
              ylab("Latitude") +
              ggtitle("Data Map - Longtitude vs Latitude and Associated Variables") +
              theme(plot.title = element_text(hjust = 0.5)) +
              scale_color_distiller(palette = "Paired", labels = comma) +
              labs(color = "Median House Value (in $USD)", size = "Population")
plot_map

We see that on average the houses nearest to the ocean tend to have higher median house values, whereas those inland have the lower median values. This difference is quite substantial and tells us that the variable ocean_proximity will likely play a large role in predicting median house value.

Data Cleaning

Categorical Variables

Initial exploration of the data showed us that there were a few steps we needed to take to make the data more useable. Firstly, we changed the categorical variable ocean_proximity from text-based to a factor variable.

housing_data$ocean_proximity = as.factor(housing_data$ocean_proximity)
levels(housing_data$ocean_proximity)
## [1] "<1H OCEAN"  "INLAND"     "ISLAND"     "NEAR BAY"   "NEAR OCEAN"

Taking a deeper dive into ocean_proximity,we see that the level ISLAND has a very low count compared to the other levels.

ggplot(housing_data, aes(x = factor(ocean_proximity))) + geom_bar(stat = "count", color = "black", fill = "Red")

We looked at the total count of each level to get a better idea of the skewness in ocean_proximity.

summary(housing_data$ocean_proximity)
##  <1H OCEAN     INLAND     ISLAND   NEAR BAY NEAR OCEAN 
##       9136       6551          5       2290       2658

In fact, ISLAND only has five rows, where every other level has over 2,000 rows. Due to possible issues with model fitting, we decided to remove this level from ocean_proximity. We accepted the risk that our model will less accurately predict the value of houses on islands.

summary(housing_data$ocean_proximity)
##  <1H OCEAN     INLAND     ISLAND   NEAR BAY NEAR OCEAN 
##       9136       6551          5       2290       2658

Missing Data

The other thing we considered was missing data.

sum(is.na(housing_data))
## [1] 207
total_bedrooms = housing_data$total_bedrooms
sum(is.na(total_bedrooms))
## [1] 207

There are 207 observations with missing data for total_bedrooms. One thing we can do to solve this issue of NA values in total_bedrooms is to impute data points. We first want to take a look at the distribution of this variable to see which imputation method will work best.

bedroom_mean = mean(housing_data$total_bedrooms, na.rm=TRUE)
bedroom_median = median(housing_data$total_bedrooms, na.rm=TRUE)
ggplot(housing_data, aes(x = total_bedrooms)) +
  geom_histogram(bins = 40, color = "black", fill = "blue") +
  geom_vline(aes(xintercept = bedroom_mean, color = "Mean"), lwd = 1.5) +
  geom_vline(aes(xintercept = bedroom_median, color = "Median"), lwd = 1.5) +
  xlab("Total Bedrooms") +
  ylab("Frequency") +
  ggtitle("Histogram of Total Bedrooms (noncontinuous variable)") +
  scale_color_manual(name = "Summary Stats", labels = c("Mean", "Median"), values = c("red", "green"))
## Warning: Removed 207 rows containing non-finite values (stat_bin).

Looking at the distribution of the data in the histogram above, we decided to use the median of the total_bedrooms variable for our imputation method.

Post-Cleaning

Looking at the structure of the dataset after cleaning the data, we see that besides the one factor variable ocean_proximity, we have nine numeric variables, three of which are continuous (longitude,latitude, and median_income) and six of which are discrete (housing_median_age, total_rooms, total_bedrooms, population, households, andmedian_house_value).

str(housing_data)
## 'data.frame':    20640 obs. of  10 variables:
##  $ longitude         : num  -122 -122 -122 -122 -122 ...
##  $ latitude          : num  37.9 37.9 37.9 37.9 37.9 ...
##  $ housing_median_age: num  41 21 52 52 52 52 52 52 42 52 ...
##  $ total_rooms       : num  880 7099 1467 1274 1627 ...
##  $ total_bedrooms    : num  129 1106 190 235 280 ...
##  $ population        : num  322 2401 496 558 565 ...
##  $ households        : num  126 1138 177 219 259 ...
##  $ median_income     : num  8.33 8.3 7.26 5.64 3.85 ...
##  $ median_house_value: num  452600 358500 352100 341300 342200 ...
##  $ ocean_proximity   : Factor w/ 5 levels "<1H OCEAN","INLAND",..: 4 4 4 4 4 4 4 4 4 4 ...

For a better sense of the distribution of the nine numeric variables, we looked at histograms for each of them.

par(mfrow = c(3, 3))
hist(housing_data$longitude, breaks = 20, main = "longitude", border="darkorange", col="dodgerblue")
hist(housing_data$latitude, breaks = 20, main = "latitude", border="darkorange", col="dodgerblue")
hist(housing_data$housing_median_age, breaks = 20, main = "housing_median_age", border="darkorange", col="dodgerblue")
hist(housing_data$total_rooms, breaks = 20, main = "total_rooms", border="darkorange", col="dodgerblue")
hist(housing_data$total_bedrooms, breaks = 20, main = "total_bedrooms", border="darkorange", col="dodgerblue")
hist(housing_data$population, breaks = 20, main = "population", border="darkorange", col="dodgerblue")
hist(housing_data$households, breaks = 20, main = "households", border="darkorange", col="dodgerblue")
hist(housing_data$median_income, breaks = 20, main = "median_income", border="darkorange", col="dodgerblue")
hist(housing_data$median_house_value, breaks = 20, main = "median_house_value", border="darkorange", col="dodgerblue")

We also looked at the pairs to better understand the relationship between the all the variables.

pairs(housing_data, col = "dodgerblue")

It looks like there are some additional variables which may not be necessary due to collinearity. We looked further into the correlations between the numeric variables and show them in the table below:

housing_data_nc = housing_data[, -10]

corrmatrix = cor(housing_data_nc)

kable(t(corrmatrix))
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
longitude 1.0000000 -0.9246644 -0.1081968 0.0445680 NA 0.0997732 0.0553101 -0.0151759 -0.0459666
latitude -0.9246644 1.0000000 0.0111727 -0.0360996 NA -0.1087847 -0.0710354 -0.0798091 -0.1441603
housing_median_age -0.1081968 0.0111727 1.0000000 -0.3612622 NA -0.2962442 -0.3029160 -0.1190340 0.1056234
total_rooms 0.0445680 -0.0360996 -0.3612622 1.0000000 NA 0.8571260 0.9184845 0.1980496 0.1341531
total_bedrooms NA NA NA NA 1 NA NA NA NA
population 0.0997732 -0.1087847 -0.2962442 0.8571260 NA 1.0000000 0.9072223 0.0048343 -0.0246497
households 0.0553101 -0.0710354 -0.3029160 0.9184845 NA 0.9072223 1.0000000 0.0130331 0.0658427
median_income -0.0151759 -0.0798091 -0.1190340 0.1980496 NA 0.0048343 0.0130331 1.0000000 0.6880752
median_house_value -0.0459666 -0.1441603 0.1056234 0.1341531 NA -0.0246497 0.0658427 0.6880752 1.0000000

One thing that sticks out is the high collinearity between households and total_bedrooms, as well as households and total_rooms. While these variables are highly correlated, we believe it is best to leave them in as they are influential in the pricing of homes typically. Overall, we do not believe that collinearility will play a substantial role in influencing our models so we will be keeping all variables.

Training and Test Data

First we want to split the data into training and testing data. We will use an 70/30 split of a randomized sample. We will use a set seed to get repeatability.

Note that we decided to partition using ocean_proximity since this is a predictor that we believe will play a large role in predicting housing prices.

set.seed(420)
housing_trn_idx = createDataPartition(housing_data$ocean_proximity, p = .70, list = FALSE)
housing_trn_data = housing_data[housing_trn_idx, ]
housing_tst_data = housing_data[-housing_trn_idx, ]

Model Selection

Using all Model Subsets

As a starting point we want to get an idea which parameters and their interactions are good ones to use in a potential model. To start this, we will create three initial models that will be fitted on our training data. The first model will be an additive model that utilizes all variables. The second model will be a two-way model that uses all variables as well as their two-way interactions. The third, and final, initial model will be a three-way model that uses all variables as well as their two and three-way interactions. We will create these models below.

full_additive_model = lm(median_house_value ~ ., data = housing_trn_data)
full_additive_adjr2 = summary(full_additive_model)$adj.r.squared

full_twoway_model = lm(median_house_value ~ (.)^2, data = housing_trn_data)
full_twoway_adjr2 = summary(full_twoway_model)$adj.r.squared

full_threeway_model = lm(median_house_value ~ (.)^3, data = housing_trn_data)
full_threeway_adjr2 = summary(full_threeway_model)$adj.r.squared

In order to take a look at initial results of these models, we will be using model selection criterion AIC and Adj.R2. We will also include the total number of predictors in each model to get a feel of the total complexity of each.

beginning_mods_results = data.frame(
  "Total Predictors" =
    c("Additive Model" = extractAIC(full_additive_model)[1],
      "Two-Way Int. Model" = extractAIC(full_twoway_model)[1],
      "Three-Way Int. Model" = extractAIC(full_threeway_model)[1]),
  "AIC" =
    c("Additive Model" = extractAIC(full_additive_model)[2],
      "Two-Way Int. Model" = extractAIC(full_twoway_model)[2],
      "Three-Way Int. Model" = extractAIC(full_threeway_model)[2]),
  "Adj R-Squared" =
    c("Additive Model" = full_additive_adjr2,
      "Two-Way Int. Model" = full_twoway_adjr2,
      "Three-Way Int. Model" = full_threeway_adjr2))

kable(beginning_mods_results, align = c("c", "r"))
Total.Predictors AIC Adj.R.Squared
Additive Model 13 318440.8 0.6522842
Two-Way Int. Model 68 316040.7 0.7071233
Three-Way Int. Model 208 314536.9 0.7388959

We see that the model with the best (i.e., lowest) AIC is full_threeway_model, with a score of 3.18×105. However, although this model has the lowest AIC, it also has far more predictors (and therefore is more complex) than the other three models - 204 compared to 64 predictors for full_twoway_model, and 12 predictors for full_additive_model. This is something to keep in mind as we move forward, as model complexity should be taken into account.

Using AIC and BIC

All three of these models are good candidates for stepwise and backward AIC and BIC, however we will only be doing stepwise and backward search on the three-way models due to compute power needed as well as time it takes for the step function to search. Since this was our best model out of the inital three, we will leave it as-is and compare to our results from stepwise and backwards search on the additive and two-way models.

back_additive_mod_finish_aic = step(full_additive_model, direction = "backward", trace = 0)
both_additive_mod_finish_aic = step(full_additive_model, direction = "both", trace = 0)

n = length(resid(full_additive_model))
back_additive_mod_finish_bic = step(full_additive_model, direction = "backward", k = log(n), trace = 0)
both_additive_mod_finish_bic = step(full_additive_model, direction = "both", k = log(n), trace = 0)

back_twoway_mod_finish_aic = step(full_twoway_model, direction = "backward", trace = 0)
both_twoway_mod_finish_aic = step(full_twoway_model, direction = "both", trace = 0)

n = length(resid(full_twoway_model))
back_twoway_mod_finish_bic = step(full_twoway_model, direction = "backward", k = log(n), trace = 0)
both_twoway_mod_finish_bic = step(full_twoway_model, direction = "both", k = log(n), trace = 0)
aic_and_bic_results = data.frame(
  "AIC" =
    c("Backward" =
        c("Additive" = extractAIC(back_additive_mod_finish_aic)[2],
          "Two-Way" = extractAIC(back_twoway_mod_finish_aic)[2],
          "Three-way" = extractAIC(full_threeway_model)[2]),
      "Both" =
        c("Additive" = extractAIC(both_additive_mod_finish_aic)[2],
          "Two-Way" = extractAIC(both_twoway_mod_finish_aic)[2],
          "Three-way" = extractAIC(full_threeway_model)[2])),
  "BIC" =
    c("Backward" =
        c("Additive" = extractAIC(back_additive_mod_finish_bic)[2],
          "Two-Way" = extractAIC(back_twoway_mod_finish_bic)[2],
          "Three-way" = extractAIC(full_threeway_model)[2]),
      "Both" =
        c("Additive" = extractAIC(both_additive_mod_finish_bic)[2],
          "Two-Way" = extractAIC(both_twoway_mod_finish_bic)[2],
          "Three-way" = extractAIC(full_threeway_model)[2])))

kable(aic_and_bic_results)
AIC BIC
Backward.Additive 318440.8 318440.8
Backward.Two-Way 316033.8 316040.4
Backward.Three-way 314536.9 314536.9
Both.Additive 318440.8 318440.8
Both.Two-Way 316033.8 316040.4
Both.Three-way 314536.9 314536.9

From the table above, we can see that the inital three-way model beats out all models selected from stepwise and backwards search using both AIC and BIC criterion. This leads us to believe this could possibly be our best model. However, we should note that the three-way model includes a large amount of variables which introduces a lot of complexity to the model.

With complexity in mind, we will take a look at AIC of the other models. From now on, we will be using AIC as our main selection criterion. The model with the second lowest AIC is tied between models selected using backwards and stepwise direction of the two-way model. We believe that these models are the exact same. Using the identical function in R, we check if these models are identical which gives us the result TRUE. Since this results to TRUE, we will use back_twoway_mod_finish_aic as our best model from this chart.

We will now want to do diagnostics of some models we created to check the model assumptions of normality as well as heteroskedasticity, since we cannot solely rely on AIC for model selection and must take assumptions into account. These models will be back_additive_mod_finish_aic, back_twoway_mod_finish_aic, and full_threeway_model. These models were selected by having the lowest AIC in each class of initial models (additive, two-way, and three-way).

diagnostics = function(model, alpha = .05, pointcol = "orange", linecol = "blue", plots = TRUE, tests = TRUE, pointtype = 16) {
    if (plots == TRUE) {
        par(mfrow = c(1, 3))
        plot(
                fitted(model),
                resid(model),
                pch = pointtype,
                xlab = "Fitted Values",
                ylab = "Residuals",
                main = "Fitted vs Residuals",
                col = pointcol
            )
        abline(h = 0, lwd = 2, col = linecol)
        
        qqnorm(
                resid(model),
                pch = pointtype,
                main = "QQNorm Plot",
                col = pointcol
            )
        qqline(
                resid(model),
                lwd = 2,
                col = linecol
                )
        hist(
            resid(model),
            main = "Histogram of Residuals",
            col = pointcol,
            xlab = "Residuals",
            ylab = "Frequency"
            )
    }
    if (tests == TRUE) {
        ad_test = ad.test(resid(model))
        bp_test = bptest(model)
        test_results = data.frame(
          "Anderson-Darling Normality Test" =
            c("Test Statistic" = round(ad_test$statistic, 5),
              "P-Value" = ad_test$p.value,
              "Result" = ifelse(ad_test$p.value < alpha, "Reject", "Fail To Reject")),
          "Breusch-Pagan Test" =
            c("Test Statistic" = round(bp_test$statistic, 5),
              "P-Value" = bp_test$p.value,
              "Result" = ifelse(bp_test$p.value < alpha, "Reject", "Fail To Reject")))

        kable(t(test_results), col.names = c("Test Statistic", "P-Value", "Decision"))
    }
}
diagnostics(back_additive_mod_finish_aic)

Test Statistic P-Value Decision
Anderson.Darling.Normality.Test 246.14931 3.7e-24 Reject
Breusch.Pagan.Test 654.98337 1.88647093798867e-132 Reject
diagnostics(back_additive_mod_finish_aic)

Test Statistic P-Value Decision
Anderson.Darling.Normality.Test 246.14931 3.7e-24 Reject
Breusch.Pagan.Test 654.98337 1.88647093798867e-132 Reject
diagnostics(back_twoway_mod_finish_aic)

Test Statistic P-Value Decision
Anderson.Darling.Normality.Test 264.25079 3.7e-24 Reject
Breusch.Pagan.Test 1430.97881 3.16163157206095e-258 Reject
diagnostics(full_threeway_model)

Test Statistic P-Value Decision
Anderson.Darling.Normality.Test 281.23411 3.7e-24 Reject
Breusch.Pagan.Test 1361.69389 6.0452470988803e-169 Reject
diagnostics(full_threeway_model)

Test Statistic P-Value Decision
Anderson.Darling.Normality.Test 281.23411 3.7e-24 Reject
Breusch.Pagan.Test 1361.69389 6.0452470988803e-169 Reject

For our diagnostics function we include a scatter plot of fitted values against residuals, a QQNorm plot, and histogram of the residuals in order to visualize our model and how our assumptions hold. For our tabular results, we are running two tests.

The first test is the Anderson-Darling Test, which tests our normality assumption. The null hypothesis for this test is that our data is sampled from a normal distribution. Note that we’re using Anderson-Darling instead of Shapiro-Wilk because we are using a large dataset with 20635 observations and Anderson-Darling performs better on larger datasets of this size.

The second test is the Breusch-Pagan Test, which tests whether our model is homoskedastic (or, whether the residuals have a constant variance). The null hypothesis for this test is that our model is homoskedastic.

Looking at the plots, we are concerned that our models will not hold to our model assumptions of normality and homoskedasticity. Looking further at our test results, we see that for each model we reject our null hypothesis’s. With a rejection of our null hypotheses, we have decided that these models would be best fit for prediction versus inference since assumptions do not hold.

Recall that our best model is back_twoway_mod_finish_aic, which was the model found from a backward search on the full two-way model using AIC as criterion. In order to combat our rejection of the null hypothesis’s above, we will attempt to transform the response variable,median_home_value, to hopefully get closer to our model assumptions.

back_twoway_mod_finish_aic_log = lm(formula = log(median_house_value) ~ longitude + latitude + housing_median_age + 
    total_rooms + total_bedrooms + population + households + 
    median_income + ocean_proximity + longitude:latitude + longitude:housing_median_age + 
    longitude:total_rooms + longitude:total_bedrooms + longitude:households + 
    longitude:median_income + longitude:ocean_proximity + latitude:housing_median_age + 
    latitude:total_rooms + latitude:total_bedrooms + latitude:median_income + 
    latitude:ocean_proximity + housing_median_age:total_rooms + 
    housing_median_age:population + housing_median_age:households + 
    housing_median_age:median_income + housing_median_age:ocean_proximity + 
    total_rooms:population + total_rooms:households + total_rooms:median_income + 
    total_rooms:ocean_proximity + total_bedrooms:households + 
    total_bedrooms:median_income + total_bedrooms:ocean_proximity + 
    population:households + population:median_income + population:ocean_proximity + 
    households:median_income + households:ocean_proximity + median_income:ocean_proximity, 
    data = housing_trn_data)
diagnostics(back_twoway_mod_finish_aic_log)

Test Statistic P-Value Decision
Anderson.Darling.Normality.Test 49.66267 3.7e-24 Reject
Breusch.Pagan.Test 1236.74046 1.33339040467781e-218 Reject
diagnostics(back_twoway_mod_finish_aic_log)

Test Statistic P-Value Decision
Anderson.Darling.Normality.Test 49.66267 3.7e-24 Reject
Breusch.Pagan.Test 1236.74046 1.33339040467781e-218 Reject

Doing so, we can see that we are closer to normal distribution when taking the log of median_house_value. However, we are still rejecting both null hypotheses. Since this did not give us much improvement on our assumptions, we will stick to using the non-logged model and accept that our assumptions do not hold. While this will decrease the validity of our model, it can still be useful for prediction rather than inference. Since we are attempting to predict values of homes, we are okay with this.

##Detect Overfitting

For our last model selection criterion, we will be utilizing leave-one-out cross-validated RMSE (hereafter referred to as LOOCV RMSE).

LOOCV RMSE corrects for the fact that RMSE alone always prefers the larger model by instead assessing how the model fits “left out” data that wasn’t used to perform the regression.

calc_loocv_rmse = function(model) {
  sqrt(mean((resid(model) / (1 - hatvalues(model))) ^ 2))
}
calc_rmse = function(actual, predicted) {
  sqrt(sum((actual - predicted)^2) / length(actual)) 
}
calc_avg_per_error = function(actual, predicted) {
    inter_abs = abs(predicted - actual)
    100 * (sum(inter_abs / actual)) / length(actual)
}
additive_loocv_rmse = calc_loocv_rmse(back_additive_mod_finish_aic)
twoway_loocv_rmse = calc_loocv_rmse(back_twoway_mod_finish_aic)
threeway_loocv_rmse = calc_loocv_rmse(full_threeway_model)

loocv_rmse_results = data.frame(
  "LOOCV-RMSE" =
    c("Backwards Additive" = additive_loocv_rmse,
      "Backwards Two-Way" = twoway_loocv_rmse,
      "Initial Three-way" = threeway_loocv_rmse))

kable(loocv_rmse_results)
LOOCV.RMSE
Backwards Additive 68363.54
Backwards Two-Way Inf
Initial Three-way Inf

As we can see from the table above, the model created from backwards search of the two-way model using AIC as selection criterion obtains the lowest LOOCV-RMSE. Since we are using LOOCV-RMSE as our final model selection criterion, this gives us the final result that our Backwards Two-Way Search model, back_twoway_mod_finish_aic, is our best model. We will now evaluate this model on our test set that have had held back in order to see how this model performs on unseen data.

Results

Predicting Housing Prices

# the actual median house values from the test set
test_actual = housing_tst_data$median_house_value
# the predicted house values for the test set
test_predictions = predict(back_twoway_mod_finish_aic, housing_tst_data)
## Warning in predict.lm(back_twoway_mod_finish_aic, housing_tst_data): prediction
## from a rank-deficient fit may be misleading
# the RMSE
test_rmse = calc_rmse(test_actual, test_predictions)
# the percentage error
test_perc_error = calc_avg_per_error(test_actual, test_predictions)

test_rmse OUT: 65411

test_perc_error OUT: 25

We evaluated our final model, the Backwards Two-Way Search model, on unseen test data using RMSE as well as average percent error. We got an RMSE of 65410.62, which is similar to the LOOCV-RMSE we obtained on this model using our training data (63098.08). Since these RMSEs are close in value, we can see that our model is not overfitting. For average percent error, we got a result of 25.5%.

Discussion

Our final model, when evaluated on unseen test data, had an average error of 65410.62. This means that, on average, our model’s predicted housing price will be ± 65410.62 in comparison to the actual price. This gives us an average percent error of ± 25.5% between the actual and predicted prices.

While this is higher than we might have hoped for, we believe this error isn’t too bad considering that the majority of the predictors were based on location. As we noted in the Introduction, while location - and more specifically ocean_proximity - gives us quite a lot of information about housing prices, but it doesn’t tell the whole story.

Below we show the range of our model’s error by ocean_proximity (Note that there is not data for the ISLAND level because we removed it from the dataset to avoid skew because of its low representation):

op0 = par()
op1 = op0$mar
op1[2] = 7
par(mar = op1)
plot(test_predictions - test_actual, housing_tst_data$ocean_proximity,
     xlab = "Predicted Price Over / Under Actual Price",
     ylab = "",
     main = "Predicted Price Over / Under Actual Price vs Ocean Proximity",
     col = "dodgerblue", yaxt = "none")
axis(2, at = 1:5, labels = levels(housing_tst_data$ocean_proximity), las = 1)

A general trend worth noting is that we tend to underestimate the price a bit more frequently than overestimating. Dividing the data by ocean_proximity, we see that for the most part, the range of the errors is consistent for each of the levels. However, there are some extreme outliers when the ocean_proximity is INLAND. In particular, our model has underestimated the price of two houses by more than $800,000. This is quite a jump from the rest of the errors and well beyond our average error of 65410.62.

In order to see if these outlier data points align with the issues we saw in the Introduction, we plotted each of our model’s errors in the test data.

plot_map = ggplot(housing_tst_data, 
                  aes(x = longitude, y = latitude, 
                      color = test_predictions - test_actual, 
                      hma = housing_median_age, tr = total_rooms, tb = total_bedrooms,
                      hh = households, mi = median_income)) +
              geom_point(aes(size = abs(test_predictions - test_actual)), alpha = 0.4) +
              xlab("Longitude") +
              ylab("Latitude") +
              ggtitle("Predicted Price Over / Under Actual Price") +
              theme(plot.title = element_text(hjust = 0.5)) +
              scale_color_distiller(palette = "Paired", labels = comma) +
              labs(color = "Predicted Price Over / Under (in $USD)", 
                   size = "Magnitude of Price Difference")
plot_map
## Warning: Removed 60 rows containing missing values (geom_point).

Those two troublesome data points are indeed located in one of the inland areas with more expensive prices - namely, near the Sacramento area. It’s possible that removing these two values could have helped us more accurately estimate the prices of inland houses.

Overall, our model was able to predict the prices of houses in California in 1990 within a ± 25.5% error rate based on the California Housing Prices dataset.