Team Members: Ahuja Nikita (M13480138), Ghumekar Aishwarya (M13262448), Khan Nida (M12962725), Nevaskar Parnika (M13480285)

Outline

  1. Introduction
  2. Data Cleaning
  3. Exploratory Data Analysis
  4. Model Specification
  5. Model Adequacy Check: Linear Model
  6. Model Re-specification
  7. Model Validation
  8. Prediction Insights

Summary

In today’s technologically dependent world there are many different ways in which a real estate agent can use predictive analytics: from lead generation and property generation, through lead nurturing and property analysis, to buyer-to-seller matching and closing deals. Predictive analytics allows real estate agents to conduct business in less time with higher accuracy. In this project we are trying to build a suitable model to most accurately predict housing prices for a given demographic (in our case California). We began with data cleaning techniques and moved on to the EDA part to identify abnormalities in the data and to gauge the data in terms of normality and skewness. Then we created a simple linear regression model to identify the significant covariates and moved further to another model after removing the non-influential variables (including the categorical variables). After performing residual analysis and certain transformations we reached to a consensus that another model must be fitted in order to get a more accurate depiction of the data. Finally, we generated a polynomial regression model with degree as 2, which gave us better p-value and Adjusted R-Squared value than the previous model. Also, upon checking the model adequacy and comparing this model with the previous one using AIC values, this model appeared to be a better choice. Hence, performing the predictions and LOOCV on this model was agreed upon. We also found some interesting insights in terms of the prediction errors which we have discussed in the prediction insights section of our project. As a future scope, better fitted models can also be obtained by performing some advanced statistical techniques on this data like Random Forest. In conclusion, we can say that as of now predictive analytics is barely scratching the surface of the real estate sector. However, the impact of this technology is apparent across the industry. There’s no doubt that predictive analytics will drastically change the landscape of this market in years to come for residential as well as commercial real estate and on both the buying and selling side.

Required Libraries

library(psych)
library(corrplot)
library(Hmisc)
library(data.table)
library(mltools)
library(knitr)
library(caTools)
library(caret)
library(MASS)

1. Introduction

Purpose of the Project

The purpose of this project is to predict the median house value using predictors such as location, distance from the ocean, and information about the other houses in a block.

The data pertains to the houses found in a given California district and some summary stats about them based on the 1990 census data. While it is understood that the prices have increased drastically since 1990, this model is an attempt to explore the data analysis methods discussed in the course and hence can’t be used to predict housing prices in the present.

Housing Prices in California Dataset

We are using the dataset of Housing Prices in California in the year 1990 available on Kaggle (https://www.kaggle.com/camnugent/california-housing-prices).

housing <- read.csv(file = "C:/Users/khann/Desktop/COLLEGE/R/housing.csv",   header=TRUE)

str(housing)
## 'data.frame':    20640 obs. of  10 variables:
##  $ longitude         : num  -122 -122 -122 -122 -122 ...
##  $ latitude          : num  37.9 37.9 37.9 37.9 37.9 ...
##  $ housing_median_age: num  41 21 52 52 52 52 52 52 42 52 ...
##  $ total_rooms       : num  880 7099 1467 1274 1627 ...
##  $ total_bedrooms    : num  129 1106 190 235 280 ...
##  $ population        : num  322 2401 496 558 565 ...
##  $ households        : num  126 1138 177 219 259 ...
##  $ median_income     : num  8.33 8.3 7.26 5.64 3.85 ...
##  $ median_house_value: num  452600 358500 352100 341300 342200 ...
##  $ ocean_proximity   : Factor w/ 5 levels "<1H OCEAN","INLAND",..: 4 4 4 4 4 4 4 4 4 4 ...

There are 20640 rows in the dataset with 10 attributes. The description of each of these as given on kaggle is as below:

  1. longitude: A measure of how far west a house is; a higher value is farther west
  2. latitude: A measure of how far north a house is; a higher value is farther north
  3. housing_median_age: Median age of a house within a block; a lower number is a newer building
  4. total_rooms: Total number of rooms within a block
  5. total_bedrooms: Total number of bedrooms within a block
  6. population: Total number of people residing within a block
  7. households: Total number of households, a group of people residing within a home unit, for a block
  8. median_income: Median income for households within a block of houses (measured in tens of thousands of US Dollars)
  9. median_house_value: Median house value for households within a block (measured in US Dollars)
  10. ocean_proximity: Location of the house w.r.t ocean/sea

Among these, “median_house_value” is the response variable and others will be the covariates.

Next we check the summary of each variable:

summary(housing)
##    longitude         latitude     housing_median_age  total_rooms   
##  Min.   :-124.3   Min.   :32.54   Min.   : 1.00      Min.   :    2  
##  1st Qu.:-121.8   1st Qu.:33.93   1st Qu.:18.00      1st Qu.: 1448  
##  Median :-118.5   Median :34.26   Median :29.00      Median : 2127  
##  Mean   :-119.6   Mean   :35.63   Mean   :28.64      Mean   : 2636  
##  3rd Qu.:-118.0   3rd Qu.:37.71   3rd Qu.:37.00      3rd Qu.: 3148  
##  Max.   :-114.3   Max.   :41.95   Max.   :52.00      Max.   :39320  
##                                                                     
##  total_bedrooms     population      households     median_income    
##  Min.   :   1.0   Min.   :    3   Min.   :   1.0   Min.   : 0.4999  
##  1st Qu.: 296.0   1st Qu.:  787   1st Qu.: 280.0   1st Qu.: 2.5634  
##  Median : 435.0   Median : 1166   Median : 409.0   Median : 3.5348  
##  Mean   : 537.9   Mean   : 1425   Mean   : 499.5   Mean   : 3.8707  
##  3rd Qu.: 647.0   3rd Qu.: 1725   3rd Qu.: 605.0   3rd Qu.: 4.7432  
##  Max.   :6445.0   Max.   :35682   Max.   :6082.0   Max.   :15.0001  
##  NA's   :207                                                        
##  median_house_value   ocean_proximity
##  Min.   : 14999     <1H OCEAN :9136  
##  1st Qu.:119600     INLAND    :6551  
##  Median :179700     ISLAND    :   5  
##  Mean   :206856     NEAR BAY  :2290  
##  3rd Qu.:264725     NEAR OCEAN:2658  
##  Max.   :500001                      
## 

It can be observed that “ocean_proximity” is the only categorical variable in the data with five distinct values. Among these, the number of observations where its value is “ISLAND” is just 5 which is very very less than the others.

Also, there are 207 observations where the variable “total_bedrooms” has NA values.

In the process of data cleaning, we need to take care of these issues.

2. Data Cleaning

Missing Data

As mentioned in the previous section, there are 207 observations with NA values for “total_bedrooms”. We can handle this by imputing median value (435) at these places.

housing$total_bedrooms[is.na(housing$total_bedrooms)] <- median(housing$total_bedrooms, na.rm=TRUE)

sum(is.na(housing))
## [1] 0

Categorical Data

We have the following levels in the categorical variable “ocean_proximity” :

summary(housing$ocean_proximity)
##  <1H OCEAN     INLAND     ISLAND   NEAR BAY NEAR OCEAN 
##       9136       6551          5       2290       2658

As observed, the level “ISLAND” has just 5 records which is very low as compared to other levels which have more than 2000 records. We can eliminate this level from the data as there could be issues with model fitting.

housing <- housing[housing$ocean_proximity != "ISLAND", ]

nrow(housing)
## [1] 20635

3. Exploratory Data Analysis

After we have cleaned the data, we will visualize the data to get some insights into the distribution and skewness of numeric data as well as correlation of the variables with each other.

Histograms for numeric variables:

par(mfrow = c(3, 3))
hist(housing$longitude, main = "longitude", col="slateblue1")
hist(housing$latitude, main = "latitude", col="slateblue1")
hist(housing$housing_median_age, main = "housing_median_age", col="slateblue1")
hist(housing$total_rooms, main = "total_rooms", col="slateblue1")
hist(housing$total_bedrooms, main = "total_bedrooms", col="slateblue1")
hist(housing$population, main = "population", col="slateblue1")
hist(housing$households, main = "households", col="slateblue1")
hist(housing$median_income, main = "median_income", col="slateblue1")
hist(housing$median_house_value, main = "median_house_value", col="slateblue1")

Scatter plots for variables:

par(mfrow = c(1, 1))
pairs(housing, col = "slateblue1")

We further observe the correlation between the numeric variables by checking their correlation coefficients.

corr_matrix <- round(cor(housing[,1:9], ), digits = 2)

kable(t(corr_matrix))
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
longitude 1.00 -0.92 -0.11 0.04 0.07 0.10 0.06 -0.02 -0.05
latitude -0.92 1.00 0.01 -0.04 -0.07 -0.11 -0.07 -0.08 -0.14
housing_median_age -0.11 0.01 1.00 -0.36 -0.32 -0.30 -0.30 -0.12 0.11
total_rooms 0.04 -0.04 -0.36 1.00 0.93 0.86 0.92 0.20 0.13
total_bedrooms 0.07 -0.07 -0.32 0.93 1.00 0.87 0.97 -0.01 0.05
population 0.10 -0.11 -0.30 0.86 0.87 1.00 0.91 0.00 -0.02
households 0.06 -0.07 -0.30 0.92 0.97 0.91 1.00 0.01 0.07
median_income -0.02 -0.08 -0.12 0.20 -0.01 0.00 0.01 1.00 0.69
median_house_value -0.05 -0.14 0.11 0.13 0.05 -0.02 0.07 0.69 1.00

From this correlation matrix, it is observed that there is high correlation between “households” and “total_bedrooms”, as well as “households” and “total_rooms”. This can cause the problem of multicollinearity but since these can be influential in the pricing of homes, we decided to keep the covariates. We can deal with that, if required, using appropriate methods.

4. Model Specification

Partitioning Dataset into Training and Test Data

We decided to separate training data from test data by spliting the original dataset in the ratio of 70:30 using “ocean_proximity”.

set.seed(123)

sample <- sample.split(housing$ocean_proximity, SplitRatio = .70)

housing_training_data <- subset(housing, sample == TRUE)
housing_test_data <- subset(housing, sample == FALSE)

Linear Regression model

model1 = lm(median_house_value ~ ., data = housing_training_data)
summary(model1)
## 
## Call:
## lm(formula = median_house_value ~ ., data = housing_training_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -374349  -42914  -10376   29344  494284 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               -2.280e+06  1.064e+05 -21.431  < 2e-16 ***
## longitude                 -2.711e+04  1.235e+03 -21.955  < 2e-16 ***
## latitude                  -2.607e+04  1.219e+03 -21.391  < 2e-16 ***
## housing_median_age         1.075e+03  5.238e+01  20.531  < 2e-16 ***
## total_rooms               -2.645e+00  9.165e-01  -2.886  0.00391 ** 
## total_bedrooms             5.917e+01  6.975e+00   8.483  < 2e-16 ***
## population                -4.298e+01  1.341e+00 -32.047  < 2e-16 ***
## households                 8.998e+01  7.902e+00  11.387  < 2e-16 ***
## median_income              3.793e+04  3.930e+02  96.503  < 2e-16 ***
## ocean_proximityINLAND     -3.858e+04  2.081e+03 -18.540  < 2e-16 ***
## ocean_proximityNEAR BAY   -2.503e+03  2.276e+03  -1.100  0.27131    
## ocean_proximityNEAR OCEAN  3.441e+03  1.864e+03   1.846  0.06490 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 68540 on 14433 degrees of freedom
## Multiple R-squared:  0.6451, Adjusted R-squared:  0.6448 
## F-statistic:  2385 on 11 and 14433 DF,  p-value: < 2.2e-16

As you can see, two levels of the categorical variable, “ocean_proximityNEAR BAY” and “ocean_proximityNEAR OCEAN”, are insignificant. We use One hot encoding on the data and remove all the insignificant variables to create a new model.

dmy <- dummyVars(" ~ .", data = housing_training_data)
trsf <- data.frame(predict(dmy, newdata = housing_training_data))

model2 = lm(median_house_value ~ . - ocean_proximity.NEAR.OCEAN - ocean_proximity.ISLAND - ocean_proximity..1H.OCEAN - ocean_proximity.NEAR.BAY, data = trsf)
summary(model2)
## 
## Call:
## lm(formula = median_house_value ~ . - ocean_proximity.NEAR.OCEAN - 
##     ocean_proximity.ISLAND - ocean_proximity..1H.OCEAN - ocean_proximity.NEAR.BAY, 
##     data = trsf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -374502  -42850  -10515   29467  493747 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -2.294e+06  1.022e+05 -22.450   <2e-16 ***
## longitude              -2.738e+04  1.194e+03 -22.923   <2e-16 ***
## latitude               -2.655e+04  1.192e+03 -22.275   <2e-16 ***
## housing_median_age      1.057e+03  5.141e+01  20.564   <2e-16 ***
## total_rooms            -2.571e+00  9.136e-01  -2.814   0.0049 ** 
## total_bedrooms          5.917e+01  6.975e+00   8.483   <2e-16 ***
## population             -4.309e+01  1.332e+00 -32.358   <2e-16 ***
## households              8.962e+01  7.902e+00  11.341   <2e-16 ***
## median_income           3.786e+04  3.915e+02  96.722   <2e-16 ***
## ocean_proximity.INLAND -3.832e+04  2.045e+03 -18.738   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 68550 on 14435 degrees of freedom
## Multiple R-squared:  0.6449, Adjusted R-squared:  0.6447 
## F-statistic:  2913 on 9 and 14435 DF,  p-value: < 2.2e-16

5. Model Adequacy Check: Linear Model

To check the adequacy of the model, we need to perform the residual analysis.To do this, we need to plot:

  1. QQ-plot of residuals to:
    • check the Normality assumption
    • look for potential outliers
  2. Scatter plot of residuals against fitted values or against regressors to:
    • check for Equal variance
    • check for Linearity
    • look for potential outliers
  3. Scatter plot of residuals against time order to check for Independent errors

Residual VS fitted value plot for linear model

plot(model2, which=c(1,1))

Interpretation: As can be seen, the residual vs fitted value plot is slightly non-linear in nature. The hard capped line might be because one of the covariates has records with a constant value or the response vairable is linearly dependent on one or more covariates.

QQ plot for linear model

par(mfrow = c(1, 2))
#Histogram of residuals
hist(resid(model2))

#QQPlot
plot(model2, which=c(2,2))

Interpretation: As can be seen from the histogram. There is more data located at the extremes compared to normal distribution. The last quantile has more values than last theoretical quantile.

From the above plots, we notice that the LINE assumptions have been violated in this model. We will now perform some transformations to see if we can have a better model.

Residual Vs Leverage plot to Identify Outliers

plot(model2, which=c(5,5))

Interpretation: There are hardly any points outside the Cook’s distance line. This implies that there are no influential outliers in the data.

Box-Cox Transformation

We perform the Box-Cox Transformation on the model to figure out optimal transformation.

par(mfrow = c(1, 1))
boxcox(model2)

The value of λ is close to 0, indicating log transformation as the best suited for this model.

Based on this, we define a new model and do the residual analysis again.

log_housing <- lm(log(median_house_value) ~ . - ocean_proximity.NEAR.OCEAN - ocean_proximity.ISLAND - ocean_proximity..1H.OCEAN - ocean_proximity.NEAR.BAY, data = trsf)

par(mfrow = c(1, 2))
#QQ Plot
qqnorm(log_housing$residuals,main="log_housing")
qqline(log_housing$residuals)

#Scatter Plot of Residuals against fitted values
plot(log_housing$fitted.values,log_housing$residuals,pch=20)
abline(h=0,col="grey")

As can be observed, there is no significant effect of transformation on our model. Hence we decided to re-define the model with higher degree in the next section.

6. Model Re-specification

Polynominal Regression Model with Degree 2

Since the Residual vs Fitted value plot showed that there is non-linearity, we create a polynomial regression model and compare the linear model with polynominal model of degree 2.

modelP1 = lm(median_house_value ~ (.)^2, data = housing_training_data)

summary(modelP1)
## 
## Call:
## lm(formula = median_house_value ~ (.)^2, data = housing_training_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -336920  -36901   -8953   24909  650783 
## 
## Coefficients:
##                                                Estimate Std. Error t value
## (Intercept)                                  -6.456e+06  1.122e+06  -5.755
## longitude                                    -3.989e+04  9.948e+03  -4.010
## latitude                                      3.303e+05  3.324e+04   9.936
## housing_median_age                           -9.755e+04  9.531e+03 -10.235
## total_rooms                                   8.455e+02  2.012e+02   4.203
## total_bedrooms                               -1.446e+03  1.274e+03  -1.135
## population                                   -4.887e+02  2.301e+02  -2.124
## households                                   -1.954e+03  1.364e+03  -1.433
## median_income                                -7.822e+05  7.887e+04  -9.918
## ocean_proximityINLAND                        -2.372e+05  3.083e+05  -0.769
## ocean_proximityNEAR BAY                      -2.498e+07  1.473e+06 -16.961
## ocean_proximityNEAR OCEAN                    -1.178e+06  3.755e+05  -3.137
## longitude:latitude                            2.340e+03  2.581e+02   9.068
## longitude:housing_median_age                 -1.213e+03  1.102e+02 -11.011
## longitude:total_rooms                         1.108e+01  2.369e+00   4.678
## longitude:total_bedrooms                     -2.688e+01  1.476e+01  -1.821
## longitude:population                         -5.912e+00  2.754e+00  -2.146
## longitude:households                         -1.862e+01  1.551e+01  -1.201
## longitude:median_income                      -1.006e+04  9.309e+02 -10.808
## longitude:ocean_proximityINLAND              -5.397e+02  3.656e+03  -0.148
## longitude:ocean_proximityNEAR BAY            -2.463e+05  1.307e+04 -18.845
## longitude:ocean_proximityNEAR OCEAN          -1.145e+04  4.481e+03  -2.554
## latitude:housing_median_age                  -1.341e+03  1.082e+02 -12.401
## latitude:total_rooms                          1.333e+01  2.415e+00   5.520
## latitude:total_bedrooms                      -4.649e+01  1.463e+01  -3.178
## latitude:population                          -5.600e+00  2.977e+00  -1.881
## latitude:households                          -1.292e+01  1.471e+01  -0.878
## latitude:median_income                       -1.106e+04  9.568e+02 -11.562
## latitude:ocean_proximityINLAND                1.657e+03  3.797e+03   0.437
## latitude:ocean_proximityNEAR BAY             -1.353e+05  1.044e+04 -12.963
## latitude:ocean_proximityNEAR OCEAN           -5.862e+03  4.675e+03  -1.254
## housing_median_age:total_rooms               -2.029e-01  8.397e-02  -2.417
## housing_median_age:total_bedrooms             1.151e-01  7.413e-01   0.155
## housing_median_age:population                -2.032e+00  1.220e-01 -16.656
## housing_median_age:households                 7.629e+00  8.548e-01   8.925
## housing_median_age:median_income              5.154e+01  2.882e+01   1.788
## housing_median_age:ocean_proximityINLAND      1.059e+03  1.753e+02   6.038
## housing_median_age:ocean_proximityNEAR BAY   -3.717e+02  1.953e+02  -1.903
## housing_median_age:ocean_proximityNEAR OCEAN  2.114e+02  1.601e+02   1.321
## total_rooms:total_bedrooms                   -1.774e-03  1.886e-03  -0.941
## total_rooms:population                       -2.484e-03  8.045e-04  -3.088
## total_rooms:households                        8.825e-03  3.047e-03   2.896
## total_rooms:median_income                     2.326e+00  3.625e-01   6.417
## total_rooms:ocean_proximityINLAND            -1.109e+01  3.534e+00  -3.139
## total_rooms:ocean_proximityNEAR BAY           4.141e+00  4.013e+00   1.032
## total_rooms:ocean_proximityNEAR OCEAN         3.178e+00  2.960e+00   1.074
## total_bedrooms:population                     5.185e-03  6.556e-03   0.791
## total_bedrooms:households                    -5.562e-02  1.007e-02  -5.524
## total_bedrooms:median_income                 -5.707e+00  4.068e+00  -1.403
## total_bedrooms:ocean_proximityINLAND          5.206e+00  2.910e+01   0.179
## total_bedrooms:ocean_proximityNEAR BAY        2.558e+01  2.967e+01   0.862
## total_bedrooms:ocean_proximityNEAR OCEAN      4.111e+01  2.773e+01   1.483
## population:households                         1.444e-02  5.556e-03   2.598
## population:median_income                     -1.028e+01  1.084e+00  -9.485
## population:ocean_proximityINLAND              2.640e+01  4.660e+00   5.665
## population:ocean_proximityNEAR BAY           -4.312e-01  6.425e+00  -0.067
## population:ocean_proximityNEAR OCEAN          1.412e+01  4.055e+00   3.481
## households:median_income                      3.108e+01  5.326e+00   5.836
## households:ocean_proximityINLAND             -5.480e+00  3.182e+01  -0.172
## households:ocean_proximityNEAR BAY           -5.147e+01  3.446e+01  -1.494
## households:ocean_proximityNEAR OCEAN         -9.796e+01  3.191e+01  -3.070
## median_income:ocean_proximityINLAND           9.086e+03  1.442e+03   6.300
## median_income:ocean_proximityNEAR BAY         9.473e+02  1.398e+03   0.678
## median_income:ocean_proximityNEAR OCEAN       2.595e+03  1.147e+03   2.263
##                                              Pr(>|t|)    
## (Intercept)                                  8.85e-09 ***
## longitude                                    6.10e-05 ***
## latitude                                      < 2e-16 ***
## housing_median_age                            < 2e-16 ***
## total_rooms                                  2.65e-05 ***
## total_bedrooms                               0.256562    
## population                                   0.033677 *  
## households                                   0.151856    
## median_income                                 < 2e-16 ***
## ocean_proximityINLAND                        0.441681    
## ocean_proximityNEAR BAY                       < 2e-16 ***
## ocean_proximityNEAR OCEAN                    0.001711 ** 
## longitude:latitude                            < 2e-16 ***
## longitude:housing_median_age                  < 2e-16 ***
## longitude:total_rooms                        2.92e-06 ***
## longitude:total_bedrooms                     0.068636 .  
## longitude:population                         0.031857 *  
## longitude:households                         0.229731    
## longitude:median_income                       < 2e-16 ***
## longitude:ocean_proximityINLAND              0.882666    
## longitude:ocean_proximityNEAR BAY             < 2e-16 ***
## longitude:ocean_proximityNEAR OCEAN          0.010644 *  
## latitude:housing_median_age                   < 2e-16 ***
## latitude:total_rooms                         3.45e-08 ***
## latitude:total_bedrooms                      0.001485 ** 
## latitude:population                          0.060004 .  
## latitude:households                          0.379720    
## latitude:median_income                        < 2e-16 ***
## latitude:ocean_proximityINLAND               0.662428    
## latitude:ocean_proximityNEAR BAY              < 2e-16 ***
## latitude:ocean_proximityNEAR OCEAN           0.209963    
## housing_median_age:total_rooms               0.015663 *  
## housing_median_age:total_bedrooms            0.876581    
## housing_median_age:population                 < 2e-16 ***
## housing_median_age:households                 < 2e-16 ***
## housing_median_age:median_income             0.073788 .  
## housing_median_age:ocean_proximityINLAND     1.60e-09 ***
## housing_median_age:ocean_proximityNEAR BAY   0.057055 .  
## housing_median_age:ocean_proximityNEAR OCEAN 0.186640    
## total_rooms:total_bedrooms                   0.346837    
## total_rooms:population                       0.002021 ** 
## total_rooms:households                       0.003786 ** 
## total_rooms:median_income                    1.43e-10 ***
## total_rooms:ocean_proximityINLAND            0.001701 ** 
## total_rooms:ocean_proximityNEAR BAY          0.302125    
## total_rooms:ocean_proximityNEAR OCEAN        0.283041    
## total_bedrooms:population                    0.429060    
## total_bedrooms:households                    3.37e-08 ***
## total_bedrooms:median_income                 0.160663    
## total_bedrooms:ocean_proximityINLAND         0.858007    
## total_bedrooms:ocean_proximityNEAR BAY       0.388527    
## total_bedrooms:ocean_proximityNEAR OCEAN     0.138166    
## population:households                        0.009380 ** 
## population:median_income                      < 2e-16 ***
## population:ocean_proximityINLAND             1.50e-08 ***
## population:ocean_proximityNEAR BAY           0.946491    
## population:ocean_proximityNEAR OCEAN         0.000501 ***
## households:median_income                     5.46e-09 ***
## households:ocean_proximityINLAND             0.863274    
## households:ocean_proximityNEAR BAY           0.135309    
## households:ocean_proximityNEAR OCEAN         0.002148 ** 
## median_income:ocean_proximityINLAND          3.06e-10 ***
## median_income:ocean_proximityNEAR BAY        0.497886    
## median_income:ocean_proximityNEAR OCEAN      0.023662 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 62990 on 14381 degrees of freedom
## Multiple R-squared:  0.7013, Adjusted R-squared:    0.7 
## F-statistic: 535.9 on 63 and 14381 DF,  p-value: < 2.2e-16
AIC(modelP1)
## [1] 360316.5
summary(model2)$adj.r.squared
## [1] 0.644722
AIC(model2)
## [1] 362704.9

Interpretation: It can be noticed that there is a negative coefficient for “house_median_age”. This implies that older a house is, lesser is its value because of the overall condition of the house.

Another thing that appears counter-intuitive is that there is a negative coefficient for “median_income”. But this covariate depends on the number of earning members in the household. Also, the “median_house_value” also depends on the number of members in the household. So here number of members in the household is acting as a confounder.

For polynominal regression of degree 2, we can see that the adjusted R-Squared value is higher than linear regression and the AIC value is lower than linear regression which is required.

Hence we will be using Polynominal model with degree 2 as our final model.

7. Model Validation

Calculating In-Sample RMSE

predin <- predict(modelP1, housing_training_data)
RMSE(housing_training_data$median_house_value,predin)
## [1] 62854.41

Calculating Out of-Sample RMSE

predout <- predict(modelP1, housing_test_data)
RMSE(housing_test_data$median_house_value,predout)
## [1] 64996.75

Using LOOCV cross validation to find over fitting

set.seed(123)
# Define training control
train.control <- trainControl(method = "LOOCV")
# Train the model
modelLOOCV <- train(median_house_value ~(.)^2, data =housing_test_data, method = "lm",trControl = train.control)
print(modelLOOCV)
## Linear Regression 
## 
## 6190 samples
##    9 predictor
## 
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation 
## Summary of sample sizes: 6189, 6189, 6189, 6189, 6189, 6189, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   64153.39  0.6956538  44919.35
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
summary(modelLOOCV)
## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -499638  -36109   -8399   25805  479474 
## 
## Coefficients: (9 not defined because of singularities)
##                                                  Estimate Std. Error
## (Intercept)                                    -7.520e+06  1.686e+06
## longitude                                      -5.923e+04  1.478e+04
## latitude                                        2.436e+05  5.118e+04
## housing_median_age                             -4.320e+04  1.317e+04
## total_rooms                                     1.144e+03  3.132e+02
## total_bedrooms                                 -6.222e+03  1.748e+03
## population                                      1.199e+03  3.217e+02
## households                                     -1.789e+03  1.622e+03
## median_income                                  -9.241e+05  1.239e+05
## ocean_proximityINLAND                           1.652e+06  4.456e+05
## ocean_proximityISLAND                                  NA         NA
## `ocean_proximityNEAR BAY`                      -2.305e+07  2.343e+06
## `ocean_proximityNEAR OCEAN`                     1.851e+04  5.568e+05
## `longitude:latitude`                            1.903e+03  3.956e+02
## `longitude:housing_median_age`                 -5.827e+02  1.519e+02
## `longitude:total_rooms`                         1.416e+01  3.687e+00
## `longitude:total_bedrooms`                     -7.858e+01  2.065e+01
## `longitude:population`                          1.602e+01  3.673e+00
## `longitude:households`                         -2.362e+01  1.805e+01
## `longitude:median_income`                      -1.168e+04  1.471e+03
## `longitude:ocean_proximityINLAND`               2.227e+04  5.275e+03
## `longitude:ocean_proximityISLAND`                      NA         NA
## `longitude:ocean_proximityNEAR BAY`            -2.277e+05  2.071e+04
## `longitude:ocean_proximityNEAR OCEAN`           2.721e+03  6.630e+03
## `latitude:housing_median_age`                  -7.738e+02  1.491e+02
## `latitude:total_rooms`                          1.537e+01  3.764e+00
## `latitude:total_bedrooms`                      -8.485e+01  2.175e+01
## `latitude:population`                           1.895e+01  3.621e+00
## `latitude:households`                          -3.273e+01  1.684e+01
## `latitude:median_income`                       -1.268e+04  1.536e+03
## `latitude:ocean_proximityINLAND`                2.526e+04  5.468e+03
## `latitude:ocean_proximityISLAND`                       NA         NA
## `latitude:ocean_proximityNEAR BAY`             -1.268e+05  1.515e+04
## `latitude:ocean_proximityNEAR OCEAN`            8.444e+03  6.920e+03
## `housing_median_age:total_rooms`               -5.670e-01  1.377e-01
## `housing_median_age:total_bedrooms`             1.385e-01  1.280e+00
## `housing_median_age:population`                -1.399e+00  1.793e-01
## `housing_median_age:households`                 8.280e+00  1.477e+00
## `housing_median_age:median_income`              2.197e+02  4.551e+01
## `housing_median_age:ocean_proximityINLAND`      6.357e+02  2.655e+02
## `housing_median_age:ocean_proximityISLAND`             NA         NA
## `housing_median_age:ocean_proximityNEAR BAY`    2.506e+02  2.847e+02
## `housing_median_age:ocean_proximityNEAR OCEAN`  1.567e+02  2.406e+02
## `total_rooms:total_bedrooms`                    2.394e-03  2.660e-03
## `total_rooms:population`                       -8.522e-03  1.431e-03
## `total_rooms:households`                        1.922e-02  4.390e-03
## `total_rooms:median_income`                     1.629e+00  5.601e-01
## `total_rooms:ocean_proximityINLAND`            -2.365e+01  5.768e+00
## `total_rooms:ocean_proximityISLAND`                    NA         NA
## `total_rooms:ocean_proximityNEAR BAY`           2.008e+01  6.849e+00
## `total_rooms:ocean_proximityNEAR OCEAN`         3.141e+00  4.863e+00
## `total_bedrooms:population`                     1.919e-02  1.091e-02
## `total_bedrooms:households`                    -1.423e-01  1.857e-02
## `total_bedrooms:median_income`                  7.756e-01  7.188e+00
## `total_bedrooms:ocean_proximityINLAND`          8.825e+01  4.406e+01
## `total_bedrooms:ocean_proximityISLAND`                 NA         NA
## `total_bedrooms:ocean_proximityNEAR BAY`       -1.531e+01  6.961e+01
## `total_bedrooms:ocean_proximityNEAR OCEAN`      1.050e+02  5.033e+01
## `population:households`                         3.904e-02  9.556e-03
## `population:median_income`                      1.235e+00  1.200e+00
## `population:ocean_proximityINLAND`              2.954e+01  7.038e+00
## `population:ocean_proximityISLAND`                     NA         NA
## `population:ocean_proximityNEAR BAY`            3.704e+00  9.676e+00
## `population:ocean_proximityNEAR OCEAN`          1.497e+01  6.033e+00
## `households:median_income`                      3.101e+00  8.435e+00
## `households:ocean_proximityINLAND`             -4.857e+01  4.482e+01
## `households:ocean_proximityISLAND`                     NA         NA
## `households:ocean_proximityNEAR BAY`           -1.100e+02  8.366e+01
## `households:ocean_proximityNEAR OCEAN`         -1.632e+02  5.616e+01
## `median_income:ocean_proximityINLAND`           1.382e+04  2.410e+03
## `median_income:ocean_proximityISLAND`                  NA         NA
## `median_income:ocean_proximityNEAR BAY`         2.404e+03  2.238e+03
## `median_income:ocean_proximityNEAR OCEAN`       1.182e+03  1.787e+03
##                                                t value Pr(>|t|)    
## (Intercept)                                     -4.460 8.34e-06 ***
## longitude                                       -4.008 6.20e-05 ***
## latitude                                         4.761 1.97e-06 ***
## housing_median_age                              -3.279 0.001047 ** 
## total_rooms                                      3.653 0.000261 ***
## total_bedrooms                                  -3.559 0.000375 ***
## population                                       3.728 0.000195 ***
## households                                      -1.103 0.270286    
## median_income                                   -7.462 9.72e-14 ***
## ocean_proximityINLAND                            3.707 0.000212 ***
## ocean_proximityISLAND                               NA       NA    
## `ocean_proximityNEAR BAY`                       -9.836  < 2e-16 ***
## `ocean_proximityNEAR OCEAN`                      0.033 0.973479    
## `longitude:latitude`                             4.810 1.54e-06 ***
## `longitude:housing_median_age`                  -3.837 0.000126 ***
## `longitude:total_rooms`                          3.840 0.000124 ***
## `longitude:total_bedrooms`                      -3.805 0.000143 ***
## `longitude:population`                           4.360 1.32e-05 ***
## `longitude:households`                          -1.308 0.190837    
## `longitude:median_income`                       -7.944 2.31e-15 ***
## `longitude:ocean_proximityINLAND`                4.222 2.45e-05 ***
## `longitude:ocean_proximityISLAND`                   NA       NA    
## `longitude:ocean_proximityNEAR BAY`            -10.991  < 2e-16 ***
## `longitude:ocean_proximityNEAR OCEAN`            0.410 0.681487    
## `latitude:housing_median_age`                   -5.191 2.16e-07 ***
## `latitude:total_rooms`                           4.084 4.49e-05 ***
## `latitude:total_bedrooms`                       -3.901 9.70e-05 ***
## `latitude:population`                            5.233 1.72e-07 ***
## `latitude:households`                           -1.943 0.052040 .  
## `latitude:median_income`                        -8.256  < 2e-16 ***
## `latitude:ocean_proximityINLAND`                 4.620 3.92e-06 ***
## `latitude:ocean_proximityISLAND`                    NA       NA    
## `latitude:ocean_proximityNEAR BAY`              -8.366  < 2e-16 ***
## `latitude:ocean_proximityNEAR OCEAN`             1.220 0.222435    
## `housing_median_age:total_rooms`                -4.118 3.87e-05 ***
## `housing_median_age:total_bedrooms`              0.108 0.913806    
## `housing_median_age:population`                 -7.805 6.94e-15 ***
## `housing_median_age:households`                  5.605 2.17e-08 ***
## `housing_median_age:median_income`               4.828 1.41e-06 ***
## `housing_median_age:ocean_proximityINLAND`       2.395 0.016666 *  
## `housing_median_age:ocean_proximityISLAND`          NA       NA    
## `housing_median_age:ocean_proximityNEAR BAY`     0.880 0.378786    
## `housing_median_age:ocean_proximityNEAR OCEAN`   0.652 0.514712    
## `total_rooms:total_bedrooms`                     0.900 0.368303    
## `total_rooms:population`                        -5.955 2.74e-09 ***
## `total_rooms:households`                         4.378 1.22e-05 ***
## `total_rooms:median_income`                      2.908 0.003648 ** 
## `total_rooms:ocean_proximityINLAND`             -4.100 4.19e-05 ***
## `total_rooms:ocean_proximityISLAND`                 NA       NA    
## `total_rooms:ocean_proximityNEAR BAY`            2.931 0.003391 ** 
## `total_rooms:ocean_proximityNEAR OCEAN`          0.646 0.518307    
## `total_bedrooms:population`                      1.759 0.078692 .  
## `total_bedrooms:households`                     -7.665 2.07e-14 ***
## `total_bedrooms:median_income`                   0.108 0.914074    
## `total_bedrooms:ocean_proximityINLAND`           2.003 0.045240 *  
## `total_bedrooms:ocean_proximityISLAND`              NA       NA    
## `total_bedrooms:ocean_proximityNEAR BAY`        -0.220 0.825971    
## `total_bedrooms:ocean_proximityNEAR OCEAN`       2.087 0.036899 *  
## `population:households`                          4.085 4.46e-05 ***
## `population:median_income`                       1.029 0.303488    
## `population:ocean_proximityINLAND`               4.198 2.73e-05 ***
## `population:ocean_proximityISLAND`                  NA       NA    
## `population:ocean_proximityNEAR BAY`             0.383 0.701880    
## `population:ocean_proximityNEAR OCEAN`           2.481 0.013138 *  
## `households:median_income`                       0.368 0.713200    
## `households:ocean_proximityINLAND`              -1.084 0.278615    
## `households:ocean_proximityISLAND`                  NA       NA    
## `households:ocean_proximityNEAR BAY`            -1.315 0.188576    
## `households:ocean_proximityNEAR OCEAN`          -2.906 0.003679 ** 
## `median_income:ocean_proximityINLAND`            5.733 1.03e-08 ***
## `median_income:ocean_proximityISLAND`               NA       NA    
## `median_income:ocean_proximityNEAR BAY`          1.074 0.282735    
## `median_income:ocean_proximityNEAR OCEAN`        0.661 0.508371    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 62520 on 6126 degrees of freedom
## Multiple R-squared:  0.7136, Adjusted R-squared:  0.7107 
## F-statistic: 242.3 on 63 and 6126 DF,  p-value: < 2.2e-16

As can be seen in the model, the RSME obtained through LOOCV and through test dataset are close. Thus our model is not overfitting. When the model is evaluated on test data, it had an average error of 64153.39. This means that, on average, our model’s predicted housing price will be ± 64153.39 in comparison to the actual price.

8. Prediction Insights

test_data <- housing_test_data$median_house_value
plot(predout - test_data, housing_test_data$ocean_proximity,
     xlab = "Predicted Price Over / Under Actual Price",
     ylab = "",
     main = "Predicted Price Over / Under Actual Price vs Ocean Proximity",
     col = "dodgerblue", yaxt = "none")
axis(2, at = 1:5, labels = levels(housing_test_data$ocean_proximity), las = 1)

As it can be seen in the plot, the difference between predicted and actual values is very close to zero. A trend that can be observed is that our model tends to underestimate the price than overestimating it. The average error is 64996.75, which can be confirmed from this model. However, when ocean_proximity is INLAND, there are some extreme outliers. Our model has underestimated the value of one house approximately 100,000.

test_data=housing_test_data$median_house_value
plot_map = ggplot(housing_test_data,
                  aes(x = longitude, y = latitude,
                      color = predout - test_data,
                      hma = housing_median_age, tr = total_rooms, tb = total_bedrooms,
                      hh = households, mi = median_income)) +
              geom_point(aes(size = abs(predout - test_data)), alpha = 0.4) +
              xlab("Longitude") +
              ylab("Latitude") +
              ggtitle("Predicted Price Over / Under Actual Price") +
              theme(plot.title = element_text(hjust = 0.5)) +
              scale_color_distiller(palette = "Paired") +
              labs(color = "Predicted Price Over / Under (in $USD)",
                   size = "Magnitude of Price Difference")
plot_map

Based on longitude and latitude values in the data,the plot shows us the areas where the prices have been underestimated in some regions denoted by yellow and overestimated in some regions denoted by blue.These values help us to understand the outliers in the data.However,in most of the areas of California,the median prices are accurately estimated which are denoted by green area.