(1) Overview

I obtained the following housing data from Kaggle via the following link: https://www.kaggle.com/camnugent/california-housing-prices. The data contains information from the 1990 California census and pertains to the houses found in a given California district (based on lat/long coordinates) and key summary stats about them based on the census. Using the data we explore the following questions:

1. How were housing prices distributed in California during the 1990s?
2. Which variables appear to affect the median house value of a given district?
3. Can we create a model to predict prices of a given district?

After engineering the data to rid the dataset of NA values, the data is ready to be explored. In order to create and test a predictive model, I extracted a sample of 500 districts to use as our test object. The remaining information is used to create the model to predict the median house value in a given district. See the summary of the data below.

##    longitude         latitude     housing_median_age  total_rooms   
##  Min.   :-124.3   Min.   :32.54   Min.   : 1.00      Min.   :    2  
##  1st Qu.:-121.8   1st Qu.:33.93   1st Qu.:18.00      1st Qu.: 1448  
##  Median :-118.5   Median :34.26   Median :29.00      Median : 2127  
##  Mean   :-119.6   Mean   :35.63   Mean   :28.64      Mean   : 2636  
##  3rd Qu.:-118.0   3rd Qu.:37.71   3rd Qu.:37.00      3rd Qu.: 3148  
##  Max.   :-114.3   Max.   :41.95   Max.   :52.00      Max.   :39320  
##                                                                     
##  total_bedrooms     population      households     median_income    
##  Min.   :   1.0   Min.   :    3   Min.   :   1.0   Min.   : 0.4999  
##  1st Qu.: 296.0   1st Qu.:  787   1st Qu.: 280.0   1st Qu.: 2.5634  
##  Median : 435.0   Median : 1166   Median : 409.0   Median : 3.5348  
##  Mean   : 537.9   Mean   : 1425   Mean   : 499.5   Mean   : 3.8707  
##  3rd Qu.: 647.0   3rd Qu.: 1725   3rd Qu.: 605.0   3rd Qu.: 4.7432  
##  Max.   :6445.0   Max.   :35682   Max.   :6082.0   Max.   :15.0001  
##  NA's   :207                                                        
##  median_house_value   ocean_proximity
##  Min.   : 14999     <1H OCEAN :9136  
##  1st Qu.:119600     INLAND    :6551  
##  Median :179700     ISLAND    :   5  
##  Mean   :206856     NEAR BAY  :2290  
##  3rd Qu.:264725     NEAR OCEAN:2658  
##  Max.   :500001                      
## 

(2) Exploring the Housing Data

## Measure of skewness: 0.9783604
## The average house value: 206745.9

Figure 1: Housing prices are rightly skewed. The average house price is approximately $200k. It is interesting to see that a good number of districts with median values that are over $500k.

##   Age Bracket Average House Value
## 1         <10            200164.3
## 2       11-20            191108.2
## 3       21-30            206786.5
## 4       31-40            206436.6
## 5       41-50            205363.7
## 6     Over 50            274145.7

Figure 2: A large number of houses appear to be 11- 40 years old but most houses are between 31-40 years old. Districts with an median age over 50 years old are on average worth more than districts with newer homes. As the bucket moves down to 50 years old or less, we do not see a large fluctuation in the average price of homes in these districts. This is an interesting finding considering the most expensive houses are also the oldest.


Figure 3 It’s expected a district’s location and its distance from the beach to be a huge factor in determining its price since these homes are in California. As summarized above most districts contain homes that are inland and not close to the beach (less than hour away).

Figure 4 The most expensive houses on average are located on islands. The least expensive home are indeed inland. What’s surprising is that houses near the beach or bay are not significantly more expensive than those that are less than hour away.

Other important key factors used to justify the price of house involve the city’s population, average income for homeowners surrounding the area, etc. Using loops, these factors were explored by sorting by the median price.

##     0%    25%    50%    75%   100% 
##  14999 119400 179500 264750 500001
##   Quantile Avg. Income Avg. Population Avg. Households
## 1      25%    2.444563        1275.673        425.7174
## 2      50%    3.312460        1575.136        508.3949
## 3      75%    4.050090        1502.953        534.0869
## 4  Top 25%    5.682831        1342.303        528.2036

The difference across each quantile in terms of the average income, is about 1 ($10,000). I also noted that the top and low 25% of priced districts have significantly lower average populations compared to the other quantiles.

(3) Creating the Model

I included all of the variables in the linear regression model because all variables are statistically significant at the .05 level. Using the model, I predicted the prices of the households of the 500 test subjects mentioned earlier and plotted the differences below.

## 
## Call:
## lm(formula = median_house_value ~ longitude + latitude + housing_median_age + 
##     total_rooms + total_bedrooms + population + households + 
##     median_income + ocean_proximity, data = housing)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -545843  -42944  -10628   28986  797843 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               -2.216e+06  8.848e+04 -25.041  < 2e-16 ***
## longitude                 -2.626e+04  1.026e+03 -25.599  < 2e-16 ***
## latitude                  -2.504e+04  1.012e+03 -24.738  < 2e-16 ***
## housing_median_age         1.052e+03  4.424e+01  23.778  < 2e-16 ***
## total_rooms               -3.266e+00  7.617e-01  -4.288 1.81e-05 ***
## total_bedrooms             5.036e+01  5.044e+00   9.983  < 2e-16 ***
## population                -3.972e+01  1.066e+00 -37.253  < 2e-16 ***
## households                 9.371e+01  6.130e+00  15.288  < 2e-16 ***
## median_income              3.833e+04  3.317e+02 115.528  < 2e-16 ***
## ocean_proximityINLAND     -4.032e+04  1.756e+03 -22.955  < 2e-16 ***
## ocean_proximityISLAND      1.579e+05  3.076e+04   5.131 2.90e-07 ***
## ocean_proximityNEAR BAY   -3.903e+03  1.928e+03  -2.024 0.042958 *  
## ocean_proximityNEAR OCEAN  5.311e+03  1.582e+03   3.358 0.000788 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 68710 on 20127 degrees of freedom
## Multiple R-squared:  0.6457, Adjusted R-squared:  0.6455 
## F-statistic:  3057 on 12 and 20127 DF,  p-value: < 2.2e-16

Figure 5 On average my predictions were off by $50k. My model’s R-squared is in the .60s which indicates that the model explains most of the variations but is not the most accurate. I was expecting to create model with a R-squared figure in the .70s but after reviewing the histogram above, there are notable differences well into the $250k levels. I ran a correlation between the actual and predicted values, calculating a correlation close to 80%. This implies that our actual and predicted values move in a similar directional movement.

##              actual predicted
## actual    1.0000000 0.7788374
## predicted 0.7788374 1.0000000