I obtained the following housing data from Kaggle via the following link: https://www.kaggle.com/camnugent/california-housing-prices. The data contains information from the 1990 California census and pertains to the houses found in a given California district (based on lat/long coordinates) and key summary stats about them based on the census. Using the data we explore the following questions:
1. How were housing prices distributed in California during the 1990s?
2. Which variables appear to affect the median house value of a given district?
3. Can we create a model to predict prices of a given district?
After engineering the data to rid the dataset of NA values, the data is ready to be explored. In order to create and test a predictive model, I extracted a sample of 500 districts to use as our test object. The remaining information is used to create the model to predict the median house value in a given district. See the summary of the data below.
## longitude latitude housing_median_age total_rooms
## Min. :-124.3 Min. :32.54 Min. : 1.00 Min. : 2
## 1st Qu.:-121.8 1st Qu.:33.93 1st Qu.:18.00 1st Qu.: 1448
## Median :-118.5 Median :34.26 Median :29.00 Median : 2127
## Mean :-119.6 Mean :35.63 Mean :28.64 Mean : 2636
## 3rd Qu.:-118.0 3rd Qu.:37.71 3rd Qu.:37.00 3rd Qu.: 3148
## Max. :-114.3 Max. :41.95 Max. :52.00 Max. :39320
##
## total_bedrooms population households median_income
## Min. : 1.0 Min. : 3 Min. : 1.0 Min. : 0.4999
## 1st Qu.: 296.0 1st Qu.: 787 1st Qu.: 280.0 1st Qu.: 2.5634
## Median : 435.0 Median : 1166 Median : 409.0 Median : 3.5348
## Mean : 537.9 Mean : 1425 Mean : 499.5 Mean : 3.8707
## 3rd Qu.: 647.0 3rd Qu.: 1725 3rd Qu.: 605.0 3rd Qu.: 4.7432
## Max. :6445.0 Max. :35682 Max. :6082.0 Max. :15.0001
## NA's :207
## median_house_value ocean_proximity
## Min. : 14999 <1H OCEAN :9136
## 1st Qu.:119600 INLAND :6551
## Median :179700 ISLAND : 5
## Mean :206856 NEAR BAY :2290
## 3rd Qu.:264725 NEAR OCEAN:2658
## Max. :500001
##
## Measure of skewness: 0.9783604
## The average house value: 206745.9
Figure 1: Housing prices are rightly skewed. The average house price is approximately $200k. It is interesting to see that a good number of districts with median values that are over $500k.
## Age Bracket Average House Value
## 1 <10 200164.3
## 2 11-20 191108.2
## 3 21-30 206786.5
## 4 31-40 206436.6
## 5 41-50 205363.7
## 6 Over 50 274145.7
Figure 2: A large number of houses appear to be 11- 40 years old but most houses are between 31-40 years old. Districts with an median age over 50 years old are on average worth more than districts with newer homes. As the bucket moves down to 50 years old or less, we do not see a large fluctuation in the average price of homes in these districts. This is an interesting finding considering the most expensive houses are also the oldest.
Figure 3 It’s expected a district’s location and its distance from the beach to be a huge factor in determining its price since these homes are in California. As summarized above most districts contain homes that are inland and not close to the beach (less than hour away).
Figure 4 The most expensive houses on average are located on islands. The least expensive home are indeed inland. What’s surprising is that houses near the beach or bay are not significantly more expensive than those that are less than hour away.
Other important key factors used to justify the price of house involve the city’s population, average income for homeowners surrounding the area, etc. Using loops, these factors were explored by sorting by the median price.
## 0% 25% 50% 75% 100%
## 14999 119400 179500 264750 500001
## Quantile Avg. Income Avg. Population Avg. Households
## 1 25% 2.444563 1275.673 425.7174
## 2 50% 3.312460 1575.136 508.3949
## 3 75% 4.050090 1502.953 534.0869
## 4 Top 25% 5.682831 1342.303 528.2036
The difference across each quantile in terms of the average income, is about 1 ($10,000). I also noted that the top and low 25% of priced districts have significantly lower average populations compared to the other quantiles.
I included all of the variables in the linear regression model because all variables are statistically significant at the .05 level. Using the model, I predicted the prices of the households of the 500 test subjects mentioned earlier and plotted the differences below.
##
## Call:
## lm(formula = median_house_value ~ longitude + latitude + housing_median_age +
## total_rooms + total_bedrooms + population + households +
## median_income + ocean_proximity, data = housing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -545843 -42944 -10628 28986 797843
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.216e+06 8.848e+04 -25.041 < 2e-16 ***
## longitude -2.626e+04 1.026e+03 -25.599 < 2e-16 ***
## latitude -2.504e+04 1.012e+03 -24.738 < 2e-16 ***
## housing_median_age 1.052e+03 4.424e+01 23.778 < 2e-16 ***
## total_rooms -3.266e+00 7.617e-01 -4.288 1.81e-05 ***
## total_bedrooms 5.036e+01 5.044e+00 9.983 < 2e-16 ***
## population -3.972e+01 1.066e+00 -37.253 < 2e-16 ***
## households 9.371e+01 6.130e+00 15.288 < 2e-16 ***
## median_income 3.833e+04 3.317e+02 115.528 < 2e-16 ***
## ocean_proximityINLAND -4.032e+04 1.756e+03 -22.955 < 2e-16 ***
## ocean_proximityISLAND 1.579e+05 3.076e+04 5.131 2.90e-07 ***
## ocean_proximityNEAR BAY -3.903e+03 1.928e+03 -2.024 0.042958 *
## ocean_proximityNEAR OCEAN 5.311e+03 1.582e+03 3.358 0.000788 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 68710 on 20127 degrees of freedom
## Multiple R-squared: 0.6457, Adjusted R-squared: 0.6455
## F-statistic: 3057 on 12 and 20127 DF, p-value: < 2.2e-16
Figure 5 On average my predictions were off by $50k. My model’s R-squared is in the .60s which indicates that the model explains most of the variations but is not the most accurate. I was expecting to create model with a R-squared figure in the .70s but after reviewing the histogram above, there are notable differences well into the $250k levels. I ran a correlation between the actual and predicted values, calculating a correlation close to 80%. This implies that our actual and predicted values move in a similar directional movement.
## actual predicted
## actual 1.0000000 0.7788374
## predicted 0.7788374 1.0000000