knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(message = FALSE)
knitr::opts_chunk$set(warning = FALSE)
library(dplyr)
library(ggplot2)
library(tidyverse)
library(reshape2)
library(ggpubr)
library(ggfortify)
library(psych)

require(scales)

Introduction to Analysis

Can education and wages be used as predictors for home values?
Now that Cleaning & Transformation and Visualization are complete, in this third notebook we’ll complete a regression analysis to see how well home values can be predicted at the county level using education and wages.

About the Data

## Data for this project is 2745 cases and 16 features
head(master,1)
##   state                       county dropout hs_diploma some_college
## 1    AK Fairbanks North Star Borough   3,381     13,785       24,520
##   four_year_degree dropout_percent hs_diploma_percent some_college_percent
## 1           20,152             5.5               22.3                 39.7
##   four_year_degree_percent region region_num resident_count avg_annual_pay
## 1                     32.6   West          4           2275       53107.01
##       metro avg_home_value
## 1 Fairbanks       244178.1
describe(master)
##                          vars    n      mean        sd    median   trimmed
## state*                      1 2745     25.80     14.12     24.00     25.86
## county*                     2 2745    820.89    457.23    823.00    819.14
## dropout*                    3 2745   1159.42    688.04   1159.00   1156.84
## hs_diploma*                 4 2745   1290.96    741.38   1294.00   1291.38
## some_college*               5 2745   1261.15    733.32   1269.00   1260.75
## four_year_degree*           6 2745   1210.68    721.27   1206.00   1209.53
## dropout_percent             7 2745     13.38      6.04     12.20     12.86
## hs_diploma_percent          8 2745     34.36      7.29     34.60     34.62
## some_college_percent        9 2745     30.56      5.01     30.50     30.54
## four_year_degree_percent   10 2745     21.70      9.54     19.20     20.40
## region*                    11 2745      2.43      1.08      3.00      2.41
## region_num                 12 2745      2.52      0.83      2.00      2.50
## resident_count             13 2745   3262.90  13086.08    728.00   1200.12
## avg_annual_pay             14 2745  41527.11   9278.22  39763.01  40322.16
## metro*                     15 2745    257.38    276.68    152.00    225.20
## avg_home_value             16 2745 160108.92 111972.30 130851.33 141978.32
##                               mad      min       max     range  skew kurtosis
## state*                      17.79     1.00      50.0      49.0  0.04    -1.27
## county*                    566.35     1.00    1634.0    1633.0  0.04    -1.11
## dropout*                   882.15     1.00    2356.0    2355.0  0.02    -1.21
## hs_diploma*                938.49     1.00    2583.0    2582.0 -0.01    -1.18
## some_college*              932.56     1.00    2538.0    2537.0  0.00    -1.19
## four_year_degree*          941.45     1.00    2438.0    2437.0  0.01    -1.24
## dropout_percent              5.78     1.40      48.5      47.1  0.90     1.05
## hs_diploma_percent           7.12     8.10      55.6      47.5 -0.32    -0.03
## some_college_percent         5.04    11.40      48.0      36.6  0.02    -0.03
## four_year_degree_percent     7.41     6.90      74.6      67.7  1.39     2.22
## region*                      1.48     1.00       4.0       3.0 -0.24    -1.34
## region_num                   1.48     1.00       4.0       3.0  0.27    -0.58
## resident_count             708.68    25.00  495918.0  495893.0 22.01   747.80
## avg_annual_pay            6241.75     0.01  134664.0  134664.0  2.61    14.38
## metro*                     223.87     1.00     844.0     843.0  0.63    -1.05
## avg_home_value           62020.49 29574.67 1527482.2 1497907.5  4.18    31.44
##                               se
## state*                      0.27
## county*                     8.73
## dropout*                   13.13
## hs_diploma*                14.15
## some_college*              14.00
## four_year_degree*          13.77
## dropout_percent             0.12
## hs_diploma_percent          0.14
## some_college_percent        0.10
## four_year_degree_percent    0.18
## region*                     0.02
## region_num                  0.02
## resident_count            249.77
## avg_annual_pay            177.09
## metro*                      5.28
## avg_home_value           2137.17

Regression Analysis

The correlation heat maps and scatter plots in the previous Visualization notebook make it appear likely that all the features below will contribute to the regression model’s accuracy. Based on this we’ll start with a full regression model for home values that uses all these features…

region = census region (Northeast, South, Midwest, West)
hs = high school diploma
college = four year degree
wages = average annual pay
rc = resident count
hv = home values

Create the Regression Model

rmdata <- master %>%
  select(region_num, hs = hs_diploma_percent, college = four_year_degree_percent, wages = avg_annual_pay, rc = resident_count, hv = avg_home_value)

rmall <- lm(hv ~ region_num + hs + college + wages + rc, data = rmdata)
summary(rmall)
## 
## Call:
## lm(formula = hv ~ region_num + hs + college + wages + rc, data = rmdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -291810  -37903   -3312   28484 1177740 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.793e+04  1.906e+04  -3.039  0.00239 ** 
## region_num   2.037e+04  1.796e+03  11.344  < 2e-16 ***
## hs          -1.484e+03  3.338e+02  -4.446 9.08e-06 ***
## college      5.391e+03  2.610e+02  20.658  < 2e-16 ***
## wages        2.339e+00  1.885e-01  12.410  < 2e-16 ***
## rc           1.113e+00  1.199e-01   9.284  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 74780 on 2739 degrees of freedom
## Multiple R-squared:  0.5548, Adjusted R-squared:  0.554 
## F-statistic: 682.8 on 5 and 2739 DF,  p-value: < 2.2e-16

Chart the regression model
We used five features in the regression model, but graphing all five would make for poor readability and isn’t very practical. The chart below uses the features with the highest coefficients (wages and college) to visualize the regression line.

ggplot(rmdata,aes(y=hv,x=wages,color=college)) + 
  geom_point() + 
  scale_x_continuous(labels = comma) +
  stat_smooth(method="lm",se=FALSE)


One concern this chart raises is the nonlinear nature of the data as average annual wages start to exceed $60,000. The data starts takes on a non-linear trend after this point and the overall distribution of the data is heteroschedastic, rather than the preferred homoschedastic. We’ll look into this further when we evaluate the regression for reasonableness below.

Check for Reasonable Conditions
This model looks pretty good as all the p-values are a lot less than 0.05 and the adjusted R-squared is at 55.4%. Prior to concluding that this is the final model, we’ll check if the conditions to use Regression are reasonable. To do this we will use the ggfortify library’s autoplot function to create diagnostic plots.

autoplot(rmall)

Analyzing the regression relationships…
Residuals vs Fitted checks the linear relationship assumptions. Our results show that up to values of $500K the relationship is a horizontal line; above $500K the outliers take the data in an upward curve.

Normal Q-Q is used to examine whether the residuals are normally distributed. Similar to the “Residuals vs Fitted” the data skews right among outliers of high home values.

Scale-Location (or Spread-Location) is used to check the homogeneity of variance of the residuals (homoscedasticity). Our “$500K” pattern continues as there is good homoscedasticity up to this amount.

Residuals vs Leverage is used to identify influential cases, that is extreme values that might influence the regression results when included or excluded from the analysis. The dashed line in the plot is called “Cook’s line” and values above this line should be considered for removal from the model. The above result aligns with the prior notes that values above $500K appear to be impacting the model’s ability to forecast with regression.

Analysis Conclusions

Conclusions from this Analysis are…
1. at a county level across the United States the explanatory variables of high school diploma, four year degree, wages, resident count, and region produce a useable regression model to forecast home values.

2. the plot of the regression model showed a heteroschedastic trend to the data for wages above $75,000. In a similar manner the residuals plot for home values is homoschedastic up to around $500,000, then tends to be non-linear.

3. to address the non-linear and heteroschedastic nature of this data for counties with higher average annual wages (above $75K) and/or home values above $500K a next step for this project could be to run this notebook excluding counties that meet these conditions. It’s reasonable to assume that the model’s forecasting ability would improve.

References

This article from Keong-Woong Moon was helpful with graphing the linear regression line.