knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(message = FALSE)
knitr::opts_chunk$set(warning = FALSE)
library(dplyr)
library(ggplot2)
library(tidyverse)
library(reshape2)
library(ggpubr)
library(ggfortify)
library(psych)
require(scales)
Can education and wages be used as predictors for home values?
Now that Cleaning & Transformation and Visualization are complete, in this third notebook we’ll complete a regression analysis to see how well home values can be predicted at the county level using education and wages.
## Data for this project is 2745 cases and 16 features
head(master,1)
## state county dropout hs_diploma some_college
## 1 AK Fairbanks North Star Borough 3,381 13,785 24,520
## four_year_degree dropout_percent hs_diploma_percent some_college_percent
## 1 20,152 5.5 22.3 39.7
## four_year_degree_percent region region_num resident_count avg_annual_pay
## 1 32.6 West 4 2275 53107.01
## metro avg_home_value
## 1 Fairbanks 244178.1
describe(master)
## vars n mean sd median trimmed
## state* 1 2745 25.80 14.12 24.00 25.86
## county* 2 2745 820.89 457.23 823.00 819.14
## dropout* 3 2745 1159.42 688.04 1159.00 1156.84
## hs_diploma* 4 2745 1290.96 741.38 1294.00 1291.38
## some_college* 5 2745 1261.15 733.32 1269.00 1260.75
## four_year_degree* 6 2745 1210.68 721.27 1206.00 1209.53
## dropout_percent 7 2745 13.38 6.04 12.20 12.86
## hs_diploma_percent 8 2745 34.36 7.29 34.60 34.62
## some_college_percent 9 2745 30.56 5.01 30.50 30.54
## four_year_degree_percent 10 2745 21.70 9.54 19.20 20.40
## region* 11 2745 2.43 1.08 3.00 2.41
## region_num 12 2745 2.52 0.83 2.00 2.50
## resident_count 13 2745 3262.90 13086.08 728.00 1200.12
## avg_annual_pay 14 2745 41527.11 9278.22 39763.01 40322.16
## metro* 15 2745 257.38 276.68 152.00 225.20
## avg_home_value 16 2745 160108.92 111972.30 130851.33 141978.32
## mad min max range skew kurtosis
## state* 17.79 1.00 50.0 49.0 0.04 -1.27
## county* 566.35 1.00 1634.0 1633.0 0.04 -1.11
## dropout* 882.15 1.00 2356.0 2355.0 0.02 -1.21
## hs_diploma* 938.49 1.00 2583.0 2582.0 -0.01 -1.18
## some_college* 932.56 1.00 2538.0 2537.0 0.00 -1.19
## four_year_degree* 941.45 1.00 2438.0 2437.0 0.01 -1.24
## dropout_percent 5.78 1.40 48.5 47.1 0.90 1.05
## hs_diploma_percent 7.12 8.10 55.6 47.5 -0.32 -0.03
## some_college_percent 5.04 11.40 48.0 36.6 0.02 -0.03
## four_year_degree_percent 7.41 6.90 74.6 67.7 1.39 2.22
## region* 1.48 1.00 4.0 3.0 -0.24 -1.34
## region_num 1.48 1.00 4.0 3.0 0.27 -0.58
## resident_count 708.68 25.00 495918.0 495893.0 22.01 747.80
## avg_annual_pay 6241.75 0.01 134664.0 134664.0 2.61 14.38
## metro* 223.87 1.00 844.0 843.0 0.63 -1.05
## avg_home_value 62020.49 29574.67 1527482.2 1497907.5 4.18 31.44
## se
## state* 0.27
## county* 8.73
## dropout* 13.13
## hs_diploma* 14.15
## some_college* 14.00
## four_year_degree* 13.77
## dropout_percent 0.12
## hs_diploma_percent 0.14
## some_college_percent 0.10
## four_year_degree_percent 0.18
## region* 0.02
## region_num 0.02
## resident_count 249.77
## avg_annual_pay 177.09
## metro* 5.28
## avg_home_value 2137.17
The correlation heat maps and scatter plots in the previous Visualization notebook make it appear likely that all the features below will contribute to the regression model’s accuracy. Based on this we’ll start with a full regression model for home values that uses all these features…
region = census region (Northeast, South, Midwest, West)
hs = high school diploma
college = four year degree
wages = average annual pay
rc = resident count
hv = home values
Create the Regression Model
rmdata <- master %>%
select(region_num, hs = hs_diploma_percent, college = four_year_degree_percent, wages = avg_annual_pay, rc = resident_count, hv = avg_home_value)
rmall <- lm(hv ~ region_num + hs + college + wages + rc, data = rmdata)
summary(rmall)
##
## Call:
## lm(formula = hv ~ region_num + hs + college + wages + rc, data = rmdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -291810 -37903 -3312 28484 1177740
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.793e+04 1.906e+04 -3.039 0.00239 **
## region_num 2.037e+04 1.796e+03 11.344 < 2e-16 ***
## hs -1.484e+03 3.338e+02 -4.446 9.08e-06 ***
## college 5.391e+03 2.610e+02 20.658 < 2e-16 ***
## wages 2.339e+00 1.885e-01 12.410 < 2e-16 ***
## rc 1.113e+00 1.199e-01 9.284 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 74780 on 2739 degrees of freedom
## Multiple R-squared: 0.5548, Adjusted R-squared: 0.554
## F-statistic: 682.8 on 5 and 2739 DF, p-value: < 2.2e-16
Chart the regression model
We used five features in the regression model, but graphing all five would make for poor readability and isn’t very practical. The chart below uses the features with the highest coefficients (wages and college) to visualize the regression line.
ggplot(rmdata,aes(y=hv,x=wages,color=college)) +
geom_point() +
scale_x_continuous(labels = comma) +
stat_smooth(method="lm",se=FALSE)
One concern this chart raises is the nonlinear nature of the data as average annual wages start to exceed $60,000. The data starts takes on a non-linear trend after this point and the overall distribution of the data is heteroschedastic, rather than the preferred homoschedastic. We’ll look into this further when we evaluate the regression for reasonableness below.
Check for Reasonable Conditions
This model looks pretty good as all the p-values are a lot less than 0.05 and the adjusted R-squared is at 55.4%. Prior to concluding that this is the final model, we’ll check if the conditions to use Regression are reasonable. To do this we will use the ggfortify library’s autoplot function to create diagnostic plots.
autoplot(rmall)
Analyzing the regression relationships…
Residuals vs Fitted checks the linear relationship assumptions. Our results show that up to values of $500K the relationship is a horizontal line; above $500K the outliers take the data in an upward curve.
Normal Q-Q is used to examine whether the residuals are normally distributed. Similar to the “Residuals vs Fitted” the data skews right among outliers of high home values.
Scale-Location (or Spread-Location) is used to check the homogeneity of variance of the residuals (homoscedasticity). Our “$500K” pattern continues as there is good homoscedasticity up to this amount.
Residuals vs Leverage is used to identify influential cases, that is extreme values that might influence the regression results when included or excluded from the analysis. The dashed line in the plot is called “Cook’s line” and values above this line should be considered for removal from the model. The above result aligns with the prior notes that values above $500K appear to be impacting the model’s ability to forecast with regression.
Conclusions from this Analysis are…
1. at a county level across the United States the explanatory variables of high school diploma, four year degree, wages, resident count, and region produce a useable regression model to forecast home values.
2. the plot of the regression model showed a heteroschedastic trend to the data for wages above $75,000. In a similar manner the residuals plot for home values is homoschedastic up to around $500,000, then tends to be non-linear.
3. to address the non-linear and heteroschedastic nature of this data for counties with higher average annual wages (above $75K) and/or home values above $500K a next step for this project could be to run this notebook excluding counties that meet these conditions. It’s reasonable to assume that the model’s forecasting ability would improve.
This article from Keong-Woong Moon was helpful with graphing the linear regression line.