Introduction

My question surrounds life expectancy within the United States, and more specifically what socio-economic factors such as income, education level, community, and employment, and how these factors affect life expectancy of Americans. This question arose from looking through the datasets from Opportunity Insights and the Opportunity Atlas. This question also arises from looking at national life expectancy rates themselves, and questioning why they are what they are. I am very interested in exploring the aspect of commuting zones and their characteristics in life expectancy, through discovering what types of communities and characteristics within those communities create longer life expectancies.

In terms of data, I pulled the majority of my data from Opportunity Insights (OI), and organized it by commuting zones, which are geographic areas that are often used when analyzing geography in terms of the economy and population. I used two different datasets, “National by-year life expectancy estimates for men and women, by income percentile” and “Table 9: Neighborhood Characteristics by Census Tract”. To compile this data into a singular dataset, I used functions within Excel that allowed me to match data factors by commuting zone. By combining these two datasets together, I was able to measure life expectancy data alongside the commuting zone factors I was interested in measuring.

Simple Regressions

Health Factors

R Code for scatterplots relating current smokers and life expectancy within each income quartile
# load libraries and data
library(ggplot2)
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units
library(corrplot)
## corrplot 0.92 loaded
library(ggpubr)
library(leaps)
library(olsrr)
## 
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
## 
##     rivers
red = read.csv("~/desktop/finalfinal2.csv")

# Quartile 1
q1 = ggplot(red, aes(x = cur_smoke_q1, y = le_raceadj_q1_F)) + 
  geom_point() + 
  stat_cor(method = "pearson") +
  stat_smooth(method = 'lm') +
  xlab("Current Smokers in Q1") +
  ylab("Q1 Life Expectancy") + 
  ggtitle("Q1 Life Expectancy and Smoking")
q1
## Warning: Removed 50611 rows containing non-finite values (stat_cor).
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 50611 rows containing non-finite values (stat_smooth).
## Warning: Removed 50611 rows containing missing values (geom_point).

smoke1 = lm(le_raceadj_q1_F ~ cur_smoke_q1, data = red)
summary(smoke1)
## 
## Call:
## lm(formula = le_raceadj_q1_F ~ cur_smoke_q1, data = red)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.8578 -1.0701 -0.1626  1.2956  4.5646 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    84.554      1.195  70.782   <2e-16 ***
## cur_smoke_q1  -11.879      4.381  -2.712   0.0079 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.908 on 98 degrees of freedom
##   (50611 observations deleted due to missingness)
## Multiple R-squared:  0.0698, Adjusted R-squared:  0.06031 
## F-statistic: 7.354 on 1 and 98 DF,  p-value: 0.007904
# Quartile 2
q2 = ggplot(red, aes(x = cur_smoke_q2, y = le_raceadj_q2_M)) + 
  geom_point() + 
  stat_cor(method = "pearson") +
  stat_smooth(method = 'lm') +
  xlab("Current Smokers in Q2") +
  ylab("Q2 Life Expectancy") + 
  ggtitle("Q2 Life Expectancy and Smoking")
q2
## Warning: Removed 50611 rows containing non-finite values (stat_cor).
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 50611 rows containing non-finite values (stat_smooth).

## Warning: Removed 50611 rows containing missing values (geom_point).

smoke2 = lm(le_raceadj_q2_M ~ cur_smoke_q2, data = red)
summary(smoke2)
## 
## Call:
## lm(formula = le_raceadj_q2_M ~ cur_smoke_q2, data = red)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.0808 -1.2399 -0.2606  1.3909  5.8880 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    80.141      1.535  52.200   <2e-16 ***
## cur_smoke_q2   -3.180      6.842  -0.465    0.643    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.18 on 98 degrees of freedom
##   (50611 observations deleted due to missingness)
## Multiple R-squared:  0.002199,   Adjusted R-squared:  -0.007982 
## F-statistic: 0.216 on 1 and 98 DF,  p-value: 0.6431
# Quartile 3
q3 = ggplot(red, aes(x = cur_smoke_q3, y = le_raceadj_q3_M)) + 
  geom_point() + 
  stat_cor(method = "pearson") +
  stat_smooth(method = 'lm') +
  xlab("Current Smokers in Q3") +
  ylab("Q3 Life Expectancy") + 
  ggtitle("Q3 Life Expectancy and Smoking")
q3
## Warning: Removed 50611 rows containing non-finite values (stat_cor).
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 50611 rows containing non-finite values (stat_smooth).

## Warning: Removed 50611 rows containing missing values (geom_point).

smoke3 = lm(le_raceadj_q3_M ~ cur_smoke_q3, data = red)
summary(smoke3)
## 
## Call:
## lm(formula = le_raceadj_q3_M ~ cur_smoke_q3, data = red)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.6731 -1.2965  0.1006  1.5783  5.0540 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    84.148      1.349  62.370   <2e-16 ***
## cur_smoke_q3  -11.073      7.478  -1.481    0.142    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.036 on 98 degrees of freedom
##   (50611 observations deleted due to missingness)
## Multiple R-squared:  0.02189,    Adjusted R-squared:  0.01191 
## F-statistic: 2.193 on 1 and 98 DF,  p-value: 0.1419
# Quartile 4
q4 = ggplot(red, aes(x = cur_smoke_q4, y = le_raceadj_q4_M)) + 
  geom_point() + 
  stat_cor(method = "pearson") +
  stat_smooth(method = 'lm') +
  xlab("Current Smokers in Q4") +
  ylab("Q4 Life Expectancy") + 
  ggtitle("Q4 Life Expectancy and Smoking")
q4
## Warning: Removed 50611 rows containing non-finite values (stat_cor).
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 50611 rows containing non-finite values (stat_smooth).

## Warning: Removed 50611 rows containing missing values (geom_point).

smoke4 = lm(le_raceadj_q4_M ~ cur_smoke_q4, data = red)
summary(smoke4)
## 
## Call:
## lm(formula = le_raceadj_q4_M ~ cur_smoke_q4, data = red)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.0203 -1.4971 -0.0728  1.3878  5.5532 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    87.037      1.240  70.167   <2e-16 ***
## cur_smoke_q4  -20.459      9.731  -2.102   0.0381 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.277 on 98 degrees of freedom
##   (50611 observations deleted due to missingness)
## Multiple R-squared:  0.04316,    Adjusted R-squared:  0.03339 
## F-statistic:  4.42 on 1 and 98 DF,  p-value: 0.03808
Data Explained

In the above data, each point on the scatterplot represents a different commuting zone within the US excluding 12 commuting zones focused in North Dakota and Alaska. These datapoints were deleted because the data being used is pulled from an Opportunity Insights dataet on life expectancy estimates by income quartile that only reports from commuting zones with populations greater than 25,000.

Exploratory Analysis of Simple Linear Regression

The data on smoking and life expectancy is sorted by income quarters, 1st to 4th. This sorting proved for interesting observations. While observing the p-values of linear regression models sorted by income quartile, I found that the first quartile had the most statistically significant results. The p-value for this model (graph to the right) is 0.007904, less than the alpha value of 0.05. This low value indicates that the model is statistically significant and that we shall reject the null hypothesis and accept the alternative hypothesis that says there is a negative correlation of -11.879 between smoking and a shorter life expectancy within the first income quartile. What makes this find interesting is the fact that this trend of statistical significance does not continue as the income quartile increases. The graph to the right is a scatterplot with the same factors, but from the fourth, or highest, income quartile. The correlational coefficient is 0.6231, with a p-value of 0.9539. The large p-value tells us that this model is not statistically significant, and shows evidence for the null hypothesis that smoking has no significant effect on life expectancy within the fourth income quartile.

Check Conditions for Health Model

The first income quartile has the most statistically significant p-value, and to further explore its significance, I am going to assess the conditions of the linear model to ensure its reliability through a series of conditions and plots.

res1 = resid(smoke1)
plot(smoke1)

hist(res1, main = "Histogram of Residuals")

Linearity

The initial scatterplot of Life Expectancy on Smoking shows a negative linear relationship, illustrating that as the prevalence of smoking increases, life expectancy decreases. The Residuals vs. Fitted plot shows equal, random spread above and below the zero line, but we need to be wary of the lack of dispersed data points towards the right of the plot

Zero Mean

The histogram of residuals shows a normal distribution centered at zero, meeting the zero mean condition.

Constant Variance

In the residual plot, the data is spread on both sides of the least squares line, but again, we shall be wary of the right side of the plot with less dispersed data points.

Randomness and Independence

Opportunity Insights used the commuting zone system to organize and separate the data that was collected by the Census, which collects data from every citizen within a geographic area, meaning that the data was collected responsibly. Comparing the data by quartile, the differences in correlational coefficients from quartile to quartile tells us that we should be wary of unifying all of the models into one.

Normality

The histogram of residuals shows a normal distribution centered at zero, and the QQ-plot shows a bit of left skew with data points straying from the line of best fit, but otherwise, the data closely follows the line of best fit, suggesting the model is normal.

Work/Labor Factor Exploration

# Quartile 1
q1 = ggplot(red, aes(x = hours_wk_pooled_pooled_p1, y = le_raceadj_q1_M)) + 
  geom_point() + 
  stat_cor(method = "pearson") +
  stat_smooth(method = 'lm') +
  xlab("Mean weekly hours worked") +
  ylab("Q1 Life Expectancy") + 
  ggtitle("Q1 Life Expectancy and Hours Worked")
q1
## Warning: Removed 50611 rows containing non-finite values (stat_cor).
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 50611 rows containing non-finite values (stat_smooth).
## Warning: Removed 50611 rows containing missing values (geom_point).

work1 = lm(le_raceadj_q1_M ~ hours_wk_pooled_pooled_p1, data = red)
summary(work1)
## 
## Call:
## lm(formula = le_raceadj_q1_M ~ hours_wk_pooled_pooled_p1, data = red)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.4668 -1.2725  0.0375  1.2415  5.8345 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               69.23764    2.06899  33.464  < 2e-16 ***
## hours_wk_pooled_pooled_p1  0.29873    0.09127   3.273  0.00147 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.884 on 98 degrees of freedom
##   (50611 observations deleted due to missingness)
## Multiple R-squared:  0.09855,    Adjusted R-squared:  0.08935 
## F-statistic: 10.71 on 1 and 98 DF,  p-value: 0.001469
# Quartile 2
q2 = ggplot(red, aes(x = hours_wk_pooled_pooled_p25, y = le_raceadj_q2_M)) + 
  geom_point() + 
  stat_cor(method = "pearson") +
  stat_smooth(method = 'lm') +
  xlab("Mean weekly hours worked") +
  ylab("Q2 Life Expectancy") + 
  ggtitle("Q2 Life Expectancy and Hours Worked")
q2
## Warning: Removed 50611 rows containing non-finite values (stat_cor).
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 50611 rows containing non-finite values (stat_smooth).

## Warning: Removed 50611 rows containing missing values (geom_point).

work2 = lm(le_raceadj_q2_M ~ hours_wk_pooled_pooled_p25, data = red)
summary(work2)
## 
## Call:
## lm(formula = le_raceadj_q2_M ~ hours_wk_pooled_pooled_p25, data = red)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.7508 -1.2257 -0.3157  1.4120  5.9084 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 80.3152     4.7481  16.915   <2e-16 ***
## hours_wk_pooled_pooled_p25  -0.0315     0.1696  -0.186    0.853    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.182 on 98 degrees of freedom
##   (50611 observations deleted due to missingness)
## Multiple R-squared:  0.0003516,  Adjusted R-squared:  -0.009849 
## F-statistic: 0.03447 on 1 and 98 DF,  p-value: 0.8531
# Quartile 3
q3 = ggplot(red, aes(x = hours_wk_pooled_pooled_p75, y = le_raceadj_q3_M)) + 
  geom_point() + 
  stat_cor(method = "pearson") +
  stat_smooth(method = 'lm') +
  xlab("Mean weekly hours worked") +
  ylab("Q3 Life Expectancy") + 
  ggtitle("Q3 Life Expectancy and Hours Worked")
q3
## Warning: Removed 50611 rows containing non-finite values (stat_cor).
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 50611 rows containing non-finite values (stat_smooth).

## Warning: Removed 50611 rows containing missing values (geom_point).

work3 = lm(le_raceadj_q3_M ~ hours_wk_pooled_pooled_p75, data = red)
summary(work3)
## 
## Call:
## lm(formula = le_raceadj_q3_M ~ hours_wk_pooled_pooled_p75, data = red)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.2228 -1.2196  0.1853  1.5163  5.2292 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 91.5623     5.0004  18.311   <2e-16 ***
## hours_wk_pooled_pooled_p75  -0.2815     0.1498  -1.879   0.0632 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.022 on 98 degrees of freedom
##   (50611 observations deleted due to missingness)
## Multiple R-squared:  0.03478,    Adjusted R-squared:  0.02493 
## F-statistic: 3.532 on 1 and 98 DF,  p-value: 0.06318
# Quartile 4
q4 = ggplot(red, aes(x = hours_wk_pooled_pooled_p100, y = le_raceadj_q4_F)) + 
  geom_point() + 
  stat_cor(method = "pearson") +
  stat_smooth(method = 'lm') +
  xlab("Mean weekly hours worked") +
  ylab("Q4 Life Expectancy") + 
  ggtitle("Q4 Life Expectancy and Hours Worked")
q4
## Warning: Removed 50611 rows containing non-finite values (stat_cor).
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 50611 rows containing non-finite values (stat_smooth).

## Warning: Removed 50611 rows containing missing values (geom_point).

work4 = lm(le_raceadj_q4_F ~ hours_wk_pooled_pooled_p100, data = red)
summary(work4)
## 
## Call:
## lm(formula = le_raceadj_q4_F ~ hours_wk_pooled_pooled_p100, data = red)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.9847 -1.8340 -0.0753  1.5650  6.8493 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  90.4723     5.1783  17.472   <2e-16 ***
## hours_wk_pooled_pooled_p100  -0.1084     0.1424  -0.761    0.449    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.506 on 98 degrees of freedom
##   (50611 observations deleted due to missingness)
## Multiple R-squared:  0.005872,   Adjusted R-squared:  -0.004272 
## F-statistic: 0.5789 on 1 and 98 DF,  p-value: 0.4486
Check Conditions

To further explore the statistically significant p-value within the first income quartile, I am going to assess the conditions of the data to further ensure the simple linear model is reliable through residual plots, QQ-plots, etc.

# Residual, normal Q-Q plots
plot(work1)

# Histogram of residuals
res1 = resid(work1)
hist(res1, main = "Histogram of Residuals")

Linearity

In the normal scatterplot for the first quartile of work and life expectancy data, the data shows that it meets the linear condition, because as the amount of hours worked weekly increases, life expectancy within the first income quartile increases as well. In the Residuals vs. Fitted plot, there is scatter among the data points and they are scattered randomly above and below the zero line, showing that the linear model created is reasonable for describing the relationship between life expectancy and mean hours worked weekly.

Zero Mean

When performing a least squares regression, the sample mean of the residuals is always zero when estimating the intercept. The histogram of residuals also shows even spread on both sides of zero, a normal distribution.

Constant Variance

Looking at the plot of residuals shows a random constant spread above and below zero, and the histogram of residuals shows an equal spread on the left and right sides of zero.

Randomness and Independence

Opportunity Insights used the commuting zone system to organize and separate the data that was collected by the Census, which collects data from every citizen within a geographic area, meaning that the data was collected responsibly. Comparing the data by quartile, the differences in correlational coefficients from quartile to quartile tells us that we should be wary of unifying all of the models into one.

Normality

The histogram of residuals shows a normal distribution. The normal quantile-quantile plot shows a consistent linear trend, although there is some skew to the right because of the few points in the upper right section.

Overall, this linear model passes all conditions needed to be met for a proper least squares regression equation.

Multiple Regression Exploration

The exploratory data analysis shows that there is strong evidence of statistically significant correlations between work and health factors within the first income quartile, ranging from the 1st to the 25th percentile. To further explore other factors that may affect life expectancy in terms of health and work, I am going to create a multiple regression model that will use other explanatory variables relating to health and work in order to more accurately predict life expectancies within the first income quartile.

Multiple Regression for Health Variables

To more accurately measure the significance of the affect of health factors alone on life expectancy, I created an additional dataset that includes all health-related variables from the main dataset. The download link to the health specific dataset can be found here - https://www.dropbox.com/s/9g1ss8k8umv23p4/healthdata.csv?dl=0. The explanations of the names of the variables can be found here - https://docs.google.com/document/d/1QMvRMuzyy8X4E6_HZ1VtMXFa28SHzdyv2v4YaixRniw/edit?usp=sharing.

Choosing and Creating a Model

Correlation Matrix

To begin, produced a matrix of correlation coefficients to see how the predictor variables effect life expectancy. Looking at the data, the variables that have the strongest associations with life expectancy are

  1. -0.492 Fraction of current smokers - cur_smoke_q1

  2. -0.331 Fraction of obese population - bmi_obese_q1

  3. 0.410 Percent uninsured in 2010 - puninsured2010

  4. -0.270 Percent of medicare enrollees who have had at least one primary care visit within the last 12 months - primcarevis_10

Looking at the data as a whole, the strongest correlations are between:

  1. 0.869 Mean of Z-Scores for Dartmouth Atlas Ambulatory Care Measures and Percent of Females Aged 67-69 who have had Mammograms

  2. 0.728 Percent of Females Aged 67-69 who have had Mammograms and Percent Diabetic with Annual Eye Test

  3. -0.578 Fraction of people who have exercised within the past 30 days and fraction of obese people in the first quartile

  4. -0.669 Medicare $ Per Enrollee and fraction of people who have exercised in the past 30 days

  5. 0.857 30-day Mortality for Heart Attacks and 30-day Hospital Mortality Rate Index

(See below table for correlation matrix visual)

# load data
hlth = read.csv("~/desktop/healthdata.csv")

# choose variables with names(hlth) function to identify the numbers
cordata = cor(hlth[,c(32, 7, 8, 9, 13, 17, 21, 22, 23, 24, 25, 26, 27, 28, 29 )], use = "complete.obs")

# print correlational matrix
cordata
##                         le_raceadj_q1_both puninsured2010 mammogram_10
## le_raceadj_q1_both              1.00000000    0.409552583  -0.01313609
## puninsured2010                  0.40955258    1.000000000  -0.38511649
## mammogram_10                   -0.01313609   -0.385116491   1.00000000
## cur_smoke_q1                   -0.49178572   -0.412569923   0.10884627
## bmi_obese_q1                   -0.33064510    0.038422239  -0.13545923
## exercise_any_q1                 0.16905248   -0.104335824   0.10140325
## mort_30day_hosp_z               0.16244341    0.304858954  -0.16071040
## diab_eyeexam_10                -0.06469811   -0.351894277   0.72788659
## diab_lipids_10                  0.13626869    0.002075106   0.47383752
## primcarevis_10                 -0.27049860   -0.147541927   0.47615494
## reimb_penroll_adj10             0.07963038    0.500973373  -0.31191633
## adjmortmeas_amiall30day         0.14737474    0.362438451  -0.27202722
## adjmortmeas_chfall30day         0.11354249    0.151266932  -0.01102297
## adjmortmeas_pnall30day          0.15380309    0.274266108  -0.14079534
## med_prev_qual_z                -0.03769047   -0.337081052   0.86898970
##                         cur_smoke_q1 bmi_obese_q1 exercise_any_q1
## le_raceadj_q1_both      -0.491785716 -0.330645098      0.16905248
## puninsured2010          -0.412569923  0.038422239     -0.10433582
## mammogram_10             0.108846268 -0.135459231      0.10140325
## cur_smoke_q1             1.000000000  0.273436489     -0.36593813
## bmi_obese_q1             0.273436489  1.000000000     -0.57834977
## exercise_any_q1         -0.365938127 -0.578349771      1.00000000
## mort_30day_hosp_z       -0.051748942 -0.006387185      0.14486944
## diab_eyeexam_10          0.027868374 -0.166524409      0.10303742
## diab_lipids_10          -0.165990413 -0.091820075     -0.09728895
## primcarevis_10           0.366366833  0.345893828     -0.24021412
## reimb_penroll_adj10      0.048788895  0.293027828     -0.66955145
## adjmortmeas_amiall30day -0.075027995  0.106446468      0.02309639
## adjmortmeas_chfall30day -0.070661414 -0.219951072      0.41879551
## adjmortmeas_pnall30day   0.006981048  0.101377944     -0.06919963
## med_prev_qual_z          0.014671918 -0.160097018      0.19376191
##                         mort_30day_hosp_z diab_eyeexam_10 diab_lipids_10
## le_raceadj_q1_both            0.162443415     -0.06469811    0.136268688
## puninsured2010                0.304858954     -0.35189428    0.002075106
## mammogram_10                 -0.160710397      0.72788659    0.473837521
## cur_smoke_q1                 -0.051748942      0.02786837   -0.165990413
## bmi_obese_q1                 -0.006387185     -0.16652441   -0.091820075
## exercise_any_q1               0.144869441      0.10303742   -0.097288955
## mort_30day_hosp_z             1.000000000     -0.22717437   -0.248622569
## diab_eyeexam_10              -0.227174368      1.00000000    0.498365447
## diab_lipids_10               -0.248622569      0.49836545    1.000000000
## primcarevis_10                0.060866168      0.31002021    0.181463162
## reimb_penroll_adj10          -0.260424088     -0.25593691    0.124653227
## adjmortmeas_amiall30day       0.857258490     -0.30638700   -0.220575922
## adjmortmeas_chfall30day       0.848040307     -0.01855114   -0.220977623
## adjmortmeas_pnall30day        0.849187830     -0.26205789   -0.195358678
## med_prev_qual_z              -0.136234229      0.83490965    0.671550225
##                         primcarevis_10 reimb_penroll_adj10
## le_raceadj_q1_both       -0.2704985950          0.07963038
## puninsured2010           -0.1475419273          0.50097337
## mammogram_10              0.4761549447         -0.31191633
## cur_smoke_q1              0.3663668330          0.04878889
## bmi_obese_q1              0.3458938282          0.29302783
## exercise_any_q1          -0.2402141235         -0.66955145
## mort_30day_hosp_z         0.0608661678         -0.26042409
## diab_eyeexam_10           0.3100202117         -0.25593691
## diab_lipids_10            0.1814631619          0.12465323
## primcarevis_10            1.0000000000         -0.03321519
## reimb_penroll_adj10      -0.0332151898          1.00000000
## adjmortmeas_amiall30day  -0.0002811419         -0.03512968
## adjmortmeas_chfall30day   0.0683039573         -0.49432222
## adjmortmeas_pnall30day    0.0802329948         -0.12328907
## med_prev_qual_z           0.5351918376         -0.36554703
##                         adjmortmeas_amiall30day adjmortmeas_chfall30day
## le_raceadj_q1_both                 0.1473747442              0.11354249
## puninsured2010                     0.3624384508              0.15126693
## mammogram_10                      -0.2720272160             -0.01102297
## cur_smoke_q1                      -0.0750279951             -0.07066141
## bmi_obese_q1                       0.1064464677             -0.21995107
## exercise_any_q1                    0.0230963852              0.41879551
## mort_30day_hosp_z                  0.8572584903              0.84804031
## diab_eyeexam_10                   -0.3063870046             -0.01855114
## diab_lipids_10                    -0.2205759219             -0.22097762
## primcarevis_10                    -0.0002811419              0.06830396
## reimb_penroll_adj10               -0.0351296789             -0.49432222
## adjmortmeas_amiall30day            1.0000000000              0.62825004
## adjmortmeas_chfall30day            0.6282500375              1.00000000
## adjmortmeas_pnall30day             0.5972172770              0.54010587
## med_prev_qual_z                   -0.2584231901              0.07282819
##                         adjmortmeas_pnall30day med_prev_qual_z
## le_raceadj_q1_both                 0.153803088     -0.03769047
## puninsured2010                     0.274266108     -0.33708105
## mammogram_10                      -0.140795338      0.86898970
## cur_smoke_q1                       0.006981048      0.01467192
## bmi_obese_q1                       0.101377944     -0.16009702
## exercise_any_q1                   -0.069199628      0.19376191
## mort_30day_hosp_z                  0.849187830     -0.13623423
## diab_eyeexam_10                   -0.262057887      0.83490965
## diab_lipids_10                    -0.195358678      0.67155022
## primcarevis_10                     0.080232995      0.53519184
## reimb_penroll_adj10               -0.123289069     -0.36554703
## adjmortmeas_amiall30day            0.597217277     -0.25842319
## adjmortmeas_chfall30day            0.540105872      0.07282819
## adjmortmeas_pnall30day             1.000000000     -0.17293120
## med_prev_qual_z                   -0.172931205      1.00000000

Create Multiple Regression Model

Looking at the r correlation coefficients for the data, I am going to create a multiple regression model that uses the fraction of current smokers in the first income quartile, fraction of obese population, percent of population uninsured in 2010, and the percentage of medicare enrollees who have had at least one primary care visit within the last 12 months to create a multiple regression model

# create multiple regression model
hlthmr = lm(le_raceadj_q1_both ~ cur_smoke_q1 + bmi_obese_q1 + puninsured2010 + primcarevis_10, data = hlth)

# view summary of model
summary(hlthmr)
## 
## Call:
## lm(formula = le_raceadj_q1_both ~ cur_smoke_q1 + bmi_obese_q1 + 
##     puninsured2010 + primcarevis_10, data = hlth)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5134 -0.9107 -0.1037  0.7583  4.8818 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     84.70026    2.82123  30.022  < 2e-16 ***
## cur_smoke_q1   -11.09496    3.78916  -2.928  0.00427 ** 
## bmi_obese_q1   -12.12952    4.37222  -2.774  0.00666 ** 
## puninsured2010   0.08822    0.02749   3.209  0.00182 ** 
## primcarevis_10  -0.01420    0.03818  -0.372  0.71078    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.378 on 95 degrees of freedom
##   (50611 observations deleted due to missingness)
## Multiple R-squared:  0.3552, Adjusted R-squared:  0.328 
## F-statistic: 13.08 on 4 and 95 DF,  p-value: 1.587e-08

Initial Observations INCLUDE SOMETHING ABOUT T-VALUES. Looking at Pr(>|t|), or the p-value that corresponds to the t-statistic, we see that the smoking, obesity, and uninsured predictor variables are the three with values below the alpha level of 0.05, meaning that these three are statistically significant p-values. The residual standard error is 1.378 on 95 degrees of freedom. The residual standard error is the standard deviation of residuals, meaning that this model predicts the average life expectancy of a person in the first income quartile with an average error of about 1.378 years (this value is valuable when comparing the accuracy of different regression models). The coefficient of determination, or multiple R^2, is 0.3552, and the adjusted R^2 value is 0.328. The coefficient of determination explains how much variation in our life expectancy variable can be explained by our predictor variables, meaning that this model explains only 32.8% of variance. The p-value of 1.587e-08 is less than the significance level of 0.05, meaning the regression model is statistically significant. The purpose of the F-statistic is to see if any of the predictor variables are related to life expectancy, and this goes hand in hand with the p-value, which is statistically significant.

After observing the model, I decided to eliminate the ‘primcarevis_10’ variable because it was not statistically significant in the model.

hlthmr = lm(le_raceadj_q1_both ~ cur_smoke_q1 + bmi_obese_q1 + puninsured2010, data = hlth)
summary(hlthmr)
## 
## Call:
## lm(formula = le_raceadj_q1_both ~ cur_smoke_q1 + bmi_obese_q1 + 
##     puninsured2010, data = hlth)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5447 -0.9496 -0.1078  0.7340  4.8681 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     83.79973    1.44151  58.133  < 2e-16 ***
## cur_smoke_q1   -11.45443    3.64734  -3.140  0.00224 ** 
## bmi_obese_q1   -12.58156    4.18100  -3.009  0.00335 ** 
## puninsured2010   0.08869    0.02734   3.244  0.00162 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.372 on 96 degrees of freedom
##   (50611 observations deleted due to missingness)
## Multiple R-squared:  0.3542, Adjusted R-squared:  0.3341 
## F-statistic: 17.55 on 3 and 96 DF,  p-value: 3.652e-09

The result of this is a more significant p-value and a greater R^2 value of 0.3341, telling us that the regression equation explains 33.41% of variance in the model.

Check Conditions for Multiple Regression Model

  1. Zero Mean When using multiple regression, the sample mean of the residuals must be zero.This histogram shows the residuals are centered around zero.
hlthmrresid = resid(hlthmr)
hist(hlthmrresid, main = "Histogram of hlth Residuals")

  1. Constant Variance
# plot residuals
plot(hlthmr)

Here, the Residuals vs. Fitted shows a small amount of departure from normality as the residuals do not have as much spread towards the end of the plot as they do in the beginning, but nonetheless, the points are evenly spread above and below the zero line, proving the condition for constant variance is met. The quantile-quantile plot mostly follows the line of best fit with a few departures towards the right, hinting at right skew.

Multiple Regression for Work/Labor Variables

To more accurately see what work related variables have the most significance on life expectancy. To more accurately measure this, I created an additional

Simple Linear Regression Exploration

q1 = ggplot(red, aes(x = hours_wk_pooled_pooled_p1, y = le_raceadj_q1_M)) + 
  geom_point() + 
  stat_cor(method = "pearson") +
  stat_smooth(method = 'lm') +
  xlab("Mean weekly hours worked") +
  ylab("Q1 Life Expectancy") + 
  ggtitle("Q1 Life Expectancy and Hours Worked")
q1
## Warning: Removed 50611 rows containing non-finite values (stat_cor).
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 50611 rows containing non-finite values (stat_smooth).
## Warning: Removed 50611 rows containing missing values (geom_point).

The scatterplot above show the correlation between the mean hours worked weekly and the corresponding life expectancy within that income quartile, which above, is the first income quartile ranging from the first to the twenty fifth percentile.

q2 = ggplot(red, aes(x = hours_wk_pooled_pooled_p25, y = le_raceadj_q2_M)) + 
  geom_point() + 
  stat_cor(method = "pearson") +
  stat_smooth(method = 'lm') +
  xlab("Mean weekly hours worked") +
  ylab("Q2 Life Expectancy") + 
  ggtitle("Q2 Life Expectancy and Hours Worked")
q2
## Warning: Removed 50611 rows containing non-finite values (stat_cor).
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 50611 rows containing non-finite values (stat_smooth).
## Warning: Removed 50611 rows containing missing values (geom_point).

The scatterplot above shows the correlation between mean weekly hours worked and life expectancy in the second income quartile, from the twenty-fifth to the fiftieth percentile. The p-value within this dataset is 0.85, which is not statistically significant compared to the quarter one income quartile data. Because of this fact, I chose to explore the effect of labor and work predictor variables specifically in the first income quartile.

Choosing and Creating a Model

CREATING A MODEL To begin, I again created a correlation matrix in R to find the correlational coefficients of different predictor variables on life expectancy as well as in relation to other predictor variables.

  1. 0.85 Mean weekly hours worked in the 12 months before being surveyed

  2. 0.309 Fraction of children with positive hours worked in the 12 months before being surveyed

  3. -0.333 Share Working in Manufacturing

  4. 0.311 Median House Value

  5. 0.158 Median Household Income in 2016

Looking at the data as a whole, the strongest correlations are between:

  1. -0.537 Mean Household Income in 2016 and Share of Poor People in 2010

  2. -0.500 Percent Employed in 2000 and Share of Poor People

  3. 0.684 Fraction of children with positive W-2 earnings in 2015 and Mean household income rank for parents of the people surveyed

  4. -0.574 Unemployment Rate and Mean household income rank for parents of the people surveyed

## load data
work = read.csv("~/desktop/workdata.csv")

## Use the names() function to choose variables to be used in the correlation matrix
# names(work)
cordata = cor(work[,c(10, 2, 3, 4, 5, 6, 8, 9, 11, 12, 15, 18, 20, 21, 22, 23, 24, 25, 26, 27)], use = "complete.obs")

# view matrix
print(cordata)
##                                 le_raceadj_q1_both hhinc_mean2000
## le_raceadj_q1_both                     1.000000000     0.09105442
## hhinc_mean2000                         0.091054423     1.00000000
## mean_commutetime2000                  -0.079198601     0.17497527
## med_hhinc2016                          0.158862663     0.94526997
## poor_share2010                        -0.075893493    -0.53723906
## emp2000                               -0.065429217     0.47690266
## jobs_highpay_5mi_2015                  0.191428890     0.16913946
## ann_avg_job_growth_2004_2013           0.053711993    -0.07006619
## par_rank_pooled_pooled_mean            0.015845988     0.36349195
## frac_below_median_pooled_pooled       -0.003483334    -0.31637269
## hours_wk_pooled_pooled_p25             0.850325390     0.05611530
## pos_hours_pooled_pooled_mean           0.309276182     0.21696150
## working_pooled_pooled_mean             0.130705110     0.12706781
## unemp_rate                             0.014556303    -0.25021494
## taxrate                                0.204485917     0.07711122
## frac_middleclass                      -0.175255856    -0.23807430
## cs_elf_ind_man                        -0.333336300    -0.09068853
## lf_d_2000_1980                         0.154836353    -0.02912371
## median_house_value                     0.310704041     0.45535382
## hhinc00                                0.069693096     0.47151125
##                                 mean_commutetime2000 med_hhinc2016
## le_raceadj_q1_both                      -0.079198601    0.15886266
## hhinc_mean2000                           0.174975275    0.94526997
## mean_commutetime2000                     1.000000000    0.20070802
## med_hhinc2016                            0.200708018    1.00000000
## poor_share2010                          -0.193312747   -0.56391518
## emp2000                                  0.315165146    0.52335095
## jobs_highpay_5mi_2015                   -0.341655216    0.23662681
## ann_avg_job_growth_2004_2013             0.086422893   -0.03793591
## par_rank_pooled_pooled_mean              0.082902475    0.36250605
## frac_below_median_pooled_pooled         -0.055087757   -0.32097207
## hours_wk_pooled_pooled_p25              -0.007120154    0.11447319
## pos_hours_pooled_pooled_mean             0.007978061    0.22695449
## working_pooled_pooled_mean               0.004496452    0.15380103
## unemp_rate                              -0.195929242   -0.26672772
## taxrate                                  0.003986853    0.12542530
## frac_middleclass                        -0.165298223   -0.20820457
## cs_elf_ind_man                          -0.091961769   -0.15462678
## lf_d_2000_1980                           0.081845858   -0.02181458
## median_house_value                      -0.026855017    0.43875770
## hhinc00                                  0.126126789    0.40976921
##                                 poor_share2010     emp2000
## le_raceadj_q1_both                -0.075893493 -0.06542922
## hhinc_mean2000                    -0.537239064  0.47690266
## mean_commutetime2000              -0.193312747  0.31516515
## med_hhinc2016                     -0.563915182  0.52335095
## poor_share2010                     1.000000000 -0.50083320
## emp2000                           -0.500833204  1.00000000
## jobs_highpay_5mi_2015              0.020294615 -0.19846073
## ann_avg_job_growth_2004_2013       0.004129212 -0.04411273
## par_rank_pooled_pooled_mean       -0.152067070  0.37978167
## frac_below_median_pooled_pooled    0.122551430 -0.38471352
## hours_wk_pooled_pooled_p25        -0.071901081  0.08447463
## pos_hours_pooled_pooled_mean      -0.126130014  0.35829182
## working_pooled_pooled_mean        -0.047793562  0.35244897
## unemp_rate                         0.304089340 -0.38033938
## taxrate                           -0.082385956  0.08398608
## frac_middleclass                   0.123236230  0.02472701
## cs_elf_ind_man                     0.140922296  0.17494858
## lf_d_2000_1980                    -0.078446488 -0.15447999
## median_house_value                -0.276779948  0.01229535
## hhinc00                           -0.301901615  0.10725281
##                                 jobs_highpay_5mi_2015
## le_raceadj_q1_both                         0.19142889
## hhinc_mean2000                             0.16913946
## mean_commutetime2000                      -0.34165522
## med_hhinc2016                              0.23662681
## poor_share2010                             0.02029461
## emp2000                                   -0.19846073
## jobs_highpay_5mi_2015                      1.00000000
## ann_avg_job_growth_2004_2013              -0.11804827
## par_rank_pooled_pooled_mean                0.10786146
## frac_below_median_pooled_pooled           -0.06016351
## hours_wk_pooled_pooled_p25                 0.18831899
## pos_hours_pooled_pooled_mean               0.12024616
## working_pooled_pooled_mean                 0.06463117
## unemp_rate                                -0.03231251
## taxrate                                    0.34934474
## frac_middleclass                          -0.24037063
## cs_elf_ind_man                            -0.22674484
## lf_d_2000_1980                            -0.12654618
## median_house_value                         0.26041859
## hhinc00                                    0.28121363
##                                 ann_avg_job_growth_2004_2013
## le_raceadj_q1_both                               0.053711993
## hhinc_mean2000                                  -0.070066191
## mean_commutetime2000                             0.086422893
## med_hhinc2016                                   -0.037935912
## poor_share2010                                   0.004129212
## emp2000                                         -0.044112735
## jobs_highpay_5mi_2015                           -0.118048268
## ann_avg_job_growth_2004_2013                     1.000000000
## par_rank_pooled_pooled_mean                     -0.344895795
## frac_below_median_pooled_pooled                  0.338405444
## hours_wk_pooled_pooled_p25                      -0.088304489
## pos_hours_pooled_pooled_mean                    -0.314276140
## working_pooled_pooled_mean                      -0.338528877
## unemp_rate                                       0.222925959
## taxrate                                         -0.157959339
## frac_middleclass                                -0.191076245
## cs_elf_ind_man                                  -0.195576775
## lf_d_2000_1980                                   0.117077193
## median_house_value                              -0.022060949
## hhinc00                                         -0.278537090
##                                 par_rank_pooled_pooled_mean
## le_raceadj_q1_both                               0.01584599
## hhinc_mean2000                                   0.36349195
## mean_commutetime2000                             0.08290248
## med_hhinc2016                                    0.36250605
## poor_share2010                                  -0.15206707
## emp2000                                          0.37978167
## jobs_highpay_5mi_2015                            0.10786146
## ann_avg_job_growth_2004_2013                    -0.34489579
## par_rank_pooled_pooled_mean                      1.00000000
## frac_below_median_pooled_pooled                 -0.99007066
## hours_wk_pooled_pooled_p25                       0.18526103
## pos_hours_pooled_pooled_mean                     0.75815246
## working_pooled_pooled_mean                       0.68458943
## unemp_rate                                      -0.57421094
## taxrate                                          0.33061980
## frac_middleclass                                 0.19918524
## cs_elf_ind_man                                   0.27533374
## lf_d_2000_1980                                  -0.28729442
## median_house_value                               0.38319623
## hhinc00                                          0.62807113
##                                 frac_below_median_pooled_pooled
## le_raceadj_q1_both                                 -0.003483334
## hhinc_mean2000                                     -0.316372689
## mean_commutetime2000                               -0.055087757
## med_hhinc2016                                      -0.320972069
## poor_share2010                                      0.122551430
## emp2000                                            -0.384713525
## jobs_highpay_5mi_2015                              -0.060163510
## ann_avg_job_growth_2004_2013                        0.338405444
## par_rank_pooled_pooled_mean                        -0.990070657
## frac_below_median_pooled_pooled                     1.000000000
## hours_wk_pooled_pooled_p25                         -0.177766587
## pos_hours_pooled_pooled_mean                       -0.765384188
## working_pooled_pooled_mean                         -0.719350140
## unemp_rate                                          0.543040827
## taxrate                                            -0.294810350
## frac_middleclass                                   -0.294452947
## cs_elf_ind_man                                     -0.335377003
## lf_d_2000_1980                                      0.332702913
## median_house_value                                 -0.319005890
## hhinc00                                            -0.541953066
##                                 hours_wk_pooled_pooled_p25
## le_raceadj_q1_both                             0.850325390
## hhinc_mean2000                                 0.056115300
## mean_commutetime2000                          -0.007120154
## med_hhinc2016                                  0.114473188
## poor_share2010                                -0.071901081
## emp2000                                        0.084474628
## jobs_highpay_5mi_2015                          0.188318985
## ann_avg_job_growth_2004_2013                  -0.088304489
## par_rank_pooled_pooled_mean                    0.185261033
## frac_below_median_pooled_pooled               -0.177766587
## hours_wk_pooled_pooled_p25                     1.000000000
## pos_hours_pooled_pooled_mean                   0.641610114
## working_pooled_pooled_mean                     0.507372079
## unemp_rate                                    -0.239434482
## taxrate                                        0.330039711
## frac_middleclass                              -0.108066294
## cs_elf_ind_man                                -0.152914988
## lf_d_2000_1980                                -0.089001699
## median_house_value                             0.109526982
## hhinc00                                        0.142412460
##                                 pos_hours_pooled_pooled_mean
## le_raceadj_q1_both                               0.309276182
## hhinc_mean2000                                   0.216961498
## mean_commutetime2000                             0.007978061
## med_hhinc2016                                    0.226954492
## poor_share2010                                  -0.126130014
## emp2000                                          0.358291823
## jobs_highpay_5mi_2015                            0.120246158
## ann_avg_job_growth_2004_2013                    -0.314276140
## par_rank_pooled_pooled_mean                      0.758152464
## frac_below_median_pooled_pooled                 -0.765384188
## hours_wk_pooled_pooled_p25                       0.641610114
## pos_hours_pooled_pooled_mean                     1.000000000
## working_pooled_pooled_mean                       0.901925174
## unemp_rate                                      -0.532614285
## taxrate                                          0.395086172
## frac_middleclass                                 0.147443859
## cs_elf_ind_man                                   0.222183364
## lf_d_2000_1980                                  -0.384371050
## median_house_value                               0.185931196
## hhinc00                                          0.432336331
##                                 working_pooled_pooled_mean  unemp_rate
## le_raceadj_q1_both                             0.130705110  0.01455630
## hhinc_mean2000                                 0.127067812 -0.25021494
## mean_commutetime2000                           0.004496452 -0.19592924
## med_hhinc2016                                  0.153801029 -0.26672772
## poor_share2010                                -0.047793562  0.30408934
## emp2000                                        0.352448967 -0.38033938
## jobs_highpay_5mi_2015                          0.064631174 -0.03231251
## ann_avg_job_growth_2004_2013                  -0.338528877  0.22292596
## par_rank_pooled_pooled_mean                    0.684589434 -0.57421094
## frac_below_median_pooled_pooled               -0.719350140  0.54304083
## hours_wk_pooled_pooled_p25                     0.507372079 -0.23943448
## pos_hours_pooled_pooled_mean                   0.901925174 -0.53261429
## working_pooled_pooled_mean                     1.000000000 -0.44696368
## unemp_rate                                    -0.446963678  1.00000000
## taxrate                                        0.354511451 -0.14137202
## frac_middleclass                               0.236830082 -0.14830155
## cs_elf_ind_man                                 0.335033068  0.04453594
## lf_d_2000_1980                                -0.496083807  0.05654919
## median_house_value                            -0.008773129 -0.11451578
## hhinc00                                        0.260173094 -0.51007599
##                                      taxrate frac_middleclass cs_elf_ind_man
## le_raceadj_q1_both               0.204485917      -0.17525586    -0.33333630
## hhinc_mean2000                   0.077111216      -0.23807430    -0.09068853
## mean_commutetime2000             0.003986853      -0.16529822    -0.09196177
## med_hhinc2016                    0.125425304      -0.20820457    -0.15462678
## poor_share2010                  -0.082385956       0.12323623     0.14092230
## emp2000                          0.083986083       0.02472701     0.17494858
## jobs_highpay_5mi_2015            0.349344741      -0.24037063    -0.22674484
## ann_avg_job_growth_2004_2013    -0.157959339      -0.19107624    -0.19557677
## par_rank_pooled_pooled_mean      0.330619797       0.19918524     0.27533374
## frac_below_median_pooled_pooled -0.294810350      -0.29445295    -0.33537700
## hours_wk_pooled_pooled_p25       0.330039711      -0.10806629    -0.15291499
## pos_hours_pooled_pooled_mean     0.395086172       0.14744386     0.22218336
## working_pooled_pooled_mean       0.354511451       0.23683008     0.33503307
## unemp_rate                      -0.141372016      -0.14830155     0.04453594
## taxrate                          1.000000000      -0.25114449    -0.10623790
## frac_middleclass                -0.251144488       1.00000000     0.39727464
## cs_elf_ind_man                  -0.106237902       0.39727464     1.00000000
## lf_d_2000_1980                  -0.057408286      -0.15608960    -0.47878651
## median_house_value               0.073634393      -0.27657458    -0.14053407
## hhinc00                          0.315972353      -0.22740444    -0.13242301
##                                 lf_d_2000_1980 median_house_value    hhinc00
## le_raceadj_q1_both                  0.15483635        0.310704041  0.0696931
## hhinc_mean2000                     -0.02912371        0.455353825  0.4715112
## mean_commutetime2000                0.08184586       -0.026855017  0.1261268
## med_hhinc2016                      -0.02181458        0.438757696  0.4097692
## poor_share2010                     -0.07844649       -0.276779948 -0.3019016
## emp2000                            -0.15447999        0.012295346  0.1072528
## jobs_highpay_5mi_2015              -0.12654618        0.260418594  0.2812136
## ann_avg_job_growth_2004_2013        0.11707719       -0.022060949 -0.2785371
## par_rank_pooled_pooled_mean        -0.28729442        0.383196229  0.6280711
## frac_below_median_pooled_pooled     0.33270291       -0.319005890 -0.5419531
## hours_wk_pooled_pooled_p25         -0.08900170        0.109526982  0.1424125
## pos_hours_pooled_pooled_mean       -0.38437105        0.185931196  0.4323363
## working_pooled_pooled_mean         -0.49608381       -0.008773129  0.2601731
## unemp_rate                          0.05654919       -0.114515777 -0.5100760
## taxrate                            -0.05740829        0.073634393  0.3159724
## frac_middleclass                   -0.15608960       -0.276574580 -0.2274044
## cs_elf_ind_man                     -0.47878651       -0.140534066 -0.1324230
## lf_d_2000_1980                      1.00000000        0.014243785  0.1482713
## median_house_value                  0.01424379        1.000000000  0.5704007
## hhinc00                             0.14827132        0.570400741  1.0000000

Create Multiple Regression Model

Looking at the r correlation coefficients for the data, I am going to create a multiple regression model that uses the mean hours worked weekly in the twelve months prior to the survey, the fraction of people with positive working hours in the 12 months prior to the survey, the share working in manufacturing, median house value, and finally, the median household income in 2016 to create a multiple regression model.

workmr = lm(le_raceadj_q1_both ~ hours_wk_pooled_pooled_p25 + pos_hours_pooled_pooled_mean + 
              cs_elf_ind_man + hhinc_mean2000, data = work)
summary(workmr)
## 
## Call:
## lm(formula = le_raceadj_q1_both ~ hours_wk_pooled_pooled_p25 + 
##     pos_hours_pooled_pooled_mean + cs_elf_ind_man + hhinc_mean2000, 
##     data = work)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.87066 -0.28540 -0.04589  0.20837  1.26227 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  -5.427e-01  1.358e+00  -0.400   0.6903    
## hours_wk_pooled_pooled_p25    8.747e-01  4.821e-02  18.144  < 2e-16 ***
## pos_hours_pooled_pooled_mean -1.489e+01  2.343e+00  -6.354 7.22e-09 ***
## cs_elf_ind_man               -1.269e+00  8.823e-01  -1.439   0.1535    
## hhinc_mean2000                2.790e-06  1.109e-06   2.515   0.0136 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4294 on 95 degrees of freedom
##   (50611 observations deleted due to missingness)
## Multiple R-squared:  0.8358, Adjusted R-squared:  0.8289 
## F-statistic: 120.9 on 4 and 95 DF,  p-value: < 2.2e-16

Initial Observations

At first glance, it appears that the variables chosen do not carry much statistical significance within the model, as the correlational coefficients are all below 1, with only two predictor variables having a Pr(>|t|) value below the significance level of 0.05.

This model is simply not statistically significant, so I am going to use another method to choose the predictor variables for my model, stepwise regression. To begin, I created a regression model with all available work related variables. Then, I used a stepwise function within R to find the best model.

# create regression model with all variables
stepwise = lm(le_raceadj_q1_both ~ hhinc_mean2000 + mean_commutetime2000 +
           med_hhinc2016 + poor_share2010 + emp2000 +
           jobs_highpay_5mi_2015 + ann_avg_job_growth_2004_2013 +
           hours_wk_pooled_pooled_p25 + working_pooled_pooled_mean +
           unemp_rate + taxrate + frac_middleclass +
           cs_elf_ind_man + lf_d_2000_1980 + median_house_value, data = work)

# stepwise regression
stp_s = ols_step_both_p(stepwise, details = TRUE)
## Stepwise Selection Method   
## ---------------------------
## 
## Candidate Terms: 
## 
## 1. hhinc_mean2000 
## 2. mean_commutetime2000 
## 3. med_hhinc2016 
## 4. poor_share2010 
## 5. emp2000 
## 6. jobs_highpay_5mi_2015 
## 7. ann_avg_job_growth_2004_2013 
## 8. hours_wk_pooled_pooled_p25 
## 9. working_pooled_pooled_mean 
## 10. unemp_rate 
## 11. taxrate 
## 12. frac_middleclass 
## 13. cs_elf_ind_man 
## 14. lf_d_2000_1980 
## 15. median_house_value 
## 
## We are selecting variables based on p value...
## 
## 
## Stepwise Selection: Step 1 
## 
## - hours_wk_pooled_pooled_p25 added 
## 
##                         Model Summary                         
## -------------------------------------------------------------
## R                       0.850       RMSE               0.549 
## R-Squared               0.723       Coef. Var          4.862 
## Adj. R-Squared          0.720       MSE                0.302 
## Pred R-Squared          0.713       MAE                0.433 
## -------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                ANOVA                                 
## --------------------------------------------------------------------
##                Sum of                                               
##               Squares        DF    Mean Square       F         Sig. 
## --------------------------------------------------------------------
## Regression     77.149         1         77.149    255.859    0.0000 
## Residual       29.550        98          0.302                      
## Total         106.699        99                                     
## --------------------------------------------------------------------
## 
##                                           Parameter Estimates                                            
## --------------------------------------------------------------------------------------------------------
##                      model      Beta    Std. Error    Std. Beta      t        Sig       lower     upper 
## --------------------------------------------------------------------------------------------------------
##                (Intercept)    -7.798         1.195                 -6.526    0.000    -10.169    -5.427 
## hours_wk_pooled_pooled_p25     0.683         0.043        0.850    15.996    0.000      0.598     0.768 
## --------------------------------------------------------------------------------------------------------
## 
## 
## 
## Stepwise Selection: Step 2 
## 
## - working_pooled_pooled_mean added 
## 
##                         Model Summary                         
## -------------------------------------------------------------
## R                       0.919       RMSE               0.413 
## R-Squared               0.845       Coef. Var          3.658 
## Adj. R-Squared          0.842       MSE                0.171 
## Pred R-Squared          0.836       MAE                0.311 
## -------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                ANOVA                                 
## --------------------------------------------------------------------
##                Sum of                                               
##               Squares        DF    Mean Square       F         Sig. 
## --------------------------------------------------------------------
## Regression     90.144         2         45.072    264.082    0.0000 
## Residual       16.555        97          0.171                      
## Total         106.699        99                                     
## --------------------------------------------------------------------
## 
##                                            Parameter Estimates                                             
## ----------------------------------------------------------------------------------------------------------
##                      model       Beta    Std. Error    Std. Beta      t        Sig       lower      upper 
## ----------------------------------------------------------------------------------------------------------
##                (Intercept)     -0.208         1.251                 -0.166    0.868     -2.690      2.275 
## hours_wk_pooled_pooled_p25      0.848         0.037        1.056    22.748    0.000      0.774      0.922 
## working_pooled_pooled_mean    -15.711         1.801       -0.405    -8.726    0.000    -19.284    -12.137 
## ----------------------------------------------------------------------------------------------------------
## 
## 
## 
##                         Model Summary                         
## -------------------------------------------------------------
## R                       0.919       RMSE               0.413 
## R-Squared               0.845       Coef. Var          3.658 
## Adj. R-Squared          0.842       MSE                0.171 
## Pred R-Squared          0.836       MAE                0.311 
## -------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                ANOVA                                 
## --------------------------------------------------------------------
##                Sum of                                               
##               Squares        DF    Mean Square       F         Sig. 
## --------------------------------------------------------------------
## Regression     90.144         2         45.072    264.082    0.0000 
## Residual       16.555        97          0.171                      
## Total         106.699        99                                     
## --------------------------------------------------------------------
## 
##                                            Parameter Estimates                                             
## ----------------------------------------------------------------------------------------------------------
##                      model       Beta    Std. Error    Std. Beta      t        Sig       lower      upper 
## ----------------------------------------------------------------------------------------------------------
##                (Intercept)     -0.208         1.251                 -0.166    0.868     -2.690      2.275 
## hours_wk_pooled_pooled_p25      0.848         0.037        1.056    22.748    0.000      0.774      0.922 
## working_pooled_pooled_mean    -15.711         1.801       -0.405    -8.726    0.000    -19.284    -12.137 
## ----------------------------------------------------------------------------------------------------------
## 
## 
## 
## Stepwise Selection: Step 3 
## 
## - median_house_value added 
## 
##                         Model Summary                         
## -------------------------------------------------------------
## R                       0.939       RMSE               0.362 
## R-Squared               0.882       Coef. Var          3.204 
## Adj. R-Squared          0.878       MSE                0.131 
## Pred R-Squared          0.873       MAE                0.271 
## -------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                ANOVA                                 
## --------------------------------------------------------------------
##                Sum of                                               
##               Squares        DF    Mean Square       F         Sig. 
## --------------------------------------------------------------------
## Regression     94.127         3         31.376    239.587    0.0000 
## Residual       12.572        96          0.131                      
## Total         106.699        99                                     
## --------------------------------------------------------------------
## 
##                                            Parameter Estimates                                             
## ----------------------------------------------------------------------------------------------------------
##                      model       Beta    Std. Error    Std. Beta      t        Sig       lower      upper 
## ----------------------------------------------------------------------------------------------------------
##                (Intercept)     -0.510         1.097                 -0.465    0.643     -2.688      1.667 
## hours_wk_pooled_pooled_p25      0.824         0.033        1.026    25.012    0.000      0.758      0.889 
## working_pooled_pooled_mean    -15.056         1.582       -0.388    -9.519    0.000    -18.195    -11.916 
##         median_house_value      0.000         0.000        0.195     5.515    0.000      0.000      0.000 
## ----------------------------------------------------------------------------------------------------------
## 
## 
## 
##                         Model Summary                         
## -------------------------------------------------------------
## R                       0.939       RMSE               0.362 
## R-Squared               0.882       Coef. Var          3.204 
## Adj. R-Squared          0.878       MSE                0.131 
## Pred R-Squared          0.873       MAE                0.271 
## -------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                ANOVA                                 
## --------------------------------------------------------------------
##                Sum of                                               
##               Squares        DF    Mean Square       F         Sig. 
## --------------------------------------------------------------------
## Regression     94.127         3         31.376    239.587    0.0000 
## Residual       12.572        96          0.131                      
## Total         106.699        99                                     
## --------------------------------------------------------------------
## 
##                                            Parameter Estimates                                             
## ----------------------------------------------------------------------------------------------------------
##                      model       Beta    Std. Error    Std. Beta      t        Sig       lower      upper 
## ----------------------------------------------------------------------------------------------------------
##                (Intercept)     -0.510         1.097                 -0.465    0.643     -2.688      1.667 
## hours_wk_pooled_pooled_p25      0.824         0.033        1.026    25.012    0.000      0.758      0.889 
## working_pooled_pooled_mean    -15.056         1.582       -0.388    -9.519    0.000    -18.195    -11.916 
##         median_house_value      0.000         0.000        0.195     5.515    0.000      0.000      0.000 
## ----------------------------------------------------------------------------------------------------------
## 
## 
## 
## Stepwise Selection: Step 4 
## 
## - unemp_rate added 
## 
##                         Model Summary                         
## -------------------------------------------------------------
## R                       0.947       RMSE               0.340 
## R-Squared               0.897       Coef. Var          3.007 
## Adj. R-Squared          0.893       MSE                0.115 
## Pred R-Squared          0.886       MAE                0.242 
## -------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                ANOVA                                 
## --------------------------------------------------------------------
##                Sum of                                               
##               Squares        DF    Mean Square       F         Sig. 
## --------------------------------------------------------------------
## Regression     95.741         4         23.935    207.506    0.0000 
## Residual       10.958        95          0.115                      
## Total         106.699        99                                     
## --------------------------------------------------------------------
## 
##                                            Parameter Estimates                                            
## ---------------------------------------------------------------------------------------------------------
##                      model       Beta    Std. Error    Std. Beta      t        Sig       lower     upper 
## ---------------------------------------------------------------------------------------------------------
##                (Intercept)     -2.950         1.219                 -2.420    0.017     -5.369    -0.530 
## hours_wk_pooled_pooled_p25      0.824         0.031        1.026    26.647    0.000      0.762     0.885 
## working_pooled_pooled_mean    -12.642         1.619       -0.326    -7.810    0.000    -15.855    -9.428 
##         median_house_value      0.000         0.000        0.211     6.317    0.000      0.000     0.000 
##                 unemp_rate     11.260         3.010        0.139     3.741    0.000      5.284    17.235 
## ---------------------------------------------------------------------------------------------------------
## 
## 
## 
##                         Model Summary                         
## -------------------------------------------------------------
## R                       0.947       RMSE               0.340 
## R-Squared               0.897       Coef. Var          3.007 
## Adj. R-Squared          0.893       MSE                0.115 
## Pred R-Squared          0.886       MAE                0.242 
## -------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                ANOVA                                 
## --------------------------------------------------------------------
##                Sum of                                               
##               Squares        DF    Mean Square       F         Sig. 
## --------------------------------------------------------------------
## Regression     95.741         4         23.935    207.506    0.0000 
## Residual       10.958        95          0.115                      
## Total         106.699        99                                     
## --------------------------------------------------------------------
## 
##                                            Parameter Estimates                                            
## ---------------------------------------------------------------------------------------------------------
##                      model       Beta    Std. Error    Std. Beta      t        Sig       lower     upper 
## ---------------------------------------------------------------------------------------------------------
##                (Intercept)     -2.950         1.219                 -2.420    0.017     -5.369    -0.530 
## hours_wk_pooled_pooled_p25      0.824         0.031        1.026    26.647    0.000      0.762     0.885 
## working_pooled_pooled_mean    -12.642         1.619       -0.326    -7.810    0.000    -15.855    -9.428 
##         median_house_value      0.000         0.000        0.211     6.317    0.000      0.000     0.000 
##                 unemp_rate     11.260         3.010        0.139     3.741    0.000      5.284    17.235 
## ---------------------------------------------------------------------------------------------------------
## 
## 
## 
## Stepwise Selection: Step 5 
## 
## - frac_middleclass added 
## 
##                         Model Summary                         
## -------------------------------------------------------------
## R                       0.953       RMSE               0.324 
## R-Squared               0.908       Coef. Var          2.866 
## Adj. R-Squared          0.903       MSE                0.105 
## Pred R-Squared          0.892       MAE                0.231 
## -------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                ANOVA                                 
## --------------------------------------------------------------------
##                Sum of                                               
##               Squares        DF    Mean Square       F         Sig. 
## --------------------------------------------------------------------
## Regression     96.854         5         19.371     184.94    0.0000 
## Residual        9.846        94          0.105                      
## Total         106.699        99                                     
## --------------------------------------------------------------------
## 
##                                            Parameter Estimates                                             
## ----------------------------------------------------------------------------------------------------------
##                      model       Beta    Std. Error    Std. Beta      t        Sig       lower      upper 
## ----------------------------------------------------------------------------------------------------------
##                (Intercept)     -3.764         1.188                 -3.168    0.002     -6.123     -1.405 
## hours_wk_pooled_pooled_p25      0.848         0.030        1.056    27.897    0.000      0.788      0.909 
## working_pooled_pooled_mean    -14.091         1.605       -0.363    -8.778    0.000    -17.278    -10.904 
##         median_house_value      0.000         0.000        0.240     7.261    0.000      0.000      0.000 
##                 unemp_rate     12.136         2.881        0.150     4.212    0.000      6.416     17.856 
##           frac_middleclass      2.291         0.703        0.114     3.259    0.002      0.895      3.687 
## ----------------------------------------------------------------------------------------------------------
## 
## 
## 
##                         Model Summary                         
## -------------------------------------------------------------
## R                       0.953       RMSE               0.324 
## R-Squared               0.908       Coef. Var          2.866 
## Adj. R-Squared          0.903       MSE                0.105 
## Pred R-Squared          0.892       MAE                0.231 
## -------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                ANOVA                                 
## --------------------------------------------------------------------
##                Sum of                                               
##               Squares        DF    Mean Square       F         Sig. 
## --------------------------------------------------------------------
## Regression     96.854         5         19.371     184.94    0.0000 
## Residual        9.846        94          0.105                      
## Total         106.699        99                                     
## --------------------------------------------------------------------
## 
##                                            Parameter Estimates                                             
## ----------------------------------------------------------------------------------------------------------
##                      model       Beta    Std. Error    Std. Beta      t        Sig       lower      upper 
## ----------------------------------------------------------------------------------------------------------
##                (Intercept)     -3.764         1.188                 -3.168    0.002     -6.123     -1.405 
## hours_wk_pooled_pooled_p25      0.848         0.030        1.056    27.897    0.000      0.788      0.909 
## working_pooled_pooled_mean    -14.091         1.605       -0.363    -8.778    0.000    -17.278    -10.904 
##         median_house_value      0.000         0.000        0.240     7.261    0.000      0.000      0.000 
##                 unemp_rate     12.136         2.881        0.150     4.212    0.000      6.416     17.856 
##           frac_middleclass      2.291         0.703        0.114     3.259    0.002      0.895      3.687 
## ----------------------------------------------------------------------------------------------------------
## 
## 
## 
## Stepwise Selection: Step 6 
## 
## - lf_d_2000_1980 added 
## 
##                         Model Summary                         
## -------------------------------------------------------------
## R                       0.957       RMSE               0.311 
## R-Squared               0.916       Coef. Var          2.751 
## Adj. R-Squared          0.910       MSE                0.097 
## Pred R-Squared          0.900       MAE                0.222 
## -------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                ANOVA                                 
## --------------------------------------------------------------------
##                Sum of                                               
##               Squares        DF    Mean Square       F         Sig. 
## --------------------------------------------------------------------
## Regression     97.720         6         16.287    168.687    0.0000 
## Residual        8.979        93          0.097                      
## Total         106.699        99                                     
## --------------------------------------------------------------------
## 
##                                            Parameter Estimates                                            
## ---------------------------------------------------------------------------------------------------------
##                      model       Beta    Std. Error    Std. Beta      t        Sig       lower     upper 
## ---------------------------------------------------------------------------------------------------------
##                (Intercept)     -5.779         1.324                 -4.364    0.000     -8.408    -3.149 
## hours_wk_pooled_pooled_p25      0.829         0.030        1.032    27.740    0.000      0.770     0.888 
## working_pooled_pooled_mean    -11.135         1.830       -0.287    -6.085    0.000    -14.769    -7.502 
##         median_house_value      0.000         0.000        0.245     7.700    0.000      0.000     0.000 
##                 unemp_rate     13.992         2.835        0.172     4.936    0.000      8.363    19.621 
##           frac_middleclass      2.313         0.675        0.115     3.426    0.001      0.972     3.653 
##             lf_d_2000_1980      0.352         0.117        0.109     2.996    0.004      0.119     0.585 
## ---------------------------------------------------------------------------------------------------------
## 
## 
## 
##                         Model Summary                         
## -------------------------------------------------------------
## R                       0.957       RMSE               0.311 
## R-Squared               0.916       Coef. Var          2.751 
## Adj. R-Squared          0.910       MSE                0.097 
## Pred R-Squared          0.900       MAE                0.222 
## -------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                ANOVA                                 
## --------------------------------------------------------------------
##                Sum of                                               
##               Squares        DF    Mean Square       F         Sig. 
## --------------------------------------------------------------------
## Regression     97.720         6         16.287    168.687    0.0000 
## Residual        8.979        93          0.097                      
## Total         106.699        99                                     
## --------------------------------------------------------------------
## 
##                                            Parameter Estimates                                            
## ---------------------------------------------------------------------------------------------------------
##                      model       Beta    Std. Error    Std. Beta      t        Sig       lower     upper 
## ---------------------------------------------------------------------------------------------------------
##                (Intercept)     -5.779         1.324                 -4.364    0.000     -8.408    -3.149 
## hours_wk_pooled_pooled_p25      0.829         0.030        1.032    27.740    0.000      0.770     0.888 
## working_pooled_pooled_mean    -11.135         1.830       -0.287    -6.085    0.000    -14.769    -7.502 
##         median_house_value      0.000         0.000        0.245     7.700    0.000      0.000     0.000 
##                 unemp_rate     13.992         2.835        0.172     4.936    0.000      8.363    19.621 
##           frac_middleclass      2.313         0.675        0.115     3.426    0.001      0.972     3.653 
##             lf_d_2000_1980      0.352         0.117        0.109     2.996    0.004      0.119     0.585 
## ---------------------------------------------------------------------------------------------------------
## 
## 
## 
## Stepwise Selection: Step 7 
## 
## - cs_elf_ind_man added 
## 
##                         Model Summary                         
## -------------------------------------------------------------
## R                       0.959       RMSE               0.306 
## R-Squared               0.919       Coef. Var          2.706 
## Adj. R-Squared          0.913       MSE                0.093 
## Pred R-Squared          0.903       MAE                0.221 
## -------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                ANOVA                                 
## --------------------------------------------------------------------
##                Sum of                                               
##               Squares        DF    Mean Square       F         Sig. 
## --------------------------------------------------------------------
## Regression     98.107         7         14.015    150.058    0.0000 
## Residual        8.593        92          0.093                      
## Total         106.699        99                                     
## --------------------------------------------------------------------
## 
##                                            Parameter Estimates                                            
## ---------------------------------------------------------------------------------------------------------
##                      model       Beta    Std. Error    Std. Beta      t        Sig       lower     upper 
## ---------------------------------------------------------------------------------------------------------
##                (Intercept)     -6.150         1.315                 -4.677    0.000     -8.762    -3.538 
## hours_wk_pooled_pooled_p25      0.811         0.031        1.010    26.416    0.000      0.750     0.872 
## working_pooled_pooled_mean    -10.052         1.877       -0.259    -5.355    0.000    -13.780    -6.324 
##         median_house_value      0.000         0.000        0.244     7.808    0.000      0.000     0.000 
##                 unemp_rate     15.206         2.851        0.187     5.333    0.000      9.543    20.868 
##           frac_middleclass      2.730         0.695        0.135     3.928    0.000      1.350     4.110 
##             lf_d_2000_1980      0.276         0.121        0.085     2.272    0.025      0.035     0.517 
##             cs_elf_ind_man     -1.468         0.721       -0.079    -2.034    0.045     -2.900    -0.035 
## ---------------------------------------------------------------------------------------------------------
## 
## 
## 
##                         Model Summary                         
## -------------------------------------------------------------
## R                       0.959       RMSE               0.306 
## R-Squared               0.919       Coef. Var          2.706 
## Adj. R-Squared          0.913       MSE                0.093 
## Pred R-Squared          0.903       MAE                0.221 
## -------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                ANOVA                                 
## --------------------------------------------------------------------
##                Sum of                                               
##               Squares        DF    Mean Square       F         Sig. 
## --------------------------------------------------------------------
## Regression     98.107         7         14.015    150.058    0.0000 
## Residual        8.593        92          0.093                      
## Total         106.699        99                                     
## --------------------------------------------------------------------
## 
##                                            Parameter Estimates                                            
## ---------------------------------------------------------------------------------------------------------
##                      model       Beta    Std. Error    Std. Beta      t        Sig       lower     upper 
## ---------------------------------------------------------------------------------------------------------
##                (Intercept)     -6.150         1.315                 -4.677    0.000     -8.762    -3.538 
## hours_wk_pooled_pooled_p25      0.811         0.031        1.010    26.416    0.000      0.750     0.872 
## working_pooled_pooled_mean    -10.052         1.877       -0.259    -5.355    0.000    -13.780    -6.324 
##         median_house_value      0.000         0.000        0.244     7.808    0.000      0.000     0.000 
##                 unemp_rate     15.206         2.851        0.187     5.333    0.000      9.543    20.868 
##           frac_middleclass      2.730         0.695        0.135     3.928    0.000      1.350     4.110 
##             lf_d_2000_1980      0.276         0.121        0.085     2.272    0.025      0.035     0.517 
##             cs_elf_ind_man     -1.468         0.721       -0.079    -2.034    0.045     -2.900    -0.035 
## ---------------------------------------------------------------------------------------------------------
## 
## 
## 
## Stepwise Selection: Step 8 
## 
## - med_hhinc2016 added 
## 
##                         Model Summary                         
## -------------------------------------------------------------
## R                       0.960       RMSE               0.302 
## R-Squared               0.922       Coef. Var          2.676 
## Adj. R-Squared          0.915       MSE                0.091 
## Pred R-Squared          0.904       MAE                0.220 
## -------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                ANOVA                                 
## --------------------------------------------------------------------
##                Sum of                                               
##               Squares        DF    Mean Square       F         Sig. 
## --------------------------------------------------------------------
## Regression     98.389         8         12.299    134.674    0.0000 
## Residual        8.310        91          0.091                      
## Total         106.699        99                                     
## --------------------------------------------------------------------
## 
##                                            Parameter Estimates                                            
## ---------------------------------------------------------------------------------------------------------
##                      model       Beta    Std. Error    Std. Beta      t        Sig       lower     upper 
## ---------------------------------------------------------------------------------------------------------
##                (Intercept)     -6.123         1.300                 -4.709    0.000     -8.706    -3.540 
## hours_wk_pooled_pooled_p25      0.816         0.031        1.017    26.758    0.000      0.756     0.877 
## working_pooled_pooled_mean    -10.542         1.877       -0.272    -5.617    0.000    -14.271    -6.814 
##         median_house_value      0.000         0.000        0.222     6.604    0.000      0.000     0.000 
##                 unemp_rate     16.049         2.860        0.198     5.612    0.000     10.368    21.729 
##           frac_middleclass      2.902         0.694        0.144     4.182    0.000      1.524     4.281 
##             lf_d_2000_1980      0.277         0.120        0.086     2.309    0.023      0.039     0.515 
##             cs_elf_ind_man     -1.323         0.718       -0.071    -1.843    0.069     -2.750     0.103 
##              med_hhinc2016      0.000         0.000        0.061     1.759    0.082      0.000     0.000 
## ---------------------------------------------------------------------------------------------------------
## 
## 
## 
##                         Model Summary                         
## -------------------------------------------------------------
## R                       0.960       RMSE               0.302 
## R-Squared               0.922       Coef. Var          2.676 
## Adj. R-Squared          0.915       MSE                0.091 
## Pred R-Squared          0.904       MAE                0.220 
## -------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                ANOVA                                 
## --------------------------------------------------------------------
##                Sum of                                               
##               Squares        DF    Mean Square       F         Sig. 
## --------------------------------------------------------------------
## Regression     98.389         8         12.299    134.674    0.0000 
## Residual        8.310        91          0.091                      
## Total         106.699        99                                     
## --------------------------------------------------------------------
## 
##                                            Parameter Estimates                                            
## ---------------------------------------------------------------------------------------------------------
##                      model       Beta    Std. Error    Std. Beta      t        Sig       lower     upper 
## ---------------------------------------------------------------------------------------------------------
##                (Intercept)     -6.123         1.300                 -4.709    0.000     -8.706    -3.540 
## hours_wk_pooled_pooled_p25      0.816         0.031        1.017    26.758    0.000      0.756     0.877 
## working_pooled_pooled_mean    -10.542         1.877       -0.272    -5.617    0.000    -14.271    -6.814 
##         median_house_value      0.000         0.000        0.222     6.604    0.000      0.000     0.000 
##                 unemp_rate     16.049         2.860        0.198     5.612    0.000     10.368    21.729 
##           frac_middleclass      2.902         0.694        0.144     4.182    0.000      1.524     4.281 
##             lf_d_2000_1980      0.277         0.120        0.086     2.309    0.023      0.039     0.515 
##             cs_elf_ind_man     -1.323         0.718       -0.071    -1.843    0.069     -2.750     0.103 
##              med_hhinc2016      0.000         0.000        0.061     1.759    0.082      0.000     0.000 
## ---------------------------------------------------------------------------------------------------------
## 
## 
## 
## No more variables to be added/removed.
## 
## 
## Final Model Output 
## ------------------
## 
##                         Model Summary                         
## -------------------------------------------------------------
## R                       0.960       RMSE               0.302 
## R-Squared               0.922       Coef. Var          2.676 
## Adj. R-Squared          0.915       MSE                0.091 
## Pred R-Squared          0.904       MAE                0.220 
## -------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                ANOVA                                 
## --------------------------------------------------------------------
##                Sum of                                               
##               Squares        DF    Mean Square       F         Sig. 
## --------------------------------------------------------------------
## Regression     98.389         8         12.299    134.674    0.0000 
## Residual        8.310        91          0.091                      
## Total         106.699        99                                     
## --------------------------------------------------------------------
## 
##                                            Parameter Estimates                                            
## ---------------------------------------------------------------------------------------------------------
##                      model       Beta    Std. Error    Std. Beta      t        Sig       lower     upper 
## ---------------------------------------------------------------------------------------------------------
##                (Intercept)     -6.123         1.300                 -4.709    0.000     -8.706    -3.540 
## hours_wk_pooled_pooled_p25      0.816         0.031        1.017    26.758    0.000      0.756     0.877 
## working_pooled_pooled_mean    -10.542         1.877       -0.272    -5.617    0.000    -14.271    -6.814 
##         median_house_value      0.000         0.000        0.222     6.604    0.000      0.000     0.000 
##                 unemp_rate     16.049         2.860        0.198     5.612    0.000     10.368    21.729 
##           frac_middleclass      2.902         0.694        0.144     4.182    0.000      1.524     4.281 
##             lf_d_2000_1980      0.277         0.120        0.086     2.309    0.023      0.039     0.515 
##             cs_elf_ind_man     -1.323         0.718       -0.071    -1.843    0.069     -2.750     0.103 
##              med_hhinc2016      0.000         0.000        0.061     1.759    0.082      0.000     0.000 
## ---------------------------------------------------------------------------------------------------------
stp_s$model
## 
## Call:
## lm(formula = paste(response, "~", paste(preds, collapse = " + ")), 
##     data = l)
## 
## Coefficients:
##                (Intercept)  hours_wk_pooled_pooled_p25  
##                 -6.123e+00                   8.164e-01  
## working_pooled_pooled_mean          median_house_value  
##                 -1.054e+01                   3.070e-06  
##                 unemp_rate            frac_middleclass  
##                  1.605e+01                   2.902e+00  
##             lf_d_2000_1980              cs_elf_ind_man  
##                  2.770e-01                  -1.323e+00  
##              med_hhinc2016  
##                  2.048e-06

Doing this gives us a model the computer has determined to be most effective, which is:

Life Expectancy = -10.54(working_pooled_pooled_mean) - 16.05(unemp_rate) + 0.277(lf_d_2000_1980) + 0.000002048(med_hhinc2016) + 0.8164(hours_wk_pooled_pooled_p25) + 0.000003070(median_house_value) + 2.902(frac_middleclass) - 1.323(cs_elf_ind_man)

Looking at the summary of this model, we see:

worktwo = lm(le_raceadj_q1_both ~ working_pooled_pooled_mean +
               unemp_rate + lf_d_2000_1980 + med_hhinc2016 + hours_wk_pooled_pooled_p25 +
               median_house_value + frac_middleclass + cs_elf_ind_man, data = work)
summary(worktwo)
## 
## Call:
## lm(formula = le_raceadj_q1_both ~ working_pooled_pooled_mean + 
##     unemp_rate + lf_d_2000_1980 + med_hhinc2016 + hours_wk_pooled_pooled_p25 + 
##     median_house_value + frac_middleclass + cs_elf_ind_man, data = work)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.71731 -0.17284 -0.03969  0.17203  1.36397 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                -6.123e+00  1.300e+00  -4.709 8.88e-06 ***
## working_pooled_pooled_mean -1.054e+01  1.877e+00  -5.617 2.10e-07 ***
## unemp_rate                  1.605e+01  2.860e+00   5.612 2.14e-07 ***
## lf_d_2000_1980              2.770e-01  1.200e-01   2.309   0.0232 *  
## med_hhinc2016               2.048e-06  1.164e-06   1.759   0.0820 .  
## hours_wk_pooled_pooled_p25  8.164e-01  3.051e-02  26.758  < 2e-16 ***
## median_house_value          3.070e-06  4.648e-07   6.604 2.63e-09 ***
## frac_middleclass            2.902e+00  6.941e-01   4.182 6.65e-05 ***
## cs_elf_ind_man             -1.323e+00  7.181e-01  -1.843   0.0686 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3022 on 91 degrees of freedom
##   (50611 observations deleted due to missingness)
## Multiple R-squared:  0.9221, Adjusted R-squared:  0.9153 
## F-statistic: 134.7 on 8 and 91 DF,  p-value: < 2.2e-16

Overall, this model has p-values that are much more statistically significant than the previous model. Despite this, there are still other ways to chose the best predictors for a multiple regression model, such as through a best subsets regression.

Best Subsets Regression Procedure

The best subsets regression procedure is a procedure that tells the computer to search for all possible subsets of predictors so that we don’t have to.

# best subsets regression
allwork = regsubsets(le_raceadj_q1_both ~ ., data = work, nvmax = 27)
## Warning in leaps.setup(x, y, wt = wt, nbest = nbest, nvmax = nvmax, force.in =
## force.in, : 2 linear dependencies found
## Reordering variables and trying again:
## Warning in leaps.exhaustive(a, really.big): XHAUST returned error code -999
summary(allwork)
## Subset selection object
## Call: regsubsets.formula(le_raceadj_q1_both ~ ., data = work, nvmax = 27)
## 26 Variables  (and intercept)
##                                 Forced in Forced out
## cz                                  FALSE      FALSE
## hhinc_mean2000                      FALSE      FALSE
## mean_commutetime2000                FALSE      FALSE
## med_hhinc2016                       FALSE      FALSE
## poor_share2010                      FALSE      FALSE
## emp2000                             FALSE      FALSE
## jobs_highpay_5mi_2015               FALSE      FALSE
## ann_avg_job_growth_2004_2013        FALSE      FALSE
## par_rank_pooled_pooled_mean         FALSE      FALSE
## frac_below_median_pooled_pooled     FALSE      FALSE
## hours_wk_pooled_pooled_mean         FALSE      FALSE
## hours_wk_pooled_pooled_p1           FALSE      FALSE
## hours_wk_pooled_pooled_p25          FALSE      FALSE
## hours_wk_pooled_pooled_p75          FALSE      FALSE
## hours_wk_pooled_pooled_p100         FALSE      FALSE
## pos_hours_pooled_pooled_mean        FALSE      FALSE
## working_pooled_pooled_mean          FALSE      FALSE
## unemp_rate                          FALSE      FALSE
## taxrate                             FALSE      FALSE
## frac_middleclass                    FALSE      FALSE
## cs_elf_ind_man                      FALSE      FALSE
## lf_d_2000_1980                      FALSE      FALSE
## median_house_value                  FALSE      FALSE
## hhinc00                             FALSE      FALSE
## emp2000.1                           FALSE      FALSE
## pos_hours_pooled_pooled_mean.1      FALSE      FALSE
## 1 subsets of each size up to 24
## Selection Algorithm: exhaustive
##           cz  hhinc_mean2000 mean_commutetime2000 med_hhinc2016 poor_share2010
## 1  ( 1 )  "*" " "            " "                  " "           " "           
## 2  ( 1 )  "*" "*"            " "                  " "           " "           
## 3  ( 1 )  "*" "*"            "*"                  " "           " "           
## 4  ( 1 )  "*" "*"            "*"                  "*"           " "           
## 5  ( 1 )  "*" "*"            "*"                  "*"           "*"           
## 6  ( 1 )  "*" "*"            "*"                  "*"           "*"           
## 7  ( 1 )  "*" "*"            "*"                  "*"           "*"           
## 8  ( 1 )  "*" "*"            "*"                  "*"           "*"           
## 9  ( 1 )  "*" "*"            "*"                  "*"           "*"           
## 10  ( 1 ) "*" "*"            "*"                  "*"           "*"           
## 11  ( 1 ) "*" "*"            "*"                  "*"           "*"           
## 12  ( 1 ) "*" "*"            "*"                  "*"           "*"           
## 13  ( 1 ) "*" "*"            "*"                  "*"           "*"           
## 14  ( 1 ) "*" "*"            "*"                  "*"           "*"           
## 15  ( 1 ) "*" "*"            "*"                  "*"           "*"           
## 16  ( 1 ) "*" "*"            "*"                  "*"           "*"           
## 17  ( 1 ) "*" "*"            "*"                  "*"           "*"           
## 18  ( 1 ) "*" "*"            "*"                  "*"           "*"           
## 19  ( 1 ) "*" "*"            "*"                  "*"           "*"           
## 20  ( 1 ) "*" "*"            "*"                  "*"           "*"           
## 21  ( 1 ) "*" "*"            "*"                  "*"           "*"           
## 22  ( 1 ) "*" "*"            "*"                  "*"           "*"           
## 23  ( 1 ) "*" "*"            "*"                  "*"           "*"           
## 24  ( 1 ) "*" "*"            "*"                  "*"           "*"           
##           emp2000 emp2000.1 jobs_highpay_5mi_2015 ann_avg_job_growth_2004_2013
## 1  ( 1 )  " "     " "       " "                   " "                         
## 2  ( 1 )  " "     " "       " "                   " "                         
## 3  ( 1 )  " "     " "       " "                   " "                         
## 4  ( 1 )  " "     " "       " "                   " "                         
## 5  ( 1 )  " "     " "       " "                   " "                         
## 6  ( 1 )  "*"     " "       " "                   " "                         
## 7  ( 1 )  "*"     " "       "*"                   " "                         
## 8  ( 1 )  "*"     " "       "*"                   "*"                         
## 9  ( 1 )  "*"     " "       "*"                   "*"                         
## 10  ( 1 ) "*"     " "       "*"                   "*"                         
## 11  ( 1 ) "*"     " "       "*"                   "*"                         
## 12  ( 1 ) "*"     " "       "*"                   "*"                         
## 13  ( 1 ) "*"     " "       "*"                   "*"                         
## 14  ( 1 ) "*"     " "       "*"                   "*"                         
## 15  ( 1 ) "*"     " "       "*"                   "*"                         
## 16  ( 1 ) "*"     " "       "*"                   "*"                         
## 17  ( 1 ) "*"     " "       "*"                   "*"                         
## 18  ( 1 ) "*"     " "       "*"                   "*"                         
## 19  ( 1 ) "*"     " "       "*"                   "*"                         
## 20  ( 1 ) "*"     " "       "*"                   "*"                         
## 21  ( 1 ) "*"     " "       "*"                   "*"                         
## 22  ( 1 ) "*"     " "       "*"                   "*"                         
## 23  ( 1 ) "*"     " "       "*"                   "*"                         
## 24  ( 1 ) "*"     " "       "*"                   "*"                         
##           par_rank_pooled_pooled_mean frac_below_median_pooled_pooled
## 1  ( 1 )  " "                         " "                            
## 2  ( 1 )  " "                         " "                            
## 3  ( 1 )  " "                         " "                            
## 4  ( 1 )  " "                         " "                            
## 5  ( 1 )  " "                         " "                            
## 6  ( 1 )  " "                         " "                            
## 7  ( 1 )  " "                         " "                            
## 8  ( 1 )  " "                         " "                            
## 9  ( 1 )  "*"                         " "                            
## 10  ( 1 ) "*"                         "*"                            
## 11  ( 1 ) "*"                         "*"                            
## 12  ( 1 ) "*"                         "*"                            
## 13  ( 1 ) "*"                         "*"                            
## 14  ( 1 ) "*"                         "*"                            
## 15  ( 1 ) "*"                         "*"                            
## 16  ( 1 ) "*"                         "*"                            
## 17  ( 1 ) "*"                         "*"                            
## 18  ( 1 ) "*"                         "*"                            
## 19  ( 1 ) "*"                         "*"                            
## 20  ( 1 ) "*"                         "*"                            
## 21  ( 1 ) "*"                         "*"                            
## 22  ( 1 ) "*"                         "*"                            
## 23  ( 1 ) "*"                         "*"                            
## 24  ( 1 ) "*"                         "*"                            
##           hours_wk_pooled_pooled_mean hours_wk_pooled_pooled_p1
## 1  ( 1 )  " "                         " "                      
## 2  ( 1 )  " "                         " "                      
## 3  ( 1 )  " "                         " "                      
## 4  ( 1 )  " "                         " "                      
## 5  ( 1 )  " "                         " "                      
## 6  ( 1 )  " "                         " "                      
## 7  ( 1 )  " "                         " "                      
## 8  ( 1 )  " "                         " "                      
## 9  ( 1 )  " "                         " "                      
## 10  ( 1 ) " "                         " "                      
## 11  ( 1 ) "*"                         " "                      
## 12  ( 1 ) "*"                         "*"                      
## 13  ( 1 ) "*"                         "*"                      
## 14  ( 1 ) "*"                         "*"                      
## 15  ( 1 ) "*"                         "*"                      
## 16  ( 1 ) "*"                         "*"                      
## 17  ( 1 ) "*"                         "*"                      
## 18  ( 1 ) "*"                         "*"                      
## 19  ( 1 ) "*"                         "*"                      
## 20  ( 1 ) "*"                         "*"                      
## 21  ( 1 ) "*"                         "*"                      
## 22  ( 1 ) "*"                         "*"                      
## 23  ( 1 ) "*"                         "*"                      
## 24  ( 1 ) "*"                         "*"                      
##           hours_wk_pooled_pooled_p25 hours_wk_pooled_pooled_p75
## 1  ( 1 )  " "                        " "                       
## 2  ( 1 )  " "                        " "                       
## 3  ( 1 )  " "                        " "                       
## 4  ( 1 )  " "                        " "                       
## 5  ( 1 )  " "                        " "                       
## 6  ( 1 )  " "                        " "                       
## 7  ( 1 )  " "                        " "                       
## 8  ( 1 )  " "                        " "                       
## 9  ( 1 )  " "                        " "                       
## 10  ( 1 ) " "                        " "                       
## 11  ( 1 ) " "                        " "                       
## 12  ( 1 ) " "                        " "                       
## 13  ( 1 ) "*"                        " "                       
## 14  ( 1 ) "*"                        "*"                       
## 15  ( 1 ) "*"                        "*"                       
## 16  ( 1 ) "*"                        "*"                       
## 17  ( 1 ) "*"                        "*"                       
## 18  ( 1 ) "*"                        "*"                       
## 19  ( 1 ) "*"                        "*"                       
## 20  ( 1 ) "*"                        "*"                       
## 21  ( 1 ) "*"                        "*"                       
## 22  ( 1 ) "*"                        "*"                       
## 23  ( 1 ) "*"                        "*"                       
## 24  ( 1 ) "*"                        "*"                       
##           hours_wk_pooled_pooled_p100 pos_hours_pooled_pooled_mean
## 1  ( 1 )  " "                         " "                         
## 2  ( 1 )  " "                         " "                         
## 3  ( 1 )  " "                         " "                         
## 4  ( 1 )  " "                         " "                         
## 5  ( 1 )  " "                         " "                         
## 6  ( 1 )  " "                         " "                         
## 7  ( 1 )  " "                         " "                         
## 8  ( 1 )  " "                         " "                         
## 9  ( 1 )  " "                         " "                         
## 10  ( 1 ) " "                         " "                         
## 11  ( 1 ) " "                         " "                         
## 12  ( 1 ) " "                         " "                         
## 13  ( 1 ) " "                         " "                         
## 14  ( 1 ) " "                         " "                         
## 15  ( 1 ) "*"                         " "                         
## 16  ( 1 ) "*"                         "*"                         
## 17  ( 1 ) "*"                         "*"                         
## 18  ( 1 ) "*"                         "*"                         
## 19  ( 1 ) "*"                         "*"                         
## 20  ( 1 ) "*"                         "*"                         
## 21  ( 1 ) "*"                         "*"                         
## 22  ( 1 ) "*"                         "*"                         
## 23  ( 1 ) "*"                         "*"                         
## 24  ( 1 ) "*"                         "*"                         
##           pos_hours_pooled_pooled_mean.1 working_pooled_pooled_mean unemp_rate
## 1  ( 1 )  " "                            " "                        " "       
## 2  ( 1 )  " "                            " "                        " "       
## 3  ( 1 )  " "                            " "                        " "       
## 4  ( 1 )  " "                            " "                        " "       
## 5  ( 1 )  " "                            " "                        " "       
## 6  ( 1 )  " "                            " "                        " "       
## 7  ( 1 )  " "                            " "                        " "       
## 8  ( 1 )  " "                            " "                        " "       
## 9  ( 1 )  " "                            " "                        " "       
## 10  ( 1 ) " "                            " "                        " "       
## 11  ( 1 ) " "                            " "                        " "       
## 12  ( 1 ) " "                            " "                        " "       
## 13  ( 1 ) " "                            " "                        " "       
## 14  ( 1 ) " "                            " "                        " "       
## 15  ( 1 ) " "                            " "                        " "       
## 16  ( 1 ) " "                            " "                        " "       
## 17  ( 1 ) " "                            "*"                        " "       
## 18  ( 1 ) " "                            "*"                        "*"       
## 19  ( 1 ) " "                            "*"                        "*"       
## 20  ( 1 ) " "                            "*"                        "*"       
## 21  ( 1 ) " "                            "*"                        "*"       
## 22  ( 1 ) " "                            "*"                        "*"       
## 23  ( 1 ) " "                            "*"                        "*"       
## 24  ( 1 ) " "                            "*"                        "*"       
##           taxrate frac_middleclass cs_elf_ind_man lf_d_2000_1980
## 1  ( 1 )  " "     " "              " "            " "           
## 2  ( 1 )  " "     " "              " "            " "           
## 3  ( 1 )  " "     " "              " "            " "           
## 4  ( 1 )  " "     " "              " "            " "           
## 5  ( 1 )  " "     " "              " "            " "           
## 6  ( 1 )  " "     " "              " "            " "           
## 7  ( 1 )  " "     " "              " "            " "           
## 8  ( 1 )  " "     " "              " "            " "           
## 9  ( 1 )  " "     " "              " "            " "           
## 10  ( 1 ) " "     " "              " "            " "           
## 11  ( 1 ) " "     " "              " "            " "           
## 12  ( 1 ) " "     " "              " "            " "           
## 13  ( 1 ) " "     " "              " "            " "           
## 14  ( 1 ) " "     " "              " "            " "           
## 15  ( 1 ) " "     " "              " "            " "           
## 16  ( 1 ) " "     " "              " "            " "           
## 17  ( 1 ) " "     " "              " "            " "           
## 18  ( 1 ) " "     " "              " "            " "           
## 19  ( 1 ) "*"     " "              " "            " "           
## 20  ( 1 ) "*"     "*"              " "            " "           
## 21  ( 1 ) "*"     "*"              "*"            " "           
## 22  ( 1 ) "*"     "*"              "*"            "*"           
## 23  ( 1 ) "*"     "*"              "*"            "*"           
## 24  ( 1 ) "*"     "*"              "*"            "*"           
##           median_house_value hhinc00
## 1  ( 1 )  " "                " "    
## 2  ( 1 )  " "                " "    
## 3  ( 1 )  " "                " "    
## 4  ( 1 )  " "                " "    
## 5  ( 1 )  " "                " "    
## 6  ( 1 )  " "                " "    
## 7  ( 1 )  " "                " "    
## 8  ( 1 )  " "                " "    
## 9  ( 1 )  " "                " "    
## 10  ( 1 ) " "                " "    
## 11  ( 1 ) " "                " "    
## 12  ( 1 ) " "                " "    
## 13  ( 1 ) " "                " "    
## 14  ( 1 ) " "                " "    
## 15  ( 1 ) " "                " "    
## 16  ( 1 ) " "                " "    
## 17  ( 1 ) " "                " "    
## 18  ( 1 ) " "                " "    
## 19  ( 1 ) " "                " "    
## 20  ( 1 ) " "                " "    
## 21  ( 1 ) " "                " "    
## 22  ( 1 ) " "                " "    
## 23  ( 1 ) "*"                " "    
## 24  ( 1 ) "*"                "*"
res.sum = summary(allwork)
data.frame(
  adj.r2 = which.max(res.sum$adjr2),
  cp = which.min(res.sum$cp),
  bic = which.min(res.sum$bic)
)
##   adj.r2 cp bic
## 1     12 12  12

Here, the output illustrates the best models of each size from all of the predictor variables to predict life expectancy. This begs the question of what is the appropriate number of predictor variables to use, which is the purpose of the data.frame() function. The R^2, CP, and BIC criteria all tell us to use a 12 predictor model. To find which variables are most appropriate for this size model, we look at the output above, focusing on the 12th row. Each variable with an asterisk in the row 12 is a variable that is suggested to be used in a 12 predictor model. Using this information, we can create a new multiple regression model that we can compare against our prior two models to find the best option.

# create new model with predictor variables
subsetmodel = lm(le_raceadj_q1_both ~ hhinc_mean2000 + mean_commutetime2000 + med_hhinc2016 +
                poor_share2010 + emp2000 + jobs_highpay_5mi_2015 + ann_avg_job_growth_2004_2013 + 
                par_rank_pooled_pooled_mean + frac_below_median_pooled_pooled +
                hours_wk_pooled_pooled_mean, data = work)
summary(subsetmodel)
## 
## Call:
## lm(formula = le_raceadj_q1_both ~ hhinc_mean2000 + mean_commutetime2000 + 
##     med_hhinc2016 + poor_share2010 + emp2000 + jobs_highpay_5mi_2015 + 
##     ann_avg_job_growth_2004_2013 + par_rank_pooled_pooled_mean + 
##     frac_below_median_pooled_pooled + hours_wk_pooled_pooled_mean, 
##     data = work)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.91480 -0.63628 -0.02223  0.56264  2.72006 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     -3.158e+00  1.050e+01  -0.301  0.76434    
## hhinc_mean2000                  -1.500e-05  7.449e-06  -2.014  0.04698 *  
## mean_commutetime2000            -2.055e-02  1.516e-02  -1.355  0.17884    
## med_hhinc2016                    3.098e-05  1.069e-05   2.899  0.00472 ** 
## poor_share2010                  -3.624e-01  9.055e-01  -0.400  0.68997    
## emp2000                         -2.685e+00  1.008e+00  -2.663  0.00918 ** 
## jobs_highpay_5mi_2015           -1.500e-06  1.725e-06  -0.870  0.38663    
## ann_avg_job_growth_2004_2013     2.230e+00  2.200e+00   1.014  0.31348    
## par_rank_pooled_pooled_mean      9.790e-01  1.279e+01   0.077  0.93915    
## frac_below_median_pooled_pooled  4.007e+00  7.969e+00   0.503  0.61636    
## hours_wk_pooled_pooled_mean      4.305e-01  8.413e-02   5.117 1.78e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9097 on 89 degrees of freedom
##   (50611 observations deleted due to missingness)
## Multiple R-squared:  0.3098, Adjusted R-squared:  0.2322 
## F-statistic: 3.994 on 10 and 89 DF,  p-value: 0.0001571

Looking at the new model, we can see that the Pr(>|t|) values are less statistically significant than those of that in our previous models. To find the best option, we can use Mallows’ Cp, which is a factor that is used to compare regression models to pick the best model.

# first, create linear regression model with all predictor variables
fullmodelone = lm(le_raceadj_q1_both ~ hhinc_mean2000 + mean_commutetime2000 + med_hhinc2016 +
                    poor_share2010 + emp2000 + jobs_highpay_5mi_2015 +
                    ann_avg_job_growth_2004_2013 + par_rank_pooled_pooled_mean +
                    hours_wk_pooled_pooled_p25 + pos_hours_pooled_pooled_mean +
                    working_pooled_pooled_mean + unemp_rate +
                    taxrate + frac_middleclass + cs_elf_ind_man +
                    lf_d_2000_1980 + median_house_value +
                    hhinc00, data = work)

# compute Mallow's Cp for each statistical model
ols_mallows_cp(workmr, fullmodelone)
## [1] 200.0821
ols_mallows_cp(worktwo, fullmodelone)
## [1] 55.62811
ols_mallows_cp(subsetmodel, fullmodelone)
## [1] 1141.712

Looking at the results, we see that “worktwo” or the linear model made using stepwise regression, has the Cp closest to the number of predictor variables, therefore establishing it as the best of the three regression models.

Check Conditions for Work Model
plot(worktwo)

# histogram of residuals
restwo = resid(worktwo)
hist(restwo)

Linearity

Looking at the residuals vs. fitted plot, we can see that there is equal spread both above and below the zero line and that the points are evenly spread throughout the intervals, suggesting that this model satisfies the condition for linearity

Zero Mean

Looking at the histogram of residuals, we observe a normal distribution centered around zero which meets the zero mean condition.

Constant Variance

In the residual plot, the data is spread on both sides of the least squares line, but there is an outlier, point 96.

Randomness and Independence

Opportunity Insights used the commuting zone system to organize and separate the data that was collected by the Census, which collects data from every citizen within a geographic area, meaning that the data was collected responsibly. Comparing the data by quartile, the differences in correlational coefficients from quartile to quartile tells us that we should be wary of unifying all of the models into one.

Normality

The qq-plot shows that there is a bit of right skew with the presence of the outlier, point 96.