My question surrounds life expectancy within the United States, and more specifically what socio-economic factors such as income, education level, community, and employment, and how these factors affect life expectancy of Americans. This question arose from looking through the datasets from Opportunity Insights and the Opportunity Atlas. This question also arises from looking at national life expectancy rates themselves, and questioning why they are what they are. I am very interested in exploring the aspect of commuting zones and their characteristics in life expectancy, through discovering what types of communities and characteristics within those communities create longer life expectancies.
In terms of data, I pulled the majority of my data from Opportunity Insights (OI), and organized it by commuting zones, which are geographic areas that are often used when analyzing geography in terms of the economy and population. I used two different datasets, “National by-year life expectancy estimates for men and women, by income percentile” and “Table 9: Neighborhood Characteristics by Census Tract”. To compile this data into a singular dataset, I used functions within Excel that allowed me to match data factors by commuting zone. By combining these two datasets together, I was able to measure life expectancy data alongside the commuting zone factors I was interested in measuring.
# load libraries and data
library(ggplot2)
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
library(corrplot)
## corrplot 0.92 loaded
library(ggpubr)
library(leaps)
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
red = read.csv("~/desktop/finalfinal2.csv")
# Quartile 1
q1 = ggplot(red, aes(x = cur_smoke_q1, y = le_raceadj_q1_F)) +
geom_point() +
stat_cor(method = "pearson") +
stat_smooth(method = 'lm') +
xlab("Current Smokers in Q1") +
ylab("Q1 Life Expectancy") +
ggtitle("Q1 Life Expectancy and Smoking")
q1
## Warning: Removed 50611 rows containing non-finite values (stat_cor).
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 50611 rows containing non-finite values (stat_smooth).
## Warning: Removed 50611 rows containing missing values (geom_point).
smoke1 = lm(le_raceadj_q1_F ~ cur_smoke_q1, data = red)
summary(smoke1)
##
## Call:
## lm(formula = le_raceadj_q1_F ~ cur_smoke_q1, data = red)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.8578 -1.0701 -0.1626 1.2956 4.5646
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 84.554 1.195 70.782 <2e-16 ***
## cur_smoke_q1 -11.879 4.381 -2.712 0.0079 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.908 on 98 degrees of freedom
## (50611 observations deleted due to missingness)
## Multiple R-squared: 0.0698, Adjusted R-squared: 0.06031
## F-statistic: 7.354 on 1 and 98 DF, p-value: 0.007904
# Quartile 2
q2 = ggplot(red, aes(x = cur_smoke_q2, y = le_raceadj_q2_M)) +
geom_point() +
stat_cor(method = "pearson") +
stat_smooth(method = 'lm') +
xlab("Current Smokers in Q2") +
ylab("Q2 Life Expectancy") +
ggtitle("Q2 Life Expectancy and Smoking")
q2
## Warning: Removed 50611 rows containing non-finite values (stat_cor).
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 50611 rows containing non-finite values (stat_smooth).
## Warning: Removed 50611 rows containing missing values (geom_point).
smoke2 = lm(le_raceadj_q2_M ~ cur_smoke_q2, data = red)
summary(smoke2)
##
## Call:
## lm(formula = le_raceadj_q2_M ~ cur_smoke_q2, data = red)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.0808 -1.2399 -0.2606 1.3909 5.8880
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 80.141 1.535 52.200 <2e-16 ***
## cur_smoke_q2 -3.180 6.842 -0.465 0.643
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.18 on 98 degrees of freedom
## (50611 observations deleted due to missingness)
## Multiple R-squared: 0.002199, Adjusted R-squared: -0.007982
## F-statistic: 0.216 on 1 and 98 DF, p-value: 0.6431
# Quartile 3
q3 = ggplot(red, aes(x = cur_smoke_q3, y = le_raceadj_q3_M)) +
geom_point() +
stat_cor(method = "pearson") +
stat_smooth(method = 'lm') +
xlab("Current Smokers in Q3") +
ylab("Q3 Life Expectancy") +
ggtitle("Q3 Life Expectancy and Smoking")
q3
## Warning: Removed 50611 rows containing non-finite values (stat_cor).
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 50611 rows containing non-finite values (stat_smooth).
## Warning: Removed 50611 rows containing missing values (geom_point).
smoke3 = lm(le_raceadj_q3_M ~ cur_smoke_q3, data = red)
summary(smoke3)
##
## Call:
## lm(formula = le_raceadj_q3_M ~ cur_smoke_q3, data = red)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.6731 -1.2965 0.1006 1.5783 5.0540
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 84.148 1.349 62.370 <2e-16 ***
## cur_smoke_q3 -11.073 7.478 -1.481 0.142
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.036 on 98 degrees of freedom
## (50611 observations deleted due to missingness)
## Multiple R-squared: 0.02189, Adjusted R-squared: 0.01191
## F-statistic: 2.193 on 1 and 98 DF, p-value: 0.1419
# Quartile 4
q4 = ggplot(red, aes(x = cur_smoke_q4, y = le_raceadj_q4_M)) +
geom_point() +
stat_cor(method = "pearson") +
stat_smooth(method = 'lm') +
xlab("Current Smokers in Q4") +
ylab("Q4 Life Expectancy") +
ggtitle("Q4 Life Expectancy and Smoking")
q4
## Warning: Removed 50611 rows containing non-finite values (stat_cor).
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 50611 rows containing non-finite values (stat_smooth).
## Warning: Removed 50611 rows containing missing values (geom_point).
smoke4 = lm(le_raceadj_q4_M ~ cur_smoke_q4, data = red)
summary(smoke4)
##
## Call:
## lm(formula = le_raceadj_q4_M ~ cur_smoke_q4, data = red)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.0203 -1.4971 -0.0728 1.3878 5.5532
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 87.037 1.240 70.167 <2e-16 ***
## cur_smoke_q4 -20.459 9.731 -2.102 0.0381 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.277 on 98 degrees of freedom
## (50611 observations deleted due to missingness)
## Multiple R-squared: 0.04316, Adjusted R-squared: 0.03339
## F-statistic: 4.42 on 1 and 98 DF, p-value: 0.03808
In the above data, each point on the scatterplot represents a different commuting zone within the US excluding 12 commuting zones focused in North Dakota and Alaska. These datapoints were deleted because the data being used is pulled from an Opportunity Insights dataet on life expectancy estimates by income quartile that only reports from commuting zones with populations greater than 25,000.
The data on smoking and life expectancy is sorted by income quarters, 1st to 4th. This sorting proved for interesting observations. While observing the p-values of linear regression models sorted by income quartile, I found that the first quartile had the most statistically significant results. The p-value for this model (graph to the right) is 0.007904, less than the alpha value of 0.05. This low value indicates that the model is statistically significant and that we shall reject the null hypothesis and accept the alternative hypothesis that says there is a negative correlation of -11.879 between smoking and a shorter life expectancy within the first income quartile. What makes this find interesting is the fact that this trend of statistical significance does not continue as the income quartile increases. The graph to the right is a scatterplot with the same factors, but from the fourth, or highest, income quartile. The correlational coefficient is 0.6231, with a p-value of 0.9539. The large p-value tells us that this model is not statistically significant, and shows evidence for the null hypothesis that smoking has no significant effect on life expectancy within the fourth income quartile.
The first income quartile has the most statistically significant p-value, and to further explore its significance, I am going to assess the conditions of the linear model to ensure its reliability through a series of conditions and plots.
res1 = resid(smoke1)
plot(smoke1)
hist(res1, main = "Histogram of Residuals")
The initial scatterplot of Life Expectancy on Smoking shows a negative linear relationship, illustrating that as the prevalence of smoking increases, life expectancy decreases. The Residuals vs. Fitted plot shows equal, random spread above and below the zero line, but we need to be wary of the lack of dispersed data points towards the right of the plot
The histogram of residuals shows a normal distribution centered at zero, meeting the zero mean condition.
In the residual plot, the data is spread on both sides of the least squares line, but again, we shall be wary of the right side of the plot with less dispersed data points.
Opportunity Insights used the commuting zone system to organize and separate the data that was collected by the Census, which collects data from every citizen within a geographic area, meaning that the data was collected responsibly. Comparing the data by quartile, the differences in correlational coefficients from quartile to quartile tells us that we should be wary of unifying all of the models into one.
The histogram of residuals shows a normal distribution centered at zero, and the QQ-plot shows a bit of left skew with data points straying from the line of best fit, but otherwise, the data closely follows the line of best fit, suggesting the model is normal.
# Quartile 1
q1 = ggplot(red, aes(x = hours_wk_pooled_pooled_p1, y = le_raceadj_q1_M)) +
geom_point() +
stat_cor(method = "pearson") +
stat_smooth(method = 'lm') +
xlab("Mean weekly hours worked") +
ylab("Q1 Life Expectancy") +
ggtitle("Q1 Life Expectancy and Hours Worked")
q1
## Warning: Removed 50611 rows containing non-finite values (stat_cor).
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 50611 rows containing non-finite values (stat_smooth).
## Warning: Removed 50611 rows containing missing values (geom_point).
work1 = lm(le_raceadj_q1_M ~ hours_wk_pooled_pooled_p1, data = red)
summary(work1)
##
## Call:
## lm(formula = le_raceadj_q1_M ~ hours_wk_pooled_pooled_p1, data = red)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.4668 -1.2725 0.0375 1.2415 5.8345
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 69.23764 2.06899 33.464 < 2e-16 ***
## hours_wk_pooled_pooled_p1 0.29873 0.09127 3.273 0.00147 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.884 on 98 degrees of freedom
## (50611 observations deleted due to missingness)
## Multiple R-squared: 0.09855, Adjusted R-squared: 0.08935
## F-statistic: 10.71 on 1 and 98 DF, p-value: 0.001469
# Quartile 2
q2 = ggplot(red, aes(x = hours_wk_pooled_pooled_p25, y = le_raceadj_q2_M)) +
geom_point() +
stat_cor(method = "pearson") +
stat_smooth(method = 'lm') +
xlab("Mean weekly hours worked") +
ylab("Q2 Life Expectancy") +
ggtitle("Q2 Life Expectancy and Hours Worked")
q2
## Warning: Removed 50611 rows containing non-finite values (stat_cor).
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 50611 rows containing non-finite values (stat_smooth).
## Warning: Removed 50611 rows containing missing values (geom_point).
work2 = lm(le_raceadj_q2_M ~ hours_wk_pooled_pooled_p25, data = red)
summary(work2)
##
## Call:
## lm(formula = le_raceadj_q2_M ~ hours_wk_pooled_pooled_p25, data = red)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.7508 -1.2257 -0.3157 1.4120 5.9084
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 80.3152 4.7481 16.915 <2e-16 ***
## hours_wk_pooled_pooled_p25 -0.0315 0.1696 -0.186 0.853
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.182 on 98 degrees of freedom
## (50611 observations deleted due to missingness)
## Multiple R-squared: 0.0003516, Adjusted R-squared: -0.009849
## F-statistic: 0.03447 on 1 and 98 DF, p-value: 0.8531
# Quartile 3
q3 = ggplot(red, aes(x = hours_wk_pooled_pooled_p75, y = le_raceadj_q3_M)) +
geom_point() +
stat_cor(method = "pearson") +
stat_smooth(method = 'lm') +
xlab("Mean weekly hours worked") +
ylab("Q3 Life Expectancy") +
ggtitle("Q3 Life Expectancy and Hours Worked")
q3
## Warning: Removed 50611 rows containing non-finite values (stat_cor).
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 50611 rows containing non-finite values (stat_smooth).
## Warning: Removed 50611 rows containing missing values (geom_point).
work3 = lm(le_raceadj_q3_M ~ hours_wk_pooled_pooled_p75, data = red)
summary(work3)
##
## Call:
## lm(formula = le_raceadj_q3_M ~ hours_wk_pooled_pooled_p75, data = red)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.2228 -1.2196 0.1853 1.5163 5.2292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 91.5623 5.0004 18.311 <2e-16 ***
## hours_wk_pooled_pooled_p75 -0.2815 0.1498 -1.879 0.0632 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.022 on 98 degrees of freedom
## (50611 observations deleted due to missingness)
## Multiple R-squared: 0.03478, Adjusted R-squared: 0.02493
## F-statistic: 3.532 on 1 and 98 DF, p-value: 0.06318
# Quartile 4
q4 = ggplot(red, aes(x = hours_wk_pooled_pooled_p100, y = le_raceadj_q4_F)) +
geom_point() +
stat_cor(method = "pearson") +
stat_smooth(method = 'lm') +
xlab("Mean weekly hours worked") +
ylab("Q4 Life Expectancy") +
ggtitle("Q4 Life Expectancy and Hours Worked")
q4
## Warning: Removed 50611 rows containing non-finite values (stat_cor).
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 50611 rows containing non-finite values (stat_smooth).
## Warning: Removed 50611 rows containing missing values (geom_point).
work4 = lm(le_raceadj_q4_F ~ hours_wk_pooled_pooled_p100, data = red)
summary(work4)
##
## Call:
## lm(formula = le_raceadj_q4_F ~ hours_wk_pooled_pooled_p100, data = red)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.9847 -1.8340 -0.0753 1.5650 6.8493
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 90.4723 5.1783 17.472 <2e-16 ***
## hours_wk_pooled_pooled_p100 -0.1084 0.1424 -0.761 0.449
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.506 on 98 degrees of freedom
## (50611 observations deleted due to missingness)
## Multiple R-squared: 0.005872, Adjusted R-squared: -0.004272
## F-statistic: 0.5789 on 1 and 98 DF, p-value: 0.4486
To further explore the statistically significant p-value within the first income quartile, I am going to assess the conditions of the data to further ensure the simple linear model is reliable through residual plots, QQ-plots, etc.
# Residual, normal Q-Q plots
plot(work1)
# Histogram of residuals
res1 = resid(work1)
hist(res1, main = "Histogram of Residuals")
In the normal scatterplot for the first quartile of work and life expectancy data, the data shows that it meets the linear condition, because as the amount of hours worked weekly increases, life expectancy within the first income quartile increases as well. In the Residuals vs. Fitted plot, there is scatter among the data points and they are scattered randomly above and below the zero line, showing that the linear model created is reasonable for describing the relationship between life expectancy and mean hours worked weekly.
When performing a least squares regression, the sample mean of the residuals is always zero when estimating the intercept. The histogram of residuals also shows even spread on both sides of zero, a normal distribution.
Looking at the plot of residuals shows a random constant spread above and below zero, and the histogram of residuals shows an equal spread on the left and right sides of zero.
Opportunity Insights used the commuting zone system to organize and separate the data that was collected by the Census, which collects data from every citizen within a geographic area, meaning that the data was collected responsibly. Comparing the data by quartile, the differences in correlational coefficients from quartile to quartile tells us that we should be wary of unifying all of the models into one.
The histogram of residuals shows a normal distribution. The normal quantile-quantile plot shows a consistent linear trend, although there is some skew to the right because of the few points in the upper right section.
Overall, this linear model passes all conditions needed to be met for a proper least squares regression equation.
The exploratory data analysis shows that there is strong evidence of statistically significant correlations between work and health factors within the first income quartile, ranging from the 1st to the 25th percentile. To further explore other factors that may affect life expectancy in terms of health and work, I am going to create a multiple regression model that will use other explanatory variables relating to health and work in order to more accurately predict life expectancies within the first income quartile.
To more accurately measure the significance of the affect of health factors alone on life expectancy, I created an additional dataset that includes all health-related variables from the main dataset. The download link to the health specific dataset can be found here - https://www.dropbox.com/s/9g1ss8k8umv23p4/healthdata.csv?dl=0. The explanations of the names of the variables can be found here - https://docs.google.com/document/d/1QMvRMuzyy8X4E6_HZ1VtMXFa28SHzdyv2v4YaixRniw/edit?usp=sharing.
Correlation Matrix
To begin, produced a matrix of correlation coefficients to see how the predictor variables effect life expectancy. Looking at the data, the variables that have the strongest associations with life expectancy are
-0.492 Fraction of current smokers - cur_smoke_q1
-0.331 Fraction of obese population - bmi_obese_q1
0.410 Percent uninsured in 2010 - puninsured2010
-0.270 Percent of medicare enrollees who have had at least one primary care visit within the last 12 months - primcarevis_10
Looking at the data as a whole, the strongest correlations are between:
0.869 Mean of Z-Scores for Dartmouth Atlas Ambulatory Care Measures and Percent of Females Aged 67-69 who have had Mammograms
0.728 Percent of Females Aged 67-69 who have had Mammograms and Percent Diabetic with Annual Eye Test
-0.578 Fraction of people who have exercised within the past 30 days and fraction of obese people in the first quartile
-0.669 Medicare $ Per Enrollee and fraction of people who have exercised in the past 30 days
0.857 30-day Mortality for Heart Attacks and 30-day Hospital Mortality Rate Index
(See below table for correlation matrix visual)
# load data
hlth = read.csv("~/desktop/healthdata.csv")
# choose variables with names(hlth) function to identify the numbers
cordata = cor(hlth[,c(32, 7, 8, 9, 13, 17, 21, 22, 23, 24, 25, 26, 27, 28, 29 )], use = "complete.obs")
# print correlational matrix
cordata
## le_raceadj_q1_both puninsured2010 mammogram_10
## le_raceadj_q1_both 1.00000000 0.409552583 -0.01313609
## puninsured2010 0.40955258 1.000000000 -0.38511649
## mammogram_10 -0.01313609 -0.385116491 1.00000000
## cur_smoke_q1 -0.49178572 -0.412569923 0.10884627
## bmi_obese_q1 -0.33064510 0.038422239 -0.13545923
## exercise_any_q1 0.16905248 -0.104335824 0.10140325
## mort_30day_hosp_z 0.16244341 0.304858954 -0.16071040
## diab_eyeexam_10 -0.06469811 -0.351894277 0.72788659
## diab_lipids_10 0.13626869 0.002075106 0.47383752
## primcarevis_10 -0.27049860 -0.147541927 0.47615494
## reimb_penroll_adj10 0.07963038 0.500973373 -0.31191633
## adjmortmeas_amiall30day 0.14737474 0.362438451 -0.27202722
## adjmortmeas_chfall30day 0.11354249 0.151266932 -0.01102297
## adjmortmeas_pnall30day 0.15380309 0.274266108 -0.14079534
## med_prev_qual_z -0.03769047 -0.337081052 0.86898970
## cur_smoke_q1 bmi_obese_q1 exercise_any_q1
## le_raceadj_q1_both -0.491785716 -0.330645098 0.16905248
## puninsured2010 -0.412569923 0.038422239 -0.10433582
## mammogram_10 0.108846268 -0.135459231 0.10140325
## cur_smoke_q1 1.000000000 0.273436489 -0.36593813
## bmi_obese_q1 0.273436489 1.000000000 -0.57834977
## exercise_any_q1 -0.365938127 -0.578349771 1.00000000
## mort_30day_hosp_z -0.051748942 -0.006387185 0.14486944
## diab_eyeexam_10 0.027868374 -0.166524409 0.10303742
## diab_lipids_10 -0.165990413 -0.091820075 -0.09728895
## primcarevis_10 0.366366833 0.345893828 -0.24021412
## reimb_penroll_adj10 0.048788895 0.293027828 -0.66955145
## adjmortmeas_amiall30day -0.075027995 0.106446468 0.02309639
## adjmortmeas_chfall30day -0.070661414 -0.219951072 0.41879551
## adjmortmeas_pnall30day 0.006981048 0.101377944 -0.06919963
## med_prev_qual_z 0.014671918 -0.160097018 0.19376191
## mort_30day_hosp_z diab_eyeexam_10 diab_lipids_10
## le_raceadj_q1_both 0.162443415 -0.06469811 0.136268688
## puninsured2010 0.304858954 -0.35189428 0.002075106
## mammogram_10 -0.160710397 0.72788659 0.473837521
## cur_smoke_q1 -0.051748942 0.02786837 -0.165990413
## bmi_obese_q1 -0.006387185 -0.16652441 -0.091820075
## exercise_any_q1 0.144869441 0.10303742 -0.097288955
## mort_30day_hosp_z 1.000000000 -0.22717437 -0.248622569
## diab_eyeexam_10 -0.227174368 1.00000000 0.498365447
## diab_lipids_10 -0.248622569 0.49836545 1.000000000
## primcarevis_10 0.060866168 0.31002021 0.181463162
## reimb_penroll_adj10 -0.260424088 -0.25593691 0.124653227
## adjmortmeas_amiall30day 0.857258490 -0.30638700 -0.220575922
## adjmortmeas_chfall30day 0.848040307 -0.01855114 -0.220977623
## adjmortmeas_pnall30day 0.849187830 -0.26205789 -0.195358678
## med_prev_qual_z -0.136234229 0.83490965 0.671550225
## primcarevis_10 reimb_penroll_adj10
## le_raceadj_q1_both -0.2704985950 0.07963038
## puninsured2010 -0.1475419273 0.50097337
## mammogram_10 0.4761549447 -0.31191633
## cur_smoke_q1 0.3663668330 0.04878889
## bmi_obese_q1 0.3458938282 0.29302783
## exercise_any_q1 -0.2402141235 -0.66955145
## mort_30day_hosp_z 0.0608661678 -0.26042409
## diab_eyeexam_10 0.3100202117 -0.25593691
## diab_lipids_10 0.1814631619 0.12465323
## primcarevis_10 1.0000000000 -0.03321519
## reimb_penroll_adj10 -0.0332151898 1.00000000
## adjmortmeas_amiall30day -0.0002811419 -0.03512968
## adjmortmeas_chfall30day 0.0683039573 -0.49432222
## adjmortmeas_pnall30day 0.0802329948 -0.12328907
## med_prev_qual_z 0.5351918376 -0.36554703
## adjmortmeas_amiall30day adjmortmeas_chfall30day
## le_raceadj_q1_both 0.1473747442 0.11354249
## puninsured2010 0.3624384508 0.15126693
## mammogram_10 -0.2720272160 -0.01102297
## cur_smoke_q1 -0.0750279951 -0.07066141
## bmi_obese_q1 0.1064464677 -0.21995107
## exercise_any_q1 0.0230963852 0.41879551
## mort_30day_hosp_z 0.8572584903 0.84804031
## diab_eyeexam_10 -0.3063870046 -0.01855114
## diab_lipids_10 -0.2205759219 -0.22097762
## primcarevis_10 -0.0002811419 0.06830396
## reimb_penroll_adj10 -0.0351296789 -0.49432222
## adjmortmeas_amiall30day 1.0000000000 0.62825004
## adjmortmeas_chfall30day 0.6282500375 1.00000000
## adjmortmeas_pnall30day 0.5972172770 0.54010587
## med_prev_qual_z -0.2584231901 0.07282819
## adjmortmeas_pnall30day med_prev_qual_z
## le_raceadj_q1_both 0.153803088 -0.03769047
## puninsured2010 0.274266108 -0.33708105
## mammogram_10 -0.140795338 0.86898970
## cur_smoke_q1 0.006981048 0.01467192
## bmi_obese_q1 0.101377944 -0.16009702
## exercise_any_q1 -0.069199628 0.19376191
## mort_30day_hosp_z 0.849187830 -0.13623423
## diab_eyeexam_10 -0.262057887 0.83490965
## diab_lipids_10 -0.195358678 0.67155022
## primcarevis_10 0.080232995 0.53519184
## reimb_penroll_adj10 -0.123289069 -0.36554703
## adjmortmeas_amiall30day 0.597217277 -0.25842319
## adjmortmeas_chfall30day 0.540105872 0.07282819
## adjmortmeas_pnall30day 1.000000000 -0.17293120
## med_prev_qual_z -0.172931205 1.00000000
Create Multiple Regression Model
Looking at the r correlation coefficients for the data, I am going to create a multiple regression model that uses the fraction of current smokers in the first income quartile, fraction of obese population, percent of population uninsured in 2010, and the percentage of medicare enrollees who have had at least one primary care visit within the last 12 months to create a multiple regression model
# create multiple regression model
hlthmr = lm(le_raceadj_q1_both ~ cur_smoke_q1 + bmi_obese_q1 + puninsured2010 + primcarevis_10, data = hlth)
# view summary of model
summary(hlthmr)
##
## Call:
## lm(formula = le_raceadj_q1_both ~ cur_smoke_q1 + bmi_obese_q1 +
## puninsured2010 + primcarevis_10, data = hlth)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5134 -0.9107 -0.1037 0.7583 4.8818
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 84.70026 2.82123 30.022 < 2e-16 ***
## cur_smoke_q1 -11.09496 3.78916 -2.928 0.00427 **
## bmi_obese_q1 -12.12952 4.37222 -2.774 0.00666 **
## puninsured2010 0.08822 0.02749 3.209 0.00182 **
## primcarevis_10 -0.01420 0.03818 -0.372 0.71078
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.378 on 95 degrees of freedom
## (50611 observations deleted due to missingness)
## Multiple R-squared: 0.3552, Adjusted R-squared: 0.328
## F-statistic: 13.08 on 4 and 95 DF, p-value: 1.587e-08
Initial Observations INCLUDE SOMETHING ABOUT T-VALUES. Looking at Pr(>|t|), or the p-value that corresponds to the t-statistic, we see that the smoking, obesity, and uninsured predictor variables are the three with values below the alpha level of 0.05, meaning that these three are statistically significant p-values. The residual standard error is 1.378 on 95 degrees of freedom. The residual standard error is the standard deviation of residuals, meaning that this model predicts the average life expectancy of a person in the first income quartile with an average error of about 1.378 years (this value is valuable when comparing the accuracy of different regression models). The coefficient of determination, or multiple R^2, is 0.3552, and the adjusted R^2 value is 0.328. The coefficient of determination explains how much variation in our life expectancy variable can be explained by our predictor variables, meaning that this model explains only 32.8% of variance. The p-value of 1.587e-08 is less than the significance level of 0.05, meaning the regression model is statistically significant. The purpose of the F-statistic is to see if any of the predictor variables are related to life expectancy, and this goes hand in hand with the p-value, which is statistically significant.
After observing the model, I decided to eliminate the ‘primcarevis_10’ variable because it was not statistically significant in the model.
hlthmr = lm(le_raceadj_q1_both ~ cur_smoke_q1 + bmi_obese_q1 + puninsured2010, data = hlth)
summary(hlthmr)
##
## Call:
## lm(formula = le_raceadj_q1_both ~ cur_smoke_q1 + bmi_obese_q1 +
## puninsured2010, data = hlth)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5447 -0.9496 -0.1078 0.7340 4.8681
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 83.79973 1.44151 58.133 < 2e-16 ***
## cur_smoke_q1 -11.45443 3.64734 -3.140 0.00224 **
## bmi_obese_q1 -12.58156 4.18100 -3.009 0.00335 **
## puninsured2010 0.08869 0.02734 3.244 0.00162 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.372 on 96 degrees of freedom
## (50611 observations deleted due to missingness)
## Multiple R-squared: 0.3542, Adjusted R-squared: 0.3341
## F-statistic: 17.55 on 3 and 96 DF, p-value: 3.652e-09
The result of this is a more significant p-value and a greater R^2 value of 0.3341, telling us that the regression equation explains 33.41% of variance in the model.
hlthmrresid = resid(hlthmr)
hist(hlthmrresid, main = "Histogram of hlth Residuals")
# plot residuals
plot(hlthmr)
Here, the Residuals vs. Fitted shows a small amount of departure from normality as the residuals do not have as much spread towards the end of the plot as they do in the beginning, but nonetheless, the points are evenly spread above and below the zero line, proving the condition for constant variance is met. The quantile-quantile plot mostly follows the line of best fit with a few departures towards the right, hinting at right skew.
To more accurately see what work related variables have the most significance on life expectancy. To more accurately measure this, I created an additional
q1 = ggplot(red, aes(x = hours_wk_pooled_pooled_p1, y = le_raceadj_q1_M)) +
geom_point() +
stat_cor(method = "pearson") +
stat_smooth(method = 'lm') +
xlab("Mean weekly hours worked") +
ylab("Q1 Life Expectancy") +
ggtitle("Q1 Life Expectancy and Hours Worked")
q1
## Warning: Removed 50611 rows containing non-finite values (stat_cor).
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 50611 rows containing non-finite values (stat_smooth).
## Warning: Removed 50611 rows containing missing values (geom_point).
The scatterplot above show the correlation between the mean hours worked weekly and the corresponding life expectancy within that income quartile, which above, is the first income quartile ranging from the first to the twenty fifth percentile.
q2 = ggplot(red, aes(x = hours_wk_pooled_pooled_p25, y = le_raceadj_q2_M)) +
geom_point() +
stat_cor(method = "pearson") +
stat_smooth(method = 'lm') +
xlab("Mean weekly hours worked") +
ylab("Q2 Life Expectancy") +
ggtitle("Q2 Life Expectancy and Hours Worked")
q2
## Warning: Removed 50611 rows containing non-finite values (stat_cor).
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 50611 rows containing non-finite values (stat_smooth).
## Warning: Removed 50611 rows containing missing values (geom_point).
The scatterplot above shows the correlation between mean weekly hours worked and life expectancy in the second income quartile, from the twenty-fifth to the fiftieth percentile. The p-value within this dataset is 0.85, which is not statistically significant compared to the quarter one income quartile data. Because of this fact, I chose to explore the effect of labor and work predictor variables specifically in the first income quartile.
CREATING A MODEL To begin, I again created a correlation matrix in R to find the correlational coefficients of different predictor variables on life expectancy as well as in relation to other predictor variables.
0.85 Mean weekly hours worked in the 12 months before being surveyed
0.309 Fraction of children with positive hours worked in the 12 months before being surveyed
-0.333 Share Working in Manufacturing
0.311 Median House Value
0.158 Median Household Income in 2016
Looking at the data as a whole, the strongest correlations are between:
-0.537 Mean Household Income in 2016 and Share of Poor People in 2010
-0.500 Percent Employed in 2000 and Share of Poor People
0.684 Fraction of children with positive W-2 earnings in 2015 and Mean household income rank for parents of the people surveyed
-0.574 Unemployment Rate and Mean household income rank for parents of the people surveyed
## load data
work = read.csv("~/desktop/workdata.csv")
## Use the names() function to choose variables to be used in the correlation matrix
# names(work)
cordata = cor(work[,c(10, 2, 3, 4, 5, 6, 8, 9, 11, 12, 15, 18, 20, 21, 22, 23, 24, 25, 26, 27)], use = "complete.obs")
# view matrix
print(cordata)
## le_raceadj_q1_both hhinc_mean2000
## le_raceadj_q1_both 1.000000000 0.09105442
## hhinc_mean2000 0.091054423 1.00000000
## mean_commutetime2000 -0.079198601 0.17497527
## med_hhinc2016 0.158862663 0.94526997
## poor_share2010 -0.075893493 -0.53723906
## emp2000 -0.065429217 0.47690266
## jobs_highpay_5mi_2015 0.191428890 0.16913946
## ann_avg_job_growth_2004_2013 0.053711993 -0.07006619
## par_rank_pooled_pooled_mean 0.015845988 0.36349195
## frac_below_median_pooled_pooled -0.003483334 -0.31637269
## hours_wk_pooled_pooled_p25 0.850325390 0.05611530
## pos_hours_pooled_pooled_mean 0.309276182 0.21696150
## working_pooled_pooled_mean 0.130705110 0.12706781
## unemp_rate 0.014556303 -0.25021494
## taxrate 0.204485917 0.07711122
## frac_middleclass -0.175255856 -0.23807430
## cs_elf_ind_man -0.333336300 -0.09068853
## lf_d_2000_1980 0.154836353 -0.02912371
## median_house_value 0.310704041 0.45535382
## hhinc00 0.069693096 0.47151125
## mean_commutetime2000 med_hhinc2016
## le_raceadj_q1_both -0.079198601 0.15886266
## hhinc_mean2000 0.174975275 0.94526997
## mean_commutetime2000 1.000000000 0.20070802
## med_hhinc2016 0.200708018 1.00000000
## poor_share2010 -0.193312747 -0.56391518
## emp2000 0.315165146 0.52335095
## jobs_highpay_5mi_2015 -0.341655216 0.23662681
## ann_avg_job_growth_2004_2013 0.086422893 -0.03793591
## par_rank_pooled_pooled_mean 0.082902475 0.36250605
## frac_below_median_pooled_pooled -0.055087757 -0.32097207
## hours_wk_pooled_pooled_p25 -0.007120154 0.11447319
## pos_hours_pooled_pooled_mean 0.007978061 0.22695449
## working_pooled_pooled_mean 0.004496452 0.15380103
## unemp_rate -0.195929242 -0.26672772
## taxrate 0.003986853 0.12542530
## frac_middleclass -0.165298223 -0.20820457
## cs_elf_ind_man -0.091961769 -0.15462678
## lf_d_2000_1980 0.081845858 -0.02181458
## median_house_value -0.026855017 0.43875770
## hhinc00 0.126126789 0.40976921
## poor_share2010 emp2000
## le_raceadj_q1_both -0.075893493 -0.06542922
## hhinc_mean2000 -0.537239064 0.47690266
## mean_commutetime2000 -0.193312747 0.31516515
## med_hhinc2016 -0.563915182 0.52335095
## poor_share2010 1.000000000 -0.50083320
## emp2000 -0.500833204 1.00000000
## jobs_highpay_5mi_2015 0.020294615 -0.19846073
## ann_avg_job_growth_2004_2013 0.004129212 -0.04411273
## par_rank_pooled_pooled_mean -0.152067070 0.37978167
## frac_below_median_pooled_pooled 0.122551430 -0.38471352
## hours_wk_pooled_pooled_p25 -0.071901081 0.08447463
## pos_hours_pooled_pooled_mean -0.126130014 0.35829182
## working_pooled_pooled_mean -0.047793562 0.35244897
## unemp_rate 0.304089340 -0.38033938
## taxrate -0.082385956 0.08398608
## frac_middleclass 0.123236230 0.02472701
## cs_elf_ind_man 0.140922296 0.17494858
## lf_d_2000_1980 -0.078446488 -0.15447999
## median_house_value -0.276779948 0.01229535
## hhinc00 -0.301901615 0.10725281
## jobs_highpay_5mi_2015
## le_raceadj_q1_both 0.19142889
## hhinc_mean2000 0.16913946
## mean_commutetime2000 -0.34165522
## med_hhinc2016 0.23662681
## poor_share2010 0.02029461
## emp2000 -0.19846073
## jobs_highpay_5mi_2015 1.00000000
## ann_avg_job_growth_2004_2013 -0.11804827
## par_rank_pooled_pooled_mean 0.10786146
## frac_below_median_pooled_pooled -0.06016351
## hours_wk_pooled_pooled_p25 0.18831899
## pos_hours_pooled_pooled_mean 0.12024616
## working_pooled_pooled_mean 0.06463117
## unemp_rate -0.03231251
## taxrate 0.34934474
## frac_middleclass -0.24037063
## cs_elf_ind_man -0.22674484
## lf_d_2000_1980 -0.12654618
## median_house_value 0.26041859
## hhinc00 0.28121363
## ann_avg_job_growth_2004_2013
## le_raceadj_q1_both 0.053711993
## hhinc_mean2000 -0.070066191
## mean_commutetime2000 0.086422893
## med_hhinc2016 -0.037935912
## poor_share2010 0.004129212
## emp2000 -0.044112735
## jobs_highpay_5mi_2015 -0.118048268
## ann_avg_job_growth_2004_2013 1.000000000
## par_rank_pooled_pooled_mean -0.344895795
## frac_below_median_pooled_pooled 0.338405444
## hours_wk_pooled_pooled_p25 -0.088304489
## pos_hours_pooled_pooled_mean -0.314276140
## working_pooled_pooled_mean -0.338528877
## unemp_rate 0.222925959
## taxrate -0.157959339
## frac_middleclass -0.191076245
## cs_elf_ind_man -0.195576775
## lf_d_2000_1980 0.117077193
## median_house_value -0.022060949
## hhinc00 -0.278537090
## par_rank_pooled_pooled_mean
## le_raceadj_q1_both 0.01584599
## hhinc_mean2000 0.36349195
## mean_commutetime2000 0.08290248
## med_hhinc2016 0.36250605
## poor_share2010 -0.15206707
## emp2000 0.37978167
## jobs_highpay_5mi_2015 0.10786146
## ann_avg_job_growth_2004_2013 -0.34489579
## par_rank_pooled_pooled_mean 1.00000000
## frac_below_median_pooled_pooled -0.99007066
## hours_wk_pooled_pooled_p25 0.18526103
## pos_hours_pooled_pooled_mean 0.75815246
## working_pooled_pooled_mean 0.68458943
## unemp_rate -0.57421094
## taxrate 0.33061980
## frac_middleclass 0.19918524
## cs_elf_ind_man 0.27533374
## lf_d_2000_1980 -0.28729442
## median_house_value 0.38319623
## hhinc00 0.62807113
## frac_below_median_pooled_pooled
## le_raceadj_q1_both -0.003483334
## hhinc_mean2000 -0.316372689
## mean_commutetime2000 -0.055087757
## med_hhinc2016 -0.320972069
## poor_share2010 0.122551430
## emp2000 -0.384713525
## jobs_highpay_5mi_2015 -0.060163510
## ann_avg_job_growth_2004_2013 0.338405444
## par_rank_pooled_pooled_mean -0.990070657
## frac_below_median_pooled_pooled 1.000000000
## hours_wk_pooled_pooled_p25 -0.177766587
## pos_hours_pooled_pooled_mean -0.765384188
## working_pooled_pooled_mean -0.719350140
## unemp_rate 0.543040827
## taxrate -0.294810350
## frac_middleclass -0.294452947
## cs_elf_ind_man -0.335377003
## lf_d_2000_1980 0.332702913
## median_house_value -0.319005890
## hhinc00 -0.541953066
## hours_wk_pooled_pooled_p25
## le_raceadj_q1_both 0.850325390
## hhinc_mean2000 0.056115300
## mean_commutetime2000 -0.007120154
## med_hhinc2016 0.114473188
## poor_share2010 -0.071901081
## emp2000 0.084474628
## jobs_highpay_5mi_2015 0.188318985
## ann_avg_job_growth_2004_2013 -0.088304489
## par_rank_pooled_pooled_mean 0.185261033
## frac_below_median_pooled_pooled -0.177766587
## hours_wk_pooled_pooled_p25 1.000000000
## pos_hours_pooled_pooled_mean 0.641610114
## working_pooled_pooled_mean 0.507372079
## unemp_rate -0.239434482
## taxrate 0.330039711
## frac_middleclass -0.108066294
## cs_elf_ind_man -0.152914988
## lf_d_2000_1980 -0.089001699
## median_house_value 0.109526982
## hhinc00 0.142412460
## pos_hours_pooled_pooled_mean
## le_raceadj_q1_both 0.309276182
## hhinc_mean2000 0.216961498
## mean_commutetime2000 0.007978061
## med_hhinc2016 0.226954492
## poor_share2010 -0.126130014
## emp2000 0.358291823
## jobs_highpay_5mi_2015 0.120246158
## ann_avg_job_growth_2004_2013 -0.314276140
## par_rank_pooled_pooled_mean 0.758152464
## frac_below_median_pooled_pooled -0.765384188
## hours_wk_pooled_pooled_p25 0.641610114
## pos_hours_pooled_pooled_mean 1.000000000
## working_pooled_pooled_mean 0.901925174
## unemp_rate -0.532614285
## taxrate 0.395086172
## frac_middleclass 0.147443859
## cs_elf_ind_man 0.222183364
## lf_d_2000_1980 -0.384371050
## median_house_value 0.185931196
## hhinc00 0.432336331
## working_pooled_pooled_mean unemp_rate
## le_raceadj_q1_both 0.130705110 0.01455630
## hhinc_mean2000 0.127067812 -0.25021494
## mean_commutetime2000 0.004496452 -0.19592924
## med_hhinc2016 0.153801029 -0.26672772
## poor_share2010 -0.047793562 0.30408934
## emp2000 0.352448967 -0.38033938
## jobs_highpay_5mi_2015 0.064631174 -0.03231251
## ann_avg_job_growth_2004_2013 -0.338528877 0.22292596
## par_rank_pooled_pooled_mean 0.684589434 -0.57421094
## frac_below_median_pooled_pooled -0.719350140 0.54304083
## hours_wk_pooled_pooled_p25 0.507372079 -0.23943448
## pos_hours_pooled_pooled_mean 0.901925174 -0.53261429
## working_pooled_pooled_mean 1.000000000 -0.44696368
## unemp_rate -0.446963678 1.00000000
## taxrate 0.354511451 -0.14137202
## frac_middleclass 0.236830082 -0.14830155
## cs_elf_ind_man 0.335033068 0.04453594
## lf_d_2000_1980 -0.496083807 0.05654919
## median_house_value -0.008773129 -0.11451578
## hhinc00 0.260173094 -0.51007599
## taxrate frac_middleclass cs_elf_ind_man
## le_raceadj_q1_both 0.204485917 -0.17525586 -0.33333630
## hhinc_mean2000 0.077111216 -0.23807430 -0.09068853
## mean_commutetime2000 0.003986853 -0.16529822 -0.09196177
## med_hhinc2016 0.125425304 -0.20820457 -0.15462678
## poor_share2010 -0.082385956 0.12323623 0.14092230
## emp2000 0.083986083 0.02472701 0.17494858
## jobs_highpay_5mi_2015 0.349344741 -0.24037063 -0.22674484
## ann_avg_job_growth_2004_2013 -0.157959339 -0.19107624 -0.19557677
## par_rank_pooled_pooled_mean 0.330619797 0.19918524 0.27533374
## frac_below_median_pooled_pooled -0.294810350 -0.29445295 -0.33537700
## hours_wk_pooled_pooled_p25 0.330039711 -0.10806629 -0.15291499
## pos_hours_pooled_pooled_mean 0.395086172 0.14744386 0.22218336
## working_pooled_pooled_mean 0.354511451 0.23683008 0.33503307
## unemp_rate -0.141372016 -0.14830155 0.04453594
## taxrate 1.000000000 -0.25114449 -0.10623790
## frac_middleclass -0.251144488 1.00000000 0.39727464
## cs_elf_ind_man -0.106237902 0.39727464 1.00000000
## lf_d_2000_1980 -0.057408286 -0.15608960 -0.47878651
## median_house_value 0.073634393 -0.27657458 -0.14053407
## hhinc00 0.315972353 -0.22740444 -0.13242301
## lf_d_2000_1980 median_house_value hhinc00
## le_raceadj_q1_both 0.15483635 0.310704041 0.0696931
## hhinc_mean2000 -0.02912371 0.455353825 0.4715112
## mean_commutetime2000 0.08184586 -0.026855017 0.1261268
## med_hhinc2016 -0.02181458 0.438757696 0.4097692
## poor_share2010 -0.07844649 -0.276779948 -0.3019016
## emp2000 -0.15447999 0.012295346 0.1072528
## jobs_highpay_5mi_2015 -0.12654618 0.260418594 0.2812136
## ann_avg_job_growth_2004_2013 0.11707719 -0.022060949 -0.2785371
## par_rank_pooled_pooled_mean -0.28729442 0.383196229 0.6280711
## frac_below_median_pooled_pooled 0.33270291 -0.319005890 -0.5419531
## hours_wk_pooled_pooled_p25 -0.08900170 0.109526982 0.1424125
## pos_hours_pooled_pooled_mean -0.38437105 0.185931196 0.4323363
## working_pooled_pooled_mean -0.49608381 -0.008773129 0.2601731
## unemp_rate 0.05654919 -0.114515777 -0.5100760
## taxrate -0.05740829 0.073634393 0.3159724
## frac_middleclass -0.15608960 -0.276574580 -0.2274044
## cs_elf_ind_man -0.47878651 -0.140534066 -0.1324230
## lf_d_2000_1980 1.00000000 0.014243785 0.1482713
## median_house_value 0.01424379 1.000000000 0.5704007
## hhinc00 0.14827132 0.570400741 1.0000000
Looking at the r correlation coefficients for the data, I am going to create a multiple regression model that uses the mean hours worked weekly in the twelve months prior to the survey, the fraction of people with positive working hours in the 12 months prior to the survey, the share working in manufacturing, median house value, and finally, the median household income in 2016 to create a multiple regression model.
workmr = lm(le_raceadj_q1_both ~ hours_wk_pooled_pooled_p25 + pos_hours_pooled_pooled_mean +
cs_elf_ind_man + hhinc_mean2000, data = work)
summary(workmr)
##
## Call:
## lm(formula = le_raceadj_q1_both ~ hours_wk_pooled_pooled_p25 +
## pos_hours_pooled_pooled_mean + cs_elf_ind_man + hhinc_mean2000,
## data = work)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.87066 -0.28540 -0.04589 0.20837 1.26227
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.427e-01 1.358e+00 -0.400 0.6903
## hours_wk_pooled_pooled_p25 8.747e-01 4.821e-02 18.144 < 2e-16 ***
## pos_hours_pooled_pooled_mean -1.489e+01 2.343e+00 -6.354 7.22e-09 ***
## cs_elf_ind_man -1.269e+00 8.823e-01 -1.439 0.1535
## hhinc_mean2000 2.790e-06 1.109e-06 2.515 0.0136 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4294 on 95 degrees of freedom
## (50611 observations deleted due to missingness)
## Multiple R-squared: 0.8358, Adjusted R-squared: 0.8289
## F-statistic: 120.9 on 4 and 95 DF, p-value: < 2.2e-16
Initial Observations
At first glance, it appears that the variables chosen do not carry much statistical significance within the model, as the correlational coefficients are all below 1, with only two predictor variables having a Pr(>|t|) value below the significance level of 0.05.
This model is simply not statistically significant, so I am going to use another method to choose the predictor variables for my model, stepwise regression. To begin, I created a regression model with all available work related variables. Then, I used a stepwise function within R to find the best model.
# create regression model with all variables
stepwise = lm(le_raceadj_q1_both ~ hhinc_mean2000 + mean_commutetime2000 +
med_hhinc2016 + poor_share2010 + emp2000 +
jobs_highpay_5mi_2015 + ann_avg_job_growth_2004_2013 +
hours_wk_pooled_pooled_p25 + working_pooled_pooled_mean +
unemp_rate + taxrate + frac_middleclass +
cs_elf_ind_man + lf_d_2000_1980 + median_house_value, data = work)
# stepwise regression
stp_s = ols_step_both_p(stepwise, details = TRUE)
## Stepwise Selection Method
## ---------------------------
##
## Candidate Terms:
##
## 1. hhinc_mean2000
## 2. mean_commutetime2000
## 3. med_hhinc2016
## 4. poor_share2010
## 5. emp2000
## 6. jobs_highpay_5mi_2015
## 7. ann_avg_job_growth_2004_2013
## 8. hours_wk_pooled_pooled_p25
## 9. working_pooled_pooled_mean
## 10. unemp_rate
## 11. taxrate
## 12. frac_middleclass
## 13. cs_elf_ind_man
## 14. lf_d_2000_1980
## 15. median_house_value
##
## We are selecting variables based on p value...
##
##
## Stepwise Selection: Step 1
##
## - hours_wk_pooled_pooled_p25 added
##
## Model Summary
## -------------------------------------------------------------
## R 0.850 RMSE 0.549
## R-Squared 0.723 Coef. Var 4.862
## Adj. R-Squared 0.720 MSE 0.302
## Pred R-Squared 0.713 MAE 0.433
## -------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## --------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## --------------------------------------------------------------------
## Regression 77.149 1 77.149 255.859 0.0000
## Residual 29.550 98 0.302
## Total 106.699 99
## --------------------------------------------------------------------
##
## Parameter Estimates
## --------------------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## --------------------------------------------------------------------------------------------------------
## (Intercept) -7.798 1.195 -6.526 0.000 -10.169 -5.427
## hours_wk_pooled_pooled_p25 0.683 0.043 0.850 15.996 0.000 0.598 0.768
## --------------------------------------------------------------------------------------------------------
##
##
##
## Stepwise Selection: Step 2
##
## - working_pooled_pooled_mean added
##
## Model Summary
## -------------------------------------------------------------
## R 0.919 RMSE 0.413
## R-Squared 0.845 Coef. Var 3.658
## Adj. R-Squared 0.842 MSE 0.171
## Pred R-Squared 0.836 MAE 0.311
## -------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## --------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## --------------------------------------------------------------------
## Regression 90.144 2 45.072 264.082 0.0000
## Residual 16.555 97 0.171
## Total 106.699 99
## --------------------------------------------------------------------
##
## Parameter Estimates
## ----------------------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ----------------------------------------------------------------------------------------------------------
## (Intercept) -0.208 1.251 -0.166 0.868 -2.690 2.275
## hours_wk_pooled_pooled_p25 0.848 0.037 1.056 22.748 0.000 0.774 0.922
## working_pooled_pooled_mean -15.711 1.801 -0.405 -8.726 0.000 -19.284 -12.137
## ----------------------------------------------------------------------------------------------------------
##
##
##
## Model Summary
## -------------------------------------------------------------
## R 0.919 RMSE 0.413
## R-Squared 0.845 Coef. Var 3.658
## Adj. R-Squared 0.842 MSE 0.171
## Pred R-Squared 0.836 MAE 0.311
## -------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## --------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## --------------------------------------------------------------------
## Regression 90.144 2 45.072 264.082 0.0000
## Residual 16.555 97 0.171
## Total 106.699 99
## --------------------------------------------------------------------
##
## Parameter Estimates
## ----------------------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ----------------------------------------------------------------------------------------------------------
## (Intercept) -0.208 1.251 -0.166 0.868 -2.690 2.275
## hours_wk_pooled_pooled_p25 0.848 0.037 1.056 22.748 0.000 0.774 0.922
## working_pooled_pooled_mean -15.711 1.801 -0.405 -8.726 0.000 -19.284 -12.137
## ----------------------------------------------------------------------------------------------------------
##
##
##
## Stepwise Selection: Step 3
##
## - median_house_value added
##
## Model Summary
## -------------------------------------------------------------
## R 0.939 RMSE 0.362
## R-Squared 0.882 Coef. Var 3.204
## Adj. R-Squared 0.878 MSE 0.131
## Pred R-Squared 0.873 MAE 0.271
## -------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## --------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## --------------------------------------------------------------------
## Regression 94.127 3 31.376 239.587 0.0000
## Residual 12.572 96 0.131
## Total 106.699 99
## --------------------------------------------------------------------
##
## Parameter Estimates
## ----------------------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ----------------------------------------------------------------------------------------------------------
## (Intercept) -0.510 1.097 -0.465 0.643 -2.688 1.667
## hours_wk_pooled_pooled_p25 0.824 0.033 1.026 25.012 0.000 0.758 0.889
## working_pooled_pooled_mean -15.056 1.582 -0.388 -9.519 0.000 -18.195 -11.916
## median_house_value 0.000 0.000 0.195 5.515 0.000 0.000 0.000
## ----------------------------------------------------------------------------------------------------------
##
##
##
## Model Summary
## -------------------------------------------------------------
## R 0.939 RMSE 0.362
## R-Squared 0.882 Coef. Var 3.204
## Adj. R-Squared 0.878 MSE 0.131
## Pred R-Squared 0.873 MAE 0.271
## -------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## --------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## --------------------------------------------------------------------
## Regression 94.127 3 31.376 239.587 0.0000
## Residual 12.572 96 0.131
## Total 106.699 99
## --------------------------------------------------------------------
##
## Parameter Estimates
## ----------------------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ----------------------------------------------------------------------------------------------------------
## (Intercept) -0.510 1.097 -0.465 0.643 -2.688 1.667
## hours_wk_pooled_pooled_p25 0.824 0.033 1.026 25.012 0.000 0.758 0.889
## working_pooled_pooled_mean -15.056 1.582 -0.388 -9.519 0.000 -18.195 -11.916
## median_house_value 0.000 0.000 0.195 5.515 0.000 0.000 0.000
## ----------------------------------------------------------------------------------------------------------
##
##
##
## Stepwise Selection: Step 4
##
## - unemp_rate added
##
## Model Summary
## -------------------------------------------------------------
## R 0.947 RMSE 0.340
## R-Squared 0.897 Coef. Var 3.007
## Adj. R-Squared 0.893 MSE 0.115
## Pred R-Squared 0.886 MAE 0.242
## -------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## --------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## --------------------------------------------------------------------
## Regression 95.741 4 23.935 207.506 0.0000
## Residual 10.958 95 0.115
## Total 106.699 99
## --------------------------------------------------------------------
##
## Parameter Estimates
## ---------------------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ---------------------------------------------------------------------------------------------------------
## (Intercept) -2.950 1.219 -2.420 0.017 -5.369 -0.530
## hours_wk_pooled_pooled_p25 0.824 0.031 1.026 26.647 0.000 0.762 0.885
## working_pooled_pooled_mean -12.642 1.619 -0.326 -7.810 0.000 -15.855 -9.428
## median_house_value 0.000 0.000 0.211 6.317 0.000 0.000 0.000
## unemp_rate 11.260 3.010 0.139 3.741 0.000 5.284 17.235
## ---------------------------------------------------------------------------------------------------------
##
##
##
## Model Summary
## -------------------------------------------------------------
## R 0.947 RMSE 0.340
## R-Squared 0.897 Coef. Var 3.007
## Adj. R-Squared 0.893 MSE 0.115
## Pred R-Squared 0.886 MAE 0.242
## -------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## --------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## --------------------------------------------------------------------
## Regression 95.741 4 23.935 207.506 0.0000
## Residual 10.958 95 0.115
## Total 106.699 99
## --------------------------------------------------------------------
##
## Parameter Estimates
## ---------------------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ---------------------------------------------------------------------------------------------------------
## (Intercept) -2.950 1.219 -2.420 0.017 -5.369 -0.530
## hours_wk_pooled_pooled_p25 0.824 0.031 1.026 26.647 0.000 0.762 0.885
## working_pooled_pooled_mean -12.642 1.619 -0.326 -7.810 0.000 -15.855 -9.428
## median_house_value 0.000 0.000 0.211 6.317 0.000 0.000 0.000
## unemp_rate 11.260 3.010 0.139 3.741 0.000 5.284 17.235
## ---------------------------------------------------------------------------------------------------------
##
##
##
## Stepwise Selection: Step 5
##
## - frac_middleclass added
##
## Model Summary
## -------------------------------------------------------------
## R 0.953 RMSE 0.324
## R-Squared 0.908 Coef. Var 2.866
## Adj. R-Squared 0.903 MSE 0.105
## Pred R-Squared 0.892 MAE 0.231
## -------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## --------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## --------------------------------------------------------------------
## Regression 96.854 5 19.371 184.94 0.0000
## Residual 9.846 94 0.105
## Total 106.699 99
## --------------------------------------------------------------------
##
## Parameter Estimates
## ----------------------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ----------------------------------------------------------------------------------------------------------
## (Intercept) -3.764 1.188 -3.168 0.002 -6.123 -1.405
## hours_wk_pooled_pooled_p25 0.848 0.030 1.056 27.897 0.000 0.788 0.909
## working_pooled_pooled_mean -14.091 1.605 -0.363 -8.778 0.000 -17.278 -10.904
## median_house_value 0.000 0.000 0.240 7.261 0.000 0.000 0.000
## unemp_rate 12.136 2.881 0.150 4.212 0.000 6.416 17.856
## frac_middleclass 2.291 0.703 0.114 3.259 0.002 0.895 3.687
## ----------------------------------------------------------------------------------------------------------
##
##
##
## Model Summary
## -------------------------------------------------------------
## R 0.953 RMSE 0.324
## R-Squared 0.908 Coef. Var 2.866
## Adj. R-Squared 0.903 MSE 0.105
## Pred R-Squared 0.892 MAE 0.231
## -------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## --------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## --------------------------------------------------------------------
## Regression 96.854 5 19.371 184.94 0.0000
## Residual 9.846 94 0.105
## Total 106.699 99
## --------------------------------------------------------------------
##
## Parameter Estimates
## ----------------------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ----------------------------------------------------------------------------------------------------------
## (Intercept) -3.764 1.188 -3.168 0.002 -6.123 -1.405
## hours_wk_pooled_pooled_p25 0.848 0.030 1.056 27.897 0.000 0.788 0.909
## working_pooled_pooled_mean -14.091 1.605 -0.363 -8.778 0.000 -17.278 -10.904
## median_house_value 0.000 0.000 0.240 7.261 0.000 0.000 0.000
## unemp_rate 12.136 2.881 0.150 4.212 0.000 6.416 17.856
## frac_middleclass 2.291 0.703 0.114 3.259 0.002 0.895 3.687
## ----------------------------------------------------------------------------------------------------------
##
##
##
## Stepwise Selection: Step 6
##
## - lf_d_2000_1980 added
##
## Model Summary
## -------------------------------------------------------------
## R 0.957 RMSE 0.311
## R-Squared 0.916 Coef. Var 2.751
## Adj. R-Squared 0.910 MSE 0.097
## Pred R-Squared 0.900 MAE 0.222
## -------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## --------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## --------------------------------------------------------------------
## Regression 97.720 6 16.287 168.687 0.0000
## Residual 8.979 93 0.097
## Total 106.699 99
## --------------------------------------------------------------------
##
## Parameter Estimates
## ---------------------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ---------------------------------------------------------------------------------------------------------
## (Intercept) -5.779 1.324 -4.364 0.000 -8.408 -3.149
## hours_wk_pooled_pooled_p25 0.829 0.030 1.032 27.740 0.000 0.770 0.888
## working_pooled_pooled_mean -11.135 1.830 -0.287 -6.085 0.000 -14.769 -7.502
## median_house_value 0.000 0.000 0.245 7.700 0.000 0.000 0.000
## unemp_rate 13.992 2.835 0.172 4.936 0.000 8.363 19.621
## frac_middleclass 2.313 0.675 0.115 3.426 0.001 0.972 3.653
## lf_d_2000_1980 0.352 0.117 0.109 2.996 0.004 0.119 0.585
## ---------------------------------------------------------------------------------------------------------
##
##
##
## Model Summary
## -------------------------------------------------------------
## R 0.957 RMSE 0.311
## R-Squared 0.916 Coef. Var 2.751
## Adj. R-Squared 0.910 MSE 0.097
## Pred R-Squared 0.900 MAE 0.222
## -------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## --------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## --------------------------------------------------------------------
## Regression 97.720 6 16.287 168.687 0.0000
## Residual 8.979 93 0.097
## Total 106.699 99
## --------------------------------------------------------------------
##
## Parameter Estimates
## ---------------------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ---------------------------------------------------------------------------------------------------------
## (Intercept) -5.779 1.324 -4.364 0.000 -8.408 -3.149
## hours_wk_pooled_pooled_p25 0.829 0.030 1.032 27.740 0.000 0.770 0.888
## working_pooled_pooled_mean -11.135 1.830 -0.287 -6.085 0.000 -14.769 -7.502
## median_house_value 0.000 0.000 0.245 7.700 0.000 0.000 0.000
## unemp_rate 13.992 2.835 0.172 4.936 0.000 8.363 19.621
## frac_middleclass 2.313 0.675 0.115 3.426 0.001 0.972 3.653
## lf_d_2000_1980 0.352 0.117 0.109 2.996 0.004 0.119 0.585
## ---------------------------------------------------------------------------------------------------------
##
##
##
## Stepwise Selection: Step 7
##
## - cs_elf_ind_man added
##
## Model Summary
## -------------------------------------------------------------
## R 0.959 RMSE 0.306
## R-Squared 0.919 Coef. Var 2.706
## Adj. R-Squared 0.913 MSE 0.093
## Pred R-Squared 0.903 MAE 0.221
## -------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## --------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## --------------------------------------------------------------------
## Regression 98.107 7 14.015 150.058 0.0000
## Residual 8.593 92 0.093
## Total 106.699 99
## --------------------------------------------------------------------
##
## Parameter Estimates
## ---------------------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ---------------------------------------------------------------------------------------------------------
## (Intercept) -6.150 1.315 -4.677 0.000 -8.762 -3.538
## hours_wk_pooled_pooled_p25 0.811 0.031 1.010 26.416 0.000 0.750 0.872
## working_pooled_pooled_mean -10.052 1.877 -0.259 -5.355 0.000 -13.780 -6.324
## median_house_value 0.000 0.000 0.244 7.808 0.000 0.000 0.000
## unemp_rate 15.206 2.851 0.187 5.333 0.000 9.543 20.868
## frac_middleclass 2.730 0.695 0.135 3.928 0.000 1.350 4.110
## lf_d_2000_1980 0.276 0.121 0.085 2.272 0.025 0.035 0.517
## cs_elf_ind_man -1.468 0.721 -0.079 -2.034 0.045 -2.900 -0.035
## ---------------------------------------------------------------------------------------------------------
##
##
##
## Model Summary
## -------------------------------------------------------------
## R 0.959 RMSE 0.306
## R-Squared 0.919 Coef. Var 2.706
## Adj. R-Squared 0.913 MSE 0.093
## Pred R-Squared 0.903 MAE 0.221
## -------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## --------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## --------------------------------------------------------------------
## Regression 98.107 7 14.015 150.058 0.0000
## Residual 8.593 92 0.093
## Total 106.699 99
## --------------------------------------------------------------------
##
## Parameter Estimates
## ---------------------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ---------------------------------------------------------------------------------------------------------
## (Intercept) -6.150 1.315 -4.677 0.000 -8.762 -3.538
## hours_wk_pooled_pooled_p25 0.811 0.031 1.010 26.416 0.000 0.750 0.872
## working_pooled_pooled_mean -10.052 1.877 -0.259 -5.355 0.000 -13.780 -6.324
## median_house_value 0.000 0.000 0.244 7.808 0.000 0.000 0.000
## unemp_rate 15.206 2.851 0.187 5.333 0.000 9.543 20.868
## frac_middleclass 2.730 0.695 0.135 3.928 0.000 1.350 4.110
## lf_d_2000_1980 0.276 0.121 0.085 2.272 0.025 0.035 0.517
## cs_elf_ind_man -1.468 0.721 -0.079 -2.034 0.045 -2.900 -0.035
## ---------------------------------------------------------------------------------------------------------
##
##
##
## Stepwise Selection: Step 8
##
## - med_hhinc2016 added
##
## Model Summary
## -------------------------------------------------------------
## R 0.960 RMSE 0.302
## R-Squared 0.922 Coef. Var 2.676
## Adj. R-Squared 0.915 MSE 0.091
## Pred R-Squared 0.904 MAE 0.220
## -------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## --------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## --------------------------------------------------------------------
## Regression 98.389 8 12.299 134.674 0.0000
## Residual 8.310 91 0.091
## Total 106.699 99
## --------------------------------------------------------------------
##
## Parameter Estimates
## ---------------------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ---------------------------------------------------------------------------------------------------------
## (Intercept) -6.123 1.300 -4.709 0.000 -8.706 -3.540
## hours_wk_pooled_pooled_p25 0.816 0.031 1.017 26.758 0.000 0.756 0.877
## working_pooled_pooled_mean -10.542 1.877 -0.272 -5.617 0.000 -14.271 -6.814
## median_house_value 0.000 0.000 0.222 6.604 0.000 0.000 0.000
## unemp_rate 16.049 2.860 0.198 5.612 0.000 10.368 21.729
## frac_middleclass 2.902 0.694 0.144 4.182 0.000 1.524 4.281
## lf_d_2000_1980 0.277 0.120 0.086 2.309 0.023 0.039 0.515
## cs_elf_ind_man -1.323 0.718 -0.071 -1.843 0.069 -2.750 0.103
## med_hhinc2016 0.000 0.000 0.061 1.759 0.082 0.000 0.000
## ---------------------------------------------------------------------------------------------------------
##
##
##
## Model Summary
## -------------------------------------------------------------
## R 0.960 RMSE 0.302
## R-Squared 0.922 Coef. Var 2.676
## Adj. R-Squared 0.915 MSE 0.091
## Pred R-Squared 0.904 MAE 0.220
## -------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## --------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## --------------------------------------------------------------------
## Regression 98.389 8 12.299 134.674 0.0000
## Residual 8.310 91 0.091
## Total 106.699 99
## --------------------------------------------------------------------
##
## Parameter Estimates
## ---------------------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ---------------------------------------------------------------------------------------------------------
## (Intercept) -6.123 1.300 -4.709 0.000 -8.706 -3.540
## hours_wk_pooled_pooled_p25 0.816 0.031 1.017 26.758 0.000 0.756 0.877
## working_pooled_pooled_mean -10.542 1.877 -0.272 -5.617 0.000 -14.271 -6.814
## median_house_value 0.000 0.000 0.222 6.604 0.000 0.000 0.000
## unemp_rate 16.049 2.860 0.198 5.612 0.000 10.368 21.729
## frac_middleclass 2.902 0.694 0.144 4.182 0.000 1.524 4.281
## lf_d_2000_1980 0.277 0.120 0.086 2.309 0.023 0.039 0.515
## cs_elf_ind_man -1.323 0.718 -0.071 -1.843 0.069 -2.750 0.103
## med_hhinc2016 0.000 0.000 0.061 1.759 0.082 0.000 0.000
## ---------------------------------------------------------------------------------------------------------
##
##
##
## No more variables to be added/removed.
##
##
## Final Model Output
## ------------------
##
## Model Summary
## -------------------------------------------------------------
## R 0.960 RMSE 0.302
## R-Squared 0.922 Coef. Var 2.676
## Adj. R-Squared 0.915 MSE 0.091
## Pred R-Squared 0.904 MAE 0.220
## -------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## --------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## --------------------------------------------------------------------
## Regression 98.389 8 12.299 134.674 0.0000
## Residual 8.310 91 0.091
## Total 106.699 99
## --------------------------------------------------------------------
##
## Parameter Estimates
## ---------------------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ---------------------------------------------------------------------------------------------------------
## (Intercept) -6.123 1.300 -4.709 0.000 -8.706 -3.540
## hours_wk_pooled_pooled_p25 0.816 0.031 1.017 26.758 0.000 0.756 0.877
## working_pooled_pooled_mean -10.542 1.877 -0.272 -5.617 0.000 -14.271 -6.814
## median_house_value 0.000 0.000 0.222 6.604 0.000 0.000 0.000
## unemp_rate 16.049 2.860 0.198 5.612 0.000 10.368 21.729
## frac_middleclass 2.902 0.694 0.144 4.182 0.000 1.524 4.281
## lf_d_2000_1980 0.277 0.120 0.086 2.309 0.023 0.039 0.515
## cs_elf_ind_man -1.323 0.718 -0.071 -1.843 0.069 -2.750 0.103
## med_hhinc2016 0.000 0.000 0.061 1.759 0.082 0.000 0.000
## ---------------------------------------------------------------------------------------------------------
stp_s$model
##
## Call:
## lm(formula = paste(response, "~", paste(preds, collapse = " + ")),
## data = l)
##
## Coefficients:
## (Intercept) hours_wk_pooled_pooled_p25
## -6.123e+00 8.164e-01
## working_pooled_pooled_mean median_house_value
## -1.054e+01 3.070e-06
## unemp_rate frac_middleclass
## 1.605e+01 2.902e+00
## lf_d_2000_1980 cs_elf_ind_man
## 2.770e-01 -1.323e+00
## med_hhinc2016
## 2.048e-06
Doing this gives us a model the computer has determined to be most effective, which is:
Life Expectancy = -10.54(working_pooled_pooled_mean) - 16.05(unemp_rate) + 0.277(lf_d_2000_1980) + 0.000002048(med_hhinc2016) + 0.8164(hours_wk_pooled_pooled_p25) + 0.000003070(median_house_value) + 2.902(frac_middleclass) - 1.323(cs_elf_ind_man)
Looking at the summary of this model, we see:
worktwo = lm(le_raceadj_q1_both ~ working_pooled_pooled_mean +
unemp_rate + lf_d_2000_1980 + med_hhinc2016 + hours_wk_pooled_pooled_p25 +
median_house_value + frac_middleclass + cs_elf_ind_man, data = work)
summary(worktwo)
##
## Call:
## lm(formula = le_raceadj_q1_both ~ working_pooled_pooled_mean +
## unemp_rate + lf_d_2000_1980 + med_hhinc2016 + hours_wk_pooled_pooled_p25 +
## median_house_value + frac_middleclass + cs_elf_ind_man, data = work)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.71731 -0.17284 -0.03969 0.17203 1.36397
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.123e+00 1.300e+00 -4.709 8.88e-06 ***
## working_pooled_pooled_mean -1.054e+01 1.877e+00 -5.617 2.10e-07 ***
## unemp_rate 1.605e+01 2.860e+00 5.612 2.14e-07 ***
## lf_d_2000_1980 2.770e-01 1.200e-01 2.309 0.0232 *
## med_hhinc2016 2.048e-06 1.164e-06 1.759 0.0820 .
## hours_wk_pooled_pooled_p25 8.164e-01 3.051e-02 26.758 < 2e-16 ***
## median_house_value 3.070e-06 4.648e-07 6.604 2.63e-09 ***
## frac_middleclass 2.902e+00 6.941e-01 4.182 6.65e-05 ***
## cs_elf_ind_man -1.323e+00 7.181e-01 -1.843 0.0686 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3022 on 91 degrees of freedom
## (50611 observations deleted due to missingness)
## Multiple R-squared: 0.9221, Adjusted R-squared: 0.9153
## F-statistic: 134.7 on 8 and 91 DF, p-value: < 2.2e-16
Overall, this model has p-values that are much more statistically significant than the previous model. Despite this, there are still other ways to chose the best predictors for a multiple regression model, such as through a best subsets regression.
Best Subsets Regression Procedure
The best subsets regression procedure is a procedure that tells the computer to search for all possible subsets of predictors so that we don’t have to.
# best subsets regression
allwork = regsubsets(le_raceadj_q1_both ~ ., data = work, nvmax = 27)
## Warning in leaps.setup(x, y, wt = wt, nbest = nbest, nvmax = nvmax, force.in =
## force.in, : 2 linear dependencies found
## Reordering variables and trying again:
## Warning in leaps.exhaustive(a, really.big): XHAUST returned error code -999
summary(allwork)
## Subset selection object
## Call: regsubsets.formula(le_raceadj_q1_both ~ ., data = work, nvmax = 27)
## 26 Variables (and intercept)
## Forced in Forced out
## cz FALSE FALSE
## hhinc_mean2000 FALSE FALSE
## mean_commutetime2000 FALSE FALSE
## med_hhinc2016 FALSE FALSE
## poor_share2010 FALSE FALSE
## emp2000 FALSE FALSE
## jobs_highpay_5mi_2015 FALSE FALSE
## ann_avg_job_growth_2004_2013 FALSE FALSE
## par_rank_pooled_pooled_mean FALSE FALSE
## frac_below_median_pooled_pooled FALSE FALSE
## hours_wk_pooled_pooled_mean FALSE FALSE
## hours_wk_pooled_pooled_p1 FALSE FALSE
## hours_wk_pooled_pooled_p25 FALSE FALSE
## hours_wk_pooled_pooled_p75 FALSE FALSE
## hours_wk_pooled_pooled_p100 FALSE FALSE
## pos_hours_pooled_pooled_mean FALSE FALSE
## working_pooled_pooled_mean FALSE FALSE
## unemp_rate FALSE FALSE
## taxrate FALSE FALSE
## frac_middleclass FALSE FALSE
## cs_elf_ind_man FALSE FALSE
## lf_d_2000_1980 FALSE FALSE
## median_house_value FALSE FALSE
## hhinc00 FALSE FALSE
## emp2000.1 FALSE FALSE
## pos_hours_pooled_pooled_mean.1 FALSE FALSE
## 1 subsets of each size up to 24
## Selection Algorithm: exhaustive
## cz hhinc_mean2000 mean_commutetime2000 med_hhinc2016 poor_share2010
## 1 ( 1 ) "*" " " " " " " " "
## 2 ( 1 ) "*" "*" " " " " " "
## 3 ( 1 ) "*" "*" "*" " " " "
## 4 ( 1 ) "*" "*" "*" "*" " "
## 5 ( 1 ) "*" "*" "*" "*" "*"
## 6 ( 1 ) "*" "*" "*" "*" "*"
## 7 ( 1 ) "*" "*" "*" "*" "*"
## 8 ( 1 ) "*" "*" "*" "*" "*"
## 9 ( 1 ) "*" "*" "*" "*" "*"
## 10 ( 1 ) "*" "*" "*" "*" "*"
## 11 ( 1 ) "*" "*" "*" "*" "*"
## 12 ( 1 ) "*" "*" "*" "*" "*"
## 13 ( 1 ) "*" "*" "*" "*" "*"
## 14 ( 1 ) "*" "*" "*" "*" "*"
## 15 ( 1 ) "*" "*" "*" "*" "*"
## 16 ( 1 ) "*" "*" "*" "*" "*"
## 17 ( 1 ) "*" "*" "*" "*" "*"
## 18 ( 1 ) "*" "*" "*" "*" "*"
## 19 ( 1 ) "*" "*" "*" "*" "*"
## 20 ( 1 ) "*" "*" "*" "*" "*"
## 21 ( 1 ) "*" "*" "*" "*" "*"
## 22 ( 1 ) "*" "*" "*" "*" "*"
## 23 ( 1 ) "*" "*" "*" "*" "*"
## 24 ( 1 ) "*" "*" "*" "*" "*"
## emp2000 emp2000.1 jobs_highpay_5mi_2015 ann_avg_job_growth_2004_2013
## 1 ( 1 ) " " " " " " " "
## 2 ( 1 ) " " " " " " " "
## 3 ( 1 ) " " " " " " " "
## 4 ( 1 ) " " " " " " " "
## 5 ( 1 ) " " " " " " " "
## 6 ( 1 ) "*" " " " " " "
## 7 ( 1 ) "*" " " "*" " "
## 8 ( 1 ) "*" " " "*" "*"
## 9 ( 1 ) "*" " " "*" "*"
## 10 ( 1 ) "*" " " "*" "*"
## 11 ( 1 ) "*" " " "*" "*"
## 12 ( 1 ) "*" " " "*" "*"
## 13 ( 1 ) "*" " " "*" "*"
## 14 ( 1 ) "*" " " "*" "*"
## 15 ( 1 ) "*" " " "*" "*"
## 16 ( 1 ) "*" " " "*" "*"
## 17 ( 1 ) "*" " " "*" "*"
## 18 ( 1 ) "*" " " "*" "*"
## 19 ( 1 ) "*" " " "*" "*"
## 20 ( 1 ) "*" " " "*" "*"
## 21 ( 1 ) "*" " " "*" "*"
## 22 ( 1 ) "*" " " "*" "*"
## 23 ( 1 ) "*" " " "*" "*"
## 24 ( 1 ) "*" " " "*" "*"
## par_rank_pooled_pooled_mean frac_below_median_pooled_pooled
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) " " " "
## 5 ( 1 ) " " " "
## 6 ( 1 ) " " " "
## 7 ( 1 ) " " " "
## 8 ( 1 ) " " " "
## 9 ( 1 ) "*" " "
## 10 ( 1 ) "*" "*"
## 11 ( 1 ) "*" "*"
## 12 ( 1 ) "*" "*"
## 13 ( 1 ) "*" "*"
## 14 ( 1 ) "*" "*"
## 15 ( 1 ) "*" "*"
## 16 ( 1 ) "*" "*"
## 17 ( 1 ) "*" "*"
## 18 ( 1 ) "*" "*"
## 19 ( 1 ) "*" "*"
## 20 ( 1 ) "*" "*"
## 21 ( 1 ) "*" "*"
## 22 ( 1 ) "*" "*"
## 23 ( 1 ) "*" "*"
## 24 ( 1 ) "*" "*"
## hours_wk_pooled_pooled_mean hours_wk_pooled_pooled_p1
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) " " " "
## 5 ( 1 ) " " " "
## 6 ( 1 ) " " " "
## 7 ( 1 ) " " " "
## 8 ( 1 ) " " " "
## 9 ( 1 ) " " " "
## 10 ( 1 ) " " " "
## 11 ( 1 ) "*" " "
## 12 ( 1 ) "*" "*"
## 13 ( 1 ) "*" "*"
## 14 ( 1 ) "*" "*"
## 15 ( 1 ) "*" "*"
## 16 ( 1 ) "*" "*"
## 17 ( 1 ) "*" "*"
## 18 ( 1 ) "*" "*"
## 19 ( 1 ) "*" "*"
## 20 ( 1 ) "*" "*"
## 21 ( 1 ) "*" "*"
## 22 ( 1 ) "*" "*"
## 23 ( 1 ) "*" "*"
## 24 ( 1 ) "*" "*"
## hours_wk_pooled_pooled_p25 hours_wk_pooled_pooled_p75
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) " " " "
## 5 ( 1 ) " " " "
## 6 ( 1 ) " " " "
## 7 ( 1 ) " " " "
## 8 ( 1 ) " " " "
## 9 ( 1 ) " " " "
## 10 ( 1 ) " " " "
## 11 ( 1 ) " " " "
## 12 ( 1 ) " " " "
## 13 ( 1 ) "*" " "
## 14 ( 1 ) "*" "*"
## 15 ( 1 ) "*" "*"
## 16 ( 1 ) "*" "*"
## 17 ( 1 ) "*" "*"
## 18 ( 1 ) "*" "*"
## 19 ( 1 ) "*" "*"
## 20 ( 1 ) "*" "*"
## 21 ( 1 ) "*" "*"
## 22 ( 1 ) "*" "*"
## 23 ( 1 ) "*" "*"
## 24 ( 1 ) "*" "*"
## hours_wk_pooled_pooled_p100 pos_hours_pooled_pooled_mean
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) " " " "
## 5 ( 1 ) " " " "
## 6 ( 1 ) " " " "
## 7 ( 1 ) " " " "
## 8 ( 1 ) " " " "
## 9 ( 1 ) " " " "
## 10 ( 1 ) " " " "
## 11 ( 1 ) " " " "
## 12 ( 1 ) " " " "
## 13 ( 1 ) " " " "
## 14 ( 1 ) " " " "
## 15 ( 1 ) "*" " "
## 16 ( 1 ) "*" "*"
## 17 ( 1 ) "*" "*"
## 18 ( 1 ) "*" "*"
## 19 ( 1 ) "*" "*"
## 20 ( 1 ) "*" "*"
## 21 ( 1 ) "*" "*"
## 22 ( 1 ) "*" "*"
## 23 ( 1 ) "*" "*"
## 24 ( 1 ) "*" "*"
## pos_hours_pooled_pooled_mean.1 working_pooled_pooled_mean unemp_rate
## 1 ( 1 ) " " " " " "
## 2 ( 1 ) " " " " " "
## 3 ( 1 ) " " " " " "
## 4 ( 1 ) " " " " " "
## 5 ( 1 ) " " " " " "
## 6 ( 1 ) " " " " " "
## 7 ( 1 ) " " " " " "
## 8 ( 1 ) " " " " " "
## 9 ( 1 ) " " " " " "
## 10 ( 1 ) " " " " " "
## 11 ( 1 ) " " " " " "
## 12 ( 1 ) " " " " " "
## 13 ( 1 ) " " " " " "
## 14 ( 1 ) " " " " " "
## 15 ( 1 ) " " " " " "
## 16 ( 1 ) " " " " " "
## 17 ( 1 ) " " "*" " "
## 18 ( 1 ) " " "*" "*"
## 19 ( 1 ) " " "*" "*"
## 20 ( 1 ) " " "*" "*"
## 21 ( 1 ) " " "*" "*"
## 22 ( 1 ) " " "*" "*"
## 23 ( 1 ) " " "*" "*"
## 24 ( 1 ) " " "*" "*"
## taxrate frac_middleclass cs_elf_ind_man lf_d_2000_1980
## 1 ( 1 ) " " " " " " " "
## 2 ( 1 ) " " " " " " " "
## 3 ( 1 ) " " " " " " " "
## 4 ( 1 ) " " " " " " " "
## 5 ( 1 ) " " " " " " " "
## 6 ( 1 ) " " " " " " " "
## 7 ( 1 ) " " " " " " " "
## 8 ( 1 ) " " " " " " " "
## 9 ( 1 ) " " " " " " " "
## 10 ( 1 ) " " " " " " " "
## 11 ( 1 ) " " " " " " " "
## 12 ( 1 ) " " " " " " " "
## 13 ( 1 ) " " " " " " " "
## 14 ( 1 ) " " " " " " " "
## 15 ( 1 ) " " " " " " " "
## 16 ( 1 ) " " " " " " " "
## 17 ( 1 ) " " " " " " " "
## 18 ( 1 ) " " " " " " " "
## 19 ( 1 ) "*" " " " " " "
## 20 ( 1 ) "*" "*" " " " "
## 21 ( 1 ) "*" "*" "*" " "
## 22 ( 1 ) "*" "*" "*" "*"
## 23 ( 1 ) "*" "*" "*" "*"
## 24 ( 1 ) "*" "*" "*" "*"
## median_house_value hhinc00
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) " " " "
## 5 ( 1 ) " " " "
## 6 ( 1 ) " " " "
## 7 ( 1 ) " " " "
## 8 ( 1 ) " " " "
## 9 ( 1 ) " " " "
## 10 ( 1 ) " " " "
## 11 ( 1 ) " " " "
## 12 ( 1 ) " " " "
## 13 ( 1 ) " " " "
## 14 ( 1 ) " " " "
## 15 ( 1 ) " " " "
## 16 ( 1 ) " " " "
## 17 ( 1 ) " " " "
## 18 ( 1 ) " " " "
## 19 ( 1 ) " " " "
## 20 ( 1 ) " " " "
## 21 ( 1 ) " " " "
## 22 ( 1 ) " " " "
## 23 ( 1 ) "*" " "
## 24 ( 1 ) "*" "*"
res.sum = summary(allwork)
data.frame(
adj.r2 = which.max(res.sum$adjr2),
cp = which.min(res.sum$cp),
bic = which.min(res.sum$bic)
)
## adj.r2 cp bic
## 1 12 12 12
Here, the output illustrates the best models of each size from all of the predictor variables to predict life expectancy. This begs the question of what is the appropriate number of predictor variables to use, which is the purpose of the data.frame() function. The R^2, CP, and BIC criteria all tell us to use a 12 predictor model. To find which variables are most appropriate for this size model, we look at the output above, focusing on the 12th row. Each variable with an asterisk in the row 12 is a variable that is suggested to be used in a 12 predictor model. Using this information, we can create a new multiple regression model that we can compare against our prior two models to find the best option.
# create new model with predictor variables
subsetmodel = lm(le_raceadj_q1_both ~ hhinc_mean2000 + mean_commutetime2000 + med_hhinc2016 +
poor_share2010 + emp2000 + jobs_highpay_5mi_2015 + ann_avg_job_growth_2004_2013 +
par_rank_pooled_pooled_mean + frac_below_median_pooled_pooled +
hours_wk_pooled_pooled_mean, data = work)
summary(subsetmodel)
##
## Call:
## lm(formula = le_raceadj_q1_both ~ hhinc_mean2000 + mean_commutetime2000 +
## med_hhinc2016 + poor_share2010 + emp2000 + jobs_highpay_5mi_2015 +
## ann_avg_job_growth_2004_2013 + par_rank_pooled_pooled_mean +
## frac_below_median_pooled_pooled + hours_wk_pooled_pooled_mean,
## data = work)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.91480 -0.63628 -0.02223 0.56264 2.72006
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.158e+00 1.050e+01 -0.301 0.76434
## hhinc_mean2000 -1.500e-05 7.449e-06 -2.014 0.04698 *
## mean_commutetime2000 -2.055e-02 1.516e-02 -1.355 0.17884
## med_hhinc2016 3.098e-05 1.069e-05 2.899 0.00472 **
## poor_share2010 -3.624e-01 9.055e-01 -0.400 0.68997
## emp2000 -2.685e+00 1.008e+00 -2.663 0.00918 **
## jobs_highpay_5mi_2015 -1.500e-06 1.725e-06 -0.870 0.38663
## ann_avg_job_growth_2004_2013 2.230e+00 2.200e+00 1.014 0.31348
## par_rank_pooled_pooled_mean 9.790e-01 1.279e+01 0.077 0.93915
## frac_below_median_pooled_pooled 4.007e+00 7.969e+00 0.503 0.61636
## hours_wk_pooled_pooled_mean 4.305e-01 8.413e-02 5.117 1.78e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9097 on 89 degrees of freedom
## (50611 observations deleted due to missingness)
## Multiple R-squared: 0.3098, Adjusted R-squared: 0.2322
## F-statistic: 3.994 on 10 and 89 DF, p-value: 0.0001571
Looking at the new model, we can see that the Pr(>|t|) values are less statistically significant than those of that in our previous models. To find the best option, we can use Mallows’ Cp, which is a factor that is used to compare regression models to pick the best model.
# first, create linear regression model with all predictor variables
fullmodelone = lm(le_raceadj_q1_both ~ hhinc_mean2000 + mean_commutetime2000 + med_hhinc2016 +
poor_share2010 + emp2000 + jobs_highpay_5mi_2015 +
ann_avg_job_growth_2004_2013 + par_rank_pooled_pooled_mean +
hours_wk_pooled_pooled_p25 + pos_hours_pooled_pooled_mean +
working_pooled_pooled_mean + unemp_rate +
taxrate + frac_middleclass + cs_elf_ind_man +
lf_d_2000_1980 + median_house_value +
hhinc00, data = work)
# compute Mallow's Cp for each statistical model
ols_mallows_cp(workmr, fullmodelone)
## [1] 200.0821
ols_mallows_cp(worktwo, fullmodelone)
## [1] 55.62811
ols_mallows_cp(subsetmodel, fullmodelone)
## [1] 1141.712
Looking at the results, we see that “worktwo” or the linear model made using stepwise regression, has the Cp closest to the number of predictor variables, therefore establishing it as the best of the three regression models.
plot(worktwo)
# histogram of residuals
restwo = resid(worktwo)
hist(restwo)
Looking at the residuals vs. fitted plot, we can see that there is equal spread both above and below the zero line and that the points are evenly spread throughout the intervals, suggesting that this model satisfies the condition for linearity
Looking at the histogram of residuals, we observe a normal distribution centered around zero which meets the zero mean condition.
In the residual plot, the data is spread on both sides of the least squares line, but there is an outlier, point 96.
Opportunity Insights used the commuting zone system to organize and separate the data that was collected by the Census, which collects data from every citizen within a geographic area, meaning that the data was collected responsibly. Comparing the data by quartile, the differences in correlational coefficients from quartile to quartile tells us that we should be wary of unifying all of the models into one.
The qq-plot shows that there is a bit of right skew with the presence of the outlier, point 96.