Abstract

What socio-economic factors in the realm of labor and health such as income and education level most affect the life expectancy of Americans?

To answer this question, I researched and analyzed data from Opportunity Insights and the Opportunity Atlas. I am very interested in exploring what types of communities and characteristics within those communities alter life expectancy. The majority of the data used in this project comes from Opportunity Insights (OI) and is organized by commuting zones, which are geographic areas that are often used when analyzing geography in terms of the economy and population. I used two different datasets, “National by-year life expectancy estimates for men and women, by income percentile” and “Neighborhood Characteristics by Census Tract”. By combining these two datasets together, I was able to measure life expectancy data alongside the commuting zone factors I was interested in measuring.

Introduction

While humans are a product of genetics, humans are also products of their environments. This realization prompts the question of what factors in work and health can affect the ultimate outcome for a person, life expectancy. By exploring variables within the realms of these two questions, observations about their effects on human lifespan can be made.

Data

All data used in this project comes from Opportunity Insights data. Within the regression models, each datapoint is a different commuting zone. Because data was only collected from commuting zones with populations over 25,000, smaller communities in North Dakota and Alaska are not included in this project.

This project specifically uses data from the first income quartile, meaning persons with incomes ranging from the 1st to 25th percentile. Within this quartile, the data between life expectancy and other factors showed the greatest correlation.

Simple Regression Exploration of Health Factors

Smoking

Before beginning to create a multiple regression model, it is important to understand the relationships that exist between the predictor and response variables within linear models, and to do this, we can use simple linear regression.

Figure 2.1 shows the scatterplot and linear regression line of life expectancy plotted on the fraction of current smokers in the first income quartile. The output below shows that the regression equation is

  (LE) = 84.554 - 11.879(fraction of smokers)

This regression equation tells us that as the fraction of smokers increases, life expectancy decreases. Keep in mind that although the coefficient looks very great, this is not the quantity of years taken as a person begins to smoke, as the number of smokers is in decimal form. The p-value within the output tells that the results are statistically significant, which means that we can safely assume that there is a relationship between the fraction of smokers increasing and life expectancy decreasing.

Multiple Regression Exploration

Multiple Regression Health Exploration

With data showing that greater expenditures on healthcare result in longer life expectancies, it is clear that health plays a vital role in a long life. After looking at r correlational coefficients within between variables and life expectancy, I decided to create a multiple regression model that uses the fraction of current smokers in the first quartile, fraction of obese population within the first quartile, and the percentage of insured people in 2010 to predict life expectancy.

I chose these variables not only for their statistical significance, but for their ability to represent two different facets of healthcare. They represent personal choices, such as the decision to smoke, as well as more out of reach decisions, such as the ability to provide insurance for a family or for oneself. Exploring both types of variables allows for a greater connection between life expectancy and healthcare overall to be drawn.

Correlation Matrix

To begin, I created a correlational matrix in R to see which predictor variables effect life expectancy the most (see Figure 3.1). Looking at the data, the variables that hold the strongest associations with life expectancy are:

-0.492 Fraction of current smokers in the first quartile
-0.331 Fraction of obese people in the first quartile
0.410 Percent uninsured in 2010

These r-values tell us that within the dataset, our response variable, life expectancy, has the greatest linear relationship with these three variables. The -0.492 value tells us that there is a negative correlation between smoking and life expectancy, and the -0.331 value tells that there is a negative correlation between obesity and life expectancy. The 0.410 value tells that there is a positive correlation between life expectancy and the percentage of uninsured citizens within a commuting zone, which is a surprising find.

Looking at the data collectively, the strongest correlations between predictor variables exist between:

0.869 Mean of Z-Scores for Dartmouth Atlas Ambulatory Care Measures and Percent of Females Aged 67-69 who have had Mammograms
0.728 Percent of Females Aged 67-69 who have had Mammograms and Percent Diabetic with Annual Eye Test
-0.578 Fraction of people who have exercised within the past 30 days and fraction of obese people in the first quartile
-0.669 Medicare $ Per Enrollee and fraction of people who have exercised in the past 30 days
0.857 30-day Mortality for Heart Attacks and 30-day Hospital Mortality Rate Index

These r-values are not in relation to the response variable, but rather, in relation to other predictor variables. For example, the 0.857 value indicates a very negative relationship between the 30-day mortality rate for heart attacks and the 30-day hospital mortality rate index overall, meaning that as the mortality rate for heart attacks increases, the mortality rate overall increases. The -0.669 r-value between Medicare expenditure per enrollee and the fraction of people who have exercised within the past 30 days in the first income quartile tells that as Medicare expenditures increase, the fraction of people who exercise decreases.

Interpretations of Multiple Regression Model

Figure 3.2 contains the output of the results of the multiple regression model. Looking at the p-values for each predictor variable, we see that they are less than the alpha level of 0.05, meaning they are statistically significant. This means that there is a significant association between the three predictor variables and life expectancy.

What kind of relationship does this model show between the variables? The coefficient of fraction of smokers is -11.45, showing that as the fraction of smokers increases, life expectancy decreases. Each unit increase in the fraction of smokers results in a -11.45 decrease in life expectancy within the model. While this may seem like a big decrease, keep in mind that this number is in terms of fractions, and not whole numbers. To visualize this decrease in terms of the two variables alone, see Figure 3.3. The coefficient of the fraction of the obese population is -12.58, showing that within the model, as the fraction of obese people increases, life expectancy decreases. In this model, the coefficient of the percentage of uninsured people in 2010 is 0.088. This tells us that as the percentage of uninsured people increases, they tend to live longer lives. This shows that in the model, as the percentage of uninsured people increases, life expectancy increases by 0.088 years. While the number appears small, this is in terms of years, while the other two variables were fractions on their own, so they did not predict in terms of years. To see a visualization of this increase in simple linear terms, see Figure 3.5.

Looking at the model as a whole, we see that the residual standard error is 1.378 on 95 degrees of freedom. This statistic means that the model predicts the average life expectancy of a person in the first income quartile with an average error of 1.378 years. This value’s importance comes from its ability to compare the accuracy of different regression models. For example, if different variables were added to the model, such as Medicare costs per enrollee, this would either increase or decrease the standard error. This increase or decrease changes the effectiveness of the model. The lower the residual standard error, the more accurate the model.

The R^2 value explains how much variation in the life expectancy variable can be explained by the predictor variables. Here, that value is 0.328, meaning that 32.8% of variation within the model can be explained by the predictor variables. This means that when using the three predictor variables in the model, 32.8% of variation in life expectancy is explained by the model itself.

Residuals

Residuals assess if the estimates and predictions are biased. Looking at the plots of residuals seen in Figures 3.6 and 3.7, it is observable that the data fits the conditions, but among these plots, there are other observations about the data to be made.

In Figure 3.7, the Residuals vs Fitted plot shows that as age increases, the residuals begin to land closer to the zero line. Residuals show how far off actual data points are from the model, and because the datapoints begin to grow closer to the zero line as age increases, we can assume that the model becomes more accurate as life expectancy increases. In other words, this model may be more accurate in measuring the life expectancy of an older person from the range of 79.5 to 81, and less accurate in the range below.

The next residual plot shown in Figure 3.7 is a Normal Quantile Quantile plot, which is created by plotting two sets of quantiles, or percentiles, against each other, with the purpose of assessing if a set of data comes from a normal distribution. In comparison with other multiple regression models, this dataset is relatively small, with data from 101 commuting zones. Because of the smaller sample size, it is very important to check for normality in the residuals to ensure that there is no bias in the model. The normal q-q plot in Figure 3.7 shows that the quantiles follow a normal distribution, but with skewness to the right due to 2 outliers, points 1 and 24. Overall, the datapoints are close to the line of best fit, which tells that the data is normal, and therefore acceptable to use in a multiple regression model. To further assess for normality, we can refer to Figure 3.6, which is a histogram of residuals. Because the histogram is centered around 0, we know that the residuals are normally distributed, and that therefore, the data is acceptable.

Appendix

## load libraries and data
library(ggplot2)
library(Hmisc)

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:base':
## 
##     format.pval, units

library(corrplot)

## corrplot 0.92 loaded

library(ggpubr)
library(leaps)
library(olsrr)

## 
## Attaching package: 'olsrr'

## The following object is masked from 'package:datasets':
## 
##     rivers

red = read.csv("~/desktop/finalfinal2.csv")

Figure 2.1

# Quartile 1
q1 = ggplot(red, aes(x = cur_smoke_q1, y = le_raceadj_q1_F)) + 
  geom_point() + 
  stat_cor(method = "pearson") +
  stat_smooth(method = 'lm') +
  xlab("Current Smokers in Q1") +
  ylab("Q1 Life Expectancy") + 
  ggtitle("Q1 Life Expectancy and Smoking")
q1

## Warning: Removed 50611 rows containing non-finite values (stat_cor).

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 50611 rows containing non-finite values (stat_smooth).

## Warning: Removed 50611 rows containing missing values (geom_point).

smoke1 = lm(le_raceadj_q1_F ~ cur_smoke_q1, data = red)
summary(smoke1)

## 
## Call:
## lm(formula = le_raceadj_q1_F ~ cur_smoke_q1, data = red)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.8578 -1.0701 -0.1626  1.2956  4.5646 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    84.554      1.195  70.782   <2e-16 ***
## cur_smoke_q1  -11.879      4.381  -2.712   0.0079 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.908 on 98 degrees of freedom
##   (50611 observations deleted due to missingness)
## Multiple R-squared:  0.0698, Adjusted R-squared:  0.06031 
## F-statistic: 7.354 on 1 and 98 DF,  p-value: 0.007904

Figure 2.2

# Quartile 2
q2 = ggplot(red, aes(x = cur_smoke_q2, y = le_raceadj_q2_M)) + 
  geom_point() + 
  stat_cor(method = "pearson") +
  stat_smooth(method = 'lm') +
  xlab("Current Smokers in Q2") +
  ylab("Q2 Life Expectancy") + 
  ggtitle("Q2 Life Expectancy and Smoking")
q2

## Warning: Removed 50611 rows containing non-finite values (stat_cor).

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 50611 rows containing non-finite values (stat_smooth).

## Warning: Removed 50611 rows containing missing values (geom_point).

smoke2 = lm(le_raceadj_q2_M ~ cur_smoke_q2, data = red)
summary(smoke2)

## 
## Call:
## lm(formula = le_raceadj_q2_M ~ cur_smoke_q2, data = red)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.0808 -1.2399 -0.2606  1.3909  5.8880 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    80.141      1.535  52.200   <2e-16 ***
## cur_smoke_q2   -3.180      6.842  -0.465    0.643    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.18 on 98 degrees of freedom
##   (50611 observations deleted due to missingness)
## Multiple R-squared:  0.002199,   Adjusted R-squared:  -0.007982 
## F-statistic: 0.216 on 1 and 98 DF,  p-value: 0.6431

Figure 2.3

# Quartile 3
q3 = ggplot(red, aes(x = cur_smoke_q3, y = le_raceadj_q3_M)) + 
  geom_point() + 
  stat_cor(method = "pearson") +
  stat_smooth(method = 'lm') +
  xlab("Current Smokers in Q3") +
  ylab("Q3 Life Expectancy") + 
  ggtitle("Q3 Life Expectancy and Smoking")
q3

## Warning: Removed 50611 rows containing non-finite values (stat_cor).

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 50611 rows containing non-finite values (stat_smooth).

## Warning: Removed 50611 rows containing missing values (geom_point).

smoke3 = lm(le_raceadj_q3_M ~ cur_smoke_q3, data = red)
summary(smoke3)

## 
## Call:
## lm(formula = le_raceadj_q3_M ~ cur_smoke_q3, data = red)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.6731 -1.2965  0.1006  1.5783  5.0540 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    84.148      1.349  62.370   <2e-16 ***
## cur_smoke_q3  -11.073      7.478  -1.481    0.142    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.036 on 98 degrees of freedom
##   (50611 observations deleted due to missingness)
## Multiple R-squared:  0.02189,    Adjusted R-squared:  0.01191 
## F-statistic: 2.193 on 1 and 98 DF,  p-value: 0.1419

Figure 2.4

# Quartile 4
q4 = ggplot(red, aes(x = cur_smoke_q4, y = le_raceadj_q4_M)) + 
  geom_point() + 
  stat_cor(method = "pearson") +
  stat_smooth(method = 'lm') +
  xlab("Current Smokers in Q4") +
  ylab("Q4 Life Expectancy") + 
  ggtitle("Q4 Life Expectancy and Smoking")
q4

## Warning: Removed 50611 rows containing non-finite values (stat_cor).

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 50611 rows containing non-finite values (stat_smooth).

## Warning: Removed 50611 rows containing missing values (geom_point).

smoke4 = lm(le_raceadj_q4_M ~ cur_smoke_q4, data = red)
summary(smoke4)

## 
## Call:
## lm(formula = le_raceadj_q4_M ~ cur_smoke_q4, data = red)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.0203 -1.4971 -0.0728  1.3878  5.5532 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    87.037      1.240  70.167   <2e-16 ***
## cur_smoke_q4  -20.459      9.731  -2.102   0.0381 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.277 on 98 degrees of freedom
##   (50611 observations deleted due to missingness)
## Multiple R-squared:  0.04316,    Adjusted R-squared:  0.03339 
## F-statistic:  4.42 on 1 and 98 DF,  p-value: 0.03808

Figure 3.1

# load data
hlth = read.csv("~/desktop/healthdata.csv")

# choose variables with names(hlth) function to identify the numbers
cordata = cor(hlth[,c(32, 7, 8, 9, 13, 17, 21, 22, 23, 24, 25, 26, 27, 28, 29 )], use = "complete.obs")

# print correlational matrix
cordata

##                         le_raceadj_q1_both puninsured2010 mammogram_10
## le_raceadj_q1_both              1.00000000    0.409552583  -0.01313609
## puninsured2010                  0.40955258    1.000000000  -0.38511649
## mammogram_10                   -0.01313609   -0.385116491   1.00000000
## cur_smoke_q1                   -0.49178572   -0.412569923   0.10884627
## bmi_obese_q1                   -0.33064510    0.038422239  -0.13545923
## exercise_any_q1                 0.16905248   -0.104335824   0.10140325
## mort_30day_hosp_z               0.16244341    0.304858954  -0.16071040
## diab_eyeexam_10                -0.06469811   -0.351894277   0.72788659
## diab_lipids_10                  0.13626869    0.002075106   0.47383752
## primcarevis_10                 -0.27049860   -0.147541927   0.47615494
## reimb_penroll_adj10             0.07963038    0.500973373  -0.31191633
## adjmortmeas_amiall30day         0.14737474    0.362438451  -0.27202722
## adjmortmeas_chfall30day         0.11354249    0.151266932  -0.01102297
## adjmortmeas_pnall30day          0.15380309    0.274266108  -0.14079534
## med_prev_qual_z                -0.03769047   -0.337081052   0.86898970
##                         cur_smoke_q1 bmi_obese_q1 exercise_any_q1
## le_raceadj_q1_both      -0.491785716 -0.330645098      0.16905248
## puninsured2010          -0.412569923  0.038422239     -0.10433582
## mammogram_10             0.108846268 -0.135459231      0.10140325
## cur_smoke_q1             1.000000000  0.273436489     -0.36593813
## bmi_obese_q1             0.273436489  1.000000000     -0.57834977
## exercise_any_q1         -0.365938127 -0.578349771      1.00000000
## mort_30day_hosp_z       -0.051748942 -0.006387185      0.14486944
## diab_eyeexam_10          0.027868374 -0.166524409      0.10303742
## diab_lipids_10          -0.165990413 -0.091820075     -0.09728895
## primcarevis_10           0.366366833  0.345893828     -0.24021412
## reimb_penroll_adj10      0.048788895  0.293027828     -0.66955145
## adjmortmeas_amiall30day -0.075027995  0.106446468      0.02309639
## adjmortmeas_chfall30day -0.070661414 -0.219951072      0.41879551
## adjmortmeas_pnall30day   0.006981048  0.101377944     -0.06919963
## med_prev_qual_z          0.014671918 -0.160097018      0.19376191
##                         mort_30day_hosp_z diab_eyeexam_10 diab_lipids_10
## le_raceadj_q1_both            0.162443415     -0.06469811    0.136268688
## puninsured2010                0.304858954     -0.35189428    0.002075106
## mammogram_10                 -0.160710397      0.72788659    0.473837521
## cur_smoke_q1                 -0.051748942      0.02786837   -0.165990413
## bmi_obese_q1                 -0.006387185     -0.16652441   -0.091820075
## exercise_any_q1               0.144869441      0.10303742   -0.097288955
## mort_30day_hosp_z             1.000000000     -0.22717437   -0.248622569
## diab_eyeexam_10              -0.227174368      1.00000000    0.498365447
## diab_lipids_10               -0.248622569      0.49836545    1.000000000
## primcarevis_10                0.060866168      0.31002021    0.181463162
## reimb_penroll_adj10          -0.260424088     -0.25593691    0.124653227
## adjmortmeas_amiall30day       0.857258490     -0.30638700   -0.220575922
## adjmortmeas_chfall30day       0.848040307     -0.01855114   -0.220977623
## adjmortmeas_pnall30day        0.849187830     -0.26205789   -0.195358678
## med_prev_qual_z              -0.136234229      0.83490965    0.671550225
##                         primcarevis_10 reimb_penroll_adj10
## le_raceadj_q1_both       -0.2704985950          0.07963038
## puninsured2010           -0.1475419273          0.50097337
## mammogram_10              0.4761549447         -0.31191633
## cur_smoke_q1              0.3663668330          0.04878889
## bmi_obese_q1              0.3458938282          0.29302783
## exercise_any_q1          -0.2402141235         -0.66955145
## mort_30day_hosp_z         0.0608661678         -0.26042409
## diab_eyeexam_10           0.3100202117         -0.25593691
## diab_lipids_10            0.1814631619          0.12465323
## primcarevis_10            1.0000000000         -0.03321519
## reimb_penroll_adj10      -0.0332151898          1.00000000
## adjmortmeas_amiall30day  -0.0002811419         -0.03512968
## adjmortmeas_chfall30day   0.0683039573         -0.49432222
## adjmortmeas_pnall30day    0.0802329948         -0.12328907
## med_prev_qual_z           0.5351918376         -0.36554703
##                         adjmortmeas_amiall30day adjmortmeas_chfall30day
## le_raceadj_q1_both                 0.1473747442              0.11354249
## puninsured2010                     0.3624384508              0.15126693
## mammogram_10                      -0.2720272160             -0.01102297
## cur_smoke_q1                      -0.0750279951             -0.07066141
## bmi_obese_q1                       0.1064464677             -0.21995107
## exercise_any_q1                    0.0230963852              0.41879551
## mort_30day_hosp_z                  0.8572584903              0.84804031
## diab_eyeexam_10                   -0.3063870046             -0.01855114
## diab_lipids_10                    -0.2205759219             -0.22097762
## primcarevis_10                    -0.0002811419              0.06830396
## reimb_penroll_adj10               -0.0351296789             -0.49432222
## adjmortmeas_amiall30day            1.0000000000              0.62825004
## adjmortmeas_chfall30day            0.6282500375              1.00000000
## adjmortmeas_pnall30day             0.5972172770              0.54010587
## med_prev_qual_z                   -0.2584231901              0.07282819
##                         adjmortmeas_pnall30day med_prev_qual_z
## le_raceadj_q1_both                 0.153803088     -0.03769047
## puninsured2010                     0.274266108     -0.33708105
## mammogram_10                      -0.140795338      0.86898970
## cur_smoke_q1                       0.006981048      0.01467192
## bmi_obese_q1                       0.101377944     -0.16009702
## exercise_any_q1                   -0.069199628      0.19376191
## mort_30day_hosp_z                  0.849187830     -0.13623423
## diab_eyeexam_10                   -0.262057887      0.83490965
## diab_lipids_10                    -0.195358678      0.67155022
## primcarevis_10                     0.080232995      0.53519184
## reimb_penroll_adj10               -0.123289069     -0.36554703
## adjmortmeas_amiall30day            0.597217277     -0.25842319
## adjmortmeas_chfall30day            0.540105872      0.07282819
## adjmortmeas_pnall30day             1.000000000     -0.17293120
## med_prev_qual_z                   -0.172931205      1.00000000

Figure 3.2

# create multiple regression model
hlthmr = lm(le_raceadj_q1_both ~ cur_smoke_q1 + bmi_obese_q1 + puninsured2010, data = hlth)

# view summary of model
summary(hlthmr)

## 
## Call:
## lm(formula = le_raceadj_q1_both ~ cur_smoke_q1 + bmi_obese_q1 + 
##     puninsured2010, data = hlth)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5447 -0.9496 -0.1078  0.7340  4.8681 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     83.79973    1.44151  58.133  < 2e-16 ***
## cur_smoke_q1   -11.45443    3.64734  -3.140  0.00224 ** 
## bmi_obese_q1   -12.58156    4.18100  -3.009  0.00335 ** 
## puninsured2010   0.08869    0.02734   3.244  0.00162 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.372 on 96 degrees of freedom
##   (50611 observations deleted due to missingness)
## Multiple R-squared:  0.3542, Adjusted R-squared:  0.3341 
## F-statistic: 17.55 on 3 and 96 DF,  p-value: 3.652e-09

Figure 3.3

plot(ggplot(hlth, aes(x = cur_smoke_q1, y = le_raceadj_q1_both)) +
  geom_point() +
  geom_smooth(method = lm))

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 50611 rows containing non-finite values (stat_smooth).

## Warning: Removed 50611 rows containing missing values (geom_point).

Figure 3.4

ggplot(hlth, aes(x = bmi_obese_q1, y = le_raceadj_q1_both)) +
  geom_point() +
  geom_smooth(method = lm)

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 50611 rows containing non-finite values (stat_smooth).

## Warning: Removed 50611 rows containing missing values (geom_point).

Figure 3.5

ggplot(hlth, aes(x = puninsured2010, y = le_raceadj_q1_both)) +
  geom_point() +
  geom_smooth(method = lm)

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 50611 rows containing non-finite values (stat_smooth).

## Warning: Removed 50611 rows containing missing values (geom_point).

Figure 3.6

## histogram of residuals
hlthmrresid = resid(hlthmr)
hist(hlthmrresid, main = "Histogram of hlth Residuals")

Figure 3.7

# plot residuals
plot(hlthmr)

Life Expectancy Project Appendix