Load and Process Data

We obtain life expectancy data from 2010 to 2015 from the CDC. We then obtain median household income for 2019 from the USDA. Note the difference in time between the two metrics.

raw_life_expectancy <- read_csv("https://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NVSS/USALEEP/CSV/US_A.CSV")
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   `Tract ID` = col_character(),
##   STATE2KX = col_character(),
##   CNTY2KX = col_character(),
##   TRACT2KX = col_character(),
##   `e(0)` = col_double(),
##   `se(e(0))` = col_double(),
##   `Abridged life table flag` = col_double()
## )
raw_usda <- read_csv("https://www.ers.usda.gov/webdocs/DataFiles/48747/Unemployment.csv?v=6497.2")
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   fips_txt = col_double(),
##   Stabr = col_character(),
##   area_name = col_character(),
##   Attribute = col_character(),
##   Value = col_double()
## )

We then retain the median household income rows from raw_usda and use FIPS numbers to match them to rows in raw_life_expectancy.

income <- raw_usda %>%
  filter(Attribute == "Median_Household_Income_2019") %>%
  mutate(fips = fips_txt, income = Value)

life_expectancy <- raw_life_expectancy %>%
  mutate(fips = parse_integer(paste(STATE2KX, CNTY2KX, sep = ""))) %>%
  group_by(fips) %>% summarize(years = median(`e(0)`))

income_vs_life_exp <- inner_join(income, life_expectancy)
## Joining, by = "fips"
head(income_vs_life_exp)
## # A tibble: 6 x 8
##   fips_txt Stabr area_name        Attribute             Value  fips income years
##      <dbl> <chr> <chr>            <chr>                 <dbl> <dbl>  <dbl> <dbl>
## 1     1001 AL    Autauga County,… Median_Household_Inc… 58233  1001  58233  74.7
## 2     1003 AL    Baldwin County,… Median_Household_Inc… 59871  1003  59871  78.2
## 3     1005 AL    Barbour County,… Median_Household_Inc… 35972  1005  35972  73.5
## 4     1007 AL    Bibb County, AL  Median_Household_Inc… 47918  1007  47918  74.1
## 5     1009 AL    Blount County, … Median_Household_Inc… 52902  1009  52902  76.9
## 6     1011 AL    Bullock County,… Median_Household_Inc… 31906  1011  31906  78

Plotting Income vs Life Expectancy

ggplot(income_vs_life_exp, aes(x=income, y=years)) +
  geom_point(size = 0.5) +
  geom_smooth(method = "loess", formula = y ~ x)

States

ggplot(filter(income_vs_life_exp, Stabr %in% c("ID", "MI", "MS", "MA")), aes(x=income, y=years)) +
  geom_point(aes(color = Stabr), size = 2)

Or how about we actually address the question on VHS, even though it involves generalizing the characteristics of each state? Note that this method takes the median of the median household incomes for all counties and the median of the median life expectancy for all counties in each state.

state_income_vs_life_exp <- income_vs_life_exp %>%
  group_by(Stabr) %>%
  summarize(income = median(income), years = median(years))
ggplot(state_income_vs_life_exp, aes(x=income, y=years)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, formula = y ~ x)

That’s more like it! How does the linear regression look?

model <- lm(years ~ income, state_income_vs_life_exp)
summary(model)
## 
## Call:
## lm(formula = years ~ income, data = state_income_vs_life_exp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7744 -0.9008  0.1241  0.8484  2.1582 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.228e+01  1.097e+00  65.905  < 2e-16 ***
## income      1.028e-04  1.858e-05   5.532 1.36e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.439 on 47 degrees of freedom
## Multiple R-squared:  0.3944, Adjusted R-squared:  0.3815 
## F-statistic: 30.61 on 1 and 47 DF,  p-value: 1.364e-06

This model features \(R^2 \approx 0.3815\). What do you think of it?