We obtain life expectancy data from 2010 to 2015 from the CDC. We then obtain median household income for 2019 from the USDA. Note the difference in time between the two metrics.
raw_life_expectancy <- read_csv("https://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NVSS/USALEEP/CSV/US_A.CSV")
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## `Tract ID` = col_character(),
## STATE2KX = col_character(),
## CNTY2KX = col_character(),
## TRACT2KX = col_character(),
## `e(0)` = col_double(),
## `se(e(0))` = col_double(),
## `Abridged life table flag` = col_double()
## )
raw_usda <- read_csv("https://www.ers.usda.gov/webdocs/DataFiles/48747/Unemployment.csv?v=6497.2")
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## fips_txt = col_double(),
## Stabr = col_character(),
## area_name = col_character(),
## Attribute = col_character(),
## Value = col_double()
## )
We then retain the median household income rows from raw_usda and use FIPS numbers to match them to rows in raw_life_expectancy.
income <- raw_usda %>%
filter(Attribute == "Median_Household_Income_2019") %>%
mutate(fips = fips_txt, income = Value)
life_expectancy <- raw_life_expectancy %>%
mutate(fips = parse_integer(paste(STATE2KX, CNTY2KX, sep = ""))) %>%
group_by(fips) %>% summarize(years = median(`e(0)`))
income_vs_life_exp <- inner_join(income, life_expectancy)
## Joining, by = "fips"
head(income_vs_life_exp)
## # A tibble: 6 x 8
## fips_txt Stabr area_name Attribute Value fips income years
## <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1001 AL Autauga County,… Median_Household_Inc… 58233 1001 58233 74.7
## 2 1003 AL Baldwin County,… Median_Household_Inc… 59871 1003 59871 78.2
## 3 1005 AL Barbour County,… Median_Household_Inc… 35972 1005 35972 73.5
## 4 1007 AL Bibb County, AL Median_Household_Inc… 47918 1007 47918 74.1
## 5 1009 AL Blount County, … Median_Household_Inc… 52902 1009 52902 76.9
## 6 1011 AL Bullock County,… Median_Household_Inc… 31906 1011 31906 78
ggplot(income_vs_life_exp, aes(x=income, y=years)) +
geom_point(size = 0.5) +
geom_smooth(method = "loess", formula = y ~ x)
ggplot(filter(income_vs_life_exp, Stabr %in% c("ID", "MI", "MS", "MA")), aes(x=income, y=years)) +
geom_point(aes(color = Stabr), size = 2)
Or how about we actually address the question on VHS, even though it involves generalizing the characteristics of each state? Note that this method takes the median of the median household incomes for all counties and the median of the median life expectancy for all counties in each state.
state_income_vs_life_exp <- income_vs_life_exp %>%
group_by(Stabr) %>%
summarize(income = median(income), years = median(years))
ggplot(state_income_vs_life_exp, aes(x=income, y=years)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, formula = y ~ x)
That’s more like it! How does the linear regression look?
model <- lm(years ~ income, state_income_vs_life_exp)
summary(model)
##
## Call:
## lm(formula = years ~ income, data = state_income_vs_life_exp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.7744 -0.9008 0.1241 0.8484 2.1582
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.228e+01 1.097e+00 65.905 < 2e-16 ***
## income 1.028e-04 1.858e-05 5.532 1.36e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.439 on 47 degrees of freedom
## Multiple R-squared: 0.3944, Adjusted R-squared: 0.3815
## F-statistic: 30.61 on 1 and 47 DF, p-value: 1.364e-06
This model features \(R^2 \approx 0.3815\). What do you think of it?