US States Health and Income

The Problem: I am interested in exploring the relationship between the healthiest areas of the country and income. My hypothesis is that the areas of the country with higher median income will also have better health. I will use web scraping to to get data about what the top indicators of health are. According to Wikipedia, Health indicators “are quantifiable characteristics of a population which researchers use as supporting evidence for describing the health of a population”. I will then further explore how to get data on each of these indicators for each area of the country (states). I will then look at income data per area of the country (states) and see if there is a correlation.

Where will the data come from? Health indicator data will be retrieved from the World Health Organization. This information is gathered using survey methodology. http://www.who.int/healthinfo/indicators/2015/metadata/en/

Based on the information from the health indicator data, I can download data from the Center of Disease Control per state in US. https://www.cdc.gov/nchs/pressroom/sosmap/heart_disease_mortality/heart_disease.htm

The United States Census Bureau website will provide the data for median income for each state. https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?src=bkmk

Health indicators were found on Wikipedia. I Used the rvest package and web scraping to export the health indicators list.

library(xml2)
library(rvest)
library(stringr)
health <- read_html("http://www.who.int/healthinfo/indicators/2015/metadata/en/")
health.indicator <- health %>% html_nodes("#primary li") 
health.indicator1 <- html_text(health.indicator)

#remove HTML tags
health.indicator1<-gsub("\n","",health.indicator1)  
health.indicator1 <- gsub("\t\t\t\t","",health.indicator1)
health.indicators <-as.factor(health.indicator1)
as.data.frame(health.indicators)
##                                                                                                           health.indicators
## 1                                                                                                 Life expectancy at birth 
## 2                                                                      Adult mortality rate between 15 and 60 years of age 
## 3                                                                                                Under-five mortality rate 
## 4                                                                                                    Infant mortality rate 
## 5                                                                                                  Neonatal mortality rate 
## 6                                                                                                          Stillbirth rate 
## 7                                                                                                 Maternal mortality ratio 
## 8                                                                                                        TB mortality rate 
## 9                                                                                              AIDS-related mortality rate 
## 10                                                                                                  Malaria mortality rate 
## 11 Mortality between 30 and 70 years of age from cardiovascular diseases, cancer, diabetes or chronic respiratory diseases 
## 12                                                                                                            Suicide rate 
## 13                                                                               Mortality rate from road traffic injuries 
## 14                                                                                               Adolescent fertility rate 
## 15                                                                                                    Total fertility rate 
## 16                                                                               New cases of vaccine-preventable diseases 
## 17                                                      New cases of IHR-notifiable diseases and other notifiable diseases 
## 18                                                                                                      HIV incidence rate 
## 19                                                                                                     HIV prevalence rate 
## 20                                                                                  Hepatitis B surface antigen prevalence 
## 21                                                                   Sexually transmitted infections (STIs) incidence rate 
## 22                                                                                                       TB incidence rate 
## 23                                                                                                    TB notification rate 
## 24                                                                                                      TB prevalence rate 
## 25                                                             Malaria parasite prevalence among children aged 6–59 months 
## 26                                                                                                  Malaria incidence rate 
## 27                                                                                     Cancer incidence, by type of cancer 
## 28                                                                          Exclusive breastfeeding rate 0–5 months of age 
## 29                                                                                       Early initiation of breastfeeding 
## 30                                                                            Incidence of low birth weight among newborns 
## 31                                                                                  Children under 5 years who are stunted 
## 32                                                                                   Children under 5 years who are wasted 
## 33                                                                                          Anaemia prevalence in children 
## 34                                                                         Anaemia prevalence in women of reproductive age 
## 35                                                                           Condom use at last sex with high-risk partner 
## 36                                                                 Population using safely managed drinking-water services 
## 37                                                                     Population using safely managed sanitation services 
## 38                                                              Population using modern fuels for cooking/heating/lighting 
## 39                                                                                           Air pollution level in cities 
## 40                                                                    Total alcohol per capita (age 15+ years) consumption 
## 41                                                                                Tobacco use among persons aged 18+ years 
## 42                                                                          Children aged under 5 years who are overweight 
## 43                                                                    Overweight and obesity in adults (Also: adolescents) 
## 44                                                                                      Raised blood pressure among adults 
## 45                                                                              Raised blood glucose/diabetes among adults 
## 46                                                                                                             Salt intake 
## 47                                                            Insufficient physical activity in adults (Also: adolescents) 
## 48                                                                                    Intimate partner violence prevalence 
## 49                                                                Demand for family planning satisfied with modern methods 
## 50                                                                                           Contraceptive prevalence rate 
## 51                                                                                                 Antenatal care coverage 
## 52                                                                             Births attended by skilled health personnel 
## 53                                                                                                Postpartum care coverage 
## 54                                                                                  Care-seeking for symptoms of pneumonia 
## 55                                                       Children with diarrhoea receiving oral rehydration solution (ORS) 
## 56                                                                                      Vitamin A supplementation coverage 
## 57                                         Immunization coverage rate by vaccine for each vaccine in the national schedule 
## 58                                                                          People living with HIV who have been diagnosed 
## 59                                                                              Prevention of mother-to-child transmission 
## 60                                                                                                       HIV care coverage 
## 61                                                                                   Antiretroviral therapy (ART) coverage 
## 62                                                                                              HIV viral load suppression 
## 63                                                TB preventive therapy for HIV-positive people newly enrolled in HIV care 
## 64                                                             HIV test results for registered new and relapse TB patients 
## 65                                                     HIV-positive new and relapse TB patients on ART during TB treatment 
## 66                                                                TB patients with results for drug susceptibility testing 
## 67                                                                                                  TB case detection rate 
## 68                                    Second-line treatment coverage among multidrug-resistant tuberculosis (MDR-TB) cases 
## 69                                                     Intermittent preventive therapy for malaria during pregnancy (IPTp) 
## 70                                                                                  Use of insecticide treated nets (ITNs) 
## 71                                                                                    Treatment of confirmed malaria cases 
## 72                                                                                 Indoor residual spraying (IRS) coverage 
## 73                                            Coverage of preventive chemotherapy for selected neglected tropical diseases 
## 74                                                                                               Cervical cancer screening 
## 75                                                                 Coverage of services for severe mental health disorders 
## 76                                                                                            Perioperative mortality rate 
## 77                                                               Obstetric and gynaecological admissions owing to abortion 
## 78                                                                                  Institutional maternal mortality ratio 
## 79                                                                                                  Maternal death reviews 
## 80                                                                                                      ART retention rate 
## 81                                                                                               TB treatment success rate 
## 82                                                                             Service-specific availability and readiness 
## 83                                                                                                     Service utilization 
## 84                                                                                                   Health service access 
## 85                                                                                                    Hospital bed density 
## 86                                                                     Availability of essential medicines and commodities 
## 87                                                                                  Health worker density and distribution 
## 88                                                                                            Output training institutions 
## 89                                                                                             Birth registration coverage 
## 90                                                                                             Death registration coverage 
## 91                                                                                 Completeness of reporting by facilities 
## 92                                                       Total current expenditure on health (% of gross domestic product) 
## 93         Current expenditure on health by general government and compulsory schemes (% of current expenditure on health) 
## 94                                                   Out-of-pocket payment for health (% of current expenditure on health) 
## 95                                                         Externally sourced funding (% of current expenditure on health) 
## 96                                         Total capital expenditure on health (% current + capital expenditure on health) 
## 97                                                                      Headcount ratio of catastrophic health expenditure 
## 98                                                                     Headcount ratio of impoverishing health expenditure 
## 99                                                              International Health Regulations (IHR) core capacity index

After reveiwing the list, I have decided to look at obesity rates and cardiovascular health specifically for each state.

Obesity: Obesity is a rising epidemic in the United States and an important health indicator.I downloaded obesity rates from 2016 from United Health Foundation-Americas health ratings data. This was downloaded as an xlx file and then converted to a CSV.

rankings <- read.csv("/Users/christinakasman/Desktop/obesity.csv") %>% as.data.frame()
rankings <-rankings[-c(51,52),] #Remove District of Columbia and US
names(rankings) = c("Edition", "Measure", "State", "OverallHealthrank", "obesityvalue","Score", "LowerCI", "UpperCI", "Source", "Source.Year") #rename headers
rankings <- subset(rankings, select = c(3,4,5,6))
State.Name <-state.abb[match(rankings$State,state.name)] #Match States names to abbreviations
rankings2 <- cbind(rankings, State.Name) #Bind new state abbreviation column - to be used as unique Idenitifier
head(rankings2)
##        State OverallHealthrank obesityvalue Score State.Name
## 1    Alabama                47         35.6  1.54         AL
## 2     Alaska                24         29.8  0.00         AK
## 3    Arizona                17         28.4 -0.37         AZ
## 4   Arkansas                45         34.5  1.25         AR
## 5 California                 4         24.2 -1.49         CA
## 6   Colorado                 1         20.2 -2.00         CO

Cardiovascular Health:

I then explored #11 on the list of health indicators - “Mortality between 30 and 70 years of age from cardiovascular diseases, cancer, diabetes or chronic respiratory diseases” specifically looking at deaths related to Cardiovascular health.

I downloaded Heart Disease Mortality by State Data (from 2015) from Centers for Disease Control and Prevention website

heartdisease <- read.csv("/Users/christinakasman/Desktop/HEART_DISEASE2015 (2).csv") %>% as.data.frame  #read in CSV
heartdisease1 <- as.data.frame(heartdisease)
names(heartdisease1) = c("State.Name", "MortRate", "Deaths", "URL")
head(heartdisease1)
##   State.Name MortRate Deaths
## 1         AL    229.7 12,981
## 2         AK    154.1    846
## 3         AZ    138.8 11,458
## 4         AR    223.2  7,938
## 5         CA    145.6 61,289
## 6         CO    128.4  7,009
##                                                URL
## 1       /nchs/pressroom/states/alabama/alabama.htm
## 2         /nchs/pressroom/states/alaska/alaska.htm
## 3       /nchs/pressroom/states/arizona/arizona.htm
## 4     /nchs/pressroom/states/arkansas/arkansas.htm
## 5 /nchs/pressroom/states/california/california.htm
## 6     /nchs/pressroom/states/colorado/colorado.htm

Median Income:

I downloaded (as a CSV) the Median Income Data from The United States Census Bureau website which will provide the data for median income for each state. https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?src=bkmk

income <- read.csv("/Users/christinakasman/Desktop/ACS_11_1YR_R1901.US01PRF_with_ann.csv") %>% as.data.frame()
income <- income[-c(1,2,7, 54), ]  #read in CSV file and remove first row, United states, District of Columbia , Puerto Rico
names(income) = c("ID", "ID2", "Country", "Target ID", "Target ID 2", "Rank","State", "EST", "MOE") #rename headers
income1 <- subset(income, select = c(6,7,8,9)) #subset columns

Changing State Names to match abbreviations

State.Name <-state.abb[match(income1$State,state.name)] #Match States names to abbreviations
income2 <- cbind(income1, State.Name) #Bind new state abbreviation column - to be used as unique Idenitifier
income2 <- income2[,c(1,2, 5,3,4)]
head(income2)
##   Rank         State State.Name   EST  MOE
## 3    1      Maryland         MD 70004  804
## 4    2        Alaska         AK 67825 1948
## 5    3    New Jersey         NJ 67458  721
## 6    4   Connecticut         CT 65753  854
## 8    6 Massachusetts         MA 62859  902
## 9    7 New Hampshire         NH 62647 1415

Merge Data Frames Together

merge1 <- merge(income2, heartdisease1, by= "State.Name") 
merge2 <- merge(merge1, rankings2, by = "State.Name" )
merge2 <- subset(merge2, select = c(1,2,3,4,6,7,10,11))
merge2 <- transform(merge2, EST = as.numeric(EST))
library(ggplot2)
ggplot(merge2, aes(EST, MortRate, color = MortRate)) + geom_point(shape = 16, size = 2, show.legend = FALSE) + geom_smooth(method=lm) + theme_minimal() +
   xlab("Estimated Median Income") +
  ylab("Mortality Rate") +
   theme(
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank()) 

library(ggplot2)
ggplot(merge2, aes(EST, obesityvalue, color = obesityvalue)) + geom_point(shape = 16, size = 2, show.legend = FALSE) +
  geom_smooth(method=lm) +
  theme_minimal() +
  xlab("Estimated Median Income") +
  ylab("Obesity") +
   theme(
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank())

Is Mortality Rates or Obesity Values a predictor of estimated median income?

oblm <- lm(merge2$obesityvalue~merge2$EST, data = merge2)
mortllm <- lm(merge2$MortRate~merge2$EST, data = merge2)
summary(oblm)
## 
## Call:
## lm(formula = merge2$obesityvalue ~ merge2$EST, data = merge2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.9545 -1.8017  0.2292  2.2007  4.6133 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 33.85021    0.83576  40.502  < 2e-16 ***
## merge2$EST  -0.16398    0.02708  -6.055 2.07e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.868 on 48 degrees of freedom
## Multiple R-squared:  0.433,  Adjusted R-squared:  0.4212 
## F-statistic: 36.66 on 1 and 48 DF,  p-value: 2.069e-07
summary(mortllm)
## 
## Call:
## lm(formula = merge2$MortRate ~ merge2$EST, data = merge2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -47.337 -13.626   2.433  15.198  46.627 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  200.377      6.511  30.776  < 2e-16 ***
## merge2$EST    -1.182      0.211  -5.603 1.01e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.35 on 48 degrees of freedom
## Multiple R-squared:  0.3954, Adjusted R-squared:  0.3828 
## F-statistic: 31.39 on 1 and 48 DF,  p-value: 1.007e-06
res1 <- cor.test(merge2$EST, merge2$obesityvalue, 
                    method = "pearson")
res1
## 
##  Pearson's product-moment correlation
## 
## data:  merge2$EST and merge2$obesityvalue
## t = -6.0547, df = 48, p-value = 2.069e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7914286 -0.4648365
## sample estimates:
##        cor 
## -0.6580426
res1 <- cor.test(merge2$EST, merge2$MortRate, 
                    method = "pearson")
res1
## 
##  Pearson's product-moment correlation
## 
## data:  merge2$EST and merge2$MortRate
## t = -5.6031, df = 48, p-value = 1.007e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7720415 -0.4248387
## sample estimates:
##        cor 
## -0.6288259

Outcome: There is a stronger correlation between Median Income and Mortality Rate than between Median Income and Obesity Value - BUT not by much! They both have strong negative correlations! Overall, we can say that good health has a strong correlation to income which is represented in different U.S. states. Areas with higher income have better health.