Introduction

There are several forms of health coverage in the United States of America, Each state has its own health policy; Among those that exist are the health care policy directed by the federal government and the insurance company. It is important to have a general idea about the economics of health in the States. It is required to have an idea of health spending, including healthcare and healthcare coverage and spending in all states. The question is. Is There a Relationship Between Health Coverage and Health Spending in the United States? Data will be collected, data will be imported, data will be cleaned, data will be analyzed and data will be visualized, using well-established and commonly used packages, including library (datasets), library (tidyr), library (dplyr), ggplot2, and ggrepel.

Healthcare data

We will be using the data from the Henry J Kaiser Family Foundation (KFF) (https://www.k

Health Insurance Coverage of the Total Population (https://www.kff.org/other/state- indicator/total-population/) - Includes years 2013-2016 Health Care Expenditures by State of Residence (in millions) (https://www.kff.org/other/state- indicator/health-care-expenditures-by-state-of-residence-in-millions/) - Includes years 1991- 2014

we first define the path to the data.

the tidyverse package will be used.

library(tidyverse)
library(readr)

the web data will be read.

library(tidyverse)
library(readr)
setwd("D:/coursera/TIDYVERSE/healthcare/data_files")
read_lines(file = "healthcare-coverage.csv", n_max = 10)
##  [1] "\"Title: Health Insurance Coverage of the Total Population | The Henry J. Kaiser Family Foundation\""                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
##  [2] "\"Timeframe: 2013 - 2016\""                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
##  [3] "\"Location\",\"2013__Employer\",\"2013__Non-Group\",\"2013__Medicaid\",\"2013__Medicare\",\"2013__Other Public\",\"2013__Uninsured\",\"2013__Total\",\"2014__Employer\",\"2014__Non-Group\",\"2014__Medicaid\",\"2014__Medicare\",\"2014__Other Public\",\"2014__Uninsured\",\"2014__Total\",\"2015__Employer\",\"2015__Non-Group\",\"2015__Medicaid\",\"2015__Medicare\",\"2015__Other Public\",\"2015__Uninsured\",\"2015__Total\",\"2016__Employer\",\"2016__Non-Group\",\"2016__Medicaid\",\"2016__Medicare\",\"2016__Other Public\",\"2016__Uninsured\",\"2016__Total\""
##  [4] "\"United States\",\"155696900\",\"13816000\",\"54919100\",\"40876300\",\"6295400\",\"41795100\",\"313401200\",\"154347500\",\"19313000\",\"61650400\",\"41896500\",\"5985000\",\"32967500\",\"316159900\",\"155965800\",\"21816500\",\"62384500\",\"43308400\",\"6422300\",\"28965900\",\"318868500\",\"157381500\",\"21884400\",\"62303400\",\"44550200\",\"6192200\",\"28051900\",\"320372000\""                                                                                                                                                                           
##  [5] "\"Alabama\",\"2126500\",\"174200\",\"869700\",\"783000\",\"85600\",\"724800\",\"4763900\",\"2202800\",\"288900\",\"891900\",\"718400\",\"143900\",\"522200\",\"4768000\",\"2218000\",\"291500\",\"911400\",\"719100\",\"174600\",\"519400\",\"4833900\",\"2263800\",\"262400\",\"997000\",\"761200\",\"128800\",\"420800\",\"4834100\""                                                                                                                                                                                                                                      
##  [6] "\"Alaska\",\"364900\",\"24000\",\"95000\",\"55200\",\"60600\",\"102200\",\"702000\",\"345300\",\"26800\",\"130100\",\"55300\",\"37300\",\"100800\",\"695700\",\"355700\",\"22300\",\"128100\",\"60900\",\"47700\",\"90500\",\"705300\",\"324400\",\"20300\",\"145400\",\"68200\",\"55600\",\"96900\",\"710800\""                                                                                                                                                                                                                                                             
##  [7] "\"Arizona\",\"2883800\",\"170800\",\"1346100\",\"842000\",\"N/A\",\"1223000\",\"6603100\",\"2835200\",\"333500\",\"1639400\",\"911100\",\"N/A\",\"827100\",\"6657200\",\"2766500\",\"278400\",\"1711500\",\"949000\",\"189300\",\"844800\",\"6739500\",\"3010700\",\"377000\",\"1468400\",\"1028000\",\"172500\",\"833700\",\"6890200\""                                                                                                                                                                                                                                     
##  [8] "\"Arkansas\",\"1128800\",\"155600\",\"600800\",\"515200\",\"67600\",\"436800\",\"2904800\",\"1176500\",\"231700\",\"639200\",\"479400\",\"82000\",\"287200\",\"2896000\",\"1293700\",\"200200\",\"641400\",\"484500\",\"63700\",\"268400\",\"2953000\",\"1290900\",\"252900\",\"618600\",\"490000\",\"67500\",\"225500\",\"2945300\""                                                                                                                                                                                                                                        
##  [9] "\"California\",\"17747300\",\"1986400\",\"8344800\",\"3828500\",\"675400\",\"5594100\",\"38176400\",\"17703700\",\"2778800\",\"9618800\",\"4049000\",\"634400\",\"3916700\",\"38701300\",\"17718300\",\"3444200\",\"10138100\",\"4080100\",\"752700\",\"2980600\",\"39113900\",\"18116200\",\"3195400\",\"9853800\",\"4436000\",\"556100\",\"3030800\",\"39188300\""                                                                                                                                                                                                         
## [10] "\"Colorado\",\"2852500\",\"426300\",\"697300\",\"549700\",\"118100\",\"654000\",\"5297800\",\"2489400\",\"397900\",\"1053700\",\"619500\",\"214000\",\"602900\",\"5377400\",\"2706000\",\"346900\",\"1036600\",\"708000\",\"148000\",\"475700\",\"5421300\",\"2872600\",\"370000\",\"855800\",\"692400\",\"190100\",\"528400\",\"5509200\""

the first two lines are descriptive and are not useful.

library(tidyverse)
library(readr)
setwd("D:/coursera/TIDYVERSE/healthcare/data_files")
coverage <- read_csv("healthcare-coverage.csv", skip = 2, col_names = TRUE)
head(coverage)
## # A tibble: 6 x 29
##   Location `2013__Employer` `2013__Non-Grou~ `2013__Medicaid` `2013__Medicare`
##   <chr>               <dbl>            <dbl>            <dbl>            <dbl>
## 1 United ~        155696900         13816000         54919100         40876300
## 2 Alabama           2126500           174200           869700           783000
## 3 Alaska             364900            24000            95000            55200
## 4 Arizona           2883800           170800          1346100           842000
## 5 Arkansas          1128800           155600           600800           515200
## 6 Califor~         17747300          1986400          8344800          3828500
## # ... with 24 more variables: `2013__Other Public` <chr>,
## #   `2013__Uninsured` <dbl>, `2013__Total` <dbl>, `2014__Employer` <dbl>,
## #   `2014__Non-Group` <dbl>, `2014__Medicaid` <dbl>, `2014__Medicare` <dbl>,
## #   `2014__Other Public` <chr>, `2014__Uninsured` <dbl>, `2014__Total` <dbl>,
## #   `2015__Employer` <dbl>, `2015__Non-Group` <dbl>, `2015__Medicaid` <dbl>,
## #   `2015__Medicare` <dbl>, `2015__Other Public` <chr>,
## #   `2015__Uninsured` <dbl>, `2015__Total` <dbl>, `2016__Employer` <dbl>,
## #   `2016__Non-Group` <dbl>, `2016__Medicaid` <dbl>, `2016__Medicare` <dbl>,
## #   `2016__Other Public` <chr>, `2016__Uninsured` <dbl>, `2016__Total` <dbl>

it is verified if there are NAs:

library(tidyverse)
library(readr)
setwd("D:/coursera/TIDYVERSE/healthcare/data_files")
tail(coverage)
## # A tibble: 6 x 29
##   Location `2013__Employer` `2013__Non-Grou~ `2013__Medicaid` `2013__Medicare`
##   <chr>               <dbl>            <dbl>            <dbl>            <dbl>
## 1 <NA>                   NA               NA               NA               NA
## 2 *Uninsu~               NA               NA               NA               NA
## 3 <NA>                   NA               NA               NA               NA
## 4 For exa~               NA               NA               NA               NA
## 5 <NA>                   NA               NA               NA               NA
## 6 *N/A*: ~               NA               NA               NA               NA
## # ... with 24 more variables: `2013__Other Public` <chr>,
## #   `2013__Uninsured` <dbl>, `2013__Total` <dbl>, `2014__Employer` <dbl>,
## #   `2014__Non-Group` <dbl>, `2014__Medicaid` <dbl>, `2014__Medicare` <dbl>,
## #   `2014__Other Public` <chr>, `2014__Uninsured` <dbl>, `2014__Total` <dbl>,
## #   `2015__Employer` <dbl>, `2015__Non-Group` <dbl>, `2015__Medicaid` <dbl>,
## #   `2015__Medicare` <dbl>, `2015__Other Public` <chr>,
## #   `2015__Uninsured` <dbl>, `2015__Total` <dbl>, `2016__Employer` <dbl>,
## #   `2016__Non-Group` <dbl>, `2016__Medicaid` <dbl>, `2016__Medicare` <dbl>,
## #   `2016__Other Public` <chr>, `2016__Uninsured` <dbl>, `2016__Total` <dbl>

It looks like there is a line with a string Notes in it and everything below that line should not be read in. We can use the n_max argument here.

library(tidyverse)
library(readr)
setwd("D:/coursera/TIDYVERSE/healthcare/data_files")
coverage <- read_csv("healthcare-coverage.csv", skip = 2, col_names = TRUE)
coverage <- read_csv("healthcare-coverage.csv", skip = 2, col_names = TRUE, n_max = which(coverage$Location == "Notes")-1)
tail(coverage)
## # A tibble: 6 x 29
##   Location `2013__Employer` `2013__Non-Grou~ `2013__Medicaid` `2013__Medicare`
##   <chr>               <dbl>            <dbl>            <dbl>            <dbl>
## 1 Vermont            317700            26200           123400            96600
## 2 Virginia          4661600           364800           773200           968000
## 3 Washing~          3541600           309000          1026800           879000
## 4 West Vi~           841300            42600           382500           329400
## 5 Wiscons~          3154500           225300           907600           812900
## 6 Wyoming            305900            19500            74200            65400
## # ... with 24 more variables: `2013__Other Public` <chr>,
## #   `2013__Uninsured` <dbl>, `2013__Total` <dbl>, `2014__Employer` <dbl>,
## #   `2014__Non-Group` <dbl>, `2014__Medicaid` <dbl>, `2014__Medicare` <dbl>,
## #   `2014__Other Public` <chr>, `2014__Uninsured` <dbl>, `2014__Total` <dbl>,
## #   `2015__Employer` <dbl>, `2015__Non-Group` <dbl>, `2015__Medicaid` <dbl>,
## #   `2015__Medicare` <dbl>, `2015__Other Public` <chr>,
## #   `2015__Uninsured` <dbl>, `2015__Total` <dbl>, `2016__Employer` <dbl>,
## #   `2016__Non-Group` <dbl>, `2016__Medicaid` <dbl>, `2016__Medicare` <dbl>,
## #   `2016__Other Public` <chr>, `2016__Uninsured` <dbl>, `2016__Total` <dbl>

Read in healthcare spending data

library(tidyverse)
library(readr)
setwd("D:/coursera/TIDYVERSE/healthcare/data_files")
spending <- read_csv("healthcare-spending.csv", skip = 2, col_names = TRUE)
spending <- read_csv("healthcare-spending.csv", skip = 2, col_names = TRUE, n_max = which(spending$Location == "Notes")-1)
tail(spending)
## # A tibble: 6 x 25
##   Location `1991__Total He~ `1992__Total He~ `1993__Total He~ `1994__Total He~
##   <chr>               <dbl>            <dbl>            <dbl>            <dbl>
## 1 Vermont              1330             1421             1522             1625
## 2 Virginia            14829            15599            16634            17637
## 3 Washing~            12674            13859            14523            15303
## 4 West Vi~             4672             5159             5550             5891
## 5 Wiscons~            12694            13669            14636            15532
## 6 Wyoming              1023             1067             1171             1265
## # ... with 20 more variables: `1995__Total Health Spending` <dbl>, `1996__Total
## #   Health Spending` <dbl>, `1997__Total Health Spending` <dbl>, `1998__Total
## #   Health Spending` <dbl>, `1999__Total Health Spending` <dbl>, `2000__Total
## #   Health Spending` <dbl>, `2001__Total Health Spending` <dbl>, `2002__Total
## #   Health Spending` <dbl>, `2003__Total Health Spending` <dbl>, `2004__Total
## #   Health Spending` <dbl>, `2005__Total Health Spending` <dbl>, `2006__Total
## #   Health Spending` <dbl>, `2007__Total Health Spending` <dbl>, `2008__Total
## #   Health Spending` <dbl>, `2009__Total Health Spending` <dbl>, `2010__Total
## #   Health Spending` <dbl>, `2011__Total Health Spending` <dbl>, `2012__Total
## #   Health Spending` <dbl>, `2013__Total Health Spending` <dbl>, `2014__Total
## #   Health Spending` <dbl>

the coverage data frame, we see that it is not “tidy”:

library(tidyverse)
library(readr)
setwd("D:/coursera/TIDYVERSE/healthcare/data_files")
glimpse(coverage)
## Rows: 52
## Columns: 29
## $ Location             <chr> "United States", "Alabama", "Alaska", "Arizona...
## $ `2013__Employer`     <dbl> 155696900, 2126500, 364900, 2883800, 1128800, ...
## $ `2013__Non-Group`    <dbl> 13816000, 174200, 24000, 170800, 155600, 19864...
## $ `2013__Medicaid`     <dbl> 54919100, 869700, 95000, 1346100, 600800, 8344...
## $ `2013__Medicare`     <dbl> 40876300, 783000, 55200, 842000, 515200, 38285...
## $ `2013__Other Public` <chr> "6295400", "85600", "60600", "N/A", "67600", "...
## $ `2013__Uninsured`    <dbl> 41795100, 724800, 102200, 1223000, 436800, 559...
## $ `2013__Total`        <dbl> 313401200, 4763900, 702000, 6603100, 2904800, ...
## $ `2014__Employer`     <dbl> 154347500, 2202800, 345300, 2835200, 1176500, ...
## $ `2014__Non-Group`    <dbl> 19313000, 288900, 26800, 333500, 231700, 27788...
## $ `2014__Medicaid`     <dbl> 61650400, 891900, 130100, 1639400, 639200, 961...
## $ `2014__Medicare`     <dbl> 41896500, 718400, 55300, 911100, 479400, 40490...
## $ `2014__Other Public` <chr> "5985000", "143900", "37300", "N/A", "82000", ...
## $ `2014__Uninsured`    <dbl> 32967500, 522200, 100800, 827100, 287200, 3916...
## $ `2014__Total`        <dbl> 316159900, 4768000, 695700, 6657200, 2896000, ...
## $ `2015__Employer`     <dbl> 155965800, 2218000, 355700, 2766500, 1293700, ...
## $ `2015__Non-Group`    <dbl> 21816500, 291500, 22300, 278400, 200200, 34442...
## $ `2015__Medicaid`     <dbl> 62384500, 911400, 128100, 1711500, 641400, 101...
## $ `2015__Medicare`     <dbl> 43308400, 719100, 60900, 949000, 484500, 40801...
## $ `2015__Other Public` <chr> "6422300", "174600", "47700", "189300", "63700...
## $ `2015__Uninsured`    <dbl> 28965900, 519400, 90500, 844800, 268400, 29806...
## $ `2015__Total`        <dbl> 318868500, 4833900, 705300, 6739500, 2953000, ...
## $ `2016__Employer`     <dbl> 157381500, 2263800, 324400, 3010700, 1290900, ...
## $ `2016__Non-Group`    <dbl> 21884400, 262400, 20300, 377000, 252900, 31954...
## $ `2016__Medicaid`     <dbl> 62303400, 997000, 145400, 1468400, 618600, 985...
## $ `2016__Medicare`     <dbl> 44550200, 761200, 68200, 1028000, 490000, 4436...
## $ `2016__Other Public` <chr> "6192200", "128800", "55600", "172500", "67500...
## $ `2016__Uninsured`    <dbl> 28051900, 420800, 96900, 833700, 225500, 30308...
## $ `2016__Total`        <dbl> 320372000, 4834100, 710800, 6890200, 2945300, ...

Read the State information

library(tidyverse)
library(readr)
setwd("D:/coursera/TIDYVERSE/healthcare/data_files")
unique(coverage$Location)
##  [1] "United States"        "Alabama"              "Alaska"              
##  [4] "Arizona"              "Arkansas"             "California"          
##  [7] "Colorado"             "Connecticut"          "Delaware"            
## [10] "District of Columbia" "Florida"              "Georgia"             
## [13] "Hawaii"               "Idaho"                "Illinois"            
## [16] "Indiana"              "Iowa"                 "Kansas"              
## [19] "Kentucky"             "Louisiana"            "Maine"               
## [22] "Maryland"             "Massachusetts"        "Michigan"            
## [25] "Minnesota"            "Mississippi"          "Missouri"            
## [28] "Montana"              "Nebraska"             "Nevada"              
## [31] "New Hampshire"        "New Jersey"           "New Mexico"          
## [34] "New York"             "North Carolina"       "North Dakota"        
## [37] "Ohio"                 "Oklahoma"             "Oregon"              
## [40] "Pennsylvania"         "Rhode Island"         "South Carolina"      
## [43] "South Dakota"         "Tennessee"            "Texas"               
## [46] "Utah"                 "Vermont"              "Virginia"            
## [49] "Washington"           "West Virginia"        "Wisconsin"           
## [52] "Wyoming"

Let’s look what states are inside the state dataset.

library(tidyverse)
library(readr)
library(datasets) 
setwd("D:/coursera/TIDYVERSE/healthcare/data_files")
data(state)
unique(state.name)
##  [1] "Alabama"        "Alaska"         "Arizona"        "Arkansas"      
##  [5] "California"     "Colorado"       "Connecticut"    "Delaware"      
##  [9] "Florida"        "Georgia"        "Hawaii"         "Idaho"         
## [13] "Illinois"       "Indiana"        "Iowa"           "Kansas"        
## [17] "Kentucky"       "Louisiana"      "Maine"          "Maryland"      
## [21] "Massachusetts"  "Michigan"       "Minnesota"      "Mississippi"   
## [25] "Missouri"       "Montana"        "Nebraska"       "Nevada"        
## [29] "New Hampshire"  "New Jersey"     "New Mexico"     "New York"      
## [33] "North Carolina" "North Dakota"   "Ohio"           "Oklahoma"      
## [37] "Oregon"         "Pennsylvania"   "Rhode Island"   "South Carolina"
## [41] "South Dakota"   "Tennessee"      "Texas"          "Utah"          
## [45] "Vermont"        "Virginia"       "Washington"     "West Virginia" 
## [49] "Wisconsin"      "Wyoming"

dealing with DC as a special case.

library(tidyverse)
library(readr)
library(datasets) 
setwd("D:/coursera/TIDYVERSE/healthcare/data_files")
state.abb <- c(state.abb, "DC")
state.region <- as.factor(c(as.character(state.region), "South")) 
state.name <- c(state.name, "District of Columbia")
library(tidyr)
 
setwd("D:/coursera/TIDYVERSE/healthcare/data_files")
state.abb <- c(state.abb, "DC")
state.region <- as.factor(c(as.character(state.region), "South")) 
state.name <- c(state.name, "District of Columbia")

create a column titled year_type and coverage . also want to keep the Location column as it is because it also contains observational level data.

library(tidyr)
 
setwd("D:/coursera/TIDYVERSE/healthcare/data_files")
coverage <- gather(coverage, "year_type", "tot_coverage", -Location) 
coverage
## # A tibble: 1,456 x 3
##    Location             year_type      tot_coverage
##    <chr>                <chr>          <chr>       
##  1 United States        2013__Employer 155696900   
##  2 Alabama              2013__Employer 2126500     
##  3 Alaska               2013__Employer 364900      
##  4 Arizona              2013__Employer 2883800     
##  5 Arkansas             2013__Employer 1128800     
##  6 California           2013__Employer 17747300    
##  7 Colorado             2013__Employer 2852500     
##  8 Connecticut          2013__Employer 2030500     
##  9 Delaware             2013__Employer 473700      
## 10 District of Columbia 2013__Employer 324300      
## # ... with 1,446 more rows

Convert healthcare spending data to a long format .

library(tidyr)
 
setwd("D:/coursera/TIDYVERSE/healthcare/data_files")
spending <- gather(spending, "year", "tot_spending", -Location) 
spending
## # A tibble: 1,248 x 3
##    Location             year                        tot_spending
##    <chr>                <chr>                              <dbl>
##  1 United States        1991__Total Health Spending       675896
##  2 Alabama              1991__Total Health Spending        10393
##  3 Alaska               1991__Total Health Spending         1458
##  4 Arizona              1991__Total Health Spending         9269
##  5 Arkansas             1991__Total Health Spending         5632
##  6 California           1991__Total Health Spending        81438
##  7 Colorado             1991__Total Health Spending         8460
##  8 Connecticut          1991__Total Health Spending        10950
##  9 Delaware             1991__Total Health Spending         1938
## 10 District of Columbia 1991__Total Health Spending         2800
## # ... with 1,238 more rows

Convert data from long format to wide format

library(tidyr)
 
setwd("D:/coursera/TIDYVERSE/healthcare/data_files")
spread(coverage, year_type, tot_coverage)
## # A tibble: 52 x 29
##    Location `2013__Employer` `2013__Medicaid` `2013__Medicare` `2013__Non-Grou~
##    <chr>    <chr>            <chr>            <chr>            <chr>           
##  1 Alabama  2126500          869700           783000           174200          
##  2 Alaska   364900           95000            55200            24000           
##  3 Arizona  2883800          1346100          842000           170800          
##  4 Arkansas 1128800          600800           515200           155600          
##  5 Califor~ 17747300         8344800          3828500          1986400         
##  6 Colorado 2852500          697300           549700           426300          
##  7 Connect~ 2030500          532000           475300           126800          
##  8 Delaware 473700           192700           141300           25100           
##  9 Distric~ 324300           174900           59900            30400           
## 10 Florida  8023400          3190900          3108800          968200          
## # ... with 42 more rows, and 24 more variables: `2013__Other Public` <chr>,
## #   `2013__Total` <chr>, `2013__Uninsured` <chr>, `2014__Employer` <chr>,
## #   `2014__Medicaid` <chr>, `2014__Medicare` <chr>, `2014__Non-Group` <chr>,
## #   `2014__Other Public` <chr>, `2014__Total` <chr>, `2014__Uninsured` <chr>,
## #   `2015__Employer` <chr>, `2015__Medicaid` <chr>, `2015__Medicare` <chr>,
## #   `2015__Non-Group` <chr>, `2015__Other Public` <chr>, `2015__Total` <chr>,
## #   `2015__Uninsured` <chr>, `2016__Employer` <chr>, `2016__Medicaid` <chr>,
## #   `2016__Medicare` <chr>, `2016__Non-Group` <chr>, `2016__Other
## #   Public` <chr>, `2016__Total` <chr>, `2016__Uninsured` <chr>

let’s separate the year_type column in the coverage dataset to two columns: year and health coverage type .

library(tidyr)
 
setwd("D:/coursera/TIDYVERSE/healthcare/data_files")
  coverage %>%
    separate(year_type, sep="__", into=c("year", "type"))
## # A tibble: 1,456 x 4
##    Location             year  type     tot_coverage
##    <chr>                <chr> <chr>    <chr>       
##  1 United States        2013  Employer 155696900   
##  2 Alabama              2013  Employer 2126500     
##  3 Alaska               2013  Employer 364900      
##  4 Arizona              2013  Employer 2883800     
##  5 Arkansas             2013  Employer 1128800     
##  6 California           2013  Employer 17747300    
##  7 Colorado             2013  Employer 2852500     
##  8 Connecticut          2013  Employer 2030500     
##  9 Delaware             2013  Employer 473700      
## 10 District of Columbia 2013  Employer 324300      
## # ... with 1,446 more rows
  coverage
## # A tibble: 1,456 x 3
##    Location             year_type      tot_coverage
##    <chr>                <chr>          <chr>       
##  1 United States        2013__Employer 155696900   
##  2 Alabama              2013__Employer 2126500     
##  3 Alaska               2013__Employer 364900      
##  4 Arizona              2013__Employer 2883800     
##  5 Arkansas             2013__Employer 1128800     
##  6 California           2013__Employer 17747300    
##  7 Colorado             2013__Employer 2852500     
##  8 Connecticut          2013__Employer 2030500     
##  9 Delaware             2013__Employer 473700      
## 10 District of Columbia 2013__Employer 324300      
## # ... with 1,446 more rows

We see that we now have two columns, except the year column was converted to a character. If we look at the help file ?separate , we see we can use the convert=TRUE argument to convert the character to an integer

library(tidyr)
 
setwd("D:/coursera/TIDYVERSE/healthcare/data_files")
coverage <- coverage %>%
     separate(year_type, sep="__",  into=c("year", "type"), convert = TRUE)

coverage
## # A tibble: 1,456 x 4
##    Location              year type     tot_coverage
##    <chr>                <int> <chr>    <chr>       
##  1 United States         2013 Employer 155696900   
##  2 Alabama               2013 Employer 2126500     
##  3 Alaska                2013 Employer 364900      
##  4 Arizona               2013 Employer 2883800     
##  5 Arkansas              2013 Employer 1128800     
##  6 California            2013 Employer 17747300    
##  7 Colorado              2013 Employer 2852500     
##  8 Connecticut           2013 Employer 2030500     
##  9 Delaware              2013 Employer 473700      
## 10 District of Columbia  2013 Employer 324300      
## # ... with 1,446 more rows

we see that the tot_coverage column is also a character.

library(tidyr)
 
setwd("D:/coursera/TIDYVERSE/healthcare/data_files")
coverage <- coverage %>%
   mutate_at("tot_coverage", as.integer)

# Add the abbreviation of States
coverage$abb <- state.abb[match(coverage$Location, state.name)]
coverage$region <- state.region[match(coverage$Location, state.name)]

coverage
## # A tibble: 1,456 x 6
##    Location              year type     tot_coverage abb   region   
##    <chr>                <int> <chr>           <int> <chr> <fct>    
##  1 United States         2013 Employer    155696900 <NA>  <NA>     
##  2 Alabama               2013 Employer      2126500 AL    South    
##  3 Alaska                2013 Employer       364900 AK    West     
##  4 Arizona               2013 Employer      2883800 AZ    West     
##  5 Arkansas              2013 Employer      1128800 AR    South    
##  6 California            2013 Employer     17747300 CA    West     
##  7 Colorado              2013 Employer      2852500 CO    West     
##  8 Connecticut           2013 Employer      2030500 CT    Northeast
##  9 Delaware              2013 Employer       473700 DE    South    
## 10 District of Columbia  2013 Employer       324300 DC    South    
## # ... with 1,446 more rows

We see that there are different year s and different types of healthcare coverage.

library(tidyr)
 
setwd("D:/coursera/TIDYVERSE/healthcare/data_files")
# Add the abbreviation of States
coverage$abb <- state.abb[match(coverage$Location, state.name)]
coverage$region <- state.region[match(coverage$Location, state.name)]
coverage
## # A tibble: 1,456 x 6
##    Location              year type     tot_coverage abb   region   
##    <chr>                <int> <chr>           <int> <chr> <fct>    
##  1 United States         2013 Employer    155696900 <NA>  <NA>     
##  2 Alabama               2013 Employer      2126500 AL    South    
##  3 Alaska                2013 Employer       364900 AK    West     
##  4 Arizona               2013 Employer      2883800 AZ    West     
##  5 Arkansas              2013 Employer      1128800 AR    South    
##  6 California            2013 Employer     17747300 CA    West     
##  7 Colorado              2013 Employer      2852500 CO    West     
##  8 Connecticut           2013 Employer      2030500 CT    Northeast
##  9 Delaware              2013 Employer       473700 DE    South    
## 10 District of Columbia  2013 Employer       324300 DC    South    
## # ... with 1,446 more rows

Next, we will look at the spending data. We see the year column has information that we do not want. We only care about the year.

library(tidyr)
 
setwd("D:/coursera/TIDYVERSE/healthcare/data_files")
spending
## # A tibble: 1,248 x 3
##    Location             year                        tot_spending
##    <chr>                <chr>                              <dbl>
##  1 United States        1991__Total Health Spending       675896
##  2 Alabama              1991__Total Health Spending        10393
##  3 Alaska               1991__Total Health Spending         1458
##  4 Arizona              1991__Total Health Spending         9269
##  5 Arkansas             1991__Total Health Spending         5632
##  6 California           1991__Total Health Spending        81438
##  7 Colorado             1991__Total Health Spending         8460
##  8 Connecticut          1991__Total Health Spending        10950
##  9 Delaware             1991__Total Health Spending         1938
## 10 District of Columbia 1991__Total Health Spending         2800
## # ... with 1,238 more rows

To select all the columns except a specific column, use the “-” (subtraction) operator (also known as negative indexing).

library(tidyr)
 
setwd("D:/coursera/TIDYVERSE/healthcare/data_files")
spending <- spending %>%
     separate(year, sep="__", into=c("year", "name"), convert = TRUE) %>% select(-name)
spending
## # A tibble: 1,248 x 3
##    Location              year tot_spending
##    <chr>                <int>        <dbl>
##  1 United States         1991       675896
##  2 Alabama               1991        10393
##  3 Alaska                1991         1458
##  4 Arizona               1991         9269
##  5 Arkansas              1991         5632
##  6 California            1991        81438
##  7 Colorado              1991         8460
##  8 Connecticut           1991        10950
##  9 Delaware              1991         1938
## 10 District of Columbia  1991         2800
## # ... with 1,238 more rows

coverage and spending together, we have to decide a few things. Both share a Location column and a year column. However, the range of years is different between datasets. Do we want a dataset with all the years available or only a portion of it? Because there is spending information from 1991-2014, and coverage information from 2013-2016.

hc <- left_join(coverage, spending, by = c(“Location”, “year”)) head(hc)

library(tidyr)
 
setwd("D:/coursera/TIDYVERSE/healthcare/data_files")
hc <- left_join(coverage, spending, by = c("Location", "year"))
head(hc)
## # A tibble: 6 x 7
##   Location       year type     tot_coverage abb   region tot_spending
##   <chr>         <int> <chr>           <int> <chr> <fct>         <dbl>
## 1 United States  2013 Employer    155696900 <NA>  <NA>        2435624
## 2 Alabama        2013 Employer      2126500 AL    South         33788
## 3 Alaska         2013 Employer       364900 AK    West           7684
## 4 Arizona        2013 Employer      2883800 AZ    West          41481
## 5 Arkansas       2013 Employer      1128800 AR    South         20500
## 6 California     2013 Employer     17747300 CA    West         278168

We see that the new hc dataset includes all the years from 2013-2016 (as that is the range of years in coverage ), but because the spending dataset only goes to 2014, the tot_spending is reported as NA for years 2015 and 2016.

library(tidyr)
 
setwd("D:/coursera/TIDYVERSE/healthcare/data_files")
hc <- right_join(coverage, spending, by = c("Location", "year"))
head(hc)
## # A tibble: 6 x 7
##   Location       year type     tot_coverage abb   region tot_spending
##   <chr>         <int> <chr>           <int> <chr> <fct>         <dbl>
## 1 United States  2013 Employer    155696900 <NA>  <NA>        2435624
## 2 Alabama        2013 Employer      2126500 AL    South         33788
## 3 Alaska         2013 Employer       364900 AK    West           7684
## 4 Arizona        2013 Employer      2883800 AZ    West          41481
## 5 Arkansas       2013 Employer      1128800 AR    South         20500
## 6 California     2013 Employer     17747300 CA    West         278168

Here, we see every row in the spending dataset is there, but with NAs for the years that there was no coverage data.

library(tidyr)
 
setwd("D:/coursera/TIDYVERSE/healthcare/data_files")
hc <- inner_join(coverage, spending, by = c("Location", "year"))
head(hc)
## # A tibble: 6 x 7
##   Location       year type     tot_coverage abb   region tot_spending
##   <chr>         <int> <chr>           <int> <chr> <fct>         <dbl>
## 1 United States  2013 Employer    155696900 <NA>  <NA>        2435624
## 2 Alabama        2013 Employer      2126500 AL    South         33788
## 3 Alaska         2013 Employer       364900 AK    West           7684
## 4 Arizona        2013 Employer      2883800 AZ    West          41481
## 5 Arkansas       2013 Employer      1128800 AR    South         20500
## 6 California     2013 Employer     17747300 CA    West         278168

we are only interested in looking at US states, we can remove the rows corresponding to the Location == “United States”.

library(tidyr)
 
setwd("D:/coursera/TIDYVERSE/healthcare/data_files")
hc <- hc %>%
   filter(Location != "United States")
hc
## # A tibble: 714 x 7
##    Location              year type     tot_coverage abb   region    tot_spending
##    <chr>                <int> <chr>           <int> <chr> <fct>            <dbl>
##  1 Alabama               2013 Employer      2126500 AL    South            33788
##  2 Alaska                2013 Employer       364900 AK    West              7684
##  3 Arizona               2013 Employer      2883800 AZ    West             41481
##  4 Arkansas              2013 Employer      1128800 AR    South            20500
##  5 California            2013 Employer     17747300 CA    West            278168
##  6 Colorado              2013 Employer      2852500 CO    West             34090
##  7 Connecticut           2013 Employer      2030500 CT    Northeast        34223
##  8 Delaware              2013 Employer       473700 DE    South             9038
##  9 District of Columbia  2013 Employer       324300 DC    South             7443
## 10 Florida               2013 Employer      8023400 FL    South           150547
## # ... with 704 more rows

Another problem is that inside our hc dataset, we have seen there are multiple types of healthcare coverage,

library(tidyr)
 
setwd("D:/coursera/TIDYVERSE/healthcare/data_files")
table(hc$type)
## 
##     Employer     Medicaid     Medicare    Non-Group Other Public        Total 
##          102          102          102          102          102          102 
##    Uninsured 
##          102

The total type is not really a formal type of healthcare coverage. It really represents just the total number of people in the state. This is useful information and we can include it as a column called tot_pop.

library(tidyr)
 
setwd("D:/coursera/TIDYVERSE/healthcare/data_files")
pop <- hc %>%
filter(type == "Total") %>%
select(Location, year, tot_coverage) 
pop
## # A tibble: 102 x 3
##    Location              year tot_coverage
##    <chr>                <int>        <int>
##  1 Alabama               2013      4763900
##  2 Alaska                2013       702000
##  3 Arizona               2013      6603100
##  4 Arkansas              2013      2904800
##  5 California            2013     38176400
##  6 Colorado              2013      5297800
##  7 Connecticut           2013      3578900
##  8 Delaware              2013       909300
##  9 District of Columbia  2013       652100
## 10 Florida               2013     19429000
## # ... with 92 more rows
library(tidyr)
 
setwd("D:/coursera/TIDYVERSE/healthcare/data_files")
hc <- hc %>%
filter(type != "Total") %>%
left_join(pop, by = c("Location", "year")) %>%
rename(tot_coverage = tot_coverage.x, tot_pop = tot_coverage.y) 
hc
## # A tibble: 612 x 8
##    Location         year type    tot_coverage abb   region  tot_spending tot_pop
##    <chr>           <int> <chr>          <int> <chr> <fct>          <dbl>   <int>
##  1 Alabama          2013 Employ~      2126500 AL    South          33788  4.76e6
##  2 Alaska           2013 Employ~       364900 AK    West            7684  7.02e5
##  3 Arizona          2013 Employ~      2883800 AZ    West           41481  6.60e6
##  4 Arkansas         2013 Employ~      1128800 AR    South          20500  2.90e6
##  5 California       2013 Employ~     17747300 CA    West          278168  3.82e7
##  6 Colorado         2013 Employ~      2852500 CO    West           34090  5.30e6
##  7 Connecticut      2013 Employ~      2030500 CT    Northe~        34223  3.58e6
##  8 Delaware         2013 Employ~       473700 DE    South           9038  9.09e5
##  9 District of Co~  2013 Employ~       324300 DC    South           7443  6.52e5
## 10 Florida          2013 Employ~      8023400 FL    South         150547  1.94e7
## # ... with 602 more rows

Is there a relationship between healthcare coverage and healthcare spending in the United States?

library(tidyr)
 
setwd("D:/coursera/TIDYVERSE/healthcare/data_files")
hc.employer.2013 <- hc %>%
filter(type == "Employer", year == "2013")
plot(hc.employer.2013$tot_spending,
hc.employer.2013$tot_coverage, log = "xy", xlab = "spending", ylab = "coverage")

We see there is a strong relationship.

we will calculate the proportion of people who are coverage in each state, year and type.

library(tidyr)
 
setwd("D:/coursera/TIDYVERSE/healthcare/data_files")
hc <- hc %>%
mutate(prop_coverage = tot_coverage/tot_pop)
hc
## # A tibble: 612 x 9
##    Location  year type  tot_coverage abb   region tot_spending tot_pop
##    <chr>    <int> <chr>        <int> <chr> <fct>         <dbl>   <int>
##  1 Alabama   2013 Empl~      2126500 AL    South         33788  4.76e6
##  2 Alaska    2013 Empl~       364900 AK    West           7684  7.02e5
##  3 Arizona   2013 Empl~      2883800 AZ    West          41481  6.60e6
##  4 Arkansas  2013 Empl~      1128800 AR    South         20500  2.90e6
##  5 Califor~  2013 Empl~     17747300 CA    West         278168  3.82e7
##  6 Colorado  2013 Empl~      2852500 CO    West          34090  5.30e6
##  7 Connect~  2013 Empl~      2030500 CT    North~        34223  3.58e6
##  8 Delaware  2013 Empl~       473700 DE    South          9038  9.09e5
##  9 Distric~  2013 Empl~       324300 DC    South          7443  6.52e5
## 10 Florida   2013 Empl~      8023400 FL    South        150547  1.94e7
## # ... with 602 more rows, and 1 more variable: prop_coverage <dbl>

We will add the spending per capita (or spending per person) in dollars and name this column spending_capita.

library(tidyr)
 
setwd("D:/coursera/TIDYVERSE/healthcare/data_files")
hc <- hc %>%
mutate(spending_capita = (tot_spending*1e6) / tot_pop)
hc %>% select(prop_coverage, spending_capita)
## # A tibble: 612 x 2
##    prop_coverage spending_capita
##            <dbl>           <dbl>
##  1         0.446           7093.
##  2         0.520          10946.
##  3         0.437           6282.
##  4         0.389           7057.
##  5         0.465           7286.
##  6         0.538           6435.
##  7         0.567           9562.
##  8         0.521           9940.
##  9         0.497          11414.
## 10         0.413           7749.
## # ... with 602 more rows
library(tidyr)
 
setwd("D:/coursera/TIDYVERSE/healthcare/data_files")
hc.employer.2013 <- hc %>%
filter(type == "Employer", year == "2013")
plot(hc.employer.2013$spending_capita,
hc.employer.2013$prop_coverage, log = "xy", xlab = "spending per capita",
ylab = "proportion of Employer coverage")

library(tidyr)
 
setwd("D:/coursera/TIDYVERSE/healthcare/data_files")
hc %>%
filter(type == "Employer", year == "2013") %>%
ggplot(aes(x = spending_capita, y = prop_coverage)) + geom_point() +
xlab("spending per capita") + ylab("coverage proportion") +
geom_smooth(method = "lm", col = "blue")

Conclusion

The total healthcare expenditure is associated with the population. To make a fair comparison, we create “healthcare expenditure per capita.” Further, the exploratory analysis via data visualization showed higher spending in healthcare per capita is positively associated with higher employer coverage proportion and is negatively associated with the porportion of uninsured population across the States.