There are several forms of health coverage in the United States of America, Each state has its own health policy; Among those that exist are the health care policy directed by the federal government and the insurance company. It is important to have a general idea about the economics of health in the States. It is required to have an idea of health spending, including healthcare and healthcare coverage and spending in all states. The question is. Is There a Relationship Between Health Coverage and Health Spending in the United States? Data will be collected, data will be imported, data will be cleaned, data will be analyzed and data will be visualized, using well-established and commonly used packages, including library (datasets), library (tidyr), library (dplyr), ggplot2, and ggrepel.
We will be using the data from the Henry J Kaiser Family Foundation (KFF) (https://www.k
Health Insurance Coverage of the Total Population (https://www.kff.org/other/state- indicator/total-population/) - Includes years 2013-2016 Health Care Expenditures by State of Residence (in millions) (https://www.kff.org/other/state- indicator/health-care-expenditures-by-state-of-residence-in-millions/) - Includes years 1991- 2014
we first define the path to the data.
the tidyverse package will be used.
library(tidyverse)
library(readr)
the web data will be read.
library(tidyverse)
library(readr)
setwd("D:/coursera/TIDYVERSE/healthcare/data_files")
read_lines(file = "healthcare-coverage.csv", n_max = 10)
## [1] "\"Title: Health Insurance Coverage of the Total Population | The Henry J. Kaiser Family Foundation\""
## [2] "\"Timeframe: 2013 - 2016\""
## [3] "\"Location\",\"2013__Employer\",\"2013__Non-Group\",\"2013__Medicaid\",\"2013__Medicare\",\"2013__Other Public\",\"2013__Uninsured\",\"2013__Total\",\"2014__Employer\",\"2014__Non-Group\",\"2014__Medicaid\",\"2014__Medicare\",\"2014__Other Public\",\"2014__Uninsured\",\"2014__Total\",\"2015__Employer\",\"2015__Non-Group\",\"2015__Medicaid\",\"2015__Medicare\",\"2015__Other Public\",\"2015__Uninsured\",\"2015__Total\",\"2016__Employer\",\"2016__Non-Group\",\"2016__Medicaid\",\"2016__Medicare\",\"2016__Other Public\",\"2016__Uninsured\",\"2016__Total\""
## [4] "\"United States\",\"155696900\",\"13816000\",\"54919100\",\"40876300\",\"6295400\",\"41795100\",\"313401200\",\"154347500\",\"19313000\",\"61650400\",\"41896500\",\"5985000\",\"32967500\",\"316159900\",\"155965800\",\"21816500\",\"62384500\",\"43308400\",\"6422300\",\"28965900\",\"318868500\",\"157381500\",\"21884400\",\"62303400\",\"44550200\",\"6192200\",\"28051900\",\"320372000\""
## [5] "\"Alabama\",\"2126500\",\"174200\",\"869700\",\"783000\",\"85600\",\"724800\",\"4763900\",\"2202800\",\"288900\",\"891900\",\"718400\",\"143900\",\"522200\",\"4768000\",\"2218000\",\"291500\",\"911400\",\"719100\",\"174600\",\"519400\",\"4833900\",\"2263800\",\"262400\",\"997000\",\"761200\",\"128800\",\"420800\",\"4834100\""
## [6] "\"Alaska\",\"364900\",\"24000\",\"95000\",\"55200\",\"60600\",\"102200\",\"702000\",\"345300\",\"26800\",\"130100\",\"55300\",\"37300\",\"100800\",\"695700\",\"355700\",\"22300\",\"128100\",\"60900\",\"47700\",\"90500\",\"705300\",\"324400\",\"20300\",\"145400\",\"68200\",\"55600\",\"96900\",\"710800\""
## [7] "\"Arizona\",\"2883800\",\"170800\",\"1346100\",\"842000\",\"N/A\",\"1223000\",\"6603100\",\"2835200\",\"333500\",\"1639400\",\"911100\",\"N/A\",\"827100\",\"6657200\",\"2766500\",\"278400\",\"1711500\",\"949000\",\"189300\",\"844800\",\"6739500\",\"3010700\",\"377000\",\"1468400\",\"1028000\",\"172500\",\"833700\",\"6890200\""
## [8] "\"Arkansas\",\"1128800\",\"155600\",\"600800\",\"515200\",\"67600\",\"436800\",\"2904800\",\"1176500\",\"231700\",\"639200\",\"479400\",\"82000\",\"287200\",\"2896000\",\"1293700\",\"200200\",\"641400\",\"484500\",\"63700\",\"268400\",\"2953000\",\"1290900\",\"252900\",\"618600\",\"490000\",\"67500\",\"225500\",\"2945300\""
## [9] "\"California\",\"17747300\",\"1986400\",\"8344800\",\"3828500\",\"675400\",\"5594100\",\"38176400\",\"17703700\",\"2778800\",\"9618800\",\"4049000\",\"634400\",\"3916700\",\"38701300\",\"17718300\",\"3444200\",\"10138100\",\"4080100\",\"752700\",\"2980600\",\"39113900\",\"18116200\",\"3195400\",\"9853800\",\"4436000\",\"556100\",\"3030800\",\"39188300\""
## [10] "\"Colorado\",\"2852500\",\"426300\",\"697300\",\"549700\",\"118100\",\"654000\",\"5297800\",\"2489400\",\"397900\",\"1053700\",\"619500\",\"214000\",\"602900\",\"5377400\",\"2706000\",\"346900\",\"1036600\",\"708000\",\"148000\",\"475700\",\"5421300\",\"2872600\",\"370000\",\"855800\",\"692400\",\"190100\",\"528400\",\"5509200\""
the first two lines are descriptive and are not useful.
library(tidyverse)
library(readr)
setwd("D:/coursera/TIDYVERSE/healthcare/data_files")
coverage <- read_csv("healthcare-coverage.csv", skip = 2, col_names = TRUE)
head(coverage)
## # A tibble: 6 x 29
## Location `2013__Employer` `2013__Non-Grou~ `2013__Medicaid` `2013__Medicare`
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 United ~ 155696900 13816000 54919100 40876300
## 2 Alabama 2126500 174200 869700 783000
## 3 Alaska 364900 24000 95000 55200
## 4 Arizona 2883800 170800 1346100 842000
## 5 Arkansas 1128800 155600 600800 515200
## 6 Califor~ 17747300 1986400 8344800 3828500
## # ... with 24 more variables: `2013__Other Public` <chr>,
## # `2013__Uninsured` <dbl>, `2013__Total` <dbl>, `2014__Employer` <dbl>,
## # `2014__Non-Group` <dbl>, `2014__Medicaid` <dbl>, `2014__Medicare` <dbl>,
## # `2014__Other Public` <chr>, `2014__Uninsured` <dbl>, `2014__Total` <dbl>,
## # `2015__Employer` <dbl>, `2015__Non-Group` <dbl>, `2015__Medicaid` <dbl>,
## # `2015__Medicare` <dbl>, `2015__Other Public` <chr>,
## # `2015__Uninsured` <dbl>, `2015__Total` <dbl>, `2016__Employer` <dbl>,
## # `2016__Non-Group` <dbl>, `2016__Medicaid` <dbl>, `2016__Medicare` <dbl>,
## # `2016__Other Public` <chr>, `2016__Uninsured` <dbl>, `2016__Total` <dbl>
it is verified if there are NAs:
library(tidyverse)
library(readr)
setwd("D:/coursera/TIDYVERSE/healthcare/data_files")
tail(coverage)
## # A tibble: 6 x 29
## Location `2013__Employer` `2013__Non-Grou~ `2013__Medicaid` `2013__Medicare`
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 <NA> NA NA NA NA
## 2 *Uninsu~ NA NA NA NA
## 3 <NA> NA NA NA NA
## 4 For exa~ NA NA NA NA
## 5 <NA> NA NA NA NA
## 6 *N/A*: ~ NA NA NA NA
## # ... with 24 more variables: `2013__Other Public` <chr>,
## # `2013__Uninsured` <dbl>, `2013__Total` <dbl>, `2014__Employer` <dbl>,
## # `2014__Non-Group` <dbl>, `2014__Medicaid` <dbl>, `2014__Medicare` <dbl>,
## # `2014__Other Public` <chr>, `2014__Uninsured` <dbl>, `2014__Total` <dbl>,
## # `2015__Employer` <dbl>, `2015__Non-Group` <dbl>, `2015__Medicaid` <dbl>,
## # `2015__Medicare` <dbl>, `2015__Other Public` <chr>,
## # `2015__Uninsured` <dbl>, `2015__Total` <dbl>, `2016__Employer` <dbl>,
## # `2016__Non-Group` <dbl>, `2016__Medicaid` <dbl>, `2016__Medicare` <dbl>,
## # `2016__Other Public` <chr>, `2016__Uninsured` <dbl>, `2016__Total` <dbl>
It looks like there is a line with a string Notes in it and everything below that line should not be read in. We can use the n_max argument here.
library(tidyverse)
library(readr)
setwd("D:/coursera/TIDYVERSE/healthcare/data_files")
coverage <- read_csv("healthcare-coverage.csv", skip = 2, col_names = TRUE)
coverage <- read_csv("healthcare-coverage.csv", skip = 2, col_names = TRUE, n_max = which(coverage$Location == "Notes")-1)
tail(coverage)
## # A tibble: 6 x 29
## Location `2013__Employer` `2013__Non-Grou~ `2013__Medicaid` `2013__Medicare`
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Vermont 317700 26200 123400 96600
## 2 Virginia 4661600 364800 773200 968000
## 3 Washing~ 3541600 309000 1026800 879000
## 4 West Vi~ 841300 42600 382500 329400
## 5 Wiscons~ 3154500 225300 907600 812900
## 6 Wyoming 305900 19500 74200 65400
## # ... with 24 more variables: `2013__Other Public` <chr>,
## # `2013__Uninsured` <dbl>, `2013__Total` <dbl>, `2014__Employer` <dbl>,
## # `2014__Non-Group` <dbl>, `2014__Medicaid` <dbl>, `2014__Medicare` <dbl>,
## # `2014__Other Public` <chr>, `2014__Uninsured` <dbl>, `2014__Total` <dbl>,
## # `2015__Employer` <dbl>, `2015__Non-Group` <dbl>, `2015__Medicaid` <dbl>,
## # `2015__Medicare` <dbl>, `2015__Other Public` <chr>,
## # `2015__Uninsured` <dbl>, `2015__Total` <dbl>, `2016__Employer` <dbl>,
## # `2016__Non-Group` <dbl>, `2016__Medicaid` <dbl>, `2016__Medicare` <dbl>,
## # `2016__Other Public` <chr>, `2016__Uninsured` <dbl>, `2016__Total` <dbl>
library(tidyverse)
library(readr)
setwd("D:/coursera/TIDYVERSE/healthcare/data_files")
spending <- read_csv("healthcare-spending.csv", skip = 2, col_names = TRUE)
spending <- read_csv("healthcare-spending.csv", skip = 2, col_names = TRUE, n_max = which(spending$Location == "Notes")-1)
tail(spending)
## # A tibble: 6 x 25
## Location `1991__Total He~ `1992__Total He~ `1993__Total He~ `1994__Total He~
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Vermont 1330 1421 1522 1625
## 2 Virginia 14829 15599 16634 17637
## 3 Washing~ 12674 13859 14523 15303
## 4 West Vi~ 4672 5159 5550 5891
## 5 Wiscons~ 12694 13669 14636 15532
## 6 Wyoming 1023 1067 1171 1265
## # ... with 20 more variables: `1995__Total Health Spending` <dbl>, `1996__Total
## # Health Spending` <dbl>, `1997__Total Health Spending` <dbl>, `1998__Total
## # Health Spending` <dbl>, `1999__Total Health Spending` <dbl>, `2000__Total
## # Health Spending` <dbl>, `2001__Total Health Spending` <dbl>, `2002__Total
## # Health Spending` <dbl>, `2003__Total Health Spending` <dbl>, `2004__Total
## # Health Spending` <dbl>, `2005__Total Health Spending` <dbl>, `2006__Total
## # Health Spending` <dbl>, `2007__Total Health Spending` <dbl>, `2008__Total
## # Health Spending` <dbl>, `2009__Total Health Spending` <dbl>, `2010__Total
## # Health Spending` <dbl>, `2011__Total Health Spending` <dbl>, `2012__Total
## # Health Spending` <dbl>, `2013__Total Health Spending` <dbl>, `2014__Total
## # Health Spending` <dbl>
the coverage data frame, we see that it is not “tidy”:
library(tidyverse)
library(readr)
setwd("D:/coursera/TIDYVERSE/healthcare/data_files")
glimpse(coverage)
## Rows: 52
## Columns: 29
## $ Location <chr> "United States", "Alabama", "Alaska", "Arizona...
## $ `2013__Employer` <dbl> 155696900, 2126500, 364900, 2883800, 1128800, ...
## $ `2013__Non-Group` <dbl> 13816000, 174200, 24000, 170800, 155600, 19864...
## $ `2013__Medicaid` <dbl> 54919100, 869700, 95000, 1346100, 600800, 8344...
## $ `2013__Medicare` <dbl> 40876300, 783000, 55200, 842000, 515200, 38285...
## $ `2013__Other Public` <chr> "6295400", "85600", "60600", "N/A", "67600", "...
## $ `2013__Uninsured` <dbl> 41795100, 724800, 102200, 1223000, 436800, 559...
## $ `2013__Total` <dbl> 313401200, 4763900, 702000, 6603100, 2904800, ...
## $ `2014__Employer` <dbl> 154347500, 2202800, 345300, 2835200, 1176500, ...
## $ `2014__Non-Group` <dbl> 19313000, 288900, 26800, 333500, 231700, 27788...
## $ `2014__Medicaid` <dbl> 61650400, 891900, 130100, 1639400, 639200, 961...
## $ `2014__Medicare` <dbl> 41896500, 718400, 55300, 911100, 479400, 40490...
## $ `2014__Other Public` <chr> "5985000", "143900", "37300", "N/A", "82000", ...
## $ `2014__Uninsured` <dbl> 32967500, 522200, 100800, 827100, 287200, 3916...
## $ `2014__Total` <dbl> 316159900, 4768000, 695700, 6657200, 2896000, ...
## $ `2015__Employer` <dbl> 155965800, 2218000, 355700, 2766500, 1293700, ...
## $ `2015__Non-Group` <dbl> 21816500, 291500, 22300, 278400, 200200, 34442...
## $ `2015__Medicaid` <dbl> 62384500, 911400, 128100, 1711500, 641400, 101...
## $ `2015__Medicare` <dbl> 43308400, 719100, 60900, 949000, 484500, 40801...
## $ `2015__Other Public` <chr> "6422300", "174600", "47700", "189300", "63700...
## $ `2015__Uninsured` <dbl> 28965900, 519400, 90500, 844800, 268400, 29806...
## $ `2015__Total` <dbl> 318868500, 4833900, 705300, 6739500, 2953000, ...
## $ `2016__Employer` <dbl> 157381500, 2263800, 324400, 3010700, 1290900, ...
## $ `2016__Non-Group` <dbl> 21884400, 262400, 20300, 377000, 252900, 31954...
## $ `2016__Medicaid` <dbl> 62303400, 997000, 145400, 1468400, 618600, 985...
## $ `2016__Medicare` <dbl> 44550200, 761200, 68200, 1028000, 490000, 4436...
## $ `2016__Other Public` <chr> "6192200", "128800", "55600", "172500", "67500...
## $ `2016__Uninsured` <dbl> 28051900, 420800, 96900, 833700, 225500, 30308...
## $ `2016__Total` <dbl> 320372000, 4834100, 710800, 6890200, 2945300, ...
library(tidyverse)
library(readr)
setwd("D:/coursera/TIDYVERSE/healthcare/data_files")
unique(coverage$Location)
## [1] "United States" "Alabama" "Alaska"
## [4] "Arizona" "Arkansas" "California"
## [7] "Colorado" "Connecticut" "Delaware"
## [10] "District of Columbia" "Florida" "Georgia"
## [13] "Hawaii" "Idaho" "Illinois"
## [16] "Indiana" "Iowa" "Kansas"
## [19] "Kentucky" "Louisiana" "Maine"
## [22] "Maryland" "Massachusetts" "Michigan"
## [25] "Minnesota" "Mississippi" "Missouri"
## [28] "Montana" "Nebraska" "Nevada"
## [31] "New Hampshire" "New Jersey" "New Mexico"
## [34] "New York" "North Carolina" "North Dakota"
## [37] "Ohio" "Oklahoma" "Oregon"
## [40] "Pennsylvania" "Rhode Island" "South Carolina"
## [43] "South Dakota" "Tennessee" "Texas"
## [46] "Utah" "Vermont" "Virginia"
## [49] "Washington" "West Virginia" "Wisconsin"
## [52] "Wyoming"
Let’s look what states are inside the state dataset.
library(tidyverse)
library(readr)
library(datasets)
setwd("D:/coursera/TIDYVERSE/healthcare/data_files")
data(state)
unique(state.name)
## [1] "Alabama" "Alaska" "Arizona" "Arkansas"
## [5] "California" "Colorado" "Connecticut" "Delaware"
## [9] "Florida" "Georgia" "Hawaii" "Idaho"
## [13] "Illinois" "Indiana" "Iowa" "Kansas"
## [17] "Kentucky" "Louisiana" "Maine" "Maryland"
## [21] "Massachusetts" "Michigan" "Minnesota" "Mississippi"
## [25] "Missouri" "Montana" "Nebraska" "Nevada"
## [29] "New Hampshire" "New Jersey" "New Mexico" "New York"
## [33] "North Carolina" "North Dakota" "Ohio" "Oklahoma"
## [37] "Oregon" "Pennsylvania" "Rhode Island" "South Carolina"
## [41] "South Dakota" "Tennessee" "Texas" "Utah"
## [45] "Vermont" "Virginia" "Washington" "West Virginia"
## [49] "Wisconsin" "Wyoming"
dealing with DC as a special case.
library(tidyverse)
library(readr)
library(datasets)
setwd("D:/coursera/TIDYVERSE/healthcare/data_files")
state.abb <- c(state.abb, "DC")
state.region <- as.factor(c(as.character(state.region), "South"))
state.name <- c(state.name, "District of Columbia")
library(tidyr)
setwd("D:/coursera/TIDYVERSE/healthcare/data_files")
state.abb <- c(state.abb, "DC")
state.region <- as.factor(c(as.character(state.region), "South"))
state.name <- c(state.name, "District of Columbia")
create a column titled year_type and coverage . also want to keep the Location column as it is because it also contains observational level data.
library(tidyr)
setwd("D:/coursera/TIDYVERSE/healthcare/data_files")
coverage <- gather(coverage, "year_type", "tot_coverage", -Location)
coverage
## # A tibble: 1,456 x 3
## Location year_type tot_coverage
## <chr> <chr> <chr>
## 1 United States 2013__Employer 155696900
## 2 Alabama 2013__Employer 2126500
## 3 Alaska 2013__Employer 364900
## 4 Arizona 2013__Employer 2883800
## 5 Arkansas 2013__Employer 1128800
## 6 California 2013__Employer 17747300
## 7 Colorado 2013__Employer 2852500
## 8 Connecticut 2013__Employer 2030500
## 9 Delaware 2013__Employer 473700
## 10 District of Columbia 2013__Employer 324300
## # ... with 1,446 more rows
Convert healthcare spending data to a long format .
library(tidyr)
setwd("D:/coursera/TIDYVERSE/healthcare/data_files")
spending <- gather(spending, "year", "tot_spending", -Location)
spending
## # A tibble: 1,248 x 3
## Location year tot_spending
## <chr> <chr> <dbl>
## 1 United States 1991__Total Health Spending 675896
## 2 Alabama 1991__Total Health Spending 10393
## 3 Alaska 1991__Total Health Spending 1458
## 4 Arizona 1991__Total Health Spending 9269
## 5 Arkansas 1991__Total Health Spending 5632
## 6 California 1991__Total Health Spending 81438
## 7 Colorado 1991__Total Health Spending 8460
## 8 Connecticut 1991__Total Health Spending 10950
## 9 Delaware 1991__Total Health Spending 1938
## 10 District of Columbia 1991__Total Health Spending 2800
## # ... with 1,238 more rows
Convert data from long format to wide format
library(tidyr)
setwd("D:/coursera/TIDYVERSE/healthcare/data_files")
spread(coverage, year_type, tot_coverage)
## # A tibble: 52 x 29
## Location `2013__Employer` `2013__Medicaid` `2013__Medicare` `2013__Non-Grou~
## <chr> <chr> <chr> <chr> <chr>
## 1 Alabama 2126500 869700 783000 174200
## 2 Alaska 364900 95000 55200 24000
## 3 Arizona 2883800 1346100 842000 170800
## 4 Arkansas 1128800 600800 515200 155600
## 5 Califor~ 17747300 8344800 3828500 1986400
## 6 Colorado 2852500 697300 549700 426300
## 7 Connect~ 2030500 532000 475300 126800
## 8 Delaware 473700 192700 141300 25100
## 9 Distric~ 324300 174900 59900 30400
## 10 Florida 8023400 3190900 3108800 968200
## # ... with 42 more rows, and 24 more variables: `2013__Other Public` <chr>,
## # `2013__Total` <chr>, `2013__Uninsured` <chr>, `2014__Employer` <chr>,
## # `2014__Medicaid` <chr>, `2014__Medicare` <chr>, `2014__Non-Group` <chr>,
## # `2014__Other Public` <chr>, `2014__Total` <chr>, `2014__Uninsured` <chr>,
## # `2015__Employer` <chr>, `2015__Medicaid` <chr>, `2015__Medicare` <chr>,
## # `2015__Non-Group` <chr>, `2015__Other Public` <chr>, `2015__Total` <chr>,
## # `2015__Uninsured` <chr>, `2016__Employer` <chr>, `2016__Medicaid` <chr>,
## # `2016__Medicare` <chr>, `2016__Non-Group` <chr>, `2016__Other
## # Public` <chr>, `2016__Total` <chr>, `2016__Uninsured` <chr>
let’s separate the year_type column in the coverage dataset to two columns: year and health coverage type .
library(tidyr)
setwd("D:/coursera/TIDYVERSE/healthcare/data_files")
coverage %>%
separate(year_type, sep="__", into=c("year", "type"))
## # A tibble: 1,456 x 4
## Location year type tot_coverage
## <chr> <chr> <chr> <chr>
## 1 United States 2013 Employer 155696900
## 2 Alabama 2013 Employer 2126500
## 3 Alaska 2013 Employer 364900
## 4 Arizona 2013 Employer 2883800
## 5 Arkansas 2013 Employer 1128800
## 6 California 2013 Employer 17747300
## 7 Colorado 2013 Employer 2852500
## 8 Connecticut 2013 Employer 2030500
## 9 Delaware 2013 Employer 473700
## 10 District of Columbia 2013 Employer 324300
## # ... with 1,446 more rows
coverage
## # A tibble: 1,456 x 3
## Location year_type tot_coverage
## <chr> <chr> <chr>
## 1 United States 2013__Employer 155696900
## 2 Alabama 2013__Employer 2126500
## 3 Alaska 2013__Employer 364900
## 4 Arizona 2013__Employer 2883800
## 5 Arkansas 2013__Employer 1128800
## 6 California 2013__Employer 17747300
## 7 Colorado 2013__Employer 2852500
## 8 Connecticut 2013__Employer 2030500
## 9 Delaware 2013__Employer 473700
## 10 District of Columbia 2013__Employer 324300
## # ... with 1,446 more rows
We see that we now have two columns, except the year column was converted to a character. If we look at the help file ?separate , we see we can use the convert=TRUE argument to convert the character to an integer
library(tidyr)
setwd("D:/coursera/TIDYVERSE/healthcare/data_files")
coverage <- coverage %>%
separate(year_type, sep="__", into=c("year", "type"), convert = TRUE)
coverage
## # A tibble: 1,456 x 4
## Location year type tot_coverage
## <chr> <int> <chr> <chr>
## 1 United States 2013 Employer 155696900
## 2 Alabama 2013 Employer 2126500
## 3 Alaska 2013 Employer 364900
## 4 Arizona 2013 Employer 2883800
## 5 Arkansas 2013 Employer 1128800
## 6 California 2013 Employer 17747300
## 7 Colorado 2013 Employer 2852500
## 8 Connecticut 2013 Employer 2030500
## 9 Delaware 2013 Employer 473700
## 10 District of Columbia 2013 Employer 324300
## # ... with 1,446 more rows
we see that the tot_coverage column is also a character.
library(tidyr)
setwd("D:/coursera/TIDYVERSE/healthcare/data_files")
coverage <- coverage %>%
mutate_at("tot_coverage", as.integer)
# Add the abbreviation of States
coverage$abb <- state.abb[match(coverage$Location, state.name)]
coverage$region <- state.region[match(coverage$Location, state.name)]
coverage
## # A tibble: 1,456 x 6
## Location year type tot_coverage abb region
## <chr> <int> <chr> <int> <chr> <fct>
## 1 United States 2013 Employer 155696900 <NA> <NA>
## 2 Alabama 2013 Employer 2126500 AL South
## 3 Alaska 2013 Employer 364900 AK West
## 4 Arizona 2013 Employer 2883800 AZ West
## 5 Arkansas 2013 Employer 1128800 AR South
## 6 California 2013 Employer 17747300 CA West
## 7 Colorado 2013 Employer 2852500 CO West
## 8 Connecticut 2013 Employer 2030500 CT Northeast
## 9 Delaware 2013 Employer 473700 DE South
## 10 District of Columbia 2013 Employer 324300 DC South
## # ... with 1,446 more rows
We see that there are different year s and different types of healthcare coverage.
library(tidyr)
setwd("D:/coursera/TIDYVERSE/healthcare/data_files")
# Add the abbreviation of States
coverage$abb <- state.abb[match(coverage$Location, state.name)]
coverage$region <- state.region[match(coverage$Location, state.name)]
coverage
## # A tibble: 1,456 x 6
## Location year type tot_coverage abb region
## <chr> <int> <chr> <int> <chr> <fct>
## 1 United States 2013 Employer 155696900 <NA> <NA>
## 2 Alabama 2013 Employer 2126500 AL South
## 3 Alaska 2013 Employer 364900 AK West
## 4 Arizona 2013 Employer 2883800 AZ West
## 5 Arkansas 2013 Employer 1128800 AR South
## 6 California 2013 Employer 17747300 CA West
## 7 Colorado 2013 Employer 2852500 CO West
## 8 Connecticut 2013 Employer 2030500 CT Northeast
## 9 Delaware 2013 Employer 473700 DE South
## 10 District of Columbia 2013 Employer 324300 DC South
## # ... with 1,446 more rows
Next, we will look at the spending data. We see the year column has information that we do not want. We only care about the year.
library(tidyr)
setwd("D:/coursera/TIDYVERSE/healthcare/data_files")
spending
## # A tibble: 1,248 x 3
## Location year tot_spending
## <chr> <chr> <dbl>
## 1 United States 1991__Total Health Spending 675896
## 2 Alabama 1991__Total Health Spending 10393
## 3 Alaska 1991__Total Health Spending 1458
## 4 Arizona 1991__Total Health Spending 9269
## 5 Arkansas 1991__Total Health Spending 5632
## 6 California 1991__Total Health Spending 81438
## 7 Colorado 1991__Total Health Spending 8460
## 8 Connecticut 1991__Total Health Spending 10950
## 9 Delaware 1991__Total Health Spending 1938
## 10 District of Columbia 1991__Total Health Spending 2800
## # ... with 1,238 more rows
To select all the columns except a specific column, use the “-” (subtraction) operator (also known as negative indexing).
library(tidyr)
setwd("D:/coursera/TIDYVERSE/healthcare/data_files")
spending <- spending %>%
separate(year, sep="__", into=c("year", "name"), convert = TRUE) %>% select(-name)
spending
## # A tibble: 1,248 x 3
## Location year tot_spending
## <chr> <int> <dbl>
## 1 United States 1991 675896
## 2 Alabama 1991 10393
## 3 Alaska 1991 1458
## 4 Arizona 1991 9269
## 5 Arkansas 1991 5632
## 6 California 1991 81438
## 7 Colorado 1991 8460
## 8 Connecticut 1991 10950
## 9 Delaware 1991 1938
## 10 District of Columbia 1991 2800
## # ... with 1,238 more rows
coverage and spending together, we have to decide a few things. Both share a Location column and a year column. However, the range of years is different between datasets. Do we want a dataset with all the years available or only a portion of it? Because there is spending information from 1991-2014, and coverage information from 2013-2016.
hc <- left_join(coverage, spending, by = c(“Location”, “year”)) head(hc)
library(tidyr)
setwd("D:/coursera/TIDYVERSE/healthcare/data_files")
hc <- left_join(coverage, spending, by = c("Location", "year"))
head(hc)
## # A tibble: 6 x 7
## Location year type tot_coverage abb region tot_spending
## <chr> <int> <chr> <int> <chr> <fct> <dbl>
## 1 United States 2013 Employer 155696900 <NA> <NA> 2435624
## 2 Alabama 2013 Employer 2126500 AL South 33788
## 3 Alaska 2013 Employer 364900 AK West 7684
## 4 Arizona 2013 Employer 2883800 AZ West 41481
## 5 Arkansas 2013 Employer 1128800 AR South 20500
## 6 California 2013 Employer 17747300 CA West 278168
We see that the new hc dataset includes all the years from 2013-2016 (as that is the range of years in coverage ), but because the spending dataset only goes to 2014, the tot_spending is reported as NA for years 2015 and 2016.
library(tidyr)
setwd("D:/coursera/TIDYVERSE/healthcare/data_files")
hc <- right_join(coverage, spending, by = c("Location", "year"))
head(hc)
## # A tibble: 6 x 7
## Location year type tot_coverage abb region tot_spending
## <chr> <int> <chr> <int> <chr> <fct> <dbl>
## 1 United States 2013 Employer 155696900 <NA> <NA> 2435624
## 2 Alabama 2013 Employer 2126500 AL South 33788
## 3 Alaska 2013 Employer 364900 AK West 7684
## 4 Arizona 2013 Employer 2883800 AZ West 41481
## 5 Arkansas 2013 Employer 1128800 AR South 20500
## 6 California 2013 Employer 17747300 CA West 278168
Here, we see every row in the spending dataset is there, but with NAs for the years that there was no coverage data.
library(tidyr)
setwd("D:/coursera/TIDYVERSE/healthcare/data_files")
hc <- inner_join(coverage, spending, by = c("Location", "year"))
head(hc)
## # A tibble: 6 x 7
## Location year type tot_coverage abb region tot_spending
## <chr> <int> <chr> <int> <chr> <fct> <dbl>
## 1 United States 2013 Employer 155696900 <NA> <NA> 2435624
## 2 Alabama 2013 Employer 2126500 AL South 33788
## 3 Alaska 2013 Employer 364900 AK West 7684
## 4 Arizona 2013 Employer 2883800 AZ West 41481
## 5 Arkansas 2013 Employer 1128800 AR South 20500
## 6 California 2013 Employer 17747300 CA West 278168
we are only interested in looking at US states, we can remove the rows corresponding to the Location == “United States”.
library(tidyr)
setwd("D:/coursera/TIDYVERSE/healthcare/data_files")
hc <- hc %>%
filter(Location != "United States")
hc
## # A tibble: 714 x 7
## Location year type tot_coverage abb region tot_spending
## <chr> <int> <chr> <int> <chr> <fct> <dbl>
## 1 Alabama 2013 Employer 2126500 AL South 33788
## 2 Alaska 2013 Employer 364900 AK West 7684
## 3 Arizona 2013 Employer 2883800 AZ West 41481
## 4 Arkansas 2013 Employer 1128800 AR South 20500
## 5 California 2013 Employer 17747300 CA West 278168
## 6 Colorado 2013 Employer 2852500 CO West 34090
## 7 Connecticut 2013 Employer 2030500 CT Northeast 34223
## 8 Delaware 2013 Employer 473700 DE South 9038
## 9 District of Columbia 2013 Employer 324300 DC South 7443
## 10 Florida 2013 Employer 8023400 FL South 150547
## # ... with 704 more rows
Another problem is that inside our hc dataset, we have seen there are multiple types of healthcare coverage,
library(tidyr)
setwd("D:/coursera/TIDYVERSE/healthcare/data_files")
table(hc$type)
##
## Employer Medicaid Medicare Non-Group Other Public Total
## 102 102 102 102 102 102
## Uninsured
## 102
The total type is not really a formal type of healthcare coverage. It really represents just the total number of people in the state. This is useful information and we can include it as a column called tot_pop.
library(tidyr)
setwd("D:/coursera/TIDYVERSE/healthcare/data_files")
pop <- hc %>%
filter(type == "Total") %>%
select(Location, year, tot_coverage)
pop
## # A tibble: 102 x 3
## Location year tot_coverage
## <chr> <int> <int>
## 1 Alabama 2013 4763900
## 2 Alaska 2013 702000
## 3 Arizona 2013 6603100
## 4 Arkansas 2013 2904800
## 5 California 2013 38176400
## 6 Colorado 2013 5297800
## 7 Connecticut 2013 3578900
## 8 Delaware 2013 909300
## 9 District of Columbia 2013 652100
## 10 Florida 2013 19429000
## # ... with 92 more rows
library(tidyr)
setwd("D:/coursera/TIDYVERSE/healthcare/data_files")
hc <- hc %>%
filter(type != "Total") %>%
left_join(pop, by = c("Location", "year")) %>%
rename(tot_coverage = tot_coverage.x, tot_pop = tot_coverage.y)
hc
## # A tibble: 612 x 8
## Location year type tot_coverage abb region tot_spending tot_pop
## <chr> <int> <chr> <int> <chr> <fct> <dbl> <int>
## 1 Alabama 2013 Employ~ 2126500 AL South 33788 4.76e6
## 2 Alaska 2013 Employ~ 364900 AK West 7684 7.02e5
## 3 Arizona 2013 Employ~ 2883800 AZ West 41481 6.60e6
## 4 Arkansas 2013 Employ~ 1128800 AR South 20500 2.90e6
## 5 California 2013 Employ~ 17747300 CA West 278168 3.82e7
## 6 Colorado 2013 Employ~ 2852500 CO West 34090 5.30e6
## 7 Connecticut 2013 Employ~ 2030500 CT Northe~ 34223 3.58e6
## 8 Delaware 2013 Employ~ 473700 DE South 9038 9.09e5
## 9 District of Co~ 2013 Employ~ 324300 DC South 7443 6.52e5
## 10 Florida 2013 Employ~ 8023400 FL South 150547 1.94e7
## # ... with 602 more rows
library(tidyr)
setwd("D:/coursera/TIDYVERSE/healthcare/data_files")
hc.employer.2013 <- hc %>%
filter(type == "Employer", year == "2013")
plot(hc.employer.2013$tot_spending,
hc.employer.2013$tot_coverage, log = "xy", xlab = "spending", ylab = "coverage")
We see there is a strong relationship.
we will calculate the proportion of people who are coverage in each state, year and type.
library(tidyr)
setwd("D:/coursera/TIDYVERSE/healthcare/data_files")
hc <- hc %>%
mutate(prop_coverage = tot_coverage/tot_pop)
hc
## # A tibble: 612 x 9
## Location year type tot_coverage abb region tot_spending tot_pop
## <chr> <int> <chr> <int> <chr> <fct> <dbl> <int>
## 1 Alabama 2013 Empl~ 2126500 AL South 33788 4.76e6
## 2 Alaska 2013 Empl~ 364900 AK West 7684 7.02e5
## 3 Arizona 2013 Empl~ 2883800 AZ West 41481 6.60e6
## 4 Arkansas 2013 Empl~ 1128800 AR South 20500 2.90e6
## 5 Califor~ 2013 Empl~ 17747300 CA West 278168 3.82e7
## 6 Colorado 2013 Empl~ 2852500 CO West 34090 5.30e6
## 7 Connect~ 2013 Empl~ 2030500 CT North~ 34223 3.58e6
## 8 Delaware 2013 Empl~ 473700 DE South 9038 9.09e5
## 9 Distric~ 2013 Empl~ 324300 DC South 7443 6.52e5
## 10 Florida 2013 Empl~ 8023400 FL South 150547 1.94e7
## # ... with 602 more rows, and 1 more variable: prop_coverage <dbl>
We will add the spending per capita (or spending per person) in dollars and name this column spending_capita.
library(tidyr)
setwd("D:/coursera/TIDYVERSE/healthcare/data_files")
hc <- hc %>%
mutate(spending_capita = (tot_spending*1e6) / tot_pop)
hc %>% select(prop_coverage, spending_capita)
## # A tibble: 612 x 2
## prop_coverage spending_capita
## <dbl> <dbl>
## 1 0.446 7093.
## 2 0.520 10946.
## 3 0.437 6282.
## 4 0.389 7057.
## 5 0.465 7286.
## 6 0.538 6435.
## 7 0.567 9562.
## 8 0.521 9940.
## 9 0.497 11414.
## 10 0.413 7749.
## # ... with 602 more rows
library(tidyr)
setwd("D:/coursera/TIDYVERSE/healthcare/data_files")
hc.employer.2013 <- hc %>%
filter(type == "Employer", year == "2013")
plot(hc.employer.2013$spending_capita,
hc.employer.2013$prop_coverage, log = "xy", xlab = "spending per capita",
ylab = "proportion of Employer coverage")
library(tidyr)
setwd("D:/coursera/TIDYVERSE/healthcare/data_files")
hc %>%
filter(type == "Employer", year == "2013") %>%
ggplot(aes(x = spending_capita, y = prop_coverage)) + geom_point() +
xlab("spending per capita") + ylab("coverage proportion") +
geom_smooth(method = "lm", col = "blue")
The total healthcare expenditure is associated with the population. To make a fair comparison, we create “healthcare expenditure per capita.” Further, the exploratory analysis via data visualization showed higher spending in healthcare per capita is positively associated with higher employer coverage proportion and is negatively associated with the porportion of uninsured population across the States.