Homework 5 should be completed using R Markdown. Submit the code as a Markdown File AND a “knitted” HTML or PDF file in a zipped folder to Blackboard. (Blackboard doesn’t allow html files to be uploaded unless in a zipped folder.) Use headers and other formatting tools to clearly indicate where your code and solutions for each question begin. Delete the instructions I include for you and instead document what you are doing, why you are doing it, and your interpretations of the output. Tell the story of your data cleaning and analysis, and annotate your code.

library(tidyverse) 
library("kableExtra") # Making cool tables in R

Opening Data

Open the coverage and spending data sets. These code chunks should work without additional edits until you get to the Pivoting stage in the Wranging section of the assignment.

Coverage Data

Use the read_csv() command to open the coverage dataset. How many rows and columns are there?

if you know there should be 50 states in your data, the fact that there is more than 50 rows is a sign that something may be suspicious about the data

coverage_nottidy <- read_csv('coverage_nottidy.csv')
coverage_nottidy #view tibble
#78 rows, 29 variables

# look at the last 30 rows:
tail(coverage_nottidy, n = 30)

## # A tibble: 30 x 29
##    Location  `2013__Employer` `2013__Non-Grou~ `2013__Medicaid` `2013__Medicare`
##    <chr>                <dbl>            <dbl>            <dbl>            <dbl>
##  1 "Washing~          3541600           309000          1026800           879000
##  2 "West Vi~           841300            42600           382500           329400
##  3 "Wiscons~          3154500           225300           907600           812900
##  4 "Wyoming"           305900            19500            74200            65400
##  5 "Notes"                 NA               NA               NA               NA
##  6 "The maj~               NA               NA               NA               NA
##  7  <NA>                   NA               NA               NA               NA
##  8 "In this~               NA               NA               NA               NA
##  9  <NA>                   NA               NA               NA               NA
## 10 "Data ex~               NA               NA               NA               NA
## # ... with 20 more rows, and 24 more variables: 2013__Other Public <chr>,
## #   2013__Uninsured <dbl>, 2013__Total <dbl>, 2014__Employer <dbl>,
## #   2014__Non-Group <dbl>, 2014__Medicaid <dbl>, 2014__Medicare <dbl>,
## #   2014__Other Public <chr>, 2014__Uninsured <dbl>, 2014__Total <dbl>,
## #   2015__Employer <dbl>, 2015__Non-Group <dbl>, 2015__Medicaid <dbl>,
## #   2015__Medicare <dbl>, 2015__Other Public <chr>, 2015__Uninsured <dbl>,
## #   2015__Total <dbl>, 2016__Employer <dbl>, 2016__Non-Group <dbl>, ...

It looks like there is a lot of extra information at the end of the file due the “Notes” section that was included with the data. We want to only include rows before the value of “Notes” for the Location variable at the end of the file. Using n_max, we can specify that we want all the lines up to and including where the Location variable is equal to the row where “Notes” begins.

## read coverage data into R and stop at row 52
coverage_nottidy <- read_csv('coverage_nottidy.csv', n_max = 52)

tail(coverage_nottidy) # check the last rows again

## # A tibble: 6 x 29
##   Location      `2013__Employer` `2013__Non-Grou~ `2013__Medicaid` `2013__Medicare`
##   <chr>                    <dbl>            <dbl>            <dbl>            <dbl>
## 1 Vermont                 317700            26200           123400            96600
## 2 Virginia               4661600           364800           773200           968000
## 3 Washington             3541600           309000          1026800           879000
## 4 West Virginia           841300            42600           382500           329400
## 5 Wisconsin              3154500           225300           907600           812900
## 6 Wyoming                 305900            19500            74200            65400
## # ... with 24 more variables: 2013__Other Public <chr>, 2013__Uninsured <dbl>,
## #   2013__Total <dbl>, 2014__Employer <dbl>, 2014__Non-Group <dbl>,
## #   2014__Medicaid <dbl>, 2014__Medicare <dbl>, 2014__Other Public <chr>,
## #   2014__Uninsured <dbl>, 2014__Total <dbl>, 2015__Employer <dbl>,
## #   2015__Non-Group <dbl>, 2015__Medicaid <dbl>, 2015__Medicare <dbl>,
## #   2015__Other Public <chr>, 2015__Uninsured <dbl>, 2015__Total <dbl>,
## #   2016__Employer <dbl>, 2016__Non-Group <dbl>, 2016__Medicaid <dbl>, ...

Looks much better now! We can then use the glimpse() function of the dplyr package to get a sense of what types of information are stored in our dataset.

glimpse(coverage_nottidy)

## Rows: 52
## Columns: 29
## $ Location             <chr> "United States", "Alabama", "Alaska", "Arizona", ~
## $ `2013__Employer`     <dbl> 155696900, 2126500, 364900, 2883800, 1128800, 177~
## $ `2013__Non-Group`    <dbl> 13816000, 174200, 24000, 170800, 155600, 1986400,~
## $ `2013__Medicaid`     <dbl> 54919100, 869700, 95000, 1346100, 600800, 8344800~
## $ `2013__Medicare`     <dbl> 40876300, 783000, 55200, 842000, 515200, 3828500,~
## $ `2013__Other Public` <chr> "6295400", "85600", "60600", "N/A", "67600", "675~
## $ `2013__Uninsured`    <dbl> 41795100, 724800, 102200, 1223000, 436800, 559410~
## $ `2013__Total`        <dbl> 313401200, 4763900, 702000, 6603100, 2904800, 381~
## $ `2014__Employer`     <dbl> 154347500, 2202800, 345300, 2835200, 1176500, 177~
## $ `2014__Non-Group`    <dbl> 19313000, 288900, 26800, 333500, 231700, 2778800,~
## $ `2014__Medicaid`     <dbl> 61650400, 891900, 130100, 1639400, 639200, 961880~
## $ `2014__Medicare`     <dbl> 41896500, 718400, 55300, 911100, 479400, 4049000,~
## $ `2014__Other Public` <chr> "5985000", "143900", "37300", "N/A", "82000", "63~
## $ `2014__Uninsured`    <dbl> 32967500, 522200, 100800, 827100, 287200, 3916700~
## $ `2014__Total`        <dbl> 316159900, 4768000, 695700, 6657200, 2896000, 387~
## $ `2015__Employer`     <dbl> 155965800, 2218000, 355700, 2766500, 1293700, 177~
## $ `2015__Non-Group`    <dbl> 21816500, 291500, 22300, 278400, 200200, 3444200,~
## $ `2015__Medicaid`     <dbl> 62384500, 911400, 128100, 1711500, 641400, 101381~
## $ `2015__Medicare`     <dbl> 43308400, 719100, 60900, 949000, 484500, 4080100,~
## $ `2015__Other Public` <chr> "6422300", "174600", "47700", "189300", "63700", ~
## $ `2015__Uninsured`    <dbl> 28965900, 519400, 90500, 844800, 268400, 2980600,~
## $ `2015__Total`        <dbl> 318868500, 4833900, 705300, 6739500, 2953000, 391~
## $ `2016__Employer`     <dbl> 157381500, 2263800, 324400, 3010700, 1290900, 181~
## $ `2016__Non-Group`    <dbl> 21884400, 262400, 20300, 377000, 252900, 3195400,~
## $ `2016__Medicaid`     <dbl> 62303400, 997000, 145400, 1468400, 618600, 985380~
## $ `2016__Medicare`     <dbl> 44550200, 761200, 68200, 1028000, 490000, 4436000~
## $ `2016__Other Public` <chr> "6192200", "128800", "55600", "172500", "67500", ~
## $ `2016__Uninsured`    <dbl> 28051900, 420800, 96900, 833700, 225500, 3030800,~
## $ `2016__Total`        <dbl> 320372000, 4834100, 710800, 6890200, 2945300, 391~

This gives an us output with all the variables listed on the far left. Thus essentially the data is rotated from the way it would be shown if we used head() instead of glimpse(). The first few observations for each variable are shown with a comma separating each observation.

Looks like we have a whole bunch of numeric variables (indicated by ), but a few that appear like they should be numeric, but are actually strings (indicated by , ex. 2016_Other Public). This is important: Keep it mind for when we wrangle the data! We will clean it up after we open the spending data.

Spending Data

## read spending data into R
spending_nottidy <- read_csv('spending_nottidy.csv')

spending_nottidy 
# has notes on the bottom again

We will do the same thing as we did for the coverage data. Read in the first 52 rows that we want.

#got some parsing errors...
spending_nottidy<- read_csv('spending_nottidy.csv', 
                     n_max  =  52) # stops at row 52

spending_nottidy # 52 by 25

Wranging Data

Pivoting longer

Look at the coverage_nottidy tibble again.

coverage_nottidy

## # A tibble: 52 x 29
##    Location  `2013__Employer` `2013__Non-Grou~ `2013__Medicaid` `2013__Medicare`
##    <chr>                <dbl>            <dbl>            <dbl>            <dbl>
##  1 United S~        155696900         13816000         54919100         40876300
##  2 Alabama            2126500           174200           869700           783000
##  3 Alaska              364900            24000            95000            55200
##  4 Arizona            2883800           170800          1346100           842000
##  5 Arkansas           1128800           155600           600800           515200
##  6 Californ~         17747300          1986400          8344800          3828500
##  7 Colorado           2852500           426300           697300           549700
##  8 Connecti~          2030500           126800           532000           475300
##  9 Delaware            473700            25100           192700           141300
## 10 District~           324300            30400           174900            59900
## # ... with 42 more rows, and 24 more variables: 2013__Other Public <chr>,
## #   2013__Uninsured <dbl>, 2013__Total <dbl>, 2014__Employer <dbl>,
## #   2014__Non-Group <dbl>, 2014__Medicaid <dbl>, 2014__Medicare <dbl>,
## #   2014__Other Public <chr>, 2014__Uninsured <dbl>, 2014__Total <dbl>,
## #   2015__Employer <dbl>, 2015__Non-Group <dbl>, 2015__Medicaid <dbl>,
## #   2015__Medicare <dbl>, 2015__Other Public <chr>, 2015__Uninsured <dbl>,
## #   2015__Total <dbl>, 2016__Employer <dbl>, 2016__Non-Group <dbl>, ...

At a glance, we see that state-level information is stored in rows (with the exception of the first row, which stores country-level information) with columns corresponding to the amount of money spent on each type of health care, by year.

If we focus in on the columns within this dataframe, we see that we have a number of different types of health care (i.e. employer, medicare, medicaid, etc.) for each year between 2013 and 2016:

names(coverage_nottidy) # view all variable names. In Excel, these would be the "headers"

##  [1] "Location"           "2013__Employer"     "2013__Non-Group"   
##  [4] "2013__Medicaid"     "2013__Medicare"     "2013__Other Public"
##  [7] "2013__Uninsured"    "2013__Total"        "2014__Employer"    
## [10] "2014__Non-Group"    "2014__Medicaid"     "2014__Medicare"    
## [13] "2014__Other Public" "2014__Uninsured"    "2014__Total"       
## [16] "2015__Employer"     "2015__Non-Group"    "2015__Medicaid"    
## [19] "2015__Medicare"     "2015__Other Public" "2015__Uninsured"   
## [22] "2015__Total"        "2016__Employer"     "2016__Non-Group"   
## [25] "2016__Medicaid"     "2016__Medicare"     "2016__Other Public"
## [28] "2016__Uninsured"    "2016__Total"

While a lot of information in here will be helpful, it is not in a tidy format. This is because, each variable is not in a separate column: Each column currently includes year and the type of coverage. We want to use each piece of information separately later when we visualize and analyze these data. We need to first pivot the data longer, and then separate the years from the insurance types.

To accomplish this, the first thing we’ll have to do is reshape the data, using the pivot_longer() function from the tidyr package. This function gathers multiple columns and collapses them into new name-value pairs. This transforms data from wide format into a long format, where:

The first argument defines the columns to gather
The names_to argument is the name of the new column that you are creating which contains the values of the column headings that you are gathering
The values_to argument is the name of the new column that will contain the values themselves; you can indicate the name of this column with the values_to argument.

Here, we create a column titled year_type and tot_coverage, storing this newly formatted data as a new tibble named “coverage”.

We also want to keep the Location column as it is because it also contains observational level data. > To select all the columns except a specific column, use the - (subtraction) operator. (This process is also referred to as negative indexing.)

coverage_long <- coverage_nottidy %>%
  mutate(across(starts_with("20"), # find every variable that begins with "20"
                as.integer)) %>%  ## Convert all year-based columns to integer
  pivot_longer(-Location,         ## Pivots all columns EXCEPT 'Location'
               names_to = "year_type", 
               values_to = "tot_coverage")

What warning did you get? What does it mean?

How many variables are there now? What are they?

coverage_long

Separating Variables

Look at coverage_long. We still have a problem with the structure of the data. The year_type column still contains two pieces of information. We want to separate these out to ensure that the data are in a tidy format. To do this, use the separate() function, which allows us to separate out the information stored in a single column into two columns. We will also use the convert=TRUE argument to convert the character to an integer.

coveragetidy <- coverage_long %>% 
  separate(year_type, sep = "__", # uses the underscore to separate the text parts
           into = c("year", "type"), #left of the underscore goes to  `year` and right of the underscore goes to`type`
           convert = TRUE) # converts character into an integer

coveragetidy  # view your tidy dataset

Perfect! We now have the four columns we wanted, each storing a separate piece of information, and the year column is an integer, as you would want it to be!

Joining Region Info to Coverage Data

Let’s go one step further and add in information for the state abbreviations and region of each state. There are data and packages in R available to you for just this purpose!

library(datasets) # this package is already included in base R
# you do not need to install it

data(state) # loads vectors into your environment that have different state information. Look at them! 
# state.region, state.center, state.area, state.abb, state.name, etc,

state.name # view the state names

Add comments to the lines of code describing what each step is doing

state_data <- tibble(Location = state.name,
                     abb = state.abb,
                     region = state.region)

state_data # 50 by 3 tibble

Join the state data with our coverage dataset to add on the state abbreviation and region information to our coverage data.

coveragetidy <- coveragetidy %>%
  left_join(state_data, by = "Location") # similar to VLOOKUP() in excel
# matches state_data TO the coverage data based on location

coveragetidy

## # A tibble: 1,456 x 6
##    Location       year type         tot_coverage abb   region
##    <chr>         <int> <chr>               <int> <chr> <fct> 
##  1 United States  2013 Employer        155696900 <NA>  <NA>  
##  2 United States  2013 Non-Group        13816000 <NA>  <NA>  
##  3 United States  2013 Medicaid         54919100 <NA>  <NA>  
##  4 United States  2013 Medicare         40876300 <NA>  <NA>  
##  5 United States  2013 Other Public      6295400 <NA>  <NA>  
##  6 United States  2013 Uninsured        41795100 <NA>  <NA>  
##  7 United States  2013 Total           313401200 <NA>  <NA>  
##  8 United States  2014 Employer        154347500 <NA>  <NA>  
##  9 United States  2014 Non-Group        19313000 <NA>  <NA>  
## 10 United States  2014 Medicaid         61650400 <NA>  <NA>  
## # ... with 1,446 more rows

Perfect! At this point, each row is an observation and each column stores a single piece of information. This dataset is now in good shape!

How many observations and columns do you have now?

Pivoting Spending Data

spending_nottidy

Here, we reshape the data using year and tot_spending for the key and value. We also want to keep Location like before.

# take spending data from wide to long
spending_long <- spending_nottidy %>%
  pivot_longer(-Location, # all columns EXCEPT location
               names_to = "year", 
               values_to = "tot_spending")
spending_long

How many columns are there now? Now what is wrong with our data?

Then, using the separate() function, we create two new columns called “year” and “name”.

# separate year and name columns
spendingtidy <- spending_long %>% 
  separate(year, sep="__", 
           into = c("year", "name"), 
           convert = TRUE)

# look at the data
spendingtidy

We don’t need the “name” variable since it is the same for every observation. We can remove it.

spendingtidy <- spendingtidy %>% 
  select(-_______)

spendingtidy

Perfect, we have a tidy dataset!

Joining the Coverage and Spending Data

At this point, we have a tidy coverage dataset and a tidy spending dataset. However, we want all of this information in a single tidy tibble To do this, we must join the datasets together just like in Homework 4. First, we have to decide what type of join we want to do. For our questions, we only want information from years that are found in both the coverage and the spending datasets. This means that we want to do an inner_join(). This will keep the data from the intersection of years from coverage and spending (meaning only 2013 and 2014).

Combine your tidy coverage and tidy spending data into a new dataset named healthcare. How many observations are there after you join them together?

# inner join to combine data frames
healthcare <- inner_join(coveragetidy, spendingtidy, 
                 by = c("Location", "year"))

healthcare

Filter, Select, and Mutate

Remove observations for “United States” from Location using filter()

# use filter to include only the state level data
# by removing "United States" from the Location variable
healthcare <- healthcare %>% 
  filter(Location != "United States")

Remove observations where the Insurance type is equal to “Total” The “Total” type is not really a type of health care coverage. It represents the total number of people in the state. This is useful information and we can include it as a new column named tot_pop. To accomplish this, first store this information in a data frame called pop. This should be the same as Homework 4. Don’t forget to annotate your code

pop <- healthcare %>% 
  filter(type == "Total") %>% 
  select(Location, year, tot_coverage)

pop

Use a left_join() to keep all of the rows in the healthcare data frame and add the population level information while simultaneously removing the rows where type is “Total” from the dataset. Then rename the columns to num_covered and tot_pop so that they are more intuitively labeled. Annotate the steps of your code

# add population level information
healthcare <- healthcare %>% 
  filter(type != "Total") %>% 
  left_join(pop, by = c("Location", "year")) %>% 
  rename(num_covered = tot_coverage.x, 
         tot_pop = tot_coverage.y)

healthcare

From here, instead of only storing the absolute number of people who are covered (tot_coverage), calculate the proportion of people in each insurance category for each state and store this information in a new variable named perc_covered.

healthcare <- healthcare %>% 
    mutate(perc_covered = num_covered/tot_pop) 

healthcare

The tot_spending column is reported in millions. Therefore, to calculate spending_capita you will need to adjust for this scaling factor to report it on the original scale (just dollars) and then divide by tot_pop. We can again use mutate() to accomplish this:

# get spending per capita in dollars
healthcare <- healthcare %>% 
  mutate(spending_capita = (tot_spending*1000000) / tot_pop)

healthcare

Analysis and Communicating Findings

Missing Data

It is always important to check for missing data and try to determine if it will impact your analysis. Patterns within missing data can be interesting by themselves. Use filter() to find missing values in the “num_covered” variable. is.na() is one of many commands you can use to show you observations that are missing data.

healthcare %>% filter(is.na(num_covered))

How many observations have missing data? Which category / variable seems to be causing the problem?

Graphing Data

The code below creates a scatterplot. drop_na() is a very useful command to remove missing values when graphing data.

Add to the code provided so that:

Each dot is labeled with the state’s abbreviations using geom_text.
Add a best-fit line using geom_smooth() to visualize the magnitude of the linear relationship.
The regions are indicated with a legend where each region is a different colored dot on the scatterplot.
Add a title or caption to communicate what the graph is about or where the data is from.
Annotate the code

healthcare %>%
  filter(type == "Employer", 
         year == "2013") %>%
  drop_na() %>% 
  ggplot(aes(x = spending_capita, 
             y = perc_covered)) + 
  geom_point() + 
  labs(x = "Spending per Capita",
       y = "Percent Employer Insurance")

Facet wrap to visualize each insurance type

This code chunk and graph are here mostly to show you another way to visualize the data and provide some code clues for your other graphs or tables.

# visualize 2013 data by insurance type
healthcare %>%
  drop_na() %>%
  filter(year == "2013") %>%
  ggplot(aes(x = spending_capita,
             y = perc_covered,
             color = region)) +
  geom_point() +
  facet_wrap(~type) +
  labs(x = "Spending per Capita ($)",
       y = "% Population Covered") +
  geom_smooth(method = "lm", col = "red")

Descriptive Statistics

Is the the value calculated from the code below meaningful or should more data and context be provided? What is this an average of?

healthcare %>% 
  summarize(avg_spend_capita = mean(spending_capita))

Based on the boxplot below, describe the trends in spending per person for those that have health insurance from their Employer for the Southern region compared to the Northeast region. Make sure to mention the median, dispersion, and implications for residents of those areas or possible reasons behind any visible trends.

Read the lines of code below. Describe what each line of code is doing AND what you can infer from the image of the distribution that it creates.

ggplot(healthcare, aes(x = spending_capita))+
  geom_density()+
  geom_vline(aes(xintercept = mean(spending_capita)), color = "orange")+
  geom_vline(aes(xintercept = median(spending_capita)), color = "black")+
  geom_vline(aes(xintercept = mean(spending_capita)+2*sd(spending_capita)), color = "red")+
  geom_vline(aes(xintercept = mean(spending_capita)-2*sd(spending_capita)), color = "red")

Summary Tables

Exploratory tables are great for summarizing information for yourself, but they need some cleaning up before they are used to communicate information with others. Below is an example of a summary table containing the average, median, and standard deviation for spending per capita for each region.

table <- healthcare %>%
  drop_na() %>%
  group_by(region) %>%
  summarize("Average" = mean(spending_capita),
            "Median" = median(spending_capita),
            "Standard Deviation" = sd(spending_capita) )
table

## # A tibble: 4 x 4
##   region        Average Median `Standard Deviation`
##   <fct>           <dbl>  <dbl>                <dbl>
## 1 Northeast       9583.  9562.                 532.
## 2 South           7759.  7583.                 900.
## 3 North Central   8404.  8307.                 535.
## 4 West            7512.  7384.                1332.

Issues with this exploratory table that should be changed if you were making a final, pretty summary table: Round decimal points to clean up numbers, add commas for thousands, add a title, footnotes, or caption to summarize key points or state where the data is from, use a theme or formatting to frame the table and make it look professional and neat.

Make a nice summary table showing the average percent of people from each region covered by each type of health insurance (you do not need to separate data by year). You are welcome to explore different functions or make additional modifications to your table. Feel free to personalize the theme and formatting according to your preferences (I would suggest exploring a format that you like and can be applied to your final report throughout tables in this assignment). Have regions as columns and the type of health care as rows.

Note: you will need to summarize the data and then use pivot_wider() so that each region becomes a column

Example of Formatted Summary Table
type	Northeast	South	North Central	West
Employer	0.54	0.47	0.54	0.49
Medicaid	0.18	0.19	0.15	0.17
Medicare	0.14	0.14	0.14	0.12
Non-Group	0.05	0.05	0.06	0.05
Other Public	0.01	0.03	0.01	0.03
Uninsured	0.08	0.13	0.09	0.12

Which region has the smallest proportion of its population covered by Medicaid? Which region as the largest proportion of its population uninsured? Include the percent covered in your answer.

On average, how much did the spending per capita change from 2013 to 2014 for each region? Make sure to indicate the direction of the change in your answer. Make a table to summarize the change between years. If you want an extra challenge, try to figure out how to visualize this information.

healthcare # %>% 
#

Homework 5: Health Care in the US

Your Name - PPOL 105

Due Saturday Nov. 27 before Midnight: