Submit your work as either an .R or .Rmd file that knits an html file when the code is run. Include all your answers and comments in the code. You will be graded on organization and overall readiblity of your code and comments. I should be able to download your code to my computer and run it without errors.

Setup

getwd() tells you the current working directory

To choose the working directory: Session > Set working directory > Choose the folder I want

Or use full folder path: setwd("C:/Users/yourname/Desktop/PPOL105")

Opening Data

Healthcare Coverage

## read coverage data into R
coverage_ <- read_csv('coverage_tidy.csv') # from readr() package included in Tidyverse
class(coverage_)

coverage. <- read.csv('coverage_tidy.csv') # included in base R installation. Not from tidyverse.
class(coverage.)

We want to use only the tibble version created by read_csv(). Remove “coverage.” from your environment using the command rm().

rm(coverage.)

Create a new dataset called coverage from coverage_. Then remove coverage_ from your environment:

______ <- __________
rm(_______)

1. What is the difference in output between the lines of code view(coverage) and coverage below?

view(coverage)

coverage

We can also use the glimpse() function of the dplyr package to get a sense of what types of information are stored in our dataset.

glimpse(coverage)

This gives an us output with all the variables listed on the far left. The data displayed is rotated from the way it would be shown if we used head() instead of glimpse(). The first few observations for each variable are shown for each variable with a comma separating each observation.

head(coverage) # for comparison to glimpse()

2. What type of data is each variable stored as (character, numeric, integer, logical, factor, etc)? How do you know this?

3. What level of measurement is appropriate for each variable? (nominal, ordinal, interval, ratio)

Healthcare Spending

## read spending data into R
spending <- read_csv()

4. What kind of information can you get from the output after you run the code spending <- read_csv("yourfilename.csv") above ?

Now view your spending tibble to examine your variables and data:

_______     # view your spending tibble
# rows?
# columns?
# variables?

At a glance, we see that location information is stored in rows with columns for the amount of money spent in each state in different years.

Joining the Data

At this point, we have a coverage dataset and a spending dataset, but ultimately, we want all of this information in a single tidy data frame. To do this, we’ll have to join the data sets together.

We have to decide what type of join we want to do. Remember, there are multiple types of joins: left_join, right_join, full_join, and inner_join(). For our questions, we only want information from years that are found in both the coverage and the spending datasets. This means that we want to do an inner_join(). This will keep the data from the intersection of years from coverage and spending (meaning only 2013 and 2014). We’ll store this in a new object: healthcare.

# inner join to combine data frames
healthcare <- inner_join(coverage, spending, 
                 by = c("Location", "year"))

healthcare

5. How many observations are there? How many variables?

Now use the full_join() command to join the data. Name the object you create healthcare_full. View healthcare_full and add a comment in your code indicating how many rows and columns there are:

______ <- full_join()

6. How many observations are there now? Is it different? How so? Why do you suppose this is?

Wranging Data

Great, we have combined the information in our datasets. But, we have a bit of extra information that we do not need. Reminder: We want to look only at data from the state-level.

Remember our key verbs:

select(), which returns a subset of the columns
filter(), that is able to return a subset of the rows
arrange(), that reorders the rows according to single or multiple variables
mutate(), used to add columns from existing data
summarize(), which reduces each group to a single row by calculating aggregate measures
- plus group_by(), which is used to summarize variables by group

Filter out the country-level summary row:

# filter to only include state level data.
# this keeps everything that is NOT the "United States" 
# This does permanently remove the location "United States" from the healthcare data.
healthcare <- healthcare %>% 
  filter(Location != "United States")

_______ # view healthcare

7. How many observations remain? What is the unit of analysis?

What if we want to look at the information for Illinois? Use the filter command to view only observations from Illinois:

healthcare %>% 
  filter(__________)

8. How many people in 2013 Illinois were covered by Medicare?

Univariate Tables

table(healthcare$type)

The “Total” type is not really a formal type of health care coverage. It really represents just the total number of people in the state during a specific year. This is useful information and we can include it as a separate column called tot_pop. To accomplish this, we will first take all observations where the type is “Total” out of the healthcare dataframe and put it in a new dataframe called pop.

9. Why do they have the same number of observations? What does this number represent? (i.e. what are we summarizing in this table?)

Creating Variables

Proportion of population covered

This finds all observations that match the criteria (healthcare type that is “Total”) used and puts them in a dataset called pop. It does NOT change the healthcare dataset.

pop <- healthcare %>% 
  filter(type == "Total") %>%  # keeps observations that have type equal to "Total" 
  select(Location, year, tot_coverage) # We only need the location, year, and tot_coverage columns

pop

This tells R to look inside of healthcare, keep observations that are NOT equal to “Total”, and then saves it to healthcare_filtered.

healthcare_filtered <- healthcare %>% 
  filter(type != "Total")

Check to make sure that the Total category is gone in healthcare_filtered using table():

table(__$___)

Now join healthcare_filtered with the pop dataframe. This will ultimately add the column for the state population on to the right of the health care variables if it matches both the State and Year columns.

joined <- left_join(healthcare_filtered, pop, by = c("Location", "year"))
joined

10. Notice that R changed the variable names for tot_covered. Why did it do this?

Lets give these variables more meaningful names.

joined <- joined %>% 
  rename(num_covered = tot_coverage.x,  # renames tot_coverage.x to num_covered for the number of people covered by a type of insurance in a state
         tot_pop = tot_coverage.y) # renames tot_coverage.y to tot_pop for the total population in a state

The end goal is to make our code as clean and efficient as possible so that a computer and a human can easily understand it. We could combine these steps into one smooth piece of code below.

11. Write comments next to each line of code to indicate what each step is doing.

# add population level information
healthcare2 <- healthcare %>% 
  filter(type != "Total") %>% 
  left_join(pop, by = c("Location", "year"))%>%
  rename(num_covered = tot_coverage.x, 
         tot_pop = tot_coverage.y)

healthcare2

From here, instead of only storing the absolute number of people who are covered (num_covered), we will calculate the proportion of people who are coverage in each state per year for each type of healthcare, storing this information in prop_coverage.

This was the same as creating a new variable in Excel and setting it equal to the number of people covered / total number of people in the state.
prop_covered = num_covered / tot_pop

# add proportion covered
joined <- joined %>% 
    mutate(prop_coverage = num_covered/tot_pop) 

joined

Spending per Capita

The tot_spending column is reported in millions. Therefore, to calculate spending_capita we will need to adjust for this scaling factor to report it on the original scale (just dollars, so multiply by 1e6 or 1000000) and then divide by tot_pop. We can again use mutate() to accomplish this:

# get spending capita in dollars
joined <- joined %>% 
  mutate(spending_capita = (tot_spending*1e6) / tot_pop) # calculate spending per person

joined

Save your work

After all the work cleaning the data, it may be a good idea to save your clean and combined data as a csv file or r data file. Change the code below to match your file name.

# save the data!

save(yourdataobject, file =  "healthcare.rda")
write_csv(yourdataobject, file = "healthcare.csv)

Check to see if the saved item shows up in your working directory! Also make sure you clean up and save your code in the same folder! Make sure you wrote comments to yourself because you will be reusing parts of this code for Homework 5.

Extra Credit

Extra Credit: Add more to the filter, select, and mutate commands to display the number of people covered (both the raw number and percent of the population) in Illinois by Medicare during 2013. Add comments to your code to clearly indicate what each step is doing. Do this is one smooth piece of code using pipes.

joined %>%
  __________ %>%
  ____________ %>% 
  
  ????

Homework 4

Student_Name

Due 11/7/2021 at Noon