Homework 5 should be completed using R Markdown. Submit the code as a Markdown File AND a “knitted” HTML or PDF file in a zipped folder to Blackboard. (Blackboard doesn’t allow html files to be uploaded unless in a zipped folder.) Use headers and other formatting tools to clearly indicate where your code and solutions for each question begin. Delete the instructions I include for you and instead document what you are doing, why you are doing it, and your interpretations of the output. Tell the story of your data cleaning and analysis, and annotate your code.
library(tidyverse)
library("kableExtra") # Making cool tables in R
Open the coverage and spending data sets. These code chunks should work without additional edits until you get to the Pivoting stage in the Wranging section of the assignment.
Use the read_csv() command to open the coverage dataset. How many rows and columns are there?
coverage_nottidy <- read_csv('coverage_nottidy.csv')
coverage_nottidy #view tibble
#78 rows, 29 variables
# look at the last 30 rows:
tail(coverage_nottidy, n = 30)
## # A tibble: 30 x 29
## Location `2013__Employer` `2013__Non-Grou~ `2013__Medicaid` `2013__Medicare`
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 "Washing~ 3541600 309000 1026800 879000
## 2 "West Vi~ 841300 42600 382500 329400
## 3 "Wiscons~ 3154500 225300 907600 812900
## 4 "Wyoming" 305900 19500 74200 65400
## 5 "Notes" NA NA NA NA
## 6 "The maj~ NA NA NA NA
## 7 <NA> NA NA NA NA
## 8 "In this~ NA NA NA NA
## 9 <NA> NA NA NA NA
## 10 "Data ex~ NA NA NA NA
## # ... with 20 more rows, and 24 more variables: 2013__Other Public <chr>,
## # 2013__Uninsured <dbl>, 2013__Total <dbl>, 2014__Employer <dbl>,
## # 2014__Non-Group <dbl>, 2014__Medicaid <dbl>, 2014__Medicare <dbl>,
## # 2014__Other Public <chr>, 2014__Uninsured <dbl>, 2014__Total <dbl>,
## # 2015__Employer <dbl>, 2015__Non-Group <dbl>, 2015__Medicaid <dbl>,
## # 2015__Medicare <dbl>, 2015__Other Public <chr>, 2015__Uninsured <dbl>,
## # 2015__Total <dbl>, 2016__Employer <dbl>, 2016__Non-Group <dbl>, ...
It looks like there is a lot of extra information at the end of the file due the “Notes” section that was included with the data. We want to only include rows before the value of “Notes” for the Location variable at the end of the file. Using n_max, we can specify that we want all the lines up to and including where the Location variable is equal to the row where “Notes” begins.
## read coverage data into R and stop at row 52
coverage_nottidy <- read_csv('coverage_nottidy.csv', n_max = 52)
tail(coverage_nottidy) # check the last rows again
## # A tibble: 6 x 29
## Location `2013__Employer` `2013__Non-Grou~ `2013__Medicaid` `2013__Medicare`
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Vermont 317700 26200 123400 96600
## 2 Virginia 4661600 364800 773200 968000
## 3 Washington 3541600 309000 1026800 879000
## 4 West Virginia 841300 42600 382500 329400
## 5 Wisconsin 3154500 225300 907600 812900
## 6 Wyoming 305900 19500 74200 65400
## # ... with 24 more variables: 2013__Other Public <chr>, 2013__Uninsured <dbl>,
## # 2013__Total <dbl>, 2014__Employer <dbl>, 2014__Non-Group <dbl>,
## # 2014__Medicaid <dbl>, 2014__Medicare <dbl>, 2014__Other Public <chr>,
## # 2014__Uninsured <dbl>, 2014__Total <dbl>, 2015__Employer <dbl>,
## # 2015__Non-Group <dbl>, 2015__Medicaid <dbl>, 2015__Medicare <dbl>,
## # 2015__Other Public <chr>, 2015__Uninsured <dbl>, 2015__Total <dbl>,
## # 2016__Employer <dbl>, 2016__Non-Group <dbl>, 2016__Medicaid <dbl>, ...
Looks much better now! We can then use the glimpse() function of the dplyr package to get a sense of what types of information are stored in our dataset.
glimpse(coverage_nottidy)
## Rows: 52
## Columns: 29
## $ Location <chr> "United States", "Alabama", "Alaska", "Arizona", ~
## $ `2013__Employer` <dbl> 155696900, 2126500, 364900, 2883800, 1128800, 177~
## $ `2013__Non-Group` <dbl> 13816000, 174200, 24000, 170800, 155600, 1986400,~
## $ `2013__Medicaid` <dbl> 54919100, 869700, 95000, 1346100, 600800, 8344800~
## $ `2013__Medicare` <dbl> 40876300, 783000, 55200, 842000, 515200, 3828500,~
## $ `2013__Other Public` <chr> "6295400", "85600", "60600", "N/A", "67600", "675~
## $ `2013__Uninsured` <dbl> 41795100, 724800, 102200, 1223000, 436800, 559410~
## $ `2013__Total` <dbl> 313401200, 4763900, 702000, 6603100, 2904800, 381~
## $ `2014__Employer` <dbl> 154347500, 2202800, 345300, 2835200, 1176500, 177~
## $ `2014__Non-Group` <dbl> 19313000, 288900, 26800, 333500, 231700, 2778800,~
## $ `2014__Medicaid` <dbl> 61650400, 891900, 130100, 1639400, 639200, 961880~
## $ `2014__Medicare` <dbl> 41896500, 718400, 55300, 911100, 479400, 4049000,~
## $ `2014__Other Public` <chr> "5985000", "143900", "37300", "N/A", "82000", "63~
## $ `2014__Uninsured` <dbl> 32967500, 522200, 100800, 827100, 287200, 3916700~
## $ `2014__Total` <dbl> 316159900, 4768000, 695700, 6657200, 2896000, 387~
## $ `2015__Employer` <dbl> 155965800, 2218000, 355700, 2766500, 1293700, 177~
## $ `2015__Non-Group` <dbl> 21816500, 291500, 22300, 278400, 200200, 3444200,~
## $ `2015__Medicaid` <dbl> 62384500, 911400, 128100, 1711500, 641400, 101381~
## $ `2015__Medicare` <dbl> 43308400, 719100, 60900, 949000, 484500, 4080100,~
## $ `2015__Other Public` <chr> "6422300", "174600", "47700", "189300", "63700", ~
## $ `2015__Uninsured` <dbl> 28965900, 519400, 90500, 844800, 268400, 2980600,~
## $ `2015__Total` <dbl> 318868500, 4833900, 705300, 6739500, 2953000, 391~
## $ `2016__Employer` <dbl> 157381500, 2263800, 324400, 3010700, 1290900, 181~
## $ `2016__Non-Group` <dbl> 21884400, 262400, 20300, 377000, 252900, 3195400,~
## $ `2016__Medicaid` <dbl> 62303400, 997000, 145400, 1468400, 618600, 985380~
## $ `2016__Medicare` <dbl> 44550200, 761200, 68200, 1028000, 490000, 4436000~
## $ `2016__Other Public` <chr> "6192200", "128800", "55600", "172500", "67500", ~
## $ `2016__Uninsured` <dbl> 28051900, 420800, 96900, 833700, 225500, 3030800,~
## $ `2016__Total` <dbl> 320372000, 4834100, 710800, 6890200, 2945300, 391~
This gives an us output with all the variables listed on the far left. Thus essentially the data is rotated from the way it would be shown if we used head() instead of glimpse(). The first few observations for each variable are shown with a comma separating each observation.
Looks like we have a whole bunch of numeric variables (indicated by 2016_Other Public). This is important: Keep it mind for when we wrangle the data! We will clean it up after we open the spending data.
## read spending data into R
spending_nottidy <- read_csv('spending_nottidy.csv')
spending_nottidy
# has notes on the bottom again
We will do the same thing as we did for the coverage data. Read in the first 52 rows that we want.
#got some parsing errors...
spending_nottidy<- read_csv('spending_nottidy.csv',
n_max = 52) # stops at row 52
spending_nottidy # 52 by 25
Look at the coverage_nottidy tibble again.
coverage_nottidy
## # A tibble: 52 x 29
## Location `2013__Employer` `2013__Non-Grou~ `2013__Medicaid` `2013__Medicare`
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 United S~ 155696900 13816000 54919100 40876300
## 2 Alabama 2126500 174200 869700 783000
## 3 Alaska 364900 24000 95000 55200
## 4 Arizona 2883800 170800 1346100 842000
## 5 Arkansas 1128800 155600 600800 515200
## 6 Californ~ 17747300 1986400 8344800 3828500
## 7 Colorado 2852500 426300 697300 549700
## 8 Connecti~ 2030500 126800 532000 475300
## 9 Delaware 473700 25100 192700 141300
## 10 District~ 324300 30400 174900 59900
## # ... with 42 more rows, and 24 more variables: 2013__Other Public <chr>,
## # 2013__Uninsured <dbl>, 2013__Total <dbl>, 2014__Employer <dbl>,
## # 2014__Non-Group <dbl>, 2014__Medicaid <dbl>, 2014__Medicare <dbl>,
## # 2014__Other Public <chr>, 2014__Uninsured <dbl>, 2014__Total <dbl>,
## # 2015__Employer <dbl>, 2015__Non-Group <dbl>, 2015__Medicaid <dbl>,
## # 2015__Medicare <dbl>, 2015__Other Public <chr>, 2015__Uninsured <dbl>,
## # 2015__Total <dbl>, 2016__Employer <dbl>, 2016__Non-Group <dbl>, ...
At a glance, we see that state-level information is stored in rows (with the exception of the first row, which stores country-level information) with columns corresponding to the amount of money spent on each type of health care, by year.
If we focus in on the columns within this dataframe, we see that we have a number of different types of health care (i.e. employer, medicare, medicaid, etc.) for each year between 2013 and 2016:
names(coverage_nottidy) # view all variable names. In Excel, these would be the "headers"
## [1] "Location" "2013__Employer" "2013__Non-Group"
## [4] "2013__Medicaid" "2013__Medicare" "2013__Other Public"
## [7] "2013__Uninsured" "2013__Total" "2014__Employer"
## [10] "2014__Non-Group" "2014__Medicaid" "2014__Medicare"
## [13] "2014__Other Public" "2014__Uninsured" "2014__Total"
## [16] "2015__Employer" "2015__Non-Group" "2015__Medicaid"
## [19] "2015__Medicare" "2015__Other Public" "2015__Uninsured"
## [22] "2015__Total" "2016__Employer" "2016__Non-Group"
## [25] "2016__Medicaid" "2016__Medicare" "2016__Other Public"
## [28] "2016__Uninsured" "2016__Total"
While a lot of information in here will be helpful, it is not in a tidy format. This is because, each variable is not in a separate column: Each column currently includes year and the type of coverage. We want to use each piece of information separately later when we visualize and analyze these data. We need to first pivot the data longer, and then separate the years from the insurance types.
To accomplish this, the first thing we’ll have to do is reshape the data, using the pivot_longer() function from the tidyr package. This function gathers multiple columns and collapses them into new name-value pairs. This transforms data from wide format into a long format, where:
names_to argument is the name of the new column that you are creating which contains the values of the column headings that you are gatheringvalues_to argument is the name of the new column that will contain the values themselves; you can indicate the name of this column with the values_to argument.Here, we create a column titled year_type and tot_coverage, storing this newly formatted data as a new tibble named “coverage”.
We also want to keep the Location column as it is because it also contains observational level data. > To select all the columns except a specific column, use the - (subtraction) operator. (This process is also referred to as negative indexing.)
coverage_long <- coverage_nottidy %>%
mutate(across(starts_with("20"), # find every variable that begins with "20"
as.integer)) %>% ## Convert all year-based columns to integer
pivot_longer(-Location, ## Pivots all columns EXCEPT 'Location'
names_to = "year_type",
values_to = "tot_coverage")
What warning did you get? What does it mean?
How many variables are there now? What are they?
coverage_long
Look at coverage_long. We still have a problem with the structure of the data. The year_type column still contains two pieces of information. We want to separate these out to ensure that the data are in a tidy format. To do this, use the separate() function, which allows us to separate out the information stored in a single column into two columns. We will also use the convert=TRUE argument to convert the character to an integer.
coveragetidy <- coverage_long %>%
separate(year_type, sep = "__", # uses the underscore to separate the text parts
into = c("year", "type"), #left of the underscore goes to `year` and right of the underscore goes to`type`
convert = TRUE) # converts character into an integer
coveragetidy # view your tidy dataset
Perfect! We now have the four columns we wanted, each storing a separate piece of information, and the year column is an integer, as you would want it to be!
Let’s go one step further and add in information for the state abbreviations and region of each state. There are data and packages in R available to you for just this purpose!
library(datasets) # this package is already included in base R
# you do not need to install it
data(state) # loads vectors into your environment that have different state information. Look at them!
# state.region, state.center, state.area, state.abb, state.name, etc,
state.name # view the state names
Add comments to the lines of code describing what each step is doing
state_data <- tibble(Location = state.name,
abb = state.abb,
region = state.region)
state_data # 50 by 3 tibble
Join the state data with our coverage dataset to add on the state abbreviation and region information to our coverage data.
coveragetidy <- coveragetidy %>%
left_join(state_data, by = "Location") # similar to VLOOKUP() in excel
# matches state_data TO the coverage data based on location
coveragetidy
## # A tibble: 1,456 x 6
## Location year type tot_coverage abb region
## <chr> <int> <chr> <int> <chr> <fct>
## 1 United States 2013 Employer 155696900 <NA> <NA>
## 2 United States 2013 Non-Group 13816000 <NA> <NA>
## 3 United States 2013 Medicaid 54919100 <NA> <NA>
## 4 United States 2013 Medicare 40876300 <NA> <NA>
## 5 United States 2013 Other Public 6295400 <NA> <NA>
## 6 United States 2013 Uninsured 41795100 <NA> <NA>
## 7 United States 2013 Total 313401200 <NA> <NA>
## 8 United States 2014 Employer 154347500 <NA> <NA>
## 9 United States 2014 Non-Group 19313000 <NA> <NA>
## 10 United States 2014 Medicaid 61650400 <NA> <NA>
## # ... with 1,446 more rows
Perfect! At this point, each row is an observation and each column stores a single piece of information. This dataset is now in good shape!
How many observations and columns do you have now?
spending_nottidy
Here, we reshape the data using year and tot_spending for the key and value. We also want to keep Location like before.
# take spending data from wide to long
spending_long <- spending_nottidy %>%
pivot_longer(-Location, # all columns EXCEPT location
names_to = "year",
values_to = "tot_spending")
spending_long
How many columns are there now? Now what is wrong with our data?
Then, using the separate() function, we create two new columns called “year” and “name”.
# separate year and name columns
spendingtidy <- spending_long %>%
separate(year, sep="__",
into = c("year", "name"),
convert = TRUE)
# look at the data
spendingtidy
We don’t need the “name” variable since it is the same for every observation. We can remove it.
spendingtidy <- spendingtidy %>%
select(-_______)
spendingtidy
Perfect, we have a tidy dataset!
At this point, we have a tidy coverage dataset and a tidy spending dataset. However, we want all of this information in a single tidy tibble To do this, we must join the datasets together just like in Homework 4. First, we have to decide what type of join we want to do. For our questions, we only want information from years that are found in both the coverage and the spending datasets. This means that we want to do an inner_join(). This will keep the data from the intersection of years from coverage and spending (meaning only 2013 and 2014).
Combine your tidy coverage and tidy spending data into a new dataset named healthcare. How many observations are there after you join them together?
# inner join to combine data frames
healthcare <- inner_join(coveragetidy, spendingtidy,
by = c("Location", "year"))
healthcare
Remove observations for “United States” from Location using filter()
# use filter to include only the state level data
# by removing "United States" from the Location variable
healthcare <- healthcare %>%
filter(Location != "United States")
Remove observations where the Insurance type is equal to “Total” The “Total” type is not really a type of health care coverage. It represents the total number of people in the state. This is useful information and we can include it as a new column named tot_pop. To accomplish this, first store this information in a data frame called pop. This should be the same as Homework 4. Don’t forget to annotate your code
pop <- healthcare %>%
filter(type == "Total") %>%
select(Location, year, tot_coverage)
pop
Use a left_join() to keep all of the rows in the healthcare data frame and add the population level information while simultaneously removing the rows where type is “Total” from the dataset. Then rename the columns to num_covered and tot_pop so that they are more intuitively labeled. Annotate the steps of your code
# add population level information
healthcare <- healthcare %>%
filter(type != "Total") %>%
left_join(pop, by = c("Location", "year")) %>%
rename(num_covered = tot_coverage.x,
tot_pop = tot_coverage.y)
healthcare
From here, instead of only storing the absolute number of people who are covered (tot_coverage), calculate the proportion of people in each insurance category for each state and store this information in a new variable named perc_covered.
healthcare <- healthcare %>%
mutate(perc_covered = num_covered/tot_pop)
healthcare
The tot_spending column is reported in millions. Therefore, to calculate spending_capita you will need to adjust for this scaling factor to report it on the original scale (just dollars) and then divide by tot_pop. We can again use mutate() to accomplish this:
# get spending per capita in dollars
healthcare <- healthcare %>%
mutate(spending_capita = (tot_spending*1000000) / tot_pop)
healthcare
It is always important to check for missing data and try to determine if it will impact your analysis. Patterns within missing data can be interesting by themselves. Use filter() to find missing values in the “num_covered” variable. is.na() is one of many commands you can use to show you observations that are missing data.
healthcare %>% filter(is.na(num_covered))
How many observations have missing data? Which category / variable seems to be causing the problem?
The code below creates a scatterplot. drop_na() is a very useful command to remove missing values when graphing data.
Add to the code provided so that:
geom_text.geom_smooth() to visualize the magnitude of the linear relationship.healthcare %>%
filter(type == "Employer",
year == "2013") %>%
drop_na() %>%
ggplot(aes(x = spending_capita,
y = perc_covered)) +
geom_point() +
labs(x = "Spending per Capita",
y = "Percent Employer Insurance")
This code chunk and graph are here mostly to show you another way to visualize the data and provide some code clues for your other graphs or tables.
# visualize 2013 data by insurance type
healthcare %>%
drop_na() %>%
filter(year == "2013") %>%
ggplot(aes(x = spending_capita,
y = perc_covered,
color = region)) +
geom_point() +
facet_wrap(~type) +
labs(x = "Spending per Capita ($)",
y = "% Population Covered") +
geom_smooth(method = "lm", col = "red")
Is the the value calculated from the code below meaningful or should more data and context be provided? What is this an average of?
healthcare %>%
summarize(avg_spend_capita = mean(spending_capita))
Based on the boxplot below, describe the trends in spending per person for those that have health insurance from their Employer for the Southern region compared to the Northeast region. Make sure to mention the median, dispersion, and implications for residents of those areas or possible reasons behind any visible trends.
Read the lines of code below. Describe what each line of code is doing AND what you can infer from the image of the distribution that it creates.
ggplot(healthcare, aes(x = spending_capita))+
geom_density()+
geom_vline(aes(xintercept = mean(spending_capita)), color = "orange")+
geom_vline(aes(xintercept = median(spending_capita)), color = "black")+
geom_vline(aes(xintercept = mean(spending_capita)+2*sd(spending_capita)), color = "red")+
geom_vline(aes(xintercept = mean(spending_capita)-2*sd(spending_capita)), color = "red")
Exploratory tables are great for summarizing information for yourself, but they need some cleaning up before they are used to communicate information with others. Below is an example of a summary table containing the average, median, and standard deviation for spending per capita for each region.
table <- healthcare %>%
drop_na() %>%
group_by(region) %>%
summarize("Average" = mean(spending_capita),
"Median" = median(spending_capita),
"Standard Deviation" = sd(spending_capita) )
table
## # A tibble: 4 x 4
## region Average Median `Standard Deviation`
## <fct> <dbl> <dbl> <dbl>
## 1 Northeast 9583. 9562. 532.
## 2 South 7759. 7583. 900.
## 3 North Central 8404. 8307. 535.
## 4 West 7512. 7384. 1332.
Issues with this exploratory table that should be changed if you were making a final, pretty summary table: Round decimal points to clean up numbers, add commas for thousands, add a title, footnotes, or caption to summarize key points or state where the data is from, use a theme or formatting to frame the table and make it look professional and neat.
Make a nice summary table showing the average percent of people from each region covered by each type of health insurance (you do not need to separate data by year). You are welcome to explore different functions or make additional modifications to your table. Feel free to personalize the theme and formatting according to your preferences (I would suggest exploring a format that you like and can be applied to your final report throughout tables in this assignment). Have regions as columns and the type of health care as rows.
Note: you will need to summarize the data and then use pivot_wider() so that each region becomes a column
| type | Northeast | South | North Central | West |
|---|---|---|---|---|
| Employer | 0.54 | 0.47 | 0.54 | 0.49 |
| Medicaid | 0.18 | 0.19 | 0.15 | 0.17 |
| Medicare | 0.14 | 0.14 | 0.14 | 0.12 |
| Non-Group | 0.05 | 0.05 | 0.06 | 0.05 |
| Other Public | 0.01 | 0.03 | 0.01 | 0.03 |
| Uninsured | 0.08 | 0.13 | 0.09 | 0.12 |
Which region has the smallest proportion of its population covered by Medicaid? Which region as the largest proportion of its population uninsured? Include the percent covered in your answer.
On average, how much did the spending per capita change from 2013 to 2014 for each region? Make sure to indicate the direction of the change in your answer. Make a table to summarize the change between years. If you want an extra challenge, try to figure out how to visualize this information.
healthcare # %>%
#