Problem Set 3

Source: NY Times

What explains the differences in covid-19 cases throughout the United States? Mask usage most likely plays an important role in mitigating the spread of covid-19. We’ll explore whether mask usage corresponds to differences in covid-19 cases.

All the datasets we’ll be using are on the county level (unit of observation = county).

The datasets (linked to their original source). You should download the ones we provide on the problem set page. Why? We spent some time cleaning them up so that they are a bit easier to use.

COVID-19 data Source: NY times
Mask Usage Source: NY times
Population Source: Vera Institute of Justice

Questions

1. Merge the 3 datasets together.

Hint 1: Make sure that the variable that you use to link all the datasets (matching variable) has the same variable name (upper and lower case matters)

Hint 2: This variable also needs to the same class across all the datasets. I’m including code that will help.

# the fips variable must also be the same class across all datasets
#mask<- mask%>%
#  mutate(fips=as.numeric(fips))

2. Graph and interpret the distribution of total covid-19 cases (`cases’). Should we be looking at the distribution of cases as a rate or as a count? Why? How would you describe the distributions? What do they tell you about the pandemic?

Bonus: Find the outlier counties in the top 2.5% of covid cases.

3. Graph and interpret the distribution of masks usage. Which variable do you think best describes mask usage? Do you need to convert/mutate it? How would you describe the distribution?

Hint: You’ll have to understand the mask data. See NY Times description of the mask data here

4. Is there any evidence that mask usage is leading to changes in covid-19 cases?

Hint 1: Use the 7 or 30 day moving average of new cases - think about why this may be a better measure for this question than total cumulative cases.

Hint 2: Think about comparing the distribution of covid cases between counties with high mask usage vs. counties with low mask usage.