Introduction

Wherever I look, I can’t seem to escape the narrative of a widespread “anti-mask” movement. This informal analysis sets out to shed some light on the following questions:
1) Is the open refusal to wear masks actually widespread, or is this being overblown?
2) If this variation in mask adoption does exist, is it correlated with support for President Trump?

Some of the datasets used in this analysis are citizen-sourced (from Kaggle), so any conclusions made should be taken with a grain of salt. Either way, I hope this provides a good showcase of my ability to clean and combine different datasets, and make inferences using linear models.

Reading in the Data

The dataset that initially motivated this analysis was compiled and estimated by Josh Katz, Margot Sanger-Katz and Kevin Quealy of the New York Times. Here is the README they included in their Github repo:

This data comes from a large number of interviews conducted online by the global data and survey firm Dynata at the request of The New York Times. The firm asked a question about mask use to obtain 250,000 survey responses between July 2 and July 14, enough data to provide estimates more detailed than the state level. (Several states have imposed new mask requirements since the completion of these interviews.) Specifically, each participant was asked: How often do you wear a mask in public when you expect to be within six feet of another person?

The methodology used by the researchers to obtain their estimates is also included:

To transform raw survey responses into county-level estimates, the survey data was weighted by age and gender, and survey respondents’ locations were approximated from their ZIP codes. Then estimates of mask-wearing were made for each census tract by taking a weighted average of the 200 nearest responses, with closer responses getting more weight in the average. These tract-level estimates were then rolled up to the county level according to each tract’s total population.

By rolling the estimates up to counties, it reduces a lot of the random noise that is seen at the tract level. In addition, the shapes in the map are constructed from census tracts that have been merged together — this helps in displaying a detailed map, but is less useful than county-level in analyzing the data.

setwd("~/R Data Science/Winter Break Project/covid-19-data")
mask <- read.csv("./mask-use/mask-use-by-county.csv")
head(mask)
##   COUNTYFP NEVER RARELY SOMETIMES FREQUENTLY ALWAYS
## 1     1001 0.053  0.074     0.134      0.295  0.444
## 2     1003 0.083  0.059     0.098      0.323  0.436
## 3     1005 0.067  0.121     0.120      0.201  0.491
## 4     1007 0.020  0.034     0.096      0.278  0.572
## 5     1009 0.053  0.114     0.180      0.194  0.459
## 6     1011 0.031  0.040     0.144      0.286  0.500

Reading and Cleaning

I prefer to see percents represented as whole numbers, so I multiplied all responses by 100. As the summary statistics show, in the majority of US counties, over 50 percent of the population always wears a mask. However, if we look at lower levels of mask adoption, there is plenty of variation to be explored.

The histograms below represent the frequency of counties that fall at each percent-level of the response. So if we look at the “never” histogram, it looks like most counties are clustered around 0-10 percent, meaining that in most US counties, only 0-10 percent of the county population say they never wear a mask. This histogram is skewed right, with one county having 43.2 percent of respondents saying they never wear a mask.

#change range to be within [0,100] instead of [0,1]
mask[,-1] <- sapply(mask[,-1], function(x) x*100)

sapply(mask[,-1],summary)
##             NEVER    RARELY SOMETIMES FREQUENTLY   ALWAYS
## Min.     0.000000  0.000000    0.1000    2.90000 11.50000
## 1st Qu.  3.400000  4.000000    7.9000   16.40000 39.32500
## Median   6.800000  7.300000   11.5000   20.40000 49.70000
## Mean     7.993953  8.291852   12.1318   20.77247 50.80936
## 3rd Qu. 11.300000 11.500000   15.6000   24.70000 61.37500
## Max.    43.200000 38.400000   42.2000   54.90000 88.90000
hist.data.frame(mask[, -1], mtitl = "Share of County-level Responses")

The next dataset to be read in was sourced from the USDA Economic Research Service. I was interested in exploring levels of educational attainment to see if they correlated with mask usage. This dataset represents a 5-year average of educational attainment levels by county from 2014 to 2018.

In order to merge the two datasets by county ID, I had to add leading zeros to the USDA county ID column. Then I removed columns that would I would not use in the analysis.

I also decided to combine the “FREQUENTLY” and “ALWAYS” responses in the NYT survey to create a category I called “wearers”. Based my interpretation of the survey categories, this represents the percentage of a county’s population that can confidently say they wear a mask over 50 percent of the time they are within 6 feet of someone.

#reading in the county-level education data
education <- read_excel("~/R Data Science/Winter Break Project/education.xls", 
    col_names = FALSE)
## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
## * ...
#formatting 
colnames(education) <- education[5,]
education <- education[-c(1:5),]
education <- as.data.frame(education)
str(education)
## 'data.frame':    3283 obs. of  47 variables:
##  $ FIPS Code                                                               : chr  "00000" "01000" "01001" "01003" ...
##  $ State                                                                   : chr  "US" "AL" "AL" "AL" ...
##  $ Area name                                                               : chr  "United States" "Alabama" "Autauga County" "Baldwin County" ...
##  $ 2003 Rural-urban Continuum Code                                         : chr  NA NA "2" "4" ...
##  $ 2003 Urban Influence Code                                               : chr  NA NA "2" "5" ...
##  $ 2013 Rural-urban Continuum Code                                         : chr  NA NA "2" "3" ...
##  $ 2013 Urban Influence Code                                               : chr  NA NA "2" "2" ...
##  $ Less than a high school diploma, 1970                                   : chr  "52373312" "1062306" "6611" "18726" ...
##  $ High school diploma only, 1970                                          : chr  "34158051" "468269" "3757" "8426" ...
##  $ Some college (1-3 years), 1970                                          : chr  "11650730" "136287" "933" "2334" ...
##  $ Four years of college or higher, 1970                                   : chr  "11717266" "141936" "767" "2038" ...
##  $ Percent of adults with less than a high school diploma, 1970            : chr  "47.700000000000003" "58.700000000000003" "54.799999999999997" "59.399999999999999" ...
##  $ Percent of adults with a high school diploma only, 1970                 : chr  "31.100000000000001" "25.899999999999999" "31.100000000000001" "26.699999999999999" ...
##  $ Percent of adults completing some college (1-3 years), 1970             : chr  "10.6" "7.5" "7.7000000000000002" "7.4000000000000004" ...
##  $ Percent of adults completing four years of college or higher, 1970      : chr  "10.699999999999999" "7.7999999999999998" "6.4000000000000004" "6.5" ...
##  $ Less than a high school diploma, 1980                                   : chr  "44535197" "964840" "7074" "18125" ...
##  $ High school diploma only, 1980                                          : chr  "45947035" "704207" "6145" "15380" ...
##  $ Some college (1-3 years), 1980                                          : chr  "20794975" "278205" "2104" "6602" ...
##  $ Four years of college or higher, 1980                                   : chr  "21558480" "270063" "2117" "5498" ...
##  $ Percent of adults with less than a high school diploma, 1980            : chr  "33.5" "43.5" "40.600000000000001" "39.700000000000003" ...
##  $ Percent of adults with a high school diploma only, 1980                 : chr  "34.600000000000001" "31.800000000000001" "35.200000000000003" "33.700000000000003" ...
##  $ Percent of adults completing some college (1-3 years), 1980             : chr  "15.699999999999999" "12.5" "12.1" "14.5" ...
##  $ Percent of adults completing four years of college or higher, 1980      : chr  "16.199999999999999" "12.199999999999999" "12.1" "12.1" ...
##  $ Less than a high school diploma, 1990                                   : chr  "39343718" "843638" "6252" "17309" ...
##  $ High school diploma only, 1990                                          : chr  "47642763" "749591" "6671" "20544" ...
##  $ Some college or associate's degree, 1990                                : chr  "39571702" "553512" "4912" "15900" ...
##  $ Bachelor's degree or higher, 1990                                       : chr  "32310253" "399228" "3026" "10870" ...
##  $ Percent of adults with less than a high school diploma, 1990            : chr  "24.800000000000001" "33.100000000000001" "30" "26.800000000000001" ...
##  $ Percent of adults with a high school diploma only, 1990                 : chr  "30" "29.399999999999999" "32" "31.800000000000001" ...
##  $ Percent of adults completing some college or associate's degree, 1990   : chr  "24.899999999999999" "21.699999999999999" "23.5" "24.600000000000001" ...
##  $ Percent of adults with a bachelor's degree or higher, 1990              : chr  "20.300000000000001" "15.699999999999999" "14.5" "16.800000000000001" ...
##  $ Less than a high school diploma, 2000                                   : chr  "35715625" "714081" "5872" "17258" ...
##  $ High school diploma only, 2000                                          : chr  "52168981" "877216" "9332" "28428" ...
##  $ Some college or associate's degree, 2000                                : chr  "49864428" "746495" "7413" "28178" ...
##  $ Bachelor's degree or higher, 2000                                       : chr  "44462605" "549608" "4972" "22146" ...
##  $ Percent of adults with less than a high school diploma, 2000            : chr  "19.600000000000001" "24.699999999999999" "21.300000000000001" "18" ...
##  $ Percent of adults with a high school diploma only, 2000                 : chr  "28.600000000000001" "30.399999999999999" "33.799999999999997" "29.600000000000001" ...
##  $ Percent of adults completing some college or associate's degree, 2000   : chr  "27.399999999999999" "25.899999999999999" "26.899999999999999" "29.300000000000001" ...
##  $ Percent of adults with a bachelor's degree or higher, 2000              : chr  "24.399999999999999" "19" "18" "23.100000000000001" ...
##  $ Less than a high school diploma, 2014-18                                : chr  "26948057" "470043" "4204" "14310" ...
##  $ High school diploma only, 2014-18                                       : chr  "59265308" "1020172" "12119" "40579" ...
##  $ Some college or associate's degree, 2014-18                             : chr  "63365655" "987148" "10552" "46025" ...
##  $ Bachelor's degree or higher, 2014-18                                    : chr  "68867051" "822595" "10291" "46075" ...
##  $ Percent of adults with less than a high school diploma, 2014-18         : chr  "12.300000000000001" "14.199999999999999" "11.300000000000001" "9.6999999999999993" ...
##  $ Percent of adults with a high school diploma only, 2014-18              : chr  "27.100000000000001" "30.899999999999999" "32.600000000000001" "27.600000000000001" ...
##  $ Percent of adults completing some college or associate's degree, 2014-18: chr  "29" "29.899999999999999" "28.399999999999999" "31.300000000000001" ...
##  $ Percent of adults with a bachelor's degree or higher, 2014-18           : chr  "31.5" "24.899999999999999" "27.699999999999999" "31.300000000000001" ...
#add leading zeros to county ID  in mask dataset

mask$COUNTYFP <- as.character(mask$COUNTYFP)

for(i in 1:length(mask$COUNTYFP)){
        if(nchar(mask$COUNTYFP[i]) < 5){
                mask$COUNTYFP[i] <- paste("0", mask$COUNTYFP[i], sep = "")
        }
}

#merging the datasets by county code
merged <- merge(mask, education, by.x = "COUNTYFP", by.y = "FIPS Code", all.x = TRUE)
#removing unnecessary columns
merged <- merged[,c(1:8,45:52)]
head(merged)
##   COUNTYFP NEVER RARELY SOMETIMES FREQUENTLY ALWAYS State      Area name
## 1    01001   5.3    7.4      13.4       29.5   44.4    AL Autauga County
## 2    01003   8.3    5.9       9.8       32.3   43.6    AL Baldwin County
## 3    01005   6.7   12.1      12.0       20.1   49.1    AL Barbour County
## 4    01007   2.0    3.4       9.6       27.8   57.2    AL    Bibb County
## 5    01009   5.3   11.4      18.0       19.4   45.9    AL  Blount County
## 6    01011   3.1    4.0      14.4       28.6   50.0    AL Bullock County
##   Less than a high school diploma, 2014-18 High school diploma only, 2014-18
## 1                                     4204                             12119
## 2                                    14310                             40579
## 3                                     4901                              6486
## 4                                     2650                              7471
## 5                                     7861                             13489
## 6                                     1760                              2817
##   Some college or associate's degree, 2014-18
## 1                                       10552
## 2                                       46025
## 3                                        4566
## 4                                        3846
## 5                                       13267
## 6                                        1582
##   Bachelor's degree or higher, 2014-18
## 1                                10291
## 2                                46075
## 3                                 2220
## 4                                 1813
## 5                                 5010
## 6                                  945
##   Percent of adults with less than a high school diploma, 2014-18
## 1                                              11.300000000000001
## 2                                              9.6999999999999993
## 3                                                              27
## 4                                              16.800000000000001
## 5                                              19.800000000000001
## 6                                              24.800000000000001
##   Percent of adults with a high school diploma only, 2014-18
## 1                                         32.600000000000001
## 2                                         27.600000000000001
## 3                                         35.700000000000003
## 4                                         47.299999999999997
## 5                                                         34
## 6                                         39.700000000000003
##   Percent of adults completing some college or associate's degree, 2014-18
## 1                                                       28.399999999999999
## 2                                                       31.300000000000001
## 3                                                       25.100000000000001
## 4                                                       24.399999999999999
## 5                                                                     33.5
## 6                                                       22.300000000000001
##   Percent of adults with a bachelor's degree or higher, 2014-18
## 1                                            27.699999999999999
## 2                                            31.300000000000001
## 3                                            12.199999999999999
## 4                                                          11.5
## 5                                                          12.6
## 6                                            13.300000000000001
#sum the share of county pop. that wears mask frequently and always 
merged$wearers <- (merged$ALWAYS + merged$FREQUENTLY)
hist(merged$wearers, main = "Mask Adoption by County", xlab = "Share of county that wears masks 'Frequently' or 'Always'")

summary(merged$wearers)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   25.50   62.30   72.20   71.58   81.90   99.20
#change column classes to numeric
merged[,9:ncol(merged)] <- apply(merged[,9:ncol(merged)], 2, as.numeric)

Already, we can see from the histogram above that the notion of a widespread anti-mask movement is somewhat sensationalized. In 75 percent of US counties, at least 60 percent of county residents wear a mask frequently or always when within 6 feet of others. Further, the median percentage of county residents who are “mask wearers” is 72 percent.

Exploratory Data Analysis

First, I ran the correlations between the new variable “mask wearers” and the remaining variables in the dataset.

cor(merged[,13:(ncol(merged)-1)], merged[,"wearers"], use = "complete.obs")
##                                                                                 [,1]
## Percent of adults with less than a high school diploma, 2014-18           0.01303711
## Percent of adults with a high school diploma only, 2014-18               -0.35189409
## Percent of adults completing some college or associate's degree, 2014-18 -0.18278990
## Percent of adults with a bachelor's degree or higher, 2014-18             0.36052924

To further understand the relationships between mask “wearers” and education levels, I created scatterplots overlaid with a simple linear model.

#renaming columns to make them easier to work with
colnames(merged)[13:16] <- c("%less.HS", "%only.HS", "%some.college", "%bachelors.or.higher")

# % Bachelor's Degree EDA
plot1 <- plot(merged$`%bachelors.or.higher`, merged$wearers, xlab = "% of adults with Bachelor's degree or higher", ylab = "% mask wearers", main = "Mask wearers vs. % Bachelor's Degree or above")
plot1
## NULL
#simple linear model 
lm1 <- lm(wearers ~ `%bachelors.or.higher`, data = merged)
summary(lm1)
## 
## Call:
## lm(formula = wearers ~ `%bachelors.or.higher`, data = merged)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -42.849  -7.797   0.864   9.151  29.810 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            60.77509    0.54455  111.61   <2e-16 ***
## `%bachelors.or.higher`  0.50093    0.02313   21.66   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.23 on 3140 degrees of freedom
## Multiple R-squared:   0.13,  Adjusted R-squared:  0.1297 
## F-statistic: 469.1 on 1 and 3140 DF,  p-value: < 2.2e-16
abline(lm1)

# %only HS degree EDA
plot2 <- plot(merged$`%only.HS`, merged$wearers, xlab = "% of adults only HS degree", ylab = "% mask wearers", main = "Mask wearers vs. % Only HS Degree")
plot2
## NULL
lm2 <- lm(wearers ~ `%only.HS`, data = merged)
summary(lm2)
## 
## Call:
## lm(formula = wearers ~ `%only.HS`, data = merged)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -43.247  -7.857   0.908   9.084  32.485 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 93.57306    1.06663   87.73   <2e-16 ***
## `%only.HS`  -0.64139    0.03045  -21.07   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.27 on 3140 degrees of freedom
## Multiple R-squared:  0.1238, Adjusted R-squared:  0.1236 
## F-statistic: 443.8 on 1 and 3140 DF,  p-value: < 2.2e-16
abline(lm2)

#try MLR model 

lm3 <- lm(wearers ~ `%bachelors.or.higher` + `%only.HS`, data = merged)
summary(lm3)
## 
## Call:
## lm(formula = wearers ~ `%bachelors.or.higher` + `%only.HS`, data = merged)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -43.302  -7.671   1.049   9.043  29.351 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            76.33684    2.30394  33.133  < 2e-16 ***
## `%bachelors.or.higher`  0.30551    0.03630   8.415  < 2e-16 ***
## `%only.HS`             -0.33091    0.04762  -6.948 4.48e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.14 on 3139 degrees of freedom
## Multiple R-squared:  0.1432, Adjusted R-squared:  0.1426 
## F-statistic: 262.2 on 2 and 3139 DF,  p-value: < 2.2e-16

So educational levels yield significant coefficients in the multiple linear regression model, but also a low adjusted R-squared value. A one percent increase in the percentage of a county with a Bachelor’s degree or higher is correlated with a .305 percent increase in mask adoption. Similarly, a one percent increase in the percentage of a county with only a high school degree is correlated with a .3309 percent decrease in mask adoption. However this only explains about 14% of the total variation in mask adoption across counties.

Exploring more variables

I would be remiss to not explore the relationship between the 2020 presidential election results and the percentage of conscientious mask-wearers in a county. After some searching on Kaggle, I came across a dataset that had what I was looking for. The dataset was compiled by the user Ethan Schacht.

This gives us a rough idea of the 2020 election vote share by county but there are some oddities in the dataset and Schact does not go into detail regarding where he sourced this dataset. For some reason, there are 4867 rows despite there only being about 3142 counties in the US. Additionally, many of the columns have about 1725 NAs.

I will still clean and analyze the dataset, but any information gained should be taken in the context of this user-sourced dataset.

#read in new dataset
us <- read.csv("./county_statistics.csv")
us <- us[,-1]
dim(us)
## [1] 4867   50
head(us)
##      county state percentage16_Donald_Trump percentage16_Hillary_Clinton
## 1 Abbeville    SC                     0.629                        0.346
## 2    Acadia    LA                     0.773                        0.206
## 3  Accomack    VA                     0.545                        0.428
## 4       Ada    ID                     0.479                        0.387
## 5     Adair    IA                     0.653                        0.300
## 6     Adair    KY                     0.806                        0.161
##   total_votes16 votes16_Donald_Trump votes16_Hillary_Clinton
## 1         10724                 6742                    3712
## 2         27386                21159                    5638
## 3         15755                 8582                    6737
## 4        195587                93748                   75676
## 5          3759                 2456                    1127
## 6          8231                 6637                    1323
##   percentage20_Donald_Trump percentage20_Joe_Biden total_votes20
## 1                     0.661                  0.330         12433
## 2                     0.795                  0.191         28425
## 3                     0.542                  0.447         16938
## 4                     0.504                  0.465        259389
## 5                     0.697                  0.286          4183
## 6                     0.830                  0.159          8766
##   votes20_Donald_Trump votes20_Joe_Biden      lat       long cases deaths
## 1                 8215              4101 34.22333  -82.46171   805     17
## 2                22596              5443 30.29506  -92.41420  3182    102
## 3                 9172              7578 37.76707  -75.63235  1227     19
## 4               130699            120539 43.45266 -116.24155 17451    181
## 5                 2917              1197 41.33076  -94.47106   222      1
## 6                 7275              1391 37.10460  -85.28130   517     22
##   TotalPop    Men  Women Hispanic White Black Native Asian Pacific
## 1    24788  12044  12744      1.3  68.9  27.6    0.1   0.3     0.0
## 2    62607  30433  32174      2.4  77.5  17.6    0.1   0.1     0.0
## 3    32840  16079  16761      8.8  60.3  28.3    0.3   0.7     0.0
## 4   435117 217999 217118      7.9  85.2   1.2    0.4   2.6     0.1
## 5     7192   3552   3640      1.7  96.6   0.3    0.0   0.4     0.0
## 6    19304   9632   9672      1.8  93.4   3.6    0.1   0.1     0.0
##   VotingAgeCitizen Income IncomeErr IncomePerCap IncomePerCapErr Poverty
## 1            19452  35254      2259        19234             799    22.7
## 2            45197  40492      2544        21591            1002    21.5
## 3            24408  42260      2253        24266            1564    19.8
## 4           316189  60151      1294        31642             725    11.8
## 5             5572  49477      2633        28861            2055     9.5
## 6            15280  36575      3426        18408            1010    21.5
##   ChildPoverty Professional Service Office Construction Production Drive
## 1         32.1         27.2    20.7   20.8         10.6       20.7  78.3
## 2         27.6         27.6    16.9   25.7         15.0       14.8  83.2
## 3         31.8         31.1    17.7   18.8         15.1       17.3  80.0
## 4         13.1         43.0    16.6   25.0          6.9        8.4  80.7
## 5         12.1         28.2    16.9   20.0         17.3       17.6  77.9
## 6         27.1         28.5    15.9   19.7         12.2       23.8  84.5
##   Carpool Transit Walk OtherTransp WorkAtHome MeanCommute Employed PrivateWork
## 1    11.1     0.5  1.8         1.8        6.5        25.8     9505        78.8
## 2    10.3     0.2  1.6         2.2        2.5        27.6    24982        80.0
## 3    10.6     0.5  2.6         1.8        4.5        22.0    13837        74.6
## 4     7.7     0.5  1.5         2.8        6.9        20.4   214984        78.3
## 5    12.4     0.3  2.8         0.4        6.2        22.3     3680        73.8
## 6     9.0     0.0  2.6         0.5        3.4        22.2     7988        74.1
##   PublicWork SelfEmployed FamilyWork Unemployment
## 1       13.3          7.8        0.1          9.4
## 2       12.1          7.6        0.3          8.9
## 3       18.1          7.1        0.2          5.4
## 4       15.0          6.6        0.1          4.3
## 5       15.3         10.4        0.5          3.0
## 6       15.8          9.9        0.1          6.2
sapply(us, function(x) sum(is.na(x)))
##                       county                        state 
##                            0                            0 
##    percentage16_Donald_Trump percentage16_Hillary_Clinton 
##                         1756                         1756 
##                total_votes16         votes16_Donald_Trump 
##                         1756                         1756 
##      votes16_Hillary_Clinton    percentage20_Donald_Trump 
##                         1756                          377 
##       percentage20_Joe_Biden                total_votes20 
##                          377                          234 
##         votes20_Donald_Trump            votes20_Joe_Biden 
##                          234                          234 
##                          lat                         long 
##                         1615                         1615 
##                        cases                       deaths 
##                         1615                         1615 
##                     TotalPop                          Men 
##                         1725                         1725 
##                        Women                     Hispanic 
##                         1725                         1725 
##                        White                        Black 
##                         1725                         1725 
##                       Native                        Asian 
##                         1725                         1725 
##                      Pacific             VotingAgeCitizen 
##                         1725                         1725 
##                       Income                    IncomeErr 
##                         1725                         1725 
##                 IncomePerCap              IncomePerCapErr 
##                         1725                         1725 
##                      Poverty                 ChildPoverty 
##                         1725                         1726 
##                 Professional                      Service 
##                         1725                         1725 
##                       Office                 Construction 
##                         1725                         1725 
##                   Production                        Drive 
##                         1725                         1725 
##                      Carpool                      Transit 
##                         1725                         1725 
##                         Walk                  OtherTransp 
##                         1725                         1725 
##                   WorkAtHome                  MeanCommute 
##                         1725                         1725 
##                     Employed                  PrivateWork 
##                         1725                         1725 
##                   PublicWork                 SelfEmployed 
##                         1725                         1725 
##                   FamilyWork                 Unemployment 
##                         1725                         1725

Since the new dataset does not have county ID’s, we will need to join the two dataframes by the county name and state abbreviation columns.

#merge the columns from the US dataset to the aggregate merged set
merged$`Area name` <- strsplit(merged$`Area name`, split = " ")
merged$`Area name` <- sapply(merged$`Area name`, function(x) x[1])

merged <- merge(merged, us, by.x = c('Area name', 'State'), by.y = c('county', 'state'), all.x = TRUE)

#express percentages within range of [0,100]
merged[,23:24] <- merged[,23:24]*100

Again, we will explore the relationships of this new dataset by running a correlation between all variables and the share of mask “wearers”.

cor(merged$wearers, merged[,18:65], use = "complete.obs")
##      percentage16_Donald_Trump percentage16_Hillary_Clinton total_votes16
## [1,]                -0.4671356                    0.4817604     0.3201092
##      votes16_Donald_Trump votes16_Hillary_Clinton percentage20_Donald_Trump
## [1,]            0.3280372               0.2938529                -0.4966669
##      percentage20_Joe_Biden total_votes20 votes20_Donald_Trump
## [1,]              0.5017273     0.3045235            0.3009313
##      votes20_Joe_Biden       lat      long     cases    deaths  TotalPop
## [1,]         0.2912494 -0.215381 0.1542209 0.2300885 0.2128475 0.3005584
##            Men     Women  Hispanic      White     Black      Native     Asian
## [1,] 0.3009001 0.3001504 0.2782887 -0.3371633 0.1819226 -0.09867881 0.3158354
##        Pacific VotingAgeCitizen    Income  IncomeErr IncomePerCap
## [1,] 0.0701423        0.3125965 0.2395558 -0.1792755    0.2122368
##      IncomePerCapErr     Poverty ChildPoverty Professional   Service    Office
## [1,]      -0.2107613 -0.01300673  0.005401577    0.1811291 0.1679398 0.2338466
##      Construction Production       Drive     Carpool   Transit       Walk
## [1,]   -0.2160054 -0.2766698 -0.02112641 0.001268234 0.2283708 -0.1199684
##      OtherTransp  WorkAtHome MeanCommute Employed PrivateWork PublicWork
## [1,]   0.1137755 -0.09420392   0.3047717 0.299713   0.1389315   0.034555
##      SelfEmployed FamilyWork Unemployment
## [1,]   -0.2964324 -0.1515119    0.1868341
#simple linear regression for percent trump voters and mask use 
plot(merged$percentage20_Donald_Trump, merged$wearers, main = "Mask wearers vs. 2020 Trump Vote Share", xlab = "2020 Trump vote %", ylab = "%  mask wearers")
lm4 <- lm(wearers ~ percentage20_Donald_Trump, data = merged)
summary(lm4)
## 
## Call:
## lm(formula = wearers ~ percentage20_Donald_Trump, data = merged)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -39.153  -7.276   0.683   8.068  35.375 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               98.94587    0.93336  106.01   <2e-16 ***
## percentage20_Donald_Trump -0.42169    0.01375  -30.67   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.27 on 2873 degrees of freedom
##   (267 observations deleted due to missingness)
## Multiple R-squared:  0.2467, Adjusted R-squared:  0.2464 
## F-statistic: 940.8 on 1 and 2873 DF,  p-value: < 2.2e-16
abline(lm4)

The model has an adjusted R-squared value of .2464 so 2020 Trump vote share explains more of the variation in mask adoption than levels of education. It is important to note that the residual standard error is 11.27, indicating that there is plenty of unexplored variation in mask adoption that cannot be explained by vote share alone. Still, the coefficient is significant, indicating that a one percent increase in 2020 Trump vote share is correlated with a .42 percent decrease in mask adoption.

Next we will combine the educational predictor and 2020 vote share in a multiple linear regression model.

lm5 <- lm(wearers ~ percentage20_Donald_Trump + `%only.HS`, merged, na.action = na.omit)
summary(lm5)
## 
## Call:
## lm(formula = wearers ~ percentage20_Donald_Trump + `%only.HS`, 
##     data = merged, na.action = na.omit)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -39.308  -6.876   0.753   7.766  35.351 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               103.94537    1.15179  90.247  < 2e-16 ***
## percentage20_Donald_Trump  -0.37051    0.01533 -24.170  < 2e-16 ***
## `%only.HS`                 -0.24248    0.03329  -7.285 4.13e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.17 on 2872 degrees of freedom
##   (267 observations deleted due to missingness)
## Multiple R-squared:  0.2603, Adjusted R-squared:  0.2598 
## F-statistic: 505.4 on 2 and 2872 DF,  p-value: < 2.2e-16

After adjusting for the % of the population in a county that only has a high school degree, a one percent increase in 2020 Trump vote share is correlated with a .37 percent decrease in mask adoption. The coefficients for % with HS degree and 2020 Trump vote share are significant, and the adjusted R-squared value is greater than any simple model run in this analysis.

Conclusion

First, we can conclude that in almost all counties, the vast majority of individuals do wear a mask within 6 feet of others. This seems to negate the idea of a widespread mask movement.

Second, there is a significant negative correlation between support for President Trump and mask adoption after controlling for education levels. However, this relationship is small in magnitude, and there is a high residual standard error. So support for Trump might be a small, significant factor among many other factors that explain mask adoption by county.