The Goal

Now that my group has finished reproducing all the summary statistics, tables and figures in our COVID paper, I though I’d give the verification report a start and do some exploratory analysis. One question from the data that interested me was whether varying levels of education affected perceived risks towards an individual, the population and complications from the virus. One could presume that people with higher levels of education would be more critical and independent in their thinking, drawing their own conclusions from media outlets and research reports to determine the true nature of the threat imposed by the pandemic and as a result, possibly being more or less paranoid than those with lower levels of education.

Progress

First, we have to load the relevant packages and extract the appropriate data into a separate data frame.

# load packages

library(tidyverse) # for dplyr and ggplot 
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.0.6     v dplyr   1.0.4
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(plotrix) # for easy calculation of standard error

# Creating the data frame

bweh <- "Covid_Data.csv" %>% 
  read_csv() %>%             
  select(educ, risk_self, risk_pop, risk_comp)
## 
## -- Column specification --------------------------------------------------------
## cols(
##   .default = col_double(),
##   ResponseId = col_character(),
##   gender_all = col_character(),
##   race_all = col_character(),
##   educ = col_character(),
##   income = col_character(),
##   emp = col_character(),
##   livealone = col_character(),
##   gender_bin_f = col_character(),
##   race_bin = col_character(),
##   emp_bin = col_character()
## )
## i Use `spec()` for the full column specifications.
print(bweh)
## # A tibble: 945 x 4
##    educ                             risk_self risk_pop risk_comp
##    <chr>                                <dbl>    <dbl>     <dbl>
##  1 Graduated high school (or GED)           3        3         1
##  2 Graduated high school (or GED)           2        4         2
##  3 Some college or technical school         1        1         2
##  4 Less than high school                    2        4         2
##  5 Graduated high school (or GED)           3        2         1
##  6 Graduated high school (or GED)           2        3         2
##  7 Graduated high school (or GED)           0        4         3
##  8 Graduated high school (or GED)           3        4         2
##  9 Some college or technical school         2        4         2
## 10 Graduated high school (or GED)           1        3         1
## # ... with 935 more rows

Next, I plan on plotting this as a bar graph so there some different ways I can tackle this: one would be to use pivot_longer and plot the data, but I have recently discovered a new function, gather, which happens to be exactly the same as pivot_longer but is a little more condensed (as in it has fewer letters). Seeing as I like to have some variety every now and then, I’ll use this function instead.

Also for anyone who is interested, as gather is analogous to pivot_longer, so is spread to pivot_wider

Since I have very little experience with geom_col, being the bar graph plotting function, I consulted google for some perspective and found people often used it in conjunction with stat = “identity” for some undiscernable reason as well as with position = “dodge”, which dictates where the bars go.

For example, here is the intended bar graph without position = “dodge”

hm <- ggplot(bweh %>% gather(Risk, means, -educ), aes(x = educ, y = means, fill = Risk)) +
  geom_col(stat = 'identity')
## Warning: Ignoring unknown parameters: stat
print(hm)

As you can see, ignoring the atrocious labels and erroneous y-axis figures, it is one way of presenting the data though I find it is clearer to present it with the bars alongside each other, in which case position = “dodge” is appropriate.

ah <- ggplot(bweh %>% gather(Risk, means, -educ), aes(x = educ, y = means, fill = Risk)) +
  geom_col(stat = 'identity', position = "dodge")
## Warning: Ignoring unknown parameters: stat
print(ah)

And it appears education is meaningless

Alright, I must have messed up the data. I would immediately blame the newfound gather function but as displayed below, it hasn’t done anything wrong.

bweh %>% gather(Risk, means, -educ)
## # A tibble: 2,835 x 3
##    educ                             Risk      means
##    <chr>                            <chr>     <dbl>
##  1 Graduated high school (or GED)   risk_self     3
##  2 Graduated high school (or GED)   risk_self     2
##  3 Some college or technical school risk_self     1
##  4 Less than high school            risk_self     2
##  5 Graduated high school (or GED)   risk_self     3
##  6 Graduated high school (or GED)   risk_self     2
##  7 Graduated high school (or GED)   risk_self     0
##  8 Graduated high school (or GED)   risk_self     3
##  9 Some college or technical school risk_self     2
## 10 Graduated high school (or GED)   risk_self     1
## # ... with 2,825 more rows

It appears I may have just plotted each score point for each risk variable instead of their averages, contrary to the means column above which, in retrospect, was very poorly named.

After some more troubleshooting via google, I finally (somewhat) understood what the stat function thingy meant and by having stat = “identity”, it means that you are telling ggplot2 to skip the aggregation and that you’ll provide the y-values. Now that made no sense whatsoever to me but subsequent googling revealed the use of stat = “summary” which I can only deduce from the name it does something similar to summarise. Combining this with the fun = “mean” function, it becomes clear what it is intended to do.

Also, geom_col was supposed to be geom_bar for a reason that seems to yet again elude my understanding.

whee <- ggplot(bweh %>% gather(Risk, means, -educ), aes(x = educ, y = means, fill = Risk)) +
  geom_bar(stat = "summary", fun = "mean", position = "dodge")

print(whee)

> On the right track

Before I draw any tentative conclusions or comment on the values on the graph, I’d like to clean it up a little by:

Firstly, I will rename the risk levels to Self, Population, and Complications using the rename function then I will make the education levels into separate lines using the mutate and ifelse functions so that it will more easily fit into the graph.

new <- bweh %>% 
  rename(Self = "risk_self",
         Population = "risk_pop",
         Complications = "risk_comp") %>% 
  mutate(educ2 = ifelse(educ == "Completed 4-year college (BA, BS)",
                        "Completed 4-year \n college (BA, BS)",
                        ifelse(educ == "Completed graduate or professional degree",
                               "Completed \n graduate or \n professional degree",
                               ifelse(educ == "Graduated high school (or GED)",
                                      "Graduated high \n school (or GED)",
                                      ifelse(educ == "Some college or technical school",
                                             "Some college or \n technical school",
                                             ifelse(educ == "Less than high school",
                                                    "Less than \n high school",
                                                    educ)))))) %>% 
  select(-educ)

print(new)
## # A tibble: 945 x 4
##     Self Population Complications educ2                                
##    <dbl>      <dbl>         <dbl> <chr>                                
##  1     3          3             1 "Graduated high \n school (or GED)"  
##  2     2          4             2 "Graduated high \n school (or GED)"  
##  3     1          1             2 "Some college or \n technical school"
##  4     2          4             2 "Less than \n high school"           
##  5     3          2             1 "Graduated high \n school (or GED)"  
##  6     2          3             2 "Graduated high \n school (or GED)"  
##  7     0          4             3 "Graduated high \n school (or GED)"  
##  8     3          4             2 "Graduated high \n school (or GED)"  
##  9     2          4             2 "Some college or \n technical school"
## 10     1          3             1 "Graduated high \n school (or GED)"  
## # ... with 935 more rows

Next, to reorder the education levels so that it starts with Less than high school and gradually increases to Completed graduate or professional degree, I am going to use this factor function which will order the values into levels of your choosing.

new$educ2 <- factor(new$educ2, levels = c("Less than \n high school", 
                                          "Graduated high \n school (or GED)",
                                          "Some college or \n technical school",
                                          "Completed 4-year \n college (BA, BS)",
                                          "Completed \n graduate or \n professional degree"))

For the final edits with the labels on the x- and y-axis, I’ll just add to the graph and let’s see how it looks.

yas <- ggplot(new %>% gather(Risk, Mean, c(Complications, Population, Self)), 
             aes(x = educ2, y = Mean, fill = Risk)) +
  geom_bar(stat = "summary", fun = "mean", position = "dodge") +
  theme(axis.text = element_text(size = 8)) +
  xlab("Level of Education") +
  ylab("Average Perceived Risk")

print(yas)

Wonderful

From the data above, it appears as if the perceived risk to the population is quite stable across levels of education and is always higher than perceived risk to self and perceived risk to complications from the virus. This could suggest that most people simply assume everyone else is excessively foolhardy and/or unlucky in comparison and consequently have a higher chance of contracting the virus.

Risk of developing complications from the virus is also somewhat stable across education, with an odd dropoff in perceived risk for those who graduated high school or obtained a GED. People with lower levels of education also seem to underweight their risk of catching COVID or perhaps people with higher levels of education overweight their risk of catching COVID. Reasons why this may be the case will most likely vary depending on the individual and the level of education they possess.

Seeing as perceived risk for the self, population and complications are very similar for those who have some college education to those who have professional degrees, it is tempting to (incorrectly) conclude that minimal college education is sufficient for developing one’s worldview with no significant difference between completing a degree or dropping out. There is a high likelihood this is not the case however.

Though, it is worth noting that there is likely a correlation between level of education and age and potentially many other factors, meaning that education cannot solely account for these patterns with the extent of its influence still relatively unknown. Additionally, this pool of participants is far more educated than the US population with 88% of this sample containing people who received at least college-level education, significantly deviating from the recorded 60% of US population with a similar background. Coupled with the fact that there exists any number of explanations one can generate from these findings, it is largely uninformative but does remain interesting nevertheless to observe the trends within this one data set.

Error Bars?

I did attempt to add error bars for the graph but it was particularly difficult and onerous to figure out how to do so. For any other graph where there is only one bar per level, the geom_errorbar function would not be as tedious to use but that being said, it is certainly not impossible.

I tried to add the standard error values (calculated easily using the plotrix package which offers a std.error function) into a data set and merge it with the means but it was all very messy and hard to figure out how to tell R to get the corresponding means and standard errors to add up for the error bars.

My attempt can be seen below:

sedata <- new %>% 
  group_by(educ2) %>% 
  summarise(se1 = std.error(Complications),
            se2 = std.error(Population),
            se3 = std.error(Self),
            mean1 = mean(Complications),
            mean2 = mean(Population),
            mean3 = mean(Self))

print(sedata)
## # A tibble: 5 x 7
##   educ2                                      se1    se2    se3 mean1 mean2 mean3
## * <fct>                                    <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl>
## 1 "Less than \n high school"              0.521  0.277  0.396   2.6   3.1   1.7 
## 2 "Graduated high \n school (or GED)"     0.136  0.117  0.123   2.11  3.06  2.07
## 3 "Some college or \n technical school"   0.0759 0.0513 0.0602  2.51  2.95  2.39
## 4 "Completed 4-year \n college (BA, BS)"  0.0734 0.0494 0.0613  2.52  2.95  2.39
## 5 "Completed \n graduate or \n professio~ 0.0935 0.0651 0.0776  2.39  2.95  2.49

That covers the data frame, but when inputting into the geom_errorbar function, you are required to have a ymax and ymin value, which normally would look something like this: ymax = mean+se, ymin = mean-se. However, since my graph has multiple levels of risk, I need to somehow input something saying ymax = mean1+se1, mean2+se2, mean3+se3, ymin = mean1-se1, mean2-se2, mean3-se3 across the educ levels but it currently transcends all comprehension so I’m willing to take a brief departure from all this coding business to do other things.

Challenges/Successes

I successfully managed to create a column graph of perceived risk across levels of education. Whilst it wasn’t too difficult an endeavour, I did encounter some issues with the geom_col and geom_bar functions and it demanded much effort and time to figure it out. There were some interesting points offered by the resulting graph but they are largely tentative and most likely trivial at best.

Unfortunately, I was unable to add error bars into the graph and this was perhaps due to fatigue and pent up frustration which precluded further attempts at trying to understand how to correctly use the geom_errorbar function.

Next Steps

Next, I might try conducting another exploratory analysis examining another question before trying to figure out how to add error bars into my existing graph.