The Goal

Given my previous struggle with error bars, my goal was to find out how to do it properly as well as find another way of presenting the data for one of my exploratory analyses since I now find the most recent output a little dissatisfying.

Progress

First, let’s do the usual and load the relevant packages and extract the data set

library(tidyverse) 
library(gt) # for nice tables
library(plotrix) # for easy calculations
library(extrafont) # for nice fonts

loadfonts(device = "win", quiet = TRUE) # loading fonts (i think)

covid <- "Covid_Data.csv" %>% 
  read_csv() %>% 
  select(-ResponseId)

Here was the code used last week for creating the data frame for the graph. Includes renaming risk variables using the rename function, cleaning up names for the education levels so they can fit easily on the graph using the mutate and ifelse functions and reordering the levels of the education variable to be in ascending order through making them factors.

two <- covid %>% 
  select(educ, risk_self, risk_pop, risk_comp) %>% 
  rename(Self = "risk_self", 
         Population = "risk_pop",
         Complications = "risk_comp") %>% 
  mutate(educ2 = ifelse(educ == "Completed 4-year college (BA, BS)",
                        "Completed 4-year college \n (BA, BS)",
                        ifelse(educ == "Completed graduate or professional degree",
                               "Completed graduate or \n professional degree",
                               ifelse(educ == "Graduated high school (or GED)",
                                      "Graduated high school \n (or GED)",
                                      ifelse(educ == "Some college or technical school",
                                             "Some college or \n technical school",
                                             educ))))) %>% 
  select(-educ)

two$educ2 <- factor(two$educ2, levels = c("Less than high school", 
                           "Graduated high school \n (or GED)",
                           "Some college or \n technical school",
                           "Completed 4-year college \n (BA, BS)",
                           "Completed graduate or \n professional degree"))

glimpse(two)
## Rows: 945
## Columns: 4
## $ Self          <dbl> 3, 2, 1, 2, 3, 2, 0, 3, 2, 1, 2, 1, 3, 1, 3, 1, 2, 3, 1,~
## $ Population    <dbl> 3, 4, 1, 4, 2, 3, 4, 4, 4, 3, 2, 3, 2, 3, 4, 3, 3, 4, 3,~
## $ Complications <dbl> 1, 2, 2, 2, 1, 2, 3, 2, 2, 1, 2, 3, 1, 1, 1, 1, 0, 1, 1,~
## $ educ2         <fct> Graduated high school 
##  (or GED), Graduated high school 
##  (or GED), Some college or 
##  technical school, Less than high school, Graduated high school 
##  (or GED), Graduated high school 
##  (or GED), Graduated high school 
##  (or GED), Graduated high school 
##  (or GED), Some college or 
##  technical school, Graduated high school 
##  (or GED), Graduated high school 
##  (or GED), Graduated high school 
##  (or GED), Some college or 
##  technical school, Graduated high school 
##  (or GED), Graduated high school 
##  (or GED), Graduated high school 
##  (or GED), Graduated high school 
##  (or GED), Some college or 
##  technical school, Graduated high school 
##  (or GED), Some college or 
##  technical school, Graduated high school 
##  (or GED), Graduated high school 
##  (or GED), Some college or 
##  technical school, Some college or 
##  technical school, Some college or 
##  technical school, Some college or 
##  technical school

With regards to the standard error data set, which was the main reason why my attempt at creating error bars last week failed, I soon realised the solution was extremely simple and I must have not been functioning at full capacity when trying to figure it out.

To summarise, my issue was merging the data sets so that the standard error for each of the risk factors would correlate with each risk type but the fact that there were also education levels to consider made it all the more confusing. But what I didn’t notice was that I didn’t need all the values of each participant - I just needed the means. In retrospect, it shouldn’t have been that difficult to figure out but I guess sleep deprivation has that effect on me. So I made the appropriate adjustments by calculating the means and putting it alongside the standard error in one combined data set.

# means data set

sedata <- two %>% 
  group_by(educ2) %>% 
  summarise(Complications = mean(Complications),
            Population = mean(Population),
            Self = mean(Self)) %>% 
  pivot_longer(
    cols = c(Complications, Population, Self),
    names_to = "Type",
    values_to = "means"
  )

# standard error data set

ugh <- two %>% 
  group_by(educ2) %>% 
  summarise(Complications = std.error(Complications),
            Population = std.error(Population),
            Self = std.error(Self)) %>% 
  pivot_longer(
    cols = -educ2,
    names_to = "Type",
    values_to = "se"
  ) %>% 
  select(se)

# combining means and standard error

full <- bind_cols(sedata, ugh)

gt(full)
educ2 Type means se
Less than high school Complications 2.600000 0.52068331
Less than high school Population 3.100000 0.27688746
Less than high school Self 1.700000 0.39581140
Graduated high school (or GED) Complications 2.113208 0.13559675
Graduated high school (or GED) Population 3.056604 0.11673161
Graduated high school (or GED) Self 2.066038 0.12305549
Some college or technical school Complications 2.505988 0.07594495
Some college or technical school Population 2.952096 0.05134657
Some college or technical school Self 2.386228 0.06020774
Completed 4-year college (BA, BS) Complications 2.523220 0.07341554
Completed 4-year college (BA, BS) Population 2.953560 0.04944779
Completed 4-year college (BA, BS) Self 2.390093 0.06130204
Completed graduate or professional degree Complications 2.389535 0.09345892
Completed graduate or professional degree Population 2.947674 0.06506884
Completed graduate or professional degree Self 2.488372 0.07757036

Not bad at all

Having solved that issue, I can now add error bars with relative ease using the geom_errorbar function. But before I do that, I’d like to turn my attention to the actual presentation of the data.

Previously, I presented it in this fashion (but now with error bars):

# bar graph with x = education, fill = risk

hm <- ggplot(full, aes(x = educ2, y = means, fill = Type)) +
  geom_col(position = "dodge") +
  geom_errorbar(aes(ymin = means-se, ymax = means+se), position = "dodge") +
  xlab("Level of Education") +
  ylab("Mean Perceived Risk")

print(hm)

Width of error bars and x-axis labelling aside, it is an okay way of presenting the data but I can’t help feel that there are better ways which would make comparison between different factors a lot clearer.

I decided to experiment a little with violin plots against my better judgement and sure enough, it produces a very disgusting and uninformative depiction of the data.

terrible <- ggplot(two %>% gather(Risk, Mean, c(Complications, Population, Self)), 
                 aes(x = educ2, y = Mean, fill = educ2)) +
  facet_wrap(vars(Risk)) +
  theme_minimal() +
  geom_jitter(width = 0.3, alpha = 0.1) + 
  scale_x_discrete(labels = NULL) +
  geom_violin(alpha = 0.8) +
  scale_fill_discrete(name = "Level of Education") +
  theme(strip.text.x = element_text(family="Arial Narrow", size = 14, face = "bold"),
        strip.text.y = element_text(family="Arial Narrow", size = 13, face = "bold"),
        axis.title.y = element_text(family="Arial Narrow", size = 14, face = "bold")) +
  ylab("Perceived Risk") +
  xlab("") +
  ggtitle("Various Forms of Risk from COVID-19 Across Education Levels")

print(terrible)

As the y values are discrete variables, it explains the bead-like appearances of some of the bars but does not make it any more appealing to the eye. Unlike the bar graph, the variability in the data is a lot more apparent, with frequency of responses made visible and comparable across the sample. However, it certainly isn’t an improvement in terms of drawing conclusions, though I do find the format of having the colour be the education levels and risk types separated into different boxes much clearer for the direct comparison of education levels with different risk types.

How about a boxplot?

worse <- ggplot(two %>% gather(Risk, Mean, c(Complications, Population, Self)), 
                 aes(x = educ2, y = Mean, fill = educ2)) +
  facet_wrap(vars(Risk)) +
  theme_minimal() +
  geom_jitter(width = 0.3, alpha = 0.1) + 
  scale_x_discrete(labels = NULL) +
  geom_boxplot(alpha = 0.8) +
  scale_fill_discrete(name = "Level of Education") +
  theme(strip.text.x = element_text(family="Arial Narrow", size = 14, face = "bold"),
        strip.text.y = element_text(family="Arial Narrow", size = 13, face = "bold"),
        axis.title.y = element_text(family="Arial Narrow", size = 14, face = "bold")) +
  ylab("Perceived Risk") +
  xlab("") +
  ggtitle("Various Forms of Risk from COVID-19 Across Education Levels")

print(worse)

Another bad idea

At this point, I’m starting to regret trying to find better ways of presenting the data as each attempt seems to be another step back. However, I did particularly enjoy the boxplot for the less than high school perceived risk towards the population.

A bar graph definitely is the most optimal format thus far even if you sacrifice some of the information regarding variability since the error bars do provide some indication of that though just not to the extent of violin or boxplots.

wonderful <- ggplot(full, aes(x = educ2, y = means, fill = educ2)) +
  facet_wrap(vars(Type)) +
  geom_col(position = "dodge") +
  geom_errorbar(aes(ymin = means-se, ymax = means+se), width = 0.5, position = "dodge") +
  scale_x_discrete(labels = NULL) +
  theme_minimal() +
  scale_fill_discrete(name = "Level of Education") +
  theme(strip.text.x = element_text(family="Arial Narrow", size = 14, face = "bold"),
        strip.text.y = element_text(family="Arial Narrow", size = 13, face = "bold"),
        axis.title.y = element_text(family="Arial Narrow", size = 14, face = "bold")) +
  scale_y_continuous(expand = c(0,0), limits = c(0, 3.5)) +
  xlab("") +
  ylab("Mean Perceived Risk")


print(wonderful)

Looks good enough

And there it is. Turns out all I needed was a little facet_wrap in my life and commitment to my first choice to make it work.

Challenges/Successes

I finally managed to figure out the issue with my error bars last week and it turns out that it wasn’t particularly difficult so much as it was just a lapse in brain activity.

Upon experimenting with different forms of data presentation from boxplots to violin plots, I finally found what I deem to be the most optimal way of doing so with a facet wrapped bar graph.

Next Steps

Next, I am going to stop wasting time concerning myself with aesthetics and get on with my other two exploratory analyses.