Given my previous struggle with error bars, my goal was to find out how to do it properly as well as find another way of presenting the data for one of my exploratory analyses since I now find the most recent output a little dissatisfying.
First, let’s do the usual and load the relevant packages and extract the data set
library(tidyverse)
library(gt) # for nice tables
library(plotrix) # for easy calculations
library(extrafont) # for nice fonts
loadfonts(device = "win", quiet = TRUE) # loading fonts (i think)
covid <- "Covid_Data.csv" %>%
read_csv() %>%
select(-ResponseId)
Here was the code used last week for creating the data frame for the graph. Includes renaming risk variables using the rename function, cleaning up names for the education levels so they can fit easily on the graph using the mutate and ifelse functions and reordering the levels of the education variable to be in ascending order through making them factors.
two <- covid %>%
select(educ, risk_self, risk_pop, risk_comp) %>%
rename(Self = "risk_self",
Population = "risk_pop",
Complications = "risk_comp") %>%
mutate(educ2 = ifelse(educ == "Completed 4-year college (BA, BS)",
"Completed 4-year college \n (BA, BS)",
ifelse(educ == "Completed graduate or professional degree",
"Completed graduate or \n professional degree",
ifelse(educ == "Graduated high school (or GED)",
"Graduated high school \n (or GED)",
ifelse(educ == "Some college or technical school",
"Some college or \n technical school",
educ))))) %>%
select(-educ)
two$educ2 <- factor(two$educ2, levels = c("Less than high school",
"Graduated high school \n (or GED)",
"Some college or \n technical school",
"Completed 4-year college \n (BA, BS)",
"Completed graduate or \n professional degree"))
glimpse(two)
## Rows: 945
## Columns: 4
## $ Self <dbl> 3, 2, 1, 2, 3, 2, 0, 3, 2, 1, 2, 1, 3, 1, 3, 1, 2, 3, 1,~
## $ Population <dbl> 3, 4, 1, 4, 2, 3, 4, 4, 4, 3, 2, 3, 2, 3, 4, 3, 3, 4, 3,~
## $ Complications <dbl> 1, 2, 2, 2, 1, 2, 3, 2, 2, 1, 2, 3, 1, 1, 1, 1, 0, 1, 1,~
## $ educ2 <fct> Graduated high school
## (or GED), Graduated high school
## (or GED), Some college or
## technical school, Less than high school, Graduated high school
## (or GED), Graduated high school
## (or GED), Graduated high school
## (or GED), Graduated high school
## (or GED), Some college or
## technical school, Graduated high school
## (or GED), Graduated high school
## (or GED), Graduated high school
## (or GED), Some college or
## technical school, Graduated high school
## (or GED), Graduated high school
## (or GED), Graduated high school
## (or GED), Graduated high school
## (or GED), Some college or
## technical school, Graduated high school
## (or GED), Some college or
## technical school, Graduated high school
## (or GED), Graduated high school
## (or GED), Some college or
## technical school, Some college or
## technical school, Some college or
## technical school, Some college or
## technical school
With regards to the standard error data set, which was the main reason why my attempt at creating error bars last week failed, I soon realised the solution was extremely simple and I must have not been functioning at full capacity when trying to figure it out.
To summarise, my issue was merging the data sets so that the standard error for each of the risk factors would correlate with each risk type but the fact that there were also education levels to consider made it all the more confusing. But what I didn’t notice was that I didn’t need all the values of each participant - I just needed the means. In retrospect, it shouldn’t have been that difficult to figure out but I guess sleep deprivation has that effect on me. So I made the appropriate adjustments by calculating the means and putting it alongside the standard error in one combined data set.
# means data set
sedata <- two %>%
group_by(educ2) %>%
summarise(Complications = mean(Complications),
Population = mean(Population),
Self = mean(Self)) %>%
pivot_longer(
cols = c(Complications, Population, Self),
names_to = "Type",
values_to = "means"
)
# standard error data set
ugh <- two %>%
group_by(educ2) %>%
summarise(Complications = std.error(Complications),
Population = std.error(Population),
Self = std.error(Self)) %>%
pivot_longer(
cols = -educ2,
names_to = "Type",
values_to = "se"
) %>%
select(se)
# combining means and standard error
full <- bind_cols(sedata, ugh)
gt(full)
| educ2 | Type | means | se |
|---|---|---|---|
| Less than high school | Complications | 2.600000 | 0.52068331 |
| Less than high school | Population | 3.100000 | 0.27688746 |
| Less than high school | Self | 1.700000 | 0.39581140 |
| Graduated high school (or GED) | Complications | 2.113208 | 0.13559675 |
| Graduated high school (or GED) | Population | 3.056604 | 0.11673161 |
| Graduated high school (or GED) | Self | 2.066038 | 0.12305549 |
| Some college or technical school | Complications | 2.505988 | 0.07594495 |
| Some college or technical school | Population | 2.952096 | 0.05134657 |
| Some college or technical school | Self | 2.386228 | 0.06020774 |
| Completed 4-year college (BA, BS) | Complications | 2.523220 | 0.07341554 |
| Completed 4-year college (BA, BS) | Population | 2.953560 | 0.04944779 |
| Completed 4-year college (BA, BS) | Self | 2.390093 | 0.06130204 |
| Completed graduate or professional degree | Complications | 2.389535 | 0.09345892 |
| Completed graduate or professional degree | Population | 2.947674 | 0.06506884 |
| Completed graduate or professional degree | Self | 2.488372 | 0.07757036 |
Not bad at all
Having solved that issue, I can now add error bars with relative ease using the geom_errorbar function. But before I do that, I’d like to turn my attention to the actual presentation of the data.
Previously, I presented it in this fashion (but now with error bars):
# bar graph with x = education, fill = risk
hm <- ggplot(full, aes(x = educ2, y = means, fill = Type)) +
geom_col(position = "dodge") +
geom_errorbar(aes(ymin = means-se, ymax = means+se), position = "dodge") +
xlab("Level of Education") +
ylab("Mean Perceived Risk")
print(hm)
Width of error bars and x-axis labelling aside, it is an okay way of presenting the data but I can’t help feel that there are better ways which would make comparison between different factors a lot clearer.
I decided to experiment a little with violin plots against my better judgement and sure enough, it produces a very disgusting and uninformative depiction of the data.
terrible <- ggplot(two %>% gather(Risk, Mean, c(Complications, Population, Self)),
aes(x = educ2, y = Mean, fill = educ2)) +
facet_wrap(vars(Risk)) +
theme_minimal() +
geom_jitter(width = 0.3, alpha = 0.1) +
scale_x_discrete(labels = NULL) +
geom_violin(alpha = 0.8) +
scale_fill_discrete(name = "Level of Education") +
theme(strip.text.x = element_text(family="Arial Narrow", size = 14, face = "bold"),
strip.text.y = element_text(family="Arial Narrow", size = 13, face = "bold"),
axis.title.y = element_text(family="Arial Narrow", size = 14, face = "bold")) +
ylab("Perceived Risk") +
xlab("") +
ggtitle("Various Forms of Risk from COVID-19 Across Education Levels")
print(terrible)
As the y values are discrete variables, it explains the bead-like appearances of some of the bars but does not make it any more appealing to the eye. Unlike the bar graph, the variability in the data is a lot more apparent, with frequency of responses made visible and comparable across the sample. However, it certainly isn’t an improvement in terms of drawing conclusions, though I do find the format of having the colour be the education levels and risk types separated into different boxes much clearer for the direct comparison of education levels with different risk types.
How about a boxplot?
worse <- ggplot(two %>% gather(Risk, Mean, c(Complications, Population, Self)),
aes(x = educ2, y = Mean, fill = educ2)) +
facet_wrap(vars(Risk)) +
theme_minimal() +
geom_jitter(width = 0.3, alpha = 0.1) +
scale_x_discrete(labels = NULL) +
geom_boxplot(alpha = 0.8) +
scale_fill_discrete(name = "Level of Education") +
theme(strip.text.x = element_text(family="Arial Narrow", size = 14, face = "bold"),
strip.text.y = element_text(family="Arial Narrow", size = 13, face = "bold"),
axis.title.y = element_text(family="Arial Narrow", size = 14, face = "bold")) +
ylab("Perceived Risk") +
xlab("") +
ggtitle("Various Forms of Risk from COVID-19 Across Education Levels")
print(worse)
Another bad idea
At this point, I’m starting to regret trying to find better ways of presenting the data as each attempt seems to be another step back. However, I did particularly enjoy the boxplot for the less than high school perceived risk towards the population.
A bar graph definitely is the most optimal format thus far even if you sacrifice some of the information regarding variability since the error bars do provide some indication of that though just not to the extent of violin or boxplots.
wonderful <- ggplot(full, aes(x = educ2, y = means, fill = educ2)) +
facet_wrap(vars(Type)) +
geom_col(position = "dodge") +
geom_errorbar(aes(ymin = means-se, ymax = means+se), width = 0.5, position = "dodge") +
scale_x_discrete(labels = NULL) +
theme_minimal() +
scale_fill_discrete(name = "Level of Education") +
theme(strip.text.x = element_text(family="Arial Narrow", size = 14, face = "bold"),
strip.text.y = element_text(family="Arial Narrow", size = 13, face = "bold"),
axis.title.y = element_text(family="Arial Narrow", size = 14, face = "bold")) +
scale_y_continuous(expand = c(0,0), limits = c(0, 3.5)) +
xlab("") +
ylab("Mean Perceived Risk")
print(wonderful)
Looks good enough
And there it is. Turns out all I needed was a little facet_wrap in my life and commitment to my first choice to make it work.
I finally managed to figure out the issue with my error bars last week and it turns out that it wasn’t particularly difficult so much as it was just a lapse in brain activity.
Upon experimenting with different forms of data presentation from boxplots to violin plots, I finally found what I deem to be the most optimal way of doing so with a facet wrapped bar graph.
Next, I am going to stop wasting time concerning myself with aesthetics and get on with my other two exploratory analyses.