ggplot2:See my presentation slides here (right-click and open the link in a new tab).
ggplot2 in mind: layers of data + aesthetic mappings + geometriesThese data sets are from: http://www.routledgetextbooks.com/textbooks/9781138024571/
Ellis & Yuan 2004 is simplified, reduced to include just two of the variables Obarow data is simplied and manipulated to create graphs that I would like to show, but keep the concepts of a second language acquisition experiment.
ell <- read.csv("EllisYuan.csv")
obarow <- read.csv("obarow.csv")
Let’s look at the data structure again.
str(obarow)
## 'data.frame': 67 obs. of 10 variables:
## $ id : int 3 2 6 7 1 10 5 8 11 14 ...
## $ gender : Factor w/ 2 levels "female","male": 2 2 2 2 2 1 2 2 1 1 ...
## $ grade : int 1 1 1 1 1 1 2 2 2 2 ...
## $ treatment: Factor w/ 4 levels "NMNP","NMYP",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ pretest : int 15 11 13 14 13 14 18 16 15 17 ...
## $ posttest : int 14 11 13 15 12 14 16 14 13 16 ...
## $ gain1 : int -1 0 0 1 -1 0 -2 -2 -2 -1 ...
## $ gain2 : int -2 1 0 1 0 5 4 2 1 -1 ...
## $ gain3 : int 0 0 1 3 -1 -1 0 0 1 1 ...
## $ gain4 : int 0 3 1 0 1 -1 0 0 0 1 ...
I would like to show (any) interaction effect between treatment condition and grade level on post-test score, so the variables I will plot are:
posttesttreatment, gradeI will change grade to a factor with 3 levels:
obarow$grade <- factor(obarow$grade, levels = 1:3,
labels = c("1st grade", "3rd grade", "5th grade"))
You can look at this data in two different ways:
ggplot(obarow, aes(x = grade, y = posttest, fill = treatment)) +
geom_boxplot()
ggplot(obarow, aes(x = treatment, y = posttest, fill = grade)) +
geom_boxplot()
How you map the variables will depend on what you’re trying to show based on your research question.
I’ll first try out boxplots with data points
ggplot(obarow, aes(x = treatment, y = posttest, fill = grade)) +
geom_boxplot(outlier.size = -1) +
geom_point(alpha = .5)
You will have to set position to position_dodge() to visually separate the groups:
ggplot(obarow, aes(x = treatment, y = posttest, fill = grade)) +
geom_boxplot(outlier.size = -1) +
geom_point(position = position_dodge(.75), alpha = .5)
position_jitterdodge simultaneously jitters and dodges:
ggplot(obarow, aes(x = treatment, y = posttest, fill = grade)) +
geom_boxplot(outlier.size = -1) +
geom_point(position = position_jitterdodge(.2), alpha = .4)
# you could also do `geom_jitter` instead of `geom_point` then use `position_dodge()`
But maybe I want to simply use lines to show the trends:
ggplot(obarow, aes(x = treatment, y = posttest, fill = grade)) +
stat_summary(aes(group = grade, color = grade), fun.y = "mean", geom = "line",
position = position_dodge(.2), size = 1) +
scale_y_continuous(limits = c(0, 30)) +
labs(title = "Effect of learning condition on vocabulary learning by grade level",
x = "", y = "Vocabulary score\n") +
theme_bw(base_size = 12) +
theme(panel.grid.major.x = element_blank(),
axis.ticks.x = element_blank())
This is a typical graph for interaction but I could add more information by adding boxplots:
ggplot(obarow, aes(x = treatment, y = posttest, fill = grade)) +
geom_boxplot(width = .2, alpha = .5, outlier.size = -1) +
stat_summary(aes(group = grade, color = grade), fun.y = "mean", geom = "line",
position = position_dodge(.2), size = 1) +
stat_summary(fun.y = "mean", geom = "point", shape = 22, size = 2,
position = position_dodge(.2), show.legend = FALSE) +
scale_y_continuous(limits = c(0, 30)) +
labs(title = "Effect of learning condition on vocabulary learning by grade level",
x = "", y = "Vocabulary score\n") +
theme_bw(base_size = 12) +
theme(panel.grid.major.x = element_blank(),
axis.ticks.x = element_blank())
# run one-way ANOVA (Type III)
options(contrasts = c("contr.sum", "contr.poly"))
mod <- lm(posttest ~ treatment*grade, data = obarow)
car::Anova(mod, type = "III")
## Anova Table (Type III tests)
##
## Response: posttest
## Sum Sq Df F value Pr(>F)
## (Intercept) 22170.7 1 7555.0822 < 2.2e-16 ***
## treatment 101.2 3 11.4989 5.86e-06 ***
## grade 1063.1 2 181.1429 < 2.2e-16 ***
## treatment:grade 33.5 6 1.9045 0.09643 .
## Residuals 161.4 55
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
When I actually run ANOVA, I find no significant interaction, only main effects. But from looking at the graph, there might be something going on with the lower grades NMYP group. Collecting more data would help reveal if there is any interaction.
str(obarow)
## 'data.frame': 67 obs. of 10 variables:
## $ id : int 3 2 6 7 1 10 5 8 11 14 ...
## $ gender : Factor w/ 2 levels "female","male": 2 2 2 2 2 1 2 2 1 1 ...
## $ grade : Factor w/ 3 levels "1st grade","3rd grade",..: 1 1 1 1 1 1 2 2 2 2 ...
## $ treatment: Factor w/ 4 levels "NMNP","NMYP",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ pretest : int 15 11 13 14 13 14 18 16 15 17 ...
## $ posttest : int 14 11 13 15 12 14 16 14 13 16 ...
## $ gain1 : int -1 0 0 1 -1 0 -2 -2 -2 -1 ...
## $ gain2 : int -2 1 0 1 0 5 4 2 1 -1 ...
## $ gain3 : int 0 0 1 3 -1 -1 0 0 1 1 ...
## $ gain4 : int 0 3 1 0 1 -1 0 0 0 1 ...
There are four gain scores, and we want to plot the changes over time (four measurements of gain scores)
To do that, we need to have gather the four scores, so that we can map the four time points on to the x-axis and the score to the y-axis.
ob3 <- tidyr::gather(obarow, key = gains, value = score, gain1:gain4)
str(ob3)
## 'data.frame': 268 obs. of 8 variables:
## $ id : int 3 2 6 7 1 10 5 8 11 14 ...
## $ gender : Factor w/ 2 levels "female","male": 2 2 2 2 2 1 2 2 1 1 ...
## $ grade : Factor w/ 3 levels "1st grade","3rd grade",..: 1 1 1 1 1 1 2 2 2 2 ...
## $ treatment: Factor w/ 4 levels "NMNP","NMYP",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ pretest : int 15 11 13 14 13 14 18 16 15 17 ...
## $ posttest : int 14 11 13 15 12 14 16 14 13 16 ...
## $ gains : chr "gain1" "gain1" "gain1" "gain1" ...
## $ score : int -1 0 0 1 -1 0 -2 -2 -2 -1 ...
ob3$gains <- factor(ob3$gains)
Now we have gains as a variable, which are the time points. I changed it to a factor.
Let’s draw line graphs that show group mean changes:
ggplot(ob3, aes(x = gains, y = score)) +
geom_line(aes(group = id), color = "grey80") +
stat_summary(aes(group = treatment, color = treatment),
fun.y = "mean", geom = "line", size = 2)
Polish it up:
ggplot(ob3, aes(x = gains, y = score, group = treatment, color = treatment)) +
geom_line(aes(group = id, color = treatment), alpha = .3) +
stat_summary(fun.data = "mean_cl_normal", geom = "errorbar", width = .1,
position = position_dodge(.1), color = "black") +
stat_summary(fun.y = "mean", geom = "line", size = 1.2,
position = position_dodge(.1)) +
stat_summary(fun.y = "mean", geom = "point", size = 2,
position = position_dodge(.1)) +
scale_color_brewer(palette = "Spectral") +
scale_x_discrete(labels = c("Time 1", "Time 2", "Time 3", "Time 4")) +
labs(x = "", y = "Gain score\n", title = "Vocabulary gains after treatments\n",
color = "Treatment Condition") +
theme_bw() +
theme(panel.grid.major.x = element_blank(), legend.position = "bottom",
plot.title = element_text(hjust = .5))
It’s generally a good idea to label the lines than having a legend, especially when you have more than two groups. To do that, I will prepare labels using some functions from dplyr.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
m_label <- ob3 %>%
filter(gains == "gain4") %>%
group_by(treatment) %>%
summarize(mean = mean(score))
What I’m doing here is to get the mean score of the last time point for each group. That score will be the y-axis coordinate where I will put the label for the specific group.
I’ll now add the labels using geom_text(). You can adjust the x-coordinate depending on where you would like the labels to appear along the x-axis. Note that for this geom function, I use the newly created data for my label, m_label.
ggplot(ob3, aes(x = gains, y = score, group = treatment, color = treatment)) +
geom_line(aes(group = id, color = treatment), alpha = .3) +
stat_summary(fun.data = "mean_cl_normal", geom = "errorbar", width = .1,
position = position_dodge(.1), color = "black") +
stat_summary(fun.y = "mean", geom = "line", size = 1.2,
position = position_dodge(.1)) +
stat_summary(fun.y = "mean", geom = "point", size = 2,
position = position_dodge(.1)) +
geom_text(data = m_label, aes(x = 4.35, y = mean, label = treatment),
size = 4, hjust = 1, fontface = "bold") +
scale_color_brewer(palette = "Spectral") +
scale_x_discrete(labels = c("Time 1", "Time 2", "Time 3", "Time 4")) +
labs(x = "", y = "Gain score\n", title = "Vocabulary gains after treatments\n",
color = "Treatment Condition") +
theme_bw(base_size = 12) +
theme(panel.grid.major.x = element_blank(),
legend.position = "none", # hide the legend
plot.title = element_text(hjust = .5))
Instead of creating labels, you could just facet them, if that better suits your needs:
ggplot(ob3, aes(x = gains, y = score, group = treatment, color = treatment)) +
geom_line(aes(group = id), color = "grey70", alpha = .3) +
stat_summary(fun.data = "mean_cl_normal", geom = "errorbar", width = .1,
position = position_dodge(.1), color = "black") +
stat_summary(fun.y = "mean", geom = "line", size = 1.2,
position = position_dodge(.1)) +
stat_summary(fun.y = "mean", geom = "point", size = 2,
position = position_dodge(.1)) +
scale_color_brewer(palette = "Spectral") +
scale_x_discrete(labels = c("Time 1", "Time 2", "Time 3", "Time 4")) +
facet_wrap(~ treatment, ncol = 2) +
labs(x = "", y = "Gain score\n", title = "Vocabulary gains after treatments\n",
color = "Treatment Condition") +
theme_bw() +
theme(panel.grid.major.x = element_blank(), legend.position = "none",
plot.title = element_text(hjust = .5))
The last example I’d like to show is survey data and visualizing counts.
sur <- read.csv("example_survey.csv")
str(sur)
## 'data.frame': 33 obs. of 5 variables:
## $ Financial : int 1 1 2 1 3 1 1 2 1 0 ...
## $ Resources : int 1 1 1 1 3 1 1 1 1 1 ...
## $ Staffing : int 2 0 NA 1 0 1 0 1 1 1 ...
## $ Training : int 3 1 2 0 0 0 2 2 0 0 ...
## $ Technology: int 2 1 2 3 3 1 2 3 1 2 ...
This data is from a survey that targeted Korean language coordinators at colleges/universities in the United States. Here are responses on how much support the language program coordinators receive in these five categories: finance, resources, staffing, training, and technology. The responses are on a 4-point likert scale, from complete support to no support.
The number of responses for each question. Then we should have the categories along the x-axis, and the number of responses on the y-axis.
That means that the data needs to be transformed so that we have a variable (column) called category and another variable called response
sur2 <- tidyr::gather(sur, Category, Response, 1:5)
str(sur2)
## 'data.frame': 165 obs. of 2 variables:
## $ Category: chr "Financial" "Financial" "Financial" "Financial" ...
## $ Response: int 1 1 2 1 3 1 1 2 1 0 ...
We can plot the number of responses per category.
ggplot(sur2, aes(x = Category, y = stat(count))) +
geom_bar()
Well, this is not useful because it just shows the number of total responses (which includes N/As). We want to see the number of responses for each of the 4-scale. Therefore, we visualize each level of response in different colors
ggplot(sur2, aes(x = Category, y = stat(count), fill = Response)) +
geom_bar()
Why nothing’s changed? Because the response variable is currently numeric, and not categorical.
sur2$Response <- factor(sur2$Response)
Try again:
ggplot(sur2, aes(x = Category, y = stat(count), fill = Response)) +
geom_bar()
Note: the default setting for geom_bar(position = ) is “stack”. Maybe you want something like this:
ggplot(sur2, aes(x = Category, y = stat(count), fill = Response)) +
geom_bar(position = "dodge")
I would definitely label the response levels for clarity.
sur2$Response <- factor(sur2$Response,
labels = c("None", "Basic", "Decent", "Complete"))
With count figures on the y-axis, dodged bars are probably the better choice
ggplot(sur2, aes(x = Category, y = stat(count), fill = Response)) +
geom_bar(position = "dodge")
However, stacked form has merits in displaying the proportion of the responses collected. If we’re talking in terms of proportion, it would make more sense to have the number of responses in percentage. Setting the position argument to “fill” will automatically change the count numbers to proportions for you.
ggplot(sur2, aes(x = Category, y = stat(count), fill = Response)) +
geom_bar(position = "fill")
This type of graph makes most sense to me but we can do more. It would be easier for the readers to intepret the graph if we used a scale in percentage and annotate the percentages.
Then I need to have a summary of the data:
sur_sum <- sur2 %>%
group_by(Category, Response) %>%
tally() %>%
mutate(Percentage = n*100/sum(n))
head(sur_sum)
## # A tibble: 6 x 4
## # Groups: Category [2]
## Category Response n Percentage
## <chr> <fct> <int> <dbl>
## 1 Financial None 4 12.1
## 2 Financial Basic 15 45.5
## 3 Financial Decent 11 33.3
## 4 Financial Complete 3 9.09
## 5 Resources None 2 6.06
## 6 Resources Basic 18 54.5
Note that now we have the sum of the counts for each category and each response level in a new colum, as opposed to having every response in each row.
When using this summarized data, we need to change something for the geom_bar(): set stat to “identity”. The default is “count”, which means the function counts the sum from the raw data. Now that we already have the sums, we need to tell ggplot to use the numbers as are in the new summarized data.
ggplot(sur_sum, aes(x = Category, y = Percentage, fill = Response)) +
geom_bar(stat = "identity", position = "stack") +
geom_text(aes(label = round(Percentage, 1)), position = "stack",
hjust = .5, vjust = 1, size = 4.5)
I actually like it better this way:
ggplot(sur_sum, aes(x = Category, y = Percentage, fill = Response)) +
geom_bar(stat = "identity", position = "stack") +
geom_text(aes(label = round(Percentage, 1)), position = "stack",
hjust = 1, size = 4.5) +
coord_flip()
In polish up the graph, I changed the label using paste0 function, which concactenates the rounded percentage with the percentage symbol %.
ggplot(sur_sum, aes(x = Category, y = Percentage, fill = Response)) +
geom_bar(stat = "identity", position = "stack", width = .7, alpha = .9) +
geom_text(aes(label = paste0(round(Percentage, 0), "%")), position = "stack",
hjust = 1, size = 3) +
scale_fill_brewer(palette = "Pastel1", na.value = "grey70") +
coord_flip() +
theme_minimal(base_size = 12) +
theme(axis.title = element_blank(), panel.grid.major.y = element_blank(),
legend.position = "top")