I quite liked this paper that sets out different ways to show data other than bar and line graphs, and I thought it would be useful to produce some code for making simple graphs in R. The authors provide Excel templates in their supplementary information, so I’ve just taken the data from there.
Note that I’ve left the figures as being more functional than publication-quality, but there are a whole load of resources out there on how to tweak your figures as you’d like!
There are a few libraries we need to use to tidy and manipulate data, as well as plotting it.
library(tidyr)
library(dplyr)
library(ggplot2)
library(knitr)
The data below is that which is provided by the authors:
| Group.1 | Group.2 | Group.3 | Group.4 | Group.5 |
|---|---|---|---|---|
| 5 | 7 | 9 | 42 | 2 |
| 3 | 3 | 7 | 2 | 0 |
| 6 | 9 | 10 | 5 | 3 |
| 8 | 10 | 12 | 55 | 5 |
| 10 | 33 | 14 | 9 | 7 |
| 13 | 15 | 17 | 12 | 10 |
| 1 | 18 | 20 | 15 | 13 |
| 4 | 6 | 40 | 3 | 1 |
| 18 | 20 | 22 | NA | 15 |
| 4 | 30 | 35 | NA | 1 |
| 7 | NA | 42 | NA | 4 |
| 9 | NA | 13 | NA | 6 |
| 14 | NA | NA | NA | 11 |
| 15 | NA | NA | NA | 12 |
| 17 | NA | NA | NA | 14 |
The first thing we have to do is get this into ‘tidy data’ format. You can read more about what that means in this presentation by Hadley Wickham, but here’s a brief synopsis:
I find it easiest to think about how I would want my data configured for use in a regression model. For example, in this case we would want a column specifying the group that the observations belong to, and another column to hold the observations themselves.
We can use the excellent {tidyr} package to quickly convert our data to ‘tidy’ format. I have also used dplyr’s ‘filter’ function to quickly get rid of NA values (which is also easier to do once data is in tidy format):
df_ind_t <- df_ind %>%
gather(Group, Value, Group.1:Group.5) %>%
filter(!is.na(Value))
kable(df_ind_t)
| Group | Value |
|---|---|
| Group.1 | 5 |
| Group.1 | 3 |
| Group.1 | 6 |
| Group.1 | 8 |
| Group.1 | 10 |
| Group.1 | 13 |
| Group.1 | 1 |
| Group.1 | 4 |
| Group.1 | 18 |
| Group.1 | 4 |
| Group.1 | 7 |
| Group.1 | 9 |
| Group.1 | 14 |
| Group.1 | 15 |
| Group.1 | 17 |
| Group.2 | 7 |
| Group.2 | 3 |
| Group.2 | 9 |
| Group.2 | 10 |
| Group.2 | 33 |
| Group.2 | 15 |
| Group.2 | 18 |
| Group.2 | 6 |
| Group.2 | 20 |
| Group.2 | 30 |
| Group.3 | 9 |
| Group.3 | 7 |
| Group.3 | 10 |
| Group.3 | 12 |
| Group.3 | 14 |
| Group.3 | 17 |
| Group.3 | 20 |
| Group.3 | 40 |
| Group.3 | 22 |
| Group.3 | 35 |
| Group.3 | 42 |
| Group.3 | 13 |
| Group.4 | 42 |
| Group.4 | 2 |
| Group.4 | 5 |
| Group.4 | 55 |
| Group.4 | 9 |
| Group.4 | 12 |
| Group.4 | 15 |
| Group.4 | 3 |
| Group.5 | 2 |
| Group.5 | 0 |
| Group.5 | 3 |
| Group.5 | 5 |
| Group.5 | 7 |
| Group.5 | 10 |
| Group.5 | 13 |
| Group.5 | 1 |
| Group.5 | 15 |
| Group.5 | 1 |
| Group.5 | 4 |
| Group.5 | 6 |
| Group.5 | 11 |
| Group.5 | 12 |
| Group.5 | 14 |
Although the whole point of this paper is to avoid bar charts, let’s just have a quick look at how you make them in ggplot2 (and this will help us compare different visualisations at the end). We can plot a basic bar graph with medians easily enough:
ggplot(df_ind_t, aes(x = Group, y = Value)) +
stat_summary_bin(fun.y = "median", geom = "bar") +
theme_bw()
Plotting standard errors is a little more involved. Here, I’ve grouped the data (by, erm, ‘Group’) and made summary variables of the mean and standard error for each group. There isn’t a function for ‘standard error’ in base R, so I defined a function that can be used within the ‘summarise’ call in my manipulation step. I then plot this newly-formed data frame. Note that we could pipe the manipulated data straight into ggplot2, but here I wanted to show the output of the summary table.
std_err <- function(x){
sd(x) / sqrt(length(x))
}
df_ind_sum <- df_ind_t %>%
group_by(Group) %>%
summarise(Grp_med = median(Value),
Grp_se = std_err(Value))
kable(df_ind_sum)
| Group | Grp_med | Grp_se |
|---|---|---|
| Group.1 | 8.0 | 1.385182 |
| Group.2 | 12.5 | 3.219558 |
| Group.3 | 15.5 | 3.547210 |
| Group.4 | 10.5 | 6.970441 |
| Group.5 | 6.0 | 1.336187 |
ggplot(df_ind_sum, aes(x = Group, y = Grp_med)) +
geom_bar(stat="identity") +
geom_errorbar(aes(ymin = Grp_med - Grp_se,
ymax = Grp_med + Grp_se),
width=.2) +
theme_bw()
Here is a basic scatterplot, with the median value plotted in red:
ggplot(df_ind_t, aes(x = Group, y = Value)) +
geom_point() +
stat_summary_bin(fun.y = "median", geom = "point",
colour = "red", size = 3,
alpha = 0.5) +
theme_bw()
We could be a bit fancier and plot our standard errors as well, plus jitter the points in case there is any overlap. Ggplot2 has the ability to layer data sets, and if they have matching variables then it will use these to display the data.
Here, I first state that we want to plot our original values on the y-axis, split by ‘Group’ on the x-axis. In the ‘geom_pointrange’ call, I specify a new data frame - the summary table with a median value and standard error for each group. I don’t have to respecify ‘x’ here, as it is still ‘Group’ (the same as the first data frame). I do specify ‘y’ though, as I want to plot the median value in the pointrange. I then also set up the error bars, and change the size, colour and alpha of the pointrange object:
ggplot(df_ind_t, aes(x = Group, y = Value)) +
geom_point(position = position_jitter(width = 0.2),
alpha = 0.7) +
geom_pointrange(data = df_ind_sum,
aes(y = Grp_med,
ymin = Grp_med - Grp_se,
ymax = Grp_med + Grp_se),
colour = "red",
alpha = 0.7,
size = 1) +
theme_bw()
As noted in the original article, presenting data in this way gives us a better idea of outliers, differences among groups in sample sizes, spread within groups, etc…
The data below is provided by the authors for paired data observations:
| Subject.IDs | Condition.1.Name | Condition.2.Name |
|---|---|---|
| 1 | 5 | 8 |
| 2 | 1 | 5 |
| 3 | 7 | 12 |
| 4 | 9 | 11 |
| 5 | 2 | 9 |
| 6 | 6 | 7 |
| 7 | 4 | 5 |
| 8 | 11 | 14 |
| 9 | 14 | 16 |
| 10 | 13 | 18 |
| 11 | 17 | 15 |
| 12 | 15 | 12 |
Again, let’s tidy the data up. We want a column for ID, a column for Condition, and a column for the value of the observation:
df_paired_t <- df_paired %>%
rename(ID = Subject.IDs,
Condition_1 = Condition.1.Name,
Condition_2 = Condition.2.Name) %>%
gather(Condition, Value,
starts_with('Condition')) %>%
arrange(ID, Condition)
kable(df_paired_t)
| ID | Condition | Value |
|---|---|---|
| 1 | Condition_1 | 5 |
| 1 | Condition_2 | 8 |
| 2 | Condition_1 | 1 |
| 2 | Condition_2 | 5 |
| 3 | Condition_1 | 7 |
| 3 | Condition_2 | 12 |
| 4 | Condition_1 | 9 |
| 4 | Condition_2 | 11 |
| 5 | Condition_1 | 2 |
| 5 | Condition_2 | 9 |
| 6 | Condition_1 | 6 |
| 6 | Condition_2 | 7 |
| 7 | Condition_1 | 4 |
| 7 | Condition_2 | 5 |
| 8 | Condition_1 | 11 |
| 8 | Condition_2 | 14 |
| 9 | Condition_1 | 14 |
| 9 | Condition_2 | 16 |
| 10 | Condition_1 | 13 |
| 10 | Condition_2 | 18 |
| 11 | Condition_1 | 17 |
| 11 | Condition_2 | 15 |
| 12 | Condition_1 | 15 |
| 12 | Condition_2 | 12 |
You’ll notice a couple of new functions there: ‘rename’ is a nice, easy way to rename variables in your data frame, while ‘arrange’ enabled me to sort my columns (which I’ve used to indicate pairs of values for each subject ID).
As per the example in the paper linked at the start, we can plot points for all the observations, and use lines to join them at the subject-level:
ggplot(df_paired_t, aes(x = Condition,
y = Value,
group = ID)) +
geom_line(alpha = 0.8) +
geom_point(alpha = 0.7,
size = 1.5) +
theme_bw()
Above, I used the ‘group’ keyword in the ggplot specification to say how points should be grouped. If I was interested in the actual identification of each pair, I could use ‘colour’ in this specification to tell ggplot to give each subject a different colour, and to create a legend detailing this. Note that - as IDs have been coded as numeric variables in the data frame - I make sure to specify that these are factors, so that ggplot knows to use a discrete colour palette:
ggplot(df_paired_t, aes(x = Condition,
y = Value,
group = ID,
colour = factor(ID))) +
geom_line(alpha = 0.8) +
geom_point(alpha = 0.7,
size = 1.5) +
theme_bw()
The data below is provided by the authors for paired data observations (although I did slightly modify the initial excel file, as it made me sad):
| Subject.ID | Group | Condition.1.Name | Condition.2.Name |
|---|---|---|---|
| 1 | 1 | 5 | 13 |
| 2 | 1 | 1 | 5 |
| 3 | 1 | 7 | 12 |
| 4 | 1 | 9 | 11 |
| 5 | 1 | 2 | 9 |
| 6 | 1 | 6 | 5 |
| 7 | 1 | 4 | 5 |
| 8 | 1 | 11 | 14 |
| 9 | 1 | 14 | 12 |
| 10 | 1 | 13 | 19 |
| 16 | 2 | 20 | 18 |
| 17 | 2 | 13 | 9 |
| 18 | 2 | 15 | 16 |
| 19 | 2 | 8 | 13 |
| 20 | 2 | 3 | 5 |
| 21 | 2 | 7 | 8 |
| 22 | 2 | 14 | 7 |
| 23 | 2 | 12 | 12 |
| 24 | 2 | 11 | 14 |
| 25 | 2 | 9 | 10 |
As before, we need to do some quick tidying steps: we want a column for ID, a column for Group, a column for Condition, and a column for the value of the observation:
df_paired_grp_t <- df_paired_grp %>%
rename(ID = Subject.ID,
Condition_1 = Condition.1.Name,
Condition_2 = Condition.2.Name) %>%
gather(Condition, Value,
starts_with('Condition')) %>%
arrange(ID, Condition)
kable(df_paired_grp_t)
| ID | Group | Condition | Value |
|---|---|---|---|
| 1 | 1 | Condition_1 | 5 |
| 1 | 1 | Condition_2 | 13 |
| 2 | 1 | Condition_1 | 1 |
| 2 | 1 | Condition_2 | 5 |
| 3 | 1 | Condition_1 | 7 |
| 3 | 1 | Condition_2 | 12 |
| 4 | 1 | Condition_1 | 9 |
| 4 | 1 | Condition_2 | 11 |
| 5 | 1 | Condition_1 | 2 |
| 5 | 1 | Condition_2 | 9 |
| 6 | 1 | Condition_1 | 6 |
| 6 | 1 | Condition_2 | 5 |
| 7 | 1 | Condition_1 | 4 |
| 7 | 1 | Condition_2 | 5 |
| 8 | 1 | Condition_1 | 11 |
| 8 | 1 | Condition_2 | 14 |
| 9 | 1 | Condition_1 | 14 |
| 9 | 1 | Condition_2 | 12 |
| 10 | 1 | Condition_1 | 13 |
| 10 | 1 | Condition_2 | 19 |
| 16 | 2 | Condition_1 | 20 |
| 16 | 2 | Condition_2 | 18 |
| 17 | 2 | Condition_1 | 13 |
| 17 | 2 | Condition_2 | 9 |
| 18 | 2 | Condition_1 | 15 |
| 18 | 2 | Condition_2 | 16 |
| 19 | 2 | Condition_1 | 8 |
| 19 | 2 | Condition_2 | 13 |
| 20 | 2 | Condition_1 | 3 |
| 20 | 2 | Condition_2 | 5 |
| 21 | 2 | Condition_1 | 7 |
| 21 | 2 | Condition_2 | 8 |
| 22 | 2 | Condition_1 | 14 |
| 22 | 2 | Condition_2 | 7 |
| 23 | 2 | Condition_1 | 12 |
| 23 | 2 | Condition_2 | 12 |
| 24 | 2 | Condition_1 | 11 |
| 24 | 2 | Condition_2 | 14 |
| 25 | 2 | Condition_1 | 9 |
| 25 | 2 | Condition_2 | 10 |
As per the example in the paper linked at the start, we can plot points for all the observations, and use lines to join them at the subject-level. Ggplot2 also has the incredible useful ‘facet_grid’ function, enabling us to split our data by ‘Group’ and plot those observations in separate panels:
ggplot(df_paired_grp_t, aes(x = Condition,
y = Value,
group = ID)) +
geom_line(alpha = 0.8) +
geom_point(alpha = 0.7,
size = 1.5) +
facet_grid(. ~ Group, labeller = label_both) +
theme_bw()
The excel example also gives a template of plotting individual-level differences for each group. Again, this means making a summary data frame with dplyr, but it’s very easy to do:
df_paired_grp_diffs <- df_paired_grp_t %>%
spread(Condition, Value) %>%
group_by(ID, Group) %>%
summarise(ID_diff = Condition_2 - Condition_1)
ggplot(df_paired_grp_diffs, aes(x = factor(Group), y = ID_diff)) +
geom_point() +
stat_summary_bin(fun.y = "median", geom = "point",
colour = "red", size = 3,
alpha = 0.5) +
theme_bw()