Beyond Bar & Line Graphs: R code

I quite liked this paper that sets out different ways to show data other than bar and line graphs, and I thought it would be useful to produce some code for making simple graphs in R. The authors provide Excel templates in their supplementary information, so I’ve just taken the data from there.

Weissgerber TL, Milic NM, Winham SJ, Garovic VD (2015) Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biol 13(4): e1002128. doi: 10.1371/journal.pbio.1002128

Note that I’ve left the figures as being more functional than publication-quality, but there are a whole load of resources out there on how to tweak your figures as you’d like!

Libraries

There are a few libraries we need to use to tidy and manipulate data, as well as plotting it.

library(tidyr)
library(dplyr)
library(ggplot2)

library(knitr)

Univariate scatterplots for independent data

The data below is that which is provided by the authors:

Group.1	Group.2	Group.3	Group.4	Group.5
5	7	9	42	2
3	3	7	2	0
6	9	10	5	3
8	10	12	55	5
10	33	14	9	7
13	15	17	12	10
1	18	20	15	13
4	6	40	3	1
18	20	22	NA	15
4	30	35	NA	1
7	NA	42	NA	4
9	NA	13	NA	6
14	NA	NA	NA	11
15	NA	NA	NA	12
17	NA	NA	NA	14

The first thing we have to do is get this into ‘tidy data’ format. You can read more about what that means in this presentation by Hadley Wickham, but here’s a brief synopsis:

Variables are in columns
Observations are in rows
One type of data in a data set

I find it easiest to think about how I would want my data configured for use in a regression model. For example, in this case we would want a column specifying the group that the observations belong to, and another column to hold the observations themselves.

We can use the excellent {tidyr} package to quickly convert our data to ‘tidy’ format. I have also used dplyr’s ‘filter’ function to quickly get rid of NA values (which is also easier to do once data is in tidy format):

df_ind_t <- df_ind %>% 
  gather(Group, Value, Group.1:Group.5) %>% 
  filter(!is.na(Value))

kable(df_ind_t)

Group	Value
Group.1	5
Group.1	3
Group.1	6
Group.1	8
Group.1	10
Group.1	13
Group.1	1
Group.1	4
Group.1	18
Group.1	4
Group.1	7
Group.1	9
Group.1	14
Group.1	15
Group.1	17
Group.2	7
Group.2	3
Group.2	9
Group.2	10
Group.2	33
Group.2	15
Group.2	18
Group.2	6
Group.2	20
Group.2	30
Group.3	9
Group.3	7
Group.3	10
Group.3	12
Group.3	14
Group.3	17
Group.3	20
Group.3	40
Group.3	22
Group.3	35
Group.3	42
Group.3	13
Group.4	42
Group.4	2
Group.4	5
Group.4	55
Group.4	9
Group.4	12
Group.4	15
Group.4	3
Group.5	2
Group.5	0
Group.5	3
Group.5	5
Group.5	7
Group.5	10
Group.5	13
Group.5	1
Group.5	15
Group.5	1
Group.5	4
Group.5	6
Group.5	11
Group.5	12
Group.5	14

Standard bar chart

Although the whole point of this paper is to avoid bar charts, let’s just have a quick look at how you make them in ggplot2 (and this will help us compare different visualisations at the end). We can plot a basic bar graph with medians easily enough:

ggplot(df_ind_t, aes(x = Group, y = Value)) +
  stat_summary_bin(fun.y = "median", geom = "bar") +
  theme_bw()

Bar chart with standard errors

Plotting standard errors is a little more involved. Here, I’ve grouped the data (by, erm, ‘Group’) and made summary variables of the mean and standard error for each group. There isn’t a function for ‘standard error’ in base R, so I defined a function that can be used within the ‘summarise’ call in my manipulation step. I then plot this newly-formed data frame. Note that we could pipe the manipulated data straight into ggplot2, but here I wanted to show the output of the summary table.

std_err <- function(x){ 
  sd(x) / sqrt(length(x))
}

df_ind_sum <- df_ind_t %>% 
  group_by(Group) %>% 
  summarise(Grp_med = median(Value),
            Grp_se = std_err(Value))

kable(df_ind_sum)

Group	Grp_med	Grp_se
Group.1	8.0	1.385182
Group.2	12.5	3.219558
Group.3	15.5	3.547210
Group.4	10.5	6.970441
Group.5	6.0	1.336187

ggplot(df_ind_sum, aes(x = Group, y = Grp_med)) +
  geom_bar(stat="identity") +
  geom_errorbar(aes(ymin = Grp_med - Grp_se, 
                    ymax = Grp_med + Grp_se),
                  width=.2) +
  theme_bw()

Scatterplot with median

Here is a basic scatterplot, with the median value plotted in red:

ggplot(df_ind_t, aes(x = Group, y = Value)) +
  geom_point() +
  stat_summary_bin(fun.y = "median", geom = "point",
                   colour = "red", size = 3,
                   alpha = 0.5) +
  theme_bw()

Jittered scatterplot with median and standard errors

We could be a bit fancier and plot our standard errors as well, plus jitter the points in case there is any overlap. Ggplot2 has the ability to layer data sets, and if they have matching variables then it will use these to display the data.

Here, I first state that we want to plot our original values on the y-axis, split by ‘Group’ on the x-axis. In the ‘geom_pointrange’ call, I specify a new data frame - the summary table with a median value and standard error for each group. I don’t have to respecify ‘x’ here, as it is still ‘Group’ (the same as the first data frame). I do specify ‘y’ though, as I want to plot the median value in the pointrange. I then also set up the error bars, and change the size, colour and alpha of the pointrange object:

ggplot(df_ind_t, aes(x = Group, y = Value)) +
  geom_point(position = position_jitter(width = 0.2),
             alpha = 0.7) +
  geom_pointrange(data = df_ind_sum,
                  aes(y = Grp_med,
                      ymin = Grp_med - Grp_se, 
                      ymax = Grp_med + Grp_se),
                  colour = "red",
                  alpha = 0.7,
                  size = 1) +
  theme_bw()

As noted in the original article, presenting data in this way gives us a better idea of outliers, differences among groups in sample sizes, spread within groups, etc…

Paired data

The data below is provided by the authors for paired data observations:

Subject.IDs	Condition.1.Name	Condition.2.Name
1	5	8
2	1	5
3	7	12
4	9	11
5	2	9
6	6	7
7	4	5
8	11	14
9	14	16
10	13	18
11	17	15
12	15	12

Again, let’s tidy the data up. We want a column for ID, a column for Condition, and a column for the value of the observation:

df_paired_t <- df_paired %>% 
  rename(ID = Subject.IDs,
         Condition_1 = Condition.1.Name,
         Condition_2 = Condition.2.Name) %>% 
  gather(Condition, Value,
         starts_with('Condition')) %>% 
  arrange(ID, Condition)

kable(df_paired_t)

ID	Condition	Value
1	Condition_1	5
1	Condition_2	8
2	Condition_1	1
2	Condition_2	5
3	Condition_1	7
3	Condition_2	12
4	Condition_1	9
4	Condition_2	11
5	Condition_1	2
5	Condition_2	9
6	Condition_1	6
6	Condition_2	7
7	Condition_1	4
7	Condition_2	5
8	Condition_1	11
8	Condition_2	14
9	Condition_1	14
9	Condition_2	16
10	Condition_1	13
10	Condition_2	18
11	Condition_1	17
11	Condition_2	15
12	Condition_1	15
12	Condition_2	12

You’ll notice a couple of new functions there: ‘rename’ is a nice, easy way to rename variables in your data frame, while ‘arrange’ enabled me to sort my columns (which I’ve used to indicate pairs of values for each subject ID).

As per the example in the paper linked at the start, we can plot points for all the observations, and use lines to join them at the subject-level:

ggplot(df_paired_t, aes(x = Condition, 
                        y = Value, 
                        group = ID)) +
  geom_line(alpha = 0.8) + 
  geom_point(alpha = 0.7,
             size = 1.5) + 
  theme_bw()

Above, I used the ‘group’ keyword in the ggplot specification to say how points should be grouped. If I was interested in the actual identification of each pair, I could use ‘colour’ in this specification to tell ggplot to give each subject a different colour, and to create a legend detailing this. Note that - as IDs have been coded as numeric variables in the data frame - I make sure to specify that these are factors, so that ggplot knows to use a discrete colour palette:

ggplot(df_paired_t, aes(x = Condition, 
                        y = Value, 
                        group = ID,
                        colour = factor(ID))) +
  geom_line(alpha = 0.8) + 
  geom_point(alpha = 0.7,
             size = 1.5) + 
  theme_bw()

Paired data, multiple groups

The data below is provided by the authors for paired data observations (although I did slightly modify the initial excel file, as it made me sad):

Subject.ID	Group	Condition.1.Name	Condition.2.Name
1	1	5	13
2	1	1	5
3	1	7	12
4	1	9	11
5	1	2	9
6	1	6	5
7	1	4	5
8	1	11	14
9	1	14	12
10	1	13	19
16	2	20	18
17	2	13	9
18	2	15	16
19	2	8	13
20	2	3	5
21	2	7	8
22	2	14	7
23	2	12	12
24	2	11	14
25	2	9	10

As before, we need to do some quick tidying steps: we want a column for ID, a column for Group, a column for Condition, and a column for the value of the observation:

df_paired_grp_t <- df_paired_grp %>% 
  rename(ID = Subject.ID,
         Condition_1 = Condition.1.Name,
         Condition_2 = Condition.2.Name) %>% 
  gather(Condition, Value,
         starts_with('Condition')) %>% 
  arrange(ID, Condition)

kable(df_paired_grp_t)

ID	Group	Condition	Value
1	1	Condition_1	5
1	1	Condition_2	13
2	1	Condition_1	1
2	1	Condition_2	5
3	1	Condition_1	7
3	1	Condition_2	12
4	1	Condition_1	9
4	1	Condition_2	11
5	1	Condition_1	2
5	1	Condition_2	9
6	1	Condition_1	6
6	1	Condition_2	5
7	1	Condition_1	4
7	1	Condition_2	5
8	1	Condition_1	11
8	1	Condition_2	14
9	1	Condition_1	14
9	1	Condition_2	12
10	1	Condition_1	13
10	1	Condition_2	19
16	2	Condition_1	20
16	2	Condition_2	18
17	2	Condition_1	13
17	2	Condition_2	9
18	2	Condition_1	15
18	2	Condition_2	16
19	2	Condition_1	8
19	2	Condition_2	13
20	2	Condition_1	3
20	2	Condition_2	5
21	2	Condition_1	7
21	2	Condition_2	8
22	2	Condition_1	14
22	2	Condition_2	7
23	2	Condition_1	12
23	2	Condition_2	12
24	2	Condition_1	11
24	2	Condition_2	14
25	2	Condition_1	9
25	2	Condition_2	10

As per the example in the paper linked at the start, we can plot points for all the observations, and use lines to join them at the subject-level. Ggplot2 also has the incredible useful ‘facet_grid’ function, enabling us to split our data by ‘Group’ and plot those observations in separate panels:

ggplot(df_paired_grp_t, aes(x = Condition, 
                        y = Value, 
                        group = ID)) +
  geom_line(alpha = 0.8) + 
  geom_point(alpha = 0.7,
             size = 1.5) + 
  facet_grid(. ~ Group, labeller = label_both) +
  theme_bw()

The excel example also gives a template of plotting individual-level differences for each group. Again, this means making a summary data frame with dplyr, but it’s very easy to do:

df_paired_grp_diffs <- df_paired_grp_t %>% 
  spread(Condition, Value) %>% 
  group_by(ID, Group) %>% 
  summarise(ID_diff = Condition_2 - Condition_1)

ggplot(df_paired_grp_diffs, aes(x = factor(Group), y = ID_diff)) +
  geom_point() +
  stat_summary_bin(fun.y = "median", geom = "point",
                   colour = "red", size = 3,
                   alpha = 0.5) +
  theme_bw()