Intro

Hello there! In this last episode of Viz Quiz series I am going to improve one of the graphs chosen by me as messed up in one of the publications authored by a classic in evolutionary and social psychology Robin Dunbar. The paper is called “Do online social media cut through the constraints that limit the size of offline social networks?” and is available via the following link: https://royalsocietypublishing.org/doi/pdf/10.1098/rsos.150292.

I will be commenting on the very first plot that appears in the paper (Figure 1) and inserted for the purpose of comparison between distributions of a number of friends for two studied samples, social network users and business employees. The plot can be found below:

So what is wrong?

I got three aspects that are bothering me.

  1. First and the most important remark. Neither x-axis, nor y-axis are unified. On the x-axis we have a range of values from 0 to 1000 on the upper plot and from 0 to 800 on the lower graph, while on the y-axis there is a range of 0-450 for the first case and of 0-200 for the second case. This makes distributions in two samples almost impossible to be accurately compared.

  2. Secondly, there is no grid on the background. This is not a critical point, but it hampers graph comprehension anyways. As it is hard to tell exact values of frequencies for each value of friends count, especially towards the tails of the distributions.

  3. Another point related to visual experience is about the text labels on the x-axis which are places not right below the ticks but in the middle of intervals between each two ticks. This way of arranging text labels makes it hard for a reader to get to which tick, x-axis value, a bar belongs to - to the left or to the right one.

  4. Lastly, the breaks on the x-axis are going a little wild. It is quite a rare case to visualize data this way, considering that the number of friends is a continuous variable and thus this is a histogram. The breaks are expected to have equal intervals between each other.

Improvements

So, I simulated the data used and try to play around with how it can be visualized.

Bar chart/Histogram

The first option is similar to the original picture in a way of showing the data, yet with several corrections regarding the aforementioned remarks. I also added some colors to make it prettier (yeah, just prettier).

library(ggplot2)
library(dplyr)

sample_group <- c("Social network users", "Social network users", "Social network users", "Social network users", "Social network users", "Social network users", "Social network users", "Social network users", "Social network users", "Social network users", "Social network users", "Social network users", "Social network users", "Social network users", "Business employees", "Business employees", "Business employees", "Business employees", "Business employees", "Business employees", "Business employees", "Business employees", "Business employees", "Business employees", "Business employees", "Business employees", "Business employees", "Business employees", "Business employees", "Business employees") %>% as.factor()

frequency <- c(350, 270, 260, 230, 400, 235, 100, 70, 40, 18, 10, 15, 8, 25, 25, 158, 150, 120, 180, 190, 160, 100, 90, 60, 38, 20, 30, 18, 15, 29) %>% as.numeric()

no_friends <- c(0, 25, 50, 75, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 0, 25, 50, 75, 100, 150, 200, 250, 300, 350, 400, 450, 500, 600, 700, 800) %>% as.numeric()

data <- data.frame(sample_group, frequency, no_friends)

ggplot(data = data, aes(x = no_friends, y = frequency, fill = sample_group))+
  geom_col(position = "dodge2", col = "gray4") +
   theme_bw() +
  labs(
    x = "Number of friends",
    y = "Frequency",
    title = "Distribution of Network Size",
    fill = "",
    subtitle = "Sample 1 (social network user: N = 2000) and\nSample 2 (business employees: N = 1375)"
  ) +
 scale_x_continuous(breaks = c(0, 25, 50, 75, 100, 150, 200, 250, 300, 350, 400, 450, 500, 600, 700, 800, 900, 1000)) +
  scale_fill_manual(values = c("#440154FF", "#FDE725FF")) +
  theme(axis.text.x=element_text(angle=45, hjust=1)) 

The breaks on the x-axis are done the same way as on the original plot assuming that the author had some idea behind this arrangement and it should be presented this way.

Line plots

Another option is a little different in a type of a visualization. It is a line chart presenting the same information and accounting for the issues of the original picture, breaks as well.

ggplot(data, aes(x=no_friends, y=frequency, group=sample_group)) +
  geom_line(aes(linetype=sample_group, col = sample_group))+
  geom_point(aes(shape=sample_group, col = sample_group))  +
  theme_bw() +
  labs(
    x = "Number of friends",
    y = "Frequency",
    title = "Distribution of Network Size",
    col = "",
    subtitle = "Sample 1 (social network user: N = 2000) and\nSample 2 (business employees: N = 1375)"
  ) +
  scale_x_continuous(breaks = c(0, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000))+
  scale_colour_manual(values = c("#440154FF", "#FDE725FF")) 

And the following one is a monochrome version of the previous one - in case there was a restriction on usage of colors for graphics when the paper was published.

ggplot(data, aes(x=no_friends, y=frequency, group=sample_group)) +
  geom_line(aes(linetype=sample_group))+
  geom_point(aes(shape=sample_group))  +
  theme_bw() +
  labs(
    x = "Number of friends",
    y = "Frequency",
    title = "Distribution of Network Size",
    col = "",
    subtitle = "Sample 1 (social network user: N = 2000) and\nSample 2 (business employees: N = 1375)"
  ) +
  scale_x_continuous(breaks = c(0, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000))

To sum up, all of the presented graphs account for the values ranges of two samples’ distributions making it easier to compare those. And that is all, thank you for your attention.

References:

  • Dunbar, R.I., 2016. Do online social media cut through the constraints that limit the size of offline social networks?. Royal Society Open Science, 3(1), p.150292.