Data analysis and plotting

Simple examples

Overlaid histogram

First up is package imports (that’s just like adding whatever tools we’ll need later).

library(tidyverse)
library(ggplot2)
library(ggthemes)
library(plotly)
library(ggdist)
library(MASS)

Next we have to import the data into a data frame.¹

df <- read_csv("data.csv", col_types = cols(Participant = col_integer()))

We explicitly set the ‘Participant’ column to be an integer for the sake of efficiency – the rest are automatically assigned types.

What does df actually look like?

The top few rows of the data frame look like this:

head(df)

# A tibble: 6 × 4
  Participant Training Test      Error
        <int> <chr>    <chr>     <dbl>
1           1 Physical Pre-test   18.1
2           1 Physical Mid-test   12.8
3           1 Physical Post-test  16.4
4           1 Physical Carryover  16.5
5           2 Physical Pre-test   37.9
6           2 Physical Mid-test   21.6

Next, we’ll filter the data so it’s just the pre- and post-tests for visual training (and order them correctly).²

vis_prepost <- df %>%
  filter(Training == "Visual",
         Test %in% c("Pre-test", "Post-test")) %>%
  mutate(Test = factor(Test, levels = c("Pre-test", "Post-test")))

Now we can make the plot itself.

p <- vis_prepost %>%
1  ggplot(aes(x = Error, fill = Test)) +
  geom_histogram(
2    alpha = 0.6,
3    position = "identity",
4    bins = 20
  ) +
  scale_fill_manual(values = c("#69b3a2", "#404080"))

print(p)

1: Define the data.
2: Reduce the graph’s opacity (so you can actually see when they overlap).
3: Overlay the graphs (instead of stacking them).
4: Set the number of bins to 20.

Finally, we’ll add some labels for the axes and a basic theme.

Code

p <- p + labs(
    title = "Distribution of error scores (visual training)",
    subtitle = "Performance in pre- vs post-tests",
    x = "Error (cm)",
    y = "Count",
    fill = "Test stage"
  ) +
  theme_clean()

print(p)

An overlaid histogram showing the distribution of error scores between pre- and post-tests amongst participants who completed visual training only.

Interactive

We can also make it interactive (like below) or animated or literally whatever you want.

ggplotly(p)

Figure 1: An interactive version of the overlaid histogram.

Smoothed

A smoothed PDF approximation can also convey the info pretty well.

Code

p_density <- vis_prepost %>%
  ggplot(aes(x = Error, fill = Test)) +
  geom_density(alpha = 0.5) +
  scale_fill_manual(values = c("#69b3a2", "#404080")) +
  labs(
    title = "Smoothed distribution of error scores",
    subtitle = "Performance in pre- vs post-tests (kernel density estimate)",
    x = "Error (cm)",
    y = expression(Density~(cm^{-1})),
    fill = "Test stage"
  ) +
  theme_clean()

print(p_density)

Figure 2: An estimation of the probability density function the distribution of error scores would have.

What if I want it to be interactive?

It’s still only one line, like Figure 1.

ggplotly(p_density)

Faceted histogram

For a faceted histogram, you just add facet_wrap.

p_sxs <- p +
  facet_wrap(~ Test, ncol = 2, scales = "fixed") +
  guides(fill = "none")

print(p_sxs)

The same can be applied to the probability density function in Figure 2 (but it doesn’t look half as cool).

Code

p_density_faceted <- p_density +
  facet_wrap(~ Test, ncol = 2) +
  guides(fill = "none")

print(p_density_faceted)

Box plot

Making a box plot is just as simple.

p_base_dist <- vis_prepost %>%
  ggplot(aes(x = Test, y = Error, fill = Test)) +
  labs(
    title = "Distribution of error scores",
    subtitle = "Performance in pre- vs post-tests",
    x = "Test stage",
    y = "Error (cm)"
  ) +
  theme_clean() +
  guides(fill = "none")

p_box <- p_base_dist + geom_boxplot()
print(p_box)

Violin plot

A violin plot can be made with some minor adjustments to a box plot.

p_violin <- p_base_dist + geom_violin(trim = FALSE)
print(p_violin)

More complex examples

Individual trajectories

It’s super easy to add other factors. Here’s a graph with an extra one, training group:

Code

p_trajectories <- df %>%
  mutate(Test = factor(Test, levels = c("Pre-test", "Mid-test", "Post-test", "Carryover"))) %>%
  ggplot(aes(
    x = Test,
    y = Error,
    group = interaction(Participant, Training),
    colour = Training
  )) +
  geom_line(alpha = 0.2) +                     
  stat_summary(                                
    aes(group = Training),
    fun = mean,
    geom = "line",
    linewidth = 1.3
  ) +
  stat_summary(
    aes(group = Training, fill = Training),    
    fun = mean,
    geom = "point",
    shape = 21,                                
    fill = "white",                            
    size = 2.5,                                  
    stroke = 1.2                               
  ) +
  labs(
    title = "Individual participant trajectories",
    subtitle = "By training type, over time",
    x = "Test stage",
    y = "Error (cm)",
    colour = "Training group"
  ) +
  theme_clean()

print(p_trajectories)

The same applies for e.g. dominant vs non-dominant, or against different throwing lengths or whatever else.

Three-dimensional plots

Three-dimensional plots are also easy to make, if we want to like plot every participant without over-plotting the hell out of it.

Code

df_wide <- df %>%
  filter(Test %in% c("Pre-test", "Post-test")) %>%
  tidyr::pivot_wider(names_from = Test, values_from = Error)

kde_surface <- function(d, n = 120, pad = 0.5) {
  x <- d$`Pre-test`; y <- d$`Post-test`
  lims <- c(min(x) - pad, max(x) + pad, min(y) - pad, max(y) + pad)
  MASS::kde2d(x, y, n = n, lims = lims)
}

surfaces <- df_wide %>%
  group_split(Training) %>%
  setNames(df_wide %>% distinct(Training) %>% pull()) %>%
  lapply(kde_surface)

p_3d_visual <- plot_ly() |>
  add_surface(
    x = surfaces$Visual$x,
    y = surfaces$Visual$y,
    z = surfaces$Visual$z,
    colorscale = "Viridis",
    showscale = TRUE
  ) |>
  layout(
    title = "Visual training — 3D density surface",
    scene = list(
      xaxis = list(title = "Pre-test (cm)"),
      yaxis = list(title = "Post-test (cm)"),
      zaxis = list(title = "Density")
    )
  )

p_3d_visual

Code

p_3d_physical <- plot_ly() |>
  add_surface(
    x = surfaces$Physical$x,
    y = surfaces$Physical$y,
    z = surfaces$Physical$z,
    colorscale = "Viridis",
    showscale = TRUE
  ) |>
  layout(
    title = "Physical training — 3D density surface",
    scene = list(
      xaxis = list(title = "Pre-test (cm)"),
      yaxis = list(title = "Post-test (cm)"),
      zaxis = list(title = "Density")
    )
  )

p_3d_physical

Heatmap

Of course, a heatmap may be better in this case:

p_heat_visual2 <- df_wide %>%
  filter(Training == "Visual") %>%
  ggplot(aes(x = `Pre-test`, y = `Post-test`)) +
  stat_density_2d(aes(fill = after_stat(density)),
                  geom = "raster", contour = FALSE) +
  coord_equal() +
  scale_fill_viridis_c(name = "Density") +
  labs(
    title = "Visual training — 2D density heatmap",
    x = "Pre-test (cm)", y = "Post-test (cm)"
  ) +
  theme_minimal()

p_heat_visual2

Footnotes

A data frame is basically just like a table in Excel.↩︎
It isn’t usually necessary to order them – I just did it so the legend would be the right way around.↩︎

Simple examples

Overlaid histogram

Interactive

Smoothed

Faceted histogram

Box plot

Violin plot

More complex examples

Individual trajectories

Three-dimensional plots

Heatmap

Footnotes

Reuse