Data analysis and plotting

Author

Kai

Published

Saturday 4 October 2025

Simple examples

Overlaid histogram

First up is package imports (that’s just like adding whatever tools we’ll need later).

library(tidyverse)
library(ggplot2)
library(ggthemes)
library(plotly)
library(ggdist)
library(MASS)

Next we have to import the data into a data frame.1

df <- read_csv("data.csv", col_types = cols(Participant = col_integer()))

We explicitly set the ‘Participant’ column to be an integer for the sake of efficiency – the rest are automatically assigned types.

The top few rows of the data frame look like this:

head(df)
# A tibble: 6 × 4
  Participant Training Test      Error
        <int> <chr>    <chr>     <dbl>
1           1 Physical Pre-test   18.1
2           1 Physical Mid-test   12.8
3           1 Physical Post-test  16.4
4           1 Physical Carryover  16.5
5           2 Physical Pre-test   37.9
6           2 Physical Mid-test   21.6

Next, we’ll filter the data so it’s just the pre- and post-tests for visual training (and order them correctly).2

vis_prepost <- df %>%
  filter(Training == "Visual",
         Test %in% c("Pre-test", "Post-test")) %>%
  mutate(Test = factor(Test, levels = c("Pre-test", "Post-test")))

Now we can make the plot itself.

p <- vis_prepost %>%
1  ggplot(aes(x = Error, fill = Test)) +
  geom_histogram(
2    alpha = 0.6,
3    position = "identity",
4    bins = 20
  ) +
  scale_fill_manual(values = c("#69b3a2", "#404080"))

print(p)
1
Define the data.
2
Reduce the graph’s opacity (so you can actually see when they overlap).
3
Overlay the graphs (instead of stacking them).
4
Set the number of bins to 20.

Finally, we’ll add some labels for the axes and a basic theme.

Code
p <- p + labs(
    title = "Distribution of error scores (visual training)",
    subtitle = "Performance in pre- vs post-tests",
    x = "Error (cm)",
    y = "Count",
    fill = "Test stage"
  ) +
  theme_clean()

print(p)

An overlaid histogram showing the distribution of error scores between pre- and post-tests amongst participants who completed visual training only.

Interactive

We can also make it interactive (like below) or animated or literally whatever you want.

ggplotly(p)
Figure 1: An interactive version of the overlaid histogram.

Smoothed

A smoothed PDF approximation can also convey the info pretty well.

Code
p_density <- vis_prepost %>%
  ggplot(aes(x = Error, fill = Test)) +
  geom_density(alpha = 0.5) +
  scale_fill_manual(values = c("#69b3a2", "#404080")) +
  labs(
    title = "Smoothed distribution of error scores",
    subtitle = "Performance in pre- vs post-tests (kernel density estimate)",
    x = "Error (cm)",
    y = expression(Density~(cm^{-1})),
    fill = "Test stage"
  ) +
  theme_clean()

print(p_density)
Figure 2: An estimation of the probability density function the distribution of error scores would have.

It’s still only one line, like Figure 1.

ggplotly(p_density)

Faceted histogram

For a faceted histogram, you just add facet_wrap.

p_sxs <- p +
  facet_wrap(~ Test, ncol = 2, scales = "fixed") +
  guides(fill = "none")

print(p_sxs)

A faceted histogram.

The same can be applied to the probability density function in Figure 2 (but it doesn’t look half as cool).

Code
p_density_faceted <- p_density +
  facet_wrap(~ Test, ncol = 2) +
  guides(fill = "none")

print(p_density_faceted)

Box plot

Making a box plot is just as simple.

p_base_dist <- vis_prepost %>%
  ggplot(aes(x = Test, y = Error, fill = Test)) +
  labs(
    title = "Distribution of error scores",
    subtitle = "Performance in pre- vs post-tests",
    x = "Test stage",
    y = "Error (cm)"
  ) +
  theme_clean() +
  guides(fill = "none")

p_box <- p_base_dist + geom_boxplot()
print(p_box)

Violin plot

A violin plot can be made with some minor adjustments to a box plot.

p_violin <- p_base_dist + geom_violin(trim = FALSE)
print(p_violin)

More complex examples

Individual trajectories

It’s super easy to add other factors. Here’s a graph with an extra one, training group:

Code
p_trajectories <- df %>%
  mutate(Test = factor(Test, levels = c("Pre-test", "Mid-test", "Post-test", "Carryover"))) %>%
  ggplot(aes(
    x = Test,
    y = Error,
    group = interaction(Participant, Training),
    colour = Training
  )) +
  geom_line(alpha = 0.2) +                     
  stat_summary(                                
    aes(group = Training),
    fun = mean,
    geom = "line",
    linewidth = 1.3
  ) +
  stat_summary(
    aes(group = Training, fill = Training),    
    fun = mean,
    geom = "point",
    shape = 21,                                
    fill = "white",                            
    size = 2.5,                                  
    stroke = 1.2                               
  ) +
  labs(
    title = "Individual participant trajectories",
    subtitle = "By training type, over time",
    x = "Test stage",
    y = "Error (cm)",
    colour = "Training group"
  ) +
  theme_clean()

print(p_trajectories)

The same applies for e.g. dominant vs non-dominant, or against different throwing lengths or whatever else.

Three-dimensional plots

Three-dimensional plots are also easy to make, if we want to like plot every participant without over-plotting the hell out of it.

Code
df_wide <- df %>%
  filter(Test %in% c("Pre-test", "Post-test")) %>%
  tidyr::pivot_wider(names_from = Test, values_from = Error)

kde_surface <- function(d, n = 120, pad = 0.5) {
  x <- d$`Pre-test`; y <- d$`Post-test`
  lims <- c(min(x) - pad, max(x) + pad, min(y) - pad, max(y) + pad)
  MASS::kde2d(x, y, n = n, lims = lims)
}

surfaces <- df_wide %>%
  group_split(Training) %>%
  setNames(df_wide %>% distinct(Training) %>% pull()) %>%
  lapply(kde_surface)

p_3d_visual <- plot_ly() |>
  add_surface(
    x = surfaces$Visual$x,
    y = surfaces$Visual$y,
    z = surfaces$Visual$z,
    colorscale = "Viridis",
    showscale = TRUE
  ) |>
  layout(
    title = "Visual training — 3D density surface",
    scene = list(
      xaxis = list(title = "Pre-test (cm)"),
      yaxis = list(title = "Post-test (cm)"),
      zaxis = list(title = "Density")
    )
  )

p_3d_visual
Code
p_3d_physical <- plot_ly() |>
  add_surface(
    x = surfaces$Physical$x,
    y = surfaces$Physical$y,
    z = surfaces$Physical$z,
    colorscale = "Viridis",
    showscale = TRUE
  ) |>
  layout(
    title = "Physical training — 3D density surface",
    scene = list(
      xaxis = list(title = "Pre-test (cm)"),
      yaxis = list(title = "Post-test (cm)"),
      zaxis = list(title = "Density")
    )
  )

p_3d_physical

Heatmap

Of course, a heatmap may be better in this case:

p_heat_visual2 <- df_wide %>%
  filter(Training == "Visual") %>%
  ggplot(aes(x = `Pre-test`, y = `Post-test`)) +
  stat_density_2d(aes(fill = after_stat(density)),
                  geom = "raster", contour = FALSE) +
  coord_equal() +
  scale_fill_viridis_c(name = "Density") +
  labs(
    title = "Visual training — 2D density heatmap",
    x = "Pre-test (cm)", y = "Post-test (cm)"
  ) +
  theme_minimal()

p_heat_visual2

Footnotes

  1. A data frame is basically just like a table in Excel.↩︎

  2. It isn’t usually necessary to order them – I just did it so the legend would be the right way around.↩︎

Reuse

All rights reserved.