MwALT 2025 Workshop - Visualizing Quantitative Data

Author

Daniel R. Isbell

MwALT 2025 Pre-Conference Workshop: Visualizing Quantitative Data for Presentation and Publication

This document provides R code for the planned workshop tasks, providing support to attendees during the workshop and serving as a post-workshop resource for review.

Preliminaries

The first thing we’ll do is load packages:

#load packages
library(psych)
library(tidyverse)
library(GGally)
library(easystats)

If you haven’t yet installed any of these packages, remember you can do so by accessing the “Packages” tab in the bottom right panel of RStudio:

Next, we’ll load a few datasets that will be used throughout the workshop:

#load data
det <- read_csv("DET_scores.csv")
sa <- read_csv("isbell_lee_2022.csv")
eli <- read_csv("ELIPT_F2017_S2020_combined_data_anon.csv")

This workshop generally glosses over R basics, as our focus is primarily on one task (creating plots with ggplot2). Nonetheless, a few additional data wrangling and analysis tasks are featured, and some explanation and in-line code comments (marked with # in code snippets) are given.

Task #1: Improving R’s base histogram

The basic histogram function in R is very easy to execute, but quite plain (derogatory). Let’s try it out for plotting a distribution of Duolingo English Test scores.

#default histogram
hist(det$Score)

This histogram is serviceable, but not particularly attractive. Let’s spruce it up:

det %>% ggplot(aes(x = Score))+
  geom_histogram(binwidth = 5, fill = "lightgreen", color = "black")+
  scale_y_continuous(limits = c(0, 11), expand = c(0,0),
                     breaks = c(0, 2, 4, 6, 8, 10))+
  scale_x_continuous(limits = c(10, 160),
                     breaks = seq(from = 10, to = 160, by = 10))+
  labs(x = "DET Score", y = "Frequency")+
  theme_classic()

This is looking better - the color is something people associate with the test, we’ve included the whole range of possible scores on the x-axis, and there is minimal chartjunk.

One last step to make this ready for a printed page - saving at high resolution:

ggsave("ggplot_det_hist.png", width = 6, height = 4, units = "in",
       dpi = 300)

The minimum dpi (dots per inch) for web viewing was classically 72 and 150 for print (or at least that’s what I remember from my undergrad coursework half a lifetime ago…), but frankly memory is cheap and most .png files will not be too large. I prefer to save at 300 or 600 dpi, if not more.

Task #2: Annotate to help viewers

We’ll annotate the plot we just made to include a key statistic and a curve that guides evaluation of the distribution.

First, calculate the mean score:

mean(det$Score)

[1] 107.6

Now we’ll build up the plot, starting with our code from the previous task and adding onto it:

det %>% ggplot(aes(x = Score))+
  geom_histogram(binwidth = 5, fill = "lightgreen", color = "darkgray",
                 alpha = .8)+
  geom_density(aes(y = ..density..*(100*5)), color = "blue", linewidth = 1)+
  geom_vline(aes(xintercept = mean(Score)), color = "black", linetype = 5,
             linewidth = 1.3)+
  geom_text(aes(x = mean(Score)-20, y = 10.5, label = "Mean = 107.60"),
            color = "black", size = 4)+
  scale_y_continuous(limits = c(0, 11), expand = c(0,0),
                     breaks = c(0, 2, 4, 6, 8, 10))+
  scale_x_continuous(limits = c(10, 160),
                     breaks = seq(from = 10, to = 160, by = 10))+
  labs(x = "DET Score", y = "Frequency")+
  theme_classic()

Don’t forget to save the file!

ggsave("ggplot_det_hist_annotated.png", width = 6, height = 4, units = "in",
       dpi = 300)

Task 3: Make a data-accountable version of a plot

Now we’ll try working with a different set of data: Isbell & Lee’s (2022) study on self-assessment of L2 Korean speech. The kind of plot we’ll work with is a boxplot. Here’s what it looks like with R’s base function:

#baseline: a boxplot of other-assessed comprehensibility
boxplot(Comprehensibility ~ Gender, data = sa)

It’s fine, but could use some customization and data-accountability. Let’s work on that with ggplot():

sa %>% ggplot(aes(x = Gender, y = Comprehensibility, fill = Gender))+
  geom_boxplot(outliers = F)+
  geom_jitter(height = 0, width = .25, size = 1)+
  scale_y_continuous(limits = c(1, 9), breaks = 1:9)+
  theme_bw()+
  theme(legend.position = "none")

Just by plotting the points, we can see the difference in group size and get a better sense of the distribution of Comprehensibility scores across female and male participants.

Save the plot if you’d like:

ggsave("ggplot_comp_boxplot.png", width = 6, height = 6, units = "in",
       dpi = 300)

Task #4: Make a plot ready for the big screen

For this task, we’ll work on customization options that make a plot well-suited to presentation in a slide deck (think: large room and some folks sitting 50+ feet away from the projector screen).

We start with the last plot, so you can go ahead and copy and paste that code as a starting point.

Then, customize:

#customize the plot to display well for a slide
#data-accountable plot in ggplot
sa %>% ggplot(aes(x = Gender, y = Comprehensibility, fill = Gender))+
  geom_boxplot(outliers = F)+
  geom_jitter(height = 0, width = .25, size = 3)+
  scale_y_continuous(limits = c(1, 9), breaks = 1:9)+
  theme_bw()+
  theme(legend.position = "none",
        text = element_text(size =21))

Afterwards, you can save this plot, too. This is set to 6 x 6 inches, but you can try out different dimensions based on what your slide deck uses (though I would keep this a square figure):

ggsave("ggplot_comp_boxplot_slide.png", width = 6, height = 6, units = "in",
       dpi = 300)

Task #5.1: Table to plot for item statistics

For this first task related to creating plots instead of a table, we’ll calculate some item statistics using the psych::alpha() function (which generates Cronbach’s alpha estimates and CTT item stats). Then we’ll create a plot, instead of a table, to show item statistics from the Academic Listening Test of my university’s English Language Institute Placement Test battery.

First, running the item stats. We select just the items from the ALT and pass those through to the psych::alpha() function:

eli %>% select(ALT_01:ALT_36) %>% psych::alpha() -> ALT_alpha

Now we’ll create a plot. We’ll actually do a little data wrangling before sending things to ggplot(); we will pick out the table with item facility values and then change the rownames to a column (essentially going from item labels ‘hidden’ in rownames to a proper variable we can use in plots). Then we pass that through to ggplot for a nice visual that includes some annotations.

ALT_alpha$item.stats %>%
  rownames_to_column() %>%
  ggplot(aes(y = fct_rev(as_factor(rowname)), x = mean))+
  geom_bar(stat = "identity", fill = "darkgray")+
  geom_vline(xintercept = .25, linetype = 2, color = "blue")+
  geom_vline(xintercept = .75, linetype = 2, color = "blue")+
  scale_x_continuous(limits = c(0,1.03), expand = c(0,0))+
  labs(y = NULL, x = "Item Facility", title = "Academic Listening Test Item Facility Values",
       subtitle = "Based on 528 observations between 2017 and 2020")+
  theme_bw()

And we can save this plot:

ggsave("ggplot_itemfacility.png", width = 10, height = 7, units = "in", dpi = 300)

How should we change this plot for a slide deck?

For Task #5.1, think about what changes you might need to make to get it ready for a slide deck. Hint: think about text size!

Task #5.2: Table to plot for correlations

We’ll try out two approaches to creating correlation heatmaps, which are effective ways of summarizing a large number of correlations or highlighting a set of important correlations in a presentation.

Approach #1: GGally’s ggcorr() function to do it all

In this approach, we’ll select variables of interest and then pass them on to the GGally::ggcorr() function to create a correlation heatmap with minimal code:

sa %>%
  select(Comprehensibility:AO) %>%
  ggcorr(method = c("pairwise", "pearson"),
         label = T,
         label_size = 4,
         layout.exp = 4,
         angle = 0,
         hjust = .95,
         size = 5,
         legend.position = "none")

By this point, I hope you can figure out how to save the plot!

Approach #2: Calculate a subset of correlations and plot

In this second approach, we’ll first calculate a subset of focal correlations with a function from the easystats suite of packages and then use that for plotting in the next step.

##create a table of correlations in R
sa_corrs <- sa %>% correlation(
  select = c("SA_Comprehensibility", "SA_Accentedness"),
  select2 = c("Comprehensibility", "Accentedness", 
              "Satisfaction", "Value", 
              "Kor_Use", "LivingSK"),
  method = "pearson",
  p_adjust = "none"
)

Now we’ll use that dataframe of correlations to create a custom heatmap:

sa_corrs %>% 
  ggplot(aes(x = Parameter1, y = Parameter2))+
  geom_tile(aes(fill = r), color = "black")+
  geom_text(aes(label = round(r, 2)), color = "black",
            size = 8)+
  scale_fill_gradient(low = "white", high = "red")+
  scale_x_discrete(position = "top")+
  labs(x = NULL, y = NULL, fill = "Pearson \nCorrelation")+
  theme_minimal()+
  theme(text = element_text(size = 18))

Don’t forget to save the plot!

Task #6.1: Putting it all together - comparing correlations for two different groups

Following on from Task #5.1, we’ll calculate some correlations and use that when annotating a pair of scatterplots illustrating the relationship between Comprehensibility and Self-Assessed Comprehensibility across male and female subgroups.

First, the correlations:

sa_corr_mf <- sa %>% group_by(Gender) %>%
  correlation(select = "Comprehensibility", select2 = "SA_Comprehensibility",
              method = "pearson", p_adjust = "none") %>%
  mutate(p_apa = format_p(p, stars = F, digits = "apa")) %>%
  rename(Gender = Group)

And now the plot:

sa %>% ggplot(aes(x = Comprehensibility, y = SA_Comprehensibility))+
  geom_smooth(method = "lm")+
  geom_point(alpha = .4)+
  geom_label(data = sa_corr_mf, aes(label = paste0("r = ", round(r, 2),", ", p_apa),
                                    x = Inf, y = Inf, hjust = 2.35, vjust = 1.5))+
  scale_x_continuous(limits = c(1,9), breaks = 1:9, expand = c(.01, .01))+
  scale_y_continuous(limits = c(1,9), breaks = 1:9, expand = c(.01, .01))+
  labs(y = "Self-Assessed Comprehensibility", x = "Other-Assessed Comprehensibility")+
  theme_bw()+
  facet_wrap(~Gender)

`geom_smooth()` using formula = 'y ~ x'

Don’t forget to save the plot!

Task #6.2: Putting it all together - dynamite plots to compare mean scores

For this final task, we’ll calculate some means with 95% confidence intervals and use them to create a plot to compare means of each section of the ELIPT across three distinct student subgroups: undergraduate students, exchange students, and graduate students.

First, we’ll filter out a couple of small/irregular student groups:

eli <- eli %>% filter(!Student_Type %in% c("ExchG", "PBU"))

Next, the calculations:

eli %>% pivot_longer(ALT_R:GF_R, names_to = "Section", values_to = "Score") %>%
  group_by(Student_Type, Section) %>%
  filter(!is.na(Score)) %>%
  summarise(n = n(),
            mean = mean(Score, na.rm = T),
            sd = sd(Score, na.rm = T),
            se = sd/sqrt(n),
            ci = se*1.96,
            max = max(Score)) -> eli_summary

Next, we’ll clean up some of the text labels so that our plot looks nicer in the end. This is something we have skipped over on some previous plots for the sake of time, but is generally something your should do (especially for print publications).

#make prettier labels
eli_summary <- eli_summary %>%
  mutate(Student_Type = case_when(Student_Type == "ExchU" ~ "Exchange",
                                  Student_Type == "G" ~ "Graduate",
                                  Student_Type == "U" ~ "Undergraduate"),
         Section = case_when(Section == "ALT_R" ~ "Academic Listening",
                             Section == "DCT_R" ~ "Dictation",
                             Section == "GF_R" ~ "Gap-Fill Reading",
                             Section == "RCT_R" ~ "Reading Comprehension"))

Finally, the plot. We’ll make this one ready for a slide deck:

eli_summary %>%
  ggplot(aes(x = Student_Type, y = mean, fill = Student_Type))+
  geom_bar(stat = "identity", position = "dodge2", color = "black")+
  geom_errorbar(aes(ymin = mean-ci, ymax = mean+ci), position = "dodge2",
                width = .4)+
  geom_point(show.legend = FALSE)+
  scale_y_continuous(expand = c(0,0), limits = c(0, 35))+
  scale_fill_discrete(palette = "Accent")+
  labs(title = "ELI Placement Test Mean Scores by \nSection and Student Academic Level",
       y = "Mean Score", x = NULL)+
  theme_bw()+
  facet_wrap(~Section, nrow = 2)+
  theme(legend.position = "bottom",
        legend.title = element_blank(),
        axis.text.x = element_text(angle = 30, hjust = 1),
        text = element_text(size = 20))

Don’t forget to save the plot! Also, think about what you might change for a print version of this plot.