Note on a Tidyverse Certification Exam

library(tidyverse)
library(here)
library(visdat)

The Challenge

This applies to an older exam for RStudio Tidyverse Certification. I came across the exam while preparing for the newer version. (I had successfully completed the Teaching Exam with Greg Wilson on 2021-02-01, but then shortly afterwards RStudio had a change of direction. The Instructor Certification process is currently suspended). I found I respectfully disagreed with the solution that a few highly skilled and qualified individuals posted as the correct answer. So I offer my version below.

Working with the ranking.csv data, the test-taker was challenged to “Re-create this plot using the tidyverse and ggplot2, fixing any mistakes you notice along the way”.

Reproduce & Fix

The Challenge Plot

The Data

Let’s assume that ranking.csv has been download to the folder data in in your project directory. Otherwise, modify the following code as needed.

ranking <- read_csv(here::here("data", "ranking.csv"))

ranking %>% vis_dat()

ranking %>% glimpse()

## Rows: 6,460
## Columns: 2
## $ item <dbl> 192, 171, 191, 182, 41, 112, 60, 99, 168, 80, 322, 106, 104, 19, ~
## $ rank <chr> "positive", "negative", "positive", "positive", "positive", "posi~

The First Task

The data is in long format. We do not have the variables positive and negative to plot. Nor do we have a count for each – a value to plot. We have only item and rank. So this truthfully is the real challenge: transform the data to a set we can plot.

Let me state that the posted solutions I’ve seen for this part are excellent.

Breaking it down

Let’s break down what we need to do. Each item might have several rankings. The distinct levels (as rank is more properly a factor) are: positive, negative, indifferent, wtf . For our task, we only need consider positive and negative.

We need to group_by() the variable item, then count() the rank levels (“positive”, “negative”, etc.) for each item, then pivot_wider() the rank variable. So in our new wide format, we have the levels as names (new variables) and the counts as their values.

Finally, we see that plot maps positive and negative variables as percentages. So we will also need to convert our count values to the individual counts over the sum per item.

This can all be done in one step, but I will do it in two: create the wide version, and then convert the values. In this instance, we can convert NA to zero: the survey had 0 replies for that level. In other instances, it might be best to leave NA values standing.

rank_wide <- ranking %>%
  group_by(item) %>%
  count(rank) %>%
  pivot_wider(names_from = rank, values_from = n) %>%
  ungroup() %>%
  replace_na(list(positive = 0, 
                  negative = 0, 
                  indifferent = 0, 
                  wtf = 0) )

rank_wide %>% head() %>% knitr::kable()

item	indifferent	positive	wtf	negative
1	7	12	1	0
2	4	16	0	1
3	7	7	0	11
4	3	18	0	4
5	4	14	2	0
6	0	21	1	1

Now, we need to express the count values as a percentage. We can use the Purrr function syntax. Although I removed the NA values earlier, we can keep na.rm = TRUE as a safety check and good coding practice.

rank_wide <- rank_wide %>%
  group_by(item) %>%
  mutate(tot = sum(positive, 
                   negative, 
                   indifferent, 
                   wtf, 
                   na.rm = TRUE)) %>%
  mutate_at(vars(-item, -tot), 
            .funs = list(~ round(./tot, digits = 3)))

rank_wide %>% head() %>% knitr::kable()

item	indifferent	positive	wtf	negative	tot
1	0.35	0.600	0.050	0.000	20
2	0.19	0.762	0.000	0.048	21
3	0.28	0.280	0.000	0.440	25
4	0.12	0.720	0.000	0.160	25
5	0.20	0.700	0.100	0.000	20
6	0.00	0.913	0.043	0.043	23

The Improved Plot

Now, we are ready to plot positive ~ negative. But let’s take a moment to look at the example plot. Our instructions were to “Re-create this plot using the tidyverse and ggplot2, fixing any mistakes you notice along the way” |ref|.

Questions About Original

First, why is alpha showing in the legend? It tells us nothing meaningful about the data. It clutters and detracts from the presentation. Second, why are the lines from geom_smooth() present in the legend? In fact, they obscure our guide to size – which does offer meaningful information. So turn them off. Third, the plot has no title and only the default labels. Fourth, and arguably, why not scale the color?

Modifications & Result

Because we will be using color to help distinguish data points, we can increase the alpha a bit as well as remove it from the legend.

rank_wide %>%
  ggplot( aes(x = negative, 
              y = positive,  
              color = positive, 
              size = tot)) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "lm" , show.legend = FALSE) +
  labs(size = "Number\nof replies", 
       color ="Positivity" ,
       x = "Negative", 
       y = "Positive",
       title = "Survey Responses: Positive ~ Negative",
       caption = "Data: rankings.csv") +
  scale_color_viridis_c() +
  scale_y_continuous(labels = scales::percent) +
  scale_x_continuous(labels = scales::percent) +
  guides(size = guide_legend(order = 1))

Qualification

So I submit the above plot as a better solution to the challenge of “reproduce and fix”: but in fairness, everyone taking the original exam was under extreme time pressure. The real challenge again was not the plot but transforming the data. Reproducing the plot was proof you were able to transform the data. I confess I found that challenge difficult and time-consuming. In the context of a timed exam, I would have been quite happy to finish just by reproducing the original plot.

Thank you for reading!

Thomas J. Haslam
2021-07-26