D&D Workshop: Data Visualization in R

Data visualization in R and concepts of `ggplot2`:

See my presentation slides here (right-click and open the link in a new tab).

Keep the concepts for `ggplot2` in mind: layers of data + aesthetic mappings + geometries

Load data

These data sets are from: http://www.routledgetextbooks.com/textbooks/9781138024571/

Ellis & Yuan 2004 is simplified, reduced to include just two of the variables Obarow data is simplied and manipulated to create graphs that I would like to show, but keep the concepts of a second language acquisition experiment.

ell <- read.csv("EllisYuan.csv")
obarow <- read.csv("obarow.csv")

Example 1: Plot interaction

Creating boxplots with multiple independent variables

STEP 1: Data

Let’s look at the data structure again.

str(obarow)

## 'data.frame':    67 obs. of  10 variables:
##  $ id       : int  3 2 6 7 1 10 5 8 11 14 ...
##  $ gender   : Factor w/ 2 levels "female","male": 2 2 2 2 2 1 2 2 1 1 ...
##  $ grade    : int  1 1 1 1 1 1 2 2 2 2 ...
##  $ treatment: Factor w/ 4 levels "NMNP","NMYP",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ pretest  : int  15 11 13 14 13 14 18 16 15 17 ...
##  $ posttest : int  14 11 13 15 12 14 16 14 13 16 ...
##  $ gain1    : int  -1 0 0 1 -1 0 -2 -2 -2 -1 ...
##  $ gain2    : int  -2 1 0 1 0 5 4 2 1 -1 ...
##  $ gain3    : int  0 0 1 3 -1 -1 0 0 1 1 ...
##  $ gain4    : int  0 3 1 0 1 -1 0 0 0 1 ...

I would like to show (any) interaction effect between treatment condition and grade level on post-test score, so the variables I will plot are:

Dependent variable: posttest
Independent variables: treatment, grade

I will change grade to a factor with 3 levels:

obarow$grade <- factor(obarow$grade, levels = 1:3, 
                       labels = c("1st grade", "3rd grade", "5th grade"))

You can look at this data in two different ways:

ggplot(obarow, aes(x = grade, y = posttest, fill = treatment)) + 
    geom_boxplot()

ggplot(obarow, aes(x = treatment, y = posttest, fill = grade)) + 
    geom_boxplot()

How you map the variables will depend on what you’re trying to show based on your research question.

STEP 2: Include elements and adjust attributes

I’ll first try out boxplots with data points

ggplot(obarow, aes(x = treatment, y = posttest, fill = grade)) + 
    geom_boxplot(outlier.size = -1) + 
    geom_point(alpha = .5)

You will have to set position to position_dodge() to visually separate the groups:

ggplot(obarow, aes(x = treatment, y = posttest, fill = grade)) + 
    geom_boxplot(outlier.size = -1) + 
    geom_point(position = position_dodge(.75), alpha = .5)

position_jitterdodge simultaneously jitters and dodges:

ggplot(obarow, aes(x = treatment, y = posttest, fill = grade)) + 
    geom_boxplot(outlier.size = -1) + 
    geom_point(position = position_jitterdodge(.2), alpha = .4)

# you could also do `geom_jitter` instead of `geom_point` then use `position_dodge()`

But maybe I want to simply use lines to show the trends:

ggplot(obarow, aes(x = treatment, y = posttest, fill = grade)) + 
    stat_summary(aes(group = grade, color = grade), fun.y = "mean", geom = "line", 
                 position = position_dodge(.2), size = 1) + 
    scale_y_continuous(limits = c(0, 30)) + 
    labs(title = "Effect of learning condition on vocabulary learning by grade level", 
         x = "", y = "Vocabulary score\n") + 
    theme_bw(base_size = 12) + 
    theme(panel.grid.major.x = element_blank(),
          axis.ticks.x = element_blank())

This is a typical graph for interaction but I could add more information by adding boxplots:

ggplot(obarow, aes(x = treatment, y = posttest, fill = grade)) + 
    geom_boxplot(width = .2, alpha = .5, outlier.size = -1) + 
    stat_summary(aes(group = grade, color = grade), fun.y = "mean", geom = "line", 
                 position = position_dodge(.2), size = 1) + 
    stat_summary(fun.y = "mean", geom = "point", shape = 22, size = 2, 
                 position = position_dodge(.2), show.legend = FALSE) + 
    scale_y_continuous(limits = c(0, 30)) + 
    labs(title = "Effect of learning condition on vocabulary learning by grade level", 
         x = "", y = "Vocabulary score\n") + 
    theme_bw(base_size = 12) + 
    theme(panel.grid.major.x = element_blank(),
          axis.ticks.x = element_blank())

# run one-way ANOVA (Type III)
options(contrasts = c("contr.sum", "contr.poly"))
mod <- lm(posttest ~ treatment*grade, data = obarow)
car::Anova(mod, type = "III")

## Anova Table (Type III tests)
## 
## Response: posttest
##                  Sum Sq Df   F value    Pr(>F)    
## (Intercept)     22170.7  1 7555.0822 < 2.2e-16 ***
## treatment         101.2  3   11.4989  5.86e-06 ***
## grade            1063.1  2  181.1429 < 2.2e-16 ***
## treatment:grade    33.5  6    1.9045   0.09643 .  
## Residuals         161.4 55                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

When I actually run ANOVA, I find no significant interaction, only main effects. But from looking at the graph, there might be something going on with the lower grades NMYP group. Collecting more data would help reveal if there is any interaction.

Example 2: Repeated measures with more than two data points

str(obarow)

## 'data.frame':    67 obs. of  10 variables:
##  $ id       : int  3 2 6 7 1 10 5 8 11 14 ...
##  $ gender   : Factor w/ 2 levels "female","male": 2 2 2 2 2 1 2 2 1 1 ...
##  $ grade    : Factor w/ 3 levels "1st grade","3rd grade",..: 1 1 1 1 1 1 2 2 2 2 ...
##  $ treatment: Factor w/ 4 levels "NMNP","NMYP",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ pretest  : int  15 11 13 14 13 14 18 16 15 17 ...
##  $ posttest : int  14 11 13 15 12 14 16 14 13 16 ...
##  $ gain1    : int  -1 0 0 1 -1 0 -2 -2 -2 -1 ...
##  $ gain2    : int  -2 1 0 1 0 5 4 2 1 -1 ...
##  $ gain3    : int  0 0 1 3 -1 -1 0 0 1 1 ...
##  $ gain4    : int  0 3 1 0 1 -1 0 0 0 1 ...

There are four gain scores, and we want to plot the changes over time (four measurements of gain scores)

To do that, we need to have gather the four scores, so that we can map the four time points on to the x-axis and the score to the y-axis.

ob3 <- tidyr::gather(obarow, key = gains, value = score, gain1:gain4)
str(ob3)

## 'data.frame':    268 obs. of  8 variables:
##  $ id       : int  3 2 6 7 1 10 5 8 11 14 ...
##  $ gender   : Factor w/ 2 levels "female","male": 2 2 2 2 2 1 2 2 1 1 ...
##  $ grade    : Factor w/ 3 levels "1st grade","3rd grade",..: 1 1 1 1 1 1 2 2 2 2 ...
##  $ treatment: Factor w/ 4 levels "NMNP","NMYP",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ pretest  : int  15 11 13 14 13 14 18 16 15 17 ...
##  $ posttest : int  14 11 13 15 12 14 16 14 13 16 ...
##  $ gains    : chr  "gain1" "gain1" "gain1" "gain1" ...
##  $ score    : int  -1 0 0 1 -1 0 -2 -2 -2 -1 ...

ob3$gains <- factor(ob3$gains)

Now we have gains as a variable, which are the time points. I changed it to a factor.

Let’s draw line graphs that show group mean changes:

ggplot(ob3, aes(x = gains, y = score)) + 
    geom_line(aes(group = id), color = "grey80") + 
    stat_summary(aes(group = treatment, color = treatment), 
                 fun.y = "mean", geom = "line", size = 2)

Polish it up:

ggplot(ob3, aes(x = gains, y = score, group = treatment, color = treatment)) + 
    geom_line(aes(group = id, color = treatment), alpha = .3) + 
    stat_summary(fun.data = "mean_cl_normal", geom = "errorbar", width = .1, 
                 position = position_dodge(.1), color = "black") + 
    stat_summary(fun.y = "mean", geom = "line", size = 1.2, 
                 position = position_dodge(.1)) + 
    stat_summary(fun.y = "mean", geom = "point", size = 2, 
                 position = position_dodge(.1)) + 
    scale_color_brewer(palette = "Spectral") + 
    scale_x_discrete(labels = c("Time 1", "Time 2", "Time 3", "Time 4")) + 
    labs(x = "", y = "Gain score\n", title = "Vocabulary gains after treatments\n", 
         color = "Treatment Condition") + 
    theme_bw() + 
    theme(panel.grid.major.x = element_blank(), legend.position = "bottom", 
          plot.title = element_text(hjust = .5))

It’s generally a good idea to label the lines than having a legend, especially when you have more than two groups. To do that, I will prepare labels using some functions from dplyr.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

m_label <- ob3 %>% 
    filter(gains == "gain4") %>% 
    group_by(treatment) %>% 
    summarize(mean = mean(score))

What I’m doing here is to get the mean score of the last time point for each group. That score will be the y-axis coordinate where I will put the label for the specific group.

I’ll now add the labels using geom_text(). You can adjust the x-coordinate depending on where you would like the labels to appear along the x-axis. Note that for this geom function, I use the newly created data for my label, m_label.

ggplot(ob3, aes(x = gains, y = score, group = treatment, color = treatment)) + 
    geom_line(aes(group = id, color = treatment), alpha = .3) + 
    stat_summary(fun.data = "mean_cl_normal", geom = "errorbar", width = .1, 
                 position = position_dodge(.1), color = "black") + 
    stat_summary(fun.y = "mean", geom = "line", size = 1.2, 
                 position = position_dodge(.1)) + 
    stat_summary(fun.y = "mean", geom = "point", size = 2, 
                 position = position_dodge(.1)) + 
    geom_text(data = m_label, aes(x = 4.35, y = mean, label = treatment), 
              size = 4, hjust = 1, fontface = "bold") + 
    scale_color_brewer(palette = "Spectral") + 
    scale_x_discrete(labels = c("Time 1", "Time 2", "Time 3", "Time 4")) + 
    labs(x = "", y = "Gain score\n", title = "Vocabulary gains after treatments\n", 
         color = "Treatment Condition") + 
    theme_bw(base_size = 12) + 
    theme(panel.grid.major.x = element_blank(), 
          legend.position = "none",  # hide the legend
          plot.title = element_text(hjust = .5))

Instead of creating labels, you could just facet them, if that better suits your needs:

ggplot(ob3, aes(x = gains, y = score, group = treatment, color = treatment)) + 
    geom_line(aes(group = id), color = "grey70", alpha = .3) + 
    stat_summary(fun.data = "mean_cl_normal", geom = "errorbar", width = .1, 
                 position = position_dodge(.1), color = "black") + 
    stat_summary(fun.y = "mean", geom = "line", size = 1.2, 
                 position = position_dodge(.1)) + 
    stat_summary(fun.y = "mean", geom = "point", size = 2, 
                 position = position_dodge(.1)) + 
    scale_color_brewer(palette = "Spectral") + 
    scale_x_discrete(labels = c("Time 1", "Time 2", "Time 3", "Time 4")) + 
    facet_wrap(~ treatment, ncol = 2) +
    labs(x = "", y = "Gain score\n", title = "Vocabulary gains after treatments\n", 
         color = "Treatment Condition") + 
    theme_bw() + 
    theme(panel.grid.major.x = element_blank(), legend.position = "none", 
          plot.title = element_text(hjust = .5))

Example 3: Count data - Survey

The last example I’d like to show is survey data and visualizing counts.

sur <- read.csv("example_survey.csv")
str(sur)

## 'data.frame':    33 obs. of  5 variables:
##  $ Financial : int  1 1 2 1 3 1 1 2 1 0 ...
##  $ Resources : int  1 1 1 1 3 1 1 1 1 1 ...
##  $ Staffing  : int  2 0 NA 1 0 1 0 1 1 1 ...
##  $ Training  : int  3 1 2 0 0 0 2 2 0 0 ...
##  $ Technology: int  2 1 2 3 3 1 2 3 1 2 ...

This data is from a survey that targeted Korean language coordinators at colleges/universities in the United States. Here are responses on how much support the language program coordinators receive in these five categories: finance, resources, staffing, training, and technology. The responses are on a 4-point likert scale, from complete support to no support.

What are we plotting?

The number of responses for each question. Then we should have the categories along the x-axis, and the number of responses on the y-axis.

That means that the data needs to be transformed so that we have a variable (column) called category and another variable called response

sur2 <- tidyr::gather(sur, Category, Response, 1:5)

str(sur2)

## 'data.frame':    165 obs. of  2 variables:
##  $ Category: chr  "Financial" "Financial" "Financial" "Financial" ...
##  $ Response: int  1 1 2 1 3 1 1 2 1 0 ...

We can plot the number of responses per category.

ggplot(sur2, aes(x = Category, y = stat(count))) +
    geom_bar()

Well, this is not useful because it just shows the number of total responses (which includes N/As). We want to see the number of responses for each of the 4-scale. Therefore, we visualize each level of response in different colors

ggplot(sur2, aes(x = Category, y = stat(count), fill = Response)) + 
    geom_bar()

Why nothing’s changed? Because the response variable is currently numeric, and not categorical.

sur2$Response <- factor(sur2$Response)

Try again:

ggplot(sur2, aes(x = Category, y = stat(count), fill = Response)) + 
    geom_bar()

Note: the default setting for geom_bar(position = ) is “stack”. Maybe you want something like this:

ggplot(sur2, aes(x = Category, y = stat(count), fill = Response)) + 
    geom_bar(position = "dodge")

I would definitely label the response levels for clarity.

sur2$Response <- factor(sur2$Response, 
                       labels = c("None", "Basic", "Decent", "Complete"))

With count figures on the y-axis, dodged bars are probably the better choice

ggplot(sur2, aes(x = Category, y = stat(count), fill = Response)) + 
    geom_bar(position = "dodge")

However, stacked form has merits in displaying the proportion of the responses collected. If we’re talking in terms of proportion, it would make more sense to have the number of responses in percentage. Setting the position argument to “fill” will automatically change the count numbers to proportions for you.

ggplot(sur2, aes(x = Category, y = stat(count), fill = Response)) + 
    geom_bar(position = "fill")

This type of graph makes most sense to me but we can do more. It would be easier for the readers to intepret the graph if we used a scale in percentage and annotate the percentages.

Then I need to have a summary of the data:

sur_sum <- sur2 %>% 
    group_by(Category, Response) %>% 
    tally() %>% 
    mutate(Percentage = n*100/sum(n))

head(sur_sum)

## # A tibble: 6 x 4
## # Groups:   Category [2]
##   Category  Response     n Percentage
##   <chr>     <fct>    <int>      <dbl>
## 1 Financial None         4      12.1 
## 2 Financial Basic       15      45.5 
## 3 Financial Decent      11      33.3 
## 4 Financial Complete     3       9.09
## 5 Resources None         2       6.06
## 6 Resources Basic       18      54.5

Note that now we have the sum of the counts for each category and each response level in a new colum, as opposed to having every response in each row.

When using this summarized data, we need to change something for the geom_bar(): set stat to “identity”. The default is “count”, which means the function counts the sum from the raw data. Now that we already have the sums, we need to tell ggplot to use the numbers as are in the new summarized data.

ggplot(sur_sum, aes(x = Category, y = Percentage, fill = Response)) + 
    geom_bar(stat = "identity", position = "stack") + 
    geom_text(aes(label = round(Percentage, 1)), position = "stack", 
              hjust = .5, vjust = 1, size = 4.5)

I actually like it better this way:

ggplot(sur_sum, aes(x = Category, y = Percentage, fill = Response)) + 
    geom_bar(stat = "identity", position = "stack") + 
    geom_text(aes(label = round(Percentage, 1)), position = "stack", 
              hjust = 1, size = 4.5) + 
    coord_flip()

In polish up the graph, I changed the label using paste0 function, which concactenates the rounded percentage with the percentage symbol %.

ggplot(sur_sum, aes(x = Category, y = Percentage, fill = Response)) + 
    geom_bar(stat = "identity", position = "stack", width = .7, alpha = .9) + 
    geom_text(aes(label = paste0(round(Percentage, 0), "%")), position = "stack", 
              hjust = 1, size = 3) + 
    scale_fill_brewer(palette = "Pastel1", na.value = "grey70") + 
    coord_flip() + 
    theme_minimal(base_size = 12) + 
    theme(axis.title = element_blank(), panel.grid.major.y = element_blank(), 
          legend.position = "top")

D&D Workshop: Data Visualization in R

Susie Kim

October 18, 2018

Data visualization in R and concepts of `ggplot2`:

Keep the concepts for `ggplot2` in mind: layers of data + aesthetic mappings + geometries

Load data

Example 1: Plot interaction

Creating boxplots with multiple independent variables

STEP 1: Data

STEP 2: Include elements and adjust attributes

Example 2: Repeated measures with more than two data points

Example 3: Count data - Survey

What are we plotting?

Happy plotting!

D&D Workshop: Data Visualization in R

Susie Kim

October 18, 2018

Data visualization in R and concepts of ggplot2:

Keep the concepts for ggplot2 in mind: layers of data + aesthetic mappings + geometries

Load data

Example 1: Plot interaction

Creating boxplots with multiple independent variables

STEP 1: Data

STEP 2: Include elements and adjust attributes

Example 2: Repeated measures with more than two data points

Example 3: Count data - Survey

What are we plotting?

Happy plotting!

Data visualization in R and concepts of `ggplot2`:

Keep the concepts for `ggplot2` in mind: layers of data + aesthetic mappings + geometries