Graphical Data Analysis

Paul Collett

2020-11-27

Preparing your data

Set your working directory where you will store the R script and any other files you may need for the project:

setwd("~/Documents/R Worksets/JALT2020 Workshop")

Create bimodal distribution data set to illustrate normality assumptions:

nn <- 100
set.seed(1234)
sim1 <- c(rtruncnorm(nn/2, a=0, b=10, mean=2, sd=.75),
          rtruncnorm(nn/2, a=0, b=10, mean=8, sd=.75))

Create a more normally-distributed second sample:

set.seed(1234)
sim2 <- rnorm(100, mean = 5, sd = .75)

x1 <- as.data.frame(sim1)
x2 <- as.data.frame(sim2)

Store the data in a format (dataframe) that can easily be retrieved for the analysis:

x3 <- as.data.frame(cbind(sim1, sim2)) ## this joins the two data sets together and transforms them into a dataframe
write.csv(x3, "x3_data.csv") ## this command writes the dataframe to a csv file ("x3_data.csv") and saves it on your computer in the working directory

You can check there are no problems with the data by using the head & tail commands which display the first & last few lines respectively of the data:

head(x3); tail(x3)

##        sim1     sim2
## 1 1.0947007 4.094701
## 2 2.2080719 5.208072
## 3 2.8133309 5.813331
## 4 0.2407267 3.240727
## 5 2.3218435 5.321844
## 6 2.3795419 5.379542

##         sim1     sim2
## 95  7.628312 4.628312
## 96  8.266663 5.266663
## 97  7.149044 4.149044
## 98  8.658653 5.658653
## 99  8.729688 5.729688
## 100 9.590838 6.590838

Histograms to show data distribution.

Using the hist command which is part of the base installation of R. Work with the sim1 & sim2 samples from above.

hist(sim1, prob=F, main = "Group 1 Distribution", xlab = "Group 1", xlim = c(0,10), ylim = c(0,20), breaks=15, cex.main=1.5, cex.lab=1.5, cex.axis=1.5)
abline(v=mean(sim1),col="blue", lty = 2)

hist(sim2, prob=F, main = "Group 2 Distribution", xlab = "Group 2", xlim = c(0,10), breaks=15, cex.main=1.5, cex.lab=1.5, cex.axis=1.5)
abline(v=mean(sim1),col="blue", lty = 2)

QQ Plots.

Using the qqnorm command which is part of the base R installation. Work with the sim1 & sim2 samples from above.

qqnorm(sim1, cex.main=1.5, cex.lab=1.5, cex.axis=1.5, main = "Group 1") # this produces the graph
qqline(sim1) # this adds a line showing a theoretical normal distribution for comparison

qqnorm(sim2, cex.main=1.5, cex.lab=1.5, cex.axis=1.5,  main = "Group 2")    # this produces the graph
qqline(sim2)  # this adds a line showing a theoretical normal distribution for comparison

Boxplots and variants for comparisions

Boxplots are good for showing the difference between two (or more) samples, e.g. when you would do a t-test or an ANOVA (or a non-parametric equivalent).

To create a basic boxplot using base R installation commands:

boxplot(sim2, col = "white", ann = F, horizontal = T, ylim = c(3,7))

#using dataset sim2, no annotations on the axes, y axis range is from 3 to 7

Boxplots for simulated data set 1

These are created using the ggplot2 package. This can be installed and activated as follows:

install.packages("ggplot2")
library(ggplot2)

The code to generate the boxplot is more complex here. The first three lines of code create the plot; the following code handles the appearance of the plot. In this case we’re creating two plots, fig5a & fig5b.

fig5a <- ggplot(x1) +
  aes(x = "", y = sim1) +
  geom_boxplot() +
  # These commands control the appearance of aspects of the theme
  theme_minimal() +
  theme(axis.text.x = element_text(size=14), 
        axis.text.y = element_text(size = 14),
        axis.title = element_text(size = 16, lineheight = 2),
        strip.text.x = element_text(size = 14)) +
  ylim(0, 10) +
  labs(x = "Group 1", y = "Comprehension Score")

fig5b <- ggplot(x2) +
  aes(x = "", y = sim2) +
  geom_boxplot() +
  theme_minimal()  +
  theme(axis.text.x = element_text(size=14),
        axis.text.y = element_blank(),
        axis.title = element_text(size = 16, lineheight = 2),
        strip.text.x = element_text(size = 14)) +
  ylim(0, 10) +
  labs(x = "Group 2", y = "")

Load the cowplot package. This lets you plot different graphs on the same grid…

install.packages("cowplot")
library(cowplot)

…using the plot_grid command from cowplot

plot_grid(fig5a, fig5b, labels = c("Boxplot example", ""))

Boxplots for simulated data set 1 augmented with jittered data

Here we add the data points to the boxplot. This involves just one extra line of code:

fig6a <- ggplot(x1) +
  aes(x = "", y = sim1) +
  geom_boxplot() +
  # This additional line of code adds the data points
  geom_jitter(width = .2, size = 3, colour = "orange") + 
  theme_minimal() +
  theme(axis.text.x = element_text(size=14),
        axis.text.y = element_text(size = 14),
        axis.title = element_text(size = 16, lineheight = 2),
        strip.text.x = element_text(size = 14)) +
  ylim(0, 10) +
  labs(x = "Group 1", y = "Comprehension Score")

fig6b <- ggplot(x2) +
  aes(x = "", y = sim2) +
  geom_boxplot(outlier.shape = NA) +
  geom_jitter(width = .2, size = 3, colour = "blue") +
  theme_minimal()  +
  theme(axis.text.x = element_text(size=14), 
        axis.text.y = element_blank(),
        axis.title = element_text(size = 16, lineheight = 2),
        strip.text.x = element_text(size = 14,)) +
  ylim(0, 10) +
  labs(x = "Group 2", y = "")

# plot the two graphs on the same grid

plot_grid(fig6a, fig6b, labels = c("Boxplots with data points", ""))

Dotplot for simulated data set 1

An alternative way to display the data is as a dotplot

fig7a <- ggplot(x1) + 
  aes(x = "", y = sim1, fill = "sim1") +
  # main changes are as follows
  stat_summary(fun = median, fun.min = median, fun.max = median, geom = "crossbar", width = 0.6, size = 0.4, color = "black", alpha = 0.6) + # generate the bar showing the median score
  geom_dotplot(binaxis ="y", binwidth = 0.2, stackdir = "center", stackratio = 1.5) + # plot the data as individual points 
  theme_minimal() +
  theme(legend.position = "none", 
        axis.text.x = element_text(size=14),
        axis.text.y = element_text(size = 14),
        axis.title = element_text(size = 16, lineheight = 2),
        strip.text.x = element_text(size = 14)) +
  ylim(0, 10) +
  labs(x = "Group 1", y = "Comprehension Score")

fig7b <- ggplot(x2) + 
  aes(x = "", y = sim2, fill = "sim1") +
  stat_summary(fun = median, fun.min = median, fun.max = median, geom = "crossbar", width = 0.6, size = 0.4, color = "black", alpha = 0.6) + # generate the bar showing the median score
  geom_dotplot(binaxis ="y", binwidth = 0.2, stackdir = "center", stackratio = 1.5) +
  theme_minimal() + 
  theme(legend.position = "none", 
        axis.text.x = element_text(size=14), 
        axis.text.y = element_blank(), 
        axis.title = element_text(size = 16, lineheight = 2),
        strip.text.x = element_text(size = 14)) +
  ylim(0, 10) +
  labs(x = "Group 2", y = "")

# plot the two graphs on the same grid

plot_grid(fig7a, fig7b,labels = c("Dotplot example", ""))

Violin plots for simulated data set 1

Violin plots show how the data is distributed and as such are helpful for understanding the structure of your dataset.

fig8a <- ggplot(x1) +
  aes(x = "", y = sim1) +
  # This is the only change from figure 5, calling for a violin plot rather than a boxplot.
  # The adjust and scale arguments set the size of the plots.
  geom_violin(adjust = 1L, scale = "count") + 
  theme_minimal() +
  theme(axis.text.x = element_text(size=14),
        axis.text.y = element_text(size = 14),
        axis.title = element_text(size = 16, lineheight = 2),
        strip.text.x = element_text(size = 14)) +
  ylim(0, 10) +
  labs(x = "Group 1", y = "Comprehension Score")

fig8b <- ggplot(x2) +
  aes(x = "", y = sim2) +
  geom_violin(adjust = 1L, scale = "area") +
  theme_minimal() +
  theme(axis.text.x = element_text(size=14),
        axis.text.y = element_blank(),
        axis.title = element_text(size = 16, lineheight = 2),
        strip.text.x = element_text(size = 14)) +
  ylim(0, 10) +
  labs(x = "Group 2", y = "")

# plot the two graphs on the same grid

plot_grid(fig8a, fig8b,labels = c("Violin plot example", ""))

Notched boxplots

set.seed(1234)
response = rnorm(n = 80, mean = c(74, 70), sd = c(3, 4.5))
group = rep(letters[1:2], length.out = 80)
sim4 <- data.frame(group,
                   response)

ggplot(sim4) +
  aes(x = group, y = response) +
  geom_boxplot(notch = TRUE, notchwidth = 0.75) +
  theme_minimal()  +
  theme(axis.text.x = element_text(size=14), # These commands control aspects of the theme, in this case the axis text
        axis.text.y = element_text(size = 14),
        axis.title = element_text(size = 16, lineheight = 2),
        plot.title = element_text(size = 14, lineheight = 2, face="bold"), 
        strip.text.x = element_text(size = 14)) +
  ggtitle("Notched boxplot example") +
  ylim(55, 85) +
  labs(x = "Groups", y = "")

Scatterplots - Looking at Relationships

Scatterplots are used to show how data is correlated.

Anscombe’s scatterplots.

(Code taken from the dataset package)

A demonstration of how statistics alone can be deceiving.

Looking at the dataset doesn’t give a lot of information:

anscombe

##    x1 x2 x3 x4    y1   y2    y3    y4
## 1  10 10 10  8  8.04 9.14  7.46  6.58
## 2   8  8  8  8  6.95 8.14  6.77  5.76
## 3  13 13 13  8  7.58 8.74 12.74  7.71
## 4   9  9  9  8  8.81 8.77  7.11  8.84
## 5  11 11 11  8  8.33 9.26  7.81  8.47
## 6  14 14 14  8  9.96 8.10  8.84  7.04
## 7   6  6  6  8  7.24 6.13  6.08  5.25
## 8   4  4  4 19  4.26 3.10  5.39 12.50
## 9  12 12 12  8 10.84 9.13  8.15  5.56
## 10  7  7  7  8  4.82 7.26  6.42  7.91
## 11  5  5  5  8  5.68 4.74  5.73  6.89

Looking at the descriptive statistics of the data suggests each dataset is very similar:

summary(anscombe)

##        x1             x2             x3             x4           y1        
##  Min.   : 4.0   Min.   : 4.0   Min.   : 4.0   Min.   : 8   Min.   : 4.260  
##  1st Qu.: 6.5   1st Qu.: 6.5   1st Qu.: 6.5   1st Qu.: 8   1st Qu.: 6.315  
##  Median : 9.0   Median : 9.0   Median : 9.0   Median : 8   Median : 7.580  
##  Mean   : 9.0   Mean   : 9.0   Mean   : 9.0   Mean   : 9   Mean   : 7.501  
##  3rd Qu.:11.5   3rd Qu.:11.5   3rd Qu.:11.5   3rd Qu.: 8   3rd Qu.: 8.570  
##  Max.   :14.0   Max.   :14.0   Max.   :14.0   Max.   :19   Max.   :10.840  
##        y2              y3              y4        
##  Min.   :3.100   Min.   : 5.39   Min.   : 5.250  
##  1st Qu.:6.695   1st Qu.: 6.25   1st Qu.: 6.170  
##  Median :8.140   Median : 7.11   Median : 7.040  
##  Mean   :7.501   Mean   : 7.50   Mean   : 7.501  
##  3rd Qu.:8.950   3rd Qu.: 7.98   3rd Qu.: 8.190  
##  Max.   :9.260   Max.   :12.74   Max.   :12.500

As do the results of a regression analysis comparing the datasets. As can be seen, the results of each regression below are almost identical:

summary(lm(y1 ~ x1, dat = anscombe))

## 
## Call:
## lm(formula = y1 ~ x1, data = anscombe)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.92127 -0.45577 -0.04136  0.70941  1.83882 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   3.0001     1.1247   2.667  0.02573 * 
## x1            0.5001     0.1179   4.241  0.00217 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared:  0.6665, Adjusted R-squared:  0.6295 
## F-statistic: 17.99 on 1 and 9 DF,  p-value: 0.00217

summary(lm(y2 ~ x2, dat = anscombe))

## 
## Call:
## lm(formula = y2 ~ x2, data = anscombe)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9009 -0.7609  0.1291  0.9491  1.2691 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.001      1.125   2.667  0.02576 * 
## x2             0.500      0.118   4.239  0.00218 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared:  0.6662, Adjusted R-squared:  0.6292 
## F-statistic: 17.97 on 1 and 9 DF,  p-value: 0.002179

summary(lm(y3 ~ x3, dat = anscombe))

## 
## Call:
## lm(formula = y3 ~ x3, data = anscombe)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1586 -0.6146 -0.2303  0.1540  3.2411 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   3.0025     1.1245   2.670  0.02562 * 
## x3            0.4997     0.1179   4.239  0.00218 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.236 on 9 degrees of freedom
## Multiple R-squared:  0.6663, Adjusted R-squared:  0.6292 
## F-statistic: 17.97 on 1 and 9 DF,  p-value: 0.002176

summary(lm(y4 ~ x4, dat = anscombe))

## 
## Call:
## lm(formula = y4 ~ x4, data = anscombe)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.751 -0.831  0.000  0.809  1.839 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   3.0017     1.1239   2.671  0.02559 * 
## x4            0.4999     0.1178   4.243  0.00216 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.236 on 9 degrees of freedom
## Multiple R-squared:  0.6667, Adjusted R-squared:  0.6297 
## F-statistic:    18 on 1 and 9 DF,  p-value: 0.002165

However, graphing the data shows how the statistics can be misleading:

for(i in 1:4) {
  ff[2:3] <- lapply(paste0(c("y","x"), i), as.name)
  plot(ff, data = anscombe, col = "red", pch = 21, bg = "orange", cex = 1.2,
       xlim = c(3, 19), ylim = c(3, 13), main = paste("Anscombe's scatterplot", i))
  abline(mods[[i]], col = "blue")
}

The point being, always examine your data graphically to understand what it tells you.

Scatterplot with loess and regression lines for simulated dataset

First, generate the dataset to use

sim5 <- data.frame(Subject = (1:30),
                   Group = rep(c("A","B")),
                  x = c(1,1,1,2,2,2,3,3,4,4,4,5,5,5,5,5,6,6,6,6,6,7,7,7,7,8,8,8,9,9),
                  y = c(6,3,5,1,9,4,6,2,11,4,12,7,13,6,10,6,18,4,17,7,16,7,10,6,14,5,15,9,16,12))

group_names <- c(
  A = "Group A",
  B = "Group B"
)

Next, generate the plot.

This can be done fairly simply in base R with the plot function:

plot(sim5$x, sim5$y,
     main = "Basic scatterplot")

Add a regression line (blue) and loess line (red):

plot(sim5$x, sim5$y,
     main = "Basic scatterplot")
abline(lm(y ~ x, data = sim5), col = "blue")
lines(lowess(sim5$x, sim5$y), col = "red")

ggplot gives more control. Let’s separate the data into groups and plot showing the regression line (dotted blue line) with confidence interval band, and a loess line (the curved line):

ggplot(sim5) +
  aes(x = x, y = y) +
  geom_point(size = 3, aes(shape = Group, colour = Group), alpha = .8) +
  geom_smooth(span = 0.75, se = F, aes(colour = Group, linetype = "dashed")) +
  geom_smooth(method = "lm", aes(fill = Group, linetype = "dotted"), alpha = 0.1) +
  theme_minimal() +
  scale_colour_viridis_d(begin = .3, end = .7) +
  scale_fill_viridis_d(option = "C", begin = .2, end = .8) +
  theme(axis.text.x = element_text(size=14),
        axis.text.y = element_text(size = 14),
        axis.title = element_text(size = 16, lineheight = 2),
        strip.text.x = element_text(size = 14)) + 
  scale_linetype_discrete(guide = FALSE) +
  ggtitle("Scatterplot with grouped data")

Estimation Plots

Unpaired estimation plot (2 sample)

An alternative to carrying out statistical significance tests. First, transfer the data set to the correct format:

sim5.long <- reshape2::melt(sim5, id = c("Subject", "Group"), measured = c("x", "y"))

Then process the data and show a summary:

fig12 <- dabest(sim5.long, variable, value,
                idx = c("x", "y"),
                paired = FALSE) %>% 
  cohens_d()

fig12

## dabestr (Data Analysis with Bootstrap Estimation in R) v0.3.0
## =============================================================
## 
## Good afternoon!
## The current time is 14:27 PM on Tuesday August 17, 2021.
## 
## Dataset    :  sim5.long
## X Variable :  variable
## Y Variable :  value
## 
## Unpaired Cohen's d of y (n = 30) minus x (n = 30)
##  0.966 [95CI  0.484; 1.46]
## 
## 
## 5000 bootstrap resamples.
## All confidence intervals are bias-corrected and accelerated.

Finally, generate the plot:

plot(fig12, color.column = Group, rawplot.ylabel = "Group Scores", effsize.ylabel = "Effect Size (Cohens d) and 95% CI (5,000 bootstrap resamples)")

Paired estimation plot (2 sample)

Apply the analysis and show the output:

fig13 <- dabest(sim5.long, variable, value,
                           idx = c("x", "y"),
                           paired = TRUE, id.col = Subject)
fig13.effect <- cohens_d(fig13)  

fig13.effect

## dabestr (Data Analysis with Bootstrap Estimation in R) v0.3.0
## =============================================================
## 
## Good afternoon!
## The current time is 14:27 PM on Tuesday August 17, 2021.
## 
## Dataset    :  sim5.long
## X Variable :  variable
## Y Variable :  value
## 
## Paired Cohen's d of y (n = 30) minus x (n = 30)
##  0.872 [95CI  0.301; 1.25]
## 
## 
## 5000 bootstrap resamples.
## All confidence intervals are bias-corrected and accelerated.

Generate the graph:

plot(fig13.effect, color.column = Group, rawplot.ylabel = "Group Scores", effsize.ylabel = "Effect Size (Cohens d) and 95% CI (5,000 bootstrap resamples)")

Extras

The following was not included in the original paper due to word count limitations. It is presented here as a supplement.

ggstatsplot

This package plots graphs and supplements them with details of corresponding statistical tests, providing a streamlined way to test data and visualize the results.

The example below is generated from the simulated dataset used earlier for the notched boxplots, showing the results of a robust (Yuen) between-group t-test with an explanatory measure of effect size, and means and outliers annotated on the graph. Note that the CIs listed here are different to those in the paper. The CIs reported here are for the effect size in the graph, while those in the paper are the CIs of the difference in means. Other options for the type of t-test used are possible.

ggstatsplot::ggbetweenstats(
  data = sim4,
  x = "group",
  y = "response",
  title = "Between-subjects analysis with ggstatsplot",
  outlier.tagging = TRUE,
  type = "robust"
  )

Here is another example, a variation of the scatterplot made using the sim5 dataset:

ggstatsplot::ggscatterstats(
  data = sim5,
  x = x,
  y = y,
  xlab = "X Variable",
  ylab = "Y Variable",
  title = "An example scatterplot with ggstatsplot",
  type = "nonparametric"
)

This produces the scatterplot with additional marginal plots showing the distribution of the X & Y variables. The results of a t-test, and a non-parametric correlation coefficient are included.

The ggstatsplot package offers a lot of options. By incorporating functions from numerous other packages, it makes generating and displaying results somewhat easier than if using each package separately. This is something that should be helpful for understanding your results when carrying out an analysis.

Other options

The data we will work with here is available on my GitHub repository https://github.com/pcjapan/graphical-data-analysis. The first data set is called comprehension-data.txt. Either load the data directly from GitHub or save a copy of the data in your working directory, and then read into R:

working <- read.delim("https://raw.githubusercontent.com/pcjapan/graphical-data-analysis/2c7d658e59f78610d2cc857b62eb79c3f4461b2f/comprehension-data.txt", header = TRUE, sep = "\t")

To work with the ggplot2 library, and many other functions in R, your data has to be in long format, as opposed to wide
Wide format is where each individual subject is recorded as a unique row of your excel/csv file, with a column for every measured variable related to that subject, Long format has every row containing a measure for a particular variable for each subject. In wide data the subjects will not be repeated in rows. In long data, the subject may be listed in multiple rows as the data is grouped by the subject response to each variable under study.
It’s fairly easy to change the format of your data in R. One way is by using the melt function in the reshape2 library (this is part of the R base installation):

require(reshape2)

then run the necessary code. Remember, we’re working with data that has been loaded into the “working” dataframe, and we will save the transformed data into a new dataframe named “longComp”.

longComp <- melt(working, id = c("ID", "Class"), measured = c("Pretest", "Posttest"))
names(longComp) <- c("Student", "Class", "Test", "Comprehension")

To see the difference in the data, look at the first few rows.

Here, we have a pretest and posttest score for each student from separate classes on a comprehension test. Scores are entered as one row for each student.

head(working) ## Original data

##     ID Class Pretest Posttest
## 1 C101    C1     7.0      5.0
## 2 C102    C1     4.7      3.7
## 3 C103    C1     7.7      8.3
## 4 C104    C1     6.3      6.0
## 5 C105    C1     5.7      4.0
## 6 C106    C1     7.0      4.0

tail(working)

##      ID Class Pretest Posttest
## 47 C309    C3       7        5
## 48 C310    C3       4        4
## 49 C311    C3       6        6
## 50 C312    C3       8        8
## 51 C313    C3       8        8
## 52 C314    C3       6        4

Now the data has been rearranged to list in each row the comprehension score for each student by the kind of test.

head(longComp) ## New data

##   Student Class    Test Comprehension
## 1    C101    C1 Pretest           7.0
## 2    C102    C1 Pretest           4.7
## 3    C103    C1 Pretest           7.7
## 4    C104    C1 Pretest           6.3
## 5    C105    C1 Pretest           5.7
## 6    C106    C1 Pretest           7.0

tail(longComp)

##     Student Class     Test Comprehension
## 99     C309    C3 Posttest             5
## 100    C310    C3 Posttest             4
## 101    C311    C3 Posttest             6
## 102    C312    C3 Posttest             8
## 103    C313    C3 Posttest             8
## 104    C314    C3 Posttest             4

For any research project, first check your data to see how it is distributed

Check data for normality - generate a qqplot

qqnorm(working$Pretest, pch = 1, frame = FALSE)
qqline(working$Pretest, col = "steelblue", lwd = 2)

Or look at a histogram of the distribution.

Here, think about what you are expecting to see in the data. A skewed distribution may be what your theory predicts, so may not be a problem; however, it raises issues for the kinds of statistical tests you can carry out.

ggplot(working, aes(x=Pretest)) + 
  geom_histogram(aes(y=..density..), colour="gray", fill="white", binwidth = .3)+
  geom_density(color = "red", alpha=.2, fill="#FF6666") +
  ggtitle("Pretest data distribution")

Line graphs can show changes in trends in a dataset, but are limited in the information they show. Augmenting with confidence interval bars makes them slightly more informative. Here, we use the facet_grid function in ggplot2 to plot the three graphs on individual panels and display them side-by-side, rather than having them all plotted on the same panel which would reduce readability.

line <- ggplot(longComp, aes(Test, Comprehension, color = Class))
line + stat_summary(fun = mean, geom = "point") + 
  stat_summary(fun = mean, geom = "line", aes(group = Class)) +
  theme_minimal() + 
  stat_summary(fun.data = mean_cl_boot, geom = "errorbar", width = 0.2) + ## Add confidence interval (CI) error bars 
  ggtitle("Example line graph") +
  facet_grid(.~Class)

Theming graphs

Here is where the power of ggplot comes into play. You can create a more attractive version of the above or modify most aspects of the display by setting various paramater

If you haven’t already done so, install (you only need to do this the first time you use it) and load the current version of the ggplot library, ggplot2:

install.packages(ggplot2)

Then load the package into R. You need to do this with all packages you wish to use:

library(ggplot2) #load and activate the package

Generate the graph, here we’re making a boxplot:

boxplot <- ggplot(longComp, aes(Class, Comprehension)) #this creates the base graph

CompBox <- boxplot +
  geom_boxplot(outlier.shape = NA, aes(fill = Class)) +  # adding layers to the graph - this sets up the boxplot, using the "Class" condition to identify the data 
  labs(x = "Class", y = "Mean Comprehension Score") + # add labels to the x & y axes
  theme(legend.position = "none", axis.text.x = element_text(size=14), # These commands control the appearance of aspects of the theme
        axis.text.y = element_text(size = 14),
        axis.title = element_text(size = 16, lineheight = 2),
        plot.title = element_text(size = 14, lineheight = 2, face="bold"), 
        strip.text.x = element_text(size = 14), 
        panel.background = element_rect(fill = "snow1")) + 
  viridis::scale_fill_viridis(discrete = TRUE, alpha=0.6) + # Use the viridis colour palettes for colour choices - improves readability
  ggtitle("Changes in comprehension scores") + # add a title
  scale_y_continuous(breaks=seq(0,10,1)) + # set the y-axis scale
  facet_wrap(~Test) # this creates a panelled view where the graphs are displayed in separate panels based on the "Test" condition

CompBox  +
  geom_jitter(color="black", size=2, alpha=0.6, width = 0.1) #here you are adding the data points to the graph

Violin plots To get a violin plot, just change a couple of lines in the code above:

violinplot <- boxplot +
  geom_violin(scale = "count", aes(fill = Class)) + # adding the plot layer to the graph - "count" means areas are scaled proportionally to the number of observations
  geom_boxplot(width=0.1, color="grey60", alpha=0.8, ) + # optionally adding boxplots to provide another layer of detail
  labs(x = "Class", y = "Mean Comprehension Score") + 
  theme(legend.position = "none", axis.text.x = element_text(size=14), 
        axis.text.y = element_text(size = 14),
        axis.title = element_text(size = 16, lineheight = 2),
        plot.title = element_text(size = 14, lineheight = 2, face="bold"), 
        strip.text.x = element_text(size = 14), 
        panel.background = element_rect(fill = "snow1")) + 
  viridis::scale_fill_viridis(discrete = TRUE, alpha=0.6) +
  ggtitle("Changes in comprehension scores") + 
  scale_y_continuous(breaks=seq(0,10,1)) + 
  facet_wrap(~Test) 

violinplot

Visualising Correlations

R has strong graphic support for visualising correlations. Let’s look at this with a new set of data, corr_Dat. This data was from a study into developing a scale to measure certain attitudes of learners towards factors that influenced approach to language study. The file is once again available in GitHub: correlation.txt

Base R

While this doesn’t show you the exact correlations, it provides a scatterplot that shows how each variable is related

corr_Dat <- read.csv("https://raw.githubusercontent.com/pcjapan/graphical-data-analysis/da29f849d1bc8891e4c1e9413a23d1777e9a6196/correlation.txt", head = TRUE)
pairs(corr_Dat)

* Using the psych library

Using the psych library pairs.panels function is more informative:

psych::pairs.panels(corr_Dat, main="Pretest - Class C", method = "spearman")

* Or use the corrplot library to visualise correlation matrices, as in the following example.

cPre <- cor(corr_Dat, method = "spearman") # create correlation matrix
corrplot::corrplot.mixed(cPre)

Ridgeline Plots

A ridgeline plot shows the distribution of a numeric variable for a number of groups. Helpful for likert scale results, for example. Use the ggridges library along with ggplot to create this particular kind of graph.

In this example, we will use one more dataset, which contains the mean scores for subscale respones from the same scale used for the correlation example above. Again, get it from GitHub: the file is scale_means.txt

library(ggridges) # Remember to load ggplot2 if you haven't already, too
scaleMeans <- read.delim("https://raw.githubusercontent.com/pcjapan/graphical-data-analysis/JALT2020/scale-means.txt", header = TRUE, sep = "\t")

R cannot tell which order to treat factors, so puts them into alphabetical order when it runs commands. This means that graphs, etc, may display things in a different order to which you want. In this case, the factor Test has two levels, Pre and Post. If we create the graph as is, the Post test results will be displayed before the Pre results, which could be confusing. To fix this, you can easily relevel factors to put them into the correct order for the analysis:

scaleMeans$Test <- factor(scaleMeans$Test, levels = c("Pre", "Post"))

Then we can run the code to output the graph:

ridgeG <- ggplot(scaleMeans, 
                 aes(x = MResponse, y = Scale, fill = Scale)) # This generates the basic graph

ridgeG + 
  geom_density_ridges2(alpha = 0.8) + 
  facet_wrap(~Class + Test, ncol = 2) + 
  theme_ridges() + 
  scale_x_continuous(expand = c(0.01, 0)) + 
  scale_y_discrete(expand = c(0.01, 0)) + 
  viridis::scale_fill_viridis(discrete = TRUE, alpha=0.6) + 
  ggtitle("Ridgeline plot example") +
  theme(legend.position = "none") +
  xlab("Mean Response") # Theming, adding some colour, and separating the graphs onto individual panels

End

That covers everything for now. Thank you for your interest.