Preparing your data
Set your working directory where you will store the R script and any other files you may need for the project:
setwd("~/Documents/R Worksets/JALT2020 Workshop")Create bimodal distribution data set to illustrate normality assumptions:
nn <- 100
set.seed(1234)
sim1 <- c(rtruncnorm(nn/2, a=0, b=10, mean=2, sd=.75),
rtruncnorm(nn/2, a=0, b=10, mean=8, sd=.75))Create a more normally-distributed second sample:
set.seed(1234)
sim2 <- rnorm(100, mean = 5, sd = .75)
x1 <- as.data.frame(sim1)
x2 <- as.data.frame(sim2)Store the data in a format (dataframe) that can easily be retrieved for the analysis:
x3 <- as.data.frame(cbind(sim1, sim2)) ## this joins the two data sets together and transforms them into a dataframe
write.csv(x3, "x3_data.csv") ## this command writes the dataframe to a csv file ("x3_data.csv") and saves it on your computer in the working directoryYou can check there are no problems with the data by using the head & tail commands which display the first & last few lines respectively of the data:
head(x3); tail(x3)## sim1 sim2
## 1 1.0947007 4.094701
## 2 2.2080719 5.208072
## 3 2.8133309 5.813331
## 4 0.2407267 3.240727
## 5 2.3218435 5.321844
## 6 2.3795419 5.379542
## sim1 sim2
## 95 7.628312 4.628312
## 96 8.266663 5.266663
## 97 7.149044 4.149044
## 98 8.658653 5.658653
## 99 8.729688 5.729688
## 100 9.590838 6.590838
Histograms to show data distribution.
Using the hist command which is part of the base installation of R. Work with the sim1 & sim2 samples from above.
hist(sim1, prob=F, main = "Group 1 Distribution", xlab = "Group 1", xlim = c(0,10), ylim = c(0,20), breaks=15, cex.main=1.5, cex.lab=1.5, cex.axis=1.5)
abline(v=mean(sim1),col="blue", lty = 2)hist(sim2, prob=F, main = "Group 2 Distribution", xlab = "Group 2", xlim = c(0,10), breaks=15, cex.main=1.5, cex.lab=1.5, cex.axis=1.5)
abline(v=mean(sim1),col="blue", lty = 2)QQ Plots.
Using the qqnorm command which is part of the base R installation. Work with the sim1 & sim2 samples from above.
qqnorm(sim1, cex.main=1.5, cex.lab=1.5, cex.axis=1.5, main = "Group 1") # this produces the graph
qqline(sim1) # this adds a line showing a theoretical normal distribution for comparisonqqnorm(sim2, cex.main=1.5, cex.lab=1.5, cex.axis=1.5, main = "Group 2") # this produces the graph
qqline(sim2) # this adds a line showing a theoretical normal distribution for comparisonBoxplots and variants for comparisions
Boxplots are good for showing the difference between two (or more) samples, e.g. when you would do a t-test or an ANOVA (or a non-parametric equivalent).
- To create a basic boxplot using base R installation commands:
boxplot(sim2, col = "white", ann = F, horizontal = T, ylim = c(3,7))#using dataset sim2, no annotations on the axes, y axis range is from 3 to 7Boxplots for simulated data set 1
These are created using the ggplot2 package. This can be installed and activated as follows:
install.packages("ggplot2")
library(ggplot2)The code to generate the boxplot is more complex here. The first three lines of code create the plot; the following code handles the appearance of the plot. In this case we’re creating two plots, fig5a & fig5b.
fig5a <- ggplot(x1) +
aes(x = "", y = sim1) +
geom_boxplot() +
# These commands control the appearance of aspects of the theme
theme_minimal() +
theme(axis.text.x = element_text(size=14),
axis.text.y = element_text(size = 14),
axis.title = element_text(size = 16, lineheight = 2),
strip.text.x = element_text(size = 14)) +
ylim(0, 10) +
labs(x = "Group 1", y = "Comprehension Score")
fig5b <- ggplot(x2) +
aes(x = "", y = sim2) +
geom_boxplot() +
theme_minimal() +
theme(axis.text.x = element_text(size=14),
axis.text.y = element_blank(),
axis.title = element_text(size = 16, lineheight = 2),
strip.text.x = element_text(size = 14)) +
ylim(0, 10) +
labs(x = "Group 2", y = "")Load the cowplot package. This lets you plot different graphs on the same grid…
install.packages("cowplot")
library(cowplot)…using the plot_grid command from cowplot
plot_grid(fig5a, fig5b, labels = c("Boxplot example", ""))Boxplots for simulated data set 1 augmented with jittered data
Here we add the data points to the boxplot. This involves just one extra line of code:
fig6a <- ggplot(x1) +
aes(x = "", y = sim1) +
geom_boxplot() +
# This additional line of code adds the data points
geom_jitter(width = .2, size = 3, colour = "orange") +
theme_minimal() +
theme(axis.text.x = element_text(size=14),
axis.text.y = element_text(size = 14),
axis.title = element_text(size = 16, lineheight = 2),
strip.text.x = element_text(size = 14)) +
ylim(0, 10) +
labs(x = "Group 1", y = "Comprehension Score")
fig6b <- ggplot(x2) +
aes(x = "", y = sim2) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(width = .2, size = 3, colour = "blue") +
theme_minimal() +
theme(axis.text.x = element_text(size=14),
axis.text.y = element_blank(),
axis.title = element_text(size = 16, lineheight = 2),
strip.text.x = element_text(size = 14,)) +
ylim(0, 10) +
labs(x = "Group 2", y = "")
# plot the two graphs on the same grid
plot_grid(fig6a, fig6b, labels = c("Boxplots with data points", ""))Dotplot for simulated data set 1
An alternative way to display the data is as a dotplot
fig7a <- ggplot(x1) +
aes(x = "", y = sim1, fill = "sim1") +
# main changes are as follows
stat_summary(fun = median, fun.min = median, fun.max = median, geom = "crossbar", width = 0.6, size = 0.4, color = "black", alpha = 0.6) + # generate the bar showing the median score
geom_dotplot(binaxis ="y", binwidth = 0.2, stackdir = "center", stackratio = 1.5) + # plot the data as individual points
theme_minimal() +
theme(legend.position = "none",
axis.text.x = element_text(size=14),
axis.text.y = element_text(size = 14),
axis.title = element_text(size = 16, lineheight = 2),
strip.text.x = element_text(size = 14)) +
ylim(0, 10) +
labs(x = "Group 1", y = "Comprehension Score")
fig7b <- ggplot(x2) +
aes(x = "", y = sim2, fill = "sim1") +
stat_summary(fun = median, fun.min = median, fun.max = median, geom = "crossbar", width = 0.6, size = 0.4, color = "black", alpha = 0.6) + # generate the bar showing the median score
geom_dotplot(binaxis ="y", binwidth = 0.2, stackdir = "center", stackratio = 1.5) +
theme_minimal() +
theme(legend.position = "none",
axis.text.x = element_text(size=14),
axis.text.y = element_blank(),
axis.title = element_text(size = 16, lineheight = 2),
strip.text.x = element_text(size = 14)) +
ylim(0, 10) +
labs(x = "Group 2", y = "")
# plot the two graphs on the same grid
plot_grid(fig7a, fig7b,labels = c("Dotplot example", ""))Violin plots for simulated data set 1
Violin plots show how the data is distributed and as such are helpful for understanding the structure of your dataset.
fig8a <- ggplot(x1) +
aes(x = "", y = sim1) +
# This is the only change from figure 5, calling for a violin plot rather than a boxplot.
# The adjust and scale arguments set the size of the plots.
geom_violin(adjust = 1L, scale = "count") +
theme_minimal() +
theme(axis.text.x = element_text(size=14),
axis.text.y = element_text(size = 14),
axis.title = element_text(size = 16, lineheight = 2),
strip.text.x = element_text(size = 14)) +
ylim(0, 10) +
labs(x = "Group 1", y = "Comprehension Score")
fig8b <- ggplot(x2) +
aes(x = "", y = sim2) +
geom_violin(adjust = 1L, scale = "area") +
theme_minimal() +
theme(axis.text.x = element_text(size=14),
axis.text.y = element_blank(),
axis.title = element_text(size = 16, lineheight = 2),
strip.text.x = element_text(size = 14)) +
ylim(0, 10) +
labs(x = "Group 2", y = "")
# plot the two graphs on the same grid
plot_grid(fig8a, fig8b,labels = c("Violin plot example", ""))Notched boxplots
set.seed(1234)
response = rnorm(n = 80, mean = c(74, 70), sd = c(3, 4.5))
group = rep(letters[1:2], length.out = 80)
sim4 <- data.frame(group,
response)
ggplot(sim4) +
aes(x = group, y = response) +
geom_boxplot(notch = TRUE, notchwidth = 0.75) +
theme_minimal() +
theme(axis.text.x = element_text(size=14), # These commands control aspects of the theme, in this case the axis text
axis.text.y = element_text(size = 14),
axis.title = element_text(size = 16, lineheight = 2),
plot.title = element_text(size = 14, lineheight = 2, face="bold"),
strip.text.x = element_text(size = 14)) +
ggtitle("Notched boxplot example") +
ylim(55, 85) +
labs(x = "Groups", y = "")Scatterplots - Looking at Relationships
Scatterplots are used to show how data is correlated.
Anscombe’s scatterplots.
(Code taken from the dataset package)
A demonstration of how statistics alone can be deceiving.
Looking at the dataset doesn’t give a lot of information:
anscombe## x1 x2 x3 x4 y1 y2 y3 y4
## 1 10 10 10 8 8.04 9.14 7.46 6.58
## 2 8 8 8 8 6.95 8.14 6.77 5.76
## 3 13 13 13 8 7.58 8.74 12.74 7.71
## 4 9 9 9 8 8.81 8.77 7.11 8.84
## 5 11 11 11 8 8.33 9.26 7.81 8.47
## 6 14 14 14 8 9.96 8.10 8.84 7.04
## 7 6 6 6 8 7.24 6.13 6.08 5.25
## 8 4 4 4 19 4.26 3.10 5.39 12.50
## 9 12 12 12 8 10.84 9.13 8.15 5.56
## 10 7 7 7 8 4.82 7.26 6.42 7.91
## 11 5 5 5 8 5.68 4.74 5.73 6.89
Looking at the descriptive statistics of the data suggests each dataset is very similar:
summary(anscombe) ## x1 x2 x3 x4 y1
## Min. : 4.0 Min. : 4.0 Min. : 4.0 Min. : 8 Min. : 4.260
## 1st Qu.: 6.5 1st Qu.: 6.5 1st Qu.: 6.5 1st Qu.: 8 1st Qu.: 6.315
## Median : 9.0 Median : 9.0 Median : 9.0 Median : 8 Median : 7.580
## Mean : 9.0 Mean : 9.0 Mean : 9.0 Mean : 9 Mean : 7.501
## 3rd Qu.:11.5 3rd Qu.:11.5 3rd Qu.:11.5 3rd Qu.: 8 3rd Qu.: 8.570
## Max. :14.0 Max. :14.0 Max. :14.0 Max. :19 Max. :10.840
## y2 y3 y4
## Min. :3.100 Min. : 5.39 Min. : 5.250
## 1st Qu.:6.695 1st Qu.: 6.25 1st Qu.: 6.170
## Median :8.140 Median : 7.11 Median : 7.040
## Mean :7.501 Mean : 7.50 Mean : 7.501
## 3rd Qu.:8.950 3rd Qu.: 7.98 3rd Qu.: 8.190
## Max. :9.260 Max. :12.74 Max. :12.500
As do the results of a regression analysis comparing the datasets. As can be seen, the results of each regression below are almost identical:
summary(lm(y1 ~ x1, dat = anscombe))##
## Call:
## lm(formula = y1 ~ x1, data = anscombe)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.92127 -0.45577 -0.04136 0.70941 1.83882
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0001 1.1247 2.667 0.02573 *
## x1 0.5001 0.1179 4.241 0.00217 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared: 0.6665, Adjusted R-squared: 0.6295
## F-statistic: 17.99 on 1 and 9 DF, p-value: 0.00217
summary(lm(y2 ~ x2, dat = anscombe))##
## Call:
## lm(formula = y2 ~ x2, data = anscombe)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9009 -0.7609 0.1291 0.9491 1.2691
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.001 1.125 2.667 0.02576 *
## x2 0.500 0.118 4.239 0.00218 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared: 0.6662, Adjusted R-squared: 0.6292
## F-statistic: 17.97 on 1 and 9 DF, p-value: 0.002179
summary(lm(y3 ~ x3, dat = anscombe))##
## Call:
## lm(formula = y3 ~ x3, data = anscombe)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.1586 -0.6146 -0.2303 0.1540 3.2411
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0025 1.1245 2.670 0.02562 *
## x3 0.4997 0.1179 4.239 0.00218 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.236 on 9 degrees of freedom
## Multiple R-squared: 0.6663, Adjusted R-squared: 0.6292
## F-statistic: 17.97 on 1 and 9 DF, p-value: 0.002176
summary(lm(y4 ~ x4, dat = anscombe))##
## Call:
## lm(formula = y4 ~ x4, data = anscombe)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.751 -0.831 0.000 0.809 1.839
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0017 1.1239 2.671 0.02559 *
## x4 0.4999 0.1178 4.243 0.00216 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.236 on 9 degrees of freedom
## Multiple R-squared: 0.6667, Adjusted R-squared: 0.6297
## F-statistic: 18 on 1 and 9 DF, p-value: 0.002165
However, graphing the data shows how the statistics can be misleading:
for(i in 1:4) {
ff[2:3] <- lapply(paste0(c("y","x"), i), as.name)
plot(ff, data = anscombe, col = "red", pch = 21, bg = "orange", cex = 1.2,
xlim = c(3, 19), ylim = c(3, 13), main = paste("Anscombe's scatterplot", i))
abline(mods[[i]], col = "blue")
}The point being, always examine your data graphically to understand what it tells you.
Scatterplot with loess and regression lines for simulated dataset
First, generate the dataset to use
sim5 <- data.frame(Subject = (1:30),
Group = rep(c("A","B")),
x = c(1,1,1,2,2,2,3,3,4,4,4,5,5,5,5,5,6,6,6,6,6,7,7,7,7,8,8,8,9,9),
y = c(6,3,5,1,9,4,6,2,11,4,12,7,13,6,10,6,18,4,17,7,16,7,10,6,14,5,15,9,16,12))
group_names <- c(
A = "Group A",
B = "Group B"
)Next, generate the plot.
This can be done fairly simply in base R with the plot function:
plot(sim5$x, sim5$y,
main = "Basic scatterplot") Add a regression line (blue) and loess line (red):
plot(sim5$x, sim5$y,
main = "Basic scatterplot")
abline(lm(y ~ x, data = sim5), col = "blue")
lines(lowess(sim5$x, sim5$y), col = "red")ggplot gives more control. Let’s separate the data into groups and plot showing the regression line (dotted blue line) with confidence interval band, and a loess line (the curved line):
ggplot(sim5) +
aes(x = x, y = y) +
geom_point(size = 3, aes(shape = Group, colour = Group), alpha = .8) +
geom_smooth(span = 0.75, se = F, aes(colour = Group, linetype = "dashed")) +
geom_smooth(method = "lm", aes(fill = Group, linetype = "dotted"), alpha = 0.1) +
theme_minimal() +
scale_colour_viridis_d(begin = .3, end = .7) +
scale_fill_viridis_d(option = "C", begin = .2, end = .8) +
theme(axis.text.x = element_text(size=14),
axis.text.y = element_text(size = 14),
axis.title = element_text(size = 16, lineheight = 2),
strip.text.x = element_text(size = 14)) +
scale_linetype_discrete(guide = FALSE) +
ggtitle("Scatterplot with grouped data")Estimation Plots
Unpaired estimation plot (2 sample)
An alternative to carrying out statistical significance tests. First, transfer the data set to the correct format:
sim5.long <- reshape2::melt(sim5, id = c("Subject", "Group"), measured = c("x", "y"))Then process the data and show a summary:
fig12 <- dabest(sim5.long, variable, value,
idx = c("x", "y"),
paired = FALSE) %>%
cohens_d()
fig12## dabestr (Data Analysis with Bootstrap Estimation in R) v0.3.0
## =============================================================
##
## Good afternoon!
## The current time is 14:27 PM on Tuesday August 17, 2021.
##
## Dataset : sim5.long
## X Variable : variable
## Y Variable : value
##
## Unpaired Cohen's d of y (n = 30) minus x (n = 30)
## 0.966 [95CI 0.484; 1.46]
##
##
## 5000 bootstrap resamples.
## All confidence intervals are bias-corrected and accelerated.
Finally, generate the plot:
plot(fig12, color.column = Group, rawplot.ylabel = "Group Scores", effsize.ylabel = "Effect Size (Cohens d) and 95% CI (5,000 bootstrap resamples)")Paired estimation plot (2 sample)
Apply the analysis and show the output:
fig13 <- dabest(sim5.long, variable, value,
idx = c("x", "y"),
paired = TRUE, id.col = Subject)
fig13.effect <- cohens_d(fig13)
fig13.effect## dabestr (Data Analysis with Bootstrap Estimation in R) v0.3.0
## =============================================================
##
## Good afternoon!
## The current time is 14:27 PM on Tuesday August 17, 2021.
##
## Dataset : sim5.long
## X Variable : variable
## Y Variable : value
##
## Paired Cohen's d of y (n = 30) minus x (n = 30)
## 0.872 [95CI 0.301; 1.25]
##
##
## 5000 bootstrap resamples.
## All confidence intervals are bias-corrected and accelerated.
Generate the graph:
plot(fig13.effect, color.column = Group, rawplot.ylabel = "Group Scores", effsize.ylabel = "Effect Size (Cohens d) and 95% CI (5,000 bootstrap resamples)")Extras
The following was not included in the original paper due to word count limitations. It is presented here as a supplement.
ggstatsplot
This package plots graphs and supplements them with details of corresponding statistical tests, providing a streamlined way to test data and visualize the results.
The example below is generated from the simulated dataset used earlier for the notched boxplots, showing the results of a robust (Yuen) between-group t-test with an explanatory measure of effect size, and means and outliers annotated on the graph. Note that the CIs listed here are different to those in the paper. The CIs reported here are for the effect size in the graph, while those in the paper are the CIs of the difference in means. Other options for the type of t-test used are possible.
ggstatsplot::ggbetweenstats(
data = sim4,
x = "group",
y = "response",
title = "Between-subjects analysis with ggstatsplot",
outlier.tagging = TRUE,
type = "robust"
)Here is another example, a variation of the scatterplot made using the sim5 dataset:
ggstatsplot::ggscatterstats(
data = sim5,
x = x,
y = y,
xlab = "X Variable",
ylab = "Y Variable",
title = "An example scatterplot with ggstatsplot",
type = "nonparametric"
)This produces the scatterplot with additional marginal plots showing the distribution of the X & Y variables. The results of a t-test, and a non-parametric correlation coefficient are included.
The ggstatsplot package offers a lot of options. By incorporating functions from numerous other packages, it makes generating and displaying results somewhat easier than if using each package separately. This is something that should be helpful for understanding your results when carrying out an analysis.
Other options
- The data we will work with here is available on my GitHub repository https://github.com/pcjapan/graphical-data-analysis. The first data set is called comprehension-data.txt. Either load the data directly from GitHub or save a copy of the data in your working directory, and then read into R:
working <- read.delim("https://raw.githubusercontent.com/pcjapan/graphical-data-analysis/2c7d658e59f78610d2cc857b62eb79c3f4461b2f/comprehension-data.txt", header = TRUE, sep = "\t")- To work with the
ggplot2library, and many other functions in R, your data has to be inlongformat, as opposed towide Wideformat is where each individual subject is recorded as a unique row of your excel/csv file, with a column for every measured variable related to that subject,Longformat has every row containing a measure for a particular variable for each subject. Inwidedata the subjects will not be repeated in rows. Inlongdata, the subject may be listed in multiple rows as the data is grouped by the subject response to each variable under study.- It’s fairly easy to change the format of your data in R. One way is by using the
meltfunction in thereshape2library (this is part of the R base installation):
require(reshape2)then run the necessary code. Remember, we’re working with data that has been loaded into the “working” dataframe, and we will save the transformed data into a new dataframe named “longComp”.
longComp <- melt(working, id = c("ID", "Class"), measured = c("Pretest", "Posttest"))
names(longComp) <- c("Student", "Class", "Test", "Comprehension")To see the difference in the data, look at the first few rows.
Here, we have a pretest and posttest score for each student from separate classes on a comprehension test. Scores are entered as one row for each student.
head(working) ## Original data## ID Class Pretest Posttest
## 1 C101 C1 7.0 5.0
## 2 C102 C1 4.7 3.7
## 3 C103 C1 7.7 8.3
## 4 C104 C1 6.3 6.0
## 5 C105 C1 5.7 4.0
## 6 C106 C1 7.0 4.0
tail(working)## ID Class Pretest Posttest
## 47 C309 C3 7 5
## 48 C310 C3 4 4
## 49 C311 C3 6 6
## 50 C312 C3 8 8
## 51 C313 C3 8 8
## 52 C314 C3 6 4
Now the data has been rearranged to list in each row the comprehension score for each student by the kind of test.
head(longComp) ## New data## Student Class Test Comprehension
## 1 C101 C1 Pretest 7.0
## 2 C102 C1 Pretest 4.7
## 3 C103 C1 Pretest 7.7
## 4 C104 C1 Pretest 6.3
## 5 C105 C1 Pretest 5.7
## 6 C106 C1 Pretest 7.0
tail(longComp)## Student Class Test Comprehension
## 99 C309 C3 Posttest 5
## 100 C310 C3 Posttest 4
## 101 C311 C3 Posttest 6
## 102 C312 C3 Posttest 8
## 103 C313 C3 Posttest 8
## 104 C314 C3 Posttest 4
For any research project, first check your data to see how it is distributed
- Check data for normality - generate a qqplot
qqnorm(working$Pretest, pch = 1, frame = FALSE)
qqline(working$Pretest, col = "steelblue", lwd = 2)- Or look at a histogram of the distribution.
Here, think about what you are expecting to see in the data. A skewed distribution may be what your theory predicts, so may not be a problem; however, it raises issues for the kinds of statistical tests you can carry out.
ggplot(working, aes(x=Pretest)) +
geom_histogram(aes(y=..density..), colour="gray", fill="white", binwidth = .3)+
geom_density(color = "red", alpha=.2, fill="#FF6666") +
ggtitle("Pretest data distribution")Line graphs can show changes in trends in a dataset, but are limited in the information they show. Augmenting with confidence interval bars makes them slightly more informative. Here, we use the facet_grid function in ggplot2 to plot the three graphs on individual panels and display them side-by-side, rather than having them all plotted on the same panel which would reduce readability.
line <- ggplot(longComp, aes(Test, Comprehension, color = Class))
line + stat_summary(fun = mean, geom = "point") +
stat_summary(fun = mean, geom = "line", aes(group = Class)) +
theme_minimal() +
stat_summary(fun.data = mean_cl_boot, geom = "errorbar", width = 0.2) + ## Add confidence interval (CI) error bars
ggtitle("Example line graph") +
facet_grid(.~Class)Theming graphs
Here is where the power of ggplot comes into play. You can create a more attractive version of the above or modify most aspects of the display by setting various paramater
- If you haven’t already done so, install (you only need to do this the first time you use it) and load the current version of the ggplot library,
ggplot2:
install.packages(ggplot2)- Then load the package into R. You need to do this with all packages you wish to use:
library(ggplot2) #load and activate the package- Generate the graph, here we’re making a boxplot:
boxplot <- ggplot(longComp, aes(Class, Comprehension)) #this creates the base graph
CompBox <- boxplot +
geom_boxplot(outlier.shape = NA, aes(fill = Class)) + # adding layers to the graph - this sets up the boxplot, using the "Class" condition to identify the data
labs(x = "Class", y = "Mean Comprehension Score") + # add labels to the x & y axes
theme(legend.position = "none", axis.text.x = element_text(size=14), # These commands control the appearance of aspects of the theme
axis.text.y = element_text(size = 14),
axis.title = element_text(size = 16, lineheight = 2),
plot.title = element_text(size = 14, lineheight = 2, face="bold"),
strip.text.x = element_text(size = 14),
panel.background = element_rect(fill = "snow1")) +
viridis::scale_fill_viridis(discrete = TRUE, alpha=0.6) + # Use the viridis colour palettes for colour choices - improves readability
ggtitle("Changes in comprehension scores") + # add a title
scale_y_continuous(breaks=seq(0,10,1)) + # set the y-axis scale
facet_wrap(~Test) # this creates a panelled view where the graphs are displayed in separate panels based on the "Test" condition
CompBox +
geom_jitter(color="black", size=2, alpha=0.6, width = 0.1) #here you are adding the data points to the graphViolin plots To get a violin plot, just change a couple of lines in the code above:
violinplot <- boxplot +
geom_violin(scale = "count", aes(fill = Class)) + # adding the plot layer to the graph - "count" means areas are scaled proportionally to the number of observations
geom_boxplot(width=0.1, color="grey60", alpha=0.8, ) + # optionally adding boxplots to provide another layer of detail
labs(x = "Class", y = "Mean Comprehension Score") +
theme(legend.position = "none", axis.text.x = element_text(size=14),
axis.text.y = element_text(size = 14),
axis.title = element_text(size = 16, lineheight = 2),
plot.title = element_text(size = 14, lineheight = 2, face="bold"),
strip.text.x = element_text(size = 14),
panel.background = element_rect(fill = "snow1")) +
viridis::scale_fill_viridis(discrete = TRUE, alpha=0.6) +
ggtitle("Changes in comprehension scores") +
scale_y_continuous(breaks=seq(0,10,1)) +
facet_wrap(~Test)
violinplotVisualising Correlations
R has strong graphic support for visualising correlations. Let’s look at this with a new set of data, corr_Dat. This data was from a study into developing a scale to measure certain attitudes of learners towards factors that influenced approach to language study. The file is once again available in GitHub: correlation.txt
- Base R
While this doesn’t show you the exact correlations, it provides a scatterplot that shows how each variable is related
corr_Dat <- read.csv("https://raw.githubusercontent.com/pcjapan/graphical-data-analysis/da29f849d1bc8891e4c1e9413a23d1777e9a6196/correlation.txt", head = TRUE)
pairs(corr_Dat) * Using the
psych library
Using the psych library pairs.panels function is more informative:
psych::pairs.panels(corr_Dat, main="Pretest - Class C", method = "spearman") * Or use the
corrplot library to visualise correlation matrices, as in the following example.
cPre <- cor(corr_Dat, method = "spearman") # create correlation matrix
corrplot::corrplot.mixed(cPre)Ridgeline Plots
A ridgeline plot shows the distribution of a numeric variable for a number of groups. Helpful for likert scale results, for example. Use the ggridges library along with ggplot to create this particular kind of graph.
In this example, we will use one more dataset, which contains the mean scores for subscale respones from the same scale used for the correlation example above. Again, get it from GitHub: the file is scale_means.txt
library(ggridges) # Remember to load ggplot2 if you haven't already, too
scaleMeans <- read.delim("https://raw.githubusercontent.com/pcjapan/graphical-data-analysis/JALT2020/scale-means.txt", header = TRUE, sep = "\t")R cannot tell which order to treat factors, so puts them into alphabetical order when it runs commands. This means that graphs, etc, may display things in a different order to which you want. In this case, the factor Test has two levels, Pre and Post. If we create the graph as is, the Post test results will be displayed before the Pre results, which could be confusing. To fix this, you can easily relevel factors to put them into the correct order for the analysis:
scaleMeans$Test <- factor(scaleMeans$Test, levels = c("Pre", "Post")) Then we can run the code to output the graph:
ridgeG <- ggplot(scaleMeans,
aes(x = MResponse, y = Scale, fill = Scale)) # This generates the basic graph
ridgeG +
geom_density_ridges2(alpha = 0.8) +
facet_wrap(~Class + Test, ncol = 2) +
theme_ridges() +
scale_x_continuous(expand = c(0.01, 0)) +
scale_y_discrete(expand = c(0.01, 0)) +
viridis::scale_fill_viridis(discrete = TRUE, alpha=0.6) +
ggtitle("Ridgeline plot example") +
theme(legend.position = "none") +
xlab("Mean Response") # Theming, adding some colour, and separating the graphs onto individual panelsEnd
That covers everything for now. Thank you for your interest.