In the last code-club, you had a go at generating some boxplots.
That session ended with a few questions:
What if you wanted to present results for multiple genes on the same page?
What if you wanted to test for significant differences between the different groups?
What if your data is stored in Excel files?
Are boxplots appropriate when there is matching among the samples?
This session help answer some of these questions, using the same dataset as last time.
From memory, most of you got everything working for the last session. If you didn’t, here is a reminder that you’ll need the packages ggplot2
and readr
installed.
In addition, you will need readxl
which allows reading of data from an Excel file. As is usual with R, there is more than one way to do things, and there are other packages that will let you read from Excel files. However, in my experience, this is the fastest so let’s stick with that. We will also need to install the reshape2
library.
If any of these packages aren’t installed, try installing them using Tools -> Install Packages
in Rstudio
.
# Load the required packages
library(ggplot2)
library(readr)
library(readxl)
library(reshape2)
You should fill in code wherever it says ## Your code goes here
and whenever you do, compile this document using the Knit
command in Rstudio. This should compile the document to an ‘.html’ file that you can view.
In this session, you will use the same data is in the previous session. However, this time the data are provided as an Excel file. Once the data are loaded into R, you will perform a t-test to ask whether there is a statistically significant difference between two sets of data. Finally, we will add annotations to the plots to show which differences are significant and think about how best to show the data when you are carried out a paired-sample t-test.
In the previous session, the data was provided as a ‘flat file’. If you receive data in Excel format, you could export or save them in a flat file format (e.g. tab or comma separated) and use the commands from the previous session. However, let’s take the opportunity to learn something new and have a look at how you import data in R from Excel.
Just a word of warning about Excel. Excel can do unexpected things with gene names by reinterpreting them as other kinds of data, e.g., dates (see this paper). This can be very difficult to spot if the file you are using is thousands of lines long. This would be one argument to stick with flat files - there is no threat of misinterpretation.
You should have been provided with a file called boxplot_genes.xlsx
, which contains two tabs - one called info
with some information regarding the experiment that was carried out, and one called data
, which contains the same data as before. To import this data into a data-frame, you should use the function read_excel
from the package readxl
.
Modify the code below, note that you will have to specify which tab contains the data.
# CHALLENGE 1: Read in the data from the excel file, by adding the appropriate
# arguments to the command below.
# df1 <- read_excel(...)
# head(df1)
# SOLUTION 1:
# Read in the 'data' table from the excel file
df1 <- read_excel("boxplot_genes.xlsx", sheet = "data")
# Print out the first few entries in the imported data
head(df1)
What you will get out of that is something similar to a data.frame, called a tibble
. Type ? tibble
in the R console to see more about a tibble
object and what it is; in the meantime, you can just handle it exactly like you would a data.frame
.
Last time, you generated a boxplot for one gene. The ggplot2
command would have looked something like this:
ggplot(
data = df1,
aes(x = drug, y = G1, fill = drug)
) +
geom_boxplot()
One of the questions that we started this session with is: what if you wanted to present results for multiple genes on the same page? Well, there are two ways that you could go about this.
One option is to take advantage of what are called for
loops in R, which is one of the ‘control structures’ in R. Control structures allow you to control the execution of bits of code. Spoiler: ggplot2
has a better way to plot data for multiple genes in one plot, so a for
loop wouldn’t be my go-to to solve this particular problem (there also a few other things to learn before you could implement this yourselves). However, depending on what you need to do, a for
loop might be necessary. I have included a couple of extra challenges related to this at the end (Challenges 11-12).
The other option is to use the facet_*
commands in ggplot2
, which make repeated plots very easy. However, first, we need to do a bit of data reformatting.
A facetted plot allows you to repeat a plot for different values of a particular variable. In this case, we want to repeat the plot for different genes. At the moment the genes are represented as different columns, not different values of the same variable. Luckily, there is a library (reshape2
) that will do it for us easily.
The following code sets up a toy dataset using random data to use as an example.
d_example <- expand.grid(LETTERS[1:6], letters[10:12])
d_example$measurement1 <- rnorm(nrow(d_example))
d_example$measurement2 <- rnorm(nrow(d_example)) + 5
d_example$measurement3 <- rnorm(nrow(d_example)) + 10
head(d_example)
You will see that this looks somewhat similar to our gene data, where Var1
and Var2
are equivalent to patient and drug, and measurements 1-3 are equivalent to the different genes.
To reshape the data, we need to use the melt
function. This function requires you to identify which data.frame
you want to reformat (or melt) and which columns are to remain untouched in the reformatting. I’ll admit that this can be a bit of a mind melt (ha!) when you first come across it but have a look at the example below:
d_example_melt <- melt(d_example, id.vars = c("Var1", "Var2"))
head(d_example_melt)
unique(d_example_melt$variable)
## [1] measurement1 measurement2 measurement3
## Levels: measurement1 measurement2 measurement3
You will see that the Var1
and Var2
columns are still the same, but there are new variable
and value
columns. You will see (from using the unique
command) that measurement1
, measurement2
and measurement3
strings, having been columns are now different values in the one column, and the values from those measurement columns are now included in the one value
column.
So, for our gene expression data, we need to reformat the data in exactly the same way. Have a go at modifying the code below.
# CHALLENGE 2: reformat the gene expression data (df1) as required
# Your code goes here.
# SOLUTION 2:
# Convert the data from 'wide' to 'long' format
df1_melt <- melt(df1, id.vars = c("patient", "drug"))
# Print out the first few entries in the restructured data
head(df1_melt)
unique(df1_melt$variable)
## [1] G1 G2 G3 G4 G5 G6 G7 G8 G9 G10
## Levels: G1 G2 G3 G4 G5 G6 G7 G8 G9 G10
With the data in this format, we are ready to draw all our boxplots. While it might not feel like it to you (yet!) ggplot2
makes it very easy to generate these multi-panel (facetted) plots. All you need to do is add one additional command. So where previously your command was of the format:
ggplot(
data = some_dataframe, aes(x = x_axis_colname, y = y_axis_colname)
) +
geom_some_plottype()
To generate a facetted plot, your command will look like:
ggplot(
data = some_dataframe, aes(x = x_axis_colname, y = y_axis_colname)
) +
geom_some_plottype() **+**
**facet_wrap( ~ facet_variable )**
...
where facet_variable
is the name of the column that separates the data into the desired subplots. For example, look at the difference in the two commands below, which plot the data for our example toy dataset:
ggplot(
data = d_example_melt,
aes(x = Var2, y = value)
) +
geom_boxplot()
ggplot(
data = d_example_melt,
aes(x = Var2, y = value)
) +
geom_boxplot() +
facet_wrap(~ variable)
Don’t forget that it is a good idea to add the data points on top of the boxplots so that you can easily see how many points are giving rise to the boxplot (but use outlier.shape = NA
to turn off outliers on the boxplot so that outlier points aren’t repeated).
ggplot(
data = d_example_melt,
aes(x = Var2, y = value)
) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(width = 0.1)
ggplot(
data = d_example_melt,
aes(x = Var2, y = value)
) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(width = 0.1) +
facet_wrap(~ variable)
If you wanted to facet on a different variable (e.g., Var2
you just change which variable that you refer to in the facet_wrap
command:
ggplot(
data = d_example_melt,
aes(x = Var2, y = value)
) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(width = 0.1) + facet_wrap(~Var1)
I hope you are convinced that it is straightforward to generate these facetted plots, and feeling ready to generate some of your own!! Try to generate the same thing - a facetted plot - for the gene expression data. Don’t forget you need to use the melted format, so if you didn’t manage the last challenge, revisit it now or ask for some help.
# CHALLENGE 3: Use `ggplot`, `geom_boxplot' and `facet_wrap` to generate panel
# of boxplots for all genes in the dataset.
# Your code goes here.
# SOLUTION 3:
# We facet on the 'variable' column so that a different chart is made for each
# gene in the dataset
ggplot(
data = df1_melt, aes(x = drug, y = value)
) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(width = 0.1) +
facet_wrap(~ variable)
So this command splits data into different plots, ‘wrapping’ them onto a new line when necessary to make an approximately square shape. It is also possible to facet into a grid shape, using two variables at once using the facet_grid
command. This makes a matrix of plots for each pairing of the two variables.
Again, here is an example from the toy dataset:
ggplot(
data = d_example_melt,
aes(x = Var2, y = value)
) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(width = 0.1) +
facet_grid(variable ~ Var1)
In this case, we only have one value per Var2
/value
pair, but you get the idea. Can you do the same thing with our gene expression data?
# CHALLENGE 4: Use `geom_boxplot' and `facet_grid` to generate a matrix of
# boxplots showing gene expression for all the genes in the dataset across all
# patients.
#
# HINT: your resulting graph matrix should contain 10 columns and 5 rows.
# Your code goes here.
# SOLUTION 4:
# We facet with respect to 'patient' (each row corresponds to a different
# patient) and gene (ie, the 'variable' column; each column corresponds to a
# different gene)
ggplot(
data = df1_melt,
aes(x = drug, y = value)
) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(width = 0.1) +
facet_grid(patient ~ variable)
OK, so you have your graph. Now, how about some statistics. Most of you will be used to doing statistics in a point-and-click kind of way either using Excel or GraphPad or something similar.
Let’s say that we want to know whether there is a statistical difference in G5 expression pre- / post-treatment with either drug. Let’s plot it first.
ggplot(df1, aes(x = drug, y = G5)) +
geom_boxplot()
Looks promising, but what do the stats say?
First we need to extract the data that we need to perform the test. Let’s focus on D1
first. The subset
function allows you to identify rows and columns of data according to certain criteria. Extract the patient
column at this point too - you will need this later on. See ? subset
for more information.
# CHALLENGE 5: Extract the relevant data to compare G5 expression in CTRL and
# D1.
#
# Your code goes here.
# SOLUTION 5:
G5_d <- subset(df1[, c("drug", "patient", "G5")], drug != "D2")
# Have a look at the data
G5_d
Once you have this data to hand, the t.test
command will carry out the test for you. The sleep
data provided in base R is used as an example dataset in the t.test
help page:
ggplot(sleep, aes(x = group, y = extra)) +
geom_boxplot()
t_out <- t.test(extra ~ group, data = sleep)
str(t_out)
## List of 9
## $ statistic : Named num -1.86
## ..- attr(*, "names")= chr "t"
## $ parameter : Named num 17.8
## ..- attr(*, "names")= chr "df"
## $ p.value : num 0.0794
## $ conf.int : atomic [1:2] -3.365 0.205
## ..- attr(*, "conf.level")= num 0.95
## $ estimate : Named num [1:2] 0.75 2.33
## ..- attr(*, "names")= chr [1:2] "mean in group 1" "mean in group 2"
## $ null.value : Named num 0
## ..- attr(*, "names")= chr "difference in means"
## $ alternative: chr "two.sided"
## $ method : chr "Welch Two Sample t-test"
## $ data.name : chr "extra by group"
## - attr(*, "class")= chr "htest"
# Extract the p-value from the t-test results
cat(sprintf("The p-value is: %.2e\n", t_out$p.value))
## The p-value is: 7.94e-02
There are two variables for the t.test
command: the first is a ‘formula’ which defines which comparison you want to do and the second is the data.frame
that holds the data to be examined. The latter should be obvious (you just provide the data.frame
itself), but the formula is perhaps a bit more tricky. It is of the format A ~ B
, where A
is the dependent variable (the numbers) and B
is the independent variable (the groups). In the example above, the formula will generate a comparison of the numbers in the column extra
, between the groups that are defined by group
.
When you run the t-test, you will see that R gives you a summary of the result, with the statistic (t
), degrees of freedom (df
), 95% confidence intervals and p-value. You can capture this information for use later (e.g., in your plot) by storing the output of the test in a variable; above, we have saved it as t_out
. str(t_out)
shows exactly what is stored in the t_out
variable. To retrieve the specific information, you can use the $
symbol followed by the desired variable name; above I have printed the p-value.
# CHALLENGE 6: Working with the G5 data you extracted in the last challenge,
# try to write the command that compares G5 expression before and after
# treatment with D1.
#
# Your code goes here.
# SOLUTION 6:
# Run an unpaired t-test
t_unpaired <- t.test(G5 ~ drug, G5_d)
# Look at the information stored in the results of that t-test
str(t_unpaired)
## List of 9
## $ statistic : Named num -1.55
## ..- attr(*, "names")= chr "t"
## $ parameter : Named num 5.11
## ..- attr(*, "names")= chr "df"
## $ p.value : num 0.18
## $ conf.int : atomic [1:2] -0.4071 0.0994
## ..- attr(*, "conf.level")= num 0.95
## $ estimate : Named num [1:2] 1.01 1.17
## ..- attr(*, "names")= chr [1:2] "mean in group CTRL" "mean in group D1"
## $ null.value : Named num 0
## ..- attr(*, "names")= chr "difference in means"
## $ alternative: chr "two.sided"
## $ method : chr "Welch Two Sample t-test"
## $ data.name : chr "G5 by drug"
## - attr(*, "class")= chr "htest"
So, by my calculations, the p-value here is 0.1801. If you are using an alpha of 0.05, this is not significant. However, is there something wrong with this comparison?
You may recall that we have pre- /post-treatment data from the same 5 patients, so the appropriate test to do is a paired sample t-test. This comparison This will accomodate any consistent, patient-specific differences in looking at the differences between the two groups.
# CHALLENGE 7: analyse the same data using a paired sample t.test
#
# Your code goes here.
# SOLUTION 7:
# Run a paired t-test
t_paired <- t.test(G5 ~ drug, G5_d, paired = T)
# Look at the results of that t-test:
str(t_paired)
## List of 9
## $ statistic : Named num -2.35
## ..- attr(*, "names")= chr "t"
## $ parameter : Named num 4
## ..- attr(*, "names")= chr "df"
## $ p.value : num 0.0782
## $ conf.int : atomic [1:2] -0.3354 0.0276
## ..- attr(*, "conf.level")= num 0.95
## $ estimate : Named num -0.154
## ..- attr(*, "names")= chr "mean of the differences"
## $ null.value : Named num 0
## ..- attr(*, "names")= chr "difference in means"
## $ alternative: chr "two.sided"
## $ method : chr "Paired t-test"
## $ data.name : chr "G5 by drug"
## - attr(*, "class")= chr "htest"
So, let’s return to our toy dataset, and add this new information on as an annotation; the command we need to use here is annotate
. This command allows multiple kinds of annotations to be added to existing ggplot
s, including text, segments (lines), and rectangles. Here, we want to add text to show the p-value, and we need to give the annotate
function some information as to where we want it to go, by providing x
and y
coordinates. We want to put this annotation above the group 2 boxplot, so we set x
to 2 and y
to the maximum value in the dataset:
D1_annotation <- sprintf("p=%.2e", t_out$p.value)
D1_annotation_x <- 2
D1_annotation_y <- max(sleep$extra)
ggplot(sleep, aes(x = group, y = extra)) +
geom_boxplot() +
annotate(
"text", x = D1_annotation_x, y = D1_annotation_y, label = D1_annotation
)
You can set the value of D1_annotation
to whatever you like; for example, you could change this annotation to show the appropriate number of asterisks for varying thresholds of significance.
# CHALLENGE 8: annotate the boxplot for the G5 expression data to show p-value
# for a paired t-test.
#
# Your code goes here.
# SOLUTION 8:
# Construct the annotation text and position:
G5_annotation <- sprintf("p = %.2e", t_paired$p.value)
G5_annotation_x <- 2
G5_annotation_y <- max(df1$G5)
# Overlay the annotation text at the chosen position using `annotate` from
# `ggplot2`
ggplot(data = df1, aes(x = drug, y = G5)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(width = 0.1) +
annotate(
"text", x = G5_annotation_x, y = G5_annotation_y, label = G5_annotation
)
So, all good so far. But can it be better? :)
The problem with a boxplot is that it doesn’t communicate that there are paired samples. How could we do that?
One way would be to colour each point by patient. Give that a go below.
# CHALLENGE 9: draw the G5 boxplot with patients coloured differently.
#
# Your code goes here.
## SOLUTION 9:
# By providing col=patient as an aesthetic to geom_jitter, the points for a
# given patient are coloured consistently
ggplot(
data = df1,
aes(x = drug, y = G5)
) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(aes(col = patient), width = 0.1)
While that does help a bit, it would make it clearer to join all the points from each patient together with a line. The human eye is good at judging the gradient or slope of a line, so a consistent up- or down-regulation in response to the drug treatment should be clearly visible.
We could overlay these lines on the boxplot, so that you have both the raw data and the summary as a boxplot.
# CHALLENGE 10: try to draw a slope graph.
#
# HINT: you will need to use the `geom_line` command and the `group` variable
# (within aes) to achieve this. You might also want to replace `geom_jitter`
# with `geom_point` so that the points are joined by the lines.
#
# Your code goes here.
## SOLUTION 10:
# By providing the patient-indicator to the `group` aesthetic in `geom_line`,
# it knows how to link points from a given patient together
ggplot(data = df1, aes(x = drug, y = G5)) +
geom_point(aes(col = patient), size = 2) +
geom_boxplot(outlier.shape = NA, alpha = 0.25) +
geom_line(aes(group = patient, col = patient)) +
theme_bw() +
annotate(
"text", x = G5_annotation_x, y = G5_annotation_y, label = G5_annotation
)
Have you any suggestions about how this graph could be improved:
Which information is distracting from the story?
What important information is obscured or isn’t clear enough?
There has been a lot of new information in this session. You can now read in data from both flat files and Excel files. You can generate a panel of plots. You have seen how to calculate a t.test
, both paired and unpaired, and you can annotate a graph with the results. You have seen an alternative way of presenting data analysed by a paired t-test. Along the way you have learned how to reformat data (using melt
), print information to the screen (sprintf
), subset a data.frame
so as to extract specific rows according to some critera (subset
) and the command to find unique values (unique
, surprisingly enough).
In the next session, we will reinforce all of this, while learning to draw barplots with error bars, another common visualisation task in biology.
There is an additional section below for those of you who have got this far and want to push on.
Here is an example. Say you have a list of people attending a workshop (like this one). And you want to write an R script to write ‘Hello X!’ (where X is the attendees name) to welcome all the participants. You could write a separate command for each of the people on your list, as follows:
cat("Hello Gillian!\n")
## Hello Gillian!
cat("Hello Leena!\n")
## Hello Leena!
cat("Hello Pablo!\n")
## Hello Pablo!
cat("Hello Alison!\n")
## Hello Alison!
# etc etc
However, what if you want to change your script to say ‘Good morning’ instead? You would need to remember to change every one of those commands. What if you want to use the same script for a different batch of students? Again, you would manually have to update every command.
Notice that what you are doing is essentially repeating the same instruction, but just changing one thing - the name of the person. This is precisely the kind of thing that computers are good at, and ‘command structures’ are the tools you can use to do this. This [ link ] ( https://www.statmethods.net/management/controlstructures.html ) provides some details about the many different control structures that are available in R ; for the time being, let’s just concern ourselves with the for
loop.
attendees <- c(
"Gillian", "Leena", "Pablo", "Alison", "Natasha", "Matt", "Chinmay", "Amy",
"Lucie", "Michael", "Caroline", "Martha", "Laura"
)
for (name in attendees) {
cat(sprintf("Hello %s!\n", name))
}
## Hello Gillian!
## Hello Leena!
## Hello Pablo!
## Hello Alison!
## Hello Natasha!
## Hello Matt!
## Hello Chinmay!
## Hello Amy!
## Hello Lucie!
## Hello Michael!
## Hello Caroline!
## Hello Martha!
## Hello Laura!
A for
loop is a way to repeat the same instruction, varying the value of one thing at a time.
In the code above, you are telling R that for every object in the attendees (and we are calling this object name
), execute everything between the {}
brackets (or the ‘curly’ brackets). So, in this case, the contents of the {}
block will be repeated 13 times as there are 13 names in the vector of attendees. You can say that the code ‘loops through’ the attendees
vector.
Now if you want to change the greeting to ‘Good morning’, all you have to do is change the one instance of the cat
command inside the curly brackets. If someone joins the group at a later date, you can add another name to the end of the list - the for
loop doesn’t need to change as it is being told to just repeat the {}
block for as many names are in the attendees
vector.
# CHALLENGE 11: Modifying the code below, make the command print:
# "Good morning Gillian (Attendee 1 of 13)!"
# "Good morning Leena (Attendee 2 of 13)!"
# "Good morning Pablo (Attendee 3 of 13)!"
#
# HINT: you will have to introduce a new variable that keeps a tally of which
# attendee number you are currently at.
#
# Your code goes here.
# SOLUTION 11:
# The vector of attendees:
attendees <- c(
"Gillian", "Leena", "Pablo", "Alison", "Natasha",
"Matt", "Chinmay", "Amy", "Lucie", "Michael",
"Caroline", "Martha", "Laura"
)
# This is used to keep-track of the position of each person in the attendees
# vector
graph_count <- 0
for (name in attendees) {
# update the attendee number:
graph_count <- graph_count + 1
# print out the welcome message
cat(sprintf(
"Good morning %s (Attendee %d of %d)!\n",
name, graph_count, length(attendees)
))
}
## Good morning Gillian (Attendee 1 of 13)!
## Good morning Leena (Attendee 2 of 13)!
## Good morning Pablo (Attendee 3 of 13)!
## Good morning Alison (Attendee 4 of 13)!
## Good morning Natasha (Attendee 5 of 13)!
## Good morning Matt (Attendee 6 of 13)!
## Good morning Chinmay (Attendee 7 of 13)!
## Good morning Amy (Attendee 8 of 13)!
## Good morning Lucie (Attendee 9 of 13)!
## Good morning Michael (Attendee 10 of 13)!
## Good morning Caroline (Attendee 11 of 13)!
## Good morning Martha (Attendee 12 of 13)!
## Good morning Laura (Attendee 13 of 13)!
So, back to what we were trying to do originally: plot a graph for each gene. Just as you looped through the attendees list, you can loop through the list of genes and draw a boxplot for each, saving each one to a different file. To do this, the reference to G1
in your previous ggplot
command is going to have to change in your loop, to acquire the value of each gene in the dataset.
# CHALLENGE 12: Add the necessary code below to generate a different plot for
# each gene in the gene list.
# HINT: you will have to use the alternative `aes_string` command instead of
# the normal `aes` command.
# gene_list = colnames( df1 )[3:12]
# for (this_gene in gene_list) {
# Your code goes here.
# }
# SOLUTION 12:
# TODO: Fix this solution
gene_list <- colnames(df1)[3:12]
for (this_gene in gene_list) {
ggplot(
data = df1, aes_string(x = "drug", y = this_gene)
) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(aes(col = patient), width = 0.1)
}