Code-club session 1.0

To work through this code-club, you’ll need the packages ggplot2 and readr installed. If they aren’t installed, try installing them using Tools -> Install Packages in Rstudio.

# Load the required packages
# - if working interactively, either copy these lines into the R console or,
#   if in Rstudio, press the 'Run Current Chunk' button to the right
library(ggplot2)
library(readr)

You should fill in code wherever it says ## Your code goes here and whenever you do, compile this document using the ‘Knit’ command in Rstudio (See File -> Knit Document; or the Knit button directly above the soruce-code panel; or press Ctrl-Shift-K / Cmd-Shift-K). This should compile the document to an .html file that you can view.

There’s two main parts to this code-club - you’re going to import data from a flat file, and then make some box-plots (box-and-whisker plots) using base-R and ggplot2.

Importing data from comma-separated and tab-separated files

There’s loads of ways to get data into R. One of the simplest is to read it in from ‘flat files’, such as comma-separated or tab-separated files. Even for this simple case, there’s a couple of ways to read in the same data.

You should have been provided with a file called boxplot_genes.tsv. This is a tab-separated file containing 15 rows of synthetic data and 12 columns.

Please ensure

that boxplot_genes.tsv is in the same directory as the current code-club script (this ensures R can easily find the file when you knit this document); and
that the directory containing this code-club script is your current ‘working directory’ (so that R can find your files during ‘interactive’ work).
- To do this, type getwd() into the R console to get your w orking d irectory’, or type dir() to print the contents of the current working dir ectory. If there is any disagreement, set your working directory using Session->Set_Working_Directory in Rstudio (or find how to use setwd() in the R console).

We have to import the dataset and store the imported data into an object so that it can subsequently be used in R. In previous sessions you will have used code like the following to store the result of a function (here the data.frame function) in an R object (here the variable abc).

abc <- data.frame(a = 1:3, b = c(FALSE, TRUE, NA))

So, to import the data, and store it for subsequent use, we need to use code that looks like this:

object_to_store_the_data <- function_to_import_the_data(function, arguments)

To import boxplot_genes.tsv into a data-frame, you can use the base functions read.table, or read.csv, or you could use the function read_tsv from the package readr (make sure your spelling is correct; check the help pages for these functions by typing ? <function_name> into the R console: eg, ? read.table; or by typing the function name into the search bar in the Help panel).

A brief tutorial on read.csv / read.table is here and on readr::read_tsv is here.

Have a look at boxplot_genes.tsv, in a spreadsheet app or a text-editor.

The file has 12 columns. The first row contains a ‘header’, with each entry giving an appropriate title for the data in the column below it.

Your first challenge is to read in the dataset.

If you want to use read.table or read.csv you have to specify:

the filename (modify the ‘file’ argument)
that the file has a header (modify the ‘header’ argument)
the character that separates one column from another. To represent the ‘tab’ character, we use the string “”. To do this, in the arguments to read.[table|csv] you should modify the ‘sep’ argument to look like read.table(..., sep = "\t")

Whereas, if you want to use read_tsv from the readr package, you only need to specify the filename by modifying the file argument - it will determine that a header is present, and it will work out what type of data is stored in each column.

Do this now, and store the results of importing the data in a variable called df1

## Challenge 1

# Read in the file 'boxplot_genes.tsv' using read.table, read.csv or
# readr::read_tsv
# Make sure you specify an appropriate 'file' argument
# Make sure you specify appropriate 'header' and 'sep' arguments if you are
# using read.csv or read.table
#

## Your code goes here:

# df1 <- FILL_IN_THE_DETAILS, and remove the comment character at the start of the line

# Challenge 1 Solution:
#

df1 <- read_tsv("boxplot_genes.tsv")

## Parsed with column specification:
## cols(
##   patient = col_character(),
##   drug = col_character(),
##   G1 = col_double(),
##   G2 = col_double(),
##   G3 = col_double(),
##   G4 = col_double(),
##   G5 = col_double(),
##   G6 = col_double(),
##   G7 = col_double(),
##   G8 = col_double(),
##   G9 = col_double(),
##   G10 = col_double()
## )

# Solution alternatives:
# df1 <- read.csv("boxplot_genes.tsv", sep = "\t", header = TRUE)
# df1 <- read.table

Now that you’ve read it in, check that your imported data (‘df1’) looks the same as when you looked at it outwith R.

df1

Next time we’ll show you how to read data in from excel spreadsheets using the package readxl.

Box plots

The boxplot (little b) was developed by Tukey. For a sample of numerical data it indicates the lower-quartile, the median and the upper-quartile as horizontal lines within a box. Any points that lie beyond the quartiles are summarised: they’re either covered by whiskers or plotted individually.

This Points of Significance and this Points of View in the same Nature Methods gives a load of detail about the boxplot, a comparison to the bar plot and some variations on the theme.

In base R, you can use the function boxplot to make boxplots and you can find how to plot one using ggplot2 (have a look at the available geoms on the website).

Let’s take a step back first. The main features on a boxplot are

1. the box;
2. the lines that cross that box; and
3. the whiskers.

To make a boxplot for a single dataset by hand you’d need the dataset values and a few summary statistics:

1. The lower-quartile (Q1);
2. The median (Q2);
3. The upper-quartile (Q3);
4. The inter-quartile range (IQR = Q3 - Q1).

You would also need to decide how long the ’whisker’s should be (we’ll come back to that, but that’s why you need the IQR).

You draw a horizontal line at the median, the lower- and the upper-quartiles and encase these lines in a box. Then you draw the whiskers. Then you put a point for any point that lies beyond the whiskers.

In base R, the length of the whiskers is determined by a parameter called range that you pass to the boxplot function. The default value for range is 1.5, and the whiskers extend to the data-point that is most distant from the box but lies within a distance of 1.5 * (the interquartile range) from the box.

Typically, we also plot the position of any datapoints that are not covered by the whiskers (these may be outliers or may indicate that your dataset is skewed).

# Some example data:
# 100 even separated values and a couple of outliers as well:
# The values are 1, 2, 3, ...., 99, 100, 150, 275, 300
d <- c(1:100, 150, 275, 300)

You can see that this dataset is skewed by plotting a histogram of d using the base-R function hist:

hist(d, col = "wheat", main = "Check the skew on that!")

We can use the summary function in R to find the quartiles of the dataset d.

# Revision 1:

# Use R's `summary` function to find the median and lower/upper quartile of the
#   dataset `d`:

## Your code goes here:

# Revision Exercise:
summary(d)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   26.50   52.00   56.07   77.50  300.00

An alternative way to get the quartiles of the dataset is as follows:

# Alternatively, the quartiles can be found using
quantile(d, probs = c(0.25, 0.5, 0.75))

##  25%  50%  75% 
## 26.5 52.0 77.5

So the median is 52, and the upper- and lower-quartiles of d are 77.5 and 26.5, respectively and hence the inter-quartile range is 77.5 - 26.5 = 51. The interquartile range can also be computed in R using the function IQR:

# Try it yourself:
# - call IQR on the dataset `d`
IQR(d)

## [1] 51

To make a boxplot of the data in d that matches one generated by R, you’d draw the following:

1. A horizontal line segment at 52 (the median)
2. Horizontal line segments at 77.5 and 26.5 (the upper- and
lower-quanrtiles)
3. Two vertical lines joining the lines drawn in 2
4. A whisker extending from 77.5 to the largest value between 77.5 and
77.5 + 1.5 * 51 = 154 (that value is 150)
5. A whisker extending from 26.5 to the smallest value between 26.5 and
26.5 - 1.5 * 51 = -50 (which is 1 here).
6. Any points that lie beyond the range [-50, 154] (ie, the points at 275
and 300)

It’s a bit easier to get the computer to plot that out though using the base R function boxplot. Again, look at the help page for boxplot if you want some further info.

# Challenge 2:

# Make a boxplot of `d` using the `boxplot` function

## Your code goes here


# Play about a bit:
#
# .. See what happens if you change the 'range' argument - what happens with
#   range=0 and why (check the `Arguments` section of the help page for
#   `boxplot`)?
#
# .. Choose your own colours (col = ...)
#
# .. log-transform the presentation of the data (log = "y" in the args to boxplot)
#    or log transform the input data (`boxplot(log(d))`)
#      - is there still evidence of right-skew?
#      - how do the y-axis-labellings behave for the two ways of transforming?

# Challenge 2 Solution:
#
boxplot(d, col = "red")

`ggplot2` and `geom_boxplot`

Next we’ll use ggplot2 to make boxplots of the dataset that we imported from the tab-separated file. In previous workshops, we used ggplot2 to plot out data that was stored in data-frame structures. The dataset that we’ve imported can be viewed as the expression levels for 10 different genes (columns 3 to 12) in 15 different samples (5 each for 3 different treatments).

The value of the boxplot is revealed when comparing more than one group.

The syntax for plotting within ggplot2 looks something like:

ggplot(
  data = some_dataframe, aes(x = x_axis_colname, y = y_axis_colname)
) +
  geom_some_plottype()

By default we’d draw boxplots vertically, so that the different groups are indicated on the x-axis, and the numerical values are indicated on the y-axis.

Try to generate a boxplot of gene expression for the column G1 against drug treatment (‘drug’ column) using ggplot2 syntax for the dataframe df1. The ggplot2 geom_* function for boxplots is geom_boxplot() and you can find a few examples of its use here and here

# Challenge 3

# Use `ggplot` to plot a boxplot of G1 expression against drug-treatment
# Your code goes here

# Challenge 3 Solution:
#
ggplot(data = df1) +
  geom_boxplot(aes(x = drug, y = G1)) +
  ylab("Expression") +
  xlab("Drug Treatment")

# Alternative solution:
# ggplot(data = df1, aes(x = drug, y = G1)) +
#   geom_boxplot() +
#   labs(x = "Drug Treatment", y = "Expression")

You can do the same thing using the base-R boxplot function, but to indicate which columns you want to use, you have to write a formula of the form y ~ x. So, to generate an equivalent boxplot to the latter, your code would look like

boxplot(y_column ~ x_column, data = my_data_frame)

# Challenge 4

# Use boxplot() to plot G1-gene expression as a function of drug-treatment

# Your code goes here

# Challenge 4 Solution:
#
boxplot(G1 ~ drug, data = df1)

Overlaying plot types

Boxplots don’t really indicate how many data points were used in their creation. As such, they can give a misleading impression of whether there is a numerical / significant difference between two groups. Also, because a lot of information within a dataset is thrown away in the construction of a boxplot, the user can miss interesting phenomena (multimodality, many identical values etc) that may be present within a dataset if they only look at a boxplot.

There are a couple of remedies for this - the simplest of which is to overlay the dataset points on top of a boxplot. See here for some other approaches.

We’ll look at the G1 gene-expression data in df1 again. Take your ggplot code for making a boxplot, and replace the call to geom_boxplot() with a call to geom_point().

# Challenge 5:

# Make a scatter plot of G1 against drug treatment using geom_point

# Your code goes here

# Challenge 5 Solution:
#
ggplot(data = df1, aes(x = drug, y = G1)) +
  geom_point()

ggplot2 has treated the drug column as if it’s a numeric x-value and made a scatter plot. As a final challenge, try and overlay a scatter plot of these points on top of a boxplot

# Challenge 6:

# Make a boxplot of G1 against drug treatment using geom_boxplot
# and overlay a scatterplot of G1 against drug treatment

# Your code goes here

# Challenge 6 Solution:
# 
ggplot(data = df1, aes(x = drug, y = G1)) +
  geom_boxplot(outlier.shape = NA) +
  geom_point(aes(col = patient))

# Note that the colouring isn't essential
# 
# As an alternative, try jittering the points:
# 
# ggplot(data = df1, aes(x = drug, y = G1)) +
#   geom_boxplot(outlier.shape = NA) +
#   geom_jitter(aes(col = patient), position = position_jitter(width = 0.1))

Hopefully that worked and you can see the individual points as well as the boxplot. There are a few ways you can modify the latter plot. Using geom_jitter instead of geom_point would add a bit of noise the the points (this is useful if you’ve got lots of points at close-quarters for example).

Summary

You’ve seen a few different ways to make boxplots in R. You’ve read in data from a tab-separated file. You’ve overlayed scatterplots and boxplots.

But,

What if you wanted to present results for multiple genes on the same page?
What if you wanted to test for significant differences between the different groups?
What if your data is stored in excel files?
Are boxplots appropriate when there is matching among the samples?

Notes

# This prints out what versions of each package you used while you were running
# this code-club script.
sessionInfo()

## R version 3.4.1 (2017-06-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.4 LTS
## 
## Matrix products: default
## BLAS: /root/tools/miniconda3/envs/pog_cc/lib/R/lib/libRblas.so
## LAPACK: /root/tools/miniconda3/envs/pog_cc/lib/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
##  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
##  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] readr_1.1.1   ggplot2_2.2.1
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.15     knitr_1.20       magrittr_1.5     hms_0.3         
##  [5] munsell_0.4.2    colorspace_1.2-4 R6_2.0.1         rlang_0.2.0     
##  [9] stringr_1.3.0    plyr_1.8.4       tools_3.4.1      grid_3.4.1      
## [13] gtable_0.1.2     htmltools_0.3.6  yaml_2.1.13      lazyeval_0.2.1  
## [17] rprojroot_1.3-2  digest_0.6.8     tibble_1.4.2     evaluate_0.10.1 
## [21] rmarkdown_1.6    labeling_0.3     stringi_1.1.7    compiler_3.4.1  
## [25] pillar_1.2.2     scales_0.5.0     backports_1.1.2  jsonlite_1.5

Code-Club: [01] - Boxplots and data I/O

Russell Hyde

18 June 2018

Code-club session 1.0

Importing data from comma-separated and tab-separated files

Box plots

`ggplot2` and `geom_boxplot`

Overlaying plot types

Summary

Notes

Code-Club: [01] - Boxplots and data I/O

Russell Hyde

18 June 2018

Code-club session 1.0

Importing data from comma-separated and tab-separated files

Box plots

ggplot2 and geom_boxplot

Overlaying plot types

Summary

Notes

`ggplot2` and `geom_boxplot`