Discuss:
What’s your current data analysis workflow?
Chat with your neighbors!


Before We Start

Lesson Can be found at http://tracykteal.github.io/R-genomics/00-before-we-start.html

This lesson was written by Kate Hertweck, Susan McClatchey, Tracy Teal, Ryan Williams and is maintained by Tracy Teal.

Learning Objectives

  • Introduce participants to the RStudio interface
  • Set up participants to have a working directory with a data/ folder inside
  • Introduce R syntax
  • Point to relevant information on how to get help, and understand how to ask well formulated questions

RStudio Interface

Go over the following:

  • Console
  • Script File
  • Environment, History
  • Files, Plots, Packages, Help

Code and workflow are more reproducible if we can document EVERYTHING that we do!

Our Goal

Other people can easily and exactly replicate our workflow and results

Getting Started with Our Working Directory

  1. Under File, click on New Project, choose New Directory then Empty project.
  2. Enter a name for this new folder called SWC_HumanGenetics (known as a working directory). This is where we will work for the rest of the workshop (~/SWC_HumanGenetics).
  3. Click on “Create Project”.
  4. Under the Files Tab on the right of the screen, click on New Folder and create a folder named data within SWC_HumanGenetics (i.e. ~/SWC_HumanGenetics/data).
  5. Create a New R Script called Ecoli_Metagenomes.R (File -> New File -> R Script) and save it within Your working directory.
  6. Does everyone’s working directory look like this?

Interacting with R

Two Main ways with Interacting with R:

  1. Using the Console - this is not recommended.
    • Bottom left panel
    • Will show commands run and results
    • Can type commands directly into the console. However, they will not be saved when you close the session!
  2. Using Script Files (Plain text files that contain code). This is recommended.
    • Type the code/commands into a script file, send it to the console to check it, and then save the script file.

">" means that R is ready to receive commands. You can send code from a script file to the console in 2 ways.

  1. Click the Run button in the top right of the script file.
  2. Highlight the desired code and click Ctrl-Enter.

A stop sign might show up for longer code. This shows that R is working to execute the code.

"+" means that R is waiting for you to complete the code. R has not yet run the code. Usually this is because you forgot to’close’ a parenthesis or quotation. Press Esc to get out of trouble.

Basics of R

R is a versatile, open source programming/scripting language that’s useful for data analysis/science. It is:

  • Based on the language S
  • Open Source Software under GPL (a specific type of license).
  • Superior to most alternatives
  • Has over 7,000 contributed packages
    • Widely used in academia and industry
  • Available on all platforms
  • Not just for statistics, but also for general purpose programming
  • Large and growing community of peers

Organizing Your Working Directory

You should always separate the original, raw data from intermediate datasets you create! For example your working directory might have folders such as:

  • raw_data
  • intermediate_data
  • figures
  • original_analyses
  • and so on…

Three ways to Seek Help

  1. You know what function you would like to use but need details about it. You can type a single question mark:
?mean  # Reminding myself of the detail and structure of calling the mean function. 
  1. You cannot remember the function that you need to use, but you remember that it has to deal with calculating means. To search all functions within your R, you can use a double question mark. This can also help you remind yourself of which package the function is in.
## First way 
??mean  # Searching for a function that has to do with calculting means of some sort. 
## Second way 
help.search("mean")

Another way to do this is to

  1. If you need to remind yourself of the names of the arguments:
args(hist) # "hist" is a function that makes a histogram.

Another way is to hit tab.

You Get an error message that you don’t understand.

Google it! Be sure to use key words like R and Error. You can copy and paste the error into it.

Asking For Help

  • Be sure to make it possible for someone to grasp your problem rapidly. Make it easy to pinpoint the problem.
  • Try to reproduce the problem with different very simplified data. You can create a simplified data set by using dput().
dput(head(iris)) #Iris is an example data.frame that comes with R.  Head only shows the first 6 rows.  
## structure(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6, 5, 5.4), 
##     Sepal.Width = c(3.5, 3, 3.2, 3.1, 3.6, 3.9), Petal.Length = c(1.4, 
##     1.4, 1.3, 1.5, 1.4, 1.7), Petal.Width = c(0.2, 0.2, 0.2, 
##     0.2, 0.2, 0.4), Species = structure(c(1L, 1L, 1L, 1L, 1L, 
##     1L), .Label = c("setosa", "versicolor", "virginica"), class = "factor")), .Names = c("Sepal.Length", 
## "Sepal.Width", "Petal.Length", "Petal.Width", "Species"), row.names = c(NA, 
## 6L), class = "data.frame")
  • Always include sessionInfo() so people can replicate your example.
sessionInfo()
## R version 3.2.2 (2015-08-14)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.9.5 (Mavericks)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] magrittr_1.5    formatR_1.2     tools_3.2.2     htmltools_0.2.6
##  [5] yaml_2.1.13     stringi_0.5-5   rmarkdown_0.8.1 knitr_1.11     
##  [9] stringr_1.0.0   digest_0.6.8    evaluate_0.7.2

Introduction to R

Lesson can be found at http://tracykteal.github.io/R-genomics/01-intro-to-R.html

This lesson was written by Kate Hertweck, Susan McClatchey, Tracy Teal, Ryan Williams and is maintained by Tracy Teal.

Learning Objectives

  • Be familiar with R syntax
  • Understand the concepts of objects and assignment
  • Understand the concepts of vector and data types
  • Get exposed to a few functions

R Syntax

Instructor will open up Final_Analysis.R. Let’s look at the following:

  • The assignment operator <-
  • The = for arguments within functions.
  • # for comments
    • This is where you can tell your reader important “behind the scenes” information.
    • IMPORTANT: You should ALWAYS heavily comment your code
  • The $ operator for working with columns inside of a data frame.
  • Indentation and spacing for easy reading - Code is made for humans.

A principle: “Write data for computers and code for humans.”

Creating objects

We can do simple math with R.

3+5
12/7

What is missing?? COMMENTS! ALWAYS comment code

# Adding 3 and 5.  R is fun!
3+5 

If we forget the #, what happens?

  • R is trying to run that sentence as a command
  • Now R has the +
    • How do we get out of this? Esc

To do more interesting things, we now need to learn about variables and assignment operators.

Objects and Assignment Operators

For instance instead of adding 3 + 5 we can create objects, which are called objects We assign an object with the <- operator, which assigns the value on the right to the object on the left The <- can be read as “goes into”

Tip: If you use Alt - (push Alt and then -), you can create the <- in a single keystroke.

# assign 3 to a
a <- 3
# assign 5 to b
b <- 5

# what now is a
a
# what now is b
b

#Add a and b 
a + b

#Add a and b and create c
c <- a + b

# what now is c
c

Tip: The name of your object is important! It should be meaningful and not too long. The name of an object can never start with a number. The following are unacceptable names for objects:

-if, else, for: These represent fundamental functions within R.
-c, mean, df, data: These are also functions
-You can use _ and - but not .
-Use nouns for objects and verbs for functions that you write.
-Keep a consistent style of your coding. Remember, code is for humans! (Hadley Wickham and Google each have published style guides.)

Functions and their arguments

Functions are “canned scripts” that automate a task. They:

  • Requires one or more arguments, which are the inputs to the function.
    • Arguments can be anything, numbers, columns of a data.frame, a whole data file. It differs by the function and can be looked up in the help page.
    • Many functions have arguments that have defaults. This is a standard value that the author of the function thought would be “good enough in standard cases”.
      • For example, what symbol to use in a plot.
      • You can always change the default to what you prefer.
  • Often (but not always) return a value
  • Executing a function is known as calling a function

For example, let’s take the function sqrt():

  • Argument/Input: The input, a number.
  • Output/Return value: The value input number squared.

Another example, let’s take the function round():

Exercise:
Round Pi: 3.14159
1. How many arguments does round have?
2. What is the default of round?
3. Change the default to a different value. What happens?
4. What happens when we round 2.5? 3.5? What does this tell you about the round function?

# Round pi
round(3.14159) 

# What are the arguments and default of round?
args(round)  

# Changing the default number of digits to 3
round(3.14159, digits = 3)  #Recommended!
round(3.14159, 3) ## NOT recommended for functions with more than one argument!!  

We get 3, that’s because the default is to round to the nearest whole number.

Exercise:
Create an object named ecoli_genome_length_mb with a value of 4.6.
How many picograms does the genome weigh? (978 Mbp = 1 pg)

Solution
ecoli_genome_length_mb <- 4.6 # This is for an e-coli genome
ecoli_genome_length_mb
ecoli_genome_weight_pg <- ecoli_genome_length_mb / 978.0
ecoli_genome_weight_pg # Ecoli genomes don't weigh very much. 
Human Genome
human_genome_length_mb <- 30000 # This is for an e-coli genome
human_genome_length_mb
human_genome_weight_pg <- human_genome_length_mb / 978.0 # To calculate the number of picograms that human genomes weight
human_genome_weight_pg # Human genomes don't weigh very much. 
Comparing things
  • == denotes equality. Reads as “is equal to”
  • != denotes inequality. Reads as “is not equal to”
  • < “less than”
  • <= “less than or equal to”
  • > “greater than”
  • >= “greater than or equal to”
1 ==1 # is 1 equal to 1?
1 != 1 # is 1 inequal to 1?   
1 < 2 # Is 1 less than 2?

Exercise:
Compare the lengths of ecoli_genome_length_mb and human_genome_length_mb.

human_genome_length_mb == ecoli_genome_length_mb # Are the genome lengths equal?
human_genome_length_mb != ecoli_genome_length_mb # Are the genome lengths equal?
human_genome_length_mb > ecoli_genome_length_mb # Are human genome lengths greater than ecoli genome lengths?

Vectors and Data Types

A vector is the most common and basic data structure in R. It’s R’s workhorse data type. A vector is a list of values, mainly numbers or characters. You can:

  • Do math with vectors
  • Can assign a list of values to a variable

Let’s create a vector of genome lengths:

glengths <- c(4.6, 3000, 50000) # Create a vector of 3 genome lengths
glengths
glengths[1] # Return the first value 
glengths[3] # Return the 3rd value 

A vector can also contain characters:

species <- c("ecoli", "human", "corn")
species
species[2] #What's the 2nd value in the species vector?

There are many functions to help us explore the vector:

length(glengths) # How many vlaues are in the genome lengths vector?
length(species) # how many values in the species vector?
length(glengths) == length(species) # Are they the same length? 

Some math with the vectors:

5 * glengths
new_lengths <- 2 * glengths # multiply lengths by 2
new_lengths

This makes it easy to combine or work with data!

Let’s take a look at our history in the top right of our console. To do so we can click on the history tab or type

history()  #Opens up the history pane in the top right of RStudio

What type of data is in our vector?

class(glengths) # What type of data is in glengths?
class(species)
str(glengths) # What's the structure of glengths?
str(species)

Tip: str() provides an overview of the object and the elements it contains. It is a really useful function when working with large and complex objects:

lengths <- c(glengths, 90) # adding at the end
lengths
lengths <- c(30, glengths) # adding at the beginning
lengths

We just saw 2 of the 6 data types that R uses: character and numeric. The other 4 are:

  • logical for TRUE and FALSE (the boolean data type)
  • integer for integer numbers (e.g., 2L, the L indicates to R that it’s an integer)
  • complex to represent complex numbers with real and imaginary parts (e.g., 1+4i) and that’s all we’re going to say about them.
  • “raw” that we won’t discuss further.

Vectors are one of the many data structures that R uses. Other important ones are lists (list), matrices (matrix), data frames (data.frame) and factors (factor).

Exercise:
1. What’s the difference between a data structure and data type?
2. Discuss the definitions of vector, object, function, arguments.


Starting with Data

Lesson can be found at http://tracykteal.github.io/R-genomics/02-starting-with-data.html

This lesson was written by Kate Hertweck, Susan McClatchey, Tracy Teal, Ryan Williams and is maintained by Tracy Teal.

Learning Objectives

  • Load external data (CSV files) in memory using the survey table (Ecoli_metadata.csv) as an example
  • Explore the structure and the content of the data in R
  • Understand what are factors and how to manipulate them

Looking at Metadata

We are studying a population of E. coli (designated Ara-3):

  • Propagated for more than 40,000 generations in a glucose-limited minimal medium.
  • This medium was supplemented with citrate which E. coli cannot metabolize in the aerobic conditions of the experiment.
  • Sequencing of the populations at regular time points reveals that spontaneous citrate-using mutants (Cit+) appeared at around 31,000 generations.
  • This metadata describes information on the Ara-3 clones and the columns represent:
Column Description
Sample Clone Name
generation generation when sample frozen
clade based on parsimony tree
strain ancestral strain
cit Citrate-using mutant status
run Sequence read archive sample ID
genome_size size in Mbp (Made up for this lesson)
  1. Make sure you are in the correct working directory by typing getwd().
  2. Create a new directory within this working directory called data.
  3. Move the downloaded file into this directory. Remember some of our unix commands:
cd  # Go to home directory 
cd Downloads/ # Go to downloads
ls *metadata.csv # list all files that end with "metadata.csv"
mv Ecoli_metadata.csv ../SWC_HumanGenetics/data/ # Move the file to our working directory!
  1. Check the status of your working directory. Is the Ecoli_metadata.csv file within the data folder in the SWC_HumanGenetics working directory.
  2. Time to load the data into R!
getwd() # R's pwd - where are we?!
## [1] "/Users/marschmi/SWC_Genomics"
# reads the Ecoli_metadata.csv file from the data sub-folder
metadata <- read.csv('data/Ecoli_metadata.csv') #provides no output 
metadata # This will provide too much output
head(metadata) # first 6 rows
tail(metadata) # last 6 rows 
View(metadata) # View it!

What is a data.frame?

data.frame is the de facto data structure for most tabular data and what we use for statistics and plotting. It is:

  • A collection of vectors of identical lengths.
  • Each vector represents a column, and each vector can be of a different data type (e.g., characters, integers, factors).
  • The str() function is useful to inspect the data types of the columns.
  • Spreadsheets can be imported with the read.csv() or the read.table functions.
  • By default, data.frame converts (= coerces) columns that contain characters (i.e., text) into the factor data type.
    • Depending on what you want to do, you might want to keep these columns as characters. If so, change read.csv('data/Ecoli_metadata.csv', stringsAsFactors = FALSE)

Inspecting data.frame objects

  • Size:
    • dim() - returns a vector with the number of rows in the first element, and the number of columns as the second element (the __dim__ensions of the object)
    • nrow() - returns the number of rows
    • ncol() - returns the number of columns
  • Content:
    • head() - shows the first 6 rows
    • tail() - shows the last 6 rows
    • View() - view the data within RStudio
  • Names:
    • names() or colnames() - returns the column names
    • rownames() - returns the row names
  • Summary:
    • str() - structure of the object and information about the class, length and content of each column
    • summary() - summary statistics for each column

Exercises
What is the class of the object metadata?
How many rows and how many columns are in this object?
How many citrate+ mutants have been recorded in this population?

Factors

Factors are used to represent categorical data.

  • Can be ordered or unordered.
  • Are an important class for statistical analysis and for plotting.
  • Are stored as integers, and have labels associated with these unique integers.
  • While factors look (and often behave) like character vectors, they are actually integers under the hood, and you need to be careful when treating them like strings.
  • Once created, factors can only contain a pre-defined set values, known as levels.
    • By default, R always sorts levels in alphabetical order

Let’s now check the __str__ucture of our data.frame in more details with the function str():

str(metadata)
## 'data.frame':    30 obs. of  7 variables:
##  $ sample     : Factor w/ 30 levels "CZB152","CZB154",..: 7 6 18 19 20 21 22 23 24 25 ...
##  $ generation : int  0 2000 5000 10000 15000 20000 20000 20000 25000 25000 ...
##  $ clade      : Factor w/ 7 levels "(C1,C2)","C1",..: NA 7 7 6 6 1 1 1 2 4 ...
##  $ strain     : Factor w/ 1 level "REL606": 1 1 1 1 1 1 1 1 1 1 ...
##  $ cit        : Factor w/ 3 levels "minus","plus",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ run        : Factor w/ 30 levels "","SRR097977",..: 1 5 22 23 24 25 26 27 28 29 ...
##  $ genome_size: num  4.62 4.63 4.6 4.59 4.66 4.63 4.62 4.61 4.65 4.59 ...
  • Some columns say Factor w/ 30 levels, which means that each row has a unique value.
  • cit is a Factor w/ 3 levels: minus, plus and unknown.

Exercise
By looking at the str(metadata) output:
1. What does the $ operator mean?
2. What do the integers mean at the end of the description for the columns that are factors?


Introducing data.frame - more on data frames

Lesson can be found at http://tracykteal.github.io/R-genomics/03-data-frames.html

This lesson was written by Kate Hertweck, Susan McClatchey, Tracy Teal, Ryan Williams and is maintained by Tracy Teal.

Learning Objectives

  • Understand the concept of a data.frame
  • Using sequences to index data
  • Know how to access any element of a data.frame

Indexing and sequences (within a vector)

If we want to extract one or several values from a vector, we must provide one or several indices in square brackets, just as we did with the vectors - this time with 2 numbers:

  • R indexes start at 1.
  • data.frame objects have 2 dimensions - rows and columns
    • Therefore, we need to specify the “coordinates” we want from it.
    • Row numbers come first, followed by column numbers (i.e. [row, column]).
metadata[1, 2]   # first element in the 2nd column of the data frame
metadata[1, 6]   # first element in the 6th column
metadata[1:3, 7] # first three elements in the 7th column
metadata[3, ]    # the 3rd element for all columns
metadata[, 7]    # the entire 7th column
head_meta <- metadata[1:6, ] # metadata[1:6, ] is equivalent to head(metadata)

Indexing and sequences (within a data.frame)

For larger datasets, it can be tricky to remember the column number that corresponds to a particular variable. (Are species names in column 5 or 7? oh, right… they are in column 6). In some cases, in which column the variable will be can change if the script you are using adds or removes columns.

It’s therefore often better to use column names to refer to a particular variable, and it makes your code easier to read and your intentions clearer.

You can do operations on a particular column, by selecting it using the $ sign.

  • In this case, the entire column is a vector.
  • You can use names(metadata) or colnames(metadata) to remind yourself of the column names.
    For instance, to extract all the strain information from our datasets:
metadata$strain # Return all of the strains in the data frame

In some cases, you may want to select more than one column. You can do this using the square brackets. Suppose we wanted strain and clade information:

metadata[, c("strain", "clade")] # Only select the strain and clade columns from metadata

You can even access columns by column name and select specific rows of interest. For example, if we wanted the strain and clade of just rows 4 through 7, we could do:

metadata[4:7, c("strain", "clade")]

Exercise
Create a new object with a new name without the strain and run columns.


Aggregating and analyzing data with dplyr

Lesson can be found at http://tracykteal.github.io/R-genomics/04-dplyr.html

This lesson was written by Kate Hertweck, Susan McClatchey, Tracy Teal, Ryan Williams and is maintained by Tracy Teal.

Data manipulation using dplyr

Bracket subsetting is handy, but it can be cumbersome and difficult to read, especially for complicated operations.

Enter dplyr.

dplyr is a package for making data manipulation easier.

Packages in R are basically sets of additional functions that let you do more stuff in R. The functions we’ve been using, like str(), come built into R; packages give you access to more functions. You need to install a package and then load it to be able to use it.

install.packages("dplyr") ## install dplyr

You might get asked to choose a CRAN mirror – this is basically asking you to choose a site to download the package from. The choice doesn’t matter too much; I’d recommend choosing the RStudio mirror.

library("dplyr")          ## load

You only need to install a package once per computer, but you need to load it every time you open a new R session and want to use that package.

What is dplyr?

dplyr is:

  • a fairly new (2014) package that tries to provide easy tools for the most common data manipulation tasks.
  • Built to work with data frames.
  • Inspired by the plyr package, which has been in use for some time but suffered from being slow in some cases.
    • dplyr addresses this by porting much of the computation to C++.
  • An additional feature is the ability to work with data stored directly in an external database.
  • The benefits of doing this are that the data can be managed natively in a relational database, queries can be conducted on that database, and only the results of the query returned.

This addresses a common problem with R in that all operations are conducted in memory and thus the amount of data you can work with is limited by available memory. The database connections essentially remove that limitation in that you can have a database of many 100s GB, conduct queries on it directly and pull back just what you need for analysis in R.

Selecting columns and filtering rows

We’re going to learn some of the most common dplyr functions:

  • select()
  • filter()
  • arrange()
  • mutate()
  • group_by()
  • summarize()

To select columns of a data frame, use select().

The first argument to this function is the data frame (metaadata), and the subsequent arguments are the columns to keep.

select(metadata, sample, clade, cit, genome_size) #select 4 columns

To choose rows, use filter():

filter(metadata, cit == "plus") # select only rows that are the citrate plus mutant

Challenge:

  1. Filter out samples that have a genome size that is equal or greater than 4.74.
  2. Filter out mutants that are “unknown” and are before generations 25,000.

To re-order or arrange the output, use arrange():

arrange(metadata, genome_size) # arrange by ascending genome_size
arrange(metadata, -genome_size) # arrange by -ascending genome_size
arrange(metadata, -generation) # arrange by descending generation

Challenge:

Create a new object named arranged_metadata where metadata is arranged first by descending generation and second by ascending clade

Summary so far:

  • select(): To subset certain columns out of the data.
  • filter(): To subset specific rows out of the data.
  • arrange(): To sort the data frame by columns.

Pipes

But what if you wanted to select and filter?

There are three ways to do this:

  1. Use intermediate steps: Create a temporary data frame and use that as input to the next function. This can clutter up your workspace with lots of objects.
  2. Nested functions: You can also nest functions by putting one function inside of another. This is handy, but can be difficult to read if too many functions are nested as the process from inside out.
  3. Pipes: A fairly recent addition to R. Pipes let you take the output of one function and send it directly to the next, which is useful when you need to do many things to the same data set.
    • Pipes in R look like %>% and are made available via the magrittr package installed as part of dplyr.

Let’s say we would like to do the following:

  • Pick only the citrate plus mutants AND pick the sample, generation, and clade columns. All in one set of function calls. Enter pipes.
metadata %>%
  filter(cit == "plus") %>%
  select(sample, generation, clade)

In the above we use the pipe to send the metadata data set first through filter, to keep rows where cit was equal to ‘plus’, and then through select to keep the sample and generation and clade columns.

When the data frame is being passed to the filter() and select() functions through a pipe, we don’t need to include it as an argument to these functions anymore.

If we wanted to create a new object with this smaller version of the data we could do so by assigning it a new name:

meta_citplus <- metadata %>%
  filter(cit == "plus") %>%
  select(sample, generation, clade)

meta_citplus

Challenge
Using pipes, subset the data to include rows where the clade is ‘unknown+’. Retain columns sample, cit, and genome_size. Finally arrange the output by descending genome_size.

Mutate

Frequently you’ll want to create new columns based on the values in existing columns, for example to do unit conversions or find the ratio of values in two columns. For this we’ll use mutate().

To create a new column of genome size in bp:

metadata %>%
    mutate(genome_bp = genome_size *1e6) # 1 Mbp = 10^6 bp

If this runs off your screen and you just want to see the first few rows, you can use a pipe to view the head() of the data (pipes work with non-dplyr functions too, as long as the dplyr or magrittr packages are loaded).

metadata %>%
  mutate(genome_bp = genome_size *1e6) %>%
  head

The row has a NA value for clade, so if we wanted to remove those we could insert a filter() in this chain:

We can do this by using the is.na() command.

is.na() is a function that determines whether something is or is not an NA. The ! symbol negates it, so we’re asking for everything that is not an NA.

Remember the != inequality operations, we can use this with is.na() by placing it before the command like this !is.na().

metadata %>%
  mutate(genome_bp = genome_size *1e6) %>%
  filter(!is.na(clade)) %>% # select all rows where clade is NOT an NA
  head

Split-apply-combine data analysis and the summarize() function

Many data analysis tasks can be approached using the “split-apply-combine” paradigm:

  1. Split the data into groups.
  2. Apply some analysis to each group, and
  3. Combine the results.

dplyr makes this very easy through the use of the group_by() function. group_by() splits the data into groups upon which some operations can be run.

For example, if we wanted to group by citrate-using mutant status and find the number of rows of data for each status, we would do:

metadata %>%
  group_by(cit) %>%  
  tally()  # tallies each group.

count() is the same as tally but alrea will also do the same thing:

metadata %>%
  count(cit)  # tallies each group but has the group.

There are several different summary statistics that can be generated from our data. The R base package provides many built-in functions such as:

  • mean()
  • median()
  • min() & max()
  • range()

By default, all R functions operating on vectors that contains missing data will return NA. It’s a way to make sure that users know they have missing data, and make a conscious decision on how to deal with it.

When dealing with simple statistics like the mean(), the easiest way to ignore NA (the missing data) is to use na.rm=TRUE (rm stands for remove).

group_by() is often used together with summarize() which collapses each group into a single-row summary of that group. summarize() can then be used to apply a function such as those that compute simple stats to give an overview for the group. So to view mean genome_size by mutant status:

metadata %>%
  group_by(cit) %>%
  summarize(mean_size = mean(genome_size, na.rm = TRUE))

You can group by multiple columns too:

metadata %>%
  group_by(cit, clade) %>%
  summarize(mean_size = mean(genome_size, na.rm = TRUE))

Looks like for one of these clones, the clade is missing. We could then discard those rows using filter():

metadata %>%
  group_by(cit, clade) %>%
  summarize(mean_size = mean(genome_size, na.rm = TRUE)) %>%
  filter(!is.na(clade))

Challenge:
Let’s calculate summary statistics! All in one go, let’s do the following:

  1. Group by cit
  2. Calculate the mean() generation for each mutant.
  3. Calculate the median() generation for each mutant.
  4. Calculate the min() generation for each mutant.
  5. calculate the max() generation for each mutant.

What do we notice about the output??

dplyr has changed our data.frame to a tbl_df. This is a data structure that’s very similar to a data frame; for our purposes the only difference is that it won’t automatically show tons of data going off the screen.

glimpse(metadata) # dplyr version, very convenient for the screen
str(metadta)

Link to the RStudio Data Wrangling Cheat Sheet


Data Visualization

Lesson can be found at http://tracykteal.github.io/R-genomics/05-data-visualization.html

This lesson was written by Kate Hertweck, Susan McClatchey, Tracy Teal, Ryan Williams and is maintained by Tracy Teal.

Learning Objectives

  • Basic plots
  • Advanced plots (introducing ggplot)
  • Writing images (and other things) to file

Basic Plots in R

The mathematician Richard Hamming once said, “The purpose of computing is insight, not numbers”, and the best way to develop insight is often to visualize data. Visualization deserves an entire lecture (or course) of its own, but we can explore a few features of R’s plotting packages.

When we are working with large sets of numbers it can be useful to display that information graphically. R has a number of built-in tools for basic graph types such as hisotgrams with hist(), scatter plots with plot(), bar charts with barplot(), boxplots with boxplot() and much,much more

We’ll test a few of these out here on the genome_size vector from our metadata.

genome_size <- metadata$genome_size

Scatterplot

Let’s start with a scatterplot. A scatter plot provides a graphical view of the relationship between two sets of numbers.

We don’t have a variable in our metadata that is a continous variable, so there is nothing to plot it against but we can plot the values against their index values just to demonstrate the function.

plot(genome_size) # Scatter plot

Each point represents a clone and the value on the x-axis is the clone index in the file, where the values on the y-axis correspond to the genome size for the clone. For any plot you can customize many features of your graphs (fonts, colors, axes, titles) through graphic options.

For example, we can change the shape and add a title of the data point using pch.

plot(genome_size, pch=8)  # Change the symbol
plot(genome_size, pch=8, main="Scatter plot of genome sizes") # Add a title 

Histogram

Another way to visualize the distribution of genome sizes is to use a histogram, we can do this buy using the hist() function:

#histograms
hist(genome_size)  # Create histogram of genome sizes
hist(genome_size, breaks = 10)

Boxplot

Using additional information from our metadata, we can use plots to compare values between the different citrate mutant status using a boxplot. A boxplot provides a graphical view of the median, quartiles, maximum, and minimum of a data set.

# Boxplot
boxplot(metadata$genome_size ~ metadata$cit) # Creates a box and whisker plot

To the left of the ~ is the response variable and to the right of the tilde is the explanatory variable.

boxplot(genome_size ~ cit, metadata,  col=c("pink","purple", "darkgrey"),
        main="Average expression differences between celltypes", ylab="Expression")

Advanced figures with ggplot2

More recently, R users have moved away from base graphic options and towards a plotting package called ggplot2 that adds a lot of functionality to the basic plots seen above.

An amazing resource for ggplot2 is the R Cookbook written by Winston Chang

The syntax takes some getting used to but it’s extremely powerful and flexible. We can start by re-creating some of the above plots but using ggplot2 functions to get a feel for the syntax.

ggplot() is best used on data in the data.frame form, so we will work with metadata for the following figures. Let’s start by loading the ggplot2 library.

install.packages("ggplot2")
library(ggplot2)

The ggplot() function is used to initialize the basic graph structure, then we add to it. The basic idea is that you specify different parts of the plot, and add them together using the + operator.

We will start with a blank plot and will find that you will get an error, because you need to add layers.

ggplot(metadata) # note the error 

Geometric objects are the actual marks we put on a plot. Examples include:

  • points (geom_point, for scatter plots, dot plots, etc)
  • lines (geom_line, for time series, trend lines, etc)
  • boxplot (geom_boxplot, for, well, boxplots!)

A plot must have at least one geom (short for geometric objects); there is no upper limit. You can add a geom to a plot using the + operator

ggplot(metadata) +
  geom_point() # note what happens here

Each type of geom usually has a required set of aesthetics to be set, and usually accepts only a subset of all aesthetics –refer to the geom help pages to see what mappings each geom accepts. Aesthetic mappings are set with the aes() function. Examples include:

  • position (i.e., on the x and y axes)
  • color (“outside” color)
  • fill (“inside” color) shape (of points)
  • linetype
  • size

ggplot Cheat Sheet

ggplot(metadata, aes(x = sample, y= genome_size)) +
  geom_point()
# above is the same as below
ggplot(metadata) +
+ geom_point(aes(x = sample, y= genome_size))

The labels on the x-axis are quite hard to read. To do this we need to add an additional theme layer. The ggplot2 theme system handles non-data plot elements such as:

  • Axis labels
  • Plot background
  • Facet label backround
  • Legend appearance

There are built-in themes we can use, or we can adjust specific elements.

For our figure we will change:

  • The x-axis labels to be plotted on a 45 degree angle with a small horizontal shift to avoid overlap.
  • Add some additional aesthetics by mapping them to other variables in our dataframe.
    • For example, the color of the points will reflect the number of generations and the shape will reflect citrate mutant status.
  • The size of the points can be adjusted within the geom_point but does not need to be included in aes() since the value is not mapping to a variable.
# First let's change the shape of the symbols 
ggplot(metadata, aes(x = sample, y= genome_size, shape = cit)) +
  geom_point(size = rel(3.0)) +
  theme(axis.text.x = element_text(angle=45, hjust=1)) # put x-axis on 45 degree angle

# Then add the change in color due to generations 
ggplot(metadata, aes(x = sample, y= genome_size, color = generation, shape = cit)) +
  geom_point(size = rel(3.0)) +
  theme(axis.text.x = element_text(angle=45, hjust=1))

# Why data type matters!  
metadata$genome_size_fact <- as.factor(metadata$genome_size)

ggplot(metadata, aes(x = sample, y= genome_size_fact, color = generation, shape = cit)) +
  geom_point(size = rel(3.0)) +
  theme(axis.text.x = element_text(angle=45, hjust=1))

##  Let's look at genome_size by generation
ggplot(metadata, aes(x = generation, y= genome_size, color = generation, shape = cit)) +
  geom_point(size = rel(3.0)) +
  theme(axis.text.x = element_text(angle=45, hjust=1))

# Zoom in on a certain part of the plot with setting the xlim
summary(metadata$generation) # let's plot the mean to max 
ggplot(metadata, aes(x = generation, y= genome_size, color = generation, shape = cit)) +
  geom_point(size = rel(3.0)) +
  theme(axis.text.x = element_text(angle=45, hjust=1)) + 
  xlim(mean(metadata$generation), max(metadata$generation))

##  Flip the plot with coord_flip!
ggplot(metadata, aes(x = generation, y= genome_size, color = generation, shape = cit)) +
  geom_point(size = rel(3.0)) + 
  theme(axis.text.x = element_text(angle=45, hjust=1)) + 
  coord_flip()

Writing figures to file

There are two ways in which figures and plots can be output to a file (rather than simply displaying on screen).

  1. Easier way: Export directly from the RStudio ‘Plots’ panel, by clicking on Export when the image is plotted. This will give you the option of png or pdf and selecting the directory to which you wish to save it to.
  2. More reproducible way: Use R functions that allow you the flexibility to specify parameters to dictate the size and resolution of the output image. Some of the more popular formats include pdf(), png(), jpeg(), tiff()etc.

Initialize a plot that will be written directly to a file using pdf(), png(), jpeg(), tiff() etc.

Within the function you will need to:

  1. Specify a name for your image.
  2. Provide the width and height (optional).
  3. Create a plot using the usual functions in R.
  4. Close the file using the dev.off() function.

Make a new folder called figures to store plots in.

pdf("figures/genome_size_vs_generation.pdf", width = 7, height = 5)

ggplot(metadata, aes(x = generation, y= genome_size, color = generation, shape = cit)) +
  geom_point(size = rel(3.0)) +
  theme(axis.text.x = element_text(angle=45, hjust=1))

dev.off()

Histogram

To plot a histogram we require another geometric object geom_bar(), which requires a statistical transformation. Some plot types (such as scatterplots) do not require transformations, each point is plotted at x and y coordinates equal to the original value. Other plots, such as boxplots, histograms, prediction lines etc. need to be transformed, and usually has a default statistic that can be changed via the stat_bin argument.

ggplot(metadata, aes(x = genome_size)) +
  geom_bar() + xlab("Genome Size") + ylab("Count")

Try plotting with the default value and compare it to the plot using the binwidth values. How do they differ?

ggplot(metadata, aes(x = genome_size)) +
  geom_bar(stat = "bin", binwidth=0.01)

# Add black and white background
ggplot(metadata, aes(x = genome_size)) +
  geom_bar(stat = "bin", binwidth=0.01) + theme_bw() 

# make a pretty color
ggplot(metadata, aes(x = genome_size)) + 
  geom_bar(stat = "bin", fill = "cornflowerblue", color = "black", binwidth=0.01) + theme_bw()

# color by citrate mutant
ggplot(metadata, aes(x = genome_size, fill = cit)) + 
  geom_bar(stat = "bin", color = "black", binwidth=0.01) + 
  theme_bw()

# Move the legend within
ggplot(metadata, aes(x = genome_size, fill = cit)) + 
  geom_bar(stat = "bin", color = "black", binwidth=0.01) + 
  theme_bw() + 
  theme(legend.position = c(0.8, 0.8)) # (x,y) Think of each side of the plot to be 1 

Challenge:

Re-create the histogram plots from above. This time, save each of the plots that you make in the figures folder.

Boxplot

Let’s try plotting a boxplot similar to what we had done using the base plot functions at the start of this lesson. We can add some additional layers to include a plot title and change the axis labels. Explore the code below and all the different layers that we have added to understand what each layer contributes to the final graphic.

ggplot(metadata, aes(x = cit, y = genome_size, fill = cit)) +
  geom_boxplot() +
  ggtitle("Boxplot of genome size by citrate mutant type") +
  xlab("citrate mutant") +
  ylab("genome size") + 
  theme(panel.grid.major = element_line(size = .5, color = "grey"),
          axis.text.x = element_text(angle=45, hjust=1),
          axis.title = element_text(size = rel(1.5)),
          axis.text = element_text(size = rel(1.25)))

Error bars

Box and whisker plots are nice, but maybe we would prefer to calculate the mean and standard deviation of the genome_size within each of the citrate mutants with error bars.

To plot scatter plots with error bars, we use geom_errobar().

  • geom_errorbar(width = 0.1, aes(ymin = mean - sd, ymax = mean + sd))
    • The width argument will make wider error bars.

This means that we will have to calculate both the mean and the standard deviaition.

Exercise:
Work with your partner to do the following:

  1. Calculcate the mean() of the genome_length for each of the citrate mutants.
  2. Calculate the standard deviation with sd() of the genome_length for each of the citrate mutants.
  3. Plot cit (x-axis) and the mean (y-axis) with errorbars.

Faceting plots

One of the best features of ggplot is to be able to facet plots. When you facet plots you split up your data by one or more variables and plot the subsets of data together.

Two ways to facet:

  1. facet_grid: Input values for vertical ~ horizontal.
  2. facet_wrap: Instead of faceting with a variable in the horizontal or vertical direction, facets can be placed next to each other, wrapping with a certain number of columns or rows. The label for each plot will be at the top of the plot.

Tip: Facets are a great way to explore your data!

# Break up 3 groups 
ggplot(metadata, aes(x = sample, y= genome_size, color = generation, shape = cit)) +
  geom_point(size = rel(3.0)) +
  theme(axis.text.x = element_text(angle=45, hjust=1)) + 
  facet_grid(cit ~ .) # To make 3 plots in one! Remember, vertical ~ horizontal. 

# This plot would be more useful.
# Break up 3 groups on the x-axis by citrate mutant status along the x-axis.
ggplot(metadata, aes(x = sample, y= genome_size, color = generation, shape = cit)) +
  geom_point(size = rel(3.0)) +
  theme(axis.text.x = element_text(angle=45, hjust=1)) + 
  facet_grid(. ~ cit) # To make 3 plots in one!  Remember, vertical ~ horizontal. 


# Let's make the plot a little prettier
ggplot(metadata, aes(x = sample, y= genome_size, color = generation, shape = cit)) +
  geom_point(size = rel(3.0)) +
  theme(axis.text.x = element_text(angle=45, hjust=1)) + 
  facet_grid(. ~ cit, scales = "free_x") # Delete all the extra samples 

Challenge:

  1. Create a faceted plot of sample (x-axis) and genome size (y-axis) broken up by clade. Make the colors of the plot to represent each clade.
  2. Create another faceted plot with the clade (x-axis) and mean genome size on the (y- axis).
  3. Create another facetted plot with sample (x-axis) and genome size (y-axis) broke up by generation. Color the points by the citrate mutant. What does this tell us about what “unknown” means?