Discuss:
What’s your current data analysis workflow?
Chat with your neighbors!
Lesson Can be found at http://tracykteal.github.io/R-genomics/00-before-we-start.html
This lesson was written by Kate Hertweck, Susan McClatchey, Tracy Teal, Ryan Williams and is maintained by Tracy Teal.
Go over the following:
Code and workflow are more reproducible if we can document EVERYTHING that we do!
Other people can easily and exactly replicate our workflow and results
File, click on New Project, choose New Directory then Empty project.SWC_HumanGenetics (known as a working directory). This is where we will work for the rest of the workshop (~/SWC_HumanGenetics).Files Tab on the right of the screen, click on New Folder and create a folder named data within SWC_HumanGenetics (i.e. ~/SWC_HumanGenetics/data).Ecoli_Metagenomes.R (File -> New File -> R Script) and save it within Your working directory.Two Main ways with Interacting with R:
">" means that R is ready to receive commands. You can send code from a script file to the console in 2 ways.
Run button in the top right of the script file.Ctrl-Enter.A stop sign might show up for longer code. This shows that R is working to execute the code.
"+" means that R is waiting for you to complete the code. R has not yet run the code. Usually this is because you forgot to’close’ a parenthesis or quotation. Press Esc to get out of trouble.
R is a versatile, open source programming/scripting language that’s useful for data analysis/science. It is:
You should always separate the original, raw data from intermediate datasets you create! For example your working directory might have folders such as:
raw_dataintermediate_datafiguresoriginal_analyses?mean # Reminding myself of the detail and structure of calling the mean function.
## First way
??mean # Searching for a function that has to do with calculting means of some sort.
## Second way
help.search("mean")
Another way to do this is to
args(hist) # "hist" is a function that makes a histogram.
Another way is to hit tab.
Google it! Be sure to use key words like R and Error. You can copy and paste the error into it.
dput().dput(head(iris)) #Iris is an example data.frame that comes with R. Head only shows the first 6 rows.
## structure(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6, 5, 5.4),
## Sepal.Width = c(3.5, 3, 3.2, 3.1, 3.6, 3.9), Petal.Length = c(1.4,
## 1.4, 1.3, 1.5, 1.4, 1.7), Petal.Width = c(0.2, 0.2, 0.2,
## 0.2, 0.2, 0.4), Species = structure(c(1L, 1L, 1L, 1L, 1L,
## 1L), .Label = c("setosa", "versicolor", "virginica"), class = "factor")), .Names = c("Sepal.Length",
## "Sepal.Width", "Petal.Length", "Petal.Width", "Species"), row.names = c(NA,
## 6L), class = "data.frame")
sessionInfo() so people can replicate your example.sessionInfo()
## R version 3.2.2 (2015-08-14)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.9.5 (Mavericks)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] magrittr_1.5 formatR_1.2 tools_3.2.2 htmltools_0.2.6
## [5] yaml_2.1.13 stringi_0.5-5 rmarkdown_0.8.1 knitr_1.11
## [9] stringr_1.0.0 digest_0.6.8 evaluate_0.7.2
Lesson can be found at http://tracykteal.github.io/R-genomics/01-intro-to-R.html
This lesson was written by Kate Hertweck, Susan McClatchey, Tracy Teal, Ryan Williams and is maintained by Tracy Teal.
Instructor will open up Final_Analysis.R. Let’s look at the following:
<-= for arguments within functions.# for comments
$ operator for working with columns inside of a data frame.We can do simple math with R.
3+5
12/7
What is missing?? COMMENTS! ALWAYS comment code
# Adding 3 and 5. R is fun!
3+5
If we forget the #, what happens?
+
EscTo do more interesting things, we now need to learn about variables and assignment operators.
For instance instead of adding 3 + 5 we can create objects, which are called objects We assign an object with the <- operator, which assigns the value on the right to the object on the left The <- can be read as “goes into”
Tip: If you use
Alt -(pushAltand then-), you can create the<-in a single keystroke.
# assign 3 to a
a <- 3
# assign 5 to b
b <- 5
# what now is a
a
# what now is b
b
#Add a and b
a + b
#Add a and b and create c
c <- a + b
# what now is c
c
Tip: The name of your object is important! It should be meaningful and not too long. The name of an object can never start with a number. The following are unacceptable names for objects:
-
if,else,for: These represent fundamental functions within R.
-c,mean,df,data: These are also functions
-You can use_and-but not.
-Use nouns for objects and verbs for functions that you write.
-Keep a consistent style of your coding. Remember, code is for humans! (Hadley Wickham and Google each have published style guides.)
Functions are “canned scripts” that automate a task. They:
For example, let’s take the function sqrt():
Another example, let’s take the function round():
Exercise:
Round Pi: 3.14159
1. How many arguments does round have?
2. What is the default of round?
3. Change the default to a different value. What happens?
4. What happens when we round 2.5? 3.5? What does this tell you about the round function?
# Round pi
round(3.14159)
# What are the arguments and default of round?
args(round)
# Changing the default number of digits to 3
round(3.14159, digits = 3) #Recommended!
round(3.14159, 3) ## NOT recommended for functions with more than one argument!!
We get 3, that’s because the default is to round to the nearest whole number.
Exercise:
Create an object namedecoli_genome_length_mbwith a value of 4.6.
How many picograms does the genome weigh? (978 Mbp = 1 pg)
ecoli_genome_length_mb <- 4.6 # This is for an e-coli genome
ecoli_genome_length_mb
ecoli_genome_weight_pg <- ecoli_genome_length_mb / 978.0
ecoli_genome_weight_pg # Ecoli genomes don't weigh very much.
human_genome_length_mb <- 30000 # This is for an e-coli genome
human_genome_length_mb
human_genome_weight_pg <- human_genome_length_mb / 978.0 # To calculate the number of picograms that human genomes weight
human_genome_weight_pg # Human genomes don't weigh very much.
== denotes equality. Reads as “is equal to”!= denotes inequality. Reads as “is not equal to”< “less than”<= “less than or equal to”> “greater than”>= “greater than or equal to”1 ==1 # is 1 equal to 1?
1 != 1 # is 1 inequal to 1?
1 < 2 # Is 1 less than 2?
Exercise:
Compare the lengths ofecoli_genome_length_mbandhuman_genome_length_mb.
human_genome_length_mb == ecoli_genome_length_mb # Are the genome lengths equal?
human_genome_length_mb != ecoli_genome_length_mb # Are the genome lengths equal?
human_genome_length_mb > ecoli_genome_length_mb # Are human genome lengths greater than ecoli genome lengths?
A vector is the most common and basic data structure in R. It’s R’s workhorse data type. A vector is a list of values, mainly numbers or characters. You can:
Let’s create a vector of genome lengths:
glengths <- c(4.6, 3000, 50000) # Create a vector of 3 genome lengths
glengths
glengths[1] # Return the first value
glengths[3] # Return the 3rd value
A vector can also contain characters:
species <- c("ecoli", "human", "corn")
species
species[2] #What's the 2nd value in the species vector?
There are many functions to help us explore the vector:
length(glengths) # How many vlaues are in the genome lengths vector?
length(species) # how many values in the species vector?
length(glengths) == length(species) # Are they the same length?
Some math with the vectors:
5 * glengths
new_lengths <- 2 * glengths # multiply lengths by 2
new_lengths
This makes it easy to combine or work with data!
Let’s take a look at our history in the top right of our console. To do so we can click on the history tab or type
history() #Opens up the history pane in the top right of RStudio
What type of data is in our vector?
class(glengths) # What type of data is in glengths?
class(species)
str(glengths) # What's the structure of glengths?
str(species)
Tip:
str()provides an overview of the object and the elements it contains. It is a really useful function when working with large and complex objects:
lengths <- c(glengths, 90) # adding at the end
lengths
lengths <- c(30, glengths) # adding at the beginning
lengths
We just saw 2 of the 6 data types that R uses: character and numeric. The other 4 are:
logical for TRUE and FALSE (the boolean data type)integer for integer numbers (e.g., 2L, the L indicates to R that it’s an integer)complex to represent complex numbers with real and imaginary parts (e.g., 1+4i) and that’s all we’re going to say about them.Vectors are one of the many data structures that R uses. Other important ones are lists (list), matrices (matrix), data frames (data.frame) and factors (factor).
Exercise:
1. What’s the difference between a data structure and data type?
2. Discuss the definitions of vector, object, function, arguments.
Lesson can be found at http://tracykteal.github.io/R-genomics/02-starting-with-data.html
This lesson was written by Kate Hertweck, Susan McClatchey, Tracy Teal, Ryan Williams and is maintained by Tracy Teal.
We are studying a population of E. coli (designated Ara-3):
| Column | Description |
|---|---|
| Sample | Clone Name |
| generation | generation when sample frozen |
| clade | based on parsimony tree |
| strain | ancestral strain |
| cit | Citrate-using mutant status |
| run | Sequence read archive sample ID |
| genome_size | size in Mbp (Made up for this lesson) |
getwd().cd # Go to home directory
cd Downloads/ # Go to downloads
ls *metadata.csv # list all files that end with "metadata.csv"
mv Ecoli_metadata.csv ../SWC_HumanGenetics/data/ # Move the file to our working directory!
Ecoli_metadata.csv file within the data folder in the SWC_HumanGenetics working directory.getwd() # R's pwd - where are we?!
## [1] "/Users/marschmi/SWC_Genomics"
# reads the Ecoli_metadata.csv file from the data sub-folder
metadata <- read.csv('data/Ecoli_metadata.csv') #provides no output
metadata # This will provide too much output
head(metadata) # first 6 rows
tail(metadata) # last 6 rows
View(metadata) # View it!
data.frame?data.frame is the de facto data structure for most tabular data and what we use for statistics and plotting. It is:
str() function is useful to inspect the data types of the columns.read.csv() or the read.table functions.read.csv('data/Ecoli_metadata.csv', stringsAsFactors = FALSE)data.frame objectsdim() - returns a vector with the number of rows in the first element, and the number of columns as the second element (the __dim__ensions of the object)nrow() - returns the number of rowsncol() - returns the number of columnshead() - shows the first 6 rowstail() - shows the last 6 rowsView() - view the data within RStudionames() or colnames() - returns the column namesrownames() - returns the row namesstr() - structure of the object and information about the class, length and content of each columnsummary() - summary statistics for each columnExercises
What is the class of the object metadata?
How many rows and how many columns are in this object?
How many citrate+ mutants have been recorded in this population?
Factors are used to represent categorical data.
Let’s now check the __str__ucture of our data.frame in more details with the function str():
str(metadata)
## 'data.frame': 30 obs. of 7 variables:
## $ sample : Factor w/ 30 levels "CZB152","CZB154",..: 7 6 18 19 20 21 22 23 24 25 ...
## $ generation : int 0 2000 5000 10000 15000 20000 20000 20000 25000 25000 ...
## $ clade : Factor w/ 7 levels "(C1,C2)","C1",..: NA 7 7 6 6 1 1 1 2 4 ...
## $ strain : Factor w/ 1 level "REL606": 1 1 1 1 1 1 1 1 1 1 ...
## $ cit : Factor w/ 3 levels "minus","plus",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ run : Factor w/ 30 levels "","SRR097977",..: 1 5 22 23 24 25 26 27 28 29 ...
## $ genome_size: num 4.62 4.63 4.6 4.59 4.66 4.63 4.62 4.61 4.65 4.59 ...
Factor w/ 30 levels, which means that each row has a unique value.cit is a Factor w/ 3 levels: minus, plus and unknown.Exercise
By looking at the str(metadata) output:
1. What does the$operator mean?
2. What do the integers mean at the end of the description for the columns that arefactors?
data.frame - more on data framesLesson can be found at http://tracykteal.github.io/R-genomics/03-data-frames.html
This lesson was written by Kate Hertweck, Susan McClatchey, Tracy Teal, Ryan Williams and is maintained by Tracy Teal.
data.framedata.frameIf we want to extract one or several values from a vector, we must provide one or several indices in square brackets, just as we did with the vectors - this time with 2 numbers:
data.frame objects have 2 dimensions - rows and columns
metadata[1, 2] # first element in the 2nd column of the data frame
metadata[1, 6] # first element in the 6th column
metadata[1:3, 7] # first three elements in the 7th column
metadata[3, ] # the 3rd element for all columns
metadata[, 7] # the entire 7th column
head_meta <- metadata[1:6, ] # metadata[1:6, ] is equivalent to head(metadata)
For larger datasets, it can be tricky to remember the column number that corresponds to a particular variable. (Are species names in column 5 or 7? oh, right… they are in column 6). In some cases, in which column the variable will be can change if the script you are using adds or removes columns.
It’s therefore often better to use column names to refer to a particular variable, and it makes your code easier to read and your intentions clearer.
You can do operations on a particular column, by selecting it using the $ sign.
names(metadata) or colnames(metadata) to remind yourself of the column names.metadata$strain # Return all of the strains in the data frame
In some cases, you may want to select more than one column. You can do this using the square brackets. Suppose we wanted strain and clade information:
metadata[, c("strain", "clade")] # Only select the strain and clade columns from metadata
You can even access columns by column name and select specific rows of interest. For example, if we wanted the strain and clade of just rows 4 through 7, we could do:
metadata[4:7, c("strain", "clade")]
Exercise
Create a new object with a new name without thestrainandruncolumns.
Lesson can be found at http://tracykteal.github.io/R-genomics/04-dplyr.html
This lesson was written by Kate Hertweck, Susan McClatchey, Tracy Teal, Ryan Williams and is maintained by Tracy Teal.
Bracket subsetting is handy, but it can be cumbersome and difficult to read, especially for complicated operations.
Enter dplyr.
dplyr is a package for making data manipulation easier.
Packages in R are basically sets of additional functions that let you do more stuff in R. The functions we’ve been using, like str(), come built into R; packages give you access to more functions. You need to install a package and then load it to be able to use it.
install.packages("dplyr") ## install dplyr
You might get asked to choose a CRAN mirror – this is basically asking you to choose a site to download the package from. The choice doesn’t matter too much; I’d recommend choosing the RStudio mirror.
library("dplyr") ## load
You only need to install a package once per computer, but you need to load it every time you open a new R session and want to use that package.
dplyr is:
plyr package, which has been in use for some time but suffered from being slow in some cases.
dplyr addresses this by porting much of the computation to C++.This addresses a common problem with R in that all operations are conducted in memory and thus the amount of data you can work with is limited by available memory. The database connections essentially remove that limitation in that you can have a database of many 100s GB, conduct queries on it directly and pull back just what you need for analysis in R.
We’re going to learn some of the most common dplyr functions:
select()filter()arrange()mutate()group_by()summarize()To select columns of a data frame, use select().
The first argument to this function is the data frame (metaadata), and the subsequent arguments are the columns to keep.
select(metadata, sample, clade, cit, genome_size) #select 4 columns
To choose rows, use filter():
filter(metadata, cit == "plus") # select only rows that are the citrate plus mutant
Challenge:
- Filter out samples that have a genome size that is equal or greater than 4.74.
- Filter out mutants that are “unknown” and are before generations 25,000.
To re-order or arrange the output, use arrange():
arrange(metadata, genome_size) # arrange by ascending genome_size
arrange(metadata, -genome_size) # arrange by -ascending genome_size
arrange(metadata, -generation) # arrange by descending generation
Challenge:
Create a new object named
arranged_metadatawheremetadatais arranged first by descending generation and second by ascending clade
Summary so far:
select(): To subset certain columns out of the data.filter(): To subset specific rows out of the data.arrange(): To sort the data frame by columns.But what if you wanted to select and filter?
There are three ways to do this:
%>% and are made available via the magrittr package installed as part of dplyr.Let’s say we would like to do the following:
metadata %>%
filter(cit == "plus") %>%
select(sample, generation, clade)
In the above we use the pipe to send the metadata data set first through filter, to keep rows where cit was equal to ‘plus’, and then through select to keep the sample and generation and clade columns.
When the data frame is being passed to the filter() and select() functions through a pipe, we don’t need to include it as an argument to these functions anymore.
If we wanted to create a new object with this smaller version of the data we could do so by assigning it a new name:
meta_citplus <- metadata %>%
filter(cit == "plus") %>%
select(sample, generation, clade)
meta_citplus
Challenge
Using pipes, subset the data to include rows where the clade is ‘unknown+’. Retain columnssample,cit, andgenome_size. Finally arrange the output by descending genome_size.
Frequently you’ll want to create new columns based on the values in existing columns, for example to do unit conversions or find the ratio of values in two columns. For this we’ll use mutate().
To create a new column of genome size in bp:
metadata %>%
mutate(genome_bp = genome_size *1e6) # 1 Mbp = 10^6 bp
If this runs off your screen and you just want to see the first few rows, you can use a pipe to view the head() of the data (pipes work with non-dplyr functions too, as long as the dplyr or magrittr packages are loaded).
metadata %>%
mutate(genome_bp = genome_size *1e6) %>%
head
The row has a NA value for clade, so if we wanted to remove those we could insert a filter() in this chain:
We can do this by using the is.na() command.
is.na() is a function that determines whether something is or is not an NA. The ! symbol negates it, so we’re asking for everything that is not an NA.
Remember the != inequality operations, we can use this with is.na() by placing it before the command like this !is.na().
metadata %>%
mutate(genome_bp = genome_size *1e6) %>%
filter(!is.na(clade)) %>% # select all rows where clade is NOT an NA
head
summarize() functionMany data analysis tasks can be approached using the “split-apply-combine” paradigm:
dplyr makes this very easy through the use of the group_by() function. group_by() splits the data into groups upon which some operations can be run.
For example, if we wanted to group by citrate-using mutant status and find the number of rows of data for each status, we would do:
metadata %>%
group_by(cit) %>%
tally() # tallies each group.
count() is the same as tally but alrea will also do the same thing:
metadata %>%
count(cit) # tallies each group but has the group.
There are several different summary statistics that can be generated from our data. The R base package provides many built-in functions such as:
mean()median()min() & max()range()By default, all R functions operating on vectors that contains missing data will return NA. It’s a way to make sure that users know they have missing data, and make a conscious decision on how to deal with it.
When dealing with simple statistics like the mean(), the easiest way to ignore NA (the missing data) is to use na.rm=TRUE (rm stands for remove).
group_by() is often used together with summarize() which collapses each group into a single-row summary of that group. summarize() can then be used to apply a function such as those that compute simple stats to give an overview for the group. So to view mean genome_size by mutant status:
metadata %>%
group_by(cit) %>%
summarize(mean_size = mean(genome_size, na.rm = TRUE))
You can group by multiple columns too:
metadata %>%
group_by(cit, clade) %>%
summarize(mean_size = mean(genome_size, na.rm = TRUE))
Looks like for one of these clones, the clade is missing. We could then discard those rows using filter():
metadata %>%
group_by(cit, clade) %>%
summarize(mean_size = mean(genome_size, na.rm = TRUE)) %>%
filter(!is.na(clade))
Challenge:
Let’s calculate summary statistics! All in one go, let’s do the following:
- Group by
cit- Calculate the
mean()generation for each mutant.- Calculate the
median()generation for each mutant.- Calculate the
min()generation for each mutant.- calculate the
max()generation for each mutant.
What do we notice about the output??
dplyr has changed our data.frame to a tbl_df. This is a data structure that’s very similar to a data frame; for our purposes the only difference is that it won’t automatically show tons of data going off the screen.
glimpse(metadata) # dplyr version, very convenient for the screen
str(metadta)
Link to the RStudio Data Wrangling Cheat Sheet
Lesson can be found at http://tracykteal.github.io/R-genomics/05-data-visualization.html
This lesson was written by Kate Hertweck, Susan McClatchey, Tracy Teal, Ryan Williams and is maintained by Tracy Teal.
The mathematician Richard Hamming once said, “The purpose of computing is insight, not numbers”, and the best way to develop insight is often to visualize data. Visualization deserves an entire lecture (or course) of its own, but we can explore a few features of R’s plotting packages.
When we are working with large sets of numbers it can be useful to display that information graphically. R has a number of built-in tools for basic graph types such as hisotgrams with hist(), scatter plots with plot(), bar charts with barplot(), boxplots with boxplot() and much,much more
We’ll test a few of these out here on the genome_size vector from our metadata.
genome_size <- metadata$genome_size
Let’s start with a scatterplot. A scatter plot provides a graphical view of the relationship between two sets of numbers.
We don’t have a variable in our metadata that is a continous variable, so there is nothing to plot it against but we can plot the values against their index values just to demonstrate the function.
plot(genome_size) # Scatter plot
Each point represents a clone and the value on the x-axis is the clone index in the file, where the values on the y-axis correspond to the genome size for the clone. For any plot you can customize many features of your graphs (fonts, colors, axes, titles) through graphic options.
For example, we can change the shape and add a title of the data point using pch.
plot(genome_size, pch=8) # Change the symbol
plot(genome_size, pch=8, main="Scatter plot of genome sizes") # Add a title
Another way to visualize the distribution of genome sizes is to use a histogram, we can do this buy using the hist() function:
#histograms
hist(genome_size) # Create histogram of genome sizes
hist(genome_size, breaks = 10)
Using additional information from our metadata, we can use plots to compare values between the different citrate mutant status using a boxplot. A boxplot provides a graphical view of the median, quartiles, maximum, and minimum of a data set.
# Boxplot
boxplot(metadata$genome_size ~ metadata$cit) # Creates a box and whisker plot
To the left of the ~ is the response variable and to the right of the tilde is the explanatory variable.
boxplot(genome_size ~ cit, metadata, col=c("pink","purple", "darkgrey"),
main="Average expression differences between celltypes", ylab="Expression")
ggplot2More recently, R users have moved away from base graphic options and towards a plotting package called ggplot2 that adds a lot of functionality to the basic plots seen above.
An amazing resource for ggplot2 is the R Cookbook written by Winston Chang
The syntax takes some getting used to but it’s extremely powerful and flexible. We can start by re-creating some of the above plots but using ggplot2 functions to get a feel for the syntax.
ggplot() is best used on data in the data.frame form, so we will work with metadata for the following figures. Let’s start by loading the ggplot2 library.
install.packages("ggplot2")
library(ggplot2)
The ggplot() function is used to initialize the basic graph structure, then we add to it. The basic idea is that you specify different parts of the plot, and add them together using the + operator.
We will start with a blank plot and will find that you will get an error, because you need to add layers.
ggplot(metadata) # note the error
Geometric objects are the actual marks we put on a plot. Examples include:
A plot must have at least one geom (short for geometric objects); there is no upper limit. You can add a geom to a plot using the + operator
ggplot(metadata) +
geom_point() # note what happens here
Each type of geom usually has a required set of aesthetics to be set, and usually accepts only a subset of all aesthetics –refer to the geom help pages to see what mappings each geom accepts. Aesthetic mappings are set with the aes() function. Examples include:
ggplot(metadata, aes(x = sample, y= genome_size)) +
geom_point()
# above is the same as below
ggplot(metadata) +
+ geom_point(aes(x = sample, y= genome_size))
The labels on the x-axis are quite hard to read. To do this we need to add an additional theme layer. The ggplot2 theme system handles non-data plot elements such as:
There are built-in themes we can use, or we can adjust specific elements.
For our figure we will change:
aes() since the value is not mapping to a variable.# First let's change the shape of the symbols
ggplot(metadata, aes(x = sample, y= genome_size, shape = cit)) +
geom_point(size = rel(3.0)) +
theme(axis.text.x = element_text(angle=45, hjust=1)) # put x-axis on 45 degree angle
# Then add the change in color due to generations
ggplot(metadata, aes(x = sample, y= genome_size, color = generation, shape = cit)) +
geom_point(size = rel(3.0)) +
theme(axis.text.x = element_text(angle=45, hjust=1))
# Why data type matters!
metadata$genome_size_fact <- as.factor(metadata$genome_size)
ggplot(metadata, aes(x = sample, y= genome_size_fact, color = generation, shape = cit)) +
geom_point(size = rel(3.0)) +
theme(axis.text.x = element_text(angle=45, hjust=1))
## Let's look at genome_size by generation
ggplot(metadata, aes(x = generation, y= genome_size, color = generation, shape = cit)) +
geom_point(size = rel(3.0)) +
theme(axis.text.x = element_text(angle=45, hjust=1))
# Zoom in on a certain part of the plot with setting the xlim
summary(metadata$generation) # let's plot the mean to max
ggplot(metadata, aes(x = generation, y= genome_size, color = generation, shape = cit)) +
geom_point(size = rel(3.0)) +
theme(axis.text.x = element_text(angle=45, hjust=1)) +
xlim(mean(metadata$generation), max(metadata$generation))
## Flip the plot with coord_flip!
ggplot(metadata, aes(x = generation, y= genome_size, color = generation, shape = cit)) +
geom_point(size = rel(3.0)) +
theme(axis.text.x = element_text(angle=45, hjust=1)) +
coord_flip()
There are two ways in which figures and plots can be output to a file (rather than simply displaying on screen).
png or pdf and selecting the directory to which you wish to save it to.pdf(), png(), jpeg(), tiff()etc.Initialize a plot that will be written directly to a file using pdf(), png(), jpeg(), tiff() etc.
Within the function you will need to:
dev.off() function.Make a new folder called figures to store plots in.
pdf("figures/genome_size_vs_generation.pdf", width = 7, height = 5)
ggplot(metadata, aes(x = generation, y= genome_size, color = generation, shape = cit)) +
geom_point(size = rel(3.0)) +
theme(axis.text.x = element_text(angle=45, hjust=1))
dev.off()
To plot a histogram we require another geometric object geom_bar(), which requires a statistical transformation. Some plot types (such as scatterplots) do not require transformations, each point is plotted at x and y coordinates equal to the original value. Other plots, such as boxplots, histograms, prediction lines etc. need to be transformed, and usually has a default statistic that can be changed via the stat_bin argument.
ggplot(metadata, aes(x = genome_size)) +
geom_bar() + xlab("Genome Size") + ylab("Count")
Try plotting with the default value and compare it to the plot using the binwidth values. How do they differ?
ggplot(metadata, aes(x = genome_size)) +
geom_bar(stat = "bin", binwidth=0.01)
# Add black and white background
ggplot(metadata, aes(x = genome_size)) +
geom_bar(stat = "bin", binwidth=0.01) + theme_bw()
# make a pretty color
ggplot(metadata, aes(x = genome_size)) +
geom_bar(stat = "bin", fill = "cornflowerblue", color = "black", binwidth=0.01) + theme_bw()
# color by citrate mutant
ggplot(metadata, aes(x = genome_size, fill = cit)) +
geom_bar(stat = "bin", color = "black", binwidth=0.01) +
theme_bw()
# Move the legend within
ggplot(metadata, aes(x = genome_size, fill = cit)) +
geom_bar(stat = "bin", color = "black", binwidth=0.01) +
theme_bw() +
theme(legend.position = c(0.8, 0.8)) # (x,y) Think of each side of the plot to be 1
Challenge:
Re-create the histogram plots from above. This time, save each of the plots that you make in the
figuresfolder.
Let’s try plotting a boxplot similar to what we had done using the base plot functions at the start of this lesson. We can add some additional layers to include a plot title and change the axis labels. Explore the code below and all the different layers that we have added to understand what each layer contributes to the final graphic.
ggplot(metadata, aes(x = cit, y = genome_size, fill = cit)) +
geom_boxplot() +
ggtitle("Boxplot of genome size by citrate mutant type") +
xlab("citrate mutant") +
ylab("genome size") +
theme(panel.grid.major = element_line(size = .5, color = "grey"),
axis.text.x = element_text(angle=45, hjust=1),
axis.title = element_text(size = rel(1.5)),
axis.text = element_text(size = rel(1.25)))
Box and whisker plots are nice, but maybe we would prefer to calculate the mean and standard deviation of the genome_size within each of the citrate mutants with error bars.
To plot scatter plots with error bars, we use geom_errobar().
geom_errorbar(width = 0.1, aes(ymin = mean - sd, ymax = mean + sd))
width argument will make wider error bars.This means that we will have to calculate both the mean and the standard deviaition.
Exercise:
Work with your partner to do the following:
- Calculcate the
mean()of the genome_length for each of the citrate mutants.- Calculate the standard deviation with
sd()of the genome_length for each of the citrate mutants.- Plot cit (x-axis) and the mean (y-axis) with errorbars.
One of the best features of ggplot is to be able to facet plots. When you facet plots you split up your data by one or more variables and plot the subsets of data together.
Two ways to facet:
facet_grid: Input values for vertical ~ horizontal.facet_wrap: Instead of faceting with a variable in the horizontal or vertical direction, facets can be placed next to each other, wrapping with a certain number of columns or rows. The label for each plot will be at the top of the plot.Tip: Facets are a great way to explore your data!
# Break up 3 groups
ggplot(metadata, aes(x = sample, y= genome_size, color = generation, shape = cit)) +
geom_point(size = rel(3.0)) +
theme(axis.text.x = element_text(angle=45, hjust=1)) +
facet_grid(cit ~ .) # To make 3 plots in one! Remember, vertical ~ horizontal.
# This plot would be more useful.
# Break up 3 groups on the x-axis by citrate mutant status along the x-axis.
ggplot(metadata, aes(x = sample, y= genome_size, color = generation, shape = cit)) +
geom_point(size = rel(3.0)) +
theme(axis.text.x = element_text(angle=45, hjust=1)) +
facet_grid(. ~ cit) # To make 3 plots in one! Remember, vertical ~ horizontal.
# Let's make the plot a little prettier
ggplot(metadata, aes(x = sample, y= genome_size, color = generation, shape = cit)) +
geom_point(size = rel(3.0)) +
theme(axis.text.x = element_text(angle=45, hjust=1)) +
facet_grid(. ~ cit, scales = "free_x") # Delete all the extra samples
Challenge:
- Create a faceted plot of sample (x-axis) and genome size (y-axis) broken up by clade. Make the colors of the plot to represent each clade.
- Create another faceted plot with the clade (x-axis) and mean genome size on the (y- axis).
- Create another facetted plot with sample (x-axis) and genome size (y-axis) broke up by generation. Color the points by the citrate mutant. What does this tell us about what “unknown” means?