Introduction

This exercise is meant to challenge you to build a dataframe by hand in R. It is based off of the Table 1 in Drake (1991) “A constant rate of spontaneous mutation in DNA-base microbes” (PNAS August 15, 1991 88 (16) 7160-7164; https://doi.org/10.1073/pnas.88.16.7160)

There are two parts to this tutorial. This versions will guide you through the process of building the dataframe and exploring it. A second version is designed to challenge your understanding of the steps we went through.

Functions, concepts and vocab

Functions

c() rep() is is.vector is.matrix class length nchar == round() data.frame() log() log10() ggpubr::ggscatter()

Concepts:

Scientific notation in R Encoding information on a plot with size and color

Vocab

smoother correlation

Packages

ggpubr

Preliminaries

Non-R materials

For this you’ll need a copy of Drake (1991), which can be found at https://www.pnas.org/content/88/16/7160

We will focus on rebuilding Table 1 and Figure 1 entirely in R without importing the data from a spreadsheet.

Packages

The only package you will need is ggpubr. Load ggpubr using library():

# enter command here
library("ggpubr")

The data

Table 1 of Drake (1991) has the following columns. Note that the symbol mu (u) is frequently used to represent mutation rates.

  1. A list of organisms (viruses, bacteria and invertebrate eukaryotes)
  2. genome sizes (G) for each organism, in scientific notation
  3. A target gene, which was assessed for the occurrence of mutations
  4. An estimate of the mutation rate per bases (bp) of the genome. This gets log transformed and used as the y variable in Figure 5. I’ll use the symbol u.bp for this variable
  5. An estimate of the mutation rate per genome (u.g). This variable isn’t in scientific notation so we’ll work with it first because its a bit easier to read.

The organism column has gaps in it because there are multiple target genes for these species

  1. E. coli genes
  2. S. cerevisiae genes
  3. N. crassa genes

Note that the very last numbers that appear in the mutation columns are means, not data!

Creating vectors in R

We can build up a dataframe in R by making a vector for each column. This is a bit tedious but will build up our skills, so be patient.

We can make a vector of organism names like this. We need to make sure each word representing an organisms is in quotes, and that there is a comma after each word.

org <- c("M13","lamb", "T2", "T4", "EC","EC","EC","SC","SC","SC","NC","NC")

We can use line breaks while defining an R object, which can make it easier to keep track of what were typing. The following code is equivalent to what we just ran. Because there are multiple E. coli (EC) , SC and NC, I’m putting them on their own lines.

org <- c("M13", "lamb", "T2", "T4", 
         "EC", "EC", "EC",
         "SC", "SC", "SC",
         "NC","NC")

If you want to be fancy, use rep() for the repeated elements.

org <- c("M13","lamb", "T2", "T4",
         rep("EC",3),
         rep("SC",3),
         rep("NC",2))

As always we need to check to make sure that the object we’re making is what we think it is and represents what we want. Use the following commands to check on the object:

# practice commands here
is(org) #prints different data types per org vector
## [1] "character"           "vector"              "data.frameRowLabels"
## [4] "SuperClassMethod"
is.vector(org) #prints TRUE
## [1] TRUE
is.matrix(org) #prints FALSE
## [1] FALSE
class(org) #prints character data type
## [1] "character"
length(org) #prints length of org, which is 12
## [1] 12
nchar(org) #prints number of character per each variables, which is 3 4 2 2 2 2 2 2 2 2 2 2
##  [1] 3 4 2 2 2 2 2 2 2 2 2 2

Next let’s make the next vector. Below is the code you’ll need, but I’ve left off the last 3 numbers. Paste the code into the code chunk and add the necessary numbers from the original table in Drake (1991). Don’t forget the commas! Also note that numbers don’t get quotes around them.

G.p <- c(6.41, 4.85, 1.60, 1.66, 4.70, 4.70, 4.70, 1.38, 1.38 )

Compare each genome size to the vector “org” in Drake (1991) org <- c(“M13”, “lamb”, “T2”, “T4”, “EC”, “EC”, “EC”, “SC”, “SC”, “SC”, “NC”,“NC”)

# Write the code to make the G.p vector here
# be sure to add the necessary numbers

G.p <- c(6.41, 4.85, 1.60, 1.66, 
         4.70, 4.70, 4.70, 
         1.38, 1.38, 1.38, 
         4.19, 4.19)

Again, we explore this. Run these functions and note the output

# Type code to explore the G.p vector here
is(G.p) #prints different data types per G.p vector
## [1] "numeric" "vector"
class(G.p) #prints "numeric" data type
## [1] "numeric"
length(G.p) #prints length of G.p, which is 12
## [1] 12
nchar(G.p) #prints n characters per each item in G.p
##  [1] 4 4 3 4 3 3 3 4 4 4 4 4
# This equates to 4 4 3 4 3 3 3 4 4 4 4 4

Checking the length of vectors

Vectors in R have length, not dimension, so to see how big they are we use the length() command; dim() won’t work. Neither will ncol() or nrow().

All of our columns have the same number of elements in them. We can make sure we don’t have any errors in our data entry by using a logical operation, ==, to confirm that the length of each vector is the same. This is very useful when vectors are big and/or when we’re writing code to automate a process

First, check the length of one of the vectors

# Confirm the length of of one of the vectors
length(org)
## [1] 12
length(G.p)
## [1] 12
length(org) == length(G.p) #prints TRUE if vector lengths are equal
## [1] TRUE
# if vectors are equivalent then prints "Vectors are equivalent
if (length(org) == length(G.p))
  print("Vectors are the same length")
## [1] "Vectors are the same length"
# This function equalLengthVector, takes to vectors to see if they are equal
# This function is redundant, but practice w/ function making

equalLengthVector <- function(vector1, vector2) {
  if (length(vector1) == length(vector2)) {
  print("Vectors are the same length")
  } else {
    print("Vectors are different lengths")
  }
}

equalLengthVector(org, G.p)
## [1] "Vectors are the same length"

Next, run this code to check that the two lengths are the same: length(G.p) == length(org)

# Check that the lengths are the same
length(G.p) == length(org)
## [1] TRUE

Exponents and scientific notation

To take the exponent of something in R use the up caret ^

So, if I want to type 1 million I can do this

standardNotation <- 1000000

or I can do this

scientificNotation <- 1*10^6

Note that the output is in “e” notation. I could type this if I wanted

eNotation <- 1e+06

In the chunk below, write a logical expression using == to confirm that 1000000 is equal to 1*10^6. Write another one to check that 1e+06 is the same as 1000000.

### Using "==" operator to see if the different numeric objects are equivalent
standardNotation == scientificNotation
## [1] TRUE
standardNotation == eNotation
## [1] TRUE
eNotation == scientificNotation 
## [1] TRUE

To represent the genome size column I could do this (just showing first 3 numbers)

G <- c(6.41*10^5, 4.85*10^4, 1.60*10^5)

I could also do this

G.with.e <- c(6.41e+05, 4.85e+04, 1.60e+05)

I can confirm that that they are the same like this

G == G.with.e
## [1] TRUE TRUE TRUE

When typing up this data I found it easier to break up the columns containing exponents into multiple columns, otherwise I was getting cross eyed.

One way to do this would be to split each number into what I’m calling a “prefix” and a “suffix”. Above we made a vector called G.p, which is the prefix of these numbers - the part before the multiplication symbol.

We can make another vector that contains the suffix - the 10^x part (or the 1e+0x part).

# G.p takes the constant of Genome size in Drake (1991)
# G.s is the multiplier 10^x for each item in G.p
G.s <- c(10^3, 10^4, 10^5, 10^5, 
         10^6, 10^6, 10^6, 10^7,
         10^7, 10^7, 10^7, 10^7)

G.s2 <- c(1e+03, 1e+04, 1e+05, 1e+05, 
         1e+06, 1e+06, 1e+06, 1e+07,
         1e+07, 1e+07, 1e+07, 1e+07)

We can then multiple these together and get the value we want

# Using the "^" notation to find genome size
G.p*G.s
##  [1]     6410    48500   160000   166000  4700000  4700000  4700000 13800000
##  [9] 13800000 13800000 41900000 41900000
# Using the "e" notation to find genome size
G.p*G.s2
##  [1]     6410    48500   160000   166000  4700000  4700000  4700000 13800000
##  [9] 13800000 13800000 41900000 41900000
#Using the "==" operator to check if each item is equivalent, should print TRUE for each
(G.p*G.s) == (G.p*G.s2)
##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

Note that this is doing is multiplying each element of G.p by the corresponding element of G.s. We could break this up like this

G.p[1]*G.s[1]
## [1] 6410
G.p[2]*G.s[2]
## [1] 48500

This works pretty well, but I found typing all those exponents to still be a little annoying, especially since many get repeated. I think its easier to just type the exponent. We’ll call this object “G.exp” for exponent

G.exp <- c(3, 4, 5, 5, 
           6, 6, 6, 7, 
           7, 7, 7, 7)
G.exp
##  [1] 3 4 5 5 6 6 6 7 7 7 7 7

Again, if you want to be fancy you can play around with the rep() function, though this is optional

G.exp <- c(3, 4,    
         rep(5, 2),
         rep(6, 3),
         rep(7, 5))
G.exp
##  [1] 3 4 5 5 6 6 6 7 7 7 7 7

Now, to get the numbers we want, we can do this: G.p*10^G.exp Can you figure out what’s going on? If not, check the next chunk

### Does the same as r chunks above, but does not require copying 10^
G.p*10^G.exp
##  [1]     6410    48500   160000   166000  4700000  4700000  4700000 13800000
##  [9] 13800000 13800000 41900000 41900000

Here I’ve added some parentheses to clarify the order of operations

 G.p*(10^G.exp)
##  [1]     6410    48500   160000   166000  4700000  4700000  4700000 13800000
##  [9] 13800000 13800000 41900000 41900000

If its still not clear, then maybe this will help:

# G.s vector and 10^G.exp are equivalent 
G.s == 10^G.exp
##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

Let’s make a final G vector with genome sizes. This is the “prefix” I made (G.p) multiplied by 10 raised to the exponent I defined in the vector G.exp.

G <-  G.p*(10^G.exp)

G
##  [1]     6410    48500   160000   166000  4700000  4700000  4700000 13800000
##  [9] 13800000 13800000 41900000 41900000

We can make a basic plot showing the distribution of genome sizes using the histogram function in R, hist().

hist(G)

The far left value is 0e+00, which just means 0.0. The first tick mark is 1e+07. Write this out using an exponent symbol ^ below:

Swapping “e” Notation for “^” notatoin

## Change from 0e+00 --> 0.0 and 1e+07 --> 1.0 * 10^7
hist(G, labels = units)

units <- c("0.0", "1 * 10^7", "2 * 10^7", "3 * 10^7", "4 * 10^7")

How do you interpret this graph? If you aren’t comfortable interpreting histograms see datavizcatalogue.com/methods/histogram.html and https://en.wikipedia.org/wiki/Histogram

Another way to look at this data would be with a boxplot

boxplot(G)

This is a ugly plot, but whatever.

hist() and boxplot() are great for quick and dirty plots. Normally though I’ll use ggpubr to make plots using gghistogram() and ggboxplot().

Some more vectors

Next we have the per genome mutation rate. The “u” is a stand-in for “mu” which is commonly used for mutation rates. So “u.bp” reads “mutation rate in bp.” First let’s do the “prefix” and assign it to a vector u.bp.p (mutation rate in bp prefix).

u.bp.p <- c(7.2, 7.7, 2.7, 2, 4.1, 6.9, 5.1, 2.8, 7.9, 1.7, 4.5, 4.6)

Confirm that this vector is the exact same size as our previous ones (Hint: requires “=”)

### Confirm the length by using the function equalLengthVector

# Comparing G.s and u.bp.p vectors
equalLengthVector(G.s, u.bp.p)
## [1] "Vectors are the same length"
# Comparing G.p and u.bp.p vectors
equalLengthVector(G.p, u.bp.p)
## [1] "Vectors are the same length"
# Comparing G.p and u.bp.p vectors
equalLengthVector(G, u.bp.p)
## [1] "Vectors are the same length"

Confirm that what you just made was a vector. (Hint: what “is” it?)

### Using is() function we can compare if u.bp.p is a vector
# We can use is(object, "vector")
is(u.bp.p, "vector") # prints TRUE
## [1] TRUE

Now let’s do the exponent and save it to a vector u.bp.exp.

u.bp.exp <- c(-7, 
            -8, -8,-8,
            -10,-10,-10,-10,
            -9, -10, -11, -10)

For those who want to be fancy, can you do this using the rep() command? (Totally optionally).

As before, make sure this new vector u.bp.exp is the right length

# is vector length equal to vector G
equalLengthVector(G, u.bp.exp) 
## [1] "Vectors are the same length"
# is vector length equal to vector above u.bp.p
equalLengthVector(u.bp.p, u.bp.exp) 
## [1] "Vectors are the same length"

Now, take the final column by carrying out the math. To do this put the following components together correctly

  • The “prefix” u.bp.p
  • The caret ^
  • The exponent u.bp.exp

Assign the result to an object u.bp. Try to type this out yourself, or cut and paste it if needed:

u.bp <- u.bp.p*10^u.bp.exp

### ADD CODE HERE
u.bp <- u.bp.p*10^u.bp.exp
u.bp
##  [1] 7.2e-07 7.7e-08 2.7e-08 2.0e-08 4.1e-10 6.9e-10 5.1e-10 2.8e-10 7.9e-09
## [10] 1.7e-10 4.5e-11 4.6e-10

As always, check that the size of u.bp is correct

### ADD CODE HERE
# comparing vectors u.bp to vector G
equalLengthVector(u.bp, G)
## [1] "Vectors are the same length"

The last column of Drake (1991) Table 1 is the per-genome mutation rate, u.g.

u.g <- c(0.0046, 0.0038, 0.0043, 
         0.0033, 0.0019, 0.0033, 
         0.0024, 0.0038, 0.11,
         0.0024, 0.0019, 0.019)

We can confirm that the vector is that same size as the previous ones

length(G.s) == length(u.g)
## [1] TRUE

I’m not 110% sure about the methods for this paper. I’m wondering if the per genome mutation rate u.g can be calculated from the per base pair mutation rate (u.bp) and the genome size (G). How do you think you could do this calculation in terms of the biology, the math, and in R? Save the output as an object u.g.2 (2nd version of u.g).

Hint: Don’t over think it; there’s just one mathematical operation, either multiplication or division (The answer is a bit further down).

### Genome multiplied by the mutation rate will give us how many base pairs mutate
u.g.2 <- G * u.bp

u.g.2
##  [1] 0.0046152 0.0037345 0.0043200 0.0033200 0.0019270 0.0032430 0.0023970
##  [8] 0.0038640 0.1090200 0.0023460 0.0018855 0.0192740

Some more hints: G is the size of each genome. So the first genome in our vector has the size in base pairs (bp).

G[1]
## [1] 6410

u.bp is the rate of mutations per base pairs. The mutation rate per base pairs for the first genome in the vector is

u.bp[1]
## [1] 7.2e-07

So, if we want to determine the number of mutations per genome, we multiply the size of the genome in bp by the per bp rate of mutation

G[1]*u.bp[1]
## [1] 0.0046152

We can compare this to what is in the table as the per genome rate of mutation (u.g)

u.g[1]
## [1] 0.0046

We can do all the multiplication of the whole vector in one shot like this: u.g.2 <- G*u.bp

Try to type this first; if you have trouble, copy and paste this into the code chunk below.

### ADD CODE HERE
u.g.2 <- G * u.bp

So we have the value of u.g from the column in Table 1 that we typed up above, and now we have our attempt to re-calculate it. One way to check them against each other is to make a simple dataframe and compare by eye:

# Compare col1 to col2 to see how close the mutation rate is per b.p per each individual genome
data.frame(u.g, u.g.2)
##       u.g     u.g.2
## 1  0.0046 0.0046152
## 2  0.0038 0.0037345
## 3  0.0043 0.0043200
## 4  0.0033 0.0033200
## 5  0.0019 0.0019270
## 6  0.0033 0.0032430
## 7  0.0024 0.0023970
## 8  0.0038 0.0038640
## 9  0.1100 0.1090200
## 10 0.0024 0.0023460
## 11 0.0019 0.0018855
## 12 0.0190 0.0192740

This looks pretty close. Let’s round things off so its easier to read. Rounding is done with the round() command.

To round something in R, we need to give it a number (or numbers in a vector) and an indication of how many digits to round off to. In the original table the numbers are rounded off to 4 digits

round(0.0037, 4)
## [1] 0.0037

Instead of u.g.2, run the previous data.frame code with round(u.g.2, 4). That is, instead of u.g.2, put in round(u.g.2, 4)

### ADD CODE HERE
# Compare col1 to col2 w/ same sig figs and compare mutation rates of b.p per genome
data.frame(round(u.g.2, 4), u.g)
##    round.u.g.2..4.    u.g
## 1           0.0046 0.0046
## 2           0.0037 0.0038
## 3           0.0043 0.0043
## 4           0.0033 0.0033
## 5           0.0019 0.0019
## 6           0.0032 0.0033
## 7           0.0024 0.0024
## 8           0.0039 0.0038
## 9           0.1090 0.1100
## 10          0.0023 0.0024
## 11          0.0019 0.0019
## 12          0.0193 0.0190

Things don’t round off perfectly, perhaps because of other rounding they did during their workflow. But there’s something weird with the first number. Either I made a type or something else is up. Let me know what you think!

Original genomes could have been rounded after producing u.g values.

Another useful comparison here is ==. See if you can predict what will happen when you run:

u.g == round(u.g.2, 4)

### ADD CODE HERE
u.g == round(u.g.2, 4) # in theory should be true, but w/ rounding errors calculated mutation rates
##  [1]  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE
# are slightly different from theoretical mutation rates

When doing comparisons like this, why might it be really important to consider rounding?

Given the genome is very large, slight deviations in rounding will causes more or less bases to become mutated during simulation of different genomes.

Make a dataframe

We can put all our pieces together into a dataframe like this with data.frame():

table1 <- data.frame(organism = org,
           G = G,
           target = NA,
           u.bp = u.bp,
           u.g = u.g)

table1
##    organism        G target    u.bp    u.g
## 1       M13     6410     NA 7.2e-07 0.0046
## 2      lamb    48500     NA 7.7e-08 0.0038
## 3        T2   160000     NA 2.7e-08 0.0043
## 4        T4   166000     NA 2.0e-08 0.0033
## 5        EC  4700000     NA 4.1e-10 0.0019
## 6        EC  4700000     NA 6.9e-10 0.0033
## 7        EC  4700000     NA 5.1e-10 0.0024
## 8        SC 13800000     NA 2.8e-10 0.0038
## 9        SC 13800000     NA 7.9e-09 0.1100
## 10       SC 13800000     NA 1.7e-10 0.0024
## 11       NC 41900000     NA 4.5e-11 0.0019
## 12       NC 41900000     NA 4.6e-10 0.0190

For each column I am specifying a name and telling it the vector to turn into a column. For target I’m telling it just to fill it in with NA, which means there is no data

As always, we want to check what we just made. Run at least 2 commands exploring the size, share or content of the dataframe

### ADD CODE HERE
# Check the dimension of the data.frame w/ dim() -> row size, col size
dim(table1)
## [1] 12  5
# What data structures are being table1
is(table1)
## [1] "data.frame" "list"       "oldClass"   "vector"
# What data type is being stored in table1
class(table1)
## [1] "data.frame"

Make Figure 1 with base R

We can make figure 1 with the basic R plot() command. FIrst let’s plot the raw data.

# Modify with log() function on 
# both G and u.bp vectors.
plot(log(u.bp) ~ log(G), data = table1)

If you look at the figure its actually on the log scale. We can nest the log function within the plot function like this

plot(log(u.bp) ~ log(G), data = table1)

Check out the y axis and the x axis of the plot and compare it to the figure (get the figure at https://www.pnas.org/content/88/16/7160). What’s wrong?

R’s default log() functions uses the natural log ln. To do the base 10 log change the code to use log10()

### ADD CODE HERE
# Must change the y-axis and x-axis to match chart figure
# The function we are doing above w/ log() is the natural log (ln). log10 is = 10^x

# Also, add a line of best fit w/ abline(lm(y-axis object ~ x-axis object))
plot(log10(u.bp) ~ log10(G), data = table1, names(org))
abline(lm(log10(u.bp) ~ log10(G)), col="black") # regression line
text(log10(u.bp) ~ log10(G), labels=org, font = 2) # Can bold when font = 2

# When the genome is larger their are less mutations base pair mutations

Make Figure 1 with ggpubr

ggpubr makes nicer plots than base R, but you can’t nest functions in it. So what we need to do is make a column of the logged variables. For Genome size it requires this:

table1$log10.G <- log10(table1$G)

table1$log10.G
##  [1] 3.806858 4.685742 5.204120 5.220108 6.672098 6.672098 6.672098 7.139879
##  [9] 7.139879 7.139879 7.622214 7.622214

Now make a new column for logged u.bp, using log10. Call it log10.u.bp. Try to type it out yourself first.

Here’s the code if you need to copy and paste it:

table1\(log10.u.bp <- log10(table1\)u.bp)

# Make a col for the log10 of basepair mutation (calculated) rate per genome
table1$log10u.bp <- log10(table1$u.bp)
table1$log10u.bp
##  [1]  -6.142668  -7.113509  -7.568636  -7.698970  -9.387216  -9.161151
##  [7]  -9.292430  -9.552842  -8.102373  -9.769551 -10.346787  -9.337242

While we’re at it lets make a log10 mutation rate per genome column called log10.u.g

table1\(log10.u.g <- log10(table1\)u.g) `

table1$log10u.g <- log10(table1$u.g)
table1$log10u.g
##  [1] -2.3372422 -2.4202164 -2.3665315 -2.4814861 -2.7212464 -2.4814861
##  [7] -2.6197888 -2.4202164 -0.9586073 -2.6197888 -2.7212464 -1.7212464

Check our columns

# Presents length class and Mode of col of log10 calculated bp mutation rate

# summary of given bp mutation rate
summary(table1$log10u.g)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -2.7212 -2.6198 -2.4509 -2.3224 -2.3592 -0.9586
# summary of genomes
summary(table1$log10.G)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.807   5.216   6.672   6.300   7.140   7.622
# summary of calculated bp mutation rate
summary(table1$log10u.bp)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -10.347  -9.429  -9.227  -8.623  -7.666  -6.143

We can make scatter plot using ggscatter

# import ggpubr R package
# Use ggscatter() to make a plot
# install.packages("ggpubr")
library(ggpubr)
ggpubr::ggscatter(data = table1, y = "log10u.bp",
  x = "log10.G")

We can make lots of nice tweaks with ggpubr. Let’s vary the size of our data points based on the log of the mutation rate per genome, log10.u.g. This lets assess another pattern on top of the main correlation between G and u.bp

ggscatter(data = table1,
          y = "log10u.bp",
          x = "log10.G",
          size = "log10u.g") #size

Now let’s add color based on log10.u.g also

ggscatter(data = table1,
          y = "log10u.bp",
          x = "log10.G",
          size = "log10u.g",   #size
          color = "log10u.g")  #color

The x axis isn’t very clear using the normal column name. Can you spot what is different about the code below?

ggscatter(data = table1,
          y = "log10u.bp",
          x = "log10.G",
          size = "log10u.g",   
          color = "log10u.g") +
  xlab("log10(Genome size)")

Now let’s re-label the y axis. Note the plus signs after the lines with xlab() and ylab(). What do you think they are doing?

ggscatter(data = table1,
          y = "log10u.bp",
          x = "log10.G",
          size = "log10u.g",   
          color = "log10u.g") +
  xlab("log10(Genome size)")     +
  ylab("log10(mutations per bp)")

There is a strong correlation between genome size and mutation rate (on the log scale). We can further d this pattern with a regression line. Note that within the call to ggscatter() I’ve added add = “reg.line”

ggscatter(data = table1,
          y = "log10u.bp",
          x = "log10.G",
          size = "log10u.g",   
          color = "log10u.g",
          add = "reg.line") + # Add regression line
  xlab("log10(Genome size)")     +
  ylab("log10(mutations per bp)")
## `geom_smooth()` using formula 'y ~ x'

We can also fit curved lines easily if we want. See if you can figure out what’s different about this plot. (The line I’ve added is called a smoother made using a specific statistical method called loess.)

ggscatter(data = table1,
          y = "log10u.bp",
          x = "log10.G",
          size = "log10u.g",   
          color = "log10u.g",
          add = "loess") + # Add line of best fit
  xlab("log10(Genome size)") +
  ylab("log10(mutations per bp)")
## `geom_smooth()` using formula 'y ~ x'

Adding Statistics

We can do some quick statistics on the fly here by adding cor.coef =TRUE. What this does is calculates the strength of the relationship between the x and the y variable (on the log10) scale, quantifies it with a statistics R, and calculates a p-value to give us a sense of whether R is not 0. This is a very strong relationship, so R is almost -1 (R varies between -1 and 1) and the p-value is very small.

(Note that the technical definition of a p-value has a very precise meaning; consider p < 0.05 to be “significant”. This means that the pattern we’re seeing in the data is unlikely to have occurred due to chance; however, is does not mean that if we repeated the study with a new data set that we would not get a different result. That is, it does not mean that our conclusions are true!)

ggscatter(data = table1,
          y = "log10u.bp",
          x = "log10.G",
          size = "log10u.g",   
          color = "log10u.g",
          add = "reg.line",
          cor.coef =TRUE) +
  xlab("log10(Genome size)") +
  ylab("log10(mutations per bp)")
## `geom_smooth()` using formula 'y ~ x'

# p-value is 2.3e-05 or 2.3 x 10^-5 is significant.
# Relation is also strongly negatively correlated with R = -0.92 being close to -1. 

Adding labels to ggscatter

# Can add organisms labels with label = org
ggscatter(data = table1,
          y = "log10u.bp",
          x = "log10.G",
          size = "log10u.g",  
          label = org,
          color = "log10u.g",
          add = "reg.line",
          cor.coef =TRUE) +
  xlab("log10(Genome size)") +
  ylab("log10(mutations per bp)")
## `geom_smooth()` using formula 'y ~ x'