R Markdown Tutorial

This is a tutorial on how to use r markdown for reproducible research

Here we can type long descriptions of our data without adding a hashtag to the beginning of the comments

In our first example we will be using the toothgrowth dataset. In this experiment, guinea pigs were given different amounts of vitamin c to see the effects on the animals’ tooth growth.

To run R code in an R markdown file, we need to denote the section that is considered R code. We call these sections “code chunks”. to call a code, use the three ` marks followed by the r. to end the code, use three more of the tick marks.

Below is a code chunk:

toothdata <- ToothGrowth
head(toothdata)

By running the “play” button on the code chunk, the results are printed in line of the r markdown file.

fit <- lm(len ~ dose, data = toothdata)
b<- fit$coefficients
plot(len ~ dose, data = toothdata)
abline(lm(len ~ dose, data = toothdata))

The slope of the regression line is 9.7635714.

Section Headers

We can also put sections and subsections in our r markdown file, similar to numbers or bullet points in a word document. This is done with the “#” that we previously used to denote text in the r script. *You have to have a space after the hashtags. If you don’t have a space, it doesn’t work.

First Level Header only uses one “#” before the text

Second Level Header uses two “#” before the text

Third Level Header uses three “#” before the text

Bullet Points

We can also add bullet points in our r markdown file. To do this, you use a dash then space. You also need a space between the lines of text. Indentation matters.

  • one item
  • one item
  • one item
    • one more item
    • one more item
      • one last item

Numeric Lists

We can also do numeric lists.

  1. first item
  2. second item
  3. third item
  1. subitem 1
  2. subitem 2
  3. subitem 3

Block Quotes

We can put really nice quotes into the markdown document. We do this by using the “>” symbol.

“Genes are like the story, and DNA is the language that the story is written in.”

— Sam Kean

Formulas

We can also put nicely formatted formulas into r markdown using dollar signs.

Hardy-Weinberg Formula

\[p^2 + 2pq + q^2 = 1\]

You can also get really complex.

\[\Theta = \begin{pmatrix}\alpha & \beta\\ \gamma & \delta \end{pmatrix}\]

LaTex cheat sheet contains a variety of different codes for the r markdown.

Code Chunks

Code Chunk Options

There are options on your r markdown file on how knitter interprets the code chunks. There are the following options.

Eval (T or F): Whether or not to evaluate the code chunk

print("Hello World")

Echo (T or F): Whether or not to show the code chunk, but the result is still shown.

## [1] "Hello World"

You can use both eval and echo if you don’t want to show something yet, but you don’t want to delete it from the notes.

Cache: If enabled, the same code chunk will not be evaluated the next time the knitr is run. Useful when you have codes that have long run times. If you save something above where you cached, the value won’t change bc that code chunk has been saved in place.

fig.width or fig.height: the (graphical device) size of the r plots in inches. The figures are first written to the knitr document then to files that are saved separately. This only changes the sizes of your files that are not in the knit document/ html file.

out.width or out.height: the output size of the r plots in the r document.

fig.cap: caption for the figure

Table of Contents

We can add a table of contents to our html document by altering the YAML code (the code chunk at the very top of the document). We can add this:

title: “HTMI Tutorial” author: “Bella Johnson” date: “3/21/2022” output: html_document: toc: true toc_float: true

The above will give us a floating table of contents on the left-hand side of the document.

Tabs

You can also add the tabs in your report. To do this, you need to specify each section you want to appear as a tab by by placing “{.tabset}” after the line. Every subsequent header will be a new tab. Has to be after a level one header.

Themes

You can add themes to your html document that change the highlighting color and hyperlink color of your html output. This can be nice aestheticlly. To do this, you can change your theme in YAML to any of the following:

  • cerulean
  • journal
  • flatly
  • readable
  • spacelab
  • united
  • cosmo
  • lumen
  • paper
  • sandstone
  • simplex
  • yeti
  • null

You can also change the color by specifying highlight:

  • default
  • tango
  • payments
  • kate
  • monochrome
  • espresso
  • zenburn
  • haddock
  • textmate

However, you can only do theme or highlight; you can’t do both.

Code Folding

You can use the code_folding option to allow the reader to toggle between displaying the code and hiding the code. This is done with:

code_folding: hide

R Markdown Summary

There are a ton of options and ways to customize r code using the html format. This is a great way to display a portfolio of your work if you are trying to market yourself to interested parties.

Data Wrangling With R

NAME

Load the Data

We need to first load the tidyverse package.

library(tidyverse)

Load the flights data and call the first section of it.

my_data <- nycflights13::flights
head(my_data)
## # A tibble: 6 × 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     1     1      517            515         2      830            819
## 2  2013     1     1      533            529         4      850            830
## 3  2013     1     1      542            540         2      923            850
## 4  2013     1     1      544            545        -1     1004           1022
## 5  2013     1     1      554            600        -6      812            837
## 6  2013     1     1      554            558        -4      740            728
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

Pull Specific Flight Data

First we will just look at the data from October 14th.

filter(my_data, month == 10, day == 14)
## # A tibble: 987 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    10    14      451            500        -9      624            648
##  2  2013    10    14      511            517        -6      733            757
##  3  2013    10    14      536            545        -9      814            855
##  4  2013    10    14      540            545        -5      932            933
##  5  2013    10    14      548            545         3      824            827
##  6  2013    10    14      549            600       -11      719            730
##  7  2013    10    14      552            600        -8      650            659
##  8  2013    10    14      553            600        -7      646            700
##  9  2013    10    14      554            600        -6      836            829
## 10  2013    10    14      555            600        -5      832            855
## # … with 977 more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Subset Variable

If we want to subset this into a new variable, we do the following.

  • The product of this new variable can now be seen under “Data”
oct_14_flight <- filter(my_data, month == 10, day == 14)

Filter Based on Operators

If you want to filter your code based on different operators, you can use the following:

  • Equals ==
  • Not equal to !=
  • Greater than >
  • Less than <
  • Greater than or equal to >=
  • Less than or equal to <=

For example, you could use the following code to filter the flights from Janauary - September:

(flight_through_september <- filter(my_data, month < 10))
## # A tibble: 252,484 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 252,474 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Selective Filterss

If you want to be more selective, you can use logical operators:

  • and &
  • or |
  • not !

Lets use the selective functions in some examples.

  • “or” function to pick flights in March or April
march_april_flights <- filter(my_data, month == 3 | month == 4)
  • “and” function to pick flights in March and April
march_april_flights2 <- filter(my_data, month == 3 & month == 4)
  • “and” function to a flight on a certain day and month
march_4th_flights <- filter(my_data, month == 3 & day == 4)
  • “not” function to select all flights but those in January
non_jan_flights <- filter(my_data, month != 1)

Arrange Data

The “arrange” function allows us to arrange the data set based on the variables we desire.

arrange(my_data, year, day, month)
## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

We can also arrange the data in descending fashion .

descending <- arrange(my_data, desc(year), desc(day), desc(month))

Missing values are always placed at the end of the data frame regardless of ascending or descending.

Select Data

The “select” function to select specific columns that we want to look at.

calendar <- select(my_data, year, month, day)
print(calendar)
## # A tibble: 336,776 × 3
##     year month   day
##    <int> <int> <int>
##  1  2013     1     1
##  2  2013     1     1
##  3  2013     1     1
##  4  2013     1     1
##  5  2013     1     1
##  6  2013     1     1
##  7  2013     1     1
##  8  2013     1     1
##  9  2013     1     1
## 10  2013     1     1
## # … with 336,766 more rows

Lets look at a range of columns.

calendar2 <- select(my_data, year:day)
print(calendar2)
## # A tibble: 336,776 × 3
##     year month   day
##    <int> <int> <int>
##  1  2013     1     1
##  2  2013     1     1
##  3  2013     1     1
##  4  2013     1     1
##  5  2013     1     1
##  6  2013     1     1
##  7  2013     1     1
##  8  2013     1     1
##  9  2013     1     1
## 10  2013     1     1
## # … with 336,766 more rows

Lets look at all columns months through carrier

calendar3 <- select(my_data, year:carrier)

We can also choose which columns not to include

everything_else <- select(my_data, -(year:day))

In the following example, you can also use the “not” operator. However, this does not always apply.

everything_else2 <- select(my_data, !(year:day))

Helper Functions

There are other helper functions that can help you select the columns or data that you’re looking for

  • starts_with(“xyz”) - will select the values that start with xyz
  • ends_with(“xyz”) - will select the values that end with xyz
  • contains(“xyz”) - will select the values that contain xyz
  • matches(“xyz”) - will match the identical value xyz

Renaming

Using the “renaming” function allows you to alter your data. However, this is a permanent change.

head(my_data)
## # A tibble: 6 × 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     1     1      517            515         2      830            819
## 2  2013     1     1      533            529         4      850            830
## 3  2013     1     1      542            540         2      923            850
## 4  2013     1     1      544            545        -1     1004           1022
## 5  2013     1     1      554            600        -6      812            837
## 6  2013     1     1      554            558        -4      740            728
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

We will now use the “rename” function change “dep_time” in the data above to “departure_time”.

rename(my_data, departure_time = dep_time)
## # A tibble: 336,776 × 19
##     year month   day departure_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>          <int>          <int>     <dbl>    <int>
##  1  2013     1     1            517            515         2      830
##  2  2013     1     1            533            529         4      850
##  3  2013     1     1            542            540         2      923
##  4  2013     1     1            544            545        -1     1004
##  5  2013     1     1            554            600        -6      812
##  6  2013     1     1            554            558        -4      740
##  7  2013     1     1            555            600        -5      913
##  8  2013     1     1            557            600        -3      709
##  9  2013     1     1            557            600        -3      838
## 10  2013     1     1            558            600        -2      753
## # … with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, origin <chr>,
## #   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## #   time_hour <dttm>
my_data <- rename(my_data, departure_time = dep_time)

Mutate Function

By using the “mutate” function, we can alter data that is already in R. We will use this function to add new columns to our data frame.

First, we will make smaller data frames. This will allow us to see what we are doing.

my_data_small <- select(my_data, year:day, distance, air_time)

Lets calculate the speed of the flights.

mutate(my_data_small, speed = distance / air_time * 60)
## # A tibble: 336,776 × 6
##     year month   day distance air_time speed
##    <int> <int> <int>    <dbl>    <dbl> <dbl>
##  1  2013     1     1     1400      227  370.
##  2  2013     1     1     1416      227  374.
##  3  2013     1     1     1089      160  408.
##  4  2013     1     1     1576      183  517.
##  5  2013     1     1      762      116  394.
##  6  2013     1     1      719      150  288.
##  7  2013     1     1     1065      158  404.
##  8  2013     1     1      229       53  259.
##  9  2013     1     1      944      140  405.
## 10  2013     1     1      733      138  319.
## # … with 336,766 more rows
my_data_small <- mutate(my_data_small, speed = distance / air_time * 60)

Now, lets create a new data frame with only our calculations using the “transmute” function.

airspeed <- transmute(my_data_small, speed = distance / air_time * 60)
airspeed <- transmute(my_data_small, speed = distance / air_time * 60 , speed2 = distance / air_time)

Summarize

We can use the function “summarize: to run a function on a data column to get a single return.

summarize(my_data, delay = mean(dep_delay, na.rm = TRUE))
## # A tibble: 1 × 1
##   delay
##   <dbl>
## 1  12.6

We can above here that the avg delay is 12.6 minutes

We gain additional value in summarize by pairing it with “by_group()”.

by_day <- group_by(my_data, year, month, day)
summarize(by_day, delay = mean(dep_delay, na.rm = TRUE))
## `summarise()` has grouped output by 'year', 'month'. You can override using the
## `.groups` argument.
## # A tibble: 365 × 4
## # Groups:   year, month [12]
##     year month   day delay
##    <int> <int> <int> <dbl>
##  1  2013     1     1 11.5 
##  2  2013     1     2 13.9 
##  3  2013     1     3 11.0 
##  4  2013     1     4  8.95
##  5  2013     1     5  5.73
##  6  2013     1     6  7.15
##  7  2013     1     7  5.42
##  8  2013     1     8  2.55
##  9  2013     1     9  2.28
## 10  2013     1    10  2.84
## # … with 355 more rows

As you can see, we now have the delay by the days of the year.

Missing Data

If you do not tell R what to do with the missing data, it will show up in the output as “NA”.

summarize(by_day, delay = mean(dep_delay))
## `summarise()` has grouped output by 'year', 'month'. You can override using the
## `.groups` argument.
## # A tibble: 365 × 4
## # Groups:   year, month [12]
##     year month   day delay
##    <int> <int> <int> <dbl>
##  1  2013     1     1    NA
##  2  2013     1     2    NA
##  3  2013     1     3    NA
##  4  2013     1     4    NA
##  5  2013     1     5    NA
##  6  2013     1     6    NA
##  7  2013     1     7    NA
##  8  2013     1     8    NA
##  9  2013     1     9    NA
## 10  2013     1    10    NA
## # … with 355 more rows

We can filter our data based on NA

  • In this data set, NA is cancelled flights
is.na(airspeed)
not_cancelled <- filter(my_data, !is.na(dep_delay), !is.na(arr_delay))

Lets run summarize on the above data. The product shown will be the average of delayed flights.

summarize(not_cancelled, delay = mean(dep_delay))
## # A tibble: 1 × 1
##   delay
##   <dbl>
## 1  12.6

Counts

We can count the number of variables that are NA by using the “sum” function.

sum(is.na(my_data$dep_delay))
## [1] 8255

We can also count the number of variables that are not NA by combining the “sum” and the “not” functions.

sum(!is.na(my_data$dep_delay))
## [1] 328521

Piping

With tibble data sets, we can pipe results to get rid of the need to use the dollar sign.

We can then summarize the number of flights by minutes delayed.

my_data %>%
  group_by(year, month, day) %>%
  summarize(mean = mean(departure_time, na.rm = TRUE))
## `summarise()` has grouped output by 'year', 'month'. You can override using the
## `.groups` argument.
## # A tibble: 365 × 4
## # Groups:   year, month [12]
##     year month   day  mean
##    <int> <int> <int> <dbl>
##  1  2013     1     1 1385.
##  2  2013     1     2 1354.
##  3  2013     1     3 1357.
##  4  2013     1     4 1348.
##  5  2013     1     5 1326.
##  6  2013     1     6 1399.
##  7  2013     1     7 1341.
##  8  2013     1     8 1335.
##  9  2013     1     9 1326.
## 10  2013     1    10 1333.
## # … with 355 more rows

Tidyverse

Tibbles

Tibbles are a modified data frame which tweaks some of the older features from data frames. R is an old language, and useful things from 20 years ago are not useful anymore.

For example, we will use the “tibble” function on the iris data set. As you will see, it is the same data frame but with different features.

as_tibble(iris)
## # A tibble: 150 × 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
##  1          5.1         3.5          1.4         0.2 setosa 
##  2          4.9         3            1.4         0.2 setosa 
##  3          4.7         3.2          1.3         0.2 setosa 
##  4          4.6         3.1          1.5         0.2 setosa 
##  5          5           3.6          1.4         0.2 setosa 
##  6          5.4         3.9          1.7         0.4 setosa 
##  7          4.6         3.4          1.4         0.3 setosa 
##  8          5           3.4          1.5         0.2 setosa 
##  9          4.4         2.9          1.4         0.2 setosa 
## 10          4.9         3.1          1.5         0.1 setosa 
## # … with 140 more rows

You can create a tibble from scratch with the function “tibble()”.

tibble(
  x = 1:5,
  y = 1,
  z = x ^ 2 + y
)
## # A tibble: 5 × 3
##       x     y     z
##   <int> <dbl> <dbl>
## 1     1     1     2
## 2     2     1     5
## 3     3     1    10
## 4     4     1    17
## 5     5     1    26

You can also use the function “tribble()” for basic data table creation.

tribble(
  ~ genea, ~geneb, ~genec,
  #######################
  110,     112,    114,
  6,       5,      4
)
## # A tibble: 2 × 3
##   genea geneb genec
##   <dbl> <dbl> <dbl>
## 1   110   112   114
## 2     6     5     4

Tibbles are built to not overwhelm your console when printing data, only showing the first few lines.

This is how a data frame prints.

print(by_day)
as.data.frame(by_day)
head(by_day)

nycflights13:: flights %>%
  print(n=10, width = Inf)

Subsetting

Subsetting tibbles is easy; it is similar to data frames. A tibble of the nycflights13 data can be seen below.

df_tibble <- tibble(nycflights13::flights)
df_tibble
## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Subset by Column Name

We can subset by column name using the “$”.

df_tibble$carrier

Subset by Position

We can also subset by the position using “[]”

df_tibble[[2]]

If you want to use this in a pipe, you need to use the “.” placeholder

df_tibble %>%
  .$carrier

Some older functions don’t like tibbles. Thus, you might have to convert back to data frame.

class(df_tibble)
## [1] "tbl_df"     "tbl"        "data.frame"
df_tibble_2 <- as.data.frame(df_tibble)
class(df_tibble_2)
## [1] "data.frame"
df_tibble
## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
head(df_tibble_2)
##   year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## 1 2013     1   1      517            515         2      830            819
## 2 2013     1   1      533            529         4      850            830
## 3 2013     1   1      542            540         2      923            850
## 4 2013     1   1      544            545        -1     1004           1022
## 5 2013     1   1      554            600        -6      812            837
## 6 2013     1   1      554            558        -4      740            728
##   arr_delay carrier flight tailnum origin dest air_time distance hour minute
## 1        11      UA   1545  N14228    EWR  IAH      227     1400    5     15
## 2        20      UA   1714  N24211    LGA  IAH      227     1416    5     29
## 3        33      AA   1141  N619AA    JFK  MIA      160     1089    5     40
## 4       -18      B6    725  N804JB    JFK  BQN      183     1576    5     45
## 5       -25      DL    461  N668DN    LGA  ATL      116      762    6      0
## 6        12      UA   1696  N39463    EWR  ORD      150      719    5     58
##             time_hour
## 1 2013-01-01 05:00:00
## 2 2013-01-01 05:00:00
## 3 2013-01-01 05:00:00
## 4 2013-01-01 05:00:00
## 5 2013-01-01 06:00:00
## 6 2013-01-01 05:00:00

Tidyr

Tidyverse Rules

The tidyverse follows 3 rules:

  1. Each variable must have its own column
  2. Each observation has its own row
  3. Each value has its own cell

However, it is impossible to satisfy two of the three rules. This leads to the following instructions for tidy data:

  1. Put each data set in a table
  2. Put each variable into a column
  3. Profit

Picking one consistent method of data storage makes for easier understanding of your code and what is happening behind the scenes.

Create a Tidy Data Set

To create a Tidy data set, you will first need to load the tidyverse package.

library(tidyverse)

Lets now look at working with tibbles.

bmi <- tibble(women)

bmi %>%
  mutate(bmi = (703 * weight)/(height)^2)
## # A tibble: 15 × 3
##    height weight   bmi
##     <dbl>  <dbl> <dbl>
##  1     58    115  24.0
##  2     59    117  23.6
##  3     60    120  23.4
##  4     61    123  23.2
##  5     62    126  23.0
##  6     63    129  22.8
##  7     64    132  22.7
##  8     65    135  22.5
##  9     66    139  22.4
## 10     67    142  22.2
## 11     68    146  22.2
## 12     69    150  22.1
## 13     70    154  22.1
## 14     71    159  22.2
## 15     72    164  22.2

Gathering and Spreading

Gathering

Sometimes you will find a data set that doesn’t fit into a tibble.

For this example, we’ll use the built-in data from tidyverse.

table4a
## # A tibble: 3 × 3
##   country     `1999` `2000`
## * <chr>        <int>  <int>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766

As you can see from the data, we have one variable in column A (country), but columns B and C are two of the same. Thus, there are two observations in each row.

  • To fix this, use the “gather” function.
table4a %>%
  gather("1999", "2000", key = "year", value = "cases")
## # A tibble: 6 × 3
##   country     year   cases
##   <chr>       <chr>  <int>
## 1 Afghanistan 1999     745
## 2 Brazil      1999   37737
## 3 China       1999  212258
## 4 Afghanistan 2000    2666
## 5 Brazil      2000   80488
## 6 China       2000  213766

Lets look at another example.

table4b
## # A tibble: 3 × 3
##   country         `1999`     `2000`
## * <chr>            <int>      <int>
## 1 Afghanistan   19987071   20595360
## 2 Brazil       172006362  174504898
## 3 China       1272915272 1280428583

The same problem is seen in table 4b, so we will use the “gather” function to fix it.

table4b %>%
  gather("1999", "2000", key = "year", value = "population")
## # A tibble: 6 × 3
##   country     year  population
##   <chr>       <chr>      <int>
## 1 Afghanistan 1999    19987071
## 2 Brazil      1999   172006362
## 3 China       1999  1272915272
## 4 Afghanistan 2000    20595360
## 5 Brazil      2000   174504898
## 6 China       2000  1280428583

What if we want to join the two above tables?

  • Use the dplyr function
table4a <- table4a %>%
  gather("1999", "2000", key = "year", value = "cases")
table4b <- table4b %>%
  gather("1999", "2000", key = "year", value = "population")

left_join(table4a, table4b)
## Joining, by = c("country", "year")
## # A tibble: 6 × 4
##   country     year   cases population
##   <chr>       <chr>  <int>      <int>
## 1 Afghanistan 1999     745   19987071
## 2 Brazil      1999   37737  172006362
## 3 China       1999  212258 1272915272
## 4 Afghanistan 2000    2666   20595360
## 5 Brazil      2000   80488  174504898
## 6 China       2000  213766 1280428583

Spreading

Spreading is the opposite of gathering. For this example, we will use table2 from tidyverse.

table2
## # A tibble: 12 × 4
##    country      year type            count
##    <chr>       <int> <chr>           <int>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583

As you can see, we have redundant info in columns 1 and 2. We can fix that by combining rows 1 and 2, 3 and 4, etc.

spread(table2, key = type, value = count)
## # A tibble: 6 × 4
##   country      year  cases population
##   <chr>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

Type is the key of what we are turning into columns. Value is what becomes rows/ observations.

Summary

In summary, spread makes long tables shorter and wider. Gather makes wide tables narrower and longer.

Separate and Pull

What happens when we have two observations stuck in one column, as seen in table 3 below?

table3
## # A tibble: 6 × 3
##   country      year rate             
## * <chr>       <int> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583

As you can see, the rate is just the population and cases combined

  • We can use the “separate” function to fix this.
table3 %>%
  separate(rate, into = c("cases", "population"))
## # A tibble: 6 × 4
##   country      year cases  population
##   <chr>       <int> <chr>  <chr>     
## 1 Afghanistan  1999 745    19987071  
## 2 Afghanistan  2000 2666   20595360  
## 3 Brazil       1999 37737  172006362 
## 4 Brazil       2000 80488  174504898 
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

However, the column type is not correct.

table3 %>%
  separate(rate, into = c("cases", "populate"), conver = TRUE)
## # A tibble: 6 × 4
##   country      year  cases   populate
##   <chr>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

You can specify what you want to separate based on, as seen in the example below.

table3 %>%
  separate(rate, into = c("cases", "populate"), sep = "/", conver = TRUE)
## # A tibble: 6 × 4
##   country      year  cases   populate
##   <chr>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

Lets make the above more tidy.

table3 %>%
  separate(
    year,
    into = c("century", "year"),
    convert = TRUE,
    sep = 2
    )
## # A tibble: 6 × 4
##   country     century  year rate             
##   <chr>         <int> <int> <chr>            
## 1 Afghanistan      19    99 745/19987071     
## 2 Afghanistan      20     0 2666/20595360    
## 3 Brazil           19    99 37737/172006362  
## 4 Brazil           20     0 80488/174504898  
## 5 China            19    99 212258/1272915272
## 6 China            20     0 213766/1280428583

Missing Values

There are two types of missing values. NA (explicit) or just no entry (implicit).

gene_data <- tibble(
  gene = c('a', 'a', 'a', 'a', 'b', 'b', 'b'),
  nuc = c(20, 22, 24, 25, NA, 42, 67),
  run = c(1,2,3,4,2,3,4)
)

gene_data
## # A tibble: 7 × 3
##   gene    nuc   run
##   <chr> <dbl> <dbl>
## 1 a        20     1
## 2 a        22     2
## 3 a        24     3
## 4 a        25     4
## 5 b        NA     2
## 6 b        42     3
## 7 b        67     4
  • The nucleotide count for gene b run 2 is explicit missing
  • The nucleotide count for gene b run 1 is implicitly missing

Make Implicit Values Explicit

One way we can make implicit values explicit is by putting runs in columns.

gene_data %>%
  spread(gene, nuc)
## # A tibble: 4 × 3
##     run     a     b
##   <dbl> <dbl> <dbl>
## 1     1    20    NA
## 2     2    22    NA
## 3     3    24    42
## 4     4    25    67

Another way we can make implicit values explicit is the function “complete()”.

gene_data %>%
  complete(gene, run)
## # A tibble: 8 × 3
##   gene    run   nuc
##   <chr> <dbl> <dbl>
## 1 a         1    20
## 2 a         2    22
## 3 a         3    24
## 4 a         4    25
## 5 b         1    NA
## 6 b         2    NA
## 7 b         3    42
## 8 b         4    67

If we want to remove missing values, we can use spread and gather, and na.rm = TRUE

gene_data %>%
  spread(gene, nuc) %>%
  gather(gene, nuc, 'a':'b', na.rm = TRUE)
## # A tibble: 6 × 3
##     run gene    nuc
##   <dbl> <chr> <dbl>
## 1     1 a        20
## 2     2 a        22
## 3     3 a        24
## 4     4 a        25
## 5     3 b        42
## 6     4 b        67

Sometimes an NA is present to represent a value being carried forward

treatment <- tribble(
  ~person,        ~treatment,          ~response,
  ##############################################
  "Isaac",         1,                   7,
  NA,              2,                   10,
  NA,              3,                   9,
  "VDB",           1,                   8,
  NA,              2,                   11,
  NA,              3,                   10,
)

treatment
## # A tibble: 6 × 3
##   person treatment response
##   <chr>      <dbl>    <dbl>
## 1 Isaac          1        7
## 2 <NA>           2       10
## 3 <NA>           3        9
## 4 VDB            1        8
## 5 <NA>           2       11
## 6 <NA>           3       10

Use the “fill()” option to replace NA with your choice of text.

treatment %>%
  fill(person)
## # A tibble: 6 × 3
##   person treatment response
##   <chr>      <dbl>    <dbl>
## 1 Isaac          1        7
## 2 Isaac          2       10
## 3 Isaac          3        9
## 4 VDB            1        8
## 5 VDB            2       11
## 6 VDB            3       10

DPLYR

It is rare that you will be working with a single data table. The DPLYR package allows you to join two data tables based on common values.

  • Mutate joins- add new variables to one data frame from the matching observations in another
  • Filtering joins - filters observation from one data frame based on whether or not they are present in another
  • Set operations - treats observations as they are set elements

Lets load the packages that we will use for this example.

library(tidyverse)
library(nycflights13)

Lets pull full carrier names based on letter codes.

airlines
## # A tibble: 16 × 2
##    carrier name                       
##    <chr>   <chr>                      
##  1 9E      Endeavor Air Inc.          
##  2 AA      American Airlines Inc.     
##  3 AS      Alaska Airlines Inc.       
##  4 B6      JetBlue Airways            
##  5 DL      Delta Air Lines Inc.       
##  6 EV      ExpressJet Airlines Inc.   
##  7 F9      Frontier Airlines Inc.     
##  8 FL      AirTran Airways Corporation
##  9 HA      Hawaiian Airlines Inc.     
## 10 MQ      Envoy Air                  
## 11 OO      SkyWest Airlines Inc.      
## 12 UA      United Air Lines Inc.      
## 13 US      US Airways Inc.            
## 14 VX      Virgin America             
## 15 WN      Southwest Airlines Co.     
## 16 YV      Mesa Airlines Inc.

Lets get info about airports.

airports
## # A tibble: 1,458 × 8
##    faa   name                             lat    lon   alt    tz dst   tzone    
##    <chr> <chr>                          <dbl>  <dbl> <dbl> <dbl> <chr> <chr>    
##  1 04G   Lansdowne Airport               41.1  -80.6  1044    -5 A     America/…
##  2 06A   Moton Field Municipal Airport   32.5  -85.7   264    -6 A     America/…
##  3 06C   Schaumburg Regional             42.0  -88.1   801    -6 A     America/…
##  4 06N   Randall Airport                 41.4  -74.4   523    -5 A     America/…
##  5 09J   Jekyll Island Airport           31.1  -81.4    11    -5 A     America/…
##  6 0A9   Elizabethton Municipal Airport  36.4  -82.2  1593    -5 A     America/…
##  7 0G6   Williams County Airport         41.5  -84.5   730    -5 A     America/…
##  8 0G7   Finger Lakes Regional Airport   42.9  -76.8   492    -5 A     America/…
##  9 0P2   Shoestring Aviation Airfield    39.8  -76.6  1000    -5 U     America/…
## 10 0S9   Jefferson County Intl           48.1 -123.    108    -8 A     America/…
## # … with 1,448 more rows

Lets get info about planes.

planes
## # A tibble: 3,322 × 9
##    tailnum  year type              manufacturer model engines seats speed engine
##    <chr>   <int> <chr>             <chr>        <chr>   <int> <int> <int> <chr> 
##  1 N10156   2004 Fixed wing multi… EMBRAER      EMB-…       2    55    NA Turbo…
##  2 N102UW   1998 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
##  3 N103US   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
##  4 N104UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
##  5 N10575   2002 Fixed wing multi… EMBRAER      EMB-…       2    55    NA Turbo…
##  6 N105UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
##  7 N107US   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
##  8 N108UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
##  9 N109UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
## 10 N110UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
## # … with 3,312 more rows

Lets get info on the weather.

weather
## # A tibble: 26,115 × 15
##    origin  year month   day  hour  temp  dewp humid wind_dir wind_speed
##    <chr>  <int> <int> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>
##  1 EWR     2013     1     1     1  39.0  26.1  59.4      270      10.4 
##  2 EWR     2013     1     1     2  39.0  27.0  61.6      250       8.06
##  3 EWR     2013     1     1     3  39.0  28.0  64.4      240      11.5 
##  4 EWR     2013     1     1     4  39.9  28.0  62.2      250      12.7 
##  5 EWR     2013     1     1     5  39.0  28.0  64.4      260      12.7 
##  6 EWR     2013     1     1     6  37.9  28.0  67.2      240      11.5 
##  7 EWR     2013     1     1     7  39.0  28.0  64.4      240      15.0 
##  8 EWR     2013     1     1     8  39.9  28.0  62.2      250      10.4 
##  9 EWR     2013     1     1     9  39.9  28.0  62.2      260      15.0 
## 10 EWR     2013     1     1    10  41    28.0  59.6      260      13.8 
## # … with 26,105 more rows, and 5 more variables: wind_gust <dbl>, precip <dbl>,
## #   pressure <dbl>, visib <dbl>, time_hour <dttm>

Lets get info on singular flights.

flights
## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Lets look at how these tables connect:

  • Flights -> planes bases on tail number
  • Flights -> airlines through carrier
  • Flights -> airports origin and destination
  • Flights -> weather via origin, year/month/day/hour

Keys

Keys are unique identifiers per observation.

  • Primary key uniquely identifies an observation in its own table
    • One way to identify a primary key is as follows:
planes %>%
  count(tailnum) %>%
  filter(n>1)
## # A tibble: 0 × 2
## # … with 2 variables: tailnum <chr>, n <int>

This indicates that the tail number is unique.

planes
## # A tibble: 3,322 × 9
##    tailnum  year type              manufacturer model engines seats speed engine
##    <chr>   <int> <chr>             <chr>        <chr>   <int> <int> <int> <chr> 
##  1 N10156   2004 Fixed wing multi… EMBRAER      EMB-…       2    55    NA Turbo…
##  2 N102UW   1998 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
##  3 N103US   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
##  4 N104UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
##  5 N10575   2002 Fixed wing multi… EMBRAER      EMB-…       2    55    NA Turbo…
##  6 N105UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
##  7 N107US   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
##  8 N108UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
##  9 N109UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
## 10 N110UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
## # … with 3,312 more rows
planes %>%
  count(model) %>%
  filter(n>1)
## # A tibble: 79 × 2
##    model       n
##    <chr>   <int>
##  1 717-200    88
##  2 737-301     2
##  3 737-3G7     2
##  4 737-3H4   105
##  5 737-3K2     2
##  6 737-3L9     2
##  7 737-3Q8     5
##  8 737-3TO     2
##  9 737-401     4
## 10 737-4B7    18
## # … with 69 more rows

Join Observations

Mutate Joins

One way to join observations is mutate joins. An example of this is seen below.

flights2 <- flights %>%
  select(year:day, hour, origin, dest, tailnum, carrier)


flights2
## # A tibble: 336,776 × 8
##     year month   day  hour origin dest  tailnum carrier
##    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>  
##  1  2013     1     1     5 EWR    IAH   N14228  UA     
##  2  2013     1     1     5 LGA    IAH   N24211  UA     
##  3  2013     1     1     5 JFK    MIA   N619AA  AA     
##  4  2013     1     1     5 JFK    BQN   N804JB  B6     
##  5  2013     1     1     6 LGA    ATL   N668DN  DL     
##  6  2013     1     1     5 EWR    ORD   N39463  UA     
##  7  2013     1     1     6 EWR    FLL   N516JB  B6     
##  8  2013     1     1     6 LGA    IAD   N829AS  EV     
##  9  2013     1     1     6 JFK    MCO   N593JB  B6     
## 10  2013     1     1     6 LGA    ORD   N3ALAA  AA     
## # … with 336,766 more rows
flights2 %>%
  select(-origin, -dest) %>%
  left_join(airlines, by = "carrier")
## # A tibble: 336,776 × 7
##     year month   day  hour tailnum carrier name                    
##    <int> <int> <int> <dbl> <chr>   <chr>   <chr>                   
##  1  2013     1     1     5 N14228  UA      United Air Lines Inc.   
##  2  2013     1     1     5 N24211  UA      United Air Lines Inc.   
##  3  2013     1     1     5 N619AA  AA      American Airlines Inc.  
##  4  2013     1     1     5 N804JB  B6      JetBlue Airways         
##  5  2013     1     1     6 N668DN  DL      Delta Air Lines Inc.    
##  6  2013     1     1     5 N39463  UA      United Air Lines Inc.   
##  7  2013     1     1     6 N516JB  B6      JetBlue Airways         
##  8  2013     1     1     6 N829AS  EV      ExpressJet Airlines Inc.
##  9  2013     1     1     6 N593JB  B6      JetBlue Airways         
## 10  2013     1     1     6 N3ALAA  AA      American Airlines Inc.  
## # … with 336,766 more rows

We added the airline name to our data frame from the airline data frame.

Other Types of Joins

  • Inner joins (inner_join) matches a pair of observations when their key is equal
  • Outer joins (outer_join) keeps observations that appear in at least one table

Stringr

You can create strings using single or double quotes.

First, lets load the packages we need for this example.

library(tidyverse)
library(stringr)

Now we will look at examples of strings.

string1 <- "this is a string"
string2 <- "to put a 'quote' in your string, use the opposite"

string1
## [1] "this is a string"
string2
## [1] "to put a 'quote' in your string, use the opposite"

If you forget to close a string, everything following it will be included. If this happens, hit escape and try again.

Multiple strings are stored in character vectors.

string4 <- c("one", "two", "three")
string4
## [1] "one"   "two"   "three"

Measuring String Length

You can measure the length of a string by using “str_length”.

str_length(string2)
## [1] 49
str_length(string4)
## [1] 3 3 5

Combine Two Strings

You can combine two strings by using “str_combine”.

str_c("X", "Y")
## [1] "XY"
str_c(string1, string2)
## [1] "this is a stringto put a 'quote' in your string, use the opposite"

Separate Strings

You can use “sep” to control how the strings are separated.

str_c(string1, string2, sep = " ")
## [1] "this is a string to put a 'quote' in your string, use the opposite"
str_c("x", "y", "z", sep = "_")
## [1] "x_y_z"

Subsetting Strings

You can subset the string using “str_sub()”.

HSP <- c("HSP123", "HSP234", "HSP456")

str_sub(HSP, 4,6)
## [1] "123" "234" "456"

This drops the first 4 letters from the string.

You can also use negatives to count back from the end of the string.

str_sub(HSP, -3, -1)
## [1] "123" "234" "456"

Convert Cases

You can convert the cases of strings by using “str_to_lower()” and “str_to_upper”.

HSP
## [1] "HSP123" "HSP234" "HSP456"
str_to_lower(HSP)
## [1] "hsp123" "hsp234" "hsp456"

Regular Gene Expression

First, lets install our necessary package.

install.packages("htmlwidgets")
## Installing package into '/home/student/R/x86_64-pc-linux-gnu-library/4.1'
## (as 'lib' is unspecified)

Next, lets view a string of genes.

x <- c("ATTAGA", "CGCCCCCGAT", "TATTA")
str_view(x, "G")
str_view(x, "TA")

The next step is using “.” where the “.” matches an entry.

str_view(x, ".G.")

Anchors

Anchors allow you to match at the start or the ending of the string.

Use “^” to highlight everything beginning in the selected lines.

str_view(x, "^TA")

Use “$” to highlight everything ending in the selected lines.

str_view(x, "TA$")

Character Classes/ Alternatives

The following character classes/ alternatives allow you to select certain spots in the text: - /d matches any digit - /s matches any space - [abc] matches a, b, or c

Below we will use “[abc]” to match any of the letters within the brackets.

str_view(x, "TA[GT]")

[^abc] matches anything but a, b, or c

str_view(x, "TA[^T]")

You can use | to pick between two alternatives

str_view(x, "TA[G|T]")

Detect matches

The function “str_detect()” returns a logical vector the same length of input.

y <- c("apple", "banana", "pear")
y
## [1] "apple"  "banana" "pear"
str_detect(y, "e")
## [1]  TRUE FALSE  TRUE

How many common words contain the letter e?

words
##   [1] "a"           "able"        "about"       "absolute"    "accept"     
##   [6] "account"     "achieve"     "across"      "act"         "active"     
##  [11] "actual"      "add"         "address"     "admit"       "advertise"  
##  [16] "affect"      "afford"      "after"       "afternoon"   "again"      
##  [21] "against"     "age"         "agent"       "ago"         "agree"      
##  [26] "air"         "all"         "allow"       "almost"      "along"      
##  [31] "already"     "alright"     "also"        "although"    "always"     
##  [36] "america"     "amount"      "and"         "another"     "answer"     
##  [41] "any"         "apart"       "apparent"    "appear"      "apply"      
##  [46] "appoint"     "approach"    "appropriate" "area"        "argue"      
##  [51] "arm"         "around"      "arrange"     "art"         "as"         
##  [56] "ask"         "associate"   "assume"      "at"          "attend"     
##  [61] "authority"   "available"   "aware"       "away"        "awful"      
##  [66] "baby"        "back"        "bad"         "bag"         "balance"    
##  [71] "ball"        "bank"        "bar"         "base"        "basis"      
##  [76] "be"          "bear"        "beat"        "beauty"      "because"    
##  [81] "become"      "bed"         "before"      "begin"       "behind"     
##  [86] "believe"     "benefit"     "best"        "bet"         "between"    
##  [91] "big"         "bill"        "birth"       "bit"         "black"      
##  [96] "bloke"       "blood"       "blow"        "blue"        "board"      
## [101] "boat"        "body"        "book"        "both"        "bother"     
## [106] "bottle"      "bottom"      "box"         "boy"         "break"      
## [111] "brief"       "brilliant"   "bring"       "britain"     "brother"    
## [116] "budget"      "build"       "bus"         "business"    "busy"       
## [121] "but"         "buy"         "by"          "cake"        "call"       
## [126] "can"         "car"         "card"        "care"        "carry"      
## [131] "case"        "cat"         "catch"       "cause"       "cent"       
## [136] "centre"      "certain"     "chair"       "chairman"    "chance"     
## [141] "change"      "chap"        "character"   "charge"      "cheap"      
## [146] "check"       "child"       "choice"      "choose"      "Christ"     
## [151] "Christmas"   "church"      "city"        "claim"       "class"      
## [156] "clean"       "clear"       "client"      "clock"       "close"      
## [161] "closes"      "clothe"      "club"        "coffee"      "cold"       
## [166] "colleague"   "collect"     "college"     "colour"      "come"       
## [171] "comment"     "commit"      "committee"   "common"      "community"  
## [176] "company"     "compare"     "complete"    "compute"     "concern"    
## [181] "condition"   "confer"      "consider"    "consult"     "contact"    
## [186] "continue"    "contract"    "control"     "converse"    "cook"       
## [191] "copy"        "corner"      "correct"     "cost"        "could"      
## [196] "council"     "count"       "country"     "county"      "couple"     
## [201] "course"      "court"       "cover"       "create"      "cross"      
## [206] "cup"         "current"     "cut"         "dad"         "danger"     
## [211] "date"        "day"         "dead"        "deal"        "dear"       
## [216] "debate"      "decide"      "decision"    "deep"        "definite"   
## [221] "degree"      "department"  "depend"      "describe"    "design"     
## [226] "detail"      "develop"     "die"         "difference"  "difficult"  
## [231] "dinner"      "direct"      "discuss"     "district"    "divide"     
## [236] "do"          "doctor"      "document"    "dog"         "door"       
## [241] "double"      "doubt"       "down"        "draw"        "dress"      
## [246] "drink"       "drive"       "drop"        "dry"         "due"        
## [251] "during"      "each"        "early"       "east"        "easy"       
## [256] "eat"         "economy"     "educate"     "effect"      "egg"        
## [261] "eight"       "either"      "elect"       "electric"    "eleven"     
## [266] "else"        "employ"      "encourage"   "end"         "engine"     
## [271] "english"     "enjoy"       "enough"      "enter"       "environment"
## [276] "equal"       "especial"    "europe"      "even"        "evening"    
## [281] "ever"        "every"       "evidence"    "exact"       "example"    
## [286] "except"      "excuse"      "exercise"    "exist"       "expect"     
## [291] "expense"     "experience"  "explain"     "express"     "extra"      
## [296] "eye"         "face"        "fact"        "fair"        "fall"       
## [301] "family"      "far"         "farm"        "fast"        "father"     
## [306] "favour"      "feed"        "feel"        "few"         "field"      
## [311] "fight"       "figure"      "file"        "fill"        "film"       
## [316] "final"       "finance"     "find"        "fine"        "finish"     
## [321] "fire"        "first"       "fish"        "fit"         "five"       
## [326] "flat"        "floor"       "fly"         "follow"      "food"       
## [331] "foot"        "for"         "force"       "forget"      "form"       
## [336] "fortune"     "forward"     "four"        "france"      "free"       
## [341] "friday"      "friend"      "from"        "front"       "full"       
## [346] "fun"         "function"    "fund"        "further"     "future"     
## [351] "game"        "garden"      "gas"         "general"     "germany"    
## [356] "get"         "girl"        "give"        "glass"       "go"         
## [361] "god"         "good"        "goodbye"     "govern"      "grand"      
## [366] "grant"       "great"       "green"       "ground"      "group"      
## [371] "grow"        "guess"       "guy"         "hair"        "half"       
## [376] "hall"        "hand"        "hang"        "happen"      "happy"      
## [381] "hard"        "hate"        "have"        "he"          "head"       
## [386] "health"      "hear"        "heart"       "heat"        "heavy"      
## [391] "hell"        "help"        "here"        "high"        "history"    
## [396] "hit"         "hold"        "holiday"     "home"        "honest"     
## [401] "hope"        "horse"       "hospital"    "hot"         "hour"       
## [406] "house"       "how"         "however"     "hullo"       "hundred"    
## [411] "husband"     "idea"        "identify"    "if"          "imagine"    
## [416] "important"   "improve"     "in"          "include"     "income"     
## [421] "increase"    "indeed"      "individual"  "industry"    "inform"     
## [426] "inside"      "instead"     "insure"      "interest"    "into"       
## [431] "introduce"   "invest"      "involve"     "issue"       "it"         
## [436] "item"        "jesus"       "job"         "join"        "judge"      
## [441] "jump"        "just"        "keep"        "key"         "kid"        
## [446] "kill"        "kind"        "king"        "kitchen"     "knock"      
## [451] "know"        "labour"      "lad"         "lady"        "land"       
## [456] "language"    "large"       "last"        "late"        "laugh"      
## [461] "law"         "lay"         "lead"        "learn"       "leave"      
## [466] "left"        "leg"         "less"        "let"         "letter"     
## [471] "level"       "lie"         "life"        "light"       "like"       
## [476] "likely"      "limit"       "line"        "link"        "list"       
## [481] "listen"      "little"      "live"        "load"        "local"      
## [486] "lock"        "london"      "long"        "look"        "lord"       
## [491] "lose"        "lot"         "love"        "low"         "luck"       
## [496] "lunch"       "machine"     "main"        "major"       "make"       
## [501] "man"         "manage"      "many"        "mark"        "market"     
## [506] "marry"       "match"       "matter"      "may"         "maybe"      
## [511] "mean"        "meaning"     "measure"     "meet"        "member"     
## [516] "mention"     "middle"      "might"       "mile"        "milk"       
## [521] "million"     "mind"        "minister"    "minus"       "minute"     
## [526] "miss"        "mister"      "moment"      "monday"      "money"      
## [531] "month"       "more"        "morning"     "most"        "mother"     
## [536] "motion"      "move"        "mrs"         "much"        "music"      
## [541] "must"        "name"        "nation"      "nature"      "near"       
## [546] "necessary"   "need"        "never"       "new"         "news"       
## [551] "next"        "nice"        "night"       "nine"        "no"         
## [556] "non"         "none"        "normal"      "north"       "not"        
## [561] "note"        "notice"      "now"         "number"      "obvious"    
## [566] "occasion"    "odd"         "of"          "off"         "offer"      
## [571] "office"      "often"       "okay"        "old"         "on"         
## [576] "once"        "one"         "only"        "open"        "operate"    
## [581] "opportunity" "oppose"      "or"          "order"       "organize"   
## [586] "original"    "other"       "otherwise"   "ought"       "out"        
## [591] "over"        "own"         "pack"        "page"        "paint"      
## [596] "pair"        "paper"       "paragraph"   "pardon"      "parent"     
## [601] "park"        "part"        "particular"  "party"       "pass"       
## [606] "past"        "pay"         "pence"       "pension"     "people"     
## [611] "per"         "percent"     "perfect"     "perhaps"     "period"     
## [616] "person"      "photograph"  "pick"        "picture"     "piece"      
## [621] "place"       "plan"        "play"        "please"      "plus"       
## [626] "point"       "police"      "policy"      "politic"     "poor"       
## [631] "position"    "positive"    "possible"    "post"        "pound"      
## [636] "power"       "practise"    "prepare"     "present"     "press"      
## [641] "pressure"    "presume"     "pretty"      "previous"    "price"      
## [646] "print"       "private"     "probable"    "problem"     "proceed"    
## [651] "process"     "produce"     "product"     "programme"   "project"    
## [656] "proper"      "propose"     "protect"     "provide"     "public"     
## [661] "pull"        "purpose"     "push"        "put"         "quality"    
## [666] "quarter"     "question"    "quick"       "quid"        "quiet"      
## [671] "quite"       "radio"       "rail"        "raise"       "range"      
## [676] "rate"        "rather"      "read"        "ready"       "real"       
## [681] "realise"     "really"      "reason"      "receive"     "recent"     
## [686] "reckon"      "recognize"   "recommend"   "record"      "red"        
## [691] "reduce"      "refer"       "regard"      "region"      "relation"   
## [696] "remember"    "report"      "represent"   "require"     "research"   
## [701] "resource"    "respect"     "responsible" "rest"        "result"     
## [706] "return"      "rid"         "right"       "ring"        "rise"       
## [711] "road"        "role"        "roll"        "room"        "round"      
## [716] "rule"        "run"         "safe"        "sale"        "same"       
## [721] "saturday"    "save"        "say"         "scheme"      "school"     
## [726] "science"     "score"       "scotland"    "seat"        "second"     
## [731] "secretary"   "section"     "secure"      "see"         "seem"       
## [736] "self"        "sell"        "send"        "sense"       "separate"   
## [741] "serious"     "serve"       "service"     "set"         "settle"     
## [746] "seven"       "sex"         "shall"       "share"       "she"        
## [751] "sheet"       "shoe"        "shoot"       "shop"        "short"      
## [756] "should"      "show"        "shut"        "sick"        "side"       
## [761] "sign"        "similar"     "simple"      "since"       "sing"       
## [766] "single"      "sir"         "sister"      "sit"         "site"       
## [771] "situate"     "six"         "size"        "sleep"       "slight"     
## [776] "slow"        "small"       "smoke"       "so"          "social"     
## [781] "society"     "some"        "son"         "soon"        "sorry"      
## [786] "sort"        "sound"       "south"       "space"       "speak"      
## [791] "special"     "specific"    "speed"       "spell"       "spend"      
## [796] "square"      "staff"       "stage"       "stairs"      "stand"      
## [801] "standard"    "start"       "state"       "station"     "stay"       
## [806] "step"        "stick"       "still"       "stop"        "story"      
## [811] "straight"    "strategy"    "street"      "strike"      "strong"     
## [816] "structure"   "student"     "study"       "stuff"       "stupid"     
## [821] "subject"     "succeed"     "such"        "sudden"      "suggest"    
## [826] "suit"        "summer"      "sun"         "sunday"      "supply"     
## [831] "support"     "suppose"     "sure"        "surprise"    "switch"     
## [836] "system"      "table"       "take"        "talk"        "tape"       
## [841] "tax"         "tea"         "teach"       "team"        "telephone"  
## [846] "television"  "tell"        "ten"         "tend"        "term"       
## [851] "terrible"    "test"        "than"        "thank"       "the"        
## [856] "then"        "there"       "therefore"   "they"        "thing"      
## [861] "think"       "thirteen"    "thirty"      "this"        "thou"       
## [866] "though"      "thousand"    "three"       "through"     "throw"      
## [871] "thursday"    "tie"         "time"        "to"          "today"      
## [876] "together"    "tomorrow"    "tonight"     "too"         "top"        
## [881] "total"       "touch"       "toward"      "town"        "trade"      
## [886] "traffic"     "train"       "transport"   "travel"      "treat"      
## [891] "tree"        "trouble"     "true"        "trust"       "try"        
## [896] "tuesday"     "turn"        "twelve"      "twenty"      "two"        
## [901] "type"        "under"       "understand"  "union"       "unit"       
## [906] "unite"       "university"  "unless"      "until"       "up"         
## [911] "upon"        "use"         "usual"       "value"       "various"    
## [916] "very"        "video"       "view"        "village"     "visit"      
## [921] "vote"        "wage"        "wait"        "walk"        "wall"       
## [926] "want"        "war"         "warm"        "wash"        "waste"      
## [931] "watch"       "water"       "way"         "we"          "wear"       
## [936] "wednesday"   "wee"         "week"        "weigh"       "welcome"    
## [941] "well"        "west"        "what"        "when"        "where"      
## [946] "whether"     "which"       "while"       "white"       "who"        
## [951] "whole"       "why"         "wide"        "wife"        "will"       
## [956] "win"         "wind"        "window"      "wish"        "with"       
## [961] "within"      "without"     "woman"       "wonder"      "wood"       
## [966] "word"        "work"        "world"       "worry"       "worse"      
## [971] "worth"       "would"       "write"       "wrong"       "year"       
## [976] "yes"         "yesterday"   "yet"         "you"         "young"
sum(str_detect(words, "e"))
## [1] 536

What proportion words end in a vowel?

mean(str_detect(words, "[aeiou]$"))
## [1] 0.2765306

What proportion words begin in a vowel?

mean(str_detect(words, "^[aeiou]"))
## [1] 0.1785714

Find all the words that don’t contain o or u.

no_o <- !str_detect(words, "[ou]")
no_o
##   [1]  TRUE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE FALSE  TRUE
##  [13]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE
##  [25]  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE
##  [37] FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE
##  [49]  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE
##  [61] FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [73]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE
##  [85]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
##  [97] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [109] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [121] FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [133]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [145]  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE
## [157]  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [169] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [181] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [193] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
## [205] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [217]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE
## [229]  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
## [241] FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE
## [253]  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [265]  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE
## [277]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE
## [289]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [301]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
## [313]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [325]  TRUE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [337] FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [349] FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
## [361] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE
## [373] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [385]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE
## [397] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [409] FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE
## [421]  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE
## [433] FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE
## [445]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE
## [457]  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [469]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [481]  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [493] FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
## [505]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE
## [517]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE
## [529] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
## [541] FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [553]  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [565] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [577] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [589] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE
## [601]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE
## [613]  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
## [625] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [637]  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE  TRUE FALSE
## [649] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [661] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [673]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE
## [685]  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE
## [697] FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE
## [709]  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE
## [721] FALSE  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE
## [733] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE
## [745]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
## [757] FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [769]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE FALSE
## [781] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE
## [793]  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
## [805]  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE
## [817] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [829] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [841]  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [853]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [865] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE
## [877] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE
## [889]  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE
## [901]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [913] FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE
## [925]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [937]  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [949]  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE
## [961]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [973]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE

Extract the words that dont end in “ou”.

words[!str_detect(words, "[ou]")]
##   [1] "a"          "able"       "accept"     "achieve"    "act"       
##   [6] "active"     "add"        "address"    "admit"      "advertise" 
##  [11] "affect"     "after"      "again"      "against"    "age"       
##  [16] "agent"      "agree"      "air"        "all"        "already"   
##  [21] "alright"    "always"     "america"    "and"        "answer"    
##  [26] "any"        "apart"      "apparent"   "appear"     "apply"     
##  [31] "area"       "arm"        "arrange"    "art"        "as"        
##  [36] "ask"        "at"         "attend"     "available"  "aware"     
##  [41] "away"       "baby"       "back"       "bad"        "bag"       
##  [46] "balance"    "ball"       "bank"       "bar"        "base"      
##  [51] "basis"      "be"         "bear"       "beat"       "bed"       
##  [56] "begin"      "behind"     "believe"    "benefit"    "best"      
##  [61] "bet"        "between"    "big"        "bill"       "birth"     
##  [66] "bit"        "black"      "break"      "brief"      "brilliant" 
##  [71] "bring"      "britain"    "by"         "cake"       "call"      
##  [76] "can"        "car"        "card"       "care"       "carry"     
##  [81] "case"       "cat"        "catch"      "cent"       "centre"    
##  [86] "certain"    "chair"      "chairman"   "chance"     "change"    
##  [91] "chap"       "character"  "charge"     "cheap"      "check"     
##  [96] "child"      "Christ"     "Christmas"  "city"       "claim"     
## [101] "class"      "clean"      "clear"      "client"     "create"    
## [106] "dad"        "danger"     "date"       "day"        "dead"      
## [111] "deal"       "dear"       "debate"     "decide"     "deep"      
## [116] "definite"   "degree"     "department" "depend"     "describe"  
## [121] "design"     "detail"     "die"        "difference" "dinner"    
## [126] "direct"     "district"   "divide"     "draw"       "dress"     
## [131] "drink"      "drive"      "dry"        "each"       "early"     
## [136] "east"       "easy"       "eat"        "effect"     "egg"       
## [141] "eight"      "either"     "elect"      "electric"   "eleven"    
## [146] "else"       "end"        "engine"     "english"    "enter"     
## [151] "especial"   "even"       "evening"    "ever"       "every"     
## [156] "evidence"   "exact"      "example"    "except"     "exercise"  
## [161] "exist"      "expect"     "expense"    "experience" "explain"   
## [166] "express"    "extra"      "eye"        "face"       "fact"      
## [171] "fair"       "fall"       "family"     "far"        "farm"      
## [176] "fast"       "father"     "feed"       "feel"       "few"       
## [181] "field"      "fight"      "file"       "fill"       "film"      
## [186] "final"      "finance"    "find"       "fine"       "finish"    
## [191] "fire"       "first"      "fish"       "fit"        "five"      
## [196] "flat"       "fly"        "france"     "free"       "friday"    
## [201] "friend"     "game"       "garden"     "gas"        "general"   
## [206] "germany"    "get"        "girl"       "give"       "glass"     
## [211] "grand"      "grant"      "great"      "green"      "hair"      
## [216] "half"       "hall"       "hand"       "hang"       "happen"    
## [221] "happy"      "hard"       "hate"       "have"       "he"        
## [226] "head"       "health"     "hear"       "heart"      "heat"      
## [231] "heavy"      "hell"       "help"       "here"       "high"      
## [236] "hit"        "idea"       "identify"   "if"         "imagine"   
## [241] "in"         "increase"   "indeed"     "inside"     "instead"   
## [246] "interest"   "invest"     "it"         "item"       "keep"      
## [251] "key"        "kid"        "kill"       "kind"       "king"      
## [256] "kitchen"    "lad"        "lady"       "land"       "large"     
## [261] "last"       "late"       "law"        "lay"        "lead"      
## [266] "learn"      "leave"      "left"       "leg"        "less"      
## [271] "let"        "letter"     "level"      "lie"        "life"      
## [276] "light"      "like"       "likely"     "limit"      "line"      
## [281] "link"       "list"       "listen"     "little"     "live"      
## [286] "machine"    "main"       "make"       "man"        "manage"    
## [291] "many"       "mark"       "market"     "marry"      "match"     
## [296] "matter"     "may"        "maybe"      "mean"       "meaning"   
## [301] "meet"       "member"     "middle"     "might"      "mile"      
## [306] "milk"       "mind"       "minister"   "miss"       "mister"    
## [311] "mrs"        "name"       "near"       "necessary"  "need"      
## [316] "never"      "new"        "news"       "next"       "nice"      
## [321] "night"      "nine"       "pack"       "page"       "paint"     
## [326] "pair"       "paper"      "paragraph"  "parent"     "park"      
## [331] "part"       "party"      "pass"       "past"       "pay"       
## [336] "pence"      "per"        "percent"    "perfect"    "perhaps"   
## [341] "pick"       "piece"      "place"      "plan"       "play"      
## [346] "please"     "practise"   "prepare"    "present"    "press"     
## [351] "pretty"     "price"      "print"      "private"    "rail"      
## [356] "raise"      "range"      "rate"       "rather"     "read"      
## [361] "ready"      "real"       "realise"    "really"     "receive"   
## [366] "recent"     "red"        "refer"      "regard"     "remember"  
## [371] "represent"  "research"   "respect"    "rest"       "rid"       
## [376] "right"      "ring"       "rise"       "safe"       "sale"      
## [381] "same"       "save"       "say"        "scheme"     "science"   
## [386] "seat"       "secretary"  "see"        "seem"       "self"      
## [391] "sell"       "send"       "sense"      "separate"   "serve"     
## [396] "service"    "set"        "settle"     "seven"      "sex"       
## [401] "shall"      "share"      "she"        "sheet"      "sick"      
## [406] "side"       "sign"       "similar"    "simple"     "since"     
## [411] "sing"       "single"     "sir"        "sister"     "sit"       
## [416] "site"       "six"        "size"       "sleep"      "slight"    
## [421] "small"      "space"      "speak"      "special"    "specific"  
## [426] "speed"      "spell"      "spend"      "staff"      "stage"     
## [431] "stairs"     "stand"      "standard"   "start"      "state"     
## [436] "stay"       "step"       "stick"      "still"      "straight"  
## [441] "strategy"   "street"     "strike"     "switch"     "system"    
## [446] "table"      "take"       "talk"       "tape"       "tax"       
## [451] "tea"        "teach"      "team"       "tell"       "ten"       
## [456] "tend"       "term"       "terrible"   "test"       "than"      
## [461] "thank"      "the"        "then"       "there"      "they"      
## [466] "thing"      "think"      "thirteen"   "thirty"     "this"      
## [471] "three"      "tie"        "time"       "trade"      "traffic"   
## [476] "train"      "travel"     "treat"      "tree"       "try"       
## [481] "twelve"     "twenty"     "type"       "very"       "view"      
## [486] "village"    "visit"      "wage"       "wait"       "walk"      
## [491] "wall"       "want"       "war"        "warm"       "wash"      
## [496] "waste"      "watch"      "water"      "way"        "we"        
## [501] "wear"       "wednesday"  "wee"        "week"       "weigh"     
## [506] "well"       "west"       "what"       "when"       "where"     
## [511] "whether"    "which"      "while"      "white"      "why"       
## [516] "wide"       "wife"       "will"       "win"        "wind"      
## [521] "wish"       "with"       "within"     "write"      "year"      
## [526] "yes"        "yesterday"  "yet"

Use the function “str_count()” to see how many matches there are in string.

First, we will see how many matches “x” has with “C”.

x
## [1] "ATTAGA"     "CGCCCCCGAT" "TATTA"
str_count(x, "C")
## [1] 0 6 0

Next, we will see how many matches “x” has with either “G” or “C”.

x
## [1] "ATTAGA"     "CGCCCCCGAT" "TATTA"
str_count(x, "[GC]")
## [1] 1 8 0

Couple this with mutate.

df <- tibble(
  word = words, 
  i = seq_along(word)
)

df
## # A tibble: 980 × 2
##    word         i
##    <chr>    <int>
##  1 a            1
##  2 able         2
##  3 about        3
##  4 absolute     4
##  5 accept       5
##  6 account      6
##  7 achieve      7
##  8 across       8
##  9 act          9
## 10 active      10
## # … with 970 more rows
df %>%
  mutate(
    vowels = str_count(words, "[aeiou]"),
    constonants = str_count(words, "[^aeiou]")
  )
## # A tibble: 980 × 4
##    word         i vowels constonants
##    <chr>    <int>  <int>       <int>
##  1 a            1      1           0
##  2 able         2      2           2
##  3 about        3      3           2
##  4 absolute     4      4           4
##  5 accept       5      2           4
##  6 account      6      3           4
##  7 achieve      7      4           3
##  8 across       8      2           4
##  9 act          9      1           2
## 10 active      10      3           3
## # … with 970 more rows

Microarrays

Microarrays Basics

Before beginning this section, we need to download the package “affy”.

BiocManager::install("affy")

Before proceeding, you will see “Update all/some/none? [a/s/n]:” in the conslole. Enter “n” underneath it.

After this, we can download the rest of our packages.

library(affy)
library(ath1121501cdf)
library(ath1121501.db)
library(limma)
library(oligo)
library(dplyr)
library(stats)
library(reshape)
library(affyio)

First, read the cell files into the directory.

setwd("~/Desktop/classroom/myfiles/microarrays/bric16")
targets <- readTargets("Bric16_Targets.csv", sep = ",", row.names = "filename")

ab <- ReadAffy()

Now we can plot the data.

hist(ab)

Normalize the microarray experiments.

eset <- affy::rma(ab)

pData(eset)

Lets plot the normalized data.

hist(eset)

Finally, lets characterize the data.

setwd("~/Desktop/classroom/myfiles/microarrays/bric16")
library(annotate)
ID <- featureNames(eset)
Symbol <- getSYMBOL(ID, "ath1121501.db")
affydata <- read.delim("affy_ATH1_array_elements.txt")

Differential Gene Expression Analysis

Flight vs Ground

First, set the conditions for the data.

condition <- targets$gravity

Next, set the design as a model matrix and set the column names.

design <- model.matrix(~factor(condition))
colnames(design) <- c("flight", "ground")

Do a regression analysis.

fit <- lmFit(eset, design)
fit <- eBayes(fit)
options(digits = 2)
top <- topTable(fit, coef = 2, n = Inf, adjust = "fdr")

Lets combine our annotations.

Annot <- data.frame(GENE = sapply(contents(ath1121501GENENAME), paste, collapse = ", "),
                    KEGG = sapply(contents(ath1121501PATH), paste, collapse = ", "),
                    GO = sapply(contents(ath1121501GO), paste, collapse = ", "),
                    SYMBOL = sapply(contents(ath1121501SYMBOL), paste, collapse = ", "),
                    ACCNUM = sapply(contents(ath1121501ACCNUM), paste, collapse = ", "))

Lets merge all of the data into one data frame.

setwd("~/Desktop/classroom/myfiles/microarrays/bric16")
all <- merge(Annot, top, by.x = 0, by.y = 0, all = TRUE)

all2 <- merge(all, affydata, by.x = "Row.names", by.y = "array_element_name")

Lets convert the accession number (ACCNUM) to a character value.

all2$ACCNUM <- as.character(all2$ACCNUM)

Now lets save the data.

write.table(all2, file = "BRIC_16_Final.csv", row.names = TRUE, col.names = TRUE, sep = "\t")

columns(org.At.tair.db)
##  [1] "ARACYC"       "ARACYCENZYME" "ENTREZID"     "ENZYME"       "EVIDENCE"    
##  [6] "EVIDENCEALL"  "GENENAME"     "GO"           "GOALL"        "ONTOLOGY"    
## [11] "ONTOLOGYALL"  "PATH"         "PMID"         "REFSEQ"       "SYMBOL"      
## [16] "TAIR"

Create a variable called fold changes and add a column called “entrez” using the ACCNUM.

foldchanges <- as.data.frame(all2$logFC)

all2$entrez = mapIds(org.At.tair.db,
                     keys = all2$ACCNUM,
                     column = "ENTREZID",
                     keytype = "TAIR",
                     multiVals = "first")
## 'select()' returned 1:1 mapping between keys and columns

Finally, we can call the new table.

head(all2, 10)
##      Row.names
## 1    244901_at
## 2    244902_at
## 3    244903_at
## 4    244904_at
## 5    244905_at
## 6    244906_at
## 7    244907_at
## 8    244908_at
## 9    244909_at
## 10 244910_s_at
##                                                                                                                            GENE
## 1  encodes a plant b subunit of mitochondrial ATP synthase based on structural similarity and the presence in the F(0) complex.
## 2                                                                                        Encodes NADH dehydrogenase subunit 4L.
## 3                                                                                                          hypothetical protein
## 4                                                                                                          hypothetical protein
## 5                                                                                                          hypothetical protein
## 6                                                                                                          hypothetical protein
## 7                                                                                                          hypothetical protein
## 8                                                                                                          hypothetical protein
## 9                                                                                                          hypothetical protein
## 10                                                                                                         hypothetical protein
##    KEGG
## 1    NA
## 2    NA
## 3    NA
## 4    NA
## 5    NA
## 6    NA
## 7    NA
## 8    NA
## 9    NA
## 10   NA
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          GO
## 1  list(GOID = "GO:0015986", Evidence = "TAS", Ontology = "BP"), list(GOID = "GO:0015986", Evidence = "IEA", Ontology = "BP"), list(GOID = "GO:0015986", Evidence = "TAS", Ontology = "BP"), list(GOID = "GO:0015986", Evidence = "IEA", Ontology = "BP"), list(GOID = "GO:0015986", Evidence = "TAS", Ontology = "BP"), list(GOID = "GO:0015986", Evidence = "TAS", Ontology = "BP"), list(GOID = "GO:0015986", Evidence = "IEA", Ontology = "BP"), list(GOID = "GO:0015986", Evidence = "IEA", Ontology = "BP"), list(GOID = "GO:0015986", Evidence = "IEA", Ontology = "BP"), list(GOID = "GO:0015986", Evidence = "TAS", Ontology = "BP"), list(GOID = "GO:0000276", Evidence = "IDA", Ontology = "CC"), list(GOID = "GO:0000276", Evidence = "IBA", Ontology = "CC"), list(GOID = "GO:0000276", Evidence = "IBA", Ontology = "CC"), list(GOID = "GO:0000276", Evidence = "IBA", Ontology = "CC"), list(GOID = "GO:0000276", Evidence = "IDA", Ontology = "CC"), list(GOID = "GO:0000276", Evidence = "IDA", Ontology = "CC"), list(GOID = "GO:0005739", Evidence = "HDA", Ontology = "CC"), list(GOID = "GO:0005739", Evidence = "HDA", Ontology = "CC"), list(GOID = "GO:0005739", Evidence = "HDA", Ontology = "CC"), list(GOID = "GO:0005739", Evidence = "ISM", Ontology = "CC"), list(GOID = "GO:0005753", Evidence = "HDA", Ontology = "CC"), list(GOID = "GO:0005753", Evidence = "HDA", Ontology = "CC"), list(GOID = "GO:0005753", Evidence = "HDA", Ontology = "CC"), list(GOID = "GO:0008270", Evidence = "HDA", Ontology = "MF"), list(GOID = "GO:0015078", Evidence = "IEA", Ontology = "MF"), list(GOID = "GO:0046933", Evidence = "TAS", Ontology = "MF"), list(GOID = "GO:0046933", Evidence = "TAS", Ontology = "MF")
## 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    list(GOID = "GO:0042773", Evidence = "IEA", Ontology = "BP"), list(GOID = "GO:0042773", Evidence = "IEA", Ontology = "BP"), list(GOID = "GO:0042773", Evidence = "IEA", Ontology = "BP"), list(GOID = "GO:0045333", Evidence = "TAS", Ontology = "BP"), list(GOID = "GO:0005739", Evidence = "TAS", Ontology = "CC"), list(GOID = "GO:0005739", Evidence = "ISM", Ontology = "CC"), list(GOID = "GO:0030964", Evidence = "IBA", Ontology = "CC"), list(GOID = "GO:0045271", Evidence = "TAS", Ontology = "CC"), list(GOID = "GO:0045272", Evidence = "IBA", Ontology = "CC"), list(GOID = "GO:0045272", Evidence = "IBA", Ontology = "CC"), list(GOID = "GO:0003954", Evidence = "TAS", Ontology = "MF"), list(GOID = "GO:0008137", Evidence = "IBA", Ontology = "MF")
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    list(GOID = "GO:0008150", Evidence = "ND", Ontology = "BP"), list(GOID = "GO:0005739", Evidence = "ISM", Ontology = "CC"), list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    list(GOID = "GO:0008150", Evidence = "ND", Ontology = "BP"), list(GOID = "GO:0005739", Evidence = "ISM", Ontology = "CC"), list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
## 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    list(GOID = "GO:0008150", Evidence = "ND", Ontology = "BP"), list(GOID = "GO:0005739", Evidence = "ISM", Ontology = "CC"), list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
## 6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              list(GOID = "GO:0005739", Evidence = "ISM", Ontology = "CC")
## 7                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 list(GOID = "GO:0008150", Evidence = "ND", Ontology = "BP"), list(GOID = "GO:0005739", Evidence = "ISM", Ontology = "CC")
## 8                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    list(GOID = "GO:0008150", Evidence = "ND", Ontology = "BP"), list(GOID = "GO:0005739", Evidence = "ISM", Ontology = "CC"), list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
## 9                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    list(GOID = "GO:0008150", Evidence = "ND", Ontology = "BP"), list(GOID = "GO:0005739", Evidence = "ISM", Ontology = "CC"), list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
## 10                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   list(GOID = "GO:0008150", Evidence = "ND", Ontology = "BP"), list(GOID = "GO:0005739", Evidence = "ISM", Ontology = "CC"), list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
##     SYMBOL    ACCNUM  logFC AveExpr     t P.Value adj.P.Val    B
## 1    ORF25 ATMG00640 -0.742     8.5 -3.74  0.0069     0.081 -2.5
## 2    NAD4L ATMG00650 -0.826     7.5 -3.55  0.0090     0.094 -2.7
## 3   ORF149 ATMG00660 -1.214     7.8 -6.20  0.0004     0.018  0.5
## 4   ORF275 ATMG00670 -0.568     3.2 -2.07  0.0758     0.300 -4.9
## 5  ORF122C ATMG00680  0.120     1.4  0.60  0.5661     0.794 -6.6
## 6  ORF240A ATMG00690  0.055     9.1  0.18  0.8586     0.947 -6.8
## 7   ORF120 ATMG00710 -0.883     4.3 -2.91  0.0220     0.154 -3.7
## 8  ORF107D ATMG00720 -0.463     2.2 -1.73  0.1262     0.396 -5.4
## 9  ORF100A ATMG00740 -0.510     1.5 -3.18  0.0150     0.125 -3.3
## 10  ORF119 ATMG00750 -0.694     1.9 -2.09  0.0744     0.297 -4.9
##    array_element_type             organism is_control               locus
## 1     oligonucleotide Arabidopsis thaliana         no           ATMG00640
## 2     oligonucleotide Arabidopsis thaliana         no           ATMG00650
## 3     oligonucleotide Arabidopsis thaliana         no           ATMG00660
## 4     oligonucleotide Arabidopsis thaliana         no           ATMG00670
## 5     oligonucleotide Arabidopsis thaliana         no           ATMG00680
## 6     oligonucleotide Arabidopsis thaliana         no           ATMG00690
## 7     oligonucleotide Arabidopsis thaliana         no           ATMG00710
## 8     oligonucleotide Arabidopsis thaliana         no           ATMG00720
## 9     oligonucleotide Arabidopsis thaliana         no           ATMG00740
## 10    oligonucleotide Arabidopsis thaliana         no ATMG00750;AT2G07686
##                                                                       description
## 1  hydrogen ion transporting ATP synthases, rotational mechanism;zinc ion binding
## 2                                                   NADH dehydrogenase subunit 4L
## 3                                                            hypothetical protein
## 4                                                            hypothetical protein
## 5                                                            hypothetical protein
## 6                                                            hypothetical protein
## 7            Polynucleotidyl transferase, ribonuclease H-like superfamily protein
## 8                                                            hypothetical protein
## 9                                                            hypothetical protein
## 10    [ATMG00750, GAG/POL/ENV polyprotein];[AT2G07686, transposable element gene]
##                       chromosome                                    start
## 1                              M                                   188160
## 2                              M                                   188954
## 3                              M                                   190106
## 4                              M                                   191055
## 5                              M                                   201768
## 6                              M                                   203634
## 7                              M                                   207571
## 8                              M                                   209500
## 9                              M                                   220521
## 10 [ATMG00750, M];[AT2G07686, 2] [ATMG00750, 220851];[AT2G07686, 3309747]
##                                        stop entrez
## 1                                    188619     NA
## 2                                    189182     NA
## 3                                    190540     NA
## 4                                    191627     NA
## 5                                    202096     NA
## 6                                    204043     NA
## 7                                    207882     NA
## 8                                    209821     NA
## 9                                    220769     NA
## 10 [ATMG00750, 221184];[AT2G07686, 3310080]     NA

Path View

Lets load our packages and our data.

library(pathview)
## 
## ##############################################################################
## Pathview is an open source software package distributed under GNU General
## Public License version 3 (GPLv3). Details of GPLv3 is available at
## http://www.gnu.org/licenses/gpl-3.0.html. Particullary, users are required to
## formally cite the original Pathview paper (not just mention it) in publications
## or products. For details, do citation("pathview") within R.
## 
## The pathview downloads and uses KEGG data. Non-academic uses may require a KEGG
## license agreement (details at http://www.kegg.jp/kegg/legal.html).
## ##############################################################################
library(gage)
## 
library(gageData)
data(kegg.sets.hs)

Lets apply fold changes and call the beginning.

foldchanges = all2$logFC
names(foldchanges) = all2$entrez
head(foldchanges)
##   <NA>   <NA>   <NA>   <NA>   <NA>   <NA> 
## -0.742 -0.826 -1.214 -0.568  0.120  0.055

Kegg is the pathway we are going to map the genes to. We are going to map it using Arabidopsis as our organism.

kegg.ath = kegg.gsets("ath", id.type = "entrez")
kegg.ath.sigmet = kegg.ath$kg.sets[kegg.ath$sigmet.idx]

Lets get the results and save them into the same directory.

keggres = gage(foldchanges, gsets = kegg.ath.sigmet, same.dir = TRUE)
lapply(keggres, head)
## $greater
##                                      p.geomean stat.mean   p.val   q.val
## ath03010 Ribosome                      2.4e-27      11.5 2.4e-27 2.7e-25
## ath00195 Photosynthesis                7.6e-06       4.6 7.6e-06 4.3e-04
## ath04145 Phagosome                     3.2e-05       4.1 3.2e-05 1.2e-03
## ath01230 Biosynthesis of amino acids   5.5e-05       3.9 5.5e-05 1.5e-03
## ath00020 Citrate cycle (TCA cycle)     1.1e-03       3.1 1.1e-03 2.5e-02
## ath00190 Oxidative phosphorylation     2.2e-03       2.9 2.2e-03 3.4e-02
##                                      set.size    exp1
## ath03010 Ribosome                         261 2.4e-27
## ath00195 Photosynthesis                    44 7.6e-06
## ath04145 Phagosome                         71 3.2e-05
## ath01230 Biosynthesis of amino acids      201 5.5e-05
## ath00020 Citrate cycle (TCA cycle)         58 1.1e-03
## ath00190 Oxidative phosphorylation        103 2.2e-03
## 
## $less
##                                                      p.geomean stat.mean  p.val
## ath04016 MAPK signaling pathway - plant                 0.0035      -2.7 0.0035
## ath00906 Carotenoid biosynthesis                        0.0198      -2.1 0.0198
## ath04141 Protein processing in endoplasmic reticulum    0.0283      -1.9 0.0283
## ath00592 alpha-Linolenic acid metabolism                0.0552      -1.6 0.0552
## ath00071 Fatty acid degradation                         0.1160      -1.2 0.1160
## ath04122 Sulfur relay system                            0.1415      -1.1 0.1415
##                                                      q.val set.size   exp1
## ath04016 MAPK signaling pathway - plant                0.4      129 0.0035
## ath00906 Carotenoid biosynthesis                       1.0       28 0.0198
## ath04141 Protein processing in endoplasmic reticulum   1.0      189 0.0283
## ath00592 alpha-Linolenic acid metabolism               1.0       36 0.0552
## ath00071 Fatty acid degradation                        1.0       45 0.1160
## ath04122 Sulfur relay system                           1.0       11 0.1415
## 
## $stats
##                                      stat.mean exp1
## ath03010 Ribosome                         11.5 11.5
## ath00195 Photosynthesis                    4.6  4.6
## ath04145 Phagosome                         4.1  4.1
## ath01230 Biosynthesis of amino acids       3.9  3.9
## ath00020 Citrate cycle (TCA cycle)         3.1  3.1
## ath00190 Oxidative phosphorylation         2.9  2.9

Greaters and Lessers

Create “greater” and “lesser” variables for the up and down-regulated genes.

greater <- keggres$greater
lessers <- keggres$less

Now write the variables into a file.

write.csv(greater, file = "BRIC_16_pathview_Greater_Pathways.csv")
write.csv(lessers, file = "BRIC_16_pathview_Lesser_Pathways.csv")

Map the greater values to the pathway.

keggrespathways_greaters = data.frame(id = rownames(keggres$greater), keggres$greater) %>%
  tbl_df() %>%
  filter(row_number() <=5) %>%
  .$id %>%
  as.character()
## Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
## Please use `tibble::as_tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
keggrespathways_greaters
## [1] "ath03010 Ribosome"                   
## [2] "ath00195 Photosynthesis"             
## [3] "ath04145 Phagosome"                  
## [4] "ath01230 Biosynthesis of amino acids"
## [5] "ath00020 Citrate cycle (TCA cycle)"

Remove the names so that you are left with just the pathway ID.

keggresids_greater = substr(keggrespathways_greaters, start = 1, stop = 8)
keggresids_greater
## [1] "ath03010" "ath00195" "ath04145" "ath01230" "ath00020"
genedata <- as.character(all2$logFC)

Define a plotting function to apply next.

plot_pathway = function(pid) pathview(gene.data = foldchanges, pathway.id = pid, species = "ath", new.signature = FALSE)

Plot Multiple Pathways - Greaters

  • Plots are saved to disk and returns a throwaway object list.
tmp = sapply(keggresids_greater, function(pid) pathview(gene.data = foldchanges, pathway.id = pid, species = "ath"))

Now we are going to apply the same functions to map the lesser data to the pathway.

keggrespathways_lessers = data.frame(id = rownames(keggres$less), keggres$less) %>%
  tbl_df() %>%
  filter(row_number() <=5) %>%
  .$id %>%
  as.character()
keggrespathways_lessers
## [1] "ath04016 MAPK signaling pathway - plant"             
## [2] "ath00906 Carotenoid biosynthesis"                    
## [3] "ath04141 Protein processing in endoplasmic reticulum"
## [4] "ath00592 alpha-Linolenic acid metabolism"            
## [5] "ath00071 Fatty acid degradation"

Again remove names so you are left with the pathway ID.

keggresids_less = substr(keggrespathways_lessers, start = 1, stop = 8)
keggresids_less
## [1] "ath04016" "ath00906" "ath04141" "ath00592" "ath00071"
genedata <- as.character(all2$logFC)

Define a plotting function to apply next

plot_pathway = function(pid) pathview(gene.data = foldchanges, pathway.id = pid, species = "ath", new.signature = FALSE)

Plot Multiple Pathways - Lessers

tmp = sapply(keggresids_less, function(pid) pathview(gene.data = foldchanges, pathway.id = pid, species = "ath"))

Add the following code so that the images of the pathways that we created will appear.

library(imager)
filenames <- list.files(path = "~/Desktop/classroom/myfiles", pattern = ".*pathview1.png")
microarrays_images <- lapply(filenames, load.image)
knitr::include_graphics(filenames)

RNAseq

Mouse EdgeR

Load Packages

library("edgeR")
library("dplyr")
library("AnnotationDbi")
library("org.Mm.eg.db")
## 

Before we load our data, we have to first set the working directory to the correct file location. After doing this, we can load our data.

setwd("~/Desktop/classroom/myfiles/RNAseq")
rawdata = read.csv("GLDS-102_rna_seq_Normalized_Counts.csv", row.names = "gene_id")

Create a Group

group <- factor(c("1", "1", "1", "1", "1", "1", "2", "2", "2", "2", "2", "2"))

Differential Gene Expression

Lets call a new regression type variable, differential gene expression (“DGE”). We will also view the structure of the GLM.

dgeGlm <- DGEList(counts = rawdata, group = group)
str(dgeGlm)
## Formal class 'DGEList' [package "edgeR"] with 1 slot
##   ..@ .Data:List of 2
##   .. ..$ : num [1:24035, 1:12] 2976.8 59.8 21.2 40.1 0 ...
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:24035] "ENSMUSG00000000001" "ENSMUSG00000000028" "ENSMUSG00000000031" "ENSMUSG00000000037" ...
##   .. .. .. ..$ : chr [1:12] "Mmus_C57.6J_KDN_FLT_Rep1_M23" "Mmus_C57.6J_KDN_FLT_Rep2_M24" "Mmus_C57.6J_KDN_FLT_Rep3_M25" "Mmus_C57.6J_KDN_FLT_Rep4_M26" ...
##   .. ..$ :'data.frame':  12 obs. of  3 variables:
##   .. .. ..$ group       : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 2 2 2 2 ...
##   .. .. ..$ lib.size    : num [1:12] 40266365 40740336 37739541 39196969 36820645 ...
##   .. .. ..$ norm.factors: num [1:12] 1 1 1 1 1 1 1 1 1 1 ...
##   ..$ names: chr [1:2] "counts" "samples"

Get Rid of Data

We want to get rid of the rows that have nothing in them. We do this using the “keep” function where we will select the content that we want to remian in the table.

keep <- rowSums(cpm(dgeGlm)>2) >=4

dgeGlm <-dgeGlm[keep,]

Pull the Name

Lets look at the names of our dgeGLM.

names(dgeGlm)
## [1] "counts"  "samples"
dgeGlm[["samples"]]
##                              group lib.size norm.factors
## Mmus_C57.6J_KDN_FLT_Rep1_M23     1  4.0e+07            1
## Mmus_C57.6J_KDN_FLT_Rep2_M24     1  4.1e+07            1
## Mmus_C57.6J_KDN_FLT_Rep3_M25     1  3.8e+07            1
## Mmus_C57.6J_KDN_FLT_Rep4_M26     1  3.9e+07            1
## Mmus_C57.6J_KDN_FLT_Rep5_M27     1  3.7e+07            1
## Mmus_C57.6J_KDN_FLT_Rep6_M28     1  3.6e+07            1
## Mmus_C57.6J_KDN_GC_Rep1_M33      2  3.7e+07            1
## Mmus_C57.6J_KDN_GC_Rep2_M34      2  3.7e+07            1
## Mmus_C57.6J_KDN_GC_Rep3_M35      2  4.0e+07            1
## Mmus_C57.6J_KDN_GC_Rep4_M36      2  3.6e+07            1
## Mmus_C57.6J_KDN_GC_Rep5_M37      2  3.8e+07            1
## Mmus_C57.6J_KDN_GC_Rep6_M38      2  3.5e+07            1

Model Matrix

The design of our model will be a model matrix.

design <- model.matrix(~group)
design
##    (Intercept) group2
## 1            1      0
## 2            1      0
## 3            1      0
## 4            1      0
## 5            1      0
## 6            1      0
## 7            1      1
## 8            1      1
## 9            1      1
## 10           1      1
## 11           1      1
## 12           1      1
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"

Lets look at the common dispersion and dispersion trend of our matrix.

dgeGlmComDisp <- estimateGLMCommonDisp(dgeGlm, design, verbose = TRUE)
## Disp = 0.026 , BCV = 0.16
dgeGlmTrendDisp <- estimateGLMTrendedDisp(dgeGlmComDisp, design)

dgeGlmTagDisp <- estimateGLMTagwiseDisp(dgeGlmTrendDisp, design)

Plot the BCV

We wil now plot the BCV (Biological Coefficient of Variation). In the plot, we will set the column names and find the line of best fit.

plotBCV(dgeGlmTagDisp)

fit <- glmFit(dgeGlmTagDisp, design)

colnames(coef(fit))
## [1] "(Intercept)" "group2"
lrt <- glmLRT(fit, coef = 2)

Find the top tags.

ttGlm <- topTags(lrt, n = Inf)

ttGlm

class(ttGlm)

dgeGLM Summary

We are going to run a summary of our dgeGLM using the decide test.

  • The decide test identifies which of the genes are differentially expressed
  • It will then pull out the significant genes
summary(deGlm <- decideTestsDGE(lrt, p = 0.1, adjust = "fdr"))
##        group2
## Down       64
## NotSig  13390
## Up        159
tagsGlm <- rownames(dgeGlmTagDisp)[as.logical(deGlm)]

Lets do a plot smear of our lrt.

plotSmear(lrt, de.tags = tagsGlm)

The plot above shows the log fold changes for every gene. Those in red are the genes that are differentially expressed.

Pull the hits that are less than 0.1 false discovery rate and save it to a file.

hits2 <- ttGlm$table[ttGlm$table$FDR < 0.1, ]

write.csv(hits2, "Mouse_EdgeR_Results_Zero_vs_1.csv")
columns(org.Mm.eg.db)
##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
##  [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
## [11] "GENETYPE"     "GO"           "GOALL"        "IPI"          "MGI"         
## [16] "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"        
## [21] "PROSITE"      "REFSEQ"       "SYMBOL"       "UNIPROT"
ttGlm2 <- as.data.frame(ttGlm$table)

ttGlm2$symbol = mapIds(org.Mm.eg.db,
                       keys = row.names(ttGlm2),
                       column = "SYMBOL",
                       keytype = "ENSEMBL",
                       multiVals = "first")
## 'select()' returned 1:many mapping between keys and columns
ttGlm2$entrez = mapIds(org.Mm.eg.db,
                       keys = row.names(ttGlm2),
                       column = "ENTREZID",
                       keytype = "ENSEMBL",
                       multiVals = "first")
## 'select()' returned 1:many mapping between keys and columns
ttGlm2$name = mapIds(org.Mm.eg.db,
                       keys = row.names(ttGlm2),
                       column = "GENENAME",
                       keytype = "ENSEMBL",
                       multiVals = "first")
## 'select()' returned 1:many mapping between keys and columns
head(ttGlm2)
##                    logFC logCPM LR  PValue     FDR symbol entrez
## ENSMUSG00000026077 -1.36    3.6 80 4.3e-19 5.9e-15  Npas2  18143
## ENSMUSG00000007872  0.89    5.5 77 1.9e-18 1.3e-14    Id3  15903
## ENSMUSG00000021775  0.95    6.2 63 2.0e-15 9.1e-12  Nr1d2 353187
## ENSMUSG00000002250 -0.83    5.3 62 2.7e-15 9.2e-12  Ppard  19015
## ENSMUSG00000059824  2.26    4.6 58 2.6e-14 7.2e-11    Dbp  13170
## ENSMUSG00000074715 -1.99    3.8 47 7.0e-12 1.6e-08  Ccl28  56838
##                                                                name
## ENSMUSG00000026077                    neuronal PAS domain protein 2
## ENSMUSG00000007872                       inhibitor of DNA binding 3
## ENSMUSG00000021775  nuclear receptor subfamily 1, group D, member 2
## ENSMUSG00000002250 peroxisome proliferator activator receptor delta
## ENSMUSG00000059824          D site albumin promoter binding protein
## ENSMUSG00000074715                  chemokine (C-C motif) ligand 28
library(pathview)
library(gage)
library(gageData)
data(kegg.sets.mm)
data(go.subs.mm)
foldchanges <- ttGlm2$logFC
names(foldchanges) <- ttGlm2$entrez
kegg.mm = kegg.gsets("mouse", id.type = "entrez")
kegg.mm.sigmet = kegg.mm$kg.sets[kegg.mm$sigmet.idx]


# Lets get the results
keggres = gage(foldchanges, gsets = kegg.mm.sigmet, same.dir = TRUE)

lapply(keggres, head)
## $greater
##                                                                   p.geomean
## mmu03010 Ribosome                                                   9.5e-05
## mmu04550 Signaling pathways regulating pluripotency of stem cells   2.0e-03
## mmu04350 TGF-beta signaling pathway                                 7.7e-03
## mmu04330 Notch signaling pathway                                    1.0e-02
## mmu04390 Hippo signaling pathway                                    2.0e-02
## mmu00830 Retinol metabolism                                         2.1e-02
##                                                                   stat.mean
## mmu03010 Ribosome                                                       3.8
## mmu04550 Signaling pathways regulating pluripotency of stem cells       2.9
## mmu04350 TGF-beta signaling pathway                                     2.5
## mmu04330 Notch signaling pathway                                        2.4
## mmu04390 Hippo signaling pathway                                        2.1
## mmu00830 Retinol metabolism                                             2.1
##                                                                     p.val q.val
## mmu03010 Ribosome                                                 9.5e-05 0.022
## mmu04550 Signaling pathways regulating pluripotency of stem cells 2.0e-03 0.230
## mmu04350 TGF-beta signaling pathway                               7.7e-03 0.589
## mmu04330 Notch signaling pathway                                  1.0e-02 0.589
## mmu04390 Hippo signaling pathway                                  2.0e-02 0.809
## mmu00830 Retinol metabolism                                       2.1e-02 0.809
##                                                                   set.size
## mmu03010 Ribosome                                                      127
## mmu04550 Signaling pathways regulating pluripotency of stem cells      100
## mmu04350 TGF-beta signaling pathway                                     72
## mmu04330 Notch signaling pathway                                        51
## mmu04390 Hippo signaling pathway                                       125
## mmu00830 Retinol metabolism                                             37
##                                                                      exp1
## mmu03010 Ribosome                                                 9.5e-05
## mmu04550 Signaling pathways regulating pluripotency of stem cells 2.0e-03
## mmu04350 TGF-beta signaling pathway                               7.7e-03
## mmu04330 Notch signaling pathway                                  1.0e-02
## mmu04390 Hippo signaling pathway                                  2.0e-02
## mmu00830 Retinol metabolism                                       2.1e-02
## 
## $less
##                                    p.geomean stat.mean  p.val q.val set.size
## mmu04145 Phagosome                    0.0019      -2.9 0.0019  0.33      121
## mmu04714 Thermogenesis                0.0048      -2.6 0.0048  0.33      207
## mmu04110 Cell cycle                   0.0051      -2.6 0.0051  0.33      110
## mmu04217 Necroptosis                  0.0056      -2.6 0.0056  0.33      113
## mmu00910 Nitrogen metabolism          0.0087      -2.6 0.0087  0.41       13
## mmu00190 Oxidative phosphorylation    0.0118      -2.3 0.0118  0.41      125
##                                      exp1
## mmu04145 Phagosome                 0.0019
## mmu04714 Thermogenesis             0.0048
## mmu04110 Cell cycle                0.0051
## mmu04217 Necroptosis               0.0056
## mmu00910 Nitrogen metabolism       0.0087
## mmu00190 Oxidative phosphorylation 0.0118
## 
## $stats
##                                                                   stat.mean
## mmu03010 Ribosome                                                       3.8
## mmu04550 Signaling pathways regulating pluripotency of stem cells       2.9
## mmu04350 TGF-beta signaling pathway                                     2.5
## mmu04330 Notch signaling pathway                                        2.4
## mmu04390 Hippo signaling pathway                                        2.1
## mmu00830 Retinol metabolism                                             2.1
##                                                                   exp1
## mmu03010 Ribosome                                                  3.8
## mmu04550 Signaling pathways regulating pluripotency of stem cells  2.9
## mmu04350 TGF-beta signaling pathway                                2.5
## mmu04330 Notch signaling pathway                                   2.4
## mmu04390 Hippo signaling pathway                                   2.1
## mmu00830 Retinol metabolism                                        2.1
greaters <- keggres$greater
lessers <- keggres$less
keggrespathways = data.frame(id = rownames(keggres$greater), keggres$greater) %>%
  tbl_df() %>%
  filter(row_number()<=5) %>%
  .$id %>%
  as.character()
keggrespathways
## [1] "mmu03010 Ribosome"                                                
## [2] "mmu04550 Signaling pathways regulating pluripotency of stem cells"
## [3] "mmu04350 TGF-beta signaling pathway"                              
## [4] "mmu04330 Notch signaling pathway"                                 
## [5] "mmu04390 Hippo signaling pathway"
keggresids = substr(keggrespathways, start = 1, stop = 8)
keggresids
## [1] "mmu03010" "mmu04550" "mmu04350" "mmu04330" "mmu04390"
plot_pathway = function(pid) pathview (gene.data = foldchanges, pathway.id = pid, species = "mouse", new.signature = FALSE)

# Plot multiple pathways

tmp = sapply(keggresids, function(pid) pathview(gene.data = foldchanges, pathway.id = pid, species = "mouse"))
keggrespathways = data.frame(id = rownames(keggres$less), keggres$less) %>%
  tbl_df() %>%
  filter(row_number()<=5) %>%
  .$id %>%
  as.character()
keggrespathways
## [1] "mmu04145 Phagosome"           "mmu04714 Thermogenesis"      
## [3] "mmu04110 Cell cycle"          "mmu04217 Necroptosis"        
## [5] "mmu00910 Nitrogen metabolism"
keggresids = substr(keggrespathways, start = 1, stop = 8)
keggresids
## [1] "mmu04145" "mmu04714" "mmu04110" "mmu04217" "mmu00910"
plot_pathway = function(pid) pathview (gene.data = foldchanges, pathway.id = pid, species = "mouse", new.signature = FALSE)

# Plot multiple pathways

tmp = sapply(keggresids, function(pid) pathview(gene.data = foldchanges, pathway.id = pid, species = "mouse"))
library(imager)
filenames2 <- list.files(path = "/home/student/Desktop/classroom/myfiles", pattern = ".*pathview2.png")
rnaseq_images <- lapply(filenames2, load.image)

knitr::include_graphics(filenames2)

Mouse DESeq

First we need to install the “apeglm” package.

BiocManager::install("apeglm", update = FALSE)
## 'getOption("repos")' replaces Bioconductor standard repositories, see
## '?repositories' for details
## 
## replacement repositories:
##     CRAN: https://cloud.r-project.org
## Bioconductor version 3.14 (BiocManager 1.30.16), R 4.1.3 (2022-03-10)
## Warning: package(s) not installed when version(s) same as current; use `force = TRUE` to
##   re-install: 'apeglm'

Lets load the required libraries for this analysis.

library("DESeq2")
library("dplyr")
library("apeglm")

Lets load in our data.

setwd("~/Desktop/classroom/myfiles/RNAseq")
countData <- read.csv("GLDS-102_rna_seq_Unnormalized_Counts.csv")

colData <- read.csv("PHENO_DATA_Mouse.csv", sep = ",", row.names = 1)

We need to add leveling factors to colData

colData$group <- factor(colData$group, levels = c("0", "1"))

Lets make sure we have the same number of individuals as well as groups

all(rownames(colData))%in% colnames(countData)
## Warning in all(rownames(colData)): coercing argument of type 'character' to
## logical
## [1] FALSE

We need to round up the counts because DEseq2 only allows integers as an input and not fractional numbers. This depends on the method of alignment that was used upstream.

countData %>%
  mutate_if(is.numeric, ceiling)

countData[, 2:13] <- sapply(countData[, 2:13], as.integer)

row.names(countData) <- countData[,1]

countData <- countData[2:13]

rownames(colData) == colnames(countData)
dds <- DESeqDataSetFromMatrix(countData = countData, colData = colData, design = ~group)

dds <- dds[rowSums(counts(dds)>2) >=4]

dds <- DESeq(dds)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
res <- results(dds)

Get the log fold change

resLFC <- lfcShrink(dds, coef = 2)
## using 'apeglm' for LFC shrinkage. If used in published research, please cite:
##     Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for
##     sequence count data: removing the noise and preserving large differences.
##     Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895
(resOrdered <- res[order(res$padj), ])
## log2 fold change (MLE): group 1 vs 0 
## Wald test p-value: group 1 vs 0 
## DataFrame with 22008 rows and 6 columns
##                     baseMean log2FoldChange     lfcSE        stat      pvalue
##                    <numeric>      <numeric> <numeric>   <numeric>   <numeric>
## ENSMUSG00000002250  1459.223      -0.926292 0.1112628    -8.32526 8.41430e-17
## ENSMUSG00000007872  1719.375       0.818829 0.1122820     7.29261 3.04014e-13
## ENSMUSG00000026077   437.035      -1.191812 0.1655873    -7.19748 6.13338e-13
## ENSMUSG00000040998 14579.593      -0.506307 0.0703771    -7.19421 6.28252e-13
## ENSMUSG00000021775  2804.923       0.842511 0.1233312     6.83129 8.41546e-12
## ...                      ...            ...       ...         ...         ...
## ENSMUSG00000118345   4.22314    -0.12097478  0.599072 -0.20193699    0.839966
## ENSMUSG00000118353   6.60578     0.56456713  0.481195  1.17326031    0.240691
## ENSMUSG00000118358   3.30902    -0.00273584  0.763559 -0.00358301    0.997141
## ENSMUSG00000118369   2.91657    -1.11623145  0.790702 -1.41169702    0.158039
## ENSMUSG00000118384   7.43136     0.23830798  0.489273  0.48706567    0.626212
##                           padj
##                      <numeric>
## ENSMUSG00000002250 1.24077e-12
## ENSMUSG00000007872 2.24149e-09
## ENSMUSG00000026077 2.31605e-09
## ENSMUSG00000040998 2.31605e-09
## ENSMUSG00000021775 2.48189e-08
## ...                        ...
## ENSMUSG00000118345          NA
## ENSMUSG00000118353          NA
## ENSMUSG00000118358          NA
## ENSMUSG00000118369          NA
## ENSMUSG00000118384          NA
summary(res)
## 
## out of 22008 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 325, 1.5%
## LFC < 0 (down)     : 327, 1.5%
## outliers [1]       : 15, 0.068%
## low counts [2]     : 7247, 33%
## (mean count < 38)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
FLT_Vs_GC <- as.data.frame(res$log2FoldChange)

head(FLT_Vs_GC)
##   res$log2FoldChange
## 1             0.0421
## 2            -0.1334
## 3            -0.0185
## 4            -0.0882
## 5            -0.0079
## 6             0.1136
plotMA(resLFC, ylim = c(-2,2))

pdf(file = "MA_Plot_FLT_vs_GC.pdf")

dev.off()
## png 
##   2

Lets save our differential expression results to file.

write.csv(as.data.frame(resOrdered), file = "Mouse_DEseq.csv")

Lets perform a custom transformation

dds <- estimateSizeFactors(dds)

se <- SummarizedExperiment(log2(counts(dds, normalize = TRUE) +1), colData = colData(dds))

pdf(file = "PCA_PLOT_FLT_vs_GC.pdf")

plotPCA(DESeqTransform(se), intgroup = "group")

Lets load out required packages.

library(AnnotationDbi)
library(org.Mm.eg.db)
columns(org.Mm.eg.db)
##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
##  [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
## [11] "GENETYPE"     "GO"           "GOALL"        "IPI"          "MGI"         
## [16] "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"        
## [21] "PROSITE"      "REFSEQ"       "SYMBOL"       "UNIPROT"
foldchanges <- as.data.frame(res$log2FoldChange, row.names = row.names(res))
head(foldchanges)
##                    res$log2FoldChange
## ENSMUSG00000000001             0.0421
## ENSMUSG00000000028            -0.1334
## ENSMUSG00000000031            -0.0185
## ENSMUSG00000000037            -0.0882
## ENSMUSG00000000049            -0.0079
## ENSMUSG00000000056             0.1136
res$symbol = mapIds(org.Mm.eg.db,
                    keys = row.names(res),
                    column = "SYMBOL",
                    keytype = "ENSEMBL",
                    multiVals = "first")
## 'select()' returned 1:many mapping between keys and columns
res$entrez = mapIds(org.Mm.eg.db,
                    keys = row.names(res),
                    column = "ENTREZID",
                    keytype = "ENSEMBL",
                    multiVals = "first")
## 'select()' returned 1:many mapping between keys and columns
res$name = mapIds(org.Mm.eg.db,
                    keys = row.names(res),
                    column = "GENENAME",
                    keytype = "ENSEMBL",
                    multiVals = "first")
## 'select()' returned 1:many mapping between keys and columns
head(res, 10)
## log2 fold change (MLE): group 1 vs 0 
## Wald test p-value: group 1 vs 0 
## DataFrame with 10 rows and 9 columns
##                      baseMean log2FoldChange     lfcSE       stat    pvalue
##                     <numeric>      <numeric> <numeric>  <numeric> <numeric>
## ENSMUSG00000000001 3132.35128     0.04214340 0.0436714  0.9650117  0.334539
## ENSMUSG00000000028   68.75801    -0.13342706 0.1565936 -0.8520597  0.394181
## ENSMUSG00000000031   21.05397    -0.01853142 0.2486477 -0.0745288  0.940590
## ENSMUSG00000000037   24.42314    -0.08817270 0.2982220 -0.2956613  0.767489
## ENSMUSG00000000049    3.24919    -0.00790342 0.9613572 -0.0082211  0.993441
## ENSMUSG00000000056 1424.88216     0.11355979 0.0777635  1.4603234  0.144201
## ENSMUSG00000000058 1420.78992    -0.03893850 0.0850602 -0.4577759  0.647113
## ENSMUSG00000000078 2254.53129    -0.07540275 0.0874314 -0.8624214  0.388456
## ENSMUSG00000000085  822.68179     0.04586772 0.0584699  0.7844667  0.432766
## ENSMUSG00000000088 5946.81754    -0.05461549 0.0691178 -0.7901795  0.429423
##                         padj      symbol      entrez                   name
##                    <numeric> <character> <character>            <character>
## ENSMUSG00000000001  0.739637       Gnai3       14679 guanine nucleotide b..
## ENSMUSG00000000028  0.777085       Cdc45       12544 cell division cycle 45
## ENSMUSG00000000031        NA         H19       14955 H19, imprinted mater..
## ENSMUSG00000000037        NA       Scml2      107815 Scm polycomb group p..
## ENSMUSG00000000049        NA        Apoh       11818       apolipoprotein H
## ENSMUSG00000000056  0.547527        Narf       67608 nuclear prelamin A r..
## ENSMUSG00000000058  0.901904        Cav2       12390             caveolin 2
## ENSMUSG00000000078  0.774294        Klf6       23849  Kruppel-like factor 6
## ENSMUSG00000000085  0.800018       Scmh1       29871 sex comb on midleg h..
## ENSMUSG00000000088  0.798420       Cox5a       12858 cytochrome c oxidase..
library(pathview)
library(gage)
library(gageData)
data(kegg.sets.mm)
data(go.subs.mm)
foldchanges <- res$log2FoldChange

names(foldchanges) = res$entrez

head(foldchanges)
##   14679   12544   14955  107815   11818   67608 
##  0.0421 -0.1334 -0.0185 -0.0882 -0.0079  0.1136
kegg.mm <- kegg.gsets("mouse", id.type = "entrez")
kegg.mm.sigmet <- kegg.mm$kg.sets[kegg.mm$sigmet.idx]

Lets map the results

keggres <- gage(foldchanges, gsets = kegg.mm.sigmet, same.dir = TRUE)

lapply(keggres, head)
## $greater
##                                           p.geomean stat.mean  p.val q.val
## mmu03010 Ribosome                            0.0056       2.6 0.0056   0.9
## mmu04022 cGMP-PKG signaling pathway          0.0309       1.9 0.0309   0.9
## mmu04360 Axon guidance                       0.0382       1.8 0.0382   0.9
## mmu04330 Notch signaling pathway             0.0518       1.6 0.0518   0.9
## mmu04658 Th1 and Th2 cell differentiation    0.0538       1.6 0.0538   0.9
## mmu04350 TGF-beta signaling pathway          0.0544       1.6 0.0544   0.9
##                                           set.size   exp1
## mmu03010 Ribosome                              133 0.0056
## mmu04022 cGMP-PKG signaling pathway            153 0.0309
## mmu04360 Axon guidance                         176 0.0382
## mmu04330 Notch signaling pathway                55 0.0518
## mmu04658 Th1 and Th2 cell differentiation       78 0.0538
## mmu04350 TGF-beta signaling pathway             87 0.0544
## 
## $less
##                                                   p.geomean stat.mean p.val
## mmu04657 IL-17 signaling pathway                      0.011      -2.3 0.011
## mmu04110 Cell cycle                                   0.020      -2.1 0.020
## mmu04145 Phagosome                                    0.027      -1.9 0.027
## mmu04621 NOD-like receptor signaling pathway          0.042      -1.7 0.042
## mmu04625 C-type lectin receptor signaling pathway     0.044      -1.7 0.044
## mmu04115 p53 signaling pathway                        0.048      -1.7 0.048
##                                                   q.val set.size  exp1
## mmu04657 IL-17 signaling pathway                    0.9       75 0.011
## mmu04110 Cell cycle                                 0.9      121 0.020
## mmu04145 Phagosome                                  0.9      145 0.027
## mmu04621 NOD-like receptor signaling pathway        0.9      151 0.042
## mmu04625 C-type lectin receptor signaling pathway   0.9       99 0.044
## mmu04115 p53 signaling pathway                      0.9       71 0.048
## 
## $stats
##                                           stat.mean exp1
## mmu03010 Ribosome                               2.6  2.6
## mmu04022 cGMP-PKG signaling pathway             1.9  1.9
## mmu04360 Axon guidance                          1.8  1.8
## mmu04330 Notch signaling pathway                1.6  1.6
## mmu04658 Th1 and Th2 cell differentiation       1.6  1.6
## mmu04350 TGF-beta signaling pathway             1.6  1.6

Lets save our greater and less than pathways

greaters_deseq <- keggres$greater
lessers_deseq <- keggres$less
keggrespathways_des <- data.frame(id = rownames(keggres$greater), keggres$greater) %>%
  tbl_df()%>%
  filter(row_number() <= 3) %>%
  .$id %>%
  as.character()
keggrespathways
## [1] "mmu04145 Phagosome"           "mmu04714 Thermogenesis"      
## [3] "mmu04110 Cell cycle"          "mmu04217 Necroptosis"        
## [5] "mmu00910 Nitrogen metabolism"
keggresids_greatersdes <- substr(keggrespathways, start = 1, stop = 8)
keggresids_greatersdes
## [1] "mmu04145" "mmu04714" "mmu04110" "mmu04217" "mmu00910"

Plot

plot_pathway = function(pid) pathview(gene.data = foldchange, pathway.id = pid, species = "mouse", new.signature = FALSE)

tmp = sapply(keggresids_greatersdes, function(pid) pathview(gene.data = foldchanges, pathway.id = pid, species = "mouse"))

Do the same for the lessers pathway

keggrespathways_des <- data.frame(id = rownames(keggres$less), keggres$less) %>%
  tbl_df()%>%
  filter(row_number() <= 3) %>%
  .$id %>%
  as.character()
keggrespathways
## [1] "mmu04145 Phagosome"           "mmu04714 Thermogenesis"      
## [3] "mmu04110 Cell cycle"          "mmu04217 Necroptosis"        
## [5] "mmu00910 Nitrogen metabolism"
keggresids_lessersdes <- substr(keggrespathways, start = 1, stop = 8)
keggresids_lessersdes
## [1] "mmu04145" "mmu04714" "mmu04110" "mmu04217" "mmu00910"

Plot

plot_pathway = function(pid) pathview(gene.data = foldchange, pathway.id = pid, species = "mouse", new.signature = FALSE)

tmp = sapply(keggresids_lessersdes, function(pid) pathview(gene.data = foldchanges, pathway.id = pid, species = "mouse"))
library(imager)

filenames3 <- list.files(path = "~/Desktop/classroom/myfiles", pattern = ".*pathview2.png")

des_images <- lapply(filenames, load.image)
knitr::include_graphics(filenames3)

EdgeR vs DESeq

In this section, we will compare EdgeR and DESeq.

First, we need to load the tidyverse database.

library(tidyverse)

Next, we will load the results of both EdgeR and DESeq.

EdgeR <- read.csv("Mouse_EdgeR_Results_Zero_vs_1.csv")

DESeq <- read.csv("Mouse_DEseq.csv")

DESeq2 <- DESeq %>%
  filter(padj < 0.1)

DESeq2 <- DESeq2 [,c(1,3)]

EdgeR <- EdgeR[, 1:2]

Label the sections of the plot.

colnames(DESeq2) <- c("ID", "logFC")
colnames(EdgeR) <- c("ID", "logFC")

Before we make the plot, we need to install the “GOplot” package.

library(GOplot)

Now we can make the plot comparing EdgeR and DESeq.

comp <- GOVenn(DESeq2, EdgeR, label = c("DESeq1", "EdgeR"), title = "Comparison of DESeq and EdgeR DE Genes", plot = FALSE)

comp$plot

We will also save the results in the form of a table as another form of data comparison.

comp$table
comp_table <- comp$table

Other Tools

Multiple Protein Alignment

First, load the “msa” package.

library(msa)

Now lets load the data and save it.

setwd("~/Desktop/classroom/myfiles/other tools")
seq <- readAAStringSet("hglobin.fa")
seq
## AAStringSet object of length 8:
##     width seq                                               names               
## [1]   142 MVLSPADKTNVKAAWGKVGAHAG...PAVHASLDKFLASVSTVLTSKYR HBA_HUMAN
## [2]   142 MVLSGEDKSNIKAAWGKIGGHGA...PAVHASLDKFLASVSTVLTSKYR HBA_MOUSE
## [3]   142 MVLSPADKTIVKAAWGKVGAHAG...PAVHASLDKFLASVSTVLTSKYR HBA_Porpoise
## [4]   142 MVLSPADKTNVKAAWGKVGAHAG...PAVHASLDKFLASVSTVLTSKYR HBA_Gorilla
## [5]   142 MVLSPADKTNVKTAWGKVGAHAG...PAVHASLDKFLASVSTVLTSKYR HBA_Capucin
## [6]   142 MTLTQAEKAAVVTIWAKVATQAD...PEVHAAWDKFLSSVSSVLTEKYR HBA_Owl_Parrot
## [7]   141 VLSSGDKANVKSVWSKVQGHLED...PDVHVSLDKFMGTVSTVLTSKYR HBA_Loggerhead
## [8]   142 MVLSDDDKAKVRAAWVPVAKNAE...PSVILALDKFLDLVAKVLVSRYR HBA_Pit_Viper

Align the eight different amino acid sequences.

alignments <- msa(seq, method = "ClustalW")
## use default substitution matrix
alignments
## CLUSTAL 2.1  
## 
## Call:
##    msa(seq, method = "ClustalW")
## 
## MsaAAMultipleAlignment with 8 rows and 142 columns
##     aln                                                    names
## [1] MVLSPADKTNVKAAWGKVGAHAGEYG...FTPAVHASLDKFLASVSTVLTSKYR HBA_HUMAN
## [2] MVLSPADKTNVKAAWGKVGAHAGEYG...FTPAVHASLDKFLASVSTVLTSKYR HBA_Gorilla
## [3] MVLSPADKTIVKAAWGKVGAHAGEYG...FTPAVHASLDKFLASVSTVLTSKYR HBA_Porpoise
## [4] MVLSPADKTNVKTAWGKVGAHAGDYG...FTPAVHASLDKFLASVSTVLTSKYR HBA_Capucin
## [5] MVLSGEDKSNIKAAWGKIGGHGAEYG...FTPAVHASLDKFLASVSTVLTSKYR HBA_MOUSE
## [6] MVLSDDDKAKVRAAWVPVAKNAEMYG...LKPSVILALDKFLDLVAKVLVSRYR HBA_Pit_Viper
## [7] -VLSSGDKANVKSVWSKVQGHLEDYG...FTPDVHVSLDKFMGTVSTVLTSKYR HBA_Loggerhead
## [8] MTLTQAEKAAVVTIWAKVATQADAIG...FTPEVHAAWDKFLSSVSSVLTEKYR HBA_Owl_Parrot
## Con MVLSPADKTNVKAAWGKVGAHAGEYG...FTPAVHASLDKFLASVSTVLTSKYR Consensus
msaPrettyPrint(alignments, output = "pdf", showNames = "left",
               showLogo = "none", askForOverwrite = FALSE,
               verbose = TRUE, file = "whole_align.pdf")

msaPrettyPrint(alignments, c(10,30), output = "pdf", showNames = "left",
               file = "Zoomed_align.pdf", showLogo = "top", 
               askForOverwrite = FALSE, verbose = TRUE)

Phylogenetic Tree

Here, we will use our protein alignment to create a phylogenetic tree.

Before we construct the tree, we need to load some packages.

library(ape)
library(seqinr)

Convert to seqinr alignment.

  • Get the distances and make a phylogenetic tree
alignment_seqinr <- msaConvert(alignments, type = "seqinr::alignment")

distances1 <- seqinr::dist.alignment(alignment_seqinr, "identity")

tree <- ape::nj(distances1)
plot(tree, main = "Phylogenetic Tree of HBA Sequences")

Synteny

First load the “decipher” library.

library(DECIPHER)

We are working with a DNA string set object. Load the sequence into the long_seqs.

setwd("~/Desktop/classroom/myfiles/other tools")
long_seq <- readDNAStringSet("plastid_genomes.fa")
long_seq
## DNAStringSet object of length 5:
##      width seq                                              names               
## [1] 130584 GGCATAAGCTATCTTCCCAAAGG...ATGATTCAAACATAAAAGTCCT Saccharina japoni...
## [2] 161592 ATGGGCGAACGACGGGAATTGAA...AAAGAAAAAAAAATAGGAGTAA Asclepias nivea c...
## [3] 117672 ATGAGTACAACTCGAAAGTCCAC...TTGATTTCATCCACAAACGAAC Nannochloropsis g...
## [4] 154731 TTATCCATTTGTAGATGGAACTT...CATATACACTAAGACAAAAGTC Cocos nucifera ch...
## [5] 156618 GGGCGAACGACGGGAATTGAACC...CCTTTTGTAGCGAATCCGTTAT Camellia cuspidat...

Now lets build a temporary SQLite database.

Seqs2DB(long_seq, "XStringSet", "long_db", names(long_seq))
## Adding 5 sequences to the database.
## 
## Added 5 new sequences to table Seqs.
## 70 total sequences in table Seqs.
## Time difference of 0.21 secs

Syntenic Blocks

Now we can find the syntenic blocks.

synteny <- FindSynteny("long_db")
## ================================================================================
## 
## Time difference of 22 secs

View blocs with plotting.

pairs(synteny)

plot(synteny)

Alignment File

We will now create an alignment using the syntenic blocks.

alignment <- AlignSynteny(synteny, "long_db")
## ================================================================================
## 
## Time difference of 219 secs

Lets create a structure holding all aligned syntenic blocks for a pair of sequences.

block <- unlist(alignment[[1]])

We can write to file one alignment at a time

writeXStringSet(block, "genome_blocks_out.fa")

Unannotated Gene Regions

Gene Information

Load the necessry libraries.

library(locfit)
library(Rsamtools)

Lets create a function that will load the gene region information in a GFF file and convert it to a bioconductor GRanges object

get_annotated_regions_from_gff <- function(file_name) {
  gff <- rtracklayer::import.gff(file_name)
  as(gff, "GRanges")
}

Get count in windows across the genome in 500bp segments.

setwd("~/Desktop/classroom/myfiles/other tools")
whole_genome <- csaw::windowCounts(
  file.path("windows.bam"),
  bin = TRUE,
  filter = 0,
  width = 500,
  param = csaw::readParam(
    minq = 20, # determines what is a passing read
    dedup = TRUE, # removes pcr duplicates
    pe = "both" # requires that both pairs of reads are aligned
  )
)

Since this is a single column of data, lets rename it

setwd("~/Desktop/classroom/myfiles/other tools")
colnames(whole_genome) <- c("small_data")

annotated_regions <- get_annotated_regions_from_gff(file.path("genes.gff"))

Overlaps

Now that we have the windows of high expression, we want to see if any of them overlap with annotated regions. First we need to load two more libraries.

library(IRanges)
library(SummarizedExperiment)

Find the overlaps between the windows we made.

windows_in_genes <- IRanges::overlapsAny(
  SummarizedExperiment::rowRanges(whole_genome), # creates a Granges object
  annotated_regions
)

windows_in_genes
##  [1]  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE

Annotated and Unannotated Regions

Subset the whole genome object into annotated and unannotated regions.

annotated_window_counts <- whole_genome[windows_in_genes,]
non_annotated_window_counts <- whole_genome[!windows_in_genes,]  

Use the function “assay()” to extract the actual counts from the Granges object.

assay(non_annotated_window_counts)
##      small_data
## [1,]          0
## [2,]          0
## [3,]         24
## [4,]         25
## [5,]          0
## [6,]          0

Per-Base Coverage

In this step, we use Rsamtools Pyleup() function to get a per-base coverage data frame. Each row represents a single nucleotide in the reference count and the count column gives the depth of coverage at that point.

First, load the “bumphunter” library.

library(bumphunter)

** FSJLKD

setwd("~/Desktop/classroom/myfiles/other tools")
pile_df <- Rsamtools::pileup(file.path("windows.bam"))

This step groups the reads with certain distance of each other into a cluster. We give the sequence names, position, and distance

clusters <- bumphunter::clusterMaker(pile_df$seqnames, pile_df$pos, maxGap = 100)

table(clusters)
## clusters
##    1    2    3 
## 1486 1552 1520

In this step, we will map the reads to the regions we found for the genome.

bumphunter::regionFinder(pile_df$count, pile_df$seqnames, pile_df$pos, clusters, cutoff = 1)
## getSegments: segmenting
## getSegments: splitting
##    chr start  end value  area cluster indexStart indexEnd    L clusterL
## 3 Chr1  4503 5500  10.4 15811       3       3039     4558 1520     1520
## 1 Chr1   502 1500  10.0 14839       1          1     1486 1486     1486
## 2 Chr1  2501 3500   8.7 13436       2       1487     3038 1552     1552

GWAS

Identifying Variants

Load Data

Load the required libraries.

library(GenomicRanges)
library(gmapR)
library(rtracklayer)
library(VariantAnnotation)
library(VariantTools)

Load the data sets. We need “.Bam” and “.fa” for this pipeline to work.

setwd("~/Desktop/classroom/myfiles/gwas")
bam_file <- file.path(getwd(), "hg17_snps.bam")

fasta_file <- file.path(getwd(), "chr17.83k.fa")

Set up the genome object and the parameters object.

fa <- rtracklayer::FastaFile(fasta_file)

GMapGenome

Create a GMapGenome object, which desicribes the genome to the linear variant calling function.

setwd("~/Desktop/classroom/myfiles/gwas")
genome <- gmapR::GmapGenome(fa, create = TRUE)
## NOTE: genome 'chr17.83k' already exists, not overwriting

Determine Variants

This next step sets our parameter for what is considered a variant. There can be a lot of fine-tuning to this function.We are just going to use the standard settings.

qual_params <- TallyVariantsParam(
  genome = genome,
  minimum_mapq = 20)

var_params <- VariantCallingFilters(read.count = 19, p.lower = 0.01)

Use the function “callVariants” in accordance with the parameters we defined above.

called_variants <- callVariants(bam_file,
                                qual_params,
                                calling.filter = var_params)

head(called_variants)
## VRanges object with 6 ranges and 17 metadata columns:
##           seqnames    ranges strand         ref              alt     totalDepth
##              <Rle> <IRanges>  <Rle> <character> <characterOrRle> <integerOrRle>
##   [1] NC_000017.10        64      *           G                T            759
##   [2] NC_000017.10        69      *           G                T            812
##   [3] NC_000017.10        70      *           G                T            818
##   [4] NC_000017.10        73      *           T                A            814
##   [5] NC_000017.10        77      *           T                A            802
##   [6] NC_000017.10        78      *           G                T            798
##             refDepth       altDepth   sampleNames softFilterMatrix | n.read.pos
##       <integerOrRle> <integerOrRle> <factorOrRle>         <matrix> |  <integer>
##   [1]            739             20          <NA>                  |         17
##   [2]            790             22          <NA>                  |         19
##   [3]            796             22          <NA>                  |         20
##   [4]            795             19          <NA>                  |         13
##   [5]            780             22          <NA>                  |         19
##   [6]            777             21          <NA>                  |         17
##       n.read.pos.ref raw.count.total count.plus count.plus.ref count.minus
##            <integer>       <integer>  <integer>      <integer>   <integer>
##   [1]             64             759         20            739           0
##   [2]             69             812         22            790           0
##   [3]             70             818         22            796           0
##   [4]             70             814         19            795           0
##   [5]             70             802         22            780           0
##   [6]             70             798         21            777           0
##       count.minus.ref count.del.plus count.del.minus read.pos.mean
##             <integer>      <integer>       <integer>     <numeric>
##   [1]               0              0               0       30.9000
##   [2]               0              0               0       40.7273
##   [3]               0              0               0       34.7727
##   [4]               0              0               0       36.1579
##   [5]               0              0               0       38.3636
##   [6]               0              0               0       39.7143
##       read.pos.mean.ref read.pos.var read.pos.var.ref     mdfne mdfne.ref
##               <numeric>    <numeric>        <numeric> <numeric> <numeric>
##   [1]           32.8755      318.558          347.804        NA        NA
##   [2]           35.4190      377.004          398.876        NA        NA
##   [3]           36.3442      497.762          402.360        NA        NA
##   [4]           36.2176      519.551          402.843        NA        NA
##   [5]           36.0064      472.327          397.070        NA        NA
##   [6]           35.9241      609.076          390.463        NA        NA
##       count.high.nm count.high.nm.ref
##           <integer>         <integer>
##   [1]            20               738
##   [2]            22               789
##   [3]            22               796
##   [4]            19               769
##   [5]            22               780
##   [6]            21               777
##   -------
##   seqinfo: 1 sequence from chr17.83k genome
##   hardFilters(4): nonRef nonNRef readCount likelihoodRatio

We have identified 6 variants

Annotation

Now we move on to annotation and load the feature position information from gff.

get_annotated_regions_from_gff <- function(file_name) {
  gff <- rtracklayer::import.gff(file_name)
  as(gff, "GRanges")
}

You can also load this data from a bed file.

setwd("~/Desktop/classroom/myfiles/gwas")
genes <- get_annotated_regions_from_gff("chr17.83k.gff3")

Gene and Variant Overlaps

Calculate which variants overlap which genes.

overlaps <- GenomicRanges::findOverlaps(called_variants, genes)
overlaps
## Hits object with 12684 hits and 0 metadata columns:
##           queryHits subjectHits
##           <integer>   <integer>
##       [1]     35176           1
##       [2]     35176           2
##       [3]     35176           3
##       [4]     35177           1
##       [5]     35177           2
##       ...       ...         ...
##   [12680]     40944           2
##   [12681]     40944           7
##   [12682]     40945           1
##   [12683]     40945           2
##   [12684]     40945           7
##   -------
##   queryLength: 44949 / subjectLength: 8

Subset the Genes

Subset the genes with the list of overlaps.

identified <- genes[subjectHits(overlaps)]

Open Reading Frames

Load the required packages

library(Biostrings)
library(systemPipeR)

Load the data into a DNAStrings object, in this case, an Arabidopsis chloroplast genome.

setwd("~/Desktop/classroom/myfiles/gwas")
dna_object <- readDNAStringSet("arabidopsis_chloroplast.fa.txt")

Predict ORFs

Predict the open reading frames with the function “predORF()”. We’ll predict all ORF on both strands.

predict_orfs <- predORF(dna_object, n = "all", type = "gr", mode = "ORF", strand = "both",
                        longest_disjoint = TRUE)
predict_orfs
## GRanges object with 2501 ranges and 2 metadata columns:
##           seqnames        ranges strand | subject_id inframe2end
##              <Rle>     <IRanges>  <Rle> |  <integer>   <numeric>
##      1 chloroplast   86762-93358      + |          1           2
##   1162 chloroplast     2056-2532      - |          1           3
##      2 chloroplast   72371-73897      + |          2           2
##   1163 chloroplast   77901-78362      - |          2           1
##      3 chloroplast   54937-56397      + |          3           3
##    ...         ...           ...    ... .        ...         ...
##   2497 chloroplast 129757-129762      - |       1336           3
##   2498 chloroplast 139258-139263      - |       1337           3
##   2499 chloroplast 140026-140031      - |       1338           3
##   2500 chloroplast 143947-143952      - |       1339           3
##   2501 chloroplast 153619-153624      - |       1340           3
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

This printed out a GRanges object in return with 2, 501 open reading frames - this is far too many.

Filter ORFs

To filter out erroneous ORFs, we do a simulation. We first estimate the length an ORF can reach by chance. We will create a string of random nucleotides that is the length of our chloroplast genome and determine what the longer ORF that can arise by chance.

bases <- c("A", "T", "G", "C")
raw_seq_string <- strsplit(as.character(dna_object), "")

Now we need to ensure that our random nucleotides match the proportion of nucleotides in our chloroplast genome so we have no bias.

seq_length <- width(dna_object[1])
counts <- lapply(bases, function(x) {sum(grepl(x, raw_seq_string))} )
probs <- unlist(lapply(counts, function(base_count){signif(base_count/seq_length, 2)}))
probs
## [1] 6.5e-06 6.5e-06 6.5e-06 6.5e-06

Simulate the Genome

Now we can build our function to simulate the genome

get_longest_orf_in_random_genome <- function(x,
                                             length = 1000,
                                             probs = c(0.25, 0.25, 0,25, 0.25),
                                             bases = c("A", "T", "G", "C")){
  # Here we create our random genome and allow replacement for the next iteration
  random_genome <- paste0(sample(bases, size = length, replace = TRUE, prob = probs), collapse = "")
  random_dna_object <- DNAStringSet(random_genome)
  names(random_dna_object) <- c("random_dna_string")
  orfs <- predORF(random_dna_object, n = 1, type = "gr", mode = "ORF", strand = "both", longest_disjoint = TRUE)
  return(max(width(orfs)))
}

Predict ORFs

Now we use the function we just created to predict the ORFs in 10 random genomes.

random_lengths <- unlist(lapply(1:10, get_longest_orf_in_random_genome, length = seq_length, probs= probs, bases = bases))

Pull out the longest length from our 10 simulations.

longest_random_orf <- max(random_lengths)

Only keep the frames that are longer in our chloroplast genome.

keep <- width(predict_orfs) > longest_random_orf

orfs_to_keep <- predict_orfs[keep]
orfs_to_keep
## GRanges object with 8 ranges and 2 metadata columns:
##        seqnames        ranges strand | subject_id inframe2end
##           <Rle>     <IRanges>  <Rle> |  <integer>   <numeric>
##   1 chloroplast   86762-93358      + |          1           2
##   2 chloroplast   72371-73897      + |          2           2
##   3 chloroplast   54937-56397      + |          3           3
##   4 chloroplast   57147-58541      + |          4           1
##   5 chloroplast   33918-35141      + |          5           1
##   6 chloroplast   32693-33772      + |          6           2
##   7 chloroplast 109408-110436      + |          7           3
##   8 chloroplast 114461-115447      + |          8           2
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Write this data to file

extracted_orfs <- BSgenome::getSeq(dna_object, orfs_to_keep)
names(extracted_orfs) <- paste0("orf_", 1:length(orfs_to_keep))
writeXStringSet(extracted_orfs, "saved_orfs.fa")

KaryoplotR

Load the required packages

library(karyoploteR)
library(GenomicRanges)

Genome Object

Set up the genome object for our karyotype.

genome_df <- data.frame(
  # first we dictate the number of chromosomes
  chr = paste0("chr", 1:5),
  start = rep(1,5),
  # and then we will dictate the length of each chromosome 
  end = c(34964571, 22037565, 25499034, 20862711, 31270811)
)

Convert the data frame to a granges object

genome_gr <- makeGRangesFromDataFrame(genome_df)

SNP Positions

Create some snp positions to map out. We do this by using the function “sample()”.

snp_pos <- sample(1:1e7, 25)
snps <- data.frame(
  chr = paste0("chr", sample(1:5,25, replace = TRUE)),
  start = snp_pos,
  end = snp_pos
)

Again we convert the data frame to GRanges.

snps_gr <- makeGRangesFromDataFrame(snps)

SNP Labels

Now lets create some snp labels.

snp_labels <- paste0("snp_", 1:25)

Plot the SNPs

First, we have to set the margins for the plot.

plot.params <- getDefaultPlotParams(plot.type = 1)

plot.params$data1outmargin <- 600

Now lets plot our SNPs.

kp <- plotKaryotype(genome = genome_gr, plot.type = 1, plot.params = plot.params)
kpPlotMarkers(kp, snps_gr, labels = snp_labels) 

Add Numeric Data

We can also add some numeric data to our plots. We will generate 100 random numbers that plot to 100 windows on chromosome 4.

numeric_data <- data.frame(
  y = rnorm(100,mean = 1, sd = 0.5),
  chr = rep("chr4", 100),
  start = seq(1,20862711, 20862711/100),
  end = seq(1,20862711, 20862711/100)
)

Now lets make the data a GRanges object.

numeric_data_gr <- makeGRangesFromDataFrame(numeric_data)

Again lets set our plot parameters.

plot.params <- getDefaultPlotParams(plot.type = 2)
plot.params$data1outmargin <- 800
plot.params$data2outmargin <- 800
plot.params$topmargin <- 800

Plot the data.

kp <- plotKaryotype(genome = genome_gr, plot.type = 2, plot.params = plot.params)
kpPlotMarkers(kp, snps_gr, labels = snp_labels) 
kpLines(kp, numeric_data_gr, y = numeric_data$y, data.panel = 2)

Phylogenic Analysis

Treeio

Load the Data

First, load the required packages.

library(ggplot2)
library(ggtree)
library(treeio)

Now, load the raw tree data. It’s a Newick format so we use:

setwd("~/Desktop/classroom/myfiles/phylogenic analysis")
itol <- ape::read.tree("itol.nwk")

Basic Phylogenic Tree

Print out a very basic phylogenic tree.

ggtree(itol)

Change the format to make it a circular tree.

ggtree(itol, layout = "circular")

Change the left-right/ up-down direction.

ggtree(itol) + coord_flip() + scale_x_reverse()

Add Names

By using the function “geom_tipla()” we can add names to the end of tips.

ggtree(itol) + geom_tiplab(color = "royalblue", size = 1)

Highlight Sections

We can highlight clades in unrooted trees with blobs of color using the function “geom_hilight”.

ggtree(itol, layout = "unrooted") + geom_hilight(node = 11, type = "encircle", fill = "coral")

MRCAs

We can use the MRCA (most recent common ancestor) function to find the node we want. Before this, we need to install the package “BAMMtools”.

install.packages("BAMMtools")
library(BAMMtools)
getmrca(itol, "Photorhabdus_luminescens", "Blochmannia_floridanus")
## [1] 206

If we want to highlight the section of the most recent common ancestor between the two above, we can do:

ggtree(itol, layout = "unrooted") + geom_hilight(node = 206, type = "encircle", fill = "salmon")

Treespace

Quantifying Difference Between Trees with Treespace

Lets load our required packages

library(ape)
library(adegraphics)
library(treespace)

Load Tree Files

Load all of the treefiles into a multiPhylo object.

setwd("~/Desktop/classroom/myfiles/phylogenic analysis")
treefiles <- list.files(file.path(getwd(), "genetrees"), full.names = TRUE)
tree_list <- lapply(treefiles, read.tree)

Next, change the class of the tree list to multiPhylo.

class(tree_list)
## [1] "list"
class(tree_list) <- "multiPhylo"

class(tree_list)
## [1] "multiPhylo"

Kendal-Coljin Distances

Now we can compute the kendal-coljin distances between trees. This function does a lot of analyses.

  • First it runs a pairwise comparison of all trees in the input.
  • Second it carries out clustering using PCA. These results are returned in a list of objects, where the function “$D” contains the pairwise metric of the trees and th function “pco”contains the PCA.

The method we use (kendal-coljin) is particularly suitable for rooted trees as we have here. The option NF tells us how many principal components to retain.

comparisons <- treespace(tree_list, nf = 3)

Plot Distances

Plot the pairwise distances between trees with the function “table.image”.

table.image(comparisons$D, nclass = 25)

PCA and Clusters

Print the PCA and clusters. This is how the group of trees cluster.

plotGroves(comparisons$pco, lab.show = TRUE, lab.cex = 1.5)

groves <- findGroves(comparisons, nclust = 2)
plotGroves(groves)

Binding Trees

Extracting and working with subtrees using APE

Load the required packages and data.

library(ape)

Load the tree data we will be working with.

setwd("~/Desktop/classroom/myfiles/phylogenic analysis")
newick <- read.tree("mammal_tree.nwk.txt")

Subset Trees

Break the main tree into subtrees using the function “subtrees”.

subtrees <- subtrees(newick)

Plot the Tree

Lets plot the tree to see what it looks like.

plot(newick)

Subset this plot using the “node” function.

plot(subtrees[[4]], sub = "Node 4")

Extract the tree manually.

small_tree <- extract.clade(newick, 9)

plot(small_tree)

Bind the Trees

Bind the two trees together.

new_tree<- bind.tree(newick, small_tree, 3)
plot(new_tree)

Trees from Alignment

In this section, we will be reconstructing Trees from alignments.

Lets load the required packages.

library(Biostrings)
library(msa)
library(phangorn)

First we will load the sequences into a seqs variable.

setwd("~/Desktop/classroom/myfiles/phylogenic analysis")
seqs <- readAAStringSet("abc.fa")

Construct Alignment

Now, lets construct an alignnment with the msa package and ClustalOmega.

aln <- msa::msa(seqs, method = c("ClustalOmega"))
## using Gonnet

To create a tree, we need to convert the alignment into a phyData object.

aln <- as.phyDat(aln, type = "AA")

class(aln)
## [1] "phyDat"

Construct the trees.

In this step we will make the trees.

Trees are made from a distance matrix, which can be computed with the function “dist.ml()”.

  • ML stands for maximum liklihood
dist_mat <- dist.ml(aln)

Now we pass the distance matrix to UPGMA to make a UPGMA tree.

upgma_tree <- upgma(dist_mat)
plot(upgma_tree, main = "UPGMA")

We can conversely pass the distance matrix to a neighbor joining function.

nj_tree <- NJ(dist_mat)
plot(nj_tree, main = "NJ")

Bootstrap Support

We will use the function “bootstraps.phyDat()” to compute bootstrap support for the branches of the tree. The first argument is the object (aln), while the second argument is the function nj.

Bootstraps are essentially confidence intervals for how the clade is placed in the correct position.

fit <- pml(nj_tree, aln)

bootstraps <- bootstrap.phyDat(aln, FUN = function(x) (NJ(dist.ml(x))), bs = 100)
plotBS(nj_tree, bootstraps, p = 10)