This is a tutorial on how to use R markdown for reproducible research.
Here we can type long passages or descriptions of our data without the need of “hashing” out our comments with the # symbol. In our first example, we will be using the ToothGrowth dataset.In this experiment, Guinea pigs (literal) were given different amounts of vitamin C to see the effects on the animal’s tooth growth.
To run R code in a markdown file, we need to denote the section that is considered R code. We call these “code chunks.”
Below is a code chunk:
Toothdata <- ToothGrowth
head(Toothdata)
## len supp dose
## 1 4.2 VC 0.5
## 2 11.5 VC 0.5
## 3 7.3 VC 0.5
## 4 5.8 VC 0.5
## 5 6.4 VC 0.5
## 6 10.0 VC 0.5
As you can see, from running the “play” button on the code chunk, the results are printed inline of the r markdown file.
fit <- lm(len ~ dose, data = Toothdata)
b <- fit$coefficients
plot(len ~ dose, data = Toothdata)
abline(lm(len ~ dose, data = Toothdata))
Figure 1: The tooth growth of Guinea Pigs when given variable amounts of Vitamin C
The slope of the regression line is 9.7635714
We can also oput sections and subsections in our r markdown file, similar to numbers or bullet points in a word document. This is done with the “#” that we previously used to denote text in an R script.
Make sure that you put a space after the hashtag, otherwise it will not work!
We can also add bullet point-type marks in our r markdown file.
Its important to note here that in R markdown, indentation matters!
We can put really nice quotes into the markdown document. We do this by using thee “>” symbol.
“Genes are like the story, and DNA is the language that the story is written in.”
— Sam Kean
Hyperlinks can also be incorporated into these files. This is especially useful in HTML files, since they aree in a web browser and will redirect the reader too the matreial that you are interested in showing them. Here we will use the link to R markdown’s homepage for this example. RMarkdown
We can also put nicely formatted formulas into Markdown using two dollar signs.
Hardy-Weinberg Formula
\[p^2 + 2pq + q^2 = 1\]
And you can get really complex as well!
\[\Theta = \begin{pmatrix}\alpha & \beta\\ \gamma & \delta \end{pmatrix}\]
## [1] "Hello World"
There are also options for your R markdown file on how knitr interprets the code chunk. There are the following options.
Eval (T or F): whether or not to evaluate the code chunk.
Echo (T or F): Whether or not to show the code for the chunk, but results will still print.
Cache: If enable, the same code chunk will not be evaluated the next time that the knitr is run. Great for code that has LONG runtimes.
fig.width or fig.height: the (graphical device) size of the R. plots in inches. The figures are first written to the knitr document then to files that are saved separately.
out.width or out.height: The output size of the R plots IN THE R DOCUMENT.
fig.cap:
We can also add a table of contents to out HTML Document. We do this by altering the YAML code (the weird code chunk at the VERY top of the document.) We can add this:
title: “HTML_Tutorial” author: “AGibbons” date: “2024-11-11” output: html_document: toc: true toc_float: true
This will give us a very nice floating table of contents on the right hand side of the document.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
??flights
my_data <- nycflights13::flights
head(my_data)
## # A tibble: 6 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
filter(my_data, month ==10, day ==14)
## # A tibble: 987 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 10 14 451 500 -9 624 648
## 2 2013 10 14 511 517 -6 733 757
## 3 2013 10 14 536 545 -9 814 855
## 4 2013 10 14 540 545 -5 932 933
## 5 2013 10 14 548 545 3 824 827
## 6 2013 10 14 549 600 -11 719 730
## 7 2013 10 14 552 600 -8 650 659
## 8 2013 10 14 553 600 -7 646 700
## 9 2013 10 14 554 600 -6 836 829
## 10 2013 10 14 555 600 -5 832 855
## # ℹ 977 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
oct_14_flight <- filter(my_data, month ==10, day ==14)
(oct_14_flight_2 <- filter(my_data, month ==10, day ==14))
## # A tibble: 987 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 10 14 451 500 -9 624 648
## 2 2013 10 14 511 517 -6 733 757
## 3 2013 10 14 536 545 -9 814 855
## 4 2013 10 14 540 545 -5 932 933
## 5 2013 10 14 548 545 3 824 827
## 6 2013 10 14 549 600 -11 719 730
## 7 2013 10 14 552 600 -8 650 659
## 8 2013 10 14 553 600 -7 646 700
## 9 2013 10 14 554 600 -6 836 829
## 10 2013 10 14 555 600 -5 832 855
## # ℹ 977 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
(flight_through_september <- filter(my_data, month < 10))
## # A tibble: 252,484 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ℹ 252,474 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
If we don’t use the == to mean equals, we get this: (oct_14_flight_2 <- filter(my_data, month =10, day =14))
March_April_Flights <- filter(my_data, month == 3 | month == 4)
March_4th_Flights <- filter(my_data, month == 3 & day == 4)
Non_jan_flights <- filter(my_data, month != 1)
arrange(my_data, year, day, month)
## # A tibble: 336,776 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
descending <- arrange(my_data, desc(year), desc(day), desc(month))
Missing values are always placed at the end of the dataframe regardless of ascending or descending.
calendar <- select(my_data, year, month, day)
print(calendar)
## # A tibble: 336,776 × 3
## year month day
## <int> <int> <int>
## 1 2013 1 1
## 2 2013 1 1
## 3 2013 1 1
## 4 2013 1 1
## 5 2013 1 1
## 6 2013 1 1
## 7 2013 1 1
## 8 2013 1 1
## 9 2013 1 1
## 10 2013 1 1
## # ℹ 336,766 more rows
calendar2 <- select(my_data, year:day)
calendar3 <- select(my_data, year:carrier)
everything_else <- select(my_data, -(year:day))
everything_else2 <- select(my_data, !(year:day))
head(my_data)
## # A tibble: 6 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
rename(my_data, departure_time = dep_time)
## # A tibble: 336,776 × 19
## year month day departure_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # ℹ 336,766 more rows
## # ℹ 12 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
my_data <- rename(my_data, departure_time = dep_time)
We have the mutate() function for that!
my_data_small <- select(my_data, year:day, distance, air_time)
mutate(my_data_small, speed = distance / air_time * 60)
## # A tibble: 336,776 × 6
## year month day distance air_time speed
## <int> <int> <int> <dbl> <dbl> <dbl>
## 1 2013 1 1 1400 227 370.
## 2 2013 1 1 1416 227 374.
## 3 2013 1 1 1089 160 408.
## 4 2013 1 1 1576 183 517.
## 5 2013 1 1 762 116 394.
## 6 2013 1 1 719 150 288.
## 7 2013 1 1 1065 158 404.
## 8 2013 1 1 229 53 259.
## 9 2013 1 1 944 140 405.
## 10 2013 1 1 733 138 319.
## # ℹ 336,766 more rows
my_data_small <- mutate(my_data_small, speed = distance / air_time * 60)
airspeed <- transmute(my_data_small, speed = distance / air_time * 60 , speed2 = distance / air_time)
summarize(my_data, delay = mean(dep_delay, na.rm = TRUE))
## # A tibble: 1 × 1
## delay
## <dbl>
## 1 12.6
So we can see here that the average delay is about 12 minutes.
by_day <- group_by(my_data, year, month, day)
summarize(by_day, delay = mean(dep_delay, na.rm = TRUE))
## `summarise()` has grouped output by 'year', 'month'. You can override using the
## `.groups` argument.
## # A tibble: 365 × 4
## # Groups: year, month [12]
## year month day delay
## <int> <int> <int> <dbl>
## 1 2013 1 1 11.5
## 2 2013 1 2 13.9
## 3 2013 1 3 11.0
## 4 2013 1 4 8.95
## 5 2013 1 5 5.73
## 6 2013 1 6 7.15
## 7 2013 1 7 5.42
## 8 2013 1 8 2.55
## 9 2013 1 9 2.28
## 10 2013 1 10 2.84
## # ℹ 355 more rows
As you can see, we now have the delay by the days of the year.
summarize(by_day, delay = mean(dep_delay))
## `summarise()` has grouped output by 'year', 'month'. You can override using the
## `.groups` argument.
## # A tibble: 365 × 4
## # Groups: year, month [12]
## year month day delay
## <int> <int> <int> <dbl>
## 1 2013 1 1 NA
## 2 2013 1 2 NA
## 3 2013 1 3 NA
## 4 2013 1 4 NA
## 5 2013 1 5 NA
## 6 2013 1 6 NA
## 7 2013 1 7 NA
## 8 2013 1 8 NA
## 9 2013 1 9 NA
## 10 2013 1 10 NA
## # ℹ 355 more rows
not_cancelled <- filter(my_data, !is.na(dep_delay), !is.na(arr_delay))
summarize(not_cancelled, delay = mean(dep_delay))
## # A tibble: 1 × 1
## delay
## <dbl>
## 1 12.6
sum(is.na(my_data$dep_delay))
## [1] 8255
sum(!is.na(my_data$dep_delay))
## [1] 328521
We can then summarize the number of flights by minutes delayed.
my_data %>%
group_by(year, month, day) %>%
summarize(mean = mean(departure_time, na.rm = TRUE))
## `summarise()` has grouped output by 'year', 'month'. You can override using the
## `.groups` argument.
## # A tibble: 365 × 4
## # Groups: year, month [12]
## year month day mean
## <int> <int> <int> <dbl>
## 1 2013 1 1 1385.
## 2 2013 1 2 1354.
## 3 2013 1 3 1357.
## 4 2013 1 4 1348.
## 5 2013 1 5 1326.
## 6 2013 1 6 1399.
## 7 2013 1 7 1341.
## 8 2013 1 8 1335.
## 9 2013 1 9 1326.
## 10 2013 1 10 1333.
## # ℹ 355 more rows
library(tibble)
Tibbles are modified dataframes which tweak some of the older features from data frames. R is an old language, and useful things from 20 years ago are not as useful anymore.
as_tibble(iris)
## # A tibble: 150 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## # ℹ 140 more rows
As we can see, we have the same dataframe, but we have different features
tibble(
x = 1:5,
y = 1,
z = x ^ 2 + y
)
## # A tibble: 5 × 3
## x y z
## <int> <dbl> <dbl>
## 1 1 1 2
## 2 2 1 5
## 3 3 1 10
## 4 4 1 17
## 5 5 1 26
tribble(
~ genea, ~geneb, ~genec,
#############################################
110, 112, 114,
6, 5, 4
)
## # A tibble: 2 × 3
## genea geneb genec
## <dbl> <dbl> <dbl>
## 1 110 112 114
## 2 6 5 4
Tibbles are built to not overwhelm your console when printing data by only showing the first few lines.
print(by_day)
as.data.frame(by_day)
head(by_day)
nycflights13::flights %>%
print(n=10, width = Inf)
df_tibble <- tibble(nycflights13::flights)
df_tibble
We can subset by COLUMN name using the “$”
df_tibble$carrier
We can subset by POSITION using “[[ ]]”
df_tibble[[2]]
If you want to use this in a pipe, you need to use the “.” placeholder
df_tibble %>%
.$carrier
Some older functions do not like tibbles, thus you might have to convert them back to dataframes.
class(df_tibble)
df_tibble_2 <- as.data.frame(df_tibble)
class(df_tibble_2)
df_tibble
head(df_tibble_2)
library(tidyverse)
It is impossible to satisfy two of the three rules.
Picking one consistent method of data storage makes for easier understanding of your code and what is happening “under the hood” or behind the scenes.
bmi <-tibble(women)
bmi %>%
mutate(bmi = (703 * weight)/(height)^2)
## # A tibble: 15 × 3
## height weight bmi
## <dbl> <dbl> <dbl>
## 1 58 115 24.0
## 2 59 117 23.6
## 3 60 120 23.4
## 4 61 123 23.2
## 5 62 126 23.0
## 6 63 129 22.8
## 7 64 132 22.7
## 8 65 135 22.5
## 9 66 139 22.4
## 10 67 142 22.2
## 11 68 146 22.2
## 12 69 150 22.1
## 13 70 154 22.1
## 14 71 159 22.2
## 15 72 164 22.2
We’ll use the built-in data from tidyverse for this part.
table4a
## # A tibble: 3 × 3
## country `1999` `2000`
## <chr> <dbl> <dbl>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
As you can see from this data, we have one variable in Column A (country), but columns b and c are two of the same. Thus, there are two observations in each row. To fix this, we can use the gather function:
table4a %>%
gather('1999', '2000', key = 'year', value = 'cases')
## # A tibble: 6 × 3
## country year cases
## <chr> <chr> <dbl>
## 1 Afghanistan 1999 745
## 2 Brazil 1999 37737
## 3 China 1999 212258
## 4 Afghanistan 2000 2666
## 5 Brazil 2000 80488
## 6 China 2000 213766
table4b
## # A tibble: 3 × 3
## country `1999` `2000`
## <chr> <dbl> <dbl>
## 1 Afghanistan 19987071 20595360
## 2 Brazil 172006362 174504898
## 3 China 1272915272 1280428583
As you can see, we have the same problem in table 4b. We can use the gather function again to fix this.
table4b %>%
gather('1999', '2000', key = 'year', value = 'population')
## # A tibble: 6 × 3
## country year population
## <chr> <chr> <dbl>
## 1 Afghanistan 1999 19987071
## 2 Brazil 1999 172006362
## 3 China 1999 1272915272
## 4 Afghanistan 2000 20595360
## 5 Brazil 2000 174504898
## 6 China 2000 1280428583
table4a <- table4a %>%
gather('1999', '2000', key = 'year', value = 'cases')
table4b <- table4b %>%
gather('1999', '2000', key = 'year', value = 'population')
left_join(table4a, table4b)
## Joining with `by = join_by(country, year)`
## # A tibble: 6 × 4
## country year cases population
## <chr> <chr> <dbl> <dbl>
## 1 Afghanistan 1999 745 19987071
## 2 Brazil 1999 37737 172006362
## 3 China 1999 212258 1272915272
## 4 Afghanistan 2000 2666 20595360
## 5 Brazil 2000 80488 174504898
## 6 China 2000 213766 1280428583
table2
## # A tibble: 12 × 4
## country year type count
## <chr> <dbl> <chr> <dbl>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
You can see that we have redundant info in columns 1 and 2. We can fix that by combining rows 1 & 2, 3 & 4, etc:
spread(table2, key = type, value = count)
## # A tibble: 6 × 4
## country year cases population
## <chr> <dbl> <dbl> <dbl>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
Type is the column we are turning into columns and the value is what becomes rows/observations
In summary, spread makes long tables shorter while gather makes wide tables narrower and longer.
table3
## # A tibble: 6 × 3
## country year rate
## <chr> <dbl> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
As you can see, the rate is just the population and cases combined. We can use “separate” to fix this.
table3 %>%
separate(rate, into = c("cases", "population"))
## # A tibble: 6 × 4
## country year cases population
## <chr> <dbl> <chr> <chr>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
However, if you notice, the column type is not correct.
table3 %>%
separate(rate, into =c("cases", "population"), convert = TRUE)
## # A tibble: 6 × 4
## country year cases population
## <chr> <dbl> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
You can specify how you want to base your separation:
table3 %>%
separate(rate, into =c("cases", "population"), sep = "/", convert = TRUE)
## # A tibble: 6 × 4
## country year cases population
## <chr> <dbl> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
table3 %>%
separate(
year,
into = c("century", "year"),
convert= TRUE,
sep = 2
)
## # A tibble: 6 × 4
## country century year rate
## <chr> <int> <int> <chr>
## 1 Afghanistan 19 99 745/19987071
## 2 Afghanistan 20 0 2666/20595360
## 3 Brazil 19 99 37737/172006362
## 4 Brazil 20 0 80488/174504898
## 5 China 19 99 212258/1272915272
## 6 China 20 0 213766/1280428583
table5
## # A tibble: 6 × 4
## country century year rate
## <chr> <chr> <chr> <chr>
## 1 Afghanistan 19 99 745/19987071
## 2 Afghanistan 20 00 2666/20595360
## 3 Brazil 19 99 37737/172006362
## 4 Brazil 20 00 80488/174504898
## 5 China 19 99 212258/1272915272
## 6 China 20 00 213766/1280428583
table5 %>%
unite(date, century, year)
## # A tibble: 6 × 3
## country date rate
## <chr> <chr> <chr>
## 1 Afghanistan 19_99 745/19987071
## 2 Afghanistan 20_00 2666/20595360
## 3 Brazil 19_99 37737/172006362
## 4 Brazil 20_00 80488/174504898
## 5 China 19_99 212258/1272915272
## 6 China 20_00 213766/1280428583
table5 %>%
unite(date, century, year, sep = "")
## # A tibble: 6 × 3
## country date rate
## <chr> <chr> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
gene_data <- tibble(
gene = c('a', 'a', 'a', 'a', 'b', 'b', 'b'),
nuc = c(20, 22, 24, 25, NA, 42, 67),
run = c(1, 2, 3, 4, 2, 3, 4)
)
gene_data
## # A tibble: 7 × 3
## gene nuc run
## <chr> <dbl> <dbl>
## 1 a 20 1
## 2 a 22 2
## 3 a 24 3
## 4 a 25 4
## 5 b NA 2
## 6 b 42 3
## 7 b 67 4
The nucleotide count for Gene b run 2 is EXPLICITLY missing. The nucleotide count for Gene b run 1 is IMPLICITLY missing.
gene_data %>%
spread(gene, nuc)
## # A tibble: 4 × 3
## run a b
## <dbl> <dbl> <dbl>
## 1 1 20 NA
## 2 2 22 NA
## 3 3 24 42
## 4 4 25 67
gene_data %>%
spread(gene, nuc) %>%
gather(gene, nuc, 'a':'b', na.rm = TRUE)
## # A tibble: 6 × 3
## run gene nuc
## <dbl> <chr> <dbl>
## 1 1 a 20
## 2 2 a 22
## 3 3 a 24
## 4 4 a 25
## 5 3 b 42
## 6 4 b 67
gene_data %>%
complete(gene, run)
## # A tibble: 8 × 3
## gene run nuc
## <chr> <dbl> <dbl>
## 1 a 1 20
## 2 a 2 22
## 3 a 3 24
## 4 a 4 25
## 5 b 1 NA
## 6 b 2 NA
## 7 b 3 42
## 8 b 4 67
treatment <- tribble(
~ person, ~treatment, ~response,
#########################################
"Isaac", 1, 7,
NA, 2, 10,
NA, 3, 9,
"VDB", 1, 8,
NA, 2, 11,
NA, 3, 10,
)
treatment
## # A tibble: 6 × 3
## person treatment response
## <chr> <dbl> <dbl>
## 1 Isaac 1 7
## 2 <NA> 2 10
## 3 <NA> 3 9
## 4 VDB 1 8
## 5 <NA> 2 11
## 6 <NA> 3 10
treatment %>%
fill(person)
## # A tibble: 6 × 3
## person treatment response
## <chr> <dbl> <dbl>
## 1 Isaac 1 7
## 2 Isaac 2 10
## 3 Isaac 3 9
## 4 VDB 1 8
## 5 VDB 2 11
## 6 VDB 3 10
library(tidyverse)
library(nycflights13)
airlines
## # A tibble: 16 × 2
## carrier name
## <chr> <chr>
## 1 9E Endeavor Air Inc.
## 2 AA American Airlines Inc.
## 3 AS Alaska Airlines Inc.
## 4 B6 JetBlue Airways
## 5 DL Delta Air Lines Inc.
## 6 EV ExpressJet Airlines Inc.
## 7 F9 Frontier Airlines Inc.
## 8 FL AirTran Airways Corporation
## 9 HA Hawaiian Airlines Inc.
## 10 MQ Envoy Air
## 11 OO SkyWest Airlines Inc.
## 12 UA United Air Lines Inc.
## 13 US US Airways Inc.
## 14 VX Virgin America
## 15 WN Southwest Airlines Co.
## 16 YV Mesa Airlines Inc.
airports
## # A tibble: 1,458 × 8
## faa name lat lon alt tz dst tzone
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A America/…
## 2 06A Moton Field Municipal Airport 32.5 -85.7 264 -6 A America/…
## 3 06C Schaumburg Regional 42.0 -88.1 801 -6 A America/…
## 4 06N Randall Airport 41.4 -74.4 523 -5 A America/…
## 5 09J Jekyll Island Airport 31.1 -81.4 11 -5 A America/…
## 6 0A9 Elizabethton Municipal Airport 36.4 -82.2 1593 -5 A America/…
## 7 0G6 Williams County Airport 41.5 -84.5 730 -5 A America/…
## 8 0G7 Finger Lakes Regional Airport 42.9 -76.8 492 -5 A America/…
## 9 0P2 Shoestring Aviation Airfield 39.8 -76.6 1000 -5 U America/…
## 10 0S9 Jefferson County Intl 48.1 -123. 108 -8 A America/…
## # ℹ 1,448 more rows
planes
## # A tibble: 3,322 × 9
## tailnum year type manufacturer model engines seats speed engine
## <chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr>
## 1 N10156 2004 Fixed wing multi… EMBRAER EMB-… 2 55 NA Turbo…
## 2 N102UW 1998 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
## 3 N103US 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
## 4 N104UW 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
## 5 N10575 2002 Fixed wing multi… EMBRAER EMB-… 2 55 NA Turbo…
## 6 N105UW 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
## 7 N107US 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
## 8 N108UW 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
## 9 N109UW 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
## 10 N110UW 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
## # ℹ 3,312 more rows
weather
## # A tibble: 26,115 × 15
## origin year month day hour temp dewp humid wind_dir wind_speed
## <chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 EWR 2013 1 1 1 39.0 26.1 59.4 270 10.4
## 2 EWR 2013 1 1 2 39.0 27.0 61.6 250 8.06
## 3 EWR 2013 1 1 3 39.0 28.0 64.4 240 11.5
## 4 EWR 2013 1 1 4 39.9 28.0 62.2 250 12.7
## 5 EWR 2013 1 1 5 39.0 28.0 64.4 260 12.7
## 6 EWR 2013 1 1 6 37.9 28.0 67.2 240 11.5
## 7 EWR 2013 1 1 7 39.0 28.0 64.4 240 15.0
## 8 EWR 2013 1 1 8 39.9 28.0 62.2 250 10.4
## 9 EWR 2013 1 1 9 39.9 28.0 62.2 260 15.0
## 10 EWR 2013 1 1 10 41 28.0 59.6 260 13.8
## # ℹ 26,105 more rows
## # ℹ 5 more variables: wind_gust <dbl>, precip <dbl>, pressure <dbl>,
## # visib <dbl>, time_hour <dttm>
flights
## # A tibble: 336,776 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
Flights -> planes based on tailnumber
Flights -> airlines through carrier
Flights -> airports through origin AND destination
Flights -> weather via origin, year/month/day/hour
Primary key uniquely identifies an observation in its own table.
planes %>%
count(tailnum) %>%
filter(n>1)
## # A tibble: 0 × 2
## # ℹ 2 variables: tailnum <chr>, n <int>
This indicates that the tailnumber is unique
planes %>%
count(model) %>%
filter(n>1)
## # A tibble: 79 × 2
## model n
## <chr> <int>
## 1 717-200 88
## 2 737-301 2
## 3 737-3G7 2
## 4 737-3H4 105
## 5 737-3K2 2
## 6 737-3L9 2
## 7 737-3Q8 5
## 8 737-3TO 2
## 9 737-401 4
## 10 737-4B7 18
## # ℹ 69 more rows
flights2 <- flights %>%
select(year:day, hour, origin, dest, tailnum, carrier)
flights2
## # A tibble: 336,776 × 8
## year month day hour origin dest tailnum carrier
## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr>
## 1 2013 1 1 5 EWR IAH N14228 UA
## 2 2013 1 1 5 LGA IAH N24211 UA
## 3 2013 1 1 5 JFK MIA N619AA AA
## 4 2013 1 1 5 JFK BQN N804JB B6
## 5 2013 1 1 6 LGA ATL N668DN DL
## 6 2013 1 1 5 EWR ORD N39463 UA
## 7 2013 1 1 6 EWR FLL N516JB B6
## 8 2013 1 1 6 LGA IAD N829AS EV
## 9 2013 1 1 6 JFK MCO N593JB B6
## 10 2013 1 1 6 LGA ORD N3ALAA AA
## # ℹ 336,766 more rows
flights2 %>%
select(-origin, -dest) %>%
left_join(airlines, by = 'carrier')
## # A tibble: 336,776 × 7
## year month day hour tailnum carrier name
## <int> <int> <int> <dbl> <chr> <chr> <chr>
## 1 2013 1 1 5 N14228 UA United Air Lines Inc.
## 2 2013 1 1 5 N24211 UA United Air Lines Inc.
## 3 2013 1 1 5 N619AA AA American Airlines Inc.
## 4 2013 1 1 5 N804JB B6 JetBlue Airways
## 5 2013 1 1 6 N668DN DL Delta Air Lines Inc.
## 6 2013 1 1 5 N39463 UA United Air Lines Inc.
## 7 2013 1 1 6 N516JB B6 JetBlue Airways
## 8 2013 1 1 6 N829AS EV ExpressJet Airlines Inc.
## 9 2013 1 1 6 N593JB B6 JetBlue Airways
## 10 2013 1 1 6 N3ALAA AA American Airlines Inc.
## # ℹ 336,766 more rows
We’ve now added the airline name to our dataframe from the airline data
Inner joins (inner_join) matches a pair of observations when their key is equal
Outer joins (outer_join) keeps observations that appear in at least one table.
library(tidyverse)
library(stringr)
string1 <- "this is a string"
string2 <- 'to put a "quote" in your string, use the opposite'
string1
## [1] "this is a string"
string2
## [1] "to put a \"quote\" in your string, use the opposite"
If you forget to close your string, you’ll get this:
string3 <- "where is this string going?"
string3
## [1] "where is this string going?"
Just hit escape and try again.
string4 <- c("one", "two", "three")
string4
## [1] "one" "two" "three"
str_length(string3)
## [1] 27
str_length(string4)
## [1] 3 3 5
str_c("X", "Y")
## [1] "XY"
str_c(string1, string2)
## [1] "this is a stringto put a \"quote\" in your string, use the opposite"
str_c(string1, string2, sep = " ")
## [1] "this is a string to put a \"quote\" in your string, use the opposite"
str_c("X", "Y", "Z", sep = "_")
## [1] "X_Y_Z"
HSP <- c("HSP123", "HSP234", "HSP456")
str_sub(HSP, 4,6)
## [1] "123" "234" "456"
This just drops the first four letters from the strings
Or, you can use negative to count back from the end.
str_sub(HSP, -3, -1)
## [1] "123" "234" "456"
HSP
## [1] "HSP123" "HSP234" "HSP456"
str_to_lower(HSP)
## [1] "hsp123" "hsp234" "hsp456"
HSP
## [1] "HSP123" "HSP234" "HSP456"
str_to_upper(HSP)
## [1] "HSP123" "HSP234" "HSP456"
install.packages("htmlwidgets")
## Installing package into '/home/student/R/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
x <- c('ATTAGA', 'CGCCCCCGGAT', 'TATTA')
str_view(x, "G")
## [1] │ ATTA<G>A
## [2] │ C<G>CCCCC<G><G>AT
str_view(x, "TA")
## [1] │ AT<TA>GA
## [3] │ <TA>T<TA>
str_view(".G.")
## [1] │ .G.
str_view(x, "^TA")
## [3] │ <TA>TTA
str_view(x, "TA$")
## [3] │ TAT<TA>
-“ matches any digit -”” matches any space - “[abc]” matches a, b, or c
str_view(x, "TA[GT]")
## [1] │ AT<TAG>A
## [3] │ <TAT>TA
[^abc] matches anything BUT a, b, or c
str_view(x, "TA[^T]")
## [1] │ AT<TAG>A
You can also use | to pick between 2 alternatives
str_view(x, "TA[G|T]")
## [1] │ AT<TAG>A
## [3] │ <TAT>TA
y <- c("apple", "banana", "pear")
y
## [1] "apple" "banana" "pear"
str_detect(y, "e")
## [1] TRUE FALSE TRUE
How many common words start with letter “e”?
words
## [1] "a" "able" "about" "absolute" "accept"
## [6] "account" "achieve" "across" "act" "active"
## [11] "actual" "add" "address" "admit" "advertise"
## [16] "affect" "afford" "after" "afternoon" "again"
## [21] "against" "age" "agent" "ago" "agree"
## [26] "air" "all" "allow" "almost" "along"
## [31] "already" "alright" "also" "although" "always"
## [36] "america" "amount" "and" "another" "answer"
## [41] "any" "apart" "apparent" "appear" "apply"
## [46] "appoint" "approach" "appropriate" "area" "argue"
## [51] "arm" "around" "arrange" "art" "as"
## [56] "ask" "associate" "assume" "at" "attend"
## [61] "authority" "available" "aware" "away" "awful"
## [66] "baby" "back" "bad" "bag" "balance"
## [71] "ball" "bank" "bar" "base" "basis"
## [76] "be" "bear" "beat" "beauty" "because"
## [81] "become" "bed" "before" "begin" "behind"
## [86] "believe" "benefit" "best" "bet" "between"
## [91] "big" "bill" "birth" "bit" "black"
## [96] "bloke" "blood" "blow" "blue" "board"
## [101] "boat" "body" "book" "both" "bother"
## [106] "bottle" "bottom" "box" "boy" "break"
## [111] "brief" "brilliant" "bring" "britain" "brother"
## [116] "budget" "build" "bus" "business" "busy"
## [121] "but" "buy" "by" "cake" "call"
## [126] "can" "car" "card" "care" "carry"
## [131] "case" "cat" "catch" "cause" "cent"
## [136] "centre" "certain" "chair" "chairman" "chance"
## [141] "change" "chap" "character" "charge" "cheap"
## [146] "check" "child" "choice" "choose" "Christ"
## [151] "Christmas" "church" "city" "claim" "class"
## [156] "clean" "clear" "client" "clock" "close"
## [161] "closes" "clothe" "club" "coffee" "cold"
## [166] "colleague" "collect" "college" "colour" "come"
## [171] "comment" "commit" "committee" "common" "community"
## [176] "company" "compare" "complete" "compute" "concern"
## [181] "condition" "confer" "consider" "consult" "contact"
## [186] "continue" "contract" "control" "converse" "cook"
## [191] "copy" "corner" "correct" "cost" "could"
## [196] "council" "count" "country" "county" "couple"
## [201] "course" "court" "cover" "create" "cross"
## [206] "cup" "current" "cut" "dad" "danger"
## [211] "date" "day" "dead" "deal" "dear"
## [216] "debate" "decide" "decision" "deep" "definite"
## [221] "degree" "department" "depend" "describe" "design"
## [226] "detail" "develop" "die" "difference" "difficult"
## [231] "dinner" "direct" "discuss" "district" "divide"
## [236] "do" "doctor" "document" "dog" "door"
## [241] "double" "doubt" "down" "draw" "dress"
## [246] "drink" "drive" "drop" "dry" "due"
## [251] "during" "each" "early" "east" "easy"
## [256] "eat" "economy" "educate" "effect" "egg"
## [261] "eight" "either" "elect" "electric" "eleven"
## [266] "else" "employ" "encourage" "end" "engine"
## [271] "english" "enjoy" "enough" "enter" "environment"
## [276] "equal" "especial" "europe" "even" "evening"
## [281] "ever" "every" "evidence" "exact" "example"
## [286] "except" "excuse" "exercise" "exist" "expect"
## [291] "expense" "experience" "explain" "express" "extra"
## [296] "eye" "face" "fact" "fair" "fall"
## [301] "family" "far" "farm" "fast" "father"
## [306] "favour" "feed" "feel" "few" "field"
## [311] "fight" "figure" "file" "fill" "film"
## [316] "final" "finance" "find" "fine" "finish"
## [321] "fire" "first" "fish" "fit" "five"
## [326] "flat" "floor" "fly" "follow" "food"
## [331] "foot" "for" "force" "forget" "form"
## [336] "fortune" "forward" "four" "france" "free"
## [341] "friday" "friend" "from" "front" "full"
## [346] "fun" "function" "fund" "further" "future"
## [351] "game" "garden" "gas" "general" "germany"
## [356] "get" "girl" "give" "glass" "go"
## [361] "god" "good" "goodbye" "govern" "grand"
## [366] "grant" "great" "green" "ground" "group"
## [371] "grow" "guess" "guy" "hair" "half"
## [376] "hall" "hand" "hang" "happen" "happy"
## [381] "hard" "hate" "have" "he" "head"
## [386] "health" "hear" "heart" "heat" "heavy"
## [391] "hell" "help" "here" "high" "history"
## [396] "hit" "hold" "holiday" "home" "honest"
## [401] "hope" "horse" "hospital" "hot" "hour"
## [406] "house" "how" "however" "hullo" "hundred"
## [411] "husband" "idea" "identify" "if" "imagine"
## [416] "important" "improve" "in" "include" "income"
## [421] "increase" "indeed" "individual" "industry" "inform"
## [426] "inside" "instead" "insure" "interest" "into"
## [431] "introduce" "invest" "involve" "issue" "it"
## [436] "item" "jesus" "job" "join" "judge"
## [441] "jump" "just" "keep" "key" "kid"
## [446] "kill" "kind" "king" "kitchen" "knock"
## [451] "know" "labour" "lad" "lady" "land"
## [456] "language" "large" "last" "late" "laugh"
## [461] "law" "lay" "lead" "learn" "leave"
## [466] "left" "leg" "less" "let" "letter"
## [471] "level" "lie" "life" "light" "like"
## [476] "likely" "limit" "line" "link" "list"
## [481] "listen" "little" "live" "load" "local"
## [486] "lock" "london" "long" "look" "lord"
## [491] "lose" "lot" "love" "low" "luck"
## [496] "lunch" "machine" "main" "major" "make"
## [501] "man" "manage" "many" "mark" "market"
## [506] "marry" "match" "matter" "may" "maybe"
## [511] "mean" "meaning" "measure" "meet" "member"
## [516] "mention" "middle" "might" "mile" "milk"
## [521] "million" "mind" "minister" "minus" "minute"
## [526] "miss" "mister" "moment" "monday" "money"
## [531] "month" "more" "morning" "most" "mother"
## [536] "motion" "move" "mrs" "much" "music"
## [541] "must" "name" "nation" "nature" "near"
## [546] "necessary" "need" "never" "new" "news"
## [551] "next" "nice" "night" "nine" "no"
## [556] "non" "none" "normal" "north" "not"
## [561] "note" "notice" "now" "number" "obvious"
## [566] "occasion" "odd" "of" "off" "offer"
## [571] "office" "often" "okay" "old" "on"
## [576] "once" "one" "only" "open" "operate"
## [581] "opportunity" "oppose" "or" "order" "organize"
## [586] "original" "other" "otherwise" "ought" "out"
## [591] "over" "own" "pack" "page" "paint"
## [596] "pair" "paper" "paragraph" "pardon" "parent"
## [601] "park" "part" "particular" "party" "pass"
## [606] "past" "pay" "pence" "pension" "people"
## [611] "per" "percent" "perfect" "perhaps" "period"
## [616] "person" "photograph" "pick" "picture" "piece"
## [621] "place" "plan" "play" "please" "plus"
## [626] "point" "police" "policy" "politic" "poor"
## [631] "position" "positive" "possible" "post" "pound"
## [636] "power" "practise" "prepare" "present" "press"
## [641] "pressure" "presume" "pretty" "previous" "price"
## [646] "print" "private" "probable" "problem" "proceed"
## [651] "process" "produce" "product" "programme" "project"
## [656] "proper" "propose" "protect" "provide" "public"
## [661] "pull" "purpose" "push" "put" "quality"
## [666] "quarter" "question" "quick" "quid" "quiet"
## [671] "quite" "radio" "rail" "raise" "range"
## [676] "rate" "rather" "read" "ready" "real"
## [681] "realise" "really" "reason" "receive" "recent"
## [686] "reckon" "recognize" "recommend" "record" "red"
## [691] "reduce" "refer" "regard" "region" "relation"
## [696] "remember" "report" "represent" "require" "research"
## [701] "resource" "respect" "responsible" "rest" "result"
## [706] "return" "rid" "right" "ring" "rise"
## [711] "road" "role" "roll" "room" "round"
## [716] "rule" "run" "safe" "sale" "same"
## [721] "saturday" "save" "say" "scheme" "school"
## [726] "science" "score" "scotland" "seat" "second"
## [731] "secretary" "section" "secure" "see" "seem"
## [736] "self" "sell" "send" "sense" "separate"
## [741] "serious" "serve" "service" "set" "settle"
## [746] "seven" "sex" "shall" "share" "she"
## [751] "sheet" "shoe" "shoot" "shop" "short"
## [756] "should" "show" "shut" "sick" "side"
## [761] "sign" "similar" "simple" "since" "sing"
## [766] "single" "sir" "sister" "sit" "site"
## [771] "situate" "six" "size" "sleep" "slight"
## [776] "slow" "small" "smoke" "so" "social"
## [781] "society" "some" "son" "soon" "sorry"
## [786] "sort" "sound" "south" "space" "speak"
## [791] "special" "specific" "speed" "spell" "spend"
## [796] "square" "staff" "stage" "stairs" "stand"
## [801] "standard" "start" "state" "station" "stay"
## [806] "step" "stick" "still" "stop" "story"
## [811] "straight" "strategy" "street" "strike" "strong"
## [816] "structure" "student" "study" "stuff" "stupid"
## [821] "subject" "succeed" "such" "sudden" "suggest"
## [826] "suit" "summer" "sun" "sunday" "supply"
## [831] "support" "suppose" "sure" "surprise" "switch"
## [836] "system" "table" "take" "talk" "tape"
## [841] "tax" "tea" "teach" "team" "telephone"
## [846] "television" "tell" "ten" "tend" "term"
## [851] "terrible" "test" "than" "thank" "the"
## [856] "then" "there" "therefore" "they" "thing"
## [861] "think" "thirteen" "thirty" "this" "thou"
## [866] "though" "thousand" "three" "through" "throw"
## [871] "thursday" "tie" "time" "to" "today"
## [876] "together" "tomorrow" "tonight" "too" "top"
## [881] "total" "touch" "toward" "town" "trade"
## [886] "traffic" "train" "transport" "travel" "treat"
## [891] "tree" "trouble" "true" "trust" "try"
## [896] "tuesday" "turn" "twelve" "twenty" "two"
## [901] "type" "under" "understand" "union" "unit"
## [906] "unite" "university" "unless" "until" "up"
## [911] "upon" "use" "usual" "value" "various"
## [916] "very" "video" "view" "village" "visit"
## [921] "vote" "wage" "wait" "walk" "wall"
## [926] "want" "war" "warm" "wash" "waste"
## [931] "watch" "water" "way" "we" "wear"
## [936] "wednesday" "wee" "week" "weigh" "welcome"
## [941] "well" "west" "what" "when" "where"
## [946] "whether" "which" "while" "white" "who"
## [951] "whole" "why" "wide" "wife" "will"
## [956] "win" "wind" "window" "wish" "with"
## [961] "within" "without" "woman" "wonder" "wood"
## [966] "word" "work" "world" "worry" "worse"
## [971] "worth" "would" "write" "wrong" "year"
## [976] "yes" "yesterday" "yet" "you" "young"
sum(str_detect(words, "e"))
## [1] 536
Lets get more complex. What proportion of words end in a vowel?
mean(str_detect(words, "[aeiou]$"))
## [1] 0.2765306
mean(str_detect(words, "^[aeiou]"))
## [1] 0.1785714
Lets find all the words that don’t contain “o” or “u”
no_o <- !str_detect(words, "[ou]")
no_o
## [1] TRUE TRUE FALSE FALSE TRUE FALSE TRUE FALSE TRUE TRUE FALSE TRUE
## [13] TRUE TRUE TRUE TRUE FALSE TRUE FALSE TRUE TRUE TRUE TRUE FALSE
## [25] TRUE TRUE TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE TRUE TRUE
## [37] FALSE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
## [49] TRUE FALSE TRUE FALSE TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE
## [61] FALSE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [73] TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE TRUE FALSE TRUE
## [85] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
## [97] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [109] FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [121] FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [133] TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [145] TRUE TRUE TRUE FALSE FALSE TRUE TRUE FALSE TRUE TRUE TRUE TRUE
## [157] TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [169] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [181] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [193] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## [205] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [217] TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE
## [229] TRUE FALSE TRUE TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
## [241] FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE TRUE FALSE FALSE TRUE
## [253] TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
## [265] TRUE TRUE FALSE FALSE TRUE TRUE TRUE FALSE FALSE TRUE FALSE FALSE
## [277] TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE
## [289] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [301] TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE
## [313] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [325] TRUE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [337] FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [349] FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
## [361] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
## [373] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [385] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE
## [397] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [409] FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE TRUE FALSE FALSE
## [421] TRUE TRUE FALSE FALSE FALSE TRUE TRUE FALSE TRUE FALSE FALSE TRUE
## [433] FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
## [445] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE FALSE
## [457] TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [469] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [481] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [493] FALSE FALSE FALSE FALSE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE
## [505] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE FALSE
## [517] TRUE TRUE TRUE TRUE FALSE TRUE TRUE FALSE FALSE TRUE TRUE FALSE
## [529] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
## [541] FALSE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [553] TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [565] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [577] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [589] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE
## [601] TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE
## [613] TRUE TRUE FALSE FALSE FALSE TRUE FALSE TRUE TRUE TRUE TRUE TRUE
## [625] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [637] TRUE TRUE TRUE TRUE FALSE FALSE TRUE FALSE TRUE TRUE TRUE FALSE
## [649] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [661] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [673] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE
## [685] TRUE FALSE FALSE FALSE FALSE TRUE FALSE TRUE TRUE FALSE FALSE TRUE
## [697] FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE FALSE TRUE TRUE
## [709] TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
## [721] FALSE TRUE TRUE TRUE FALSE TRUE FALSE FALSE TRUE FALSE TRUE FALSE
## [733] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE
## [745] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
## [757] FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [769] TRUE TRUE FALSE TRUE TRUE TRUE TRUE FALSE TRUE FALSE FALSE FALSE
## [781] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE
## [793] TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
## [805] TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE
## [817] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [829] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
## [841] TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
## [853] TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
## [865] FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE
## [877] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE
## [889] TRUE TRUE TRUE FALSE FALSE FALSE TRUE FALSE FALSE TRUE TRUE FALSE
## [901] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [913] FALSE FALSE FALSE TRUE FALSE TRUE TRUE TRUE FALSE TRUE TRUE TRUE
## [925] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [937] TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [949] TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE
## [961] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [973] TRUE FALSE TRUE TRUE TRUE TRUE FALSE FALSE
Now lets extract
words[!str_detect(words, "[ou]")]
## [1] "a" "able" "accept" "achieve" "act"
## [6] "active" "add" "address" "admit" "advertise"
## [11] "affect" "after" "again" "against" "age"
## [16] "agent" "agree" "air" "all" "already"
## [21] "alright" "always" "america" "and" "answer"
## [26] "any" "apart" "apparent" "appear" "apply"
## [31] "area" "arm" "arrange" "art" "as"
## [36] "ask" "at" "attend" "available" "aware"
## [41] "away" "baby" "back" "bad" "bag"
## [46] "balance" "ball" "bank" "bar" "base"
## [51] "basis" "be" "bear" "beat" "bed"
## [56] "begin" "behind" "believe" "benefit" "best"
## [61] "bet" "between" "big" "bill" "birth"
## [66] "bit" "black" "break" "brief" "brilliant"
## [71] "bring" "britain" "by" "cake" "call"
## [76] "can" "car" "card" "care" "carry"
## [81] "case" "cat" "catch" "cent" "centre"
## [86] "certain" "chair" "chairman" "chance" "change"
## [91] "chap" "character" "charge" "cheap" "check"
## [96] "child" "Christ" "Christmas" "city" "claim"
## [101] "class" "clean" "clear" "client" "create"
## [106] "dad" "danger" "date" "day" "dead"
## [111] "deal" "dear" "debate" "decide" "deep"
## [116] "definite" "degree" "department" "depend" "describe"
## [121] "design" "detail" "die" "difference" "dinner"
## [126] "direct" "district" "divide" "draw" "dress"
## [131] "drink" "drive" "dry" "each" "early"
## [136] "east" "easy" "eat" "effect" "egg"
## [141] "eight" "either" "elect" "electric" "eleven"
## [146] "else" "end" "engine" "english" "enter"
## [151] "especial" "even" "evening" "ever" "every"
## [156] "evidence" "exact" "example" "except" "exercise"
## [161] "exist" "expect" "expense" "experience" "explain"
## [166] "express" "extra" "eye" "face" "fact"
## [171] "fair" "fall" "family" "far" "farm"
## [176] "fast" "father" "feed" "feel" "few"
## [181] "field" "fight" "file" "fill" "film"
## [186] "final" "finance" "find" "fine" "finish"
## [191] "fire" "first" "fish" "fit" "five"
## [196] "flat" "fly" "france" "free" "friday"
## [201] "friend" "game" "garden" "gas" "general"
## [206] "germany" "get" "girl" "give" "glass"
## [211] "grand" "grant" "great" "green" "hair"
## [216] "half" "hall" "hand" "hang" "happen"
## [221] "happy" "hard" "hate" "have" "he"
## [226] "head" "health" "hear" "heart" "heat"
## [231] "heavy" "hell" "help" "here" "high"
## [236] "hit" "idea" "identify" "if" "imagine"
## [241] "in" "increase" "indeed" "inside" "instead"
## [246] "interest" "invest" "it" "item" "keep"
## [251] "key" "kid" "kill" "kind" "king"
## [256] "kitchen" "lad" "lady" "land" "large"
## [261] "last" "late" "law" "lay" "lead"
## [266] "learn" "leave" "left" "leg" "less"
## [271] "let" "letter" "level" "lie" "life"
## [276] "light" "like" "likely" "limit" "line"
## [281] "link" "list" "listen" "little" "live"
## [286] "machine" "main" "make" "man" "manage"
## [291] "many" "mark" "market" "marry" "match"
## [296] "matter" "may" "maybe" "mean" "meaning"
## [301] "meet" "member" "middle" "might" "mile"
## [306] "milk" "mind" "minister" "miss" "mister"
## [311] "mrs" "name" "near" "necessary" "need"
## [316] "never" "new" "news" "next" "nice"
## [321] "night" "nine" "pack" "page" "paint"
## [326] "pair" "paper" "paragraph" "parent" "park"
## [331] "part" "party" "pass" "past" "pay"
## [336] "pence" "per" "percent" "perfect" "perhaps"
## [341] "pick" "piece" "place" "plan" "play"
## [346] "please" "practise" "prepare" "present" "press"
## [351] "pretty" "price" "print" "private" "rail"
## [356] "raise" "range" "rate" "rather" "read"
## [361] "ready" "real" "realise" "really" "receive"
## [366] "recent" "red" "refer" "regard" "remember"
## [371] "represent" "research" "respect" "rest" "rid"
## [376] "right" "ring" "rise" "safe" "sale"
## [381] "same" "save" "say" "scheme" "science"
## [386] "seat" "secretary" "see" "seem" "self"
## [391] "sell" "send" "sense" "separate" "serve"
## [396] "service" "set" "settle" "seven" "sex"
## [401] "shall" "share" "she" "sheet" "sick"
## [406] "side" "sign" "similar" "simple" "since"
## [411] "sing" "single" "sir" "sister" "sit"
## [416] "site" "six" "size" "sleep" "slight"
## [421] "small" "space" "speak" "special" "specific"
## [426] "speed" "spell" "spend" "staff" "stage"
## [431] "stairs" "stand" "standard" "start" "state"
## [436] "stay" "step" "stick" "still" "straight"
## [441] "strategy" "street" "strike" "switch" "system"
## [446] "table" "take" "talk" "tape" "tax"
## [451] "tea" "teach" "team" "tell" "ten"
## [456] "tend" "term" "terrible" "test" "than"
## [461] "thank" "the" "then" "there" "they"
## [466] "thing" "think" "thirteen" "thirty" "this"
## [471] "three" "tie" "time" "trade" "traffic"
## [476] "train" "travel" "treat" "tree" "try"
## [481] "twelve" "twenty" "type" "very" "view"
## [486] "village" "visit" "wage" "wait" "walk"
## [491] "wall" "want" "war" "warm" "wash"
## [496] "waste" "watch" "water" "way" "we"
## [501] "wear" "wednesday" "wee" "week" "weigh"
## [506] "well" "west" "what" "when" "where"
## [511] "whether" "which" "while" "white" "why"
## [516] "wide" "wife" "will" "win" "wind"
## [521] "wish" "with" "within" "write" "year"
## [526] "yes" "yesterday" "yet"
You can also use str_count() too say how many matches there are in a string
x
## [1] "ATTAGA" "CGCCCCCGGAT" "TATTA"
str_count(x, "[GC]")
## [1] 1 9 0
Lets couple this with mutate
df <- tibble(
word = words,
count = seq_along(word)
)
df
## # A tibble: 980 × 2
## word count
## <chr> <int>
## 1 a 1
## 2 able 2
## 3 about 3
## 4 absolute 4
## 5 accept 5
## 6 account 6
## 7 achieve 7
## 8 across 8
## 9 act 9
## 10 active 10
## # ℹ 970 more rows
df %>%
mutate(
vowels = str_count(words, "[aeiou]"),
consotants = str_count(words, "[^aeiou]")
)
## # A tibble: 980 × 4
## word count vowels consotants
## <chr> <int> <int> <int>
## 1 a 1 1 0
## 2 able 2 2 2
## 3 about 3 3 2
## 4 absolute 4 4 4
## 5 accept 5 2 4
## 6 account 6 3 4
## 7 achieve 7 4 3
## 8 across 8 2 4
## 9 act 9 1 2
## 10 active 10 3 3
## # ℹ 970 more rows
library(ath1121501cdf)
## Warning: replacing previous import 'AnnotationDbi::tail' by 'utils::tail' when
## loading 'ath1121501cdf'
## Warning: replacing previous import 'AnnotationDbi::head' by 'utils::head' when
## loading 'ath1121501cdf'
##
library(ath1121501.db)
## Loading required package: AnnotationDbi
## Loading required package: stats4
## Loading required package: BiocGenerics
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:lubridate':
##
## intersect, setdiff, union
## The following objects are masked from 'package:dplyr':
##
## combine, intersect, setdiff, union
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, aperm, append, as.data.frame, basename, cbind,
## colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
## get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
## match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
## Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
## table, tapply, union, unique, unsplit, which.max, which.min
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
## Loading required package: IRanges
## Loading required package: S4Vectors
##
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:lubridate':
##
## second, second<-
## The following objects are masked from 'package:dplyr':
##
## first, rename
## The following object is masked from 'package:tidyr':
##
## expand
## The following object is masked from 'package:utils':
##
## findMatches
## The following objects are masked from 'package:base':
##
## expand.grid, I, unname
##
## Attaching package: 'IRanges'
## The following object is masked from 'package:lubridate':
##
## %within%
## The following objects are masked from 'package:dplyr':
##
## collapse, desc, slice
## The following object is masked from 'package:purrr':
##
## reduce
##
## Attaching package: 'AnnotationDbi'
## The following object is masked from 'package:dplyr':
##
## select
## Loading required package: org.At.tair.db
##
##
library(annotate)
## Loading required package: XML
library(affy)
##
## Attaching package: 'affy'
## The following object is masked from 'package:lubridate':
##
## pm
library(limma)
##
## Attaching package: 'limma'
## The following object is masked from 'package:BiocGenerics':
##
## plotMA
library(oligo)
## Loading required package: oligoClasses
## Welcome to oligoClasses version 1.64.0
##
## Attaching package: 'oligoClasses'
## The following object is masked from 'package:affy':
##
## list.celfiles
## Loading required package: Biostrings
## Loading required package: XVector
##
## Attaching package: 'XVector'
## The following object is masked from 'package:purrr':
##
## compact
## Loading required package: GenomeInfoDb
##
## Attaching package: 'Biostrings'
## The following object is masked from 'package:base':
##
## strsplit
## ================================================================================
## Welcome to oligo version 1.66.0
## ================================================================================
##
## Attaching package: 'oligo'
## The following object is masked from 'package:limma':
##
## backgroundCorrect
## The following objects are masked from 'package:affy':
##
## intensity, MAplot, mm, mm<-, mmindex, pm, pm<-, pmindex,
## probeNames, rma
## The following object is masked from 'package:lubridate':
##
## pm
## The following object is masked from 'package:dplyr':
##
## summarize
library(dplyr)
library(stats)
library(reshape)
##
## Attaching package: 'reshape'
## The following objects are masked from 'package:S4Vectors':
##
## expand, rename
## The following object is masked from 'package:lubridate':
##
## stamp
## The following object is masked from 'package:dplyr':
##
## rename
## The following objects are masked from 'package:tidyr':
##
## expand, smiths
targets <- readTargets("Bric16_Targets.csv", sep = ",", row.names = "filename")
ab <- ReadAffy()
hist(ab)
eset <- affy::rma(ab)
## Background correcting
## Normalizing
## Calculating Expression
pData(eset)
## sample
## Atha_Ler-0_sShoots_FLT_Rep1_F-F2_raw.CEL 1
## Atha_Ler-0_sShoots_FLT_Rep2_F-F3_raw.CEL 2
## Atha_Ler-0_sShoots_FLT_Rep3_F-F4_raw.CEL 3
## Atha_Ler-0_sShoots_GC_Rep1_H-F2_raw.CEL 4
## Atha_Ler-0_sShoots_GC_Rep2_H-F3_raw.CEL 5
## Atha_Ler-0_sShoots_GC_Rep3_H-F4_raw.CEL 6
hist(eset)
ID <- featureNames(eset)
symbol <- getSYMBOL(ID, "ath1121501.db")
affydata <- read.delim("AFFY_ATH1_array_elements.txt")
## Warning in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,
## : EOF within quoted string
condition <- targets$gravity
design <- model.matrix(~factor(condition))
colnames(design) <- c("flight", "ground")
fit <- lmFit(eset, design)
fit <- eBayes(fit)
options(digits = 2)
top <- topTable(fit, coef =2, n=Inf, adjust='fdr')
Lets combine annotations
Annot <- data.frame(GENE = sapply(contents(ath1121501GENENAME), paste, collapse = ", "),
KEGG = sapply(contents(ath1121501PATH), paste, collapse = ", "),
GO = sapply(contents(ath1121501GO), paste, collapse = ", "),
SYMBOL = sapply(contents(ath1121501SYMBOL), paste, collapse = ", "),
ACCNUM = sapply(contents(ath1121501ACCNUM), paste, collapse = ", "))
## Warning: The contents() method for Bimap objects is deprecated. Please use
## as.list() instead.
## Warning: The contents() method for Bimap objects is deprecated. Please use
## as.list() instead.
## Warning: The contents() method for Bimap objects is deprecated. Please use
## as.list() instead.
## Warning: The contents() method for Bimap objects is deprecated. Please use
## as.list() instead.
## Warning: The contents() method for Bimap objects is deprecated. Please use
## as.list() instead.
Lets merge all the data into one dataframe
all <- merge(Annot, top, by.x = 0, by.y = 0, all = TRUE)
all2 <- merge(all, affydata, by.x = "Row.names", by.y = "array_element_name")
Lets convert the ACCNUM to a character value
all2$ACCNUM <- as.character(all2$ACCNUM)
write.table(all2, file="BRIC_16_Final.csv", row.names = TRUE, col.names = TRUE, sep = "\t")
columns(org.At.tair.db)
## [1] "ARACYC" "ARACYCENZYME" "ENTREZID" "ENZYME" "EVIDENCE"
## [6] "EVIDENCEALL" "GENENAME" "GO" "GOALL" "ONTOLOGY"
## [11] "ONTOLOGYALL" "PATH" "PMID" "REFSEQ" "SYMBOL"
## [16] "TAIR"
foldchanges <- as.data.frame(all2$logFC)
all2$entrez = mapIds(org.At.tair.db,
keys = all2$ACCNUM,
column = "ENTREZID",
keytype = "TAIR",
multivals = "first")
## 'select()' returned 1:1 mapping between keys and columns
head(all2, 10)
## Row.names GENE KEGG
## 1 244903_at hypothetical protein NA
## 2 244904_at hypothetical protein NA
## 3 244906_at hypothetical protein NA
## 4 244907_at hypothetical protein NA
## 5 244908_at hypothetical protein NA
## 6 244911_at hypothetical protein NA
## 7 244913_at hypothetical protein NA
## 8 244914_at hypothetical protein NA
## 9 244916_at hypothetical protein NA
## 10 244917_at hypothetical protein NA
## GO
## 1 list(GOID = "GO:0008150", Evidence = "ND", Ontology = "BP"), list(GOID = "GO:0005739", Evidence = "ISM", Ontology = "CC"), list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
## 2 list(GOID = "GO:0005739", Evidence = "ISM", Ontology = "CC"), list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
## 3 list(GOID = "GO:0008150", Evidence = "ND", Ontology = "BP"), list(GOID = "GO:0005739", Evidence = "ISM", Ontology = "CC"), list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
## 4 list(GOID = "GO:0008150", Evidence = "ND", Ontology = "BP"), list(GOID = "GO:0005739", Evidence = "ISM", Ontology = "CC"), list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
## 5 list(GOID = "GO:0008150", Evidence = "ND", Ontology = "BP"), list(GOID = "GO:0005739", Evidence = "ISM", Ontology = "CC"), list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
## 6 list(GOID = "GO:0008150", Evidence = "ND", Ontology = "BP"), list(GOID = "GO:0005575", Evidence = "ND", Ontology = "CC"), list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
## 7 list(GOID = "GO:0008150", Evidence = "ND", Ontology = "BP"), list(GOID = "GO:0005739", Evidence = "ISM", Ontology = "CC"), list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
## 8 list(GOID = "GO:0018130", Evidence = "IEA", Ontology = "BP"), list(GOID = "GO:0018130", Evidence = "IEA", Ontology = "BP"), list(GOID = "GO:0018130", Evidence = "IEA", Ontology = "BP"), list(GOID = "GO:0019438", Evidence = "IEA", Ontology = "BP"), list(GOID = "GO:0019438", Evidence = "IEA", Ontology = "BP"), list(GOID = "GO:0019438", Evidence = "IEA", Ontology = "BP"), list(GOID = "GO:0044271", Evidence = "IEA", Ontology = "BP"), list(GOID = "GO:0044271", Evidence = "IEA", Ontology = "BP"), list(GOID = "GO:0044271", Evidence = "IEA", Ontology = "BP"), list(GOID = "GO:1901362", Evidence = "IEA", Ontology = "BP"), list(GOID = "GO:1901362", Evidence = "IEA", Ontology = "BP"), list(GOID = "GO:0005739", Evidence = "ISM", Ontology = "CC"), list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
## 9 list(GOID = "GO:0008150", Evidence = "ND", Ontology = "BP"), list(GOID = "GO:0005634", Evidence = "ISM", Ontology = "CC"), list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
## 10 list(GOID = "GO:0008150", Evidence = "ND", Ontology = "BP"), list(GOID = "GO:0005739", Evidence = "ISM", Ontology = "CC"), list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
## SYMBOL ACCNUM logFC AveExpr t P.Value adj.P.Val B
## 1 ORF149 ATMG00660 -1.214 7.8 -6.20 0.0004 0.018 0.5
## 2 ORF275 ATMG00670 -0.568 3.2 -2.07 0.0758 0.300 -4.9
## 3 ORF240A ATMG00690 0.055 9.1 0.18 0.8586 0.947 -6.8
## 4 ORF120 ATMG00710 -0.883 4.3 -2.91 0.0220 0.154 -3.7
## 5 ORF107D ATMG00720 -0.463 2.2 -1.73 0.1262 0.396 -5.4
## 6 ORF170 ATMG00820 -0.193 2.7 -0.72 0.4948 0.749 -6.5
## 7 ORF121B ATMG00840 -0.339 1.8 -1.18 0.2764 0.574 -6.1
## 8 ORF107E ATMG00850 -0.305 5.5 -1.30 0.2338 0.534 -6.0
## 9 ORF187 ATMG00880 -0.888 2.8 -2.32 0.0527 0.245 -4.6
## 10 ORF184 ATMG00870 -0.379 3.6 -1.31 0.2307 0.531 -5.9
## array_element_type organism is_control locus
## 1 oligonucleotide Arabidopsis thaliana no ATMG00660
## 2 oligonucleotide Arabidopsis thaliana no ATMG00670
## 3 oligonucleotide Arabidopsis thaliana no ATMG00690
## 4 oligonucleotide Arabidopsis thaliana no ATMG00710
## 5 oligonucleotide Arabidopsis thaliana no ATMG00720
## 6 oligonucleotide Arabidopsis thaliana no ATMG00820
## 7 oligonucleotide Arabidopsis thaliana no AT2G07626
## 8 oligonucleotide Arabidopsis thaliana no ATMG00850
## 9 oligonucleotide Arabidopsis thaliana no ATMG00880
## 10 oligonucleotide Arabidopsis thaliana no ATMG00870
## description
## 1 hypothetical protein;(source:Araport11)
## 2 transmembrane protein;(source:Araport11)
## 3 FO-ATPase subunit;(source:Araport11)
## 4 Polynucleotidyl transferase, ribonuclease H-like superfamily protein;(source:Araport11)
## 5 hypothetical protein;(source:Araport11)
## 6 Reverse transcriptase (RNA-dependent DNA polymerase);(source:Araport11)
## 7 hypothetical protein;(source:Araport11)
## 8 DNA/RNA polymerases superfamily protein;(source:Araport11)
## 9 hypothetical protein;(source:Araport11)
## 10 hypothetical protein;(source:Araport11)
## chromosome entrez
## 1 M <NA>
## 2 M <NA>
## 3 M <NA>
## 4 M <NA>
## 5 M <NA>
## 6 M <NA>
## 7 2 <NA>
## 8 M <NA>
## 9 M <NA>
## 10 M <NA>
library(pathview)
##
## ##############################################################################
## Pathview is an open source software package distributed under GNU General
## Public License version 3 (GPLv3). Details of GPLv3 is available at
## http://www.gnu.org/licenses/gpl-3.0.html. Particullary, users are required to
## formally cite the original Pathview paper (not just mention it) in publications
## or products. For details, do citation("pathview") within R.
##
## The pathview downloads and uses KEGG data. Non-academic uses may require a KEGG
## license agreement (details at http://www.kegg.jp/kegg/legal.html).
## ##############################################################################
library(gage)
##
library(gageData)
data(kegg.sets.hs)
foldchanges = all$logFC
names(foldchanges) = all2$entrez
head(foldchanges)
## <NA> <NA> <NA> <NA> <NA> <NA>
## -0.742 -0.826 -1.214 -0.568 0.120 0.055
kegg.ath = kegg.gsets("ath", id.type = "entrez")
kegg.ath.sigmet = kegg.ath$kg.sets[kegg.ath$sigmet.idx]
Let’s get the results:
keggres = gage(foldchanges, gsets=kegg.ath.sigmet, same.dir = TRUE)
lapply(keggres, head)
## $greater
## p.geomean stat.mean p.val q.val
## ath04075 Plant hormone signal transduction 0.062 1.5 0.062 0.79
## ath03420 Nucleotide excision repair 0.068 1.5 0.068 0.79
## ath04626 Plant-pathogen interaction 0.071 1.5 0.071 0.79
## ath04070 Phosphatidylinositol signaling system 0.076 1.5 0.076 0.79
## ath04530 Tight junction 0.078 1.4 0.078 0.79
## ath00030 Pentose phosphate pathway 0.081 1.4 0.081 0.79
## set.size exp1
## ath04075 Plant hormone signal transduction 185 0.062
## ath03420 Nucleotide excision repair 25 0.068
## ath04626 Plant-pathogen interaction 99 0.071
## ath04070 Phosphatidylinositol signaling system 32 0.076
## ath04530 Tight junction 19 0.078
## ath00030 Pentose phosphate pathway 27 0.081
##
## $less
## p.geomean stat.mean p.val q.val
## ath00900 Terpenoid backbone biosynthesis 0.087 -1.38 0.087 0.92
## ath00940 Phenylpropanoid biosynthesis 0.132 -1.12 0.132 0.92
## ath00220 Arginine biosynthesis 0.142 -1.10 0.142 0.92
## ath00650 Butanoate metabolism 0.177 -0.95 0.177 0.92
## ath04110 Cell cycle 0.180 -0.92 0.180 0.92
## ath00020 Citrate cycle (TCA cycle) 0.189 -0.89 0.189 0.92
## set.size exp1
## ath00900 Terpenoid backbone biosynthesis 30 0.087
## ath00940 Phenylpropanoid biosynthesis 60 0.132
## ath00220 Arginine biosynthesis 14 0.142
## ath00650 Butanoate metabolism 11 0.177
## ath04110 Cell cycle 65 0.180
## ath00020 Citrate cycle (TCA cycle) 22 0.189
##
## $stats
## stat.mean exp1
## ath04075 Plant hormone signal transduction 1.5 1.5
## ath03420 Nucleotide excision repair 1.5 1.5
## ath04626 Plant-pathogen interaction 1.5 1.5
## ath04070 Phosphatidylinositol signaling system 1.5 1.5
## ath04530 Tight junction 1.4 1.4
## ath00030 Pentose phosphate pathway 1.4 1.4
greater <- keggres$greater
lessers <- keggres$less
write.csv(greater, file = "BRIC_16_pathview_Greater_Pathways.csv")
write.csv(greater, file = "BRIC_16_pathview_Lesser_Pathways.csv")
keggrespathways = data.frame(id=rownames(keggres$greater), keggres$greater) %>%
tibble::as_tibble() %>%
filter(row_number() <=5) %>%
.$id %>%
as.character()
keggrespathways
## [1] "ath04075 Plant hormone signal transduction"
## [2] "ath03420 Nucleotide excision repair"
## [3] "ath04626 Plant-pathogen interaction"
## [4] "ath04070 Phosphatidylinositol signaling system"
## [5] "ath04530 Tight junction"
keggresids_greater = substr(keggrespathways, start=1, stop=8)
keggresids_greater
## [1] "ath04075" "ath03420" "ath04626" "ath04070" "ath04530"
genedata <- as.character(all2$logFC)
Define a plotting function to apply next:
plot_pathway = function(pid) pathview(gene.data=foldchanges, pathway.id = pid, species = "ath", new.signature = FALSE)
Plot multiple pathways (plots are saved to disk and returns a throwaway object list).
tmp = sapply(keggresids_greater, function(pid) pathview(gene.data=foldchanges, pathway.id = pid, species = "ath"))
## Info: Getting gene ID data from KEGG...
## Info: Done with data retrieval!
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file ath04075.pathview.png
## Info: Getting gene ID data from KEGG...
## Info: Done with data retrieval!
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file ath03420.pathview.png
## Info: Getting gene ID data from KEGG...
## Info: Done with data retrieval!
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file ath04626.pathview.png
## Info: Getting gene ID data from KEGG...
## Info: Done with data retrieval!
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file ath04070.pathview.png
## Info: Getting gene ID data from KEGG...
## Info: Done with data retrieval!
## Info: Downloading xml files for ath04530, 1/1 pathways..
## Warning in download.file(xml.url, xml.target, quiet = T): cannot open URL
## 'https://rest.kegg.jp/get/ath04530/kgml': HTTP status was '404 Not Found'
## Warning: Download of ath04530 xml file failed!
## This pathway may not exist!
## Info: Downloading png files for ath04530, 1/1 pathways..
## Warning: Download of ath04530 png file failed!
## This pathway may not exist!
## Warning: Failed to download KEGG xml/png files, ath04530 skipped!
keggrespathways = data.frame(id=rownames(keggres$less), keggres$less) %>%
tibble::as_tibble() %>%
filter(row_number() <=5) %>%
.$id %>%
as.character()
keggrespathways
## [1] "ath00900 Terpenoid backbone biosynthesis"
## [2] "ath00940 Phenylpropanoid biosynthesis"
## [3] "ath00220 Arginine biosynthesis"
## [4] "ath00650 Butanoate metabolism"
## [5] "ath04110 Cell cycle"
keggresids_less = substr(keggrespathways, start=1, stop=8)
keggresids_less
## [1] "ath00900" "ath00940" "ath00220" "ath00650" "ath04110"
genedata <- as.character(all2$logFC)
Define a plotting function to apply next
plot_pathway = function(pid) pathview(gene.data=foldchanges, pathway.id = pid, species = "ath", new.signature = FALSE)
# plot multiple pathways (plots are saved to disk and returns a throwaway object list)
tmp = sapply(keggresids_less, function(pid) pathview(gene.data=foldchanges, pathway.id = pid, species = "ath"))
## Info: Getting gene ID data from KEGG...
## Info: Done with data retrieval!
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file ath00900.pathview.png
## Info: Getting gene ID data from KEGG...
## Info: Done with data retrieval!
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file ath00940.pathview.png
## Info: Getting gene ID data from KEGG...
## Info: Done with data retrieval!
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file ath00220.pathview.png
## Info: Getting gene ID data from KEGG...
## Info: Done with data retrieval!
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file ath00650.pathview.png
## Info: Getting gene ID data from KEGG...
## Info: Done with data retrieval!
## Info: Downloading xml files for ath04110, 1/1 pathways..
## Warning in download.file(xml.url, xml.target, quiet = T): cannot open URL
## 'https://rest.kegg.jp/get/ath04110/kgml': HTTP status was '404 Not Found'
## Warning: Download of ath04110 xml file failed!
## This pathway may not exist!
## Info: Downloading png files for ath04110, 1/1 pathways..
## Warning: Download of ath04110 png file failed!
## This pathway may not exist!
## Warning: Failed to download KEGG xml/png files, ath04110 skipped!
library("edgeR")
library("dplyr")
library("AnnotationDbi")
library("org.Mm.eg.db")
##
rawdata = read.csv("GLDS-102_rna_seq_Normalized_Counts.csv", row.names = "gene_ID")
group <- factor(c('1', '1', '1', '1', '1', '1', '2', '2', '2', '2', '2', '2'))
dgeGlm <- DGEList(counts = rawdata, group = group)
str(dgeGlm)
## Formal class 'DGEList' [package "edgeR"] with 1 slot
## ..@ .Data:List of 2
## .. ..$ : num [1:24035, 1:12] 2976.8 59.8 21.2 40.1 0 ...
## .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. ..$ : chr [1:24035] "ENSMUSG00000000001" "ENSMUSG00000000028" "ENSMUSG00000000031" "ENSMUSG00000000037" ...
## .. .. .. ..$ : chr [1:12] "Mmus_C57.6J_KDN_FLT_Rep1_M23" "Mmus_C57.6J_KDN_FLT_Rep2_M24" "Mmus_C57.6J_KDN_FLT_Rep3_M25" "Mmus_C57.6J_KDN_FLT_Rep4_M26" ...
## .. ..$ :'data.frame': 12 obs. of 3 variables:
## .. .. ..$ group : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 2 2 2 2 ...
## .. .. ..$ lib.size : num [1:12] 40266365 40740336 37739541 39196969 36820645 ...
## .. .. ..$ norm.factors: num [1:12] 1 1 1 1 1 1 1 1 1 1 ...
## ..$ names: chr [1:2] "counts" "samples"
keep <- rowSums(cpm(dgeGlm)>2) >=4
dgeGlm <- dgeGlm[keep,]
names(dgeGlm)
## [1] "counts" "samples"
dgeGlm[["samples"]]
## group lib.size norm.factors
## Mmus_C57.6J_KDN_FLT_Rep1_M23 1 4.0e+07 1
## Mmus_C57.6J_KDN_FLT_Rep2_M24 1 4.1e+07 1
## Mmus_C57.6J_KDN_FLT_Rep3_M25 1 3.8e+07 1
## Mmus_C57.6J_KDN_FLT_Rep4_M26 1 3.9e+07 1
## Mmus_C57.6J_KDN_FLT_Rep5_M27 1 3.7e+07 1
## Mmus_C57.6J_KDN_FLT_Rep6_M28 1 3.6e+07 1
## Mmus_C57.6J_KDN_GC_Rep1_M33 2 3.7e+07 1
## Mmus_C57.6J_KDN_GC_Rep2_M34 2 3.7e+07 1
## Mmus_C57.6J_KDN_GC_Rep3_M35 2 4.0e+07 1
## Mmus_C57.6J_KDN_GC_Rep4_M36 2 3.6e+07 1
## Mmus_C57.6J_KDN_GC_Rep5_M37 2 3.8e+07 1
## Mmus_C57.6J_KDN_GC_Rep6_M38 2 3.5e+07 1
design <- model.matrix(~group)
design
## (Intercept) group2
## 1 1 0
## 2 1 0
## 3 1 0
## 4 1 0
## 5 1 0
## 6 1 0
## 7 1 1
## 8 1 1
## 9 1 1
## 10 1 1
## 11 1 1
## 12 1 1
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"
dgeGlmComDisp <- estimateGLMCommonDisp(dgeGlm, design, verbose = TRUE)
## Disp = 0.026 , BCV = 0.16
dgeGlmTrendDisp <- estimateGLMTrendedDisp(dgeGlmComDisp, design)
dgeGlmTagDisp <- estimateGLMTagwiseDisp(dgeGlmTrendDisp, design)
plotBCV(dgeGlmTagDisp)
fit <- glmFit(dgeGlmTagDisp, design)
colnames(coef(fit))
## [1] "(Intercept)" "group2"
lrt <- glmLRT(fit, coef =2)
ttGlm <- topTags(lrt, n = Inf)
class(ttGlm)
## [1] "TopTags"
## attr(,"package")
## [1] "edgeR"
summary(deGlm <- decideTestsDGE(lrt, p = 0.1, adjust = "fdr"))
## group2
## Down 64
## NotSig 13390
## Up 159
tagsGlm <- rownames(dgeGlmTagDisp)[as.logical(deGlm)]
plotSmear(lrt, de.tags = tagsGlm)
hits2 <- ttGlm$table[ttGlm$table$FDR < 0.1, ]
write.csv(hits2, "Mouse_EdgeR_Results_Zero_vs_1.csv")
columns(org.Mm.eg.db)
## [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS"
## [6] "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME"
## [11] "GENETYPE" "GO" "GOALL" "IPI" "MGI"
## [16] "ONTOLOGY" "ONTOLOGYALL" "PATH" "PFAM" "PMID"
## [21] "PROSITE" "REFSEQ" "SYMBOL" "UNIPROT"
ttGlm2 <- as.data.frame(ttGlm$table)
ttGlm2$symbol = mapIds(org.Mm.eg.db,
keys=row.names(ttGlm2),
column = "SYMBOL",
keytype = "ENSEMBL",
multivals = "first")
## 'select()' returned 1:many mapping between keys and columns
ttGlm$entrez = mapIds(org.Mm.eg.db,
keys=row.names(ttGlm2),
column = "ENTREZID",
keytype = "ENSEMBL",
multivals = "first")
## 'select()' returned 1:many mapping between keys and columns
ttGlm$name = mapIds(org.Mm.eg.db,
keys=row.names(ttGlm2),
column = "GENENAME",
keytype = "ENSEMBL",
multivals = "first")
## 'select()' returned 1:many mapping between keys and columns
head(ttGlm2)
## logFC logCPM LR PValue FDR symbol
## ENSMUSG00000026077 -1.36 3.6 80 4.3e-19 5.9e-15 Npas2
## ENSMUSG00000007872 0.89 5.5 77 1.9e-18 1.3e-14 Id3
## ENSMUSG00000021775 0.95 6.2 63 2.0e-15 9.1e-12 Nr1d2
## ENSMUSG00000002250 -0.83 5.3 62 2.7e-15 9.2e-12 Ppard
## ENSMUSG00000059824 2.26 4.6 58 2.6e-14 7.2e-11 Dbp
## ENSMUSG00000074715 -1.99 3.8 47 7.0e-12 1.6e-08 Ccl28
library(pathview)
library(gage)
library(gageData)
data(kegg.sets.mm)
data(go.subs.mm)
foldchanges <- ttGlm2$logFC
names(foldchanges) <- ttGlm2$entrez
kegg.mm = kegg.gsets("mouse", id.type = "entrez")
kegg.mm.sigmet = kegg.mm$kg.sets[kegg.mm$sigmet.idx]
# Lets get the results
keggres = gage(foldchanges, gsets=kegg.mm.sigmet, same.dir = TRUE)
lapply(keggres, head)
## $greater
## p.geomean stat.mean p.val q.val
## mmu00970 Aminoacyl-tRNA biosynthesis NA NaN NA NA
## mmu02010 ABC transporters NA NaN NA NA
## mmu03008 Ribosome biogenesis in eukaryotes NA NaN NA NA
## mmu03010 Ribosome NA NaN NA NA
## mmu03013 Nucleocytoplasmic transport NA NaN NA NA
## mmu03015 mRNA surveillance pathway NA NaN NA NA
## set.size exp1
## mmu00970 Aminoacyl-tRNA biosynthesis 0 NA
## mmu02010 ABC transporters 0 NA
## mmu03008 Ribosome biogenesis in eukaryotes 0 NA
## mmu03010 Ribosome 0 NA
## mmu03013 Nucleocytoplasmic transport 0 NA
## mmu03015 mRNA surveillance pathway 0 NA
##
## $less
## p.geomean stat.mean p.val q.val
## mmu00970 Aminoacyl-tRNA biosynthesis NA NaN NA NA
## mmu02010 ABC transporters NA NaN NA NA
## mmu03008 Ribosome biogenesis in eukaryotes NA NaN NA NA
## mmu03010 Ribosome NA NaN NA NA
## mmu03013 Nucleocytoplasmic transport NA NaN NA NA
## mmu03015 mRNA surveillance pathway NA NaN NA NA
## set.size exp1
## mmu00970 Aminoacyl-tRNA biosynthesis 0 NA
## mmu02010 ABC transporters 0 NA
## mmu03008 Ribosome biogenesis in eukaryotes 0 NA
## mmu03010 Ribosome 0 NA
## mmu03013 Nucleocytoplasmic transport 0 NA
## mmu03015 mRNA surveillance pathway 0 NA
##
## $stats
## stat.mean exp1
## mmu00970 Aminoacyl-tRNA biosynthesis NaN NA
## mmu02010 ABC transporters NaN NA
## mmu03008 Ribosome biogenesis in eukaryotes NaN NA
## mmu03010 Ribosome NaN NA
## mmu03013 Nucleocytoplasmic transport NaN NA
## mmu03015 mRNA surveillance pathway NaN NA
greaters <- keggres$greater
lessers <- keggres$less
keggrespathways = data.frame(id = rownames(keggres$greater), keggres$greater) %>%
tibble::as_tibble() %>%
filter(row_number()<=5) %>%
.$id %>%
as.character()
keggrespathways
## [1] "mmu00970 Aminoacyl-tRNA biosynthesis"
## [2] "mmu02010 ABC transporters"
## [3] "mmu03008 Ribosome biogenesis in eukaryotes"
## [4] "mmu03010 Ribosome"
## [5] "mmu03013 Nucleocytoplasmic transport"
keggresids = substr(keggrespathways, start=1, stop=8)
keggresids
## [1] "mmu00970" "mmu02010" "mmu03008" "mmu03010" "mmu03013"
plot_pathway = function(pid) pathview(gene.data=foldchanges, pathway.id =pid, species = "mouse", new.signature = FALSE)
# PLOT MULTIPLE PATHWAYS
tmp = sapply(keggresids, function(pid) pathview(gene.data=foldchanges, pathway.id =pid, species = "mouse"))
## Warning: None of the genes or compounds mapped to the pathway!
## Argument gene.idtype or cpd.idtype may be wrong.
## 'select()' returned 1:1 mapping between keys and columns
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file mmu00970.pathview.png
## Warning: None of the genes or compounds mapped to the pathway!
## Argument gene.idtype or cpd.idtype may be wrong.
## 'select()' returned 1:1 mapping between keys and columns
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file mmu02010.pathview.png
## Warning: None of the genes or compounds mapped to the pathway!
## Argument gene.idtype or cpd.idtype may be wrong.
## 'select()' returned 1:1 mapping between keys and columns
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file mmu03008.pathview.png
## Warning: None of the genes or compounds mapped to the pathway!
## Argument gene.idtype or cpd.idtype may be wrong.
## 'select()' returned 1:1 mapping between keys and columns
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file mmu03010.pathview.png
## Warning: None of the genes or compounds mapped to the pathway!
## Argument gene.idtype or cpd.idtype may be wrong.
## 'select()' returned 1:1 mapping between keys and columns
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file mmu03013.pathview.png
keggrespathways = data.frame(id = rownames(keggres$less), keggres$less) %>%
tibble::as_tibble() %>%
filter(row_number()<=5) %>%
.$id %>%
as.character()
keggrespathways
## [1] "mmu00970 Aminoacyl-tRNA biosynthesis"
## [2] "mmu02010 ABC transporters"
## [3] "mmu03008 Ribosome biogenesis in eukaryotes"
## [4] "mmu03010 Ribosome"
## [5] "mmu03013 Nucleocytoplasmic transport"
keggresids = substr(keggrespathways, start=1, stop=8)
keggresids
## [1] "mmu00970" "mmu02010" "mmu03008" "mmu03010" "mmu03013"
plot_pathway = function(pid) pathview(gene.data=foldchanges, pathway.id =pid, species = "mouse", new.signature = FALSE)
# PLOT MULTIPLE PATHWAYS
tmp = sapply(keggresids, function(pid) pathview(gene.data=foldchanges, pathway.id =pid, species = "mouse"))
## Warning: None of the genes or compounds mapped to the pathway!
## Argument gene.idtype or cpd.idtype may be wrong.
## 'select()' returned 1:1 mapping between keys and columns
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file mmu00970.pathview.png
## Warning: None of the genes or compounds mapped to the pathway!
## Argument gene.idtype or cpd.idtype may be wrong.
## 'select()' returned 1:1 mapping between keys and columns
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file mmu02010.pathview.png
## Warning: None of the genes or compounds mapped to the pathway!
## Argument gene.idtype or cpd.idtype may be wrong.
## 'select()' returned 1:1 mapping between keys and columns
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file mmu03008.pathview.png
## Warning: None of the genes or compounds mapped to the pathway!
## Argument gene.idtype or cpd.idtype may be wrong.
## 'select()' returned 1:1 mapping between keys and columns
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file mmu03010.pathview.png
## Warning: None of the genes or compounds mapped to the pathway!
## Argument gene.idtype or cpd.idtype may be wrong.
## 'select()' returned 1:1 mapping between keys and columns
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file mmu03013.pathview.png
library(imager)
## Loading required package: magrittr
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
##
## set_names
## The following object is masked from 'package:tidyr':
##
## extract
##
## Attaching package: 'imager'
## The following object is masked from 'package:magrittr':
##
## add
## The following object is masked from 'package:Biostrings':
##
## width
## The following object is masked from 'package:XVector':
##
## width
## The following objects are masked from 'package:oligoClasses':
##
## B, B<-
## The following objects are masked from 'package:IRanges':
##
## resize, width
## The following object is masked from 'package:S4Vectors':
##
## width
## The following object is masked from 'package:Biobase':
##
## channel
## The following object is masked from 'package:BiocGenerics':
##
## width
## The following object is masked from 'package:stringr':
##
## boundary
## The following object is masked from 'package:dplyr':
##
## where
## The following object is masked from 'package:tidyr':
##
## fill
## The following objects are masked from 'package:stats':
##
## convolve, spectrum
## The following object is masked from 'package:graphics':
##
## frame
## The following object is masked from 'package:base':
##
## save.image
filenames <- list.files(path = "/home/student/Desktop/classroom/myfiles", pattern = ".*pathview.png")
all_images <- lapply(filenames, load.image)
knitr::include_graphics(filenames)
library("DESeq2")
## Loading required package: GenomicRanges
##
## Attaching package: 'GenomicRanges'
## The following object is masked from 'package:magrittr':
##
## subtract
## Loading required package: SummarizedExperiment
## Loading required package: MatrixGenerics
## Loading required package: matrixStats
##
## Attaching package: 'matrixStats'
## The following objects are masked from 'package:Biobase':
##
## anyMissing, rowMedians
## The following object is masked from 'package:dplyr':
##
## count
##
## Attaching package: 'MatrixGenerics'
## The following objects are masked from 'package:matrixStats':
##
## colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
## colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
## colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
## colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
## colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
## colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
## colWeightedMeans, colWeightedMedians, colWeightedSds,
## colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
## rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
## rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
## rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
## rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
## rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
## rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
## rowWeightedSds, rowWeightedVars
## The following object is masked from 'package:Biobase':
##
## rowMedians
library("dplyr")
library("apeglm")
countData <- read.csv("GLDS-102_rna_seq_Unnormalized_Counts.csv")
colData <- read.csv("PHENO_DATA_mouse.csv", sep= ",", row.names = 1)
colData$group <- factor(colData$group, levels = c("0", "1"))
all(rownames(colData)%in% colnames(countData))
## [1] TRUE
countData %>%
mutate_if(is.numeric, ceiling)
countData[, 2:13] <- sapply(countData[,2:13], as.integer)
row.names(countData) <- countData[,1]
countData <- countData[2:13]
all(rownames(colData)) %in% colnames(countData)
## Warning in all(rownames(colData)): coercing argument of type 'character' to
## logical
rownames(colData) == colnames(countData)
dds <- DESeqDataSetFromMatrix(countData = countData, colData = colData, design = ~group)
dds <- dds[rowSums(counts(dds)>2) >=4]
dds <- DESeq(dds)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
res <- results(dds)
resLFC <- lfcShrink(dds, coef= 2)
## using 'apeglm' for LFC shrinkage. If used in published research, please cite:
## Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for
## sequence count data: removing the noise and preserving large differences.
## Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895
resOrdered <- res[order(res$padj), ]
summary(res)
##
## out of 22008 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up) : 325, 1.5%
## LFC < 0 (down) : 327, 1.5%
## outliers [1] : 15, 0.068%
## low counts [2] : 7247, 33%
## (mean count < 38)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
FLT_Vs_GC <- as.data.frame(res$log2FoldChange)
head(FLT_Vs_GC)
## res$log2FoldChange
## 1 0.0421
## 2 -0.1334
## 3 -0.0185
## 4 -0.0882
## 5 -0.0079
## 6 0.1136
plotMA(resLFC, ylim = c(-2,2))
pdf(file = "MA_Plot_FLT_vs_GC.pdf")
dev.off()
## png
## 2
write.csv(as.data.frame(resOrdered), file = "Mouse_DESeq.csv")
dds <- estimateSizeFactors(dds)
se <- SummarizedExperiment(log2(counts(dds, normalize = TRUE) +1), colData = colData(dds))
pdf(file = "PCA_PLOT_FLT_vs_GC.pdf")
plotPCA(DESeqTransform(se), intgroup = "group")
## using ntop=500 top features by variance
library(AnnotationDbi)
library(org.Mm.eg.db)
columns(org.Mm.eg.db)
## [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS"
## [6] "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME"
## [11] "GENETYPE" "GO" "GOALL" "IPI" "MGI"
## [16] "ONTOLOGY" "ONTOLOGYALL" "PATH" "PFAM" "PMID"
## [21] "PROSITE" "REFSEQ" "SYMBOL" "UNIPROT"
foldchanges <- as.data.frame(res$log2FoldChange, row.names = row.names(res))
head(foldchanges)
## res$log2FoldChange
## ENSMUSG00000000001 0.0421
## ENSMUSG00000000028 -0.1334
## ENSMUSG00000000031 -0.0185
## ENSMUSG00000000037 -0.0882
## ENSMUSG00000000049 -0.0079
## ENSMUSG00000000056 0.1136
res$symbol = mapIds(org.Mm.eg.db,
keys = row.names(res),
column = "SYMBOL",
keytype = "ENSEMBL",
multivals = "first")
## 'select()' returned 1:many mapping between keys and columns
res$entrez = mapIds(org.Mm.eg.db,
keys = row.names(res),
column = "ENTREZID",
keytype = "ENSEMBL",
multivals = "first")
## 'select()' returned 1:many mapping between keys and columns
res$name = mapIds(org.Mm.eg.db,
keys = row.names(res),
column = "GENENAME",
keytype = "ENSEMBL",
multivals = "first")
## 'select()' returned 1:many mapping between keys and columns
head(res, 10)
## log2 fold change (MLE): group 1 vs 0
## Wald test p-value: group 1 vs 0
## DataFrame with 10 rows and 9 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSMUSG00000000001 3132.35128 0.04214340 0.0436714 0.9650117 0.334539
## ENSMUSG00000000028 68.75801 -0.13342706 0.1565936 -0.8520597 0.394181
## ENSMUSG00000000031 21.05397 -0.01853142 0.2486477 -0.0745288 0.940590
## ENSMUSG00000000037 24.42314 -0.08817270 0.2982220 -0.2956613 0.767489
## ENSMUSG00000000049 3.24919 -0.00790342 0.9613572 -0.0082211 0.993441
## ENSMUSG00000000056 1424.88216 0.11355979 0.0777635 1.4603234 0.144201
## ENSMUSG00000000058 1420.78992 -0.03893850 0.0850602 -0.4577759 0.647113
## ENSMUSG00000000078 2254.53129 -0.07540275 0.0874314 -0.8624214 0.388456
## ENSMUSG00000000085 822.68179 0.04586772 0.0584699 0.7844667 0.432766
## ENSMUSG00000000088 5946.81754 -0.05461549 0.0691178 -0.7901795 0.429423
## padj symbol entrez name
## <numeric> <character> <character> <character>
## ENSMUSG00000000001 0.739637 Gnai3 14679 G protein subunit al..
## ENSMUSG00000000028 0.777085 Cdc45 12544 cell division cycle 45
## ENSMUSG00000000031 NA H19 14955 H19, imprinted mater..
## ENSMUSG00000000037 NA Scml2 107815 Scm polycomb group p..
## ENSMUSG00000000049 NA Apoh 11818 apolipoprotein H
## ENSMUSG00000000056 0.547527 Narf 67608 nuclear prelamin A r..
## ENSMUSG00000000058 0.901904 Cav2 12390 caveolin 2
## ENSMUSG00000000078 0.774294 Klf6 23849 Kruppel-like transcr..
## ENSMUSG00000000085 0.800018 Scmh1 29871 sex comb on midleg h..
## ENSMUSG00000000088 0.798420 Cox5a 12858 cytochrome c oxidase..
library(pathview)
library(gage)
library(gageData)
data(kegg.sets.mm)
data(go.subs.mm)
foldchanges <- res$log2FoldChange
names(foldchanges) = res$entrez
head(foldchanges)
## 14679 12544 14955 107815 11818 67608
## 0.0421 -0.1334 -0.0185 -0.0882 -0.0079 0.1136
kegg.mm = kegg.gsets("mouse", id.type = "entrez")
kegg.mm.sigmet <- kegg.mm$kg.sets[kegg.mm$sigmet.idx]
keggres <- gage(foldchanges, gsets = kegg.mm.sigmet, same.dir = TRUE)
lapply(keggres, head)
## $greater
## p.geomean stat.mean p.val q.val
## mmu03010 Ribosome 0.017 2.1 0.017 0.89
## mmu04022 cGMP-PKG signaling pathway 0.030 1.9 0.030 0.89
## mmu04360 Axon guidance 0.038 1.8 0.038 0.89
## mmu04330 Notch signaling pathway 0.042 1.8 0.042 0.89
## mmu04658 Th1 and Th2 cell differentiation 0.054 1.6 0.054 0.89
## mmu02010 ABC transporters 0.075 1.5 0.075 0.89
## set.size exp1
## mmu03010 Ribosome 139 0.017
## mmu04022 cGMP-PKG signaling pathway 152 0.030
## mmu04360 Axon guidance 176 0.038
## mmu04330 Notch signaling pathway 58 0.042
## mmu04658 Th1 and Th2 cell differentiation 78 0.054
## mmu02010 ABC transporters 46 0.075
##
## $less
## p.geomean stat.mean p.val
## mmu04613 Neutrophil extracellular trap formation 0.0043 -2.6 0.0043
## mmu04110 Cell cycle 0.0062 -2.5 0.0062
## mmu04657 IL-17 signaling pathway 0.0113 -2.3 0.0113
## mmu04145 Phagosome 0.0332 -1.8 0.0332
## mmu04621 NOD-like receptor signaling pathway 0.0388 -1.8 0.0388
## mmu04625 C-type lectin receptor signaling pathway 0.0441 -1.7 0.0441
## q.val set.size exp1
## mmu04613 Neutrophil extracellular trap formation 0.75 160 0.0043
## mmu04110 Cell cycle 0.75 151 0.0062
## mmu04657 IL-17 signaling pathway 0.90 75 0.0113
## mmu04145 Phagosome 0.91 144 0.0332
## mmu04621 NOD-like receptor signaling pathway 0.91 154 0.0388
## mmu04625 C-type lectin receptor signaling pathway 0.91 99 0.0441
##
## $stats
## stat.mean exp1
## mmu03010 Ribosome 2.1 2.1
## mmu04022 cGMP-PKG signaling pathway 1.9 1.9
## mmu04360 Axon guidance 1.8 1.8
## mmu04330 Notch signaling pathway 1.8 1.8
## mmu04658 Th1 and Th2 cell differentiation 1.6 1.6
## mmu02010 ABC transporters 1.5 1.5
greaters <- keggres$greater
lessers <- keggres
keggrespathways <- data.frame(id = rownames(keggres$greater), keggres$greater) %>%
tbl_df() %>%
filter(row_number() <= 3) %>%
.$id %>%
as.character()
## Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
## ℹ Please use `tibble::as_tibble()` instead.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
keggrespathways
## [1] "mmu03010 Ribosome" "mmu04022 cGMP-PKG signaling pathway"
## [3] "mmu04360 Axon guidance"
keggresids <- substr(keggrespathways, start = 1, stop = 8)
keggresids
## [1] "mmu03010" "mmu04022" "mmu04360"
PLOT!
plot_pathway = function(pid) pathview(gene.data = foldchanges, pathway.id = pid, species = "mouse", new.signature = FALSE)
tmp = sapply(keggresids, function(pid) pathview(gene.data = foldchanges, pathway.id = pid, species = "mouse"))
## 'select()' returned 1:1 mapping between keys and columns
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file mmu03010.pathview.png
## 'select()' returned 1:1 mapping between keys and columns
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file mmu04022.pathview.png
## 'select()' returned 1:1 mapping between keys and columns
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file mmu04360.pathview.png
keggrespathways <- data.frame(id = rownames(keggres$less), keggres$less) %>%
tbl_df() %>%
filter(row_number() <= 3) %>%
.$id %>%
as.character()
## Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
## ℹ Please use `tibble::as_tibble()` instead.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
keggrespathways
## [1] "mmu04613 Neutrophil extracellular trap formation"
## [2] "mmu04110 Cell cycle"
## [3] "mmu04657 IL-17 signaling pathway"
keggresids <- substr(keggrespathways, start = 1, stop = 8)
keggresids
## [1] "mmu04613" "mmu04110" "mmu04657"
PLOT!
plot_pathway = function(pid) pathview(gene.data = foldchanges, pathway.id = pid, species = "mouse", new.signature = FALSE)
tmp = sapply(keggresids, function(pid) pathview(gene.data = foldchanges, pathway.id = pid, species = "mouse"))
## 'select()' returned 1:1 mapping between keys and columns
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file mmu04613.pathview.png
## 'select()' returned 1:1 mapping between keys and columns
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file mmu04110.pathview.png
## 'select()' returned 1:1 mapping between keys and columns
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file mmu04657.pathview.png
library(imager)
filenames <- list.files(path = "~/Desktop/classroom/myfiles", pattern = ".*pathview.png")
all_images <- lapply(filenames, load.image)
knitr::include_graphics(filenames)
library(msa)
seq <- readAAStringSet("hglobin.fa")
seq
## AAStringSet object of length 3:
## width seq names
## [1] 142 MVLSPADKTNVKAAWGKVGAHAG...PAVHASLDKFLASVSTVLTSKYR HBA_HUMAN
## [2] 142 MVLSGEDKSNIKAAWGKIGGHGA...PAVHASLDKFLASVSTVLTSKYR HBA_MOUSE
## [3] 142 MSLTRTERTIILSLWSKISTQAD...ADAHAAWDKFLSIVSGVLTEKYR HBAZ_CAPHI
alignments <- msa(seq, method = "ClustalW")
## use default substitution matrix
alignments
## CLUSTAL 2.1
##
## Call:
## msa(seq, method = "ClustalW")
##
## MsaAAMultipleAlignment with 3 rows and 142 columns
## aln names
## [1] MVLSPADKTNVKAAWGKVGAHAGEYG...FTPAVHASLDKFLASVSTVLTSKYR HBA_HUMAN
## [2] MVLSGEDKSNIKAAWGKIGGHGAEYG...FTPAVHASLDKFLASVSTVLTSKYR HBA_MOUSE
## [3] MSLTRTERTIILSLWSKISTQADVIG...FTADAHAAWDKFLSIVSGVLTEKYR HBAZ_CAPHI
## Con MVLS??DKTNIKAAWGKIG?HA?EYG...FTPAVHASLDKFLASVSTVLTSKYR Consensus
msaPrettyPrint(alignments, output = "pdf", showNames = "left",
showLogo = "none", askForOverwrite = FALSE,
verbose = TRUE, file = "whole_align.pdf")
## Multiple alignment written to temporary file /tmp/RtmpLdkG5i/seq14271c99d025.fasta
## File whole_align.tex created
## Warning in texi2dvi(texfile, quiet = !verbose, pdf = identical(output, "pdf"),
## : texi2dvi script/program not available, using emulation
## Output file whole_align.pdf created
msaPrettyPrint(alignments, c(10,30), output = "pdf", showNames = "left",
file = "Zoomed_align.pdf", showLogo = "top", askForOverwrite = FALSE,
verbose = TRUE)
## Multiple alignment written to temporary file /tmp/RtmpLdkG5i/seq14271234c5bf.fasta
## File Zoomed_align.tex created
## Warning in texi2dvi(texfile, quiet = !verbose, pdf = identical(output, "pdf"),
## : texi2dvi script/program not available, using emulation
## Output file Zoomed_align.pdf created
library(ape)
##
## Attaching package: 'ape'
## The following object is masked from 'package:imager':
##
## where
## The following object is masked from 'package:Biostrings':
##
## complement
## The following object is masked from 'package:dplyr':
##
## where
library(seqinr)
##
## Attaching package: 'seqinr'
## The following objects are masked from 'package:ape':
##
## as.alignment, consensus
## The following object is masked from 'package:matrixStats':
##
## count
## The following object is masked from 'package:Biostrings':
##
## translate
## The following object is masked from 'package:limma':
##
## zscore
## The following object is masked from 'package:dplyr':
##
## count
alignment_seqinr <- msaConvert(alignments, type = "seqinr::alignment")
distance1 <- seqinr::dist.alignment(alignment_seqinr, "identity")
tree <- ape::nj(distance1)
plot(tree, main = "Phylogenetic Tree of HBA Sequences")
library(DECIPHER)
## Loading required package: RSQLite
## Loading required package: parallel
This is a DNAStringSet object:
long_seq <- readDNAStringSet(file.path(getwd(), "plastid_genomes.fa"))
long_seq
## DNAStringSet object of length 5:
## width seq names
## [1] 130584 GGCATAAGCTATCTTCCCAAAGG...ATGATTCAAACATAAAAGTCCT NC_018523.1 Sacch...
## [2] 161592 ATGGGCGAACGACGGGAATTGAA...AAAGAAAAAAAAATAGGAGTAA NC_022431.1 Ascle...
## [3] 117672 ATGAGTACAACTCGAAAGTCCAC...TTGATTTCATCCACAAACGAAC NC_022259.1 Nanno...
## [4] 154731 TTATCCATTTGTAGATGGAACTT...CATATACACTAAGACAAAAGTC NC_022417.1 Cocos...
## [5] 156618 GGGCGAACGACGGGAATTGAACC...CCTTTTGTAGCGAATCCGTTAT NC_022459.1 Camel...
Seqs2DB(long_seq, "XStringSet", "long_db", names(long_seq))
## Adding 5 sequences to the database.
##
## Added 5 new sequences to table Seqs.
## 30 total sequences in table Seqs.
## Time difference of 0.18 secs
####Now that we’ve built the database, we can do the following: a. Find the syntenic blocks
synteny <- FindSynteny("long_db")
## ================================================================================
##
## Time difference of 4.9 secs
pairs(synteny)
plot(synteny)
alignment <- AlignSynteny(synteny, "long_db")
## ================================================================================
##
## Time difference of 43 secs
block <- unlist(alignment[[1]])
writeXStringSet(block, "genome_blocks_out.fa")
library(locfit)
## locfit 1.5-9.8 2023-06-11
##
## Attaching package: 'locfit'
## The following object is masked from 'package:purrr':
##
## none
library(Rsamtools)
get_annotated_regions_from_gff <- function(file_name) {
gff <- rtracklayer::import.gff(file_name)
as(gff, "GRanges")
}
whole_genome <- csaw::windowCounts(
file.path("windows.bam"),
bin = TRUE,
filter = 0,
width = 500,
param = csaw::readParam(
minq = 20, # determines what is a passing read
dedup = TRUE, # removes pcr duplicate
pe = "both" # requires that both pairs of reads are aligned
)
)
Since this is a single column of data, let’s rename it as:
colnames(whole_genome) <- c("small_data")
annotated_regions <- get_annotated_regions_from_gff(file.path(getwd(), "genes.gff"))
Now that we have the window of high expression, we want to see if any of them overlap with annotated regions.
library(IRanges)
library(SummarizedExperiment)
windows_in_genes <- IRanges::overlapsAny(
SummarizedExperiment::rowRanges(whole_genome), # Creates a granges object
annotated_regions
)
windows_in_genes
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE
annotated_window_counts <- whole_genome[windows_in_genes,]
non_annotated_window_counts <- whole_genome[!windows_in_genes,]
Use assay () to extract the actual counts from the Granges object:
assay(non_annotated_window_counts)
## small_data
## [1,] 0
## [2,] 31
## [3,] 25
## [4,] 0
## [5,] 0
## [6,] 24
## [7,] 25
## [8,] 0
## [9,] 0
Each row represents a signle nucleotide in the reference count and the count column gives the depth of coverage at that point.
library(bumphunter)
## Loading required package: foreach
## Parallel computing support for 'oligo/crlmm': Disabled
## - Load 'ff'
## - Load and register a 'foreach' adaptor
## Example - Using 'multicore' for 2 cores:
## library(doMC)
## registerDoMC(2)
## ================================================================================
##
## Attaching package: 'foreach'
##
## The following objects are masked from 'package:purrr':
##
## accumulate, when
##
## Loading required package: iterators
pile_df <- Rsamtools::pileup(file.path(getwd(), "windows.bam"))
This step groups the reads with certain distance of each other into a cluster. We give the sequence names, position, and distance.
clusters <- bumphunter::clusterMaker(pile_df$seqnames, pile_df$pos, maxGap = 100)
table(clusters)
## clusters
## 1 2 3
## 1486 1552 1520
In this step, we will map the reads to the regions we found for the genomes.
bumphunter::regionFinder(pile_df$count, pile_df$seqnames, pile_df$pos, clusters, cutoff = 1)
## getSegments: segmenting
## getSegments: splitting
## chr start end value area cluster indexStart indexEnd L clusterL
## 3 Chr1 4503 5500 10.4 15811 3 3039 4558 1520 1520
## 1 Chr1 502 1500 10.0 14839 1 1 1486 1486 1486
## 2 Chr1 2501 3500 8.7 13436 2 1487 3038 1552 1552
Lets load the required packages
library(ggplot2)
library(ggtree)
## ggtree v3.10.0 For help: https://yulab-smu.top/treedata-book/
##
## If you use the ggtree package suite in published research, please cite
## the appropriate paper(s):
##
## Guangchuang Yu, David Smith, Huachen Zhu, Yi Guan, Tommy Tsan-Yuk Lam.
## ggtree: an R package for visualization and annotation of phylogenetic
## trees with their covariates and other associated data. Methods in
## Ecology and Evolution. 2017, 8(1):28-36. doi:10.1111/2041-210X.12628
##
## S Xu, Z Dai, P Guo, X Fu, S Liu, L Zhou, W Tang, T Feng, M Chen, L
## Zhan, T Wu, E Hu, Y Jiang, X Bo, G Yu. ggtreeExtra: Compact
## visualization of richly annotated phylogenetic data. Molecular Biology
## and Evolution. 2021, 38(9):4039-4042. doi: 10.1093/molbev/msab166
##
## Guangchuang Yu, Tommy Tsan-Yuk Lam, Huachen Zhu, Yi Guan. Two methods
## for mapping and visualizing associated data on phylogeny using ggtree.
## Molecular Biology and Evolution. 2018, 35(12):3041-3043.
## doi:10.1093/molbev/msy194
##
## Attaching package: 'ggtree'
## The following object is masked from 'package:ape':
##
## rotate
## The following object is masked from 'package:magrittr':
##
## inset
## The following object is masked from 'package:reshape':
##
## expand
## The following object is masked from 'package:Biostrings':
##
## collapse
## The following object is masked from 'package:IRanges':
##
## collapse
## The following object is masked from 'package:S4Vectors':
##
## expand
## The following object is masked from 'package:tidyr':
##
## expand
library(treeio)
## treeio v1.26.0 For help: https://yulab-smu.top/treedata-book/
##
## If you use the ggtree package suite in published research, please cite
## the appropriate paper(s):
##
## LG Wang, TTY Lam, S Xu, Z Dai, L Zhou, T Feng, P Guo, CW Dunn, BR
## Jones, T Bradley, H Zhu, Y Guan, Y Jiang, G Yu. treeio: an R package
## for phylogenetic tree input and output with richly annotated and
## associated data. Molecular Biology and Evolution. 2020, 37(2):599-603.
## doi: 10.1093/molbev/msz240
##
## G Yu. Data Integration, Manipulation and Visualization of Phylogenetic
## Trees (1st ed.). Chapman and Hall/CRC. 2022. ISBN: 9781032233574
##
## Guangchuang Yu, David Smith, Huachen Zhu, Yi Guan, Tommy Tsan-Yuk Lam.
## ggtree: an R package for visualization and annotation of phylogenetic
## trees with their covariates and other associated data. Methods in
## Ecology and Evolution. 2017, 8(1):28-36. doi:10.1111/2041-210X.12628
##
## Attaching package: 'treeio'
## The following object is masked from 'package:seqinr':
##
## read.fasta
## The following object is masked from 'package:Biostrings':
##
## mask
itol <- ape::read.tree("itol.nwk")
ggtree(itol)
ggtree(itol, layout = "circular")
ggtree(itol) + coord_flip() + scale_x_reverse()
ggtree(itol) + geom_tiplab(color = "indianred", size = 1.5)
ggtree(itol, layour = "unrooted") + geom_strip(13,14, color = "red", barsize = 1)
## Warning in stat_tree(data = data, mapping = mapping, geom = "segment", position = position, : Ignoring unknown parameters: `layour`
## Ignoring unknown parameters: `layour`
ggtree(itol, layout = "unrooted") + geom_hilight(node = 11, type = "encircle", fill = "steelblue")
## "daylight" method was used as default layout for unrooted tree.
## Average angle change [1] 0.174910612627282
## Average angle change [2] 0.161645191380673
## Average angle change [3] 0.129304375923319
## Average angle change [4] 0.0825706767962636
## Average angle change [5] 0.100056259084497
install.packages('BAMMtools')
## Installing package into '/home/student/R/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
library(BAMMtools)
getMRCA(itol, tip = c("Photorhabdus_luminescens", "Blochmannia_floridanus"))
## [1] 206
ggtree(itol, layout = "unrooted") + geom_hilight(node = 206, type = "encircle", fill = "steelblue")
## "daylight" method was used as default layout for unrooted tree.
## Average angle change [1] 0.174910612627282
## Average angle change [2] 0.161645191380673
## Average angle change [3] 0.129304375923319
## Average angle change [4] 0.0825706767962636
## Average angle change [5] 0.100056259084497
First lets load the required packages
library(ape)
library(adegraphics)
## Registered S3 methods overwritten by 'adegraphics':
## method from
## biplot.dudi ade4
## kplot.foucart ade4
## kplot.mcoa ade4
## kplot.mfa ade4
## kplot.pta ade4
## kplot.sepan ade4
## kplot.statis ade4
## scatter.coa ade4
## scatter.dudi ade4
## scatter.nipals ade4
## scatter.pco ade4
## score.acm ade4
## score.mix ade4
## score.pca ade4
## screeplot.dudi ade4
##
## Attaching package: 'adegraphics'
## The following object is masked from 'package:ape':
##
## zoom
## The following object is masked from 'package:GenomicRanges':
##
## score
## The following object is masked from 'package:Biostrings':
##
## score
## The following object is masked from 'package:BiocGenerics':
##
## score
library(treespace)
## Loading required package: ade4
##
## Attaching package: 'ade4'
## The following objects are masked from 'package:adegraphics':
##
## kplotsepan.coa, s.arrow, s.class, s.corcircle, s.distri, s.image,
## s.label, s.logo, s.match, s.traject, s.value, table.value,
## triangle.class
## The following object is masked from 'package:GenomicRanges':
##
## score
## The following object is masked from 'package:Biostrings':
##
## score
## The following object is masked from 'package:BiocGenerics':
##
## score
Now we need to load all the treefiles into a multiPhylo object:
treefiles <- list.files(file.path(getwd(), "gene_trees"), full.names = TRUE)
tree_list <- lapply(treefiles, read.tree)
class(tree_list)
## [1] "list"
class(tree_list) <- "multiPhylo"
class(tree_list)
## [1] "multiPhylo"
Now we can compute the kendall-coljin distance between trees. This functiion does a LOT of analyses.
comparisons <- treespace(tree_list, nf = 3)
We can plot the pairwise distances between tree with table.image
table.image(comparisons$D, nclass= 40)
Now lets print the PCA and clusters, this shows us how the group of trees cluster
plotGroves(comparisons$pco, lab.show = TRUE, lab.cex = 1.5)
groves <- findGroves(comparisons, nclust = 2)
plotGroves(groves)
Load our required packages:
library(ape)
Now lets load the tree data we will be working with
newick <- read.tree("mammal_tree.nwk")
l <- subtrees(newick)
Lets plot the tree to see whaat it looks like
plot(newick)
We can subset this plot using the “node” function
plot(l[[4]], sub = "Node 4")
Extract the tree manually:
small_tree <- extract.clade(newick, 9)
plot(small_tree)
Now what if we want to bind two trees together?
new_tree <- bind.tree(newick, small_tree, 3)
plot(new_tree)
##Reconstructing trees from alignments
library(Biostrings)
library(msa)
library(phangorn)
seqs <- readAAStringSet("abc.fa")
aln <- msa::msa(seqs, method = c("ClustalOmega"))
## using Gonnet
aln <- as.phyDat(aln, type = "AA")
class(aln)
## [1] "phyDat"
dist_mat <- dist.ml(aln)
upgma_tree <- upgma(dist_mat)
plot(upgma_tree, main = "UPGMA")
nj_tree <- NJ(dist_mat)
plot(nj_tree, main = "NJ")
Bootstraps are essentially confidence intervals for how the clade is placed in the correct position
fit <- pml(nj_tree, aln)
bootstraps <- bootstrap.phyDat(aln, FUN = function(x) {nj(dist.ml(x))}, bs = 100)
plotBS(nj_tree, bootstraps, p = 10)
library(GenomicRanges)
library(gmapR)
library(rtracklayer)
library(VariantAnnotation)
##
## Attaching package: 'VariantAnnotation'
## The following object is masked from 'package:stringr':
##
## fixed
## The following object is masked from 'package:base':
##
## tabulate
library(VariantTools)
bam_file <- file.path(getwd(), "hg17_snps.bam")
fasta_file <- file.path(getwd(), "chr17_83k.fa")
fa <- rtracklayer::FastaFile(fasta_file)
genome <- gmapR::GmapGenome(fa, create = TRUE)
## NOTE: genome 'chr17_83k' already exists, not overwriting
There can be a lot of fine-tuning to this function. We are just going to use the standard settings
qual_params <- TallyVariantsParam(
genome = genome,
minimum_mapq = 20)
var_params <- VariantCallingFilters(read.count = 19, p.lower = 0.01)
called_variants <- callVariants(bam_file,
qual_params,
calling.filter = var_params)
head(called_variants)
## VRanges object with 6 ranges and 17 metadata columns:
## seqnames ranges strand ref alt totalDepth
## <Rle> <IRanges> <Rle> <character> <characterOrRle> <integerOrRle>
## [1] NC_000017.10 64 * G T 759
## [2] NC_000017.10 69 * G T 812
## [3] NC_000017.10 70 * G T 818
## [4] NC_000017.10 73 * T A 814
## [5] NC_000017.10 77 * T A 802
## [6] NC_000017.10 78 * G T 798
## refDepth altDepth sampleNames softFilterMatrix | n.read.pos
## <integerOrRle> <integerOrRle> <factorOrRle> <matrix> | <integer>
## [1] 739 20 <NA> | 17
## [2] 790 22 <NA> | 19
## [3] 796 22 <NA> | 20
## [4] 795 19 <NA> | 13
## [5] 780 22 <NA> | 19
## [6] 777 21 <NA> | 17
## n.read.pos.ref raw.count.total count.plus count.plus.ref count.minus
## <integer> <integer> <integer> <integer> <integer>
## [1] 64 759 20 739 0
## [2] 69 812 22 790 0
## [3] 70 818 22 796 0
## [4] 70 814 19 795 0
## [5] 70 802 22 780 0
## [6] 70 798 21 777 0
## count.minus.ref count.del.plus count.del.minus read.pos.mean
## <integer> <integer> <integer> <numeric>
## [1] 0 0 0 30.9000
## [2] 0 0 0 40.7273
## [3] 0 0 0 34.7727
## [4] 0 0 0 36.1579
## [5] 0 0 0 38.3636
## [6] 0 0 0 39.7143
## read.pos.mean.ref read.pos.var read.pos.var.ref mdfne mdfne.ref
## <numeric> <numeric> <numeric> <numeric> <numeric>
## [1] 32.8755 318.558 347.804 NA NA
## [2] 35.4190 377.004 398.876 NA NA
## [3] 36.3442 497.762 402.360 NA NA
## [4] 36.2176 519.551 402.843 NA NA
## [5] 36.0064 472.327 397.070 NA NA
## [6] 35.9241 609.076 390.463 NA NA
## count.high.nm count.high.nm.ref
## <integer> <integer>
## [1] 20 738
## [2] 22 789
## [3] 22 796
## [4] 19 769
## [5] 22 780
## [6] 21 777
## -------
## seqinfo: 1 sequence from chr17_83k genome
## hardFilters(4): nonRef nonNRef readCount likelihoodRatio
We have identified 6 variants
get_annotated_regions_from_gff <- function(file_name) {
gff <- rtracklayer::import.gff(file_name)
as(gff, "GRanges")
}
Note you can also load this data from a bed file.
genes <- get_annotated_regions_from_gff("chr17.83k.gff3")
overlaps <- GenomicRanges::findOverlaps(called_variants, genes)
overlaps
## Hits object with 12684 hits and 0 metadata columns:
## queryHits subjectHits
## <integer> <integer>
## [1] 35176 1
## [2] 35176 2
## [3] 35176 3
## [4] 35177 1
## [5] 35177 2
## ... ... ...
## [12680] 40944 2
## [12681] 40944 7
## [12682] 40945 1
## [12683] 40945 2
## [12684] 40945 7
## -------
## queryLength: 44949 / subjectLength: 8
identified <- genes[subjectHits(overlaps)]
library(Biostrings)
library(systemPipeR)
## Loading required package: ShortRead
## Loading required package: BiocParallel
## Loading required package: GenomicAlignments
##
## Attaching package: 'GenomicAlignments'
## The following object is masked from 'package:dplyr':
##
## last
##
## Attaching package: 'ShortRead'
## The following object is masked from 'package:adegraphics':
##
## zoom
## The following objects are masked from 'package:locfit':
##
## left, right
## The following object is masked from 'package:ape':
##
## zoom
## The following object is masked from 'package:imager':
##
## clean
## The following object is masked from 'package:magrittr':
##
## functions
## The following object is masked from 'package:oligo':
##
## intensity
## The following objects are masked from 'package:oligoClasses':
##
## chromosome, position
## The following object is masked from 'package:affy':
##
## intensity
## The following object is masked from 'package:dplyr':
##
## id
## The following object is masked from 'package:purrr':
##
## compose
## The following object is masked from 'package:tibble':
##
## view
##
## Attaching package: 'systemPipeR'
## The following object is masked from 'package:VariantAnnotation':
##
## reference
## The following object is masked from 'package:DESeq2':
##
## results
dna_object <- readDNAStringSet("arabidopsis_chloroplast.fa")
predict_orfs <- predORF(dna_object, n = 'all', type = 'gr', mode = 'ORF', strand = 'both',
longest_disjoint = TRUE)
predict_orfs
## GRanges object with 2501 ranges and 2 metadata columns:
## seqnames ranges strand | subject_id inframe2end
## <Rle> <IRanges> <Rle> | <integer> <numeric>
## 1 chloroplast 86762-93358 + | 1 2
## 1162 chloroplast 2056-2532 - | 1 3
## 2 chloroplast 72371-73897 + | 2 2
## 1163 chloroplast 77901-78362 - | 2 1
## 3 chloroplast 54937-56397 + | 3 3
## ... ... ... ... . ... ...
## 2497 chloroplast 129757-129762 - | 1336 3
## 2498 chloroplast 139258-139263 - | 1337 3
## 2499 chloroplast 140026-140031 - | 1338 3
## 2500 chloroplast 143947-143952 - | 1339 3
## 2501 chloroplast 153619-153624 - | 1340 3
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
This printed out a GRanges object in return, with 2,501 open reading frames. This is FAR too many open reading frames.
We first estimate the length an ORF can reach by chance. We will create a string of random nucleotides that is the length of our chloroplast genome and determine what the longer ORF that can arise by chance.
bases <- c("A", "T", "G", "C")
raw_seq_string <- strsplit(as.character(dna_object), "")
seq_length <- width(dna_object[1])
counts <- lapply(bases, function(x) {sum(grepl(x, raw_seq_string))} )
probs <- unlist(lapply(counts, function(base_count){signif(base_count/seq_length, 2)}))
probs
## [1] 6.5e-06 6.5e-06 6.5e-06 6.5e-06
get_longest_orf_in_random_genome <- function(x,
length = 1000,
probs = c(0.25, 0.25, 0.25, 0.25),
bases = c("A", "T", "G", "C")){
# Here we create our random genome and allow replacement for the next iteration
random_genome <- paste0(sample(bases, size = length, replace = TRUE, prob = probs), collapse = "")
random_dna_object <- DNAStringSet(random_genome)
names(random_dna_object) <- c("random_dna_string")
orfs <- predORF(random_dna_object, n =1, type = 'gr', mode = 'ORF', strand = 'both', longest_disjoint = TRUE)
return(max(width(orfs)))
}
random_lengths <- unlist(lapply(1:10, get_longest_orf_in_random_genome, length = seq_length, probs = probs, bases = bases))
longest_random_orf <- max(random_lengths)
keep <- width(predict_orfs) > longest_random_orf
orfs_to_keep <- predict_orfs[keep]
orfs_to_keep
## GRanges object with 10 ranges and 2 metadata columns:
## seqnames ranges strand | subject_id inframe2end
## <Rle> <IRanges> <Rle> | <integer> <numeric>
## 1 chloroplast 86762-93358 + | 1 2
## 2 chloroplast 72371-73897 + | 2 2
## 3 chloroplast 54937-56397 + | 3 3
## 4 chloroplast 57147-58541 + | 4 1
## 5 chloroplast 33918-35141 + | 5 1
## 6 chloroplast 32693-33772 + | 6 2
## 7 chloroplast 109408-110436 + | 7 3
## 8 chloroplast 114461-115447 + | 8 2
## 9 chloroplast 141539-142276 + | 9 2
## 10 chloroplast 60741-61430 + | 10 1
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
extracted_orfs <- BSgenome::getSeq(dna_object, orfs_to_keep)
names(extracted_orfs) <- paste0("orf_", 1:length(orfs_to_keep))
writeXStringSet(extracted_orfs, "saved_orfs.fa")
library(karyoploteR)
## Loading required package: regioneR
library(GenomicRanges)
genome_df <- data.frame(
# first we dictate # of chromosomes
chr = paste0("chr", 1:5),
start = rep(1, 5),
# and then we will dictate the length of each chromosome
end = c(34964571, 22037565, 25499034, 20862711, 31270811)
)
genome_gr <- makeGRangesFromDataFrame(genome_df)
snp_pos <- sample(1:1e7, 25)
snps <- data.frame(
chr = paste0("chr", sample(1:5,25, replace = TRUE)),
start = snp_pos,
end = snp_pos
)
snps_gr <- makeGRangesFromDataFrame(snps)
snp_labels <- paste0("snp_", 1:25)
plot.params <- getDefaultPlotParams(plot.type =1)
plot.params$data1outmargin <- 600
kp <- plotKaryotype(genome= genome_gr, plot.type = 1, plot.params = plot.params)
## No predefined canonical chromosomes found for the requested genome. Applying a heuristic chromosome filtering.
## To get the unfiltered genome, please set chromosomes="all" in the plotKaryotype call
kpPlotMarkers(kp, snps_gr, labels = snp_labels )
numeric_data <- data.frame(
y = rnorm(100,mean = 1, sd = 0.5),
chr = rep("chr4", 100),
start = seq(1, 20862711, 20862711/100),
end = seq(1, 20862711, 20862711/100)
)
numeric_data_gr <- makeGRangesFromDataFrame(numeric_data)
plot.params <- getDefaultPlotParams(plot.type = 2)
plot.params$data1outmargin <- 800
plot.params$data2outmargin <- 800
plot.params$topmargin <- 800
kp <- plotKaryotype(genome= genome_gr, plot.type = 2, plot.params = plot.params)
## No predefined canonical chromosomes found for the requested genome. Applying a heuristic chromosome filtering.
## To get the unfiltered genome, please set chromosomes="all" in the plotKaryotype call
kpPlotMarkers(kp, snps_gr, labels = snp_labels )
kpLines(kp, numeric_data_gr, y = numeric_data$y, data.panel = 2)