Reproducible research using R Markdown tutorial.

We can type long passages or descriptions of data that we record without having to use the using the symbol # to type our comments like in normal r. ToothGrowth dataset is going to be used in the first example we are given. In this experiment we use, real Guinea Pigs were given different amounts of VItamin C to see if the animal’s teeth were affected.

When running the r code you have to denote what is going to be considered the R code. These sections are called “code chunks”.

Below is a code chunk:

Toothdata <- ToothGrowth 

head(ToothGrowth)
##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5

The R Markdown file printed the code chunck after pressing the print button.

fit <- lm(len ~ dose, data = Toothdata)

b <- fit$coefficients

plot(len ~ dose, data = Toothdata)

abline(lm(len ~ dose, data = Toothdata))
Figure 1: The tooth growth of Guinea Pigs when given variable amounts of vitamin C

Figure 1: The tooth growth of Guinea Pigs when given variable amounts of vitamin C

the slope of the regression line is 9.7635714.

Section Headers

We can also put sections and subsections into our markdown file in a similar way to numbers and bullet points in a word document. We do this with the “#” that we used previously to denote and detect R script.

First level header

Second level header

Third level header

Always make sure to put a space after the hashtag or it will not work!!

We can also add bullet point-tyoe marks in our r markdown file.

  • one item
  • one item
  • one item
    • one more item
    • one more item
    • one more item
      • one last item

It’s important to note that indentation matters in r markdown.

  1. First item
  2. Second item
  3. Third item
  1. subitem 1
  2. subitem 2
  3. subitem 3

Block Quotes

We can put really nice quotes into the document. We can do this by using the “>” symbol.

“Genes are like the story, and DNA is the language that the story is written in.”

— Sam Kean

Formulas

We can also put nice formatted formulas into R Markdown using two dollar signs.

Hard-Weinberg Formula

\[p^2 + 2qp + q^2 = 1\]

You can get very complex!

\[\theta = \begin{pmatrix}\alpha & \beta\\ \gamma & \delta \end{pmatrix}\]

Code Chunks

Code chunk options

There are options with R markdown files that can interprit the code chunk. These are the following options:

Eval (T or F): whether or not to evaluate the code chunk.

Echo (T or F): whether or not to show the code for the chunk, but results will still print.

Cache: if enabled, the same code chunk will not be evaluated the next time that the kniter is run. Great for a code that has LONG run times.

fig.width or fig.height: the (graphical device) size of the r plots in inches. The figures are first written to the knitr document then to files that are saved separately.

out.width or out.height: the output size of the r plots in the R documents.

fig.cap: the words for the figure caption.

Table of Contents

We can also add a table of contents to our HTML document. We do this by altering the YAML code (the weird code chunk at the very top of the document). We can add this:

title: “R_Markdown” author: “MLD” date: “2024-11-12” output: html_document: toc: true toc_float: true

This will give us a very nice floating table of contents on the right hand side of the document.

Tabs

You can also add TABS in the report. To do this you need to specify each section that you want to become a tab by placing {.tabset} after the line.

Themes

You can add themes to the HTML document that change the highlighting color and hyperlink color of your HTML output. This can be nice aesthetically. To do this, change your theme in the YAML to one of the following:

cerulean journal flatly readable spacelab united cosmo lumen_ paper sandstone simplex yeti null

You can also change the color by specifying highlight:

default tango payments kate monochrome espresso zenburn haddock textmate

Code Folding

You can also use the code_folding option to allow the reader to toggle between displaying the code and hiding the code. This is done with:

code_folding: hide

Summary

There are many options for you to customize your R code using the HTML format. This is also a great way to display any “portfolio” of your work if you are trying to market yourself to interested parties.

library(tidyverse)

my_data <- nycflights13::flights

head(my_data)

filter(my_data, month == 10, day == 14)

oct_14_flight <- filter(my_data, month == 10, day == 14)

(oct_14_flight <- filter(my_data, month == 10, day == 14))

(flight_through_september <- filter(my_data, month == 10))

Arrange

Arrange allows us to arrange the dataset based on the variables we desire

arrange(my_data, year, day, month)

We can also do this in in descending fashion

descending <- arrange(my_data, desc(year), desc(day), desc(month))

Missing values are always placed at the end of the dataframe regardless od ascending or descending

###Select

We can also select specific columns tht we want to look at.

calendar <- dplyr::select(my_data, year, month, day) print(calendar)

We can also look at a range of columns

calendar2 <- dplyr::select(my_data, year:day)

Lets look at all columns months through carrier

calendar3 <- dplyr::select(my_data, year:carrier)

We can also choose which columns NOT to include

everything_else <- dplyr::select(my_data, -(year:day))

In this instance we can also use the “not” opperator !

everything_else2 <- dplyr::select(my_data, !(year:day))

There are also some other helper functions that can help you select the columns or data you’re look

Starts_with(“xyz”) – will select the values that start with xyz ends_with(“xyz”) — will select the values that end with xyz contains(“xyz”) — will select the values that contain xyz matches(“xyz”) —- will match the indentical value xyz

Renaming

head(my_data)

Mutate

What if you want to add new columns to your data frame? we have the mutate() function for that.

First, lets make smaller data frame so we can see what were doing.

my_data_small <- dplyr::select(my_data, year:day, distance, air_time)

###Lets calculate the speed of the flights

mutate(my_data_small, speed = distance / air_time * 60)

What if we wanted to create a new dataframe with only your calculations (transmute).

my_data_small <- mutate(my_data_small, speed = distance / air_time * 60)

Summarize and by_group()

We can use summarize to run a function on a data column to get a single return

dplyr::summarize(my_data, delay = mean(dep_delay, na.rm = TRUE))

So we can see that the average delay is about 12 minutes

We can give additional vlaue in summerize by pairing it with by_group().

by_day <- group_by(my_data, year, month, day) dplyr::summarize(by_day, delay = mean(dep_delay, na.rm = TRUE))

As you can see, we now have the delay by the days of the year

Missing Data

What happens if we dont tell R what to do with missing data?

dplyr::summarize(by_day, delay = mean(dep_delay))

We can also filter our data based on NA (which in this dataset was cancelled flights)

Lets run summarize again on this data

not_cancelled <- filter(my_data, !is.na(dep_delay), !is.na(arr_delay))

dplyr::summarize(not_cancelled, delay = mean(dep_delay))

Counts

We can also count the number of variables that are NA

sum(is.na(my_data$dep_delay))

We can also count the numbers that are not NA

sum(!is.na(my_data$dep_delay))

Piping

With tibble datasets (more on them soon), we can piper results to get rid of the need to use the dollar signs We can then summarize the number of flights by minutes delayed

my_data %>% group_by(year, month, day) %>% dplyr::summarize(mean = mean(dep_time, na.rm = TRUE))

Tibbles

First lets load the packages we need

library(tibble)

Now we will take a time to explore tibbles. Tibbles are modified dataframes that tweak some of the older features from data frames. R is an old language and useful things from 20 years ago are not useful anymore

as_tibble(iris)

As we can see, we ahve the same data fram, but we have different features. You can also create a tibble from scratch with tibble()

tibble( x = 1:5, y = 1, z = x ^ 2 + y )

You can also use tribble() for basic data table creation

tribble( ~genea, ~geneb, ~ genec, ######################## 110, 112, 114, 6, 5, 4 )

Tibbles are built to not overwhelm your console when printing data the first few lines

This is how a data frame prints

print(by_day) as.data.frame(by_day) head(by_day)

nycflights13::flights %>% print(n=10, width = Inf)

Subsetting

Subsetting tibbles is easy, similar to data.frames

df_tibble <- tibble(nycflights13::flights)

df_tibble

We can subset by column name using the $.

df_tibble$carrier

we can subset by postion using [[]]

df_tibble[[2]]

If you want to use this in a pipe, you need to use the “.” placeholder

Some older functions do not like tibbles, thus you might have to convert them back to dataframe.

class(df_tibble)

df_tibble2 <- as.data.frame(df_tibble)

df_tibble

head(df_tibble2)

Tidyr

Lets load the packages we need

library(tidyverse)

How do we make a tidy dataset? well the tidyverse follows three rules

1 - Each variable must have its own column 2 - Each observation has its own row 3 - Each value has its own cell

It is impossible to satisfy two of the three rules.

This leads to the following instructions for tidy data

1 put each dataset into a tibble 2 put each variable into a column 3 profit

Picking one consistent method of data storage makes for eaier understanding of your code and what is happening behind the scenes

bmi <- tibble(women)

bmi %>% mutate(bmi = (703 * weight)/(height)^2)

Spreading and Gathering

Sometimes you’ll find datasets that dont fit well into a tibble

We’ll use the built-in data from tidyverse for this part

table4a

As you can see from this data, we have one variable in column A (country)but the columns b and c are two of the same. Thus, there are two observation in each row

To fix this, we can use the gather function

table4a %>% gather(‘1999’, ‘2000’, key = ‘year’, value = ‘cases’)

Lets look at another example

table4b

As you can see we have the same problem in table 4b

table4b %>% gather(‘1999’, ‘2000’, key = ‘year’, value = ‘population’)

Now what if we want to join these two tables? we can us dplyr.

table4a <- table4a %>% gather(‘1999’, ‘2000’, key = ‘year’, value = ‘cases’) table4b <- table4b %>% gather(‘1999’, ‘2000’, key = ‘year’, value = ‘population’)

left_join(table4a, table4b)

Joining with by = join_by(country, year)

Spreading

Spreading is the opposite of gathering. Lets look at table 2

table2

You can see that we have redundant info in columns 1 and 2 We can fix thast by combing rows 1 and 2, 3 and 4, etc.

spread(table2, key = type, value = count)

Type is the key of that we are turing into columns, the value is what becomes rows/obervations

In summary, spread makes long tables shorter and wider Gather makes wide tables narrowe and longer

Separating and pulling

Now what happens when we can two observations stuck in one column

table3

this shows cases and population is combined to be the rate

we can use separate to fix it.

table3 %>% separate(rate, into = c(“cases”, “population”))

But, now the column type is not correct

table3 %>% separate(rate, into = c(“cases”, “populate”), conver = TRUE)

You can specify what you want to separate based on

table3 %>% separate(rate, into = c(“cases”, “populate”), sep = “/”, conver = TRUE)

Now we can make it look more tidy.

table3 %>% separate( year, into = c(“cases”, “population”), convert = TRUE, sep = 2 )

DPLYR

it is rare we will get to work with a single data table. The DPLYR package allows

you to join data tables by common values

Mutate joins - add new variables to one data frame from matching observations in another

Filtering joins - filters observations from one data frame based on whether or not

they are present in another

Set operations - treats observations like they are set elements

library(tidyverse) library(nycflights13)

Pulling full carrier names based on letter codes

airlines

Pulling info about airports

airports

Pulling info about each plane

planes

Getting info on the weather at the airports

weather

Getting info on singular flights

flights

seeing how the tables connect

Flights -> planes based on tailnumber

flights -> airlines through carriers

flights -> planes origin and destinations

flights -> weather via origin, year/month/day/hour

keys

keys are unique identifiers per observation

primar key uniquily identifies an observation in its own table:

Examples:

planes %>% count(tailnum) %>% filter(n>1)

THis means the tailnumber is unique

planes %>% count(model) %>% filter(n>1)

Mutate Join

flights2 <- flights %>% select(year:day, hour, origin, dest, tailnum, carrier)

flights2

flights2 %>% select(-origin, -dest) %>% left_join(airlines, by = “carrier”)

Now, we have added the airline name to the dataframe from the airline dataframe

Other types of joins

Inner Joins (inner_join) matches a pair of observs. when their key is equal

Outer Joins (outer_join) keeps observations that appear in at least one table

Unite

What happens if we want to do the inverse of separate

table5

table5 %>% unite(data, century, year)

table5 %>% unite(data, century, year, sep = ““)

Missing values

There can be two types of missing values. NA (explicit) or just no entry (implicit)

gene_data <- tibble( gene = c(‘a’, ‘a’, ‘a’, ‘a’, ‘b’, ‘b’, ‘b’), nuc = c(20, 22, 24, 25, NA, 42, 67), run = c(1,2,3,4,2,3,4) )

gene_data

The nucleotide count for Gene b run 2 is explicit missing The nucleotide count for gene b run 1 is implicitly missing

The one way we can make implicit missing values explicit is by putting runs in column

If we want to remove the missing values, we can use spread and gather, and na.rm =TRUE

gene_data %>% spread(gene, nuc) %>% gather (gene, nuc, ‘a’:‘b’, na.rm = TRUE)

Another way that we can make implicit values explicit, is complete()

gene_data %>% complete(gene, run)

If we want to remove the missing values, we can use spread and gather and na.rm = true

gene_data %>% spread(gene, nuc) %>% gather(gene, nuc, ‘a’, ‘b’, na.rm = TRUE)

Another way is by using complete().

gene_data %>% complete(gene, nuc)

Sometimes an NA is present to represent a value being carried forward

treatment <- tribble( ~person, ~treatment, ~response, ########################################### “Isaac”, 1, 7, NA, 2, 10, NA, 3, 9, “VDB”, 1, 8, NA, 2, 11, NA, 3, 10 )

treatment

Here, we can use the fill() option.

treatment %>% fill(person)

Stringr

library(tidyverse) library(tringr)

You can create using single or double quotes

string1 <- “this is a string” string2 <- ‘to put a “quote” in your string, use the opposite’

string1

string2

If you forget to close your string, this happens:

string3 <- “where is this string going?”

string3

Just hit escape and try again.

Multiple strings are stored in character vectors

string4 <- c(“one”, “two”, “three”)

string4

Measuring String Length

str_length(string3)

str_length(string4)

Now we will combine two strings

str_c(“x”, “y”)

str_c(string1, string2)

You can use sep to control how they are separated.

str_c(string1, string2, sep = ” “)

str_c(“x”, “y”, “z”, sep = “_“)

Subsetting strings

you can subset a string using str_sub().

MSP <- c(“MSP123”, “MSP234”, “MSP456”)

str_sub(MSP, 4, 6)

This drops first four letter of the string

Can also use negatives to count ack from end

str_sub(MSP, -3, -1)

Can convert cases of strings:

MSP str_to_lower(MSP)

And upper case

Regular Expression

install.packages(“htmlwidgets”)

x <- c(‘ATTAGA’, ‘CGCCCCCGGAT’, ‘TATTA’)

str_view(x, “G”)

str_view(x, “TA”)

Next, “.” will be where it matches an entry

str_view(x, “.G.”)

Anchors allow you to match at the start or the ending

str_view(x, “^TA”)

str_view(x, “TA$”)

Character classes/ alternitaves

atches any digit

matches any space

[abc] matches a, b, or c

str_view(x, “TA[GT]”)

[^abc] matches anything but a, b, or c

str_view(x, “TA[^T]”)

Can use | to pick between two alternitaves

str_view(x, “TA[G|T]”)

Detect Matches

str_detect returns = a logical vector the same length of input

y <- c(“apple”, “banana”, “pear”) y

str_detect(y, “e”)

How many common words start with letter ‘e’?

sum(str_detect(words, “e”))

What proportion of words end in vowels?

mean(str_detect(words, “[aeiou]$”))

mean(str_detect(words, “1”))

Words that do not contain “o” or “u”

no_o <- !str_detect(words, “[ou]”) no_o

Extacting

words[!str_detect(words, “[ou]”)]

Can also use str_count() to sya how many matches in a string

x

str_count(x, “GC”)

Coupling this with Mutate

df <- tibble( word = words, i = seq_along(word) )

df

df %>% mutate( vowels = str_count(words, “[aeiou]”), constonants = str_count(words, “[^aeiou]”) )

Mircoarrays

First Lets load all the packages we need for Micro arrays

library(ath1121501cdf)

library(ath1121501.db)

library(dplyr) library(stats) library(reshape)

library(affy)

Now we must read the cel files into a dirrectory

targets <- readTargets(“Bric16_Targets.csv”, sep = “,”, row.names = “filename”)

ab <- ReadAffy()


  1. aeiou↩︎