R Programming for Biologists: Workshop 1

Gavin Douglas
Aug. 7th, 2018

Object assignment

Assigning a character:

tmp <- "hello world!"
print(tmp)

[1] "hello world!"

Working with numbers:

a <- 20
b <- 5
c <- a**2 + b**2
c

[1] 425

Why do I use <- instead of =?

Function help pages

print is a function, which you can read about by typing:

?print

All functions will have some sort of documentation when you use the ? syntax

Key R classes

(1) numeric

(2) integer

(3) logical

(4) character

(5) factor

Key R datatypes

(1) vector

(2) list

(3) matrix

(4) dataframe

Numeric vectors

One of the main functions we'll be using is c.

test_vec <- c(5, 42, 44, 6)
test_vec

[1]  5 42 44  6

Getting a particular index from a vector:

test_vec[3]

[1] 44

Note that the first index is indicated by 1, not 0:

test_vec[0]

numeric(0)

Two types of numbers

What's the difference between the below two objects (hint: try class)?

x <- 1

y <- 1L

Logical (i.e. Boolean) vectors

logical_vec <- c(TRUE, F, T, 10 > 2, 10 == 1)
logical_vec

[1]  TRUE FALSE  TRUE  TRUE FALSE

Note no " characters:

class(c("TRUE", "FALSE"))

[1] "character"

Beware working with variables with Boolean names:

T <- FALSE
print(T)

[1] FALSE

Assignment vs comparison operator

These two lines have very different meanings:

      T == 24
      T = 24

Using comparison operators will return logical vectors

"CAT" == "DOG"

[1] FALSE

x <- 2
x < 3

[1] TRUE

x > 10

[1] FALSE

Character vectors

Defining a character vector:

tmp <- c("hey,", "this", "is", "multiple", "strings")
class(tmp)

[1] "character"

Converting to a character vector:

tmp2 <- c(10, 20, 40.0, 19)
tmp2 <- as.character(tmp2)
tmp2

[1] "10" "20" "40" "19"

class(tmp2)

[1] "character"

Factor (i.e. categorical) vectors

Character columns in tables will be read in as factors by default (set StringsAsFactors=False to avoid this).

Explicitly defining factors is better:

tmp <- c("hey,", "this", "is", "multiple", "strings")
tmp

[1] "hey,"     "this"     "is"       "multiple" "strings"

tmp <- factor(tmp)
tmp

[1] hey,     this     is       multiple strings 
Levels: hey, is multiple strings this

Factor levels are sorted alphabetically by default

tmp <- factor(c("treated_WT", "control_WT", "treated_KO", "control_WT", "control_KO", "control_KO", "treated_WT", "treated_KO"))
print(tmp)

[1] treated_WT control_WT treated_KO control_WT control_KO control_KO
[7] treated_WT treated_KO
Levels: control_KO control_WT treated_KO treated_WT

You can explicitly set the factor levels you want:

tmp2 <- factor(tmp, levels = c("control_WT", "treated_WT", "control_KO", "treated_KO"))
print(tmp2)

[1] treated_WT control_WT treated_KO control_WT control_KO control_KO
[7] treated_WT treated_KO
Levels: control_WT treated_WT control_KO treated_KO

Class coercion

What class is the object z below?

      z <- c(4154163, "Hi there gang!")

What class is the object funInSun below?

      print(funInSun)

[[1]]
[1] 1841   51

[[2]]
[1] "heya"

Lists

You can combine objects of different types in lists.

tmp <- list("prime"=c(1, 2, 5, 7, 11), "animals"=c("cow", "chicken"))
tmp

$prime
[1]  1  2  5  7 11

$animals
[1] "cow"     "chicken"

The key advantage of using lists is that you can use the function lapply, e.g.

tmp2 <- list("set1"=c(1, 2, 5, 7, 11), "set2"=c(3, 8, 10), "set3"=c(4, 8))
lapply(tmp2, sum)

$set1
[1] 26

$set2
[1] 21

$set3
[1] 12

Matrices

Only for data of the same type. Take up less memory than dataframes.

test_matrix <- matrix(c(10, 32, 13, 54), nrow=2, ncol=2)
test_matrix

     [,1] [,2]
[1,]   10   13
[2,]   32   54

test_matrix2 <- matrix(c("dog", "cat", 13, 54), nrow=2, ncol=2)
test_matrix2

     [,1]  [,2]
[1,] "dog" "13"
[2,] "cat" "54"

Dataframes

Columns can be of different types. Take up more memory, but are easier to interact with.

test_df <- as.data.frame(matrix(c(10, 32, 13, 54), nrow=2, ncol=2))
test_df

  V1 V2
1 10 13
2 32 54

test_df2 <- data.frame(pet=c("dog", "cat"), livestock=c("cow", "sheep"), stringsAsFactors = FALSE)
test_df2

  pet livestock
1 dog       cow
2 cat     sheep

Select a single column:

test_df2$pets

NULL

Row and column names

Matrices and dataframes can both have row and column names.

test_df2 <- data.frame(pet=c("dog", "cat"), livestock=c("cow", "sheep"), stringsAsFactors = FALSE)
test_df2

  pet livestock
1 dog       cow
2 cat     sheep

rownames(test_df2) <- c("Bill", "Sandy")
colnames(test_df2) <- c("Pet", "Livestock")
test_df2

      Pet Livestock
Bill  dog       cow
Sandy cat     sheep

Subsetting by row and column names

To get the row corresponding to Bill (note the column is left blank):

test_df2["Bill", ]

     Pet Livestock
Bill dog       cow

To get Sandy's livestock:

test_df2["Sandy", "Livestock"]

[1] "sheep"

However, we can't remove rows or columns by name (at least not without an extra step).

test_df2[-"Bill",]
Error in -"Bill" : invalid argument to unary operator

Subsetting by row and column indices

Return element at first row and second column:

test_df2[1, 2]

[1] "cow"

Remove the first row:

test_df2[-1,]

      Pet Livestock
Sandy cat     sheep

Using "which" to get indices

Remove the first row based on it's name (with the which function that will return an index):

test_df2[-which(rownames(test_df2) == "Bill"),]

      Pet Livestock
Sandy cat     sheep

Example commands to read and write a table

Reading:

in_table <- read.table("myfile.txt", header=TRUE, row.names=1, sep="\t", stringsAsFactors=FALSE)

There are many other options, but these ones are good to be aware of. Note that in this case the first column would be interpreted as the rownames.

Writing:

write.table(in_table_modified, file="newfile.txt", quote=FALSE, sep="\t", col.names=NA, row.names=TRUE)

Make sure to set quote=FALSE so that no quotes are in your output file.

Two other datatypes to be aware of

Both of these datatypes can be used in the place of dataframes.

data.table (data.table package)
tibble (tidyr packages)

Additional basic default functions

mean
min/max
plot
boxplot
summary
str

Good practice when running R commands (1/2)

Clear your environment when starting your work

rm(list=ls())

Restart R (if necessary) to make sure no packages are loaded.
Make a new Rscript file to save your R commands; relying on “History” will end in sadness
Run commands in this Rscript file in RStudio by highlighting lines and hitting CTRL-RETURN

Good practice when running R commands (2/2)

Write comments starting with the # character before blocks of code
Write your comments and commands in a way that someone else could understand
Start a new line once you reach 80 characters
Next workshop: write custom functions to avoid writing repetitive code

Problem #1 - Writing a table

Write the test_df2 table to a file named “write_test.txt” with the write.table function.

Problem #2 - Exploring a table

Load the dataset mtcars with the command

data(mtcars)

What class is this object?
Contrast the outputs of summary and str.
Take a look at the first 10 rows of this object with:

head(mtcars, 10)

Problem #3 - Basic scatterplot

Plot a scatterplot of miles per gallon against displacement in the mtcars table.
Change the axes names and point type (with the pch option)

Problem #4 - Subsetting a table by rownames

Make a copy of the mtcars object, but only keep the rows called: “Valiant”, “Merc 230”, and “Lotus Europa”.
Make a another copy of this object, but this time only with rows that contain “Merc” using grep.
Get the mean miles per gallon for all Merc cars.

Problem #5 - Dealing with missing data.

Load the airquality dataset with:

data(airquality)

How many NA values are in this table?
When you sum a logical vector TRUE is interpreted as 1 and FALSE is interpreted as 0. with this in mind how could you use the rowSums function to identify rows that have any NA values?
Make a new dataframe with rows with any NA values removed.

Videos to watch for next week (week 2 of Coursera)

All week 2 Coursera videos up to and including “Scoping Rules - R Scoping Rules”.