R Programming for Biologists: Workshop 1

Gavin Douglas
Aug. 7th, 2018

Object assignment


Assigning a character:

tmp <- "hello world!"
print(tmp)
[1] "hello world!"

Working with numbers:

a <- 20
b <- 5
c <- a**2 + b**2
c
[1] 425

Why do I use <- instead of =?

Function help pages


print is a function, which you can read about by typing:

?print

All functions will have some sort of documentation when you use the ? syntax

Key R classes


(1) numeric

(2) integer

(3) logical

(4) character

(5) factor

Key R datatypes


(1) vector

(2) list

(3) matrix

(4) dataframe

Numeric vectors

One of the main functions we'll be using is c.

test_vec <- c(5, 42, 44, 6)
test_vec
[1]  5 42 44  6

Getting a particular index from a vector:

test_vec[3]
[1] 44

Note that the first index is indicated by 1, not 0:

test_vec[0]
numeric(0)

Two types of numbers


What's the difference between the below two objects (hint: try class)?

x <- 1
y <- 1L

Logical (i.e. Boolean) vectors

logical_vec <- c(TRUE, F, T, 10 > 2, 10 == 1)
logical_vec
[1]  TRUE FALSE  TRUE  TRUE FALSE

Note no " characters:

class(c("TRUE", "FALSE"))
[1] "character"

Beware working with variables with Boolean names:

T <- FALSE
print(T)
[1] FALSE

Assignment vs comparison operator

These two lines have very different meanings:

      T == 24
      T = 24

Using comparison operators will return logical vectors

"CAT" == "DOG"
[1] FALSE
x <- 2
x < 3
[1] TRUE
x > 10
[1] FALSE

Character vectors

Defining a character vector:

tmp <- c("hey,", "this", "is", "multiple", "strings")
class(tmp)
[1] "character"

Converting to a character vector:

tmp2 <- c(10, 20, 40.0, 19)
tmp2 <- as.character(tmp2)
tmp2
[1] "10" "20" "40" "19"
class(tmp2)
[1] "character"

Factor (i.e. categorical) vectors

Character columns in tables will be read in as factors by default (set StringsAsFactors=False to avoid this).


Explicitly defining factors is better:

tmp <- c("hey,", "this", "is", "multiple", "strings")
tmp
[1] "hey,"     "this"     "is"       "multiple" "strings" 
tmp <- factor(tmp)
tmp
[1] hey,     this     is       multiple strings 
Levels: hey, is multiple strings this

Factor levels are sorted alphabetically by default

tmp <- factor(c("treated_WT", "control_WT", "treated_KO", "control_WT", "control_KO", "control_KO", "treated_WT", "treated_KO"))
print(tmp)
[1] treated_WT control_WT treated_KO control_WT control_KO control_KO
[7] treated_WT treated_KO
Levels: control_KO control_WT treated_KO treated_WT

You can explicitly set the factor levels you want:

tmp2 <- factor(tmp, levels = c("control_WT", "treated_WT", "control_KO", "treated_KO"))
print(tmp2)
[1] treated_WT control_WT treated_KO control_WT control_KO control_KO
[7] treated_WT treated_KO
Levels: control_WT treated_WT control_KO treated_KO

Class coercion

What class is the object z below?

      z <- c(4154163, "Hi there gang!")

What class is the object funInSun below?

      print(funInSun)
[[1]]
[1] 1841   51

[[2]]
[1] "heya"

Lists

You can combine objects of different types in lists.

tmp <- list("prime"=c(1, 2, 5, 7, 11), "animals"=c("cow", "chicken"))
tmp
$prime
[1]  1  2  5  7 11

$animals
[1] "cow"     "chicken"


The key advantage of using lists is that you can use the function lapply, e.g.

tmp2 <- list("set1"=c(1, 2, 5, 7, 11), "set2"=c(3, 8, 10), "set3"=c(4, 8))
lapply(tmp2, sum)
$set1
[1] 26

$set2
[1] 21

$set3
[1] 12

Matrices

Only for data of the same type. Take up less memory than dataframes.

test_matrix <- matrix(c(10, 32, 13, 54), nrow=2, ncol=2)
test_matrix
     [,1] [,2]
[1,]   10   13
[2,]   32   54
test_matrix2 <- matrix(c("dog", "cat", 13, 54), nrow=2, ncol=2)
test_matrix2
     [,1]  [,2]
[1,] "dog" "13"
[2,] "cat" "54"

Dataframes

Columns can be of different types. Take up more memory, but are easier to interact with.

test_df <- as.data.frame(matrix(c(10, 32, 13, 54), nrow=2, ncol=2))
test_df
  V1 V2
1 10 13
2 32 54
test_df2 <- data.frame(pet=c("dog", "cat"), livestock=c("cow", "sheep"), stringsAsFactors = FALSE)
test_df2
  pet livestock
1 dog       cow
2 cat     sheep

Select a single column:

test_df2$pets
NULL

Row and column names

Matrices and dataframes can both have row and column names.

test_df2 <- data.frame(pet=c("dog", "cat"), livestock=c("cow", "sheep"), stringsAsFactors = FALSE)
test_df2
  pet livestock
1 dog       cow
2 cat     sheep
rownames(test_df2) <- c("Bill", "Sandy")
colnames(test_df2) <- c("Pet", "Livestock")
test_df2
      Pet Livestock
Bill  dog       cow
Sandy cat     sheep

Subsetting by row and column names

To get the row corresponding to Bill (note the column is left blank):

test_df2["Bill", ]
     Pet Livestock
Bill dog       cow

To get Sandy's livestock:

test_df2["Sandy", "Livestock"]
[1] "sheep"

However, we can't remove rows or columns by name (at least not without an extra step).

test_df2[-"Bill",]
Error in -"Bill" : invalid argument to unary operator

Subsetting by row and column indices

Return element at first row and second column:

test_df2[1, 2]
[1] "cow"

Remove the first row:

test_df2[-1,]
      Pet Livestock
Sandy cat     sheep

Using "which" to get indices

Remove the first row based on it's name (with the which function that will return an index):

test_df2[-which(rownames(test_df2) == "Bill"),]
      Pet Livestock
Sandy cat     sheep

Example commands to read and write a table

Reading:

in_table <- read.table("myfile.txt", header=TRUE, row.names=1, sep="\t", stringsAsFactors=FALSE)

There are many other options, but these ones are good to be aware of. Note that in this case the first column would be interpreted as the rownames.

Writing:

write.table(in_table_modified, file="newfile.txt", quote=FALSE, sep="\t", col.names=NA, row.names=TRUE)

Make sure to set quote=FALSE so that no quotes are in your output file.

Two other datatypes to be aware of


Both of these datatypes can be used in the place of dataframes.

  • data.table (data.table package)

  • tibble (tidyr packages)

Additional basic default functions


  • mean

  • min/max

  • plot

  • boxplot

  • summary

  • str

Good practice when running R commands (1/2)


  • Clear your environment when starting your work
rm(list=ls())
  • Restart R (if necessary) to make sure no packages are loaded.

  • Make a new Rscript file to save your R commands; relying on “History” will end in sadness

  • Run commands in this Rscript file in RStudio by highlighting lines and hitting CTRL-RETURN

Good practice when running R commands (2/2)


  • Write comments starting with the # character before blocks of code

  • Write your comments and commands in a way that someone else could understand

  • Start a new line once you reach 80 characters

  • Next workshop: write custom functions to avoid writing repetitive code

Problem #1 - Writing a table


Write the test_df2 table to a file named “write_test.txt” with the write.table function.

Problem #2 - Exploring a table


  • Load the dataset mtcars with the command
data(mtcars)
  • What class is this object?

  • Contrast the outputs of summary and str.

  • Take a look at the first 10 rows of this object with:
head(mtcars, 10)

Problem #3 - Basic scatterplot


  • Plot a scatterplot of miles per gallon against displacement in the mtcars table.

  • Change the axes names and point type (with the pch option)

Problem #4 - Subsetting a table by rownames


  • Make a copy of the mtcars object, but only keep the rows called: “Valiant”, “Merc 230”, and “Lotus Europa”.

  • Make a another copy of this object, but this time only with rows that contain “Merc” using grep.

  • Get the mean miles per gallon for all Merc cars.

Problem #5 - Dealing with missing data.


  • Load the airquality dataset with:
data(airquality)
  • How many NA values are in this table?

  • When you sum a logical vector TRUE is interpreted as 1 and FALSE is interpreted as 0. with this in mind how could you use the rowSums function to identify rows that have any NA values?

  • Make a new dataframe with rows with any NA values removed.

Videos to watch for next week (week 2 of Coursera)


All week 2 Coursera videos up to and including “Scoping Rules - R Scoping Rules”.