Introduction

This mini hands-on tutorial serves as an introduction to R, covering the following topics:

  • Online sources of information about R;
  • Packages, Documentation and Help;
  • Basics and syntax of R;
  • Main R data structures: Vectors, Matrices, Data frames, Lists and Factors;
  • Brief intro to R control-flow via Loops and Conditionals;
  • Brief description of function declaration;
  • Listing of some of the most commonly used built-in R functions.

This document will guide you through the initial steps toward using R. RStudio will be used has the development platform for this workshop since it is a free software, available for Linux, Mac and Windows, integrating many functionalities that facilitate the learning process. You can download it directly from: https://www.rstudio.com/products/rstudio/download/

This protocol is divided into 7 parts, each one identified by a Title, Predicted execution time (in parenthesis), a brief Task description and the R commands to be executed. These will always be inside grey text boxes, with the font colored according to the R syntax highlighting.

Keep Calm… and Good Work!


Online Sources and other useful Bibliography

  • www.r-project.org (The developers of R)
  • www.statmethods.net (Quick-R)
  • www.cookbook-r.com (R code “recipes”)
  • www.bioconductor.org/help/workflows (R code for pipelines of genomic analyses)
  • Advanced R (If you want to learn R from a programmers point of view)

  • Introductory Statistics with R (Springer, Dalgaard)
  • A first course in statistical programming with R (CUP, Braun and Murdoch)
  • Computational Genome Analysis: An Introduction (Springer, Deonier, Tavaré and Waterman)
  • R programming for Bioinformatics (CRC Press, Gentleman)


Package repositories

In R, the fundamental unit of shareable code is the package. A package bundles together code, data, documentation, and tests, and is easy to share with others. These packages are stored online from which they can be easily retrieved and installed on your computer (R packages by Hadley Wickham). There are 2 main R repositories:

This huge variety of packages is one of the reasons that R is so successful: the chances are that someone has already solved a problem that you’re working on, and you can benefit from their work by downloading their package for free.

To set the repositories that you want to use when searching and installing packages:

setRepositories()   
# then input the numbers corresponding to the requested repositories
 # (it is advisable to use repositories 1 2 3 4 5 6 7 to cover most packages)

##### THIS TAKES VERY LONG, SO PLEASE DO IT AT HOME #####
# To Install Bioconductor, run the following code
## try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
biocLite()


Basics

General Notes (about R and RStudio)

  1. R is case sensitive - be aware of capital letters (b is different from B).
  2. All R code lines starting with the # (cardinal or hash) sign are interpreted as comments, and therefore not evaluated.
# This is a comment
# 3 + 4   # this code is not evaluated, so and it does not print any result
2 + 3     # the code before the hash sign is evaluated, so it prints the result (value 5)
[1] 5
  1. Expressions in R are evaluated from the innermost parenthesis toward the outermost one (just like in the usual mathematical evaluation).
# Example with parenthesis:
((2+2)/2)-2
[1] 0
# Without parenthesis:
2+2/2-2
[1] 1
  1. Spaces matter in variable names — use a dot or underscore instead, e.g. my.variable_name.
  2. Spaces between variables and operators do not matter: 3+2 is the same as 3 + 2, and function (arg1 , arg2) is the same as function(arg1,arg2).
  3. If you want to write 2 expressions/commands in the same line, you have to separate them by a ; (semi-colon)
#Example:
3 + 2 ; 5 + 1  
[1] 5
[1] 6
  1. More recent versions of RStudio auto-complete your commands by showing you possible alternatives as soon as you type 3 consecutive characters, however, if you want to see the options for less than 3 chars, just press tab to display the options. Tip: Use auto-complete as much as possible to avoid typing mistakes.
  2. There are 4 main vector data types: Logical (TRUE or FALSE); Numeric (eg. 1,2,3…); Character (eg. “u”, “alg”, “arve”) and Complex (eg. 3+2i)
  3. Vectors are ordered sets of elements. In R vectors are 1-based, i.e. the first index position is number 1 (opposed to other languages whose indexes start at zero).
  4. R objects can be divided in two main groups: Functions and Data-related Objects. Functions receive arguments inside circular brackets ( ) and objects receive arguments inside square brackets [ ]:

    function (arguments)
    data.object [arguments]


Start/Quit RStudio

RStudio can be opened by double-clicking its icon. Alternatively, in Linux and Mac, one can start R by typing R in a terminal.

The R environment is controlled by hidden files (files that start with a .) in the startup directory: .RData, .Rhistory and .Rprofile (optional).

  • Rdata saves everything in memory (can be very large — be careful);
  • History saves all commands that have been typed during the R session;
  • Profile is most useful for advanced users to customize R behaviour.

It is always good practice to rename these files:

# DO NOT RUN
save.image (file=“myProjectName.RData”)
savehistory (file=“myProjectName.Rhistory”)

To quit R (close it), use the q () function, and you will be prompted if you want to save the workspace image (i.e. the .RData file):

q()
Save workspace image to ~/path/to/your/working/directory/.RData? [y/n/c]:

By typing y (yes), then the entire R workspace will be written to the .RData file which can become very large. Often it is sufficient to just save an analysis protocol in an R source file. This way one can quickly regenerate all data sets and objects in the future.


Installing Packages and Getting Help

R has many built-in ways of providing help regarding its functions and packages:

install.packages ("ggplot2")   # install the package called ggplot2
library ("ggplot2")    # load the library ggplot2 
help (package=ggplot2) # help(package="package_name") to get help about a specific package
vignette ("ggplot2")   # launch a pdf with the package manual (called R vignettes)
?qplot   # ?function to get quick info about the function of interest


Working Environment

Your working environment is the place where the variables you define are stored. More advanced users can create more than one environment.

ls()    # list all objects in your environment
dir()   # list all files in your working directory
getwd() # find out the path to your working directory
setwd("/home/isabel") # example of setting a new working directory path


Hands-on Tutorial

1. Create an RStudio project (30 min)

To start we will open the RStudio. This is an Integrated Development Environment - IDE - that includes syntax-highlighting text editor (1), an R console (2), as well as workspace and history management (3), and tools for plotting and exporting images, browsing the workspace and creating projects (4).

Figure 1: RStudio GUI

Figure 1: RStudio GUI

Projects are a great functionality, easing the transition between dataset analysis, and allowing a fast navigation to your analysis/working directory. To create a new project:

File > New Project... > New Directory > Empty Project
Directory name: r-absoluteBeginners
Create project as a subdirectory of: ~/
                           Browse... (directory/folder to save the workshop data)
Create Project

Projects should be personalized by clicking on the menu in the right upper corner. The general options - R General - are the most important to customize, since they allow the definition of the RStudio “behavior” when the project is opened. The following suggestions are particularly useful:

Restore .RData at startup - Yes (for analyses with +1GB of data, you should choose "No")
Save .RData on exit - Ask
Always save history - Yes
Figure 2: Customize Project

Figure 2: Customize Project


2. Operators (60 min)

Important NOTE: Please create a new R Script file to save all the code you use for today’s tutorial and save it in your current working directory. Name it: r4ab_day1.R

Assignment Operators

Values are assigned to named variables with an <- (arrow) or an = (equal) sign. In most cases they are interchangeable, however it is good practice to use the arrow since it is explicit about the direction of the assignment. If the equal sign is used, the assignment occurs from left to right.

x <- 7     # assign the number 7 to the variable x
x          # R will print the value associated with variable x
y <- 9     # assign the number 9 to the variable y
z = 3      # assign the value 3 to the variable z
42 -> lue  # assign the value 42 to the variable lue
x ->  xx   # assign the value of x (7) to the variable xx, which becomes 7
xx
my_variable = 5   # my_variable has the value 5


Comparison Operators

Allow the direct comparison between values:

Symbol Description
== exactly the same (equal)
!= different (not equal)
< smaller than
> greater than
<= smaller or equal
>= greater or equal
1 == 1   # TRUE
1 != 1   # FALSE
x > 3    # TRUE (x is 7)
y <= 9   # TRUE (y is 9)
my_variable < z   # FALSE (z is 3 and my_variable is 5)


Logical Operators

Compare logical (TRUE FALSE) values:

Symbol Description
& AND
| OR
! NOT

QUESTION: Are these TRUE, or FALSE?

x < y & x > 10   # AND means that both expressions have to be true
x < y | x > 10   # OR means that only one expression must be true
!(x != y & my_variable <= y)  # yet another AND example


Arithmetic Operators

R makes calculations using the following arithmetic operators:

Symbol Description
+ summation
- subtraction
* multiplication
/ division
^ powering
3 / y   ## 0.3333333

x * 2   ## 14

3 - 4   ## -1

my_variable + 2   ## 7

2^z   ## 8


3. Data Structures (120 min)

Vectors

The basic data structure in R is the vector, which requires all of its elements to be of the same type (e.g. all numeric; all character (text); all logical (TRUE FALSE)).


Creating Vectors

Function Description
c combine/concatenate
: integer sequence
seq general sequence
rep repetitive patterns
x <- c (1,2,3,4,5,6)
x
[1] 1 2 3 4 5 6
class (x)   # this function outputs the class of the object
[1] "numeric"
y <- 10
class (y)
[1] "numeric"
z <- "a string"
class (z)
[1] "character"
# The results are shown in the comments next to each line

seq (1,6)   ## 1 2 3 4 5 6
seq (from=100, by=1, length=5)   ## 100 101 102 103 104

1:6    ## 1 2 3 4 5 6
10:1   ## 10  9  8  7  6  5  4  3  2  1

rep (1:2, 3)   ## 1 2 1 2 1 2


Vectorized Arithmetic

Most arithmetic operations in the R language are vectorized, i.e. the operation is applied element-wise. When one operand is shorter than the other, the shortest one is recycled, i.e. the values from the shorter vector are re-used in order to have the same length as the longer vector.

1:3 + 10:12
[1] 11 13 15
# Notice the warning: this is recycling (the shorter vector "restarts" the "cycling")
1:5 + 10:12
Warning in 1:5 + 10:12: longer object length is not a multiple of shorter
object length
[1] 11 13 15 14 16
x + y         # Remember that x = c(1 2 3 4 5 6) and y = 10
[1] 11 12 13 14 15 16
c(70,80) + x
[1] 71 82 73 84 75 86


Subsetting/Indexing Vectors

Subsetting is the extraction of one or more elements, which are of interest, from vectors. There are several ways of doing this.

Note: Please remember that indices in R are 1-based (see introduction).

# Subsetting by indices
myVec <- 1:26 ; myVec
myVec [1]
myVec [4:9]

# LETTERS is a built-in vector with the 26 letters of the alphabet
myLOL <- LETTERS
myLOL[c(3,3,13,1,18)]

#Subsetting by same length logical vectors
myLogical <- myVec > 10 ; myLogical
# returns only the values in positions corresponding to TRUE in the logical vector
myVec [myLogical]


Naming indexes of a vector

Referring to an index by name rather than by position can make code more readable and flexible. Use the function names to attribute names to each position of the vector.

joe <- c (24, 1.70)
names (joe)              ## NULL
names (joe) <- c ("age","height")
names (joe)              ## "age"    "height"
joe ["age"] == joe [1]   ## age   TRUE

names (myVec) <- LETTERS
myVec
# Subsetting by field names
myVec [c("A", "A", "B", "C", "E", "H", "M")] ## The Fibonacci Series :o)


Excluding elements

Sometimes we want to retain most elements of a vector, except for a few unwanted positions. Instead of specifying all elements of interest, it is easier to specify the ones we want to remove.

alphabet <- LETTERS
alphabet   # print vector alphabet
 [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q"
[18] "R" "S" "T" "U" "V" "W" "X" "Y" "Z"
vowel.positions <- c(1,5,9,15,21)
alphabet[vowel.positions]    # print alphabet in vowel.positions
[1] "A" "E" "I" "O" "U"
consonants <- alphabet [-vowel.positions]  # exclude all vowels from the alphabet
consonants    # exclude the element from position 3 of the "good" vector
 [1] "B" "C" "D" "F" "G" "H" "J" "K" "L" "M" "N" "P" "Q" "R" "S" "T" "V"
[18] "W" "X" "Y" "Z"


Matrices

Matrices are two dimentional vectors, explicitly created with the matrix function. Just like one-dimensional vectors, they store same-type elements.

IMPORTANT NOTE: R uses a column-major order for the internal linear storage of array values, meaning that first all of column 1 is stored, then all of column 2, etc. This implies that, by default, when you create a matrix, R will populate the first column, then the second, then the third, and so on until all values given to the matrix function are used.

my.matrix <- matrix (1:12, nrow=3, byrow = FALSE)   # byrow = FALSE is the default (see ?matrix) 
dim (my.matrix)   # check the dimentions of the matrix x
my.matrix

xx <- matrix (1:12, nrow=3, byrow = TRUE)
dim (xx)  # check that the dimentions of xx is the same as the dimention of x
xx        # compare x with xx and make sure you understand what is hapenning


Subsetting/Indexing Matrices

Very Important Note: The arguments inside the square brackets in matrices (and data.frames - see next section) are the [row_number, column_number]. If any of these is omitted, R assumes that all values are to be used.

# Creating a matrix of characters
my.matrix <- matrix (LETTERS, nrow = 4, byrow = TRUE) 
# Please notice the warning message (related to the "recycling" of the LETTERS)

my.matrix
dim (my.matrix)

# Subsetting by indices 
my.matrix [,2]   # all rows, column 2 (returns a vector)
my.matrix [3,]   # row 3, all columns (returns a vector)
my.matrix [1:3,c(4,2)]   # rows 1, 2 and 3 from columns 4 and 2 (by this order) (returns a matrix)


Data Frames

Data frames are the most flexible and commonly used R data structures, used to hold a set of spreadsheet-like tables.
In a data.frame, the observations are the rows and the variables are the columns. They can be treated like matrices, where columns are vectors of the same type, but different columns can be vectors of different types. They are easily subset by index and by column name.

df <- data.frame (type=rep(c("case","control"),c(2,3)),time=rnorm(5))  
# rnorm is a random number generator retrieved from a normal distribution

class (df)   ## "data.frame"
df


Subsetting/Indexing Data Frames

Remember: The arguments inside the square brackets, just like in matrices, are the [row_number, column_number]. If any of these is omitted, R assumes that all values are to be used.

NOTE: R includes a package in its default base installation, named “The R Datasets Package”. This resource includes a diverse group of datasets, containing data from different fields: biology, physics, chemistry, economics, psychology, mathematics. These data are very useful to learn R. For more info about these datasets, run the following command: library(help=datasets)

# Familiarize yourself with the iris dataset (a built-in data frame)
iris

# Subset by indices the iris dataset
iris [,3]   # all rows, column 3 
iris [1,]   # row 1, all columns
iris [1:9, c(3,4,1,2)]   # rows 1 to 9 with columns 3, 4, 1 and 2 (in this order)

# Subset by column name (for data.frames)
iris$Species
iris[,"Sepal.Length"]

# Select the time column from the df data frame created above
df$time      ## 0.5229577 0.7732990 2.1108504 0.4792064 1.3923535


Lists

Lists are very powerful data structures, consisting of ordered sets of elements, that can be arbitrary R objects (vectors, strings, functions, etc), and heterogeneous, i.e. each element of a different type.

lst = list (a=1:3, b="hello", fn=sqrt)   # index 3 contains the function "square root"
lst
lst$fn(49)   # outputs the square root of 49


Subsetting/Indexing Lists

# Subsetting by indices
lst [1]     # returns a list with the data contained in position 1 (preserves the type list)
class (lst[1])

lst [[1]]   # returns the data contained in position 1 (simplifies to inner data type) 
class(lst[[1]])

# Subsetting by name
lst$b       # returns the data contained in position 1 (simplifies to inner data type)
class(lst$b)

# Compare the class of these alternative indexing by name
lst["a"]
lst[["a"]]


Data structures can be interconverted (coerced) from one type to another:

Sometimes it is useful to convert between data structure types (particularly when using packages). R has several functions for such conversions:

# To check the object class:
class(lst)

# To check the basic structure of an object:
str(lst)

# "Force" the object to be of a certain type:
 # (this is not valid code, just a syntax example)
as.matrix (myDataFrame)
as.numeric (myChar)
as.data.frame (myMatrix)
as.character (myNumeric)


4. Loops and Conditionals in R (60 min)

for() and while() loops

R allows the implementation of loops, i.e. replicating instructions in an iterative way (also called cycles). The most common ones are for() loops and while() loops.

# creating a for loop to calculate the first 12 values of the Fibonacci sequence
my.x <- c(1,1)
for (i in 1:10) {
  my.x <- c(my.x, my.x[i] + my.x[i+1])
  print(my.x)
}

# while loops will execute a block of commands until a condition is no longer satisfied
x <- 3 ; x
while (x < 9)
{
  cat("Number", x, "is smaller than 9.\n") # cat is a printing function (see ?cat)
   x <- x+1
}


Conditionals: if() statements

Conditionals allow running commands only when certain conditions are TRUE

x <- -5 ; x
if (x >= 0) { print("Non-negative number") } else { print("Negative number") }
 # Note: The else clause is optional. If the command is run at the command-line,
  # and there is an else clause, then either all the expressions must be enclosed
  # in curly braces, or the else statement must be in line with the if clause.

# coupled with a for loop
x <- c(-5:5) ; x
for (i in 1:length(x)) {
  if (x[i] > 0) {
     print(x[i])
  } 
  else {
   print ("negative number")
  }
}  


Conditionals: ifelse() statements

The ifelse function combines element-wise operations (vectorized) and filtering with a condition that is evaluated. The major advantage of the ifelse over the standard if-then-else statement is that it is vectorized.

# re-code gender 1 as F (female) and 2 as M (male)
gender <- c(1,1,1,2,2,1,2,1,2,1,1,1,2,2,2,2,2)
ifelse(gender == 1, "F", "M")
 [1] "F" "F" "F" "M" "M" "F" "M" "F" "M" "F" "F" "F" "M" "M" "M" "M" "M"


5. Functions (60 min)

R allows defining new functions using the function command. The syntax (in pseudo-code) is the following:

my.function.name <- function (argument1, argument2, ...) { 
  expression1
  expression2
  ...
  return (value)
  }

Now, lets code our own function to calculate the average (or mean) of the values from a vector:

# Define the function
my.average <- function (x) {
  average.result <- sum(x)/length(x)
  return (average.result)
}

# Create the data vector
my.data <- c(10,20,30)

# Run the function using the vector as argument
my.average(my.data)

# Compare with R built-in mean function
mean(my.data)


6. Loading data and Saving files (30 min)

Most R users need to load their datasets, usually saved as table files (e.g. Excel .csv files), to be able to analyse and manipulate them. After the analysis, the results need to be exported/saved (eg. to view in another program).

# Inspect the esoph built-in data set
esoph
dim(esoph)
colnames(esoph)

### Saving ###
# Save to a file named esophData.csv the esoph R dataset, separated by commas and
 # without quotes (the file will be saved in the current working directory)
write.table (esoph, file="esophData.csv", sep="," , quote=F)

# Save to a file named esophData.tab the esoph dataset, separated by tabs and without
 # quotes (the file will be saved in the current working directory)
write.table (esoph, file="esophData.tab", sep="\t" , quote=F)

### Loading ###
# Load a data file into R (the file should be in the working directory)
  # read a table with columns separated by tabs
my.data.tab <- read.table ("esophData.tab", sep="\t", header=TRUE)
 # read a table with columns separated by commas
my.data.csv <- read.csv ("esophData.csv", header=T)

Note: if you want to load or save the files in directories different from the working dir, just use (inside quotes) the full path as the first argument, instead of just the file name (e.g. “/home/Desktop/r_Workshop/esophData.csv”).


7. Some Great R Functions to “play” with (60 min)

# the unique function returns a vector with unique entries only (remove duplicated elements)
unique (iris$Sepal.Length)

# length returns the size of the vector
length (unique (iris$Sepal.Length))

# table counts the occurrences of entries (tally)
table (iris$Species)

# aggregate computes statistics of data aggregates
aggregate (iris[,1:4], by=list (iris$Species), FUN=mean, na.rm=T)

# the %in% function returns the intersect between two vectors
month.name [month.name %in% c("CCMar","May", "Fish", "July", "September","Cool")]

# merge joins data.frames based on a common column (that functions as a "key")
df1 <- data.frame(x=1:5, y=LETTERS[1:5]) ; df1
df2 <- data.frame(x=c("Eu","Tu","Ele"), y=1:6) ; df2
merge (df1, df2, by.x=1, by.y=2, all = TRUE)

# cbind and rbind (takes a sequence of vector, matrix or data-frame arguments
 # and combine them by columns or rows, respectively)
my.binding <- as.data.frame(cbind(1:7, LETTERS[1:7]))    # the '1' (shorter vector) is recycled
my.binding
my.binding <- cbind(my.binding, 8:14)[, c(1, 3, 2)] # insert a new column and re-order them
my.binding

my.binding2 <- rbind(seq(1,21,by=2), c(1:11))
my.binding2

# reverse the vector
rev (LETTERS)
# sum and cumulative sum
sum (1:50); cumsum (1:50)
# product and cumulative product
prod (1:25); cumprod (1:25)

### Playing with some R built-in datasets (see library(help=datasets) )
iris   # familiarize yourself with the iris data

# mean, standard deviation, variance and median 
mean (iris[,2]); sd (iris[,2]); var (iris[,2]); median (iris[,2]) 

# minimum, maximum, range and summary statistics
min (iris[,1]); max (iris[,1]); range (iris[,1]); summary (iris)

# exponential, logarithm
exp (iris[1,1:4]); log (iris[1,1:4])

# sine, cosine and tangent (radians, not degrees)
sin (iris[1,1:4]); cos (iris[1,1:4]); tan (iris[1,1:4]) 

# sort, order and rank the vector
sort (iris[1,1:4]); order (iris[1,1:4]); rank (iris[1,1:4])

# useful to be used with if conditionals
any (iris[1,1:4] > 2)   # ask R if there are any values higher that 2? 
all (iris[1,1:4] > 2)   # ask R if all values are higher than 2

# select data
which (iris[1,1:4] > 2)
which.max (iris[1,1:4]) 


The esoph (Smoking, Alcohol and (O)esophageal Cancer data) built-in dataset presents 2 types of variables: continuous numerical variables (the number of cases and the number of controls), and discrete categorical variables (the age group, the tobacco smoking group and the alcohol drinking group). Sometimes it is hard to “categorize” continuous variables, i.e. to group them in specific intervals of interest, and name these groups (also called levels).

Accordingly, imagine that we are interested in classifying the number of cancer cases according to their occurrence: frequent, intermediate and rare. This type of variable recoding into factors is easily accomplished using the function cut(), which divides the range of x into intervals and codes the values in x according to which interval they fall.

# subset non-contiguous data from the esoph dataset
esoph
summary(esoph)
# cancers in patients consuming more than 30 g/day of tobacco
subset(esoph$ncases, esoph$tobgp == "30+")
# total nr of cancers in patients older than 75
sum(subset(esoph$ncases, esoph$agegp == "75+"))

# factorize the nr of cases in 3 levels, equally spaced,
 # and add the new column cat_ncases, to the dataset
esoph$cat_ncases <- cut (esoph$ncases,3,labels=c("rare","med","freq"))
summary(esoph)


END