1. Introduction

This mini hands-on tutorial serves as an introduction to R, covering the following topics:

This document will guide you through the initial steps towards using R. RStudio will be used has the development platform for this workshop since it is a free software, available for Linux, Mac and Windows, integrating many functionalities that facilitate the learning process. You can download it directly from: https://www.rstudio.com/products/rstudio/download/

This protocol is divided into 5 parts, each one identified by a Title, Predicted execution time (in parenthesis), a brief Task description and the R commands to be executed. These will always be inside grey text boxes, with the font colored according to the R syntax highlighting.

Keep Calm… and Good Work!

2. Online Sources and other useful Bibliography

3. Package repositories

To set the repositories that you want to use when searching and installing packages:

setRepositories()   # then input the numbers corresponding to the requested repositories (it is advisable to use repositories 1 2 3 4 5 6 7 to cover most packages)

4. Basics

General Notes

  1. R is case sensitive - be aware of capital letters (x is different from X).
  2. Expressions in R are evaluated from the innermost parenthesis toward the outermost one (just like in the usual mathematics notation). Example with parenthesis: ((2+2)/2)-2 = 0; without parenthesis: 2+2/2-2 = 1
  3. Spaces matter in variable names - use a dot or underscore (my.variable_name).
  4. Spaces between variables and operators do not matter (3+2 is the same as 3 + 2, and function (arg1, arg2) is the same as function(arg1,arg2)).
  5. If you want to write 2 expressions/commands in the same line, you have to separate them by a ; (semi-colon) Example: 3 + 2 ; 5 + 1
  6. More recent versions of RStudio autocomplete your keystrokes by showing you possible commands/variables as soon as you type 3 consecutive characters, however, if you want to see the options for less than 3 chars, just press tab to display the options.
  7. There are 4 main data types: Logical (TRUE or FALSE); Numeric (ex. 1,2,3…); Character (ex. “u”, “alg”, “arve”) and Complex (ex. 2,b,3)
  8. Vectors in R are 1-based, i.e. the first index position is number 1.
  9. All R code lines starting with the # (cardinal or hash) sign are not interpreted and function as comments.
# This is a comment
# 3 + 4   # this code is not evaluated, so and it does not print the value 7
2 + 3   # the code before the hash sigh is evaluated, so it prints 5
## [1] 5
  1. R has two main “structures”: Functions and Objects. Functions receive arguments inside circular brackets ( ) and objects receive arguments inside square brackets [ ]

function (arguments)
object [arguments]

Start/Quit R

RStudio can be opened by double-clicking its icon. Alternatively, in Linux and Mac, one can start R by typing ’R’ in a terminal.

The R environment is controlled by hidden files (files that start with a .) in the startup directory: .RData, .Rhistory and .Rprofile (optional).

  • Rdata saves everything in memory (can be very large - be careful);
  • History means you can automatically save all commands you type;
  • Profile is most useful for advanced users.

It is always a good practice to rename these files:

save.image (file=“myProjectName.RData”)
savehistory (file=“myProjectName.Rhistory”)

To quit R (close it), use the q () function, and you will be prompted if you want to save the workspace image (i.e. the .RData file):

q()
Save workspace image to ~/path/to/your/working/directory/.RData? [y/n/c]:

By typing ’y’ (yes), then the entire R workspace will be written to the .RData file which can become very large. Often it is sufficient to just save an analysis protocol in an R source file. This way one can quickly regenerate all data sets and objects in the future.

Installing Packages and Getting Help

R has many built in ways of providing help regarding its functions and packages:

install.packages ("diagram")   # install the package called diagram
library (diagram)   # load the library diagram 
require(diagram)   # alternative way to load a package
help (package=diagram) # help(package="package_name") to get help about a given package
vignette ("diagram")   # launch a pdf with the package manual (in R they are called vignettes)
?plotmat   # ?function to get quick info about the function of interest

Working Environment

ls() # list all objects in your environment
dir() # list all files in your working directory
getwd() # find out the path to your working directory
setwd("/home/isabel") # example of setting a new working directory path

5. Hands-on Tutorial

5.1. Create an RStudio project (15 min)

To start we will open the RStudio. This is an Integrated Development Environment - IDE - that includes syntax-highlighting text editor, an R console, as well as workspace and history management, and tools for plotting and exporting images, browsing the workspace and creating projects.

RStudio GUI

Projects are a great functionality, easing the transition between dataset analysis, and allowing a fast navigation to your analysis/working directory. To create a new project:

File > New Project... > New Directory > Empty Project
Directory name: r-absoluteBeginners
Create project as a subdirectory of: ~/               Browse... (directory/folder to save the workshop data)
Create Project

Projects should be personalized by clicking on the menu in the right upper corner. The general options - R General - are the most important to customize, since they allow the definition of the RStudio “behavior” when the project is opened. The following suggestions are particularly useful:

Restore .RData at startup - **Yes** (for analyses with +1GB of data, you should choose "No")
Save .RData on exit - **Ask**
Always save history - **Yes**

Customize Project

5.2. R Syntax :: Operators (30 min)

Assignment Operators

Values are assigned to named variables with an <- (arrow) or an = sign. The arrow is the asymmetric assignment operator (the output is saved, or directed, to the object it points); while the equal sign is the symmetric operator (the object and the function become equal). For must cases they are interchangeable, however it is good practice to use the arrow so that we are sure that we are assigning the correct values.

object <- function (arguments)
object <- object [arguments]

x <- 7   # assign the number 7 to the variable x
y <- 9   # assign the number 9 to the variable y
z = 3 # assign the value 3 to the variable z
my_variable = 5   # my_variable_name has the value 5   

Comparison Operators

Allow the direct comparison between values:

Symbol Description
== exactly the same (equal)
!= different (not equal)
< smaller than
> greater than
<= smaller or equal
>= greater or equal
1 == 1   # TRUE
1 != 1   # FALSE
x > 3   # TRUE
y <= 9   # TRUE
my_variable < z   # FALSE

Logical Operators

Allow the joining of several comparisons:

Symbol Description
& AND
| OR
! NOT

** QUESTION: Are these TRUE, or FALSE?**

x < y & x > 10   # AND means that both expressions have to be true
x < y | x > 10   # OR means that only one expression must be true
x != y & my_variable <= y  # yet another AND example

Calculations

R can also make calculations:

Symbol Description
+ summation
- subtraction
* multiplication
/ division
^ powering
3 / y   ## [1] 0.3333333

x * 2   ## [1] 14

3 - 4   ## [1] -1

my_variable + 2   ## [1] 7

2^z   ## [1] 8

5.3. R Syntax :: Loading data and Saving files (15 min)

Most R users need to load their datasets, usually saved as table files (e.g. excel .csv files), to be able to analyse and manipulate them. After the analysis, the results need to be exported/saved (eg. to view in excel).

# Save to a file named orangeTreeData.csv the orange dataset, separated by commas and without quotes (the file will be saved in the current working directory)  
write.table (Orange, file="orangeTreeData.csv", sep="," , quote=F)

# Save to a file named orangeTreeData.tab the orange dataset, separated by tabs and without quotes (the file will be saved in the current working directory)  
write.table (Orange, file="orangeTreeData.tab", sep="\t" , quote=F)

# Load my data file into R (the file should be in the working directory)
my.data.tab <- read.table ("orangeTreeData.tab", sep="\t", header=TRUE)   # read a table with columns separated by tabs
my.data.csv <- read.csv ("orangeTreeData.csv", header=T)   # read a table with columns separated by commas

Note: if you want to load or save the files in directories different from the working dir, just give the full path to R, inside quotes, instead of just the file name.

5.4 Data Structures (60 min)

5.4.1 Vectors

x <- c (1,2,3,4,5,6)
class (x)   # this function outputs the data type
## [1] "numeric"
y <- 10
class (y)
## [1] "numeric"
z <- "a string"
class (z)
## [1] "character"

Functions for Creating Vectors

Function Description
c concatenate
: integer sequence
seq general sequence
rep repetitive patterns
vector vector of given length with default value
# The results are shown in the comment next to each line

seq (1,6)   ## [1] 1 2 3 4 5 6
seq (from=100, by=1, length=5)   ## [1] 100 101 102 103 104

1:6   ## [1] 1 2 3 4 5 6
10:1   ##  [1] 10  9  8  7  6  5  4  3  2  1

rep (1:2, 3)   ## [1] 1 2 1 2 1 2

vector (mode="character", length=5)   ## [1] "" "" "" "" ""
vector (mode="logical", length=5)   ## [1] FALSE FALSE FALSE FALSE FALSE
good <- seq (from=100, by=1, length=5)
good   # print vector good
## [1] 100 101 102 103 104
bad <- (1:3)
bad   # print vector "bad"
## [1] 1 2 3

Excluding elements

good [-bad]   # exclude all elements in "bad" positions from the "good" vector 
## [1] 103 104
good [-3]   # exclude the element from position 3 of the "good" vector
## [1] 100 101 103 104
good - 3   # subtract 3 to each value from the "good" vector
## [1]  97  98  99 100 101
good - bad   # subtract to each element of "good" the value of the corresponding element in "bad". Note that the smaller vector is "recycled"  (See next section of Vectorized Arithmetic)
## Warning in good - bad: longer object length is not a multiple of shorter
## object length
## [1]  99  99  99 102 102

Vectorized Arithmetic

Most arithmetic operations in the R language are vectorized, i.e. the operation is applied element-wise. When one operand is shorter than the other, the shortest is “recycled”.

1:3 + 10:12
## [1] 11 13 15
rep (1:2, 3) ### NOTE: this is NOT recycling
## [1] 1 2 1 2 1 2
1:5 + 10:12   # this IS recycling (the shorter vector "restarts" the cycling)
## Warning in 1:5 + 10:12: longer object length is not a multiple of shorter
## object length
## [1] 11 13 15 14 16
x + y        # Remember that x = c(1 2 3 4 5 6) and y = 10
## [1] 11 12 13 14 15 16
c(70,80) + x
## [1] 71 82 73 84 75 86

Naming indexes of a vector

Referring to an index by name rather than by position can make code more readable and flexible.

joe <- c (24, 1.70)
names (joe)              ## NULL
names (joe) <- c ("age","height")
names (joe)              ## [1] "age"    "height"
joe ["age"] == joe [1]   ## age   TRUE

Common functions applicable to vectors

To test some of these functions, we will use the “iris” dataset, one of the many datasets included by default in all R installations. To know more about all available datasets run the command: library(help = “datasets”)

length (joe)              # length/size of the vector  
rev (joe)                 # reverse the vector
sum (joe); cumsum (joe)   # sum and cumulative sum
prod (joe); cumprod (joe) # product and cumulative product

# mean, standard deviation, variance and median 
mean (iris[,2]); sd (iris[,2]); var (iris[,2]); median (iris[,2]) 
# minimum, maximum, range and summary statistics
min (iris[,1]); max (iris[,1]); range (iris[,1]); summary (iris)
# exponential, logarithm
exp (iris[1,1:4]); log (iris[1,1:4])
# sine, cosine and tangent (radians, not degrees)
sin (iris[1,1:4]); cos (iris[1,1:4]); tan (iris[1,1:4]) 

# sort, order and rank the vector
sort (iris[1,1:4]); order (iris[1,1:4]); rank (iris[1,1:4])
# select data
which (iris[1,1:4] > 2)
which.max (iris[1,1:4]) 
# useful to be used with if conditionals
any (iris[1,1:4] > 2)   # ask R if there are any values higher that 2? 
all (iris[1,1:4] > 2)   # ask R if all values are higher than 2  

Functions can be called within other function calls; the following 2 commands are equivalent:

x <- c (3, 2, 9, 4)

# Alternative 1
y <- exp (x)
z1 <- which (y > 20)
z1
## [1] 1 3 4
# ALternative 2
z2 <- which (exp (x) > 20)
z2
## [1] 1 3 4
# Check if the results are the same
all.equal (z1, z2) ; z1 == z2
## [1] TRUE
## [1] TRUE TRUE TRUE

5.4.2 Matrices

Are created with the matrix function and

x <- matrix (1:10, nrow=2)
x
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    3    5    7    9
## [2,]    2    4    6    8   10
dim (x)
## [1] 2 5
as.vector (x)   # revert the matrix to a one-dimensional vector
##  [1]  1  2  3  4  5  6  7  8  9 10

5.4.3 Data Frames

Data frames are the most flexible and commonly used R data structures, used to hold a set of spreadsheet like tables.
In a data.frame, the observations are the rows and the variables are the columns. They can be treated like matrices, where columns are vectors of the same type, but different columns can be vectors of different types. They are easily subset (see section 5.5).

df <- data.frame (type=rep(c("case","control"),c(2,3)),time=rnorm(5))  # rnorm is random number generator retrieved from a normal distribution
class (df)   ## [1] "data.frame"
df
df$time      ## [1] 0.5229577 0.7732990 2.1108504 0.4792064 1.3923535

5.4.4 Lists

Lists are very powerful data structures, consisting of ordered sets of elements, that can be arbitrary R objects (vectors, strings, functions, etc), and heterogeneous, i.e. each element of a different type.

lst = list (a=1:3, b="hello", c=sqrt)   # index 3 contains the function "square root""
lst
lst$c(49)

5.4.5 Data structures can be interconverted (coerced) from one type to another:

# To check the data type:
class(lst)

# To check the object type:
str(lst)

# "Force" the object to be of a certain type: (this is not valid code, just example)
as.matrix ()
as.numeric ()
as.data.frame ()
as.array ()
as.character ()

5.5 Subsetting Data (30 min)

Subsetting is the selection of particular data points, which are of interest. There are several ways of doing this.

NOTE: Very importantly, the arguments inside the square brackets in matrices and data.frames are the [row-number, column-number]. If any of these is omitted, R assumes that all values are to be used.

# Subsetting by indices
myVec <- 1:26 ; myVec
names (myVec) <- LETTERS ; myVec
myVec [1:4]
iris [,2]   # all rows, column 2 
iris [3,]   # row 3, all columns
iris [1:3,c(2,5)]   # rows 1, 2 and 3 with columns 2 and 5

#Subsetting by same length logical vectors
myLog <- myVec > 10 ; myLog
myVec [myLog]

# Subsetting by field names
myVec [c("G", "I", "S", "D")]

# Subsetting by column name (in the case of data.frames)
iris$Species

5.6 Some Great R Functions (30 min)

# unique make vector entries unique
unique (iris$Sepal.Length)
length (unique (iris$Sepal.Length))

# table counts the occurrences of entries
table (iris$Species)
# aggregate computes statistics of data aggregates
aggregate (iris[,1:4], by=list (iris$Species), FUN=mean, na.rm=T)
# the %in% function returns the intersect between two vectors
month.name [month.name %in% c ("May", "July")]

# merge joins data.frames based on a common column (that functions as a "key")
df1 <- data.frame(x=1:5, y=LETTERS[1:5]) ; df1
df2 <- data.frame(x=c("Eu","Tu","Ele"), y=1:6) ; df2
merge (df1, df2, by.x=1, by.y=2, all = TRUE)

5.7 Loops and Conditionals in R (30 min)

5.7.1 for() and while() loops

R allows the implementation of loops, i.e. replicating instructions in an iterative way (also called cycles). The most common ones are for loops and while loops.

# creating a for loop to square all the elements of a dataset, which has all odd numbers in the range of 1 to 20

my.x = seq (1, 20, by=2) ; my.x
##  [1]  1  3  5  7  9 11 13 15 17 19
my.x.squared = NULL
for (n in 1:10)
{ 
  my.x.squared [n] = my.x [n]^2 
}
print (my.x.squared)
##  [1]   1   9  25  49  81 121 169 225 289 361
# while loops will execute a block of commands until the condition is no longer satisfied
x <- 1 ; x
## [1] 1
while (x < 5)
{
  x <- x+1
  print(x)
}
## [1] 2
## [1] 3
## [1] 4
## [1] 5

5.7.2 Conditionals: if() statements

Conditionals allow the running of commands only if certain conditions are true.

x <- -5 ; x
## [1] -5
if(x > 0){
   print("Non-negative number")
} else {
   print("Negative number")
}
## [1] "Negative number"
# coupled with a for loop
x <- c(-5:5) ; x
##  [1] -5 -4 -3 -2 -1  0  1  2  3  4  5
x
##  [1] -5 -4 -3 -2 -1  0  1  2  3  4  5
for (i in 1:length(x)) {
  if (x[i] > 0) {
     print(x[i])
  } 
  else {
   print ("is a negative number")
  }
}  
## [1] "is a negative number"
## [1] "is a negative number"
## [1] "is a negative number"
## [1] "is a negative number"
## [1] "is a negative number"
## [1] "is a negative number"
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5