0. How to Use RStudio

0.1. RStudio Interface

Opening RStudio will automatically launch R software. The platform interface looks as follows:
Your RStudio first screen

Roughly, we can divide the working window into three areas:

  • Left area: includes the tabs Console, Terminal, and Background Jobs

  • Top-right area: includes the tabs Environment, History, Connections, and Tutorial

  • Bottom-right area: includes the tabs Files, Plots, Packages, Help, Viewer, and Presentation.

Let’s take a closer look at the essential tabs.

0.1.1. Console

On this tab, we first see the information about the R version in use and also some basic commands to try. At the end of those descriptions, we can type our R code, press Enter, and get the result below the code line (e.g., try running 2*2 and see what happens). Virtually, we can do here anything we would do in any other R program, for example:

  • Installing and loading R packages
  • Performing simple or complex mathematical operations
  • Assigning the result of an operation to a variable
  • Import data
  • Creating common types of R objects, like vectors, matrices and dataframes
  • Exploring data
  • Statistical analysis
  • Building data visualizations

However, when we run our code directly in the console, it isn’t saved for being reproduced further. If we need (and we usually do) to write a reproducible code to solve a specific task, we have to record and regularly save it in a script file rather than in a console.

We’ll explore how to write scripts. For now, let’s keep in mind that you should mostly use the console to test the code and install R packages since they only need to be installed once.

0.1.1.2. Environment

Whenever we define a new or re-assign an existing variable in RStudio, it’s stored as an object in the workspace and gets displayed, together with its value, on the Environment tab in the top-right area of the RStudio window. Try running greeting <- "Hello, World!" in the console and see what happens on the Environment tab.

0.1.1.3. Other important tabs

  • Terminal - to run commands from the terminal
  • History - to track the history of all the operations performed during the current RStudio session
  • Files - to see the structure of the working folder, reset the working folder, navigate between folders, etc.
  • Plots - to preview and export created data visualizations
  • Packages - to check what packages were loaded and load or unload packages (by switching on/off the box to the left of a package name)

0.2. Creating an R File.

When creating an R File you can make an R-script (typical R document) or make an Rmarkdown, where you can add chunks with scripts and exporting the file as a pdf, html or word file.

To start recording a script, click File - New File - R Script. This will open a text editor in the top-left corner of the RStudio interface (above the Console tab)

In a script, we can do all the things we listed in the section on the console, only that now our actions will be stored in a file for further usage or sharing.

To run a single line of code from a script, put the cursor on that line and click the Run icon on the top-right of the text editor. Otherwise, use a keyboard shortcut (Ctrl + Enter in Windows/Linux, Cmd + Enter in Mac). To run multiple code lines, do the same after selecting the necessary lines. To run all code lines, select all the lines and click the Run icon OR use a keyboard shortcut (Ctrl + A + Enter in Windows/Linux, Cmd + A + Enter in Mac).

When we write a script, it makes sense to add code comments where necessary (using a hashtag symbol # followed by a line of a comment text) to explain to a potential future reader the why behind certain pieces of code.

1. Installing packages

R packages are add-ons to base R that help you achieve additional tasks that are not directly supported by base R. It is by the action of these extra functionality that R excels as a tool for computational genomics. The Bioconductor project (http://bioconductor.org/) is a dedicated package repository for computational biology-related packages. However main package repository of R, called CRAN, also has computational biology related packages. In addition, R-Forge (http://r-forge.r-project.org/), GitHub (https://github.com/), and Bitbucket (http://www.bitbucket.org) are some of the other locations where R packages might be hosted. The packages needed for the code snippets in this book and how to install them are explained in the Packages needed to run the book code section in the Preface of the book.

You can install CRAN packages using install.packages() (# is the comment character in R).

# install package named "ggplot2" from CRAN

install.packages("ggplot2")

You can install bioconductor packages with a specific installer script.

# get the installer package if you don't have
install.packages("BiocManager")

# install bioconductor package "rtracklayer"
BiocManager::install("rtracklayer")

You can install packages from GitHub using the install_github() function from devtools package.

library(devtools)
install_github("hadley/stringr")

Another way to install packages is from the source.

# download the source file
download.file(
"https://github.com/al2na/methylKit/releases/download/v0.99.2/methylKit_0.99.2.tar.gz",
               destfile="methylKit_0.99.2.tar.gz")
# install the package from the source file
install.packages("methylKit_0.99.2.tar.gz",
                 repos=NULL,type="source")
# delete the source file
unlink("methylKit_0.99.2.tar.gz")

You can also update CRAN and Bioconductor packages.

# updating CRAN packages
update.packages()

# updating bioconductor packages
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install()

2. Getting help on functions and packages

You can get help on functions by using help() and help.search() functions. You can list the functions in a package with the ls() function

library(MASS)
ls("package:MASS") # functions in the package
ls() # objects in your R enviroment
# get help on hist() function
?hist
help("hist")
# search the word "hist" in help pages
help.search("hist")
??hist

3. Computations in R

R can be used as an ordinary calculator, and some say it is an over-grown calculator. Here are some examples. Remember that # is the comment character. The comments give details about the operations in case they are not clear.

2 + 3 * 5       # Note the order of operations.
log(10)        # Natural logarithm with base e
5^2            # 5 raised to the second power
3/2            # Division
sqrt(16)      # Square root
abs(3-7)      # Absolute value of 3-7
pi             # The number
exp(2)        # exponential function
# This is a comment line

4. Data structures

4.1 Vectors

Vectors are one of the core R data structures. It is basically a list of elements of the same type (numeric, character or logical). Later you will see that every column of a table will be represented as a vector. R handles vectors easily and intuitively. You can create vectors with the c() function, however that is not the only way. The operations on vectors will propagate to all the elements of the vectors.

x<-c(1,3,2,10,5)    #create a vector named x with 5 components
x = c(1,3,2,10,5)  
x
## [1]  1  3  2 10  5
y<-1:5              #create a vector of consecutive integers y
y+2                 #scalar addition
## [1] 3 4 5 6 7
2*y                 #scalar multiplication
## [1]  2  4  6  8 10
y                   #y itself has not been unchanged
## [1] 1 2 3 4 5

The standard assignment operator in R is <-. This operator is preferentially used in books and documentation. However, it is also possible to use the = operator for the assignment. We have an example in the above code snippet and throughout the book we use <- and = interchangeably for assignment.

4.2 Matricies

A matrix refers to a numeric array of rows and columns. You can think of it as a stacked version of vectors where each row or column is a vector. One of the easiest ways to create a matrix is to combine vectors of equal length using cbind(), meaning ‘column bind’.

x<-c(1,2,3,4)
y<-c(4,5,6,7)
m1<-cbind(x,y);m1
##      x y
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
## [4,] 4 7
dim(m1)              # 2 by 5 matrix
## [1] 4 2

4.3 Data frames

A data frame is more general than a matrix, in that different columns can have different modes (numeric, character, factor, etc.). A data frame can be constructed by the data.frame() function. For example, we illustrate how to construct a data frame from genomic intervals or coordinates.

chr <- c("chr1", "chr1", "chr2", "chr2")
strand <- c("-","-","+","+")
start<- c(200,4000,100,400)
end<-c(250,410,200,450)
mydata <- data.frame(chr,start,end,strand)
#change column names
names(mydata) <- c("chr","start","end","strand")
mydata # OR this will work too
##    chr start end strand
## 1 chr1   200 250      -
## 2 chr1  4000 410      -
## 3 chr2   100 200      +
## 4 chr2   400 450      +

There are a variety of ways to extract the elements of a data frame. You can extract certain columns using column numbers or names, or you can extract certain rows by using row numbers. You can also extract data using logical arguments, such as extracting all rows that have a value in a column larger than your threshold.

mydata[,2:4] # columns 2,3,4 of data frame
##   start end strand
## 1   200 250      -
## 2  4000 410      -
## 3   100 200      +
## 4   400 450      +
mydata[,c("chr","start")] # columns chr and start from data frame
##    chr start
## 1 chr1   200
## 2 chr1  4000
## 3 chr2   100
## 4 chr2   400
mydata$start # variable start in the data frame
## [1]  200 4000  100  400
mydata[mydata$start>400,] # get all rows where start>400
##    chr start end strand
## 2 chr1  4000 410      -

4.4 Lists

A list in R is an ordered collection of objects (components). A list allows you to gather a variety of (possibly unrelated) objects under one name. You can create a list with the list() function. Each object or element in the list has a numbered position and can have names. Below we show a few examples of how to create lists.

# example of a list with 4 components
# a string, a numeric vector, a matrix, and a scalar
w <- list(name="Fred",
       mynumbers=c(1,2,3),
       mymatrix=matrix(1:4,ncol=2),
       age=5.3)
w
## $name
## [1] "Fred"
## 
## $mynumbers
## [1] 1 2 3
## 
## $mymatrix
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
## 
## $age
## [1] 5.3

4.5 Factors

Factors are used to store categorical data. They are important for statistical modeling since categorical variables are treated differently in statistical models than continuous variables. This ensures categorical data treated accordingly in statistical models.

features=c("promoter","exon","intron")
f.feat=factor(features)

An important thing to note is that when you are reading a data frame with read.table() or creating a data frame with data.frame() function, the character columns are stored as factors by default, to change this behavior you need to set stringsAsFactors=FALSE in read.table() and/or data.frame() function arguments.

5. Data types

There are four common data types in R, they are numeric, logical, character and integer. All these data types can be used to create vectors natively.

#create a numeric vector x with 5 components
x<-c(1,3,2,10,5)
x
## [1]  1  3  2 10  5
#create a logical vector x
x<-c(TRUE,FALSE,TRUE)
x
## [1]  TRUE FALSE  TRUE
# create a character vector
x<-c("sds","sd","as")
x
## [1] "sds" "sd"  "as"
class(x)
## [1] "character"
x<-c(1L,2L,3L)
x
## [1] 1 2 3
class(x)
## [1] "integer"

6.Reading and writing data

world_population <- read.csv("world_population.csv")

(Ro run the above piece of code, first download the publicly available World Population Dataset from Kaggle and unzip it into the same folder where you store your R script.)

The result of running the above piece of code will be an R dataframe in your working folder.

In RStudio:

  • File - Import Dataset

OR

  • Click Import Dataset on the Environmnet tab

Also, in R, you can easily read tabular format data with the read.table() function. You can find a data set on different formats, some of the most typical are the following ones:

  • excel file (.xlsx)

  • csv file (.csv)

  • txt file (.txt).

There are different functions that can be used for opening the data frame file, like read.csv or read_xlsx. In order to save or export a dataset from rstudio, you can easly do it through the function write.csv() (.csv file) or write.table()

Tips: You can also import your dataset through the files window. First you should select your working directory with the function setwd() or with the option More from the Files window.
To know where is your working directory, you can use the function getwd(), and you’ll know where your working directory is located.

7. Descriptive Analysis functions

Once you’ve import the dataset and you have all your data uploaded on the Rstudio environment is important to make a descriptive analysis of the data. Some of the most important functions for the descriptive analysis are the following ones:

  • head() tail(): Returns the first or last parts of a vector, matrix, table, data frame or function.

  • str(): Compactly display the internal structure of an R object, a diagnostic function and an alternative to summary

  • summary(): is a generic function used to produce result summaries of the results of various model fitting functions.

  • min() max(): Returns the (regular or parallel) maxima and minima of the input values.

  • range(): returns a vector containing the minimum and maximum of all the given arguments.

  • names(): Functions to get or set the names of an object.

  • mean(): Generic function for the (trimmed) arithmetic mean.

  • median(): Compute the sample median.

  • sd(): This function computes the standard deviation of the values in x

  • var() cov() cor(): compute the variance of x and the covariance or correlation of x and y if these are vectors.

  • lapply(): returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X.

8. Example

We’ll be using the dataset iris throughout this example. This dataset is imported by defaut in R, you only need to load it by running iris:

dat <- iris # load the iris dataset and renamed it dat

Below a preview of this dataset and its structure:

head(dat) # first 6 observations
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
str(dat) # structure of the dataset
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

The dataset contains 150 observations and 5 variables, representing the length and width of the sepal and petal and the species of 150 flowers. Length and width of the sepal and petal are numeric variables and the species is a factor with 3 levels (indicated by num and Factor w/ 3 levels after the name of the variables).

8.1 Minimum and maximum

Minimum and maximum can be found thanks to the min() and max()functions:

min(dat$Sepal.Length) # The $ symbol is used to indicate which column we desire to calculate the minimum value from the dataset dat
## [1] 4.3
max(dat$Sepal.Length)
## [1] 7.9

Alternatively the range() function:

rng <- range(dat$Sepal.Length)
rng
## [1] 4.3 7.9

gives you the minimum and maximum directly. Note that the output of the range() function is actually an object containing the minimum and maximum (in that order). This means you can actually access the minimum with:

rng[1] # rng = name of the object specified above
## [1] 4.3

This reminds us that, in R, there are often several ways to arrive at the same result. The method that uses the shortest piece of code is usually preferred as a shorter piece of code is less prone to coding errors and more readable.

8.2 Range

The range can then be easily computed, as you have guessed, by subtracting the minimum from the maximum:

max(dat$Sepal.Length) - min(dat$Sepal.Length)
## [1] 3.6

8.3 Mean

The mean can be computed with the mean() function:

mean(dat$Sepal.Length)
## [1] 5.843333
  • if there is at least one missing value in your dataset, use mean(dat\$Sepal.Length, na.rm = TRUE) to compute the mean with the NA excluded. This argument can be used for most functions presented in this article, not only the mean
  • for a truncated mean, use mean(dat\$Sepal.Length, trim = 0.10) and change the trim argument to your needs

8.4 Median

The median can be computed thanks to the median() function:

median(dat$Sepal.Length)
## [1] 5.8

8.5 Standard deviation and variance

The standard deviation and the variance is computed with the sd() and var() functions:

sd(dat$Sepal.Length) # standard deviation
## [1] 0.8280661
var(dat$Sepal.Length) # variance
## [1] 0.6856935

to compute the standard deviation (or variance) of multiple variables at the same time, use lapply() with the appropriate statistics as second argument:

lapply(dat[, 1:4], sd)
## $Sepal.Length
## [1] 0.8280661
## 
## $Sepal.Width
## [1] 0.4358663
## 
## $Petal.Length
## [1] 1.765298
## 
## $Petal.Width
## [1] 0.7622377

The command dat[, 1:4] selects the variables 1 to 4 as the fifth variable is a qualitative variable and the standard deviation cannot be computed on such type of variable.

8.5 Summary

You can compute the minimum, 1st quartile, median, mean, 3rd quartile and the maximum for all numeric variables of a dataset at once using summary():

summary(dat)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

If you need these descriptive statistics by group use the by() function:

by(dat, dat$Species, summary)
## dat$Species: setosa
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.300   Min.   :1.000   Min.   :0.100  
##  1st Qu.:4.800   1st Qu.:3.200   1st Qu.:1.400   1st Qu.:0.200  
##  Median :5.000   Median :3.400   Median :1.500   Median :0.200  
##  Mean   :5.006   Mean   :3.428   Mean   :1.462   Mean   :0.246  
##  3rd Qu.:5.200   3rd Qu.:3.675   3rd Qu.:1.575   3rd Qu.:0.300  
##  Max.   :5.800   Max.   :4.400   Max.   :1.900   Max.   :0.600  
##        Species  
##  setosa    :50  
##  versicolor: 0  
##  virginica : 0  
##                 
##                 
##                 
## ------------------------------------------------------------ 
## dat$Species: versicolor
##   Sepal.Length    Sepal.Width     Petal.Length   Petal.Width          Species  
##  Min.   :4.900   Min.   :2.000   Min.   :3.00   Min.   :1.000   setosa    : 0  
##  1st Qu.:5.600   1st Qu.:2.525   1st Qu.:4.00   1st Qu.:1.200   versicolor:50  
##  Median :5.900   Median :2.800   Median :4.35   Median :1.300   virginica : 0  
##  Mean   :5.936   Mean   :2.770   Mean   :4.26   Mean   :1.326                  
##  3rd Qu.:6.300   3rd Qu.:3.000   3rd Qu.:4.60   3rd Qu.:1.500                  
##  Max.   :7.000   Max.   :3.400   Max.   :5.10   Max.   :1.800                  
## ------------------------------------------------------------ 
## dat$Species: virginica
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.900   Min.   :2.200   Min.   :4.500   Min.   :1.400  
##  1st Qu.:6.225   1st Qu.:2.800   1st Qu.:5.100   1st Qu.:1.800  
##  Median :6.500   Median :3.000   Median :5.550   Median :2.000  
##  Mean   :6.588   Mean   :2.974   Mean   :5.552   Mean   :2.026  
##  3rd Qu.:6.900   3rd Qu.:3.175   3rd Qu.:5.875   3rd Qu.:2.300  
##  Max.   :7.900   Max.   :3.800   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    : 0  
##  versicolor: 0  
##  virginica :50  
##                 
##                 
## 

9. More things

9.1 Plots

hist(dat$Sepal.Length)

plot(dat$Species, dat$Sepal.Length,
     main = "Sepal Length based on the Species",
     xlab = "Species", ylab = "Sepal Length (cm)",
     col = "red")

plot(dat$Sepal.Length, dat$Sepal.Width,
     main = "Sepal Length Vs Sepal Width" )

barplot(prop.table(table(dat$Species)),
        col=c("orange","blue","red"),
        legend.text =c("Setosa","versicolor","virginica"),
        ylim=c(0,0.5))

9.2 Statistics tests

shapiro.test(dat$Sepal.Length)
## 
##  Shapiro-Wilk normality test
## 
## data:  dat$Sepal.Length
## W = 0.97609, p-value = 0.01018
library(car)
## Warning: package 'car' was built under R version 4.2.2
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.2.2
qqPlot(dat$Sepal.Length)

## [1] 132 118
lm.iris1 <- lm(Sepal.Length~Species, data=iris) ##makes a 'linear model' object which we will use for the next line of code and the ANOVA code
leveneTest(lm.iris1)
## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value   Pr(>F)   
## group   2  6.3527 0.002259 **
##       147                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(lm.iris1)
## Analysis of Variance Table
## 
## Response: Sepal.Length
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## Species     2 63.212  31.606  119.26 < 2.2e-16 ***
## Residuals 147 38.956   0.265                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm.iris1)
## 
## Call:
## lm(formula = Sepal.Length ~ Species, data = iris)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6880 -0.3285 -0.0060  0.3120  1.3120 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         5.0060     0.0728  68.762  < 2e-16 ***
## Speciesversicolor   0.9300     0.1030   9.033 8.77e-16 ***
## Speciesvirginica    1.5820     0.1030  15.366  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5148 on 147 degrees of freedom
## Multiple R-squared:  0.6187, Adjusted R-squared:  0.6135 
## F-statistic: 119.3 on 2 and 147 DF,  p-value: < 2.2e-16
iris.aov <- aov(Sepal.Length~Species, data = iris)
TukeyHSD(iris.aov)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Sepal.Length ~ Species, data = iris)
## 
## $Species
##                       diff       lwr       upr p adj
## versicolor-setosa    0.930 0.6862273 1.1737727     0
## virginica-setosa     1.582 1.3382273 1.8257727     0
## virginica-versicolor 0.652 0.4082273 0.8957727     0
iris.aov2 <- aov(Sepal.Width~Species, dat = iris)
TukeyHSD(iris.aov2)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Sepal.Width ~ Species, data = iris)
## 
## $Species
##                        diff         lwr        upr     p adj
## versicolor-setosa    -0.658 -0.81885528 -0.4971447 0.0000000
## virginica-setosa     -0.454 -0.61485528 -0.2931447 0.0000000
## virginica-versicolor  0.204  0.04314472  0.3648553 0.0087802
plot(dat$Species, dat$Sepal.Width,
     main = "Sepal Width based on the Species",
     xlab = "Species", ylab = "Sepal Width (cm)",
     col = "blue")