A quick introduction to R

What is R?

R is a simple and versatile programming language. You can use R for statistical data analysis, training of neural networks, beautiful visualizations or even for mathematical simulations. Moreover, R is the leading language for the analysis of biological data and the essential skill of almost any bioinformatician.

Installing R

If R isn’t installed on your PC then you can download it from CRAN: https://cran.r-project.org/.

In addition to R you will need to install RStudio. Please go to https://rstudio.com/products/rstudio/ and click on the “RStudio Desktop” button. After that click on the version recommended for your system and download it.

Instead of RStudio, you can enable the R programming language in Jupyter Notebook. To do this, you must have a pre-installed Jupiter Notebook (for example, from Anaconda distribution) and R on your PC. Further installation instructions are available here: https://irkernel.github.io/installation/

Installing packages

In addition to the Base R software, you can use additional code and data sets written by other people. To do this, you need to download and install the appropriate R packages. It’s easy to do with install.packages() function. For example, to install the ggplot2 package from CRAN type the following command:

install.packages("ggplot2")

Some specific bioinformatics packages are not available for download from CRAN. However, you can install them from Bioconductor. You can get the latest version of Bioconductor (3.10 for 03.20.2020) by entering the commands:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(version = "3.10")

After that you can install specific packages using BiocManager::install():

BiocManager::install(c("DESeq2", "AnnotationDbi"))

Finally, load selected packages:

library(DESeq2)
library(AnnotationDbi)

The R syntax

Let’s get started with R syntax! First, we add two numbers in the console and output the result:

5 + 10

## [1] 15

If you need to leave a comment in the code, then use #:

# I'm adding two numbers! 
5 + 10

## [1] 15

We can assign values to variables using the assignment operator <-:

a <- 10
b <- a + 1
print(b)

## [1] 11

However, in most contexts, the = operator can also be used as an alternative:

a = 10
b = a + 1
print(b)

## [1] 11

Variables can have specific types within R: numeric, character, integer, logical, complex, raw.

a <- 50.2           # numeric
b <- 'character'    # character
c <- 2L             # integer
d <- TRUE           # logical

Data Structures

R deal with named data structures. The simplest such structure is vector. Let’s create two vectors, each of which will contain numeric or character values (not both!):

v1 <- c(1, 2, 7, 4, 10)
v2 <- c('Omics', 'Data', 'Analysis')

Matrices are multi-dimensional generalizations of vectors. There are several ways to create your own matrix in R. The first one is using matrix() function:

m <- matrix(data = 1:9, nrow = 3, ncol = 3)
m

##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9

Alternatively, you can use cbind() function that generates matrices by combining several vectors of the same length:

x <- 1:3
y <- 4:6
z <- 7:9
m <- cbind(x, y, z)
m

##      x y z
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9

R dataframe is more flexible than matrix because it can store both string vectors and numeric vectors within the same object. To create a dataframe from vectors use the data.frame() function:

m <- data.frame(index = 1:3, 
                factors = c('Omics', 'Data', 'Analysis'),
                values = c(20, 10.5, 0))
m

##   index  factors values
## 1     1    Omics   20.0
## 2     2     Data   10.5
## 3     3 Analysis    0.0

Also, you can always upload your dataset. You can do it with read.table() function:

d <- read.table("SampleFile.csv", sep = ',', header=T, row.names=1)

Basic statistical functions

An example of square root calculation:

sqrt(64)

## [1] 8

Mean and median of the numbers in vector x:

x <- c(1, 2, 6, 84, 4, 5)
mn <- mean(x)
md <- median(x)
cat("Mean:", mn ,"\nMedian:", md)

## Mean: 17 
## Median: 4.5

Variance of the population and standard distance:

v <- var(x)
s <- sd(x)
cat("Variance:", v ,"\nStandard distance:", s)

## Variance: 1080.8 
## Standard distance: 32.87552

The correlation coefficient between the numbers in x and the numbers in y:

x <- c(1, 2, 6, 20, 4, 5)
y <- c(12, 8, 10, 33, 0, 2)
pearson <- cor(x, y, method = 'pearson')
spearman <- cor(x, y, method = 'spearman')
cat("Pearson correlation coefficient:", pearson ,"\nSpearman correlation coefficient:", spearman)

## Pearson correlation coefficient: 0.8445419 
## Spearman correlation coefficient: 0.2571429

Standard score (z-score) for the values in vector:

x <- c(1, 2, 6, 20, 4, 5)
scale(x)

##             [,1]
## [1,] -0.76767089
## [2,] -0.62373260
## [3,] -0.04797943
## [4,]  1.96715666
## [5,] -0.33585602
## [6,] -0.19191772
## attr(,"scaled:center")
## [1] 6.333333
## attr(,"scaled:scale")
## [1] 6.947422

T-test used to determine whether the means of two groups are equal to each other:

x = rnorm(50)
y = rnorm(50)
t.test(x,y)

## 
##  Welch Two Sample t-test
## 
## data:  x and y
## t = -0.79627, df = 95.672, p-value = 0.4278
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.5566100  0.2379056
## sample estimates:
##  mean of x  mean of y 
## 0.01319068 0.17254288

Let’s perform One-way ANOVA test. First of all, create a example dataframe:

df <- data.frame(Age = sample.int(100, 12), 
                 Group = rep(c('WT1', 'WT2', 'KO1', 'KO2'), each=3))
df

##    Age Group
## 1   24   WT1
## 2   22   WT1
## 3   54   WT1
## 4   51   WT2
## 5   35   WT2
## 6   11   WT2
## 7   20   KO1
## 8    6   KO1
## 9    4   KO1
## 10  23   KO2
## 11  94   KO2
## 12  34   KO2

Next compute the analysis of variance:

res <- aov(Age ~ Group, data = df)
summary(res)

##             Df Sum Sq Mean Sq F value Pr(>F)
## Group        3   2463   821.0   1.451  0.299
## Residuals    8   4526   565.8

Conditionals

The simplest form of decision controlling statement for conditional execution is the if() statement. Common operators used in this type of statements:

equals: ==
not equal: !=
less than/greater than: < or >
less than or equal/greater than or equal: <= or >=

a <- 3
if(a == 3){
  print('Yes, it\'s true!')
}

## [1] "Yes, it's true!"

The if/else statement may be implemented as

a <- 3
b <- 5
if(a > 6 && b != -10){ 
  print('Yes, it\'s true!') 
} else {
  print('Ups')
}

## [1] "Ups"

Loops

There are three different types of loops in R:

for (loop over a fixed number of items)
while (execute loop while a condition is TRUE)
repeat (until we “manually” break execution)

The most frequently used type of loops is the for loop:

items <- c(2,1,8,3,7)
for (i in items){
  print(i)
}

## [1] 2
## [1] 1
## [1] 8
## [1] 3
## [1] 7

We also can rewrite this loop as:

items <- c(2,1,8,3,7)
for (i in 1:length(items)){
  print(items[i])
}

## [1] 2
## [1] 1
## [1] 8
## [1] 3
## [1] 7

Another loop constructor is while:

x <- 0
while (x*2 < 10) {
  print(x)     
  x <- x + 1
}

## [1] 0
## [1] 1
## [1] 2
## [1] 3
## [1] 4

Finally, repeat loop:

x <- 0
repeat {
  if (x*2 > 10) break
  print(x)
  x <- x + 1
}

## [1] 0
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5

Graphics system in R

One of the most powerful functions of R is it’s ability to produce a wide range of publication quality graphics.

The data frame below represents some random values for Sample 1-4:

x <- matrix(rexp(24, rate=.1), ncol=4, 
            dimnames = list(c("A", "B", "C", "D", "E", "F"),
                            c("Sample1", "Sample2", "Sample3", "Sample4")))
x

##     Sample1    Sample2   Sample3  Sample4
## A 33.082769 17.6335099  1.537352 6.881575
## B  8.261918 18.5985016  6.946848 3.656953
## C 14.559802  1.1750146  2.662541 7.423239
## D  9.263284  0.9041282  9.446722 7.724385
## E 20.503731 14.2910573  5.180612 4.477749
## F  2.076536  8.7899742 24.955601 5.183823

Basic line and scatter plots are build using the plot() command. Here is a simple example of scatter plot:

plot(x[,1], x[,2], col = 'red', xlab="Sample1", ylab="Sample2")

By adding the type = l inside the plot() function we select line type for plot:

x1 = seq(from = -3*pi, to = 3*pi, length.out = 200)
y1 = 1.5*sin(x1)+2
plot(x1, y1, type = "l")

Histograms are created with the hist() function. By setting the argument freq = FALSE the plot returns probability densities instead of frequencies:

hist(x, freq = FALSE)

A kernel density plot may be more informative than a simple histogram, because it doesn’t depend on the number of bins used. In R we can build a density plot by plotting the resulting object of the density() function.

plot(density(x), col = 'red')

The boxplot() function is used to create boxplots:

boxplot(x)

We want to add colors to existing boxplots. The package RColorBrewer will help us to do this:

library(RColorBrewer)
boxplot(x, col = RColorBrewer::brewer.pal(ncol(x), 'Set1'))

Data Vizualization with `ggplot2`

Let’s learn how to make plots from data in a data frame via ggplot2. It’s a high-level graphics system, based on the grammar of graphics from Leland Wilkinson.

ggplot2() function accepts two arguments:

data set to be plotted
aesthetic mappings provided by aes function

Additional parameters such as geometric objects (geom_* functions) are passed on by appending them with + as separator.

We’ll use the standard dataset mpg that contains a subset of the fuel economy data that the EPA makes available on http://fueleconomy.gov. Output first 6 rows in the dataset:

library(ggplot2)
head(mpg)

## # A tibble: 6 x 11
##   manufacturer model displ  year   cyl trans  drv     cty   hwy fl    class
##   <chr>        <chr> <dbl> <int> <int> <chr>  <chr> <int> <int> <chr> <chr>
## 1 audi         a4      1.8  1999     4 auto(… f        18    29 p     comp…
## 2 audi         a4      1.8  1999     4 manua… f        21    29 p     comp…
## 3 audi         a4      2    2008     4 manua… f        20    31 p     comp…
## 4 audi         a4      2    2008     4 auto(… f        21    30 p     comp…
## 5 audi         a4      2.8  1999     6 auto(… f        16    26 p     comp…
## 6 audi         a4      2.8  1999     6 manua… f        18    26 p     comp…

Let’s create scatterplot using geom_point(). Inside ggplot() function we specify data from mpg package and call aes() function. The aes() function indicates which variables in the data should be plotted.

ggplot(mpg, aes(x = displ, y = hwy, colour = class)) + 
  geom_point()

Similarly, we plot boxplot

ggplot(mpg, aes(x = manufacturer, y = hwy, colour = class)) + 
  geom_boxplot()

and histogram. In order to plot all car types in the same plot, we add a fill = class argument to aes:

ggplot(mpg, aes(x = hwy, color = class, fill = class)) + 
  geom_histogram(bins = 40)

Now we will create bar charts using geom_bar():

ggplot(mpg, aes(class))+
  geom_bar(aes(fill = drv))

The easiest way to save a plot is using ggsave() function:

ggsave('Barplot.png', height = 7, width = 7.5, dpi = 200)

Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a classic method of reducing the dimensionality of data. In environment R, the calculation of principal components can be performed using the functions
a) princomp() (stats package)
b) prcomp() (stats package)
c) PCA() (FactoMineR package)
d) dudi.pca() (ade4 package)
e) acp() (amap package)

We will choose USArrests as multidimensional data. This dataframe contains information about violent crime rates by US State.

head(USArrests)

##            Murder Assault UrbanPop Rape
## Alabama      13.2     236       58 21.2
## Alaska       10.0     263       48 44.5
## Arizona       8.1     294       80 31.0
## Arkansas      8.8     190       50 19.5
## California    9.0     276       91 40.6
## Colorado      7.9     204       78 38.7

PCA with function prcomp():

p = prcomp(USArrests, scale = TRUE)
summary(p)

## Importance of components:
##                           PC1    PC2     PC3     PC4
## Standard deviation     1.5749 0.9949 0.59713 0.41645
## Proportion of Variance 0.6201 0.2474 0.08914 0.04336
## Cumulative Proportion  0.6201 0.8675 0.95664 1.00000

PCA plot:

percentVar <- p$sdev^2/sum(p$sdev^2)

ggplot(data = as.data.frame(p$x), aes(x = PC1, y = PC2)) +
  geom_point(alpha = 0.8, size = 2) +
  geom_label(aes(label = rownames(p$x)), label.size = 0.1)+
  xlab(paste0("PC1: ", round(percentVar[1] * 100), "% variance")) + 
  ylab(paste0("PC2: ", round(percentVar[2] * 100), "% variance")) +
  ggtitle("PCA plot of Crime Rates by US State")

A quick introduction to R

Dmitrii Smirnov, Anastasia Golova, and Ekaterina Khrameeva

03/31/2020

What is R?

Installing R

Installing packages

The R syntax

Data Structures

Basic statistical functions

Conditionals

Loops

Graphics system in R

Data Vizualization with `ggplot2`

Principal Component Analysis (PCA)

References

A quick introduction to R

Dmitrii Smirnov, Anastasia Golova, and Ekaterina Khrameeva

03/31/2020

What is R?

Installing R

Installing packages

The R syntax

Data Structures

Basic statistical functions

Conditionals

Loops

Graphics system in R

Data Vizualization with ggplot2

Principal Component Analysis (PCA)

References

Data Vizualization with `ggplot2`