Bioinformatics-Intro

R, python, and bash scripting are the three main pillars of bioinformatics. Here, we are starting with R and use of Rstudio. Please, follow the instructions below and install the required software to your own PC.

The main goal of this tutorial is to give you a flavor of how simple analysis are conducted in R environment. The first part will involve practicing basic terms and second will cover a data analytics for single cell RNA sequencing experiment.

Let’s start…

Installing R & Rstudio on your laptops

Download and install R

Download and install R (choose the version appropriate for your Operating System: https://cran.rstudio.com/)

Download and install Rstudio

Download and install Rstudio (choose the version appropriate for your Operating System: https://www.rstudio.com/products/rstudio/download/#download)

You can start Rstudio either from terminal or among installed softwares.

Install required R packages

These are a few R packages we will use along the tutorial:

# install.packages()
install.packages('tidyverse')
library(tidyverse)

# Enter commands in R (or R studio, if installed)
install.packages('Seurat')
library(Seurat)

Here is a link to get further help for installing Seurat package in case needed: https://satijalab.org/seurat/articles/install.html

Download example dataset

Our lab has generated single-cell RNA sequecing datasets using 10X Genomics Chromium platform from Hippocampus and Striatum of Rattus norvegicus brain.

Please, download these two example dataset through the link: https://drive.google.com/drive/folders/18DSiCHv-WM_48iutYdGJiycJBKnIzDuK?usp=sharing

Warm up exercise

Let’s begin

 1 + 2

## [1] 3

Variables:

x <- 1
x

## [1] 1

Functions

Functions take arguments in paranthesis and process the output. Here the function ‘c’ returns the values as a vector.

c(1, 2, 3)

## [1] 1 2 3

Comments

Everything that follows a ‘#’ is considered as command.

# This line is a command
1 + 4

## [1] 5

Logical operations

a <- 2
b <- 9.5
a > b

## [1] FALSE

z <- a > b
z

## [1] FALSE

class(a)

## [1] "numeric"

class(b)

## [1] "numeric"

class(z)

## [1] "logical"

Reverse the logic:

!z

## [1] TRUE

Character

x <- as.character(3.14) 
x

## [1] "3.14"

fname = "John"; lname ="Doe";
paste(fname, lname)

## [1] "John Doe"

?paste

Vectors

my_vec1 <- c("aa", "bb", "cc", "dd", "ee")
my_vec2 <- c(1, 2, 3, 4, 5)
length(my_vec2)

## [1] 5

#Combining vectors:
c(my_vec1, my_vec2)

##  [1] "aa" "bb" "cc" "dd" "ee" "1"  "2"  "3"  "4"  "5"

Arithmetics

my_vec2

## [1] 1 2 3 4 5

2 * my_vec2 # Mutliply by 2

## [1]  2  4  6  8 10

my_vec2 - 2

## [1] -1  0  1  2  3

my_vec3 <- c(10, 20, 30, 40, 50)

my_vec2 + my_vec3

## [1] 11 22 33 44 55

3 * (my_vec2 + my_vec3) / 5

## [1]  6.6 13.2 19.8 26.4 33.0

Vector indexing

Indexing starts with 1 unlike python.

s <- c("aa", "bb", "cc", "dd", "ee") 
s[3]

## [1] "cc"

s[3:5]

## [1] "cc" "dd" "ee"

s[-4] #Now 'dd' is gone!

## [1] "aa" "bb" "cc" "ee"

s[c(2, 3, 3)]

## [1] "bb" "cc" "cc"

s[c(2, 1, 3)]

## [1] "bb" "aa" "cc"

s[c(FALSE, TRUE, FALSE, TRUE, FALSE)] #Logical indexing

## [1] "bb" "dd"

Naming the vector members:

v <- c("Mary", "Sue")
v

## [1] "Mary" "Sue"

names(v) <- c("First", "Last")
v

##  First   Last 
## "Mary"  "Sue"

v["First"]

##  First 
## "Mary"

Matrix

“A matrix is a collection of data elements arranged in a two-dimensional rectangular layout.”

A <- matrix( 
   c(2, 4, 3, 1, 5, 7), # the data elements 
   nrow=2,              # number of rows 
   ncol=3,              # number of columns 
   byrow = TRUE)        # fill matrix by rows 

A

##      [,1] [,2] [,3]
## [1,]    2    4    3
## [2,]    1    5    7

A[2, 3] # A[row, column]

## [1] 7

A[2, ] # The entire 2nd row

## [1] 1 5 7

A[, 3] # The entire 3rd column

## [1] 3 7

A[, c(1,3)]  # the 1st and 3rd columns

##      [,1] [,2]
## [1,]    2    3
## [2,]    1    7

Transpose the matrix

B <- matrix( 
   c(2, 4, 3, 1, 5, 7), 
   nrow=3, 
   ncol=2)
B

##      [,1] [,2]
## [1,]    2    1
## [2,]    4    5
## [3,]    3    7

t(B)

##      [,1] [,2] [,3]
## [1,]    2    4    3
## [2,]    1    5    7

Combine matrices

cbind(A, t(B)) # combine row-wise

##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    2    4    3    2    4    3
## [2,]    1    5    7    1    5    7

rbind(A, t(B)) # combine column-wise

##      [,1] [,2] [,3]
## [1,]    2    4    3
## [2,]    1    5    7
## [3,]    2    4    3
## [4,]    1    5    7

Data frames

“A data frame is used for storing data tables. It is a list of vectors of equal length.”

n <- c(2, 3, 5) 
s <- c("aa", "bb", "cc") 
b <- c(TRUE, FALSE, TRUE) 
df <- data.frame(n, s, b)       # df is a data frame
df

##   n  s     b
## 1 2 aa  TRUE
## 2 3 bb FALSE
## 3 5 cc  TRUE

mtcars

##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

mtcars[10, 4]

## [1] 123

mtcars['Merc 280', 'hp']

## [1] 123

mtcars["Merc 280", ]

##           mpg cyl  disp  hp drat   wt qsec vs am gear carb
## Merc 280 19.2   6 167.6 123 3.92 3.44 18.3  1  0    4    4

mtcars[["hp"]]

##  [1] 110 110  93 110 175 105 245  62  95 123 123 180 180 180 205 215 230  66  52
## [20]  65  97 150 150 245 175  66  91 113 264 175 335 109

mtcars$hp

##  [1] 110 110  93 110 175 105 245  62  95 123 123 180 180 180 205 215 230  66  52
## [20]  65  97 150 150 245 175  66  91 113 264 175 335 109

nrow(mtcars)

## [1] 32

ncol(mtcars)

## [1] 11

mtcars[, c("mpg", "hp")]

##                      mpg  hp
## Mazda RX4           21.0 110
## Mazda RX4 Wag       21.0 110
## Datsun 710          22.8  93
## Hornet 4 Drive      21.4 110
## Hornet Sportabout   18.7 175
## Valiant             18.1 105
## Duster 360          14.3 245
## Merc 240D           24.4  62
## Merc 230            22.8  95
## Merc 280            19.2 123
## Merc 280C           17.8 123
## Merc 450SE          16.4 180
## Merc 450SL          17.3 180
## Merc 450SLC         15.2 180
## Cadillac Fleetwood  10.4 205
## Lincoln Continental 10.4 215
## Chrysler Imperial   14.7 230
## Fiat 128            32.4  66
## Honda Civic         30.4  52
## Toyota Corolla      33.9  65
## Toyota Corona       21.5  97
## Dodge Challenger    15.5 150
## AMC Javelin         15.2 150
## Camaro Z28          13.3 245
## Pontiac Firebird    19.2 175
## Fiat X1-9           27.3  66
## Porsche 914-2       26.0  91
## Lotus Europa        30.4 113
## Ford Pantera L      15.8 264
## Ferrari Dino        19.7 175
## Maserati Bora       15.0 335
## Volvo 142E          21.4 109

Importing files:

getwd()
setwd("your target directory")

counts <- read.table(file = "SRR1039512.counts", sep = "\t", row.names = 1)
head(counts)
#Alternatives: read.csv(), read.delim(), readxl::read_xls()

Let’s compute the total library sequencing size:

sum(counts$V2)

dim(counts)

CPM <- counts / ( sum(counts$V2)/ 1000000 )
head(CPM)

head( cbind(counts, CPM) )

Disclaimer: This tutorial has been originated from: http://www.r-tutor.com/r-introduction

Single cell RNAseq of Rattus norvegicus Striatum and Hippocampus

Load the pre-processed scRNAseq expression matrix

library(tidyverse)
library(Seurat)
setwd('~/data/scData/')

striatum <- get(load("striatum.seurat.Robj"))
head(striatum@meta.data)

##                        orig.ident percent.mito res.0.4 nCount_RNA nFeature_RNA
## AAACCTGAGCCGCCTA Rat_Striatum_003  0.007434944       0        269          160
## AAACCTGAGGCGACAT Rat_Striatum_003  0.019531250       6        256          130
## AAACCTGAGGGATGGG Rat_Striatum_003  0.031634446       4        569          307
## AAACCTGAGTAAGTAC Rat_Striatum_003  0.024390244       3        369          198
## AAACCTGTCCCATTAT Rat_Striatum_003  0.029459902       4        611          227
## AAACCTGTCGTGGGAA Rat_Striatum_003  0.003021148       2        331          152

hippo <- get(load("hippocampus.seurat.Robj"))
head(hippo@meta.data)

##                           orig.ident percent.mito res.0.4 nCount_RNA
## AAACCTGCAAGGCTCC Rat_Hippocampus_006  0.041800643       3        311
## AAACCTGGTAGTAGTA Rat_Hippocampus_006  0.006622517       0        302
## AAACCTGGTATCACCA Rat_Hippocampus_006  0.012658228       0        237
## AAACCTGGTTATCCGA Rat_Hippocampus_006  0.017334778       3        923
## AAACCTGTCCGCGCAA Rat_Hippocampus_006  0.026525199       2        377
## AAACGGGAGACACTAA Rat_Hippocampus_006  0.019867550       0        302
##                  nFeature_RNA
## AAACCTGCAAGGCTCC          171
## AAACCTGGTAGTAGTA          164
## AAACCTGGTATCACCA          140
## AAACCTGGTTATCCGA          351
## AAACCTGTCCGCGCAA          173
## AAACGGGAGACACTAA          178

Quality Check

VlnPlot(hippo, features = c("nFeature_RNA", "nCount_RNA", "percent.mito"), ncol = 3)

VlnPlot(striatum, features = c("nFeature_RNA", "nCount_RNA", "percent.mito"), ncol = 3)

Gene expression normalization

#“LogNormalize” normalizes the feature expression measurements for each cell by the total expression, multiplies this by a scale factor (10,000 by default), and log-transforms the result
hippo <- NormalizeData(hippo, normalization.method = "LogNormalize", scale.factor = 10000)

Determine variable genes

#Calculate a subset of features that exhibit high cell-to-cell variation in the dataset (i.e, they are highly expressed in some cells, and lowly expressed in others).
hippo <- FindVariableFeatures(hippo, selection.method = "vst", nfeatures = 2000)
# Identify the 10 most highly variable genes
top10 <- head(VariableFeatures(hippo), 10)

Scaling the data

#Apply a linear transformation (‘scaling’) that is a standard pre-processing step prior to dimensional reduction techniques
hippo <- ScaleData(hippo, features = rownames(hippo))

Principal Component Analysis

#Perform linear dimensional reduction - Principal Components Analysis
hippo <- RunPCA(hippo, features = VariableFeatures(object = hippo))
DimPlot(hippo, reduction = "pca")

ElbowPlot(hippo)

Clustering the cells based on their transcriptomes

#Cluster the cells
hippo <- FindNeighbors(hippo, dims = 1:10)
hippo <- FindClusters(hippo, resolution = 0.5)

## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 4206
## Number of edges: 140832
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8257
## Number of communities: 8
## Elapsed time: 0 seconds

#Run non-linear dimensional reduction (UMAP/tSNE)
# If you haven't installed UMAP, you can do so via reticulate::py_install(packages =
# 'umap-learn')
hippo <- RunUMAP(hippo, dims = 1:10)
DimPlot(hippo, reduction = "umap")

Finding differentially expressed genes

#Finding differentially expressed features (cluster biomarkers)
hippo.markers <- FindAllMarkers(hippo, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
hippo.markers %>%
  group_by(cluster) %>%
  slice_max(n = 2, order_by = avg_log2FC)

## # A tibble: 16 × 7
## # Groups:   cluster [8]
##        p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene  
##        <dbl>      <dbl> <dbl> <dbl>     <dbl> <fct>   <chr> 
##  1 1.38e- 98      1.40  0.592 0.319 1.22e- 94 0       Mal   
##  2 1.81e- 93      1.24  0.627 0.363 1.61e- 89 0       Sept4 
##  3 1.29e-210      1.92  0.642 0.141 1.15e-206 1       S100b 
##  4 1.23e-129      1.67  0.4   0.074 1.09e-125 1       Klk6  
##  5 0              3.08  0.951 0.195 0         2       Apoe  
##  6 6.03e-195      3.03  0.446 0.061 5.35e-191 2       Csf1r 
##  7 0              3.80  0.654 0.044 0         3       Il1b  
##  8 1.88e-278      3.28  0.737 0.075 1.67e-274 3       Cd83  
##  9 3.87e-102      5.46  0.276 0.014 3.44e- 98 4       Mt3   
## 10 6.30e- 12      1.99  0.252 0.098 5.58e-  8 4       Tubb5 
## 11 6.23e- 16      1.04  0.311 0.055 5.53e- 12 5       Pnrc1 
## 12 2.77e- 14      0.771 0.475 0.115 2.46e- 10 5       Ier3  
## 13 2.74e- 77      5.45  0.911 0.109 2.43e- 73 6       Bsg   
## 14 0              5.12  0.733 0.003 0         6       Itm2a 
## 15 9.56e-157      3.89  0.632 0.015 8.48e-153 7       Mpzl1 
## 16 6.32e- 83      3.87  0.684 0.04  5.61e- 79 7       Snrpd1

Expression distribution of certain biomarker genes in each cluster

VlnPlot(hippo, features = c("Mal", "Sept4", "Mt3", "Mpzl1"))

FeaturePlot(hippo, features = c("Mal", "Sept4", "Mt3", "Mpzl1"))

Heatmap visualization of biomarker gene expression

hippo.markers %>%
  group_by(cluster) %>%
  top_n(n = 10, wt = avg_log2FC) -> top10
DoHeatmap(hippo, features = top10$gene) + NoLegend()

sessionInfo()

## R version 4.1.0 (2021-05-18)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] SeuratObject_4.0.4 Seurat_4.1.0       forcats_0.5.1      stringr_1.4.0     
##  [5] dplyr_1.0.7        purrr_0.3.4        readr_2.1.2        tidyr_1.1.4       
##  [9] tibble_3.1.6       ggplot2_3.3.5      tidyverse_1.3.1   
## 
## loaded via a namespace (and not attached):
##   [1] readxl_1.3.1          backports_1.4.1       plyr_1.8.6           
##   [4] igraph_1.2.11         lazyeval_0.2.2        splines_4.1.0        
##   [7] listenv_0.8.0         scattermore_0.7       digest_0.6.29        
##  [10] htmltools_0.5.2       fansi_1.0.2           magrittr_2.0.2       
##  [13] tensor_1.5            cluster_2.1.2         ROCR_1.0-11          
##  [16] limma_3.48.3          tzdb_0.2.0            globals_0.14.0       
##  [19] modelr_0.1.8          matrixStats_0.61.0    spatstat.sparse_2.1-0
##  [22] colorspace_2.0-2      rvest_1.0.2           ggrepel_0.9.1        
##  [25] haven_2.4.3           xfun_0.29             crayon_1.4.2         
##  [28] jsonlite_1.7.3        spatstat.data_2.1-2   survival_3.2-13      
##  [31] zoo_1.8-9             glue_1.6.1            polyclip_1.10-0      
##  [34] gtable_0.3.0          leiden_0.3.9          future.apply_1.8.1   
##  [37] abind_1.4-5           scales_1.1.1          DBI_1.1.2            
##  [40] miniUI_0.1.1.1        Rcpp_1.0.8            viridisLite_0.4.0    
##  [43] xtable_1.8-4          reticulate_1.24       spatstat.core_2.3-2  
##  [46] htmlwidgets_1.5.4     httr_1.4.2            RColorBrewer_1.1-2   
##  [49] ellipsis_0.3.2        ica_1.0-2             pkgconfig_2.0.3      
##  [52] farver_2.1.0          sass_0.4.0            uwot_0.1.11          
##  [55] dbplyr_2.1.1          deldir_1.0-6          utf8_1.2.2           
##  [58] tidyselect_1.1.1      labeling_0.4.2        rlang_1.0.0          
##  [61] reshape2_1.4.4        later_1.3.0           munsell_0.5.0        
##  [64] cellranger_1.1.0      tools_4.1.0           cli_3.1.1            
##  [67] generics_0.1.2        broom_0.7.12          ggridges_0.5.3       
##  [70] evaluate_0.14         fastmap_1.1.0         yaml_2.2.2           
##  [73] goftest_1.2-3         knitr_1.37            fs_1.5.2             
##  [76] fitdistrplus_1.1-6    RANN_2.6.1            pbapply_1.5-0        
##  [79] future_1.23.0         nlme_3.1-155          mime_0.12            
##  [82] ggrastr_1.0.1         xml2_1.3.3            compiler_4.1.0       
##  [85] rstudioapi_0.13       beeswarm_0.4.0        plotly_4.10.0        
##  [88] png_0.1-7             spatstat.utils_2.3-0  reprex_2.0.1         
##  [91] bslib_0.3.1           stringi_1.7.6         highr_0.9            
##  [94] RSpectra_0.16-0       lattice_0.20-45       Matrix_1.4-0         
##  [97] vctrs_0.3.8           pillar_1.6.5          lifecycle_1.0.1      
## [100] spatstat.geom_2.3-1   lmtest_0.9-39         jquerylib_0.1.4      
## [103] RcppAnnoy_0.0.19      data.table_1.14.2     cowplot_1.1.1        
## [106] irlba_2.3.5           httpuv_1.6.5          patchwork_1.1.1      
## [109] R6_2.5.1              promises_1.2.0.1      KernSmooth_2.23-20   
## [112] gridExtra_2.3         vipor_0.4.5           parallelly_1.30.0    
## [115] codetools_0.2-18      MASS_7.3-55           assertthat_0.2.1     
## [118] withr_2.4.3           sctransform_0.3.3     mgcv_1.8-38          
## [121] parallel_4.1.0        hms_1.1.1             grid_4.1.0           
## [124] rpart_4.1.16          rmarkdown_2.11        Cairo_1.5-14         
## [127] Rtsne_0.15            shiny_1.7.1           lubridate_1.8.0      
## [130] ggbeeswarm_0.6.0