R, python, and bash scripting are the three main pillars of bioinformatics. Here, we are starting with R and use of Rstudio. Please, follow the instructions below and install the required software to your own PC.
.
The main goal of this tutorial is to give you a flavor of how simple analysis are conducted in R environment. The first part will involve practicing basic terms and second will cover a data analytics for single cell RNA sequencing experiment.
.
Let’s start…
You can start Rstudio either from terminal or among installed softwares.
These are a few R packages we will use along the tutorial:
# install.packages()
install.packages('tidyverse')
library(tidyverse)
# Enter commands in R (or R studio, if installed)
install.packages('Seurat')
library(Seurat)
Here is a link to get further help for installing Seurat package in case needed: https://satijalab.org/seurat/articles/install.html
Our lab has generated single-cell RNA sequecing datasets using 10X Genomics Chromium platform from Hippocampus and Striatum of Rattus norvegicus brain.
Please, download these two example dataset through the link: https://drive.google.com/drive/folders/18DSiCHv-WM_48iutYdGJiycJBKnIzDuK?usp=sharing
1 + 2
## [1] 3
x <- 1
x
## [1] 1
Functions take arguments in paranthesis and process the output. Here the function ‘c’ returns the values as a vector.
c(1, 2, 3)
## [1] 1 2 3
a <- 2
b <- 9.5
a > b
## [1] FALSE
z <- a > b
z
## [1] FALSE
class(a)
## [1] "numeric"
class(b)
## [1] "numeric"
class(z)
## [1] "logical"
Reverse the logic:
!z
## [1] TRUE
x <- as.character(3.14)
x
## [1] "3.14"
fname = "John"; lname ="Doe";
paste(fname, lname)
## [1] "John Doe"
?paste
my_vec1 <- c("aa", "bb", "cc", "dd", "ee")
my_vec2 <- c(1, 2, 3, 4, 5)
length(my_vec2)
## [1] 5
#Combining vectors:
c(my_vec1, my_vec2)
## [1] "aa" "bb" "cc" "dd" "ee" "1" "2" "3" "4" "5"
my_vec2
## [1] 1 2 3 4 5
2 * my_vec2 # Mutliply by 2
## [1] 2 4 6 8 10
my_vec2 - 2
## [1] -1 0 1 2 3
my_vec3 <- c(10, 20, 30, 40, 50)
my_vec2 + my_vec3
## [1] 11 22 33 44 55
3 * (my_vec2 + my_vec3) / 5
## [1] 6.6 13.2 19.8 26.4 33.0
Indexing starts with 1
unlike python.
s <- c("aa", "bb", "cc", "dd", "ee")
s[3]
## [1] "cc"
s[3:5]
## [1] "cc" "dd" "ee"
s[-4] #Now 'dd' is gone!
## [1] "aa" "bb" "cc" "ee"
s[c(2, 3, 3)]
## [1] "bb" "cc" "cc"
s[c(2, 1, 3)]
## [1] "bb" "aa" "cc"
s[c(FALSE, TRUE, FALSE, TRUE, FALSE)] #Logical indexing
## [1] "bb" "dd"
Naming the vector members:
v <- c("Mary", "Sue")
v
## [1] "Mary" "Sue"
names(v) <- c("First", "Last")
v
## First Last
## "Mary" "Sue"
v["First"]
## First
## "Mary"
“A matrix is a collection of data elements arranged in a two-dimensional rectangular layout.”
A <- matrix(
c(2, 4, 3, 1, 5, 7), # the data elements
nrow=2, # number of rows
ncol=3, # number of columns
byrow = TRUE) # fill matrix by rows
A
## [,1] [,2] [,3]
## [1,] 2 4 3
## [2,] 1 5 7
A[2, 3] # A[row, column]
## [1] 7
A[2, ] # The entire 2nd row
## [1] 1 5 7
A[, 3] # The entire 3rd column
## [1] 3 7
A[, c(1,3)] # the 1st and 3rd columns
## [,1] [,2]
## [1,] 2 3
## [2,] 1 7
Transpose the matrix
B <- matrix(
c(2, 4, 3, 1, 5, 7),
nrow=3,
ncol=2)
B
## [,1] [,2]
## [1,] 2 1
## [2,] 4 5
## [3,] 3 7
t(B)
## [,1] [,2] [,3]
## [1,] 2 4 3
## [2,] 1 5 7
Combine matrices
cbind(A, t(B)) # combine row-wise
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 2 4 3 2 4 3
## [2,] 1 5 7 1 5 7
rbind(A, t(B)) # combine column-wise
## [,1] [,2] [,3]
## [1,] 2 4 3
## [2,] 1 5 7
## [3,] 2 4 3
## [4,] 1 5 7
“A data frame is used for storing data tables. It is a list of vectors of equal length.”
n <- c(2, 3, 5)
s <- c("aa", "bb", "cc")
b <- c(TRUE, FALSE, TRUE)
df <- data.frame(n, s, b) # df is a data frame
df
## n s b
## 1 2 aa TRUE
## 2 3 bb FALSE
## 3 5 cc TRUE
mtcars
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
mtcars[10, 4]
## [1] 123
mtcars['Merc 280', 'hp']
## [1] 123
mtcars["Merc 280", ]
## mpg cyl disp hp drat wt qsec vs am gear carb
## Merc 280 19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
mtcars[["hp"]]
## [1] 110 110 93 110 175 105 245 62 95 123 123 180 180 180 205 215 230 66 52
## [20] 65 97 150 150 245 175 66 91 113 264 175 335 109
mtcars$hp
## [1] 110 110 93 110 175 105 245 62 95 123 123 180 180 180 205 215 230 66 52
## [20] 65 97 150 150 245 175 66 91 113 264 175 335 109
nrow(mtcars)
## [1] 32
ncol(mtcars)
## [1] 11
mtcars[, c("mpg", "hp")]
## mpg hp
## Mazda RX4 21.0 110
## Mazda RX4 Wag 21.0 110
## Datsun 710 22.8 93
## Hornet 4 Drive 21.4 110
## Hornet Sportabout 18.7 175
## Valiant 18.1 105
## Duster 360 14.3 245
## Merc 240D 24.4 62
## Merc 230 22.8 95
## Merc 280 19.2 123
## Merc 280C 17.8 123
## Merc 450SE 16.4 180
## Merc 450SL 17.3 180
## Merc 450SLC 15.2 180
## Cadillac Fleetwood 10.4 205
## Lincoln Continental 10.4 215
## Chrysler Imperial 14.7 230
## Fiat 128 32.4 66
## Honda Civic 30.4 52
## Toyota Corolla 33.9 65
## Toyota Corona 21.5 97
## Dodge Challenger 15.5 150
## AMC Javelin 15.2 150
## Camaro Z28 13.3 245
## Pontiac Firebird 19.2 175
## Fiat X1-9 27.3 66
## Porsche 914-2 26.0 91
## Lotus Europa 30.4 113
## Ford Pantera L 15.8 264
## Ferrari Dino 19.7 175
## Maserati Bora 15.0 335
## Volvo 142E 21.4 109
getwd()
setwd("your target directory")
counts <- read.table(file = "SRR1039512.counts", sep = "\t", row.names = 1)
head(counts)
#Alternatives: read.csv(), read.delim(), readxl::read_xls()
Let’s compute the total library sequencing size:
sum(counts$V2)
dim(counts)
CPM <- counts / ( sum(counts$V2)/ 1000000 )
head(CPM)
head( cbind(counts, CPM) )
Disclaimer: This tutorial has been originated from: http://www.r-tutor.com/r-introduction
library(tidyverse)
library(Seurat)
setwd('~/data/scData/')
striatum <- get(load("striatum.seurat.Robj"))
head(striatum@meta.data)
## orig.ident percent.mito res.0.4 nCount_RNA nFeature_RNA
## AAACCTGAGCCGCCTA Rat_Striatum_003 0.007434944 0 269 160
## AAACCTGAGGCGACAT Rat_Striatum_003 0.019531250 6 256 130
## AAACCTGAGGGATGGG Rat_Striatum_003 0.031634446 4 569 307
## AAACCTGAGTAAGTAC Rat_Striatum_003 0.024390244 3 369 198
## AAACCTGTCCCATTAT Rat_Striatum_003 0.029459902 4 611 227
## AAACCTGTCGTGGGAA Rat_Striatum_003 0.003021148 2 331 152
hippo <- get(load("hippocampus.seurat.Robj"))
head(hippo@meta.data)
## orig.ident percent.mito res.0.4 nCount_RNA
## AAACCTGCAAGGCTCC Rat_Hippocampus_006 0.041800643 3 311
## AAACCTGGTAGTAGTA Rat_Hippocampus_006 0.006622517 0 302
## AAACCTGGTATCACCA Rat_Hippocampus_006 0.012658228 0 237
## AAACCTGGTTATCCGA Rat_Hippocampus_006 0.017334778 3 923
## AAACCTGTCCGCGCAA Rat_Hippocampus_006 0.026525199 2 377
## AAACGGGAGACACTAA Rat_Hippocampus_006 0.019867550 0 302
## nFeature_RNA
## AAACCTGCAAGGCTCC 171
## AAACCTGGTAGTAGTA 164
## AAACCTGGTATCACCA 140
## AAACCTGGTTATCCGA 351
## AAACCTGTCCGCGCAA 173
## AAACGGGAGACACTAA 178
VlnPlot(hippo, features = c("nFeature_RNA", "nCount_RNA", "percent.mito"), ncol = 3)
VlnPlot(striatum, features = c("nFeature_RNA", "nCount_RNA", "percent.mito"), ncol = 3)
#“LogNormalize” normalizes the feature expression measurements for each cell by the total expression, multiplies this by a scale factor (10,000 by default), and log-transforms the result
hippo <- NormalizeData(hippo, normalization.method = "LogNormalize", scale.factor = 10000)
#Calculate a subset of features that exhibit high cell-to-cell variation in the dataset (i.e, they are highly expressed in some cells, and lowly expressed in others).
hippo <- FindVariableFeatures(hippo, selection.method = "vst", nfeatures = 2000)
# Identify the 10 most highly variable genes
top10 <- head(VariableFeatures(hippo), 10)
#Apply a linear transformation (‘scaling’) that is a standard pre-processing step prior to dimensional reduction techniques
hippo <- ScaleData(hippo, features = rownames(hippo))
#Perform linear dimensional reduction - Principal Components Analysis
hippo <- RunPCA(hippo, features = VariableFeatures(object = hippo))
DimPlot(hippo, reduction = "pca")
ElbowPlot(hippo)
#Cluster the cells
hippo <- FindNeighbors(hippo, dims = 1:10)
hippo <- FindClusters(hippo, resolution = 0.5)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 4206
## Number of edges: 140832
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8257
## Number of communities: 8
## Elapsed time: 0 seconds
#Run non-linear dimensional reduction (UMAP/tSNE)
# If you haven't installed UMAP, you can do so via reticulate::py_install(packages =
# 'umap-learn')
hippo <- RunUMAP(hippo, dims = 1:10)
DimPlot(hippo, reduction = "umap")
#Finding differentially expressed features (cluster biomarkers)
hippo.markers <- FindAllMarkers(hippo, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
hippo.markers %>%
group_by(cluster) %>%
slice_max(n = 2, order_by = avg_log2FC)
## # A tibble: 16 × 7
## # Groups: cluster [8]
## p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <chr>
## 1 1.38e- 98 1.40 0.592 0.319 1.22e- 94 0 Mal
## 2 1.81e- 93 1.24 0.627 0.363 1.61e- 89 0 Sept4
## 3 1.29e-210 1.92 0.642 0.141 1.15e-206 1 S100b
## 4 1.23e-129 1.67 0.4 0.074 1.09e-125 1 Klk6
## 5 0 3.08 0.951 0.195 0 2 Apoe
## 6 6.03e-195 3.03 0.446 0.061 5.35e-191 2 Csf1r
## 7 0 3.80 0.654 0.044 0 3 Il1b
## 8 1.88e-278 3.28 0.737 0.075 1.67e-274 3 Cd83
## 9 3.87e-102 5.46 0.276 0.014 3.44e- 98 4 Mt3
## 10 6.30e- 12 1.99 0.252 0.098 5.58e- 8 4 Tubb5
## 11 6.23e- 16 1.04 0.311 0.055 5.53e- 12 5 Pnrc1
## 12 2.77e- 14 0.771 0.475 0.115 2.46e- 10 5 Ier3
## 13 2.74e- 77 5.45 0.911 0.109 2.43e- 73 6 Bsg
## 14 0 5.12 0.733 0.003 0 6 Itm2a
## 15 9.56e-157 3.89 0.632 0.015 8.48e-153 7 Mpzl1
## 16 6.32e- 83 3.87 0.684 0.04 5.61e- 79 7 Snrpd1
VlnPlot(hippo, features = c("Mal", "Sept4", "Mt3", "Mpzl1"))
FeaturePlot(hippo, features = c("Mal", "Sept4", "Mt3", "Mpzl1"))
hippo.markers %>%
group_by(cluster) %>%
top_n(n = 10, wt = avg_log2FC) -> top10
DoHeatmap(hippo, features = top10$gene) + NoLegend()
sessionInfo()
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] SeuratObject_4.0.4 Seurat_4.1.0 forcats_0.5.1 stringr_1.4.0
## [5] dplyr_1.0.7 purrr_0.3.4 readr_2.1.2 tidyr_1.1.4
## [9] tibble_3.1.6 ggplot2_3.3.5 tidyverse_1.3.1
##
## loaded via a namespace (and not attached):
## [1] readxl_1.3.1 backports_1.4.1 plyr_1.8.6
## [4] igraph_1.2.11 lazyeval_0.2.2 splines_4.1.0
## [7] listenv_0.8.0 scattermore_0.7 digest_0.6.29
## [10] htmltools_0.5.2 fansi_1.0.2 magrittr_2.0.2
## [13] tensor_1.5 cluster_2.1.2 ROCR_1.0-11
## [16] limma_3.48.3 tzdb_0.2.0 globals_0.14.0
## [19] modelr_0.1.8 matrixStats_0.61.0 spatstat.sparse_2.1-0
## [22] colorspace_2.0-2 rvest_1.0.2 ggrepel_0.9.1
## [25] haven_2.4.3 xfun_0.29 crayon_1.4.2
## [28] jsonlite_1.7.3 spatstat.data_2.1-2 survival_3.2-13
## [31] zoo_1.8-9 glue_1.6.1 polyclip_1.10-0
## [34] gtable_0.3.0 leiden_0.3.9 future.apply_1.8.1
## [37] abind_1.4-5 scales_1.1.1 DBI_1.1.2
## [40] miniUI_0.1.1.1 Rcpp_1.0.8 viridisLite_0.4.0
## [43] xtable_1.8-4 reticulate_1.24 spatstat.core_2.3-2
## [46] htmlwidgets_1.5.4 httr_1.4.2 RColorBrewer_1.1-2
## [49] ellipsis_0.3.2 ica_1.0-2 pkgconfig_2.0.3
## [52] farver_2.1.0 sass_0.4.0 uwot_0.1.11
## [55] dbplyr_2.1.1 deldir_1.0-6 utf8_1.2.2
## [58] tidyselect_1.1.1 labeling_0.4.2 rlang_1.0.0
## [61] reshape2_1.4.4 later_1.3.0 munsell_0.5.0
## [64] cellranger_1.1.0 tools_4.1.0 cli_3.1.1
## [67] generics_0.1.2 broom_0.7.12 ggridges_0.5.3
## [70] evaluate_0.14 fastmap_1.1.0 yaml_2.2.2
## [73] goftest_1.2-3 knitr_1.37 fs_1.5.2
## [76] fitdistrplus_1.1-6 RANN_2.6.1 pbapply_1.5-0
## [79] future_1.23.0 nlme_3.1-155 mime_0.12
## [82] ggrastr_1.0.1 xml2_1.3.3 compiler_4.1.0
## [85] rstudioapi_0.13 beeswarm_0.4.0 plotly_4.10.0
## [88] png_0.1-7 spatstat.utils_2.3-0 reprex_2.0.1
## [91] bslib_0.3.1 stringi_1.7.6 highr_0.9
## [94] RSpectra_0.16-0 lattice_0.20-45 Matrix_1.4-0
## [97] vctrs_0.3.8 pillar_1.6.5 lifecycle_1.0.1
## [100] spatstat.geom_2.3-1 lmtest_0.9-39 jquerylib_0.1.4
## [103] RcppAnnoy_0.0.19 data.table_1.14.2 cowplot_1.1.1
## [106] irlba_2.3.5 httpuv_1.6.5 patchwork_1.1.1
## [109] R6_2.5.1 promises_1.2.0.1 KernSmooth_2.23-20
## [112] gridExtra_2.3 vipor_0.4.5 parallelly_1.30.0
## [115] codetools_0.2-18 MASS_7.3-55 assertthat_0.2.1
## [118] withr_2.4.3 sctransform_0.3.3 mgcv_1.8-38
## [121] parallel_4.1.0 hms_1.1.1 grid_4.1.0
## [124] rpart_4.1.16 rmarkdown_2.11 Cairo_1.5-14
## [127] Rtsne_0.15 shiny_1.7.1 lubridate_1.8.0
## [130] ggbeeswarm_0.6.0
Comments
Everything that follows a ‘#’ is considered as command.