This is an R-markdown document for the PCL1491 - Introduction to bioinformatics module.
Let’s start by loading the Seurat package, which we will use for our class (or install first, as necessary).
#install.packages('Seurat') # as needed
library(Seurat) # for most if not all of our work with RNAseq
Now let’s load our data, which is a subset of the single-cell RNAseq data from the Kidney Precision Medicine Project (KPMP). We have 800 cells from the kidney samples of 32 patients, 12 of which have acute kidney injury (AKI).You can find the full dataset on the KPMP repository ( https://atlas.kpmp.org/repository/ ). This study/dataset is described in detail in Lake et al., 2023 https://www.nature.com/articles/s41586-023-05769-3 .
Seurat_object <- readRDS("./Data/KPMP_small.rds")
When starting any new research project, it is always important to understand the contents, characteristics and distribution of our data (or the data we will be collecting, if it was a brand new experiment). First, let’s take a look at the Seurat data structure.
Now let us look at the gene expression data we are working with.
expt_matrix <- as.matrix(Seurat_object@assays$RNA@counts) # our count matrix, expanded
We have every row representing one gene (identified by row names here starting with “AL627309.1”) and every column representing one sample. In our case, a sample is a cell. In other RNAseq experiments, a sample can be an entire chunk of tissue, a collection of cells in a test-tube, or anything else that contains RNA. The numbers in the main matrix represent the number of RNA fragments, or reads, detected for one gene in one sample. For example, the cell “AKI3010018_ACGCACGAGACCATAA-1” has ~0.84 RNA reads detected for the gene “SAMD11”. This is the general format of count matrices for RNAseq data.
Note that we have non-integers here for our counts, indicating that the data may have already been normalized (see analysis section below) and a potential error occurred in the upload from KPMP. This is ok for our demonstration purposes, but will definitely need to be addressed in real-life research situations.
Finally, let us look at the metadata. In single-cell RNAseq data, every cell has metadata
expt_metadata <- Seurat_object@meta.data
As you can see, this is just another table, but now every row is one sample and every column a metadata variable. For our purposes, we have 4 cell types - endothelial, epithelial, immune, and stromal cells. In our mini-dataset, we have 100 cells for each cell type and disease state (AKI or “LivingDonor”, which is our healthy controls). We would also be interested in looking at our patient demographics.
# select patient-level metadata, and focus on unique cases
expt_pt <- unique(expt_metadata[,c("orig.ident", "sampletype", "SpecimenID", "diseasetype", "age", "gender")])
We can see that there more control than AKI patients, with especially few female AKI patients in our study.
Now that we have an understanding of our data and metadata, we can move on to our analysis, which we will perform with Seurat functions.
The first thing people often do with RNAseq data is normalization. We need to do this too. This is important to consider as, for example, we do not want to confound increased gene expression because of biological reasons with increased RNA amplification due to technical variation.
# Seurat_object <- NormalizeData(Seurat_object,
# normalization.method = "LogNormalize",
# scale.factor = 1000000) # we use a scale factor of 1 million to convert our values to counts-per-million
Seurat_object@assays$RNA@data[1:10,1:3] # our normalized data is stored here; preview/see our normalized data
## 10 x 3 sparse Matrix of class "dgCMatrix"
## AKI3010018_AACCACAAGCAGCCTC-1 AKI3010018_ACGCACGAGACCATAA-1
## AL627309.1 . .
## AL669831.5 . .
## LINC00115 . .
## FAM41C . .
## AL645608.1 . .
## SAMD11 . 0.8441467
## NOC2L . .
## KLHL17 . .
## PLEKHN1 . .
## HES4 . .
## AKI3010018_AGGACGACAACCCTCT-1
## AL627309.1 .
## AL669831.5 .
## LINC00115 .
## FAM41C .
## AL645608.1 .
## SAMD11 .
## NOC2L .
## KLHL17 .
## PLEKHN1 .
## HES4 3.973572
We do not actually run the normalization code here since we may be starting with already-normalized data. But if we did (and if we had integer counts to start with), we would divide every gene count by a size-factor (such as a simple sum of all reads per sample) and then log-transform the values. This is a common normalization step to control for differences in total RNA amounts in different cells.
In many Seurat functions, the output is often saved into a proper part of the Seurat object. In this case, our normalized data is saved under “data” (under “@assays” and “$RNA”), separate from “counts” , thus allowing us to keep both. This is in contrast to the often-advised-against action of overwriting an object you are working with (which could be hard to reverse).
We can now do a mock cell clustering exercise with this KPMP dataset, the usual next-step for single-cell RNAseq data after normalization (and QC). For more information, you can review the basic Seurat vignette at https://satijalab.org/seurat/articles/pbmc3k_tutorial.html .
Seurat_object <- FindVariableFeatures(Seurat_object, selection.method = "vst", nfeatures = 2000) # find variable features
Seurat_object <- ScaleData(Seurat_object) # center and z-score our data
## Centering and scaling data matrix
Seurat_object <- RunPCA(Seurat_object) # run PCA
## PC_ 1
## Positive: CORO1A, CCL5, CD2, FLT1, CD52, CD69, TRBC1, CD3D, PLVAP, RAMP3
## PECAM1, KDR, PTPRB, IL7R, SLCO2A1, NKG7, HCST, ADGRL4, GZMA, NOTCH4
## EGFL7, GZMH, CST7, CTSW, CD247, CALCRL, CEACAM1, RNASE1, LCP1, STAB1
## Negative: NDUFA4, CYSTM1, MAL, CA12, CHCHD10, EPCAM, COA3, MPC1, SPINT2, COX7B
## UQCRB, CLDN7, EFHD1, CKB, CYCS, ATP6V0B, OGDHL, COX6A1, FXYD2, ATP1B1
## TFCP2L1, RTN4, SDC4, CYB5A, SLC25A5, CYS1, SLC25A4, CTSD, ATP1A1, CDH16
## PC_ 2
## Positive: APOE, FTL, ATP5F1E, SLC7A7, TYROBP, AIF1, SMIM24, ATP5MG, HCST, GPX3
## ATP5F1D, ATP5MPL, ATP5IF1, ATP5F1C, HLA-DRA, COTL1, RIDA, ATP5MF, FCER1G, ALDOB
## C1orf162, ITGB2, GLYAT, HMOX1, LAPTM5, GATM, FBP1, UQCR11, HLA-DPB1, LGALS2
## Negative: SELM, SEPW1, SF3B14, TCEB1, C19orf43, ATP5I, SHFM1, MALAT1, USMG5, PTRF
## LSMD1, TSC22D1, ATP5L, ATP5D, C14orf2, MRP63, ATP5J, CAV1, ATPIF1, ATP5G2
## UQCR11.1, GNG11, MCAM, WBP5, SPARCL1, ADIRF, PRKCDBP, NHP2L1, LINC00493, ATP5G3
## PC_ 3
## Positive: GPX3, GATM, CUBN, ALDOB, DDC, GLYAT, PDZK1IP1, PDZK1, AK4, SLC5A12
## KHK, C11orf54, PAH, DPYS, GLYATL1, SLC4A4, GSTA1, AGXT2, SLC16A9, SLC13A3
## FAM151A, PLG, UGT2B7, DPEP1, SLC7A8, LRP2, ACSM2B, PRAP1, CMBL, MIOX
## Negative: HLA-DRA, CD74, HLA-DRB1, HLA-DPB1, HLA-DPA1, AIF1, TYROBP, SRGN, HLA-DQA1, FCER1G
## CTSS, LAPTM5, C1orf162, CSF1R, MNDA, MS4A6A, HLA-DQB1, RGS10, RGS2, CLEC7A
## MS4A7, PYCARD, CSTA, MSR1, ITGB2, SPI1, HLA-DRB5, HCST, LY96, ITGAX
## PC_ 4
## Positive: DMRT2, LINC01187, ATP6V0D2, SLC4A1, FAM24B, HEPACAM2, RHBG, NUPR2, SRARP, ATP6V1G3
## ATP6V1C2, CKMT2, ATP6V0A4, ATP5F1B, C12orf75, SPINT1-AS1, ATP6AP2, RHCG, RAB3B, DHRS7
## FMC1, ATP5MC1, SMIM6, CKB, TMEM101, PACRG, ATP5PO, KIT, PART1, SELENOM
## Negative: CD74, HLA-DRA, HLA-DRB1, AIF1, HLA-DPA1, HLA-DPB1, TYROBP, FTL, COTL1, CSF1R
## HLA-DQA1, FCER1G, CTSS, LAPTM5, ATPIF1, SRGN, C1orf162, MNDA, C5AR1, RGS2
## MS4A6A, DAB2, HLA-DQB1, HLA-DRB5, MSR1, MS4A7, SLC7A7, MRP63, CLEC7A, ATP5I
## PC_ 5
## Positive: ATP5G1, ATP5G3, ATP5L, KCNJ1, MFSD4, TMEM52B, KNG1, WNK1, SLC12A1, ATP5F1
## MMP24-AS1, ATP5C1, TMEM72, MUC15, RHOBTB3, USMG5, RP11-532F12.5, CYS1, ATP1B1, ATP5G2
## ERBB4, PROM2, WNK4, KCNJ16, POU3F3, LINC01158, SERPINA5, ATP5J, ATP5B, CLDN16
## Negative: GNG11, EGFL7, IFI27, RNASE1, TEK, SLC9A3R2, RAMP2, HYAL2, CALCRL, HEG1
## PTPRB, LMO2, SLCO2A1, FLT1, RAMP3, IGFBP5, CLEC3B, CD34, SOX18, SHANK3
## TM4SF18, RASIP1, NOSTRIN, TMEM88, LRRC32, TGFBR2, FAM167B, CD93, KDR, PODXL
# ElbowPlot(Seurat_object) # we will look at this if we have time
Seurat_object <- FindNeighbors(Seurat_object, dims = 1:10) # find cells that are similar to each other
## Computing nearest neighbor graph
## Computing SNN
Seurat_object <- FindClusters(Seurat_object, resolution = 0.5) # find clusters of cells
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 800
## Number of edges: 19357
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8838
## Number of communities: 8
## Elapsed time: 0 seconds
Seurat_object <- RunUMAP(Seurat_object, dims = 1:10) # dimension-reduction values for visualization
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
## 11:09:45 UMAP embedding parameters a = 0.9922 b = 1.112
## 11:09:45 Read 800 rows and found 10 numeric columns
## 11:09:45 Using Annoy for neighbor search, n_neighbors = 30
## 11:09:45 Building Annoy index with metric = cosine, n_trees = 50
## 0% 10 20 30 40 50 60 70 80 90 100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 11:09:45 Writing NN index file to temp file C:\Users\User\AppData\Local\Temp\RtmpOwbQU0\file201014681412
## 11:09:45 Searching Annoy index using 1 thread, search_k = 3000
## 11:09:46 Annoy recall = 100%
## 11:09:46 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 11:09:46 Initializing from normalized Laplacian + noise (using RSpectra)
## 11:09:46 Commencing optimization for 500 epochs, with 27330 positive edges
## 11:09:49 Optimization finished
DimPlot(Seurat_object)
Usually, this is where bioinfomaticians and scientists will need to spend a fair amount of time trying to define the identity of these cells based on the gene expression of cells in these clusters (also know as cell-type marker expression). Fortunately for us, the KPMP dataset comes with handy cell type annotations that define the cells.
DimPlot(Seurat_object, group.by = "ClusterClass") # now we are using "ClusterClass" to define our cells
We see above our cell clusters labelled by the KPMP “ClusterClass”
labels, which reveals some good separations between our cell types.
For our project, we will look for celltype-specific differentially expressed genes (DEG’s) between AKI and control samples. To focus on a specific celltype of interest, we will first subset our data. You can pick any of the four “ClusterClass” cell types, though stromal cells may be harder to interpret biologically.
Test_object <- subset(Seurat_object, subset = ClusterClass == "immune cells") # subset for immune cells
Now we can perform differential gene expression testing. Ideally, we would first summarize (add) all of the counts per patient per gene in a step call pseudobulking. This would better represent the degrees of freedom/sample size we actually have. For our purposes, we will focus on cell-level differences for now. We use the “FindMarkers” function to find differentially expressed genes. By default, it will use the Wilcoxon rank sum test to test for statistical significance.
Idents(Test_object) <- "sampletype" # we set "sampletype" as the active identity via "Idents()", this lets Seurat know we're using the disease state information stored in the "sampletype" variable to define our comparison groups
results <- FindMarkers(Test_object, # our data
ident.1 = "AKI", # what group are we looking for DEG's for
logfc.threshold = 0.5, # what fold-change difference minimum do we want to see between control and treated for us to consider testing it
min.pct = 0.1, # what percent of cells do we want to express a gene in order for us to consider testing it
only.pos = F) # do we want only genes that are enriched/greater in "ident.1"
And we’re done with Seurat! In a real analysis, we would adjust the parameters of FindMarkers based on what kind of DEG’s we’re looking for, what context we’re studying, and more.
We can how take a look at the genes that turn up in our result.
sig_results <- results[results$p_val < 0.05,] # filter for "significant" results
neg_results <- sig_results[sig_results$avg_log2FC < 0,] # filter for down-regulated genes
The significance p values of our tests are in the “p_val” and “p_val_adj” columns, with the latter being corrected for multiple comparisons, which is very important when you are testing for thousands of genes.”avg_log2FC” is the log- fold-change of gene expression between our key group (AKI immune cells, in this case) and our reference/control group (healthy/living donor immune cells). The log is base 2, so a value of -10 means a 2^-10 fold change between AKI and control cells (or a 1024x decrease). However, note that it is sometimes easy to get inflated fold-change values when gene expression is near 0. Finally, pct.1 and pct.2 are the percentage of group1 (in this case, AKI immune cells) and group2 (control immune cells) cells that express the gene at a level > 0. For example, 37% of AKI immune cells express CXCR4.
For this example, we see that many of our differentially expressed genes are down-regulated. We can visualize some of the gene expression values with plots in Seurat.
VlnPlot(Test_object, features = c("CXCR4", "BTG1"))
Finally, we can save our results in a few different ways. CSV is a common format for saving tables.
# write.csv(neg_results, "./neg_results.csv") # note you don't actually need the "./" part here
And we are done! If we have time, we will try to take a look at these genes via enrichment analysis on https://maayanlab.cloud/Enrichr/.