Introduction to bioinformatics (2025 PCL1491)

Setting up

This is an R-markdown document for the PCL1491 - Introduction to bioinformatics module.

Let’s start by loading the Seurat package, which we will use for our class (or install first, as necessary).

#install.packages('Seurat') # as needed
library(Seurat) # for most if not all of our work with RNAseq

Now let’s load our data, which is a subset of the single-cell RNAseq data from the Kidney Precision Medicine Project (KPMP). We have 800 cells from the kidney samples of 32 patients, 12 of which have acute kidney injury (AKI).You can find the full dataset on the KPMP repository ( https://atlas.kpmp.org/repository/ ). This study/dataset is described in detail in Lake et al., 2023 https://www.nature.com/articles/s41586-023-05769-3 .

Seurat_object <- readRDS("./Data/KPMP_small.rds")

When starting any new research project, it is always important to understand the contents, characteristics and distribution of our data (or the data we will be collecting, if it was a brand new experiment being planned). First, let’s take a look at the Seurat data structure.

Seurat_object

## An object of class Seurat 
## 30395 features across 800 samples within 1 assay 
## Active assay: RNA (30395 features, 5000 variable features)
##  2 layers present: counts, data
##  3 dimensional reductions calculated: harmony, pca, umap

Now let us look at the gene expression data we are working with.

expt_matrix <- as.matrix(Seurat_object@assays$RNA@counts) # our count matrix, extracted and expanded
expt_matrix[1:10,1:3] # let's see the first 10 rows and 3 columns of our count matrix

##            AKI3010018_AACCACAAGCAGCCTC-1 AKI3010018_ACGCACGAGACCATAA-1
## AL627309.1                             0                     0.0000000
## AL669831.5                             0                     0.0000000
## LINC00115                              0                     0.0000000
## FAM41C                                 0                     0.0000000
## AL645608.1                             0                     0.0000000
## SAMD11                                 0                     0.8441467
## NOC2L                                  0                     0.0000000
## KLHL17                                 0                     0.0000000
## PLEKHN1                                0                     0.0000000
## HES4                                   0                     0.0000000
##            AKI3010018_AGGACGACAACCCTCT-1
## AL627309.1                      0.000000
## AL669831.5                      0.000000
## LINC00115                       0.000000
## FAM41C                          0.000000
## AL645608.1                      0.000000
## SAMD11                          0.000000
## NOC2L                           0.000000
## KLHL17                          0.000000
## PLEKHN1                         0.000000
## HES4                            3.973572

We have every row representing one gene (identified by row names here starting with “AL627309.1”) and every column representing one sample. In our case, a sample is a cell. In other RNAseq experiments, a sample can be an entire chunk of tissue, a collection of cells in a test-tube, or anything else that contains RNA. The numbers in the main matrix represent the number of RNA fragments, or reads, detected for one gene in one sample. For example, the cell “AKI3010018_ACGCACGAGACCATAA-1” has ~0.84 RNA reads detected for the gene “SAMD11”. This is the general format of count matrices for RNAseq data.

Note that we have non-integers here for our counts, indicating that the data may have already been normalized (see analysis section below) and a potential error occurred in the upload from KPMP. This is ok for our demonstration purposes, but will definitely need to be addressed in real-life research situations.

Finally, let us look at the metadata. In single-cell RNAseq data, sometimes every cell has metadata.

expt_metadata <- Seurat_object@meta.data
expt_metadata[1:10,1:10] #first 10 rows and columns of our metadata

##                               orig.ident nCount_RNA nFeature_RNA percent.mt
## AKI3010018_AACCACAAGCAGCCTC-1 AKI3010018  1224.1317          699  4.8946016
## AKI3010018_ACGCACGAGACCATAA-1 AKI3010018  4089.3051         1147 15.3634371
## AKI3010018_AGGACGACAACCCTCT-1 AKI3010018  3256.2663         1703  0.5778606
## AKI3010018_AGTGTTGAGATTGTGA-1 AKI3010018  7258.0065         2928  4.5127062
## AKI3010018_ATTCCCGTCGTCAGAT-1 AKI3010018   987.8799          657 11.9573496
## AKI3010018_CAAGGGACATGAGAAT-1 AKI3010018  3124.7667         1465  6.8607762
## AKI3010018_CATCAAGCAGCCGTCA-1 AKI3010018 10258.0674         3568  0.7305823
## AKI3010018_CATCGTCAGTTACGTC-1 AKI3010018  3086.3940         1413  9.2157844
## AKI3010018_CGACAGCGTATGTGTC-1 AKI3010018  2316.4746         1208  0.7709667
## AKI3010018_CTCATGCTCCGTGGGT-1 AKI3010018  2231.7650         1308  0.2691110
##                               subclass.l2  subclass.l1 dataSource sampletype
## AKI3010018_AACCACAAGCAGCCTC-1      MAC-M2       Immune       KPMP        AKI
## AKI3010018_ACGCACGAGACCATAA-1         FIB Interstitial       KPMP        AKI
## AKI3010018_AGGACGACAACCCTCT-1      EC-AEA           EC       KPMP        AKI
## AKI3010018_AGTGTTGAGATTGTGA-1         aPT           PT       KPMP        AKI
## AKI3010018_ATTCCCGTCGTCAGAT-1      EC-PTC           EC       KPMP        AKI
## AKI3010018_CAAGGGACATGAGAAT-1    CNT-IC-A           IC       KPMP        AKI
## AKI3010018_CATCAAGCAGCCGTCA-1     dPT/DTL           PT       KPMP        AKI
## AKI3010018_CATCGTCAGTTACGTC-1       aTAL2      ATL/TAL       KPMP        AKI
## AKI3010018_CGACAGCGTATGTGTC-1           T       Immune       KPMP        AKI
## AKI3010018_CTCATGCTCCGTGGGT-1         aPT           PT       KPMP        AKI
##                                    umap1      umap2
## AKI3010018_AACCACAAGCAGCCTC-1  -8.538363 -10.806451
## AKI3010018_ACGCACGAGACCATAA-1   8.487801  -3.340054
## AKI3010018_AGGACGACAACCCTCT-1  10.029637  -7.953340
## AKI3010018_AGTGTTGAGATTGTGA-1   4.302333   9.961152
## AKI3010018_ATTCCCGTCGTCAGAT-1   8.423757 -11.110086
## AKI3010018_CAAGGGACATGAGAAT-1  -6.220929  -4.677879
## AKI3010018_CATCAAGCAGCCGTCA-1   7.217938   5.555872
## AKI3010018_CATCGTCAGTTACGTC-1   2.085822   2.061441
## AKI3010018_CGACAGCGTATGTGTC-1 -11.171146  -3.826559
## AKI3010018_CTCATGCTCCGTGGGT-1   2.185218  10.247087

table(expt_metadata$diseasetype, expt_metadata$ClusterClass) # breakdown of our cells

##              
##               endothelial cells epithelial cells immune cells stromal cells
##   AKI                       100              100          100           100
##   LivingDonor               100              100          100           100

As you can see, this is just another table, but now every row is one sample and every column a metadata variable. Here we have 4 cell types - endothelial, epithelial, immune, and stromal cells. In our mini-dataset, we have 100 cells for each cell type and disease state (AKI or “LivingDonor”, which are our healthy controls). We are also interested in looking at our patient demographics.

# select patient-level metadata, and focus on unique cases (to narrow down to patients and not cells)
expt_pt <- unique(expt_metadata[,c("orig.ident", "sampletype", "SpecimenID", "diseasetype", "age", "gender")])

# see our patients
table(expt_pt$diseasetype, expt_pt$gender)

##              
##                M  F
##   AKI          9  3
##   LivingDonor  7 13

We can see that there are more control than AKI patients, with especially few female AKI patients in our study.

Analysis with Seurat

Now that we have an understanding of our data and metadata, we can move on to our analysis, which we will perform with Seurat functions.

The first thing people usually do with RNAseq data is normalization. We need to do this too. This is important to consider to avoid confounding increased gene expression because of biological reasons with increased RNA amplification due to technical reasons.

# Seurat_object <- NormalizeData(Seurat_object, 
#                                normalization.method = "LogNormalize", 
#                                scale.factor = 1000000) # we use a scale factor of 1 million to convert our values to counts-per-million
Seurat_object@assays$RNA@data[1:10,1:3] # our normalized data is stored here; preview/see our normalized data

## 10 x 3 sparse Matrix of class "dgCMatrix"
##            AKI3010018_AACCACAAGCAGCCTC-1 AKI3010018_ACGCACGAGACCATAA-1
## AL627309.1                             .                     .        
## AL669831.5                             .                     .        
## LINC00115                              .                     .        
## FAM41C                                 .                     .        
## AL645608.1                             .                     .        
## SAMD11                                 .                     0.8441467
## NOC2L                                  .                     .        
## KLHL17                                 .                     .        
## PLEKHN1                                .                     .        
## HES4                                   .                     .        
##            AKI3010018_AGGACGACAACCCTCT-1
## AL627309.1                      .       
## AL669831.5                      .       
## LINC00115                       .       
## FAM41C                          .       
## AL645608.1                      .       
## SAMD11                          .       
## NOC2L                           .       
## KLHL17                          .       
## PLEKHN1                         .       
## HES4                            3.973572

We do not actually run the normalization code here since we may be starting with already-normalized data. But if we did (or if we had integer counts to start with), one common normalization method (which the above code does) would be to divide every gene count by a size-factor (such as a simple sum of all reads per sample or cell) and then log-transform the values.

In many Seurat functions, the output is often saved into a proper part of the Seurat object. In this case, our normalized data is saved under “data” (under “@assays” and “$RNA”), separate from “counts” , thus allowing us to keep both. This is in contrast to the often-advised-against action of overwriting an object you are working with (which could be hard to reverse).

We can now do a mock cell clustering exercise with this KPMP dataset, the usual next-step for single-cell RNAseq data after normalization (and QC). For more information, you can review the basic Seurat vignette at https://satijalab.org/seurat/articles/pbmc3k_tutorial.html .

Seurat_object <- FindVariableFeatures(Seurat_object, selection.method = "vst", nfeatures = 2000) # find variable features (information-rich genes)
Seurat_object <- ScaleData(Seurat_object) # center and z-score our data

## Centering and scaling data matrix

Seurat_object <- RunPCA(Seurat_object) # run PCA

## PC_ 1 
## Positive:  CORO1A, CCL5, CD2, FLT1, CD52, CD69, TRBC1, CD3D, PLVAP, RAMP3 
##     PECAM1, KDR, PTPRB, IL7R, SLCO2A1, NKG7, HCST, ADGRL4, GZMA, NOTCH4 
##     EGFL7, GZMH, CST7, CTSW, CD247, CALCRL, CEACAM1, RNASE1, LCP1, STAB1 
## Negative:  NDUFA4, CYSTM1, MAL, CA12, CHCHD10, EPCAM, COA3, MPC1, SPINT2, COX7B 
##     UQCRB, CLDN7, EFHD1, CKB, CYCS, ATP6V0B, OGDHL, COX6A1, FXYD2, ATP1B1 
##     TFCP2L1, RTN4, SDC4, CYB5A, SLC25A5, CYS1, SLC25A4, CTSD, ATP1A1, CDH16 
## PC_ 2 
## Positive:  APOE, FTL, ATP5F1E, SLC7A7, TYROBP, AIF1, SMIM24, ATP5MG, HCST, GPX3 
##     ATP5F1D, ATP5MPL, ATP5IF1, ATP5F1C, HLA-DRA, COTL1, RIDA, ATP5MF, FCER1G, ALDOB 
##     C1orf162, ITGB2, GLYAT, HMOX1, LAPTM5, GATM, FBP1, UQCR11, HLA-DPB1, LGALS2 
## Negative:  SELM, SEPW1, SF3B14, TCEB1, C19orf43, ATP5I, SHFM1, MALAT1, USMG5, PTRF 
##     LSMD1, TSC22D1, ATP5L, ATP5D, C14orf2, MRP63, ATP5J, CAV1, ATPIF1, ATP5G2 
##     UQCR11.1, GNG11, MCAM, WBP5, SPARCL1, ADIRF, PRKCDBP, NHP2L1, LINC00493, ATP5G3 
## PC_ 3 
## Positive:  GPX3, GATM, CUBN, ALDOB, DDC, GLYAT, PDZK1IP1, PDZK1, AK4, SLC5A12 
##     KHK, C11orf54, PAH, DPYS, GLYATL1, SLC4A4, GSTA1, AGXT2, SLC16A9, SLC13A3 
##     FAM151A, PLG, UGT2B7, DPEP1, SLC7A8, LRP2, ACSM2B, PRAP1, CMBL, MIOX 
## Negative:  HLA-DRA, CD74, HLA-DRB1, HLA-DPB1, HLA-DPA1, AIF1, TYROBP, SRGN, HLA-DQA1, FCER1G 
##     CTSS, LAPTM5, C1orf162, CSF1R, MNDA, MS4A6A, HLA-DQB1, RGS10, RGS2, CLEC7A 
##     MS4A7, PYCARD, CSTA, MSR1, ITGB2, SPI1, HLA-DRB5, HCST, LY96, ITGAX 
## PC_ 4 
## Positive:  DMRT2, LINC01187, ATP6V0D2, SLC4A1, FAM24B, HEPACAM2, RHBG, NUPR2, SRARP, ATP6V1G3 
##     ATP6V1C2, CKMT2, ATP6V0A4, ATP5F1B, C12orf75, SPINT1-AS1, ATP6AP2, RHCG, RAB3B, DHRS7 
##     FMC1, ATP5MC1, SMIM6, CKB, TMEM101, PACRG, ATP5PO, KIT, PART1, SELENOM 
## Negative:  CD74, HLA-DRA, HLA-DRB1, AIF1, HLA-DPA1, HLA-DPB1, TYROBP, FTL, COTL1, CSF1R 
##     HLA-DQA1, FCER1G, CTSS, LAPTM5, ATPIF1, SRGN, C1orf162, MNDA, C5AR1, RGS2 
##     MS4A6A, DAB2, HLA-DQB1, HLA-DRB5, MSR1, MS4A7, SLC7A7, MRP63, CLEC7A, ATP5I 
## PC_ 5 
## Positive:  ATP5G1, ATP5G3, ATP5L, KCNJ1, MFSD4, TMEM52B, KNG1, WNK1, SLC12A1, ATP5F1 
##     MMP24-AS1, ATP5C1, TMEM72, MUC15, RHOBTB3, USMG5, RP11-532F12.5, CYS1, ATP1B1, ATP5G2 
##     ERBB4, PROM2, WNK4, KCNJ16, POU3F3, LINC01158, SERPINA5, ATP5J, ATP5B, CLDN16 
## Negative:  GNG11, EGFL7, IFI27, RNASE1, TEK, SLC9A3R2, RAMP2, HYAL2, CALCRL, HEG1 
##     PTPRB, LMO2, SLCO2A1, FLT1, RAMP3, IGFBP5, CLEC3B, CD34, SOX18, SHANK3 
##     TM4SF18, RASIP1, NOSTRIN, TMEM88, LRRC32, TGFBR2, FAM167B, CD93, KDR, PODXL

ElbowPlot(Seurat_object) # we will look at this if we have time

Seurat_object <- FindNeighbors(Seurat_object, dims = 1:10) # find cells that are similar to each other

## Computing nearest neighbor graph

## Computing SNN

Seurat_object <- FindClusters(Seurat_object, resolution = 0.5) # find clusters of cells

## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 800
## Number of edges: 19357
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8838
## Number of communities: 8
## Elapsed time: 0 seconds

Seurat_object <- RunUMAP(Seurat_object, dims = 1:10) # reduced-dimension values for visualization

## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session

## 16:44:48 UMAP embedding parameters a = 0.9922 b = 1.112

## 16:44:49 Read 800 rows and found 10 numeric columns

## 16:44:49 Using Annoy for neighbor search, n_neighbors = 30

## 16:44:49 Building Annoy index with metric = cosine, n_trees = 50

## 0%   10   20   30   40   50   60   70   80   90   100%

## [----|----|----|----|----|----|----|----|----|----|

## **************************************************|
## 16:44:49 Writing NN index file to temp file C:\Users\User\AppData\Local\Temp\Rtmpm8bTW8\fileaf0659f4b5
## 16:44:49 Searching Annoy index using 1 thread, search_k = 3000
## 16:44:49 Annoy recall = 100%
## 16:44:49 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 16:44:50 Initializing from normalized Laplacian + noise (using RSpectra)
## 16:44:50 Commencing optimization for 500 epochs, with 27330 positive edges
## 16:44:53 Optimization finished

DimPlot(Seurat_object)

Usually, this is where bioinfomaticians and scientists will need to spend a fair amount of time trying to define the identity of cells in their dataset based on the gene expression of cells in these clusters (also know as cell-type marker expression). Fortunately for us, the KPMP dataset comes with cell type annotations.

DimPlot(Seurat_object, group.by = "ClusterClass") # now we are using "ClusterClass" to define our cells

We see above our cell clusters labelled by the KPMP “ClusterClass” labels, which reveals some good separation between our cell types.

Module project

For our project, we will look for celltype-specific differentially expressed genes (DEG’s) between AKI and control samples. To focus on a specific celltype of interest, we will first subset our data. You can pick any of the four “ClusterClass” cell types, though stromal cells may be harder to interpret biologically.

Test_object <- subset(Seurat_object, subset = ClusterClass == "immune cells") # subset for immune cells
Test_object # note only 200 cells ("samples") are left

## An object of class Seurat 
## 30395 features across 200 samples within 1 assay 
## Active assay: RNA (30395 features, 2000 variable features)
##  3 layers present: counts, data, scale.data
##  3 dimensional reductions calculated: harmony, pca, umap

Now we can perform differential gene expression testing. Ideally, we would first summarize (add) all of the counts per patient per gene in a step call pseudobulking. This would better represent the degrees of freedom/sample size we actually have. For our purposes, we will focus on cell-level differences. We use the “FindMarkers” function to find differentially expressed genes. By default, it will use the Wilcoxon rank sum test to test for statistical significance.

Idents(Test_object) <- "sampletype" # we set "sampletype" as the active identity via "Idents()", this lets Seurat know we're using the disease state information stored in the "sampletype" variable to define our comparison groups

results <- FindMarkers(Test_object, # our data
                       ident.1 = "AKI", # what group are we looking for DEG's for
                       logfc.threshold = 0.5, # what fold-change difference minimum do we want to see between control and treated for us to consider testing it
                       min.pct = 0.1, # what fraction of cells do we want to express a gene in order for us to consider testing it
                       only.pos = F) # do we want only genes that are enriched/greater in "ident.1"

results[1:5,] # see first 5 results

##                p_val avg_log2FC pct.1 pct.2    p_val_adj
## CXCR4   3.497730e-16 -10.944638  0.37  0.81 1.063135e-11
## GNB2L1  1.749737e-13 -21.847774  0.01  0.45 5.318325e-09
## BTG1    2.030485e-13  -7.667475  0.60  0.87 6.171660e-09
## TSC22D3 2.331369e-12 -13.859380  0.43  0.74 7.086195e-08
## FKBP5   2.402719e-12  -9.922413  0.07  0.50 7.303065e-08

And we’re done with Seurat! In a real analysis, we would adjust the parameters of FindMarkers (and/or maybe how we subset or prepare our data in general) based on what kind of DEG’s we’re looking for, what context we’re studying, and more.

We can how take a look at the genes that turn up in our result.

sig_results <- results[results$p_val < 0.05,] # filter for "significant" results
neg_results <- sig_results[sig_results$avg_log2FC < 0,] # filter for down-regulated genes

neg_results[1:5,]

##                p_val avg_log2FC pct.1 pct.2    p_val_adj
## CXCR4   3.497730e-16 -10.944638  0.37  0.81 1.063135e-11
## GNB2L1  1.749737e-13 -21.847774  0.01  0.45 5.318325e-09
## BTG1    2.030485e-13  -7.667475  0.60  0.87 6.171660e-09
## TSC22D3 2.331369e-12 -13.859380  0.43  0.74 7.086195e-08
## FKBP5   2.402719e-12  -9.922413  0.07  0.50 7.303065e-08

The significance p values of our tests are in the “p_val” and “p_val_adj” columns, with the latter being corrected for multiple comparisons, which is very important when you are testing thousands of genes in one analysis.”avg_log2FC” is the log- fold-change of gene expression between our key group (AKI immune cells, in this case) and our reference/control group (healthy/living donor immune cells). The log is base 2, so a value of -10 means a 2^-10 fold change between AKI and control cells (or a 1024x decrease). However, note that it is sometimes easy to get inflated fold-change values when gene expression for one group is near 0. Finally, pct.1 and pct.2 are the percentage of group1 (in this case, AKI immune cells) and group2 (control immune cells) cells that express the gene at a level > 0. For example, 37% of AKI immune cells express CXCR4.

For this example, we see that many of our differentially expressed genes are down-regulated. We can visualize gene expression values with plotting functions in Seurat.

VlnPlot(Test_object, features = c("CXCR4", "BTG1"))

Finally, we can save our results. CSV is a common format for saving tables.

write.csv(neg_results, "./neg_results.csv") # note you don't actually need the "./" part here

And we are done! If we have time, we will try to take a look at these genes via enrichment analysis on https://maayanlab.cloud/Enrichr/.

Introduction to bioinformatics (2025 PCL1491)

Yuxiao (Sonny) Chen, PhD

June 05, 2025

Setting up

Analysis with Seurat

Module project