class7_largedata

Book chapters for today

large files with H5 (accadental last week!)

Super computer did not work last week! Here is why:

Normally R puts everything into memory for fast processing, but single cell RNAseq is very often too big to fit on the processor. The package solved it automatically for me by an out-of-memory representation (H5):

H5 objects solve memory limit problems by

    1. HDF5 breaks the file up into little chunks, which are kept on storage
    1. HDF5 creates a H5 index file that is loaded on the processor
    1. The processor uses the H5 index to only load what is being processed or about to be processed

large files with H5 (accadental last week!)

H5 is usually easy and awesome! But, problems with H5 are:

    1. H5 and HDF5 does not work well on windows machines;
    1. stuff like last week: I accadently had file chunks loading from my personal computer. The index was pointing from OSC to my home computer.

Still, using H5 is recommended generally

H5 alterantive: large files with k-nn or k-means

The other option is to summarize data into many centroids via k-means or k-nearest neighbor.

The example to the right shows the sce richard data with an increasing number of k nearest neighbor centeroids. As k gets larger, most of the data can be summarized into proximal centeroids.

Some approaches (e.g. metacell) just combine the data from nearby cells. Other approaches, like BioCNeighbors, creates a centeroid index file to retrieve information for cell close to chosen centeroid(s).

approx_sce <- findKNN(sce, k=100, BNPARAM=AnnoyParam())

Week 1: Installing packages and loading files

There are two main package repositories for us, CRAN (general R stuff) and bioconducter (biology stuff)

From CRAN

install.packages("ggplot2")

From Bioconducter

BiocManager::install("DESeq2")

(we also learned how to load objects and the package::function thing)

To load a r object, you can do it with a read_rds() command:

Acute_1 <- read_rds("1_acute.rds")

Week 2 - singleCellExperiment

The SingleCellExperiment objcet class is similar to many, many bioinfomatic structures.

(also we learned graphing in R via ggplot2)

  • Each data (or meta data) is stored in an assays slot
  • Rows are genes (access with rowData() )
  • Columns are cells (access with colData() )
  • Each assay has different features (gene counts, normalized counts, etc) or info about the cells (i.e. embeddings)

Week3: Quality Control

There are inherent chalanges to processing single cell RNAseq including sparsity and dead cell contamination.

A strict cut off can exclude real cells.

(we also learned pipes |>, dplyr filtering, and more ggplot2)

Week4: clustering

How we identify cell groups, but useful in most bioinfomatics. We discussed three ways:

1. Graph-based clustering

(connects cells to closest k cells. A high K gives few clusters, small K gives many.)

Pro: works on big and small datasets

con: inflates clusters in high density areas.

2. Vector quantization/k means

(connects cells to closest centroid of k total)

Pro: quick!

con: does not give reliable results - use for over/under clustering

3. Hierarchical clustering

(make a tree, cut tree for clusters)

Pro: accurate and tunable

con: computationally expensive. use vector/k-means first

Week5: differential gene expression

Different averages (mean, median, max, min) and statistics (AUC, Coehn’s FC) give different, but equally valid results. A p value is not valid.

Minimum rank with a log fold change threshold is the most stringent.

# r code
library(SingleCellExperiment)
library(scater)
library(tidyverse)

1sce <- read_rds("sce_dc.Rds")
markers<- scoreMarkers(sce,
                       sce$CellType ,
                       lfc=2)

2markers$DC6 |>
  data.frame() |>
  arrange(desc(min.AUC)) |>
  round(2) |>
  rownames_to_column(var="Symbol") |>
  write_csv("DC6.csv)
1
Top chunk runs scran::scoremarkers()
2
Bottom chunk prepares the table on the right

Week 5-6: Cell type and cell cycle annotation -

There are many ways each with their own limitations. We used a spearman’s correlation between two datasets

library(celldex)
library(SingleR)
ref <- BlueprintEncodeData()
pred <- SingleR(test=sce, ref=ref, labels=ref$label.main)

week 7: Large data and multi-sample!

Fundamental issues with comparing two single cell RNAseq datasets:

  1. Data become very large (>1 Gb), too large to fit on a processor directly, which is how R works. We have already encountered this problem.
  2. Wet lab processing, and thus quality control, cannot be identical between samples. Even if tissue processing, cell culture, etc were identical, and they are usually not, there are still issues of the number of reads per cell, PCR efficency during library generation, and more.
  • We already mentioned that quality control and normalization has no one size fits all solution!

paper

Single-cell transcriptomics identifies prothymosin α restriction of HIV-1 in vivo

getting the data

  1. I went to the paper, Copied the GSE number

  1. I searched the GSE233747 on the GEO website site and clicked the link

  1. From each sample page, I scrolled down and downloaded each of the three files, re-named them, and put them in a folder

general pipeline

For Thursday worksheet, you will get the pipeline below, and multiple samples. Then, you will need to run the pipeline on all three samples and either cluster or test differential expression.

  # Class 3 - Quality control . 
  df <- perCellQCMetrics(sce)
  outliers <- (isOutlier(df$sum, type="lower", log=TRUE) | isOutlier(df$sum, type="higher", log=TRUE))
  
  sce <- sce[, !outliers]
  
  # Normalization and feature selection -skipped
  sce <- logNormCounts(sce)
  dec <- modelGeneVar(sce)
  hvg <- getTopHVGs(dec, prop=0.1)
  
  # PCA - skipped
  set.seed(1234)
  sce <- runPCA(sce, ncomponents=25, subset_row=hvg)
  
  # class 4 - Clustering - class5
  set.seed(100)
  sce$clusters <- clusterCells(sce, 
                               use.dimred="PCA", 
                               BLUSPARAM=KmeansParam(centers=5))
  
  rownames(sce) <- rowData(sce)$Symbol
  
  
  labels_for_cell_id <- SingleR(test=sce, ref=ref, labels=ref$label.main)
  sce$pred <- labels_for_cell_id$labels

  
  # Visualization.
  sce <- runTSNE(sce, dimred = 'PCA')
  sce <- runUMAP(sce, dimred = 'PCA')

It is better to make your pipeline into a function

Reminder: in R, it is very easy to make a new function. For Thursday, you will be putting in a singleCellExperiment (sce) into the pipeline and getting out a singleCellExperiment. Life will be better if you make a function out of your script.

myfunction <- function(sce){

  # the rest of the script goes here 

return(sce)
}

Then you can run the script cleanly rather than copy and pasting everything

healthy |> myfunction() -> healthy
acute1 |> myfunction() -> acute1
acute2 |> myfunction() -> acute2

Batch effects from reagents drive TSNE/UMAP embeddings

ScRNAseq assays usually generate data across multiple batches - either replicates, conditions, or getting online data. That creates uncontrollable problems: changes in wet lab processing, changes in reagent quality, etc.

Thus, there are usually systematic differences between replicates, refereed to as “batch effects.” Examples:

The following 3 PBMC scRNAseq were done by the same group, but at different times.

The gray T cells (and all other cell types) separate by batch and cell type

The following 2 bone marrow were done at the same time with same reagents

The missing cells in the knockout are naive B and T cells, and more T mem cells

Normalizing for batch effects on the embedding

How to batch correct is an evolving field. Usually, scRNA datasets are first scaled and then options include:

Maybe scale the data

Pro: you don’t make up results

Con: embedding will reflect reagents, not biology

library(batchelor)
combined <- correctExperiments(healthy=healthy,
                               acute_1=Acute_1,
                               acute_2=Acute_2,
                               acute_3=Acute_3,
                               PARAM=NoCorrectParam())
combined <- runPCA(combined)
combined <- runTSNE(combined, dimred="PCA")

For each gene, scaled down value until equal to lowest mean across batches.

pro: Good when you know populations should be the same

con: Assumes same cells are there in all the samples & only kind of works.

rescaled <- rescaleBatches(healthy, 
                           Acute_1,
                           Acute_2,
                           Acute_3)


rescaled <- runPCA(rescaled,
    exprs_values="corrected",
    BSPARAM=BiocSingular::RandomParam())

rescaled <- runTSNE(rescaled, dimred="PCA")
rescaled$batch <- combined$batch
rescaled$pred <- combined$pred

Run K nearest neighbor, but allow closest cells across batches. Move neighbors together.

Pro: it works, and is tunable by changing K (more K is more together)

Con: You made up your results

mnn.out <- fastMNN(healthy,
                   Acute_1,
                   Acute_2, 
                   Acute_3, 
                   d=50, k=10, 
                   BSPARAM=BiocSingular::RandomParam(deferred=TRUE))

mnn.out <- runTSNE(mnn.out, dimred="corrected")
mnn.out$pred <- combined$pred
mnn.out$batch <- combined$batch

Differential expression between samples

You can not use the scaled values for differential expression (DE). They are only for embedding

Differential expression (DE) analysis is built upon bulk RNAseq, which requires replicates.

It is popular to not do replicates in single cell RNAseq, which would break budgets. That does not change the fact that it is not appropriate to treat cells as replicates and then apply a p value:

  • biological replication occurs at the sample level, not the cell level that has different inherant variance
  • we already ran statistics to get clusters. You would be double dipping

Pseudoexpression analysis

A solution is to collapse clusters into pseudo replicates, and do traditional differential expression analysis across replicates.

summed <- aggregateAcrossCells(merged, 
    id=colData(merged)[,c("pred", "batch")])
summed

Bibliography