This part 4 trying to correct for version conflicts in the code tutorials to get the machine learning algorithms for unsupervised learning to work by flattening the layers added in version 5 of Seurat.

We are using GSE318371, a recent gene study on NCBI that analyzes the aggressive Natural Killer T-Cell Lymphoma vs. healthy samples using single cell RNA (scRNA) Sequencing, of peripheral blood mononuclear cells (PBMCs).

Looking at the EDA of exploratory data analysis with new package for this thing on natural killer t-cell lymphoma associated by EBV infection. This study is GSE318371 within NCBI database of gene expression studies. I extracted the custom download option of the GSE318371_RAW.tar file scrolled at end of page on this series. The extraction of barcodes takes a very long time once downloaded for each file and you can avoid that process and leave them in zip form with the .tsv.gz name but remove the prepended file name in front of ‘barcodes.tsv.gz’, ‘features.tsv.gz’, and ‘matrix.mtx’ but put these files with respective GSM sample ID into its own file to read in for each file folder.

This is a very recent February 2026 uploaded research on aggressive lymphoma associated with EBV infection. I found others and they include the nasopharyngeal carcinoma and Hodgkin and large B-cell lymphomas. But for now we work on this project to get our top genes for our machine to predict EBV, Lyme disease, or specific associated EBV pathology of multiple sclerosis, mononucleosis, primary EBV infection, fibromyalgia, as well as various lymphomas and nasopharyngeal carcinoma.

There is a process that has to be followed to extract each sample information of barcodes, features, and cells. This is array gene expression data where an array of many cells is input into a machine and each cell within the array is ran to count the number of times a gene appears. The barcodes are the cells and the features are the genes in matrix format. This is what single cell RNA sequencing is.

It is useful to review or get familiar with the Seurat package for this type of data as it is the go to method of handling unsupervised gene expression data which is basically what finding gene targets is, finding the genes to make target and good predictors on unseen data. But our data is known and the sample classes are seen therefore, and will still allow us the functionality to explore the unsupervised machine learning models in finding our top 10 genes.

This last section follows through with a tutorial found online video, great to follow along and learn a little bit or review the material in these type of data sets on gene expression data with single cell RNA sequencing data.

library(Seurat)

## Loading required package: SeuratObject

## Loading required package: sp

## 
## Attaching package: 'SeuratObject'

## The following objects are masked from 'package:base':
## 
##     intersect, t

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.0     ✔ readr     2.2.0
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.1
## ✔ lubridate 1.9.5     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

setwd(path)

list.files(path)

##  [1] "merged.RDS"                                            
##  [2] "mergedSamples.RDS"                                     
##  [3] "patient14"                                             
##  [4] "patient16"                                             
##  [5] "patient18"                                             
##  [6] "patient19"                                             
##  [7] "patient20"                                             
##  [8] "patient22"                                             
##  [9] "patient25"                                             
## [10] "patient27"                                             
## [11] "patient30"                                             
## [12] "patient31"                                             
## [13] "patient36"                                             
## [14] "patient37"                                             
## [15] "patient38"                                             
## [16] "patient39"                                             
## [17] "patient41"                                             
## [18] "patient44"                                             
## [19] "patient51"                                             
## [20] "patient52"                                             
## [21] "patient6"                                              
## [22] "patient9"                                              
## [23] "sampleHD1"                                             
## [24] "sampleHD10"                                            
## [25] "sampleHD11"                                            
## [26] "sampleHD12"                                            
## [27] "sampleHD2"                                             
## [28] "sampleHD3"                                             
## [29] "sampleHD4"                                             
## [30] "sampleHD5"                                             
## [31] "sampleHD6"                                             
## [32] "sampleHD7"                                             
## [33] "sampleHD8"                                             
## [34] "sampleHD9"                                             
## [35] "Seurat_filteredNormalized10kScaledLogScaled2kFeatures" 
## [36] "Seurat_filteredNormalized10kScaledLogScaled2kFeatures1"

set.seed(123)

The above files are in our RDS_objects folder. It can be set within the chunk but reverts back to directory outside the Rmarkdown notebook chunk when it is done running that chunk.

Lets read in the mergedSamples.RDS of our samples merged together in a 400k row of genes by 4 columns. I went back to the original merged data of the samples as Seurat objects from earlier part of project and added in a parameter in the merge function ‘merge.data=TRUE’ it might not make a difference, but this code works after another small change when normalizing the data and only normalizing one time, not twice like biostatsquid said to do to log transform it, although biostatsquid did do a good job adding in additional context to the code and was almost exactly like the Single Cell Genomics, Transcriptomics, & Proteomics code tutorial on youtube.

If you want it to behave like Seurat 4, the AI help online said to go back to when merging the samples of Seurat objects and to add merge.data=TRUE merged <- merge(sample1, sample2, merge.data = TRUE)

# merged <- merge(patient14, y=c(patient16, patient18, patient19,patient20 , patient22 , patient25 , patient27 , patient30,  patient31,  patient36 , patient37 , patient38 , patient39 , patient41 ,
# patient44,  patient51,  patient52,  patient6 ,  patient9,   sampleHD1 , sampleHD10,
# sampleHD11, sampleHD12, sampleHD2 , sampleHD3 , sampleHD4 , sampleHD5,  sampleHD6 ,
#  sampleHD7 , sampleHD8 , sampleHD9 ), add.cell.ids = ls()[1:32], merge.data=TRUE)

setwd(path)
#merged <- readRDS('mergedSamples.RDS')
merged <- readRDS('merged.RDS') #added extra line of merge.data=TRUE

Layout expanded of lager Seurat object, merged, screen1 Layout of expanded Large Seurat object, merged, screen 2 The above screen shots show the layout of this large Seurat object expanded out with the layers and additional information that is extracted individually by matrix @meta.data or by @assays for the RNA meta data of gene names as features to add to the matrix of barcodes.

We can see the details of this object by entering its object name.

merged

You can see it has labels by counts for samples. There are 30,960 genes AKA features and 407,006 cells AKA samples or barcodes in this 1 assay with layers. There are 32 layers. Seurat version 5 added layers from Version 4 that the tutorials use in Single Cell Genomics, Transcriptomics, & Proteomics youtube videos and in Biostatsquid youtube channel.

str(merged)

There are many attachments to this Seurat object, not just the matrix of merged columns with merged@meta.data but also in a $chr element list in this array as [1:30960] showing ENSEMBL IDs. This will be useful when looking for top genes and getting labels as well as this next code extracting the mitochondrial DNA with regex to make a percent of mitochondrial DNA column.

dim(merged@meta.data)

In the meta.data table there are 3 columns and 407,006 barcodes for cells of the array.

colnames(merged@meta.data)

The column names of merged give the original identity of GSM ID, the counts of RNA that showed up for each barcode cell, and the number of genes as features that the barcode was part of in this single cell RNA sequencing project.

#quality control (QC)

When using quality control you don’t want a low count as the sequencing is probably altered or a low, low number unique genes indicates cell dead, dying, or empty droplet, and if other cells have higher counts and this could be an error within that cell, and you don’t want doublets or multiplets of genes producing high counts that are very large compared to the average counts of cells because that could mean the cell has 2 cells in it as the genes stuck together and giving a higher count. And for mitochondrial DNA, it is not located in the cytoplasm where most RNA ready to be translated into a protein occurs, but could be due to apoptosis of the cell and the mitochondrial membrane ruptured and leaking out its DNA or the cell is damaged and dying. So a high percentage of mitochondrial DNA distorts the counts as well. Mitochondrial DNA is only transcribed in the mitochondrial.

##perc at

merged[['percent_mt']] <- PercentageFeatureSet(merged, pattern = '^MT-')

head(merged@meta.data)

meta.data table in large Array Seurat Object

image of first few rows of merged@meta.data

The nCount is total number of molecules in a cell The nFeature is the total number of unique genes in each cell

This violin plot VlnPlot() is used as a Seurat library plot.

VlnPlot(merged, features = c("nCount_RNA", "nFeature_RNA", "percent_mt"), ncol=3)

violin Plot

We use the above plots of violin plots to compare the scatter and use those above lowest threshold of all scatter and not including less scatter above the highest theshold in common. For counts, visually estimated, the lowest threshold is 0 and highest is 2,000. For Features it looks like lowest threshold is about 100 and highest is around 4,000. The percent mitochondrial DNA looks like around 6-8 %.

This next plot is for a regression line to identify thresholds to filter out low and high counts and features as well at mt percent, the number of molecules on x-axis as counts of cells and the number of genes as features on y-axis.

You don’t want to have cells to be in lower right corner under regression line because it means that high molecule count but low unique features, capturing low genes and sequencing over and over again givig high count, and not in upper left corner bc high genes but low count because not deeply sequence enough, always check for this to see that it is due to low quality cells and not due to an artifact or sequencing error. That could ruin downstream analysis.

FeatureScatter(nsclc_seu, feature1 = “nCount_RNA”, feature2 = “nFeature_RNA”) + geom_smooth(method = ‘lm’)

library(ggplot2)

FeatureScatter(merged, feature1 = "nCount_RNA", feature2 = "nFeature_RNA") + geom_smooth(method = 'lm')+

# Control axis limits using ggplot2 functions

      xlim(0, 30000)+    # Set X-axis range
      ylim(0, 16000)     # Set Y-axis range

Scatter Plot with liner regression line overlay

Looks like around 20,000 count of RNA and about 5,000 features as thresholds to filter out for quality control to select best genes leading to cell differentiation in this pathology. This scatter follows a log function so I think it was already normalized. We can place the log function over the scatter.

FeatureScatter(merged, feature1 = "nCount_RNA", feature2 = "nFeature_RNA") + geom_smooth(method = 'lm', , formula = y ~ log(x))+

# Control axis limits using ggplot2 functions

      xlim(0, 30000)+    # Set X-axis range
      ylim(0, 16000)     # Set Y-axis range

Scatter Plot Logistic Regression Line overlay

It looks like the scatter are the shape of a log curve when compared to the linear regression line, but when fitting a log function over the scatter, the curve fell flat early and missed many scatter after about 4,000 count RNA. Lets proceed with filtering based on the linear fit model and thresholds estimated.

#filtering At this point, I am testing whether you need to keep the object the same name and store each change with the object, because each command made to the object is stored in its array of data in a separate list of changes made to the object. I originally kept renaming the object and removing the previous object after making changes to it like filtering it, normalizing it, and scaling it, then downstream error produced with inability to scale data with the ScaleData function. Lets see. They do explain in a few videos seen that each change you make is stored in the object maybe to revert back to original without removing all and reuploading.

merged <- subset(merged, subset = nFeature_RNA > 200 & nFeature_RNA < 5000 & percent_mt <8 & nCount_RNA < 18000)

You can make your own thresholds on how you interpret visualizations The file size shrank from 11 GB to 9.8 GB file size as it filtered out those barcodes not meeting this criteria.

Normalize data after removing filtered out cells

***** knitr doesn’t allow this chunk 17 to run so it will be print screens but everything ended up running fine in RStudio after making those AI internet solution corrections. *****

The other youtube video also shows this same exact code as this video. Both videos use the same exact Seurat R code commands and workflow. Normalization by default, use global scaling normalization of log normalize, default scaling of 10,000, and then get a log transformed result.

merged <- NormalizeData(
  merged,
  normalization.method = "LogNormalize",
  scale.factor = 10000,
  layer = "counts",       # Input layer
  #new.layer = "lognorm"   # Output layer
)

# the program doesn't use the 'new.layer' and posts it as a warning
# "The following arguments are not used: new.layer"

The file size increased to 19.5 GB in size after normalizing and log scaling it from 9.8 GB.

merged

An object of class Seurat 30960 features across 370409 samples within 1 assay Active assay: RNA (30960 features, 0 variable features) 64 layers present: counts.GSM9493334, counts.GSM9493335, counts.GSM9493336, counts.GSM9493337, counts.GSM9493338, counts.GSM9493339, counts.GSM9493340, counts.GSM9493341, counts.GSM9493342, counts.GSM9493343, counts.GSM9493344, counts.GSM9493345, counts.GSM9493346, counts.GSM9493347, counts.GSM9493348, counts.GSM9493349, counts.GSM9493350, counts.GSM9493351, counts.GSM9493332, counts.GSM9493333, counts.GSM9493320, counts.GSM9493329, counts.GSM9493330, counts.GSM9493331, counts.GSM9493321, counts.GSM9493322, counts.GSM9493323, counts.GSM9493324, counts.GSM9493325, counts.GSM9493326, counts.GSM9493327, counts.GSM9493328, data.GSM9493334, data.GSM9493335, data.GSM9493336, data.GSM9493337, data.GSM9493338, data.GSM9493339, data.GSM9493340, data.GSM9493341, data.GSM9493342, data.GSM9493343, data.GSM9493344, data.GSM9493345, data.GSM9493346, data.GSM9493347, data.GSM9493348, data.GSM9493349, data.GSM9493350, data.GSM9493351, data.GSM9493332, data.GSM9493333, data.GSM9493320, data.GSM9493329, data.GSM9493330, data.GSM9493331, data.GSM9493321, data.GSM9493322, data.GSM9493323, data.GSM9493324, data.GSM9493325, data.GSM9493326, data.GSM9493327, data.GSM9493328

There are now 64 layers instead of 32 layers, the additional data labeled layers are where the Normalized and scaled values are stored along with the count labeled layers in the object name > assays > RNA > layers.

There are now fewer cells from 407,006 to 370,409 that were kept after filtering based on visuals of Violin Plot and the scatter plots with residual regression best fit line.

merged@assays$RNA

Assay (v5) data with 30960 features for 370409 cells First 10 features: ENSG00000238009, ENSG00000241860, ENSG00000290385, ENSG00000291215, LINC01409, ENSG00000290784, LINC00115, LINC01128, ENSG00000288531, FAM41C Layers: counts.GSM9493334, counts.GSM9493335, counts.GSM9493336, counts.GSM9493337, counts.GSM9493338, counts.GSM9493339, counts.GSM9493340, counts.GSM9493341, counts.GSM9493342, counts.GSM9493343, counts.GSM9493344, counts.GSM9493345, counts.GSM9493346, counts.GSM9493347, counts.GSM9493348, counts.GSM9493349, counts.GSM9493350, counts.GSM9493351, counts.GSM9493332, counts.GSM9493333, counts.GSM9493320, counts.GSM9493329, counts.GSM9493330, counts.GSM9493331, counts.GSM9493321, counts.GSM9493322, counts.GSM9493323, counts.GSM9493324, counts.GSM9493325, counts.GSM9493326, counts.GSM9493327, counts.GSM9493328, data.GSM9493334, data.GSM9493335, data.GSM9493336, data.GSM9493337, data.GSM9493338, data.GSM9493339, data.GSM9493340, data.GSM9493341, data.GSM9493342, data.GSM9493343, data.GSM9493344, data.GSM9493345, data.GSM9493346, data.GSM9493347, data.GSM9493348, data.GSM9493349, data.GSM9493350, data.GSM9493351, data.GSM9493332, data.GSM9493333, data.GSM9493320, data.GSM9493329, data.GSM9493330, data.GSM9493331, data.GSM9493321, data.GSM9493322, data.GSM9493323, data.GSM9493324, data.GSM9493325, data.GSM9493326, data.GSM9493327, data.GSM9493328

The first 10 features are listed when looking at the RNA under the assay tab.

Here is my original screenshot of RNA contents arrangement in Seurat object once merged. Layout expanded of lager Seurat object, merged, screen1

The following shows the sparce matrices for the first 5 layers by counts. This list could be very large if not limited. There are only 1s where the gene is present in the cell or barcode column of assay, the 0s are held with the period or dot.

merged@assays[["RNA"]]@layers[1:5]

$counts.GSM9493334 23499 x 8732 sparse Matrix of class “dgCMatrix”

[1,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …… [2,] . . . . . 1 . 1 . . . . . . . . . . . . . . . . . . . . . . …… [3,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …… [4,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …… [5,] . . . . . . . . . . . 1 . . . . . . . . . . . . . . . . . . …… [6,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …… [7,] . . . . . . . 1 . . . . . . 1 . . . . . . . . . . . . . . . …… [8,] . . . . . . . . . . . . . . 2 . . . . . . . . . . . . . . . …… [9,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …… [10,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …… [11,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …… [12,] 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …… [13,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …… [14,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …… [15,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …… [16,] . . . . 2 . . 2 . . . . . . . . . . . . . . . . . . . . . . …… [17,] 2 . . 1 . 2 1 . 2 2 . . . 10 . 2 . . . 2 . . . 2 1 . . . . . ……

………………………… ……..suppressing 8702 columns and 23466 rows in show(); maybe adjust options(max.print=, width=)

…

$counts.GSM9493335 25402 x 8847 sparse Matrix of class “dgCMatrix”

[1,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …… [2,] . . . 1 . . . . . . . . . . . . . . . . . . . . . . . . . . …… [3,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …… [4,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …… [5,] . . . . . . . . . . . . . . . . . 1 . . . 1 . . 1 . . . . . …… [6,] . 1 . . . . . . . . . . . . . . . . . . . . . . 1 . . . . . …… [7,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …… [8,] . . . . . . . . . . . . 1 . . . . . . . . . . . . . . . . . …… [9,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …… [10,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …… [11,] . 1 . . 1 . . . . . . . . . . . . . 1 . . . . 1 . . . . 1 . …… [12,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …… [13,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …… [14,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …… [15,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …… [16,] 1 . . 2 . . 1 . . 1 1 . 1 1 . . . . . 2 . 1 . . . . . . 1 . …… [17,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ……

………………………… ……..suppressing 8817 columns and 25369 rows in show(); maybe adjust options(max.print=, width=) …………………………

[25387,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …… [25388,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …… [25389,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …… [25390,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ……

This was copy and pasted due to knitr not letting code run past the Normalization chunk of R code. You can see not all features are counted in the cells shown. Here is a snippet. sparse matrix snippet

#gives sparse matrices for each GSM sample
#merged@assays[["RNA"]]@layers$counts #sparse matrices
#merged@assays[["RNA"]]@layers$data #null
merged

There aren’t any or there are 0 variable features from out put of object working with above but we will change that next code chunk.

identify highly variable features

The default is to return 2000 features per dataset, and ‘vst’ stands for ‘variance stabilizing transformation’ as a default as well.

merged <- FindVariableFeatures(merged, selection.method = 'vst', nfeatures = 2000)

High Variability Genes - Targets to Cell Differentiation in Aggressive Natural Killer T-cell Lymphoma Short snippet above of the grids per sample calculated to find the high variability genes.

Now look at the merged object to see that variable features should show 2000.

merged

An object of class Seurat 30960 features across 370409 samples within 1 assay Active assay: RNA (30960 features, 2000 variable features) 64 layers present: counts.GSM9493334, counts.GSM9493335, counts.GSM9493336, counts.GSM9493337, counts.GSM9493338, counts.GSM9493339, counts.GSM9493340, counts.GSM9493341, counts.GSM9493342, counts.GSM9493343, counts.GSM9493344, counts.GSM9493345, counts.GSM9493346, counts.GSM9493347, counts.GSM9493348, counts.GSM9493349, counts.GSM9493350, counts.GSM9493351, counts.GSM9493332, counts.GSM9493333, counts.GSM9493320, counts.GSM9493329, counts.GSM9493330, counts.GSM9493331, counts.GSM9493321, counts.GSM9493322, counts.GSM9493323, counts.GSM9493324, counts.GSM9493325, counts.GSM9493326, counts.GSM9493327, counts.GSM9493328, data.GSM9493334, data.GSM9493335, data.GSM9493336, data.GSM9493337, data.GSM9493338, data.GSM9493339, data.GSM9493340, data.GSM9493341, data.GSM9493342, data.GSM9493343, data.GSM9493344, data.GSM9493345, data.GSM9493346, data.GSM9493347, data.GSM9493348, data.GSM9493349, data.GSM9493350, data.GSM9493351, data.GSM9493332, data.GSM9493333, data.GSM9493320, data.GSM9493329, data.GSM9493330, data.GSM9493331, data.GSM9493321, data.GSM9493322, data.GSM9493323, data.GSM9493324, data.GSM9493325, data.GSM9493326, data.GSM9493327, data.GSM9493328

There are 30,960 features in 370,409 samples and 2000 variable features now after finding for the variable features. Before doing the findVariableFeatures command or transformation, there were 0 variable features.

We should limit the list as it will print out a lot of the values. But only those features in the list of 2000 highly variable features or having the highest differential expression values where cells changed the most from healthy to pathological state are listed by name, all others are listed as ‘NA’ values.

merged[["RNA"]]@meta.data$var.features[1:500]

snippet 500 genes with only top 2000 labeled

identify the top 10 highly variable genes, this builds a scatter plot that will identify and label the scatterpoints belonging to the top 10 genes with highest variability. Average expression on x-axis and Standardized Variance on y-axis.

top10 <- head(VariableFeatures(merged), 10)
top10

[1] “PPBP” “IGLC2” “IGKC” “S100A9” “EREG” “IGLC3” “SOX5” “PF4”
[9] “LYZ” “LINGO2”

top10_plot <- VariableFeaturePlot(merged)
LabelPoints(plot = top10_plot, points = top10, repel = TRUE)

scatter Plot with Labeled Top 10 High Variability Genes

Scaling after finding variable features in dataset.

This corrects for unwanted sources of variation, technical as example with batch effects, or biological such as cell cycle stage or mitochondrial contamination, this ensures results are due to clustering by gene differences and not technological or biological errors.

The ScaleData function is part of standard parameters, there is a more complex single cell Transform function or workflow within Seurat that could be alternatively used.

First to make a character string stored as the rownames of the structural array we have been using mergedFilteredNormalizedScaled. Lets look at the first 100 genes as well.

all_genes <- rownames(merged)

all_genes[1:100]

[1] “ENSG00000238009” “ENSG00000241860” “ENSG00000290385” “ENSG00000291215” [5] “LINC01409” “ENSG00000290784” “LINC00115” “LINC01128”
[9] “ENSG00000288531” “FAM41C” “ENSG00000272438” “NOC2L”
[13] “KLHL17” “PLEKHN1” “ENSG00000272512” “HES4”
[17] “ISG15” “ENSG00000224969” “AGRN” “ENSG00000291156” [21] “C1orf159” “ENSG00000285812” “LINC01342” “TNFRSF18”
[25] “TNFRSF4” “SDF4” “B3GALT6” “C1QTNF12”
[29] “ENSG00000260179” “UBE2J2” “LINC01786” “SCNN1D”
[33] “ACAP3” “PUSL1” “INTS11” “CPTP”
[37] “TAS1R3” “DVL1” “MXRA8” “AURKAIP1”
[41] “CCNL2” “MRPL20-AS1” “MRPL20” “MRPL20-DT”
[45] “ATAD3C” “ATAD3B” “ENSG00000290916” “ATAD3A”
[49] “TMEM240” “SSU72” “ENSG00000215014” “FNDC10”
[53] “ENSG00000286989” “ENSG00000272106” “MIB2” “MMP23B”
[57] “CDK11B” “ENSG00000272004” “SLC35E2B” “CDK11A”
[61] “ENSG00000290854” “NADK” “GNB1” “GNB1-DT”
[65] “CFAP74” “PRKCZ” “ENSG00000271806” “PRKCZ-AS1”
[69] “FAAP20” “ENSG00000234396” “SKI” “ENSG00000287356” [73] “MORN1” “ENSG00000272420” “RER1” “PEX10”
[77] “PLCH2” “PANK4” “HES5” “ENSG00000272449” [81] “TNFRSF14-AS1” “TNFRSF14” “ENSG00000228037” “ENSG00000289610” [85] “PRXL2B” “MMEL1” “TTC34” “PRDM16”
[89] “MEGF6” “ENSG00000238260” “TPRG1L” “WRAP73”
[93] “TP73” “SMIM1” “LRRC47” “CEP104”
[97] “DFFB” “C1orf174” “LINC01134” “AJAP1”

length(all_genes)

[1] 30960

There are a total of 30,960 genes.

before scaling data lets look at the gene names and consider extracting out the gene information with our matrix of 4 columns

The row names of the object, merged, have the gene names, when we looked for the highest variation genes in most differentially expressed for 2000 genes with

merged <- FindVariableFeatures(merged, selection.method = ‘vst’, nfeatures = 2000)

Do they line up to the row names of merged and where are they located.

varFeatures <- merged[["RNA"]]@meta.data$var.features
genes <- rownames(merged)
df <- data.frame(Gene = genes, FeatureVariability = varFeatures)
#head(df,20)
view(df)

screenshot of df

The location of any of the genes if they were a high variability gene top 2000 is listed next to the correct row name in a random 20 sample draw from first 20 samples.

Are there only 2000 non NA values in our varFeatures set to find only 2000?

class(varFeatures)
length(unique(varFeatures))

[1] “character” [1] 2001

There are 2001 unique values in a 30,960 long vector of genes set to be the label if it is one of the 2000 most differentially expressed genes. The additional character is the NA value.

Maybe I can figure out something to do with that instead as this ‘work around’ has already taken more than one day in set backs. And produced no progress to following the work flow.

The VariableFeatures(merged) is how to list only those top 2000 genes we found with the findVariableFeatures function. We can see those by calling that function VariableFeatures(merged) or rather since its more than 1000 to write it to csv and then read that file. But would be nice to see how it is used with its values for pathology and healthy samples. Maybe we can make a vector of those top 10 like we do already but then only keep the matrix of meta.data with those 10 genes to see the counts and number of barcodes express it.

Lets scale the data first then come back to this task.

*** CHANGED TO ALL GENES SCALED FOR PART 5E, BUT RECALLED IT DOESN’T WORK *** There are 30,960 unique gene names. We will scale only the variable features we found of top 2000 most differentially expressed genes.

#merged <- ScaleData(merged, features = all_genes)

merged <- ScaleData(merged, features = VariableFeatures(merged))

Great news! the ScaleData function worked, must have been due to declaring the input and output (that was ignored) in the NormalizeData function and then not adding in the additional NormalizeData function.

Centering and scaling data matrix |===========================================================| 100%

This file is 25.4 GB in size just log transform scaling the 2,000 targeted features. When doing all features it wouldn’t work, but I can try again. This could be why it worked this time due to that edit as well as small changes in the parameter setting of the normalization function.

Lets save this Seurat object we filtered, normalized and scaled by 10,000 then scaled with a log transformation. We only log transformed the 2,000 most differentially expressed features in this set. When trying to scale all 30,960 features, the program couldn’t on this computer and stopped with an error message.

setwd(path)

#this takes a long time on this computer to save, its 25.4 GB in size

saveRDS(merged,"Seurat_filteredNormalized10kScaledLogScaled2kFeatures1")

Make a data frame of all the genes of variable features, only the top 2000 will be ranked.

allGenesDF <- merged[["RNA"]]@meta.data
dim(allGenesDF)

[1] 30960 194 There are 30960 genes and 194 columns for all the 32 samples and 5 other features of each sample. The variable features column will only have the name of the gene is it is one of the 2,000 highest variable feature.

all2k <- subset(allGenesDF, allGenesDF$var.features != "NA")
all2k <- all2k[order(all2k$var.features.rank, decreasing=FALSE),]

write.csv(all2k, 'all2000_DE_genes_NKTL_EBV_ordered.csv', row.names=FALSE)

You can get this data frame here. ***

Lets finish that task before moving onto the analysis portion. I want to extract these top 10 features and make a data frame not a Seurat object that has the gene names attached.

It is interesting because of all the stored information you know the genes are there and you just have to get them. They are in the row names, now we can also subset the Seurat object and make the object selective to only those 10 genes to see how they are expressed in each of our 12 healthy and 20 pathological NKTCL cases to do our regular workflow for supervised machine learning.

m <- merged[which(row.names(merged) %in% top10),]
m

An object of class Seurat 10 features across 370409 samples within 1 assay Active assay: RNA (10 features, 10 variable features) 65 layers present: counts.GSM9493334, counts.GSM9493335, counts.GSM9493336, counts.GSM9493337, counts.GSM9493338, counts.GSM9493339, counts.GSM9493340, counts.GSM9493341, counts.GSM9493342, counts.GSM9493343, counts.GSM9493344, counts.GSM9493345, counts.GSM9493346, counts.GSM9493347, counts.GSM9493348, counts.GSM9493349, counts.GSM9493350, counts.GSM9493351, counts.GSM9493332, counts.GSM9493333, counts.GSM9493320, counts.GSM9493329, counts.GSM9493330, counts.GSM9493331, counts.GSM9493321, counts.GSM9493322, counts.GSM9493323, counts.GSM9493324, counts.GSM9493325, counts.GSM9493326, counts.GSM9493327, counts.GSM9493328, data.GSM9493334, data.GSM9493335, data.GSM9493336, data.GSM9493337, data.GSM9493338, data.GSM9493339, data.GSM9493340, data.GSM9493341, data.GSM9493342, data.GSM9493343, data.GSM9493344, data.GSM9493345, data.GSM9493346, data.GSM9493347, data.GSM9493348, data.GSM9493349, data.GSM9493350, data.GSM9493351, data.GSM9493332, data.GSM9493333, data.GSM9493320, data.GSM9493329, data.GSM9493330, data.GSM9493331, data.GSM9493321, data.GSM9493322, data.GSM9493323, data.GSM9493324, data.GSM9493325, data.GSM9493326, data.GSM9493327, data.GSM9493328, scale.data 1 dimensional reduction calculated: pca

This Seurat object has 10 features only, those that are the top or most differentially expressed genes in a much smaller 284 MB file size with 65 layers, because done after successfully using the ScaleData function, the additional layer beyond the 64 layers, the 65th layer is the layer holding the scale.data values. This is done on the filtered data from 407,006 cells down to 370,409 cells.

unique(m$orig.ident)

[1] “GSM9493334” “GSM9493335” “GSM9493336” “GSM9493337” “GSM9493338” [6] “GSM9493339” “GSM9493340” “GSM9493341” “GSM9493342” “GSM9493343” [11] “GSM9493344” “GSM9493345” “GSM9493346” “GSM9493347” “GSM9493348” [16] “GSM9493349” “GSM9493350” “GSM9493351” “GSM9493332” “GSM9493333” [21] “GSM9493320” “GSM9493329” “GSM9493330” “GSM9493331” “GSM9493321” [26] “GSM9493322” “GSM9493323” “GSM9493324” “GSM9493325” “GSM9493326” [31] “GSM9493327” “GSM9493328”

Those show the top 10 features or genes span all 32 samples of the 12 healthy and 20 pathological NKTC aggressive Lymphoma.

top10Data <- m[["RNA"]]@meta.data
colnames(top10Data)

[1] “vf_vst_counts.GSM9493334_mean”
[2] “vf_vst_counts.GSM9493334_variance”
[3] “vf_vst_counts.GSM9493334_variance.expected”
[4] “vf_vst_counts.GSM9493334_variance.standardized” [5] “vf_vst_counts.GSM9493334_variable”
[6] “vf_vst_counts.GSM9493334_rank”
[7] “vf_vst_counts.GSM9493335_mean”
[8] “vf_vst_counts.GSM9493335_variance”
[9] “vf_vst_counts.GSM9493335_variance.expected”
[10] “vf_vst_counts.GSM9493335_variance.standardized” [11] “vf_vst_counts.GSM9493335_variable”
[12] “vf_vst_counts.GSM9493335_rank”
[13] “vf_vst_counts.GSM9493336_mean”
[14] “vf_vst_counts.GSM9493336_variance”
[15] “vf_vst_counts.GSM9493336_variance.expected”
[16] “vf_vst_counts.GSM9493336_variance.standardized” [17] “vf_vst_counts.GSM9493336_variable”
[18] “vf_vst_counts.GSM9493336_rank”
[19] “vf_vst_counts.GSM9493337_mean”
[20] “vf_vst_counts.GSM9493337_variance”
[21] “vf_vst_counts.GSM9493337_variance.expected”
[22] “vf_vst_counts.GSM9493337_variance.standardized” [23] “vf_vst_counts.GSM9493337_variable”

…

[182] “vf_vst_counts.GSM9493327_variance”
[183] “vf_vst_counts.GSM9493327_variance.expected”
[184] “vf_vst_counts.GSM9493327_variance.standardized” [185] “vf_vst_counts.GSM9493327_variable”
[186] “vf_vst_counts.GSM9493327_rank”
[187] “vf_vst_counts.GSM9493328_mean”
[188] “vf_vst_counts.GSM9493328_variance”
[189] “vf_vst_counts.GSM9493328_variance.expected”
[190] “vf_vst_counts.GSM9493328_variance.standardized” [191] “vf_vst_counts.GSM9493328_variable”
[192] “vf_vst_counts.GSM9493328_rank”
[193] “var.features”
[194] “var.features.rank”

There are 194 columns and they are divided into each sample but have the same column metric per sample of 6 metrics each. There are 32 samples times 6 metrics, that is 192 columns.

“vf_vst_counts.GSM9493334_mean”
“vf_vst_counts.GSM9493334_variance”
“vf_vst_counts.GSM9493334_variance.expected”
“vf_vst_counts.GSM9493334_variance.standardized”
“vf_vst_counts.GSM9493334_variable”
“vf_vst_counts.GSM9493334_rank”

Our table with genes are listed as column 194 at the end of the columns with rank by gene across all samples for how differentially expressed that gene was across all samples as last column 195. So that there will be a data frame of 10 top genes and show how they are expressed across each sample from the sequencing data of single cell RNA sequencing information in the array.

DataFrame10 <- m[['RNA']]@meta.data
DataFrame10[,c(1:12,185:194)]

Here is a snippet of the DataFrame10 first and last columns in the viewer. beginning columns of top 10 genes of all samples

I only selected the first 10 columns and last 9 columns to show the samples those features explain. Otherwise, a huge amount of scrolling would be needed to look at all the 195 columns in this data for our top 10 most differentially expressed genes.

Lets write this out to csv and go back to the analysis.

write.csv(DataFrame10, 'top10DE_genes_NKTCL_EBV.csv', row.names=FALSE)
#write.csv(DataFrame10, 'top10DE_genes_NKTCL_EBV-UNSCALED.csv', row.names=FALSE)

We won’t use that now and maybe won’t at all. But it is good to have. Here is a copy of it if you would like it, access it here. I went through and separately made an unscaled to log transformation version of top 10 most differentially expressed genes to access here.

rm(m,df,all_genes, DataFrame10, top10Data)

=================================================================

We can stop and then return to read back in the Seurat object where we left off.

merged <- readRDS("Seurat_filteredNormalized10kScaledLogScaled2kFeatures1")

It reads in the same file size of 25.4 GB.

Analysis after filtering, normalizing, retrieving the 2,000 most differetially expressed genes, and scaling the data

#Dimensionality Reduction with PCA allow us to visualize multidimensional datasets, with common PCA principal component analysis, summarizing all gene expression data over many genes into principal components (the variation between scatterpoints are the components, variations off the regression lines or dimension, adding components to top components will give most variation and explain most of data).

Loadings is a term for PCA that explains a gene that is strongly associated with that principal component. They are ordered with PC1 most explanative and subsequent add to that explanation in order of best.

The positive and negative genes are named for how ever many features you set for the number of PCs that explain the most.

merged <- RunPCA(merged, features = VariableFeatures(merged))

print(merged[['pca']], dims = 1:5, nfeatures=5)

PCA 1st screenshot PCA 2nd screenshot

merged@assays$RNA

Assay (v5) data with 30960 features for 370409 cells Top 10 variable features: PPBP, IGLC2, IGKC, S100A9, EREG, IGLC3, SOX5, PF4, LYZ, LINGO2 Layers: counts.GSM9493334, counts.GSM9493335, counts.GSM9493336, counts.GSM9493337, counts.GSM9493338, counts.GSM9493339, counts.GSM9493340, counts.GSM9493341, counts.GSM9493342, counts.GSM9493343, counts.GSM9493344, counts.GSM9493345, counts.GSM9493346, counts.GSM9493347, counts.GSM9493348, counts.GSM9493349, counts.GSM9493350, counts.GSM9493351, counts.GSM9493332, counts.GSM9493333, counts.GSM9493320, counts.GSM9493329, counts.GSM9493330, counts.GSM9493331, counts.GSM9493321, counts.GSM9493322, counts.GSM9493323, counts.GSM9493324, counts.GSM9493325, counts.GSM9493326, counts.GSM9493327, counts.GSM9493328, data.GSM9493334, data.GSM9493335, data.GSM9493336, data.GSM9493337, data.GSM9493338, data.GSM9493339, data.GSM9493340, data.GSM9493341, data.GSM9493342, data.GSM9493343, data.GSM9493344, data.GSM9493345, data.GSM9493346, data.GSM9493347, data.GSM9493348, data.GSM9493349, data.GSM9493350, data.GSM9493351, data.GSM9493332, data.GSM9493333, data.GSM9493320, data.GSM9493329, data.GSM9493330, data.GSM9493331, data.GSM9493321, data.GSM9493322, data.GSM9493323, data.GSM9493324, data.GSM9493325, data.GSM9493326, data.GSM9493327, data.GSM9493328, scale.data

We can take a look at all commands for this object saved when we saved it and with the PCA command added. We go to the ‘commands’ tab to expand on those commands and see the elements each has, there will be 4 commands after running the PCA command.

commands history on object

The heatmap will have the explanatory genes next to the rows as labels and the visual will show the expression profiles for these genes in the first PC or which ever PC is selected.

DimHeatmap(merged, dims = 1, cells = 500, balanced = TRUE)

Heatmap Top Differentially Expressed Genes after PCA

The dim plot is a scatter plot of a named PC for principal component, against another PC You will see cells group in clusters by PC comparison 1 by 1.

DimPlot(merged, reduction = "pca") + NoLegend()

clustering color coded after PCA

Determine dimensionality of the data with Elbow Plot

This type of plot, elbow plot, shows standard deviation on y-axis and the principal components or PCs on x-axis to see the threshold limit of number of dimensions or PCs to use, it starts high and drops until plateuing at elbow bend, won’t be useful to use additional that add little variation and more computing.

ElbowPlot(merged)

Elbow Plot of most DE PCAs to use explaining most variation

===================================================

This one seemed like it froze computer but was just really slow, took 1 1/2 hours to run, in console it was running, but just faded and spinning circle in the source code window of document.

Clustering using K-Nearest Neighbor

Basically the lower resolution the least clusters and higher resolution has more clusters

merged <- FindNeighbors(merged, dims = 1:7) #based on elbow plot above
merged <- FindClusters(merged, resolution = c(0.1, 0.3, 0.5, 0.7,1))