This part 4 trying to correct for version conflicts in the code tutorials to get the machine learning algorithms for unsupervised learning to work by flattening the layers added in version 5 of Seurat.
We are using GSE318371, a recent gene study on NCBI that analyzes the aggressive Natural Killer T-Cell Lymphoma vs. healthy samples using single cell RNA (scRNA) Sequencing, of peripheral blood mononuclear cells (PBMCs).
Looking at the EDA of exploratory data analysis with new package for this thing on natural killer t-cell lymphoma associated by EBV infection. This study is GSE318371 within NCBI database of gene expression studies. I extracted the custom download option of the GSE318371_RAW.tar file scrolled at end of page on this series. The extraction of barcodes takes a very long time once downloaded for each file and you can avoid that process and leave them in zip form with the .tsv.gz name but remove the prepended file name in front of ‘barcodes.tsv.gz’, ‘features.tsv.gz’, and ‘matrix.mtx’ but put these files with respective GSM sample ID into its own file to read in for each file folder.
This is a very recent February 2026 uploaded research on aggressive lymphoma associated with EBV infection. I found others and they include the nasopharyngeal carcinoma and Hodgkin and large B-cell lymphomas. But for now we work on this project to get our top genes for our machine to predict EBV, Lyme disease, or specific associated EBV pathology of multiple sclerosis, mononucleosis, primary EBV infection, fibromyalgia, as well as various lymphomas and nasopharyngeal carcinoma.
There is a process that has to be followed to extract each sample information of barcodes, features, and cells. This is array gene expression data where an array of many cells is input into a machine and each cell within the array is ran to count the number of times a gene appears. The barcodes are the cells and the features are the genes in matrix format. This is what single cell RNA sequencing is.
It is useful to review or get familiar with the Seurat package for this type of data as it is the go to method of handling unsupervised gene expression data which is basically what finding gene targets is, finding the genes to make target and good predictors on unseen data. But our data is known and the sample classes are seen therefore, and will still allow us the functionality to explore the unsupervised machine learning models in finding our top 10 genes.
This last section follows through with a tutorial found online video, great to follow along and learn a little bit or review the material in these type of data sets on gene expression data with single cell RNA sequencing data.
library(Seurat)
## Loading required package: SeuratObject
## Loading required package: sp
##
## Attaching package: 'SeuratObject'
## The following objects are masked from 'package:base':
##
## intersect, t
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.0 ✔ readr 2.2.0
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.2 ✔ tibble 3.3.1
## ✔ lubridate 1.9.5 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
setwd(path)
list.files(path)
## [1] "merged.RDS"
## [2] "mergedSamples.RDS"
## [3] "patient14"
## [4] "patient16"
## [5] "patient18"
## [6] "patient19"
## [7] "patient20"
## [8] "patient22"
## [9] "patient25"
## [10] "patient27"
## [11] "patient30"
## [12] "patient31"
## [13] "patient36"
## [14] "patient37"
## [15] "patient38"
## [16] "patient39"
## [17] "patient41"
## [18] "patient44"
## [19] "patient51"
## [20] "patient52"
## [21] "patient6"
## [22] "patient9"
## [23] "sampleHD1"
## [24] "sampleHD10"
## [25] "sampleHD11"
## [26] "sampleHD12"
## [27] "sampleHD2"
## [28] "sampleHD3"
## [29] "sampleHD4"
## [30] "sampleHD5"
## [31] "sampleHD6"
## [32] "sampleHD7"
## [33] "sampleHD8"
## [34] "sampleHD9"
## [35] "Seurat_filteredNormalized10kScaledLogScaled2kFeatures"
## [36] "Seurat_filteredNormalized10kScaledLogScaled2kFeatures1"
set.seed(123)
The above files are in our RDS_objects folder. It can be set within the chunk but reverts back to directory outside the Rmarkdown notebook chunk when it is done running that chunk.
Lets read in the mergedSamples.RDS of our samples merged together in a 400k row of genes by 4 columns. I went back to the original merged data of the samples as Seurat objects from earlier part of project and added in a parameter in the merge function ‘merge.data=TRUE’ it might not make a difference, but this code works after another small change when normalizing the data and only normalizing one time, not twice like biostatsquid said to do to log transform it, although biostatsquid did do a good job adding in additional context to the code and was almost exactly like the Single Cell Genomics, Transcriptomics, & Proteomics code tutorial on youtube.
If you want it to behave like Seurat 4, the AI help online said to go back to when merging the samples of Seurat objects and to add merge.data=TRUE merged <- merge(sample1, sample2, merge.data = TRUE)
# merged <- merge(patient14, y=c(patient16, patient18, patient19,patient20 , patient22 , patient25 , patient27 , patient30, patient31, patient36 , patient37 , patient38 , patient39 , patient41 ,
# patient44, patient51, patient52, patient6 , patient9, sampleHD1 , sampleHD10,
# sampleHD11, sampleHD12, sampleHD2 , sampleHD3 , sampleHD4 , sampleHD5, sampleHD6 ,
# sampleHD7 , sampleHD8 , sampleHD9 ), add.cell.ids = ls()[1:32], merge.data=TRUE)
setwd(path)
#merged <- readRDS('mergedSamples.RDS')
merged <- readRDS('merged.RDS') #added extra line of merge.data=TRUE
The
above screen shots show the layout of this large Seurat object expanded
out with the layers and additional information that is extracted
individually by matrix @meta.data or by
@assays for the RNA meta data of gene
names as features to add to the matrix of barcodes.
We can see the details of this object by entering its object name.
merged
You can see it has labels by counts for samples. There are 30,960 genes AKA features and 407,006 cells AKA samples or barcodes in this 1 assay with layers. There are 32 layers. Seurat version 5 added layers from Version 4 that the tutorials use in Single Cell Genomics, Transcriptomics, & Proteomics youtube videos and in Biostatsquid youtube channel.
str(merged)
There are many attachments to this Seurat object, not just the matrix of merged columns with merged@meta.data but also in a $chr element list in this array as [1:30960] showing ENSEMBL IDs. This will be useful when looking for top genes and getting labels as well as this next code extracting the mitochondrial DNA with regex to make a percent of mitochondrial DNA column.
dim(merged@meta.data)
In the meta.data table there are 3 columns and 407,006 barcodes for cells of the array.
colnames(merged@meta.data)
The column names of merged give the original identity of GSM ID, the counts of RNA that showed up for each barcode cell, and the number of genes as features that the barcode was part of in this single cell RNA sequencing project.
#quality control (QC)
When using quality control you don’t want a low count as the sequencing is probably altered or a low, low number unique genes indicates cell dead, dying, or empty droplet, and if other cells have higher counts and this could be an error within that cell, and you don’t want doublets or multiplets of genes producing high counts that are very large compared to the average counts of cells because that could mean the cell has 2 cells in it as the genes stuck together and giving a higher count. And for mitochondrial DNA, it is not located in the cytoplasm where most RNA ready to be translated into a protein occurs, but could be due to apoptosis of the cell and the mitochondrial membrane ruptured and leaking out its DNA or the cell is damaged and dying. So a high percentage of mitochondrial DNA distorts the counts as well. Mitochondrial DNA is only transcribed in the mitochondrial.
##perc at
merged[['percent_mt']] <- PercentageFeatureSet(merged, pattern = '^MT-')
head(merged@meta.data)
The nCount is total number of molecules in a cell The nFeature is the total number of unique genes in each cell
This violin plot VlnPlot() is used as a Seurat library plot.
VlnPlot(merged, features = c("nCount_RNA", "nFeature_RNA", "percent_mt"), ncol=3)
We use the above plots of violin plots to compare the scatter and use those above lowest threshold of all scatter and not including less scatter above the highest theshold in common. For counts, visually estimated, the lowest threshold is 0 and highest is 2,000. For Features it looks like lowest threshold is about 100 and highest is around 4,000. The percent mitochondrial DNA looks like around 6-8 %.
This next plot is for a regression line to identify thresholds to filter out low and high counts and features as well at mt percent, the number of molecules on x-axis as counts of cells and the number of genes as features on y-axis.
You don’t want to have cells to be in lower right corner under regression line because it means that high molecule count but low unique features, capturing low genes and sequencing over and over again givig high count, and not in upper left corner bc high genes but low count because not deeply sequence enough, always check for this to see that it is due to low quality cells and not due to an artifact or sequencing error. That could ruin downstream analysis.
FeatureScatter(nsclc_seu, feature1 = “nCount_RNA”, feature2 = “nFeature_RNA”) + geom_smooth(method = ‘lm’)
library(ggplot2)
FeatureScatter(merged, feature1 = "nCount_RNA", feature2 = "nFeature_RNA") + geom_smooth(method = 'lm')+
# Control axis limits using ggplot2 functions
xlim(0, 30000)+ # Set X-axis range
ylim(0, 16000) # Set Y-axis range
Looks like around 20,000 count of RNA and about 5,000 features as thresholds to filter out for quality control to select best genes leading to cell differentiation in this pathology. This scatter follows a log function so I think it was already normalized. We can place the log function over the scatter.
FeatureScatter(merged, feature1 = "nCount_RNA", feature2 = "nFeature_RNA") + geom_smooth(method = 'lm', , formula = y ~ log(x))+
# Control axis limits using ggplot2 functions
xlim(0, 30000)+ # Set X-axis range
ylim(0, 16000) # Set Y-axis range
It looks like the scatter are the shape of a log curve when compared to the linear regression line, but when fitting a log function over the scatter, the curve fell flat early and missed many scatter after about 4,000 count RNA. Lets proceed with filtering based on the linear fit model and thresholds estimated.
#filtering At this point, I am testing whether you need to keep the object the same name and store each change with the object, because each command made to the object is stored in its array of data in a separate list of changes made to the object. I originally kept renaming the object and removing the previous object after making changes to it like filtering it, normalizing it, and scaling it, then downstream error produced with inability to scale data with the ScaleData function. Lets see. They do explain in a few videos seen that each change you make is stored in the object maybe to revert back to original without removing all and reuploading.
merged <- subset(merged, subset = nFeature_RNA > 200 & nFeature_RNA < 5000 & percent_mt <8 & nCount_RNA < 18000)
You can make your own thresholds on how you interpret visualizations The file size shrank from 11 GB to 9.8 GB file size as it filtered out those barcodes not meeting this criteria.
***** knitr doesn’t allow this chunk 17 to run so it will be print screens but everything ended up running fine in RStudio after making those AI internet solution corrections. *****
The other youtube video also shows this same exact code as this video. Both videos use the same exact Seurat R code commands and workflow. Normalization by default, use global scaling normalization of log normalize, default scaling of 10,000, and then get a log transformed result.
merged <- NormalizeData(
merged,
normalization.method = "LogNormalize",
scale.factor = 10000,
layer = "counts", # Input layer
#new.layer = "lognorm" # Output layer
)
# the program doesn't use the 'new.layer' and posts it as a warning
# "The following arguments are not used: new.layer"
The file size increased to 19.5 GB in size after normalizing and log scaling it from 9.8 GB.
merged
An object of class Seurat 30960 features across 370409 samples within 1 assay Active assay: RNA (30960 features, 0 variable features) 64 layers present: counts.GSM9493334, counts.GSM9493335, counts.GSM9493336, counts.GSM9493337, counts.GSM9493338, counts.GSM9493339, counts.GSM9493340, counts.GSM9493341, counts.GSM9493342, counts.GSM9493343, counts.GSM9493344, counts.GSM9493345, counts.GSM9493346, counts.GSM9493347, counts.GSM9493348, counts.GSM9493349, counts.GSM9493350, counts.GSM9493351, counts.GSM9493332, counts.GSM9493333, counts.GSM9493320, counts.GSM9493329, counts.GSM9493330, counts.GSM9493331, counts.GSM9493321, counts.GSM9493322, counts.GSM9493323, counts.GSM9493324, counts.GSM9493325, counts.GSM9493326, counts.GSM9493327, counts.GSM9493328, data.GSM9493334, data.GSM9493335, data.GSM9493336, data.GSM9493337, data.GSM9493338, data.GSM9493339, data.GSM9493340, data.GSM9493341, data.GSM9493342, data.GSM9493343, data.GSM9493344, data.GSM9493345, data.GSM9493346, data.GSM9493347, data.GSM9493348, data.GSM9493349, data.GSM9493350, data.GSM9493351, data.GSM9493332, data.GSM9493333, data.GSM9493320, data.GSM9493329, data.GSM9493330, data.GSM9493331, data.GSM9493321, data.GSM9493322, data.GSM9493323, data.GSM9493324, data.GSM9493325, data.GSM9493326, data.GSM9493327, data.GSM9493328
There are now 64 layers instead of 32 layers, the additional data labeled layers are where the Normalized and scaled values are stored along with the count labeled layers in the object name > assays > RNA > layers.
There are now fewer cells from 407,006 to 370,409 that were kept after filtering based on visuals of Violin Plot and the scatter plots with residual regression best fit line.
merged@assays$RNA
Assay (v5) data with 30960 features for 370409 cells First 10 features: ENSG00000238009, ENSG00000241860, ENSG00000290385, ENSG00000291215, LINC01409, ENSG00000290784, LINC00115, LINC01128, ENSG00000288531, FAM41C Layers: counts.GSM9493334, counts.GSM9493335, counts.GSM9493336, counts.GSM9493337, counts.GSM9493338, counts.GSM9493339, counts.GSM9493340, counts.GSM9493341, counts.GSM9493342, counts.GSM9493343, counts.GSM9493344, counts.GSM9493345, counts.GSM9493346, counts.GSM9493347, counts.GSM9493348, counts.GSM9493349, counts.GSM9493350, counts.GSM9493351, counts.GSM9493332, counts.GSM9493333, counts.GSM9493320, counts.GSM9493329, counts.GSM9493330, counts.GSM9493331, counts.GSM9493321, counts.GSM9493322, counts.GSM9493323, counts.GSM9493324, counts.GSM9493325, counts.GSM9493326, counts.GSM9493327, counts.GSM9493328, data.GSM9493334, data.GSM9493335, data.GSM9493336, data.GSM9493337, data.GSM9493338, data.GSM9493339, data.GSM9493340, data.GSM9493341, data.GSM9493342, data.GSM9493343, data.GSM9493344, data.GSM9493345, data.GSM9493346, data.GSM9493347, data.GSM9493348, data.GSM9493349, data.GSM9493350, data.GSM9493351, data.GSM9493332, data.GSM9493333, data.GSM9493320, data.GSM9493329, data.GSM9493330, data.GSM9493331, data.GSM9493321, data.GSM9493322, data.GSM9493323, data.GSM9493324, data.GSM9493325, data.GSM9493326, data.GSM9493327, data.GSM9493328
The first 10 features are listed when looking at the RNA under the assay tab.
Here is my original screenshot of RNA contents arrangement in Seurat
object once merged.
The following shows the sparce matrices for the first 5 layers by counts. This list could be very large if not limited. There are only 1s where the gene is present in the cell or barcode column of assay, the 0s are held with the period or dot.
merged@assays[["RNA"]]@layers[1:5]
$counts.GSM9493334 23499 x 8732 sparse Matrix of class “dgCMatrix”
[1,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …… [2,] . . . . . 1 . 1 . . . . . . . . . . . . . . . . . . . . . . …… [3,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …… [4,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …… [5,] . . . . . . . . . . . 1 . . . . . . . . . . . . . . . . . . …… [6,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …… [7,] . . . . . . . 1 . . . . . . 1 . . . . . . . . . . . . . . . …… [8,] . . . . . . . . . . . . . . 2 . . . . . . . . . . . . . . . …… [9,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …… [10,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …… [11,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …… [12,] 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …… [13,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …… [14,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …… [15,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …… [16,] . . . . 2 . . 2 . . . . . . . . . . . . . . . . . . . . . . …… [17,] 2 . . 1 . 2 1 . 2 2 . . . 10 . 2 . . . 2 . . . 2 1 . . . . . ……
………………………… ……..suppressing 8702 columns and 23466 rows in show(); maybe adjust options(max.print=, width=)
…
$counts.GSM9493335 25402 x 8847 sparse Matrix of class “dgCMatrix”
[1,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …… [2,] . . . 1 . . . . . . . . . . . . . . . . . . . . . . . . . . …… [3,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …… [4,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …… [5,] . . . . . . . . . . . . . . . . . 1 . . . 1 . . 1 . . . . . …… [6,] . 1 . . . . . . . . . . . . . . . . . . . . . . 1 . . . . . …… [7,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …… [8,] . . . . . . . . . . . . 1 . . . . . . . . . . . . . . . . . …… [9,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …… [10,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …… [11,] . 1 . . 1 . . . . . . . . . . . . . 1 . . . . 1 . . . . 1 . …… [12,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …… [13,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …… [14,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …… [15,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …… [16,] 1 . . 2 . . 1 . . 1 1 . 1 1 . . . . . 2 . 1 . . . . . . 1 . …… [17,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ……
………………………… ……..suppressing 8817 columns and 25369 rows in show(); maybe adjust options(max.print=, width=) …………………………
[25387,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …… [25388,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …… [25389,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …… [25390,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ……
This was copy and pasted due to knitr not letting code run past the
Normalization chunk of R code. You can see not all features are counted
in the cells shown. Here is a snippet.
#gives sparse matrices for each GSM sample
#merged@assays[["RNA"]]@layers$counts #sparse matrices
#merged@assays[["RNA"]]@layers$data #null
merged
An object of class Seurat 30960 features across 370409 samples within 1 assay Active assay: RNA (30960 features, 0 variable features) 64 layers present: counts.GSM9493334, counts.GSM9493335, counts.GSM9493336, counts.GSM9493337, counts.GSM9493338, counts.GSM9493339, counts.GSM9493340, counts.GSM9493341, counts.GSM9493342, counts.GSM9493343, counts.GSM9493344, counts.GSM9493345, counts.GSM9493346, counts.GSM9493347, counts.GSM9493348, counts.GSM9493349, counts.GSM9493350, counts.GSM9493351, counts.GSM9493332, counts.GSM9493333, counts.GSM9493320, counts.GSM9493329, counts.GSM9493330, counts.GSM9493331, counts.GSM9493321, counts.GSM9493322, counts.GSM9493323, counts.GSM9493324, counts.GSM9493325, counts.GSM9493326, counts.GSM9493327, counts.GSM9493328, data.GSM9493334, data.GSM9493335, data.GSM9493336, data.GSM9493337, data.GSM9493338, data.GSM9493339, data.GSM9493340, data.GSM9493341, data.GSM9493342, data.GSM9493343, data.GSM9493344, data.GSM9493345, data.GSM9493346, data.GSM9493347, data.GSM9493348, data.GSM9493349, data.GSM9493350, data.GSM9493351, data.GSM9493332, data.GSM9493333, data.GSM9493320, data.GSM9493329, data.GSM9493330, data.GSM9493331, data.GSM9493321, data.GSM9493322, data.GSM9493323, data.GSM9493324, data.GSM9493325, data.GSM9493326, data.GSM9493327, data.GSM9493328
There aren’t any or there are 0 variable features from out put of object working with above but we will change that next code chunk.
The default is to return 2000 features per dataset, and ‘vst’ stands for ‘variance stabilizing transformation’ as a default as well.
merged <- FindVariableFeatures(merged, selection.method = 'vst', nfeatures = 2000)
Short snippet above of the grids per sample calculated to find the high
variability genes.
Now look at the merged object to see that variable features should show 2000.
merged
An object of class Seurat 30960 features across 370409 samples within 1 assay Active assay: RNA (30960 features, 2000 variable features) 64 layers present: counts.GSM9493334, counts.GSM9493335, counts.GSM9493336, counts.GSM9493337, counts.GSM9493338, counts.GSM9493339, counts.GSM9493340, counts.GSM9493341, counts.GSM9493342, counts.GSM9493343, counts.GSM9493344, counts.GSM9493345, counts.GSM9493346, counts.GSM9493347, counts.GSM9493348, counts.GSM9493349, counts.GSM9493350, counts.GSM9493351, counts.GSM9493332, counts.GSM9493333, counts.GSM9493320, counts.GSM9493329, counts.GSM9493330, counts.GSM9493331, counts.GSM9493321, counts.GSM9493322, counts.GSM9493323, counts.GSM9493324, counts.GSM9493325, counts.GSM9493326, counts.GSM9493327, counts.GSM9493328, data.GSM9493334, data.GSM9493335, data.GSM9493336, data.GSM9493337, data.GSM9493338, data.GSM9493339, data.GSM9493340, data.GSM9493341, data.GSM9493342, data.GSM9493343, data.GSM9493344, data.GSM9493345, data.GSM9493346, data.GSM9493347, data.GSM9493348, data.GSM9493349, data.GSM9493350, data.GSM9493351, data.GSM9493332, data.GSM9493333, data.GSM9493320, data.GSM9493329, data.GSM9493330, data.GSM9493331, data.GSM9493321, data.GSM9493322, data.GSM9493323, data.GSM9493324, data.GSM9493325, data.GSM9493326, data.GSM9493327, data.GSM9493328
There are 30,960 features in 370,409 samples and 2000 variable features now after finding for the variable features. Before doing the findVariableFeatures command or transformation, there were 0 variable features.
We should limit the list as it will print out a lot of the values. But only those features in the list of 2000 highly variable features or having the highest differential expression values where cells changed the most from healthy to pathological state are listed by name, all others are listed as ‘NA’ values.
merged[["RNA"]]@meta.data$var.features[1:500]
top10 <- head(VariableFeatures(merged), 10)
top10
[1] “PPBP” “IGLC2” “IGKC” “S100A9” “EREG” “IGLC3” “SOX5” “PF4”
[9] “LYZ” “LINGO2”
top10_plot <- VariableFeaturePlot(merged)
LabelPoints(plot = top10_plot, points = top10, repel = TRUE)
This corrects for unwanted sources of variation, technical as example with batch effects, or biological such as cell cycle stage or mitochondrial contamination, this ensures results are due to clustering by gene differences and not technological or biological errors.
The ScaleData function is part of standard parameters, there is a more complex single cell Transform function or workflow within Seurat that could be alternatively used.
First to make a character string stored as the rownames of the structural array we have been using mergedFilteredNormalizedScaled. Lets look at the first 100 genes as well.
all_genes <- rownames(merged)
all_genes[1:100]
[1] “ENSG00000238009” “ENSG00000241860” “ENSG00000290385”
“ENSG00000291215” [5] “LINC01409” “ENSG00000290784” “LINC00115”
“LINC01128”
[9] “ENSG00000288531” “FAM41C” “ENSG00000272438” “NOC2L”
[13] “KLHL17” “PLEKHN1” “ENSG00000272512” “HES4”
[17] “ISG15” “ENSG00000224969” “AGRN” “ENSG00000291156” [21] “C1orf159”
“ENSG00000285812” “LINC01342” “TNFRSF18”
[25] “TNFRSF4” “SDF4” “B3GALT6” “C1QTNF12”
[29] “ENSG00000260179” “UBE2J2” “LINC01786” “SCNN1D”
[33] “ACAP3” “PUSL1” “INTS11” “CPTP”
[37] “TAS1R3” “DVL1” “MXRA8” “AURKAIP1”
[41] “CCNL2” “MRPL20-AS1” “MRPL20” “MRPL20-DT”
[45] “ATAD3C” “ATAD3B” “ENSG00000290916” “ATAD3A”
[49] “TMEM240” “SSU72” “ENSG00000215014” “FNDC10”
[53] “ENSG00000286989” “ENSG00000272106” “MIB2” “MMP23B”
[57] “CDK11B” “ENSG00000272004” “SLC35E2B” “CDK11A”
[61] “ENSG00000290854” “NADK” “GNB1” “GNB1-DT”
[65] “CFAP74” “PRKCZ” “ENSG00000271806” “PRKCZ-AS1”
[69] “FAAP20” “ENSG00000234396” “SKI” “ENSG00000287356” [73] “MORN1”
“ENSG00000272420” “RER1” “PEX10”
[77] “PLCH2” “PANK4” “HES5” “ENSG00000272449” [81] “TNFRSF14-AS1”
“TNFRSF14” “ENSG00000228037” “ENSG00000289610” [85] “PRXL2B” “MMEL1”
“TTC34” “PRDM16”
[89] “MEGF6” “ENSG00000238260” “TPRG1L” “WRAP73”
[93] “TP73” “SMIM1” “LRRC47” “CEP104”
[97] “DFFB” “C1orf174” “LINC01134” “AJAP1”
length(all_genes)
[1] 30960
There are a total of 30,960 genes.
The row names of the object, merged, have the gene names, when we looked for the highest variation genes in most differentially expressed for 2000 genes with
merged <- FindVariableFeatures(merged, selection.method = ‘vst’, nfeatures = 2000)
Do they line up to the row names of merged and where are they located.
varFeatures <- merged[["RNA"]]@meta.data$var.features
genes <- rownames(merged)
df <- data.frame(Gene = genes, FeatureVariability = varFeatures)
#head(df,20)
view(df)
The location of any of the genes if they were a high variability gene top 2000 is listed next to the correct row name in a random 20 sample draw from first 20 samples.
Are there only 2000 non NA values in our varFeatures set to find only 2000?
class(varFeatures)
length(unique(varFeatures))
[1] “character” [1] 2001
There are 2001 unique values in a 30,960 long vector of genes set to be the label if it is one of the 2000 most differentially expressed genes. The additional character is the NA value.
Maybe I can figure out something to do with that instead as this ‘work around’ has already taken more than one day in set backs. And produced no progress to following the work flow.
The VariableFeatures(merged) is how to list only those top 2000 genes we found with the findVariableFeatures function. We can see those by calling that function VariableFeatures(merged) or rather since its more than 1000 to write it to csv and then read that file. But would be nice to see how it is used with its values for pathology and healthy samples. Maybe we can make a vector of those top 10 like we do already but then only keep the matrix of meta.data with those 10 genes to see the counts and number of barcodes express it.
Lets scale the data first then come back to this task.
*** CHANGED TO ALL GENES SCALED FOR PART 5E, BUT RECALLED IT DOESN’T WORK *** There are 30,960 unique gene names. We will scale only the variable features we found of top 2000 most differentially expressed genes.
#merged <- ScaleData(merged, features = all_genes)
merged <- ScaleData(merged, features = VariableFeatures(merged))
Great news! the ScaleData function worked, must have been due to declaring the input and output (that was ignored) in the NormalizeData function and then not adding in the additional NormalizeData function.
Centering and scaling data matrix |===========================================================| 100%
This file is 25.4 GB in size just log transform scaling the 2,000 targeted features. When doing all features it wouldn’t work, but I can try again. This could be why it worked this time due to that edit as well as small changes in the parameter setting of the normalization function.
Lets save this Seurat object we filtered, normalized and scaled by 10,000 then scaled with a log transformation. We only log transformed the 2,000 most differentially expressed features in this set. When trying to scale all 30,960 features, the program couldn’t on this computer and stopped with an error message.
setwd(path)
#this takes a long time on this computer to save, its 25.4 GB in size
saveRDS(merged,"Seurat_filteredNormalized10kScaledLogScaled2kFeatures1")
Make a data frame of all the genes of variable features, only the top 2000 will be ranked.
allGenesDF <- merged[["RNA"]]@meta.data
dim(allGenesDF)
[1] 30960 194 There are 30960 genes and 194 columns for all the 32 samples and 5 other features of each sample. The variable features column will only have the name of the gene is it is one of the 2,000 highest variable feature.
all2k <- subset(allGenesDF, allGenesDF$var.features != "NA")
all2k <- all2k[order(all2k$var.features.rank, decreasing=FALSE),]
write.csv(all2k, 'all2000_DE_genes_NKTL_EBV_ordered.csv', row.names=FALSE)
You can get this data frame here. ***
Lets finish that task before moving onto the analysis portion. I want to extract these top 10 features and make a data frame not a Seurat object that has the gene names attached.
It is interesting because of all the stored information you know the genes are there and you just have to get them. They are in the row names, now we can also subset the Seurat object and make the object selective to only those 10 genes to see how they are expressed in each of our 12 healthy and 20 pathological NKTCL cases to do our regular workflow for supervised machine learning.
m <- merged[which(row.names(merged) %in% top10),]
m
An object of class Seurat 10 features across 370409 samples within 1 assay Active assay: RNA (10 features, 10 variable features) 65 layers present: counts.GSM9493334, counts.GSM9493335, counts.GSM9493336, counts.GSM9493337, counts.GSM9493338, counts.GSM9493339, counts.GSM9493340, counts.GSM9493341, counts.GSM9493342, counts.GSM9493343, counts.GSM9493344, counts.GSM9493345, counts.GSM9493346, counts.GSM9493347, counts.GSM9493348, counts.GSM9493349, counts.GSM9493350, counts.GSM9493351, counts.GSM9493332, counts.GSM9493333, counts.GSM9493320, counts.GSM9493329, counts.GSM9493330, counts.GSM9493331, counts.GSM9493321, counts.GSM9493322, counts.GSM9493323, counts.GSM9493324, counts.GSM9493325, counts.GSM9493326, counts.GSM9493327, counts.GSM9493328, data.GSM9493334, data.GSM9493335, data.GSM9493336, data.GSM9493337, data.GSM9493338, data.GSM9493339, data.GSM9493340, data.GSM9493341, data.GSM9493342, data.GSM9493343, data.GSM9493344, data.GSM9493345, data.GSM9493346, data.GSM9493347, data.GSM9493348, data.GSM9493349, data.GSM9493350, data.GSM9493351, data.GSM9493332, data.GSM9493333, data.GSM9493320, data.GSM9493329, data.GSM9493330, data.GSM9493331, data.GSM9493321, data.GSM9493322, data.GSM9493323, data.GSM9493324, data.GSM9493325, data.GSM9493326, data.GSM9493327, data.GSM9493328, scale.data 1 dimensional reduction calculated: pca
This Seurat object has 10 features only, those that are the top or most differentially expressed genes in a much smaller 284 MB file size with 65 layers, because done after successfully using the ScaleData function, the additional layer beyond the 64 layers, the 65th layer is the layer holding the scale.data values. This is done on the filtered data from 407,006 cells down to 370,409 cells.
unique(m$orig.ident)
[1] “GSM9493334” “GSM9493335” “GSM9493336” “GSM9493337” “GSM9493338” [6] “GSM9493339” “GSM9493340” “GSM9493341” “GSM9493342” “GSM9493343” [11] “GSM9493344” “GSM9493345” “GSM9493346” “GSM9493347” “GSM9493348” [16] “GSM9493349” “GSM9493350” “GSM9493351” “GSM9493332” “GSM9493333” [21] “GSM9493320” “GSM9493329” “GSM9493330” “GSM9493331” “GSM9493321” [26] “GSM9493322” “GSM9493323” “GSM9493324” “GSM9493325” “GSM9493326” [31] “GSM9493327” “GSM9493328”
Those show the top 10 features or genes span all 32 samples of the 12 healthy and 20 pathological NKTC aggressive Lymphoma.
top10Data <- m[["RNA"]]@meta.data
colnames(top10Data)
[1] “vf_vst_counts.GSM9493334_mean”
[2] “vf_vst_counts.GSM9493334_variance”
[3] “vf_vst_counts.GSM9493334_variance.expected”
[4] “vf_vst_counts.GSM9493334_variance.standardized” [5]
“vf_vst_counts.GSM9493334_variable”
[6] “vf_vst_counts.GSM9493334_rank”
[7] “vf_vst_counts.GSM9493335_mean”
[8] “vf_vst_counts.GSM9493335_variance”
[9] “vf_vst_counts.GSM9493335_variance.expected”
[10] “vf_vst_counts.GSM9493335_variance.standardized” [11]
“vf_vst_counts.GSM9493335_variable”
[12] “vf_vst_counts.GSM9493335_rank”
[13] “vf_vst_counts.GSM9493336_mean”
[14] “vf_vst_counts.GSM9493336_variance”
[15] “vf_vst_counts.GSM9493336_variance.expected”
[16] “vf_vst_counts.GSM9493336_variance.standardized” [17]
“vf_vst_counts.GSM9493336_variable”
[18] “vf_vst_counts.GSM9493336_rank”
[19] “vf_vst_counts.GSM9493337_mean”
[20] “vf_vst_counts.GSM9493337_variance”
[21] “vf_vst_counts.GSM9493337_variance.expected”
[22] “vf_vst_counts.GSM9493337_variance.standardized” [23]
“vf_vst_counts.GSM9493337_variable”
…
[182] “vf_vst_counts.GSM9493327_variance”
[183] “vf_vst_counts.GSM9493327_variance.expected”
[184] “vf_vst_counts.GSM9493327_variance.standardized” [185]
“vf_vst_counts.GSM9493327_variable”
[186] “vf_vst_counts.GSM9493327_rank”
[187] “vf_vst_counts.GSM9493328_mean”
[188] “vf_vst_counts.GSM9493328_variance”
[189] “vf_vst_counts.GSM9493328_variance.expected”
[190] “vf_vst_counts.GSM9493328_variance.standardized” [191]
“vf_vst_counts.GSM9493328_variable”
[192] “vf_vst_counts.GSM9493328_rank”
[193] “var.features”
[194] “var.features.rank”
There are 194 columns and they are divided into each sample but have the same column metric per sample of 6 metrics each. There are 32 samples times 6 metrics, that is 192 columns.
Our table with genes are listed as column 194 at the end of the columns with rank by gene across all samples for how differentially expressed that gene was across all samples as last column 195. So that there will be a data frame of 10 top genes and show how they are expressed across each sample from the sequencing data of single cell RNA sequencing information in the array.
DataFrame10 <- m[['RNA']]@meta.data
DataFrame10[,c(1:12,185:194)]
Here is a snippet of the DataFrame10 first and last columns in the
viewer.
I only selected the first 10 columns and last 9 columns to show the samples those features explain. Otherwise, a huge amount of scrolling would be needed to look at all the 195 columns in this data for our top 10 most differentially expressed genes.
Lets write this out to csv and go back to the analysis.
write.csv(DataFrame10, 'top10DE_genes_NKTCL_EBV.csv', row.names=FALSE)
#write.csv(DataFrame10, 'top10DE_genes_NKTCL_EBV-UNSCALED.csv', row.names=FALSE)
We won’t use that now and maybe won’t at all. But it is good to have. Here is a copy of it if you would like it, access it here. I went through and separately made an unscaled to log transformation version of top 10 most differentially expressed genes to access here.
rm(m,df,all_genes, DataFrame10, top10Data)
=================================================================
We can stop and then return to read back in the Seurat object where we left off.
merged <- readRDS("Seurat_filteredNormalized10kScaledLogScaled2kFeatures1")
It reads in the same file size of 25.4 GB.
#Dimensionality Reduction with PCA allow us to visualize multidimensional datasets, with common PCA principal component analysis, summarizing all gene expression data over many genes into principal components (the variation between scatterpoints are the components, variations off the regression lines or dimension, adding components to top components will give most variation and explain most of data).
Loadings is a term for PCA that explains a gene that is strongly associated with that principal component. They are ordered with PC1 most explanative and subsequent add to that explanation in order of best.
The positive and negative genes are named for how ever many features you set for the number of PCs that explain the most.
merged <- RunPCA(merged, features = VariableFeatures(merged))
print(merged[['pca']], dims = 1:5, nfeatures=5)
merged@assays$RNA
Assay (v5) data with 30960 features for 370409 cells Top 10 variable features: PPBP, IGLC2, IGKC, S100A9, EREG, IGLC3, SOX5, PF4, LYZ, LINGO2 Layers: counts.GSM9493334, counts.GSM9493335, counts.GSM9493336, counts.GSM9493337, counts.GSM9493338, counts.GSM9493339, counts.GSM9493340, counts.GSM9493341, counts.GSM9493342, counts.GSM9493343, counts.GSM9493344, counts.GSM9493345, counts.GSM9493346, counts.GSM9493347, counts.GSM9493348, counts.GSM9493349, counts.GSM9493350, counts.GSM9493351, counts.GSM9493332, counts.GSM9493333, counts.GSM9493320, counts.GSM9493329, counts.GSM9493330, counts.GSM9493331, counts.GSM9493321, counts.GSM9493322, counts.GSM9493323, counts.GSM9493324, counts.GSM9493325, counts.GSM9493326, counts.GSM9493327, counts.GSM9493328, data.GSM9493334, data.GSM9493335, data.GSM9493336, data.GSM9493337, data.GSM9493338, data.GSM9493339, data.GSM9493340, data.GSM9493341, data.GSM9493342, data.GSM9493343, data.GSM9493344, data.GSM9493345, data.GSM9493346, data.GSM9493347, data.GSM9493348, data.GSM9493349, data.GSM9493350, data.GSM9493351, data.GSM9493332, data.GSM9493333, data.GSM9493320, data.GSM9493329, data.GSM9493330, data.GSM9493331, data.GSM9493321, data.GSM9493322, data.GSM9493323, data.GSM9493324, data.GSM9493325, data.GSM9493326, data.GSM9493327, data.GSM9493328, scale.data
We can take a look at all commands for this object saved when we saved it and with the PCA command added. We go to the ‘commands’ tab to expand on those commands and see the elements each has, there will be 4 commands after running the PCA command.
DimHeatmap(merged, dims = 1, cells = 500, balanced = TRUE)
The dim plot is a scatter plot of a named PC for principal component, against another PC You will see cells group in clusters by PC comparison 1 by 1.
DimPlot(merged, reduction = "pca") + NoLegend()
This type of plot, elbow plot, shows standard deviation on y-axis and the principal components or PCs on x-axis to see the threshold limit of number of dimensions or PCs to use, it starts high and drops until plateuing at elbow bend, won’t be useful to use additional that add little variation and more computing.
ElbowPlot(merged)
===================================================
This one seemed like it froze computer but was just really slow, took 1 1/2 hours to run, in console it was running, but just faded and spinning circle in the source code window of document.
Basically the lower resolution the least clusters and higher resolution has more clusters
merged <- FindNeighbors(merged, dims = 1:7) #based on elbow plot above
merged <- FindClusters(merged, resolution = c(0.1, 0.3, 0.5, 0.7,1))
Number of nodes: 370409 Number of edges: 10184126
Running Louvain algorithm… 0% 10 20 30 40 50 60 70 80 90 100% [—-|—-|—-|—-|—-|—-|—-|—-|—-|—-| **************************************************| Maximum modularity in 10 random starts: 0.9724 Number of communities: 9 Elapsed time: 453 seconds Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 370409 Number of edges: 10184126
Running Louvain algorithm… 0% 10 20 30 40 50 60 70 80 90 100% [—-|—-|—-|—-|—-|—-|—-|—-|—-|—-| **************************************************| Maximum modularity in 10 random starts: 0.9473 Number of communities: 17 Elapsed time: 424 seconds Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 370409 Number of edges: 10184126
Running Louvain algorithm… 0% 10 20 30 40 50 60 70 80 90 100% [—-|—-|—-|—-|—-|—-|—-|—-|—-|—-| **************************************************| Maximum modularity in 10 random starts: 0.9334 Number of communities: 25 Elapsed time: 441 seconds Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 370409 Number of edges: 10184126
Running Louvain algorithm… 0% 10 20 30 40 50 60 70 80 90 100% [—-|—-|—-|—-|—-|—-|—-|—-|—-|—-| **************************************************| Maximum modularity in 10 random starts: 0.9223 Number of communities: 28 Elapsed time: 512 seconds Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 370409 Number of edges: 10184126
Running Louvain algorithm… 0% 10 20 30 40 50 60 70 80 90 100% [—-|—-|—-|—-|—-|—-|—-|—-|—-|—-| **************************************************| Maximum modularity in 10 random starts: 0.9093 Number of communities: 34 Elapsed time: 550 seconds
There will be added columns to the matrix object of our merged@meta.data for each of the resolutions such as ‘RNA_snn_res0.1’ and ‘RNA_snn_res0.3’, and so on to 1. And the integer values in the new columns indicates to which cluster each cell (or barcode) belongs.
The results are color coded for each cluster. Since gene expression is basically the cell differentiation of becoming different cell types, usually, different clusters represents different cell types.
dim(merged@meta.data)
[1] 370409 10
colnames(merged@meta.data)
[1] “orig.ident” “nCount_RNA” “nFeature_RNA” “percent_mt”
[5] “RNA_snn_res.0.1” “RNA_snn_res.0.3” “RNA_snn_res.0.5”
“RNA_snn_res.0.7” [9] “RNA_snn_res.1” “seurat_clusters”
head(merged@meta.data)
The meta.data now has 10 columns instead of 4 and each resolution is a column with the number of the cluster for each cell in that respective resolution column.
The data set may interest someone in many clusters that overlap that could be useful for identifying subset of cells interested in.
DimPlot(merged, group.by = 'RNA_snn_res.1', label = TRUE)
DimPlot(merged, group.by = 'RNA_snn_res.0.1', label = TRUE)
Set identity of clusters makes more sense than using more clusters with resolution 1
Idents(merged) <- 'RNA_snn_res.0.1'
Gives a clearer visualization of separate clusters with space between each
merged <- RunUMAP(merged, dims = 1:15)
Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric To use Python UMAP via reticulate, set umap.method to ‘umap-learn’ and metric to ‘correlation’ This message will be shown once per session 18:17:22 UMAP embedding parameters a = 0.9922 b = 1.112 18:17:22 Read 370409 rows and found 15 numeric columns 18:17:22 Using Annoy for neighbor search, n_neighbors = 30 18:17:22 Building Annoy index with metric = cosine, n_trees = 50 0% 10 20 30 40 50 60 70 80 90 100% [—-|—-|—-|—-|—-|—-|—-|—-|—-|—-| **************************************************| 18:17:50 Writing NN index file to temp file C:0CQbbW51ac33c236d6 18:17:51 Searching Annoy index using 1 thread, search_k = 3000 18:20:17 Annoy recall = 100% 18:20:18 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30 18:20:32 Initializing from normalized Laplacian + noise (using RSpectra) 18:21:16 Commencing optimization for 200 epochs, with 15948322 positive edges 18:21:16 Using rng type: pcg Using method ‘umap’ 0% 10 20 30 40 50 60 70 80 90 100% [—-|—-|—-|—-|—-|—-|—-|—-|—-|—-| **************************************************| 18:27:54 Optimization finished
Next, identify the clusters and look at genetic markers for different cell types and identify different cell types, not in this tutorial but another one
DimPlot(merged, reduction = 'umap')
The clusters
look more defined than in clusters from k-nearest neighbors.
We didn’t use tSNE but might be added later. Thanks and keep checking in. Our object is now 26.3 GB in file size and has 7 commands.
Save the Seurat object to use later
saveRDS(merged, file='mergedClustersUMap1.RDS')
===================================================
We went back and created the 2,000 most differentially expressed data frame in this part to the Seurat analysis of the NKTC Lymphoma and EBV samples in GSE318371. We couldn’t put all 30,970 genes because after running the high variability function, only the genes that are in the 2,000 top genes are labeled, while all others are NA. Our other data set only had the top 10 genes, but this one was the most added gene information we could get with what we now know on how to extract the gene information. It may be that the genes are labeled somewhere else and a huge data set of gene labels could have been made, but at the moment I am unaware of where the genes are stored as labels other than using the findVariableFeatures function.
Make a data frame of all the genes of variable features, only the top 2000 will be ranked.
allGenesDF <- merged[["RNA"]]@meta.data
dim(allGenesDF)
[1] 30960 194 There are 30960 genes and 194 columns for all the 32 samples and 5 other features of each sample. The variable features column will only have the name of the gene is it is one of the 2,000 highest variable feature.
I looked it up and it turns out you can get the gene name for all entries of 30,960 by using the rownames function on data object[[“RNA”]]
allGenesDF$genes <- rownames(merged[["RNA"]])
write.csv(allGenesDF, "all30960genes_NKTC_Lymphoma_EBV_rownamesAdded.csv", row.names=FALSE)
You can get this data frame of all 30,960 genes and their gene names and 2,000 ranked order of highest differentiabilit here. This data of all genes is actually useful to us in another exploration and analysis of this gene expression data when we connect the target or associated genes known to be affected in EBV infection using the genes provided from a respected and trusted site for gathering gene expression information from current academia and peer reviewed research studies or alternate proven and reproducible studies. Once such organization is KEGG discussed with last part or part 4. We will be able to filter out those genes as well as our top ten or more genes we select from this or our own filtering process.
We will then use those top ten genes to predict on our samples of 12 healthy and 20 pathological cases of EBV infected NKTC Lymphoma patient samples and see how well those genes are for indicating a 2 class model of healthy or EBV NKTC Lymphoma.
When we compare the genes associated with EBV in the KEGG site for gene expression profiling we can look to see how our samples change from healthy to pathological for those genes that EBV affects and compare to the changes they go through in EBV, as well as other Lymphomas if there, or onco genes for blood cancers.
all2k <- subset(allGenesDF, allGenesDF$var.features != "NA")
all2k <- all2k[order(all2k$var.features.rank, decreasing=FALSE),]
You can get this data frame here.
We won’t be doing anything with that data frame now, but it will be useful to work with if we decide to get the gene expression data from the counts of how many times these genes showed up through barcodes of the gene fragments in all the cells of the array. The columns track stats for the counts of those genes twice filtered, at intake, excluding those not in at least 3 cells and not in at least 200 genes or a gene having at least 200 barcodes for it. Then filtered with the exploratory data analysis to see where most of the information gathered and filter out those outside those threshold limits for counts, features or genes, and mitochondrial percent DNA present in the cells. The data was then log normalized which puts values between 0 and 1, and then scaled by 10,000 (because some values in gene expression can be very very small as in 0.00000000000001 or even lower, so adding back in 5 digits of 1.0 ^5 or order of 5 makes the data useable because some computer systems make anything smaller than a certain magnitude 0 anyhow or some software like Excel possibly), and then the highest variable features were found from those counts that only included the top 2,000 genes with the most change from healthy to pathological (this is an assumption made but the data has all samples and the healthy and pathological aren’t labeled separately other than patient and sample ‘HD’ unless that is how the application software understands the separate samples, but it could also just be those genes that have such variation among all 32 samples that there are many with low and high across samples, so that those with the most change between samples in general, even if some other pathology in this case EBV and NKTC Lymphoma could skew results of genes.), and then those 2,000 most differentially expressed genes were scaled with a log transformation (to put them all in the same viewing window). So, These values are essentially log normalized, multiplied by 10,000, then log transformed.
Now that we got our data frames of all genes and labels and a data frame of only those 2,000 most differentially expressed genes, we can test out the other unsupervised model and see how well it does compared to clustering and UMap done last part of this project.
Now, we are going to run the TSNE unsupervised classifier on this data set.
merged <- readRDS('mergedClustersUMap.RDS')
merged <- RunTSNE(merged, dims = 1:15)
DimPlot(merged, reduction = 'tsne')
The T-SNE
clustering looks very interesting, like it is looking at the axial
version of a body but the clusters are in different layers of the body
in multidimensional space, like a cross section of the lumen of the
aorta or inferior vena cava, but also looking at the lungs, the
pancreas, and stomach portions from above or below axially. Or
concentrically.
A little about these algorithms, from another course using python, T-SNE and UMAP are intended for mapping high dimensionality space into lower dimensionality space for visuals and analysis. T-SNE doesn’t scale as well as UMAP, and T-SNE preserves relationships of points in clusters closer together with distance, while UMAP also preserves relationships between cluster points but uses a low dimensional manifold embedded in the higher dimension space to find those clusters. T-SNE is for T-Distributed Stochastic Neighbor Embedding for finding points in complex, high dimensional space, that can be visualized. UMAP is for Uniform Manifold Approximation and Projection.
Thanks and we have more to do. We need to now look at our genes and do our own manual filtering of most expressed genes from the means of the healthy vs the means of the pathological cases, then get the top 10 and bottom 10 of those as target genes and see if they are the same as the top genes in this process used by Seurat in finding most differentially expressed in general across all samples. All numeric data is counts of genes found across at least 3 cells and having at least 200 barcode fragments of the gene in the array of 407,006 single cells of RNA sequencing. The data was also already log normalized, scaled to 10,000 x its log normalized size, and then scaled to be within the same magnitude by subtracting the mean of each gene and then dividing by the standard deviation for that gene across all samples (looked it up with AI online)
Then we will compare the genes by first retrieving them from a known source, KEGG, (similar to drinking beer in the size or volume of a keg after waiting for all these computer processes to run on a 16 GB RAM, 500 GB SSD, and AMD Ryzen 5 processor), to get the EBV and Lymphoma or other blood cancer genes to see the changes in comparison with the genes in these samples to see if any noticeable similarities or nothing at all noticeable. But prefer to see some insights that can make solid connections and conclusions to how the pathology of EBV infected Natural Killer T-cell Lymphoma, an aggressive lymphoma, behaves and genes targeted compared to EBV infection and lymphomas in general or blood cancers in general.
Thanks again and keep checking back.