This part 4 trying to correct for version conflicts in the code tutorials to get the machine learning algorithms for unsupervised learning to work by flattening the layers added in version 5 of Seurat.

We are using GSE318371, a recent gene study on NCBI that analyzes the aggressive Natural Killer T-Cell Lymphoma vs. healthy samples using single cell RNA (scRNA) Sequencing, of peripheral blood mononuclear cells (PBMCs).

Looking at the EDA of exploratory data analysis with new package for this thing on natural killer t-cell lymphoma associated by EBV infection. This study is GSE318371 within NCBI database of gene expression studies. I extracted the custom download option of the GSE318371_RAW.tar file scrolled at end of page on this series. The extraction of barcodes takes a very long time once downloaded for each file and you can avoid that process and leave them in zip form with the .tsv.gz name but remove the prepended file name in front of ‘barcodes.tsv.gz’, ‘features.tsv.gz’, and ‘matrix.mtx’ but put these files with respective GSM sample ID into its own file to read in for each file folder.

This is a very recent February 2026 uploaded research on aggressive lymphoma associated with EBV infection. I found others and they include the nasopharyngeal carcinoma and Hodgkin and large B-cell lymphomas. But for now we work on this project to get our top genes for our machine to predict EBV, Lyme disease, or specific associated EBV pathology of multiple sclerosis, mononucleosis, primary EBV infection, fibromyalgia, as well as various lymphomas and nasopharyngeal carcinoma.

There is a process that has to be followed to extract each sample information of barcodes, features, and cells. This is array gene expression data where an array of many cells is input into a machine and each cell within the array is ran to count the number of times a gene appears. The barcodes are the cells and the features are the genes in matrix format. This is what single cell RNA sequencing is.

It is useful to review or get familiar with the Seurat package for this type of data as it is the go to method of handling unsupervised gene expression data which is basically what finding gene targets is, finding the genes to make target and good predictors on unseen data. But our data is known and the sample classes are seen therefore, and will still allow us the functionality to explore the unsupervised machine learning models in finding our top 10 genes.

This last section follows through with a tutorial found online video, great to follow along and learn a little bit or review the material in these type of data sets on gene expression data with single cell RNA sequencing data.

library(Seurat)
## Loading required package: SeuratObject
## Loading required package: sp
## 
## Attaching package: 'SeuratObject'
## The following objects are masked from 'package:base':
## 
##     intersect, t
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.0     ✔ readr     2.2.0
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.1
## ✔ lubridate 1.9.5     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
setwd(path)

list.files(path)
##  [1] "merged.RDS"        "mergedSamples.RDS" "patient14"        
##  [4] "patient16"         "patient18"         "patient19"        
##  [7] "patient20"         "patient22"         "patient25"        
## [10] "patient27"         "patient30"         "patient31"        
## [13] "patient36"         "patient37"         "patient38"        
## [16] "patient39"         "patient41"         "patient44"        
## [19] "patient51"         "patient52"         "patient6"         
## [22] "patient9"          "sampleHD1"         "sampleHD10"       
## [25] "sampleHD11"        "sampleHD12"        "sampleHD2"        
## [28] "sampleHD3"         "sampleHD4"         "sampleHD5"        
## [31] "sampleHD6"         "sampleHD7"         "sampleHD8"        
## [34] "sampleHD9"
set.seed(123) 

The above files are in our RDS_objects folder. It can be set within the chunk but reverts back to directory outside the Rmarkdown notebook chunk when it is done running that chunk.

Lets read in the mergedSamples.RDS of our samples merged together in a 400k row of genes by 4 columns. I went back to the original merged data of the samples as Seurat objects from earlier part of project and added in a parameter in the merge function ‘merge.data=TRUE’ it might not make a difference, but this code works after another small change when normalizing the data and only normalizing one time, not twice like biostatsquid said to do to log transform it, although biostatsquid did do a good job adding in additional context to the code and was almost exactly like the Single Cell Genomics, Transcriptomics, & Proteomics code tutorial on youtube.

If you want it to behave like Seurat 4, the AI help online said to go back to when merging the samples of Seurat objects and to add merge.data=TRUE merged <- merge(sample1, sample2, merge.data = TRUE)

# merged <- merge(patient14, y=c(patient16, patient18, patient19,patient20 , patient22 , patient25 , patient27 , patient30,  patient31,  patient36 , patient37 , patient38 , patient39 , patient41 ,
# patient44,  patient51,  patient52,  patient6 ,  patient9,   sampleHD1 , sampleHD10,
# sampleHD11, sampleHD12, sampleHD2 , sampleHD3 , sampleHD4 , sampleHD5,  sampleHD6 ,
#  sampleHD7 , sampleHD8 , sampleHD9 ), add.cell.ids = ls()[1:32], merge.data=TRUE)
setwd(path)
#merged <- readRDS('mergedSamples.RDS')
merged <- readRDS('merged.RDS') #added extra line of merge.data=TRUE

Layout expanded of lager Seurat object, merged, screen1 Layout of expanded Large Seurat object, merged, screen 2 The above screen shots show the layout of this large Seurat object expanded out with the layers and additional information that is extracted individually by matrix @meta.data or by @assays for the RNA meta data of gene names as features to add to the matrix of barcodes.

We can see the details of this object by entering its object name.

merged
## An object of class Seurat 
## 30960 features across 407006 samples within 1 assay 
## Active assay: RNA (30960 features, 0 variable features)
##  32 layers present: counts.GSM9493334, counts.GSM9493335, counts.GSM9493336, counts.GSM9493337, counts.GSM9493338, counts.GSM9493339, counts.GSM9493340, counts.GSM9493341, counts.GSM9493342, counts.GSM9493343, counts.GSM9493344, counts.GSM9493345, counts.GSM9493346, counts.GSM9493347, counts.GSM9493348, counts.GSM9493349, counts.GSM9493350, counts.GSM9493351, counts.GSM9493332, counts.GSM9493333, counts.GSM9493320, counts.GSM9493329, counts.GSM9493330, counts.GSM9493331, counts.GSM9493321, counts.GSM9493322, counts.GSM9493323, counts.GSM9493324, counts.GSM9493325, counts.GSM9493326, counts.GSM9493327, counts.GSM9493328

You can see it has labels by counts for samples. There are 30,960 genes AKA features and 407,006 cells AKA samples or barcodes in this 1 assay with layers. There are 32 layers. Seurat version 5 added layers from Version 4 that the tutorials use in Single Cell Genomics, Transcriptomics, & Proteomics youtube videos and in Biostatsquid youtube channel.

str(merged)
## Formal class 'Seurat' [package "SeuratObject"] with 13 slots
##   ..@ assays      :List of 1
##   .. ..$ RNA:Formal class 'Assay5' [package "SeuratObject"] with 8 slots
##   .. .. .. ..@ layers    :List of 32
##   .. .. .. .. ..$ counts.GSM9493334:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   .. .. .. .. .. .. ..@ i       : int [1:19172672] 11 16 42 49 68 81 97 104 128 152 ...
##   .. .. .. .. .. .. ..@ p       : int [1:8870] 0 1489 2641 4324 7342 9549 13787 15283 18922 21606 ...
##   .. .. .. .. .. .. ..@ Dim     : int [1:2] 23499 8869
##   .. .. .. .. .. .. ..@ Dimnames:List of 2
##   .. .. .. .. .. .. .. ..$ : NULL
##   .. .. .. .. .. .. .. ..$ : NULL
##   .. .. .. .. .. .. ..@ x       : num [1:19172672] 1 2 1 1 2 1 1 7 1 1 ...
##   .. .. .. .. .. .. ..@ factors : list()
##   .. .. .. .. ..$ counts.GSM9493335:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   .. .. .. .. .. .. ..@ i       : int [1:24861569] 36 75 92 100 111 120 201 228 290 311 ...
##   .. .. .. .. .. .. ..@ p       : int [1:9487] 0 850 5559 8303 10653 13417 18143 21132 23892 26529 ...
##   .. .. .. .. .. .. ..@ Dim     : int [1:2] 25402 9486
##   .. .. .. .. .. .. ..@ Dimnames:List of 2
##   .. .. .. .. .. .. .. ..$ : NULL
##   .. .. .. .. .. .. .. ..$ : NULL
##   .. .. .. .. .. .. ..@ x       : num [1:24861569] 1 1 1 2 2 1 1 1 1 1 ...
##   .. .. .. .. .. .. ..@ factors : list()
##   .. .. .. .. ..$ counts.GSM9493336:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   .. .. .. .. .. .. ..@ i       : int [1:31248138] 29 42 49 56 70 74 80 103 112 118 ...
##   .. .. .. .. .. .. ..@ p       : int [1:10934] 0 2827 8343 10745 13203 16523 18213 19147 21558 23992 ...
##   .. .. .. .. .. .. ..@ Dim     : int [1:2] 25619 10933
##   .. .. .. .. .. .. ..@ Dimnames:List of 2
##   .. .. .. .. .. .. .. ..$ : NULL
##   .. .. .. .. .. .. .. ..$ : NULL
##   .. .. .. .. .. .. ..@ x       : num [1:31248138] 1 1 1 1 1 2 2 8 2 1 ...
##   .. .. .. .. .. .. ..@ factors : list()
##   .. .. .. .. ..$ counts.GSM9493337:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   .. .. .. .. .. .. ..@ i       : int [1:19140134] 10 16 20 29 40 42 48 55 57 58 ...
##   .. .. .. .. .. .. ..@ p       : int [1:6557] 0 4013 7109 9903 15488 18744 21955 23370 28882 32402 ...
##   .. .. .. .. .. .. ..@ Dim     : int [1:2] 25154 6556
##   .. .. .. .. .. .. ..@ Dimnames:List of 2
##   .. .. .. .. .. .. .. ..$ : NULL
##   .. .. .. .. .. .. .. ..$ : NULL
##   .. .. .. .. .. .. ..@ x       : num [1:19140134] 1 1 1 1 2 1 3 1 2 2 ...
##   .. .. .. .. .. .. ..@ factors : list()
##   .. .. .. .. ..$ counts.GSM9493338:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   .. .. .. .. .. .. ..@ i       : int [1:27825614] 29 40 42 62 65 68 74 104 119 121 ...
##   .. .. .. .. .. .. ..@ p       : int [1:10009] 0 2569 5953 8650 12471 16977 17181 17829 20857 24128 ...
##   .. .. .. .. .. .. ..@ Dim     : int [1:2] 25494 10008
##   .. .. .. .. .. .. ..@ Dimnames:List of 2
##   .. .. .. .. .. .. .. ..$ : NULL
##   .. .. .. .. .. .. .. ..$ : NULL
##   .. .. .. .. .. .. ..@ x       : num [1:27825614] 1 2 1 2 2 1 1 9 1 1 ...
##   .. .. .. .. .. .. ..@ factors : list()
##   .. .. .. .. ..$ counts.GSM9493339:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   .. .. .. .. .. .. ..@ i       : int [1:31243875] 58 60 61 161 211 250 266 283 284 347 ...
##   .. .. .. .. .. .. ..@ p       : int [1:13711] 0 834 4699 7833 9457 11030 12634 15049 17639 19514 ...
##   .. .. .. .. .. .. ..@ Dim     : int [1:2] 25777 13710
##   .. .. .. .. .. .. ..@ Dimnames:List of 2
##   .. .. .. .. .. .. .. ..$ : NULL
##   .. .. .. .. .. .. .. ..$ : NULL
##   .. .. .. .. .. .. ..@ x       : num [1:31243875] 1 1 1 2 1 1 1 1 2 1 ...
##   .. .. .. .. .. .. ..@ factors : list()
##   .. .. .. .. ..$ counts.GSM9493340:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   .. .. .. .. .. .. ..@ i       : int [1:32944966] 45 60 80 103 123 127 133 135 159 173 ...
##   .. .. .. .. .. .. ..@ p       : int [1:11069] 0 2745 5095 8632 10668 10949 13888 14340 16676 19595 ...
##   .. .. .. .. .. .. ..@ Dim     : int [1:2] 26214 11068
##   .. .. .. .. .. .. ..@ Dimnames:List of 2
##   .. .. .. .. .. .. .. ..$ : NULL
##   .. .. .. .. .. .. .. ..$ : NULL
##   .. .. .. .. .. .. ..@ x       : num [1:32944966] 1 1 1 10 3 1 1 5 1 1 ...
##   .. .. .. .. .. .. ..@ factors : list()
##   .. .. .. .. ..$ counts.GSM9493341:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   .. .. .. .. .. .. ..@ i       : int [1:26434092] 15 48 55 58 59 61 64 67 69 100 ...
##   .. .. .. .. .. .. ..@ p       : int [1:15502] 0 2972 4458 6500 7695 9381 11332 12874 14279 15479 ...
##   .. .. .. .. .. .. ..@ Dim     : int [1:2] 26605 15501
##   .. .. .. .. .. .. ..@ Dimnames:List of 2
##   .. .. .. .. .. .. .. ..$ : NULL
##   .. .. .. .. .. .. .. ..$ : NULL
##   .. .. .. .. .. .. ..@ x       : num [1:26434092] 2 1 1 1 2 4 1 1 10 2 ...
##   .. .. .. .. .. .. ..@ factors : list()
##   .. .. .. .. ..$ counts.GSM9493342:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   .. .. .. .. .. .. ..@ i       : int [1:34759347] 11 32 39 42 49 56 61 62 68 74 ...
##   .. .. .. .. .. .. ..@ p       : int [1:14703] 0 2715 3179 6162 9795 11962 16497 20034 22102 24622 ...
##   .. .. .. .. .. .. ..@ Dim     : int [1:2] 25522 14702
##   .. .. .. .. .. .. ..@ Dimnames:List of 2
##   .. .. .. .. .. .. .. ..$ : NULL
##   .. .. .. .. .. .. .. ..$ : NULL
##   .. .. .. .. .. .. ..@ x       : num [1:34759347] 1 1 3 1 2 1 1 4 1 1 ...
##   .. .. .. .. .. .. ..@ factors : list()
##   .. .. .. .. ..$ counts.GSM9493343:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   .. .. .. .. .. .. ..@ i       : int [1:31984660] 42 49 60 68 72 75 80 96 103 120 ...
##   .. .. .. .. .. .. ..@ p       : int [1:15223] 0 2192 4327 5967 8166 11557 14780 17287 19373 19646 ...
##   .. .. .. .. .. .. ..@ Dim     : int [1:2] 25382 15222
##   .. .. .. .. .. .. ..@ Dimnames:List of 2
##   .. .. .. .. .. .. .. ..$ : NULL
##   .. .. .. .. .. .. .. ..$ : NULL
##   .. .. .. .. .. .. ..@ x       : num [1:31984660] 1 2 1 1 1 1 1 1 22 3 ...
##   .. .. .. .. .. .. ..@ factors : list()
##   .. .. .. .. ..$ counts.GSM9493344:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   .. .. .. .. .. .. ..@ i       : int [1:35642971] 16 39 42 47 49 53 61 62 68 74 ...
##   .. .. .. .. .. .. ..@ p       : int [1:13311] 0 2776 7743 9484 9981 10573 15526 19447 22521 25763 ...
##   .. .. .. .. .. .. ..@ Dim     : int [1:2] 25898 13310
##   .. .. .. .. .. .. ..@ Dimnames:List of 2
##   .. .. .. .. .. .. .. ..$ : NULL
##   .. .. .. .. .. .. .. ..$ : NULL
##   .. .. .. .. .. .. ..@ x       : num [1:35642971] 1 1 4 1 2 1 1 4 1 1 ...
##   .. .. .. .. .. .. ..@ factors : list()
##   .. .. .. .. ..$ counts.GSM9493345:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   .. .. .. .. .. .. ..@ i       : int [1:35475776] 11 16 42 45 62 68 70 74 93 103 ...
##   .. .. .. .. .. .. ..@ p       : int [1:13634] 0 2523 5146 7745 13543 17630 19045 21688 24242 28633 ...
##   .. .. .. .. .. .. ..@ Dim     : int [1:2] 26108 13633
##   .. .. .. .. .. .. ..@ Dimnames:List of 2
##   .. .. .. .. .. .. .. ..$ : NULL
##   .. .. .. .. .. .. .. ..$ : NULL
##   .. .. .. .. .. .. ..@ x       : num [1:35475776] 2 1 1 1 1 2 1 1 1 14 ...
##   .. .. .. .. .. .. ..@ factors : list()
##   .. .. .. .. ..$ counts.GSM9493346:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   .. .. .. .. .. .. ..@ i       : int [1:35686204] 16 39 42 67 79 93 100 102 119 126 ...
##   .. .. .. .. .. .. ..@ p       : int [1:13355] 0 2798 5408 8662 13023 15619 18022 21159 24432 27738 ...
##   .. .. .. .. .. .. ..@ Dim     : int [1:2] 25882 13354
##   .. .. .. .. .. .. ..@ Dimnames:List of 2
##   .. .. .. .. .. .. .. ..$ : NULL
##   .. .. .. .. .. .. .. ..$ : NULL
##   .. .. .. .. .. .. ..@ x       : num [1:35686204] 2 1 1 1 1 1 1 6 2 1 ...
##   .. .. .. .. .. .. ..@ factors : list()
##   .. .. .. .. ..$ counts.GSM9493347:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   .. .. .. .. .. .. ..@ i       : int [1:37473479] 6 11 16 49 56 59 61 62 81 90 ...
##   .. .. .. .. .. .. ..@ p       : int [1:14368] 0 3335 6504 11343 11644 16967 19987 22582 26236 31712 ...
##   .. .. .. .. .. .. ..@ Dim     : int [1:2] 25976 14367
##   .. .. .. .. .. .. ..@ Dimnames:List of 2
##   .. .. .. .. .. .. .. ..$ : NULL
##   .. .. .. .. .. .. .. ..$ : NULL
##   .. .. .. .. .. .. ..@ x       : num [1:37473479] 1 1 2 1 1 1 2 2 1 1 ...
##   .. .. .. .. .. .. ..@ factors : list()
##   .. .. .. .. ..$ counts.GSM9493348:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   .. .. .. .. .. .. ..@ i       : int [1:35264666] 15 25 28 33 41 53 59 60 61 64 ...
##   .. .. .. .. .. .. ..@ p       : int [1:13791] 0 2546 8357 13536 15416 18140 19953 22287 23504 25845 ...
##   .. .. .. .. .. .. ..@ Dim     : int [1:2] 25501 13790
##   .. .. .. .. .. .. ..@ Dimnames:List of 2
##   .. .. .. .. .. .. .. ..$ : NULL
##   .. .. .. .. .. .. .. ..$ : NULL
##   .. .. .. .. .. .. ..@ x       : num [1:35264666] 1 1 2 1 1 1 1 1 1 1 ...
##   .. .. .. .. .. .. ..@ factors : list()
##   .. .. .. .. ..$ counts.GSM9493349:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   .. .. .. .. .. .. ..@ i       : int [1:38407661] 16 21 26 29 39 42 47 49 54 56 ...
##   .. .. .. .. .. .. ..@ p       : int [1:30123] 0 3331 3578 4714 8619 8907 9209 9410 9678 9885 ...
##   .. .. .. .. .. .. ..@ Dim     : int [1:2] 26107 30122
##   .. .. .. .. .. .. ..@ Dimnames:List of 2
##   .. .. .. .. .. .. .. ..$ : NULL
##   .. .. .. .. .. .. .. ..$ : NULL
##   .. .. .. .. .. .. ..@ x       : num [1:38407661] 1 1 1 2 2 2 1 1 1 1 ...
##   .. .. .. .. .. .. ..@ factors : list()
##   .. .. .. .. ..$ counts.GSM9493350:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   .. .. .. .. .. .. ..@ i       : int [1:23914985] 40 56 62 103 120 127 134 172 188 210 ...
##   .. .. .. .. .. .. ..@ p       : int [1:13679] 0 1591 5255 7925 9747 12264 14307 15592 17376 17740 ...
##   .. .. .. .. .. .. ..@ Dim     : int [1:2] 24039 13678
##   .. .. .. .. .. .. ..@ Dimnames:List of 2
##   .. .. .. .. .. .. .. ..$ : NULL
##   .. .. .. .. .. .. .. ..$ : NULL
##   .. .. .. .. .. .. ..@ x       : num [1:23914985] 1 1 1 10 2 1 1 1 1 1 ...
##   .. .. .. .. .. .. ..@ factors : list()
##   .. .. .. .. ..$ counts.GSM9493351:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   .. .. .. .. .. .. ..@ i       : int [1:26380287] 41 53 55 102 111 134 147 164 195 210 ...
##   .. .. .. .. .. .. ..@ p       : int [1:14439] 0 1471 1887 3391 4716 7127 10482 11763 15732 15991 ...
##   .. .. .. .. .. .. ..@ Dim     : int [1:2] 23925 14438
##   .. .. .. .. .. .. ..@ Dimnames:List of 2
##   .. .. .. .. .. .. .. ..$ : NULL
##   .. .. .. .. .. .. .. ..$ : NULL
##   .. .. .. .. .. .. ..@ x       : num [1:26380287] 1 1 2 1 1 1 1 1 1 2 ...
##   .. .. .. .. .. .. ..@ factors : list()
##   .. .. .. .. ..$ counts.GSM9493332:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   .. .. .. .. .. .. ..@ i       : int [1:17857942] 13 22 23 30 31 36 37 39 46 59 ...
##   .. .. .. .. .. .. ..@ p       : int [1:7626] 0 2862 5274 7638 10199 12559 14247 17211 19232 21174 ...
##   .. .. .. .. .. .. ..@ Dim     : int [1:2] 21785 7625
##   .. .. .. .. .. .. ..@ Dimnames:List of 2
##   .. .. .. .. .. .. .. ..$ : NULL
##   .. .. .. .. .. .. .. ..$ : NULL
##   .. .. .. .. .. .. ..@ x       : num [1:17857942] 1 1 1 1 1 1 1 4 2 1 ...
##   .. .. .. .. .. .. ..@ factors : list()
##   .. .. .. .. ..$ counts.GSM9493333:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   .. .. .. .. .. .. ..@ i       : int [1:33544609] 5 58 59 97 111 119 121 142 156 171 ...
##   .. .. .. .. .. .. ..@ p       : int [1:14308] 0 1367 3685 4818 8275 10730 13394 15211 17495 20432 ...
##   .. .. .. .. .. .. ..@ Dim     : int [1:2] 24125 14307
##   .. .. .. .. .. .. ..@ Dimnames:List of 2
##   .. .. .. .. .. .. .. ..$ : NULL
##   .. .. .. .. .. .. .. ..$ : NULL
##   .. .. .. .. .. .. ..@ x       : num [1:33544609] 1 1 2 2 1 1 2 1 1 1 ...
##   .. .. .. .. .. .. ..@ factors : list()
##   .. .. .. .. ..$ counts.GSM9493320:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   .. .. .. .. .. .. ..@ i       : int [1:23132352] 15 55 57 58 79 93 100 102 129 132 ...
##   .. .. .. .. .. .. ..@ p       : int [1:9150] 0 2534 5498 7925 10764 13088 17161 19540 22681 24627 ...
##   .. .. .. .. .. .. ..@ Dim     : int [1:2] 25684 9149
##   .. .. .. .. .. .. ..@ Dimnames:List of 2
##   .. .. .. .. .. .. .. ..$ : NULL
##   .. .. .. .. .. .. .. ..$ : NULL
##   .. .. .. .. .. .. ..@ x       : num [1:23132352] 1 1 1 1 2 1 2 2 2 4 ...
##   .. .. .. .. .. .. ..@ factors : list()
##   .. .. .. .. ..$ counts.GSM9493329:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   .. .. .. .. .. .. ..@ i       : int [1:20875724] 15 24 25 28 38 41 48 55 57 61 ...
##   .. .. .. .. .. .. ..@ p       : int [1:11874] 0 3701 4815 6698 8275 11240 12504 13516 15008 18863 ...
##   .. .. .. .. .. .. ..@ Dim     : int [1:2] 23837 11873
##   .. .. .. .. .. .. ..@ Dimnames:List of 2
##   .. .. .. .. .. .. .. ..$ : NULL
##   .. .. .. .. .. .. .. ..$ : NULL
##   .. .. .. .. .. .. ..@ x       : num [1:20875724] 2 1 1 2 1 3 1 1 1 5 ...
##   .. .. .. .. .. .. ..@ factors : list()
##   .. .. .. .. ..$ counts.GSM9493330:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   .. .. .. .. .. .. ..@ i       : int [1:19617402] 7 11 16 39 41 42 56 61 62 94 ...
##   .. .. .. .. .. .. ..@ p       : int [1:11307] 0 4026 4904 6544 8208 9643 12091 13573 15903 19970 ...
##   .. .. .. .. .. .. ..@ Dim     : int [1:2] 23877 11306
##   .. .. .. .. .. .. ..@ Dimnames:List of 2
##   .. .. .. .. .. .. .. ..$ : NULL
##   .. .. .. .. .. .. .. ..$ : NULL
##   .. .. .. .. .. .. ..@ x       : num [1:19617402] 1 1 1 2 1 4 1 2 7 2 ...
##   .. .. .. .. .. .. ..@ factors : list()
##   .. .. .. .. ..$ counts.GSM9493331:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   .. .. .. .. .. .. ..@ i       : int [1:25824743] 1 25 28 41 48 67 91 102 126 132 ...
##   .. .. .. .. .. .. ..@ p       : int [1:15737] 0 2581 3881 7809 10283 10787 11929 14141 17142 18874 ...
##   .. .. .. .. .. .. ..@ Dim     : int [1:2] 24568 15736
##   .. .. .. .. .. .. ..@ Dimnames:List of 2
##   .. .. .. .. .. .. .. ..$ : NULL
##   .. .. .. .. .. .. .. ..$ : NULL
##   .. .. .. .. .. .. ..@ x       : num [1:25824743] 1 1 1 1 2 2 2 20 7 2 ...
##   .. .. .. .. .. .. ..@ factors : list()
##   .. .. .. .. ..$ counts.GSM9493321:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   .. .. .. .. .. .. ..@ i       : int [1:30294412] 11 16 20 29 40 56 62 74 80 87 ...
##   .. .. .. .. .. .. ..@ p       : int [1:13432] 0 3536 5041 7914 10348 11510 13574 16308 19276 21197 ...
##   .. .. .. .. .. .. ..@ Dim     : int [1:2] 25857 13431
##   .. .. .. .. .. .. ..@ Dimnames:List of 2
##   .. .. .. .. .. .. .. ..$ : NULL
##   .. .. .. .. .. .. .. ..$ : NULL
##   .. .. .. .. .. .. ..@ x       : num [1:30294412] 1 1 2 2 1 1 1 1 1 1 ...
##   .. .. .. .. .. .. ..@ factors : list()
##   .. .. .. .. ..$ counts.GSM9493322:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   .. .. .. .. .. .. ..@ i       : int [1:25286681] 48 56 58 61 122 126 132 145 147 158 ...
##   .. .. .. .. .. .. ..@ p       : int [1:9720] 0 2348 4830 7477 9271 12370 15692 17835 20434 23523 ...
##   .. .. .. .. .. .. ..@ Dim     : int [1:2] 25891 9719
##   .. .. .. .. .. .. ..@ Dimnames:List of 2
##   .. .. .. .. .. .. .. ..$ : NULL
##   .. .. .. .. .. .. .. ..$ : NULL
##   .. .. .. .. .. .. ..@ x       : num [1:25286681] 1 1 1 1 1 2 4 1 3 1 ...
##   .. .. .. .. .. .. ..@ factors : list()
##   .. .. .. .. ..$ counts.GSM9493323:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   .. .. .. .. .. .. ..@ i       : int [1:30786304] 40 62 74 90 118 120 123 166 168 190 ...
##   .. .. .. .. .. .. ..@ p       : int [1:12595] 0 1591 5168 6593 7672 9924 12348 15386 18047 20644 ...
##   .. .. .. .. .. .. ..@ Dim     : int [1:2] 25707 12594
##   .. .. .. .. .. .. ..@ Dimnames:List of 2
##   .. .. .. .. .. .. .. ..$ : NULL
##   .. .. .. .. .. .. .. ..$ : NULL
##   .. .. .. .. .. .. ..@ x       : num [1:30786304] 1 2 1 1 2 1 1 2 1 1 ...
##   .. .. .. .. .. .. ..@ factors : list()
##   .. .. .. .. ..$ counts.GSM9493324:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   .. .. .. .. .. .. ..@ i       : int [1:21853271] 25 26 37 39 58 59 62 75 90 93 ...
##   .. .. .. .. .. .. ..@ p       : int [1:8447] 0 4355 7598 10336 13678 14012 16940 19320 22293 24939 ...
##   .. .. .. .. .. .. ..@ Dim     : int [1:2] 25629 8446
##   .. .. .. .. .. .. ..@ Dimnames:List of 2
##   .. .. .. .. .. .. .. ..$ : NULL
##   .. .. .. .. .. .. .. ..$ : NULL
##   .. .. .. .. .. .. ..@ x       : num [1:21853271] 1 1 1 1 1 1 2 1 1 1 ...
##   .. .. .. .. .. .. ..@ factors : list()
##   .. .. .. .. ..$ counts.GSM9493325:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   .. .. .. .. .. .. ..@ i       : int [1:32314751] 42 58 62 68 103 112 115 133 135 148 ...
##   .. .. .. .. .. .. ..@ p       : int [1:13152] 0 2985 5841 8453 11185 13639 15670 18155 20756 23165 ...
##   .. .. .. .. .. .. ..@ Dim     : int [1:2] 26173 13151
##   .. .. .. .. .. .. ..@ Dimnames:List of 2
##   .. .. .. .. .. .. .. ..$ : NULL
##   .. .. .. .. .. .. .. ..$ : NULL
##   .. .. .. .. .. .. ..@ x       : num [1:32314751] 1 1 2 1 5 1 1 12 1 1 ...
##   .. .. .. .. .. .. ..@ factors : list()
##   .. .. .. .. ..$ counts.GSM9493326:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   .. .. .. .. .. .. ..@ i       : int [1:23506391] 126 333 366 456 522 547 567 608 720 1185 ...
##   .. .. .. .. .. .. ..@ p       : int [1:13606] 0 232 597 2183 3599 4542 7509 9826 11159 13361 ...
##   .. .. .. .. .. .. ..@ Dim     : int [1:2] 24173 13605
##   .. .. .. .. .. .. ..@ Dimnames:List of 2
##   .. .. .. .. .. .. .. ..$ : NULL
##   .. .. .. .. .. .. .. ..$ : NULL
##   .. .. .. .. .. .. ..@ x       : num [1:23506391] 1 1 4 2 1 1 1 1 1 1 ...
##   .. .. .. .. .. .. ..@ factors : list()
##   .. .. .. .. ..$ counts.GSM9493327:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   .. .. .. .. .. .. ..@ i       : int [1:20465208] 41 48 61 102 119 125 177 193 236 253 ...
##   .. .. .. .. .. .. ..@ p       : int [1:12067] 0 1427 3980 5650 8990 11380 12347 14449 15570 17735 ...
##   .. .. .. .. .. .. ..@ Dim     : int [1:2] 23737 12066
##   .. .. .. .. .. .. ..@ Dimnames:List of 2
##   .. .. .. .. .. .. .. ..$ : NULL
##   .. .. .. .. .. .. .. ..$ : NULL
##   .. .. .. .. .. .. ..@ x       : num [1:20465208] 1 1 1 4 1 1 1 1 1 1 ...
##   .. .. .. .. .. .. ..@ factors : list()
##   .. .. .. .. ..$ counts.GSM9493328:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   .. .. .. .. .. .. ..@ i       : int [1:20123150] 40 47 56 60 71 72 98 130 132 145 ...
##   .. .. .. .. .. .. ..@ p       : int [1:11252] 0 2005 6063 7788 9543 9751 10629 11878 18200 19565 ...
##   .. .. .. .. .. .. ..@ Dim     : int [1:2] 23877 11251
##   .. .. .. .. .. .. ..@ Dimnames:List of 2
##   .. .. .. .. .. .. .. ..$ : NULL
##   .. .. .. .. .. .. .. ..$ : NULL
##   .. .. .. .. .. .. ..@ x       : num [1:20123150] 1 1 2 3 1 1 1 1 2 2 ...
##   .. .. .. .. .. .. ..@ factors : list()
##   .. .. .. ..@ cells     :Formal class 'LogMap' [package "SeuratObject"] with 1 slot
##   .. .. .. .. .. ..@ .Data: logi [1:407006, 1:32] TRUE TRUE TRUE TRUE TRUE TRUE ...
##   .. .. .. .. .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. .. .. .. .. ..$ : chr [1:407006] "patient14_AAACCCAAGGAGTACC-1" "patient14_AAACCCACAACCGCTG-1" "patient14_AAACCCACACCGGAAA-1" "patient14_AAACCCACATCGATGT-1" ...
##   .. .. .. .. .. .. .. ..$ : chr [1:32] "counts.GSM9493334" "counts.GSM9493335" "counts.GSM9493336" "counts.GSM9493337" ...
##   .. .. .. .. .. ..$ dim     : int [1:2] 407006 32
##   .. .. .. .. .. ..$ dimnames:List of 2
##   .. .. .. .. .. .. ..$ : chr [1:407006] "patient14_AAACCCAAGGAGTACC-1" "patient14_AAACCCACAACCGCTG-1" "patient14_AAACCCACACCGGAAA-1" "patient14_AAACCCACATCGATGT-1" ...
##   .. .. .. .. .. .. ..$ : chr [1:32] "counts.GSM9493334" "counts.GSM9493335" "counts.GSM9493336" "counts.GSM9493337" ...
##   .. .. .. ..@ features  :Formal class 'LogMap' [package "SeuratObject"] with 1 slot
##   .. .. .. .. .. ..@ .Data: logi [1:30960, 1:32] TRUE TRUE TRUE TRUE TRUE TRUE ...
##   .. .. .. .. .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. .. .. .. .. ..$ : chr [1:30960] "ENSG00000238009" "ENSG00000241860" "ENSG00000290385" "ENSG00000291215" ...
##   .. .. .. .. .. .. .. ..$ : chr [1:32] "counts.GSM9493334" "counts.GSM9493335" "counts.GSM9493336" "counts.GSM9493337" ...
##   .. .. .. .. .. ..$ dim     : int [1:2] 30960 32
##   .. .. .. .. .. ..$ dimnames:List of 2
##   .. .. .. .. .. .. ..$ : chr [1:30960] "ENSG00000238009" "ENSG00000241860" "ENSG00000290385" "ENSG00000291215" ...
##   .. .. .. .. .. .. ..$ : chr [1:32] "counts.GSM9493334" "counts.GSM9493335" "counts.GSM9493336" "counts.GSM9493337" ...
##   .. .. .. ..@ default   : int 32
##   .. .. .. ..@ assay.orig: chr(0) 
##   .. .. .. ..@ meta.data :'data.frame':  30960 obs. of  0 variables
##   .. .. .. ..@ misc      : list()
##   .. .. .. ..@ key       : chr "rna_"
##   ..@ meta.data   :'data.frame': 407006 obs. of  3 variables:
##   .. ..$ orig.ident  : chr [1:407006] "GSM9493334" "GSM9493334" "GSM9493334" "GSM9493334" ...
##   .. ..$ nCount_RNA  : num [1:407006] 3697 3064 3972 9102 4416 ...
##   .. ..$ nFeature_RNA: int [1:407006] 1489 1152 1683 3018 2207 4238 1496 3639 2684 2842 ...
##   ..@ active.assay: chr "RNA"
##   ..@ active.ident: Factor w/ 32 levels "GSM9493334","GSM9493335",..: 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..- attr(*, "names")= chr [1:407006] "patient14_AAACCCAAGGAGTACC-1" "patient14_AAACCCACAACCGCTG-1" "patient14_AAACCCACACCGGAAA-1" "patient14_AAACCCACATCGATGT-1" ...
##   ..@ graphs      : list()
##   ..@ neighbors   : list()
##   ..@ reductions  : list()
##   ..@ images      : list()
##   ..@ project.name: chr "SeuratProject"
##   ..@ misc        : list()
##   ..@ version     :Classes 'package_version', 'numeric_version'  hidden list of 1
##   .. ..$ : int [1:3] 5 3 0
##   ..@ commands    : list()
##   ..@ tools       : list()

There are many attachments to this Seurat object, not just the matrix of merged columns with but also in a $chr element list in this array as [1:30960] showing ENSEMBL IDs. This will be useful when looking for top genes and getting labels as well as this next code extracting the mitochondrial DNA with regex to make a percent of mitochondrial DNA column.

dim(merged@meta.data)
## [1] 407006      3

In the meta.data table there are 3 columns and 407,006 barcodes for cells of the array.

colnames(merged@meta.data)
## [1] "orig.ident"   "nCount_RNA"   "nFeature_RNA"

The column names of merged give the original identity of GSM ID, the counts of RNA that showed up for each barcode cell, and the number of genes as features that the barcode was part of in this single cell RNA sequencing project.

#quality control (QC)

When using quality control you don’t want a low count as the sequencing is probably altered or a low, low number unique genes indicates cell dead, dying, or empty droplet, and if other cells have higher counts and this could be an error within that cell, and you don’t want doublets or multiplets of genes producing high counts that are very large compared to the average counts of cells because that could mean the cell has 2 cells in it as the genes stuck together and giving a higher count. And for mitochondrial DNA, it is not located in the cytoplasm where most RNA ready to be translated into a protein occurs, but could be due to apoptosis of the cell and the mitochondrial membrane ruptured and leaking out its DNA or the cell is damaged and dying. So a high percentage of mitochondrial DNA distorts the counts as well. Mitochondrial DNA is only transcribed in the mitochondrial.

##perc at

merged[['percent_mt']] <- PercentageFeatureSet(merged, pattern = '^MT-') 
head(merged@meta.data)
##                              orig.ident nCount_RNA nFeature_RNA percent_mt
## patient14_AAACCCAAGGAGTACC-1 GSM9493334       3697         1489   1.812280
## patient14_AAACCCACAACCGCTG-1 GSM9493334       3064         1152   4.862924
## patient14_AAACCCACACCGGAAA-1 GSM9493334       3972         1683   4.229607
## patient14_AAACCCACATCGATGT-1 GSM9493334       9102         3018   2.548890
## patient14_AAACCCACATGAAAGT-1 GSM9493334       4416         2207   2.038043
## patient14_AAACCCAGTGATTCTG-1 GSM9493334      13535         4238   2.910972
meta.data table in large Array Seurat Object
meta.data table in large Array Seurat Object
image of first few rows of merged@meta.data
image of first few rows of

The nCount is total number of molecules in a cell The nFeature is the total number of unique genes in each cell

This violin plot VlnPlot() is used as a Seurat library plot.

VlnPlot(merged, features = c("nCount_RNA", "nFeature_RNA", "percent_mt"), ncol=3) 
## Warning: Default search for "data" layer in "RNA" assay yielded no results;
## utilizing "counts" layer instead.

violin Plot

We use the above plots of violin plots to compare the scatter and use those above lowest threshold of all scatter and not including less scatter above the highest theshold in common. For counts, visually estimated, the lowest threshold is 0 and highest is 2,000. For Features it looks like lowest threshold is about 100 and highest is around 4,000. The percent mitochondrial DNA looks like around 6-8 %.

This next plot is for a regression line to identify thresholds to filter out low and high counts and features as well at mt percent, the number of molecules on x-axis as counts of cells and the number of genes as features on y-axis.

You don’t want to have cells to be in lower right corner under regression line because it means that high molecule count but low unique features, capturing low genes and sequencing over and over again givig high count, and not in upper left corner bc high genes but low count because not deeply sequence enough, always check for this to see that it is due to low quality cells and not due to an artifact or sequencing error. That could ruin downstream analysis.

FeatureScatter(nsclc_seu, feature1 = “nCount_RNA”, feature2 = “nFeature_RNA”) + geom_smooth(method = ‘lm’)

library(ggplot2)
FeatureScatter(merged, feature1 = "nCount_RNA", feature2 = "nFeature_RNA") + geom_smooth(method = 'lm')+

# Control axis limits using ggplot2 functions

      xlim(0, 30000)+    # Set X-axis range
      ylim(0, 16000)     # Set Y-axis range
## Rasterizing points since number of points exceeds 100,000.
## To disable this behavior set `raster=FALSE`
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 1456 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 1456 rows containing missing values or values outside the scale range
## (`geom_scattermore()`).

Scatter Plot with liner regression line overlay

Looks like around 20,000 count of RNA and about 5,000 features as thresholds to filter out for quality control to select best genes leading to cell differentiation in this pathology. This scatter follows a log function so I think it was already normalized. We can place the log function over the scatter.

FeatureScatter(merged, feature1 = "nCount_RNA", feature2 = "nFeature_RNA") + geom_smooth(method = 'lm', , formula = y ~ log(x))+

# Control axis limits using ggplot2 functions

      xlim(0, 30000)+    # Set X-axis range
      ylim(0, 16000)     # Set Y-axis range
## Rasterizing points since number of points exceeds 100,000.
## To disable this behavior set `raster=FALSE`
## Warning: Removed 1456 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 1456 rows containing missing values or values outside the scale range
## (`geom_scattermore()`).
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_smooth()`).

Scatter Plot Logistic Regression Line overlay

It looks like the scatter are the shape of a log curve when compared to the linear regression line, but when fitting a log function over the scatter, the curve fell flat early and missed many scatter after about 4,000 count RNA. Lets proceed with filtering based on the linear fit model and thresholds estimated.

#filtering At this point, I am testing whether you need to keep the object the same name and store each change with the object, because each command made to the object is stored in its array of data in a separate list of changes made to the object. I originally kept renaming the object and removing the previous object after making changes to it like filtering it, normalizing it, and scaling it, then downstream error produced with inability to scale data with the ScaleData function. Lets see. They do explain in a few videos seen that each change you make is stored in the object maybe to revert back to original without removing all and reuploading.

merged <- subset(merged, subset = nFeature_RNA > 200 & nFeature_RNA < 5000 & percent_mt <8 & nCount_RNA < 18000)

You can make your own thresholds on how you interpret visualizations The file size shrank from 11 GB to 9.8 GB file size as it filtered out those barcodes not meeting this criteria.

Normalize data after removing filtered out cells

***** knitr doesn’t allow this chunk 17 to run so it will be print screens but everything ended up running fine in RStudio after making those AI internet solution corrections. *****

The other youtube video also shows this same exact code as this video. Both videos use the same exact Seurat R code commands and workflow. Normalization by default, use global scaling normalization of log normalize, default scaling of 10,000, and then get a log transformed result.

merged <- NormalizeData(
  merged,
  normalization.method = "LogNormalize",
  scale.factor = 10000,
  layer = "counts",       # Input layer
  #new.layer = "lognorm"   # Output layer
)

# the program doesn't use the 'new.layer' and posts it as a warning
# "The following arguments are not used: new.layer"

The file size increased to 19.5 GB in size after normalizing and log scaling it from 9.8 GB.

merged

An object of class Seurat 30960 features across 370409 samples within 1 assay Active assay: RNA (30960 features, 0 variable features) 64 layers present: counts.GSM9493334, counts.GSM9493335, counts.GSM9493336, counts.GSM9493337, counts.GSM9493338, counts.GSM9493339, counts.GSM9493340, counts.GSM9493341, counts.GSM9493342, counts.GSM9493343, counts.GSM9493344, counts.GSM9493345, counts.GSM9493346, counts.GSM9493347, counts.GSM9493348, counts.GSM9493349, counts.GSM9493350, counts.GSM9493351, counts.GSM9493332, counts.GSM9493333, counts.GSM9493320, counts.GSM9493329, counts.GSM9493330, counts.GSM9493331, counts.GSM9493321, counts.GSM9493322, counts.GSM9493323, counts.GSM9493324, counts.GSM9493325, counts.GSM9493326, counts.GSM9493327, counts.GSM9493328, data.GSM9493334, data.GSM9493335, data.GSM9493336, data.GSM9493337, data.GSM9493338, data.GSM9493339, data.GSM9493340, data.GSM9493341, data.GSM9493342, data.GSM9493343, data.GSM9493344, data.GSM9493345, data.GSM9493346, data.GSM9493347, data.GSM9493348, data.GSM9493349, data.GSM9493350, data.GSM9493351, data.GSM9493332, data.GSM9493333, data.GSM9493320, data.GSM9493329, data.GSM9493330, data.GSM9493331, data.GSM9493321, data.GSM9493322, data.GSM9493323, data.GSM9493324, data.GSM9493325, data.GSM9493326, data.GSM9493327, data.GSM9493328

There are now 64 layers instead of 32 layers, the additional data labeled layers are where the Normalized and scaled values are stored along with the count labeled layers in the object name > assays > RNA > layers.

There are now fewer cells from 407,006 to 370,409 that were kept after filtering based on visuals of Violin Plot and the scatter plots with residual regression best fit line.

merged@assays$RNA

Assay (v5) data with 30960 features for 370409 cells First 10 features: ENSG00000238009, ENSG00000241860, ENSG00000290385, ENSG00000291215, LINC01409, ENSG00000290784, LINC00115, LINC01128, ENSG00000288531, FAM41C Layers: counts.GSM9493334, counts.GSM9493335, counts.GSM9493336, counts.GSM9493337, counts.GSM9493338, counts.GSM9493339, counts.GSM9493340, counts.GSM9493341, counts.GSM9493342, counts.GSM9493343, counts.GSM9493344, counts.GSM9493345, counts.GSM9493346, counts.GSM9493347, counts.GSM9493348, counts.GSM9493349, counts.GSM9493350, counts.GSM9493351, counts.GSM9493332, counts.GSM9493333, counts.GSM9493320, counts.GSM9493329, counts.GSM9493330, counts.GSM9493331, counts.GSM9493321, counts.GSM9493322, counts.GSM9493323, counts.GSM9493324, counts.GSM9493325, counts.GSM9493326, counts.GSM9493327, counts.GSM9493328, data.GSM9493334, data.GSM9493335, data.GSM9493336, data.GSM9493337, data.GSM9493338, data.GSM9493339, data.GSM9493340, data.GSM9493341, data.GSM9493342, data.GSM9493343, data.GSM9493344, data.GSM9493345, data.GSM9493346, data.GSM9493347, data.GSM9493348, data.GSM9493349, data.GSM9493350, data.GSM9493351, data.GSM9493332, data.GSM9493333, data.GSM9493320, data.GSM9493329, data.GSM9493330, data.GSM9493331, data.GSM9493321, data.GSM9493322, data.GSM9493323, data.GSM9493324, data.GSM9493325, data.GSM9493326, data.GSM9493327, data.GSM9493328

The first 10 features are listed when looking at the RNA under the assay tab.

Here is my original screenshot of RNA contents arrangement in Seurat object once merged. Layout expanded of lager Seurat object, merged, screen1

The following shows the sparce matrices for the first 5 layers by counts. This list could be very large if not limited. There are only 1s where the gene is present in the cell or barcode column of assay, the 0s are held with the period or dot.

merged@assays[["RNA"]]@layers[1:5]

$counts.GSM9493334 23499 x 8732 sparse Matrix of class “dgCMatrix”

[1,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …… [2,] . . . . . 1 . 1 . . . . . . . . . . . . . . . . . . . . . . …… [3,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …… [4,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …… [5,] . . . . . . . . . . . 1 . . . . . . . . . . . . . . . . . . …… [6,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …… [7,] . . . . . . . 1 . . . . . . 1 . . . . . . . . . . . . . . . …… [8,] . . . . . . . . . . . . . . 2 . . . . . . . . . . . . . . . …… [9,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …… [10,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …… [11,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …… [12,] 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …… [13,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …… [14,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …… [15,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …… [16,] . . . . 2 . . 2 . . . . . . . . . . . . . . . . . . . . . . …… [17,] 2 . . 1 . 2 1 . 2 2 . . . 10 . 2 . . . 2 . . . 2 1 . . . . . ……

………………………… ……..suppressing 8702 columns and 23466 rows in show(); maybe adjust options(max.print=, width=)

$counts.GSM9493335 25402 x 8847 sparse Matrix of class “dgCMatrix”

[1,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …… [2,] . . . 1 . . . . . . . . . . . . . . . . . . . . . . . . . . …… [3,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …… [4,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …… [5,] . . . . . . . . . . . . . . . . . 1 . . . 1 . . 1 . . . . . …… [6,] . 1 . . . . . . . . . . . . . . . . . . . . . . 1 . . . . . …… [7,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …… [8,] . . . . . . . . . . . . 1 . . . . . . . . . . . . . . . . . …… [9,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …… [10,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …… [11,] . 1 . . 1 . . . . . . . . . . . . . 1 . . . . 1 . . . . 1 . …… [12,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …… [13,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …… [14,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …… [15,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …… [16,] 1 . . 2 . . 1 . . 1 1 . 1 1 . . . . . 2 . 1 . . . . . . 1 . …… [17,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ……

………………………… ……..suppressing 8817 columns and 25369 rows in show(); maybe adjust options(max.print=, width=) …………………………

[25387,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …… [25388,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …… [25389,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …… [25390,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ……

This was copy and pasted due to knitr not letting code run past the Normalization chunk of R code. You can see not all features are counted in the cells shown. Here is a snippet. sparse matrix snippet

#gives sparse matrices for each GSM sample
#merged@assays[["RNA"]]@layers$counts #sparse matrices
#merged@assays[["RNA"]]@layers$data #null
merged

An object of class Seurat 30960 features across 370409 samples within 1 assay Active assay: RNA (30960 features, 0 variable features) 64 layers present: counts.GSM9493334, counts.GSM9493335, counts.GSM9493336, counts.GSM9493337, counts.GSM9493338, counts.GSM9493339, counts.GSM9493340, counts.GSM9493341, counts.GSM9493342, counts.GSM9493343, counts.GSM9493344, counts.GSM9493345, counts.GSM9493346, counts.GSM9493347, counts.GSM9493348, counts.GSM9493349, counts.GSM9493350, counts.GSM9493351, counts.GSM9493332, counts.GSM9493333, counts.GSM9493320, counts.GSM9493329, counts.GSM9493330, counts.GSM9493331, counts.GSM9493321, counts.GSM9493322, counts.GSM9493323, counts.GSM9493324, counts.GSM9493325, counts.GSM9493326, counts.GSM9493327, counts.GSM9493328, data.GSM9493334, data.GSM9493335, data.GSM9493336, data.GSM9493337, data.GSM9493338, data.GSM9493339, data.GSM9493340, data.GSM9493341, data.GSM9493342, data.GSM9493343, data.GSM9493344, data.GSM9493345, data.GSM9493346, data.GSM9493347, data.GSM9493348, data.GSM9493349, data.GSM9493350, data.GSM9493351, data.GSM9493332, data.GSM9493333, data.GSM9493320, data.GSM9493329, data.GSM9493330, data.GSM9493331, data.GSM9493321, data.GSM9493322, data.GSM9493323, data.GSM9493324, data.GSM9493325, data.GSM9493326, data.GSM9493327, data.GSM9493328

There aren’t any or there are 0 variable features from out put of object working with above but we will change that next code chunk.

identify highly variable features

The default is to return 2000 features per dataset, and ‘vst’ stands for ‘variance stabilizing transformation’ as a default as well.

merged <- FindVariableFeatures(merged, selection.method = 'vst', nfeatures = 2000) 

High Variability Genes - Targets to Cell Differentiation in Aggressive Natural Killer T-cell Lymphoma Short snippet above of the grids per sample calculated to find the high variability genes.

Now look at the merged object to see that variable features should show 2000.

merged

An object of class Seurat 30960 features across 370409 samples within 1 assay Active assay: RNA (30960 features, 2000 variable features) 64 layers present: counts.GSM9493334, counts.GSM9493335, counts.GSM9493336, counts.GSM9493337, counts.GSM9493338, counts.GSM9493339, counts.GSM9493340, counts.GSM9493341, counts.GSM9493342, counts.GSM9493343, counts.GSM9493344, counts.GSM9493345, counts.GSM9493346, counts.GSM9493347, counts.GSM9493348, counts.GSM9493349, counts.GSM9493350, counts.GSM9493351, counts.GSM9493332, counts.GSM9493333, counts.GSM9493320, counts.GSM9493329, counts.GSM9493330, counts.GSM9493331, counts.GSM9493321, counts.GSM9493322, counts.GSM9493323, counts.GSM9493324, counts.GSM9493325, counts.GSM9493326, counts.GSM9493327, counts.GSM9493328, data.GSM9493334, data.GSM9493335, data.GSM9493336, data.GSM9493337, data.GSM9493338, data.GSM9493339, data.GSM9493340, data.GSM9493341, data.GSM9493342, data.GSM9493343, data.GSM9493344, data.GSM9493345, data.GSM9493346, data.GSM9493347, data.GSM9493348, data.GSM9493349, data.GSM9493350, data.GSM9493351, data.GSM9493332, data.GSM9493333, data.GSM9493320, data.GSM9493329, data.GSM9493330, data.GSM9493331, data.GSM9493321, data.GSM9493322, data.GSM9493323, data.GSM9493324, data.GSM9493325, data.GSM9493326, data.GSM9493327, data.GSM9493328

There are 30,960 features in 370,409 samples and 2000 variable features now after finding for the variable features. Before doing the findVariableFeatures command or transformation, there were 0 variable features.

We should limit the list as it will print out a lot of the values. But only those features in the list of 2000 highly variable features or having the highest differential expression values where cells changed the most from healthy to pathological state are listed by name, all others are listed as ‘NA’ values.

merged[["RNA"]]@meta.data$var.features[1:500]
snippet 500 genes with only top 2000 labeled
snippet 500 genes with only top 2000 labeled

identify the top 10 highly variable genes, this builds a scatter plot that will identify and label the scatterpoints belonging to the top 10 genes with highest variability. Average expression on x-axis and Standardized Variance on y-axis.

top10 <- head(VariableFeatures(merged), 10)
top10

[1] “PPBP” “IGLC2” “IGKC” “S100A9” “EREG” “IGLC3” “SOX5” “PF4”
[9] “LYZ” “LINGO2”

top10_plot <- VariableFeaturePlot(merged)
LabelPoints(plot = top10_plot, points = top10, repel = TRUE)
scatter Plot with Labeled Top 10 High Variability Genes
scatter Plot with Labeled Top 10 High Variability Genes

Scaling after finding variable features in dataset.

This corrects for unwanted sources of variation, technical as example with batch effects, or biological such as cell cycle stage or mitochondrial contamination, this ensures results are due to clustering by gene differences and not technological or biological errors.

The ScaleData function is part of standard parameters, there is a more complex single cell Transform function or workflow within Seurat that could be alternatively used.

First to make a character string stored as the rownames of the structural array we have been using mergedFilteredNormalizedScaled. Lets look at the first 100 genes as well.

all_genes <- rownames(merged)

all_genes[1:100]

[1] “ENSG00000238009” “ENSG00000241860” “ENSG00000290385” “ENSG00000291215” [5] “LINC01409” “ENSG00000290784” “LINC00115” “LINC01128”
[9] “ENSG00000288531” “FAM41C” “ENSG00000272438” “NOC2L”
[13] “KLHL17” “PLEKHN1” “ENSG00000272512” “HES4”
[17] “ISG15” “ENSG00000224969” “AGRN” “ENSG00000291156” [21] “C1orf159” “ENSG00000285812” “LINC01342” “TNFRSF18”
[25] “TNFRSF4” “SDF4” “B3GALT6” “C1QTNF12”
[29] “ENSG00000260179” “UBE2J2” “LINC01786” “SCNN1D”
[33] “ACAP3” “PUSL1” “INTS11” “CPTP”
[37] “TAS1R3” “DVL1” “MXRA8” “AURKAIP1”
[41] “CCNL2” “MRPL20-AS1” “MRPL20” “MRPL20-DT”
[45] “ATAD3C” “ATAD3B” “ENSG00000290916” “ATAD3A”
[49] “TMEM240” “SSU72” “ENSG00000215014” “FNDC10”
[53] “ENSG00000286989” “ENSG00000272106” “MIB2” “MMP23B”
[57] “CDK11B” “ENSG00000272004” “SLC35E2B” “CDK11A”
[61] “ENSG00000290854” “NADK” “GNB1” “GNB1-DT”
[65] “CFAP74” “PRKCZ” “ENSG00000271806” “PRKCZ-AS1”
[69] “FAAP20” “ENSG00000234396” “SKI” “ENSG00000287356” [73] “MORN1” “ENSG00000272420” “RER1” “PEX10”
[77] “PLCH2” “PANK4” “HES5” “ENSG00000272449” [81] “TNFRSF14-AS1” “TNFRSF14” “ENSG00000228037” “ENSG00000289610” [85] “PRXL2B” “MMEL1” “TTC34” “PRDM16”
[89] “MEGF6” “ENSG00000238260” “TPRG1L” “WRAP73”
[93] “TP73” “SMIM1” “LRRC47” “CEP104”
[97] “DFFB” “C1orf174” “LINC01134” “AJAP1”

length(all_genes)

[1] 30960

There are a total of 30,960 genes.


before scaling data lets look at the gene names and consider extracting out the gene information with our matrix of 4 columns

The row names of the object, merged, have the gene names, when we looked for the highest variation genes in most differentially expressed for 2000 genes with

merged <- FindVariableFeatures(merged, selection.method = ‘vst’, nfeatures = 2000)

Do they line up to the row names of merged and where are they located.

varFeatures <- merged[["RNA"]]@meta.data$var.features
genes <- rownames(merged)
df <- data.frame(Gene = genes, FeatureVariability = varFeatures)
#head(df,20)
view(df)
screenshot of df
screenshot of df

The location of any of the genes if they were a high variability gene top 2000 is listed next to the correct row name in a random 20 sample draw from first 20 samples.

Are there only 2000 non NA values in our varFeatures set to find only 2000?

class(varFeatures)
length(unique(varFeatures))

[1] “character” [1] 2001

There are 2001 unique values in a 30,960 long vector of genes set to be the label if it is one of the 2000 most differentially expressed genes. The additional character is the NA value.

Maybe I can figure out something to do with that instead as this ‘work around’ has already taken more than one day in set backs. And produced no progress to following the work flow.

The VariableFeatures(merged) is how to list only those top 2000 genes we found with the findVariableFeatures function. We can see those by calling that function VariableFeatures(merged) or rather since its more than 1000 to write it to csv and then read that file. But would be nice to see how it is used with its values for pathology and healthy samples. Maybe we can make a vector of those top 10 like we do already but then only keep the matrix of meta.data with those 10 genes to see the counts and number of barcodes express it.

Lets scale the data first then come back to this task.


There are 30,960 unique gene names. We will scale only the variable features we found of top 2000 most differentially expressed genes.

#merged <- ScaleData(merged, features = all_genes)

merged <- ScaleData(merged, features = VariableFeatures(merged))

Great news! the ScaleData function worked, must have been due to declaring the input and output (that was ignored) in the NormalizeData function and then not adding in the additional NormalizeData function.

Centering and scaling data matrix |===========================================================| 100%

This file is 25.4 GB in size just log transform scaling the 2,000 targeted features. When doing all features it wouldn’t work, but I can try again. This could be why it worked this time due to that edit as well as small changes in the parameter setting of the normalization function.

Lets save this Seurat object we filtered, normalized and scaled by 10,000 then scaled with a log transformation. We only log transformed the 2,000 most differentially expressed features in this set.

setwd(path)

#this takes a long time on this computer to save, its 25.4 GB in size

saveRDS(merged,"Seurat_filteredNormalized10kScaledLogScaled2kFeatures")

Lets finish that task before moving onto the analysis portion. I want to extract these top 10 features and make a data frame not a Seurat object that has the gene names attached.

It is interesting because of all the stored information you know the genes are there and you just have to get them. They are in the row names, now we can also subset the Seurat object and make the object selective to only those 10 genes to see how they are expressed in each of our 12 healthy and 20 pathological NKTCL cases to do our regular workflow for supervised machine learning.

m <- merged[which(row.names(merged) %in% top10),]
m

An object of class Seurat 10 features across 370409 samples within 1 assay Active assay: RNA (10 features, 10 variable features) 65 layers present: counts.GSM9493334, counts.GSM9493335, counts.GSM9493336, counts.GSM9493337, counts.GSM9493338, counts.GSM9493339, counts.GSM9493340, counts.GSM9493341, counts.GSM9493342, counts.GSM9493343, counts.GSM9493344, counts.GSM9493345, counts.GSM9493346, counts.GSM9493347, counts.GSM9493348, counts.GSM9493349, counts.GSM9493350, counts.GSM9493351, counts.GSM9493332, counts.GSM9493333, counts.GSM9493320, counts.GSM9493329, counts.GSM9493330, counts.GSM9493331, counts.GSM9493321, counts.GSM9493322, counts.GSM9493323, counts.GSM9493324, counts.GSM9493325, counts.GSM9493326, counts.GSM9493327, counts.GSM9493328, data.GSM9493334, data.GSM9493335, data.GSM9493336, data.GSM9493337, data.GSM9493338, data.GSM9493339, data.GSM9493340, data.GSM9493341, data.GSM9493342, data.GSM9493343, data.GSM9493344, data.GSM9493345, data.GSM9493346, data.GSM9493347, data.GSM9493348, data.GSM9493349, data.GSM9493350, data.GSM9493351, data.GSM9493332, data.GSM9493333, data.GSM9493320, data.GSM9493329, data.GSM9493330, data.GSM9493331, data.GSM9493321, data.GSM9493322, data.GSM9493323, data.GSM9493324, data.GSM9493325, data.GSM9493326, data.GSM9493327, data.GSM9493328, scale.data 1 dimensional reduction calculated: pca

This Seurat object has 10 features only, those that are the top or most differentially expressed genes in a much smaller 284 MB file size with 65 layers, because done after successfully using the ScaleData function, the additional layer beyond the 64 layers, the 65th layer is the layer holding the scale.data values. This is done on the filtered data from 407,006 cells down to 370,409 cells.

unique(m$orig.ident)

[1] “GSM9493334” “GSM9493335” “GSM9493336” “GSM9493337” “GSM9493338” [6] “GSM9493339” “GSM9493340” “GSM9493341” “GSM9493342” “GSM9493343” [11] “GSM9493344” “GSM9493345” “GSM9493346” “GSM9493347” “GSM9493348” [16] “GSM9493349” “GSM9493350” “GSM9493351” “GSM9493332” “GSM9493333” [21] “GSM9493320” “GSM9493329” “GSM9493330” “GSM9493331” “GSM9493321” [26] “GSM9493322” “GSM9493323” “GSM9493324” “GSM9493325” “GSM9493326” [31] “GSM9493327” “GSM9493328”

Those show the top 10 features or genes span all 32 samples of the 12 healthy and 20 pathological NKTC aggressive Lymphoma.

top10Data <- m[["RNA"]]@meta.data
colnames(top10Data)

[1] “vf_vst_counts.GSM9493334_mean”
[2] “vf_vst_counts.GSM9493334_variance”
[3] “vf_vst_counts.GSM9493334_variance.expected”
[4] “vf_vst_counts.GSM9493334_variance.standardized” [5] “vf_vst_counts.GSM9493334_variable”
[6] “vf_vst_counts.GSM9493334_rank”
[7] “vf_vst_counts.GSM9493335_mean”
[8] “vf_vst_counts.GSM9493335_variance”
[9] “vf_vst_counts.GSM9493335_variance.expected”
[10] “vf_vst_counts.GSM9493335_variance.standardized” [11] “vf_vst_counts.GSM9493335_variable”
[12] “vf_vst_counts.GSM9493335_rank”
[13] “vf_vst_counts.GSM9493336_mean”
[14] “vf_vst_counts.GSM9493336_variance”
[15] “vf_vst_counts.GSM9493336_variance.expected”
[16] “vf_vst_counts.GSM9493336_variance.standardized” [17] “vf_vst_counts.GSM9493336_variable”
[18] “vf_vst_counts.GSM9493336_rank”
[19] “vf_vst_counts.GSM9493337_mean”
[20] “vf_vst_counts.GSM9493337_variance”
[21] “vf_vst_counts.GSM9493337_variance.expected”
[22] “vf_vst_counts.GSM9493337_variance.standardized” [23] “vf_vst_counts.GSM9493337_variable”

[182] “vf_vst_counts.GSM9493327_variance”
[183] “vf_vst_counts.GSM9493327_variance.expected”
[184] “vf_vst_counts.GSM9493327_variance.standardized” [185] “vf_vst_counts.GSM9493327_variable”
[186] “vf_vst_counts.GSM9493327_rank”
[187] “vf_vst_counts.GSM9493328_mean”
[188] “vf_vst_counts.GSM9493328_variance”
[189] “vf_vst_counts.GSM9493328_variance.expected”
[190] “vf_vst_counts.GSM9493328_variance.standardized” [191] “vf_vst_counts.GSM9493328_variable”
[192] “vf_vst_counts.GSM9493328_rank”
[193] “var.features”
[194] “var.features.rank”

There are 194 columns and they are divided into each sample but have the same column metric per sample of 6 metrics each. There are 32 samples times 6 metrics, that is 192 columns.

Our table with genes are listed as column 194 at the end of the columns with rank by gene across all samples for how differentially expressed that gene was across all samples as last column 195. So that there will be a data frame of 10 top genes and show how they are expressed across each sample from the sequencing data of single cell RNA sequencing information in the array.

DataFrame10 <- m[['RNA']]@meta.data
DataFrame10[,c(1:12,185:194)]

Here is a snippet of the DataFrame10 first and last columns in the viewer. beginning columns of top 10 genes of all samples beginning columns of top 10 genes of all samples

I only selected the first 10 columns and last 9 columns to show the samples those features explain. Otherwise, a huge amount of scrolling would be needed to look at all the 195 columns in this data for our top 10 most differentially expressed genes.

Lets write this out to csv and go back to the analysis.

write.csv(DataFrame10, 'top10DE_genes_NKTCL_EBV.csv', row.names=FALSE)
#write.csv(DataFrame10, 'top10DE_genes_NKTCL_EBV-UNSCALED.csv', row.names=FALSE)

We won’t use that now and maybe won’t at all. But it is good to have. Here is a copy of it if you would like it, access it here. I went through and separately made an unscaled to log transformation version of top 10 most differentially expressed genes to access here.

rm(m,df,all_genes, DataFrame10, top10Data)

=================================================================

We can stop and then return to read back in the Seurat object where we left off.

merged <- readRDS("Seurat_filteredNormalized10kScaledLogScaled2kFeatures")

It reads in the same file size of 25.4 GB.

Analysis after filtering, normalizing, retrieving the 2,000 most differetially expressed genes, and scaling the data

#Dimensionality Reduction with PCA allow us to visualize multidimensional datasets, with common PCA principal component analysis, summarizing all gene expression data over many genes into principal components (the variation between scatterpoints are the components, variations off the regression lines or dimension, adding components to top components will give most variation and explain most of data).

Loadings is a term for PCA that explains a gene that is strongly associated with that principal component. They are ordered with PC1 most explanative and subsequent add to that explanation in order of best.

The positive and negative genes are named for how ever many features you set for the number of PCs that explain the most.

merged <- RunPCA(merged, features = VariableFeatures(merged))

print(merged[['pca']], dims = 1:5, nfeatures=5)

PCA 1st screenshot PCA 2nd screenshot

merged@assays$RNA

Assay (v5) data with 30960 features for 370409 cells Top 10 variable features: PPBP, IGLC2, IGKC, S100A9, EREG, IGLC3, SOX5, PF4, LYZ, LINGO2 Layers: counts.GSM9493334, counts.GSM9493335, counts.GSM9493336, counts.GSM9493337, counts.GSM9493338, counts.GSM9493339, counts.GSM9493340, counts.GSM9493341, counts.GSM9493342, counts.GSM9493343, counts.GSM9493344, counts.GSM9493345, counts.GSM9493346, counts.GSM9493347, counts.GSM9493348, counts.GSM9493349, counts.GSM9493350, counts.GSM9493351, counts.GSM9493332, counts.GSM9493333, counts.GSM9493320, counts.GSM9493329, counts.GSM9493330, counts.GSM9493331, counts.GSM9493321, counts.GSM9493322, counts.GSM9493323, counts.GSM9493324, counts.GSM9493325, counts.GSM9493326, counts.GSM9493327, counts.GSM9493328, data.GSM9493334, data.GSM9493335, data.GSM9493336, data.GSM9493337, data.GSM9493338, data.GSM9493339, data.GSM9493340, data.GSM9493341, data.GSM9493342, data.GSM9493343, data.GSM9493344, data.GSM9493345, data.GSM9493346, data.GSM9493347, data.GSM9493348, data.GSM9493349, data.GSM9493350, data.GSM9493351, data.GSM9493332, data.GSM9493333, data.GSM9493320, data.GSM9493329, data.GSM9493330, data.GSM9493331, data.GSM9493321, data.GSM9493322, data.GSM9493323, data.GSM9493324, data.GSM9493325, data.GSM9493326, data.GSM9493327, data.GSM9493328, scale.data

We can take a look at all commands for this object saved when we saved it and with the PCA command added. We go to the ‘commands’ tab to expand on those commands and see the elements each has, there will be 4 commands after running the PCA command.

commands history on object
commands history on object

The heatmap will have the explanatory genes next to the rows as labels and the visual will show the expression profiles for these genes in the first PC or which ever PC is selected.

DimHeatmap(merged, dims = 1, cells = 500, balanced = TRUE)
Heatmap Top Differentially Expressed Genes after PCA
Heatmap Top Differentially Expressed Genes after PCA

The dim plot is a scatter plot of a named PC for principal component, against another PC You will see cells group in clusters by PC comparison 1 by 1.

DimPlot(merged, reduction = "pca") + NoLegend()
clustering color coded after PCA
clustering color coded after PCA

Determine dimensionality of the data with Elbow Plot

This type of plot, elbow plot, shows standard deviation on y-axis and the principal components or PCs on x-axis to see the threshold limit of number of dimensions or PCs to use, it starts high and drops until plateuing at elbow bend, won’t be useful to use additional that add little variation and more computing.

ElbowPlot(merged)
Elbow Plot of most DE PCAs to use explaining most variation
Elbow Plot of most DE PCAs to use explaining most variation

===================================================

This one seemed like it froze computer but was just really slow, took 1 1/2 hours to run, in console it was running, but just faded and spinning circle in the source code window of document.

Clustering using K-Nearest Neighbor

Basically the lower resolution the least clusters and higher resolution has more clusters

merged <- FindNeighbors(merged, dims = 1:7) #based on elbow plot above
merged <- FindClusters(merged, resolution = c(0.1, 0.3, 0.5, 0.7,1))

Number of nodes: 370409 Number of edges: 10184126

Running Louvain algorithm… 0% 10 20 30 40 50 60 70 80 90 100% [—-|—-|—-|—-|—-|—-|—-|—-|—-|—-| **************************************************| Maximum modularity in 10 random starts: 0.9724 Number of communities: 9 Elapsed time: 453 seconds Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 370409 Number of edges: 10184126

Running Louvain algorithm… 0% 10 20 30 40 50 60 70 80 90 100% [—-|—-|—-|—-|—-|—-|—-|—-|—-|—-| **************************************************| Maximum modularity in 10 random starts: 0.9473 Number of communities: 17 Elapsed time: 424 seconds Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 370409 Number of edges: 10184126

Running Louvain algorithm… 0% 10 20 30 40 50 60 70 80 90 100% [—-|—-|—-|—-|—-|—-|—-|—-|—-|—-| **************************************************| Maximum modularity in 10 random starts: 0.9334 Number of communities: 25 Elapsed time: 441 seconds Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 370409 Number of edges: 10184126

Running Louvain algorithm… 0% 10 20 30 40 50 60 70 80 90 100% [—-|—-|—-|—-|—-|—-|—-|—-|—-|—-| **************************************************| Maximum modularity in 10 random starts: 0.9223 Number of communities: 28 Elapsed time: 512 seconds Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 370409 Number of edges: 10184126

Running Louvain algorithm… 0% 10 20 30 40 50 60 70 80 90 100% [—-|—-|—-|—-|—-|—-|—-|—-|—-|—-| **************************************************| Maximum modularity in 10 random starts: 0.9093 Number of communities: 34 Elapsed time: 550 seconds

There will be added columns to the matrix object of our for each of the resolutions such as ‘RNA_snn_res0.1’ and ‘RNA_snn_res0.3’, and so on to 1. And the integer values in the new columns indicates to which cluster each cell (or barcode) belongs.

The results are color coded for each cluster. Since gene expression is basically the cell differentiation of becoming different cell types, usually, different clusters represents different cell types.

dim(merged@meta.data)

[1] 370409 10

colnames(merged@meta.data)

[1] “orig.ident” “nCount_RNA” “nFeature_RNA” “percent_mt”
[5] “RNA_snn_res.0.1” “RNA_snn_res.0.3” “RNA_snn_res.0.5” “RNA_snn_res.0.7” [9] “RNA_snn_res.1” “seurat_clusters”

head(merged@meta.data)

The meta.data now has 10 columns instead of 4 and each resolution is a column with the number of the cluster for each cell in that respective resolution column.

The table of Merged meta.data with 10 columns where each resolution has a column
The table of Merged meta.data with 10 columns where each resolution has a column

The data set may interest someone in many clusters that overlap that could be useful for identifying subset of cells interested in.

DimPlot(merged, group.by = 'RNA_snn_res.1', label = TRUE)
resolution 1 with 33 clusters
resolution 1 with 33 clusters
DimPlot(merged, group.by = 'RNA_snn_res.0.1', label = TRUE)
resolution 0.1 cluster of 8
resolution 0.1 cluster of 8

Set identity of clusters makes more sense than using more clusters with resolution 1

Idents(merged) <- 'RNA_snn_res.0.1' 
merged object where clusters are located
merged object where clusters are located
merged object where commands are located further down list from clusters
merged object where commands are located further down list from clusters

UMap as a nonlinear dimensionality reduction, better results than tSNE

Gives a clearer visualization of separate clusters with space between each

merged <- RunUMAP(merged, dims = 1:15)
Output after running UMAP screenshot
Output after running UMAP screenshot

Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric To use Python UMAP via reticulate, set umap.method to ‘umap-learn’ and metric to ‘correlation’ This message will be shown once per session 18:17:22 UMAP embedding parameters a = 0.9922 b = 1.112 18:17:22 Read 370409 rows and found 15 numeric columns 18:17:22 Using Annoy for neighbor search, n_neighbors = 30 18:17:22 Building Annoy index with metric = cosine, n_trees = 50 0% 10 20 30 40 50 60 70 80 90 100% [—-|—-|—-|—-|—-|—-|—-|—-|—-|—-| **************************************************| 18:17:50 Writing NN index file to temp file C:0CQbbW51ac33c236d6 18:17:51 Searching Annoy index using 1 thread, search_k = 3000 18:20:17 Annoy recall = 100% 18:20:18 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30 18:20:32 Initializing from normalized Laplacian + noise (using RSpectra) 18:21:16 Commencing optimization for 200 epochs, with 15948322 positive edges 18:21:16 Using rng type: pcg Using method ‘umap’ 0% 10 20 30 40 50 60 70 80 90 100% [—-|—-|—-|—-|—-|—-|—-|—-|—-|—-| **************************************************| 18:27:54 Optimization finished

Next, identify the clusters and look at genetic markers for different cell types and identify different cell types, not in this tutorial but another one

DimPlot(merged, reduction = 'umap')

Optimized UMap clusters using Seurat UMap function The clusters look more defined than in clusters from k-nearest neighbors.

merged object now has additional commands
merged object now has additional commands

We didn’t use tSNE but might be added later. Thanks and keep checking in. Our object is now 26.3 GB in file size and has 7 commands.

Save the Seurat object to use later

saveRDS(merged, file='mergedClustersUMap.RDS')

===================================================

Thanks again, keep checking back. I will see about adding in the tSNE option as well if it is available in Seurat. Seurat has its own implementation of UMap algorithm.

There are more in the tutorials by biostatsquid on this as well as many from the Single Cell Genomic, Transcriptomic, & Proteomic youtube channel. This will make a great addition of genes to add to our main database on genes with changes from healthy to pathological state that can be used to predict a pathology from healthy in a machine learning algorithm based on similar types of biological media like PBMCs as most are, or skeletal tissue for the fibromyalgia case. This is work on this project for unsupervised machine learning but we can still use this data as is in the data frame created and shared in the document before and after log scaling it but both having been filtered, normalized, and scaled by 10,000. There are also gene clusters related to each pathology and EBV actually has its own set of gene markers that could be used cross-comparatively with this same data base we are building as we add more pathology associated EBV data to it. One such online resource for obtaining those gene clusters by pathology or system of body is KEGG for Kyoto Encyclopedia of Genes and Genomes. Bioinformatics has a lot of depth but once you get the main work flows and understanding of concepts then it flows naturally and makes sense. Some of the language overlaps and means something different than other professions or within its own community. Understanding like the cell of the human cell and then this cell in here for the array so when it is single cell RNA Sequencing, it is not discussing the cell of the mammal or organism but the cells of the arrays that can hold hundreds of thousands of cells for sequencing, like this original array size of 407,006 cells and 30,960 genes. We only made a data frame of the top 10 genes, but we will need to go back and make a data frame of all genes and extract those cluster genes of EBV and any other questions or ideas for projects or new features or directions to go using that data later.

This project is not complete, but most of the unsupervised machine learning was done for PCA and top genes that explain most variation and have highest change from healthy to pathology, clustering with k-nearest neighbors and then with UMap to get sharp well defined boundaries that cells fall into.