This is a project with gastric carcinoma samples that looks at the metastasis in peritoneal stomach tissue and in the stomach. The study is GSE308231. There is a published article on this gene expression data available here.
In a brief summary of the published article, the researchers found that a macrophage, fibroblast, and malignant epithelial cell area (MFM) of gastric carcinoma (GC) tumors and gastric carcinoma peritoneal metastasis (PM) had a design that was impeding treatment in GC and PM. GC and PM are aggressive because this MFM region tends to create a tough shell that blocks out immune cells like T cells as well as promotes tumor growth via VEGF that is used in all tumor proliferation cancers it seems, but the GC and PM tumors in the MFM region are not treatable due to this tough fibrous shell that chemotherapy cannot reach. They found a therapeutic route to try with the latest treatment of NIPS for intraperitoneal delivery of chemotherapy that is the best treatment so far, but they need to turn off the PLAU-PLAUR axis because it is responsible for the MFM and signaling pathway that prevents it from being obliterated by current therapeutics. The study was based more on a spatial transcriptomic analysis of images from an online cancer imaging resource, The Cancer Genome-Atlas Stomach Adenocarcinoma or TCGA-STAD, to develop a deep learning model with transfer learning that could predict from an input image if the tumor was MFM or not, and their study showed that MFM was a very poor prognosis because survival was 2% in 5 years and fatal in 1-2 years tops. This is a recent study. Gastric carcinoma is one of the malignant cancers responsible for cancer related mortality world wide, it was done in an Asian country this study, and GC affects mostly Chinese and South Asian populations. The study used this gene expression data but also many gene expression datasets from same resource of National Center for Bioinformatics Information or NCBI of their Gene Expression Omnibus or GEO section of public data. The CA1, CA2, and CA3 was stated as being gastric carcinoma but different types of samples as subtypes. With CA3 the worst prognosis of the 3 types. While the F14, F15, and F16 samples of peritoneal metastatic GC were not discussed in the published article.
There is an epthelial-mesenchymal transition called EMT that is seen in cancer progression of various cancers and in this study, the MFM shows this EMT through the niche regions of the tumors in GC that have malignant epithelial cells. The fibroblasts and macrophages are of the M2 type macrophages that are tumor promoters from various reported studies this article refers to. This could be the same EMT that occurs when Epstein-Barr virus (EBV) turns epithelial cells into a different cell structure. This study doesn’t make mention of EBV, but gastric carcinoma is highly associated with EBV and so it was selected. We aim to add these top genes from Seurat on this scRNA sequenced data and the genes we find from top fold change values in GC compared to PM using extracted Seurat data.
We will be using Seurat for these samples. There are 6 samples in total, and 3 are gastric carcinoma while the other 3 samples are peritoneal metastasis of gastric carcinoma. This is single cell RNA data, expression profiling by highthroughput sequencing of transcriptomic RNA data of complementary DNA.
library(Seurat)
## Loading required package: SeuratObject
## Loading required package: sp
##
## Attaching package: 'SeuratObject'
## The following objects are masked from 'package:base':
##
## intersect, t
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.0 ✔ readr 2.2.0
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.2 ✔ tibble 3.3.1
## ✔ lubridate 1.9.5 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(rmarkdown)
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
##
## The following object is masked from 'package:dplyr':
##
## combine
##
## The following object is masked from 'package:ggplot2':
##
## margin
Lets read in the series information.
series <- read.table("GSE308231_series_matrix.txt", nrow=24)
paged_table(series)
series_b <- read.table("GSE308231_series_matrix.txt", skip=24, nrow=39)
paged_table(series_b)
series_c <- read.table("GSE308231_series_matrix.txt", skip=42, nrow=22)
paged_table(series_c)
series_d <- read.table("GSE308231_series_matrix.txt", skip=65, nrow=1)
Series <- rbind(series_b,series_c,series_d)
rm(series_b,series_c,series_d) #getting too cluttered already
paged_table(Series)
Lets keep the ID, GSE ID, and tissue sample type. I had to skip some rows because they didn’t have the same number of elements in extracting the tabular information from the text data.
ID_series <- Series[c(1,2,8,11),]
paged_table(ID_series)
The CA1-CA3 are stomach tumors in Gastric carcinoma, and the F14-F16 are the metastasis of stomach tumors into the peritoneal lining. This was from the sample page as well. The identification of cancer or tumor location is in the ‘cell line’ description.
Now we can start reading in the files with Seurat. These folders have the zipped barcodes, features and matrix.mtx files in .gz format, you don’t have to unzip them, but these are in each folder labeled by GSM ID for the sample they are in this series data. Each folder is already in my directory of this rmarkdown document. Just call the folder and the files will be unzipped into a Seurat object. I added the extended name of ’_CA1’ through ’_F16’ to each sample folder to clarify.
GSM9240670 <- Read10X(data.dir = "GSM9240670_CA1")
We can clarify how many barcodes we want for each gene. We want at least 3 barcodes per gene or 3 cells per gene, and drop any gene without at least 3 barcodes. And for features at least 200. We will use last 2 digits to name it as the sample ID here.
sample70 <- CreateSeuratObject(counts=GSM9240670, project="GSM9240670", min.cells=3, min.features=200)
class(sample70)
## [1] "Seurat"
## attr(,"package")
## [1] "SeuratObject"
Look at the first 20 columns of barcodes.
colnames(sample70)[1:20]
## [1] "AAACCCAAGCCGATTT-1" "AAACCCACATGACAAA-1" "AAACCCAGTTCAATCG-1"
## [4] "AAACCCATCAAACCTG-1" "AAACCCATCAGTCATG-1" "AAACCCATCTGTAAGC-1"
## [7] "AAACGAAAGCCATTCA-1" "AAACGAACACCCTTGT-1" "AAACGAATCCTTCTTC-1"
## [10] "AAACGAATCGCAATTG-1" "AAACGCTAGAAGGATG-1" "AAACGCTAGTAACGTA-1"
## [13] "AAACGCTAGTGATAAC-1" "AAACGCTAGTTGTCAC-1" "AAACGCTCAATCGCGC-1"
## [16] "AAACGCTCAGCAGGAT-1" "AAACGCTTCATTGTTC-1" "AAACGCTTCCCGAGTG-1"
## [19] "AAAGAACAGCGAGTAC-1" "AAAGAACGTAACATGA-1"
Look at the first 20 rows of genes.
rownames(sample70)[1:20]
## [1] "AL627309.1" "AL627309.5" "AP006222.2" "LINC01409" "FAM87B"
## [6] "LINC01128" "LINC00115" "FAM41C" "AL645608.6" "AL645608.2"
## [11] "LINC02593" "SAMD11" "NOC2L" "KLHL17" "PLEKHN1"
## [16] "PERM1" "AL645608.7" "HES4" "ISG15" "AL645608.1"
sample70
## An object of class Seurat
## 28380 features across 7581 samples within 1 assay
## Active assay: RNA (28380 features, 0 variable features)
## 1 layer present: counts
paged_table(sample70@meta.data[1:20,])
saveRDS(sample70, file="RDS_objects/sample70")
GSM9240671 <- Read10X(data.dir = "GSM9240671_CA2")
sample71 <- CreateSeuratObject(counts=GSM9240671, project="GSM9240671", min.cells=3, min.features=200)
saveRDS(sample71, file="RDS_objects/sample71")
GSM9240672 <- Read10X(data.dir = "GSM9240672_CA3")
sample72 <- CreateSeuratObject(counts=GSM9240672, project="GSM9240672", min.cells=3, min.features=200)
saveRDS(sample72, file="RDS_objects/sample72")
GSM9240673 <- Read10X(data.dir = "GSM9240673_F14")
sample73 <- CreateSeuratObject(counts=GSM9240673, project="GSM9240673", min.cells=3, min.features=200)
saveRDS(sample73, file="RDS_objects/sample73")
GSM9240674 <- Read10X(data.dir = "GSM9240674_F15")
sample74 <- CreateSeuratObject(counts=GSM9240674, project="GSM9240674", min.cells=3, min.features=200)
saveRDS(sample74, file="RDS_objects/sample74")
GSM9240675 <- Read10X(data.dir = "GSM9240675_F16")
sample75 <- CreateSeuratObject(counts=GSM9240675, project="GSM9240675", min.cells=3, min.features=200)
saveRDS(sample75, file="RDS_objects/sample75")
We saved our 6 Seurat objects of the GSM samples of 3 gastric carcinoma and 3 gastric carcinoma that metastasized to the peritoneal tissue.
Now we merge these samples. But lets remove the GSM sample large matrix objects we don’t need.
rm(GSM9240670, GSM9240671, GSM9240672, GSM9240673, GSM9240674, GSM9240675)
mergedSamples <- merge(sample70, y=c(sample71, sample72, sample73, sample74, sample75), add.cell.ids = ls()[1:6])
str(mergedSamples)
## Formal class 'Seurat' [package "SeuratObject"] with 13 slots
## ..@ assays :List of 1
## .. ..$ RNA:Formal class 'Assay5' [package "SeuratObject"] with 8 slots
## .. .. .. ..@ layers :List of 6
## .. .. .. .. ..$ counts.GSM9240670:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. .. .. .. .. .. ..@ i : int [1:21515992] 12 30 31 39 44 66 69 81 106 126 ...
## .. .. .. .. .. .. ..@ p : int [1:7582] 0 2351 3233 5429 12856 15819 17807 21294 24463 27239 ...
## .. .. .. .. .. .. ..@ Dim : int [1:2] 28380 7581
## .. .. .. .. .. .. ..@ Dimnames:List of 2
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. ..@ x : num [1:21515992] 1 1 1 1 1 1 1 2 1 1 ...
## .. .. .. .. .. .. ..@ factors : list()
## .. .. .. .. ..$ counts.GSM9240671:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. .. .. .. .. .. ..@ i : int [1:24841051] 18 31 45 52 56 69 71 81 122 143 ...
## .. .. .. .. .. .. ..@ p : int [1:11051] 0 2378 5393 10193 13485 14322 17841 19678 21010 24119 ...
## .. .. .. .. .. .. ..@ Dim : int [1:2] 27340 11050
## .. .. .. .. .. .. ..@ Dimnames:List of 2
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. ..@ x : num [1:24841051] 1 1 1 1 1 1 1 1 5 1 ...
## .. .. .. .. .. .. ..@ factors : list()
## .. .. .. .. ..$ counts.GSM9240672:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. .. .. .. .. .. ..@ i : int [1:26388472] 41 44 65 110 129 130 135 138 144 151 ...
## .. .. .. .. .. .. ..@ p : int [1:10708] 0 1852 4943 6683 8683 12823 15068 17285 19237 21098 ...
## .. .. .. .. .. .. ..@ Dim : int [1:2] 27952 10707
## .. .. .. .. .. .. ..@ Dimnames:List of 2
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. ..@ x : num [1:26388472] 1 1 1 1 1 2 1 1 1 1 ...
## .. .. .. .. .. .. ..@ factors : list()
## .. .. .. .. ..$ counts.GSM9240673:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. .. .. .. .. .. ..@ i : int [1:40592441] 29 33 55 79 90 110 121 139 150 172 ...
## .. .. .. .. .. .. ..@ p : int [1:13457] 0 1803 5355 9584 10207 11926 15229 17238 21883 24648 ...
## .. .. .. .. .. .. ..@ Dim : int [1:2] 29540 13456
## .. .. .. .. .. .. ..@ Dimnames:List of 2
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. ..@ x : num [1:40592441] 1 1 3 2 1 1 5 2 2 2 ...
## .. .. .. .. .. .. ..@ factors : list()
## .. .. .. .. ..$ counts.GSM9240674:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. .. .. .. .. .. ..@ i : int [1:32465719] 5 11 12 18 29 30 34 37 39 43 ...
## .. .. .. .. .. .. ..@ p : int [1:12843] 0 5983 6480 9218 12433 13898 20649 22481 24627 26066 ...
## .. .. .. .. .. .. ..@ Dim : int [1:2] 31248 12842
## .. .. .. .. .. .. ..@ Dimnames:List of 2
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. ..@ x : num [1:32465719] 1 5 1 6 1 5 1 1 2 8 ...
## .. .. .. .. .. .. ..@ factors : list()
## .. .. .. .. ..$ counts.GSM9240675:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. .. .. .. .. .. ..@ i : int [1:30173389] 29 68 83 90 139 154 156 178 179 180 ...
## .. .. .. .. .. .. ..@ p : int [1:13011] 0 2279 2777 3813 4198 7293 13862 18642 20215 23195 ...
## .. .. .. .. .. .. ..@ Dim : int [1:2] 28546 13010
## .. .. .. .. .. .. ..@ Dimnames:List of 2
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. ..@ x : num [1:30173389] 1 4 1 1 1 2 2 3 1 2 ...
## .. .. .. .. .. .. ..@ factors : list()
## .. .. .. ..@ cells :Formal class 'LogMap' [package "SeuratObject"] with 1 slot
## .. .. .. .. .. ..@ .Data: logi [1:68646, 1:6] TRUE TRUE TRUE TRUE TRUE TRUE ...
## .. .. .. .. .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. .. .. .. .. ..$ : chr [1:68646] "ID_series_AAACCCAAGCCGATTT-1" "ID_series_AAACCCACATGACAAA-1" "ID_series_AAACCCAGTTCAATCG-1" "ID_series_AAACCCATCAAACCTG-1" ...
## .. .. .. .. .. .. .. ..$ : chr [1:6] "counts.GSM9240670" "counts.GSM9240671" "counts.GSM9240672" "counts.GSM9240673" ...
## .. .. .. .. .. ..$ dim : int [1:2] 68646 6
## .. .. .. .. .. ..$ dimnames:List of 2
## .. .. .. .. .. .. ..$ : chr [1:68646] "ID_series_AAACCCAAGCCGATTT-1" "ID_series_AAACCCACATGACAAA-1" "ID_series_AAACCCAGTTCAATCG-1" "ID_series_AAACCCATCAAACCTG-1" ...
## .. .. .. .. .. .. ..$ : chr [1:6] "counts.GSM9240670" "counts.GSM9240671" "counts.GSM9240672" "counts.GSM9240673" ...
## .. .. .. ..@ features :Formal class 'LogMap' [package "SeuratObject"] with 1 slot
## .. .. .. .. .. ..@ .Data: logi [1:32659, 1:6] TRUE TRUE TRUE TRUE TRUE TRUE ...
## .. .. .. .. .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. .. .. .. .. ..$ : chr [1:32659] "AL627309.1" "AL627309.5" "AP006222.2" "LINC01409" ...
## .. .. .. .. .. .. .. ..$ : chr [1:6] "counts.GSM9240670" "counts.GSM9240671" "counts.GSM9240672" "counts.GSM9240673" ...
## .. .. .. .. .. ..$ dim : int [1:2] 32659 6
## .. .. .. .. .. ..$ dimnames:List of 2
## .. .. .. .. .. .. ..$ : chr [1:32659] "AL627309.1" "AL627309.5" "AP006222.2" "LINC01409" ...
## .. .. .. .. .. .. ..$ : chr [1:6] "counts.GSM9240670" "counts.GSM9240671" "counts.GSM9240672" "counts.GSM9240673" ...
## .. .. .. ..@ default : int 6
## .. .. .. ..@ assay.orig: chr(0)
## .. .. .. ..@ meta.data :'data.frame': 32659 obs. of 0 variables
## .. .. .. ..@ misc : list()
## .. .. .. ..@ key : chr "rna_"
## ..@ meta.data :'data.frame': 68646 obs. of 3 variables:
## .. ..$ orig.ident : chr [1:68646] "GSM9240670" "GSM9240670" "GSM9240670" "GSM9240670" ...
## .. ..$ nCount_RNA : num [1:68646] 6570 7361 5615 45367 8295 ...
## .. ..$ nFeature_RNA: int [1:68646] 2351 882 2196 7427 2963 1988 3487 3169 2776 4420 ...
## ..@ active.assay: chr "RNA"
## ..@ active.ident: Factor w/ 6 levels "GSM9240670","GSM9240671",..: 1 1 1 1 1 1 1 1 1 1 ...
## .. ..- attr(*, "names")= chr [1:68646] "ID_series_AAACCCAAGCCGATTT-1" "ID_series_AAACCCACATGACAAA-1" "ID_series_AAACCCAGTTCAATCG-1" "ID_series_AAACCCATCAAACCTG-1" ...
## ..@ graphs : list()
## ..@ neighbors : list()
## ..@ reductions : list()
## ..@ images : list()
## ..@ project.name: chr "SeuratProject"
## ..@ misc : list()
## ..@ version :Classes 'package_version', 'numeric_version' hidden list of 1
## .. ..$ : int [1:3] 5 3 0
## ..@ commands : list()
## ..@ tools : list()
Above is the structure of this merged Seurat object of 6 samples. Lets save it and read it in to get used to reading in the RDS objects we save.
saveRDS(mergedSamples,file="RDS_objects/mergedSamples.RDS")
Lets read in this merged file. Clear out the environment to leave room for the large Seurat array, table of 10.9 Gb it says.
merged <- readRDS("RDS_objects/mergedSamples.RDS")
Our merged object is the same as the one we saved called mergedSamples, or should be. Lets see the structure of this object called merged.
str(merged)
## Formal class 'Seurat' [package "SeuratObject"] with 13 slots
## ..@ assays :List of 1
## .. ..$ RNA:Formal class 'Assay5' [package "SeuratObject"] with 8 slots
## .. .. .. ..@ layers :List of 6
## .. .. .. .. ..$ counts.GSM9240670:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. .. .. .. .. .. ..@ i : int [1:21515992] 12 30 31 39 44 66 69 81 106 126 ...
## .. .. .. .. .. .. ..@ p : int [1:7582] 0 2351 3233 5429 12856 15819 17807 21294 24463 27239 ...
## .. .. .. .. .. .. ..@ Dim : int [1:2] 28380 7581
## .. .. .. .. .. .. ..@ Dimnames:List of 2
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. ..@ x : num [1:21515992] 1 1 1 1 1 1 1 2 1 1 ...
## .. .. .. .. .. .. ..@ factors : list()
## .. .. .. .. ..$ counts.GSM9240671:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. .. .. .. .. .. ..@ i : int [1:24841051] 18 31 45 52 56 69 71 81 122 143 ...
## .. .. .. .. .. .. ..@ p : int [1:11051] 0 2378 5393 10193 13485 14322 17841 19678 21010 24119 ...
## .. .. .. .. .. .. ..@ Dim : int [1:2] 27340 11050
## .. .. .. .. .. .. ..@ Dimnames:List of 2
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. ..@ x : num [1:24841051] 1 1 1 1 1 1 1 1 5 1 ...
## .. .. .. .. .. .. ..@ factors : list()
## .. .. .. .. ..$ counts.GSM9240672:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. .. .. .. .. .. ..@ i : int [1:26388472] 41 44 65 110 129 130 135 138 144 151 ...
## .. .. .. .. .. .. ..@ p : int [1:10708] 0 1852 4943 6683 8683 12823 15068 17285 19237 21098 ...
## .. .. .. .. .. .. ..@ Dim : int [1:2] 27952 10707
## .. .. .. .. .. .. ..@ Dimnames:List of 2
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. ..@ x : num [1:26388472] 1 1 1 1 1 2 1 1 1 1 ...
## .. .. .. .. .. .. ..@ factors : list()
## .. .. .. .. ..$ counts.GSM9240673:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. .. .. .. .. .. ..@ i : int [1:40592441] 29 33 55 79 90 110 121 139 150 172 ...
## .. .. .. .. .. .. ..@ p : int [1:13457] 0 1803 5355 9584 10207 11926 15229 17238 21883 24648 ...
## .. .. .. .. .. .. ..@ Dim : int [1:2] 29540 13456
## .. .. .. .. .. .. ..@ Dimnames:List of 2
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. ..@ x : num [1:40592441] 1 1 3 2 1 1 5 2 2 2 ...
## .. .. .. .. .. .. ..@ factors : list()
## .. .. .. .. ..$ counts.GSM9240674:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. .. .. .. .. .. ..@ i : int [1:32465719] 5 11 12 18 29 30 34 37 39 43 ...
## .. .. .. .. .. .. ..@ p : int [1:12843] 0 5983 6480 9218 12433 13898 20649 22481 24627 26066 ...
## .. .. .. .. .. .. ..@ Dim : int [1:2] 31248 12842
## .. .. .. .. .. .. ..@ Dimnames:List of 2
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. ..@ x : num [1:32465719] 1 5 1 6 1 5 1 1 2 8 ...
## .. .. .. .. .. .. ..@ factors : list()
## .. .. .. .. ..$ counts.GSM9240675:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. .. .. .. .. .. ..@ i : int [1:30173389] 29 68 83 90 139 154 156 178 179 180 ...
## .. .. .. .. .. .. ..@ p : int [1:13011] 0 2279 2777 3813 4198 7293 13862 18642 20215 23195 ...
## .. .. .. .. .. .. ..@ Dim : int [1:2] 28546 13010
## .. .. .. .. .. .. ..@ Dimnames:List of 2
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. ..@ x : num [1:30173389] 1 4 1 1 1 2 2 3 1 2 ...
## .. .. .. .. .. .. ..@ factors : list()
## .. .. .. ..@ cells :Formal class 'LogMap' [package "SeuratObject"] with 1 slot
## .. .. .. .. .. ..@ .Data: logi [1:68646, 1:6] TRUE TRUE TRUE TRUE TRUE TRUE ...
## .. .. .. .. .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. .. .. .. .. ..$ : chr [1:68646] "ID_series_AAACCCAAGCCGATTT-1" "ID_series_AAACCCACATGACAAA-1" "ID_series_AAACCCAGTTCAATCG-1" "ID_series_AAACCCATCAAACCTG-1" ...
## .. .. .. .. .. .. .. ..$ : chr [1:6] "counts.GSM9240670" "counts.GSM9240671" "counts.GSM9240672" "counts.GSM9240673" ...
## .. .. .. .. .. ..$ dim : int [1:2] 68646 6
## .. .. .. .. .. ..$ dimnames:List of 2
## .. .. .. .. .. .. ..$ : chr [1:68646] "ID_series_AAACCCAAGCCGATTT-1" "ID_series_AAACCCACATGACAAA-1" "ID_series_AAACCCAGTTCAATCG-1" "ID_series_AAACCCATCAAACCTG-1" ...
## .. .. .. .. .. .. ..$ : chr [1:6] "counts.GSM9240670" "counts.GSM9240671" "counts.GSM9240672" "counts.GSM9240673" ...
## .. .. .. ..@ features :Formal class 'LogMap' [package "SeuratObject"] with 1 slot
## .. .. .. .. .. ..@ .Data: logi [1:32659, 1:6] TRUE TRUE TRUE TRUE TRUE TRUE ...
## .. .. .. .. .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. .. .. .. .. ..$ : chr [1:32659] "AL627309.1" "AL627309.5" "AP006222.2" "LINC01409" ...
## .. .. .. .. .. .. .. ..$ : chr [1:6] "counts.GSM9240670" "counts.GSM9240671" "counts.GSM9240672" "counts.GSM9240673" ...
## .. .. .. .. .. ..$ dim : int [1:2] 32659 6
## .. .. .. .. .. ..$ dimnames:List of 2
## .. .. .. .. .. .. ..$ : chr [1:32659] "AL627309.1" "AL627309.5" "AP006222.2" "LINC01409" ...
## .. .. .. .. .. .. ..$ : chr [1:6] "counts.GSM9240670" "counts.GSM9240671" "counts.GSM9240672" "counts.GSM9240673" ...
## .. .. .. ..@ default : int 6
## .. .. .. ..@ assay.orig: chr(0)
## .. .. .. ..@ meta.data :'data.frame': 32659 obs. of 0 variables
## .. .. .. ..@ misc : list()
## .. .. .. ..@ key : chr "rna_"
## ..@ meta.data :'data.frame': 68646 obs. of 3 variables:
## .. ..$ orig.ident : chr [1:68646] "GSM9240670" "GSM9240670" "GSM9240670" "GSM9240670" ...
## .. ..$ nCount_RNA : num [1:68646] 6570 7361 5615 45367 8295 ...
## .. ..$ nFeature_RNA: int [1:68646] 2351 882 2196 7427 2963 1988 3487 3169 2776 4420 ...
## ..@ active.assay: chr "RNA"
## ..@ active.ident: Factor w/ 6 levels "GSM9240670","GSM9240671",..: 1 1 1 1 1 1 1 1 1 1 ...
## .. ..- attr(*, "names")= chr [1:68646] "ID_series_AAACCCAAGCCGATTT-1" "ID_series_AAACCCACATGACAAA-1" "ID_series_AAACCCAGTTCAATCG-1" "ID_series_AAACCCATCAAACCTG-1" ...
## ..@ graphs : list()
## ..@ neighbors : list()
## ..@ reductions : list()
## ..@ images : list()
## ..@ project.name: chr "SeuratProject"
## ..@ misc : list()
## ..@ version :Classes 'package_version', 'numeric_version' hidden list of 1
## .. ..$ : int [1:3] 5 3 0
## ..@ commands : list()
## ..@ tools : list()
dim(merged)
## [1] 32659 68646
dim(mergedSamples)
## [1] 32659 68646
Lets remove mergedSamples and the samples we merged to make it. The processing will need this memory space as it can get heavy in processing and using a lot of memory to run commands.
rm(mergedSamples)
rm(sample70, sample71, sample72, sample73, sample74, sample75)
Lets look at the first 10 rows of the meta.data in the merged Seurat object. There are 68,646 rows of barcodes.
paged_table(merged@meta.data[1:10,])
##perc mt for mitochondrial DNA that indicates cell damage to mitochondrial organelle in cytoplasm from dying cell or damaged.
We add a column to our merged data called percent_mt.
merged[['percent_mt']] <- PercentageFeatureSet(merged, pattern = '^MT-')
paged_table(merged@meta.data)
The nCount is total number of molecules in a cell The nFeature is the total number of unique genes in each cell
This violin plot VlnPlot() is used as a Seurat library plot.
VlnPlot(merged, features = c("nCount_RNA", "nFeature_RNA", "percent_mt"), ncol=3)
## Warning: Default search for "data" layer in "RNA" assay yielded no results;
## utilizing "counts" layer instead.
Looks like most of the data is at around 3,000 to 4,000 counts, 5,000 features, and about 10-12 % mt. We will use that as our filtering to filter out the outliers.
Lets plot a line through our data of scatter points.
library(ggplot2)
FeatureScatter(merged, feature1 = "nCount_RNA", feature2 = "nFeature_RNA") + geom_smooth(method = 'lm')+
# Control axis limits using ggplot2 functions
xlim(0, 45000)+ # Set X-axis range
ylim(0, 45000) # Set Y-axis range
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 1198 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 1198 rows containing missing values or values outside the scale range
## (`geom_point()`).
FeatureScatter(merged, feature1 = "nCount_RNA", feature2 = "nFeature_RNA") + geom_smooth(method = 'lm')+
# Control axis limits using ggplot2 functions
xlim(0, 5000)+ # Set X-axis range
ylim(0, 5000) # Set Y-axis range
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 34371 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 34371 rows containing missing values or values outside the scale range
## (`geom_point()`).
Now, lets filter for the 5,000 counts, and 5,000 features, with no more than 12% mitochondrial DNA from the violin plots above.
#filtering
merged <- subset(merged, subset = nFeature_RNA < 3000 & percent_mt < 10 & nCount_RNA < 3000)
You can make your own thresholds on how you interpret visualizations
#Normalize data after removing filtered out cells
Normalization by default, use global scaling normalization of log normalize, default scaling of 10,000, and then get a log transformed result.
merged <- NormalizeData(
merged,
normalization.method = "LogNormalize",
scale.factor = 10000,
layer = "counts", # Input layer
new.layer = "lognorm" # Output layer
)
## Normalizing layer: counts.GSM9240670
## Warning: The following arguments are not used: new.layer
## Normalizing layer: counts.GSM9240671
## Warning: The following arguments are not used: new.layer
## Normalizing layer: counts.GSM9240672
## Warning: The following arguments are not used: new.layer
## Normalizing layer: counts.GSM9240673
## Warning: The following arguments are not used: new.layer
## Normalizing layer: counts.GSM9240674
## Warning: The following arguments are not used: new.layer
## Normalizing layer: counts.GSM9240675
## Warning: The following arguments are not used: new.layer
str(merged)
## Formal class 'Seurat' [package "SeuratObject"] with 13 slots
## ..@ assays :List of 1
## .. ..$ RNA:Formal class 'Assay5' [package "SeuratObject"] with 8 slots
## .. .. .. ..@ layers :List of 12
## .. .. .. .. ..$ counts.GSM9240670:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. .. .. .. .. .. ..@ i : int [1:785896] 42 137 138 157 160 164 200 269 300 332 ...
## .. .. .. .. .. .. ..@ p : int [1:825] 0 983 1814 2162 2557 3188 4633 5951 7427 7908 ...
## .. .. .. .. .. .. ..@ Dim : int [1:2] 28380 824
## .. .. .. .. .. .. ..@ Dimnames:List of 2
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. ..@ x : num [1:785896] 1 1 1 1 3 1 1 1 1 1 ...
## .. .. .. .. .. .. ..@ factors : list()
## .. .. .. .. ..$ counts.GSM9240671:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. .. .. .. .. .. ..@ i : int [1:2650402] 30 42 122 152 161 242 309 337 382 406 ...
## .. .. .. .. .. .. ..@ p : int [1:2561] 0 509 1777 2930 3972 4876 5591 6920 8006 9214 ...
## .. .. .. .. .. .. ..@ Dim : int [1:2] 27340 2560
## .. .. .. .. .. .. ..@ Dimnames:List of 2
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. ..@ x : num [1:2650402] 1 1 2 1 1 1 1 1 2 1 ...
## .. .. .. .. .. .. ..@ factors : list()
## .. .. .. .. ..$ counts.GSM9240672:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. .. .. .. .. .. ..@ i : int [1:1529215] 62 80 84 108 119 173 224 230 241 252 ...
## .. .. .. .. .. .. ..@ p : int [1:1393] 0 1340 2694 4080 5598 7021 7692 8917 10204 11683 ...
## .. .. .. .. .. .. ..@ Dim : int [1:2] 27952 1392
## .. .. .. .. .. .. ..@ Dimnames:List of 2
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. ..@ x : num [1:1529215] 2 1 1 1 1 1 1 1 1 1 ...
## .. .. .. .. .. .. ..@ factors : list()
## .. .. .. .. ..$ counts.GSM9240673:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. .. .. .. .. .. ..@ i : int [1:1680987] 55 159 246 322 336 367 387 447 546 581 ...
## .. .. .. .. .. .. ..@ p : int [1:1589] 0 623 1371 2995 3451 3842 4860 5788 7326 8858 ...
## .. .. .. .. .. .. ..@ Dim : int [1:2] 29540 1588
## .. .. .. .. .. .. ..@ Dimnames:List of 2
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. ..@ x : num [1:1680987] 1 1 1 1 1 1 1 1 1 3 ...
## .. .. .. .. .. .. ..@ factors : list()
## .. .. .. .. ..$ counts.GSM9240674:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. .. .. .. .. .. ..@ i : int [1:2762722] 188 199 417 420 536 567 603 790 849 868 ...
## .. .. .. .. .. .. ..@ p : int [1:3222] 0 497 1962 3401 3849 4534 5016 5407 6043 6556 ...
## .. .. .. .. .. .. ..@ Dim : int [1:2] 31248 3221
## .. .. .. .. .. .. ..@ Dimnames:List of 2
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. ..@ x : num [1:2762722] 1 1 1 1 1 1 1 1 1 1 ...
## .. .. .. .. .. .. ..@ factors : list()
## .. .. .. .. ..$ counts.GSM9240675:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. .. .. .. .. .. ..@ i : int [1:5153829] 21 42 52 68 79 118 124 139 165 179 ...
## .. .. .. .. .. .. ..@ p : int [1:5054] 0 1036 2542 3007 4245 5241 6015 7576 8620 9116 ...
## .. .. .. .. .. .. ..@ Dim : int [1:2] 28546 5053
## .. .. .. .. .. .. ..@ Dimnames:List of 2
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. ..@ x : num [1:5153829] 1 1 1 1 1 2 1 1 1 1 ...
## .. .. .. .. .. .. ..@ factors : list()
## .. .. .. .. ..$ data.GSM9240670 :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. .. .. .. .. .. ..@ i : int [1:785896] 42 137 138 157 160 164 200 269 300 332 ...
## .. .. .. .. .. .. ..@ p : int [1:825] 0 983 1814 2162 2557 3188 4633 5951 7427 7908 ...
## .. .. .. .. .. .. ..@ Dim : int [1:2] 28380 824
## .. .. .. .. .. .. ..@ Dimnames:List of 2
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. ..@ x : num [1:785896] 2.1 2.1 2.1 2.1 3.12 ...
## .. .. .. .. .. .. ..@ factors : list()
## .. .. .. .. ..$ data.GSM9240671 :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. .. .. .. .. .. ..@ i : int [1:2650402] 30 42 122 152 161 242 309 337 382 406 ...
## .. .. .. .. .. .. ..@ p : int [1:2561] 0 509 1777 2930 3972 4876 5591 6920 8006 9214 ...
## .. .. .. .. .. .. ..@ Dim : int [1:2] 27340 2560
## .. .. .. .. .. .. ..@ Dimnames:List of 2
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. ..@ x : num [1:2650402] 1.74 1.74 2.34 1.74 1.74 ...
## .. .. .. .. .. .. ..@ factors : list()
## .. .. .. .. ..$ data.GSM9240672 :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. .. .. .. .. .. ..@ i : int [1:1529215] 62 80 84 108 119 173 224 230 241 252 ...
## .. .. .. .. .. .. ..@ p : int [1:1393] 0 1340 2694 4080 5598 7021 7692 8917 10204 11683 ...
## .. .. .. .. .. .. ..@ Dim : int [1:2] 27952 1392
## .. .. .. .. .. .. ..@ Dimnames:List of 2
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. ..@ x : num [1:1529215] 2.17 1.59 1.59 1.59 1.59 ...
## .. .. .. .. .. .. ..@ factors : list()
## .. .. .. .. ..$ data.GSM9240673 :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. .. .. .. .. .. ..@ i : int [1:1680987] 55 159 246 322 336 367 387 447 546 581 ...
## .. .. .. .. .. .. ..@ p : int [1:1589] 0 623 1371 2995 3451 3842 4860 5788 7326 8858 ...
## .. .. .. .. .. .. ..@ Dim : int [1:2] 29540 1588
## .. .. .. .. .. .. ..@ Dimnames:List of 2
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. ..@ x : num [1:1680987] 2.28 2.28 2.28 2.28 2.28 ...
## .. .. .. .. .. .. ..@ factors : list()
## .. .. .. .. ..$ data.GSM9240674 :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. .. .. .. .. .. ..@ i : int [1:2762722] 188 199 417 420 536 567 603 790 849 868 ...
## .. .. .. .. .. .. ..@ p : int [1:3222] 0 497 1962 3401 3849 4534 5016 5407 6043 6556 ...
## .. .. .. .. .. .. ..@ Dim : int [1:2] 31248 3221
## .. .. .. .. .. .. ..@ Dimnames:List of 2
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. ..@ x : num [1:2762722] 2.96 2.96 2.96 2.96 2.96 ...
## .. .. .. .. .. .. ..@ factors : list()
## .. .. .. .. ..$ data.GSM9240675 :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. .. .. .. .. .. ..@ i : int [1:5153829] 21 42 52 68 79 118 124 139 165 179 ...
## .. .. .. .. .. .. ..@ p : int [1:5054] 0 1036 2542 3007 4245 5241 6015 7576 8620 9116 ...
## .. .. .. .. .. .. ..@ Dim : int [1:2] 28546 5053
## .. .. .. .. .. .. ..@ Dimnames:List of 2
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. ..@ x : num [1:5153829] 2.11 2.11 2.11 2.11 2.11 ...
## .. .. .. .. .. .. ..@ factors : list()
## .. .. .. ..@ cells :Formal class 'LogMap' [package "SeuratObject"] with 1 slot
## .. .. .. .. .. ..@ .Data: logi [1:14638, 1:12] TRUE TRUE TRUE TRUE TRUE TRUE ...
## .. .. .. .. .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. .. .. .. .. ..$ : chr [1:14638] "ID_series_AAACGCTAGTAACGTA-1" "ID_series_AAAGAACGTAACATGA-1" "ID_series_AAAGGGCAGATGGCGT-1" "ID_series_AAAGGGCCATACGCAT-1" ...
## .. .. .. .. .. .. .. ..$ : chr [1:12] "counts.GSM9240670" "counts.GSM9240671" "counts.GSM9240672" "counts.GSM9240673" ...
## .. .. .. .. .. ..$ dim : int [1:2] 14638 12
## .. .. .. .. .. ..$ dimnames:List of 2
## .. .. .. .. .. .. ..$ : chr [1:14638] "ID_series_AAACGCTAGTAACGTA-1" "ID_series_AAAGAACGTAACATGA-1" "ID_series_AAAGGGCAGATGGCGT-1" "ID_series_AAAGGGCCATACGCAT-1" ...
## .. .. .. .. .. .. ..$ : chr [1:12] "counts.GSM9240670" "counts.GSM9240671" "counts.GSM9240672" "counts.GSM9240673" ...
## .. .. .. ..@ features :Formal class 'LogMap' [package "SeuratObject"] with 1 slot
## .. .. .. .. .. ..@ .Data: logi [1:32659, 1:12] TRUE TRUE TRUE TRUE TRUE TRUE ...
## .. .. .. .. .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. .. .. .. .. ..$ : chr [1:32659] "AL627309.1" "AL627309.5" "AP006222.2" "LINC01409" ...
## .. .. .. .. .. .. .. ..$ : chr [1:12] "counts.GSM9240670" "counts.GSM9240671" "counts.GSM9240672" "counts.GSM9240673" ...
## .. .. .. .. .. ..$ dim : int [1:2] 32659 12
## .. .. .. .. .. ..$ dimnames:List of 2
## .. .. .. .. .. .. ..$ : chr [1:32659] "AL627309.1" "AL627309.5" "AP006222.2" "LINC01409" ...
## .. .. .. .. .. .. ..$ : chr [1:12] "counts.GSM9240670" "counts.GSM9240671" "counts.GSM9240672" "counts.GSM9240673" ...
## .. .. .. ..@ default : int 6
## .. .. .. ..@ assay.orig: chr(0)
## .. .. .. ..@ meta.data :'data.frame': 32659 obs. of 0 variables
## .. .. .. ..@ misc : list()
## .. .. .. ..@ key : chr "rna_"
## ..@ meta.data :'data.frame': 14638 obs. of 4 variables:
## .. ..$ orig.ident : chr [1:14638] "GSM9240670" "GSM9240670" "GSM9240670" "GSM9240670" ...
## .. ..$ nCount_RNA : num [1:14638] 1391 1350 591 796 1251 ...
## .. ..$ nFeature_RNA: int [1:14638] 983 831 348 395 631 1445 1318 1476 481 430 ...
## .. ..$ percent_mt : num [1:14638] 1.73 7.93 2.03 1.76 2.64 ...
## ..@ active.assay: chr "RNA"
## ..@ active.ident: Factor w/ 6 levels "GSM9240670","GSM9240671",..: 1 1 1 1 1 1 1 1 1 1 ...
## .. ..- attr(*, "names")= chr [1:14638] "ID_series_AAACGCTAGTAACGTA-1" "ID_series_AAAGAACGTAACATGA-1" "ID_series_AAAGGGCAGATGGCGT-1" "ID_series_AAAGGGCCATACGCAT-1" ...
## ..@ graphs : list()
## ..@ neighbors : list()
## ..@ reductions : list()
## ..@ images : list()
## ..@ project.name: chr "SeuratProject"
## ..@ misc : list()
## ..@ version :Classes 'package_version', 'numeric_version' hidden list of 1
## .. ..$ : int [1:3] 5 3 0
## ..@ commands :List of 1
## .. ..$ NormalizeData.RNA:Formal class 'SeuratCommand' [package "SeuratObject"] with 5 slots
## .. .. .. ..@ name : chr "NormalizeData.RNA"
## .. .. .. ..@ time.stamp : POSIXct[1:1], format: "2026-04-05 18:12:54"
## .. .. .. ..@ assay.used : chr "RNA"
## .. .. .. ..@ call.string: chr [1:2] "NormalizeData(merged, normalization.method = \"LogNormalize\", " " scale.factor = 10000, layer = \"counts\", new.layer = \"lognorm\")"
## .. .. .. ..@ params :List of 5
## .. .. .. .. ..$ assay : chr "RNA"
## .. .. .. .. ..$ normalization.method: chr "LogNormalize"
## .. .. .. .. ..$ scale.factor : num 10000
## .. .. .. .. ..$ margin : num 1
## .. .. .. .. ..$ verbose : logi TRUE
## ..@ tools : list()
merged
## An object of class Seurat
## 32659 features across 14638 samples within 1 assay
## Active assay: RNA (32659 features, 0 variable features)
## 12 layers present: counts.GSM9240670, counts.GSM9240671, counts.GSM9240672, counts.GSM9240673, counts.GSM9240674, counts.GSM9240675, data.GSM9240670, data.GSM9240671, data.GSM9240672, data.GSM9240673, data.GSM9240674, data.GSM9240675
We have layers of counts and data now in our Seurat object and it is smaller due to filtering, now it has 32,659 features and 14,638 samples in this 1 assay.
merged@assays$RNA
## Assay (v5) data with 32659 features for 14638 cells
## First 10 features:
## AL627309.1, AL627309.5, AP006222.2, LINC01409, FAM87B, LINC01128,
## LINC00115, FAM41C, AL645608.6, AL645608.2
## Layers:
## counts.GSM9240670, counts.GSM9240671, counts.GSM9240672,
## counts.GSM9240673, counts.GSM9240674, counts.GSM9240675,
## data.GSM9240670, data.GSM9240671, data.GSM9240672, data.GSM9240673,
## data.GSM9240674, data.GSM9240675
merged@assays[["RNA"]]@layers$counts.GSM9240670[1:25,1:25]
## 25 x 25 sparse Matrix of class "dgCMatrix"
##
## [1,] . . . . . . . . . . . . . . . . . . . . . . . . .
## [2,] . . . . . . . . . . . . . . . . . . . . . . . . .
## [3,] . . . . . . . . . . . . . . . . . . . . . . . . .
## [4,] . . . . . . . . . . . . . . . . . . . . . . . . .
## [5,] . . . . . . . . . . . . . . . . . . . . . . . . .
## [6,] . . . . . . . . . . . . . . . . . . . . . . . . .
## [7,] . . . . . . . . . . . . . . . . . . . . . . . . .
## [8,] . . . . . . . . . . . . . . . . . . . . . . . . .
## [9,] . . . . . . . . . . . . . . . . . . . . . . . . .
## [10,] . . . . . . . . . . . . . . . . . . . . . . . . .
## [11,] . . . . . . . . . . . . . . . . . . . . . . . . .
## [12,] . . . . . . . . . . . . . . . . . . . . . . . . .
## [13,] . . . . . . . . . . . . . . . . . . . . . . . . .
## [14,] . . . . . . . . . . . . . . . . . . . . . . . . .
## [15,] . . . . . . . . . . . . . . . . . . . . . . . . .
## [16,] . . . . . . . . . . . . . . . . . . . . . . . . .
## [17,] . . . . . . . . . . . . . . . . . . . . . . . . .
## [18,] . . . . . . . 1 . . . . 1 2 1 . . . . . . . . 2 .
## [19,] . 2 . . 2 . . 1 2 . . . . 1 1 . . . . 1 2 . 2 1 1
## [20,] . . . . . . . . . . . . . . . . . . . . . . . . .
## [21,] . . . . . . . . . . . . . . . 1 . . . . . . . . .
## [22,] . . . . . . . . . . . . . . . . . . . . . . . . .
## [23,] . . . . . . . . . . . . . . . . . . . . 1 . . . 1
## [24,] . . . . . . . . . . . . . . . . . . . . . . . . .
## [25,] . . . . . . . . . . . . . . . . . . . . . . . . .
The following shows the sparce matrices for the first layer by counts, but limited because it is huge and there are many layers, this is just the first sample layers. This list could be very large if not limited. There are only integer values where the gene is present in the cell or barcode column of assay, the 0s are held with the period or dot.
dim(merged@assays[["RNA"]]@layers$counts.GSM9240670)
## [1] 28380 824
You can see not all features are counted in the cells shown.
The default is to return 2000 features per dataset, and ‘vst’ stands for ‘variance stabilizing transformation’ as a default as well.
merged <- FindVariableFeatures(merged, selection.method = 'vst', nfeatures = 2000)
## Finding variable features for layer counts.GSM9240670
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : pseudoinverse used at -2.6149
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : neighborhood radius 0.30103
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : reciprocal condition number 6.2046e-15
## Finding variable features for layer counts.GSM9240671
## Finding variable features for layer counts.GSM9240672
## Finding variable features for layer counts.GSM9240673
## Finding variable features for layer counts.GSM9240674
## Finding variable features for layer counts.GSM9240675
Now look at the merged object to see that variable features should show 2000.
merged
## An object of class Seurat
## 32659 features across 14638 samples within 1 assay
## Active assay: RNA (32659 features, 2000 variable features)
## 12 layers present: counts.GSM9240670, counts.GSM9240671, counts.GSM9240672, counts.GSM9240673, counts.GSM9240674, counts.GSM9240675, data.GSM9240670, data.GSM9240671, data.GSM9240672, data.GSM9240673, data.GSM9240674, data.GSM9240675
We should limit the list as it will print out a lot of the values. But only those features in the list of 2000 highly variable features or having the highest differential expression values where cells changed the most from healthy to pathological state are listed by name, all others are listed as ‘NA’ values.
merged[["RNA"]]@meta.data$var.features[1:500]
## [1] NA NA NA NA NA
## [6] NA NA NA NA NA
## [11] NA NA NA NA NA
## [16] NA NA "HES4" "ISG15" NA
## [21] NA NA NA NA NA
## [26] NA NA NA "TNFRSF18" "TNFRSF4"
## [31] NA NA NA NA NA
## [36] NA NA NA NA NA
## [41] NA NA NA NA NA
## [46] NA NA NA NA NA
## [51] NA "VWA1" NA NA NA
## [56] NA NA NA NA NA
## [61] NA NA NA NA NA
## [66] NA NA NA NA NA
## [71] NA NA NA NA NA
## [76] NA NA NA NA NA
## [81] NA NA NA NA NA
## [86] NA NA NA NA NA
## [91] NA NA NA NA NA
## [96] NA NA NA NA NA
## [101] NA NA NA NA NA
## [106] NA NA NA NA NA
## [111] NA NA NA NA NA
## [116] NA NA NA NA NA
## [121] NA NA NA NA NA
## [126] NA NA NA NA NA
## [131] NA NA NA NA NA
## [136] NA NA NA NA NA
## [141] NA NA NA NA NA
## [146] NA NA NA NA NA
## [151] NA NA NA NA NA
## [156] NA "TNFRSF9" NA NA NA
## [161] "ERRFI1" NA NA NA NA
## [166] NA NA NA NA NA
## [171] NA NA NA NA NA
## [176] NA NA NA NA NA
## [181] NA NA NA NA NA
## [186] NA NA NA "RBP7" NA
## [191] NA NA NA NA NA
## [196] NA NA NA NA NA
## [201] NA NA NA NA NA
## [206] NA NA NA NA NA
## [211] NA NA NA NA NA
## [216] NA NA NA NA NA
## [221] NA NA NA NA NA
## [226] NA NA "AL021155.5" NA NA
## [231] NA NA NA NA NA
## [236] NA NA NA NA NA
## [241] NA NA NA "KAZN" NA
## [246] NA NA NA NA NA
## [251] NA NA NA NA NA
## [256] NA NA NA NA NA
## [261] NA NA NA NA NA
## [266] NA NA NA NA NA
## [271] NA NA NA NA NA
## [276] NA NA NA NA NA
## [281] NA NA NA NA NA
## [286] NA NA NA NA NA
## [291] NA NA NA "CROCC" NA
## [296] NA NA NA NA NA
## [301] NA NA NA NA "PADI4"
## [306] NA NA NA NA NA
## [311] "IGSF21" NA NA NA NA
## [316] NA NA NA NA NA
## [321] NA NA NA NA NA
## [326] NA NA NA NA NA
## [331] NA NA "PLA2G2A" NA NA
## [336] NA NA NA NA NA
## [341] NA NA NA NA NA
## [346] NA NA NA NA NA
## [351] NA NA NA NA NA
## [356] NA NA NA NA NA
## [361] NA NA NA NA NA
## [366] NA "HSPG2" NA NA NA
## [371] NA NA NA NA NA
## [376] NA NA NA "C1QA" "C1QC"
## [381] "C1QB" NA NA NA NA
## [386] NA NA NA NA NA
## [391] NA NA NA NA NA
## [396] NA NA "ID3" NA NA
## [401] NA NA NA NA NA
## [406] NA NA "FUCA1" "CNR2" NA
## [411] NA NA NA NA NA
## [416] NA NA NA NA NA
## [421] NA NA NA NA NA
## [426] NA NA "CLIC4" NA NA
## [431] NA NA NA NA NA
## [436] NA NA NA NA NA
## [441] NA NA NA NA NA
## [446] NA NA NA NA NA
## [451] NA "STMN1" NA NA NA
## [456] NA NA NA NA NA
## [461] NA NA NA NA NA
## [466] NA NA NA NA NA
## [471] NA NA NA NA NA
## [476] NA NA NA NA NA
## [481] NA NA NA NA NA
## [486] NA NA NA NA NA
## [491] NA NA NA NA NA
## [496] NA NA NA NA NA
We only printed out the first 500 variable features and only the genes in the top 2,000 genes with high variability are shown all others are NA.
top10 <- head(VariableFeatures(merged), 10)
top10
## [1] "IGKC" "TPSB2" "CCL4" "CXCL8" "S100A9" "IGHG1" "GNLY" "LYZ"
## [9] "JCHAIN" "COL3A1"
top10_plot <- VariableFeaturePlot(merged)
LabelPoints(plot = top10_plot, points = top10, repel = TRUE)
## When using repel, set xnudge and ynudge to 0 for optimal results
## Warning in scale_x_log10(): log-10 transformation introduced infinite values.
## Warning: Removed 4279 rows containing missing values or values outside the scale range
## (`geom_point()`).
This corrects for unwanted sources of variation, technical as example with batch effects, or biological such as cell cycle stage or mitochondrial contamination, this ensures results are due to clustering by gene differences and not technological or biological errors.
The ScaleData function is part of standard parameters, there is a more complex single cell Transform function or workflow within Seurat that could be alternatively used.
First to make a character string stored as the rownames of the structural array we have been using mergedFilteredNormalizedScaled. Lets look at the first 100 genes as well.
all_genes <- rownames(merged)
all_genes[1:100]
## [1] "AL627309.1" "AL627309.5" "AP006222.2" "LINC01409" "FAM87B"
## [6] "LINC01128" "LINC00115" "FAM41C" "AL645608.6" "AL645608.2"
## [11] "LINC02593" "SAMD11" "NOC2L" "KLHL17" "PLEKHN1"
## [16] "PERM1" "AL645608.7" "HES4" "ISG15" "AL645608.1"
## [21] "AGRN" "RNF223" "C1orf159" "AL390719.3" "LINC01342"
## [26] "AL390719.2" "TTLL10-AS1" "TTLL10" "TNFRSF18" "TNFRSF4"
## [31] "SDF4" "B3GALT6" "C1QTNF12" "AL162741.1" "UBE2J2"
## [36] "LINC01786" "SCNN1D" "ACAP3" "PUSL1" "INTS11"
## [41] "CPTP" "TAS1R3" "DVL1" "MXRA8" "AURKAIP1"
## [46] "CCNL2" "MRPL20-AS1" "MRPL20" "AL391244.2" "ANKRD65"
## [51] "LINC01770" "VWA1" "ATAD3C" "ATAD3B" "ATAD3A"
## [56] "TMEM240" "SSU72" "AL645728.1" "FNDC10" "AL691432.4"
## [61] "AL691432.2" "MIB2" "MMP23B" "CDK11B" "FO704657.1"
## [66] "SLC35E2B" "CDK11A" "SLC35E2A" "NADK" "GNB1"
## [71] "AL109917.1" "TMEM52" "CFAP74" "AL391845.2" "GABRD"
## [76] "AL391845.1" "PRKCZ" "AL590822.2" "PRKCZ-AS1" "FAAP20"
## [81] "AL590822.1" "SKI" "AL590822.3" "MORN1" "AL513477.2"
## [86] "RER1" "PEX10" "PLCH2" "PANK4" "HES5"
## [91] "AL139246.5" "TNFRSF14-AS1" "TNFRSF14" "AL139246.3" "PRXL2B"
## [96] "MMEL1" "TTC34" "PRDM16-DT" "PRDM16" "AL590438.1"
length(all_genes)
## [1] 32659
There are 32,659 genes that made it through quality control filtering of less than 10% mitochondrial DNA, and less than 3,000 each of nFeatures_RNA and nCounts_RNA.
The row names of the object, merged, have the gene names, when we looked for the highest variation genes in most differentially expressed for 2000 genes with
merged <- FindVariableFeatures(merged, selection.method = ‘vst’, nfeatures = 2000)
Do they line up to the row names of merged and where are they located.
varFeatures <- merged[["RNA"]]@meta.data$var.features
genes <- rownames(merged)
df <- data.frame(Gene = as.factor(genes), FeatureVariability = as.factor(varFeatures))
head(df,20)
## Gene FeatureVariability
## 1 AL627309.1 <NA>
## 2 AL627309.5 <NA>
## 3 AP006222.2 <NA>
## 4 LINC01409 <NA>
## 5 FAM87B <NA>
## 6 LINC01128 <NA>
## 7 LINC00115 <NA>
## 8 FAM41C <NA>
## 9 AL645608.6 <NA>
## 10 AL645608.2 <NA>
## 11 LINC02593 <NA>
## 12 SAMD11 <NA>
## 13 NOC2L <NA>
## 14 KLHL17 <NA>
## 15 PLEKHN1 <NA>
## 16 PERM1 <NA>
## 17 AL645608.7 <NA>
## 18 HES4 HES4
## 19 ISG15 ISG15
## 20 AL645608.1 <NA>
The location of any of the genes if they were a high variability gene top 2000 is listed next to the correct row name in a random 20 sample draw from first 20 samples.
Are there only 2000 non NA values in our varFeatures set to find only 2000?
class(varFeatures)
## [1] "character"
length(unique(varFeatures))
## [1] 2001
There are 2001 unique values in a 32,659 long vector of genes set to be the label if it is one of the 2000 most differentially expressed genes. The additional character is the NA value.
Lets scale the data now.
#merged <- ScaleData(merged, features = all_genes)
merged <- ScaleData(merged, features = VariableFeatures(merged))
## Centering and scaling data matrix
m <- merged[which(row.names(merged) %in% top10),]
m
## An object of class Seurat
## 10 features across 14638 samples within 1 assay
## Active assay: RNA (10 features, 10 variable features)
## 13 layers present: counts.GSM9240670, counts.GSM9240671, counts.GSM9240672, counts.GSM9240673, counts.GSM9240674, counts.GSM9240675, data.GSM9240670, data.GSM9240671, data.GSM9240672, data.GSM9240673, data.GSM9240674, data.GSM9240675, scale.data
unique(m$orig.ident)
## [1] "GSM9240670" "GSM9240671" "GSM9240672" "GSM9240673" "GSM9240674"
## [6] "GSM9240675"
top10Data <- m[["RNA"]]@meta.data
colnames(top10Data)
## [1] "vf_vst_counts.GSM9240670_mean"
## [2] "vf_vst_counts.GSM9240670_variance"
## [3] "vf_vst_counts.GSM9240670_variance.expected"
## [4] "vf_vst_counts.GSM9240670_variance.standardized"
## [5] "vf_vst_counts.GSM9240670_variable"
## [6] "vf_vst_counts.GSM9240670_rank"
## [7] "vf_vst_counts.GSM9240671_mean"
## [8] "vf_vst_counts.GSM9240671_variance"
## [9] "vf_vst_counts.GSM9240671_variance.expected"
## [10] "vf_vst_counts.GSM9240671_variance.standardized"
## [11] "vf_vst_counts.GSM9240671_variable"
## [12] "vf_vst_counts.GSM9240671_rank"
## [13] "vf_vst_counts.GSM9240672_mean"
## [14] "vf_vst_counts.GSM9240672_variance"
## [15] "vf_vst_counts.GSM9240672_variance.expected"
## [16] "vf_vst_counts.GSM9240672_variance.standardized"
## [17] "vf_vst_counts.GSM9240672_variable"
## [18] "vf_vst_counts.GSM9240672_rank"
## [19] "vf_vst_counts.GSM9240673_mean"
## [20] "vf_vst_counts.GSM9240673_variance"
## [21] "vf_vst_counts.GSM9240673_variance.expected"
## [22] "vf_vst_counts.GSM9240673_variance.standardized"
## [23] "vf_vst_counts.GSM9240673_variable"
## [24] "vf_vst_counts.GSM9240673_rank"
## [25] "vf_vst_counts.GSM9240674_mean"
## [26] "vf_vst_counts.GSM9240674_variance"
## [27] "vf_vst_counts.GSM9240674_variance.expected"
## [28] "vf_vst_counts.GSM9240674_variance.standardized"
## [29] "vf_vst_counts.GSM9240674_variable"
## [30] "vf_vst_counts.GSM9240674_rank"
## [31] "vf_vst_counts.GSM9240675_mean"
## [32] "vf_vst_counts.GSM9240675_variance"
## [33] "vf_vst_counts.GSM9240675_variance.expected"
## [34] "vf_vst_counts.GSM9240675_variance.standardized"
## [35] "vf_vst_counts.GSM9240675_variable"
## [36] "vf_vst_counts.GSM9240675_rank"
## [37] "var.features"
## [38] "var.features.rank"
DataFrame10 <- m[['RNA']]@meta.data
paged_table(DataFrame10)
write.csv(DataFrame10, 'top10DE_genes_gastricCarcinoma_GSE308231.csv', row.names=FALSE)
#Dimensionality Reduction with PCA allow us to visualize multidimensional datasets, with common PCA principal component analysis, summarizing all gene expression data over many genes into principal components (the variation between scatterpoints are the components, variations off the regression lines or dimension, adding components to top components will give most variation and explain most of data).
Loadings is a term for PCA that explains a gene that is strongly associated with that principal component. They are ordered with PC1 most explanative and subsequent add to that explanation in order of best.
The positive and negative genes are named for how ever many features you set for the number of PCs that explain the most.
merged <- RunPCA(merged, features = VariableFeatures(merged))
## PC_ 1
## Positive: RBMS3, SPARCL1, NFIB, LDB2, FBXL7, APBB2, LHFPL6, DLC1, KIAA1217, NRP1
## IGFBP7, TCF4, PTPRG, MAGI1, PRKG1, GSN, TIMP3, ZFPM2, KALRN, COL4A2
## CALD1, CD36, IGFBP4, EBF1, SASH1, RUNX1T1, RBPMS, RASAL2, CDC42BPA, PARD3B
## Negative: CD69, CXCR4, SRGN, TNFAIP3, CCL5, STAT4, SAMSN1, PDE3B, IL7R, ITK
## CD247, TRBC2, GNG2, CNOT6L, GZMA, THEMIS, TRAC, JUND, KLF6, CAMK4
## CCL4, CRYBG1, DUSP1, IKZF3, PTPN22, LINC01619, RGS1, TC2N, CBLB, DUSP2
## PC_ 2
## Positive: PDE4D, CCSER1, KIAA1217, MAGI1, KLF6, CDH13, MECOM, HDAC9, KALRN, NEAT1
## CD69, FCHSD2, CBLB, CEMIP2, CNOT6L, ABLIM1, ARL15, CADM2, PTPRM, ANO2
## ST6GALNAC3, PLCB1, LARGE1, WWOX, MACROD2, ANKRD28, JUN, RORA, CNTNAP2, CSMD1
## Negative: DCN, CST3, LUM, MGP, FTL, SERPINF1, SFRP2, COL1A2, COL1A1, COL3A1
## TIMP1, CFD, CCDC80, FBLN1, COL6A2, C1R, S100A11, FTH1, IFITM3, C3
## SPARC, CD63, MMP2, PCOLCE, LGALS1, EFEMP1, THY1, COL6A1, IGFBP4, C1S
## PC_ 3
## Positive: CXCL8, G0S2, AQP9, NAMPT, S100A8, S100A9, BCL2A1, MNDA, ACSL1, PLAUR
## CSF3R, PTGS2, SOD2, IL1R2, C5AR1, FPR1, FCGR3B, SAT1, TREM1, IRAK3
## DOCK4, MXD1, TLR2, FGD4, CLEC4E, LUCAT1, FTH1, PHACTR1, RBM47, PLEK
## Negative: CCL5, CD247, THEMIS, TC2N, GZMA, SAMD3, CBLB, ITK, LINC00861, CAMK4
## INPP4B, IL7R, TRAC, TRBC2, NKG7, GZMK, STAT4, PYHIN1, RORA, GPRIN3
## LINC01934, CNOT6L, ANK3, TOX, KLRK1, PPP2R2B, IKZF3, PDE3B, CST7, TRBC1
## PC_ 4
## Positive: CSMD1, RBFOX1, CNTNAP2, IL1RAPL1, CNTN5, LRP1B, DMD, CCDC26, DPP10, CTNNA3
## CTNNA2, LRRC4C, SGCZ, CSMD3, NRXN1, GRID2, NRXN3, PCDH15, PTCHD1-AS, NKAIN3
## PTPRT, MACROD2, PTPRD, IL1RAPL2, ANKS1B, AC011246.1, AL008633.1, NRG3, DPP6, AC084816.1
## Negative: BTNL9, VWF, PALMD, ABLIM3, ADGRL4, CLDN5, ANO2, CYYR1, CD36, EGFL7
## FABP4, RBP7, LDB2, PTPRB, CD300LG, CDH5, ADGRF5, DIPK2B, ARHGAP29, EMCN
## EPAS1, CCDC85A, PECAM1, CAV1, ADAMTS18, CA4, CALCRL, ERG, NOSTRIN, ADAMTS9
## PC_ 5
## Positive: CD74, HLA-DRA, HLA-DPB1, HLA-DPA1, HLA-DRB5, HLA-DRB1, C1QA, C1QB, C1QC, HLA-DQA2
## IFI30, HLA-DQB1, MARCH1, MS4A6A, HBB, CD68, F13A1, CYBB, RNASE1, AIF1
## BANK1, CD14, HLA-DQA1, CTSZ, FOLR2, MS4A4A, CPVL, MS4A1, CTSB, CD163
## Negative: TNFAIP3, DUSP1, FOS, NEAT1, ELL2, ZFP36, NAMPT, SAMSN1, JUNB, KLF6
## JUND, FOSB, JUN, PPP1R15A, CBLB, RORA, FOXO3, SRGN, ADAMTSL4-AS1, ANKRD28
## CXCR4, G0S2, MXD1, IER2, FNDC3B, IVNS1ABP, CPD, CRY1, NIBAN1, CXCL8
print(merged[['pca']], dims = 1:5, nfeatures=5)
## PC_ 1
## Positive: RBMS3, SPARCL1, NFIB, LDB2, FBXL7
## Negative: CD69, CXCR4, SRGN, TNFAIP3, CCL5
## PC_ 2
## Positive: PDE4D, CCSER1, KIAA1217, MAGI1, KLF6
## Negative: DCN, CST3, LUM, MGP, FTL
## PC_ 3
## Positive: CXCL8, G0S2, AQP9, NAMPT, S100A8
## Negative: CCL5, CD247, THEMIS, TC2N, GZMA
## PC_ 4
## Positive: CSMD1, RBFOX1, CNTNAP2, IL1RAPL1, CNTN5
## Negative: BTNL9, VWF, PALMD, ABLIM3, ADGRL4
## PC_ 5
## Positive: CD74, HLA-DRA, HLA-DPB1, HLA-DPA1, HLA-DRB5
## Negative: TNFAIP3, DUSP1, FOS, NEAT1, ELL2
merged@assays$RNA
## Assay (v5) data with 32659 features for 14638 cells
## Top 10 variable features:
## IGKC, TPSB2, CCL4, CXCL8, S100A9, IGHG1, GNLY, LYZ, JCHAIN, COL3A1
## Layers:
## counts.GSM9240670, counts.GSM9240671, counts.GSM9240672,
## counts.GSM9240673, counts.GSM9240674, counts.GSM9240675,
## data.GSM9240670, data.GSM9240671, data.GSM9240672, data.GSM9240673,
## data.GSM9240674, data.GSM9240675, scale.data
DimHeatmap(merged, dims = 1, cells = 500, balanced = TRUE)
The dim plot is a scatter plot of a named PC for principal component, against another PC You will see cells group in clusters by PC comparison 1 by 1.
DimPlot(merged, reduction = "pca") + NoLegend()
This shows std dev on y-axis and PCs on x-axis to see the threshold limit of number of dimensions or PCs to use, it starts high and drops until plateuing at elbow bend, won’t be useful to use additional that add little variation and more computing.
ElbowPlot(merged)
It looks like it could be at about 8 or 12, looks like 2 little bends at 8 and at 12 principal components.
Basically the lower resolution the least clusters and higher resolution has more clusters
merged <- FindNeighbors(merged, dims = 1:11)#chose this from elbow plot
## Computing nearest neighbor graph
## Computing SNN
merged <- FindClusters(merged, resolution = c(0.1, 0.3, 0.5, 0.7,1))
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 14638
## Number of edges: 486264
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9753
## Number of communities: 11
## Elapsed time: 1 seconds
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 14638
## Number of edges: 486264
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9515
## Number of communities: 14
## Elapsed time: 1 seconds
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 14638
## Number of edges: 486264
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9316
## Number of communities: 18
## Elapsed time: 1 seconds
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 14638
## Number of edges: 486264
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9164
## Number of communities: 21
## Elapsed time: 1 seconds
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 14638
## Number of edges: 486264
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8976
## Number of communities: 26
## Elapsed time: 1 seconds
There will be added columns to the matrix object of our merged@meta.data for each of the resolutios such as ‘RNA_snn_res0.1’ and ‘RNA_snn_res0.3’, and so on to 1. And the integer values in the new columns indicates to which cluster each cell (or barcode) belongs.
The results are color coded for each cluster. Since gene expression is basically the cell differentiation of becoming different cell types, usually, different clusters represents different cell types.
paged_table(merged@meta.data)
The data set may interest someone in many clusters that overlap that could be useful for identifying subset of cells interested in.
DimPlot(merged, group.by = 'RNA_snn_res.1', label = TRUE)
DimPlot(merged, group.by = 'RNA_snn_res.0.5', label = TRUE)
DimPlot(merged, group.by = 'RNA_snn_res.0.1', label = TRUE)
Set identity of clusters makes more sense than using more clusters with resolution 1
Idents(merged) <- 'RNA_snn_res.0.1'
Gives a clearer visualization of separate clusters with space between each
merged <- RunUMAP(merged, dims = 1:11)
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
## 18:13:26 UMAP embedding parameters a = 0.9922 b = 1.112
## 18:13:26 Read 14638 rows and found 11 numeric columns
## 18:13:26 Using Annoy for neighbor search, n_neighbors = 30
## 18:13:26 Building Annoy index with metric = cosine, n_trees = 50
## 0% 10 20 30 40 50 60 70 80 90 100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 18:13:27 Writing NN index file to temp file C:\Users\jlcor\AppData\Local\Temp\Rtmpa23K6C\file23f05bc54f4
## 18:13:27 Searching Annoy index using 1 thread, search_k = 3000
## 18:13:30 Annoy recall = 100%
## 18:13:30 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 18:13:31 Initializing from normalized Laplacian + noise (using RSpectra)
## 18:13:32 Commencing optimization for 200 epochs, with 621096 positive edges
## 18:13:32 Using rng type: pcg
## 18:13:58 Optimization finished
Next, identify the clusters and look at genetic markers for different cell types and identify different cell types, not in this tutorial but another one
DimPlot(merged, reduction = 'umap')
Save the Seurat object to use later
saveRDS(merged, file='mergedClustersUMap.RDS')
Make a data frame of all the genes of variable features, only the top 2000 will be ranked.
allGenesDF <- merged[["RNA"]]@meta.data
dim(allGenesDF)
## [1] 32659 38
The variable features column will only have the name of the gene is it is one of the 2,000 highest variable feature.
I looked it up and it turns out you can get the gene name for all entries of 30,960 by using the rownames function on data object[[“RNA”]]
allGenesDF$genes <- rownames(merged[["RNA"]])
write.csv(allGenesDF, "all32659genes_gastricCarcinoma_rownamesAdded.csv", row.names=FALSE)
paged_table(allGenesDF[1:20,])
Here are only the highest variability genes of the top 2,000 in Seurat.
all2k <- subset(allGenesDF, allGenesDF$var.features != "NA")
all2k <- all2k[order(all2k$var.features.rank, decreasing=FALSE),]
This will show only the first 100 of the top 2,000 high variability genes in Seurat.
paged_table(all2k[1:100,35:39])
Now lets look at the tSNE unsupervised classifier.
merged <- RunTSNE(merged, dims = 1:11)
DimPlot(merged, reduction = 'tsne')
Both the tSNE and UMAP clustering in Seurat look much more precise and defined then just the K-Nearest Neighbor clustering.
Lets look at how our merged Seurat object looks now after doing all these processes on it.
merged
## An object of class Seurat
## 32659 features across 14638 samples within 1 assay
## Active assay: RNA (32659 features, 2000 variable features)
## 13 layers present: counts.GSM9240670, counts.GSM9240671, counts.GSM9240672, counts.GSM9240673, counts.GSM9240674, counts.GSM9240675, data.GSM9240670, data.GSM9240671, data.GSM9240672, data.GSM9240673, data.GSM9240674, data.GSM9240675, scale.data
## 3 dimensional reductions calculated: pca, umap, tsne
It shows that it has 13 layers, is filtered, has 3 dimensionality reductions done after scaling the data, using PCA, UMAP, and TSNE. The K-Nearest Neighbor clustering was done using the PCA settings.
You can get the top10 genes and the statistical information with each of the 6 samples here.
You can get all 32,659 genes and the statistical information with each sample here.
Today we ran Seurat on gastric carcinoma 6 samples from GSE308231 on 3 samples of stomach cancer and 3 samples of metastatic gastric carcinoma in the peritoneal tissue surrounding the stomach. We did processing after filtering, normalizing, and scaling the data, then found principal components with PCA to do clustering with K-Nearest Neighbors using the idea number of PCs from the elbow plot of PCs, then used UMAP and TSNE unsupervised clustering to get more detailed and precise clusters of this data. We also found the 10 most variable genes from the 2,000 highest variability genes in Seurat from the two classes of stomach cancer, and also extracted the statistical data Seurat created for each sample and the names of the top 10, the 2,000 highest variable genes, and all genes that made the filter for less than 10% mitochondrial DNA and less than 3,000 count of features and counts of RNA.
Next part to this research will be on testing the top 10 genes with random forest to see how well they predict the sample, then use all the genes and extract only the count and gene name to do our generalized fold change values to get our top genes by fold change from stage 1 to stage 4 stomach cancer from stomach tumors to metastatic tumors spread outside the stomach tissue into the peritoneal lining. We will then add these genes of top 10 with Seurat and top fold change genes in top up regulated or enhanced to top down regulated or silenced into our pathologies database. I recommend reading the article that came with this gene expression data as it makes light of the study and shows some great deep learning machine learning data science to predict more severe tumors that will likely metastasize based on images they found on NCBI of various studies. They do mention that it is difficult to treat stomach cancer that metastasizes outside and into the peritoneal region due to the fibrotic tissue and that fibrous and cartilaginous or thick tissue lacks a rich blood supply to access the tumors and treat successfully with chemotherapy as this type of gastric carcinoma is the most difficult to treat. Their study seeks to find the tumors by shape and location as well as genes most abundant in those zones to target for treatment. I quick read the article and mostly the plots. The link is at the beginning of this project.
Thanks again, and keep checking in.
============================================================================
*** Part 2
Lets look at the columns of the top 10 genes.
colnames(DataFrame10)
## [1] "vf_vst_counts.GSM9240670_mean"
## [2] "vf_vst_counts.GSM9240670_variance"
## [3] "vf_vst_counts.GSM9240670_variance.expected"
## [4] "vf_vst_counts.GSM9240670_variance.standardized"
## [5] "vf_vst_counts.GSM9240670_variable"
## [6] "vf_vst_counts.GSM9240670_rank"
## [7] "vf_vst_counts.GSM9240671_mean"
## [8] "vf_vst_counts.GSM9240671_variance"
## [9] "vf_vst_counts.GSM9240671_variance.expected"
## [10] "vf_vst_counts.GSM9240671_variance.standardized"
## [11] "vf_vst_counts.GSM9240671_variable"
## [12] "vf_vst_counts.GSM9240671_rank"
## [13] "vf_vst_counts.GSM9240672_mean"
## [14] "vf_vst_counts.GSM9240672_variance"
## [15] "vf_vst_counts.GSM9240672_variance.expected"
## [16] "vf_vst_counts.GSM9240672_variance.standardized"
## [17] "vf_vst_counts.GSM9240672_variable"
## [18] "vf_vst_counts.GSM9240672_rank"
## [19] "vf_vst_counts.GSM9240673_mean"
## [20] "vf_vst_counts.GSM9240673_variance"
## [21] "vf_vst_counts.GSM9240673_variance.expected"
## [22] "vf_vst_counts.GSM9240673_variance.standardized"
## [23] "vf_vst_counts.GSM9240673_variable"
## [24] "vf_vst_counts.GSM9240673_rank"
## [25] "vf_vst_counts.GSM9240674_mean"
## [26] "vf_vst_counts.GSM9240674_variance"
## [27] "vf_vst_counts.GSM9240674_variance.expected"
## [28] "vf_vst_counts.GSM9240674_variance.standardized"
## [29] "vf_vst_counts.GSM9240674_variable"
## [30] "vf_vst_counts.GSM9240674_rank"
## [31] "vf_vst_counts.GSM9240675_mean"
## [32] "vf_vst_counts.GSM9240675_variance"
## [33] "vf_vst_counts.GSM9240675_variance.expected"
## [34] "vf_vst_counts.GSM9240675_variance.standardized"
## [35] "vf_vst_counts.GSM9240675_variable"
## [36] "vf_vst_counts.GSM9240675_rank"
## [37] "var.features"
## [38] "var.features.rank"
mean10 <- grep('mean',colnames(DataFrame10))
top10mean <- DataFrame10[,c(mean10,37,38)]
paged_table(top10mean)
The columns have large names, lets use their respective sample description instead. Columns with GSM ending in 70, 71, and 72 are gastric carcinoma (GC) 1, 2, and 3 respectively, while GSM ending in 73, 74, and 75 are peritoneal metastasis (PM) 14, 15, and 16 respectively. We can just call them 1, 2, 3 for GC and PM.
colnames(top10mean) <- c("GC1","GC2","GC3","PM1","PM2","PM3","Gene_ID","rank")
paged_table(top10mean)
GC is our baseline as less fatal then GC that metastasizes into the peritoneal tissue as PM. We will do fold change values of peritoneal over GC. These top genes were already found to be the top 10 most highly variable genes in the data set by own algorithm but they were filtered to be less than 10% mitochondrial DNA, and less than 3,000 n counts each of features and cell counts.
top10MeanOrdered <- top10mean[order(top10mean$rank, decreasing=F),]
paged_table(top10MeanOrdered)
Lets see what their fold change is out of curiosity.
top10MeanOrdered$GC_mean <- rowMeans(top10MeanOrdered[,1:3])
top10MeanOrdered$PM_mean <- rowMeans(top10MeanOrdered[,4:6])
top10MeanOrdered$FC_PM_over_GC <- top10MeanOrdered$PM_mean/top10MeanOrdered$GC_mean
paged_table(top10MeanOrdered)
We can see that the algorithm for highest variability in gene expression did not use fold change values for the samples in a group but across each sample as the fold change values are not ordered the same. We are using these genes as top 10 genes but from Seurat’s high variability algorithm not by fold change values.
Lets make a matrix and test how well these top 10 genes can predict the class as GC or PM.
Genes10 <- top10mean$Gene_ID
class <- c("Gastric carcinoma","Gastric carcinoma","Gastric carcinoma",
"peritoneal metastasis", "peritoneal metastasis",
"peritoneal metastasis")
top10mx <- data.frame(t(top10mean[,1:6]))
colnames(top10mx) <- Genes10
top10mx$class <- as.factor(class)
paged_table(top10mx)
Lets see what a random forest basic model will do in predicting the class of these samples in a 2 class model of Gastric Carcinoma or Peritoneal Metastasis. We have equal numbers of samples in each class, but the samples are small in number to work with as we have been having lately.
set.seed(123)
inTrain <- sample(1:6,.8*6)
training <- top10mx[inTrain,]
testing <- top10mx[-inTrain,]
table(training$class)
##
## Gastric carcinoma peritoneal metastasis
## 2 2
table(testing$class)
##
## Gastric carcinoma peritoneal metastasis
## 1 1
Each class is represented in the training and testing data.
rf_10 <- randomForest(training[,1:6], training$class, mtry=3, ntree=5000, confusion=T)
rf_10$confusion
## Gastric carcinoma peritoneal metastasis class.error
## Gastric carcinoma 1 1 0.5
## peritoneal metastasis 2 0 1.0
Our model built on training set scored 50% accuracy on predicting 1/2 samples of gastric carcinoma correctly and 100% accuracy on predicting 2/2 samples of peritoneal metastasis correctly.
Lets see how well it predicts on the hold out set of data in our testing set.
prediction_10 <- predict(rf_10,testing)
results_10 <- data.frame(predicted=prediction_10, actual=testing$class)
results_10
## predicted actual
## GC1 Gastric carcinoma Gastric carcinoma
## PM2 peritoneal metastasis peritoneal metastasis
The model scored 100% accuracy in prediction of each of the samples correctly. The gastric carcinoma sample was predicted as such and the peritoneal metastasis was also correctly predicted. This indicates that these top 10 genes are great targets to predicting gastric carcinoma or peritoneal metastasis.
Now, lets look at all the genes that we extracted and only take the means, but get the sample grouped means and then fold change of PM/GC to extract the top 10 upregulated and top 10 down regulated genes after excluding any NaNs or Infs if needed. The last column is the genes column 39.
keep <- grep('mean',colnames(allGenesDF))
allGenes_keepMeans <- allGenesDF[,c(keep,39)]
paged_table(allGenes_keepMeans[1:5,])
Change the column names here the same as with top 10 for the sample type of GC or PM, the 70-72 are GC and 73-75 are PM.
colnames(allGenes_keepMeans) <- c("GC1","GC2","GC3","PM1","PM2","PM3","Gene_ID")
paged_table(allGenes_keepMeans[1:5,])
allGenes_keepMeans$GC_mean <- rowMeans(allGenes_keepMeans[,1:3])
allGenes_keepMeans$PM_mean <- rowMeans(allGenes_keepMeans[,4:6])
allGenes_keepMeans$FC_PM_over_GC <- allGenes_keepMeans$PM_mean/allGenes_keepMeans$GC_mean
paged_table(allGenes_keepMeans[1:5,])
summary(allGenes_keepMeans$FC_PM_over_GC)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 0.7274 1.1018 Inf 1.8023 Inf 7683
There are Inf and NAs in the data, so we can eliminate them and then order the data to extract the top genes by fold change values.
allGenes_noNans <- allGenes_keepMeans[!is.na(allGenes_keepMeans$FC_PM_over_GC),]
allGenes_noInfs <- allGenes_noNans[!is.infinite(allGenes_noNans$FC_PM_over_GC),]
allGenes_noNansInfs <- allGenes_noInfs[order(allGenes_noInfs$FC_PM_over_GC, decreasing=T),]
allGenes_ready <- allGenes_noNansInfs[allGenes_noNansInfs$FC_PM_over_GC > 0,]
summary(allGenes_ready)
## GC1 GC2 GC3
## Min. : 0.000000 Min. : 0.000000 Min. : 0.000000
## 1st Qu.: 0.002427 1st Qu.: 0.001563 1st Qu.: 0.001437
## Median : 0.012136 Median : 0.010547 Median : 0.010057
## Mean : 0.072244 Mean : 0.082158 Mean : 0.084854
## 3rd Qu.: 0.052184 3rd Qu.: 0.051172 3rd Qu.: 0.053879
## Max. :59.208738 Max. :48.307422 Max. :56.477011
## PM1 PM2 PM3 Gene_ID
## Min. : 0.000000 Min. : 0.000000 Min. : 0.000000 Length:23552
## 1st Qu.: 0.001889 1st Qu.: 0.002794 1st Qu.: 0.001781 Class :character
## Median : 0.011965 Median : 0.012108 Median : 0.012072 Mode :character
## Mean : 0.076444 Mean : 0.063688 Mean : 0.074888
## 3rd Qu.: 0.056045 3rd Qu.: 0.044086 3rd Qu.: 0.055017
## Max. :50.014484 Max. :27.065818 Max. :69.656838
## GC_mean PM_mean FC_PM_over_GC
## Min. : 0.000130 Min. : 0.000066 Min. :1.900e-03
## 1st Qu.: 0.001848 1st Qu.: 0.002425 1st Qu.:7.128e-01
## Median : 0.011856 Median : 0.012352 Median :1.063e+00
## Mean : 0.079752 Mean : 0.071673 Mean :2.257e+00
## 3rd Qu.: 0.053574 3rd Qu.: 0.052687 3rd Qu.:1.616e+00
## Max. :54.664390 Max. :48.912380 Max. :1.214e+04
top20_GCPM <- allGenes_ready[c(1:10,23543:23552),]
paged_table(top20_GCPM)
write.csv(allGenes_ready,"allGenes_ready_GCPM_23552k_noNaNsInfs_ordered.csv", row.names=F)
write.csv(top20_GCPM,"top20_GC_PM_genes_FCs.csv", row.names=F)
You can get those two datasets here, top20_GCPM and allGenes_ready.
Now lets make a matrix and see how well these top 20 genes by fold change values can predict class of sample in our 2 class model of GC or GC with PM. We can reuse our class character vector from the top 10 Seurat genes matrix.
colnames(top20_GCPM)
## [1] "GC1" "GC2" "GC3" "PM1"
## [5] "PM2" "PM3" "Gene_ID" "GC_mean"
## [9] "PM_mean" "FC_PM_over_GC"
genes20 <- top20_GCPM$Gene_ID
top20mx <- data.frame(t(top20_GCPM[,c(1:6)]))
colnames(top20mx) <- genes20
top20mx$class <- as.factor(class)
paged_table(top20mx)
Now that we have our matrix of top 20 gastric carcinoma and peritoneal metastasis genes, lets see how well these genes predict the class of the sample.
set.seed(123)
inTrain <- sample(1:6,.8*6)
training <- top20mx[inTrain,]
testing <- top20mx[-inTrain,]
table(training$class)
##
## Gastric carcinoma peritoneal metastasis
## 2 2
table(testing$class)
##
## Gastric carcinoma peritoneal metastasis
## 1 1
Great, same distribution as with the top 10 Seurat genes matrix.
rf_20 <- randomForest(training[1:6], training$class, mtry=7, ntree=5000, confusion=T)
## Warning in randomForest.default(training[1:6], training$class, mtry = 7, :
## invalid mtry: reset to within valid range
rf_20$confusion
## Gastric carcinoma peritoneal metastasis class.error
## Gastric carcinoma 2 0 0
## peritoneal metastasis 0 2 0
Well, that’s impressive, the model performed perfectly at 100% accuracy in predicting the Gastric Carcinoma as well as the Peritoneal Metastasis samples 2/2 for each class. Lets see how well this model predicts the class on the hold out samples.
predicted_20 <- predict(rf_20, testing)
results <- data.frame(predicted=predicted_20, actual=testing$class)
results
## predicted actual
## GC1 Gastric carcinoma Gastric carcinoma
## PM2 peritoneal metastasis peritoneal metastasis
And the model also performed 100% accuracy in prediction of the class in correctly predicting the gastric carcinoma sample 1/1 and the peritoneal metastasis class correctly 1/1 times.
Looks like these genes can be added to our pathologies database.
Lets go ahead and read in our pathologies database.
path <- "...current Pathologies Database" #path to where you have the pathologies database stored.
setwd(path)
pathologies <- read.csv("Pathologies_UF_added_4-2-2026.csv", header=T, sep=',')
colnames(pathologies)
## [1] "Ensembl_ID" "Genecards_ID" "FC_pathology_control"
## [4] "topGenePathology" "mediaType" "studySummarized"
## [7] "GSE_study_ID"
We need the same names of the pathologies columns for our 2 additions of top10 seurat and top20 foldchange genes.
colnames(top10MeanOrdered)
## [1] "GC1" "GC2" "GC3" "PM1"
## [5] "PM2" "PM3" "Gene_ID" "rank"
## [9] "GC_mean" "PM_mean" "FC_PM_over_GC"
top10a <- top10MeanOrdered[,c(7,11)]
colnames(top10a) <- c("Genecards_ID","FC_pathology_control")
top10a$Ensemble_ID <- 'needed'
top10a$topGenePathology <- "Gastric Carcinoma and Peritoneal Metastatic Gastric Carcioma"
top10a$mediaType <- "scRNA high throughput sequencing, bulk RNA from stomach tissue"
top10a$studySummarized <- "This is the top 10 high variability genes out of 2,000 high variability genes using Seurat and their Differential Expression algorithm to find the genes across all samples with the highest degree of variability. The study used gastric carcinoma 3 samples and peritoneal metastasis 3 samples, one of the most fatal cancers in cancer related deaths worldwide, single cell RNA or scRNA sequenced by high throughput sequencing. In a brief summary of the published article, the researchers found that a macrophage, fibroblast, and malignant epithelial cell area (MFM) of gastric carcinoma (GC) tumors and gastric carcinoma peritoneal metastasis (PM) had a design that was impeding treatment in GC and PM. GC and PM are aggressive because this MFM region tends to create a tough shell that blocks out immune cells like T cells as well as promotes tumor growth via VEGF that is used in all tumor proliferation cancers it seems, but the GC and PM tumors in the MFM region are not treatable due to this tough fibrous shell that chemotherapy cannot reach. They found a therapeutic route to try with the latest treatment of NIPS for intraperitoneal delivery of chemotherapy that is the best treatment so far, but they need to turn off the PLAU-PLAUR axis because it is responsible for the MFM and signaling pathway that prevents it from being obliterated by current therapeutics. The study was based more on a spatial transcriptomic analysis of images from an online cancer imaging resource, The Cancer Genome-Atlas Stomach Adenocarcinoma or TCGA-STAD, to develop a deep learning model with transfer learning that could predict from an input image if the tumor was MFM or not, and their study showed that MFM was a very poor prognosis because survival was 2% in 5 years and fatal in 1-2 years tops. This is a recent study. Gastric carcinoma is one of the malignant cancers responsible for cancer related mortality world wide, it was done in an Asian country this study, and GC affects mostly Chinese and South Asian populations. The study used this gene expression data but also many gene expression datasets from same resource of National Center for Bioinformatics Information or NCBI of their Gene Expression Omnibus or GEO section of public data. The CA1, CA2, and CA3 was stated as being gastric carcinoma but different types of samples as subtypes. With CA3 the worst prognosis of the 3 types. While the F14, F15, and F16 samples of peritoneal metastatic GC were not discussed in the published article.
There is an epthelial-mesenchymal transition called EMT that is seen in cancer progression of various cancers and in this study, the MFM shows this EMT through the niche regions of the tumors in GC that have malignant epithelial cells. The fibroblasts and macrophages are of the M2 type macrophages that are tumor promoters from various reported studies this article refers to. This could be the same EMT that occurs when Epstein-Barr virus (EBV) turns epithelial cells into a different cell structure. This study doesn't make mention of EBV, but gastric carcinoma is highly associated with EBV and so it was selected. We aim to add these top genes from Seurat on this scRNA sequenced data and the genes we find from top fold change values in GC compared to PM using extracted Seurat data.
"
top10a$GSE_study_ID <- "GSE308231"
paged_table(top10a)
Now lets get the columns of the top 20 genes by fold change values ready to add to the pathologies database.
colnames(top20_GCPM)
## [1] "GC1" "GC2" "GC3" "PM1"
## [5] "PM2" "PM3" "Gene_ID" "GC_mean"
## [9] "PM_mean" "FC_PM_over_GC"
top20a <- top20_GCPM[,c(7,10)]
colnames(top20a) <- c("Genecards_ID","FC_pathology_control")
top20a$Ensemble_ID <- 'needed'
top20a$topGenePathology <- "Gastric Carcinoma and Peritoneal Metastatic Gastric Carcioma"
top20a$mediaType <- "scRNA high throughput sequencing, bulk RNA from stomach tissue"
top20a$studySummarized <- "Top 20 genes of 10 most enhanced and 10 most silenced excluding zeros, NAs, and Infinites in fold change data...This is a study that used gastric carcinoma 3 samples and peritoneal metastasis 3 samples, one of the most fatal cancers in cancer related deaths worldwide, single cell RNA or scRNA sequenced by high throughput sequencing. In a brief summary of the published article, the researchers found that a macrophage, fibroblast, and malignant epithelial cell area (MFM) of gastric carcinoma (GC) tumors and gastric carcinoma peritoneal metastasis (PM) had a design that was impeding treatment in GC and PM. GC and PM are aggressive because this MFM region tends to create a tough shell that blocks out immune cells like T cells as well as promotes tumor growth via VEGF that is used in all tumor proliferation cancers it seems, but the GC and PM tumors in the MFM region are not treatable due to this tough fibrous shell that chemotherapy cannot reach. They found a therapeutic route to try with the latest treatment of NIPS for intraperitoneal delivery of chemotherapy that is the best treatment so far, but they need to turn off the PLAU-PLAUR axis because it is responsible for the MFM and signaling pathway that prevents it from being obliterated by current therapeutics. The study was based more on a spatial transcriptomic analysis of images from an online cancer imaging resource, The Cancer Genome-Atlas Stomach Adenocarcinoma or TCGA-STAD, to develop a deep learning model with transfer learning that could predict from an input image if the tumor was MFM or not, and their study showed that MFM was a very poor prognosis because survival was 2% in 5 years and fatal in 1-2 years tops. This is a recent study. Gastric carcinoma is one of the malignant cancers responsible for cancer related mortality world wide, it was done in an Asian country this study, and GC affects mostly Chinese and South Asian populations. The study used this gene expression data but also many gene expression datasets from same resource of National Center for Bioinformatics Information or NCBI of their Gene Expression Omnibus or GEO section of public data. The CA1, CA2, and CA3 was stated as being gastric carcinoma but different types of samples as subtypes. With CA3 the worst prognosis of the 3 types. While the F14, F15, and F16 samples of peritoneal metastatic GC were not discussed in the published article.
There is an epthelial-mesenchymal transition called EMT that is seen in cancer progression of various cancers and in this study, the MFM shows this EMT through the niche regions of the tumors in GC that have malignant epithelial cells. The fibroblasts and macrophages are of the M2 type macrophages that are tumor promoters from various reported studies this article refers to. This could be the same EMT that occurs when Epstein-Barr virus (EBV) turns epithelial cells into a different cell structure. This study doesn't make mention of EBV, but gastric carcinoma is highly associated with EBV and so it was selected. We aim to add these top genes from Seurat on this scRNA sequenced data and the genes we find from top fold change values in GC compared to PM using extracted Seurat data.
"
top20a$GSE_study_ID <- "GSE308231"
colnames(pathologies)
## [1] "Ensembl_ID" "Genecards_ID" "FC_pathology_control"
## [4] "topGenePathology" "mediaType" "studySummarized"
## [7] "GSE_study_ID"
colnames(top10a)
## [1] "Genecards_ID" "FC_pathology_control" "Ensemble_ID"
## [4] "topGenePathology" "mediaType" "studySummarized"
## [7] "GSE_study_ID"
top10a <- top10a[,c(3,1,2,4:7)]
colnames(top20a)
## [1] "Genecards_ID" "FC_pathology_control" "Ensemble_ID"
## [4] "topGenePathology" "mediaType" "studySummarized"
## [7] "GSE_study_ID"
top20a <- top20a[,c(3,1,2,4:7)]
We can bind these two data sets of top 10 and top 20 before combining them to the pathologies data, so that we can use our list of Ensembl IDs from a previous study’s data to fill in that column with the Ensembl ID instead of ‘needed.’
top30 <- rbind(top10a,top20a)
paged_table(top30)
Lets read in our Ensemble database to attach the Ensemble IDs via a merge to this data then rbind the data to the pathologies database.
ensembl_path <- "C:...Ensemble IDs merger for Genecards IDs GSE271486 resource/"
setwd(ensembl_path)
ensembl <- read.csv("GSE271486_ensembleIDs_NPC_LBMP_study.csv", header=T, sep=',')
paged_table(ensembl[,1:4])
I used the study of GSE271486 for this file, the online link above only has the first 4 columns. These could be useful by locus location on chromosome and description but at a later time. Lets just merge the whole file.
Ensemble <- ensembl[,c(1:2)]
top30b <- top30[,c(2:7)] #exclude Ensembl id 'needed' due to duplications possibly
top30merge <- merge(Ensemble, top30b, by.x="gene_name", by.y="Genecards_ID")
colnames(top30merge)
## [1] "gene_name" "gene_id" "FC_pathology_control"
## [4] "topGenePathology" "mediaType" "studySummarized"
## [7] "GSE_study_ID"
top30merge <- top30merge[,c(2,1,3:7)]
colnames(top30merge)[1:2] <- c("Ensembl_ID" , "Genecards_ID")
head(top30merge)
## Ensembl_ID Genecards_ID FC_pathology_control
## 1 ENSG00000282080 AC006511.4 0.013086932
## 2 ENSG00000140873 ADAMTS18 55.511202929
## 3 ENSG00000275302 CCL4 0.278347734
## 4 ENSG00000100604 CHGA 0.002652753
## 5 ENSG00000168542 COL3A1 4.370448905
## 6 ENSG00000169429 CXCL8 0.346405936
## topGenePathology
## 1 Gastric Carcinoma and Peritoneal Metastatic Gastric Carcioma
## 2 Gastric Carcinoma and Peritoneal Metastatic Gastric Carcioma
## 3 Gastric Carcinoma and Peritoneal Metastatic Gastric Carcioma
## 4 Gastric Carcinoma and Peritoneal Metastatic Gastric Carcioma
## 5 Gastric Carcinoma and Peritoneal Metastatic Gastric Carcioma
## 6 Gastric Carcinoma and Peritoneal Metastatic Gastric Carcioma
## mediaType
## 1 scRNA high throughput sequencing, bulk RNA from stomach tissue
## 2 scRNA high throughput sequencing, bulk RNA from stomach tissue
## 3 scRNA high throughput sequencing, bulk RNA from stomach tissue
## 4 scRNA high throughput sequencing, bulk RNA from stomach tissue
## 5 scRNA high throughput sequencing, bulk RNA from stomach tissue
## 6 scRNA high throughput sequencing, bulk RNA from stomach tissue
## studySummarized
## 1 Top 20 genes of 10 most enhanced and 10 most silenced excluding zeros, NAs, and Infinites in fold change data...This is a study that used gastric carcinoma 3 samples and peritoneal metastasis 3 samples, one of the most fatal cancers in cancer related deaths worldwide, single cell RNA or scRNA sequenced by high throughput sequencing. In a brief summary of the published article, the researchers found that a macrophage, fibroblast, and malignant epithelial cell area (MFM) of gastric carcinoma (GC) tumors and gastric carcinoma peritoneal metastasis (PM) had a design that was impeding treatment in GC and PM. GC and PM are aggressive because this MFM region tends to create a tough shell that blocks out immune cells like T cells as well as promotes tumor growth via VEGF that is used in all tumor proliferation cancers it seems, but the GC and PM tumors in the MFM region are not treatable due to this tough fibrous shell that chemotherapy cannot reach. They found a therapeutic route to try with the latest treatment of NIPS for intraperitoneal delivery of chemotherapy that is the best treatment so far, but they need to turn off the PLAU-PLAUR axis because it is responsible for the MFM and signaling pathway that prevents it from being obliterated by current therapeutics. The study was based more on a spatial transcriptomic analysis of images from an online cancer imaging resource, The Cancer Genome-Atlas Stomach Adenocarcinoma or TCGA-STAD, to develop a deep learning model with transfer learning that could predict from an input image if the tumor was MFM or not, and their study showed that MFM was a very poor prognosis because survival was 2% in 5 years and fatal in 1-2 years tops. This is a recent study. Gastric carcinoma is one of the malignant cancers responsible for cancer related mortality world wide, it was done in an Asian country this study, and GC affects mostly Chinese and South Asian populations. The study used this gene expression data but also many gene expression datasets from same resource of National Center for Bioinformatics Information or NCBI of their Gene Expression Omnibus or GEO section of public data. The CA1, CA2, and CA3 was stated as being gastric carcinoma but different types of samples as subtypes. With CA3 the worst prognosis of the 3 types. While the F14, F15, and F16 samples of peritoneal metastatic GC were not discussed in the published article. \n\nThere is an epthelial-mesenchymal transition called EMT that is seen in cancer progression of various cancers and in this study, the MFM shows this EMT through the niche regions of the tumors in GC that have malignant epithelial cells. The fibroblasts and macrophages are of the M2 type macrophages that are tumor promoters from various reported studies this article refers to. This could be the same EMT that occurs when Epstein-Barr virus (EBV) turns epithelial cells into a different cell structure. This study doesn't make mention of EBV, but gastric carcinoma is highly associated with EBV and so it was selected. We aim to add these top genes from Seurat on this scRNA sequenced data and the genes we find from top fold change values in GC compared to PM using extracted Seurat data. \n
## 2 Top 20 genes of 10 most enhanced and 10 most silenced excluding zeros, NAs, and Infinites in fold change data...This is a study that used gastric carcinoma 3 samples and peritoneal metastasis 3 samples, one of the most fatal cancers in cancer related deaths worldwide, single cell RNA or scRNA sequenced by high throughput sequencing. In a brief summary of the published article, the researchers found that a macrophage, fibroblast, and malignant epithelial cell area (MFM) of gastric carcinoma (GC) tumors and gastric carcinoma peritoneal metastasis (PM) had a design that was impeding treatment in GC and PM. GC and PM are aggressive because this MFM region tends to create a tough shell that blocks out immune cells like T cells as well as promotes tumor growth via VEGF that is used in all tumor proliferation cancers it seems, but the GC and PM tumors in the MFM region are not treatable due to this tough fibrous shell that chemotherapy cannot reach. They found a therapeutic route to try with the latest treatment of NIPS for intraperitoneal delivery of chemotherapy that is the best treatment so far, but they need to turn off the PLAU-PLAUR axis because it is responsible for the MFM and signaling pathway that prevents it from being obliterated by current therapeutics. The study was based more on a spatial transcriptomic analysis of images from an online cancer imaging resource, The Cancer Genome-Atlas Stomach Adenocarcinoma or TCGA-STAD, to develop a deep learning model with transfer learning that could predict from an input image if the tumor was MFM or not, and their study showed that MFM was a very poor prognosis because survival was 2% in 5 years and fatal in 1-2 years tops. This is a recent study. Gastric carcinoma is one of the malignant cancers responsible for cancer related mortality world wide, it was done in an Asian country this study, and GC affects mostly Chinese and South Asian populations. The study used this gene expression data but also many gene expression datasets from same resource of National Center for Bioinformatics Information or NCBI of their Gene Expression Omnibus or GEO section of public data. The CA1, CA2, and CA3 was stated as being gastric carcinoma but different types of samples as subtypes. With CA3 the worst prognosis of the 3 types. While the F14, F15, and F16 samples of peritoneal metastatic GC were not discussed in the published article. \n\nThere is an epthelial-mesenchymal transition called EMT that is seen in cancer progression of various cancers and in this study, the MFM shows this EMT through the niche regions of the tumors in GC that have malignant epithelial cells. The fibroblasts and macrophages are of the M2 type macrophages that are tumor promoters from various reported studies this article refers to. This could be the same EMT that occurs when Epstein-Barr virus (EBV) turns epithelial cells into a different cell structure. This study doesn't make mention of EBV, but gastric carcinoma is highly associated with EBV and so it was selected. We aim to add these top genes from Seurat on this scRNA sequenced data and the genes we find from top fold change values in GC compared to PM using extracted Seurat data. \n
## 3 This is the top 10 high variability genes out of 2,000 high variability genes using Seurat and their Differential Expression algorithm to find the genes across all samples with the highest degree of variability. The study used gastric carcinoma 3 samples and peritoneal metastasis 3 samples, one of the most fatal cancers in cancer related deaths worldwide, single cell RNA or scRNA sequenced by high throughput sequencing. In a brief summary of the published article, the researchers found that a macrophage, fibroblast, and malignant epithelial cell area (MFM) of gastric carcinoma (GC) tumors and gastric carcinoma peritoneal metastasis (PM) had a design that was impeding treatment in GC and PM. GC and PM are aggressive because this MFM region tends to create a tough shell that blocks out immune cells like T cells as well as promotes tumor growth via VEGF that is used in all tumor proliferation cancers it seems, but the GC and PM tumors in the MFM region are not treatable due to this tough fibrous shell that chemotherapy cannot reach. They found a therapeutic route to try with the latest treatment of NIPS for intraperitoneal delivery of chemotherapy that is the best treatment so far, but they need to turn off the PLAU-PLAUR axis because it is responsible for the MFM and signaling pathway that prevents it from being obliterated by current therapeutics. The study was based more on a spatial transcriptomic analysis of images from an online cancer imaging resource, The Cancer Genome-Atlas Stomach Adenocarcinoma or TCGA-STAD, to develop a deep learning model with transfer learning that could predict from an input image if the tumor was MFM or not, and their study showed that MFM was a very poor prognosis because survival was 2% in 5 years and fatal in 1-2 years tops. This is a recent study. Gastric carcinoma is one of the malignant cancers responsible for cancer related mortality world wide, it was done in an Asian country this study, and GC affects mostly Chinese and South Asian populations. The study used this gene expression data but also many gene expression datasets from same resource of National Center for Bioinformatics Information or NCBI of their Gene Expression Omnibus or GEO section of public data. The CA1, CA2, and CA3 was stated as being gastric carcinoma but different types of samples as subtypes. With CA3 the worst prognosis of the 3 types. While the F14, F15, and F16 samples of peritoneal metastatic GC were not discussed in the published article. \n\nThere is an epthelial-mesenchymal transition called EMT that is seen in cancer progression of various cancers and in this study, the MFM shows this EMT through the niche regions of the tumors in GC that have malignant epithelial cells. The fibroblasts and macrophages are of the M2 type macrophages that are tumor promoters from various reported studies this article refers to. This could be the same EMT that occurs when Epstein-Barr virus (EBV) turns epithelial cells into a different cell structure. This study doesn't make mention of EBV, but gastric carcinoma is highly associated with EBV and so it was selected. We aim to add these top genes from Seurat on this scRNA sequenced data and the genes we find from top fold change values in GC compared to PM using extracted Seurat data. \n
## 4 Top 20 genes of 10 most enhanced and 10 most silenced excluding zeros, NAs, and Infinites in fold change data...This is a study that used gastric carcinoma 3 samples and peritoneal metastasis 3 samples, one of the most fatal cancers in cancer related deaths worldwide, single cell RNA or scRNA sequenced by high throughput sequencing. In a brief summary of the published article, the researchers found that a macrophage, fibroblast, and malignant epithelial cell area (MFM) of gastric carcinoma (GC) tumors and gastric carcinoma peritoneal metastasis (PM) had a design that was impeding treatment in GC and PM. GC and PM are aggressive because this MFM region tends to create a tough shell that blocks out immune cells like T cells as well as promotes tumor growth via VEGF that is used in all tumor proliferation cancers it seems, but the GC and PM tumors in the MFM region are not treatable due to this tough fibrous shell that chemotherapy cannot reach. They found a therapeutic route to try with the latest treatment of NIPS for intraperitoneal delivery of chemotherapy that is the best treatment so far, but they need to turn off the PLAU-PLAUR axis because it is responsible for the MFM and signaling pathway that prevents it from being obliterated by current therapeutics. The study was based more on a spatial transcriptomic analysis of images from an online cancer imaging resource, The Cancer Genome-Atlas Stomach Adenocarcinoma or TCGA-STAD, to develop a deep learning model with transfer learning that could predict from an input image if the tumor was MFM or not, and their study showed that MFM was a very poor prognosis because survival was 2% in 5 years and fatal in 1-2 years tops. This is a recent study. Gastric carcinoma is one of the malignant cancers responsible for cancer related mortality world wide, it was done in an Asian country this study, and GC affects mostly Chinese and South Asian populations. The study used this gene expression data but also many gene expression datasets from same resource of National Center for Bioinformatics Information or NCBI of their Gene Expression Omnibus or GEO section of public data. The CA1, CA2, and CA3 was stated as being gastric carcinoma but different types of samples as subtypes. With CA3 the worst prognosis of the 3 types. While the F14, F15, and F16 samples of peritoneal metastatic GC were not discussed in the published article. \n\nThere is an epthelial-mesenchymal transition called EMT that is seen in cancer progression of various cancers and in this study, the MFM shows this EMT through the niche regions of the tumors in GC that have malignant epithelial cells. The fibroblasts and macrophages are of the M2 type macrophages that are tumor promoters from various reported studies this article refers to. This could be the same EMT that occurs when Epstein-Barr virus (EBV) turns epithelial cells into a different cell structure. This study doesn't make mention of EBV, but gastric carcinoma is highly associated with EBV and so it was selected. We aim to add these top genes from Seurat on this scRNA sequenced data and the genes we find from top fold change values in GC compared to PM using extracted Seurat data. \n
## 5 This is the top 10 high variability genes out of 2,000 high variability genes using Seurat and their Differential Expression algorithm to find the genes across all samples with the highest degree of variability. The study used gastric carcinoma 3 samples and peritoneal metastasis 3 samples, one of the most fatal cancers in cancer related deaths worldwide, single cell RNA or scRNA sequenced by high throughput sequencing. In a brief summary of the published article, the researchers found that a macrophage, fibroblast, and malignant epithelial cell area (MFM) of gastric carcinoma (GC) tumors and gastric carcinoma peritoneal metastasis (PM) had a design that was impeding treatment in GC and PM. GC and PM are aggressive because this MFM region tends to create a tough shell that blocks out immune cells like T cells as well as promotes tumor growth via VEGF that is used in all tumor proliferation cancers it seems, but the GC and PM tumors in the MFM region are not treatable due to this tough fibrous shell that chemotherapy cannot reach. They found a therapeutic route to try with the latest treatment of NIPS for intraperitoneal delivery of chemotherapy that is the best treatment so far, but they need to turn off the PLAU-PLAUR axis because it is responsible for the MFM and signaling pathway that prevents it from being obliterated by current therapeutics. The study was based more on a spatial transcriptomic analysis of images from an online cancer imaging resource, The Cancer Genome-Atlas Stomach Adenocarcinoma or TCGA-STAD, to develop a deep learning model with transfer learning that could predict from an input image if the tumor was MFM or not, and their study showed that MFM was a very poor prognosis because survival was 2% in 5 years and fatal in 1-2 years tops. This is a recent study. Gastric carcinoma is one of the malignant cancers responsible for cancer related mortality world wide, it was done in an Asian country this study, and GC affects mostly Chinese and South Asian populations. The study used this gene expression data but also many gene expression datasets from same resource of National Center for Bioinformatics Information or NCBI of their Gene Expression Omnibus or GEO section of public data. The CA1, CA2, and CA3 was stated as being gastric carcinoma but different types of samples as subtypes. With CA3 the worst prognosis of the 3 types. While the F14, F15, and F16 samples of peritoneal metastatic GC were not discussed in the published article. \n\nThere is an epthelial-mesenchymal transition called EMT that is seen in cancer progression of various cancers and in this study, the MFM shows this EMT through the niche regions of the tumors in GC that have malignant epithelial cells. The fibroblasts and macrophages are of the M2 type macrophages that are tumor promoters from various reported studies this article refers to. This could be the same EMT that occurs when Epstein-Barr virus (EBV) turns epithelial cells into a different cell structure. This study doesn't make mention of EBV, but gastric carcinoma is highly associated with EBV and so it was selected. We aim to add these top genes from Seurat on this scRNA sequenced data and the genes we find from top fold change values in GC compared to PM using extracted Seurat data. \n
## 6 This is the top 10 high variability genes out of 2,000 high variability genes using Seurat and their Differential Expression algorithm to find the genes across all samples with the highest degree of variability. The study used gastric carcinoma 3 samples and peritoneal metastasis 3 samples, one of the most fatal cancers in cancer related deaths worldwide, single cell RNA or scRNA sequenced by high throughput sequencing. In a brief summary of the published article, the researchers found that a macrophage, fibroblast, and malignant epithelial cell area (MFM) of gastric carcinoma (GC) tumors and gastric carcinoma peritoneal metastasis (PM) had a design that was impeding treatment in GC and PM. GC and PM are aggressive because this MFM region tends to create a tough shell that blocks out immune cells like T cells as well as promotes tumor growth via VEGF that is used in all tumor proliferation cancers it seems, but the GC and PM tumors in the MFM region are not treatable due to this tough fibrous shell that chemotherapy cannot reach. They found a therapeutic route to try with the latest treatment of NIPS for intraperitoneal delivery of chemotherapy that is the best treatment so far, but they need to turn off the PLAU-PLAUR axis because it is responsible for the MFM and signaling pathway that prevents it from being obliterated by current therapeutics. The study was based more on a spatial transcriptomic analysis of images from an online cancer imaging resource, The Cancer Genome-Atlas Stomach Adenocarcinoma or TCGA-STAD, to develop a deep learning model with transfer learning that could predict from an input image if the tumor was MFM or not, and their study showed that MFM was a very poor prognosis because survival was 2% in 5 years and fatal in 1-2 years tops. This is a recent study. Gastric carcinoma is one of the malignant cancers responsible for cancer related mortality world wide, it was done in an Asian country this study, and GC affects mostly Chinese and South Asian populations. The study used this gene expression data but also many gene expression datasets from same resource of National Center for Bioinformatics Information or NCBI of their Gene Expression Omnibus or GEO section of public data. The CA1, CA2, and CA3 was stated as being gastric carcinoma but different types of samples as subtypes. With CA3 the worst prognosis of the 3 types. While the F14, F15, and F16 samples of peritoneal metastatic GC were not discussed in the published article. \n\nThere is an epthelial-mesenchymal transition called EMT that is seen in cancer progression of various cancers and in this study, the MFM shows this EMT through the niche regions of the tumors in GC that have malignant epithelial cells. The fibroblasts and macrophages are of the M2 type macrophages that are tumor promoters from various reported studies this article refers to. This could be the same EMT that occurs when Epstein-Barr virus (EBV) turns epithelial cells into a different cell structure. This study doesn't make mention of EBV, but gastric carcinoma is highly associated with EBV and so it was selected. We aim to add these top genes from Seurat on this scRNA sequenced data and the genes we find from top fold change values in GC compared to PM using extracted Seurat data. \n
## GSE_study_ID
## 1 GSE308231
## 2 GSE308231
## 3 GSE308231
## 4 GSE308231
## 5 GSE308231
## 6 GSE308231
colnames(pathologies)
## [1] "Ensembl_ID" "Genecards_ID" "FC_pathology_control"
## [4] "topGenePathology" "mediaType" "studySummarized"
## [7] "GSE_study_ID"
Now lets add this to our pathologies database.
Pathologies <- rbind(pathologies,top30merge)
dim(Pathologies)
## [1] 346 7
paged_table(Pathologies)
Lets write this pathologies table out to csv.
setwd(path)
write.csv(Pathologies,"pathologies_updated_GC_PM_added_4-5-2026.csv", row.names=F)
You can get this dataset at the link here.
We found that the model training and used on testing scored 100% accuracy with random forest basic settings for the top 20 genes found in fold change after eliminating NAs, Infinites, and 0.000000 values and taking the top 10 up regulated by ordering by decreasing values of fold change PM/GC mean values and top 10 down regulated or silenced from the bottom 10 of that ordered list.
The top genes by Seurat in top 10 by highest degree of variability we added the fold change values but they were already ranked, and the rank order is not the same as their fold change values as Seurat’s algorithm uses PCA or another type to rank each gene in each sample not class to get the highest degree of variability.
The study was interesting that we used this gene expression scRNA data of 6 samples but it was also a study in GEO that this study that used it also borrowed from publicly available gene expression data. Their study is fascinating and I hope they get closer to raising odds of survival from currently 2% in 5 years and fatality in 1-2 years for gastric carcinoma and peritoneal metastasis from gastric carcinoma. They shed light on the PLAU-PLAUR axis controlling signaling that leads to poor outcomes for those patients with malignant epithelial cells comingling in the same niche area of GC tumor as the fibroblastic macrophages in MTM they call it that is where EMT or Epethelial-Mesenchymal activity meets B cells and thick shell of peritoneal tissue that is unpenetrable by chemotherapy currentl unless intraperitoneal access to the medication of chemotherapy and knocking out the PLAU-PLAUR axis. This study did not link or mention EBV in any line of text in the published article, but GC is highly associated with EBV infection as epithelial lining can turn into another cell type such as the EMT the study focused on with more verbage on MFM niches of the tumor and imaging with deep neural networks transfer learning to identify an MFM tumor that needs attention and intervention to destroy the PLAU-PLAUR axis soon or death could occur within a year.
Thanks so much. I will be adding this pathologies data to my Tableau site that isn’t visually appealing but helps to use a key as the gene ID and see what the fold change values are in the current pathologies analyzed thus far in connection with EBV or EBV associated pathologies or as alternate pathologies to base the machine learning model on later when we gather all our information. To see what pathologies are similar with other genes. There are currently pathways of genes per disease, but this is one way to see how they connect from these studies.