This is a project with gastric carcinoma samples that looks at the metastasis in peritoneal stomach tissue and in the stomach. The study is GSE308231. There is a published article on this gene expression data available here.
We will be using Seurat for these samples. There are 6 samples in total, and 3 are gastric carcinoma while the other 3 samples are peritoneal metastasis of gastric carcinoma. This is single cell RNA data, expression profiling by highthroughput sequencing of transcriptomic RNA data of complementary DNA.
library(Seurat)
## Loading required package: SeuratObject
## Loading required package: sp
##
## Attaching package: 'SeuratObject'
## The following objects are masked from 'package:base':
##
## intersect, t
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.0 ✔ readr 2.2.0
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.2 ✔ tibble 3.3.1
## ✔ lubridate 1.9.5 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(rmarkdown)
Lets read in the series information.
series <- read.table("GSE308231_series_matrix.txt", nrow=24)
paged_table(series)
series_b <- read.table("GSE308231_series_matrix.txt", skip=24, nrow=39)
paged_table(series_b)
series_c <- read.table("GSE308231_series_matrix.txt", skip=42, nrow=22)
paged_table(series_c)
series_d <- read.table("GSE308231_series_matrix.txt", skip=65, nrow=1)
Series <- rbind(series_b,series_c,series_d)
rm(series_b,series_c,series_d) #getting too cluttered already
paged_table(Series)
Lets keep the ID, GSE ID, and tissue sample type. I had to skip some rows because they didn’t have the same number of elements in extracting the tabular information from the text data.
ID_series <- Series[c(1,2,8,11),]
paged_table(ID_series)
The CA1-CA3 are stomach tumors in Gastric carcinoma, and the F14-F16 are the metastasis of stomach tumors into the peritoneal lining. This was from the sample page as well. The identification of cancer or tumor location is in the ‘cell line’ description.
Now we can start reading in the files with Seurat. These folders have the zipped barcodes, features and matrix.mtx files in .gz format, you don’t have to unzip them, but these are in each folder labeled by GSM ID for the sample they are in this series data. Each folder is already in my directory of this rmarkdown document. Just call the folder and the files will be unzipped into a Seurat object. I added the extended name of ’_CA1’ through ’_F16’ to each sample folder to clarify.
GSM9240670 <- Read10X(data.dir = "GSM9240670_CA1")
We can clarify how many barcodes we want for each gene. We want at least 3 barcodes per gene or 3 cells per gene, and drop any gene without at least 3 barcodes. And for features at least 200. We will use last 2 digits to name it as the sample ID here.
sample70 <- CreateSeuratObject(counts=GSM9240670, project="GSM9240670", min.cells=3, min.features=200)
class(sample70)
## [1] "Seurat"
## attr(,"package")
## [1] "SeuratObject"
Look at the first 20 columns of barcodes.
colnames(sample70)[1:20]
## [1] "AAACCCAAGCCGATTT-1" "AAACCCACATGACAAA-1" "AAACCCAGTTCAATCG-1"
## [4] "AAACCCATCAAACCTG-1" "AAACCCATCAGTCATG-1" "AAACCCATCTGTAAGC-1"
## [7] "AAACGAAAGCCATTCA-1" "AAACGAACACCCTTGT-1" "AAACGAATCCTTCTTC-1"
## [10] "AAACGAATCGCAATTG-1" "AAACGCTAGAAGGATG-1" "AAACGCTAGTAACGTA-1"
## [13] "AAACGCTAGTGATAAC-1" "AAACGCTAGTTGTCAC-1" "AAACGCTCAATCGCGC-1"
## [16] "AAACGCTCAGCAGGAT-1" "AAACGCTTCATTGTTC-1" "AAACGCTTCCCGAGTG-1"
## [19] "AAAGAACAGCGAGTAC-1" "AAAGAACGTAACATGA-1"
Look at the first 20 rows of genes.
rownames(sample70)[1:20]
## [1] "AL627309.1" "AL627309.5" "AP006222.2" "LINC01409" "FAM87B"
## [6] "LINC01128" "LINC00115" "FAM41C" "AL645608.6" "AL645608.2"
## [11] "LINC02593" "SAMD11" "NOC2L" "KLHL17" "PLEKHN1"
## [16] "PERM1" "AL645608.7" "HES4" "ISG15" "AL645608.1"
sample70
## An object of class Seurat
## 28380 features across 7581 samples within 1 assay
## Active assay: RNA (28380 features, 0 variable features)
## 1 layer present: counts
paged_table(sample70@meta.data[1:20,])
saveRDS(sample70, file="RDS_objects/sample70")
GSM9240671 <- Read10X(data.dir = "GSM9240671_CA2")
sample71 <- CreateSeuratObject(counts=GSM9240671, project="GSM9240671", min.cells=3, min.features=200)
saveRDS(sample71, file="RDS_objects/sample71")
GSM9240672 <- Read10X(data.dir = "GSM9240672_CA3")
sample72 <- CreateSeuratObject(counts=GSM9240672, project="GSM9240672", min.cells=3, min.features=200)
saveRDS(sample72, file="RDS_objects/sample72")
GSM9240673 <- Read10X(data.dir = "GSM9240673_F14")
sample73 <- CreateSeuratObject(counts=GSM9240673, project="GSM9240673", min.cells=3, min.features=200)
saveRDS(sample73, file="RDS_objects/sample73")
GSM9240674 <- Read10X(data.dir = "GSM9240674_F15")
sample74 <- CreateSeuratObject(counts=GSM9240674, project="GSM9240674", min.cells=3, min.features=200)
saveRDS(sample74, file="RDS_objects/sample74")
GSM9240675 <- Read10X(data.dir = "GSM9240675_F16")
sample75 <- CreateSeuratObject(counts=GSM9240675, project="GSM9240675", min.cells=3, min.features=200)
saveRDS(sample75, file="RDS_objects/sample75")
We saved our 6 Seurat objects of the GSM samples of 3 gastric carcinoma and 3 gastric carcinoma that metastasized to the peritoneal tissue.
Now we merge these samples. But lets remove the GSM sample large matrix objects we don’t need.
rm(GSM9240670, GSM9240671, GSM9240672, GSM9240673, GSM9240674, GSM9240675)
mergedSamples <- merge(sample70, y=c(sample71, sample72, sample73, sample74, sample75), add.cell.ids = ls()[1:6])
str(mergedSamples)
## Formal class 'Seurat' [package "SeuratObject"] with 13 slots
## ..@ assays :List of 1
## .. ..$ RNA:Formal class 'Assay5' [package "SeuratObject"] with 8 slots
## .. .. .. ..@ layers :List of 6
## .. .. .. .. ..$ counts.GSM9240670:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. .. .. .. .. .. ..@ i : int [1:21515992] 12 30 31 39 44 66 69 81 106 126 ...
## .. .. .. .. .. .. ..@ p : int [1:7582] 0 2351 3233 5429 12856 15819 17807 21294 24463 27239 ...
## .. .. .. .. .. .. ..@ Dim : int [1:2] 28380 7581
## .. .. .. .. .. .. ..@ Dimnames:List of 2
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. ..@ x : num [1:21515992] 1 1 1 1 1 1 1 2 1 1 ...
## .. .. .. .. .. .. ..@ factors : list()
## .. .. .. .. ..$ counts.GSM9240671:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. .. .. .. .. .. ..@ i : int [1:24841051] 18 31 45 52 56 69 71 81 122 143 ...
## .. .. .. .. .. .. ..@ p : int [1:11051] 0 2378 5393 10193 13485 14322 17841 19678 21010 24119 ...
## .. .. .. .. .. .. ..@ Dim : int [1:2] 27340 11050
## .. .. .. .. .. .. ..@ Dimnames:List of 2
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. ..@ x : num [1:24841051] 1 1 1 1 1 1 1 1 5 1 ...
## .. .. .. .. .. .. ..@ factors : list()
## .. .. .. .. ..$ counts.GSM9240672:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. .. .. .. .. .. ..@ i : int [1:26388472] 41 44 65 110 129 130 135 138 144 151 ...
## .. .. .. .. .. .. ..@ p : int [1:10708] 0 1852 4943 6683 8683 12823 15068 17285 19237 21098 ...
## .. .. .. .. .. .. ..@ Dim : int [1:2] 27952 10707
## .. .. .. .. .. .. ..@ Dimnames:List of 2
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. ..@ x : num [1:26388472] 1 1 1 1 1 2 1 1 1 1 ...
## .. .. .. .. .. .. ..@ factors : list()
## .. .. .. .. ..$ counts.GSM9240673:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. .. .. .. .. .. ..@ i : int [1:40592441] 29 33 55 79 90 110 121 139 150 172 ...
## .. .. .. .. .. .. ..@ p : int [1:13457] 0 1803 5355 9584 10207 11926 15229 17238 21883 24648 ...
## .. .. .. .. .. .. ..@ Dim : int [1:2] 29540 13456
## .. .. .. .. .. .. ..@ Dimnames:List of 2
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. ..@ x : num [1:40592441] 1 1 3 2 1 1 5 2 2 2 ...
## .. .. .. .. .. .. ..@ factors : list()
## .. .. .. .. ..$ counts.GSM9240674:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. .. .. .. .. .. ..@ i : int [1:32465719] 5 11 12 18 29 30 34 37 39 43 ...
## .. .. .. .. .. .. ..@ p : int [1:12843] 0 5983 6480 9218 12433 13898 20649 22481 24627 26066 ...
## .. .. .. .. .. .. ..@ Dim : int [1:2] 31248 12842
## .. .. .. .. .. .. ..@ Dimnames:List of 2
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. ..@ x : num [1:32465719] 1 5 1 6 1 5 1 1 2 8 ...
## .. .. .. .. .. .. ..@ factors : list()
## .. .. .. .. ..$ counts.GSM9240675:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. .. .. .. .. .. ..@ i : int [1:30173389] 29 68 83 90 139 154 156 178 179 180 ...
## .. .. .. .. .. .. ..@ p : int [1:13011] 0 2279 2777 3813 4198 7293 13862 18642 20215 23195 ...
## .. .. .. .. .. .. ..@ Dim : int [1:2] 28546 13010
## .. .. .. .. .. .. ..@ Dimnames:List of 2
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. ..@ x : num [1:30173389] 1 4 1 1 1 2 2 3 1 2 ...
## .. .. .. .. .. .. ..@ factors : list()
## .. .. .. ..@ cells :Formal class 'LogMap' [package "SeuratObject"] with 1 slot
## .. .. .. .. .. ..@ .Data: logi [1:68646, 1:6] TRUE TRUE TRUE TRUE TRUE TRUE ...
## .. .. .. .. .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. .. .. .. .. ..$ : chr [1:68646] "ID_series_AAACCCAAGCCGATTT-1" "ID_series_AAACCCACATGACAAA-1" "ID_series_AAACCCAGTTCAATCG-1" "ID_series_AAACCCATCAAACCTG-1" ...
## .. .. .. .. .. .. .. ..$ : chr [1:6] "counts.GSM9240670" "counts.GSM9240671" "counts.GSM9240672" "counts.GSM9240673" ...
## .. .. .. .. .. ..$ dim : int [1:2] 68646 6
## .. .. .. .. .. ..$ dimnames:List of 2
## .. .. .. .. .. .. ..$ : chr [1:68646] "ID_series_AAACCCAAGCCGATTT-1" "ID_series_AAACCCACATGACAAA-1" "ID_series_AAACCCAGTTCAATCG-1" "ID_series_AAACCCATCAAACCTG-1" ...
## .. .. .. .. .. .. ..$ : chr [1:6] "counts.GSM9240670" "counts.GSM9240671" "counts.GSM9240672" "counts.GSM9240673" ...
## .. .. .. ..@ features :Formal class 'LogMap' [package "SeuratObject"] with 1 slot
## .. .. .. .. .. ..@ .Data: logi [1:32659, 1:6] TRUE TRUE TRUE TRUE TRUE TRUE ...
## .. .. .. .. .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. .. .. .. .. ..$ : chr [1:32659] "AL627309.1" "AL627309.5" "AP006222.2" "LINC01409" ...
## .. .. .. .. .. .. .. ..$ : chr [1:6] "counts.GSM9240670" "counts.GSM9240671" "counts.GSM9240672" "counts.GSM9240673" ...
## .. .. .. .. .. ..$ dim : int [1:2] 32659 6
## .. .. .. .. .. ..$ dimnames:List of 2
## .. .. .. .. .. .. ..$ : chr [1:32659] "AL627309.1" "AL627309.5" "AP006222.2" "LINC01409" ...
## .. .. .. .. .. .. ..$ : chr [1:6] "counts.GSM9240670" "counts.GSM9240671" "counts.GSM9240672" "counts.GSM9240673" ...
## .. .. .. ..@ default : int 6
## .. .. .. ..@ assay.orig: chr(0)
## .. .. .. ..@ meta.data :'data.frame': 32659 obs. of 0 variables
## .. .. .. ..@ misc : list()
## .. .. .. ..@ key : chr "rna_"
## ..@ meta.data :'data.frame': 68646 obs. of 3 variables:
## .. ..$ orig.ident : chr [1:68646] "GSM9240670" "GSM9240670" "GSM9240670" "GSM9240670" ...
## .. ..$ nCount_RNA : num [1:68646] 6570 7361 5615 45367 8295 ...
## .. ..$ nFeature_RNA: int [1:68646] 2351 882 2196 7427 2963 1988 3487 3169 2776 4420 ...
## ..@ active.assay: chr "RNA"
## ..@ active.ident: Factor w/ 6 levels "GSM9240670","GSM9240671",..: 1 1 1 1 1 1 1 1 1 1 ...
## .. ..- attr(*, "names")= chr [1:68646] "ID_series_AAACCCAAGCCGATTT-1" "ID_series_AAACCCACATGACAAA-1" "ID_series_AAACCCAGTTCAATCG-1" "ID_series_AAACCCATCAAACCTG-1" ...
## ..@ graphs : list()
## ..@ neighbors : list()
## ..@ reductions : list()
## ..@ images : list()
## ..@ project.name: chr "SeuratProject"
## ..@ misc : list()
## ..@ version :Classes 'package_version', 'numeric_version' hidden list of 1
## .. ..$ : int [1:3] 5 3 0
## ..@ commands : list()
## ..@ tools : list()
Above is the structure of this merged Seurat object of 6 samples. Lets save it and read it in to get used to reading in the RDS objects we save.
saveRDS(mergedSamples,file="RDS_objects/mergedSamples.RDS")
Lets read in this merged file. Clear out the environment to leave room for the large Seurat array, table of 10.9 Gb it says.
merged <- readRDS("RDS_objects/mergedSamples.RDS")
Our merged object is the same as the one we saved called mergedSamples, or should be. Lets see the structure of this object called merged.
str(merged)
## Formal class 'Seurat' [package "SeuratObject"] with 13 slots
## ..@ assays :List of 1
## .. ..$ RNA:Formal class 'Assay5' [package "SeuratObject"] with 8 slots
## .. .. .. ..@ layers :List of 6
## .. .. .. .. ..$ counts.GSM9240670:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. .. .. .. .. .. ..@ i : int [1:21515992] 12 30 31 39 44 66 69 81 106 126 ...
## .. .. .. .. .. .. ..@ p : int [1:7582] 0 2351 3233 5429 12856 15819 17807 21294 24463 27239 ...
## .. .. .. .. .. .. ..@ Dim : int [1:2] 28380 7581
## .. .. .. .. .. .. ..@ Dimnames:List of 2
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. ..@ x : num [1:21515992] 1 1 1 1 1 1 1 2 1 1 ...
## .. .. .. .. .. .. ..@ factors : list()
## .. .. .. .. ..$ counts.GSM9240671:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. .. .. .. .. .. ..@ i : int [1:24841051] 18 31 45 52 56 69 71 81 122 143 ...
## .. .. .. .. .. .. ..@ p : int [1:11051] 0 2378 5393 10193 13485 14322 17841 19678 21010 24119 ...
## .. .. .. .. .. .. ..@ Dim : int [1:2] 27340 11050
## .. .. .. .. .. .. ..@ Dimnames:List of 2
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. ..@ x : num [1:24841051] 1 1 1 1 1 1 1 1 5 1 ...
## .. .. .. .. .. .. ..@ factors : list()
## .. .. .. .. ..$ counts.GSM9240672:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. .. .. .. .. .. ..@ i : int [1:26388472] 41 44 65 110 129 130 135 138 144 151 ...
## .. .. .. .. .. .. ..@ p : int [1:10708] 0 1852 4943 6683 8683 12823 15068 17285 19237 21098 ...
## .. .. .. .. .. .. ..@ Dim : int [1:2] 27952 10707
## .. .. .. .. .. .. ..@ Dimnames:List of 2
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. ..@ x : num [1:26388472] 1 1 1 1 1 2 1 1 1 1 ...
## .. .. .. .. .. .. ..@ factors : list()
## .. .. .. .. ..$ counts.GSM9240673:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. .. .. .. .. .. ..@ i : int [1:40592441] 29 33 55 79 90 110 121 139 150 172 ...
## .. .. .. .. .. .. ..@ p : int [1:13457] 0 1803 5355 9584 10207 11926 15229 17238 21883 24648 ...
## .. .. .. .. .. .. ..@ Dim : int [1:2] 29540 13456
## .. .. .. .. .. .. ..@ Dimnames:List of 2
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. ..@ x : num [1:40592441] 1 1 3 2 1 1 5 2 2 2 ...
## .. .. .. .. .. .. ..@ factors : list()
## .. .. .. .. ..$ counts.GSM9240674:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. .. .. .. .. .. ..@ i : int [1:32465719] 5 11 12 18 29 30 34 37 39 43 ...
## .. .. .. .. .. .. ..@ p : int [1:12843] 0 5983 6480 9218 12433 13898 20649 22481 24627 26066 ...
## .. .. .. .. .. .. ..@ Dim : int [1:2] 31248 12842
## .. .. .. .. .. .. ..@ Dimnames:List of 2
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. ..@ x : num [1:32465719] 1 5 1 6 1 5 1 1 2 8 ...
## .. .. .. .. .. .. ..@ factors : list()
## .. .. .. .. ..$ counts.GSM9240675:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. .. .. .. .. .. ..@ i : int [1:30173389] 29 68 83 90 139 154 156 178 179 180 ...
## .. .. .. .. .. .. ..@ p : int [1:13011] 0 2279 2777 3813 4198 7293 13862 18642 20215 23195 ...
## .. .. .. .. .. .. ..@ Dim : int [1:2] 28546 13010
## .. .. .. .. .. .. ..@ Dimnames:List of 2
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. ..@ x : num [1:30173389] 1 4 1 1 1 2 2 3 1 2 ...
## .. .. .. .. .. .. ..@ factors : list()
## .. .. .. ..@ cells :Formal class 'LogMap' [package "SeuratObject"] with 1 slot
## .. .. .. .. .. ..@ .Data: logi [1:68646, 1:6] TRUE TRUE TRUE TRUE TRUE TRUE ...
## .. .. .. .. .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. .. .. .. .. ..$ : chr [1:68646] "ID_series_AAACCCAAGCCGATTT-1" "ID_series_AAACCCACATGACAAA-1" "ID_series_AAACCCAGTTCAATCG-1" "ID_series_AAACCCATCAAACCTG-1" ...
## .. .. .. .. .. .. .. ..$ : chr [1:6] "counts.GSM9240670" "counts.GSM9240671" "counts.GSM9240672" "counts.GSM9240673" ...
## .. .. .. .. .. ..$ dim : int [1:2] 68646 6
## .. .. .. .. .. ..$ dimnames:List of 2
## .. .. .. .. .. .. ..$ : chr [1:68646] "ID_series_AAACCCAAGCCGATTT-1" "ID_series_AAACCCACATGACAAA-1" "ID_series_AAACCCAGTTCAATCG-1" "ID_series_AAACCCATCAAACCTG-1" ...
## .. .. .. .. .. .. ..$ : chr [1:6] "counts.GSM9240670" "counts.GSM9240671" "counts.GSM9240672" "counts.GSM9240673" ...
## .. .. .. ..@ features :Formal class 'LogMap' [package "SeuratObject"] with 1 slot
## .. .. .. .. .. ..@ .Data: logi [1:32659, 1:6] TRUE TRUE TRUE TRUE TRUE TRUE ...
## .. .. .. .. .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. .. .. .. .. ..$ : chr [1:32659] "AL627309.1" "AL627309.5" "AP006222.2" "LINC01409" ...
## .. .. .. .. .. .. .. ..$ : chr [1:6] "counts.GSM9240670" "counts.GSM9240671" "counts.GSM9240672" "counts.GSM9240673" ...
## .. .. .. .. .. ..$ dim : int [1:2] 32659 6
## .. .. .. .. .. ..$ dimnames:List of 2
## .. .. .. .. .. .. ..$ : chr [1:32659] "AL627309.1" "AL627309.5" "AP006222.2" "LINC01409" ...
## .. .. .. .. .. .. ..$ : chr [1:6] "counts.GSM9240670" "counts.GSM9240671" "counts.GSM9240672" "counts.GSM9240673" ...
## .. .. .. ..@ default : int 6
## .. .. .. ..@ assay.orig: chr(0)
## .. .. .. ..@ meta.data :'data.frame': 32659 obs. of 0 variables
## .. .. .. ..@ misc : list()
## .. .. .. ..@ key : chr "rna_"
## ..@ meta.data :'data.frame': 68646 obs. of 3 variables:
## .. ..$ orig.ident : chr [1:68646] "GSM9240670" "GSM9240670" "GSM9240670" "GSM9240670" ...
## .. ..$ nCount_RNA : num [1:68646] 6570 7361 5615 45367 8295 ...
## .. ..$ nFeature_RNA: int [1:68646] 2351 882 2196 7427 2963 1988 3487 3169 2776 4420 ...
## ..@ active.assay: chr "RNA"
## ..@ active.ident: Factor w/ 6 levels "GSM9240670","GSM9240671",..: 1 1 1 1 1 1 1 1 1 1 ...
## .. ..- attr(*, "names")= chr [1:68646] "ID_series_AAACCCAAGCCGATTT-1" "ID_series_AAACCCACATGACAAA-1" "ID_series_AAACCCAGTTCAATCG-1" "ID_series_AAACCCATCAAACCTG-1" ...
## ..@ graphs : list()
## ..@ neighbors : list()
## ..@ reductions : list()
## ..@ images : list()
## ..@ project.name: chr "SeuratProject"
## ..@ misc : list()
## ..@ version :Classes 'package_version', 'numeric_version' hidden list of 1
## .. ..$ : int [1:3] 5 3 0
## ..@ commands : list()
## ..@ tools : list()
dim(merged)
## [1] 32659 68646
dim(mergedSamples)
## [1] 32659 68646
Lets remove mergedSamples and the samples we merged to make it. The processing will need this memory space as it can get heavy in processing and using a lot of memory to run commands.
rm(mergedSamples)
rm(sample70, sample71, sample72, sample73, sample74, sample75)
Lets look at the first 10 rows of the meta.data in the merged Seurat object. There are 68,646 rows of barcodes.
paged_table(merged@meta.data[1:10,])
##perc mt for mitochondrial DNA that indicates cell damage to mitochondrial organelle in cytoplasm from dying cell or damaged.
We add a column to our merged data called percent_mt.
merged[['percent_mt']] <- PercentageFeatureSet(merged, pattern = '^MT-')
paged_table(merged@meta.data)
The nCount is total number of molecules in a cell The nFeature is the total number of unique genes in each cell
This violin plot VlnPlot() is used as a Seurat library plot.
VlnPlot(merged, features = c("nCount_RNA", "nFeature_RNA", "percent_mt"), ncol=3)
## Warning: Default search for "data" layer in "RNA" assay yielded no results;
## utilizing "counts" layer instead.
Looks like most of the data is at around 3,000 to 4,000 counts, 5,000 features, and about 10-12 % mt. We will use that as our filtering to filter out the outliers.
Lets plot a line through our data of scatter points.
library(ggplot2)
FeatureScatter(merged, feature1 = "nCount_RNA", feature2 = "nFeature_RNA") + geom_smooth(method = 'lm')+
# Control axis limits using ggplot2 functions
xlim(0, 45000)+ # Set X-axis range
ylim(0, 45000) # Set Y-axis range
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 1198 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 1198 rows containing missing values or values outside the scale range
## (`geom_point()`).
FeatureScatter(merged, feature1 = "nCount_RNA", feature2 = "nFeature_RNA") + geom_smooth(method = 'lm')+
# Control axis limits using ggplot2 functions
xlim(0, 5000)+ # Set X-axis range
ylim(0, 5000) # Set Y-axis range
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 34371 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 34371 rows containing missing values or values outside the scale range
## (`geom_point()`).
Now, lets filter for the 5,000 counts, and 5,000 features, with no more than 12% mitochondrial DNA from the violin plots above.
#filtering
merged <- subset(merged, subset = nFeature_RNA < 3000 & percent_mt < 10 & nCount_RNA < 3000)
You can make your own thresholds on how you interpret visualizations
#Normalize data after removing filtered out cells
Normalization by default, use global scaling normalization of log normalize, default scaling of 10,000, and then get a log transformed result.
merged <- NormalizeData(
merged,
normalization.method = "LogNormalize",
scale.factor = 10000,
layer = "counts", # Input layer
new.layer = "lognorm" # Output layer
)
## Normalizing layer: counts.GSM9240670
## Warning: The following arguments are not used: new.layer
## Normalizing layer: counts.GSM9240671
## Warning: The following arguments are not used: new.layer
## Normalizing layer: counts.GSM9240672
## Warning: The following arguments are not used: new.layer
## Normalizing layer: counts.GSM9240673
## Warning: The following arguments are not used: new.layer
## Normalizing layer: counts.GSM9240674
## Warning: The following arguments are not used: new.layer
## Normalizing layer: counts.GSM9240675
## Warning: The following arguments are not used: new.layer
str(merged)
## Formal class 'Seurat' [package "SeuratObject"] with 13 slots
## ..@ assays :List of 1
## .. ..$ RNA:Formal class 'Assay5' [package "SeuratObject"] with 8 slots
## .. .. .. ..@ layers :List of 12
## .. .. .. .. ..$ counts.GSM9240670:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. .. .. .. .. .. ..@ i : int [1:785896] 42 137 138 157 160 164 200 269 300 332 ...
## .. .. .. .. .. .. ..@ p : int [1:825] 0 983 1814 2162 2557 3188 4633 5951 7427 7908 ...
## .. .. .. .. .. .. ..@ Dim : int [1:2] 28380 824
## .. .. .. .. .. .. ..@ Dimnames:List of 2
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. ..@ x : num [1:785896] 1 1 1 1 3 1 1 1 1 1 ...
## .. .. .. .. .. .. ..@ factors : list()
## .. .. .. .. ..$ counts.GSM9240671:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. .. .. .. .. .. ..@ i : int [1:2650402] 30 42 122 152 161 242 309 337 382 406 ...
## .. .. .. .. .. .. ..@ p : int [1:2561] 0 509 1777 2930 3972 4876 5591 6920 8006 9214 ...
## .. .. .. .. .. .. ..@ Dim : int [1:2] 27340 2560
## .. .. .. .. .. .. ..@ Dimnames:List of 2
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. ..@ x : num [1:2650402] 1 1 2 1 1 1 1 1 2 1 ...
## .. .. .. .. .. .. ..@ factors : list()
## .. .. .. .. ..$ counts.GSM9240672:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. .. .. .. .. .. ..@ i : int [1:1529215] 62 80 84 108 119 173 224 230 241 252 ...
## .. .. .. .. .. .. ..@ p : int [1:1393] 0 1340 2694 4080 5598 7021 7692 8917 10204 11683 ...
## .. .. .. .. .. .. ..@ Dim : int [1:2] 27952 1392
## .. .. .. .. .. .. ..@ Dimnames:List of 2
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. ..@ x : num [1:1529215] 2 1 1 1 1 1 1 1 1 1 ...
## .. .. .. .. .. .. ..@ factors : list()
## .. .. .. .. ..$ counts.GSM9240673:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. .. .. .. .. .. ..@ i : int [1:1680987] 55 159 246 322 336 367 387 447 546 581 ...
## .. .. .. .. .. .. ..@ p : int [1:1589] 0 623 1371 2995 3451 3842 4860 5788 7326 8858 ...
## .. .. .. .. .. .. ..@ Dim : int [1:2] 29540 1588
## .. .. .. .. .. .. ..@ Dimnames:List of 2
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. ..@ x : num [1:1680987] 1 1 1 1 1 1 1 1 1 3 ...
## .. .. .. .. .. .. ..@ factors : list()
## .. .. .. .. ..$ counts.GSM9240674:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. .. .. .. .. .. ..@ i : int [1:2762722] 188 199 417 420 536 567 603 790 849 868 ...
## .. .. .. .. .. .. ..@ p : int [1:3222] 0 497 1962 3401 3849 4534 5016 5407 6043 6556 ...
## .. .. .. .. .. .. ..@ Dim : int [1:2] 31248 3221
## .. .. .. .. .. .. ..@ Dimnames:List of 2
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. ..@ x : num [1:2762722] 1 1 1 1 1 1 1 1 1 1 ...
## .. .. .. .. .. .. ..@ factors : list()
## .. .. .. .. ..$ counts.GSM9240675:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. .. .. .. .. .. ..@ i : int [1:5153829] 21 42 52 68 79 118 124 139 165 179 ...
## .. .. .. .. .. .. ..@ p : int [1:5054] 0 1036 2542 3007 4245 5241 6015 7576 8620 9116 ...
## .. .. .. .. .. .. ..@ Dim : int [1:2] 28546 5053
## .. .. .. .. .. .. ..@ Dimnames:List of 2
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. ..@ x : num [1:5153829] 1 1 1 1 1 2 1 1 1 1 ...
## .. .. .. .. .. .. ..@ factors : list()
## .. .. .. .. ..$ data.GSM9240670 :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. .. .. .. .. .. ..@ i : int [1:785896] 42 137 138 157 160 164 200 269 300 332 ...
## .. .. .. .. .. .. ..@ p : int [1:825] 0 983 1814 2162 2557 3188 4633 5951 7427 7908 ...
## .. .. .. .. .. .. ..@ Dim : int [1:2] 28380 824
## .. .. .. .. .. .. ..@ Dimnames:List of 2
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. ..@ x : num [1:785896] 2.1 2.1 2.1 2.1 3.12 ...
## .. .. .. .. .. .. ..@ factors : list()
## .. .. .. .. ..$ data.GSM9240671 :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. .. .. .. .. .. ..@ i : int [1:2650402] 30 42 122 152 161 242 309 337 382 406 ...
## .. .. .. .. .. .. ..@ p : int [1:2561] 0 509 1777 2930 3972 4876 5591 6920 8006 9214 ...
## .. .. .. .. .. .. ..@ Dim : int [1:2] 27340 2560
## .. .. .. .. .. .. ..@ Dimnames:List of 2
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. ..@ x : num [1:2650402] 1.74 1.74 2.34 1.74 1.74 ...
## .. .. .. .. .. .. ..@ factors : list()
## .. .. .. .. ..$ data.GSM9240672 :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. .. .. .. .. .. ..@ i : int [1:1529215] 62 80 84 108 119 173 224 230 241 252 ...
## .. .. .. .. .. .. ..@ p : int [1:1393] 0 1340 2694 4080 5598 7021 7692 8917 10204 11683 ...
## .. .. .. .. .. .. ..@ Dim : int [1:2] 27952 1392
## .. .. .. .. .. .. ..@ Dimnames:List of 2
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. ..@ x : num [1:1529215] 2.17 1.59 1.59 1.59 1.59 ...
## .. .. .. .. .. .. ..@ factors : list()
## .. .. .. .. ..$ data.GSM9240673 :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. .. .. .. .. .. ..@ i : int [1:1680987] 55 159 246 322 336 367 387 447 546 581 ...
## .. .. .. .. .. .. ..@ p : int [1:1589] 0 623 1371 2995 3451 3842 4860 5788 7326 8858 ...
## .. .. .. .. .. .. ..@ Dim : int [1:2] 29540 1588
## .. .. .. .. .. .. ..@ Dimnames:List of 2
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. ..@ x : num [1:1680987] 2.28 2.28 2.28 2.28 2.28 ...
## .. .. .. .. .. .. ..@ factors : list()
## .. .. .. .. ..$ data.GSM9240674 :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. .. .. .. .. .. ..@ i : int [1:2762722] 188 199 417 420 536 567 603 790 849 868 ...
## .. .. .. .. .. .. ..@ p : int [1:3222] 0 497 1962 3401 3849 4534 5016 5407 6043 6556 ...
## .. .. .. .. .. .. ..@ Dim : int [1:2] 31248 3221
## .. .. .. .. .. .. ..@ Dimnames:List of 2
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. ..@ x : num [1:2762722] 2.96 2.96 2.96 2.96 2.96 ...
## .. .. .. .. .. .. ..@ factors : list()
## .. .. .. .. ..$ data.GSM9240675 :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. .. .. .. .. .. ..@ i : int [1:5153829] 21 42 52 68 79 118 124 139 165 179 ...
## .. .. .. .. .. .. ..@ p : int [1:5054] 0 1036 2542 3007 4245 5241 6015 7576 8620 9116 ...
## .. .. .. .. .. .. ..@ Dim : int [1:2] 28546 5053
## .. .. .. .. .. .. ..@ Dimnames:List of 2
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. ..@ x : num [1:5153829] 2.11 2.11 2.11 2.11 2.11 ...
## .. .. .. .. .. .. ..@ factors : list()
## .. .. .. ..@ cells :Formal class 'LogMap' [package "SeuratObject"] with 1 slot
## .. .. .. .. .. ..@ .Data: logi [1:14638, 1:12] TRUE TRUE TRUE TRUE TRUE TRUE ...
## .. .. .. .. .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. .. .. .. .. ..$ : chr [1:14638] "ID_series_AAACGCTAGTAACGTA-1" "ID_series_AAAGAACGTAACATGA-1" "ID_series_AAAGGGCAGATGGCGT-1" "ID_series_AAAGGGCCATACGCAT-1" ...
## .. .. .. .. .. .. .. ..$ : chr [1:12] "counts.GSM9240670" "counts.GSM9240671" "counts.GSM9240672" "counts.GSM9240673" ...
## .. .. .. .. .. ..$ dim : int [1:2] 14638 12
## .. .. .. .. .. ..$ dimnames:List of 2
## .. .. .. .. .. .. ..$ : chr [1:14638] "ID_series_AAACGCTAGTAACGTA-1" "ID_series_AAAGAACGTAACATGA-1" "ID_series_AAAGGGCAGATGGCGT-1" "ID_series_AAAGGGCCATACGCAT-1" ...
## .. .. .. .. .. .. ..$ : chr [1:12] "counts.GSM9240670" "counts.GSM9240671" "counts.GSM9240672" "counts.GSM9240673" ...
## .. .. .. ..@ features :Formal class 'LogMap' [package "SeuratObject"] with 1 slot
## .. .. .. .. .. ..@ .Data: logi [1:32659, 1:12] TRUE TRUE TRUE TRUE TRUE TRUE ...
## .. .. .. .. .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. .. .. .. .. ..$ : chr [1:32659] "AL627309.1" "AL627309.5" "AP006222.2" "LINC01409" ...
## .. .. .. .. .. .. .. ..$ : chr [1:12] "counts.GSM9240670" "counts.GSM9240671" "counts.GSM9240672" "counts.GSM9240673" ...
## .. .. .. .. .. ..$ dim : int [1:2] 32659 12
## .. .. .. .. .. ..$ dimnames:List of 2
## .. .. .. .. .. .. ..$ : chr [1:32659] "AL627309.1" "AL627309.5" "AP006222.2" "LINC01409" ...
## .. .. .. .. .. .. ..$ : chr [1:12] "counts.GSM9240670" "counts.GSM9240671" "counts.GSM9240672" "counts.GSM9240673" ...
## .. .. .. ..@ default : int 6
## .. .. .. ..@ assay.orig: chr(0)
## .. .. .. ..@ meta.data :'data.frame': 32659 obs. of 0 variables
## .. .. .. ..@ misc : list()
## .. .. .. ..@ key : chr "rna_"
## ..@ meta.data :'data.frame': 14638 obs. of 4 variables:
## .. ..$ orig.ident : chr [1:14638] "GSM9240670" "GSM9240670" "GSM9240670" "GSM9240670" ...
## .. ..$ nCount_RNA : num [1:14638] 1391 1350 591 796 1251 ...
## .. ..$ nFeature_RNA: int [1:14638] 983 831 348 395 631 1445 1318 1476 481 430 ...
## .. ..$ percent_mt : num [1:14638] 1.73 7.93 2.03 1.76 2.64 ...
## ..@ active.assay: chr "RNA"
## ..@ active.ident: Factor w/ 6 levels "GSM9240670","GSM9240671",..: 1 1 1 1 1 1 1 1 1 1 ...
## .. ..- attr(*, "names")= chr [1:14638] "ID_series_AAACGCTAGTAACGTA-1" "ID_series_AAAGAACGTAACATGA-1" "ID_series_AAAGGGCAGATGGCGT-1" "ID_series_AAAGGGCCATACGCAT-1" ...
## ..@ graphs : list()
## ..@ neighbors : list()
## ..@ reductions : list()
## ..@ images : list()
## ..@ project.name: chr "SeuratProject"
## ..@ misc : list()
## ..@ version :Classes 'package_version', 'numeric_version' hidden list of 1
## .. ..$ : int [1:3] 5 3 0
## ..@ commands :List of 1
## .. ..$ NormalizeData.RNA:Formal class 'SeuratCommand' [package "SeuratObject"] with 5 slots
## .. .. .. ..@ name : chr "NormalizeData.RNA"
## .. .. .. ..@ time.stamp : POSIXct[1:1], format: "2026-04-05 00:35:27"
## .. .. .. ..@ assay.used : chr "RNA"
## .. .. .. ..@ call.string: chr [1:2] "NormalizeData(merged, normalization.method = \"LogNormalize\", " " scale.factor = 10000, layer = \"counts\", new.layer = \"lognorm\")"
## .. .. .. ..@ params :List of 5
## .. .. .. .. ..$ assay : chr "RNA"
## .. .. .. .. ..$ normalization.method: chr "LogNormalize"
## .. .. .. .. ..$ scale.factor : num 10000
## .. .. .. .. ..$ margin : num 1
## .. .. .. .. ..$ verbose : logi TRUE
## ..@ tools : list()
merged
## An object of class Seurat
## 32659 features across 14638 samples within 1 assay
## Active assay: RNA (32659 features, 0 variable features)
## 12 layers present: counts.GSM9240670, counts.GSM9240671, counts.GSM9240672, counts.GSM9240673, counts.GSM9240674, counts.GSM9240675, data.GSM9240670, data.GSM9240671, data.GSM9240672, data.GSM9240673, data.GSM9240674, data.GSM9240675
We have layers of counts and data now in our Seurat object and it is smaller due to filtering, now it has 32,659 features and 14,638 samples in this 1 assay.
merged@assays$RNA
## Assay (v5) data with 32659 features for 14638 cells
## First 10 features:
## AL627309.1, AL627309.5, AP006222.2, LINC01409, FAM87B, LINC01128,
## LINC00115, FAM41C, AL645608.6, AL645608.2
## Layers:
## counts.GSM9240670, counts.GSM9240671, counts.GSM9240672,
## counts.GSM9240673, counts.GSM9240674, counts.GSM9240675,
## data.GSM9240670, data.GSM9240671, data.GSM9240672, data.GSM9240673,
## data.GSM9240674, data.GSM9240675
merged@assays[["RNA"]]@layers$counts.GSM9240670[1:25,1:25]
## 25 x 25 sparse Matrix of class "dgCMatrix"
##
## [1,] . . . . . . . . . . . . . . . . . . . . . . . . .
## [2,] . . . . . . . . . . . . . . . . . . . . . . . . .
## [3,] . . . . . . . . . . . . . . . . . . . . . . . . .
## [4,] . . . . . . . . . . . . . . . . . . . . . . . . .
## [5,] . . . . . . . . . . . . . . . . . . . . . . . . .
## [6,] . . . . . . . . . . . . . . . . . . . . . . . . .
## [7,] . . . . . . . . . . . . . . . . . . . . . . . . .
## [8,] . . . . . . . . . . . . . . . . . . . . . . . . .
## [9,] . . . . . . . . . . . . . . . . . . . . . . . . .
## [10,] . . . . . . . . . . . . . . . . . . . . . . . . .
## [11,] . . . . . . . . . . . . . . . . . . . . . . . . .
## [12,] . . . . . . . . . . . . . . . . . . . . . . . . .
## [13,] . . . . . . . . . . . . . . . . . . . . . . . . .
## [14,] . . . . . . . . . . . . . . . . . . . . . . . . .
## [15,] . . . . . . . . . . . . . . . . . . . . . . . . .
## [16,] . . . . . . . . . . . . . . . . . . . . . . . . .
## [17,] . . . . . . . . . . . . . . . . . . . . . . . . .
## [18,] . . . . . . . 1 . . . . 1 2 1 . . . . . . . . 2 .
## [19,] . 2 . . 2 . . 1 2 . . . . 1 1 . . . . 1 2 . 2 1 1
## [20,] . . . . . . . . . . . . . . . . . . . . . . . . .
## [21,] . . . . . . . . . . . . . . . 1 . . . . . . . . .
## [22,] . . . . . . . . . . . . . . . . . . . . . . . . .
## [23,] . . . . . . . . . . . . . . . . . . . . 1 . . . 1
## [24,] . . . . . . . . . . . . . . . . . . . . . . . . .
## [25,] . . . . . . . . . . . . . . . . . . . . . . . . .
The following shows the sparce matrices for the first layer by counts, but limited because it is huge and there are many layers, this is just the first sample layers. This list could be very large if not limited. There are only integer values where the gene is present in the cell or barcode column of assay, the 0s are held with the period or dot.
dim(merged@assays[["RNA"]]@layers$counts.GSM9240670)
## [1] 28380 824
You can see not all features are counted in the cells shown.
The default is to return 2000 features per dataset, and ‘vst’ stands for ‘variance stabilizing transformation’ as a default as well.
merged <- FindVariableFeatures(merged, selection.method = 'vst', nfeatures = 2000)
## Finding variable features for layer counts.GSM9240670
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : pseudoinverse used at -2.6149
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : neighborhood radius 0.30103
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : reciprocal condition number 6.2046e-15
## Finding variable features for layer counts.GSM9240671
## Finding variable features for layer counts.GSM9240672
## Finding variable features for layer counts.GSM9240673
## Finding variable features for layer counts.GSM9240674
## Finding variable features for layer counts.GSM9240675
Now look at the merged object to see that variable features should show 2000.
merged
## An object of class Seurat
## 32659 features across 14638 samples within 1 assay
## Active assay: RNA (32659 features, 2000 variable features)
## 12 layers present: counts.GSM9240670, counts.GSM9240671, counts.GSM9240672, counts.GSM9240673, counts.GSM9240674, counts.GSM9240675, data.GSM9240670, data.GSM9240671, data.GSM9240672, data.GSM9240673, data.GSM9240674, data.GSM9240675
We should limit the list as it will print out a lot of the values. But only those features in the list of 2000 highly variable features or having the highest differential expression values where cells changed the most from healthy to pathological state are listed by name, all others are listed as ‘NA’ values.
merged[["RNA"]]@meta.data$var.features[1:500]
## [1] NA NA NA NA NA
## [6] NA NA NA NA NA
## [11] NA NA NA NA NA
## [16] NA NA "HES4" "ISG15" NA
## [21] NA NA NA NA NA
## [26] NA NA NA "TNFRSF18" "TNFRSF4"
## [31] NA NA NA NA NA
## [36] NA NA NA NA NA
## [41] NA NA NA NA NA
## [46] NA NA NA NA NA
## [51] NA "VWA1" NA NA NA
## [56] NA NA NA NA NA
## [61] NA NA NA NA NA
## [66] NA NA NA NA NA
## [71] NA NA NA NA NA
## [76] NA NA NA NA NA
## [81] NA NA NA NA NA
## [86] NA NA NA NA NA
## [91] NA NA NA NA NA
## [96] NA NA NA NA NA
## [101] NA NA NA NA NA
## [106] NA NA NA NA NA
## [111] NA NA NA NA NA
## [116] NA NA NA NA NA
## [121] NA NA NA NA NA
## [126] NA NA NA NA NA
## [131] NA NA NA NA NA
## [136] NA NA NA NA NA
## [141] NA NA NA NA NA
## [146] NA NA NA NA NA
## [151] NA NA NA NA NA
## [156] NA "TNFRSF9" NA NA NA
## [161] "ERRFI1" NA NA NA NA
## [166] NA NA NA NA NA
## [171] NA NA NA NA NA
## [176] NA NA NA NA NA
## [181] NA NA NA NA NA
## [186] NA NA NA "RBP7" NA
## [191] NA NA NA NA NA
## [196] NA NA NA NA NA
## [201] NA NA NA NA NA
## [206] NA NA NA NA NA
## [211] NA NA NA NA NA
## [216] NA NA NA NA NA
## [221] NA NA NA NA NA
## [226] NA NA "AL021155.5" NA NA
## [231] NA NA NA NA NA
## [236] NA NA NA NA NA
## [241] NA NA NA "KAZN" NA
## [246] NA NA NA NA NA
## [251] NA NA NA NA NA
## [256] NA NA NA NA NA
## [261] NA NA NA NA NA
## [266] NA NA NA NA NA
## [271] NA NA NA NA NA
## [276] NA NA NA NA NA
## [281] NA NA NA NA NA
## [286] NA NA NA NA NA
## [291] NA NA NA "CROCC" NA
## [296] NA NA NA NA NA
## [301] NA NA NA NA "PADI4"
## [306] NA NA NA NA NA
## [311] "IGSF21" NA NA NA NA
## [316] NA NA NA NA NA
## [321] NA NA NA NA NA
## [326] NA NA NA NA NA
## [331] NA NA "PLA2G2A" NA NA
## [336] NA NA NA NA NA
## [341] NA NA NA NA NA
## [346] NA NA NA NA NA
## [351] NA NA NA NA NA
## [356] NA NA NA NA NA
## [361] NA NA NA NA NA
## [366] NA "HSPG2" NA NA NA
## [371] NA NA NA NA NA
## [376] NA NA NA "C1QA" "C1QC"
## [381] "C1QB" NA NA NA NA
## [386] NA NA NA NA NA
## [391] NA NA NA NA NA
## [396] NA NA "ID3" NA NA
## [401] NA NA NA NA NA
## [406] NA NA "FUCA1" "CNR2" NA
## [411] NA NA NA NA NA
## [416] NA NA NA NA NA
## [421] NA NA NA NA NA
## [426] NA NA "CLIC4" NA NA
## [431] NA NA NA NA NA
## [436] NA NA NA NA NA
## [441] NA NA NA NA NA
## [446] NA NA NA NA NA
## [451] NA "STMN1" NA NA NA
## [456] NA NA NA NA NA
## [461] NA NA NA NA NA
## [466] NA NA NA NA NA
## [471] NA NA NA NA NA
## [476] NA NA NA NA NA
## [481] NA NA NA NA NA
## [486] NA NA NA NA NA
## [491] NA NA NA NA NA
## [496] NA NA NA NA NA
We only printed out the first 500 variable features and only the genes in the top 2,000 genes with high variability are shown all others are NA.
top10 <- head(VariableFeatures(merged), 10)
top10
## [1] "IGKC" "TPSB2" "CCL4" "CXCL8" "S100A9" "IGHG1" "GNLY" "LYZ"
## [9] "JCHAIN" "COL3A1"
top10_plot <- VariableFeaturePlot(merged)
LabelPoints(plot = top10_plot, points = top10, repel = TRUE)
## When using repel, set xnudge and ynudge to 0 for optimal results
## Warning in scale_x_log10(): log-10 transformation introduced infinite values.
## Warning: Removed 4279 rows containing missing values or values outside the scale range
## (`geom_point()`).
This corrects for unwanted sources of variation, technical as example with batch effects, or biological such as cell cycle stage or mitochondrial contamination, this ensures results are due to clustering by gene differences and not technological or biological errors.
The ScaleData function is part of standard parameters, there is a more complex single cell Transform function or workflow within Seurat that could be alternatively used.
First to make a character string stored as the rownames of the structural array we have been using mergedFilteredNormalizedScaled. Lets look at the first 100 genes as well.
all_genes <- rownames(merged)
all_genes[1:100]
## [1] "AL627309.1" "AL627309.5" "AP006222.2" "LINC01409" "FAM87B"
## [6] "LINC01128" "LINC00115" "FAM41C" "AL645608.6" "AL645608.2"
## [11] "LINC02593" "SAMD11" "NOC2L" "KLHL17" "PLEKHN1"
## [16] "PERM1" "AL645608.7" "HES4" "ISG15" "AL645608.1"
## [21] "AGRN" "RNF223" "C1orf159" "AL390719.3" "LINC01342"
## [26] "AL390719.2" "TTLL10-AS1" "TTLL10" "TNFRSF18" "TNFRSF4"
## [31] "SDF4" "B3GALT6" "C1QTNF12" "AL162741.1" "UBE2J2"
## [36] "LINC01786" "SCNN1D" "ACAP3" "PUSL1" "INTS11"
## [41] "CPTP" "TAS1R3" "DVL1" "MXRA8" "AURKAIP1"
## [46] "CCNL2" "MRPL20-AS1" "MRPL20" "AL391244.2" "ANKRD65"
## [51] "LINC01770" "VWA1" "ATAD3C" "ATAD3B" "ATAD3A"
## [56] "TMEM240" "SSU72" "AL645728.1" "FNDC10" "AL691432.4"
## [61] "AL691432.2" "MIB2" "MMP23B" "CDK11B" "FO704657.1"
## [66] "SLC35E2B" "CDK11A" "SLC35E2A" "NADK" "GNB1"
## [71] "AL109917.1" "TMEM52" "CFAP74" "AL391845.2" "GABRD"
## [76] "AL391845.1" "PRKCZ" "AL590822.2" "PRKCZ-AS1" "FAAP20"
## [81] "AL590822.1" "SKI" "AL590822.3" "MORN1" "AL513477.2"
## [86] "RER1" "PEX10" "PLCH2" "PANK4" "HES5"
## [91] "AL139246.5" "TNFRSF14-AS1" "TNFRSF14" "AL139246.3" "PRXL2B"
## [96] "MMEL1" "TTC34" "PRDM16-DT" "PRDM16" "AL590438.1"
length(all_genes)
## [1] 32659
There are 32,659 genes that made it through quality control filtering of less than 10% mitochondrial DNA, and less than 3,000 each of nFeatures_RNA and nCounts_RNA.
The row names of the object, merged, have the gene names, when we looked for the highest variation genes in most differentially expressed for 2000 genes with
merged <- FindVariableFeatures(merged, selection.method = ‘vst’, nfeatures = 2000)
Do they line up to the row names of merged and where are they located.
varFeatures <- merged[["RNA"]]@meta.data$var.features
genes <- rownames(merged)
df <- data.frame(Gene = as.factor(genes), FeatureVariability = as.factor(varFeatures))
head(df,20)
## Gene FeatureVariability
## 1 AL627309.1 <NA>
## 2 AL627309.5 <NA>
## 3 AP006222.2 <NA>
## 4 LINC01409 <NA>
## 5 FAM87B <NA>
## 6 LINC01128 <NA>
## 7 LINC00115 <NA>
## 8 FAM41C <NA>
## 9 AL645608.6 <NA>
## 10 AL645608.2 <NA>
## 11 LINC02593 <NA>
## 12 SAMD11 <NA>
## 13 NOC2L <NA>
## 14 KLHL17 <NA>
## 15 PLEKHN1 <NA>
## 16 PERM1 <NA>
## 17 AL645608.7 <NA>
## 18 HES4 HES4
## 19 ISG15 ISG15
## 20 AL645608.1 <NA>
The location of any of the genes if they were a high variability gene top 2000 is listed next to the correct row name in a random 20 sample draw from first 20 samples.
Are there only 2000 non NA values in our varFeatures set to find only 2000?
class(varFeatures)
## [1] "character"
length(unique(varFeatures))
## [1] 2001
There are 2001 unique values in a 32,659 long vector of genes set to be the label if it is one of the 2000 most differentially expressed genes. The additional character is the NA value.
Lets scale the data now.
#merged <- ScaleData(merged, features = all_genes)
merged <- ScaleData(merged, features = VariableFeatures(merged))
## Centering and scaling data matrix
m <- merged[which(row.names(merged) %in% top10),]
m
## An object of class Seurat
## 10 features across 14638 samples within 1 assay
## Active assay: RNA (10 features, 10 variable features)
## 13 layers present: counts.GSM9240670, counts.GSM9240671, counts.GSM9240672, counts.GSM9240673, counts.GSM9240674, counts.GSM9240675, data.GSM9240670, data.GSM9240671, data.GSM9240672, data.GSM9240673, data.GSM9240674, data.GSM9240675, scale.data
unique(m$orig.ident)
## [1] "GSM9240670" "GSM9240671" "GSM9240672" "GSM9240673" "GSM9240674"
## [6] "GSM9240675"
top10Data <- m[["RNA"]]@meta.data
colnames(top10Data)
## [1] "vf_vst_counts.GSM9240670_mean"
## [2] "vf_vst_counts.GSM9240670_variance"
## [3] "vf_vst_counts.GSM9240670_variance.expected"
## [4] "vf_vst_counts.GSM9240670_variance.standardized"
## [5] "vf_vst_counts.GSM9240670_variable"
## [6] "vf_vst_counts.GSM9240670_rank"
## [7] "vf_vst_counts.GSM9240671_mean"
## [8] "vf_vst_counts.GSM9240671_variance"
## [9] "vf_vst_counts.GSM9240671_variance.expected"
## [10] "vf_vst_counts.GSM9240671_variance.standardized"
## [11] "vf_vst_counts.GSM9240671_variable"
## [12] "vf_vst_counts.GSM9240671_rank"
## [13] "vf_vst_counts.GSM9240672_mean"
## [14] "vf_vst_counts.GSM9240672_variance"
## [15] "vf_vst_counts.GSM9240672_variance.expected"
## [16] "vf_vst_counts.GSM9240672_variance.standardized"
## [17] "vf_vst_counts.GSM9240672_variable"
## [18] "vf_vst_counts.GSM9240672_rank"
## [19] "vf_vst_counts.GSM9240673_mean"
## [20] "vf_vst_counts.GSM9240673_variance"
## [21] "vf_vst_counts.GSM9240673_variance.expected"
## [22] "vf_vst_counts.GSM9240673_variance.standardized"
## [23] "vf_vst_counts.GSM9240673_variable"
## [24] "vf_vst_counts.GSM9240673_rank"
## [25] "vf_vst_counts.GSM9240674_mean"
## [26] "vf_vst_counts.GSM9240674_variance"
## [27] "vf_vst_counts.GSM9240674_variance.expected"
## [28] "vf_vst_counts.GSM9240674_variance.standardized"
## [29] "vf_vst_counts.GSM9240674_variable"
## [30] "vf_vst_counts.GSM9240674_rank"
## [31] "vf_vst_counts.GSM9240675_mean"
## [32] "vf_vst_counts.GSM9240675_variance"
## [33] "vf_vst_counts.GSM9240675_variance.expected"
## [34] "vf_vst_counts.GSM9240675_variance.standardized"
## [35] "vf_vst_counts.GSM9240675_variable"
## [36] "vf_vst_counts.GSM9240675_rank"
## [37] "var.features"
## [38] "var.features.rank"
DataFrame10 <- m[['RNA']]@meta.data
paged_table(DataFrame10)
write.csv(DataFrame10, 'top10DE_genes_gastricCarcinoma_GSE308231.csv', row.names=FALSE)
#Dimensionality Reduction with PCA allow us to visualize multidimensional datasets, with common PCA principal component analysis, summarizing all gene expression data over many genes into principal components (the variation between scatterpoints are the components, variations off the regression lines or dimension, adding components to top components will give most variation and explain most of data).
Loadings is a term for PCA that explains a gene that is strongly associated with that principal component. They are ordered with PC1 most explanative and subsequent add to that explanation in order of best.
The positive and negative genes are named for how ever many features you set for the number of PCs that explain the most.
merged <- RunPCA(merged, features = VariableFeatures(merged))
## PC_ 1
## Positive: RBMS3, SPARCL1, NFIB, LDB2, FBXL7, APBB2, LHFPL6, DLC1, KIAA1217, NRP1
## IGFBP7, TCF4, PTPRG, MAGI1, PRKG1, GSN, TIMP3, ZFPM2, KALRN, COL4A2
## CALD1, CD36, IGFBP4, EBF1, SASH1, RUNX1T1, RBPMS, RASAL2, CDC42BPA, PARD3B
## Negative: CD69, CXCR4, SRGN, TNFAIP3, CCL5, STAT4, SAMSN1, PDE3B, IL7R, ITK
## CD247, TRBC2, GNG2, CNOT6L, GZMA, THEMIS, TRAC, JUND, KLF6, CAMK4
## CCL4, CRYBG1, DUSP1, IKZF3, PTPN22, LINC01619, RGS1, TC2N, CBLB, DUSP2
## PC_ 2
## Positive: PDE4D, CCSER1, KIAA1217, MAGI1, KLF6, CDH13, MECOM, HDAC9, KALRN, NEAT1
## CD69, FCHSD2, CBLB, CEMIP2, CNOT6L, ABLIM1, ARL15, CADM2, PTPRM, ANO2
## ST6GALNAC3, PLCB1, LARGE1, WWOX, MACROD2, ANKRD28, JUN, RORA, CNTNAP2, CSMD1
## Negative: DCN, CST3, LUM, MGP, FTL, SERPINF1, SFRP2, COL1A2, COL1A1, COL3A1
## TIMP1, CFD, CCDC80, FBLN1, COL6A2, C1R, S100A11, FTH1, IFITM3, C3
## SPARC, CD63, MMP2, PCOLCE, LGALS1, EFEMP1, THY1, COL6A1, IGFBP4, C1S
## PC_ 3
## Positive: CXCL8, G0S2, AQP9, NAMPT, S100A8, S100A9, BCL2A1, MNDA, ACSL1, PLAUR
## CSF3R, PTGS2, SOD2, IL1R2, C5AR1, FPR1, FCGR3B, SAT1, TREM1, IRAK3
## DOCK4, MXD1, TLR2, FGD4, CLEC4E, LUCAT1, FTH1, PHACTR1, RBM47, PLEK
## Negative: CCL5, CD247, THEMIS, TC2N, GZMA, SAMD3, CBLB, ITK, LINC00861, CAMK4
## INPP4B, IL7R, TRAC, TRBC2, NKG7, GZMK, STAT4, PYHIN1, RORA, GPRIN3
## LINC01934, CNOT6L, ANK3, TOX, KLRK1, PPP2R2B, IKZF3, PDE3B, CST7, TRBC1
## PC_ 4
## Positive: CSMD1, RBFOX1, CNTNAP2, IL1RAPL1, CNTN5, LRP1B, DMD, CCDC26, DPP10, CTNNA3
## CTNNA2, LRRC4C, SGCZ, CSMD3, NRXN1, GRID2, NRXN3, PCDH15, PTCHD1-AS, NKAIN3
## PTPRT, MACROD2, PTPRD, IL1RAPL2, ANKS1B, AC011246.1, AL008633.1, NRG3, DPP6, AC084816.1
## Negative: BTNL9, VWF, PALMD, ABLIM3, ADGRL4, CLDN5, ANO2, CYYR1, CD36, EGFL7
## FABP4, RBP7, LDB2, PTPRB, CD300LG, CDH5, ADGRF5, DIPK2B, ARHGAP29, EMCN
## EPAS1, CCDC85A, PECAM1, CAV1, ADAMTS18, CA4, CALCRL, ERG, NOSTRIN, ADAMTS9
## PC_ 5
## Positive: CD74, HLA-DRA, HLA-DPB1, HLA-DPA1, HLA-DRB5, HLA-DRB1, C1QA, C1QB, C1QC, HLA-DQA2
## IFI30, HLA-DQB1, MARCH1, MS4A6A, HBB, CD68, F13A1, CYBB, RNASE1, AIF1
## BANK1, CD14, HLA-DQA1, CTSZ, FOLR2, MS4A4A, CPVL, MS4A1, CTSB, CD163
## Negative: TNFAIP3, DUSP1, FOS, NEAT1, ELL2, ZFP36, NAMPT, SAMSN1, JUNB, KLF6
## JUND, FOSB, JUN, PPP1R15A, CBLB, RORA, FOXO3, SRGN, ADAMTSL4-AS1, ANKRD28
## CXCR4, G0S2, MXD1, IER2, FNDC3B, IVNS1ABP, CPD, CRY1, NIBAN1, CXCL8
print(merged[['pca']], dims = 1:5, nfeatures=5)
## PC_ 1
## Positive: RBMS3, SPARCL1, NFIB, LDB2, FBXL7
## Negative: CD69, CXCR4, SRGN, TNFAIP3, CCL5
## PC_ 2
## Positive: PDE4D, CCSER1, KIAA1217, MAGI1, KLF6
## Negative: DCN, CST3, LUM, MGP, FTL
## PC_ 3
## Positive: CXCL8, G0S2, AQP9, NAMPT, S100A8
## Negative: CCL5, CD247, THEMIS, TC2N, GZMA
## PC_ 4
## Positive: CSMD1, RBFOX1, CNTNAP2, IL1RAPL1, CNTN5
## Negative: BTNL9, VWF, PALMD, ABLIM3, ADGRL4
## PC_ 5
## Positive: CD74, HLA-DRA, HLA-DPB1, HLA-DPA1, HLA-DRB5
## Negative: TNFAIP3, DUSP1, FOS, NEAT1, ELL2
merged@assays$RNA
## Assay (v5) data with 32659 features for 14638 cells
## Top 10 variable features:
## IGKC, TPSB2, CCL4, CXCL8, S100A9, IGHG1, GNLY, LYZ, JCHAIN, COL3A1
## Layers:
## counts.GSM9240670, counts.GSM9240671, counts.GSM9240672,
## counts.GSM9240673, counts.GSM9240674, counts.GSM9240675,
## data.GSM9240670, data.GSM9240671, data.GSM9240672, data.GSM9240673,
## data.GSM9240674, data.GSM9240675, scale.data
DimHeatmap(merged, dims = 1, cells = 500, balanced = TRUE)
The dim plot is a scatter plot of a named PC for principal component, against another PC You will see cells group in clusters by PC comparison 1 by 1.
DimPlot(merged, reduction = "pca") + NoLegend()
This shows std dev on y-axis and PCs on x-axis to see the threshold limit of number of dimensions or PCs to use, it starts high and drops until plateuing at elbow bend, won’t be useful to use additional that add little variation and more computing.
ElbowPlot(merged)
It looks like it could be at about 8 or 12, looks like 2 little bends at 8 and at 12 principal components.
Basically the lower resolution the least clusters and higher resolution has more clusters
merged <- FindNeighbors(merged, dims = 1:11)#chose this from elbow plot
## Computing nearest neighbor graph
## Computing SNN
merged <- FindClusters(merged, resolution = c(0.1, 0.3, 0.5, 0.7,1))
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 14638
## Number of edges: 486264
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9753
## Number of communities: 11
## Elapsed time: 1 seconds
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 14638
## Number of edges: 486264
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9515
## Number of communities: 14
## Elapsed time: 1 seconds
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 14638
## Number of edges: 486264
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9316
## Number of communities: 18
## Elapsed time: 1 seconds
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 14638
## Number of edges: 486264
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9164
## Number of communities: 21
## Elapsed time: 1 seconds
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 14638
## Number of edges: 486264
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8976
## Number of communities: 26
## Elapsed time: 1 seconds
There will be added columns to the matrix object of our merged@meta.data for each of the resolutios such as ‘RNA_snn_res0.1’ and ‘RNA_snn_res0.3’, and so on to 1. And the integer values in the new columns indicates to which cluster each cell (or barcode) belongs.
The results are color coded for each cluster. Since gene expression is basically the cell differentiation of becoming different cell types, usually, different clusters represents different cell types.
paged_table(merged@meta.data)
The data set may interest someone in many clusters that overlap that could be useful for identifying subset of cells interested in.
DimPlot(merged, group.by = 'RNA_snn_res.1', label = TRUE)
DimPlot(merged, group.by = 'RNA_snn_res.0.5', label = TRUE)
DimPlot(merged, group.by = 'RNA_snn_res.0.1', label = TRUE)
Set identity of clusters makes more sense than using more clusters with resolution 1
Idents(merged) <- 'RNA_snn_res.0.1'
Gives a clearer visualization of separate clusters with space between each
merged <- RunUMAP(merged, dims = 1:11)
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
## 00:35:59 UMAP embedding parameters a = 0.9922 b = 1.112
## 00:35:59 Read 14638 rows and found 11 numeric columns
## 00:35:59 Using Annoy for neighbor search, n_neighbors = 30
## 00:35:59 Building Annoy index with metric = cosine, n_trees = 50
## 0% 10 20 30 40 50 60 70 80 90 100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 00:36:00 Writing NN index file to temp file C:\Users\jlcor\AppData\Local\Temp\Rtmp06eLqi\file32e8537a4591
## 00:36:00 Searching Annoy index using 1 thread, search_k = 3000
## 00:36:03 Annoy recall = 100%
## 00:36:03 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 00:36:04 Initializing from normalized Laplacian + noise (using RSpectra)
## 00:36:04 Commencing optimization for 200 epochs, with 621096 positive edges
## 00:36:04 Using rng type: pcg
## 00:36:18 Optimization finished
Next, identify the clusters and look at genetic markers for different cell types and identify different cell types, not in this tutorial but another one
DimPlot(merged, reduction = 'umap')
Save the Seurat object to use later
saveRDS(merged, file='mergedClustersUMap.RDS')
Make a data frame of all the genes of variable features, only the top 2000 will be ranked.
allGenesDF <- merged[["RNA"]]@meta.data
dim(allGenesDF)
## [1] 32659 38
The variable features column will only have the name of the gene is it is one of the 2,000 highest variable feature.
I looked it up and it turns out you can get the gene name for all entries of 30,960 by using the rownames function on data object[[“RNA”]]
allGenesDF$genes <- rownames(merged[["RNA"]])
write.csv(allGenesDF, "all32659genes_gastricCarcinoma_rownamesAdded.csv", row.names=FALSE)
paged_table(allGenesDF[1:20,])
Here are only the highest variability genes of the top 2,000 in Seurat.
all2k <- subset(allGenesDF, allGenesDF$var.features != "NA")
all2k <- all2k[order(all2k$var.features.rank, decreasing=FALSE),]
This will show only the first 100 of the top 2,000 high variability genes in Seurat.
paged_table(all2k[1:100,35:39])
Now lets look at the tSNE unsupervised classifier.
merged <- RunTSNE(merged, dims = 1:11)
DimPlot(merged, reduction = 'tsne')
Both the tSNE and UMAP clustering in Seurat look much more precise and defined then just the K-Nearest Neighbor clustering.
Lets look at how our merged Seurat object looks now after doing all these processes on it.
merged
## An object of class Seurat
## 32659 features across 14638 samples within 1 assay
## Active assay: RNA (32659 features, 2000 variable features)
## 13 layers present: counts.GSM9240670, counts.GSM9240671, counts.GSM9240672, counts.GSM9240673, counts.GSM9240674, counts.GSM9240675, data.GSM9240670, data.GSM9240671, data.GSM9240672, data.GSM9240673, data.GSM9240674, data.GSM9240675, scale.data
## 3 dimensional reductions calculated: pca, umap, tsne
It shows that it has 13 layers, is filtered, has 3 dimensionality reductions done after scaling the data, using PCA, UMAP, and TSNE. The K-Nearest Neighbor clustering was done using the PCA settings.
You can get the top10 genes and the statistical information with each of the 6 samples here.
You can get all 32,659 genes and the statistical information with each sample here.
Today we ran Seurat on gastric carcinoma 6 samples from GSE308231 on 3 samples of stomach cancer and 3 samples of metastatic gastric carcinoma in the peritoneal tissue surrounding the stomach. We did processing after filtering, normalizing, and scaling the data, then found principal components with PCA to do clustering with K-Nearest Neighbors using the idea number of PCs from the elbow plot of PCs, then used UMAP and TSNE unsupervised clustering to get more detailed and precise clusters of this data. We also found the 10 most variable genes from the 2,000 highest variability genes in Seurat from the two classes of stomach cancer, and also extracted the statistical data Seurat created for each sample and the names of the top 10, the 2,000 highest variable genes, and all genes that made the filter for less than 10% mitochondrial DNA and less than 3,000 count of features and counts of RNA.
Next part to this research will be on testing the top 10 genes with random forest to see how well they predict the sample, then use all the genes and extract only the count and gene name to do our generalized fold change values to get our top genes by fold change from stage 1 to stage 4 stomach cancer from stomach tumors to metastatic tumors spread outside the stomach tissue into the peritoneal lining. We will then add these genes of top 10 with Seurat and top fold change genes in top up regulated or enhanced to top down regulated or silenced into our pathologies database. I recommend reading the article that came with this gene expression data as it makes light of the study and shows some great deep learning machine learning data science to predict more severe tumors that will likely metastasize based on images they found on NCBI of various studies. They do mention that it is difficult to treat stomach cancer that metastasizes outside and into the peritoneal region due to the fibrotic tissue and that fibrous and cartilaginous or thick tissue lacks a rich blood supply to access the tumors and treat successfully with chemotherapy as this type of gastric carcinoma is the most difficult to treat. Their study seeks to find the tumors by shape and location as well as genes most abundant in those zones to target for treatment. I quick read the article and mostly the plots. The link is at the beginning of this project.
Thanks again, and keep checking in.