This part 3 includes the 1st part minus the additional matrix from the meta.data table in the Seurat object of part 1. We are using GSE318371, a recent gene study on NCBI that analyzes the aggressive Natural Killer T-Cell Lymphoma vs. healthy samples using single cell RNA (scRNA) Sequencing, of peripheral blood mononuclear cells (PBMCs).
Looking at the EDA of exploratory data analysis with new package for this thing on natural killer t-cell lymphoma associated by EBV infection. This study is GSE318371 within NCBI database of gene expression studies. I extracted the custom download option of the GSE318371_RAW.tar file scrolled at end of page on this series. The extraction of barcodes takes a very long time once downloaded for each file and you can avoid that process and leave them in zip form with the .tsv.gz name but remove the prepended file name in front of ‘barcodes.tsv.gz’, ‘features.tsv.gz’, and ‘matrix.mtx’ but put these files with respective GSM sample ID into its own file to read in for each file folder.
This is a very recent February 2026 uploaded research on aggressive lymphoma associated with EBV infection. I found others and they include the nasopharyngeal carcinoma and Hodgkin and large B-cell lymphomas. But for now we work on this project to get our top genes for our machine to predict EBV, Lyme disease, or specific associated EBV pathology of multiple sclerosis, mononucleosis, primary EBV infection, as well as various lymphomas and nasopharyngeal carcinoma.
There is a process that has to be followed to extract each sample information of barcodes, features, and cells. This is array gene expression data where an array of many cells is input into a machine and each cell within the array is ran to count the number of times a gene appears. The barcodes are the cells and the features are the genes in matrix format.
I had to look at a youtube video to understand this package Seurat better and how to read in this data. I would like to estimate each barcode file as taking around 30 minutes each to extract from the zipped file format. My 7zip isn’t working to extract it with right tab and I followed the videos to the exact step but it turns out the information is useful but just leave the folder in zip format because the code doesn’t work for the unzipped files.
It is useful to review or get familiar with the Seurat package for this type of data as it is the go to method of handling unsupervised gene expression data which is basically what finding gene targets is, finding the genes to make target and good predictors on unseen data. But our data is known and the sample classes are seen therefore, and will still allow us the functionality to explore the unsupervised machine learning models in finding our top 10 genes.
The code had to be made False in evaluating the chunks in Rmarkdown due to the large file size and knitr and/or computer not allowing additional storage or buffering for large file sizes.
This last section follows through with a tutorial found online video, great to follow along and learn a little bit or review the material in these type of data sets on gene expression data with single cell RNA sequencing data.
The tutorial failed my progress with this data after the gene names were extracted from the row names of the array object, when trying to scale the data with the ScaleData function, but the code to be used is left after the lines of equal symbols if this gets sorted out. I am sure it will but will not stress over it. I will review the video and see if a step was missed somewhere and test things out, but this is a massive large array of all sorts of information like running neural nets on layers and finding the hiccup. But have fun.
And also, the knitr stops at the portion of opening up the 10-11 Gb large file for some reason. So, I added screen shots that appear small.
===============================================================================
Lets read the summary file in the GSE318371-GPL34284_series_matrix.txt file to see how the samples were collected, handled, if normalized, type, and design of study.
seriesInfo1 <- read.csv("GSE318371-GPL34284_series_matrix.txt", nrows=25, sep='\t', stringsAsFactors = T, strip.white = T, na.strings=" ", header=F)
seriesInfo1
## V1
## 1 !Series_title
## 2 !Series_geo_accession
## 3 !Series_status
## 4 !Series_submission_date
## 5 !Series_last_update_date
## 6 !Series_summary
## 7 !Series_overall_design
## 8 !Series_type
## 9 !Series_contributor
## 10 !Series_contributor
## 11 !Series_contributor
## 12 !Series_contributor
## 13 !Series_sample_id
## 14 !Series_contact_name
## 15 !Series_contact_email
## 16 !Series_contact_institute
## 17 !Series_contact_address
## 18 !Series_contact_city
## 19 !Series_contact_zip/postal_code
## 20 !Series_contact_country
## 21 !Series_supplementary_file
## 22 !Series_platform_id
## 23 !Series_platform_id
## 24 !Series_platform_taxid
## 25 !Series_sample_taxid
## V2
## 1 Peripheral blood mononuclear cells single-cell landscape of newly diagnosed NK/T cell lymphoma patients
## 2 GSE318371
## 3 Public on Feb 07 2026
## 4 Feb 04 2026
## 5 Feb 07 2026
## 6 Natural killer/T cell lymphoma (NKTCL) is a rare and aggressive form of non-Hodgkin's lymphoma associated with Epstein-Barr Virus (EBV) infection.The recent advancement of multi-omics technologies has significantly enhanced our understanding of NKTCL disease biology, including genetics, transcription landscape, variations of EBV strain, and microenvironments. Emerging evidence suggests that immunoprofiling of peripheral blood mononuclear cells (PBMCs) is associated with the treatment response of cancer patients and can be used to guide clinical trials and therapy. In this study, we utilized single-cell RNA sequencing (scRNA-seq) to comprehensively characterize the phenotypic landscape of PBMCs in newly diagnosed patients with NKTCL. This research offers a valuable peripheral blood-based signature for newly diagnosed NKTCL, which could be a crucial resource for further investigations into the pathogenesis of NKTCL and the optimization of therapeutic regimens.
## 7 scRNA-seq profiling of PBMCs from healthy donors and newly diagnosed NKTCL patients
## 8 Expression profiling by high throughput sequencing
## 9 Xiaozhen,,Liang
## 10 Rong,,Tao
## 11 Ran,,Jia
## 12 Chuanxu,,Liu
## 13 GSM9493320 GSM9493321 GSM9493322 GSM9493323 GSM9493324 GSM9493325 GSM9493326 GSM9493327 GSM9493328 GSM9493329 GSM9493330 GSM9493331 GSM9493332 GSM9493333 GSM9493334 GSM9493335 GSM9493336 GSM9493337 GSM9493338 GSM9493339 GSM9493340 GSM9493341 GSM9493342 GSM9493343 GSM9493344 GSM9493345 GSM9493346 GSM9493347 GSM9493348 GSM9493349 GSM9493350 GSM9493351
## 14 Xiaozhen,,Liang
## 15 xzliang@simm.ac.cn
## 16 Shanghai Institute of Materia Medica Chinese Academy of Sciences
## 17 Life Science Research Building 320 Yueyang Road, Xuhui District
## 18 Shanghai
## 19 200031
## 20 China
## 21 ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE318nnn/GSE318371/suppl/GSE318371_RAW.tar
## 22 GPL24676
## 23 GPL34284
## 24 9606
## 25 9606
seriesInfoDesign <-read.csv("GSE318371-GPL34284_series_matrix.txt", sep='\t', nrows=50,stringsAsFactors = T,strip.white=T,na.strings=" ", ncol(32), skip=25, header=F)
dim(seriesInfoDesign)
## [1] 42 30
There are 42 additional rows of metadata on study design and methods for handling data and biological material.
Lets look at first few columns and rows of interest that detail study design.
seriesInfoDesign[c(8:15,17:21,30:33),c(1:2)]
## V1
## 8 !Sample_source_name_ch1
## 9 !Sample_organism_ch1
## 10 !Sample_characteristics_ch1
## 11 !Sample_characteristics_ch1
## 12 !Sample_characteristics_ch1
## 13 !Sample_molecule_ch1
## 14 !Sample_extract_protocol_ch1
## 15 !Sample_extract_protocol_ch1
## 17 !Sample_description
## 18 !Sample_data_processing
## 19 !Sample_data_processing
## 20 !Sample_data_processing
## 21 !Sample_platform_id
## 30 !Sample_instrument_model
## 31 !Sample_library_selection
## 32 !Sample_library_source
## 33 !Sample_library_strategy
## V2
## 8 "blood"
## 9 "Homo sapiens"
## 10 "tissue: blood"
## 11 "cell line: PBMCs"
## 12 "cell type: Peripheral blood immune cells"
## 13 "total RNA"
## 14 "Isolated PBMCs were loaded into a 10× Chromium Chip (v3.1 PN:1000120) and barcoded using a 10x Chromium Controller."
## 15 "RNA from the barcoded cells was then reverse-transcribed, amplified, and prepared into sequencing libraries with the 10× Library Construction Kit (v3.1 PN:1000190) according to the manufacturer’s instructions."
## 17 "Library name: HD1"
## 18 "Raw scRNA-seq data were initially pre-processed using CellRanger (version 8.0.1, 10x Genomics) to align reads to the human genome (GRCh38, 2024-A from 10x Genomics) and count the unique molecular identifiers (UMIs) for each gene to generate specific gene cell count tables. For each scRNA-seq sample, the count tables were filtered to retain the genes detected in at least 10 cells and cells with a minimum gene count of 300."
## 19 "Assembly: GRCh38"
## 20 "Supplementary files format and content: barcodes, features, and matrix files for each samples"
## 21 "GPL34284"
## 30 "Illumina NovaSeq X Plus"
## 31 "cDNA"
## 32 "transcriptomic single cell"
## 33 "RNA-Seq"
The data is from peripheral blood mononuclear cells (PBMCs) of total RNA using chip sequencing or array sequencing, they kept the genes that showed in at least 10 cells of the array being sampled or having at least a count of 300 genes. The array of RNA-Seq analysis counts gene fragments that show up in the sequencing as many won’t show up but enough do. There is a useful youtube video that explains how chip sequencing operates here. This is where the barcodes and features makes sense as it seems like different but similar language to data science language. The features are the genes or rows as we have seen, and the cells are the barcodes of nucleotides as columns of our matrix when we read in the formatted files using Seurat library where the folder has to have the ‘barcodes.tsv.gz’, ‘features.tsv.gz’, and ‘matrix.mtx.gz’ format to read it in. The youtube videos I watched showed a way of unzipping and reading in the packages similarly but the Seurat library from my recent experience only reads in the unzipped file formats with attached file name ‘gz’ meaning needs to be unzipped, this is the file format already in when downloading from the NCBI website for the gene expression data.
We can see the patient and healthy label to the GSM samples with row 17 and 41 of the metadata or series information.
seriesInfoDesign[c(17,41),]
## V1 V2 V3
## 17 !Sample_description "Library name: HD1" "Library name: HD2"
## 41 "ID_REF" "GSM9493320" "GSM9493321"
## V4 V5 V6
## 17 "Library name: HD3" "Library name: HD4" "Library name: HD5"
## 41 "GSM9493322" "GSM9493323" "GSM9493324"
## V7 V8 V9
## 17 "Library name: HD6" "Library name: HD7" "Library name: HD8"
## 41 "GSM9493325" "GSM9493326" "GSM9493327"
## V10 V11 V12
## 17 "Library name: HD9" "Library name: HD10" "Library name: HD11"
## 41 "GSM9493328" "GSM9493329" "GSM9493330"
## V13 V14 V15
## 17 "Library name: HD12" "Library name: patient16" "Library name: patient18"
## 41 "GSM9493331" "GSM9493335" "GSM9493336"
## V16 V17
## 17 "Library name: patient19" "Library name: patient20"
## 41 "GSM9493337" "GSM9493338"
## V18 V19
## 17 "Library name: patient22" "Library name: patient25"
## 41 "GSM9493339" "GSM9493340"
## V20 V21
## 17 "Library name: patient27" "Library name: patient30"
## 41 "GSM9493341" "GSM9493342"
## V22 V23
## 17 "Library name: patient31" "Library name: patient36"
## 41 "GSM9493343" "GSM9493344"
## V24 V25
## 17 "Library name: patient37" "Library name: patient38"
## 41 "GSM9493345" "GSM9493346"
## V26 V27
## 17 "Library name: patient39" "Library name: patient41"
## 41 "GSM9493347" "GSM9493348"
## V28 V29
## 17 "Library name: patient44" "Library name: patient51"
## 41 "GSM9493349" "GSM9493350"
## V30
## 17 "Library name: patient52"
## 41 "GSM9493351"
There are 12 healthy samples HD1-HD12, and randomly numbered patient samples from patient6 through patient51 totaling 17 patients.
str(seriesInfoDesign)
## 'data.frame': 42 obs. of 30 variables:
## $ V1 : Factor w/ 35 levels "!Sample_channel_count",..: 31 14 25 26 16 32 1 24 21 2 ...
## $ V2 : Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
## $ V3 : Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
## $ V4 : Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
## $ V5 : Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
## $ V6 : Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
## $ V7 : Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
## $ V8 : Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
## $ V9 : Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
## $ V10: Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
## $ V11: Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
## $ V12: Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
## $ V13: Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
## $ V14: Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
## $ V15: Factor w/ 40 levels "","\"0\"","\"1\"",..: 25 19 26 13 14 32 3 8 20 36 ...
## $ V16: Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
## $ V17: Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
## $ V18: Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
## $ V19: Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
## $ V20: Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
## $ V21: Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
## $ V22: Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
## $ V23: Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
## $ V24: Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
## $ V25: Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
## $ V26: Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
## $ V27: Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
## $ V28: Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
## $ V29: Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
## $ V30: Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
Install the Seurat package with install.packages(‘Seurat’) if you haven’t already. Then read in the library.
library(Seurat)
## Loading required package: SeuratObject
## Loading required package: sp
##
## Attaching package: 'SeuratObject'
## The following objects are masked from 'package:base':
##
## intersect, t
We need to use the tidyverse package.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.1 ✔ tibble 3.3.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Found a few youtube videos on using Seurat for this type of data.Here is an interesting video explaining extracting and moving files into their GSM sample but don’t do the unzipping and renaming, just leave in unzipped version and remove prepended file name to only get the names needed with extension of gz kept. Do put in its own sample name of folder you will name it as an object for in Rstudio, here is the video.
After making those folders you have to read them into R with Read10x function.
NML_1 <- Read10x(data.dir = “../Downloads/GSE132771_Raw/NPL1/”) where the file name and location are using the demonstration file that used the GSE132771 series data or RAW.Tar files.
After doing what video tutorial did, with extraction, the code didn’t work, but when I left the files in original tsv.gz format but removed prepended file name e.g. GSM9493320_HD1_barcodes.tsv.gz into barcodes.tsv.gz then the following line of code ran. This removes that tedious extraction step that added extra work and time to the file processing to use in our exploratory data analysis. The video was from 3-4 years ago from today’s date, so some errors may have been corrected within the Seurat library.
I had to use exact file location and make sure to not copy and paste with the file directory within Microsoft as ‘' or backslash because it has to be’/’ or forward slash.
Lets save my directory as a character string to use and save all these file objects of RDS type made using Seurat.
Fill in ellipses with your file directory location.
directoryFolder <- "C:/.../GSE318371_RAW"
RDS_objects <- "C:/.../RDS_objects"
setwd(directoryFolder)
GSM9493320 <- Read10X("GSM9493320")
Then you can create an object with CreateSeuratObject()
NML_1 <- CreateSeuratObject(counts = NML_1, project = “NML_1”, min.cells=3, min.features=200)
sampleHD1 <- CreateSeuratObject(counts=GSM9493320, project="GSM9493320", min.cells=3, min.features=200)
For the matrix, you will see genes are the barcodes in matrix columns that are counted in each of the matrix rows that are the cells. The video says the barcodes are the rows but that is probably the data frame before transposing to a matrix. Visually the barcodes are shown in the ‘column names’ above each column with counts.
When naming the barcodes.tsv file it is the cells, and the features.tsv file is the genes.
The cells are where the genes are counted in an array of cells that sequence gene data. There will be counts of the gene in each cell that vary from none to however many times they show. The next video explains reading the Seurat object created here.
class(sampleHD1)
## [1] "Seurat"
## attr(,"package")
## [1] "SeuratObject"
the colnames are the barcodes and the rownames are the genes. colnames([]) and will display barcodes
colnames(sampleHD1[])[1:100]
## [1] "AAACCCACAGCATTGT-1" "AAACCCACATCCGAAT-1" "AAACCCAGTACTAGCT-1"
## [4] "AAACCCAGTCACTCAA-1" "AAACCCAGTGGTAACG-1" "AAACCCATCAATGCAC-1"
## [7] "AAACCCATCCCTCTAG-1" "AAACCCATCGTCGATA-1" "AAACGAAAGCCAGACA-1"
## [10] "AAACGAAAGGATACCG-1" "AAACGAAAGTAATACG-1" "AAACGAACACGACCTG-1"
## [13] "AAACGAACACTGTTCC-1" "AAACGAACAGTTACCA-1" "AAACGAACATTCAGCA-1"
## [16] "AAACGAAGTAACATCC-1" "AAACGAAGTCTCCCTA-1" "AAACGAAGTGCAGATG-1"
## [19] "AAACGAATCCTCGCAT-1" "AAACGCTAGACGCCCT-1" "AAACGCTAGCTTTGTG-1"
## [22] "AAACGCTAGTGAACAT-1" "AAACGCTGTAGTATAG-1" "AAACGCTGTTGAAGTA-1"
## [25] "AAACGCTTCAATGTCG-1" "AAACGCTTCTCAGAAC-1" "AAAGAACAGCGACATG-1"
## [28] "AAAGAACAGGCTGTAG-1" "AAAGAACAGGGATCTG-1" "AAAGAACCAACAGAGC-1"
## [31] "AAAGAACGTCGCCACA-1" "AAAGAACGTGGTAACG-1" "AAAGAACTCATGAGGG-1"
## [34] "AAAGAACTCATTCGTT-1" "AAAGGATAGCTCGCAC-1" "AAAGGATAGTCCGCGT-1"
## [37] "AAAGGATCAAGGGTCA-1" "AAAGGATCAGAACTTC-1" "AAAGGATGTGCCTAAT-1"
## [40] "AAAGGATGTGTCCCTT-1" "AAAGGATTCACCGACG-1" "AAAGGATTCACGGTCG-1"
## [43] "AAAGGGCAGCGAACTG-1" "AAAGGGCAGTAGCAAT-1" "AAAGGGCGTGTGTCCG-1"
## [46] "AAAGGTAAGAGAGCAA-1" "AAAGGTACAAAGTATG-1" "AAAGGTACAAGACGGT-1"
## [49] "AAAGGTACACAAGTTC-1" "AAAGGTACACTCCTTG-1" "AAAGGTAGTATAGGAT-1"
## [52] "AAAGGTAGTGTGCCTG-1" "AAAGGTATCACAAGAA-1" "AAAGGTATCTGCGTCT-1"
## [55] "AAAGGTATCTGTCTCG-1" "AAAGTCCAGCACACAG-1" "AAAGTCCAGCGGTAGT-1"
## [58] "AAAGTCCAGGATTCAA-1" "AAAGTCCAGTACAACA-1" "AAAGTCCCACCTGCGA-1"
## [61] "AAAGTCCGTTCTTGCC-1" "AAAGTCCTCACCTGTC-1" "AAAGTCCTCAGGTGTT-1"
## [64] "AAAGTCCTCCATAAGC-1" "AAAGTCCTCTATGTGG-1" "AAAGTGAAGAAATGGG-1"
## [67] "AAAGTGAAGCAGTCTT-1" "AAAGTGAAGCTGTCCG-1" "AAAGTGAAGGTAAGAG-1"
## [70] "AAAGTGAAGTGATAAC-1" "AAAGTGAAGTTATGGA-1" "AAAGTGACAAGCAATA-1"
## [73] "AAAGTGACAATAGTGA-1" "AAAGTGAGTATACCCA-1" "AAAGTGAGTATCGTTG-1"
## [76] "AAAGTGAGTCGGTAAG-1" "AAAGTGAGTGAATGTA-1" "AAAGTGAGTTTCACTT-1"
## [79] "AAAGTGATCAAATGCC-1" "AAATGGAAGGGCCCTT-1" "AAATGGAGTACAACGG-1"
## [82] "AAATGGAGTTCCGTTC-1" "AAATGGATCCGCATAA-1" "AAATGGATCGCCGATG-1"
## [85] "AACAAAGAGCACTTTG-1" "AACAAAGAGCGAGGAG-1" "AACAAAGAGGCTAAAT-1"
## [88] "AACAAAGCAACAGTGG-1" "AACAAAGCACTGCACG-1" "AACAAAGCAGAGTCAG-1"
## [91] "AACAAAGGTCGCGGTT-1" "AACAAAGGTGAGGAAA-1" "AACAAAGTCATGCCGG-1"
## [94] "AACAAAGTCCTGTTAT-1" "AACAAAGTCGTCAGAT-1" "AACAAAGTCTCCGAGG-1"
## [97] "AACAACCAGATGTTAG-1" "AACAACCCACCCAAGC-1" "AACAACCCATCATCCC-1"
## [100] "AACAACCCATGTGCTA-1"
There are many columns but we limited it to 1st 100.
rownames([]) and will display gene names
rownames(sampleHD1[])[1:100]
## [1] "ENSG00000290826" "ENSG00000238009" "ENSG00000241860" "ENSG00000286448"
## [5] "ENSG00000290385" "ENSG00000291215" "LINC01409" "ENSG00000290784"
## [9] "LINC00115" "LINC01128" "ENSG00000288531" "FAM41C"
## [13] "NOC2L" "KLHL17" "PLEKHN1" "ENSG00000272512"
## [17] "HES4" "ISG15" "ENSG00000224969" "AGRN"
## [21] "ENSG00000291156" "C1orf159" "ENSG00000285812" "LINC01342"
## [25] "TTLL10" "TNFRSF18" "TNFRSF4" "SDF4"
## [29] "B3GALT6" "C1QTNF12" "ENSG00000260179" "UBE2J2"
## [33] "LINC01786" "SCNN1D" "ACAP3" "PUSL1"
## [37] "INTS11" "CPTP" "TAS1R3" "DVL1"
## [41] "MXRA8" "AURKAIP1" "CCNL2" "MRPL20-AS1"
## [45] "MRPL20" "MRPL20-DT" "ANKRD65" "ATAD3C"
## [49] "ATAD3B" "ENSG00000290916" "ATAD3A" "TMEM240"
## [53] "SSU72" "ENSG00000215014" "FNDC10" "ENSG00000286989"
## [57] "ENSG00000272106" "MIB2" "MMP23B" "CDK11B"
## [61] "ENSG00000272004" "SLC35E2B" "CDK11A" "ENSG00000290854"
## [65] "NADK" "GNB1" "GNB1-DT" "CFAP74"
## [69] "PRKCZ" "ENSG00000271806" "PRKCZ-AS1" "FAAP20"
## [73] "ENSG00000234396" "SKI" "ENSG00000287356" "MORN1"
## [77] "ENSG00000272420" "RER1" "PEX10" "PLCH2"
## [81] "ENSG00000224387" "PANK4" "ENSG00000272449" "TNFRSF14-AS1"
## [85] "TNFRSF14" "ENSG00000228037" "ENSG00000289610" "PRXL2B"
## [89] "MMEL1" "TTC34" "PRDM16" "MEGF6"
## [93] "ENSG00000238260" "TPRG1L" "WRAP73" "TP73"
## [97] "CCDC27" "SMIM1" "LRRC47" "ENSG00000272153"
You can see there are a lot of gene names some by the genecards ID and some by the Ensemble ID as the row names of the matrix object, and we also limited the view to first 100 rows.
setwd(RDS_objects)
saveRDS(sampleHD1,"sampleHD1")
Lets start importing and saving the other folder files.
setwd(directoryFolder)
GSM9493321 <- Read10X("GSM9493321")
sampleHD2 <- CreateSeuratObject(counts=GSM9493321, project="GSM9493321", min.cells=3, min.features=200)
View(sampleHD2@meta.data)
setwd(RDS_objects)
saveRDS(sampleHD2, "sampleHD2")
setwd(directoryFolder)
GSM9493322 <- Read10X("GSM9493322")
sampleHD3 <- CreateSeuratObject(counts=GSM9493322, project="GSM9493322", min.cells=3, min.features=200)
setwd(RDS_objects)
saveRDS(sampleHD3, "sampleHD3")
setwd(directoryFolder)
GSM9493323 <- Read10X("GSM9493323")
sampleHD4 <- CreateSeuratObject(counts=GSM9493323, project="GSM9493323", min.cells=3, min.features=200)
setwd(RDS_objects)
saveRDS(sampleHD4, "sampleHD4")
setwd(directoryFolder)
GSM9493324 <- Read10X("GSM9493324")
sampleHD5 <- CreateSeuratObject(counts=GSM9493324, project="GSM9493324", min.cells=3, min.features=200)
setwd(RDS_objects)
saveRDS(sampleHD5,"sampleHD5")
setwd(directoryFolder)
GSM9493325 <- Read10X("GSM9493325")
sampleHD6 <- CreateSeuratObject(counts=GSM9493325, project="GSM9493325", min.cells=3, min.features=200)
setwd(RDS_objects)
saveRDS(sampleHD6, "sampleHD6")
setwd(directoryFolder)
GSM9493326 <- Read10X("GSM9493326")
sampleHD7 <- CreateSeuratObject(counts=GSM9493326, project="GSM9493326", min.cells=3, min.features=200)
setwd(RDS_objects)
saveRDS(sampleHD7,"sampleHD7")
setwd(directoryFolder)
GSM9493327 <- Read10X("GSM9493327")
sampleHD8 <- CreateSeuratObject(counts=GSM9493327, project="GSM9493327", min.cells=3, min.features=200)
setwd(RDS_objects)
saveRDS(sampleHD8, "sampleHD8")
setwd(directoryFolder)
GSM9493328 <- Read10X("GSM9493328")
sampleHD9 <- CreateSeuratObject(counts=GSM9493328, project="GSM9493328", min.cells=3, min.features=200)
setwd(RDS_objects)
saveRDS(sampleHD9, "sampleHD9")
setwd(directoryFolder)
GSM9493329 <- Read10X("GSM9493329")
sampleHD10 <- CreateSeuratObject(counts=GSM9493329, project="GSM9493329", min.cells=3, min.features=200)
setwd(RDS_objects)
saveRDS(sampleHD10, "sampleHD10")
setwd(directoryFolder)
GSM9493330 <- Read10X("GSM9493330")
sampleHD11 <- CreateSeuratObject(counts=GSM9493330, project="GSM9493330", min.cells=3, min.features=200)
saveRDS(sampleHD11, "sampleHD11")
setwd(directoryFolder)
GSM9493331 <- Read10X("GSM9493331")
sampleHD12 <- CreateSeuratObject(counts=GSM9493331, project="GSM9493331", min.cells=3, min.features=200)
setwd(RDS_objects)
saveRDS(sampleHD12, "sampleHD12")
setwd(directoryFolder)
GSM9493332 <- Read10X("GSM9493332")
patient6 <- CreateSeuratObject(counts=GSM9493332, project="GSM9493332", min.cells=3, min.features=200)
setwd(RDS_objects)
saveRDS(patient6, "patient6")
setwd(directoryFolder)
GSM9493333 <- Read10X("GSM9493333")
patient9 <- CreateSeuratObject(counts=GSM9493333, project="GSM9493333", min.cells=3, min.features=200)
setwd(RDS_objects)
saveRDS(patient9, "patient9")
setwd(directoryFolder)
GSM9493334 <- Read10X("GSM9493334")
patient14 <- CreateSeuratObject(counts=GSM9493334, project="GSM9493334", min.cells=3, min.features=200)
setwd(RDS_objects)
saveRDS(patient14, "patient14")
setwd(directoryFolder)
GSM9493335 <- Read10X("GSM9493335")
patient16 <- CreateSeuratObject(counts=GSM9493335, project="GSM9493335", min.cells=3, min.features=200)
setwd(RDS_objects)
saveRDS(patient16, "patient16")
setwd(directoryFolder)
GSM9493336 <- Read10X("GSM9493336")
patient18 <- CreateSeuratObject(counts=GSM9493336, project="GSM9493336", min.cells=3, min.features=200)
setwd(RDS_objects)
saveRDS(patient18, "patient18")
setwd(directoryFolder)
GSM9493337 <- Read10X("GSM9493337")
patient19 <- CreateSeuratObject(counts=GSM9493337, project="GSM9493337", min.cells=3, min.features=200)
setwd(RDS_objects)
saveRDS(patient19, "patient19")
setwd(directoryFolder)
GSM9493338 <- Read10X("GSM9493338")
patient20 <- CreateSeuratObject(counts=GSM9493338, project="GSM9493338", min.cells=3, min.features=200)
setwd(RDS_objects)
saveRDS(patient20, "patient20")
setwd(directoryFolder)
GSM9493339 <- Read10X("GSM9493339")
patient22 <- CreateSeuratObject(counts=GSM9493339, project="GSM9493339", min.cells=3, min.features=200)
setwd(RDS_objects)
saveRDS(patient22, "patient22")
setwd(directoryFolder)
GSM9493340 <- Read10X("GSM9493340")
patient25 <- CreateSeuratObject(counts=GSM9493340, project="GSM9493340", min.cells=3, min.features=200)
setwd(RDS_objects)
saveRDS(patient25, "patient25")
setwd(directoryFolder)
GSM9493341 <- Read10X("GSM9493341")
patient27 <- CreateSeuratObject(counts=GSM9493341, project="GSM9493341", min.cells=3, min.features=200)
setwd(RDS_objects)
saveRDS(patient27, "patient27")
setwd(directoryFolder)
GSM9493342 <- Read10X("GSM9493342")
patient30 <- CreateSeuratObject(counts=GSM9493342, project="GSM9493342", min.cells=3, min.features=200)
setwd(RDS_objects)
saveRDS(patient30, "patient30")
setwd(directoryFolder)
GSM9493343 <- Read10X("GSM9493343")
patient31 <- CreateSeuratObject(counts=GSM9493343, project="GSM9493343", min.cells=3, min.features=200)
setwd(RDS_objects)
saveRDS(patient31, "patient31")
setwd(directoryFolder)
GSM9493344 <- Read10X("GSM9493344")
patient36 <- CreateSeuratObject(counts=GSM9493344, project="GSM9493344", min.cells=3, min.features=200)
setwd(RDS_objects)
saveRDS(patient36, "patient36")
setwd(directoryFolder)
GSM9493345 <- Read10X("GSM9493345")
patient37 <- CreateSeuratObject(counts=GSM9493345, project="GSM9493345", min.cells=3, min.features=200)
setwd(RDS_objects)
saveRDS(patient37, "patient37")
setwd(directoryFolder)
GSM9493346 <- Read10X("GSM9493346")
patient38 <- CreateSeuratObject(counts=GSM9493346, project="GSM9493346", min.cells=3, min.features=200)
setwd(RDS_objects)
saveRDS(patient38, "patient38")
setwd(directoryFolder)
GSM9493347 <- Read10X("GSM9493347")
patient39 <- CreateSeuratObject(counts=GSM9493347, project="GSM9493347", min.cells=3, min.features=200)
setwd(RDS_objects)
saveRDS(patient39, "patient39")
setwd(directoryFolder)
GSM9493348 <- Read10X("GSM9493348")
patient41 <- CreateSeuratObject(counts=GSM9493348, project="GSM9493348", min.cells=3, min.features=200)
setwd(RDS_objects)
saveRDS(patient41, "patient41")
setwd(directoryFolder)
GSM9493349 <- Read10X("GSM9493349")
patient44 <- CreateSeuratObject(counts=GSM9493349, project="GSM9493349", min.cells=3, min.features=200)
setwd(RDS_objects)
saveRDS(patient44, "patient44")
setwd(directoryFolder)
GSM9493350 <- Read10X("GSM9493350")
patient51 <- CreateSeuratObject(counts=GSM9493350, project="GSM9493350", min.cells=3, min.features=200)
setwd(RDS_objects)
saveRDS(patient51, "patient51")
setwd(directoryFolder)
GSM9493351 <- Read10X("GSM9493351")
patient52 <- CreateSeuratObject(counts=GSM9493351, project="GSM9493351", min.cells=3, min.features=200)
setwd(RDS_objects)
saveRDS(patient52, "patient52")
We uploaded all the sampleHD files and patient files. There is another tutorial on merging these RDS files.
We still have these items in our environment and it is taking up quite a bit of space. Lets delete these objects after verifying we have all files.
ls()
## [1] "directoryFolder" "GSM9493320" "GSM9493321" "GSM9493322"
## [5] "GSM9493323" "GSM9493324" "GSM9493325" "GSM9493326"
## [9] "GSM9493327" "GSM9493328" "GSM9493329" "GSM9493330"
## [13] "GSM9493331" "GSM9493332" "GSM9493333" "GSM9493334"
## [17] "GSM9493335" "GSM9493336" "GSM9493337" "GSM9493338"
## [21] "GSM9493339" "GSM9493340" "GSM9493341" "GSM9493342"
## [25] "GSM9493343" "GSM9493344" "GSM9493345" "GSM9493346"
## [29] "GSM9493347" "GSM9493348" "GSM9493349" "GSM9493350"
## [33] "GSM9493351" "patient14" "patient16" "patient18"
## [37] "patient19" "patient20" "patient22" "patient25"
## [41] "patient27" "patient30" "patient31" "patient36"
## [45] "patient37" "patient38" "patient39" "patient41"
## [49] "patient44" "patient51" "patient52" "patient6"
## [53] "patient9" "RDS_objects" "sampleHD1" "sampleHD10"
## [57] "sampleHD11" "sampleHD12" "sampleHD2" "sampleHD3"
## [61] "sampleHD4" "sampleHD5" "sampleHD6" "sampleHD7"
## [65] "sampleHD8" "sampleHD9" "seriesInfo1" "seriesInfoDesign"
Lets remove the GSM samples.
rm("GSM9493320" , "GSM9493321" ,
"GSM9493322" , "GSM9493323" , "GSM9493324" ,
"GSM9493325" , "GSM9493326" , "GSM9493327" ,
"GSM9493328" , "GSM9493329" , "GSM9493330" ,
"GSM9493331" , "GSM9493332" , "GSM9493333" ,
"GSM9493334" , "GSM9493335" , "GSM9493336" ,
"GSM9493337" , "GSM9493338" , "GSM9493339" ,
"GSM9493340" , "GSM9493341" , "GSM9493342" ,
"GSM9493343" , "GSM9493344" , "GSM9493345",
"GSM9493346" , "GSM9493347" , "GSM9493348" ,
"GSM9493349" , "GSM9493350" , "GSM9493351")
==================================================================================
library(Seurat)
setwd(RDS_objects)
sampleHD1 <- readRDS("sampleHD1")
str(sampleHD1)
## Formal class 'Seurat' [package "SeuratObject"] with 13 slots
## ..@ assays :List of 1
## .. ..$ RNA:Formal class 'Assay5' [package "SeuratObject"] with 8 slots
## .. .. .. ..@ layers :List of 1
## .. .. .. .. ..$ counts:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. .. .. .. .. .. ..@ i : int [1:23132352] 17 59 61 62 84 100 107 109 137 140 ...
## .. .. .. .. .. .. ..@ p : int [1:9150] 0 2534 5498 7925 10764 13088 17161 19540 22681 24627 ...
## .. .. .. .. .. .. ..@ Dim : int [1:2] 25684 9149
## .. .. .. .. .. .. ..@ Dimnames:List of 2
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. ..@ x : num [1:23132352] 1 1 1 1 2 1 2 2 2 4 ...
## .. .. .. .. .. .. ..@ factors : list()
## .. .. .. ..@ cells :Formal class 'LogMap' [package "SeuratObject"] with 1 slot
## .. .. .. .. .. ..@ .Data: logi [1:9149, 1] TRUE TRUE TRUE TRUE TRUE TRUE ...
## .. .. .. .. .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. .. .. .. .. ..$ : chr [1:9149] "AAACCCACAGCATTGT-1" "AAACCCACATCCGAAT-1" "AAACCCAGTACTAGCT-1" "AAACCCAGTCACTCAA-1" ...
## .. .. .. .. .. .. .. ..$ : chr "counts"
## .. .. .. .. .. ..$ dim : int [1:2] 9149 1
## .. .. .. .. .. ..$ dimnames:List of 2
## .. .. .. .. .. .. ..$ : chr [1:9149] "AAACCCACAGCATTGT-1" "AAACCCACATCCGAAT-1" "AAACCCAGTACTAGCT-1" "AAACCCAGTCACTCAA-1" ...
## .. .. .. .. .. .. ..$ : chr "counts"
## .. .. .. ..@ features :Formal class 'LogMap' [package "SeuratObject"] with 1 slot
## .. .. .. .. .. ..@ .Data: logi [1:25684, 1] TRUE TRUE TRUE TRUE TRUE TRUE ...
## .. .. .. .. .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. .. .. .. .. ..$ : chr [1:25684] "ENSG00000290826" "ENSG00000238009" "ENSG00000241860" "ENSG00000286448" ...
## .. .. .. .. .. .. .. ..$ : chr "counts"
## .. .. .. .. .. ..$ dim : int [1:2] 25684 1
## .. .. .. .. .. ..$ dimnames:List of 2
## .. .. .. .. .. .. ..$ : chr [1:25684] "ENSG00000290826" "ENSG00000238009" "ENSG00000241860" "ENSG00000286448" ...
## .. .. .. .. .. .. ..$ : chr "counts"
## .. .. .. ..@ default : int 1
## .. .. .. ..@ assay.orig: chr(0)
## .. .. .. ..@ meta.data :'data.frame': 25684 obs. of 0 variables
## .. .. .. ..@ misc :List of 1
## .. .. .. .. ..$ calcN: logi TRUE
## .. .. .. ..@ key : chr "rna_"
## ..@ meta.data :'data.frame': 9149 obs. of 3 variables:
## .. ..$ orig.ident : Factor w/ 1 level "GSM9493320": 1 1 1 1 1 1 1 1 1 1 ...
## .. ..$ nCount_RNA : num [1:9149] 5606 6841 6104 7370 5848 ...
## .. ..$ nFeature_RNA: int [1:9149] 2534 2964 2427 2839 2324 4073 2379 3141 1946 2299 ...
## ..@ active.assay: chr "RNA"
## ..@ active.ident: Factor w/ 1 level "GSM9493320": 1 1 1 1 1 1 1 1 1 1 ...
## .. ..- attr(*, "names")= chr [1:9149] "AAACCCACAGCATTGT-1" "AAACCCACATCCGAAT-1" "AAACCCAGTACTAGCT-1" "AAACCCAGTCACTCAA-1" ...
## ..@ graphs : list()
## ..@ neighbors : list()
## ..@ reductions : list()
## ..@ images : list()
## ..@ project.name: chr "GSM9493320"
## ..@ misc : list()
## ..@ version :Classes 'package_version', 'numeric_version' hidden list of 1
## .. ..$ : int [1:3] 5 3 0
## ..@ commands : list()
## ..@ tools : list()
The above is the structure of each of the following similarly that we will upload to merge using Seurat that will prepend the name of the sample to the barcode as it merges these barcodes that may be different in each sample as only barcodes that showed in more than 10 cells of the array or had more than 300 genes.
We can see the @meta.data feature is where the dataframe is located. We can look at it. But only the first 10 rows for aesthetics and to avoid printing out a long table.
sampleHD1@meta.data[1:10,]
## orig.ident nCount_RNA nFeature_RNA
## AAACCCACAGCATTGT-1 GSM9493320 5606 2534
## AAACCCACATCCGAAT-1 GSM9493320 6841 2964
## AAACCCAGTACTAGCT-1 GSM9493320 6104 2427
## AAACCCAGTCACTCAA-1 GSM9493320 7370 2839
## AAACCCAGTGGTAACG-1 GSM9493320 5848 2324
## AAACCCATCAATGCAC-1 GSM9493320 14448 4073
## AAACCCATCCCTCTAG-1 GSM9493320 5522 2379
## AAACCCATCGTCGATA-1 GSM9493320 8924 3141
## AAACGAAAGCCAGACA-1 GSM9493320 4517 1946
## AAACGAAAGGATACCG-1 GSM9493320 5301 2299
setwd(RDS_objects)
sampleHD2 <- readRDS("sampleHD2")
setwd(RDS_objects)
sampleHD3 <- readRDS("sampleHD3")
setwd(RDS_objects)
sampleHD4 <- readRDS("sampleHD4")
setwd(RDS_objects)
sampleHD5 <- readRDS("sampleHD5")
setwd(RDS_objects)
sampleHD6 <- readRDS("sampleHD6")
setwd(RDS_objects)
sampleHD7 <- readRDS("sampleHD7")
setwd(RDS_objects)
sampleHD8 <- readRDS("sampleHD8")
setwd(RDS_objects)
sampleHD9 <- readRDS("sampleHD9")
setwd(RDS_objects)
sampleHD10 <- readRDS("sampleHD10")
setwd(RDS_objects)
sampleHD11 <- readRDS("sampleHD11")
setwd(RDS_objects)
sampleHD12 <- readRDS("sampleHD12")
setwd(RDS_objects)
patient6 <- readRDS("patient6")
setwd(RDS_objects)
patient9 <- readRDS("patient9")
setwd(RDS_objects)
patient14 <- readRDS("patient14")
setwd(RDS_objects)
patient16 <- readRDS("patient16")
setwd(RDS_objects)
patient18 <- readRDS("patient18")
setwd(RDS_objects)
patient19 <- readRDS("patient19")
setwd(RDS_objects)
patient20 <- readRDS("patient20")
setwd(RDS_objects)
patient22 <- readRDS("patient22")
setwd(RDS_objects)
patient25 <- readRDS("patient25")
setwd(RDS_objects)
patient27 <- readRDS("patient27")
setwd(RDS_objects)
patient30 <- readRDS("patient30")
setwd(RDS_objects)
patient31 <- readRDS("patient31")
setwd(RDS_objects)
patient36 <- readRDS("/patient36")
setwd(RDS_objects)
patient37 <- readRDS("patient37")
setwd(RDS_objects)
patient38 <- readRDS("patient38")
setwd(RDS_objects)
patient39 <- readRDS("patient39")
setwd(RDS_objects)
patient41 <- readRDS("patient41")
setwd(RDS_objects)
patient44 <- readRDS("patient44")
setwd(RDS_objects)
patient51 <- readRDS("patient51")
setwd(RDS_objects)
patient52 <- readRDS("patient52")
All the Seurat objects are read in. Lets look at them.
setwd(RDS_objects)
ls()
## [1] "directoryFolder" "patient14" "patient16" "patient18"
## [5] "patient19" "patient20" "patient22" "patient25"
## [9] "patient27" "patient30" "patient31" "patient36"
## [13] "patient37" "patient38" "patient39" "patient41"
## [17] "patient44" "patient51" "patient52" "patient6"
## [21] "patient9" "RDS_objects" "sampleHD1" "sampleHD10"
## [25] "sampleHD11" "sampleHD12" "sampleHD2" "sampleHD3"
## [29] "sampleHD4" "sampleHD5" "sampleHD6" "sampleHD7"
## [33] "sampleHD8" "sampleHD9" "seriesInfo1" "seriesInfoDesign"
[1] “patient14” “patient16” “patient18” “patient19” “patient20” “patient22” “patient25” [8] “patient27” “patient30” “patient31” “patient36” “patient37” “patient38” “patient39” [15] “patient41” “patient44” “patient51” “patient52” “patient6” “patient9” “sampleHD1” [22] “sampleHD10” “sampleHD11” “sampleHD12” “sampleHD2” “sampleHD3” “sampleHD4” “sampleHD5” [29] “sampleHD6” “sampleHD7” “sampleHD8” “sampleHD9”
There are 32 samples, 12 healthy and 20 pathology for aggressive Natural Killer T-cell Lymphoma patients.
Now lets try to merge these using the commands given in the tutorial video described in last part. Find here.
There is also a useful youtube video that explains how chip sequencing operates in single cell RNA sequencing here.
mergedSamples <- merge(patient14, y=c(patient16, patient18, patient19,patient20 , patient22 , patient25 , patient27 , patient30, patient31, patient36 , patient37 , patient38 , patient39 , patient41 ,
patient44, patient51, patient52, patient6 , patient9, sampleHD1 , sampleHD10,
sampleHD11, sampleHD12, sampleHD2 , sampleHD3 , sampleHD4 , sampleHD5, sampleHD6 ,
sampleHD7 , sampleHD8 , sampleHD9 ), add.cell.ids = ls()[1:32])
setwd(RDS_objects)
saveRDS(mergedSamples,"mergedSamples.RDS")
Lets read in this merged file. Clear out the environment to leave room for the large Seurat array, table of 10.9 Gb it says.
setwd(RDS_objects)
merged <- readRDS("mergedSamples.RDS")
dim(merged)
[1] 30960 407006 When expanding the object array description of structure in environment in RStudio, the @arrays and $RNA descriptor shows that the 407,006 columns are the cells AKA barcodes, while the 30,960 is the features or genes.
Lets view the table random rows after viewing its structure.
str(merged)
Formal class ‘Seurat’ [package “SeuratObject”] with 13 slots ..@ assays :List of 1 .. ..$ RNA:Formal class ‘Assay5’ [package “SeuratObject”] with 8 slots .. .. .. ..@ layers :List of 32 .. .. .. .. ..$ counts.GSM9493334:Formal class ‘dgCMatrix’ [package “Matrix”] with 6 slots .. .. .. .. .. .. ..@ i : int [1:19172672] 11 16 42 49 68 81 97 104 128 152 … .. .. .. .. .. .. ..@ p : int [1:8870] 0 1489 2641 4324 7342 9549 13787 15283 18922 21606 … .. .. .. .. .. .. ..@ Dim : int [1:2] 23499 8869 .. .. .. .. .. .. ..@ Dimnames:List of 2 .. .. .. .. .. .. .. ..$ : NULL .. .. .. .. .. .. .. ..$ : NULL .. .. .. .. .. .. ..@ x : num [1:19172672] 1 2 1 1 2 1 1 7 1 1 … .. .. .. .. .. .. ..@ factors : list() .. .. .. .. ..$ counts.GSM9493335:Formal class ‘dgCMatrix’ [package “Matrix”] with 6 slots .. .. .. .. .. .. ..@ i : int [1:24861569] 36 75 92 100 111 120 201 228 290 311 … .. .. .. .. .. .. ..@ p : int [1:9487] 0 850 5559 8303 10653 13417 18143 21132 23892 26529 … .. .. .. .. .. .. ..@ Dim : int [1:2] 25402 9486 .. .. .. .. .. .. ..@ Dimnames:List of 2 .. .. .. .. .. .. .. ..$ : NULL .. .. .. .. .. .. .. ..$ : NULL .. .. .. .. .. .. ..@ x : num [1:24861569] 1 1 1 2 2 1 1 1 1 1 … .. .. .. .. .. .. ..@ factors : list() .. .. .. .. ..$ counts.GSM9493336:Formal class ‘dgCMatrix’ [package “Matrix”] with 6 slots .. .. .. .. .. .. ..@ i : int [1:31248138] 29 42 49 56 70 74 80 103 112 118 … .. .. .. .. .. .. ..@ p : int [1:10934] 0 2827 8343 10745 13203 16523 18213 19147 21558 23992 … .. .. .. .. .. .. ..@ Dim : int [1:2] 25619 10933
…
(very long structure of each sample omitted here)
.. .. .. .. .. ..$ dimnames:List of 2 .. .. .. .. .. .. ..$ : chr [1:30960] “ENSG00000238009” “ENSG00000241860” “ENSG00000290385” “ENSG00000291215” … .. .. .. .. .. .. ..$ : chr [1:32] “counts.GSM9493334” “counts.GSM9493335” “counts.GSM9493336” “counts.GSM9493337” … .. .. .. ..@ default : int 32 .. .. .. ..@ assay.orig: chr(0) .. .. .. ..@ meta.data :‘data.frame’: 30960 obs. of 0 variables .. .. .. ..@ misc : list() .. .. .. ..@ key : chr “rna_” ..@ meta.data :‘data.frame’: 407006 obs. of 3 variables: .. ..$ orig.ident : chr [1:407006] “GSM9493334” “GSM9493334” “GSM9493334” “GSM9493334” … .. ..$ nCount_RNA : num [1:407006] 3697 3064 3972 9102 4416 … .. ..$ nFeature_RNA: int [1:407006] 1489 1152 1683 3018 2207 4238 1496 3639 2684 2842 … ..@ active.assay: chr “RNA” ..@ active.ident: Factor w/ 32 levels “GSM9493334”,“GSM9493335”,..: 1 1 1 1 1 1 1 1 1 1 … .. ..- attr(*, “names”)= chr [1:407006] “patient14_AAACCCAAGGAGTACC-1” “patient14_AAACCCACAACCGCTG-1” “patient14_AAACCCACACCGGAAA-1” “patient14_AAACCCACATCGATGT-1” … ..@ graphs : list() ..@ neighbors : list() ..@ reductions : list() ..@ images : list() ..@ project.name: chr “SeuratProject” ..@ misc : list() ..@ version :Classes ‘package_version’, ‘numeric_version’ hidden list of 1 .. ..$ : int [1:3] 5 3 0 ..@ commands : list() ..@ tools : list()
The @meta.data feature of this Seurat array object is where the dataframe is.
merged@meta.data[c(1:10,10000:10010, 50000:50010, 100000:100010,200000:200010,300000:300010,400000:400010),]
Description:df [76 × 3]
orig.ident
patient14_AAACCCACAACCGCTG-1 GSM9493334 3064 1152
patient14_AAACCCACACCGGAAA-1 GSM9493334 3972 1683
patient14_AAACCCACATCGATGT-1 GSM9493334 9102 3018
patient14_AAACCCACATGAAAGT-1 GSM9493334 4416 2207
patient14_AAACCCAGTGATTCTG-1 GSM9493334 13535 4238
patient14_AAACCCATCACCTGTC-1 GSM9493334 3465 1496
patient14_AAACCCATCACTCTTA-1 GSM9493334 12162 3639
patient14_AAACCCATCGCTGTCT-1 GSM9493334 6912 2684
patient14_AAACCCATCGCTTGAA-1 GSM9493334 7876 2842
patient16_AGACCCGGTATCTTCT-1 GSM9493335 5436 2257
patient16_AGACCCGGTTCCAAAC-1 GSM9493335 5606 2253
patient16_AGACCCGTCTACCAGA-1 GSM9493335 7307 3009
patient16_AGACCCGTCTTACGTT-1 GSM9493335 8910 2964
patient16_AGACTCAAGAAGCTGC-1 GSM9493335 6154 2376
patient16_AGACTCAAGGATAATC-1 GSM9493335 11542 3613
patient16_AGACTCAAGGTGATAT-1 GSM9493335 13963 3975
patient16_AGACTCACAAGTGTCT-1 GSM9493335 6324 2609
patient16_AGACTCACAGGACAGT-1 GSM9493335 11725 3827
patient16_AGACTCAGTGACGTCC-1 GSM9493335 5617 2124
patient16_AGACTCAGTTTACTTC-1 GSM9493335 7401 3041
patient22_CATGGTAGTTGAGAGC-1 GSM9493339 6206 2587
patient22_CATGGTATCCACTAGA-1 GSM9493339 5385 2300
patient22_CATTCATAGAATGTTG-1 GSM9493339 4107 1870
patient22_CATTCATAGATCCCAT-1 GSM9493339 6588 2611
patient22_CATTCATAGGAACATT-1 GSM9493339 1519 909
patient22_CATTCATAGGTACTGG-1 GSM9493339 5552 2231
patient22_CATTCATAGTCTCTGA-1 GSM9493339 6460 2680
patient22_CATTCATCATGGCACC-1 GSM9493339 304 260
patient22_CATTCATGTAGTCTTG-1 GSM9493339 4360 1967
patient22_CATTCATGTCAAGTTC-1 GSM9493339 29873 6355
patient22_CATTCATGTCATTCCC-1 GSM9493339 9479 3409
patient30_TTAGTCTGTGTGCTTA-1 GSM9493342 10648 3595
patient30_TTAGTCTGTTAAGACA-1 GSM9493342 10342 3231
patient30_TTAGTCTGTTGGACCC-1 GSM9493342 4785 2124
patient30_TTAGTCTTCGCTACGG-1 GSM9493342 5640 2439
patient30_TTAGTCTTCGGTTCAA-1 GSM9493342 873 612
patient30_TTAGTCTTCGTCCTTG-1 GSM9493342 4520 2149
patient30_TTAGTCTTCTAGCCTC-1 GSM9493342 6459 2491
patient30_TTAGTCTTCTAGGAAA-1 GSM9493342 5409 2426
patient30_TTAGTCTTCTGGCCGA-1 GSM9493342 10246 3270
patient30_TTATTGCAGACTGAGC-1 GSM9493342 8046 3126
patient30_TTATTGCAGCGCATCC-1 GSM9493342 6185 2426
patient44_GATGTTGCACTTGTCC-1 GSM9493349 451 346
patient44_GATGTTGCAGACTGCC-1 GSM9493349 5012 1851
patient44_GATGTTGCAGCTACAT-1 GSM9493349 325 268
patient44_GATGTTGCAGCTTCCT-1 GSM9493349 408 323
patient44_GATGTTGCAGGTACGA-1 GSM9493349 327 284
patient44_GATGTTGCAGGTTACT-1 GSM9493349 304 267
patient44_GATGTTGCAGTGTACT-1 GSM9493349 457 372
patient44_GATGTTGCATAGTCAC-1 GSM9493349 10412 3154
patient44_GATGTTGCATATGAAG-1 GSM9493349 276 246
patient44_GATGTTGCATCCAACA-1 GSM9493349 267 233
patient44_GATGTTGCATCTGGGC-1 GSM9493349 319 256
sampleHD12_ATCATTCGTACAGTAA-1 GSM9493331 2508 681
sampleHD12_ATCATTCGTACCTATG-1 GSM9493331 4988 2131
sampleHD12_ATCATTCGTACGACTT-1 GSM9493331 4040 1497
sampleHD12_ATCATTCGTCCGACGT-1 GSM9493331 3855 1805
sampleHD12_ATCATTCGTCGCCTAG-1 GSM9493331 3326 1673
sampleHD12_ATCATTCGTTATGTGC-1 GSM9493331 9441 3193
sampleHD12_ATCATTCTCCCAATAG-1 GSM9493331 526 208
sampleHD12_ATCATTCTCTGTCAGA-1 GSM9493331 5601 2408
sampleHD12_ATCCACCAGCCTTGAT-1 GSM9493331 3383 1745
sampleHD12_ATCCACCAGCTGTGCC-1 GSM9493331 4399 2135
sampleHD12_ATCCACCAGGCTCACC-1 GSM9493331 2897 1196
sampleHD9_CGATCGGTCTTTCAGT-1 GSM9493328 895 441
sampleHD9_CGATCGGTCTTTGGAG-1 GSM9493328 3303 1071
sampleHD9_CGATGCGAGCGCGTTC-1 GSM9493328 1987 740
sampleHD9_CGATGCGCAAAGAACT-1 GSM9493328 4609 1617
sampleHD9_CGATGCGCAAAGTGTA-1 GSM9493328 2038 920
sampleHD9_CGATGCGCAACGATTC-1 GSM9493328 3533 1719
sampleHD9_CGATGCGCAAGAGATT-1 GSM9493328 5649 1822
sampleHD9_CGATGCGCACGGCCAT-1 GSM9493328 4760 1893
sampleHD9_CGATGCGCAGCTTTCC-1 GSM9493328 3406 1130
sampleHD9_CGATGCGCAGGCAATG-1 GSM9493328 4102 1541
sampleHD9_CGATGCGCATGTTCGA-1 GSM9493328 3295 1311
The above is a print out like much of this file because of the heavy computing and storage it requires to process these large files. But the way this merged data frame adds in the sample tag or ID is to prepend it to the barcode. When we made a data table adding in the sample ID we added it as its own column or feature next to same row giving GSM ID. This format of Seurat is able to be ran in other programs built for this type of object when it comes to filtering out the best quality, normalizing the data, and running the machine learning algorithms on the data to get best genes using PCA or principal component analysis, bayesion modeling, umap, tsne, and clustering with k-means. Those are machine learning algorithms best used on unsupervised data where the sample label isn’t known or the class the sample belongs to in classification isn’t known to build a model on a percent of a known class label data of 80% and test on a hold out test set of 20%. This makes sense since the genes known to cause pathology aren’t known yet, and these algorithms help find out in dimensional space the patterns that take shape in pathology samples compared to healthy samples.
All of these Rmarkdown chunks for the most part were printed copies of output that isn’t rendered the same as in RStudio, but the loading and saving of the files is tedious in time as well as storage, and my laptop didn’t have enough storage for the files and stopped the knitr program many times previously, so it is easier to do this since still in preprocessing stages. We will see with the other tutorials that this one is based off, the results with the unsupervised machine learning algorithms we need to find the best genes to target this pathology of Natural Killer T-cell Lymphoma and EBV associations to it.
Thanks and keep checking in.
=========================================================
I was directed to this youtube video when logging in to find other youtube videos on the next part of analysis in single cell RNA sequencing (scRNA sequencing) and found this one that explains work flow nicely, video. Lets use the tutorial code above for the workflow processing with
library(Seurat)
library(tidyverse)
setwd(path)
list.files(path)
## [1] "mergedSamples.RDS" "patient14" "patient16"
## [4] "patient18" "patient19" "patient20"
## [7] "patient22" "patient25" "patient27"
## [10] "patient30" "patient31" "patient36"
## [13] "patient37" "patient38" "patient39"
## [16] "patient41" "patient44" "patient51"
## [19] "patient52" "patient6" "patient9"
## [22] "sampleHD1" "sampleHD10" "sampleHD11"
## [25] "sampleHD12" "sampleHD2" "sampleHD3"
## [28] "sampleHD4" "sampleHD5" "sampleHD6"
## [31] "sampleHD7" "sampleHD8" "sampleHD9"
set.seed(123)
The above files are in our RDS_objects folder. It can be set within the chunk but reverts back to directory outside the Rmarkdown notebook chunk when it is done running that chunk.
Lets read in the mergedSamples.RDS of our samples merged together in a 400k row of genes by 4 columns.
setwd(path)
merged <- readRDS('mergedSamples.RDS')
str(merged)
Formal class ‘Seurat’ [package “SeuratObject”] with 13 slots ..@ assays :List of 1 .. ..$ RNA:Formal class ‘Assay5’ [package “SeuratObject”] with 8 slots .. .. .. ..@ layers :List of 32 .. .. .. .. ..$ counts.GSM9493334:Formal class ‘dgCMatrix’ [package “Matrix”] with 6 slots .. .. .. .. .. .. ..@ i : int [1:19172672] 11 16 42 49 68 81 97 104 128 152 … .. .. .. .. .. .. ..@ p : int [1:8870] 0 1489 2641 4324 7342 9549 13787 15283 18922 21606 … .. .. .. .. .. .. ..@ Dim : int [1:2] 23499 8869 .. .. .. .. .. .. ..@ Dimnames:List of 2 .. .. .. .. .. .. .. ..$ : NULL .. .. .. .. .. .. .. ..$ : NULL .. .. .. .. .. .. ..@ x : num [1:19172672] 1 2 1 1 2 1 1 7 1 1 … .. .. .. .. .. .. ..@ factors : list() .. .. .. .. ..$ counts.GSM9493335:Formal class ‘dgCMatrix’ [package “Matrix”] with 6 slots .. .. .. .. .. .. ..@ i : int [1:24861569] 36 75 92 100 111 120 201 228 290 311 … .. .. .. .. .. .. ..@ p : int [1:9487] 0 850 5559 8303 10653 13417 18143 21132 23892 26529 … .. .. .. .. .. .. ..@ Dim : int [1:2] 25402 9486 .. .. .. .. .. .. ..@ Dimnames:List of 2 .. .. .. .. .. .. .. ..$ : NULL .. .. .. .. .. .. .. ..$ : NULL .. .. .. .. .. .. ..@ x : num [1:24861569] 1 1 1 2 2 1 1 1 1 1 … .. .. .. .. .. .. ..@ factors : list() .. .. .. .. ..$ counts.GSM9493336:Formal class ‘dgCMatrix’ [package “Matrix”] with 6 slots .. .. .. .. .. .. ..@ i : int [1:31248138] 29 42 49 56 70 74 80 103 112 118 … .. .. .. .. .. .. ..@ p : int [1:10934] 0 2827 8343 10745 13203 16523 18213 19147 21558 23992 … .. .. .. .. .. .. ..@ Dim : int [1:2] 25619 10933 .. .. .. .. .. .. ..@ Dimnames:List of 2 .. .. .. .. .. .. .. ..$ : NULL .. .. .. .. .. .. .. ..$ : NULL .. .. .. .. .. .. ..@ x : num [1:31248138] 1 1 1 1 1 2 2 8 2 1 … .. .. .. .. .. .. ..@ factors : list() .. .. .. .. ..$ counts.GSM9493337:Formal class ‘dgCMatrix’ [package “Matrix”] with 6 slots
…
.. .. .. .. ..$ counts.GSM9493328:Formal class ‘dgCMatrix’ [package “Matrix”] with 6 slots .. .. .. .. .. .. ..@ i : int [1:20123150] 40 47 56 60 71 72 98 130 132 145 … .. .. .. .. .. .. ..@ p : int [1:11252] 0 2005 6063 7788 9543 9751 10629 11878 18200 19565 … .. .. .. .. .. .. ..@ Dim : int [1:2] 23877 11251 .. .. .. .. .. .. ..@ Dimnames:List of 2 .. .. .. .. .. .. .. ..$ : NULL .. .. .. .. .. .. .. ..$ : NULL .. .. .. .. .. .. ..@ x : num [1:20123150] 1 1 2 3 1 1 1 1 2 2 … .. .. .. .. .. .. ..@ factors : list() .. .. .. ..@ cells :Formal class ‘LogMap’ [package “SeuratObject”] with 1 slot .. .. .. .. .. ..@ .Data: logi [1:407006, 1:32] TRUE TRUE TRUE TRUE TRUE TRUE … .. .. .. .. .. .. ..- attr(, “dimnames”)=List of 2 .. .. .. .. .. .. .. ..$ : chr [1:407006] “patient14_AAACCCAAGGAGTACC-1” “patient14_AAACCCACAACCGCTG-1” “patient14_AAACCCACACCGGAAA-1” “patient14_AAACCCACATCGATGT-1” … .. .. .. .. .. .. .. ..$ : chr [1:32] “counts.GSM9493334” “counts.GSM9493335” “counts.GSM9493336” “counts.GSM9493337” … .. .. .. .. .. ..$ dim : int [1:2] 407006 32 .. .. .. .. .. ..$ dimnames:List of 2 .. .. .. .. .. .. ..$ : chr [1:407006] “patient14_AAACCCAAGGAGTACC-1” “patient14_AAACCCACAACCGCTG-1” “patient14_AAACCCACACCGGAAA-1” “patient14_AAACCCACATCGATGT-1” … .. .. .. .. .. .. ..$ : chr [1:32] “counts.GSM9493334” “counts.GSM9493335” “counts.GSM9493336” “counts.GSM9493337” … .. .. .. ..@ features :Formal class ‘LogMap’ [package “SeuratObject”] with 1 slot .. .. .. .. .. ..@ .Data: logi [1:30960, 1:32] TRUE TRUE TRUE TRUE TRUE TRUE … .. .. .. .. .. .. ..- attr(, “dimnames”)=List of 2 .. .. .. .. .. .. .. ..$ : chr [1:30960] “ENSG00000238009” “ENSG00000241860” “ENSG00000290385” “ENSG00000291215” … .. .. .. .. .. .. .. ..$ : chr [1:32] “counts.GSM9493334” “counts.GSM9493335” “counts.GSM9493336” “counts.GSM9493337” … .. .. .. .. .. ..$ dim : int [1:2] 30960 32 .. .. .. .. .. ..$ dimnames:List of 2 .. .. .. .. .. .. .. ..$ : chr [1:30960] “ENSG00000238009” “ENSG00000241860” “ENSG00000290385” “ENSG00000291215” … .. .. .. .. .. .. .. ..$ : chr [1:32] “counts.GSM9493334” “counts.GSM9493335” “counts.GSM9493336” “counts.GSM9493337” … .. .. .. .. .. ..$ dim : int [1:2] 30960 32 .. .. .. .. .. ..$ dimnames:List of 2 .. .. .. .. .. .. ..$ : chr [1:30960] “ENSG00000238009” “ENSG00000241860” “ENSG00000290385” “ENSG00000291215” … .. .. .. .. .. .. ..$ : chr [1:32] “counts.GSM9493334” “counts.GSM9493335” “counts.GSM9493336” “counts.GSM9493337” … .. .. .. ..@ default : int 32 .. .. .. ..@ assay.orig: chr(0) .. .. .. ..@ meta.data :‘data.frame’: 30960 obs. of 0 variables .. .. .. ..@ misc : list() .. .. .. ..@ key : chr “rna_” ..@ meta.data :‘data.frame’: 407006 obs. of 3 variables: .. ..$ orig.ident : chr [1:407006] “GSM9493334” “GSM9493334” “GSM9493334” “GSM9493334” … .. ..$ nCount_RNA : num [1:407006] 3697 3064 3972 9102 4416 … .. ..$ nFeature_RNA: int [1:407006] 1489 1152 1683 3018 2207 4238 1496 3639 2684 2842 … ..@ active.assay: chr “RNA” ..@ active.ident: Factor w/ 32 levels “GSM9493334”,“GSM9493335”,..: 1 1 1 1 1 1 1 1 1 1 … .. ..- attr(*, “names”)= chr [1:407006] “patient14_AAACCCAAGGAGTACC-1” “patient14_AAACCCACAACCGCTG-1” “patient14_AAACCCACACCGGAAA-1” “patient14_AAACCCACATCGATGT-1” … ..@ graphs : list() ..@ neighbors : list() ..@ reductions : list() ..@ images : list() ..@ project.name: chr “SeuratProject” ..@ misc : list() ..@ version :Classes ‘package_version’, ‘numeric_version’ hidden list of 1 .. ..$ : int [1:3] 5 3 0 ..@ commands : list() ..@ tools : list()
There are many attachments to this Seurat object, not just the matrix of merged columns with merged@meta.data but also in a $chr element list in this array as [1:30960] showing ENSEMBL IDs. This will be useful when looking for top genes and getting labels as well as this next code extracting the mitochondrial DNA with regex to make a percent of mitochondrial DNA column.
dim(merged@meta.data)
[1] 407006 3
In the meta.data table there are 3 columns and 407,006 barcodes for cells of the array.
colnames(merged@meta.data)
[1] “orig.ident” “nCount_RNA” “nFeature_RNA”
The column names of merged give the original identity of GSM ID, the counts of RNA that showed up for each barcode cell, and the number of genes as features that the barcode was part of in this single cell RNA sequencing project.
#quality control (QC)
When using quality control you don’t want a low count as the sequencing is probably altered or a low, low number unique genes indicates cell dead, dying, or empty droplet, and if other cells have higher counts and this could be an error within that cell, and you don’t want doublets or multiplets of genes producing high counts that are very large compared to the average counts of cells because that could mean the cell has 2 cells in it as the genes stuck together and giving a higher count. And for mitochondrial DNA, it is not located in the cytoplasm where most RNA ready to be translated into a protein occurs, but could be due to apoptosis of the cell and the mitochondrial membrane ruptured and leaking out its DNA or the cell is damaged and dying. So a high percentage of mitochondrial DNA distorts the counts as well. Mitochondrial DNA is only transcribed in the mitochondrial.
##perc at
merged[['percent_mt']] <- PercentageFeatureSet(merged, pattern = '^MT-')
dim(merged@meta.data)
[1] 407006 4
head(merged@meta.data)
Description:df [6 × 4]
orig.ident
The nCount is total number of molecules in a cell The nFeature is the total number of unique genes in each cell
This violin plot VlnPlot() is used as a Seurat library plot.
VlnPlot(merged, features = c("nCount_RNA", "nFeature_RNA", "percent_mt"), ncol=3)
We use the above plots of violin plots to compare the scatter and use those above lowest threshold of all scatter and not including less scatter above the highest theshold in common. For counts, visually estimated, the lowest threshold is 0 and highest is 2,000. For Features it looks like lowest threshold is about 100 and highest is around 4,000. The percent mitochondrial DNA looks like around 6-8 %.
This next plot is for a regression line to identify thresholds to filter out low and high counts and features as well at mt percent, the number of molecules on x-axis as counts of cells and the number of genes as features on y-axis.
You don’t want to have cells to be in lower right corner under regression line because it means that high molecule count but low unique features, capturing low genes and sequencing over and over again givig high count, and not in upper left corner bc high genes but low count because not deeply sequence enough, always check for this to see that it is due to low quality cells and not due to an artifact or sequencing error. That could ruin downstream analysis.
FeatureScatter(nsclc_seu, feature1 = “nCount_RNA”, feature2 = “nFeature_RNA”) + geom_smooth(method = ‘lm’)
library(ggplot2)
FeatureScatter(merged, feature1 = "nCount_RNA", feature2 = "nFeature_RNA") + geom_smooth(method = 'lm')+
# Control axis limits using ggplot2 functions
xlim(0, 30000)+ # Set X-axis range
ylim(0, 16000) # Set Y-axis range
Looks like around 20,000 count of RNA and about 5,000 features as thresholds to filter out for quality control to select best genes leading to cell differentiation in this pathology. This scatter follows a log function so I think it was already normalized. We can place the log function over the scatter.
FeatureScatter(merged, feature1 = "nCount_RNA", feature2 = "nFeature_RNA") + geom_smooth(method = 'lm', , formula = y ~ log(x))+
# Control axis limits using ggplot2 functions
xlim(0, 30000)+ # Set X-axis range
ylim(0, 16000) # Set Y-axis range
It looks like the scatter are the shape of a log curve when compared to the linear regression line, but when fitting a log function over the scatter, the curve fell flat early and missed many scatter after about 4,000 count RNA. Lets proceed with filtering based on the linear fit model and thresholds estimated.
#filtering
mergedFiltered <- subset(merged, subset = nFeature_RNA > 200 & nFeature_RNA < 5000 & percent_mt <8 & nCount_RNA < 18000)
rm(merged)
You can make your own thresholds on how you interpret visualizations
#Normalize data after removing filtered out cells
Normalization by default, use global scaling normalization of log normalize, default scaling of 10,000, and then get a log transformed result.
#FNS for Filtered, Normalized, Scaled
mergedFNS <- NormalizeData(mergedFiltered, normalization.method = 'LogNormalize', scale.factor = 10000)
#mergedFilteredNormalizedScaled <- NormalizeData(mergedFNS)
rm(mergedFiltered)
str(mergedFNS)
It is log transformed after log normalizing then scaling the filtered data. This is done by re-running the NormalizeData function with no set parameters but only name of the filtered, log normalized, and scaled data.
mergedFilteredNormalizedScaled <- NormalizeData(mergedFNS)
rm(mergedFNS)
#identify highly variable features
The default is to return 2000 features per dataset
mergedFilteredNormalizedScaled <- FindVariableFeatures(mergedFilteredNormalizedScaled, selection.method = 'vst', nfeatures = 2000)
Short snippet above of the grids per sample calculated to find the high
variability genes.
#identify the top 10 highly variable genes, this builds a scatter plot that will identify and label the scatterpoints belonging to the top 10 genes with highest variability. Average expression on x-axis and Standardized Variance on y-axis.
top10 <- head(VariableFeatures(mergedFilteredNormalizedScaled), 10)
top10
[1] “PPBP” “IGLC2” “IGKC” “S100A9” “EREG” “IGLC3” “SOX5” “PF4” “LYZ” “LINGO2”
top10_plot <- VariableFeaturePlot(mergedFilteredNormalizedScaled)
LabelPoints(plot = top10_plot, points = top10, repel = TRUE)
#Scaling after finding variable features in dataset.
This corrects for unwanted sources of variation, technical as example with batch effects, or biological such as cell cycle stage or mitochondrial contamination, this ensures results are due to clustering by gene differences and not technological or biological errors.
The ScaleData function is part of standard parameters, there is a more complex single cell Transform function or workflow within Seurat that could be alternatively used.
First to make a character string stored as the rownames of the structural array we have been using mergedFilteredNormalizedScaled.
all_genes <- rownames(mergedFilteredNormalizedScaled)
all_genes[1:500]
[1] “ENSG00000238009” “ENSG00000241860” “ENSG00000290385”
“ENSG00000291215” “LINC01409”
[6] “ENSG00000290784” “LINC00115” “LINC01128” “ENSG00000288531”
“FAM41C”
[11] “ENSG00000272438” “NOC2L” “KLHL17” “PLEKHN1” “ENSG00000272512” [16]
“HES4” “ISG15” “ENSG00000224969” “AGRN” “ENSG00000291156” [21]
“C1orf159” “ENSG00000285812” “LINC01342” “TNFRSF18” “TNFRSF4”
[26] “SDF4” “B3GALT6” “C1QTNF12” “ENSG00000260179” “UBE2J2”
[31] “LINC01786” “SCNN1D” “ACAP3” “PUSL1” “INTS11”
[36] “CPTP” “TAS1R3” “DVL1” “MXRA8” “AURKAIP1”
[41] “CCNL2” “MRPL20-AS1” “MRPL20” “MRPL20-DT” “ATAD3C”
[46] “ATAD3B” “ENSG00000290916” “ATAD3A” “TMEM240” “SSU72”
[51] “ENSG00000215014” “FNDC10” “ENSG00000286989” “ENSG00000272106”
“MIB2”
[56] “MMP23B” “CDK11B” “ENSG00000272004” “SLC35E2B” “CDK11A”
[61] “ENSG00000290854” “NADK” “GNB1” “GNB1-DT” “CFAP74”
[66] “PRKCZ” “ENSG00000271806” “PRKCZ-AS1” “FAAP20” “ENSG00000234396”
[71] “SKI” “ENSG00000287356” “MORN1” “ENSG00000272420” “RER1”
[76] “PEX10” “PLCH2” “PANK4” “HES5” “ENSG00000272449” [81]
“TNFRSF14-AS1” “TNFRSF14” “ENSG00000228037” “ENSG00000289610”
“PRXL2B”
[86] “MMEL1” “TTC34” “PRDM16” “MEGF6” “ENSG00000238260” [91] “TPRG1L”
“WRAP73” “TP73” “SMIM1” “LRRC47”
[96] “CEP104” “DFFB” “C1orf174” “LINC01134” “AJAP1”
[101] “ENSG00000236948” “NPHP4” “KCNAB2” “CHD5” “RPL22”
[106] “RNF207” “ICMT” “GPR153” “ACOT7” “ENSG00000271746” [111] “ESPN”
“TNFRSF25” “PLEKHG5” “NOL9” “TAS1R1”
[116] “ZBTB48” “KLHL21” “PHF13” “THAP3” “DNAJC11”
[121] “CAMTA1-DT” “CAMTA1” “VAMP3” “ENSG00000269925” “PER3”
[126] “ENSG00000236266” “UTS2” “TNFRSF9” “PARK7” “ENSG00000284747” [131]
“ENSG00000284716” “ERRFI1” “ERRFI1-DT” “SLC45A1” “RERE”
[136] “RERE-AS1” “ENO1” “ENO1-AS1” “CA6” “ENSG00000290109” [141]
“SLC2A5” “GPR157” “MIR34AHG” “LNCTAM34A” “H6PD”
[146] “SPSB1” “LINC02606” “SLC25A33” “TMEM201” “PIK3CD”
[151] “PIK3CD-AS1” “PIK3CD-AS2” “CLSTN1” “CTNNBIP1” “ENSG00000285701”
[156] “LZIC” “ENSG00000228150” “NMNAT1” “RBP7” “UBE4B”
[161] “KIF1B” “ENSG00000284735” “ENSG00000284642” “PGD” “CENPS”
[166] “CORT” “DFFA” “PEX14” “ENSG00000287727” “CASZ1”
[171] “ENSG00000272078” “C1orf127” “TARDBP” “MASP2” “SRM”
[176] “EXOSC10” “ENSG00000226849” “EXOSC10-AS1” “MTOR” “MTOR-AS1”
[181] “ANGPTL7” “UBIAD1” “DISP3” “ENSG00000284708” “FBXO2”
[186] “FBXO44” “MAD2L2” “FBXO6” “DRAXIN” “AGTRAP”
[191] “C1orf167” “MTHFR” “CLCN6” “NPPA” “ENSG00000285646” [196]
“KIAA2013” “PLOD1” “MFN2” “MIIP” “TNFRSF8”
[201] “TNFRSF1B” “VPS13D” “LINC02766” “DHRS3” “PRDM2”
[206] “ENSG00000289380” “KAZN” “ENSG00000231606” “KAZN-AS1”
“TMEM51-AS1”
[211] “TMEM51” “TMEM51-AS2” “FHAD1” “FHAD1-AS1” “EFHD2-AS1”
[216] “EFHD2” “CTRC” “CELA2A” “CELA2B” “CASP9”
[221] “DNAJC16” “ENSG00000272510” “AGMAT” “ENSG00000237301” “DDI2”
[226] “ENSG00000271742” “PLEKHM2” “ENSG00000237938” “SLC25A34”
“FBLIM1”
[231] “UQCRHL” “SPEN-AS1” “SPEN” “ZBTB17” “EPHA2”
[236] “ARHGEF19” “ENSG00000291077” “ENSG00000288398” “CPLANE2”
“FBXO42”
[241] “SZRD1” “SPATA21” “NECAP2” “LINC01772” “CROCCP3”
[246] “ENSG00000261135” “NBPF1” “CROCCP2” “CROCC” “ENSG00000238142”
[251] “ENSG00000272426” “ENSG00000226526” “ATP13A2” “SDHB” “PADI2”
[256] “PADI3” “PADI4” “PADI6” “RCC2” “ENSG00000290096” [261] “ARHGEF10L”
“IGSF21” “ALDH4A1” “IFFO2” “ENSG00000272084” [266] “UBR4” “EMC1-AS1”
“EMC1” “MRTO4” “AKR7L”
[271] “AKR7A3” “AKR7A2” “SLC66A1” “CAPZB” “MICOS10”
[276] “MICOS10-DT” “NBL1” “HTR6” “TMCO4” “OTUD3”
[281] “ENSG00000227066” “PLA2G2C” “UBXN10” “VWA5B1” “ENSG00000226664”
[286] “CAMK2N1” “MUL1” “CDA” “PINK1” “PINK1-AS”
[291] “DDOST” “HP1BP3” “EIF4G3” “ECE1” “ECE1-AS1”
[296] “NBPF3” “ALPL” “RAP1GAP” “USP48” “LDLRAD2”
[301] “HSPG2” “LINC01635” “LINC00339” “CDC42-AS1” “CDC42”
[306] “CDC42-IT1” “WNT4” “ENSG00000285873” “ZBTB40” “C1QA”
[311] “C1QC” “C1QB” “EPHB2” “TEX46” “KDM1A”
[316] “ENSG00000240553” “LUZP1” “LINC01355” “ENSG00000284726”
“HNRNPR”
[321] “ZNF436” “ZNF436-AS1” “TCEA3” “ASAP3” “E2F2”
[326] “ID3” “ENSG00000285802” “MDS2” “RPL11” “ELOA-AS1”
[331] “ELOA” “PITHD1” “ENSG00000289835” “LYPLA2” “GALE”
[336] “HMGCL” “FUCA1” “CNR2” “ENSG00000232557” “PNRC2”
[341] “SRSF10” “IFNLR1” “GRHL3” “STPG1” “NIPAL3”
[346] “ENSG00000288982” “RCAN3AS” “RCAN3” “NCMAP-DT” “NCMAP”
[351] “SRRM1” “ENSG00000284699” “CLIC4” “RUNX3” “ENSG00000261025” [356]
“ENSG00000233755” “SYF2” “ENSG00000284602” “ENSG00000284657”
“RSRP1”
[361] “ENSG00000272432” “RHD” “TMEM50A” “RHCE” “MACO1”
[366] “LDLRAP1” “ENSG00000225643” “MAN1C1” “ENSG00000233478”
“SELENON”
[371] “ENSG00000228172” “MTFR1L” “AUNIP” “ENSG00000236528” “PAQR7”
[376] “STMN1” “PAFAH2” “PDIK1L” “ZNF593” “CNKSR1”
[381] “CEP85” “SH3BGRL3” “UBXN11” “CD52” “CRYBG2”
[386] “ZNF683” “DHDDS” “DHDDS-AS1” “HMGN2” “RPS6KA1”
[391] “ENSG00000289452” “ENSG00000260063” “ARID1A” “PIGV”
“ZDHHC18”
[396] “SFN” “GPN2” “ENSG00000289554” “GPATCH3” “NUDC”
[401] “KDF1” “TRNP1” “TENT5B” “SLC9A1” “WDTC1-DT”
[406] “WDTC1” “TMEM222” “SYTL1” “MAP3K6” “WASF2”
[411] “ENSG00000241169” “ENSG00000237429” “AHDC1” “FGR”
“LINC02574”
[416] “IFI6” “ENSG00000225886” “ENSG00000287244” “FAM76A” “STX12”
[421] “ENSG00000269971” “ENSG00000270031” “ENSG00000286433” “PPP1R8”
“THEMIS2”
[426] “RPA2” “SMPDL3B” “ENSG00000227050” “XKR8” “EYA3”
[431] “ENSG00000289576” “PTAFR” “DNAJC8” “ENSG00000290123”
“ATP5IF1”
[436] “ENSG00000270605” “SESN2” “MED18” “PHACTR4” “RCC1”
[441] “SNHG3” “TRNAU1AP” “SNHG12” “TAF12” “TAF12-DT”
[446] “ENSG00000270103” “GMEB1” “ENSG00000289291” “YTHDF2” “EPB41”
[451] “TMEM200B” “ENSG00000225750” “SRSF4” “MECR” “PTPRU”
[456] “MATN1” “MATN1-AS1” “ENSG00000287510” “LAPTM5” “SDC3”
[461] “PUM1” “NKAIN1” “SNRNP40” “ZCCHC17” “ENSG00000229044” [466]
“FABP3” “SERINC2” “LINC01226” “HCRTR1” “PEF1”
[471] “PEF1-AS1” “ENSG00000264078” “ADGRB2” “ENSG00000203620”
“ENSG00000269967” [476] “PTP4A2” “ENSG00000228634” “KHDRBS1”
“ENSG00000203325” “TMEM39B”
[481] “KPNA6” “ENSG00000250135” “TXLNA” “CCDC28B” “ENSG00000224066”
[486] “IQCC” “DCDC2B” “TMEM234” “EIF3I” “ENSG00000291132” [491]
“FAM167B” “LCK” “HDAC1” “MARCKSL1” “TSSK3”
[496] “FAM229A” “BSDC1” “ZBTB8A” “ZBTB8OS” “RBBP4”
length(all_genes)
There are 30,960 unique gene names.
mergedFilteredNormalizedScaled <- ScaleData(mergedFilteredNormalizedScaled, features = all_genes)
Error: std::bad_alloc 9. stop(structure(list(message = “std::bad_alloc”, call = NULL, cppstack = NULL), class = c(“std::bad_alloc”, “C++Error”, “error”, “condition”))) 8. RowMergeMatricesList(mat_list = all.mat, mat_rownames = all.rownames, all_rownames = all.names) 7. RowMergeSparseMatrices(mat1 = x, mat2 = y) 6. StitchMatrix.dgCMatrix(x = LayerData(object = object, layer = layer[1L], features = features), y = lapply(X = layer[2:length(x = layer)], FUN = LayerData, object = object, features = features), rowmap = slot(object = object, name = “features”)[features, layer], colmap = slot(object = object, … 5. StitchMatrix(x = LayerData(object = object, layer = layer[1L], features = features), y = lapply(X = layer[2:length(x = layer)], FUN = LayerData, object = object, features = features), rowmap = slot(object = object, name = “features”)[features, layer], colmap = slot(object = object, … 4. ScaleData.StdAssay(object = object[[assay]], features = features, vars.to.regress = vars.to.regress, latent.data = latent.data, split.by = split.by, model.use = model.use, use.umi = use.umi, do.scale = do.scale, do.center = do.center, scale.max = scale.max, … 3. ScaleData(object = object[[assay]], features = features, vars.to.regress = vars.to.regress, latent.data = latent.data, split.by = split.by, model.use = model.use, use.umi = use.umi, do.scale = do.scale, do.center = do.center, scale.max = scale.max, block.size = block.size, min.cells.to.block = min.cells.to.block, … 2. ScaleData.Seurat(mergedFilteredNormalizedScaled, features = all_genes) 1. ScaleData(mergedFilteredNormalizedScaled, features = all_genes)
This is where we end for now with at least the top 10 genes of highest variability based on our quality control filtering. But we weren’t able to get past the layer building in the Seurat ScaleData function. I have looked into it for about 15 minutes but not going to stress over it, if you know the fix let me know. Great follow through so far, seems to work with this RAW NCBI gene expression data, though given tutorial on youtube using a different gene expression profile for pathology of non-small cell lung carcinoma.
#Dimensionality Reduction with PCA allow us to visualize multidimensional datasets, with common PCA principal component analysis, summarizing all gene expression data over many genes into principal components (the variation between scatterpoints are the components, variations off the regression lines or dimension, adding components to top components will give most variation and explain most of data).
Loadings is a term for PCA that explains a gene that is strongly associated with that principal component. They are ordered with PC1 most explanative and subsequent add to that explanation in order of best.
The positive and negative genes are named for how ever many features you set for the number of PCs that explain the most.
mergedFilteredNormalizedScaled <- RunPCA(mergedFilteredNormalizedScaled, features = VariableFeatures(mergedFilteredNormalizedScaled))
print(mergedFilteredNormalizedScaled[['pca']], dims = 1:5, nfeatures=5)
View(mergedFilteredNormalizedScaled@assays$RNA)
DimHeatmap(mergedFilteredNormalizedScaled, dims = 1, cells = 500, balanced = TRUE)
The dim plot is a scatter plot of a named PC for principal component, against another PC You will see cells group in clusters by PC comparison 1 by 1.
DimPlot(mergedFilteredNormalizedScaled, reduction = "pca") + NoLegend()
This shows std dev on y-axis and PCs on x-axis to see the threshold limit of number of dimensions or PCs to use, it starts high and drops until plateuing at elbow bend, won’t be useful to use additional that add little variation and more computing.
ElbowPlot(mergedFilteredNormalizedScaled)
Basically the lower resolution the least clusters and higher resolution has more clusters
mergedFilteredNormalizedScaled <- FindNeighbors(mergedFilteredNormalizedScaled, sims = 1:15)
mergedFilteredNormalizedScaled <- FindClusters(mergedFilteredNormalizedScaled, resolution = c(0.1, 0.3, 0.5, 0.7,1))
There will be added columns to the matrix object of our mergedFilteredNormalizedScaled@meta.data for each of the resolutios such as ‘RNA_snn_res0.1’ and ‘RNA_snn_res0.3’, and so on to 1. And the integer values in the new columns indicates to which cluster each cell (or barcode) belongs.
The results are color coded for each cluster. Since gene expression is basically the cell differentiation of becoming different cell types, usually, different clusters represents different cell types.
View(mergedFilteredNormalizedScaled@meta.data)
The data set may interest someone in many clusters that overlap that could be useful for identifying subset of cells interested in.
DimPlot(mergedFilteredNormalizedScaled, group.by = 'RNA_snn_res.1', label = TRUE)
Set identity of clusters makes more sense than using more clusters with resolution 1
Idents(mergedFilteredNormalizedScaled) <- 'RNA_snn_res.0.1'
Gives a clearer visualization of separate clusters with space between each
mergedFilteredNormalizedScaled <- RunUMap(mergedFilteredNormalizedScaled, dims = 1:15)
Next, identify the clusters and look at genetic markers for different cell types and identify different cell types, not in this tutorial but another one
DimPlot(mergedFilteredNormalizedScaled, reduction = 'uMap')
Save the Seurat object to use later
saveRDS(mergedFilteredNormalizedScaled, file='nsclc_seu.RDS')
===================================================