Looking at the EDA of exploratory data analysis with new package for this thing on natural killer t-cell lymphoma associated by EBV infection. This study is GSE318371 within NCBI database of gene expression studies. I extracted the custom download option of the GSE318371_RAW.tar file scrolled at end of page on this series. The extraction of barcodes takes a very long time once downloaded for each file and you can avoid that process and leave them in zip form with the .tsv.gz name but remove the prepended file name in front of ‘barcodes.tsv.gz’, ‘features.tsv.gz’, and ‘matrix.mtx’ but put these files with respective GSM sample ID into its own file to read in for each file folder.
This is a very recent February 2026 uploaded research on aggressive lymphoma associated with EBV infection. I found others and they include the nasopharyngeal carcinoma and Hodgkin and large B-cell lymphomas. But for now we work on this project to get our top genes for our machine to predict EBV, Lyme disease, or specific associated EBV pathology of multiple sclerosis, mononucleosis, primary EBV infection, as well as various lymphomas and nasopharyngeal carcinoma.
There is a process that has to be followed to extract each sample information of barcodes, features, and cells. This is array gene expression data where an array of many cells is input into a machine and each cell within the array is ran to count the number of times a gene appears. The barcodes are the cells and the features are the genes in matrix format.
I had to look at a youtube video to understand this package Seurat better and how to read in this data. I would like to estimate each barcode file as taking around 30 minutes each to extract from the zipped file format. My 7zip isn’t working to extract it with right tab and I followed the videos to the exact step but it turns out the information is useful but just leave the folder in zip format because the code doesn’t work for the unzipped files.
Lets read the summary file in the GSE318371-GPL34284_series_matrix.txt file to see how the samples were collected, handled, if normalized, type, and design of study.
seriesInfo1 <- read.csv("GSE318371-GPL34284_series_matrix.txt", nrows=25, sep='\t', stringsAsFactors = T, strip.white = T, na.strings=" ", header=F)
seriesInfo1
## V1
## 1 !Series_title
## 2 !Series_geo_accession
## 3 !Series_status
## 4 !Series_submission_date
## 5 !Series_last_update_date
## 6 !Series_summary
## 7 !Series_overall_design
## 8 !Series_type
## 9 !Series_contributor
## 10 !Series_contributor
## 11 !Series_contributor
## 12 !Series_contributor
## 13 !Series_sample_id
## 14 !Series_contact_name
## 15 !Series_contact_email
## 16 !Series_contact_institute
## 17 !Series_contact_address
## 18 !Series_contact_city
## 19 !Series_contact_zip/postal_code
## 20 !Series_contact_country
## 21 !Series_supplementary_file
## 22 !Series_platform_id
## 23 !Series_platform_id
## 24 !Series_platform_taxid
## 25 !Series_sample_taxid
## V2
## 1 Peripheral blood mononuclear cells single-cell landscape of newly diagnosed NK/T cell lymphoma patients
## 2 GSE318371
## 3 Public on Feb 07 2026
## 4 Feb 04 2026
## 5 Feb 07 2026
## 6 Natural killer/T cell lymphoma (NKTCL) is a rare and aggressive form of non-Hodgkin's lymphoma associated with Epstein-Barr Virus (EBV) infection.The recent advancement of multi-omics technologies has significantly enhanced our understanding of NKTCL disease biology, including genetics, transcription landscape, variations of EBV strain, and microenvironments. Emerging evidence suggests that immunoprofiling of peripheral blood mononuclear cells (PBMCs) is associated with the treatment response of cancer patients and can be used to guide clinical trials and therapy. In this study, we utilized single-cell RNA sequencing (scRNA-seq) to comprehensively characterize the phenotypic landscape of PBMCs in newly diagnosed patients with NKTCL. This research offers a valuable peripheral blood-based signature for newly diagnosed NKTCL, which could be a crucial resource for further investigations into the pathogenesis of NKTCL and the optimization of therapeutic regimens.
## 7 scRNA-seq profiling of PBMCs from healthy donors and newly diagnosed NKTCL patients
## 8 Expression profiling by high throughput sequencing
## 9 Xiaozhen,,Liang
## 10 Rong,,Tao
## 11 Ran,,Jia
## 12 Chuanxu,,Liu
## 13 GSM9493320 GSM9493321 GSM9493322 GSM9493323 GSM9493324 GSM9493325 GSM9493326 GSM9493327 GSM9493328 GSM9493329 GSM9493330 GSM9493331 GSM9493332 GSM9493333 GSM9493334 GSM9493335 GSM9493336 GSM9493337 GSM9493338 GSM9493339 GSM9493340 GSM9493341 GSM9493342 GSM9493343 GSM9493344 GSM9493345 GSM9493346 GSM9493347 GSM9493348 GSM9493349 GSM9493350 GSM9493351
## 14 Xiaozhen,,Liang
## 15 xzliang@simm.ac.cn
## 16 Shanghai Institute of Materia Medica Chinese Academy of Sciences
## 17 Life Science Research Building 320 Yueyang Road, Xuhui District
## 18 Shanghai
## 19 200031
## 20 China
## 21 ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE318nnn/GSE318371/suppl/GSE318371_RAW.tar
## 22 GPL24676
## 23 GPL34284
## 24 9606
## 25 9606
seriesInfoDesign <-read.csv("GSE318371-GPL34284_series_matrix.txt", sep='\t', nrows=50,stringsAsFactors = T,strip.white=T,na.strings=" ", ncol(32), skip=25, header=F)
dim(seriesInfoDesign)
## [1] 42 30
There are 42 additional rows of metadata on study design and methods for handling data and biological material.
Lets look at first few columns and rows of interest that detail study design.
seriesInfoDesign[c(8:15,17:21,30:33),c(1:2)]
## V1
## 8 !Sample_source_name_ch1
## 9 !Sample_organism_ch1
## 10 !Sample_characteristics_ch1
## 11 !Sample_characteristics_ch1
## 12 !Sample_characteristics_ch1
## 13 !Sample_molecule_ch1
## 14 !Sample_extract_protocol_ch1
## 15 !Sample_extract_protocol_ch1
## 17 !Sample_description
## 18 !Sample_data_processing
## 19 !Sample_data_processing
## 20 !Sample_data_processing
## 21 !Sample_platform_id
## 30 !Sample_instrument_model
## 31 !Sample_library_selection
## 32 !Sample_library_source
## 33 !Sample_library_strategy
## V2
## 8 "blood"
## 9 "Homo sapiens"
## 10 "tissue: blood"
## 11 "cell line: PBMCs"
## 12 "cell type: Peripheral blood immune cells"
## 13 "total RNA"
## 14 "Isolated PBMCs were loaded into a 10× Chromium Chip (v3.1 PN:1000120) and barcoded using a 10x Chromium Controller."
## 15 "RNA from the barcoded cells was then reverse-transcribed, amplified, and prepared into sequencing libraries with the 10× Library Construction Kit (v3.1 PN:1000190) according to the manufacturer’s instructions."
## 17 "Library name: HD1"
## 18 "Raw scRNA-seq data were initially pre-processed using CellRanger (version 8.0.1, 10x Genomics) to align reads to the human genome (GRCh38, 2024-A from 10x Genomics) and count the unique molecular identifiers (UMIs) for each gene to generate specific gene cell count tables. For each scRNA-seq sample, the count tables were filtered to retain the genes detected in at least 10 cells and cells with a minimum gene count of 300."
## 19 "Assembly: GRCh38"
## 20 "Supplementary files format and content: barcodes, features, and matrix files for each samples"
## 21 "GPL34284"
## 30 "Illumina NovaSeq X Plus"
## 31 "cDNA"
## 32 "transcriptomic single cell"
## 33 "RNA-Seq"
The data is from peripheral blood mononuclear cells (PBMCs) of total RNA using chip sequencing or array sequencing, they kept the genes that showed in at least 10 cells of the array being sampled or having at least a count of 300 genes. The array of RNA-Seq analysis counts gene fragments that show up in the sequencing as many won’t show up but enough do. There is a useful youtube video that explains how chip sequencing operates here. This is where the barcodes and features makes sense as it seems like different but similar language to data science language. The features are the genes or rows as we have seen, and the cells are the barcodes of nucleotides as columns of our matrix when we read in the formatted files using Seurat library where the folder has to have the ‘barcodes.tsv.gz’, ‘features.tsv.gz’, and ‘matrix.mtx.gz’ format to read it in. The youtube videos I watched showed a way of unzipping and reading in the packages similarly but the Seurat library from my recent experience only reads in the unzipped file formats with attached file name ‘gz’ meaning needs to be unzipped, this is the file format already in when downloading from the NCBI website for the gene expression data.
We can see the patient and healthy label to the GSM samples with row 17 and 41 of the metadata or series information.
seriesInfoDesign[c(17,41),]
## V1 V2 V3
## 17 !Sample_description "Library name: HD1" "Library name: HD2"
## 41 "ID_REF" "GSM9493320" "GSM9493321"
## V4 V5 V6
## 17 "Library name: HD3" "Library name: HD4" "Library name: HD5"
## 41 "GSM9493322" "GSM9493323" "GSM9493324"
## V7 V8 V9
## 17 "Library name: HD6" "Library name: HD7" "Library name: HD8"
## 41 "GSM9493325" "GSM9493326" "GSM9493327"
## V10 V11 V12
## 17 "Library name: HD9" "Library name: HD10" "Library name: HD11"
## 41 "GSM9493328" "GSM9493329" "GSM9493330"
## V13 V14 V15
## 17 "Library name: HD12" "Library name: patient16" "Library name: patient18"
## 41 "GSM9493331" "GSM9493335" "GSM9493336"
## V16 V17
## 17 "Library name: patient19" "Library name: patient20"
## 41 "GSM9493337" "GSM9493338"
## V18 V19
## 17 "Library name: patient22" "Library name: patient25"
## 41 "GSM9493339" "GSM9493340"
## V20 V21
## 17 "Library name: patient27" "Library name: patient30"
## 41 "GSM9493341" "GSM9493342"
## V22 V23
## 17 "Library name: patient31" "Library name: patient36"
## 41 "GSM9493343" "GSM9493344"
## V24 V25
## 17 "Library name: patient37" "Library name: patient38"
## 41 "GSM9493345" "GSM9493346"
## V26 V27
## 17 "Library name: patient39" "Library name: patient41"
## 41 "GSM9493347" "GSM9493348"
## V28 V29
## 17 "Library name: patient44" "Library name: patient51"
## 41 "GSM9493349" "GSM9493350"
## V30
## 17 "Library name: patient52"
## 41 "GSM9493351"
There are 12 healthy samples HD1-HD12, and randomly numbered patient samples from patient6 through patient51 totaling 17 patients.
str(seriesInfoDesign)
## 'data.frame': 42 obs. of 30 variables:
## $ V1 : Factor w/ 35 levels "!Sample_channel_count",..: 31 14 25 26 16 32 1 24 21 2 ...
## $ V2 : Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
## $ V3 : Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
## $ V4 : Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
## $ V5 : Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
## $ V6 : Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
## $ V7 : Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
## $ V8 : Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
## $ V9 : Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
## $ V10: Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
## $ V11: Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
## $ V12: Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
## $ V13: Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
## $ V14: Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
## $ V15: Factor w/ 40 levels "","\"0\"","\"1\"",..: 25 19 26 13 14 32 3 8 20 36 ...
## $ V16: Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
## $ V17: Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
## $ V18: Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
## $ V19: Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
## $ V20: Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
## $ V21: Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
## $ V22: Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
## $ V23: Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
## $ V24: Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
## $ V25: Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
## $ V26: Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
## $ V27: Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
## $ V28: Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
## $ V29: Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
## $ V30: Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
Install the Seurat package with install.packages(‘Seurat’) if you haven’t already. Then read in the library.
library(Seurat)
## Loading required package: SeuratObject
## Loading required package: sp
##
## Attaching package: 'SeuratObject'
## The following objects are masked from 'package:base':
##
## intersect, t
We need to use the tidyverse package.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.1 ✔ tibble 3.3.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Found a few youtube videos on using Seurat for this type of data.Here is an interesting video explaining extracting and moving files into their GSM sample but don’t do the unzipping and renaming, just leave in unzipped version and remove prepended file name to only get the names needed with extension of gz kept. Do put in its own sample name of folder you will name it as an object for in Rstudio, here is the video.
After making those folders you have to read them into R with Read10x function.
NML_1 <- Read10x(data.dir = “../Downloads/GSE132771_Raw/NPL1/”) where the file name and location are using the demonstration file that used the GSE132771 series data or RAW.Tar files.
After doing what video tutorial did, with extraction, the code didn’t work, but when I left the files in original tsv.gz format but removed prepended file name e.g. GSM9493320_HD1_barcodes.tsv.gz into barcodes.tsv.gz then the following line of code ran. This removes that tedious extraction step that added extra work and time to the file processing to use in our exploratory data analysis. The video was from 3-4 years ago from today’s date, so some errors may have been corrected within the Seurat library.
I had to use exact file location and make sure to not copy and paste with the file directory within Microsoft as ‘' or backslash because it has to be’/’ or forward slash.
Lets save my directory as a character string to use and save all these file objects of RDS type made using Seurat.
Fill in ellipses with your file directory location.
directoryFolder <- "C:/.../GSE318371_RAW"
RDS_objects <- "C:/.../RDS_objects"
setwd(directoryFolder)
GSM9493320 <- Read10X("GSM9493320")
Then you can create an object with CreateSeuratObject()
NML_1 <- CreateSeuratObject(counts = NML_1, project = “NML_1”, min.cells=3, min.features=200)
sampleHD1 <- CreateSeuratObject(counts=GSM9493320, project="GSM9493320", min.cells=3, min.features=200)
For the matrix, you will see genes are the barcodes in matrix columns that are counted in each of the matrix rows that are the cells. The video says the barcodes are the rows but that is probably the data frame before transposing to a matrix. Visually the barcodes are shown in the ‘column names’ above each column with counts.
When naming the barcodes.tsv file it is the cells, and the features.tsv file is the genes.
The cells are where the genes are counted in an array of cells that sequence gene data. There will be counts of the gene in each cell that vary from none to however many times they show. The next video explains reading the Seurat object created here.
class(sampleHD1)
## [1] "Seurat"
## attr(,"package")
## [1] "SeuratObject"
the colnames are the barcodes and the rownames are the genes. colnames([]) and will display barcodes
colnames(sampleHD1[])[1:100]
## [1] "AAACCCACAGCATTGT-1" "AAACCCACATCCGAAT-1" "AAACCCAGTACTAGCT-1"
## [4] "AAACCCAGTCACTCAA-1" "AAACCCAGTGGTAACG-1" "AAACCCATCAATGCAC-1"
## [7] "AAACCCATCCCTCTAG-1" "AAACCCATCGTCGATA-1" "AAACGAAAGCCAGACA-1"
## [10] "AAACGAAAGGATACCG-1" "AAACGAAAGTAATACG-1" "AAACGAACACGACCTG-1"
## [13] "AAACGAACACTGTTCC-1" "AAACGAACAGTTACCA-1" "AAACGAACATTCAGCA-1"
## [16] "AAACGAAGTAACATCC-1" "AAACGAAGTCTCCCTA-1" "AAACGAAGTGCAGATG-1"
## [19] "AAACGAATCCTCGCAT-1" "AAACGCTAGACGCCCT-1" "AAACGCTAGCTTTGTG-1"
## [22] "AAACGCTAGTGAACAT-1" "AAACGCTGTAGTATAG-1" "AAACGCTGTTGAAGTA-1"
## [25] "AAACGCTTCAATGTCG-1" "AAACGCTTCTCAGAAC-1" "AAAGAACAGCGACATG-1"
## [28] "AAAGAACAGGCTGTAG-1" "AAAGAACAGGGATCTG-1" "AAAGAACCAACAGAGC-1"
## [31] "AAAGAACGTCGCCACA-1" "AAAGAACGTGGTAACG-1" "AAAGAACTCATGAGGG-1"
## [34] "AAAGAACTCATTCGTT-1" "AAAGGATAGCTCGCAC-1" "AAAGGATAGTCCGCGT-1"
## [37] "AAAGGATCAAGGGTCA-1" "AAAGGATCAGAACTTC-1" "AAAGGATGTGCCTAAT-1"
## [40] "AAAGGATGTGTCCCTT-1" "AAAGGATTCACCGACG-1" "AAAGGATTCACGGTCG-1"
## [43] "AAAGGGCAGCGAACTG-1" "AAAGGGCAGTAGCAAT-1" "AAAGGGCGTGTGTCCG-1"
## [46] "AAAGGTAAGAGAGCAA-1" "AAAGGTACAAAGTATG-1" "AAAGGTACAAGACGGT-1"
## [49] "AAAGGTACACAAGTTC-1" "AAAGGTACACTCCTTG-1" "AAAGGTAGTATAGGAT-1"
## [52] "AAAGGTAGTGTGCCTG-1" "AAAGGTATCACAAGAA-1" "AAAGGTATCTGCGTCT-1"
## [55] "AAAGGTATCTGTCTCG-1" "AAAGTCCAGCACACAG-1" "AAAGTCCAGCGGTAGT-1"
## [58] "AAAGTCCAGGATTCAA-1" "AAAGTCCAGTACAACA-1" "AAAGTCCCACCTGCGA-1"
## [61] "AAAGTCCGTTCTTGCC-1" "AAAGTCCTCACCTGTC-1" "AAAGTCCTCAGGTGTT-1"
## [64] "AAAGTCCTCCATAAGC-1" "AAAGTCCTCTATGTGG-1" "AAAGTGAAGAAATGGG-1"
## [67] "AAAGTGAAGCAGTCTT-1" "AAAGTGAAGCTGTCCG-1" "AAAGTGAAGGTAAGAG-1"
## [70] "AAAGTGAAGTGATAAC-1" "AAAGTGAAGTTATGGA-1" "AAAGTGACAAGCAATA-1"
## [73] "AAAGTGACAATAGTGA-1" "AAAGTGAGTATACCCA-1" "AAAGTGAGTATCGTTG-1"
## [76] "AAAGTGAGTCGGTAAG-1" "AAAGTGAGTGAATGTA-1" "AAAGTGAGTTTCACTT-1"
## [79] "AAAGTGATCAAATGCC-1" "AAATGGAAGGGCCCTT-1" "AAATGGAGTACAACGG-1"
## [82] "AAATGGAGTTCCGTTC-1" "AAATGGATCCGCATAA-1" "AAATGGATCGCCGATG-1"
## [85] "AACAAAGAGCACTTTG-1" "AACAAAGAGCGAGGAG-1" "AACAAAGAGGCTAAAT-1"
## [88] "AACAAAGCAACAGTGG-1" "AACAAAGCACTGCACG-1" "AACAAAGCAGAGTCAG-1"
## [91] "AACAAAGGTCGCGGTT-1" "AACAAAGGTGAGGAAA-1" "AACAAAGTCATGCCGG-1"
## [94] "AACAAAGTCCTGTTAT-1" "AACAAAGTCGTCAGAT-1" "AACAAAGTCTCCGAGG-1"
## [97] "AACAACCAGATGTTAG-1" "AACAACCCACCCAAGC-1" "AACAACCCATCATCCC-1"
## [100] "AACAACCCATGTGCTA-1"
There are many columns but we limited it to 1st 100.
rownames([]) and will display gene names
rownames(sampleHD1[])[1:100]
## [1] "ENSG00000290826" "ENSG00000238009" "ENSG00000241860" "ENSG00000286448"
## [5] "ENSG00000290385" "ENSG00000291215" "LINC01409" "ENSG00000290784"
## [9] "LINC00115" "LINC01128" "ENSG00000288531" "FAM41C"
## [13] "NOC2L" "KLHL17" "PLEKHN1" "ENSG00000272512"
## [17] "HES4" "ISG15" "ENSG00000224969" "AGRN"
## [21] "ENSG00000291156" "C1orf159" "ENSG00000285812" "LINC01342"
## [25] "TTLL10" "TNFRSF18" "TNFRSF4" "SDF4"
## [29] "B3GALT6" "C1QTNF12" "ENSG00000260179" "UBE2J2"
## [33] "LINC01786" "SCNN1D" "ACAP3" "PUSL1"
## [37] "INTS11" "CPTP" "TAS1R3" "DVL1"
## [41] "MXRA8" "AURKAIP1" "CCNL2" "MRPL20-AS1"
## [45] "MRPL20" "MRPL20-DT" "ANKRD65" "ATAD3C"
## [49] "ATAD3B" "ENSG00000290916" "ATAD3A" "TMEM240"
## [53] "SSU72" "ENSG00000215014" "FNDC10" "ENSG00000286989"
## [57] "ENSG00000272106" "MIB2" "MMP23B" "CDK11B"
## [61] "ENSG00000272004" "SLC35E2B" "CDK11A" "ENSG00000290854"
## [65] "NADK" "GNB1" "GNB1-DT" "CFAP74"
## [69] "PRKCZ" "ENSG00000271806" "PRKCZ-AS1" "FAAP20"
## [73] "ENSG00000234396" "SKI" "ENSG00000287356" "MORN1"
## [77] "ENSG00000272420" "RER1" "PEX10" "PLCH2"
## [81] "ENSG00000224387" "PANK4" "ENSG00000272449" "TNFRSF14-AS1"
## [85] "TNFRSF14" "ENSG00000228037" "ENSG00000289610" "PRXL2B"
## [89] "MMEL1" "TTC34" "PRDM16" "MEGF6"
## [93] "ENSG00000238260" "TPRG1L" "WRAP73" "TP73"
## [97] "CCDC27" "SMIM1" "LRRC47" "ENSG00000272153"
You can see there are a lot of gene names some by the genecards ID and some by the Ensemble ID as the row names of the matrix object, and we also limited the view to first 100 rows.
setwd(RDS_objects)
saveRDS(sampleHD1,"sampleHD1")
Lets start importing and saving the other folder files.
setwd(directoryFolder)
GSM9493321 <- Read10X("GSM9493321")
sampleHD2 <- CreateSeuratObject(counts=GSM9493321, project="GSM9493321", min.cells=3, min.features=200)
View(sampleHD2@meta.data)
setwd(RDS_objects)
saveRDS(sampleHD2, "sampleHD2")
setwd(directoryFolder)
GSM9493322 <- Read10X("GSM9493322")
sampleHD3 <- CreateSeuratObject(counts=GSM9493322, project="GSM9493322", min.cells=3, min.features=200)
setwd(RDS_objects)
saveRDS(sampleHD3, "sampleHD3")
setwd(directoryFolder)
GSM9493323 <- Read10X("GSM9493323")
sampleHD4 <- CreateSeuratObject(counts=GSM9493323, project="GSM9493323", min.cells=3, min.features=200)
setwd(RDS_objects)
saveRDS(sampleHD4, "sampleHD4")
setwd(directoryFolder)
GSM9493324 <- Read10X("GSM9493324")
sampleHD5 <- CreateSeuratObject(counts=GSM9493324, project="GSM9493324", min.cells=3, min.features=200)
setwd(RDS_objects)
saveRDS(sampleHD5,"sampleHD5")
setwd(directoryFolder)
GSM9493325 <- Read10X("GSM9493325")
sampleHD6 <- CreateSeuratObject(counts=GSM9493325, project="GSM9493325", min.cells=3, min.features=200)
setwd(RDS_objects)
saveRDS(sampleHD6, "sampleHD6")
setwd(directoryFolder)
GSM9493326 <- Read10X("GSM9493326")
sampleHD7 <- CreateSeuratObject(counts=GSM9493326, project="GSM9493326", min.cells=3, min.features=200)
setwd(RDS_objects)
saveRDS(sampleHD7,"sampleHD7")
setwd(directoryFolder)
GSM9493327 <- Read10X("GSM9493327")
sampleHD8 <- CreateSeuratObject(counts=GSM9493327, project="GSM9493327", min.cells=3, min.features=200)
setwd(RDS_objects)
saveRDS(sampleHD8, "sampleHD8")
setwd(directoryFolder)
GSM9493328 <- Read10X("GSM9493328")
sampleHD9 <- CreateSeuratObject(counts=GSM9493328, project="GSM9493328", min.cells=3, min.features=200)
setwd(RDS_objects)
saveRDS(sampleHD9, "sampleHD9")
setwd(directoryFolder)
GSM9493329 <- Read10X("GSM9493329")
sampleHD10 <- CreateSeuratObject(counts=GSM9493329, project="GSM9493329", min.cells=3, min.features=200)
setwd(RDS_objects)
saveRDS(sampleHD10, "sampleHD10")
setwd(directoryFolder)
GSM9493330 <- Read10X("GSM9493330")
sampleHD11 <- CreateSeuratObject(counts=GSM9493330, project="GSM9493330", min.cells=3, min.features=200)
saveRDS(sampleHD11, "sampleHD11")
setwd(directoryFolder)
GSM9493331 <- Read10X("GSM9493331")
sampleHD12 <- CreateSeuratObject(counts=GSM9493331, project="GSM9493331", min.cells=3, min.features=200)
setwd(RDS_objects)
saveRDS(sampleHD12, "sampleHD12")
setwd(directoryFolder)
GSM9493332 <- Read10X("GSM9493332")
patient6 <- CreateSeuratObject(counts=GSM9493332, project="GSM9493332", min.cells=3, min.features=200)
setwd(RDS_objects)
saveRDS(patient6, "patient6")
setwd(directoryFolder)
GSM9493333 <- Read10X("GSM9493333")
patient9 <- CreateSeuratObject(counts=GSM9493333, project="GSM9493333", min.cells=3, min.features=200)
setwd(RDS_objects)
saveRDS(patient9, "patient9")
setwd(directoryFolder)
GSM9493334 <- Read10X("GSM9493334")
patient14 <- CreateSeuratObject(counts=GSM9493334, project="GSM9493334", min.cells=3, min.features=200)
setwd(RDS_objects)
saveRDS(patient14, "patient14")
setwd(directoryFolder)
GSM9493335 <- Read10X("GSM9493335")
patient16 <- CreateSeuratObject(counts=GSM9493335, project="GSM9493335", min.cells=3, min.features=200)
setwd(RDS_objects)
saveRDS(patient16, "patient16")
setwd(directoryFolder)
GSM9493336 <- Read10X("GSM9493336")
patient18 <- CreateSeuratObject(counts=GSM9493336, project="GSM9493336", min.cells=3, min.features=200)
setwd(RDS_objects)
saveRDS(patient18, "patient18")
setwd(directoryFolder)
GSM9493337 <- Read10X("GSM9493337")
patient19 <- CreateSeuratObject(counts=GSM9493337, project="GSM9493337", min.cells=3, min.features=200)
setwd(RDS_objects)
saveRDS(patient19, "patient19")
setwd(directoryFolder)
GSM9493338 <- Read10X("GSM9493338")
patient20 <- CreateSeuratObject(counts=GSM9493338, project="GSM9493338", min.cells=3, min.features=200)
setwd(RDS_objects)
saveRDS(patient20, "patient20")
setwd(directoryFolder)
GSM9493339 <- Read10X("GSM9493339")
patient22 <- CreateSeuratObject(counts=GSM9493339, project="GSM9493339", min.cells=3, min.features=200)
setwd(RDS_objects)
saveRDS(patient22, "patient22")
setwd(directoryFolder)
GSM9493340 <- Read10X("GSM9493340")
patient25 <- CreateSeuratObject(counts=GSM9493340, project="GSM9493340", min.cells=3, min.features=200)
setwd(RDS_objects)
saveRDS(patient25, "patient25")
setwd(directoryFolder)
GSM9493341 <- Read10X("GSM9493341")
patient27 <- CreateSeuratObject(counts=GSM9493341, project="GSM9493341", min.cells=3, min.features=200)
setwd(RDS_objects)
saveRDS(patient27, "patient27")
setwd(directoryFolder)
GSM9493342 <- Read10X("GSM9493342")
patient30 <- CreateSeuratObject(counts=GSM9493342, project="GSM9493342", min.cells=3, min.features=200)
setwd(RDS_objects)
saveRDS(patient30, "patient30")
setwd(directoryFolder)
GSM9493343 <- Read10X("GSM9493343")
patient31 <- CreateSeuratObject(counts=GSM9493343, project="GSM9493343", min.cells=3, min.features=200)
setwd(RDS_objects)
saveRDS(patient31, "patient31")
setwd(directoryFolder)
GSM9493344 <- Read10X("GSM9493344")
patient36 <- CreateSeuratObject(counts=GSM9493344, project="GSM9493344", min.cells=3, min.features=200)
setwd(RDS_objects)
saveRDS(patient36, "patient36")
setwd(directoryFolder)
GSM9493345 <- Read10X("GSM9493345")
patient37 <- CreateSeuratObject(counts=GSM9493345, project="GSM9493345", min.cells=3, min.features=200)
setwd(RDS_objects)
saveRDS(patient37, "patient37")
setwd(directoryFolder)
GSM9493346 <- Read10X("GSM9493346")
patient38 <- CreateSeuratObject(counts=GSM9493346, project="GSM9493346", min.cells=3, min.features=200)
setwd(RDS_objects)
saveRDS(patient38, "patient38")
setwd(directoryFolder)
GSM9493347 <- Read10X("GSM9493347")
patient39 <- CreateSeuratObject(counts=GSM9493347, project="GSM9493347", min.cells=3, min.features=200)
setwd(RDS_objects)
saveRDS(patient39, "patient39")
setwd(directoryFolder)
GSM9493348 <- Read10X("GSM9493348")
patient41 <- CreateSeuratObject(counts=GSM9493348, project="GSM9493348", min.cells=3, min.features=200)
setwd(RDS_objects)
saveRDS(patient41, "patient41")
setwd(directoryFolder)
GSM9493349 <- Read10X("GSM9493349")
patient44 <- CreateSeuratObject(counts=GSM9493349, project="GSM9493349", min.cells=3, min.features=200)
setwd(RDS_objects)
saveRDS(patient44, "patient44")
setwd(directoryFolder)
GSM9493350 <- Read10X("GSM9493350")
patient51 <- CreateSeuratObject(counts=GSM9493350, project="GSM9493350", min.cells=3, min.features=200)
setwd(RDS_objects)
saveRDS(patient51, "patient51")
setwd(directoryFolder)
GSM9493351 <- Read10X("GSM9493351")
patient52 <- CreateSeuratObject(counts=GSM9493351, project="GSM9493351", min.cells=3, min.features=200)
setwd(RDS_objects)
saveRDS(patient52, "patient52")
We uploaded all the sampleHD files and patient files. There is another tutorial on merging these RDS files.
We still have these items in our environment and it is taking up quite a bit of space. Lets delete these objects after verifying we have all files.
ls()
## [1] "directoryFolder" "GSM9493320" "GSM9493321" "GSM9493322"
## [5] "GSM9493323" "GSM9493324" "GSM9493325" "GSM9493326"
## [9] "GSM9493327" "GSM9493328" "GSM9493329" "GSM9493330"
## [13] "GSM9493331" "GSM9493332" "GSM9493333" "GSM9493334"
## [17] "GSM9493335" "GSM9493336" "GSM9493337" "GSM9493338"
## [21] "GSM9493339" "GSM9493340" "GSM9493341" "GSM9493342"
## [25] "GSM9493343" "GSM9493344" "GSM9493345" "GSM9493346"
## [29] "GSM9493347" "GSM9493348" "GSM9493349" "GSM9493350"
## [33] "GSM9493351" "patient14" "patient16" "patient18"
## [37] "patient19" "patient20" "patient22" "patient25"
## [41] "patient27" "patient30" "patient31" "patient36"
## [45] "patient37" "patient38" "patient39" "patient41"
## [49] "patient44" "patient51" "patient52" "patient6"
## [53] "patient9" "RDS_objects" "sampleHD1" "sampleHD10"
## [57] "sampleHD11" "sampleHD12" "sampleHD2" "sampleHD3"
## [61] "sampleHD4" "sampleHD5" "sampleHD6" "sampleHD7"
## [65] "sampleHD8" "sampleHD9" "seriesInfo1" "seriesInfoDesign"
Lets remove the GSM samples.
rm("GSM9493320" , "GSM9493321" ,
"GSM9493322" , "GSM9493323" , "GSM9493324" ,
"GSM9493325" , "GSM9493326" , "GSM9493327" ,
"GSM9493328" , "GSM9493329" , "GSM9493330" ,
"GSM9493331" , "GSM9493332" , "GSM9493333" ,
"GSM9493334" , "GSM9493335" , "GSM9493336" ,
"GSM9493337" , "GSM9493338" , "GSM9493339" ,
"GSM9493340" , "GSM9493341" , "GSM9493342" ,
"GSM9493343" , "GSM9493344" , "GSM9493345",
"GSM9493346" , "GSM9493347" , "GSM9493348" ,
"GSM9493349" , "GSM9493350" , "GSM9493351")
====================================================
Then you can open the folder and find the Seurat object as an RDS File. Video tutorial 3 here
go to file where RDS_objects stored RDS objects earlier and use readRDS()
sampleHD1 <- readRDS(“C:/Users/jlcor/OneDrive/Desktop/EBV and nonHodgkin aggressive lymphoma NK tcell type/GSE318371_RAW/sampleHD1.RDS”)
merge after reading in other objects like sampleHD2 and sample HD3: merdedSamples <- merge(sampleHD1, y=c(sampleHD2, sampleHD3), add.cell.ids = ls()[1:3],project=“mergedSamples”)
ls()
This merge of objects actually just rowbinds the barcodes but adds a different prepended ID of the sample ID to the barcode because it doesn’t allow same barcodes and the column names are the same but the ID column changes for the sample obtained in HD1, HD2, or HD3 in this demo altered for this data from tutorial data but not yet tested on the merge to see if it works.
Then save with saveRDS(“mergedSamples,file=”C:/Users/jlcor/OneDrive/Desktop/EBV and nonHodgkin aggressive lymphoma NK tcell type/GSE318371_RAW/mergedSamples.RDS”)
===================================================================
healthy1DF <- data.frame(sampleHD1@meta.data)
#colnames(healthy1DF) <- c("h1",'h1_counts','h1_features')
healthy1DF$sample <- 'healthy1'
healthy1DF$barcode <- row.names(healthy1DF)
head(healthy1DF)
## orig.ident nCount_RNA nFeature_RNA sample
## AAACCCACAGCATTGT-1 GSM9493320 5606 2534 healthy1
## AAACCCACATCCGAAT-1 GSM9493320 6841 2964 healthy1
## AAACCCAGTACTAGCT-1 GSM9493320 6104 2427 healthy1
## AAACCCAGTCACTCAA-1 GSM9493320 7370 2839 healthy1
## AAACCCAGTGGTAACG-1 GSM9493320 5848 2324 healthy1
## AAACCCATCAATGCAC-1 GSM9493320 14448 4073 healthy1
## barcode
## AAACCCACAGCATTGT-1 AAACCCACAGCATTGT-1
## AAACCCACATCCGAAT-1 AAACCCACATCCGAAT-1
## AAACCCAGTACTAGCT-1 AAACCCAGTACTAGCT-1
## AAACCCAGTCACTCAA-1 AAACCCAGTCACTCAA-1
## AAACCCAGTGGTAACG-1 AAACCCAGTGGTAACG-1
## AAACCCATCAATGCAC-1 AAACCCATCAATGCAC-1
healthy2DF <- data.frame(sampleHD2@meta.data)
#colnames(healthy2DF) <- c("h2",'h2_counts','h2_features')
healthy2DF$sample <- 'healthy2'
healthy2DF$barcode <- row.names(healthy2DF)
head(healthy2DF)
## orig.ident nCount_RNA nFeature_RNA sample
## AAACCCAAGCACTCTA-1 GSM9493321 11161 3536 healthy2
## AAACCCAAGGCTGTAG-1 GSM9493321 2558 1505 healthy2
## AAACCCAAGTAGACCG-1 GSM9493321 6763 2873 healthy2
## AAACCCAAGTCAACAA-1 GSM9493321 5227 2434 healthy2
## AAACCCACAACTCATG-1 GSM9493321 2280 1162 healthy2
## AAACCCACAAGTGCAG-1 GSM9493321 3879 2064 healthy2
## barcode
## AAACCCAAGCACTCTA-1 AAACCCAAGCACTCTA-1
## AAACCCAAGGCTGTAG-1 AAACCCAAGGCTGTAG-1
## AAACCCAAGTAGACCG-1 AAACCCAAGTAGACCG-1
## AAACCCAAGTCAACAA-1 AAACCCAAGTCAACAA-1
## AAACCCACAACTCATG-1 AAACCCACAACTCATG-1
## AAACCCACAAGTGCAG-1 AAACCCACAAGTGCAG-1
healthy3DF <- data.frame(sampleHD3@meta.data)
#colnames(healthy3DF) <- c("h3",'h3_counts','h3_features')
healthy3DF$sample <- 'healthy3'
healthy3DF$barcode <- row.names(healthy3DF)
head(healthy3DF)
## orig.ident nCount_RNA nFeature_RNA sample
## AAACCCACAGGCAATG-1 GSM9493322 5265 2348 healthy3
## AAACCCACATGTTCGA-1 GSM9493322 5948 2482 healthy3
## AAACCCAGTAAGATTG-1 GSM9493322 6445 2647 healthy3
## AAACCCAGTACGACAG-1 GSM9493322 3437 1794 healthy3
## AAACCCAGTATTCCTT-1 GSM9493322 7894 3099 healthy3
## AAACCCAGTCATTGCA-1 GSM9493322 9716 3322 healthy3
## barcode
## AAACCCACAGGCAATG-1 AAACCCACAGGCAATG-1
## AAACCCACATGTTCGA-1 AAACCCACATGTTCGA-1
## AAACCCAGTAAGATTG-1 AAACCCAGTAAGATTG-1
## AAACCCAGTACGACAG-1 AAACCCAGTACGACAG-1
## AAACCCAGTATTCCTT-1 AAACCCAGTATTCCTT-1
## AAACCCAGTCATTGCA-1 AAACCCAGTCATTGCA-1
healthy4DF <- data.frame(sampleHD4@meta.data)
#colnames(healthy4DF) <- c("h4",'h4_counts','h4_features')
healthy4DF$sample <- 'healthy4'
healthy4DF$barcode <- row.names(healthy4DF)
head(healthy4DF)
## orig.ident nCount_RNA nFeature_RNA sample
## AAACCCAAGAAGCCTG-1 GSM9493323 2764 1591 healthy4
## AAACCCAAGCCATTGT-1 GSM9493323 11077 3577 healthy4
## AAACCCAAGCGTGTCC-1 GSM9493323 3445 1425 healthy4
## AAACCCAAGGCTTAAA-1 GSM9493323 2005 1079 healthy4
## AAACCCACACCGGTCA-1 GSM9493323 4722 2252 healthy4
## AAACCCACAGCGATTT-1 GSM9493323 5075 2424 healthy4
## barcode
## AAACCCAAGAAGCCTG-1 AAACCCAAGAAGCCTG-1
## AAACCCAAGCCATTGT-1 AAACCCAAGCCATTGT-1
## AAACCCAAGCGTGTCC-1 AAACCCAAGCGTGTCC-1
## AAACCCAAGGCTTAAA-1 AAACCCAAGGCTTAAA-1
## AAACCCACACCGGTCA-1 AAACCCACACCGGTCA-1
## AAACCCACAGCGATTT-1 AAACCCACAGCGATTT-1
healthy5DF <- data.frame(sampleHD5@meta.data)
#colnames(healthy5DF) <- c("h5",'h5_counts','h5_features')
healthy5DF$sample <- 'healthy5'
healthy5DF$barcode <- row.names(healthy5DF)
head(healthy5DF)
## orig.ident nCount_RNA nFeature_RNA sample
## AAACCCAAGAGAGCAA-1 GSM9493324 15813 4355 healthy5
## AAACCCAAGAGGCGTT-1 GSM9493324 10103 3243 healthy5
## AAACCCAAGGACTTCT-1 GSM9493324 6251 2738 healthy5
## AAACCCAAGGTTCACT-1 GSM9493324 10841 3342 healthy5
## AAACCCAAGTCGCTAT-1 GSM9493324 776 334 healthy5
## AAACCCACAACCACGC-1 GSM9493324 8280 2928 healthy5
## barcode
## AAACCCAAGAGAGCAA-1 AAACCCAAGAGAGCAA-1
## AAACCCAAGAGGCGTT-1 AAACCCAAGAGGCGTT-1
## AAACCCAAGGACTTCT-1 AAACCCAAGGACTTCT-1
## AAACCCAAGGTTCACT-1 AAACCCAAGGTTCACT-1
## AAACCCAAGTCGCTAT-1 AAACCCAAGTCGCTAT-1
## AAACCCACAACCACGC-1 AAACCCACAACCACGC-1
healthy6DF <- data.frame(sampleHD6@meta.data)
#colnames(healthy6DF) <- c("h6",'h6_counts','h6_features')
healthy6DF$sample <- 'healthy6'
healthy6DF$barcode <- row.names(healthy6DF)
head(healthy6DF)
## orig.ident nCount_RNA nFeature_RNA sample
## AAACCCAAGGGCTGAT-1 GSM9493325 8614 2985 healthy6
## AAACCCACAACCGGAA-1 GSM9493325 6718 2856 healthy6
## AAACCCACAAGGTCAG-1 GSM9493325 5637 2612 healthy6
## AAACCCACACTCTAGA-1 GSM9493325 6069 2732 healthy6
## AAACCCACACTCTGCT-1 GSM9493325 5589 2454 healthy6
## AAACCCACAGAGGCAT-1 GSM9493325 4674 2031 healthy6
## barcode
## AAACCCAAGGGCTGAT-1 AAACCCAAGGGCTGAT-1
## AAACCCACAACCGGAA-1 AAACCCACAACCGGAA-1
## AAACCCACAAGGTCAG-1 AAACCCACAAGGTCAG-1
## AAACCCACACTCTAGA-1 AAACCCACACTCTAGA-1
## AAACCCACACTCTGCT-1 AAACCCACACTCTGCT-1
## AAACCCACAGAGGCAT-1 AAACCCACAGAGGCAT-1
healthy7DF <- data.frame(sampleHD7@meta.data)
#colnames(healthy7DF) <- c("h7",'h7_counts','h7_features')
healthy7DF$sample <- 'healthy7'
healthy7DF$barcode <- row.names(healthy7DF)
head(healthy7DF)
## orig.ident nCount_RNA nFeature_RNA sample
## AAACCCAAGACGGAAA-1 GSM9493326 412 232 healthy7
## AAACCCAAGGTGCTAG-1 GSM9493326 632 365 healthy7
## AAACCCAAGTTGTACC-1 GSM9493326 4339 1586 healthy7
## AAACCCACACCCATAA-1 GSM9493326 3536 1416 healthy7
## AAACCCACAGAGTGAC-1 GSM9493326 1929 943 healthy7
## AAACCCACAGTATTCG-1 GSM9493326 8275 2967 healthy7
## barcode
## AAACCCAAGACGGAAA-1 AAACCCAAGACGGAAA-1
## AAACCCAAGGTGCTAG-1 AAACCCAAGGTGCTAG-1
## AAACCCAAGTTGTACC-1 AAACCCAAGTTGTACC-1
## AAACCCACACCCATAA-1 AAACCCACACCCATAA-1
## AAACCCACAGAGTGAC-1 AAACCCACAGAGTGAC-1
## AAACCCACAGTATTCG-1 AAACCCACAGTATTCG-1
healthy8DF <- data.frame(sampleHD8@meta.data)
#colnames(healthy8DF) <- c("h8",'h8_counts','h8_features')
healthy8DF$sample <- 'healthy8'
healthy8DF$barcode <- row.names(healthy8DF)
head(healthy8DF)
## orig.ident nCount_RNA nFeature_RNA sample
## AAACCCAAGAAACCCG-1 GSM9493327 4005 1427 healthy8
## AAACCCAAGCAGGCTA-1 GSM9493327 6821 2553 healthy8
## AAACCCAAGCCTAGGA-1 GSM9493327 4187 1670 healthy8
## AAACCCAAGGTTCATC-1 GSM9493327 10180 3340 healthy8
## AAACCCACAAATACGA-1 GSM9493327 8079 2390 healthy8
## AAACCCACACACACTA-1 GSM9493327 2072 967 healthy8
## barcode
## AAACCCAAGAAACCCG-1 AAACCCAAGAAACCCG-1
## AAACCCAAGCAGGCTA-1 AAACCCAAGCAGGCTA-1
## AAACCCAAGCCTAGGA-1 AAACCCAAGCCTAGGA-1
## AAACCCAAGGTTCATC-1 AAACCCAAGGTTCATC-1
## AAACCCACAAATACGA-1 AAACCCACAAATACGA-1
## AAACCCACACACACTA-1 AAACCCACACACACTA-1
healthy9DF <- data.frame(sampleHD9@meta.data)
#colnames(healthy9DF) <- c("h9",'h9_counts','h9_features')
healthy9DF$sample <- 'healthy9'
healthy9DF$barcode <- row.names(healthy9DF)
head(healthy9DF)
## orig.ident nCount_RNA nFeature_RNA sample
## AAACCCAAGCGGTAAC-1 GSM9493328 4587 2005 healthy9
## AAACCCAAGGCAGTCA-1 GSM9493328 14668 4058 healthy9
## AAACCCAAGGGATCAC-1 GSM9493328 3762 1725 healthy9
## AAACCCAAGGGTACGT-1 GSM9493328 5330 1755 healthy9
## AAACCCAAGGTAGTCA-1 GSM9493328 237 208 healthy9
## AAACCCAAGTCATCGT-1 GSM9493328 3140 878 healthy9
## barcode
## AAACCCAAGCGGTAAC-1 AAACCCAAGCGGTAAC-1
## AAACCCAAGGCAGTCA-1 AAACCCAAGGCAGTCA-1
## AAACCCAAGGGATCAC-1 AAACCCAAGGGATCAC-1
## AAACCCAAGGGTACGT-1 AAACCCAAGGGTACGT-1
## AAACCCAAGGTAGTCA-1 AAACCCAAGGTAGTCA-1
## AAACCCAAGTCATCGT-1 AAACCCAAGTCATCGT-1
healthy10DF <- data.frame(sampleHD10@meta.data)
#colnames(healthy10DF) <- c("h10",'h10_counts','h10_features')
healthy10DF$sample <- 'healthy10'
healthy10DF$barcode <- row.names(healthy10DF)
head(healthy10DF)
## orig.ident nCount_RNA nFeature_RNA sample
## AAACCCAAGCACGGAT-1 GSM9493329 13236 3701 healthy10
## AAACCCAAGCGTCAAG-1 GSM9493329 2654 1114 healthy10
## AAACCCAAGGAACGAA-1 GSM9493329 5531 1883 healthy10
## AAACCCAAGGCTCACC-1 GSM9493329 4045 1577 healthy10
## AAACCCAAGGTCATCT-1 GSM9493329 7792 2965 healthy10
## AAACCCAAGTCACGAG-1 GSM9493329 2913 1264 healthy10
## barcode
## AAACCCAAGCACGGAT-1 AAACCCAAGCACGGAT-1
## AAACCCAAGCGTCAAG-1 AAACCCAAGCGTCAAG-1
## AAACCCAAGGAACGAA-1 AAACCCAAGGAACGAA-1
## AAACCCAAGGCTCACC-1 AAACCCAAGGCTCACC-1
## AAACCCAAGGTCATCT-1 AAACCCAAGGTCATCT-1
## AAACCCAAGTCACGAG-1 AAACCCAAGTCACGAG-1
healthy11DF <- data.frame(sampleHD11@meta.data)
#colnames(healthy11DF) <- c("h11",'h11_counts','h11_features')
healthy11DF$sample <- 'healthy11'
healthy11DF$barcode <- row.names(healthy11DF)
head(healthy11DF)
## orig.ident nCount_RNA nFeature_RNA sample
## AAACCCAAGTATGCAA-1 GSM9493330 13816 4026 healthy11
## AAACCCAAGTATGTAG-1 GSM9493330 2565 878 healthy11
## AAACCCACAAAGTATG-1 GSM9493330 4423 1640 healthy11
## AAACCCACAAGACGAC-1 GSM9493330 4726 1664 healthy11
## AAACCCACAAGTGTCT-1 GSM9493330 3991 1435 healthy11
## AAACCCACAGAGTCAG-1 GSM9493330 5142 2448 healthy11
## barcode
## AAACCCAAGTATGCAA-1 AAACCCAAGTATGCAA-1
## AAACCCAAGTATGTAG-1 AAACCCAAGTATGTAG-1
## AAACCCACAAAGTATG-1 AAACCCACAAAGTATG-1
## AAACCCACAAGACGAC-1 AAACCCACAAGACGAC-1
## AAACCCACAAGTGTCT-1 AAACCCACAAGTGTCT-1
## AAACCCACAGAGTCAG-1 AAACCCACAGAGTCAG-1
healthy12DF <- data.frame(sampleHD12@meta.data)
#colnames(healthy12DF) <- c("h12",'h12_counts','h12_features')
healthy12DF$sample <- 'healthy12'
healthy12DF$barcode <- row.names(healthy12DF)
head(healthy12DF)
## orig.ident nCount_RNA nFeature_RNA sample
## AAACCCAAGGCATCGA-1 GSM9493331 7584 2581 healthy12
## AAACCCAAGGTCCGAA-1 GSM9493331 3517 1300 healthy12
## AAACCCAAGGTTATAG-1 GSM9493331 11275 3928 healthy12
## AAACCCAAGTATGATG-1 GSM9493331 5638 2474 healthy12
## AAACCCAAGTATTGCC-1 GSM9493331 816 504 healthy12
## AAACCCAAGTGGAATT-1 GSM9493331 3340 1142 healthy12
## barcode
## AAACCCAAGGCATCGA-1 AAACCCAAGGCATCGA-1
## AAACCCAAGGTCCGAA-1 AAACCCAAGGTCCGAA-1
## AAACCCAAGGTTATAG-1 AAACCCAAGGTTATAG-1
## AAACCCAAGTATGATG-1 AAACCCAAGTATGATG-1
## AAACCCAAGTATTGCC-1 AAACCCAAGTATTGCC-1
## AAACCCAAGTGGAATT-1 AAACCCAAGTGGAATT-1
Lets merge these barcodes among our healthy 12 patient samples to see which barcodes are in common among all cells as having the most counts.
H1H2 <- merge(healthy1DF, healthy2DF, by.x="barcode",by.y="barcode")
H1H2$barcode
## [1] "ACATCCCTCCCTCTAG-1" "ACTGTGACAGACCGCT-1" "ACTTCGCTCGTTTACT-1"
## [4] "ATAGACCGTTGTCAGT-1" "ATCACAGGTACGGCAA-1" "ATCAGGTGTTGTGCCG-1"
## [7] "ATGAGGGGTTCGGTAT-1" "CAATACGCAAGGCCTC-1" "CAGCAGCTCTTCCGTG-1"
## [10] "CGAAGGAAGGATGGCT-1" "CTCAATTGTGCGTTTA-1" "CTGCGAGTCGATTCCC-1"
## [13] "GACGTTACATGGCACC-1" "GAGTCTAGTACGCTAT-1" "GATTCGAAGTAGGATT-1"
## [16] "GCCAGTGCATTACGGT-1" "GGAATGGGTTACAGCT-1" "GGCTTTCAGTCGCCAC-1"
## [19] "GGGCCATCAATACAGA-1" "GGTGAAGAGTTGTCGT-1" "GGTGTTACACCGTGGT-1"
## [22] "GTAGCTAAGGTACTGG-1" "GTCGAATGTATGTGTC-1" "GTGATGTGTTCGGCCA-1"
## [25] "GTGGGAAGTTTGGAAA-1" "GTTACGATCGTTACCC-1" "GTTGTAGCACAACGTT-1"
## [28] "TAGACTGGTACAGTAA-1" "TAGTGCACATTGCCGG-1" "TATCAGGCAAATACGA-1"
## [31] "TCACTATCACTCCTGT-1" "TCGACGGAGCGTGTCC-1" "TCGCAGGTCCCAAGTA-1"
## [34] "TCGGTCTTCTTACTGT-1" "TGAATGCGTGTGTGTT-1" "TGGAACTAGTAGGATT-1"
## [37] "TGTAACGGTGAGGAAA-1" "TTCATGTTCTTAAGGC-1" "TTCCTCTCACTGCTTC-1"
## [40] "TTGTTTGCATGAGTAA-1"
H1H2H3 <- merge(H1H2, healthy3DF, by.x="barcode",by.y="barcode")
H1H2H3$barcode
## character(0)
Early in the merge there is no common barcode among the first 3 healthy patients.
We will try the long merge of the healthy patient dataframes by rbinding them which will attach their ID name to each sample.
combinedHealthy12 <- rbind(healthy1DF,healthy2DF,healthy3DF,healthy4DF,healthy5DF,healthy6DF,healthy7DF,healthy8DF,healthy9DF,healthy10DF,healthy11DF,healthy12DF)
dim(combinedHealthy12)
## [1] 142327 5
There are 142,327 barcodes in all healthy samples. Randomly select every 9,000th row to see the results for these samples.
combinedHealthy12[c(9000,18000,27000,36000,45000,54000,63000,72000,81000,90000,99000,108000,117000,126000,135000,142327),]
## orig.ident nCount_RNA nFeature_RNA sample
## TTGTTCAGTCGCAACC-1 GSM9493320 5562 2695 healthy1
## GTCACTCGTTGGCCGT-1 GSM9493321 4542 1995 healthy2
## CTGCGAGCATAAGCGG-1 GSM9493322 11196 4006 healthy3
## CATCGGGAGCTGTTAC-1 GSM9493323 1622 916 healthy4
## AACAACCCATGTTCGA-1 GSM9493324 10639 3511 healthy5
## AATGGCTAGTGGTCAG-1 GSM9493325 6285 2360 healthy6
## TACATTCTCTGCCTCA-1 GSM9493325 7475 2820 healthy6
## CGTTAGACAATTGTGC-1 GSM9493326 4632 2012 healthy7
## ACGATCACAACTGGTT-1 GSM9493327 6117 2591 healthy8
## TCCTCTTTCCTGGCTT-1 GSM9493327 2657 1315 healthy8
## GGGACTCGTTCGTTCC-1 GSM9493328 4601 1243 healthy9
## CGCCATTCAAATGGCG-1 GSM9493329 3700 1202 healthy10
## AGGGTTTCATCACCAA-11 GSM9493330 8206 2748 healthy11
## TTCCAATAGCGTTCCG-1 GSM9493330 2968 1148 healthy11
## GATGATCAGAGAGTGA-1 GSM9493331 2563 858 healthy12
## TTTGTTGTCTACTTCA-1 GSM9493331 6889 2390 healthy12
## barcode
## TTGTTCAGTCGCAACC-1 TTGTTCAGTCGCAACC-1
## GTCACTCGTTGGCCGT-1 GTCACTCGTTGGCCGT-1
## CTGCGAGCATAAGCGG-1 CTGCGAGCATAAGCGG-1
## CATCGGGAGCTGTTAC-1 CATCGGGAGCTGTTAC-1
## AACAACCCATGTTCGA-1 AACAACCCATGTTCGA-1
## AATGGCTAGTGGTCAG-1 AATGGCTAGTGGTCAG-1
## TACATTCTCTGCCTCA-1 TACATTCTCTGCCTCA-1
## CGTTAGACAATTGTGC-1 CGTTAGACAATTGTGC-1
## ACGATCACAACTGGTT-1 ACGATCACAACTGGTT-1
## TCCTCTTTCCTGGCTT-1 TCCTCTTTCCTGGCTT-1
## GGGACTCGTTCGTTCC-1 GGGACTCGTTCGTTCC-1
## CGCCATTCAAATGGCG-1 CGCCATTCAAATGGCG-1
## AGGGTTTCATCACCAA-11 AGGGTTTCATCACCAA-1
## TTCCAATAGCGTTCCG-1 TTCCAATAGCGTTCCG-1
## GATGATCAGAGAGTGA-1 GATGATCAGAGAGTGA-1
## TTTGTTGTCTACTTCA-1 TTTGTTGTCTACTTCA-1
There are different GSM sample IDs in the ‘orig.ident’ column as should be. We added in the healthy sample type as well as the barcode from the row names. Lets write this file out to csv as the healthy merged 12 samples of barcodes.
write.csv(combinedHealthy12,'combinedHealthy12.csv',row.names=FALSE)
We can do the same for the 17 patient samples as well and combine the two datasets. Lets clean out our data environment.
rm(sampleHD1,sampleHD2,sampleHD3,sampleHD4,sampleHD5,sampleHD6,sampleHD7,sampleHD8,sampleHD9,sampleHD10,sampleHD11,sampleHD12,H1H2,H1H2H3,healthy10DF,healthy11DF,healthy12DF,healthy1DF,healthy2DF,healthy3DF,healthy4DF,healthy5DF,healthy6DF,healthy7DF,healthy8DF,healthy9DF)
So that now we have our patient files left to make into dataframe objects and add in the sample and barcode columns before row binding them and writing it out to csv.
patient6DF <- data.frame(patient6@meta.data)
patient6DF$sample <- 'patient6'
patient6DF$barcode <- row.names(patient6DF)
head(patient6DF)
## orig.ident nCount_RNA nFeature_RNA sample
## AAACCTGAGGCTCTTA-1 GSM9493332 10238 2862 patient6
## AAACCTGAGTGCCAGA-1 GSM9493332 6831 2412 patient6
## AAACCTGCAAGTTAAG-1 GSM9493332 5831 2364 patient6
## AAACCTGCAATCACAC-1 GSM9493332 8311 2561 patient6
## AAACCTGCACGGCCAT-1 GSM9493332 6643 2360 patient6
## AAACCTGCATGCATGT-1 GSM9493332 4568 1688 patient6
## barcode
## AAACCTGAGGCTCTTA-1 AAACCTGAGGCTCTTA-1
## AAACCTGAGTGCCAGA-1 AAACCTGAGTGCCAGA-1
## AAACCTGCAAGTTAAG-1 AAACCTGCAAGTTAAG-1
## AAACCTGCAATCACAC-1 AAACCTGCAATCACAC-1
## AAACCTGCACGGCCAT-1 AAACCTGCACGGCCAT-1
## AAACCTGCATGCATGT-1 AAACCTGCATGCATGT-1
patient9DF <- data.frame(patient9@meta.data)
patient9DF$sample <- 'patient9'
patient9DF$barcode <- row.names(patient9DF)
head(patient9DF)
## orig.ident nCount_RNA nFeature_RNA sample
## AAACCTGAGACCCACC-1 GSM9493333 2939 1367 patient9
## AAACCTGAGACGACGT-1 GSM9493333 5500 2318 patient9
## AAACCTGAGAGACGAA-1 GSM9493333 2382 1133 patient9
## AAACCTGAGATGCCAG-1 GSM9493333 9313 3457 patient9
## AAACCTGAGATGTGGC-1 GSM9493333 5353 2455 patient9
## AAACCTGAGCCAGTAG-1 GSM9493333 6768 2664 patient9
## barcode
## AAACCTGAGACCCACC-1 AAACCTGAGACCCACC-1
## AAACCTGAGACGACGT-1 AAACCTGAGACGACGT-1
## AAACCTGAGAGACGAA-1 AAACCTGAGAGACGAA-1
## AAACCTGAGATGCCAG-1 AAACCTGAGATGCCAG-1
## AAACCTGAGATGTGGC-1 AAACCTGAGATGTGGC-1
## AAACCTGAGCCAGTAG-1 AAACCTGAGCCAGTAG-1
patient14DF <- data.frame(patient14@meta.data)
patient14DF$sample <- 'patient14'
patient14DF$barcode <- row.names(patient14DF)
head(patient14DF)
## orig.ident nCount_RNA nFeature_RNA sample
## AAACCCAAGGAGTACC-1 GSM9493334 3697 1489 patient14
## AAACCCACAACCGCTG-1 GSM9493334 3064 1152 patient14
## AAACCCACACCGGAAA-1 GSM9493334 3972 1683 patient14
## AAACCCACATCGATGT-1 GSM9493334 9102 3018 patient14
## AAACCCACATGAAAGT-1 GSM9493334 4416 2207 patient14
## AAACCCAGTGATTCTG-1 GSM9493334 13535 4238 patient14
## barcode
## AAACCCAAGGAGTACC-1 AAACCCAAGGAGTACC-1
## AAACCCACAACCGCTG-1 AAACCCACAACCGCTG-1
## AAACCCACACCGGAAA-1 AAACCCACACCGGAAA-1
## AAACCCACATCGATGT-1 AAACCCACATCGATGT-1
## AAACCCACATGAAAGT-1 AAACCCACATGAAAGT-1
## AAACCCAGTGATTCTG-1 AAACCCAGTGATTCTG-1
patient16DF <- data.frame(patient16@meta.data)
patient16DF$sample <- 'patient16'
patient16DF$barcode <- row.names(patient16DF)
head(patient16DF)
## orig.ident nCount_RNA nFeature_RNA sample
## AAACCCAAGGCCCGTT-1 GSM9493335 1335 850 patient16
## AAACCCACACATATCG-1 GSM9493335 15942 4709 patient16
## AAACCCACACCTGATA-1 GSM9493335 8358 2744 patient16
## AAACCCACACTGCATA-1 GSM9493335 6433 2350 patient16
## AAACCCACATCTCCCA-1 GSM9493335 7349 2764 patient16
## AAACCCAGTATGTCAC-1 GSM9493335 22245 4726 patient16
## barcode
## AAACCCAAGGCCCGTT-1 AAACCCAAGGCCCGTT-1
## AAACCCACACATATCG-1 AAACCCACACATATCG-1
## AAACCCACACCTGATA-1 AAACCCACACCTGATA-1
## AAACCCACACTGCATA-1 AAACCCACACTGCATA-1
## AAACCCACATCTCCCA-1 AAACCCACATCTCCCA-1
## AAACCCAGTATGTCAC-1 AAACCCAGTATGTCAC-1
patient18DF <- data.frame(patient18@meta.data)
patient18DF$sample <- 'patient18'
patient18DF$barcode <- row.names(patient18DF)
head(patient18DF)
## orig.ident nCount_RNA nFeature_RNA sample
## AAACCCAAGAATCCCT-1 GSM9493336 7380 2827 patient18
## AAACCCACAATCCTTT-1 GSM9493336 23600 5516 patient18
## AAACCCACACAAATGA-1 GSM9493336 6396 2402 patient18
## AAACCCACAGTCGCTG-1 GSM9493336 6393 2458 patient18
## AAACCCACATATGCGT-1 GSM9493336 10896 3320 patient18
## AAACCCAGTAATCAGA-1 GSM9493336 3265 1690 patient18
## barcode
## AAACCCAAGAATCCCT-1 AAACCCAAGAATCCCT-1
## AAACCCACAATCCTTT-1 AAACCCACAATCCTTT-1
## AAACCCACACAAATGA-1 AAACCCACACAAATGA-1
## AAACCCACAGTCGCTG-1 AAACCCACAGTCGCTG-1
## AAACCCACATATGCGT-1 AAACCCACATATGCGT-1
## AAACCCAGTAATCAGA-1 AAACCCAGTAATCAGA-1
patient19DF <- data.frame(patient19@meta.data)
patient19DF$sample <- 'patient19'
patient19DF$barcode <- row.names(patient19DF)
head(patient19DF)
## orig.ident nCount_RNA nFeature_RNA sample
## AAACCCAAGCTGACTT-1 GSM9493337 14516 4013 patient19
## AAACCCACAAGGCCTC-1 GSM9493337 9934 3096 patient19
## AAACCCACATATGAAG-1 GSM9493337 7642 2794 patient19
## AAACCCAGTACAGTAA-1 GSM9493337 29120 5585 patient19
## AAACCCAGTGACTGTT-1 GSM9493337 10785 3256 patient19
## AAACGAAAGGGTCTTT-1 GSM9493337 7637 3211 patient19
## barcode
## AAACCCAAGCTGACTT-1 AAACCCAAGCTGACTT-1
## AAACCCACAAGGCCTC-1 AAACCCACAAGGCCTC-1
## AAACCCACATATGAAG-1 AAACCCACATATGAAG-1
## AAACCCAGTACAGTAA-1 AAACCCAGTACAGTAA-1
## AAACCCAGTGACTGTT-1 AAACCCAGTGACTGTT-1
## AAACGAAAGGGTCTTT-1 AAACGAAAGGGTCTTT-1
patient20DF <- data.frame(patient20@meta.data)
patient20DF$sample <- 'patient20'
patient20DF$barcode <- row.names(patient20DF)
head(patient20DF)
## orig.ident nCount_RNA nFeature_RNA sample
## AAACCCAAGGGTACGT-1 GSM9493338 7174 2569 patient20
## AAACCCAAGTATAGAC-1 GSM9493338 11690 3384 patient20
## AAACCCACAGCTGAGA-1 GSM9493338 8594 2697 patient20
## AAACCCACAGTAGTTC-1 GSM9493338 12894 3821 patient20
## AAACCCACATAGGCGA-1 GSM9493338 16876 4506 patient20
## AAACCCAGTACTCCGG-1 GSM9493338 5461 204 patient20
## barcode
## AAACCCAAGGGTACGT-1 AAACCCAAGGGTACGT-1
## AAACCCAAGTATAGAC-1 AAACCCAAGTATAGAC-1
## AAACCCACAGCTGAGA-1 AAACCCACAGCTGAGA-1
## AAACCCACAGTAGTTC-1 AAACCCACAGTAGTTC-1
## AAACCCACATAGGCGA-1 AAACCCACATAGGCGA-1
## AAACCCAGTACTCCGG-1 AAACCCAGTACTCCGG-1
patient22DF <- data.frame(patient22@meta.data)
patient22DF$sample <- 'patient22'
patient22DF$barcode <- row.names(patient22DF)
head(patient22DF)
## orig.ident nCount_RNA nFeature_RNA sample
## AAACCCAAGATGTTAG-1 GSM9493339 1237 834 patient22
## AAACCCAAGCACTCGC-1 GSM9493339 11878 3865 patient22
## AAACCCAAGCCTCCAG-1 GSM9493339 8011 3134 patient22
## AAACCCAAGCGGTAGT-1 GSM9493339 3418 1624 patient22
## AAACCCAAGGAACGTC-1 GSM9493339 3655 1573 patient22
## AAACCCAAGTTCCTGA-1 GSM9493339 3008 1604 patient22
## barcode
## AAACCCAAGATGTTAG-1 AAACCCAAGATGTTAG-1
## AAACCCAAGCACTCGC-1 AAACCCAAGCACTCGC-1
## AAACCCAAGCCTCCAG-1 AAACCCAAGCCTCCAG-1
## AAACCCAAGCGGTAGT-1 AAACCCAAGCGGTAGT-1
## AAACCCAAGGAACGTC-1 AAACCCAAGGAACGTC-1
## AAACCCAAGTTCCTGA-1 AAACCCAAGTTCCTGA-1
patient25DF <- data.frame(patient25@meta.data)
patient25DF$sample <- 'patient25'
patient25DF$barcode <- row.names(patient25DF)
head(patient25DF)
## orig.ident nCount_RNA nFeature_RNA sample
## AAACCCAAGAATTGCA-1 GSM9493340 6955 2745 patient25
## AAACCCAAGATAGCTA-1 GSM9493340 5617 2350 patient25
## AAACCCAAGCTCGAAG-1 GSM9493340 9442 3537 patient25
## AAACCCAAGGGTGGGA-1 GSM9493340 4770 2036 patient25
## AAACCCACAACTGTGT-1 GSM9493340 455 281 patient25
## AAACCCACAATAGTGA-1 GSM9493340 6766 2939 patient25
## barcode
## AAACCCAAGAATTGCA-1 AAACCCAAGAATTGCA-1
## AAACCCAAGATAGCTA-1 AAACCCAAGATAGCTA-1
## AAACCCAAGCTCGAAG-1 AAACCCAAGCTCGAAG-1
## AAACCCAAGGGTGGGA-1 AAACCCAAGGGTGGGA-1
## AAACCCACAACTGTGT-1 AAACCCACAACTGTGT-1
## AAACCCACAATAGTGA-1 AAACCCACAATAGTGA-1
patient27DF <- data.frame(patient27@meta.data)
patient27DF$sample <- 'patient27'
patient27DF$barcode <- row.names(patient27DF)
head(patient27DF)
## orig.ident nCount_RNA nFeature_RNA sample
## AAACCCAAGATGAACT-1 GSM9493341 8415 2972 patient27
## AAACCCAAGCGCTTCG-1 GSM9493341 2577 1486 patient27
## AAACCCAAGTAATACG-1 GSM9493341 4681 2042 patient27
## AAACCCACAGAGTGAC-1 GSM9493341 2268 1195 patient27
## AAACCCACAGGCAATG-1 GSM9493341 2779 1686 patient27
## AAACCCACAGTTAGGG-1 GSM9493341 3864 1951 patient27
## barcode
## AAACCCAAGATGAACT-1 AAACCCAAGATGAACT-1
## AAACCCAAGCGCTTCG-1 AAACCCAAGCGCTTCG-1
## AAACCCAAGTAATACG-1 AAACCCAAGTAATACG-1
## AAACCCACAGAGTGAC-1 AAACCCACAGAGTGAC-1
## AAACCCACAGGCAATG-1 AAACCCACAGGCAATG-1
## AAACCCACAGTTAGGG-1 AAACCCACAGTTAGGG-1
patient30DF <- data.frame(patient30@meta.data)
patient30DF$sample <- 'patient30'
patient30DF$barcode <- row.names(patient30DF)
head(patient30DF)
## orig.ident nCount_RNA nFeature_RNA sample
## AAACCCAAGAACTTCC-1 GSM9493342 7731 2715 patient30
## AAACCCAAGATTGGGC-1 GSM9493342 893 464 patient30
## AAACCCAAGCATATGA-1 GSM9493342 8097 2983 patient30
## AAACCCAAGCATCTTG-1 GSM9493342 12789 3633 patient30
## AAACCCAAGTCTTCGA-1 GSM9493342 5672 2167 patient30
## AAACCCAAGTTCAACC-1 GSM9493342 16285 4535 patient30
## barcode
## AAACCCAAGAACTTCC-1 AAACCCAAGAACTTCC-1
## AAACCCAAGATTGGGC-1 AAACCCAAGATTGGGC-1
## AAACCCAAGCATATGA-1 AAACCCAAGCATATGA-1
## AAACCCAAGCATCTTG-1 AAACCCAAGCATCTTG-1
## AAACCCAAGTCTTCGA-1 AAACCCAAGTCTTCGA-1
## AAACCCAAGTTCAACC-1 AAACCCAAGTTCAACC-1
patient31DF <- data.frame(patient31@meta.data)
patient31DF$sample <- 'patient31'
patient31DF$barcode <- row.names(patient31DF)
head(patient31DF)
## orig.ident nCount_RNA nFeature_RNA sample
## AAACCCAAGATGCAGC-1 GSM9493343 6673 2192 patient31
## AAACCCAAGCCGCACT-1 GSM9493343 5623 2135 patient31
## AAACCCAAGTCAGAGC-1 GSM9493343 5247 1640 patient31
## AAACCCACAATTCACG-1 GSM9493343 5696 2199 patient31
## AAACCCACACAAATCC-1 GSM9493343 9451 3391 patient31
## AAACCCACACAATGTC-1 GSM9493343 10479 3223 patient31
## barcode
## AAACCCAAGATGCAGC-1 AAACCCAAGATGCAGC-1
## AAACCCAAGCCGCACT-1 AAACCCAAGCCGCACT-1
## AAACCCAAGTCAGAGC-1 AAACCCAAGTCAGAGC-1
## AAACCCACAATTCACG-1 AAACCCACAATTCACG-1
## AAACCCACACAAATCC-1 AAACCCACACAAATCC-1
## AAACCCACACAATGTC-1 AAACCCACACAATGTC-1
patient36DF <- data.frame(patient36@meta.data)
patient36DF$sample <- 'patient36'
patient36DF$barcode <- row.names(patient36DF)
head(patient36DF)
## orig.ident nCount_RNA nFeature_RNA sample
## AAACCCAAGAAGTGTT-1 GSM9493344 8136 2776 patient36
## AAACCCAAGATAGCTA-1 GSM9493344 15803 4967 patient36
## AAACCCAAGCGTTCCG-1 GSM9493344 3286 1741 patient36
## AAACCCAAGGAACGAA-1 GSM9493344 886 497 patient36
## AAACCCAAGGCTGGAT-1 GSM9493344 918 592 patient36
## AAACCCACAGCTATTG-1 GSM9493344 18166 4953 patient36
## barcode
## AAACCCAAGAAGTGTT-1 AAACCCAAGAAGTGTT-1
## AAACCCAAGATAGCTA-1 AAACCCAAGATAGCTA-1
## AAACCCAAGCGTTCCG-1 AAACCCAAGCGTTCCG-1
## AAACCCAAGGAACGAA-1 AAACCCAAGGAACGAA-1
## AAACCCAAGGCTGGAT-1 AAACCCAAGGCTGGAT-1
## AAACCCACAGCTATTG-1 AAACCCACAGCTATTG-1
patient37DF <- data.frame(patient37@meta.data)
patient37DF$sample <- 'patient37'
patient37DF$barcode <- row.names(patient37DF)
head(patient37DF)
## orig.ident nCount_RNA nFeature_RNA sample
## AAACCCAAGAAATTGC-1 GSM9493345 8154 2523 patient37
## AAACCCAAGCCTAGGA-1 GSM9493345 7610 2623 patient37
## AAACCCAAGTATCTGC-1 GSM9493345 6634 2599 patient37
## AAACCCAAGTCTTCGA-1 GSM9493345 30471 5798 patient37
## AAACCCAAGTTCGCAT-1 GSM9493345 13432 4087 patient37
## AAACCCACAACTCCAA-1 GSM9493345 3667 1415 patient37
## barcode
## AAACCCAAGAAATTGC-1 AAACCCAAGAAATTGC-1
## AAACCCAAGCCTAGGA-1 AAACCCAAGCCTAGGA-1
## AAACCCAAGTATCTGC-1 AAACCCAAGTATCTGC-1
## AAACCCAAGTCTTCGA-1 AAACCCAAGTCTTCGA-1
## AAACCCAAGTTCGCAT-1 AAACCCAAGTTCGCAT-1
## AAACCCACAACTCCAA-1 AAACCCACAACTCCAA-1
patient38DF <- data.frame(patient38@meta.data)
patient38DF$sample <- 'patient38'
patient38DF$barcode <- row.names(patient38DF)
head(patient38DF)
## orig.ident nCount_RNA nFeature_RNA sample
## AAACCCAAGGTCACAG-1 GSM9493346 6311 2798 patient38
## AAACCCAAGTGGAAGA-1 GSM9493346 6364 2610 patient38
## AAACCCACAGATTCGT-1 GSM9493346 10556 3254 patient38
## AAACCCACATCCTGTC-1 GSM9493346 13957 4361 patient38
## AAACCCACATCCTTCG-1 GSM9493346 6792 2596 patient38
## AAACCCAGTGACTGAG-1 GSM9493346 6586 2403 patient38
## barcode
## AAACCCAAGGTCACAG-1 AAACCCAAGGTCACAG-1
## AAACCCAAGTGGAAGA-1 AAACCCAAGTGGAAGA-1
## AAACCCACAGATTCGT-1 AAACCCACAGATTCGT-1
## AAACCCACATCCTGTC-1 AAACCCACATCCTGTC-1
## AAACCCACATCCTTCG-1 AAACCCACATCCTTCG-1
## AAACCCAGTGACTGAG-1 AAACCCAGTGACTGAG-1
patient39DF <- data.frame(patient39@meta.data)
patient39DF$sample <- 'patient39'
patient39DF$barcode <- row.names(patient39DF)
head(patient39DF)
## orig.ident nCount_RNA nFeature_RNA sample
## AAACCCAAGATGAACT-1 GSM9493347 9044 3335 patient39
## AAACCCAAGCGTATAA-1 GSM9493347 9370 3169 patient39
## AAACCCAAGGTGCGAT-1 GSM9493347 15035 4839 patient39
## AAACCCAAGGTGTGAT-1 GSM9493347 340 301 patient39
## AAACCCACAATTTCGG-1 GSM9493347 20410 5323 patient39
## AAACCCACACGTACAT-1 GSM9493347 7596 3020 patient39
## barcode
## AAACCCAAGATGAACT-1 AAACCCAAGATGAACT-1
## AAACCCAAGCGTATAA-1 AAACCCAAGCGTATAA-1
## AAACCCAAGGTGCGAT-1 AAACCCAAGGTGCGAT-1
## AAACCCAAGGTGTGAT-1 AAACCCAAGGTGTGAT-1
## AAACCCACAATTTCGG-1 AAACCCACAATTTCGG-1
## AAACCCACACGTACAT-1 AAACCCACACGTACAT-1
patient41DF <- data.frame(patient41@meta.data)
patient41DF$sample <- 'patient41'
patient41DF$barcode <- row.names(patient41DF)
head(patient41DF)
## orig.ident nCount_RNA nFeature_RNA sample
## AAACCCAAGAAGCTGC-1 GSM9493348 6399 2546 patient41
## AAACCCAAGATTTGCC-1 GSM9493348 22051 5811 patient41
## AAACCCAAGCTGTTAC-1 GSM9493348 18725 5179 patient41
## AAACCCAAGGTAGTCG-1 GSM9493348 4545 1880 patient41
## AAACCCAAGTCTTCGA-1 GSM9493348 7017 2724 patient41
## AAACCCACAACCCTAA-1 GSM9493348 3604 1813 patient41
## barcode
## AAACCCAAGAAGCTGC-1 AAACCCAAGAAGCTGC-1
## AAACCCAAGATTTGCC-1 AAACCCAAGATTTGCC-1
## AAACCCAAGCTGTTAC-1 AAACCCAAGCTGTTAC-1
## AAACCCAAGGTAGTCG-1 AAACCCAAGGTAGTCG-1
## AAACCCAAGTCTTCGA-1 AAACCCAAGTCTTCGA-1
## AAACCCACAACCCTAA-1 AAACCCACAACCCTAA-1
patient44DF <- data.frame(patient44@meta.data)
patient44DF$sample <- 'patient44'
patient44DF$barcode <- row.names(patient44DF)
head(patient44DF)
## orig.ident nCount_RNA nFeature_RNA sample
## AAACCCAAGAGCCATG-1 GSM9493349 12359 3331 patient44
## AAACCCAAGCCTGCCA-1 GSM9493349 306 247 patient44
## AAACCCAAGTACAGCG-1 GSM9493349 2325 1136 patient44
## AAACCCAAGTCTCGTA-1 GSM9493349 13436 3905 patient44
## AAACCCACAACATACC-1 GSM9493349 332 288 patient44
## AAACCCACAAGCTGCC-1 GSM9493349 355 302 patient44
## barcode
## AAACCCAAGAGCCATG-1 AAACCCAAGAGCCATG-1
## AAACCCAAGCCTGCCA-1 AAACCCAAGCCTGCCA-1
## AAACCCAAGTACAGCG-1 AAACCCAAGTACAGCG-1
## AAACCCAAGTCTCGTA-1 AAACCCAAGTCTCGTA-1
## AAACCCACAACATACC-1 AAACCCACAACATACC-1
## AAACCCACAAGCTGCC-1 AAACCCACAAGCTGCC-1
patient51DF <- data.frame(patient51@meta.data)
patient51DF$sample <- 'patient51'
patient51DF$barcode <- row.names(patient51DF)
head(patient51DF)
## orig.ident nCount_RNA nFeature_RNA sample
## AAACCCAAGATTAGAC-1 GSM9493350 4036 1591 patient51
## AAACCCACAAAGGATT-1 GSM9493350 11263 3664 patient51
## AAACCCACAAGACAAT-1 GSM9493350 8074 2670 patient51
## AAACCCACAGGCACTC-1 GSM9493350 3792 1822 patient51
## AAACCCACATGAAAGT-1 GSM9493350 6838 2517 patient51
## AAACCCACATGGGAAC-1 GSM9493350 4908 2043 patient51
## barcode
## AAACCCAAGATTAGAC-1 AAACCCAAGATTAGAC-1
## AAACCCACAAAGGATT-1 AAACCCACAAAGGATT-1
## AAACCCACAAGACAAT-1 AAACCCACAAGACAAT-1
## AAACCCACAGGCACTC-1 AAACCCACAGGCACTC-1
## AAACCCACATGAAAGT-1 AAACCCACATGAAAGT-1
## AAACCCACATGGGAAC-1 AAACCCACATGGGAAC-1
df [6 × 5]
orig.ident
AAACCCACAAAGGATT-1 GSM9493350 11263 3664
AAACCCACAAGACAAT-1 GSM9493350 8074 2670
AAACCCACAGGCACTC-1 GSM9493350 3792 1822
AAACCCACATGAAAGT-1 GSM9493350 6838 2517
AAACCCACATGGGAAC-1 GSM9493350 4908 2043
nCount_RNA
patient52DF <- data.frame(patient52@meta.data)
patient52DF$sample <- 'patient52'
patient52DF$barcode <- row.names(patient52DF)
head(patient52DF)
## orig.ident nCount_RNA nFeature_RNA sample
## AAACCCAAGACTACCT-1 GSM9493351 2723 1471 patient52
## AAACCCAAGAGCATAT-1 GSM9493351 1371 416 patient52
## AAACCCAAGCACTCAT-1 GSM9493351 3217 1504 patient52
## AAACCCAAGCTTAGTC-1 GSM9493351 4059 1325 patient52
## AAACCCAAGTCCCAAT-1 GSM9493351 6942 2411 patient52
## AAACCCACACGGAAGT-1 GSM9493351 9136 3355 patient52
## barcode
## AAACCCAAGACTACCT-1 AAACCCAAGACTACCT-1
## AAACCCAAGAGCATAT-1 AAACCCAAGAGCATAT-1
## AAACCCAAGCACTCAT-1 AAACCCAAGCACTCAT-1
## AAACCCAAGCTTAGTC-1 AAACCCAAGCTTAGTC-1
## AAACCCAAGTCCCAAT-1 AAACCCAAGTCCCAAT-1
## AAACCCACACGGAAGT-1 AAACCCACACGGAAGT-1
df [6 × 5]
orig.ident
AAACCCAAGAGCATAT-1 GSM9493351 1371 416 AAACCCAAGCACTCAT-1 GSM9493351
3217 1504
AAACCCAAGCTTAGTC-1 GSM9493351 4059 1325
AAACCCAAGTCCCAAT-1 GSM9493351 6942 2411
AAACCCACACGGAAGT-1 GSM9493351 9136 3355
nCount_RNA
combined17patientBarcodes <- rbind(patient6DF,patient9DF,patient14DF,patient16DF,patient18DF,patient19DF,patient20DF,patient22DF,patient25DF,patient27DF,patient30DF,patient31DF,patient36DF,patient37DF,patient38DF,patient39DF,patient41DF,patient44DF,patient51DF,patient52DF)
dim(combined17patientBarcodes)
## [1] 264679 5
[1] 264679 5
There are 264,679 barcodes in the patient samples of 17 patients from patient 6 through to patient 52 randomly numbered. We can see every 50,000 rows to see the patient ID.
combined17patientBarcodes[c(1,25000,50000,75000,100000,125000,150000,175000,200000,225000,250000,264679),]
## orig.ident nCount_RNA nFeature_RNA sample
## AAACCTGAGGCTCTTA-1 GSM9493332 10238 2862 patient6
## CCGGTGATCCATAAGC-1 GSM9493334 6448 2566 patient14
## TGCTTCGGTCCCGCAA-1 GSM9493336 8852 2958 patient18
## GATGAGGAGGTGAGAA-1 GSM9493339 5419 2070 patient22
## GAAGAATCATCTGTTT-1 GSM9493341 1540 985 patient27
## AGCTTCCCAGCGTTTA-1 GSM9493343 7552 3071 patient31
## TGGGAAGAGGTTGCCC-1 GSM9493344 5632 2681 patient36
## TAGGTACTCTCGACGG-1 GSM9493346 9334 2794 patient38
## GATCGTACACCCTTAC-1 GSM9493348 7391 2681 patient41
## GGCTTTCGTGCTATTG-1 GSM9493349 276 213 patient44
## TTGTTTGAGGTATAGT-1 GSM9493350 4217 1807 patient51
## TTTGTTGTCGGACAAG-1 GSM9493351 8375 2957 patient52
## barcode
## AAACCTGAGGCTCTTA-1 AAACCTGAGGCTCTTA-1
## CCGGTGATCCATAAGC-1 CCGGTGATCCATAAGC-1
## TGCTTCGGTCCCGCAA-1 TGCTTCGGTCCCGCAA-1
## GATGAGGAGGTGAGAA-1 GATGAGGAGGTGAGAA-1
## GAAGAATCATCTGTTT-1 GAAGAATCATCTGTTT-1
## AGCTTCCCAGCGTTTA-1 AGCTTCCCAGCGTTTA-1
## TGGGAAGAGGTTGCCC-1 TGGGAAGAGGTTGCCC-1
## TAGGTACTCTCGACGG-1 TAGGTACTCTCGACGG-1
## GATCGTACACCCTTAC-1 GATCGTACACCCTTAC-1
## GGCTTTCGTGCTATTG-1 GGCTTTCGTGCTATTG-1
## TTGTTTGAGGTATAGT-1 TTGTTTGAGGTATAGT-1
## TTTGTTGTCGGACAAG-1 TTTGTTGTCGGACAAG-1
df [12 × 5]
orig.ident
CCGGTGATCCATAAGC-1 GSM9493334 6448 2566
TGCTTCGGTCCCGCAA-1 GSM9493336 8852 2958
GATGAGGAGGTGAGAA-1 GSM9493339 5419 2070
GAAGAATCATCTGTTT-1 GSM9493341 1540 985 AGCTTCCCAGCGTTTA-1 GSM9493343
7552 3071
TGGGAAGAGGTTGCCC-1 GSM9493344 5632 2681
TAGGTACTCTCGACGG-1 GSM9493346 9334 2794
GATCGTACACCCTTAC-1 GSM9493348 7391 2681
GGCTTTCGTGCTATTG-1 GSM9493349 276 213
df [12 × 5] nCount_RNA
Lets see unique patient sample type.
unique(combined17patientBarcodes$sample)
## [1] "patient6" "patient9" "patient14" "patient16" "patient18" "patient19"
## [7] "patient20" "patient22" "patient25" "patient27" "patient30" "patient31"
## [13] "patient36" "patient37" "patient38" "patient39" "patient41" "patient44"
## [19] "patient51" "patient52"
[1] “patient6” “patient9” “patient14” “patient16” “patient18” [6] “patient19” “patient20” “patient22” “patient25” “patient27” [11] “patient30” “patient31” “patient36” “patient37” “patient38” [16] “patient39” “patient41” “patient44” “patient51” “patient52”
There are 20 patient samples, not 17. Noted. We have a total of 12 healthy samples and 20 patient samples for a total of 32 samples in all. In the metadata on methods of data extraction and handling, there is no listed patient14 but in the files extracted and imported with Seurat there is a patient 14. The math isn’t making sense since there are 30 column totals and 1 of the columns is the description column, so there should be 29 samples plus the patient 14 for 30 total samples, but we have 20 patient samples and 12 healthy samples for 32 total samples. There are 2 samples unaccounted for. They could be in the healthy samples. Lets see.
unique(combinedHealthy12$sample)
## [1] "healthy1" "healthy2" "healthy3" "healthy4" "healthy5" "healthy6"
## [7] "healthy7" "healthy8" "healthy9" "healthy10" "healthy11" "healthy12"
[1] “healthy1” “healthy2” “healthy3” “healthy4” “healthy5” [6] “healthy6” “healthy7” “healthy8” “healthy9” “healthy10” [11] “healthy11” “healthy12”
Ok, so we have 12 healthy samples and 20 patient samples. The pathology in this case is nasopharyngeal carcinoma highly associated with Epstein-Barr Viral infection or EBV.
Lets write this out to csv.
write.csv(combined17patientBarcodes,
"combined17patientBarcodes.csv",row.names=FALSE)
Now lets combine the healthy and patient barcodes and write to csv.
allSampleBarcodes <- rbind(combined17patientBarcodes,combinedHealthy12)
dim(allSampleBarcodes)
## [1] 407006 5
[1] 407006 5
We have a total of 407,006 barcodes for 32 samples made up of 12 healthy and 20 nasopharyngeal carcinoma patients.
We will do further analysis on this data to follow through on work flow analysis to get the top genes responsible for aggressive natural killer t-cell lymphoma (NKTCL) from EBV using this PBMC single cell RNA sequencing which really refers to the cell in the array singlularly being sequenced and every cell separately from the others. We have worked with micro RNA in recent study of the analysis of a mononucleosis gene expression data, where micro RNA enhance or inhibit a pre-messenger RNA or pre-mRNA from translating after transcription and curls on itself like a hairpin shape making it double stranded micro RNA in that region of the single strand RNA that is ssRNA when not part of the hairpin double strand miRNA, done to prevent making a protein or inhibiting translation of the pre-mRNA. Now we are working with same complementary DNA of reverse transcribed mRNA but in an array format that is single cell RNA sequencing in refering to how the gene expression data is obtained to analyze.
Write the last dataframe of all samples’ barcodes to csv.
write.csv(allSampleBarcodes,'allSampleBarcodes_nasopharyngealCarcinoma.csv',row.names=FALSE)
You can get the files below:
Thanks. Keep checking in for more work on this project.
Thanks again.