Looking at the EDA of exploratory data analysis with new package for this thing on natural killer t-cell lymphoma associated by EBV infection. This study is GSE318371 within NCBI database of gene expression studies. I extracted the custom download option of the GSE318371_RAW.tar file scrolled at end of page on this series. The extraction of barcodes takes a very long time once downloaded for each file and you can avoid that process and leave them in zip form with the .tsv.gz name but remove the prepended file name in front of ‘barcodes.tsv.gz’, ‘features.tsv.gz’, and ‘matrix.mtx’ but put these files with respective GSM sample ID into its own file to read in for each file folder.

This is a very recent February 2026 uploaded research on aggressive lymphoma associated with EBV infection. I found others and they include the nasopharyngeal carcinoma and Hodgkin and large B-cell lymphomas. But for now we work on this project to get our top genes for our machine to predict EBV, Lyme disease, or specific associated EBV pathology of multiple sclerosis, mononucleosis, primary EBV infection, as well as various lymphomas and nasopharyngeal carcinoma.

There is a process that has to be followed to extract each sample information of barcodes, features, and cells. This is array gene expression data where an array of many cells is input into a machine and each cell within the array is ran to count the number of times a gene appears. The barcodes are the cells and the features are the genes in matrix format.

I had to look at a youtube video to understand this package Seurat better and how to read in this data. I would like to estimate each barcode file as taking around 30 minutes each to extract from the zipped file format. My 7zip isn’t working to extract it with right tab and I followed the videos to the exact step but it turns out the information is useful but just leave the folder in zip format because the code doesn’t work for the unzipped files.

Lets read the summary file in the GSE318371-GPL34284_series_matrix.txt file to see how the samples were collected, handled, if normalized, type, and design of study.

seriesInfo1 <- read.csv("GSE318371-GPL34284_series_matrix.txt", nrows=25, sep='\t', stringsAsFactors = T, strip.white = T, na.strings=" ", header=F)

seriesInfo1
##                                 V1
## 1                    !Series_title
## 2            !Series_geo_accession
## 3                   !Series_status
## 4          !Series_submission_date
## 5         !Series_last_update_date
## 6                  !Series_summary
## 7           !Series_overall_design
## 8                     !Series_type
## 9              !Series_contributor
## 10             !Series_contributor
## 11             !Series_contributor
## 12             !Series_contributor
## 13               !Series_sample_id
## 14            !Series_contact_name
## 15           !Series_contact_email
## 16       !Series_contact_institute
## 17         !Series_contact_address
## 18            !Series_contact_city
## 19 !Series_contact_zip/postal_code
## 20         !Series_contact_country
## 21      !Series_supplementary_file
## 22             !Series_platform_id
## 23             !Series_platform_id
## 24          !Series_platform_taxid
## 25            !Series_sample_taxid

eripheral blood mononuclear cells single-cell landscape of newly diagnosed NK/T cell lymphoma patients

ublic on Feb 07 2026
eb 04 2026
eb 07 2026
## 6  Natural killer/T cell lymphoma (NKTCL) is a rare and aggressive form of non-Hodgkin's lymphoma associated with Epstein-Barr Virus (EBV) infection.The recent advancement of multi-omics technologies has significantly enhanced our understanding of NKTCL disease biology, including genetics, transcription landscape, variations of EBV strain, and microenvironments. Emerging evidence suggests that immunoprofiling of peripheral blood mononuclear cells (PBMCs) is associated with the treatment response of cancer patients and can be used to guide clinical trials and therapy. In this study, we utilized single-cell RNA sequencing (scRNA-seq) to comprehensively characterize the phenotypic landscape of PBMCs in newly diagnosed patients with NKTCL. This research offers a valuable peripheral blood-based signature for newly diagnosed NKTCL, which could be a crucial resource for further investigations into the pathogenesis of NKTCL and the optimization of therapeutic regimens.
scRNA-seq profiling of PBMCs from healthy donors and newly diagnosed NKTCL patients
xpression profiling by high throughput sequencing
iaozhen,,Liang
ong,,Tao
an,,Jia
huanxu,,Liu

iaozhen,,Liang
xzliang@simm.ac.cn
hanghai Institute of Materia Medica Chinese Academy of Sciences
ife Science Research Building 320 Yueyang Road, Xuhui District
hanghai

hina
ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE318nnn/GSE318371/suppl/GSE318371_RAW.tar




seriesInfoDesign <-read.csv("GSE318371-GPL34284_series_matrix.txt", sep='\t', nrows=50,stringsAsFactors = T,strip.white=T,na.strings=" ", ncol(32), skip=25, header=F)

dim(seriesInfoDesign)
## [1] 42 30

There are 42 additional rows of metadata on study design and methods for handling data and biological material.

Lets look at first few columns and rows of interest that detail study design.

seriesInfoDesign[c(8:15,17:21,30:33),c(1:2)]
##                              V1
## 8       !Sample_source_name_ch1
## 9          !Sample_organism_ch1
## 10  !Sample_characteristics_ch1
## 11  !Sample_characteristics_ch1
## 12  !Sample_characteristics_ch1
## 13         !Sample_molecule_ch1
## 14 !Sample_extract_protocol_ch1
## 15 !Sample_extract_protocol_ch1
## 17          !Sample_description
## 18      !Sample_data_processing
## 19      !Sample_data_processing
## 20      !Sample_data_processing
## 21          !Sample_platform_id
## 30     !Sample_instrument_model
## 31    !Sample_library_selection
## 32       !Sample_library_source
## 33     !Sample_library_strategy
##                                                                                                                                                                                                                                                                                                                                                                                                                                             V2
## 8                                                                                                                                                                                                                                                                                                                                                                                                                                      "blood"
## 9                                                                                                                                                                                                                                                                                                                                                                                                                               "Homo sapiens"
## 10                                                                                                                                                                                                                                                                                                                                                                                                                             "tissue: blood"
## 11                                                                                                                                                                                                                                                                                                                                                                                                                          "cell line: PBMCs"
## 12                                                                                                                                                                                                                                                                                                                                                                                                  "cell type: Peripheral blood immune cells"
## 13                                                                                                                                                                                                                                                                                                                                                                                                                                 "total RNA"
## 14                                                                                                                                                                                                                                                                                                                       "Isolated PBMCs were loaded into a 10× Chromium Chip (v3.1 PN:1000120) and barcoded using a 10x Chromium Controller."
## 15                                                                                                                                                                                                                         "RNA from the barcoded cells was then reverse-transcribed, amplified, and prepared into sequencing libraries with the 10× Library Construction Kit (v3.1 PN:1000190) according to the manufacturer’s instructions."
## 17                                                                                                                                                                                                                                                                                                                                                                                                                         "Library name: HD1"
## 18 "Raw scRNA-seq data were initially pre-processed using CellRanger (version 8.0.1, 10x Genomics) to align reads to the human genome (GRCh38, 2024-A from 10x Genomics) and count the unique molecular identifiers (UMIs) for each gene to generate specific gene cell count tables. For each scRNA-seq sample, the count tables were filtered to retain the genes detected in at least 10 cells and cells with a minimum gene count of 300."
## 19                                                                                                                                                                                                                                                                                                                                                                                                                          "Assembly: GRCh38"
## 20                                                                                                                                                                                                                                                                                                                                             "Supplementary files format and content: barcodes, features, and matrix files for each samples"
## 21                                                                                                                                                                                                                                                                                                                                                                                                                                  "GPL34284"
## 30                                                                                                                                                                                                                                                                                                                                                                                                                   "Illumina NovaSeq X Plus"
## 31                                                                                                                                                                                                                                                                                                                                                                                                                                      "cDNA"
## 32                                                                                                                                                                                                                                                                                                                                                                                                                "transcriptomic single cell"
## 33                                                                                                                                                                                                                                                                                                                                                                                                                                   "RNA-Seq"

The data is from peripheral blood mononuclear cells (PBMCs) of total RNA using chip sequencing or array sequencing, they kept the genes that showed in at least 10 cells of the array being sampled or having at least a count of 300 genes. The array of RNA-Seq analysis counts gene fragments that show up in the sequencing as many won’t show up but enough do. There is a useful youtube video that explains how chip sequencing operates here. This is where the barcodes and features makes sense as it seems like different but similar language to data science language. The features are the genes or rows as we have seen, and the cells are the barcodes of nucleotides as columns of our matrix when we read in the formatted files using Seurat library where the folder has to have the ‘barcodes.tsv.gz’, ‘features.tsv.gz’, and ‘matrix.mtx.gz’ format to read it in. The youtube videos I watched showed a way of unzipping and reading in the packages similarly but the Seurat library from my recent experience only reads in the unzipped file formats with attached file name ‘gz’ meaning needs to be unzipped, this is the file format already in when downloading from the NCBI website for the gene expression data.

We can see the patient and healthy label to the GSM samples with row 17 and 41 of the metadata or series information.

seriesInfoDesign[c(17,41),]
##                     V1                  V2                  V3
## 17 !Sample_description "Library name: HD1" "Library name: HD2"
## 41            "ID_REF"        "GSM9493320"        "GSM9493321"
##                     V4                  V5                  V6
## 17 "Library name: HD3" "Library name: HD4" "Library name: HD5"
## 41        "GSM9493322"        "GSM9493323"        "GSM9493324"
##                     V7                  V8                  V9
## 17 "Library name: HD6" "Library name: HD7" "Library name: HD8"
## 41        "GSM9493325"        "GSM9493326"        "GSM9493327"
##                    V10                  V11                  V12
## 17 "Library name: HD9" "Library name: HD10" "Library name: HD11"
## 41        "GSM9493328"         "GSM9493329"         "GSM9493330"
##                     V13                       V14                       V15
## 17 "Library name: HD12" "Library name: patient16" "Library name: patient18"
## 41         "GSM9493331"              "GSM9493335"              "GSM9493336"
##                          V16                       V17
## 17 "Library name: patient19" "Library name: patient20"
## 41              "GSM9493337"              "GSM9493338"
##                          V18                       V19
## 17 "Library name: patient22" "Library name: patient25"
## 41              "GSM9493339"              "GSM9493340"
##                          V20                       V21
## 17 "Library name: patient27" "Library name: patient30"
## 41              "GSM9493341"              "GSM9493342"
##                          V22                       V23
## 17 "Library name: patient31" "Library name: patient36"
## 41              "GSM9493343"              "GSM9493344"
##                          V24                       V25
## 17 "Library name: patient37" "Library name: patient38"
## 41              "GSM9493345"              "GSM9493346"
##                          V26                       V27
## 17 "Library name: patient39" "Library name: patient41"
## 41              "GSM9493347"              "GSM9493348"
##                          V28                       V29
## 17 "Library name: patient44" "Library name: patient51"
## 41              "GSM9493349"              "GSM9493350"
##                          V30
## 17 "Library name: patient52"
## 41              "GSM9493351"

There are 12 healthy samples HD1-HD12, and randomly numbered patient samples from patient6 through patient51 totaling 17 patients.

str(seriesInfoDesign)
## 'data.frame':    42 obs. of  30 variables:
##  $ V1 : Factor w/ 35 levels "!Sample_channel_count",..: 31 14 25 26 16 32 1 24 21 2 ...
##  $ V2 : Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
##  $ V3 : Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
##  $ V4 : Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
##  $ V5 : Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
##  $ V6 : Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
##  $ V7 : Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
##  $ V8 : Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
##  $ V9 : Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
##  $ V10: Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
##  $ V11: Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
##  $ V12: Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
##  $ V13: Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
##  $ V14: Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
##  $ V15: Factor w/ 40 levels "","\"0\"","\"1\"",..: 25 19 26 13 14 32 3 8 20 36 ...
##  $ V16: Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
##  $ V17: Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
##  $ V18: Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
##  $ V19: Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
##  $ V20: Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
##  $ V21: Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
##  $ V22: Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
##  $ V23: Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
##  $ V24: Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
##  $ V25: Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
##  $ V26: Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
##  $ V27: Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
##  $ V28: Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
##  $ V29: Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...
##  $ V30: Factor w/ 40 levels "","\"\"","\"0\"",..: 26 20 27 14 15 33 4 9 21 36 ...

Install the Seurat package with install.packages(‘Seurat’) if you haven’t already. Then read in the library.

library(Seurat)
## Loading required package: SeuratObject
## Loading required package: sp
## 
## Attaching package: 'SeuratObject'
## The following objects are masked from 'package:base':
## 
##     intersect, t

We need to use the tidyverse package.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Found a few youtube videos on using Seurat for this type of data.Here is an interesting video explaining extracting and moving files into their GSM sample but don’t do the unzipping and renaming, just leave in unzipped version and remove prepended file name to only get the names needed with extension of gz kept. Do put in its own sample name of folder you will name it as an object for in Rstudio, here is the video.

After making those folders you have to read them into R with Read10x function.

NML_1 <- Read10x(data.dir = “../Downloads/GSE132771_Raw/NPL1/”) where the file name and location are using the demonstration file that used the GSE132771 series data or RAW.Tar files.

After doing what video tutorial did, with extraction, the code didn’t work, but when I left the files in original tsv.gz format but removed prepended file name e.g. GSM9493320_HD1_barcodes.tsv.gz into barcodes.tsv.gz then the following line of code ran. This removes that tedious extraction step that added extra work and time to the file processing to use in our exploratory data analysis. The video was from 3-4 years ago from today’s date, so some errors may have been corrected within the Seurat library.

I had to use exact file location and make sure to not copy and paste with the file directory within Microsoft as ‘' or backslash because it has to be’/’ or forward slash.

Lets save my directory as a character string to use and save all these file objects of RDS type made using Seurat.

Fill in ellipses with your file directory location.

directoryFolder <- "C:/.../GSE318371_RAW"

RDS_objects <- "C:/.../RDS_objects"
setwd(directoryFolder)

GSM9493320 <- Read10X("GSM9493320")

Then you can create an object with CreateSeuratObject()

NML_1 <- CreateSeuratObject(counts = NML_1, project = “NML_1”, min.cells=3, min.features=200)

sampleHD1 <- CreateSeuratObject(counts=GSM9493320, project="GSM9493320", min.cells=3, min.features=200)

For the matrix, you will see genes are the barcodes in matrix columns that are counted in each of the matrix rows that are the cells. The video says the barcodes are the rows but that is probably the data frame before transposing to a matrix. Visually the barcodes are shown in the ‘column names’ above each column with counts.

When naming the barcodes.tsv file it is the cells, and the features.tsv file is the genes.

The cells are where the genes are counted in an array of cells that sequence gene data. There will be counts of the gene in each cell that vary from none to however many times they show. The next video explains reading the Seurat object created here.

class(sampleHD1)
## [1] "Seurat"
## attr(,"package")
## [1] "SeuratObject"

the colnames are the barcodes and the rownames are the genes. colnames([]) and will display barcodes

colnames(sampleHD1[])[1:100]
##   [1] "AAACCCACAGCATTGT-1" "AAACCCACATCCGAAT-1" "AAACCCAGTACTAGCT-1"
##   [4] "AAACCCAGTCACTCAA-1" "AAACCCAGTGGTAACG-1" "AAACCCATCAATGCAC-1"
##   [7] "AAACCCATCCCTCTAG-1" "AAACCCATCGTCGATA-1" "AAACGAAAGCCAGACA-1"
##  [10] "AAACGAAAGGATACCG-1" "AAACGAAAGTAATACG-1" "AAACGAACACGACCTG-1"
##  [13] "AAACGAACACTGTTCC-1" "AAACGAACAGTTACCA-1" "AAACGAACATTCAGCA-1"
##  [16] "AAACGAAGTAACATCC-1" "AAACGAAGTCTCCCTA-1" "AAACGAAGTGCAGATG-1"
##  [19] "AAACGAATCCTCGCAT-1" "AAACGCTAGACGCCCT-1" "AAACGCTAGCTTTGTG-1"
##  [22] "AAACGCTAGTGAACAT-1" "AAACGCTGTAGTATAG-1" "AAACGCTGTTGAAGTA-1"
##  [25] "AAACGCTTCAATGTCG-1" "AAACGCTTCTCAGAAC-1" "AAAGAACAGCGACATG-1"
##  [28] "AAAGAACAGGCTGTAG-1" "AAAGAACAGGGATCTG-1" "AAAGAACCAACAGAGC-1"
##  [31] "AAAGAACGTCGCCACA-1" "AAAGAACGTGGTAACG-1" "AAAGAACTCATGAGGG-1"
##  [34] "AAAGAACTCATTCGTT-1" "AAAGGATAGCTCGCAC-1" "AAAGGATAGTCCGCGT-1"
##  [37] "AAAGGATCAAGGGTCA-1" "AAAGGATCAGAACTTC-1" "AAAGGATGTGCCTAAT-1"
##  [40] "AAAGGATGTGTCCCTT-1" "AAAGGATTCACCGACG-1" "AAAGGATTCACGGTCG-1"
##  [43] "AAAGGGCAGCGAACTG-1" "AAAGGGCAGTAGCAAT-1" "AAAGGGCGTGTGTCCG-1"
##  [46] "AAAGGTAAGAGAGCAA-1" "AAAGGTACAAAGTATG-1" "AAAGGTACAAGACGGT-1"
##  [49] "AAAGGTACACAAGTTC-1" "AAAGGTACACTCCTTG-1" "AAAGGTAGTATAGGAT-1"
##  [52] "AAAGGTAGTGTGCCTG-1" "AAAGGTATCACAAGAA-1" "AAAGGTATCTGCGTCT-1"
##  [55] "AAAGGTATCTGTCTCG-1" "AAAGTCCAGCACACAG-1" "AAAGTCCAGCGGTAGT-1"
##  [58] "AAAGTCCAGGATTCAA-1" "AAAGTCCAGTACAACA-1" "AAAGTCCCACCTGCGA-1"
##  [61] "AAAGTCCGTTCTTGCC-1" "AAAGTCCTCACCTGTC-1" "AAAGTCCTCAGGTGTT-1"
##  [64] "AAAGTCCTCCATAAGC-1" "AAAGTCCTCTATGTGG-1" "AAAGTGAAGAAATGGG-1"
##  [67] "AAAGTGAAGCAGTCTT-1" "AAAGTGAAGCTGTCCG-1" "AAAGTGAAGGTAAGAG-1"
##  [70] "AAAGTGAAGTGATAAC-1" "AAAGTGAAGTTATGGA-1" "AAAGTGACAAGCAATA-1"
##  [73] "AAAGTGACAATAGTGA-1" "AAAGTGAGTATACCCA-1" "AAAGTGAGTATCGTTG-1"
##  [76] "AAAGTGAGTCGGTAAG-1" "AAAGTGAGTGAATGTA-1" "AAAGTGAGTTTCACTT-1"
##  [79] "AAAGTGATCAAATGCC-1" "AAATGGAAGGGCCCTT-1" "AAATGGAGTACAACGG-1"
##  [82] "AAATGGAGTTCCGTTC-1" "AAATGGATCCGCATAA-1" "AAATGGATCGCCGATG-1"
##  [85] "AACAAAGAGCACTTTG-1" "AACAAAGAGCGAGGAG-1" "AACAAAGAGGCTAAAT-1"
##  [88] "AACAAAGCAACAGTGG-1" "AACAAAGCACTGCACG-1" "AACAAAGCAGAGTCAG-1"
##  [91] "AACAAAGGTCGCGGTT-1" "AACAAAGGTGAGGAAA-1" "AACAAAGTCATGCCGG-1"
##  [94] "AACAAAGTCCTGTTAT-1" "AACAAAGTCGTCAGAT-1" "AACAAAGTCTCCGAGG-1"
##  [97] "AACAACCAGATGTTAG-1" "AACAACCCACCCAAGC-1" "AACAACCCATCATCCC-1"
## [100] "AACAACCCATGTGCTA-1"

There are many columns but we limited it to 1st 100.

rownames([]) and will display gene names

rownames(sampleHD1[])[1:100]
##   [1] "ENSG00000290826" "ENSG00000238009" "ENSG00000241860" "ENSG00000286448"
##   [5] "ENSG00000290385" "ENSG00000291215" "LINC01409"       "ENSG00000290784"
##   [9] "LINC00115"       "LINC01128"       "ENSG00000288531" "FAM41C"         
##  [13] "NOC2L"           "KLHL17"          "PLEKHN1"         "ENSG00000272512"
##  [17] "HES4"            "ISG15"           "ENSG00000224969" "AGRN"           
##  [21] "ENSG00000291156" "C1orf159"        "ENSG00000285812" "LINC01342"      
##  [25] "TTLL10"          "TNFRSF18"        "TNFRSF4"         "SDF4"           
##  [29] "B3GALT6"         "C1QTNF12"        "ENSG00000260179" "UBE2J2"         
##  [33] "LINC01786"       "SCNN1D"          "ACAP3"           "PUSL1"          
##  [37] "INTS11"          "CPTP"            "TAS1R3"          "DVL1"           
##  [41] "MXRA8"           "AURKAIP1"        "CCNL2"           "MRPL20-AS1"     
##  [45] "MRPL20"          "MRPL20-DT"       "ANKRD65"         "ATAD3C"         
##  [49] "ATAD3B"          "ENSG00000290916" "ATAD3A"          "TMEM240"        
##  [53] "SSU72"           "ENSG00000215014" "FNDC10"          "ENSG00000286989"
##  [57] "ENSG00000272106" "MIB2"            "MMP23B"          "CDK11B"         
##  [61] "ENSG00000272004" "SLC35E2B"        "CDK11A"          "ENSG00000290854"
##  [65] "NADK"            "GNB1"            "GNB1-DT"         "CFAP74"         
##  [69] "PRKCZ"           "ENSG00000271806" "PRKCZ-AS1"       "FAAP20"         
##  [73] "ENSG00000234396" "SKI"             "ENSG00000287356" "MORN1"          
##  [77] "ENSG00000272420" "RER1"            "PEX10"           "PLCH2"          
##  [81] "ENSG00000224387" "PANK4"           "ENSG00000272449" "TNFRSF14-AS1"   
##  [85] "TNFRSF14"        "ENSG00000228037" "ENSG00000289610" "PRXL2B"         
##  [89] "MMEL1"           "TTC34"           "PRDM16"          "MEGF6"          
##  [93] "ENSG00000238260" "TPRG1L"          "WRAP73"          "TP73"           
##  [97] "CCDC27"          "SMIM1"           "LRRC47"          "ENSG00000272153"

You can see there are a lot of gene names some by the genecards ID and some by the Ensemble ID as the row names of the matrix object, and we also limited the view to first 100 rows.

setwd(RDS_objects)

saveRDS(sampleHD1,"sampleHD1")

Lets start importing and saving the other folder files.

setwd(directoryFolder)

GSM9493321 <- Read10X("GSM9493321")
sampleHD2 <- CreateSeuratObject(counts=GSM9493321, project="GSM9493321", min.cells=3, min.features=200)
View(sampleHD2@meta.data)
setwd(RDS_objects)
saveRDS(sampleHD2, "sampleHD2")
setwd(directoryFolder)

GSM9493322 <- Read10X("GSM9493322")
sampleHD3 <- CreateSeuratObject(counts=GSM9493322, project="GSM9493322", min.cells=3, min.features=200)
setwd(RDS_objects)

saveRDS(sampleHD3, "sampleHD3")
setwd(directoryFolder)

GSM9493323 <- Read10X("GSM9493323")
sampleHD4 <- CreateSeuratObject(counts=GSM9493323, project="GSM9493323", min.cells=3, min.features=200)
setwd(RDS_objects)

saveRDS(sampleHD4, "sampleHD4")
setwd(directoryFolder)

GSM9493324 <- Read10X("GSM9493324")
sampleHD5 <- CreateSeuratObject(counts=GSM9493324, project="GSM9493324", min.cells=3, min.features=200)
setwd(RDS_objects)

saveRDS(sampleHD5,"sampleHD5")
setwd(directoryFolder)

GSM9493325 <- Read10X("GSM9493325")
sampleHD6 <- CreateSeuratObject(counts=GSM9493325, project="GSM9493325", min.cells=3, min.features=200)
setwd(RDS_objects)

saveRDS(sampleHD6, "sampleHD6")
setwd(directoryFolder)

GSM9493326 <- Read10X("GSM9493326")
sampleHD7 <- CreateSeuratObject(counts=GSM9493326, project="GSM9493326", min.cells=3, min.features=200)
setwd(RDS_objects)

saveRDS(sampleHD7,"sampleHD7")
setwd(directoryFolder)

GSM9493327 <- Read10X("GSM9493327")
sampleHD8 <- CreateSeuratObject(counts=GSM9493327, project="GSM9493327", min.cells=3, min.features=200)
setwd(RDS_objects)

saveRDS(sampleHD8, "sampleHD8")
setwd(directoryFolder)

GSM9493328 <- Read10X("GSM9493328")
sampleHD9 <- CreateSeuratObject(counts=GSM9493328, project="GSM9493328", min.cells=3, min.features=200)
setwd(RDS_objects)

saveRDS(sampleHD9, "sampleHD9")
setwd(directoryFolder)

GSM9493329 <- Read10X("GSM9493329")
sampleHD10 <- CreateSeuratObject(counts=GSM9493329, project="GSM9493329", min.cells=3, min.features=200)
setwd(RDS_objects)

saveRDS(sampleHD10, "sampleHD10")
setwd(directoryFolder)

GSM9493330 <- Read10X("GSM9493330")
sampleHD11 <- CreateSeuratObject(counts=GSM9493330, project="GSM9493330", min.cells=3, min.features=200)
saveRDS(sampleHD11, "sampleHD11")
setwd(directoryFolder)

GSM9493331 <- Read10X("GSM9493331")
sampleHD12 <- CreateSeuratObject(counts=GSM9493331, project="GSM9493331", min.cells=3, min.features=200)
setwd(RDS_objects)

saveRDS(sampleHD12, "sampleHD12")
setwd(directoryFolder)

GSM9493332 <- Read10X("GSM9493332")
patient6 <- CreateSeuratObject(counts=GSM9493332, project="GSM9493332", min.cells=3, min.features=200)
setwd(RDS_objects)

saveRDS(patient6, "patient6")
setwd(directoryFolder)

GSM9493333 <- Read10X("GSM9493333")
patient9 <- CreateSeuratObject(counts=GSM9493333, project="GSM9493333", min.cells=3, min.features=200)
setwd(RDS_objects)

saveRDS(patient9, "patient9")
setwd(directoryFolder)

GSM9493334 <- Read10X("GSM9493334")
patient14 <- CreateSeuratObject(counts=GSM9493334, project="GSM9493334", min.cells=3, min.features=200)
setwd(RDS_objects)

saveRDS(patient14, "patient14")
setwd(directoryFolder)

GSM9493335 <- Read10X("GSM9493335")
patient16 <- CreateSeuratObject(counts=GSM9493335, project="GSM9493335", min.cells=3, min.features=200)
setwd(RDS_objects)

saveRDS(patient16, "patient16")
setwd(directoryFolder)

GSM9493336 <- Read10X("GSM9493336")
patient18 <- CreateSeuratObject(counts=GSM9493336, project="GSM9493336", min.cells=3, min.features=200)
setwd(RDS_objects)

saveRDS(patient18, "patient18")
setwd(directoryFolder)

GSM9493337 <- Read10X("GSM9493337")
patient19 <- CreateSeuratObject(counts=GSM9493337, project="GSM9493337", min.cells=3, min.features=200)
setwd(RDS_objects)

saveRDS(patient19, "patient19")
setwd(directoryFolder)

GSM9493338 <- Read10X("GSM9493338")
patient20 <- CreateSeuratObject(counts=GSM9493338, project="GSM9493338", min.cells=3, min.features=200)
setwd(RDS_objects)

saveRDS(patient20, "patient20")
setwd(directoryFolder)

GSM9493339 <- Read10X("GSM9493339")
patient22 <- CreateSeuratObject(counts=GSM9493339, project="GSM9493339", min.cells=3, min.features=200)
setwd(RDS_objects)

saveRDS(patient22, "patient22")
setwd(directoryFolder)

GSM9493340 <- Read10X("GSM9493340")
patient25 <- CreateSeuratObject(counts=GSM9493340, project="GSM9493340", min.cells=3, min.features=200)
setwd(RDS_objects)

saveRDS(patient25, "patient25")
setwd(directoryFolder)

GSM9493341 <- Read10X("GSM9493341")
patient27 <- CreateSeuratObject(counts=GSM9493341, project="GSM9493341", min.cells=3, min.features=200)
setwd(RDS_objects)

saveRDS(patient27, "patient27")
setwd(directoryFolder)

GSM9493342 <- Read10X("GSM9493342")
patient30 <- CreateSeuratObject(counts=GSM9493342, project="GSM9493342", min.cells=3, min.features=200)
setwd(RDS_objects)

saveRDS(patient30, "patient30")
setwd(directoryFolder)

GSM9493343 <- Read10X("GSM9493343")
patient31 <- CreateSeuratObject(counts=GSM9493343, project="GSM9493343", min.cells=3, min.features=200)
setwd(RDS_objects)

saveRDS(patient31, "patient31")
setwd(directoryFolder)

GSM9493344 <- Read10X("GSM9493344")
patient36 <- CreateSeuratObject(counts=GSM9493344, project="GSM9493344", min.cells=3, min.features=200)
setwd(RDS_objects)

saveRDS(patient36, "patient36")
setwd(directoryFolder)

GSM9493345 <- Read10X("GSM9493345")
patient37 <- CreateSeuratObject(counts=GSM9493345, project="GSM9493345", min.cells=3, min.features=200)
setwd(RDS_objects)

saveRDS(patient37, "patient37")
setwd(directoryFolder)

GSM9493346 <- Read10X("GSM9493346")
patient38 <- CreateSeuratObject(counts=GSM9493346, project="GSM9493346", min.cells=3, min.features=200)
setwd(RDS_objects)

saveRDS(patient38, "patient38")
setwd(directoryFolder)

GSM9493347 <- Read10X("GSM9493347")
patient39 <- CreateSeuratObject(counts=GSM9493347, project="GSM9493347", min.cells=3, min.features=200)
setwd(RDS_objects)

saveRDS(patient39, "patient39")
setwd(directoryFolder)

GSM9493348 <- Read10X("GSM9493348")
patient41 <- CreateSeuratObject(counts=GSM9493348, project="GSM9493348", min.cells=3, min.features=200)
setwd(RDS_objects)

saveRDS(patient41, "patient41")
setwd(directoryFolder)

GSM9493349 <- Read10X("GSM9493349")
patient44 <- CreateSeuratObject(counts=GSM9493349, project="GSM9493349", min.cells=3, min.features=200)
setwd(RDS_objects)

saveRDS(patient44, "patient44")
setwd(directoryFolder)

GSM9493350 <- Read10X("GSM9493350")
patient51 <- CreateSeuratObject(counts=GSM9493350, project="GSM9493350", min.cells=3, min.features=200)
setwd(RDS_objects)

saveRDS(patient51, "patient51")
setwd(directoryFolder)

GSM9493351 <- Read10X("GSM9493351")
patient52 <- CreateSeuratObject(counts=GSM9493351, project="GSM9493351", min.cells=3, min.features=200)
setwd(RDS_objects)

saveRDS(patient52, "patient52")

We uploaded all the sampleHD files and patient files. There is another tutorial on merging these RDS files.

We still have these items in our environment and it is taking up quite a bit of space. Lets delete these objects after verifying we have all files.

ls()
##  [1] "directoryFolder"  "GSM9493320"       "GSM9493321"       "GSM9493322"      
##  [5] "GSM9493323"       "GSM9493324"       "GSM9493325"       "GSM9493326"      
##  [9] "GSM9493327"       "GSM9493328"       "GSM9493329"       "GSM9493330"      
## [13] "GSM9493331"       "GSM9493332"       "GSM9493333"       "GSM9493334"      
## [17] "GSM9493335"       "GSM9493336"       "GSM9493337"       "GSM9493338"      
## [21] "GSM9493339"       "GSM9493340"       "GSM9493341"       "GSM9493342"      
## [25] "GSM9493343"       "GSM9493344"       "GSM9493345"       "GSM9493346"      
## [29] "GSM9493347"       "GSM9493348"       "GSM9493349"       "GSM9493350"      
## [33] "GSM9493351"       "patient14"        "patient16"        "patient18"       
## [37] "patient19"        "patient20"        "patient22"        "patient25"       
## [41] "patient27"        "patient30"        "patient31"        "patient36"       
## [45] "patient37"        "patient38"        "patient39"        "patient41"       
## [49] "patient44"        "patient51"        "patient52"        "patient6"        
## [53] "patient9"         "RDS_objects"      "sampleHD1"        "sampleHD10"      
## [57] "sampleHD11"       "sampleHD12"       "sampleHD2"        "sampleHD3"       
## [61] "sampleHD4"        "sampleHD5"        "sampleHD6"        "sampleHD7"       
## [65] "sampleHD8"        "sampleHD9"        "seriesInfo1"      "seriesInfoDesign"

Lets remove the GSM samples.

rm("GSM9493320"   ,   "GSM9493321"   ,  
 "GSM9493322" ,     "GSM9493323" ,     "GSM9493324" ,    
 "GSM9493325" ,     "GSM9493326"  ,    "GSM9493327"  ,   
"GSM9493328" ,     "GSM9493329"   ,   "GSM9493330"   ,  
"GSM9493331" ,     "GSM9493332" ,     "GSM9493333"    , 
 "GSM9493334" ,     "GSM9493335" ,     "GSM9493336"     ,
"GSM9493337" ,     "GSM9493338" ,     "GSM9493339"     ,
"GSM9493340" ,     "GSM9493341" ,     "GSM9493342"     ,
 "GSM9493343" ,     "GSM9493344" ,     "GSM9493345",     
"GSM9493346"    ,  "GSM9493347" ,     "GSM9493348" ,    
"GSM9493349"   ,   "GSM9493350" ,     "GSM9493351")

====================================================

Then you can open the folder and find the Seurat object as an RDS File. Video tutorial 3 here

go to file where RDS_objects stored RDS objects earlier and use readRDS()

sampleHD1 <- readRDS(“C:/Users/jlcor/OneDrive/Desktop/EBV and nonHodgkin aggressive lymphoma NK tcell type/GSE318371_RAW/sampleHD1.RDS”)

merge after reading in other objects like sampleHD2 and sample HD3: merdedSamples <- merge(sampleHD1, y=c(sampleHD2, sampleHD3), add.cell.ids = ls()[1:3],project=“mergedSamples”)

ls()

This merge of objects actually just rowbinds the barcodes but adds a different prepended ID of the sample ID to the barcode because it doesn’t allow same barcodes and the column names are the same but the ID column changes for the sample obtained in HD1, HD2, or HD3 in this demo altered for this data from tutorial data but not yet tested on the merge to see if it works.

Then save with saveRDS(“mergedSamples,file=”C:/Users/jlcor/OneDrive/Desktop/EBV and nonHodgkin aggressive lymphoma NK tcell type/GSE318371_RAW/mergedSamples.RDS”)

===================================================================

healthy1DF <- data.frame(sampleHD1@meta.data)
#colnames(healthy1DF) <- c("h1",'h1_counts','h1_features')
healthy1DF$sample <- 'healthy1'
healthy1DF$barcode <- row.names(healthy1DF)
head(healthy1DF)
##                    orig.ident nCount_RNA nFeature_RNA   sample
## AAACCCACAGCATTGT-1 GSM9493320       5606         2534 healthy1
## AAACCCACATCCGAAT-1 GSM9493320       6841         2964 healthy1
## AAACCCAGTACTAGCT-1 GSM9493320       6104         2427 healthy1
## AAACCCAGTCACTCAA-1 GSM9493320       7370         2839 healthy1
## AAACCCAGTGGTAACG-1 GSM9493320       5848         2324 healthy1
## AAACCCATCAATGCAC-1 GSM9493320      14448         4073 healthy1
##                               barcode
## AAACCCACAGCATTGT-1 AAACCCACAGCATTGT-1
## AAACCCACATCCGAAT-1 AAACCCACATCCGAAT-1
## AAACCCAGTACTAGCT-1 AAACCCAGTACTAGCT-1
## AAACCCAGTCACTCAA-1 AAACCCAGTCACTCAA-1
## AAACCCAGTGGTAACG-1 AAACCCAGTGGTAACG-1
## AAACCCATCAATGCAC-1 AAACCCATCAATGCAC-1
healthy2DF <- data.frame(sampleHD2@meta.data)
#colnames(healthy2DF) <- c("h2",'h2_counts','h2_features')
healthy2DF$sample <- 'healthy2'
healthy2DF$barcode <- row.names(healthy2DF)
head(healthy2DF)
##                    orig.ident nCount_RNA nFeature_RNA   sample
## AAACCCAAGCACTCTA-1 GSM9493321      11161         3536 healthy2
## AAACCCAAGGCTGTAG-1 GSM9493321       2558         1505 healthy2
## AAACCCAAGTAGACCG-1 GSM9493321       6763         2873 healthy2
## AAACCCAAGTCAACAA-1 GSM9493321       5227         2434 healthy2
## AAACCCACAACTCATG-1 GSM9493321       2280         1162 healthy2
## AAACCCACAAGTGCAG-1 GSM9493321       3879         2064 healthy2
##                               barcode
## AAACCCAAGCACTCTA-1 AAACCCAAGCACTCTA-1
## AAACCCAAGGCTGTAG-1 AAACCCAAGGCTGTAG-1
## AAACCCAAGTAGACCG-1 AAACCCAAGTAGACCG-1
## AAACCCAAGTCAACAA-1 AAACCCAAGTCAACAA-1
## AAACCCACAACTCATG-1 AAACCCACAACTCATG-1
## AAACCCACAAGTGCAG-1 AAACCCACAAGTGCAG-1
healthy3DF <- data.frame(sampleHD3@meta.data)
#colnames(healthy3DF) <- c("h3",'h3_counts','h3_features')
healthy3DF$sample <- 'healthy3'

healthy3DF$barcode <- row.names(healthy3DF)
head(healthy3DF)
##                    orig.ident nCount_RNA nFeature_RNA   sample
## AAACCCACAGGCAATG-1 GSM9493322       5265         2348 healthy3
## AAACCCACATGTTCGA-1 GSM9493322       5948         2482 healthy3
## AAACCCAGTAAGATTG-1 GSM9493322       6445         2647 healthy3
## AAACCCAGTACGACAG-1 GSM9493322       3437         1794 healthy3
## AAACCCAGTATTCCTT-1 GSM9493322       7894         3099 healthy3
## AAACCCAGTCATTGCA-1 GSM9493322       9716         3322 healthy3
##                               barcode
## AAACCCACAGGCAATG-1 AAACCCACAGGCAATG-1
## AAACCCACATGTTCGA-1 AAACCCACATGTTCGA-1
## AAACCCAGTAAGATTG-1 AAACCCAGTAAGATTG-1
## AAACCCAGTACGACAG-1 AAACCCAGTACGACAG-1
## AAACCCAGTATTCCTT-1 AAACCCAGTATTCCTT-1
## AAACCCAGTCATTGCA-1 AAACCCAGTCATTGCA-1
healthy4DF <- data.frame(sampleHD4@meta.data)
#colnames(healthy4DF) <- c("h4",'h4_counts','h4_features')
healthy4DF$sample <- 'healthy4'

healthy4DF$barcode <- row.names(healthy4DF)
head(healthy4DF)
##                    orig.ident nCount_RNA nFeature_RNA   sample
## AAACCCAAGAAGCCTG-1 GSM9493323       2764         1591 healthy4
## AAACCCAAGCCATTGT-1 GSM9493323      11077         3577 healthy4
## AAACCCAAGCGTGTCC-1 GSM9493323       3445         1425 healthy4
## AAACCCAAGGCTTAAA-1 GSM9493323       2005         1079 healthy4
## AAACCCACACCGGTCA-1 GSM9493323       4722         2252 healthy4
## AAACCCACAGCGATTT-1 GSM9493323       5075         2424 healthy4
##                               barcode
## AAACCCAAGAAGCCTG-1 AAACCCAAGAAGCCTG-1
## AAACCCAAGCCATTGT-1 AAACCCAAGCCATTGT-1
## AAACCCAAGCGTGTCC-1 AAACCCAAGCGTGTCC-1
## AAACCCAAGGCTTAAA-1 AAACCCAAGGCTTAAA-1
## AAACCCACACCGGTCA-1 AAACCCACACCGGTCA-1
## AAACCCACAGCGATTT-1 AAACCCACAGCGATTT-1
healthy5DF <- data.frame(sampleHD5@meta.data)
#colnames(healthy5DF) <- c("h5",'h5_counts','h5_features')
healthy5DF$sample <- 'healthy5'

healthy5DF$barcode <- row.names(healthy5DF)
head(healthy5DF)
##                    orig.ident nCount_RNA nFeature_RNA   sample
## AAACCCAAGAGAGCAA-1 GSM9493324      15813         4355 healthy5
## AAACCCAAGAGGCGTT-1 GSM9493324      10103         3243 healthy5
## AAACCCAAGGACTTCT-1 GSM9493324       6251         2738 healthy5
## AAACCCAAGGTTCACT-1 GSM9493324      10841         3342 healthy5
## AAACCCAAGTCGCTAT-1 GSM9493324        776          334 healthy5
## AAACCCACAACCACGC-1 GSM9493324       8280         2928 healthy5
##                               barcode
## AAACCCAAGAGAGCAA-1 AAACCCAAGAGAGCAA-1
## AAACCCAAGAGGCGTT-1 AAACCCAAGAGGCGTT-1
## AAACCCAAGGACTTCT-1 AAACCCAAGGACTTCT-1
## AAACCCAAGGTTCACT-1 AAACCCAAGGTTCACT-1
## AAACCCAAGTCGCTAT-1 AAACCCAAGTCGCTAT-1
## AAACCCACAACCACGC-1 AAACCCACAACCACGC-1
healthy6DF <- data.frame(sampleHD6@meta.data)
#colnames(healthy6DF) <- c("h6",'h6_counts','h6_features')
healthy6DF$sample <- 'healthy6'

healthy6DF$barcode <- row.names(healthy6DF)
head(healthy6DF)
##                    orig.ident nCount_RNA nFeature_RNA   sample
## AAACCCAAGGGCTGAT-1 GSM9493325       8614         2985 healthy6
## AAACCCACAACCGGAA-1 GSM9493325       6718         2856 healthy6
## AAACCCACAAGGTCAG-1 GSM9493325       5637         2612 healthy6
## AAACCCACACTCTAGA-1 GSM9493325       6069         2732 healthy6
## AAACCCACACTCTGCT-1 GSM9493325       5589         2454 healthy6
## AAACCCACAGAGGCAT-1 GSM9493325       4674         2031 healthy6
##                               barcode
## AAACCCAAGGGCTGAT-1 AAACCCAAGGGCTGAT-1
## AAACCCACAACCGGAA-1 AAACCCACAACCGGAA-1
## AAACCCACAAGGTCAG-1 AAACCCACAAGGTCAG-1
## AAACCCACACTCTAGA-1 AAACCCACACTCTAGA-1
## AAACCCACACTCTGCT-1 AAACCCACACTCTGCT-1
## AAACCCACAGAGGCAT-1 AAACCCACAGAGGCAT-1
healthy7DF <- data.frame(sampleHD7@meta.data)
#colnames(healthy7DF) <- c("h7",'h7_counts','h7_features')
healthy7DF$sample <- 'healthy7'

healthy7DF$barcode <- row.names(healthy7DF)
head(healthy7DF)
##                    orig.ident nCount_RNA nFeature_RNA   sample
## AAACCCAAGACGGAAA-1 GSM9493326        412          232 healthy7
## AAACCCAAGGTGCTAG-1 GSM9493326        632          365 healthy7
## AAACCCAAGTTGTACC-1 GSM9493326       4339         1586 healthy7
## AAACCCACACCCATAA-1 GSM9493326       3536         1416 healthy7
## AAACCCACAGAGTGAC-1 GSM9493326       1929          943 healthy7
## AAACCCACAGTATTCG-1 GSM9493326       8275         2967 healthy7
##                               barcode
## AAACCCAAGACGGAAA-1 AAACCCAAGACGGAAA-1
## AAACCCAAGGTGCTAG-1 AAACCCAAGGTGCTAG-1
## AAACCCAAGTTGTACC-1 AAACCCAAGTTGTACC-1
## AAACCCACACCCATAA-1 AAACCCACACCCATAA-1
## AAACCCACAGAGTGAC-1 AAACCCACAGAGTGAC-1
## AAACCCACAGTATTCG-1 AAACCCACAGTATTCG-1
healthy8DF <- data.frame(sampleHD8@meta.data)
#colnames(healthy8DF) <- c("h8",'h8_counts','h8_features')
healthy8DF$sample <- 'healthy8'

healthy8DF$barcode <- row.names(healthy8DF)
head(healthy8DF)
##                    orig.ident nCount_RNA nFeature_RNA   sample
## AAACCCAAGAAACCCG-1 GSM9493327       4005         1427 healthy8
## AAACCCAAGCAGGCTA-1 GSM9493327       6821         2553 healthy8
## AAACCCAAGCCTAGGA-1 GSM9493327       4187         1670 healthy8
## AAACCCAAGGTTCATC-1 GSM9493327      10180         3340 healthy8
## AAACCCACAAATACGA-1 GSM9493327       8079         2390 healthy8
## AAACCCACACACACTA-1 GSM9493327       2072          967 healthy8
##                               barcode
## AAACCCAAGAAACCCG-1 AAACCCAAGAAACCCG-1
## AAACCCAAGCAGGCTA-1 AAACCCAAGCAGGCTA-1
## AAACCCAAGCCTAGGA-1 AAACCCAAGCCTAGGA-1
## AAACCCAAGGTTCATC-1 AAACCCAAGGTTCATC-1
## AAACCCACAAATACGA-1 AAACCCACAAATACGA-1
## AAACCCACACACACTA-1 AAACCCACACACACTA-1
healthy9DF <- data.frame(sampleHD9@meta.data)
#colnames(healthy9DF) <- c("h9",'h9_counts','h9_features')
healthy9DF$sample <- 'healthy9'

healthy9DF$barcode <- row.names(healthy9DF)
head(healthy9DF)
##                    orig.ident nCount_RNA nFeature_RNA   sample
## AAACCCAAGCGGTAAC-1 GSM9493328       4587         2005 healthy9
## AAACCCAAGGCAGTCA-1 GSM9493328      14668         4058 healthy9
## AAACCCAAGGGATCAC-1 GSM9493328       3762         1725 healthy9
## AAACCCAAGGGTACGT-1 GSM9493328       5330         1755 healthy9
## AAACCCAAGGTAGTCA-1 GSM9493328        237          208 healthy9
## AAACCCAAGTCATCGT-1 GSM9493328       3140          878 healthy9
##                               barcode
## AAACCCAAGCGGTAAC-1 AAACCCAAGCGGTAAC-1
## AAACCCAAGGCAGTCA-1 AAACCCAAGGCAGTCA-1
## AAACCCAAGGGATCAC-1 AAACCCAAGGGATCAC-1
## AAACCCAAGGGTACGT-1 AAACCCAAGGGTACGT-1
## AAACCCAAGGTAGTCA-1 AAACCCAAGGTAGTCA-1
## AAACCCAAGTCATCGT-1 AAACCCAAGTCATCGT-1
healthy10DF <- data.frame(sampleHD10@meta.data)
#colnames(healthy10DF) <- c("h10",'h10_counts','h10_features')
healthy10DF$sample <- 'healthy10'

healthy10DF$barcode <- row.names(healthy10DF)
head(healthy10DF)
##                    orig.ident nCount_RNA nFeature_RNA    sample
## AAACCCAAGCACGGAT-1 GSM9493329      13236         3701 healthy10
## AAACCCAAGCGTCAAG-1 GSM9493329       2654         1114 healthy10
## AAACCCAAGGAACGAA-1 GSM9493329       5531         1883 healthy10
## AAACCCAAGGCTCACC-1 GSM9493329       4045         1577 healthy10
## AAACCCAAGGTCATCT-1 GSM9493329       7792         2965 healthy10
## AAACCCAAGTCACGAG-1 GSM9493329       2913         1264 healthy10
##                               barcode
## AAACCCAAGCACGGAT-1 AAACCCAAGCACGGAT-1
## AAACCCAAGCGTCAAG-1 AAACCCAAGCGTCAAG-1
## AAACCCAAGGAACGAA-1 AAACCCAAGGAACGAA-1
## AAACCCAAGGCTCACC-1 AAACCCAAGGCTCACC-1
## AAACCCAAGGTCATCT-1 AAACCCAAGGTCATCT-1
## AAACCCAAGTCACGAG-1 AAACCCAAGTCACGAG-1
healthy11DF <- data.frame(sampleHD11@meta.data)
#colnames(healthy11DF) <- c("h11",'h11_counts','h11_features')
healthy11DF$sample <- 'healthy11'

healthy11DF$barcode <- row.names(healthy11DF)
head(healthy11DF)
##                    orig.ident nCount_RNA nFeature_RNA    sample
## AAACCCAAGTATGCAA-1 GSM9493330      13816         4026 healthy11
## AAACCCAAGTATGTAG-1 GSM9493330       2565          878 healthy11
## AAACCCACAAAGTATG-1 GSM9493330       4423         1640 healthy11
## AAACCCACAAGACGAC-1 GSM9493330       4726         1664 healthy11
## AAACCCACAAGTGTCT-1 GSM9493330       3991         1435 healthy11
## AAACCCACAGAGTCAG-1 GSM9493330       5142         2448 healthy11
##                               barcode
## AAACCCAAGTATGCAA-1 AAACCCAAGTATGCAA-1
## AAACCCAAGTATGTAG-1 AAACCCAAGTATGTAG-1
## AAACCCACAAAGTATG-1 AAACCCACAAAGTATG-1
## AAACCCACAAGACGAC-1 AAACCCACAAGACGAC-1
## AAACCCACAAGTGTCT-1 AAACCCACAAGTGTCT-1
## AAACCCACAGAGTCAG-1 AAACCCACAGAGTCAG-1
healthy12DF <- data.frame(sampleHD12@meta.data)
#colnames(healthy12DF) <- c("h12",'h12_counts','h12_features')
healthy12DF$sample <- 'healthy12'

healthy12DF$barcode <- row.names(healthy12DF)
head(healthy12DF)
##                    orig.ident nCount_RNA nFeature_RNA    sample
## AAACCCAAGGCATCGA-1 GSM9493331       7584         2581 healthy12
## AAACCCAAGGTCCGAA-1 GSM9493331       3517         1300 healthy12
## AAACCCAAGGTTATAG-1 GSM9493331      11275         3928 healthy12
## AAACCCAAGTATGATG-1 GSM9493331       5638         2474 healthy12
## AAACCCAAGTATTGCC-1 GSM9493331        816          504 healthy12
## AAACCCAAGTGGAATT-1 GSM9493331       3340         1142 healthy12
##                               barcode
## AAACCCAAGGCATCGA-1 AAACCCAAGGCATCGA-1
## AAACCCAAGGTCCGAA-1 AAACCCAAGGTCCGAA-1
## AAACCCAAGGTTATAG-1 AAACCCAAGGTTATAG-1
## AAACCCAAGTATGATG-1 AAACCCAAGTATGATG-1
## AAACCCAAGTATTGCC-1 AAACCCAAGTATTGCC-1
## AAACCCAAGTGGAATT-1 AAACCCAAGTGGAATT-1

Lets merge these barcodes among our healthy 12 patient samples to see which barcodes are in common among all cells as having the most counts.

H1H2 <- merge(healthy1DF, healthy2DF, by.x="barcode",by.y="barcode")
H1H2$barcode
##  [1] "ACATCCCTCCCTCTAG-1" "ACTGTGACAGACCGCT-1" "ACTTCGCTCGTTTACT-1"
##  [4] "ATAGACCGTTGTCAGT-1" "ATCACAGGTACGGCAA-1" "ATCAGGTGTTGTGCCG-1"
##  [7] "ATGAGGGGTTCGGTAT-1" "CAATACGCAAGGCCTC-1" "CAGCAGCTCTTCCGTG-1"
## [10] "CGAAGGAAGGATGGCT-1" "CTCAATTGTGCGTTTA-1" "CTGCGAGTCGATTCCC-1"
## [13] "GACGTTACATGGCACC-1" "GAGTCTAGTACGCTAT-1" "GATTCGAAGTAGGATT-1"
## [16] "GCCAGTGCATTACGGT-1" "GGAATGGGTTACAGCT-1" "GGCTTTCAGTCGCCAC-1"
## [19] "GGGCCATCAATACAGA-1" "GGTGAAGAGTTGTCGT-1" "GGTGTTACACCGTGGT-1"
## [22] "GTAGCTAAGGTACTGG-1" "GTCGAATGTATGTGTC-1" "GTGATGTGTTCGGCCA-1"
## [25] "GTGGGAAGTTTGGAAA-1" "GTTACGATCGTTACCC-1" "GTTGTAGCACAACGTT-1"
## [28] "TAGACTGGTACAGTAA-1" "TAGTGCACATTGCCGG-1" "TATCAGGCAAATACGA-1"
## [31] "TCACTATCACTCCTGT-1" "TCGACGGAGCGTGTCC-1" "TCGCAGGTCCCAAGTA-1"
## [34] "TCGGTCTTCTTACTGT-1" "TGAATGCGTGTGTGTT-1" "TGGAACTAGTAGGATT-1"
## [37] "TGTAACGGTGAGGAAA-1" "TTCATGTTCTTAAGGC-1" "TTCCTCTCACTGCTTC-1"
## [40] "TTGTTTGCATGAGTAA-1"
H1H2H3 <- merge(H1H2, healthy3DF, by.x="barcode",by.y="barcode")
H1H2H3$barcode
## character(0)

Early in the merge there is no common barcode among the first 3 healthy patients.

We will try the long merge of the healthy patient dataframes by rbinding them which will attach their ID name to each sample.

combinedHealthy12 <- rbind(healthy1DF,healthy2DF,healthy3DF,healthy4DF,healthy5DF,healthy6DF,healthy7DF,healthy8DF,healthy9DF,healthy10DF,healthy11DF,healthy12DF)
dim(combinedHealthy12)
## [1] 142327      5

There are 142,327 barcodes in all healthy samples. Randomly select every 9,000th row to see the results for these samples.

combinedHealthy12[c(9000,18000,27000,36000,45000,54000,63000,72000,81000,90000,99000,108000,117000,126000,135000,142327),]
##                     orig.ident nCount_RNA nFeature_RNA    sample
## TTGTTCAGTCGCAACC-1  GSM9493320       5562         2695  healthy1
## GTCACTCGTTGGCCGT-1  GSM9493321       4542         1995  healthy2
## CTGCGAGCATAAGCGG-1  GSM9493322      11196         4006  healthy3
## CATCGGGAGCTGTTAC-1  GSM9493323       1622          916  healthy4
## AACAACCCATGTTCGA-1  GSM9493324      10639         3511  healthy5
## AATGGCTAGTGGTCAG-1  GSM9493325       6285         2360  healthy6
## TACATTCTCTGCCTCA-1  GSM9493325       7475         2820  healthy6
## CGTTAGACAATTGTGC-1  GSM9493326       4632         2012  healthy7
## ACGATCACAACTGGTT-1  GSM9493327       6117         2591  healthy8
## TCCTCTTTCCTGGCTT-1  GSM9493327       2657         1315  healthy8
## GGGACTCGTTCGTTCC-1  GSM9493328       4601         1243  healthy9
## CGCCATTCAAATGGCG-1  GSM9493329       3700         1202 healthy10
## AGGGTTTCATCACCAA-11 GSM9493330       8206         2748 healthy11
## TTCCAATAGCGTTCCG-1  GSM9493330       2968         1148 healthy11
## GATGATCAGAGAGTGA-1  GSM9493331       2563          858 healthy12
## TTTGTTGTCTACTTCA-1  GSM9493331       6889         2390 healthy12
##                                barcode
## TTGTTCAGTCGCAACC-1  TTGTTCAGTCGCAACC-1
## GTCACTCGTTGGCCGT-1  GTCACTCGTTGGCCGT-1
## CTGCGAGCATAAGCGG-1  CTGCGAGCATAAGCGG-1
## CATCGGGAGCTGTTAC-1  CATCGGGAGCTGTTAC-1
## AACAACCCATGTTCGA-1  AACAACCCATGTTCGA-1
## AATGGCTAGTGGTCAG-1  AATGGCTAGTGGTCAG-1
## TACATTCTCTGCCTCA-1  TACATTCTCTGCCTCA-1
## CGTTAGACAATTGTGC-1  CGTTAGACAATTGTGC-1
## ACGATCACAACTGGTT-1  ACGATCACAACTGGTT-1
## TCCTCTTTCCTGGCTT-1  TCCTCTTTCCTGGCTT-1
## GGGACTCGTTCGTTCC-1  GGGACTCGTTCGTTCC-1
## CGCCATTCAAATGGCG-1  CGCCATTCAAATGGCG-1
## AGGGTTTCATCACCAA-11 AGGGTTTCATCACCAA-1
## TTCCAATAGCGTTCCG-1  TTCCAATAGCGTTCCG-1
## GATGATCAGAGAGTGA-1  GATGATCAGAGAGTGA-1
## TTTGTTGTCTACTTCA-1  TTTGTTGTCTACTTCA-1

There are different GSM sample IDs in the ‘orig.ident’ column as should be. We added in the healthy sample type as well as the barcode from the row names. Lets write this file out to csv as the healthy merged 12 samples of barcodes.

write.csv(combinedHealthy12,'combinedHealthy12.csv',row.names=FALSE)

We can do the same for the 17 patient samples as well and combine the two datasets. Lets clean out our data environment.

rm(sampleHD1,sampleHD2,sampleHD3,sampleHD4,sampleHD5,sampleHD6,sampleHD7,sampleHD8,sampleHD9,sampleHD10,sampleHD11,sampleHD12,H1H2,H1H2H3,healthy10DF,healthy11DF,healthy12DF,healthy1DF,healthy2DF,healthy3DF,healthy4DF,healthy5DF,healthy6DF,healthy7DF,healthy8DF,healthy9DF)

So that now we have our patient files left to make into dataframe objects and add in the sample and barcode columns before row binding them and writing it out to csv.

patient6DF <- data.frame(patient6@meta.data)

patient6DF$sample <- 'patient6'
patient6DF$barcode <- row.names(patient6DF)

head(patient6DF)
##                    orig.ident nCount_RNA nFeature_RNA   sample
## AAACCTGAGGCTCTTA-1 GSM9493332      10238         2862 patient6
## AAACCTGAGTGCCAGA-1 GSM9493332       6831         2412 patient6
## AAACCTGCAAGTTAAG-1 GSM9493332       5831         2364 patient6
## AAACCTGCAATCACAC-1 GSM9493332       8311         2561 patient6
## AAACCTGCACGGCCAT-1 GSM9493332       6643         2360 patient6
## AAACCTGCATGCATGT-1 GSM9493332       4568         1688 patient6
##                               barcode
## AAACCTGAGGCTCTTA-1 AAACCTGAGGCTCTTA-1
## AAACCTGAGTGCCAGA-1 AAACCTGAGTGCCAGA-1
## AAACCTGCAAGTTAAG-1 AAACCTGCAAGTTAAG-1
## AAACCTGCAATCACAC-1 AAACCTGCAATCACAC-1
## AAACCTGCACGGCCAT-1 AAACCTGCACGGCCAT-1
## AAACCTGCATGCATGT-1 AAACCTGCATGCATGT-1
patient9DF <- data.frame(patient9@meta.data)

patient9DF$sample <- 'patient9'
patient9DF$barcode <- row.names(patient9DF)

head(patient9DF)
##                    orig.ident nCount_RNA nFeature_RNA   sample
## AAACCTGAGACCCACC-1 GSM9493333       2939         1367 patient9
## AAACCTGAGACGACGT-1 GSM9493333       5500         2318 patient9
## AAACCTGAGAGACGAA-1 GSM9493333       2382         1133 patient9
## AAACCTGAGATGCCAG-1 GSM9493333       9313         3457 patient9
## AAACCTGAGATGTGGC-1 GSM9493333       5353         2455 patient9
## AAACCTGAGCCAGTAG-1 GSM9493333       6768         2664 patient9
##                               barcode
## AAACCTGAGACCCACC-1 AAACCTGAGACCCACC-1
## AAACCTGAGACGACGT-1 AAACCTGAGACGACGT-1
## AAACCTGAGAGACGAA-1 AAACCTGAGAGACGAA-1
## AAACCTGAGATGCCAG-1 AAACCTGAGATGCCAG-1
## AAACCTGAGATGTGGC-1 AAACCTGAGATGTGGC-1
## AAACCTGAGCCAGTAG-1 AAACCTGAGCCAGTAG-1
patient14DF <- data.frame(patient14@meta.data)

patient14DF$sample <- 'patient14'
patient14DF$barcode <- row.names(patient14DF)

head(patient14DF)
##                    orig.ident nCount_RNA nFeature_RNA    sample
## AAACCCAAGGAGTACC-1 GSM9493334       3697         1489 patient14
## AAACCCACAACCGCTG-1 GSM9493334       3064         1152 patient14
## AAACCCACACCGGAAA-1 GSM9493334       3972         1683 patient14
## AAACCCACATCGATGT-1 GSM9493334       9102         3018 patient14
## AAACCCACATGAAAGT-1 GSM9493334       4416         2207 patient14
## AAACCCAGTGATTCTG-1 GSM9493334      13535         4238 patient14
##                               barcode
## AAACCCAAGGAGTACC-1 AAACCCAAGGAGTACC-1
## AAACCCACAACCGCTG-1 AAACCCACAACCGCTG-1
## AAACCCACACCGGAAA-1 AAACCCACACCGGAAA-1
## AAACCCACATCGATGT-1 AAACCCACATCGATGT-1
## AAACCCACATGAAAGT-1 AAACCCACATGAAAGT-1
## AAACCCAGTGATTCTG-1 AAACCCAGTGATTCTG-1
patient16DF <- data.frame(patient16@meta.data)

patient16DF$sample <- 'patient16'
patient16DF$barcode <- row.names(patient16DF)

head(patient16DF)
##                    orig.ident nCount_RNA nFeature_RNA    sample
## AAACCCAAGGCCCGTT-1 GSM9493335       1335          850 patient16
## AAACCCACACATATCG-1 GSM9493335      15942         4709 patient16
## AAACCCACACCTGATA-1 GSM9493335       8358         2744 patient16
## AAACCCACACTGCATA-1 GSM9493335       6433         2350 patient16
## AAACCCACATCTCCCA-1 GSM9493335       7349         2764 patient16
## AAACCCAGTATGTCAC-1 GSM9493335      22245         4726 patient16
##                               barcode
## AAACCCAAGGCCCGTT-1 AAACCCAAGGCCCGTT-1
## AAACCCACACATATCG-1 AAACCCACACATATCG-1
## AAACCCACACCTGATA-1 AAACCCACACCTGATA-1
## AAACCCACACTGCATA-1 AAACCCACACTGCATA-1
## AAACCCACATCTCCCA-1 AAACCCACATCTCCCA-1
## AAACCCAGTATGTCAC-1 AAACCCAGTATGTCAC-1
patient18DF <- data.frame(patient18@meta.data)

patient18DF$sample <- 'patient18'
patient18DF$barcode <- row.names(patient18DF)

head(patient18DF)
##                    orig.ident nCount_RNA nFeature_RNA    sample
## AAACCCAAGAATCCCT-1 GSM9493336       7380         2827 patient18
## AAACCCACAATCCTTT-1 GSM9493336      23600         5516 patient18
## AAACCCACACAAATGA-1 GSM9493336       6396         2402 patient18
## AAACCCACAGTCGCTG-1 GSM9493336       6393         2458 patient18
## AAACCCACATATGCGT-1 GSM9493336      10896         3320 patient18
## AAACCCAGTAATCAGA-1 GSM9493336       3265         1690 patient18
##                               barcode
## AAACCCAAGAATCCCT-1 AAACCCAAGAATCCCT-1
## AAACCCACAATCCTTT-1 AAACCCACAATCCTTT-1
## AAACCCACACAAATGA-1 AAACCCACACAAATGA-1
## AAACCCACAGTCGCTG-1 AAACCCACAGTCGCTG-1
## AAACCCACATATGCGT-1 AAACCCACATATGCGT-1
## AAACCCAGTAATCAGA-1 AAACCCAGTAATCAGA-1
patient19DF <- data.frame(patient19@meta.data)

patient19DF$sample <- 'patient19'
patient19DF$barcode <- row.names(patient19DF)

head(patient19DF)
##                    orig.ident nCount_RNA nFeature_RNA    sample
## AAACCCAAGCTGACTT-1 GSM9493337      14516         4013 patient19
## AAACCCACAAGGCCTC-1 GSM9493337       9934         3096 patient19
## AAACCCACATATGAAG-1 GSM9493337       7642         2794 patient19
## AAACCCAGTACAGTAA-1 GSM9493337      29120         5585 patient19
## AAACCCAGTGACTGTT-1 GSM9493337      10785         3256 patient19
## AAACGAAAGGGTCTTT-1 GSM9493337       7637         3211 patient19
##                               barcode
## AAACCCAAGCTGACTT-1 AAACCCAAGCTGACTT-1
## AAACCCACAAGGCCTC-1 AAACCCACAAGGCCTC-1
## AAACCCACATATGAAG-1 AAACCCACATATGAAG-1
## AAACCCAGTACAGTAA-1 AAACCCAGTACAGTAA-1
## AAACCCAGTGACTGTT-1 AAACCCAGTGACTGTT-1
## AAACGAAAGGGTCTTT-1 AAACGAAAGGGTCTTT-1
patient20DF <- data.frame(patient20@meta.data)

patient20DF$sample <- 'patient20'
patient20DF$barcode <- row.names(patient20DF)

head(patient20DF)
##                    orig.ident nCount_RNA nFeature_RNA    sample
## AAACCCAAGGGTACGT-1 GSM9493338       7174         2569 patient20
## AAACCCAAGTATAGAC-1 GSM9493338      11690         3384 patient20
## AAACCCACAGCTGAGA-1 GSM9493338       8594         2697 patient20
## AAACCCACAGTAGTTC-1 GSM9493338      12894         3821 patient20
## AAACCCACATAGGCGA-1 GSM9493338      16876         4506 patient20
## AAACCCAGTACTCCGG-1 GSM9493338       5461          204 patient20
##                               barcode
## AAACCCAAGGGTACGT-1 AAACCCAAGGGTACGT-1
## AAACCCAAGTATAGAC-1 AAACCCAAGTATAGAC-1
## AAACCCACAGCTGAGA-1 AAACCCACAGCTGAGA-1
## AAACCCACAGTAGTTC-1 AAACCCACAGTAGTTC-1
## AAACCCACATAGGCGA-1 AAACCCACATAGGCGA-1
## AAACCCAGTACTCCGG-1 AAACCCAGTACTCCGG-1
patient22DF <- data.frame(patient22@meta.data)

patient22DF$sample <- 'patient22'
patient22DF$barcode <- row.names(patient22DF)

head(patient22DF)
##                    orig.ident nCount_RNA nFeature_RNA    sample
## AAACCCAAGATGTTAG-1 GSM9493339       1237          834 patient22
## AAACCCAAGCACTCGC-1 GSM9493339      11878         3865 patient22
## AAACCCAAGCCTCCAG-1 GSM9493339       8011         3134 patient22
## AAACCCAAGCGGTAGT-1 GSM9493339       3418         1624 patient22
## AAACCCAAGGAACGTC-1 GSM9493339       3655         1573 patient22
## AAACCCAAGTTCCTGA-1 GSM9493339       3008         1604 patient22
##                               barcode
## AAACCCAAGATGTTAG-1 AAACCCAAGATGTTAG-1
## AAACCCAAGCACTCGC-1 AAACCCAAGCACTCGC-1
## AAACCCAAGCCTCCAG-1 AAACCCAAGCCTCCAG-1
## AAACCCAAGCGGTAGT-1 AAACCCAAGCGGTAGT-1
## AAACCCAAGGAACGTC-1 AAACCCAAGGAACGTC-1
## AAACCCAAGTTCCTGA-1 AAACCCAAGTTCCTGA-1
patient25DF <- data.frame(patient25@meta.data)

patient25DF$sample <- 'patient25'
patient25DF$barcode <- row.names(patient25DF)

head(patient25DF)
##                    orig.ident nCount_RNA nFeature_RNA    sample
## AAACCCAAGAATTGCA-1 GSM9493340       6955         2745 patient25
## AAACCCAAGATAGCTA-1 GSM9493340       5617         2350 patient25
## AAACCCAAGCTCGAAG-1 GSM9493340       9442         3537 patient25
## AAACCCAAGGGTGGGA-1 GSM9493340       4770         2036 patient25
## AAACCCACAACTGTGT-1 GSM9493340        455          281 patient25
## AAACCCACAATAGTGA-1 GSM9493340       6766         2939 patient25
##                               barcode
## AAACCCAAGAATTGCA-1 AAACCCAAGAATTGCA-1
## AAACCCAAGATAGCTA-1 AAACCCAAGATAGCTA-1
## AAACCCAAGCTCGAAG-1 AAACCCAAGCTCGAAG-1
## AAACCCAAGGGTGGGA-1 AAACCCAAGGGTGGGA-1
## AAACCCACAACTGTGT-1 AAACCCACAACTGTGT-1
## AAACCCACAATAGTGA-1 AAACCCACAATAGTGA-1
patient27DF <- data.frame(patient27@meta.data)

patient27DF$sample <- 'patient27'
patient27DF$barcode <- row.names(patient27DF)

head(patient27DF)
##                    orig.ident nCount_RNA nFeature_RNA    sample
## AAACCCAAGATGAACT-1 GSM9493341       8415         2972 patient27
## AAACCCAAGCGCTTCG-1 GSM9493341       2577         1486 patient27
## AAACCCAAGTAATACG-1 GSM9493341       4681         2042 patient27
## AAACCCACAGAGTGAC-1 GSM9493341       2268         1195 patient27
## AAACCCACAGGCAATG-1 GSM9493341       2779         1686 patient27
## AAACCCACAGTTAGGG-1 GSM9493341       3864         1951 patient27
##                               barcode
## AAACCCAAGATGAACT-1 AAACCCAAGATGAACT-1
## AAACCCAAGCGCTTCG-1 AAACCCAAGCGCTTCG-1
## AAACCCAAGTAATACG-1 AAACCCAAGTAATACG-1
## AAACCCACAGAGTGAC-1 AAACCCACAGAGTGAC-1
## AAACCCACAGGCAATG-1 AAACCCACAGGCAATG-1
## AAACCCACAGTTAGGG-1 AAACCCACAGTTAGGG-1
patient30DF <- data.frame(patient30@meta.data)

patient30DF$sample <- 'patient30'
patient30DF$barcode <- row.names(patient30DF)

head(patient30DF)
##                    orig.ident nCount_RNA nFeature_RNA    sample
## AAACCCAAGAACTTCC-1 GSM9493342       7731         2715 patient30
## AAACCCAAGATTGGGC-1 GSM9493342        893          464 patient30
## AAACCCAAGCATATGA-1 GSM9493342       8097         2983 patient30
## AAACCCAAGCATCTTG-1 GSM9493342      12789         3633 patient30
## AAACCCAAGTCTTCGA-1 GSM9493342       5672         2167 patient30
## AAACCCAAGTTCAACC-1 GSM9493342      16285         4535 patient30
##                               barcode
## AAACCCAAGAACTTCC-1 AAACCCAAGAACTTCC-1
## AAACCCAAGATTGGGC-1 AAACCCAAGATTGGGC-1
## AAACCCAAGCATATGA-1 AAACCCAAGCATATGA-1
## AAACCCAAGCATCTTG-1 AAACCCAAGCATCTTG-1
## AAACCCAAGTCTTCGA-1 AAACCCAAGTCTTCGA-1
## AAACCCAAGTTCAACC-1 AAACCCAAGTTCAACC-1
patient31DF <- data.frame(patient31@meta.data)

patient31DF$sample <- 'patient31'
patient31DF$barcode <- row.names(patient31DF)

head(patient31DF)
##                    orig.ident nCount_RNA nFeature_RNA    sample
## AAACCCAAGATGCAGC-1 GSM9493343       6673         2192 patient31
## AAACCCAAGCCGCACT-1 GSM9493343       5623         2135 patient31
## AAACCCAAGTCAGAGC-1 GSM9493343       5247         1640 patient31
## AAACCCACAATTCACG-1 GSM9493343       5696         2199 patient31
## AAACCCACACAAATCC-1 GSM9493343       9451         3391 patient31
## AAACCCACACAATGTC-1 GSM9493343      10479         3223 patient31
##                               barcode
## AAACCCAAGATGCAGC-1 AAACCCAAGATGCAGC-1
## AAACCCAAGCCGCACT-1 AAACCCAAGCCGCACT-1
## AAACCCAAGTCAGAGC-1 AAACCCAAGTCAGAGC-1
## AAACCCACAATTCACG-1 AAACCCACAATTCACG-1
## AAACCCACACAAATCC-1 AAACCCACACAAATCC-1
## AAACCCACACAATGTC-1 AAACCCACACAATGTC-1
patient36DF <- data.frame(patient36@meta.data)

patient36DF$sample <- 'patient36'
patient36DF$barcode <- row.names(patient36DF)

head(patient36DF)
##                    orig.ident nCount_RNA nFeature_RNA    sample
## AAACCCAAGAAGTGTT-1 GSM9493344       8136         2776 patient36
## AAACCCAAGATAGCTA-1 GSM9493344      15803         4967 patient36
## AAACCCAAGCGTTCCG-1 GSM9493344       3286         1741 patient36
## AAACCCAAGGAACGAA-1 GSM9493344        886          497 patient36
## AAACCCAAGGCTGGAT-1 GSM9493344        918          592 patient36
## AAACCCACAGCTATTG-1 GSM9493344      18166         4953 patient36
##                               barcode
## AAACCCAAGAAGTGTT-1 AAACCCAAGAAGTGTT-1
## AAACCCAAGATAGCTA-1 AAACCCAAGATAGCTA-1
## AAACCCAAGCGTTCCG-1 AAACCCAAGCGTTCCG-1
## AAACCCAAGGAACGAA-1 AAACCCAAGGAACGAA-1
## AAACCCAAGGCTGGAT-1 AAACCCAAGGCTGGAT-1
## AAACCCACAGCTATTG-1 AAACCCACAGCTATTG-1
patient37DF <- data.frame(patient37@meta.data)

patient37DF$sample <- 'patient37'
patient37DF$barcode <- row.names(patient37DF)

head(patient37DF)
##                    orig.ident nCount_RNA nFeature_RNA    sample
## AAACCCAAGAAATTGC-1 GSM9493345       8154         2523 patient37
## AAACCCAAGCCTAGGA-1 GSM9493345       7610         2623 patient37
## AAACCCAAGTATCTGC-1 GSM9493345       6634         2599 patient37
## AAACCCAAGTCTTCGA-1 GSM9493345      30471         5798 patient37
## AAACCCAAGTTCGCAT-1 GSM9493345      13432         4087 patient37
## AAACCCACAACTCCAA-1 GSM9493345       3667         1415 patient37
##                               barcode
## AAACCCAAGAAATTGC-1 AAACCCAAGAAATTGC-1
## AAACCCAAGCCTAGGA-1 AAACCCAAGCCTAGGA-1
## AAACCCAAGTATCTGC-1 AAACCCAAGTATCTGC-1
## AAACCCAAGTCTTCGA-1 AAACCCAAGTCTTCGA-1
## AAACCCAAGTTCGCAT-1 AAACCCAAGTTCGCAT-1
## AAACCCACAACTCCAA-1 AAACCCACAACTCCAA-1
patient38DF <- data.frame(patient38@meta.data)

patient38DF$sample <- 'patient38'
patient38DF$barcode <- row.names(patient38DF)

head(patient38DF)
##                    orig.ident nCount_RNA nFeature_RNA    sample
## AAACCCAAGGTCACAG-1 GSM9493346       6311         2798 patient38
## AAACCCAAGTGGAAGA-1 GSM9493346       6364         2610 patient38
## AAACCCACAGATTCGT-1 GSM9493346      10556         3254 patient38
## AAACCCACATCCTGTC-1 GSM9493346      13957         4361 patient38
## AAACCCACATCCTTCG-1 GSM9493346       6792         2596 patient38
## AAACCCAGTGACTGAG-1 GSM9493346       6586         2403 patient38
##                               barcode
## AAACCCAAGGTCACAG-1 AAACCCAAGGTCACAG-1
## AAACCCAAGTGGAAGA-1 AAACCCAAGTGGAAGA-1
## AAACCCACAGATTCGT-1 AAACCCACAGATTCGT-1
## AAACCCACATCCTGTC-1 AAACCCACATCCTGTC-1
## AAACCCACATCCTTCG-1 AAACCCACATCCTTCG-1
## AAACCCAGTGACTGAG-1 AAACCCAGTGACTGAG-1
patient39DF <- data.frame(patient39@meta.data)

patient39DF$sample <- 'patient39'
patient39DF$barcode <- row.names(patient39DF)

head(patient39DF)
##                    orig.ident nCount_RNA nFeature_RNA    sample
## AAACCCAAGATGAACT-1 GSM9493347       9044         3335 patient39
## AAACCCAAGCGTATAA-1 GSM9493347       9370         3169 patient39
## AAACCCAAGGTGCGAT-1 GSM9493347      15035         4839 patient39
## AAACCCAAGGTGTGAT-1 GSM9493347        340          301 patient39
## AAACCCACAATTTCGG-1 GSM9493347      20410         5323 patient39
## AAACCCACACGTACAT-1 GSM9493347       7596         3020 patient39
##                               barcode
## AAACCCAAGATGAACT-1 AAACCCAAGATGAACT-1
## AAACCCAAGCGTATAA-1 AAACCCAAGCGTATAA-1
## AAACCCAAGGTGCGAT-1 AAACCCAAGGTGCGAT-1
## AAACCCAAGGTGTGAT-1 AAACCCAAGGTGTGAT-1
## AAACCCACAATTTCGG-1 AAACCCACAATTTCGG-1
## AAACCCACACGTACAT-1 AAACCCACACGTACAT-1
patient41DF <- data.frame(patient41@meta.data)

patient41DF$sample <- 'patient41'
patient41DF$barcode <- row.names(patient41DF)

head(patient41DF)
##                    orig.ident nCount_RNA nFeature_RNA    sample
## AAACCCAAGAAGCTGC-1 GSM9493348       6399         2546 patient41
## AAACCCAAGATTTGCC-1 GSM9493348      22051         5811 patient41
## AAACCCAAGCTGTTAC-1 GSM9493348      18725         5179 patient41
## AAACCCAAGGTAGTCG-1 GSM9493348       4545         1880 patient41
## AAACCCAAGTCTTCGA-1 GSM9493348       7017         2724 patient41
## AAACCCACAACCCTAA-1 GSM9493348       3604         1813 patient41
##                               barcode
## AAACCCAAGAAGCTGC-1 AAACCCAAGAAGCTGC-1
## AAACCCAAGATTTGCC-1 AAACCCAAGATTTGCC-1
## AAACCCAAGCTGTTAC-1 AAACCCAAGCTGTTAC-1
## AAACCCAAGGTAGTCG-1 AAACCCAAGGTAGTCG-1
## AAACCCAAGTCTTCGA-1 AAACCCAAGTCTTCGA-1
## AAACCCACAACCCTAA-1 AAACCCACAACCCTAA-1
patient44DF <- data.frame(patient44@meta.data)

patient44DF$sample <- 'patient44'
patient44DF$barcode <- row.names(patient44DF)

head(patient44DF)
##                    orig.ident nCount_RNA nFeature_RNA    sample
## AAACCCAAGAGCCATG-1 GSM9493349      12359         3331 patient44
## AAACCCAAGCCTGCCA-1 GSM9493349        306          247 patient44
## AAACCCAAGTACAGCG-1 GSM9493349       2325         1136 patient44
## AAACCCAAGTCTCGTA-1 GSM9493349      13436         3905 patient44
## AAACCCACAACATACC-1 GSM9493349        332          288 patient44
## AAACCCACAAGCTGCC-1 GSM9493349        355          302 patient44
##                               barcode
## AAACCCAAGAGCCATG-1 AAACCCAAGAGCCATG-1
## AAACCCAAGCCTGCCA-1 AAACCCAAGCCTGCCA-1
## AAACCCAAGTACAGCG-1 AAACCCAAGTACAGCG-1
## AAACCCAAGTCTCGTA-1 AAACCCAAGTCTCGTA-1
## AAACCCACAACATACC-1 AAACCCACAACATACC-1
## AAACCCACAAGCTGCC-1 AAACCCACAAGCTGCC-1
patient51DF <- data.frame(patient51@meta.data)

patient51DF$sample <- 'patient51'
patient51DF$barcode <- row.names(patient51DF)

head(patient51DF)
##                    orig.ident nCount_RNA nFeature_RNA    sample
## AAACCCAAGATTAGAC-1 GSM9493350       4036         1591 patient51
## AAACCCACAAAGGATT-1 GSM9493350      11263         3664 patient51
## AAACCCACAAGACAAT-1 GSM9493350       8074         2670 patient51
## AAACCCACAGGCACTC-1 GSM9493350       3792         1822 patient51
## AAACCCACATGAAAGT-1 GSM9493350       6838         2517 patient51
## AAACCCACATGGGAAC-1 GSM9493350       4908         2043 patient51
##                               barcode
## AAACCCAAGATTAGAC-1 AAACCCAAGATTAGAC-1
## AAACCCACAAAGGATT-1 AAACCCACAAAGGATT-1
## AAACCCACAAGACAAT-1 AAACCCACAAGACAAT-1
## AAACCCACAGGCACTC-1 AAACCCACAGGCACTC-1
## AAACCCACATGAAAGT-1 AAACCCACATGAAAGT-1
## AAACCCACATGGGAAC-1 AAACCCACATGGGAAC-1

df [6 × 5]

orig.ident nCount_RNA nFeature_RNA AAACCCAAGATTAGAC-1 GSM9493350 4036 1591
AAACCCACAAAGGATT-1 GSM9493350 11263 3664
AAACCCACAAGACAAT-1 GSM9493350 8074 2670
AAACCCACAGGCACTC-1 GSM9493350 3792 1822
AAACCCACATGAAAGT-1 GSM9493350 6838 2517
AAACCCACATGGGAAC-1 GSM9493350 4908 2043

nCount_RNA nFeature_RNA sample barcode 4036 1591 patient51 AAACCCAAGATTAGAC-1 11263 3664 patient51 AAACCCACAAAGGATT-1 8074 2670 patient51 AAACCCACAAGACAAT-1 3792 1822 patient51 AAACCCACAGGCACTC-1 6838 2517 patient51 AAACCCACATGAAAGT-1 4908 2043 patient51 AAACCCACATGGGAAC-1

patient52DF <- data.frame(patient52@meta.data)

patient52DF$sample <- 'patient52'
patient52DF$barcode <- row.names(patient52DF)

head(patient52DF)
##                    orig.ident nCount_RNA nFeature_RNA    sample
## AAACCCAAGACTACCT-1 GSM9493351       2723         1471 patient52
## AAACCCAAGAGCATAT-1 GSM9493351       1371          416 patient52
## AAACCCAAGCACTCAT-1 GSM9493351       3217         1504 patient52
## AAACCCAAGCTTAGTC-1 GSM9493351       4059         1325 patient52
## AAACCCAAGTCCCAAT-1 GSM9493351       6942         2411 patient52
## AAACCCACACGGAAGT-1 GSM9493351       9136         3355 patient52
##                               barcode
## AAACCCAAGACTACCT-1 AAACCCAAGACTACCT-1
## AAACCCAAGAGCATAT-1 AAACCCAAGAGCATAT-1
## AAACCCAAGCACTCAT-1 AAACCCAAGCACTCAT-1
## AAACCCAAGCTTAGTC-1 AAACCCAAGCTTAGTC-1
## AAACCCAAGTCCCAAT-1 AAACCCAAGTCCCAAT-1
## AAACCCACACGGAAGT-1 AAACCCACACGGAAGT-1

df [6 × 5]

orig.ident nCount_RNA nFeature_RNA AAACCCAAGACTACCT-1 GSM9493351 2723 1471
AAACCCAAGAGCATAT-1 GSM9493351 1371 416 AAACCCAAGCACTCAT-1 GSM9493351 3217 1504
AAACCCAAGCTTAGTC-1 GSM9493351 4059 1325
AAACCCAAGTCCCAAT-1 GSM9493351 6942 2411
AAACCCACACGGAAGT-1 GSM9493351 9136 3355

nCount_RNA nFeature_RNA sample barcode 2723 1471 patient52 AAACCCAAGACTACCT-1 1371 416 patient52 AAACCCAAGAGCATAT-1 3217 1504 patient52 AAACCCAAGCACTCAT-1 4059 1325 patient52 AAACCCAAGCTTAGTC-1 6942 2411 patient52 AAACCCAAGTCCCAAT-1 9136 3355 patient52 AAACCCACACGGAAGT-1 Now lets rbind these 17 patient barcodes.

combined17patientBarcodes <- rbind(patient6DF,patient9DF,patient14DF,patient16DF,patient18DF,patient19DF,patient20DF,patient22DF,patient25DF,patient27DF,patient30DF,patient31DF,patient36DF,patient37DF,patient38DF,patient39DF,patient41DF,patient44DF,patient51DF,patient52DF)

dim(combined17patientBarcodes)
## [1] 264679      5

[1] 264679 5

There are 264,679 barcodes in the patient samples of 17 patients from patient 6 through to patient 52 randomly numbered. We can see every 50,000 rows to see the patient ID.

combined17patientBarcodes[c(1,25000,50000,75000,100000,125000,150000,175000,200000,225000,250000,264679),]
##                    orig.ident nCount_RNA nFeature_RNA    sample
## AAACCTGAGGCTCTTA-1 GSM9493332      10238         2862  patient6
## CCGGTGATCCATAAGC-1 GSM9493334       6448         2566 patient14
## TGCTTCGGTCCCGCAA-1 GSM9493336       8852         2958 patient18
## GATGAGGAGGTGAGAA-1 GSM9493339       5419         2070 patient22
## GAAGAATCATCTGTTT-1 GSM9493341       1540          985 patient27
## AGCTTCCCAGCGTTTA-1 GSM9493343       7552         3071 patient31
## TGGGAAGAGGTTGCCC-1 GSM9493344       5632         2681 patient36
## TAGGTACTCTCGACGG-1 GSM9493346       9334         2794 patient38
## GATCGTACACCCTTAC-1 GSM9493348       7391         2681 patient41
## GGCTTTCGTGCTATTG-1 GSM9493349        276          213 patient44
## TTGTTTGAGGTATAGT-1 GSM9493350       4217         1807 patient51
## TTTGTTGTCGGACAAG-1 GSM9493351       8375         2957 patient52
##                               barcode
## AAACCTGAGGCTCTTA-1 AAACCTGAGGCTCTTA-1
## CCGGTGATCCATAAGC-1 CCGGTGATCCATAAGC-1
## TGCTTCGGTCCCGCAA-1 TGCTTCGGTCCCGCAA-1
## GATGAGGAGGTGAGAA-1 GATGAGGAGGTGAGAA-1
## GAAGAATCATCTGTTT-1 GAAGAATCATCTGTTT-1
## AGCTTCCCAGCGTTTA-1 AGCTTCCCAGCGTTTA-1
## TGGGAAGAGGTTGCCC-1 TGGGAAGAGGTTGCCC-1
## TAGGTACTCTCGACGG-1 TAGGTACTCTCGACGG-1
## GATCGTACACCCTTAC-1 GATCGTACACCCTTAC-1
## GGCTTTCGTGCTATTG-1 GGCTTTCGTGCTATTG-1
## TTGTTTGAGGTATAGT-1 TTGTTTGAGGTATAGT-1
## TTTGTTGTCGGACAAG-1 TTTGTTGTCGGACAAG-1

df [12 × 5]

orig.ident nCount_RNA nFeature_RNA AAACCTGAGGCTCTTA-1 GSM9493332 10238 2862
CCGGTGATCCATAAGC-1 GSM9493334 6448 2566
TGCTTCGGTCCCGCAA-1 GSM9493336 8852 2958
GATGAGGAGGTGAGAA-1 GSM9493339 5419 2070
GAAGAATCATCTGTTT-1 GSM9493341 1540 985 AGCTTCCCAGCGTTTA-1 GSM9493343 7552 3071
TGGGAAGAGGTTGCCC-1 GSM9493344 5632 2681
TAGGTACTCTCGACGG-1 GSM9493346 9334 2794
GATCGTACACCCTTAC-1 GSM9493348 7391 2681
GGCTTTCGTGCTATTG-1 GSM9493349 276 213

df [12 × 5] nCount_RNA nFeature_RNA sample barcode 10238 2862 patient6 AAACCTGAGGCTCTTA-1 6448 2566 patient14 CCGGTGATCCATAAGC-1 8852 2958 patient18 TGCTTCGGTCCCGCAA-1 5419 2070 patient22 GATGAGGAGGTGAGAA-1 1540 985 patient27 GAAGAATCATCTGTTT-1 7552 3071 patient31 AGCTTCCCAGCGTTTA-1 5632 2681 patient36 TGGGAAGAGGTTGCCC-1 9334 2794 patient38 TAGGTACTCTCGACGG-1 7391 2681 patient41 GATCGTACACCCTTAC-1 276 213 patient44 GGCTTTCGTGCTATTG-1

Lets see unique patient sample type.

unique(combined17patientBarcodes$sample)
##  [1] "patient6"  "patient9"  "patient14" "patient16" "patient18" "patient19"
##  [7] "patient20" "patient22" "patient25" "patient27" "patient30" "patient31"
## [13] "patient36" "patient37" "patient38" "patient39" "patient41" "patient44"
## [19] "patient51" "patient52"

[1] “patient6” “patient9” “patient14” “patient16” “patient18” [6] “patient19” “patient20” “patient22” “patient25” “patient27” [11] “patient30” “patient31” “patient36” “patient37” “patient38” [16] “patient39” “patient41” “patient44” “patient51” “patient52”

There are 20 patient samples, not 17. Noted. We have a total of 12 healthy samples and 20 patient samples for a total of 32 samples in all. In the metadata on methods of data extraction and handling, there is no listed patient14 but in the files extracted and imported with Seurat there is a patient 14. The math isn’t making sense since there are 30 column totals and 1 of the columns is the description column, so there should be 29 samples plus the patient 14 for 30 total samples, but we have 20 patient samples and 12 healthy samples for 32 total samples. There are 2 samples unaccounted for. They could be in the healthy samples. Lets see.

unique(combinedHealthy12$sample)
##  [1] "healthy1"  "healthy2"  "healthy3"  "healthy4"  "healthy5"  "healthy6" 
##  [7] "healthy7"  "healthy8"  "healthy9"  "healthy10" "healthy11" "healthy12"

[1] “healthy1” “healthy2” “healthy3” “healthy4” “healthy5” [6] “healthy6” “healthy7” “healthy8” “healthy9” “healthy10” [11] “healthy11” “healthy12”

Ok, so we have 12 healthy samples and 20 patient samples. The pathology in this case is nasopharyngeal carcinoma highly associated with Epstein-Barr Viral infection or EBV.

Lets write this out to csv.

write.csv(combined17patientBarcodes,
        "combined17patientBarcodes.csv",row.names=FALSE)

Now lets combine the healthy and patient barcodes and write to csv.

allSampleBarcodes <- rbind(combined17patientBarcodes,combinedHealthy12)
dim(allSampleBarcodes)
## [1] 407006      5

[1] 407006 5

We have a total of 407,006 barcodes for 32 samples made up of 12 healthy and 20 nasopharyngeal carcinoma patients.

We will do further analysis on this data to follow through on work flow analysis to get the top genes responsible for aggressive natural killer t-cell lymphoma (NKTCL) from EBV using this PBMC single cell RNA sequencing which really refers to the cell in the array singlularly being sequenced and every cell separately from the others. We have worked with micro RNA in recent study of the analysis of a mononucleosis gene expression data, where micro RNA enhance or inhibit a pre-messenger RNA or pre-mRNA from translating after transcription and curls on itself like a hairpin shape making it double stranded micro RNA in that region of the single strand RNA that is ssRNA when not part of the hairpin double strand miRNA, done to prevent making a protein or inhibiting translation of the pre-mRNA. Now we are working with same complementary DNA of reverse transcribed mRNA but in an array format that is single cell RNA sequencing in refering to how the gene expression data is obtained to analyze.

Write the last dataframe of all samples’ barcodes to csv.

write.csv(allSampleBarcodes,'allSampleBarcodes_nasopharyngealCarcinoma.csv',row.names=FALSE)

You can get the files below:

Thanks. Keep checking in for more work on this project.

Thanks again.