MS 20bp gene fragment analysis 3rd batch Data Extraction

This is the 3rd batch document to import and make transformations to it before adding to the 1st and 2nd batches of controls and 1st patient samples of transformations. The file was uploaded to Google Drive and you can get the file created from publicly available gene expression data in GSE293036 with this program here. Or you can upload every file from the site directly and put them in the same folder and run this program to get the file by visiting NCBE for GSE293036.

These are very large files of 10-50 Million rows of nucleotide strings 20 base pairs long but only 2 dimensions each. The 1st batch is in this document but code not ran, it can take a large amount of time to run each chunk once import and transformations begin. There will be a 3rd batch as well done separately for the B cell samples of repeats as comparison the study used.

This is the Multiple Sclerosis (MS) gene expression found on NCBI for the GSE293036-GPL34284 series and platform. There are 3 repeats of a control, 5 repeats of a MS patient, 5 repeats of one other MS patient, and 5 repeats of a B-cell line that is commercially available to find the MS loci that known gene studies have found associated with MS. The study says they found 83 independent MS risk loci, 150 allelic enhancer variants, and 286 allelic silencing variants. Enhancer could be upregulated gene expression and silencer could be down regulated gene expression. The ID_REF was used in this data and I never worked with it directly but know BLAST or basic alignment of sequencing tool online works with it but with an internet search wasn’t able to get a gene. I had already tried genecards.org and nothing. But once I realized I should be getting the type of gene label used as they are specific or standardized within their bioinformatics community as ‘ID_REF’ just like HGNC and ENSEMBL and GeneCards ID is, then the internet said to use the following program in R, with the use of Bioconductor for analyzing the 20 base pair sequences and returnign their gene, or reverse to get the 20 base pair sequence from a gene.

These are large samples, each one uploaded, different amount of genes for the control repeats and had to extract the same genes in all 3 sample repeats of the control group and then order them from largest to smallest so I could column bind them together. There are over 10 million nucleotide sequences of 20 base pairs long in the data so it takes a while to upload and modify at each step because it is 10 million observations or more, but only 2 columns each sample for counts of each nucleotide strand that gives allele information as some variants of a gene have different amino acid make up from one nucleotide being different or more but still operating the same in the finished protein, that could be the alleles or gene variants MS patients have in common that demyelinates their central nervous system in the brain, cord, and peripheral nerves in later stages. You can see the brain demyelination as white spots on the brain from the demyelinated lipids of brain tissue from what I remember in special imaging and differential diagnosis, possibly with a brain CT that operates at a different window to look at soft tissue inside the cranium and not a skull CT to look at bone.

Running the gene expression data in 20 base pair nucleotide allele form and found on internet that Bioconductor can find the gene name as Gene cards wasn’t allowing it. Install these packages if you haven’t and run package refseqR to get gene IDs for the REFSEQ nucleotide strings in ID_REF format.

These data frames are huge 10-50 million rows each sample, so it takes a lot of memory to store each one while transforming it or modifying it, as needed remove the ones not used after your done with it in its new transformation.

#install.packages("refseqR")

#install.packages("Biostrings")
#doesn't work have to install next line for bioconductor

#install.packages("BiocManager") 
#BiocManager::install("Biostrings")

#library(refseqR)

header <- read.table("GSE293036-GPL34284_series_matrix.txt", comment.char='!',
                     header = T)
str(header)

‘data.frame’: 0 obs. of 16 variables: $ ID_REF : logi $ GSM8874365: logi $ GSM8874366: logi $ GSM8874367: logi $ GSM8874368: logi $ GSM8874369: logi $ GSM8874370: logi $ GSM8874371: logi $ GSM8874372: logi $ GSM8874373: logi $ GSM8874374: logi $ GSM8874375: logi $ GSM8874376: logi $ GSM8874377: logi $ GSM8874378: logi $ GSM8874379: logi

cntrl_rep3 <- read.table('GSM8874364_control_rep3.hg38.barcode_count.txt', header=F)
str(cntrl_rep3)

‘data.frame’: 15038071 obs. of 2 variables: $ V1: chr “TTGCGTAAGCTCCATCGCGA” “TTTGCTTCTGCTGATGTTTG” “GAGGCTCACCGCACTACTAG” “CCCGATGCTTTCTGAGTTAG” … $ V2: int 12 4 21 6 28 1 3 6 21 1 …

cntrl_rep2 <- read.table('GSM8874363_control_rep2.hg38.barcode_count.txt', header=F)
str(cntrl_rep2)

‘data.frame’: 14920932 obs. of 2 variables: $ V1: chr “GTGGTCTGCGCCCACCTCCA” “AACCACCGCCTCTTTGAGTT” “TCAGCGGGGCCTTTTTAAAT” “GCTTCGGGCTTATGTCTGGT” … $ V2: int 11 27 7 1 14 1 21 6 7 15 …

cntrl_rep1 <- read.table('GSM8874362_control_rep1.hg38.barcode_count.txt', header=F)
str(cntrl_rep1)

‘data.frame’: 15294307 obs. of 2 variables: $ V1: chr “TATGCATATTTCCAACTGCT” “GTCCGCTTTTTTAAGTTGCT” “ACCTTTTTACAAATATCAGG” “TAAACCTGCCATCTCTTCGC” … $ V2: int 3 14 6 3 2 10 12 1 7 1 …

Get the nucleotide strands that are common to all 3 control data sets as they have varying number of nucleotide strands per data set.

same <- cntrl_rep1[which(cntrl_rep1$V1 %in% cntrl_rep2$V1),]
dim(same)

[1] 11423289 2

same2 <- cntrl_rep1[which(cntrl_rep1$V1 %in% cntrl_rep3$V1),]
dim(same2)

[1] 11441782 2

sameAll <- same[which(same$V1 %in% same2$V1),]
dim(sameAll)

[1] 10838413 2

They have 10,838,413 nucleotide strands in common.

rm(same,same2)

We want a list of those strings to filter each control to have so they have the same strands and can be combined into a data frame together.

commonBPs <- sameAll$V1
rm(sameAll)

common1 <- cntrl_rep1[which(cntrl_rep1$V1 %in% commonBPs),]
rm(cntrl_rep1)
str(common1)

‘data.frame’: 10838413 obs. of 2 variables: $ V1: chr “TATGCATATTTCCAACTGCT” “GTCCGCTTTTTTAAGTTGCT” “ACCTTTTTACAAATATCAGG” “TCTCAGCTTCTACCCGAACT” … $ V2: int 3 14 6 2 10 12 7 13 18 3 …

common2 <- cntrl_rep2[which(cntrl_rep2$V1 %in% commonBPs),]
rm(cntrl_rep2)
str(common2)

‘data.frame’: 10838413 obs. of 2 variables: $ V1: chr “GTGGTCTGCGCCCACCTCCA” “AACCACCGCCTCTTTGAGTT” “TCAGCGGGGCCTTTTTAAAT” “TTCGCTCACCCCTTGCTGAA” … $ V2: int 11 27 7 14 21 6 7 15 7 28 …

common3 <- cntrl_rep3[which(cntrl_rep3$V1 %in% commonBPs),]
rm(cntrl_rep3)
str(common3)

‘data.frame’: 10838413 obs. of 2 variables: $ V1: chr “TTGCGTAAGCTCCATCGCGA” “TTTGCTTCTGCTGATGTTTG” “GAGGCTCACCGCACTACTAG” “CCCGATGCTTTCTGAGTTAG” … $ V2: int 12 4 21 6 28 3 6 21 2 8 …

Order these so we can drop the ID_REF field in the other 2 sets and know the rows are for the same strand.

comm1ordered <- common1[order(common1$V1, decreasing=T),]
rm(common1)
head(comm1ordered$V1)

[1] “TTTTTTTTTTTTTTCGTCCC” “TTTTTTTTTTTTTCCTTGCT” “TTTTTTTTTTTTGCAGTGAT” [4] “TTTTTTTTTTTTCTGCTATG” “TTTTTTTTTTTTCTCCCGCT” “TTTTTTTTTTTTCCCGGTCG”

comm2ordered <- common2[order(common2$V1,decreasing=T),]
rm(common2)
head(comm2ordered$V1)

[1] “TTTTTTTTTTTTTTCGTCCC” “TTTTTTTTTTTTTCCTTGCT” “TTTTTTTTTTTTGCAGTGAT” [4] “TTTTTTTTTTTTCTGCTATG” “TTTTTTTTTTTTCTCCCGCT” “TTTTTTTTTTTTCCCGGTCG”

comm3ordered <- common3[order(common3$V1,decreasing=T),]
rm(common3)
head(comm3ordered$V1)

[1] “TTTTTTTTTTTTTTCGTCCC” “TTTTTTTTTTTTTCCTTGCT” “TTTTTTTTTTTTGCAGTGAT” [4] “TTTTTTTTTTTTCTGCTATG” “TTTTTTTTTTTTCTCCCGCT” “TTTTTTTTTTTTCCCGGTCG”

compare tails in same data since only 2 variables with very small names. {r eval=FALSE, include=TRUE tail(comm1ordered)

V1 V2 13323086 AAAAAAAAAAAAAAACTAAA 2
5755838 AAAAAAAAAAAAAAAACAAA 2
948484 AAAAAAAAAAAAAAAAATAA 1
8856417 AAAAAAAAAAAAAAAAACAA 1
9962297 AAAAAAAAAAAAAAAAAACA 1
14061785 AAAAAAAAAAAAAAAAAAAA 5

{r eval=FALSE, include=TRUE tail(comm2ordered) V1 V2 10218013 AAAAAAAAAAAAAAACTAAA 1
4913588 AAAAAAAAAAAAAAAACAAA 1
7408372 AAAAAAAAAAAAAAAAATAA 2
260301 AAAAAAAAAAAAAAAAACAA 1
9772349 AAAAAAAAAAAAAAAAAACA 2
1620579 AAAAAAAAAAAAAAAAAAAA 7

{r eval=FALSE, include=TRUE tail(comm3ordered) V1 V2 12876964 AAAAAAAAAAAAAAACTAAA 2
4187456 AAAAAAAAAAAAAAAACAAA 2
10737186 AAAAAAAAAAAAAAAAATAA 2
4267113 AAAAAAAAAAAAAAAAACAA 1
1795217 AAAAAAAAAAAAAAAAAACA 1
2151989 AAAAAAAAAAAAAAAAAAAA 8

Looks like all the strand are the same in the top 5 and bottom 5 so we can go ahead and combine them into one dataset.

controls <- cbind(comm1ordered,comm2ordered$V2,comm3ordered$V2)
rm(comm1ordered,comm2ordered,comm3ordered)
str(controls)

‘data.frame’: 10838413 obs. of 4 variables: $ V1 : chr “TTTTTTTTTTTTTTCGTCCC” “TTTTTTTTTTTTTCCTTGCT” “TTTTTTTTTTTTGCAGTGAT” “TTTTTTTTTTTTCTGCTATG” … $ V2 : int 4 2 4 4 1 1 2 5 2 6 … $ comm2ordered$V2: int 4 1 1 1 4 1 3 5 1 5 … $ comm3ordered$V2: int 3 5 1 2 2 3 3 8 3 1 …

Give them names by their repeat sample number and type which is control type sample and the GSM sample ID last 4 digits.

colnames(controls) <- c("ID_REF","control1-4362", "control2-4363", "control3-4364")
rm(header) #don't need this header column names anymore
str(controls)

‘data.frame’: 10838413 obs. of 4 variables: $ ID_REF : chr “TTTTTTTTTTTTTTCGTCCC” “TTTTTTTTTTTTTCCTTGCT” “TTTTTTTTTTTTGCAGTGAT” “TTTTTTTTTTTTCTGCTATG” … $ control1-4362: int 4 2 4 4 1 1 2 5 2 6 … $ control2-4363: int 4 1 1 1 4 1 3 5 1 5 … $ control3-4364: int 3 5 1 2 2 3 3 8 3 1 …

write.csv(controls, 'control.csv', row.names=F)

head(controls,10)

ID_REF control1-4362 control2-4363 control3-4364 TTTTTTTTTTTTTTCGTCCC 4 4 3 TTTTTTTTTTTTTCCTTGCT 2 1 5 TTTTTTTTTTTTGCAGTGAT 4 1 1 TTTTTTTTTTTTCTGCTATG 4 1 2 TTTTTTTTTTTTCTCCCGCT 1 4 2 TTTTTTTTTTTTCCCGGTCG 1 1 3 TTTTTTTTTTTTCATGTTCG 2 3 3 TTTTTTTTTTTGTCTGATGA 5 5 8 TTTTTTTTTTTGGGATGTTA 2 1 3 TTTTTTTTTTTGACACTCTA 6 5 1

To look up a 20 bp barcode in RefSeq with ID_REF, you can use the refseqR package in R, which allows you to retrieve associated mRNA, transcript, and protein accessions. Here’s how you can do it: Install the refseqR package. Use the refseq_fromGene function to retrieve the mRNA transcript and protein accessions for the given gene ID. If you have the 20 bp barcode, you can use the refseq_fromGene function with the appropriate sequence option to get the corresponding mRNA transcript and protein accessions. For example, if you have a gene ID like LOC101512347, you can use the following code: GeneID <- c(“LOC101512347”) transcript <- refseq_fromGene(GeneID, sequence = “transcript”) protein <- refseq_fromGene(GeneID, sequence = “protein”)

GeneID <- c("TTTTTTTTTTTTTTCGTCCC")

#transcript <- refseq_fromGene(GeneID,sequence="transcript")
#protein <- refseq_fromGene(GeneID, sequence="protein")

It says the NCBI servers are busy and try again a bit later.

Now load in the 1st multiple sclerosis 5 repeats of samples 20 base pair nucleic acid strands of complementary DNA.

controls <- read.csv("control.csv", header=T, sep=',', na.strings=c('',' ','na','NA'
                                                                    ))
commonBPs <- controls$ID_REF

dim(controls)

[1] 10838413 4

{r eval=FALSE, include=TRUE str(controls)

Lets go ahead and get the 5 repeats for the 1st MS patient into its dataframe.

MS1_rep1 <- read.table("GSM8874370_MS-1_rep1.hg38.barcode_count.txt",header=F)
str(MS1_rep1)

‘data.frame’: 27317734 obs. of 2 variables: $ V1: chr “CCCCTCTAAGCTTTCCCAAC” “CGGTCGGGGTCCCGTATGCG” “TGCCCTCACTCCATTGGTAG” “GCGGTCTGTCTGGCTATGCT” … $ V2: int 1 1 1 1 1 1 7 4 1 8 …

MS1_rep2 <- read.table("GSM8874371_MS-1_rep2.hg38.barcode_count.txt",header=F)
str(MS1_rep2)

‘data.frame’: 31523571 obs. of 2 variables: $ V1: chr “GAACCAGTCTGCGTGATCAC” “AGCTTTGTCTGTGGTCGGGC” “AACTTGCTGCTCCCCGTGTA” “CATGCTACGGCTGTGGGCGC” … $ V2: int 3 1 86 2 56 1 1 1 2 4 …

MS1_rep3 <- read.table("GSM8874372_MS-1_rep3.hg38.barcode_count.txt",header=F)
str(MS1_rep3)

‘data.frame’: 23113758 obs. of 2 variables: $ V1: chr “GACTTGTCAACAAGGGAAGC” “GAGTTCTCTGACGGAGACTA” “TCGGTTACTTAATGGGGTCA” “TTTTCCTGTTTTTTATCCGT” … $ V2: int 9 3 15 21 1 21 9 1 1 1 …

MS1_rep4 <- read.table("GSM8874373_MS-1_rep4.hg38.barcode_count.txt",header=F)
str(MS1_rep4)

‘data.frame’: 39272023 obs. of 2 variables: $ V1: chr “GCGCGGTGGGATAAGACATC” “CCCTCCTTCCTCACCACCTA” “GCAGTCTCACGATTGCATCT” “CAGTCTCTCACTTACGTTGC” … $ V2: int 15 1 1 118 1 1 1 1 4 1 …

MS1_rep5 <- read.table("GSM8874374_MS-1_rep5.hg38.barcode_count.txt",header=F)
str(MS1_rep5)

‘data.frame’: 35574942 obs. of 2 variables: $ V1: chr “TAATGAATTTCGCTCCAGAA” “CAAGGAAGATACCTTTGTTA” “CCCTTCATCTCCCCGCCCTC” “CTCTCAAAACTAACCTCCTC” … $ V2: int 1 93 1 16 1 1 4 1 1 1 …

Lets see which onces are in commonBPs so we can add them to our controls data frame.

MS1_1 <- MS1_rep1[which(MS1_rep1$V1 %in% commonBPs),]
rm(MS1_rep1)
dim(MS1_1)

[1] 9219633 2

MS1_2 <- MS1_rep2[which(MS1_rep2$V1 %in% commonBPs),]
rm(MS1_rep2)
dim(MS1_2)

[1] 10369355 2

MS1_3 <- MS1_rep3[which(MS1_rep3$V1 %in% commonBPs),]
rm(MS1_rep3)
dim(MS1_3)

[1] 10279747 2

MS1_4 <- MS1_rep4[which(MS1_rep4$V1 %in% commonBPs),]
rm(MS1_rep4)
dim(MS1_4)

[1] 10388707 2

MS1_5 <- MS1_rep5[which(MS1_rep5$V1 %in% commonBPs),]
rm(MS1_rep5)
dim(MS1_5)

[1] 10415580 2

Write these files out if computer crashes, because sometimes if you don’t put the data frame commands to include a list by another data frame object list, the Rstudio will keep running and not be interrupted and if you restart your computer or Rstudio then you have to do everything all over again because it won’t store the objects in Environment.

write.csv(MS1_1, "MS1_1.csv", row.names=F)
write.csv(MS1_2, "MS1_2.csv", row.names=F)
write.csv(MS1_3, "MS1_3.csv", row.names=F)
write.csv(MS1_4, "MS1_4.csv", row.names=F)
write.csv(MS1_5, "MS1_5.csv", row.names=F)
write.csv(names4, 'names4.csv', row.names=F)

Read in each file if you stopped and started again.

MS1_1 <- read.csv("MS1_1.csv", header=T, sep=',', na.strings=c('',' ','na','NA'))
MS1_2 <- read.csv("MS1_2.csv", header=T, sep=',', na.strings=c('',' ','na','NA'))
MS1_3 <- read.csv("MS1_3.csv", header=T, sep=',', na.strings=c('',' ','na','NA'))
MS1_4 <- read.csv("MS1_4.csv", header=T, sep=',', na.strings=c('',' ','na','NA'))
MS1_5 <- read.csv("MS1_5.csv", header=T, sep=',', na.strings=c('',' ','na','NA'))

Now we will see which are in the set {1,2} of the 5 sets of MS first patient repeat samples. We find the common nucleotide strands in set 1 and 2 as {1,2}, then in set 3 and 4 as {3,4}, then in set {1,2} and {3,4} as set {1,2,3,4}, and then use set {1,2,3,4} to find what is common in set 5.

Set 1 and 2 as set12

set12 <- MS1_1[which(MS1_1$V1 %in% MS1_2$V1),]
dim(set12)

[1] 9158446 2

Which nucleotide strands are in set {3,4} so that sets 3 and 4 have common strands in set34.

set34 <- MS1_3[which(MS1_3$V1 %in% MS1_4$V1),]
dim(set34)

[1] 10187517 2

Which are in set 3 and set 4 as set1234

set1234 <- set12[which(set12$V1 %in% set34$V1),]
dim(set1234)

[1] 9035693 2

Now, which strands are in set1234 and set 5.

set12345 <- set1234[which(set1234$V1 %in% MS1_5$V1),]
dim(set12345)

[1] 9011758 2

All common nucleotide strands are in set12345 first feature V1.

commonMS1 <- set12345$V1

MS1_1common <- MS1_1[which(MS1_1$V1 %in% commonMS1),]
dim(MS1_1common)

[1] 9011758 2

MS1_2common <- MS1_2[which(MS1_2$V1 %in% commonMS1),]
dim(MS1_2common)

[1] 9011758 2

MS1_3common <- MS1_3[which(MS1_3$V1 %in% commonMS1),]
dim(MS1_3common)

[1] 9011758 2

MS1_4common <- MS1_4[which(MS1_4$V1 %in% commonMS1),]
dim(MS1_4common)

[1] 9011758 2

MS1_5common <- MS1_5[which(MS1_5$V1 %in% commonMS1),]
dim(MS1_5common)

[1] 9011758 2

Remove the clutter of unused objects in Environment. {r eval=FALSE, include=TRUE rm(MS1_1,MS1_2,MS1_3,MS1_4,MS1_5, set12,set34,set1234,set12345,commonBPs)

Get the controls to share only the common nucleotide strands in the first MS patient samples.

control <- controls[which(controls$ID_REF %in% commonMS1),]
rm(controls)
dim(control)

[1] 9011758 4

Now order the control and MS1 data sets so these gene strands of complementary DNA are in the same order to combine all samples thus far together.

control <- control[order(control$ID_REF, decreasing=T),]
head(control$ID_REF)

[1] “TTTTTTTTTTTTTTCGTCCC” “TTTTTTTTTTTTTCCTTGCT” “TTTTTTTTTTTTGCAGTGAT” [4] “TTTTTTTTTTTTCTGCTATG” “TTTTTTTTTTTTCCCGGTCG” “TTTTTTTTTTTGTCTGATGA”

MS1_1order <- MS1_1common[order(MS1_1common$V1,decreasing=T),]
rm(MS1_1common)
dim(MS1_1order)

[1] 9011758 2

MS1_2order <- MS1_2common[order(MS1_2common$V1,decreasing=T),]
rm(MS1_2common)
dim(MS1_2order)

[1] 9011758 2

MS1_3order <- MS1_3common[order(MS1_3common$V1,decreasing=T),]
rm(MS1_3common)
dim(MS1_3order)

[1] 9011758 2

MS1_4order <- MS1_4common[order(MS1_4common$V1,decreasing=T),]
rm(MS1_4common)
dim(MS1_4order)

[1] 9011758 2

MS1_5order <- MS1_5common[order(MS1_5common$V1,decreasing=T),]
rm(MS1_5common)
dim(MS1_5order)

[1] 9011758 2

Now give each MS1 data set a column name for its sample type as MS1 for 1st patient with MS and the repeat value of 1-5 for which one it is and last 4 numeric digits to the GSM sample ID.

colnames(MS1_1order) <- c("ID_REF", "MS1_r1_4370")
colnames(MS1_2order) <- c("ID_REF", "MS1_r2_4371")
colnames(MS1_3order) <- c("ID_REF", "MS1_r3_4372")
colnames(MS1_4order) <- c("ID_REF", "MS1_r4_4373")
colnames(MS1_5order) <- c("ID_REF", "MS1_r5_4374")

Lets look at all the first 5 samples in the ID_REF of our data sets before combining them.

head(control$ID_REF)

[1] “TTTTTTTTTTTTTTCGTCCC” “TTTTTTTTTTTTTCCTTGCT” [3] “TTTTTTTTTTTTGCAGTGAT” “TTTTTTTTTTTTCTGCTATG” [5] “TTTTTTTTTTTTCCCGGTCG” “TTTTTTTTTTTGTCTGATGA”

head(MS1_1order$ID_REF)

[1] “TTTTTTTTTTTTTTCGTCCC” “TTTTTTTTTTTTTCCTTGCT” [3] “TTTTTTTTTTTTGCAGTGAT” “TTTTTTTTTTTTCTGCTATG” [5] “TTTTTTTTTTTTCCCGGTCG” “TTTTTTTTTTTGTCTGATGA”

head(MS1_2order$ID_REF)

[1] “TTTTTTTTTTTTTTCGTCCC” “TTTTTTTTTTTTTCCTTGCT” [3] “TTTTTTTTTTTTGCAGTGAT” “TTTTTTTTTTTTCTGCTATG” [5] “TTTTTTTTTTTTCCCGGTCG” “TTTTTTTTTTTGTCTGATGA”

head(MS1_3order$ID_REF)

[1] “TTTTTTTTTTTTTTCGTCCC” “TTTTTTTTTTTTTCCTTGCT” [3] “TTTTTTTTTTTTGCAGTGAT” “TTTTTTTTTTTTCTGCTATG” [5] “TTTTTTTTTTTTCCCGGTCG” “TTTTTTTTTTTGTCTGATGA”

head(MS1_4order$ID_REF)

[1] “TTTTTTTTTTTTTTCGTCCC” “TTTTTTTTTTTTTCCTTGCT” [3] “TTTTTTTTTTTTGCAGTGAT” “TTTTTTTTTTTTCTGCTATG” [5] “TTTTTTTTTTTTCCCGGTCG” “TTTTTTTTTTTGTCTGATGA”

head(MS1_5order$ID_REF)

[1] “TTTTTTTTTTTTTTCGTCCC” “TTTTTTTTTTTTTCCTTGCT” [3] “TTTTTTTTTTTTGCAGTGAT” “TTTTTTTTTTTTCTGCTATG” [5] “TTTTTTTTTTTTCCCGGTCG” “TTTTTTTTTTTGTCTGATGA”

And also look at the tails of each of the sample ID_REF to make sure all is good before combining.

tail(MS1_1order$ID_REF)

[1] “AAAAAAAAACCCATTCCGGA” “AAAAAAAAACACGCATATTT” [3] “AAAAAAAAAATCCACCTTCC” “AAAAAAAAAATAACGATCTG” [5] “AAAAAAAAAAGCTCTGTCCT” “AAAAAAAAAAGATAACTCTG”

tail(MS1_2order$ID_REF)

[1] “AAAAAAAAACCCATTCCGGA” “AAAAAAAAACACGCATATTT” [3] “AAAAAAAAAATCCACCTTCC” “AAAAAAAAAATAACGATCTG” [5] “AAAAAAAAAAGCTCTGTCCT” “AAAAAAAAAAGATAACTCTG”

tail(MS1_3order$ID_REF)

[1] “AAAAAAAAACCCATTCCGGA” “AAAAAAAAACACGCATATTT” [3] “AAAAAAAAAATCCACCTTCC” “AAAAAAAAAATAACGATCTG” [5] “AAAAAAAAAAGCTCTGTCCT” “AAAAAAAAAAGATAACTCTG”

tail(MS1_4order$ID_REF)

[1] “AAAAAAAAACCCATTCCGGA” “AAAAAAAAACACGCATATTT” [3] “AAAAAAAAAATCCACCTTCC” “AAAAAAAAAATAACGATCTG” [5] “AAAAAAAAAAGCTCTGTCCT” “AAAAAAAAAAGATAACTCTG”

tail(MS1_5order$ID_REF)

[1] “AAAAAAAAACCCATTCCGGA” “AAAAAAAAACACGCATATTT” [3] “AAAAAAAAAATCCACCTTCC” “AAAAAAAAAATAACGATCTG” [5] “AAAAAAAAAAGCTCTGTCCT” “AAAAAAAAAAGATAACTCTG”

controlsMS1 <- cbind(control,MS1_1order$MS1_r1_4370, MS1_2order$MS1_r2_4371,MS1_3order$MS1_r3_4372, MS1_4order$MS1_r4_4373,MS1_5order$MS1_r5_4374)

str(controlsMS1)

‘data.frame’: 9011758 obs. of 9 variables: $ ID_REF : chr “TTTTTTTTTTTTTTCGTCCC” “TTTTTTTTTTTTTCCTTGCT” “TTTTTTTTTTTTGCAGTGAT” “TTTTTTTTTTTTCTGCTATG” … $ control1.4362 : int 4 2 4 4 1 5 2 1 4 3 … $ control2.4363 : int 4 1 1 1 1 5 1 2 3 1 … $ control3.4364 : int 3 5 1 2 3 8 3 4 1 2 … $ MS1_1order$MS1_r1_4370: int 3 5 3 3 21 15 12 23 2 4 … $ MS1_2order$MS1_r2_4371: int 1 4 4 8 9 23 1 15 10 4 … $ MS1_3order$MS1_r3_4372: int 3 3 3 2 7 16 7 7 5 10 … $ MS1_4order$MS1_r4_4373: int 5 9 3 4 13 43 4 17 18 22 … $ MS1_5order$MS1_r5_4374: int 6 12 5 9 10 26 5 12 21 8 …

write.csv(controlsMS1, "controls_MS1.csv", row.names=F)

Now lets remove the clutter if continuing.

rm(control,MS1_1order,MS1_2order,MS1_3order,MS1_4order,MS1_5order,commonMS1)

We have a combined data set of genes in common between control repeats 1-3 and repeats 1-5 for the 1st MS patient. This is a 9,011,758 observation long data frame with 9 variables. The ID_REF was ordered after the common genes found for each sample and then combined. Each sample has the last 4 numbers of their GSM sample ID in the header name for the sample. This is the largest data set we have worked on and that time doesn’t even include running machine learning to make predictions because the original data were each 10-50 million gene samples as nucleotide base pairs 20 base pairs or bp long.

====================================================================

We have to add the 2nd MS patient 5 repeats of data here in this section to our controlsMS1.

controlsMS1 <- read.csv("controls_MS1.csv", header=T, sep=',', na.strings=c('',' ','na','NA'))
str(controlsMS1)

‘data.frame’: 9011758 obs. of 9 variables: $ ID_REF : chr “TTTTTTTTTTTTTTCGTCCC” “TTTTTTTTTTTTTCCTTGCT” “TTTTTTTTTTTTGCAGTGAT” “TTTTTTTTTTTTCTGCTATG” … $ control1.4362 : int 4 2 4 4 1 5 2 1 4 3 … $ control2.4363 : int 4 1 1 1 1 5 1 2 3 1 … $ control3.4364 : int 3 5 1 2 3 8 3 4 1 2 … $ MS1_1order.MS1_r1_4370: int 3 5 3 3 21 15 12 23 2 4 … $ MS1_2order.MS1_r2_4371: int 1 4 4 8 9 23 1 15 10 4 … $ MS1_3order.MS1_r3_4372: int 3 3 3 2 7 16 7 7 5 10 … $ MS1_4order.MS1_r4_4373: int 5 9 3 4 13 43 4 17 18 22 … $ MS1_5order.MS1_r5_4374: int 6 12 5 9 10 26 5 12 21 8 …

Store the common nucleotide strings in a vector object.

commonID_REF <- controlsMS1$ID_REF

MS2_1 <- read.table("GSM8874375_MS-2_rep1.hg38.barcode_count.txt", header=F)
dim(MS2_1)

[1] 35173454 2

MS2_2 <- read.table("GSM8874376_MS-2_rep2.hg38.barcode_count.txt", header=F)
dim(MS2_2)

[1] 35510754 2

MS2_3 <- read.table("GSM8874377_MS-2_rep3.hg38.barcode_count.txt", header=F)
dim(MS2_3)

[1] 32775712 2

MS2_4 <- read.table("GSM8874378_MS-2_rep4.hg38.barcode_count.txt", header=F)
dim(MS2_4)

[1] 34362486 2

MS2_5 <- read.table("GSM8874379_MS-2_rep5.hg38.barcode_count.txt", header=F)
dim(MS2_5)

[1] 32812495 2

Next are extracting only those genes in the 5 repeat samples that are also in the control 3 repeats and MS1 5 repeat samples.

MS2_1_commons <- MS2_1[which(MS2_1$V1 %in% commonID_REF),]
rm(MS2_1)
dim(MS2_1_commons)

[1] 9001606 2

MS2_2_commons <- MS2_2[which(MS2_2$V1 %in% commonID_REF),]
rm(MS2_2)
dim(MS2_2_commons)

[1] 8999305 2

MS2_3_commons <- MS2_3[which(MS2_3$V1 %in% commonID_REF),]
rm(MS2_3)
dim(MS2_3_commons)

[1] 8989982 2

MS2_4_commons <- MS2_4[which(MS2_4$V1 %in% commonID_REF),]
rm(MS2_4)
dim(MS2_4_commons)

[1] 8994641 2

MS2_5_commons <- MS2_5[which(MS2_5$V1 %in% commonID_REF),]
rm(MS2_5)
dim(MS2_5_commons)

[1] 8999356 2

Now we find the nucleotide strings common to {1,2} as set12, and {3,4} as set34, and to {1,2,3,4} as set1234, and common to {1,2,3,4,5} as set12345

common to set 1 and 2

set12 <- MS2_1_commons[which(MS2_1_commons$V1 %in% MS2_2_commons$V1),]
dim(set12)

[1] 8989837 2

common to set 3 and 4:

set34 <- MS2_3_commons[which(MS2_3_commons$V1 %in% MS2_4_commons$V1),]
dim(set34)

[1] 8974237 2

common to set 1,2, 3, and 4:

set1234 <- set12[which(set12$V1 %in% set34$V1),]
dim(set1234)

[1] 8955579 2

common to set 1,2,3,4, and 5:

set12345 <- set1234[which(set1234$V1 %in% MS2_5_commons$V1),]
dim(set12345)

[1] 8946025 2

The common strings of nucleotides to the 2nd patient’s 5 repeats.

common2ndID_REF <- set12345$V1

Now use the controlsMS1 data frame ID_Ref to get those in common2ndID_REF names to combine the 2nd patient samples to it, after ordering each of sets MS2_1_commons through MS2_5_commons by those common genes in all.There are less names in the 2nd set batch than 1st data set batch.

Clear up space by removing these objects we won’t use later.

rm(set12,set34,set1234,set12345,commonID_REF)

NewControlsMS1 <- controlsMS1[which(controlsMS1$ID_REF %in% common2ndID_REF),]
dim(NewControlsMS1)

[1] 8946025 9

MS2_1 <- MS2_1_commons[which(MS2_1_commons$V1 %in% common2ndID_REF),]
dim(MS2_1)

[1] 8946025 2

MS2_2 <- MS2_2_commons[which(MS2_2_commons$V1 %in% common2ndID_REF),]
dim(MS2_2)

[1] 8946025 2

MS2_3 <- MS2_3_commons[which(MS2_3_commons$V1 %in% common2ndID_REF),]
dim(MS2_3)

[1] 8946025 2

MS2_4 <- MS2_4_commons[which(MS2_4_commons$V1 %in% common2ndID_REF),]
dim(MS2_4)

[1] 8946025 2

MS2_5 <- MS2_5_commons[which(MS2_5_commons$V1 %in% common2ndID_REF),]
dim(MS2_5)

[1] 8946025 2

Remove the clutter of unused objects.

rm(MS2_1_commons, MS2_2_commons, MS2_3_commons, MS2_4_commons, MS2_5_commons,controlsMS1,common2ndID_REF)
ls()

[1] “MS2_1” “MS2_2” “MS2_3” “MS2_4”
[5] “MS2_5” “NewControlsMS1”

Order all the variables in these data sets before combining and add new column name identifiers to each MS2 table.

NewControlsMS1 <- NewControlsMS1[order(NewControlsMS1$ID_REF,decreasing=T),]
head(NewControlsMS1$ID_REF)

[1] “TTTTTTTTTTTTTTCGTCCC” “TTTTTTTTTTTTTCCTTGCT” “TTTTTTTTTTTTGCAGTGAT” [4] “TTTTTTTTTTTTCTGCTATG” “TTTTTTTTTTTTCCCGGTCG” “TTTTTTTTTTTGTCTGATGA”

write.csv(NewControlsMS1,'newControlsMS1.csv',row.names=F)

Add the column names to the MS2 5 samples and order them as well. Name them by their 2nd patient with MS as MS2, the repeat number and last 4 digits of GSM sample ID>

colnames(MS2_1) <- c("ID_REF", "MS2_r1_4375")
MS2_1o <- MS2_1[order(MS2_1$ID_REF, decreasing=T),]
head(MS2_1o$ID_REF)

[1] “TTTTTTTTTTTTTTCGTCCC” “TTTTTTTTTTTTTCCTTGCT” “TTTTTTTTTTTTGCAGTGAT” [4] “TTTTTTTTTTTTCTGCTATG” “TTTTTTTTTTTTCCCGGTCG” “TTTTTTTTTTTGTCTGATGA”

colnames(MS2_2) <- c("ID_REF", "MS2_r2_4376")
MS2_2o <- MS2_2[order(MS2_2$ID_REF, decreasing=T),]
head(MS2_2o$ID_REF)

[1] “TTTTTTTTTTTTTTCGTCCC” “TTTTTTTTTTTTTCCTTGCT” “TTTTTTTTTTTTGCAGTGAT” [4] “TTTTTTTTTTTTCTGCTATG” “TTTTTTTTTTTTCCCGGTCG” “TTTTTTTTTTTGTCTGATGA”

colnames(MS2_3) <- c("ID_REF", "MS2_r3_4377")
MS2_3o <- MS2_3[order(MS2_3$ID_REF, decreasing=T),]
head(MS2_3o$ID_REF)

[1] “TTTTTTTTTTTTTTCGTCCC” “TTTTTTTTTTTTTCCTTGCT” “TTTTTTTTTTTTGCAGTGAT” [4] “TTTTTTTTTTTTCTGCTATG” “TTTTTTTTTTTTCCCGGTCG” “TTTTTTTTTTTGTCTGATGA”

colnames(MS2_4) <- c("ID_REF", "MS2_r4_4378")
MS2_4o <- MS2_4[order(MS2_4$ID_REF, decreasing=T),]
head(MS2_4o$ID_REF)

[1] “TTTTTTTTTTTTTTCGTCCC” “TTTTTTTTTTTTTCCTTGCT” “TTTTTTTTTTTTGCAGTGAT” [4] “TTTTTTTTTTTTCTGCTATG” “TTTTTTTTTTTTCCCGGTCG” “TTTTTTTTTTTGTCTGATGA”

colnames(MS2_5) <- c("ID_REF", "MS2_r5_4379")
MS2_5o <- MS2_5[order(MS2_5$ID_REF, decreasing=T),]
head(MS2_5o$ID_REF)

[1] “TTTTTTTTTTTTTTCGTCCC” “TTTTTTTTTTTTTCCTTGCT” “TTTTTTTTTTTTGCAGTGAT” [4] “TTTTTTTTTTTTCTGCTATG” “TTTTTTTTTTTTCCCGGTCG” “TTTTTTTTTTTGTCTGATGA”

Remove the old data frames we won’t use again. {r eval=FALSE, include=TRUE rm(MS2_1,MS2_2,MS2_3,MS2_4,MS2_5)

Get the tails to make sure the last nucleotide strands are the same in all sets before combining them.

{r eval=FALSE, include=TRUE tail(NewControlsMS1$ID_REF) [1] “AAAAAAAAACCCATTCCGGA” “AAAAAAAAACACGCATATTT” “AAAAAAAAAATCCACCTTCC” [4] “AAAAAAAAAATAACGATCTG” “AAAAAAAAAAGCTCTGTCCT” “AAAAAAAAAAGATAACTCTG”

{r eval=FALSE, include=TRUE tail(MS2_1o$ID_REF) [1] “AAAAAAAAACCCATTCCGGA” “AAAAAAAAACACGCATATTT” “AAAAAAAAAATCCACCTTCC” [4] “AAAAAAAAAATAACGATCTG” “AAAAAAAAAAGCTCTGTCCT” “AAAAAAAAAAGATAACTCTG”

{r eval=FALSE, include=TRUE tail(MS2_2o$ID_REF) [1] “AAAAAAAAACCCATTCCGGA” “AAAAAAAAACACGCATATTT” “AAAAAAAAAATCCACCTTCC” [4] “AAAAAAAAAATAACGATCTG” “AAAAAAAAAAGCTCTGTCCT” “AAAAAAAAAAGATAACTCTG”

{r eval=FALSE, include=TRUE tail(MS2_3o$ID_REF) [1] “AAAAAAAAACCCATTCCGGA” “AAAAAAAAACACGCATATTT” “AAAAAAAAAATCCACCTTCC” [4] “AAAAAAAAAATAACGATCTG” “AAAAAAAAAAGCTCTGTCCT” “AAAAAAAAAAGATAACTCTG”

{r eval=FALSE, include=TRUE tail(MS2_4o$ID_REF) [1] “AAAAAAAAACCCATTCCGGA” “AAAAAAAAACACGCATATTT” “AAAAAAAAAATCCACCTTCC” [4] “AAAAAAAAAATAACGATCTG” “AAAAAAAAAAGCTCTGTCCT” “AAAAAAAAAAGATAACTCTG”

{r eval=FALSE, include=TRUE tail(MS2_5o$ID_REF) [1] “AAAAAAAAACCCATTCCGGA” “AAAAAAAAACACGCATATTT” “AAAAAAAAAATCCACCTTCC” [4] “AAAAAAAAAATAACGATCTG” “AAAAAAAAAAGCTCTGTCCT” “AAAAAAAAAAGATAACTCTG”

Now that the first and last entries are the same and observation count across samples, we can combine the 2nd MS samples to the big data frame of controls and 1st MS patient samples.

control_MS1_MS2 <- cbind(NewControlsMS1,MS2_1o$MS2_r1_4375,MS2_2o$MS2_r2_4376,MS2_3o$MS2_r3_4377,MS2_4o$MS2_r4_4378,MS2_5o$MS2_r5_4379)

str(control_MS1_MS2)

Lets make sure are sample names have similar feature descriptors.

colnames(control_MS1_MS2)

Write this data frame out to table. We still have the commercial 5 repeat samples of an MS patient to add using the same algorithm.

write.csv(control_MS1_MS2, "controls_MS1_MS2.csv", row.names=F)

control_MS1_MS2 <- read.csv("controls_MS1_MS2.csv", header=T, sep=',', na.strings=c('',' ','na','NA'))
summary(control_MS1_MS2)

ID_REF control1.4362 control2.4363 control3.4364
Length:8946025 Min. : 1.00 Min. : 1.00 Min. : 1.00
Class :character 1st Qu.: 6.00 1st Qu.: 5.00 1st Qu.: 5.00
Mode :character Median : 10.00 Median : 10.00 Median : 10.00
Mean : 11.95 Mean : 11.56 Mean : 11.66
3rd Qu.: 16.00 3rd Qu.: 16.00 3rd Qu.: 16.00
Max. :724.00 Max. :634.00 Max. :693.00
MS1_1order.MS1_r1_4370 MS1_2order.MS1_r2_4371 MS1_3order.MS1_r3_4372 Min. : 1.00 Min. : 1.0 Min. : 1.00
1st Qu.: 12.00 1st Qu.: 14.0 1st Qu.: 10.00
Median : 24.00 Median : 26.0 Median : 18.00
Mean : 32.96 Mean : 33.1 Mean : 23.27
3rd Qu.: 45.00 3rd Qu.: 43.0 3rd Qu.: 30.00
Max. :2287.00 Max. :2734.0 Max. :2089.00
MS1_4order.MS1_r4_4373 MS1_5order.MS1_r5_4374 MS2_1o.MS2_r1_4375 Min. : 1.00 Min. : 1.00 Min. : 1.00
1st Qu.: 20.00 1st Qu.: 18.00 1st Qu.: 16.00
Median : 37.00 Median : 32.00 Median : 27.00
Mean : 47.28 Mean : 40.45 Mean : 33.74
3rd Qu.: 62.00 3rd Qu.: 53.00 3rd Qu.: 44.00
Max. :3993.00 Max. :3215.00 Max. :2398.00
MS2_2o.MS2_r2_4376 MS2_3o.MS2_r3_4377 MS2_4o.MS2_r4_4378 MS2_5o.MS2_r5_4379 Min. : 1.00 Min. : 1.00 Min. : 1.00 Min. : 1.00
1st Qu.: 16.00 1st Qu.: 14.00 1st Qu.: 16.00 1st Qu.: 15.00
Median : 27.00 Median : 24.00 Median : 27.00 Median : 25.00
Mean : 33.39 Mean : 30.22 Mean : 33.72 Mean : 30.87
3rd Qu.: 44.00 3rd Qu.: 40.00 3rd Qu.: 44.00 3rd Qu.: 41.00
Max. :2412.00 Max. :2127.00 Max. :2298.00 Max. :2173.00

str(control_MS1_MS2)

‘data.frame’: 8946025 obs. of 14 variables: $ ID_REF : chr “TTTTTTTTTTTTTTCGTCCC” “TTTTTTTTTTTTTCCTTGCT” “TTTTTTTTTTTTGCAGTGAT” “TTTTTTTTTTTTCTGCTATG” … $ control1.4362 : int 4 2 4 4 1 5 1 4 3 2 … $ control2.4363 : int 4 1 1 1 1 5 2 3 1 3 … $ control3.4364 : int 3 5 1 2 3 8 4 1 2 3 … $ MS1_1order.MS1_r1_4370: int 3 5 3 3 21 15 23 2 4 11 … $ MS1_2order.MS1_r2_4371: int 1 4 4 8 9 23 15 10 4 4 … $ MS1_3order.MS1_r3_4372: int 3 3 3 2 7 16 7 5 10 7 … $ MS1_4order.MS1_r4_4373: int 5 9 3 4 13 43 17 18 22 24 … $ MS1_5order.MS1_r5_4374: int 6 12 5 9 10 26 12 21 8 5 … $ MS2_1o.MS2_r1_4375 : int 4 5 3 1 15 19 9 8 12 12 … $ MS2_2o.MS2_r2_4376 : int 8 3 7 8 7 27 19 6 7 13 … $ MS2_3o.MS2_r3_4377 : int 11 10 8 7 4 25 19 6 6 8 … $ MS2_4o.MS2_r4_4378 : int 3 8 4 4 7 19 22 11 8 20 … $ MS2_5o.MS2_r5_4379 : int 4 9 5 5 4 17 21 9 8 14 …

We need to add the 3rd batch that is the B cell commercial line they used to compare in the controls and two multiple sclerosis patients they received samples from.

We have a data frame of 14 features where the 3 controls and 5 repeats each from the 2 multiple sclerosis patients and the target feature that is the ID_REF field with the so far 8,946,025 nucleotide strands of 20 base pairs of complementary DNA as they have thymine (T) in them, that is what reverse transcription of RNA into DNA means, they get the complementary DNA or cDNA strand from the messenger RNA or mRNA to see the allelic copy variants that can happen at the nucleic acid base level with replacement of one nucleic acid that may not affect the triplet codon amino acid it forms to form the translated protein at the ribosomal process of turning mRNA into a protein.

The current stored size of this file is 516.4 Mb on my computer and most likely not small enough for Google Sheets to share. But you can get those samples from NCBI by looking up the GSE and GPL or GSM names given in the beginning of this document and run the code above to get to the almost 9 million nucleotide base pairs with 14 variables before adding in the additional 5 comparison repeats of commercial B cells of a patient with MS.

Thanks.

=========================================================== Now for the 3rd batch of adding in the commercial B-cell patient with MS used as a comparison to the two participating MS patient samples added to data frame and controls already.

control_MS1_MS2 <- read.csv("controls_MS1_MS2.csv", header=T, sep=',', na.strings=c('',' ','na','NA'))
summary(control_MS1_MS2)

ID_REF control1.4362 control2.4363 control3.4364
Length:8946025 Min. : 1.00 Min. : 1.00 Min. : 1.00
Class :character 1st Qu.: 6.00 1st Qu.: 5.00 1st Qu.: 5.00
Mode :character Median : 10.00 Median : 10.00 Median : 10.00
Mean : 11.95 Mean : 11.56 Mean : 11.66
3rd Qu.: 16.00 3rd Qu.: 16.00 3rd Qu.: 16.00
Max. :724.00 Max. :634.00 Max. :693.00
MS1_r1_4370 MS1_r2_4371 MS1_r3_4372 MS1_r4_4373
Min. : 1.00 Min. : 1.0 Min. : 1.00 Min. : 1.00
1st Qu.: 12.00 1st Qu.: 14.0 1st Qu.: 10.00 1st Qu.: 20.00
Median : 24.00 Median : 26.0 Median : 18.00 Median : 37.00
Mean : 32.96 Mean : 33.1 Mean : 23.27 Mean : 47.28
3rd Qu.: 45.00 3rd Qu.: 43.0 3rd Qu.: 30.00 3rd Qu.: 62.00
Max. :2287.00 Max. :2734.0 Max. :2089.00 Max. :3993.00
MS1_r5_4374 MS2_r1_4375 MS2_r2_4376 MS2_r3_4377
Min. : 1.00 Min. : 1.00 Min. : 1.00 Min. : 1.00
1st Qu.: 18.00 1st Qu.: 16.00 1st Qu.: 16.00 1st Qu.: 14.00
Median : 32.00 Median : 27.00 Median : 27.00 Median : 24.00
Mean : 40.45 Mean : 33.74 Mean : 33.39 Mean : 30.22
3rd Qu.: 53.00 3rd Qu.: 44.00 3rd Qu.: 44.00 3rd Qu.: 40.00
Max. :3215.00 Max. :2398.00 Max. :2412.00 Max. :2127.00
MS2_r4_4378 MS2_r5_4379
Min. : 1.00 Min. : 1.00
1st Qu.: 16.00 1st Qu.: 15.00
Median : 27.00 Median : 25.00
Mean : 33.72 Mean : 30.87
3rd Qu.: 44.00 3rd Qu.: 41.00
Max. :2298.00 Max. :2173.00

str(control_MS1_MS2)

‘data.frame’: 8946025 obs. of 14 variables: $ ID_REF : chr “TTTTTTTTTTTTTTCGTCCC” “TTTTTTTTTTTTTCCTTGCT” “TTTTTTTTTTTTGCAGTGAT” “TTTTTTTTTTTTCTGCTATG” … $ control1.4362: int 4 2 4 4 1 5 1 4 3 2 … $ control2.4363: int 4 1 1 1 1 5 2 3 1 3 … $ control3.4364: int 3 5 1 2 3 8 4 1 2 3 … $ MS1_r1_4370 : int 3 5 3 3 21 15 23 2 4 11 … $ MS1_r2_4371 : int 1 4 4 8 9 23 15 10 4 4 … $ MS1_r3_4372 : int 3 3 3 2 7 16 7 5 10 7 … $ MS1_r4_4373 : int 5 9 3 4 13 43 17 18 22 24 … $ MS1_r5_4374 : int 6 12 5 9 10 26 12 21 8 5 … $ MS2_r1_4375 : int 4 5 3 1 15 19 9 8 12 12 … $ MS2_r2_4376 : int 8 3 7 8 7 27 19 6 7 13 … $ MS2_r3_4377 : int 11 10 8 7 4 25 19 6 6 8 … $ MS2_r4_4378 : int 3 8 4 4 7 19 22 11 8 20 … $ MS2_r5_4379 : int 4 9 5 5 4 17 21 9 8 14 …

head(control_MS1_MS2$ID_REF,10)

[1] “TTTTTTTTTTTTTTCGTCCC” “TTTTTTTTTTTTTCCTTGCT” “TTTTTTTTTTTTGCAGTGAT” [4] “TTTTTTTTTTTTCTGCTATG” “TTTTTTTTTTTTCCCGGTCG” “TTTTTTTTTTTGTCTGATGA” [7] “TTTTTTTTTTTCTTCCTGGC” “TTTTTTTTTTTCGTTCTGAA” “TTTTTTTTTTTCGTGTTTCC” [10] “TTTTTTTTTTTCGGCTGCTC”

MScommercial_r1 <- read.table("GSM8874365_GM12878_rep1.hg38.barcode_count.txt", header=F)

str(MScommercial_r1)

‘data.frame’: 30950992 obs. of 2 variables: $ V1: chr “TCGCGTTGTACGCCTGATCC” “TTAGTATTACTAGTCGATCT” “CCTTCTGTTTTGGGTACAAT” “CTATTGAGGCTAAGTTTGCA” … $ V2: int 59 13 1 39 9 1 1 18 1 1 …

MScommercial_r2 <- read.table("GSM8874366_GM12878_rep2.hg38.barcode_count.txt", header=F)
str(MScommercial_r2)

‘data.frame’: 31688321 obs. of 2 variables: $ V1: chr “GCTTTGTCCATTTAAGTACT” “ATACCTTCCCCTTCCACCTC” “CTCGCCCGACTATTGTTGCT” “CATTGACCTAGAGCAGGGAG” … $ V2: int 20 1 1 1 57 1 56 1 16 1 …

MScommercial_r3 <- read.table("GSM8874367_GM12878_rep3.hg38.barcode_count.txt", header=F)
str(MScommercial_r3)

‘data.frame’: 32667987 obs. of 2 variables: $ V1: chr “CGGGACTTAAACTGCCTGTC” “CCCTACCACTCCGGACTCCC” “TCGGTTCCCAATCACGTTCC” “GAGTTCTGGTTGTGCGTCCT” … $ V2: int 1 1 16 8 1 1 8 1 1 1 …

MScommercial_r4 <- read.table("GSM8874368_GM12878_rep4.hg38.barcode_count.txt", header=F)
str(MScommercial_r4)

‘data.frame’: 28423531 obs. of 2 variables: $ V1: chr “CCTGTTCTCATTTCCGCTCC” “GTCGGATGTCCCCATACACG” “CGGATAAATATTTGTTGTTA” “CTGAGAATACCGCAATTATT” … $ V2: int 1 5 24 8 1 18 1 8 1 1 …

MScommercial_r5 <- read.table("GSM8874369_GM12878_rep5.hg38.barcode_count.txt", header=F)
str(MScommercial_r4)

NOw shrink these to the same size as the controls and other two MS patient samples.

comm1 <- MScommercial_r1[which(MScommercial_r1$V1 %in% control_MS1_MS2$ID_REF),]
str(comm1)

‘data.frame’: 8930146 obs. of 2 variables: $ V1: chr “TCGCGTTGTACGCCTGATCC” “TTAGTATTACTAGTCGATCT” “CTATTGAGGCTAAGTTTGCA” “TCTAATAGCCGCAATATGTT” … $ V2: int 59 13 39 18 50 18 13 60 19 3 …

comm2 <- MScommercial_r2[which(MScommercial_r2$V1 %in% control_MS1_MS2$ID_REF),]
str(comm2)

‘data.frame’: 8923413 obs. of 2 variables: $ V1: chr “GCTTTGTCCATTTAAGTACT” “TATTCGGGCTACCTGATCGC” “ACCTGTCACTGGCCATGTAC” “TCTACACTCGAAAAAACTCA” … $ V2: int 20 57 56 16 5 17 42 26 65 28 …

comm3 <- MScommercial_r3[which(MScommercial_r3$V1 %in% control_MS1_MS2$ID_REF),]
str(comm3)

‘data.frame’: 8934208 obs. of 2 variables: $ V1: chr “GAGTTCTGGTTGTGCGTCCT” “AGCCGGTCGAACTCTTTGTA” “TTTCGGGAGGCTGTTATAAT” “GGTTGGATCGGCTTTCTCGG” … $ V2: int 8 8 21 77 295 31 50 33 36 27 …

comm4 <- MScommercial_r4[which(MScommercial_r4$V1 %in% control_MS1_MS2$ID_REF),]
str(comm4)

‘data.frame’: 8914358 obs. of 2 variables: $ V1: chr “CGGATAAATATTTGTTGTTA” “CTGAGAATACCGCAATTATT” “ACTCCCACGCTATCCTACAC” “ACGTTAAGAGGTTGGATCTG” … $ V2: int 24 8 18 21 24 31 4 69 14 5 …

comm5 <- MScommercial_r5[which(MScommercial_r5$V1 %in% control_MS1_MS2$ID_REF),]
str(comm5)

‘data.frame’: 8907648 obs. of 2 variables: $ V1: chr “TCGTCGCTCTCCATGCACCG” “CACGGCTGCCCGAGTCTCTT” “TGGTATTTCAATGGTCGGTC” “CTAAACGGGCTTATTTGTCA” … $ V2: int 8 13 9 56 21 14 28 11 10 33 …

Remove the unused objects.

rm(MScommercial_r1,MScommercial_r2,MScommercial_r3,MScommercial_r4,MScommercial_r5)

Now we have to get the same number of common nucleotides in each of these 5 commercial B-cell line MS patient strings.

Now we have to make a set common to the set of {1,2} then {3,4}, then set {5}, get set {1,3} then whats common between that set an set {3,4} to get set {1,2,3,4} and then use that to get what is common with set{5}

set1_2 <- comm1[which(comm1$V1 %in% comm2$V1),]
str(set1_2)

‘data.frame’: 8908632 obs. of 2 variables: $ V1: chr “TCGCGTTGTACGCCTGATCC” “TTAGTATTACTAGTCGATCT” “CTATTGAGGCTAAGTTTGCA” “TCTAATAGCCGCAATATGTT” … $ V2: int 59 13 39 18 50 18 13 60 19 3 …

set3_4 <- comm3[which(comm3$V1 %in% comm4$V1),]
str(set3_4)

‘data.frame’: 8903674 obs. of 2 variables: $ V1: chr “GAGTTCTGGTTGTGCGTCCT” “AGCCGGTCGAACTCTTTGTA” “TTTCGGGAGGCTGTTATAAT” “GGTTGGATCGGCTTTCTCGG” … $ V2: int 8 8 21 77 295 31 50 33 36 27 …

set1_2_3_4 <- set1_2[which(set1_2$V1 %in% set3_4$V1),]
str(set1_2_3_4)

‘data.frame’: 8870531 obs. of 2 variables: $ V1: chr “TCGCGTTGTACGCCTGATCC” “TTAGTATTACTAGTCGATCT” “CTATTGAGGCTAAGTTTGCA” “TCTAATAGCCGCAATATGTT” … $ V2: int 59 13 39 18 50 18 13 60 19 3 …

set1_5 <- set1_2_3_4[which(set1_2_3_4$V1 %in% comm5$V1),]
str(set1_2_3_4)

We have the common strings of nucleotides between the 5 sets of commercial line MS patient from those common with controls and the other realparticipating MS patients samples.

commonNucleotides <- set1_5$V1
str(commonNucleotides)

chr [1:8838657] “TCGCGTTGTACGCCTGATCC” “CTATTGAGGCTAAGTTTGCA” …

Now we must get those common to this strand of nucleotides in all sets of commercial and our bigger data frame of MS patients and controls.

bigCommon <- control_MS1_MS2[which(control_MS1_MS2$ID_REF %in% commonNucleotides),]
str(bigCommon)

‘data.frame’: 8838657 obs. of 14 variables: $ ID_REF : chr “TTTTTTTTTTTTTTCGTCCC” “TTTTTTTTTTTTTCCTTGCT” “TTTTTTTTTTTTGCAGTGAT” “TTTTTTTTTTTTCTGCTATG” … $ control1.4362: int 4 2 4 4 5 1 4 3 2 5 … $ control2.4363: int 4 1 1 1 5 2 3 1 3 2 … $ control3.4364: int 3 5 1 2 8 4 1 2 3 3 … $ MS1_r1_4370 : int 3 5 3 3 15 23 2 4 11 11 … $ MS1_r2_4371 : int 1 4 4 8 23 15 10 4 4 6 … $ MS1_r3_4372 : int 3 3 3 2 16 7 5 10 7 1 … $ MS1_r4_4373 : int 5 9 3 4 43 17 18 22 24 17 … $ MS1_r5_4374 : int 6 12 5 9 26 12 21 8 5 8 … $ MS2_r1_4375 : int 4 5 3 1 19 9 8 12 12 6 … $ MS2_r2_4376 : int 8 3 7 8 27 19 6 7 13 5 … $ MS2_r3_4377 : int 11 10 8 7 25 19 6 6 8 6 … $ MS2_r4_4378 : int 3 8 4 4 19 22 11 8 20 4 … $ MS2_r5_4379 : int 4 9 5 5 17 21 9 8 14 8 …

Lets remove clutter before getting the other 5 sets.

rm(set1_2, set1_2_3_4, set1_5, set3_4)

commercial1 <- comm1[which(comm1$V1 %in% commonNucleotides),]
str(commercial1)

‘data.frame’: 8838657 obs. of 2 variables: $ V1: chr “TCGCGTTGTACGCCTGATCC” “CTATTGAGGCTAAGTTTGCA” “TCTAATAGCCGCAATATGTT” “AATGGGCCACCGTGTTTCTA” … $ V2: int 59 39 18 50 18 13 60 19 3 37 …

commercial2 <- comm2[which(comm2$V1 %in% commonNucleotides),]
str(commercial2)

‘data.frame’: 8838657 obs. of 2 variables: $ V1: chr “GCTTTGTCCATTTAAGTACT” “TATTCGGGCTACCTGATCGC” “ACCTGTCACTGGCCATGTAC” “TCTACACTCGAAAAAACTCA” … $ V2: int 20 57 56 16 5 17 42 26 65 28 …

commercial3 <- comm3[which(comm3$V1 %in% commonNucleotides),]
str(commercial3)

‘data.frame’: 8838657 obs. of 2 variables: $ V1: chr “GAGTTCTGGTTGTGCGTCCT” “AGCCGGTCGAACTCTTTGTA” “TTTCGGGAGGCTGTTATAAT” “GGTTGGATCGGCTTTCTCGG” … $ V2: int 8 8 21 77 295 31 50 33 36 27 …

commercial4 <- comm4[which(comm4$V1 %in% commonNucleotides),]
str(commercial4)

‘data.frame’: 8838657 obs. of 2 variables: $ V1: chr “CGGATAAATATTTGTTGTTA” “CTGAGAATACCGCAATTATT” “ACTCCCACGCTATCCTACAC” “ACGTTAAGAGGTTGGATCTG” … $ V2: int 24 8 18 21 24 31 4 69 14 5 …

commercial5 <- comm5[which(comm5$V1 %in% commonNucleotides),]
str(commercial5)

‘data.frame’: 8838657 obs. of 2 variables: $ V1: chr “TCGTCGCTCTCCATGCACCG” “CACGGCTGCCCGAGTCTCTT” “TGGTATTTCAATGGTCGGTC” “CTAAACGGGCTTATTTGTCA” … $ V2: int 8 13 9 56 21 14 28 11 10 33 …

Now we must order these all before combining them, they all have the same observations of same nucleotide strings.

Big1 <- bigCommon[order(bigCommon$ID_REF,decreasing=T),]
head(Big1$ID_REF,10)

[1] “TTTTTTTTTTTTTTCGTCCC” “TTTTTTTTTTTTTCCTTGCT” “TTTTTTTTTTTTGCAGTGAT” [4] “TTTTTTTTTTTTCTGCTATG” “TTTTTTTTTTTGTCTGATGA” “TTTTTTTTTTTCTTCCTGGC” [7] “TTTTTTTTTTTCGTTCTGAA” “TTTTTTTTTTTCGTGTTTCC” “TTTTTTTTTTTCGGCTGCTC” [10] “TTTTTTTTTTTCGCCTAAGG”

commercial1o <- commercial1[order(commercial1$V1, decreasing=T),]
head(commercial1o$V1,10)

commercial2o <- commercial2[order(commercial2$V1, decreasing=T),]
head(commercial2o$V1,10)

commercial3o <- commercial3[order(commercial3$V1, decreasing=T),]
head(commercial3o$V1,10)

commercial4o <- commercial4[order(commercial4$V1, decreasing=T),]
head(commercial4o$V1,10)

commercial5o <- commercial5[order(commercial5$V1, decreasing=T),]
head(commercial5o$V1,10)

Now lets change the names of the columns for the 5 samples of commercial MS patient.

colnames(commercial1o) <- c("ID_REF","commercial_r1_4365")
head(commercial1o)

colnames(commercial2o) <- c("ID_REF","commercial_r2_4366")
head(commercial2o)

colnames(commercial3o) <- c("ID_REF","commercial_r3_4367")
head(commercial3o)

colnames(commercial4o) <- c("ID_REF","commercial_r4_4368")
head(commercial4o)

colnames(commercial5o) <- c("ID_REF","commercial_r5_4369")
head(commercial5o)

Lets remove some clutter.

rm(comm1,comm2,comm3,comm4,comm5,commercial1,commercial2,commercial3,commercial4,commercial5,control_MS1_MS2,bigCommon,commonNucleotides)
ls()

Now, we can combine all samples into 1 data set of all controls, participating MS patients, and commercial comparison MS samples.

allSamples <- cbind(Big1,commercial1o$commercial_r1_4365,commercial2o$commercial_r2_4366,commercial3o$commercial_r3_4367,commercial4o$commercial_r4_4368,commercial5o$commercial_r5_4369)
str(allSamples)

‘data.frame’: 8838657 obs. of 19 variables: $ ID_REF : chr “TTTTTTTTTTTTTTCGTCCC” “TTTTTTTTTTTTTCCTTGCT” “TTTTTTTTTTTTGCAGTGAT” “TTTTTTTTTTTTCTGCTATG” … $ control1.4362 : int 4 2 4 4 5 1 4 3 2 5 … $ control2.4363 : int 4 1 1 1 5 2 3 1 3 2 … $ control3.4364 : int 3 5 1 2 8 4 1 2 3 3 … $ MS1_r1_4370 : int 3 5 3 3 15 23 2 4 11 11 … $ MS1_r2_4371 : int 1 4 4 8 23 15 10 4 4 6 … $ MS1_r3_4372 : int 3 3 3 2 16 7 5 10 7 1 … $ MS1_r4_4373 : int 5 9 3 4 43 17 18 22 24 17 … $ MS1_r5_4374 : int 6 12 5 9 26 12 21 8 5 8 … $ MS2_r1_4375 : int 4 5 3 1 19 9 8 12 12 6 … $ MS2_r2_4376 : int 8 3 7 8 27 19 6 7 13 5 … $ MS2_r3_4377 : int 11 10 8 7 25 19 6 6 8 6 … $ MS2_r4_4378 : int 3 8 4 4 19 22 11 8 20 4 … $ MS2_r5_4379 : int 4 9 5 5 17 21 9 8 14 8 … $ commercial1o$commercial_r1_4365: int 5 5 6 9 24 14 6 4 7 5 … $ commercial2o$commercial_r2_4366: int 8 8 8 13 16 16 8 6 7 6 … $ commercial3o$commercial_r3_4367: int 5 3 5 6 33 17 8 4 13 6 … $ commercial4o$commercial_r4_4368: int 9 8 4 3 29 12 4 7 8 4 … $ commercial5o$commercial_r5_4369: int 1 8 2 2 16 10 6 5 18 1 …

write.csv(allSamples, 'allSampleRepeatsControlsMS1MS2Commercial.csv',row.names=F)

Get this 631.5 Mb file here.

Next we will do machine learning on this file. to get the top fold change strands and shrink dimensions then use those to predict with Random Forest the sample as control or multiple sclerosis.

MS 20bp gene fragment analysis 3rd batch Data Extraction

Janis Harris

2026-01-13