This is the 2nd batch document to import and make transformations to it before adding to the 1st batch of controls and 1st patient samples of transformations. These are very large files of 10-50 Million rows of nucleotide strings 20 base pairs long but only 2 dimensions each. The 1st batch is in this document but code not ran, it can take a large amount of time to run each chunk once import and transformations begin. There will be a 3rd batch as well done separately for the B cell samples of repeats as comparison the study used.
This is the Multiple Sclerosis (MS) gene expression found on NCBI for the GSE293036-GPL34284 series and platform. There are 3 repeats of a control, 5 repeats of a MS patient, 5 repeats of one other MS patient, and 5 repeats of a B-cell line that is commercially available to find the MS loci that known gene studies have found associated with MS. The study says they found 83 independent MS risk loci, 150 allelic enhancer variants, and 286 allelic silencing variants. Enhancer could be upregulated gene expression and silencer could be down regulated gene expression. The ID_REF was used in this data and I never worked with it directly but know BLAST or basic alignment of sequencing tool online works with it but with an internet search wasn’t able to get a gene. I had already tried genecards.org and nothing. But once I realized I should be getting the type of gene label used as they are specific or standardized within their bioinformatics community as ‘ID_REF’ just like HGNC and ENSEMBL and GeneCards ID is, then the internet said to use the following program in R, with the use of Bioconductor for analyzing the 20 base pair sequences and returnign their gene, or reverse to get the 20 base pair sequence from a gene.
These are large samples, each one uploaded, different amount of genes for the control repeats and had to extract the same genes in all 3 sample repeats of the control group and then order them from largest to smallest so I could column bind them together. There are over 10 million nucleotide sequences of 20 base pairs long in the data so it takes a while to upload and modify at each step because it is 10 million observations or more, but only 2 columns each sample for counts of each nucleotide strand that gives allele information as some variants of a gene have different amino acid make up from one nucleotide being different or more but still operating the same in the finished protein, that could be the alleles or gene variants MS patients have in common that demyelinates their central nervous system in the brain, cord, and peripheral nerves in later stages. You can see the brain demyelination as white spots on the brain from the demyelinated lipids of brain tissue from what I remember in special imaging and differential diagnosis, possibly with a brain CT that operates at a different window to look at soft tissue inside the cranium and not a skull CT to look at bone.
Running the gene expression data in 20 base pair nucleotide allele form and found on internet that Bioconductor can find the gene name as Gene cards wasn’t allowing it. Install these packages if you haven’t and run package refseqR to get gene IDs for the REFSEQ nucleotide strings in ID_REF format.
These data frames are huge 10-50 million rows each sample, so it takes a lot of memory to store each one while transforming it or modifying it, as needed remove the ones not used after your done with it in its new transformation.
#install.packages("refseqR")
#install.packages("Biostrings")
#doesn't work have to install next line for bioconductor
#install.packages("BiocManager")
#BiocManager::install("Biostrings")
#library(refseqR)
header <- read.table("GSE293036-GPL34284_series_matrix.txt", comment.char='!',
header = T)
str(header)
cntrl_rep3 <- read.table('GSM8874364_control_rep3.hg38.barcode_count.txt', header=F)
cntrl_rep2 <- read.table('GSM8874363_control_rep2.hg38.barcode_count.txt', header=F)
cntrl_rep1 <- read.table('GSM8874362_control_rep1.hg38.barcode_count.txt', header=F)
same <- cntrl_rep1[which(cntrl_rep1$V1 %in% cntrl_rep2$V1),]
same2 <- cntrl_rep1[which(cntrl_rep1$V1 %in% cntrl_rep3$V1),]
sameAll <- same[which(same$V1 %in% same2$V1),]
rm(same,same2)
commonBPs <- sameAll$V1
common1 <- cntrl_rep1[which(cntrl_rep1$V1 %in% commonBPs),]
rm(cntrl_rep1)
common2 <- cntrl_rep2[which(cntrl_rep2$V1 %in% commonBPs),]
rm(cntrl_rep2)
common3 <- cntrl_rep3[which(cntrl_rep3$V1 %in% commonBPs),]
rm(cntrl_rep3)
comm1ordered <- common1[order(common1$V1, decreasing=T),]
rm(common1)
comm2ordered <- common2[order(common2$V1,decreasing=T),]
rm(common2)
comm3ordered <- common3[order(common3$V1,decreasing=T),]
rm(common3)
controls <- cbind(comm1ordered,comm2ordered$V2,comm3ordered$V2)
rm(comm1ordered,comm2ordered,comm3ordered)
colnames(controls) <- c("ID_REF","control1-4362", "control2-4363", "control3-4364")
write.csv(controls, 'control.csv', row.names=F)
colnames(controls) <- c("ID_REF","control1-4362", "control2-4363", "control3-4364")
head(controls,10)
To look up a 20 bp barcode in RefSeq with ID_REF, you can use the refseqR package in R, which allows you to retrieve associated mRNA, transcript, and protein accessions. Here’s how you can do it: Install the refseqR package. Use the refseq_fromGene function to retrieve the mRNA transcript and protein accessions for the given gene ID. If you have the 20 bp barcode, you can use the refseq_fromGene function with the appropriate sequence option to get the corresponding mRNA transcript and protein accessions. For example, if you have a gene ID like LOC101512347, you can use the following code: GeneID <- c(“LOC101512347”) transcript <- refseq_fromGene(GeneID, sequence = “transcript”) protein <- refseq_fromGene(GeneID, sequence = “protein”)
GeneID <- c("TTTTTTTTTTTTTTCGTCCC")
#transcript <- refseq_fromGene(GeneID,sequence="transcript")
#protein <- refseq_fromGene(GeneID, sequence="protein")
It says the NCBI servers are busy and try again a bit later.
controls <- read.csv("control.csv", header=T, sep=',', na.strings=c('',' ','na','NA'
))
commonBPs <- controls$ID_REF
Lets go ahead and get the 5 repeats for the 1st MS patient into its dataframe.
MS1_rep1 <- read.table("GSM8874370_MS-1_rep1.hg38.barcode_count.txt",header=F)
MS1_rep2 <- read.table("GSM8874371_MS-1_rep2.hg38.barcode_count.txt",header=F)
MS1_rep3 <- read.table("GSM8874372_MS-1_rep3.hg38.barcode_count.txt",header=F)
MS1_rep4 <- read.table("GSM8874373_MS-1_rep4.hg38.barcode_count.txt",header=F)
MS1_rep5 <- read.table("GSM8874374_MS-1_rep5.hg38.barcode_count.txt",header=F)
Lets see which onces are in commonBPs so we can add them to our controls data frame.
MS1_1 <- MS1_rep1[which(MS1_rep1$V1 %in% commonBPs),]
rm(MS1_rep1)
MS1_2 <- MS1_rep2[which(MS1_rep2$V1 %in% commonBPs),]
rm(MS1_rep2)
MS1_3 <- MS1_rep3[which(MS1_rep3$V1 %in% commonBPs),]
rm(MS1_rep3)
MS1_4 <- MS1_rep4[which(MS1_rep4$V1 %in% commonBPs),]
rm(MS1_rep4)
MS1_5 <- MS1_rep5[which(MS1_rep5$V1 %in% commonBPs),]
rm(MS1_rep5)
# write.csv(MS1_1, "MS1_1.csv", row.names=F)
# write.csv(MS1_2, "MS1_2.csv", row.names=F)
# write.csv(MS1_3, "MS1_3.csv", row.names=F)
# write.csv(MS1_4, "MS1_4.csv", row.names=F)
# write.csv(MS1_5, "MS1_5.csv", row.names=F)
# write.csv(names4, 'names4.csv', row.names=F)
MS1_1 <- read.csv("MS1_1.csv", header=T, sep=',', na.strings=c('',' ','na','NA'))
MS1_2 <- read.csv("MS1_2.csv", header=T, sep=',', na.strings=c('',' ','na','NA'))
MS1_3 <- read.csv("MS1_3.csv", header=T, sep=',', na.strings=c('',' ','na','NA'))
MS1_4 <- read.csv("MS1_4.csv", header=T, sep=',', na.strings=c('',' ','na','NA'))
MS1_5 <- read.csv("MS1_5.csv", header=T, sep=',', na.strings=c('',' ','na','NA'))
MS1_1names <- MS1_1[which(MS1_1$V1 %in% MS1_2$V1),]
rm(MS1_1)
MS1_2names <- MS1_2[which(MS1_2$V1 %in% MS1_1names$V1),]
rm(MS1_2)
MS1_3names <- MS1_3[which(MS1_3$V1 %in% MS1_2names$V1),]
rm(MS1_3)
MS1_4names <- MS1_4[which(MS1_4$V1 %in% MS1_3names$V1),]
rm(MS1_4)
MS1_5names <- MS1_5[which(MS1_5$V1 %in% MS1_4names$V1),]
rm(MS1_5)
commonMS1 <- MS1_5names$V1
MS1_1 <- MS1_1names[which(MS1_1names$V1 %in% commonMS1),]
rm(MS1_1names)
MS1_2 <- MS1_2names[which(MS1_2names$V1 %in% commonMS1),]
rm(MS1_2names)
MS1_3 <- MS1_3names[which(MS1_3names$V1 %in% commonMS1),]
rm(MS1_3names)
MS1_4 <- MS1_4names[which(MS1_4names$V1 %in% commonMS1),]
rm(MS1_4names)
MS1_5 <- MS1_5names[which(MS1_5names$V1 %in% commonMS1),]
rm(MS1_5names)
control <- controls[which(controls$ID_REF %in% commonMS1),]
rm(controls)
control <- control[order(control$ID_REF, decreasing=T),]
MS1_1order <- MS1_1[order(MS1_1$V1,decreasing=T),]
rm(MS1_1)
MS1_2order <- MS1_2[order(MS1_2$V1,decreasing=T),]
rm(MS1_2)
MS1_3order <- MS1_3[order(MS1_3$V1,decreasing=T),]
rm(MS1_3)
MS1_4order <- MS1_4[order(MS1_4$V1,decreasing=T),]
rm(MS1_4)
MS1_5order <- MS1_5[order(MS1_5$V1,decreasing=T),]
rm(MS1_5)
colnames(MS1_1order) <- c("ID_REF", "MS1_1_4370")
colnames(MS1_2order) <- c("ID_REF", "MS1_2_4371")
colnames(MS1_3order) <- c("ID_REF", "MS1_3_4372")
colnames(MS1_4order) <- c("ID_REF", "MS1_4_4373")
colnames(MS1_5order) <- c("ID_REF", "MS1_5_4374")
str(control)
str(MS1_1order)
str(MS1_2order)
str(MS1_3order)
str(MS1_4order)
str(MS1_5order)
controlsMS1 <- cbind(control,MS1_1order$MS1_1_4370, MS1_2order$MS1_2_4371,MS1_3order$MS1_3_4372, MS1_4order$MS1_4_4373,MS1_5order$MS1_5_4374)
str(controlsMS1)
colnames(controlsMS1)[5:9] <- c("MS1_r1_4370","MS1_r2_4371", "MS1_r3_4372",
"MS1_r4_4373", "MS1_r5_4374")
str(controlsMS1)
write.csv(ControlsMS1, "controls_MS1.csv", row.names=F)
rm(control,MS1_1order,MS1_2order,MS1_3order,MS1_4order,MS1_5order,commonBPs,commonMS1,names)
We have a combined data set of genes in common between control repeats 1-3 and repeats 1-5 for the 1st MS patient. This is a 9,011,758 observation long data frame with 9 variables. The ID_REF was ordered after the common genes found for each sample and then combined. Each sample has the last 4 numbers of their GSM sample ID in the header name for the sample. This is the largest data set we have worked on and that time doesn’t even include running machine learning to make predictions because the original data were each 10-50 million gene samples as nucleotide base pairs 20 base pairs or bp long.
====================================================================
We have to add the 2nd MS patient 5 repeats of data here in this section to our controlsMS1.
controlsMS1 <- read.csv("controls_MS1.csv", header=T, sep=',', na.strings=c('',' ','na','NA'))
str(controlsMS1)
commonID_REF <- controlsMS1$ID_REF
MS2_1 <- read.table("GSM8874375_MS-2_rep1.hg38.barcode_count.txt", header=F)
MS2_2 <- read.table("GSM8874376_MS-2_rep2.hg38.barcode_count.txt", header=F)
MS2_3 <- read.table("GSM8874377_MS-2_rep3.hg38.barcode_count.txt", header=F)
MS2_4 <- read.table("GSM8874378_MS-2_rep4.hg38.barcode_count.txt", header=F)
MS2_5 <- read.table("GSM8874379_MS-2_rep5.hg38.barcode_count.txt", header=F)
Next are extracting only those genes in the 5 repeat samples that are also in the control 3 repeats and MS1 5 repeat samples.
MS2_1_commons <- MS2_1[which(MS2_1$V1 %in% commonID_REF),]
rm(MS2_1)
MS2_2_commons <- MS2_2[which(MS2_2$V1 %in% commonID_REF),]
rm(MS2_2)
MS2_3_commons <- MS2_3[which(MS2_3$V1 %in% commonID_REF),]
rm(MS2_3)
MS2_4_commons <- MS2_4[which(MS2_4$V1 %in% commonID_REF),]
rm(MS2_4)
MS2_5_commons <- MS2_5[which(MS2_5$V1 %in% commonID_REF),]
rm(MS2_5)
Now find nucleotide strings in the 2nd MS patients that are in the first 2 repeats.
Common to repeat 1 and 2 in MS2_1$V1.
MS2_1 <- MS2_1_commons[which(MS2_1_commons$V1 %in% MS2_2_commons$V1),]
Common to repeat 2 and 3 in MS2_2$V1.
MS2_2 <- MS2_2_commons[which(MS2_2_commons$V1 %in% MS2_3_commons$V1),]
Common to repeat 3 and 4 in MS2_3$V1
MS2_3 <- MS2_3_commons[which(MS2_3_commons$V1 %in% MS2_4_commons$V1),]
Common to repeat 4 and 5 in MS2_4$V1.
MS2_4 <- MS2_4_commons[which(MS2_4_commons$V1 %in% MS2_5_commons$V1),]
Now use MS2_1 is common to set {1,2}, common in set 2 and 3 is MS2_2 as set {2,3}, for resulting set common to set {1,2} and set {2,3} we get set1_3 is {1,2,3}, then in commone to set 3 and 4 is MS2_3 as set {3,4}, to get common to sets 1 through 4, we get set1_4 {1,2,3,4}, and then common to set 4 and 5 is set MS2_4 as set {4,5} to get common to all sets 1 through 5 we get set1_5 {1,2,3,4,5} common strings in all.
common to set 1, 2, and 3:
set1_3 <- MS2_1[which(MS2_1$V1 %in% MS2_2$V1),]
common to set 2, 3, and 4:
set2_4 <- MS2_2[which(MS2_2$V1 %in% MS2_3$V1),]
common to set 1,2, 3, and 4:
set1_2_3_4 <- set1_3[which(set1_3$V1 %in% set2_4$V1),]
common to set 3,4 and 5:
set4_5 <- MS2_3[which(MS2_3$V1 %in% MS2_4$V1),]
set1_2_3_4_5 <- set1_2_3_4[which(set1_2_3_4$V1 %in% set4_5$V1),]
common strings of nucleotides to the 2nd patient’s 5 repeats.
common2ndID_REF <- set1_2_3_4_5$V1
Now use the controlsMS1 data frame ID_Ref to get those in common2ndID_REF names to combine the 2nd patient samples to it, after ordering each of sets MS2_1_commons through MS2_5_commons by those common genes in all.There are less names in the 2nd set batch than 1st data set batch.
Clear up space by removing these objects we won’t use later.
rm(set1_3,set2_3,set2_4,set3_4,set4_5,set1_2_3_4,set1_2_3_4_5,set1_2_3, MS2_1,MS2_2,MS2_3,MS2_4,MS2_5)
NewControlsMS1 <- controlsMS1[which(controlsMS1$ID_REF %in% common2ndID_REF),]
MS2_1 <- MS2_1_commons[which(MS2_1_commons$V1 %in% common2ndID_REF),]
MS2_2 <- MS2_2_commons[which(MS2_2_commons$V1 %in% common2ndID_REF),]
MS2_3 <- MS2_3_commons[which(MS2_3_commons$V1 %in% common2ndID_REF),]
MS2_4 <- MS2_4_commons[which(MS2_4_commons$V1 %in% common2ndID_REF),]
MS2_5 <- MS2_5_commons[which(MS2_5_commons$V1 %in% common2ndID_REF),]
Remove the clutter of unused objects.
rm(MS2_1_commons, MS2_2_commons, MS2_3_commons, MS2_4_commons, MS2_5_commons,controlsMS1,commonID_REF,common2ndID_REF)
Order all the variables in these data sets before combining and add new column name identifiers to each MS2 table.
NewControlsMS1 <- NewControlsMS1[order(NewControlsMS1$ID_REF,decreasing=T),]
write.csv(NewControlsMS1,'newControlsMS1.csv',row.names=F)
colnames(MS2_1) <- c("ID_REF", "MS2_1_4375")
MS2_1o <- MS2_1[order(MS2_1$ID_REF, decreasing=T),]
colnames(MS2_2) <- c("ID_REF", "MS2_2_4376")
MS2_2o <- MS2_2[order(MS2_2$ID_REF, decreasing=T),]
colnames(MS2_3) <- c("ID_REF", "MS2_3_4377")
MS2_3o <- MS2_3[order(MS2_3$ID_REF, decreasing=T),]
colnames(MS2_4) <- c("ID_REF", "MS2_4_4378")
MS2_4o <- MS2_4[order(MS2_4$ID_REF, decreasing=T),]
colnames(MS2_5) <- c("ID_REF", "MS2_5_4379")
MS2_5o <- MS2_5[order(MS2_5$ID_REF, decreasing=T),]
combine these 2nd samples to look at the ID_REF top and bottom ones.
MS2 <- cbind(MS2_1o,MS2_2o,MS2_3o,MS2_4o,MS2_5o)
head(MS2)
tail(MS2)
control_MS1_MS2 <- cbind(NewControlsMS1,MS2_1o$MS2_1_4375,MS2_2o$MS2_2_4376,MS2_3o$MS2_3_4377,MS2_4o$MS2_4_4378,MS2_5o$MS2_5_4379)
str(control_MS1_MS2)
Change the new added features into similar names.
colnames(control_MS1_MS2)[10:14] <- c("MS2_r1_4375","MS2_r2_4376","MS2_r3_4377","MS2_r4_4378","MS2_r5_4379")
colnames(control_MS1_MS2)
write.csv(control_MS1_MS2, "controls_MS1_MS2.csv", row.names=F)
control_MS1_MS2 <- read.csv("controls_MS1_MS2.csv", header=T, sep=',', na.strings=c('',' ','na','NA'))
summary(control_MS1_MS2)
## ID_REF control1.4362 control2.4363 control3.4364
## Length:8946025 Min. : 1.00 Min. : 1.00 Min. : 1.00
## Class :character 1st Qu.: 6.00 1st Qu.: 5.00 1st Qu.: 5.00
## Mode :character Median : 10.00 Median : 10.00 Median : 10.00
## Mean : 11.95 Mean : 11.56 Mean : 11.66
## 3rd Qu.: 16.00 3rd Qu.: 16.00 3rd Qu.: 16.00
## Max. :724.00 Max. :634.00 Max. :693.00
## MS1_r1_4370 MS1_r2_4371 MS1_r3_4372 MS1_r4_4373
## Min. : 1.00 Min. : 1.0 Min. : 1.00 Min. : 1.00
## 1st Qu.: 12.00 1st Qu.: 14.0 1st Qu.: 10.00 1st Qu.: 20.00
## Median : 24.00 Median : 26.0 Median : 18.00 Median : 37.00
## Mean : 32.96 Mean : 33.1 Mean : 23.27 Mean : 47.28
## 3rd Qu.: 45.00 3rd Qu.: 43.0 3rd Qu.: 30.00 3rd Qu.: 62.00
## Max. :2287.00 Max. :2734.0 Max. :2089.00 Max. :3993.00
## MS1_r5_4374 MS2_r1_4375 MS2_r2_4376 MS2_r3_4377
## Min. : 1.00 Min. : 1.00 Min. : 1.00 Min. : 1.00
## 1st Qu.: 18.00 1st Qu.: 16.00 1st Qu.: 16.00 1st Qu.: 14.00
## Median : 32.00 Median : 27.00 Median : 27.00 Median : 24.00
## Mean : 40.45 Mean : 33.74 Mean : 33.39 Mean : 30.22
## 3rd Qu.: 53.00 3rd Qu.: 44.00 3rd Qu.: 44.00 3rd Qu.: 40.00
## Max. :3215.00 Max. :2398.00 Max. :2412.00 Max. :2127.00
## MS2_r4_4378 MS2_r5_4379
## Min. : 1.00 Min. : 1.00
## 1st Qu.: 16.00 1st Qu.: 15.00
## Median : 27.00 Median : 25.00
## Mean : 33.72 Mean : 30.87
## 3rd Qu.: 44.00 3rd Qu.: 41.00
## Max. :2298.00 Max. :2173.00
str(control_MS1_MS2)
## 'data.frame': 8946025 obs. of 14 variables:
## $ ID_REF : chr "TTTTTTTTTTTTTTCGTCCC" "TTTTTTTTTTTTTCCTTGCT" "TTTTTTTTTTTTGCAGTGAT" "TTTTTTTTTTTTCTGCTATG" ...
## $ control1.4362: int 4 2 4 4 1 5 1 4 3 2 ...
## $ control2.4363: int 4 1 1 1 1 5 2 3 1 3 ...
## $ control3.4364: int 3 5 1 2 3 8 4 1 2 3 ...
## $ MS1_r1_4370 : int 3 5 3 3 21 15 23 2 4 11 ...
## $ MS1_r2_4371 : int 1 4 4 8 9 23 15 10 4 4 ...
## $ MS1_r3_4372 : int 3 3 3 2 7 16 7 5 10 7 ...
## $ MS1_r4_4373 : int 5 9 3 4 13 43 17 18 22 24 ...
## $ MS1_r5_4374 : int 6 12 5 9 10 26 12 21 8 5 ...
## $ MS2_r1_4375 : int 4 5 3 1 15 19 9 8 12 12 ...
## $ MS2_r2_4376 : int 8 3 7 8 7 27 19 6 7 13 ...
## $ MS2_r3_4377 : int 11 10 8 7 4 25 19 6 6 8 ...
## $ MS2_r4_4378 : int 3 8 4 4 7 19 22 11 8 20 ...
## $ MS2_r5_4379 : int 4 9 5 5 4 17 21 9 8 14 ...
We need to add the 3rd batch that is the B cell commercial line they used to compare in the controls and two multiple sclerosis patients they received samples from.
We have a data frame of 14 features where the 3 controls and 5 repeats each from the 2 multiple sclerosis patients and the target feature that is the ID_REF field with the so far 8,946,025 nucleotide strands of 20 base pairs of complementary DNA as they have thymine (T) in them, that is what reverse transcription of RNA into DNA means, they get the complementary DNA or cDNA strand from the messenger RNA or mRNA to see the allelic copy variants that can happen at the nucleic acid base level with replacement of one nucleic acid that may not affect the triplet codon amino acid it forms to form the translated protein at the ribosomal process of turning mRNA into a protein.
The current stored size of this file is 516.4 Mb on my computer and most likely not small enough for Google Sheets to share. But you can get those samples from NCBI by looking up the GSE and GPL or GSM names given in the beginning of this document and run the code above to get to the almost 9 million nucleotide base pairs with 14 variables before adding in the additional 5 comparison repeats of commercial B cells of a patient with MS.