This is to re-examine the UL and non-UL samples from the Gene Expression Omnibus online data repository (GEO) for genotypes in the ULs compared to those samples without tumor tissue in them. The accession IDs for the Series is GSE95101 and for the platform is GPL13376
Lets look at some of these copy number variants of one gene with seven copy number variants or CNVs and see where the changes in the nucleotide sequences occur. Copy number variations in nucleotides can have short repeats, jumps in sequence, or deletions of a gene. I have been calling these CNVs genotypes, which are the traits and alleles responsible for the physical traits or phenotypes of an organism. Some CNVs are responsible for diseases, and in tumors there are many different CNVs that are found to be responsible. A uterine leiomyoma or fibroid is a benign tumor. These samples were taken from uterus tissue with these uternine tumors and the same neighboring uterine tissue without uterine tumors.
library(dplyr)
library(tidyr)
library(e1071)
library(caret)
library(randomForest)
library(MASS)
library(gbm)
library(DT)
library(stringr)
UL1a <- read.csv('UL1a.csv', sep=',',
header=T, na.strings=c('',' '))
UL1b <- read.csv('UL1b.csv', sep=',',
header=T, na.strings=c('',' '))
UL1c <- read.csv('UL1c.csv', sep=',',
header=T, na.strings=c('',' '))
UL1d <- read.csv('UL1d.csv', sep=',',
header=T, na.strings=c('',' '))
UL1 <- rbind(UL1a,UL1b,UL1c,UL1d)
rm(UL1a,UL1b,UL1c,UL1d)
str(UL1)
## 'data.frame': 48701 obs. of 51 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ ID : Factor w/ 48701 levels "ILMN_1343289",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Species : Factor w/ 1 level "Homo sapiens": 1 1 1 1 1 1 1 1 1 1 ...
## $ Source : Factor w/ 3 levels "ILMN_Controls",..: 1 1 2 1 1 2 2 2 2 2 ...
## $ Search_Key : Factor w/ 46721 levels "ILMN_10001","ILMN_10014",..: 11664 11664 11664 11664 11664 11664 11664 9848 11306 1417 ...
## $ Transcript : Factor w/ 46724 levels "ILMN_10001","ILMN_10014",..: 2290 2291 1377 2292 2293 4903 1218 9858 11314 1421 ...
## $ ILMN_Gene : Factor w/ 43186 levels "1-Dec","1-Mar",..: 8904 10174 2074 10143 10156 88 2713 4872 9250 2029 ...
## $ Source_Reference_ID : Factor w/ 46721 levels "NM_000015.1",..: 776 4636 1225 1123 1827 1130 1438 9379 7158 10894 ...
## $ RefSeq_ID : Factor w/ 28570 levels "NM_000015.1",..: 776 4636 1225 1123 1827 1130 1438 9379 7158 10894 ...
## $ Unigene_ID : Factor w/ 18153 levels "NA","Hs.100554",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Entrez_Gene_ID : Factor w/ 16063 levels "10","10000","10001",..: 10744 10744 1487 10744 10744 5871 2331 6516 10718 5545 ...
## $ GI : int 14141192 20149305 25453469 4507728 4507744 5016088 7669491 88954077 33469136 89070645 ...
## $ Accession : Factor w/ 46721 levels "NM_000015.1",..: 776 4636 1225 1123 1827 1130 1438 9379 7158 10894 ...
## $ Symbol : Factor w/ 25036 levels "1-Dec","1-Mar",..: 8904 10174 2074 10143 10156 88 2713 4872 9250 2029 ...
## $ Protein_Product : Factor w/ 28173 levels "NA","NP_000006.1",..: 1 1 1224 1 1 1129 1437 9300 7155 10815 ...
## $ Probe_Id : Factor w/ 48701 levels "ILMN_1343289",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Array_Address_Id : int 2140735 6550370 2690379 4590356 4260048 5860528 1770601 50270 3310274 7040079 ...
## $ Probe_Type : Factor w/ 3 levels "A","I","S": 3 3 3 3 3 3 3 3 3 2 ...
## $ Probe_Start : int 416 1856 1293 1408 72 1725 930 1 1103 2975 ...
## $ SEQUENCE : Factor w/ 48701 levels "AAAAAACAGGAATAGCTCTAGGAGTCCTTACACAGGTCCGAGGGACCAGC",..: 9453 9026 11512 4877 6702 9532 5532 1967 10077 11476 ...
## $ Chromosome : Factor w/ 56 levels "1","10","11",..: 27 27 22 27 27 23 4 14 1 27 ...
## $ Probe_Chr_Orientation: Factor w/ 3 levels "-","+","NA": 3 3 1 3 3 1 2 2 1 3 ...
## $ Probe_Coordinates : Factor w/ 41351 levels "100000925-100000974",..: 10161 10161 8915 10161 10161 7449 8192 4231 2830 10161 ...
## $ Cytoband : Factor w/ 3676 levels "10p11.1d","10p11.21a",..: 2501 2501 2039 2501 2501 2165 288 1490 1036 2033 ...
## $ Definition : Factor w/ 46614 levels "Homo sapiens 1-acylglycerol-3-phosphate O-acyltransferase 1 (lysophosphatidic acid acyltransferase, alpha) (AGP"| __truncated__,..: 5692 7144 2221 7047 6692 98 2786 8195 6195 7895 ...
## $ Ontology_Component : Factor w/ 7849 levels "A 20S multiprotein assembly of total mass about 1.2 MDa that activates dynein-based activity in vivo. A large s"| __truncated__,..: 2614 1893 1304 1321 1266 2112 1468 1893 1763 269 ...
## $ Ontology_Process : Factor w/ 8950 levels "[goid 6069] [pmid 1755855] [evidence IDA]; A change in state or activity of a cell or an organism (in terms of "| __truncated__,..: 2067 2257 3716 1085 547 417 1784 1091 1091 962 ...
## $ Ontology_Function : Factor w/ 9453 levels "[goid 16505] [pmid 10426319] [evidence NAS]",..: 3847 2486 1975 1911 2750 2514 95 3637 3637 828 ...
## $ Synonyms : Factor w/ 16472 levels "0610037N12Rik; RPP20; RPP2",..: 4591 4591 5325 4591 4591 5300 2402 4591 3649 4591 ...
## $ Obsolete_Probe_Id : Factor w/ 16878 levels "0610037N12Rik; RPP20; RPP2",..: 4784 4784 645 6450 1251 5508 2490 4784 3785 4784 ...
## $ GB_ACC : Factor w/ 46717 levels "NA","NM_000015.1",..: 777 4635 1225 1123 1827 1130 1438 9377 7156 10892 ...
## $ GSM2496185 : num 13942 23759 27434 3092 6857 ...
## $ GSM2496186 : num 12934 15091 26473 4269 7799 ...
## $ GSM2496187 : num 11909 22609 23964 3455 7954 ...
## $ GSM2496188 : num 12147 18225 27823 4258 7380 ...
## $ GSM2496189 : num 14142 20728 24486 3333 6445 ...
## $ GSM2496190 : num 11650 19582 26225 4545 9215 ...
## $ GSM2496191 : num 12786 19105 28200 3413 10031 ...
## $ GSM2496192 : num 9383 10008 27997 3191 7428 ...
## $ GSM2496193 : num 11481 10575 23172 3597 7712 ...
## $ GSM2496203 : num 4136 1028 16324 4994 6466 ...
## $ GSM2496204 : num 11458 17921 26664 3095 7471 ...
## $ GSM2496205 : num 15445 18186 25687 3138 7047 ...
## $ GSM2496206 : num 11098 8905 22094 2473 6307 ...
## $ GSM2496207 : num 11510 9721 21161 4353 4826 ...
## $ GSM2496208 : num 11446 11451 26427 3863 7069 ...
## $ GSM2496209 : num 9945 16387 27837 3027 6462 ...
## $ GSM2496217 : num 12707 18456 28792 3251 8407 ...
## $ GSM2496218 : num 12261 19342 25018 2322 6925 ...
## $ GSM2496219 : num 11087 9198 27179 4554 9100 ...
## $ GSM2496220 : num 11746 21023 29030 4131 7771 ...
nonUL1a <- read.csv('nonUL1a.csv', sep=',',
header=T, na.strings=c('',' '))
nonUL1b <- read.csv('nonUL1b.csv', sep=',',
header=T, na.strings=c('',' '))
nonUL1c <- read.csv('nonUL1c.csv', sep=',',
header=T, na.strings=c('',' '))
nonUL1d <- read.csv('nonUL1d.csv', sep=',',
header=T, na.strings=c('',' '))
nonUL1 <- rbind(nonUL1a,nonUL1b,nonUL1c,nonUL1d)
rm(nonUL1a,nonUL1b,nonUL1c,nonUL1d)
str(nonUL1)
## 'data.frame': 48701 obs. of 49 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ ID : Factor w/ 48701 levels "ILMN_1343289",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Species : Factor w/ 1 level "Homo sapiens": 1 1 1 1 1 1 1 1 1 1 ...
## $ Source : Factor w/ 3 levels "ILMN_Controls",..: 1 1 2 1 1 2 2 2 2 2 ...
## $ Search_Key : Factor w/ 46721 levels "ILMN_10001","ILMN_10014",..: 11664 11664 11664 11664 11664 11664 11664 9848 11306 1417 ...
## $ Transcript : Factor w/ 46724 levels "ILMN_10001","ILMN_10014",..: 2290 2291 1377 2292 2293 4903 1218 9858 11314 1421 ...
## $ ILMN_Gene : Factor w/ 43186 levels "1-Dec","1-Mar",..: 8904 10174 2074 10143 10156 88 2713 4872 9250 2029 ...
## $ Source_Reference_ID : Factor w/ 46721 levels "NM_000015.1",..: 776 4636 1225 1123 1827 1130 1438 9379 7158 10894 ...
## $ RefSeq_ID : Factor w/ 28570 levels "NM_000015.1",..: 776 4636 1225 1123 1827 1130 1438 9379 7158 10894 ...
## $ Unigene_ID : Factor w/ 18153 levels "NA","Hs.100554",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Entrez_Gene_ID : Factor w/ 16063 levels "10","10000","10001",..: 10744 10744 1487 10744 10744 5871 2331 6516 10718 5545 ...
## $ GI : int 14141192 20149305 25453469 4507728 4507744 5016088 7669491 88954077 33469136 89070645 ...
## $ Accession : Factor w/ 46721 levels "NM_000015.1",..: 776 4636 1225 1123 1827 1130 1438 9379 7158 10894 ...
## $ Symbol : Factor w/ 25036 levels "1-Dec","1-Mar",..: 8904 10174 2074 10143 10156 88 2713 4872 9250 2029 ...
## $ Protein_Product : Factor w/ 28173 levels "NA","NP_000006.1",..: 1 1 1224 1 1 1129 1437 9300 7155 10815 ...
## $ Probe_Id : Factor w/ 48701 levels "ILMN_1343289",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Array_Address_Id : int 2140735 6550370 2690379 4590356 4260048 5860528 1770601 50270 3310274 7040079 ...
## $ Probe_Type : Factor w/ 3 levels "A","I","S": 3 3 3 3 3 3 3 3 3 2 ...
## $ Probe_Start : int 416 1856 1293 1408 72 1725 930 1 1103 2975 ...
## $ SEQUENCE : Factor w/ 48701 levels "AAAAAACAGGAATAGCTCTAGGAGTCCTTACACAGGTCCGAGGGACCAGC",..: 9453 9026 11512 4877 6702 9532 5532 1967 10077 11476 ...
## $ Chromosome : Factor w/ 56 levels "1","10","11",..: 27 27 22 27 27 23 4 14 1 27 ...
## $ Probe_Chr_Orientation: Factor w/ 3 levels "-","+","NA": 3 3 1 3 3 1 2 2 1 3 ...
## $ Probe_Coordinates : Factor w/ 41351 levels "100000925-100000974",..: 10161 10161 8915 10161 10161 7449 8192 4231 2830 10161 ...
## $ Cytoband : Factor w/ 3676 levels "10p11.1d","10p11.21a",..: 2501 2501 2039 2501 2501 2165 288 1490 1036 2033 ...
## $ Definition : Factor w/ 46614 levels "Homo sapiens 1-acylglycerol-3-phosphate O-acyltransferase 1 (lysophosphatidic acid acyltransferase, alpha) (AGP"| __truncated__,..: 5692 7144 2221 7047 6692 98 2786 8195 6195 7895 ...
## $ Ontology_Component : Factor w/ 7849 levels "A 20S multiprotein assembly of total mass about 1.2 MDa that activates dynein-based activity in vivo. A large s"| __truncated__,..: 2614 1893 1304 1321 1266 2112 1468 1893 1763 269 ...
## $ Ontology_Process : Factor w/ 8950 levels "[goid 6069] [pmid 1755855] [evidence IDA]; A change in state or activity of a cell or an organism (in terms of "| __truncated__,..: 2067 2257 3716 1085 547 417 1784 1091 1091 962 ...
## $ Ontology_Function : Factor w/ 9453 levels "[goid 16505] [pmid 10426319] [evidence NAS]",..: 3847 2486 1975 1911 2750 2514 95 3637 3637 828 ...
## $ Synonyms : Factor w/ 16472 levels "0610037N12Rik; RPP20; RPP2",..: 4591 4591 5325 4591 4591 5300 2402 4591 3649 4591 ...
## $ Obsolete_Probe_Id : Factor w/ 16878 levels "0610037N12Rik; RPP20; RPP2",..: 4784 4784 645 6450 1251 5508 2490 4784 3785 4784 ...
## $ GB_ACC : Factor w/ 46717 levels "NA","NM_000015.1",..: 777 4635 1225 1123 1827 1130 1438 9377 7156 10892 ...
## $ GSM2496194 : num 9823 18157 27796 3428 7706 ...
## $ GSM2496195 : num 11265 20893 24042 4279 9407 ...
## $ GSM2496196 : num 13016 20943 24368 3049 9110 ...
## $ GSM2496197 : num 11698 18242 24179 3574 7935 ...
## $ GSM2496198 : num 11448 20998 25276 2178 7307 ...
## $ GSM2496199 : num 11454 21756 26935 3768 8928 ...
## $ GSM2496200 : num 11514 21849 26969 3170 9457 ...
## $ GSM2496201 : num 10621 10200 24231 2292 7765 ...
## $ GSM2496202 : num 11066 9349 22945 4513 8454 ...
## $ GSM2496210 : num 10189 21816 29280 4816 8773 ...
## $ GSM2496211 : num 9998 18435 26231 4683 8579 ...
## $ GSM2496212 : num 11407 23942 27389 3589 7977 ...
## $ GSM2496213 : num 9476 10440 23432 5444 8126 ...
## $ GSM2496214 : num 11708 9478 22640 4533 8418 ...
## $ GSM2496215 : num 11457 11803 24008 5839 8549 ...
## $ GSM2496216 : num 12900 20375 28086 4044 7252 ...
## $ GSM2496221 : num 10020 16842 25324 2469 7225 ...
## $ GSM2496222 : num 13409 9030 22273 3315 7844 ...
UL <- UL1[,-c(1:13,15:19,21,29:31)]
nonUL <- nonUL1[,-c(1:13,15:19,21,29:31)]
write.csv(UL,'UL.csv', row.names=FALSE)
write.csv(nonUL, 'nonUL.csv', row.names=FALSE)
fibroid <- read.csv('UL.csv', sep=',', header=T, na.strings=c('',' '))
nonFibroid <- read.csv('nonUL.csv', sep=',', header=T, na.strings=c('',' '))
fibroid_gene_n <- fibroid %>% group_by(Symbol) %>% count(n())
narm <- grep('^NA$',fibroid_gene_n$Symbol)
fibroid1 <- fibroid_gene_n[-narm,-2]
colnames(fibroid1)[2] <- 'gene_count'
NONfibroid_gene_n <- nonFibroid %>% group_by(Symbol) %>% count(n())
narm1 <- grep('^NA$',NONfibroid_gene_n$Symbol)
nonFibroid1 <- NONfibroid_gene_n[-narm1,-2]
colnames(nonFibroid1)[2] <- 'gene_count'
GeneCopyNumberVariants <- fibroid1[order(fibroid1$gene_count,decreasing=TRUE)[1:10],]
GeneCopyNumberVariants
## # A tibble: 10 x 2
## # Groups: Symbol [10]
## Symbol gene_count
## <fct> <int>
## 1 DDX12 10
## 2 KIAA0692 9
## 3 LOC23117 8
## 4 PLEC1 8
## 5 BDNF 7
## 6 CTNNB1 7
## 7 DMD 7
## 8 LOC202134 7
## 9 LOC339047 7
## 10 LOC653086 7
Combine the gene counts with the tables of samples for each type of UL or nonUL.
Fibroid_count <- merge(fibroid1, fibroid, by.x='Symbol', by.y='Symbol')
nonFibroid_count <- merge(nonFibroid1, nonFibroid, by.x='Symbol', by.y='Symbol')
Fibroid_count[order(Fibroid_count$gene_count, decreasing=TRUE)[1:20],1:3]
## Symbol gene_count SEQUENCE
## 5646 DDX12 10 CCAGTCCCTGACTACAGAGGATTTCCCCAAAGTCCCTGGCTGTGAGGTTC
## 5647 DDX12 10 TTACTGGGGATGGTATTTAGGAGCCAGGAAAGCCGGTGCATTCCTAGTGA
## 5648 DDX12 10 TCTCCTGCCCCCTCCGGAAGCTTGGATGCCCCTCCACACCCTCTTGATCT
## 5649 DDX12 10 CAGACTTCTCGCTTCCTTTCTGCTGGGCCTCTGAGGGGTCATGGGGCCAT
## 5650 DDX12 10 ACATGTGCTGTCACTGGAACTTGCTCTTTTCACTCAGCAGCCAGAGGGTC
## 5651 DDX12 10 AAACGTTACAGTGTTCCGATGAGACACAGTAGGCAGTACTTGGGAGGGTC
## 5652 DDX12 10 CAGGGCAGGAACCACGTCTTTACAGTTTGATGTTCCCAGAGCTGACCCAG
## 5653 DDX12 10 GCAGGGGAGATTGGGTTTAGGGGCTTTCCTGGTCTGCATTCTGCTACAGC
## 5654 DDX12 10 CCGCCGGGCTGCTTTTTCCTTGGATGCCCATCAGGACGCCTCAGTTCTCT
## 5655 DDX12 10 CGTTGCTACAAGCTGTTTTTTGAATGTCTCTACACAGTCCAGGCAGGAAG
## 10983 KIAA0692 9 AAGTGGTGCCTGGCTGTCCCTATACTGTGCTGCTGGGTGTTCCAGCCTGT
## 10984 KIAA0692 9 TAAGTGCAGTGAGCTCTGGCGGAAACCACCCTCTGCCCCGTCTGTTGGAT
## 10985 KIAA0692 9 CATTGTAATGATAAGGAAATGTTGCGATCAAATAAGATTTAGACACACTT
## 10986 KIAA0692 9 GATCACAGGCACAGGGAAGCCACAAGGAGCTCTGTATGAGTTGTGTTTGC
## 10987 KIAA0692 9 CAGGCGACTGGGTAGCAGATGTGGAAGCTGATGGTTAGGCCCAGGGCATG
## 10988 KIAA0692 9 GTTGTTCTGGACGATCTTCGGGATCCTCTGGGGCACTGTGACACTCGGAG
## 10989 KIAA0692 9 GAGTGCTGGGAAGGTTAATGTTAAATGGGTTGTGTGTCGGGGAGGGTACA
## 10990 KIAA0692 9 AGCTCCACCTTGACCCAGCCTCACAACAAAAAGTTTGTGTATGACCAGGC
## 10991 KIAA0692 9 GCAAATGTAACTCAGGGGTTTGGGGCCAGAGGAAGAGGGAGAAGGTGGCC
## 12174 LOC23117 8 CTGGCCTTCCCTCATCAGCCGTAAATGATGATTTACTGCTGTTACCATCA
Add a mean, median, min, and max column to these tables.
Fibroid_count$Fibroid_Mean <- rowMeans(Fibroid_count[11:30])
nonFibroid_count$nonFibroid_Mean <- rowMeans(nonFibroid_count[11:28])
Use the tidyr package to group by sample ID by gathering those columns into one.
UL_3 <- gather(Fibroid_count, 'UL_Sample_ID','Value',11:30)
nonUL_3 <- gather(nonFibroid_count, 'nonUL_Sample_ID', 'Value',11:28)
Create the stat tables then combine for the UL and nonUL sample sets using the dplyr package.
UL_median <- UL_3 %>% group_by(SEQUENCE) %>% summarise_at(vars(Value), median)
colnames(UL_median)[2] <- 'Fibroid_Median'
nonUL_median <- nonUL_3 %>% group_by(SEQUENCE) %>% summarise_at(vars(Value), median)
colnames(nonUL_median)[2] <- 'nonFibroid_Median'
UL_max <- UL_3 %>% group_by(SEQUENCE) %>% summarise_at(vars(Value), max)
colnames(UL_max)[2] <- 'Fibroid_max'
nonUL_max <- nonUL_3 %>% group_by(SEQUENCE) %>% summarise_at(vars(Value), max)
colnames(nonUL_max)[2] <- 'nonFibroid_max'
UL_min <- UL_3 %>% group_by(SEQUENCE) %>% summarise_at(vars(Value), min)
colnames(UL_min)[2] <- 'Fibroid_min'
nonUL_min <- nonUL_3 %>% group_by(SEQUENCE) %>% summarise_at(vars(Value), min)
colnames(nonUL_min)[2] <- 'nonFibroid_min'
UL_sd <- UL_3 %>% group_by(SEQUENCE) %>% summarise_at(vars(Value), sd)
colnames(UL_sd)[2] <- 'Fibroid_stdError'
nonUL_sd <- nonUL_3 %>% group_by(SEQUENCE) %>% summarise_at(vars(Value), sd)
colnames(nonUL_sd)[2] <- 'nonFibroid_stdError'
Combine these four tables together.
Fibroid_stats <- merge(UL_median, UL_max, by.x='SEQUENCE', by.y='SEQUENCE')
Fibroid_stats1 <- merge(Fibroid_stats, UL_min, by.x='SEQUENCE', by.y='SEQUENCE')
Fibroid_stats2 <- merge(Fibroid_count, Fibroid_stats1, by.x='SEQUENCE', by.y='SEQUENCE')
Fibroid_stats3 <- merge(Fibroid_stats2, UL_sd, by.x='SEQUENCE', by.y='SEQUENCE')
colnames(Fibroid_stats3)[11:30] <- paste('UL_', colnames(Fibroid_stats3)[11:30], sep='')
nonFibroid_stats <- merge(nonUL_median, nonUL_max, by.x='SEQUENCE', by.y='SEQUENCE')
nonFibroid_stats1 <- merge(nonFibroid_stats, nonUL_min, by.x='SEQUENCE', by.y='SEQUENCE')
nonFibroid_stats2 <- merge(nonFibroid_count, nonFibroid_stats1, by.x='SEQUENCE', by.y='SEQUENCE')
nonFibroid_stats3 <- merge(nonFibroid_stats2, nonUL_sd, by.x='SEQUENCE', by.y='SEQUENCE')
colnames(nonFibroid_stats3)[11:28] <- paste('nonUL_', colnames(nonFibroid_stats3)[11:28], sep='')
nonfibroid <- nonFibroid_stats3[,c(1,11:33)]
all <- merge(Fibroid_stats3, nonfibroid, by.x='SEQUENCE', by.y='SEQUENCE')
str(all)
## 'data.frame': 30549 obs. of 58 variables:
## $ SEQUENCE : Factor w/ 48701 levels "AAAAAACAAAACCGCGCAGCGGAGAACCGGTGCCTGAGTCTCCCAGGGAC",..: 1 4 5 7 8 10 12 13 15 18 ...
## $ Symbol : Factor w/ 25036 levels "1-Dec","1-Mar",..: 12031 11383 13002 14397 15611 12721 12474 10953 24822 18000 ...
## $ gene_count : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Probe_Chr_Orientation: Factor w/ 3 levels "-","+","NA": 2 1 2 3 3 2 2 3 1 2 ...
## $ Probe_Coordinates : Factor w/ 41351 levels "100000925-100000974",..: 33600 3552 19430 41351 41351 36542 28872 41351 31503 41119 ...
## $ Cytoband : Factor w/ 3676 levels "10p11.1d","10p11.21a",..: 3472 3472 3472 3472 3472 293 3472 3472 1261 2180 ...
## $ Definition : Factor w/ 46614 levels "{3 region, probe S2} [human, 76N, mammary epithelial cells, mRNA Partial, 339 nt]",..: 33743 33388 39229 37744 38664 34103 33968 35730 29449 21454 ...
## $ Ontology_Component : Factor w/ 7849 levels "A 20S multiprotein assembly of total mass about 1.2 MDa that activates dynein-based activity in vivo. A large s"| __truncated__,..: 4150 4150 4150 4150 4150 4150 4150 4150 5434 6267 ...
## $ Ontology_Process : Factor w/ 8950 levels "[goid 19642] [evidence IEA]; The chemical reactions and pathways involving carbohydrates, any of a group of org"| __truncated__,..: 2492 2492 2492 2492 2492 2492 2492 2492 8502 3316 ...
## $ Ontology_Function : Factor w/ 9453 levels "[goid 15280] [evidence IEA]; Interacting selectively with sodium ions (Na+) [goid 31402] [evidence IEA]",..: 8111 8111 8111 8111 8111 8111 8111 8111 7071 2705 ...
## $ UL_GSM2496185 : num 49.5 51.2 59.9 312.3 52.3 ...
## $ UL_GSM2496186 : num 51.5 52.3 64.5 333.3 53.7 ...
## $ UL_GSM2496187 : num 46.9 49 50.3 331.8 56.4 ...
## $ UL_GSM2496188 : num 50.1 48.4 52 360.9 51.8 ...
## $ UL_GSM2496189 : num 54 49.8 52.8 339 52.6 ...
## $ UL_GSM2496190 : num 50.1 48.6 52 411.7 51.9 ...
## $ UL_GSM2496191 : num 49.5 46 54.7 424.3 54.8 ...
## $ UL_GSM2496192 : num 46.5 50.6 50.2 517.3 53.6 ...
## $ UL_GSM2496193 : num 48 46.7 52.9 631.7 54.1 ...
## $ UL_GSM2496203 : num 48.1 51.5 55.1 576.7 55 ...
## $ UL_GSM2496204 : num 51.7 50.8 51.6 300.4 52.3 ...
## $ UL_GSM2496205 : num 46.9 47.5 52.3 348.9 53.2 ...
## $ UL_GSM2496206 : num 50.5 43.9 54.9 722.6 56.1 ...
## $ UL_GSM2496207 : num 44 46.5 56.5 1218 55.9 ...
## $ UL_GSM2496208 : num 51.1 46 51.9 535.5 55.5 ...
## $ UL_GSM2496209 : num 55.2 46.9 50.6 669 53.3 ...
## $ UL_GSM2496217 : num 48.9 45.8 49.6 361.9 49.8 ...
## $ UL_GSM2496218 : num 50 49.9 53.9 377.2 53.6 ...
## $ UL_GSM2496219 : num 48.3 49.5 50.4 544.1 58.1 ...
## $ UL_GSM2496220 : num 47.7 46.7 51.7 406.1 56.3 ...
## $ Fibroid_Mean : num 49.4 48.4 53.4 486.1 54 ...
## $ Fibroid_Median : num 49.5 48.5 52.1 408.9 53.7 ...
## $ Fibroid_max : num 55.2 52.3 64.5 1218 58.1 ...
## $ Fibroid_min : num 44 43.9 49.6 300.4 49.8 ...
## $ Fibroid_stdError : num 2.59 2.3 3.61 214.26 2 ...
## $ nonUL_GSM2496194 : num 42.7 42.7 56.6 442 57 ...
## $ nonUL_GSM2496195 : num 49.1 45.3 56 505.4 55.3 ...
## $ nonUL_GSM2496196 : num 51.1 45.3 55.1 416.5 57.4 ...
## $ nonUL_GSM2496197 : num 46 52.6 56.2 475.3 54.6 ...
## $ nonUL_GSM2496198 : num 49.1 54.1 53.7 408.1 50.8 ...
## $ nonUL_GSM2496199 : num 52.6 47.9 51.1 510.7 52.5 ...
## $ nonUL_GSM2496200 : num 46 47.1 52.5 681.9 50.3 ...
## $ nonUL_GSM2496201 : num 45.9 46.8 48.3 689.3 56.7 ...
## $ nonUL_GSM2496202 : num 50.6 45.3 56.5 735.2 48.7 ...
## $ nonUL_GSM2496210 : num 48.3 49.3 56.3 426.9 52.5 ...
## $ nonUL_GSM2496211 : num 46.8 46 54.6 567.3 47.4 ...
## $ nonUL_GSM2496212 : num 50.6 48.7 54.3 493.8 51.5 ...
## $ nonUL_GSM2496213 : num 49.1 49.5 56.7 792.8 48.8 ...
## $ nonUL_GSM2496214 : num 48.8 49.9 54 710.5 50.6 ...
## $ nonUL_GSM2496215 : num 48 47.1 52.9 795.4 53.1 ...
## $ nonUL_GSM2496216 : num 47.9 49.2 48 574.5 51.1 ...
## $ nonUL_GSM2496221 : num 48.8 48.2 50.6 507.4 54.7 ...
## $ nonUL_GSM2496222 : num 44.6 49.5 50.4 590 56 ...
## $ nonFibroid_Mean : num 48.1 48 53.5 573.5 52.7 ...
## $ nonFibroid_Median : num 48.5 48.1 54.1 539 52.5 ...
## $ nonFibroid_max : num 52.6 54.1 56.7 795.4 57.4 ...
## $ nonFibroid_min : num 42.7 42.7 48 408.1 47.4 ...
## $ nonFibroid_stdError : num 2.46 2.75 2.85 130.03 3.07 ...
Lets change the ‘fibroid’ in the column names to ‘UL’ for uterine leiomyoma.
colnames(all) <- gsub('Fibroid', 'UL', colnames(all))
Reorder the table so that the stats are at the end of the columns.
All <- all[,c(1:10,11:30, 36:53,31:35,54:58)]
str(All)
## 'data.frame': 30549 obs. of 58 variables:
## $ SEQUENCE : Factor w/ 48701 levels "AAAAAACAAAACCGCGCAGCGGAGAACCGGTGCCTGAGTCTCCCAGGGAC",..: 1 4 5 7 8 10 12 13 15 18 ...
## $ Symbol : Factor w/ 25036 levels "1-Dec","1-Mar",..: 12031 11383 13002 14397 15611 12721 12474 10953 24822 18000 ...
## $ gene_count : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Probe_Chr_Orientation: Factor w/ 3 levels "-","+","NA": 2 1 2 3 3 2 2 3 1 2 ...
## $ Probe_Coordinates : Factor w/ 41351 levels "100000925-100000974",..: 33600 3552 19430 41351 41351 36542 28872 41351 31503 41119 ...
## $ Cytoband : Factor w/ 3676 levels "10p11.1d","10p11.21a",..: 3472 3472 3472 3472 3472 293 3472 3472 1261 2180 ...
## $ Definition : Factor w/ 46614 levels "{3 region, probe S2} [human, 76N, mammary epithelial cells, mRNA Partial, 339 nt]",..: 33743 33388 39229 37744 38664 34103 33968 35730 29449 21454 ...
## $ Ontology_Component : Factor w/ 7849 levels "A 20S multiprotein assembly of total mass about 1.2 MDa that activates dynein-based activity in vivo. A large s"| __truncated__,..: 4150 4150 4150 4150 4150 4150 4150 4150 5434 6267 ...
## $ Ontology_Process : Factor w/ 8950 levels "[goid 19642] [evidence IEA]; The chemical reactions and pathways involving carbohydrates, any of a group of org"| __truncated__,..: 2492 2492 2492 2492 2492 2492 2492 2492 8502 3316 ...
## $ Ontology_Function : Factor w/ 9453 levels "[goid 15280] [evidence IEA]; Interacting selectively with sodium ions (Na+) [goid 31402] [evidence IEA]",..: 8111 8111 8111 8111 8111 8111 8111 8111 7071 2705 ...
## $ UL_GSM2496185 : num 49.5 51.2 59.9 312.3 52.3 ...
## $ UL_GSM2496186 : num 51.5 52.3 64.5 333.3 53.7 ...
## $ UL_GSM2496187 : num 46.9 49 50.3 331.8 56.4 ...
## $ UL_GSM2496188 : num 50.1 48.4 52 360.9 51.8 ...
## $ UL_GSM2496189 : num 54 49.8 52.8 339 52.6 ...
## $ UL_GSM2496190 : num 50.1 48.6 52 411.7 51.9 ...
## $ UL_GSM2496191 : num 49.5 46 54.7 424.3 54.8 ...
## $ UL_GSM2496192 : num 46.5 50.6 50.2 517.3 53.6 ...
## $ UL_GSM2496193 : num 48 46.7 52.9 631.7 54.1 ...
## $ UL_GSM2496203 : num 48.1 51.5 55.1 576.7 55 ...
## $ UL_GSM2496204 : num 51.7 50.8 51.6 300.4 52.3 ...
## $ UL_GSM2496205 : num 46.9 47.5 52.3 348.9 53.2 ...
## $ UL_GSM2496206 : num 50.5 43.9 54.9 722.6 56.1 ...
## $ UL_GSM2496207 : num 44 46.5 56.5 1218 55.9 ...
## $ UL_GSM2496208 : num 51.1 46 51.9 535.5 55.5 ...
## $ UL_GSM2496209 : num 55.2 46.9 50.6 669 53.3 ...
## $ UL_GSM2496217 : num 48.9 45.8 49.6 361.9 49.8 ...
## $ UL_GSM2496218 : num 50 49.9 53.9 377.2 53.6 ...
## $ UL_GSM2496219 : num 48.3 49.5 50.4 544.1 58.1 ...
## $ UL_GSM2496220 : num 47.7 46.7 51.7 406.1 56.3 ...
## $ nonUL_GSM2496194 : num 42.7 42.7 56.6 442 57 ...
## $ nonUL_GSM2496195 : num 49.1 45.3 56 505.4 55.3 ...
## $ nonUL_GSM2496196 : num 51.1 45.3 55.1 416.5 57.4 ...
## $ nonUL_GSM2496197 : num 46 52.6 56.2 475.3 54.6 ...
## $ nonUL_GSM2496198 : num 49.1 54.1 53.7 408.1 50.8 ...
## $ nonUL_GSM2496199 : num 52.6 47.9 51.1 510.7 52.5 ...
## $ nonUL_GSM2496200 : num 46 47.1 52.5 681.9 50.3 ...
## $ nonUL_GSM2496201 : num 45.9 46.8 48.3 689.3 56.7 ...
## $ nonUL_GSM2496202 : num 50.6 45.3 56.5 735.2 48.7 ...
## $ nonUL_GSM2496210 : num 48.3 49.3 56.3 426.9 52.5 ...
## $ nonUL_GSM2496211 : num 46.8 46 54.6 567.3 47.4 ...
## $ nonUL_GSM2496212 : num 50.6 48.7 54.3 493.8 51.5 ...
## $ nonUL_GSM2496213 : num 49.1 49.5 56.7 792.8 48.8 ...
## $ nonUL_GSM2496214 : num 48.8 49.9 54 710.5 50.6 ...
## $ nonUL_GSM2496215 : num 48 47.1 52.9 795.4 53.1 ...
## $ nonUL_GSM2496216 : num 47.9 49.2 48 574.5 51.1 ...
## $ nonUL_GSM2496221 : num 48.8 48.2 50.6 507.4 54.7 ...
## $ nonUL_GSM2496222 : num 44.6 49.5 50.4 590 56 ...
## $ UL_Mean : num 49.4 48.4 53.4 486.1 54 ...
## $ UL_Median : num 49.5 48.5 52.1 408.9 53.7 ...
## $ UL_max : num 55.2 52.3 64.5 1218 58.1 ...
## $ UL_min : num 44 43.9 49.6 300.4 49.8 ...
## $ UL_stdError : num 2.59 2.3 3.61 214.26 2 ...
## $ nonUL_Mean : num 48.1 48 53.5 573.5 52.7 ...
## $ nonUL_Median : num 48.5 48.1 54.1 539 52.5 ...
## $ nonUL_max : num 52.6 54.1 56.7 795.4 57.4 ...
## $ nonUL_min : num 42.7 42.7 48 408.1 47.4 ...
## $ nonUL_stdError : num 2.46 2.75 2.85 130.03 3.07 ...
All_stats_only <- All[,c(1,2,3,49:58)]
stats_all <- All_stats_only[!duplicated(All_stats_only),]
stats_all$foldChangeMean_UL_to_nonUL <- stats_all$UL_Mean/stats_all$nonUL_Mean
FoldChangeGenes <- stats_all[order(stats_all$foldChangeMean_UL_to_nonUL, decreasing=TRUE)[c(1:5,30545:30549)],]
FoldChangeGenes
## SEQUENCE Symbol gene_count
## 16709 GCAACGCTCCTCTGAAATGCTTGTCTTTTTTCTGTTGCCGAAATAGCTGG KIAA1199 1
## 17948 GCCCCAGCAAGCCTCCCTCCATCCTCCAGTGGGAAACTGTTGATGGTGTT PENK 1
## 22790 GGTATTGCTGATCGTATGCAGAAGGAAATCACTGCTCTGGCTCCTAGCAC ACTC 1
## 28162 TGCGAGACCTGGGTGTCCAACCTGCGCTACAACCACATGCTGCGGAAGAA DLK1 2
## 3569 AGGCCCTGGAGGCTGCAACATACCTCAATCCTGTCCCAGGCCGGATCCTC MMP11 1
## 17565 GCCAACCTCCTCTCACAGCCTCTGTATCTCTGCAGGCCATACTGGTTCCA ABCA8 1
## 6582 CAGATGTTTTCCCTTGTGGCAGTCTTCAGCCTCCTCTACCCTACATGATC ADH1A 1
## 8958 CCCAGTGACACTTCAGAGAGCTGGTAGTTAGTAGCATGTTGAGCCAGGCC FOS 1
## 27542 TGACTGTCCCTGCCAATGCTCCAGCTGTCGTCTGACTCTGGGTTCGTTGG FOSB 1
## 6881 CAGCTGGGCGATGTGCGAGCTGATAGTGAGCGGCAGAATCAGGAGTACCA KRT19 1
## UL_Mean UL_Median UL_max UL_min UL_stdError nonUL_Mean
## 16709 3029.4710 2566.7096 6548.8160 156.8375 1948.88193 167.75283
## 17948 1798.4359 237.7790 26134.7105 48.2000 5783.33114 120.28719
## 22790 4693.5182 4880.4535 13992.2647 110.2923 3328.78611 392.30767
## 28162 946.4365 249.4440 5799.7170 54.7250 1542.06289 87.41913
## 3569 4767.6862 3652.1430 15286.3475 235.9674 4029.71487 442.19991
## 17565 143.2650 99.0957 628.2257 56.0500 126.17477 724.33959
## 6582 608.7258 545.4995 1947.6415 54.3500 476.16852 3102.90066
## 8958 1287.1220 805.1551 7335.1691 220.6810 1573.12379 6916.73077
## 27542 205.9075 70.8273 2290.1261 53.1000 493.97933 1142.67872
## 6881 130.4870 85.6123 311.2020 45.6000 90.60451 815.61118
## nonUL_Median nonUL_max nonUL_min nonUL_stdError
## 16709 116.83410 514.9222 63.1000 127.90943
## 17948 71.88335 431.7050 54.2000 98.67055
## 22790 332.26735 830.2335 169.4130 197.47324
## 28162 52.70835 669.4027 47.3000 145.29680
## 3569 308.24915 1683.5592 75.9750 430.96362
## 17565 643.88430 1493.2380 125.9636 317.38002
## 6582 2961.70150 7727.8238 275.2883 1833.54639
## 8958 6722.92010 17362.0317 2406.0706 3676.04993
## 27542 595.09105 7604.0221 92.4250 1864.52510
## 6881 738.20585 2049.6529 57.4000 498.12489
## foldChangeMean_UL_to_nonUL
## 16709 18.0591346
## 17948 14.9511840
## 22790 11.9638706
## 28162 10.8264232
## 3569 10.7817440
## 17565 0.1977871
## 6582 0.1961796
## 8958 0.1860882
## 27542 0.1801972
## 6881 0.1599868
str(stats_all)
## 'data.frame': 30549 obs. of 14 variables:
## $ SEQUENCE : Factor w/ 48701 levels "AAAAAACAAAACCGCGCAGCGGAGAACCGGTGCCTGAGTCTCCCAGGGAC",..: 1 4 5 7 8 10 12 13 15 18 ...
## $ Symbol : Factor w/ 25036 levels "1-Dec","1-Mar",..: 12031 11383 13002 14397 15611 12721 12474 10953 24822 18000 ...
## $ gene_count : int 1 1 1 1 1 1 1 1 1 1 ...
## $ UL_Mean : num 49.4 48.4 53.4 486.1 54 ...
## $ UL_Median : num 49.5 48.5 52.1 408.9 53.7 ...
## $ UL_max : num 55.2 52.3 64.5 1218 58.1 ...
## $ UL_min : num 44 43.9 49.6 300.4 49.8 ...
## $ UL_stdError : num 2.59 2.3 3.61 214.26 2 ...
## $ nonUL_Mean : num 48.1 48 53.5 573.5 52.7 ...
## $ nonUL_Median : num 48.5 48.1 54.1 539 52.5 ...
## $ nonUL_max : num 52.6 54.1 56.7 795.4 57.4 ...
## $ nonUL_min : num 42.7 42.7 48 408.1 47.4 ...
## $ nonUL_stdError : num 2.46 2.75 2.85 130.03 3.07 ...
## $ foldChangeMean_UL_to_nonUL: num 1.027 1.007 0.997 0.848 1.024 ...
write.csv(stats_all, 'stats_only_UL_nonUL.csv', row.names=FALSE)
Combine the table of top and bottom five genes in fold change values of the ratio of UL to non-UL sample means, FoldChangeGenes, with the table of the ten genes having the highest number of copy number variations or genotypes, GeneCopyNumberVariants.
ontology <- nonFibroid[,c(1,6:9)]
gnc <- as.data.frame(GeneCopyNumberVariants)[1]
keyGenes1 <- merge(gnc, stats_all, by.x='Symbol', by.y='Symbol')
keyGenes1a <- merge(keyGenes1, ontology, by.x='Symbol', by.y='Symbol')
keyGenes2 <- merge(FoldChangeGenes, ontology, by.x='Symbol', by.y='Symbol')
keyGenes2a <- keyGenes2[,c(1:3,15:18,4:14)]
keyGenes1b <- keyGenes1a[,c(1:3,15:18,4:14)]
KeyGenes <- rbind(keyGenes2a, keyGenes1b)
KG <- KeyGenes[!duplicated(KeyGenes$SEQUENCE),]
KG1 <- KG[order(KG$foldChangeMean_UL_to_nonUL, decreasing=TRUE),]
KG1[,c(1:3,18)]
## Symbol SEQUENCE gene_count
## 8 KIAA1199 GCAACGCTCCTCTGAAATGCTTGTCTTTTTTCTGTTGCCGAAATAGCTGG 1
## 11 PENK GCCCCAGCAAGCCTCCCTCCATCCTCCAGTGGGAAACTGTTGATGGTGTT 1
## 2 ACTC GGTATTGCTGATCGTATGCAGAAGGAAATCACTGCTCTGGCTCCTAGCAC 1
## 4 DLK1 TGCGAGACCTGGGTGTCCAACCTGCGCTACAACCACATGCTGCGGAAGAA 2
## 10 MMP11 AGGCCCTGGAGGCTGCAACATACCTCAATCCTGTCCCAGGCCGGATCCTC 1
## 68 CTNNB1 AGCTGCAGGGGTCCTCTGTGAACTTGCTCAGGACAAGGAAGCTGCAGAAG 7
## 89 CTNNB1 CTGCAGGGGTCCTCTGTGAACTTGCTCAGGACAAGGAAGCTGCAGAAGCT 7
## 82 CTNNB1 AGTCTCTCGTAGTGTTAAGTTATAGTGAATACTGCTACAGCAATTTCTAA 7
## 231 DMD CAGTGTTGGGATCACTCACTTTCCCCCTACAGGACTCAGATCTGGGAGGC 7
## 210 DMD CTCCTCTCAGCTGAACACCCTCCTTTCACTCCCAAATGCAAACAGTCTCT 7
## 96 CTNNB1 GCCTCTTGCACTCTGAATTGGGAATGTTTGCACCACAGTGGGGGGCTTGC 7
## 140 DDX12 CCGCCGGGCTGCTTTTTCCTTGGATGCCCATCAGGACGCCTCAGTTCTCT 10
## 397 LOC23117 TCAACCACATCCTTCAAAAGGACTATGCCTGTTTATAAGCCCAGCTGTTT 8
## 361 LOC202134 GCCAAAGGAATGGGCTCCAGACACCCCCTCTTCCAGAGCAAGGATGAAGG 7
## 286 KIAA0692 TAAGTGCAGTGAGCTCTGGCGGAAACCACCCTCTGCCCCGTCTGTTGGAT 9
## 61 CTNNB1 CAAACTTTACAGAGGAGAATGCCCTGTTTGTTAACCATGTTTCTTTTGGC 7
## 516 LOC653086 TGATGTGTCACGCCACTGTACTCCAGCCTGACGGCAGAGCGAGACTCCAT 7
## 33 BDNF CCCTCCACCTCCTGCTCGGGGGGCTTTAATGAGACACCCACCGCTGCTGT 7
## 551 PLEC1 CCCGACGAGCAGGACTTCATCCAGGCCTACGAGGAGGTGCGCGAGAAGTA 8
## 445 LOC23117 TGTCGTTTCCTCCATTCTTCACCAAAACATCAGCGTACATAGGCACATGG 8
## 474 LOC339047 ACTGCCTGTGTGGCTCCTTGAGTGCGCGGAGGCCAAAGCTGAGATGACTT 7
## 331 KIAA0692 CATTGTAATGATAAGGAAATGTTGCGATCAAATAAGATTTAGACACACTT 9
## 559 PLEC1 CCCTCGGGCAGCCTGTTTCCCTCCCTGGTGGTTGTGGGTCACGTTGTCAC 8
## 530 LOC653086 GGTGTGCTCTGGTATGTAATGACAATATGTGAACAAACCTGTGGAATTAA 7
## 259 KIAA0692 GAGTGCTGGGAAGGTTAATGTTAAATGGGTTGTGTGTCGGGGAGGGTACA 9
## 429 LOC23117 GGCTCCTCTTTGGGCTCCTACTGGAATTTATCAGCCATCAGTGCATCTCT 8
## 252 DMD TCTATCAACAGAGCTGAATGAGTGCCAGGAAGCTGCGAAATCTGTCTTAC 7
## 268 KIAA0692 GCAAATGTAACTCAGGGGTTTGGGGCCAGAGGAAGAGGGAGAAGGTGGCC 9
## 19 BDNF AATAATAGAGTGTGGGAGTTTTGGGGCCGAAGTCTTTCCCGGAGCAGCTG 7
## 340 LOC202134 CAAACCCTTGAAGACATTTCAGGGCCATGCTCACTTGGGAGGGTTTGAGG 7
## 405 LOC23117 AAAGCAGTGGTTTTCAGCTGCCAGAGGCCTGAGAGAGTTTGGGCATACTC 8
## 245 DMD CCATTCAGAAGAATGATAAATGCCACAAGCATTTGGAAACAGGCTTCCCT 7
## 322 KIAA0692 AGCTCCACCTTGACCCAGCCTCACAACAAAAAGTTTGTGTATGACCAGGC 9
## 375 LOC202134 AGTGGGCAGAATGATGAGGGAAGTGGGCACGTGCCCATGTTCTTCTTGGC 7
## 382 LOC202134 TTCATCCAGGCCTGCGCCGGTGTTCACAGTGGTCCTCATCTAAGCCAGCC 7
## 591 PLEC1 CCGGGCCTTCTCGTGGTACCCTGCCTGCTGCCTTTGCCCCCGCACTGACT 8
## 47 BDNF GCTCGCTGAAGTTGGCTTCCTAGCGGTGTAGGCTGGAATAGACTCTTGGC 7
## 575 PLEC1 GGCGCAGACATGGACCCCTCGCGAGCCATCCAGAACGAGATCAGCTCCCT 8
## 313 KIAA0692 GATCACAGGCACAGGGAAGCCACAAGGAGCTCTGTATGAGTTGTGTTTGC 9
## 495 LOC339047 TTTCAGGCCCATGGCAGAGGGTGGGCTCAGGAGGGCCATCGTGGGTGTCC 7
## 170 DDX12 GCAGGGGAGATTGGGTTTAGGGGCTTTCCTGGTCTGCATTCTGCTACAGC 10
## 460 LOC339047 AGTGCCCACATCACACAGCATCTAGCACGTAACTGCACCCCGGGAGTCGT 7
## 437 LOC23117 GGCTCTGTTGGAATCCGCATAGTGTGGAAATGAGTTTGCCCTGGAAAGGG 8
## 502 LOC653086 AGTGTTGGGACTACAGGTGTGTGTTACTGCTCCCAGCTGGGAGGCAGGCT 7
## 509 LOC653086 GTGAGCCTGTTTCATCATCTGTAAACTTTGAATAATGATACCTACCCCGC 7
## 467 LOC339047 CGCCCTGAAAGGACCAGGACATGCGGGTGCGGTGGCTGCTCTTTTGGCTC 7
## 150 DDX12 CAGGGCAGGAACCACGTCTTTACAGTTTGATGTTCCCAGAGCTGACCCAG 10
## 75 CTNNB1 CAGGAATCTAGTCTGGATGACTGCTTCTGGAGCCTGGATGCAGTACCATT 7
## 295 KIAA0692 GTTGTTCTGGACGATCTTCGGGATCCTCTGGGGCACTGTGACACTCGGAG 9
## 180 DDX12 TCTCCTGCCCCCTCCGGAAGCTTGGATGCCCCTCCACACCCTCTTGATCT 10
## 389 LOC23117 CTGGCCTTCCCTCATCAGCCGTAAATGATGATTTACTGCTGTTACCATCA 8
## 103 CTNNB1 GCAATTTGCCAAGTTTCTTTAGCATTTGGCCCTGGATTACGCTGGACCCC 7
## 413 LOC23117 CCCTTCCTACATTCTTGTTTTCATTTTTTCGGAGGAAGAGGAGTTGCTAG 8
## 523 LOC653086 AGCAGCACATCGTCATTTTACAATTGAGAAACATGGAGACTCCAAATGGA 7
## 421 LOC23117 GGGAAGTACATGGGGCAGATGGAAGAACCTGAGATAATCGCAAGGATGGC 8
## 40 BDNF TCAGACCCCTCAGGCCACTGCTGTTCCTGTCACACATTCCTGCAAAGGAC 7
## 607 PLEC1 CTCCGTCTGCCCCGTGGGCTCCTGCCACCGTCCCCGATGAAGATCGTGCC 8
## 453 LOC339047 TTTCCTGAAATGGAGCTTTGCTCTTGTTGCCCAGGCCGTAGTGCAATGGC 7
## 599 PLEC1 GCCTTTGCCTCGCCGAGGGAGGTCTTGCTGGAGCGGCCGTGCTGGCTGGA 8
## 12 BDNF ATGTACGTGGGGGATTCTTGACTCGGGTTAGTCTCTGGGGATGCAGAGCC 7
## 583 PLEC1 CAGCCCTGGGGACACACTGCCCTGGAACCTTGGGAAAACGCAGCGGAGCC 8
## 190 DDX12 CGTTGCTACAAGCTGTTTTTTGAATGTCTCTACACAGTCCAGGCAGGAAG 10
## 481 LOC339047 TCTGTATGGACCCTGCCAAGCTCTGCCCCTCTGCCCCTGCATTGGGGCGC 7
## 54 BDNF TGGGGAGACGAGATTTTAAGACACTTGAGTCTCCAGGACAGCAAAGGCAC 7
## 160 DDX12 CAGACTTCTCGCTTCCTTTCTGCTGGGCCTCTGAGGGGTCATGGGGCCAT 10
## 304 KIAA0692 AAGTGGTGCCTGGCTGTCCCTATACTGTGCTGCTGGGTGTTCCAGCCTGT 9
## 224 DMD GCAGCCAACTTATTGGCATGATGGAGTGACAGGAAAAACAGCTGGCATGG 7
## 200 DDX12 TTACTGGGGATGGTATTTAGGAGCCAGGAAAGCCGGTGCATTCCTAGTGA 10
## 544 LOC653086 TACCTGGCCTATCTTTCATAGGTTATATAAATTCCTTGGTTCCCAGTTTT 7
## 217 DMD GGGTTTTCTCAGGATTGCTATGCAACAGGATCAGTGCTGTAGTGCCCGGT 7
## 120 DDX12 ACATGTGCTGTCACTGGAACTTGCTCTTTTCACTCAGCAGCCAGAGGGTC 10
## 110 DDX12 CCAGTCCCTGACTACAGAGGATTTCCCCAAAGTCCCTGGCTGTGAGGTTC 10
## 130 DDX12 AAACGTTACAGTGTTCCGATGAGACACAGTAGGCAGTACTTGGGAGGGTC 10
## 368 LOC202134 GACCAAAGCAGGACAATTGCTTGATCCCAGGAGTTTAAGACCAGCCGGGG 7
## 537 LOC653086 AAGGACTCAGATGCAGGGTCTTCTCTGCTCCCCGTCACACAGAGGGTGGC 7
## 277 KIAA0692 CAGGCGACTGGGTAGCAGATGTGGAAGCTGATGGTTAGGCCCAGGGCATG 9
## 488 LOC339047 GACCTGTAGCTAAACCTTCCACCAGCGCTTGAGAACTTAATTTGAACCGG 7
## 238 DMD GCACTCCGACTACATCAGGAGAAGATGTTCGAGACTTTGCCAAGGTACTA 7
## 354 LOC202134 CCACGCCGGCAAAGAAATTGGAAGACTCCACCATTACAGGCAGCCACCAG 7
## 26 BDNF CTTGCTGTGGTCTCTTTGTGGCAGAAGTGTTTCATGCATGGCAGCAGGCC 7
## 567 PLEC1 AGCCTCTGTTCCCCTAGTAAGTGCCTTCCATGTCGGCCTCTAACCCCAGG 8
## 347 LOC202134 CCTGTTTGGATCACATGGTCTTGTCCTGATAACTTGGAAGAGGTTGCTTC 7
## 1 ABCA8 GCCAACCTCCTCTCACAGCCTCTGTATCTCTGCAGGCCATACTGGTTCCA 1
## 3 ADH1A CAGATGTTTTCCCTTGTGGCAGTCTTCAGCCTCCTCTACCCTACATGATC 1
## 6 FOS CCCAGTGACACTTCAGAGAGCTGGTAGTTAGTAGCATGTTGAGCCAGGCC 1
## 7 FOSB TGACTGTCCCTGCCAATGCTCCAGCTGTCGTCTGACTCTGGGTTCGTTGG 1
## 9 KRT19 CAGCTGGGCGATGTGCGAGCTGATAGTGAGCGGCAGAATCAGGAGTACCA 1
## foldChangeMean_UL_to_nonUL
## 8 18.0591346
## 11 14.9511840
## 2 11.9638706
## 4 10.8264232
## 10 10.7817440
## 68 1.5689829
## 89 1.5248323
## 82 1.2503004
## 231 1.1977828
## 210 1.1332249
## 96 1.1098758
## 140 1.1076674
## 397 1.0971923
## 361 1.0775532
## 286 1.0762308
## 61 1.0740152
## 516 1.0680820
## 33 1.0451334
## 551 1.0433765
## 445 1.0317830
## 474 1.0291369
## 331 1.0250419
## 559 1.0243757
## 530 1.0237370
## 259 1.0212595
## 429 1.0204791
## 252 1.0202573
## 268 1.0150804
## 19 1.0137555
## 340 1.0132932
## 405 1.0108892
## 245 1.0100160
## 322 1.0098454
## 375 1.0093062
## 382 1.0091169
## 591 1.0071980
## 47 1.0067778
## 575 1.0042547
## 313 1.0038554
## 495 1.0036545
## 170 1.0019589
## 460 1.0005056
## 437 0.9998782
## 502 0.9991520
## 509 0.9982094
## 467 0.9981707
## 150 0.9980314
## 75 0.9977081
## 295 0.9976800
## 180 0.9964084
## 389 0.9937542
## 103 0.9930219
## 413 0.9927896
## 523 0.9924044
## 421 0.9921585
## 40 0.9911345
## 607 0.9868791
## 453 0.9867059
## 599 0.9860355
## 12 0.9821732
## 583 0.9816619
## 190 0.9793547
## 481 0.9791535
## 54 0.9767836
## 160 0.9762911
## 304 0.9737579
## 224 0.9735481
## 200 0.9725752
## 544 0.9708972
## 217 0.9699886
## 120 0.9694418
## 110 0.9672238
## 130 0.9664926
## 368 0.9545894
## 537 0.9529309
## 277 0.9406746
## 488 0.9392117
## 238 0.9047898
## 354 0.8995475
## 26 0.8520654
## 567 0.7581050
## 347 0.4746867
## 1 0.1977871
## 3 0.1961796
## 6 0.1860882
## 7 0.1801972
## 9 0.1599868
write.csv(KG1,'keyGenes_UL_FCs_CNVs.csv', row.names=FALSE)
Order by gene count, then by fold change.
KG2 <- KG1[with(KG1, order(gene_count, foldChangeMean_UL_to_nonUL, decreasing=TRUE)),]
Lets add in a fold change of the median value ratios of UL to non-UL samples to compare.
colnames(KG2)[18] <- 'foldChange_Mean'
KG2$foldChange_Median <- KG2$UL_Median/KG2$nonUL_Median
Lets look at some of these copy number variants of one gene with seven copy number variants or CNVs and see where the changes in the nucleotide sequences occur. Copy number variations in nucleotides can have short repeats, jumps in sequence, insertions, or deletions of a gene. I have been calling these CNVs genotypes, which are the traits and alleles responsible for the physical traits or phenotypes of an organism. Some CNVs are responsible for diseases, and in tumors there are many different CNVs that are found to be responsible. A uterine leiomyoma or fibroid is a benign tumor. These samples were taken from uterus tissue with these uternine tumors and the same neighboring uterine tissue without uterine tumors.
CTNNB1 <- subset(KG2, KG2$Symbol=='CTNNB1')
CTNNB1_seq <- CTNNB1[,1:2]
Add in a column to describe the length of the nucleotides in each copy number variant nucleotide strand.
CTNNB1_seq$SEQUENCE <- as.character(CTNNB1$SEQUENCE)
CTNNB1_seq$nChar <- nchar(CTNNB1_seq$SEQUENCE)
Lets look at the CNVs of the CTNNB1 gene.
CTNNB1_seq
## Symbol SEQUENCE nChar
## 68 CTNNB1 AGCTGCAGGGGTCCTCTGTGAACTTGCTCAGGACAAGGAAGCTGCAGAAG 50
## 89 CTNNB1 CTGCAGGGGTCCTCTGTGAACTTGCTCAGGACAAGGAAGCTGCAGAAGCT 50
## 82 CTNNB1 AGTCTCTCGTAGTGTTAAGTTATAGTGAATACTGCTACAGCAATTTCTAA 50
## 96 CTNNB1 GCCTCTTGCACTCTGAATTGGGAATGTTTGCACCACAGTGGGGGGCTTGC 50
## 61 CTNNB1 CAAACTTTACAGAGGAGAATGCCCTGTTTGTTAACCATGTTTCTTTTGGC 50
## 75 CTNNB1 CAGGAATCTAGTCTGGATGACTGCTTCTGGAGCCTGGATGCAGTACCATT 50
## 103 CTNNB1 GCAATTTGCCAAGTTTCTTTAGCATTTGGCCCTGGATTACGCTGGACCCC 50
From the above, some of the CNVs make you wonder if they are even the same gene. The first two have the same pattern of ‘CTGCAGGG’ then some variations. Then its not obvious what the other sequence alignments are. We could go back to the cytoband location and where the gene starts to see if there is more information.
Lets get the SEQUENCE, protein product, and cytoband columns from the original UL1 table.
cytoband <- UL1[,c(15,20,24)]
Now combine with the CTNNB1_seq and the KG2 table.
CTNNB1_cyto <- merge(cytoband, CTNNB1_seq, by.x='SEQUENCE', by.y='SEQUENCE')
KG2_cyto <- merge(cytoband, KG2, by.x='SEQUENCE', by.y='SEQUENCE')
Now lets look at the KG2_cyto table to see where these CNVs are located within the cytoband of each gene location.
KG3 <- KG2_cyto[with(KG2_cyto, order(gene_count,foldChange_Mean, decreasing = TRUE)),]
KG3[,1:5]
## SEQUENCE Protein_Product
## 37 CCGCCGGGCTGCTTTTTCCTTGGATGCCCATCAGGACGCCTCAGTTCTCT XP_936926.1
## 56 GCAGGGGAGATTGGGTTTAGGGGCTTTCCTGGTCTGCATTCTGCTACAGC XP_937020.1
## 26 CAGGGCAGGAACCACGTCTTTACAGTTTGATGTTCCCAGAGCTGACCCAG XP_936919.1
## 77 TCTCCTGCCCCCTCCGGAAGCTTGGATGCCCCTCCACACCCTCTTGATCT XP_936947.1
## 41 CGTTGCTACAAGCTGTTTTTTGAATGTCTCTACACAGTCCAGGCAGGAAG XP_937000.1
## 20 CAGACTTCTCGCTTCCTTTCTGCTGGGCCTCTGAGGGGTCATGGGGCCAT XP_936988.1
## 84 TTACTGGGGATGGTATTTAGGAGCCAGGAAAGCCGGTGCATTCCTAGTGA XP_936932.1
## 6 ACATGTGCTGTCACTGGAACTTGCTCTTTTCACTCAGCAGCCAGAGGGTC XP_936976.1
## 30 CCAGTCCCTGACTACAGAGGATTTCCCCAAAGTCCCTGGCTGTGAGGTTC XP_936952.1
## 1 AAACGTTACAGTGTTCCGATGAGACACAGTAGGCAGTACTTGGGAGGGTC XP_936980.1
## 72 TAAGTGCAGTGAGCTCTGGCGGAAACCACCCTCTGCCCCGTCTGTTGGAT XP_935983.1
## 28 CATTGTAATGATAAGGAAATGTTGCGATCAAATAAGATTTAGACACACTT XP_935991.1
## 49 GAGTGCTGGGAAGGTTAATGTTAAATGGGTTGTGTGTCGGGGAGGGTACA XP_935974.1
## 51 GCAAATGTAACTCAGGGGTTTGGGGCCAGAGGAAGAGGGAGAAGGTGGCC XP_935936.1
## 10 AGCTCCACCTTGACCCAGCCTCACAACAAAAAGTTTGTGTATGACCAGGC XP_935967.1
## 50 GATCACAGGCACAGGGAAGCCACAAGGAGCTCTGTATGAGTTGTGTTTGC XP_935893.1
## 71 GTTGTTCTGGACGATCTTCGGGATCCTCTGGGGCACTGTGACACTCGGAG XP_936004.1
## 4 AAGTGGTGCCTGGCTGTCCCTATACTGTGCTGCTGGGTGTTCCAGCCTGT XP_935903.1
## 25 CAGGCGACTGGGTAGCAGATGTGGAAGCTGATGGTTAGGCCCAGGGCATG XP_935881.1
## 74 TCAACCACATCCTTCAAAAGGACTATGCCTGTTTATAAGCCCAGCTGTTT XP_938957.1
## 33 CCCGACGAGCAGGACTTCATCCAGGCCTACGAGGAGGTGCGCGAGAAGTA NP_958780.1
## 83 TGTCGTTTCCTCCATTCTTCACCAAAACATCAGCGTACATAGGCACATGG XP_938806.1
## 35 CCCTCGGGCAGCCTGTTTCCCTCCCTGGTGGTTGTGGGTCACGTTGTCAC NP_958784.1
## 64 GGCTCCTCTTTGGGCTCCTACTGGAATTTATCAGCCATCAGTGCATCTCT XP_938917.1
## 2 AAAGCAGTGGTTTTCAGCTGCCAGAGGCCTGAGAGAGTTTGGGCATACTC XP_938927.1
## 38 CCGGGCCTTCTCGTGGTACCCTGCCTGCTGCCTTTGCCCCCGCACTGACT NP_958782.1
## 63 GGCGCAGACATGGACCCCTCGCGAGCCATCCAGAACGAGATCAGCTCCCT NP_958781.1
## 65 GGCTCTGTTGGAATCCGCATAGTGTGGAAATGAGTTTGCCCTGGAAAGGG XP_938916.1
## 45 CTGGCCTTCCCTCATCAGCCGTAAATGATGATTTACTGCTGTTACCATCA XP_939002.1
## 36 CCCTTCCTACATTCTTGTTTTCATTTTTTCGGAGGAAGAGGAGTTGCTAG XP_938960.1
## 66 GGGAAGTACATGGGGCAGATGGAAGAACCTGAGATAATCGCAAGGATGGC XP_938807.1
## 42 CTCCGTCTGCCCCGTGGGCTCCTGCCACCGTCCCCGATGAAGATCGTGCC NP_958783.1
## 61 GCCTTTGCCTCGCCGAGGGAGGTCTTGCTGGAGCGGCCGTGCTGGCTGGA NP_958785.1
## 22 CAGCCCTGGGGACACACTGCCCTGGAACCTTGGGAAAACGCAGCGGAGCC NP_000436.2
## 9 AGCCTCTGTTCCCCTAGTAAGTGCCTTCCATGTCGGCCTCTAACCCCAGG NP_958786.1
## 11 AGCTGCAGGGGTCCTCTGTGAACTTGCTCAGGACAAGGAAGCTGCAGAAG XP_950743.1
## 44 CTGCAGGGGTCCTCTGTGAACTTGCTCAGGACAAGGAAGCTGCAGAAGCT NP_001895.1
## 13 AGTCTCTCGTAGTGTTAAGTTATAGTGAATACTGCTACAGCAATTTCTAA NP_001895.1
## 27 CAGTGTTGGGATCACTCACTTTCCCCCTACAGGACTCAGATCTGGGAGGC NP_003997.1
## 43 CTCCTCTCAGCTGAACACCCTCCTTTCACTCCCAAATGCAAACAGTCTCT NP_004010.1
## 60 GCCTCTTGCACTCTGAATTGGGAATGTTTGCACCACAGTGGGGGGCTTGC XP_950747.1
## 57 GCCAAAGGAATGGGCTCCAGACACCCCCTCTTCCAGAGCAAGGATGAAGG XP_937236.1
## 19 CAAACTTTACAGAGGAGAATGCCCTGTTTGTTAACCATGTTTCTTTTGGC XP_950746.1
## 80 TGATGTGTCACGCCACTGTACTCCAGCCTGACGGCAGAGCGAGACTCCAT XP_936088.1
## 34 CCCTCCACCTCCTGCTCGGGGGGCTTTAATGAGACACCCACCGCTGCTGT NP_001700.2
## 7 ACTGCCTGTGTGGCTCCTTGAGTGCGCGGAGGCCAAAGCTGAGATGACTT XP_937640.1
## 69 GGTGTGCTCTGGTATGTAATGACAATATGTGAACAAACCTGTGGAATTAA XP_936056.1
## 76 TCTATCAACAGAGCTGAATGAGTGCCAGGAAGCTGCGAAATCTGTCTTAC NP_004003.1
## 5 AATAATAGAGTGTGGGAGTTTTGGGGCCGAAGTCTTTCCCGGAGCAGCTG NP_733929.1
## 18 CAAACCCTTGAAGACATTTCAGGGCCATGCTCACTTGGGAGGGTTTGAGG XP_937228.1
## 31 CCATTCAGAAGAATGATAAATGCCACAAGCATTTGGAAACAGGCTTCCCT NP_004001.1
## 15 AGTGGGCAGAATGATGAGGGAAGTGGGCACGTGCCCATGTTCTTCTTGGC XP_937222.1
## 85 TTCATCCAGGCCTGCGCCGGTGTTCACAGTGGTCCTCATCTAAGCCAGCC XP_937214.1
## 62 GCTCGCTGAAGTTGGCTTCCTAGCGGTGTAGGCTGGAATAGACTCTTGGC NP_733928.1
## 86 TTTCAGGCCCATGGCAGAGGGTGGGCTCAGGAGGGCCATCGTGGGTGTCC XP_937694.1
## 14 AGTGCCCACATCACACAGCATCTAGCACGTAACTGCACCCCGGGAGTCGT XP_937456.1
## 16 AGTGTTGGGACTACAGGTGTGTGTTACTGCTCCCAGCTGGGAGGCAGGCT XP_936104.1
## 70 GTGAGCCTGTTTCATCATCTGTAAACTTTGAATAATGATACCTACCCCGC XP_936038.1
## 40 CGCCCTGAAAGGACCAGGACATGCGGGTGCGGTGGCTGCTCTTTTGGCTC XP_937724.1
## 24 CAGGAATCTAGTCTGGATGACTGCTTCTGGAGCCTGGATGCAGTACCATT XP_950748.1
## 53 GCAATTTGCCAAGTTTCTTTAGCATTTGGCCCTGGATTACGCTGGACCCC XP_947138.1
## 8 AGCAGCACATCGTCATTTTACAATTGAGAAACATGGAGACTCCAAATGGA XP_936080.1
## 75 TCAGACCCCTCAGGCCACTGCTGTTCCTGTCACACATTCCTGCAAAGGAC NP_733931.1
## 87 TTTCCTGAAATGGAGCTTTGCTCTTGTTGCCCAGGCCGTAGTGCAATGGC XP_937537.1
## 17 ATGTACGTGGGGGATTCTTGACTCGGGTTAGTCTCTGGGGATGCAGAGCC NP_733930.1
## 78 TCTGTATGGACCCTGCCAAGCTCTGCCCCTCTGCCCCTGCATTGGGGCGC XP_937505.1
## 82 TGGGGAGACGAGATTTTAAGACACTTGAGTCTCCAGGACAGCAAAGGCAC NP_733927.1
## 55 GCAGCCAACTTATTGGCATGATGGAGTGACAGGAAAAACAGCTGGCATGG NP_000100.2
## 73 TACCTGGCCTATCTTTCATAGGTTATATAAATTCCTTGGTTCCCAGTTTT XP_936049.1
## 67 GGGTTTTCTCAGGATTGCTATGCAACAGGATCAGTGCTGTAGTGCCCGGT NP_004005.1
## 47 GACCAAAGCAGGACAATTGCTTGATCCCAGGAGTTTAAGACCAGCCGGGG XP_932593.1
## 3 AAGGACTCAGATGCAGGGTCTTCTCTGCTCCCCGTCACACAGAGGGTGGC XP_936046.1
## 48 GACCTGTAGCTAAACCTTCCACCAGCGCTTGAGAACTTAATTTGAACCGG XP_937490.1
## 54 GCACTCCGACTACATCAGGAGAAGATGTTCGAGACTTTGCCAAGGTACTA NP_004010.1
## 29 CCACGCCGGCAAAGAAATTGGAAGACTCCACCATTACAGGCAGCCACCAG XP_937214.1
## 46 CTTGCTGTGGTCTCTTTGTGGCAGAAGTGTTTCATGCATGGCAGCAGGCC NP_001700.2
## 39 CCTGTTTGGATCACATGGTCTTGTCCTGATAACTTGGAAGAGGTTGCTTC XP_371783.3
## 81 TGCGAGACCTGGGTGTCCAACCTGCGCTACAACCACATGCTGCGGAAGAA NP_003827.3
## 52 GCAACGCTCCTCTGAAATGCTTGTCTTTTTTCTGTTGCCGAAATAGCTGG NP_061159.1
## 59 GCCCCAGCAAGCCTCCCTCCATCCTCCAGTGGGAAACTGTTGATGGTGTT NP_006202.1
## 68 GGTATTGCTGATCGTATGCAGAAGGAAATCACTGCTCTGGCTCCTAGCAC NP_005150.1
## 12 AGGCCCTGGAGGCTGCAACATACCTCAATCCTGTCCCAGGCCGGATCCTC NP_005931.2
## 58 GCCAACCTCCTCTCACAGCCTCTGTATCTCTGCAGGCCATACTGGTTCCA NP_009099.1
## 21 CAGATGTTTTCCCTTGTGGCAGTCTTCAGCCTCCTCTACCCTACATGATC NP_000658.1
## 32 CCCAGTGACACTTCAGAGAGCTGGTAGTTAGTAGCATGTTGAGCCAGGCC NP_005243.1
## 79 TGACTGTCCCTGCCAATGCTCCAGCTGTCGTCTGACTCTGGGTTCGTTGG NP_006723.1
## 23 CAGCTGGGCGATGTGCGAGCTGATAGTGAGCGGCAGAATCAGGAGTACCA NP_002267.2
## Cytoband Symbol gene_count
## 37 12p13.31a DDX12 10
## 56 12p13.31a DDX12 10
## 26 12p13.31a DDX12 10
## 77 12p13.31a DDX12 10
## 41 12p13.31a DDX12 10
## 20 12p13.31a DDX12 10
## 84 12p13.31a DDX12 10
## 6 12p13.31a DDX12 10
## 30 12p13.31a DDX12 10
## 1 12p13.31a DDX12 10
## 72 12q24.33d KIAA0692 9
## 28 12q24.33d KIAA0692 9
## 49 12q24.33d KIAA0692 9
## 51 12q24.33d KIAA0692 9
## 10 12q24.33d KIAA0692 9
## 50 12q24.33d KIAA0692 9
## 71 12q24.33d KIAA0692 9
## 4 12q24.33d KIAA0692 9
## 25 12q24.33d KIAA0692 9
## 74 16p12.2a LOC23117 8
## 33 8q24.3g PLEC1 8
## 83 16p12.2a LOC23117 8
## 35 8q24.3g PLEC1 8
## 64 16p12.2a LOC23117 8
## 2 16p12.2a LOC23117 8
## 38 8q24.3g PLEC1 8
## 63 8q24.3g PLEC1 8
## 65 16p12.2a LOC23117 8
## 45 16p12.2a LOC23117 8
## 36 16p12.2a LOC23117 8
## 66 16p12.2a LOC23117 8
## 42 8q24.3g PLEC1 8
## 61 8q24.3g PLEC1 8
## 22 8q24.3g PLEC1 8
## 9 8q24.3g PLEC1 8
## 11 3p22.1b CTNNB1 7
## 44 3p22.1b CTNNB1 7
## 13 3p22.1b CTNNB1 7
## 27 Xp21.2a-p21.1d DMD 7
## 43 Xp21.2a-p21.1d DMD 7
## 60 3p22.1b CTNNB1 7
## 57 5q35.2d LOC202134 7
## 19 3p22.1b CTNNB1 7
## 80 NA LOC653086 7
## 34 11p14.1d BDNF 7
## 7 16p13.11b LOC339047 7
## 69 NA LOC653086 7
## 76 Xp21.2a-p21.1d DMD 7
## 5 11p14.1d BDNF 7
## 18 5q35.2d LOC202134 7
## 31 Xp21.2a-p21.1d DMD 7
## 15 5q35.2d LOC202134 7
## 85 5q35.2d LOC202134 7
## 62 11p14.1d BDNF 7
## 86 16p13.11b LOC339047 7
## 14 16p13.11b LOC339047 7
## 16 NA LOC653086 7
## 70 NA LOC653086 7
## 40 16p13.11b LOC339047 7
## 24 3p22.1b CTNNB1 7
## 53 3p22.1b CTNNB1 7
## 8 NA LOC653086 7
## 75 11p14.1d BDNF 7
## 87 16p13.11b LOC339047 7
## 17 11p14.1d BDNF 7
## 78 16p13.11b LOC339047 7
## 82 11p14.1d BDNF 7
## 55 Xp21.2a-p21.1d DMD 7
## 73 NA LOC653086 7
## 67 Xp21.2a-p21.1d DMD 7
## 47 5q35.2d LOC202134 7
## 3 NA LOC653086 7
## 48 16p13.11b LOC339047 7
## 54 Xp21.2a-p21.1d DMD 7
## 29 5q35.2d LOC202134 7
## 46 11p14.1d BDNF 7
## 39 5q35.2d LOC202134 7
## 81 14q32.2b DLK1 2
## 52 15q25.1b KIAA1199 1
## 59 8q12.1b PENK 1
## 68 15q14a ACTC 1
## 12 22q11.23a MMP11 1
## 58 17q24.2c ABCA8 1
## 21 4q23b ADH1A 1
## 32 14q24.3b FOS 1
## 79 19q13.32a FOSB 1
## 23 17q21.2b KRT19 1
CTNNB1_b <- subset(KG3, KG3$Symbol=='CTNNB1')
CTNNB1_b[,c(1:5,20:21)]
## SEQUENCE Protein_Product Cytoband
## 11 AGCTGCAGGGGTCCTCTGTGAACTTGCTCAGGACAAGGAAGCTGCAGAAG XP_950743.1 3p22.1b
## 44 CTGCAGGGGTCCTCTGTGAACTTGCTCAGGACAAGGAAGCTGCAGAAGCT NP_001895.1 3p22.1b
## 13 AGTCTCTCGTAGTGTTAAGTTATAGTGAATACTGCTACAGCAATTTCTAA NP_001895.1 3p22.1b
## 60 GCCTCTTGCACTCTGAATTGGGAATGTTTGCACCACAGTGGGGGGCTTGC XP_950747.1 3p22.1b
## 19 CAAACTTTACAGAGGAGAATGCCCTGTTTGTTAACCATGTTTCTTTTGGC XP_950746.1 3p22.1b
## 24 CAGGAATCTAGTCTGGATGACTGCTTCTGGAGCCTGGATGCAGTACCATT XP_950748.1 3p22.1b
## 53 GCAATTTGCCAAGTTTCTTTAGCATTTGGCCCTGGATTACGCTGGACCCC XP_947138.1 3p22.1b
## Symbol gene_count foldChange_Mean foldChange_Median
## 11 CTNNB1 7 1.5689829 1.5744314
## 44 CTNNB1 7 1.5248323 1.4655024
## 13 CTNNB1 7 1.2503004 1.1602376
## 60 CTNNB1 7 1.1098758 1.0850642
## 19 CTNNB1 7 1.0740152 1.0471422
## 24 CTNNB1 7 0.9977081 0.9924948
## 53 CTNNB1 7 0.9930219 0.9818097
The cytoband location of each of these CNVs for CTNNB1 is the same location on chromosome 3 on the p strand/direction along 22.1b. Also, the fold change for the mean and median values for the first listed CNVs changed by 16-57 percent more in UL compared to non-UL samples. This could mean that these four CNVs of the gene CTNNB1 offer some clues as to what mutations or changes impact risk in developing uterine leiomyomas for some females.
Lets order the key genes by fold change median then by CNVs.
KG4 <- KG3[with(KG3, order(foldChange_Median, gene_count, decreasing = TRUE)),]
KG4[,c(1:5,21)]
## SEQUENCE Protein_Product
## 52 GCAACGCTCCTCTGAAATGCTTGTCTTTTTTCTGTTGCCGAAATAGCTGG NP_061159.1
## 68 GGTATTGCTGATCGTATGCAGAAGGAAATCACTGCTCTGGCTCCTAGCAC NP_005150.1
## 12 AGGCCCTGGAGGCTGCAACATACCTCAATCCTGTCCCAGGCCGGATCCTC NP_005931.2
## 81 TGCGAGACCTGGGTGTCCAACCTGCGCTACAACCACATGCTGCGGAAGAA NP_003827.3
## 59 GCCCCAGCAAGCCTCCCTCCATCCTCCAGTGGGAAACTGTTGATGGTGTT NP_006202.1
## 11 AGCTGCAGGGGTCCTCTGTGAACTTGCTCAGGACAAGGAAGCTGCAGAAG XP_950743.1
## 44 CTGCAGGGGTCCTCTGTGAACTTGCTCAGGACAAGGAAGCTGCAGAAGCT NP_001895.1
## 43 CTCCTCTCAGCTGAACACCCTCCTTTCACTCCCAAATGCAAACAGTCTCT NP_004010.1
## 13 AGTCTCTCGTAGTGTTAAGTTATAGTGAATACTGCTACAGCAATTTCTAA NP_001895.1
## 27 CAGTGTTGGGATCACTCACTTTCCCCCTACAGGACTCAGATCTGGGAGGC NP_003997.1
## 60 GCCTCTTGCACTCTGAATTGGGAATGTTTGCACCACAGTGGGGGGCTTGC XP_950747.1
## 74 TCAACCACATCCTTCAAAAGGACTATGCCTGTTTATAAGCCCAGCTGTTT XP_938957.1
## 10 AGCTCCACCTTGACCCAGCCTCACAACAAAAAGTTTGTGTATGACCAGGC XP_935967.1
## 19 CAAACTTTACAGAGGAGAATGCCCTGTTTGTTAACCATGTTTCTTTTGGC XP_950746.1
## 33 CCCGACGAGCAGGACTTCATCCAGGCCTACGAGGAGGTGCGCGAGAAGTA NP_958780.1
## 80 TGATGTGTCACGCCACTGTACTCCAGCCTGACGGCAGAGCGAGACTCCAT XP_936088.1
## 31 CCATTCAGAAGAATGATAAATGCCACAAGCATTTGGAAACAGGCTTCCCT NP_004001.1
## 64 GGCTCCTCTTTGGGCTCCTACTGGAATTTATCAGCCATCAGTGCATCTCT XP_938917.1
## 69 GGTGTGCTCTGGTATGTAATGACAATATGTGAACAAACCTGTGGAATTAA XP_936056.1
## 28 CATTGTAATGATAAGGAAATGTTGCGATCAAATAAGATTTAGACACACTT XP_935991.1
## 76 TCTATCAACAGAGCTGAATGAGTGCCAGGAAGCTGCGAAATCTGTCTTAC NP_004003.1
## 16 AGTGTTGGGACTACAGGTGTGTGTTACTGCTCCCAGCTGGGAGGCAGGCT XP_936104.1
## 37 CCGCCGGGCTGCTTTTTCCTTGGATGCCCATCAGGACGCCTCAGTTCTCT XP_936926.1
## 86 TTTCAGGCCCATGGCAGAGGGTGGGCTCAGGAGGGCCATCGTGGGTGTCC XP_937694.1
## 49 GAGTGCTGGGAAGGTTAATGTTAAATGGGTTGTGTGTCGGGGAGGGTACA XP_935974.1
## 83 TGTCGTTTCCTCCATTCTTCACCAAAACATCAGCGTACATAGGCACATGG XP_938806.1
## 35 CCCTCGGGCAGCCTGTTTCCCTCCCTGGTGGTTGTGGGTCACGTTGTCAC NP_958784.1
## 57 GCCAAAGGAATGGGCTCCAGACACCCCCTCTTCCAGAGCAAGGATGAAGG XP_937236.1
## 14 AGTGCCCACATCACACAGCATCTAGCACGTAACTGCACCCCGGGAGTCGT XP_937456.1
## 50 GATCACAGGCACAGGGAAGCCACAAGGAGCTCTGTATGAGTTGTGTTTGC XP_935893.1
## 70 GTGAGCCTGTTTCATCATCTGTAAACTTTGAATAATGATACCTACCCCGC XP_936038.1
## 38 CCGGGCCTTCTCGTGGTACCCTGCCTGCTGCCTTTGCCCCCGCACTGACT NP_958782.1
## 65 GGCTCTGTTGGAATCCGCATAGTGTGGAAATGAGTTTGCCCTGGAAAGGG XP_938916.1
## 78 TCTGTATGGACCCTGCCAAGCTCTGCCCCTCTGCCCCTGCATTGGGGCGC XP_937505.1
## 85 TTCATCCAGGCCTGCGCCGGTGTTCACAGTGGTCCTCATCTAAGCCAGCC XP_937214.1
## 7 ACTGCCTGTGTGGCTCCTTGAGTGCGCGGAGGCCAAAGCTGAGATGACTT XP_937640.1
## 5 AATAATAGAGTGTGGGAGTTTTGGGGCCGAAGTCTTTCCCGGAGCAGCTG NP_733929.1
## 18 CAAACCCTTGAAGACATTTCAGGGCCATGCTCACTTGGGAGGGTTTGAGG XP_937228.1
## 62 GCTCGCTGAAGTTGGCTTCCTAGCGGTGTAGGCTGGAATAGACTCTTGGC NP_733928.1
## 66 GGGAAGTACATGGGGCAGATGGAAGAACCTGAGATAATCGCAAGGATGGC XP_938807.1
## 41 CGTTGCTACAAGCTGTTTTTTGAATGTCTCTACACAGTCCAGGCAGGAAG XP_937000.1
## 45 CTGGCCTTCCCTCATCAGCCGTAAATGATGATTTACTGCTGTTACCATCA XP_939002.1
## 51 GCAAATGTAACTCAGGGGTTTGGGGCCAGAGGAAGAGGGAGAAGGTGGCC XP_935936.1
## 40 CGCCCTGAAAGGACCAGGACATGCGGGTGCGGTGGCTGCTCTTTTGGCTC XP_937724.1
## 84 TTACTGGGGATGGTATTTAGGAGCCAGGAAAGCCGGTGCATTCCTAGTGA XP_936932.1
## 17 ATGTACGTGGGGGATTCTTGACTCGGGTTAGTCTCTGGGGATGCAGAGCC NP_733930.1
## 20 CAGACTTCTCGCTTCCTTTCTGCTGGGCCTCTGAGGGGTCATGGGGCCAT XP_936988.1
## 24 CAGGAATCTAGTCTGGATGACTGCTTCTGGAGCCTGGATGCAGTACCATT XP_950748.1
## 36 CCCTTCCTACATTCTTGTTTTCATTTTTTCGGAGGAAGAGGAGTTGCTAG XP_938960.1
## 15 AGTGGGCAGAATGATGAGGGAAGTGGGCACGTGCCCATGTTCTTCTTGGC XP_937222.1
## 26 CAGGGCAGGAACCACGTCTTTACAGTTTGATGTTCCCAGAGCTGACCCAG XP_936919.1
## 2 AAAGCAGTGGTTTTCAGCTGCCAGAGGCCTGAGAGAGTTTGGGCATACTC XP_938927.1
## 75 TCAGACCCCTCAGGCCACTGCTGTTCCTGTCACACATTCCTGCAAAGGAC NP_733931.1
## 56 GCAGGGGAGATTGGGTTTAGGGGCTTTCCTGGTCTGCATTCTGCTACAGC XP_937020.1
## 55 GCAGCCAACTTATTGGCATGATGGAGTGACAGGAAAAACAGCTGGCATGG NP_000100.2
## 8 AGCAGCACATCGTCATTTTACAATTGAGAAACATGGAGACTCCAAATGGA XP_936080.1
## 42 CTCCGTCTGCCCCGTGGGCTCCTGCCACCGTCCCCGATGAAGATCGTGCC NP_958783.1
## 53 GCAATTTGCCAAGTTTCTTTAGCATTTGGCCCTGGATTACGCTGGACCCC XP_947138.1
## 61 GCCTTTGCCTCGCCGAGGGAGGTCTTGCTGGAGCGGCCGTGCTGGCTGGA NP_958785.1
## 72 TAAGTGCAGTGAGCTCTGGCGGAAACCACCCTCTGCCCCGTCTGTTGGAT XP_935983.1
## 71 GTTGTTCTGGACGATCTTCGGGATCCTCTGGGGCACTGTGACACTCGGAG XP_936004.1
## 22 CAGCCCTGGGGACACACTGCCCTGGAACCTTGGGAAAACGCAGCGGAGCC NP_000436.2
## 73 TACCTGGCCTATCTTTCATAGGTTATATAAATTCCTTGGTTCCCAGTTTT XP_936049.1
## 77 TCTCCTGCCCCCTCCGGAAGCTTGGATGCCCCTCCACACCCTCTTGATCT XP_936947.1
## 63 GGCGCAGACATGGACCCCTCGCGAGCCATCCAGAACGAGATCAGCTCCCT NP_958781.1
## 67 GGGTTTTCTCAGGATTGCTATGCAACAGGATCAGTGCTGTAGTGCCCGGT NP_004005.1
## 82 TGGGGAGACGAGATTTTAAGACACTTGAGTCTCCAGGACAGCAAAGGCAC NP_733927.1
## 4 AAGTGGTGCCTGGCTGTCCCTATACTGTGCTGCTGGGTGTTCCAGCCTGT XP_935903.1
## 48 GACCTGTAGCTAAACCTTCCACCAGCGCTTGAGAACTTAATTTGAACCGG XP_937490.1
## 87 TTTCCTGAAATGGAGCTTTGCTCTTGTTGCCCAGGCCGTAGTGCAATGGC XP_937537.1
## 6 ACATGTGCTGTCACTGGAACTTGCTCTTTTCACTCAGCAGCCAGAGGGTC XP_936976.1
## 34 CCCTCCACCTCCTGCTCGGGGGGCTTTAATGAGACACCCACCGCTGCTGT NP_001700.2
## 30 CCAGTCCCTGACTACAGAGGATTTCCCCAAAGTCCCTGGCTGTGAGGTTC XP_936952.1
## 1 AAACGTTACAGTGTTCCGATGAGACACAGTAGGCAGTACTTGGGAGGGTC XP_936980.1
## 3 AAGGACTCAGATGCAGGGTCTTCTCTGCTCCCCGTCACACAGAGGGTGGC XP_936046.1
## 47 GACCAAAGCAGGACAATTGCTTGATCCCAGGAGTTTAAGACCAGCCGGGG XP_932593.1
## 46 CTTGCTGTGGTCTCTTTGTGGCAGAAGTGTTTCATGCATGGCAGCAGGCC NP_001700.2
## 29 CCACGCCGGCAAAGAAATTGGAAGACTCCACCATTACAGGCAGCCACCAG XP_937214.1
## 25 CAGGCGACTGGGTAGCAGATGTGGAAGCTGATGGTTAGGCCCAGGGCATG XP_935881.1
## 54 GCACTCCGACTACATCAGGAGAAGATGTTCGAGACTTTGCCAAGGTACTA NP_004010.1
## 9 AGCCTCTGTTCCCCTAGTAAGTGCCTTCCATGTCGGCCTCTAACCCCAGG NP_958786.1
## 39 CCTGTTTGGATCACATGGTCTTGTCCTGATAACTTGGAAGAGGTTGCTTC XP_371783.3
## 21 CAGATGTTTTCCCTTGTGGCAGTCTTCAGCCTCCTCTACCCTACATGATC NP_000658.1
## 58 GCCAACCTCCTCTCACAGCCTCTGTATCTCTGCAGGCCATACTGGTTCCA NP_009099.1
## 32 CCCAGTGACACTTCAGAGAGCTGGTAGTTAGTAGCATGTTGAGCCAGGCC NP_005243.1
## 79 TGACTGTCCCTGCCAATGCTCCAGCTGTCGTCTGACTCTGGGTTCGTTGG NP_006723.1
## 23 CAGCTGGGCGATGTGCGAGCTGATAGTGAGCGGCAGAATCAGGAGTACCA NP_002267.2
## Cytoband Symbol gene_count foldChange_Median
## 52 15q25.1b KIAA1199 1 21.9688396
## 68 15q14a ACTC 1 14.6883331
## 12 22q11.23a MMP11 1 11.8480229
## 81 14q32.2b DLK1 2 4.7325319
## 59 8q12.1b PENK 1 3.3078453
## 11 3p22.1b CTNNB1 7 1.5744314
## 44 3p22.1b CTNNB1 7 1.4655024
## 43 Xp21.2a-p21.1d DMD 7 1.2286298
## 13 3p22.1b CTNNB1 7 1.1602376
## 27 Xp21.2a-p21.1d DMD 7 1.1029701
## 60 3p22.1b CTNNB1 7 1.0850642
## 74 16p12.2a LOC23117 8 1.0667330
## 10 12q24.33d KIAA0692 9 1.0473309
## 19 3p22.1b CTNNB1 7 1.0471422
## 33 8q24.3g PLEC1 8 1.0462839
## 80 NA LOC653086 7 1.0414456
## 31 Xp21.2a-p21.1d DMD 7 1.0377049
## 64 16p12.2a LOC23117 8 1.0286458
## 69 NA LOC653086 7 1.0279570
## 28 12q24.33d KIAA0692 9 1.0247769
## 76 Xp21.2a-p21.1d DMD 7 1.0233573
## 16 NA LOC653086 7 1.0191080
## 37 12p13.31a DDX12 10 1.0179040
## 86 16p13.11b LOC339047 7 1.0175470
## 49 12q24.33d KIAA0692 9 1.0151057
## 83 16p12.2a LOC23117 8 1.0146392
## 35 8q24.3g PLEC1 8 1.0138161
## 57 5q35.2d LOC202134 7 1.0137122
## 14 16p13.11b LOC339047 7 1.0120598
## 50 12q24.33d KIAA0692 9 1.0118272
## 70 NA LOC653086 7 1.0087172
## 38 8q24.3g PLEC1 8 1.0081202
## 65 16p12.2a LOC23117 8 1.0071909
## 78 16p13.11b LOC339047 7 1.0071241
## 85 5q35.2d LOC202134 7 1.0066428
## 7 16p13.11b LOC339047 7 1.0047835
## 5 11p14.1d BDNF 7 1.0046246
## 18 5q35.2d LOC202134 7 1.0037413
## 62 11p14.1d BDNF 7 1.0016629
## 66 16p12.2a LOC23117 8 1.0009319
## 41 12p13.31a DDX12 10 0.9972581
## 45 16p12.2a LOC23117 8 0.9970746
## 51 12q24.33d KIAA0692 9 0.9962676
## 40 16p13.11b LOC339047 7 0.9950549
## 84 12p13.31a DDX12 10 0.9945794
## 17 11p14.1d BDNF 7 0.9937282
## 20 12p13.31a DDX12 10 0.9930535
## 24 3p22.1b CTNNB1 7 0.9924948
## 36 16p12.2a LOC23117 8 0.9922770
## 15 5q35.2d LOC202134 7 0.9917228
## 26 12p13.31a DDX12 10 0.9915254
## 2 16p12.2a LOC23117 8 0.9908288
## 75 11p14.1d BDNF 7 0.9900315
## 56 12p13.31a DDX12 10 0.9899665
## 55 Xp21.2a-p21.1d DMD 7 0.9894561
## 8 NA LOC653086 7 0.9847075
## 42 8q24.3g PLEC1 8 0.9844314
## 53 3p22.1b CTNNB1 7 0.9818097
## 61 8q24.3g PLEC1 8 0.9800936
## 72 12q24.33d KIAA0692 9 0.9784906
## 71 12q24.33d KIAA0692 9 0.9774402
## 22 8q24.3g PLEC1 8 0.9773820
## 73 NA LOC653086 7 0.9770270
## 77 12p13.31a DDX12 10 0.9757264
## 63 8q24.3g PLEC1 8 0.9754829
## 67 Xp21.2a-p21.1d DMD 7 0.9733874
## 82 11p14.1d BDNF 7 0.9728164
## 4 12q24.33d KIAA0692 9 0.9714923
## 48 16p13.11b LOC339047 7 0.9696381
## 87 16p13.11b LOC339047 7 0.9692033
## 6 12p13.31a DDX12 10 0.9685507
## 34 11p14.1d BDNF 7 0.9681280
## 30 12p13.31a DDX12 10 0.9657570
## 1 12p13.31a DDX12 10 0.9617268
## 3 NA LOC653086 7 0.9575042
## 47 5q35.2d LOC202134 7 0.9568036
## 46 11p14.1d BDNF 7 0.9289947
## 29 5q35.2d LOC202134 7 0.9285395
## 25 12q24.33d KIAA0692 9 0.8823950
## 54 Xp21.2a-p21.1d DMD 7 0.7956211
## 9 8q24.3g PLEC1 8 0.7882088
## 39 5q35.2d LOC202134 7 0.3998123
## 21 4q23b ADH1A 1 0.1841845
## 58 17q24.2c ABCA8 1 0.1539030
## 32 14q24.3b FOS 1 0.1197627
## 79 19q13.32a FOSB 1 0.1190193
## 23 17q21.2b KRT19 1 0.1159735
The above table gives the protein products, the ontology function, the fold change of the median values of UL/nonUL, gene symbol, sequence of CNV, and cytoband location. The protein products can be found at genecards.org by entering the ID for the protein product into the search bar. A quick scan of a few of the protein products in genecards.org gave the following descriptions. The first listed protein NP_061159.1 says it is a colon cancer secreted protein. Many of the above CNVs are listed as proteins involved in the extracellular matrix like DMD and CTNNB1. There are also various neurological and synapses diseases associated with those proteins.
The site genecards.org has very useful properties in analyzing gene expression data from this research. If you are a member, you can download the network genes involved in diseases you query and compare to how the genes in certain tissues compare to those genes. Three out of the seven CNVs for CTNNB1 are in the top fold change median values in the ratio of UL/nonUL samples.
write.csv(KG4, 'keyGenes_topMedFCs.csv', row.names=FALSE)
Lets make a machine learning data set to test various algorithms on predicting if the sample is a UL or not. We will use the samples in this set, plus add in some microarray samples that have been studied by me elsewhere using this set of genes and sequences if available in any of the microarray studies.
Lets isolate those genes that are in our key genes of top picks for UL targets and combine the UL and nonUL sample information to those genes and sequences without the stats.
keyTargets <- KG4[,c(1,4)]
ULs <- UL[,c(2,10:29)]
colnames(ULs)[2:21] <- paste('UL', colnames(ULs)[2:21], sep='_')
nonULs <- nonUL[,c(2,10:27)]
colnames(nonULs)[2:19] <- paste('nonUL', colnames(nonULs)[2:19], sep='_')
keyULs <- merge(keyTargets, ULs, by.x='SEQUENCE', by.y='SEQUENCE')
keys <- merge(keyULs, nonULs, by.x='SEQUENCE', by.y='SEQUENCE')
write.csv(keys,'keyGeneTargetsCNVs.csv',row.names=FALSE)
Lets create the matrix for machine learning.
keysNames <- paste(keys$Symbol,keys$SEQUENCE, sep='_')
keys0 <- keys[,-(1:2)]
keys_ml <- as.data.frame(t(keys0))
colnames(keys_ml) <- keysNames
keys_ml$Type <- as.factor(c(rep('UL',length(grep('^UL_',row.names(keys_ml)))),
rep('nonUL',length(grep('^nonUL_',row.names(keys_ml))))))
keys_ml0 <- keys_ml[,c(88,1:87)]
write.csv(keys_ml0, 'ml_ready_UL_classes.csv',row.names=TRUE)
pretty_headers <- str_to_title(colnames(keys_ml0))
keys_ml0a <- datatable(data=keys_ml0, rownames=FALSE,
colnames=pretty_headers,
filter=list(position='top'),
options=list(
dom='Bfrtip',
buttons=I('colvis'),
language=list(sSearch='Filter:')),
extensions=c('Buttons','Responsive')
)
keys_ml0a
Now, lets pull in the other data sets that are from the microarray samples and see if we can get the genes and sequences that correspond to our key genes above in identifying a sample as UL or not with predictive analytics.
There is one study of the other studies that has Sequence, Gene symbol and a few UL and nonUL microarray samples to compare to this above beadchip UL and nonUL set. The GEO series ID is GSE68295 with the GEO platform of GPL6480. The files are 27 MB each in file size.
setwd('./microArray UL')
non <- read.csv('nonUL_GSE68295_GPL6480_table.csv', sep=',',
header=T, na.strings=c('',' '))
uls <- read.csv('UL_GSE68295_GPL6480_table.csv', sep=',',
header=T, na.strings=c('',' '))
setwd('../')
Keep only the needed columns.
uls_array <- uls[,c(8,18:21)]
colnames(uls_array)[3:5] <- paste('UL', colnames(uls_array)[3:5], sep='_')
non_array <- non[,c(8,18:21)]
colnames(non_array)[3:5] <- paste('nonUL', colnames(non_array)[3:5], sep='_')
The sequences don’t align or match any in the microarrays with the beadchip UL samples.
uls_array0 <- merge(keyTargets, uls_array, by.x='SEQUENCE', by.y='SEQUENCE')
ulsArray0 <- uls_array0[,-2]
Match by gene symbol between the microarray and beadchip UL samples.
uls_array1 <- merge(keyTargets, uls_array, by.x='Symbol', by.y='GENE_SYMBOL')
ulsArray <- uls_array1[,-2]
The sequences don’t align between the arrays and beadchip samples for nonULs.
non_array0 <- merge(keyTargets, non_array, by.x='SEQUENCE', by.y='SEQUENCE')
nonArray0 <- non_array0[,-(1:2)]
Match by Gene symbol between the microarray and beadchip samples of nonULs.
non_array1 <- merge(keyTargets, non_array, by.x='Symbol', by.y='GENE_SYMBOL')
nonArray <- non_array1[,-(1:2)]
Combine the UL and nonUL samples of the microarrays into one dataset.
microarrays <- merge(ulsArray, nonArray, by.x='SEQUENCE.y', by.y='SEQUENCE.y')
Marrays <- microarrays[!duplicated(microarrays$SEQUENCE),]
Marrays
## SEQUENCE.y Symbol
## 1 AAATGTAGATGTGGCAAATGACTTGGCCCTGAAACTTCTCCGGGATTATTCTGCAGATGA DMD
## 50 AACGCTCCTCTGAAATGCTTGTCTTTTTTCTGTTGCCGAAATAGCTGGTCCTTTTTCGGG KIAA1199
## 51 AGAAAATATAGTCACAGGAAACTACTCACGTAAGTAGTAATGATTCTCAAGATCAAAGGG DMD
## 100 AGAGCTCTACCAGTCTTTAGCTGACCTGAATAATGTCAGATTCTCAGCTTATAGGACTGC DMD
## 149 AGAGGGTTCCTGTAGACCTAGGGAGGACCTTATCTGTGCGTGAAACACACCAGGCTGTGG FOS
## 150 AGGTTCAAGAGGCTTGACATCATTGGCTGACACTTTCGAACACGTGATAGAAGAGCTGTT BDNF
## 199 AGTTTCCTATGGGAACAATTGAAGTAAACTTTTTGTTCTGGTCCTTTTTGGTCGAGGAGT CTNNB1
## 248 ATAACCCATGTTTTACCTTTTGAAAAAATAAATGAAGGATTTGACCTGCTTCACTCTGGG ADH1A
## 249 CCCTATAGGCCAGAATTTTCCAATTAGAGGAATTCAGTTATATGATGGCCCCATCAACAT KIAA1199
## 250 CTTCTGCTGTCCTTTGGAGGGTGTCTTCTGGGTAGAGGGATGGGAAGGAAGGGACCCTTA KRT19
## 251 GAACATCCTGATATTTTATCCATTCTATAGAGATCATTGATGGTACACAGACATACAGTG CTNNB1
## 300 GCCTTACTATTGTATTATAGTACTGCTTTACTGTGTATCTCAATAAAGCACGCAGTTATG DMD
## 349 GGCCAAAAAGTTCACAGTCAAATGGGGAGGGGTATTCTTCATGCAGGAGACCCCAGGCCC MMP11
## 350 GGTATCGTCTTCCTCAACAAGTGCGAGACCTGGGTGTCCAACCTGCGCTACAACCACATG DLK1
## 351 TAATATGTAAGGAATGCTTGGAATATCTGCTGTATGTCAACTTTATGCAGCTTCCTTTTG BDNF
## 400 TACTGCTTATCTACCATCATTTCTGCATCCATCGAAACATCTGGGTTTACCTGAAAAGAA PENK
## 401 TAGCTGTCACCTCCCGAGCAGAAGAGTGGTTAAATCTTTTGTTGGAATACCAGAAACACA DMD
## 450 TATGCTGCTACTGACTGGACGTACCTGGAAGGGAGCTATTCTTGGTGGCTTTAAAAGTAA ADH1A
## 451 TGATCAAAAACTATTTGGGATATGTATGGGTAGGGTAAATCAGTAAGAGGTGTTATTTGG CTNNB1
## 500 TGATGGTTTATAAGTTGCCAGTGGAAGATGTGCAACCTTTAGCCCAAGCTTTCTTCAAAT ABCA8
## 501 TGTTTCACAGTACAGGATCTGTACATAAAAGTTTCTTTCCTAAACCATTCACCAAGAGCC KIAA1199
## 502 TTTGCTAGCCAAAAGGTATGGGGGCTTCATGAAAAGGTATGGAGGCTTCATGAAGAAAAT PENK
## 503 TTTTTACAATCTGTATCTTTGACAATTCTGGGTGCGAGTGTGAGAGTGTGAGCAGGGCTT FOSB
## UL_GSM1667147 UL_GSM1667148 UL_GSM1667149 nonUL_GSM1667144 nonUL_GSM1667145
## 1 6.348170 8.452940 3.133639 6.045532 5.489635
## 50 14.415790 12.807570 12.130890 8.088513 7.994159
## 51 2.384431 2.850962 3.133639 5.123847 3.922084
## 100 5.278357 5.919466 3.133639 5.250090 5.551075
## 149 10.455070 9.526076 11.971260 10.784700 14.087970
## 150 3.361238 2.704974 3.133639 2.576334 8.335834
## 199 10.436110 11.297670 11.962520 9.379255 10.684990
## 248 5.796909 5.426538 6.308862 10.089030 9.312589
## 249 5.142129 3.612948 5.464374 2.505503 4.809947
## 250 7.941722 3.870707 5.465157 13.917330 11.867160
## 251 8.540211 9.106644 9.567971 7.524949 9.472293
## 300 12.787010 11.485870 11.094990 12.600150 12.116980
## 349 13.149410 11.217970 8.370618 9.060750 7.865174
## 350 6.477206 9.691022 3.133639 4.366345 2.498480
## 351 2.995668 2.910897 3.133639 2.598976 5.933189
## 400 2.449153 2.925143 3.133639 3.915916 2.909686
## 401 9.579811 10.857530 10.098500 9.145992 10.959560
## 450 6.416634 7.539459 8.776969 10.703870 9.403824
## 451 13.944110 14.830000 13.426710 12.492420 14.130850
## 500 2.459625 4.291164 3.133639 7.546093 7.542825
## 501 11.651600 9.577540 3.133639 5.381942 3.061552
## 502 3.638878 2.312648 4.921288 3.314055 2.498480
## 503 5.952776 3.329733 8.833434 7.481725 12.119600
## nonUL_GSM1667146
## 1 8.694016
## 50 7.026303
## 51 2.876342
## 100 8.478863
## 149 8.440556
## 150 2.270915
## 199 9.869629
## 248 10.472460
## 249 2.995668
## 250 14.578700
## 251 5.727058
## 300 12.129400
## 349 7.606486
## 350 2.595905
## 351 2.255719
## 400 2.524270
## 401 10.098020
## 450 11.138870
## 451 13.460310
## 500 5.062359
## 501 5.435981
## 502 2.386559
## 503 4.620255
pretty_headers <- str_to_title(colnames(Marrays))
Marrays1 <- datatable(data=Marrays, rownames=FALSE,
colnames=pretty_headers,
filter=list(position='top'),
options=list(
dom='Bfrtip',
buttons=I('colvis'),
language=list(sSearch='Filter:')),
extensions=c('Buttons','Responsive')
)
Marrays1
## Warning in instance$preRenderHook(instance): It seems your data is too big
## for client-side DataTables. You may consider server-side processing: https://
## rstudio.github.io/DT/server.html
Since these two expression types can’t be compared by sequence, they should be compared by gene. Lets combine them into a study by gene expression values.
keys1 <- keys[,-1]
keys2 <- keys1 %>% group_by(Symbol) %>%
summarise_at(vars(as.vector(colnames(keys1)[2:39])), mean)
Marrays1 <- Marrays[,-1]
Marrays2 <- Marrays1 %>% group_by(Symbol) %>%
summarise_at(vars(as.vector(colnames(Marrays1)[2:7])), mean)
beadArrays <- merge(keys2, Marrays2, by.x='Symbol', by.y='Symbol')
pretty_headers <- str_to_title(colnames(beadArrays))
beadArrays1 <- datatable(data=beadArrays, rownames=FALSE,
colnames=pretty_headers,
filter=list(position='top'),
options=list(
dom='Bfrtip',
buttons=c('colvis','csv','excel'),
language=list(sSearch='Filter:')),
extensions=c('Buttons','Responsive')
)
beadArrays1
## Warning in instance$preRenderHook(instance): It seems your data is too big
## for client-side DataTables. You may consider server-side processing: https://
## rstudio.github.io/DT/server.html
There are only 12 genes in common among these combined samples of microarray and beadchip UL and nonUL samples.
names <- (beadArrays$Symbol)
beadArrays1 <- beadArrays[,-1]
beadArrays_ML <- as.data.frame(t(beadArrays1))
colnames(beadArrays_ML) <- names
beadArrays_ML$Type <- as.factor(c(rep('UL_bead',20), rep('nonUL_bead',18),
rep('UL_array',3), rep('nonUL_array',3)))
beadArrays_ML2 <- beadArrays_ML[,c(13,1:12)]
pretty_headers <- str_to_title(colnames(beadArrays_ML2))
beadArrays_ML3 <- datatable(data=beadArrays_ML2, rownames=FALSE,
colnames=pretty_headers,
filter=list(position='top'),
options=list(
dom='Bfrtip',
buttons=c('colvis','csv','excel'),
language=list(sSearch='Filter:')),
extensions=c('Buttons','Responsive')
)
beadArrays_ML3
UL_seq_ML <- keys_ml0
UL_gene_ML <- beadArrays_ML2
There are two datasets to use for machine learning. The first is our beadchip samples of 88 sequences and 38 samples of 20 UL and 18 nonUL in the UL_seq_ML data set. The second dataset for machine learning is the mixed microarray and beadchip samples of UL and nonUL by gene in the UL_gene_ML data set, because there were no common sequence or copy number variants of the gene sequences between the beadchip and microarray sets of UL and nonUL samples.
The libraries were installed earlier.
set.seed(12356789)
Create a partition of the data with a 70/30 split into training/testing sets of the first data set with two classes of UL or nonUL and 88 features of genes with their CNVs.
inTrain <- createDataPartition(y=UL_seq_ML$Type, p=0.7, list=FALSE)
trainingSet <- UL_seq_ML[inTrain,]
testingSet <- UL_seq_ML[-inTrain,]
RandomForest, cross-validation (cv) = 5
rfMod <- train(Type~., method='rf', data=(trainingSet),
trControl=trainControl(method='cv'), number=5)
plot(rfMod)
Run predictions on the testing set
predRF <- predict(rfMod, testingSet)
predDF <- data.frame(predRF, type=testingSet$Type)
predDF
## predRF type
## 1 UL UL
## 2 UL UL
## 3 UL UL
## 4 UL UL
## 5 UL UL
## 6 UL UL
## 7 nonUL nonUL
## 8 nonUL nonUL
## 9 nonUL nonUL
## 10 nonUL nonUL
## 11 nonUL nonUL
sum <- sum(predRF==testingSet$Type)
length <- length(testingSet$Type)
accuracy_rfMod <- (sum/length)
accuracy_rfMod
## [1] 1
results <- c(round(accuracy_rfMod,2), round(100,2))
results <- as.factor(results)
results <- t(data.frame(results))
colnames(results) <- colnames(predDF)
Results <- rbind(predDF, results)
Results
## predRF type
## 1 UL UL
## 2 UL UL
## 3 UL UL
## 4 UL UL
## 5 UL UL
## 6 UL UL
## 7 nonUL nonUL
## 8 nonUL nonUL
## 9 nonUL nonUL
## 10 nonUL nonUL
## 11 nonUL nonUL
## results 1 100
pretty_headers <- str_to_title(colnames(Results))
Results1 <- datatable(data=Results, rownames=FALSE,
colnames=pretty_headers,
filter=list(position='top'),
options=list(
dom='Bfrtip',
buttons=c('colvis','csv','excel'),
language=list(sSearch='Filter:')),
extensions=c('Buttons','Responsive')
)
Results1
The above shows that using the genes and the CNV of each gene totalling 88 features, makes a perfect data set of results with only 27 observations to train and 11 to test on 2 classes of UL or non-UL. Using only the random forest algorithm it classified each sample 100% accurately trained on 70% of the samples.
What if we used random forest to only predict by gene in the first beadchip type data set? We can use the transpose of the keys2 data set made earlier when combining to make the 2nd ML dataset.
names <- keys2$Symbol
keys_2 <- keys2[,-1]
keys_t <- as.data.frame(t(keys_2))
colnames(keys_t) <- names
keys_t$Type <- keys_ml$Type
keys_ML <- keys_t[,c(21,1:20)]
Now we will use our new data set based on the beadchip genes and not the CNVs of those genes, in the keys_ML data set to predict with RandomForest.
inTrain <- createDataPartition(y=keys_ML$Type, p=0.7, list=FALSE)
trainingSet <- keys_ML[inTrain,]
testingSet <- keys_ML[-inTrain,]
RandomForest, cross-validation (cv) = 5
rfMod <- train(Type~., method='rf', data=(trainingSet),
trControl=trainControl(method='cv'), number=5)
plot(rfMod)
Run predictions on the testing set
predRF <- predict(rfMod, testingSet)
predDF <- data.frame(predRF, type=testingSet$Type)
predDF
## predRF type
## 1 UL UL
## 2 UL UL
## 3 UL UL
## 4 UL UL
## 5 UL UL
## 6 nonUL UL
## 7 nonUL nonUL
## 8 nonUL nonUL
## 9 nonUL nonUL
## 10 nonUL nonUL
## 11 nonUL nonUL
sum <- sum(predRF==testingSet$Type)
length <- length(testingSet$Type)
accuracy_rfMod <- (sum/length)
accuracy_rfMod
## [1] 0.9090909
results <- c(round(accuracy_rfMod,2), round(100,2))
results <- as.factor(results)
results <- t(data.frame(results))
colnames(results) <- colnames(predDF)
Results <- rbind(predDF, results)
Results
## predRF type
## 1 UL UL
## 2 UL UL
## 3 UL UL
## 4 UL UL
## 5 UL UL
## 6 nonUL UL
## 7 nonUL nonUL
## 8 nonUL nonUL
## 9 nonUL nonUL
## 10 nonUL nonUL
## 11 nonUL nonUL
## results 0.91 100
pretty_headers <- str_to_title(colnames(Results))
Results1 <- datatable(data=Results, rownames=FALSE,
colnames=pretty_headers,
filter=list(position='top'),
options=list(
dom='Bfrtip',
buttons=c('colvis','csv','excel'),
language=list(sSearch='Filter:')),
extensions=c('Buttons','Responsive')
)
Results1
From the above table, the random forest algorithm only misclassified one sample as nonUL, when it was really a UL sample. This data set used the mean values of the genes and not the copy number variants of each gene as the previous data set we just used was built on. So, a true positive was misclassified as a negative. This means its a false negative or Type II error. If it had misclassified a nonUL as UL then it would be a Type I error for false positive. There is a good article to review material you never really use, until time to write a theoretical research paper. The values are good to know for precision and recall, and sensitivity and specificity. Depending on what your overarching goal is in sampling outcomes, you want to improve one more than the other.
How about with the KNN algorithm.
knnMod <- train(Type ~ .,
method='knn', preProcess=c('center','scale'),
tuneLength=10, trControl=trainControl(method='cv'), data=trainingSet)
plot(knnMod)
rpartMod <- train(Type ~ ., method='rpart', tuneLength=7, data=trainingSet)
glmMod <- train(Type ~ .,
method='glm', data=trainingSet)
predKNN <- predict(knnMod, testingSet)
predRPART <- predict(rpartMod, testingSet)
predGLM <- predict(glmMod, testingSet)
length=length(testingSet$Type)
sumKNN <- sum(predKNN==testingSet$Type)
sumRPart <- sum(predRPART==testingSet$Type)
sumGLM <- sum(predGLM==testingSet$Type)
accuracy_KNN <- sumKNN/length
accuracy_RPART <- sumRPart/length
accuracy_GLM <- sumGLM/length
predDF2 <- data.frame(predRF,predKNN,predRPART,predGLM,
TYPE=testingSet$Type)
colnames(predDF2) <- c('RandomForest','KNN','Rpart','GLM','TrueValue')
results <- c(round(accuracy_rfMod,2),
round(accuracy_KNN,2),
round(accuracy_RPART,2),
round(accuracy_GLM,2),
round(100,2))
results <- as.factor(results)
results <- t(data.frame(results))
colnames(results) <- c('RandomForest','KNN','Rpart','GLM','TrueValue')
Results <- rbind(predDF2, results)
Results
## RandomForest KNN Rpart GLM TrueValue
## 1 UL UL nonUL UL UL
## 2 UL UL UL UL UL
## 3 UL UL nonUL UL UL
## 4 UL UL UL UL UL
## 5 UL UL UL UL UL
## 6 nonUL nonUL nonUL nonUL UL
## 7 nonUL nonUL nonUL nonUL nonUL
## 8 nonUL nonUL nonUL nonUL nonUL
## 9 nonUL nonUL nonUL nonUL nonUL
## 10 nonUL nonUL nonUL nonUL nonUL
## 11 nonUL nonUL UL nonUL nonUL
## results 0.91 0.91 0.64 0.91 100
pretty_headers <- str_to_title(colnames(Results))
Results1 <- datatable(data=Results, rownames=FALSE,
colnames=pretty_headers,
filter=list(position='top'),
options=list(
dom='Bfrtip',
buttons=c('colvis','csv','excel'),
language=list(sSearch='Filter:')),
extensions=c('Buttons','Responsive')
)
Results1
As far as the algorithms used above go, the prediction accuracy is great for Random Forest, GLM, and K-Nearest Neighbor with 91% accuracy. Make sure to remove any fields before transposing such as the symbol field when keeping the samples as numeric. Because the numeric sample values will be factors, and throw off the algorithms. Or you could manually change each of the class types of the above genes above. Rpart, or recursive partitioning trees did the worst with 64% accuracy.
This next data set uses the three most expressed and two lease expressed genes by fold change in UL to nonUL sample means.
keys_ML_b <- keys_ML[,c(1,3,10,11,13,19)]
inTrain <- createDataPartition(y=keys_ML_b$Type, p=0.7, list=FALSE)
trainingSet <- keys_ML_b[inTrain,]
testingSet <- keys_ML_b[-inTrain,]
RandomForest, cross-validation (cv) = 5
rfMod <- train(Type~., method='rf', data=(trainingSet),
trControl=trainControl(method='cv'), number=5)
plot(rfMod)
Run predictions on the testing set
predRF <- predict(rfMod, testingSet)
predDF <- data.frame(predRF, type=testingSet$Type)
predDF
## predRF type
## 1 UL UL
## 2 UL UL
## 3 UL UL
## 4 UL UL
## 5 UL UL
## 6 UL UL
## 7 nonUL nonUL
## 8 nonUL nonUL
## 9 nonUL nonUL
## 10 nonUL nonUL
## 11 nonUL nonUL
sum <- sum(predRF==testingSet$Type)
length <- length(testingSet$Type)
accuracy_rfMod <- (sum/length)
accuracy_rfMod
## [1] 1
results <- c(round(accuracy_rfMod,2), round(100,2))
results <- as.factor(results)
results <- t(data.frame(results))
colnames(results) <- colnames(predDF)
Results <- rbind(predDF, results)
Results
## predRF type
## 1 UL UL
## 2 UL UL
## 3 UL UL
## 4 UL UL
## 5 UL UL
## 6 UL UL
## 7 nonUL nonUL
## 8 nonUL nonUL
## 9 nonUL nonUL
## 10 nonUL nonUL
## 11 nonUL nonUL
## results 1 100
How about with the KNN algorithm.
knnMod <- train(Type ~ .,
method='knn', preProcess=c('center','scale'),
tuneLength=10, trControl=trainControl(method='cv'), data=trainingSet)
plot(knnMod)
The accuracy seems to be better between 5 and 17 neighbors for classification from what the above plot is displaying.
rpartMod <- train(Type ~ ., method='rpart', tuneLength=7, data=trainingSet)
glmMod <- train(Type ~ .,
method='glm', data=trainingSet)
predKNN <- predict(knnMod, testingSet)
predRPART <- predict(rpartMod, testingSet)
predGLM <- predict(glmMod, testingSet)
length=length(testingSet$Type)
sumKNN <- sum(predKNN==testingSet$Type)
sumRPart <- sum(predRPART==testingSet$Type)
sumGLM <- sum(predGLM==testingSet$Type)
accuracy_KNN <- sumKNN/length
accuracy_RPART <- sumRPart/length
accuracy_GLM <- sumGLM/length
predDF2 <- data.frame(predRF,predKNN,predRPART,predGLM,
TYPE=testingSet$Type)
colnames(predDF2) <- c('RandomForest','KNN','Rpart','GLM','TrueValue')
results <- c(round(accuracy_rfMod,2),
round(accuracy_KNN,2),
round(accuracy_RPART,2),
round(accuracy_GLM,2),
round(100,2))
results <- as.factor(results)
results <- t(data.frame(results))
colnames(results) <- c('RandomForest','KNN','Rpart','GLM','TrueValue')
Results <- rbind(predDF2, results)
pretty_headers <- str_to_title(colnames(Results))
Results1 <- datatable(data=Results, rownames=FALSE,
colnames=pretty_headers,
filter=list(position='top'),
options=list(
dom='Bfrtip',
buttons=c('colvis','csv','excel'),
language=list(sSearch='Filter:')),
extensions=c('Buttons','Responsive')
)
Results1
The above data shows that using the three highest fold change and lowest fold change genes to predict the sample as a UL or not scored 100% accuracy for GLM and Random Forest. And the KNN and Rpart scored 91% accuracy. Make sure to set your seed to the same value so you don’t get different results when re-running the algorithms above, because the first seed was set by me and I got different results where KNN scored 100% and GLM and Random Forest scored 82% accuracy. ***
Lets go back to the data that placed the fold change values by mean of the sequence values.
fib <- fibroid[,c(1,10:29)]
fib_mean <- fib %>% group_by(Symbol) %>%
summarise_at(vars(as.vector(colnames(fib)[2:21])), mean, na.rm=TRUE)
fib_mean$UL_gene_mean <- rowMeans(fib_mean[2:21])
colnames(fib_mean)[2:21] <- paste('UL', colnames(fib_mean)[2:21], sep='_')
nfib <- nonFibroid[,c(1,10:27)]
nfib_mean <- nfib %>% group_by(Symbol) %>%
summarise_at(vars(as.vector(colnames(nfib)[2:19])), mean, na.rm=TRUE)
nfib_mean$nonUL_gene_mean <- rowMeans(nfib_mean[2:19])
colnames(nfib_mean)[2:19] <- paste('nonUL', colnames(nfib_mean)[2:19], sep='_')
fib_nonfib <- merge(fib_mean,nfib_mean, by.x='Symbol', by.y='Symbol')
fib_nonfib$FC_UL2non <- fib_nonfib$UL_gene_mean/fib_nonfib$nonUL_gene_mean
Fib_Non <- fib_nonfib[,c(1,22,41,42,2:21,23:40)]
FC_genes <- Fib_Non[order(Fib_Non$FC_UL2non, decreasing=TRUE)[1:5],]
FCs <- FC_genes[,c(1,5:42)]
FCs_t <- as.data.frame(t(FCs))
colnames(FCs_t) <- FCs$Symbol
FCs_ML <- FCs_t[-1,]
FCs_ML$Type <- as.factor(c(rep('UL',20),rep('nonUL',18)))
FCs_ML1 <- FCs_ML[,c(6,1:5)]
head(FCs_ML1)
## Type KIAA1199 PENK ACTC MMP11 DLK1
## UL_GSM2496185 UL 861.2 442.0 1452.7 13478.5 151.6
## UL_GSM2496186 UL 3915.9192 219.7620 4341.5140 8300.2949 221.1859
## UL_GSM2496187 UL 2292.2627 300.5855 1460.5967 6096.7978 750.0491
## UL_GSM2496188 UL 4272.8288 699.5680 8327.8549 8858.7194 297.5915
## UL_GSM2496189 UL 2212.2382 99.1353 5434.4145 3590.9747 105.1468
## UL_GSM2496190 UL 6548.8160 255.7960 6894.4949 4141.8642 246.1572
Now lets try these algorithms again, to see if they provide better results on the top five genes with the highest fold change values in each gene.
FCs_ML1$KIAA1199 <- as.numeric(FCs_ML1$KIAA1199)
FCs_ML1$PENK <- as.numeric(FCs_ML1$PENK)
FCs_ML1$ACTC <- as.numeric(FCs_ML1$ACTC)
FCs_ML1$MMP11 <- as.numeric(FCs_ML1$MMP11)
FCs_ML1$DLK1 <- as.numeric(FCs_ML1$DLK1)
set.seed(123789)
inTrain <- createDataPartition(y=FCs_ML1$Type, p=0.7, list=FALSE)
trainingSet <- FCs_ML1[inTrain,]
testingSet <- FCs_ML1[-inTrain,]
RandomForest, cross-validation (cv) = 5
rfMod <- train(Type~., method='rf', data=(trainingSet),
trControl=trainControl(method='cv'), number=5)
Run predictions on the testing set
predRF <- predict(rfMod, testingSet)
predDF <- data.frame(predRF, type=testingSet$Type)
predDF
## predRF type
## 1 UL UL
## 2 UL UL
## 3 UL UL
## 4 UL UL
## 5 UL UL
## 6 UL UL
## 7 nonUL nonUL
## 8 nonUL nonUL
## 9 nonUL nonUL
## 10 nonUL nonUL
## 11 nonUL nonUL
sum <- sum(predRF==testingSet$Type)
length <- length(testingSet$Type)
accuracy_rfMod <- (sum/length)
accuracy_rfMod
## [1] 1
The above table shows that using the set of five genes that had the highest fold change values scored 100% accuracy using the random forest algorithm to predict a sample as being a UL or not.
results <- c(round(accuracy_rfMod,2), round(100,2))
results <- as.factor(results)
results <- t(data.frame(results))
colnames(results) <- colnames(predDF)
Results <- rbind(predDF, results)
Results
## predRF type
## 1 UL UL
## 2 UL UL
## 3 UL UL
## 4 UL UL
## 5 UL UL
## 6 UL UL
## 7 nonUL nonUL
## 8 nonUL nonUL
## 9 nonUL nonUL
## 10 nonUL nonUL
## 11 nonUL nonUL
## results 1 100
How about with the KNN algorithm.
knnMod <- train(Type ~ .,
method='knn', preProcess=c('center','scale'),
tuneLength=10, trControl=trainControl(method='cv'), data=trainingSet)
rpartMod <- train(Type ~ ., method='rpart', tuneLength=7, data=trainingSet)
glmMod <- train(Type ~ .,
method='glm', data=trainingSet)
predKNN <- predict(knnMod, testingSet)
predRPART <- predict(rpartMod, testingSet)
predGLM <- predict(glmMod, testingSet)
length=length(testingSet$Type)
sumKNN <- sum(predKNN==testingSet$Type)
sumRPart <- sum(predRPART==testingSet$Type)
sumGLM <- sum(predGLM==testingSet$Type)
accuracy_KNN <- sumKNN/length
accuracy_RPART <- sumRPart/length
accuracy_GLM <- sumGLM/length
predDF2 <- data.frame(predRF,predKNN,predRPART,predGLM,
TYPE=testingSet$Type)
colnames(predDF2) <- c('RandomForest','KNN','Rpart','GLM','TrueValue')
results <- c(round(accuracy_rfMod,2),
round(accuracy_KNN,2),
round(accuracy_RPART,2),
round(accuracy_GLM,2),
round(100,2))
results <- as.factor(results)
results <- t(data.frame(results))
colnames(results) <- c('RandomForest','KNN','Rpart','GLM','TrueValue')
Results <- rbind(predDF2, results)
pretty_headers <- str_to_title(colnames(Results))
Results1 <- datatable(data=Results, rownames=FALSE,
colnames=pretty_headers,
filter=list(position='top'),
options=list(
dom='Bfrtip',
buttons=c('colvis','csv','excel'),
language=list(sSearch='Filter:')),
extensions=c('Buttons','Responsive')
)
Results1
The random forest classifier scored 100% accuracy, while KNN, Rpart, and GLM algorithms all scored 82% accuracy in predicting a sample as uL or not.
Lets use the data set with four classes to predict in the mixed beadchip and microarray samples as either UL or nonUL in their respective medium.
set.seed(123789)
inTrain <- createDataPartition(y=UL_gene_ML$Type, p=0.7, list=FALSE)
trainingSet <- UL_gene_ML[inTrain,]
testingSet <- UL_gene_ML[-inTrain,]
RandomForest, cross-validation (cv) = 5
rfMod <- train(Type~., method='rf', data=(trainingSet),
trControl=trainControl(method='cv'), number=5)
Run predictions on the testing set
predRF <- predict(rfMod, testingSet)
predDF <- data.frame(predRF, type=testingSet$Type)
predDF
## predRF type
## 1 UL_bead UL_bead
## 2 UL_bead UL_bead
## 3 UL_bead UL_bead
## 4 UL_bead UL_bead
## 5 UL_bead UL_bead
## 6 UL_bead UL_bead
## 7 nonUL_bead nonUL_bead
## 8 nonUL_bead nonUL_bead
## 9 nonUL_bead nonUL_bead
## 10 nonUL_bead nonUL_bead
## 11 nonUL_bead nonUL_bead
sum <- sum(predRF==testingSet$Type)
length <- length(testingSet$Type)
accuracy_rfMod <- (sum/length)
accuracy_rfMod
## [1] 1
results <- c(round(accuracy_rfMod,2), round(100,2))
results <- as.factor(results)
results <- t(data.frame(results))
colnames(results) <- colnames(predDF)
Results <- rbind(predDF, results)
Results
## predRF type
## 1 UL_bead UL_bead
## 2 UL_bead UL_bead
## 3 UL_bead UL_bead
## 4 UL_bead UL_bead
## 5 UL_bead UL_bead
## 6 UL_bead UL_bead
## 7 nonUL_bead nonUL_bead
## 8 nonUL_bead nonUL_bead
## 9 nonUL_bead nonUL_bead
## 10 nonUL_bead nonUL_bead
## 11 nonUL_bead nonUL_bead
## results 1 100
The above table shows that the random forest algorithm scored 100% accuracy on the four class predictions of our testing set.
How about with the KNN algorithm.
knnMod <- train(Type ~ .,
method='knn', preProcess=c('center','scale'),
tuneLength=10, trControl=trainControl(method='cv'), data=trainingSet)
rpartMod <- train(Type ~ ., method='rpart', tuneLength=7, data=trainingSet)
#glmMod <- train(Type ~ .,
# method='glm', data=trainingSet)
glmMod <- glm(Type ~ .,family = binomial(), data=trainingSet, method='glm.fit',)
GLMpred <- predict.glm(glmMod, type = "response")
The above GLM model is not liking this type of data, predicting the class by the numeric probabilities of the features provided. It seemed to do fine with the other data sets that had two classes and also used numeric data to predict each class factor.
predKNN <- predict(knnMod, testingSet)
predRPART <- predict(rpartMod, testingSet)
predGLM <- GLMpred
length=length(testingSet$Type)
sumKNN <- sum(predKNN==testingSet$Type)
sumRPart <- sum(predRPART==testingSet$Type)
sumGLM <- sum(predGLM==testingSet$Type)
accuracy_KNN <- sumKNN/length
accuracy_RPART <- sumRPart/length
accuracy_GLM <- sumGLM/length
predDF2 <- data.frame(predRF,predKNN,predRPART,predGLM,
TYPE=testingSet$Type)
colnames(predDF2) <- c('RandomForest','KNN','Rpart','GLM','TrueValue')
results <- c(round(accuracy_rfMod,2),
round(accuracy_KNN,2),
round(accuracy_RPART,2),
round(accuracy_GLM,2),
round(100,2))
results <- as.factor(results)
results <- t(data.frame(results))
colnames(results) <- c('RandomForest','KNN','Rpart','GLM','TrueValue')
Results <- rbind(predDF2, results)
pretty_headers <- str_to_title(colnames(Results))
Results1 <- datatable(data=Results, rownames=FALSE,
colnames=pretty_headers,
filter=list(position='top'),
options=list(
dom='Bfrtip',
buttons=c('colvis','csv','excel'),
language=list(sSearch='Filter:')),
extensions=c('Buttons','Responsive')
)
Results1
The GLM or generalized linear model is used to regress actual predicted numeric values using linear regression and naive bayes type linear models. This could be why its values were numeric the first run and now unable to complete training so it was excluded from results as there were no results for the GLM model. The Random Forest and KNN scored 100% accuracy, and the Rpart scored 91% accuracy.
Lets add in the gene that had the lowest fold change value and use it as an outcome variable to predict the value based on these high fold change values. We will remove the Type field.
FC_genes1 <- Fib_Non[order(Fib_Non$FC_UL2non)[c(1:2,25032:25036)],]
row.names(FC_genes1) <- FC_genes1$Symbol
FC_genes2 <- FC_genes1[-2,]
FC_genes2_ML <- as.data.frame(t(FC_genes2))
FCs_ML_4 <- FC_genes2_ML[-c(1:4),]
write.csv(FCs_ML_4,'ML_highFCs_lowFC.csv', row.names=TRUE)
set.seed(123789)
ML4 <- FCs_ML_4
ML4$KRT19 <- as.numeric(ML4$KRT19)
ML4$DLK1 <- as.numeric(ML4$DLK1)
ML4$MMP11 <- as.numeric(ML4$MMP11)
ML4$ACTC <- as.numeric(ML4$ACTC)
ML4$PENK <- as.numeric(ML4$PENK)
ML4$KIAA1199 <- as.numeric(ML4$KIAA1199)
inTrain <- createDataPartition(y=ML4$KRT19, p=0.7, list=FALSE)
trainingSet <- ML4[inTrain,]
testingSet <- ML4[-inTrain,]
RandomForest, cross-validation (cv) = 5
rfMod <- train(KRT19~., method='rf', data=(trainingSet),
trControl=trainControl(method='cv'), number=5)
Run predictions on the testing set, altered so that the predicted value is within one standard deviations of the mean.
predRF <- round(predict(rfMod, testingSet),0)
predDF <- data.frame(predRF, KRT19_Value=testingSet$KRT19)
predDF
## predRF KRT19_Value
## UL_GSM2496185 16 3
## UL_GSM2496191 21 2
## UL_GSM2496204 16 11
## UL_GSM2496208 15 9
## UL_GSM2496219 24 18
## nonUL_GSM2496199 33 34
## nonUL_GSM2496210 33 35
## nonUL_GSM2496212 22 28
## nonUL_GSM2496215 32 39
## nonUL_GSM2496222 27 23
mu <- mean(testingSet$KRT19)
sde <- sd(testingSet$KRT19)
sum <- sum(predRF < (mu+sde))
length <- length(testingSet$KRT19)
accuracy_rfMod <- (sum/length)
accuracy_rfMod
## [1] 1
All predicted values are within one standard deviation of the mean.
results <- c(round(accuracy_rfMod,2), round(100,2))
results <- as.factor(results)
results <- t(data.frame(results))
colnames(results) <- colnames(predDF)
Results <- rbind(predDF, results)
Results
## predRF KRT19_Value
## UL_GSM2496185 16 3
## UL_GSM2496191 21 2
## UL_GSM2496204 16 11
## UL_GSM2496208 15 9
## UL_GSM2496219 24 18
## nonUL_GSM2496199 33 34
## nonUL_GSM2496210 33 35
## nonUL_GSM2496212 22 28
## nonUL_GSM2496215 32 39
## nonUL_GSM2496222 27 23
## results 1 100
The above didn’t get any of the predicted results exactly equal to the true value of the lowest expressed gene in fold change of UL/nonUL, but it did get every predicted value within one standard deviation of the sample mean.
How about with the KNN algorithm.
knnMod <- train(KRT19 ~ .,
method='knn', preProcess=c('center','scale'),
tuneLength=10, trControl=trainControl(method='cv'), data=trainingSet)
The accuracy seems to be better between 8 and 9 neighbors for classification from what the above plot is displaying.
rpartMod <- train(KRT19 ~ ., method='rpart', tuneLength=9, data=trainingSet)
glmMod <- train(KRT19 ~ .,
method='glm', data=trainingSet)
predKNN <- predict(knnMod, testingSet)
predRPART <- predict(rpartMod, testingSet)
predGLM <- predict(glmMod, testingSet)
length=length(testingSet$KRT19)
sumKNN <- sum(predKNN==testingSet$KRT19)
sumRPart <- sum(predRPART==testingSet$KRT19)
sumGLM <- sum(predGLM==testingSet$KRT19)
accuracy_KNN <- sumKNN/length
accuracy_RPART <- sumRPart/length
accuracy_GLM <- sumGLM/length
predDF2 <- data.frame(predRF,predKNN,predRPART,predGLM,
KRT19=testingSet$KRT19)
colnames(predDF2) <- c('RandomForest','KNN','Rpart','GLM','TrueValue')
results <- c(round(accuracy_rfMod,2),
round(accuracy_KNN,2),
round(accuracy_RPART,2),
round(accuracy_GLM,2),
round(100,2))
results <- as.factor(results)
results <- t(data.frame(results))
colnames(results) <- c('RandomForest','KNN','Rpart','GLM','TrueValue')
Results <- rbind(predDF2, results)
Results
## RandomForest KNN Rpart GLM TrueValue
## UL_GSM2496185 16 21.8 15.3125 17.7925113372005 3
## UL_GSM2496191 21 30.8 28.9166666666667 41.9876957302681 2
## UL_GSM2496204 16 16 15.3125 11.2416036435578 11
## UL_GSM2496208 15 18.8 15.3125 10.8507470045519 9
## UL_GSM2496219 24 27.8 28.9166666666667 34.1668248778496 18
## nonUL_GSM2496199 33 35.6 28.9166666666667 21.3231243095602 34
## nonUL_GSM2496210 33 36.2 28.9166666666667 33.9084163681288 35
## nonUL_GSM2496212 22 21.8 15.3125 21.5506884597176 28
## nonUL_GSM2496215 32 27 28.9166666666667 24.3442067799312 39
## nonUL_GSM2496222 27 30.2 28.9166666666667 23.9479962823136 23
## results 1 0 0 0 100
When it comes to regression on numeric values and using the five highest expressed genes to predict the value of the lowest expressed gene in each sample of UL or nonUL, the results were far from useful. Every algorithm scored 0% but the random forest which was modified to gain accuracty if the prediction is within 1 standard deviation of the sample mean in which case it scored 100%. But would have also scored 0% as the other algorithms have.