This Rmarkdown file will extract the genes having high Copy Number Variants (CNVs) and fold change expression values in their available data. The information is from the previous GEO data sets used earlier, that also had sequence information as well as gene information. These studies having gene sequence information are: Alzheimer’s Disease (AD), Colon Cancer, Stomach cancer, pancreatic cancer, a beadchip study done on uterine leiomyomas (UL), and a microarry study done on uterine leiomyomas (UL). Once these genes having the most CNVs and fold changes are found, a link network will be built to demonstrate how these genes are linked to any of these diseases if applicable. From the top genes of each disease just listed.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(DT)
library(visNetwork)
## Warning: package 'visNetwork' was built under R version 3.6.3
library(igraph)
## Warning: package 'igraph' was built under R version 3.6.3
##
## Attaching package: 'igraph'
## The following object is masked from 'package:tidyr':
##
## crossing
## The following objects are masked from 'package:dplyr':
##
## as_data_frame, groups, union
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
The tables being used are from the following disease study files. - UL_study1_thresholds.csv - UL_microarray_thresholds.csv - GastricStomachCancer_Thresholds.csv - ColonCancerThresholds.csv - Pancreatic_Cancer_Thresholds.csv - useableAlzheimerStats.csv
These files will be uploaded to the github folder of files for this Rmarkdown script that creates a link analysis with visNetwork to display the genes associated with certain diseases and if associated with other diseases.
To skip straight through to the section that builds the visual network analysis of genes related to diseases, search for ‘%^&’ or !!!THE BEGINNING of our VISNETWORK!!! within RStudio. About line 870 of this Rmarkdown file.
Uterine leiomyomas Disease data table.
ULp <- read.delim('GPL21145-48548.txt', sep='\t', comment.char='#',
header=TRUE, na.strings=c('',' ', 'NA'))
head(ULp)
These two are from the same study with beadchip expressions.
UL1 <- read.csv('UL_GSE95101_GPL13376_table.csv', sep=',',header=TRUE,
na.strings=c('',' ','NA'))
nonUL1 <- read.csv('nonUL_GSE95101_GPL13376_table.csv', sep=',',header=TRUE,
na.strings=c('',' ','NA'))
These next two are from different studies with beadchip expressions. These have to be merged with the platform ULp we read in a couple Rmarkdown chunks ago.
UL2 <- read.csv('UL_GSE120854_GPL23767_table.csv', sep=',',header=TRUE,
na.strings=c('',' ','NA'))
nonUL2 <- read.csv('nonUL_GSE120854_GPL23767_table.csv', sep=',',header=TRUE,
na.strings=c('',' ','NA'))
Keep only the columns needed of ULp, ID, SourceSeq, and CHR.There is no gene symbol to bind to this data table platform GPL21145.
ULP <- ULp %>% select(ID,SourceSeq,CHR)
UL2b <- UL2 %>% select(-X,-SPOT_ID)
nonUL2b <- nonUL2 %>% select(-X,-SPOT_ID)
colnames(UL2b)[2:25] <- paste('UL',colnames(UL2b)[2:25],sep='_')
colnames(nonUL2b)[2:11] <- paste('nonUL', colnames(nonUL2b)[2:11],sep='_')
UL_2study <- merge(UL2b,nonUL2b, by.x='ID', by.y='ID')
UL_study2 <- merge(ULP, UL_2study, by.x='ID', by.y='ID')
str(UL_study2)
Remove the ‘null’ values that aren’t being recognized by the na.strings or complete.cases. And also the numeric gene expression values are being read in as factors. Do this by writing it to csv and reading it back in and declaring ‘null’ as an na.string.
write.csv(UL_study2, 'UL_study2.csv', row.names=FALSE)
UL_study2 <- read.csv('UL_study2.csv', sep=',', header=TRUE, na.strings=c('',' ','NA','null'))
str(UL_study2)
Thats great that it now recognizes each column as a numeric data type and that the null values are now being recognized as NA.
Now, lets use complete.cases to get only those observations with complete cases and remove the NA values.
UL_study2_cc <- UL_study2[complete.cases(UL_study2),]
obsRM <- length(UL_study2$ID) - length(UL_study2_cc$ID)
cat('The number of observations removed is: ', obsRM,'.')
Its too bad that we cannot really do anything with this data, unless we open up ubuntu Linux and use the 3.4 version of R to use Bioconductor to get the gene names. But we can at least use the sequences to see if there are any with copies. Not necessarily CNVs because they would have to have the gene names to see what insertions, deletions, skips, etc make the gene a CNV. We can always extract what we need from it now, then get the gene information with a VM of Ubuntu, download the R version and use the tutorial (assuming it works exactly as demonstrated) online.
The following are the requirements using the above online tutorial: - R version 3.4.1 (2017-06-30) - Platform: x86_64-pc-linux-gnu (64-bit) - Running under: Ubuntu 16.04.2 LTS
So lets get the median and foldchange median values. We also don’t have the cytoband location, but we do have the chromosome for this set.
row.names(UL_study2_cc) <- UL_study2_cc$ID
ul <- grep('^UL', colnames(UL_study2_cc))
UL2_t <- UL_study2_cc[,ul]
nUL2_t <- UL_study2_cc[,-c(1:3,ul)]
Get the row medians of the UL and nonUL 2nd study data.
UL2_t$UL_Median <- apply(UL2_t,1,median)
nUL2_t$nonUL_Median <- apply(nUL2_t,1,median)
FoldChange_Median_UL_to_nonUL <- UL2_t$UL_Median/nUL2_t$nonUL_Median
UL_study2_cc$FoldChange_Median_UL_to_nonUL <- FoldChange_Median_UL_to_nonUL
UL2_cnv <- UL_study2_cc %>%
group_by(SourceSeq) %>%
mutate(Sequence_CNVs = n()) %>%
select(ID,SourceSeq,Sequence_CNVs, CHR,FoldChange_Median_UL_to_nonUL, everything()) %>%
ungroup() %>%
unique()
The above has all unique CNVs of the sequences, as should be expected since there was no error making the row names the ID field.
write.csv(UL2_cnv, 'UL_study2_needsGeneData.csv', row.names=FALSE)
Now lets get to the first study of UL gene expression data that we can use to add to our network analysis of genes related to diseases. These two tables are the UL1 and nonUL1 tables.
nonUL1b <- nonUL1 %>% select(Symbol,SEQUENCE,Chromosome,Cytoband,GSM2496194:GSM2496222)
UL1b <- UL1 %>% select(Symbol,SEQUENCE,Chromosome,Cytoband,GSM2496185:GSM2496220)
colnames(nonUL1b)[5:22] <- paste('nonUL',colnames(nonUL1b)[5:22],sep='_')
colnames(UL1b)[5:24] <- paste('UL',colnames(UL1b)[5:24],sep='_')
Add the Median fields to each table.
UL1c <- UL1b[,5:24]
nUL1c <- nonUL1b[,5:22]
UL1c$UL1_Median <- apply(UL1c,1,median)
nUL1c$nonUL1_Median <- apply(nUL1c,1,median)
FC <- UL1c$UL1_Median/nUL1c$nonUL1_Median
mUL1 <- UL1b[,1:4]
UL_study1 <- cbind(mUL1,UL1c,nUL1c)
UL_study1$FoldChange_Median_UL1_to_nonUL1 <- FC
UL_study1b <- UL_study1 %>% select(Symbol,SEQUENCE,Chromosome,Cytoband,
FoldChange_Median_UL1_to_nonUL1,
UL1_Median,nonUL1_Median,
everything())
There are some missing values in the chromosome and cytoband fields, but we don’t need to omit them yet.
colSums(is.na(UL_study1b))
The above shows that there aren’t any missing values in the samples or the sequence fields, but there are some in the gene symbol, chromosome, and cytoband fields.
Lets get the CNVs of the genes now, but remove the NAs from the Symbol field first.
UL_study1b2 <- UL_study1b[complete.cases(UL_study1b$Symbol),]
UL_study1_cnv <- UL_study1b2 %>%
group_by(Symbol) %>%
mutate(Symbol_CNVs = n()) %>%
select(Symbol,Symbol_CNVs,everything()) %>%
ungroup() %>%
unique()
Lets look at what we have for our first few fields and the values.
head(UL_study1_cnv[,1:6])
This is going to grab our target genes for those genes that have a CNV gene count greater than the top 90th percentile and also is either of the bottom 5th percentile or top 95th percentile for fold change median values of UL to nonUL gene expression values.
UL_study1_thresholds <- subset(UL_study1_cnv,
UL_study1_cnv$Symbol_CNVs > quantile(UL_study1_cnv$Symbol_CNVs,.9) &
(UL_study1_cnv$FoldChange_Median_UL1_to_nonUL1 <
quantile(UL_study1_cnv$FoldChange_Median_UL1_to_nonUL1,.05) |
UL_study1_cnv$FoldChange_Median_UL1_to_nonUL1 >
quantile(UL_study1_cnv$FoldChange_Median_UL1_to_nonUL1,.95))
)
Lets now order this table by CNV and Fold Change values.
UL_study1_ordered <- UL_study1_thresholds[with(UL_study1_thresholds,
order(Symbol_CNVs,FoldChange_Median_UL1_to_nonUL1,
decreasing=TRUE)),]
Lets take a look at the top few rows and first few columns of this table of threshold value target genes for UL.
head(UL_study1_ordered[,1:7])
The table is ordered by CNVs first then by fold change values within the top CNVs. Lets write this to csv and use it later for our visNetwork. So far, it looks like either the Chromosome or the cytoband could be used for location of the nodes. But we could also group the genes by link to gene disease networks, gene networks, or anything else we get ideas about later.
write.csv(UL_study1_ordered,'UL_study1_thresholds.csv',row.names=FALSE)
UL microarray study Uterine leiomyomas Disease data table (microarray data). These two are from the same study with microarray expressions.
UL1 <- read.csv('UL_GSE68295_GPL6480_table.csv', sep=',',header=TRUE,
na.strings=c('',' ','NA'))
nonUL1 <- read.csv('nonUL_GSE68295_GPL6480_table.csv', sep=',',header=TRUE,
na.strings=c('',' ','NA'))
colnames(UL1)
colnames(nonUL1)
Lets keep only the columns interested in. Those are
nonUL1b <- nonUL1 %>% select(GENE_SYMBOL,SEQUENCE,CYTOBAND,GSM1667144:GSM1667146)
UL1b <- UL1 %>% select(GENE_SYMBOL,SEQUENCE,CYTOBAND,GSM1667147:GSM1667149)
colnames(nonUL1b)[4:6] <- paste('nonUL',colnames(nonUL1b)[4:6],sep='_')
colnames(UL1b)[4:6] <- paste('UL',colnames(UL1b)[4:6],sep='_')
Add the Median fields to each table.
UL1c <- UL1b[,4:6]
nUL1c <- nonUL1b[,4:6]
UL1c$UL1_MedianArray <- apply(UL1c,1,median)
nUL1c$nonUL1_MedianArray <- apply(nUL1c,1,median)
FC <- UL1c$UL1_MedianArray/nUL1c$nonUL1_MedianArray
mUL1 <- UL1b[,1:3]
UL_study1 <- cbind(mUL1,UL1c,nUL1c)
UL_study1$FoldChange_MedianArray_UL1_to_nonUL1 <- FC
UL_study1b <- UL_study1 %>% select(GENE_SYMBOL,SEQUENCE,CYTOBAND,
FoldChange_MedianArray_UL1_to_nonUL1,
UL1_MedianArray,nonUL1_MedianArray,
everything())
There are some missing values in the GENE_SYMBOL, SEQUENCE, and CYTOBAND fields, but we don’t need to omit them yet.
colSums(is.na(UL_study1b))
The above shows that there aren’t any missing values in the samples or the sequence fields, but there are some in the gene symbol, chromosome, and cytoband fields.
Lets get the CNVs of the genes now, but remove the NAs from the Symbol field first.
UL_study1b2 <- UL_study1b[complete.cases(UL_study1b$GENE_SYMBOL),]
UL_study1_cnv <- UL_study1b2 %>%
group_by(GENE_SYMBOL) %>%
mutate(GENE_SYMBOL_CNVs = n()) %>%
select(GENE_SYMBOL,GENE_SYMBOL_CNVs,everything()) %>%
ungroup() %>%
unique()
Lets look at what we have for our first few fields and the values.
head(UL_study1_cnv[,1:6])
This is going to grab our target genes for those genes that have a CNV gene count greater than the top 90th percentile and also is either of the bottom 5th percentile or top 95th percentile for fold change median values of UL to nonUL gene expression values.
UL_study1_thresholds <- subset(UL_study1_cnv,
UL_study1_cnv$GENE_SYMBOL_CNVs > quantile(UL_study1_cnv$GENE_SYMBOL_CNVs,.9) &
(UL_study1_cnv$FoldChange_MedianArray_UL1_to_nonUL1 <
quantile(UL_study1_cnv$FoldChange_MedianArray_UL1_to_nonUL1,.05) |
UL_study1_cnv$FoldChange_MedianArray_UL1_to_nonUL1 >
quantile(UL_study1_cnv$FoldChange_MedianArray_UL1_to_nonUL1,.95))
)
Lets now order this table by CNV and Fold Change values.
UL_study1_ordered <- UL_study1_thresholds[with(UL_study1_thresholds,
order(GENE_SYMBOL_CNVs,FoldChange_MedianArray_UL1_to_nonUL1,
decreasing=TRUE)),]
Lets take a look at the top few rows and first few columns of this table of threshold value target genes for UL.
head(UL_study1_ordered[,1:7])
The table is ordered by CNVs first then by fold change values within the top CNVs. Lets write this to csv and use it later for our visNetwork. So far, it looks like either the Chromosome or the cytoband could be used for location of the nodes. But we could also group the genes by link to gene disease networks, gene networks, or anything else we get ideas about later.
write.csv(UL_study1_ordered,'UL_microarray_thresholds.csv',row.names=FALSE)
Gastric Stomach Cancer data table.
GSC <- read.delim('GSE64916_series_matrix.txt', sep='\t', comment.char='!',
header=TRUE, na.strings=c('',' ', 'NA'))
head(GSC)
These are samples of gastric stomach cancer gene expressions in located peripheral blood. The samples GSM1583284, GSM1583285, GSM1583286, and GSM1583287 are GSC, and the control of this study is from the blood of a person without gastric stomach cancer in GSM1583288.
Lets label these columns to be their sample types.
colnames(GSC) <- c("ID","GSC1","GSC2","GSC3","GSC4","noGSC_ctrl")
head(GSC)
GSCp <- read.delim('GPL13497-9755.txt', sep='\t',header=TRUE, comment.char='#',
na.strings=c('',' ','NA'))
head(GSCp)
Lets keep the ID, GENE_SYMBOL, CYTOBAND, and SEQUENCE columns of our platform meta data.
GSC_P <- GSCp %>% select(ID, GENE_SYMBOL, CYTOBAND, SEQUENCE)
head(GSC_P)
Now lets combine the two data tables of sample and meta information.
GSC2 <- merge(GSC_P, GSC, by.x='ID',by.y='ID')
head(GSC2)
We will now remove the NAs from the Gene Symbol field.
GSC3 <- GSC2[complete.cases(GSC2$GENE_SYMBOL),]
head(GSC3)
Lets group by gene and get the count of each gene and attach it to the sequence information.
GSC4 <- GSC3 %>%
group_by(GENE_SYMBOL) %>%
mutate(GENE_CNVs = n()) %>%
select(GENE_SYMBOL, GENE_CNVs, everything()) %>%
ungroup() %>%
unique()
colnames(GSC4)
Now, we will add the fold change of median values of the GSC samples to the control sample of a healthy person.
GSC4$GSC_Median <- apply(GSC4[6:9],1,median)
GSC4$FoldChangeMedian_GSC_to_healthyCtrl <- GSC4$GSC_Median/GSC4$noGSC_ctrl
GSC5 <- GSC4 %>% select(GENE_SYMBOL, GENE_CNVs,-ID,CYTOBAND,SEQUENCE,
FoldChangeMedian_GSC_to_healthyCtrl,GSC_Median, GSC1:noGSC_ctrl)
colnames(GSC5)
Lets get the subset of GSC that has the genes with a high number of CNVs and a high Fold change by median values.
GSC_genes <- subset(GSC5, GSC5$GENE_CNVs > median(GSC5$GENE_CNVs) &
(GSC5$FoldChangeMedian_GSC_to_healthyCtrl <
quantile(GSC5$FoldChangeMedian_GSC_to_healthyCtrl, 0.05) |
GSC5$FoldChangeMedian_GSC_to_healthyCtrl >
quantile(GSC5$FoldChangeMedian_GSC_to_healthyCtrl, 0.95)
)
)
length(unique(GSC_genes$GENE_SYMBOL))
That is too many genes to make the targeted gastric stomach cancer genes, so we will raise our CNV threshold to the top 99th percentile.
GSC_genes <- subset(GSC5, GSC5$GENE_CNVs > quantile(GSC5$GENE_CNVs,0.99) &
(GSC5$FoldChangeMedian_GSC_to_healthyCtrl <
quantile(GSC5$FoldChangeMedian_GSC_to_healthyCtrl, 0.05) |
GSC5$FoldChangeMedian_GSC_to_healthyCtrl >
quantile(GSC5$FoldChangeMedian_GSC_to_healthyCtrl, 0.95)
)
)
length(unique(GSC_genes$GENE_SYMBOL))
Lets order this table of targeted genes by CNV count then by decreasing value of fold change from GSC to healthy as a ratio.
GSC6 <- GSC_genes[with(GSC_genes, order(GENE_CNVs,FoldChangeMedian_GSC_to_healthyCtrl,
decreasing=TRUE)),]
head(GSC6)
Lets write this file out to csv and work on the other disease data sets separate from this one.
write.csv(GSC6, 'GastricStomachCancer_Thresholds.csv', row.names=FALSE)
Colon cancer data table.
CC <- read.delim('GSE135749_series_matrix.txt', sep='\t', comment.char='!',
header=TRUE, na.strings=c('',' ', 'NA'))
head(CC)
CCp <- read.delim('GPL10558-50081.txt', sep='\t',header=TRUE, comment.char='#',
na.strings=c('',' ','NA'))
head(CCp)
colnames(CC)
colnames(CCp)
Keep only the SYMBOL, ID, Cytoband, and SEQUENCE columns of the CC platform to combine with the CC samples by ID_REF.
CC_p <- CCp %>% select(ID, Symbol,SEQUENCE,Cytoband)
The CC data set will be produced by this next command.
CC_1 <- merge(CC_p,CC, by.x='ID', by.y='ID_REF')
head(CC_1)
Note: Since these are quantile normalized, and there are various ways and reasons to do this, but the purpose being primarily to eliminate noise, we will keep the values, for our purposes, we are still getting those threshold genes that are in the bottom 5th and top 95th percentile while also having more CNVs than the median CNVs of all genes in this set.
We should separate the columns that are samples of colon cancer tumors without treatment of the short hairpin RNAs (shRNA) targeting two different RNAs of LGR5. The first sample columns are the control colon cancer samples, the next two are the first type of LGR5 being knocked down with a shRNA, and the last two samples as columns are the 2nd type of LGR5 being knocked down with the 2nd type of shRNA.
These are colon cancer tumor cells. with two replicates of each sample type or group. We will prepend cc0, cc1,cc2 to these 6 samples for if no shRNA knockdown (cc0), the 1st type shRNA (cc1), or the 2nd type shRNA (cc2) is what the sample is a product of procedure type.
colnames(CC_1)[5:6] <- paste('cc0', colnames(CC_1)[5:6],sep='_')
colnames(CC_1)[7:8] <- paste('cc1', colnames(CC_1)[7:8],sep='_')
colnames(CC_1)[9:10] <- paste('cc2', colnames(CC_1)[9:10],sep='_')
colnames(CC_1)
Now, lets add in the median values of the samples of these three groups of control, knockdown 1, and knockdown 2. The median values will be knockdown 1 compared to control, and knockdown 2 compared to control.We want to be consistent with the methods of our other disease type sample extraction of target genes, so even though there are only two samples of each group, the median is used. Notice the median seems to be programmed to be the mean when there are only two samples.
cc0 <- apply(CC_1[5:6],1,median)
cc1 <- apply(CC_1[7:8],1,median)
cc2 <- apply(CC_1[9:10],1,median)
FoldChange_CC1_to_CC0 <- cc1/cc0
FoldChange_CC2_to_CC0 <- cc2/cc0
Add these median and fold change values to our colon cancer table, CC_1.
CC_1$ctrl_CC_median <- cc0
CC_1$group1_CC_median <- cc1
CC_1$group2_CC_median <- cc2
CC_1$FoldChange_CC1_to_CC0 <- FoldChange_CC1_to_CC0
CC_1$FoldChange_CC2_to_CC0 <- FoldChange_CC2_to_CC0
colnames(CC_1)
Rearrange the columns to see the stats and meta information before the sample information.
CC2 <- CC_1 %>% select(ID:Cytoband, FoldChange_CC1_to_CC0:FoldChange_CC2_to_CC0,
ctrl_CC_median:group2_CC_median,everything())
colnames(CC2)
Lets group by gene and get the count of each gene and attach it to the sequence information.
CC3 <- CC2[complete.cases(CC2$Symbol),]
CC4 <- CC3 %>%
group_by(Symbol) %>%
mutate(Symbol_CC_CNVs = n()) %>%
select(Symbol, Symbol_CC_CNVs, everything()) %>%
ungroup() %>%
unique()
Now, lets get our threshold values so that the genes with CNVs greater than the median value of CNVs of genes in our data set are selected and also those genes in that group but also having a threshold of being in the bottom 5th and top 95th percentile for fold change values of group1/control and group2/control are selected as target genes.
CC5 <- subset(CC4,
(CC4$Symbol_CC_CNVs > median(CC4$Symbol_CC_CNVs)) &
(CC4$FoldChange_CC1_to_CC0 < quantile(CC4$FoldChange_CC1_to_CC0,.05, na.rm=TRUE)
|
CC4$FoldChange_CC1_to_CC0 > quantile(CC4$FoldChange_CC1_to_CC0,.95, na.rm=TRUE)
)
|
(CC4$Symbol_CC_CNVs > median(CC4$Symbol_CC_CNVs))
&
(CC4$FoldChange_CC2_to_CC0 < quantile(CC4$FoldChange_CC2_to_CC0,.05, na.rm=TRUE)
|
CC4$FoldChange_CC2_to_CC0 > quantile(CC4$FoldChange_CC2_to_CC0,.95, na.rm=TRUE))
)
There are a lot of observations for this threshold dataset of targeted colon cancer genes. Lets see how many genes are unique.
length(unique(CC5$Symbol))
That is a lot of genes as gene targets, so lets set a higher threshold for the genes with CNVs higher than the 97th percentile.
CC6 <- subset(CC4,
(CC4$Symbol_CC_CNVs > quantile(CC4$Symbol_CC_CNVs,.97)) &
(CC4$FoldChange_CC1_to_CC0 < quantile(CC4$FoldChange_CC1_to_CC0,.05, na.rm=TRUE)
|
CC4$FoldChange_CC1_to_CC0 > quantile(CC4$FoldChange_CC1_to_CC0,.95, na.rm=TRUE)
)
|
(CC4$Symbol_CC_CNVs > quantile(CC4$Symbol_CC_CNVs,.97))
&
(CC4$FoldChange_CC2_to_CC0 < quantile(CC4$FoldChange_CC2_to_CC0,.05, na.rm=TRUE)
|
CC4$FoldChange_CC2_to_CC0 > quantile(CC4$FoldChange_CC2_to_CC0,.95, na.rm=TRUE))
)
This dataset is much better to work with for targeted genes based on our set threshold limits as discriminates. Lets see how many are unique.
length(CC6$Symbol)
Lets order this data table by CNVs then by fold change decreasing by group1 then by group2.
CC7 <- CC6[with(CC6, order(Symbol_CC_CNVs,FoldChange_CC1_to_CC0,
FoldChange_CC2_to_CC0, decreasing=TRUE)),]
If we can’t get the CYTOBAND for the other diseases then we will leave it out or add it in later if there are few to look up and add manually. Lets write this file out to csv and work on the other disease data sets separate from this one.
write.csv(CC7, 'ColonCancerThresholds.csv', row.names=FALSE)
Pancreatic cancer data table.
PC <- read.delim('GSE131859_non-normalized.txt', sep='\t', comment.char='!',
header=TRUE, na.strings=c('',' ', 'NA'))
head(PC)
PCp <- read.delim('GPL14951-11332.txt', sep='\t',header=TRUE, comment.char='#',
na.strings=c('',' ','NA'))
head(PCp)
colnames(PC)
These samples have the pvalue scores attached to the sample microarray readings. We aren’t interested in these pval fields, so we will remove them and the added X field when loading of the row numbers when saved to csv from other file.
PC1 <- PC %>% select(-MiaPACA2_EV.Detection_Pval,-MiaPACA2_ARHGEF10.Detection_Pval,
-Hs766T_shGFP.Detection_Pval,-Hs766T_shARHGEF10.Detection_Pval,-X)
head(PC1)
We have four samples of pancreatic cancer. MiaPACA2_EV is the control of pancreatic cancer epithelial cells. The idea is to show that ARHGEF10 is a tumor suppressor by examining two cell lines of available commercial pancreatic cancer cells where MiaPACA2 is directly from the epithelial pancreatic cells and Hs766T are from the lymph nodes of someone with pancreatic cancer. MiaPACA2_ARHGEF10 sample is when ARHGEF10 is overexpressed to demonstrate suppression of pancreatic cancer cells not mutating. The Hs766T_shGFP is the sample where the short hair pin RNA additive to eliminate the tumor suppressor ARHGEF10 wasn’t added and is the lymph node pancreatic cancer control. The Hs766T_shARHGEF10 is when the Hs766T cell line of pancreatic cancer cells had the ARHGEF10 gene knocked down or suppressed. Lets rename these column samples by what they are,
colnames(PC1) <- c('ID_REF','PC_epithelial_ctrl','PC_epithelial_SuppressorAdded',
'PC_lymph_ctrl','PC_lymph_SuppressorKnockedOut')
colnames(PCp)
Keep only the Symbol, Cytoband, ID, and SEQUENCE columns of the PC platform to combine with the PC samples by ID_REF and GENE_SYMBOL.
PC_p <- PCp %>% select(ID, Symbol, SEQUENCE, Cytoband)
head(PC_p)
The PC data set will be produced by this next chunk.
PC2 <- merge(PC_p,PC1, by.x='ID', by.y='ID_REF')
head(PC2)
Since there is only one sample of each, it is redundant to get the median of each type, so we will get the fold change values of the suppresor added epithelial PC cells to the epithelial PC cells control, and the fold change values of the lymph node PC cells with the suppressor knocked out or suppressed to the lymph node PC cells control.
FC_epithelial_PC <- PC2$PC_epithelial_SuppressorAdded/PC2$PC_epithelial_ctrl
FC_lymphNode_PC <- PC2$PC_lymph_SuppressorKnockedOut/PC2$PC_lymph_ctrl
PC2$FC_epithelial_PC <- FC_epithelial_PC
PC2$FC_lymphNode_PC <- FC_lymphNode_PC
PC3 <- PC2 %>% select(ID:Cytoband, FC_epithelial_PC,FC_lymphNode_PC, everything())
Lets group by gene and get the count of each gene and attach it to the sequence information.
PC4 <- PC3 %>%
group_by(Symbol) %>%
mutate(GENE_CNVs = n()) %>%
select(Symbol, GENE_CNVs, everything()) %>%
ungroup() %>%
unique()
colnames(PC4) <- gsub('FC_','FoldChange_', colnames(PC4))
colnames(PC4)
PC5 <- PC4[with(PC4, order(GENE_CNVs, FoldChange_epithelial_PC,
FoldChange_lymphNode_PC, decreasing=TRUE)),]
Get the subset of AD that has the genes with a high number of CNVs and a high Fold change by median values.
PC_genes <- subset(PC5, PC5$GENE_CNVs > median(PC5$GENE_CNVs, na.rm=TRUE) &
(PC5$FoldChange_epithelial_PC <
quantile(PC5$FoldChange_epithelial_PC,0.05, na.rm=TRUE) |
PC5$FoldChange_lymphNode_PC >
quantile(PC5$FoldChange_lymphNode_PC, .95, na.rm=TRUE))
)
length(unique(PC_genes))
This is a lot of unique genes, so lets filter more by making the CNVs to be in the top 95th percentile of CNV counts instead of greater than the median of gene CNVs.
PC_genes2 <- subset(PC5, PC5$GENE_CNVs > quantile(PC5$GENE_CNVs, 0.95,na.rm=TRUE) &
(PC5$FoldChange_epithelial_PC <
quantile(PC5$FoldChange_epithelial_PC,0.05, na.rm=TRUE) |
PC5$FoldChange_lymphNode_PC >
quantile(PC5$FoldChange_lymphNode_PC, .95, na.rm=TRUE))
)
length(unique(PC_genes2))
That is a much better quantity to work with for targeted Pancreatic Cancer genes. Lets order this by CNVs then by fold change epithelial then fold change lymph.
PC6 <- PC_genes2[with(PC_genes2, order(GENE_CNVs,FoldChange_epithelial_PC,
FoldChange_lymphNode_PC, decreasing=TRUE)),]
head(PC6)
Lets drop the ID field.
PC7 <- PC6 %>% select(-ID)
head(PC7)
Lets write this file out to csv and work on the other disease data sets separate from this one.
write.csv(PC7, 'Pancreatic_Cancer_Thresholds.csv', row.names=FALSE)
Alzheimer’s Disease data table.
AD <- read.delim('GSE109887_series_matrix.txt', sep='\t', comment.char='!',
header=TRUE, na.strings=c('',' ', 'NA'))
head(AD)
ADp <- read.delim('GPL4133-12599.txt', sep='\t',header=TRUE, comment.char='#',
na.strings=c('',' ','NA'))
head(ADp)
colnames(AD)
colnames(ADp)
Keep only the GENE_SYMBOL and SEQUENCE columns of the AD platform to combine with the AD samples by ID_REF and GENE_SYMBOL.
AD_p <- ADp[,c(10,20)]
The AD data set will be produced by this next chunk.
AD_1 <- merge(AD_p,AD, by.x='GENE_SYMBOL', by.y='ID_REF')
head(AD_1)
We should separate the columns that are samples of AD patients to those that are the control samples of AD patients.
metaAD <- read.csv('AlzheimerAgeGenderTissueSamplesMeta.csv',sep=',', header=TRUE,
na.strings=c('',' ','NA'))
head(metaAD)
row.names(metaAD) <- metaAD$sampleID
meta_AD <- as.data.frame(t(metaAD))
head(meta_AD)
Get the indices of the AD and control patients
ad <- grep('AD', meta_AD$disease)
#add 1 because of the additional sequence field in AD_1 not in metaAD
ad <- ad+1
Paste the type of sample to the sample ID fields.
colnames(AD_1)[ad] <- paste('AD',colnames(AD_1)[ad],sep='_')
colnames(AD_1)[-c(1,2,ad)] <- paste('ctrl_AD',colnames(AD_1)[-c(1,2,ad)], sep='_')
Create separate tables for the AD and the control patients.
ad1 <- AD_1[,ad]
ctrl_ad1 <- AD_1[,-c(1,2,ad)]
Row Means of each subset.
ad1$Mean_AD <- rowMeans(ad1)
ctrl_ad1$Mean_ctrl_AD <- rowMeans(ctrl_ad1)
ad1$Median_AD <- apply(ad1,1,median)
ctrl_ad1$Median_ctrl_AD <- apply(ctrl_ad1,1,median)
Now combine these new value fields to the original table for AD and control mean and median.
AD_1$Mean_AD <- ad1$Mean_AD
AD_1$Mean_ctrl_AD <- ctrl_ad1$Mean_ctrl_AD
AD_1$Median_AD <- ad1$Median_AD
AD_1$Median_ctrl_AD <- ctrl_ad1$Median_ctrl_AD
Remove duplicate entries, the genes even thought they have different sequence values, are showing duplicate or identical gene expression values by gene.
AD2 <- AD_1[!duplicated(AD_1),]
AD2$FoldChange_Mean_AD_2_ctrl <- AD2$Mean_AD/AD2$Mean_ctrl_AD
AD2$FoldChange_Median_AD_2_ctrl <- AD2$Median_AD/AD2$Median_ctrl_AD
head(AD2)[,c(1,2,83:86)]
You can see from the above table that even though the sequence is different in these first displayed genes that the gene the sequences belong to have the same gene expression values.
Lets group by gene and get the count of each gene and attach it to the sequence information.
AD3 <- AD2 %>%
group_by(GENE_SYMBOL) %>%
mutate(GENE_CNVs = n()) %>%
select(GENE_SYMBOL, GENE_CNVs, SEQUENCE, Median_AD, Median_ctrl_AD,
FoldChange_Mean_AD_2_ctrl, FoldChange_Median_AD_2_ctrl, everything()) %>%
ungroup() %>%
unique()
Alzheimer <- AD3[with(AD3, order(GENE_CNVs, GENE_SYMBOL, decreasing=TRUE)),]
Get the subset of AD that has the genes with a high number of CNVs and a high Fold change by median values.
AD_genes <- subset(Alzheimer, Alzheimer$GENE_CNVs > median(Alzheimer$GENE_CNVs) &
(Alzheimer$FoldChange_Median_AD_2_ctrl < 0.90 |
Alzheimer$FoldChange_Median_AD_2_ctrl > 1.10)
)
alz_genes <- unique(AD_genes$GENE_SYMBOL)
alz_genes
AD4 <- AD_genes[,c(1:3,7)]
AD4
AD_genes1 <- datatable(data=AD4, rownames=FALSE,
extensions=c('Buttons','Responsive','FixedColumns'),
filter=list(position='top'),
options=list(pageLength=10,
dom='Bfrtip',scrollX = TRUE, fixedColumns = TRUE,
buttons=c('colvis','csv'),
language=list(sSearch='Filter:')
)
)
AD_genes1
Lets also add in the cytoband location.
cyto <- ADp %>% select(GENE_SYMBOL, CYTOBAND)
AD5 <- merge(cyto, AD4, by.x='GENE_SYMBOL', by.y='GENE_SYMBOL')
head(AD5)
So, now we have our table of Alzheimer genes that have high CNVs and Fold Change Median values. Lets start getting this information for the other diseases to word towards building our visNetwork of genes related to other genes if any are.
We want to get these same values exactly as we have for the Alzheimer’s Disease genes.
colnames(AD5)
If we can’t get the CYTOBAND for the other diseases then we will leave it out. Lets write this file out to csv and work on the other disease data sets separate from this one.
write.csv(AD5, 'useableAlzheimerStats.csv', row.names=FALSE)
%^& !!!THE BEGINNING of our VISNETWORK!!!
Lets start by reading in those data tables we will be working with that derived the genes in the highes percentile of Copy Number Variants of genes as well as being in the bottom 5th and top 90th-99th percentiles of fold change values. The library visNetwork requires that there be an edges and a nodes data file to read from and that there be certain specific columns within each table, such as ID, to, from, label, and so on. How we will link the genes is by their ID and to see if that ID links to any other gene in our tables of disease genes. We will first read in our tables, then make sure the column names describe what disease the fold change values and CNVs are for, as well as add a field to each of these targeted genes tables that identifies the gene as being a top gene related to the disease table. Such as, pancreatic cancer, beadchip benign uterine leiomyoma, microarray benign uterine leiomyoma, gastric stomach cancer, colon cancer, or Alzheimer’s Disease.
pancreatic <- read.csv('Pancreatic_Cancer_Thresholds.csv',sep=',', header=TRUE,
na.strings=c('',' ','NA'))
colon <- read.csv('ColonCancerThresholds.csv',sep=',', header=TRUE,
na.strings=c('',' ','NA'))
stomach <- read.csv('GastricStomachCancer_Thresholds.csv',sep=',', header=TRUE,
na.strings=c('',' ','NA'))
alzheimer <- read.csv('useableAlzheimerStats.csv',sep=',', header=TRUE,
na.strings=c('',' ','NA'))
ul_beadchip <- read.csv('UL_study1_thresholds.csv',sep=',', header=TRUE,
na.strings=c('',' ','NA'))
ul_microarray <- read.csv('UL_microarray_thresholds.csv',sep=',', header=TRUE,
na.strings=c('',' ','NA'))
Lets now look at each of these six diseases and add in what type of disease it is to the fields in each table that don’t give that information.
#colnames(alzheimer)[1:4] <- paste('AD',colnames(alzheimer)[1:4],sep='_')
#colnames(colon)[c(1,3,4,5)] <- paste('CC',colnames(colon)[c(1,3,4,5)],sep='_')
#colnames(stomach)[1:4] <- paste('GSC', colnames(stomach)[1:4],sep='_')
#colnames(pancreatic)[1:4] <- paste('PC', colnames(pancreatic)[1:4], sep='_')
#colnames(ul_beadchip)[1:5] <- paste('UL_beadchip', colnames(ul_beadchip)[1:5], sep='_')
#colnames(ul_beadchip)[9:46] <- gsub('UL','UL_beadchip', colnames(ul_beadchip)[9:46])
#colnames(ul_microarray)[c(1:4)] <- paste('UL_array', colnames(ul_microarray)[c(1:4)],
# sep='_')
#colnames(ul_microarray)[8:13] <- gsub('UL_','UL_microarray_', colnames(ul_microarray)[8:13])
Lets add in the disease type field of factor type to each table.
pancreatic$pancreatic_cancer <- as.factor('pancreatic cancer')
stomach$stomach_cancer <- as.factor('stomach cancer')
colon$colon_cancer <- as.factor('colon cancer')
alzheimer$alzheimerDisease <- as.factor('Alzheimer\'s Disease')
ul_beadchip$uterine_leiomyoma_BC <- as.factor('benign UL beadchip')
ul_microarray$uterine_leiomyoma_MA <- as.factor('benign UL microarray')
Pancreatic cancer data table:
head(pancreatic)
## Symbol GENE_CNVs SEQUENCE
## 1 DMD 8 GGCAAGTTCAACAGGATTTTGTGACCAGCGCAGGCTGGGCCTCCTTCTGC
## 2 HYDIN 8 GCAATTAAGTATGTGGAACACCCTCAGATAGACAGCCTGGACCTGCGCGG
## 3 DMD 8 CCAAAAGGTATTTTGCGAAGCATCCCCGAATGGGCTACCTGCCAGTGCAG
## 4 ABCG1 6 CAGCACTTGTGAAGGATTGAATGCAGGTTCCAGGTGGAGGGAAGACGTGG
## 5 KCNIP4 6 CTTGGAAGGGCTTGAAATGATAGCAGTTCTGATCGTCATTGTGCTTTTTG
## 6 GPR177 5 TGAGATCTACAAGTTGACCCGCAAGGAGGCCCAGGAGTAGGAGGCTGCAG
## Cytoband FoldChange_epithelial_PC FoldChange_lymphNode_PC
## 1 Xp21.2a-p21.1d 2.309727 0.4945016
## 2 16q22.1f-q22.3a 1.967041 0.4460581
## 3 Xp21.2a-p21.1d 1.414647 0.4765914
## 4 21q22.3b 3.787784 0.6271994
## 5 4p15.31d-p15.31b 2.674067 0.5404346
## 6 1p31.3a 8.289087 0.4977605
## PC_epithelial_ctrl PC_epithelial_SuppressorAdded PC_lymph_ctrl
## 1 117.2 270.7 281.9
## 2 133.5 262.6 289.2
## 3 185.7 262.7 296.9
## 4 1408.0 5333.2 8854.6
## 5 112.6 301.1 280.7
## 6 550.7 4564.8 15584.0
## PC_lymph_SuppressorKnockedOut pancreatic_cancer
## 1 139.4 pancreatic cancer
## 2 129.0 pancreatic cancer
## 3 141.5 pancreatic cancer
## 4 5553.6 pancreatic cancer
## 5 151.7 pancreatic cancer
## 6 7757.1 pancreatic cancer
Stomach cancer data table:
head(stomach)
## GENE_SYMBOL GENE_CNVs CYTOBAND
## 1 PDE4DIP 15 hs|1q21.1
## 2 PDE4DIP 15 hs|1q21.1
## 3 LOC100128869 8 hs|4q28.3
## 4 LOC729444 7 hs|22q11.21
## 5 LOC100131581 7 hs|8p23.1
## 6 LOC389834 7 hs|21p11.2
## SEQUENCE
## 1 CAAAAAATGGACTAGAAGAGAAGCTGGCTGAGGAGCTGAGATCAGCCTCGTGGCCTGGGT
## 2 CTACCAAAGCTGTGTATTTTTCATTTCTTCATGGCACTGTGCTGTTAATTTCTGTTAACA
## 3 TCAAAGACACACGCTCACTGCATGTGCTCTTGGGGGACGTCAGTGCCACGTTTGGTCACA
## 4 TTCCATCATACAGTCTTCAGCAACCTTGAAAGATTGGACAAGCTTCAGCCCACTCTTGAA
## 5 AGATAACTTCAGACTCAGCCTAGAAGTAAAGATCCTCCAGATATGACCTCAACTACCCTC
## 6 GGAACACTCTCTTAAAGTCTACTTATCCTATTCCTTCAAATTAAGACCAAAGTCCAATGT
## FoldChangeMedian_GSC_to_healthyCtrl GSC_Median GSC1 GSC2 GSC3
## 1 1.3059347 3.035789 2.332352 3.456686 2.614892
## 2 0.7574300 6.580579 6.966718 6.194441 8.041944
## 3 0.4429174 3.180155 2.332352 4.027958 2.326500
## 4 1.3486174 3.714299 3.762692 3.665906 3.405338
## 5 0.7360758 2.943508 3.092651 2.794365 4.078270
## 6 0.7229002 3.439584 2.791133 3.231682 5.058745
## GSC4 noGSC_ctrl stomach_cancer
## 1 4.208106 2.324610 stomach cancer
## 2 5.767812 8.688036 stomach cancer
## 3 4.576213 7.180018 stomach cancer
## 4 4.299684 2.754154 stomach cancer
## 5 2.402923 3.998920 stomach cancer
## 6 3.647486 4.758035 stomach cancer
Colon cancer data table:
head(colon)
## Symbol Symbol_CC_CNVs ID
## 1 HYDIN 17 ILMN_1697764
## 2 HYDIN 17 ILMN_1743138
## 3 HYDIN 17 ILMN_1658413
## 4 SDHALP1 13 ILMN_1734640
## 5 SDHALP1 13 ILMN_1757306
## 6 SDHALP1 13 ILMN_1718539
## SEQUENCE Cytoband
## 1 GCAATTAAGTATGTGGAACACCCTCAGATAGACAGCCTGGACCTGCGCGG 16q22.1f-q22.3a
## 2 TGTGCATTGCCAGTCATTCCCATGCCTTTGCCACGGTGTCCTTCACCCCG 16q22.1f-q22.3a
## 3 CTTTGGGAGCCAGGACCCCCCTTTGGTATGTCACTTAAAGAGCGCTGGAG 16q22.1f-q22.3a
## 4 GTGATGACAGAATCAGCTTTTGTAATTATGTATAATAGCTCATGCATGTG 3q29f
## 5 AGCCCTCCTGGACGATTGGAACTGTAATGTGGAAAGGGCTCTGATGGAGC 3q29f
## 6 GTTCTTCAGATGAACTGATTTTTGTGCAGAGCACACGTGTTGGATTCTGC 3q29f
## FoldChange_CC1_to_CC0 FoldChange_CC2_to_CC0 ctrl_CC_median group1_CC_median
## 1 7.153846 0.4615385 -0.65 -4.65
## 2 5.355556 -0.4666667 2.25 12.05
## 3 -11.375000 10.4583333 -1.20 13.65
## 4 8.621622 -1.6756757 -1.85 -15.95
## 5 1.976190 -8.7857143 -4.20 -8.30
## 6 -6.272727 -15.3636364 0.55 -3.45
## group2_CC_median cc0_GSM4027437 cc0_GSM4027439 cc1_GSM4027441 cc1_GSM4027442
## 1 -0.30 -1.3 0.0 -9.8 0.5
## 2 -1.05 13.3 -8.8 14.7 9.4
## 3 -12.55 -7.9 5.5 8.6 18.7
## 4 3.10 10.6 -14.3 -17.2 -14.7
## 5 36.90 5.4 -13.8 -10.5 -6.1
## 6 -8.45 1.6 -0.5 -3.1 -3.8
## cc2_GSM4027444 cc2_GSM4027446 colon_cancer
## 1 8.5 -9.1 colon cancer
## 2 -7.0 4.9 colon cancer
## 3 -18.4 -6.7 colon cancer
## 4 15.2 -9.0 colon cancer
## 5 13.5 60.3 colon cancer
## 6 7.9 -24.8 colon cancer
Alzheimer’s Disease data table:
head(alzheimer)
## GENE_SYMBOL CYTOBAND GENE_CNVs
## 1 AHNAK hs|11q12.3 5
## 2 AHNAK hs|11q12.3 5
## 3 AHNAK hs|11q12.3 5
## 4 AHNAK hs|11q12.3 5
## 5 AHNAK hs|11q12.3 5
## 6 AHNAK hs|11q12.3 5
## SEQUENCE
## 1 GAAAAGTAACATTCCCTAAAATGAAGATCCCCAAATTTACCTTCTCTGGCCGTGAGCTGG
## 2 AAGGCCCCAGTTTGGACATTGACACACCTGATGTCAATATTGAAGGTCCGGAAGGAAAAT
## 3 GTGCATTTCTTTACTTGCAAGGGGACAGAGTGTGGGCTTAGGTTTGGGACTAGAGGGGGC
## 4 ATTTCCAGCACTTTAATGGCCAATTAACTGAGAATGTAAGAAAATTGATGCTGTACAAGG
## 5 TCTGTGCCCAAGGTAGAAGGTGAAATGAAAGTGCCAGATGTTGACATTAAAGGGCCCAAA
## 6 GAAAAGTAACATTCCCTAAAATGAAGATCCCCAAATTTACCTTCTCTGGCCGTGAGCTGG
## FoldChange_Median_AD_2_ctrl alzheimerDisease
## 1 1.10655 Alzheimer's Disease
## 2 1.10655 Alzheimer's Disease
## 3 1.10655 Alzheimer's Disease
## 4 1.10655 Alzheimer's Disease
## 5 1.10655 Alzheimer's Disease
## 6 1.10655 Alzheimer's Disease
Uterine Leiomyoma beadchip data table:
head(ul_beadchip)
## Symbol Symbol_CNVs SEQUENCE
## 1 CTNNB1 7 AGCTGCAGGGGTCCTCTGTGAACTTGCTCAGGACAAGGAAGCTGCAGAAG
## 2 CTNNB1 7 CTGCAGGGGTCCTCTGTGAACTTGCTCAGGACAAGGAAGCTGCAGAAGCT
## 3 LOC202134 7 CCTGTTTGGATCACATGGTCTTGTCCTGATAACTTGGAAGAGGTTGCTTC
## 4 SLCO3A1 6 CCTGTATGTCAGCATCGCCATCGCGCTCAAATCCTTCGCCTTCATCCTGT
## 5 SLCO3A1 6 CCTCCTCTGGGGATGGCTTGCCGACATGCTTTAGAAACTCCCCTCTTGGG
## 6 LOC653673 6 GGAAAAGGCATGGTGACAATGGGGGCTGTAGCCCTAGGAGAACGGGGGGA
## Chromosome Cytoband FoldChange_Median_UL1_to_nonUL1 UL1_Median nonUL1_Median
## 1 <NA> 3p22.1b 1.5744314 2748.51615 1745.71980
## 2 3 3p22.1b 1.4655024 2113.89070 1442.43415
## 3 5 5q35.2d 0.3998123 72.51865 181.38175
## 4 <NA> 15q26.1d 1.4533565 578.43480 397.99925
## 5 <NA> 15q26.1d 1.3643572 161.32180 118.24015
## 6 17 <NA> 1.2756026 112.62110 88.28855
## UL_GSM2496185 UL_GSM2496186 UL_GSM2496187 UL_GSM2496188 UL_GSM2496189
## 1 4742.3 3302.9018 3239.9835 2694.4572 4329.4024
## 2 4482.6 2738.1917 2868.2212 2320.8855 3582.6563
## 3 120.2 95.5970 114.1447 61.4714 53.0500
## 4 909.7 693.5790 596.4898 657.6947 774.6664
## 5 188.2 165.9787 133.6730 127.4902 189.8039
## 6 63.6 71.3818 59.9000 99.5000 114.9687
## UL_GSM2496190 UL_GSM2496191 UL_GSM2496192 UL_GSM2496193 UL_GSM2496203
## 1 2212.7663 2349.6953 3868.5521 4720.0408 56.6000
## 2 1879.2320 2129.0526 2743.9472 3241.3755 70.6800
## 3 102.4450 109.9962 58.7000 54.6500 151.3824
## 4 635.5111 580.4067 462.8962 576.4629 285.0000
## 5 197.9507 179.2933 152.0385 159.9105 195.6714
## 6 167.1138 107.2792 120.4000 95.0750 130.6000
## UL_GSM2496204 UL_GSM2496205 UL_GSM2496206 UL_GSM2496207 UL_GSM2496208
## 1 2104.1954 2984.8842 4542.3102 2943.6535 2356.1480
## 2 1776.8023 2413.5624 3268.0110 2098.7288 1766.2115
## 3 67.7818 52.9000 70.4667 186.8900 59.0000
## 4 849.8864 586.4250 214.2731 75.3222 372.6541
## 5 218.1418 160.5893 66.1333 69.5778 162.0543
## 6 136.3000 110.2735 94.8857 121.5657 126.0100
## UL_GSM2496209 UL_GSM2496217 UL_GSM2496218 UL_GSM2496219 UL_GSM2496220
## 1 2569.9042 1461.1724 2194.4782 2802.5751 1277.3170
## 2 1777.2178 1231.3778 1977.7932 2085.1125 1063.1578
## 3 60.2286 66.7474 74.5706 172.9389 147.4045
## 4 453.3000 590.5725 528.8004 448.9996 539.1882
## 5 162.1348 165.1192 113.8667 157.4818 138.7429
## 6 137.8122 121.4737 92.0086 154.2255 98.5815
## nonUL_GSM2496194 nonUL_GSM2496195 nonUL_GSM2496196 nonUL_GSM2496197
## 1 1819.0576 1590.1865 1589.6078 1780.0956
## 2 1523.4501 1488.2989 1355.8731 1396.6437
## 3 173.7488 107.0138 250.1007 248.4337
## 4 572.5591 498.7206 398.9950 564.0055
## 5 140.0642 137.7730 117.3053 115.9455
## 6 69.4000 66.2286 64.9667 73.1000
## nonUL_GSM2496198 nonUL_GSM2496199 nonUL_GSM2496200 nonUL_GSM2496201
## 1 2641.8803 1371.1857 1711.3440 1509.7454
## 2 1978.2444 1274.5286 1445.0758 1309.4235
## 3 91.9065 193.9691 145.8866 122.1739
## 4 442.3333 380.6980 323.2386 316.2327
## 5 115.3744 112.5600 113.1290 139.1017
## 6 85.5313 92.0763 102.1674 84.6600
## nonUL_GSM2496202 nonUL_GSM2496210 nonUL_GSM2496211 nonUL_GSM2496212
## 1 1832.1853 1300.1667 1612.5461 1383.7698
## 2 1439.7925 1144.2250 1382.7999 1285.0511
## 3 115.3957 229.9462 189.0147 564.7217
## 4 392.9562 503.8777 419.0000 292.6316
## 5 139.6097 138.8037 116.9000 105.9600
## 6 89.5727 88.2200 92.6342 104.2000
## nonUL_GSM2496213 nonUL_GSM2496214 nonUL_GSM2496215 nonUL_GSM2496216
## 1 2141.9502 1986.5859 2006.4213 1632.0421
## 2 1630.3266 1936.1433 1480.8517 1091.4076
## 3 281.0516 350.7591 103.6000 108.7818
## 4 397.0035 260.3180 288.9834 522.8038
## 5 119.1750 84.9333 135.0706 171.8562
## 6 87.2000 135.4857 96.1591 115.1231
## nonUL_GSM2496221 nonUL_GSM2496222 uterine_leiomyoma_BC
## 1 2194.8792 2451.2007 benign UL beadchip
## 2 1837.8157 1864.2062 benign UL beadchip
## 3 201.6956 87.3286 benign UL beadchip
## 4 652.5280 278.8829 benign UL beadchip
## 5 141.1556 67.2000 benign UL beadchip
## 6 85.8444 88.3571 benign UL beadchip
Uterine Leiomyoma microarray data table:
head(ul_microarray)
## GENE_SYMBOL GENE_SYMBOL_CNVs
## 1 TPM3 11
## 2 SPDYE3 10
## 3 RPSA 8
## 4 LPP 8
## 5 ST13 8
## 6 NAV2 7
## SEQUENCE CYTOBAND
## 1 AATTGCATTTTAATTGGGGCAGTTAGAATGTTGATTTCCTAACAGCATTGTGAAGTTGAC hs|1q21.3
## 2 ATCAGGAAGGAGGAGAGGAACCATTTGTGCAGATCATCTAGAAGAACCTGGACCATTCTT hs|7q22.1
## 3 CTGAGTCAGGTAGTGCAGTGGTTGTAAAGTGCAGCATATTCATTTCCAAGCTAGAGGGTT hs|3p22.1
## 4 TCGCTGACGCATTAGACTCAAGGTGGACTTAATTTGGAGATCACCTATGAAGACATCTGT hs|3q28
## 5 AATTATACAAGCATATTTAATGCTGAGTTTAATTTAATATGTAATACATATGGTAATTGT hs|22q13.2
## 6 CACATGGCTATTGATCTCTACTGCGGTTTGGCTTGTCTGTGGGGAATACATGAGCCCCGA hs|11p15.1
## FoldChange_MedianArray_UL1_to_nonUL1 UL1_MedianArray nonUL1_MedianArray
## 1 0.6516203 3.133639 4.808995
## 2 0.6738332 4.118197 6.111597
## 3 1.6315379 5.039850 3.089018
## 4 0.7150036 3.133639 4.382690
## 5 0.7121234 4.853509 6.815545
## 6 1.6569822 7.909182 4.773245
## UL_GSM1667147 UL_GSM1667148 UL_GSM1667149 nonUL_GSM1667144 nonUL_GSM1667145
## 1 3.500581 2.387398 3.133639 4.808995 5.921844
## 2 4.118197 5.116485 3.133639 2.867115 6.111597
## 3 6.903503 5.039850 3.133639 5.185176 2.498480
## 4 2.634312 3.411704 3.133639 3.249956 6.112889
## 5 7.195406 3.280124 4.853509 8.044680 6.815545
## 6 6.692666 8.541624 7.909182 4.773245 5.542953
## nonUL_GSM1667146 uterine_leiomyoma_MA
## 1 2.513410 benign UL microarray
## 2 6.481402 benign UL microarray
## 3 3.089018 benign UL microarray
## 4 4.382690 benign UL microarray
## 5 5.230738 benign UL microarray
## 6 3.444421 benign UL microarray
Lets only keep the cytoband, sequence, symbol, fold change, and disease type columns of each table.
The alzheimer table already has the only fields we want, but lets rearrange and rename them to make them more consistent.
AD1 <- alzheimer %>% select(GENE_SYMBOL:SEQUENCE,alzheimerDisease,
FoldChange_Median_AD_2_ctrl)
colnames(AD1) <- c('gene','cytoband','cnv','sequence','disease','foldChangeAD')
colnames(AD1)
## [1] "gene" "cytoband" "cnv" "sequence" "disease"
## [6] "foldChangeAD"
The pancreatic cancer table has two fold change values for different cell lines observed. The first fold change values are of epithelial cells of PC with a tumor/cancer suppressor(ARHGEF10) added compared to not. And the second fold change values are of lymphatic cells that the cancer suppressor (ARHGEF10) was knocked out or eliminated as much as possible when compared to lymphatic cells of pancreatic cancer patients.
PC1 <- pancreatic %>% select(Symbol,Cytoband,GENE_CNVs,SEQUENCE,pancreatic_cancer,
FoldChange_epithelial_PC,
FoldChange_lymphNode_PC
)
colnames(PC1) <- c('gene','cytoband','cnv','sequence','disease','foldChangePC_epith',
'foldChangePC_lymph')
colnames(PC1)
## [1] "gene" "cytoband" "cnv"
## [4] "sequence" "disease" "foldChangePC_epith"
## [7] "foldChangePC_lymph"
The colon cancer dataset also has two fold change values. The first is a set of LGR5 being knocked down with a shRNA, and the 2nd type is of LGR5 being knocked down with the 2nd type of shRNA.
CC1 <- colon %>% select(Symbol,Cytoband,Symbol_CC_CNVs,SEQUENCE,colon_cancer,
FoldChange_CC1_to_CC0, FoldChange_CC2_to_CC0)
colnames(CC1) <- c('gene','cytoband','cnv','sequence','disease','FoldChange_CC1_to_CC0',
'FoldChange_CC2_to_CC0')
colnames(CC1)
## [1] "gene" "cytoband" "cnv"
## [4] "sequence" "disease" "FoldChange_CC1_to_CC0"
## [7] "FoldChange_CC2_to_CC0"
The stomach cancer dataset has one fold change value
GSC1 <- stomach %>% select(GENE_SYMBOL,CYTOBAND,GENE_CNVs,SEQUENCE,stomach_cancer,
FoldChangeMedian_GSC_to_healthyCtrl)
colnames(GSC1) <- c('gene','cytoband','cnv','sequence','disease',
'FoldChangeMedian_GSC_to_healthyCtrl' )
colnames(GSC1)
## [1] "gene" "cytoband"
## [3] "cnv" "sequence"
## [5] "disease" "FoldChangeMedian_GSC_to_healthyCtrl"
The UL beadchip has one fold change column.
UL_BC <- ul_beadchip %>% select(Symbol,Cytoband,Symbol_CNVs,SEQUENCE,uterine_leiomyoma_BC,
FoldChange_Median_UL1_to_nonUL1,
)
colnames(UL_BC) <- c('gene','cytoband','cnv','sequence', 'disease',
'FoldChange_Median_UL1_to_nonUL1')
colnames(UL_BC)
## [1] "gene" "cytoband"
## [3] "cnv" "sequence"
## [5] "disease" "FoldChange_Median_UL1_to_nonUL1"
The UL microarray data also has one foldchange column.
UL_MA <- ul_microarray %>% select(GENE_SYMBOL,CYTOBAND,GENE_SYMBOL_CNVs,SEQUENCE,
uterine_leiomyoma_MA,
FoldChange_MedianArray_UL1_to_nonUL1)
colnames(UL_MA) <- c('gene','cytoband','cnv','sequence', 'disease',
'FoldChange_MedianArray_UL1_to_nonUL1')
colnames(UL_MA)
## [1] "gene"
## [2] "cytoband"
## [3] "cnv"
## [4] "sequence"
## [5] "disease"
## [6] "FoldChange_MedianArray_UL1_to_nonUL1"
Our data tables with fold change values are AD1, CC1, GSC1, PC1, UL_BC, and UL_MA. Lets look at the first few observations of each table.
head(AD1)
## gene cytoband cnv
## 1 AHNAK hs|11q12.3 5
## 2 AHNAK hs|11q12.3 5
## 3 AHNAK hs|11q12.3 5
## 4 AHNAK hs|11q12.3 5
## 5 AHNAK hs|11q12.3 5
## 6 AHNAK hs|11q12.3 5
## sequence
## 1 GAAAAGTAACATTCCCTAAAATGAAGATCCCCAAATTTACCTTCTCTGGCCGTGAGCTGG
## 2 AAGGCCCCAGTTTGGACATTGACACACCTGATGTCAATATTGAAGGTCCGGAAGGAAAAT
## 3 GTGCATTTCTTTACTTGCAAGGGGACAGAGTGTGGGCTTAGGTTTGGGACTAGAGGGGGC
## 4 ATTTCCAGCACTTTAATGGCCAATTAACTGAGAATGTAAGAAAATTGATGCTGTACAAGG
## 5 TCTGTGCCCAAGGTAGAAGGTGAAATGAAAGTGCCAGATGTTGACATTAAAGGGCCCAAA
## 6 GAAAAGTAACATTCCCTAAAATGAAGATCCCCAAATTTACCTTCTCTGGCCGTGAGCTGG
## disease foldChangeAD
## 1 Alzheimer's Disease 1.10655
## 2 Alzheimer's Disease 1.10655
## 3 Alzheimer's Disease 1.10655
## 4 Alzheimer's Disease 1.10655
## 5 Alzheimer's Disease 1.10655
## 6 Alzheimer's Disease 1.10655
Lets alter the cytoband field so that all values are the same format.
AD2 <- AD1
cyto1 <- strsplit(as.character(AD2$cytoband),split='|',fixed=TRUE)
cyto2 <- lapply(cyto1,'[',2)
AD2$cytoband <- as.character(paste(cyto2))
AD2$cytoband <- as.factor(AD2$cytoband)
head(AD2)
## gene cytoband cnv
## 1 AHNAK 11q12.3 5
## 2 AHNAK 11q12.3 5
## 3 AHNAK 11q12.3 5
## 4 AHNAK 11q12.3 5
## 5 AHNAK 11q12.3 5
## 6 AHNAK 11q12.3 5
## sequence
## 1 GAAAAGTAACATTCCCTAAAATGAAGATCCCCAAATTTACCTTCTCTGGCCGTGAGCTGG
## 2 AAGGCCCCAGTTTGGACATTGACACACCTGATGTCAATATTGAAGGTCCGGAAGGAAAAT
## 3 GTGCATTTCTTTACTTGCAAGGGGACAGAGTGTGGGCTTAGGTTTGGGACTAGAGGGGGC
## 4 ATTTCCAGCACTTTAATGGCCAATTAACTGAGAATGTAAGAAAATTGATGCTGTACAAGG
## 5 TCTGTGCCCAAGGTAGAAGGTGAAATGAAAGTGCCAGATGTTGACATTAAAGGGCCCAAA
## 6 GAAAAGTAACATTCCCTAAAATGAAGATCCCCAAATTTACCTTCTCTGGCCGTGAGCTGG
## disease foldChangeAD
## 1 Alzheimer's Disease 1.10655
## 2 Alzheimer's Disease 1.10655
## 3 Alzheimer's Disease 1.10655
## 4 Alzheimer's Disease 1.10655
## 5 Alzheimer's Disease 1.10655
## 6 Alzheimer's Disease 1.10655
head(CC1)
## gene cytoband cnv
## 1 HYDIN 16q22.1f-q22.3a 17
## 2 HYDIN 16q22.1f-q22.3a 17
## 3 HYDIN 16q22.1f-q22.3a 17
## 4 SDHALP1 3q29f 13
## 5 SDHALP1 3q29f 13
## 6 SDHALP1 3q29f 13
## sequence disease
## 1 GCAATTAAGTATGTGGAACACCCTCAGATAGACAGCCTGGACCTGCGCGG colon cancer
## 2 TGTGCATTGCCAGTCATTCCCATGCCTTTGCCACGGTGTCCTTCACCCCG colon cancer
## 3 CTTTGGGAGCCAGGACCCCCCTTTGGTATGTCACTTAAAGAGCGCTGGAG colon cancer
## 4 GTGATGACAGAATCAGCTTTTGTAATTATGTATAATAGCTCATGCATGTG colon cancer
## 5 AGCCCTCCTGGACGATTGGAACTGTAATGTGGAAAGGGCTCTGATGGAGC colon cancer
## 6 GTTCTTCAGATGAACTGATTTTTGTGCAGAGCACACGTGTTGGATTCTGC colon cancer
## FoldChange_CC1_to_CC0 FoldChange_CC2_to_CC0
## 1 7.153846 0.4615385
## 2 5.355556 -0.4666667
## 3 -11.375000 10.4583333
## 4 8.621622 -1.6756757
## 5 1.976190 -8.7857143
## 6 -6.272727 -15.3636364
CC2 <- CC1
CC2$cytoband <- gsub('-','|',CC2$cytoband)
CC2$cytoband <- gsub('[|].*[a-z]{1}$','', CC2$cytoband, perl=TRUE)
CC2$cytoband <- gsub('[a-z]{1}$','', CC2$cytoband, perl=TRUE)
CC2$cytoband <- as.factor(CC2$cytoband)
head(CC2)
## gene cytoband cnv sequence
## 1 HYDIN 16q22.1 17 GCAATTAAGTATGTGGAACACCCTCAGATAGACAGCCTGGACCTGCGCGG
## 2 HYDIN 16q22.1 17 TGTGCATTGCCAGTCATTCCCATGCCTTTGCCACGGTGTCCTTCACCCCG
## 3 HYDIN 16q22.1 17 CTTTGGGAGCCAGGACCCCCCTTTGGTATGTCACTTAAAGAGCGCTGGAG
## 4 SDHALP1 3q29 13 GTGATGACAGAATCAGCTTTTGTAATTATGTATAATAGCTCATGCATGTG
## 5 SDHALP1 3q29 13 AGCCCTCCTGGACGATTGGAACTGTAATGTGGAAAGGGCTCTGATGGAGC
## 6 SDHALP1 3q29 13 GTTCTTCAGATGAACTGATTTTTGTGCAGAGCACACGTGTTGGATTCTGC
## disease FoldChange_CC1_to_CC0 FoldChange_CC2_to_CC0
## 1 colon cancer 7.153846 0.4615385
## 2 colon cancer 5.355556 -0.4666667
## 3 colon cancer -11.375000 10.4583333
## 4 colon cancer 8.621622 -1.6756757
## 5 colon cancer 1.976190 -8.7857143
## 6 colon cancer -6.272727 -15.3636364
head(GSC1)
## gene cytoband cnv
## 1 PDE4DIP hs|1q21.1 15
## 2 PDE4DIP hs|1q21.1 15
## 3 LOC100128869 hs|4q28.3 8
## 4 LOC729444 hs|22q11.21 7
## 5 LOC100131581 hs|8p23.1 7
## 6 LOC389834 hs|21p11.2 7
## sequence disease
## 1 CAAAAAATGGACTAGAAGAGAAGCTGGCTGAGGAGCTGAGATCAGCCTCGTGGCCTGGGT stomach cancer
## 2 CTACCAAAGCTGTGTATTTTTCATTTCTTCATGGCACTGTGCTGTTAATTTCTGTTAACA stomach cancer
## 3 TCAAAGACACACGCTCACTGCATGTGCTCTTGGGGGACGTCAGTGCCACGTTTGGTCACA stomach cancer
## 4 TTCCATCATACAGTCTTCAGCAACCTTGAAAGATTGGACAAGCTTCAGCCCACTCTTGAA stomach cancer
## 5 AGATAACTTCAGACTCAGCCTAGAAGTAAAGATCCTCCAGATATGACCTCAACTACCCTC stomach cancer
## 6 GGAACACTCTCTTAAAGTCTACTTATCCTATTCCTTCAAATTAAGACCAAAGTCCAATGT stomach cancer
## FoldChangeMedian_GSC_to_healthyCtrl
## 1 1.3059347
## 2 0.7574300
## 3 0.4429174
## 4 1.3486174
## 5 0.7360758
## 6 0.7229002
GSC2 <- GSC1
cyto1 <- strsplit(as.character(GSC2$cytoband),split='|',fixed=TRUE)
cyto2 <- lapply(cyto1,'[',2)
GSC2$cytoband <- as.character(paste(cyto2))
GSC2$cytoband <- as.factor(GSC2$cytoband)
head(GSC2)
## gene cytoband cnv
## 1 PDE4DIP 1q21.1 15
## 2 PDE4DIP 1q21.1 15
## 3 LOC100128869 4q28.3 8
## 4 LOC729444 22q11.21 7
## 5 LOC100131581 8p23.1 7
## 6 LOC389834 21p11.2 7
## sequence disease
## 1 CAAAAAATGGACTAGAAGAGAAGCTGGCTGAGGAGCTGAGATCAGCCTCGTGGCCTGGGT stomach cancer
## 2 CTACCAAAGCTGTGTATTTTTCATTTCTTCATGGCACTGTGCTGTTAATTTCTGTTAACA stomach cancer
## 3 TCAAAGACACACGCTCACTGCATGTGCTCTTGGGGGACGTCAGTGCCACGTTTGGTCACA stomach cancer
## 4 TTCCATCATACAGTCTTCAGCAACCTTGAAAGATTGGACAAGCTTCAGCCCACTCTTGAA stomach cancer
## 5 AGATAACTTCAGACTCAGCCTAGAAGTAAAGATCCTCCAGATATGACCTCAACTACCCTC stomach cancer
## 6 GGAACACTCTCTTAAAGTCTACTTATCCTATTCCTTCAAATTAAGACCAAAGTCCAATGT stomach cancer
## FoldChangeMedian_GSC_to_healthyCtrl
## 1 1.3059347
## 2 0.7574300
## 3 0.4429174
## 4 1.3486174
## 5 0.7360758
## 6 0.7229002
head(PC1)
## gene cytoband cnv
## 1 DMD Xp21.2a-p21.1d 8
## 2 HYDIN 16q22.1f-q22.3a 8
## 3 DMD Xp21.2a-p21.1d 8
## 4 ABCG1 21q22.3b 6
## 5 KCNIP4 4p15.31d-p15.31b 6
## 6 GPR177 1p31.3a 5
## sequence disease
## 1 GGCAAGTTCAACAGGATTTTGTGACCAGCGCAGGCTGGGCCTCCTTCTGC pancreatic cancer
## 2 GCAATTAAGTATGTGGAACACCCTCAGATAGACAGCCTGGACCTGCGCGG pancreatic cancer
## 3 CCAAAAGGTATTTTGCGAAGCATCCCCGAATGGGCTACCTGCCAGTGCAG pancreatic cancer
## 4 CAGCACTTGTGAAGGATTGAATGCAGGTTCCAGGTGGAGGGAAGACGTGG pancreatic cancer
## 5 CTTGGAAGGGCTTGAAATGATAGCAGTTCTGATCGTCATTGTGCTTTTTG pancreatic cancer
## 6 TGAGATCTACAAGTTGACCCGCAAGGAGGCCCAGGAGTAGGAGGCTGCAG pancreatic cancer
## foldChangePC_epith foldChangePC_lymph
## 1 2.309727 0.4945016
## 2 1.967041 0.4460581
## 3 1.414647 0.4765914
## 4 3.787784 0.6271994
## 5 2.674067 0.5404346
## 6 8.289087 0.4977605
PC2 <- PC1
PC2$cytoband <- gsub('-','|',PC2$cytoband)
PC2$cytoband <- gsub('[|].*[a-z]{1}$','', PC2$cytoband, perl=TRUE)
PC2$cytoband <- gsub('[a-z]{1}$','', PC2$cytoband, perl=TRUE)
PC2$cytoband <- as.factor(PC2$cytoband)
head(PC2)
## gene cytoband cnv sequence
## 1 DMD Xp21.2 8 GGCAAGTTCAACAGGATTTTGTGACCAGCGCAGGCTGGGCCTCCTTCTGC
## 2 HYDIN 16q22.1 8 GCAATTAAGTATGTGGAACACCCTCAGATAGACAGCCTGGACCTGCGCGG
## 3 DMD Xp21.2 8 CCAAAAGGTATTTTGCGAAGCATCCCCGAATGGGCTACCTGCCAGTGCAG
## 4 ABCG1 21q22.3 6 CAGCACTTGTGAAGGATTGAATGCAGGTTCCAGGTGGAGGGAAGACGTGG
## 5 KCNIP4 4p15.31 6 CTTGGAAGGGCTTGAAATGATAGCAGTTCTGATCGTCATTGTGCTTTTTG
## 6 GPR177 1p31.3 5 TGAGATCTACAAGTTGACCCGCAAGGAGGCCCAGGAGTAGGAGGCTGCAG
## disease foldChangePC_epith foldChangePC_lymph
## 1 pancreatic cancer 2.309727 0.4945016
## 2 pancreatic cancer 1.967041 0.4460581
## 3 pancreatic cancer 1.414647 0.4765914
## 4 pancreatic cancer 3.787784 0.6271994
## 5 pancreatic cancer 2.674067 0.5404346
## 6 pancreatic cancer 8.289087 0.4977605
head(UL_BC)
## gene cytoband cnv sequence
## 1 CTNNB1 3p22.1b 7 AGCTGCAGGGGTCCTCTGTGAACTTGCTCAGGACAAGGAAGCTGCAGAAG
## 2 CTNNB1 3p22.1b 7 CTGCAGGGGTCCTCTGTGAACTTGCTCAGGACAAGGAAGCTGCAGAAGCT
## 3 LOC202134 5q35.2d 7 CCTGTTTGGATCACATGGTCTTGTCCTGATAACTTGGAAGAGGTTGCTTC
## 4 SLCO3A1 15q26.1d 6 CCTGTATGTCAGCATCGCCATCGCGCTCAAATCCTTCGCCTTCATCCTGT
## 5 SLCO3A1 15q26.1d 6 CCTCCTCTGGGGATGGCTTGCCGACATGCTTTAGAAACTCCCCTCTTGGG
## 6 LOC653673 <NA> 6 GGAAAAGGCATGGTGACAATGGGGGCTGTAGCCCTAGGAGAACGGGGGGA
## disease FoldChange_Median_UL1_to_nonUL1
## 1 benign UL beadchip 1.5744314
## 2 benign UL beadchip 1.4655024
## 3 benign UL beadchip 0.3998123
## 4 benign UL beadchip 1.4533565
## 5 benign UL beadchip 1.3643572
## 6 benign UL beadchip 1.2756026
UL_BC2 <- UL_BC
UL_BC2$cytoband <- gsub('-','|',UL_BC2$cytoband)
UL_BC2$cytoband <- gsub('[|].*[a-z]{1}$','', UL_BC2$cytoband, perl=TRUE)
UL_BC2$cytoband <- gsub('[a-z]{1}$','', UL_BC2$cytoband, perl=TRUE)
UL_BC2$cytoband <- as.factor(UL_BC2$cytoband)
UL_BC2$cytoband <- as.factor(UL_BC2$cytoband)
head(UL_BC2)
## gene cytoband cnv sequence
## 1 CTNNB1 3p22.1 7 AGCTGCAGGGGTCCTCTGTGAACTTGCTCAGGACAAGGAAGCTGCAGAAG
## 2 CTNNB1 3p22.1 7 CTGCAGGGGTCCTCTGTGAACTTGCTCAGGACAAGGAAGCTGCAGAAGCT
## 3 LOC202134 5q35.2 7 CCTGTTTGGATCACATGGTCTTGTCCTGATAACTTGGAAGAGGTTGCTTC
## 4 SLCO3A1 15q26.1 6 CCTGTATGTCAGCATCGCCATCGCGCTCAAATCCTTCGCCTTCATCCTGT
## 5 SLCO3A1 15q26.1 6 CCTCCTCTGGGGATGGCTTGCCGACATGCTTTAGAAACTCCCCTCTTGGG
## 6 LOC653673 <NA> 6 GGAAAAGGCATGGTGACAATGGGGGCTGTAGCCCTAGGAGAACGGGGGGA
## disease FoldChange_Median_UL1_to_nonUL1
## 1 benign UL beadchip 1.5744314
## 2 benign UL beadchip 1.4655024
## 3 benign UL beadchip 0.3998123
## 4 benign UL beadchip 1.4533565
## 5 benign UL beadchip 1.3643572
## 6 benign UL beadchip 1.2756026
head(UL_MA)
## gene cytoband cnv
## 1 TPM3 hs|1q21.3 11
## 2 SPDYE3 hs|7q22.1 10
## 3 RPSA hs|3p22.1 8
## 4 LPP hs|3q28 8
## 5 ST13 hs|22q13.2 8
## 6 NAV2 hs|11p15.1 7
## sequence
## 1 AATTGCATTTTAATTGGGGCAGTTAGAATGTTGATTTCCTAACAGCATTGTGAAGTTGAC
## 2 ATCAGGAAGGAGGAGAGGAACCATTTGTGCAGATCATCTAGAAGAACCTGGACCATTCTT
## 3 CTGAGTCAGGTAGTGCAGTGGTTGTAAAGTGCAGCATATTCATTTCCAAGCTAGAGGGTT
## 4 TCGCTGACGCATTAGACTCAAGGTGGACTTAATTTGGAGATCACCTATGAAGACATCTGT
## 5 AATTATACAAGCATATTTAATGCTGAGTTTAATTTAATATGTAATACATATGGTAATTGT
## 6 CACATGGCTATTGATCTCTACTGCGGTTTGGCTTGTCTGTGGGGAATACATGAGCCCCGA
## disease FoldChange_MedianArray_UL1_to_nonUL1
## 1 benign UL microarray 0.6516203
## 2 benign UL microarray 0.6738332
## 3 benign UL microarray 1.6315379
## 4 benign UL microarray 0.7150036
## 5 benign UL microarray 0.7121234
## 6 benign UL microarray 1.6569822
UL_MA2 <- UL_MA
cyto1 <- strsplit(as.character(UL_MA2$cytoband),split='|',fixed=TRUE)
cyto2 <- lapply(cyto1,'[',2)
UL_MA2$cytoband <- as.character(paste(cyto2))
UL_MA2$cytoband <- as.factor(UL_MA2$cytoband)
head(UL_MA2)
## gene cytoband cnv
## 1 TPM3 1q21.3 11
## 2 SPDYE3 7q22.1 10
## 3 RPSA 3p22.1 8
## 4 LPP 3q28 8
## 5 ST13 22q13.2 8
## 6 NAV2 11p15.1 7
## sequence
## 1 AATTGCATTTTAATTGGGGCAGTTAGAATGTTGATTTCCTAACAGCATTGTGAAGTTGAC
## 2 ATCAGGAAGGAGGAGAGGAACCATTTGTGCAGATCATCTAGAAGAACCTGGACCATTCTT
## 3 CTGAGTCAGGTAGTGCAGTGGTTGTAAAGTGCAGCATATTCATTTCCAAGCTAGAGGGTT
## 4 TCGCTGACGCATTAGACTCAAGGTGGACTTAATTTGGAGATCACCTATGAAGACATCTGT
## 5 AATTATACAAGCATATTTAATGCTGAGTTTAATTTAATATGTAATACATATGGTAATTGT
## 6 CACATGGCTATTGATCTCTACTGCGGTTTGGCTTGTCTGTGGGGAATACATGAGCCCCGA
## disease FoldChange_MedianArray_UL1_to_nonUL1
## 1 benign UL microarray 0.6516203
## 2 benign UL microarray 0.6738332
## 3 benign UL microarray 1.6315379
## 4 benign UL microarray 0.7150036
## 5 benign UL microarray 0.7121234
## 6 benign UL microarray 1.6569822
Now we have our tables with the same formatted cytoband locations as factors. These tables are AD2, CC2, GSC2, PC2, UL_BC2, and UL_MA2. Lets create other tables of these six tables that have all the same values. Lets gather the fold change fields to bind all together. Four tables only have one fold change type, but the colon cancer and the pancreatic cancer have two types.
AD3 <- gather(AD2,'foldChange','foldChangeValue',6)
CC3 <- gather(CC2,'foldChange','foldChangeValue',6:7)
PC3 <- gather(PC2,'foldChange','foldChangeValue',6:7)
GSC3 <- gather(GSC2,'foldChange','foldChangeValue',6)
UL_BC3 <- gather(UL_BC2,'foldChange','foldChangeValue',6)
UL_MA3 <-gather(UL_MA2,'foldChange','foldChangeValue',6)
We can now row bind these tables together.
All_data <- rbind(AD3,CC3,PC3,GSC3,UL_BC3,UL_MA3)
All <- All_data[order(All_data$gene),]
head(All,20)
## gene cytoband cnv
## 1 AHNAK 11q12.3 5
## 2 AHNAK 11q12.3 5
## 3 AHNAK 11q12.3 5
## 4 AHNAK 11q12.3 5
## 5 AHNAK 11q12.3 5
## 6 AHNAK 11q12.3 5
## 7 AHNAK 11q12.3 5
## 8 AHNAK 11q12.3 5
## 9 AHNAK 11q12.3 5
## 10 AHNAK 11q12.3 5
## 11 AHNAK 11q12.3 5
## 12 AHNAK 11q12.3 5
## 13 AHNAK 11q12.3 5
## 14 AHNAK 11q12.3 5
## 15 AHNAK 11q12.3 5
## 16 AHNAK 11q12.3 5
## 17 AHNAK 11q12.3 5
## 18 AHNAK 11q12.3 5
## 19 AHNAK 11q12.3 5
## 20 AHNAK 11q12.3 5
## sequence
## 1 GAAAAGTAACATTCCCTAAAATGAAGATCCCCAAATTTACCTTCTCTGGCCGTGAGCTGG
## 2 AAGGCCCCAGTTTGGACATTGACACACCTGATGTCAATATTGAAGGTCCGGAAGGAAAAT
## 3 GTGCATTTCTTTACTTGCAAGGGGACAGAGTGTGGGCTTAGGTTTGGGACTAGAGGGGGC
## 4 ATTTCCAGCACTTTAATGGCCAATTAACTGAGAATGTAAGAAAATTGATGCTGTACAAGG
## 5 TCTGTGCCCAAGGTAGAAGGTGAAATGAAAGTGCCAGATGTTGACATTAAAGGGCCCAAA
## 6 GAAAAGTAACATTCCCTAAAATGAAGATCCCCAAATTTACCTTCTCTGGCCGTGAGCTGG
## 7 AAGGCCCCAGTTTGGACATTGACACACCTGATGTCAATATTGAAGGTCCGGAAGGAAAAT
## 8 GTGCATTTCTTTACTTGCAAGGGGACAGAGTGTGGGCTTAGGTTTGGGACTAGAGGGGGC
## 9 ATTTCCAGCACTTTAATGGCCAATTAACTGAGAATGTAAGAAAATTGATGCTGTACAAGG
## 10 TCTGTGCCCAAGGTAGAAGGTGAAATGAAAGTGCCAGATGTTGACATTAAAGGGCCCAAA
## 11 GAAAAGTAACATTCCCTAAAATGAAGATCCCCAAATTTACCTTCTCTGGCCGTGAGCTGG
## 12 AAGGCCCCAGTTTGGACATTGACACACCTGATGTCAATATTGAAGGTCCGGAAGGAAAAT
## 13 GTGCATTTCTTTACTTGCAAGGGGACAGAGTGTGGGCTTAGGTTTGGGACTAGAGGGGGC
## 14 ATTTCCAGCACTTTAATGGCCAATTAACTGAGAATGTAAGAAAATTGATGCTGTACAAGG
## 15 TCTGTGCCCAAGGTAGAAGGTGAAATGAAAGTGCCAGATGTTGACATTAAAGGGCCCAAA
## 16 GAAAAGTAACATTCCCTAAAATGAAGATCCCCAAATTTACCTTCTCTGGCCGTGAGCTGG
## 17 AAGGCCCCAGTTTGGACATTGACACACCTGATGTCAATATTGAAGGTCCGGAAGGAAAAT
## 18 GTGCATTTCTTTACTTGCAAGGGGACAGAGTGTGGGCTTAGGTTTGGGACTAGAGGGGGC
## 19 ATTTCCAGCACTTTAATGGCCAATTAACTGAGAATGTAAGAAAATTGATGCTGTACAAGG
## 20 TCTGTGCCCAAGGTAGAAGGTGAAATGAAAGTGCCAGATGTTGACATTAAAGGGCCCAAA
## disease foldChange foldChangeValue
## 1 Alzheimer's Disease foldChangeAD 1.10655
## 2 Alzheimer's Disease foldChangeAD 1.10655
## 3 Alzheimer's Disease foldChangeAD 1.10655
## 4 Alzheimer's Disease foldChangeAD 1.10655
## 5 Alzheimer's Disease foldChangeAD 1.10655
## 6 Alzheimer's Disease foldChangeAD 1.10655
## 7 Alzheimer's Disease foldChangeAD 1.10655
## 8 Alzheimer's Disease foldChangeAD 1.10655
## 9 Alzheimer's Disease foldChangeAD 1.10655
## 10 Alzheimer's Disease foldChangeAD 1.10655
## 11 Alzheimer's Disease foldChangeAD 1.10655
## 12 Alzheimer's Disease foldChangeAD 1.10655
## 13 Alzheimer's Disease foldChangeAD 1.10655
## 14 Alzheimer's Disease foldChangeAD 1.10655
## 15 Alzheimer's Disease foldChangeAD 1.10655
## 16 Alzheimer's Disease foldChangeAD 1.10655
## 17 Alzheimer's Disease foldChangeAD 1.10655
## 18 Alzheimer's Disease foldChangeAD 1.10655
## 19 Alzheimer's Disease foldChangeAD 1.10655
## 20 Alzheimer's Disease foldChangeAD 1.10655
Lets use a DT library datatable to display the new table.
All_DT1 <- datatable(data=All, rownames=FALSE, width = 800, height = 700,
extensions=c('Buttons','Responsive','FixedColumns'),
#filter=list(position='top'),
options=list(pageLength=10,
dom='Bfrtip',scrollX = TRUE, scrollY=TRUE,fixedColumns = TRUE,
buttons=c('colvis','csv'),
language=list(sSearch='Filter:')
)
)
All_DT1
Lets now see what genes are in other diseases with slicing.
All2 <- All_data %>% select(gene,disease)
All3 <- All2[!duplicated(All2),]
All4 <- All3[order(All3$gene, decreasing=FALSE),]
head(All4)
## gene disease
## 1 AHNAK Alzheimer's Disease
## 26 CALD1 Alzheimer's Disease
## 42 DCLK1 Alzheimer's Disease
## 706 DCLK1 benign UL microarray
## 58 ELAVL4 Alzheimer's Disease
## 67 EPHA4 Alzheimer's Disease
Lets create a DT data table of this information showing the unique genes with their diseases associated with.
All_DT2 <- datatable(data=All4, rownames=FALSE, width = 800, height = 700,
extensions=c('Buttons','Responsive','FixedColumns'),
#filter=list(position='top'),
options=list(pageLength=10,
dom='Bfrtip',scrollX = TRUE, scrollY=TRUE,fixedColumns = TRUE,
buttons=c('colvis','csv'),
language=list(sSearch='Filter:')
)
)
All_DT2
We can see from the data table above that there are some genes that are associated with more than one disease. This will make for presumably an interesting visual network. Lets see exactly which genes are associated with more than one disease.
All5 <- All4 %>%
group_by(gene) %>%
mutate(geneDiseases = n()) %>%
select(gene,geneDiseases, disease) %>%
ungroup()
All6 <- subset(All5, All5$geneDiseases > 1)
All7 <- All6[order(All6$geneDiseases, decreasing=TRUE),]
All7
## # A tibble: 57 x 3
## gene geneDiseases disease
## <fct> <int> <fct>
## 1 DMD 3 colon cancer
## 2 DMD 3 pancreatic cancer
## 3 DMD 3 benign UL microarray
## 4 HYDIN 3 colon cancer
## 5 HYDIN 3 pancreatic cancer
## 6 HYDIN 3 benign UL microarray
## 7 KCNIP4 3 colon cancer
## 8 KCNIP4 3 pancreatic cancer
## 9 KCNIP4 3 benign UL beadchip
## 10 DCLK1 2 Alzheimer's Disease
## # … with 47 more rows
length(unique(All7$gene))
## [1] 27
unique(All7$gene)
## [1] DMD HYDIN KCNIP4 DCLK1 ABCC6 ABCG1 AGL BMP1 CAST
## [10] CTNNB1 FSD1 PTGER3 S100A13 WHSC1 CREB5 HIF3A ITIH5 MDK
## [19] MTUS1 NRG1 PHF20L1 RFX4 RUNX1T1 SORBS1 MCF2L TNXB SYNE2
## 345 Levels: AHNAK CALD1 DCLK1 ELAVL4 EPHA4 GABRG2 GAD1 KIAA1107 PNMAL1 ... ZRANB3
We know that there are 27 genes in our data associated with more than one disease. We should just plot these diseases with visNetwork. There are other possibilities in using link analysis with visNetwork for location of the genes that are in our 792 observational data of gene targets associated with six specific diseases. And that too could be interesting to compare by cytoband location which genes are neighbors in the nucleus where transcription starts. Here are some facts about gene expression:
-DNA information is transcribed into a special type of RNA called messenger RNA in the nucleus (mRNA)
mRNA is modified in the nucleus then is transported out to the cytoplasm
Protein synthesis, or translation, occurs in the cytoplasm with the help of transfer RNA molecules (tRNA) and ribosomes that contain ribosomal RNA molecules (rRNA)
A gene transcribed into RNA is said to be expressed
(Those aren’t my words, They were taken verbatum from week 1 of a graduate coure called ‘Data Systems of the Life Sciences.’)
So, we are working with the All7 table on genes that are associated with more than one of our six diseases. We will use the genes as our nodes, and the edges move from the nodes of diseases if they are associated with the other diseases, and the weights as counts of how many diseases each node is associated with. Lets now recall what fields are needed to build this visual network using visNetwork. We can always attach the cytoband location with a merge of All7 to All but only selecting the gene and cytoband fields.
We need a node data frame with an id, label, and location. The label can be the gene, and the location can be the disease. We can get the id from the row number in the unique genes and locations of the All5 table. The counts in the All5 table can be the weights. Then add in the cytoband information as the location, and the label as the gene. The edges table needs the weights from the counts and the from and to columns.
The nodes will have the id, label (gene), location (cytoband). Lets add the id field to the All5 table that groups by gene and counts the diseases per gene.
nodes <- All5
nodes$id <- as.factor(row.names(nodes))
nodes$label <- nodes$gene
nodes$location <- nodes$disease
nodes$weight <- nodes$geneDiseases
nodes1 <- nodes %>% select(id,label,location,weight)
head(nodes1)
## # A tibble: 6 x 4
## id label location weight
## <fct> <fct> <fct> <int>
## 1 1 AHNAK Alzheimer's Disease 1
## 2 2 CALD1 Alzheimer's Disease 1
## 3 3 DCLK1 Alzheimer's Disease 2
## 4 4 DCLK1 benign UL microarray 2
## 5 5 ELAVL4 Alzheimer's Disease 1
## 6 6 EPHA4 Alzheimer's Disease 1
This adds in a title that will be displayed when hovered over, and also adds in a group, so that the groups will be color coded and grouped by whichever the group is being set to. Zoom in with the interactive display buttons to view the names and cluster groups once the interactive plot is ran.
nodes3 <- All %>% select(gene,cytoband,disease)
nodes4 <- merge(nodes1,nodes3, by.x='label', by.y='gene')
nodes4$group <- nodes4$disease
nodes4$title <- nodes4$cytoband
nodes5 <- nodes4 %>% select(id,label,group,title,location,weight,cytoband)
head(nodes5)
## id label group title location weight cytoband
## 1 125 AAK1 stomach cancer 2p13.3 stomach cancer 1 2p13.3
## 2 125 AAK1 stomach cancer 2p13.3 stomach cancer 1 2p13.3
## 3 208 ABCB10 benign UL microarray 1q42.13 benign UL microarray 1 1q42.13
## 4 14 ABCC6 colon cancer 16p13.11 colon cancer 2 16p13.11
## 5 14 ABCC6 colon cancer 16p13.11 colon cancer 2 16p13.11
## 6 14 ABCC6 colon cancer 16p13.11 colon cancer 2 16p13.11
The edges will have the from (id), to (disease), label, weight (counts), and width(fraction of counts for arrow width)
nodes6 <- nodes5[!duplicated(nodes5$id),]
edges <- nodes6 %>% select(id, location, weight,cytoband)
edges$from <- edges$id
edges$width <- edges$weight/3
edges$label <- edges$location
to <- as.data.frame(unique(nodes5$location))
colnames(to) <- 'location'
to$to <- as.factor(row.names(to))
edges2 <- merge(edges,to, by.x='location', by.y='location')
edges3 <- edges2 %>% select(from,to,label,weight, width)
head(edges3)
## from to label weight width
## 1 3 6 Alzheimer's Disease 2 0.6666667
## 2 10 6 Alzheimer's Disease 1 0.3333333
## 3 7 6 Alzheimer's Disease 1 0.3333333
## 4 8 6 Alzheimer's Disease 1 0.3333333
## 5 2 6 Alzheimer's Disease 1 0.3333333
## 6 12 6 Alzheimer's Disease 1 0.3333333
Our nodes table is nodes3 and our edges table is edges3.
visNetwork(nodes=nodes6, edges=edges3, main='Genes Associated with Six Diseases') %>% visEdges(arrows=c('from','middle')) %>%
visInteraction(navigationButtons=TRUE, dragNodes=FALSE,
dragView=TRUE, zoomView = TRUE) %>%
visOptions(nodesIdSelection = TRUE, manipulation=FALSE) %>%
visIgraphLayout() %>%
visLegend
Its great that the visual network worked but many genes are counted repeatedly and makes it cluttered. We should use the table with 57 genes of those genes that had an assoication with more than one disease in table All7. But regardless of the clutter we see that there are six clusters around the six disease types. When you zoom in on the clusters, the cluster with the most links to other diseases is the Alzheimer cluster that is associated with both UL studies, colon cancer, and stomach cancer, but not with pancreatic cancer. The UL microarray study was associated with genes in pancreatic cancer and Alzheimer disease. This is what the chart showed when we looked at it earlier, the All7 data table.
Here is the All7 datatable as a refresher of the connections seen above in the visNetwork link analysis plot.
All7_DT3 <- datatable(data=All7, rownames=FALSE, width = 800, height = 700,
extensions=c('Buttons','Responsive','FixedColumns'),
#filter=list(position='top'),
options=list(pageLength=10,
dom='Bfrtip',scrollX = TRUE, scrollY=TRUE,fixedColumns = TRUE,
buttons=c('colvis','csv'),
language=list(sSearch='Filter:')
)
)
All7_DT3
The above table All7 would probably make a much better visual network analysis to view those specific genes associated with multiple diseases. In this case, five diseases, of which two were different media type for the UL samples as beadchip or microarray. These diseases were selected because they had sequence information attached to their GEO sample and platform data to make our targeted gene selections by copy number variants of genes as well as fold change values in the 5th and 95th percentile approximately. These genes out of tens of thousands fit those threshold values.
nodes10 <- All7 %>% select(gene, disease)
nodes10$id <- row.names(nodes10)
nodes10$label <- nodes10$gene
nodes10$location <- nodes10$disease
nodes10$title <- nodes10$disease
nodes11 <- nodes10 %>% select(id,label,title,location)
head(nodes11,10)
## # A tibble: 10 x 4
## id label title location
## <chr> <fct> <fct> <fct>
## 1 1 DMD colon cancer colon cancer
## 2 2 DMD pancreatic cancer pancreatic cancer
## 3 3 DMD benign UL microarray benign UL microarray
## 4 4 HYDIN colon cancer colon cancer
## 5 5 HYDIN pancreatic cancer pancreatic cancer
## 6 6 HYDIN benign UL microarray benign UL microarray
## 7 7 KCNIP4 colon cancer colon cancer
## 8 8 KCNIP4 pancreatic cancer pancreatic cancer
## 9 9 KCNIP4 benign UL beadchip benign UL beadchip
## 10 10 DCLK1 Alzheimer's Disease Alzheimer's Disease
edges10 <- All7 %>% select(gene,disease, geneDiseases)
edges10$from <- row.names(All7)
to <- as.data.frame(unique(All7$disease))
colnames(to) <- 'disease'
to$to <- as.factor(row.names(to))
edges11 <- merge(to, edges10, by.x='disease', by.y='disease')
edges11$label <- edges11$disease
edges11$weight <- edges11$geneDiseases
edges11$width <- edges11$weight/2
edges12 <- edges11 %>% select(from, to, label, weight, width)
head(edges12,10)
## from to label weight width
## 1 10 5 Alzheimer's Disease 2 1.0
## 2 23 4 benign UL beadchip 2 1.0
## 3 39 4 benign UL beadchip 2 1.0
## 4 17 4 benign UL beadchip 2 1.0
## 5 49 4 benign UL beadchip 2 1.0
## 6 41 4 benign UL beadchip 2 1.0
## 7 21 4 benign UL beadchip 2 1.0
## 8 9 4 benign UL beadchip 3 1.5
## 9 27 4 benign UL beadchip 2 1.0
## 10 33 4 benign UL beadchip 2 1.0
visNetwork(nodes=nodes11, edges=edges12, main='Mulitple Disease Associations of 27 Unique Genes') %>% visEdges(arrows=c('from','middle','to')) %>%
visInteraction(navigationButtons=TRUE, dragNodes=TRUE,
dragView=TRUE, zoomView = TRUE) %>%
visOptions(nodesIdSelection = TRUE, manipulation=FALSE) %>%
visIgraphLayout()
Lets compare this with the underlying data of our datatable All7_DT3.
All7_DT3
Lets see what this looks like to link the genes from cytoband location to other diseases in our link analysis.
all10 <- All %>% select(gene,cytoband,cnv,disease,foldChangeValue)
all10$foldChangeValue <- round(all10$foldChangeValue,1)
all11 <- all10[!duplicated(all10),]
row.names(all11) <- NULL
all12 <- all11[with(all11, order(cytoband)),]
row.names(all12) <- NULL
head(all12,10)
## gene cytoband cnv disease foldChangeValue
## 1 AHNAK 11q12.3 5 Alzheimer's Disease 1.1
## 2 ASRGL1 11q12.3 4 benign UL microarray 0.5
## 3 SCN3B 11q24.1 3 Alzheimer's Disease 0.9
## 4 DCLK1 13q13.3 4 Alzheimer's Disease 0.9
## 5 DCLK1 13q13.3 4 benign UL microarray 1.4
## 6 PRKCB 16p12.1 3 Alzheimer's Disease 0.9
## 7 PNMAL1 19q13.32 3 Alzheimer's Disease 0.9
## 8 HIF3A 19q13.32 5 pancreatic cancer 1.7
## 9 HIF3A 19q13.32 5 pancreatic cancer 0.3
## 10 HIF3A 19q13.32 4 benign UL microarray 0.7
Lets rename and rearrange the fields above.
all12$weight <- all12$cnv
all12$width <- all12$foldChangeValue
all12$id <- as.factor(row.names(all12))
all12$label <- all12$gene
all12$title <- all12$cytoband
all12$group <- all12$disease
all12$receiveLocation <- all12$disease
all12$sendLocation <- all12$gene
all12$from <- all12$id
to <- as.data.frame(unique(all12$receiveLocation))
colnames(to) <- 'receiveLocation'
to$to <- row.names(to)
all12b <- merge(all12,to, by.x='receiveLocation', by.y='receiveLocation')
all13 <- all12b %>% select(id,label,title,group,sendLocation,receiveLocation,weight,width,from,to)
head(all13,10)
## id label title group sendLocation receiveLocation
## 1 1 AHNAK 11q12.3 Alzheimer's Disease AHNAK Alzheimer's Disease
## 2 15 EPHA4 2q36.1 Alzheimer's Disease EPHA4 Alzheimer's Disease
## 3 3 SCN3B 11q24.1 Alzheimer's Disease SCN3B Alzheimer's Disease
## 4 4 DCLK1 13q13.3 Alzheimer's Disease DCLK1 Alzheimer's Disease
## 5 18 CALD1 7q33 Alzheimer's Disease CALD1 Alzheimer's Disease
## 6 6 PRKCB 16p12.1 Alzheimer's Disease PRKCB Alzheimer's Disease
## 7 7 PNMAL1 19q13.32 Alzheimer's Disease PNMAL1 Alzheimer's Disease
## 8 12 KIAA1107 1p22.1 Alzheimer's Disease KIAA1107 Alzheimer's Disease
## 9 13 ELAVL4 1p33 Alzheimer's Disease ELAVL4 Alzheimer's Disease
## 10 14 GAD1 2q31.1 Alzheimer's Disease GAD1 Alzheimer's Disease
## weight width from to
## 1 5 1.1 1 1
## 2 3 0.9 15 1
## 3 3 0.9 3 1
## 4 4 0.9 4 1
## 5 4 1.1 18 1
## 6 3 0.9 6 1
## 7 3 0.9 7 1
## 8 3 0.9 12 1
## 9 3 0.9 13 1
## 10 3 0.9 14 1
Build the nodes and edges tables. The weight in this set is the gene CNVs per disease. Previously it was the number of diseases each gene was associated with.
nodes15 <- all13 %>% select(id,label,title, group,sendLocation)
edges14 <- all13 %>% select(from,to,receiveLocation,weight,width)
edges14$label <- edges14$receiveLocation
edges15 <- edges14 %>% select(from,to,label,weight, width)
visNetwork(nodes=nodes15, edges=edges15, main='Cytoband Nodes and Disease Association') %>% visEdges(arrows=c('from','middle','to')) %>%
visInteraction(navigationButtons=TRUE, dragNodes=TRUE,
dragView=TRUE, zoomView = TRUE) %>%
visOptions(nodesIdSelection = TRUE, manipulation=FALSE) %>%
visIgraphLayout() %>%
visLegend(ncol=2)
Lets move some stuff around and see what happens.
all10 <- All %>% select(gene,cytoband,cnv,disease,foldChangeValue)
all10$foldChangeValue <- round(all10$foldChangeValue,1)
all11 <- all10[!duplicated(all10),]
row.names(all11) <- NULL
all12 <- all11[with(all11, order(cytoband)),]
row.names(all12) <- NULL
head(all12,10)
## gene cytoband cnv disease foldChangeValue
## 1 AHNAK 11q12.3 5 Alzheimer's Disease 1.1
## 2 ASRGL1 11q12.3 4 benign UL microarray 0.5
## 3 SCN3B 11q24.1 3 Alzheimer's Disease 0.9
## 4 DCLK1 13q13.3 4 Alzheimer's Disease 0.9
## 5 DCLK1 13q13.3 4 benign UL microarray 1.4
## 6 PRKCB 16p12.1 3 Alzheimer's Disease 0.9
## 7 PNMAL1 19q13.32 3 Alzheimer's Disease 0.9
## 8 HIF3A 19q13.32 5 pancreatic cancer 1.7
## 9 HIF3A 19q13.32 5 pancreatic cancer 0.3
## 10 HIF3A 19q13.32 4 benign UL microarray 0.7
Lets rename and rearrange the fields above.
all12$weight <- all12$cnv
all12$width <- all12$foldChangeValue
all12$id <- as.factor(row.names(all12))
all12$label <- all12$disease
all12$receiveLocation <- all12$gene
all12$sendLocation <- all12$disease
all12$from <- all12$id
to <- as.data.frame(unique(all12$receiveLocation))
colnames(to) <- 'receiveLocation'
to$to <- row.names(to)
all12b <- merge(all12,to, by.x='receiveLocation', by.y='receiveLocation')
all13 <- all12b %>% select(id,label,sendLocation,receiveLocation,weight,width,from,to)
head(all13,10)
## id label sendLocation receiveLocation weight width
## 1 393 stomach cancer stomach cancer AAK1 6 0.7
## 2 392 stomach cancer stomach cancer AAK1 6 0.8
## 3 545 benign UL microarray benign UL microarray ABCB10 4 0.7
## 4 72 colon cancer colon cancer ABCC6 6 3.5
## 5 73 colon cancer colon cancer ABCC6 6 -118.3
## 6 74 colon cancer colon cancer ABCC6 6 -3.9
## 7 75 colon cancer colon cancer ABCC6 6 -37.3
## 8 76 pancreatic cancer pancreatic cancer ABCC6 5 2.0
## 9 77 pancreatic cancer pancreatic cancer ABCC6 5 0.4
## 10 136 colon cancer colon cancer ABCG1 6 17.8
## from to
## 1 393 152
## 2 392 152
## 3 545 266
## 4 72 34
## 5 73 34
## 6 74 34
## 7 75 34
## 8 76 34
## 9 77 34
## 10 136 58
Build the nodes table.
nodes15 <- all13 %>% select(id,label,sendLocation)
edges14 <- all13 %>% select(from,to,receiveLocation,weight,width)
edges14$label <- edges14$receiveLocation
edges15 <- edges14 %>% select(from,to,label,weight, width)
visNetwork(nodes=nodes15, edges=edges15, main='Disease as Nodes and Genes as Edges') %>% visEdges(arrows=c('from','middle','to')) %>%
visInteraction(navigationButtons=TRUE, dragNodes=TRUE,
dragView=TRUE, zoomView = TRUE) %>%
visOptions(nodesIdSelection = TRUE, manipulation=FALSE) %>%
visIgraphLayout()