Gene Sequence Data for Link Analysis

This Rmarkdown file will extract the genes having high Copy Number Variants (CNVs) and fold change expression values in their available data. The information is from the previous GEO data sets used earlier, that also had sequence information as well as gene information. These studies having gene sequence information are: Alzheimer’s Disease (AD), Colon Cancer, Stomach cancer, pancreatic cancer, a beadchip study done on uterine leiomyomas (UL), and a microarry study done on uterine leiomyomas (UL). Once these genes having the most CNVs and fold changes are found, a link network will be built to demonstrate how these genes are linked to any of these diseases if applicable. From the top genes of each disease just listed.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
library(DT)
library(visNetwork)

## Warning: package 'visNetwork' was built under R version 3.6.3

library(igraph)

## Warning: package 'igraph' was built under R version 3.6.3

## 
## Attaching package: 'igraph'

## The following object is masked from 'package:tidyr':
## 
##     crossing

## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union

## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum

## The following object is masked from 'package:base':
## 
##     union

The tables being used are from the following disease study files. - UL_study1_thresholds.csv - UL_microarray_thresholds.csv - GastricStomachCancer_Thresholds.csv - ColonCancerThresholds.csv - Pancreatic_Cancer_Thresholds.csv - useableAlzheimerStats.csv

These files will be uploaded to the github folder of files for this Rmarkdown script that creates a link analysis with visNetwork to display the genes associated with certain diseases and if associated with other diseases.

To skip straight through to the section that builds the visual network analysis of genes related to diseases, search for ‘%^&’ or !!!THE BEGINNING of our VISNETWORK!!! within RStudio. About line 870 of this Rmarkdown file.

Uterine leiomyomas Disease data table.

ULp <- read.delim('GPL21145-48548.txt', sep='\t', comment.char='#',
                 header=TRUE, na.strings=c('',' ', 'NA'))
head(ULp)

These two are from the same study with beadchip expressions.

UL1 <- read.csv('UL_GSE95101_GPL13376_table.csv', sep=',',header=TRUE,
                  na.strings=c('',' ','NA'))
nonUL1 <- read.csv('nonUL_GSE95101_GPL13376_table.csv', sep=',',header=TRUE,
                  na.strings=c('',' ','NA'))

These next two are from different studies with beadchip expressions. These have to be merged with the platform ULp we read in a couple Rmarkdown chunks ago.

UL2 <- read.csv('UL_GSE120854_GPL23767_table.csv', sep=',',header=TRUE,
                  na.strings=c('',' ','NA'))
nonUL2 <- read.csv('nonUL_GSE120854_GPL23767_table.csv', sep=',',header=TRUE,
                  na.strings=c('',' ','NA'))

Keep only the columns needed of ULp, ID, SourceSeq, and CHR.There is no gene symbol to bind to this data table platform GPL21145.

ULP <- ULp %>% select(ID,SourceSeq,CHR)
UL2b <- UL2 %>% select(-X,-SPOT_ID)
nonUL2b <- nonUL2 %>% select(-X,-SPOT_ID)

colnames(UL2b)[2:25] <- paste('UL',colnames(UL2b)[2:25],sep='_')
colnames(nonUL2b)[2:11] <- paste('nonUL', colnames(nonUL2b)[2:11],sep='_')

UL_2study <- merge(UL2b,nonUL2b, by.x='ID', by.y='ID') 

UL_study2 <- merge(ULP, UL_2study, by.x='ID', by.y='ID')
str(UL_study2)

Remove the ‘null’ values that aren’t being recognized by the na.strings or complete.cases. And also the numeric gene expression values are being read in as factors. Do this by writing it to csv and reading it back in and declaring ‘null’ as an na.string.

write.csv(UL_study2, 'UL_study2.csv', row.names=FALSE)

UL_study2 <- read.csv('UL_study2.csv', sep=',', header=TRUE, na.strings=c('',' ','NA','null'))
str(UL_study2)

Thats great that it now recognizes each column as a numeric data type and that the null values are now being recognized as NA.

Now, lets use complete.cases to get only those observations with complete cases and remove the NA values.

UL_study2_cc <- UL_study2[complete.cases(UL_study2),]

obsRM <- length(UL_study2$ID) - length(UL_study2_cc$ID)

cat('The number of observations removed is: ', obsRM,'.')

Its too bad that we cannot really do anything with this data, unless we open up ubuntu Linux and use the 3.4 version of R to use Bioconductor to get the gene names. But we can at least use the sequences to see if there are any with copies. Not necessarily CNVs because they would have to have the gene names to see what insertions, deletions, skips, etc make the gene a CNV. We can always extract what we need from it now, then get the gene information with a VM of Ubuntu, download the R version and use the tutorial (assuming it works exactly as demonstrated) online.

The following are the requirements using the above online tutorial: - R version 3.4.1 (2017-06-30) - Platform: x86_64-pc-linux-gnu (64-bit) - Running under: Ubuntu 16.04.2 LTS

So lets get the median and foldchange median values. We also don’t have the cytoband location, but we do have the chromosome for this set.

row.names(UL_study2_cc) <- UL_study2_cc$ID

ul <- grep('^UL', colnames(UL_study2_cc))
UL2_t <- UL_study2_cc[,ul]
nUL2_t <- UL_study2_cc[,-c(1:3,ul)]

Get the row medians of the UL and nonUL 2nd study data.

UL2_t$UL_Median <- apply(UL2_t,1,median)
nUL2_t$nonUL_Median <- apply(nUL2_t,1,median)

FoldChange_Median_UL_to_nonUL <- UL2_t$UL_Median/nUL2_t$nonUL_Median

UL_study2_cc$FoldChange_Median_UL_to_nonUL <- FoldChange_Median_UL_to_nonUL

UL2_cnv <- UL_study2_cc %>%
  group_by(SourceSeq) %>%
  mutate(Sequence_CNVs = n()) %>%
  select(ID,SourceSeq,Sequence_CNVs, CHR,FoldChange_Median_UL_to_nonUL, everything()) %>%
  ungroup() %>%
  unique()

The above has all unique CNVs of the sequences, as should be expected since there was no error making the row names the ID field.

write.csv(UL2_cnv, 'UL_study2_needsGeneData.csv', row.names=FALSE)

Now lets get to the first study of UL gene expression data that we can use to add to our network analysis of genes related to diseases. These two tables are the UL1 and nonUL1 tables.

nonUL1b <- nonUL1 %>% select(Symbol,SEQUENCE,Chromosome,Cytoband,GSM2496194:GSM2496222)
UL1b <- UL1 %>% select(Symbol,SEQUENCE,Chromosome,Cytoband,GSM2496185:GSM2496220)
colnames(nonUL1b)[5:22] <- paste('nonUL',colnames(nonUL1b)[5:22],sep='_')
colnames(UL1b)[5:24] <- paste('UL',colnames(UL1b)[5:24],sep='_')

Add the Median fields to each table.

UL1c <- UL1b[,5:24]
nUL1c <- nonUL1b[,5:22]

UL1c$UL1_Median <- apply(UL1c,1,median)
nUL1c$nonUL1_Median <- apply(nUL1c,1,median)

FC <- UL1c$UL1_Median/nUL1c$nonUL1_Median

mUL1 <- UL1b[,1:4]
UL_study1 <- cbind(mUL1,UL1c,nUL1c)

UL_study1$FoldChange_Median_UL1_to_nonUL1 <- FC

UL_study1b <- UL_study1 %>% select(Symbol,SEQUENCE,Chromosome,Cytoband,
                                   FoldChange_Median_UL1_to_nonUL1,
                                   UL1_Median,nonUL1_Median,
                                   everything())

There are some missing values in the chromosome and cytoband fields, but we don’t need to omit them yet.

colSums(is.na(UL_study1b))

The above shows that there aren’t any missing values in the samples or the sequence fields, but there are some in the gene symbol, chromosome, and cytoband fields.

Lets get the CNVs of the genes now, but remove the NAs from the Symbol field first.

UL_study1b2 <- UL_study1b[complete.cases(UL_study1b$Symbol),]
UL_study1_cnv <- UL_study1b2 %>%
  group_by(Symbol) %>%
  mutate(Symbol_CNVs = n()) %>%
  select(Symbol,Symbol_CNVs,everything()) %>%
  ungroup() %>%
  unique()

Lets look at what we have for our first few fields and the values.

head(UL_study1_cnv[,1:6])

This is going to grab our target genes for those genes that have a CNV gene count greater than the top 90th percentile and also is either of the bottom 5th percentile or top 95th percentile for fold change median values of UL to nonUL gene expression values.

UL_study1_thresholds <- subset(UL_study1_cnv, 
                          UL_study1_cnv$Symbol_CNVs > quantile(UL_study1_cnv$Symbol_CNVs,.9) &
                          (UL_study1_cnv$FoldChange_Median_UL1_to_nonUL1 < 
                            quantile(UL_study1_cnv$FoldChange_Median_UL1_to_nonUL1,.05) |
                          UL_study1_cnv$FoldChange_Median_UL1_to_nonUL1 > 
                            quantile(UL_study1_cnv$FoldChange_Median_UL1_to_nonUL1,.95))
                        )

Lets now order this table by CNV and Fold Change values.

UL_study1_ordered <- UL_study1_thresholds[with(UL_study1_thresholds,
                        order(Symbol_CNVs,FoldChange_Median_UL1_to_nonUL1,
                              decreasing=TRUE)),]

Lets take a look at the top few rows and first few columns of this table of threshold value target genes for UL.

head(UL_study1_ordered[,1:7])

The table is ordered by CNVs first then by fold change values within the top CNVs. Lets write this to csv and use it later for our visNetwork. So far, it looks like either the Chromosome or the cytoband could be used for location of the nodes. But we could also group the genes by link to gene disease networks, gene networks, or anything else we get ideas about later.

write.csv(UL_study1_ordered,'UL_study1_thresholds.csv',row.names=FALSE)

UL microarray study Uterine leiomyomas Disease data table (microarray data). These two are from the same study with microarray expressions.

UL1 <- read.csv('UL_GSE68295_GPL6480_table.csv', sep=',',header=TRUE,
                  na.strings=c('',' ','NA'))
nonUL1 <- read.csv('nonUL_GSE68295_GPL6480_table.csv', sep=',',header=TRUE,
                  na.strings=c('',' ','NA'))

colnames(UL1)
colnames(nonUL1)

Lets keep only the columns interested in. Those are

nonUL1b <- nonUL1 %>% select(GENE_SYMBOL,SEQUENCE,CYTOBAND,GSM1667144:GSM1667146)
UL1b <- UL1 %>% select(GENE_SYMBOL,SEQUENCE,CYTOBAND,GSM1667147:GSM1667149)


colnames(nonUL1b)[4:6] <- paste('nonUL',colnames(nonUL1b)[4:6],sep='_')
colnames(UL1b)[4:6] <- paste('UL',colnames(UL1b)[4:6],sep='_')

Add the Median fields to each table.

UL1c <- UL1b[,4:6]
nUL1c <- nonUL1b[,4:6]

UL1c$UL1_MedianArray <- apply(UL1c,1,median)
nUL1c$nonUL1_MedianArray <- apply(nUL1c,1,median)

FC <- UL1c$UL1_MedianArray/nUL1c$nonUL1_MedianArray

mUL1 <- UL1b[,1:3]
UL_study1 <- cbind(mUL1,UL1c,nUL1c)

UL_study1$FoldChange_MedianArray_UL1_to_nonUL1 <- FC

UL_study1b <- UL_study1 %>% select(GENE_SYMBOL,SEQUENCE,CYTOBAND,
                                   FoldChange_MedianArray_UL1_to_nonUL1,
                                   UL1_MedianArray,nonUL1_MedianArray,
                                   everything())

There are some missing values in the GENE_SYMBOL, SEQUENCE, and CYTOBAND fields, but we don’t need to omit them yet.

colSums(is.na(UL_study1b))

The above shows that there aren’t any missing values in the samples or the sequence fields, but there are some in the gene symbol, chromosome, and cytoband fields.

Lets get the CNVs of the genes now, but remove the NAs from the Symbol field first.

UL_study1b2 <- UL_study1b[complete.cases(UL_study1b$GENE_SYMBOL),]

UL_study1_cnv <- UL_study1b2 %>%
  group_by(GENE_SYMBOL) %>%
  mutate(GENE_SYMBOL_CNVs = n()) %>%
  select(GENE_SYMBOL,GENE_SYMBOL_CNVs,everything()) %>%
  ungroup() %>%
  unique()

Lets look at what we have for our first few fields and the values.

head(UL_study1_cnv[,1:6])

UL_study1_thresholds <- subset(UL_study1_cnv, 
                          UL_study1_cnv$GENE_SYMBOL_CNVs > quantile(UL_study1_cnv$GENE_SYMBOL_CNVs,.9) &
                          (UL_study1_cnv$FoldChange_MedianArray_UL1_to_nonUL1 < 
                            quantile(UL_study1_cnv$FoldChange_MedianArray_UL1_to_nonUL1,.05) |
                          UL_study1_cnv$FoldChange_MedianArray_UL1_to_nonUL1 > 
                            quantile(UL_study1_cnv$FoldChange_MedianArray_UL1_to_nonUL1,.95))
                        )

Lets now order this table by CNV and Fold Change values.

UL_study1_ordered <- UL_study1_thresholds[with(UL_study1_thresholds,
                        order(GENE_SYMBOL_CNVs,FoldChange_MedianArray_UL1_to_nonUL1,
                              decreasing=TRUE)),]

Lets take a look at the top few rows and first few columns of this table of threshold value target genes for UL.

head(UL_study1_ordered[,1:7])

write.csv(UL_study1_ordered,'UL_microarray_thresholds.csv',row.names=FALSE)

Gastric Stomach Cancer data table.

GSC <- read.delim('GSE64916_series_matrix.txt', sep='\t', comment.char='!',
                 header=TRUE, na.strings=c('',' ', 'NA'))
head(GSC)

These are samples of gastric stomach cancer gene expressions in located peripheral blood. The samples GSM1583284, GSM1583285, GSM1583286, and GSM1583287 are GSC, and the control of this study is from the blood of a person without gastric stomach cancer in GSM1583288.

Lets label these columns to be their sample types.

colnames(GSC) <- c("ID","GSC1","GSC2","GSC3","GSC4","noGSC_ctrl")
head(GSC)

GSCp <- read.delim('GPL13497-9755.txt', sep='\t',header=TRUE, comment.char='#',
                  na.strings=c('',' ','NA'))
head(GSCp)

Lets keep the ID, GENE_SYMBOL, CYTOBAND, and SEQUENCE columns of our platform meta data.

GSC_P <- GSCp %>% select(ID, GENE_SYMBOL, CYTOBAND, SEQUENCE)
head(GSC_P)

Now lets combine the two data tables of sample and meta information.

GSC2 <- merge(GSC_P, GSC, by.x='ID',by.y='ID')
head(GSC2)

We will now remove the NAs from the Gene Symbol field.

GSC3 <- GSC2[complete.cases(GSC2$GENE_SYMBOL),]
head(GSC3)

Lets group by gene and get the count of each gene and attach it to the sequence information.

GSC4 <- GSC3 %>%
  group_by(GENE_SYMBOL) %>%
  mutate(GENE_CNVs = n()) %>%
  select(GENE_SYMBOL, GENE_CNVs, everything()) %>%
  ungroup() %>%
  unique()

colnames(GSC4)

Now, we will add the fold change of median values of the GSC samples to the control sample of a healthy person.

GSC4$GSC_Median <- apply(GSC4[6:9],1,median)
GSC4$FoldChangeMedian_GSC_to_healthyCtrl <- GSC4$GSC_Median/GSC4$noGSC_ctrl

GSC5 <- GSC4 %>% select(GENE_SYMBOL, GENE_CNVs,-ID,CYTOBAND,SEQUENCE,
                        FoldChangeMedian_GSC_to_healthyCtrl,GSC_Median, GSC1:noGSC_ctrl)
colnames(GSC5)

Lets get the subset of GSC that has the genes with a high number of CNVs and a high Fold change by median values.

GSC_genes <- subset(GSC5, GSC5$GENE_CNVs > median(GSC5$GENE_CNVs) &
                (GSC5$FoldChangeMedian_GSC_to_healthyCtrl < 
                   quantile(GSC5$FoldChangeMedian_GSC_to_healthyCtrl, 0.05) |
                 GSC5$FoldChangeMedian_GSC_to_healthyCtrl > 
                   quantile(GSC5$FoldChangeMedian_GSC_to_healthyCtrl, 0.95)
                )
             )

length(unique(GSC_genes$GENE_SYMBOL))

That is too many genes to make the targeted gastric stomach cancer genes, so we will raise our CNV threshold to the top 99th percentile.

GSC_genes <- subset(GSC5, GSC5$GENE_CNVs > quantile(GSC5$GENE_CNVs,0.99) &
                (GSC5$FoldChangeMedian_GSC_to_healthyCtrl < 
                   quantile(GSC5$FoldChangeMedian_GSC_to_healthyCtrl, 0.05) |
                 GSC5$FoldChangeMedian_GSC_to_healthyCtrl > 
                   quantile(GSC5$FoldChangeMedian_GSC_to_healthyCtrl, 0.95)
                )
             )

length(unique(GSC_genes$GENE_SYMBOL))

Lets order this table of targeted genes by CNV count then by decreasing value of fold change from GSC to healthy as a ratio.

GSC6 <- GSC_genes[with(GSC_genes, order(GENE_CNVs,FoldChangeMedian_GSC_to_healthyCtrl,
                              decreasing=TRUE)),]
head(GSC6)

Lets write this file out to csv and work on the other disease data sets separate from this one.

write.csv(GSC6, 'GastricStomachCancer_Thresholds.csv', row.names=FALSE)

Colon cancer data table.

CC <- read.delim('GSE135749_series_matrix.txt', sep='\t', comment.char='!',
                 header=TRUE, na.strings=c('',' ', 'NA'))
head(CC)

CCp <- read.delim('GPL10558-50081.txt', sep='\t',header=TRUE, comment.char='#',
                  na.strings=c('',' ','NA'))
head(CCp)

colnames(CC)

colnames(CCp)

Keep only the SYMBOL, ID, Cytoband, and SEQUENCE columns of the CC platform to combine with the CC samples by ID_REF.

CC_p <- CCp %>% select(ID, Symbol,SEQUENCE,Cytoband)

The CC data set will be produced by this next command.

CC_1 <- merge(CC_p,CC, by.x='ID', by.y='ID_REF')
head(CC_1)

Note: Since these are quantile normalized, and there are various ways and reasons to do this, but the purpose being primarily to eliminate noise, we will keep the values, for our purposes, we are still getting those threshold genes that are in the bottom 5th and top 95th percentile while also having more CNVs than the median CNVs of all genes in this set.

We should separate the columns that are samples of colon cancer tumors without treatment of the short hairpin RNAs (shRNA) targeting two different RNAs of LGR5. The first sample columns are the control colon cancer samples, the next two are the first type of LGR5 being knocked down with a shRNA, and the last two samples as columns are the 2nd type of LGR5 being knocked down with the 2nd type of shRNA.

These are colon cancer tumor cells. with two replicates of each sample type or group. We will prepend cc0, cc1,cc2 to these 6 samples for if no shRNA knockdown (cc0), the 1st type shRNA (cc1), or the 2nd type shRNA (cc2) is what the sample is a product of procedure type.

colnames(CC_1)[5:6] <- paste('cc0', colnames(CC_1)[5:6],sep='_')
colnames(CC_1)[7:8] <- paste('cc1', colnames(CC_1)[7:8],sep='_')
colnames(CC_1)[9:10] <- paste('cc2', colnames(CC_1)[9:10],sep='_')

colnames(CC_1)

Now, lets add in the median values of the samples of these three groups of control, knockdown 1, and knockdown 2. The median values will be knockdown 1 compared to control, and knockdown 2 compared to control.We want to be consistent with the methods of our other disease type sample extraction of target genes, so even though there are only two samples of each group, the median is used. Notice the median seems to be programmed to be the mean when there are only two samples.

cc0 <- apply(CC_1[5:6],1,median)
cc1 <- apply(CC_1[7:8],1,median)
cc2 <- apply(CC_1[9:10],1,median)

FoldChange_CC1_to_CC0 <- cc1/cc0
FoldChange_CC2_to_CC0 <- cc2/cc0

Add these median and fold change values to our colon cancer table, CC_1.

CC_1$ctrl_CC_median <- cc0
CC_1$group1_CC_median <- cc1
CC_1$group2_CC_median <- cc2
CC_1$FoldChange_CC1_to_CC0 <- FoldChange_CC1_to_CC0
CC_1$FoldChange_CC2_to_CC0 <- FoldChange_CC2_to_CC0

colnames(CC_1)

Rearrange the columns to see the stats and meta information before the sample information.

CC2 <- CC_1 %>% select(ID:Cytoband, FoldChange_CC1_to_CC0:FoldChange_CC2_to_CC0,
                       ctrl_CC_median:group2_CC_median,everything())
colnames(CC2)

Lets group by gene and get the count of each gene and attach it to the sequence information.

CC3 <- CC2[complete.cases(CC2$Symbol),]

CC4 <- CC3 %>%
  group_by(Symbol) %>%
  mutate(Symbol_CC_CNVs = n()) %>%
  select(Symbol, Symbol_CC_CNVs, everything()) %>%
  ungroup() %>%
  unique()

Now, lets get our threshold values so that the genes with CNVs greater than the median value of CNVs of genes in our data set are selected and also those genes in that group but also having a threshold of being in the bottom 5th and top 95th percentile for fold change values of group1/control and group2/control are selected as target genes.

CC5 <- subset(CC4, 
              (CC4$Symbol_CC_CNVs > median(CC4$Symbol_CC_CNVs)) &
               (CC4$FoldChange_CC1_to_CC0 < quantile(CC4$FoldChange_CC1_to_CC0,.05, na.rm=TRUE) 
               | 
                CC4$FoldChange_CC1_to_CC0 > quantile(CC4$FoldChange_CC1_to_CC0,.95, na.rm=TRUE)
               )
              |  
              (CC4$Symbol_CC_CNVs > median(CC4$Symbol_CC_CNVs))  
               &            
              (CC4$FoldChange_CC2_to_CC0 < quantile(CC4$FoldChange_CC2_to_CC0,.05, na.rm=TRUE) 
               |               
               CC4$FoldChange_CC2_to_CC0 > quantile(CC4$FoldChange_CC2_to_CC0,.95, na.rm=TRUE)) 
              )

There are a lot of observations for this threshold dataset of targeted colon cancer genes. Lets see how many genes are unique.

length(unique(CC5$Symbol))

That is a lot of genes as gene targets, so lets set a higher threshold for the genes with CNVs higher than the 97th percentile.

CC6 <- subset(CC4, 
              (CC4$Symbol_CC_CNVs > quantile(CC4$Symbol_CC_CNVs,.97)) &
               (CC4$FoldChange_CC1_to_CC0 < quantile(CC4$FoldChange_CC1_to_CC0,.05, na.rm=TRUE) 
               | 
                CC4$FoldChange_CC1_to_CC0 > quantile(CC4$FoldChange_CC1_to_CC0,.95, na.rm=TRUE)
               )
              |  
              (CC4$Symbol_CC_CNVs > quantile(CC4$Symbol_CC_CNVs,.97))  
               &            
              (CC4$FoldChange_CC2_to_CC0 < quantile(CC4$FoldChange_CC2_to_CC0,.05, na.rm=TRUE) 
               |               
               CC4$FoldChange_CC2_to_CC0 > quantile(CC4$FoldChange_CC2_to_CC0,.95, na.rm=TRUE)) 
              )

This dataset is much better to work with for targeted genes based on our set threshold limits as discriminates. Lets see how many are unique.

length(CC6$Symbol)

Lets order this data table by CNVs then by fold change decreasing by group1 then by group2.

CC7 <- CC6[with(CC6, order(Symbol_CC_CNVs,FoldChange_CC1_to_CC0,
                           FoldChange_CC2_to_CC0, decreasing=TRUE)),]

If we can’t get the CYTOBAND for the other diseases then we will leave it out or add it in later if there are few to look up and add manually. Lets write this file out to csv and work on the other disease data sets separate from this one.

write.csv(CC7, 'ColonCancerThresholds.csv', row.names=FALSE)

Pancreatic cancer data table.

PC <- read.delim('GSE131859_non-normalized.txt', sep='\t', comment.char='!',
                 header=TRUE, na.strings=c('',' ', 'NA'))
head(PC)

PCp <- read.delim('GPL14951-11332.txt', sep='\t',header=TRUE, comment.char='#',
                  na.strings=c('',' ','NA'))
head(PCp)

colnames(PC)

These samples have the pvalue scores attached to the sample microarray readings. We aren’t interested in these pval fields, so we will remove them and the added X field when loading of the row numbers when saved to csv from other file.

PC1 <- PC %>% select(-MiaPACA2_EV.Detection_Pval,-MiaPACA2_ARHGEF10.Detection_Pval,
                     -Hs766T_shGFP.Detection_Pval,-Hs766T_shARHGEF10.Detection_Pval,-X)
head(PC1)

We have four samples of pancreatic cancer. MiaPACA2_EV is the control of pancreatic cancer epithelial cells. The idea is to show that ARHGEF10 is a tumor suppressor by examining two cell lines of available commercial pancreatic cancer cells where MiaPACA2 is directly from the epithelial pancreatic cells and Hs766T are from the lymph nodes of someone with pancreatic cancer. MiaPACA2_ARHGEF10 sample is when ARHGEF10 is overexpressed to demonstrate suppression of pancreatic cancer cells not mutating. The Hs766T_shGFP is the sample where the short hair pin RNA additive to eliminate the tumor suppressor ARHGEF10 wasn’t added and is the lymph node pancreatic cancer control. The Hs766T_shARHGEF10 is when the Hs766T cell line of pancreatic cancer cells had the ARHGEF10 gene knocked down or suppressed. Lets rename these column samples by what they are,

colnames(PC1) <- c('ID_REF','PC_epithelial_ctrl','PC_epithelial_SuppressorAdded',
                   'PC_lymph_ctrl','PC_lymph_SuppressorKnockedOut')

colnames(PCp)

Keep only the Symbol, Cytoband, ID, and SEQUENCE columns of the PC platform to combine with the PC samples by ID_REF and GENE_SYMBOL.

PC_p <- PCp %>% select(ID, Symbol, SEQUENCE, Cytoband)
head(PC_p)

The PC data set will be produced by this next chunk.

PC2 <- merge(PC_p,PC1, by.x='ID', by.y='ID_REF')
head(PC2)

Since there is only one sample of each, it is redundant to get the median of each type, so we will get the fold change values of the suppresor added epithelial PC cells to the epithelial PC cells control, and the fold change values of the lymph node PC cells with the suppressor knocked out or suppressed to the lymph node PC cells control.

FC_epithelial_PC <- PC2$PC_epithelial_SuppressorAdded/PC2$PC_epithelial_ctrl
FC_lymphNode_PC <- PC2$PC_lymph_SuppressorKnockedOut/PC2$PC_lymph_ctrl

PC2$FC_epithelial_PC <- FC_epithelial_PC
PC2$FC_lymphNode_PC <- FC_lymphNode_PC

PC3 <- PC2 %>% select(ID:Cytoband, FC_epithelial_PC,FC_lymphNode_PC, everything())

Lets group by gene and get the count of each gene and attach it to the sequence information.

PC4 <- PC3 %>%
  group_by(Symbol) %>%
  mutate(GENE_CNVs = n()) %>%
  select(Symbol, GENE_CNVs, everything()) %>%
  ungroup() %>%
  unique()

colnames(PC4) <- gsub('FC_','FoldChange_', colnames(PC4))
colnames(PC4)

PC5 <- PC4[with(PC4, order(GENE_CNVs, FoldChange_epithelial_PC, 
                                        FoldChange_lymphNode_PC, decreasing=TRUE)),]

Get the subset of AD that has the genes with a high number of CNVs and a high Fold change by median values.

PC_genes <- subset(PC5, PC5$GENE_CNVs > median(PC5$GENE_CNVs, na.rm=TRUE) &
                    (PC5$FoldChange_epithelial_PC < 
                       quantile(PC5$FoldChange_epithelial_PC,0.05, na.rm=TRUE) |
                    PC5$FoldChange_lymphNode_PC > 
                      quantile(PC5$FoldChange_lymphNode_PC, .95, na.rm=TRUE))
                  )

length(unique(PC_genes))

This is a lot of unique genes, so lets filter more by making the CNVs to be in the top 95th percentile of CNV counts instead of greater than the median of gene CNVs.

PC_genes2 <- subset(PC5, PC5$GENE_CNVs > quantile(PC5$GENE_CNVs, 0.95,na.rm=TRUE) &
                    (PC5$FoldChange_epithelial_PC < 
                       quantile(PC5$FoldChange_epithelial_PC,0.05, na.rm=TRUE) |
                    PC5$FoldChange_lymphNode_PC > 
                      quantile(PC5$FoldChange_lymphNode_PC, .95, na.rm=TRUE))
                  )
length(unique(PC_genes2))

That is a much better quantity to work with for targeted Pancreatic Cancer genes. Lets order this by CNVs then by fold change epithelial then fold change lymph.

PC6 <- PC_genes2[with(PC_genes2, order(GENE_CNVs,FoldChange_epithelial_PC,
                                       FoldChange_lymphNode_PC, decreasing=TRUE)),]
head(PC6)

Lets drop the ID field.

PC7 <- PC6 %>% select(-ID)
head(PC7)

Lets write this file out to csv and work on the other disease data sets separate from this one.

write.csv(PC7, 'Pancreatic_Cancer_Thresholds.csv', row.names=FALSE)

Alzheimer’s Disease data table.

AD <- read.delim('GSE109887_series_matrix.txt', sep='\t', comment.char='!',
                 header=TRUE, na.strings=c('',' ', 'NA'))
head(AD)

ADp <- read.delim('GPL4133-12599.txt', sep='\t',header=TRUE, comment.char='#',
                  na.strings=c('',' ','NA'))
head(ADp)

colnames(AD)

colnames(ADp)

Keep only the GENE_SYMBOL and SEQUENCE columns of the AD platform to combine with the AD samples by ID_REF and GENE_SYMBOL.

AD_p <- ADp[,c(10,20)]

The AD data set will be produced by this next chunk.

AD_1 <- merge(AD_p,AD, by.x='GENE_SYMBOL', by.y='ID_REF')
head(AD_1)

We should separate the columns that are samples of AD patients to those that are the control samples of AD patients.

metaAD <- read.csv('AlzheimerAgeGenderTissueSamplesMeta.csv',sep=',', header=TRUE,
                   na.strings=c('',' ','NA'))
head(metaAD)

row.names(metaAD) <- metaAD$sampleID

meta_AD <- as.data.frame(t(metaAD))
head(meta_AD)

Get the indices of the AD and control patients

ad <- grep('AD', meta_AD$disease)
#add 1 because of the additional sequence field in AD_1 not in metaAD
ad <- ad+1

Paste the type of sample to the sample ID fields.

colnames(AD_1)[ad] <- paste('AD',colnames(AD_1)[ad],sep='_')
colnames(AD_1)[-c(1,2,ad)] <- paste('ctrl_AD',colnames(AD_1)[-c(1,2,ad)], sep='_')

Create separate tables for the AD and the control patients.

ad1 <- AD_1[,ad]
ctrl_ad1 <- AD_1[,-c(1,2,ad)]

Row Means of each subset.

ad1$Mean_AD <- rowMeans(ad1)
ctrl_ad1$Mean_ctrl_AD <- rowMeans(ctrl_ad1)

ad1$Median_AD <- apply(ad1,1,median)
ctrl_ad1$Median_ctrl_AD <- apply(ctrl_ad1,1,median)

Now combine these new value fields to the original table for AD and control mean and median.

AD_1$Mean_AD <- ad1$Mean_AD
AD_1$Mean_ctrl_AD <- ctrl_ad1$Mean_ctrl_AD
AD_1$Median_AD <- ad1$Median_AD
AD_1$Median_ctrl_AD <- ctrl_ad1$Median_ctrl_AD

Remove duplicate entries, the genes even thought they have different sequence values, are showing duplicate or identical gene expression values by gene.

AD2 <- AD_1[!duplicated(AD_1),]

AD2$FoldChange_Mean_AD_2_ctrl <- AD2$Mean_AD/AD2$Mean_ctrl_AD
AD2$FoldChange_Median_AD_2_ctrl <- AD2$Median_AD/AD2$Median_ctrl_AD

head(AD2)[,c(1,2,83:86)]

You can see from the above table that even though the sequence is different in these first displayed genes that the gene the sequences belong to have the same gene expression values.

Lets group by gene and get the count of each gene and attach it to the sequence information.

AD3 <- AD2 %>%
  group_by(GENE_SYMBOL) %>%
  mutate(GENE_CNVs = n()) %>%
  select(GENE_SYMBOL, GENE_CNVs, SEQUENCE, Median_AD, Median_ctrl_AD, 
         FoldChange_Mean_AD_2_ctrl, FoldChange_Median_AD_2_ctrl, everything()) %>%
  ungroup() %>%
  unique()

Alzheimer <- AD3[with(AD3, order(GENE_CNVs, GENE_SYMBOL,  decreasing=TRUE)),]

Get the subset of AD that has the genes with a high number of CNVs and a high Fold change by median values.

AD_genes <- subset(Alzheimer, Alzheimer$GENE_CNVs > median(Alzheimer$GENE_CNVs) &
                              (Alzheimer$FoldChange_Median_AD_2_ctrl < 0.90 |
                              Alzheimer$FoldChange_Median_AD_2_ctrl > 1.10)
                  )
alz_genes <- unique(AD_genes$GENE_SYMBOL)
alz_genes

AD4 <- AD_genes[,c(1:3,7)]
AD4

AD_genes1 <- datatable(data=AD4,  rownames=FALSE,
                      extensions=c('Buttons','Responsive','FixedColumns'),
                      filter=list(position='top'),
                      options=list(pageLength=10,
                        dom='Bfrtip',scrollX = TRUE, fixedColumns = TRUE,
                        buttons=c('colvis','csv'),
                        language=list(sSearch='Filter:')
                        )
                      )

AD_genes1

Lets also add in the cytoband location.

cyto <- ADp %>% select(GENE_SYMBOL, CYTOBAND) 
AD5 <- merge(cyto, AD4, by.x='GENE_SYMBOL', by.y='GENE_SYMBOL')
head(AD5)

So, now we have our table of Alzheimer genes that have high CNVs and Fold Change Median values. Lets start getting this information for the other diseases to word towards building our visNetwork of genes related to other genes if any are.

We want to get these same values exactly as we have for the Alzheimer’s Disease genes.

colnames(AD5)

If we can’t get the CYTOBAND for the other diseases then we will leave it out. Lets write this file out to csv and work on the other disease data sets separate from this one.

write.csv(AD5, 'useableAlzheimerStats.csv', row.names=FALSE)

%^& !!!THE BEGINNING of our VISNETWORK!!!

Lets start by reading in those data tables we will be working with that derived the genes in the highes percentile of Copy Number Variants of genes as well as being in the bottom 5th and top 90th-99th percentiles of fold change values. The library visNetwork requires that there be an edges and a nodes data file to read from and that there be certain specific columns within each table, such as ID, to, from, label, and so on. How we will link the genes is by their ID and to see if that ID links to any other gene in our tables of disease genes. We will first read in our tables, then make sure the column names describe what disease the fold change values and CNVs are for, as well as add a field to each of these targeted genes tables that identifies the gene as being a top gene related to the disease table. Such as, pancreatic cancer, beadchip benign uterine leiomyoma, microarray benign uterine leiomyoma, gastric stomach cancer, colon cancer, or Alzheimer’s Disease.

pancreatic <- read.csv('Pancreatic_Cancer_Thresholds.csv',sep=',', header=TRUE,
                       na.strings=c('',' ','NA'))
colon <- read.csv('ColonCancerThresholds.csv',sep=',', header=TRUE,
                  na.strings=c('',' ','NA')) 
stomach <-  read.csv('GastricStomachCancer_Thresholds.csv',sep=',', header=TRUE, 
                     na.strings=c('',' ','NA'))
alzheimer <- read.csv('useableAlzheimerStats.csv',sep=',', header=TRUE,
                      na.strings=c('',' ','NA'))
ul_beadchip <- read.csv('UL_study1_thresholds.csv',sep=',', header=TRUE, 
                        na.strings=c('',' ','NA'))
ul_microarray <- read.csv('UL_microarray_thresholds.csv',sep=',', header=TRUE, 
                          na.strings=c('',' ','NA'))

Lets now look at each of these six diseases and add in what type of disease it is to the fields in each table that don’t give that information.

#colnames(alzheimer)[1:4] <- paste('AD',colnames(alzheimer)[1:4],sep='_')

#colnames(colon)[c(1,3,4,5)] <- paste('CC',colnames(colon)[c(1,3,4,5)],sep='_')

#colnames(stomach)[1:4] <- paste('GSC', colnames(stomach)[1:4],sep='_')

#colnames(pancreatic)[1:4] <- paste('PC', colnames(pancreatic)[1:4], sep='_')

#colnames(ul_beadchip)[1:5] <- paste('UL_beadchip', colnames(ul_beadchip)[1:5], sep='_')
#colnames(ul_beadchip)[9:46] <- gsub('UL','UL_beadchip', colnames(ul_beadchip)[9:46])

#colnames(ul_microarray)[c(1:4)] <- paste('UL_array', colnames(ul_microarray)[c(1:4)],
#                                              sep='_')
#colnames(ul_microarray)[8:13] <- gsub('UL_','UL_microarray_', colnames(ul_microarray)[8:13])

Lets add in the disease type field of factor type to each table.

pancreatic$pancreatic_cancer <- as.factor('pancreatic cancer')
stomach$stomach_cancer <- as.factor('stomach cancer')
colon$colon_cancer <- as.factor('colon cancer')
alzheimer$alzheimerDisease <- as.factor('Alzheimer\'s Disease')
ul_beadchip$uterine_leiomyoma_BC <- as.factor('benign UL beadchip')
ul_microarray$uterine_leiomyoma_MA <- as.factor('benign UL microarray')

Pancreatic cancer data table:

head(pancreatic)

##   Symbol GENE_CNVs                                           SEQUENCE
## 1    DMD         8 GGCAAGTTCAACAGGATTTTGTGACCAGCGCAGGCTGGGCCTCCTTCTGC
## 2  HYDIN         8 GCAATTAAGTATGTGGAACACCCTCAGATAGACAGCCTGGACCTGCGCGG
## 3    DMD         8 CCAAAAGGTATTTTGCGAAGCATCCCCGAATGGGCTACCTGCCAGTGCAG
## 4  ABCG1         6 CAGCACTTGTGAAGGATTGAATGCAGGTTCCAGGTGGAGGGAAGACGTGG
## 5 KCNIP4         6 CTTGGAAGGGCTTGAAATGATAGCAGTTCTGATCGTCATTGTGCTTTTTG
## 6 GPR177         5 TGAGATCTACAAGTTGACCCGCAAGGAGGCCCAGGAGTAGGAGGCTGCAG
##           Cytoband FoldChange_epithelial_PC FoldChange_lymphNode_PC
## 1   Xp21.2a-p21.1d                 2.309727               0.4945016
## 2  16q22.1f-q22.3a                 1.967041               0.4460581
## 3   Xp21.2a-p21.1d                 1.414647               0.4765914
## 4         21q22.3b                 3.787784               0.6271994
## 5 4p15.31d-p15.31b                 2.674067               0.5404346
## 6          1p31.3a                 8.289087               0.4977605
##   PC_epithelial_ctrl PC_epithelial_SuppressorAdded PC_lymph_ctrl
## 1              117.2                         270.7         281.9
## 2              133.5                         262.6         289.2
## 3              185.7                         262.7         296.9
## 4             1408.0                        5333.2        8854.6
## 5              112.6                         301.1         280.7
## 6              550.7                        4564.8       15584.0
##   PC_lymph_SuppressorKnockedOut pancreatic_cancer
## 1                         139.4 pancreatic cancer
## 2                         129.0 pancreatic cancer
## 3                         141.5 pancreatic cancer
## 4                        5553.6 pancreatic cancer
## 5                         151.7 pancreatic cancer
## 6                        7757.1 pancreatic cancer

Stomach cancer data table:

head(stomach)

##    GENE_SYMBOL GENE_CNVs    CYTOBAND
## 1      PDE4DIP        15   hs|1q21.1
## 2      PDE4DIP        15   hs|1q21.1
## 3 LOC100128869         8   hs|4q28.3
## 4    LOC729444         7 hs|22q11.21
## 5 LOC100131581         7   hs|8p23.1
## 6    LOC389834         7  hs|21p11.2
##                                                       SEQUENCE
## 1 CAAAAAATGGACTAGAAGAGAAGCTGGCTGAGGAGCTGAGATCAGCCTCGTGGCCTGGGT
## 2 CTACCAAAGCTGTGTATTTTTCATTTCTTCATGGCACTGTGCTGTTAATTTCTGTTAACA
## 3 TCAAAGACACACGCTCACTGCATGTGCTCTTGGGGGACGTCAGTGCCACGTTTGGTCACA
## 4 TTCCATCATACAGTCTTCAGCAACCTTGAAAGATTGGACAAGCTTCAGCCCACTCTTGAA
## 5 AGATAACTTCAGACTCAGCCTAGAAGTAAAGATCCTCCAGATATGACCTCAACTACCCTC
## 6 GGAACACTCTCTTAAAGTCTACTTATCCTATTCCTTCAAATTAAGACCAAAGTCCAATGT
##   FoldChangeMedian_GSC_to_healthyCtrl GSC_Median     GSC1     GSC2     GSC3
## 1                           1.3059347   3.035789 2.332352 3.456686 2.614892
## 2                           0.7574300   6.580579 6.966718 6.194441 8.041944
## 3                           0.4429174   3.180155 2.332352 4.027958 2.326500
## 4                           1.3486174   3.714299 3.762692 3.665906 3.405338
## 5                           0.7360758   2.943508 3.092651 2.794365 4.078270
## 6                           0.7229002   3.439584 2.791133 3.231682 5.058745
##       GSC4 noGSC_ctrl stomach_cancer
## 1 4.208106   2.324610 stomach cancer
## 2 5.767812   8.688036 stomach cancer
## 3 4.576213   7.180018 stomach cancer
## 4 4.299684   2.754154 stomach cancer
## 5 2.402923   3.998920 stomach cancer
## 6 3.647486   4.758035 stomach cancer

Colon cancer data table:

head(colon)

##    Symbol Symbol_CC_CNVs           ID
## 1   HYDIN             17 ILMN_1697764
## 2   HYDIN             17 ILMN_1743138
## 3   HYDIN             17 ILMN_1658413
## 4 SDHALP1             13 ILMN_1734640
## 5 SDHALP1             13 ILMN_1757306
## 6 SDHALP1             13 ILMN_1718539
##                                             SEQUENCE        Cytoband
## 1 GCAATTAAGTATGTGGAACACCCTCAGATAGACAGCCTGGACCTGCGCGG 16q22.1f-q22.3a
## 2 TGTGCATTGCCAGTCATTCCCATGCCTTTGCCACGGTGTCCTTCACCCCG 16q22.1f-q22.3a
## 3 CTTTGGGAGCCAGGACCCCCCTTTGGTATGTCACTTAAAGAGCGCTGGAG 16q22.1f-q22.3a
## 4 GTGATGACAGAATCAGCTTTTGTAATTATGTATAATAGCTCATGCATGTG           3q29f
## 5 AGCCCTCCTGGACGATTGGAACTGTAATGTGGAAAGGGCTCTGATGGAGC           3q29f
## 6 GTTCTTCAGATGAACTGATTTTTGTGCAGAGCACACGTGTTGGATTCTGC           3q29f
##   FoldChange_CC1_to_CC0 FoldChange_CC2_to_CC0 ctrl_CC_median group1_CC_median
## 1              7.153846             0.4615385          -0.65            -4.65
## 2              5.355556            -0.4666667           2.25            12.05
## 3            -11.375000            10.4583333          -1.20            13.65
## 4              8.621622            -1.6756757          -1.85           -15.95
## 5              1.976190            -8.7857143          -4.20            -8.30
## 6             -6.272727           -15.3636364           0.55            -3.45
##   group2_CC_median cc0_GSM4027437 cc0_GSM4027439 cc1_GSM4027441 cc1_GSM4027442
## 1            -0.30           -1.3            0.0           -9.8            0.5
## 2            -1.05           13.3           -8.8           14.7            9.4
## 3           -12.55           -7.9            5.5            8.6           18.7
## 4             3.10           10.6          -14.3          -17.2          -14.7
## 5            36.90            5.4          -13.8          -10.5           -6.1
## 6            -8.45            1.6           -0.5           -3.1           -3.8
##   cc2_GSM4027444 cc2_GSM4027446 colon_cancer
## 1            8.5           -9.1 colon cancer
## 2           -7.0            4.9 colon cancer
## 3          -18.4           -6.7 colon cancer
## 4           15.2           -9.0 colon cancer
## 5           13.5           60.3 colon cancer
## 6            7.9          -24.8 colon cancer

Alzheimer’s Disease data table:

head(alzheimer)

##   GENE_SYMBOL   CYTOBAND GENE_CNVs
## 1       AHNAK hs|11q12.3         5
## 2       AHNAK hs|11q12.3         5
## 3       AHNAK hs|11q12.3         5
## 4       AHNAK hs|11q12.3         5
## 5       AHNAK hs|11q12.3         5
## 6       AHNAK hs|11q12.3         5
##                                                       SEQUENCE
## 1 GAAAAGTAACATTCCCTAAAATGAAGATCCCCAAATTTACCTTCTCTGGCCGTGAGCTGG
## 2 AAGGCCCCAGTTTGGACATTGACACACCTGATGTCAATATTGAAGGTCCGGAAGGAAAAT
## 3 GTGCATTTCTTTACTTGCAAGGGGACAGAGTGTGGGCTTAGGTTTGGGACTAGAGGGGGC
## 4 ATTTCCAGCACTTTAATGGCCAATTAACTGAGAATGTAAGAAAATTGATGCTGTACAAGG
## 5 TCTGTGCCCAAGGTAGAAGGTGAAATGAAAGTGCCAGATGTTGACATTAAAGGGCCCAAA
## 6 GAAAAGTAACATTCCCTAAAATGAAGATCCCCAAATTTACCTTCTCTGGCCGTGAGCTGG
##   FoldChange_Median_AD_2_ctrl    alzheimerDisease
## 1                     1.10655 Alzheimer's Disease
## 2                     1.10655 Alzheimer's Disease
## 3                     1.10655 Alzheimer's Disease
## 4                     1.10655 Alzheimer's Disease
## 5                     1.10655 Alzheimer's Disease
## 6                     1.10655 Alzheimer's Disease

Uterine Leiomyoma beadchip data table:

head(ul_beadchip)

##      Symbol Symbol_CNVs                                           SEQUENCE
## 1    CTNNB1           7 AGCTGCAGGGGTCCTCTGTGAACTTGCTCAGGACAAGGAAGCTGCAGAAG
## 2    CTNNB1           7 CTGCAGGGGTCCTCTGTGAACTTGCTCAGGACAAGGAAGCTGCAGAAGCT
## 3 LOC202134           7 CCTGTTTGGATCACATGGTCTTGTCCTGATAACTTGGAAGAGGTTGCTTC
## 4   SLCO3A1           6 CCTGTATGTCAGCATCGCCATCGCGCTCAAATCCTTCGCCTTCATCCTGT
## 5   SLCO3A1           6 CCTCCTCTGGGGATGGCTTGCCGACATGCTTTAGAAACTCCCCTCTTGGG
## 6 LOC653673           6 GGAAAAGGCATGGTGACAATGGGGGCTGTAGCCCTAGGAGAACGGGGGGA
##   Chromosome Cytoband FoldChange_Median_UL1_to_nonUL1 UL1_Median nonUL1_Median
## 1       <NA>  3p22.1b                       1.5744314 2748.51615    1745.71980
## 2          3  3p22.1b                       1.4655024 2113.89070    1442.43415
## 3          5  5q35.2d                       0.3998123   72.51865     181.38175
## 4       <NA> 15q26.1d                       1.4533565  578.43480     397.99925
## 5       <NA> 15q26.1d                       1.3643572  161.32180     118.24015
## 6         17     <NA>                       1.2756026  112.62110      88.28855
##   UL_GSM2496185 UL_GSM2496186 UL_GSM2496187 UL_GSM2496188 UL_GSM2496189
## 1        4742.3     3302.9018     3239.9835     2694.4572     4329.4024
## 2        4482.6     2738.1917     2868.2212     2320.8855     3582.6563
## 3         120.2       95.5970      114.1447       61.4714       53.0500
## 4         909.7      693.5790      596.4898      657.6947      774.6664
## 5         188.2      165.9787      133.6730      127.4902      189.8039
## 6          63.6       71.3818       59.9000       99.5000      114.9687
##   UL_GSM2496190 UL_GSM2496191 UL_GSM2496192 UL_GSM2496193 UL_GSM2496203
## 1     2212.7663     2349.6953     3868.5521     4720.0408       56.6000
## 2     1879.2320     2129.0526     2743.9472     3241.3755       70.6800
## 3      102.4450      109.9962       58.7000       54.6500      151.3824
## 4      635.5111      580.4067      462.8962      576.4629      285.0000
## 5      197.9507      179.2933      152.0385      159.9105      195.6714
## 6      167.1138      107.2792      120.4000       95.0750      130.6000
##   UL_GSM2496204 UL_GSM2496205 UL_GSM2496206 UL_GSM2496207 UL_GSM2496208
## 1     2104.1954     2984.8842     4542.3102     2943.6535     2356.1480
## 2     1776.8023     2413.5624     3268.0110     2098.7288     1766.2115
## 3       67.7818       52.9000       70.4667      186.8900       59.0000
## 4      849.8864      586.4250      214.2731       75.3222      372.6541
## 5      218.1418      160.5893       66.1333       69.5778      162.0543
## 6      136.3000      110.2735       94.8857      121.5657      126.0100
##   UL_GSM2496209 UL_GSM2496217 UL_GSM2496218 UL_GSM2496219 UL_GSM2496220
## 1     2569.9042     1461.1724     2194.4782     2802.5751     1277.3170
## 2     1777.2178     1231.3778     1977.7932     2085.1125     1063.1578
## 3       60.2286       66.7474       74.5706      172.9389      147.4045
## 4      453.3000      590.5725      528.8004      448.9996      539.1882
## 5      162.1348      165.1192      113.8667      157.4818      138.7429
## 6      137.8122      121.4737       92.0086      154.2255       98.5815
##   nonUL_GSM2496194 nonUL_GSM2496195 nonUL_GSM2496196 nonUL_GSM2496197
## 1        1819.0576        1590.1865        1589.6078        1780.0956
## 2        1523.4501        1488.2989        1355.8731        1396.6437
## 3         173.7488         107.0138         250.1007         248.4337
## 4         572.5591         498.7206         398.9950         564.0055
## 5         140.0642         137.7730         117.3053         115.9455
## 6          69.4000          66.2286          64.9667          73.1000
##   nonUL_GSM2496198 nonUL_GSM2496199 nonUL_GSM2496200 nonUL_GSM2496201
## 1        2641.8803        1371.1857        1711.3440        1509.7454
## 2        1978.2444        1274.5286        1445.0758        1309.4235
## 3          91.9065         193.9691         145.8866         122.1739
## 4         442.3333         380.6980         323.2386         316.2327
## 5         115.3744         112.5600         113.1290         139.1017
## 6          85.5313          92.0763         102.1674          84.6600
##   nonUL_GSM2496202 nonUL_GSM2496210 nonUL_GSM2496211 nonUL_GSM2496212
## 1        1832.1853        1300.1667        1612.5461        1383.7698
## 2        1439.7925        1144.2250        1382.7999        1285.0511
## 3         115.3957         229.9462         189.0147         564.7217
## 4         392.9562         503.8777         419.0000         292.6316
## 5         139.6097         138.8037         116.9000         105.9600
## 6          89.5727          88.2200          92.6342         104.2000
##   nonUL_GSM2496213 nonUL_GSM2496214 nonUL_GSM2496215 nonUL_GSM2496216
## 1        2141.9502        1986.5859        2006.4213        1632.0421
## 2        1630.3266        1936.1433        1480.8517        1091.4076
## 3         281.0516         350.7591         103.6000         108.7818
## 4         397.0035         260.3180         288.9834         522.8038
## 5         119.1750          84.9333         135.0706         171.8562
## 6          87.2000         135.4857          96.1591         115.1231
##   nonUL_GSM2496221 nonUL_GSM2496222 uterine_leiomyoma_BC
## 1        2194.8792        2451.2007   benign UL beadchip
## 2        1837.8157        1864.2062   benign UL beadchip
## 3         201.6956          87.3286   benign UL beadchip
## 4         652.5280         278.8829   benign UL beadchip
## 5         141.1556          67.2000   benign UL beadchip
## 6          85.8444          88.3571   benign UL beadchip

Uterine Leiomyoma microarray data table:

head(ul_microarray)

##   GENE_SYMBOL GENE_SYMBOL_CNVs
## 1        TPM3               11
## 2      SPDYE3               10
## 3        RPSA                8
## 4         LPP                8
## 5        ST13                8
## 6        NAV2                7
##                                                       SEQUENCE   CYTOBAND
## 1 AATTGCATTTTAATTGGGGCAGTTAGAATGTTGATTTCCTAACAGCATTGTGAAGTTGAC  hs|1q21.3
## 2 ATCAGGAAGGAGGAGAGGAACCATTTGTGCAGATCATCTAGAAGAACCTGGACCATTCTT  hs|7q22.1
## 3 CTGAGTCAGGTAGTGCAGTGGTTGTAAAGTGCAGCATATTCATTTCCAAGCTAGAGGGTT  hs|3p22.1
## 4 TCGCTGACGCATTAGACTCAAGGTGGACTTAATTTGGAGATCACCTATGAAGACATCTGT    hs|3q28
## 5 AATTATACAAGCATATTTAATGCTGAGTTTAATTTAATATGTAATACATATGGTAATTGT hs|22q13.2
## 6 CACATGGCTATTGATCTCTACTGCGGTTTGGCTTGTCTGTGGGGAATACATGAGCCCCGA hs|11p15.1
##   FoldChange_MedianArray_UL1_to_nonUL1 UL1_MedianArray nonUL1_MedianArray
## 1                            0.6516203        3.133639           4.808995
## 2                            0.6738332        4.118197           6.111597
## 3                            1.6315379        5.039850           3.089018
## 4                            0.7150036        3.133639           4.382690
## 5                            0.7121234        4.853509           6.815545
## 6                            1.6569822        7.909182           4.773245
##   UL_GSM1667147 UL_GSM1667148 UL_GSM1667149 nonUL_GSM1667144 nonUL_GSM1667145
## 1      3.500581      2.387398      3.133639         4.808995         5.921844
## 2      4.118197      5.116485      3.133639         2.867115         6.111597
## 3      6.903503      5.039850      3.133639         5.185176         2.498480
## 4      2.634312      3.411704      3.133639         3.249956         6.112889
## 5      7.195406      3.280124      4.853509         8.044680         6.815545
## 6      6.692666      8.541624      7.909182         4.773245         5.542953
##   nonUL_GSM1667146 uterine_leiomyoma_MA
## 1         2.513410 benign UL microarray
## 2         6.481402 benign UL microarray
## 3         3.089018 benign UL microarray
## 4         4.382690 benign UL microarray
## 5         5.230738 benign UL microarray
## 6         3.444421 benign UL microarray

Lets only keep the cytoband, sequence, symbol, fold change, and disease type columns of each table.

The alzheimer table already has the only fields we want, but lets rearrange and rename them to make them more consistent.

AD1 <- alzheimer %>% select(GENE_SYMBOL:SEQUENCE,alzheimerDisease,
                            FoldChange_Median_AD_2_ctrl)
colnames(AD1) <- c('gene','cytoband','cnv','sequence','disease','foldChangeAD')
colnames(AD1)

## [1] "gene"         "cytoband"     "cnv"          "sequence"     "disease"     
## [6] "foldChangeAD"

The pancreatic cancer table has two fold change values for different cell lines observed. The first fold change values are of epithelial cells of PC with a tumor/cancer suppressor(ARHGEF10) added compared to not. And the second fold change values are of lymphatic cells that the cancer suppressor (ARHGEF10) was knocked out or eliminated as much as possible when compared to lymphatic cells of pancreatic cancer patients.

PC1 <- pancreatic %>% select(Symbol,Cytoband,GENE_CNVs,SEQUENCE,pancreatic_cancer,
                             FoldChange_epithelial_PC,
                             FoldChange_lymphNode_PC
                             )
colnames(PC1) <- c('gene','cytoband','cnv','sequence','disease','foldChangePC_epith',
                   'foldChangePC_lymph')
colnames(PC1)

## [1] "gene"               "cytoband"           "cnv"               
## [4] "sequence"           "disease"            "foldChangePC_epith"
## [7] "foldChangePC_lymph"

The colon cancer dataset also has two fold change values. The first is a set of LGR5 being knocked down with a shRNA, and the 2nd type is of LGR5 being knocked down with the 2nd type of shRNA.

CC1 <- colon %>% select(Symbol,Cytoband,Symbol_CC_CNVs,SEQUENCE,colon_cancer,
                        FoldChange_CC1_to_CC0, FoldChange_CC2_to_CC0)
colnames(CC1) <- c('gene','cytoband','cnv','sequence','disease','FoldChange_CC1_to_CC0',
                   'FoldChange_CC2_to_CC0')
colnames(CC1)

## [1] "gene"                  "cytoband"              "cnv"                  
## [4] "sequence"              "disease"               "FoldChange_CC1_to_CC0"
## [7] "FoldChange_CC2_to_CC0"

The stomach cancer dataset has one fold change value

GSC1 <- stomach %>% select(GENE_SYMBOL,CYTOBAND,GENE_CNVs,SEQUENCE,stomach_cancer,
                           FoldChangeMedian_GSC_to_healthyCtrl)
colnames(GSC1) <-  c('gene','cytoband','cnv','sequence','disease',
                     'FoldChangeMedian_GSC_to_healthyCtrl' )
colnames(GSC1)

## [1] "gene"                                "cytoband"                           
## [3] "cnv"                                 "sequence"                           
## [5] "disease"                             "FoldChangeMedian_GSC_to_healthyCtrl"

The UL beadchip has one fold change column.

UL_BC <- ul_beadchip %>% select(Symbol,Cytoband,Symbol_CNVs,SEQUENCE,uterine_leiomyoma_BC,
                                FoldChange_Median_UL1_to_nonUL1,
                                )
colnames(UL_BC) <-  c('gene','cytoband','cnv','sequence', 'disease',
                     'FoldChange_Median_UL1_to_nonUL1')
colnames(UL_BC)

## [1] "gene"                            "cytoband"                       
## [3] "cnv"                             "sequence"                       
## [5] "disease"                         "FoldChange_Median_UL1_to_nonUL1"

The UL microarray data also has one foldchange column.

UL_MA <- ul_microarray %>% select(GENE_SYMBOL,CYTOBAND,GENE_SYMBOL_CNVs,SEQUENCE,
                                  uterine_leiomyoma_MA,
                                  FoldChange_MedianArray_UL1_to_nonUL1)
colnames(UL_MA) <-  c('gene','cytoband','cnv','sequence', 'disease',
                     'FoldChange_MedianArray_UL1_to_nonUL1')
colnames(UL_MA)

## [1] "gene"                                
## [2] "cytoband"                            
## [3] "cnv"                                 
## [4] "sequence"                            
## [5] "disease"                             
## [6] "FoldChange_MedianArray_UL1_to_nonUL1"

Our data tables with fold change values are AD1, CC1, GSC1, PC1, UL_BC, and UL_MA. Lets look at the first few observations of each table.

head(AD1)

##    gene   cytoband cnv
## 1 AHNAK hs|11q12.3   5
## 2 AHNAK hs|11q12.3   5
## 3 AHNAK hs|11q12.3   5
## 4 AHNAK hs|11q12.3   5
## 5 AHNAK hs|11q12.3   5
## 6 AHNAK hs|11q12.3   5
##                                                       sequence
## 1 GAAAAGTAACATTCCCTAAAATGAAGATCCCCAAATTTACCTTCTCTGGCCGTGAGCTGG
## 2 AAGGCCCCAGTTTGGACATTGACACACCTGATGTCAATATTGAAGGTCCGGAAGGAAAAT
## 3 GTGCATTTCTTTACTTGCAAGGGGACAGAGTGTGGGCTTAGGTTTGGGACTAGAGGGGGC
## 4 ATTTCCAGCACTTTAATGGCCAATTAACTGAGAATGTAAGAAAATTGATGCTGTACAAGG
## 5 TCTGTGCCCAAGGTAGAAGGTGAAATGAAAGTGCCAGATGTTGACATTAAAGGGCCCAAA
## 6 GAAAAGTAACATTCCCTAAAATGAAGATCCCCAAATTTACCTTCTCTGGCCGTGAGCTGG
##               disease foldChangeAD
## 1 Alzheimer's Disease      1.10655
## 2 Alzheimer's Disease      1.10655
## 3 Alzheimer's Disease      1.10655
## 4 Alzheimer's Disease      1.10655
## 5 Alzheimer's Disease      1.10655
## 6 Alzheimer's Disease      1.10655

Lets alter the cytoband field so that all values are the same format.

AD2 <- AD1
cyto1 <- strsplit(as.character(AD2$cytoband),split='|',fixed=TRUE)
cyto2 <- lapply(cyto1,'[',2)
AD2$cytoband <- as.character(paste(cyto2))
AD2$cytoband <- as.factor(AD2$cytoband)
head(AD2)

##    gene cytoband cnv
## 1 AHNAK  11q12.3   5
## 2 AHNAK  11q12.3   5
## 3 AHNAK  11q12.3   5
## 4 AHNAK  11q12.3   5
## 5 AHNAK  11q12.3   5
## 6 AHNAK  11q12.3   5
##                                                       sequence
## 1 GAAAAGTAACATTCCCTAAAATGAAGATCCCCAAATTTACCTTCTCTGGCCGTGAGCTGG
## 2 AAGGCCCCAGTTTGGACATTGACACACCTGATGTCAATATTGAAGGTCCGGAAGGAAAAT
## 3 GTGCATTTCTTTACTTGCAAGGGGACAGAGTGTGGGCTTAGGTTTGGGACTAGAGGGGGC
## 4 ATTTCCAGCACTTTAATGGCCAATTAACTGAGAATGTAAGAAAATTGATGCTGTACAAGG
## 5 TCTGTGCCCAAGGTAGAAGGTGAAATGAAAGTGCCAGATGTTGACATTAAAGGGCCCAAA
## 6 GAAAAGTAACATTCCCTAAAATGAAGATCCCCAAATTTACCTTCTCTGGCCGTGAGCTGG
##               disease foldChangeAD
## 1 Alzheimer's Disease      1.10655
## 2 Alzheimer's Disease      1.10655
## 3 Alzheimer's Disease      1.10655
## 4 Alzheimer's Disease      1.10655
## 5 Alzheimer's Disease      1.10655
## 6 Alzheimer's Disease      1.10655

head(CC1)

##      gene        cytoband cnv
## 1   HYDIN 16q22.1f-q22.3a  17
## 2   HYDIN 16q22.1f-q22.3a  17
## 3   HYDIN 16q22.1f-q22.3a  17
## 4 SDHALP1           3q29f  13
## 5 SDHALP1           3q29f  13
## 6 SDHALP1           3q29f  13
##                                             sequence      disease
## 1 GCAATTAAGTATGTGGAACACCCTCAGATAGACAGCCTGGACCTGCGCGG colon cancer
## 2 TGTGCATTGCCAGTCATTCCCATGCCTTTGCCACGGTGTCCTTCACCCCG colon cancer
## 3 CTTTGGGAGCCAGGACCCCCCTTTGGTATGTCACTTAAAGAGCGCTGGAG colon cancer
## 4 GTGATGACAGAATCAGCTTTTGTAATTATGTATAATAGCTCATGCATGTG colon cancer
## 5 AGCCCTCCTGGACGATTGGAACTGTAATGTGGAAAGGGCTCTGATGGAGC colon cancer
## 6 GTTCTTCAGATGAACTGATTTTTGTGCAGAGCACACGTGTTGGATTCTGC colon cancer
##   FoldChange_CC1_to_CC0 FoldChange_CC2_to_CC0
## 1              7.153846             0.4615385
## 2              5.355556            -0.4666667
## 3            -11.375000            10.4583333
## 4              8.621622            -1.6756757
## 5              1.976190            -8.7857143
## 6             -6.272727           -15.3636364

CC2 <- CC1
CC2$cytoband <- gsub('-','|',CC2$cytoband)
CC2$cytoband <- gsub('[|].*[a-z]{1}$','', CC2$cytoband, perl=TRUE)
CC2$cytoband <- gsub('[a-z]{1}$','', CC2$cytoband, perl=TRUE)
CC2$cytoband <- as.factor(CC2$cytoband)
head(CC2)

##      gene cytoband cnv                                           sequence
## 1   HYDIN  16q22.1  17 GCAATTAAGTATGTGGAACACCCTCAGATAGACAGCCTGGACCTGCGCGG
## 2   HYDIN  16q22.1  17 TGTGCATTGCCAGTCATTCCCATGCCTTTGCCACGGTGTCCTTCACCCCG
## 3   HYDIN  16q22.1  17 CTTTGGGAGCCAGGACCCCCCTTTGGTATGTCACTTAAAGAGCGCTGGAG
## 4 SDHALP1     3q29  13 GTGATGACAGAATCAGCTTTTGTAATTATGTATAATAGCTCATGCATGTG
## 5 SDHALP1     3q29  13 AGCCCTCCTGGACGATTGGAACTGTAATGTGGAAAGGGCTCTGATGGAGC
## 6 SDHALP1     3q29  13 GTTCTTCAGATGAACTGATTTTTGTGCAGAGCACACGTGTTGGATTCTGC
##        disease FoldChange_CC1_to_CC0 FoldChange_CC2_to_CC0
## 1 colon cancer              7.153846             0.4615385
## 2 colon cancer              5.355556            -0.4666667
## 3 colon cancer            -11.375000            10.4583333
## 4 colon cancer              8.621622            -1.6756757
## 5 colon cancer              1.976190            -8.7857143
## 6 colon cancer             -6.272727           -15.3636364

head(GSC1)

##           gene    cytoband cnv
## 1      PDE4DIP   hs|1q21.1  15
## 2      PDE4DIP   hs|1q21.1  15
## 3 LOC100128869   hs|4q28.3   8
## 4    LOC729444 hs|22q11.21   7
## 5 LOC100131581   hs|8p23.1   7
## 6    LOC389834  hs|21p11.2   7
##                                                       sequence        disease
## 1 CAAAAAATGGACTAGAAGAGAAGCTGGCTGAGGAGCTGAGATCAGCCTCGTGGCCTGGGT stomach cancer
## 2 CTACCAAAGCTGTGTATTTTTCATTTCTTCATGGCACTGTGCTGTTAATTTCTGTTAACA stomach cancer
## 3 TCAAAGACACACGCTCACTGCATGTGCTCTTGGGGGACGTCAGTGCCACGTTTGGTCACA stomach cancer
## 4 TTCCATCATACAGTCTTCAGCAACCTTGAAAGATTGGACAAGCTTCAGCCCACTCTTGAA stomach cancer
## 5 AGATAACTTCAGACTCAGCCTAGAAGTAAAGATCCTCCAGATATGACCTCAACTACCCTC stomach cancer
## 6 GGAACACTCTCTTAAAGTCTACTTATCCTATTCCTTCAAATTAAGACCAAAGTCCAATGT stomach cancer
##   FoldChangeMedian_GSC_to_healthyCtrl
## 1                           1.3059347
## 2                           0.7574300
## 3                           0.4429174
## 4                           1.3486174
## 5                           0.7360758
## 6                           0.7229002

GSC2 <- GSC1
cyto1 <- strsplit(as.character(GSC2$cytoband),split='|',fixed=TRUE)
cyto2 <- lapply(cyto1,'[',2)
GSC2$cytoband <- as.character(paste(cyto2))
GSC2$cytoband <- as.factor(GSC2$cytoband)
head(GSC2)

##           gene cytoband cnv
## 1      PDE4DIP   1q21.1  15
## 2      PDE4DIP   1q21.1  15
## 3 LOC100128869   4q28.3   8
## 4    LOC729444 22q11.21   7
## 5 LOC100131581   8p23.1   7
## 6    LOC389834  21p11.2   7
##                                                       sequence        disease
## 1 CAAAAAATGGACTAGAAGAGAAGCTGGCTGAGGAGCTGAGATCAGCCTCGTGGCCTGGGT stomach cancer
## 2 CTACCAAAGCTGTGTATTTTTCATTTCTTCATGGCACTGTGCTGTTAATTTCTGTTAACA stomach cancer
## 3 TCAAAGACACACGCTCACTGCATGTGCTCTTGGGGGACGTCAGTGCCACGTTTGGTCACA stomach cancer
## 4 TTCCATCATACAGTCTTCAGCAACCTTGAAAGATTGGACAAGCTTCAGCCCACTCTTGAA stomach cancer
## 5 AGATAACTTCAGACTCAGCCTAGAAGTAAAGATCCTCCAGATATGACCTCAACTACCCTC stomach cancer
## 6 GGAACACTCTCTTAAAGTCTACTTATCCTATTCCTTCAAATTAAGACCAAAGTCCAATGT stomach cancer
##   FoldChangeMedian_GSC_to_healthyCtrl
## 1                           1.3059347
## 2                           0.7574300
## 3                           0.4429174
## 4                           1.3486174
## 5                           0.7360758
## 6                           0.7229002

head(PC1)

##     gene         cytoband cnv
## 1    DMD   Xp21.2a-p21.1d   8
## 2  HYDIN  16q22.1f-q22.3a   8
## 3    DMD   Xp21.2a-p21.1d   8
## 4  ABCG1         21q22.3b   6
## 5 KCNIP4 4p15.31d-p15.31b   6
## 6 GPR177          1p31.3a   5
##                                             sequence           disease
## 1 GGCAAGTTCAACAGGATTTTGTGACCAGCGCAGGCTGGGCCTCCTTCTGC pancreatic cancer
## 2 GCAATTAAGTATGTGGAACACCCTCAGATAGACAGCCTGGACCTGCGCGG pancreatic cancer
## 3 CCAAAAGGTATTTTGCGAAGCATCCCCGAATGGGCTACCTGCCAGTGCAG pancreatic cancer
## 4 CAGCACTTGTGAAGGATTGAATGCAGGTTCCAGGTGGAGGGAAGACGTGG pancreatic cancer
## 5 CTTGGAAGGGCTTGAAATGATAGCAGTTCTGATCGTCATTGTGCTTTTTG pancreatic cancer
## 6 TGAGATCTACAAGTTGACCCGCAAGGAGGCCCAGGAGTAGGAGGCTGCAG pancreatic cancer
##   foldChangePC_epith foldChangePC_lymph
## 1           2.309727          0.4945016
## 2           1.967041          0.4460581
## 3           1.414647          0.4765914
## 4           3.787784          0.6271994
## 5           2.674067          0.5404346
## 6           8.289087          0.4977605

PC2 <- PC1
PC2$cytoband <- gsub('-','|',PC2$cytoband)
PC2$cytoband <- gsub('[|].*[a-z]{1}$','', PC2$cytoband, perl=TRUE)
PC2$cytoband <- gsub('[a-z]{1}$','', PC2$cytoband, perl=TRUE)
PC2$cytoband <- as.factor(PC2$cytoband)
head(PC2)

##     gene cytoband cnv                                           sequence
## 1    DMD   Xp21.2   8 GGCAAGTTCAACAGGATTTTGTGACCAGCGCAGGCTGGGCCTCCTTCTGC
## 2  HYDIN  16q22.1   8 GCAATTAAGTATGTGGAACACCCTCAGATAGACAGCCTGGACCTGCGCGG
## 3    DMD   Xp21.2   8 CCAAAAGGTATTTTGCGAAGCATCCCCGAATGGGCTACCTGCCAGTGCAG
## 4  ABCG1  21q22.3   6 CAGCACTTGTGAAGGATTGAATGCAGGTTCCAGGTGGAGGGAAGACGTGG
## 5 KCNIP4  4p15.31   6 CTTGGAAGGGCTTGAAATGATAGCAGTTCTGATCGTCATTGTGCTTTTTG
## 6 GPR177   1p31.3   5 TGAGATCTACAAGTTGACCCGCAAGGAGGCCCAGGAGTAGGAGGCTGCAG
##             disease foldChangePC_epith foldChangePC_lymph
## 1 pancreatic cancer           2.309727          0.4945016
## 2 pancreatic cancer           1.967041          0.4460581
## 3 pancreatic cancer           1.414647          0.4765914
## 4 pancreatic cancer           3.787784          0.6271994
## 5 pancreatic cancer           2.674067          0.5404346
## 6 pancreatic cancer           8.289087          0.4977605

head(UL_BC)

##        gene cytoband cnv                                           sequence
## 1    CTNNB1  3p22.1b   7 AGCTGCAGGGGTCCTCTGTGAACTTGCTCAGGACAAGGAAGCTGCAGAAG
## 2    CTNNB1  3p22.1b   7 CTGCAGGGGTCCTCTGTGAACTTGCTCAGGACAAGGAAGCTGCAGAAGCT
## 3 LOC202134  5q35.2d   7 CCTGTTTGGATCACATGGTCTTGTCCTGATAACTTGGAAGAGGTTGCTTC
## 4   SLCO3A1 15q26.1d   6 CCTGTATGTCAGCATCGCCATCGCGCTCAAATCCTTCGCCTTCATCCTGT
## 5   SLCO3A1 15q26.1d   6 CCTCCTCTGGGGATGGCTTGCCGACATGCTTTAGAAACTCCCCTCTTGGG
## 6 LOC653673     <NA>   6 GGAAAAGGCATGGTGACAATGGGGGCTGTAGCCCTAGGAGAACGGGGGGA
##              disease FoldChange_Median_UL1_to_nonUL1
## 1 benign UL beadchip                       1.5744314
## 2 benign UL beadchip                       1.4655024
## 3 benign UL beadchip                       0.3998123
## 4 benign UL beadchip                       1.4533565
## 5 benign UL beadchip                       1.3643572
## 6 benign UL beadchip                       1.2756026

UL_BC2 <- UL_BC
UL_BC2$cytoband <- gsub('-','|',UL_BC2$cytoband)
UL_BC2$cytoband <- gsub('[|].*[a-z]{1}$','', UL_BC2$cytoband, perl=TRUE)
UL_BC2$cytoband <- gsub('[a-z]{1}$','', UL_BC2$cytoband, perl=TRUE)
UL_BC2$cytoband <- as.factor(UL_BC2$cytoband)
UL_BC2$cytoband <- as.factor(UL_BC2$cytoband)
head(UL_BC2)

##        gene cytoband cnv                                           sequence
## 1    CTNNB1   3p22.1   7 AGCTGCAGGGGTCCTCTGTGAACTTGCTCAGGACAAGGAAGCTGCAGAAG
## 2    CTNNB1   3p22.1   7 CTGCAGGGGTCCTCTGTGAACTTGCTCAGGACAAGGAAGCTGCAGAAGCT
## 3 LOC202134   5q35.2   7 CCTGTTTGGATCACATGGTCTTGTCCTGATAACTTGGAAGAGGTTGCTTC
## 4   SLCO3A1  15q26.1   6 CCTGTATGTCAGCATCGCCATCGCGCTCAAATCCTTCGCCTTCATCCTGT
## 5   SLCO3A1  15q26.1   6 CCTCCTCTGGGGATGGCTTGCCGACATGCTTTAGAAACTCCCCTCTTGGG
## 6 LOC653673     <NA>   6 GGAAAAGGCATGGTGACAATGGGGGCTGTAGCCCTAGGAGAACGGGGGGA
##              disease FoldChange_Median_UL1_to_nonUL1
## 1 benign UL beadchip                       1.5744314
## 2 benign UL beadchip                       1.4655024
## 3 benign UL beadchip                       0.3998123
## 4 benign UL beadchip                       1.4533565
## 5 benign UL beadchip                       1.3643572
## 6 benign UL beadchip                       1.2756026

head(UL_MA)

##     gene   cytoband cnv
## 1   TPM3  hs|1q21.3  11
## 2 SPDYE3  hs|7q22.1  10
## 3   RPSA  hs|3p22.1   8
## 4    LPP    hs|3q28   8
## 5   ST13 hs|22q13.2   8
## 6   NAV2 hs|11p15.1   7
##                                                       sequence
## 1 AATTGCATTTTAATTGGGGCAGTTAGAATGTTGATTTCCTAACAGCATTGTGAAGTTGAC
## 2 ATCAGGAAGGAGGAGAGGAACCATTTGTGCAGATCATCTAGAAGAACCTGGACCATTCTT
## 3 CTGAGTCAGGTAGTGCAGTGGTTGTAAAGTGCAGCATATTCATTTCCAAGCTAGAGGGTT
## 4 TCGCTGACGCATTAGACTCAAGGTGGACTTAATTTGGAGATCACCTATGAAGACATCTGT
## 5 AATTATACAAGCATATTTAATGCTGAGTTTAATTTAATATGTAATACATATGGTAATTGT
## 6 CACATGGCTATTGATCTCTACTGCGGTTTGGCTTGTCTGTGGGGAATACATGAGCCCCGA
##                disease FoldChange_MedianArray_UL1_to_nonUL1
## 1 benign UL microarray                            0.6516203
## 2 benign UL microarray                            0.6738332
## 3 benign UL microarray                            1.6315379
## 4 benign UL microarray                            0.7150036
## 5 benign UL microarray                            0.7121234
## 6 benign UL microarray                            1.6569822

UL_MA2 <- UL_MA
cyto1 <- strsplit(as.character(UL_MA2$cytoband),split='|',fixed=TRUE)
cyto2 <- lapply(cyto1,'[',2)
UL_MA2$cytoband <- as.character(paste(cyto2))
UL_MA2$cytoband <- as.factor(UL_MA2$cytoband)
head(UL_MA2)

##     gene cytoband cnv
## 1   TPM3   1q21.3  11
## 2 SPDYE3   7q22.1  10
## 3   RPSA   3p22.1   8
## 4    LPP     3q28   8
## 5   ST13  22q13.2   8
## 6   NAV2  11p15.1   7
##                                                       sequence
## 1 AATTGCATTTTAATTGGGGCAGTTAGAATGTTGATTTCCTAACAGCATTGTGAAGTTGAC
## 2 ATCAGGAAGGAGGAGAGGAACCATTTGTGCAGATCATCTAGAAGAACCTGGACCATTCTT
## 3 CTGAGTCAGGTAGTGCAGTGGTTGTAAAGTGCAGCATATTCATTTCCAAGCTAGAGGGTT
## 4 TCGCTGACGCATTAGACTCAAGGTGGACTTAATTTGGAGATCACCTATGAAGACATCTGT
## 5 AATTATACAAGCATATTTAATGCTGAGTTTAATTTAATATGTAATACATATGGTAATTGT
## 6 CACATGGCTATTGATCTCTACTGCGGTTTGGCTTGTCTGTGGGGAATACATGAGCCCCGA
##                disease FoldChange_MedianArray_UL1_to_nonUL1
## 1 benign UL microarray                            0.6516203
## 2 benign UL microarray                            0.6738332
## 3 benign UL microarray                            1.6315379
## 4 benign UL microarray                            0.7150036
## 5 benign UL microarray                            0.7121234
## 6 benign UL microarray                            1.6569822

Now we have our tables with the same formatted cytoband locations as factors. These tables are AD2, CC2, GSC2, PC2, UL_BC2, and UL_MA2. Lets create other tables of these six tables that have all the same values. Lets gather the fold change fields to bind all together. Four tables only have one fold change type, but the colon cancer and the pancreatic cancer have two types.

AD3 <- gather(AD2,'foldChange','foldChangeValue',6)
CC3 <- gather(CC2,'foldChange','foldChangeValue',6:7)
PC3 <- gather(PC2,'foldChange','foldChangeValue',6:7)
GSC3 <- gather(GSC2,'foldChange','foldChangeValue',6)
UL_BC3 <- gather(UL_BC2,'foldChange','foldChangeValue',6)
UL_MA3 <-gather(UL_MA2,'foldChange','foldChangeValue',6)

We can now row bind these tables together.

All_data <- rbind(AD3,CC3,PC3,GSC3,UL_BC3,UL_MA3)

All <- All_data[order(All_data$gene),]
head(All,20)

##     gene cytoband cnv
## 1  AHNAK  11q12.3   5
## 2  AHNAK  11q12.3   5
## 3  AHNAK  11q12.3   5
## 4  AHNAK  11q12.3   5
## 5  AHNAK  11q12.3   5
## 6  AHNAK  11q12.3   5
## 7  AHNAK  11q12.3   5
## 8  AHNAK  11q12.3   5
## 9  AHNAK  11q12.3   5
## 10 AHNAK  11q12.3   5
## 11 AHNAK  11q12.3   5
## 12 AHNAK  11q12.3   5
## 13 AHNAK  11q12.3   5
## 14 AHNAK  11q12.3   5
## 15 AHNAK  11q12.3   5
## 16 AHNAK  11q12.3   5
## 17 AHNAK  11q12.3   5
## 18 AHNAK  11q12.3   5
## 19 AHNAK  11q12.3   5
## 20 AHNAK  11q12.3   5
##                                                        sequence
## 1  GAAAAGTAACATTCCCTAAAATGAAGATCCCCAAATTTACCTTCTCTGGCCGTGAGCTGG
## 2  AAGGCCCCAGTTTGGACATTGACACACCTGATGTCAATATTGAAGGTCCGGAAGGAAAAT
## 3  GTGCATTTCTTTACTTGCAAGGGGACAGAGTGTGGGCTTAGGTTTGGGACTAGAGGGGGC
## 4  ATTTCCAGCACTTTAATGGCCAATTAACTGAGAATGTAAGAAAATTGATGCTGTACAAGG
## 5  TCTGTGCCCAAGGTAGAAGGTGAAATGAAAGTGCCAGATGTTGACATTAAAGGGCCCAAA
## 6  GAAAAGTAACATTCCCTAAAATGAAGATCCCCAAATTTACCTTCTCTGGCCGTGAGCTGG
## 7  AAGGCCCCAGTTTGGACATTGACACACCTGATGTCAATATTGAAGGTCCGGAAGGAAAAT
## 8  GTGCATTTCTTTACTTGCAAGGGGACAGAGTGTGGGCTTAGGTTTGGGACTAGAGGGGGC
## 9  ATTTCCAGCACTTTAATGGCCAATTAACTGAGAATGTAAGAAAATTGATGCTGTACAAGG
## 10 TCTGTGCCCAAGGTAGAAGGTGAAATGAAAGTGCCAGATGTTGACATTAAAGGGCCCAAA
## 11 GAAAAGTAACATTCCCTAAAATGAAGATCCCCAAATTTACCTTCTCTGGCCGTGAGCTGG
## 12 AAGGCCCCAGTTTGGACATTGACACACCTGATGTCAATATTGAAGGTCCGGAAGGAAAAT
## 13 GTGCATTTCTTTACTTGCAAGGGGACAGAGTGTGGGCTTAGGTTTGGGACTAGAGGGGGC
## 14 ATTTCCAGCACTTTAATGGCCAATTAACTGAGAATGTAAGAAAATTGATGCTGTACAAGG
## 15 TCTGTGCCCAAGGTAGAAGGTGAAATGAAAGTGCCAGATGTTGACATTAAAGGGCCCAAA
## 16 GAAAAGTAACATTCCCTAAAATGAAGATCCCCAAATTTACCTTCTCTGGCCGTGAGCTGG
## 17 AAGGCCCCAGTTTGGACATTGACACACCTGATGTCAATATTGAAGGTCCGGAAGGAAAAT
## 18 GTGCATTTCTTTACTTGCAAGGGGACAGAGTGTGGGCTTAGGTTTGGGACTAGAGGGGGC
## 19 ATTTCCAGCACTTTAATGGCCAATTAACTGAGAATGTAAGAAAATTGATGCTGTACAAGG
## 20 TCTGTGCCCAAGGTAGAAGGTGAAATGAAAGTGCCAGATGTTGACATTAAAGGGCCCAAA
##                disease   foldChange foldChangeValue
## 1  Alzheimer's Disease foldChangeAD         1.10655
## 2  Alzheimer's Disease foldChangeAD         1.10655
## 3  Alzheimer's Disease foldChangeAD         1.10655
## 4  Alzheimer's Disease foldChangeAD         1.10655
## 5  Alzheimer's Disease foldChangeAD         1.10655
## 6  Alzheimer's Disease foldChangeAD         1.10655
## 7  Alzheimer's Disease foldChangeAD         1.10655
## 8  Alzheimer's Disease foldChangeAD         1.10655
## 9  Alzheimer's Disease foldChangeAD         1.10655
## 10 Alzheimer's Disease foldChangeAD         1.10655
## 11 Alzheimer's Disease foldChangeAD         1.10655
## 12 Alzheimer's Disease foldChangeAD         1.10655
## 13 Alzheimer's Disease foldChangeAD         1.10655
## 14 Alzheimer's Disease foldChangeAD         1.10655
## 15 Alzheimer's Disease foldChangeAD         1.10655
## 16 Alzheimer's Disease foldChangeAD         1.10655
## 17 Alzheimer's Disease foldChangeAD         1.10655
## 18 Alzheimer's Disease foldChangeAD         1.10655
## 19 Alzheimer's Disease foldChangeAD         1.10655
## 20 Alzheimer's Disease foldChangeAD         1.10655

Lets use a DT library datatable to display the new table.

All_DT1 <- datatable(data=All,  rownames=FALSE, width = 800, height = 700,
                      extensions=c('Buttons','Responsive','FixedColumns'),
                      #filter=list(position='top'),
                      options=list(pageLength=10,
                        dom='Bfrtip',scrollX = TRUE, scrollY=TRUE,fixedColumns = TRUE,
                        buttons=c('colvis','csv'),
                        language=list(sSearch='Filter:')
                        )
                      )

All_DT1

Lets now see what genes are in other diseases with slicing.

All2 <- All_data %>% select(gene,disease)
All3 <- All2[!duplicated(All2),]
All4 <- All3[order(All3$gene, decreasing=FALSE),]
head(All4)

##       gene              disease
## 1    AHNAK  Alzheimer's Disease
## 26   CALD1  Alzheimer's Disease
## 42   DCLK1  Alzheimer's Disease
## 706  DCLK1 benign UL microarray
## 58  ELAVL4  Alzheimer's Disease
## 67   EPHA4  Alzheimer's Disease

Lets create a DT data table of this information showing the unique genes with their diseases associated with.

All_DT2 <- datatable(data=All4,  rownames=FALSE, width = 800, height = 700,
                      extensions=c('Buttons','Responsive','FixedColumns'),
                      #filter=list(position='top'),
                      options=list(pageLength=10,
                        dom='Bfrtip',scrollX = TRUE, scrollY=TRUE,fixedColumns = TRUE,
                        buttons=c('colvis','csv'),
                        language=list(sSearch='Filter:')
                        )
                      )

All_DT2

We can see from the data table above that there are some genes that are associated with more than one disease. This will make for presumably an interesting visual network. Lets see exactly which genes are associated with more than one disease.

All5 <- All4 %>%
  group_by(gene) %>%
  mutate(geneDiseases = n()) %>%
  select(gene,geneDiseases, disease) %>%
  ungroup() 
  
All6 <- subset(All5, All5$geneDiseases > 1)
All7 <- All6[order(All6$geneDiseases, decreasing=TRUE),]
All7

## # A tibble: 57 x 3
##    gene   geneDiseases disease             
##    <fct>         <int> <fct>               
##  1 DMD               3 colon cancer        
##  2 DMD               3 pancreatic cancer   
##  3 DMD               3 benign UL microarray
##  4 HYDIN             3 colon cancer        
##  5 HYDIN             3 pancreatic cancer   
##  6 HYDIN             3 benign UL microarray
##  7 KCNIP4            3 colon cancer        
##  8 KCNIP4            3 pancreatic cancer   
##  9 KCNIP4            3 benign UL beadchip  
## 10 DCLK1             2 Alzheimer's Disease 
## # … with 47 more rows

length(unique(All7$gene))

## [1] 27

unique(All7$gene)

##  [1] DMD     HYDIN   KCNIP4  DCLK1   ABCC6   ABCG1   AGL     BMP1    CAST   
## [10] CTNNB1  FSD1    PTGER3  S100A13 WHSC1   CREB5   HIF3A   ITIH5   MDK    
## [19] MTUS1   NRG1    PHF20L1 RFX4    RUNX1T1 SORBS1  MCF2L   TNXB    SYNE2  
## 345 Levels: AHNAK CALD1 DCLK1 ELAVL4 EPHA4 GABRG2 GAD1 KIAA1107 PNMAL1 ... ZRANB3

We know that there are 27 genes in our data associated with more than one disease. We should just plot these diseases with visNetwork. There are other possibilities in using link analysis with visNetwork for location of the genes that are in our 792 observational data of gene targets associated with six specific diseases. And that too could be interesting to compare by cytoband location which genes are neighbors in the nucleus where transcription starts. Here are some facts about gene expression:

-DNA information is transcribed into a special type of RNA called messenger RNA in the nucleus (mRNA)

mRNA is modified in the nucleus then is transported out to the cytoplasm
Protein synthesis, or translation, occurs in the cytoplasm with the help of transfer RNA molecules (tRNA) and ribosomes that contain ribosomal RNA molecules (rRNA)
A gene transcribed into RNA is said to be expressed

(Those aren’t my words, They were taken verbatum from week 1 of a graduate coure called ‘Data Systems of the Life Sciences.’)

So, we are working with the All7 table on genes that are associated with more than one of our six diseases. We will use the genes as our nodes, and the edges move from the nodes of diseases if they are associated with the other diseases, and the weights as counts of how many diseases each node is associated with. Lets now recall what fields are needed to build this visual network using visNetwork. We can always attach the cytoband location with a merge of All7 to All but only selecting the gene and cytoband fields.

We need a node data frame with an id, label, and location. The label can be the gene, and the location can be the disease. We can get the id from the row number in the unique genes and locations of the All5 table. The counts in the All5 table can be the weights. Then add in the cytoband information as the location, and the label as the gene. The edges table needs the weights from the counts and the from and to columns.

The nodes will have the id, label (gene), location (cytoband). Lets add the id field to the All5 table that groups by gene and counts the diseases per gene.

nodes <- All5

nodes$id <- as.factor(row.names(nodes))
nodes$label <- nodes$gene
nodes$location <- nodes$disease
nodes$weight <- nodes$geneDiseases

nodes1 <- nodes %>% select(id,label,location,weight)
head(nodes1)

## # A tibble: 6 x 4
##   id    label  location             weight
##   <fct> <fct>  <fct>                 <int>
## 1 1     AHNAK  Alzheimer's Disease       1
## 2 2     CALD1  Alzheimer's Disease       1
## 3 3     DCLK1  Alzheimer's Disease       2
## 4 4     DCLK1  benign UL microarray      2
## 5 5     ELAVL4 Alzheimer's Disease       1
## 6 6     EPHA4  Alzheimer's Disease       1

This adds in a title that will be displayed when hovered over, and also adds in a group, so that the groups will be color coded and grouped by whichever the group is being set to. Zoom in with the interactive display buttons to view the names and cluster groups once the interactive plot is ran.

nodes3 <- All %>% select(gene,cytoband,disease)
nodes4 <- merge(nodes1,nodes3, by.x='label', by.y='gene')
nodes4$group <- nodes4$disease
nodes4$title <- nodes4$cytoband
nodes5 <- nodes4 %>% select(id,label,group,title,location,weight,cytoband)

head(nodes5)

##    id  label                group    title             location weight cytoband
## 1 125   AAK1       stomach cancer   2p13.3       stomach cancer      1   2p13.3
## 2 125   AAK1       stomach cancer   2p13.3       stomach cancer      1   2p13.3
## 3 208 ABCB10 benign UL microarray  1q42.13 benign UL microarray      1  1q42.13
## 4  14  ABCC6         colon cancer 16p13.11         colon cancer      2 16p13.11
## 5  14  ABCC6         colon cancer 16p13.11         colon cancer      2 16p13.11
## 6  14  ABCC6         colon cancer 16p13.11         colon cancer      2 16p13.11

The edges will have the from (id), to (disease), label, weight (counts), and width(fraction of counts for arrow width)

nodes6 <- nodes5[!duplicated(nodes5$id),]

edges <- nodes6 %>% select(id, location, weight,cytoband)
edges$from <- edges$id
edges$width <- edges$weight/3
edges$label <- edges$location

to <- as.data.frame(unique(nodes5$location))
colnames(to) <- 'location'
to$to <- as.factor(row.names(to))

edges2 <- merge(edges,to, by.x='location', by.y='location')
edges3 <- edges2 %>% select(from,to,label,weight, width)

head(edges3)

##   from to               label weight     width
## 1    3  6 Alzheimer's Disease      2 0.6666667
## 2   10  6 Alzheimer's Disease      1 0.3333333
## 3    7  6 Alzheimer's Disease      1 0.3333333
## 4    8  6 Alzheimer's Disease      1 0.3333333
## 5    2  6 Alzheimer's Disease      1 0.3333333
## 6   12  6 Alzheimer's Disease      1 0.3333333

Our nodes table is nodes3 and our edges table is edges3.

visNetwork(nodes=nodes6, edges=edges3, main='Genes Associated with Six Diseases') %>% visEdges(arrows=c('from','middle')) %>%
  visInteraction(navigationButtons=TRUE, dragNodes=FALSE,
                 dragView=TRUE, zoomView = TRUE) %>%
  visOptions(nodesIdSelection = TRUE, manipulation=FALSE) %>%
  visIgraphLayout() %>%
  visLegend

Its great that the visual network worked but many genes are counted repeatedly and makes it cluttered. We should use the table with 57 genes of those genes that had an assoication with more than one disease in table All7. But regardless of the clutter we see that there are six clusters around the six disease types. When you zoom in on the clusters, the cluster with the most links to other diseases is the Alzheimer cluster that is associated with both UL studies, colon cancer, and stomach cancer, but not with pancreatic cancer. The UL microarray study was associated with genes in pancreatic cancer and Alzheimer disease. This is what the chart showed when we looked at it earlier, the All7 data table.

Here is the All7 datatable as a refresher of the connections seen above in the visNetwork link analysis plot.

All7_DT3 <- datatable(data=All7, rownames=FALSE,  width = 800, height = 700,
                      extensions=c('Buttons','Responsive','FixedColumns'),
                      #filter=list(position='top'),
                      options=list(pageLength=10,
                        dom='Bfrtip',scrollX = TRUE, scrollY=TRUE,fixedColumns = TRUE,
                        buttons=c('colvis','csv'),
                        language=list(sSearch='Filter:')
                        )
                      )
                  
All7_DT3

The above table All7 would probably make a much better visual network analysis to view those specific genes associated with multiple diseases. In this case, five diseases, of which two were different media type for the UL samples as beadchip or microarray. These diseases were selected because they had sequence information attached to their GEO sample and platform data to make our targeted gene selections by copy number variants of genes as well as fold change values in the 5th and 95th percentile approximately. These genes out of tens of thousands fit those threshold values.

nodes10 <- All7 %>% select(gene, disease)
nodes10$id <- row.names(nodes10)
nodes10$label <- nodes10$gene
nodes10$location <- nodes10$disease
nodes10$title <- nodes10$disease

nodes11 <- nodes10 %>% select(id,label,title,location)
head(nodes11,10)

## # A tibble: 10 x 4
##    id    label  title                location            
##    <chr> <fct>  <fct>                <fct>               
##  1 1     DMD    colon cancer         colon cancer        
##  2 2     DMD    pancreatic cancer    pancreatic cancer   
##  3 3     DMD    benign UL microarray benign UL microarray
##  4 4     HYDIN  colon cancer         colon cancer        
##  5 5     HYDIN  pancreatic cancer    pancreatic cancer   
##  6 6     HYDIN  benign UL microarray benign UL microarray
##  7 7     KCNIP4 colon cancer         colon cancer        
##  8 8     KCNIP4 pancreatic cancer    pancreatic cancer   
##  9 9     KCNIP4 benign UL beadchip   benign UL beadchip  
## 10 10    DCLK1  Alzheimer's Disease  Alzheimer's Disease

edges10 <- All7 %>% select(gene,disease, geneDiseases)
edges10$from <- row.names(All7)

to <- as.data.frame(unique(All7$disease))
colnames(to) <- 'disease'
to$to <- as.factor(row.names(to))

edges11 <- merge(to, edges10, by.x='disease', by.y='disease')
edges11$label <- edges11$disease
edges11$weight <- edges11$geneDiseases
edges11$width <- edges11$weight/2

edges12 <- edges11 %>% select(from, to, label, weight, width)
head(edges12,10)

##    from to               label weight width
## 1    10  5 Alzheimer's Disease      2   1.0
## 2    23  4  benign UL beadchip      2   1.0
## 3    39  4  benign UL beadchip      2   1.0
## 4    17  4  benign UL beadchip      2   1.0
## 5    49  4  benign UL beadchip      2   1.0
## 6    41  4  benign UL beadchip      2   1.0
## 7    21  4  benign UL beadchip      2   1.0
## 8     9  4  benign UL beadchip      3   1.5
## 9    27  4  benign UL beadchip      2   1.0
## 10   33  4  benign UL beadchip      2   1.0

visNetwork(nodes=nodes11, edges=edges12, main='Mulitple Disease Associations of 27 Unique Genes') %>% visEdges(arrows=c('from','middle','to')) %>%
  visInteraction(navigationButtons=TRUE, dragNodes=TRUE,
                 dragView=TRUE, zoomView = TRUE) %>%
  visOptions(nodesIdSelection = TRUE, manipulation=FALSE) %>%
  visIgraphLayout()

Lets compare this with the underlying data of our datatable All7_DT3.

All7_DT3

Lets see what this looks like to link the genes from cytoband location to other diseases in our link analysis.

all10 <- All %>% select(gene,cytoband,cnv,disease,foldChangeValue)
all10$foldChangeValue <- round(all10$foldChangeValue,1)
all11 <- all10[!duplicated(all10),]
row.names(all11) <- NULL
all12 <- all11[with(all11, order(cytoband)),]
row.names(all12) <- NULL
head(all12,10)

##      gene cytoband cnv              disease foldChangeValue
## 1   AHNAK  11q12.3   5  Alzheimer's Disease             1.1
## 2  ASRGL1  11q12.3   4 benign UL microarray             0.5
## 3   SCN3B  11q24.1   3  Alzheimer's Disease             0.9
## 4   DCLK1  13q13.3   4  Alzheimer's Disease             0.9
## 5   DCLK1  13q13.3   4 benign UL microarray             1.4
## 6   PRKCB  16p12.1   3  Alzheimer's Disease             0.9
## 7  PNMAL1 19q13.32   3  Alzheimer's Disease             0.9
## 8   HIF3A 19q13.32   5    pancreatic cancer             1.7
## 9   HIF3A 19q13.32   5    pancreatic cancer             0.3
## 10  HIF3A 19q13.32   4 benign UL microarray             0.7

Lets rename and rearrange the fields above.

all12$weight <- all12$cnv
all12$width <- all12$foldChangeValue
all12$id <- as.factor(row.names(all12))
all12$label <- all12$gene
all12$title <- all12$cytoband
all12$group <- all12$disease
all12$receiveLocation <- all12$disease
all12$sendLocation <- all12$gene
all12$from <- all12$id

to <- as.data.frame(unique(all12$receiveLocation))
colnames(to) <- 'receiveLocation'
to$to <- row.names(to)

all12b <- merge(all12,to, by.x='receiveLocation', by.y='receiveLocation')

all13 <- all12b %>% select(id,label,title,group,sendLocation,receiveLocation,weight,width,from,to)

head(all13,10)

##    id    label    title               group sendLocation     receiveLocation
## 1   1    AHNAK  11q12.3 Alzheimer's Disease        AHNAK Alzheimer's Disease
## 2  15    EPHA4   2q36.1 Alzheimer's Disease        EPHA4 Alzheimer's Disease
## 3   3    SCN3B  11q24.1 Alzheimer's Disease        SCN3B Alzheimer's Disease
## 4   4    DCLK1  13q13.3 Alzheimer's Disease        DCLK1 Alzheimer's Disease
## 5  18    CALD1     7q33 Alzheimer's Disease        CALD1 Alzheimer's Disease
## 6   6    PRKCB  16p12.1 Alzheimer's Disease        PRKCB Alzheimer's Disease
## 7   7   PNMAL1 19q13.32 Alzheimer's Disease       PNMAL1 Alzheimer's Disease
## 8  12 KIAA1107   1p22.1 Alzheimer's Disease     KIAA1107 Alzheimer's Disease
## 9  13   ELAVL4     1p33 Alzheimer's Disease       ELAVL4 Alzheimer's Disease
## 10 14     GAD1   2q31.1 Alzheimer's Disease         GAD1 Alzheimer's Disease
##    weight width from to
## 1       5   1.1    1  1
## 2       3   0.9   15  1
## 3       3   0.9    3  1
## 4       4   0.9    4  1
## 5       4   1.1   18  1
## 6       3   0.9    6  1
## 7       3   0.9    7  1
## 8       3   0.9   12  1
## 9       3   0.9   13  1
## 10      3   0.9   14  1

Build the nodes and edges tables. The weight in this set is the gene CNVs per disease. Previously it was the number of diseases each gene was associated with.

nodes15 <- all13 %>% select(id,label,title, group,sendLocation)
edges14 <- all13 %>% select(from,to,receiveLocation,weight,width)
edges14$label <- edges14$receiveLocation
edges15 <- edges14 %>% select(from,to,label,weight, width)

visNetwork(nodes=nodes15, edges=edges15, main='Cytoband Nodes and Disease Association') %>% visEdges(arrows=c('from','middle','to')) %>%
  visInteraction(navigationButtons=TRUE, dragNodes=TRUE,
                 dragView=TRUE, zoomView = TRUE) %>%
  visOptions(nodesIdSelection = TRUE, manipulation=FALSE) %>%
  visIgraphLayout() %>%
  visLegend(ncol=2)

Lets move some stuff around and see what happens.

all10 <- All %>% select(gene,cytoband,cnv,disease,foldChangeValue)
all10$foldChangeValue <- round(all10$foldChangeValue,1)
all11 <- all10[!duplicated(all10),]
row.names(all11) <- NULL
all12 <- all11[with(all11, order(cytoband)),]
row.names(all12) <- NULL
head(all12,10)

##      gene cytoband cnv              disease foldChangeValue
## 1   AHNAK  11q12.3   5  Alzheimer's Disease             1.1
## 2  ASRGL1  11q12.3   4 benign UL microarray             0.5
## 3   SCN3B  11q24.1   3  Alzheimer's Disease             0.9
## 4   DCLK1  13q13.3   4  Alzheimer's Disease             0.9
## 5   DCLK1  13q13.3   4 benign UL microarray             1.4
## 6   PRKCB  16p12.1   3  Alzheimer's Disease             0.9
## 7  PNMAL1 19q13.32   3  Alzheimer's Disease             0.9
## 8   HIF3A 19q13.32   5    pancreatic cancer             1.7
## 9   HIF3A 19q13.32   5    pancreatic cancer             0.3
## 10  HIF3A 19q13.32   4 benign UL microarray             0.7

Lets rename and rearrange the fields above.

all12$weight <- all12$cnv
all12$width <- all12$foldChangeValue
all12$id <- as.factor(row.names(all12))
all12$label <- all12$disease
all12$receiveLocation <- all12$gene
all12$sendLocation <- all12$disease
all12$from <- all12$id

to <- as.data.frame(unique(all12$receiveLocation))
colnames(to) <- 'receiveLocation'
to$to <- row.names(to)

all12b <- merge(all12,to, by.x='receiveLocation', by.y='receiveLocation')

all13 <- all12b %>% select(id,label,sendLocation,receiveLocation,weight,width,from,to)

head(all13,10)

##     id                label         sendLocation receiveLocation weight  width
## 1  393       stomach cancer       stomach cancer            AAK1      6    0.7
## 2  392       stomach cancer       stomach cancer            AAK1      6    0.8
## 3  545 benign UL microarray benign UL microarray          ABCB10      4    0.7
## 4   72         colon cancer         colon cancer           ABCC6      6    3.5
## 5   73         colon cancer         colon cancer           ABCC6      6 -118.3
## 6   74         colon cancer         colon cancer           ABCC6      6   -3.9
## 7   75         colon cancer         colon cancer           ABCC6      6  -37.3
## 8   76    pancreatic cancer    pancreatic cancer           ABCC6      5    2.0
## 9   77    pancreatic cancer    pancreatic cancer           ABCC6      5    0.4
## 10 136         colon cancer         colon cancer           ABCG1      6   17.8
##    from  to
## 1   393 152
## 2   392 152
## 3   545 266
## 4    72  34
## 5    73  34
## 6    74  34
## 7    75  34
## 8    76  34
## 9    77  34
## 10  136  58

Build the nodes table.

nodes15 <- all13 %>% select(id,label,sendLocation)
edges14 <- all13 %>% select(from,to,receiveLocation,weight,width)
edges14$label <- edges14$receiveLocation
edges15 <- edges14 %>% select(from,to,label,weight, width)

visNetwork(nodes=nodes15, edges=edges15, main='Disease as Nodes and Genes as Edges') %>% visEdges(arrows=c('from','middle','to')) %>%
  visInteraction(navigationButtons=TRUE, dragNodes=TRUE,
                 dragView=TRUE, zoomView = TRUE) %>%
  visOptions(nodesIdSelection = TRUE, manipulation=FALSE) %>%
  visIgraphLayout()

Gene Sequence Data for Link Analysis

Janis Corona

3/27/2020