RNA-seq and Microarray Data Integration Survival Analysis

R Markdown

This R Markdown gives step-by-step explanations of the methods we used in the project.

First, we loaded all packages we would need for the analysis. (R version 3.3.3)

##Commented out for purpose of time
#Install needed packages and load them into R
#install.packages("randomForest", repos="http://cran.us.r-project.org")
#install.packages("kknn", repos="http://cran.us.r-project.org")
#install.packages("gbm", repos="http://cran.us.r-project.org")
library(mlr, quietly = TRUE)
library(dplyr, quietly = TRUE)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(readr, quietly = TRUE)
library(stringr, quietly = TRUE)
library(ggplot2, quietly = TRUE)
library(limma, quietly = TRUE)

## 
## Attaching package: 'limma'

## The following object is masked from 'package:ParamHelpers':
## 
##     isNumeric

library(DESeq2, quietly = TRUE)

## 
## Attaching package: 'BiocGenerics'

## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB

## The following object is masked from 'package:limma':
## 
##     plotMA

## The following objects are masked from 'package:dplyr':
## 
##     combine, intersect, setdiff, union

## The following objects are masked from 'package:stats':
## 
##     IQR, mad, xtabs

## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, cbind, colnames,
##     do.call, duplicated, eval, evalq, Filter, Find, get, grep,
##     grepl, intersect, is.unsorted, lapply, lengths, Map, mapply,
##     match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
##     Position, rank, rbind, Reduce, rownames, sapply, setdiff,
##     sort, table, tapply, union, unique, unsplit, which, which.max,
##     which.min

## 
## Attaching package: 'S4Vectors'

## The following objects are masked from 'package:dplyr':
## 
##     first, rename

## The following object is masked from 'package:ParamHelpers':
## 
##     isEmpty

## The following objects are masked from 'package:base':
## 
##     colMeans, colSums, expand.grid, rowMeans, rowSums

## 
## Attaching package: 'IRanges'

## The following objects are masked from 'package:dplyr':
## 
##     collapse, desc, regroup, slice

## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.

library(genefilter, quietly = TRUE)

## 
## Attaching package: 'genefilter'

## The following object is masked from 'package:readr':
## 
##     spec

RNA-seq Data Preparation Before we began analysis, there was a great deal of data preparation needed. The data for this project was provided by CAMDA as part of the Neuroblastoma Data Integration Challenge. We were given RNA-seq average expression levels for 498 children neuroblastoma patients see Zhang et al for more details. The file we read in was a filtered file from the original, with samples in each row and genes (by gene name) in each column, with the exception of the first column, which is the class label (given to us in the clinical data).

#Prep rna seq expression data
rna_expr <- as.data.frame(read_tsv("Documents/BIO465/nb/RNA_GENE_EXPRESSION.txt"), quiet = TRUE)

## Warning: Missing column names filled in: 'X28434' [28434]

## Warning: Duplicated column names deduplicated: 'RPS21' => 'RPS21_1' [109],
## 'INS-IGF2' => 'INS-IGF2_1' [168], 'HSD3B2' => 'HSD3B2_1' [533], 'MYL6B'
## => 'MYL6B_1' [584], 'NME1-NME2' => 'NME1-NME2_1' [604], 'CSNK1E' =>
## 'CSNK1E_1' [988], 'KIDINS220' => 'KIDINS220_1' [1207], 'RNF213' =>
## 'RNF213_1' [1519], 'THBS2' => 'THBS2_1' [1655], 'TPM1' => 'TPM1_1' [1681],
## 'ACTN1' => 'ACTN1_1' [1775], 'AMZ2' => 'AMZ2_1' [1818], 'C22orf32'
## => 'C22orf32_1' [1990], 'CRISPLD2' => 'CRISPLD2_1' [2184], 'FLJ45340'
## => 'FLJ45340_1' [2426], 'IFI44L' => 'IFI44L_1' [2598], 'IGF2R' =>
## 'IGF2R_1' [2605], 'MAFIP' => 'MAFIP_1' [2849], 'MT1X' => 'MT1X_1' [2942],
## 'PNMA5' => 'PNMA5_1' [3166], 'RNPS1' => 'RNPS1_1' [3331], 'SMAP2' =>
## 'SMAP2_1' [3471], 'WNK1' => 'WNK1_1' [3771], 'AKR1C2' => 'AKR1C2_1' [3909],
## 'BRD1' => 'BRD1_1' [4120], 'C1orf21' => 'C1orf21_1' [4209], 'C7orf36'
## => 'C7orf36_1' [4266], 'DNAH17' => 'DNAH17_1' [4702], 'FKBP5' =>
## 'FKBP5_1' [4950], 'GOSR2' => 'GOSR2_1' [5093], 'ICA1L' => 'ICA1L_1' [5251],
## 'INF2' => 'INF2_1' [5299], 'LPHN2' => 'LPHN2_1' [5630], 'NFYA' =>
## 'NFYA_1' [5929], 'QSOX1' => 'QSOX1_1' [6364], 'RGMA' => 'RGMA_1' [6460],
## 'RNF130' => 'RNF130_1' [6486], 'SPON2' => 'SPON2_1' [6797], 'TBC1D9'
## => 'TBC1D9_1' [6924], 'TINAGL1' => 'TINAGL1_1' [6973], 'TMEM189-UBE2V1'
## => 'TMEM189-UBE2V1_1' [7011], 'ZFHX2' => 'ZFHX2_1' [7337], 'ZNF195'
## => 'ZNF195_1' [7354], 'AIG1' => 'AIG1_1' [7504], 'ANKHD1-EIF4EBP3' =>
## 'ANKHD1-EIF4EBP3_1' [7537], 'ANKRD19' => 'ANKRD19_1' [7540], 'C1orf86'
## => 'C1orf86_1' [7843], 'C8orf83' => 'C8orf83_1' [7924], 'CD163L1' =>
## 'CD163L1_1' [8008], 'CD58' => 'CD58_1' [8013], 'CLN8' => 'CLN8_1' [8119],
## 'CSTF3' => 'CSTF3_1' [8212], 'CTSL2' => 'CTSL2_1' [8221], 'DNAJC25-
## GNG10' => 'DNAJC25-GNG10_1' [8340], 'ELF1' => 'ELF1_1' [8441],
## 'FRMD4B' => 'FRMD4B_1' [8701], 'GABPB1' => 'GABPB1_1' [8721], 'GEMIN6'
## => 'GEMIN6_1' [8760], 'HS2ST1' => 'HS2ST1_1' [8966], 'LARP4' =>
## 'LARP4_1' [9224], 'LEPROT' => 'LEPROT_1' [9245], 'LOC100132288' =>
## 'LOC100132288_1' [9311], 'METT10D' => 'METT10D_1' [9707], 'NFE2L2'
## => 'NFE2L2_1' [9939], 'NPIPL3' => 'NPIPL3_1' [9983], 'PALM2-
## AKAP2' => 'PALM2-AKAP2_1' [10091], 'PANK1' => 'PANK1_1' [10095],
## 'PMS2CL' => 'PMS2CL_1' [10264], 'POLR1A' => 'POLR1A_1' [10286],
## 'RHOF' => 'RHOF_1' [10540], 'SDHC' => 'SDHC_1' [10680], 'SEC16B'
## => 'SEC16B_1' [10686], 'SIRPG' => 'SIRPG_1' [10765], 'SNAP47'
## => 'SNAP47_1' [10877], 'SYS1-DBNDD2' => 'SYS1-DBNDD2_1' [11010],
## 'TBC1D3F' => 'TBC1D3F_1' [11045], 'TDRKH' => 'TDRKH_1' [11071],
## 'TMEM111' => 'TMEM111_1' [11131], 'USP6' => 'USP6_1' [11366],
## 'VPS13A' => 'VPS13A_1' [11388], 'ZNF254' => 'ZNF254_1' [11555],
## 'ZNF280D' => 'ZNF280D_1' [11564], 'ALKBH3' => 'ALKBH3_1' [11765],
## 'CA5BP' => 'CA5BP_1' [12097], 'DDTL' => 'DDTL_1' [12404], 'DIS3L2'
## => 'DIS3L2_1' [12432], 'FBXL20' => 'FBXL20_1' [12628], 'GRIK1'
## => 'GRIK1_1' [12809], 'GTF2H2' => 'GTF2H2_1' [12824], 'KLHDC4' =>
## 'KLHDC4_1' [13048], 'LOC100132062' => 'LOC100132062_1' [13176],
## 'LOC100132287' => 'LOC100132287_1' [13182], 'LOC100133331' =>
## 'LOC100133331_1' [13199], 'LOC100190938' => 'LOC100190938_1' [13205],
## 'LOC641298' => 'LOC641298_1' [13478], 'MEF2B' => 'MEF2B_1' [13651],
## 'MTCP1NB' => 'MTCP1NB_1' [13721], 'MTIF3' => 'MTIF3_1' [13726],
## 'MTPAP' => 'MTPAP_1' [13731], 'NBPF11' => 'NBPF11_1' [13762], 'NME6'
## => 'NME6_1' [13807], 'PALMD' => 'PALMD_1' [13905], 'PLGLB1' =>
## 'PLGLB1_1' [14037], 'PPAN-P2RY11' => 'PPAN-P2RY11_1' [14067], 'RAD9A'
## => 'RAD9A_1' [14168], 'RGPD6' => 'RGPD6_1' [14211], 'SDHAP3' =>
## 'SDHAP3_1' [14344], 'SRR' => 'SRR_1' [14529], 'TEX14' => 'TEX14_1' [14627],
## 'TMEM144' => 'TMEM144_1' [14669], 'ZNF506' => 'ZNF506_1' [15009], 'AIM2' =>
## 'AIM2_1' [15124], 'APTX' => 'APTX_1' [15154], 'BCAR3' => 'BCAR3_1' [15208],
## 'C22orf43' => 'C22orf43_1' [15312], 'C4orf21' => 'C4orf21_1' [15331],
## 'CCDC144B' => 'CCDC144B_1' [15409], 'CCDC29' => 'CCDC29_1' [15417],
## 'FAM84B' => 'FAM84B_1' [15670], 'FLJ39061' => 'FLJ39061_1' [15711],
## 'IFT88' => 'IFT88_1' [15905], 'LOC100133150' => 'LOC100133150_1' [16141],
## 'LOC115110' => 'LOC115110_1' [16453], 'LOC150381' => 'LOC150381_1' [16465],
## 'LOC390627' => 'LOC390627_1' [16517], 'LOC727751' => 'LOC727751_1' [16625],
## 'MBNL3' => 'MBNL3_1' [16739], 'RNU4-2' => 'RNU4-2_1' [17091], 'SNRPG'
## => 'SNRPG_1' [17257], 'TNFSF12-TNFSF13' => 'TNFSF12-TNFSF13_1' [17404],
## 'TOR1AIP2' => 'TOR1AIP2_1' [17411], 'ZFP91-CNTF' => 'ZFP91-CNTF_1' [17536],
## 'ZNF141' => 'ZNF141_1' [17547], 'ZNF322B' => 'ZNF322B_1' [17576],
## 'ZNF568' => 'ZNF568_1' [17624], 'CDC42' => 'CDC42_1' [17948],
## 'ELMOD1' => 'ELMOD1_1' [18063], 'FAM182B' => 'FAM182B_1' [18097],
## 'FAM9B' => 'FAM9B_1' [18117], 'FOLH1B' => 'FOLH1B_1' [18159],
## 'NBPF8' => 'NBPF8_1' [19218], 'PLGLA' => 'PLGLA_1' [19327], 'PPCDC'
## => 'PPCDC_1' [19337], 'SERHL2' => 'SERHL2_1' [19482], 'SLC34A3'
## => 'SLC34A3_1' [19515], 'SLFN12' => 'SLFN12_1' [19525], 'SSX6' =>
## 'SSX6_1' [19570], 'STON1-GTF2A1L' => 'STON1-GTF2A1L_1' [19575], 'TDGF1'
## => 'TDGF1_1' [19603], 'TRIM6-TRIM34' => 'TRIM6-TRIM34_1' [19654],
## 'C1orf127' => 'C1orf127_1' [19890], 'C2orf84' => 'C2orf84_1' [19916],
## 'CYP4B1' => 'CYP4B1_1' [20034], 'EFHB' => 'EFHB_1' [20082], 'EIF5AL1' =>
## 'EIF5AL1_1' [20085], 'LOC727849' => 'LOC727849_1' [20975], 'LOC729051'
## => 'LOC729051_1' [21009], 'MGC87042' => 'MGC87042_1' [21067],
## 'SCGB1C1' => 'SCGB1C1_1' [21364], 'SRP9' => 'SRP9_1' [21417],
## 'ZNF534' => 'ZNF534_1' [21539], 'BLID' => 'BLID_1' [21614], 'EIF4A1'
## => 'EIF4A1_1' [21792], 'KRT86' => 'KRT86_1' [22004], 'LOC100133091'
## => 'LOC100133091_1' [22173], 'RNU12' => 'RNU12_1' [22888], 'TAGLN2'
## => 'TAGLN2_1' [23069], 'UMODL1' => 'UMODL1_1' [23123], 'HSD3B1' =>
## 'HSD3B1_1' [23455], 'LOC402160' => 'LOC402160_1' [23904], 'LOC644339' =>
## 'LOC644339_1' [23944], 'LOC728175' => 'LOC728175_1' [24002], 'LOC729862'
## => 'LOC729862_1' [24043], 'LOC91149' => 'LOC91149_1' [24053], 'OR2C3'
## => 'OR2C3_1' [24145], 'RNU5A' => 'RNU5A_1' [24240], 'LOC285045' =>
## 'LOC285045_1' [25097], 'SFTPA2' => 'SFTPA2_1' [25500], 'TAS2R1' =>
## 'TAS2R1_1' [25551], 'LOC348840' => 'LOC348840_1' [25889], 'BPY2'
## => 'BPY2_1' [26594], 'BPY2B' => 'BPY2B_1' [26596], 'BPY2C' =>
## 'BPY2C_1' [26598], 'DEFB110' => 'DEFB110_1' [26684], 'HIST1H2BA'
## => 'HIST1H2BA_1' [26766], 'LOC389834' => 'LOC389834_1' [27303],
## 'LOC728728' => 'LOC728728_1' [27464], 'MOXD2' => 'MOXD2_1' [27579],
## 'OR1D4' => 'OR1D4_1' [27672], 'OR1D5' => 'OR1D5_1' [27674], 'SFTPA1' =>
## 'SFTPA1_1' [28033], 'TMPRSS11E' => 'TMPRSS11E_1' [28365]

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   NCBIGeneId = col_character(),
##   Class = col_integer(),
##   BPY2 = col_integer(),
##   BPY2_1 = col_integer(),
##   BPY2B = col_integer(),
##   BPY2B_1 = col_integer(),
##   BPY2C = col_integer(),
##   BPY2C_1 = col_integer(),
##   BTF3P14 = col_integer(),
##   C1orf68 = col_integer(),
##   CDY1 = col_integer(),
##   CDY1B = col_integer(),
##   CDY2A = col_integer(),
##   CDY2B = col_integer(),
##   CXorf1 = col_integer(),
##   DEFB106A = col_integer(),
##   DEFB106B = col_integer(),
##   DEFB108P1 = col_integer(),
##   DEFB108P2 = col_integer(),
##   DEFB110 = col_integer()
##   # ... with 386 more columns
## )

## See spec(...) for full column specifications.

rna_expr <- rna_expr[,1:(ncol(rna_expr)-1)]

In addition to the RNA-seq expression level data, we were given microarray data for 145 children patients. We selected for only those patients in our RNA-seq data. The remaining data preparation involved including only wanted columns and fixing invalid column names.

#Keep only samples included in microarray data
microarray_samples <- scan("Documents/BIO465/nb/microarray_ids.txt", what="", sep="\n", quiet = TRUE)
rna_expr_data_to_keep <- rna_expr[rna_expr$NCBIGeneId %in% microarray_samples, ]
rna_expr_data_filtered <- rna_expr_data_to_keep[,-1]
colnames(rna_expr_data_filtered) <- lapply(colnames(rna_expr_data_filtered), function(x) str_replace_all(x, ';', '_'))
rna_expr_data_filtered_practice <- rna_expr_data_filtered
colnames(rna_expr_data_filtered_practice) <- make.names(colnames(rna_expr_data_filtered_practice), unique = TRUE)

Because the RNA-seq data included expression levels for > 60,000 genes, we had to trim the dataset for the purposes of analysis. We used the genefilter package to keep genes with an expression level > 10 for at least .25 of the samples and to keep genes with variance across samples in the >.75 interquartile range (IQR).

#Keep genes with at least .25 of the samples expression levels >=10. 
f <- pOverA(0.25,10)
ffun <- filterfun(f)
rna_expr_data <- t(rna_expr_data_filtered_practice[,-1])
flrGene1 <- genefilter(rna_expr_data, ffun)
f1.rna <- rna_expr_data[flrGene1, ]

#Keep genes with variance across the samples in the >.75 IQR
f2 <- function(x) {IQR(x) > 0.75}
ffun2 <- filterfun(f2)
sprdGns1 <- genefilter(f1.rna, ffun2)
f3rna <- f1.rna[sprdGns1, ]
rna_expr1 <- t(f3rna)
#Append class label to expression data
classlabel <- rna_expr_data_filtered_practice$Class
rna_expr1 <- cbind(Class = classlabel, rna_expr1)
#Fix column names and types
colnames(rna_expr) <- make.names(colnames(rna_expr1), unique = TRUE)
rna_expr_filtered <- as.data.frame(rna_expr1)
rna_expr_filtered[,1] <- sapply(rna_expr_filtered[,1], function(x) as.integer(x))

Microarray Expression Levels

The next step of our project required us to analyze raw microarray expression levels. The data for this step was also provided by CAMDA. It is composed Agilent44k microarray gene expression profiles for clinical endpoint prediction of 145 of the previously mentioned 498 children patients. We used the limma package (version 3.30.13) to analyze the microarray data. Documentation on the limma package can be found here. We based our analysis on detailed instructions provided by the Mattick Lab.

Step 1: We read in tab-delimited targets file with the following columns: NCBI Id, Filename, Condition (either favorable or nonfavorable)

targets <- readTargets("Documents/BIO465/nb/targets-new.txt")

Step 2: We used the read.maimages function to read in the microarray files. Then we corrected for background noise and normalized the arrays using the backgroundCorrect and normalizeBetweenArrays functions.

x <- read.maimages(targets, path="/Users/kristencockriel/Documents/BIO465/nb/microarray", source="agilent.median", green.only = TRUE, verbose = FALSE)
y <- backgroundCorrect(x, method="normexp", offset=16, verbose= FALSE)
#This provides log2 values
y <- normalizeBetweenArrays(y, method="quantile")
y.ave <- avereps(y, ID=y$genes$ProbeName)

Step 3: We filtered the microarray data using the genefilter package.

#Keep genes with at least .25 of the samples expression levels >=10. 
flrGene <- genefilter(y.ave$E, ffun)
f1.y <- y.ave$E[flrGene, ]

#Keep genes with variance across the samples in the >.75 IQR
ffun1 <- filterfun(f2)
sprdGns <- genefilter(f1.y, ffun1)
f3y <- f1.y[sprdGns, ]
ma_expr <- t(f3y)
#Append class label to expression data
ma_expr <- cbind(Class = classlabel, ma_expr)
#Fix column names and types
colnames(ma_expr) <- make.names(colnames(ma_expr), unique = TRUE)
ma_expr_filtered <- as.data.frame(ma_expr)
ma_expr_filtered[,1] <- sapply(ma_expr_filtered[,1], function(x) as.integer(x))

Data Integration

The purpose of our project is to determine a combination of the two datasets that provides more accuracy in classification and survival analysis. We used a naive combination of the datasets, simply combining all of the filtered genes from both rna-seq and microarray into one dataset (after normalizing the datasets). Then, we performed similar analyses to those of the individual datasets. We then plotted the data.

rna_de_normalized <- normalizeFeatures(rna_expr_filtered, target = "Class", method = "standardize")
ma_de_normalized <- normalizeFeatures(ma_expr_filtered, target="Class", method = "standardize")
combined_rna_and_microarray <- cbind(rna_de_normalized, ma_de_normalized[,2:ncol(ma_de_normalized)])
combined.DE.data <- combined_rna_and_microarray

#combined.DE.data <- cbind(rna_expr_filtered, ma_expr_filtered[,2:ncol(ma_expr_filtered)])
c_data_to_filter <- t(combined.DE.data[,-1])
f2 <- pOverA(0.50,0)
ffun2 <- filterfun(f2)
flrGene2 <- genefilter(c_data_to_filter, ffun2)
f2.c <- c_data_to_filter[flrGene2, ]
c_expr <- t(f2.c)

c_expr <- cbind(Class = classlabel, c_expr)
#Fix column names and types
colnames(c_expr) <- make.names(colnames(c_expr), unique = TRUE)
c_expr_filtered <- as.data.frame(c_expr)
c_expr_filtered[,1] <- sapply(c_expr_filtered[,1], function(x) as.integer(x))

Fifty-nine Gene Expression Signature

Our last dataset is derived from 59 genes known to impact neuroblastoma clinical endpoints. To create this dataset, we simply selected for those 59 genes from the original rna_seq_filtered_practice dataset.

fifty_nine_genes <- scan("Documents/BIO465/nb/59_gene.txt", what="", sep="\n")
rna_expr_59 <- rna_expr_filtered[, colnames(rna_expr_filtered) %in% fifty_nine_genes]
rna_expr_59 <- cbind(Class = rna_expr_filtered[,1], rna_expr_59)

Classification Analysis

After the data was prepared, we were able to begin the analysis process. We first wanted to determine the accuracy of each dataset in classification algorithms. We tested the datasets five algorithms: Support Vector Machines (svm), Random Forest (rf), Naive Bayes (nb), k-Nearest Neighbor (kknn), Gradient Boosting Machine (gbm). We used the mlr package (documentation can be found here) to apply these machine-learning algorithms to a classification analysis. The data was being classified as either favorable (0) or unfavorable (1) in response to treatment. We recorded performance metrics for the accuracy (acc), area under the curve (auc), true negative rate (tnr), and true positive rate (tpr).

classification_analysis <- function(my_data, algorithm_to_implement, target_name, predict_type){
  task <- makeClassifTask(data = my_data, target = target_name)
  # Create a learner object that indicates which classification algorithm we want to use.
  # Also indicate that we want the algorithm to generate probabilistic predictions.
  # In this example, we are using the Support Vector Machines (SVM) algorithm.
  # For survival, predict_type should either be "response" or "prob"
  learner <- makeLearner(algorithm_to_implement, predict.type = predict_type)
  # Specify the resampling strategy.
  # In this case, we indicate that we want to use cross-validation with 5 folds.
  resampleDesc = makeResampleDesc("CV", iters = 5)
  # Because cross validation relies on random assignments, set a random seed to
  #   ensure that we will get consistent results across multiple executions.
  set.seed(1)
  # Make the predictions (combine the data with the algorithm and resampling strategy).
  # By specifying "show.info=FALSE", we are telling mlr to only display important information
  #   at the console.
  results <- resample(learner, task, resampleDesc, show.info=FALSE)
  return(results)
}


#Perform classification analysis on the following functions: support vector machine, random forest, naive bayes, k-nearest neighbor, and generalized boosting model
#RNAseq results
classif_svm_rna_results <- classification_analysis(rna_expr_filtered, "classif.svm", "Class", "prob")
classif_svm_rna_metrics <- performance(classif_svm_rna_results$pred, measures = list(mlr::acc, auc, tpr, tnr))

classif_rf_rna_results <- classification_analysis(rna_expr_filtered, "classif.randomForest", "Class", "prob")
classif_rf_rna_metrics <- performance(classif_rf_rna_results$pred, measures = list(mlr::acc, auc, tpr, tnr))

classif_nb_rna_results <- classification_analysis(rna_expr_filtered, "classif.naiveBayes", "Class", "prob")
classif_nb_rna_metrics <-  performance(classif_nb_rna_results$pred, measures = list(mlr::acc, auc, tpr, tnr))
 
classif_kknn_rna_results <- classification_analysis(rna_expr_filtered, "classif.kknn", "Class", "prob")

## Loading required package: kknn

classif_kknn_rna_metrics <-  performance(classif_kknn_rna_results$pred, measures = list(mlr::acc, auc, tpr, tnr))

#classif_gbm_rna_results <- classification_analysis(rna_expr_filtered, "classif.gbm", "Class", "prob")
#classif_gbm_rna_metrics <-  performance(classif_gbm_rna_results$pred, measures = list(mlr::acc, auc, tpr, tnr))


#Microarray results
classif_svm_ma_results <- classification_analysis(ma_expr_filtered, "classif.svm", "Class", "prob")
classif_svm_ma_metrics <- performance(classif_svm_ma_results$pred, measures = list(mlr::acc, auc, tpr, tnr))

classif_rf_ma_results <- classification_analysis(ma_expr_filtered, "classif.randomForest", "Class", "prob")
classif_rf_ma_metrics <- performance(classif_rf_ma_results$pred, measures = list(mlr::acc, auc, tpr, tnr))

classif_nb_ma_results <- classification_analysis(ma_expr_filtered, "classif.naiveBayes", "Class", "prob")
classif_nb_ma_metrics <-  performance(classif_nb_ma_results$pred, measures = list(mlr::acc, auc, tpr, tnr))

classif_kknn_ma_results <-classification_analysis(ma_expr_filtered, "classif.kknn", "Class", "prob")
classif_kknn_ma_metrics <-  performance(classif_kknn_ma_results$pred, measures = list(mlr::acc, auc, tpr, tnr))

#classif_gbm_ma_results <- classification_analysis(ma_expr_filtered, "classif.gbm", "Class", "prob")
#classif_gbm_ma_metrics <-  performance(classif_gbm_ma_results$pred, measures = list(mlr::acc, auc, tpr, tnr))

#Combined results
classif_svm_c_results <- classification_analysis(c_expr_filtered, "classif.svm", "Class", "prob")
classif_svm_c_metrics <- performance(classif_svm_c_results$pred, measures = list(mlr::acc, auc, tpr, tnr))

classif_rf_c_results <- classification_analysis(c_expr_filtered, "classif.randomForest", "Class", "prob")
classif_rf_c_metrics <- performance(classif_rf_c_results$pred, measures = list(mlr::acc, auc, tpr, tnr))

classif_nb_c_results <- classification_analysis(c_expr_filtered, "classif.naiveBayes", "Class", "prob")
classif_nb_c_metrics <-  performance(classif_nb_c_results$pred, measures = list(mlr::acc, auc, tpr, tnr))

classif_kknn_c_results <-classification_analysis(c_expr_filtered, "classif.kknn", "Class", "prob")
classif_kknn_c_metrics <-  performance(classif_kknn_c_results$pred, measures = list(mlr::acc, auc, tpr, tnr))

#classif_gbm_c_results <- classification_analysis(c_expr_filtered, "classif.gbm", "Class", "prob")
#classif_gbm_c_metrics <-  performance(classif_gbm_c_results$pred, measures = list(mlr::acc, auc, tpr, tnr))

#59 results

classif_svm_59_results <- classification_analysis(rna_expr_59, "classif.svm", "Class", "prob")
classif_svm_59_metrics <- performance(classif_svm_59_results$pred, measures = list(mlr::acc, auc, tpr, tnr))

classif_rf_59_results <- classification_analysis(rna_expr_59, "classif.randomForest", "Class", "prob")
classif_rf_59_metrics <- performance(classif_rf_59_results$pred, measures = list(mlr::acc, auc, tpr, tnr))

classif_nb_59_results <- classification_analysis(rna_expr_59, "classif.naiveBayes", "Class", "prob")
classif_nb_59_metrics <-  performance(classif_nb_59_results$pred, measures = list(mlr::acc, auc, tpr, tnr))

classif_kknn_59_results <-classification_analysis(rna_expr_59, "classif.kknn", "Class", "prob")
classif_kknn_59_metrics <-  performance(classif_kknn_59_results$pred, measures = list(mlr::acc, auc, tpr, tnr))

#classif_gbm_59_results <- classification_analysis(rna_expr_59, "classif.gbm", "Class", "prob")
#classif_gbm_59_metrics <-  performance(classif_gbm_59_results$pred, measures = list(mlr::acc, auc, tpr, tnr))

After we obtained results for each of the 5 algorithms, we computed the average accuracy (acc) and area under the curve (auc) of the algorithms with each data set.

getAverage <- function(metric, given_data){
  total <- 0
  count <- length(given_data)
  for (i in 1:count)
  {
    total <- total + given_data[[i]][metric]
  }
  return(total/count)
}
#classif_rna_average_acc <- getAverage("acc", list(classif_svm_rna_metrics, classif_rf_rna_metrics, classif_kknn_rna_metrics, classif_nb_rna_metrics, classif_gbm_rna_metrics))
classif_rna_average_acc <- getAverage("acc", list(classif_svm_rna_metrics, classif_rf_rna_metrics, classif_kknn_rna_metrics, classif_nb_rna_metrics))
#rna_average_auc <- getAverage("auc", list(svm_expr_metrics, rf_expr_metrics, kknn_expr_metrics, nb_expr_metrics, gbm_expr_metrics))

#classif_ma_average_acc <- getAverage("acc", list(classif_svm_ma_metrics, classif_rf_ma_metrics, classif_kknn_ma_metrics, classif_nb_ma_metrics, classif_gbm_ma_metrics))
classif_ma_average_acc <- getAverage("acc", list(classif_svm_ma_metrics, classif_rf_ma_metrics, classif_kknn_ma_metrics, classif_nb_ma_metrics))
#ma_average_auc <-  getAverage("auc", list(svm_ma_metrics, rf_ma_metrics, kknn_ma_metrics, nb_ma_metrics, gbm_ma_metrics))

#classif_combined_average_acc <- getAverage("acc", list(classif_svm_c_metrics, classif_rf_c_metrics, classif_kknn_c_metrics, classif_nb_c_metrics, classif_gbm_c_metrics))
classif_combined_average_acc <- getAverage("acc", list(classif_svm_c_metrics, classif_rf_c_metrics, classif_kknn_c_metrics, classif_nb_c_metrics))

#classif_fifty_nine_average_acc <- getAverage("acc", list(classif_svm_59_metrics, classif_rf_59_metrics, classif_kknn_59_metrics, classif_nb_59_metrics, classif_gbm_59_metrics))
classif_fifty_nine_average_acc <- getAverage("acc", list(classif_svm_59_metrics, classif_rf_59_metrics, classif_kknn_59_metrics, classif_nb_59_metrics))

Plotting Results

We first created graphs using the ggplot2 package to compare the microarray and RNA-seq datasets. We created bar graphs to display both the average performance of the two datasets and the performance of each algorithm with the two datasets.

##CREATE BAR PLOT WITH TWO AVERAGE VALUES
info <- data.frame(data = factor(c("RNAseq", "Microarray"), levels=c("RNAseq", "Microarray")), average = c(classif_rna_average_acc, classif_ma_average_acc))

print(info)

##         data   average
## 1     RNAseq 0.8258621
## 2 Microarray 0.8362069

average_accuracy_plot <- ggplot(data = info, aes(x = data, y=average, fill=data, label=signif(average, digits=4))) + geom_bar(width=.5, colour="black", stat="identity") + guides(fill=FALSE) + xlab("Dataset") + ylab("Average Prediction Accuracy") + ggtitle("RNA-seq and Microarray Classification Accuracy") + scale_fill_manual(values=c("#56B4E9", "#009E73")) + ylim(0,1) + theme(plot.title = element_text(hjust = 0.5)) + geom_text(size = 5, hjust = 0.5, vjust = 3)

average_accuracy_plot

##CREATE LINE GRAPH WITH ACC VALUES FOR EACH ALGORITHM/DATASET 
#classif_rna_acc <- c(classif_svm_rna_metrics["acc"], classif_rf_rna_metrics["acc"], classif_kknn_rna_metrics["acc"], classif_nb_rna_metrics["acc"], classif_gbm_rna_metrics["acc"])
classif_rna_acc <- c(classif_svm_rna_metrics["acc"], classif_rf_rna_metrics["acc"], classif_kknn_rna_metrics["acc"], classif_nb_rna_metrics["acc"])

#classif_ma_acc <- c(classif_svm_ma_metrics["acc"], classif_rf_ma_metrics["acc"], classif_kknn_ma_metrics["acc"], classif_nb_ma_metrics["acc"], classif_gbm_ma_metrics["acc"])
classif_ma_acc <- c(classif_svm_ma_metrics["acc"], classif_rf_ma_metrics["acc"], classif_kknn_ma_metrics["acc"], classif_nb_ma_metrics["acc"])

#acc_values <- data.frame(dataset = c("RNAseq", "RNAseq", "RNAseq", "RNAseq", "RNAseq", "Microarray", "Microarray", "Microarray", "Microarray", "Microarray"), algorithm = c("Support Vector Machine", "Random Forest", "k-Nearest Neighbor", "Naive Bayes", "Gradient Boosting Machine","Support Vector Machine", "Random Forest", "k-Nearest Neighbor", "Naive Bayes", "Gradient Boosting Machine"), accuracy= c(classif_rna_acc, classif_ma_acc))
acc_values <- data.frame(dataset = c("RNAseq", "RNAseq", "RNAseq", "RNAseq", "Microarray", "Microarray", "Microarray", "Microarray"), algorithm = c("Support Vector Machine", "Random Forest", "k-Nearest Neighbor", "Naive Bayes", "Support Vector Machine", "Random Forest", "k-Nearest Neighbor", "Naive Bayes"), accuracy= c(classif_rna_acc, classif_ma_acc))

print(acc_values)

##      dataset              algorithm  accuracy
## 1     RNAseq Support Vector Machine 0.8482759
## 2     RNAseq          Random Forest 0.8275862
## 3     RNAseq     k-Nearest Neighbor 0.7724138
## 4     RNAseq            Naive Bayes 0.8551724
## 5 Microarray Support Vector Machine 0.8344828
## 6 Microarray          Random Forest 0.8551724
## 7 Microarray     k-Nearest Neighbor 0.7931034
## 8 Microarray            Naive Bayes 0.8620690

accuracy_plot <- ggplot(data=acc_values, aes(x=algorithm, y=accuracy, fill=dataset)) + geom_bar(position="dodge", stat="identity") + ylab("Accuracy") + theme(plot.title = element_text(hjust = .5), axis.text.x = element_text(angle = 45, hjust = 1)) + scale_fill_manual(values=c("#009E73", "#56B4E9")) + xlab("Algorithm")
#+ ggtitle("RNA-seq and Microarray Classification Algorithms")
accuracy_plot

Next, to be able to compare results, we recreated these graphs with the combined dataset results included.

##CREATE BAR PLOT WITH THREE AVERAGE VALUES
info2 <- data.frame(data = factor(c("RNAseq", "Microarray", "Combined"), levels=c("RNAseq", "Microarray", "Combined")), average = c(classif_rna_average_acc, classif_ma_average_acc, classif_combined_average_acc))

print(info2)

##         data   average
## 1     RNAseq 0.8258621
## 2 Microarray 0.8362069
## 3   Combined 0.8327586

average_accuracy_plot2 <- ggplot(data = info2, aes(x = data, y=average, fill=data, label=signif(average, digits=4))) + geom_bar(width=.5, colour="black", stat="identity") + guides(fill=FALSE) + xlab("Dataset") + ylab("Average Prediction Accuracy") + ggtitle("Combined Data Classification Accuracy") + scale_fill_manual(values=c("#56B4E9", "#009E73", "#FF9999")) + ylim(0,1) + theme(plot.title = element_text(hjust = 0.5)) + geom_text(size = 5, hjust = 0.5, vjust = 3)

average_accuracy_plot2

##CREATE LINE GRAPH WITH ACC VALUES FOR EACH ALGORITHM/DATASET 
#classif_combined_acc <- c(classif_svm_c_metrics["acc"], classif_rf_c_metrics["acc"], classif_kknn_c_metrics["acc"], classif_nb_c_metrics["acc"], classif_gbm_c_metrics["acc"])
classif_combined_acc <- c(classif_svm_c_metrics["acc"], classif_rf_c_metrics["acc"], classif_kknn_c_metrics["acc"], classif_nb_c_metrics["acc"])


#acc_values2 <- data.frame(dataset = c("RNAseq", "RNAseq", "RNAseq", "RNAseq", "RNAseq", "Microarray", "Microarray", "Microarray", "Microarray", "Microarray", "Combined", "Combined", "Combined", "Combined", "Combined"), algorithm = c("Support Vector Machine", "Random Forest", "k-Nearest Neighbor", "Naive Bayes", "Gradient Boosting Machine", "Support Vector Machine", "Random Forest", "k-Nearest Neighbor", "Naive Bayes", "Gradient Boosting Machine", "Support Vector Machine", "Random Forest", "k-Nearest Neighbor", "Naive Bayes", "Gradient Boosting Machine"), accuracy= c(classif_rna_acc, classif_ma_acc, classif_combined_acc))
acc_values2 <- data.frame(dataset = c("RNAseq", "RNAseq", "RNAseq", "RNAseq",  "Microarray", "Microarray", "Microarray", "Microarray", "Combined", "Combined", "Combined", "Combined"), algorithm = c("Support Vector Machine", "Random Forest", "k-Nearest Neighbor", "Naive Bayes", "Support Vector Machine", "Random Forest", "k-Nearest Neighbor", "Naive Bayes", "Support Vector Machine", "Random Forest", "k-Nearest Neighbor", "Naive Bayes"), accuracy= c(classif_rna_acc, classif_ma_acc, classif_combined_acc))

print(acc_values2)

##       dataset              algorithm  accuracy
## 1      RNAseq Support Vector Machine 0.8482759
## 2      RNAseq          Random Forest 0.8275862
## 3      RNAseq     k-Nearest Neighbor 0.7724138
## 4      RNAseq            Naive Bayes 0.8551724
## 5  Microarray Support Vector Machine 0.8344828
## 6  Microarray          Random Forest 0.8551724
## 7  Microarray     k-Nearest Neighbor 0.7931034
## 8  Microarray            Naive Bayes 0.8620690
## 9    Combined Support Vector Machine 0.8482759
## 10   Combined          Random Forest 0.8482759
## 11   Combined     k-Nearest Neighbor 0.7655172
## 12   Combined            Naive Bayes 0.8689655

accuracy_plot2 <- ggplot(data=acc_values2, aes(x=algorithm, y=accuracy, fill=dataset)) + geom_bar(position="dodge", stat="identity") + ylab("Prediction Accuracy") + xlab("Dataset") + ggtitle("Combined Data Classification Algorithms") + theme(plot.title = element_text(hjust = .5), axis.text.x = element_text(angle = 45, hjust = 1)) + scale_fill_manual(values=c("#FF9999", "#009E73", "#56B4E9"))

accuracy_plot2

Finally, we created the same graphs with all four datasets for a comprehensive comparison.

GRAPHS

##CREATE BAR PLOT WITH THREE AVERAGE VALUES
info5 <- data.frame(data = factor(c("RNAseq", "Microarray", "Combined", "Fifty-nine"), levels=c("RNAseq", "Microarray", "Combined", "Fifty-nine")), average = c(classif_rna_average_acc, classif_ma_average_acc, classif_combined_average_acc, classif_fifty_nine_average_acc))

print(info5)

##         data   average
## 1     RNAseq 0.8258621
## 2 Microarray 0.8362069
## 3   Combined 0.8327586
## 4 Fifty-nine 0.8344828

average_accuracy_plot5 <- ggplot(data = info5, aes(x = data, y=average, fill=data, label=signif(average, digits=4))) + geom_bar(width=.5, colour="black", stat="identity") + guides(fill=FALSE) + xlab("Dataset") + ylab("Average Prediction Accuracy") + ggtitle("Average Prediction Accuracy for Gene Expression Data") + scale_fill_manual(values=c("#56B4E9", "#009E73", "#FF9999", "#F8A31B")) + ylim(0,1) + theme(plot.title = element_text(hjust = 0.5)) + geom_text(size = 5, hjust = 0.5, vjust = 3)

average_accuracy_plot5

##CREATE LINE GRAPH WITH ACC VALUES FOR EACH ALGORITHM/DATASET 

#classif_fifty_nine_acc <- c(classif_svm_59_metrics["acc"], classif_rf_59_metrics["acc"], classif_kknn_59_metrics["acc"], classif_nb_59_metrics["acc"], classif_gbm_59_metrics["acc"])
classif_fifty_nine_acc <- c(classif_svm_59_metrics["acc"], classif_rf_59_metrics["acc"], classif_kknn_59_metrics["acc"], classif_nb_59_metrics["acc"])

#acc_values5 <- data.frame(dataset = c("RNAseq", "RNAseq", "RNAseq", "RNAseq", "RNAseq", "Microarray", "Microarray", "Microarray", "Microarray", "Microarray", "Combined", "Combined", "Combined", "Combined", "Combined", "Fifty-nine", "Fifty-nine", "Fifty-nine", "Fifty-nine", "Fifty-nine"), algorithm = c("Support Vector Machine", "Random Forest", "k-Nearest Neighbor", "Naive Bayes", "Gradient Boosting Machine", "Support Vector Machine", "Random Forest", "k-Nearest Neighbor", "Naive Bayes", "Gradient Boosting Machine", "Support Vector Machine", "Random Forest", "k-Nearest Neighbor", "Naive Bayes", "Gradient Boosting Machine", "Support Vector Machine", "Random Forest", "k-Nearest Neighbor", "Naive Bayes", "Gradient Boosting Machine"), accuracy= c(rna_acc, ma_acc, combined_acc, fifty_nine_acc))
acc_values5 <- data.frame(dataset = c( "RNAseq", "RNAseq", "RNAseq", "RNAseq", "Microarray", "Microarray", "Microarray", "Microarray", "Combined", "Combined", "Combined",  "Combined", "Fifty-nine", "Fifty-nine", "Fifty-nine", "Fifty-nine"), algorithm = c("Support Vector Machine", "Random Forest", "k-Nearest Neighbor", "Naive Bayes", "Support Vector Machine", "Random Forest", "k-Nearest Neighbor", "Naive Bayes",  "Support Vector Machine", "Random Forest", "k-Nearest Neighbor", "Naive Bayes", "Support Vector Machine", "Random Forest", "k-Nearest Neighbor", "Naive Bayes"), accuracy= c(classif_rna_acc, classif_ma_acc, classif_combined_acc, classif_fifty_nine_acc))

print(acc_values5)

##       dataset              algorithm  accuracy
## 1      RNAseq Support Vector Machine 0.8482759
## 2      RNAseq          Random Forest 0.8275862
## 3      RNAseq     k-Nearest Neighbor 0.7724138
## 4      RNAseq            Naive Bayes 0.8551724
## 5  Microarray Support Vector Machine 0.8344828
## 6  Microarray          Random Forest 0.8551724
## 7  Microarray     k-Nearest Neighbor 0.7931034
## 8  Microarray            Naive Bayes 0.8620690
## 9    Combined Support Vector Machine 0.8482759
## 10   Combined          Random Forest 0.8482759
## 11   Combined     k-Nearest Neighbor 0.7655172
## 12   Combined            Naive Bayes 0.8689655
## 13 Fifty-nine Support Vector Machine 0.8413793
## 14 Fifty-nine          Random Forest 0.8344828
## 15 Fifty-nine     k-Nearest Neighbor 0.8068966
## 16 Fifty-nine            Naive Bayes 0.8551724

accuracy_plot5 <- ggplot(data=acc_values5, aes(x=algorithm, y=accuracy, fill=dataset)) + geom_bar(position="dodge", stat="identity") + ylab("Accuracy") + xlab("Dataset")  + theme(plot.title = element_text(hjust = .5), axis.text.x = element_text(angle = 45, hjust = 1)) + scale_fill_manual(values=c("#FF9999", "#F8A31B", "#009E73", "#56B4E9" ))
#+ ggtitle("Accuracy of Classification Algorithms")

accuracy_plot5

Survival Analysis

To better evaluate the results of our data integration, we used the same datasets to perform survival analysis. We were given overal survival time (in days) in the clinical data provided by CAMDA. We appended this information to each dataset to create new datasets for the analysis.

#Read in survival time information and filter to keep only survival times for the 145 patients used in analysis
survival_time <- as.data.frame(read_tsv("Documents/BIO465/nb/sample_ids_with_survival_time.txt"))

## Parsed with column specification:
## cols(
##   ID = col_character(),
##   OSDay = col_integer()
## )

survival_times_to_keep <- survival_time[survival_time$ID %in% microarray_samples, ]

#Append survival times to all four datasets
rna_survival <- cbind(Time = survival_times_to_keep[,2], rna_expr_filtered)
ma_survival <- cbind(Time = survival_times_to_keep[,2], ma_expr_filtered)
combined_survival <- cbind(Time = survival_times_to_keep[,2], c_expr_filtered) 
rna_expr_59_survival <- cbind(Time = survival_times_to_keep[,2], rna_expr_59)

Again, we used the mlr package, and applied four different algorithms to the analysis: Random Forest (rf), Survival Tree (rp), GLM with Regularization (glm), and Gradient Boosting Machine (gbm). We recorded the performance of the algorithms with the metric “cindex”, which is the “fraction of all pairs of subjects whose predicted survival times are correctly ordered among all subjects that can actually be ordered. In other words, it is the probability of concordance between the predicted and the observed survival.” citation here

survival_analysis <- function(my_data, algorithm_to_implement, target_names){
  task <- makeSurvTask(data=my_data, target = target_names)
  learner <- makeLearner(algorithm_to_implement, predict.type = "response")
  resampleDesc = makeResampleDesc("CV", iters = 5)
  set.seed(1)
  results <- resample(learner, task, resampleDesc, show.info=FALSE)
  return(results)
}

#Perform survival analysis on the 4 algorithms
#RNA-seq results

#surv_cox_rna_results <- survival_analysis(rna_survival, "surv.coxph", c("Time", "Class"))
#cox_rna_metrics <- performance(surv_cox_rna_results$pred, measures = mlr::cindex)

surv_rf_rna_results <- survival_analysis(rna_survival, "surv.randomForestSRC", c("Time", "Class"))
rf_rna_metrics <- performance(surv_rf_rna_results$pred, measures = mlr::cindex)

surv_rp_rna_results <- survival_analysis(rna_survival, "surv.rpart", c("Time", "Class"))
rp_rna_metrics <-  performance(surv_rp_rna_results$pred, measures = mlr::cindex)

surv_glm_rna_results <- survival_analysis(rna_survival, "surv.glmnet", c("Time", "Class"))
glm_rna_metrics <-  performance(surv_glm_rna_results$pred, measures = mlr::cindex)

surv_gbm_rna_results <- survival_analysis(rna_survival, "surv.gbm", c("Time", "Class"))
gbm_rna_metrics <-  performance(surv_gbm_rna_results$pred, measures = mlr::cindex)


#Microarray results

#surv_cox_ma_results <- survival_analysis(ma_survival, "surv.coxph", c("Time", "Class"))
#cox_ma_metrics <- performance(surv_cox_ma_results$pred, measures = mlr::cindex)

surv_rf_ma_results <- survival_analysis(ma_survival, "surv.randomForestSRC", c("Time", "Class"))
rf_ma_metrics <- performance(surv_rf_ma_results$pred, measures = mlr::cindex)

surv_rp_ma_results <- survival_analysis(ma_survival, "surv.rpart", c("Time", "Class"))
rp_ma_metrics <-  performance(surv_rp_ma_results$pred, measures = mlr::cindex)

surv_glm_ma_results <-survival_analysis(ma_survival, "surv.glmnet", c("Time", "Class"))
glm_ma_metrics <-  performance(surv_glm_ma_results$pred, measures = mlr::cindex)

surv_gbm_ma_results <- survival_analysis(ma_survival, "surv.gbm", c("Time", "Class"))
gbm_ma_metrics <-  performance(surv_gbm_ma_results$pred, measures = mlr::cindex)

#Combined results

#surv_cox_c_results <- survival_analysis(combined_survival, "surv.coxph", c("Time", "Class"))
#cox_c_metrics <- performance(surv_cox_c_results$pred, measures = mlr::cindex)

surv_rf_c_results <- survival_analysis(combined_survival, "surv.randomForestSRC", c("Time", "Class"))
rf_c_metrics <- performance(surv_rf_c_results$pred, measures = mlr::cindex)

surv_rp_c_results <- survival_analysis(combined_survival, "surv.rpart", c("Time", "Class"))
rp_c_metrics <-  performance(surv_rp_c_results$pred, measures = mlr::cindex)

surv_glm_c_results <-survival_analysis(combined_survival, "surv.glmnet", c("Time", "Class"))
glm_c_metrics <-  performance(surv_glm_c_results$pred, measures = mlr::cindex)

surv_gbm_c_results <- survival_analysis(combined_survival, "surv.gbm", c("Time", "Class"))
gbm_c_metrics <-  performance(surv_gbm_c_results$pred, measures = mlr::cindex)

#59 genes
#surv_cox_59_results <- survival_analysis(rna_expr_59_survival, "surv.coxph", c("Time", "Class"))
#cox_59_metrics <- performance(surv_cox_59_results$pred, measures = mlr::cindex)

surv_rf_59_results <- survival_analysis(rna_expr_59_survival, "surv.randomForestSRC", c("Time", "Class"))
rf_59_metrics <- performance(surv_rf_ma_results$pred, measures = mlr::cindex)

surv_rp_59_results <- survival_analysis(rna_expr_59_survival, "surv.rpart", c("Time", "Class"))
rp_59_metrics <- performance(surv_rp_59_results$pred, measures = mlr::cindex)

surv_glm_59_results <- survival_analysis(rna_expr_59_survival, "surv.glmnet", c("Time", "Class"))

## Warning: from glmnet Fortran code (error code -30075); Numerical error at
## 75th lambda value; solutions for larger values of lambda returned

## Warning: from glmnet Fortran code (error code -30078); Numerical error at
## 78th lambda value; solutions for larger values of lambda returned

## Warning: from glmnet Fortran code (error code -30080); Numerical error at
## 80th lambda value; solutions for larger values of lambda returned

glm_59_metrics <- performance(surv_glm_59_results$pred, measures = mlr::cindex)

surv_gbm_59_results <- survival_analysis(rna_expr_59_survival, "surv.gbm", c("Time", "Class"))
gbm_59_metrics <- performance(surv_gbm_59_results$pred, measures = mlr::cindex)

We calculated the average cindex of the four algorithms for each dataset.

#Get Average cindex
rna_average_cindex <- mean(c(rf_rna_metrics, rp_rna_metrics, glm_rna_metrics, gbm_rna_metrics))
ma_average_cindex <- mean(c(rf_ma_metrics, rp_ma_metrics, glm_ma_metrics, gbm_ma_metrics))
combined_average_cindex <- mean(c(rf_c_metrics, rp_c_metrics, glm_c_metrics, gbm_c_metrics))
fifty_nine_average_cindex <- mean(c(rf_59_metrics, rp_59_metrics, glm_59_metrics, gbm_59_metrics))

We then created the same graphs we made for the classification analysis with the survival analysis data.

info7 <- data.frame(data = factor(c("RNAseq", "Microarray"), levels=c("RNAseq", "Microarray")), cindex=c(rna_average_cindex, ma_average_cindex))
print(info7)

##         data    cindex
## 1     RNAseq 0.8246590
## 2 Microarray 0.8244884

average_accuracy_plot7 <- ggplot(data = info7, aes(x = data, y=cindex, fill=data, label=signif(cindex, digits=4))) + geom_bar(width=.5, colour="black", stat="identity") + guides(fill=FALSE) + xlab("Dataset") + ylab("Average cIndex") + ggtitle("RNA-seq and Microarray Survival Analysis cIndex") + scale_fill_manual(values=c("#56B4E9", "#009E73")) + ylim(0,1) + theme(plot.title = element_text(hjust = 0.5)) + geom_text(size = 5, hjust = 0.5, vjust = 3)

average_accuracy_plot7

rna_cindex <- c(rf_rna_metrics, rp_rna_metrics, glm_rna_metrics, gbm_rna_metrics)
ma_cindex <- c(rf_ma_metrics, rp_ma_metrics, glm_ma_metrics, gbm_ma_metrics)

acc_values7 <- data.frame(dataset = c( "RNAseq", "RNAseq", "RNAseq", "RNAseq",  "Microarray", "Microarray", "Microarray", "Microarray"), algorithm = c("Random Forest", "Survival Tree", "GLM with Regularization", "Gradient Boosting Machine", "Random Forest", "Survival Tree", "GLM with Regularization", "Gradient Boosting Machine"), cindex= c(rna_cindex, ma_cindex))

print(acc_values7)

##      dataset                 algorithm    cindex
## 1     RNAseq             Random Forest 0.8496478
## 2     RNAseq             Survival Tree 0.7665579
## 3     RNAseq   GLM with Regularization 0.8145590
## 4     RNAseq Gradient Boosting Machine 0.8678713
## 5 Microarray             Random Forest 0.8409954
## 6 Microarray             Survival Tree 0.7640876
## 7 Microarray   GLM with Regularization 0.8409711
## 8 Microarray Gradient Boosting Machine 0.8518997

accuracy_plot7 <- ggplot(data=acc_values7, aes(x=algorithm, y=cindex, fill=dataset)) + geom_bar(position = "dodge", stat = "identity") + ylab("Cindex") + xlab("Algorithm") + ggtitle("RNA-seq and Microarray Survival Analysis Algorithms") + theme(plot.title = element_text(hjust = .5), axis.text.x = element_text(angle = 45, hjust = 1)) + scale_fill_manual(values=c("#009E73", "#56B4E9"))
 
accuracy_plot7

#Graphs
##CREATE BAR PLOT WITH THREE AVERAGE VALUES
info3 <- data.frame(data = factor(c("RNAseq", "Microarray", "Combined"), levels=c("RNAseq", "Microarray", "Combined")), cindex = c(rna_average_cindex, ma_average_cindex, combined_average_cindex))

print(info3)

##         data    cindex
## 1     RNAseq 0.8246590
## 2 Microarray 0.8244884
## 3   Combined 0.8186945

average_accuracy_plot3 <- ggplot(data = info3, aes(x = data, y=cindex, fill=data, label=signif(cindex, digits=4))) + geom_bar(width=.5, colour="black", stat="identity") + guides(fill=FALSE) + xlab("Dataset") + ylab("Average cIndex") + ggtitle("Combined Data Survival Analysis cIndex") + scale_fill_manual(values=c("#56B4E9", "#009E73", "#FF9999")) + ylim(0,1) + theme(plot.title = element_text(hjust = 0.5)) + geom_text(size = 5, hjust = 0.5, vjust = 3)

average_accuracy_plot3

##CREATE LINE GRAPH WITH ACC VALUES FOR EACH ALGORITHM/DATASET 

combined_cindex <- c( rf_c_metrics, rp_c_metrics, glm_c_metrics, gbm_c_metrics)

acc_values3 <- data.frame(dataset = c( "RNAseq", "RNAseq", "RNAseq", "RNAseq",  "Microarray", "Microarray", "Microarray", "Microarray", "Combined", "Combined",  "Combined", "Combined"), algorithm = c("Random Forest", "Survival Tree", "GLM with Regularization", "Gradient Boosting Machine", "Random Forest", "Survival Tree", "GLM with Regularization", "Gradient Boosting Machine",  "Random Forest", "Survival Tree", "GLM with Regularization", "Gradient Boosting Machine"), cindex= c(rna_cindex, ma_cindex, combined_cindex))

print(acc_values3)

##       dataset                 algorithm    cindex
## 1      RNAseq             Random Forest 0.8496478
## 2      RNAseq             Survival Tree 0.7665579
## 3      RNAseq   GLM with Regularization 0.8145590
## 4      RNAseq Gradient Boosting Machine 0.8678713
## 5  Microarray             Random Forest 0.8409954
## 6  Microarray             Survival Tree 0.7640876
## 7  Microarray   GLM with Regularization 0.8409711
## 8  Microarray Gradient Boosting Machine 0.8518997
## 9    Combined             Random Forest 0.8390574
## 10   Combined             Survival Tree 0.7702676
## 11   Combined   GLM with Regularization 0.8006410
## 12   Combined Gradient Boosting Machine 0.8648120

accuracy_plot3 <- ggplot(data=acc_values3, aes(x=algorithm, y=cindex, fill=dataset)) + geom_bar(position = "dodge", stat = "identity") + ylab("Cindex") + xlab("Algorithm") + ggtitle("Combined Data Survival Analysis Algorithms") + theme(plot.title = element_text(hjust = .5), axis.text.x = element_text(angle = 45, hjust = 1)) + scale_fill_manual(values=c("#FF9999", "#009E73", "#56B4E9"))
 
accuracy_plot3

#Graphs
##CREATE BAR PLOT WITH THREE AVERAGE VALUES
info4 <- data.frame(data = factor(c("RNAseq", "Microarray", "Combined", "Fifty-Nine"), levels=c("RNAseq", "Microarray", "Combined", "Fifty-Nine")), cindex = c(rna_average_cindex, ma_average_cindex, combined_average_cindex, fifty_nine_average_cindex))

print(info4)

##         data    cindex
## 1     RNAseq 0.8246590
## 2 Microarray 0.8244884
## 3   Combined 0.8186945
## 4 Fifty-Nine 0.8176222

average_accuracy_plot4 <- ggplot(data = info3, aes(x = data, y=cindex, fill=data, label=signif(cindex, digits=4))) + geom_bar(width=.5, colour="black", stat="identity") + guides(fill=FALSE) + xlab("Dataset") + ylab("Average cIndex") + ggtitle("All Datasets Surival Analysis cIndex") + scale_fill_manual(values=c("#56B4E9", "#009E73", "#FF9999", "#F8A31B")) + ylim(0,1) + theme(plot.title = element_text(hjust = 0.5)) + geom_text(size = 5, hjust = 0.5, vjust = 3)

average_accuracy_plot3

##CREATE LINE GRAPH WITH ACC VALUES FOR EACH ALGORITHM/DATASET 
rna_cindex <- c( rf_rna_metrics, rp_rna_metrics, glm_rna_metrics, gbm_rna_metrics)
ma_cindex <- c( rf_ma_metrics, rp_ma_metrics, glm_ma_metrics, gbm_ma_metrics)
combined_cindex <- c( rf_c_metrics, rp_c_metrics, glm_c_metrics, gbm_c_metrics)
fiftynine_cindex <- c(rf_59_metrics, rp_59_metrics, glm_59_metrics, gbm_59_metrics)

acc_values4 <- data.frame(dataset = c("RNAseq", "RNAseq", "RNAseq", "RNAseq", "Microarray", "Microarray", "Microarray", "Microarray", "Combined", "Combined", "Combined", "Combined", "Fifty-nine", "Fifty-nine", "Fifty-nine", "Fifty-nine"), algorithm = c("Random Forest", "Survival Tree", "GLM with Regularization", "Gradient Boosting Machine", "Random Forest", "Survival Tree", "GLM with Regularization", "Gradient Boosting Machine","Random Forest", "Survival Tree", "GLM with Regularization", "Gradient Boosting Machine", "Random Forest", "Survival Tree", "GLM with Regularization", "Gradient Boosting Machine"), cindex= c(rna_cindex, ma_cindex, combined_cindex, fiftynine_cindex))

print(acc_values4)

##       dataset                 algorithm    cindex
## 1      RNAseq             Random Forest 0.8496478
## 2      RNAseq             Survival Tree 0.7665579
## 3      RNAseq   GLM with Regularization 0.8145590
## 4      RNAseq Gradient Boosting Machine 0.8678713
## 5  Microarray             Random Forest 0.8409954
## 6  Microarray             Survival Tree 0.7640876
## 7  Microarray   GLM with Regularization 0.8409711
## 8  Microarray Gradient Boosting Machine 0.8518997
## 9    Combined             Random Forest 0.8390574
## 10   Combined             Survival Tree 0.7702676
## 11   Combined   GLM with Regularization 0.8006410
## 12   Combined Gradient Boosting Machine 0.8648120
## 13 Fifty-nine             Random Forest 0.8409954
## 14 Fifty-nine             Survival Tree 0.7407474
## 15 Fifty-nine   GLM with Regularization 0.8280677
## 16 Fifty-nine Gradient Boosting Machine 0.8606783

accuracy_plot4 <- ggplot(data=acc_values4, aes(x=algorithm, y=cindex, fill=dataset)) + geom_bar(position = "dodge", stat = "identity") + ylab("cIndex") + xlab("Dataset") + theme(plot.title = element_text(hjust = .5), axis.text.x = element_text(angle = 45, hjust = 1)) + scale_fill_manual(values=c("#FF9999","#F8A31B", "#009E73", "#56B4E9" ))
#+ ggtitle("All Datasets Survival Analysis Algorithms")
accuracy_plot4

To better be able to interpret our results, we created a boxplot for both classification and survival analyses results.

#Classification boxplot
ggplot(acc_values, aes(x= dataset, y = accuracy, fill=dataset)) + geom_boxplot() +
     scale_x_discrete() + xlab("Dataset") + ylab("Performance") +
     scale_fill_manual(values=c("#009E73", "#56B4E9")) + theme(axis.text.x =    element_text(angle = 45, hjust = 1))

ggplot(acc_values5, aes(x = dataset, y = accuracy, fill = dataset)) + 
     geom_boxplot() +
     scale_x_discrete() + xlab("Dataset") +
     ylab("Performance") + scale_fill_manual(values=c("#FF9999","#F8A31B", "#009E73", "#56B4E9")) +
     theme(axis.text.x = element_text(angle = 45, hjust = 1))

#Survival boxplot
ggplot(acc_values7, aes(x= dataset, y = cindex, fill=dataset)) + geom_boxplot() +
     scale_x_discrete() + xlab("Dataset") + ylab("Performance") +
     scale_fill_manual(values=c("#009E73", "#56B4E9")) + theme(axis.text.x =    element_text(angle = 45, hjust = 1))

ggplot(acc_values4, aes(x = dataset, y = cindex, fill = dataset)) +
     geom_boxplot() +
     scale_x_discrete() + xlab("Dataset") +
     ylab("Performance") + scale_fill_manual(values=c("#FF9999","#F8A31B", "#009E73", "#56B4E9")) +
     theme(axis.text.x = element_text(angle = 45, hjust = 1))

RNA-seq and Microarray Data Integration Survival Analysis

Kristen Cockriel and Jessica Rigby

4/20/2017

R Markdown

GRAPHS