Data Retrieval

This project is focused on analyzing the TCGA-THCA data from a multiomics approach. Specifically, we will be analyzing mRNA, miRNA, lncRNA, and methylation data to describe the trends and relationships between the multiomic profiles of patients and their prognostic characteristics (e.g. survival and/or recurrence).

To start off, we need to import the relevant data we will be using. This data is sourced from TCGA (The Cancer Genome Atlas), although the mRNA, miRNA, methylation, and clinical data came from Firehose (from the Broad Institute), and the lncRNA data came from TANRIC.

Let’s first import the datasets we will be working with, and assign them to variables.

#Inputting the data into variables that we can manipulate
library(readxl)
Warning: package ‘readxl’ was built under R version 4.2.2

Let’s look at the head of these tables to understand their structure.

#Accessing the first few rows of each dataset
head(clinical)
head(lncRNA)
#head(methylation)
head(miRNA)
head(mRNA)

Data Preprocessing

After looking at the table heads, there are a couple of issues that we can see:

1) The sample labels aren't identical - some have extra elements at the beginning or end, which         need to be standardized

2) The number of samples in each file aren't the same - for example, miRNA only has 541 samples,        and lncRNA has 556. This also needs to be standardized to make sure only samples with all           data are used.

Let’s start with step 1.

The general format of the data should look something like this: ‘TCGA-XX-XXXX-XX’, so let’s look at each file to see how we can fix it.

lncRNA seems to have extra labels at the start which we can remove, as well as underscores which we can convert to hyphens. However, what’s also important here is that we have normal and tumor samples explictly labeled, whereas other files use the ‘01’ marking for tumors and ‘11’ marking for normal samples. We need to adapt our lncRNA data to follow this structure.

#We are going through each column in the lncRNA dataset
for (i in 1:ncol(lncRNA)) {
  #If the column name - which is the sample ID - contains Normal, we'll add a -11 and format the name to remove the beginning and convert all underscores to hyphens.
  
  if (grepl("Normal", colnames(lncRNA)[i], fixed = TRUE) == TRUE) {
    colnames(lncRNA)[i] <- sub(".*-TC", "TC", colnames(lncRNA)[i])
    colnames(lncRNA)[i] <- gsub("_", "-", colnames(lncRNA)[i])
    colnames(lncRNA)[i] <- paste(colnames(lncRNA)[i], '-11', sep="")
  } else {
    #Otherwise, we'll still format the name, but we'll add a -01 at the end instead.
    colnames(lncRNA)[i] <- sub(".*-TC", "TC", colnames(lncRNA)[i])
    colnames(lncRNA)[i] <- gsub("_", "-", colnames(lncRNA)[i])
    colnames(lncRNA)[i] <- paste(colnames(lncRNA)[i], '-01', sep="")
  }
    
}

#We'll convert the first column to the standard name Gene_ID and remove the extra marker in the Ensemble ID after the decimal point.
colnames(lncRNA)[1] <- "Gene_ID"
lncRNA$Gene_ID <- sub("\\..*", "", lncRNA$Gene_ID)
head(lncRNA)

That looks more in line with the standard sample notation.

Both the methylation and the mRNA file have underscores that we can convert to hyphens.

#We'll iterate over each column and convert the underscores to dashes, for both datasets. Then, we'll convert the first column to the Gene_ID label.
for (i in 1:ncol(mRNA)) {
  colnames(mRNA)[i] <- gsub("_", "-", colnames(mRNA)[i])
}
colnames(mRNA)[1] <- "Gene_ID"
mRNA <- mRNA[30:20531,]
head(mRNA)

#for (i in 1:ncol(methylation)) {
#  colnames(methylation)[i] <- gsub("_", "-", colnames(methylation)[i])
#}
#colnames(methylation)[1] <- "Gene_ID"
#head(methylation)

That seems to be fixed as well.

For miRNA data, there is a lot of extraneous data at the end about sample and vials, which we don’t need, so let’s remove that.

#We'll iterate over the column names and remove any extraneous information after our normal or tumor sample labels, and then create the Gene_ID column.

for (i in 1:ncol(miRNA)) {
  colnames(miRNA)[i] <- sub("_01.*", "_01", colnames(miRNA)[i])
  colnames(miRNA)[i] <- sub("_11.*", "_11", colnames(miRNA)[i])
  colnames(miRNA)[i] <- gsub("_", "-", colnames(miRNA)[i])
}
colnames(miRNA)[1] <- "Gene_ID"
miRNA$'miRNA-ID' <- NULL
miRNA <- miRNA[!grepl('precursor', miRNA$`miRNA-region`),]
miRNA <- miRNA[!grepl('stemloop', miRNA$`miRNA-region`),]
miRNA$'miRNA-region' <- NULL
head(miRNA)

Done!

For clinical data, we have our ‘Sample ID’ column with appropriate notation, so we don’t need to change that.

Now, for step 2, we have to identify which samples are missing from any of the files, as we only want data points present in each file. To accomplish this, I am first going to access all the sample IDs from each file. I’m then going to add them together to form a vector, and use the table function to see how many times each value appears in the vector. Given that we have 5 files of data, a sample present in all files should appear at least 5 times. Thus, we will isolate any values that appear less than 5 times.

#Here we access the specific sample names through indexing of the data (which starts from column 2).
clin_samples <- clinical$`Sample ID`
lnc_samples <- colnames(lncRNA)[2:557]
#meth_samples <- colnames(methylation)[5:571]
miR_samples <- colnames(miRNA)[2:1609]
mRNA_samples <- colnames(mRNA)[2:569]

#We concatenate all the sample data into a vector.
temp <- c(lnc_samples, miR_samples, mRNA_samples)
#Next, we count the occurrence of each unique sample ID in our vector using the table function - if a sample is present in all 4 datasets, it'll have a count of 4.
temp_table = table(temp)
#Now, we just want to keep the samples that have an occurrence of 4.
keep1 <- names(temp_table[temp_table >= 3])
keep1 <- c("Gene_ID", keep1)

Now, we will remove their columns/rows from each data file in a sequential manner.

#Next, we just subset the datasets to only keep columns that match a value in our vector of accepted samples - using the subset and select functions.
miRNA_new <- subset(miRNA, select=c(keep1))
mRNA_new <- subset(mRNA, select=c(keep1))
#methylation_new <- subset(methylation, select=c(keep1))
lncRNA_new <- subset(lncRNA, select=c(keep1))

One additional step we need to take is to make sure that all our tumor samples have clinical data available. Thus, I’m going to compare our tumor samples (with the -01 ending) to our clinical sample IDs to find and also remove any samples that don’t overlap, using a similar method.

normal_samples
 [1] "TCGA-BJ-A28R-11" "TCGA-BJ-A28W-11" "TCGA-BJ-A28X-11" "TCGA-BJ-A290-11" "TCGA-BJ-A2N7-11"
 [6] "TCGA-BJ-A2N8-11" "TCGA-BJ-A2N9-11" "TCGA-BJ-A2NA-11" "TCGA-BJ-A3PR-11" "TCGA-BJ-A3PU-11"
[11] "TCGA-DO-A1JZ-11" "TCGA-E8-A2JQ-11" "TCGA-EL-A3GZ-11" "TCGA-EL-A3H1-11" "TCGA-EL-A3H2-11"
[16] "TCGA-EL-A3H7-11" "TCGA-EL-A3MW-11" "TCGA-EL-A3MX-11" "TCGA-EL-A3MY-11" "TCGA-EL-A3N2-11"
[21] "TCGA-EL-A3N3-11" "TCGA-EL-A3T0-11" "TCGA-EL-A3T1-11" "TCGA-EL-A3T2-11" "TCGA-EL-A3T3-11"
[26] "TCGA-EL-A3T6-11" "TCGA-EL-A3T7-11" "TCGA-EL-A3T8-11" "TCGA-EL-A3TA-11" "TCGA-EL-A3TB-11"
[31] "TCGA-EL-A3ZG-11" "TCGA-EL-A3ZH-11" "TCGA-EL-A3ZK-11" "TCGA-EL-A3ZL-11" "TCGA-EL-A3ZM-11"
[36] "TCGA-EL-A3ZO-11" "TCGA-EL-A3ZP-11" "TCGA-EL-A3ZQ-11" "TCGA-EL-A3ZR-11" "TCGA-EL-A3ZS-11"
[41] "TCGA-EL-A3ZT-11" "TCGA-EM-A1CS-11" "TCGA-EM-A1CT-11" "TCGA-EM-A1CU-11" "TCGA-EM-A1CV-11"
[46] "TCGA-EM-A1CW-11" "TCGA-EM-A1YC-11" "TCGA-EM-A3ST-11" "TCGA-ET-A2MX-11" "TCGA-ET-A2N5-11"
[51] "TCGA-ET-A3DP-11" "TCGA-ET-A3DW-11" "TCGA-FY-A3TY-11" "TCGA-GE-A2C6-11" "TCGA-H2-A2K9-11"
[56] "TCGA-H2-A3RI-11" "TCGA-KS-A41I-11" "TCGA-KS-A41J-11" "TCGA-KS-A41L-11"

Now, all our data formatting is done - each of our datasets is composed of 542 samples, with 492 tumor samples (as the clinical data suggests) and 50 normal samples. Before we move into our actual analysis, we need to make sure all our samples have data points for the RNAs being analyzed. Since a lot of the files have ‘NA’ values in them, we can’t work with this, so we’ll just eliminate all rows containing an NA.

library(dplyr)
Warning: package ‘dplyr’ was built under R version 4.2.2
Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union
library(readr)
Warning: package ‘readr’ was built under R version 4.2.2
library(edgeR)
Warning: package ‘edgeR’ was built under R version 4.2.2Loading required package: limma
Warning: package ‘limma’ was built under R version 4.2.2
#For each dataset, we are going to remove any rows with NA in them, and then convert it to a data frame to work with our DESeq2 functions.
miRNA_new[miRNA_new == "NA"] <- NA
miRNA_final <- miRNA_new[complete.cases(miRNA_new), ]
miRNA_final <- na.omit(miRNA_final)
miRNA_final = data.frame(miRNA_final)

mRNA_new[mRNA_new == "NA"] <- NA
mRNA_final <- mRNA_new[complete.cases(mRNA_new), ]
mRNA_final = data.frame(mRNA_final)

lncRNA_new[lncRNA_new == "NA"] <- NA
lncRNA_final <- lncRNA_new[complete.cases(lncRNA_new), ]
lncRNA_final = data.frame(lncRNA_final)

#methylation_new[methylation_new == "NA"] <- NA 
#methylation_final <- methylation_new[complete.cases(methylation_new), ]
#methylation_final = data.frame(methylation_final)

Differential Gene Expression

We finally have our final datasets that we can work with. Let’s begin with analysis.

We first should create a proper metadata variable to begin analysis, by considering tumor and normal samples. However, there are different levels of analysis we can conduct, which means we’ll create a couple of metadata files based on clinical data.

First, let’s create a file with a simple distribution of group 0 for normal tissue and group 1 for tumor tissue.

library(DESeq2)
library(ggplot2)

#Access sample IDs to form column 1, then check for the .01 tag to assign groups as 1 or 0 for tumor or normal.
metadata <- data.frame("Sample_ID"=c(colnames(lncRNA_final)[2:552]))
metadata$'Group' <- ifelse(grepl(".01", metadata$Sample_ID, fixed = TRUE) == TRUE, 1, 0)
metadata$Group <- as.factor(metadata$Group)

for (i in 1:551) {
  sample <- metadata[i, 1]
  sample2 <- gsub("\\.", "-", sample)
  row1 <- clin_2[which(clin_2$sample == sample2), ]
  metadata[i, c("OS_time")] <- as.numeric(row1$'OS.time')
  metadata[i, c("OS_Status")] <- as.numeric(row1$'OS')
  metadata[i, c("PFI_time")] <- as.numeric(row1$'PFI.time')
  metadata[i, c("PFI_Status")] <- as.numeric(row1$'PFI')
  metadata[i, c("DSS_time")] <- as.numeric(row1$'DSS.time')
  metadata[i, c("DSS_Status")] <- as.numeric(row1$'DSS')
  #if (grepl(".11", sample, fixed=TRUE) == TRUE) {
  #  metadata[i, c("days_to_death")] <- runif(1, 1000, 4000)
  #  metadata[i, c("Status")] <- 0
  #} else{
  #  metadata[i, c("days_to_death")] <- as.numeric(group1) * 30
  #  metadata[i, c("Status")] <- as.numeric(row1$Survival_Group)
  #}
}

Next, let’s create a metadata file that adds in recurrent samples as being group 2, and primary samples as being group 1.

#Access sample IDs to form column 1, then if the tumor is normal, assign group 0. If the tumor is not normal, reference clinical data for disease free status, and assign group 2 to recurred samples.
metadata_recur <- data.frame("Sample_ID"=c(colnames(lncRNA_final)[2:552]))
for (i in 1:551) {
  sample <- metadata_recur[i, 1]
  sample2 <- gsub("\\.", "-", sample)
  row1 <- clin_new[which(clin_new$Sample.ID == sample2), ]
  group1 <- row1$Recurrence_Group
  group2 <- row1$Overall.Survival..Months.
  if (grepl(".11", sample, fixed=TRUE) == TRUE) {
    metadata_recur[i, c("Group")] <- 0
    metadata_recur[i, c("days_to_death")] <- runif(1, 1000, 4000)
    metadata_recur[i, c("Status")] <- 0
  } else {
    metadata_recur[i, c("days_to_death")] <- as.numeric(group2) * 30
    metadata_recur[i, c("Status")] <- as.numeric(row1$Survival_Group)
    if (group1 == 1) {
      metadata_recur[i, c("Group")] <- 1
    } else{
      metadata_recur[i, c("Group")] <- 2
    }
  }
}

metadata_recur$Group <- as.factor(metadata_recur$Group)

We can also create a metadata file to evaluate metastatic vs. non-metastatic samples, which can form group 1 and 2 respectively.

#Access sample IDs to form column 1, then if the tumor is normal, assign group 0. If the tumor is not normal, reference clinical data for metastatic status, and assign group 1 to metastatic samples.
metadata_metas <- data.frame("Sample_ID"=c(colnames(lncRNA_final)[2:543]))

for (i in 1:542) {
  sample <- metadata_metas[i, 1]
  sample2 <- gsub("\\.", "-", sample)
  row1 <- clin_new[which(clin_new$Sample.ID == sample2), ]
  group1 <- row1$Metastasis_Group
  if (grepl(".11", sample, fixed=TRUE) == TRUE) {
    metadata_metas[i, c("Group")] <- 0
  } else {
    if (group1 == 1) {
      metadata_metas[i, c("Group")] <- 1
    } else{
      metadata_metas[i, c("Group")] <- 2
    }
    metadata_metas[i, c("days_to_death")] <- row1$Overall.Survival..Months.
  }
}

metadata_metas$Group <- as.factor(metadata_metas$Group)
#Need to convert the gene names to the actual row names, and remove that column
m_DESEQ <- mRNA_final[,-1]
rownames(m_DESEQ) <- mRNA_final$Gene_ID
library(GDCRNATools)

Registered S3 methods overwritten by 'htmltools':
  method               from         
  print.html           tools:rstudio
  print.shiny.tag      tools:rstudio
  print.shiny.tag.list tools:rstudio

Registered S3 method overwritten by 'data.table':
  method           from
  print.data.table     
Registered S3 method overwritten by 'htmlwidgets':
  method           from         
  print.htmlwidget tools:rstudio
Registered S3 methods overwritten by 'dbplyr':
  method         from
  print.tbl_lazy     
  print.tbl_sql      
##############################################################################
Pathview is an open source software package distributed under GNU General
Public License version 3 (GPLv3). Details of GPLv3 is available at
http://www.gnu.org/licenses/gpl-3.0.html. Particullary, users are required to
formally cite the original Pathview paper (not just mention it) in publications
or products. For details, do citation("pathview") within R.

The pathview downloads and uses KEGG data. Non-academic uses may require a KEGG
license agreement (details at http://www.kegg.jp/kegg/legal.html).
##############################################################################
library("AnnotationDbi")
Warning: package ‘AnnotationDbi’ was built under R version 4.2.2
Attaching package: ‘AnnotationDbi’

The following object is masked from ‘package:dplyr’:

    select
library("org.Hs.eg.db")

m_DESEQ <- m_DESEQ[30:20531,]
m_DESEQ$'ENTREZ' <- c(rownames(m_DESEQ))
m_DESEQ$'ENTREZ' <- gsub(".*\\|", "", m_DESEQ$'ENTREZ')
m_DESEQ$'Ensemble' <- mapIds(org.Hs.eg.db, keys=m_DESEQ$'ENTREZ', keytype="ENTREZID", column="ENSEMBL")
'select()' returned 1:many mapping between keys and columns
rownames(m_DESEQ) <- make.names(m_DESEQ$Ensemble, unique=TRUE)
m_DESEQ$Ensemble <- NULL
m_DESEQ$ENTREZ <- NULL

m_DESEQ <- m_DESEQ[complete.cases(m_DESEQ), ]

metadata_new <- data.frame(c(metadata$Sample_ID))
metadata_new$'Group' <- ifelse(metadata$Group == 0, 'NormalTissue', 'PrimaryTumor')
metadata_new$'days_to_death' <- metadata$days_to_death
#metadata_new$'days_to_last_follow_up' = NA

#metadata_test <- gdcParseMetadata(project.id='TCGA-THCA', data.type='RNAseq')

DEGAll <- gdcDEAnalysis(counts=round(m_DESEQ), group=metadata_new$Group, comparison='NormalTissue-PrimaryTumor', method='limma')

dePC_mRNA <- gdcDEReport(deg=DEGAll)
head(dePC_mRNA)
#Need to convert the gene names to the actual row names, and remove that column
mi_DESEQ <- miRNA_final[,-1]
rownames(mi_DESEQ) <- miRNA_final$Gene_ID
mi_DESEQ[] <- lapply(mi_DESEQ, as.numeric)
rownames(mi_DESEQ) <- sub("r", "R", rownames(mi_DESEQ))
mi_DESEQ <- mi_DESEQ[20:292,]
mi_DESEQ <- mi_DESEQ[!grepl('unannotated', rownames(mi_DESEQ)),]
mi_DESEQ <- na.omit(mi_DESEQ)
library(GDCRNATools)
library("AnnotationDbi")
library("org.Hs.eg.db")

DEGAll <- gdcDEAnalysis(counts=round(mi_DESEQ), group=metadata_new$Group, comparison='NormalTissue-PrimaryTumor', method='limma')

dePC_miRNA <- gdcDEReport(deg=DEGAll)
head(dePC_miRNA)
#Need to convert the gene names to the actual row names, and remove that column
lnc_DESEQ <- lncRNA_final[,-1]
rownames(lnc_DESEQ) <- lncRNA_final$Gene_ID
library(GDCRNATools)
library("AnnotationDbi")
library("org.Hs.eg.db")

metadata_new <- data.frame(c(metadata$Sample_ID))
metadata_new$'Group' <- ifelse(metadata$Group == 0, 'NormalTissue', 'PrimaryTumor')

DEGAll <- gdcDEAnalysis(counts=round(lnc_DESEQ), group=metadata_new$Group, comparison='NormalTissue-PrimaryTumor', method='limma')

dePC_lnc <- gdcDEReport(deg=DEGAll)
head(dePC_lnc)
NA
ceOutput <- gdcCEAnalysis(lnc = rownames(lnc_Expr_2),
                          pc = rownames(m_Expr_2),
                          lnc.targets = 'starBase',
                          pc.targets = 'starBase',
                          rna.expr = RNA_tot,
                          mir.expr = mi_Expr_2)
Step 1/3: Hypergenometric test done !
Error in mir.expr[mir, ] : subscript out of bounds
gdcBarPlot(dePC_miRNA, angle=45, data.type='miRNA')
lnc_Expr_2 <- lnc_Expr[rownames(lnc_Expr) %in% rownames(lnc_de),]
m_Expr_2 <- m_Expr[rownames(m_Expr) %in% rownames(m_de),]

mi_Expr_2 <- mi_Expr[rownames(mi_Expr) %in% rownames(dePC_miRNA), ]


RNA_tot = rbind(lnc_Expr_2, m_Expr_2)
mi_Expr_df <- data.frame(mi_Expr)
deLNC <- c('ENSG00000260920','ENSG00000242125','ENSG00000261211')
dePC <- c('ENSG00000043355','ENSG00000109586','ENSG00000144355', 'ENSG00000198626')
genes <- c(deLNC, dePC)
samples <- c('TCGA-2F-A9KO-01', 'TCGA-2F-A9KP-01',
'TCGA-2F-A9KQ-01', 'TCGA-2F-A9KR-01',
'TCGA-2F-A9KT-01', 'TCGA-2F-A9KW-01')
rnaExpr <- data.frame(matrix(c(2.7,7.0,4.9,6.9,4.6,2.5,
0.5,2.5,5.7,6.5,4.9,3.8,
2.1,2.9,5.9,5.7,4.5,3.5,
2.7,5.9,4.5,5.8,5.2,3.0,
2.5,2.2,5.3,4.4,4.4,2.9,
2.4,3.8,6.2,3.8,3.8,4.2,
1.0,2.0,3.0,4.0,5.0,6.0),7,6),
stringsAsFactors=FALSE)
rownames(rnaExpr) <- genes
colnames(rnaExpr) <- samples
mirExpr <- data.frame(matrix(c(7.7,7.4,7.9,8.9,8.6,9.5,
5.1,4.4,5.5,8.5,4.4,3.5,
4.9,5.5,6.9,6.1,5.5,4.1,
12.4,13.5,15.1,15.4,13.0,12.8,
2.5,2.2,5.3,4.4,4.4,2.9,
2.4,2.7,6.2,1.5,4.4,4.2, 
1.0, 2.0, 3.0, 4.0, 5.0, 6.0),7,6),
stringsAsFactors=FALSE)
colnames(mirExpr) <- samples
rownames(mirExpr) <- c('hsa-miR-340-5p-mature','hsa-miR-181b-1-5p-mature',
'hsa-miR-181a-2-3p-mature', 'hsa-miR-195-5p-mature',
'hsa-miR-199b-5p','hsa-miR-182-5p', 'hsa-miR-146b-5p-mature')
ceOutput <- gdcCEAnalysis(lnc = deLNC,
pc = dePC,
lnc.targets = 'starBase',
pc.targets = 'starBase',
rna.expr = rnaExpr,
mir.expr = mirExpr)
Step 1/3: Hypergenometric test done !
Step 2/3: Correlation analysis done !
Step 3/3: Regulation pattern analysis done !
library(survminer)
Warning: package ‘survminer’ was built under R version 4.2.2Loading required package: ggplot2
Warning: package ‘ggplot2’ was built under R version 4.2.2Loading required package: ggpubr
Warning: package ‘ggpubr’ was built under R version 4.2.2
Attaching package: ‘survminer’

The following object is masked from ‘package:survival’:

    myeloma
gene_1 <- RNA_tot_2_merged[RNA_tot_2_merged$Gene1_Quartile %in% c(1, 4),]

gene_1$Gene1_Quartile <- sub(4, 2, gene_1$Gene1_Quartile)
sfit_1 <- survfit(Surv(OS_time, OS_Status)~Gene1_Quartile, data=gene_1)
summary(sfit_1)
Call: survfit(formula = Surv(OS_time, OS_Status) ~ Gene1_Quartile, 
    data = gene_1)

                Gene1_Quartile=1 
 time n.risk n.event survival std.err lower 95% CI upper 95% CI
  243    132       1    0.992 0.00755        0.978        1.000
 1019     69       1    0.978 0.01610        0.947        1.000
 1500     44       1    0.956 0.02703        0.904        1.000
 1597     41       1    0.933 0.03501        0.866        1.000
 1734     38       1    0.908 0.04181        0.830        0.994
 1753     37       1    0.883 0.04734        0.795        0.981

                Gene1_Quartile=2 
        time       n.risk      n.event     survival      std.err lower 95% CI upper 95% CI 
    1.01e+03     6.50e+01     1.00e+00     9.85e-01     1.53e-02     9.55e-01     1.00e+00 
ggsurvplot(sfit_1, conf.int=FALSE, pval=TRUE, risk.table=TRUE, risk.table.col="strata", 
           legend.labs=c("Low Expression", "High Expression"), legend.title="Group",  
           palette=c("dodgerblue2", "red3"), 
           title="Kaplan-Meier Curve for hsa-miR-146b-3p-mature", risk.table.height=0.3)



gene_2 <- RNA_tot_2_merged[RNA_tot_2_merged$Gene2_Quartile %in% c(1, 4),]

gene_2$Gene2_Quartile <- sub(4, 2, gene_2$Gene2_Quartile)
sfit_2 <- survfit(Surv(PFI_time, PFI_Status)~Gene2_Quartile, data=gene_2)
summary(sfit_2)
Call: survfit(formula = Surv(PFI_time, PFI_Status) ~ Gene2_Quartile, 
    data = gene_2)

                Gene2_Quartile=1 
 time n.risk n.event survival std.err lower 95% CI upper 95% CI
   30    138       1    0.993 0.00722        0.979        1.000
  127    132       2    0.978 0.01273        0.953        1.000
  158    130       1    0.970 0.01468        0.942        0.999
  225    127       1    0.963 0.01644        0.931        0.995
  267    124       2    0.947 0.01950        0.910        0.986
  279    121       1    0.939 0.02085        0.899        0.981
  329    120       1    0.931 0.02209        0.889        0.976
  363    116       1    0.923 0.02332        0.879        0.970
  385    113       1    0.915 0.02450        0.868        0.964
  439    112       1    0.907 0.02561        0.858        0.959
  445    110       1    0.899 0.02667        0.848        0.953
  494    105       1    0.890 0.02775        0.837        0.946
  526    103       1    0.882 0.02880        0.827        0.940
  533    101       2    0.864 0.03076        0.806        0.927
  544     98       1    0.855 0.03169        0.795        0.920
  903     68       2    0.830 0.03540        0.764        0.902
 1019     62       1    0.817 0.03727        0.747        0.893
 1062     60       1    0.803 0.03906        0.730        0.883
 1204     54       1    0.788 0.04107        0.712        0.873
 1385     48       1    0.772 0.04337        0.691        0.862

                Gene2_Quartile=2 
 time n.risk n.event survival std.err lower 95% CI upper 95% CI
  113    133       1    0.992 0.00749        0.978        1.000
  265    127       1    0.985 0.01076        0.964        1.000
  711     82       1    0.973 0.01598        0.942        1.000
  720     81       1    0.961 0.01979        0.923        1.000
  781     76       1    0.948 0.02322        0.904        0.995
  874     70       1    0.934 0.02654        0.884        0.988
  966     62       1    0.919 0.03009        0.862        0.980
ggsurvplot(sfit_2, conf.int=FALSE, pval=TRUE, risk.table=TRUE, risk.table.col="strata", 
           legend.labs=c("Low Expression", "High Expression"), legend.title="Group",  
           palette=c("dodgerblue2", "red3"), 
           title="Kaplan-Meier Curve for FAM167B", risk.table.height=0.3)


gene_3 <- RNA_tot_2_merged[RNA_tot_2_merged$Gene3_Quartile %in% c(1, 4),]

gene_3$Gene3_Quartile <- sub(4, 2, gene_3$Gene3_Quartile)
sfit_3 <- survfit(Surv(PFI_time, PFI_Status)~Gene3_Quartile, data=gene_3)
summary(sfit_3)
Call: survfit(formula = Surv(PFI_time, PFI_Status) ~ Gene3_Quartile, 
    data = gene_3)

                Gene3_Quartile=1 
 time n.risk n.event survival std.err lower 95% CI upper 95% CI
  127    133       1    0.992 0.00749        0.978        1.000
  225    128       2    0.977 0.01314        0.952        1.000
  267    125       1    0.969 0.01518        0.940        0.999
  329    122       1    0.961 0.01701        0.928        0.995
  431    115       1    0.953 0.01881        0.917        0.990
  533    107       1    0.944 0.02063        0.904        0.985
  544    105       1    0.935 0.02231        0.892        0.980
  711     93       1    0.925 0.02423        0.879        0.974
 1062     66       1    0.911 0.02762        0.858        0.967
 1215     53       1    0.894 0.03200        0.833        0.959
 1385     39       1    0.871 0.03852        0.798        0.950
 1785     28       1    0.840 0.04809        0.751        0.939

                Gene3_Quartile=2 
 time n.risk n.event survival std.err lower 95% CI upper 95% CI
    6    137       1    0.993 0.00727        0.979        1.000
  113    134       1    0.985 0.01032        0.965        1.000
  158    132       1    0.978 0.01266        0.953        1.000
  174    131       3    0.955 0.01779        0.921        0.991
  218    128       1    0.948 0.01915        0.911        0.986
  267    126       1    0.940 0.02042        0.901        0.981
  279    125       1    0.933 0.02160        0.892        0.976
  411    114       1    0.925 0.02291        0.881        0.971
  514     98       1    0.915 0.02454        0.868        0.965
  577     87       1    0.905 0.02642        0.854        0.958
  641     82       1    0.894 0.02831        0.840        0.951
  781     71       1    0.881 0.03058        0.823        0.943
  874     65       1    0.868 0.03298        0.805        0.935
  903     62       2    0.840 0.03738        0.769        0.916
  966     54       1    0.824 0.03979        0.750        0.906
 1019     50       2    0.791 0.04451        0.709        0.883
 1426     38       1    0.770 0.04796        0.682        0.870
 1708     26       2    0.711 0.05983        0.603        0.839
ggsurvplot(sfit_3, conf.int=FALSE, pval=TRUE, risk.table=TRUE, risk.table.col="strata", 
           legend.labs=c("Low Expression", "High Expression"), legend.title="Group",  
           palette=c("dodgerblue2", "red3"), 
           title="Kaplan-Meier Curve for DNAH11", risk.table.height=0.3)


gene_4 <- RNA_tot_2_merged[RNA_tot_2_merged$Gene4_Quartile %in% c(1, 4),]

gene_4$Gene1_Quartile <- sub(4, 2, gene_4$Gene4_Quartile)
sfit_4 <- survfit(Surv(PFI_time, PFI_Status)~Gene4_Quartile, data=gene_4)
summary(sfit_4)
Call: survfit(formula = Surv(PFI_time, PFI_Status) ~ Gene4_Quartile, 
    data = gene_4)

                Gene4_Quartile=1 
 time n.risk n.event survival std.err lower 95% CI upper 95% CI
    6    138       1    0.993 0.00722        0.979        1.000
   30    137       1    0.986 0.01017        0.966        1.000
  110    132       1    0.978 0.01254        0.954        1.000
  113    131       1    0.971 0.01450        0.943        0.999
  127    130       1    0.963 0.01619        0.932        0.995
  174    128       1    0.956 0.01773        0.921        0.991
  204    126       1    0.948 0.01914        0.911        0.986
  225    124       1    0.940 0.02046        0.901        0.981
  265    120       1    0.933 0.02174        0.891        0.976
  267    119       1    0.925 0.02292        0.881        0.971
  279    117       1    0.917 0.02405        0.871        0.965
  332    114       1    0.909 0.02515        0.861        0.959
  363    109       1    0.900 0.02626        0.850        0.953
  439    102       1    0.892 0.02745        0.839        0.947
  514     95       1    0.882 0.02872        0.828        0.940
  711     84       1    0.872 0.03024        0.814        0.933
  874     71       1    0.859 0.03221        0.799        0.925
  933     66       1    0.846 0.03425        0.782        0.916
 1062     55       1    0.831 0.03692        0.762        0.907
 1204     47       1    0.813 0.04015        0.738        0.896
 1215     46       1    0.796 0.04299        0.716        0.885
 1385     40       1    0.776 0.04629        0.690        0.872
 1426     39       1    0.756 0.04919        0.665        0.859
 1708     27       1    0.728 0.05476        0.628        0.844
 1785     26       1    0.700 0.05938        0.593        0.826

                Gene4_Quartile=4 
 time n.risk n.event survival std.err lower 95% CI upper 95% CI
  127    131       1    0.992  0.0076        0.978        1.000
  174    130       1    0.985  0.0107        0.964        1.000
  279    125       1    0.977  0.0132        0.951        1.000
  533     98       1    0.967  0.0164        0.935        1.000
  641     84       1    0.955  0.0198        0.917        0.995
  720     75       1    0.943  0.0233        0.898        0.989
  966     60       2    0.911  0.0314        0.852        0.975
 1019     55       1    0.895  0.0349        0.829        0.966
 1708     27       1    0.862  0.0468        0.775        0.958
ggsurvplot(sfit_4, conf.int=FALSE, pval=TRUE, risk.table=TRUE, risk.table.col="strata", 
           legend.labs=c("Low Expression", "High Expression"), legend.title="Group",  
           palette=c("dodgerblue2", "red3"), 
           title="Kaplan-Meier Curve for BHMT2", risk.table.height=0.3)

---
title: "R Notebook"
output: html_notebook
---
## *Data Retrieval*

This project is focused on analyzing the TCGA-THCA data from a multiomics approach. Specifically, we will be analyzing mRNA, miRNA, lncRNA, and methylation data to describe the trends and relationships between the multiomic profiles of patients and their prognostic characteristics (e.g. survival and/or recurrence).

To start off, we need to import the relevant data we will be using. This data is sourced from TCGA (The Cancer Genome Atlas), although the mRNA, miRNA, methylation, and clinical data came from Firehose (from the Broad Institute), and the lncRNA data came from TANRIC.

Let's first import the datasets we will be working with, and assign them to variables.
```{r}
#Inputting the data into variables that we can manipulate
library(readxl)

#Inputting the data into variables that we can manipulate
clinical <- read_excel("THCA_Clinical_Data.xlsx")
clinical_2 <- read_excel("Clinical_Possible.xlsx")
lncRNA <- read_excel("THCA_lncRNA_Data.xlsx")
#methylation <- read_excel("THCA_Methylation_Data.xlsx")
miRNA <- read_excel("THCA_miRNA_Data_Test.xlsx")
mRNA <- read_excel("THCA_mRNA_Data_Final.xlsx")

```

Let's look at the head of these tables to understand their structure.

```{r}
#Accessing the first few rows of each dataset
head(clinical)
head(lncRNA)
#head(methylation)
head(miRNA)
head(mRNA)
```
## *Data Preprocessing*

After looking at the table heads, there are a couple of issues that we can see:

    1) The sample labels aren't identical - some have extra elements at the beginning or end, which         need to be standardized
    
    2) The number of samples in each file aren't the same - for example, miRNA only has 541 samples,        and lncRNA has 556. This also needs to be standardized to make sure only samples with all           data are used.

Let's start with step 1.

The general format of the data should look something like this: 'TCGA-XX-XXXX-XX', so let's look at each file to see how we can fix it.

lncRNA seems to have extra labels at the start which we can remove, as well as underscores which we can convert to hyphens. However, what's also important here is that we have normal and tumor samples explictly labeled, whereas other files use the '01' marking for tumors and '11' marking for normal samples. We need to adapt our lncRNA data to follow this structure.
```{r}
#We are going through each column in the lncRNA dataset
for (i in 1:ncol(lncRNA)) {
  #If the column name - which is the sample ID - contains Normal, we'll add a -11 and format the name to remove the beginning and convert all underscores to hyphens.
  
  if (grepl("Normal", colnames(lncRNA)[i], fixed = TRUE) == TRUE) {
    colnames(lncRNA)[i] <- sub(".*-TC", "TC", colnames(lncRNA)[i])
    colnames(lncRNA)[i] <- gsub("_", "-", colnames(lncRNA)[i])
    colnames(lncRNA)[i] <- paste(colnames(lncRNA)[i], '-11', sep="")
  } else {
    #Otherwise, we'll still format the name, but we'll add a -01 at the end instead.
    colnames(lncRNA)[i] <- sub(".*-TC", "TC", colnames(lncRNA)[i])
    colnames(lncRNA)[i] <- gsub("_", "-", colnames(lncRNA)[i])
    colnames(lncRNA)[i] <- paste(colnames(lncRNA)[i], '-01', sep="")
  }
    
}

#We'll convert the first column to the standard name Gene_ID and remove the extra marker in the Ensemble ID after the decimal point.
colnames(lncRNA)[1] <- "Gene_ID"
lncRNA$Gene_ID <- sub("\\..*", "", lncRNA$Gene_ID)
head(lncRNA)
```

That looks more in line with the standard sample notation.

Both the methylation and the mRNA file have underscores that we can convert to hyphens.

```{r}
#We'll iterate over each column and convert the underscores to dashes, for both datasets. Then, we'll convert the first column to the Gene_ID label.
for (i in 1:ncol(mRNA)) {
  colnames(mRNA)[i] <- gsub("_", "-", colnames(mRNA)[i])
}
colnames(mRNA)[1] <- "Gene_ID"
mRNA <- mRNA[30:20531,]
head(mRNA)

#for (i in 1:ncol(methylation)) {
#  colnames(methylation)[i] <- gsub("_", "-", colnames(methylation)[i])
#}
#colnames(methylation)[1] <- "Gene_ID"
#head(methylation)
```
That seems to be fixed as well.

For miRNA data, there is a lot of extraneous data at the end about sample and vials, which we don't need, so let's remove that.

```{r}
#We'll iterate over the column names and remove any extraneous information after our normal or tumor sample labels, and then create the Gene_ID column.

for (i in 1:ncol(miRNA)) {
  colnames(miRNA)[i] <- sub("_01.*", "_01", colnames(miRNA)[i])
  colnames(miRNA)[i] <- sub("_11.*", "_11", colnames(miRNA)[i])
  colnames(miRNA)[i] <- gsub("_", "-", colnames(miRNA)[i])
}
colnames(miRNA)[1] <- "Gene_ID"
miRNA$'miRNA-ID' <- NULL
miRNA <- miRNA[!grepl('precursor', miRNA$`miRNA-region`),]
miRNA <- miRNA[!grepl('stemloop', miRNA$`miRNA-region`),]
miRNA$'miRNA-region' <- NULL
head(miRNA)
```
Done!

For clinical data, we have our 'Sample ID' column with appropriate notation, so we don't need to change that.

Now, for step 2, we have to identify which samples are missing from any of the files, as we only want data points present in each file. To accomplish this, I am first going to access all the sample IDs from each file. I'm then going to add them together to form a vector, and use the table function to see how many times each value appears in the vector. Given that we have 5 files of data, a sample present in all files should appear at least 5 times. Thus, we will isolate any values that appear less than 5 times.

```{r}
#Here we access the specific sample names through indexing of the data (which starts from column 2).
clin_samples <- clinical$`Sample ID`
lnc_samples <- colnames(lncRNA)[2:557]
#meth_samples <- colnames(methylation)[5:571]
miR_samples <- colnames(miRNA)[2:1609]
mRNA_samples <- colnames(mRNA)[2:569]

#We concatenate all the sample data into a vector.
temp <- c(lnc_samples, miR_samples, mRNA_samples)
#Next, we count the occurrence of each unique sample ID in our vector using the table function - if a sample is present in all 4 datasets, it'll have a count of 3.
temp_table = table(temp)
#Now, we just want to keep the samples that have an occurrence of 4.
keep1 <- names(temp_table[temp_table >= 3])
keep1 <- c("Gene_ID", keep1)

```

Now, we will remove their columns/rows from each data file in a sequential manner.

```{r}
#Next, we just subset the datasets to only keep columns that match a value in our vector of accepted samples - using the subset and select functions.
miRNA_new <- subset(miRNA, select=c(keep1))
mRNA_new <- subset(mRNA, select=c(keep1))
#methylation_new <- subset(methylation, select=c(keep1))
lncRNA_new <- subset(lncRNA, select=c(keep1))
```

One additional step we need to take is to make sure that all our tumor samples have clinical data available. Thus, I'm going to compare our tumor samples (with the -01 ending) to our clinical sample IDs to find and also remove any samples that don't overlap, using a similar method.

```{r}
#Now, since clinical data only contains tumor samples, we need to separate out our tumor samples by searching for all samples with the -01 tag.
tumor_samples <- grep("-01", keep1, value=TRUE)
normal_samples <- grep("-11", keep1, value=TRUE)
normal_samples

#Same as above - vector, count, keep anything occurring twice since we are comparing 2 datasets.
temp2 <- c(tumor_samples, clin_samples)
temp_table_2 = table(temp2)
keep2 <- names(temp_table_2[temp_table_2 >= 2])
keep2 <- c(keep2)

#We then delete any rows that don't correspond to a sample in our list, by using the in function to only keep rows with values in our vector.

clin_new <- clinical[clinical$`Sample ID` %in% keep2,]
clin_new <- data.frame(clin_new)

clin_2 <- clinical_2[clinical_2$sample %in% keep1,]
clin_2 <- data.frame(clin_2)

clin_new$'Metastasis_Group' <- with(clin_new, ifelse(clin_new$American.Joint.Committee.on.Cancer.Metastasis.Stage.Code == "M1", 1, 2))

clin_new$'Recurrence_Group' <- with(clin_new, ifelse(clin_new$Disease.Free.Status == "1:Recurred/Progressed", 1, 2))

clin_new
```

Now, all our data formatting is done - each of our datasets is composed of 542 samples, with 492 tumor samples (as the clinical data suggests) and 50 normal samples. Before we move into our actual analysis, we need to make sure all our samples have data points for the RNAs being analyzed. Since a lot of the files have 'NA' values in them, we can't work with this, so we'll just eliminate all rows containing an NA.

```{r}
library(dplyr)
library(readr)
library(edgeR)
#For each dataset, we are going to remove any rows with NA in them, and then convert it to a data frame to work with our DESeq2 functions.
miRNA_new[miRNA_new == "NA"] <- NA
miRNA_final <- miRNA_new[complete.cases(miRNA_new), ]
miRNA_final <- na.omit(miRNA_final)
miRNA_final = data.frame(miRNA_final)

mRNA_new[mRNA_new == "NA"] <- NA
mRNA_final <- mRNA_new[complete.cases(mRNA_new), ]
mRNA_final = data.frame(mRNA_final)

lncRNA_new[lncRNA_new == "NA"] <- NA
lncRNA_final <- lncRNA_new[complete.cases(lncRNA_new), ]
lncRNA_final = data.frame(lncRNA_final)

#methylation_new[methylation_new == "NA"] <- NA 
#methylation_final <- methylation_new[complete.cases(methylation_new), ]
#methylation_final = data.frame(methylation_final)

```

## *Differential Gene Expression*


We finally have our final datasets that we can work with. Let's begin with analysis. 

We first should create a proper metadata variable to begin analysis, by considering tumor and normal samples. However, there are different levels of analysis we can conduct, which means we'll create a couple of metadata files based on clinical data.

First, let's create a file with a simple distribution of group 0 for normal tissue and group 1 for tumor tissue.

```{r}
library(DESeq2)
library(ggplot2)

#Access sample IDs to form column 1, then check for the .01 tag to assign groups as 1 or 0 for tumor or normal.
metadata <- data.frame("Sample_ID"=c(colnames(lncRNA_final)[2:552]))
metadata$'Group' <- ifelse(grepl(".01", metadata$Sample_ID, fixed = TRUE) == TRUE, 1, 0)
metadata$Group <- as.factor(metadata$Group)

for (i in 1:551) {
  sample <- metadata[i, 1]
  sample2 <- gsub("\\.", "-", sample)
  row1 <- clin_2[which(clin_2$sample == sample2), ]
  metadata[i, c("OS_time")] <- as.numeric(row1$'OS.time')
  metadata[i, c("OS_Status")] <- as.numeric(row1$'OS')
  metadata[i, c("PFI_time")] <- as.numeric(row1$'PFI.time')
  metadata[i, c("PFI_Status")] <- as.numeric(row1$'PFI')
  metadata[i, c("DSS_time")] <- as.numeric(row1$'DSS.time')
  metadata[i, c("DSS_Status")] <- as.numeric(row1$'DSS')
  #if (grepl(".11", sample, fixed=TRUE) == TRUE) {
  #  metadata[i, c("days_to_death")] <- runif(1, 1000, 4000)
  #  metadata[i, c("Status")] <- 0
  #} else{
  #  metadata[i, c("days_to_death")] <- as.numeric(group1) * 30
  #  metadata[i, c("Status")] <- as.numeric(row1$Survival_Group)
  #}
}

```

Next, let's create a metadata file that adds in recurrent samples as being group 2, and primary samples as being group 1.

```{r}
#Access sample IDs to form column 1, then if the tumor is normal, assign group 0. If the tumor is not normal, reference clinical data for disease free status, and assign group 2 to recurred samples.
metadata_recur <- data.frame("Sample_ID"=c(colnames(lncRNA_final)[2:552]))
for (i in 1:551) {
  sample <- metadata_recur[i, 1]
  sample2 <- gsub("\\.", "-", sample)
  row1 <- clin_new[which(clin_new$Sample.ID == sample2), ]
  group1 <- row1$Recurrence_Group
  group2 <- row1$Overall.Survival..Months.
  if (grepl(".11", sample, fixed=TRUE) == TRUE) {
    metadata_recur[i, c("Group")] <- 0
    metadata_recur[i, c("days_to_death")] <- runif(1, 1000, 4000)
    metadata_recur[i, c("Status")] <- 0
  } else {
    metadata_recur[i, c("days_to_death")] <- as.numeric(group2) * 30
    metadata_recur[i, c("Status")] <- as.numeric(row1$Survival_Group)
    if (group1 == 1) {
      metadata_recur[i, c("Group")] <- 1
    } else{
      metadata_recur[i, c("Group")] <- 2
    }
  }
}

metadata_recur$Group <- as.factor(metadata_recur$Group)
```

We can also create a metadata file to evaluate metastatic vs. non-metastatic samples, which can form group 1 and 2 respectively.

```{r}
#Access sample IDs to form column 1, then if the tumor is normal, assign group 0. If the tumor is not normal, reference clinical data for metastatic status, and assign group 1 to metastatic samples.
metadata_metas <- data.frame("Sample_ID"=c(colnames(lncRNA_final)[2:543]))

for (i in 1:542) {
  sample <- metadata_metas[i, 1]
  sample2 <- gsub("\\.", "-", sample)
  row1 <- clin_new[which(clin_new$Sample.ID == sample2), ]
  group1 <- row1$Metastasis_Group
  if (grepl(".11", sample, fixed=TRUE) == TRUE) {
    metadata_metas[i, c("Group")] <- 0
  } else {
    if (group1 == 1) {
      metadata_metas[i, c("Group")] <- 1
    } else{
      metadata_metas[i, c("Group")] <- 2
    }
    metadata_metas[i, c("days_to_death")] <- row1$Overall.Survival..Months.
  }
}

metadata_metas$Group <- as.factor(metadata_metas$Group)
```


```{r}

```





```{r}
#Need to convert the gene names to the actual row names, and remove that column
m_DESEQ <- mRNA_final[,-1]
rownames(m_DESEQ) <- mRNA_final$Gene_ID
```

```{r}
library(GDCRNATools)
library("AnnotationDbi")
library("org.Hs.eg.db")

m_DESEQ <- m_DESEQ[30:20531,]
m_DESEQ$'ENTREZ' <- c(rownames(m_DESEQ))
m_DESEQ$'ENTREZ' <- gsub(".*\\|", "", m_DESEQ$'ENTREZ')
m_DESEQ$'Ensemble' <- mapIds(org.Hs.eg.db, keys=m_DESEQ$'ENTREZ', keytype="ENTREZID", column="ENSEMBL")
rownames(m_DESEQ) <- make.names(m_DESEQ$Ensemble, unique=TRUE)
m_DESEQ$Ensemble <- NULL
m_DESEQ$ENTREZ <- NULL

m_DESEQ <- m_DESEQ[complete.cases(m_DESEQ), ]

metadata_new <- data.frame(c(metadata$Sample_ID))
metadata_new$'Group' <- ifelse(metadata$Group == 0, 'NormalTissue', 'PrimaryTumor')
metadata_new$'days_to_death' <- metadata$days_to_death
#metadata_new$'days_to_last_follow_up' = NA

#metadata_test <- gdcParseMetadata(project.id='TCGA-THCA', data.type='RNAseq')

DEGAll <- gdcDEAnalysis(counts=round(m_DESEQ), group=metadata_new$Group, comparison='NormalTissue-PrimaryTumor', method='limma')

dePC_mRNA <- gdcDEReport(deg=DEGAll)
head(dePC_mRNA)
```

```{r}
#Need to convert the gene names to the actual row names, and remove that column
mi_DESEQ <- miRNA_final[,-1]
rownames(mi_DESEQ) <- miRNA_final$Gene_ID
mi_DESEQ[] <- lapply(mi_DESEQ, as.numeric)
rownames(mi_DESEQ) <- sub("r", "R", rownames(mi_DESEQ))
mi_DESEQ <- mi_DESEQ[20:292,]
mi_DESEQ <- mi_DESEQ[!grepl('unannotated', rownames(mi_DESEQ)),]
mi_DESEQ <- na.omit(mi_DESEQ)
```

```{r}
library(GDCRNATools)
library("AnnotationDbi")
library("org.Hs.eg.db")

DEGAll <- gdcDEAnalysis(counts=round(mi_DESEQ), group=metadata_new$Group, comparison='NormalTissue-PrimaryTumor', method='limma')

dePC_miRNA <- gdcDEReport(deg=DEGAll)
head(dePC_miRNA)
```


```{r}
#Need to convert the gene names to the actual row names, and remove that column
lnc_DESEQ <- lncRNA_final[,-1]
rownames(lnc_DESEQ) <- lncRNA_final$Gene_ID
```

```{r}
library(GDCRNATools)
library("AnnotationDbi")
library("org.Hs.eg.db")

metadata_new <- data.frame(c(metadata$Sample_ID))
metadata_new$'Group' <- ifelse(metadata$Group == 0, 'NormalTissue', 'PrimaryTumor')

DEGAll <- gdcDEAnalysis(counts=round(lnc_DESEQ), group=metadata_new$Group, comparison='NormalTissue-PrimaryTumor', method='limma')

dePC_lnc <- gdcDEReport(deg=DEGAll)
head(dePC_lnc)

```


```{r}
library(GDCRNATools)
mi_Expr <- gdcVoomNormalization(counts=mi_DESEQ, filter=TRUE)
colnames(mi_Expr) <- gsub("\\.", "-", colnames(mi_Expr))

m_Expr <- gdcVoomNormalization(counts=m_DESEQ, filter=TRUE)
colnames(m_Expr) <- gsub("\\.", "-", colnames(m_Expr))

lnc_Expr <- gdcVoomNormalization(counts=lnc_DESEQ, filter=TRUE)
colnames(lnc_Expr) <- gsub("\\.", "-", colnames(lnc_Expr))

de_RNA <- rbind(dePC_lnc, dePC_mRNA)
lnc_de <- de_RNA[de_RNA$'group' == 'long_non_coding',]
m_de <- de_RNA[de_RNA$'group' == 'protein_coding',]

gdcBarPlot(de_RNA, angle=45, data.type='RNAseq')

enrichOutput <- gdcEnrichAnalysis(gene=rownames(m_de), simplify=TRUE)
gdcEnrichPlot(enrichment=enrichOutput, type='bar', category='GO', num.terms=5)

gdcBarPlot(lnc_de, angle=45, data.type='RNAseq')

gdcBarPlot(dePC_miRNA, angle=45, data.type='miRNAs')

lnc_Expr_2 <- lnc_Expr[rownames(lnc_Expr) %in% rownames(lnc_de),]
m_Expr_2 <- m_Expr[rownames(m_Expr) %in% rownames(m_de),]

mi_Expr_2 <- mi_Expr[rownames(mi_Expr) %in% rownames(dePC_miRNA), ]


RNA_tot = rbind(lnc_Expr_2, m_Expr_2)
mi_Expr_df <- data.frame(mi_Expr)

ceOutput <- gdcCEAnalysis(lnc = rownames(lnc_Expr_2),
                          pc = rownames(m_Expr_2),
                          lnc.targets = 'starBase',
                          pc.targets = 'starBase',
                          rna.expr = RNA_tot,
                          mir.expr = mi_Expr_2)


```

```{r}
gdcBarPlot(dePC_miRNA, angle=45, data.type='miRNA')
lnc_Expr_2 <- lnc_Expr[rownames(lnc_Expr) %in% rownames(lnc_de),]
m_Expr_2 <- m_Expr[rownames(m_Expr) %in% rownames(m_de),]

mi_Expr_2 <- mi_Expr[rownames(mi_Expr) %in% rownames(dePC_miRNA), ]


RNA_tot = rbind(lnc_Expr_2, m_Expr_2)
mi_Expr_df <- data.frame(mi_Expr)
```


```{r}
deLNC <- c('ENSG00000260920','ENSG00000242125','ENSG00000261211')
dePC <- c('ENSG00000043355','ENSG00000109586','ENSG00000144355', 'ENSG00000198626')
genes <- c(deLNC, dePC)
samples <- c('TCGA-2F-A9KO-01', 'TCGA-2F-A9KP-01',
'TCGA-2F-A9KQ-01', 'TCGA-2F-A9KR-01',
'TCGA-2F-A9KT-01', 'TCGA-2F-A9KW-01')
rnaExpr <- data.frame(matrix(c(2.7,7.0,4.9,6.9,4.6,2.5,
0.5,2.5,5.7,6.5,4.9,3.8,
2.1,2.9,5.9,5.7,4.5,3.5,
2.7,5.9,4.5,5.8,5.2,3.0,
2.5,2.2,5.3,4.4,4.4,2.9,
2.4,3.8,6.2,3.8,3.8,4.2,
1.0,2.0,3.0,4.0,5.0,6.0),7,6),
stringsAsFactors=FALSE)
rownames(rnaExpr) <- genes
colnames(rnaExpr) <- samples
mirExpr <- data.frame(matrix(c(7.7,7.4,7.9,8.9,8.6,9.5,
5.1,4.4,5.5,8.5,4.4,3.5,
4.9,5.5,6.9,6.1,5.5,4.1,
12.4,13.5,15.1,15.4,13.0,12.8,
2.5,2.2,5.3,4.4,4.4,2.9,
2.4,2.7,6.2,1.5,4.4,4.2, 
1.0, 2.0, 3.0, 4.0, 5.0, 6.0),7,6),
stringsAsFactors=FALSE)
colnames(mirExpr) <- samples
rownames(mirExpr) <- c('hsa-miR-340-5p-mature','hsa-miR-181b-1-5p-mature',
'hsa-miR-181a-2-3p-mature', 'hsa-miR-195-5p-mature',
'hsa-miR-199b-5p','hsa-miR-182-5p', 'hsa-miR-146b-5p-mature')
ceOutput <- gdcCEAnalysis(lnc = deLNC,
pc = dePC,
lnc.targets = 'starBase',
pc.targets = 'starBase',
rna.expr = rnaExpr,
mir.expr = mirExpr)
```








```{r}
library(survival)
library(survminer)


s <- Surv(metadata$PFI_time, metadata$PFI_Status)

sfit <- survfit(Surv(OS_time, OS_Status)~PFI_Status, data=metadata)
summary(sfit)


ggsurvplot(sfit, conf.int=TRUE, pval=TRUE, risk.table=TRUE, risk.table.col="strata", 
           legend.labs=c("Not-recurrent", "Recurrent"), legend.title="Group",  
           palette=c("dodgerblue2", "orchid2"), 
           title="Kaplan-Meier Curve for Thyroid Cancer Survival", risk.table.height=0.3)
```


```{r}
mod <- coxph(Surv(OS_time, OS_Status)~PFI_Status, data=metadata)
summary(mod)
related <- c()
p_vals <- c()
hazards <- c()
lower_95 <- c()
higher_95 <- c()
RNA_tot_2 <- as.data.frame(t(RNA_tot))
lnc_Expr_3 <- as.data.frame(t(lnc_Expr_2))
lnc_merged <- cbind(lnc_Expr_3, metadata)
RNA_tot_2_merged <- cbind(RNA_tot_2, metadata)

mi_Expr_3 <- as.data.frame(t(mi_Expr_2))
mi_merged <- cbind(mi_Expr_3, metadata)

full_merge <- cbind(lnc_Expr_3, mi_merged)


for (i in 1:110) {
  
  mod2 <- coxph(Surv(PFI_time, PFI_Status)~get(colnames(full_merge)[i]), data=full_merge)
  p_val <- summary(mod2)$coefficients[1,5]
  hazard <- summary(mod2)$coefficients[1,2]
  lower_conf <- summary(mod2)$conf.int[1,3]
  higher_conf <- summary(mod2)$conf.int[1,4]
  #print(p_val)
  if (p_val <= 0.01) {
    related <- c(related, colnames(full_merge)[i])
    p_vals <- c(p_vals, p_val)
    hazards <- c(hazards, hazard)
    lower_95 <- c(lower_95, lower_conf)
    higher_95 <- c(higher_95, higher_conf)
    print(colnames(full_merge)[i])
    print(summary(mod2))
  }
}

library(dplyr)
RNA_tot_2_merged$'Gene1_Quartile' <- ntile(RNA_tot_2_merged$ENSG00000187678, 4)
RNA_tot_2_merged$'Gene2_Quartile' <- ntile(RNA_tot_2_merged$ENSG00000183615, 4)
RNA_tot_2_merged$'Gene3_Quartile' <- ntile(RNA_tot_2_merged$ENSG00000105877, 4)
RNA_tot_2_merged$'Gene4_Quartile' <- ntile(RNA_tot_2_merged$ENSG00000132840, 4)

mi_merged$'Gene1_Quartile' <- ntile(mi_merged$'hsa-miR-181a-2-3p-star', 4)
mi_merged$'Gene2_Quartile' <- ntile(mi_merged$'hsa-miR-146b-3p-mature', 4)

gene_1 <- RNA_tot_2_merged[RNA_tot_2_merged$Gene1_Quartile %in% c(1, 4),]

gene_1$Gene1_Quartile <- sub(4, 2, gene_1$Gene1_Quartile)
sfit_1 <- survfit(Surv(OS_time, OS_Status)~Gene1_Quartile, data=gene_1)
summary(sfit_1)

ggsurvplot(sfit_1, conf.int=FALSE, pval=TRUE, risk.table=TRUE, risk.table.col="strata", 
           legend.labs=c("Low Expression", "High Expression"), legend.title="Group",  
           palette=c("dodgerblue2", "red3"), 
           title="Kaplan-Meier Curve for SPRY4", risk.table.height=0.3)


gene_2 <- RNA_tot_2_merged[RNA_tot_2_merged$Gene2_Quartile %in% c(1, 4),]

gene_2$Gene2_Quartile <- sub(4, 2, gene_2$Gene2_Quartile)
sfit_2 <- survfit(Surv(PFI_time, PFI_Status)~Gene2_Quartile, data=gene_2)
summary(sfit_2)

ggsurvplot(sfit_2, conf.int=FALSE, pval=TRUE, risk.table=TRUE, risk.table.col="strata", 
           legend.labs=c("Low Expression", "High Expression"), legend.title="Group",  
           palette=c("dodgerblue2", "red3"), 
           title="Kaplan-Meier Curve for FAM167B", risk.table.height=0.3)

gene_3 <- RNA_tot_2_merged[RNA_tot_2_merged$Gene3_Quartile %in% c(1, 4),]

gene_3$Gene3_Quartile <- sub(4, 2, gene_3$Gene3_Quartile)
sfit_3 <- survfit(Surv(PFI_time, PFI_Status)~Gene3_Quartile, data=gene_3)
summary(sfit_3)

ggsurvplot(sfit_3, conf.int=FALSE, pval=TRUE, risk.table=TRUE, risk.table.col="strata", 
           legend.labs=c("Low Expression", "High Expression"), legend.title="Group",  
           palette=c("dodgerblue2", "red3"), 
           title="Kaplan-Meier Curve for DNAH11", risk.table.height=0.3)

gene_4 <- RNA_tot_2_merged[RNA_tot_2_merged$Gene4_Quartile %in% c(1, 4),]

gene_4$Gene1_Quartile <- sub(4, 2, gene_4$Gene4_Quartile)
sfit_4 <- survfit(Surv(PFI_time, PFI_Status)~Gene4_Quartile, data=gene_4)
summary(sfit_4)

ggsurvplot(sfit_4, conf.int=FALSE, pval=TRUE, risk.table=TRUE, risk.table.col="strata", 
           legend.labs=c("Low Expression", "High Expression"), legend.title="Group",  
           palette=c("dodgerblue2", "red3"), 
           title="Kaplan-Meier Curve for BHMT2", risk.table.height=0.3)


```


```{r}
res <- data.frame(related, p_vals, hazards, lower_95, higher_95)
res$related <- de_RNA[rownames(de_RNA) %in% res$related,]$symbol
p <- 
  res |>
  ggplot(aes(y = fct_rev(related))) + 
  theme_classic()

p <- p +
  geom_point(aes(x=hazards), shape=15, size=3) +
  geom_linerange(aes(xmin=lower_95, xmax=higher_95))

p <- p +
  geom_vline(xintercept = 1, linetype="dashed") +
  labs(x="Hazard Ratio", y="")

p <- p +
  coord_cartesian(ylim=c(1,13), xlim=c(0, 2))

p <- p +
  annotate("text", x = .3, y = 13, label = "Lower recur chance") +
  annotate("text", x = 1.7, y = 13, label = "Higher recur chance")

p_mid <- p + 
  theme(axis.line.y = element_blank(),
        axis.ticks.y= element_blank(),
        axis.text.y= element_blank(),
        axis.title.y= element_blank())

res_plot <- res |>
  # round estimates and 95% CIs to 2 decimal places for journal specifications
  mutate(across(
    c(hazards, lower_95, higher_95),
    ~ str_pad(
      round(.x, 2),
      width = 4,
      pad = "0",
      side = "right"
    )
  ),
  # add an "-" between HR estimate confidence intervals
  estimate_lab = paste0(hazards, " (", lower_95, "-", higher_95, ")")) |>
  # round p-values to two decimal places, except in cases where p < .001
  mutate(p_vals = case_when(
    p_vals < .001 ~ "<0.001",
    round(p_vals, 2) == .05 ~ as.character(round(p_vals,3)),
    p_vals < .01 ~ str_pad( # if less than .01, go one more decimal place
      as.character(round(p_vals, 3)),
      width = 4,
      pad = "0",
      side = "right"
    ),
    TRUE ~ str_pad( # otherwise just round to 2 decimal places and pad string so that .2 reads as 0.20
      as.character(round(p_vals, 2)),
      width = 4,
      pad = "0",
      side = "right"
    )
  )) |>
  # add a row of data that are actually column names which will be shown on the plot in the next step
  bind_rows(
    data.frame(
      related = "Model",
      estimate_lab = "Hazard Ratio (95% CI)",
      lower_95 = "",
      higher_95 = "",
      p_vals = "p-value"
    )
  ) |>
  mutate(related = fct_rev(fct_relevel(related, "Model")))

p_left <-
  res_plot  |>
  ggplot(aes(y = related))

p_left <- 
  p_left +
  geom_text(aes(x = 0, label = related), hjust = 0, fontface = "bold")

p_left <- 
  p_left +
  geom_text(
    aes(x = 2.5, label = estimate_lab),
    hjust = 0,
    fontface = ifelse(res_plot$estimate_lab == "Hazard Ratio (95% CI)", "bold", "plain")
  )

p_left <-
  p_left +
  theme_void() +
  coord_cartesian(xlim = c(0, 4))

p_right <-
  res_plot  |>
  ggplot() +
  geom_text(
    aes(x = 0, y = related, label = p_vals),
    hjust = 0,
    fontface = ifelse(res_plot$p_vals == "p-value", "bold", "plain")
  ) +
  theme_void()

layout <- c(
  area(t = 0, l = 0, b = 30, r = 3), # left plot, starts at the top of the page (0) and goes 30 units down and 3 units to the right
  area(t = 1, l = 4, b = 30, r = 9), # middle plot starts a little lower (t=1) because there's no title. starts 1 unit right of the left plot (l=4, whereas left plot is r=3), goes to the bottom of the page (30 units), and 6 units further over from the left plot (r=9 whereas left plot is r=3)
  area(t = 0, l = 9, b = 30, r = 11) # right most plot starts at top of page, begins where middle plot ends (l=9, and middle plot is r=9), goes to bottom of page (b=30), and extends two units wide (r=11)
)
# final plot arrangement
p_left + p_mid + p_right + plot_layout(design = layout)

```

