Data Preprocessing
After looking at the table heads, there are a couple of issues that
we can see:
1) The sample labels aren't identical - some have extra elements at the beginning or end, which need to be standardized
2) The number of samples in each file aren't the same - for example, miRNA only has 541 samples, and lncRNA has 556. This also needs to be standardized to make sure only samples with all data are used.
Let’s start with step 1.
The general format of the data should look something like this:
‘TCGA-XX-XXXX-XX’, so let’s look at each file to see how we can fix
it.
lncRNA seems to have extra labels at the start which we can remove,
as well as underscores which we can convert to hyphens. However, what’s
also important here is that we have normal and tumor samples explictly
labeled, whereas other files use the ‘01’ marking for tumors and ‘11’
marking for normal samples. We need to adapt our lncRNA data to follow
this structure.
#We are going through each column in the lncRNA dataset
for (i in 1:ncol(lncRNA)) {
#If the column name - which is the sample ID - contains Normal, we'll add a -11 and format the name to remove the beginning and convert all underscores to hyphens.
if (grepl("Normal", colnames(lncRNA)[i], fixed = TRUE) == TRUE) {
colnames(lncRNA)[i] <- sub(".*-TC", "TC", colnames(lncRNA)[i])
colnames(lncRNA)[i] <- gsub("_", "-", colnames(lncRNA)[i])
colnames(lncRNA)[i] <- paste(colnames(lncRNA)[i], '-11', sep="")
} else {
#Otherwise, we'll still format the name, but we'll add a -01 at the end instead.
colnames(lncRNA)[i] <- sub(".*-TC", "TC", colnames(lncRNA)[i])
colnames(lncRNA)[i] <- gsub("_", "-", colnames(lncRNA)[i])
colnames(lncRNA)[i] <- paste(colnames(lncRNA)[i], '-01', sep="")
}
}
#We'll convert the first column to the standard name Gene_ID and remove the extra marker in the Ensemble ID after the decimal point.
colnames(lncRNA)[1] <- "Gene_ID"
lncRNA$Gene_ID <- sub("\\..*", "", lncRNA$Gene_ID)
head(lncRNA)
That looks more in line with the standard sample notation.
Both the methylation and the mRNA file have underscores that we can
convert to hyphens.
#We'll iterate over each column and convert the underscores to dashes, for both datasets. Then, we'll convert the first column to the Gene_ID label.
for (i in 1:ncol(mRNA)) {
colnames(mRNA)[i] <- gsub("_", "-", colnames(mRNA)[i])
}
colnames(mRNA)[1] <- "Gene_ID"
mRNA <- mRNA[30:20531,]
head(mRNA)
#for (i in 1:ncol(methylation)) {
# colnames(methylation)[i] <- gsub("_", "-", colnames(methylation)[i])
#}
#colnames(methylation)[1] <- "Gene_ID"
#head(methylation)
That seems to be fixed as well.
For miRNA data, there is a lot of extraneous data at the end about
sample and vials, which we don’t need, so let’s remove that.
#We'll iterate over the column names and remove any extraneous information after our normal or tumor sample labels, and then create the Gene_ID column.
for (i in 1:ncol(miRNA)) {
colnames(miRNA)[i] <- sub("_01.*", "_01", colnames(miRNA)[i])
colnames(miRNA)[i] <- sub("_11.*", "_11", colnames(miRNA)[i])
colnames(miRNA)[i] <- gsub("_", "-", colnames(miRNA)[i])
}
colnames(miRNA)[1] <- "Gene_ID"
miRNA$'miRNA-ID' <- NULL
miRNA <- miRNA[!grepl('precursor', miRNA$`miRNA-region`),]
miRNA <- miRNA[!grepl('stemloop', miRNA$`miRNA-region`),]
miRNA$'miRNA-region' <- NULL
head(miRNA)
Done!
For clinical data, we have our ‘Sample ID’ column with appropriate
notation, so we don’t need to change that.
Now, for step 2, we have to identify which samples are missing from
any of the files, as we only want data points present in each file. To
accomplish this, I am first going to access all the sample IDs from each
file. I’m then going to add them together to form a vector, and use the
table function to see how many times each value appears in the vector.
Given that we have 5 files of data, a sample present in all files should
appear at least 5 times. Thus, we will isolate any values that appear
less than 5 times.
#Here we access the specific sample names through indexing of the data (which starts from column 2).
clin_samples <- clinical$`Sample ID`
lnc_samples <- colnames(lncRNA)[2:557]
#meth_samples <- colnames(methylation)[5:571]
miR_samples <- colnames(miRNA)[2:1609]
mRNA_samples <- colnames(mRNA)[2:569]
#We concatenate all the sample data into a vector.
temp <- c(lnc_samples, miR_samples, mRNA_samples)
#Next, we count the occurrence of each unique sample ID in our vector using the table function - if a sample is present in all 4 datasets, it'll have a count of 4.
temp_table = table(temp)
#Now, we just want to keep the samples that have an occurrence of 4.
keep1 <- names(temp_table[temp_table >= 3])
keep1 <- c("Gene_ID", keep1)
Now, we will remove their columns/rows from each data file in a
sequential manner.
#Next, we just subset the datasets to only keep columns that match a value in our vector of accepted samples - using the subset and select functions.
miRNA_new <- subset(miRNA, select=c(keep1))
mRNA_new <- subset(mRNA, select=c(keep1))
#methylation_new <- subset(methylation, select=c(keep1))
lncRNA_new <- subset(lncRNA, select=c(keep1))
One additional step we need to take is to make sure that all our
tumor samples have clinical data available. Thus, I’m going to compare
our tumor samples (with the -01 ending) to our clinical sample IDs to
find and also remove any samples that don’t overlap, using a similar
method.
normal_samples
[1] "TCGA-BJ-A28R-11" "TCGA-BJ-A28W-11" "TCGA-BJ-A28X-11" "TCGA-BJ-A290-11" "TCGA-BJ-A2N7-11"
[6] "TCGA-BJ-A2N8-11" "TCGA-BJ-A2N9-11" "TCGA-BJ-A2NA-11" "TCGA-BJ-A3PR-11" "TCGA-BJ-A3PU-11"
[11] "TCGA-DO-A1JZ-11" "TCGA-E8-A2JQ-11" "TCGA-EL-A3GZ-11" "TCGA-EL-A3H1-11" "TCGA-EL-A3H2-11"
[16] "TCGA-EL-A3H7-11" "TCGA-EL-A3MW-11" "TCGA-EL-A3MX-11" "TCGA-EL-A3MY-11" "TCGA-EL-A3N2-11"
[21] "TCGA-EL-A3N3-11" "TCGA-EL-A3T0-11" "TCGA-EL-A3T1-11" "TCGA-EL-A3T2-11" "TCGA-EL-A3T3-11"
[26] "TCGA-EL-A3T6-11" "TCGA-EL-A3T7-11" "TCGA-EL-A3T8-11" "TCGA-EL-A3TA-11" "TCGA-EL-A3TB-11"
[31] "TCGA-EL-A3ZG-11" "TCGA-EL-A3ZH-11" "TCGA-EL-A3ZK-11" "TCGA-EL-A3ZL-11" "TCGA-EL-A3ZM-11"
[36] "TCGA-EL-A3ZO-11" "TCGA-EL-A3ZP-11" "TCGA-EL-A3ZQ-11" "TCGA-EL-A3ZR-11" "TCGA-EL-A3ZS-11"
[41] "TCGA-EL-A3ZT-11" "TCGA-EM-A1CS-11" "TCGA-EM-A1CT-11" "TCGA-EM-A1CU-11" "TCGA-EM-A1CV-11"
[46] "TCGA-EM-A1CW-11" "TCGA-EM-A1YC-11" "TCGA-EM-A3ST-11" "TCGA-ET-A2MX-11" "TCGA-ET-A2N5-11"
[51] "TCGA-ET-A3DP-11" "TCGA-ET-A3DW-11" "TCGA-FY-A3TY-11" "TCGA-GE-A2C6-11" "TCGA-H2-A2K9-11"
[56] "TCGA-H2-A3RI-11" "TCGA-KS-A41I-11" "TCGA-KS-A41J-11" "TCGA-KS-A41L-11"
Now, all our data formatting is done - each of our datasets is
composed of 542 samples, with 492 tumor samples (as the clinical data
suggests) and 50 normal samples. Before we move into our actual
analysis, we need to make sure all our samples have data points for the
RNAs being analyzed. Since a lot of the files have ‘NA’ values in them,
we can’t work with this, so we’ll just eliminate all rows containing an
NA.
library(dplyr)
Warning: package ‘dplyr’ was built under R version 4.2.2
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
library(readr)
Warning: package ‘readr’ was built under R version 4.2.2
library(edgeR)
Warning: package ‘edgeR’ was built under R version 4.2.2Loading required package: limma
Warning: package ‘limma’ was built under R version 4.2.2
#For each dataset, we are going to remove any rows with NA in them, and then convert it to a data frame to work with our DESeq2 functions.
miRNA_new[miRNA_new == "NA"] <- NA
miRNA_final <- miRNA_new[complete.cases(miRNA_new), ]
miRNA_final <- na.omit(miRNA_final)
miRNA_final = data.frame(miRNA_final)
mRNA_new[mRNA_new == "NA"] <- NA
mRNA_final <- mRNA_new[complete.cases(mRNA_new), ]
mRNA_final = data.frame(mRNA_final)
lncRNA_new[lncRNA_new == "NA"] <- NA
lncRNA_final <- lncRNA_new[complete.cases(lncRNA_new), ]
lncRNA_final = data.frame(lncRNA_final)
#methylation_new[methylation_new == "NA"] <- NA
#methylation_final <- methylation_new[complete.cases(methylation_new), ]
#methylation_final = data.frame(methylation_final)
Differential Gene Expression
We finally have our final datasets that we can work with. Let’s begin
with analysis.
We first should create a proper metadata variable to begin analysis,
by considering tumor and normal samples. However, there are different
levels of analysis we can conduct, which means we’ll create a couple of
metadata files based on clinical data.
First, let’s create a file with a simple distribution of group 0 for
normal tissue and group 1 for tumor tissue.
library(DESeq2)
library(ggplot2)
#Access sample IDs to form column 1, then check for the .01 tag to assign groups as 1 or 0 for tumor or normal.
metadata <- data.frame("Sample_ID"=c(colnames(lncRNA_final)[2:552]))
metadata$'Group' <- ifelse(grepl(".01", metadata$Sample_ID, fixed = TRUE) == TRUE, 1, 0)
metadata$Group <- as.factor(metadata$Group)
for (i in 1:551) {
sample <- metadata[i, 1]
sample2 <- gsub("\\.", "-", sample)
row1 <- clin_2[which(clin_2$sample == sample2), ]
metadata[i, c("OS_time")] <- as.numeric(row1$'OS.time')
metadata[i, c("OS_Status")] <- as.numeric(row1$'OS')
metadata[i, c("PFI_time")] <- as.numeric(row1$'PFI.time')
metadata[i, c("PFI_Status")] <- as.numeric(row1$'PFI')
metadata[i, c("DSS_time")] <- as.numeric(row1$'DSS.time')
metadata[i, c("DSS_Status")] <- as.numeric(row1$'DSS')
#if (grepl(".11", sample, fixed=TRUE) == TRUE) {
# metadata[i, c("days_to_death")] <- runif(1, 1000, 4000)
# metadata[i, c("Status")] <- 0
#} else{
# metadata[i, c("days_to_death")] <- as.numeric(group1) * 30
# metadata[i, c("Status")] <- as.numeric(row1$Survival_Group)
#}
}
Next, let’s create a metadata file that adds in recurrent samples as
being group 2, and primary samples as being group 1.
#Access sample IDs to form column 1, then if the tumor is normal, assign group 0. If the tumor is not normal, reference clinical data for disease free status, and assign group 2 to recurred samples.
metadata_recur <- data.frame("Sample_ID"=c(colnames(lncRNA_final)[2:552]))
for (i in 1:551) {
sample <- metadata_recur[i, 1]
sample2 <- gsub("\\.", "-", sample)
row1 <- clin_new[which(clin_new$Sample.ID == sample2), ]
group1 <- row1$Recurrence_Group
group2 <- row1$Overall.Survival..Months.
if (grepl(".11", sample, fixed=TRUE) == TRUE) {
metadata_recur[i, c("Group")] <- 0
metadata_recur[i, c("days_to_death")] <- runif(1, 1000, 4000)
metadata_recur[i, c("Status")] <- 0
} else {
metadata_recur[i, c("days_to_death")] <- as.numeric(group2) * 30
metadata_recur[i, c("Status")] <- as.numeric(row1$Survival_Group)
if (group1 == 1) {
metadata_recur[i, c("Group")] <- 1
} else{
metadata_recur[i, c("Group")] <- 2
}
}
}
metadata_recur$Group <- as.factor(metadata_recur$Group)
We can also create a metadata file to evaluate metastatic
vs. non-metastatic samples, which can form group 1 and 2
respectively.
#Access sample IDs to form column 1, then if the tumor is normal, assign group 0. If the tumor is not normal, reference clinical data for metastatic status, and assign group 1 to metastatic samples.
metadata_metas <- data.frame("Sample_ID"=c(colnames(lncRNA_final)[2:543]))
for (i in 1:542) {
sample <- metadata_metas[i, 1]
sample2 <- gsub("\\.", "-", sample)
row1 <- clin_new[which(clin_new$Sample.ID == sample2), ]
group1 <- row1$Metastasis_Group
if (grepl(".11", sample, fixed=TRUE) == TRUE) {
metadata_metas[i, c("Group")] <- 0
} else {
if (group1 == 1) {
metadata_metas[i, c("Group")] <- 1
} else{
metadata_metas[i, c("Group")] <- 2
}
metadata_metas[i, c("days_to_death")] <- row1$Overall.Survival..Months.
}
}
metadata_metas$Group <- as.factor(metadata_metas$Group)
#Need to convert the gene names to the actual row names, and remove that column
m_DESEQ <- mRNA_final[,-1]
rownames(m_DESEQ) <- mRNA_final$Gene_ID
library(GDCRNATools)
Registered S3 methods overwritten by 'htmltools':
method from
print.html tools:rstudio
print.shiny.tag tools:rstudio
print.shiny.tag.list tools:rstudio
Registered S3 method overwritten by 'data.table':
method from
print.data.table
Registered S3 method overwritten by 'htmlwidgets':
method from
print.htmlwidget tools:rstudio
Registered S3 methods overwritten by 'dbplyr':
method from
print.tbl_lazy
print.tbl_sql
##############################################################################
Pathview is an open source software package distributed under GNU General
Public License version 3 (GPLv3). Details of GPLv3 is available at
http://www.gnu.org/licenses/gpl-3.0.html. Particullary, users are required to
formally cite the original Pathview paper (not just mention it) in publications
or products. For details, do citation("pathview") within R.
The pathview downloads and uses KEGG data. Non-academic uses may require a KEGG
license agreement (details at http://www.kegg.jp/kegg/legal.html).
##############################################################################
library("AnnotationDbi")
Warning: package ‘AnnotationDbi’ was built under R version 4.2.2
Attaching package: ‘AnnotationDbi’
The following object is masked from ‘package:dplyr’:
select
library("org.Hs.eg.db")
m_DESEQ <- m_DESEQ[30:20531,]
m_DESEQ$'ENTREZ' <- c(rownames(m_DESEQ))
m_DESEQ$'ENTREZ' <- gsub(".*\\|", "", m_DESEQ$'ENTREZ')
m_DESEQ$'Ensemble' <- mapIds(org.Hs.eg.db, keys=m_DESEQ$'ENTREZ', keytype="ENTREZID", column="ENSEMBL")
'select()' returned 1:many mapping between keys and columns
rownames(m_DESEQ) <- make.names(m_DESEQ$Ensemble, unique=TRUE)
m_DESEQ$Ensemble <- NULL
m_DESEQ$ENTREZ <- NULL
m_DESEQ <- m_DESEQ[complete.cases(m_DESEQ), ]
metadata_new <- data.frame(c(metadata$Sample_ID))
metadata_new$'Group' <- ifelse(metadata$Group == 0, 'NormalTissue', 'PrimaryTumor')
metadata_new$'days_to_death' <- metadata$days_to_death
#metadata_new$'days_to_last_follow_up' = NA
#metadata_test <- gdcParseMetadata(project.id='TCGA-THCA', data.type='RNAseq')
DEGAll <- gdcDEAnalysis(counts=round(m_DESEQ), group=metadata_new$Group, comparison='NormalTissue-PrimaryTumor', method='limma')
dePC_mRNA <- gdcDEReport(deg=DEGAll)
head(dePC_mRNA)
#Need to convert the gene names to the actual row names, and remove that column
mi_DESEQ <- miRNA_final[,-1]
rownames(mi_DESEQ) <- miRNA_final$Gene_ID
mi_DESEQ[] <- lapply(mi_DESEQ, as.numeric)
rownames(mi_DESEQ) <- sub("r", "R", rownames(mi_DESEQ))
mi_DESEQ <- mi_DESEQ[20:292,]
mi_DESEQ <- mi_DESEQ[!grepl('unannotated', rownames(mi_DESEQ)),]
mi_DESEQ <- na.omit(mi_DESEQ)
library(GDCRNATools)
library("AnnotationDbi")
library("org.Hs.eg.db")
DEGAll <- gdcDEAnalysis(counts=round(mi_DESEQ), group=metadata_new$Group, comparison='NormalTissue-PrimaryTumor', method='limma')
dePC_miRNA <- gdcDEReport(deg=DEGAll)
head(dePC_miRNA)
#Need to convert the gene names to the actual row names, and remove that column
lnc_DESEQ <- lncRNA_final[,-1]
rownames(lnc_DESEQ) <- lncRNA_final$Gene_ID
library(GDCRNATools)
library("AnnotationDbi")
library("org.Hs.eg.db")
metadata_new <- data.frame(c(metadata$Sample_ID))
metadata_new$'Group' <- ifelse(metadata$Group == 0, 'NormalTissue', 'PrimaryTumor')
DEGAll <- gdcDEAnalysis(counts=round(lnc_DESEQ), group=metadata_new$Group, comparison='NormalTissue-PrimaryTumor', method='limma')
dePC_lnc <- gdcDEReport(deg=DEGAll)
head(dePC_lnc)
NA
ceOutput <- gdcCEAnalysis(lnc = rownames(lnc_Expr_2),
pc = rownames(m_Expr_2),
lnc.targets = 'starBase',
pc.targets = 'starBase',
rna.expr = RNA_tot,
mir.expr = mi_Expr_2)
Step 1/3: Hypergenometric test done !
Error in mir.expr[mir, ] : subscript out of bounds
gdcBarPlot(dePC_miRNA, angle=45, data.type='miRNA')
lnc_Expr_2 <- lnc_Expr[rownames(lnc_Expr) %in% rownames(lnc_de),]
m_Expr_2 <- m_Expr[rownames(m_Expr) %in% rownames(m_de),]
mi_Expr_2 <- mi_Expr[rownames(mi_Expr) %in% rownames(dePC_miRNA), ]
RNA_tot = rbind(lnc_Expr_2, m_Expr_2)
mi_Expr_df <- data.frame(mi_Expr)
deLNC <- c('ENSG00000260920','ENSG00000242125','ENSG00000261211')
dePC <- c('ENSG00000043355','ENSG00000109586','ENSG00000144355', 'ENSG00000198626')
genes <- c(deLNC, dePC)
samples <- c('TCGA-2F-A9KO-01', 'TCGA-2F-A9KP-01',
'TCGA-2F-A9KQ-01', 'TCGA-2F-A9KR-01',
'TCGA-2F-A9KT-01', 'TCGA-2F-A9KW-01')
rnaExpr <- data.frame(matrix(c(2.7,7.0,4.9,6.9,4.6,2.5,
0.5,2.5,5.7,6.5,4.9,3.8,
2.1,2.9,5.9,5.7,4.5,3.5,
2.7,5.9,4.5,5.8,5.2,3.0,
2.5,2.2,5.3,4.4,4.4,2.9,
2.4,3.8,6.2,3.8,3.8,4.2,
1.0,2.0,3.0,4.0,5.0,6.0),7,6),
stringsAsFactors=FALSE)
rownames(rnaExpr) <- genes
colnames(rnaExpr) <- samples
mirExpr <- data.frame(matrix(c(7.7,7.4,7.9,8.9,8.6,9.5,
5.1,4.4,5.5,8.5,4.4,3.5,
4.9,5.5,6.9,6.1,5.5,4.1,
12.4,13.5,15.1,15.4,13.0,12.8,
2.5,2.2,5.3,4.4,4.4,2.9,
2.4,2.7,6.2,1.5,4.4,4.2,
1.0, 2.0, 3.0, 4.0, 5.0, 6.0),7,6),
stringsAsFactors=FALSE)
colnames(mirExpr) <- samples
rownames(mirExpr) <- c('hsa-miR-340-5p-mature','hsa-miR-181b-1-5p-mature',
'hsa-miR-181a-2-3p-mature', 'hsa-miR-195-5p-mature',
'hsa-miR-199b-5p','hsa-miR-182-5p', 'hsa-miR-146b-5p-mature')
ceOutput <- gdcCEAnalysis(lnc = deLNC,
pc = dePC,
lnc.targets = 'starBase',
pc.targets = 'starBase',
rna.expr = rnaExpr,
mir.expr = mirExpr)
Step 1/3: Hypergenometric test done !
Step 2/3: Correlation analysis done !
Step 3/3: Regulation pattern analysis done !
library(survminer)
Warning: package ‘survminer’ was built under R version 4.2.2Loading required package: ggplot2
Warning: package ‘ggplot2’ was built under R version 4.2.2Loading required package: ggpubr
Warning: package ‘ggpubr’ was built under R version 4.2.2
Attaching package: ‘survminer’
The following object is masked from ‘package:survival’:
myeloma
gene_1 <- RNA_tot_2_merged[RNA_tot_2_merged$Gene1_Quartile %in% c(1, 4),]
gene_1$Gene1_Quartile <- sub(4, 2, gene_1$Gene1_Quartile)
sfit_1 <- survfit(Surv(OS_time, OS_Status)~Gene1_Quartile, data=gene_1)
summary(sfit_1)
Call: survfit(formula = Surv(OS_time, OS_Status) ~ Gene1_Quartile,
data = gene_1)
Gene1_Quartile=1
time n.risk n.event survival std.err lower 95% CI upper 95% CI
243 132 1 0.992 0.00755 0.978 1.000
1019 69 1 0.978 0.01610 0.947 1.000
1500 44 1 0.956 0.02703 0.904 1.000
1597 41 1 0.933 0.03501 0.866 1.000
1734 38 1 0.908 0.04181 0.830 0.994
1753 37 1 0.883 0.04734 0.795 0.981
Gene1_Quartile=2
time n.risk n.event survival std.err lower 95% CI upper 95% CI
1.01e+03 6.50e+01 1.00e+00 9.85e-01 1.53e-02 9.55e-01 1.00e+00
ggsurvplot(sfit_1, conf.int=FALSE, pval=TRUE, risk.table=TRUE, risk.table.col="strata",
legend.labs=c("Low Expression", "High Expression"), legend.title="Group",
palette=c("dodgerblue2", "red3"),
title="Kaplan-Meier Curve for hsa-miR-146b-3p-mature", risk.table.height=0.3)


gene_2 <- RNA_tot_2_merged[RNA_tot_2_merged$Gene2_Quartile %in% c(1, 4),]
gene_2$Gene2_Quartile <- sub(4, 2, gene_2$Gene2_Quartile)
sfit_2 <- survfit(Surv(PFI_time, PFI_Status)~Gene2_Quartile, data=gene_2)
summary(sfit_2)
Call: survfit(formula = Surv(PFI_time, PFI_Status) ~ Gene2_Quartile,
data = gene_2)
Gene2_Quartile=1
time n.risk n.event survival std.err lower 95% CI upper 95% CI
30 138 1 0.993 0.00722 0.979 1.000
127 132 2 0.978 0.01273 0.953 1.000
158 130 1 0.970 0.01468 0.942 0.999
225 127 1 0.963 0.01644 0.931 0.995
267 124 2 0.947 0.01950 0.910 0.986
279 121 1 0.939 0.02085 0.899 0.981
329 120 1 0.931 0.02209 0.889 0.976
363 116 1 0.923 0.02332 0.879 0.970
385 113 1 0.915 0.02450 0.868 0.964
439 112 1 0.907 0.02561 0.858 0.959
445 110 1 0.899 0.02667 0.848 0.953
494 105 1 0.890 0.02775 0.837 0.946
526 103 1 0.882 0.02880 0.827 0.940
533 101 2 0.864 0.03076 0.806 0.927
544 98 1 0.855 0.03169 0.795 0.920
903 68 2 0.830 0.03540 0.764 0.902
1019 62 1 0.817 0.03727 0.747 0.893
1062 60 1 0.803 0.03906 0.730 0.883
1204 54 1 0.788 0.04107 0.712 0.873
1385 48 1 0.772 0.04337 0.691 0.862
Gene2_Quartile=2
time n.risk n.event survival std.err lower 95% CI upper 95% CI
113 133 1 0.992 0.00749 0.978 1.000
265 127 1 0.985 0.01076 0.964 1.000
711 82 1 0.973 0.01598 0.942 1.000
720 81 1 0.961 0.01979 0.923 1.000
781 76 1 0.948 0.02322 0.904 0.995
874 70 1 0.934 0.02654 0.884 0.988
966 62 1 0.919 0.03009 0.862 0.980
ggsurvplot(sfit_2, conf.int=FALSE, pval=TRUE, risk.table=TRUE, risk.table.col="strata",
legend.labs=c("Low Expression", "High Expression"), legend.title="Group",
palette=c("dodgerblue2", "red3"),
title="Kaplan-Meier Curve for FAM167B", risk.table.height=0.3)

gene_3 <- RNA_tot_2_merged[RNA_tot_2_merged$Gene3_Quartile %in% c(1, 4),]
gene_3$Gene3_Quartile <- sub(4, 2, gene_3$Gene3_Quartile)
sfit_3 <- survfit(Surv(PFI_time, PFI_Status)~Gene3_Quartile, data=gene_3)
summary(sfit_3)
Call: survfit(formula = Surv(PFI_time, PFI_Status) ~ Gene3_Quartile,
data = gene_3)
Gene3_Quartile=1
time n.risk n.event survival std.err lower 95% CI upper 95% CI
127 133 1 0.992 0.00749 0.978 1.000
225 128 2 0.977 0.01314 0.952 1.000
267 125 1 0.969 0.01518 0.940 0.999
329 122 1 0.961 0.01701 0.928 0.995
431 115 1 0.953 0.01881 0.917 0.990
533 107 1 0.944 0.02063 0.904 0.985
544 105 1 0.935 0.02231 0.892 0.980
711 93 1 0.925 0.02423 0.879 0.974
1062 66 1 0.911 0.02762 0.858 0.967
1215 53 1 0.894 0.03200 0.833 0.959
1385 39 1 0.871 0.03852 0.798 0.950
1785 28 1 0.840 0.04809 0.751 0.939
Gene3_Quartile=2
time n.risk n.event survival std.err lower 95% CI upper 95% CI
6 137 1 0.993 0.00727 0.979 1.000
113 134 1 0.985 0.01032 0.965 1.000
158 132 1 0.978 0.01266 0.953 1.000
174 131 3 0.955 0.01779 0.921 0.991
218 128 1 0.948 0.01915 0.911 0.986
267 126 1 0.940 0.02042 0.901 0.981
279 125 1 0.933 0.02160 0.892 0.976
411 114 1 0.925 0.02291 0.881 0.971
514 98 1 0.915 0.02454 0.868 0.965
577 87 1 0.905 0.02642 0.854 0.958
641 82 1 0.894 0.02831 0.840 0.951
781 71 1 0.881 0.03058 0.823 0.943
874 65 1 0.868 0.03298 0.805 0.935
903 62 2 0.840 0.03738 0.769 0.916
966 54 1 0.824 0.03979 0.750 0.906
1019 50 2 0.791 0.04451 0.709 0.883
1426 38 1 0.770 0.04796 0.682 0.870
1708 26 2 0.711 0.05983 0.603 0.839
ggsurvplot(sfit_3, conf.int=FALSE, pval=TRUE, risk.table=TRUE, risk.table.col="strata",
legend.labs=c("Low Expression", "High Expression"), legend.title="Group",
palette=c("dodgerblue2", "red3"),
title="Kaplan-Meier Curve for DNAH11", risk.table.height=0.3)

gene_4 <- RNA_tot_2_merged[RNA_tot_2_merged$Gene4_Quartile %in% c(1, 4),]
gene_4$Gene1_Quartile <- sub(4, 2, gene_4$Gene4_Quartile)
sfit_4 <- survfit(Surv(PFI_time, PFI_Status)~Gene4_Quartile, data=gene_4)
summary(sfit_4)
Call: survfit(formula = Surv(PFI_time, PFI_Status) ~ Gene4_Quartile,
data = gene_4)
Gene4_Quartile=1
time n.risk n.event survival std.err lower 95% CI upper 95% CI
6 138 1 0.993 0.00722 0.979 1.000
30 137 1 0.986 0.01017 0.966 1.000
110 132 1 0.978 0.01254 0.954 1.000
113 131 1 0.971 0.01450 0.943 0.999
127 130 1 0.963 0.01619 0.932 0.995
174 128 1 0.956 0.01773 0.921 0.991
204 126 1 0.948 0.01914 0.911 0.986
225 124 1 0.940 0.02046 0.901 0.981
265 120 1 0.933 0.02174 0.891 0.976
267 119 1 0.925 0.02292 0.881 0.971
279 117 1 0.917 0.02405 0.871 0.965
332 114 1 0.909 0.02515 0.861 0.959
363 109 1 0.900 0.02626 0.850 0.953
439 102 1 0.892 0.02745 0.839 0.947
514 95 1 0.882 0.02872 0.828 0.940
711 84 1 0.872 0.03024 0.814 0.933
874 71 1 0.859 0.03221 0.799 0.925
933 66 1 0.846 0.03425 0.782 0.916
1062 55 1 0.831 0.03692 0.762 0.907
1204 47 1 0.813 0.04015 0.738 0.896
1215 46 1 0.796 0.04299 0.716 0.885
1385 40 1 0.776 0.04629 0.690 0.872
1426 39 1 0.756 0.04919 0.665 0.859
1708 27 1 0.728 0.05476 0.628 0.844
1785 26 1 0.700 0.05938 0.593 0.826
Gene4_Quartile=4
time n.risk n.event survival std.err lower 95% CI upper 95% CI
127 131 1 0.992 0.0076 0.978 1.000
174 130 1 0.985 0.0107 0.964 1.000
279 125 1 0.977 0.0132 0.951 1.000
533 98 1 0.967 0.0164 0.935 1.000
641 84 1 0.955 0.0198 0.917 0.995
720 75 1 0.943 0.0233 0.898 0.989
966 60 2 0.911 0.0314 0.852 0.975
1019 55 1 0.895 0.0349 0.829 0.966
1708 27 1 0.862 0.0468 0.775 0.958
ggsurvplot(sfit_4, conf.int=FALSE, pval=TRUE, risk.table=TRUE, risk.table.col="strata",
legend.labs=c("Low Expression", "High Expression"), legend.title="Group",
palette=c("dodgerblue2", "red3"),
title="Kaplan-Meier Curve for BHMT2", risk.table.height=0.3)


---
title: "R Notebook"
output: html_notebook
---
## *Data Retrieval*

This project is focused on analyzing the TCGA-THCA data from a multiomics approach. Specifically, we will be analyzing mRNA, miRNA, lncRNA, and methylation data to describe the trends and relationships between the multiomic profiles of patients and their prognostic characteristics (e.g. survival and/or recurrence).

To start off, we need to import the relevant data we will be using. This data is sourced from TCGA (The Cancer Genome Atlas), although the mRNA, miRNA, methylation, and clinical data came from Firehose (from the Broad Institute), and the lncRNA data came from TANRIC.

Let's first import the datasets we will be working with, and assign them to variables.
```{r}
#Inputting the data into variables that we can manipulate
library(readxl)

#Inputting the data into variables that we can manipulate
clinical <- read_excel("THCA_Clinical_Data.xlsx")
clinical_2 <- read_excel("Clinical_Possible.xlsx")
lncRNA <- read_excel("THCA_lncRNA_Data.xlsx")
#methylation <- read_excel("THCA_Methylation_Data.xlsx")
miRNA <- read_excel("THCA_miRNA_Data_Test.xlsx")
mRNA <- read_excel("THCA_mRNA_Data_Final.xlsx")

```

Let's look at the head of these tables to understand their structure.

```{r}
#Accessing the first few rows of each dataset
head(clinical)
head(lncRNA)
#head(methylation)
head(miRNA)
head(mRNA)
```
## *Data Preprocessing*

After looking at the table heads, there are a couple of issues that we can see:

    1) The sample labels aren't identical - some have extra elements at the beginning or end, which         need to be standardized
    
    2) The number of samples in each file aren't the same - for example, miRNA only has 541 samples,        and lncRNA has 556. This also needs to be standardized to make sure only samples with all           data are used.

Let's start with step 1.

The general format of the data should look something like this: 'TCGA-XX-XXXX-XX', so let's look at each file to see how we can fix it.

lncRNA seems to have extra labels at the start which we can remove, as well as underscores which we can convert to hyphens. However, what's also important here is that we have normal and tumor samples explictly labeled, whereas other files use the '01' marking for tumors and '11' marking for normal samples. We need to adapt our lncRNA data to follow this structure.
```{r}
#We are going through each column in the lncRNA dataset
for (i in 1:ncol(lncRNA)) {
  #If the column name - which is the sample ID - contains Normal, we'll add a -11 and format the name to remove the beginning and convert all underscores to hyphens.
  
  if (grepl("Normal", colnames(lncRNA)[i], fixed = TRUE) == TRUE) {
    colnames(lncRNA)[i] <- sub(".*-TC", "TC", colnames(lncRNA)[i])
    colnames(lncRNA)[i] <- gsub("_", "-", colnames(lncRNA)[i])
    colnames(lncRNA)[i] <- paste(colnames(lncRNA)[i], '-11', sep="")
  } else {
    #Otherwise, we'll still format the name, but we'll add a -01 at the end instead.
    colnames(lncRNA)[i] <- sub(".*-TC", "TC", colnames(lncRNA)[i])
    colnames(lncRNA)[i] <- gsub("_", "-", colnames(lncRNA)[i])
    colnames(lncRNA)[i] <- paste(colnames(lncRNA)[i], '-01', sep="")
  }
    
}

#We'll convert the first column to the standard name Gene_ID and remove the extra marker in the Ensemble ID after the decimal point.
colnames(lncRNA)[1] <- "Gene_ID"
lncRNA$Gene_ID <- sub("\\..*", "", lncRNA$Gene_ID)
head(lncRNA)
```

That looks more in line with the standard sample notation.

Both the methylation and the mRNA file have underscores that we can convert to hyphens.

```{r}
#We'll iterate over each column and convert the underscores to dashes, for both datasets. Then, we'll convert the first column to the Gene_ID label.
for (i in 1:ncol(mRNA)) {
  colnames(mRNA)[i] <- gsub("_", "-", colnames(mRNA)[i])
}
colnames(mRNA)[1] <- "Gene_ID"
mRNA <- mRNA[30:20531,]
head(mRNA)

#for (i in 1:ncol(methylation)) {
#  colnames(methylation)[i] <- gsub("_", "-", colnames(methylation)[i])
#}
#colnames(methylation)[1] <- "Gene_ID"
#head(methylation)
```
That seems to be fixed as well.

For miRNA data, there is a lot of extraneous data at the end about sample and vials, which we don't need, so let's remove that.

```{r}
#We'll iterate over the column names and remove any extraneous information after our normal or tumor sample labels, and then create the Gene_ID column.

for (i in 1:ncol(miRNA)) {
  colnames(miRNA)[i] <- sub("_01.*", "_01", colnames(miRNA)[i])
  colnames(miRNA)[i] <- sub("_11.*", "_11", colnames(miRNA)[i])
  colnames(miRNA)[i] <- gsub("_", "-", colnames(miRNA)[i])
}
colnames(miRNA)[1] <- "Gene_ID"
miRNA$'miRNA-ID' <- NULL
miRNA <- miRNA[!grepl('precursor', miRNA$`miRNA-region`),]
miRNA <- miRNA[!grepl('stemloop', miRNA$`miRNA-region`),]
miRNA$'miRNA-region' <- NULL
head(miRNA)
```
Done!

For clinical data, we have our 'Sample ID' column with appropriate notation, so we don't need to change that.

Now, for step 2, we have to identify which samples are missing from any of the files, as we only want data points present in each file. To accomplish this, I am first going to access all the sample IDs from each file. I'm then going to add them together to form a vector, and use the table function to see how many times each value appears in the vector. Given that we have 5 files of data, a sample present in all files should appear at least 5 times. Thus, we will isolate any values that appear less than 5 times.

```{r}
#Here we access the specific sample names through indexing of the data (which starts from column 2).
clin_samples <- clinical$`Sample ID`
lnc_samples <- colnames(lncRNA)[2:557]
#meth_samples <- colnames(methylation)[5:571]
miR_samples <- colnames(miRNA)[2:1609]
mRNA_samples <- colnames(mRNA)[2:569]

#We concatenate all the sample data into a vector.
temp <- c(lnc_samples, miR_samples, mRNA_samples)
#Next, we count the occurrence of each unique sample ID in our vector using the table function - if a sample is present in all 4 datasets, it'll have a count of 3.
temp_table = table(temp)
#Now, we just want to keep the samples that have an occurrence of 4.
keep1 <- names(temp_table[temp_table >= 3])
keep1 <- c("Gene_ID", keep1)

```

Now, we will remove their columns/rows from each data file in a sequential manner.

```{r}
#Next, we just subset the datasets to only keep columns that match a value in our vector of accepted samples - using the subset and select functions.
miRNA_new <- subset(miRNA, select=c(keep1))
mRNA_new <- subset(mRNA, select=c(keep1))
#methylation_new <- subset(methylation, select=c(keep1))
lncRNA_new <- subset(lncRNA, select=c(keep1))
```

One additional step we need to take is to make sure that all our tumor samples have clinical data available. Thus, I'm going to compare our tumor samples (with the -01 ending) to our clinical sample IDs to find and also remove any samples that don't overlap, using a similar method.

```{r}
#Now, since clinical data only contains tumor samples, we need to separate out our tumor samples by searching for all samples with the -01 tag.
tumor_samples <- grep("-01", keep1, value=TRUE)
normal_samples <- grep("-11", keep1, value=TRUE)
normal_samples

#Same as above - vector, count, keep anything occurring twice since we are comparing 2 datasets.
temp2 <- c(tumor_samples, clin_samples)
temp_table_2 = table(temp2)
keep2 <- names(temp_table_2[temp_table_2 >= 2])
keep2 <- c(keep2)

#We then delete any rows that don't correspond to a sample in our list, by using the in function to only keep rows with values in our vector.

clin_new <- clinical[clinical$`Sample ID` %in% keep2,]
clin_new <- data.frame(clin_new)

clin_2 <- clinical_2[clinical_2$sample %in% keep1,]
clin_2 <- data.frame(clin_2)

clin_new$'Metastasis_Group' <- with(clin_new, ifelse(clin_new$American.Joint.Committee.on.Cancer.Metastasis.Stage.Code == "M1", 1, 2))

clin_new$'Recurrence_Group' <- with(clin_new, ifelse(clin_new$Disease.Free.Status == "1:Recurred/Progressed", 1, 2))

clin_new
```

Now, all our data formatting is done - each of our datasets is composed of 542 samples, with 492 tumor samples (as the clinical data suggests) and 50 normal samples. Before we move into our actual analysis, we need to make sure all our samples have data points for the RNAs being analyzed. Since a lot of the files have 'NA' values in them, we can't work with this, so we'll just eliminate all rows containing an NA.

```{r}
library(dplyr)
library(readr)
library(edgeR)
#For each dataset, we are going to remove any rows with NA in them, and then convert it to a data frame to work with our DESeq2 functions.
miRNA_new[miRNA_new == "NA"] <- NA
miRNA_final <- miRNA_new[complete.cases(miRNA_new), ]
miRNA_final <- na.omit(miRNA_final)
miRNA_final = data.frame(miRNA_final)

mRNA_new[mRNA_new == "NA"] <- NA
mRNA_final <- mRNA_new[complete.cases(mRNA_new), ]
mRNA_final = data.frame(mRNA_final)

lncRNA_new[lncRNA_new == "NA"] <- NA
lncRNA_final <- lncRNA_new[complete.cases(lncRNA_new), ]
lncRNA_final = data.frame(lncRNA_final)

#methylation_new[methylation_new == "NA"] <- NA 
#methylation_final <- methylation_new[complete.cases(methylation_new), ]
#methylation_final = data.frame(methylation_final)

```

## *Differential Gene Expression*


We finally have our final datasets that we can work with. Let's begin with analysis. 

We first should create a proper metadata variable to begin analysis, by considering tumor and normal samples. However, there are different levels of analysis we can conduct, which means we'll create a couple of metadata files based on clinical data.

First, let's create a file with a simple distribution of group 0 for normal tissue and group 1 for tumor tissue.

```{r}
library(DESeq2)
library(ggplot2)

#Access sample IDs to form column 1, then check for the .01 tag to assign groups as 1 or 0 for tumor or normal.
metadata <- data.frame("Sample_ID"=c(colnames(lncRNA_final)[2:552]))
metadata$'Group' <- ifelse(grepl(".01", metadata$Sample_ID, fixed = TRUE) == TRUE, 1, 0)
metadata$Group <- as.factor(metadata$Group)

for (i in 1:551) {
  sample <- metadata[i, 1]
  sample2 <- gsub("\\.", "-", sample)
  row1 <- clin_2[which(clin_2$sample == sample2), ]
  metadata[i, c("OS_time")] <- as.numeric(row1$'OS.time')
  metadata[i, c("OS_Status")] <- as.numeric(row1$'OS')
  metadata[i, c("PFI_time")] <- as.numeric(row1$'PFI.time')
  metadata[i, c("PFI_Status")] <- as.numeric(row1$'PFI')
  metadata[i, c("DSS_time")] <- as.numeric(row1$'DSS.time')
  metadata[i, c("DSS_Status")] <- as.numeric(row1$'DSS')
  #if (grepl(".11", sample, fixed=TRUE) == TRUE) {
  #  metadata[i, c("days_to_death")] <- runif(1, 1000, 4000)
  #  metadata[i, c("Status")] <- 0
  #} else{
  #  metadata[i, c("days_to_death")] <- as.numeric(group1) * 30
  #  metadata[i, c("Status")] <- as.numeric(row1$Survival_Group)
  #}
}

```

Next, let's create a metadata file that adds in recurrent samples as being group 2, and primary samples as being group 1.

```{r}
#Access sample IDs to form column 1, then if the tumor is normal, assign group 0. If the tumor is not normal, reference clinical data for disease free status, and assign group 2 to recurred samples.
metadata_recur <- data.frame("Sample_ID"=c(colnames(lncRNA_final)[2:552]))
for (i in 1:551) {
  sample <- metadata_recur[i, 1]
  sample2 <- gsub("\\.", "-", sample)
  row1 <- clin_new[which(clin_new$Sample.ID == sample2), ]
  group1 <- row1$Recurrence_Group
  group2 <- row1$Overall.Survival..Months.
  if (grepl(".11", sample, fixed=TRUE) == TRUE) {
    metadata_recur[i, c("Group")] <- 0
    metadata_recur[i, c("days_to_death")] <- runif(1, 1000, 4000)
    metadata_recur[i, c("Status")] <- 0
  } else {
    metadata_recur[i, c("days_to_death")] <- as.numeric(group2) * 30
    metadata_recur[i, c("Status")] <- as.numeric(row1$Survival_Group)
    if (group1 == 1) {
      metadata_recur[i, c("Group")] <- 1
    } else{
      metadata_recur[i, c("Group")] <- 2
    }
  }
}

metadata_recur$Group <- as.factor(metadata_recur$Group)
```

We can also create a metadata file to evaluate metastatic vs. non-metastatic samples, which can form group 1 and 2 respectively.

```{r}
#Access sample IDs to form column 1, then if the tumor is normal, assign group 0. If the tumor is not normal, reference clinical data for metastatic status, and assign group 1 to metastatic samples.
metadata_metas <- data.frame("Sample_ID"=c(colnames(lncRNA_final)[2:543]))

for (i in 1:542) {
  sample <- metadata_metas[i, 1]
  sample2 <- gsub("\\.", "-", sample)
  row1 <- clin_new[which(clin_new$Sample.ID == sample2), ]
  group1 <- row1$Metastasis_Group
  if (grepl(".11", sample, fixed=TRUE) == TRUE) {
    metadata_metas[i, c("Group")] <- 0
  } else {
    if (group1 == 1) {
      metadata_metas[i, c("Group")] <- 1
    } else{
      metadata_metas[i, c("Group")] <- 2
    }
    metadata_metas[i, c("days_to_death")] <- row1$Overall.Survival..Months.
  }
}

metadata_metas$Group <- as.factor(metadata_metas$Group)
```


```{r}

```





```{r}
#Need to convert the gene names to the actual row names, and remove that column
m_DESEQ <- mRNA_final[,-1]
rownames(m_DESEQ) <- mRNA_final$Gene_ID
```

```{r}
library(GDCRNATools)
library("AnnotationDbi")
library("org.Hs.eg.db")

m_DESEQ <- m_DESEQ[30:20531,]
m_DESEQ$'ENTREZ' <- c(rownames(m_DESEQ))
m_DESEQ$'ENTREZ' <- gsub(".*\\|", "", m_DESEQ$'ENTREZ')
m_DESEQ$'Ensemble' <- mapIds(org.Hs.eg.db, keys=m_DESEQ$'ENTREZ', keytype="ENTREZID", column="ENSEMBL")
rownames(m_DESEQ) <- make.names(m_DESEQ$Ensemble, unique=TRUE)
m_DESEQ$Ensemble <- NULL
m_DESEQ$ENTREZ <- NULL

m_DESEQ <- m_DESEQ[complete.cases(m_DESEQ), ]

metadata_new <- data.frame(c(metadata$Sample_ID))
metadata_new$'Group' <- ifelse(metadata$Group == 0, 'NormalTissue', 'PrimaryTumor')
metadata_new$'days_to_death' <- metadata$days_to_death
#metadata_new$'days_to_last_follow_up' = NA

#metadata_test <- gdcParseMetadata(project.id='TCGA-THCA', data.type='RNAseq')

DEGAll <- gdcDEAnalysis(counts=round(m_DESEQ), group=metadata_new$Group, comparison='NormalTissue-PrimaryTumor', method='limma')

dePC_mRNA <- gdcDEReport(deg=DEGAll)
head(dePC_mRNA)
```

```{r}
#Need to convert the gene names to the actual row names, and remove that column
mi_DESEQ <- miRNA_final[,-1]
rownames(mi_DESEQ) <- miRNA_final$Gene_ID
mi_DESEQ[] <- lapply(mi_DESEQ, as.numeric)
rownames(mi_DESEQ) <- sub("r", "R", rownames(mi_DESEQ))
mi_DESEQ <- mi_DESEQ[20:292,]
mi_DESEQ <- mi_DESEQ[!grepl('unannotated', rownames(mi_DESEQ)),]
mi_DESEQ <- na.omit(mi_DESEQ)
```

```{r}
library(GDCRNATools)
library("AnnotationDbi")
library("org.Hs.eg.db")

DEGAll <- gdcDEAnalysis(counts=round(mi_DESEQ), group=metadata_new$Group, comparison='NormalTissue-PrimaryTumor', method='limma')

dePC_miRNA <- gdcDEReport(deg=DEGAll)
head(dePC_miRNA)
```


```{r}
#Need to convert the gene names to the actual row names, and remove that column
lnc_DESEQ <- lncRNA_final[,-1]
rownames(lnc_DESEQ) <- lncRNA_final$Gene_ID
```

```{r}
library(GDCRNATools)
library("AnnotationDbi")
library("org.Hs.eg.db")

metadata_new <- data.frame(c(metadata$Sample_ID))
metadata_new$'Group' <- ifelse(metadata$Group == 0, 'NormalTissue', 'PrimaryTumor')

DEGAll <- gdcDEAnalysis(counts=round(lnc_DESEQ), group=metadata_new$Group, comparison='NormalTissue-PrimaryTumor', method='limma')

dePC_lnc <- gdcDEReport(deg=DEGAll)
head(dePC_lnc)

```


```{r}
library(GDCRNATools)
mi_Expr <- gdcVoomNormalization(counts=mi_DESEQ, filter=TRUE)
colnames(mi_Expr) <- gsub("\\.", "-", colnames(mi_Expr))

m_Expr <- gdcVoomNormalization(counts=m_DESEQ, filter=TRUE)
colnames(m_Expr) <- gsub("\\.", "-", colnames(m_Expr))

lnc_Expr <- gdcVoomNormalization(counts=lnc_DESEQ, filter=TRUE)
colnames(lnc_Expr) <- gsub("\\.", "-", colnames(lnc_Expr))

de_RNA <- rbind(dePC_lnc, dePC_mRNA)
lnc_de <- de_RNA[de_RNA$'group' == 'long_non_coding',]
m_de <- de_RNA[de_RNA$'group' == 'protein_coding',]

gdcBarPlot(de_RNA, angle=45, data.type='RNAseq')

enrichOutput <- gdcEnrichAnalysis(gene=rownames(m_de), simplify=TRUE)
gdcEnrichPlot(enrichment=enrichOutput, type='bar', category='GO', num.terms=5)

gdcBarPlot(lnc_de, angle=45, data.type='RNAseq')

gdcBarPlot(dePC_miRNA, angle=45, data.type='miRNAs')

lnc_Expr_2 <- lnc_Expr[rownames(lnc_Expr) %in% rownames(lnc_de),]
m_Expr_2 <- m_Expr[rownames(m_Expr) %in% rownames(m_de),]

mi_Expr_2 <- mi_Expr[rownames(mi_Expr) %in% rownames(dePC_miRNA), ]


RNA_tot = rbind(lnc_Expr_2, m_Expr_2)
mi_Expr_df <- data.frame(mi_Expr)

ceOutput <- gdcCEAnalysis(lnc = rownames(lnc_Expr_2),
                          pc = rownames(m_Expr_2),
                          lnc.targets = 'starBase',
                          pc.targets = 'starBase',
                          rna.expr = RNA_tot,
                          mir.expr = mi_Expr_2)


```

```{r}
gdcBarPlot(dePC_miRNA, angle=45, data.type='miRNA')
lnc_Expr_2 <- lnc_Expr[rownames(lnc_Expr) %in% rownames(lnc_de),]
m_Expr_2 <- m_Expr[rownames(m_Expr) %in% rownames(m_de),]

mi_Expr_2 <- mi_Expr[rownames(mi_Expr) %in% rownames(dePC_miRNA), ]


RNA_tot = rbind(lnc_Expr_2, m_Expr_2)
mi_Expr_df <- data.frame(mi_Expr)
```


```{r}
deLNC <- c('ENSG00000260920','ENSG00000242125','ENSG00000261211')
dePC <- c('ENSG00000043355','ENSG00000109586','ENSG00000144355', 'ENSG00000198626')
genes <- c(deLNC, dePC)
samples <- c('TCGA-2F-A9KO-01', 'TCGA-2F-A9KP-01',
'TCGA-2F-A9KQ-01', 'TCGA-2F-A9KR-01',
'TCGA-2F-A9KT-01', 'TCGA-2F-A9KW-01')
rnaExpr <- data.frame(matrix(c(2.7,7.0,4.9,6.9,4.6,2.5,
0.5,2.5,5.7,6.5,4.9,3.8,
2.1,2.9,5.9,5.7,4.5,3.5,
2.7,5.9,4.5,5.8,5.2,3.0,
2.5,2.2,5.3,4.4,4.4,2.9,
2.4,3.8,6.2,3.8,3.8,4.2,
1.0,2.0,3.0,4.0,5.0,6.0),7,6),
stringsAsFactors=FALSE)
rownames(rnaExpr) <- genes
colnames(rnaExpr) <- samples
mirExpr <- data.frame(matrix(c(7.7,7.4,7.9,8.9,8.6,9.5,
5.1,4.4,5.5,8.5,4.4,3.5,
4.9,5.5,6.9,6.1,5.5,4.1,
12.4,13.5,15.1,15.4,13.0,12.8,
2.5,2.2,5.3,4.4,4.4,2.9,
2.4,2.7,6.2,1.5,4.4,4.2, 
1.0, 2.0, 3.0, 4.0, 5.0, 6.0),7,6),
stringsAsFactors=FALSE)
colnames(mirExpr) <- samples
rownames(mirExpr) <- c('hsa-miR-340-5p-mature','hsa-miR-181b-1-5p-mature',
'hsa-miR-181a-2-3p-mature', 'hsa-miR-195-5p-mature',
'hsa-miR-199b-5p','hsa-miR-182-5p', 'hsa-miR-146b-5p-mature')
ceOutput <- gdcCEAnalysis(lnc = deLNC,
pc = dePC,
lnc.targets = 'starBase',
pc.targets = 'starBase',
rna.expr = rnaExpr,
mir.expr = mirExpr)
```








```{r}
library(survival)
library(survminer)


s <- Surv(metadata$PFI_time, metadata$PFI_Status)

sfit <- survfit(Surv(OS_time, OS_Status)~PFI_Status, data=metadata)
summary(sfit)


ggsurvplot(sfit, conf.int=TRUE, pval=TRUE, risk.table=TRUE, risk.table.col="strata", 
           legend.labs=c("Not-recurrent", "Recurrent"), legend.title="Group",  
           palette=c("dodgerblue2", "orchid2"), 
           title="Kaplan-Meier Curve for Thyroid Cancer Survival", risk.table.height=0.3)
```


```{r}
mod <- coxph(Surv(OS_time, OS_Status)~PFI_Status, data=metadata)
summary(mod)
related <- c()
p_vals <- c()
hazards <- c()
lower_95 <- c()
higher_95 <- c()
RNA_tot_2 <- as.data.frame(t(RNA_tot))
lnc_Expr_3 <- as.data.frame(t(lnc_Expr_2))
lnc_merged <- cbind(lnc_Expr_3, metadata)
RNA_tot_2_merged <- cbind(RNA_tot_2, metadata)

mi_Expr_3 <- as.data.frame(t(mi_Expr_2))
mi_merged <- cbind(mi_Expr_3, metadata)

full_merge <- cbind(lnc_Expr_3, mi_merged)


for (i in 1:110) {
  
  mod2 <- coxph(Surv(PFI_time, PFI_Status)~get(colnames(full_merge)[i]), data=full_merge)
  p_val <- summary(mod2)$coefficients[1,5]
  hazard <- summary(mod2)$coefficients[1,2]
  lower_conf <- summary(mod2)$conf.int[1,3]
  higher_conf <- summary(mod2)$conf.int[1,4]
  #print(p_val)
  if (p_val <= 0.01) {
    related <- c(related, colnames(full_merge)[i])
    p_vals <- c(p_vals, p_val)
    hazards <- c(hazards, hazard)
    lower_95 <- c(lower_95, lower_conf)
    higher_95 <- c(higher_95, higher_conf)
    print(colnames(full_merge)[i])
    print(summary(mod2))
  }
}

library(dplyr)
RNA_tot_2_merged$'Gene1_Quartile' <- ntile(RNA_tot_2_merged$ENSG00000187678, 4)
RNA_tot_2_merged$'Gene2_Quartile' <- ntile(RNA_tot_2_merged$ENSG00000183615, 4)
RNA_tot_2_merged$'Gene3_Quartile' <- ntile(RNA_tot_2_merged$ENSG00000105877, 4)
RNA_tot_2_merged$'Gene4_Quartile' <- ntile(RNA_tot_2_merged$ENSG00000132840, 4)

mi_merged$'Gene1_Quartile' <- ntile(mi_merged$'hsa-miR-181a-2-3p-star', 4)
mi_merged$'Gene2_Quartile' <- ntile(mi_merged$'hsa-miR-146b-3p-mature', 4)

gene_1 <- RNA_tot_2_merged[RNA_tot_2_merged$Gene1_Quartile %in% c(1, 4),]

gene_1$Gene1_Quartile <- sub(4, 2, gene_1$Gene1_Quartile)
sfit_1 <- survfit(Surv(OS_time, OS_Status)~Gene1_Quartile, data=gene_1)
summary(sfit_1)

ggsurvplot(sfit_1, conf.int=FALSE, pval=TRUE, risk.table=TRUE, risk.table.col="strata", 
           legend.labs=c("Low Expression", "High Expression"), legend.title="Group",  
           palette=c("dodgerblue2", "red3"), 
           title="Kaplan-Meier Curve for SPRY4", risk.table.height=0.3)


gene_2 <- RNA_tot_2_merged[RNA_tot_2_merged$Gene2_Quartile %in% c(1, 4),]

gene_2$Gene2_Quartile <- sub(4, 2, gene_2$Gene2_Quartile)
sfit_2 <- survfit(Surv(PFI_time, PFI_Status)~Gene2_Quartile, data=gene_2)
summary(sfit_2)

ggsurvplot(sfit_2, conf.int=FALSE, pval=TRUE, risk.table=TRUE, risk.table.col="strata", 
           legend.labs=c("Low Expression", "High Expression"), legend.title="Group",  
           palette=c("dodgerblue2", "red3"), 
           title="Kaplan-Meier Curve for FAM167B", risk.table.height=0.3)

gene_3 <- RNA_tot_2_merged[RNA_tot_2_merged$Gene3_Quartile %in% c(1, 4),]

gene_3$Gene3_Quartile <- sub(4, 2, gene_3$Gene3_Quartile)
sfit_3 <- survfit(Surv(PFI_time, PFI_Status)~Gene3_Quartile, data=gene_3)
summary(sfit_3)

ggsurvplot(sfit_3, conf.int=FALSE, pval=TRUE, risk.table=TRUE, risk.table.col="strata", 
           legend.labs=c("Low Expression", "High Expression"), legend.title="Group",  
           palette=c("dodgerblue2", "red3"), 
           title="Kaplan-Meier Curve for DNAH11", risk.table.height=0.3)

gene_4 <- RNA_tot_2_merged[RNA_tot_2_merged$Gene4_Quartile %in% c(1, 4),]

gene_4$Gene1_Quartile <- sub(4, 2, gene_4$Gene4_Quartile)
sfit_4 <- survfit(Surv(PFI_time, PFI_Status)~Gene4_Quartile, data=gene_4)
summary(sfit_4)

ggsurvplot(sfit_4, conf.int=FALSE, pval=TRUE, risk.table=TRUE, risk.table.col="strata", 
           legend.labs=c("Low Expression", "High Expression"), legend.title="Group",  
           palette=c("dodgerblue2", "red3"), 
           title="Kaplan-Meier Curve for BHMT2", risk.table.height=0.3)


```


```{r}
res <- data.frame(related, p_vals, hazards, lower_95, higher_95)
res$related <- de_RNA[rownames(de_RNA) %in% res$related,]$symbol
p <- 
  res |>
  ggplot(aes(y = fct_rev(related))) + 
  theme_classic()

p <- p +
  geom_point(aes(x=hazards), shape=15, size=3) +
  geom_linerange(aes(xmin=lower_95, xmax=higher_95))

p <- p +
  geom_vline(xintercept = 1, linetype="dashed") +
  labs(x="Hazard Ratio", y="")

p <- p +
  coord_cartesian(ylim=c(1,13), xlim=c(0, 2))

p <- p +
  annotate("text", x = .3, y = 13, label = "Lower recur chance") +
  annotate("text", x = 1.7, y = 13, label = "Higher recur chance")

p_mid <- p + 
  theme(axis.line.y = element_blank(),
        axis.ticks.y= element_blank(),
        axis.text.y= element_blank(),
        axis.title.y= element_blank())

res_plot <- res |>
  # round estimates and 95% CIs to 2 decimal places for journal specifications
  mutate(across(
    c(hazards, lower_95, higher_95),
    ~ str_pad(
      round(.x, 2),
      width = 4,
      pad = "0",
      side = "right"
    )
  ),
  # add an "-" between HR estimate confidence intervals
  estimate_lab = paste0(hazards, " (", lower_95, "-", higher_95, ")")) |>
  # round p-values to two decimal places, except in cases where p < .001
  mutate(p_vals = case_when(
    p_vals < .001 ~ "<0.001",
    round(p_vals, 2) == .05 ~ as.character(round(p_vals,3)),
    p_vals < .01 ~ str_pad( # if less than .01, go one more decimal place
      as.character(round(p_vals, 3)),
      width = 4,
      pad = "0",
      side = "right"
    ),
    TRUE ~ str_pad( # otherwise just round to 2 decimal places and pad string so that .2 reads as 0.20
      as.character(round(p_vals, 2)),
      width = 4,
      pad = "0",
      side = "right"
    )
  )) |>
  # add a row of data that are actually column names which will be shown on the plot in the next step
  bind_rows(
    data.frame(
      related = "Model",
      estimate_lab = "Hazard Ratio (95% CI)",
      lower_95 = "",
      higher_95 = "",
      p_vals = "p-value"
    )
  ) |>
  mutate(related = fct_rev(fct_relevel(related, "Model")))

p_left <-
  res_plot  |>
  ggplot(aes(y = related))

p_left <- 
  p_left +
  geom_text(aes(x = 0, label = related), hjust = 0, fontface = "bold")

p_left <- 
  p_left +
  geom_text(
    aes(x = 2.5, label = estimate_lab),
    hjust = 0,
    fontface = ifelse(res_plot$estimate_lab == "Hazard Ratio (95% CI)", "bold", "plain")
  )

p_left <-
  p_left +
  theme_void() +
  coord_cartesian(xlim = c(0, 4))

p_right <-
  res_plot  |>
  ggplot() +
  geom_text(
    aes(x = 0, y = related, label = p_vals),
    hjust = 0,
    fontface = ifelse(res_plot$p_vals == "p-value", "bold", "plain")
  ) +
  theme_void()

layout <- c(
  area(t = 0, l = 0, b = 30, r = 3), # left plot, starts at the top of the page (0) and goes 30 units down and 3 units to the right
  area(t = 1, l = 4, b = 30, r = 9), # middle plot starts a little lower (t=1) because there's no title. starts 1 unit right of the left plot (l=4, whereas left plot is r=3), goes to the bottom of the page (30 units), and 6 units further over from the left plot (r=9 whereas left plot is r=3)
  area(t = 0, l = 9, b = 30, r = 11) # right most plot starts at top of page, begins where middle plot ends (l=9, and middle plot is r=9), goes to bottom of page (b=30), and extends two units wide (r=11)
)
# final plot arrangement
p_left + p_mid + p_right + plot_layout(design = layout)

```

