FINAL PROJECT PROGRESS REPORT: EVALUATING EXPRESSION PROFILES OF KEY INFLAMMATORY GENES IN THYROID CANCER
For this project, two key sources of data were extracted from publicly available research literature on thyroid cancer microarray assays. Microarrays provide an opportunity for scientists to scan a list of pre-selected target genes from an organism’s genome in order to generate a panel of relative gene expressions (in arbitrary units), doing so by using an instrument capable of detecting relative intensity levels of fluorescent signals emitted by specific molecules. Unlike similar assays such as real time quantitative Polymerase Chain Reactions (RT-qPCR), microarrays allow for the visualization of thousands of DNA and mRNA molecules at once, providing a “big-picture” data set that is essential for analyzing general trends in gene expression.
In the context of thyroid cancers, data was collected on multiple samples (tumors surgically obtained from human subjects). However, given the extensive data collected for thousands of genes across hundreds of samples, the actual biological features of the tumors, such as histological classification, were saved on a separate file labelled as “public annotation.” The files were too big to be exported as excel documents from the original laboratory drive and onto my computer, so they were instead transferred into rds formats. Note that since the clinical data obtained was public and did not contain any identifying information (tumor samples have unique ID codes), permission was granted by laboratory staff to save the files onto my personal computer for the project.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.1 ✔ tibble 3.3.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Human_thyroid_arrays_public_trans_9_27_17 <- readRDS("R_project_IRF7_IL18_TMEM173_04282026.rds")
public_annotation <- readRDS("public_annotation.rds")
The first goal of the project was to extract the data from these two files, isolate the observations pertaining only to the three genes of interest (IRF7, IL18, and STING), cleaning up the data, then merging them into a single data frame using tidyverse. Note that in this data set, STING is being referred to as its alternate name, “TMEM173”.
Data Cleaning and Joining:
thyroid_df_filtered <- Human_thyroid_arrays_public_trans_9_27_17 %>%
filter(Gene.Symbol %in% c("IRF7", "IL18", "TMEM173"))
thyroid_long <- thyroid_df_filtered %>%
pivot_longer(
cols = starts_with("GSM"),
names_to = "Sample_ID",
values_to = "Expression"
)
thyroid_long_clean <- thyroid_long %>%
mutate(Sample_ID = str_extract(Sample_ID, "GSM[0-9]+"))
thyroid_long_unique <- thyroid_long_clean %>%
group_by(Sample_ID, Gene.Symbol) %>%
summarize(Expression = mean(Expression, na.rm = TRUE), .groups = 'drop')
colnames(public_annotation) <- make.names(colnames(public_annotation),
unique = TRUE)
colnames(public_annotation)[12:24] <- c("GSE_Number", "Sample_ID", "Sample_Name",
"Histology", "Sample_type", "Driver_summary",
"Tumor_purity", "X12_Index", "Unknown_Status",
"BRAF", "RAS", "RET_PTC", "Other_Mutation")
public_clean <- public_annotation %>%
filter(if_any(everything(), ~ str_detect(., "GSM")))
thyroid_master <- thyroid_long_unique %>%
left_join(public_clean, by = "Sample_ID")
summary(thyroid_master)
## Sample_ID Gene.Symbol.x Expression Gene.Symbol.y
## Length:648 Length:648 Min. : 3.391 Length:648
## Class :character Class :character 1st Qu.: 6.077 Class :character
## Mode :character Mode :character Median : 6.893 Mode :character
## Mean : 6.794
## 3rd Qu.: 7.694
## Max. :10.037
## PECAM1 PECAM1.1 PECAM1.2 PECAM1.3
## Length:648 Length:648 Length:648 Length:648
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## PECAM1.4 X X.1 X.2 X.3
## Length:648 Mode:logical Mode:logical Mode:logical Mode:logical
## Class :character NA's:648 NA's:648 NA's:648 NA's:648
## Mode :character
##
##
##
## X.4 GSE_Number Sample_Name Histology
## Mode:logical Length:648 Length:648 Length:648
## NA's:648 Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## Sample_type Driver_summary Tumor_purity X12_Index
## Length:648 Length:648 Length:648 Min. :1
## Class :character Class :character Class :character 1st Qu.:1
## Mode :character Mode :character Mode :character Median :1
## Mean :1
## 3rd Qu.:1
## Max. :1
## Unknown_Status BRAF RAS RET_PTC
## Length:648 Length:648 Length:648 Length:648
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Other_Mutation
## Length:648
## Class :character
## Mode :character
##
##
##
Gene expression was then visualized based on histology, as this is the main method of representing how malignant the tumor cells of a sample are.
For the sake of simplicity, histology types can be understood based on how non-cancerous, or “well-differentiated,” the thyroid cells are. The order of cancer aggressiveness, from least to worst, goes as follows: Normal (healthy thyroid), Papillary Thyroid Carcinoma (PTC), Poorly Differentiated Thyroid Carcinoma (PDTC), and Anaplastic thyroid Carcinoma (ATC).
library(ggplot2)
plot_data <- thyroid_master %>%
filter(Histology %in% c("Normal", "ATC", "PDTC", "PTC")) %>%
rename(Gene.Symbol = Gene.Symbol.x) %>%
mutate(Histology = factor(Histology, levels = c("Normal", "PTC", "PDTC", "ATC")))
sting_data <- plot_data %>%
filter(str_detect(Gene.Symbol, "STING|TMEM173"))
sting_plot <- ggplot(sting_data, aes(x = Histology, y = Expression, fill = Histology)) +
geom_boxplot(outlier.shape = NA, alpha = 0.7) +
geom_jitter(width = 0.2, alpha = 0.3, size = 1.5) +
theme_minimal() +
scale_fill_manual(values = c("Normal" = "#4DAF4A", "PTC" = "#377EB8", "PDTC" = "#984EA3", "ATC" = "#E41A1C")) +
labs(title = "STING (TMEM173) Expression", x = NULL, y = "Log2 Intensity") +
theme(plot.title = element_text(hjust = 0.5, face = "bold"), legend.position = "none")
print(sting_plot)
irf7_data <- plot_data %>%
filter(Gene.Symbol == "IRF7")
irf7_plot <- ggplot(irf7_data, aes(x = Histology, y = Expression, fill = Histology)) +
geom_boxplot(outlier.shape = NA, alpha = 0.7) +
geom_jitter(width = 0.2, alpha = 0.3, size = 1.5) +
theme_minimal() +
scale_fill_manual(values = c("Normal" = "#4DAF4A", "PTC" = "#377EB8", "PDTC" = "#984EA3", "ATC" = "#E41A1C")) +
labs(title = "IRF7 Expression", x = NULL, y = "Log2 Intensity") +
theme(plot.title = element_text(hjust = 0.5, face = "bold"), legend.position = "none")
print(irf7_plot)
il18_data <- plot_data %>%
filter(Gene.Symbol == "IL18")
il18_plot <- ggplot(il18_data, aes(x = Histology, y = Expression, fill = Histology)) +
geom_boxplot(outlier.shape = NA, alpha = 0.7) +
geom_jitter(width = 0.2, alpha = 0.3, size = 1.5) +
theme_minimal() +
scale_fill_manual(values = c("Normal" = "#4DAF4A", "PTC" = "#377EB8", "PDTC" = "#984EA3", "ATC" = "#E41A1C")) +
labs(title = "IL18 Expression", x = "Histological Classification", y = "Log2 Intensity") +
theme(plot.title = element_text(hjust = 0.5, face = "bold"), legend.position = "none")
print(il18_plot)
Statistical Analysis of Microarray Data: Welch’s ANOVA test. A statistically appropriate way of measuring statistically significant differences between histological groups via calculation of p values for each gene. This is preferred over the standard t-test due to its ability to account for more than two groups being compared, and Welch’s test specifically is capable of doing calculations across groups of different sample sizes. Any p value output less than 0.05 is considered statistically significant, decreasing the chances that the differences observed are due to random chance and are instead biologically relevant.
irf7_anova <- oneway.test(Expression ~ Histology,
data = filter(plot_data, Gene.Symbol == "IRF7"),
var.equal = FALSE)
sting_anova <- oneway.test(Expression ~ Histology,
data = filter(plot_data, str_detect(Gene.Symbol, "STING|TMEM173")),
var.equal = FALSE)
il18_anova <- oneway.test(Expression ~ Histology,
data = filter(plot_data, Gene.Symbol == "IL18"),
var.equal = FALSE)
cat(
paste("IRF7 p-value: ", irf7_anova$p.value),
paste("STING p-value:", sting_anova$p.value),
paste("IL18 p-value: ", il18_anova$p.value),
sep = "\n"
)
## IRF7 p-value: 4.78435014859038e-06
## STING p-value: 4.29440250760277e-19
## IL18 p-value: 3.93989157719563e-08
Further readjustments had to be made following the ggplot analysis. The data trends observed for STING show a significantly higher level of gene expression for the benign PTC tumors compared to the ATC cancers known for having higher inflammatory signatures. Therefore, upon further analysis of the original microarray data, it was observed that the samples had a heterogeneous distribution of genetic mutations driving each cancer. Over 60% of thyroid cancers are driven specifically by the BRAF V600E mutation, and the predominant literature available on the activity of inflammatory pathways in thyroid cancer are based on it. In order to reflect a more accurate representation of trends in gene expression of BRAF V600E-driven cancers, a subgroup analysis was performed using the “driver summary” metadata, resulting in the visualizations below:
braf_only_data <- plot_data %>%
dplyr::filter(Driver_summary == "BRAF_V600E")
braf_plot <- ggplot(braf_only_data, aes(x = Histology, y = Expression, fill = Histology)) +
geom_boxplot(outlier.shape = NA, alpha = 0.7) +
geom_jitter(width = 0.2, alpha = 0.4, size = 1.2) +
facet_wrap(~Gene.Symbol, scales = "free_y") +
theme_minimal() +
# Custom colors: Blue for PTC, Purple for PDTC, Red for ATC
scale_fill_manual(values = c("PTC" = "#3498db", "PDTC" = "#9b59b6", "ATC" = "#e74c3c")) +
labs(
title = "Expression Analysis: BRAF_V600E Mutants Only",
subtitle = "Standardizing by driver mutation to isolate histological impact",
x = "Histology",
y = "Log2 Expression (Normalized)"
) +
theme(
legend.position = "none",
plot.title = element_text(face = "bold", hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5)
)
print(braf_plot)
Readjusted Welch’s ANOVA test:
irf7_braf_anova <- oneway.test(Expression ~ Histology,
data = filter(braf_only_data, Gene.Symbol == "IRF7"),
var.equal = FALSE)
sting_braf_anova <- oneway.test(Expression ~ Histology,
data = filter(braf_only_data, str_detect(Gene.Symbol, "STING|TMEM173")),
var.equal = FALSE)
il18_braf_anova <- oneway.test(Expression ~ Histology,
data = filter(braf_only_data, Gene.Symbol == "IL18"),
var.equal = FALSE)
cat(
paste("IRF7_BRAF p-value: ", irf7_braf_anova$p.value),
paste("STING_BRAF p-value:", sting_braf_anova$p.value),
paste("IL18_BRAF p-value: ", il18_braf_anova$p.value),
sep = "\n"
)
## IRF7_BRAF p-value: 0.0266210630837767
## STING_BRAF p-value: 0.487752065273772
## IL18_BRAF p-value: 0.108941338525097
Further verification of data based on number of samples missing a driver mutation category:
driver_stats <- plot_data %>%
summarise(
count_missing = sum(Driver_summary == "" | Driver_summary == "ND" | is.na(Driver_summary)),
total_rows = n(),
proportion_missing = count_missing / total_rows,
percentage_missing = proportion_missing * 100
)
print(driver_stats)
## # A tibble: 1 × 4
## count_missing total_rows proportion_missing percentage_missing
## <int> <int> <dbl> <dbl>
## 1 357 648 0.551 55.1
Conclusion/ Future Steps:
When the groups are controlled for the BRAF V600E mutation, the
differences in gene expression for STING and IL18 become statistically
insignificant (p>0.05). However, upon further verification of the
data, it was determined that over 55% of all samples have either blank
or “not determined” mutation specification, as seen above. This suggests
that a significant amount of data has been excluded from the “BRAF
V600E-only” data plot, and therefore is not reliable. As a result, the
project will return its focus to analyzing gene expression patters based
on histology.