Emily_Coty_Final

FINAL PROJECT PROGRESS REPORT: EVALUATING EXPRESSION PROFILES OF KEY INFLAMMATORY GENES IN THYROID CANCER

For this project, two key sources of data were extracted from publicly available research literature on thyroid cancer microarray assays. Microarrays provide an opportunity for scientists to scan a list of pre-selected target genes from an organism’s genome in order to generate a panel of relative gene expressions (in arbitrary units), doing so by using an instrument capable of detecting relative intensity levels of fluorescent signals emitted by specific molecules. Unlike similar assays such as real time quantitative Polymerase Chain Reactions (RT-qPCR), microarrays allow for the visualization of thousands of DNA and mRNA molecules at once, providing a “big-picture” data set that is essential for analyzing general trends in gene expression.

In the context of thyroid cancers, data was collected on multiple samples (tumors surgically obtained from human subjects). However, given the extensive data collected for thousands of genes across hundreds of samples, the actual biological features of the tumors, such as histological classification, were saved on a separate file labelled as “public annotation.” The files were too big to be exported as excel documents from the original laboratory drive and onto my computer, so they were instead transferred into rds formats. Note that since the clinical data obtained was public and did not contain any identifying information (tumor samples have unique ID codes), permission was granted by laboratory staff to save the files onto my personal computer for the project.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Human_thyroid_arrays_public_trans_9_27_17 <- readRDS("R_project_IRF7_IL18_TMEM173_04282026.rds")

public_annotation <- readRDS("public_annotation.rds")

The first goal of the project was to extract the data from these two files, isolate the observations pertaining only to the three genes of interest (IRF7, IL18, and STING), cleaning up the data, then merging them into a single data frame using tidyverse. Note that in this data set, STING is being referred to as its alternate name, “TMEM173”.

Data Cleaning and Joining:

thyroid_df_filtered <- Human_thyroid_arrays_public_trans_9_27_17 %>%
      filter(Gene.Symbol %in% c("IRF7", "IL18", "TMEM173"))

thyroid_long <- thyroid_df_filtered %>% 
      pivot_longer(
        cols = starts_with("GSM"),
        names_to = "Sample_ID",
        values_to = "Expression"
      )

thyroid_long_clean <- thyroid_long %>% 
  mutate(Sample_ID = str_extract(Sample_ID, "GSM[0-9]+"))


thyroid_long_unique <- thyroid_long_clean %>%
  group_by(Sample_ID, Gene.Symbol) %>%
  summarize(Expression = mean(Expression, na.rm = TRUE), .groups = 'drop')


colnames(public_annotation) <- make.names(colnames(public_annotation), 
                                          unique = TRUE)

colnames(public_annotation)[12:24] <- c("GSE_Number", "Sample_ID", "Sample_Name", 
                                        "Histology", "Sample_type", "Driver_summary", 
                                        "Tumor_purity", "X12_Index", "Unknown_Status", 
                                        "BRAF", "RAS", "RET_PTC", "Other_Mutation")

public_clean <- public_annotation %>% 
  filter(if_any(everything(), ~ str_detect(., "GSM")))


thyroid_master <- thyroid_long_unique %>% 
  left_join(public_clean, by = "Sample_ID")

summary(thyroid_master)

##   Sample_ID         Gene.Symbol.x        Expression     Gene.Symbol.y     
##  Length:648         Length:648         Min.   : 3.391   Length:648        
##  Class :character   Class :character   1st Qu.: 6.077   Class :character  
##  Mode  :character   Mode  :character   Median : 6.893   Mode  :character  
##                                        Mean   : 6.794                     
##                                        3rd Qu.: 7.694                     
##                                        Max.   :10.037                     
##     PECAM1            PECAM1.1           PECAM1.2           PECAM1.3        
##  Length:648         Length:648         Length:648         Length:648        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    PECAM1.4            X             X.1            X.2            X.3         
##  Length:648         Mode:logical   Mode:logical   Mode:logical   Mode:logical  
##  Class :character   NA's:648       NA's:648       NA's:648       NA's:648      
##  Mode  :character                                                              
##                                                                                
##                                                                                
##                                                                                
##    X.4           GSE_Number        Sample_Name         Histology        
##  Mode:logical   Length:648         Length:648         Length:648        
##  NA's:648       Class :character   Class :character   Class :character  
##                 Mode  :character   Mode  :character   Mode  :character  
##                                                                         
##                                                                         
##                                                                         
##  Sample_type        Driver_summary     Tumor_purity         X12_Index
##  Length:648         Length:648         Length:648         Min.   :1  
##  Class :character   Class :character   Class :character   1st Qu.:1  
##  Mode  :character   Mode  :character   Mode  :character   Median :1  
##                                                           Mean   :1  
##                                                           3rd Qu.:1  
##                                                           Max.   :1  
##  Unknown_Status         BRAF               RAS              RET_PTC         
##  Length:648         Length:648         Length:648         Length:648        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  Other_Mutation    
##  Length:648        
##  Class :character  
##  Mode  :character  
##                    
##                    
##

Gene expression was then visualized based on histology, as this is the main method of representing how malignant the tumor cells of a sample are.

For the sake of simplicity, histology types can be understood based on how non-cancerous, or “well-differentiated,” the thyroid cells are. The order of cancer aggressiveness, from least to worst, goes as follows: Normal (healthy thyroid), Papillary Thyroid Carcinoma (PTC), Poorly Differentiated Thyroid Carcinoma (PDTC), and Anaplastic thyroid Carcinoma (ATC).

library(ggplot2)

plot_data <- thyroid_master %>%
  filter(Histology %in% c("Normal", "ATC", "PDTC", "PTC")) %>%
  rename(Gene.Symbol = Gene.Symbol.x) %>% 
  mutate(Histology = factor(Histology, levels = c("Normal", "PTC", "PDTC", "ATC")))


sting_data <- plot_data %>% 
  filter(str_detect(Gene.Symbol, "STING|TMEM173"))

sting_plot <- ggplot(sting_data, aes(x = Histology, y = Expression, fill = Histology)) +
  geom_boxplot(outlier.shape = NA, alpha = 0.7) +
  geom_jitter(width = 0.2, alpha = 0.3, size = 1.5) +
  theme_minimal() +
  scale_fill_manual(values = c("Normal" = "#4DAF4A", "PTC" = "#377EB8", "PDTC" = "#984EA3", "ATC" = "#E41A1C")) +
  labs(title = "STING (TMEM173) Expression", x = NULL, y = "Log2 Intensity") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"), legend.position = "none")

print(sting_plot)

irf7_data <- plot_data %>% 
  filter(Gene.Symbol == "IRF7")

irf7_plot <- ggplot(irf7_data, aes(x = Histology, y = Expression, fill = Histology)) +
  geom_boxplot(outlier.shape = NA, alpha = 0.7) +
  geom_jitter(width = 0.2, alpha = 0.3, size = 1.5) +
  theme_minimal() +
  scale_fill_manual(values = c("Normal" = "#4DAF4A", "PTC" = "#377EB8", "PDTC" = "#984EA3", "ATC" = "#E41A1C")) +
  labs(title = "IRF7 Expression", x = NULL, y = "Log2 Intensity") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"), legend.position = "none")

print(irf7_plot)

il18_data <- plot_data %>% 
  filter(Gene.Symbol == "IL18")

il18_plot <- ggplot(il18_data, aes(x = Histology, y = Expression, fill = Histology)) +
  geom_boxplot(outlier.shape = NA, alpha = 0.7) +
  geom_jitter(width = 0.2, alpha = 0.3, size = 1.5) +
  theme_minimal() +
  scale_fill_manual(values = c("Normal" = "#4DAF4A", "PTC" = "#377EB8", "PDTC" = "#984EA3", "ATC" = "#E41A1C")) +
  labs(title = "IL18 Expression", x = "Histological Classification", y = "Log2 Intensity") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"), legend.position = "none")

print(il18_plot)

Statistical Analysis of Microarray Data: Welch’s ANOVA test. A statistically appropriate way of measuring statistically significant differences between histological groups via calculation of p values for each gene. This is preferred over the standard t-test due to its ability to account for more than two groups being compared, and Welch’s test specifically is capable of doing calculations across groups of different sample sizes. Any p value output less than 0.05 is considered statistically significant, decreasing the chances that the differences observed are due to random chance and are instead biologically relevant.

irf7_anova <- oneway.test(Expression ~ Histology, 
                          data = filter(plot_data, Gene.Symbol == "IRF7"), 
                          var.equal = FALSE)


sting_anova <- oneway.test(Expression ~ Histology, 
                           data = filter(plot_data, str_detect(Gene.Symbol, "STING|TMEM173")), 
                           var.equal = FALSE)


il18_anova <- oneway.test(Expression ~ Histology, 
                          data = filter(plot_data, Gene.Symbol == "IL18"), 
                          var.equal = FALSE)

cat(
  paste("IRF7 p-value: ", irf7_anova$p.value),
  paste("STING p-value:", sting_anova$p.value),
  paste("IL18 p-value: ", il18_anova$p.value),
  sep = "\n"
)

## IRF7 p-value:  4.78435014859038e-06
## STING p-value: 4.29440250760277e-19
## IL18 p-value:  3.93989157719563e-08

Further readjustments had to be made following the ggplot analysis. The data trends observed for STING show a significantly higher level of gene expression for the benign PTC tumors compared to the ATC cancers known for having higher inflammatory signatures. Therefore, upon further analysis of the original microarray data, it was observed that the samples had a heterogeneous distribution of genetic mutations driving each cancer. Over 60% of thyroid cancers are driven specifically by the BRAF V600E mutation, and the predominant literature available on the activity of inflammatory pathways in thyroid cancer are based on it. In order to reflect a more accurate representation of trends in gene expression of BRAF V600E-driven cancers, a subgroup analysis was performed using the “driver summary” metadata, resulting in the visualizations below:

braf_only_data <- plot_data %>%
  dplyr::filter(Driver_summary == "BRAF_V600E")

braf_plot <- ggplot(braf_only_data, aes(x = Histology, y = Expression, fill = Histology)) +
  geom_boxplot(outlier.shape = NA, alpha = 0.7) +
  geom_jitter(width = 0.2, alpha = 0.4, size = 1.2) +
  facet_wrap(~Gene.Symbol, scales = "free_y") +
  theme_minimal() +
  # Custom colors: Blue for PTC, Purple for PDTC, Red for ATC
  scale_fill_manual(values = c("PTC" = "#3498db", "PDTC" = "#9b59b6", "ATC" = "#e74c3c")) +
  labs(
    title = "Expression Analysis: BRAF_V600E Mutants Only",
    subtitle = "Standardizing by driver mutation to isolate histological impact",
    x = "Histology",
    y = "Log2 Expression (Normalized)"
  ) +
  theme(
    legend.position = "none",
    plot.title = element_text(face = "bold", hjust = 0.5),
    plot.subtitle = element_text(hjust = 0.5)
  )

print(braf_plot)

Readjusted Welch’s ANOVA test:

irf7_braf_anova <- oneway.test(Expression ~ Histology, 
                          data = filter(braf_only_data, Gene.Symbol == "IRF7"), 
                          var.equal = FALSE)


sting_braf_anova <- oneway.test(Expression ~ Histology, 
                           data = filter(braf_only_data, str_detect(Gene.Symbol, "STING|TMEM173")), 
                           var.equal = FALSE)


il18_braf_anova <- oneway.test(Expression ~ Histology, 
                          data = filter(braf_only_data, Gene.Symbol == "IL18"), 
                          var.equal = FALSE)

cat(
  paste("IRF7_BRAF p-value: ", irf7_braf_anova$p.value),
  paste("STING_BRAF p-value:", sting_braf_anova$p.value),
  paste("IL18_BRAF p-value: ", il18_braf_anova$p.value),
  sep = "\n"
)

## IRF7_BRAF p-value:  0.0266210630837767
## STING_BRAF p-value: 0.487752065273772
## IL18_BRAF p-value:  0.108941338525097

Further verification of data based on number of samples missing a driver mutation category:

driver_stats <- plot_data %>%
  summarise(
    count_missing = sum(Driver_summary == "" | Driver_summary == "ND" | is.na(Driver_summary)),
    total_rows = n(),
    proportion_missing = count_missing / total_rows,
    percentage_missing = proportion_missing * 100
  )

print(driver_stats)

## # A tibble: 1 × 4
##   count_missing total_rows proportion_missing percentage_missing
##           <int>      <int>              <dbl>              <dbl>
## 1           357        648              0.551               55.1

Conclusion/ Future Steps:
When the groups are controlled for the BRAF V600E mutation, the differences in gene expression for STING and IL18 become statistically insignificant (p>0.05). However, upon further verification of the data, it was determined that over 55% of all samples have either blank or “not determined” mutation specification, as seen above. This suggests that a significant amount of data has been excluded from the “BRAF V600E-only” data plot, and therefore is not reliable. As a result, the project will return its focus to analyzing gene expression patters based on histology.

Emily_Coty_Final_project

Emily Coty

2026-05-02