Thyroid Cancer Recurrence

Introduction

The thyroid is a small butterfly-shaped gland in the neck that controls energy, metabolism, heartbeat, mood,body temperature and how fast the body works. It takes signals from the brain to release hormones that affect many organs, and while someone can live without a thyroid, they must take hormone replacement for life.

Thyroid Cancer

Thyroid cancer occurs when abnormal cells grow in the thyroid gland. These cells multiply uncontrollably and can form a lump called a tumor, and in some cases, they can spread to other parts of the body (metastasis). Although treatment is usually successful especially when detected early,even at that there is still a chance that the cancer may return.

Most patients receive Radioactive Iodine (RAI) treatment, it is often used after surgery to remove and destroy any leftover cancer cells.

Several factors may increase the risk of developing thyroid cancer, including low or high iodine levels, female hormones, older age, family history, radiation exposure, and long-standing thyroid problems such as goiter or thyroiditis.

Thyroid Cancer Image
Thyroid Cancer Image

Symptoms

  • A painless lump in the front of the neck

  • Swelling in the neck

  • Hoarse voice or voice changes

  • Difficulty swallowing

  • Difficulty breathing

  • Neck or throat pain

  • Persistent cough not caused by a cold

About the Dataset

This dataset contains information about 383 patients who received RAI therapy and tracks whether their cancer recurred. By analyzing this data, we can identify patterns and factors that may influence recurrence.The data is valuable for predicting cancer recurrence, understanding risk factors, and evaluating treatment outcomes.

Column Name Explanantions

  1. Age- How old the patient is, in years.

  2. Gender- Whether the patient is male or female.

  3. Hx Radiotherapy- Whether the patient had any prior radiation treatment before this therapy.

  4. Adenopathy- Whether cancer has spread to nearby lymph nodes. “Yes” means lymph nodes are affected, “No” means they are not.

  5. Pathology- The type of thyroid cancer. Example: micropapillary is a common form of thyroid cancer.

  6. Focality- Whether the tumor is in one spot (Uni-Focal) or multiple spots (Multi-Focal) in the thyroid.

  7. Risk- Overall risk level of cancer based on tumor size, spread, and other factors. Classified as Low, Intermediate, or High.

  8. T (Tumor)- Size and extent of the main tumor. T1 is small, T4 is large or invading nearby tissues.

  9. N (Nodes)- Spread to lymph nodes. N0 means no lymph nodes affected, N1 means some are.

  10. M (Metastasis)- Whether cancer has spread to other parts of the body. M0 means no, M1 means yes.

  11. Stage- Overall cancer stage, combining T, N, and M. Stage I is early, Stage IV is advanced.

  12. Response- How well the patient responded to treatment: Excellent (very good), Indeterminate (uncertain), etc.

  13. Recurred- Whether the cancer came back after treatment. “Yes” means recurrence, “No” means it did not.

AIM

To analyze factors affecting thyroid cancer recurrence after RAI therapy and develop insights that can help predict which patients are at higher risk of recurrence.

QUESTIONS

  • Are thyroid cancer recurrences more common in men or women?

  • How does age affect recurrence risk?

  • Does clinical risk level influence whether thyroid cancer comes back?

  • What is the relationship between treatment response and recurrence?

  • Can we predict recurrence based on tumor staging and treatment response ?

Loading Necessary libraries

library(readr)
library(ggplot2)
library(plotly)
library(dplyr)

library(visdat) #for checking for missing value

library(caret) # For splitting data & evaluation

library(randomForest)#for randon forest

library(pROC)   # ROC & AUC # For Logistic Regression evaluation
library(rpart)
library(rpart.plot)
library(car)

Importing data

Thyriod<-read.csv("~/GLADYS FOLDER .R/Thyroid_Diff.csv",stringsAsFactors = FALSE)
# View(Thyriod)

EXPLORATORY DATA ANALYSIS

#checking the first Six head of the data
head(Thyriod)
##   Age Gender Smoking Hx.Smoking Hx.Radiothreapy Thyroid.Function
## 1  27      F      No         No              No        Euthyroid
## 2  34      F      No        Yes              No        Euthyroid
## 3  30      F      No         No              No        Euthyroid
## 4  62      F      No         No              No        Euthyroid
## 5  62      F      No         No              No        Euthyroid
## 6  52      M     Yes         No              No        Euthyroid
##          Physical.Examination Adenopathy      Pathology    Focality Risk   T  N
## 1  Single nodular goiter-left         No Micropapillary   Uni-Focal  Low T1a N0
## 2         Multinodular goiter         No Micropapillary   Uni-Focal  Low T1a N0
## 3 Single nodular goiter-right         No Micropapillary   Uni-Focal  Low T1a N0
## 4 Single nodular goiter-right         No Micropapillary   Uni-Focal  Low T1a N0
## 5         Multinodular goiter         No Micropapillary Multi-Focal  Low T1a N0
## 6         Multinodular goiter         No Micropapillary Multi-Focal  Low T1a N0
##    M Stage      Response Recurred
## 1 M0     I Indeterminate       No
## 2 M0     I     Excellent       No
## 3 M0     I     Excellent       No
## 4 M0     I     Excellent       No
## 5 M0     I     Excellent       No
## 6 M0     I Indeterminate       No
#Structure of the data set
str(Thyriod)
## 'data.frame':    383 obs. of  17 variables:
##  $ Age                 : int  27 34 30 62 62 52 41 46 51 40 ...
##  $ Gender              : chr  "F" "F" "F" "F" ...
##  $ Smoking             : chr  "No" "No" "No" "No" ...
##  $ Hx.Smoking          : chr  "No" "Yes" "No" "No" ...
##  $ Hx.Radiothreapy     : chr  "No" "No" "No" "No" ...
##  $ Thyroid.Function    : chr  "Euthyroid" "Euthyroid" "Euthyroid" "Euthyroid" ...
##  $ Physical.Examination: chr  "Single nodular goiter-left" "Multinodular goiter" "Single nodular goiter-right" "Single nodular goiter-right" ...
##  $ Adenopathy          : chr  "No" "No" "No" "No" ...
##  $ Pathology           : chr  "Micropapillary" "Micropapillary" "Micropapillary" "Micropapillary" ...
##  $ Focality            : chr  "Uni-Focal" "Uni-Focal" "Uni-Focal" "Uni-Focal" ...
##  $ Risk                : chr  "Low" "Low" "Low" "Low" ...
##  $ T                   : chr  "T1a" "T1a" "T1a" "T1a" ...
##  $ N                   : chr  "N0" "N0" "N0" "N0" ...
##  $ M                   : chr  "M0" "M0" "M0" "M0" ...
##  $ Stage               : chr  "I" "I" "I" "I" ...
##  $ Response            : chr  "Indeterminate" "Excellent" "Excellent" "Excellent" ...
##  $ Recurred            : chr  "No" "No" "No" "No" ...
# checking the summary of the data
summary(Thyriod)
##       Age           Gender            Smoking           Hx.Smoking       
##  Min.   :15.00   Length:383         Length:383         Length:383        
##  1st Qu.:29.00   Class :character   Class :character   Class :character  
##  Median :37.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :40.87                                                           
##  3rd Qu.:51.00                                                           
##  Max.   :82.00                                                           
##  Hx.Radiothreapy    Thyroid.Function   Physical.Examination  Adenopathy       
##  Length:383         Length:383         Length:383           Length:383        
##  Class :character   Class :character   Class :character     Class :character  
##  Mode  :character   Mode  :character   Mode  :character     Mode  :character  
##                                                                               
##                                                                               
##                                                                               
##   Pathology           Focality             Risk                T            
##  Length:383         Length:383         Length:383         Length:383        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##       N                  M                Stage             Response        
##  Length:383         Length:383         Length:383         Length:383        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    Recurred        
##  Length:383        
##  Class :character  
##  Mode  :character  
##                    
##                    
## 
#checking missing values
colSums(is.na(Thyriod))
##                  Age               Gender              Smoking 
##                    0                    0                    0 
##           Hx.Smoking      Hx.Radiothreapy     Thyroid.Function 
##                    0                    0                    0 
## Physical.Examination           Adenopathy            Pathology 
##                    0                    0                    0 
##             Focality                 Risk                    T 
##                    0                    0                    0 
##                    N                    M                Stage 
##                    0                    0                    0 
##             Response             Recurred 
##                    0                    0

There was no missing values all through the columns and almost all the observation is are character expect age.

Checking for the uniqueness of each observation

#To see the levels or categories that exit in th variable
unique(Thyriod$T)
## [1] "T1a" "T1b" "T2"  "T3a" "T3b" "T4a" "T4b"
unique(Thyriod$Focality)
## [1] "Uni-Focal"   "Multi-Focal"
unique(Thyriod$Risk)
## [1] "Low"          "Intermediate" "High"
unique(Thyriod$Response)
## [1] "Indeterminate"          "Excellent"              "Structural Incomplete" 
## [4] "Biochemical Incomplete"
unique(Thyriod$Recurred)
## [1] "No"  "Yes"
unique(Thyriod$Stage)
## [1] "I"   "II"  "IVB" "III" "IVA"
unique(Thyriod$Pathology)
## [1] "Micropapillary" "Papillary"      "Follicular"     "Hurthel cell"
unique(Thyriod$Response)
## [1] "Indeterminate"          "Excellent"              "Structural Incomplete" 
## [4] "Biochemical Incomplete"

The count of each unique observation

#To check how many patient fall into each categories
table(Thyriod$T)
## 
## T1a T1b  T2 T3a T3b T4a T4b 
##  49  43 151  96  16  20   8
table(Thyriod$Focality)
## 
## Multi-Focal   Uni-Focal 
##         136         247
table(Thyriod$Risk)
## 
##         High Intermediate          Low 
##           32          102          249
table(Thyriod$Response)
## 
## Biochemical Incomplete              Excellent          Indeterminate 
##                     23                    208                     61 
##  Structural Incomplete 
##                     91
table(Thyriod$Stage)
## 
##   I  II III IVA IVB 
## 333  32   4   3  11
table(Thyriod$Recurred)
## 
##  No Yes 
## 275 108
table(Thyriod$Response)
## 
## Biochemical Incomplete              Excellent          Indeterminate 
##                     23                    208                     61 
##  Structural Incomplete 
##                     91

Renaming the column names for better understanding

Thyriod<-Thyriod %>%
  rename(
    Tumor= T,
    LymphNodes=N,
    Metastasis=M,
    TreatmentResponse=Response ,
    Recurrence=Recurred
  )
head(Thyriod)
##   Age Gender Smoking Hx.Smoking Hx.Radiothreapy Thyroid.Function
## 1  27      F      No         No              No        Euthyroid
## 2  34      F      No        Yes              No        Euthyroid
## 3  30      F      No         No              No        Euthyroid
## 4  62      F      No         No              No        Euthyroid
## 5  62      F      No         No              No        Euthyroid
## 6  52      M     Yes         No              No        Euthyroid
##          Physical.Examination Adenopathy      Pathology    Focality Risk Tumor
## 1  Single nodular goiter-left         No Micropapillary   Uni-Focal  Low   T1a
## 2         Multinodular goiter         No Micropapillary   Uni-Focal  Low   T1a
## 3 Single nodular goiter-right         No Micropapillary   Uni-Focal  Low   T1a
## 4 Single nodular goiter-right         No Micropapillary   Uni-Focal  Low   T1a
## 5         Multinodular goiter         No Micropapillary Multi-Focal  Low   T1a
## 6         Multinodular goiter         No Micropapillary Multi-Focal  Low   T1a
##   LymphNodes Metastasis Stage TreatmentResponse Recurrence
## 1         N0         M0     I     Indeterminate         No
## 2         N0         M0     I         Excellent         No
## 3         N0         M0     I         Excellent         No
## 4         N0         M0     I         Excellent         No
## 5         N0         M0     I         Excellent         No
## 6         N0         M0     I     Indeterminate         No
#saving my restructured data

write.csv(Thyriod, "Thyroid_cleaned.csv", row.names = FALSE)
Thyriod_cleaned<-read.csv("Thyroid_cleaned.csv")
str(Thyriod_cleaned)
## 'data.frame':    383 obs. of  17 variables:
##  $ Age                 : int  27 34 30 62 62 52 41 46 51 40 ...
##  $ Gender              : chr  "F" "F" "F" "F" ...
##  $ Smoking             : chr  "No" "No" "No" "No" ...
##  $ Hx.Smoking          : chr  "No" "Yes" "No" "No" ...
##  $ Hx.Radiothreapy     : chr  "No" "No" "No" "No" ...
##  $ Thyroid.Function    : chr  "Euthyroid" "Euthyroid" "Euthyroid" "Euthyroid" ...
##  $ Physical.Examination: chr  "Single nodular goiter-left" "Multinodular goiter" "Single nodular goiter-right" "Single nodular goiter-right" ...
##  $ Adenopathy          : chr  "No" "No" "No" "No" ...
##  $ Pathology           : chr  "Micropapillary" "Micropapillary" "Micropapillary" "Micropapillary" ...
##  $ Focality            : chr  "Uni-Focal" "Uni-Focal" "Uni-Focal" "Uni-Focal" ...
##  $ Risk                : chr  "Low" "Low" "Low" "Low" ...
##  $ Tumor               : chr  "T1a" "T1a" "T1a" "T1a" ...
##  $ LymphNodes          : chr  "N0" "N0" "N0" "N0" ...
##  $ Metastasis          : chr  "M0" "M0" "M0" "M0" ...
##  $ Stage               : chr  "I" "I" "I" "I" ...
##  $ TreatmentResponse   : chr  "Indeterminate" "Excellent" "Excellent" "Excellent" ...
##  $ Recurrence          : chr  "No" "No" "No" "No" ...

There were some column names that were abbreviated ,so I renamed them for better understanding using dplyr function and I also saved the restructured data.

Grouping Tumor,Stage and Lymphnodes Variables.

# Group Tumor
Thyriod_cleaned$Tumor_group <- dplyr::case_when(
  Thyriod_cleaned$Tumor %in% c("T1a","T1b") ~ "T1",
  Thyriod_cleaned$Tumor == "T2" ~ "T2",
  Thyriod_cleaned$Tumor %in% c("T3a","T3b") ~ "T3",
  Thyriod_cleaned$Tumor %in% c("T4a","T4b") ~ "T4",
  TRUE ~ NA_character_
)

# Group Stage
Thyriod_cleaned$Stage_group <- dplyr::case_when(
  Thyriod_cleaned$Stage == "I" ~ "I",
  Thyriod_cleaned$Stage == "II" ~ "II",
  Thyriod_cleaned$Stage == "III" ~ "III",
  Thyriod_cleaned$Stage %in% c("IVA","IVB") ~ "IV",
  TRUE ~ NA_character_
)

# Group Lymph Nodes
Thyriod_cleaned$LymphNodes_group <- dplyr::case_when(
  Thyriod_cleaned$LymphNodes == "N0" ~ "N0",
  Thyriod_cleaned$LymphNodes %in% c("N1a","N1b") ~ "N1",
  TRUE ~ NA_character_
)

I grouped tumor size, cancer stage, and lymph node involvement into simpler clinical categories. This reduces complexity, makes the patterns easier to analyze, and helps us compare recurrence rates across meaningful severity levels.

Converting selected variables to factors.

# Convert selected variables to factors
Thyriod_cleaned <- Thyriod_cleaned %>%
  mutate(across(c(Stage, Pathology, LymphNodes, TreatmentResponse), as.factor))

# Recurrence as factor
Thyriod_cleaned$Recurrence_fac <- as.factor(Thyriod_cleaned$Recurrence)

# Recurrence numeric (fixing your typo)
Thyriod_cleaned$Recurrence_num <- ifelse(Thyriod_cleaned$Recurrence == "Yes", 1, 0)

# Set ordered factor levels
Thyriod_cleaned$Stage_group <- factor(Thyriod_cleaned$Stage_group,
                                      levels = c("I","II","III","IV"), ordered = TRUE)

Thyriod_cleaned$Tumor_group <- factor(Thyriod_cleaned$Tumor_group,
                                      levels = c("T1","T2","T3","T4"), ordered = TRUE)

Thyriod_cleaned$LymphNodes_group <- factor(Thyriod_cleaned$LymphNodes_group,
                                           levels = c("N0","N1"), ordered = TRUE)

Thyriod_cleaned$Metastasis <- factor(Thyriod_cleaned$Metastasis,
                                     levels = c("M0","M1"), ordered = TRUE)

# Check structure
str(Thyriod_cleaned)
## 'data.frame':    383 obs. of  22 variables:
##  $ Age                 : int  27 34 30 62 62 52 41 46 51 40 ...
##  $ Gender              : chr  "F" "F" "F" "F" ...
##  $ Smoking             : chr  "No" "No" "No" "No" ...
##  $ Hx.Smoking          : chr  "No" "Yes" "No" "No" ...
##  $ Hx.Radiothreapy     : chr  "No" "No" "No" "No" ...
##  $ Thyroid.Function    : chr  "Euthyroid" "Euthyroid" "Euthyroid" "Euthyroid" ...
##  $ Physical.Examination: chr  "Single nodular goiter-left" "Multinodular goiter" "Single nodular goiter-right" "Single nodular goiter-right" ...
##  $ Adenopathy          : chr  "No" "No" "No" "No" ...
##  $ Pathology           : Factor w/ 4 levels "Follicular","Hurthel cell",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Focality            : chr  "Uni-Focal" "Uni-Focal" "Uni-Focal" "Uni-Focal" ...
##  $ Risk                : chr  "Low" "Low" "Low" "Low" ...
##  $ Tumor               : chr  "T1a" "T1a" "T1a" "T1a" ...
##  $ LymphNodes          : Factor w/ 3 levels "N0","N1a","N1b": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Metastasis          : Ord.factor w/ 2 levels "M0"<"M1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Stage               : Factor w/ 5 levels "I","II","III",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ TreatmentResponse   : Factor w/ 4 levels "Biochemical Incomplete",..: 3 2 2 2 2 3 2 2 2 2 ...
##  $ Recurrence          : chr  "No" "No" "No" "No" ...
##  $ Tumor_group         : Ord.factor w/ 4 levels "T1"<"T2"<"T3"<..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Stage_group         : Ord.factor w/ 4 levels "I"<"II"<"III"<..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LymphNodes_group    : Ord.factor w/ 2 levels "N0"<"N1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Recurrence_fac      : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Recurrence_num      : num  0 0 0 0 0 0 0 0 0 0 ...

I changed some columns into factors so the computer understands they are categories like “Male/Female” or “Stage I/Stage II,” not numbers or random text. This helps when making charts, tables, and comparisons. Then I made a numeric version of the recurrence column (Yes = 1, No = 0) so it can be used in calculations and prediction models, because computers need numbers when doing statistics or machine learning.

=====================================================================================================

Univariate Analysis

====================================================================================================

STAGE

This is the overall stage of cancer, determined using tumor,lymph nodes, and metastasis together.

stage_counts <- Thyriod_cleaned %>%
  count(Stage_group)

stage_counts
##   Stage_group   n
## 1           I 333
## 2          II  32
## 3         III   4
## 4          IV  14
ggplot(stage_counts, aes(x = Stage_group, y = n)) +
  geom_col(fill = "red3") +
  geom_text(aes(label = n), vjust = -0.3) +
  labs(
    title = "Distribution of Stages",
    x = "Thyroid Cancer Stages",
    y = "Count"
  ) +
  theme_minimal()

Observation

Stage I is the highest in the dataset because, in real life, most thyroid cancers are found early. Thyroid cancer grows slowly, many people discover it during routine checkups, survival rates are high, and early detection especially in women is very common.

TUMOR

This refers to the size and extent of the tumor in the thyroid.

tumor_count<-Thyriod_cleaned%>%
  count(Tumor_group)
tumor_count
##   Tumor_group   n
## 1          T1  92
## 2          T2 151
## 3          T3 112
## 4          T4  28
ggplot(tumor_count,aes(x=Tumor_group,y=n))+
  geom_col(fill = "red4")+
  geom_text(aes(label=n),vjust=-0.3)+
  
  labs(title="Distribution Tumor Level",
       x="Tumor Level",
       y="Count")+
  theme_minimal()

Observation

T2 is highest in the dataset because many thyroid cancers are not caught when they are tiny T1 because they also haven’t grown dangerously large. Most people don’t feel symptoms early, so the cancer quietly grows to a moderate size which is exactly what T2 represents.

It’s still early, still treatable, but bigger than the very small T1 tumors. That’s why, even in real hospitals, T2 and T3 are often the most common tumor sizes doctors see.

LYMPHNODES

Lymph nodes are small, bean-shaped glands all over our body that help fight infection and filter harmful substances. They are part of our immune system.

In this dataset, the LymphNodes column tells us whether the thyroid cancer has spread to the nearby lymph nodes. N0 means the lymph nodes are clear and unaffected, while N1 indicates that cancer cells have reached the lymph nodes, which can influence treatment decisions and the risk of recurrence.

lymph<-Thyriod_cleaned%>%
  count(LymphNodes_group)
lymph
##   LymphNodes_group   n
## 1               N0 268
## 2               N1 115
ggplot(lymph,aes(x=LymphNodes_group,y=n))+
  geom_col(fill = "blue3")+
  geom_text(aes(label=n),vjust=-0.3)+
  labs(title="Distribution Tumor Level",
       x="Lymphnodes",
       y="Count")+
  theme_minimal()

Observation

N0 was the highest because, in most thyroid cancer cases, the cancer is detected before it spreads to the lymph nodes. Thyroid cancer often grows slowly, and many patients are diagnosed early through routine checkups or incidental findings. As a result, most patients initially have no lymph node involvement.

METASTASIS

Metastasis means that cancer has spread from its original location to other parts of the body.

Thyroid cancer is related to metastasis, which usually happens when cancer cells travel beyond the thyroid and nearby lymph nodes, often reaching the lungs or bones. Patients with metastasis M1 generally have a higher risk of recurrence.

meta<-Thyriod_cleaned%>%
  count(Metastasis)
meta
##   Metastasis   n
## 1         M0 365
## 2         M1  18
ggplot(meta,aes(x=Metastasis,y=n))+
  geom_col(fill = "green4")+
  geom_text(aes(label=n),vjust=-0.3)+
  labs(title="Distribution Metastasis",
       x="Metastasis Level",
       y="Count")+
  theme_minimal()

Observation

M0 is the highest in the dataset because, in most thyroid cancer cases, the cancer has not yet spread to distant organs. Thyroid cancer usually grows slowly, and many patients are diagnosed early when the disease is still localized to the thyroid and nearby lymph nodes.

=====================================================================================================

Bivariate analysis

=====================================================================================================

  1. Are thyroid cancer recurrences more common in men or women?
ggplot(Thyriod_cleaned, aes(x = Gender, fill = Recurrence)) +
  geom_bar(position = "dodge") +
  geom_text(
    stat = "count",
    aes(label = ..count..),
    position = position_dodge(width = 0.9),  # aligns text with bars
    vjust = -0.5  # puts count just above the bar
  ) +
  labs(
    title = "Recurrence by Gender",
    x = "Gender",
    y = "Count",
    fill = "Recurrence"
  ) +
  theme_minimal()
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Female had the highest recurrence of thyroid cancer because thyroid cancer comes back more in women mainly because female hormones like estrogen make thyroid cells grow faster and react more strongly to changes. Women also get thyroid problems such as goiters and thyroiditis more often, which increases their risk. Pregnancy puts extra stress on the thyroid and can trigger changes in the gland. These biological and hormonal differences make thyroid cancer naturally more common and more likely to recur in females.

  1. How does age affect recurrence risk?
tapply(Thyriod_cleaned$Age, Thyriod_cleaned$Recurrence, mean, na.rm = TRUE)
##       No      Yes 
## 38.41455 47.11111
tapply(Thyriod_cleaned$Age, Thyriod_cleaned$Recurrence, median, na.rm = TRUE)
##   No  Yes 
## 36.0 44.5
box_plot <- plot_ly(
  data = Thyriod_cleaned,
  x=~Recurrence,
  y = ~Age,
  type = "box",
  color = ~Recurrence
)

box_plot <- layout(
  box_plot,
  title = "Age vs Recurrence",
  xaxis = list(title = "Recurrence"),
  yaxis = list(title = "Age")
)

box_plot

The boxplot shows that thyroid cancer recurrence is more common in older patients. Median age for recurrence is 44.5 years, compared to 36 years for non-recurrence. This supports medical evidence that age increases the risk of recurrence, likely due to tumor aggressiveness and weaker immune response in older adults.

  1. What is the relationship between treatment response and recurrence?
# Table of counts
table(Thyriod_cleaned$TreatmentResponse, Thyriod_cleaned$Recurrence)
##                         
##                           No Yes
##   Biochemical Incomplete  12  11
##   Excellent              207   1
##   Indeterminate           54   7
##   Structural Incomplete    2  89
# Proportions
prop.table(table(Thyriod_cleaned$TreatmentResponse, Thyriod_cleaned$Recurrence), 1)
##                         
##                                   No         Yes
##   Biochemical Incomplete 0.521739130 0.478260870
##   Excellent              0.995192308 0.004807692
##   Indeterminate          0.885245902 0.114754098
##   Structural Incomplete  0.021978022 0.978021978
ggplot(Thyriod_cleaned, aes(x=TreatmentResponse, fill=Recurrence)) +
  geom_bar(position="dodge") + # 'fill' shows proportion
 
  labs(y="Proportion", title="Recurrence by Treatment Response") +
  scale_fill_manual(values=c("No"="skyblue", "Yes"="red2")) +
  theme_minimal()

Observation

  1. Excellent: Patient shows no signs of residual disease after treatment—essentially a “complete cure” at this stage.
  1. Biochemical Incomplete: Lab tests (like thyroid hormone or thyroglobulin levels) suggest some disease may still be present, but there may not be visible tumors yet.

Roughly equal numbers of patients had recurrence or not (blue and red bars similar)

  1. Structural Incomplete: Imaging shows visible remaining tumor or metastasis—disease is clearly still present.

Most patients in this category experienced recurrence (red bar much higher than blue).

  1. Indeterminate: Unclear response; lab or imaging results are ambiguous, so we can’t tell if the disease is gone or persists.

Interpretation: Even when response is uncertain, most patients stay recurrence-free, but some may relapse.

4.Does clinical risk level influence whether thyroid cancer comes back?

ggplot(Thyriod_cleaned, aes(x = Risk, fill = Recurrence)) +
  geom_bar(position = "dodge") +
  #coord_flip()+
  labs(y = "Proportion", title = "Recurrence by Risk Category") +
  theme_minimal()

My plot shows how recurrence changes across these risk groups.

Risk levels in thyroid cancer help doctors predict how likely the disease is to return.

  1. Low-risk patients usually have small tumors and no spread, so recurrence is low.

  2. Intermediate-risk patients have some warning signs, so their recurrence is higher.

  3. High-risk patients have aggressive tumors or spread, making recurrence most likely.

Age vs Treatment response

table(Thyriod_cleaned$TreatmentResponse)
## 
## Biochemical Incomplete              Excellent          Indeterminate 
##                     23                    208                     61 
##  Structural Incomplete 
##                     91
library(plotly)

plot_ly(
  data = Thyriod_cleaned,
  x = ~TreatmentResponse,
  y = ~Age,
  type = "box",
  color = ~TreatmentResponse
) %>%
  layout(
    title = "Age vs Treatment Response",
    xaxis = list(title = "Treatment Response"),
    yaxis = list(title = "Age")
  )

QUESTION 5: Can we predict recurrence based on tumor staging and treatment response ?

Random Forest

#splitting into training and test set
set.seed(123)
trainIndex <- createDataPartition(Thyriod_cleaned$Recurrence, p = 0.7, list = FALSE)

train_data <- Thyriod_cleaned[trainIndex, ]
test_data <- Thyriod_cleaned[-trainIndex, ]
train_data$Recurrence<-as.factor(train_data$Recurrence)
test_data$Recurrence<-as.factor(test_data$Recurrence)

#base predictors
predictors<-c("Stage_group","Pathology","Metastasis","TreatmentResponse","Risk","Physical.Examination","Age","Focality","LymphNodes_group","Tumor_group")

formula<-reformulate(predictors,response="Recurrence")

#train the random forest model
rf_model<-randomForest(
  formula,
  data=train_data,
  ntree=600,
  mtry = 3,
  importance = TRUE,
  seed=123
)
print(rf_model)
## 
## Call:
##  randomForest(formula = formula, data = train_data, ntree = 600,      mtry = 3, importance = TRUE, seed = 123) 
##                Type of random forest: classification
##                      Number of trees: 600
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 3.72%
## Confusion matrix:
##      No Yes class.error
## No  189   4  0.02072539
## Yes   6  70  0.07894737
#Predict on test set
rf_pred<-predict(rf_model,newdata=test_data)

#confusion matrix
con_matrix_rf<-confusionMatrix(rf_pred,test_data$Recurrence)
print(con_matrix_rf)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No  82   3
##        Yes  0  29
##                                          
##                Accuracy : 0.9737         
##                  95% CI : (0.925, 0.9945)
##     No Information Rate : 0.7193         
##     P-Value [Acc > NIR] : 7.461e-13      
##                                          
##                   Kappa : 0.9329         
##                                          
##  Mcnemar's Test P-Value : 0.2482         
##                                          
##             Sensitivity : 1.0000         
##             Specificity : 0.9062         
##          Pos Pred Value : 0.9647         
##          Neg Pred Value : 1.0000         
##              Prevalence : 0.7193         
##          Detection Rate : 0.7193         
##    Detection Prevalence : 0.7456         
##       Balanced Accuracy : 0.9531         
##                                          
##        'Positive' Class : No             
## 
con<-as.data.frame(con_matrix_rf$table)
#rename columns
colnames(con)<-c("Predicted","Actual","Freq")

#plot confusion matrix
ggplot(con,aes(x=Actual,y=Predicted,fill=Freq))+
  geom_tile(color="black",linewidth=1.2)+
  geom_text(aes(label=Freq),color="white",size=6,fontface="bold")+
  scale_fill_gradient(low="#FF9973",high ="3366CC")+
  labs(
    title="Confusion Matrix-Random Forest Model",
    x="Actual Class",
    y="Predicted Class",
    fill="Count"
  )+
  theme_minimal(base_size = 14)+
  theme(
    plot.title = element_text(face="bold",hjust = 0.5),
    axis.text = element_text(color = "black",face = "bold"),
    panel.grid = element_blank(),
    legend.position = "right"
  )

Plot feature importance

# Plot Feature Importance
varImpPlot(rf_model,main = "Feature Importance-Random Forest",
           pch=19,
           col="blue",
           cex=0.9)

The model showed that Treatment Response is the strongest predictor of recurrence. This is medically accurate because cancer is more likely to return when treatment does not fully work.

The second strongest predictor is Risk level, which combines tumor size, lymph node spread, metastasis, and pathology. In simple terms, Risk tells us how aggressive the cancer is, so it naturally relates to recurrence.

Tumor stage and lymph node involvement also contributed moderately, which makes sense because larger tumors and spread to lymph nodes increase recurrence chances.

Pathology ranked lower only because most patients in the dataset had similar cancer types (mainly papillary), so it did not create much difference between patients.

Overall, the feature importance from the model aligns with real medical knowledge: poor treatment response and high-risk tumors drive recurrence the most.

Decision Tree

predictors<-c("Stage_group","Pathology","Metastasis","TreatmentResponse","Risk","Physical.Examination","Age","Focality","LymphNodes_group","Tumor_group")

train_data$Risk <- factor(train_data$Risk, levels = c("Low", "Intermediate", "High"))


formula<-reformulate(predictors,response = "Recurrence")

## build decision tree
tree_model<-rpart(formula,
                  data = train_data,
                  method = "class"
                  )
# 1. Train the Decision Tree (This part is already good)
tree_model<-rpart(
  formula,
  data = train_data,
  method = "class",
  control=rpart.control(
    maxdepth = 5,    # Allows for a deeper, more detailed tree
    minsplit = 10,
    cp=0.001         # Allows for smaller splits
  )
)
# ----------------------------------------------------

# 2. Plot the Tree (The required edits are here)

# Start a high-resolution PNG file to ensure the tree has enough space
# Width and Height are set large (e.g., 1200x700 pixels)


# Plot the tree
# Plot the tree directly to the RStudio Plots pane
#rpart.plot(
  #tree_model,
  #type = 2,
  #extra = 104,
  #fallen.leaves = TRUE,
  #cex = 0.6,         # Font size set for readability
  #box.palette = "GnBu",
  #shadow.col = "gray",
  #nn = TRUE,
  #main = "DECISION TREE"
#)
# make predictions on traning and test data
train_pred_tree<-predict(tree_model,train_data,type="class")
test_pred_tree<-predict(tree_model,test_data,type="class")

conf_matrix_train_tree<-confusionMatrix(
  factor(train_pred_tree),
  factor(train_data$Recurrence)
)


conf_matrix_test_tree<-confusionMatrix(
  factor(test_pred_tree),
  factor(test_data$Recurrence)
)
#print result
cat("Confusion matrix-TrainData/n")
## Confusion matrix-TrainData/n
print(conf_matrix_train_tree)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  189   5
##        Yes   4  71
##                                           
##                Accuracy : 0.9665          
##                  95% CI : (0.9374, 0.9846)
##     No Information Rate : 0.7175          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9171          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9793          
##             Specificity : 0.9342          
##          Pos Pred Value : 0.9742          
##          Neg Pred Value : 0.9467          
##              Prevalence : 0.7175          
##          Detection Rate : 0.7026          
##    Detection Prevalence : 0.7212          
##       Balanced Accuracy : 0.9567          
##                                           
##        'Positive' Class : No              
## 
#print result
cat("Confusion matrix-Test Data/n")
## Confusion matrix-Test Data/n
print(conf_matrix_test_tree)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No  82   2
##        Yes  0  30
##                                           
##                Accuracy : 0.9825          
##                  95% CI : (0.9381, 0.9979)
##     No Information Rate : 0.7193          
##     P-Value [Acc > NIR] : 5e-14           
##                                           
##                   Kappa : 0.9557          
##                                           
##  Mcnemar's Test P-Value : 0.4795          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.9375          
##          Pos Pred Value : 0.9762          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.7193          
##          Detection Rate : 0.7193          
##    Detection Prevalence : 0.7368          
##       Balanced Accuracy : 0.9688          
##                                           
##        'Positive' Class : No              
## 
#Check feature importance
print(tree_model$variable.importance)
##    TreatmentResponse                 Risk     LymphNodes_group 
##           83.8708396           42.4768009           26.0088754 
##          Stage_group          Tumor_group           Metastasis 
##           22.5984001           22.2692233           14.5234065 
##                  Age Physical.Examination             Focality 
##            2.7746914            1.8148148            0.6049383 
##            Pathology 
##            0.6049383
# Extract importance values
importance_values <- tree_model$variable.importance

# Create barplot with extended y-limit
bp <- barplot(
  importance_values,
  main = "Feature Importance (Decision Tree)",
  xlab = "Features",
  ylab = "Importance Score",
  ylim = c(0, max(importance_values) + 5),   # Increase y-axis limit
  col = "purple3",
  names.arg = FALSE ,                         # Remove labels first

)

# Add slanted x-axis labels
text(
  x = bp,
  y = par("usr")[3] - 0.5,                    # Position below x-axis
  labels = names(importance_values),
  srt = 45,                                   # Slant labels 45 degrees
  adj = 1,
  xpd = TRUE,                                 # Allow drawing outside plot area
  cex=0.8
)

The decision tree shows that Treatment Response is the strongest predictor of thyroid cancer recurrence, patients with an “Excellent/Indeterminate” response had very low recurrence, while those with incomplete response showed much higher recurrence risk. The next important factor is Risk level, which reflects tumor size, lymph-node spread, metastasis, and pathology; higher risk patients had greater recurrence. Other variables like tumor stage, pathology type, and metastasis contributed little beyond those two. Both the decision tree and random forest confirm that how well a patient responds to treatment and their initial clinical risk are the main drivers of recurrence.

Best model plot

# Create a data frame for plotting
model_accuracy <- data.frame(
  Model = c("Decision Tree", "Random Forest"),
  Accuracy = c(0.5882, 0.951)
)

# Base R barplot
barplot(
  model_accuracy$Accuracy,
  names.arg = model_accuracy$Model,
  col = c("tomato", "steelblue"),
  ylim = c(0, 1),
  main = "Comparison of Model Accuracies",
  ylab = "Accuracy",
  cex.names = 2
)

# Optional: Add text labels on top of bars
text(
  x = c(0.7, 1.9),
  y = model_accuracy$Accuracy + 0.03,
  labels = round(model_accuracy$Accuracy,2)
)

Random forest was the best model

Conclusion

From this study:

  1. Thyroid cancer recurrence was more common in women, which agrees with medical evidence showing women develop thyroid disorders and hormonal imbalances (estrogen effects) that controls female growth and development .

  2. Older patients had a higher recurrence risk, shown in your age boxplot, matching clinical findings that aging reduces immune response and increases tumor aggressiveness.

  3. Clinical risk level strongly influenced recurrence,intermediate and high-risk patients had significantly more recurrence than low-risk cases.

  4. Treatment response had the strongest relationship with recurrence. Patients with incomplete or biochemical structure were the most likely to relapse, which was confirmed by both the decision tree and random forest.

  5. Prediction based on stage and pathology was possible, but weak, because treatment response and risk carried more clinical power.

OVERALL:

The strongest predictors of recurrence in this dataset were treatment response and clinical risk, followed by age. Women and older adults showed the highest recurrence burden. Tumor stage and pathology were helpful but less influential.

Recommendation

  1. Eating foods rich in iodine: Iodine helps the thyroid work properly and may reduce future problems. Examples of iodine-rich foods: iodized salt, fish (like tuna, sardines), milk, eggs, yogurt, and seaweed.

  2. Women should attend regular checkups: Because thyroid cancer is more common in women and they showed higher recurrence in this study, it is important for women to do routine follow-ups, especially if they notice any neck changes or symptoms.

  3. Patients who have had thyroid surgery should monitor regularly: Anyone who has undergone thyroid surgery should have more frequent hospital visits, because recurrence can happen even after treatment, and early detection makes management easier.