The thyroid is a small butterfly-shaped gland in the neck that controls energy, metabolism, heartbeat, mood,body temperature and how fast the body works. It takes signals from the brain to release hormones that affect many organs, and while someone can live without a thyroid, they must take hormone replacement for life.
Thyroid Cancer
Thyroid cancer occurs when abnormal cells grow in the thyroid gland. These cells multiply uncontrollably and can form a lump called a tumor, and in some cases, they can spread to other parts of the body (metastasis). Although treatment is usually successful especially when detected early,even at that there is still a chance that the cancer may return.
Most patients receive Radioactive Iodine (RAI) treatment, it is often used after surgery to remove and destroy any leftover cancer cells.
Several factors may increase the risk of developing thyroid cancer, including low or high iodine levels, female hormones, older age, family history, radiation exposure, and long-standing thyroid problems such as goiter or thyroiditis.
Symptoms
A painless lump in the front of the neck
Swelling in the neck
Hoarse voice or voice changes
Difficulty swallowing
Difficulty breathing
Neck or throat pain
Persistent cough not caused by a cold
This dataset contains information about 383 patients who received RAI therapy and tracks whether their cancer recurred. By analyzing this data, we can identify patterns and factors that may influence recurrence.The data is valuable for predicting cancer recurrence, understanding risk factors, and evaluating treatment outcomes.
Age- How old the patient is, in years.
Gender- Whether the patient is male or female.
Hx Radiotherapy- Whether the patient had any prior radiation treatment before this therapy.
Adenopathy- Whether cancer has spread to nearby lymph nodes. “Yes” means lymph nodes are affected, “No” means they are not.
Pathology- The type of thyroid cancer. Example: micropapillary is a common form of thyroid cancer.
Focality- Whether the tumor is in one spot (Uni-Focal) or multiple spots (Multi-Focal) in the thyroid.
Risk- Overall risk level of cancer based on tumor size, spread, and other factors. Classified as Low, Intermediate, or High.
T (Tumor)- Size and extent of the main tumor. T1 is small, T4 is large or invading nearby tissues.
N (Nodes)- Spread to lymph nodes. N0 means no lymph nodes affected, N1 means some are.
M (Metastasis)- Whether cancer has spread to other parts of the body. M0 means no, M1 means yes.
Stage- Overall cancer stage, combining T, N, and M. Stage I is early, Stage IV is advanced.
Response- How well the patient responded to treatment: Excellent (very good), Indeterminate (uncertain), etc.
Recurred- Whether the cancer came back after treatment. “Yes” means recurrence, “No” means it did not.
To analyze factors affecting thyroid cancer recurrence after RAI therapy and develop insights that can help predict which patients are at higher risk of recurrence.
Are thyroid cancer recurrences more common in men or women?
How does age affect recurrence risk?
Does clinical risk level influence whether thyroid cancer comes back?
What is the relationship between treatment response and recurrence?
Can we predict recurrence based on tumor staging and treatment response ?
library(readr)
library(ggplot2)
library(plotly)
library(dplyr)
library(visdat) #for checking for missing value
library(caret) # For splitting data & evaluation
library(randomForest)#for randon forest
library(pROC) # ROC & AUC # For Logistic Regression evaluation
library(rpart)
library(rpart.plot)
library(car)
Thyriod<-read.csv("~/GLADYS FOLDER .R/Thyroid_Diff.csv",stringsAsFactors = FALSE)
# View(Thyriod)
#checking the first Six head of the data
head(Thyriod)
## Age Gender Smoking Hx.Smoking Hx.Radiothreapy Thyroid.Function
## 1 27 F No No No Euthyroid
## 2 34 F No Yes No Euthyroid
## 3 30 F No No No Euthyroid
## 4 62 F No No No Euthyroid
## 5 62 F No No No Euthyroid
## 6 52 M Yes No No Euthyroid
## Physical.Examination Adenopathy Pathology Focality Risk T N
## 1 Single nodular goiter-left No Micropapillary Uni-Focal Low T1a N0
## 2 Multinodular goiter No Micropapillary Uni-Focal Low T1a N0
## 3 Single nodular goiter-right No Micropapillary Uni-Focal Low T1a N0
## 4 Single nodular goiter-right No Micropapillary Uni-Focal Low T1a N0
## 5 Multinodular goiter No Micropapillary Multi-Focal Low T1a N0
## 6 Multinodular goiter No Micropapillary Multi-Focal Low T1a N0
## M Stage Response Recurred
## 1 M0 I Indeterminate No
## 2 M0 I Excellent No
## 3 M0 I Excellent No
## 4 M0 I Excellent No
## 5 M0 I Excellent No
## 6 M0 I Indeterminate No
#Structure of the data set
str(Thyriod)
## 'data.frame': 383 obs. of 17 variables:
## $ Age : int 27 34 30 62 62 52 41 46 51 40 ...
## $ Gender : chr "F" "F" "F" "F" ...
## $ Smoking : chr "No" "No" "No" "No" ...
## $ Hx.Smoking : chr "No" "Yes" "No" "No" ...
## $ Hx.Radiothreapy : chr "No" "No" "No" "No" ...
## $ Thyroid.Function : chr "Euthyroid" "Euthyroid" "Euthyroid" "Euthyroid" ...
## $ Physical.Examination: chr "Single nodular goiter-left" "Multinodular goiter" "Single nodular goiter-right" "Single nodular goiter-right" ...
## $ Adenopathy : chr "No" "No" "No" "No" ...
## $ Pathology : chr "Micropapillary" "Micropapillary" "Micropapillary" "Micropapillary" ...
## $ Focality : chr "Uni-Focal" "Uni-Focal" "Uni-Focal" "Uni-Focal" ...
## $ Risk : chr "Low" "Low" "Low" "Low" ...
## $ T : chr "T1a" "T1a" "T1a" "T1a" ...
## $ N : chr "N0" "N0" "N0" "N0" ...
## $ M : chr "M0" "M0" "M0" "M0" ...
## $ Stage : chr "I" "I" "I" "I" ...
## $ Response : chr "Indeterminate" "Excellent" "Excellent" "Excellent" ...
## $ Recurred : chr "No" "No" "No" "No" ...
# checking the summary of the data
summary(Thyriod)
## Age Gender Smoking Hx.Smoking
## Min. :15.00 Length:383 Length:383 Length:383
## 1st Qu.:29.00 Class :character Class :character Class :character
## Median :37.00 Mode :character Mode :character Mode :character
## Mean :40.87
## 3rd Qu.:51.00
## Max. :82.00
## Hx.Radiothreapy Thyroid.Function Physical.Examination Adenopathy
## Length:383 Length:383 Length:383 Length:383
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Pathology Focality Risk T
## Length:383 Length:383 Length:383 Length:383
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## N M Stage Response
## Length:383 Length:383 Length:383 Length:383
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Recurred
## Length:383
## Class :character
## Mode :character
##
##
##
#checking missing values
colSums(is.na(Thyriod))
## Age Gender Smoking
## 0 0 0
## Hx.Smoking Hx.Radiothreapy Thyroid.Function
## 0 0 0
## Physical.Examination Adenopathy Pathology
## 0 0 0
## Focality Risk T
## 0 0 0
## N M Stage
## 0 0 0
## Response Recurred
## 0 0
There was no missing values all through the columns and almost all the observation is are character expect age.
Checking for the uniqueness of each observation
#To see the levels or categories that exit in th variable
unique(Thyriod$T)
## [1] "T1a" "T1b" "T2" "T3a" "T3b" "T4a" "T4b"
unique(Thyriod$Focality)
## [1] "Uni-Focal" "Multi-Focal"
unique(Thyriod$Risk)
## [1] "Low" "Intermediate" "High"
unique(Thyriod$Response)
## [1] "Indeterminate" "Excellent" "Structural Incomplete"
## [4] "Biochemical Incomplete"
unique(Thyriod$Recurred)
## [1] "No" "Yes"
unique(Thyriod$Stage)
## [1] "I" "II" "IVB" "III" "IVA"
unique(Thyriod$Pathology)
## [1] "Micropapillary" "Papillary" "Follicular" "Hurthel cell"
unique(Thyriod$Response)
## [1] "Indeterminate" "Excellent" "Structural Incomplete"
## [4] "Biochemical Incomplete"
The count of each unique observation
#To check how many patient fall into each categories
table(Thyriod$T)
##
## T1a T1b T2 T3a T3b T4a T4b
## 49 43 151 96 16 20 8
table(Thyriod$Focality)
##
## Multi-Focal Uni-Focal
## 136 247
table(Thyriod$Risk)
##
## High Intermediate Low
## 32 102 249
table(Thyriod$Response)
##
## Biochemical Incomplete Excellent Indeterminate
## 23 208 61
## Structural Incomplete
## 91
table(Thyriod$Stage)
##
## I II III IVA IVB
## 333 32 4 3 11
table(Thyriod$Recurred)
##
## No Yes
## 275 108
table(Thyriod$Response)
##
## Biochemical Incomplete Excellent Indeterminate
## 23 208 61
## Structural Incomplete
## 91
Thyriod<-Thyriod %>%
rename(
Tumor= T,
LymphNodes=N,
Metastasis=M,
TreatmentResponse=Response ,
Recurrence=Recurred
)
head(Thyriod)
## Age Gender Smoking Hx.Smoking Hx.Radiothreapy Thyroid.Function
## 1 27 F No No No Euthyroid
## 2 34 F No Yes No Euthyroid
## 3 30 F No No No Euthyroid
## 4 62 F No No No Euthyroid
## 5 62 F No No No Euthyroid
## 6 52 M Yes No No Euthyroid
## Physical.Examination Adenopathy Pathology Focality Risk Tumor
## 1 Single nodular goiter-left No Micropapillary Uni-Focal Low T1a
## 2 Multinodular goiter No Micropapillary Uni-Focal Low T1a
## 3 Single nodular goiter-right No Micropapillary Uni-Focal Low T1a
## 4 Single nodular goiter-right No Micropapillary Uni-Focal Low T1a
## 5 Multinodular goiter No Micropapillary Multi-Focal Low T1a
## 6 Multinodular goiter No Micropapillary Multi-Focal Low T1a
## LymphNodes Metastasis Stage TreatmentResponse Recurrence
## 1 N0 M0 I Indeterminate No
## 2 N0 M0 I Excellent No
## 3 N0 M0 I Excellent No
## 4 N0 M0 I Excellent No
## 5 N0 M0 I Excellent No
## 6 N0 M0 I Indeterminate No
#saving my restructured data
write.csv(Thyriod, "Thyroid_cleaned.csv", row.names = FALSE)
Thyriod_cleaned<-read.csv("Thyroid_cleaned.csv")
str(Thyriod_cleaned)
## 'data.frame': 383 obs. of 17 variables:
## $ Age : int 27 34 30 62 62 52 41 46 51 40 ...
## $ Gender : chr "F" "F" "F" "F" ...
## $ Smoking : chr "No" "No" "No" "No" ...
## $ Hx.Smoking : chr "No" "Yes" "No" "No" ...
## $ Hx.Radiothreapy : chr "No" "No" "No" "No" ...
## $ Thyroid.Function : chr "Euthyroid" "Euthyroid" "Euthyroid" "Euthyroid" ...
## $ Physical.Examination: chr "Single nodular goiter-left" "Multinodular goiter" "Single nodular goiter-right" "Single nodular goiter-right" ...
## $ Adenopathy : chr "No" "No" "No" "No" ...
## $ Pathology : chr "Micropapillary" "Micropapillary" "Micropapillary" "Micropapillary" ...
## $ Focality : chr "Uni-Focal" "Uni-Focal" "Uni-Focal" "Uni-Focal" ...
## $ Risk : chr "Low" "Low" "Low" "Low" ...
## $ Tumor : chr "T1a" "T1a" "T1a" "T1a" ...
## $ LymphNodes : chr "N0" "N0" "N0" "N0" ...
## $ Metastasis : chr "M0" "M0" "M0" "M0" ...
## $ Stage : chr "I" "I" "I" "I" ...
## $ TreatmentResponse : chr "Indeterminate" "Excellent" "Excellent" "Excellent" ...
## $ Recurrence : chr "No" "No" "No" "No" ...
There were some column names that were abbreviated ,so I renamed them for better understanding using dplyr function and I also saved the restructured data.
# Group Tumor
Thyriod_cleaned$Tumor_group <- dplyr::case_when(
Thyriod_cleaned$Tumor %in% c("T1a","T1b") ~ "T1",
Thyriod_cleaned$Tumor == "T2" ~ "T2",
Thyriod_cleaned$Tumor %in% c("T3a","T3b") ~ "T3",
Thyriod_cleaned$Tumor %in% c("T4a","T4b") ~ "T4",
TRUE ~ NA_character_
)
# Group Stage
Thyriod_cleaned$Stage_group <- dplyr::case_when(
Thyriod_cleaned$Stage == "I" ~ "I",
Thyriod_cleaned$Stage == "II" ~ "II",
Thyriod_cleaned$Stage == "III" ~ "III",
Thyriod_cleaned$Stage %in% c("IVA","IVB") ~ "IV",
TRUE ~ NA_character_
)
# Group Lymph Nodes
Thyriod_cleaned$LymphNodes_group <- dplyr::case_when(
Thyriod_cleaned$LymphNodes == "N0" ~ "N0",
Thyriod_cleaned$LymphNodes %in% c("N1a","N1b") ~ "N1",
TRUE ~ NA_character_
)
I grouped tumor size, cancer stage, and lymph node involvement into simpler clinical categories. This reduces complexity, makes the patterns easier to analyze, and helps us compare recurrence rates across meaningful severity levels.
# Convert selected variables to factors
Thyriod_cleaned <- Thyriod_cleaned %>%
mutate(across(c(Stage, Pathology, LymphNodes, TreatmentResponse), as.factor))
# Recurrence as factor
Thyriod_cleaned$Recurrence_fac <- as.factor(Thyriod_cleaned$Recurrence)
# Recurrence numeric (fixing your typo)
Thyriod_cleaned$Recurrence_num <- ifelse(Thyriod_cleaned$Recurrence == "Yes", 1, 0)
# Set ordered factor levels
Thyriod_cleaned$Stage_group <- factor(Thyriod_cleaned$Stage_group,
levels = c("I","II","III","IV"), ordered = TRUE)
Thyriod_cleaned$Tumor_group <- factor(Thyriod_cleaned$Tumor_group,
levels = c("T1","T2","T3","T4"), ordered = TRUE)
Thyriod_cleaned$LymphNodes_group <- factor(Thyriod_cleaned$LymphNodes_group,
levels = c("N0","N1"), ordered = TRUE)
Thyriod_cleaned$Metastasis <- factor(Thyriod_cleaned$Metastasis,
levels = c("M0","M1"), ordered = TRUE)
# Check structure
str(Thyriod_cleaned)
## 'data.frame': 383 obs. of 22 variables:
## $ Age : int 27 34 30 62 62 52 41 46 51 40 ...
## $ Gender : chr "F" "F" "F" "F" ...
## $ Smoking : chr "No" "No" "No" "No" ...
## $ Hx.Smoking : chr "No" "Yes" "No" "No" ...
## $ Hx.Radiothreapy : chr "No" "No" "No" "No" ...
## $ Thyroid.Function : chr "Euthyroid" "Euthyroid" "Euthyroid" "Euthyroid" ...
## $ Physical.Examination: chr "Single nodular goiter-left" "Multinodular goiter" "Single nodular goiter-right" "Single nodular goiter-right" ...
## $ Adenopathy : chr "No" "No" "No" "No" ...
## $ Pathology : Factor w/ 4 levels "Follicular","Hurthel cell",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ Focality : chr "Uni-Focal" "Uni-Focal" "Uni-Focal" "Uni-Focal" ...
## $ Risk : chr "Low" "Low" "Low" "Low" ...
## $ Tumor : chr "T1a" "T1a" "T1a" "T1a" ...
## $ LymphNodes : Factor w/ 3 levels "N0","N1a","N1b": 1 1 1 1 1 1 1 1 1 1 ...
## $ Metastasis : Ord.factor w/ 2 levels "M0"<"M1": 1 1 1 1 1 1 1 1 1 1 ...
## $ Stage : Factor w/ 5 levels "I","II","III",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ TreatmentResponse : Factor w/ 4 levels "Biochemical Incomplete",..: 3 2 2 2 2 3 2 2 2 2 ...
## $ Recurrence : chr "No" "No" "No" "No" ...
## $ Tumor_group : Ord.factor w/ 4 levels "T1"<"T2"<"T3"<..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Stage_group : Ord.factor w/ 4 levels "I"<"II"<"III"<..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LymphNodes_group : Ord.factor w/ 2 levels "N0"<"N1": 1 1 1 1 1 1 1 1 1 1 ...
## $ Recurrence_fac : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ Recurrence_num : num 0 0 0 0 0 0 0 0 0 0 ...
I changed some columns into factors so the computer understands they are categories like “Male/Female” or “Stage I/Stage II,” not numbers or random text. This helps when making charts, tables, and comparisons. Then I made a numeric version of the recurrence column (Yes = 1, No = 0) so it can be used in calculations and prediction models, because computers need numbers when doing statistics or machine learning.
STAGE
This is the overall stage of cancer, determined using tumor,lymph nodes, and metastasis together.
Stage I- Early cancer, tumor is still small, no spread.
Stage II- Larger tumor or some spread into nearby tissue.
Stage III- Cancer spread to local lymph nodes or tissues.
Stage IV- Advanced cancer, often with distant spread (metastasis).
stage_counts <- Thyriod_cleaned %>%
count(Stage_group)
stage_counts
## Stage_group n
## 1 I 333
## 2 II 32
## 3 III 4
## 4 IV 14
ggplot(stage_counts, aes(x = Stage_group, y = n)) +
geom_col(fill = "red3") +
geom_text(aes(label = n), vjust = -0.3) +
labs(
title = "Distribution of Stages",
x = "Thyroid Cancer Stages",
y = "Count"
) +
theme_minimal()
Observation
Stage I is the highest in the dataset because, in real life, most thyroid cancers are found early. Thyroid cancer grows slowly, many people discover it during routine checkups, survival rates are high, and early detection especially in women is very common.
TUMOR
This refers to the size and extent of the tumor in the thyroid.
T1- Very small tumor < 2 cm, limited to thyroid.
T2- Tumor between 2 – 4 cm, still within thyroid.
T3- Tumor > 4 cm or slightly extending outside thyroid.
T4- Tumor growing beyond thyroid into nearby tissues.
tumor_count<-Thyriod_cleaned%>%
count(Tumor_group)
tumor_count
## Tumor_group n
## 1 T1 92
## 2 T2 151
## 3 T3 112
## 4 T4 28
ggplot(tumor_count,aes(x=Tumor_group,y=n))+
geom_col(fill = "red4")+
geom_text(aes(label=n),vjust=-0.3)+
labs(title="Distribution Tumor Level",
x="Tumor Level",
y="Count")+
theme_minimal()
Observation
T2 is highest in the dataset because many thyroid cancers are not caught when they are tiny T1 because they also haven’t grown dangerously large. Most people don’t feel symptoms early, so the cancer quietly grows to a moderate size which is exactly what T2 represents.
It’s still early, still treatable, but bigger than the very small T1 tumors. That’s why, even in real hospitals, T2 and T3 are often the most common tumor sizes doctors see.
LYMPHNODES
Lymph nodes are small, bean-shaped glands all over our body that help fight infection and filter harmful substances. They are part of our immune system.
In this dataset, the LymphNodes column tells us whether the thyroid cancer has spread to the nearby lymph nodes. N0 means the lymph nodes are clear and unaffected, while N1 indicates that cancer cells have reached the lymph nodes, which can influence treatment decisions and the risk of recurrence.
lymph<-Thyriod_cleaned%>%
count(LymphNodes_group)
lymph
## LymphNodes_group n
## 1 N0 268
## 2 N1 115
ggplot(lymph,aes(x=LymphNodes_group,y=n))+
geom_col(fill = "blue3")+
geom_text(aes(label=n),vjust=-0.3)+
labs(title="Distribution Tumor Level",
x="Lymphnodes",
y="Count")+
theme_minimal()
Observation
N0 was the highest because, in most thyroid cancer cases, the cancer is detected before it spreads to the lymph nodes. Thyroid cancer often grows slowly, and many patients are diagnosed early through routine checkups or incidental findings. As a result, most patients initially have no lymph node involvement.
METASTASIS
Metastasis means that cancer has spread from its original location to other parts of the body.
Thyroid cancer is related to metastasis, which usually happens when cancer cells travel beyond the thyroid and nearby lymph nodes, often reaching the lungs or bones. Patients with metastasis M1 generally have a higher risk of recurrence.
meta<-Thyriod_cleaned%>%
count(Metastasis)
meta
## Metastasis n
## 1 M0 365
## 2 M1 18
ggplot(meta,aes(x=Metastasis,y=n))+
geom_col(fill = "green4")+
geom_text(aes(label=n),vjust=-0.3)+
labs(title="Distribution Metastasis",
x="Metastasis Level",
y="Count")+
theme_minimal()
Observation
M0 is the highest in the dataset because, in most thyroid cancer cases, the cancer has not yet spread to distant organs. Thyroid cancer usually grows slowly, and many patients are diagnosed early when the disease is still localized to the thyroid and nearby lymph nodes.
ggplot(Thyriod_cleaned, aes(x = Gender, fill = Recurrence)) +
geom_bar(position = "dodge") +
geom_text(
stat = "count",
aes(label = ..count..),
position = position_dodge(width = 0.9), # aligns text with bars
vjust = -0.5 # puts count just above the bar
) +
labs(
title = "Recurrence by Gender",
x = "Gender",
y = "Count",
fill = "Recurrence"
) +
theme_minimal()
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Female had the highest recurrence of thyroid cancer because thyroid cancer comes back more in women mainly because female hormones like estrogen make thyroid cells grow faster and react more strongly to changes. Women also get thyroid problems such as goiters and thyroiditis more often, which increases their risk. Pregnancy puts extra stress on the thyroid and can trigger changes in the gland. These biological and hormonal differences make thyroid cancer naturally more common and more likely to recur in females.
tapply(Thyriod_cleaned$Age, Thyriod_cleaned$Recurrence, mean, na.rm = TRUE)
## No Yes
## 38.41455 47.11111
tapply(Thyriod_cleaned$Age, Thyriod_cleaned$Recurrence, median, na.rm = TRUE)
## No Yes
## 36.0 44.5
box_plot <- plot_ly(
data = Thyriod_cleaned,
x=~Recurrence,
y = ~Age,
type = "box",
color = ~Recurrence
)
box_plot <- layout(
box_plot,
title = "Age vs Recurrence",
xaxis = list(title = "Recurrence"),
yaxis = list(title = "Age")
)
box_plot
The boxplot shows that thyroid cancer recurrence is more common in older patients. Median age for recurrence is 44.5 years, compared to 36 years for non-recurrence. This supports medical evidence that age increases the risk of recurrence, likely due to tumor aggressiveness and weaker immune response in older adults.
# Table of counts
table(Thyriod_cleaned$TreatmentResponse, Thyriod_cleaned$Recurrence)
##
## No Yes
## Biochemical Incomplete 12 11
## Excellent 207 1
## Indeterminate 54 7
## Structural Incomplete 2 89
# Proportions
prop.table(table(Thyriod_cleaned$TreatmentResponse, Thyriod_cleaned$Recurrence), 1)
##
## No Yes
## Biochemical Incomplete 0.521739130 0.478260870
## Excellent 0.995192308 0.004807692
## Indeterminate 0.885245902 0.114754098
## Structural Incomplete 0.021978022 0.978021978
ggplot(Thyriod_cleaned, aes(x=TreatmentResponse, fill=Recurrence)) +
geom_bar(position="dodge") + # 'fill' shows proportion
labs(y="Proportion", title="Recurrence by Treatment Response") +
scale_fill_manual(values=c("No"="skyblue", "Yes"="red2")) +
theme_minimal()
Observation
Roughly equal numbers of patients had recurrence or not (blue and red bars similar)
Most patients in this category experienced recurrence (red bar much higher than blue).
Interpretation: Even when response is uncertain, most patients stay recurrence-free, but some may relapse.
4.Does clinical risk level influence whether thyroid cancer comes back?
ggplot(Thyriod_cleaned, aes(x = Risk, fill = Recurrence)) +
geom_bar(position = "dodge") +
#coord_flip()+
labs(y = "Proportion", title = "Recurrence by Risk Category") +
theme_minimal()
My plot shows how recurrence changes across these risk groups.
Risk levels in thyroid cancer help doctors predict how likely the disease is to return.
Low-risk patients usually have small tumors and no spread, so recurrence is low.
Intermediate-risk patients have some warning signs, so their recurrence is higher.
High-risk patients have aggressive tumors or spread, making recurrence most likely.
Age vs Treatment response
table(Thyriod_cleaned$TreatmentResponse)
##
## Biochemical Incomplete Excellent Indeterminate
## 23 208 61
## Structural Incomplete
## 91
library(plotly)
plot_ly(
data = Thyriod_cleaned,
x = ~TreatmentResponse,
y = ~Age,
type = "box",
color = ~TreatmentResponse
) %>%
layout(
title = "Age vs Treatment Response",
xaxis = list(title = "Treatment Response"),
yaxis = list(title = "Age")
)
QUESTION 5: Can we predict recurrence based on tumor staging and treatment response ?
#splitting into training and test set
set.seed(123)
trainIndex <- createDataPartition(Thyriod_cleaned$Recurrence, p = 0.7, list = FALSE)
train_data <- Thyriod_cleaned[trainIndex, ]
test_data <- Thyriod_cleaned[-trainIndex, ]
train_data$Recurrence<-as.factor(train_data$Recurrence)
test_data$Recurrence<-as.factor(test_data$Recurrence)
#base predictors
predictors<-c("Stage_group","Pathology","Metastasis","TreatmentResponse","Risk","Physical.Examination","Age","Focality","LymphNodes_group","Tumor_group")
formula<-reformulate(predictors,response="Recurrence")
#train the random forest model
rf_model<-randomForest(
formula,
data=train_data,
ntree=600,
mtry = 3,
importance = TRUE,
seed=123
)
print(rf_model)
##
## Call:
## randomForest(formula = formula, data = train_data, ntree = 600, mtry = 3, importance = TRUE, seed = 123)
## Type of random forest: classification
## Number of trees: 600
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 3.72%
## Confusion matrix:
## No Yes class.error
## No 189 4 0.02072539
## Yes 6 70 0.07894737
#Predict on test set
rf_pred<-predict(rf_model,newdata=test_data)
#confusion matrix
con_matrix_rf<-confusionMatrix(rf_pred,test_data$Recurrence)
print(con_matrix_rf)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 82 3
## Yes 0 29
##
## Accuracy : 0.9737
## 95% CI : (0.925, 0.9945)
## No Information Rate : 0.7193
## P-Value [Acc > NIR] : 7.461e-13
##
## Kappa : 0.9329
##
## Mcnemar's Test P-Value : 0.2482
##
## Sensitivity : 1.0000
## Specificity : 0.9062
## Pos Pred Value : 0.9647
## Neg Pred Value : 1.0000
## Prevalence : 0.7193
## Detection Rate : 0.7193
## Detection Prevalence : 0.7456
## Balanced Accuracy : 0.9531
##
## 'Positive' Class : No
##
con<-as.data.frame(con_matrix_rf$table)
#rename columns
colnames(con)<-c("Predicted","Actual","Freq")
#plot confusion matrix
ggplot(con,aes(x=Actual,y=Predicted,fill=Freq))+
geom_tile(color="black",linewidth=1.2)+
geom_text(aes(label=Freq),color="white",size=6,fontface="bold")+
scale_fill_gradient(low="#FF9973",high ="3366CC")+
labs(
title="Confusion Matrix-Random Forest Model",
x="Actual Class",
y="Predicted Class",
fill="Count"
)+
theme_minimal(base_size = 14)+
theme(
plot.title = element_text(face="bold",hjust = 0.5),
axis.text = element_text(color = "black",face = "bold"),
panel.grid = element_blank(),
legend.position = "right"
)
Plot feature importance
# Plot Feature Importance
varImpPlot(rf_model,main = "Feature Importance-Random Forest",
pch=19,
col="blue",
cex=0.9)
The model showed that Treatment Response is the strongest predictor of recurrence. This is medically accurate because cancer is more likely to return when treatment does not fully work.
The second strongest predictor is Risk level, which combines tumor size, lymph node spread, metastasis, and pathology. In simple terms, Risk tells us how aggressive the cancer is, so it naturally relates to recurrence.
Tumor stage and lymph node involvement also contributed moderately, which makes sense because larger tumors and spread to lymph nodes increase recurrence chances.
Pathology ranked lower only because most patients in the dataset had similar cancer types (mainly papillary), so it did not create much difference between patients.
Overall, the feature importance from the model aligns with real medical knowledge: poor treatment response and high-risk tumors drive recurrence the most.
predictors<-c("Stage_group","Pathology","Metastasis","TreatmentResponse","Risk","Physical.Examination","Age","Focality","LymphNodes_group","Tumor_group")
train_data$Risk <- factor(train_data$Risk, levels = c("Low", "Intermediate", "High"))
formula<-reformulate(predictors,response = "Recurrence")
## build decision tree
tree_model<-rpart(formula,
data = train_data,
method = "class"
)
# 1. Train the Decision Tree (This part is already good)
tree_model<-rpart(
formula,
data = train_data,
method = "class",
control=rpart.control(
maxdepth = 5, # Allows for a deeper, more detailed tree
minsplit = 10,
cp=0.001 # Allows for smaller splits
)
)
# ----------------------------------------------------
# 2. Plot the Tree (The required edits are here)
# Start a high-resolution PNG file to ensure the tree has enough space
# Width and Height are set large (e.g., 1200x700 pixels)
# Plot the tree
# Plot the tree directly to the RStudio Plots pane
#rpart.plot(
#tree_model,
#type = 2,
#extra = 104,
#fallen.leaves = TRUE,
#cex = 0.6, # Font size set for readability
#box.palette = "GnBu",
#shadow.col = "gray",
#nn = TRUE,
#main = "DECISION TREE"
#)
# make predictions on traning and test data
train_pred_tree<-predict(tree_model,train_data,type="class")
test_pred_tree<-predict(tree_model,test_data,type="class")
conf_matrix_train_tree<-confusionMatrix(
factor(train_pred_tree),
factor(train_data$Recurrence)
)
conf_matrix_test_tree<-confusionMatrix(
factor(test_pred_tree),
factor(test_data$Recurrence)
)
#print result
cat("Confusion matrix-TrainData/n")
## Confusion matrix-TrainData/n
print(conf_matrix_train_tree)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 189 5
## Yes 4 71
##
## Accuracy : 0.9665
## 95% CI : (0.9374, 0.9846)
## No Information Rate : 0.7175
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9171
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9793
## Specificity : 0.9342
## Pos Pred Value : 0.9742
## Neg Pred Value : 0.9467
## Prevalence : 0.7175
## Detection Rate : 0.7026
## Detection Prevalence : 0.7212
## Balanced Accuracy : 0.9567
##
## 'Positive' Class : No
##
#print result
cat("Confusion matrix-Test Data/n")
## Confusion matrix-Test Data/n
print(conf_matrix_test_tree)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 82 2
## Yes 0 30
##
## Accuracy : 0.9825
## 95% CI : (0.9381, 0.9979)
## No Information Rate : 0.7193
## P-Value [Acc > NIR] : 5e-14
##
## Kappa : 0.9557
##
## Mcnemar's Test P-Value : 0.4795
##
## Sensitivity : 1.0000
## Specificity : 0.9375
## Pos Pred Value : 0.9762
## Neg Pred Value : 1.0000
## Prevalence : 0.7193
## Detection Rate : 0.7193
## Detection Prevalence : 0.7368
## Balanced Accuracy : 0.9688
##
## 'Positive' Class : No
##
#Check feature importance
print(tree_model$variable.importance)
## TreatmentResponse Risk LymphNodes_group
## 83.8708396 42.4768009 26.0088754
## Stage_group Tumor_group Metastasis
## 22.5984001 22.2692233 14.5234065
## Age Physical.Examination Focality
## 2.7746914 1.8148148 0.6049383
## Pathology
## 0.6049383
# Extract importance values
importance_values <- tree_model$variable.importance
# Create barplot with extended y-limit
bp <- barplot(
importance_values,
main = "Feature Importance (Decision Tree)",
xlab = "Features",
ylab = "Importance Score",
ylim = c(0, max(importance_values) + 5), # Increase y-axis limit
col = "purple3",
names.arg = FALSE , # Remove labels first
)
# Add slanted x-axis labels
text(
x = bp,
y = par("usr")[3] - 0.5, # Position below x-axis
labels = names(importance_values),
srt = 45, # Slant labels 45 degrees
adj = 1,
xpd = TRUE, # Allow drawing outside plot area
cex=0.8
)
The decision tree shows that Treatment Response is the strongest predictor of thyroid cancer recurrence, patients with an “Excellent/Indeterminate” response had very low recurrence, while those with incomplete response showed much higher recurrence risk. The next important factor is Risk level, which reflects tumor size, lymph-node spread, metastasis, and pathology; higher risk patients had greater recurrence. Other variables like tumor stage, pathology type, and metastasis contributed little beyond those two. Both the decision tree and random forest confirm that how well a patient responds to treatment and their initial clinical risk are the main drivers of recurrence.
# Create a data frame for plotting
model_accuracy <- data.frame(
Model = c("Decision Tree", "Random Forest"),
Accuracy = c(0.5882, 0.951)
)
# Base R barplot
barplot(
model_accuracy$Accuracy,
names.arg = model_accuracy$Model,
col = c("tomato", "steelblue"),
ylim = c(0, 1),
main = "Comparison of Model Accuracies",
ylab = "Accuracy",
cex.names = 2
)
# Optional: Add text labels on top of bars
text(
x = c(0.7, 1.9),
y = model_accuracy$Accuracy + 0.03,
labels = round(model_accuracy$Accuracy,2)
)
Random forest was the best model
From this study:
Thyroid cancer recurrence was more common in women, which agrees with medical evidence showing women develop thyroid disorders and hormonal imbalances (estrogen effects) that controls female growth and development .
Older patients had a higher recurrence risk, shown in your age boxplot, matching clinical findings that aging reduces immune response and increases tumor aggressiveness.
Clinical risk level strongly influenced recurrence,intermediate and high-risk patients had significantly more recurrence than low-risk cases.
Treatment response had the strongest relationship with recurrence. Patients with incomplete or biochemical structure were the most likely to relapse, which was confirmed by both the decision tree and random forest.
Prediction based on stage and pathology was possible, but weak, because treatment response and risk carried more clinical power.
OVERALL:
The strongest predictors of recurrence in this dataset were treatment response and clinical risk, followed by age. Women and older adults showed the highest recurrence burden. Tumor stage and pathology were helpful but less influential.
Eating foods rich in iodine: Iodine helps the thyroid work properly and may reduce future problems. Examples of iodine-rich foods: iodized salt, fish (like tuna, sardines), milk, eggs, yogurt, and seaweed.
Women should attend regular checkups: Because thyroid cancer is more common in women and they showed higher recurrence in this study, it is important for women to do routine follow-ups, especially if they notice any neck changes or symptoms.
Patients who have had thyroid surgery should monitor regularly: Anyone who has undergone thyroid surgery should have more frequent hospital visits, because recurrence can happen even after treatment, and early detection makes management easier.