DSA_406_001_SP25_PA3_ebbopp

Gentle introduction

Author

DSA_406_001_SP25_PA3_ebbopp

Reading in the Dataset

# Setting up the libraries needed for this project
library(data.table)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::between()     masks data.table::between()
✖ dplyr::filter()      masks stats::filter()
✖ dplyr::first()       masks data.table::first()
✖ lubridate::hour()    masks data.table::hour()
✖ lubridate::isoweek() masks data.table::isoweek()
✖ dplyr::lag()         masks stats::lag()
✖ dplyr::last()        masks data.table::last()
✖ lubridate::mday()    masks data.table::mday()
✖ lubridate::minute()  masks data.table::minute()
✖ lubridate::month()   masks data.table::month()
✖ lubridate::quarter() masks data.table::quarter()
✖ lubridate::second()  masks data.table::second()
✖ purrr::transpose()   masks data.table::transpose()
✖ lubridate::wday()    masks data.table::wday()
✖ lubridate::week()    masks data.table::week()
✖ lubridate::yday()    masks data.table::yday()
✖ lubridate::year()    masks data.table::year()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(tinytex)
library(DescTools)
Warning: package 'DescTools' was built under R version 4.4.3

Attaching package: 'DescTools'

The following object is masked from 'package:data.table':

    %like%
library(reshape2)
Warning: package 'reshape2' was built under R version 4.4.3

Attaching package: 'reshape2'

The following object is masked from 'package:tidyr':

    smiths

The following objects are masked from 'package:data.table':

    dcast, melt
# Reading in the dataset
brain_tumor_dt <- read.csv("data/Brain_Tumor_Prediction_Dataset.csv")

Brief Data Descriptions

# Getting a basic frame of reference for the dataset with into
# analysis tools
summary(brain_tumor_dt)
      Age           Gender            Country            Tumor_Size    
 Min.   : 5.00   Length:250000      Length:250000      Min.   : 0.500  
 1st Qu.:26.00   Class :character   Class :character   1st Qu.: 2.870  
 Median :47.00   Mode  :character   Mode  :character   Median : 5.260  
 Mean   :46.96                                         Mean   : 5.252  
 3rd Qu.:68.00                                         3rd Qu.: 7.630  
 Max.   :89.00                                         Max.   :10.000  
 Tumor_Location     MRI_Findings        Genetic_Risk Smoking_History   
 Length:250000      Length:250000      Min.   :  0   Length:250000     
 Class :character   Class :character   1st Qu.: 25   Class :character  
 Mode  :character   Mode  :character   Median : 50   Mode  :character  
                                       Mean   : 50                     
                                       3rd Qu.: 75                     
                                       Max.   :100                     
 Alcohol_Consumption Radiation_Exposure Head_Injury_History Chronic_Illness   
 Length:250000       Length:250000      Length:250000       Length:250000     
 Class :character    Class :character   Class :character    Class :character  
 Mode  :character    Mode  :character   Mode  :character    Mode  :character  
                                                                              
                                                                              
                                                                              
 Blood_Pressure       Diabetes          Tumor_Type        Treatment_Received
 Length:250000      Length:250000      Length:250000      Length:250000     
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
 Survival_Rate... Tumor_Growth_Rate  Family_History     Symptom_Severity  
 Min.   :10.00    Length:250000      Length:250000      Length:250000     
 1st Qu.:32.00    Class :character   Class :character   Class :character  
 Median :55.00    Mode  :character   Mode  :character   Mode  :character  
 Mean   :54.48                                                            
 3rd Qu.:77.00                                                            
 Max.   :99.00                                                            
 Brain_Tumor_Present
 Length:250000      
 Class :character   
 Mode  :character   
                    
                    
                    
dim(brain_tumor_dt)
[1] 250000     21
head(brain_tumor_dt)
  Age Gender   Country Tumor_Size Tumor_Location MRI_Findings Genetic_Risk
1  66  Other     China       8.70     Cerebellum       Severe           81
2  87 Female Australia       8.14       Temporal       Normal           65
3  41   Male    Canada       6.02      Occipital       Severe          100
4  52   Male     Japan       7.26      Occipital       Normal           19
5  84 Female    Brazil       7.94       Temporal     Abnormal           47
6  29   Male   Germany       7.97        Frontal     Abnormal           70
  Smoking_History Alcohol_Consumption Radiation_Exposure Head_Injury_History
1              No                 Yes             Medium                  No
2              No                 Yes             Medium                  No
3             Yes                  No                Low                 Yes
4             Yes                 Yes               High                 Yes
5              No                 Yes             Medium                  No
6             Yes                 Yes             Medium                  No
  Chronic_Illness Blood_Pressure Diabetes Tumor_Type Treatment_Received
1             Yes         122/88       No  Malignant               None
2              No        126/119       No  Malignant               None
3              No         118/65       No     Benign       Chemotherapy
4              No        165/119      Yes     Benign          Radiation
5             Yes         156/97      Yes  Malignant               None
6              No          95/85       No  Malignant            Surgery
  Survival_Rate... Tumor_Growth_Rate Family_History Symptom_Severity
1               58              Slow            Yes             Mild
2               13             Rapid            Yes           Severe
3               67              Slow            Yes         Moderate
4               85          Moderate             No         Moderate
5               17          Moderate             No         Moderate
6               65             Rapid            Yes           Severe
  Brain_Tumor_Present
1                  No
2                  No
3                 Yes
4                 Yes
5                  No
6                  No

As we can see from inspecting the data above, this dataset contains 250000 observations, each with 21 elements. This dataset is made up of qualitative and quantitative values and data types including integers floats, booleans, and free response strings.

I retrieved this dataset from kaggle at the following link: https://www.kaggle.com/datasets/ankushpanday1/brain-tumor-prediction-dataset

Questions to Answer

  • What is the dataset about?

    • The Brain Tumor Prediction Dataset is a comprehensive collection of medical records aimed at facilitating research and development in brain tumor diagnosis and prediction. It comprises 250,000 patient records, each encompassing 22 significant medical features. The dataset includes MRI scan results, demographic information, medical history, and other relevant attributes.
  • Where did the data come from?

    • I got this data online while browsing numerous dataset websites. I eventually found this specific dataset on Kaggle.
  • Is there a website or publication to cite the authors? If yes, include it.

    • https://www.kaggle.com/datasets/ankushpanday1/brain-tumor-prediction-dataset

Motivation

  • What are your motivations for exploring this dataset?

    • For many years now I have been very interested in improving medical diagnoses, and finding meaningful ways to help people and communities. For many years I wanted to be a radiologist looking at medical images and finding meaningful diagnoses within those images to help my patients. Just before college I learned about the potentials of computer science, data science, and computer algorithms within the medical space. This led me to want to dive into the computer algorithm space to develop software capable of outperforming the systems we have today to provide even greater widespread impacts on people and communities. This dataset strongly aligns with these goals and is a great step into the world of medical diagnoses algorithms, and methodologies to find an important interplay between the two fields.
  • What questions do you want to answer? (This involves identifying and articulating the key questions that you aim to explore through your analysis of the dataset. These questions set the direction for your research and data exploration. They are typically broad and open-ended, aimed at uncovering insights or patterns within the data. IE: What are the main factors that affect customer satisfaction?)

    • What are the main factors that affect the presence of brain tumors?

    • What factors influence the size and/or location of brain tumors?

    • Are there are factors that that correlate to the severity of a brain tumor?

    • Are there any factors commonly associated with tumors, not actually impact on the presence, location, or severity of tumor?

  • Provide a hypothesis about the dataset. (Formulating a hypothesis involves making a specific, testable statement based on your initial understanding or assumptions about the data. A hypothesis is more focused than a general question and often predicts a relationship between variables that you can test through your analysis. IE: Customers with shorter wait times report higher satisfaction levels, suggesting a significant negative correlation between wait time and satisfaction.)

    • Patients with a family history of brain tumors and a high genetic risk score are more likely to develop malignant tumors than those without a family history and low genetic risk.

Ethical Considerations

  • What are some ethical considerations? 

    • There are many ethical considerations to consider, including patient privacy & data sensitivity, bias in data collection, assumptions in medical diagnosis, and fair treatment & algorithmic bias.

      • Even though this dataset may be anonymized, medical data is highly sensitive. Ensuring compliance with HIPAA, GDPR, or similar regulations is critical if used in real-world applications.

      • The dataset might not represent all demographics equally, and if the data is skewed toward certain regions, ethnic groups, or medical histories, models trained on it could be less effective for underrepresented populations.

      • AI models built on this data should be used as decision-support tools, not replacements for medical professionals. False positives/negatives could have severe consequences, leading to unnecessary stress, procedures, or missed diagnoses.

      • If the dataset disproportionately represents certain tumor types, age groups, or genders, the AI model might generalize poorly. Addressing bias before deploying AI in healthcare is crucial to ensure fair treatment for all patients.

  • Do you have any bias coming into this analysis?

    1. Such as do you assume certain things already (we all have internal bias that we should recognize)
    • I don’t know that I have many biases coming into this. I believe there is likely to be a strong correlation between some of the factors such as family history, history of illness, age, etc. which may impact my views on the analysis and is something I need to keep in mind when performing this analysis

    • Additionally, I am making assumptions that this data has been collected appropriately in an ethical way that equally encompasses all populations. This likely is not the case in reality and will need to be considered further.

Table Creation/Data Dictionary

# Create a variable to hold descriptions
descriptions <- c(
  "Patient's age in years",
  "Gender of the patient (Male, Female, Other)",
  "Patient's country of residence",
  "Size of the tumor in cm",
  "Brain lobe affected (e.g., Frontal, Temporal, Parietal)",
  "Severity of MRI results (e.g., Normal, Abnormal, Severe))",
  "Score indicating genetic risk (0–100 scale)",
  "Whether the patient has a history of smoking (Yes/No)",
  "Whether the patient consumes alcohol (Yes/No)",
  "Level of radiation exposure (Low, Medium, High)",
  "History of head injury (Yes/No)",
  "Presence of chronic illnesses (Yes/No)",
  "Systolic/Diastolic values (e.g., 120/80)",
  "Presence of diabetes (Yes/No)",
  "Classification of tumor (Benign/Malignant)",
  "Type of treatment received (e.g., Chemotherapy, Radiation, None)",
  "Estimated 5-year survival probability",
  "Rate of tumor growth (Slow, Moderate, Rapid)",
  "Whether the patient has a family history of tumors (Yes/No)",
  "Severity of symptoms (Mild, Moderate, Severe)",
  "Whether the patient has a brain tumor (Yes/No)"
)
# Create the data dictionary using R functions
data_dictionary <- data.frame(
  Variable_Name = colnames(brain_tumor_dt),
  Class = sapply(brain_tumor_dt, class),
  Continuity = ifelse(sapply(brain_tumor_dt, is.numeric), "Continuous", "Discrete"),
  Description = descriptions
)
# Print out table
data_dictionary
                          Variable_Name     Class Continuity
Age                                 Age   integer Continuous
Gender                           Gender character   Discrete
Country                         Country character   Discrete
Tumor_Size                   Tumor_Size   numeric Continuous
Tumor_Location           Tumor_Location character   Discrete
MRI_Findings               MRI_Findings character   Discrete
Genetic_Risk               Genetic_Risk   integer Continuous
Smoking_History         Smoking_History character   Discrete
Alcohol_Consumption Alcohol_Consumption character   Discrete
Radiation_Exposure   Radiation_Exposure character   Discrete
Head_Injury_History Head_Injury_History character   Discrete
Chronic_Illness         Chronic_Illness character   Discrete
Blood_Pressure           Blood_Pressure character   Discrete
Diabetes                       Diabetes character   Discrete
Tumor_Type                   Tumor_Type character   Discrete
Treatment_Received   Treatment_Received character   Discrete
Survival_Rate...       Survival_Rate...   integer Continuous
Tumor_Growth_Rate     Tumor_Growth_Rate character   Discrete
Family_History           Family_History character   Discrete
Symptom_Severity       Symptom_Severity character   Discrete
Brain_Tumor_Present Brain_Tumor_Present character   Discrete
                                                                         Description
Age                                                           Patient's age in years
Gender                                   Gender of the patient (Male, Female, Other)
Country                                               Patient's country of residence
Tumor_Size                                                   Size of the tumor in cm
Tumor_Location               Brain lobe affected (e.g., Frontal, Temporal, Parietal)
MRI_Findings               Severity of MRI results (e.g., Normal, Abnormal, Severe))
Genetic_Risk                             Score indicating genetic risk (0–100 scale)
Smoking_History                Whether the patient has a history of smoking (Yes/No)
Alcohol_Consumption                    Whether the patient consumes alcohol (Yes/No)
Radiation_Exposure                   Level of radiation exposure (Low, Medium, High)
Head_Injury_History                                  History of head injury (Yes/No)
Chronic_Illness                               Presence of chronic illnesses (Yes/No)
Blood_Pressure                              Systolic/Diastolic values (e.g., 120/80)
Diabetes                                               Presence of diabetes (Yes/No)
Tumor_Type                                Classification of tumor (Benign/Malignant)
Treatment_Received  Type of treatment received (e.g., Chemotherapy, Radiation, None)
Survival_Rate...                               Estimated 5-year survival probability
Tumor_Growth_Rate                       Rate of tumor growth (Slow, Moderate, Rapid)
Family_History           Whether the patient has a family history of tumors (Yes/No)
Symptom_Severity                       Severity of symptoms (Mild, Moderate, Severe)
Brain_Tumor_Present                   Whether the patient has a brain tumor (Yes/No)

Data Processing

Missing Values

# Check for missing values
colSums(is.na(brain_tumor_dt))
                Age              Gender             Country          Tumor_Size 
                  0                   0                   0                   0 
     Tumor_Location        MRI_Findings        Genetic_Risk     Smoking_History 
                  0                   0                   0                   0 
Alcohol_Consumption  Radiation_Exposure Head_Injury_History     Chronic_Illness 
                  0                   0                   0                   0 
     Blood_Pressure            Diabetes          Tumor_Type  Treatment_Received 
                  0                   0                   0                   0 
   Survival_Rate...   Tumor_Growth_Rate      Family_History    Symptom_Severity 
                  0                   0                   0                   0 
Brain_Tumor_Present 
                  0 

There are no missing values in this dataset, so there is no need for removal, imputation or other missing value handling techniques.

If there were missing values I would update with the following

# Fill missing numeric columns with median value, and missing categorical columns with mode
brain_tumor_dt <- brain_tumor_dt %>%
  mutate(across(where(is.numeric), ~ ifelse(is.na(.), median(., na.rm = TRUE), .))) %>%
  mutate(across(where(is.character), ~ ifelse(is.na(.), Mode(.), .)))

Outliers

# Identify and handle outliers using IQR
# Creating a function to identify the outliers
numeric_cols <- brain_tumor_dt %>% select(where(is.numeric))
outlier_bounds <- function(x) {
  Q1 <- quantile(x, 0.25, na.rm = TRUE)
  Q3 <- quantile(x, 0.75, na.rm = TRUE)
  IQR <- Q3 - Q1
  lower <- Q1 - 1.5 * IQR
  upper <- Q3 + 1.5 * IQR
  return(c(lower, upper))
}

# Replace outliers with NA
brain_tumor_dt <- brain_tumor_dt %>%
  mutate(across(where(is.numeric), ~ ifelse(. < outlier_bounds(.)[1] | . > outlier_bounds(.)[2], NA, .)))
# Identify and handle outliers using Z-score
brain_tumor_dt <- brain_tumor_dt %>%
  mutate(across(where(is.numeric), ~ ifelse(abs(scale(.)) > 3, NA, .)))

I decided to use both IQR and Z-score methods to identify and address outliers since they are widely used, robust identification techniques. IQR is great for identifying skewed or non-normal distributions by focusing on the middle 50% and identifying the values falling outside of a normal range. Z-score is beneficial for identifying outliers in normally distributed datasets, by measuring the number of standard deviations from the mean. Using these in tandem helps us to get a very clear handle on all potential outliers.

Data Transformation

I needed to rename the Survival_Rate column.

# Rename using pattern matching
colnames(brain_tumor_dt)[grep("Survival_Rate", colnames(brain_tumor_dt))] <- "Survival_Rate"

colnames(brain_tumor_dt)
 [1] "Age"                 "Gender"              "Country"            
 [4] "Tumor_Size"          "Tumor_Location"      "MRI_Findings"       
 [7] "Genetic_Risk"        "Smoking_History"     "Alcohol_Consumption"
[10] "Radiation_Exposure"  "Head_Injury_History" "Chronic_Illness"    
[13] "Blood_Pressure"      "Diabetes"            "Tumor_Type"         
[16] "Treatment_Received"  "Survival_Rate"       "Tumor_Growth_Rate"  
[19] "Family_History"      "Symptom_Severity"    "Brain_Tumor_Present"

Turning the binary values into 1/0

# Convert all true/false variables to 1's and 0's
brain_tumor_dt <- brain_tumor_dt %>%
  mutate(across(where(is.character), ~ ifelse(. %in% c("Yes", "True"), 1, ifelse(. %in% c("No", "False"), 0, .))))

Splitting the Blood pressure into two columns for Systolic and Diastolic

# Split Blood_Pressure into Systolic and Diastolic columns
brain_tumor_dt <- brain_tumor_dt %>%
  separate(Blood_Pressure, into = c("Systolic", "Diastolic"), sep = "/") %>%
  mutate(across(c(Systolic, Diastolic), as.numeric))

Exploratory Visualization

Question 1.a: What is the distribution of tumor types (Benign vs. Malignant) in the dataset?

# Bar chart for tumor type distribution
ggplot(brain_tumor_dt, aes(x = Tumor_Type, fill = Tumor_Type)) +
  geom_bar() +
  labs(title = "Distribution of Tumor Types", x = "Tumor Type", y = "Count") +
  theme_minimal() +
  scale_fill_brewer(palette = "Set2")

This chart shows the proportion of benign vs. malignant tumors. We see that in this dataset we have an even distribution of benign and malignant tumors which will allow us to make a full analysis of both types without any major model skew.

While this may not be completely accurate to the real world distribution, this allows us to get a good look at the many potential cases for both benign and malignant tumors.

Question 1.b: How does tumor size vary between benign and malignant tumors?

# Box plot for tumor size by tumor type
ggplot(brain_tumor_dt, aes(x = Tumor_Type, y = Tumor_Size, fill = Tumor_Type)) +
  geom_boxplot() +
  labs(title = "Tumor Size by Tumor Type", x = "Tumor Type", y = "Tumor Size (cm)") +
  theme_minimal() +
  scale_fill_brewer(palette = "Set3")

Here we see that we have very similar tumor sizes for both benign and malignant tumors due to this data coming from Kaggle, and being very clean and standardized. Similar to what we’ve seen in the prevalence of benign and malignant tumors, this may not be a completely accurate distribution, but does allow us to see a full variety of cases. Even though the distribution may be a bit off this gives us enough data to perform deep analysis and train models for both benign and malignant tumors

Question 2.a: What correlations are there between the numeric values in the dataset?

# Select only numeric columns for correlation
numeric_cols <- brain_tumor_dt %>% select(where(is.numeric))

# Compute the correlation matrix
correlation_matrix <- cor(numeric_cols, use = "complete.obs")

# Print the correlation matrix
print(correlation_matrix)
                        Age    Tumor_Size  Genetic_Risk      Systolic
Age            1.0000000000  0.0012242342 -0.0009739399 -0.0026875324
Tumor_Size     0.0012242342  1.0000000000  0.0004892489  0.0027917385
Genetic_Risk  -0.0009739399  0.0004892489  1.0000000000  0.0029550387
Systolic      -0.0026875324  0.0027917385  0.0029550387  1.0000000000
Diastolic     -0.0030774137 -0.0018677792  0.0023899252  0.0003398664
Survival_Rate  0.0028852311  0.0019161016  0.0040355733 -0.0054996748
                  Diastolic Survival_Rate
Age           -0.0030774137   0.002885231
Tumor_Size    -0.0018677792   0.001916102
Genetic_Risk   0.0023899252   0.004035573
Systolic       0.0003398664  -0.005499675
Diastolic      1.0000000000  -0.003883594
Survival_Rate -0.0038835936   1.000000000
# Melt the correlation matrix for visualization
melted_corr <- melt(correlation_matrix)

# Plot the correlation matrix
ggplot(data = melted_corr, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile(color = "white") +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0, limit = c(-.05, .05), space = "Lab") +
  theme_minimal() +
  labs(title = "Correlation Matrix", x = "Variables", y = "Variables") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The correlation matrix shows very weak correlations between the numeric variables in the dataset. The values are close to zero, indicating no strong linear relationships between variables like Age, Tumor_Size, Genetic_Risk, Systolic, Diastolic, and Survival_Rate.

The lack of strong correlations suggests that the relationships between variables may be non-linear or involve complex interactions. This highlights the need for advanced modeling techniques like decision trees or machine learning to uncover hidden patterns. The weak correlations might also reflect the nature of the dataset, which may not fully capture the real-world variability. This could limit the applicability of findings to real world scenarios.

Question 2.b: What correlations are there between the categorical values in the dataset?

# Select only non-numeric columns
categorical_cols <- brain_tumor_dt %>% select(where(is.character))

# Function to calculate Cramér's V for two categorical variables
cramers_v_matrix <- function(data) {
  n <- ncol(data)
  result <- matrix(NA, n, n, dimnames = list(names(data), names(data)))
  
  for (i in 1:n) {
    for (j in 1:n) {
      result[i, j] <- CramerV(table(data[[i]], data[[j]]))
    }
  }
  
  return(as.data.frame(result))
}

# Compute the Cramér's V correlation matrix
categorical_corr_matrix <- cramers_v_matrix(categorical_cols)

# Print the correlation matrix
print(categorical_corr_matrix)
                          Gender     Country Tumor_Location MRI_Findings
Gender              1.0000000000 0.006768408    0.002896465  0.004929259
Country             0.0067684084 1.000000000    0.007262927  0.007373935
Tumor_Location      0.0028964646 0.007262927    1.000000000  0.003870838
MRI_Findings        0.0049292592 0.007373935    0.003870838  1.000000000
Smoking_History     0.0028717866 0.006768925    0.003115167  0.002265326
Alcohol_Consumption 0.0039394716 0.005239916    0.003851057  0.003156193
Radiation_Exposure  0.0036874778 0.005090887    0.001978456  0.002971286
Head_Injury_History 0.0022260668 0.005645962    0.002274990  0.001045439
Chronic_Illness     0.0017551630 0.005704842    0.006513967  0.003008542
Diabetes            0.0043891850 0.009248222    0.001951946  0.003006531
Tumor_Type          0.0008994088 0.006509211    0.002862941  0.001692099
Treatment_Received  0.0007786495 0.006699331    0.003842924  0.003984925
Tumor_Growth_Rate   0.0039010986 0.008226064    0.003891257  0.001353536
Family_History      0.0016063964 0.008012543    0.004708727  0.004604334
Symptom_Severity    0.0028764067 0.005941383    0.004560911  0.003675545
Brain_Tumor_Present 0.0013510813 0.006449056    0.003365100  0.001361001
                    Smoking_History Alcohol_Consumption Radiation_Exposure
Gender                 0.0028717866        0.0039394716       0.0036874778
Country                0.0067689254        0.0052399160       0.0050908873
Tumor_Location         0.0031151669        0.0038510565       0.0019784557
MRI_Findings           0.0022653259        0.0031561927       0.0029712864
Smoking_History        1.0000000000        0.0006176618       0.0033049206
Alcohol_Consumption    0.0006176618        1.0000000000       0.0030670312
Radiation_Exposure     0.0033049206        0.0030670312       1.0000000000
Head_Injury_History    0.0001517217        0.0006716796       0.0012362661
Chronic_Illness        0.0031089309        0.0004284477       0.0024735936
Diabetes               0.0013140765        0.0003943925       0.0014671455
Tumor_Type             0.0012620442        0.0006617428       0.0022010152
Treatment_Received     0.0019091323        0.0033702934       0.0039962231
Tumor_Growth_Rate      0.0050322647        0.0012265070       0.0043603150
Family_History         0.0009123463        0.0005196019       0.0040891004
Symptom_Severity       0.0018477511        0.0026966220       0.0041357149
Brain_Tumor_Present    0.0007188680        0.0003266940       0.0003796155
                    Head_Injury_History Chronic_Illness     Diabetes
Gender                     2.226067e-03    0.0017551630 0.0043891850
Country                    5.645962e-03    0.0057048419 0.0092482216
Tumor_Location             2.274990e-03    0.0065139665 0.0019519455
MRI_Findings               1.045439e-03    0.0030085416 0.0030065309
Smoking_History            1.517217e-04    0.0031089309 0.0013140765
Alcohol_Consumption        6.716796e-04    0.0004284477 0.0003943925
Radiation_Exposure         1.236266e-03    0.0024735936 0.0014671455
Head_Injury_History        1.000000e+00    0.0003205968 0.0015124032
Chronic_Illness            3.205968e-04    1.0000000000 0.0028355761
Diabetes                   1.512403e-03    0.0028355761 1.0000000000
Tumor_Type                 3.239626e-03    0.0015641982 0.0008028224
Treatment_Received         2.780067e-03    0.0026125229 0.0035212871
Tumor_Growth_Rate          3.955264e-03    0.0008696386 0.0015190369
Family_History             2.393319e-05    0.0001047399 0.0008164989
Symptom_Severity           5.563807e-03    0.0011061493 0.0019344017
Brain_Tumor_Present        1.544220e-03    0.0034344370 0.0045616401
                      Tumor_Type Treatment_Received Tumor_Growth_Rate
Gender              0.0008994088       0.0007786495      0.0039010986
Country             0.0065092113       0.0066993308      0.0082260636
Tumor_Location      0.0028629412       0.0038429244      0.0038912572
MRI_Findings        0.0016920991       0.0039849249      0.0013535359
Smoking_History     0.0012620442       0.0019091323      0.0050322647
Alcohol_Consumption 0.0006617428       0.0033702934      0.0012265070
Radiation_Exposure  0.0022010152       0.0039962231      0.0043603150
Head_Injury_History 0.0032396258       0.0027800668      0.0039552641
Chronic_Illness     0.0015641982       0.0026125229      0.0008696386
Diabetes            0.0008028224       0.0035212871      0.0015190369
Tumor_Type          1.0000000000       0.0024216472      0.0026757097
Treatment_Received  0.0024216472       1.0000000000      0.0030543753
Tumor_Growth_Rate   0.0026757097       0.0030543753      1.0000000000
Family_History      0.0018084725       0.0041150805      0.0036371668
Symptom_Severity    0.0004878983       0.0026311166      0.0023422943
Brain_Tumor_Present 0.0038575475       0.0031566969      0.0016443958
                    Family_History Symptom_Severity Brain_Tumor_Present
Gender                1.606396e-03     0.0028764067        0.0013510813
Country               8.012543e-03     0.0059413832        0.0064490559
Tumor_Location        4.708727e-03     0.0045609114        0.0033651001
MRI_Findings          4.604334e-03     0.0036755454        0.0013610010
Smoking_History       9.123463e-04     0.0018477511        0.0007188680
Alcohol_Consumption   5.196019e-04     0.0026966220        0.0003266940
Radiation_Exposure    4.089100e-03     0.0041357149        0.0003796155
Head_Injury_History   2.393319e-05     0.0055638074        0.0015442197
Chronic_Illness       1.047399e-04     0.0011061493        0.0034344370
Diabetes              8.164989e-04     0.0019344017        0.0045616401
Tumor_Type            1.808472e-03     0.0004878983        0.0038575475
Treatment_Received    4.115081e-03     0.0026311166        0.0031566969
Tumor_Growth_Rate     3.637167e-03     0.0023422943        0.0016443958
Family_History        1.000000e+00     0.0037444309        0.0022082729
Symptom_Severity      3.744431e-03     1.0000000000        0.0019828674
Brain_Tumor_Present   2.208273e-03     0.0019828674        1.0000000000
# Convert the matrix to long format for ggplot
melted_corr <- melt(as.matrix(categorical_corr_matrix))

# Plot the heatmap
ggplot(data = melted_corr, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile(color = "white") +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", limit = c(-.02, .02), space = "Lab") +
  labs(title = "Categorical Correlation Matrix (Cramér's V)", x = "Variables", y = "Variables") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The Categorical Correlation Matrix (Cramér’s V) shows very weak associations between most categorical variables in the dataset. The values are close to zero, indicating minimal relationships between variables like Gender, Tumor_Location, Family_History, Treatment_Received, and others.

Most categorical variables show very weak associations, suggesting that they may not have strong direct relationships or that the dataset may not capture these relationships effectively. Variables like Symptom_Severity, Tumor_Type, and Tumor_Growth_Rate may be more predictive of tumor presence and severity. These should be prioritized in predictive modeling. Variables like Gender and Country show almost no association with tumor-related variables, suggesting that demographic factors may not play a significant role in this dataset. The weak associations might reflect the synthetic nature of the dataset, which could limit its applicability to real-world scenarios. This highlights the importance of validating findings with real-world data.

Question 3: Is there a multivariable link between Tumor_Size, Tumor_Type, Genetic_Risk, Tumor_Location, and Symptom_Severity with Survival_Rate?

# Multivariate regression
multi_model <- lm(Survival_Rate ~ Tumor_Size + Tumor_Type + Genetic_Risk + Tumor_Location + Symptom_Severity, data = brain_tumor_dt)
summary(multi_model)

Call:
lm(formula = Survival_Rate ~ Tumor_Size + Tumor_Type + Genetic_Risk + 
    Tumor_Location + Symptom_Severity, data = brain_tumor_dt)

Residuals:
    Min      1Q  Median      3Q     Max 
-44.935 -22.529   0.192  22.498  45.072 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)              54.174596   0.198350 273.127   <2e-16 ***
Tumor_Size                0.018199   0.018961   0.960   0.3371    
Tumor_TypeMalignant       0.101471   0.104000   0.976   0.3292    
Genetic_Risk              0.003589   0.001782   2.014   0.0440 *  
Tumor_LocationFrontal    -0.274582   0.164653  -1.668   0.0954 .  
Tumor_LocationOccipital   0.002688   0.164491   0.016   0.9870    
Tumor_LocationParietal    0.058222   0.164405   0.354   0.7232    
Tumor_LocationTemporal   -0.057151   0.164565  -0.347   0.7284    
Symptom_SeverityModerate  0.013192   0.127459   0.104   0.9176    
Symptom_SeveritySevere    0.092365   0.127233   0.726   0.4679    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 26 on 249990 degrees of freedom
Multiple R-squared:  4.612e-05, Adjusted R-squared:  1.012e-05 
F-statistic: 1.281 on 9 and 249990 DF,  p-value: 0.2411
ggplot(brain_tumor_dt, aes(x = Genetic_Risk, y = Survival_Rate, color = Tumor_Type)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Interaction Plot: Genetic Risk vs Survival Rate by Tumor Type",
       x = "Genetic Risk", y = "Survival Rate (%)") +
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'

This graph explores the relationship between Genetic Risk and Survival Rate, segmented by Tumor Type (Benign and Malignant). The scatterplot shows individual data points, while the linear regression lines provide a visual representation of the trends for each tumor type. The regression analysis suggests that Genetic Risk has a minimal direct impact on Survival Rate, as the slopes of the lines are relatively flat. This aligns with earlier findings in the document, such as the weak correlations observed in the numeric variables and the categorical correlation matrix, which indicate that no single factor strongly predicts survival outcomes.

These results highlight the need for multivariate approaches to uncover complex interactions between variables like Tumor_Size, Tumor_Location, and Symptom_Severity. Additionally, the dataset’s synthetic nature and even distribution of benign and malignant tumors provide a controlled environment for analysis but may not fully reflect real-world variability. This reinforces the importance of validating findings with real-world data to ensure applicability in medical diagnosis and treatment planning.

Question 4: How does the mean survival rate vary with binned genetic risk levels across different tumor types (Benign vs. Malignant)?

# Aggregate data by Genetic_Risk and Tumor_Type
aggregated_data <- brain_tumor_dt %>%
  group_by(Genetic_Risk = cut(Genetic_Risk, breaks = 10), Tumor_Type) %>%
  summarize(Mean_Survival_Rate = mean(Survival_Rate, na.rm = TRUE))
`summarise()` has grouped output by 'Genetic_Risk'. You can override using the
`.groups` argument.
# Plot aggregated data with lines of best fit
ggplot(aggregated_data, aes(x = Genetic_Risk, y = Mean_Survival_Rate, color = Tumor_Type, group = Tumor_Type)) +
  geom_line() +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, aes(group = Tumor_Type)) +
  labs(title = "Mean Survival Rate by Genetic Risk and Tumor Type",
       x = "Genetic Risk (Binned)", y = "Mean Survival Rate (%)") +
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'

This graph illustrates the relationship between Genetic Risk (binned into intervals) and the Mean Survival Rate (%) for Benign and Malignant tumors. The lines of best fit for each tumor type highlight the trends in survival rates as genetic risk increases. For benign tumors, there is a slight upward trend in survival rate with increasing genetic risk, while malignant tumors show a more variable pattern with no clear linear trend. This suggests that genetic risk may have a stronger and more consistent impact on survival rates for benign tumors compared to malignant ones. However, the variability in the data indicates that other factors may also play a significant role in determining survival rates.

Question 5: How does the mean survival rate vary across different tumor growth rates (Slow, Moderate, Rapid) and symptom severity levels (Mild, Moderate, Severe)?

# Convert Tumor_Growth_Rate to a factor with the desired order
brain_tumor_dt <- brain_tumor_dt %>%
  mutate(Tumor_Growth_Rate = factor(Tumor_Growth_Rate, levels = c("Slow", "Moderate", "Rapid")))

# Aggregate data by Tumor_Growth_Rate and Symptom_Severity
aggregated_data_updated <- brain_tumor_dt %>%
  group_by(Tumor_Growth_Rate, Symptom_Severity) %>%
  summarize(Mean_Survival_Rate = mean(Survival_Rate, na.rm = TRUE), .groups = "drop")

# Plot 1: Data points and lines connecting them
ggplot(aggregated_data_updated, aes(x = Tumor_Growth_Rate, y = Mean_Survival_Rate, color = Symptom_Severity, group = Symptom_Severity)) +
  geom_line() +
  geom_point() +
  labs(title = "Mean Survival Rate by Tumor Growth Rate and Symptom Severity (Data Points)",
       x = "Tumor Growth Rate", y = "Mean Survival Rate (%)") +
  theme_minimal()

# Plot 2: Best-fit lines only
ggplot(aggregated_data_updated, aes(x = Tumor_Growth_Rate, y = Mean_Survival_Rate, color = Symptom_Severity, group = Symptom_Severity)) +
  geom_smooth(method = "lm", se = FALSE, aes(group = Symptom_Severity)) +
  labs(title = "Best-Fit Lines: Tumor Growth Rate vs. Mean Survival Rate",
       x = "Tumor Growth Rate", y = "Mean Survival Rate (%)") +
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'

These visualizations explore the relationship between tumor growth rate, symptom severity, and mean survival rate. The first graph, “Mean Survival Rate by Tumor Growth Rate and Symptom Severity (Data Points)”, shows the raw data trends, with mean survival rates plotted for each combination of tumor growth rate (Slow, Moderate, Rapid) and symptom severity (Mild, Moderate, Severe). The data points are connected by lines to highlight the variability within each symptom severity group. The trends suggest that survival rates fluctuate slightly across growth rates, with Severe symptoms generally associated with higher survival rates compared to Mild symptoms, though the patterns are inconsistent.

The second graph, “Best-Fit Lines: Tumor Growth Rate vs. Mean Survival Rate”, provides a clearer view of the overall trends using linear regression lines. For Severe symptoms, survival rates decrease slightly as tumor growth rate increases, while for Moderate symptoms, survival rates show a slight upward trend with faster growth rates. The Mild symptom group exhibits a relatively flat trend, indicating minimal impact of growth rate on survival. Together, these graphs suggest that symptom severity may influence how tumor growth rate affects survival outcomes, highlighting the need for further analysis to uncover potential interactions between these variables.

Exploratory Analysis Questions

Question 1: What is the distribution of symptom severity in the dataset?

Why this is important:
Understanding the distribution of symptom severity (Mild, Moderate, Severe) provides insight into the dataset’s composition and whether it represents a balanced range of cases. This is essential for ensuring that analyses and models are not biased toward one severity level.

Exploration Approach:
A bar chart was created to visualize the count of cases for each symptom severity level.

# Bar chart for symptom severity distribution
ggplot(brain_tumor_dt, aes(x = Symptom_Severity, fill = Symptom_Severity)) +
  geom_bar() +
  labs(title = "Distribution of Symptom Severity", x = "Symptom Severity", y = "Count") +
  theme_minimal() +
  scale_fill_brewer(palette = "Set2")

Findings:
The bar chart shows that the dataset has a balanced distribution of symptom severity levels (Mild, Moderate, Severe). Each category has a similar number of observations, with no single category dominating the dataset. This balance ensures that analyses and models will not be biased toward any specific severity level, allowing for fair comparisons and robust insights across all severity groups.

This balanced representation is particularly useful for exploring the relationships between symptom severity and other variables, such as survival rate or tumor growth rate, without the risk of skewed results due to overrepresentation of one category.

Question 2: How does tumor size relate to symptom severity?

Why this is important:
Tumor size is often linked to symptom severity, with larger tumors potentially causing more severe symptoms. Understanding this relationship can help prioritize cases for treatment based on tumor size.

Exploration Approach:
A box plot was created to compare tumor sizes across different levels of symptom severity.

# Box plot for tumor size by symptom severity
ggplot(brain_tumor_dt, aes(x = Symptom_Severity, y = Tumor_Size, fill = Symptom_Severity)) +
  geom_boxplot() +
  labs(title = "Tumor Size by Symptom Severity", x = "Symptom Severity", y = "Tumor Size (cm)") +
  theme_minimal() +
  scale_fill_brewer(palette = "Set3")

Findings:
The box plot shows the distribution of tumor sizes across different levels of symptom severity (Mild, Moderate, Severe). The median tumor size is consistent across all severity levels, with no significant differences observed. The interquartile ranges (IQRs) and overall spread of tumor sizes are also similar for all groups. This suggests that tumor size alone may not be a strong determinant of symptom severity in this dataset.

However, the lack of variation could be due to the synthetic nature of the dataset, which may not fully capture real-world variability. Further analysis incorporating other factors, such as tumor growth rate or location, may provide additional insights into the relationship between tumor size and symptom severity.

Question 3: How does survival rate differ across tumor types (Benign vs. Malignant)?

Why this is important:
Survival rate is a critical outcome metric. Comparing survival rates between benign and malignant tumors can provide insights into the severity and prognosis of each tumor type.

Exploration Approach:
A violin plot was created to visualize the distribution of survival rates for benign and malignant tumors.

# Violin plot for survival rate by tumor type
ggplot(brain_tumor_dt, aes(x = Tumor_Type, y = Survival_Rate, fill = Tumor_Type)) +
  geom_violin(trim = FALSE) +
  labs(title = "Survival Rate by Tumor Type", x = "Tumor Type", y = "Survival Rate (%)") +
  theme_minimal() +
  scale_fill_brewer(palette = "Set1")

Findings:
The violin plot illustrates the distribution of survival rates for benign and malignant tumors. Both tumor types show a wide range of survival rates, spanning from 0% to 100%. The central tendency (median) appears similar for both groups, with benign tumors showing slightly higher survival rates overall. The distributions are relatively uniform, with no significant skewness or clustering observed.

This suggests that while tumor type (benign vs. malignant) may influence survival rates, the impact is not as pronounced as expected. Other factors, such as tumor size, growth rate, or symptom severity, may play a more significant role in determining survival outcomes. This highlights the need for multivariate analysis to uncover the combined effects of these variables.

Question 4: What happens to survival rate when symptom severity increases?

Why this is important:
Symptom severity is a key indicator of patient condition. Understanding its impact on survival rate can help identify high-risk patients and prioritize treatment.

Exploration Approach:
A line plot was created to show the mean survival rate for each level of symptom severity.

Code:

# Line plot for survival rate by symptom severity
aggregated_severity <- brain_tumor_dt %>%
  group_by(Symptom_Severity) %>%
  summarize(Mean_Survival_Rate = mean(Survival_Rate, na.rm = TRUE))

ggplot(aggregated_severity, aes(x = Symptom_Severity, y = Mean_Survival_Rate, group = 1)) +
  geom_line(color = "blue") +
  geom_point(size = 3, color = "red") +
  labs(title = "Survival Rate by Symptom Severity", x = "Symptom Severity", y = "Mean Survival Rate (%)") +
  theme_minimal()

Findings:
The line plot shows the mean survival rate for each level of symptom severity (Mild, Moderate, Severe). Interestingly, survival rates increase as symptom severity progresses from Mild to Severe. This trend is counterintuitive, as one might expect more severe symptoms to correlate with lower survival rates.

This pattern could indicate that patients with severe symptoms are receiving more aggressive or effective treatments, leading to higher survival rates. Alternatively, it may reflect the synthetic nature of the dataset, where symptom severity does not align with real-world outcomes. Further analysis is needed to explore the underlying factors driving this trend, such as treatment type, tumor growth rate, or other patient characteristics.

Hypothesis Generation

Hypothesis: Patients with severe symptoms and rapid tumor growth rates have significantly lower survival rates compared to patients with mild symptoms and slow tumor growth rates, regardless of tumor type (Benign or Malignant).

EDA Observations:

  1. Symptom severity appears to influence survival rates, with severe symptoms generally associated with higher survival rates, which may reflect aggressive treatment or dataset characteristics.

  2. Tumor growth rate shows a slight relationship with survival rates, where rapid growth rates tend to correlate with worse outcomes.

  3. The interaction between symptom severity and tumor growth rate may amplify the impact on survival rates, as seen in the variability of trends across groups.

  4. The synthetic nature of the dataset may introduce biases or unrealistic patterns that do not fully reflect real-world scenarios. For example, the balanced distribution of tumor types and symptom severity levels, as well as the weak correlations between variables, may limit the generalizability of findings to real-world applications. This highlights the importance of validating results with real-world clinical data.

Relationships and patterns:

The exploratory data analysis revealed several key relationships and patterns within the dataset:

  1. Symptom Severity and Survival Rate:

    • Contrary to expectations, survival rates increased with symptom severity (Mild < Moderate < Severe). This counterintuitive trend may reflect aggressive treatment for patients with severe symptoms or biases introduced by the synthetic nature of the dataset. It suggests that symptom severity alone may not be a direct predictor of survival outcomes but could interact with other factors, such as treatment type or tumor growth rate.
  2. Tumor Growth Rate and Survival Rate:

    • Tumor growth rate (Slow, Moderate, Rapid) showed a slight relationship with survival rates. Patients with rapid tumor growth tended to have lower survival rates, particularly when combined with severe symptoms. This aligns with clinical expectations, as faster-growing tumors are often more aggressive and harder to treat.
  3. Genetic Risk and Tumor Type:

    • Genetic risk appeared to have a stronger and more consistent impact on survival rates for benign tumors compared to malignant ones. For benign tumors, survival rates increased slightly with higher genetic risk, while malignant tumors exhibited more variability, with no clear trend. This suggests that genetic risk may play a more significant role in less aggressive tumor types.
  4. Tumor Size and Symptom Severity:

    • Tumor size was consistent across all levels of symptom severity, with no significant differences observed. This indicates that tumor size alone may not determine symptom severity, and other factors, such as tumor location or growth rate, may play a more critical role in influencing symptoms.
  5. Weak Correlations Between Variables:

    • Both the numeric and categorical correlation matrices revealed weak associations between most variables. For example, Age, Tumor_Size, and Genetic_Risk showed minimal correlation with survival rates. Similarly, categorical variables like Gender and Country had almost no association with tumor-related outcomes. This suggests that survival outcomes and tumor characteristics may depend on complex, non-linear interactions rather than simple linear relationships.
  6. Synthetic Nature of the Dataset:

    • The dataset’s synthetic nature introduced balanced distributions (e.g., equal representation of benign and malignant tumors, and symptom severity levels) and weak correlations between variables. While this provides a controlled environment for analysis, it may not fully reflect real-world variability, limiting the generalizability of findings. For example, the even distribution of tumor types and symptom severity levels may obscure real-world trends where certain tumor types or severity levels are more prevalent.
  7. Interaction Effects:

    • The interaction between symptom severity and tumor growth rate emerged as a potential key factor influencing survival outcomes. Patients with severe symptoms and rapid tumor growth rates appeared to be at the highest risk of poor survival outcomes. This highlights the importance of exploring interaction effects in multivariate analyses to uncover hidden patterns.

These relationships and patterns suggest that survival outcomes are influenced by a combination of factors, including symptom severity, tumor growth rate, and genetic risk. However, the weak correlations and synthetic nature of the dataset emphasize the need for advanced modeling techniques and validation with real-world data to ensure the applicability of findings.

Stakeholder value:

This hypothesis is critical for healthcare providers and researchers as it identifies high-risk patient groups who may benefit from prioritized and aggressive treatment strategies. For policymakers and healthcare administrators, it provides insights into resource allocation for patients with the most severe conditions.

Additional data necessary:

  1. Longitudinal data on survival outcomes to track changes over time.

  2. Detailed treatment information (e.g., type, duration, and effectiveness) to control for treatment effects.

  3. Additional demographic data (e.g., socioeconomic status, access to healthcare) to account for confounding factors.

  4. Biomarker data to explore biological mechanisms underlying symptom severity and tumor growth.

  5. More realistic data to make this analysis, and the models resulting from this analysis, more accurate and applicable to real world use cases.

Appropriate analysis methods:

  1. Multivariate Regression Analysis

    • Description: Multivariate regression can model the relationship between survival rate and multiple independent variables, such as symptom severity, tumor growth rate, tumor type, and genetic risk. Interaction terms can be included to test whether the combined effect of symptom severity and tumor growth rate significantly impacts survival rates.
    • This method provides a clear, interpretable framework for understanding how individual variables and their interactions influence survival rates.
  2. Interaction Analysis

    • Description: Interaction analysis explicitly tests whether the effect of one variable on survival rate depends on the level of another variable. This can be done within regression models or using ANOVA techniques.
    • It directly addresses the hypothesis by quantifying the combined impact of symptom severity and tumor growth rate on survival outcomes.
  3. Decision Trees and Random Forests

    • Decision trees and random forests are machine learning methods that can model non-linear relationships and interactions between variables. They can identify the most important predictors of survival rates and how these predictors interact.

    • These methods are robust to complex, non-linear relationships and can handle large datasets effectively. They also provide variable importance scores to identify key drivers of survival outcomes.

  4. Clustering Analysis

    • Clustering methods, such as k-means or hierarchical clustering, can group patients based on similar characteristics. These clusters can then be analyzed for differences in survival rates.

    • It helps identify subgroups of patients with similar profiles, which can provide insights into high-risk groups and guide personalized treatment strategies.

Implications of the hypothesis:

  • If true:

    • High-risk patients (severe symptoms and rapid tumor growth) can be identified early and prioritized for intensive treatment.

    • Healthcare providers can develop targeted intervention strategies to improve survival outcomes for these patients.

    • Resource allocation can be optimized to focus on the most vulnerable groups.

  • If false:

    • It would suggest that symptom severity and tumor growth rate may not be as critical in determining survival outcomes as previously thought.

    • Further investigation would be needed to identify other factors (e.g., genetic risk, treatment type) that have a stronger influence on survival rates.

    • It may highlight the need for more comprehensive datasets or alternative modeling approaches to uncover hidden patterns.

Stakeholder Communication

The exploratory analysis of the Brain Tumor Prediction Dataset revealed several key insights. First, the dataset is well-balanced across symptom severity levels (Mild, Moderate, Severe) and tumor types (Benign, Malignant), ensuring unbiased analysis. Tumor size was found to be consistent across symptom severity levels, suggesting that size alone may not determine symptom severity. Survival rates were similar for benign and malignant tumors, with no significant differences in their distributions. Interestingly, survival rates increased with symptom severity, potentially reflecting aggressive treatment for severe cases or dataset-specific characteristics.

The numeric correlation matrix showed weak correlations between variables like Age, Tumor_Size, and Genetic_Risk with survival rates, indicating that survival outcomes may depend on complex, non-linear interactions. Similarly, the categorical correlation matrix revealed weak associations between most categorical variables, with Symptom_Severity, Tumor_Type, and Tumor_Growth_Rate showing slightly stronger relationships with survival outcomes.

Based on these findings, I hypothesize that patients with severe symptoms and rapid tumor growth rates have significantly lower survival rates compared to those with mild symptoms and slow tumor growth rates, regardless of tumor type. This hypothesis is meaningful for stakeholders as it identifies high-risk groups that may benefit from targeted interventions and resource prioritization.

I feel it is also important to note that this may all be the result of data that does not fully represent reality leading to many skews in the trends found.

To test this hypothesis, additional data on treatment types, longitudinal survival outcomes, and demographic factors would be valuable. Multivariate regression, interaction analysis, and survival analysis are recommended to validate the hypothesis and uncover complex relationships. If the hypothesis is confirmed, healthcare providers can prioritize high-risk patients for intensive care, while policymakers can allocate resources more effectively. If disproven, further investigation into other factors influencing survival outcomes will be necessary.

Recommended Next Steps:

  1. Collect additional data on treatment types, patient demographics, and longitudinal outcomes.

  2. Perform advanced statistical and machine learning analyses to validate the hypothesis.

  3. Develop predictive models to identify high-risk patients early.

  4. Validate findings with real-world clinical data to ensure applicability.

The following visualization highlights the relationship between tumor growth rate, symptom severity, and survival rate, emphasizing the need to focus on patients with rapid tumor growth and severe symptoms:

# Stakeholder-friendly visualization
ggplot(aggregated_data_updated, aes(x = Tumor_Growth_Rate, y = Mean_Survival_Rate, color = Symptom_Severity, group = Symptom_Severity)) +
  geom_line(size = 1.2) +
  geom_point(size = 3) +
  labs(title = "Survival Rate by Tumor Growth Rate and Symptom Severity",
       x = "Tumor Growth Rate", y = "Mean Survival Rate (%)",
       color = "Symptom Severity") +
  theme_minimal()
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.