Warning: package 'DescTools' was built under R version 4.4.3
Attaching package: 'DescTools'
The following object is masked from 'package:data.table':
%like%
library(reshape2)
Warning: package 'reshape2' was built under R version 4.4.3
Attaching package: 'reshape2'
The following object is masked from 'package:tidyr':
smiths
The following objects are masked from 'package:data.table':
dcast, melt
# Reading in the datasetbrain_tumor_dt <-read.csv("data/Brain_Tumor_Prediction_Dataset.csv")
Brief Data Descriptions
# Getting a basic frame of reference for the dataset with into# analysis toolssummary(brain_tumor_dt)
Age Gender Country Tumor_Size
Min. : 5.00 Length:250000 Length:250000 Min. : 0.500
1st Qu.:26.00 Class :character Class :character 1st Qu.: 2.870
Median :47.00 Mode :character Mode :character Median : 5.260
Mean :46.96 Mean : 5.252
3rd Qu.:68.00 3rd Qu.: 7.630
Max. :89.00 Max. :10.000
Tumor_Location MRI_Findings Genetic_Risk Smoking_History
Length:250000 Length:250000 Min. : 0 Length:250000
Class :character Class :character 1st Qu.: 25 Class :character
Mode :character Mode :character Median : 50 Mode :character
Mean : 50
3rd Qu.: 75
Max. :100
Alcohol_Consumption Radiation_Exposure Head_Injury_History Chronic_Illness
Length:250000 Length:250000 Length:250000 Length:250000
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
Blood_Pressure Diabetes Tumor_Type Treatment_Received
Length:250000 Length:250000 Length:250000 Length:250000
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
Survival_Rate... Tumor_Growth_Rate Family_History Symptom_Severity
Min. :10.00 Length:250000 Length:250000 Length:250000
1st Qu.:32.00 Class :character Class :character Class :character
Median :55.00 Mode :character Mode :character Mode :character
Mean :54.48
3rd Qu.:77.00
Max. :99.00
Brain_Tumor_Present
Length:250000
Class :character
Mode :character
dim(brain_tumor_dt)
[1] 250000 21
head(brain_tumor_dt)
Age Gender Country Tumor_Size Tumor_Location MRI_Findings Genetic_Risk
1 66 Other China 8.70 Cerebellum Severe 81
2 87 Female Australia 8.14 Temporal Normal 65
3 41 Male Canada 6.02 Occipital Severe 100
4 52 Male Japan 7.26 Occipital Normal 19
5 84 Female Brazil 7.94 Temporal Abnormal 47
6 29 Male Germany 7.97 Frontal Abnormal 70
Smoking_History Alcohol_Consumption Radiation_Exposure Head_Injury_History
1 No Yes Medium No
2 No Yes Medium No
3 Yes No Low Yes
4 Yes Yes High Yes
5 No Yes Medium No
6 Yes Yes Medium No
Chronic_Illness Blood_Pressure Diabetes Tumor_Type Treatment_Received
1 Yes 122/88 No Malignant None
2 No 126/119 No Malignant None
3 No 118/65 No Benign Chemotherapy
4 No 165/119 Yes Benign Radiation
5 Yes 156/97 Yes Malignant None
6 No 95/85 No Malignant Surgery
Survival_Rate... Tumor_Growth_Rate Family_History Symptom_Severity
1 58 Slow Yes Mild
2 13 Rapid Yes Severe
3 67 Slow Yes Moderate
4 85 Moderate No Moderate
5 17 Moderate No Moderate
6 65 Rapid Yes Severe
Brain_Tumor_Present
1 No
2 No
3 Yes
4 Yes
5 No
6 No
As we can see from inspecting the data above, this dataset contains 250000 observations, each with 21 elements. This dataset is made up of qualitative and quantitative values and data types including integers floats, booleans, and free response strings.
I retrieved this dataset from kaggle at the following link: https://www.kaggle.com/datasets/ankushpanday1/brain-tumor-prediction-dataset
Questions to Answer
What is the dataset about?
The Brain Tumor Prediction Dataset is a comprehensive collection of medical records aimed at facilitating research and development in brain tumor diagnosis and prediction. It comprises 250,000 patient records, each encompassing 22 significant medical features. The dataset includes MRI scan results, demographic information, medical history, and other relevant attributes.
Where did the data come from?
I got this data online while browsing numerous dataset websites. I eventually found this specific dataset on Kaggle.
Is there a website or publication to cite the authors? If yes, include it.
What are your motivations for exploring this dataset?
For many years now I have been very interested in improving medical diagnoses, and finding meaningful ways to help people and communities. For many years I wanted to be a radiologist looking at medical images and finding meaningful diagnoses within those images to help my patients. Just before college I learned about the potentials of computer science, data science, and computer algorithms within the medical space. This led me to want to dive into the computer algorithm space to develop software capable of outperforming the systems we have today to provide even greater widespread impacts on people and communities. This dataset strongly aligns with these goals and is a great step into the world of medical diagnoses algorithms, and methodologies to find an important interplay between the two fields.
What questions do you want to answer? (This involves identifying and articulating the key questions that you aim to explore through your analysis of the dataset. These questions set the direction for your research and data exploration. They are typically broad and open-ended, aimed at uncovering insights or patterns within the data. IE: What are the main factors that affect customer satisfaction?)
What are the main factors that affect the presence of brain tumors?
What factors influence the size and/or location of brain tumors?
Are there are factors that that correlate to the severity of a brain tumor?
Are there any factors commonly associated with tumors, not actually impact on the presence, location, or severity of tumor?
Provide a hypothesis about the dataset. (Formulating a hypothesis involves making a specific, testable statement based on your initial understanding or assumptions about the data. A hypothesis is more focused than a general question and often predicts a relationship between variables that you can test through your analysis. IE: Customers with shorter wait times report higher satisfaction levels, suggesting a significant negative correlation between wait time and satisfaction.)
Patients with a family history of brain tumors and a high genetic risk score are more likely to develop malignant tumors than those without a family history and low genetic risk.
Ethical Considerations
What are some ethical considerations?
There are many ethical considerations to consider, including patient privacy & data sensitivity, bias in data collection, assumptions in medical diagnosis, and fair treatment & algorithmic bias.
Even though this dataset may be anonymized, medical data is highly sensitive. Ensuring compliance with HIPAA, GDPR, or similar regulations is critical if used in real-world applications.
The dataset might not represent all demographics equally, and if the data is skewed toward certain regions, ethnic groups, or medical histories, models trained on it could be less effective for underrepresented populations.
AI models built on this data should be used as decision-support tools, not replacements for medical professionals. False positives/negatives could have severe consequences, leading to unnecessary stress, procedures, or missed diagnoses.
If the dataset disproportionately represents certain tumor types, age groups, or genders, the AI model might generalize poorly. Addressing bias before deploying AI in healthcare is crucial to ensure fair treatment for all patients.
Do you have any bias coming into this analysis?
Such as do you assume certain things already (we all have internal bias that we should recognize)
I don’t know that I have many biases coming into this. I believe there is likely to be a strong correlation between some of the factors such as family history, history of illness, age, etc. which may impact my views on the analysis and is something I need to keep in mind when performing this analysis
Additionally, I am making assumptions that this data has been collected appropriately in an ethical way that equally encompasses all populations. This likely is not the case in reality and will need to be considered further.
Table Creation/Data Dictionary
# Create a variable to hold descriptionsdescriptions <-c("Patient's age in years","Gender of the patient (Male, Female, Other)","Patient's country of residence","Size of the tumor in cm","Brain lobe affected (e.g., Frontal, Temporal, Parietal)","Severity of MRI results (e.g., Normal, Abnormal, Severe))","Score indicating genetic risk (0–100 scale)","Whether the patient has a history of smoking (Yes/No)","Whether the patient consumes alcohol (Yes/No)","Level of radiation exposure (Low, Medium, High)","History of head injury (Yes/No)","Presence of chronic illnesses (Yes/No)","Systolic/Diastolic values (e.g., 120/80)","Presence of diabetes (Yes/No)","Classification of tumor (Benign/Malignant)","Type of treatment received (e.g., Chemotherapy, Radiation, None)","Estimated 5-year survival probability","Rate of tumor growth (Slow, Moderate, Rapid)","Whether the patient has a family history of tumors (Yes/No)","Severity of symptoms (Mild, Moderate, Severe)","Whether the patient has a brain tumor (Yes/No)")
# Create the data dictionary using R functionsdata_dictionary <-data.frame(Variable_Name =colnames(brain_tumor_dt),Class =sapply(brain_tumor_dt, class),Continuity =ifelse(sapply(brain_tumor_dt, is.numeric), "Continuous", "Discrete"),Description = descriptions)
# Print out tabledata_dictionary
Variable_Name Class Continuity
Age Age integer Continuous
Gender Gender character Discrete
Country Country character Discrete
Tumor_Size Tumor_Size numeric Continuous
Tumor_Location Tumor_Location character Discrete
MRI_Findings MRI_Findings character Discrete
Genetic_Risk Genetic_Risk integer Continuous
Smoking_History Smoking_History character Discrete
Alcohol_Consumption Alcohol_Consumption character Discrete
Radiation_Exposure Radiation_Exposure character Discrete
Head_Injury_History Head_Injury_History character Discrete
Chronic_Illness Chronic_Illness character Discrete
Blood_Pressure Blood_Pressure character Discrete
Diabetes Diabetes character Discrete
Tumor_Type Tumor_Type character Discrete
Treatment_Received Treatment_Received character Discrete
Survival_Rate... Survival_Rate... integer Continuous
Tumor_Growth_Rate Tumor_Growth_Rate character Discrete
Family_History Family_History character Discrete
Symptom_Severity Symptom_Severity character Discrete
Brain_Tumor_Present Brain_Tumor_Present character Discrete
Description
Age Patient's age in years
Gender Gender of the patient (Male, Female, Other)
Country Patient's country of residence
Tumor_Size Size of the tumor in cm
Tumor_Location Brain lobe affected (e.g., Frontal, Temporal, Parietal)
MRI_Findings Severity of MRI results (e.g., Normal, Abnormal, Severe))
Genetic_Risk Score indicating genetic risk (0–100 scale)
Smoking_History Whether the patient has a history of smoking (Yes/No)
Alcohol_Consumption Whether the patient consumes alcohol (Yes/No)
Radiation_Exposure Level of radiation exposure (Low, Medium, High)
Head_Injury_History History of head injury (Yes/No)
Chronic_Illness Presence of chronic illnesses (Yes/No)
Blood_Pressure Systolic/Diastolic values (e.g., 120/80)
Diabetes Presence of diabetes (Yes/No)
Tumor_Type Classification of tumor (Benign/Malignant)
Treatment_Received Type of treatment received (e.g., Chemotherapy, Radiation, None)
Survival_Rate... Estimated 5-year survival probability
Tumor_Growth_Rate Rate of tumor growth (Slow, Moderate, Rapid)
Family_History Whether the patient has a family history of tumors (Yes/No)
Symptom_Severity Severity of symptoms (Mild, Moderate, Severe)
Brain_Tumor_Present Whether the patient has a brain tumor (Yes/No)
Data Processing
Missing Values
# Check for missing valuescolSums(is.na(brain_tumor_dt))
There are no missing values in this dataset, so there is no need for removal, imputation or other missing value handling techniques.
If there were missing values I would update with the following
# Fill missing numeric columns with median value, and missing categorical columns with modebrain_tumor_dt <- brain_tumor_dt %>%mutate(across(where(is.numeric), ~ifelse(is.na(.), median(., na.rm =TRUE), .))) %>%mutate(across(where(is.character), ~ifelse(is.na(.), Mode(.), .)))
Outliers
# Identify and handle outliers using IQR# Creating a function to identify the outliersnumeric_cols <- brain_tumor_dt %>%select(where(is.numeric))outlier_bounds <-function(x) { Q1 <-quantile(x, 0.25, na.rm =TRUE) Q3 <-quantile(x, 0.75, na.rm =TRUE) IQR <- Q3 - Q1 lower <- Q1 -1.5* IQR upper <- Q3 +1.5* IQRreturn(c(lower, upper))}# Replace outliers with NAbrain_tumor_dt <- brain_tumor_dt %>%mutate(across(where(is.numeric), ~ifelse(. <outlier_bounds(.)[1] | . >outlier_bounds(.)[2], NA, .)))
# Identify and handle outliers using Z-scorebrain_tumor_dt <- brain_tumor_dt %>%mutate(across(where(is.numeric), ~ifelse(abs(scale(.)) >3, NA, .)))
I decided to use both IQR and Z-score methods to identify and address outliers since they are widely used, robust identification techniques. IQR is great for identifying skewed or non-normal distributions by focusing on the middle 50% and identifying the values falling outside of a normal range. Z-score is beneficial for identifying outliers in normally distributed datasets, by measuring the number of standard deviations from the mean. Using these in tandem helps us to get a very clear handle on all potential outliers.
Data Transformation
I needed to rename the Survival_Rate column.
# Rename using pattern matchingcolnames(brain_tumor_dt)[grep("Survival_Rate", colnames(brain_tumor_dt))] <-"Survival_Rate"colnames(brain_tumor_dt)
# Convert all true/false variables to 1's and 0'sbrain_tumor_dt <- brain_tumor_dt %>%mutate(across(where(is.character), ~ifelse(. %in%c("Yes", "True"), 1, ifelse(. %in%c("No", "False"), 0, .))))
Splitting the Blood pressure into two columns for Systolic and Diastolic
# Split Blood_Pressure into Systolic and Diastolic columnsbrain_tumor_dt <- brain_tumor_dt %>%separate(Blood_Pressure, into =c("Systolic", "Diastolic"), sep ="/") %>%mutate(across(c(Systolic, Diastolic), as.numeric))
Exploratory Visualization
Question 1.a: What is the distribution of tumor types (Benign vs. Malignant) in the dataset?
# Bar chart for tumor type distributionggplot(brain_tumor_dt, aes(x = Tumor_Type, fill = Tumor_Type)) +geom_bar() +labs(title ="Distribution of Tumor Types", x ="Tumor Type", y ="Count") +theme_minimal() +scale_fill_brewer(palette ="Set2")
This chart shows the proportion of benign vs. malignant tumors. We see that in this dataset we have an even distribution of benign and malignant tumors which will allow us to make a full analysis of both types without any major model skew.
While this may not be completely accurate to the real world distribution, this allows us to get a good look at the many potential cases for both benign and malignant tumors.
Question 1.b: How does tumor size vary between benign and malignant tumors?
# Box plot for tumor size by tumor typeggplot(brain_tumor_dt, aes(x = Tumor_Type, y = Tumor_Size, fill = Tumor_Type)) +geom_boxplot() +labs(title ="Tumor Size by Tumor Type", x ="Tumor Type", y ="Tumor Size (cm)") +theme_minimal() +scale_fill_brewer(palette ="Set3")
Here we see that we have very similar tumor sizes for both benign and malignant tumors due to this data coming from Kaggle, and being very clean and standardized. Similar to what we’ve seen in the prevalence of benign and malignant tumors, this may not be a completely accurate distribution, but does allow us to see a full variety of cases. Even though the distribution may be a bit off this gives us enough data to perform deep analysis and train models for both benign and malignant tumors
Question 2.a: What correlations are there between the numeric values in the dataset?
# Select only numeric columns for correlationnumeric_cols <- brain_tumor_dt %>%select(where(is.numeric))# Compute the correlation matrixcorrelation_matrix <-cor(numeric_cols, use ="complete.obs")# Print the correlation matrixprint(correlation_matrix)
# Melt the correlation matrix for visualizationmelted_corr <-melt(correlation_matrix)# Plot the correlation matrixggplot(data = melted_corr, aes(x = Var1, y = Var2, fill = value)) +geom_tile(color ="white") +scale_fill_gradient2(low ="blue", high ="red", mid ="white", midpoint =0, limit =c(-.05, .05), space ="Lab") +theme_minimal() +labs(title ="Correlation Matrix", x ="Variables", y ="Variables") +theme(axis.text.x =element_text(angle =45, hjust =1))
The correlation matrix shows very weak correlations between the numeric variables in the dataset. The values are close to zero, indicating no strong linear relationships between variables like Age, Tumor_Size, Genetic_Risk, Systolic, Diastolic, and Survival_Rate.
The lack of strong correlations suggests that the relationships between variables may be non-linear or involve complex interactions. This highlights the need for advanced modeling techniques like decision trees or machine learning to uncover hidden patterns. The weak correlations might also reflect the nature of the dataset, which may not fully capture the real-world variability. This could limit the applicability of findings to real world scenarios.
Question 2.b: What correlations are there between the categorical values in the dataset?
# Select only non-numeric columnscategorical_cols <- brain_tumor_dt %>%select(where(is.character))# Function to calculate Cramér's V for two categorical variablescramers_v_matrix <-function(data) { n <-ncol(data) result <-matrix(NA, n, n, dimnames =list(names(data), names(data)))for (i in1:n) {for (j in1:n) { result[i, j] <-CramerV(table(data[[i]], data[[j]])) } }return(as.data.frame(result))}# Compute the Cramér's V correlation matrixcategorical_corr_matrix <-cramers_v_matrix(categorical_cols)# Print the correlation matrixprint(categorical_corr_matrix)
# Convert the matrix to long format for ggplotmelted_corr <-melt(as.matrix(categorical_corr_matrix))# Plot the heatmapggplot(data = melted_corr, aes(x = Var1, y = Var2, fill = value)) +geom_tile(color ="white") +scale_fill_gradient2(low ="blue", high ="red", mid ="white", limit =c(-.02, .02), space ="Lab") +labs(title ="Categorical Correlation Matrix (Cramér's V)", x ="Variables", y ="Variables") +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1))
The Categorical Correlation Matrix (Cramér’s V) shows very weak associations between most categorical variables in the dataset. The values are close to zero, indicating minimal relationships between variables like Gender, Tumor_Location, Family_History, Treatment_Received, and others.
Most categorical variables show very weak associations, suggesting that they may not have strong direct relationships or that the dataset may not capture these relationships effectively. Variables like Symptom_Severity, Tumor_Type, and Tumor_Growth_Rate may be more predictive of tumor presence and severity. These should be prioritized in predictive modeling. Variables like Gender and Country show almost no association with tumor-related variables, suggesting that demographic factors may not play a significant role in this dataset. The weak associations might reflect the synthetic nature of the dataset, which could limit its applicability to real-world scenarios. This highlights the importance of validating findings with real-world data.
Question 3: Is there a multivariable link between Tumor_Size, Tumor_Type, Genetic_Risk, Tumor_Location, and Symptom_Severity with Survival_Rate?
ggplot(brain_tumor_dt, aes(x = Genetic_Risk, y = Survival_Rate, color = Tumor_Type)) +geom_point(alpha =0.5) +geom_smooth(method ="lm", se =FALSE) +labs(title ="Interaction Plot: Genetic Risk vs Survival Rate by Tumor Type",x ="Genetic Risk", y ="Survival Rate (%)") +theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
This graph explores the relationship between Genetic Risk and Survival Rate, segmented by Tumor Type (Benign and Malignant). The scatterplot shows individual data points, while the linear regression lines provide a visual representation of the trends for each tumor type. The regression analysis suggests that Genetic Risk has a minimal direct impact on Survival Rate, as the slopes of the lines are relatively flat. This aligns with earlier findings in the document, such as the weak correlations observed in the numeric variables and the categorical correlation matrix, which indicate that no single factor strongly predicts survival outcomes.
These results highlight the need for multivariate approaches to uncover complex interactions between variables like Tumor_Size, Tumor_Location, and Symptom_Severity. Additionally, the dataset’s synthetic nature and even distribution of benign and malignant tumors provide a controlled environment for analysis but may not fully reflect real-world variability. This reinforces the importance of validating findings with real-world data to ensure applicability in medical diagnosis and treatment planning.
Question 4: How does the mean survival rate vary with binned genetic risk levels across different tumor types (Benign vs. Malignant)?
# Aggregate data by Genetic_Risk and Tumor_Typeaggregated_data <- brain_tumor_dt %>%group_by(Genetic_Risk =cut(Genetic_Risk, breaks =10), Tumor_Type) %>%summarize(Mean_Survival_Rate =mean(Survival_Rate, na.rm =TRUE))
`summarise()` has grouped output by 'Genetic_Risk'. You can override using the
`.groups` argument.
# Plot aggregated data with lines of best fitggplot(aggregated_data, aes(x = Genetic_Risk, y = Mean_Survival_Rate, color = Tumor_Type, group = Tumor_Type)) +geom_line() +geom_point() +geom_smooth(method ="lm", se =FALSE, aes(group = Tumor_Type)) +labs(title ="Mean Survival Rate by Genetic Risk and Tumor Type",x ="Genetic Risk (Binned)", y ="Mean Survival Rate (%)") +theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
This graph illustrates the relationship between Genetic Risk (binned into intervals) and the Mean Survival Rate (%) for Benign and Malignant tumors. The lines of best fit for each tumor type highlight the trends in survival rates as genetic risk increases. For benign tumors, there is a slight upward trend in survival rate with increasing genetic risk, while malignant tumors show a more variable pattern with no clear linear trend. This suggests that genetic risk may have a stronger and more consistent impact on survival rates for benign tumors compared to malignant ones. However, the variability in the data indicates that other factors may also play a significant role in determining survival rates.
Question 5: How does the mean survival rate vary across different tumor growth rates (Slow, Moderate, Rapid) and symptom severity levels (Mild, Moderate, Severe)?
# Convert Tumor_Growth_Rate to a factor with the desired orderbrain_tumor_dt <- brain_tumor_dt %>%mutate(Tumor_Growth_Rate =factor(Tumor_Growth_Rate, levels =c("Slow", "Moderate", "Rapid")))# Aggregate data by Tumor_Growth_Rate and Symptom_Severityaggregated_data_updated <- brain_tumor_dt %>%group_by(Tumor_Growth_Rate, Symptom_Severity) %>%summarize(Mean_Survival_Rate =mean(Survival_Rate, na.rm =TRUE), .groups ="drop")# Plot 1: Data points and lines connecting themggplot(aggregated_data_updated, aes(x = Tumor_Growth_Rate, y = Mean_Survival_Rate, color = Symptom_Severity, group = Symptom_Severity)) +geom_line() +geom_point() +labs(title ="Mean Survival Rate by Tumor Growth Rate and Symptom Severity (Data Points)",x ="Tumor Growth Rate", y ="Mean Survival Rate (%)") +theme_minimal()
# Plot 2: Best-fit lines onlyggplot(aggregated_data_updated, aes(x = Tumor_Growth_Rate, y = Mean_Survival_Rate, color = Symptom_Severity, group = Symptom_Severity)) +geom_smooth(method ="lm", se =FALSE, aes(group = Symptom_Severity)) +labs(title ="Best-Fit Lines: Tumor Growth Rate vs. Mean Survival Rate",x ="Tumor Growth Rate", y ="Mean Survival Rate (%)") +theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
These visualizations explore the relationship between tumor growth rate, symptom severity, and mean survival rate. The first graph, “Mean Survival Rate by Tumor Growth Rate and Symptom Severity (Data Points)”, shows the raw data trends, with mean survival rates plotted for each combination of tumor growth rate (Slow, Moderate, Rapid) and symptom severity (Mild, Moderate, Severe). The data points are connected by lines to highlight the variability within each symptom severity group. The trends suggest that survival rates fluctuate slightly across growth rates, with Severe symptoms generally associated with higher survival rates compared to Mild symptoms, though the patterns are inconsistent.
The second graph, “Best-Fit Lines: Tumor Growth Rate vs. Mean Survival Rate”, provides a clearer view of the overall trends using linear regression lines. For Severe symptoms, survival rates decrease slightly as tumor growth rate increases, while for Moderate symptoms, survival rates show a slight upward trend with faster growth rates. The Mild symptom group exhibits a relatively flat trend, indicating minimal impact of growth rate on survival. Together, these graphs suggest that symptom severity may influence how tumor growth rate affects survival outcomes, highlighting the need for further analysis to uncover potential interactions between these variables.
Exploratory Analysis Questions
Question 1: What is the distribution of symptom severity in the dataset?
Why this is important:
Understanding the distribution of symptom severity (Mild, Moderate, Severe) provides insight into the dataset’s composition and whether it represents a balanced range of cases. This is essential for ensuring that analyses and models are not biased toward one severity level.
Exploration Approach:
A bar chart was created to visualize the count of cases for each symptom severity level.
# Bar chart for symptom severity distributionggplot(brain_tumor_dt, aes(x = Symptom_Severity, fill = Symptom_Severity)) +geom_bar() +labs(title ="Distribution of Symptom Severity", x ="Symptom Severity", y ="Count") +theme_minimal() +scale_fill_brewer(palette ="Set2")
Findings:
The bar chart shows that the dataset has a balanced distribution of symptom severity levels (Mild, Moderate, Severe). Each category has a similar number of observations, with no single category dominating the dataset. This balance ensures that analyses and models will not be biased toward any specific severity level, allowing for fair comparisons and robust insights across all severity groups.
This balanced representation is particularly useful for exploring the relationships between symptom severity and other variables, such as survival rate or tumor growth rate, without the risk of skewed results due to overrepresentation of one category.
Question 2: How does tumor size relate to symptom severity?
Why this is important:
Tumor size is often linked to symptom severity, with larger tumors potentially causing more severe symptoms. Understanding this relationship can help prioritize cases for treatment based on tumor size.
Exploration Approach:
A box plot was created to compare tumor sizes across different levels of symptom severity.
# Box plot for tumor size by symptom severityggplot(brain_tumor_dt, aes(x = Symptom_Severity, y = Tumor_Size, fill = Symptom_Severity)) +geom_boxplot() +labs(title ="Tumor Size by Symptom Severity", x ="Symptom Severity", y ="Tumor Size (cm)") +theme_minimal() +scale_fill_brewer(palette ="Set3")
Findings:
The box plot shows the distribution of tumor sizes across different levels of symptom severity (Mild, Moderate, Severe). The median tumor size is consistent across all severity levels, with no significant differences observed. The interquartile ranges (IQRs) and overall spread of tumor sizes are also similar for all groups. This suggests that tumor size alone may not be a strong determinant of symptom severity in this dataset.
However, the lack of variation could be due to the synthetic nature of the dataset, which may not fully capture real-world variability. Further analysis incorporating other factors, such as tumor growth rate or location, may provide additional insights into the relationship between tumor size and symptom severity.
Question 3: How does survival rate differ across tumor types (Benign vs. Malignant)?
Why this is important:
Survival rate is a critical outcome metric. Comparing survival rates between benign and malignant tumors can provide insights into the severity and prognosis of each tumor type.
Exploration Approach:
A violin plot was created to visualize the distribution of survival rates for benign and malignant tumors.
# Violin plot for survival rate by tumor typeggplot(brain_tumor_dt, aes(x = Tumor_Type, y = Survival_Rate, fill = Tumor_Type)) +geom_violin(trim =FALSE) +labs(title ="Survival Rate by Tumor Type", x ="Tumor Type", y ="Survival Rate (%)") +theme_minimal() +scale_fill_brewer(palette ="Set1")
Findings:
The violin plot illustrates the distribution of survival rates for benign and malignant tumors. Both tumor types show a wide range of survival rates, spanning from 0% to 100%. The central tendency (median) appears similar for both groups, with benign tumors showing slightly higher survival rates overall. The distributions are relatively uniform, with no significant skewness or clustering observed.
This suggests that while tumor type (benign vs. malignant) may influence survival rates, the impact is not as pronounced as expected. Other factors, such as tumor size, growth rate, or symptom severity, may play a more significant role in determining survival outcomes. This highlights the need for multivariate analysis to uncover the combined effects of these variables.
Question 4: What happens to survival rate when symptom severity increases?
Why this is important:
Symptom severity is a key indicator of patient condition. Understanding its impact on survival rate can help identify high-risk patients and prioritize treatment.
Exploration Approach:
A line plot was created to show the mean survival rate for each level of symptom severity.
Code:
# Line plot for survival rate by symptom severityaggregated_severity <- brain_tumor_dt %>%group_by(Symptom_Severity) %>%summarize(Mean_Survival_Rate =mean(Survival_Rate, na.rm =TRUE))ggplot(aggregated_severity, aes(x = Symptom_Severity, y = Mean_Survival_Rate, group =1)) +geom_line(color ="blue") +geom_point(size =3, color ="red") +labs(title ="Survival Rate by Symptom Severity", x ="Symptom Severity", y ="Mean Survival Rate (%)") +theme_minimal()
Findings:
The line plot shows the mean survival rate for each level of symptom severity (Mild, Moderate, Severe). Interestingly, survival rates increase as symptom severity progresses from Mild to Severe. This trend is counterintuitive, as one might expect more severe symptoms to correlate with lower survival rates.
This pattern could indicate that patients with severe symptoms are receiving more aggressive or effective treatments, leading to higher survival rates. Alternatively, it may reflect the synthetic nature of the dataset, where symptom severity does not align with real-world outcomes. Further analysis is needed to explore the underlying factors driving this trend, such as treatment type, tumor growth rate, or other patient characteristics.
Hypothesis Generation
Hypothesis: Patients with severe symptoms and rapid tumor growth rates have significantly lower survival rates compared to patients with mild symptoms and slow tumor growth rates, regardless of tumor type (Benign or Malignant).
EDA Observations:
Symptom severity appears to influence survival rates, with severe symptoms generally associated with higher survival rates, which may reflect aggressive treatment or dataset characteristics.
Tumor growth rate shows a slight relationship with survival rates, where rapid growth rates tend to correlate with worse outcomes.
The interaction between symptom severity and tumor growth rate may amplify the impact on survival rates, as seen in the variability of trends across groups.
The synthetic nature of the dataset may introduce biases or unrealistic patterns that do not fully reflect real-world scenarios. For example, the balanced distribution of tumor types and symptom severity levels, as well as the weak correlations between variables, may limit the generalizability of findings to real-world applications. This highlights the importance of validating results with real-world clinical data.
Relationships and patterns:
The exploratory data analysis revealed several key relationships and patterns within the dataset:
Symptom Severity and Survival Rate:
Contrary to expectations, survival rates increased with symptom severity (Mild < Moderate < Severe). This counterintuitive trend may reflect aggressive treatment for patients with severe symptoms or biases introduced by the synthetic nature of the dataset. It suggests that symptom severity alone may not be a direct predictor of survival outcomes but could interact with other factors, such as treatment type or tumor growth rate.
Tumor Growth Rate and Survival Rate:
Tumor growth rate (Slow, Moderate, Rapid) showed a slight relationship with survival rates. Patients with rapid tumor growth tended to have lower survival rates, particularly when combined with severe symptoms. This aligns with clinical expectations, as faster-growing tumors are often more aggressive and harder to treat.
Genetic Risk and Tumor Type:
Genetic risk appeared to have a stronger and more consistent impact on survival rates for benign tumors compared to malignant ones. For benign tumors, survival rates increased slightly with higher genetic risk, while malignant tumors exhibited more variability, with no clear trend. This suggests that genetic risk may play a more significant role in less aggressive tumor types.
Tumor Size and Symptom Severity:
Tumor size was consistent across all levels of symptom severity, with no significant differences observed. This indicates that tumor size alone may not determine symptom severity, and other factors, such as tumor location or growth rate, may play a more critical role in influencing symptoms.
Weak Correlations Between Variables:
Both the numeric and categorical correlation matrices revealed weak associations between most variables. For example, Age, Tumor_Size, and Genetic_Risk showed minimal correlation with survival rates. Similarly, categorical variables like Gender and Country had almost no association with tumor-related outcomes. This suggests that survival outcomes and tumor characteristics may depend on complex, non-linear interactions rather than simple linear relationships.
Synthetic Nature of the Dataset:
The dataset’s synthetic nature introduced balanced distributions (e.g., equal representation of benign and malignant tumors, and symptom severity levels) and weak correlations between variables. While this provides a controlled environment for analysis, it may not fully reflect real-world variability, limiting the generalizability of findings. For example, the even distribution of tumor types and symptom severity levels may obscure real-world trends where certain tumor types or severity levels are more prevalent.
Interaction Effects:
The interaction between symptom severity and tumor growth rate emerged as a potential key factor influencing survival outcomes. Patients with severe symptoms and rapid tumor growth rates appeared to be at the highest risk of poor survival outcomes. This highlights the importance of exploring interaction effects in multivariate analyses to uncover hidden patterns.
These relationships and patterns suggest that survival outcomes are influenced by a combination of factors, including symptom severity, tumor growth rate, and genetic risk. However, the weak correlations and synthetic nature of the dataset emphasize the need for advanced modeling techniques and validation with real-world data to ensure the applicability of findings.
Stakeholder value:
This hypothesis is critical for healthcare providers and researchers as it identifies high-risk patient groups who may benefit from prioritized and aggressive treatment strategies. For policymakers and healthcare administrators, it provides insights into resource allocation for patients with the most severe conditions.
Additional data necessary:
Longitudinal data on survival outcomes to track changes over time.
Detailed treatment information (e.g., type, duration, and effectiveness) to control for treatment effects.
Additional demographic data (e.g., socioeconomic status, access to healthcare) to account for confounding factors.
Biomarker data to explore biological mechanisms underlying symptom severity and tumor growth.
More realistic data to make this analysis, and the models resulting from this analysis, more accurate and applicable to real world use cases.
Appropriate analysis methods:
Multivariate Regression Analysis
Description: Multivariate regression can model the relationship between survival rate and multiple independent variables, such as symptom severity, tumor growth rate, tumor type, and genetic risk. Interaction terms can be included to test whether the combined effect of symptom severity and tumor growth rate significantly impacts survival rates.
This method provides a clear, interpretable framework for understanding how individual variables and their interactions influence survival rates.
Interaction Analysis
Description: Interaction analysis explicitly tests whether the effect of one variable on survival rate depends on the level of another variable. This can be done within regression models or using ANOVA techniques.
It directly addresses the hypothesis by quantifying the combined impact of symptom severity and tumor growth rate on survival outcomes.
Decision Trees and Random Forests
Decision trees and random forests are machine learning methods that can model non-linear relationships and interactions between variables. They can identify the most important predictors of survival rates and how these predictors interact.
These methods are robust to complex, non-linear relationships and can handle large datasets effectively. They also provide variable importance scores to identify key drivers of survival outcomes.
Clustering Analysis
Clustering methods, such as k-means or hierarchical clustering, can group patients based on similar characteristics. These clusters can then be analyzed for differences in survival rates.
It helps identify subgroups of patients with similar profiles, which can provide insights into high-risk groups and guide personalized treatment strategies.
Implications of the hypothesis:
If true:
High-risk patients (severe symptoms and rapid tumor growth) can be identified early and prioritized for intensive treatment.
Healthcare providers can develop targeted intervention strategies to improve survival outcomes for these patients.
Resource allocation can be optimized to focus on the most vulnerable groups.
If false:
It would suggest that symptom severity and tumor growth rate may not be as critical in determining survival outcomes as previously thought.
Further investigation would be needed to identify other factors (e.g., genetic risk, treatment type) that have a stronger influence on survival rates.
It may highlight the need for more comprehensive datasets or alternative modeling approaches to uncover hidden patterns.
Stakeholder Communication
The exploratory analysis of the Brain Tumor Prediction Dataset revealed several key insights. First, the dataset is well-balanced across symptom severity levels (Mild, Moderate, Severe) and tumor types (Benign, Malignant), ensuring unbiased analysis. Tumor size was found to be consistent across symptom severity levels, suggesting that size alone may not determine symptom severity. Survival rates were similar for benign and malignant tumors, with no significant differences in their distributions. Interestingly, survival rates increased with symptom severity, potentially reflecting aggressive treatment for severe cases or dataset-specific characteristics.
The numeric correlation matrix showed weak correlations between variables like Age, Tumor_Size, and Genetic_Risk with survival rates, indicating that survival outcomes may depend on complex, non-linear interactions. Similarly, the categorical correlation matrix revealed weak associations between most categorical variables, with Symptom_Severity, Tumor_Type, and Tumor_Growth_Rate showing slightly stronger relationships with survival outcomes.
Based on these findings, I hypothesize that patients with severe symptoms and rapid tumor growth rates have significantly lower survival rates compared to those with mild symptoms and slow tumor growth rates, regardless of tumor type. This hypothesis is meaningful for stakeholders as it identifies high-risk groups that may benefit from targeted interventions and resource prioritization.
I feel it is also important to note that this may all be the result of data that does not fully represent reality leading to many skews in the trends found.
To test this hypothesis, additional data on treatment types, longitudinal survival outcomes, and demographic factors would be valuable. Multivariate regression, interaction analysis, and survival analysis are recommended to validate the hypothesis and uncover complex relationships. If the hypothesis is confirmed, healthcare providers can prioritize high-risk patients for intensive care, while policymakers can allocate resources more effectively. If disproven, further investigation into other factors influencing survival outcomes will be necessary.
Recommended Next Steps:
Collect additional data on treatment types, patient demographics, and longitudinal outcomes.
Perform advanced statistical and machine learning analyses to validate the hypothesis.
Develop predictive models to identify high-risk patients early.
Validate findings with real-world clinical data to ensure applicability.
The following visualization highlights the relationship between tumor growth rate, symptom severity, and survival rate, emphasizing the need to focus on patients with rapid tumor growth and severe symptoms:
# Stakeholder-friendly visualizationggplot(aggregated_data_updated, aes(x = Tumor_Growth_Rate, y = Mean_Survival_Rate, color = Symptom_Severity, group = Symptom_Severity)) +geom_line(size =1.2) +geom_point(size =3) +labs(title ="Survival Rate by Tumor Growth Rate and Symptom Severity",x ="Tumor Growth Rate", y ="Mean Survival Rate (%)",color ="Symptom Severity") +theme_minimal()
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.