Level 1: Understanding the Dataset (Basic Exploration)

What are the column names and checking of missing values in the dataset?

library(dplyr)

## Warning: package 'dplyr' was built under R version 4.4.3

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(corrplot)

## Warning: package 'corrplot' was built under R version 4.4.3

## corrplot 0.95 loaded

library(readr)
brain_tumor_data <- read_csv("C:/Users/hp/OneDrive/Desktop/project 1 eda/Brain_Tumor_Prediction_Dataset.csv")

## Rows: 250000 Columns: 21

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (17): Gender, Country, Tumor_Location, MRI_Findings, Smoking_History, Al...
## dbl  (4): Age, Tumor_Size, Genetic_Risk, Survival_Rate(%)
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

View(brain_tumor_data)
colSums(is.na(brain_tumor_data))

##                 Age              Gender             Country          Tumor_Size 
##                   0                   0                   0                   0 
##      Tumor_Location        MRI_Findings        Genetic_Risk     Smoking_History 
##                   0                   0                   0                   0 
## Alcohol_Consumption  Radiation_Exposure Head_Injury_History     Chronic_Illness 
##                   0                   0                   0                   0 
##      Blood_Pressure            Diabetes          Tumor_Type  Treatment_Received 
##                   0                   0                   0                   0 
##    Survival_Rate(%)   Tumor_Growth_Rate      Family_History    Symptom_Severity 
##                   0                   0                   0                   0 
## Brain_Tumor_Present 
##                   0

Interpretation

–The dataset, named “Brain_Tumor_Prediction_Dataset.csv”, is successfully loaded using read_csv().

–It contains 250,000 rows and 21 columns.

– Columns contain categorical data (e.g., Gender, Country, Tumor Type) and others contain numerical data (e.g., Age, Tumor Size, Genetic Risk, Survival Rate).

– The columns include information such as:

– Demographics: Age, Gender, Country

– Tumor Characteristics: Tumor Size, Tumor Location, Tumor Type

– Health Factors: MRI Findings, Genetic Risk, Smoking History, Alcohol Consumption

– Other Medical Conditions: Diabetes, Chronic Illness, Head Injury History

– Outcome Measures: Survival Rate, Brain Tumor Present (Yes/No)

– There are no missing values.

Count of different tumor types

tumor_distribution <- brain_tumor_data %>%
group_by(Tumor_Type) %>%
summarise(Count = n()) %>%
arrange(desc(Count))
print(tumor_distribution)

## # A tibble: 2 × 2
##   Tumor_Type  Count
##   <chr>       <int>
## 1 Benign     125204
## 2 Malignant  124796

Interpretation

It shows the count of two types of brain tumors in the dataset:

– Benign tumors: 125,204 cases

– Malignant tumors: 124,796 cases

#Level 2: Data Extraction & Filtering

Find the most common tumor type

most_common_tumor <- brain_tumor_data %>%
count(Tumor_Type) %>%
arrange(desc(n)) %>%
head(1)
print(most_common_tumor)

## # A tibble: 1 × 2
##   Tumor_Type      n
##   <chr>       <int>
## 1 Benign     125204

Interpretation

– The dataset contains information about different types of brain tumors.

– The output shows that “Benign” is the most common type of tumor.

– The total number of patients with a Benign tumor is 125,204.

Filter patients under 30

young_patients <- brain_tumor_data %>%
filter(Age < 30)
print(head(young_patients))

## # A tibble: 6 × 21
##     Age Gender Country Tumor_Size Tumor_Location MRI_Findings Genetic_Risk
##   <dbl> <chr>  <chr>        <dbl> <chr>          <chr>               <dbl>
## 1    29 Male   Germany       7.97 Frontal        Abnormal               70
## 2     5 Other  Brazil        8.65 Parietal       Abnormal               68
## 3    19 Other  Russia        6.86 Temporal       Normal                 81
## 4    16 Female USA           8.06 Frontal        Normal                 47
## 5    17 Other  USA           9.66 Parietal       Abnormal               89
## 6    29 Female India         1.11 Parietal       Abnormal               34
## # ℹ 14 more variables: Smoking_History <chr>, Alcohol_Consumption <chr>,
## #   Radiation_Exposure <chr>, Head_Injury_History <chr>, Chronic_Illness <chr>,
## #   Blood_Pressure <chr>, Diabetes <chr>, Tumor_Type <chr>,
## #   Treatment_Received <chr>, `Survival_Rate(%)` <dbl>,
## #   Tumor_Growth_Rate <chr>, Family_History <chr>, Symptom_Severity <chr>,
## #   Brain_Tumor_Present <chr>

Interpretation

– Patients under 30 years old with brain tumors.

– Displays age, gender, country, tumor details, MRI findings, and risk factors.

– Helps analyze tumor trends in young patients.

– Identifies common risk factors among younger individuals

– Can be used for early diagnosis and preventive measures.

Filter patients with malignant tumors

high_malignancy_cases <- brain_tumor_data %>%
filter(Tumor_Type == "Malignant")
print(head(high_malignancy_cases))

## # A tibble: 6 × 21
##     Age Gender Country   Tumor_Size Tumor_Location MRI_Findings Genetic_Risk
##   <dbl> <chr>  <chr>          <dbl> <chr>          <chr>               <dbl>
## 1    66 Other  China           8.7  Cerebellum     Severe                 81
## 2    87 Female Australia       8.14 Temporal       Normal                 65
## 3    84 Female Brazil          7.94 Temporal       Abnormal               47
## 4    29 Male   Germany         7.97 Frontal        Abnormal               70
## 5    19 Other  Russia          6.86 Temporal       Normal                 81
## 6    43 Other  Australia       1.59 Temporal       Abnormal               58
## # ℹ 14 more variables: Smoking_History <chr>, Alcohol_Consumption <chr>,
## #   Radiation_Exposure <chr>, Head_Injury_History <chr>, Chronic_Illness <chr>,
## #   Blood_Pressure <chr>, Diabetes <chr>, Tumor_Type <chr>,
## #   Treatment_Received <chr>, `Survival_Rate(%)` <dbl>,
## #   Tumor_Growth_Rate <chr>, Family_History <chr>, Symptom_Severity <chr>,
## #   Brain_Tumor_Present <chr>

Interpretation

– It shows the patients diagnosed with malignant (cancerous) brain tumors.

– Displays age, gender, country, tumor details, MRI findings, and risk factors.

– It helps to understand common traits in malignant cases.

– It is useful for early detection, treatment planning, and risk assessment.

Level 3: Grouping & Summarization

Calculate average age per tumor type

avg_age <- brain_tumor_data %>%
group_by(Tumor_Type) %>%
summarise(Average_Age = mean(Age, na.rm = TRUE))
print(avg_age)

## # A tibble: 2 × 2
##   Tumor_Type Average_Age
##   <chr>            <dbl>
## 1 Benign            47.0
## 2 Malignant         47.0

Interpretation

– The average age of patients for both benign (non-cancerous) and malignant (cancerous) brain tumors is 47 years.

– Helps in understanding risk age groups for brain tumors.

– It can guide us for further medical research and preventive measures for people around this age.

Count frequency of each tumor location

Find the most frequent location

location_count <- table(brain_tumor_data$Tumor_Location)
most_common_location <- names(location_count[which.max(location_count)])
print(most_common_location)

## [1] "Parietal"

Interpretation

“Parietal” is the most common location where brain tumors are found in this dataset.

Identify the age group with the highest tumor risk

age_risk <- brain_tumor_data %>%
group_by(Age) %>%
summarise(Count = n()) %>%
arrange(desc(Count))
print(head(age_risk))

## # A tibble: 6 × 2
##     Age Count
##   <dbl> <int>
## 1    25  3061
## 2    46  3060
## 3    28  3025
## 4    39  3014
## 5     6  3011
## 6    26  3011

Interpretation

– The table displays the age groups with the highest number of brain tumor cases in the dataset.

– Helps in identifying high-risk age groups for brain tumors.

– Can assist in early detection and preventive healthcare strategies for these age groups.

Level 4: Sorting & Ranking Data

Rank tumors by severity level

tumor_rank <- brain_tumor_data %>%
mutate(Symptom_Severity_Num = as.numeric(factor(Symptom_Severity, levels = c("Mild", "Moderate", "Severe")))) %>%
group_by(Tumor_Type) %>%
summarise(Average_Severity = mean(Symptom_Severity_Num, na.rm = TRUE)) %>%
arrange(desc(Average_Severity))
print(tumor_rank)

## # A tibble: 2 × 2
##   Tumor_Type Average_Severity
##   <chr>                 <dbl>
## 1 Benign                 2.00
## 2 Malignant              2.00

Interpretation

– Conversion of symptom severity into numerical values:

o “Mild” → 1

o “Moderate” → 2

o “Severe” → 3

– Then, it calculated the average severity score for each tumor type (Benign and Malignant).

– Both Benign and Malignant tumors have an average severity score of 2.00.

Count occurrences of each severity level for each tumor type

severity_distribution <- table(brain_tumor_data$Tumor_Type, brain_tumor_data$Symptom_Severity)
print(severity_distribution)

##            
##              Mild Moderate Severe
##   Benign    41818    41522  41864
##   Malignant 41663    41443  41690

Interpretation

– The number of cases in each severity category is almost equal for both Benign and Malignant tumors.

– Example:

– Benign Tumors

o Mild: 41,818

o Moderate: 41,522

o Severe: 41,864

– Malignant Tumors

o Mild: 41,663

o Moderate: 41,443

o Severe: 41,690

– Since all three severity levels are nearly equal, the calculated mean severity score for both tumor types could turn out to be the same.

– This explains why the earlier ranking showed Benign = 2.00 and Malignant = 2.00.

Identify major risk factors

risk_factors <- colSums(brain_tumor_data[, c("Genetic_Risk", "Smoking_History", "Radiation_Exposure", "Family_History")] == "Yes", na.rm = TRUE)
print(risk_factors)

##       Genetic_Risk    Smoking_History Radiation_Exposure     Family_History 
##                  0             125150                  0             124964

Interpretation

– It counted the number of “Yes” responses for four major risk factors of brain tumors:

o Genetic Risk

o Smoking History

o Radiation Exposure

o Family History

– Genetic Risk: 0 → No cases in the dataset had “Yes” for genetic risk.

– Smoking History: 125,150 → A large number of cases had smoking history as a risk factor.

– Radiation Exposure: 0 → No cases had “Yes” for radiation exposure.

– Family History: 124,964 → Many cases had a family history of brain tumors.

– Smoking history and family history appear to be the most significant risk factors.

Count tumor cases by gender

gender_distribution <- brain_tumor_data %>%
group_by(Gender) %>%
summarise(Count = n())
print(gender_distribution)

## # A tibble: 3 × 2
##   Gender Count
##   <chr>  <int>
## 1 Female 83375
## 2 Male   83407
## 3 Other  83218

Interpretation

Output shows the distribution of brain tumor cases based on Gender .

– Female Cases: 83,375

– Male Cases: 83,407

– Other Cases: 83,218 – The counts are very close to each other, meaning no gender is significantly overrepresented.

– Since the number of cases for all gender categories is almost the same, the dataset does not show gender bias in tumor occurrence.

– If one gender had significantly higher cases, we could explore potential risk factors.

– However, since all values are close, gender does not seem to be a major differentiating factor in tumor distribution.

Level 5: Feature Engineering

Create a new column for risk category based on age

brain_tumor_data <- brain_tumor_data %>%
mutate(Risk_Category = case_when(
Age <= 25 ~ "Low Risk",
Age > 25 & Age <= 50 ~ "Moderate Risk",
Age > 50 ~ "High Risk"
))
# Count of each risk category
table(brain_tumor_data$Risk_Category)

## 
##     High Risk      Low Risk Moderate Risk 
##        114510         61647         73843

Interpretation

This output categorizes individuals into different risk levels based on age and counts the number of cases in each category.

– Risk Categories Based on Age:

o Low Risk (Age ≤ 25): 61,647 cases

o Moderate Risk (Age 26–50): 73,843 cases

o High Risk (Age > 50): 114,510 cases

– Most Cases are in the High-Risk Group:

o The highest number of cases (114,510) falls into the High-Risk category (Age > 50).

o This suggests that older individuals are more likely to develop brain tumors.

– Moderate Risk Group is the Second Largest:

o The Moderate Risk group has 73,843 cases, meaning middle-aged individuals also have a significant number of cases.

– Young People Have the Fewest Cases:

o The Low Risk group (Age ≤ 25) has the least number of cases (61,647), suggesting that younger individuals are less prone to brain tumors.

– Risk of brain tumors increases with age, with the highest cases in people above 50 years.

– Middle-aged individuals also have a significant number of cases.

– Young individuals have the lowest risk, indicating that age plays a crucial role in tumor development.

Find the most effective treatment based on survival rates

treatment_effectiveness <- brain_tumor_data %>%
group_by(Treatment_Received) %>%
summarise(Average_Survival = mean(`Survival_Rate(%)`, na.rm = TRUE)) %>%
arrange(desc(Average_Survival))
print(treatment_effectiveness)

## # A tibble: 4 × 2
##   Treatment_Received Average_Survival
##   <chr>                         <dbl>
## 1 Surgery                        54.6
## 2 None                           54.5
## 3 Radiation                      54.4
## 4 Chemotherapy                   54.3

Interpretation

The average survival rate (%) for patients based on the treatment they received.

– Surgery has the highest average survival rate (54.6%), making it the most effective treatment in this dataset.

– No Treatment (None) comes second with 54.5%, which is almost the same as surgery. This indicates that some cases didn’t require treatment or have other factors influencing survival.

– Radiation follows closely with a 54.4% survival rate.

– Chemotherapy has the lowest survival rate (54.3%) among the four treatments.

– The difference in survival rates across treatments is very small, suggesting that multiple factors (like patient condition, tumor type, or severity) might be affecting survival.

– Surgery is slightly more effective than other treatments.

– The “None” category having a high survival rate .

Level 6: Visualizations

Is there a correlation between Age and Survival Rate?

# Select only Age and Survival Rate columns
correlation_data <- brain_tumor_data[, c("Age", "Survival_Rate(%)")]
cor_matrix <- cor(correlation_data)
print(cor_matrix)

##                          Age Survival_Rate(%)
## Age              1.000000000      0.002885231
## Survival_Rate(%) 0.002885231      1.000000000

corrplot(cor_matrix, method = "circle", type = "upper",
         tl.col = "black", tl.srt = 45)

Interpretation – Correlation analysis was done between Age and Survival Rate (%).

– The correlation value is very close to 0.

– This means there is almost no relationship between Age and Survival Rate.

- Finding:

–Patients’ Age does not significantly affect their Survival Rate.

- Conclusion:

Whether a patient is younger or older, it does not majorly change their chances of survival.

Is there a correlation between Tumor Size and Survival Rate?

# Select only Tumor Size and Survival Rate columns
correlation_data <- brain_tumor_data[, c("Tumor_Size", "Survival_Rate(%)")]
# Calculate correlation matrix (default without complete.obs)
cor_matrix <- cor(correlation_data)
print(cor_matrix)

##                   Tumor_Size Survival_Rate(%)
## Tumor_Size       1.000000000      0.001916102
## Survival_Rate(%) 0.001916102      1.000000000

corrplot(cor_matrix, method = "circle", type = "upper",
         tl.col = "black", tl.srt = 45)

Interpretation – Correlation coefficient between Tumor Size and Survival Rate = 0.0019.

– This value is very close to 0, meaning almost no correlation.

- Finding:

–Patients with large tumors and small tumors have similar survival rates.

- Conclusion:

–Tumor Size does not significantly affect Survival Rate.

Predict Survival Rate based on Age

model1 <- lm(`Survival_Rate(%)` ~ Age, data = brain_tumor_data)
summary(model1)

## 
## Call:
## lm(formula = `Survival_Rate(%)` ~ Age, data = brain_tumor_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -44.610 -22.497   0.393  22.507  44.647 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 54.337840   0.112362 483.596   <2e-16 ***
## Age          0.003060   0.002121   1.443    0.149    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26 on 249998 degrees of freedom
## Multiple R-squared:  8.325e-06,  Adjusted R-squared:  4.325e-06 
## F-statistic: 2.081 on 1 and 249998 DF,  p-value: 0.1491

# Scatter plot with a trend line
ggplot(brain_tumor_data, aes(x = Age, y = `Survival_Rate(%)`)) +
  geom_point(color = "darkblue", size = 2) +
  geom_smooth(method = "lm", color = "red") +   # Regression line
  labs(title = "Scatter Plot: Age vs Survival Rate",
       x = "Age",
       y = "Survival Rate (%)") +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

Interpretation –Age has a very weak and non-significant impact on Survival Rate.

–As Age increases, the Survival Rate slightly increases (0.003%), but this is so tiny that it’s practically meaningless.

–The p-value (0.149) shows that this relationship is NOT statistically significant — we cannot trust this relationship based on the data.

–The R-squared (~0%) means that Age alone does not explain survival rate variations among patients

–A scatter plot was created between Age and Survival Rate (%).

–Blue dots show individual patient data.

–The red line (trend line) is almost flat.

–This means Age does not have much effect on Survival Rate.

–Conclusion: No strong relationship between Age and Survival Rate.

Prediction Survival Rate using Age, Tumor Size, and Symptom Severity

brain_tumor_data$Symptom_Score <- as.numeric(factor(brain_tumor_data$Symptom_Severity))
model2 <- lm(`Survival_Rate(%)` ~ Age + Tumor_Size + Symptom_Score, data = brain_tumor_data)
summary(model2)

## 
## Call:
## lm(formula = `Survival_Rate(%)` ~ Age + Tumor_Size + Symptom_Score, 
##     data = brain_tumor_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -44.723 -22.499   0.322  22.508  44.772 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   54.150782   0.196455 275.640   <2e-16 ***
## Age            0.003051   0.002121   1.438    0.150    
## Tumor_Size     0.018125   0.018960   0.956    0.339    
## Symptom_Score  0.046133   0.063617   0.725    0.468    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26 on 249996 degrees of freedom
## Multiple R-squared:  1.409e-05,  Adjusted R-squared:  2.086e-06 
## F-statistic: 1.174 on 3 and 249996 DF,  p-value: 0.318

Interpretation

–The p-values for all variables (Age, Tumor Size, and Symptom Severity) are greater than 0.05, meaning none of them are statistically significant.

–The R-squared (~0%) tells us that these three variables together explain almost none of the variations in Survival Rate.

–The F-statistic p-value (0.318) confirms that the overall model is not significant.

Visualize the number of patients by Symptom Severity (Bar Plot)

ggplot(brain_tumor_data, aes(x = Symptom_Severity)) +
  geom_bar(fill = "skyblue") +
  labs(title = "Number of Patients by Symptom Severity", x = "Severity", y = "Count") +
  theme_minimal()

Interpretation

–Mild and Severe symptoms are almost equal in number.

–Moderate symptoms are slightly less common compared to Mild and Severe.

–Overall, the number of patients is very similar across all three symptom severity levels.

Does the location of the tumor affect the survival rate?

Scenario:If I want to find out whether survival rate changes based on where in the brain the tumor is located.

anova_location <- aov(`Survival_Rate(%)` ~ Tumor_Location, data = brain_tumor_data)
summary(anova_location)

##                    Df    Sum Sq Mean Sq F value Pr(>F)
## Tumor_Location      4      3358   839.5   1.242  0.291
## Residuals      249995 168995087   676.0

Interpretation

Since the p-value = 0.291 (> 0.05),

–We fail to reject the null hypothesis.

–Tumor location does not significantly affect the survival rate.

This means where the tumor is located in the brain doesn’t make a big difference to how long patients survive.

Histogram of Survival Rate (%) Distribution

ggplot(brain_tumor_data, aes(x = `Survival_Rate(%)`)) +
  geom_histogram(fill = "tomato", bins = 20, color = "black") +
  labs(title = "Distribution of Survival Rates", x = "Survival Rate (%)", y = "Count") +
  theme_minimal()

Interpretation

–Survival rates are widely spread across patients — no single survival rate is dominating.

–Patients are almost evenly distributed across different survival percentages.

–This suggests a lot of variation among patients’ survival outcomes.

##Compare Survival Rates by Gender (Boxplot)

ggplot(brain_tumor_data, aes(x = Gender, y = `Survival_Rate(%)`)) +
  geom_boxplot(fill = "lightgreen") +
  labs(title = "Survival Rate by Gender", x = "Gender", y = "Survival Rate (%)") +
  theme_minimal()

Interpretation:

–Median survival rates for Female, Male, and Other genders are almost the same (around 50%).

–The spread (variation) of survival rates is similar for all genders.

–Minimum and maximum survival rates are nearly equal across all gender groups.

–No extreme outliers are visible in any gender category.

Bar Plot of Treatment Received

ggplot(brain_tumor_data, aes(x = Treatment_Received)) +
  geom_bar(fill = "salmon") +
  labs(title = "Number of Patients per Treatment", x = "Treatment", y = "Count") +
  theme_minimal()

Interpretation:

–The number of patients across all treatment types — Chemotherapy, Radiation, Surgery, and even No Treatment — are almost equal.

–No treatment type stands out with a very high or very low patient count.

–Each treatment category has around 62,000+ patients.

–Patients are evenly distributed across different treatment options.

–No treatment (None) was chosen by almost as many patients as those who received active treatments like chemotherapy, radiation, or surgery.

Boxplot: Tumor Size by Tumor Type

ggplot(brain_tumor_data, aes(x = Tumor_Type, y = Tumor_Size)) +
  geom_boxplot(fill = "violet") +
  labs(title = "Tumor Size per Tumor Type", x = "Tumor Type", y = "Tumor Size") +
  theme_minimal()

Interpretation:

–Benign and Malignant tumors have almost the same tumor size distribution.

–The median tumor size (middle line inside the box) is very similar for both tumor types.

–Both tumor types have similar ranges — from very small tumors (close to 0) up to tumors larger than 10 units.

–No major outliers or extreme differences are visible between benign and malignant tumors.

-Key Findings:

–Tumor size alone may not clearly differentiate between Benign and Malignant tumors.

–Even small tumors can be malignant, and large tumors can still be benign.

–Other factors (like tumor location, symptom severity, or genetic risk) might be more important for predicting tumor type.

##Histogram of Age Distribution

ggplot(brain_tumor_data, aes(x = as.numeric(Age))) +
  geom_histogram(fill = "lightblue", bins = 20, color = "black") +
  labs(title = "Distribution of Patient Ages", x = "Age", y = "Number of Patients") +
  theme_minimal()

Interpretation

–The age distribution of patients is fairly even across different age groups.

–Most age groups — whether young (20s) or older (60s-70s) — have a similar number of patients.

–There are slightly fewer very young patients (under 10 years) and very old patients (above 80 years).

–The majority of patients fall between 20 and 80 years old.

-Key Findings:

–Brain tumors affect all age groups almost equally — from young adults to older people.

–Special focus may be needed for patients between 20-80 years, where most cases occur.

–Pediatric (child) brain tumors are less common in this dataset.

##Visualizing Risk Category Count

ggplot(brain_tumor_data, aes(x = Risk_Category)) +
  geom_bar(fill = "orange") +
  labs(title = "Patient Count by Risk Category", x = "Risk Level", y = "Number of Patients") +
  theme_minimal()

Interpretation:

–High Risk patients are the largest group — there are many more patients above 50 years old.

–Moderate Risk patients (aged 26–50) are less than high risk, but still significant.

–Low Risk patients (aged 25 years and below) are the fewest.

-key Findings:

–Older patients (High Risk) form the majority of brain tumor cases.

–Young adults and children (Low Risk) have fewer brain tumor cases compared to older adults.

–Focus on High Risk group is critical for screening, early diagnosis, and preventive healthcare.

Compare Survival Rate by Tumor Location

ggplot(brain_tumor_data, aes(x = Tumor_Location, y = `Survival_Rate(%)`)) +
  geom_boxplot(fill = "gold") +
  labs(title = "Survival Rate by Tumor Location", x = "Tumor Location", y = "Survival Rate (%)") +
  theme_minimal()

Interpretation:

–The median survival rate is almost the same across all tumor locations (Cerebellum, Frontal, Occipital, Parietal, Temporal).

–Spread (variability) of survival rates looks very similar for each location.

–There are some patients with very low survival rates in every location (seen as longer lower whiskers).

-key Findings:

–Tumor location (Cerebellum, Frontal, Occipital, Parietal, Temporal) does not drastically affect the survival rate.

–Median survival rates are almost identical across different parts of the brain.

–Wide variability in survival exists within each group — survival depends on other factors too (not just tumor location).

How does the average Survival Rate change with Age?

avg_survival <- brain_tumor_data %>%
  group_by(Age) %>%
  summarise(Average_Survival = mean(`Survival_Rate(%)`, na.rm = TRUE))

ggplot(avg_survival, aes(x = Age, y = Average_Survival)) +
  geom_line(color = "purple") +
  labs(title = "Average Survival Rate by Age", x = "Age", y = "Average Survival Rate (%)") +
  theme_minimal()

Interpretation:

The survival rate is not constant; it fluctuates with age.

The survival rate mostly stays between 54% and 55.5%.

There are some sharp increases and decreases at certain ages.

Key Findings:

Age affects survival rate, but the overall changes are small.

No clear trend (like a steady increase or decrease) is visible — the pattern looks irregular.

Survival rate is relatively stable across different ages, with only slight variations.

Is there a difference in Survival Rate based on Treatment Received?

Scenario: If I want to check if Surgery, Radiation, and Chemotherapy result in different average survival rates.

anova_treatment <- aov(`Survival_Rate(%)` ~ Treatment_Received, data = brain_tumor_data)
summary(anova_treatment)

##                        Df    Sum Sq Mean Sq F value Pr(>F)
## Treatment_Received      3      2877   958.9   1.418  0.235
## Residuals          249996 168995568   676.0

Interpretation:

–The p-value is 0.235 (greater than 0.05).

–F-value is 1.418, which is not very high.

–Key Findings:

–Since p-value > 0.05, we fail to reject the null hypothesis.

–Conclusion:

–There is no statistically significant difference in survival rates between different treatments (Surgery, Radiation, Chemotherapy, or None).

–Treatments did not show a strong impact on survival rate in this dataset.

Grouped Bar Chart: Gender and Treatment Received

ggplot(brain_tumor_data, aes(x = Treatment_Received, fill = Gender)) +
  geom_bar(position = "dodge") +
  labs(title = "Treatment Distribution by Gender", x = "Treatment", y = "Count") +
  theme_minimal()

Interpretation:

–The bar chart shows the number of patients (Count) by Treatment Type and Gender (Female, Male, Other).

–The bars for all three genders (red for Female, green for Male, blue for Other) are almost equal height for each treatment type (Chemotherapy, None, Radiation, Surgery).

–key Findings:

–Treatment usage is fairly balanced across all genders.

–No major gender-based difference in the number of patients receiving a particular treatment.

–Whether it is Chemotherapy, Radiation, Surgery, or no treatment, males, females, and others are getting similar treatments.

Exploring Brain Tumor Trends

Neetu and Ramandeep

2025-04-25

Level 1: Understanding the Dataset (Basic Exploration)

What are the column names and checking of missing values in the dataset?

Count of different tumor types

Find the most common tumor type

Filter patients under 30

Filter patients with malignant tumors

Level 3: Grouping & Summarization

Calculate average age per tumor type

Count frequency of each tumor location

Find the most frequent location

Identify the age group with the highest tumor risk

Level 4: Sorting & Ranking Data

Rank tumors by severity level

Count occurrences of each severity level for each tumor type

Identify major risk factors

Count tumor cases by gender

Level 5: Feature Engineering

Create a new column for risk category based on age

Find the most effective treatment based on survival rates

Level 6: Visualizations

Is there a correlation between Age and Survival Rate?

Is there a correlation between Tumor Size and Survival Rate?

Predict Survival Rate based on Age

Prediction Survival Rate using Age, Tumor Size, and Symptom Severity

Visualize the number of patients by Symptom Severity (Bar Plot)

Does the location of the tumor affect the survival rate?

Scenario:If I want to find out whether survival rate changes based on where in the brain the tumor is located.

Histogram of Survival Rate (%) Distribution

Bar Plot of Treatment Received

Boxplot: Tumor Size by Tumor Type

Compare Survival Rate by Tumor Location

How does the average Survival Rate change with Age?

Is there a difference in Survival Rate based on Treatment Received?

Scenario: If I want to check if Surgery, Radiation, and Chemotherapy result in different average survival rates.

Grouped Bar Chart: Gender and Treatment Received