Introduction

This document presents various visualizations of the brain tumor dataset. The visualizations explore aspects such as tumor growth rates, tumor types, survival outcomes, tumor sizes by age, and more. Each “Tab” below presents a different plot along with its corresponding code hidden under.

Dataset

Below is a snapshot of the dataset and its structure:

library(ggplot2)


getwd()

## [1] "/Users/dibaramezan/Desktop/SPRING 25/DATA VISUALIZATION/Module 1"

setwd("/Users/dibaramezan/Desktop/Spring 25/DATA VISUALIZATION/Module 1")
df <- read.csv("Brain Tumor Prediction Dataset.csv")
file.exists("Brain Tumor Prediction Dataset.csv")

## [1] TRUE

filename <- "Brain Tumor Prediction Dataset.csv"
df <- read.csv(filename, header = TRUE)
head(df)

##   Age Gender   Country Tumor_Size Tumor_Location MRI_Findings Genetic_Risk
## 1  66  Other     China       8.70     Cerebellum       Severe           81
## 2  87 Female Australia       8.14       Temporal       Normal           65
## 3  41   Male    Canada       6.02      Occipital       Severe          100
## 4  52   Male     Japan       7.26      Occipital       Normal           19
## 5  84 Female    Brazil       7.94       Temporal     Abnormal           47
## 6  29   Male   Germany       7.97        Frontal     Abnormal           70
##   Smoking_History Alcohol_Consumption Radiation_Exposure Head_Injury_History
## 1              No                 Yes             Medium                  No
## 2              No                 Yes             Medium                  No
## 3             Yes                  No                Low                 Yes
## 4             Yes                 Yes               High                 Yes
## 5              No                 Yes             Medium                  No
## 6             Yes                 Yes             Medium                  No
##   Chronic_Illness Blood_Pressure Diabetes Tumor_Type Treatment_Received
## 1             Yes         122/88       No  Malignant               None
## 2              No        126/119       No  Malignant               None
## 3              No         118/65       No     Benign       Chemotherapy
## 4              No        165/119      Yes     Benign          Radiation
## 5             Yes         156/97      Yes  Malignant               None
## 6              No          95/85       No  Malignant            Surgery
##   Survival_Rate... Tumor_Growth_Rate Family_History Symptom_Severity
## 1               58              Slow            Yes             Mild
## 2               13             Rapid            Yes           Severe
## 3               67              Slow            Yes         Moderate
## 4               85          Moderate             No         Moderate
## 5               17          Moderate             No         Moderate
## 6               65             Rapid            Yes           Severe
##   Brain_Tumor_Present
## 1                  No
## 2                  No
## 3                 Yes
## 4                 Yes
## 5                  No
## 6                  No

str(df)

## 'data.frame':    250000 obs. of  21 variables:
##  $ Age                : int  66 87 41 52 84 29 5 19 16 43 ...
##  $ Gender             : chr  "Other" "Female" "Male" "Male" ...
##  $ Country            : chr  "China" "Australia" "Canada" "Japan" ...
##  $ Tumor_Size         : num  8.7 8.14 6.02 7.26 7.94 7.97 8.65 6.86 8.06 1.59 ...
##  $ Tumor_Location     : chr  "Cerebellum" "Temporal" "Occipital" "Occipital" ...
##  $ MRI_Findings       : chr  "Severe" "Normal" "Severe" "Normal" ...
##  $ Genetic_Risk       : int  81 65 100 19 47 70 68 81 47 58 ...
##  $ Smoking_History    : chr  "No" "No" "Yes" "Yes" ...
##  $ Alcohol_Consumption: chr  "Yes" "Yes" "No" "Yes" ...
##  $ Radiation_Exposure : chr  "Medium" "Medium" "Low" "High" ...
##  $ Head_Injury_History: chr  "No" "No" "Yes" "Yes" ...
##  $ Chronic_Illness    : chr  "Yes" "No" "No" "No" ...
##  $ Blood_Pressure     : chr  "122/88" "126/119" "118/65" "165/119" ...
##  $ Diabetes           : chr  "No" "No" "No" "Yes" ...
##  $ Tumor_Type         : chr  "Malignant" "Malignant" "Benign" "Benign" ...
##  $ Treatment_Received : chr  "None" "None" "Chemotherapy" "Radiation" ...
##  $ Survival_Rate...   : int  58 13 67 85 17 65 91 18 31 40 ...
##  $ Tumor_Growth_Rate  : chr  "Slow" "Rapid" "Slow" "Moderate" ...
##  $ Family_History     : chr  "Yes" "Yes" "Yes" "No" ...
##  $ Symptom_Severity   : chr  "Mild" "Severe" "Moderate" "Moderate" ...
##  $ Brain_Tumor_Present: chr  "No" "No" "Yes" "Yes" ...

General Findings

Tab 1

Horizontal Stacked Bar Chart of Tumor Growth Rates by Country

This chart displays the proportion of tumor growth rates (Slow, Rapid, Moderate) by country.

library(ggplot2)
library(scales)       
library(RColorBrewer) 
library(ggthemes)     
library(plyr)       

#reorder countries by the total number of observations (largest first)

country_counts <- ddply(df, .(Country), nrow)
df$Country <- factor(df$Country, levels = country_counts$Country[order(country_counts$V1, decreasing = TRUE)])

# horizontal stacked bar chart, proportion of tumor growth rate (slow, rapid, moderate) by country
ggplot(df, aes(x = Country, fill = Tumor_Growth_Rate)) +
  geom_bar(position = "fill", colour = "black") + 
  coord_flip() +                                  
  scale_y_continuous(labels = percent_format()) +  
  labs(title = "Proportion of Tumor Growth Rates by Country",
       x = "Country",
       y = "Proportion",
       fill = "Tumor Growth Rate") +
  scale_fill_manual(values = brewer.pal(3, "Set2")) +  
  theme_economist()

Each bar is subdivided into Slow, Moderate, and Rapid proportions that sum to 100%. No single country is dominated by just one category; rather, each exhibits some combination of all three growth rates. Although there is variation, the differences are not extremely stark. Most countries have at least 25–30% in each growth-rate category. This suggests that the distribution of tumor growth rates, while it does vary, remains somewhat consistent globally in this dataset.

Tab 2

Dual Axis Stacked Bar Chart of Tumor Type Distribution and Average Survival Rate by Country This chart shows the distribution of tumor types (Benign vs. Malignant) as stacked bars with an overlaid line representing the average survival rate per country.

#dual axis on a stacked bar chart

bar_df <- ddply(df, .(Country, Tumor_Type), summarise, count = length(Tumor_Type))


total_counts <- ddply(bar_df, .(Country), summarise, total_count = sum(count))


bar_df$Country <- factor(bar_df$Country, 
                         levels = total_counts$Country[order(total_counts$total_count, decreasing = TRUE)])


survival_df <- ddply(df, .(Country), summarise, avg_survival = mean(`Survival_Rate...`, na.rm = TRUE))


survival_df$Country <- factor(survival_df$Country, levels = levels(bar_df$Country))


max_count <- max(total_counts$total_count)
max_avg_survival <- max(survival_df$avg_survival)
scaling_factor <- max_count / max_avg_survival


p <- ggplot(bar_df, aes(x = Country, y = count, fill = Tumor_Type)) +

  geom_bar(stat = "identity", 
           position = position_stack(reverse = TRUE), 
           colour = "black") +
  coord_flip() +  # Make the bars horizontal
  theme_light() +
  labs(title = "Tumor Type Distribution and Average Survival Rate by Country",
       y = "Patient Count",
       fill = "Tumor Type") +
  theme(plot.title = element_text(hjust = 0.5)) +

  scale_fill_manual(values = c("Benign" = "#FFB6C1",   # Light Pink
                               "Malignant" = "#FF69B4"), # Hot Pink
                    guide = guide_legend(reverse = TRUE)) +
 
  geom_line(data = survival_df, 
            aes(x = Country, y = avg_survival * scaling_factor, group = 1, colour = "Avg Survival Rate"), 
            size = 1, 
            inherit.aes = FALSE) +
  geom_point(data = survival_df, 
             aes(x = Country, y = avg_survival * scaling_factor, group = 1), 
             size = 3, shape = 21, fill = "white", color = "black", 
             inherit.aes = FALSE) +

  scale_color_manual(name = NULL, values = c("Avg Survival Rate" = "black")) +
  
  scale_y_continuous(labels = comma,
                     sec.axis = sec_axis(~ . / scaling_factor,
                                         name = "Average Survival Rate (%)")) +

  theme(legend.background = element_rect(fill = "transparent"),
        legend.box.background = element_rect(fill = "transparent", colour = NA),
        legend.spacing = unit(-1, "lines"))

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

print(p)

Each horizontal bar represents the total number of patients in a given country, split between Malignant (darker pink) and Benign (lighter pink) tumors. For instance, some countries (e.g., UK and South Africa) appear to have a relatively higher proportion of malignant cases, while others (e.g., Brazil) may show a more balanced or lower proportion of malignant tumors.

Tab 3

Line Chart of Average Tumor Size by Age This chart displays how the average tumor size varies with age, with an annotation for the maximum average tumor size.

library(ggplot2)
library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(scales)
library(ggthemes)
library(RColorBrewer)


age_tumor <- df %>%
  group_by(Age) %>%
  summarise(avg_tumor_size = mean(Tumor_Size, na.rm = TRUE)) %>%
  ungroup()


p <- ggplot(age_tumor, aes(x = Age, y = avg_tumor_size)) +
  geom_line(color = "#39FF14", size = 1) +
  geom_point(color = "#FF6EC7", size = 2, shape = 21, fill = "white") +
  labs(title = "Average Tumor Size by Age",
       x = "Age",
       y = "Average Tumor Size") +
  theme_economist() +
  scale_y_continuous(labels = comma)


ann <- age_tumor %>% filter(avg_tumor_size == max(avg_tumor_size, na.rm = TRUE))

p + geom_text(data = ann, aes(label = round(avg_tumor_size, 1)),
              vjust = -0.5, size = 3)

Although there are fluctuations, the overall range remains fairly tight. There doesn’t appear to be a strong upward or downward trend with increasing age; instead, it’s more of a “zigzag” pattern within a narrow corridor. Because the line doesn’t show a clear age-related progression, it might imply that factors other than age (e.g., genetic risk, lifestyle, or treatment differences) play a larger role in determining tumor size.

Tab 4

Heat Map of Average Tumor Size by Country and Tumor Type

This heat map visualizes the average tumor size for each combination of country and tumor type.

The color scale ranges from lighter shades (representing smaller average tumor sizes) to darker/redder shades (representing larger average tumor sizes).

library(ggplot2)
library(dplyr)
library(scales)
library(ggthemes)
library(RColorBrewer)

heat_df <- df %>%
  group_by(Country, Tumor_Type) %>%
  summarise(avg_tumor_size = mean(Tumor_Size, na.rm = TRUE)) %>%
  ungroup()

## `summarise()` has grouped output by 'Country'. You can override using the
## `.groups` argument.

ggplot(heat_df, aes(x = Country, y = Tumor_Type, fill = avg_tumor_size)) +
  geom_tile() +
  labs(title = "Heat Map: Average Tumor Size by Country and Tumor Type",
       x = "Country",
       y = "Tumor Type",
       fill = "Average Tumor Size") +
  theme_economist() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_fill_gradientn(colors = brewer.pal(9, "YlOrRd"), labels = comma)

In some countries such as USA, Russia, Germany, and Brazil, malignant tumors may appear slightly darker (indicating a marginally larger average size), whereas in others, benign tumors are similar in size or just slightly larger on average like South Africa

Tab 5

Nested Donut Chart: Treatment Received, Genetic Risk, and Survival Rate

This interactive Plotly chart presents a nested donut chart with three rings:

Outer ring: Treatment Received Middle ring: Genetic Risk (binned into Low/Medium/High) Inner ring: Survival Rate (binned into Low/Medium/High)

library(dplyr)
library(plotly)

## 
## Attaching package: 'plotly'

## The following objects are masked from 'package:plyr':
## 
##     arrange, mutate, rename, summarise

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

library(RColorBrewer)


df_treatment <- df %>%
  group_by(Treatment_Received) %>%
  summarise(Count = n()) %>%
  ungroup()


df_risk <- df %>%
  mutate(risk_cat = case_when(
    Genetic_Risk < 34 ~ "Low",
    Genetic_Risk < 67 ~ "Medium",
    TRUE              ~ "High"
  )) %>%
  group_by(risk_cat) %>%
  summarise(Count = n()) %>%
  ungroup()


df_surv <- df %>%
  mutate(surv_cat = case_when(
    `Survival_Rate...` < 34 ~ "Low",
    `Survival_Rate...` < 67 ~ "Medium",
    TRUE                    ~ "High"
  )) %>%
  group_by(surv_cat) %>%
  summarise(Count = n()) %>%
  ungroup()


num_treatments <- nrow(df_treatment)

treatment_colors <- brewer.pal(min(num_treatments, 8), "Set2")


fig <- plot_ly() %>%
  # Outer Ring: Treatment Received
  add_pie(
    data = df_treatment,
    labels = ~Treatment_Received,
    values = ~Count,
    name = "Treatment Received",
    hole = 0.2,             
    textinfo = 'label+percent',
    textposition = 'inside',
    marker = list(colors = treatment_colors, 
                  line = list(color = '#FFFFFF', width = 1)),
    domain = list(x = c(0,1), y = c(0,1))
  ) %>%
  # Middle Ring: Genetic Risk
  add_pie(
    data = df_risk,
    labels = ~risk_cat,
    values = ~Count,
    name = "Genetic Risk",
    hole = 0.4,             
    textinfo = 'label+percent',
    textposition = 'inside',
    marker = list(colors = brewer.pal(3, "Blues"), 
                  line = list(color = '#FFFFFF', width = 1)),
    domain = list(x = c(0,1), y = c(0,1))
  ) %>%
  # Inner Ring: Survival Rate
  add_pie(
    data = df_surv,
    labels = ~surv_cat,
    values = ~Count,
    name = "Survival Rate",
    hole = 0.6,             
    textinfo = 'label+percent',
    textposition = 'inside',
    marker = list(colors = brewer.pal(3, "Oranges"), 
                  line = list(color = '#FFFFFF', width = 1)),
    domain = list(x = c(0,1), y = c(0,1))
  ) %>%
  layout(
    title = "Nested Donut Chart: Treatment, Genetic Risk, and Survival Rate",
    showlegend = TRUE
  )

fig

Because each ring is independent, we can’t directly infer that high-risk patients always have low survival or that certain treatments lead to certain outcomes. Instead, the chart simply provides a snapshot of the separate distributions. A noticeable low percentage (27%) of “Low Survival rate” slice indicates a portion of patients with poorer prognoses.

Tab 6

Multiple Bar Chart by Country and Tumor Location (Uniform Color by Country) This chart shows counts of tumor locations by country with a uniform color for each country. The y-axis is “zoomed” to 0–2100.

library(ggplot2)
library(lubridate)  
library(dplyr)
library(scales)
library(ggthemes)
library(RColorBrewer)


df_agg <- df %>%
  group_by(Country, Tumor_Location) %>%
  summarise(count = n()) %>%
  ungroup()

## `summarise()` has grouped output by 'Country'. You can override using the
## `.groups` argument.

num_countries <- length(unique(df$Country))
colors <- if (num_countries <= 12) {
  brewer.pal(n = num_countries, name = "Set3")
} else {
  colorRampPalette(brewer.pal(12, "Set3"))(num_countries)
}

ggplot(df_agg, aes(x = Tumor_Location, y = count, fill = Country)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  facet_wrap(~ Country, ncol = 3, scales = "free_y") +
  labs(
    title = "Tumor Location Counts by Country",
    x = "Tumor Location",
    y = "Count"
  ) +
  scale_fill_manual(values = colors) +
  scale_y_continuous(labels = comma) +
  coord_cartesian(ylim = c(0, 2100)) +  
  theme_economist() +
  theme(
    strip.text = element_text(size = 12),
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "none"
  )

Within each facet (representing a country), the bars for different tumor locations are nearly the same height. This suggests that no single location (e.g., Cerebellum, Frontal, Occipital, Parietal, Temporal) overwhelmingly dominates or lags behind the others in terms of frequency. A large sample size might naturally smooth out differences, or the dataset may come from a source where tumors are recorded in a uniform manner. It could also be that, biologically, these tumor locations are genuinely similar in prevalence.

Conclusion

Overall, the data point to a relatively uniform global picture in terms of tumor location and growth rates, slight country-level variations in malignant versus benign prevalence, and only modest age-related changes in tumor size. While these high-level patterns provide useful insights, the charts also highlight the need for deeper investigation—especially around how genetic risk, treatment choices, and demographic factors collectively influence survival outcomes.

R Module: Brain Tumor Dataset

03-01-2025