Assignment 3

Author

Ishmael Mahmoud

Introduction

For this project, I chose an extensive dataset entitled “Global Cancer Patients (2015–2024)”, containing anonymized records of 50,000 individuals from multiple countries. This dataset provides demographic data, cancer types and stages, environmental and genetic risk factors, treatment expenses, and patient outcomes. I selected this dataset because of its multidimensional nature and its capacity for significant spatial and statistical analysis within global health systems.

My analysis focuses on two main components: A temporal and categorical assessment of cancer treatment costs using interactive bar charts created with Plotly, and (2) a geographical representation of cancer severity and healthcare burden across countries utilizing interactive maps with Leaflet.

Code

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Code

library(lubridate)
library(plotly)


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

Code

library(leaflet)

Code

df <- read_csv("/Users/ishmaelhassan/Desktop/global_cancer_patients_2015_2024.csv")

Rows: 50000 Columns: 15
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (5): Patient_ID, Gender, Country_Region, Cancer_Type, Cancer_Stage
dbl (10): Age, Year, Genetic_Risk, Air_Pollution, Alcohol_Use, Smoking, Obes...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Code

top_cancers <- df %>%
  count(Cancer_Type, sort = TRUE) %>%
  top_n(4) %>%
  pull(Cancer_Type)

Selecting by n

Code

filtered_data <- df %>%
  filter(Cancer_Type %in% top_cancers)


grouped_data <- filtered_data %>%
  group_by(Year, Cancer_Type) %>%
  summarise(Avg_Treatment_Cost = mean(Treatment_Cost_USD, na.rm = TRUE)) %>%
  ungroup()

`summarise()` has grouped output by 'Year'. You can override using the
`.groups` argument.

Code

fig <- plot_ly(
  data = grouped_data,
  x = ~Year,
  y = ~Avg_Treatment_Cost,
  color = ~Cancer_Type,
  type = "bar",
  text = ~paste(
    "Cancer Type:", Cancer_Type,
    "<br>Year:", Year,
    "<br>Avg Cost: $", round(Avg_Treatment_Cost, 2)
  ),
  hoverinfo = "text"
) %>%
  layout(
    title = "Average Treatment Cost for Top 4 Cancer Types (2015–2024)",
    xaxis = list(title = "Year"),
    yaxis = list(title = "Avg Treatment Cost (USD)"),
    barmode = "group",
    legend = list(title = list(text = "Cancer Type"))
  )

fig

Analysis

The interactive bar chart above illustrates the average treatment costs from 2015 to 2024 for the four most prevalent cancer types in the dataset. The graph enhances clarity by concentrating solely on the most common types, thereby preventing overcrowding. The grouped bar format facilitates clear year-over-year comparisons, while distinct colors and hover tooltips offer immediate, detailed insights into cost variations by cancer type and year. This streamlined methodology enhances the understanding of the visualization while emphasizing significant trends, including persistently elevated costs for specific cancers and significant fluctuations over time. While effective in its present state, the chart could be enhanced by integrating interactive filters to examine specific regions or patient demographics. The visualization effectively converts intricate data into a comprehensible and informative resource for analyzing treatment cost trends among primary cancer types.

Interactive map

Code

library(tidyverse)
library(leaflet)

country_coords <- tibble::tribble(
  ~Country_Region, ~lat, ~lng,
  "USA", 37.0902, -95.7129,
  "UK", 55.3781, -3.4360,
  "India", 20.5937, 78.9629,
  "China", 35.8617, 104.1954,
  "Brazil", -14.2350, -51.9253,
  "Pakistan", 30.3753, 69.3451,
  "Canada", 56.1304, -106.3468,
  "Australia", -25.2744, 133.7751,
  "Germany", 51.1657, 10.4515,
  "South Africa", -30.5595, 22.9375
)

map_data <- df %>%
  group_by(Country_Region) %>%
  summarise(
    Avg_Severity = mean(Target_Severity_Score, na.rm = TRUE),
    Avg_Cost = mean(Treatment_Cost_USD, na.rm = TRUE),
    Patient_Count = n()
  ) %>%
  inner_join(country_coords, by = "Country_Region")

leaflet(data = map_data) %>%
  addTiles() %>%
  setView(lng = 10, lat = 20, zoom = 2) %>%
  addCircleMarkers(
    ~lng, ~lat,
    radius = ~sqrt(Avg_Cost) / 300, # scale radius by treatment cost
    color = ~colorNumeric("Reds", Avg_Severity)(Avg_Severity),
    fillOpacity = 0.7,
    popup = ~paste0(
      "<strong>", Country_Region, "</strong><br>",
      "Avg Severity: ", round(Avg_Severity, 2), "<br>",
      "Avg Cost: $", round(Avg_Cost, 0), "<br>",
      "Patients: ", Patient_Count
    )
  )

Analysis

The code makes an interactive Leaflet map that shows information about how cancer is treated in ten different countries. It starts by estimating the latitude and longitude of each country. Next, it combines this spatial data with a summary of the patient data in the dataset. It figures out the total number of patients, the average severity score, and the average cost of treatment for each country. Then, these metrics are shown using addCircleMarkers(). Each country is shown by a circle whose radius is scaled based on the cost of treatment and whose color is shaded using a red gradient based on the severity score. When you hover over or click on the popups, they show you more information about each country by showing its name, average severity, cost, and number of patients. The map is centered around the world so that you can see a lot of land. This graphic does a good job of showing how cancer care is distributed around the world, showing which countries have the worst and most expensive cancer cases. However, because latitude and longitude coordinates have to be entered by hand, it can only cover ten countries. Overall, the map makes it easy to see trends and differences in cancer data by location in a way that is both clear and interesting.

Conclusion

The Plotly bar chart demonstrated the wide range of treatment costs across different cancer types and over time, revealing that specific cancers, such as lung and breast cancer, consistently generate higher average expenses. Simultaneously, the Leaflet map revealed geographic disparities, indicating heightened severity scores and treatment expenses concentrated in economically advanced countries. The bar chart delivers comprehensive cost insights over time and by cancer type, whereas the map presents a wider spatial overview of treatment severity and resource distribution. A challenge I faced was the lack of latitude and longitude data, which I addressed by manually incorporating country coordinates for visualization.

Chatgpt link

https://chatgpt.com/share/681593b0-c1b0-8002-8fd3-1bb18762006b