Overview

Our data is from one of the largest breast oncology institutions in the nation, Houston’s MD Anderson Cancer Center. They serve about 40,000 patients with breast cancer annually, amounting to an average of 3,333 patients per month. The data we are going to look at provides a basic notion of the incidence of breast cancer in the Houston area, even though MD Anderson does not provide its exact monthly diagnosis rates. The data presents an aggregated table of service counts for patient groups from 2020-2024. The goal is to tidy up the data and do minimal analysis.

Loading data and libraries

library(tidyverse)
library(RCurl)
x <- getURL("https://raw.githubusercontent.com/isaias-soto/CUNY_DAT607/refs/heads/main/Project%202/Untidy%20dataset%20MDA.csv") 
cancer <- read.csv(text = x, na.strings = c("NA",""), skip = 1, header = TRUE) # start by skipping title line
cancer
##    Patient_ID    Month Age_Group     Race Cases_2020 Cases_2021 Cases_2022
## 1        P001  January     30-39    White         50         55         60
## 2        P002  January     40-49    Black         90         95        100
## 3        P003  January     50-59 Hispanic         80         85         88
## 4        P004 February     30-39    Asian         70         75         80
## 5        P005 February     40-49    White        110        115        120
## 6        P006    March     50-59    Black        130        135        140
## 7        P007    March     60-69 Hispanic        100        105        110
## 8        P008    March       70+    Asian         70         75         78
## 9        P009    April     30-39    White         60         65         70
## 10       P010    April     40-49    Black         90         95        100
## 11       P011      May     50-59 Hispanic        120        125        130
##    Cases_2023 Cases_2024 Total_Patients            Type  X X.1
## 1          65         70            300 Invasive Ductal NA  NA
## 2         105        120            510         Lobular NA  NA
## 3          90        110            453 Triple-Negative NA  NA
## 4          85         90            400           HER2+ NA  NA
## 5         125        140            610    Inflammatory NA  NA
## 6         145        160            710 Invasive Ductal NA  NA
## 7         115        130            560         Lobular NA  NA
## 8          80        100            403 Triple-Negative NA  NA
## 9          75         85            355           HER2+ NA  NA
## 10        105        120            510    Inflammatory NA  NA
## 11        135        150            660 Invasive Ductal NA  NA

Tidying and reshaping data

We see started by skipping the title line when we read the csv into our environment. Next, we want to get rid of the blank columns at the tail end of this data frame and rename the Patient_ID column to reflect that it is an aggregated group ID. Then, we’ll reshape the data to move case years into its own column, creating a long format dataset instead of wide.

tidy_cancer <- cancer |> 
  select(-c(X,X.1)) |>   # remove empty columns
  rename(Patient_Group_ID = Patient_ID) |>   # rename to reflect group
  rename_with(~gsub("Cases_","",.),starts_with("Cases_")) |>  # isolate year
  pivot_longer(cols = "2020":"2024", names_to = "Cases_Year", values_to = "Count") # reshape to longer format

# next we add a column with the rate of patients served for that year + group
tidy_cancer <- tidy_cancer |>
  group_by(Cases_Year) |>
  mutate(Year_service_rate = paste0(round((Count/sum(Count))*100,1),"%")) |>
  ungroup() |>
  group_by(Race) |>
  mutate(Race_service_rate = paste0(round((Count/sum(Count))*100,1),"%"))
tidy_cancer
## # A tibble: 55 × 10
## # Groups:   Race [4]
##    Patient_Group_ID Month  Age_Group Race  Total_Patients Type  Cases_Year Count
##    <chr>            <chr>  <chr>     <chr>          <int> <chr> <chr>      <int>
##  1 P001             Janua… 30-39     White            300 Inva… 2020          50
##  2 P001             Janua… 30-39     White            300 Inva… 2021          55
##  3 P001             Janua… 30-39     White            300 Inva… 2022          60
##  4 P001             Janua… 30-39     White            300 Inva… 2023          65
##  5 P001             Janua… 30-39     White            300 Inva… 2024          70
##  6 P002             Janua… 40-49     Black            510 Lobu… 2020          90
##  7 P002             Janua… 40-49     Black            510 Lobu… 2021          95
##  8 P002             Janua… 40-49     Black            510 Lobu… 2022         100
##  9 P002             Janua… 40-49     Black            510 Lobu… 2023         105
## 10 P002             Janua… 40-49     Black            510 Lobu… 2024         120
## # ℹ 45 more rows
## # ℹ 2 more variables: Year_service_rate <chr>, Race_service_rate <chr>

Analysis

From our new dataset above we see the new columns Year_service_rate and Race_service_rate. The Year_service_rate aggregates by year and shows the service rate for each patient group for that year. The Race_service_rate shows the service rate of each patient group year out of all patient groups from within each race group. We can see that from 2020 to 2024, service rates increased across all patient groups by race from year to year. However, service rates by year did not seem to change across patient groups from year to year.

Conclusion

In conclusion, we took this wide format dataset and reshaped it into a longer format while cleaning up empty columns and column names. Then, we added two new columns for quick analysis on service rates for patient groups by year and in proportion to the groups race make up. Given more time, I would have liked to cross reference treatment success rates with this count data for patients served.