Our data is from one of the largest breast oncology institutions in the nation, Houston’s MD Anderson Cancer Center. They serve about 40,000 patients with breast cancer annually, amounting to an average of 3,333 patients per month. The data we are going to look at provides a basic notion of the incidence of breast cancer in the Houston area, even though MD Anderson does not provide its exact monthly diagnosis rates. The data presents an aggregated table of service counts for patient groups from 2020-2024. The goal is to tidy up the data and do minimal analysis.
Loading data and libraries
library(tidyverse)
library(RCurl)
x <- getURL("https://raw.githubusercontent.com/isaias-soto/CUNY_DAT607/refs/heads/main/Project%202/Untidy%20dataset%20MDA.csv")
cancer <- read.csv(text = x, na.strings = c("NA",""), skip = 1, header = TRUE) # start by skipping title line
cancer
## Patient_ID Month Age_Group Race Cases_2020 Cases_2021 Cases_2022
## 1 P001 January 30-39 White 50 55 60
## 2 P002 January 40-49 Black 90 95 100
## 3 P003 January 50-59 Hispanic 80 85 88
## 4 P004 February 30-39 Asian 70 75 80
## 5 P005 February 40-49 White 110 115 120
## 6 P006 March 50-59 Black 130 135 140
## 7 P007 March 60-69 Hispanic 100 105 110
## 8 P008 March 70+ Asian 70 75 78
## 9 P009 April 30-39 White 60 65 70
## 10 P010 April 40-49 Black 90 95 100
## 11 P011 May 50-59 Hispanic 120 125 130
## Cases_2023 Cases_2024 Total_Patients Type X X.1
## 1 65 70 300 Invasive Ductal NA NA
## 2 105 120 510 Lobular NA NA
## 3 90 110 453 Triple-Negative NA NA
## 4 85 90 400 HER2+ NA NA
## 5 125 140 610 Inflammatory NA NA
## 6 145 160 710 Invasive Ductal NA NA
## 7 115 130 560 Lobular NA NA
## 8 80 100 403 Triple-Negative NA NA
## 9 75 85 355 HER2+ NA NA
## 10 105 120 510 Inflammatory NA NA
## 11 135 150 660 Invasive Ductal NA NA
Tidying and reshaping data
We see started by skipping the title line when we read the csv into our environment. Next, we want to get rid of the blank columns at the tail end of this data frame and rename the Patient_ID column to reflect that it is an aggregated group ID. Then, we’ll reshape the data to move case years into its own column, creating a long format dataset instead of wide.
tidy_cancer <- cancer |>
select(-c(X,X.1)) |> # remove empty columns
rename(Patient_Group_ID = Patient_ID) |> # rename to reflect group
rename_with(~gsub("Cases_","",.),starts_with("Cases_")) |> # isolate year
pivot_longer(cols = "2020":"2024", names_to = "Cases_Year", values_to = "Count") # reshape to longer format
# next we add a column with the rate of patients served for that year + group
tidy_cancer <- tidy_cancer |>
group_by(Cases_Year) |>
mutate(Year_service_rate = paste0(round((Count/sum(Count))*100,1),"%")) |>
ungroup() |>
group_by(Race) |>
mutate(Race_service_rate = paste0(round((Count/sum(Count))*100,1),"%"))
tidy_cancer
## # A tibble: 55 × 10
## # Groups: Race [4]
## Patient_Group_ID Month Age_Group Race Total_Patients Type Cases_Year Count
## <chr> <chr> <chr> <chr> <int> <chr> <chr> <int>
## 1 P001 Janua… 30-39 White 300 Inva… 2020 50
## 2 P001 Janua… 30-39 White 300 Inva… 2021 55
## 3 P001 Janua… 30-39 White 300 Inva… 2022 60
## 4 P001 Janua… 30-39 White 300 Inva… 2023 65
## 5 P001 Janua… 30-39 White 300 Inva… 2024 70
## 6 P002 Janua… 40-49 Black 510 Lobu… 2020 90
## 7 P002 Janua… 40-49 Black 510 Lobu… 2021 95
## 8 P002 Janua… 40-49 Black 510 Lobu… 2022 100
## 9 P002 Janua… 40-49 Black 510 Lobu… 2023 105
## 10 P002 Janua… 40-49 Black 510 Lobu… 2024 120
## # ℹ 45 more rows
## # ℹ 2 more variables: Year_service_rate <chr>, Race_service_rate <chr>
Analysis
From our new dataset above we see the new columns
Year_service_rate
and Race_service_rate
. The
Year_service_rate
aggregates by year and shows the service
rate for each patient group for that year. The
Race_service_rate
shows the service rate of each patient
group year out of all patient groups from within each race group. We can
see that from 2020 to 2024, service rates increased across all patient
groups by race from year to year. However, service rates by year did not
seem to change across patient groups from year to year.
Conclusion
In conclusion, we took this wide format dataset and reshaped it into a longer format while cleaning up empty columns and column names. Then, we added two new columns for quick analysis on service rates for patient groups by year and in proportion to the groups race make up. Given more time, I would have liked to cross reference treatment success rates with this count data for patients served.