In this project, I have chosen to work on breast cancer. There are various resources available on this topic, with the Surveillance, Epidemiology, and End Results (SEER) [1] program being the most reliable one.
The SEER Program of the National Cancer Institute (NCI) collects and publishes cancer data through a coordinated system of strategically placed cancer registries, covering nearly 30% of the US population.
Currently, there are 18 SEER registries in the USA. You can find this information on the following website: SEER Data Access.
I have also utilized the following repository to assist me with this project: SEER_solid_tumor [2]. The database contains extensive data, and my investigation will focus solely on breast cancer for the years 2011-2015 and 2019-2020. SEER provides a software called STAT that I’ve used to import the data, which is stored and utilized on my local computer. Additionally, there are two GitHub repositories that I’ve referenced to some extent in this project:
The first [2]repository covers all types of cancer, but my study specifically focuses on breast cancer, addressing different research questions.
The second [3] repository has conducted machine learning analyses on various cancer types using Python (not R). I’ve drawn inspiration and learned methods from their approach to survival studies in cancer patients.
Checking all the packages are installed and if not install as needed.
The primary focus of my research is to explore the survival rates of breast cancer patients and the various factors influencing these rates, including age, cancer type, treatment modalities, and other pertinent parameters. The commonly utilized five-year survival rate benchmark serves as a pivotal point of analysis in this study.
Acknowledging the significance of this benchmark, I have divided the data into two distinct datasets. The dataset spanning from 2011 to 2015 assumes that the status of all patients within that period is known up to the database’s current date in 2022. Additionally, I have selected the most recent data from 2019 to 2020 as the target years for potential correlation and regression studies to estimate survival rates.
Although my research is not conducted within a strictly scientific framework, it is approached with rigor and attention to detail. While I do not possess expertise in the field of breast cancer, my personal connection to the topic motivates me to delve deeper into understanding the complexities surrounding it.
The dataset from 2011 to 2015 comprises approximately 303,000 rows with 36 selected columns. For the purpose of prediction, I have chosen to focus solely on the 2019-2020 data, which encompasses about 131,000 rows. The multifaceted nature of the research question necessitates a thorough examination, from data tidying to cleaning.
Some of the key parameters under consideration include years of diagnoses, age groups at diagnosis, and cancer type. However, I also recognize the importance of incorporating additional factors such as tumor characteristics and treatment modalities to provide a comprehensive understanding of breast cancer survival outcomes.
In conclusion, while my knowledge of the subject may not be extensive, I am committed to learning and contributing meaningful insights to the field of breast cancer research through meticulous analysis and interpretation of data.
According to the American Cancer Society, the five-year relative survival rate for localized breast cancer is around 99%, but it drops to about 27% for distant-stage breast cancer. These rates can vary over time and with advances in treatment. Reference [5]: American Cancer Society - Breast Cancer Survival Rates
# Function to load CSV file
load_csv <- function(file_path) {
if (file.exists(file_path)) {
return(read_csv(file_path))
} else {
message("File not found locally. Attempting to fetch from server...")
return(fetch_database(gdrive_link))
}
}
# Function to fetch database from signed URL
fetch_database <- function(url) {
response <- GET(url)
if (http_type(response) == "application/force-download") {
stop_for_status(response)
return(read_csv(rawToChar(response$content)))
} else {
message("Failed to fetch from server. Please select the file manually.")
return(readr::read_csv(file.choose()))
}
}
# Local file paths
directory <- "C:/Users/kohya/OneDrive/CUNY/DATA 606/DATA 606 Spring/Project"
file_2020 <- "BREAST_2019-2020-updated.csv"
file_serv <- "BREAST_2011-2015.csv"
gdrive_link <- "https://drive.google.com/uc?export=download&id=1vBR2SZ-aFX3jjU6kQMjPkxfYKP-EwqRE"
# Complete the file paths
full_path_serv <- file.path(directory, file_serv)
full_path_eval <- file.path(directory, file_2020)
# Attempt to load the databases
BREAST_DF_surv <- load_csv(full_path_serv)
## Rows: 303557 Columns: 36
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (34): Sex, Race recode (W, B, AI, API), Race and origin recode (NHW, NHB...
## dbl (2): Year of diagnosis, Year of follow-up recode
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
BREAST_DF_eval <- load_csv(full_path_eval)
## Rows: 131395 Columns: 36
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (34): Sex, Race recode (W, B, AI, API), Race and origin recode (NHW, NHB...
## dbl (2): Year of diagnosis, Year of follow-up recode
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# View the first few rows of the data frame
kable(head(BREAST_DF_surv, 10))
| Sex | Year of diagnosis | Race recode (W, B, AI, API) | Race and origin recode (NHW, NHB, NHAIAN, NHAPI, Hispanic) | Site recode ICD-O-3/WHO 2008 | Site recode ICD-O-3 2023 Revision | Primary Site - labeled | Grade Recode (thru 2017) | Grade Clinical (2018+) | Grade Pathological (2018+) | Diagnostic Confirmation | Laterality | Chemotherapy recode (yes, no/unk) | Radiation recode | Months from diagnosis to treatment | Reason no cancer-directed surgery | Scope of reg lymph nd surg (1998-2002) | Survival months flag | Survival months | COD to site recode | First malignant primary indicator | Sequence number | Total number of in situ/malignant tumors for patient | Total number of benign/borderline tumors for patient | Patient ID | Marital status at diagnosis | Median household income inflation adj to 2021 | Rural-Urban Continuum Code | Age recode (<60,60-69,70+) | Race and origin (recommended by SEER) | Year of follow-up recode | Year of death recode | SEER other cause of death classification | Tumor Size Summary (2016+) | RX Summ–Systemic/Sur Seq (2007+) | Origin recode NHIA (Hispanic, Non-Hisp) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Female | 2015 | White | Non-Hispanic White | Breast | Breast | C50.4-Upper-outer quadrant of breast | Moderately differentiated; Grade II | Blank(s) | Blank(s) | Positive histology | Right - origin of primary | Yes | Beam radiation | 002 | Surgery performed | Blank(s) | Complete dates are available and there are more than 0 days of survival | 0060 | Alive | No | 2nd of 2 or more primaries | 02 | 00 | 00000309 | Married (including common law) | $75,000+ | Counties in metropolitan areas ge 1 million pop | 50-54 years | All races/ethnicities | 2020 | Alive at last contact | Alive or dead due to cancer | Blank(s) | Systemic therapy after surgery | Non-Spanish-Hispanic-Latino |
| Female | 2013 | White | Non-Hispanic White | Breast | Breast | C50.9-Breast, NOS | Unknown | Blank(s) | Blank(s) | Positive histology | Right - origin of primary | No/Unknown | None/Unknown | Blank(s) | Not recommended | Blank(s) | Complete dates are available and there are more than 0 days of survival | 0028 | Breast | No | 3rd of 3 or more primaries | 03 | 00 | 00000346 | Divorced | $75,000+ | Counties in metropolitan areas ge 1 million pop | 40-44 years | All races/ethnicities | 2015 | 2015 | Alive or dead due to cancer | Blank(s) | No systemic therapy and/or surgical procedures | Non-Spanish-Hispanic-Latino |
| Female | 2012 | White | Non-Hispanic White | Breast | Breast | C50.2-Upper-inner quadrant of breast | Moderately differentiated; Grade II | Blank(s) | Blank(s) | Positive histology | Right - origin of primary | No/Unknown | None/Unknown | 004 | Surgery performed | Blank(s) | Complete dates are available and there are more than 0 days of survival | 0099 | Alive | No | 2nd of 2 or more primaries | 03 | 00 | 00000374 | Widowed | $75,000+ | Counties in metropolitan areas ge 1 million pop | 80-84 years | All races/ethnicities | 2020 | Alive at last contact | Alive or dead due to cancer | Blank(s) | Systemic therapy before surgery | Non-Spanish-Hispanic-Latino |
| Female | 2014 | White | Non-Hispanic White | Breast | Breast | C50.8-Overlapping lesion of breast | Moderately differentiated; Grade II | Blank(s) | Blank(s) | Positive histology | Right - origin of primary | No/Unknown | None/Unknown | 001 | Surgery performed | Blank(s) | Complete dates are available and there are more than 0 days of survival | 0081 | Alive | No | 2nd of 2 or more primaries | 02 | 00 | 00000391 | Married (including common law) | $75,000+ | Counties in metropolitan areas ge 1 million pop | 55-59 years | All races/ethnicities | 2020 | Alive at last contact | Alive or dead due to cancer | Blank(s) | Systemic therapy after surgery | Non-Spanish-Hispanic-Latino |
| Female | 2011 | Black | Non-Hispanic Black | Breast | Breast | C50.9-Breast, NOS | Unknown | Blank(s) | Blank(s) | Direct visualization without microscopic confirmation | Left - origin of primary | No/Unknown | None/Unknown | Blank(s) | Not recommended | Blank(s) | Complete dates are available and there are more than 0 days of survival | 0010 | Breast | No | 2nd of 2 or more primaries | 02 | 00 | 00000547 | Widowed | $75,000+ | Counties in metropolitan areas ge 1 million pop | 85+ years | All races/ethnicities | 2012 | 2012 | Alive or dead due to cancer | Blank(s) | No systemic therapy and/or surgical procedures | Non-Spanish-Hispanic-Latino |
| Female | 2013 | White | Hispanic (All Races) | Breast | Breast | C50.9-Breast, NOS | Moderately differentiated; Grade II | Blank(s) | Blank(s) | Positive histology | Right - origin of primary | No/Unknown | Beam radiation | 001 | Surgery performed | Blank(s) | Complete dates are available and there are more than 0 days of survival | 0086 | Alive | No | 2nd of 2 or more primaries | 02 | 00 | 00000567 | Married (including common law) | $75,000+ | Counties in metropolitan areas ge 1 million pop | 70-74 years | All races/ethnicities | 2020 | Alive at last contact | Alive or dead due to cancer | Blank(s) | No systemic therapy and/or surgical procedures | Spanish-Hispanic-Latino |
| Female | 2015 | White | Non-Hispanic White | Breast | Breast | C50.8-Overlapping lesion of breast | Unknown | Blank(s) | Blank(s) | Positive histology | Left - origin of primary | Yes | None/Unknown | 001 | Not recommended | Blank(s) | Complete dates are available and there are more than 0 days of survival | 0017 | Breast | No | 2nd of 2 or more primaries | 02 | 00 | 00000760 | Widowed | $75,000+ | Counties in metropolitan areas ge 1 million pop | 75-79 years | All races/ethnicities | 2016 | 2016 | Alive or dead due to cancer | Blank(s) | No systemic therapy and/or surgical procedures | Non-Spanish-Hispanic-Latino |
| Female | 2015 | White | Hispanic (All Races) | Breast | Breast | C50.4-Upper-outer quadrant of breast | Poorly differentiated; Grade III | Blank(s) | Blank(s) | Positive histology | Right - origin of primary | No/Unknown | None/Unknown | 001 | Surgery performed | Blank(s) | Complete dates are available and there are more than 0 days of survival | 0007 | Other Cause of Death | No | 2nd of 2 or more primaries | 02 | 00 | 00000941 | Widowed | $75,000+ | Counties in metropolitan areas ge 1 million pop | 85+ years | All races/ethnicities | 2015 | 2015 | Dead (attributable to causes other than this cancer dx) | Blank(s) | No systemic therapy and/or surgical procedures | Spanish-Hispanic-Latino |
| Female | 2015 | White | Non-Hispanic White | Breast | Breast | C50.9-Breast, NOS | Poorly differentiated; Grade III | Blank(s) | Blank(s) | Positive histology | Right - origin of primary | No/Unknown | Beam radiation | 001 | Surgery performed | Blank(s) | Complete dates are available and there are more than 0 days of survival | 0043 | Cerebrovascular Diseases | No | 2nd of 2 or more primaries | 02 | 00 | 00002056 | Widowed | $75,000+ | Counties in metropolitan areas ge 1 million pop | 80-84 years | All races/ethnicities | 2019 | 2019 | Dead (attributable to causes other than this cancer dx) | Blank(s) | Systemic therapy after surgery | Non-Spanish-Hispanic-Latino |
| Female | 2015 | Black | Non-Hispanic Black | Breast | Breast | C50.8-Overlapping lesion of breast | Poorly differentiated; Grade III | Blank(s) | Blank(s) | Positive histology | Right - origin of primary | No/Unknown | None/Unknown | 001 | Surgery performed | Blank(s) | Complete dates are available and there are more than 0 days of survival | 0070 | Alive | No | 3rd of 3 or more primaries | 04 | 00 | 00002605 | Divorced | $75,000+ | Counties in metropolitan areas ge 1 million pop | 60-64 years | All races/ethnicities | 2020 | Alive at last contact | Alive or dead due to cancer | Blank(s) | No systemic therapy and/or surgical procedures | Non-Spanish-Hispanic-Latino |
kable(head(BREAST_DF_eval, 10))
| Sex | Year of diagnosis | Race recode (W, B, AI, API) | Race and origin recode (NHW, NHB, NHAIAN, NHAPI, Hispanic) | Site recode ICD-O-3/WHO 2008 | Site recode ICD-O-3 2023 Revision | Primary Site - labeled | Grade Recode (thru 2017) | Grade Clinical (2018+) | Grade Pathological (2018+) | Diagnostic Confirmation | Laterality | Chemotherapy recode (yes, no/unk) | Radiation recode | Months from diagnosis to treatment | Reason no cancer-directed surgery | Scope of reg lymph nd surg (1998-2002) | Survival months flag | Survival months | COD to site recode | First malignant primary indicator | Sequence number | Total number of in situ/malignant tumors for patient | Total number of benign/borderline tumors for patient | Patient ID | Marital status at diagnosis | Median household income inflation adj to 2021 | Rural-Urban Continuum Code | Age recode (<60,60-69,70+) | Race and origin (recommended by SEER) | Year of follow-up recode | Year of death recode | SEER other cause of death classification | Tumor Size Summary (2016+) | RX Summ–Systemic/Sur Seq (2007+) | Origin recode NHIA (Hispanic, Non-Hisp) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Female | 2019 | Asian or Pacific Islander | Non-Hispanic Asian or Pacific Islander | Breast | Breast | C50.8-Overlapping lesion of breast | Unknown | 1 | 1 | Positive histology | Right - origin of primary | No/Unknown | None/Unknown | 002 | Surgery performed | Blank(s) | Complete dates are available and there are more than 0 days of survival | 0019 | Alive | No | 2nd of 2 or more primaries | 02 | 00 | 00002750 | Divorced | $75,000+ | Counties in metropolitan areas ge 1 million pop | 65-69 years | All races/ethnicities | 2020 | Alive at last contact | Alive or dead due to cancer | 008 | Systemic therapy after surgery | Non-Spanish-Hispanic-Latino |
| Female | 2020 | Asian or Pacific Islander | Non-Hispanic Asian or Pacific Islander | Breast | Breast | C50.8-Overlapping lesion of breast | Unknown | 2 | 9 | Positive histology | Right - origin of primary | No/Unknown | None/Unknown | 000 | Recommended, unknown if performed | Blank(s) | Complete dates are available and there are more than 0 days of survival | 0000 | Alive | No | 2nd of 2 or more primaries | 02 | 00 | 00002870 | Married (including common law) | $75,000+ | Counties in metropolitan areas ge 1 million pop | 75-79 years | All races/ethnicities | 2020 | Alive at last contact | Alive or dead due to cancer | 050 | No systemic therapy and/or surgical procedures | Non-Spanish-Hispanic-Latino |
| Female | 2020 | White | Non-Hispanic White | Breast | Breast | C50.4-Upper-outer quadrant of breast | Unknown | 1 | 2 | Positive histology | Right - origin of primary | No/Unknown | None/Unknown | 000 | Surgery performed | Blank(s) | Complete dates are available and there are more than 0 days of survival | 0007 | Alive | No | 2nd of 2 or more primaries | 02 | 00 | 00003067 | Divorced | $75,000+ | Counties in metropolitan areas ge 1 million pop | 85+ years | All races/ethnicities | 2020 | Alive at last contact | Alive or dead due to cancer | 018 | No systemic therapy and/or surgical procedures | Non-Spanish-Hispanic-Latino |
| Female | 2020 | White | Non-Hispanic White | Breast | Breast | C50.5-Lower-outer quadrant of breast | Unknown | 2 | 9 | Positive histology | Right - origin of primary | Yes | None/Unknown | 001 | Surgery performed | Blank(s) | Complete dates are available and there are more than 0 days of survival | 0010 | Alive | No | 2nd of 2 or more primaries | 02 | 00 | 00003365 | Widowed | $75,000+ | Counties in metropolitan areas ge 1 million pop | 85+ years | All races/ethnicities | 2020 | Alive at last contact | Alive or dead due to cancer | 060 | Systemic therapy both before and after surgery | Non-Spanish-Hispanic-Latino |
| Female | 2019 | White | Non-Hispanic White | Breast | Breast | C50.8-Overlapping lesion of breast | Unknown | 2 | 2 | Positive histology | Right - origin of primary | No/Unknown | Radioactive implants (includes brachytherapy) (1988+) | 000 | Surgery performed | Blank(s) | Complete dates are available and there are more than 0 days of survival | 0016 | Alive | No | 3rd of 3 or more primaries | 03 | 00 | 00003679 | Divorced | $75,000+ | Counties in metropolitan areas ge 1 million pop | 75-79 years | All races/ethnicities | 2020 | Alive at last contact | Alive or dead due to cancer | 010 | No systemic therapy and/or surgical procedures | Non-Spanish-Hispanic-Latino |
| Female | 2019 | Asian or Pacific Islander | Non-Hispanic Asian or Pacific Islander | Breast | Breast | C50.9-Breast, NOS | Unknown | 2 | 2 | Positive histology | Right - origin of primary | No/Unknown | None/Unknown | 004 | Surgery performed | Blank(s) | Complete dates are available and there are more than 0 days of survival | 0014 | Alive | No | 3rd of 3 or more primaries | 04 | 00 | 00003771 | Married (including common law) | $75,000+ | Counties in metropolitan areas ge 1 million pop | 55-59 years | All races/ethnicities | 2020 | Alive at last contact | Alive or dead due to cancer | 030 | Systemic therapy after surgery | Non-Spanish-Hispanic-Latino |
| Female | 2019 | Asian or Pacific Islander | Non-Hispanic Asian or Pacific Islander | Breast | Breast | C50.4-Upper-outer quadrant of breast | Unknown | 1 | 1 | Positive histology | Left - origin of primary | No/Unknown | None/Unknown | 004 | Surgery performed | Blank(s) | Complete dates are available and there are more than 0 days of survival | 0014 | Alive | No | 4th of 4 or more primaries | 04 | 00 | 00003771 | Married (including common law) | $75,000+ | Counties in metropolitan areas ge 1 million pop | 55-59 years | All races/ethnicities | 2020 | Alive at last contact | Alive or dead due to cancer | 004 | Systemic therapy after surgery | Non-Spanish-Hispanic-Latino |
| Female | 2020 | White | Non-Hispanic White | Breast | Breast | C50.8-Overlapping lesion of breast | Unknown | 2 | 9 | Positive histology | Right - origin of primary | No/Unknown | None/Unknown | 001 | Surgery performed | Blank(s) | Complete dates are available and there are more than 0 days of survival | 0003 | Alive | No | 2nd of 2 or more primaries | 02 | 00 | 00006501 | Married (including common law) | $75,000+ | Counties in metropolitan areas ge 1 million pop | 80-84 years | All races/ethnicities | 2020 | Alive at last contact | Alive or dead due to cancer | 036 | Systemic therapy both before and after surgery | Non-Spanish-Hispanic-Latino |
| Female | 2020 | White | Non-Hispanic White | Breast | Breast | C50.3-Lower-inner quadrant of breast | Unknown | 1 | 1 | Positive histology | Left - origin of primary | No/Unknown | None/Unknown | 002 | Surgery performed | Blank(s) | Complete dates are available and there are more than 0 days of survival | 0007 | Alive | No | 3rd of 3 or more primaries | 03 | 00 | 00007723 | Married (including common law) | $75,000+ | Counties in metropolitan areas ge 1 million pop | 70-74 years | All races/ethnicities | 2020 | Alive at last contact | Alive or dead due to cancer | 006 | No systemic therapy and/or surgical procedures | Non-Spanish-Hispanic-Latino |
| Female | 2019 | White | Non-Hispanic White | Breast | Breast | C50.4-Upper-outer quadrant of breast | Unknown | 2 | 9 | Positive histology | Right - origin of primary | Yes | None/Unknown | 002 | Surgery performed | Blank(s) | Complete dates are available and there are more than 0 days of survival | 0021 | Alive | No | 2nd of 2 or more primaries | 02 | 00 | 00008406 | Unmarried or Domestic Partner | $75,000+ | Counties in metropolitan areas ge 1 million pop | 55-59 years | All races/ethnicities | 2020 | Alive at last contact | Alive or dead due to cancer | 019 | Systemic therapy both before and after surgery | Non-Spanish-Hispanic-Latino |
There are 131,395 cases in the BREAST cancer list of 2019-2020. And there are 303,557 in 2011-2015 dataset.
I used the SEER *STAT to collect the data and export it as a TXT to be able to import it to the R for analyses. How SEER collects the data is explained in the following page in summary:
The SEER program collects cancer incidence data through a network of population-based cancer registries. These registries gather information on patient demographics, primary tumor site, tumor morphology, stage at diagnosis, and first course of treatment. They also follow up with patients for vital status.
By law, these facilities are required to report new cancer cases to a central cancer registry, like a state cancer registry.
The SEER program releases new research data annually, based on submissions from the previous year, and makes it available for public use through a data request process. This comprehensive approach ensures that the SEER database is a valuable resource for cancer research and surveillance. https://training.seer.cancer.gov/registration/data/collection.html
This will be an observational study, information is gathered for different patients and I will be evaluating the available data to present and evaluate.
Data is collected from SEER program and I used SEER *STAT software to glean them in a format that can be used and imported as TXT/CSV to R (Surveillance, Epidemiology, and End Results Program 2023).
We have a combination of both numeric and categorical data to work with. For example, while the number of tumors, and survival months are qualitative. Other like race, marital status, type of cancer are categorical.
Categorical features, such as ‘Median household income …’ ‘Marital Status,’ ‘Grade recode’ ‘laterality’ and ‘Radiatio recode’ and so on are represented as objects (characters).
Integer data types (int64) are assigned to ‘Patient ID,’ ‘Year of diagnosis,’ ‘total number of …’.
# Find unique values in each column
# Apply function to find unique values for each column
#find the number of unique values in each column
unique_values <- data.frame(unique = apply(BREAST_DF_surv, 2, function(x) length(unique(x))),colnames = colnames(BREAST_DF_surv))
#fidn the number of unique values and the unique values themselves
unique_info <- data.frame(
unique_count = sapply(BREAST_DF_surv, function(x) length(unique(x))),
unique_values = sapply(BREAST_DF_surv, function(x) toString(unique(x))),
column_names = names(BREAST_DF_surv)
)
# Check for NULL values
any_null <- any(sapply(BREAST_DF_surv, is.null))
# Check for NA values
any_na <- any(sapply(BREAST_DF_surv, is.na))
# Check if there are any NULL or NA values
if (any_null || any_na) {
print("The data frame contains NULL or NA values.")
} else {
print("The data frame does not contain any NULL or NA values.")
}
## [1] "The data frame does not contain any NULL or NA values."
has_na_character <- any(sapply(BREAST_DF_surv, function(x) any(x == "NA")))
if (has_na_character) {
print("The data frame contains character values of 'NA'.")
} else {
print("The data frame does not contain character values of 'NA'.")
}
## [1] "The data frame does not contain character values of 'NA'."
Upon exploring the data, it seems data might have an empty column, in this data-based, the empty values are filled with “Blanks”. Thus, in this section, I first explore if there is any column which is entirely empty, then will remove it and if there are others which have some empty values filled with “Blank(s)” I will replaced them with “NA” which is handled better in dplyr and tydiverse.
# There are cells in the DF that contianes "Blank(s) which is literally NA, first I want to find if there is any column that all is values is Blank(s), if then remove them.
#look for columns with all "Blank(s)" values
Empty_column <- BREAST_DF_surv %>%
dplyr::summarise(dplyr::across(everything(), ~all(. == "Blank(s)"))) %>%
as.logical() %>%
unlist()
# Get the names of columns with all cells containing "Blank(s)"
blank_column_names <- names(BREAST_DF_surv)[Empty_column]
# Print the column names with all cells containing "Blanks"
paste("list of empty column(s): ", blank_column_names)
## [1] "list of empty column(s): Grade Clinical (2018+)"
## [2] "list of empty column(s): Grade Pathological (2018+)"
## [3] "list of empty column(s): Scope of reg lymph nd surg (1998-2002)"
## [4] "list of empty column(s): Tumor Size Summary (2016+)"
#remove those empty column from thr DF
BREAST_DF_surv <- BREAST_DF_surv[, !names(BREAST_DF_surv) %in% blank_column_names]
BREAST_DF_eval <- BREAST_DF_eval[, !names(BREAST_DF_eval) %in% blank_column_names]
#Then let's see if there is any cell in the remaining that migth still have "Blank(s)", if so repalce it with NA which is better handle in R
#This code first replaces all occurrences of "Blank(s)" with an empty string "", and then uses na_if() to convert the empty strings to NA. Now, all cells that previously had "Blank(s)" are replaced with NA, making it easier to handle missing values in R.
BREAST_DF_surv <- BREAST_DF_surv %>%
mutate_if(is.character, ~ifelse(. == "Blank(s)", "", .)) %>% # For character columns
mutate_if(is.numeric, ~ifelse(. == "", as.numeric(NA), .)) # For numeric columns
# Now, empty character cells are replaced with NA
BREAST_DF_surv <- BREAST_DF_surv %>%
mutate_if(is.character, na_if, "")
#same to be done for eval dataset
BREAST_DF_eval <- BREAST_DF_eval %>%
mutate_if(is.character, ~ifelse(. == "Blank(s)", "", .)) %>% # For character columns
mutate_if(is.numeric, ~ifelse(. == "", as.numeric(NA), .)) # For numeric columns
# Now, empty character cells are replaced with NA
BREAST_DF_eval <- BREAST_DF_eval %>%
mutate_if(is.character, na_if, "")
#Change characters to numerics
BREAST_DF_surv$`Months from diagnosis to treatment` <- as.numeric(BREAST_DF_surv$`Months from diagnosis to treatment`)
BREAST_DF_surv$`Survival months` <- as.numeric(BREAST_DF_surv$`Survival months`)
## Warning: NAs introduced by coercion
BREAST_DF_surv$`Total number of in situ/malignant tumors for patient` <-
as.numeric(BREAST_DF_surv$`Total number of in situ/malignant tumors for patient`)
## Warning: NAs introduced by coercion
BREAST_DF_surv$`Total number of benign/borderline tumors for patient` <-
as.numeric(BREAST_DF_surv$`Total number of benign/borderline tumors for patient`)
#Change the character to numeric in Eval dataset too
BREAST_DF_eval$`Months from diagnosis to treatment` <- as.numeric(BREAST_DF_eval$`Months from diagnosis to treatment`)
BREAST_DF_eval$`Survival months` <- as.numeric(BREAST_DF_eval$`Survival months`)
## Warning: NAs introduced by coercion
BREAST_DF_eval$`Total number of in situ/malignant tumors for patient` <-
as.numeric(BREAST_DF_eval$`Total number of in situ/malignant tumors for patient`)
## Warning: NAs introduced by coercion
BREAST_DF_eval$`Total number of benign/borderline tumors for patient` <-
as.numeric(BREAST_DF_eval$`Total number of benign/borderline tumors for patient`)
# View the structure of the data frame
#str(BREAST_DF_surv)
skimr::skim(BREAST_DF_surv)
| Name | BREAST_DF_surv |
| Number of rows | 303557 |
| Number of columns | 32 |
| _______________________ | |
| Column type frequency: | |
| character | 26 |
| numeric | 6 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Sex | 0 | 1 | 6 | 6 | 0 | 1 | 0 |
| Race recode (W, B, AI, API) | 0 | 1 | 5 | 29 | 0 | 5 | 0 |
| Race and origin recode (NHW, NHB, NHAIAN, NHAPI, Hispanic) | 0 | 1 | 18 | 42 | 0 | 6 | 0 |
| Site recode ICD-O-3/WHO 2008 | 0 | 1 | 6 | 6 | 0 | 1 | 0 |
| Site recode ICD-O-3 2023 Revision | 0 | 1 | 6 | 6 | 0 | 1 | 0 |
| Primary Site - labeled | 0 | 1 | 12 | 36 | 0 | 9 | 0 |
| Grade Recode (thru 2017) | 0 | 1 | 7 | 38 | 0 | 5 | 0 |
| Diagnostic Confirmation | 0 | 1 | 7 | 57 | 0 | 9 | 0 |
| Laterality | 0 | 1 | 24 | 53 | 0 | 5 | 0 |
| Chemotherapy recode (yes, no/unk) | 0 | 1 | 3 | 10 | 0 | 2 | 0 |
| Radiation recode | 0 | 1 | 12 | 53 | 0 | 8 | 0 |
| Reason no cancer-directed surgery | 0 | 1 | 15 | 76 | 0 | 8 | 0 |
| Survival months flag | 0 | 1 | 61 | 73 | 0 | 5 | 0 |
| COD to site recode | 0 | 1 | 5 | 55 | 0 | 87 | 0 |
| First malignant primary indicator | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
| Sequence number | 0 | 1 | 16 | 60 | 0 | 13 | 0 |
| Patient ID | 0 | 1 | 8 | 8 | 0 | 294480 | 0 |
| Marital status at diagnosis | 0 | 1 | 7 | 30 | 0 | 7 | 0 |
| Median household income inflation adj to 2021 | 0 | 1 | 8 | 38 | 0 | 11 | 0 |
| Rural-Urban Continuum Code | 0 | 1 | 38 | 60 | 0 | 7 | 0 |
| Age recode (<60,60-69,70+) | 0 | 1 | 9 | 11 | 0 | 18 | 0 |
| Race and origin (recommended by SEER) | 0 | 1 | 21 | 21 | 0 | 1 | 0 |
| Year of death recode | 0 | 1 | 4 | 21 | 0 | 11 | 0 |
| SEER other cause of death classification | 0 | 1 | 16 | 55 | 0 | 4 | 0 |
| RX Summ–Systemic/Sur Seq (2007+) | 0 | 1 | 16 | 55 | 0 | 8 | 0 |
| Origin recode NHIA (Hispanic, Non-Hisp) | 0 | 1 | 23 | 27 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Year of diagnosis | 0 | 1.00 | 2013.04 | 1.42 | 2011 | 2012 | 2013 | 2014 | 2015 | ▇▇▇▇▇ |
| Months from diagnosis to treatment | 15843 | 0.95 | 1.13 | 1.14 | 0 | 0 | 1 | 2 | 24 | ▇▁▁▁▁ |
| Survival months | 1290 | 1.00 | 74.22 | 29.88 | 0 | 62 | 78 | 97 | 119 | ▂▂▆▇▆ |
| Total number of in situ/malignant tumors for patient | 3 | 1.00 | 1.36 | 0.65 | 1 | 1 | 1 | 2 | 20 | ▇▁▁▁▁ |
| Total number of benign/borderline tumors for patient | 0 | 1.00 | 0.01 | 0.09 | 0 | 0 | 0 | 0 | 5 | ▇▁▁▁▁ |
| Year of follow-up recode | 0 | 1.00 | 2018.90 | 2.14 | 2011 | 2019 | 2020 | 2020 | 2020 | ▁▁▁▁▇ |
skimr::skim(BREAST_DF_eval)
| Name | BREAST_DF_eval |
| Number of rows | 131395 |
| Number of columns | 32 |
| _______________________ | |
| Column type frequency: | |
| character | 26 |
| numeric | 6 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Sex | 0 | 1 | 6 | 6 | 0 | 1 | 0 |
| Race recode (W, B, AI, API) | 0 | 1 | 5 | 29 | 0 | 5 | 0 |
| Race and origin recode (NHW, NHB, NHAIAN, NHAPI, Hispanic) | 0 | 1 | 18 | 42 | 0 | 6 | 0 |
| Site recode ICD-O-3/WHO 2008 | 0 | 1 | 6 | 6 | 0 | 1 | 0 |
| Site recode ICD-O-3 2023 Revision | 0 | 1 | 6 | 6 | 0 | 1 | 0 |
| Primary Site - labeled | 0 | 1 | 12 | 36 | 0 | 9 | 0 |
| Grade Recode (thru 2017) | 0 | 1 | 7 | 7 | 0 | 1 | 0 |
| Diagnostic Confirmation | 0 | 1 | 7 | 57 | 0 | 9 | 0 |
| Laterality | 0 | 1 | 24 | 53 | 0 | 5 | 0 |
| Chemotherapy recode (yes, no/unk) | 0 | 1 | 3 | 10 | 0 | 2 | 0 |
| Radiation recode | 0 | 1 | 12 | 53 | 0 | 8 | 0 |
| Reason no cancer-directed surgery | 0 | 1 | 15 | 76 | 0 | 8 | 0 |
| Survival months flag | 0 | 1 | 61 | 73 | 0 | 5 | 0 |
| COD to site recode | 0 | 1 | 5 | 55 | 0 | 67 | 0 |
| First malignant primary indicator | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
| Sequence number | 0 | 1 | 16 | 60 | 0 | 16 | 0 |
| Patient ID | 0 | 1 | 8 | 8 | 0 | 127795 | 0 |
| Marital status at diagnosis | 0 | 1 | 7 | 30 | 0 | 7 | 0 |
| Median household income inflation adj to 2021 | 0 | 1 | 8 | 38 | 0 | 11 | 0 |
| Rural-Urban Continuum Code | 0 | 1 | 38 | 60 | 0 | 7 | 0 |
| Age recode (<60,60-69,70+) | 0 | 1 | 9 | 11 | 0 | 17 | 0 |
| Race and origin (recommended by SEER) | 0 | 1 | 21 | 21 | 0 | 1 | 0 |
| Year of death recode | 0 | 1 | 4 | 21 | 0 | 3 | 0 |
| SEER other cause of death classification | 0 | 1 | 16 | 55 | 0 | 4 | 0 |
| RX Summ–Systemic/Sur Seq (2007+) | 0 | 1 | 16 | 55 | 0 | 8 | 0 |
| Origin recode NHIA (Hispanic, Non-Hisp) | 0 | 1 | 23 | 27 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Year of diagnosis | 0 | 1.00 | 2019.48 | 0.50 | 2019 | 2019 | 2019 | 2020 | 2020 | ▇▁▁▁▇ |
| Months from diagnosis to treatment | 6807 | 0.95 | 1.26 | 1.18 | 0 | 1 | 1 | 2 | 24 | ▇▁▁▁▁ |
| Survival months | 537 | 1.00 | 11.07 | 7.05 | 0 | 5 | 11 | 17 | 23 | ▇▆▆▇▆ |
| Total number of in situ/malignant tumors for patient | 11 | 1.00 | 1.31 | 0.62 | 1 | 1 | 1 | 1 | 50 | ▇▁▁▁▁ |
| Total number of benign/borderline tumors for patient | 0 | 1.00 | 0.01 | 0.09 | 0 | 0 | 0 | 0 | 2 | ▇▁▁▁▁ |
| Year of follow-up recode | 0 | 1.00 | 2019.98 | 0.14 | 2019 | 2020 | 2020 | 2020 | 2020 | ▁▁▁▁▇ |
Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g.scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.
#find column name to use later if needed
DF_col_names <- colnames(BREAST_DF_surv)
# use ggplot to plot the race information
BREAST_DF_surv |>
ggplot(mapping = aes(x=`Race recode (W, B, AI, API)`)) +
geom_bar(stat = "count") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
geom_text(aes(label = after_stat(count)), stat = "count", vjust = -0.5) +
ylim(0, 246000)
#we want to compare the percentage of the different races in the eval and survival data, thus I use summarise to create two new DFs to only store the sumamry statistics specifically including the percentage of race based on the population
#find percentage of race for the survival
BREAST_DF_perc_surv <- BREAST_DF_surv %>%
group_by(`Race recode (W, B, AI, API)`) %>%
dplyr::summarise(count = dplyr::n()) %>% # Calculate count per group
ungroup() %>% # Ungroup the data
mutate(total_count = sum(count)) %>% # Calculate total count
mutate(percentage = count / total_count * 100) # Calculate percentage using total count
# Plot the percentages
ggplot(BREAST_DF_perc_surv, aes(x = `Race recode (W, B, AI, API)`, y = percentage)) +
geom_bar(stat = "identity", fill = "skyblue") +
geom_text(aes(label = paste0(round(percentage, 1), "%")), vjust = -0.5, color = "black") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Percentage of Population by Race between 2011-2015", x = "Race recode (W, B, AI, API)", y = "Percentage") + ylim (0,90)
BREAST_DF_eval |>
ggplot(mapping = aes(x=`Race recode (W, B, AI, API)`)) +
geom_bar(stat = "count") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
geom_text(aes(label = after_stat(count)), stat = "count", vjust = -0.5) +
ylim(0, 104000)
BREAST_DF_perc_eval <- BREAST_DF_eval %>%
group_by(`Race recode (W, B, AI, API)`) %>%
dplyr::summarise(count = dplyr::n()) %>% # Calculate count per group
ungroup() %>% # Ungroup the data
mutate(total_count = sum(count)) %>% # Calculate total count
mutate(percentage = count / total_count * 100) # Calculate percentage using total count
# Plot the percentages
ggplot(BREAST_DF_perc_eval, aes(x = `Race recode (W, B, AI, API)`, y = percentage)) +
geom_bar(stat = "identity", fill = "plum") +
geom_text(aes(label = paste0(round(percentage, 1), "%")), vjust = -0.5, color = "black") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Percentage of Population by Race between 2019-2022", x = "Race recode (W, B, AI, API)", y = "Percentage") + ylim (0,90)
# In this section I want to focus on the age and see if age matters, same sets of data is going to be plot for ages, starting with percentage for eval and surve
#find percentage of race for the survival
#find ubique values for column ratted to age
uniques_ages <- unique(BREAST_DF_surv[29])
BREAST_DF_age_perc_surv <- BREAST_DF_surv %>%
dplyr::group_by(`Age recode (<60,60-69,70+)`) %>%
dplyr::summarise(count = dplyr::n()) %>% # Calculate count per group
ungroup() %>% # Ungroup the data
mutate(total_count = sum(count)) %>% # Calculate total count
mutate(percentage = count / total_count * 100) # Calculate percentage using total count
perc_max <- max(BREAST_DF_age_perc_surv$percentage)
# Plot the percentages
ggplot(BREAST_DF_age_perc_surv, aes(x = `Age recode (<60,60-69,70+)`, y = percentage)) +
geom_bar(stat = "identity", fill = "brown") +
geom_text(aes(label = paste0(round(percentage, 1), "%")), hjust = -0.1 , vjust = 0.4, color = "black", angle = 90) + # Rotate the text vertically
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +labs(title = "Percentage of Population by Age range 2011-2015",
x = "Age range",
y = "Percentage") +
ylim(0, round(1.5 * perc_max, 1))
# In this section we do the same analyses for Eval dta based on age
BREAST_DF_age_perc_eval <- BREAST_DF_eval %>%
dplyr::group_by(`Age recode (<60,60-69,70+)`) %>%
dplyr::summarise(count = dplyr::n()) %>% # Calculate count per group
ungroup() %>% # Ungroup the data
mutate(total_count = sum(count)) %>% # Calculate total count
mutate(percentage = count / total_count * 100) # Calculate percentage using total count
# Plot the percentages
ggplot(BREAST_DF_age_perc_eval, aes(x = `Age recode (<60,60-69,70+)`, y = percentage)) +
geom_bar(stat = "identity", fill = "brown") +
geom_text(aes(label = paste0(round(percentage, 1), "%")), hjust = -0.1 , vjust = 0.4, color = "black", angle = 90) + # Rotate the text vertically
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +labs(title = "Percentage of Population by Age range 2019-2022",
x = "Age range",
y = "Percentage") +
ylim(0, round(1.5 * perc_max, 1))
# In this section, we do the analyses on household income
#find ubique values for column ratted to age
uniques_householdes <- unique(BREAST_DF_surv[27])
BREAST_DF_income_perc_surv <- BREAST_DF_surv %>% dplyr::group_by(`Median household income inflation adj to 2021`) %>%
dplyr::summarise(count = dplyr::n()) %>% # Calculate count per group
ungroup() %>% # Ungroup the data
mutate(total_count = sum(count)) %>% # Calculate total count
mutate(percentage = count / total_count * 100) # Calculate percentage using total count
perc_max <- max(BREAST_DF_income_perc_surv$percentage) # Plot the percentages
ggplot(BREAST_DF_income_perc_surv, aes(x = `Median household income inflation adj to 2021`, y = percentage)) +
geom_bar(stat = "identity", fill = "brown") +
geom_text(aes(label = paste0(round(percentage, 1), "%")), hjust = -0.1 , vjust = 0.4, color = "black", angle = 0) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Percentage of Population by income 2011-2015", x = "Household Income", y = "Percentage") +
ylim(0, 1.2*perc_max)
#In this section we do the same analyses for Eval data based on age
BREAST_DF_income_perc_eval <- BREAST_DF_eval %>%
dplyr::group_by(`Median household income inflation adj to 2021`) %>%
dplyr::summarise(count = dplyr::n()) %>% # Calculate count per group
ungroup() %>% # Ungroup the data
mutate(total_count = sum(count)) %>% # Calculate total count
mutate(percentage = count / total_count * 100) # Calculate percentage using total count
#Plot the percentages
perc_max <- max(BREAST_DF_income_perc_eval$percentage)
ggplot(BREAST_DF_income_perc_eval, aes(x = `Median household income inflation adj to 2021`, y = percentage)) +
geom_bar(stat = "identity", fill = "brown") +
geom_text(aes(label = paste0(round(percentage, 1), "%")), hjust = -0.1 , vjust = 0.4, color = "black", angle = 0) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Percentage of Population by income 2019-2022", x = "Household Income", y = "Percentage") +
ylim(0, 1.2*perc_max)
# In this section, we do the analyses on Primary Site
#find ubique values for column ratted to age
uniques_canter_type <- unique(BREAST_DF_surv[27])
BREAST_DF_labeled_perc_surv <- BREAST_DF_surv %>% dplyr::group_by(`Primary Site - labeled`) %>%
dplyr::summarise(count = dplyr::n()) %>% # Calculate count per group
ungroup() %>% # Ungroup the data
mutate(total_count = sum(count)) %>% # Calculate total count
mutate(percentage = count / total_count * 100) # Calculate percentage using total count
perc_max <- max(BREAST_DF_labeled_perc_surv$percentage) # Plot the percentages
ggplot(BREAST_DF_labeled_perc_surv, aes(x = `Primary Site - labeled`, y = percentage)) +
geom_bar(stat = "identity", fill = "darkgreen") +
geom_text(aes(label = paste0(round(percentage, 1), "%")), hjust = -0.1 , vjust = 0.4, color = "black", angle = 0) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Percentage of Population by Site Primary labeles 2011-2015", x = "Primary Labels", y = "Percentage") +
ylim(0, 1.2*perc_max)
#In this section we do the same analyses for Eval data based on age
BREAST_DF_labeled_perc_eval <- BREAST_DF_eval %>%
dplyr::group_by(`Primary Site - labeled`) %>%
dplyr::summarise(count = dplyr::n()) %>% # Calculate count per group
ungroup() %>% # Ungroup the data
mutate(total_count = sum(count)) %>% # Calculate total count
mutate(percentage = count / total_count * 100) # Calculate percentage using total count
#Plot the percentages
perc_max <- max(BREAST_DF_labeled_perc_eval$percentage)
ggplot(BREAST_DF_labeled_perc_eval, aes(x = `Primary Site - labeled`, y = percentage)) +
geom_bar(stat = "identity", fill = "darkgreen") +
geom_text(aes(label = paste0(round(percentage, 1), "%")), hjust = -0.1 , vjust = 0.4, color = "black", angle = 0) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Percentage of Population by site Primary labels 2019-2022", x = "Primary Labels", y = "Percentage") +
ylim(0, 1.2*perc_max)
# check if the column `COD to site recode` has value of Alive or Breast meaning they are still alive or have died because of breast cancer, and other passed a way but not because of Breast cancer.
BREAST_DF_surv <- BREAST_DF_surv %>%
mutate(COD = ifelse(`COD to site recode` %in% c("Alive","Breast"), `COD to site recode`, "Other"))
In this section, we look into some exploratory data analysis such as
Cause of death of those who have had cancer
Total number of tumors (Malignant or Benign)
Radiation and chemotherapy
Surgery Performed
Marital Status
Household income
We looked into the population and then among the population how many survived the cancer. Later we will run some analyses to see whether those were important or deciding factors or not.
BREAST_DF_COD_perc_surv <- BREAST_DF_surv %>%
dplyr::group_by(COD) %>%
dplyr::summarise(count = dplyr::n()) %>% # Calculate count per group
ungroup() %>% # Ungroup the data
mutate(`Total Count` = sum(count)) %>% # Calculate total count
mutate(Population = round(count / `Total Count` * 100),2) # Calculate percentage using total count
kable(BREAST_DF_COD_perc_surv)
| COD | count | Total Count | Population | 2 |
|---|---|---|---|---|
| Alive | 228221 | 303557 | 75 | 2 |
| Breast | 38472 | 303557 | 13 | 2 |
| Other | 36864 | 303557 | 12 | 2 |
# Let’s first group by the number of tumors and find out how many people in the population have them. Then, among those individuals, let’s determine how many passed away solely due to breast cancer. However, it’s important to note that this approach may not be completely accurate, as there could be cases where individuals passed away due to breast cancer complications that are not accounted for in these counts.”
BREAST_DF_TNoT_perc_surv <- BREAST_DF_surv %>%
dplyr::group_by(`Total number of in situ/malignant tumors for patient`) %>%
dplyr::add_count() %>%
filter(COD == "Breast") %>%
dplyr::summarise(`Event Population` = n(),
Population = dplyr::first(n)) # Use `first()` to extract the total count in each
# Do simple math to fidn the percentage of the group in the population and then the percentage of the deceased within the group.
BREAST_DF_TNoT_perc_surv$`Group % in total` <- round(BREAST_DF_TNoT_perc_surv$Population/sum(BREAST_DF_TNoT_perc_surv$Population)*100,2)
BREAST_DF_TNoT_perc_surv$`Death %` <- round(BREAST_DF_TNoT_perc_surv$`Event Population`/BREAST_DF_TNoT_perc_surv$Population*100,2)
kable(BREAST_DF_TNoT_perc_surv)
| Total number of in situ/malignant tumors for patient | Event Population | Population | Group % in total | Death % |
|---|---|---|---|---|
| 1 | 27314 | 217122 | 71.53 | 12.58 |
| 2 | 8945 | 68082 | 22.43 | 13.14 |
| 3 | 1808 | 14579 | 4.80 | 12.40 |
| 4 | 322 | 2996 | 0.99 | 10.75 |
| 5 | 68 | 595 | 0.20 | 11.43 |
| 6 | 9 | 126 | 0.04 | 7.14 |
| 7 | 3 | 29 | 0.01 | 10.34 |
| 8 | 2 | 18 | 0.01 | 11.11 |
| 18 | 1 | 1 | 0.00 | 100.00 |
# Let' focus on the treatemnt, There are two type of treatment and can be a 4 combination, as follows: Radiation: R, Chemoteraphy: C, R:N-C:N, R:Y-C:N, R:N-C:Y, R:Y-C:Y. We must look into these 4 group and find the total number and then in each find the number of death. Finally report them imialrly that we have done above.
BREAST_DF_surv <- BREAST_DF_surv %>%
mutate(Radiation = ifelse(`Radiation recode` %in% c("None/Unknown","Refused (1988+)","Recommended, unknown if administered"),"No/Unknown","Yes"))
BREAST_DF_eval <- BREAST_DF_eval %>%
mutate(Radiation = ifelse(`Radiation recode` %in% c("None/Unknown","Refused (1988+)","Recommended, unknown if administered"),"No/Unknown","Yes"))
#use DPLYR to filter based on two parameters chemotheraphy and radiation therapy and evalaute the death rate accordingly
BREAST_DF_RNC_perc_surv <- BREAST_DF_surv %>%
dplyr::group_by(Radiation,`Chemotherapy recode (yes, no/unk)`) %>%
dplyr::add_count() %>%
filter(COD == "Breast") %>%
dplyr::summarise(`Event Population` = n(),
Population = dplyr::first(n)) # Use `first()` to extract the total count in each
## `summarise()` has grouped output by 'Radiation'. You can override using the
## `.groups` argument.
# Replace "No/Unknown" with "No" in the original columns
BREAST_DF_RNC_perc_surv$Radiation <- ifelse(BREAST_DF_RNC_perc_surv$Radiation == "No/Unknown", "No", BREAST_DF_RNC_perc_surv$Radiation)
BREAST_DF_RNC_perc_surv$"Chemotherapy recode (yes, no/unk)" <- ifelse(BREAST_DF_RNC_perc_surv$"Chemotherapy recode (yes, no/unk)" == "No/Unknown", "No", BREAST_DF_RNC_perc_surv$"Chemotherapy recode (yes, no/unk)")
# Create a new column "Radiation_Chemo" with values separated by "/"
BREAST_DF_RNC_perc_surv$Radiation_Chemo <- paste(BREAST_DF_RNC_perc_surv$Radiation, BREAST_DF_RNC_perc_surv$"Chemotherapy recode (yes, no/unk)", sep = "/")
# Optionally, remove the original "Radiation" and "Chemotherapy recode (yes, no/unk)" columns
BREAST_DF_RNC_perc_surv <- subset(BREAST_DF_RNC_perc_surv, select = -c(Radiation, `Chemotherapy recode (yes, no/unk)`))
BREAST_DF_RNC_perc_surv <- BREAST_DF_RNC_perc_surv[, c("Radiation_Chemo", setdiff(names(BREAST_DF_RNC_perc_surv), "Radiation_Chemo"))]
# Reshape the dataframe from wide to long format
#knowing the population calcualte the gorup rate and death rate in each group
BREAST_DF_RNC_perc_surv$`Group % in total` <- round(BREAST_DF_RNC_perc_surv$Population/sum(BREAST_DF_RNC_perc_surv$Population)*100,2)
BREAST_DF_RNC_perc_surv$`Death %` <- round(BREAST_DF_RNC_perc_surv$`Event Population`/BREAST_DF_RNC_perc_surv$Population*100,2)
kable(BREAST_DF_RNC_perc_surv)
| Radiation_Chemo | Event Population | Population | Group % in total | Death % |
|---|---|---|---|---|
| No/No | 15684 | 107012 | 35.25 | 14.66 |
| No/Yes | 9929 | 54966 | 18.11 | 18.06 |
| Yes/No | 3731 | 79926 | 26.33 | 4.67 |
| Yes/Yes | 9128 | 61653 | 20.31 | 14.81 |
#next let's look into the surgery and the survival rate and whether it migth have been critical or not.
BREAST_DF_SUR_perc_surv <- BREAST_DF_surv %>%
dplyr::group_by(`Reason no cancer-directed surgery`) %>%
dplyr::add_count() %>%
filter(COD == "Breast") %>%
dplyr::summarise(`Event Population` = n(),
Population = dplyr::first(n)) # Use `first()` to extract the total
#knowing the population calcualte the gorup rate and death rate in each group
BREAST_DF_SUR_perc_surv$`Group % in total` <- round(BREAST_DF_SUR_perc_surv$Population/sum(BREAST_DF_SUR_perc_surv$Population)*100,2)
BREAST_DF_SUR_perc_surv$`Death %` <- round(BREAST_DF_SUR_perc_surv$`Event Population`/BREAST_DF_SUR_perc_surv$Population*100,2)
kable(BREAST_DF_SUR_perc_surv)
| Reason no cancer-directed surgery | Event Population | Population | Group % in total | Death % |
|---|---|---|---|---|
| Not performed, patient died prior to recommended surgery | 139 | 278 | 0.09 | 50.00 |
| Not recommended | 11636 | 23199 | 7.64 | 50.16 |
| Not recommended, contraindicated due to other cond; autopsy only (1973-2002) | 593 | 1356 | 0.45 | 43.73 |
| Recommended but not performed, patient refused | 1171 | 2608 | 0.86 | 44.90 |
| Recommended but not performed, unknown reason | 545 | 1604 | 0.53 | 33.98 |
| Recommended, unknown if performed | 613 | 2649 | 0.87 | 23.14 |
| Surgery performed | 22376 | 269730 | 88.86 | 8.30 |
| Unknown; death certificate; or autopsy only (2003+) | 1399 | 2133 | 0.70 | 65.59 |
#next let's look into the marital status and the survival rate and whether it migth have been critical or not.
BREAST_DF_MARI_perc_surv <- BREAST_DF_surv %>%
dplyr::group_by(`Marital status at diagnosis`) %>%
dplyr::add_count() %>%
filter(COD == "Breast") %>%
dplyr::summarise(`Event Population` = n(),
Population = dplyr::first(n)) # Use `first()` to extract the total
#knowing the population calcualte the gorup rate and death rate in each group
BREAST_DF_MARI_perc_surv$`Group % in total` <- round(BREAST_DF_MARI_perc_surv$Population/sum(BREAST_DF_MARI_perc_surv$Population)*100,2)
BREAST_DF_MARI_perc_surv$`Death %` <- round(BREAST_DF_MARI_perc_surv$`Event Population`/BREAST_DF_MARI_perc_surv$Population*100,2)
kable(BREAST_DF_MARI_perc_surv)
| Marital status at diagnosis | Event Population | Population | Group % in total | Death % |
|---|---|---|---|---|
| Divorced | 4399 | 32214 | 10.61 | 13.66 |
| Married (including common law) | 15694 | 160551 | 52.89 | 9.78 |
| Separated | 544 | 3225 | 1.06 | 16.87 |
| Single (never married) | 7161 | 44678 | 14.72 | 16.03 |
| Unknown | 2774 | 18481 | 6.09 | 15.01 |
| Unmarried or Domestic Partner | 110 | 1014 | 0.33 | 10.85 |
| Widowed | 7790 | 43394 | 14.30 | 17.95 |
#next let's look into the Median household income and the survival rate and whether it migth have been critical or not.
BREAST_DF_HHI_perc_surv <- BREAST_DF_surv %>%
dplyr::group_by(`Median household income inflation adj to 2021`) %>%
dplyr::add_count() %>%
filter(COD == "Breast") %>%
dplyr::summarise(`Event Population` = n(),
Population = dplyr::first(n)) # Use `first()` to extract the total
#knwoign the population calcualte the gorup rate and death rate in each group
BREAST_DF_HHI_perc_surv$`Group % in total` <- round(BREAST_DF_HHI_perc_surv$Population/sum(BREAST_DF_HHI_perc_surv$Population)*100,2)
BREAST_DF_HHI_perc_surv$`Death %` <- round(BREAST_DF_HHI_perc_surv$`Event Population`/BREAST_DF_HHI_perc_surv$Population*100,2)
kable(BREAST_DF_HHI_perc_surv)
| Median household income inflation adj to 2021 | Event Population | Population | Group % in total | Death % |
|---|---|---|---|---|
| $35,000 - $39,999 | 1000 | 6077 | 2.00 | 16.46 |
| $40,000 - $44,999 | 1630 | 10225 | 3.37 | 15.94 |
| $45,000 - $49,999 | 2289 | 14917 | 4.91 | 15.34 |
| $50,000 - $54,999 | 2310 | 16794 | 5.53 | 13.75 |
| $55,000 - $59,999 | 3371 | 24860 | 8.19 | 13.56 |
| $60,000 - $64,999 | 6010 | 43537 | 14.34 | 13.80 |
| $65,000 - $69,999 | 5848 | 44978 | 14.82 | 13.00 |
| $70,000 - $74,999 | 3927 | 31930 | 10.52 | 12.30 |
| $75,000+ | 11608 | 107459 | 35.40 | 10.80 |
| < $35,000 | 469 | 2716 | 0.89 | 17.27 |
| Unknown/missing/no match/Not 1990-2021 | 10 | 64 | 0.02 | 15.62 |
#next let's look into the Type of Cancer and the survival rate and whether it migth have been critical or not.
BREAST_DF_PSL_perc_surv <- BREAST_DF_surv %>%
dplyr::group_by(`Primary Site - labeled`) %>%
dplyr::add_count() %>%
filter(COD == "Breast") %>%
dplyr::summarise(`Event Population` = n(),
Population = dplyr::first(n)) # Use `first()` to extract the total
#knwoign the population calcualte the gorup rate and death rate in each group
BREAST_DF_PSL_perc_surv$`Group % in total` <- round(BREAST_DF_PSL_perc_surv$Population/sum(BREAST_DF_PSL_perc_surv$Population)*100,2)
BREAST_DF_PSL_perc_surv$`Death %` <- round(BREAST_DF_PSL_perc_surv$`Event Population`/BREAST_DF_PSL_perc_surv$Population*100,2)
kable(BREAST_DF_PSL_perc_surv)
| Primary Site - labeled | Event Population | Population | Group % in total | Death % |
|---|---|---|---|---|
| C50.0-Nipple | 173 | 1477 | 0.49 | 11.71 |
| C50.1-Central portion of breast | 2043 | 14012 | 4.62 | 14.58 |
| C50.2-Upper-inner quadrant of breast | 3058 | 36006 | 11.86 | 8.49 |
| C50.3-Lower-inner quadrant of breast | 1572 | 16365 | 5.39 | 9.61 |
| C50.4-Upper-outer quadrant of breast | 9710 | 98199 | 32.35 | 9.89 |
| C50.5-Lower-outer quadrant of breast | 2287 | 21939 | 7.23 | 10.42 |
| C50.6-Axillary tail of breast | 270 | 1685 | 0.56 | 16.02 |
| C50.8-Overlapping lesion of breast | 7514 | 68285 | 22.49 | 11.00 |
| C50.9-Breast, NOS | 11845 | 45589 | 15.02 | 25.98 |
# Create a list to store all your dataframes
DF_names <- c (
"BREAST_DF_TNoT_perc_surv",
"BREAST_DF_RNC_perc_surv",
"BREAST_DF_SUR_perc_surv",
"BREAST_DF_MARI_perc_surv",
"BREAST_DF_HHI_perc_surv",
"BREAST_DF_PSL_perc_surv")
# Create an empty list to store plots
plot_list <- list()
chart_color <- c("plum", "darkgreen", "darkred", "darkblue", "darkorange", "darkmagenta",
"darkcyan", "purple", "lightblue", "darkgray", "lightpink", "blue",
"brown", "red")
chart_title <- c("# of Malignant Tumors",
"Radiation/Chemo Status",
"Cancer Surgery",
"Marital Status",
"Household Income",
"Primary Site Labeled")
set.seed(2014)
# Loop through each dataframe
for (i in 1:length(DF_names)) {
# Access the dataframe
df <- get(DF_names[i])
# Generate a random color
random_color <- sample(chart_color, 1)
# Get the name of the first column and wrap the text
column_name <- str_wrap(names(df)[1], width = 10) # Adjust width as needed
# Create the plot and store it in the plot list
plot <- ggplot(df, aes(x = !!rlang::sym(names(df)[1]), y = !!rlang::sym("Death %"))) +
geom_bar(stat = "identity", fill = random_color) +
labs(title = chart_title[i],
x = NULL, y = "Death %") + # Remove x-axis label
theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1)) # Rotate x-axis labels
plot_list[[i]] <- plot
}
# Arrange the plots in a 2 by 3 matrix
grid.arrange(grobs = plot_list, ncol = 3)
# Plot individually
# Plot individually
# Loop through each dataframe
for (i in 1:length(DF_names)) {
# Access the dataframe
df <- get(DF_names[i])
# Generate a random color
random_color <- sample(chart_color, 1)
# Get the name of the first column and wrap the text
column_name <- str_wrap(names(df)[1], width = 10) # Adjust width as needed
# Create the plot and store it in the plot list
plot <- ggplot(df, aes(x = !!rlang::sym(names(df)[1]), y = !!rlang::sym("Death %"))) +
geom_bar(stat = "identity", fill = random_color) +
labs(title = chart_title[i],
x = NULL, y = "Death %") + # Remove x-axis label
theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1)) # Rotate x-axis labels
# Print the plot
print(plot)
}
In this section we will be using different R packages to perform
correlation and other analyses on the data, to do so, we first need to
slightly change our data to make them suitable for packages like
survival, purrr, caret,
GGally, and so forth.
The first step is to change the categorical data to factor in columns
that they exist. Then we use the purrr to calculate
chi-square and Fisher exact test for different variables. Since the size
of the population is large, we will do bootstrap and p-simulation to
calculate the p_value to find the importance of different variables.
The strategy is to find the one with the highest effect in theory,
the code will calculate the p-values from chi-squared/Fisher’s exact
test for independence between each categorical variable and the
COD (Cause of death) column. The lower the p-value, the
stronger the evidence against the null hypothesis of independence,
suggesting a significant association between the variable and
COD. Then we simplify the model by keeping the most
relevant, we also need to look into homoscedasticity
and remove those that may contribute to.
Then we explore the data, there are some column than can be eliminated from this analyses. i.e., year, race (there are two), and so on. The following bullets lists those that are eliminated in the next steps of analyses.
Race Recode# List of columns to remove
uncritical_column <- c("Sex", "Year of diagnosis",
"Race and origin recode (NHW, NHB, NHAIAN, NHAPI, Hispanic)",
"Site recode ICD-O-3/WHO 2008", "Site recode ICD-O-3 2023 Revision",
"Diagnostic Confirmation, Survival months flag", "COD to site recode",
"Patient ID", "Year of follow-up recode", "Year of death recode",
"SEER other cause of death classification",
"RX Summ--Systemic/Sur Seq (2007+)",
"Origin recode NHIA (Hispanic, Non-Hisp)",
"Race and origin (recommended by SEER)",
"Diagnostic Confirmation",
"Sequence number", "Radiation recode")
# Create BREAST_DF_surv_clean by removing uncritical columns
BREAST_DF_surv_clean <- BREAST_DF_surv[, !names(BREAST_DF_surv) %in% uncritical_column]
BREAST_DF_eval_clean <- BREAST_DF_eval[, !names(BREAST_DF_eval) %in% uncritical_column]
# Identify character and numeric columns
char_cols <- sapply(BREAST_DF_surv_clean, is.character)
num_cols <- sapply(BREAST_DF_surv_clean, is.numeric)
char_cols_e <- sapply(BREAST_DF_eval_clean, is.character)
# Convert character columns to factors
BREAST_DF_surv_clean[char_cols] <- lapply(BREAST_DF_surv_clean[char_cols], as.factor)
BREAST_DF_eval_clean[char_cols_e] <- lapply(BREAST_DF_eval_clean[char_cols_e], as.factor)
#BREAST_DF_surv[num_cols] <- lapply(BREAST_DF_surv[num_cols], as.factor)
# Check the class of each column to ensure they are factors now
#sapply(BREAST_DF_surv, class)
#check to esure all variable have more than two levels
one_level_vars <- sapply(BREAST_DF_surv_clean, function(x) length(unique(x)) == 1)
# Print variables with only one level
one_level_vars_names <- names(one_level_vars)[one_level_vars]
#print(names(one_level_vars)[one_level_vars])
# Remove variables with only one level from the data frame
BREAST_DF_surv_clean <- BREAST_DF_surv_clean[, !names(BREAST_DF_surv_clean) %in% one_level_vars_names]
BREAST_DF_eval_clean <- BREAST_DF_eval_clean[, !names(BREAST_DF_eval_clean) %in% one_level_vars_names]
skimr::skim(BREAST_DF_surv_clean)
| Name | BREAST_DF_surv_clean |
| Number of rows | 303557 |
| Number of columns | 18 |
| _______________________ | |
| Column type frequency: | |
| factor | 14 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| Race recode (W, B, AI, API) | 0 | 1 | FALSE | 5 | Whi: 240584, Bla: 32165, Asi: 27061, Ame: 1933 |
| Primary Site - labeled | 0 | 1 | FALSE | 9 | C50: 98199, C50: 68285, C50: 45589, C50: 36006 |
| Grade Recode (thru 2017) | 0 | 1 | FALSE | 5 | Mod: 119566, Poo: 84251, Wel: 64536, Unk: 34855 |
| Laterality | 0 | 1 | FALSE | 5 | Lef: 152350, Rig: 147730, Pai: 3152, Onl: 190 |
| Chemotherapy recode (yes, no/unk) | 0 | 1 | FALSE | 2 | No/: 186938, Yes: 116619 |
| Reason no cancer-directed surgery | 0 | 1 | FALSE | 8 | Sur: 269730, Not: 23199, Rec: 2649, Rec: 2608 |
| Survival months flag | 0 | 1 | FALSE | 5 | Com: 295136, Inc: 6620, Not: 1290, Com: 376 |
| First malignant primary indicator | 0 | 1 | FALSE | 2 | Yes: 252683, No: 50874 |
| Marital status at diagnosis | 0 | 1 | FALSE | 7 | Mar: 160551, Sin: 44678, Wid: 43394, Div: 32214 |
| Median household income inflation adj to 2021 | 0 | 1 | FALSE | 11 | $75: 107459, $65: 44978, $60: 43537, $70: 31930 |
| Rural-Urban Continuum Code | 0 | 1 | FALSE | 7 | Cou: 185374, Cou: 65041, Cou: 21239, Non: 18125 |
| Age recode (<60,60-69,70+) | 0 | 1 | FALSE | 18 | 60-: 41318, 65-: 41060, 55-: 37068, 50-: 34424 |
| COD | 0 | 1 | FALSE | 3 | Ali: 228221, Bre: 38472, Oth: 36864 |
| Radiation | 0 | 1 | FALSE | 2 | No/: 161978, Yes: 141579 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Months from diagnosis to treatment | 15843 | 0.95 | 1.13 | 1.14 | 0 | 0 | 1 | 2 | 24 | ▇▁▁▁▁ |
| Survival months | 1290 | 1.00 | 74.22 | 29.88 | 0 | 62 | 78 | 97 | 119 | ▂▂▆▇▆ |
| Total number of in situ/malignant tumors for patient | 3 | 1.00 | 1.36 | 0.65 | 1 | 1 | 1 | 2 | 20 | ▇▁▁▁▁ |
| Total number of benign/borderline tumors for patient | 0 | 1.00 | 0.01 | 0.09 | 0 | 0 | 0 | 0 | 5 | ▇▁▁▁▁ |
skimr::skim(BREAST_DF_eval_clean)
| Name | BREAST_DF_eval_clean |
| Number of rows | 131395 |
| Number of columns | 17 |
| _______________________ | |
| Column type frequency: | |
| factor | 13 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| Race recode (W, B, AI, API) | 0 | 1 | FALSE | 5 | Whi: 100601, Bla: 14533, Asi: 13448, Unk: 1891 |
| Primary Site - labeled | 0 | 1 | FALSE | 9 | C50: 43321, C50: 30822, C50: 16539, C50: 16423 |
| Grade Recode (thru 2017) | 0 | 1 | FALSE | 1 | Unk: 131395 |
| Laterality | 0 | 1 | FALSE | 5 | Lef: 66096, Rig: 63885, Pai: 1317, Bil: 52 |
| Chemotherapy recode (yes, no/unk) | 0 | 1 | FALSE | 2 | No/: 83776, Yes: 47619 |
| Reason no cancer-directed surgery | 0 | 1 | FALSE | 8 | Sur: 114210, Not: 12567, Rec: 1144, Rec: 1111 |
| Survival months flag | 0 | 1 | FALSE | 5 | Com: 128932, Inc: 1037, Com: 633, Not: 537 |
| First malignant primary indicator | 0 | 1 | FALSE | 2 | Yes: 107910, No: 23485 |
| Marital status at diagnosis | 0 | 1 | FALSE | 7 | Mar: 70613, Sin: 20883, Wid: 16724, Div: 13667 |
| Median household income inflation adj to 2021 | 0 | 1 | FALSE | 11 | $75: 84913, $55: 8336, $65: 8298, $70: 8158 |
| Rural-Urban Continuum Code | 0 | 1 | FALSE | 7 | Cou: 80172, Cou: 28055, Cou: 9574, Non: 7900 |
| Age recode (<60,60-69,70+) | 0 | 1 | FALSE | 17 | 65-: 18702, 60-: 17760, 70-: 17096, 55-: 15189 |
| Radiation | 0 | 1 | FALSE | 2 | Yes: 65993, No/: 65402 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Months from diagnosis to treatment | 6807 | 0.95 | 1.26 | 1.18 | 0 | 1 | 1 | 2 | 24 | ▇▁▁▁▁ |
| Survival months | 537 | 1.00 | 11.07 | 7.05 | 0 | 5 | 11 | 17 | 23 | ▇▆▆▇▆ |
| Total number of in situ/malignant tumors for patient | 11 | 1.00 | 1.31 | 0.62 | 1 | 1 | 1 | 1 | 50 | ▇▁▁▁▁ |
| Total number of benign/borderline tumors for patient | 0 | 1.00 | 0.01 | 0.09 | 0 | 0 | 0 | 0 | 2 | ▇▁▁▁▁ |
# Function to calculate chi-squared test for independence
chi_squared_cal <- function(var, data) {
tab <- table(data$COD, var)
chisq_result <- chisq.test(tab)
p_value <- chisq_result$p.value
return(p_value)
}
# Function to calculate Sisher-Exact test for independence
fisher_exact_cal <- function(var, data) {
tab <- table(data$COD, var)
# Perform Fisher's exact test
fisher_result <- fisher.test(tab, simulate.p.value = TRUE)
# Extract the p-value
p_value <- fisher_result$p.value
return(p_value)
}
# Initialize an empty list to store p-values
p_values <- list()
# Number of bootstrap samples
n_bootstrap <- 50
#I perform bootsrap and downasampling to eliminate the population effect on chi-square, still the correlation seems high with all be so close to 0
# Loop over each column in the dataframe
for (col in names(BREAST_DF_surv_clean)) {
# Check if the column is a factor
if (is.factor(BREAST_DF_surv_clean[[col]])) {
# Initialize an empty vector to store p-values from bootstrap samples
bootstrap_p_values <- numeric(n_bootstrap)
# Perform bootstrap sampling and calculate chi-squared p-value for each sample
for (i in 1:n_bootstrap) {
# Generate a bootstrap sample with replacement
bootstrap_data <-
BREAST_DF_surv_clean[sample(nrow(BREAST_DF_surv_clean),
size = 0.05 * nrow(BREAST_DF_surv_clean),
replace = TRUE), ]
# Calculate chi-squared p-value for the bootstrap sample
#bootstrap_p_values[i] <- chi_squared_cal(bootstrap_data[[col]], bootstrap_data)
bootstrap_p_values[i] <- fisher_exact_cal(bootstrap_data[[col]], bootstrap_data)
}
# Calculate the mean p-value from bootstrap samples
mean_p_value <- mean(bootstrap_p_values)
# Store the mean p-value for the column
p_values[[col]] <- mean_p_value
}
}
# Convert the list of p-values to a data frame
p_values_df <- data.frame(variable = names(p_values), p_value = unlist(p_values))
# Sort the results by p-values
sorted_results <- p_values_df[order(p_values_df$p_value, na.last = TRUE), ]
# Print the sorted p-values
kable(sorted_results)
| variable | p_value | |
|---|---|---|
| Race recode (W, B, AI, API) | Race recode (W, B, AI, API) | 0.0004998 |
| Primary Site - labeled | Primary Site - labeled | 0.0004998 |
| Grade Recode (thru 2017) | Grade Recode (thru 2017) | 0.0004998 |
| Laterality | Laterality | 0.0004998 |
| Chemotherapy recode (yes, no/unk) | Chemotherapy recode (yes, no/unk) | 0.0004998 |
| Reason no cancer-directed surgery | Reason no cancer-directed surgery | 0.0004998 |
| Survival months flag | Survival months flag | 0.0004998 |
| First malignant primary indicator | First malignant primary indicator | 0.0004998 |
| Marital status at diagnosis | Marital status at diagnosis | 0.0004998 |
| Median household income inflation adj to 2021 | Median household income inflation adj to 2021 | 0.0004998 |
| Age recode (<60,60-69,70+) | Age recode (<60,60-69,70+) | 0.0004998 |
| COD | COD | 0.0004998 |
| Radiation | Radiation | 0.0004998 |
| Rural-Urban Continuum Code | Rural-Urban Continuum Code | 0.0015892 |
In This section I used the existing R package to calculate the correlations among the different columns and COD. To od so, we start first with separation the numerical nd categorical data since they need to be treated separately in term of calculating the correlation with COD. We start by finding Pearson correlation coefficient between COD and the numerical column.
# Select numerical columns in your dataset
numeric_cols <- sapply(BREAST_DF_surv_clean, is.numeric)
# Separate numerical and categorical columns
numeric_data <- BREAST_DF_surv_clean[, numeric_cols]
categorical_data <- BREAST_DF_surv_clean[, !numeric_cols]
# Calculate Pearson correlation coefficient between "COD" and numerical columns
correlation_with_COD_numeric <- rcorr(as.matrix(numeric_data), y = BREAST_DF_surv_clean$COD, type = "pearson")
# Print correlation coefficients for numerical columns
#kable(print(correlation_with_COD_numeric$r))
library(kableExtra)
# Print correlation coefficients for numerical columns
correlation_table <- correlation_with_COD_numeric$r
rownames(correlation_table) <- colnames(correlation_table)
# Display as a table
kable(correlation_table, caption = "Correlation Coefficients with COD")
| Months from diagnosis to treatment | Survival months | Total number of in situ/malignant tumors for patient | Total number of benign/borderline tumors for patient | y | |
|---|---|---|---|---|---|
| Months from diagnosis to treatment | 1.0000000 | -0.0139649 | 0.0186951 | 0.0005761 | -0.0037166 |
| Survival months | -0.0139649 | 1.0000000 | -0.0347760 | 0.0051819 | -0.5516706 |
| Total number of in situ/malignant tumors for patient | 0.0186951 | -0.0347760 | 1.0000000 | 0.0181349 | 0.1470846 |
| Total number of benign/borderline tumors for patient | 0.0005761 | 0.0051819 | 0.0181349 | 1.0000000 | 0.0096745 |
| y | -0.0037166 | -0.5516706 | 0.1470846 | 0.0096745 | 1.0000000 |
library(reshape2) # For melt function
# Melt correlation matrix
correlation_melted <- melt(correlation_table)
# Plot heatmap
ggplot(correlation_melted, aes(Var1, Var2, fill = value)) +
geom_tile() +
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1, 1), space = "Lab",
name = "Correlation") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, vjust = 1,
size = 8, hjust = 1)) +
coord_fixed()
# Calculate Cramér's V for association between "COD" and categorical columns
cramer_v <- apply(categorical_data, 2, function(x) {
table_data <- table(x, BREAST_DF_surv_clean$COD)
assoc(table_data, method = "cramers")
})
# Print Cramér's V for association with categorical columns
#print(cramer_v)
# Insert a line break or comment to separate the code blocks
cat("\n")
# Initialize an empty data frame
cramer_v_df <- data.frame(Variable = character(), Value = numeric(), row.names = NULL)
# Iterate over each variable and its associated Cramér's V value
for (var_name in names(cramer_v)) {
# Extract Cramér's V value for the current variable
cramer_v_value <- cramer_v[[var_name]]
# Append a row to the data frame with the variable name and its Cramér's V value
cramer_v_df <- rbind(cramer_v_df, data.frame(Variable = var_name, Value = cramer_v_value))
}
# Print as a table
kable(cramer_v_df, caption = "Cramer's V for Association with COD")
| Variable | Value.x | Value.A | Value.Freq |
|---|---|---|---|
| Race recode (W, B, AI, API) | American Indian/Alaska Native | Alive | 1416 |
| Race recode (W, B, AI, API) | Asian or Pacific Islander | Alive | 22312 |
| Race recode (W, B, AI, API) | Black | Alive | 21523 |
| Race recode (W, B, AI, API) | Unknown | Alive | 1681 |
| Race recode (W, B, AI, API) | White | Alive | 181289 |
| Race recode (W, B, AI, API) | American Indian/Alaska Native | Breast | 252 |
| Race recode (W, B, AI, API) | Asian or Pacific Islander | Breast | 2749 |
| Race recode (W, B, AI, API) | Black | Breast | 6369 |
| Race recode (W, B, AI, API) | Unknown | Breast | 70 |
| Race recode (W, B, AI, API) | White | Breast | 29032 |
| Race recode (W, B, AI, API) | American Indian/Alaska Native | Other | 265 |
| Race recode (W, B, AI, API) | Asian or Pacific Islander | Other | 2000 |
| Race recode (W, B, AI, API) | Black | Other | 4273 |
| Race recode (W, B, AI, API) | Unknown | Other | 63 |
| Race recode (W, B, AI, API) | White | Other | 30263 |
| Primary Site - labeled | C50.0-Nipple | Alive | 1033 |
| Primary Site - labeled | C50.1-Central portion of breast | Alive | 9851 |
| Primary Site - labeled | C50.2-Upper-inner quadrant of breast | Alive | 28945 |
| Primary Site - labeled | C50.3-Lower-inner quadrant of breast | Alive | 12769 |
| Primary Site - labeled | C50.4-Upper-outer quadrant of breast | Alive | 77369 |
| Primary Site - labeled | C50.5-Lower-outer quadrant of breast | Alive | 17218 |
| Primary Site - labeled | C50.6-Axillary tail of breast | Alive | 1205 |
| Primary Site - labeled | C50.8-Overlapping lesion of breast | Alive | 52392 |
| Primary Site - labeled | C50.9-Breast, NOS | Alive | 27439 |
| Primary Site - labeled | C50.0-Nipple | Breast | 173 |
| Primary Site - labeled | C50.1-Central portion of breast | Breast | 2043 |
| Primary Site - labeled | C50.2-Upper-inner quadrant of breast | Breast | 3058 |
| Primary Site - labeled | C50.3-Lower-inner quadrant of breast | Breast | 1572 |
| Primary Site - labeled | C50.4-Upper-outer quadrant of breast | Breast | 9710 |
| Primary Site - labeled | C50.5-Lower-outer quadrant of breast | Breast | 2287 |
| Primary Site - labeled | C50.6-Axillary tail of breast | Breast | 270 |
| Primary Site - labeled | C50.8-Overlapping lesion of breast | Breast | 7514 |
| Primary Site - labeled | C50.9-Breast, NOS | Breast | 11845 |
| Primary Site - labeled | C50.0-Nipple | Other | 271 |
| Primary Site - labeled | C50.1-Central portion of breast | Other | 2118 |
| Primary Site - labeled | C50.2-Upper-inner quadrant of breast | Other | 4003 |
| Primary Site - labeled | C50.3-Lower-inner quadrant of breast | Other | 2024 |
| Primary Site - labeled | C50.4-Upper-outer quadrant of breast | Other | 11120 |
| Primary Site - labeled | C50.5-Lower-outer quadrant of breast | Other | 2434 |
| Primary Site - labeled | C50.6-Axillary tail of breast | Other | 210 |
| Primary Site - labeled | C50.8-Overlapping lesion of breast | Other | 8379 |
| Primary Site - labeled | C50.9-Breast, NOS | Other | 6305 |
| Grade Recode (thru 2017) | Moderately differentiated; Grade II | Alive | 93775 |
| Grade Recode (thru 2017) | Poorly differentiated; Grade III | Alive | 59437 |
| Grade Recode (thru 2017) | Undifferentiated; anaplastic; Grade IV | Alive | 202 |
| Grade Recode (thru 2017) | Unknown | Alive | 20725 |
| Grade Recode (thru 2017) | Well differentiated; Grade I | Alive | 54082 |
| Grade Recode (thru 2017) | Moderately differentiated; Grade II | Breast | 11130 |
| Grade Recode (thru 2017) | Poorly differentiated; Grade III | Breast | 15938 |
| Grade Recode (thru 2017) | Undifferentiated; anaplastic; Grade IV | Breast | 98 |
| Grade Recode (thru 2017) | Unknown | Breast | 8913 |
| Grade Recode (thru 2017) | Well differentiated; Grade I | Breast | 2393 |
| Grade Recode (thru 2017) | Moderately differentiated; Grade II | Other | 14661 |
| Grade Recode (thru 2017) | Poorly differentiated; Grade III | Other | 8876 |
| Grade Recode (thru 2017) | Undifferentiated; anaplastic; Grade IV | Other | 49 |
| Grade Recode (thru 2017) | Unknown | Other | 5217 |
| Grade Recode (thru 2017) | Well differentiated; Grade I | Other | 8061 |
| Laterality | Bilateral, single primary | Alive | 21 |
| Laterality | Left - origin of primary | Alive | 115104 |
| Laterality | Only one side - side unspecified | Alive | 59 |
| Laterality | Paired site, but no information concerning laterality | Alive | 438 |
| Laterality | Right - origin of primary | Alive | 112599 |
| Laterality | Bilateral, single primary | Breast | 89 |
| Laterality | Left - origin of primary | Breast | 18661 |
| Laterality | Only one side - side unspecified | Breast | 87 |
| Laterality | Paired site, but no information concerning laterality | Breast | 2080 |
| Laterality | Right - origin of primary | Breast | 17555 |
| Laterality | Bilateral, single primary | Other | 25 |
| Laterality | Left - origin of primary | Other | 18585 |
| Laterality | Only one side - side unspecified | Other | 44 |
| Laterality | Paired site, but no information concerning laterality | Other | 634 |
| Laterality | Right - origin of primary | Other | 17576 |
| Chemotherapy recode (yes, no/unk) | No/Unknown | Alive | 137991 |
| Chemotherapy recode (yes, no/unk) | Yes | Alive | 90230 |
| Chemotherapy recode (yes, no/unk) | No/Unknown | Breast | 19415 |
| Chemotherapy recode (yes, no/unk) | Yes | Breast | 19057 |
| Chemotherapy recode (yes, no/unk) | No/Unknown | Other | 29532 |
| Chemotherapy recode (yes, no/unk) | Yes | Other | 7332 |
| Reason no cancer-directed surgery | Not performed, patient died prior to recommended surgery | Alive | 0 |
| Reason no cancer-directed surgery | Not recommended | Alive | 6917 |
| Reason no cancer-directed surgery | Not recommended, contraindicated due to other cond; autopsy only (1973-2002) | Alive | 118 |
| Reason no cancer-directed surgery | Recommended but not performed, patient refused | Alive | 686 |
| Reason no cancer-directed surgery | Recommended but not performed, unknown reason | Alive | 729 |
| Reason no cancer-directed surgery | Recommended, unknown if performed | Alive | 1741 |
| Reason no cancer-directed surgery | Surgery performed | Alive | 217725 |
| Reason no cancer-directed surgery | Unknown; death certificate; or autopsy only (2003+) | Alive | 305 |
| Reason no cancer-directed surgery | Not performed, patient died prior to recommended surgery | Breast | 139 |
| Reason no cancer-directed surgery | Not recommended | Breast | 11636 |
| Reason no cancer-directed surgery | Not recommended, contraindicated due to other cond; autopsy only (1973-2002) | Breast | 593 |
| Reason no cancer-directed surgery | Recommended but not performed, patient refused | Breast | 1171 |
| Reason no cancer-directed surgery | Recommended but not performed, unknown reason | Breast | 545 |
| Reason no cancer-directed surgery | Recommended, unknown if performed | Breast | 613 |
| Reason no cancer-directed surgery | Surgery performed | Breast | 22376 |
| Reason no cancer-directed surgery | Unknown; death certificate; or autopsy only (2003+) | Breast | 1399 |
| Reason no cancer-directed surgery | Not performed, patient died prior to recommended surgery | Other | 139 |
| Reason no cancer-directed surgery | Not recommended | Other | 4646 |
| Reason no cancer-directed surgery | Not recommended, contraindicated due to other cond; autopsy only (1973-2002) | Other | 645 |
| Reason no cancer-directed surgery | Recommended but not performed, patient refused | Other | 751 |
| Reason no cancer-directed surgery | Recommended but not performed, unknown reason | Other | 330 |
| Reason no cancer-directed surgery | Recommended, unknown if performed | Other | 295 |
| Reason no cancer-directed surgery | Surgery performed | Other | 29629 |
| Reason no cancer-directed surgery | Unknown; death certificate; or autopsy only (2003+) | Other | 429 |
| Survival months flag | Complete dates are available and there are 0 days of survival | Alive | 248 |
| Survival months flag | Complete dates are available and there are more than 0 days of survival | Alive | 223378 |
| Survival months flag | Incomplete dates are available and there cannot be zero days of follow-up | Alive | 4551 |
| Survival months flag | Incomplete dates are available and there could be zero days of follow-up | Alive | 44 |
| Survival months flag | Not calculated because a Death Certificate Only or Autopsy Only case | Alive | 0 |
| Survival months flag | Complete dates are available and there are 0 days of survival | Breast | 83 |
| Survival months flag | Complete dates are available and there are more than 0 days of survival | Breast | 36117 |
| Survival months flag | Incomplete dates are available and there cannot be zero days of follow-up | Breast | 1183 |
| Survival months flag | Incomplete dates are available and there could be zero days of follow-up | Breast | 59 |
| Survival months flag | Not calculated because a Death Certificate Only or Autopsy Only case | Breast | 1030 |
| Survival months flag | Complete dates are available and there are 0 days of survival | Other | 45 |
| Survival months flag | Complete dates are available and there are more than 0 days of survival | Other | 35641 |
| Survival months flag | Incomplete dates are available and there cannot be zero days of follow-up | Other | 886 |
| Survival months flag | Incomplete dates are available and there could be zero days of follow-up | Other | 32 |
| Survival months flag | Not calculated because a Death Certificate Only or Autopsy Only case | Other | 260 |
| First malignant primary indicator | No | Alive | 32987 |
| First malignant primary indicator | Yes | Alive | 195234 |
| First malignant primary indicator | No | Breast | 7480 |
| First malignant primary indicator | Yes | Breast | 30992 |
| First malignant primary indicator | No | Other | 10407 |
| First malignant primary indicator | Yes | Other | 26457 |
| Marital status at diagnosis | Divorced | Alive | 23903 |
| Marital status at diagnosis | Married (including common law) | Alive | 132121 |
| Marital status at diagnosis | Separated | Alive | 2401 |
| Marital status at diagnosis | Single (never married) | Alive | 32829 |
| Marital status at diagnosis | Unknown | Alive | 12919 |
| Marital status at diagnosis | Unmarried or Domestic Partner | Alive | 844 |
| Marital status at diagnosis | Widowed | Alive | 23204 |
| Marital status at diagnosis | Divorced | Breast | 4399 |
| Marital status at diagnosis | Married (including common law) | Breast | 15694 |
| Marital status at diagnosis | Separated | Breast | 544 |
| Marital status at diagnosis | Single (never married) | Breast | 7161 |
| Marital status at diagnosis | Unknown | Breast | 2774 |
| Marital status at diagnosis | Unmarried or Domestic Partner | Breast | 110 |
| Marital status at diagnosis | Widowed | Breast | 7790 |
| Marital status at diagnosis | Divorced | Other | 3912 |
| Marital status at diagnosis | Married (including common law) | Other | 12736 |
| Marital status at diagnosis | Separated | Other | 280 |
| Marital status at diagnosis | Single (never married) | Other | 4688 |
| Marital status at diagnosis | Unknown | Other | 2788 |
| Marital status at diagnosis | Unmarried or Domestic Partner | Other | 60 |
| Marital status at diagnosis | Widowed | Other | 12400 |
| Median household income inflation adj to 2021 | $35,000 - $39,999 | Alive | 4108 |
| Median household income inflation adj to 2021 | $40,000 - $44,999 | Alive | 6976 |
| Median household income inflation adj to 2021 | $45,000 - $49,999 | Alive | 10351 |
| Median household income inflation adj to 2021 | $50,000 - $54,999 | Alive | 11978 |
| Median household income inflation adj to 2021 | $55,000 - $59,999 | Alive | 18238 |
| Median household income inflation adj to 2021 | $60,000 - $64,999 | Alive | 32172 |
| Median household income inflation adj to 2021 | $65,000 - $69,999 | Alive | 34163 |
| Median household income inflation adj to 2021 | $70,000 - $74,999 | Alive | 23995 |
| Median household income inflation adj to 2021 | $75,000+ | Alive | 84391 |
| Median household income inflation adj to 2021 | < $35,000 | Alive | 1799 |
| Median household income inflation adj to 2021 | Unknown/missing/no match/Not 1990-2021 | Alive | 50 |
| Median household income inflation adj to 2021 | $35,000 - $39,999 | Breast | 1000 |
| Median household income inflation adj to 2021 | $40,000 - $44,999 | Breast | 1630 |
| Median household income inflation adj to 2021 | $45,000 - $49,999 | Breast | 2289 |
| Median household income inflation adj to 2021 | $50,000 - $54,999 | Breast | 2310 |
| Median household income inflation adj to 2021 | $55,000 - $59,999 | Breast | 3371 |
| Median household income inflation adj to 2021 | $60,000 - $64,999 | Breast | 6010 |
| Median household income inflation adj to 2021 | $65,000 - $69,999 | Breast | 5848 |
| Median household income inflation adj to 2021 | $70,000 - $74,999 | Breast | 3927 |
| Median household income inflation adj to 2021 | $75,000+ | Breast | 11608 |
| Median household income inflation adj to 2021 | < $35,000 | Breast | 469 |
| Median household income inflation adj to 2021 | Unknown/missing/no match/Not 1990-2021 | Breast | 10 |
| Median household income inflation adj to 2021 | $35,000 - $39,999 | Other | 969 |
| Median household income inflation adj to 2021 | $40,000 - $44,999 | Other | 1619 |
| Median household income inflation adj to 2021 | $45,000 - $49,999 | Other | 2277 |
| Median household income inflation adj to 2021 | $50,000 - $54,999 | Other | 2506 |
| Median household income inflation adj to 2021 | $55,000 - $59,999 | Other | 3251 |
| Median household income inflation adj to 2021 | $60,000 - $64,999 | Other | 5355 |
| Median household income inflation adj to 2021 | $65,000 - $69,999 | Other | 4967 |
| Median household income inflation adj to 2021 | $70,000 - $74,999 | Other | 4008 |
| Median household income inflation adj to 2021 | $75,000+ | Other | 11460 |
| Median household income inflation adj to 2021 | < $35,000 | Other | 448 |
| Median household income inflation adj to 2021 | Unknown/missing/no match/Not 1990-2021 | Other | 4 |
| Rural-Urban Continuum Code | Counties in metropolitan areas ge 1 million pop | Alive | 141535 |
| Rural-Urban Continuum Code | Counties in metropolitan areas of 250,000 to 1 million pop | Alive | 48846 |
| Rural-Urban Continuum Code | Counties in metropolitan areas of lt 250 thousand pop | Alive | 15452 |
| Rural-Urban Continuum Code | Nonmetropolitan counties adjacent to a metropolitan area | Alive | 12781 |
| Rural-Urban Continuum Code | Nonmetropolitan counties not adjacent to a metropolitan area | Alive | 9289 |
| Rural-Urban Continuum Code | Unknown/missing/no match (Alaska or Hawaii - Entire State) | Alive | 268 |
| Rural-Urban Continuum Code | Unknown/missing/no match/Not 1990-2021 | Alive | 50 |
| Rural-Urban Continuum Code | Counties in metropolitan areas ge 1 million pop | Breast | 23147 |
| Rural-Urban Continuum Code | Counties in metropolitan areas of 250,000 to 1 million pop | Breast | 7884 |
| Rural-Urban Continuum Code | Counties in metropolitan areas of lt 250 thousand pop | Breast | 2843 |
| Rural-Urban Continuum Code | Nonmetropolitan counties adjacent to a metropolitan area | Breast | 2578 |
| Rural-Urban Continuum Code | Nonmetropolitan counties not adjacent to a metropolitan area | Breast | 1970 |
| Rural-Urban Continuum Code | Unknown/missing/no match (Alaska or Hawaii - Entire State) | Breast | 40 |
| Rural-Urban Continuum Code | Unknown/missing/no match/Not 1990-2021 | Breast | 10 |
| Rural-Urban Continuum Code | Counties in metropolitan areas ge 1 million pop | Other | 20692 |
| Rural-Urban Continuum Code | Counties in metropolitan areas of 250,000 to 1 million pop | Other | 8311 |
| Rural-Urban Continuum Code | Counties in metropolitan areas of lt 250 thousand pop | Other | 2944 |
| Rural-Urban Continuum Code | Nonmetropolitan counties adjacent to a metropolitan area | Other | 2766 |
| Rural-Urban Continuum Code | Nonmetropolitan counties not adjacent to a metropolitan area | Other | 2090 |
| Rural-Urban Continuum Code | Unknown/missing/no match (Alaska or Hawaii - Entire State) | Other | 57 |
| Rural-Urban Continuum Code | Unknown/missing/no match/Not 1990-2021 | Other | 4 |
| Age recode (<60,60-69,70+) | 01-04 years | Alive | 1 |
| Age recode (<60,60-69,70+) | 05-09 years | Alive | 2 |
| Age recode (<60,60-69,70+) | 10-14 years | Alive | 2 |
| Age recode (<60,60-69,70+) | 15-19 years | Alive | 14 |
| Age recode (<60,60-69,70+) | 20-24 years | Alive | 178 |
| Age recode (<60,60-69,70+) | 25-29 years | Alive | 1097 |
| Age recode (<60,60-69,70+) | 30-34 years | Alive | 3307 |
| Age recode (<60,60-69,70+) | 35-39 years | Alive | 7040 |
| Age recode (<60,60-69,70+) | 40-44 years | Alive | 15293 |
| Age recode (<60,60-69,70+) | 45-49 years | Alive | 24158 |
| Age recode (<60,60-69,70+) | 50-54 years | Alive | 29263 |
| Age recode (<60,60-69,70+) | 55-59 years | Alive | 30741 |
| Age recode (<60,60-69,70+) | 60-64 years | Alive | 33793 |
| Age recode (<60,60-69,70+) | 65-69 years | Alive | 32764 |
| Age recode (<60,60-69,70+) | 70-74 years | Alive | 23598 |
| Age recode (<60,60-69,70+) | 75-79 years | Alive | 15007 |
| Age recode (<60,60-69,70+) | 80-84 years | Alive | 8013 |
| Age recode (<60,60-69,70+) | 85+ years | Alive | 3950 |
| Age recode (<60,60-69,70+) | 01-04 years | Breast | 0 |
| Age recode (<60,60-69,70+) | 05-09 years | Breast | 0 |
| Age recode (<60,60-69,70+) | 10-14 years | Breast | 0 |
| Age recode (<60,60-69,70+) | 15-19 years | Breast | 1 |
| Age recode (<60,60-69,70+) | 20-24 years | Breast | 59 |
| Age recode (<60,60-69,70+) | 25-29 years | Breast | 265 |
| Age recode (<60,60-69,70+) | 30-34 years | Breast | 686 |
| Age recode (<60,60-69,70+) | 35-39 years | Breast | 1327 |
| Age recode (<60,60-69,70+) | 40-44 years | Breast | 2019 |
| Age recode (<60,60-69,70+) | 45-49 years | Breast | 2765 |
| Age recode (<60,60-69,70+) | 50-54 years | Breast | 3909 |
| Age recode (<60,60-69,70+) | 55-59 years | Breast | 4360 |
| Age recode (<60,60-69,70+) | 60-64 years | Breast | 4531 |
| Age recode (<60,60-69,70+) | 65-69 years | Breast | 4136 |
| Age recode (<60,60-69,70+) | 70-74 years | Breast | 3663 |
| Age recode (<60,60-69,70+) | 75-79 years | Breast | 3196 |
| Age recode (<60,60-69,70+) | 80-84 years | Breast | 3003 |
| Age recode (<60,60-69,70+) | 85+ years | Breast | 4552 |
| Age recode (<60,60-69,70+) | 01-04 years | Other | 0 |
| Age recode (<60,60-69,70+) | 05-09 years | Other | 0 |
| Age recode (<60,60-69,70+) | 10-14 years | Other | 0 |
| Age recode (<60,60-69,70+) | 15-19 years | Other | 2 |
| Age recode (<60,60-69,70+) | 20-24 years | Other | 11 |
| Age recode (<60,60-69,70+) | 25-29 years | Other | 43 |
| Age recode (<60,60-69,70+) | 30-34 years | Other | 100 |
| Age recode (<60,60-69,70+) | 35-39 years | Other | 182 |
| Age recode (<60,60-69,70+) | 40-44 years | Other | 423 |
| Age recode (<60,60-69,70+) | 45-49 years | Other | 713 |
| Age recode (<60,60-69,70+) | 50-54 years | Other | 1252 |
| Age recode (<60,60-69,70+) | 55-59 years | Other | 1967 |
| Age recode (<60,60-69,70+) | 60-64 years | Other | 2994 |
| Age recode (<60,60-69,70+) | 65-69 years | Other | 4160 |
| Age recode (<60,60-69,70+) | 70-74 years | Other | 4927 |
| Age recode (<60,60-69,70+) | 75-79 years | Other | 5630 |
| Age recode (<60,60-69,70+) | 80-84 years | Other | 6182 |
| Age recode (<60,60-69,70+) | 85+ years | Other | 8278 |
| COD | Alive | Alive | 228221 |
| COD | Breast | Alive | 0 |
| COD | Other | Alive | 0 |
| COD | Alive | Breast | 0 |
| COD | Breast | Breast | 38472 |
| COD | Other | Breast | 0 |
| COD | Alive | Other | 0 |
| COD | Breast | Other | 0 |
| COD | Other | Other | 36864 |
| Radiation | No/Unknown | Alive | 111019 |
| Radiation | Yes | Alive | 117202 |
| Radiation | No/Unknown | Breast | 25613 |
| Radiation | Yes | Breast | 12859 |
| Radiation | No/Unknown | Other | 25346 |
| Radiation | Yes | Other | 11518 |
# Melt Cramér's V results
cramer_v_melted <- melt(cramer_v_df, id.vars = "Variable", variable.name = "Var1", value.name = "value")
## Warning: attributes are not identical across measure variables; they will be
## dropped
# Plot as a bar graph
ggplot(cramer_v_melted, aes(x = Variable, y = value, fill = Var1)) +
geom_bar(stat = "identity") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),
axis.text.y = element_text(angle = 45, hjust = 1, vjust = 1)) + # Rotate y-axis labels by 45 degrees
scale_y_discrete(labels = function(x) str_wrap(x, width = 10)) + # Wrap labels with a width of 10 characters
labs(x = "Variable", y = "Cramer's V", fill = "Variable") +
ggtitle("Cramer's V for Association with COD")
#Since there are many factors and categorical variables I need to encode them.
#the followign code can deal with encoding
#Find the index of the column named "COD"
# Step 1: Find the index of the column named "COD"
cod_column_index <- which(names(BREAST_DF_surv_clean) == "COD")
# Step 2: Exclude "COD" column from model matrix
encoded_data <- model.matrix(~ . - 1, data = BREAST_DF_surv_clean[, -cod_column_index])
# Step 3: Select encoded variables and target variable
encoded_data <- cbind(encoded_data, COD = BREAST_DF_surv_clean$COD)
## Warning in base::cbind(...): number of rows of result is not a multiple of
## vector length (arg 2)
# Step 4: Calculate correlation matrix
correlation_matrix <- cor(encoded_data)
## Warning in cor(encoded_data): the standard deviation is zero
# Step 5: Display summary statistics of the correlation matrix
summary_table <- summary(correlation_matrix)
summary_table_kable <- kable(summary_table)
# Step 6: Plot correlation matrix as a heatmap
library(corrplot)
corrplot(correlation_matrix, method = "color", tl.cex = 0.15, title = "Correlation Matrix")
# Display the summary table
summary_table_kable
Race recode (W, B, AI, API)American
Indian/Alaska Native |
Race recode (W, B, AI, API)Asian or
Pacific Islander |
Race recode (W, B, AI, API)Black |
Race recode (W, B, AI, API)Unknown |
Race recode (W, B, AI, API)White |
Primary Site - labeledC50.1-Central
portion of breast |
Primary Site - labeledC50.2-Upper-inner
quadrant of breast |
Primary Site - labeledC50.3-Lower-inner
quadrant of breast |
Primary Site - labeledC50.4-Upper-outer
quadrant of breast |
Primary Site - labeledC50.5-Lower-outer
quadrant of breast |
Primary Site - labeledC50.6-Axillary tail
of breast |
Primary Site - labeledC50.8-Overlapping
lesion of breast |
Primary Site - labeledC50.9-Breast,
NOS |
Grade Recode (thru 2017)Poorly
differentiated; Grade III |
Grade Recode (thru 2017)Undifferentiated;
anaplastic; Grade IV |
Grade Recode (thru 2017)Unknown |
Grade Recode (thru 2017)Well
differentiated; Grade I |
LateralityLeft - origin of primary | LateralityOnly one side - side unspecified | LateralityPaired site, but no information concerning laterality | LateralityRight - origin of primary | Chemotherapy recode (yes, no/unk)Yes |
Months from diagnosis to treatment |
Reason no cancer-directed surgeryNot
recommended |
Reason no cancer-directed surgeryNot
recommended, contraindicated due to other cond; autopsy only
(1973-2002) |
Reason no cancer-directed surgeryRecommended
but not performed, patient refused |
Reason no cancer-directed surgeryRecommended
but not performed, unknown reason |
Reason no cancer-directed surgeryRecommended,
unknown if performed |
Reason no cancer-directed surgerySurgery
performed |
Reason no cancer-directed surgeryUnknown;
death certificate; or autopsy only (2003+) |
Survival months flagComplete dates are
available and there are more than 0 days of survival |
Survival months flagIncomplete dates are
available and there cannot be zero days of follow-up |
Survival months flagIncomplete dates are
available and there could be zero days of follow-up |
Survival months flagNot calculated because
a Death Certificate Only or Autopsy Only case |
Survival months |
First malignant primary indicatorYes |
Total number of in situ/malignant tumors for patient |
Total number of benign/borderline tumors for patient |
Marital status at diagnosisMarried
(including common law) |
Marital status at diagnosisSeparated |
Marital status at diagnosisSingle (never
married) |
Marital status at diagnosisUnknown |
Marital status at diagnosisUnmarried or
Domestic Partner |
Marital status at diagnosisWidowed |
Median household income inflation adj to 2021$40,000
- $44,999 |
Median household income inflation adj to 2021$45,000
- $49,999 |
Median household income inflation adj to 2021$50,000
- $54,999 |
Median household income inflation adj to 2021$55,000
- $59,999 |
Median household income inflation adj to 2021$60,000
- $64,999 |
Median household income inflation adj to 2021$65,000
- $69,999 |
Median household income inflation adj to 2021$70,000
- $74,999 |
Median household income inflation adj to 2021$75,000+ |
Median household income inflation adj to 2021<
$35,000 |
Median household income inflation adj to 2021Unknown/missing/no
match/Not 1990-2021 |
Rural-Urban Continuum CodeCounties in
metropolitan areas of 250,000 to 1 million pop |
Rural-Urban Continuum CodeCounties in
metropolitan areas of lt 250 thousand pop |
Rural-Urban Continuum CodeNonmetropolitan
counties adjacent to a metropolitan area |
Rural-Urban Continuum CodeNonmetropolitan
counties not adjacent to a metropolitan area |
Rural-Urban Continuum CodeUnknown/missing/no
match (Alaska or Hawaii - Entire State) |
Rural-Urban Continuum CodeUnknown/missing/no
match/Not 1990-2021 |
Age recode (<60,60-69,70+)05-09
years |
Age recode (<60,60-69,70+)10-14
years |
Age recode (<60,60-69,70+)15-19
years |
Age recode (<60,60-69,70+)20-24
years |
Age recode (<60,60-69,70+)25-29
years |
Age recode (<60,60-69,70+)30-34
years |
Age recode (<60,60-69,70+)35-39
years |
Age recode (<60,60-69,70+)40-44
years |
Age recode (<60,60-69,70+)45-49
years |
Age recode (<60,60-69,70+)50-54
years |
Age recode (<60,60-69,70+)55-59
years |
Age recode (<60,60-69,70+)60-64
years |
Age recode (<60,60-69,70+)65-69
years |
Age recode (<60,60-69,70+)70-74
years |
Age recode (<60,60-69,70+)75-79
years |
Age recode (<60,60-69,70+)80-84
years |
Age recode (<60,60-69,70+)85+
years |
RadiationYes | COD | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. :-0.1576838 | Min. :-0.6165221 | Min. :-0.673943 | Min. :-0.1334814 | Min. :-0.6739426 | Min. :-0.155876 | Min. :-0.2615058 | Min. :-0.169973 | Min. :-0.382989 | Min. :-0.199024 | Min. :-0.0520480 | Min. :-0.382989 | Min. :-0.275063 | Min. :-0.3318287 | Min. :-0.0208938 | Min. :-0.204073 | Min. :-0.331829 | Min. :-0.9935062 | Min. :-0.0366469 | Min. :-0.1575071 | Min. :-0.9935062 | Min. :-0.284581 | Min. :-0.071647 | Min. :-0.8574575 | Min. :-0.2202216 | Min. :-0.248924 | Min. :-0.1439998 | Min. :-0.306675 | Min. :-0.857457 | Min. :-0.0598241 | Min. :-0.9942023 | Min. :-0.994202 | Min. :-0.0557427 | Min. :1 | Min. :-0.277625 | Min. :-0.691429 | Min. :-0.6914293 | Min. :-0.0159064 | Min. :-0.449883 | Min. :-0.1121800 | Min. :-0.4498831 | Min. :-0.260209 | Min. :-0.0640663 | Min. :-0.433220 | Min. :-0.1387420 | Min. :-0.1694547 | Min. :-0.1808955 | Min. :-0.2237456 | Min. :-0.302021 | Min. :-0.307945 | Min. :-0.2570005 | Min. :-0.307945 | Min. :-0.070190 | Min. :-0.0098790 | Min. :-0.144158 | Min. :-0.1728759 | Min. :-0.154232 | Min. :-0.154618 | Min. :-0.0693888 | Min. :-0.0098790 | Min. :-0.0028688 | Min. :-0.0051977 | Min. :-0.0071789 | Min. :-0.0193592 | Min. :-0.027421 | Min. :-0.047111 | Min. :-0.068738 | Min. :-0.100748 | Min. :-0.128046 | Min. :-0.1445947 | Min. :-0.1504952 | Min. :-0.1593788 | Min. :-0.159379 | Min. :-0.138078 | Min. :-0.133971 | Min. :-0.150245 | Min. :-0.1873481 | Min. :-0.132353 | Min. :-0.0121551 | |
| 1st Qu.:-0.0020006 | 1st Qu.:-0.0202167 | 1st Qu.:-0.012099 | 1st Qu.:-0.0058338 | 1st Qu.:-0.0189862 | 1st Qu.:-0.010531 | 1st Qu.:-0.0055591 | 1st Qu.:-0.005036 | 1st Qu.:-0.007592 | 1st Qu.:-0.004752 | 1st Qu.:-0.0033956 | 1st Qu.:-0.005180 | 1st Qu.:-0.012712 | 1st Qu.:-0.0102566 | 1st Qu.:-0.0020484 | 1st Qu.:-0.005608 | 1st Qu.:-0.011881 | 1st Qu.:-0.0025589 | 1st Qu.:-0.0026654 | 1st Qu.:-0.0054651 | 1st Qu.:-0.0025611 | 1st Qu.:-0.008157 | 1st Qu.:-0.009361 | 1st Qu.:-0.0113926 | 1st Qu.:-0.0066137 | 1st Qu.:-0.005443 | 1st Qu.:-0.0023943 | 1st Qu.:-0.008606 | 1st Qu.:-0.012623 | 1st Qu.:-0.0016251 | 1st Qu.:-0.0036413 | 1st Qu.:-0.008391 | 1st Qu.:-0.0015205 | 1st Qu.:1 | 1st Qu.:-0.017262 | 1st Qu.:-0.008279 | 1st Qu.:-0.0105455 | 1st Qu.:-0.0024496 | 1st Qu.:-0.018276 | 1st Qu.:-0.0034648 | 1st Qu.:-0.0153153 | 1st Qu.:-0.005487 | 1st Qu.:-0.0034380 | 1st Qu.:-0.025953 | 1st Qu.:-0.0065187 | 1st Qu.:-0.0075767 | 1st Qu.:-0.0064648 | 1st Qu.:-0.0062737 | 1st Qu.:-0.007336 | 1st Qu.:-0.012559 | 1st Qu.:-0.0049280 | 1st Qu.:-0.014613 | 1st Qu.:-0.004756 | 1st Qu.:-0.0027459 | 1st Qu.:-0.006427 | 1st Qu.:-0.0063794 | 1st Qu.:-0.008253 | 1st Qu.:-0.008279 | 1st Qu.:-0.0042149 | 1st Qu.:-0.0027459 | 1st Qu.:-0.0008929 | 1st Qu.:-0.0008929 | 1st Qu.:-0.0018815 | 1st Qu.:-0.0038172 | 1st Qu.:-0.005673 | 1st Qu.:-0.008648 | 1st Qu.:-0.008935 | 1st Qu.:-0.012379 | 1st Qu.:-0.014009 | 1st Qu.:-0.0100840 | 1st Qu.:-0.0062954 | 1st Qu.:-0.0035576 | 1st Qu.:-0.013136 | 1st Qu.:-0.014378 | 1st Qu.:-0.019360 | 1st Qu.:-0.023319 | 1st Qu.:-0.0243034 | 1st Qu.:-0.019630 | 1st Qu.:-0.0011461 | |
| Median : 0.0005336 | Median :-0.0033835 | Median : 0.001860 | Median :-0.0006944 | Median :-0.0007078 | Median :-0.001262 | Median :-0.0028889 | Median :-0.001448 | Median :-0.001242 | Median :-0.002111 | Median :-0.0003168 | Median :-0.001220 | Median : 0.000506 | Median :-0.0009646 | Median :-0.0002264 | Median : 0.002904 | Median :-0.001414 | Median :-0.0004302 | Median :-0.0004554 | Median :-0.0002511 | Median : 0.0002321 | Median : 0.003513 | Median :-0.002447 | Median :-0.0019345 | Median :-0.0007267 | Median :-0.001201 | Median :-0.0000559 | Median :-0.001012 | Median : 0.000035 | Median :-0.0002439 | Median : 0.0016351 | Median :-0.002005 | Median :-0.0003634 | Median :1 | Median :-0.002944 | Median : 0.001289 | Median :-0.0017191 | Median :-0.0003627 | Median :-0.005281 | Median :-0.0003402 | Median :-0.0007771 | Median :-0.001318 | Median :-0.0004321 | Median :-0.001559 | Median :-0.0005868 | Median :-0.0009248 | Median :-0.0007155 | Median :-0.0002672 | Median :-0.001239 | Median :-0.001800 | Median :-0.0016338 | Median :-0.003181 | Median :-0.001151 | Median :-0.0008847 | Median :-0.000154 | Median :-0.0008082 | Median :-0.001274 | Median :-0.001003 | Median :-0.0004993 | Median :-0.0008847 | Median :-0.0002912 | Median :-0.0003251 | Median :-0.0005455 | Median :-0.0008537 | Median :-0.001546 | Median :-0.000829 | Median :-0.001997 | Median :-0.003132 | Median :-0.002266 | Median :-0.0018848 | Median :-0.0003493 | Median :-0.0004631 | Median :-0.002854 | Median :-0.002379 | Median :-0.002270 | Median :-0.002061 | Median :-0.0020631 | Median :-0.002587 | Median : 0.0003216 | |
| Mean : 0.0169704 | Mean :-0.0009204 | Mean : 0.005789 | Mean : 0.0107475 | Mean :-0.0088807 | Mean : 0.005178 | Mean : 0.0001063 | Mean : 0.004543 | Mean :-0.006328 | Mean : 0.002687 | Mean : 0.0103478 | Mean :-0.003984 | Mean : 0.003542 | Mean : 0.0100118 | Mean : 0.0125342 | Mean : 0.012084 | Mean : 0.001227 | Mean :-0.0009544 | Mean : 0.0126757 | Mean : 0.0127019 | Mean :-0.0007810 | Mean : 0.016328 | Mean : 0.008602 | Mean : 0.0009881 | Mean : 0.0085826 | Mean : 0.008312 | Mean : 0.0118276 | Mean : 0.008002 | Mean :-0.010200 | Mean : 0.0125725 | Mean : 0.0006063 | Mean :-0.001404 | Mean : 0.0115660 | Mean :1 | Mean : 0.006655 | Mean : 0.007717 | Mean :-0.0004392 | Mean : 0.0127352 | Mean :-0.005335 | Mean : 0.0102429 | Mean : 0.0056400 | Mean : 0.008018 | Mean : 0.0110086 | Mean : 0.001435 | Mean : 0.0110037 | Mean : 0.0105688 | Mean : 0.0091652 | Mean : 0.0060635 | Mean :-0.002248 | Mean :-0.004613 | Mean :-0.0003216 | Mean :-0.014703 | Mean : 0.013485 | Mean : 0.0250915 | Mean : 0.008380 | Mean : 0.0112452 | Mean : 0.011215 | Mean : 0.012771 | Mean : 0.0167936 | Mean : 0.0250915 | Mean : 0.0128817 | Mean : 0.0127117 | Mean : 0.0128602 | Mean : 0.0121784 | Mean : 0.010861 | Mean : 0.009444 | Mean : 0.007175 | Mean : 0.003054 | Mean : 0.000216 | Mean :-0.0007652 | Mean :-0.0011515 | Mean :-0.0022937 | Mean :-0.004167 | Mean :-0.002999 | Mean :-0.001605 | Mean :-0.000504 | Mean :-0.0000585 | Mean : 0.009374 | Mean : 0.0132475 | |
| 3rd Qu.: 0.0040061 | 3rd Qu.: 0.0070447 | 3rd Qu.: 0.016525 | 3rd Qu.: 0.0047715 | 3rd Qu.: 0.0125200 | 3rd Qu.: 0.005010 | 3rd Qu.: 0.0011539 | 3rd Qu.: 0.004086 | 3rd Qu.: 0.004650 | 3rd Qu.: 0.001679 | 3rd Qu.: 0.0022560 | 3rd Qu.: 0.002721 | 3rd Qu.: 0.012246 | 3rd Qu.: 0.0092260 | 3rd Qu.: 0.0026775 | 3rd Qu.: 0.009980 | 3rd Qu.: 0.007513 | 3rd Qu.: 0.0014938 | 3rd Qu.: 0.0015396 | 3rd Qu.: 0.0030307 | 3rd Qu.: 0.0022817 | 3rd Qu.: 0.022748 | 3rd Qu.: 0.005722 | 3rd Qu.: 0.0066273 | 3rd Qu.: 0.0027397 | 3rd Qu.: 0.002368 | 3rd Qu.: 0.0043855 | 3rd Qu.: 0.004974 | 3rd Qu.: 0.009365 | 3rd Qu.: 0.0016627 | 3rd Qu.: 0.0095196 | 3rd Qu.: 0.002287 | 3rd Qu.: 0.0003999 | 3rd Qu.:1 | 3rd Qu.: 0.012122 | 3rd Qu.: 0.010252 | 3rd Qu.: 0.0075641 | 3rd Qu.: 0.0015427 | 3rd Qu.: 0.010096 | 3rd Qu.: 0.0030447 | 3rd Qu.: 0.0166287 | 3rd Qu.: 0.007431 | 3rd Qu.: 0.0018987 | 3rd Qu.: 0.011320 | 3rd Qu.: 0.0031320 | 3rd Qu.: 0.0035984 | 3rd Qu.: 0.0053395 | 3rd Qu.: 0.0042906 | 3rd Qu.: 0.002973 | 3rd Qu.: 0.003492 | 3rd Qu.: 0.0032218 | 3rd Qu.: 0.004352 | 3rd Qu.: 0.004585 | 3rd Qu.:-0.0000462 | 3rd Qu.: 0.003579 | 3rd Qu.: 0.0059771 | 3rd Qu.: 0.005613 | 3rd Qu.: 0.005251 | 3rd Qu.: 0.0020642 | 3rd Qu.:-0.0000462 | 3rd Qu.:-0.0000228 | 3rd Qu.:-0.0000348 | 3rd Qu.: 0.0004444 | 3rd Qu.: 0.0016579 | 3rd Qu.: 0.002306 | 3rd Qu.: 0.003628 | 3rd Qu.: 0.004160 | 3rd Qu.: 0.003860 | 3rd Qu.: 0.003601 | 3rd Qu.: 0.0033621 | 3rd Qu.: 0.0021982 | 3rd Qu.: 0.0038639 | 3rd Qu.: 0.003548 | 3rd Qu.: 0.003082 | 3rd Qu.: 0.002808 | 3rd Qu.: 0.003460 | 3rd Qu.: 0.0041037 | 3rd Qu.: 0.007550 | 3rd Qu.: 0.0019772 | |
| Max. : 1.0000000 | Max. : 1.0000000 | Max. : 1.000000 | Max. : 1.0000000 | Max. : 1.0000000 | Max. : 1.000000 | Max. : 1.0000000 | Max. : 1.000000 | Max. : 1.000000 | Max. : 1.000000 | Max. : 1.0000000 | Max. : 1.000000 | Max. : 1.000000 | Max. : 1.0000000 | Max. : 1.0000000 | Max. : 1.000000 | Max. : 1.000000 | Max. : 1.0000000 | Max. : 1.0000000 | Max. : 1.0000000 | Max. : 1.0000000 | Max. : 1.000000 | Max. : 1.000000 | Max. : 1.0000000 | Max. : 1.0000000 | Max. : 1.000000 | Max. : 1.0000000 | Max. : 1.000000 | Max. : 1.000000 | Max. : 1.0000000 | Max. : 1.0000000 | Max. : 1.000000 | Max. : 1.0000000 | Max. :1 | Max. : 1.000000 | Max. : 1.000000 | Max. : 1.0000000 | Max. : 1.0000000 | Max. : 1.000000 | Max. : 1.0000000 | Max. : 1.0000000 | Max. : 1.000000 | Max. : 1.0000000 | Max. : 1.000000 | Max. : 1.0000000 | Max. : 1.0000000 | Max. : 1.0000000 | Max. : 1.0000000 | Max. : 1.000000 | Max. : 1.000000 | Max. : 1.0000000 | Max. : 1.000000 | Max. : 1.000000 | Max. : 1.0000000 | Max. : 1.000000 | Max. : 1.0000000 | Max. : 1.000000 | Max. : 1.000000 | Max. : 1.0000000 | Max. : 1.0000000 | Max. : 1.0000000 | Max. : 1.0000000 | Max. : 1.0000000 | Max. : 1.0000000 | Max. : 1.000000 | Max. : 1.000000 | Max. : 1.000000 | Max. : 1.000000 | Max. : 1.000000 | Max. : 1.0000000 | Max. : 1.0000000 | Max. : 1.0000000 | Max. : 1.000000 | Max. : 1.000000 | Max. : 1.000000 | Max. : 1.000000 | Max. : 1.0000000 | Max. : 1.000000 | Max. : 1.0000000 | |
| NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :78 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 |
# Extract correlation with COD
correlation_with_COD <- correlation_matrix[, "COD"]
# Convert correlation_with_COD to a data frame with column names
correlation_df <- data.frame(variable = names(correlation_with_COD), correlation = correlation_with_COD)
# Sort correlation values
correlation_df <- correlation_df[order(correlation_df$correlation, decreasing = TRUE), ]
# Create bar plot using ggplot2
ggplot(correlation_df, aes(x = variable, y = correlation)) +
geom_bar(stat = "identity") +
labs(title = "Correlation with COD", x = "Variables", y = "Correlation")
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_bar()`).
# Find the index of the column named "COD"
cod_column_index <- which(names(BREAST_DF_surv_clean) == "COD")
# Exclude "COD" column from model matrix and encode factors
encoded_data <- predict(dummyVars(" ~ .", data = BREAST_DF_surv_clean[, -cod_column_index], fullRank = TRUE), newdata = BREAST_DF_surv_clean)
# Remove the "COD" column from encoded_data
encoded_data <- encoded_data[, -cod_column_index]
# Add "COD" column back to encoded_data
encoded_data <- cbind(encoded_data, COD = BREAST_DF_surv_clean$COD)
# Calculate correlation matrix
correlation_matrix <- cor(encoded_data)
# Extract correlation with COD
correlation_with_COD <- correlation_matrix[, "COD"]
# Summary of correlation matrix
summary(correlation_matrix)
## `Race recode (W, B, AI, API)`Asian or Pacific Islander
## Min. :-0.611482
## 1st Qu.:-0.020636
## Median :-0.004543
## Mean :-0.001203
## 3rd Qu.: 0.006915
## Max. : 1.000000
## NA's :3
## `Race recode (W, B, AI, API)`Black `Race recode (W, B, AI, API)`Unknown
## Min. :-0.672898 Min. :-0.151550
## 1st Qu.:-0.008186 1st Qu.:-0.008196
## Median : 0.002961 Median :-0.001720
## Mean : 0.007238 Mean : 0.011140
## 3rd Qu.: 0.015977 3rd Qu.: 0.004775
## Max. : 1.000000 Max. : 1.000000
## NA's :3 NA's :3
## `Race recode (W, B, AI, API)`White
## Min. :-0.672898
## 1st Qu.:-0.021506
## Median :-0.001173
## Mean :-0.007624
## 3rd Qu.: 0.013379
## Max. : 1.000000
## NA's :3
## `Primary Site - labeled`C50.1-Central portion of breast
## Min. :-0.152121
## 1st Qu.:-0.009934
## Median :-0.002113
## Mean : 0.005421
## 3rd Qu.: 0.006062
## Max. : 1.000000
## NA's :3
## `Primary Site - labeled`C50.2-Upper-inner quadrant of breast
## Min. :-0.253678
## 1st Qu.:-0.009990
## Median :-0.003675
## Mean :-0.001860
## 3rd Qu.: 0.000150
## Max. : 1.000000
## NA's :3
## `Primary Site - labeled`C50.3-Lower-inner quadrant of breast
## Min. :-0.165071
## 1st Qu.:-0.008009
## Median :-0.002641
## Mean : 0.003667
## 3rd Qu.: 0.003119
## Max. : 1.000000
## NA's :3
## `Primary Site - labeled`C50.4-Upper-outer quadrant of breast
## Min. :-0.372542
## 1st Qu.:-0.015158
## Median :-0.001181
## Mean :-0.008656
## 3rd Qu.: 0.005932
## Max. : 1.000000
## NA's :3
## `Primary Site - labeled`C50.5-Lower-outer quadrant of breast
## Min. :-0.1930083
## 1st Qu.:-0.0068032
## Median :-0.0026344
## Mean : 0.0017748
## 3rd Qu.: 0.0000719
## Max. : 1.0000000
## NA's :3
## `Primary Site - labeled`C50.6-Axillary tail of breast
## Min. :-0.0516638
## 1st Qu.:-0.0032916
## Median :-0.0002025
## Mean : 0.0108061
## 3rd Qu.: 0.0025517
## Max. : 1.0000000
## NA's :3
## `Primary Site - labeled`C50.8-Overlapping lesion of breast
## Min. :-0.372542
## 1st Qu.:-0.005557
## Median :-0.001620
## Mean :-0.005418
## 3rd Qu.: 0.001945
## Max. : 1.000000
## NA's :3
## `Primary Site - labeled`C50.9-Breast, NOS
## Min. :-0.290700
## 1st Qu.:-0.014812
## Median : 0.002148
## Mean : 0.010813
## 3rd Qu.: 0.017795
## Max. : 1.000000
## NA's :3
## `Grade Recode (thru 2017)`Poorly differentiated; Grade III
## Min. :-0.3220663
## 1st Qu.:-0.0132914
## Median :-0.0008937
## Mean : 0.0111492
## 3rd Qu.: 0.0112681
## Max. : 1.0000000
## NA's :3
## `Grade Recode (thru 2017)`Undifferentiated; anaplastic; Grade IV
## Min. :-0.0210283
## 1st Qu.:-0.0017196
## Median :-0.0004927
## Mean : 0.0132802
## 3rd Qu.: 0.0023513
## Max. : 1.0000000
## NA's :3
## `Grade Recode (thru 2017)`Unknown
## Min. :-0.269170
## 1st Qu.:-0.009920
## Median : 0.003273
## Mean : 0.019721
## 3rd Qu.: 0.024205
## Max. : 1.000000
## NA's :3
## `Grade Recode (thru 2017)`Well differentiated; Grade I
## Min. :-0.322066
## 1st Qu.:-0.017850
## Median :-0.005277
## Mean :-0.001697
## 3rd Qu.: 0.007549
## Max. : 1.000000
## NA's :3
## Laterality.Only one side - side unspecified
## Min. :-0.0505764
## 1st Qu.:-0.0025761
## Median :-0.0002076
## Mean : 0.0149982
## 3rd Qu.: 0.0036708
## Max. : 1.0000000
## NA's :3
## Laterality.Paired site, but no information concerning laterality
## Min. :-0.306824
## 1st Qu.:-0.013601
## Median :-0.001130
## Mean : 0.026582
## 3rd Qu.: 0.007532
## Max. : 1.000000
## NA's :3
## Laterality.Right - origin of primary `Chemotherapy recode (yes, no/unk)`Yes
## Min. :-0.0997361 Min. :-0.266711
## 1st Qu.:-0.0033810 1st Qu.:-0.022998
## Median :-0.0004293 Median : 0.003276
## Mean : 0.0100453 Mean : 0.014559
## 3rd Qu.: 0.0021401 3rd Qu.: 0.022243
## Max. : 1.0000000 Max. : 1.000000
## NA's :3 NA's :3
## `Months from diagnosis to treatment`
## Min. :1
## 1st Qu.:1
## Median :1
## Mean :1
## 3rd Qu.:1
## Max. :1
## NA's :76
## `Reason no cancer-directed surgery`Not recommended
## Min. :-0.812290
## 1st Qu.:-0.020352
## Median :-0.004180
## Mean : 0.006545
## 3rd Qu.: 0.008143
## Max. : 1.000000
## NA's :3
## `Reason no cancer-directed surgery`Not recommended, contraindicated due to other cond; autopsy only (1973-2002)
## Min. :-0.189154
## 1st Qu.:-0.006970
## Median :-0.001004
## Mean : 0.011121
## 3rd Qu.: 0.002799
## Max. : 1.000000
## NA's :3
## `Reason no cancer-directed surgery`Recommended but not performed, patient refused
## Min. :-0.262869
## 1st Qu.:-0.008973
## Median :-0.001474
## Mean : 0.009488
## 3rd Qu.: 0.002102
## Max. : 1.000000
## NA's :3
## `Reason no cancer-directed surgery`Recommended but not performed, unknown reason
## Min. :-0.205810
## 1st Qu.:-0.006578
## Median :-0.002151
## Mean : 0.013125
## 3rd Qu.: 0.006299
## Max. : 1.000000
## NA's :3
## `Reason no cancer-directed surgery`Recommended, unknown if performed
## Min. :-0.2649458
## 1st Qu.:-0.0085961
## Median :-0.0009167
## Mean : 0.0086230
## 3rd Qu.: 0.0060008
## Max. : 1.0000000
## NA's :3
## `Reason no cancer-directed surgery`Surgery performed
## Min. :-0.8122899
## 1st Qu.:-0.0354131
## Median : 0.0005905
## Mean :-0.0229615
## 3rd Qu.: 0.0230988
## Max. : 1.0000000
## NA's :3
## `Reason no cancer-directed surgery`Unknown; death certificate; or autopsy only (2003+)
## Min. :-0.316222
## 1st Qu.:-0.013675
## Median :-0.002965
## Mean : 0.025584
## 3rd Qu.: 0.008967
## Max. : 1.000000
## NA's :3
## `Survival months flag`Complete dates are available and there are more than 0 days of survival
## Min. :-0.883947
## 1st Qu.:-0.012528
## Median : 0.005244
## Mean :-0.013255
## 3rd Qu.: 0.017204
## Max. : 1.000000
## NA's :3
## `Survival months flag`Incomplete dates are available and there cannot be zero days of follow-up
## Min. :-0.883947
## 1st Qu.:-0.014050
## Median :-0.002561
## Mean : 0.001254
## 3rd Qu.: 0.004816
## Max. : 1.000000
## NA's :3
## `Survival months flag`Incomplete dates are available and there could be zero days of follow-up
## Min. :-0.124874
## 1st Qu.:-0.003584
## Median :-0.001239
## Mean : 0.012886
## 3rd Qu.: 0.002214
## Max. : 1.000000
## NA's :3
## `Survival months flag`Not calculated because a Death Certificate Only or Autopsy Only case
## Min. :-0.386749
## 1st Qu.:-0.014145
## Median :-0.003250
## Mean : 0.025055
## 3rd Qu.: 0.001901
## Max. : 1.000000
## NA's :3
## `Survival months` `First malignant primary indicator`Yes
## Min. :1 Min. :-0.121335
## 1st Qu.:1 1st Qu.:-0.009865
## Median :1 Median : 0.001646
## Mean :1 Mean : 0.015425
## 3rd Qu.:1 3rd Qu.: 0.013781
## Max. :1 Max. : 1.000000
## NA's :76 NA's :3
## `Total number of in situ/malignant tumors for patient`
## Min. :1
## 1st Qu.:1
## Median :1
## Mean :1
## 3rd Qu.:1
## Max. :1
## NA's :76
## `Total number of benign/borderline tumors for patient`
## Min. :-0.0152234
## 1st Qu.:-0.0032617
## Median :-0.0005528
## Mean : 0.0130747
## 3rd Qu.: 0.0019016
## Max. : 1.0000000
## NA's :3
## `Marital status at diagnosis`Married (including common law)
## Min. :-0.440177
## 1st Qu.:-0.032660
## Median :-0.006210
## Mean :-0.009718
## 3rd Qu.: 0.013420
## Max. : 1.000000
## NA's :3
## `Marital status at diagnosis`Separated
## Min. :-0.109798
## 1st Qu.:-0.004177
## Median :-0.000266
## Mean : 0.011281
## 3rd Qu.: 0.003666
## Max. : 1.000000
## NA's :3
## `Marital status at diagnosis`Single (never married)
## Min. :-0.440177
## 1st Qu.:-0.014297
## Median :-0.002190
## Mean : 0.005418
## 3rd Qu.: 0.017116
## Max. : 1.000000
## NA's :3
## `Marital status at diagnosis`Unknown
## Min. :-0.269781
## 1st Qu.:-0.009316
## Median :-0.001311
## Mean : 0.009429
## 3rd Qu.: 0.006888
## Max. : 1.000000
## NA's :3
## `Marital status at diagnosis`Unmarried or Domestic Partner
## Min. :-0.061342
## 1st Qu.:-0.004189
## Median :-0.000586
## Mean : 0.011305
## 3rd Qu.: 0.001826
## Max. : 1.000000
## NA's :3
## `Marital status at diagnosis`Widowed
## Min. :-0.432734
## 1st Qu.:-0.027195
## Median :-0.001960
## Mean : 0.007008
## 3rd Qu.: 0.018651
## Max. : 1.000000
## NA's :3
## `Median household income inflation adj to 2021`$40,000 - $44,999
## Min. :-0.1382091
## 1st Qu.:-0.0056870
## Median :-0.0007594
## Mean : 0.0128226
## 3rd Qu.: 0.0041608
## Max. : 1.0000000
## NA's :3
## `Median household income inflation adj to 2021`$45,000 - $49,999
## Min. :-0.1682857
## 1st Qu.:-0.0069108
## Median :-0.0009922
## Mean : 0.0119225
## 3rd Qu.: 0.0053678
## Max. : 1.0000000
## NA's :3
## `Median household income inflation adj to 2021`$50,000 - $54,999
## Min. :-0.1791432
## 1st Qu.:-0.0069663
## Median :-0.0009649
## Mean : 0.0100901
## 3rd Qu.: 0.0055250
## Max. : 1.0000000
## NA's :3
## `Median household income inflation adj to 2021`$55,000 - $59,999
## Min. :-0.221090
## 1st Qu.:-0.005620
## Median :-0.000249
## Mean : 0.006952
## 3rd Qu.: 0.005061
## Max. : 1.000000
## NA's :3
## `Median household income inflation adj to 2021`$60,000 - $64,999
## Min. :-0.302908
## 1st Qu.:-0.008606
## Median :-0.001320
## Mean :-0.002785
## 3rd Qu.: 0.003066
## Max. : 1.000000
## NA's :3
## `Median household income inflation adj to 2021`$65,000 - $69,999
## Min. :-0.308737
## 1st Qu.:-0.012660
## Median :-0.001992
## Mean :-0.005016
## 3rd Qu.: 0.004584
## Max. : 1.000000
## NA's :3
## `Median household income inflation adj to 2021`$70,000 - $74,999
## Min. :-0.2538036
## 1st Qu.:-0.0053899
## Median :-0.0016729
## Mean :-0.0009235
## 3rd Qu.: 0.0020979
## Max. : 1.0000000
## NA's :3
## `Median household income inflation adj to 2021`$75,000+
## Min. :-0.308737
## 1st Qu.:-0.020431
## Median :-0.005049
## Mean :-0.016574
## 3rd Qu.: 0.003574
## Max. : 1.000000
## NA's :3
## `Median household income inflation adj to 2021`< $35,000
## Min. :-0.070337
## 1st Qu.:-0.005079
## Median :-0.001380
## Mean : 0.014739
## 3rd Qu.: 0.005036
## Max. : 1.000000
## NA's :3
## `Median household income inflation adj to 2021`Unknown/missing/no match/Not 1990-2021
## Min. :-0.0127447
## 1st Qu.:-0.0034638
## Median :-0.0012849
## Mean : 0.0267364
## 3rd Qu.: 0.0009734
## Max. : 1.0000000
## NA's :3
## `Rural-Urban Continuum Code`Counties in metropolitan areas of 250,000 to 1 million pop
## Min. :-0.1432295
## 1st Qu.:-0.0074650
## Median :-0.0002018
## Mean : 0.0091835
## 3rd Qu.: 0.0038568
## Max. : 1.0000000
## NA's :3
## `Rural-Urban Continuum Code`Counties in metropolitan areas of lt 250 thousand pop
## Min. :-0.1717685
## 1st Qu.:-0.0050985
## Median :-0.0001767
## Mean : 0.0126107
## 3rd Qu.: 0.0058588
## Max. : 1.0000000
## NA's :3
## `Rural-Urban Continuum Code`Nonmetropolitan counties adjacent to a metropolitan area
## Min. :-0.1531643
## 1st Qu.:-0.0064625
## Median :-0.0008947
## Mean : 0.0127534
## 3rd Qu.: 0.0065129
## Max. : 1.0000000
## NA's :3
## `Rural-Urban Continuum Code`Nonmetropolitan counties not adjacent to a metropolitan area
## Min. :-0.1543301
## 1st Qu.:-0.0077561
## Median :-0.0005505
## Mean : 0.0142939
## 3rd Qu.: 0.0078590
## Max. : 1.0000000
## NA's :3
## `Rural-Urban Continuum Code`Unknown/missing/no match (Alaska or Hawaii - Entire State)
## Min. :-0.0678178
## 1st Qu.:-0.0038102
## Median :-0.0002637
## Mean : 0.0120203
## 3rd Qu.: 0.0021074
## Max. : 1.0000000
## NA's :3
## `Rural-Urban Continuum Code`Unknown/missing/no match/Not 1990-2021
## Min. :-0.0127447
## 1st Qu.:-0.0034638
## Median :-0.0012849
## Mean : 0.0267364
## 3rd Qu.: 0.0009734
## Max. : 1.0000000
## NA's :3
## `Age recode (<60,60-69,70+)`05-09 years
## Min. :-0.0027197
## 1st Qu.:-0.0008100
## Median :-0.0002830
## Mean : 0.0136204
## 3rd Qu.:-0.0000415
## Max. : 1.0000000
## NA's :3
## `Age recode (<60,60-69,70+)`10-14 years
## Min. :-0.0050171
## 1st Qu.:-0.0008100
## Median :-0.0003417
## Mean : 0.0134044
## 3rd Qu.:-0.0000665
## Max. : 1.0000000
## NA's :3
## `Age recode (<60,60-69,70+)`15-19 years
## Min. :-0.0070476
## 1st Qu.:-0.0018281
## Median :-0.0006534
## Mean : 0.0134773
## 3rd Qu.: 0.0000811
## Max. : 1.0000000
## NA's :3
## `Age recode (<60,60-69,70+)`20-24 years
## Min. :-0.018979
## 1st Qu.:-0.003199
## Median :-0.001403
## Mean : 0.012949
## 3rd Qu.: 0.001502
## Max. : 1.000000
## NA's :3
## `Age recode (<60,60-69,70+)`25-29 years
## Min. :-0.027433
## 1st Qu.:-0.005254
## Median :-0.001847
## Mean : 0.011652
## 3rd Qu.: 0.002353
## Max. : 1.000000
## NA's :3
## `Age recode (<60,60-69,70+)`30-34 years
## Min. :-0.046406
## 1st Qu.:-0.007629
## Median :-0.001896
## Mean : 0.009909
## 3rd Qu.: 0.002237
## Max. : 1.000000
## NA's :3
## `Age recode (<60,60-69,70+)`35-39 years
## Min. :-0.067571
## 1st Qu.:-0.009978
## Median :-0.002013
## Mean : 0.007331
## 3rd Qu.: 0.004198
## Max. : 1.000000
## NA's :3
## `Age recode (<60,60-69,70+)`40-44 years
## Min. :-0.098876
## 1st Qu.:-0.016888
## Median :-0.005596
## Mean : 0.002280
## 3rd Qu.: 0.003749
## Max. : 1.000000
## NA's :3
## `Age recode (<60,60-69,70+)`45-49 years
## Min. :-0.125622
## 1st Qu.:-0.017717
## Median :-0.006664
## Mean :-0.001411
## 3rd Qu.: 0.004519
## Max. : 1.000000
## NA's :3
## `Age recode (<60,60-69,70+)`50-54 years
## Min. :-0.141961
## 1st Qu.:-0.016959
## Median :-0.003925
## Mean :-0.002563
## 3rd Qu.: 0.003557
## Max. : 1.000000
## NA's :3
## `Age recode (<60,60-69,70+)`55-59 years
## Min. :-0.148041
## 1st Qu.:-0.015581
## Median :-0.001369
## Mean :-0.003108
## 3rd Qu.: 0.002133
## Max. : 1.000000
## NA's :3
## `Age recode (<60,60-69,70+)`60-64 years
## Min. :-0.1569887
## 1st Qu.:-0.0139543
## Median :-0.0007929
## Mean :-0.0045549
## 3rd Qu.: 0.0037404
## Max. : 1.0000000
## NA's :3
## `Age recode (<60,60-69,70+)`65-69 years
## Min. :-0.156989
## 1st Qu.:-0.018456
## Median :-0.004416
## Mean :-0.006050
## 3rd Qu.: 0.002814
## Max. : 1.000000
## NA's :3
## `Age recode (<60,60-69,70+)`70-74 years
## Min. :-0.136706
## 1st Qu.:-0.018582
## Median :-0.002493
## Mean :-0.003747
## 3rd Qu.: 0.004444
## Max. : 1.000000
## NA's :3
## `Age recode (<60,60-69,70+)`75-79 years
## Min. :-0.1299541
## 1st Qu.:-0.0192648
## Median :-0.0017527
## Mean :-0.0004472
## 3rd Qu.: 0.0042832
## Max. : 1.0000000
## NA's :3
## `Age recode (<60,60-69,70+)`80-84 years `Age recode (<60,60-69,70+)`85+ years
## Min. :-0.1497345 Min. :-0.185038
## 1st Qu.:-0.0260547 1st Qu.:-0.032765
## Median :-0.0009538 Median :-0.002749
## Mean : 0.0030973 Mean : 0.009643
## 3rd Qu.: 0.0066626 3rd Qu.: 0.005992
## Max. : 1.0000000 Max. : 1.000000
## NA's :3 NA's :3
## Radiation.Yes COD
## Min. :-0.18043 Min. :-0.274122
## 1st Qu.:-0.02974 1st Qu.:-0.028538
## Median :-0.00240 Median : 0.003475
## Mean : 0.00561 Mean : 0.019223
## 3rd Qu.: 0.01477 3rd Qu.: 0.028403
## Max. : 1.00000 Max. : 1.000000
## NA's :3 NA's :3
# Print correlation with COD
print(correlation_with_COD)
## `Race recode (W, B, AI, API)`Asian or Pacific Islander
## -0.0545190384
## `Race recode (W, B, AI, API)`Black
## 0.0469532442
## `Race recode (W, B, AI, API)`Unknown
## -0.0293993259
## `Race recode (W, B, AI, API)`White
## 0.0074658579
## `Primary Site - labeled`C50.1-Central portion of breast
## 0.0250324542
## `Primary Site - labeled`C50.2-Upper-inner quadrant of breast
## -0.0331489825
## `Primary Site - labeled`C50.3-Lower-inner quadrant of breast
## -0.0090667775
## `Primary Site - labeled`C50.4-Upper-outer quadrant of breast
## -0.0443648371
## `Primary Site - labeled`C50.5-Lower-outer quadrant of breast
## -0.0175945778
## `Primary Site - labeled`C50.6-Axillary tail of breast
## 0.0043188952
## `Primary Site - labeled`C50.8-Overlapping lesion of breast
## -0.0110631898
## `Primary Site - labeled`C50.9-Breast, NOS
## 0.1016503665
## `Grade Recode (thru 2017)`Poorly differentiated; Grade III
## 0.0271874011
## `Grade Recode (thru 2017)`Undifferentiated; anaplastic; Grade IV
## 0.0094420265
## `Grade Recode (thru 2017)`Unknown
## 0.0968239858
## `Grade Recode (thru 2017)`Well differentiated; Grade I
## -0.0623106812
## Laterality.Only one side - side unspecified
## 0.0200049734
## Laterality.Paired site, but no information concerning laterality
## 0.1028374128
## Laterality.Right - origin of primary
## -0.0181205661
## `Chemotherapy recode (yes, no/unk)`Yes
## -0.0921253574
## `Months from diagnosis to treatment`
## NA
## `Reason no cancer-directed surgery`Not recommended
## 0.2220449034
## `Reason no cancer-directed surgery`Not recommended, contraindicated due to other cond; autopsy only (1973-2002)
## 0.0989504823
## `Reason no cancer-directed surgery`Recommended but not performed, patient refused
## 0.0884305414
## `Reason no cancer-directed surgery`Recommended but not performed, unknown reason
## 0.0403204344
## `Reason no cancer-directed surgery`Recommended, unknown if performed
## 0.0114951426
## `Reason no cancer-directed surgery`Surgery performed
## -0.2741215608
## `Reason no cancer-directed surgery`Unknown; death certificate; or autopsy only (2003+)
## 0.0839598865
## `Survival months flag`Complete dates are available and there are more than 0 days of survival
## -0.0490960286
## `Survival months flag`Incomplete dates are available and there cannot be zero days of follow-up
## 0.0166136901
## `Survival months flag`Incomplete dates are available and there could be zero days of follow-up
## 0.0165572269
## `Survival months flag`Not calculated because a Death Certificate Only or Autopsy Only case
## 0.0787841243
## `Survival months`
## NA
## `First malignant primary indicator`Yes
## -0.1213346094
## `Total number of in situ/malignant tumors for patient`
## NA
## `Total number of benign/borderline tumors for patient`
## 0.0096744569
## `Marital status at diagnosis`Married (including common law)
## -0.1738909087
## `Marital status at diagnosis`Separated
## -0.0040996820
## `Marital status at diagnosis`Single (never married)
## 0.0003130660
## `Marital status at diagnosis`Unknown
## 0.0303384662
## `Marital status at diagnosis`Unmarried or Domestic Partner
## -0.0119834990
## `Marital status at diagnosis`Widowed
## 0.2258045820
## `Median household income inflation adj to 2021`$40,000 - $44,999
## 0.0288158876
## `Median household income inflation adj to 2021`$45,000 - $49,999
## 0.0293692229
## `Median household income inflation adj to 2021`$50,000 - $54,999
## 0.0232834837
## `Median household income inflation adj to 2021`$55,000 - $59,999
## 0.0119175068
## `Median household income inflation adj to 2021`$60,000 - $64,999
## 0.0085555964
## `Median household income inflation adj to 2021`$65,000 - $69,999
## -0.0113267709
## `Median household income inflation adj to 2021`$70,000 - $74,999
## 0.0021964742
## `Median household income inflation adj to 2021`$75,000+
## -0.0518348420
## `Median household income inflation adj to 2021`< $35,000
## 0.0183133390
## `Median household income inflation adj to 2021`Unknown/missing/no match/Not 1990-2021
## -0.0018601995
## `Rural-Urban Continuum Code`Counties in metropolitan areas of 250,000 to 1 million pop
## 0.0054201133
## `Rural-Urban Continuum Code`Counties in metropolitan areas of lt 250 thousand pop
## 0.0164868975
## `Rural-Urban Continuum Code`Nonmetropolitan counties adjacent to a metropolitan area
## 0.0284308358
## `Rural-Urban Continuum Code`Nonmetropolitan counties not adjacent to a metropolitan area
## 0.0283202178
## `Rural-Urban Continuum Code`Unknown/missing/no match (Alaska or Hawaii - Entire State)
## 0.0026305237
## `Rural-Urban Continuum Code`Unknown/missing/no match/Not 1990-2021
## -0.0018601995
## `Age recode (<60,60-69,70+)`05-09 years
## -0.0013753077
## `Age recode (<60,60-69,70+)`10-14 years
## -0.0013753077
## `Age recode (<60,60-69,70+)`15-19 years
## -0.0008190567
## `Age recode (<60,60-69,70+)`20-24 years
## -0.0017825828
## `Age recode (<60,60-69,70+)`25-29 years
## -0.0118417779
## `Age recode (<60,60-69,70+)`30-34 years
## -0.0259548035
## `Age recode (<60,60-69,70+)`35-39 years
## -0.0423991343
## `Age recode (<60,60-69,70+)`40-44 years
## -0.0751335018
## `Age recode (<60,60-69,70+)`45-49 years
## -0.0999972407
## `Age recode (<60,60-69,70+)`50-54 years
## -0.0950419640
## `Age recode (<60,60-69,70+)`55-59 years
## -0.0788618264
## `Age recode (<60,60-69,70+)`60-64 years
## -0.0661892685
## `Age recode (<60,60-69,70+)`65-69 years
## -0.0379863553
## `Age recode (<60,60-69,70+)`70-74 years
## 0.0251230157
## `Age recode (<60,60-69,70+)`75-79 years
## 0.1002552651
## `Age recode (<60,60-69,70+)`80-84 years
## 0.1861215456
## `Age recode (<60,60-69,70+)`85+ years
## 0.3114860645
## Radiation.Yes
## -0.1573241890
## COD
## 1.0000000000
# Exclude "COD" column from model matrix and encode factors
encoded_data <- predict(dummyVars(" ~ .", data = BREAST_DF_surv_clean[, -cod_column_index], fullRank = TRUE), newdata = BREAST_DF_surv_clean)
# Alternatively, using ggplot
correlation_df <- data.frame(variable = colnames(correlation_matrix), correlation = correlation_with_COD)
# Create a ggplot with facets
ggplot(correlation_df[1:19, ], aes(x = variable, y = correlation)) +
geom_bar(stat = "identity", fill = "skyblue") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1,
size = 7)) + # Adjust size as needed
scale_x_discrete(labels = function(x) str_wrap(x, width = 25)) # Wrap text
ggplot(correlation_df[20:39, ], aes(x = variable, y = correlation)) +
geom_bar(stat = "identity", fill = "skyblue") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1,
size = 7)) + # Adjust size as needed
scale_x_discrete(labels = function(x) str_wrap(x, width = 25)) # Wrap text
## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_bar()`).
ggplot(correlation_df[40:59, ], aes(x = variable, y = correlation)) +
geom_bar(stat = "identity", fill = "skyblue") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1,
size = 7)) + # Adjust size as needed
scale_x_discrete(labels = function(x) str_wrap(x, width = 25)) # Wrap text
ggplot(correlation_df[60:77, ], aes(x = variable, y = correlation)) +
geom_bar(stat = "identity", fill = "skyblue") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1,
size = 7)) + # Adjust size as needed
scale_x_discrete(labels = function(x) str_wrap(x, width = 25)) # Wrap text
To be able to work with this database, I need to transform the categorical data (factors) to numerical variables. A method known as one-hot encoding is used. Although for this survival analysis, target encoding is the better method, I have decided not to apply that due to complexity and time constraints [1,2].
In general the machine learning phase consist of four main steps:
Encode categorical variables.
Split the data into training and testing sets.
Train the models.
Evaluate the models.
Target encoding, also known as mean encoding or likelihood encoding,
is a technique used to encode categorical variables into numerical
values based on the target variable. It replaces each category with the
mean (or some other summary statistic) of the target variable for that
category. caret is the package in R that has this function
embedded.
One-hot encoding is a technique used in classification tasks to represent categorical variables, such as alive or deceased in the case of survival analysis, as binary vectors. In R, this is achieved by converting each category into a binary vector where each element corresponds to a category, with a value of 1 indicating the presence of the category and 0 otherwise. This allows machine learning algorithms to effectively interpret and utilize categorical data in predictive models.
Random Forest (rf): Random forest
is a popular machine learning algorithm that can be adapted for survival
analysis. It constructs a multitude of decision trees during training
and outputs the mode of the classes (classification) or the mean
prediction (regression) of the individual trees.
Logistic Regression (glm): Logistic
regression, a foundational technique in survival analysis, is employed
in this project to model the relationship between various prognostic
factors and the probability of survival or death outcomes in breast
cancer patients.
Deep Nueral Netweork (DNN): This is a a powerful
machine learning model that can learn complex patterns in data to
classify individuals as either alive or deceased in a given
classification problem. In R, DNNs can be implemented using packages
like keras, providing a flexible framework for building and
training deep learning models tailored to specific datasets.
BREAST_DF_surv_clean_no_missing <- na.omit(BREAST_DF_surv_clean)
#change the problem to a binomial distribution of Alive / Breast and remove others, Binimonal is easier to tackle
#Repalce also factor to numer 1 and 2 from "Alive" and "Breast"
# Remove "Others" from COD column
BREAST_DF_surv_clean_no_missing_bi <- BREAST_DF_surv_clean_no_missing[BREAST_DF_surv_clean_no_missing$COD != "Other", ]
# Replace remaining categories with numerical values
#BREAST_DF_surv_clean_no_missing_bi$COD <- as.numeric(factor(BREAST_DF_surv_clean_no_missing_bi$COD, levels = c("Alive", "Breast")))
BREAST_DF_surv_clean_no_missing_bi$COD <- ifelse(BREAST_DF_surv_clean_no_missing_bi$COD == "Alive", 1, 0)
BREAST_DF_surv_clean_no_missing_bi$COD <- as.factor(BREAST_DF_surv_clean_no_missing_bi$COD)
# Convert to binomial distribution
#model_rf <- randomForest(COD ~ ., data = BREAST_DF_surv_clean_no_missing_bi, type = "response", ntree = 100)
# Find the index of the column named "COD"
cod_column_index <- which(names(BREAST_DF_surv_clean_no_missing_bi) == "COD")
# Exclude "COD" column from the data
data_without_cod <- BREAST_DF_surv_clean_no_missing_bi[, -cod_column_index]
# Perform one-hot encoding
encoded_data <- dummyVars(" ~ .", data = data_without_cod)
# Create the design matrix with encoded data
design_matrix <- predict(encoded_data, newdata = data_without_cod)
design_matrix <- data.frame(design_matrix)
# Add the target variable (COD) back to the design matrix
design_matrix <- cbind(design_matrix, COD = BREAST_DF_surv_clean_no_missing_bi$COD)
design_matrix$COD <- factor(design_matrix$COD)
# Split the data into training and testing sets
set.seed(123) # for reproducibility
train_indices <- createDataPartition(design_matrix$COD, p = 0.7, list = FALSE)
train_data <- design_matrix[train_indices, ]
test_data <- design_matrix[-train_indices, ]
Random Forests are a powerful machine learning technique well-suited for survival analysis tasks like predicting patient survival in cancer cases. Random Forests don’t rely on a single decision tree but on a multitude of them (“forest”). Each tree is built on a random subset of the data (with replacement) and uses a random selection of features at each split.
# Fit the Random Forest model
model_rf <- randomForest(COD ~ ., data = train_data, type = "prob")
# Make predictions on the test set
predictions_rf <- predict(model_rf, newdata = test_data)
# Evaluate the model
conf_matrix <- confusionMatrix(predictions_rf, test_data$COD)
print(conf_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 6970 1479
## 1 2591 65338
##
## Accuracy : 0.9467
## 95% CI : (0.9451, 0.9483)
## No Information Rate : 0.8748
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7439
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.72900
## Specificity : 0.97786
## Pos Pred Value : 0.82495
## Neg Pred Value : 0.96186
## Prevalence : 0.12518
## Detection Rate : 0.09126
## Detection Prevalence : 0.11062
## Balanced Accuracy : 0.85343
##
## 'Positive' Class : 0
##
# Plot confusion matrix as a heatmap
conf_table <- as.table(conf_matrix$table)
heatmap(conf_table,
Colv = NA,
Rowv = NA,
col = cm.colors(12),
scale = "column",
margins = c(10, 10),
xlab = "Predicted Class",
ylab = "True Class",
main = "Confusion Matrix Heatmap")
# Heatmap
heatmap_data <- as.data.frame(as.table(conf_matrix))
heatmap <- ggplot(heatmap_data, aes(x = Prediction, y = Reference, fill = Freq)) +
geom_tile(color = "white") +
scale_fill_gradient(low = "lightgreen", high = "darkgreen") +
labs(x = "Predicted", y = "Actual", fill = "Frequency") +
theme_minimal() +
geom_text(aes(label = Freq), color = "black", size = 3) + # Add text labels
ggtitle("Random Forest Predictive Model") + # Add title
labs(subtitle = paste("Accuracy:", scales::percent(conf_matrix$overall["Accuracy"]))) + # Add accuracy as subtitle
theme(plot.subtitle = element_text(hjust = 0.5)) # Center subtitle
print(heatmap)
# Get predicted probabilities for each class (ensure type="prob" is used)
predictions_rf_probs <- predict(model_rf, test_data, type = "prob")
# Extract true class labels and convert them to factor
true_class <- as.factor(test_data$COD)
# Convert factor predictions to ordered factors
predictions_order <- ordered(as.numeric(predictions_rf) - 1, levels = c(0, 1))
# Create ROC curve
roc_curve <- roc(true_class, predictions_rf_probs[, "1"])
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
# Plot ROC curve
plot(roc_curve, print.auc = TRUE, auc.polygon = TRUE, max.auc.polygon = TRUE, grid = TRUE, grid.col = "lightgray", main = "ROC Curve", xlab = "1 - Specificity", ylab = "Sensitivity")
Logistic regression is a statistical model used to analyze the relationship between a binary outcome variable and one or more independent variables. It estimates the probability of the outcome variable being in a particular category (usually coded as 0 or 1) based on the values of the independent variables. The model employs the logistic function to constrain the predicted probabilities between 0 and 1, making it suitable for binary classification tasks like survival/death analyses in our case. In R, logistic regression can be implemented using the glm() function with a binomial family distribution.
# Train the logistic regression model
logistic_model <- glm(COD ~ ., data = train_data, family = binomial)
# Make predictions on the test set
predictions_logistic <- predict(logistic_model, newdata = test_data, type = "response")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases
# Convert predicted probabilities to class labels
predicted_class <- ifelse(predictions_logistic > 0.5, 1, 0)
# Evaluate the model
confusion_matrix <- table(predicted_class, test_data$COD)
print(confusion_matrix)
##
## predicted_class 0 1
## 0 5755 1538
## 1 3806 65279
# Calculate accuracy
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
print(paste("Accuracy:", accuracy))
## [1] "Accuracy: 0.9300322082275"
# Plot the confusion matrix as a heatmap
heatmap(confusion_matrix,
Colv = NA,
Rowv = NA,
col = cm.colors(12), # Color palette for heatmap
scale = "column", # Scale rows (predictions)
margins = c(10, 10), # Add extra space for row and column names
xlab = "Predicted Class",
ylab = "True Class",
main = "Confusion Matrix Heatmap")
# Heatmap
heatmap_data <- as.data.frame(as.table(confusion_matrix))
heatmap <- ggplot(heatmap_data, aes(x = predicted_class, y = Var2, fill = Freq)) +
geom_tile(color = "white") +
scale_fill_gradient(low = "lightgreen", high = "darkgreen") +
labs(x = "Predicted", y = "Actual", fill = "Frequency") +
theme_minimal() +
geom_text(aes(label = Freq), color = "black", size = 3) + # Add text labels
ggtitle("Logistic Regression Predictive Model") + # Add title
labs(subtitle = paste("Accuracy:", scales::percent(accuracy))) + # Add accuracy as subtitle
theme(plot.subtitle = element_text(hjust = 0.5)) # Center subtitle
print(heatmap)
# Calculate AUC ROC
roc_curve <- roc(test_data$COD, predictions_logistic)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
print(roc_curve)
##
## Call:
## roc.default(response = test_data$COD, predictor = predictions_logistic)
##
## Data: predictions_logistic in 9561 controls (test_data$COD 0) < 66817 cases (test_data$COD 1).
## Area under the curve: 0.9291
# Plot the ROC curve
plot(roc_curve, print.auc = TRUE, auc.polygon = TRUE, max.auc.polygon = TRUE, grid = TRUE, grid.col = "lightgray", main = "ROC Curve")
# Prepare data
cod_column_index_1 <- which(names(BREAST_DF_surv_clean_no_missing) == c("COD","Survival months"))
# Exclude "COD" column from the data
#data_without_cod <- BREAST_DF_surv_clean[, -cod_column_index]
data_without_cod_1 <- BREAST_DF_surv_clean_no_missing[, -cod_column_index]
# Perform one-hot encoding
encoded_data_1 <- dummyVars(" ~ .", data = data_without_cod_1)
# Create the design matrix with encoded data
design_matrix_1 <- predict(encoded_data_1, newdata = data_without_cod_1)
# Add the target variable (Survival months and status) back to the design matrix
design_matrix_1 <- cbind(design_matrix_1,
Time = BREAST_DF_surv_clean_no_missing$`Survival months`,
Status = BREAST_DF_surv_clean_no_missing$COD)
design_matrix_1 <- data.frame(design_matrix_1)
# Split the data into training and testing sets
set.seed(123) # for reproducibility
train_indices_1 <- createDataPartition(design_matrix_1$Status, p = 0.7, list = FALSE)
train_data_1 <- design_matrix_1[train_indices, ]
test_data_1 <- design_matrix_1[-train_indices, ]
A deep neural network for survival analysis is a powerful machine learning model capable of capturing complex patterns in survival data to predict the likelihood of an event occurring (e.g., death) over a given period. In binary classification tasks such as life/dead outcomes, a deep neural network consists of multiple layers of interconnected nodes (neurons) that process input features to predict the probability of an individual experiencing the event of interest. These networks can incorporate various architectures, such as convolutional neural networks (CNNs) or recurrent neural networks (RNNs), and are trained using optimization algorithms like stochastic gradient descent (SGD) to minimize prediction errors. In R, deep neural networks for survival analysis can be implemented using libraries like keras or tensorflow, allowing for flexible modeling and customization.
# Load required libraries
library(keras)
library(survival)
library(survMisc) # For cindex() function
##
## Attaching package: 'survMisc'
## The following object is masked from 'package:pROC':
##
## ci
## The following object is masked from 'package:R.utils':
##
## asLong
## The following object is masked from 'package:ggplot2':
##
## autoplot
library(reticulate)
#use_python("C:/Users/kohya/AppData/Local/Programs/Python/Python37")
# Define the neural network architecture
model <- keras_model_sequential() %>%
layer_dense(units = 64, activation = "relu", input_shape = ncol(train_data) - 1) %>%
layer_dense(units = 32, activation = "relu") %>%
layer_dense(units = 1, activation = "sigmoid")
# Compile the model
model %>% compile(
loss = "binary_crossentropy",
optimizer = optimizer_adam(),
metrics = c("accuracy")
)
# Train the model
history <- model %>% fit(
x = as.matrix(train_data[, -ncol(train_data)]), # Features
y = as.numeric(train_data$COD) - 1, # Target variable (convert to 0-based index)
epochs = 100,
batch_size = 32,
validation_split = 0.2
)
## Epoch 1/100
## 4456/4456 - 11s - loss: 0.1832 - accuracy: 0.9373 - val_loss: 0.1733 - val_accuracy: 0.9429 - 11s/epoch - 2ms/step
## Epoch 2/100
## 4456/4456 - 10s - loss: 0.1690 - accuracy: 0.9435 - val_loss: 0.1695 - val_accuracy: 0.9466 - 10s/epoch - 2ms/step
## Epoch 3/100
## 4456/4456 - 9s - loss: 0.1656 - accuracy: 0.9446 - val_loss: 0.1676 - val_accuracy: 0.9469 - 9s/epoch - 2ms/step
## Epoch 4/100
## 4456/4456 - 11s - loss: 0.1637 - accuracy: 0.9457 - val_loss: 0.1649 - val_accuracy: 0.9466 - 11s/epoch - 3ms/step
## Epoch 5/100
## 4456/4456 - 10s - loss: 0.1620 - accuracy: 0.9463 - val_loss: 0.1655 - val_accuracy: 0.9467 - 10s/epoch - 2ms/step
## Epoch 6/100
## 4456/4456 - 11s - loss: 0.1612 - accuracy: 0.9469 - val_loss: 0.1673 - val_accuracy: 0.9467 - 11s/epoch - 2ms/step
## Epoch 7/100
## 4456/4456 - 10s - loss: 0.1603 - accuracy: 0.9468 - val_loss: 0.1718 - val_accuracy: 0.9457 - 10s/epoch - 2ms/step
## Epoch 8/100
## 4456/4456 - 10s - loss: 0.1597 - accuracy: 0.9471 - val_loss: 0.1644 - val_accuracy: 0.9473 - 10s/epoch - 2ms/step
## Epoch 9/100
## 4456/4456 - 10s - loss: 0.1591 - accuracy: 0.9473 - val_loss: 0.1637 - val_accuracy: 0.9479 - 10s/epoch - 2ms/step
## Epoch 10/100
## 4456/4456 - 10s - loss: 0.1584 - accuracy: 0.9474 - val_loss: 0.1642 - val_accuracy: 0.9480 - 10s/epoch - 2ms/step
## Epoch 11/100
## 4456/4456 - 10s - loss: 0.1576 - accuracy: 0.9480 - val_loss: 0.1660 - val_accuracy: 0.9474 - 10s/epoch - 2ms/step
## Epoch 12/100
## 4456/4456 - 11s - loss: 0.1575 - accuracy: 0.9479 - val_loss: 0.1644 - val_accuracy: 0.9472 - 11s/epoch - 2ms/step
## Epoch 13/100
## 4456/4456 - 10s - loss: 0.1568 - accuracy: 0.9487 - val_loss: 0.1671 - val_accuracy: 0.9462 - 10s/epoch - 2ms/step
## Epoch 14/100
## 4456/4456 - 10s - loss: 0.1564 - accuracy: 0.9487 - val_loss: 0.1653 - val_accuracy: 0.9486 - 10s/epoch - 2ms/step
## Epoch 15/100
## 4456/4456 - 10s - loss: 0.1557 - accuracy: 0.9487 - val_loss: 0.1674 - val_accuracy: 0.9473 - 10s/epoch - 2ms/step
## Epoch 16/100
## 4456/4456 - 10s - loss: 0.1552 - accuracy: 0.9492 - val_loss: 0.1641 - val_accuracy: 0.9485 - 10s/epoch - 2ms/step
## Epoch 17/100
## 4456/4456 - 10s - loss: 0.1547 - accuracy: 0.9490 - val_loss: 0.1650 - val_accuracy: 0.9475 - 10s/epoch - 2ms/step
## Epoch 18/100
## 4456/4456 - 10s - loss: 0.1546 - accuracy: 0.9491 - val_loss: 0.1712 - val_accuracy: 0.9453 - 10s/epoch - 2ms/step
## Epoch 19/100
## 4456/4456 - 10s - loss: 0.1541 - accuracy: 0.9495 - val_loss: 0.1666 - val_accuracy: 0.9482 - 10s/epoch - 2ms/step
## Epoch 20/100
## 4456/4456 - 10s - loss: 0.1538 - accuracy: 0.9495 - val_loss: 0.1694 - val_accuracy: 0.9474 - 10s/epoch - 2ms/step
## Epoch 21/100
## 4456/4456 - 10s - loss: 0.1532 - accuracy: 0.9499 - val_loss: 0.1676 - val_accuracy: 0.9476 - 10s/epoch - 2ms/step
## Epoch 22/100
## 4456/4456 - 10s - loss: 0.1525 - accuracy: 0.9499 - val_loss: 0.1682 - val_accuracy: 0.9474 - 10s/epoch - 2ms/step
## Epoch 23/100
## 4456/4456 - 10s - loss: 0.1522 - accuracy: 0.9503 - val_loss: 0.1679 - val_accuracy: 0.9478 - 10s/epoch - 2ms/step
## Epoch 24/100
## 4456/4456 - 10s - loss: 0.1522 - accuracy: 0.9500 - val_loss: 0.1692 - val_accuracy: 0.9480 - 10s/epoch - 2ms/step
## Epoch 25/100
## 4456/4456 - 11s - loss: 0.1515 - accuracy: 0.9503 - val_loss: 0.1769 - val_accuracy: 0.9462 - 11s/epoch - 2ms/step
## Epoch 26/100
## 4456/4456 - 11s - loss: 0.1514 - accuracy: 0.9502 - val_loss: 0.1690 - val_accuracy: 0.9484 - 11s/epoch - 2ms/step
## Epoch 27/100
## 4456/4456 - 10s - loss: 0.1509 - accuracy: 0.9506 - val_loss: 0.1811 - val_accuracy: 0.9470 - 10s/epoch - 2ms/step
## Epoch 28/100
## 4456/4456 - 9s - loss: 0.1504 - accuracy: 0.9508 - val_loss: 0.1738 - val_accuracy: 0.9472 - 9s/epoch - 2ms/step
## Epoch 29/100
## 4456/4456 - 10s - loss: 0.1500 - accuracy: 0.9509 - val_loss: 0.1746 - val_accuracy: 0.9462 - 10s/epoch - 2ms/step
## Epoch 30/100
## 4456/4456 - 10s - loss: 0.1497 - accuracy: 0.9509 - val_loss: 0.1774 - val_accuracy: 0.9470 - 10s/epoch - 2ms/step
## Epoch 31/100
## 4456/4456 - 10s - loss: 0.1494 - accuracy: 0.9512 - val_loss: 0.1767 - val_accuracy: 0.9463 - 10s/epoch - 2ms/step
## Epoch 32/100
## 4456/4456 - 9s - loss: 0.1491 - accuracy: 0.9514 - val_loss: 0.1818 - val_accuracy: 0.9456 - 9s/epoch - 2ms/step
## Epoch 33/100
## 4456/4456 - 9s - loss: 0.1487 - accuracy: 0.9514 - val_loss: 0.1866 - val_accuracy: 0.9439 - 9s/epoch - 2ms/step
## Epoch 34/100
## 4456/4456 - 9s - loss: 0.1485 - accuracy: 0.9512 - val_loss: 0.1888 - val_accuracy: 0.9456 - 9s/epoch - 2ms/step
## Epoch 35/100
## 4456/4456 - 9s - loss: 0.1483 - accuracy: 0.9516 - val_loss: 0.1833 - val_accuracy: 0.9468 - 9s/epoch - 2ms/step
## Epoch 36/100
## 4456/4456 - 9s - loss: 0.1479 - accuracy: 0.9519 - val_loss: 0.1796 - val_accuracy: 0.9461 - 9s/epoch - 2ms/step
## Epoch 37/100
## 4456/4456 - 9s - loss: 0.1475 - accuracy: 0.9518 - val_loss: 0.1849 - val_accuracy: 0.9459 - 9s/epoch - 2ms/step
## Epoch 38/100
## 4456/4456 - 10s - loss: 0.1473 - accuracy: 0.9518 - val_loss: 0.1890 - val_accuracy: 0.9445 - 10s/epoch - 2ms/step
## Epoch 39/100
## 4456/4456 - 9s - loss: 0.1470 - accuracy: 0.9524 - val_loss: 0.1873 - val_accuracy: 0.9476 - 9s/epoch - 2ms/step
## Epoch 40/100
## 4456/4456 - 9s - loss: 0.1469 - accuracy: 0.9522 - val_loss: 0.1861 - val_accuracy: 0.9452 - 9s/epoch - 2ms/step
## Epoch 41/100
## 4456/4456 - 11s - loss: 0.1465 - accuracy: 0.9521 - val_loss: 0.1875 - val_accuracy: 0.9458 - 11s/epoch - 2ms/step
## Epoch 42/100
## 4456/4456 - 9s - loss: 0.1463 - accuracy: 0.9524 - val_loss: 0.1880 - val_accuracy: 0.9456 - 9s/epoch - 2ms/step
## Epoch 43/100
## 4456/4456 - 10s - loss: 0.1460 - accuracy: 0.9523 - val_loss: 0.1922 - val_accuracy: 0.9466 - 10s/epoch - 2ms/step
## Epoch 44/100
## 4456/4456 - 10s - loss: 0.1457 - accuracy: 0.9523 - val_loss: 0.1941 - val_accuracy: 0.9462 - 10s/epoch - 2ms/step
## Epoch 45/100
## 4456/4456 - 10s - loss: 0.1458 - accuracy: 0.9524 - val_loss: 0.1984 - val_accuracy: 0.9461 - 10s/epoch - 2ms/step
## Epoch 46/100
## 4456/4456 - 10s - loss: 0.1452 - accuracy: 0.9530 - val_loss: 0.1909 - val_accuracy: 0.9449 - 10s/epoch - 2ms/step
## Epoch 47/100
## 4456/4456 - 10s - loss: 0.1451 - accuracy: 0.9528 - val_loss: 0.1986 - val_accuracy: 0.9453 - 10s/epoch - 2ms/step
## Epoch 48/100
## 4456/4456 - 10s - loss: 0.1450 - accuracy: 0.9527 - val_loss: 0.1939 - val_accuracy: 0.9455 - 10s/epoch - 2ms/step
## Epoch 49/100
## 4456/4456 - 10s - loss: 0.1447 - accuracy: 0.9531 - val_loss: 0.1980 - val_accuracy: 0.9449 - 10s/epoch - 2ms/step
## Epoch 50/100
## 4456/4456 - 9s - loss: 0.1440 - accuracy: 0.9533 - val_loss: 0.1966 - val_accuracy: 0.9452 - 9s/epoch - 2ms/step
## Epoch 51/100
## 4456/4456 - 9s - loss: 0.1441 - accuracy: 0.9529 - val_loss: 0.2018 - val_accuracy: 0.9452 - 9s/epoch - 2ms/step
## Epoch 52/100
## 4456/4456 - 9s - loss: 0.1442 - accuracy: 0.9531 - val_loss: 0.2009 - val_accuracy: 0.9457 - 9s/epoch - 2ms/step
## Epoch 53/100
## 4456/4456 - 9s - loss: 0.1439 - accuracy: 0.9532 - val_loss: 0.1999 - val_accuracy: 0.9472 - 9s/epoch - 2ms/step
## Epoch 54/100
## 4456/4456 - 9s - loss: 0.1437 - accuracy: 0.9534 - val_loss: 0.2137 - val_accuracy: 0.9440 - 9s/epoch - 2ms/step
## Epoch 55/100
## 4456/4456 - 9s - loss: 0.1435 - accuracy: 0.9533 - val_loss: 0.2028 - val_accuracy: 0.9460 - 9s/epoch - 2ms/step
## Epoch 56/100
## 4456/4456 - 9s - loss: 0.1432 - accuracy: 0.9535 - val_loss: 0.2091 - val_accuracy: 0.9452 - 9s/epoch - 2ms/step
## Epoch 57/100
## 4456/4456 - 9s - loss: 0.1431 - accuracy: 0.9535 - val_loss: 0.2102 - val_accuracy: 0.9453 - 9s/epoch - 2ms/step
## Epoch 58/100
## 4456/4456 - 9s - loss: 0.1429 - accuracy: 0.9534 - val_loss: 0.2069 - val_accuracy: 0.9447 - 9s/epoch - 2ms/step
## Epoch 59/100
## 4456/4456 - 9s - loss: 0.1427 - accuracy: 0.9536 - val_loss: 0.2110 - val_accuracy: 0.9436 - 9s/epoch - 2ms/step
## Epoch 60/100
## 4456/4456 - 9s - loss: 0.1426 - accuracy: 0.9537 - val_loss: 0.2148 - val_accuracy: 0.9461 - 9s/epoch - 2ms/step
## Epoch 61/100
## 4456/4456 - 9s - loss: 0.1424 - accuracy: 0.9539 - val_loss: 0.2207 - val_accuracy: 0.9441 - 9s/epoch - 2ms/step
## Epoch 62/100
## 4456/4456 - 9s - loss: 0.1420 - accuracy: 0.9538 - val_loss: 0.2233 - val_accuracy: 0.9435 - 9s/epoch - 2ms/step
## Epoch 63/100
## 4456/4456 - 9s - loss: 0.1422 - accuracy: 0.9539 - val_loss: 0.2147 - val_accuracy: 0.9448 - 9s/epoch - 2ms/step
## Epoch 64/100
## 4456/4456 - 10s - loss: 0.1421 - accuracy: 0.9540 - val_loss: 0.2154 - val_accuracy: 0.9436 - 10s/epoch - 2ms/step
## Epoch 65/100
## 4456/4456 - 9s - loss: 0.1417 - accuracy: 0.9542 - val_loss: 0.2284 - val_accuracy: 0.9432 - 9s/epoch - 2ms/step
## Epoch 66/100
## 4456/4456 - 9s - loss: 0.1417 - accuracy: 0.9542 - val_loss: 0.2242 - val_accuracy: 0.9441 - 9s/epoch - 2ms/step
## Epoch 67/100
## 4456/4456 - 10s - loss: 0.1415 - accuracy: 0.9538 - val_loss: 0.2321 - val_accuracy: 0.9459 - 10s/epoch - 2ms/step
## Epoch 68/100
## 4456/4456 - 10s - loss: 0.1412 - accuracy: 0.9542 - val_loss: 0.2217 - val_accuracy: 0.9448 - 10s/epoch - 2ms/step
## Epoch 69/100
## 4456/4456 - 10s - loss: 0.1412 - accuracy: 0.9541 - val_loss: 0.2239 - val_accuracy: 0.9434 - 10s/epoch - 2ms/step
## Epoch 70/100
## 4456/4456 - 9s - loss: 0.1412 - accuracy: 0.9543 - val_loss: 0.2233 - val_accuracy: 0.9452 - 9s/epoch - 2ms/step
## Epoch 71/100
## 4456/4456 - 9s - loss: 0.1411 - accuracy: 0.9545 - val_loss: 0.2309 - val_accuracy: 0.9436 - 9s/epoch - 2ms/step
## Epoch 72/100
## 4456/4456 - 9s - loss: 0.1405 - accuracy: 0.9545 - val_loss: 0.2264 - val_accuracy: 0.9460 - 9s/epoch - 2ms/step
## Epoch 73/100
## 4456/4456 - 9s - loss: 0.1404 - accuracy: 0.9546 - val_loss: 0.2319 - val_accuracy: 0.9445 - 9s/epoch - 2ms/step
## Epoch 74/100
## 4456/4456 - 9s - loss: 0.1403 - accuracy: 0.9549 - val_loss: 0.2337 - val_accuracy: 0.9437 - 9s/epoch - 2ms/step
## Epoch 75/100
## 4456/4456 - 9s - loss: 0.1405 - accuracy: 0.9548 - val_loss: 0.2356 - val_accuracy: 0.9457 - 9s/epoch - 2ms/step
## Epoch 76/100
## 4456/4456 - 9s - loss: 0.1401 - accuracy: 0.9547 - val_loss: 0.2387 - val_accuracy: 0.9427 - 9s/epoch - 2ms/step
## Epoch 77/100
## 4456/4456 - 9s - loss: 0.1404 - accuracy: 0.9548 - val_loss: 0.2388 - val_accuracy: 0.9421 - 9s/epoch - 2ms/step
## Epoch 78/100
## 4456/4456 - 9s - loss: 0.1398 - accuracy: 0.9550 - val_loss: 0.2401 - val_accuracy: 0.9448 - 9s/epoch - 2ms/step
## Epoch 79/100
## 4456/4456 - 10s - loss: 0.1398 - accuracy: 0.9549 - val_loss: 0.2425 - val_accuracy: 0.9434 - 10s/epoch - 2ms/step
## Epoch 80/100
## 4456/4456 - 10s - loss: 0.1398 - accuracy: 0.9549 - val_loss: 0.2396 - val_accuracy: 0.9440 - 10s/epoch - 2ms/step
## Epoch 81/100
## 4456/4456 - 9s - loss: 0.1395 - accuracy: 0.9550 - val_loss: 0.2386 - val_accuracy: 0.9445 - 9s/epoch - 2ms/step
## Epoch 82/100
## 4456/4456 - 9s - loss: 0.1392 - accuracy: 0.9554 - val_loss: 0.2533 - val_accuracy: 0.9413 - 9s/epoch - 2ms/step
## Epoch 83/100
## 4456/4456 - 10s - loss: 0.1392 - accuracy: 0.9554 - val_loss: 0.2612 - val_accuracy: 0.9436 - 10s/epoch - 2ms/step
## Epoch 84/100
## 4456/4456 - 9s - loss: 0.1392 - accuracy: 0.9552 - val_loss: 0.2531 - val_accuracy: 0.9418 - 9s/epoch - 2ms/step
## Epoch 85/100
## 4456/4456 - 9s - loss: 0.1391 - accuracy: 0.9550 - val_loss: 0.2554 - val_accuracy: 0.9436 - 9s/epoch - 2ms/step
## Epoch 86/100
## 4456/4456 - 9s - loss: 0.1392 - accuracy: 0.9553 - val_loss: 0.2459 - val_accuracy: 0.9446 - 9s/epoch - 2ms/step
## Epoch 87/100
## 4456/4456 - 9s - loss: 0.1392 - accuracy: 0.9550 - val_loss: 0.2494 - val_accuracy: 0.9437 - 9s/epoch - 2ms/step
## Epoch 88/100
## 4456/4456 - 9s - loss: 0.1390 - accuracy: 0.9552 - val_loss: 0.2509 - val_accuracy: 0.9430 - 9s/epoch - 2ms/step
## Epoch 89/100
## 4456/4456 - 9s - loss: 0.1388 - accuracy: 0.9556 - val_loss: 0.2612 - val_accuracy: 0.9435 - 9s/epoch - 2ms/step
## Epoch 90/100
## 4456/4456 - 9s - loss: 0.1385 - accuracy: 0.9552 - val_loss: 0.2587 - val_accuracy: 0.9437 - 9s/epoch - 2ms/step
## Epoch 91/100
## 4456/4456 - 9s - loss: 0.1385 - accuracy: 0.9556 - val_loss: 0.2599 - val_accuracy: 0.9441 - 9s/epoch - 2ms/step
## Epoch 92/100
## 4456/4456 - 9s - loss: 0.1386 - accuracy: 0.9555 - val_loss: 0.2558 - val_accuracy: 0.9441 - 9s/epoch - 2ms/step
## Epoch 93/100
## 4456/4456 - 10s - loss: 0.1381 - accuracy: 0.9553 - val_loss: 0.2613 - val_accuracy: 0.9425 - 10s/epoch - 2ms/step
## Epoch 94/100
## 4456/4456 - 10s - loss: 0.1382 - accuracy: 0.9555 - val_loss: 0.2516 - val_accuracy: 0.9441 - 10s/epoch - 2ms/step
## Epoch 95/100
## 4456/4456 - 10s - loss: 0.1382 - accuracy: 0.9554 - val_loss: 0.2607 - val_accuracy: 0.9433 - 10s/epoch - 2ms/step
## Epoch 96/100
## 4456/4456 - 10s - loss: 0.1382 - accuracy: 0.9555 - val_loss: 0.2555 - val_accuracy: 0.9423 - 10s/epoch - 2ms/step
## Epoch 97/100
## 4456/4456 - 10s - loss: 0.1378 - accuracy: 0.9555 - val_loss: 0.2652 - val_accuracy: 0.9438 - 10s/epoch - 2ms/step
## Epoch 98/100
## 4456/4456 - 9s - loss: 0.1377 - accuracy: 0.9559 - val_loss: 0.2777 - val_accuracy: 0.9423 - 9s/epoch - 2ms/step
## Epoch 99/100
## 4456/4456 - 9s - loss: 0.1376 - accuracy: 0.9559 - val_loss: 0.2686 - val_accuracy: 0.9429 - 9s/epoch - 2ms/step
## Epoch 100/100
## 4456/4456 - 9s - loss: 0.1378 - accuracy: 0.9559 - val_loss: 0.2660 - val_accuracy: 0.9434 - 9s/epoch - 2ms/step
# Evaluate the model
metrics <- model %>% evaluate(
x = as.matrix(test_data[, -ncol(test_data)]), # Features
y = as.numeric(test_data$COD) - 1, # Target variable (convert to 0-based index)
verbose = 0
)
# Print evaluation metrics
cat("Test Loss:", metrics["loss"], "\n")
## Test Loss: 0.2614014
cat("Test Accuracy:", metrics["accuracy"], "\n")
## Test Accuracy: 0.9416193
# Predictions on test data
predictions <- model %>% predict(as.matrix(test_data[, -ncol(test_data)]))
## 2387/2387 - 2s - 2s/epoch - 1ms/step
predictions <- ifelse(predictions > 0.5, 1, 0)
# Confusion matrix
conf_matrix <- table(Actual = as.numeric(test_data$COD) - 1, Predicted = predictions)
print("Confusion Matrix:")
## [1] "Confusion Matrix:"
print(conf_matrix)
## Predicted
## Actual 0 1
## 0 6565 2996
## 1 1463 65354
# Accuracy, Sensitivity, and Specificity
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
sensitivity <- conf_matrix[2, 2] / sum(conf_matrix[2, ])
specificity <- conf_matrix[1, 1] / sum(conf_matrix[1, ])
paste("Accuracy:",accuracy)
## [1] "Accuracy: 0.94161931446228"
paste("Sensitivity:", sensitivity)
## [1] "Sensitivity: 0.978104374635198"
paste("Specificity:", specificity)
## [1] "Specificity: 0.686643656521284"
# Calculate overall accuracy
overall_accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
# Heatmap
heatmap_data <- as.data.frame(conf_matrix)
heatmap <- ggplot(heatmap_data, aes(x = Predicted, y = Actual, fill = Freq)) +
geom_tile(color = "white") +
scale_fill_gradient(low = "lightgreen", high = "darkgreen") +
labs(x = "Predicted", y = "Actual", fill = "Frequency") +
theme_minimal() +
geom_text(aes(label = Freq), color = "black", size = 3) + # Add text labels
ggtitle("Deep NN Predictive Model") + # Add title
labs(subtitle = paste("Accuracy:", scales::percent(overall_accuracy))) + # Add accuracy as subtitle
theme(plot.subtitle = element_text(hjust = 0.5)) # Center subtitle
print(heatmap)
# Plot ROC curve
roc_data <- roc(test_data$COD, predictions)
## Setting levels: control = 0, case = 1
## Warning in roc.default(test_data$COD, predictions): Deprecated use a matrix as
## predictor. Unexpected results may be produced, please pass a numeric vector.
## Setting direction: controls < cases
#plot(roc_data, main = "ROC Curve", col = "blue")
plot(roc_data, print.auc = TRUE, auc.polygon = TRUE, max.auc.polygon = TRUE, grid = TRUE, grid.col = "lightgray", main = "ROC Curve")
In this project, I aimed for prediction of the survival rate of patients with breast cancer with more than 96% accuracy knowing the survival rate is 75%. The goal was to use machine learning and available resources and the techniques learned in DATA606 and DTA607 to deal with this complex problem. I utilized the SEER database spanning from 2011 to 2015, comprising over 300,000 cases, to predict the survival rate of cancer patients based on 16 critical indicators, including race, household income, cancer type, treatment, time to treatment, number of tumors, and more. Preliminary exploratory data analysis was conducted to identify these key indicators from a pool of 36, followed by data cleaning and organization for machine learning tasks. Various R packages were employed for data cleaning, type conversion, handling missing values, and database organization. Additionally, correlation analyses using tools like ggplot, chi-square, Fisher test, and other complex R packages were performed to explore correlations between numeric and categorical variables and the target parameter of interest, Alive/Death.
Initially, the intention was to include all three categories of Alive/Death/Other, but it was later recognized that the inclusion of the “Other” category rendered the analysis irrelevant. Therefore, the analysis was focused solely on Alive/Death, as breast cancer was the primary cause of death even if patients had other conditions.
A range of machine learning algorithms were applied, starting from Logistic Regression and Random Forest to more sophisticated methods like DNN. Overall, the project demonstrated that even individuals with limited domain knowledge can utilize available resources to predict cancer patient outcomes with approximately 94% accuracy. However, further endeavors, such as stratification, parameter importance implication, and additional data gathering, could enhance accuracy, offering significant contributions to the healthcare industry, patient care, and family circumstances.
Despite the complexities associated with managing different packages and large databases, I enjoyed exploring new concepts and learning how different methods can be employed. Particularly, I gained insights into the significance of encoding and its impact on survival model performance. While this analysis lacks the rigor of academic research, it underscores the potential of machine learning in addressing complex problems, paving the way for future exploration and study.
In summary, among the developed models, Logistic Regression emerged as the simplest and fastest, achieving 93% accuracy, followed by RandomForest. Additionally, neural networks exhibited success but were time-consuming and presented black-box risks. For future iterations, I would opt to focus on Logistic Regression and RandomForest, dedicating more time to encoding, data preparation, and exploring stratification and parameter stress testing to potentially enhance accuracy.
This project highlights the potential of machine learning for patient survival prediction, even for individuals with limited domain knowledge. However, further research is needed to:
By addressing these limitations, future studies can contribute significantly to personalized medicine, patient care planning, and supporting families facing this challenging diagnosis.
I would like to thank the professors in both DATA606 and DATA607, as well as the students in the classes, who made the courses interesting and challenging. I have learned a lot and dealt with many challenges throughout these courses, despite having little specific background in data science beforehand. The course content was carefully chosen to help students like me develop an understanding of the topic and find enjoyment in the learning process.
[1] SEER (https://seer.cancer.gov/data/access.html)
[3] XAI_Healthcare_eXplainable_AI_in_Healthcare.pdf (upc.edu)
[4] Pargen, F., Pfisterer, F., Thomas, J., Bischl, B.: Regularized target encoding out performs traditional methods in supervised machine learning with high cardinality features. Computational Statistics 37(5), 2671–2692 (Nov 2022)
[5] American Cancer Society - Breast Cancer Survival Rates