Data Preparation

In this project, I have chosen to work on breast cancer. There are various resources available on this topic, with the Surveillance, Epidemiology, and End Results (SEER) [1] program being the most reliable one.

The SEER Program of the National Cancer Institute (NCI) collects and publishes cancer data through a coordinated system of strategically placed cancer registries, covering nearly 30% of the US population.

Currently, there are 18 SEER registries in the USA. You can find this information on the following website: SEER Data Access.

I have also utilized the following repository to assist me with this project: SEER_solid_tumor [2]. The database contains extensive data, and my investigation will focus solely on breast cancer for the years 2011-2015 and 2019-2020. SEER provides a software called STAT that I’ve used to import the data, which is stored and utilized on my local computer. Additionally, there are two GitHub repositories that I’ve referenced to some extent in this project:

  1. The first [2]repository covers all types of cancer, but my study specifically focuses on breast cancer, addressing different research questions.

  2. The second [3] repository has conducted machine learning analyses on various cancer types using Python (not R). I’ve drawn inspiration and learned methods from their approach to survival studies in cancer patients.

R initialization

Checking all the packages are installed and if not install as needed.

Research question

The primary focus of my research is to explore the survival rates of breast cancer patients and the various factors influencing these rates, including age, cancer type, treatment modalities, and other pertinent parameters. The commonly utilized five-year survival rate benchmark serves as a pivotal point of analysis in this study.

Acknowledging the significance of this benchmark, I have divided the data into two distinct datasets. The dataset spanning from 2011 to 2015 assumes that the status of all patients within that period is known up to the database’s current date in 2022. Additionally, I have selected the most recent data from 2019 to 2020 as the target years for potential correlation and regression studies to estimate survival rates.

Although my research is not conducted within a strictly scientific framework, it is approached with rigor and attention to detail. While I do not possess expertise in the field of breast cancer, my personal connection to the topic motivates me to delve deeper into understanding the complexities surrounding it.

The dataset from 2011 to 2015 comprises approximately 303,000 rows with 36 selected columns. For the purpose of prediction, I have chosen to focus solely on the 2019-2020 data, which encompasses about 131,000 rows. The multifaceted nature of the research question necessitates a thorough examination, from data tidying to cleaning.

Some of the key parameters under consideration include years of diagnoses, age groups at diagnosis, and cancer type. However, I also recognize the importance of incorporating additional factors such as tumor characteristics and treatment modalities to provide a comprehensive understanding of breast cancer survival outcomes.

In conclusion, while my knowledge of the subject may not be extensive, I am committed to learning and contributing meaningful insights to the field of breast cancer research through meticulous analysis and interpretation of data.

Note on 5 years threshold

According to the American Cancer Society, the five-year relative survival rate for localized breast cancer is around 99%, but it drops to about 27% for distant-stage breast cancer. These rates can vary over time and with advances in treatment. Reference [5]: American Cancer Society - Breast Cancer Survival Rates

# Function to load CSV file
load_csv <- function(file_path) {
  if (file.exists(file_path)) {
    return(read_csv(file_path))
  } else {
    message("File not found locally. Attempting to fetch from server...")
    return(fetch_database(gdrive_link))
  }
}

# Function to fetch database from signed URL
fetch_database <- function(url) {
  response <- GET(url)
  if (http_type(response) == "application/force-download") {
    stop_for_status(response)
    return(read_csv(rawToChar(response$content)))
  } else {
    message("Failed to fetch from server. Please select the file manually.")
    return(readr::read_csv(file.choose()))
  }
}

# Local file paths
directory <- "C:/Users/kohya/OneDrive/CUNY/DATA 606/DATA 606 Spring/Project"
file_2020 <- "BREAST_2019-2020-updated.csv"
file_serv <- "BREAST_2011-2015.csv"
gdrive_link <- "https://drive.google.com/uc?export=download&id=1vBR2SZ-aFX3jjU6kQMjPkxfYKP-EwqRE"

# Complete the file paths
full_path_serv <- file.path(directory, file_serv)
full_path_eval <- file.path(directory, file_2020)

# Attempt to load the databases
BREAST_DF_surv <- load_csv(full_path_serv)
## Rows: 303557 Columns: 36
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (34): Sex, Race recode (W, B, AI, API), Race and origin recode (NHW, NHB...
## dbl  (2): Year of diagnosis, Year of follow-up recode
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
BREAST_DF_eval <- load_csv(full_path_eval)
## Rows: 131395 Columns: 36
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (34): Sex, Race recode (W, B, AI, API), Race and origin recode (NHW, NHB...
## dbl  (2): Year of diagnosis, Year of follow-up recode
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# View the first few rows of the data frame
kable(head(BREAST_DF_surv, 10))
Sex Year of diagnosis Race recode (W, B, AI, API) Race and origin recode (NHW, NHB, NHAIAN, NHAPI, Hispanic) Site recode ICD-O-3/WHO 2008 Site recode ICD-O-3 2023 Revision Primary Site - labeled Grade Recode (thru 2017) Grade Clinical (2018+) Grade Pathological (2018+) Diagnostic Confirmation Laterality Chemotherapy recode (yes, no/unk) Radiation recode Months from diagnosis to treatment Reason no cancer-directed surgery Scope of reg lymph nd surg (1998-2002) Survival months flag Survival months COD to site recode First malignant primary indicator Sequence number Total number of in situ/malignant tumors for patient Total number of benign/borderline tumors for patient Patient ID Marital status at diagnosis Median household income inflation adj to 2021 Rural-Urban Continuum Code Age recode (<60,60-69,70+) Race and origin (recommended by SEER) Year of follow-up recode Year of death recode SEER other cause of death classification Tumor Size Summary (2016+) RX Summ–Systemic/Sur Seq (2007+) Origin recode NHIA (Hispanic, Non-Hisp)
Female 2015 White Non-Hispanic White Breast Breast C50.4-Upper-outer quadrant of breast Moderately differentiated; Grade II Blank(s) Blank(s) Positive histology Right - origin of primary Yes Beam radiation 002 Surgery performed Blank(s) Complete dates are available and there are more than 0 days of survival 0060 Alive No 2nd of 2 or more primaries 02 00 00000309 Married (including common law) $75,000+ Counties in metropolitan areas ge 1 million pop 50-54 years All races/ethnicities 2020 Alive at last contact Alive or dead due to cancer Blank(s) Systemic therapy after surgery Non-Spanish-Hispanic-Latino
Female 2013 White Non-Hispanic White Breast Breast C50.9-Breast, NOS Unknown Blank(s) Blank(s) Positive histology Right - origin of primary No/Unknown None/Unknown Blank(s) Not recommended Blank(s) Complete dates are available and there are more than 0 days of survival 0028 Breast No 3rd of 3 or more primaries 03 00 00000346 Divorced $75,000+ Counties in metropolitan areas ge 1 million pop 40-44 years All races/ethnicities 2015 2015 Alive or dead due to cancer Blank(s) No systemic therapy and/or surgical procedures Non-Spanish-Hispanic-Latino
Female 2012 White Non-Hispanic White Breast Breast C50.2-Upper-inner quadrant of breast Moderately differentiated; Grade II Blank(s) Blank(s) Positive histology Right - origin of primary No/Unknown None/Unknown 004 Surgery performed Blank(s) Complete dates are available and there are more than 0 days of survival 0099 Alive No 2nd of 2 or more primaries 03 00 00000374 Widowed $75,000+ Counties in metropolitan areas ge 1 million pop 80-84 years All races/ethnicities 2020 Alive at last contact Alive or dead due to cancer Blank(s) Systemic therapy before surgery Non-Spanish-Hispanic-Latino
Female 2014 White Non-Hispanic White Breast Breast C50.8-Overlapping lesion of breast Moderately differentiated; Grade II Blank(s) Blank(s) Positive histology Right - origin of primary No/Unknown None/Unknown 001 Surgery performed Blank(s) Complete dates are available and there are more than 0 days of survival 0081 Alive No 2nd of 2 or more primaries 02 00 00000391 Married (including common law) $75,000+ Counties in metropolitan areas ge 1 million pop 55-59 years All races/ethnicities 2020 Alive at last contact Alive or dead due to cancer Blank(s) Systemic therapy after surgery Non-Spanish-Hispanic-Latino
Female 2011 Black Non-Hispanic Black Breast Breast C50.9-Breast, NOS Unknown Blank(s) Blank(s) Direct visualization without microscopic confirmation Left - origin of primary No/Unknown None/Unknown Blank(s) Not recommended Blank(s) Complete dates are available and there are more than 0 days of survival 0010 Breast No 2nd of 2 or more primaries 02 00 00000547 Widowed $75,000+ Counties in metropolitan areas ge 1 million pop 85+ years All races/ethnicities 2012 2012 Alive or dead due to cancer Blank(s) No systemic therapy and/or surgical procedures Non-Spanish-Hispanic-Latino
Female 2013 White Hispanic (All Races) Breast Breast C50.9-Breast, NOS Moderately differentiated; Grade II Blank(s) Blank(s) Positive histology Right - origin of primary No/Unknown Beam radiation 001 Surgery performed Blank(s) Complete dates are available and there are more than 0 days of survival 0086 Alive No 2nd of 2 or more primaries 02 00 00000567 Married (including common law) $75,000+ Counties in metropolitan areas ge 1 million pop 70-74 years All races/ethnicities 2020 Alive at last contact Alive or dead due to cancer Blank(s) No systemic therapy and/or surgical procedures Spanish-Hispanic-Latino
Female 2015 White Non-Hispanic White Breast Breast C50.8-Overlapping lesion of breast Unknown Blank(s) Blank(s) Positive histology Left - origin of primary Yes None/Unknown 001 Not recommended Blank(s) Complete dates are available and there are more than 0 days of survival 0017 Breast No 2nd of 2 or more primaries 02 00 00000760 Widowed $75,000+ Counties in metropolitan areas ge 1 million pop 75-79 years All races/ethnicities 2016 2016 Alive or dead due to cancer Blank(s) No systemic therapy and/or surgical procedures Non-Spanish-Hispanic-Latino
Female 2015 White Hispanic (All Races) Breast Breast C50.4-Upper-outer quadrant of breast Poorly differentiated; Grade III Blank(s) Blank(s) Positive histology Right - origin of primary No/Unknown None/Unknown 001 Surgery performed Blank(s) Complete dates are available and there are more than 0 days of survival 0007 Other Cause of Death No 2nd of 2 or more primaries 02 00 00000941 Widowed $75,000+ Counties in metropolitan areas ge 1 million pop 85+ years All races/ethnicities 2015 2015 Dead (attributable to causes other than this cancer dx) Blank(s) No systemic therapy and/or surgical procedures Spanish-Hispanic-Latino
Female 2015 White Non-Hispanic White Breast Breast C50.9-Breast, NOS Poorly differentiated; Grade III Blank(s) Blank(s) Positive histology Right - origin of primary No/Unknown Beam radiation 001 Surgery performed Blank(s) Complete dates are available and there are more than 0 days of survival 0043 Cerebrovascular Diseases No 2nd of 2 or more primaries 02 00 00002056 Widowed $75,000+ Counties in metropolitan areas ge 1 million pop 80-84 years All races/ethnicities 2019 2019 Dead (attributable to causes other than this cancer dx) Blank(s) Systemic therapy after surgery Non-Spanish-Hispanic-Latino
Female 2015 Black Non-Hispanic Black Breast Breast C50.8-Overlapping lesion of breast Poorly differentiated; Grade III Blank(s) Blank(s) Positive histology Right - origin of primary No/Unknown None/Unknown 001 Surgery performed Blank(s) Complete dates are available and there are more than 0 days of survival 0070 Alive No 3rd of 3 or more primaries 04 00 00002605 Divorced $75,000+ Counties in metropolitan areas ge 1 million pop 60-64 years All races/ethnicities 2020 Alive at last contact Alive or dead due to cancer Blank(s) No systemic therapy and/or surgical procedures Non-Spanish-Hispanic-Latino
kable(head(BREAST_DF_eval, 10))
Sex Year of diagnosis Race recode (W, B, AI, API) Race and origin recode (NHW, NHB, NHAIAN, NHAPI, Hispanic) Site recode ICD-O-3/WHO 2008 Site recode ICD-O-3 2023 Revision Primary Site - labeled Grade Recode (thru 2017) Grade Clinical (2018+) Grade Pathological (2018+) Diagnostic Confirmation Laterality Chemotherapy recode (yes, no/unk) Radiation recode Months from diagnosis to treatment Reason no cancer-directed surgery Scope of reg lymph nd surg (1998-2002) Survival months flag Survival months COD to site recode First malignant primary indicator Sequence number Total number of in situ/malignant tumors for patient Total number of benign/borderline tumors for patient Patient ID Marital status at diagnosis Median household income inflation adj to 2021 Rural-Urban Continuum Code Age recode (<60,60-69,70+) Race and origin (recommended by SEER) Year of follow-up recode Year of death recode SEER other cause of death classification Tumor Size Summary (2016+) RX Summ–Systemic/Sur Seq (2007+) Origin recode NHIA (Hispanic, Non-Hisp)
Female 2019 Asian or Pacific Islander Non-Hispanic Asian or Pacific Islander Breast Breast C50.8-Overlapping lesion of breast Unknown 1 1 Positive histology Right - origin of primary No/Unknown None/Unknown 002 Surgery performed Blank(s) Complete dates are available and there are more than 0 days of survival 0019 Alive No 2nd of 2 or more primaries 02 00 00002750 Divorced $75,000+ Counties in metropolitan areas ge 1 million pop 65-69 years All races/ethnicities 2020 Alive at last contact Alive or dead due to cancer 008 Systemic therapy after surgery Non-Spanish-Hispanic-Latino
Female 2020 Asian or Pacific Islander Non-Hispanic Asian or Pacific Islander Breast Breast C50.8-Overlapping lesion of breast Unknown 2 9 Positive histology Right - origin of primary No/Unknown None/Unknown 000 Recommended, unknown if performed Blank(s) Complete dates are available and there are more than 0 days of survival 0000 Alive No 2nd of 2 or more primaries 02 00 00002870 Married (including common law) $75,000+ Counties in metropolitan areas ge 1 million pop 75-79 years All races/ethnicities 2020 Alive at last contact Alive or dead due to cancer 050 No systemic therapy and/or surgical procedures Non-Spanish-Hispanic-Latino
Female 2020 White Non-Hispanic White Breast Breast C50.4-Upper-outer quadrant of breast Unknown 1 2 Positive histology Right - origin of primary No/Unknown None/Unknown 000 Surgery performed Blank(s) Complete dates are available and there are more than 0 days of survival 0007 Alive No 2nd of 2 or more primaries 02 00 00003067 Divorced $75,000+ Counties in metropolitan areas ge 1 million pop 85+ years All races/ethnicities 2020 Alive at last contact Alive or dead due to cancer 018 No systemic therapy and/or surgical procedures Non-Spanish-Hispanic-Latino
Female 2020 White Non-Hispanic White Breast Breast C50.5-Lower-outer quadrant of breast Unknown 2 9 Positive histology Right - origin of primary Yes None/Unknown 001 Surgery performed Blank(s) Complete dates are available and there are more than 0 days of survival 0010 Alive No 2nd of 2 or more primaries 02 00 00003365 Widowed $75,000+ Counties in metropolitan areas ge 1 million pop 85+ years All races/ethnicities 2020 Alive at last contact Alive or dead due to cancer 060 Systemic therapy both before and after surgery Non-Spanish-Hispanic-Latino
Female 2019 White Non-Hispanic White Breast Breast C50.8-Overlapping lesion of breast Unknown 2 2 Positive histology Right - origin of primary No/Unknown Radioactive implants (includes brachytherapy) (1988+) 000 Surgery performed Blank(s) Complete dates are available and there are more than 0 days of survival 0016 Alive No 3rd of 3 or more primaries 03 00 00003679 Divorced $75,000+ Counties in metropolitan areas ge 1 million pop 75-79 years All races/ethnicities 2020 Alive at last contact Alive or dead due to cancer 010 No systemic therapy and/or surgical procedures Non-Spanish-Hispanic-Latino
Female 2019 Asian or Pacific Islander Non-Hispanic Asian or Pacific Islander Breast Breast C50.9-Breast, NOS Unknown 2 2 Positive histology Right - origin of primary No/Unknown None/Unknown 004 Surgery performed Blank(s) Complete dates are available and there are more than 0 days of survival 0014 Alive No 3rd of 3 or more primaries 04 00 00003771 Married (including common law) $75,000+ Counties in metropolitan areas ge 1 million pop 55-59 years All races/ethnicities 2020 Alive at last contact Alive or dead due to cancer 030 Systemic therapy after surgery Non-Spanish-Hispanic-Latino
Female 2019 Asian or Pacific Islander Non-Hispanic Asian or Pacific Islander Breast Breast C50.4-Upper-outer quadrant of breast Unknown 1 1 Positive histology Left - origin of primary No/Unknown None/Unknown 004 Surgery performed Blank(s) Complete dates are available and there are more than 0 days of survival 0014 Alive No 4th of 4 or more primaries 04 00 00003771 Married (including common law) $75,000+ Counties in metropolitan areas ge 1 million pop 55-59 years All races/ethnicities 2020 Alive at last contact Alive or dead due to cancer 004 Systemic therapy after surgery Non-Spanish-Hispanic-Latino
Female 2020 White Non-Hispanic White Breast Breast C50.8-Overlapping lesion of breast Unknown 2 9 Positive histology Right - origin of primary No/Unknown None/Unknown 001 Surgery performed Blank(s) Complete dates are available and there are more than 0 days of survival 0003 Alive No 2nd of 2 or more primaries 02 00 00006501 Married (including common law) $75,000+ Counties in metropolitan areas ge 1 million pop 80-84 years All races/ethnicities 2020 Alive at last contact Alive or dead due to cancer 036 Systemic therapy both before and after surgery Non-Spanish-Hispanic-Latino
Female 2020 White Non-Hispanic White Breast Breast C50.3-Lower-inner quadrant of breast Unknown 1 1 Positive histology Left - origin of primary No/Unknown None/Unknown 002 Surgery performed Blank(s) Complete dates are available and there are more than 0 days of survival 0007 Alive No 3rd of 3 or more primaries 03 00 00007723 Married (including common law) $75,000+ Counties in metropolitan areas ge 1 million pop 70-74 years All races/ethnicities 2020 Alive at last contact Alive or dead due to cancer 006 No systemic therapy and/or surgical procedures Non-Spanish-Hispanic-Latino
Female 2019 White Non-Hispanic White Breast Breast C50.4-Upper-outer quadrant of breast Unknown 2 9 Positive histology Right - origin of primary Yes None/Unknown 002 Surgery performed Blank(s) Complete dates are available and there are more than 0 days of survival 0021 Alive No 2nd of 2 or more primaries 02 00 00008406 Unmarried or Domestic Partner $75,000+ Counties in metropolitan areas ge 1 million pop 55-59 years All races/ethnicities 2020 Alive at last contact Alive or dead due to cancer 019 Systemic therapy both before and after surgery Non-Spanish-Hispanic-Latino

Cases

There are 131,395 cases in the BREAST cancer list of 2019-2020. And there are 303,557 in 2011-2015 dataset.

Data collection

I used the SEER *STAT to collect the data and export it as a TXT to be able to import it to the R for analyses. How SEER collects the data is explained in the following page in summary:

Type of study

This will be an observational study, information is gathered for different patients and I will be evaluating the available data to present and evaluate.

Data Source

Data is collected from SEER program and I used SEER *STAT software to glean them in a format that can be used and imported as TXT/CSV to R (Surveillance, Epidemiology, and End Results Program 2023).

Dependent Variable

We have a combination of both numeric and categorical data to work with. For example, while the number of tumors, and survival months are qualitative. Other like race, marital status, type of cancer are categorical.

Categorical features, such as ‘Median household income …’ ‘Marital Status,’ ‘Grade recode’ ‘laterality’ and ‘Radiatio recode’ and so on are represented as objects (characters).

Integer data types (int64) are assigned to ‘Patient ID,’ ‘Year of diagnosis,’ ‘total number of …’.

# Find unique values in each column
# Apply function to find unique values for each column
#find the number of unique values in each column  
unique_values <- data.frame(unique = apply(BREAST_DF_surv, 2, function(x) length(unique(x))),colnames = colnames(BREAST_DF_surv))

#fidn the number of unique values and the unique values themselves 
unique_info <- data.frame(
  unique_count = sapply(BREAST_DF_surv, function(x) length(unique(x))),
  unique_values = sapply(BREAST_DF_surv, function(x) toString(unique(x))),
  column_names = names(BREAST_DF_surv)
)


# Check for NULL values
any_null <- any(sapply(BREAST_DF_surv, is.null))

# Check for NA values
any_na <- any(sapply(BREAST_DF_surv, is.na))

# Check if there are any NULL or NA values
if (any_null || any_na) {
  print("The data frame contains NULL or NA values.")
} else {
  print("The data frame does not contain any NULL or NA values.")
}
## [1] "The data frame does not contain any NULL or NA values."
has_na_character <- any(sapply(BREAST_DF_surv, function(x) any(x == "NA")))

if (has_na_character) {
  print("The data frame contains character values of 'NA'.")
} else {
  print("The data frame does not contain character values of 'NA'.")
}
## [1] "The data frame does not contain character values of 'NA'."

Data tiding

Upon exploring the data, it seems data might have an empty column, in this data-based, the empty values are filled with “Blanks”. Thus, in this section, I first explore if there is any column which is entirely empty, then will remove it and if there are others which have some empty values filled with “Blank(s)” I will replaced them with “NA” which is handled better in dplyr and tydiverse.

# There are cells in the DF that contianes "Blank(s) which is literally NA, first I want to find if there is any column that all is values is Blank(s), if then remove them.

#look for columns with all "Blank(s)" values
Empty_column <- BREAST_DF_surv %>%
  dplyr::summarise(dplyr::across(everything(), ~all(. == "Blank(s)"))) %>%
  as.logical() %>%
  unlist()

# Get the names of columns with all cells containing "Blank(s)"
blank_column_names <- names(BREAST_DF_surv)[Empty_column]

# Print the column names with all cells containing "Blanks"
paste("list of empty column(s): ", blank_column_names)
## [1] "list of empty column(s):  Grade Clinical (2018+)"                
## [2] "list of empty column(s):  Grade Pathological (2018+)"            
## [3] "list of empty column(s):  Scope of reg lymph nd surg (1998-2002)"
## [4] "list of empty column(s):  Tumor Size Summary (2016+)"
#remove those empty column from thr DF
BREAST_DF_surv <- BREAST_DF_surv[, !names(BREAST_DF_surv) %in% blank_column_names]
BREAST_DF_eval <- BREAST_DF_eval[, !names(BREAST_DF_eval) %in% blank_column_names]

#Then let's see if there is any cell in the remaining that migth still have "Blank(s)", if so repalce it with NA which is better handle in R

#This code first replaces all occurrences of "Blank(s)" with an empty string "", and then uses na_if() to convert the empty strings to NA. Now, all cells that previously had "Blank(s)" are replaced with NA, making it easier to handle missing values in R.

BREAST_DF_surv <- BREAST_DF_surv %>%
  mutate_if(is.character, ~ifelse(. == "Blank(s)", "", .)) %>%  # For character columns
  mutate_if(is.numeric, ~ifelse(. == "", as.numeric(NA), .))  # For numeric columns

# Now, empty character cells are replaced with NA
BREAST_DF_surv <- BREAST_DF_surv %>%
  mutate_if(is.character, na_if, "")


#same to be done for eval dataset
BREAST_DF_eval <- BREAST_DF_eval %>%
  mutate_if(is.character, ~ifelse(. == "Blank(s)", "", .)) %>%  # For character columns
  mutate_if(is.numeric, ~ifelse(. == "", as.numeric(NA), .))  # For numeric columns

# Now, empty character cells are replaced with NA
BREAST_DF_eval <- BREAST_DF_eval %>%
  mutate_if(is.character, na_if, "")

#Change characters to numerics 
BREAST_DF_surv$`Months from diagnosis to treatment` <- as.numeric(BREAST_DF_surv$`Months from diagnosis to treatment`)
BREAST_DF_surv$`Survival months` <- as.numeric(BREAST_DF_surv$`Survival months`)
## Warning: NAs introduced by coercion
BREAST_DF_surv$`Total number of in situ/malignant tumors for patient` <- 
  as.numeric(BREAST_DF_surv$`Total number of in situ/malignant tumors for patient`)
## Warning: NAs introduced by coercion
BREAST_DF_surv$`Total number of benign/borderline tumors for patient` <- 
  as.numeric(BREAST_DF_surv$`Total number of benign/borderline tumors for patient`)
#Change the character to numeric in Eval dataset too
BREAST_DF_eval$`Months from diagnosis to treatment` <- as.numeric(BREAST_DF_eval$`Months from diagnosis to treatment`)
BREAST_DF_eval$`Survival months` <- as.numeric(BREAST_DF_eval$`Survival months`)
## Warning: NAs introduced by coercion
BREAST_DF_eval$`Total number of in situ/malignant tumors for patient` <- 
  as.numeric(BREAST_DF_eval$`Total number of in situ/malignant tumors for patient`)
## Warning: NAs introduced by coercion
BREAST_DF_eval$`Total number of benign/borderline tumors for patient` <- 
  as.numeric(BREAST_DF_eval$`Total number of benign/borderline tumors for patient`)


# View the structure of the data frame
#str(BREAST_DF_surv)
skimr::skim(BREAST_DF_surv)
Data summary
Name BREAST_DF_surv
Number of rows 303557
Number of columns 32
_______________________
Column type frequency:
character 26
numeric 6
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Sex 0 1 6 6 0 1 0
Race recode (W, B, AI, API) 0 1 5 29 0 5 0
Race and origin recode (NHW, NHB, NHAIAN, NHAPI, Hispanic) 0 1 18 42 0 6 0
Site recode ICD-O-3/WHO 2008 0 1 6 6 0 1 0
Site recode ICD-O-3 2023 Revision 0 1 6 6 0 1 0
Primary Site - labeled 0 1 12 36 0 9 0
Grade Recode (thru 2017) 0 1 7 38 0 5 0
Diagnostic Confirmation 0 1 7 57 0 9 0
Laterality 0 1 24 53 0 5 0
Chemotherapy recode (yes, no/unk) 0 1 3 10 0 2 0
Radiation recode 0 1 12 53 0 8 0
Reason no cancer-directed surgery 0 1 15 76 0 8 0
Survival months flag 0 1 61 73 0 5 0
COD to site recode 0 1 5 55 0 87 0
First malignant primary indicator 0 1 2 3 0 2 0
Sequence number 0 1 16 60 0 13 0
Patient ID 0 1 8 8 0 294480 0
Marital status at diagnosis 0 1 7 30 0 7 0
Median household income inflation adj to 2021 0 1 8 38 0 11 0
Rural-Urban Continuum Code 0 1 38 60 0 7 0
Age recode (<60,60-69,70+) 0 1 9 11 0 18 0
Race and origin (recommended by SEER) 0 1 21 21 0 1 0
Year of death recode 0 1 4 21 0 11 0
SEER other cause of death classification 0 1 16 55 0 4 0
RX Summ–Systemic/Sur Seq (2007+) 0 1 16 55 0 8 0
Origin recode NHIA (Hispanic, Non-Hisp) 0 1 23 27 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Year of diagnosis 0 1.00 2013.04 1.42 2011 2012 2013 2014 2015 ▇▇▇▇▇
Months from diagnosis to treatment 15843 0.95 1.13 1.14 0 0 1 2 24 ▇▁▁▁▁
Survival months 1290 1.00 74.22 29.88 0 62 78 97 119 ▂▂▆▇▆
Total number of in situ/malignant tumors for patient 3 1.00 1.36 0.65 1 1 1 2 20 ▇▁▁▁▁
Total number of benign/borderline tumors for patient 0 1.00 0.01 0.09 0 0 0 0 5 ▇▁▁▁▁
Year of follow-up recode 0 1.00 2018.90 2.14 2011 2019 2020 2020 2020 ▁▁▁▁▇
skimr::skim(BREAST_DF_eval)
Data summary
Name BREAST_DF_eval
Number of rows 131395
Number of columns 32
_______________________
Column type frequency:
character 26
numeric 6
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Sex 0 1 6 6 0 1 0
Race recode (W, B, AI, API) 0 1 5 29 0 5 0
Race and origin recode (NHW, NHB, NHAIAN, NHAPI, Hispanic) 0 1 18 42 0 6 0
Site recode ICD-O-3/WHO 2008 0 1 6 6 0 1 0
Site recode ICD-O-3 2023 Revision 0 1 6 6 0 1 0
Primary Site - labeled 0 1 12 36 0 9 0
Grade Recode (thru 2017) 0 1 7 7 0 1 0
Diagnostic Confirmation 0 1 7 57 0 9 0
Laterality 0 1 24 53 0 5 0
Chemotherapy recode (yes, no/unk) 0 1 3 10 0 2 0
Radiation recode 0 1 12 53 0 8 0
Reason no cancer-directed surgery 0 1 15 76 0 8 0
Survival months flag 0 1 61 73 0 5 0
COD to site recode 0 1 5 55 0 67 0
First malignant primary indicator 0 1 2 3 0 2 0
Sequence number 0 1 16 60 0 16 0
Patient ID 0 1 8 8 0 127795 0
Marital status at diagnosis 0 1 7 30 0 7 0
Median household income inflation adj to 2021 0 1 8 38 0 11 0
Rural-Urban Continuum Code 0 1 38 60 0 7 0
Age recode (<60,60-69,70+) 0 1 9 11 0 17 0
Race and origin (recommended by SEER) 0 1 21 21 0 1 0
Year of death recode 0 1 4 21 0 3 0
SEER other cause of death classification 0 1 16 55 0 4 0
RX Summ–Systemic/Sur Seq (2007+) 0 1 16 55 0 8 0
Origin recode NHIA (Hispanic, Non-Hisp) 0 1 23 27 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Year of diagnosis 0 1.00 2019.48 0.50 2019 2019 2019 2020 2020 ▇▁▁▁▇
Months from diagnosis to treatment 6807 0.95 1.26 1.18 0 1 1 2 24 ▇▁▁▁▁
Survival months 537 1.00 11.07 7.05 0 5 11 17 23 ▇▆▆▇▆
Total number of in situ/malignant tumors for patient 11 1.00 1.31 0.62 1 1 1 1 50 ▇▁▁▁▁
Total number of benign/borderline tumors for patient 0 1.00 0.01 0.09 0 0 0 0 2 ▇▁▁▁▁
Year of follow-up recode 0 1.00 2019.98 0.14 2019 2020 2020 2020 2020 ▁▁▁▁▇

Relevant summary statistics

Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g.scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.

#find column name to use later if needed
DF_col_names <- colnames(BREAST_DF_surv)

# use ggplot to plot the race information 
BREAST_DF_surv |> 
  ggplot(mapping = aes(x=`Race recode (W, B, AI, API)`)) +
  geom_bar(stat = "count") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  geom_text(aes(label = after_stat(count)), stat = "count", vjust = -0.5) +
  ylim(0, 246000)

#we want to compare the percentage of the different races in the eval and survival data, thus I use summarise to create two new DFs to only store the sumamry statistics specifically including the percentage of race based on the population
#find percentage of race for the survival
BREAST_DF_perc_surv <- BREAST_DF_surv %>%
  group_by(`Race recode (W, B, AI, API)`) %>%
  dplyr::summarise(count = dplyr::n()) %>%  # Calculate count per group
  ungroup() %>%  # Ungroup the data
  mutate(total_count = sum(count)) %>%  # Calculate total count
  mutate(percentage = count / total_count * 100)  # Calculate percentage using total count

# Plot the percentages
ggplot(BREAST_DF_perc_surv, aes(x = `Race recode (W, B, AI, API)`, y = percentage)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  geom_text(aes(label = paste0(round(percentage, 1), "%")), vjust = -0.5, color = "black") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Percentage of Population by Race between 2011-2015", x = "Race recode (W, B, AI, API)", y = "Percentage") + ylim (0,90)

BREAST_DF_eval |> 
  ggplot(mapping = aes(x=`Race recode (W, B, AI, API)`)) +
  geom_bar(stat = "count") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  geom_text(aes(label = after_stat(count)), stat = "count", vjust = -0.5) +
  ylim(0, 104000)

BREAST_DF_perc_eval <- BREAST_DF_eval %>%
  group_by(`Race recode (W, B, AI, API)`) %>%
  dplyr::summarise(count = dplyr::n()) %>%  # Calculate count per group
  ungroup() %>%  # Ungroup the data
  mutate(total_count = sum(count)) %>%  # Calculate total count
  mutate(percentage = count / total_count * 100)  # Calculate percentage using total count

# Plot the percentages
ggplot(BREAST_DF_perc_eval, aes(x = `Race recode (W, B, AI, API)`, y = percentage)) +
  geom_bar(stat = "identity", fill = "plum") +
  geom_text(aes(label = paste0(round(percentage, 1), "%")), vjust = -0.5, color = "black") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Percentage of Population by Race between 2019-2022", x = "Race recode (W, B, AI, API)", y = "Percentage") + ylim (0,90)

# In this section I want to focus on the age and see if age matters, same sets of data is going to be plot for ages, starting with percentage for eval and surve 
#find percentage of race for the survival
#find ubique values for column ratted to age 
uniques_ages <- unique(BREAST_DF_surv[29])

BREAST_DF_age_perc_surv <- BREAST_DF_surv %>%
  dplyr::group_by(`Age recode (<60,60-69,70+)`) %>%
  dplyr::summarise(count = dplyr::n()) %>%  # Calculate count per group
  ungroup() %>%  # Ungroup the data
  mutate(total_count = sum(count)) %>%  # Calculate total count
  mutate(percentage = count / total_count * 100)  # Calculate percentage using total count

perc_max <- max(BREAST_DF_age_perc_surv$percentage)
# Plot the percentages
ggplot(BREAST_DF_age_perc_surv, aes(x = `Age recode (<60,60-69,70+)`, y = percentage)) +
  geom_bar(stat = "identity", fill = "brown") +
  geom_text(aes(label = paste0(round(percentage, 1), "%")), hjust = -0.1 , vjust = 0.4, color = "black", angle = 90) +  # Rotate the text vertically
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +labs(title = "Percentage of Population by Age range 2011-2015", 
       x = "Age range", 
       y = "Percentage") + 
  ylim(0, round(1.5 * perc_max, 1))

# In this section we do the same analyses for Eval dta based on age
BREAST_DF_age_perc_eval <- BREAST_DF_eval %>%
  dplyr::group_by(`Age recode (<60,60-69,70+)`) %>%
  dplyr::summarise(count = dplyr::n()) %>%  # Calculate count per group
  ungroup() %>%  # Ungroup the data
  mutate(total_count = sum(count)) %>%  # Calculate total count
  mutate(percentage = count / total_count * 100)  # Calculate percentage using total count

# Plot the percentages
ggplot(BREAST_DF_age_perc_eval, aes(x = `Age recode (<60,60-69,70+)`, y = percentage)) +
  geom_bar(stat = "identity", fill = "brown") +
  geom_text(aes(label = paste0(round(percentage, 1), "%")), hjust = -0.1 , vjust = 0.4, color = "black", angle = 90) +  # Rotate the text vertically
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +labs(title = "Percentage of Population by Age range 2019-2022", 
       x = "Age range", 
       y = "Percentage") + 
  ylim(0, round(1.5 * perc_max, 1))

# In this section, we do the analyses on household income 
#find ubique values for column ratted to age 
uniques_householdes <- unique(BREAST_DF_surv[27])

BREAST_DF_income_perc_surv <- BREAST_DF_surv %>% dplyr::group_by(`Median household income inflation adj to 2021`) %>% 
  dplyr::summarise(count = dplyr::n()) %>% # Calculate count per group 
  ungroup() %>% # Ungroup the data 
  mutate(total_count = sum(count)) %>% # Calculate total count 
  mutate(percentage = count / total_count * 100) # Calculate percentage using total count

perc_max <- max(BREAST_DF_income_perc_surv$percentage) # Plot the percentages 
ggplot(BREAST_DF_income_perc_surv, aes(x = `Median household income inflation adj to 2021`, y = percentage)) + 
  geom_bar(stat = "identity", fill = "brown") + 
  geom_text(aes(label = paste0(round(percentage, 1), "%")), hjust = -0.1 , vjust = 0.4, color = "black", angle = 0) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Percentage of Population by income 2011-2015", x = "Household Income", y = "Percentage") + 
  ylim(0, 1.2*perc_max)

#In this section we do the same analyses for Eval data based on age
BREAST_DF_income_perc_eval <- BREAST_DF_eval %>% 
  dplyr::group_by(`Median household income inflation adj to 2021`) %>% 
  dplyr::summarise(count = dplyr::n()) %>% # Calculate count per group 
  ungroup() %>% # Ungroup the data 
  mutate(total_count = sum(count)) %>% # Calculate total count 
  mutate(percentage = count / total_count * 100) # Calculate percentage using total count


#Plot the percentages
perc_max <- max(BREAST_DF_income_perc_eval$percentage)
ggplot(BREAST_DF_income_perc_eval, aes(x = `Median household income inflation adj to 2021`, y = percentage)) + 
  geom_bar(stat = "identity", fill = "brown") + 
  geom_text(aes(label = paste0(round(percentage, 1), "%")), hjust = -0.1 , vjust = 0.4, color = "black", angle = 0) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Percentage of Population by income 2019-2022", x = "Household Income", y = "Percentage") + 
  ylim(0, 1.2*perc_max)

# In this section, we do the analyses on Primary Site
#find ubique values for column ratted to age 
uniques_canter_type <- unique(BREAST_DF_surv[27])

BREAST_DF_labeled_perc_surv <- BREAST_DF_surv %>% dplyr::group_by(`Primary Site - labeled`) %>% 
  dplyr::summarise(count = dplyr::n()) %>% # Calculate count per group 
  ungroup() %>% # Ungroup the data 
  mutate(total_count = sum(count)) %>% # Calculate total count 
  mutate(percentage = count / total_count * 100) # Calculate percentage using total count

perc_max <- max(BREAST_DF_labeled_perc_surv$percentage) # Plot the percentages 
ggplot(BREAST_DF_labeled_perc_surv, aes(x = `Primary Site - labeled`, y = percentage)) + 
  geom_bar(stat = "identity", fill = "darkgreen") + 
  geom_text(aes(label = paste0(round(percentage, 1), "%")), hjust = -0.1 , vjust = 0.4, color = "black", angle = 0) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Percentage of Population by Site Primary labeles 2011-2015", x = "Primary Labels", y = "Percentage") + 
  ylim(0, 1.2*perc_max)

#In this section we do the same analyses for Eval data based on age
BREAST_DF_labeled_perc_eval <- BREAST_DF_eval %>% 
  dplyr::group_by(`Primary Site - labeled`) %>% 
  dplyr::summarise(count = dplyr::n()) %>% # Calculate count per group 
  ungroup() %>% # Ungroup the data 
  mutate(total_count = sum(count)) %>% # Calculate total count 
  mutate(percentage = count / total_count * 100) # Calculate percentage using total count


#Plot the percentages
perc_max <- max(BREAST_DF_labeled_perc_eval$percentage)
ggplot(BREAST_DF_labeled_perc_eval, aes(x = `Primary Site - labeled`, y = percentage)) + 
  geom_bar(stat = "identity", fill = "darkgreen") + 
  geom_text(aes(label = paste0(round(percentage, 1), "%")), hjust = -0.1 , vjust = 0.4, color = "black", angle = 0) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Percentage of Population by site Primary labels 2019-2022", x = "Primary Labels", y = "Percentage") + 
  ylim(0, 1.2*perc_max)

# check if the column `COD to site recode` has value of Alive or Breast meaning they are still alive or have died because of breast cancer, and other passed a way but not because of Breast cancer. 

BREAST_DF_surv <- BREAST_DF_surv %>%
  mutate(COD = ifelse(`COD to site recode` %in% c("Alive","Breast"), `COD to site recode`, "Other"))

Results of the exploratory data analysis

In this section, we look into some exploratory data analysis such as

We looked into the population and then among the population how many survived the cancer. Later we will run some analyses to see whether those were important or deciding factors or not.

BREAST_DF_COD_perc_surv <- BREAST_DF_surv %>%
  dplyr::group_by(COD) %>%
  dplyr::summarise(count = dplyr::n()) %>%  # Calculate count per group
  ungroup() %>%  # Ungroup the data
  mutate(`Total Count` = sum(count)) %>%  # Calculate total count
  mutate(Population = round(count / `Total Count` * 100),2)  # Calculate percentage using total count

kable(BREAST_DF_COD_perc_surv)
COD count Total Count Population 2
Alive 228221 303557 75 2
Breast 38472 303557 13 2
Other 36864 303557 12 2
# Let’s first group by the number of tumors and find out how many people in the population have them. Then, among those individuals, let’s determine how many passed away solely due to breast cancer. However, it’s important to note that this approach may not be completely accurate, as there could be cases where individuals passed away due to breast cancer complications that are not accounted for in these counts.”
 
BREAST_DF_TNoT_perc_surv <- BREAST_DF_surv %>%
  dplyr::group_by(`Total number of in situ/malignant tumors for patient`) %>%
  dplyr::add_count() %>%
  filter(COD == "Breast") %>%
  dplyr::summarise(`Event Population` = n(), 
            Population = dplyr::first(n))  # Use `first()` to extract the total count in each 

# Do simple math to fidn the percentage of the group in the population and then the percentage of the deceased within the group. 

BREAST_DF_TNoT_perc_surv$`Group % in total` <- round(BREAST_DF_TNoT_perc_surv$Population/sum(BREAST_DF_TNoT_perc_surv$Population)*100,2)

BREAST_DF_TNoT_perc_surv$`Death %` <- round(BREAST_DF_TNoT_perc_surv$`Event Population`/BREAST_DF_TNoT_perc_surv$Population*100,2)

    
kable(BREAST_DF_TNoT_perc_surv)
Total number of in situ/malignant tumors for patient Event Population Population Group % in total Death %
1 27314 217122 71.53 12.58
2 8945 68082 22.43 13.14
3 1808 14579 4.80 12.40
4 322 2996 0.99 10.75
5 68 595 0.20 11.43
6 9 126 0.04 7.14
7 3 29 0.01 10.34
8 2 18 0.01 11.11
18 1 1 0.00 100.00
# Let' focus on the treatemnt, There are two type of treatment and can be a 4 combination, as follows: Radiation: R, Chemoteraphy: C,  R:N-C:N,  R:Y-C:N, R:N-C:Y, R:Y-C:Y. We must look into these 4 group and find the total number and then in each find the number of death. Finally report them imialrly that we have done above. 

BREAST_DF_surv <- BREAST_DF_surv %>% 
  mutate(Radiation = ifelse(`Radiation recode` %in% c("None/Unknown","Refused (1988+)","Recommended, unknown if administered"),"No/Unknown","Yes"))
BREAST_DF_eval <- BREAST_DF_eval %>% 
  mutate(Radiation = ifelse(`Radiation recode` %in% c("None/Unknown","Refused (1988+)","Recommended, unknown if administered"),"No/Unknown","Yes"))

#use DPLYR to filter based on two parameters chemotheraphy and radiation therapy and evalaute the death rate accordingly  
BREAST_DF_RNC_perc_surv <- BREAST_DF_surv %>%
  dplyr::group_by(Radiation,`Chemotherapy recode (yes, no/unk)`) %>%
  dplyr::add_count() %>%
  filter(COD == "Breast") %>%
  dplyr::summarise(`Event Population` = n(), 
            Population = dplyr::first(n))  # Use `first()` to extract the total count in each 
## `summarise()` has grouped output by 'Radiation'. You can override using the
## `.groups` argument.
# Replace "No/Unknown" with "No" in the original columns
BREAST_DF_RNC_perc_surv$Radiation <- ifelse(BREAST_DF_RNC_perc_surv$Radiation == "No/Unknown", "No", BREAST_DF_RNC_perc_surv$Radiation)

BREAST_DF_RNC_perc_surv$"Chemotherapy recode (yes, no/unk)" <- ifelse(BREAST_DF_RNC_perc_surv$"Chemotherapy recode (yes, no/unk)" == "No/Unknown", "No", BREAST_DF_RNC_perc_surv$"Chemotherapy recode (yes, no/unk)")

# Create a new column "Radiation_Chemo" with values separated by "/"
BREAST_DF_RNC_perc_surv$Radiation_Chemo <- paste(BREAST_DF_RNC_perc_surv$Radiation, BREAST_DF_RNC_perc_surv$"Chemotherapy recode (yes, no/unk)", sep = "/")


# Optionally, remove the original "Radiation" and "Chemotherapy recode (yes, no/unk)" columns
BREAST_DF_RNC_perc_surv <- subset(BREAST_DF_RNC_perc_surv, select = -c(Radiation, `Chemotherapy recode (yes, no/unk)`))

BREAST_DF_RNC_perc_surv <- BREAST_DF_RNC_perc_surv[, c("Radiation_Chemo", setdiff(names(BREAST_DF_RNC_perc_surv), "Radiation_Chemo"))]


# Reshape the dataframe from wide to long format

#knowing the population calcualte the gorup rate and death rate in each group 
BREAST_DF_RNC_perc_surv$`Group % in total` <- round(BREAST_DF_RNC_perc_surv$Population/sum(BREAST_DF_RNC_perc_surv$Population)*100,2)

BREAST_DF_RNC_perc_surv$`Death %` <- round(BREAST_DF_RNC_perc_surv$`Event Population`/BREAST_DF_RNC_perc_surv$Population*100,2)



kable(BREAST_DF_RNC_perc_surv)
Radiation_Chemo Event Population Population Group % in total Death %
No/No 15684 107012 35.25 14.66
No/Yes 9929 54966 18.11 18.06
Yes/No 3731 79926 26.33 4.67
Yes/Yes 9128 61653 20.31 14.81
#next let's look into the surgery and the survival rate and whether it migth have been critical or not. 
BREAST_DF_SUR_perc_surv <- BREAST_DF_surv %>%
  dplyr::group_by(`Reason no cancer-directed surgery`) %>%
  dplyr::add_count() %>%
  filter(COD == "Breast") %>%
  dplyr::summarise(`Event Population` = n(), 
            Population = dplyr::first(n))  # Use `first()` to extract the total 

#knowing the population calcualte the gorup rate and death rate in each group 
BREAST_DF_SUR_perc_surv$`Group % in total` <- round(BREAST_DF_SUR_perc_surv$Population/sum(BREAST_DF_SUR_perc_surv$Population)*100,2)

BREAST_DF_SUR_perc_surv$`Death %` <- round(BREAST_DF_SUR_perc_surv$`Event Population`/BREAST_DF_SUR_perc_surv$Population*100,2)

kable(BREAST_DF_SUR_perc_surv)
Reason no cancer-directed surgery Event Population Population Group % in total Death %
Not performed, patient died prior to recommended surgery 139 278 0.09 50.00
Not recommended 11636 23199 7.64 50.16
Not recommended, contraindicated due to other cond; autopsy only (1973-2002) 593 1356 0.45 43.73
Recommended but not performed, patient refused 1171 2608 0.86 44.90
Recommended but not performed, unknown reason 545 1604 0.53 33.98
Recommended, unknown if performed 613 2649 0.87 23.14
Surgery performed 22376 269730 88.86 8.30
Unknown; death certificate; or autopsy only (2003+) 1399 2133 0.70 65.59
#next let's look into the marital status and the survival rate and whether it migth have been critical or not. 
BREAST_DF_MARI_perc_surv <- BREAST_DF_surv %>%
  dplyr::group_by(`Marital status at diagnosis`) %>%
  dplyr::add_count() %>%
  filter(COD == "Breast") %>%
  dplyr::summarise(`Event Population` = n(), 
            Population = dplyr::first(n))  # Use `first()` to extract the total 

#knowing the population calcualte the gorup rate and death rate in each group 
BREAST_DF_MARI_perc_surv$`Group % in total` <- round(BREAST_DF_MARI_perc_surv$Population/sum(BREAST_DF_MARI_perc_surv$Population)*100,2)

BREAST_DF_MARI_perc_surv$`Death %` <- round(BREAST_DF_MARI_perc_surv$`Event Population`/BREAST_DF_MARI_perc_surv$Population*100,2)

kable(BREAST_DF_MARI_perc_surv)
Marital status at diagnosis Event Population Population Group % in total Death %
Divorced 4399 32214 10.61 13.66
Married (including common law) 15694 160551 52.89 9.78
Separated 544 3225 1.06 16.87
Single (never married) 7161 44678 14.72 16.03
Unknown 2774 18481 6.09 15.01
Unmarried or Domestic Partner 110 1014 0.33 10.85
Widowed 7790 43394 14.30 17.95
#next let's look into the Median household income and the survival rate and whether it migth have been critical or not. 
BREAST_DF_HHI_perc_surv <- BREAST_DF_surv %>%
  dplyr::group_by(`Median household income inflation adj to 2021`) %>%
  dplyr::add_count() %>%
  filter(COD == "Breast") %>%
  dplyr::summarise(`Event Population` = n(), 
            Population = dplyr::first(n))  # Use `first()` to extract the total 

#knwoign the population calcualte the gorup rate and death rate in each group 
BREAST_DF_HHI_perc_surv$`Group % in total` <- round(BREAST_DF_HHI_perc_surv$Population/sum(BREAST_DF_HHI_perc_surv$Population)*100,2)

BREAST_DF_HHI_perc_surv$`Death %` <- round(BREAST_DF_HHI_perc_surv$`Event Population`/BREAST_DF_HHI_perc_surv$Population*100,2)

kable(BREAST_DF_HHI_perc_surv)
Median household income inflation adj to 2021 Event Population Population Group % in total Death %
$35,000 - $39,999 1000 6077 2.00 16.46
$40,000 - $44,999 1630 10225 3.37 15.94
$45,000 - $49,999 2289 14917 4.91 15.34
$50,000 - $54,999 2310 16794 5.53 13.75
$55,000 - $59,999 3371 24860 8.19 13.56
$60,000 - $64,999 6010 43537 14.34 13.80
$65,000 - $69,999 5848 44978 14.82 13.00
$70,000 - $74,999 3927 31930 10.52 12.30
$75,000+ 11608 107459 35.40 10.80
< $35,000 469 2716 0.89 17.27
Unknown/missing/no match/Not 1990-2021 10 64 0.02 15.62
#next let's look into the Type of Cancer and the survival rate and whether it migth have been critical or not. 
BREAST_DF_PSL_perc_surv <- BREAST_DF_surv %>%
  dplyr::group_by(`Primary Site - labeled`) %>%
  dplyr::add_count() %>%
  filter(COD == "Breast") %>%
  dplyr::summarise(`Event Population` = n(), 
            Population = dplyr::first(n))  # Use `first()` to extract the total 

#knwoign the population calcualte the gorup rate and death rate in each group 
BREAST_DF_PSL_perc_surv$`Group % in total` <- round(BREAST_DF_PSL_perc_surv$Population/sum(BREAST_DF_PSL_perc_surv$Population)*100,2)

BREAST_DF_PSL_perc_surv$`Death %` <- round(BREAST_DF_PSL_perc_surv$`Event Population`/BREAST_DF_PSL_perc_surv$Population*100,2)

kable(BREAST_DF_PSL_perc_surv)
Primary Site - labeled Event Population Population Group % in total Death %
C50.0-Nipple 173 1477 0.49 11.71
C50.1-Central portion of breast 2043 14012 4.62 14.58
C50.2-Upper-inner quadrant of breast 3058 36006 11.86 8.49
C50.3-Lower-inner quadrant of breast 1572 16365 5.39 9.61
C50.4-Upper-outer quadrant of breast 9710 98199 32.35 9.89
C50.5-Lower-outer quadrant of breast 2287 21939 7.23 10.42
C50.6-Axillary tail of breast 270 1685 0.56 16.02
C50.8-Overlapping lesion of breast 7514 68285 22.49 11.00
C50.9-Breast, NOS 11845 45589 15.02 25.98
# Create a list to store all your dataframes
DF_names <- c (
  "BREAST_DF_TNoT_perc_surv", 
  "BREAST_DF_RNC_perc_surv",
  "BREAST_DF_SUR_perc_surv",
  "BREAST_DF_MARI_perc_surv",
  "BREAST_DF_HHI_perc_surv",
  "BREAST_DF_PSL_perc_surv")

# Create an empty list to store plots
plot_list <- list()
chart_color <- c("plum", "darkgreen", "darkred", "darkblue", "darkorange", "darkmagenta",
                 "darkcyan", "purple", "lightblue", "darkgray", "lightpink", "blue",
                 "brown", "red")
chart_title <- c("# of Malignant Tumors", 
                 "Radiation/Chemo Status", 
                 "Cancer Surgery",
                 "Marital Status",
                 "Household Income",
                 "Primary Site Labeled")
set.seed(2014)
# Loop through each dataframe
for (i in 1:length(DF_names)) {
  # Access the dataframe
  df <- get(DF_names[i])
  
  # Generate a random color
  random_color <- sample(chart_color, 1)
  
  # Get the name of the first column and wrap the text
  column_name <- str_wrap(names(df)[1], width = 10)  # Adjust width as needed
  
  # Create the plot and store it in the plot list
  plot <- ggplot(df, aes(x = !!rlang::sym(names(df)[1]), y = !!rlang::sym("Death %"))) +
    geom_bar(stat = "identity", fill = random_color) +
    labs(title = chart_title[i],
         x = NULL, y = "Death %") +  # Remove x-axis label
    theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1))  # Rotate x-axis labels

  plot_list[[i]] <- plot
}

# Arrange the plots in a 2 by 3 matrix
grid.arrange(grobs = plot_list, ncol = 3)

# Plot individually 
# Plot individually 

# Loop through each dataframe
for (i in 1:length(DF_names)) {
  # Access the dataframe
  df <- get(DF_names[i])
  
  # Generate a random color
  random_color <- sample(chart_color, 1)
  
  # Get the name of the first column and wrap the text
  column_name <- str_wrap(names(df)[1], width = 10)  # Adjust width as needed
  
  # Create the plot and store it in the plot list
  plot <- ggplot(df, aes(x = !!rlang::sym(names(df)[1]), y = !!rlang::sym("Death %"))) +
    geom_bar(stat = "identity", fill = random_color) +
    labs(title = chart_title[i],
         x = NULL, y = "Death %") +  # Remove x-axis label
    theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1))  # Rotate x-axis labels

  # Print the plot
  print(plot)
}

Correlation investigation

In this section we will be using different R packages to perform correlation and other analyses on the data, to do so, we first need to slightly change our data to make them suitable for packages like survival, purrr, caret, GGally, and so forth.

The first step is to change the categorical data to factor in columns that they exist. Then we use the purrr to calculate chi-square and Fisher exact test for different variables. Since the size of the population is large, we will do bootstrap and p-simulation to calculate the p_value to find the importance of different variables.

The strategy is to find the one with the highest effect in theory, the code will calculate the p-values from chi-squared/Fisher’s exact test for independence between each categorical variable and the COD (Cause of death) column. The lower the p-value, the stronger the evidence against the null hypothesis of independence, suggesting a significant association between the variable and COD. Then we simplify the model by keeping the most relevant, we also need to look into homoscedasticity and remove those that may contribute to.

Then we explore the data, there are some column than can be eliminated from this analyses. i.e., year, race (there are two), and so on. The following bullets lists those that are eliminated in the next steps of analyses.

Fisher_test and chi-Square

# List of columns to remove
uncritical_column <- c("Sex", "Year of diagnosis", 
                       "Race and origin recode (NHW, NHB, NHAIAN, NHAPI, Hispanic)", 
                       "Site recode ICD-O-3/WHO 2008", "Site recode ICD-O-3 2023 Revision", 
                       "Diagnostic Confirmation, Survival months flag", "COD to site recode", 
                       "Patient ID", "Year of follow-up recode", "Year of death recode", 
                       "SEER other cause of death classification", 
                       "RX Summ--Systemic/Sur Seq (2007+)",
                       "Origin recode NHIA (Hispanic, Non-Hisp)",
                       "Race and origin (recommended by SEER)",
                       "Diagnostic Confirmation",
                       "Sequence number", "Radiation recode")

# Create BREAST_DF_surv_clean by removing uncritical columns
BREAST_DF_surv_clean <- BREAST_DF_surv[, !names(BREAST_DF_surv) %in% uncritical_column]
BREAST_DF_eval_clean <- BREAST_DF_eval[, !names(BREAST_DF_eval) %in% uncritical_column]


# Identify character and numeric columns
char_cols <- sapply(BREAST_DF_surv_clean, is.character)
num_cols <- sapply(BREAST_DF_surv_clean, is.numeric)
char_cols_e <- sapply(BREAST_DF_eval_clean, is.character)

# Convert character columns to factors
BREAST_DF_surv_clean[char_cols] <- lapply(BREAST_DF_surv_clean[char_cols], as.factor)
BREAST_DF_eval_clean[char_cols_e] <- lapply(BREAST_DF_eval_clean[char_cols_e], as.factor)
#BREAST_DF_surv[num_cols] <- lapply(BREAST_DF_surv[num_cols], as.factor)

# Check the class of each column to ensure they are factors now
#sapply(BREAST_DF_surv, class)


#check to esure all variable have more than two levels 
one_level_vars <- sapply(BREAST_DF_surv_clean, function(x) length(unique(x)) == 1)
# Print variables with only one level
one_level_vars_names <- names(one_level_vars)[one_level_vars]
#print(names(one_level_vars)[one_level_vars])

# Remove variables with only one level from the data frame
BREAST_DF_surv_clean <- BREAST_DF_surv_clean[, !names(BREAST_DF_surv_clean) %in% one_level_vars_names]
BREAST_DF_eval_clean <- BREAST_DF_eval_clean[, !names(BREAST_DF_eval_clean) %in% one_level_vars_names]


skimr::skim(BREAST_DF_surv_clean)
Data summary
Name BREAST_DF_surv_clean
Number of rows 303557
Number of columns 18
_______________________
Column type frequency:
factor 14
numeric 4
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Race recode (W, B, AI, API) 0 1 FALSE 5 Whi: 240584, Bla: 32165, Asi: 27061, Ame: 1933
Primary Site - labeled 0 1 FALSE 9 C50: 98199, C50: 68285, C50: 45589, C50: 36006
Grade Recode (thru 2017) 0 1 FALSE 5 Mod: 119566, Poo: 84251, Wel: 64536, Unk: 34855
Laterality 0 1 FALSE 5 Lef: 152350, Rig: 147730, Pai: 3152, Onl: 190
Chemotherapy recode (yes, no/unk) 0 1 FALSE 2 No/: 186938, Yes: 116619
Reason no cancer-directed surgery 0 1 FALSE 8 Sur: 269730, Not: 23199, Rec: 2649, Rec: 2608
Survival months flag 0 1 FALSE 5 Com: 295136, Inc: 6620, Not: 1290, Com: 376
First malignant primary indicator 0 1 FALSE 2 Yes: 252683, No: 50874
Marital status at diagnosis 0 1 FALSE 7 Mar: 160551, Sin: 44678, Wid: 43394, Div: 32214
Median household income inflation adj to 2021 0 1 FALSE 11 $75: 107459, $65: 44978, $60: 43537, $70: 31930
Rural-Urban Continuum Code 0 1 FALSE 7 Cou: 185374, Cou: 65041, Cou: 21239, Non: 18125
Age recode (<60,60-69,70+) 0 1 FALSE 18 60-: 41318, 65-: 41060, 55-: 37068, 50-: 34424
COD 0 1 FALSE 3 Ali: 228221, Bre: 38472, Oth: 36864
Radiation 0 1 FALSE 2 No/: 161978, Yes: 141579

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Months from diagnosis to treatment 15843 0.95 1.13 1.14 0 0 1 2 24 ▇▁▁▁▁
Survival months 1290 1.00 74.22 29.88 0 62 78 97 119 ▂▂▆▇▆
Total number of in situ/malignant tumors for patient 3 1.00 1.36 0.65 1 1 1 2 20 ▇▁▁▁▁
Total number of benign/borderline tumors for patient 0 1.00 0.01 0.09 0 0 0 0 5 ▇▁▁▁▁
skimr::skim(BREAST_DF_eval_clean)
Data summary
Name BREAST_DF_eval_clean
Number of rows 131395
Number of columns 17
_______________________
Column type frequency:
factor 13
numeric 4
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Race recode (W, B, AI, API) 0 1 FALSE 5 Whi: 100601, Bla: 14533, Asi: 13448, Unk: 1891
Primary Site - labeled 0 1 FALSE 9 C50: 43321, C50: 30822, C50: 16539, C50: 16423
Grade Recode (thru 2017) 0 1 FALSE 1 Unk: 131395
Laterality 0 1 FALSE 5 Lef: 66096, Rig: 63885, Pai: 1317, Bil: 52
Chemotherapy recode (yes, no/unk) 0 1 FALSE 2 No/: 83776, Yes: 47619
Reason no cancer-directed surgery 0 1 FALSE 8 Sur: 114210, Not: 12567, Rec: 1144, Rec: 1111
Survival months flag 0 1 FALSE 5 Com: 128932, Inc: 1037, Com: 633, Not: 537
First malignant primary indicator 0 1 FALSE 2 Yes: 107910, No: 23485
Marital status at diagnosis 0 1 FALSE 7 Mar: 70613, Sin: 20883, Wid: 16724, Div: 13667
Median household income inflation adj to 2021 0 1 FALSE 11 $75: 84913, $55: 8336, $65: 8298, $70: 8158
Rural-Urban Continuum Code 0 1 FALSE 7 Cou: 80172, Cou: 28055, Cou: 9574, Non: 7900
Age recode (<60,60-69,70+) 0 1 FALSE 17 65-: 18702, 60-: 17760, 70-: 17096, 55-: 15189
Radiation 0 1 FALSE 2 Yes: 65993, No/: 65402

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Months from diagnosis to treatment 6807 0.95 1.26 1.18 0 1 1 2 24 ▇▁▁▁▁
Survival months 537 1.00 11.07 7.05 0 5 11 17 23 ▇▆▆▇▆
Total number of in situ/malignant tumors for patient 11 1.00 1.31 0.62 1 1 1 1 50 ▇▁▁▁▁
Total number of benign/borderline tumors for patient 0 1.00 0.01 0.09 0 0 0 0 2 ▇▁▁▁▁
# Function to calculate chi-squared test for independence
chi_squared_cal <- function(var, data) {
  tab <- table(data$COD, var)
  chisq_result <- chisq.test(tab)
  p_value <- chisq_result$p.value
  return(p_value)
}

# Function to calculate Sisher-Exact test for independence
fisher_exact_cal <- function(var, data) {
  tab <- table(data$COD, var)
  # Perform Fisher's exact test
  fisher_result <- fisher.test(tab, simulate.p.value = TRUE)
  # Extract the p-value
  p_value <- fisher_result$p.value  
  return(p_value)
}


# Initialize an empty list to store p-values
p_values <- list()

# Number of bootstrap samples
n_bootstrap <- 50

#I perform bootsrap and downasampling to eliminate the population effect on chi-square, still the correlation seems high with all be so close to 0 
# Loop over each column in the dataframe
for (col in names(BREAST_DF_surv_clean)) {
  # Check if the column is a factor
  if (is.factor(BREAST_DF_surv_clean[[col]])) {
    # Initialize an empty vector to store p-values from bootstrap samples
    bootstrap_p_values <- numeric(n_bootstrap)
    
    # Perform bootstrap sampling and calculate chi-squared p-value for each sample
    for (i in 1:n_bootstrap) {
      # Generate a bootstrap sample with replacement
      bootstrap_data <- 
        BREAST_DF_surv_clean[sample(nrow(BREAST_DF_surv_clean), 
                                    size = 0.05 * nrow(BREAST_DF_surv_clean), 
                                    replace = TRUE), ]
      
      # Calculate chi-squared p-value for the bootstrap sample
      #bootstrap_p_values[i] <- chi_squared_cal(bootstrap_data[[col]], bootstrap_data)
      bootstrap_p_values[i] <- fisher_exact_cal(bootstrap_data[[col]], bootstrap_data)
    }
    
    # Calculate the mean p-value from bootstrap samples
    mean_p_value <- mean(bootstrap_p_values)
    
    # Store the mean p-value for the column
    p_values[[col]] <- mean_p_value
  }
}

# Convert the list of p-values to a data frame
p_values_df <- data.frame(variable = names(p_values), p_value = unlist(p_values))

# Sort the results by p-values
sorted_results <- p_values_df[order(p_values_df$p_value, na.last = TRUE), ]

# Print the sorted p-values
kable(sorted_results)
variable p_value
Race recode (W, B, AI, API) Race recode (W, B, AI, API) 0.0004998
Primary Site - labeled Primary Site - labeled 0.0004998
Grade Recode (thru 2017) Grade Recode (thru 2017) 0.0004998
Laterality Laterality 0.0004998
Chemotherapy recode (yes, no/unk) Chemotherapy recode (yes, no/unk) 0.0004998
Reason no cancer-directed surgery Reason no cancer-directed surgery 0.0004998
Survival months flag Survival months flag 0.0004998
First malignant primary indicator First malignant primary indicator 0.0004998
Marital status at diagnosis Marital status at diagnosis 0.0004998
Median household income inflation adj to 2021 Median household income inflation adj to 2021 0.0004998
Age recode (<60,60-69,70+) Age recode (<60,60-69,70+) 0.0004998
COD COD 0.0004998
Radiation Radiation 0.0004998
Rural-Urban Continuum Code Rural-Urban Continuum Code 0.0015892

Correlation Analyses

In This section I used the existing R package to calculate the correlations among the different columns and COD. To od so, we start first with separation the numerical nd categorical data since they need to be treated separately in term of calculating the correlation with COD. We start by finding Pearson correlation coefficient between COD and the numerical column.

# Select numerical columns in your dataset
numeric_cols <- sapply(BREAST_DF_surv_clean, is.numeric)

# Separate numerical and categorical columns
numeric_data <- BREAST_DF_surv_clean[, numeric_cols]
categorical_data <- BREAST_DF_surv_clean[, !numeric_cols]

# Calculate Pearson correlation coefficient between "COD" and numerical columns
correlation_with_COD_numeric <- rcorr(as.matrix(numeric_data), y = BREAST_DF_surv_clean$COD, type = "pearson")

# Print correlation coefficients for numerical columns
#kable(print(correlation_with_COD_numeric$r))

library(kableExtra)

# Print correlation coefficients for numerical columns
correlation_table <- correlation_with_COD_numeric$r
rownames(correlation_table) <- colnames(correlation_table)

# Display as a table
kable(correlation_table, caption = "Correlation Coefficients with COD")
Correlation Coefficients with COD
Months from diagnosis to treatment Survival months Total number of in situ/malignant tumors for patient Total number of benign/borderline tumors for patient y
Months from diagnosis to treatment 1.0000000 -0.0139649 0.0186951 0.0005761 -0.0037166
Survival months -0.0139649 1.0000000 -0.0347760 0.0051819 -0.5516706
Total number of in situ/malignant tumors for patient 0.0186951 -0.0347760 1.0000000 0.0181349 0.1470846
Total number of benign/borderline tumors for patient 0.0005761 0.0051819 0.0181349 1.0000000 0.0096745
y -0.0037166 -0.5516706 0.1470846 0.0096745 1.0000000
library(reshape2)  # For melt function

# Melt correlation matrix
correlation_melted <- melt(correlation_table)

# Plot heatmap
ggplot(correlation_melted, aes(Var1, Var2, fill = value)) +
  geom_tile() +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
                       midpoint = 0, limit = c(-1, 1), space = "Lab",
                       name = "Correlation") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, 
                                    size = 8, hjust = 1)) +
  coord_fixed()

# Calculate Cramér's V for association between "COD" and categorical columns
cramer_v <- apply(categorical_data, 2, function(x) {
  table_data <- table(x, BREAST_DF_surv_clean$COD)
  assoc(table_data, method = "cramers")
})

# Print Cramér's V for association with categorical columns
#print(cramer_v)

# Insert a line break or comment to separate the code blocks
cat("\n")
# Initialize an empty data frame
cramer_v_df <- data.frame(Variable = character(), Value = numeric(), row.names = NULL)

# Iterate over each variable and its associated Cramér's V value
for (var_name in names(cramer_v)) {
  # Extract Cramér's V value for the current variable
  cramer_v_value <- cramer_v[[var_name]]
  
  # Append a row to the data frame with the variable name and its Cramér's V value
  cramer_v_df <- rbind(cramer_v_df, data.frame(Variable = var_name, Value = cramer_v_value))
}

# Print as a table
kable(cramer_v_df, caption = "Cramer's V for Association with COD")
Cramer’s V for Association with COD
Variable Value.x Value.A Value.Freq
Race recode (W, B, AI, API) American Indian/Alaska Native Alive 1416
Race recode (W, B, AI, API) Asian or Pacific Islander Alive 22312
Race recode (W, B, AI, API) Black Alive 21523
Race recode (W, B, AI, API) Unknown Alive 1681
Race recode (W, B, AI, API) White Alive 181289
Race recode (W, B, AI, API) American Indian/Alaska Native Breast 252
Race recode (W, B, AI, API) Asian or Pacific Islander Breast 2749
Race recode (W, B, AI, API) Black Breast 6369
Race recode (W, B, AI, API) Unknown Breast 70
Race recode (W, B, AI, API) White Breast 29032
Race recode (W, B, AI, API) American Indian/Alaska Native Other 265
Race recode (W, B, AI, API) Asian or Pacific Islander Other 2000
Race recode (W, B, AI, API) Black Other 4273
Race recode (W, B, AI, API) Unknown Other 63
Race recode (W, B, AI, API) White Other 30263
Primary Site - labeled C50.0-Nipple Alive 1033
Primary Site - labeled C50.1-Central portion of breast Alive 9851
Primary Site - labeled C50.2-Upper-inner quadrant of breast Alive 28945
Primary Site - labeled C50.3-Lower-inner quadrant of breast Alive 12769
Primary Site - labeled C50.4-Upper-outer quadrant of breast Alive 77369
Primary Site - labeled C50.5-Lower-outer quadrant of breast Alive 17218
Primary Site - labeled C50.6-Axillary tail of breast Alive 1205
Primary Site - labeled C50.8-Overlapping lesion of breast Alive 52392
Primary Site - labeled C50.9-Breast, NOS Alive 27439
Primary Site - labeled C50.0-Nipple Breast 173
Primary Site - labeled C50.1-Central portion of breast Breast 2043
Primary Site - labeled C50.2-Upper-inner quadrant of breast Breast 3058
Primary Site - labeled C50.3-Lower-inner quadrant of breast Breast 1572
Primary Site - labeled C50.4-Upper-outer quadrant of breast Breast 9710
Primary Site - labeled C50.5-Lower-outer quadrant of breast Breast 2287
Primary Site - labeled C50.6-Axillary tail of breast Breast 270
Primary Site - labeled C50.8-Overlapping lesion of breast Breast 7514
Primary Site - labeled C50.9-Breast, NOS Breast 11845
Primary Site - labeled C50.0-Nipple Other 271
Primary Site - labeled C50.1-Central portion of breast Other 2118
Primary Site - labeled C50.2-Upper-inner quadrant of breast Other 4003
Primary Site - labeled C50.3-Lower-inner quadrant of breast Other 2024
Primary Site - labeled C50.4-Upper-outer quadrant of breast Other 11120
Primary Site - labeled C50.5-Lower-outer quadrant of breast Other 2434
Primary Site - labeled C50.6-Axillary tail of breast Other 210
Primary Site - labeled C50.8-Overlapping lesion of breast Other 8379
Primary Site - labeled C50.9-Breast, NOS Other 6305
Grade Recode (thru 2017) Moderately differentiated; Grade II Alive 93775
Grade Recode (thru 2017) Poorly differentiated; Grade III Alive 59437
Grade Recode (thru 2017) Undifferentiated; anaplastic; Grade IV Alive 202
Grade Recode (thru 2017) Unknown Alive 20725
Grade Recode (thru 2017) Well differentiated; Grade I Alive 54082
Grade Recode (thru 2017) Moderately differentiated; Grade II Breast 11130
Grade Recode (thru 2017) Poorly differentiated; Grade III Breast 15938
Grade Recode (thru 2017) Undifferentiated; anaplastic; Grade IV Breast 98
Grade Recode (thru 2017) Unknown Breast 8913
Grade Recode (thru 2017) Well differentiated; Grade I Breast 2393
Grade Recode (thru 2017) Moderately differentiated; Grade II Other 14661
Grade Recode (thru 2017) Poorly differentiated; Grade III Other 8876
Grade Recode (thru 2017) Undifferentiated; anaplastic; Grade IV Other 49
Grade Recode (thru 2017) Unknown Other 5217
Grade Recode (thru 2017) Well differentiated; Grade I Other 8061
Laterality Bilateral, single primary Alive 21
Laterality Left - origin of primary Alive 115104
Laterality Only one side - side unspecified Alive 59
Laterality Paired site, but no information concerning laterality Alive 438
Laterality Right - origin of primary Alive 112599
Laterality Bilateral, single primary Breast 89
Laterality Left - origin of primary Breast 18661
Laterality Only one side - side unspecified Breast 87
Laterality Paired site, but no information concerning laterality Breast 2080
Laterality Right - origin of primary Breast 17555
Laterality Bilateral, single primary Other 25
Laterality Left - origin of primary Other 18585
Laterality Only one side - side unspecified Other 44
Laterality Paired site, but no information concerning laterality Other 634
Laterality Right - origin of primary Other 17576
Chemotherapy recode (yes, no/unk) No/Unknown Alive 137991
Chemotherapy recode (yes, no/unk) Yes Alive 90230
Chemotherapy recode (yes, no/unk) No/Unknown Breast 19415
Chemotherapy recode (yes, no/unk) Yes Breast 19057
Chemotherapy recode (yes, no/unk) No/Unknown Other 29532
Chemotherapy recode (yes, no/unk) Yes Other 7332
Reason no cancer-directed surgery Not performed, patient died prior to recommended surgery Alive 0
Reason no cancer-directed surgery Not recommended Alive 6917
Reason no cancer-directed surgery Not recommended, contraindicated due to other cond; autopsy only (1973-2002) Alive 118
Reason no cancer-directed surgery Recommended but not performed, patient refused Alive 686
Reason no cancer-directed surgery Recommended but not performed, unknown reason Alive 729
Reason no cancer-directed surgery Recommended, unknown if performed Alive 1741
Reason no cancer-directed surgery Surgery performed Alive 217725
Reason no cancer-directed surgery Unknown; death certificate; or autopsy only (2003+) Alive 305
Reason no cancer-directed surgery Not performed, patient died prior to recommended surgery Breast 139
Reason no cancer-directed surgery Not recommended Breast 11636
Reason no cancer-directed surgery Not recommended, contraindicated due to other cond; autopsy only (1973-2002) Breast 593
Reason no cancer-directed surgery Recommended but not performed, patient refused Breast 1171
Reason no cancer-directed surgery Recommended but not performed, unknown reason Breast 545
Reason no cancer-directed surgery Recommended, unknown if performed Breast 613
Reason no cancer-directed surgery Surgery performed Breast 22376
Reason no cancer-directed surgery Unknown; death certificate; or autopsy only (2003+) Breast 1399
Reason no cancer-directed surgery Not performed, patient died prior to recommended surgery Other 139
Reason no cancer-directed surgery Not recommended Other 4646
Reason no cancer-directed surgery Not recommended, contraindicated due to other cond; autopsy only (1973-2002) Other 645
Reason no cancer-directed surgery Recommended but not performed, patient refused Other 751
Reason no cancer-directed surgery Recommended but not performed, unknown reason Other 330
Reason no cancer-directed surgery Recommended, unknown if performed Other 295
Reason no cancer-directed surgery Surgery performed Other 29629
Reason no cancer-directed surgery Unknown; death certificate; or autopsy only (2003+) Other 429
Survival months flag Complete dates are available and there are 0 days of survival Alive 248
Survival months flag Complete dates are available and there are more than 0 days of survival Alive 223378
Survival months flag Incomplete dates are available and there cannot be zero days of follow-up Alive 4551
Survival months flag Incomplete dates are available and there could be zero days of follow-up Alive 44
Survival months flag Not calculated because a Death Certificate Only or Autopsy Only case Alive 0
Survival months flag Complete dates are available and there are 0 days of survival Breast 83
Survival months flag Complete dates are available and there are more than 0 days of survival Breast 36117
Survival months flag Incomplete dates are available and there cannot be zero days of follow-up Breast 1183
Survival months flag Incomplete dates are available and there could be zero days of follow-up Breast 59
Survival months flag Not calculated because a Death Certificate Only or Autopsy Only case Breast 1030
Survival months flag Complete dates are available and there are 0 days of survival Other 45
Survival months flag Complete dates are available and there are more than 0 days of survival Other 35641
Survival months flag Incomplete dates are available and there cannot be zero days of follow-up Other 886
Survival months flag Incomplete dates are available and there could be zero days of follow-up Other 32
Survival months flag Not calculated because a Death Certificate Only or Autopsy Only case Other 260
First malignant primary indicator No Alive 32987
First malignant primary indicator Yes Alive 195234
First malignant primary indicator No Breast 7480
First malignant primary indicator Yes Breast 30992
First malignant primary indicator No Other 10407
First malignant primary indicator Yes Other 26457
Marital status at diagnosis Divorced Alive 23903
Marital status at diagnosis Married (including common law) Alive 132121
Marital status at diagnosis Separated Alive 2401
Marital status at diagnosis Single (never married) Alive 32829
Marital status at diagnosis Unknown Alive 12919
Marital status at diagnosis Unmarried or Domestic Partner Alive 844
Marital status at diagnosis Widowed Alive 23204
Marital status at diagnosis Divorced Breast 4399
Marital status at diagnosis Married (including common law) Breast 15694
Marital status at diagnosis Separated Breast 544
Marital status at diagnosis Single (never married) Breast 7161
Marital status at diagnosis Unknown Breast 2774
Marital status at diagnosis Unmarried or Domestic Partner Breast 110
Marital status at diagnosis Widowed Breast 7790
Marital status at diagnosis Divorced Other 3912
Marital status at diagnosis Married (including common law) Other 12736
Marital status at diagnosis Separated Other 280
Marital status at diagnosis Single (never married) Other 4688
Marital status at diagnosis Unknown Other 2788
Marital status at diagnosis Unmarried or Domestic Partner Other 60
Marital status at diagnosis Widowed Other 12400
Median household income inflation adj to 2021 $35,000 - $39,999 Alive 4108
Median household income inflation adj to 2021 $40,000 - $44,999 Alive 6976
Median household income inflation adj to 2021 $45,000 - $49,999 Alive 10351
Median household income inflation adj to 2021 $50,000 - $54,999 Alive 11978
Median household income inflation adj to 2021 $55,000 - $59,999 Alive 18238
Median household income inflation adj to 2021 $60,000 - $64,999 Alive 32172
Median household income inflation adj to 2021 $65,000 - $69,999 Alive 34163
Median household income inflation adj to 2021 $70,000 - $74,999 Alive 23995
Median household income inflation adj to 2021 $75,000+ Alive 84391
Median household income inflation adj to 2021 < $35,000 Alive 1799
Median household income inflation adj to 2021 Unknown/missing/no match/Not 1990-2021 Alive 50
Median household income inflation adj to 2021 $35,000 - $39,999 Breast 1000
Median household income inflation adj to 2021 $40,000 - $44,999 Breast 1630
Median household income inflation adj to 2021 $45,000 - $49,999 Breast 2289
Median household income inflation adj to 2021 $50,000 - $54,999 Breast 2310
Median household income inflation adj to 2021 $55,000 - $59,999 Breast 3371
Median household income inflation adj to 2021 $60,000 - $64,999 Breast 6010
Median household income inflation adj to 2021 $65,000 - $69,999 Breast 5848
Median household income inflation adj to 2021 $70,000 - $74,999 Breast 3927
Median household income inflation adj to 2021 $75,000+ Breast 11608
Median household income inflation adj to 2021 < $35,000 Breast 469
Median household income inflation adj to 2021 Unknown/missing/no match/Not 1990-2021 Breast 10
Median household income inflation adj to 2021 $35,000 - $39,999 Other 969
Median household income inflation adj to 2021 $40,000 - $44,999 Other 1619
Median household income inflation adj to 2021 $45,000 - $49,999 Other 2277
Median household income inflation adj to 2021 $50,000 - $54,999 Other 2506
Median household income inflation adj to 2021 $55,000 - $59,999 Other 3251
Median household income inflation adj to 2021 $60,000 - $64,999 Other 5355
Median household income inflation adj to 2021 $65,000 - $69,999 Other 4967
Median household income inflation adj to 2021 $70,000 - $74,999 Other 4008
Median household income inflation adj to 2021 $75,000+ Other 11460
Median household income inflation adj to 2021 < $35,000 Other 448
Median household income inflation adj to 2021 Unknown/missing/no match/Not 1990-2021 Other 4
Rural-Urban Continuum Code Counties in metropolitan areas ge 1 million pop Alive 141535
Rural-Urban Continuum Code Counties in metropolitan areas of 250,000 to 1 million pop Alive 48846
Rural-Urban Continuum Code Counties in metropolitan areas of lt 250 thousand pop Alive 15452
Rural-Urban Continuum Code Nonmetropolitan counties adjacent to a metropolitan area Alive 12781
Rural-Urban Continuum Code Nonmetropolitan counties not adjacent to a metropolitan area Alive 9289
Rural-Urban Continuum Code Unknown/missing/no match (Alaska or Hawaii - Entire State) Alive 268
Rural-Urban Continuum Code Unknown/missing/no match/Not 1990-2021 Alive 50
Rural-Urban Continuum Code Counties in metropolitan areas ge 1 million pop Breast 23147
Rural-Urban Continuum Code Counties in metropolitan areas of 250,000 to 1 million pop Breast 7884
Rural-Urban Continuum Code Counties in metropolitan areas of lt 250 thousand pop Breast 2843
Rural-Urban Continuum Code Nonmetropolitan counties adjacent to a metropolitan area Breast 2578
Rural-Urban Continuum Code Nonmetropolitan counties not adjacent to a metropolitan area Breast 1970
Rural-Urban Continuum Code Unknown/missing/no match (Alaska or Hawaii - Entire State) Breast 40
Rural-Urban Continuum Code Unknown/missing/no match/Not 1990-2021 Breast 10
Rural-Urban Continuum Code Counties in metropolitan areas ge 1 million pop Other 20692
Rural-Urban Continuum Code Counties in metropolitan areas of 250,000 to 1 million pop Other 8311
Rural-Urban Continuum Code Counties in metropolitan areas of lt 250 thousand pop Other 2944
Rural-Urban Continuum Code Nonmetropolitan counties adjacent to a metropolitan area Other 2766
Rural-Urban Continuum Code Nonmetropolitan counties not adjacent to a metropolitan area Other 2090
Rural-Urban Continuum Code Unknown/missing/no match (Alaska or Hawaii - Entire State) Other 57
Rural-Urban Continuum Code Unknown/missing/no match/Not 1990-2021 Other 4
Age recode (<60,60-69,70+) 01-04 years Alive 1
Age recode (<60,60-69,70+) 05-09 years Alive 2
Age recode (<60,60-69,70+) 10-14 years Alive 2
Age recode (<60,60-69,70+) 15-19 years Alive 14
Age recode (<60,60-69,70+) 20-24 years Alive 178
Age recode (<60,60-69,70+) 25-29 years Alive 1097
Age recode (<60,60-69,70+) 30-34 years Alive 3307
Age recode (<60,60-69,70+) 35-39 years Alive 7040
Age recode (<60,60-69,70+) 40-44 years Alive 15293
Age recode (<60,60-69,70+) 45-49 years Alive 24158
Age recode (<60,60-69,70+) 50-54 years Alive 29263
Age recode (<60,60-69,70+) 55-59 years Alive 30741
Age recode (<60,60-69,70+) 60-64 years Alive 33793
Age recode (<60,60-69,70+) 65-69 years Alive 32764
Age recode (<60,60-69,70+) 70-74 years Alive 23598
Age recode (<60,60-69,70+) 75-79 years Alive 15007
Age recode (<60,60-69,70+) 80-84 years Alive 8013
Age recode (<60,60-69,70+) 85+ years Alive 3950
Age recode (<60,60-69,70+) 01-04 years Breast 0
Age recode (<60,60-69,70+) 05-09 years Breast 0
Age recode (<60,60-69,70+) 10-14 years Breast 0
Age recode (<60,60-69,70+) 15-19 years Breast 1
Age recode (<60,60-69,70+) 20-24 years Breast 59
Age recode (<60,60-69,70+) 25-29 years Breast 265
Age recode (<60,60-69,70+) 30-34 years Breast 686
Age recode (<60,60-69,70+) 35-39 years Breast 1327
Age recode (<60,60-69,70+) 40-44 years Breast 2019
Age recode (<60,60-69,70+) 45-49 years Breast 2765
Age recode (<60,60-69,70+) 50-54 years Breast 3909
Age recode (<60,60-69,70+) 55-59 years Breast 4360
Age recode (<60,60-69,70+) 60-64 years Breast 4531
Age recode (<60,60-69,70+) 65-69 years Breast 4136
Age recode (<60,60-69,70+) 70-74 years Breast 3663
Age recode (<60,60-69,70+) 75-79 years Breast 3196
Age recode (<60,60-69,70+) 80-84 years Breast 3003
Age recode (<60,60-69,70+) 85+ years Breast 4552
Age recode (<60,60-69,70+) 01-04 years Other 0
Age recode (<60,60-69,70+) 05-09 years Other 0
Age recode (<60,60-69,70+) 10-14 years Other 0
Age recode (<60,60-69,70+) 15-19 years Other 2
Age recode (<60,60-69,70+) 20-24 years Other 11
Age recode (<60,60-69,70+) 25-29 years Other 43
Age recode (<60,60-69,70+) 30-34 years Other 100
Age recode (<60,60-69,70+) 35-39 years Other 182
Age recode (<60,60-69,70+) 40-44 years Other 423
Age recode (<60,60-69,70+) 45-49 years Other 713
Age recode (<60,60-69,70+) 50-54 years Other 1252
Age recode (<60,60-69,70+) 55-59 years Other 1967
Age recode (<60,60-69,70+) 60-64 years Other 2994
Age recode (<60,60-69,70+) 65-69 years Other 4160
Age recode (<60,60-69,70+) 70-74 years Other 4927
Age recode (<60,60-69,70+) 75-79 years Other 5630
Age recode (<60,60-69,70+) 80-84 years Other 6182
Age recode (<60,60-69,70+) 85+ years Other 8278
COD Alive Alive 228221
COD Breast Alive 0
COD Other Alive 0
COD Alive Breast 0
COD Breast Breast 38472
COD Other Breast 0
COD Alive Other 0
COD Breast Other 0
COD Other Other 36864
Radiation No/Unknown Alive 111019
Radiation Yes Alive 117202
Radiation No/Unknown Breast 25613
Radiation Yes Breast 12859
Radiation No/Unknown Other 25346
Radiation Yes Other 11518
# Melt Cramér's V results
cramer_v_melted <- melt(cramer_v_df, id.vars = "Variable", variable.name = "Var1", value.name = "value")
## Warning: attributes are not identical across measure variables; they will be
## dropped
# Plot as a bar graph
ggplot(cramer_v_melted, aes(x = Variable, y = value, fill = Var1)) +
  geom_bar(stat = "identity") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),
        axis.text.y = element_text(angle = 45, hjust = 1, vjust = 1)) + # Rotate y-axis labels by 45 degrees
  scale_y_discrete(labels = function(x) str_wrap(x, width = 10)) + # Wrap labels with a width of 10 characters
  labs(x = "Variable", y = "Cramer's V", fill = "Variable") +
  ggtitle("Cramer's V for Association with COD")

#Since there are many factors and categorical variables I need to encode them. 
#the followign code can deal with encoding
#Find the index of the column named "COD"
# Step 1: Find the index of the column named "COD"
cod_column_index <- which(names(BREAST_DF_surv_clean) == "COD")

# Step 2: Exclude "COD" column from model matrix
encoded_data <- model.matrix(~ . - 1, data = BREAST_DF_surv_clean[, -cod_column_index])

# Step 3: Select encoded variables and target variable
encoded_data <- cbind(encoded_data, COD = BREAST_DF_surv_clean$COD)
## Warning in base::cbind(...): number of rows of result is not a multiple of
## vector length (arg 2)
# Step 4: Calculate correlation matrix
correlation_matrix <- cor(encoded_data)
## Warning in cor(encoded_data): the standard deviation is zero
# Step 5: Display summary statistics of the correlation matrix
summary_table <- summary(correlation_matrix)
summary_table_kable <- kable(summary_table)

# Step 6: Plot correlation matrix as a heatmap
library(corrplot)
corrplot(correlation_matrix, method = "color", tl.cex = 0.15, title = "Correlation Matrix")

# Display the summary table
summary_table_kable
Race recode (W, B, AI, API)American Indian/Alaska Native Race recode (W, B, AI, API)Asian or Pacific Islander Race recode (W, B, AI, API)Black Race recode (W, B, AI, API)Unknown Race recode (W, B, AI, API)White Primary Site - labeledC50.1-Central portion of breast Primary Site - labeledC50.2-Upper-inner quadrant of breast Primary Site - labeledC50.3-Lower-inner quadrant of breast Primary Site - labeledC50.4-Upper-outer quadrant of breast Primary Site - labeledC50.5-Lower-outer quadrant of breast Primary Site - labeledC50.6-Axillary tail of breast Primary Site - labeledC50.8-Overlapping lesion of breast Primary Site - labeledC50.9-Breast, NOS Grade Recode (thru 2017)Poorly differentiated; Grade III Grade Recode (thru 2017)Undifferentiated; anaplastic; Grade IV Grade Recode (thru 2017)Unknown Grade Recode (thru 2017)Well differentiated; Grade I LateralityLeft - origin of primary LateralityOnly one side - side unspecified LateralityPaired site, but no information concerning laterality LateralityRight - origin of primary Chemotherapy recode (yes, no/unk)Yes Months from diagnosis to treatment Reason no cancer-directed surgeryNot recommended Reason no cancer-directed surgeryNot recommended, contraindicated due to other cond; autopsy only (1973-2002) Reason no cancer-directed surgeryRecommended but not performed, patient refused Reason no cancer-directed surgeryRecommended but not performed, unknown reason Reason no cancer-directed surgeryRecommended, unknown if performed Reason no cancer-directed surgerySurgery performed Reason no cancer-directed surgeryUnknown; death certificate; or autopsy only (2003+) Survival months flagComplete dates are available and there are more than 0 days of survival Survival months flagIncomplete dates are available and there cannot be zero days of follow-up Survival months flagIncomplete dates are available and there could be zero days of follow-up Survival months flagNot calculated because a Death Certificate Only or Autopsy Only case Survival months First malignant primary indicatorYes Total number of in situ/malignant tumors for patient Total number of benign/borderline tumors for patient Marital status at diagnosisMarried (including common law) Marital status at diagnosisSeparated Marital status at diagnosisSingle (never married) Marital status at diagnosisUnknown Marital status at diagnosisUnmarried or Domestic Partner Marital status at diagnosisWidowed Median household income inflation adj to 2021$40,000 - $44,999 Median household income inflation adj to 2021$45,000 - $49,999 Median household income inflation adj to 2021$50,000 - $54,999 Median household income inflation adj to 2021$55,000 - $59,999 Median household income inflation adj to 2021$60,000 - $64,999 Median household income inflation adj to 2021$65,000 - $69,999 Median household income inflation adj to 2021$70,000 - $74,999 Median household income inflation adj to 2021$75,000+ Median household income inflation adj to 2021< $35,000 Median household income inflation adj to 2021Unknown/missing/no match/Not 1990-2021 Rural-Urban Continuum CodeCounties in metropolitan areas of 250,000 to 1 million pop Rural-Urban Continuum CodeCounties in metropolitan areas of lt 250 thousand pop Rural-Urban Continuum CodeNonmetropolitan counties adjacent to a metropolitan area Rural-Urban Continuum CodeNonmetropolitan counties not adjacent to a metropolitan area Rural-Urban Continuum CodeUnknown/missing/no match (Alaska or Hawaii - Entire State) Rural-Urban Continuum CodeUnknown/missing/no match/Not 1990-2021 Age recode (<60,60-69,70+)05-09 years Age recode (<60,60-69,70+)10-14 years Age recode (<60,60-69,70+)15-19 years Age recode (<60,60-69,70+)20-24 years Age recode (<60,60-69,70+)25-29 years Age recode (<60,60-69,70+)30-34 years Age recode (<60,60-69,70+)35-39 years Age recode (<60,60-69,70+)40-44 years Age recode (<60,60-69,70+)45-49 years Age recode (<60,60-69,70+)50-54 years Age recode (<60,60-69,70+)55-59 years Age recode (<60,60-69,70+)60-64 years Age recode (<60,60-69,70+)65-69 years Age recode (<60,60-69,70+)70-74 years Age recode (<60,60-69,70+)75-79 years Age recode (<60,60-69,70+)80-84 years Age recode (<60,60-69,70+)85+ years RadiationYes COD
Min. :-0.1576838 Min. :-0.6165221 Min. :-0.673943 Min. :-0.1334814 Min. :-0.6739426 Min. :-0.155876 Min. :-0.2615058 Min. :-0.169973 Min. :-0.382989 Min. :-0.199024 Min. :-0.0520480 Min. :-0.382989 Min. :-0.275063 Min. :-0.3318287 Min. :-0.0208938 Min. :-0.204073 Min. :-0.331829 Min. :-0.9935062 Min. :-0.0366469 Min. :-0.1575071 Min. :-0.9935062 Min. :-0.284581 Min. :-0.071647 Min. :-0.8574575 Min. :-0.2202216 Min. :-0.248924 Min. :-0.1439998 Min. :-0.306675 Min. :-0.857457 Min. :-0.0598241 Min. :-0.9942023 Min. :-0.994202 Min. :-0.0557427 Min. :1 Min. :-0.277625 Min. :-0.691429 Min. :-0.6914293 Min. :-0.0159064 Min. :-0.449883 Min. :-0.1121800 Min. :-0.4498831 Min. :-0.260209 Min. :-0.0640663 Min. :-0.433220 Min. :-0.1387420 Min. :-0.1694547 Min. :-0.1808955 Min. :-0.2237456 Min. :-0.302021 Min. :-0.307945 Min. :-0.2570005 Min. :-0.307945 Min. :-0.070190 Min. :-0.0098790 Min. :-0.144158 Min. :-0.1728759 Min. :-0.154232 Min. :-0.154618 Min. :-0.0693888 Min. :-0.0098790 Min. :-0.0028688 Min. :-0.0051977 Min. :-0.0071789 Min. :-0.0193592 Min. :-0.027421 Min. :-0.047111 Min. :-0.068738 Min. :-0.100748 Min. :-0.128046 Min. :-0.1445947 Min. :-0.1504952 Min. :-0.1593788 Min. :-0.159379 Min. :-0.138078 Min. :-0.133971 Min. :-0.150245 Min. :-0.1873481 Min. :-0.132353 Min. :-0.0121551
1st Qu.:-0.0020006 1st Qu.:-0.0202167 1st Qu.:-0.012099 1st Qu.:-0.0058338 1st Qu.:-0.0189862 1st Qu.:-0.010531 1st Qu.:-0.0055591 1st Qu.:-0.005036 1st Qu.:-0.007592 1st Qu.:-0.004752 1st Qu.:-0.0033956 1st Qu.:-0.005180 1st Qu.:-0.012712 1st Qu.:-0.0102566 1st Qu.:-0.0020484 1st Qu.:-0.005608 1st Qu.:-0.011881 1st Qu.:-0.0025589 1st Qu.:-0.0026654 1st Qu.:-0.0054651 1st Qu.:-0.0025611 1st Qu.:-0.008157 1st Qu.:-0.009361 1st Qu.:-0.0113926 1st Qu.:-0.0066137 1st Qu.:-0.005443 1st Qu.:-0.0023943 1st Qu.:-0.008606 1st Qu.:-0.012623 1st Qu.:-0.0016251 1st Qu.:-0.0036413 1st Qu.:-0.008391 1st Qu.:-0.0015205 1st Qu.:1 1st Qu.:-0.017262 1st Qu.:-0.008279 1st Qu.:-0.0105455 1st Qu.:-0.0024496 1st Qu.:-0.018276 1st Qu.:-0.0034648 1st Qu.:-0.0153153 1st Qu.:-0.005487 1st Qu.:-0.0034380 1st Qu.:-0.025953 1st Qu.:-0.0065187 1st Qu.:-0.0075767 1st Qu.:-0.0064648 1st Qu.:-0.0062737 1st Qu.:-0.007336 1st Qu.:-0.012559 1st Qu.:-0.0049280 1st Qu.:-0.014613 1st Qu.:-0.004756 1st Qu.:-0.0027459 1st Qu.:-0.006427 1st Qu.:-0.0063794 1st Qu.:-0.008253 1st Qu.:-0.008279 1st Qu.:-0.0042149 1st Qu.:-0.0027459 1st Qu.:-0.0008929 1st Qu.:-0.0008929 1st Qu.:-0.0018815 1st Qu.:-0.0038172 1st Qu.:-0.005673 1st Qu.:-0.008648 1st Qu.:-0.008935 1st Qu.:-0.012379 1st Qu.:-0.014009 1st Qu.:-0.0100840 1st Qu.:-0.0062954 1st Qu.:-0.0035576 1st Qu.:-0.013136 1st Qu.:-0.014378 1st Qu.:-0.019360 1st Qu.:-0.023319 1st Qu.:-0.0243034 1st Qu.:-0.019630 1st Qu.:-0.0011461
Median : 0.0005336 Median :-0.0033835 Median : 0.001860 Median :-0.0006944 Median :-0.0007078 Median :-0.001262 Median :-0.0028889 Median :-0.001448 Median :-0.001242 Median :-0.002111 Median :-0.0003168 Median :-0.001220 Median : 0.000506 Median :-0.0009646 Median :-0.0002264 Median : 0.002904 Median :-0.001414 Median :-0.0004302 Median :-0.0004554 Median :-0.0002511 Median : 0.0002321 Median : 0.003513 Median :-0.002447 Median :-0.0019345 Median :-0.0007267 Median :-0.001201 Median :-0.0000559 Median :-0.001012 Median : 0.000035 Median :-0.0002439 Median : 0.0016351 Median :-0.002005 Median :-0.0003634 Median :1 Median :-0.002944 Median : 0.001289 Median :-0.0017191 Median :-0.0003627 Median :-0.005281 Median :-0.0003402 Median :-0.0007771 Median :-0.001318 Median :-0.0004321 Median :-0.001559 Median :-0.0005868 Median :-0.0009248 Median :-0.0007155 Median :-0.0002672 Median :-0.001239 Median :-0.001800 Median :-0.0016338 Median :-0.003181 Median :-0.001151 Median :-0.0008847 Median :-0.000154 Median :-0.0008082 Median :-0.001274 Median :-0.001003 Median :-0.0004993 Median :-0.0008847 Median :-0.0002912 Median :-0.0003251 Median :-0.0005455 Median :-0.0008537 Median :-0.001546 Median :-0.000829 Median :-0.001997 Median :-0.003132 Median :-0.002266 Median :-0.0018848 Median :-0.0003493 Median :-0.0004631 Median :-0.002854 Median :-0.002379 Median :-0.002270 Median :-0.002061 Median :-0.0020631 Median :-0.002587 Median : 0.0003216
Mean : 0.0169704 Mean :-0.0009204 Mean : 0.005789 Mean : 0.0107475 Mean :-0.0088807 Mean : 0.005178 Mean : 0.0001063 Mean : 0.004543 Mean :-0.006328 Mean : 0.002687 Mean : 0.0103478 Mean :-0.003984 Mean : 0.003542 Mean : 0.0100118 Mean : 0.0125342 Mean : 0.012084 Mean : 0.001227 Mean :-0.0009544 Mean : 0.0126757 Mean : 0.0127019 Mean :-0.0007810 Mean : 0.016328 Mean : 0.008602 Mean : 0.0009881 Mean : 0.0085826 Mean : 0.008312 Mean : 0.0118276 Mean : 0.008002 Mean :-0.010200 Mean : 0.0125725 Mean : 0.0006063 Mean :-0.001404 Mean : 0.0115660 Mean :1 Mean : 0.006655 Mean : 0.007717 Mean :-0.0004392 Mean : 0.0127352 Mean :-0.005335 Mean : 0.0102429 Mean : 0.0056400 Mean : 0.008018 Mean : 0.0110086 Mean : 0.001435 Mean : 0.0110037 Mean : 0.0105688 Mean : 0.0091652 Mean : 0.0060635 Mean :-0.002248 Mean :-0.004613 Mean :-0.0003216 Mean :-0.014703 Mean : 0.013485 Mean : 0.0250915 Mean : 0.008380 Mean : 0.0112452 Mean : 0.011215 Mean : 0.012771 Mean : 0.0167936 Mean : 0.0250915 Mean : 0.0128817 Mean : 0.0127117 Mean : 0.0128602 Mean : 0.0121784 Mean : 0.010861 Mean : 0.009444 Mean : 0.007175 Mean : 0.003054 Mean : 0.000216 Mean :-0.0007652 Mean :-0.0011515 Mean :-0.0022937 Mean :-0.004167 Mean :-0.002999 Mean :-0.001605 Mean :-0.000504 Mean :-0.0000585 Mean : 0.009374 Mean : 0.0132475
3rd Qu.: 0.0040061 3rd Qu.: 0.0070447 3rd Qu.: 0.016525 3rd Qu.: 0.0047715 3rd Qu.: 0.0125200 3rd Qu.: 0.005010 3rd Qu.: 0.0011539 3rd Qu.: 0.004086 3rd Qu.: 0.004650 3rd Qu.: 0.001679 3rd Qu.: 0.0022560 3rd Qu.: 0.002721 3rd Qu.: 0.012246 3rd Qu.: 0.0092260 3rd Qu.: 0.0026775 3rd Qu.: 0.009980 3rd Qu.: 0.007513 3rd Qu.: 0.0014938 3rd Qu.: 0.0015396 3rd Qu.: 0.0030307 3rd Qu.: 0.0022817 3rd Qu.: 0.022748 3rd Qu.: 0.005722 3rd Qu.: 0.0066273 3rd Qu.: 0.0027397 3rd Qu.: 0.002368 3rd Qu.: 0.0043855 3rd Qu.: 0.004974 3rd Qu.: 0.009365 3rd Qu.: 0.0016627 3rd Qu.: 0.0095196 3rd Qu.: 0.002287 3rd Qu.: 0.0003999 3rd Qu.:1 3rd Qu.: 0.012122 3rd Qu.: 0.010252 3rd Qu.: 0.0075641 3rd Qu.: 0.0015427 3rd Qu.: 0.010096 3rd Qu.: 0.0030447 3rd Qu.: 0.0166287 3rd Qu.: 0.007431 3rd Qu.: 0.0018987 3rd Qu.: 0.011320 3rd Qu.: 0.0031320 3rd Qu.: 0.0035984 3rd Qu.: 0.0053395 3rd Qu.: 0.0042906 3rd Qu.: 0.002973 3rd Qu.: 0.003492 3rd Qu.: 0.0032218 3rd Qu.: 0.004352 3rd Qu.: 0.004585 3rd Qu.:-0.0000462 3rd Qu.: 0.003579 3rd Qu.: 0.0059771 3rd Qu.: 0.005613 3rd Qu.: 0.005251 3rd Qu.: 0.0020642 3rd Qu.:-0.0000462 3rd Qu.:-0.0000228 3rd Qu.:-0.0000348 3rd Qu.: 0.0004444 3rd Qu.: 0.0016579 3rd Qu.: 0.002306 3rd Qu.: 0.003628 3rd Qu.: 0.004160 3rd Qu.: 0.003860 3rd Qu.: 0.003601 3rd Qu.: 0.0033621 3rd Qu.: 0.0021982 3rd Qu.: 0.0038639 3rd Qu.: 0.003548 3rd Qu.: 0.003082 3rd Qu.: 0.002808 3rd Qu.: 0.003460 3rd Qu.: 0.0041037 3rd Qu.: 0.007550 3rd Qu.: 0.0019772
Max. : 1.0000000 Max. : 1.0000000 Max. : 1.000000 Max. : 1.0000000 Max. : 1.0000000 Max. : 1.000000 Max. : 1.0000000 Max. : 1.000000 Max. : 1.000000 Max. : 1.000000 Max. : 1.0000000 Max. : 1.000000 Max. : 1.000000 Max. : 1.0000000 Max. : 1.0000000 Max. : 1.000000 Max. : 1.000000 Max. : 1.0000000 Max. : 1.0000000 Max. : 1.0000000 Max. : 1.0000000 Max. : 1.000000 Max. : 1.000000 Max. : 1.0000000 Max. : 1.0000000 Max. : 1.000000 Max. : 1.0000000 Max. : 1.000000 Max. : 1.000000 Max. : 1.0000000 Max. : 1.0000000 Max. : 1.000000 Max. : 1.0000000 Max. :1 Max. : 1.000000 Max. : 1.000000 Max. : 1.0000000 Max. : 1.0000000 Max. : 1.000000 Max. : 1.0000000 Max. : 1.0000000 Max. : 1.000000 Max. : 1.0000000 Max. : 1.000000 Max. : 1.0000000 Max. : 1.0000000 Max. : 1.0000000 Max. : 1.0000000 Max. : 1.000000 Max. : 1.000000 Max. : 1.0000000 Max. : 1.000000 Max. : 1.000000 Max. : 1.0000000 Max. : 1.000000 Max. : 1.0000000 Max. : 1.000000 Max. : 1.000000 Max. : 1.0000000 Max. : 1.0000000 Max. : 1.0000000 Max. : 1.0000000 Max. : 1.0000000 Max. : 1.0000000 Max. : 1.000000 Max. : 1.000000 Max. : 1.000000 Max. : 1.000000 Max. : 1.000000 Max. : 1.0000000 Max. : 1.0000000 Max. : 1.0000000 Max. : 1.000000 Max. : 1.000000 Max. : 1.000000 Max. : 1.000000 Max. : 1.0000000 Max. : 1.000000 Max. : 1.0000000
NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :78 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1
# Extract correlation with COD
correlation_with_COD <- correlation_matrix[, "COD"]

# Convert correlation_with_COD to a data frame with column names
correlation_df <- data.frame(variable = names(correlation_with_COD), correlation = correlation_with_COD)

# Sort correlation values
correlation_df <- correlation_df[order(correlation_df$correlation, decreasing = TRUE), ]

# Create bar plot using ggplot2
ggplot(correlation_df, aes(x = variable, y = correlation)) +
  geom_bar(stat = "identity") +
  labs(title = "Correlation with COD", x = "Variables", y = "Correlation")
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_bar()`).

# Find the index of the column named "COD"
cod_column_index <- which(names(BREAST_DF_surv_clean) == "COD")

# Exclude "COD" column from model matrix and encode factors
encoded_data <- predict(dummyVars(" ~ .", data = BREAST_DF_surv_clean[, -cod_column_index], fullRank = TRUE), newdata = BREAST_DF_surv_clean)



# Remove the "COD" column from encoded_data
encoded_data <- encoded_data[, -cod_column_index]

# Add "COD" column back to encoded_data
encoded_data <- cbind(encoded_data, COD = BREAST_DF_surv_clean$COD)

# Calculate correlation matrix
correlation_matrix <- cor(encoded_data)

# Extract correlation with COD
correlation_with_COD <- correlation_matrix[, "COD"]

# Summary of correlation matrix
summary(correlation_matrix)
##  `Race recode (W, B, AI, API)`Asian or Pacific Islander
##  Min.   :-0.611482                                     
##  1st Qu.:-0.020636                                     
##  Median :-0.004543                                     
##  Mean   :-0.001203                                     
##  3rd Qu.: 0.006915                                     
##  Max.   : 1.000000                                     
##  NA's   :3                                             
##  `Race recode (W, B, AI, API)`Black `Race recode (W, B, AI, API)`Unknown
##  Min.   :-0.672898                  Min.   :-0.151550                   
##  1st Qu.:-0.008186                  1st Qu.:-0.008196                   
##  Median : 0.002961                  Median :-0.001720                   
##  Mean   : 0.007238                  Mean   : 0.011140                   
##  3rd Qu.: 0.015977                  3rd Qu.: 0.004775                   
##  Max.   : 1.000000                  Max.   : 1.000000                   
##  NA's   :3                          NA's   :3                           
##  `Race recode (W, B, AI, API)`White
##  Min.   :-0.672898                 
##  1st Qu.:-0.021506                 
##  Median :-0.001173                 
##  Mean   :-0.007624                 
##  3rd Qu.: 0.013379                 
##  Max.   : 1.000000                 
##  NA's   :3                         
##  `Primary Site - labeled`C50.1-Central portion of breast
##  Min.   :-0.152121                                      
##  1st Qu.:-0.009934                                      
##  Median :-0.002113                                      
##  Mean   : 0.005421                                      
##  3rd Qu.: 0.006062                                      
##  Max.   : 1.000000                                      
##  NA's   :3                                              
##  `Primary Site - labeled`C50.2-Upper-inner quadrant of breast
##  Min.   :-0.253678                                           
##  1st Qu.:-0.009990                                           
##  Median :-0.003675                                           
##  Mean   :-0.001860                                           
##  3rd Qu.: 0.000150                                           
##  Max.   : 1.000000                                           
##  NA's   :3                                                   
##  `Primary Site - labeled`C50.3-Lower-inner quadrant of breast
##  Min.   :-0.165071                                           
##  1st Qu.:-0.008009                                           
##  Median :-0.002641                                           
##  Mean   : 0.003667                                           
##  3rd Qu.: 0.003119                                           
##  Max.   : 1.000000                                           
##  NA's   :3                                                   
##  `Primary Site - labeled`C50.4-Upper-outer quadrant of breast
##  Min.   :-0.372542                                           
##  1st Qu.:-0.015158                                           
##  Median :-0.001181                                           
##  Mean   :-0.008656                                           
##  3rd Qu.: 0.005932                                           
##  Max.   : 1.000000                                           
##  NA's   :3                                                   
##  `Primary Site - labeled`C50.5-Lower-outer quadrant of breast
##  Min.   :-0.1930083                                          
##  1st Qu.:-0.0068032                                          
##  Median :-0.0026344                                          
##  Mean   : 0.0017748                                          
##  3rd Qu.: 0.0000719                                          
##  Max.   : 1.0000000                                          
##  NA's   :3                                                   
##  `Primary Site - labeled`C50.6-Axillary tail of breast
##  Min.   :-0.0516638                                   
##  1st Qu.:-0.0032916                                   
##  Median :-0.0002025                                   
##  Mean   : 0.0108061                                   
##  3rd Qu.: 0.0025517                                   
##  Max.   : 1.0000000                                   
##  NA's   :3                                            
##  `Primary Site - labeled`C50.8-Overlapping lesion of breast
##  Min.   :-0.372542                                         
##  1st Qu.:-0.005557                                         
##  Median :-0.001620                                         
##  Mean   :-0.005418                                         
##  3rd Qu.: 0.001945                                         
##  Max.   : 1.000000                                         
##  NA's   :3                                                 
##  `Primary Site - labeled`C50.9-Breast, NOS
##  Min.   :-0.290700                        
##  1st Qu.:-0.014812                        
##  Median : 0.002148                        
##  Mean   : 0.010813                        
##  3rd Qu.: 0.017795                        
##  Max.   : 1.000000                        
##  NA's   :3                                
##  `Grade Recode (thru 2017)`Poorly differentiated; Grade III
##  Min.   :-0.3220663                                        
##  1st Qu.:-0.0132914                                        
##  Median :-0.0008937                                        
##  Mean   : 0.0111492                                        
##  3rd Qu.: 0.0112681                                        
##  Max.   : 1.0000000                                        
##  NA's   :3                                                 
##  `Grade Recode (thru 2017)`Undifferentiated; anaplastic; Grade IV
##  Min.   :-0.0210283                                              
##  1st Qu.:-0.0017196                                              
##  Median :-0.0004927                                              
##  Mean   : 0.0132802                                              
##  3rd Qu.: 0.0023513                                              
##  Max.   : 1.0000000                                              
##  NA's   :3                                                       
##  `Grade Recode (thru 2017)`Unknown
##  Min.   :-0.269170                
##  1st Qu.:-0.009920                
##  Median : 0.003273                
##  Mean   : 0.019721                
##  3rd Qu.: 0.024205                
##  Max.   : 1.000000                
##  NA's   :3                        
##  `Grade Recode (thru 2017)`Well differentiated; Grade I
##  Min.   :-0.322066                                     
##  1st Qu.:-0.017850                                     
##  Median :-0.005277                                     
##  Mean   :-0.001697                                     
##  3rd Qu.: 0.007549                                     
##  Max.   : 1.000000                                     
##  NA's   :3                                             
##  Laterality.Only one side - side unspecified
##  Min.   :-0.0505764                         
##  1st Qu.:-0.0025761                         
##  Median :-0.0002076                         
##  Mean   : 0.0149982                         
##  3rd Qu.: 0.0036708                         
##  Max.   : 1.0000000                         
##  NA's   :3                                  
##  Laterality.Paired site, but no information concerning laterality
##  Min.   :-0.306824                                               
##  1st Qu.:-0.013601                                               
##  Median :-0.001130                                               
##  Mean   : 0.026582                                               
##  3rd Qu.: 0.007532                                               
##  Max.   : 1.000000                                               
##  NA's   :3                                                       
##  Laterality.Right - origin of primary `Chemotherapy recode (yes, no/unk)`Yes
##  Min.   :-0.0997361                   Min.   :-0.266711                     
##  1st Qu.:-0.0033810                   1st Qu.:-0.022998                     
##  Median :-0.0004293                   Median : 0.003276                     
##  Mean   : 0.0100453                   Mean   : 0.014559                     
##  3rd Qu.: 0.0021401                   3rd Qu.: 0.022243                     
##  Max.   : 1.0000000                   Max.   : 1.000000                     
##  NA's   :3                            NA's   :3                             
##  `Months from diagnosis to treatment`
##  Min.   :1                           
##  1st Qu.:1                           
##  Median :1                           
##  Mean   :1                           
##  3rd Qu.:1                           
##  Max.   :1                           
##  NA's   :76                          
##  `Reason no cancer-directed surgery`Not recommended
##  Min.   :-0.812290                                 
##  1st Qu.:-0.020352                                 
##  Median :-0.004180                                 
##  Mean   : 0.006545                                 
##  3rd Qu.: 0.008143                                 
##  Max.   : 1.000000                                 
##  NA's   :3                                         
##  `Reason no cancer-directed surgery`Not recommended, contraindicated due to other cond; autopsy only (1973-2002)
##  Min.   :-0.189154                                                                                              
##  1st Qu.:-0.006970                                                                                              
##  Median :-0.001004                                                                                              
##  Mean   : 0.011121                                                                                              
##  3rd Qu.: 0.002799                                                                                              
##  Max.   : 1.000000                                                                                              
##  NA's   :3                                                                                                      
##  `Reason no cancer-directed surgery`Recommended but not performed, patient refused
##  Min.   :-0.262869                                                                
##  1st Qu.:-0.008973                                                                
##  Median :-0.001474                                                                
##  Mean   : 0.009488                                                                
##  3rd Qu.: 0.002102                                                                
##  Max.   : 1.000000                                                                
##  NA's   :3                                                                        
##  `Reason no cancer-directed surgery`Recommended but not performed, unknown reason
##  Min.   :-0.205810                                                               
##  1st Qu.:-0.006578                                                               
##  Median :-0.002151                                                               
##  Mean   : 0.013125                                                               
##  3rd Qu.: 0.006299                                                               
##  Max.   : 1.000000                                                               
##  NA's   :3                                                                       
##  `Reason no cancer-directed surgery`Recommended, unknown if performed
##  Min.   :-0.2649458                                                  
##  1st Qu.:-0.0085961                                                  
##  Median :-0.0009167                                                  
##  Mean   : 0.0086230                                                  
##  3rd Qu.: 0.0060008                                                  
##  Max.   : 1.0000000                                                  
##  NA's   :3                                                           
##  `Reason no cancer-directed surgery`Surgery performed
##  Min.   :-0.8122899                                  
##  1st Qu.:-0.0354131                                  
##  Median : 0.0005905                                  
##  Mean   :-0.0229615                                  
##  3rd Qu.: 0.0230988                                  
##  Max.   : 1.0000000                                  
##  NA's   :3                                           
##  `Reason no cancer-directed surgery`Unknown; death certificate; or autopsy only (2003+)
##  Min.   :-0.316222                                                                     
##  1st Qu.:-0.013675                                                                     
##  Median :-0.002965                                                                     
##  Mean   : 0.025584                                                                     
##  3rd Qu.: 0.008967                                                                     
##  Max.   : 1.000000                                                                     
##  NA's   :3                                                                             
##  `Survival months flag`Complete dates are available and there are more than 0 days of survival
##  Min.   :-0.883947                                                                            
##  1st Qu.:-0.012528                                                                            
##  Median : 0.005244                                                                            
##  Mean   :-0.013255                                                                            
##  3rd Qu.: 0.017204                                                                            
##  Max.   : 1.000000                                                                            
##  NA's   :3                                                                                    
##  `Survival months flag`Incomplete dates are available and there cannot be zero days of follow-up
##  Min.   :-0.883947                                                                              
##  1st Qu.:-0.014050                                                                              
##  Median :-0.002561                                                                              
##  Mean   : 0.001254                                                                              
##  3rd Qu.: 0.004816                                                                              
##  Max.   : 1.000000                                                                              
##  NA's   :3                                                                                      
##  `Survival months flag`Incomplete dates are available and there could be zero days of follow-up
##  Min.   :-0.124874                                                                             
##  1st Qu.:-0.003584                                                                             
##  Median :-0.001239                                                                             
##  Mean   : 0.012886                                                                             
##  3rd Qu.: 0.002214                                                                             
##  Max.   : 1.000000                                                                             
##  NA's   :3                                                                                     
##  `Survival months flag`Not calculated because a Death Certificate Only or Autopsy Only case
##  Min.   :-0.386749                                                                         
##  1st Qu.:-0.014145                                                                         
##  Median :-0.003250                                                                         
##  Mean   : 0.025055                                                                         
##  3rd Qu.: 0.001901                                                                         
##  Max.   : 1.000000                                                                         
##  NA's   :3                                                                                 
##  `Survival months` `First malignant primary indicator`Yes
##  Min.   :1         Min.   :-0.121335                     
##  1st Qu.:1         1st Qu.:-0.009865                     
##  Median :1         Median : 0.001646                     
##  Mean   :1         Mean   : 0.015425                     
##  3rd Qu.:1         3rd Qu.: 0.013781                     
##  Max.   :1         Max.   : 1.000000                     
##  NA's   :76        NA's   :3                             
##  `Total number of in situ/malignant tumors for patient`
##  Min.   :1                                             
##  1st Qu.:1                                             
##  Median :1                                             
##  Mean   :1                                             
##  3rd Qu.:1                                             
##  Max.   :1                                             
##  NA's   :76                                            
##  `Total number of benign/borderline tumors for patient`
##  Min.   :-0.0152234                                    
##  1st Qu.:-0.0032617                                    
##  Median :-0.0005528                                    
##  Mean   : 0.0130747                                    
##  3rd Qu.: 0.0019016                                    
##  Max.   : 1.0000000                                    
##  NA's   :3                                             
##  `Marital status at diagnosis`Married (including common law)
##  Min.   :-0.440177                                          
##  1st Qu.:-0.032660                                          
##  Median :-0.006210                                          
##  Mean   :-0.009718                                          
##  3rd Qu.: 0.013420                                          
##  Max.   : 1.000000                                          
##  NA's   :3                                                  
##  `Marital status at diagnosis`Separated
##  Min.   :-0.109798                     
##  1st Qu.:-0.004177                     
##  Median :-0.000266                     
##  Mean   : 0.011281                     
##  3rd Qu.: 0.003666                     
##  Max.   : 1.000000                     
##  NA's   :3                             
##  `Marital status at diagnosis`Single (never married)
##  Min.   :-0.440177                                  
##  1st Qu.:-0.014297                                  
##  Median :-0.002190                                  
##  Mean   : 0.005418                                  
##  3rd Qu.: 0.017116                                  
##  Max.   : 1.000000                                  
##  NA's   :3                                          
##  `Marital status at diagnosis`Unknown
##  Min.   :-0.269781                   
##  1st Qu.:-0.009316                   
##  Median :-0.001311                   
##  Mean   : 0.009429                   
##  3rd Qu.: 0.006888                   
##  Max.   : 1.000000                   
##  NA's   :3                           
##  `Marital status at diagnosis`Unmarried or Domestic Partner
##  Min.   :-0.061342                                         
##  1st Qu.:-0.004189                                         
##  Median :-0.000586                                         
##  Mean   : 0.011305                                         
##  3rd Qu.: 0.001826                                         
##  Max.   : 1.000000                                         
##  NA's   :3                                                 
##  `Marital status at diagnosis`Widowed
##  Min.   :-0.432734                   
##  1st Qu.:-0.027195                   
##  Median :-0.001960                   
##  Mean   : 0.007008                   
##  3rd Qu.: 0.018651                   
##  Max.   : 1.000000                   
##  NA's   :3                           
##  `Median household income inflation adj to 2021`$40,000 - $44,999
##  Min.   :-0.1382091                                              
##  1st Qu.:-0.0056870                                              
##  Median :-0.0007594                                              
##  Mean   : 0.0128226                                              
##  3rd Qu.: 0.0041608                                              
##  Max.   : 1.0000000                                              
##  NA's   :3                                                       
##  `Median household income inflation adj to 2021`$45,000 - $49,999
##  Min.   :-0.1682857                                              
##  1st Qu.:-0.0069108                                              
##  Median :-0.0009922                                              
##  Mean   : 0.0119225                                              
##  3rd Qu.: 0.0053678                                              
##  Max.   : 1.0000000                                              
##  NA's   :3                                                       
##  `Median household income inflation adj to 2021`$50,000 - $54,999
##  Min.   :-0.1791432                                              
##  1st Qu.:-0.0069663                                              
##  Median :-0.0009649                                              
##  Mean   : 0.0100901                                              
##  3rd Qu.: 0.0055250                                              
##  Max.   : 1.0000000                                              
##  NA's   :3                                                       
##  `Median household income inflation adj to 2021`$55,000 - $59,999
##  Min.   :-0.221090                                               
##  1st Qu.:-0.005620                                               
##  Median :-0.000249                                               
##  Mean   : 0.006952                                               
##  3rd Qu.: 0.005061                                               
##  Max.   : 1.000000                                               
##  NA's   :3                                                       
##  `Median household income inflation adj to 2021`$60,000 - $64,999
##  Min.   :-0.302908                                               
##  1st Qu.:-0.008606                                               
##  Median :-0.001320                                               
##  Mean   :-0.002785                                               
##  3rd Qu.: 0.003066                                               
##  Max.   : 1.000000                                               
##  NA's   :3                                                       
##  `Median household income inflation adj to 2021`$65,000 - $69,999
##  Min.   :-0.308737                                               
##  1st Qu.:-0.012660                                               
##  Median :-0.001992                                               
##  Mean   :-0.005016                                               
##  3rd Qu.: 0.004584                                               
##  Max.   : 1.000000                                               
##  NA's   :3                                                       
##  `Median household income inflation adj to 2021`$70,000 - $74,999
##  Min.   :-0.2538036                                              
##  1st Qu.:-0.0053899                                              
##  Median :-0.0016729                                              
##  Mean   :-0.0009235                                              
##  3rd Qu.: 0.0020979                                              
##  Max.   : 1.0000000                                              
##  NA's   :3                                                       
##  `Median household income inflation adj to 2021`$75,000+
##  Min.   :-0.308737                                      
##  1st Qu.:-0.020431                                      
##  Median :-0.005049                                      
##  Mean   :-0.016574                                      
##  3rd Qu.: 0.003574                                      
##  Max.   : 1.000000                                      
##  NA's   :3                                              
##  `Median household income inflation adj to 2021`< $35,000
##  Min.   :-0.070337                                       
##  1st Qu.:-0.005079                                       
##  Median :-0.001380                                       
##  Mean   : 0.014739                                       
##  3rd Qu.: 0.005036                                       
##  Max.   : 1.000000                                       
##  NA's   :3                                               
##  `Median household income inflation adj to 2021`Unknown/missing/no match/Not 1990-2021
##  Min.   :-0.0127447                                                                   
##  1st Qu.:-0.0034638                                                                   
##  Median :-0.0012849                                                                   
##  Mean   : 0.0267364                                                                   
##  3rd Qu.: 0.0009734                                                                   
##  Max.   : 1.0000000                                                                   
##  NA's   :3                                                                            
##  `Rural-Urban Continuum Code`Counties in metropolitan areas of 250,000 to 1 million pop
##  Min.   :-0.1432295                                                                    
##  1st Qu.:-0.0074650                                                                    
##  Median :-0.0002018                                                                    
##  Mean   : 0.0091835                                                                    
##  3rd Qu.: 0.0038568                                                                    
##  Max.   : 1.0000000                                                                    
##  NA's   :3                                                                             
##  `Rural-Urban Continuum Code`Counties in metropolitan areas of lt 250 thousand pop
##  Min.   :-0.1717685                                                               
##  1st Qu.:-0.0050985                                                               
##  Median :-0.0001767                                                               
##  Mean   : 0.0126107                                                               
##  3rd Qu.: 0.0058588                                                               
##  Max.   : 1.0000000                                                               
##  NA's   :3                                                                        
##  `Rural-Urban Continuum Code`Nonmetropolitan counties adjacent to a metropolitan area
##  Min.   :-0.1531643                                                                  
##  1st Qu.:-0.0064625                                                                  
##  Median :-0.0008947                                                                  
##  Mean   : 0.0127534                                                                  
##  3rd Qu.: 0.0065129                                                                  
##  Max.   : 1.0000000                                                                  
##  NA's   :3                                                                           
##  `Rural-Urban Continuum Code`Nonmetropolitan counties not adjacent to a metropolitan area
##  Min.   :-0.1543301                                                                      
##  1st Qu.:-0.0077561                                                                      
##  Median :-0.0005505                                                                      
##  Mean   : 0.0142939                                                                      
##  3rd Qu.: 0.0078590                                                                      
##  Max.   : 1.0000000                                                                      
##  NA's   :3                                                                               
##  `Rural-Urban Continuum Code`Unknown/missing/no match (Alaska or Hawaii - Entire State)
##  Min.   :-0.0678178                                                                    
##  1st Qu.:-0.0038102                                                                    
##  Median :-0.0002637                                                                    
##  Mean   : 0.0120203                                                                    
##  3rd Qu.: 0.0021074                                                                    
##  Max.   : 1.0000000                                                                    
##  NA's   :3                                                                             
##  `Rural-Urban Continuum Code`Unknown/missing/no match/Not 1990-2021
##  Min.   :-0.0127447                                                
##  1st Qu.:-0.0034638                                                
##  Median :-0.0012849                                                
##  Mean   : 0.0267364                                                
##  3rd Qu.: 0.0009734                                                
##  Max.   : 1.0000000                                                
##  NA's   :3                                                         
##  `Age recode (<60,60-69,70+)`05-09 years
##  Min.   :-0.0027197                     
##  1st Qu.:-0.0008100                     
##  Median :-0.0002830                     
##  Mean   : 0.0136204                     
##  3rd Qu.:-0.0000415                     
##  Max.   : 1.0000000                     
##  NA's   :3                              
##  `Age recode (<60,60-69,70+)`10-14 years
##  Min.   :-0.0050171                     
##  1st Qu.:-0.0008100                     
##  Median :-0.0003417                     
##  Mean   : 0.0134044                     
##  3rd Qu.:-0.0000665                     
##  Max.   : 1.0000000                     
##  NA's   :3                              
##  `Age recode (<60,60-69,70+)`15-19 years
##  Min.   :-0.0070476                     
##  1st Qu.:-0.0018281                     
##  Median :-0.0006534                     
##  Mean   : 0.0134773                     
##  3rd Qu.: 0.0000811                     
##  Max.   : 1.0000000                     
##  NA's   :3                              
##  `Age recode (<60,60-69,70+)`20-24 years
##  Min.   :-0.018979                      
##  1st Qu.:-0.003199                      
##  Median :-0.001403                      
##  Mean   : 0.012949                      
##  3rd Qu.: 0.001502                      
##  Max.   : 1.000000                      
##  NA's   :3                              
##  `Age recode (<60,60-69,70+)`25-29 years
##  Min.   :-0.027433                      
##  1st Qu.:-0.005254                      
##  Median :-0.001847                      
##  Mean   : 0.011652                      
##  3rd Qu.: 0.002353                      
##  Max.   : 1.000000                      
##  NA's   :3                              
##  `Age recode (<60,60-69,70+)`30-34 years
##  Min.   :-0.046406                      
##  1st Qu.:-0.007629                      
##  Median :-0.001896                      
##  Mean   : 0.009909                      
##  3rd Qu.: 0.002237                      
##  Max.   : 1.000000                      
##  NA's   :3                              
##  `Age recode (<60,60-69,70+)`35-39 years
##  Min.   :-0.067571                      
##  1st Qu.:-0.009978                      
##  Median :-0.002013                      
##  Mean   : 0.007331                      
##  3rd Qu.: 0.004198                      
##  Max.   : 1.000000                      
##  NA's   :3                              
##  `Age recode (<60,60-69,70+)`40-44 years
##  Min.   :-0.098876                      
##  1st Qu.:-0.016888                      
##  Median :-0.005596                      
##  Mean   : 0.002280                      
##  3rd Qu.: 0.003749                      
##  Max.   : 1.000000                      
##  NA's   :3                              
##  `Age recode (<60,60-69,70+)`45-49 years
##  Min.   :-0.125622                      
##  1st Qu.:-0.017717                      
##  Median :-0.006664                      
##  Mean   :-0.001411                      
##  3rd Qu.: 0.004519                      
##  Max.   : 1.000000                      
##  NA's   :3                              
##  `Age recode (<60,60-69,70+)`50-54 years
##  Min.   :-0.141961                      
##  1st Qu.:-0.016959                      
##  Median :-0.003925                      
##  Mean   :-0.002563                      
##  3rd Qu.: 0.003557                      
##  Max.   : 1.000000                      
##  NA's   :3                              
##  `Age recode (<60,60-69,70+)`55-59 years
##  Min.   :-0.148041                      
##  1st Qu.:-0.015581                      
##  Median :-0.001369                      
##  Mean   :-0.003108                      
##  3rd Qu.: 0.002133                      
##  Max.   : 1.000000                      
##  NA's   :3                              
##  `Age recode (<60,60-69,70+)`60-64 years
##  Min.   :-0.1569887                     
##  1st Qu.:-0.0139543                     
##  Median :-0.0007929                     
##  Mean   :-0.0045549                     
##  3rd Qu.: 0.0037404                     
##  Max.   : 1.0000000                     
##  NA's   :3                              
##  `Age recode (<60,60-69,70+)`65-69 years
##  Min.   :-0.156989                      
##  1st Qu.:-0.018456                      
##  Median :-0.004416                      
##  Mean   :-0.006050                      
##  3rd Qu.: 0.002814                      
##  Max.   : 1.000000                      
##  NA's   :3                              
##  `Age recode (<60,60-69,70+)`70-74 years
##  Min.   :-0.136706                      
##  1st Qu.:-0.018582                      
##  Median :-0.002493                      
##  Mean   :-0.003747                      
##  3rd Qu.: 0.004444                      
##  Max.   : 1.000000                      
##  NA's   :3                              
##  `Age recode (<60,60-69,70+)`75-79 years
##  Min.   :-0.1299541                     
##  1st Qu.:-0.0192648                     
##  Median :-0.0017527                     
##  Mean   :-0.0004472                     
##  3rd Qu.: 0.0042832                     
##  Max.   : 1.0000000                     
##  NA's   :3                              
##  `Age recode (<60,60-69,70+)`80-84 years `Age recode (<60,60-69,70+)`85+ years
##  Min.   :-0.1497345                      Min.   :-0.185038                    
##  1st Qu.:-0.0260547                      1st Qu.:-0.032765                    
##  Median :-0.0009538                      Median :-0.002749                    
##  Mean   : 0.0030973                      Mean   : 0.009643                    
##  3rd Qu.: 0.0066626                      3rd Qu.: 0.005992                    
##  Max.   : 1.0000000                      Max.   : 1.000000                    
##  NA's   :3                               NA's   :3                            
##  Radiation.Yes           COD           
##  Min.   :-0.18043   Min.   :-0.274122  
##  1st Qu.:-0.02974   1st Qu.:-0.028538  
##  Median :-0.00240   Median : 0.003475  
##  Mean   : 0.00561   Mean   : 0.019223  
##  3rd Qu.: 0.01477   3rd Qu.: 0.028403  
##  Max.   : 1.00000   Max.   : 1.000000  
##  NA's   :3          NA's   :3
# Print correlation with COD
print(correlation_with_COD)
##                                                          `Race recode (W, B, AI, API)`Asian or Pacific Islander 
##                                                                                                   -0.0545190384 
##                                                                              `Race recode (W, B, AI, API)`Black 
##                                                                                                    0.0469532442 
##                                                                            `Race recode (W, B, AI, API)`Unknown 
##                                                                                                   -0.0293993259 
##                                                                              `Race recode (W, B, AI, API)`White 
##                                                                                                    0.0074658579 
##                                                         `Primary Site - labeled`C50.1-Central portion of breast 
##                                                                                                    0.0250324542 
##                                                    `Primary Site - labeled`C50.2-Upper-inner quadrant of breast 
##                                                                                                   -0.0331489825 
##                                                    `Primary Site - labeled`C50.3-Lower-inner quadrant of breast 
##                                                                                                   -0.0090667775 
##                                                    `Primary Site - labeled`C50.4-Upper-outer quadrant of breast 
##                                                                                                   -0.0443648371 
##                                                    `Primary Site - labeled`C50.5-Lower-outer quadrant of breast 
##                                                                                                   -0.0175945778 
##                                                           `Primary Site - labeled`C50.6-Axillary tail of breast 
##                                                                                                    0.0043188952 
##                                                      `Primary Site - labeled`C50.8-Overlapping lesion of breast 
##                                                                                                   -0.0110631898 
##                                                                       `Primary Site - labeled`C50.9-Breast, NOS 
##                                                                                                    0.1016503665 
##                                                      `Grade Recode (thru 2017)`Poorly differentiated; Grade III 
##                                                                                                    0.0271874011 
##                                                `Grade Recode (thru 2017)`Undifferentiated; anaplastic; Grade IV 
##                                                                                                    0.0094420265 
##                                                                               `Grade Recode (thru 2017)`Unknown 
##                                                                                                    0.0968239858 
##                                                          `Grade Recode (thru 2017)`Well differentiated; Grade I 
##                                                                                                   -0.0623106812 
##                                                                     Laterality.Only one side - side unspecified 
##                                                                                                    0.0200049734 
##                                                Laterality.Paired site, but no information concerning laterality 
##                                                                                                    0.1028374128 
##                                                                            Laterality.Right - origin of primary 
##                                                                                                   -0.0181205661 
##                                                                          `Chemotherapy recode (yes, no/unk)`Yes 
##                                                                                                   -0.0921253574 
##                                                                            `Months from diagnosis to treatment` 
##                                                                                                              NA 
##                                                              `Reason no cancer-directed surgery`Not recommended 
##                                                                                                    0.2220449034 
## `Reason no cancer-directed surgery`Not recommended, contraindicated due to other cond; autopsy only (1973-2002) 
##                                                                                                    0.0989504823 
##                               `Reason no cancer-directed surgery`Recommended but not performed, patient refused 
##                                                                                                    0.0884305414 
##                                `Reason no cancer-directed surgery`Recommended but not performed, unknown reason 
##                                                                                                    0.0403204344 
##                                            `Reason no cancer-directed surgery`Recommended, unknown if performed 
##                                                                                                    0.0114951426 
##                                                            `Reason no cancer-directed surgery`Surgery performed 
##                                                                                                   -0.2741215608 
##                          `Reason no cancer-directed surgery`Unknown; death certificate; or autopsy only (2003+) 
##                                                                                                    0.0839598865 
##                   `Survival months flag`Complete dates are available and there are more than 0 days of survival 
##                                                                                                   -0.0490960286 
##                 `Survival months flag`Incomplete dates are available and there cannot be zero days of follow-up 
##                                                                                                    0.0166136901 
##                  `Survival months flag`Incomplete dates are available and there could be zero days of follow-up 
##                                                                                                    0.0165572269 
##                      `Survival months flag`Not calculated because a Death Certificate Only or Autopsy Only case 
##                                                                                                    0.0787841243 
##                                                                                               `Survival months` 
##                                                                                                              NA 
##                                                                          `First malignant primary indicator`Yes 
##                                                                                                   -0.1213346094 
##                                                          `Total number of in situ/malignant tumors for patient` 
##                                                                                                              NA 
##                                                          `Total number of benign/borderline tumors for patient` 
##                                                                                                    0.0096744569 
##                                                     `Marital status at diagnosis`Married (including common law) 
##                                                                                                   -0.1738909087 
##                                                                          `Marital status at diagnosis`Separated 
##                                                                                                   -0.0040996820 
##                                                             `Marital status at diagnosis`Single (never married) 
##                                                                                                    0.0003130660 
##                                                                            `Marital status at diagnosis`Unknown 
##                                                                                                    0.0303384662 
##                                                      `Marital status at diagnosis`Unmarried or Domestic Partner 
##                                                                                                   -0.0119834990 
##                                                                            `Marital status at diagnosis`Widowed 
##                                                                                                    0.2258045820 
##                                                `Median household income inflation adj to 2021`$40,000 - $44,999 
##                                                                                                    0.0288158876 
##                                                `Median household income inflation adj to 2021`$45,000 - $49,999 
##                                                                                                    0.0293692229 
##                                                `Median household income inflation adj to 2021`$50,000 - $54,999 
##                                                                                                    0.0232834837 
##                                                `Median household income inflation adj to 2021`$55,000 - $59,999 
##                                                                                                    0.0119175068 
##                                                `Median household income inflation adj to 2021`$60,000 - $64,999 
##                                                                                                    0.0085555964 
##                                                `Median household income inflation adj to 2021`$65,000 - $69,999 
##                                                                                                   -0.0113267709 
##                                                `Median household income inflation adj to 2021`$70,000 - $74,999 
##                                                                                                    0.0021964742 
##                                                         `Median household income inflation adj to 2021`$75,000+ 
##                                                                                                   -0.0518348420 
##                                                        `Median household income inflation adj to 2021`< $35,000 
##                                                                                                    0.0183133390 
##                           `Median household income inflation adj to 2021`Unknown/missing/no match/Not 1990-2021 
##                                                                                                   -0.0018601995 
##                          `Rural-Urban Continuum Code`Counties in metropolitan areas of 250,000 to 1 million pop 
##                                                                                                    0.0054201133 
##                               `Rural-Urban Continuum Code`Counties in metropolitan areas of lt 250 thousand pop 
##                                                                                                    0.0164868975 
##                            `Rural-Urban Continuum Code`Nonmetropolitan counties adjacent to a metropolitan area 
##                                                                                                    0.0284308358 
##                        `Rural-Urban Continuum Code`Nonmetropolitan counties not adjacent to a metropolitan area 
##                                                                                                    0.0283202178 
##                          `Rural-Urban Continuum Code`Unknown/missing/no match (Alaska or Hawaii - Entire State) 
##                                                                                                    0.0026305237 
##                                              `Rural-Urban Continuum Code`Unknown/missing/no match/Not 1990-2021 
##                                                                                                   -0.0018601995 
##                                                                         `Age recode (<60,60-69,70+)`05-09 years 
##                                                                                                   -0.0013753077 
##                                                                         `Age recode (<60,60-69,70+)`10-14 years 
##                                                                                                   -0.0013753077 
##                                                                         `Age recode (<60,60-69,70+)`15-19 years 
##                                                                                                   -0.0008190567 
##                                                                         `Age recode (<60,60-69,70+)`20-24 years 
##                                                                                                   -0.0017825828 
##                                                                         `Age recode (<60,60-69,70+)`25-29 years 
##                                                                                                   -0.0118417779 
##                                                                         `Age recode (<60,60-69,70+)`30-34 years 
##                                                                                                   -0.0259548035 
##                                                                         `Age recode (<60,60-69,70+)`35-39 years 
##                                                                                                   -0.0423991343 
##                                                                         `Age recode (<60,60-69,70+)`40-44 years 
##                                                                                                   -0.0751335018 
##                                                                         `Age recode (<60,60-69,70+)`45-49 years 
##                                                                                                   -0.0999972407 
##                                                                         `Age recode (<60,60-69,70+)`50-54 years 
##                                                                                                   -0.0950419640 
##                                                                         `Age recode (<60,60-69,70+)`55-59 years 
##                                                                                                   -0.0788618264 
##                                                                         `Age recode (<60,60-69,70+)`60-64 years 
##                                                                                                   -0.0661892685 
##                                                                         `Age recode (<60,60-69,70+)`65-69 years 
##                                                                                                   -0.0379863553 
##                                                                         `Age recode (<60,60-69,70+)`70-74 years 
##                                                                                                    0.0251230157 
##                                                                         `Age recode (<60,60-69,70+)`75-79 years 
##                                                                                                    0.1002552651 
##                                                                         `Age recode (<60,60-69,70+)`80-84 years 
##                                                                                                    0.1861215456 
##                                                                           `Age recode (<60,60-69,70+)`85+ years 
##                                                                                                    0.3114860645 
##                                                                                                   Radiation.Yes 
##                                                                                                   -0.1573241890 
##                                                                                                             COD 
##                                                                                                    1.0000000000
# Exclude "COD" column from model matrix and encode factors
encoded_data <- predict(dummyVars(" ~ .", data = BREAST_DF_surv_clean[, -cod_column_index], fullRank = TRUE), newdata = BREAST_DF_surv_clean)

# Alternatively, using ggplot
correlation_df <- data.frame(variable = colnames(correlation_matrix), correlation = correlation_with_COD)
# Create a ggplot with facets
ggplot(correlation_df[1:19, ], aes(x = variable, y = correlation)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1, 
                                   size = 7)) +  # Adjust size as needed
  scale_x_discrete(labels = function(x) str_wrap(x, width = 25))  # Wrap text

ggplot(correlation_df[20:39, ], aes(x = variable, y = correlation)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1, 
                                   size = 7)) +  # Adjust size as needed
  scale_x_discrete(labels = function(x) str_wrap(x, width = 25))  # Wrap text
## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_bar()`).

ggplot(correlation_df[40:59, ], aes(x = variable, y = correlation)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1, 
                                   size = 7)) +  # Adjust size as needed
  scale_x_discrete(labels = function(x) str_wrap(x, width = 25))  # Wrap text

ggplot(correlation_df[60:77, ], aes(x = variable, y = correlation)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1, 
                                   size = 7)) +  # Adjust size as needed
  scale_x_discrete(labels = function(x) str_wrap(x, width = 25))  # Wrap text

Machine learning, Random Forest classification model

To be able to work with this database, I need to transform the categorical data (factors) to numerical variables. A method known as one-hot encoding is used. Although for this survival analysis, target encoding is the better method, I have decided not to apply that due to complexity and time constraints [1,2].

In general the machine learning phase consist of four main steps:

  1. Encode categorical variables.

  2. Split the data into training and testing sets.

  3. Train the models.

  4. Evaluate the models.

What is target encoding:

Target encoding, also known as mean encoding or likelihood encoding, is a technique used to encode categorical variables into numerical values based on the target variable. It replaces each category with the mean (or some other summary statistic) of the target variable for that category. caret is the package in R that has this function embedded.

What is One-Hot encoding:

One-hot encoding is a technique used in classification tasks to represent categorical variables, such as alive or deceased in the case of survival analysis, as binary vectors. In R, this is achieved by converting each category into a binary vector where each element corresponds to a category, with a value of 1 indicating the presence of the category and 0 otherwise. This allows machine learning algorithms to effectively interpret and utilize categorical data in predictive models.

Different models investigated in this Project

  1. Random Forest (rf): Random forest is a popular machine learning algorithm that can be adapted for survival analysis. It constructs a multitude of decision trees during training and outputs the mode of the classes (classification) or the mean prediction (regression) of the individual trees.

  2. Logistic Regression (glm): Logistic regression, a foundational technique in survival analysis, is employed in this project to model the relationship between various prognostic factors and the probability of survival or death outcomes in breast cancer patients.

  3. Deep Nueral Netweork (DNN): This is a a powerful machine learning model that can learn complex patterns in data to classify individuals as either alive or deceased in a given classification problem. In R, DNNs can be implemented using packages like keras, providing a flexible framework for building and training deep learning models tailored to specific datasets.

Data Preparation for Resemble models

BREAST_DF_surv_clean_no_missing <- na.omit(BREAST_DF_surv_clean)

#change the problem to a binomial distribution of Alive / Breast and remove others, Binimonal is easier to tackle 
#Repalce also factor to numer 1 and 2 from "Alive" and "Breast"
# Remove "Others" from COD column
BREAST_DF_surv_clean_no_missing_bi <- BREAST_DF_surv_clean_no_missing[BREAST_DF_surv_clean_no_missing$COD != "Other", ]

# Replace remaining categories with numerical values
#BREAST_DF_surv_clean_no_missing_bi$COD <- as.numeric(factor(BREAST_DF_surv_clean_no_missing_bi$COD, levels = c("Alive", "Breast")))

BREAST_DF_surv_clean_no_missing_bi$COD <- ifelse(BREAST_DF_surv_clean_no_missing_bi$COD == "Alive", 1, 0)

BREAST_DF_surv_clean_no_missing_bi$COD <- as.factor(BREAST_DF_surv_clean_no_missing_bi$COD)

# Convert to binomial distribution
#model_rf <- randomForest(COD ~ ., data = BREAST_DF_surv_clean_no_missing_bi, type = "response", ntree = 100)


# Find the index of the column named "COD"
cod_column_index <- which(names(BREAST_DF_surv_clean_no_missing_bi) == "COD")

# Exclude "COD" column from the data 
data_without_cod <- BREAST_DF_surv_clean_no_missing_bi[, -cod_column_index]

# Perform one-hot encoding
encoded_data <- dummyVars(" ~ .", data = data_without_cod)

# Create the design matrix with encoded data
design_matrix <- predict(encoded_data, newdata = data_without_cod)
design_matrix <- data.frame(design_matrix)

# Add the target variable (COD) back to the design matrix
design_matrix <- cbind(design_matrix, COD = BREAST_DF_surv_clean_no_missing_bi$COD)
design_matrix$COD <- factor(design_matrix$COD)

# Split the data into training and testing sets
set.seed(123)  # for reproducibility
train_indices <- createDataPartition(design_matrix$COD, p = 0.7, list = FALSE)
train_data <- design_matrix[train_indices, ]
test_data <- design_matrix[-train_indices, ]

Machine Learning: Random Forest

Random Forests are a powerful machine learning technique well-suited for survival analysis tasks like predicting patient survival in cancer cases. Random Forests don’t rely on a single decision tree but on a multitude of them (“forest”). Each tree is built on a random subset of the data (with replacement) and uses a random selection of features at each split.

# Fit the Random Forest model
model_rf <- randomForest(COD ~ ., data = train_data, type = "prob")

# Make predictions on the test set
predictions_rf <- predict(model_rf, newdata = test_data)

# Evaluate the model
conf_matrix <- confusionMatrix(predictions_rf, test_data$COD)
print(conf_matrix)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0  6970  1479
##          1  2591 65338
##                                           
##                Accuracy : 0.9467          
##                  95% CI : (0.9451, 0.9483)
##     No Information Rate : 0.8748          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7439          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.72900         
##             Specificity : 0.97786         
##          Pos Pred Value : 0.82495         
##          Neg Pred Value : 0.96186         
##              Prevalence : 0.12518         
##          Detection Rate : 0.09126         
##    Detection Prevalence : 0.11062         
##       Balanced Accuracy : 0.85343         
##                                           
##        'Positive' Class : 0               
## 
# Plot confusion matrix as a heatmap
conf_table <- as.table(conf_matrix$table)
heatmap(conf_table, 
        Colv = NA, 
        Rowv = NA, 
        col = cm.colors(12),  
        scale = "column",     
        margins = c(10, 10),   
        xlab = "Predicted Class", 
        ylab = "True Class",
        main = "Confusion Matrix Heatmap")

# Heatmap
heatmap_data <- as.data.frame(as.table(conf_matrix))
heatmap <- ggplot(heatmap_data, aes(x = Prediction, y = Reference, fill = Freq)) +
  geom_tile(color = "white") +
  scale_fill_gradient(low = "lightgreen", high = "darkgreen") +
  labs(x = "Predicted", y = "Actual", fill = "Frequency") +
  theme_minimal() +
  geom_text(aes(label = Freq), color = "black", size = 3) +  # Add text labels
  ggtitle("Random Forest Predictive Model") +  # Add title
  labs(subtitle = paste("Accuracy:", scales::percent(conf_matrix$overall["Accuracy"]))) +  # Add accuracy as subtitle
  theme(plot.subtitle = element_text(hjust = 0.5))  # Center subtitle

print(heatmap)

# Get predicted probabilities for each class (ensure type="prob" is used)
predictions_rf_probs <- predict(model_rf, test_data, type = "prob")

# Extract true class labels and convert them to factor
true_class <- as.factor(test_data$COD)

# Convert factor predictions to ordered factors
predictions_order <- ordered(as.numeric(predictions_rf) - 1, levels = c(0, 1))

# Create ROC curve
roc_curve <- roc(true_class, predictions_rf_probs[, "1"])
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
# Plot ROC curve
plot(roc_curve, print.auc = TRUE, auc.polygon = TRUE, max.auc.polygon = TRUE, grid = TRUE, grid.col = "lightgray", main = "ROC Curve", xlab = "1 - Specificity", ylab = "Sensitivity")

Machine Learning: Logistic Regression

Logistic regression is a statistical model used to analyze the relationship between a binary outcome variable and one or more independent variables. It estimates the probability of the outcome variable being in a particular category (usually coded as 0 or 1) based on the values of the independent variables. The model employs the logistic function to constrain the predicted probabilities between 0 and 1, making it suitable for binary classification tasks like survival/death analyses in our case. In R, logistic regression can be implemented using the glm() function with a binomial family distribution.

# Train the logistic regression model
logistic_model <- glm(COD ~ ., data = train_data, family = binomial)

# Make predictions on the test set
predictions_logistic <- predict(logistic_model, newdata = test_data, type = "response")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases
# Convert predicted probabilities to class labels
predicted_class <- ifelse(predictions_logistic > 0.5, 1, 0)

# Evaluate the model
confusion_matrix <- table(predicted_class, test_data$COD)
print(confusion_matrix)
##                
## predicted_class     0     1
##               0  5755  1538
##               1  3806 65279
# Calculate accuracy
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
print(paste("Accuracy:", accuracy))
## [1] "Accuracy: 0.9300322082275"
# Plot the confusion matrix as a heatmap
heatmap(confusion_matrix, 
        Colv = NA, 
        Rowv = NA, 
        col = cm.colors(12),  # Color palette for heatmap
        scale = "column",     # Scale rows (predictions)
        margins = c(10, 10),  # Add extra space for row and column names
        xlab = "Predicted Class", 
        ylab = "True Class",
        main = "Confusion Matrix Heatmap")

# Heatmap
heatmap_data <- as.data.frame(as.table(confusion_matrix))
heatmap <- ggplot(heatmap_data, aes(x = predicted_class, y = Var2, fill = Freq)) +
  geom_tile(color = "white") +
  scale_fill_gradient(low = "lightgreen", high = "darkgreen") +
  labs(x = "Predicted", y = "Actual", fill = "Frequency") +
  theme_minimal() +
  geom_text(aes(label = Freq), color = "black", size = 3) +  # Add text labels
  ggtitle("Logistic Regression Predictive Model") +  # Add title
  labs(subtitle = paste("Accuracy:", scales::percent(accuracy))) +  # Add accuracy as subtitle
  theme(plot.subtitle = element_text(hjust = 0.5))  # Center subtitle

print(heatmap)

# Calculate AUC ROC
roc_curve <- roc(test_data$COD, predictions_logistic)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
print(roc_curve)
## 
## Call:
## roc.default(response = test_data$COD, predictor = predictions_logistic)
## 
## Data: predictions_logistic in 9561 controls (test_data$COD 0) < 66817 cases (test_data$COD 1).
## Area under the curve: 0.9291
# Plot the ROC curve
plot(roc_curve, print.auc = TRUE, auc.polygon = TRUE, max.auc.polygon = TRUE, grid = TRUE, grid.col = "lightgray", main = "ROC Curve")

Data Preparation for Survival model

# Prepare data
cod_column_index_1 <- which(names(BREAST_DF_surv_clean_no_missing) == c("COD","Survival months"))


# Exclude "COD" column from the data 
#data_without_cod <- BREAST_DF_surv_clean[, -cod_column_index]
data_without_cod_1 <- BREAST_DF_surv_clean_no_missing[, -cod_column_index]

# Perform one-hot encoding
encoded_data_1 <- dummyVars(" ~ .", data = data_without_cod_1)

# Create the design matrix with encoded data
design_matrix_1 <- predict(encoded_data_1, newdata = data_without_cod_1)

# Add the target variable (Survival months and status) back to the design matrix
design_matrix_1 <- cbind(design_matrix_1, 
                       Time = BREAST_DF_surv_clean_no_missing$`Survival months`, 
                       Status = BREAST_DF_surv_clean_no_missing$COD)
design_matrix_1 <- data.frame(design_matrix_1)

# Split the data into training and testing sets
set.seed(123)  # for reproducibility
train_indices_1 <- createDataPartition(design_matrix_1$Status, p = 0.7, list = FALSE)
train_data_1 <- design_matrix_1[train_indices, ]
test_data_1 <- design_matrix_1[-train_indices, ]

Deep Neural Network (DNN)

A deep neural network for survival analysis is a powerful machine learning model capable of capturing complex patterns in survival data to predict the likelihood of an event occurring (e.g., death) over a given period. In binary classification tasks such as life/dead outcomes, a deep neural network consists of multiple layers of interconnected nodes (neurons) that process input features to predict the probability of an individual experiencing the event of interest. These networks can incorporate various architectures, such as convolutional neural networks (CNNs) or recurrent neural networks (RNNs), and are trained using optimization algorithms like stochastic gradient descent (SGD) to minimize prediction errors. In R, deep neural networks for survival analysis can be implemented using libraries like keras or tensorflow, allowing for flexible modeling and customization.

# Load required libraries
library(keras)
library(survival)
library(survMisc)  # For cindex() function
## 
## Attaching package: 'survMisc'
## The following object is masked from 'package:pROC':
## 
##     ci
## The following object is masked from 'package:R.utils':
## 
##     asLong
## The following object is masked from 'package:ggplot2':
## 
##     autoplot
library(reticulate)
#use_python("C:/Users/kohya/AppData/Local/Programs/Python/Python37")
# Define the neural network architecture
model <- keras_model_sequential() %>%
  layer_dense(units = 64, activation = "relu", input_shape = ncol(train_data) - 1) %>%
  layer_dense(units = 32, activation = "relu") %>%
  layer_dense(units = 1, activation = "sigmoid")

# Compile the model
model %>% compile(
  loss = "binary_crossentropy",
  optimizer = optimizer_adam(),
  metrics = c("accuracy")
)

# Train the model
history <- model %>% fit(
  x = as.matrix(train_data[, -ncol(train_data)]),  # Features
  y = as.numeric(train_data$COD) - 1,  # Target variable (convert to 0-based index)
  epochs = 100,
  batch_size = 32,
  validation_split = 0.2
)
## Epoch 1/100
## 4456/4456 - 23s - loss: 0.1812 - accuracy: 0.9382 - val_loss: 0.1829 - val_accuracy: 0.9422 - 23s/epoch - 5ms/step
## Epoch 2/100
## 4456/4456 - 19s - loss: 0.1694 - accuracy: 0.9434 - val_loss: 0.1695 - val_accuracy: 0.9446 - 19s/epoch - 4ms/step
## Epoch 3/100
## 4456/4456 - 20s - loss: 0.1658 - accuracy: 0.9448 - val_loss: 0.1697 - val_accuracy: 0.9466 - 20s/epoch - 4ms/step
## Epoch 4/100
## 4456/4456 - 19s - loss: 0.1638 - accuracy: 0.9457 - val_loss: 0.1649 - val_accuracy: 0.9474 - 19s/epoch - 4ms/step
## Epoch 5/100
## 4456/4456 - 19s - loss: 0.1623 - accuracy: 0.9459 - val_loss: 0.1657 - val_accuracy: 0.9470 - 19s/epoch - 4ms/step
## Epoch 6/100
## 4456/4456 - 21s - loss: 0.1613 - accuracy: 0.9461 - val_loss: 0.1644 - val_accuracy: 0.9469 - 21s/epoch - 5ms/step
## Epoch 7/100
## 4456/4456 - 20s - loss: 0.1602 - accuracy: 0.9468 - val_loss: 0.1679 - val_accuracy: 0.9453 - 20s/epoch - 4ms/step
## Epoch 8/100
## 4456/4456 - 20s - loss: 0.1594 - accuracy: 0.9472 - val_loss: 0.1637 - val_accuracy: 0.9473 - 20s/epoch - 4ms/step
## Epoch 9/100
## 4456/4456 - 20s - loss: 0.1587 - accuracy: 0.9473 - val_loss: 0.1651 - val_accuracy: 0.9484 - 20s/epoch - 5ms/step
## Epoch 10/100
## 4456/4456 - 20s - loss: 0.1582 - accuracy: 0.9475 - val_loss: 0.1654 - val_accuracy: 0.9472 - 20s/epoch - 4ms/step
## Epoch 11/100
## 4456/4456 - 19s - loss: 0.1575 - accuracy: 0.9475 - val_loss: 0.1674 - val_accuracy: 0.9472 - 19s/epoch - 4ms/step
## Epoch 12/100
## 4456/4456 - 20s - loss: 0.1568 - accuracy: 0.9483 - val_loss: 0.1627 - val_accuracy: 0.9487 - 20s/epoch - 4ms/step
## Epoch 13/100
## 4456/4456 - 20s - loss: 0.1564 - accuracy: 0.9480 - val_loss: 0.1631 - val_accuracy: 0.9489 - 20s/epoch - 4ms/step
## Epoch 14/100
## 4456/4456 - 20s - loss: 0.1558 - accuracy: 0.9481 - val_loss: 0.1645 - val_accuracy: 0.9479 - 20s/epoch - 4ms/step
## Epoch 15/100
## 4456/4456 - 21s - loss: 0.1552 - accuracy: 0.9485 - val_loss: 0.1695 - val_accuracy: 0.9476 - 21s/epoch - 5ms/step
## Epoch 16/100
## 4456/4456 - 21s - loss: 0.1546 - accuracy: 0.9486 - val_loss: 0.1682 - val_accuracy: 0.9475 - 21s/epoch - 5ms/step
## Epoch 17/100
## 4456/4456 - 20s - loss: 0.1544 - accuracy: 0.9484 - val_loss: 0.1663 - val_accuracy: 0.9482 - 20s/epoch - 4ms/step
## Epoch 18/100
## 4456/4456 - 20s - loss: 0.1539 - accuracy: 0.9490 - val_loss: 0.1668 - val_accuracy: 0.9482 - 20s/epoch - 5ms/step
## Epoch 19/100
## 4456/4456 - 21s - loss: 0.1535 - accuracy: 0.9491 - val_loss: 0.1682 - val_accuracy: 0.9480 - 21s/epoch - 5ms/step
## Epoch 20/100
## 4456/4456 - 20s - loss: 0.1529 - accuracy: 0.9494 - val_loss: 0.1708 - val_accuracy: 0.9470 - 20s/epoch - 5ms/step
## Epoch 21/100
## 4456/4456 - 20s - loss: 0.1527 - accuracy: 0.9493 - val_loss: 0.1693 - val_accuracy: 0.9475 - 20s/epoch - 4ms/step
## Epoch 22/100
## 4456/4456 - 20s - loss: 0.1522 - accuracy: 0.9494 - val_loss: 0.1670 - val_accuracy: 0.9481 - 20s/epoch - 4ms/step
## Epoch 23/100
## 4456/4456 - 20s - loss: 0.1517 - accuracy: 0.9497 - val_loss: 0.1673 - val_accuracy: 0.9483 - 20s/epoch - 4ms/step
## Epoch 24/100
## 4456/4456 - 20s - loss: 0.1515 - accuracy: 0.9500 - val_loss: 0.1794 - val_accuracy: 0.9450 - 20s/epoch - 4ms/step
## Epoch 25/100
## 4456/4456 - 20s - loss: 0.1512 - accuracy: 0.9499 - val_loss: 0.1711 - val_accuracy: 0.9483 - 20s/epoch - 5ms/step
## Epoch 26/100
## 4456/4456 - 19s - loss: 0.1505 - accuracy: 0.9503 - val_loss: 0.1746 - val_accuracy: 0.9469 - 19s/epoch - 4ms/step
## Epoch 27/100
## 4456/4456 - 21s - loss: 0.1505 - accuracy: 0.9503 - val_loss: 0.1735 - val_accuracy: 0.9471 - 21s/epoch - 5ms/step
## Epoch 28/100
## 4456/4456 - 19s - loss: 0.1502 - accuracy: 0.9504 - val_loss: 0.1741 - val_accuracy: 0.9478 - 19s/epoch - 4ms/step
## Epoch 29/100
## 4456/4456 - 20s - loss: 0.1494 - accuracy: 0.9506 - val_loss: 0.1760 - val_accuracy: 0.9473 - 20s/epoch - 5ms/step
## Epoch 30/100
## 4456/4456 - 20s - loss: 0.1493 - accuracy: 0.9510 - val_loss: 0.1761 - val_accuracy: 0.9478 - 20s/epoch - 5ms/step
## Epoch 31/100
## 4456/4456 - 20s - loss: 0.1489 - accuracy: 0.9508 - val_loss: 0.1768 - val_accuracy: 0.9486 - 20s/epoch - 5ms/step
## Epoch 32/100
## 4456/4456 - 20s - loss: 0.1486 - accuracy: 0.9513 - val_loss: 0.1793 - val_accuracy: 0.9479 - 20s/epoch - 5ms/step
## Epoch 33/100
## 4456/4456 - 20s - loss: 0.1483 - accuracy: 0.9515 - val_loss: 0.1815 - val_accuracy: 0.9487 - 20s/epoch - 4ms/step
## Epoch 34/100
## 4456/4456 - 20s - loss: 0.1479 - accuracy: 0.9512 - val_loss: 0.1812 - val_accuracy: 0.9470 - 20s/epoch - 5ms/step
## Epoch 35/100
## 4456/4456 - 20s - loss: 0.1474 - accuracy: 0.9517 - val_loss: 0.1817 - val_accuracy: 0.9465 - 20s/epoch - 4ms/step
## Epoch 36/100
## 4456/4456 - 20s - loss: 0.1471 - accuracy: 0.9518 - val_loss: 0.1847 - val_accuracy: 0.9462 - 20s/epoch - 5ms/step
## Epoch 37/100
## 4456/4456 - 21s - loss: 0.1469 - accuracy: 0.9520 - val_loss: 0.1862 - val_accuracy: 0.9484 - 21s/epoch - 5ms/step
## Epoch 38/100
## 4456/4456 - 20s - loss: 0.1465 - accuracy: 0.9518 - val_loss: 0.1867 - val_accuracy: 0.9468 - 20s/epoch - 5ms/step
## Epoch 39/100
## 4456/4456 - 20s - loss: 0.1466 - accuracy: 0.9520 - val_loss: 0.1941 - val_accuracy: 0.9466 - 20s/epoch - 5ms/step
## Epoch 40/100
## 4456/4456 - 21s - loss: 0.1461 - accuracy: 0.9522 - val_loss: 0.1905 - val_accuracy: 0.9465 - 21s/epoch - 5ms/step
## Epoch 41/100
## 4456/4456 - 20s - loss: 0.1458 - accuracy: 0.9521 - val_loss: 0.1892 - val_accuracy: 0.9471 - 20s/epoch - 4ms/step
## Epoch 42/100
## 4456/4456 - 21s - loss: 0.1452 - accuracy: 0.9519 - val_loss: 0.1883 - val_accuracy: 0.9480 - 21s/epoch - 5ms/step
## Epoch 43/100
## 4456/4456 - 21s - loss: 0.1453 - accuracy: 0.9524 - val_loss: 0.1928 - val_accuracy: 0.9475 - 21s/epoch - 5ms/step
## Epoch 44/100
## 4456/4456 - 20s - loss: 0.1450 - accuracy: 0.9525 - val_loss: 0.1945 - val_accuracy: 0.9469 - 20s/epoch - 5ms/step
## Epoch 45/100
## 4456/4456 - 21s - loss: 0.1448 - accuracy: 0.9525 - val_loss: 0.1958 - val_accuracy: 0.9457 - 21s/epoch - 5ms/step
## Epoch 46/100
## 4456/4456 - 20s - loss: 0.1443 - accuracy: 0.9526 - val_loss: 0.2016 - val_accuracy: 0.9473 - 20s/epoch - 5ms/step
## Epoch 47/100
## 4456/4456 - 20s - loss: 0.1441 - accuracy: 0.9527 - val_loss: 0.1975 - val_accuracy: 0.9477 - 20s/epoch - 4ms/step
## Epoch 48/100
## 4456/4456 - 20s - loss: 0.1443 - accuracy: 0.9527 - val_loss: 0.2026 - val_accuracy: 0.9478 - 20s/epoch - 4ms/step
## Epoch 49/100
## 4456/4456 - 20s - loss: 0.1438 - accuracy: 0.9529 - val_loss: 0.2021 - val_accuracy: 0.9448 - 20s/epoch - 4ms/step
## Epoch 50/100
## 4456/4456 - 20s - loss: 0.1434 - accuracy: 0.9531 - val_loss: 0.2044 - val_accuracy: 0.9460 - 20s/epoch - 4ms/step
## Epoch 51/100
## 4456/4456 - 20s - loss: 0.1431 - accuracy: 0.9532 - val_loss: 0.2069 - val_accuracy: 0.9475 - 20s/epoch - 4ms/step
## Epoch 52/100
## 4456/4456 - 20s - loss: 0.1429 - accuracy: 0.9534 - val_loss: 0.2089 - val_accuracy: 0.9459 - 20s/epoch - 5ms/step
## Epoch 53/100
## 4456/4456 - 20s - loss: 0.1429 - accuracy: 0.9531 - val_loss: 0.2122 - val_accuracy: 0.9464 - 20s/epoch - 4ms/step
## Epoch 54/100
## 4456/4456 - 20s - loss: 0.1427 - accuracy: 0.9536 - val_loss: 0.2056 - val_accuracy: 0.9451 - 20s/epoch - 4ms/step
## Epoch 55/100
## 4456/4456 - 21s - loss: 0.1424 - accuracy: 0.9536 - val_loss: 0.2066 - val_accuracy: 0.9464 - 21s/epoch - 5ms/step
## Epoch 56/100
## 4456/4456 - 21s - loss: 0.1421 - accuracy: 0.9533 - val_loss: 0.2112 - val_accuracy: 0.9449 - 21s/epoch - 5ms/step
## Epoch 57/100
## 4456/4456 - 20s - loss: 0.1420 - accuracy: 0.9538 - val_loss: 0.2113 - val_accuracy: 0.9462 - 20s/epoch - 5ms/step
## Epoch 58/100
## 4456/4456 - 21s - loss: 0.1417 - accuracy: 0.9537 - val_loss: 0.2226 - val_accuracy: 0.9451 - 21s/epoch - 5ms/step
## Epoch 59/100
## 4456/4456 - 20s - loss: 0.1415 - accuracy: 0.9540 - val_loss: 0.2131 - val_accuracy: 0.9464 - 20s/epoch - 4ms/step
## Epoch 60/100
## 4456/4456 - 20s - loss: 0.1415 - accuracy: 0.9540 - val_loss: 0.2164 - val_accuracy: 0.9457 - 20s/epoch - 4ms/step
## Epoch 61/100
## 4456/4456 - 20s - loss: 0.1412 - accuracy: 0.9539 - val_loss: 0.2189 - val_accuracy: 0.9468 - 20s/epoch - 5ms/step
## Epoch 62/100
## 4456/4456 - 20s - loss: 0.1412 - accuracy: 0.9539 - val_loss: 0.2237 - val_accuracy: 0.9439 - 20s/epoch - 5ms/step
## Epoch 63/100
## 4456/4456 - 21s - loss: 0.1409 - accuracy: 0.9545 - val_loss: 0.2208 - val_accuracy: 0.9458 - 21s/epoch - 5ms/step
## Epoch 64/100
## 4456/4456 - 19s - loss: 0.1406 - accuracy: 0.9542 - val_loss: 0.2237 - val_accuracy: 0.9441 - 19s/epoch - 4ms/step
## Epoch 65/100
## 4456/4456 - 20s - loss: 0.1406 - accuracy: 0.9546 - val_loss: 0.2315 - val_accuracy: 0.9474 - 20s/epoch - 5ms/step
## Epoch 66/100
## 4456/4456 - 20s - loss: 0.1404 - accuracy: 0.9546 - val_loss: 0.2286 - val_accuracy: 0.9450 - 20s/epoch - 4ms/step
## Epoch 67/100
## 4456/4456 - 20s - loss: 0.1398 - accuracy: 0.9549 - val_loss: 0.2272 - val_accuracy: 0.9453 - 20s/epoch - 4ms/step
## Epoch 68/100
## 4456/4456 - 19s - loss: 0.1403 - accuracy: 0.9546 - val_loss: 0.2265 - val_accuracy: 0.9465 - 19s/epoch - 4ms/step
## Epoch 69/100
## 4456/4456 - 20s - loss: 0.1401 - accuracy: 0.9547 - val_loss: 0.2256 - val_accuracy: 0.9466 - 20s/epoch - 5ms/step
## Epoch 70/100
## 4456/4456 - 20s - loss: 0.1396 - accuracy: 0.9552 - val_loss: 0.2301 - val_accuracy: 0.9465 - 20s/epoch - 4ms/step
## Epoch 71/100
## 4456/4456 - 20s - loss: 0.1400 - accuracy: 0.9547 - val_loss: 0.2389 - val_accuracy: 0.9440 - 20s/epoch - 5ms/step
## Epoch 72/100
## 4456/4456 - 20s - loss: 0.1395 - accuracy: 0.9553 - val_loss: 0.2336 - val_accuracy: 0.9456 - 20s/epoch - 4ms/step
## Epoch 73/100
## 4456/4456 - 20s - loss: 0.1395 - accuracy: 0.9551 - val_loss: 0.2350 - val_accuracy: 0.9453 - 20s/epoch - 5ms/step
## Epoch 74/100
## 4456/4456 - 20s - loss: 0.1392 - accuracy: 0.9552 - val_loss: 0.2404 - val_accuracy: 0.9448 - 20s/epoch - 4ms/step
## Epoch 75/100
## 4456/4456 - 20s - loss: 0.1391 - accuracy: 0.9552 - val_loss: 0.2386 - val_accuracy: 0.9453 - 20s/epoch - 4ms/step
## Epoch 76/100
## 4456/4456 - 20s - loss: 0.1391 - accuracy: 0.9551 - val_loss: 0.2493 - val_accuracy: 0.9419 - 20s/epoch - 4ms/step
## Epoch 77/100
## 4456/4456 - 20s - loss: 0.1390 - accuracy: 0.9553 - val_loss: 0.2429 - val_accuracy: 0.9456 - 20s/epoch - 4ms/step
## Epoch 78/100
## 4456/4456 - 20s - loss: 0.1390 - accuracy: 0.9551 - val_loss: 0.2468 - val_accuracy: 0.9456 - 20s/epoch - 4ms/step
## Epoch 79/100
## 4456/4456 - 19s - loss: 0.1384 - accuracy: 0.9554 - val_loss: 0.2519 - val_accuracy: 0.9447 - 19s/epoch - 4ms/step
## Epoch 80/100
## 4456/4456 - 15s - loss: 0.1384 - accuracy: 0.9555 - val_loss: 0.2506 - val_accuracy: 0.9408 - 15s/epoch - 3ms/step
## Epoch 81/100
## 4456/4456 - 7s - loss: 0.1388 - accuracy: 0.9554 - val_loss: 0.2428 - val_accuracy: 0.9442 - 7s/epoch - 2ms/step
## Epoch 82/100
## 4456/4456 - 8s - loss: 0.1383 - accuracy: 0.9551 - val_loss: 0.2478 - val_accuracy: 0.9455 - 8s/epoch - 2ms/step
## Epoch 83/100
## 4456/4456 - 8s - loss: 0.1380 - accuracy: 0.9555 - val_loss: 0.2471 - val_accuracy: 0.9446 - 8s/epoch - 2ms/step
## Epoch 84/100
## 4456/4456 - 8s - loss: 0.1379 - accuracy: 0.9554 - val_loss: 0.2560 - val_accuracy: 0.9431 - 8s/epoch - 2ms/step
## Epoch 85/100
## 4456/4456 - 9s - loss: 0.1379 - accuracy: 0.9554 - val_loss: 0.2597 - val_accuracy: 0.9423 - 9s/epoch - 2ms/step
## Epoch 86/100
## 4456/4456 - 11s - loss: 0.1379 - accuracy: 0.9555 - val_loss: 0.2618 - val_accuracy: 0.9439 - 11s/epoch - 2ms/step
## Epoch 87/100
## 4456/4456 - 8s - loss: 0.1377 - accuracy: 0.9554 - val_loss: 0.2538 - val_accuracy: 0.9451 - 8s/epoch - 2ms/step
## Epoch 88/100
## 4456/4456 - 8s - loss: 0.1377 - accuracy: 0.9557 - val_loss: 0.2540 - val_accuracy: 0.9448 - 8s/epoch - 2ms/step
## Epoch 89/100
## 4456/4456 - 8s - loss: 0.1370 - accuracy: 0.9556 - val_loss: 0.2601 - val_accuracy: 0.9452 - 8s/epoch - 2ms/step
## Epoch 90/100
## 4456/4456 - 9s - loss: 0.1373 - accuracy: 0.9559 - val_loss: 0.2617 - val_accuracy: 0.9436 - 9s/epoch - 2ms/step
## Epoch 91/100
## 4456/4456 - 10s - loss: 0.1373 - accuracy: 0.9555 - val_loss: 0.2802 - val_accuracy: 0.9440 - 10s/epoch - 2ms/step
## Epoch 92/100
## 4456/4456 - 10s - loss: 0.1373 - accuracy: 0.9556 - val_loss: 0.2708 - val_accuracy: 0.9427 - 10s/epoch - 2ms/step
## Epoch 93/100
## 4456/4456 - 10s - loss: 0.1367 - accuracy: 0.9558 - val_loss: 0.2617 - val_accuracy: 0.9449 - 10s/epoch - 2ms/step
## Epoch 94/100
## 4456/4456 - 10s - loss: 0.1371 - accuracy: 0.9560 - val_loss: 0.2858 - val_accuracy: 0.9407 - 10s/epoch - 2ms/step
## Epoch 95/100
## 4456/4456 - 9s - loss: 0.1369 - accuracy: 0.9561 - val_loss: 0.2771 - val_accuracy: 0.9432 - 9s/epoch - 2ms/step
## Epoch 96/100
## 4456/4456 - 9s - loss: 0.1364 - accuracy: 0.9561 - val_loss: 0.2710 - val_accuracy: 0.9441 - 9s/epoch - 2ms/step
## Epoch 97/100
## 4456/4456 - 9s - loss: 0.1370 - accuracy: 0.9556 - val_loss: 0.2697 - val_accuracy: 0.9442 - 9s/epoch - 2ms/step
## Epoch 98/100
## 4456/4456 - 9s - loss: 0.1362 - accuracy: 0.9560 - val_loss: 0.2771 - val_accuracy: 0.9419 - 9s/epoch - 2ms/step
## Epoch 99/100
## 4456/4456 - 9s - loss: 0.1362 - accuracy: 0.9559 - val_loss: 0.2728 - val_accuracy: 0.9440 - 9s/epoch - 2ms/step
## Epoch 100/100
## 4456/4456 - 9s - loss: 0.1362 - accuracy: 0.9560 - val_loss: 0.2760 - val_accuracy: 0.9435 - 9s/epoch - 2ms/step
# Evaluate the model
metrics <- model %>% evaluate(
  x = as.matrix(test_data[, -ncol(test_data)]),  # Features
  y = as.numeric(test_data$COD) - 1,  # Target variable (convert to 0-based index)
  verbose = 0
)

# Print evaluation metrics
cat("Test Loss:", metrics["loss"], "\n")
## Test Loss: 0.2963572
cat("Test Accuracy:", metrics["accuracy"], "\n")
## Test Accuracy: 0.9421561
# Predictions on test data
predictions <- model %>% predict(as.matrix(test_data[, -ncol(test_data)]))
## 2387/2387 - 3s - 3s/epoch - 1ms/step
predictions <- ifelse(predictions > 0.5, 1, 0)

# Confusion matrix
conf_matrix <- table(Actual = as.numeric(test_data$COD) - 1, Predicted = predictions)
print("Confusion Matrix:")
## [1] "Confusion Matrix:"
print(conf_matrix)
##       Predicted
## Actual     0     1
##      0  6741  2820
##      1  1598 65219
# Accuracy, Sensitivity, and Specificity
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
sensitivity <- conf_matrix[2, 2] / sum(conf_matrix[2, ])
specificity <- conf_matrix[1, 1] / sum(conf_matrix[1, ])
paste("Accuracy:",accuracy)
## [1] "Accuracy: 0.942156118253948"
paste("Sensitivity:", sensitivity)
## [1] "Sensitivity: 0.97608393073619"
paste("Specificity:", specificity)
## [1] "Specificity: 0.70505177282711"
# Calculate overall accuracy
overall_accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)

# Heatmap
heatmap_data <- as.data.frame(conf_matrix)
heatmap <- ggplot(heatmap_data, aes(x = Predicted, y = Actual, fill = Freq)) +
  geom_tile(color = "white") +
  scale_fill_gradient(low = "lightgreen", high = "darkgreen") +
  labs(x = "Predicted", y = "Actual", fill = "Frequency") +
  theme_minimal() +
  geom_text(aes(label = Freq), color = "black", size = 3) +  # Add text labels
  ggtitle("Deep NN Predictive Model") +  # Add title
  labs(subtitle = paste("Accuracy:", scales::percent(overall_accuracy))) +  # Add accuracy as subtitle
  theme(plot.subtitle = element_text(hjust = 0.5))  # Center subtitle

print(heatmap)

# Plot ROC curve
roc_data <- roc(test_data$COD, predictions)
## Setting levels: control = 0, case = 1
## Warning in roc.default(test_data$COD, predictions): Deprecated use a matrix as
## predictor. Unexpected results may be produced, please pass a numeric vector.
## Setting direction: controls < cases
#plot(roc_data, main = "ROC Curve", col = "blue")
plot(roc_data, print.auc = TRUE, auc.polygon = TRUE, max.auc.polygon = TRUE, grid = TRUE, grid.col = "lightgray", main = "ROC Curve")

Conclusion:

In this project, I aimed for prediction of the survival rate of patients with breast cancer with more than 96% accuracy knowing the survival rate is 75%. The goal was to use machine learning and available resources and the techniques learned in DATA606 and DTA607 to deal with this complex problem. I utilized the SEER database spanning from 2011 to 2015, comprising over 300,000 cases, to predict the survival rate of cancer patients based on 16 critical indicators, including race, household income, cancer type, treatment, time to treatment, number of tumors, and more. Preliminary exploratory data analysis was conducted to identify these key indicators from a pool of 36, followed by data cleaning and organization for machine learning tasks. Various R packages were employed for data cleaning, type conversion, handling missing values, and database organization. Additionally, correlation analyses using tools like ggplot, chi-square, Fisher test, and other complex R packages were performed to explore correlations between numeric and categorical variables and the target parameter of interest, Alive/Death.

Initially, the intention was to include all three categories of Alive/Death/Other, but it was later recognized that the inclusion of the “Other” category rendered the analysis irrelevant. Therefore, the analysis was focused solely on Alive/Death, as breast cancer was the primary cause of death even if patients had other conditions.

A range of machine learning algorithms were applied, starting from Logistic Regression and Random Forest to more sophisticated methods like DNN. Overall, the project demonstrated that even individuals with limited domain knowledge can utilize available resources to predict cancer patient outcomes with approximately 94% accuracy. However, further endeavors, such as stratification, parameter importance implication, and additional data gathering, could enhance accuracy, offering significant contributions to the healthcare industry, patient care, and family circumstances.

Despite the complexities associated with managing different packages and large databases, I enjoyed exploring new concepts and learning how different methods can be employed. Particularly, I gained insights into the significance of encoding and its impact on survival model performance. While this analysis lacks the rigor of academic research, it underscores the potential of machine learning in addressing complex problems, paving the way for future exploration and study.

In summary, among the developed models, Logistic Regression emerged as the simplest and fastest, achieving 93% accuracy, followed by RandomForest. Additionally, neural networks exhibited success but were time-consuming and presented black-box risks. For future iterations, I would opt to focus on Logistic Regression and RandomForest, dedicating more time to encoding, data preparation, and exploring stratification and parameter stress testing to potentially enhance accuracy.

This project highlights the potential of machine learning for patient survival prediction, even for individuals with limited domain knowledge. However, further research is needed to:

By addressing these limitations, future studies can contribute significantly to personalized medicine, patient care planning, and supporting families facing this challenging diagnosis.

Acknowledgement:

I would like to thank the professors in both DATA606 and DATA607, as well as the students in the classes, who made the courses interesting and challenging. I have learned a lot and dealt with many challenges throughout these courses, despite having little specific background in data science beforehand. The course content was carefully chosen to help students like me develop an understanding of the topic and find enjoyment in the learning process.

References:

[1] SEER (https://seer.cancer.gov/data/access.html)

[2] zgalochkina/SEER_solid_tumor: R code for SEER data analysis of solid tumor in different populations (github.com)

[3] XAI_Healthcare_eXplainable_AI_in_Healthcare.pdf (upc.edu)

[4] Pargen, F., Pfisterer, F., Thomas, J., Bischl, B.: Regularized target encoding out performs traditional methods in supervised machine learning with high cardinality features. Computational Statistics 37(5), 2671–2692 (Nov 2022)

[5] American Cancer Society - Breast Cancer Survival Rates

Surveillance, Epidemiology, and End Results Program. 2023. “SEER*stat Database: Incidence - SEER Research Data, 8 Registries, Nov 2021 Sub (1975-2020) - Linked to County Attributes - Time Dependent (1990-2020) Income/Rurality, 1969-2020 Counties.” National Cancer Institute, DCCPS, Surveillance Research Program, released April 2023, based on the November 2022 submission. https://seer.cancer.gov/data/citation.html.