Data Science Project - Breast Cancer Survival with SEER data

Data Preparation

In this project, I have chosen to work on breast cancer. There are various resources available on this topic, with the Surveillance, Epidemiology, and End Results (SEER) [1] program being the most reliable one.

The SEER Program of the National Cancer Institute (NCI) collects and publishes cancer data through a coordinated system of strategically placed cancer registries, covering nearly 30% of the US population.

Currently, there are 18 SEER registries in the USA. You can find this information on the following website: SEER Data Access.

I have also utilized the following repository to assist me with this project: SEER_solid_tumor [2]. The database contains extensive data, and my investigation will focus solely on breast cancer for the years 2011-2015 and 2019-2020. SEER provides a software called STAT that I’ve used to import the data, which is stored and utilized on my local computer. Additionally, there are two GitHub repositories that I’ve referenced to some extent in this project:

The first [2]repository covers all types of cancer, but my study specifically focuses on breast cancer, addressing different research questions.
The second [3] repository has conducted machine learning analyses on various cancer types using Python (not R). I’ve drawn inspiration and learned methods from their approach to survival studies in cancer patients.

R initialization

Checking all the packages are installed and if not install as needed.

Research question

The primary focus of my research is to explore the survival rates of breast cancer patients and the various factors influencing these rates, including age, cancer type, treatment modalities, and other pertinent parameters. The commonly utilized five-year survival rate benchmark serves as a pivotal point of analysis in this study.

Acknowledging the significance of this benchmark, I have divided the data into two distinct datasets. The dataset spanning from 2011 to 2015 assumes that the status of all patients within that period is known up to the database’s current date in 2022. Additionally, I have selected the most recent data from 2019 to 2020 as the target years for potential correlation and regression studies to estimate survival rates.

Although my research is not conducted within a strictly scientific framework, it is approached with rigor and attention to detail. While I do not possess expertise in the field of breast cancer, my personal connection to the topic motivates me to delve deeper into understanding the complexities surrounding it.

The dataset from 2011 to 2015 comprises approximately 303,000 rows with 36 selected columns. For the purpose of prediction, I have chosen to focus solely on the 2019-2020 data, which encompasses about 131,000 rows. The multifaceted nature of the research question necessitates a thorough examination, from data tidying to cleaning.

Some of the key parameters under consideration include years of diagnoses, age groups at diagnosis, and cancer type. However, I also recognize the importance of incorporating additional factors such as tumor characteristics and treatment modalities to provide a comprehensive understanding of breast cancer survival outcomes.

In conclusion, while my knowledge of the subject may not be extensive, I am committed to learning and contributing meaningful insights to the field of breast cancer research through meticulous analysis and interpretation of data.

Note on 5 years threshold

According to the American Cancer Society, the five-year relative survival rate for localized breast cancer is around 99%, but it drops to about 27% for distant-stage breast cancer. These rates can vary over time and with advances in treatment. Reference [5]: American Cancer Society - Breast Cancer Survival Rates

# Function to load CSV file
load_csv <- function(file_path) {
  if (file.exists(file_path)) {
    return(read_csv(file_path))
  } else {
    message("File not found locally. Attempting to fetch from server...")
    return(fetch_database(gdrive_link))
  }
}

# Function to fetch database from signed URL
fetch_database <- function(url) {
  response <- GET(url)
  if (http_type(response) == "application/force-download") {
    stop_for_status(response)
    return(read_csv(rawToChar(response$content)))
  } else {
    message("Failed to fetch from server. Please select the file manually.")
    return(readr::read_csv(file.choose()))
  }
}

# Local file paths
directory <- "C:/Users/kohya/OneDrive/CUNY/DATA 606/DATA 606 Spring/Project"
file_2020 <- "BREAST_2019-2020-updated.csv"
file_serv <- "BREAST_2011-2015.csv"
gdrive_link <- "https://drive.google.com/uc?export=download&id=1vBR2SZ-aFX3jjU6kQMjPkxfYKP-EwqRE"

# Complete the file paths
full_path_serv <- file.path(directory, file_serv)
full_path_eval <- file.path(directory, file_2020)

# Attempt to load the databases
BREAST_DF_surv <- load_csv(full_path_serv)

## Rows: 303557 Columns: 36
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (34): Sex, Race recode (W, B, AI, API), Race and origin recode (NHW, NHB...
## dbl  (2): Year of diagnosis, Year of follow-up recode
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

BREAST_DF_eval <- load_csv(full_path_eval)

## Rows: 131395 Columns: 36
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (34): Sex, Race recode (W, B, AI, API), Race and origin recode (NHW, NHB...
## dbl  (2): Year of diagnosis, Year of follow-up recode
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# View the first few rows of the data frame
kable(head(BREAST_DF_surv, 10))

Sex	Year of diagnosis	Race recode (W, B, AI, API)	Race and origin recode (NHW, NHB, NHAIAN, NHAPI, Hispanic)	Site recode ICD-O-3/WHO 2008	Site recode ICD-O-3 2023 Revision	Primary Site - labeled	Grade Recode (thru 2017)	Grade Clinical (2018+)	Grade Pathological (2018+)	Diagnostic Confirmation	Laterality	Chemotherapy recode (yes, no/unk)	Radiation recode	Months from diagnosis to treatment	Reason no cancer-directed surgery	Scope of reg lymph nd surg (1998-2002)	Survival months flag	Survival months	COD to site recode	First malignant primary indicator	Sequence number	Total number of in situ/malignant tumors for patient	Patient ID	Marital status at diagnosis	Median household income inflation adj to 2021	Rural-Urban Continuum Code	Age recode (<60,60-69,70+)	Race and origin (recommended by SEER)	Year of follow-up recode	Year of death recode	SEER other cause of death classification	Tumor Size Summary (2016+)	RX Summ–Systemic/Sur Seq (2007+)	Origin recode NHIA (Hispanic, Non-Hisp)
Female	2015	White	Non-Hispanic White	Breast	Breast	C50.4-Upper-outer quadrant of breast	Moderately differentiated; Grade II	Blank(s)	Blank(s)	Positive histology	Right - origin of primary	Yes	Beam radiation	002	Surgery performed	Blank(s)	Complete dates are available and there are more than 0 days of survival	0060	Alive	No	2nd of 2 or more primaries	02	00000309	Married (including common law)	$75,000+	Counties in metropolitan areas ge 1 million pop	50-54 years	All races/ethnicities	2020	Alive at last contact	Alive or dead due to cancer	Blank(s)	Systemic therapy after surgery	Non-Spanish-Hispanic-Latino
Female	2013	White	Non-Hispanic White	Breast	Breast	C50.9-Breast, NOS	Unknown	Blank(s)	Blank(s)	Positive histology	Right - origin of primary	No/Unknown	None/Unknown	Blank(s)	Not recommended	Blank(s)	Complete dates are available and there are more than 0 days of survival	0028	Breast	No	3rd of 3 or more primaries	03	00000346	Divorced	$75,000+	Counties in metropolitan areas ge 1 million pop	40-44 years	All races/ethnicities	2015	2015	Alive or dead due to cancer	Blank(s)	No systemic therapy and/or surgical procedures	Non-Spanish-Hispanic-Latino
Female	2012	White	Non-Hispanic White	Breast	Breast	C50.2-Upper-inner quadrant of breast	Moderately differentiated; Grade II	Blank(s)	Blank(s)	Positive histology	Right - origin of primary	No/Unknown	None/Unknown	004	Surgery performed	Blank(s)	Complete dates are available and there are more than 0 days of survival	0099	Alive	No	2nd of 2 or more primaries	03	00000374	Widowed	$75,000+	Counties in metropolitan areas ge 1 million pop	80-84 years	All races/ethnicities	2020	Alive at last contact	Alive or dead due to cancer	Blank(s)	Systemic therapy before surgery	Non-Spanish-Hispanic-Latino
Female	2014	White	Non-Hispanic White	Breast	Breast	C50.8-Overlapping lesion of breast	Moderately differentiated; Grade II	Blank(s)	Blank(s)	Positive histology	Right - origin of primary	No/Unknown	None/Unknown	001	Surgery performed	Blank(s)	Complete dates are available and there are more than 0 days of survival	0081	Alive	No	2nd of 2 or more primaries	02	00000391	Married (including common law)	$75,000+	Counties in metropolitan areas ge 1 million pop	55-59 years	All races/ethnicities	2020	Alive at last contact	Alive or dead due to cancer	Blank(s)	Systemic therapy after surgery	Non-Spanish-Hispanic-Latino
Female	2011	Black	Non-Hispanic Black	Breast	Breast	C50.9-Breast, NOS	Unknown	Blank(s)	Blank(s)	Direct visualization without microscopic confirmation	Left - origin of primary	No/Unknown	None/Unknown	Blank(s)	Not recommended	Blank(s)	Complete dates are available and there are more than 0 days of survival	0010	Breast	No	2nd of 2 or more primaries	02	00000547	Widowed	$75,000+	Counties in metropolitan areas ge 1 million pop	85+ years	All races/ethnicities	2012	2012	Alive or dead due to cancer	Blank(s)	No systemic therapy and/or surgical procedures	Non-Spanish-Hispanic-Latino
Female	2013	White	Hispanic (All Races)	Breast	Breast	C50.9-Breast, NOS	Moderately differentiated; Grade II	Blank(s)	Blank(s)	Positive histology	Right - origin of primary	No/Unknown	Beam radiation	001	Surgery performed	Blank(s)	Complete dates are available and there are more than 0 days of survival	0086	Alive	No	2nd of 2 or more primaries	02	00000567	Married (including common law)	$75,000+	Counties in metropolitan areas ge 1 million pop	70-74 years	All races/ethnicities	2020	Alive at last contact	Alive or dead due to cancer	Blank(s)	No systemic therapy and/or surgical procedures	Spanish-Hispanic-Latino
Female	2015	White	Non-Hispanic White	Breast	Breast	C50.8-Overlapping lesion of breast	Unknown	Blank(s)	Blank(s)	Positive histology	Left - origin of primary	Yes	None/Unknown	001	Not recommended	Blank(s)	Complete dates are available and there are more than 0 days of survival	0017	Breast	No	2nd of 2 or more primaries	02	00000760	Widowed	$75,000+	Counties in metropolitan areas ge 1 million pop	75-79 years	All races/ethnicities	2016	2016	Alive or dead due to cancer	Blank(s)	No systemic therapy and/or surgical procedures	Non-Spanish-Hispanic-Latino
Female	2015	White	Hispanic (All Races)	Breast	Breast	C50.4-Upper-outer quadrant of breast	Poorly differentiated; Grade III	Blank(s)	Blank(s)	Positive histology	Right - origin of primary	No/Unknown	None/Unknown	001	Surgery performed	Blank(s)	Complete dates are available and there are more than 0 days of survival	0007	Other Cause of Death	No	2nd of 2 or more primaries	02	00000941	Widowed	$75,000+	Counties in metropolitan areas ge 1 million pop	85+ years	All races/ethnicities	2015	2015	Dead (attributable to causes other than this cancer dx)	Blank(s)	No systemic therapy and/or surgical procedures	Spanish-Hispanic-Latino
Female	2015	White	Non-Hispanic White	Breast	Breast	C50.9-Breast, NOS	Poorly differentiated; Grade III	Blank(s)	Blank(s)	Positive histology	Right - origin of primary	No/Unknown	Beam radiation	001	Surgery performed	Blank(s)	Complete dates are available and there are more than 0 days of survival	0043	Cerebrovascular Diseases	No	2nd of 2 or more primaries	02	00002056	Widowed	$75,000+	Counties in metropolitan areas ge 1 million pop	80-84 years	All races/ethnicities	2019	2019	Dead (attributable to causes other than this cancer dx)	Blank(s)	Systemic therapy after surgery	Non-Spanish-Hispanic-Latino
Female	2015	Black	Non-Hispanic Black	Breast	Breast	C50.8-Overlapping lesion of breast	Poorly differentiated; Grade III	Blank(s)	Blank(s)	Positive histology	Right - origin of primary	No/Unknown	None/Unknown	001	Surgery performed	Blank(s)	Complete dates are available and there are more than 0 days of survival	0070	Alive	No	3rd of 3 or more primaries	04	00002605	Divorced	$75,000+	Counties in metropolitan areas ge 1 million pop	60-64 years	All races/ethnicities	2020	Alive at last contact	Alive or dead due to cancer	Blank(s)	No systemic therapy and/or surgical procedures	Non-Spanish-Hispanic-Latino

kable(head(BREAST_DF_eval, 10))

Sex	Year of diagnosis	Race recode (W, B, AI, API)	Race and origin recode (NHW, NHB, NHAIAN, NHAPI, Hispanic)	Site recode ICD-O-3/WHO 2008	Site recode ICD-O-3 2023 Revision	Primary Site - labeled	Grade Recode (thru 2017)	Grade Clinical (2018+)	Grade Pathological (2018+)	Diagnostic Confirmation	Laterality	Chemotherapy recode (yes, no/unk)	Radiation recode	Months from diagnosis to treatment	Reason no cancer-directed surgery	Scope of reg lymph nd surg (1998-2002)	Survival months flag	Survival months	COD to site recode	First malignant primary indicator	Sequence number	Total number of in situ/malignant tumors for patient	Patient ID	Marital status at diagnosis	Median household income inflation adj to 2021	Rural-Urban Continuum Code	Age recode (<60,60-69,70+)	Race and origin (recommended by SEER)	Year of follow-up recode	Year of death recode	SEER other cause of death classification	Tumor Size Summary (2016+)	RX Summ–Systemic/Sur Seq (2007+)	Origin recode NHIA (Hispanic, Non-Hisp)
Female	2019	Asian or Pacific Islander	Non-Hispanic Asian or Pacific Islander	Breast	Breast	C50.8-Overlapping lesion of breast	Unknown	1	1	Positive histology	Right - origin of primary	No/Unknown	None/Unknown	002	Surgery performed	Blank(s)	Complete dates are available and there are more than 0 days of survival	0019	Alive	No	2nd of 2 or more primaries	02	00002750	Divorced	$75,000+	Counties in metropolitan areas ge 1 million pop	65-69 years	All races/ethnicities	2020	Alive at last contact	Alive or dead due to cancer	008	Systemic therapy after surgery	Non-Spanish-Hispanic-Latino
Female	2020	Asian or Pacific Islander	Non-Hispanic Asian or Pacific Islander	Breast	Breast	C50.8-Overlapping lesion of breast	Unknown	2	9	Positive histology	Right - origin of primary	No/Unknown	None/Unknown	000	Recommended, unknown if performed	Blank(s)	Complete dates are available and there are more than 0 days of survival	0000	Alive	No	2nd of 2 or more primaries	02	00002870	Married (including common law)	$75,000+	Counties in metropolitan areas ge 1 million pop	75-79 years	All races/ethnicities	2020	Alive at last contact	Alive or dead due to cancer	050	No systemic therapy and/or surgical procedures	Non-Spanish-Hispanic-Latino
Female	2020	White	Non-Hispanic White	Breast	Breast	C50.4-Upper-outer quadrant of breast	Unknown	1	2	Positive histology	Right - origin of primary	No/Unknown	None/Unknown	000	Surgery performed	Blank(s)	Complete dates are available and there are more than 0 days of survival	0007	Alive	No	2nd of 2 or more primaries	02	00003067	Divorced	$75,000+	Counties in metropolitan areas ge 1 million pop	85+ years	All races/ethnicities	2020	Alive at last contact	Alive or dead due to cancer	018	No systemic therapy and/or surgical procedures	Non-Spanish-Hispanic-Latino
Female	2020	White	Non-Hispanic White	Breast	Breast	C50.5-Lower-outer quadrant of breast	Unknown	2	9	Positive histology	Right - origin of primary	Yes	None/Unknown	001	Surgery performed	Blank(s)	Complete dates are available and there are more than 0 days of survival	0010	Alive	No	2nd of 2 or more primaries	02	00003365	Widowed	$75,000+	Counties in metropolitan areas ge 1 million pop	85+ years	All races/ethnicities	2020	Alive at last contact	Alive or dead due to cancer	060	Systemic therapy both before and after surgery	Non-Spanish-Hispanic-Latino
Female	2019	White	Non-Hispanic White	Breast	Breast	C50.8-Overlapping lesion of breast	Unknown	2	2	Positive histology	Right - origin of primary	No/Unknown	Radioactive implants (includes brachytherapy) (1988+)	000	Surgery performed	Blank(s)	Complete dates are available and there are more than 0 days of survival	0016	Alive	No	3rd of 3 or more primaries	03	00003679	Divorced	$75,000+	Counties in metropolitan areas ge 1 million pop	75-79 years	All races/ethnicities	2020	Alive at last contact	Alive or dead due to cancer	010	No systemic therapy and/or surgical procedures	Non-Spanish-Hispanic-Latino
Female	2019	Asian or Pacific Islander	Non-Hispanic Asian or Pacific Islander	Breast	Breast	C50.9-Breast, NOS	Unknown	2	2	Positive histology	Right - origin of primary	No/Unknown	None/Unknown	004	Surgery performed	Blank(s)	Complete dates are available and there are more than 0 days of survival	0014	Alive	No	3rd of 3 or more primaries	04	00003771	Married (including common law)	$75,000+	Counties in metropolitan areas ge 1 million pop	55-59 years	All races/ethnicities	2020	Alive at last contact	Alive or dead due to cancer	030	Systemic therapy after surgery	Non-Spanish-Hispanic-Latino
Female	2019	Asian or Pacific Islander	Non-Hispanic Asian or Pacific Islander	Breast	Breast	C50.4-Upper-outer quadrant of breast	Unknown	1	1	Positive histology	Left - origin of primary	No/Unknown	None/Unknown	004	Surgery performed	Blank(s)	Complete dates are available and there are more than 0 days of survival	0014	Alive	No	4th of 4 or more primaries	04	00003771	Married (including common law)	$75,000+	Counties in metropolitan areas ge 1 million pop	55-59 years	All races/ethnicities	2020	Alive at last contact	Alive or dead due to cancer	004	Systemic therapy after surgery	Non-Spanish-Hispanic-Latino
Female	2020	White	Non-Hispanic White	Breast	Breast	C50.8-Overlapping lesion of breast	Unknown	2	9	Positive histology	Right - origin of primary	No/Unknown	None/Unknown	001	Surgery performed	Blank(s)	Complete dates are available and there are more than 0 days of survival	0003	Alive	No	2nd of 2 or more primaries	02	00006501	Married (including common law)	$75,000+	Counties in metropolitan areas ge 1 million pop	80-84 years	All races/ethnicities	2020	Alive at last contact	Alive or dead due to cancer	036	Systemic therapy both before and after surgery	Non-Spanish-Hispanic-Latino
Female	2020	White	Non-Hispanic White	Breast	Breast	C50.3-Lower-inner quadrant of breast	Unknown	1	1	Positive histology	Left - origin of primary	No/Unknown	None/Unknown	002	Surgery performed	Blank(s)	Complete dates are available and there are more than 0 days of survival	0007	Alive	No	3rd of 3 or more primaries	03	00007723	Married (including common law)	$75,000+	Counties in metropolitan areas ge 1 million pop	70-74 years	All races/ethnicities	2020	Alive at last contact	Alive or dead due to cancer	006	No systemic therapy and/or surgical procedures	Non-Spanish-Hispanic-Latino
Female	2019	White	Non-Hispanic White	Breast	Breast	C50.4-Upper-outer quadrant of breast	Unknown	2	9	Positive histology	Right - origin of primary	Yes	None/Unknown	002	Surgery performed	Blank(s)	Complete dates are available and there are more than 0 days of survival	0021	Alive	No	2nd of 2 or more primaries	02	00008406	Unmarried or Domestic Partner	$75,000+	Counties in metropolitan areas ge 1 million pop	55-59 years	All races/ethnicities	2020	Alive at last contact	Alive or dead due to cancer	019	Systemic therapy both before and after surgery	Non-Spanish-Hispanic-Latino

Cases

There are 131,395 cases in the BREAST cancer list of 2019-2020. And there are 303,557 in 2011-2015 dataset.

Data collection

I used the SEER *STAT to collect the data and export it as a TXT to be able to import it to the R for analyses. How SEER collects the data is explained in the following page in summary:

The SEER program collects cancer incidence data through a network of population-based cancer registries. These registries gather information on patient demographics, primary tumor site, tumor morphology, stage at diagnosis, and first course of treatment. They also follow up with patients for vital status.
By law, these facilities are required to report new cancer cases to a central cancer registry, like a state cancer registry.
The SEER program releases new research data annually, based on submissions from the previous year, and makes it available for public use through a data request process. This comprehensive approach ensures that the SEER database is a valuable resource for cancer research and surveillance. https://training.seer.cancer.gov/registration/data/collection.html

Type of study

This will be an observational study, information is gathered for different patients and I will be evaluating the available data to present and evaluate.

Data Source

Data is collected from SEER program and I used SEER *STAT software to glean them in a format that can be used and imported as TXT/CSV to R (Surveillance, Epidemiology, and End Results Program 2023).

Dependent Variable

We have a combination of both numeric and categorical data to work with. For example, while the number of tumors, and survival months are qualitative. Other like race, marital status, type of cancer are categorical.

Categorical features, such as ‘Median household income …’ ‘Marital Status,’ ‘Grade recode’ ‘laterality’ and ‘Radiatio recode’ and so on are represented as objects (characters).

Integer data types (int64) are assigned to ‘Patient ID,’ ‘Year of diagnosis,’ ‘total number of …’.

# Find unique values in each column
# Apply function to find unique values for each column
#find the number of unique values in each column  
unique_values <- data.frame(unique = apply(BREAST_DF_surv, 2, function(x) length(unique(x))),colnames = colnames(BREAST_DF_surv))

#fidn the number of unique values and the unique values themselves 
unique_info <- data.frame(
  unique_count = sapply(BREAST_DF_surv, function(x) length(unique(x))),
  unique_values = sapply(BREAST_DF_surv, function(x) toString(unique(x))),
  column_names = names(BREAST_DF_surv)
)


# Check for NULL values
any_null <- any(sapply(BREAST_DF_surv, is.null))

# Check for NA values
any_na <- any(sapply(BREAST_DF_surv, is.na))

# Check if there are any NULL or NA values
if (any_null || any_na) {
  print("The data frame contains NULL or NA values.")
} else {
  print("The data frame does not contain any NULL or NA values.")
}

## [1] "The data frame does not contain any NULL or NA values."

has_na_character <- any(sapply(BREAST_DF_surv, function(x) any(x == "NA")))

if (has_na_character) {
  print("The data frame contains character values of 'NA'.")
} else {
  print("The data frame does not contain character values of 'NA'.")
}

## [1] "The data frame does not contain character values of 'NA'."

Data tiding

Upon exploring the data, it seems data might have an empty column, in this data-based, the empty values are filled with “Blanks”. Thus, in this section, I first explore if there is any column which is entirely empty, then will remove it and if there are others which have some empty values filled with “Blank(s)” I will replaced them with “NA” which is handled better in dplyr and tydiverse.

# There are cells in the DF that contianes "Blank(s) which is literally NA, first I want to find if there is any column that all is values is Blank(s), if then remove them.

#look for columns with all "Blank(s)" values
Empty_column <- BREAST_DF_surv %>%
  dplyr::summarise(dplyr::across(everything(), ~all(. == "Blank(s)"))) %>%
  as.logical() %>%
  unlist()

# Get the names of columns with all cells containing "Blank(s)"
blank_column_names <- names(BREAST_DF_surv)[Empty_column]

# Print the column names with all cells containing "Blanks"
paste("list of empty column(s): ", blank_column_names)

## [1] "list of empty column(s):  Grade Clinical (2018+)"                
## [2] "list of empty column(s):  Grade Pathological (2018+)"            
## [3] "list of empty column(s):  Scope of reg lymph nd surg (1998-2002)"
## [4] "list of empty column(s):  Tumor Size Summary (2016+)"

#remove those empty column from thr DF
BREAST_DF_surv <- BREAST_DF_surv[, !names(BREAST_DF_surv) %in% blank_column_names]
BREAST_DF_eval <- BREAST_DF_eval[, !names(BREAST_DF_eval) %in% blank_column_names]

#Then let's see if there is any cell in the remaining that migth still have "Blank(s)", if so repalce it with NA which is better handle in R

#This code first replaces all occurrences of "Blank(s)" with an empty string "", and then uses na_if() to convert the empty strings to NA. Now, all cells that previously had "Blank(s)" are replaced with NA, making it easier to handle missing values in R.

BREAST_DF_surv <- BREAST_DF_surv %>%
  mutate_if(is.character, ~ifelse(. == "Blank(s)", "", .)) %>%  # For character columns
  mutate_if(is.numeric, ~ifelse(. == "", as.numeric(NA), .))  # For numeric columns

# Now, empty character cells are replaced with NA
BREAST_DF_surv <- BREAST_DF_surv %>%
  mutate_if(is.character, na_if, "")


#same to be done for eval dataset
BREAST_DF_eval <- BREAST_DF_eval %>%
  mutate_if(is.character, ~ifelse(. == "Blank(s)", "", .)) %>%  # For character columns
  mutate_if(is.numeric, ~ifelse(. == "", as.numeric(NA), .))  # For numeric columns

# Now, empty character cells are replaced with NA
BREAST_DF_eval <- BREAST_DF_eval %>%
  mutate_if(is.character, na_if, "")

#Change characters to numerics 
BREAST_DF_surv$`Months from diagnosis to treatment` <- as.numeric(BREAST_DF_surv$`Months from diagnosis to treatment`)
BREAST_DF_surv$`Survival months` <- as.numeric(BREAST_DF_surv$`Survival months`)

## Warning: NAs introduced by coercion

BREAST_DF_surv$`Total number of in situ/malignant tumors for patient` <- 
  as.numeric(BREAST_DF_surv$`Total number of in situ/malignant tumors for patient`)

## Warning: NAs introduced by coercion

BREAST_DF_surv$`Total number of benign/borderline tumors for patient` <- 
  as.numeric(BREAST_DF_surv$`Total number of benign/borderline tumors for patient`)
#Change the character to numeric in Eval dataset too
BREAST_DF_eval$`Months from diagnosis to treatment` <- as.numeric(BREAST_DF_eval$`Months from diagnosis to treatment`)
BREAST_DF_eval$`Survival months` <- as.numeric(BREAST_DF_eval$`Survival months`)

## Warning: NAs introduced by coercion

BREAST_DF_eval$`Total number of in situ/malignant tumors for patient` <- 
  as.numeric(BREAST_DF_eval$`Total number of in situ/malignant tumors for patient`)

## Warning: NAs introduced by coercion

BREAST_DF_eval$`Total number of benign/borderline tumors for patient` <- 
  as.numeric(BREAST_DF_eval$`Total number of benign/borderline tumors for patient`)


# View the structure of the data frame
#str(BREAST_DF_surv)
skimr::skim(BREAST_DF_surv)

Data summary
Name	BREAST_DF_surv
Number of rows	303557
Number of columns	32
_______________________
Column type frequency:
character	26
numeric	6
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
Sex	1	6	6	1
Race recode (W, B, AI, API)	1	5	29	5
Race and origin recode (NHW, NHB, NHAIAN, NHAPI, Hispanic)	1	18	42	6
Site recode ICD-O-3/WHO 2008	1	6	6	1
Site recode ICD-O-3 2023 Revision	1	6	6	1
Primary Site - labeled	1	12	36	9
Grade Recode (thru 2017)	1	7	38	5
Diagnostic Confirmation	1	7	57	9
Laterality	1	24	53	5
Chemotherapy recode (yes, no/unk)	1	3	10	2
Radiation recode	1	12	53	8
Reason no cancer-directed surgery	1	15	76	8
Survival months flag	1	61	73	5
COD to site recode	1	5	55	87
First malignant primary indicator	1	2	3	2
Sequence number	1	16	60	13
Patient ID	1	8	8	294480
Marital status at diagnosis	1	7	30	7
Median household income inflation adj to 2021	1	8	38	11
Rural-Urban Continuum Code	1	38	60	7
Age recode (<60,60-69,70+)	1	9	11	18
Race and origin (recommended by SEER)	1	21	21	1
Year of death recode	1	4	21	11
SEER other cause of death classification	1	16	55	4
RX Summ–Systemic/Sur Seq (2007+)	1	16	55	8
Origin recode NHIA (Hispanic, Non-Hisp)	1	23	27	2

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Year of diagnosis	0	1.00	2013.04	1.42	2011	2012	2013	2014	2015	▇▇▇▇▇
Months from diagnosis to treatment	15843	0.95	1.13	1.14	0	0	1	2	24	▇▁▁▁▁
Survival months	1290	1.00	74.22	29.88	0	62	78	97	119	▂▂▆▇▆
Total number of in situ/malignant tumors for patient	3	1.00	1.36	0.65	1	1	1	2	20	▇▁▁▁▁
Total number of benign/borderline tumors for patient	0	1.00	0.01	0.09	0	0	0	0	5	▇▁▁▁▁
Year of follow-up recode	0	1.00	2018.90	2.14	2011	2019	2020	2020	2020	▁▁▁▁▇

skimr::skim(BREAST_DF_eval)

Data summary
Name	BREAST_DF_eval
Number of rows	131395
Number of columns	32
_______________________
Column type frequency:
character	26
numeric	6
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
Sex	1	6	6	1
Race recode (W, B, AI, API)	1	5	29	5
Race and origin recode (NHW, NHB, NHAIAN, NHAPI, Hispanic)	1	18	42	6
Site recode ICD-O-3/WHO 2008	1	6	6	1
Site recode ICD-O-3 2023 Revision	1	6	6	1
Primary Site - labeled	1	12	36	9
Grade Recode (thru 2017)	1	7	7	1
Diagnostic Confirmation	1	7	57	9
Laterality	1	24	53	5
Chemotherapy recode (yes, no/unk)	1	3	10	2
Radiation recode	1	12	53	8
Reason no cancer-directed surgery	1	15	76	8
Survival months flag	1	61	73	5
COD to site recode	1	5	55	67
First malignant primary indicator	1	2	3	2
Sequence number	1	16	60	16
Patient ID	1	8	8	127795
Marital status at diagnosis	1	7	30	7
Median household income inflation adj to 2021	1	8	38	11
Rural-Urban Continuum Code	1	38	60	7
Age recode (<60,60-69,70+)	1	9	11	17
Race and origin (recommended by SEER)	1	21	21	1
Year of death recode	1	4	21	3
SEER other cause of death classification	1	16	55	4
RX Summ–Systemic/Sur Seq (2007+)	1	16	55	8
Origin recode NHIA (Hispanic, Non-Hisp)	1	23	27	2

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Year of diagnosis	0	1.00	2019.48	0.50	2019	2019	2019	2020	2020	▇▁▁▁▇
Months from diagnosis to treatment	6807	0.95	1.26	1.18	0	1	1	2	24	▇▁▁▁▁
Survival months	537	1.00	11.07	7.05	0	5	11	17	23	▇▆▆▇▆
Total number of in situ/malignant tumors for patient	11	1.00	1.31	0.62	1	1	1	1	50	▇▁▁▁▁
Total number of benign/borderline tumors for patient	0	1.00	0.01	0.09	0	0	0	0	2	▇▁▁▁▁
Year of follow-up recode	0	1.00	2019.98	0.14	2019	2020	2020	2020	2020	▁▁▁▁▇

Relevant summary statistics

Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g.scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.

#find column name to use later if needed
DF_col_names <- colnames(BREAST_DF_surv)

# use ggplot to plot the race information 
BREAST_DF_surv |> 
  ggplot(mapping = aes(x=`Race recode (W, B, AI, API)`)) +
  geom_bar(stat = "count") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  geom_text(aes(label = after_stat(count)), stat = "count", vjust = -0.5) +
  ylim(0, 246000)

#we want to compare the percentage of the different races in the eval and survival data, thus I use summarise to create two new DFs to only store the sumamry statistics specifically including the percentage of race based on the population
#find percentage of race for the survival
BREAST_DF_perc_surv <- BREAST_DF_surv %>%
  group_by(`Race recode (W, B, AI, API)`) %>%
  dplyr::summarise(count = dplyr::n()) %>%  # Calculate count per group
  ungroup() %>%  # Ungroup the data
  mutate(total_count = sum(count)) %>%  # Calculate total count
  mutate(percentage = count / total_count * 100)  # Calculate percentage using total count

# Plot the percentages
ggplot(BREAST_DF_perc_surv, aes(x = `Race recode (W, B, AI, API)`, y = percentage)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  geom_text(aes(label = paste0(round(percentage, 1), "%")), vjust = -0.5, color = "black") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Percentage of Population by Race between 2011-2015", x = "Race recode (W, B, AI, API)", y = "Percentage") + ylim (0,90)

BREAST_DF_eval |> 
  ggplot(mapping = aes(x=`Race recode (W, B, AI, API)`)) +
  geom_bar(stat = "count") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  geom_text(aes(label = after_stat(count)), stat = "count", vjust = -0.5) +
  ylim(0, 104000)

BREAST_DF_perc_eval <- BREAST_DF_eval %>%
  group_by(`Race recode (W, B, AI, API)`) %>%
  dplyr::summarise(count = dplyr::n()) %>%  # Calculate count per group
  ungroup() %>%  # Ungroup the data
  mutate(total_count = sum(count)) %>%  # Calculate total count
  mutate(percentage = count / total_count * 100)  # Calculate percentage using total count

# Plot the percentages
ggplot(BREAST_DF_perc_eval, aes(x = `Race recode (W, B, AI, API)`, y = percentage)) +
  geom_bar(stat = "identity", fill = "plum") +
  geom_text(aes(label = paste0(round(percentage, 1), "%")), vjust = -0.5, color = "black") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Percentage of Population by Race between 2019-2022", x = "Race recode (W, B, AI, API)", y = "Percentage") + ylim (0,90)

# In this section I want to focus on the age and see if age matters, same sets of data is going to be plot for ages, starting with percentage for eval and surve 
#find percentage of race for the survival
#find ubique values for column ratted to age 
uniques_ages <- unique(BREAST_DF_surv[29])

BREAST_DF_age_perc_surv <- BREAST_DF_surv %>%
  dplyr::group_by(`Age recode (<60,60-69,70+)`) %>%
  dplyr::summarise(count = dplyr::n()) %>%  # Calculate count per group
  ungroup() %>%  # Ungroup the data
  mutate(total_count = sum(count)) %>%  # Calculate total count
  mutate(percentage = count / total_count * 100)  # Calculate percentage using total count

perc_max <- max(BREAST_DF_age_perc_surv$percentage)
# Plot the percentages
ggplot(BREAST_DF_age_perc_surv, aes(x = `Age recode (<60,60-69,70+)`, y = percentage)) +
  geom_bar(stat = "identity", fill = "brown") +
  geom_text(aes(label = paste0(round(percentage, 1), "%")), hjust = -0.1 , vjust = 0.4, color = "black", angle = 90) +  # Rotate the text vertically
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +labs(title = "Percentage of Population by Age range 2011-2015", 
       x = "Age range", 
       y = "Percentage") + 
  ylim(0, round(1.5 * perc_max, 1))

# In this section we do the same analyses for Eval dta based on age
BREAST_DF_age_perc_eval <- BREAST_DF_eval %>%
  dplyr::group_by(`Age recode (<60,60-69,70+)`) %>%
  dplyr::summarise(count = dplyr::n()) %>%  # Calculate count per group
  ungroup() %>%  # Ungroup the data
  mutate(total_count = sum(count)) %>%  # Calculate total count
  mutate(percentage = count / total_count * 100)  # Calculate percentage using total count

# Plot the percentages
ggplot(BREAST_DF_age_perc_eval, aes(x = `Age recode (<60,60-69,70+)`, y = percentage)) +
  geom_bar(stat = "identity", fill = "brown") +
  geom_text(aes(label = paste0(round(percentage, 1), "%")), hjust = -0.1 , vjust = 0.4, color = "black", angle = 90) +  # Rotate the text vertically
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +labs(title = "Percentage of Population by Age range 2019-2022", 
       x = "Age range", 
       y = "Percentage") + 
  ylim(0, round(1.5 * perc_max, 1))

# In this section, we do the analyses on household income 
#find ubique values for column ratted to age 
uniques_householdes <- unique(BREAST_DF_surv[27])

BREAST_DF_income_perc_surv <- BREAST_DF_surv %>% dplyr::group_by(`Median household income inflation adj to 2021`) %>% 
  dplyr::summarise(count = dplyr::n()) %>% # Calculate count per group 
  ungroup() %>% # Ungroup the data 
  mutate(total_count = sum(count)) %>% # Calculate total count 
  mutate(percentage = count / total_count * 100) # Calculate percentage using total count

perc_max <- max(BREAST_DF_income_perc_surv$percentage) # Plot the percentages 
ggplot(BREAST_DF_income_perc_surv, aes(x = `Median household income inflation adj to 2021`, y = percentage)) + 
  geom_bar(stat = "identity", fill = "brown") + 
  geom_text(aes(label = paste0(round(percentage, 1), "%")), hjust = -0.1 , vjust = 0.4, color = "black", angle = 0) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Percentage of Population by income 2011-2015", x = "Household Income", y = "Percentage") + 
  ylim(0, 1.2*perc_max)

#In this section we do the same analyses for Eval data based on age
BREAST_DF_income_perc_eval <- BREAST_DF_eval %>% 
  dplyr::group_by(`Median household income inflation adj to 2021`) %>% 
  dplyr::summarise(count = dplyr::n()) %>% # Calculate count per group 
  ungroup() %>% # Ungroup the data 
  mutate(total_count = sum(count)) %>% # Calculate total count 
  mutate(percentage = count / total_count * 100) # Calculate percentage using total count


#Plot the percentages
perc_max <- max(BREAST_DF_income_perc_eval$percentage)
ggplot(BREAST_DF_income_perc_eval, aes(x = `Median household income inflation adj to 2021`, y = percentage)) + 
  geom_bar(stat = "identity", fill = "brown") + 
  geom_text(aes(label = paste0(round(percentage, 1), "%")), hjust = -0.1 , vjust = 0.4, color = "black", angle = 0) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Percentage of Population by income 2019-2022", x = "Household Income", y = "Percentage") + 
  ylim(0, 1.2*perc_max)

# In this section, we do the analyses on Primary Site
#find ubique values for column ratted to age 
uniques_canter_type <- unique(BREAST_DF_surv[27])

BREAST_DF_labeled_perc_surv <- BREAST_DF_surv %>% dplyr::group_by(`Primary Site - labeled`) %>% 
  dplyr::summarise(count = dplyr::n()) %>% # Calculate count per group 
  ungroup() %>% # Ungroup the data 
  mutate(total_count = sum(count)) %>% # Calculate total count 
  mutate(percentage = count / total_count * 100) # Calculate percentage using total count

perc_max <- max(BREAST_DF_labeled_perc_surv$percentage) # Plot the percentages 
ggplot(BREAST_DF_labeled_perc_surv, aes(x = `Primary Site - labeled`, y = percentage)) + 
  geom_bar(stat = "identity", fill = "darkgreen") + 
  geom_text(aes(label = paste0(round(percentage, 1), "%")), hjust = -0.1 , vjust = 0.4, color = "black", angle = 0) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Percentage of Population by Site Primary labeles 2011-2015", x = "Primary Labels", y = "Percentage") + 
  ylim(0, 1.2*perc_max)

#In this section we do the same analyses for Eval data based on age
BREAST_DF_labeled_perc_eval <- BREAST_DF_eval %>% 
  dplyr::group_by(`Primary Site - labeled`) %>% 
  dplyr::summarise(count = dplyr::n()) %>% # Calculate count per group 
  ungroup() %>% # Ungroup the data 
  mutate(total_count = sum(count)) %>% # Calculate total count 
  mutate(percentage = count / total_count * 100) # Calculate percentage using total count


#Plot the percentages
perc_max <- max(BREAST_DF_labeled_perc_eval$percentage)
ggplot(BREAST_DF_labeled_perc_eval, aes(x = `Primary Site - labeled`, y = percentage)) + 
  geom_bar(stat = "identity", fill = "darkgreen") + 
  geom_text(aes(label = paste0(round(percentage, 1), "%")), hjust = -0.1 , vjust = 0.4, color = "black", angle = 0) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Percentage of Population by site Primary labels 2019-2022", x = "Primary Labels", y = "Percentage") + 
  ylim(0, 1.2*perc_max)

# check if the column `COD to site recode` has value of Alive or Breast meaning they are still alive or have died because of breast cancer, and other passed a way but not because of Breast cancer. 

BREAST_DF_surv <- BREAST_DF_surv %>%
  mutate(COD = ifelse(`COD to site recode` %in% c("Alive","Breast"), `COD to site recode`, "Other"))

Results of the exploratory data analysis

In this section, we look into some exploratory data analysis such as

Cause of death of those who have had cancer
Total number of tumors (Malignant or Benign)
Radiation and chemotherapy
Surgery Performed
Marital Status
Household income

We looked into the population and then among the population how many survived the cancer. Later we will run some analyses to see whether those were important or deciding factors or not.

BREAST_DF_COD_perc_surv <- BREAST_DF_surv %>%
  dplyr::group_by(COD) %>%
  dplyr::summarise(count = dplyr::n()) %>%  # Calculate count per group
  ungroup() %>%  # Ungroup the data
  mutate(`Total Count` = sum(count)) %>%  # Calculate total count
  mutate(Population = round(count / `Total Count` * 100),2)  # Calculate percentage using total count

kable(BREAST_DF_COD_perc_surv)

COD	count	Total Count	Population	2
Alive	228221	303557	75	2
Breast	38472	303557	13	2
Other	36864	303557	12	2

# Let’s first group by the number of tumors and find out how many people in the population have them. Then, among those individuals, let’s determine how many passed away solely due to breast cancer. However, it’s important to note that this approach may not be completely accurate, as there could be cases where individuals passed away due to breast cancer complications that are not accounted for in these counts.”
 
BREAST_DF_TNoT_perc_surv <- BREAST_DF_surv %>%
  dplyr::group_by(`Total number of in situ/malignant tumors for patient`) %>%
  dplyr::add_count() %>%
  filter(COD == "Breast") %>%
  dplyr::summarise(`Event Population` = n(), 
            Population = dplyr::first(n))  # Use `first()` to extract the total count in each 

# Do simple math to fidn the percentage of the group in the population and then the percentage of the deceased within the group. 

BREAST_DF_TNoT_perc_surv$`Group % in total` <- round(BREAST_DF_TNoT_perc_surv$Population/sum(BREAST_DF_TNoT_perc_surv$Population)*100,2)

BREAST_DF_TNoT_perc_surv$`Death %` <- round(BREAST_DF_TNoT_perc_surv$`Event Population`/BREAST_DF_TNoT_perc_surv$Population*100,2)

    
kable(BREAST_DF_TNoT_perc_surv)

Total number of in situ/malignant tumors for patient	Event Population	Population	Group % in total	Death %
1	27314	217122	71.53	12.58
2	8945	68082	22.43	13.14
3	1808	14579	4.80	12.40
4	322	2996	0.99	10.75
5	68	595	0.20	11.43
6	9	126	0.04	7.14
7	3	29	0.01	10.34
8	2	18	0.01	11.11
18	1	1	0.00	100.00

# Let' focus on the treatemnt, There are two type of treatment and can be a 4 combination, as follows: Radiation: R, Chemoteraphy: C,  R:N-C:N,  R:Y-C:N, R:N-C:Y, R:Y-C:Y. We must look into these 4 group and find the total number and then in each find the number of death. Finally report them imialrly that we have done above. 

BREAST_DF_surv <- BREAST_DF_surv %>% 
  mutate(Radiation = ifelse(`Radiation recode` %in% c("None/Unknown","Refused (1988+)","Recommended, unknown if administered"),"No/Unknown","Yes"))
BREAST_DF_eval <- BREAST_DF_eval %>% 
  mutate(Radiation = ifelse(`Radiation recode` %in% c("None/Unknown","Refused (1988+)","Recommended, unknown if administered"),"No/Unknown","Yes"))

#use DPLYR to filter based on two parameters chemotheraphy and radiation therapy and evalaute the death rate accordingly  
BREAST_DF_RNC_perc_surv <- BREAST_DF_surv %>%
  dplyr::group_by(Radiation,`Chemotherapy recode (yes, no/unk)`) %>%
  dplyr::add_count() %>%
  filter(COD == "Breast") %>%
  dplyr::summarise(`Event Population` = n(), 
            Population = dplyr::first(n))  # Use `first()` to extract the total count in each

## `summarise()` has grouped output by 'Radiation'. You can override using the
## `.groups` argument.

# Replace "No/Unknown" with "No" in the original columns
BREAST_DF_RNC_perc_surv$Radiation <- ifelse(BREAST_DF_RNC_perc_surv$Radiation == "No/Unknown", "No", BREAST_DF_RNC_perc_surv$Radiation)

BREAST_DF_RNC_perc_surv$"Chemotherapy recode (yes, no/unk)" <- ifelse(BREAST_DF_RNC_perc_surv$"Chemotherapy recode (yes, no/unk)" == "No/Unknown", "No", BREAST_DF_RNC_perc_surv$"Chemotherapy recode (yes, no/unk)")

# Create a new column "Radiation_Chemo" with values separated by "/"
BREAST_DF_RNC_perc_surv$Radiation_Chemo <- paste(BREAST_DF_RNC_perc_surv$Radiation, BREAST_DF_RNC_perc_surv$"Chemotherapy recode (yes, no/unk)", sep = "/")


# Optionally, remove the original "Radiation" and "Chemotherapy recode (yes, no/unk)" columns
BREAST_DF_RNC_perc_surv <- subset(BREAST_DF_RNC_perc_surv, select = -c(Radiation, `Chemotherapy recode (yes, no/unk)`))

BREAST_DF_RNC_perc_surv <- BREAST_DF_RNC_perc_surv[, c("Radiation_Chemo", setdiff(names(BREAST_DF_RNC_perc_surv), "Radiation_Chemo"))]


# Reshape the dataframe from wide to long format

#knowing the population calcualte the gorup rate and death rate in each group 
BREAST_DF_RNC_perc_surv$`Group % in total` <- round(BREAST_DF_RNC_perc_surv$Population/sum(BREAST_DF_RNC_perc_surv$Population)*100,2)

BREAST_DF_RNC_perc_surv$`Death %` <- round(BREAST_DF_RNC_perc_surv$`Event Population`/BREAST_DF_RNC_perc_surv$Population*100,2)



kable(BREAST_DF_RNC_perc_surv)

Radiation_Chemo	Event Population	Population	Group % in total	Death %
No/No	15684	107012	35.25	14.66
No/Yes	9929	54966	18.11	18.06
Yes/No	3731	79926	26.33	4.67
Yes/Yes	9128	61653	20.31	14.81

#next let's look into the surgery and the survival rate and whether it migth have been critical or not. 
BREAST_DF_SUR_perc_surv <- BREAST_DF_surv %>%
  dplyr::group_by(`Reason no cancer-directed surgery`) %>%
  dplyr::add_count() %>%
  filter(COD == "Breast") %>%
  dplyr::summarise(`Event Population` = n(), 
            Population = dplyr::first(n))  # Use `first()` to extract the total 

#knowing the population calcualte the gorup rate and death rate in each group 
BREAST_DF_SUR_perc_surv$`Group % in total` <- round(BREAST_DF_SUR_perc_surv$Population/sum(BREAST_DF_SUR_perc_surv$Population)*100,2)

BREAST_DF_SUR_perc_surv$`Death %` <- round(BREAST_DF_SUR_perc_surv$`Event Population`/BREAST_DF_SUR_perc_surv$Population*100,2)

kable(BREAST_DF_SUR_perc_surv)

Reason no cancer-directed surgery	Event Population	Population	Group % in total	Death %
Not performed, patient died prior to recommended surgery	139	278	0.09	50.00
Not recommended	11636	23199	7.64	50.16
Not recommended, contraindicated due to other cond; autopsy only (1973-2002)	593	1356	0.45	43.73
Recommended but not performed, patient refused	1171	2608	0.86	44.90
Recommended but not performed, unknown reason	545	1604	0.53	33.98
Recommended, unknown if performed	613	2649	0.87	23.14
Surgery performed	22376	269730	88.86	8.30
Unknown; death certificate; or autopsy only (2003+)	1399	2133	0.70	65.59

#next let's look into the marital status and the survival rate and whether it migth have been critical or not. 
BREAST_DF_MARI_perc_surv <- BREAST_DF_surv %>%
  dplyr::group_by(`Marital status at diagnosis`) %>%
  dplyr::add_count() %>%
  filter(COD == "Breast") %>%
  dplyr::summarise(`Event Population` = n(), 
            Population = dplyr::first(n))  # Use `first()` to extract the total 

#knowing the population calcualte the gorup rate and death rate in each group 
BREAST_DF_MARI_perc_surv$`Group % in total` <- round(BREAST_DF_MARI_perc_surv$Population/sum(BREAST_DF_MARI_perc_surv$Population)*100,2)

BREAST_DF_MARI_perc_surv$`Death %` <- round(BREAST_DF_MARI_perc_surv$`Event Population`/BREAST_DF_MARI_perc_surv$Population*100,2)

kable(BREAST_DF_MARI_perc_surv)

Marital status at diagnosis	Event Population	Population	Group % in total	Death %
Divorced	4399	32214	10.61	13.66
Married (including common law)	15694	160551	52.89	9.78
Separated	544	3225	1.06	16.87
Single (never married)	7161	44678	14.72	16.03
Unknown	2774	18481	6.09	15.01
Unmarried or Domestic Partner	110	1014	0.33	10.85
Widowed	7790	43394	14.30	17.95

#next let's look into the Median household income and the survival rate and whether it migth have been critical or not. 
BREAST_DF_HHI_perc_surv <- BREAST_DF_surv %>%
  dplyr::group_by(`Median household income inflation adj to 2021`) %>%
  dplyr::add_count() %>%
  filter(COD == "Breast") %>%
  dplyr::summarise(`Event Population` = n(), 
            Population = dplyr::first(n))  # Use `first()` to extract the total 

#knwoign the population calcualte the gorup rate and death rate in each group 
BREAST_DF_HHI_perc_surv$`Group % in total` <- round(BREAST_DF_HHI_perc_surv$Population/sum(BREAST_DF_HHI_perc_surv$Population)*100,2)

BREAST_DF_HHI_perc_surv$`Death %` <- round(BREAST_DF_HHI_perc_surv$`Event Population`/BREAST_DF_HHI_perc_surv$Population*100,2)

kable(BREAST_DF_HHI_perc_surv)

Median household income inflation adj to 2021	Event Population	Population	Group % in total	Death %
$35,000 - $39,999	1000	6077	2.00	16.46
$40,000 - $44,999	1630	10225	3.37	15.94
$45,000 - $49,999	2289	14917	4.91	15.34
$50,000 - $54,999	2310	16794	5.53	13.75
$55,000 - $59,999	3371	24860	8.19	13.56
$60,000 - $64,999	6010	43537	14.34	13.80
$65,000 - $69,999	5848	44978	14.82	13.00
$70,000 - $74,999	3927	31930	10.52	12.30
$75,000+	11608	107459	35.40	10.80
< $35,000	469	2716	0.89	17.27
Unknown/missing/no match/Not 1990-2021	10	64	0.02	15.62

#next let's look into the Type of Cancer and the survival rate and whether it migth have been critical or not. 
BREAST_DF_PSL_perc_surv <- BREAST_DF_surv %>%
  dplyr::group_by(`Primary Site - labeled`) %>%
  dplyr::add_count() %>%
  filter(COD == "Breast") %>%
  dplyr::summarise(`Event Population` = n(), 
            Population = dplyr::first(n))  # Use `first()` to extract the total 

#knwoign the population calcualte the gorup rate and death rate in each group 
BREAST_DF_PSL_perc_surv$`Group % in total` <- round(BREAST_DF_PSL_perc_surv$Population/sum(BREAST_DF_PSL_perc_surv$Population)*100,2)

BREAST_DF_PSL_perc_surv$`Death %` <- round(BREAST_DF_PSL_perc_surv$`Event Population`/BREAST_DF_PSL_perc_surv$Population*100,2)

kable(BREAST_DF_PSL_perc_surv)

Primary Site - labeled	Event Population	Population	Group % in total	Death %
C50.0-Nipple	173	1477	0.49	11.71
C50.1-Central portion of breast	2043	14012	4.62	14.58
C50.2-Upper-inner quadrant of breast	3058	36006	11.86	8.49
C50.3-Lower-inner quadrant of breast	1572	16365	5.39	9.61
C50.4-Upper-outer quadrant of breast	9710	98199	32.35	9.89
C50.5-Lower-outer quadrant of breast	2287	21939	7.23	10.42
C50.6-Axillary tail of breast	270	1685	0.56	16.02
C50.8-Overlapping lesion of breast	7514	68285	22.49	11.00
C50.9-Breast, NOS	11845	45589	15.02	25.98

# Create a list to store all your dataframes
DF_names <- c (
  "BREAST_DF_TNoT_perc_surv", 
  "BREAST_DF_RNC_perc_surv",
  "BREAST_DF_SUR_perc_surv",
  "BREAST_DF_MARI_perc_surv",
  "BREAST_DF_HHI_perc_surv",
  "BREAST_DF_PSL_perc_surv")

# Create an empty list to store plots
plot_list <- list()
chart_color <- c("plum", "darkgreen", "darkred", "darkblue", "darkorange", "darkmagenta",
                 "darkcyan", "purple", "lightblue", "darkgray", "lightpink", "blue",
                 "brown", "red")
chart_title <- c("# of Malignant Tumors", 
                 "Radiation/Chemo Status", 
                 "Cancer Surgery",
                 "Marital Status",
                 "Household Income",
                 "Primary Site Labeled")
set.seed(2014)
# Loop through each dataframe
for (i in 1:length(DF_names)) {
  # Access the dataframe
  df <- get(DF_names[i])
  
  # Generate a random color
  random_color <- sample(chart_color, 1)
  
  # Get the name of the first column and wrap the text
  column_name <- str_wrap(names(df)[1], width = 10)  # Adjust width as needed
  
  # Create the plot and store it in the plot list
  plot <- ggplot(df, aes(x = !!rlang::sym(names(df)[1]), y = !!rlang::sym("Death %"))) +
    geom_bar(stat = "identity", fill = random_color) +
    labs(title = chart_title[i],
         x = NULL, y = "Death %") +  # Remove x-axis label
    theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1))  # Rotate x-axis labels

  plot_list[[i]] <- plot
}

# Arrange the plots in a 2 by 3 matrix
grid.arrange(grobs = plot_list, ncol = 3)

# Plot individually 
# Plot individually 

# Loop through each dataframe
for (i in 1:length(DF_names)) {
  # Access the dataframe
  df <- get(DF_names[i])
  
  # Generate a random color
  random_color <- sample(chart_color, 1)
  
  # Get the name of the first column and wrap the text
  column_name <- str_wrap(names(df)[1], width = 10)  # Adjust width as needed
  
  # Create the plot and store it in the plot list
  plot <- ggplot(df, aes(x = !!rlang::sym(names(df)[1]), y = !!rlang::sym("Death %"))) +
    geom_bar(stat = "identity", fill = random_color) +
    labs(title = chart_title[i],
         x = NULL, y = "Death %") +  # Remove x-axis label
    theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1))  # Rotate x-axis labels

  # Print the plot
  print(plot)
}

Correlation investigation

In this section we will be using different R packages to perform correlation and other analyses on the data, to do so, we first need to slightly change our data to make them suitable for packages like survival, purrr, caret, GGally, and so forth.

The first step is to change the categorical data to factor in columns that they exist. Then we use the purrr to calculate chi-square and Fisher exact test for different variables. Since the size of the population is large, we will do bootstrap and p-simulation to calculate the p_value to find the importance of different variables.

The strategy is to find the one with the highest effect in theory, the code will calculate the p-values from chi-squared/Fisher’s exact test for independence between each categorical variable and the COD (Cause of death) column. The lower the p-value, the stronger the evidence against the null hypothesis of independence, suggesting a significant association between the variable and COD. Then we simplify the model by keeping the most relevant, we also need to look into homoscedasticity and remove those that may contribute to.

Then we explore the data, there are some column than can be eliminated from this analyses. i.e., year, race (there are two), and so on. The following bullets lists those that are eliminated in the next steps of analyses.

Sex, Year of diagnosis,
Race and origin recode (NHW, NHB, NHAIAN, NHAPI, Hispanic), due to collonearity with Race Recode
Site recode ICD-O-3/WHO 2008, Site recode ICD-O-3 2023 Revision, Diagnostic Confirmation, Survival months flag, COD to site recode (replaced with COD), Patient ID, Year of follow-up recode, Year of death recode, SEER other cause of death classification, RX Summ–Systemic/Sur Seq (2007+), Origin recode NHIA (Hispanic, Non-Hisp)

Fisher_test and chi-Square

# List of columns to remove
uncritical_column <- c("Sex", "Year of diagnosis", 
                       "Race and origin recode (NHW, NHB, NHAIAN, NHAPI, Hispanic)", 
                       "Site recode ICD-O-3/WHO 2008", "Site recode ICD-O-3 2023 Revision", 
                       "Diagnostic Confirmation, Survival months flag", "COD to site recode", 
                       "Patient ID", "Year of follow-up recode", "Year of death recode", 
                       "SEER other cause of death classification", 
                       "RX Summ--Systemic/Sur Seq (2007+)",
                       "Origin recode NHIA (Hispanic, Non-Hisp)",
                       "Race and origin (recommended by SEER)",
                       "Diagnostic Confirmation",
                       "Sequence number", "Radiation recode")

# Create BREAST_DF_surv_clean by removing uncritical columns
BREAST_DF_surv_clean <- BREAST_DF_surv[, !names(BREAST_DF_surv) %in% uncritical_column]
BREAST_DF_eval_clean <- BREAST_DF_eval[, !names(BREAST_DF_eval) %in% uncritical_column]


# Identify character and numeric columns
char_cols <- sapply(BREAST_DF_surv_clean, is.character)
num_cols <- sapply(BREAST_DF_surv_clean, is.numeric)
char_cols_e <- sapply(BREAST_DF_eval_clean, is.character)

# Convert character columns to factors
BREAST_DF_surv_clean[char_cols] <- lapply(BREAST_DF_surv_clean[char_cols], as.factor)
BREAST_DF_eval_clean[char_cols_e] <- lapply(BREAST_DF_eval_clean[char_cols_e], as.factor)
#BREAST_DF_surv[num_cols] <- lapply(BREAST_DF_surv[num_cols], as.factor)

# Check the class of each column to ensure they are factors now
#sapply(BREAST_DF_surv, class)


#check to esure all variable have more than two levels 
one_level_vars <- sapply(BREAST_DF_surv_clean, function(x) length(unique(x)) == 1)
# Print variables with only one level
one_level_vars_names <- names(one_level_vars)[one_level_vars]
#print(names(one_level_vars)[one_level_vars])

# Remove variables with only one level from the data frame
BREAST_DF_surv_clean <- BREAST_DF_surv_clean[, !names(BREAST_DF_surv_clean) %in% one_level_vars_names]
BREAST_DF_eval_clean <- BREAST_DF_eval_clean[, !names(BREAST_DF_eval_clean) %in% one_level_vars_names]


skimr::skim(BREAST_DF_surv_clean)

Data summary
Name	BREAST_DF_surv_clean
Number of rows	303557
Number of columns	18
_______________________
Column type frequency:
factor	14
numeric	4
________________________
Group variables	None

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
Race recode (W, B, AI, API)	1	FALSE	5	Whi: 240584, Bla: 32165, Asi: 27061, Ame: 1933
Primary Site - labeled	1	FALSE	9	C50: 98199, C50: 68285, C50: 45589, C50: 36006
Grade Recode (thru 2017)	1	FALSE	5	Mod: 119566, Poo: 84251, Wel: 64536, Unk: 34855
Laterality	1	FALSE	5	Lef: 152350, Rig: 147730, Pai: 3152, Onl: 190
Chemotherapy recode (yes, no/unk)	1	FALSE	2	No/: 186938, Yes: 116619
Reason no cancer-directed surgery	1	FALSE	8	Sur: 269730, Not: 23199, Rec: 2649, Rec: 2608
Survival months flag	1	FALSE	5	Com: 295136, Inc: 6620, Not: 1290, Com: 376
First malignant primary indicator	1	FALSE	2	Yes: 252683, No: 50874
Marital status at diagnosis	1	FALSE	7	Mar: 160551, Sin: 44678, Wid: 43394, Div: 32214
Median household income inflation adj to 2021	1	FALSE	11	$75: 107459, $65: 44978, $60: 43537, $70: 31930
Rural-Urban Continuum Code	1	FALSE	7	Cou: 185374, Cou: 65041, Cou: 21239, Non: 18125
Age recode (<60,60-69,70+)	1	FALSE	18	60-: 41318, 65-: 41060, 55-: 37068, 50-: 34424
COD	1	FALSE	3	Ali: 228221, Bre: 38472, Oth: 36864
Radiation	1	FALSE	2	No/: 161978, Yes: 141579

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Months from diagnosis to treatment	15843	0.95	1.13	1.14	0	0	1	2	24	▇▁▁▁▁
Survival months	1290	1.00	74.22	29.88	0	62	78	97	119	▂▂▆▇▆
Total number of in situ/malignant tumors for patient	3	1.00	1.36	0.65	1	1	1	2	20	▇▁▁▁▁
Total number of benign/borderline tumors for patient	0	1.00	0.01	0.09	0	0	0	0	5	▇▁▁▁▁

skimr::skim(BREAST_DF_eval_clean)

Data summary
Name	BREAST_DF_eval_clean
Number of rows	131395
Number of columns	17
_______________________
Column type frequency:
factor	13
numeric	4
________________________
Group variables	None

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
Race recode (W, B, AI, API)	1	FALSE	5	Whi: 100601, Bla: 14533, Asi: 13448, Unk: 1891
Primary Site - labeled	1	FALSE	9	C50: 43321, C50: 30822, C50: 16539, C50: 16423
Grade Recode (thru 2017)	1	FALSE	1	Unk: 131395
Laterality	1	FALSE	5	Lef: 66096, Rig: 63885, Pai: 1317, Bil: 52
Chemotherapy recode (yes, no/unk)	1	FALSE	2	No/: 83776, Yes: 47619
Reason no cancer-directed surgery	1	FALSE	8	Sur: 114210, Not: 12567, Rec: 1144, Rec: 1111
Survival months flag	1	FALSE	5	Com: 128932, Inc: 1037, Com: 633, Not: 537
First malignant primary indicator	1	FALSE	2	Yes: 107910, No: 23485
Marital status at diagnosis	1	FALSE	7	Mar: 70613, Sin: 20883, Wid: 16724, Div: 13667
Median household income inflation adj to 2021	1	FALSE	11	$75: 84913, $55: 8336, $65: 8298, $70: 8158
Rural-Urban Continuum Code	1	FALSE	7	Cou: 80172, Cou: 28055, Cou: 9574, Non: 7900
Age recode (<60,60-69,70+)	1	FALSE	17	65-: 18702, 60-: 17760, 70-: 17096, 55-: 15189
Radiation	1	FALSE	2	Yes: 65993, No/: 65402

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Months from diagnosis to treatment	6807	0.95	1.26	1.18	0	1	1	2	24	▇▁▁▁▁
Survival months	537	1.00	11.07	7.05	0	5	11	17	23	▇▆▆▇▆
Total number of in situ/malignant tumors for patient	11	1.00	1.31	0.62	1	1	1	1	50	▇▁▁▁▁
Total number of benign/borderline tumors for patient	0	1.00	0.01	0.09	0	0	0	0	2	▇▁▁▁▁

# Function to calculate chi-squared test for independence
chi_squared_cal <- function(var, data) {
  tab <- table(data$COD, var)
  chisq_result <- chisq.test(tab)
  p_value <- chisq_result$p.value
  return(p_value)
}

# Function to calculate Sisher-Exact test for independence
fisher_exact_cal <- function(var, data) {
  tab <- table(data$COD, var)
  # Perform Fisher's exact test
  fisher_result <- fisher.test(tab, simulate.p.value = TRUE)
  # Extract the p-value
  p_value <- fisher_result$p.value  
  return(p_value)
}


# Initialize an empty list to store p-values
p_values <- list()

# Number of bootstrap samples
n_bootstrap <- 50

#I perform bootsrap and downasampling to eliminate the population effect on chi-square, still the correlation seems high with all be so close to 0 
# Loop over each column in the dataframe
for (col in names(BREAST_DF_surv_clean)) {
  # Check if the column is a factor
  if (is.factor(BREAST_DF_surv_clean[[col]])) {
    # Initialize an empty vector to store p-values from bootstrap samples
    bootstrap_p_values <- numeric(n_bootstrap)
    
    # Perform bootstrap sampling and calculate chi-squared p-value for each sample
    for (i in 1:n_bootstrap) {
      # Generate a bootstrap sample with replacement
      bootstrap_data <- 
        BREAST_DF_surv_clean[sample(nrow(BREAST_DF_surv_clean), 
                                    size = 0.05 * nrow(BREAST_DF_surv_clean), 
                                    replace = TRUE), ]
      
      # Calculate chi-squared p-value for the bootstrap sample
      #bootstrap_p_values[i] <- chi_squared_cal(bootstrap_data[[col]], bootstrap_data)
      bootstrap_p_values[i] <- fisher_exact_cal(bootstrap_data[[col]], bootstrap_data)
    }
    
    # Calculate the mean p-value from bootstrap samples
    mean_p_value <- mean(bootstrap_p_values)
    
    # Store the mean p-value for the column
    p_values[[col]] <- mean_p_value
  }
}

# Convert the list of p-values to a data frame
p_values_df <- data.frame(variable = names(p_values), p_value = unlist(p_values))

# Sort the results by p-values
sorted_results <- p_values_df[order(p_values_df$p_value, na.last = TRUE), ]

# Print the sorted p-values
kable(sorted_results)

	variable	p_value
Race recode (W, B, AI, API)	Race recode (W, B, AI, API)	0.0004998
Primary Site - labeled	Primary Site - labeled	0.0004998
Grade Recode (thru 2017)	Grade Recode (thru 2017)	0.0004998
Laterality	Laterality	0.0004998
Chemotherapy recode (yes, no/unk)	Chemotherapy recode (yes, no/unk)	0.0004998
Reason no cancer-directed surgery	Reason no cancer-directed surgery	0.0004998
Survival months flag	Survival months flag	0.0004998
First malignant primary indicator	First malignant primary indicator	0.0004998
Marital status at diagnosis	Marital status at diagnosis	0.0004998
Median household income inflation adj to 2021	Median household income inflation adj to 2021	0.0004998
Age recode (<60,60-69,70+)	Age recode (<60,60-69,70+)	0.0004998
COD	COD	0.0004998
Radiation	Radiation	0.0004998
Rural-Urban Continuum Code	Rural-Urban Continuum Code	0.0015892

Correlation Analyses

In This section I used the existing R package to calculate the correlations among the different columns and COD. To od so, we start first with separation the numerical nd categorical data since they need to be treated separately in term of calculating the correlation with COD. We start by finding Pearson correlation coefficient between COD and the numerical column.

# Select numerical columns in your dataset
numeric_cols <- sapply(BREAST_DF_surv_clean, is.numeric)

# Separate numerical and categorical columns
numeric_data <- BREAST_DF_surv_clean[, numeric_cols]
categorical_data <- BREAST_DF_surv_clean[, !numeric_cols]

# Calculate Pearson correlation coefficient between "COD" and numerical columns
correlation_with_COD_numeric <- rcorr(as.matrix(numeric_data), y = BREAST_DF_surv_clean$COD, type = "pearson")

# Print correlation coefficients for numerical columns
#kable(print(correlation_with_COD_numeric$r))

library(kableExtra)

# Print correlation coefficients for numerical columns
correlation_table <- correlation_with_COD_numeric$r
rownames(correlation_table) <- colnames(correlation_table)

# Display as a table
kable(correlation_table, caption = "Correlation Coefficients with COD")

Correlation Coefficients with COD
	Months from diagnosis to treatment	Survival months	Total number of in situ/malignant tumors for patient	Total number of benign/borderline tumors for patient	y
Months from diagnosis to treatment	1.0000000	-0.0139649	0.0186951	0.0005761	-0.0037166
Survival months	-0.0139649	1.0000000	-0.0347760	0.0051819	-0.5516706
Total number of in situ/malignant tumors for patient	0.0186951	-0.0347760	1.0000000	0.0181349	0.1470846
Total number of benign/borderline tumors for patient	0.0005761	0.0051819	0.0181349	1.0000000	0.0096745
y	-0.0037166	-0.5516706	0.1470846	0.0096745	1.0000000

library(reshape2)  # For melt function

# Melt correlation matrix
correlation_melted <- melt(correlation_table)

# Plot heatmap
ggplot(correlation_melted, aes(Var1, Var2, fill = value)) +
  geom_tile() +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
                       midpoint = 0, limit = c(-1, 1), space = "Lab",
                       name = "Correlation") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, 
                                    size = 8, hjust = 1)) +
  coord_fixed()

# Calculate Cramér's V for association between "COD" and categorical columns
cramer_v <- apply(categorical_data, 2, function(x) {
  table_data <- table(x, BREAST_DF_surv_clean$COD)
  assoc(table_data, method = "cramers")
})

# Print Cramér's V for association with categorical columns
#print(cramer_v)

# Insert a line break or comment to separate the code blocks
cat("\n")

# Initialize an empty data frame
cramer_v_df <- data.frame(Variable = character(), Value = numeric(), row.names = NULL)

# Iterate over each variable and its associated Cramér's V value
for (var_name in names(cramer_v)) {
  # Extract Cramér's V value for the current variable
  cramer_v_value <- cramer_v[[var_name]]
  
  # Append a row to the data frame with the variable name and its Cramér's V value
  cramer_v_df <- rbind(cramer_v_df, data.frame(Variable = var_name, Value = cramer_v_value))
}

# Print as a table
kable(cramer_v_df, caption = "Cramer's V for Association with COD")

Cramer’s V for Association with COD
Variable	Value.x	Value.A	Value.Freq
Race recode (W, B, AI, API)	American Indian/Alaska Native	Alive	1416
Race recode (W, B, AI, API)	Asian or Pacific Islander	Alive	22312
Race recode (W, B, AI, API)	Black	Alive	21523
Race recode (W, B, AI, API)	Unknown	Alive	1681
Race recode (W, B, AI, API)	White	Alive	181289
Race recode (W, B, AI, API)	American Indian/Alaska Native	Breast	252
Race recode (W, B, AI, API)	Asian or Pacific Islander	Breast	2749
Race recode (W, B, AI, API)	Black	Breast	6369
Race recode (W, B, AI, API)	Unknown	Breast	70
Race recode (W, B, AI, API)	White	Breast	29032
Race recode (W, B, AI, API)	American Indian/Alaska Native	Other	265
Race recode (W, B, AI, API)	Asian or Pacific Islander	Other	2000
Race recode (W, B, AI, API)	Black	Other	4273
Race recode (W, B, AI, API)	Unknown	Other	63
Race recode (W, B, AI, API)	White	Other	30263
Primary Site - labeled	C50.0-Nipple	Alive	1033
Primary Site - labeled	C50.1-Central portion of breast	Alive	9851
Primary Site - labeled	C50.2-Upper-inner quadrant of breast	Alive	28945
Primary Site - labeled	C50.3-Lower-inner quadrant of breast	Alive	12769
Primary Site - labeled	C50.4-Upper-outer quadrant of breast	Alive	77369
Primary Site - labeled	C50.5-Lower-outer quadrant of breast	Alive	17218
Primary Site - labeled	C50.6-Axillary tail of breast	Alive	1205
Primary Site - labeled	C50.8-Overlapping lesion of breast	Alive	52392
Primary Site - labeled	C50.9-Breast, NOS	Alive	27439
Primary Site - labeled	C50.0-Nipple	Breast	173
Primary Site - labeled	C50.1-Central portion of breast	Breast	2043
Primary Site - labeled	C50.2-Upper-inner quadrant of breast	Breast	3058
Primary Site - labeled	C50.3-Lower-inner quadrant of breast	Breast	1572
Primary Site - labeled	C50.4-Upper-outer quadrant of breast	Breast	9710
Primary Site - labeled	C50.5-Lower-outer quadrant of breast	Breast	2287
Primary Site - labeled	C50.6-Axillary tail of breast	Breast	270
Primary Site - labeled	C50.8-Overlapping lesion of breast	Breast	7514
Primary Site - labeled	C50.9-Breast, NOS	Breast	11845
Primary Site - labeled	C50.0-Nipple	Other	271
Primary Site - labeled	C50.1-Central portion of breast	Other	2118
Primary Site - labeled	C50.2-Upper-inner quadrant of breast	Other	4003
Primary Site - labeled	C50.3-Lower-inner quadrant of breast	Other	2024
Primary Site - labeled	C50.4-Upper-outer quadrant of breast	Other	11120
Primary Site - labeled	C50.5-Lower-outer quadrant of breast	Other	2434
Primary Site - labeled	C50.6-Axillary tail of breast	Other	210
Primary Site - labeled	C50.8-Overlapping lesion of breast	Other	8379
Primary Site - labeled	C50.9-Breast, NOS	Other	6305
Grade Recode (thru 2017)	Moderately differentiated; Grade II	Alive	93775
Grade Recode (thru 2017)	Poorly differentiated; Grade III	Alive	59437
Grade Recode (thru 2017)	Undifferentiated; anaplastic; Grade IV	Alive	202
Grade Recode (thru 2017)	Unknown	Alive	20725
Grade Recode (thru 2017)	Well differentiated; Grade I	Alive	54082
Grade Recode (thru 2017)	Moderately differentiated; Grade II	Breast	11130
Grade Recode (thru 2017)	Poorly differentiated; Grade III	Breast	15938
Grade Recode (thru 2017)	Undifferentiated; anaplastic; Grade IV	Breast	98
Grade Recode (thru 2017)	Unknown	Breast	8913
Grade Recode (thru 2017)	Well differentiated; Grade I	Breast	2393
Grade Recode (thru 2017)	Moderately differentiated; Grade II	Other	14661
Grade Recode (thru 2017)	Poorly differentiated; Grade III	Other	8876
Grade Recode (thru 2017)	Undifferentiated; anaplastic; Grade IV	Other	49
Grade Recode (thru 2017)	Unknown	Other	5217
Grade Recode (thru 2017)	Well differentiated; Grade I	Other	8061
Laterality	Bilateral, single primary	Alive	21
Laterality	Left - origin of primary	Alive	115104
Laterality	Only one side - side unspecified	Alive	59
Laterality	Paired site, but no information concerning laterality	Alive	438
Laterality	Right - origin of primary	Alive	112599
Laterality	Bilateral, single primary	Breast	89
Laterality	Left - origin of primary	Breast	18661
Laterality	Only one side - side unspecified	Breast	87
Laterality	Paired site, but no information concerning laterality	Breast	2080
Laterality	Right - origin of primary	Breast	17555
Laterality	Bilateral, single primary	Other	25
Laterality	Left - origin of primary	Other	18585
Laterality	Only one side - side unspecified	Other	44
Laterality	Paired site, but no information concerning laterality	Other	634
Laterality	Right - origin of primary	Other	17576
Chemotherapy recode (yes, no/unk)	No/Unknown	Alive	137991
Chemotherapy recode (yes, no/unk)	Yes	Alive	90230
Chemotherapy recode (yes, no/unk)	No/Unknown	Breast	19415
Chemotherapy recode (yes, no/unk)	Yes	Breast	19057
Chemotherapy recode (yes, no/unk)	No/Unknown	Other	29532
Chemotherapy recode (yes, no/unk)	Yes	Other	7332
Reason no cancer-directed surgery	Not performed, patient died prior to recommended surgery	Alive	0
Reason no cancer-directed surgery	Not recommended	Alive	6917
Reason no cancer-directed surgery	Not recommended, contraindicated due to other cond; autopsy only (1973-2002)	Alive	118
Reason no cancer-directed surgery	Recommended but not performed, patient refused	Alive	686
Reason no cancer-directed surgery	Recommended but not performed, unknown reason	Alive	729
Reason no cancer-directed surgery	Recommended, unknown if performed	Alive	1741
Reason no cancer-directed surgery	Surgery performed	Alive	217725
Reason no cancer-directed surgery	Unknown; death certificate; or autopsy only (2003+)	Alive	305
Reason no cancer-directed surgery	Not performed, patient died prior to recommended surgery	Breast	139
Reason no cancer-directed surgery	Not recommended	Breast	11636
Reason no cancer-directed surgery	Not recommended, contraindicated due to other cond; autopsy only (1973-2002)	Breast	593
Reason no cancer-directed surgery	Recommended but not performed, patient refused	Breast	1171
Reason no cancer-directed surgery	Recommended but not performed, unknown reason	Breast	545
Reason no cancer-directed surgery	Recommended, unknown if performed	Breast	613
Reason no cancer-directed surgery	Surgery performed	Breast	22376
Reason no cancer-directed surgery	Unknown; death certificate; or autopsy only (2003+)	Breast	1399
Reason no cancer-directed surgery	Not performed, patient died prior to recommended surgery	Other	139
Reason no cancer-directed surgery	Not recommended	Other	4646
Reason no cancer-directed surgery	Not recommended, contraindicated due to other cond; autopsy only (1973-2002)	Other	645
Reason no cancer-directed surgery	Recommended but not performed, patient refused	Other	751
Reason no cancer-directed surgery	Recommended but not performed, unknown reason	Other	330
Reason no cancer-directed surgery	Recommended, unknown if performed	Other	295
Reason no cancer-directed surgery	Surgery performed	Other	29629
Reason no cancer-directed surgery	Unknown; death certificate; or autopsy only (2003+)	Other	429
Survival months flag	Complete dates are available and there are 0 days of survival	Alive	248
Survival months flag	Complete dates are available and there are more than 0 days of survival	Alive	223378
Survival months flag	Incomplete dates are available and there cannot be zero days of follow-up	Alive	4551
Survival months flag	Incomplete dates are available and there could be zero days of follow-up	Alive	44
Survival months flag	Not calculated because a Death Certificate Only or Autopsy Only case	Alive	0
Survival months flag	Complete dates are available and there are 0 days of survival	Breast	83
Survival months flag	Complete dates are available and there are more than 0 days of survival	Breast	36117
Survival months flag	Incomplete dates are available and there cannot be zero days of follow-up	Breast	1183
Survival months flag	Incomplete dates are available and there could be zero days of follow-up	Breast	59
Survival months flag	Not calculated because a Death Certificate Only or Autopsy Only case	Breast	1030
Survival months flag	Complete dates are available and there are 0 days of survival	Other	45
Survival months flag	Complete dates are available and there are more than 0 days of survival	Other	35641
Survival months flag	Incomplete dates are available and there cannot be zero days of follow-up	Other	886
Survival months flag	Incomplete dates are available and there could be zero days of follow-up	Other	32
Survival months flag	Not calculated because a Death Certificate Only or Autopsy Only case	Other	260
First malignant primary indicator	No	Alive	32987
First malignant primary indicator	Yes	Alive	195234
First malignant primary indicator	No	Breast	7480
First malignant primary indicator	Yes	Breast	30992
First malignant primary indicator	No	Other	10407
First malignant primary indicator	Yes	Other	26457
Marital status at diagnosis	Divorced	Alive	23903
Marital status at diagnosis	Married (including common law)	Alive	132121
Marital status at diagnosis	Separated	Alive	2401
Marital status at diagnosis	Single (never married)	Alive	32829
Marital status at diagnosis	Unknown	Alive	12919
Marital status at diagnosis	Unmarried or Domestic Partner	Alive	844
Marital status at diagnosis	Widowed	Alive	23204
Marital status at diagnosis	Divorced	Breast	4399
Marital status at diagnosis	Married (including common law)	Breast	15694
Marital status at diagnosis	Separated	Breast	544
Marital status at diagnosis	Single (never married)	Breast	7161
Marital status at diagnosis	Unknown	Breast	2774
Marital status at diagnosis	Unmarried or Domestic Partner	Breast	110
Marital status at diagnosis	Widowed	Breast	7790
Marital status at diagnosis	Divorced	Other	3912
Marital status at diagnosis	Married (including common law)	Other	12736
Marital status at diagnosis	Separated	Other	280
Marital status at diagnosis	Single (never married)	Other	4688
Marital status at diagnosis	Unknown	Other	2788
Marital status at diagnosis	Unmarried or Domestic Partner	Other	60
Marital status at diagnosis	Widowed	Other	12400
Median household income inflation adj to 2021	$35,000 - $39,999	Alive	4108
Median household income inflation adj to 2021	$40,000 - $44,999	Alive	6976
Median household income inflation adj to 2021	$45,000 - $49,999	Alive	10351
Median household income inflation adj to 2021	$50,000 - $54,999	Alive	11978
Median household income inflation adj to 2021	$55,000 - $59,999	Alive	18238
Median household income inflation adj to 2021	$60,000 - $64,999	Alive	32172
Median household income inflation adj to 2021	$65,000 - $69,999	Alive	34163
Median household income inflation adj to 2021	$70,000 - $74,999	Alive	23995
Median household income inflation adj to 2021	$75,000+	Alive	84391
Median household income inflation adj to 2021	< $35,000	Alive	1799
Median household income inflation adj to 2021	Unknown/missing/no match/Not 1990-2021	Alive	50
Median household income inflation adj to 2021	$35,000 - $39,999	Breast	1000
Median household income inflation adj to 2021	$40,000 - $44,999	Breast	1630
Median household income inflation adj to 2021	$45,000 - $49,999	Breast	2289
Median household income inflation adj to 2021	$50,000 - $54,999	Breast	2310
Median household income inflation adj to 2021	$55,000 - $59,999	Breast	3371
Median household income inflation adj to 2021	$60,000 - $64,999	Breast	6010
Median household income inflation adj to 2021	$65,000 - $69,999	Breast	5848
Median household income inflation adj to 2021	$70,000 - $74,999	Breast	3927
Median household income inflation adj to 2021	$75,000+	Breast	11608
Median household income inflation adj to 2021	< $35,000	Breast	469
Median household income inflation adj to 2021	Unknown/missing/no match/Not 1990-2021	Breast	10
Median household income inflation adj to 2021	$35,000 - $39,999	Other	969
Median household income inflation adj to 2021	$40,000 - $44,999	Other	1619
Median household income inflation adj to 2021	$45,000 - $49,999	Other	2277
Median household income inflation adj to 2021	$50,000 - $54,999	Other	2506
Median household income inflation adj to 2021	$55,000 - $59,999	Other	3251
Median household income inflation adj to 2021	$60,000 - $64,999	Other	5355
Median household income inflation adj to 2021	$65,000 - $69,999	Other	4967
Median household income inflation adj to 2021	$70,000 - $74,999	Other	4008
Median household income inflation adj to 2021	$75,000+	Other	11460
Median household income inflation adj to 2021	< $35,000	Other	448
Median household income inflation adj to 2021	Unknown/missing/no match/Not 1990-2021	Other	4
Rural-Urban Continuum Code	Counties in metropolitan areas ge 1 million pop	Alive	141535
Rural-Urban Continuum Code	Counties in metropolitan areas of 250,000 to 1 million pop	Alive	48846
Rural-Urban Continuum Code	Counties in metropolitan areas of lt 250 thousand pop	Alive	15452
Rural-Urban Continuum Code	Nonmetropolitan counties adjacent to a metropolitan area	Alive	12781
Rural-Urban Continuum Code	Nonmetropolitan counties not adjacent to a metropolitan area	Alive	9289
Rural-Urban Continuum Code	Unknown/missing/no match (Alaska or Hawaii - Entire State)	Alive	268
Rural-Urban Continuum Code	Unknown/missing/no match/Not 1990-2021	Alive	50
Rural-Urban Continuum Code	Counties in metropolitan areas ge 1 million pop	Breast	23147
Rural-Urban Continuum Code	Counties in metropolitan areas of 250,000 to 1 million pop	Breast	7884
Rural-Urban Continuum Code	Counties in metropolitan areas of lt 250 thousand pop	Breast	2843
Rural-Urban Continuum Code	Nonmetropolitan counties adjacent to a metropolitan area	Breast	2578
Rural-Urban Continuum Code	Nonmetropolitan counties not adjacent to a metropolitan area	Breast	1970
Rural-Urban Continuum Code	Unknown/missing/no match (Alaska or Hawaii - Entire State)	Breast	40
Rural-Urban Continuum Code	Unknown/missing/no match/Not 1990-2021	Breast	10
Rural-Urban Continuum Code	Counties in metropolitan areas ge 1 million pop	Other	20692
Rural-Urban Continuum Code	Counties in metropolitan areas of 250,000 to 1 million pop	Other	8311
Rural-Urban Continuum Code	Counties in metropolitan areas of lt 250 thousand pop	Other	2944
Rural-Urban Continuum Code	Nonmetropolitan counties adjacent to a metropolitan area	Other	2766
Rural-Urban Continuum Code	Nonmetropolitan counties not adjacent to a metropolitan area	Other	2090
Rural-Urban Continuum Code	Unknown/missing/no match (Alaska or Hawaii - Entire State)	Other	57
Rural-Urban Continuum Code	Unknown/missing/no match/Not 1990-2021	Other	4
Age recode (<60,60-69,70+)	01-04 years	Alive	1
Age recode (<60,60-69,70+)	05-09 years	Alive	2
Age recode (<60,60-69,70+)	10-14 years	Alive	2
Age recode (<60,60-69,70+)	15-19 years	Alive	14
Age recode (<60,60-69,70+)	20-24 years	Alive	178
Age recode (<60,60-69,70+)	25-29 years	Alive	1097
Age recode (<60,60-69,70+)	30-34 years	Alive	3307
Age recode (<60,60-69,70+)	35-39 years	Alive	7040
Age recode (<60,60-69,70+)	40-44 years	Alive	15293
Age recode (<60,60-69,70+)	45-49 years	Alive	24158
Age recode (<60,60-69,70+)	50-54 years	Alive	29263
Age recode (<60,60-69,70+)	55-59 years	Alive	30741
Age recode (<60,60-69,70+)	60-64 years	Alive	33793
Age recode (<60,60-69,70+)	65-69 years	Alive	32764
Age recode (<60,60-69,70+)	70-74 years	Alive	23598
Age recode (<60,60-69,70+)	75-79 years	Alive	15007
Age recode (<60,60-69,70+)	80-84 years	Alive	8013
Age recode (<60,60-69,70+)	85+ years	Alive	3950
Age recode (<60,60-69,70+)	01-04 years	Breast	0
Age recode (<60,60-69,70+)	05-09 years	Breast	0
Age recode (<60,60-69,70+)	10-14 years	Breast	0
Age recode (<60,60-69,70+)	15-19 years	Breast	1
Age recode (<60,60-69,70+)	20-24 years	Breast	59
Age recode (<60,60-69,70+)	25-29 years	Breast	265
Age recode (<60,60-69,70+)	30-34 years	Breast	686
Age recode (<60,60-69,70+)	35-39 years	Breast	1327
Age recode (<60,60-69,70+)	40-44 years	Breast	2019
Age recode (<60,60-69,70+)	45-49 years	Breast	2765
Age recode (<60,60-69,70+)	50-54 years	Breast	3909
Age recode (<60,60-69,70+)	55-59 years	Breast	4360
Age recode (<60,60-69,70+)	60-64 years	Breast	4531
Age recode (<60,60-69,70+)	65-69 years	Breast	4136
Age recode (<60,60-69,70+)	70-74 years	Breast	3663
Age recode (<60,60-69,70+)	75-79 years	Breast	3196
Age recode (<60,60-69,70+)	80-84 years	Breast	3003
Age recode (<60,60-69,70+)	85+ years	Breast	4552
Age recode (<60,60-69,70+)	01-04 years	Other	0
Age recode (<60,60-69,70+)	05-09 years	Other	0
Age recode (<60,60-69,70+)	10-14 years	Other	0
Age recode (<60,60-69,70+)	15-19 years	Other	2
Age recode (<60,60-69,70+)	20-24 years	Other	11
Age recode (<60,60-69,70+)	25-29 years	Other	43
Age recode (<60,60-69,70+)	30-34 years	Other	100
Age recode (<60,60-69,70+)	35-39 years	Other	182
Age recode (<60,60-69,70+)	40-44 years	Other	423
Age recode (<60,60-69,70+)	45-49 years	Other	713
Age recode (<60,60-69,70+)	50-54 years	Other	1252
Age recode (<60,60-69,70+)	55-59 years	Other	1967
Age recode (<60,60-69,70+)	60-64 years	Other	2994
Age recode (<60,60-69,70+)	65-69 years	Other	4160
Age recode (<60,60-69,70+)	70-74 years	Other	4927
Age recode (<60,60-69,70+)	75-79 years	Other	5630
Age recode (<60,60-69,70+)	80-84 years	Other	6182
Age recode (<60,60-69,70+)	85+ years	Other	8278
COD	Alive	Alive	228221
COD	Breast	Alive	0
COD	Other	Alive	0
COD	Alive	Breast	0
COD	Breast	Breast	38472
COD	Other	Breast	0
COD	Alive	Other	0
COD	Breast	Other	0
COD	Other	Other	36864
Radiation	No/Unknown	Alive	111019
Radiation	Yes	Alive	117202
Radiation	No/Unknown	Breast	25613
Radiation	Yes	Breast	12859
Radiation	No/Unknown	Other	25346
Radiation	Yes	Other	11518

# Melt Cramér's V results
cramer_v_melted <- melt(cramer_v_df, id.vars = "Variable", variable.name = "Var1", value.name = "value")

## Warning: attributes are not identical across measure variables; they will be
## dropped

# Plot as a bar graph
ggplot(cramer_v_melted, aes(x = Variable, y = value, fill = Var1)) +
  geom_bar(stat = "identity") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),
        axis.text.y = element_text(angle = 45, hjust = 1, vjust = 1)) + # Rotate y-axis labels by 45 degrees
  scale_y_discrete(labels = function(x) str_wrap(x, width = 10)) + # Wrap labels with a width of 10 characters
  labs(x = "Variable", y = "Cramer's V", fill = "Variable") +
  ggtitle("Cramer's V for Association with COD")

#Since there are many factors and categorical variables I need to encode them. 
#the followign code can deal with encoding
#Find the index of the column named "COD"
# Step 1: Find the index of the column named "COD"
cod_column_index <- which(names(BREAST_DF_surv_clean) == "COD")

# Step 2: Exclude "COD" column from model matrix
encoded_data <- model.matrix(~ . - 1, data = BREAST_DF_surv_clean[, -cod_column_index])

# Step 3: Select encoded variables and target variable
encoded_data <- cbind(encoded_data, COD = BREAST_DF_surv_clean$COD)

## Warning in base::cbind(...): number of rows of result is not a multiple of
## vector length (arg 2)

# Step 4: Calculate correlation matrix
correlation_matrix <- cor(encoded_data)

## Warning in cor(encoded_data): the standard deviation is zero

# Step 5: Display summary statistics of the correlation matrix
summary_table <- summary(correlation_matrix)
summary_table_kable <- kable(summary_table)

# Step 6: Plot correlation matrix as a heatmap
library(corrplot)
corrplot(correlation_matrix, method = "color", tl.cex = 0.15, title = "Correlation Matrix")

# Display the summary table
summary_table_kable

`Race recode (W, B, AI, API)`American Indian/Alaska Native	`Race recode (W, B, AI, API)`Asian or Pacific Islander	`Race recode (W, B, AI, API)`Black	`Race recode (W, B, AI, API)`Unknown	`Race recode (W, B, AI, API)`White	`Primary Site - labeled`C50.1-Central portion of breast	`Primary Site - labeled`C50.2-Upper-inner quadrant of breast	`Primary Site - labeled`C50.3-Lower-inner quadrant of breast	`Primary Site - labeled`C50.4-Upper-outer quadrant of breast	`Primary Site - labeled`C50.5-Lower-outer quadrant of breast	`Primary Site - labeled`C50.6-Axillary tail of breast	`Primary Site - labeled`C50.8-Overlapping lesion of breast	`Primary Site - labeled`C50.9-Breast, NOS	`Grade Recode (thru 2017)`Poorly differentiated; Grade III	`Grade Recode (thru 2017)`Undifferentiated; anaplastic; Grade IV	`Grade Recode (thru 2017)`Unknown	`Grade Recode (thru 2017)`Well differentiated; Grade I	LateralityLeft - origin of primary	LateralityOnly one side - side unspecified	LateralityPaired site, but no information concerning laterality	LateralityRight - origin of primary	`Chemotherapy recode (yes, no/unk)`Yes	`Months from diagnosis to treatment`	`Reason no cancer-directed surgery`Not recommended	`Reason no cancer-directed surgery`Not recommended, contraindicated due to other cond; autopsy only (1973-2002)	`Reason no cancer-directed surgery`Recommended but not performed, patient refused	`Reason no cancer-directed surgery`Recommended but not performed, unknown reason	`Reason no cancer-directed surgery`Recommended, unknown if performed	`Reason no cancer-directed surgery`Surgery performed	`Reason no cancer-directed surgery`Unknown; death certificate; or autopsy only (2003+)	`Survival months flag`Complete dates are available and there are more than 0 days of survival	`Survival months flag`Incomplete dates are available and there cannot be zero days of follow-up	`Survival months flag`Incomplete dates are available and there could be zero days of follow-up	`Survival months flag`Not calculated because a Death Certificate Only or Autopsy Only case	`Survival months`	`First malignant primary indicator`Yes	`Total number of in situ/malignant tumors for patient`	`Total number of benign/borderline tumors for patient`	`Marital status at diagnosis`Married (including common law)	`Marital status at diagnosis`Separated	`Marital status at diagnosis`Single (never married)	`Marital status at diagnosis`Unknown	`Marital status at diagnosis`Unmarried or Domestic Partner	`Marital status at diagnosis`Widowed	`Median household income inflation adj to 2021`$40,000 - $44,999	`Median household income inflation adj to 2021`$45,000 - $49,999	`Median household income inflation adj to 2021`$50,000 - $54,999	`Median household income inflation adj to 2021`$55,000 - $59,999	`Median household income inflation adj to 2021`$60,000 - $64,999	`Median household income inflation adj to 2021`$65,000 - $69,999	`Median household income inflation adj to 2021`$70,000 - $74,999	`Median household income inflation adj to 2021`$75,000+	`Median household income inflation adj to 2021`< $35,000	`Median household income inflation adj to 2021`Unknown/missing/no match/Not 1990-2021	`Rural-Urban Continuum Code`Counties in metropolitan areas of 250,000 to 1 million pop	`Rural-Urban Continuum Code`Counties in metropolitan areas of lt 250 thousand pop	`Rural-Urban Continuum Code`Nonmetropolitan counties adjacent to a metropolitan area	`Rural-Urban Continuum Code`Nonmetropolitan counties not adjacent to a metropolitan area	`Rural-Urban Continuum Code`Unknown/missing/no match (Alaska or Hawaii - Entire State)	`Rural-Urban Continuum Code`Unknown/missing/no match/Not 1990-2021	`Age recode (<60,60-69,70+)`05-09 years	`Age recode (<60,60-69,70+)`10-14 years	`Age recode (<60,60-69,70+)`15-19 years	`Age recode (<60,60-69,70+)`20-24 years	`Age recode (<60,60-69,70+)`25-29 years	`Age recode (<60,60-69,70+)`30-34 years	`Age recode (<60,60-69,70+)`35-39 years	`Age recode (<60,60-69,70+)`40-44 years	`Age recode (<60,60-69,70+)`45-49 years	`Age recode (<60,60-69,70+)`50-54 years	`Age recode (<60,60-69,70+)`55-59 years	`Age recode (<60,60-69,70+)`60-64 years	`Age recode (<60,60-69,70+)`65-69 years	`Age recode (<60,60-69,70+)`70-74 years	`Age recode (<60,60-69,70+)`75-79 years	`Age recode (<60,60-69,70+)`80-84 years	`Age recode (<60,60-69,70+)`85+ years	RadiationYes	COD
Min. :-0.1576838	Min. :-0.6165221	Min. :-0.673943	Min. :-0.1334814	Min. :-0.6739426	Min. :-0.155876	Min. :-0.2615058	Min. :-0.169973	Min. :-0.382989	Min. :-0.199024	Min. :-0.0520480	Min. :-0.382989	Min. :-0.275063	Min. :-0.3318287	Min. :-0.0208938	Min. :-0.204073	Min. :-0.331829	Min. :-0.9935062	Min. :-0.0366469	Min. :-0.1575071	Min. :-0.9935062	Min. :-0.284581	Min. :-0.071647	Min. :-0.8574575	Min. :-0.2202216	Min. :-0.248924	Min. :-0.1439998	Min. :-0.306675	Min. :-0.857457	Min. :-0.0598241	Min. :-0.9942023	Min. :-0.994202	Min. :-0.0557427	Min. :1	Min. :-0.277625	Min. :-0.691429	Min. :-0.6914293	Min. :-0.0159064	Min. :-0.449883	Min. :-0.1121800	Min. :-0.4498831	Min. :-0.260209	Min. :-0.0640663	Min. :-0.433220	Min. :-0.1387420	Min. :-0.1694547	Min. :-0.1808955	Min. :-0.2237456	Min. :-0.302021	Min. :-0.307945	Min. :-0.2570005	Min. :-0.307945	Min. :-0.070190	Min. :-0.0098790	Min. :-0.144158	Min. :-0.1728759	Min. :-0.154232	Min. :-0.154618	Min. :-0.0693888	Min. :-0.0098790	Min. :-0.0028688	Min. :-0.0051977	Min. :-0.0071789	Min. :-0.0193592	Min. :-0.027421	Min. :-0.047111	Min. :-0.068738	Min. :-0.100748	Min. :-0.128046	Min. :-0.1445947	Min. :-0.1504952	Min. :-0.1593788	Min. :-0.159379	Min. :-0.138078	Min. :-0.133971	Min. :-0.150245	Min. :-0.1873481	Min. :-0.132353	Min. :-0.0121551
1st Qu.:-0.0020006	1st Qu.:-0.0202167	1st Qu.:-0.012099	1st Qu.:-0.0058338	1st Qu.:-0.0189862	1st Qu.:-0.010531	1st Qu.:-0.0055591	1st Qu.:-0.005036	1st Qu.:-0.007592	1st Qu.:-0.004752	1st Qu.:-0.0033956	1st Qu.:-0.005180	1st Qu.:-0.012712	1st Qu.:-0.0102566	1st Qu.:-0.0020484	1st Qu.:-0.005608	1st Qu.:-0.011881	1st Qu.:-0.0025589	1st Qu.:-0.0026654	1st Qu.:-0.0054651	1st Qu.:-0.0025611	1st Qu.:-0.008157	1st Qu.:-0.009361	1st Qu.:-0.0113926	1st Qu.:-0.0066137	1st Qu.:-0.005443	1st Qu.:-0.0023943	1st Qu.:-0.008606	1st Qu.:-0.012623	1st Qu.:-0.0016251	1st Qu.:-0.0036413	1st Qu.:-0.008391	1st Qu.:-0.0015205	1st Qu.:1	1st Qu.:-0.017262	1st Qu.:-0.008279	1st Qu.:-0.0105455	1st Qu.:-0.0024496	1st Qu.:-0.018276	1st Qu.:-0.0034648	1st Qu.:-0.0153153	1st Qu.:-0.005487	1st Qu.:-0.0034380	1st Qu.:-0.025953	1st Qu.:-0.0065187	1st Qu.:-0.0075767	1st Qu.:-0.0064648	1st Qu.:-0.0062737	1st Qu.:-0.007336	1st Qu.:-0.012559	1st Qu.:-0.0049280	1st Qu.:-0.014613	1st Qu.:-0.004756	1st Qu.:-0.0027459	1st Qu.:-0.006427	1st Qu.:-0.0063794	1st Qu.:-0.008253	1st Qu.:-0.008279	1st Qu.:-0.0042149	1st Qu.:-0.0027459	1st Qu.:-0.0008929	1st Qu.:-0.0008929	1st Qu.:-0.0018815	1st Qu.:-0.0038172	1st Qu.:-0.005673	1st Qu.:-0.008648	1st Qu.:-0.008935	1st Qu.:-0.012379	1st Qu.:-0.014009	1st Qu.:-0.0100840	1st Qu.:-0.0062954	1st Qu.:-0.0035576	1st Qu.:-0.013136	1st Qu.:-0.014378	1st Qu.:-0.019360	1st Qu.:-0.023319	1st Qu.:-0.0243034	1st Qu.:-0.019630	1st Qu.:-0.0011461
Median : 0.0005336	Median :-0.0033835	Median : 0.001860	Median :-0.0006944	Median :-0.0007078	Median :-0.001262	Median :-0.0028889	Median :-0.001448	Median :-0.001242	Median :-0.002111	Median :-0.0003168	Median :-0.001220	Median : 0.000506	Median :-0.0009646	Median :-0.0002264	Median : 0.002904	Median :-0.001414	Median :-0.0004302	Median :-0.0004554	Median :-0.0002511	Median : 0.0002321	Median : 0.003513	Median :-0.002447	Median :-0.0019345	Median :-0.0007267	Median :-0.001201	Median :-0.0000559	Median :-0.001012	Median : 0.000035	Median :-0.0002439	Median : 0.0016351	Median :-0.002005	Median :-0.0003634	Median :1	Median :-0.002944	Median : 0.001289	Median :-0.0017191	Median :-0.0003627	Median :-0.005281	Median :-0.0003402	Median :-0.0007771	Median :-0.001318	Median :-0.0004321	Median :-0.001559	Median :-0.0005868	Median :-0.0009248	Median :-0.0007155	Median :-0.0002672	Median :-0.001239	Median :-0.001800	Median :-0.0016338	Median :-0.003181	Median :-0.001151	Median :-0.0008847	Median :-0.000154	Median :-0.0008082	Median :-0.001274	Median :-0.001003	Median :-0.0004993	Median :-0.0008847	Median :-0.0002912	Median :-0.0003251	Median :-0.0005455	Median :-0.0008537	Median :-0.001546	Median :-0.000829	Median :-0.001997	Median :-0.003132	Median :-0.002266	Median :-0.0018848	Median :-0.0003493	Median :-0.0004631	Median :-0.002854	Median :-0.002379	Median :-0.002270	Median :-0.002061	Median :-0.0020631	Median :-0.002587	Median : 0.0003216
Mean : 0.0169704	Mean :-0.0009204	Mean : 0.005789	Mean : 0.0107475	Mean :-0.0088807	Mean : 0.005178	Mean : 0.0001063	Mean : 0.004543	Mean :-0.006328	Mean : 0.002687	Mean : 0.0103478	Mean :-0.003984	Mean : 0.003542	Mean : 0.0100118	Mean : 0.0125342	Mean : 0.012084	Mean : 0.001227	Mean :-0.0009544	Mean : 0.0126757	Mean : 0.0127019	Mean :-0.0007810	Mean : 0.016328	Mean : 0.008602	Mean : 0.0009881	Mean : 0.0085826	Mean : 0.008312	Mean : 0.0118276	Mean : 0.008002	Mean :-0.010200	Mean : 0.0125725	Mean : 0.0006063	Mean :-0.001404	Mean : 0.0115660	Mean :1	Mean : 0.006655	Mean : 0.007717	Mean :-0.0004392	Mean : 0.0127352	Mean :-0.005335	Mean : 0.0102429	Mean : 0.0056400	Mean : 0.008018	Mean : 0.0110086	Mean : 0.001435	Mean : 0.0110037	Mean : 0.0105688	Mean : 0.0091652	Mean : 0.0060635	Mean :-0.002248	Mean :-0.004613	Mean :-0.0003216	Mean :-0.014703	Mean : 0.013485	Mean : 0.0250915	Mean : 0.008380	Mean : 0.0112452	Mean : 0.011215	Mean : 0.012771	Mean : 0.0167936	Mean : 0.0250915	Mean : 0.0128817	Mean : 0.0127117	Mean : 0.0128602	Mean : 0.0121784	Mean : 0.010861	Mean : 0.009444	Mean : 0.007175	Mean : 0.003054	Mean : 0.000216	Mean :-0.0007652	Mean :-0.0011515	Mean :-0.0022937	Mean :-0.004167	Mean :-0.002999	Mean :-0.001605	Mean :-0.000504	Mean :-0.0000585	Mean : 0.009374	Mean : 0.0132475
3rd Qu.: 0.0040061	3rd Qu.: 0.0070447	3rd Qu.: 0.016525	3rd Qu.: 0.0047715	3rd Qu.: 0.0125200	3rd Qu.: 0.005010	3rd Qu.: 0.0011539	3rd Qu.: 0.004086	3rd Qu.: 0.004650	3rd Qu.: 0.001679	3rd Qu.: 0.0022560	3rd Qu.: 0.002721	3rd Qu.: 0.012246	3rd Qu.: 0.0092260	3rd Qu.: 0.0026775	3rd Qu.: 0.009980	3rd Qu.: 0.007513	3rd Qu.: 0.0014938	3rd Qu.: 0.0015396	3rd Qu.: 0.0030307	3rd Qu.: 0.0022817	3rd Qu.: 0.022748	3rd Qu.: 0.005722	3rd Qu.: 0.0066273	3rd Qu.: 0.0027397	3rd Qu.: 0.002368	3rd Qu.: 0.0043855	3rd Qu.: 0.004974	3rd Qu.: 0.009365	3rd Qu.: 0.0016627	3rd Qu.: 0.0095196	3rd Qu.: 0.002287	3rd Qu.: 0.0003999	3rd Qu.:1	3rd Qu.: 0.012122	3rd Qu.: 0.010252	3rd Qu.: 0.0075641	3rd Qu.: 0.0015427	3rd Qu.: 0.010096	3rd Qu.: 0.0030447	3rd Qu.: 0.0166287	3rd Qu.: 0.007431	3rd Qu.: 0.0018987	3rd Qu.: 0.011320	3rd Qu.: 0.0031320	3rd Qu.: 0.0035984	3rd Qu.: 0.0053395	3rd Qu.: 0.0042906	3rd Qu.: 0.002973	3rd Qu.: 0.003492	3rd Qu.: 0.0032218	3rd Qu.: 0.004352	3rd Qu.: 0.004585	3rd Qu.:-0.0000462	3rd Qu.: 0.003579	3rd Qu.: 0.0059771	3rd Qu.: 0.005613	3rd Qu.: 0.005251	3rd Qu.: 0.0020642	3rd Qu.:-0.0000462	3rd Qu.:-0.0000228	3rd Qu.:-0.0000348	3rd Qu.: 0.0004444	3rd Qu.: 0.0016579	3rd Qu.: 0.002306	3rd Qu.: 0.003628	3rd Qu.: 0.004160	3rd Qu.: 0.003860	3rd Qu.: 0.003601	3rd Qu.: 0.0033621	3rd Qu.: 0.0021982	3rd Qu.: 0.0038639	3rd Qu.: 0.003548	3rd Qu.: 0.003082	3rd Qu.: 0.002808	3rd Qu.: 0.003460	3rd Qu.: 0.0041037	3rd Qu.: 0.007550	3rd Qu.: 0.0019772
Max. : 1.0000000	Max. : 1.0000000	Max. : 1.000000	Max. : 1.0000000	Max. : 1.0000000	Max. : 1.000000	Max. : 1.0000000	Max. : 1.000000	Max. : 1.000000	Max. : 1.000000	Max. : 1.0000000	Max. : 1.000000	Max. : 1.000000	Max. : 1.0000000	Max. : 1.0000000	Max. : 1.000000	Max. : 1.000000	Max. : 1.0000000	Max. : 1.0000000	Max. : 1.0000000	Max. : 1.0000000	Max. : 1.000000	Max. : 1.000000	Max. : 1.0000000	Max. : 1.0000000	Max. : 1.000000	Max. : 1.0000000	Max. : 1.000000	Max. : 1.000000	Max. : 1.0000000	Max. : 1.0000000	Max. : 1.000000	Max. : 1.0000000	Max. :1	Max. : 1.000000	Max. : 1.000000	Max. : 1.0000000	Max. : 1.0000000	Max. : 1.000000	Max. : 1.0000000	Max. : 1.0000000	Max. : 1.000000	Max. : 1.0000000	Max. : 1.000000	Max. : 1.0000000	Max. : 1.0000000	Max. : 1.0000000	Max. : 1.0000000	Max. : 1.000000	Max. : 1.000000	Max. : 1.0000000	Max. : 1.000000	Max. : 1.000000	Max. : 1.0000000	Max. : 1.000000	Max. : 1.0000000	Max. : 1.000000	Max. : 1.000000	Max. : 1.0000000	Max. : 1.0000000	Max. : 1.0000000	Max. : 1.0000000	Max. : 1.0000000	Max. : 1.0000000	Max. : 1.000000	Max. : 1.000000	Max. : 1.000000	Max. : 1.000000	Max. : 1.000000	Max. : 1.0000000	Max. : 1.0000000	Max. : 1.0000000	Max. : 1.000000	Max. : 1.000000	Max. : 1.000000	Max. : 1.000000	Max. : 1.0000000	Max. : 1.000000	Max. : 1.0000000
NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA’s :78	NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA’s :1

# Extract correlation with COD
correlation_with_COD <- correlation_matrix[, "COD"]

# Convert correlation_with_COD to a data frame with column names
correlation_df <- data.frame(variable = names(correlation_with_COD), correlation = correlation_with_COD)

# Sort correlation values
correlation_df <- correlation_df[order(correlation_df$correlation, decreasing = TRUE), ]

# Create bar plot using ggplot2
ggplot(correlation_df, aes(x = variable, y = correlation)) +
  geom_bar(stat = "identity") +
  labs(title = "Correlation with COD", x = "Variables", y = "Correlation")

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_bar()`).

# Find the index of the column named "COD"
cod_column_index <- which(names(BREAST_DF_surv_clean) == "COD")

# Exclude "COD" column from model matrix and encode factors
encoded_data <- predict(dummyVars(" ~ .", data = BREAST_DF_surv_clean[, -cod_column_index], fullRank = TRUE), newdata = BREAST_DF_surv_clean)



# Remove the "COD" column from encoded_data
encoded_data <- encoded_data[, -cod_column_index]

# Add "COD" column back to encoded_data
encoded_data <- cbind(encoded_data, COD = BREAST_DF_surv_clean$COD)

# Calculate correlation matrix
correlation_matrix <- cor(encoded_data)

# Extract correlation with COD
correlation_with_COD <- correlation_matrix[, "COD"]

# Summary of correlation matrix
summary(correlation_matrix)

##  `Race recode (W, B, AI, API)`Asian or Pacific Islander
##  Min.   :-0.611482                                     
##  1st Qu.:-0.020636                                     
##  Median :-0.004543                                     
##  Mean   :-0.001203                                     
##  3rd Qu.: 0.006915                                     
##  Max.   : 1.000000                                     
##  NA's   :3                                             
##  `Race recode (W, B, AI, API)`Black `Race recode (W, B, AI, API)`Unknown
##  Min.   :-0.672898                  Min.   :-0.151550                   
##  1st Qu.:-0.008186                  1st Qu.:-0.008196                   
##  Median : 0.002961                  Median :-0.001720                   
##  Mean   : 0.007238                  Mean   : 0.011140                   
##  3rd Qu.: 0.015977                  3rd Qu.: 0.004775                   
##  Max.   : 1.000000                  Max.   : 1.000000                   
##  NA's   :3                          NA's   :3                           
##  `Race recode (W, B, AI, API)`White
##  Min.   :-0.672898                 
##  1st Qu.:-0.021506                 
##  Median :-0.001173                 
##  Mean   :-0.007624                 
##  3rd Qu.: 0.013379                 
##  Max.   : 1.000000                 
##  NA's   :3                         
##  `Primary Site - labeled`C50.1-Central portion of breast
##  Min.   :-0.152121                                      
##  1st Qu.:-0.009934                                      
##  Median :-0.002113                                      
##  Mean   : 0.005421                                      
##  3rd Qu.: 0.006062                                      
##  Max.   : 1.000000                                      
##  NA's   :3                                              
##  `Primary Site - labeled`C50.2-Upper-inner quadrant of breast
##  Min.   :-0.253678                                           
##  1st Qu.:-0.009990                                           
##  Median :-0.003675                                           
##  Mean   :-0.001860                                           
##  3rd Qu.: 0.000150                                           
##  Max.   : 1.000000                                           
##  NA's   :3                                                   
##  `Primary Site - labeled`C50.3-Lower-inner quadrant of breast
##  Min.   :-0.165071                                           
##  1st Qu.:-0.008009                                           
##  Median :-0.002641                                           
##  Mean   : 0.003667                                           
##  3rd Qu.: 0.003119                                           
##  Max.   : 1.000000                                           
##  NA's   :3                                                   
##  `Primary Site - labeled`C50.4-Upper-outer quadrant of breast
##  Min.   :-0.372542                                           
##  1st Qu.:-0.015158                                           
##  Median :-0.001181                                           
##  Mean   :-0.008656                                           
##  3rd Qu.: 0.005932                                           
##  Max.   : 1.000000                                           
##  NA's   :3                                                   
##  `Primary Site - labeled`C50.5-Lower-outer quadrant of breast
##  Min.   :-0.1930083                                          
##  1st Qu.:-0.0068032                                          
##  Median :-0.0026344                                          
##  Mean   : 0.0017748                                          
##  3rd Qu.: 0.0000719                                          
##  Max.   : 1.0000000                                          
##  NA's   :3                                                   
##  `Primary Site - labeled`C50.6-Axillary tail of breast
##  Min.   :-0.0516638                                   
##  1st Qu.:-0.0032916                                   
##  Median :-0.0002025                                   
##  Mean   : 0.0108061                                   
##  3rd Qu.: 0.0025517                                   
##  Max.   : 1.0000000                                   
##  NA's   :3                                            
##  `Primary Site - labeled`C50.8-Overlapping lesion of breast
##  Min.   :-0.372542                                         
##  1st Qu.:-0.005557                                         
##  Median :-0.001620                                         
##  Mean   :-0.005418                                         
##  3rd Qu.: 0.001945                                         
##  Max.   : 1.000000                                         
##  NA's   :3                                                 
##  `Primary Site - labeled`C50.9-Breast, NOS
##  Min.   :-0.290700                        
##  1st Qu.:-0.014812                        
##  Median : 0.002148                        
##  Mean   : 0.010813                        
##  3rd Qu.: 0.017795                        
##  Max.   : 1.000000                        
##  NA's   :3                                
##  `Grade Recode (thru 2017)`Poorly differentiated; Grade III
##  Min.   :-0.3220663                                        
##  1st Qu.:-0.0132914                                        
##  Median :-0.0008937                                        
##  Mean   : 0.0111492                                        
##  3rd Qu.: 0.0112681                                        
##  Max.   : 1.0000000                                        
##  NA's   :3                                                 
##  `Grade Recode (thru 2017)`Undifferentiated; anaplastic; Grade IV
##  Min.   :-0.0210283                                              
##  1st Qu.:-0.0017196                                              
##  Median :-0.0004927                                              
##  Mean   : 0.0132802                                              
##  3rd Qu.: 0.0023513                                              
##  Max.   : 1.0000000                                              
##  NA's   :3                                                       
##  `Grade Recode (thru 2017)`Unknown
##  Min.   :-0.269170                
##  1st Qu.:-0.009920                
##  Median : 0.003273                
##  Mean   : 0.019721                
##  3rd Qu.: 0.024205                
##  Max.   : 1.000000                
##  NA's   :3                        
##  `Grade Recode (thru 2017)`Well differentiated; Grade I
##  Min.   :-0.322066                                     
##  1st Qu.:-0.017850                                     
##  Median :-0.005277                                     
##  Mean   :-0.001697                                     
##  3rd Qu.: 0.007549                                     
##  Max.   : 1.000000                                     
##  NA's   :3                                             
##  Laterality.Only one side - side unspecified
##  Min.   :-0.0505764                         
##  1st Qu.:-0.0025761                         
##  Median :-0.0002076                         
##  Mean   : 0.0149982                         
##  3rd Qu.: 0.0036708                         
##  Max.   : 1.0000000                         
##  NA's   :3                                  
##  Laterality.Paired site, but no information concerning laterality
##  Min.   :-0.306824                                               
##  1st Qu.:-0.013601                                               
##  Median :-0.001130                                               
##  Mean   : 0.026582                                               
##  3rd Qu.: 0.007532                                               
##  Max.   : 1.000000                                               
##  NA's   :3                                                       
##  Laterality.Right - origin of primary `Chemotherapy recode (yes, no/unk)`Yes
##  Min.   :-0.0997361                   Min.   :-0.266711                     
##  1st Qu.:-0.0033810                   1st Qu.:-0.022998                     
##  Median :-0.0004293                   Median : 0.003276                     
##  Mean   : 0.0100453                   Mean   : 0.014559                     
##  3rd Qu.: 0.0021401                   3rd Qu.: 0.022243                     
##  Max.   : 1.0000000                   Max.   : 1.000000                     
##  NA's   :3                            NA's   :3                             
##  `Months from diagnosis to treatment`
##  Min.   :1                           
##  1st Qu.:1                           
##  Median :1                           
##  Mean   :1                           
##  3rd Qu.:1                           
##  Max.   :1                           
##  NA's   :76                          
##  `Reason no cancer-directed surgery`Not recommended
##  Min.   :-0.812290                                 
##  1st Qu.:-0.020352                                 
##  Median :-0.004180                                 
##  Mean   : 0.006545                                 
##  3rd Qu.: 0.008143                                 
##  Max.   : 1.000000                                 
##  NA's   :3                                         
##  `Reason no cancer-directed surgery`Not recommended, contraindicated due to other cond; autopsy only (1973-2002)
##  Min.   :-0.189154                                                                                              
##  1st Qu.:-0.006970                                                                                              
##  Median :-0.001004                                                                                              
##  Mean   : 0.011121                                                                                              
##  3rd Qu.: 0.002799                                                                                              
##  Max.   : 1.000000                                                                                              
##  NA's   :3                                                                                                      
##  `Reason no cancer-directed surgery`Recommended but not performed, patient refused
##  Min.   :-0.262869                                                                
##  1st Qu.:-0.008973                                                                
##  Median :-0.001474                                                                
##  Mean   : 0.009488                                                                
##  3rd Qu.: 0.002102                                                                
##  Max.   : 1.000000                                                                
##  NA's   :3                                                                        
##  `Reason no cancer-directed surgery`Recommended but not performed, unknown reason
##  Min.   :-0.205810                                                               
##  1st Qu.:-0.006578                                                               
##  Median :-0.002151                                                               
##  Mean   : 0.013125                                                               
##  3rd Qu.: 0.006299                                                               
##  Max.   : 1.000000                                                               
##  NA's   :3                                                                       
##  `Reason no cancer-directed surgery`Recommended, unknown if performed
##  Min.   :-0.2649458                                                  
##  1st Qu.:-0.0085961                                                  
##  Median :-0.0009167                                                  
##  Mean   : 0.0086230                                                  
##  3rd Qu.: 0.0060008                                                  
##  Max.   : 1.0000000                                                  
##  NA's   :3                                                           
##  `Reason no cancer-directed surgery`Surgery performed
##  Min.   :-0.8122899                                  
##  1st Qu.:-0.0354131                                  
##  Median : 0.0005905                                  
##  Mean   :-0.0229615                                  
##  3rd Qu.: 0.0230988                                  
##  Max.   : 1.0000000                                  
##  NA's   :3                                           
##  `Reason no cancer-directed surgery`Unknown; death certificate; or autopsy only (2003+)
##  Min.   :-0.316222                                                                     
##  1st Qu.:-0.013675                                                                     
##  Median :-0.002965                                                                     
##  Mean   : 0.025584                                                                     
##  3rd Qu.: 0.008967                                                                     
##  Max.   : 1.000000                                                                     
##  NA's   :3                                                                             
##  `Survival months flag`Complete dates are available and there are more than 0 days of survival
##  Min.   :-0.883947                                                                            
##  1st Qu.:-0.012528                                                                            
##  Median : 0.005244                                                                            
##  Mean   :-0.013255                                                                            
##  3rd Qu.: 0.017204                                                                            
##  Max.   : 1.000000                                                                            
##  NA's   :3                                                                                    
##  `Survival months flag`Incomplete dates are available and there cannot be zero days of follow-up
##  Min.   :-0.883947                                                                              
##  1st Qu.:-0.014050                                                                              
##  Median :-0.002561                                                                              
##  Mean   : 0.001254                                                                              
##  3rd Qu.: 0.004816                                                                              
##  Max.   : 1.000000                                                                              
##  NA's   :3                                                                                      
##  `Survival months flag`Incomplete dates are available and there could be zero days of follow-up
##  Min.   :-0.124874                                                                             
##  1st Qu.:-0.003584                                                                             
##  Median :-0.001239                                                                             
##  Mean   : 0.012886                                                                             
##  3rd Qu.: 0.002214                                                                             
##  Max.   : 1.000000                                                                             
##  NA's   :3                                                                                     
##  `Survival months flag`Not calculated because a Death Certificate Only or Autopsy Only case
##  Min.   :-0.386749                                                                         
##  1st Qu.:-0.014145                                                                         
##  Median :-0.003250                                                                         
##  Mean   : 0.025055                                                                         
##  3rd Qu.: 0.001901                                                                         
##  Max.   : 1.000000                                                                         
##  NA's   :3                                                                                 
##  `Survival months` `First malignant primary indicator`Yes
##  Min.   :1         Min.   :-0.121335                     
##  1st Qu.:1         1st Qu.:-0.009865                     
##  Median :1         Median : 0.001646                     
##  Mean   :1         Mean   : 0.015425                     
##  3rd Qu.:1         3rd Qu.: 0.013781                     
##  Max.   :1         Max.   : 1.000000                     
##  NA's   :76        NA's   :3                             
##  `Total number of in situ/malignant tumors for patient`
##  Min.   :1                                             
##  1st Qu.:1                                             
##  Median :1                                             
##  Mean   :1                                             
##  3rd Qu.:1                                             
##  Max.   :1                                             
##  NA's   :76                                            
##  `Total number of benign/borderline tumors for patient`
##  Min.   :-0.0152234                                    
##  1st Qu.:-0.0032617                                    
##  Median :-0.0005528                                    
##  Mean   : 0.0130747                                    
##  3rd Qu.: 0.0019016                                    
##  Max.   : 1.0000000                                    
##  NA's   :3                                             
##  `Marital status at diagnosis`Married (including common law)
##  Min.   :-0.440177                                          
##  1st Qu.:-0.032660                                          
##  Median :-0.006210                                          
##  Mean   :-0.009718                                          
##  3rd Qu.: 0.013420                                          
##  Max.   : 1.000000                                          
##  NA's   :3                                                  
##  `Marital status at diagnosis`Separated
##  Min.   :-0.109798                     
##  1st Qu.:-0.004177                     
##  Median :-0.000266                     
##  Mean   : 0.011281                     
##  3rd Qu.: 0.003666                     
##  Max.   : 1.000000                     
##  NA's   :3                             
##  `Marital status at diagnosis`Single (never married)
##  Min.   :-0.440177                                  
##  1st Qu.:-0.014297                                  
##  Median :-0.002190                                  
##  Mean   : 0.005418                                  
##  3rd Qu.: 0.017116                                  
##  Max.   : 1.000000                                  
##  NA's   :3                                          
##  `Marital status at diagnosis`Unknown
##  Min.   :-0.269781                   
##  1st Qu.:-0.009316                   
##  Median :-0.001311                   
##  Mean   : 0.009429                   
##  3rd Qu.: 0.006888                   
##  Max.   : 1.000000                   
##  NA's   :3                           
##  `Marital status at diagnosis`Unmarried or Domestic Partner
##  Min.   :-0.061342                                         
##  1st Qu.:-0.004189                                         
##  Median :-0.000586                                         
##  Mean   : 0.011305                                         
##  3rd Qu.: 0.001826                                         
##  Max.   : 1.000000                                         
##  NA's   :3                                                 
##  `Marital status at diagnosis`Widowed
##  Min.   :-0.432734                   
##  1st Qu.:-0.027195                   
##  Median :-0.001960                   
##  Mean   : 0.007008                   
##  3rd Qu.: 0.018651                   
##  Max.   : 1.000000                   
##  NA's   :3                           
##  `Median household income inflation adj to 2021`$40,000 - $44,999
##  Min.   :-0.1382091                                              
##  1st Qu.:-0.0056870                                              
##  Median :-0.0007594                                              
##  Mean   : 0.0128226                                              
##  3rd Qu.: 0.0041608                                              
##  Max.   : 1.0000000                                              
##  NA's   :3                                                       
##  `Median household income inflation adj to 2021`$45,000 - $49,999
##  Min.   :-0.1682857                                              
##  1st Qu.:-0.0069108                                              
##  Median :-0.0009922                                              
##  Mean   : 0.0119225                                              
##  3rd Qu.: 0.0053678                                              
##  Max.   : 1.0000000                                              
##  NA's   :3                                                       
##  `Median household income inflation adj to 2021`$50,000 - $54,999
##  Min.   :-0.1791432                                              
##  1st Qu.:-0.0069663                                              
##  Median :-0.0009649                                              
##  Mean   : 0.0100901                                              
##  3rd Qu.: 0.0055250                                              
##  Max.   : 1.0000000                                              
##  NA's   :3                                                       
##  `Median household income inflation adj to 2021`$55,000 - $59,999
##  Min.   :-0.221090                                               
##  1st Qu.:-0.005620                                               
##  Median :-0.000249                                               
##  Mean   : 0.006952                                               
##  3rd Qu.: 0.005061                                               
##  Max.   : 1.000000                                               
##  NA's   :3                                                       
##  `Median household income inflation adj to 2021`$60,000 - $64,999
##  Min.   :-0.302908                                               
##  1st Qu.:-0.008606                                               
##  Median :-0.001320                                               
##  Mean   :-0.002785                                               
##  3rd Qu.: 0.003066                                               
##  Max.   : 1.000000                                               
##  NA's   :3                                                       
##  `Median household income inflation adj to 2021`$65,000 - $69,999
##  Min.   :-0.308737                                               
##  1st Qu.:-0.012660                                               
##  Median :-0.001992                                               
##  Mean   :-0.005016                                               
##  3rd Qu.: 0.004584                                               
##  Max.   : 1.000000                                               
##  NA's   :3                                                       
##  `Median household income inflation adj to 2021`$70,000 - $74,999
##  Min.   :-0.2538036                                              
##  1st Qu.:-0.0053899                                              
##  Median :-0.0016729                                              
##  Mean   :-0.0009235                                              
##  3rd Qu.: 0.0020979                                              
##  Max.   : 1.0000000                                              
##  NA's   :3                                                       
##  `Median household income inflation adj to 2021`$75,000+
##  Min.   :-0.308737                                      
##  1st Qu.:-0.020431                                      
##  Median :-0.005049                                      
##  Mean   :-0.016574                                      
##  3rd Qu.: 0.003574                                      
##  Max.   : 1.000000                                      
##  NA's   :3                                              
##  `Median household income inflation adj to 2021`< $35,000
##  Min.   :-0.070337                                       
##  1st Qu.:-0.005079                                       
##  Median :-0.001380                                       
##  Mean   : 0.014739                                       
##  3rd Qu.: 0.005036                                       
##  Max.   : 1.000000                                       
##  NA's   :3                                               
##  `Median household income inflation adj to 2021`Unknown/missing/no match/Not 1990-2021
##  Min.   :-0.0127447                                                                   
##  1st Qu.:-0.0034638                                                                   
##  Median :-0.0012849                                                                   
##  Mean   : 0.0267364                                                                   
##  3rd Qu.: 0.0009734                                                                   
##  Max.   : 1.0000000                                                                   
##  NA's   :3                                                                            
##  `Rural-Urban Continuum Code`Counties in metropolitan areas of 250,000 to 1 million pop
##  Min.   :-0.1432295                                                                    
##  1st Qu.:-0.0074650                                                                    
##  Median :-0.0002018                                                                    
##  Mean   : 0.0091835                                                                    
##  3rd Qu.: 0.0038568                                                                    
##  Max.   : 1.0000000                                                                    
##  NA's   :3                                                                             
##  `Rural-Urban Continuum Code`Counties in metropolitan areas of lt 250 thousand pop
##  Min.   :-0.1717685                                                               
##  1st Qu.:-0.0050985                                                               
##  Median :-0.0001767                                                               
##  Mean   : 0.0126107                                                               
##  3rd Qu.: 0.0058588                                                               
##  Max.   : 1.0000000                                                               
##  NA's   :3                                                                        
##  `Rural-Urban Continuum Code`Nonmetropolitan counties adjacent to a metropolitan area
##  Min.   :-0.1531643                                                                  
##  1st Qu.:-0.0064625                                                                  
##  Median :-0.0008947                                                                  
##  Mean   : 0.0127534                                                                  
##  3rd Qu.: 0.0065129                                                                  
##  Max.   : 1.0000000                                                                  
##  NA's   :3                                                                           
##  `Rural-Urban Continuum Code`Nonmetropolitan counties not adjacent to a metropolitan area
##  Min.   :-0.1543301                                                                      
##  1st Qu.:-0.0077561                                                                      
##  Median :-0.0005505                                                                      
##  Mean   : 0.0142939                                                                      
##  3rd Qu.: 0.0078590                                                                      
##  Max.   : 1.0000000                                                                      
##  NA's   :3                                                                               
##  `Rural-Urban Continuum Code`Unknown/missing/no match (Alaska or Hawaii - Entire State)
##  Min.   :-0.0678178                                                                    
##  1st Qu.:-0.0038102                                                                    
##  Median :-0.0002637                                                                    
##  Mean   : 0.0120203                                                                    
##  3rd Qu.: 0.0021074                                                                    
##  Max.   : 1.0000000                                                                    
##  NA's   :3                                                                             
##  `Rural-Urban Continuum Code`Unknown/missing/no match/Not 1990-2021
##  Min.   :-0.0127447                                                
##  1st Qu.:-0.0034638                                                
##  Median :-0.0012849                                                
##  Mean   : 0.0267364                                                
##  3rd Qu.: 0.0009734                                                
##  Max.   : 1.0000000                                                
##  NA's   :3                                                         
##  `Age recode (<60,60-69,70+)`05-09 years
##  Min.   :-0.0027197                     
##  1st Qu.:-0.0008100                     
##  Median :-0.0002830                     
##  Mean   : 0.0136204                     
##  3rd Qu.:-0.0000415                     
##  Max.   : 1.0000000                     
##  NA's   :3                              
##  `Age recode (<60,60-69,70+)`10-14 years
##  Min.   :-0.0050171                     
##  1st Qu.:-0.0008100                     
##  Median :-0.0003417                     
##  Mean   : 0.0134044                     
##  3rd Qu.:-0.0000665                     
##  Max.   : 1.0000000                     
##  NA's   :3                              
##  `Age recode (<60,60-69,70+)`15-19 years
##  Min.   :-0.0070476                     
##  1st Qu.:-0.0018281                     
##  Median :-0.0006534                     
##  Mean   : 0.0134773                     
##  3rd Qu.: 0.0000811                     
##  Max.   : 1.0000000                     
##  NA's   :3                              
##  `Age recode (<60,60-69,70+)`20-24 years
##  Min.   :-0.018979                      
##  1st Qu.:-0.003199                      
##  Median :-0.001403                      
##  Mean   : 0.012949                      
##  3rd Qu.: 0.001502                      
##  Max.   : 1.000000                      
##  NA's   :3                              
##  `Age recode (<60,60-69,70+)`25-29 years
##  Min.   :-0.027433                      
##  1st Qu.:-0.005254                      
##  Median :-0.001847                      
##  Mean   : 0.011652                      
##  3rd Qu.: 0.002353                      
##  Max.   : 1.000000                      
##  NA's   :3                              
##  `Age recode (<60,60-69,70+)`30-34 years
##  Min.   :-0.046406                      
##  1st Qu.:-0.007629                      
##  Median :-0.001896                      
##  Mean   : 0.009909                      
##  3rd Qu.: 0.002237                      
##  Max.   : 1.000000                      
##  NA's   :3                              
##  `Age recode (<60,60-69,70+)`35-39 years
##  Min.   :-0.067571                      
##  1st Qu.:-0.009978                      
##  Median :-0.002013                      
##  Mean   : 0.007331                      
##  3rd Qu.: 0.004198                      
##  Max.   : 1.000000                      
##  NA's   :3                              
##  `Age recode (<60,60-69,70+)`40-44 years
##  Min.   :-0.098876                      
##  1st Qu.:-0.016888                      
##  Median :-0.005596                      
##  Mean   : 0.002280                      
##  3rd Qu.: 0.003749                      
##  Max.   : 1.000000                      
##  NA's   :3                              
##  `Age recode (<60,60-69,70+)`45-49 years
##  Min.   :-0.125622                      
##  1st Qu.:-0.017717                      
##  Median :-0.006664                      
##  Mean   :-0.001411                      
##  3rd Qu.: 0.004519                      
##  Max.   : 1.000000                      
##  NA's   :3                              
##  `Age recode (<60,60-69,70+)`50-54 years
##  Min.   :-0.141961                      
##  1st Qu.:-0.016959                      
##  Median :-0.003925                      
##  Mean   :-0.002563                      
##  3rd Qu.: 0.003557                      
##  Max.   : 1.000000                      
##  NA's   :3                              
##  `Age recode (<60,60-69,70+)`55-59 years
##  Min.   :-0.148041                      
##  1st Qu.:-0.015581                      
##  Median :-0.001369                      
##  Mean   :-0.003108                      
##  3rd Qu.: 0.002133                      
##  Max.   : 1.000000                      
##  NA's   :3                              
##  `Age recode (<60,60-69,70+)`60-64 years
##  Min.   :-0.1569887                     
##  1st Qu.:-0.0139543                     
##  Median :-0.0007929                     
##  Mean   :-0.0045549                     
##  3rd Qu.: 0.0037404                     
##  Max.   : 1.0000000                     
##  NA's   :3                              
##  `Age recode (<60,60-69,70+)`65-69 years
##  Min.   :-0.156989                      
##  1st Qu.:-0.018456                      
##  Median :-0.004416                      
##  Mean   :-0.006050                      
##  3rd Qu.: 0.002814                      
##  Max.   : 1.000000                      
##  NA's   :3                              
##  `Age recode (<60,60-69,70+)`70-74 years
##  Min.   :-0.136706                      
##  1st Qu.:-0.018582                      
##  Median :-0.002493                      
##  Mean   :-0.003747                      
##  3rd Qu.: 0.004444                      
##  Max.   : 1.000000                      
##  NA's   :3                              
##  `Age recode (<60,60-69,70+)`75-79 years
##  Min.   :-0.1299541                     
##  1st Qu.:-0.0192648                     
##  Median :-0.0017527                     
##  Mean   :-0.0004472                     
##  3rd Qu.: 0.0042832                     
##  Max.   : 1.0000000                     
##  NA's   :3                              
##  `Age recode (<60,60-69,70+)`80-84 years `Age recode (<60,60-69,70+)`85+ years
##  Min.   :-0.1497345                      Min.   :-0.185038                    
##  1st Qu.:-0.0260547                      1st Qu.:-0.032765                    
##  Median :-0.0009538                      Median :-0.002749                    
##  Mean   : 0.0030973                      Mean   : 0.009643                    
##  3rd Qu.: 0.0066626                      3rd Qu.: 0.005992                    
##  Max.   : 1.0000000                      Max.   : 1.000000                    
##  NA's   :3                               NA's   :3                            
##  Radiation.Yes           COD           
##  Min.   :-0.18043   Min.   :-0.274122  
##  1st Qu.:-0.02974   1st Qu.:-0.028538  
##  Median :-0.00240   Median : 0.003475  
##  Mean   : 0.00561   Mean   : 0.019223  
##  3rd Qu.: 0.01477   3rd Qu.: 0.028403  
##  Max.   : 1.00000   Max.   : 1.000000  
##  NA's   :3          NA's   :3

# Print correlation with COD
print(correlation_with_COD)

##                                                          `Race recode (W, B, AI, API)`Asian or Pacific Islander 
##                                                                                                   -0.0545190384 
##                                                                              `Race recode (W, B, AI, API)`Black 
##                                                                                                    0.0469532442 
##                                                                            `Race recode (W, B, AI, API)`Unknown 
##                                                                                                   -0.0293993259 
##                                                                              `Race recode (W, B, AI, API)`White 
##                                                                                                    0.0074658579 
##                                                         `Primary Site - labeled`C50.1-Central portion of breast 
##                                                                                                    0.0250324542 
##                                                    `Primary Site - labeled`C50.2-Upper-inner quadrant of breast 
##                                                                                                   -0.0331489825 
##                                                    `Primary Site - labeled`C50.3-Lower-inner quadrant of breast 
##                                                                                                   -0.0090667775 
##                                                    `Primary Site - labeled`C50.4-Upper-outer quadrant of breast 
##                                                                                                   -0.0443648371 
##                                                    `Primary Site - labeled`C50.5-Lower-outer quadrant of breast 
##                                                                                                   -0.0175945778 
##                                                           `Primary Site - labeled`C50.6-Axillary tail of breast 
##                                                                                                    0.0043188952 
##                                                      `Primary Site - labeled`C50.8-Overlapping lesion of breast 
##                                                                                                   -0.0110631898 
##                                                                       `Primary Site - labeled`C50.9-Breast, NOS 
##                                                                                                    0.1016503665 
##                                                      `Grade Recode (thru 2017)`Poorly differentiated; Grade III 
##                                                                                                    0.0271874011 
##                                                `Grade Recode (thru 2017)`Undifferentiated; anaplastic; Grade IV 
##                                                                                                    0.0094420265 
##                                                                               `Grade Recode (thru 2017)`Unknown 
##                                                                                                    0.0968239858 
##                                                          `Grade Recode (thru 2017)`Well differentiated; Grade I 
##                                                                                                   -0.0623106812 
##                                                                     Laterality.Only one side - side unspecified 
##                                                                                                    0.0200049734 
##                                                Laterality.Paired site, but no information concerning laterality 
##                                                                                                    0.1028374128 
##                                                                            Laterality.Right - origin of primary 
##                                                                                                   -0.0181205661 
##                                                                          `Chemotherapy recode (yes, no/unk)`Yes 
##                                                                                                   -0.0921253574 
##                                                                            `Months from diagnosis to treatment` 
##                                                                                                              NA 
##                                                              `Reason no cancer-directed surgery`Not recommended 
##                                                                                                    0.2220449034 
## `Reason no cancer-directed surgery`Not recommended, contraindicated due to other cond; autopsy only (1973-2002) 
##                                                                                                    0.0989504823 
##                               `Reason no cancer-directed surgery`Recommended but not performed, patient refused 
##                                                                                                    0.0884305414 
##                                `Reason no cancer-directed surgery`Recommended but not performed, unknown reason 
##                                                                                                    0.0403204344 
##                                            `Reason no cancer-directed surgery`Recommended, unknown if performed 
##                                                                                                    0.0114951426 
##                                                            `Reason no cancer-directed surgery`Surgery performed 
##                                                                                                   -0.2741215608 
##                          `Reason no cancer-directed surgery`Unknown; death certificate; or autopsy only (2003+) 
##                                                                                                    0.0839598865 
##                   `Survival months flag`Complete dates are available and there are more than 0 days of survival 
##                                                                                                   -0.0490960286 
##                 `Survival months flag`Incomplete dates are available and there cannot be zero days of follow-up 
##                                                                                                    0.0166136901 
##                  `Survival months flag`Incomplete dates are available and there could be zero days of follow-up 
##                                                                                                    0.0165572269 
##                      `Survival months flag`Not calculated because a Death Certificate Only or Autopsy Only case 
##                                                                                                    0.0787841243 
##                                                                                               `Survival months` 
##                                                                                                              NA 
##                                                                          `First malignant primary indicator`Yes 
##                                                                                                   -0.1213346094 
##                                                          `Total number of in situ/malignant tumors for patient` 
##                                                                                                              NA 
##                                                          `Total number of benign/borderline tumors for patient` 
##                                                                                                    0.0096744569 
##                                                     `Marital status at diagnosis`Married (including common law) 
##                                                                                                   -0.1738909087 
##                                                                          `Marital status at diagnosis`Separated 
##                                                                                                   -0.0040996820 
##                                                             `Marital status at diagnosis`Single (never married) 
##                                                                                                    0.0003130660 
##                                                                            `Marital status at diagnosis`Unknown 
##                                                                                                    0.0303384662 
##                                                      `Marital status at diagnosis`Unmarried or Domestic Partner 
##                                                                                                   -0.0119834990 
##                                                                            `Marital status at diagnosis`Widowed 
##                                                                                                    0.2258045820 
##                                                `Median household income inflation adj to 2021`$40,000 - $44,999 
##                                                                                                    0.0288158876 
##                                                `Median household income inflation adj to 2021`$45,000 - $49,999 
##                                                                                                    0.0293692229 
##                                                `Median household income inflation adj to 2021`$50,000 - $54,999 
##                                                                                                    0.0232834837 
##                                                `Median household income inflation adj to 2021`$55,000 - $59,999 
##                                                                                                    0.0119175068 
##                                                `Median household income inflation adj to 2021`$60,000 - $64,999 
##                                                                                                    0.0085555964 
##                                                `Median household income inflation adj to 2021`$65,000 - $69,999 
##                                                                                                   -0.0113267709 
##                                                `Median household income inflation adj to 2021`$70,000 - $74,999 
##                                                                                                    0.0021964742 
##                                                         `Median household income inflation adj to 2021`$75,000+ 
##                                                                                                   -0.0518348420 
##                                                        `Median household income inflation adj to 2021`< $35,000 
##                                                                                                    0.0183133390 
##                           `Median household income inflation adj to 2021`Unknown/missing/no match/Not 1990-2021 
##                                                                                                   -0.0018601995 
##                          `Rural-Urban Continuum Code`Counties in metropolitan areas of 250,000 to 1 million pop 
##                                                                                                    0.0054201133 
##                               `Rural-Urban Continuum Code`Counties in metropolitan areas of lt 250 thousand pop 
##                                                                                                    0.0164868975 
##                            `Rural-Urban Continuum Code`Nonmetropolitan counties adjacent to a metropolitan area 
##                                                                                                    0.0284308358 
##                        `Rural-Urban Continuum Code`Nonmetropolitan counties not adjacent to a metropolitan area 
##                                                                                                    0.0283202178 
##                          `Rural-Urban Continuum Code`Unknown/missing/no match (Alaska or Hawaii - Entire State) 
##                                                                                                    0.0026305237 
##                                              `Rural-Urban Continuum Code`Unknown/missing/no match/Not 1990-2021 
##                                                                                                   -0.0018601995 
##                                                                         `Age recode (<60,60-69,70+)`05-09 years 
##                                                                                                   -0.0013753077 
##                                                                         `Age recode (<60,60-69,70+)`10-14 years 
##                                                                                                   -0.0013753077 
##                                                                         `Age recode (<60,60-69,70+)`15-19 years 
##                                                                                                   -0.0008190567 
##                                                                         `Age recode (<60,60-69,70+)`20-24 years 
##                                                                                                   -0.0017825828 
##                                                                         `Age recode (<60,60-69,70+)`25-29 years 
##                                                                                                   -0.0118417779 
##                                                                         `Age recode (<60,60-69,70+)`30-34 years 
##                                                                                                   -0.0259548035 
##                                                                         `Age recode (<60,60-69,70+)`35-39 years 
##                                                                                                   -0.0423991343 
##                                                                         `Age recode (<60,60-69,70+)`40-44 years 
##                                                                                                   -0.0751335018 
##                                                                         `Age recode (<60,60-69,70+)`45-49 years 
##                                                                                                   -0.0999972407 
##                                                                         `Age recode (<60,60-69,70+)`50-54 years 
##                                                                                                   -0.0950419640 
##                                                                         `Age recode (<60,60-69,70+)`55-59 years 
##                                                                                                   -0.0788618264 
##                                                                         `Age recode (<60,60-69,70+)`60-64 years 
##                                                                                                   -0.0661892685 
##                                                                         `Age recode (<60,60-69,70+)`65-69 years 
##                                                                                                   -0.0379863553 
##                                                                         `Age recode (<60,60-69,70+)`70-74 years 
##                                                                                                    0.0251230157 
##                                                                         `Age recode (<60,60-69,70+)`75-79 years 
##                                                                                                    0.1002552651 
##                                                                         `Age recode (<60,60-69,70+)`80-84 years 
##                                                                                                    0.1861215456 
##                                                                           `Age recode (<60,60-69,70+)`85+ years 
##                                                                                                    0.3114860645 
##                                                                                                   Radiation.Yes 
##                                                                                                   -0.1573241890 
##                                                                                                             COD 
##                                                                                                    1.0000000000

# Exclude "COD" column from model matrix and encode factors
encoded_data <- predict(dummyVars(" ~ .", data = BREAST_DF_surv_clean[, -cod_column_index], fullRank = TRUE), newdata = BREAST_DF_surv_clean)

# Alternatively, using ggplot
correlation_df <- data.frame(variable = colnames(correlation_matrix), correlation = correlation_with_COD)
# Create a ggplot with facets
ggplot(correlation_df[1:19, ], aes(x = variable, y = correlation)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1, 
                                   size = 7)) +  # Adjust size as needed
  scale_x_discrete(labels = function(x) str_wrap(x, width = 25))  # Wrap text

ggplot(correlation_df[20:39, ], aes(x = variable, y = correlation)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1, 
                                   size = 7)) +  # Adjust size as needed
  scale_x_discrete(labels = function(x) str_wrap(x, width = 25))  # Wrap text

## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_bar()`).

ggplot(correlation_df[40:59, ], aes(x = variable, y = correlation)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1, 
                                   size = 7)) +  # Adjust size as needed
  scale_x_discrete(labels = function(x) str_wrap(x, width = 25))  # Wrap text

ggplot(correlation_df[60:77, ], aes(x = variable, y = correlation)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1, 
                                   size = 7)) +  # Adjust size as needed
  scale_x_discrete(labels = function(x) str_wrap(x, width = 25))  # Wrap text

Machine learning, Random Forest classification model

To be able to work with this database, I need to transform the categorical data (factors) to numerical variables. A method known as one-hot encoding is used. Although for this survival analysis, target encoding is the better method, I have decided not to apply that due to complexity and time constraints [1,2].

In general the machine learning phase consist of four main steps:

Encode categorical variables.
Split the data into training and testing sets.
Train the models.
Evaluate the models.

What is target encoding:

Target encoding, also known as mean encoding or likelihood encoding, is a technique used to encode categorical variables into numerical values based on the target variable. It replaces each category with the mean (or some other summary statistic) of the target variable for that category. caret is the package in R that has this function embedded.

What is One-Hot encoding:

One-hot encoding is a technique used in classification tasks to represent categorical variables, such as alive or deceased in the case of survival analysis, as binary vectors. In R, this is achieved by converting each category into a binary vector where each element corresponds to a category, with a value of 1 indicating the presence of the category and 0 otherwise. This allows machine learning algorithms to effectively interpret and utilize categorical data in predictive models.

Different models investigated in this Project

Random Forest (rf): Random forest is a popular machine learning algorithm that can be adapted for survival analysis. It constructs a multitude of decision trees during training and outputs the mode of the classes (classification) or the mean prediction (regression) of the individual trees.
Logistic Regression (glm): Logistic regression, a foundational technique in survival analysis, is employed in this project to model the relationship between various prognostic factors and the probability of survival or death outcomes in breast cancer patients.
Deep Nueral Netweork (DNN): This is a a powerful machine learning model that can learn complex patterns in data to classify individuals as either alive or deceased in a given classification problem. In R, DNNs can be implemented using packages like keras, providing a flexible framework for building and training deep learning models tailored to specific datasets.

Data Preparation for Resemble models

BREAST_DF_surv_clean_no_missing <- na.omit(BREAST_DF_surv_clean)

#change the problem to a binomial distribution of Alive / Breast and remove others, Binimonal is easier to tackle 
#Repalce also factor to numer 1 and 2 from "Alive" and "Breast"
# Remove "Others" from COD column
BREAST_DF_surv_clean_no_missing_bi <- BREAST_DF_surv_clean_no_missing[BREAST_DF_surv_clean_no_missing$COD != "Other", ]

# Replace remaining categories with numerical values
#BREAST_DF_surv_clean_no_missing_bi$COD <- as.numeric(factor(BREAST_DF_surv_clean_no_missing_bi$COD, levels = c("Alive", "Breast")))

BREAST_DF_surv_clean_no_missing_bi$COD <- ifelse(BREAST_DF_surv_clean_no_missing_bi$COD == "Alive", 1, 0)

BREAST_DF_surv_clean_no_missing_bi$COD <- as.factor(BREAST_DF_surv_clean_no_missing_bi$COD)

# Convert to binomial distribution
#model_rf <- randomForest(COD ~ ., data = BREAST_DF_surv_clean_no_missing_bi, type = "response", ntree = 100)


# Find the index of the column named "COD"
cod_column_index <- which(names(BREAST_DF_surv_clean_no_missing_bi) == "COD")

# Exclude "COD" column from the data 
data_without_cod <- BREAST_DF_surv_clean_no_missing_bi[, -cod_column_index]

# Perform one-hot encoding
encoded_data <- dummyVars(" ~ .", data = data_without_cod)

# Create the design matrix with encoded data
design_matrix <- predict(encoded_data, newdata = data_without_cod)
design_matrix <- data.frame(design_matrix)

# Add the target variable (COD) back to the design matrix
design_matrix <- cbind(design_matrix, COD = BREAST_DF_surv_clean_no_missing_bi$COD)
design_matrix$COD <- factor(design_matrix$COD)

# Split the data into training and testing sets
set.seed(123)  # for reproducibility
train_indices <- createDataPartition(design_matrix$COD, p = 0.7, list = FALSE)
train_data <- design_matrix[train_indices, ]
test_data <- design_matrix[-train_indices, ]

Machine Learning: Random Forest

Random Forests are a powerful machine learning technique well-suited for survival analysis tasks like predicting patient survival in cancer cases. Random Forests don’t rely on a single decision tree but on a multitude of them (“forest”). Each tree is built on a random subset of the data (with replacement) and uses a random selection of features at each split.

# Fit the Random Forest model
model_rf <- randomForest(COD ~ ., data = train_data, type = "prob")

# Make predictions on the test set
predictions_rf <- predict(model_rf, newdata = test_data)

# Evaluate the model
conf_matrix <- confusionMatrix(predictions_rf, test_data$COD)
print(conf_matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0  6970  1479
##          1  2591 65338
##                                           
##                Accuracy : 0.9467          
##                  95% CI : (0.9451, 0.9483)
##     No Information Rate : 0.8748          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7439          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.72900         
##             Specificity : 0.97786         
##          Pos Pred Value : 0.82495         
##          Neg Pred Value : 0.96186         
##              Prevalence : 0.12518         
##          Detection Rate : 0.09126         
##    Detection Prevalence : 0.11062         
##       Balanced Accuracy : 0.85343         
##                                           
##        'Positive' Class : 0               
##

# Plot confusion matrix as a heatmap
conf_table <- as.table(conf_matrix$table)
heatmap(conf_table, 
        Colv = NA, 
        Rowv = NA, 
        col = cm.colors(12),  
        scale = "column",     
        margins = c(10, 10),   
        xlab = "Predicted Class", 
        ylab = "True Class",
        main = "Confusion Matrix Heatmap")

# Heatmap
heatmap_data <- as.data.frame(as.table(conf_matrix))
heatmap <- ggplot(heatmap_data, aes(x = Prediction, y = Reference, fill = Freq)) +
  geom_tile(color = "white") +
  scale_fill_gradient(low = "lightgreen", high = "darkgreen") +
  labs(x = "Predicted", y = "Actual", fill = "Frequency") +
  theme_minimal() +
  geom_text(aes(label = Freq), color = "black", size = 3) +  # Add text labels
  ggtitle("Random Forest Predictive Model") +  # Add title
  labs(subtitle = paste("Accuracy:", scales::percent(conf_matrix$overall["Accuracy"]))) +  # Add accuracy as subtitle
  theme(plot.subtitle = element_text(hjust = 0.5))  # Center subtitle

print(heatmap)

# Get predicted probabilities for each class (ensure type="prob" is used)
predictions_rf_probs <- predict(model_rf, test_data, type = "prob")

# Extract true class labels and convert them to factor
true_class <- as.factor(test_data$COD)

# Convert factor predictions to ordered factors
predictions_order <- ordered(as.numeric(predictions_rf) - 1, levels = c(0, 1))

# Create ROC curve
roc_curve <- roc(true_class, predictions_rf_probs[, "1"])

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

# Plot ROC curve
plot(roc_curve, print.auc = TRUE, auc.polygon = TRUE, max.auc.polygon = TRUE, grid = TRUE, grid.col = "lightgray", main = "ROC Curve", xlab = "1 - Specificity", ylab = "Sensitivity")

Machine Learning: Logistic Regression

Logistic regression is a statistical model used to analyze the relationship between a binary outcome variable and one or more independent variables. It estimates the probability of the outcome variable being in a particular category (usually coded as 0 or 1) based on the values of the independent variables. The model employs the logistic function to constrain the predicted probabilities between 0 and 1, making it suitable for binary classification tasks like survival/death analyses in our case. In R, logistic regression can be implemented using the glm() function with a binomial family distribution.

# Train the logistic regression model
logistic_model <- glm(COD ~ ., data = train_data, family = binomial)

# Make predictions on the test set
predictions_logistic <- predict(logistic_model, newdata = test_data, type = "response")

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases

# Convert predicted probabilities to class labels
predicted_class <- ifelse(predictions_logistic > 0.5, 1, 0)

# Evaluate the model
confusion_matrix <- table(predicted_class, test_data$COD)
print(confusion_matrix)

##                
## predicted_class     0     1
##               0  5755  1538
##               1  3806 65279

# Calculate accuracy
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
print(paste("Accuracy:", accuracy))

## [1] "Accuracy: 0.9300322082275"

# Plot the confusion matrix as a heatmap
heatmap(confusion_matrix, 
        Colv = NA, 
        Rowv = NA, 
        col = cm.colors(12),  # Color palette for heatmap
        scale = "column",     # Scale rows (predictions)
        margins = c(10, 10),  # Add extra space for row and column names
        xlab = "Predicted Class", 
        ylab = "True Class",
        main = "Confusion Matrix Heatmap")

# Heatmap
heatmap_data <- as.data.frame(as.table(confusion_matrix))
heatmap <- ggplot(heatmap_data, aes(x = predicted_class, y = Var2, fill = Freq)) +
  geom_tile(color = "white") +
  scale_fill_gradient(low = "lightgreen", high = "darkgreen") +
  labs(x = "Predicted", y = "Actual", fill = "Frequency") +
  theme_minimal() +
  geom_text(aes(label = Freq), color = "black", size = 3) +  # Add text labels
  ggtitle("Logistic Regression Predictive Model") +  # Add title
  labs(subtitle = paste("Accuracy:", scales::percent(accuracy))) +  # Add accuracy as subtitle
  theme(plot.subtitle = element_text(hjust = 0.5))  # Center subtitle

print(heatmap)

# Calculate AUC ROC
roc_curve <- roc(test_data$COD, predictions_logistic)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

print(roc_curve)

## 
## Call:
## roc.default(response = test_data$COD, predictor = predictions_logistic)
## 
## Data: predictions_logistic in 9561 controls (test_data$COD 0) < 66817 cases (test_data$COD 1).
## Area under the curve: 0.9291

# Plot the ROC curve
plot(roc_curve, print.auc = TRUE, auc.polygon = TRUE, max.auc.polygon = TRUE, grid = TRUE, grid.col = "lightgray", main = "ROC Curve")

Data Preparation for Survival model

# Prepare data
cod_column_index_1 <- which(names(BREAST_DF_surv_clean_no_missing) == c("COD","Survival months"))


# Exclude "COD" column from the data 
#data_without_cod <- BREAST_DF_surv_clean[, -cod_column_index]
data_without_cod_1 <- BREAST_DF_surv_clean_no_missing[, -cod_column_index]

# Perform one-hot encoding
encoded_data_1 <- dummyVars(" ~ .", data = data_without_cod_1)

# Create the design matrix with encoded data
design_matrix_1 <- predict(encoded_data_1, newdata = data_without_cod_1)

# Add the target variable (Survival months and status) back to the design matrix
design_matrix_1 <- cbind(design_matrix_1, 
                       Time = BREAST_DF_surv_clean_no_missing$`Survival months`, 
                       Status = BREAST_DF_surv_clean_no_missing$COD)
design_matrix_1 <- data.frame(design_matrix_1)

# Split the data into training and testing sets
set.seed(123)  # for reproducibility
train_indices_1 <- createDataPartition(design_matrix_1$Status, p = 0.7, list = FALSE)
train_data_1 <- design_matrix_1[train_indices, ]
test_data_1 <- design_matrix_1[-train_indices, ]

Deep Neural Network (DNN)

A deep neural network for survival analysis is a powerful machine learning model capable of capturing complex patterns in survival data to predict the likelihood of an event occurring (e.g., death) over a given period. In binary classification tasks such as life/dead outcomes, a deep neural network consists of multiple layers of interconnected nodes (neurons) that process input features to predict the probability of an individual experiencing the event of interest. These networks can incorporate various architectures, such as convolutional neural networks (CNNs) or recurrent neural networks (RNNs), and are trained using optimization algorithms like stochastic gradient descent (SGD) to minimize prediction errors. In R, deep neural networks for survival analysis can be implemented using libraries like keras or tensorflow, allowing for flexible modeling and customization.

# Load required libraries
library(keras)
library(survival)
library(survMisc)  # For cindex() function

## 
## Attaching package: 'survMisc'

## The following object is masked from 'package:pROC':
## 
##     ci

## The following object is masked from 'package:R.utils':
## 
##     asLong

## The following object is masked from 'package:ggplot2':
## 
##     autoplot

library(reticulate)
#use_python("C:/Users/kohya/AppData/Local/Programs/Python/Python37")
# Define the neural network architecture
model <- keras_model_sequential() %>%
  layer_dense(units = 64, activation = "relu", input_shape = ncol(train_data) - 1) %>%
  layer_dense(units = 32, activation = "relu") %>%
  layer_dense(units = 1, activation = "sigmoid")

# Compile the model
model %>% compile(
  loss = "binary_crossentropy",
  optimizer = optimizer_adam(),
  metrics = c("accuracy")
)

# Train the model
history <- model %>% fit(
  x = as.matrix(train_data[, -ncol(train_data)]),  # Features
  y = as.numeric(train_data$COD) - 1,  # Target variable (convert to 0-based index)
  epochs = 100,
  batch_size = 32,
  validation_split = 0.2
)

## Epoch 1/100
## 4456/4456 - 11s - loss: 0.1832 - accuracy: 0.9373 - val_loss: 0.1733 - val_accuracy: 0.9429 - 11s/epoch - 2ms/step
## Epoch 2/100
## 4456/4456 - 10s - loss: 0.1690 - accuracy: 0.9435 - val_loss: 0.1695 - val_accuracy: 0.9466 - 10s/epoch - 2ms/step
## Epoch 3/100
## 4456/4456 - 9s - loss: 0.1656 - accuracy: 0.9446 - val_loss: 0.1676 - val_accuracy: 0.9469 - 9s/epoch - 2ms/step
## Epoch 4/100
## 4456/4456 - 11s - loss: 0.1637 - accuracy: 0.9457 - val_loss: 0.1649 - val_accuracy: 0.9466 - 11s/epoch - 3ms/step
## Epoch 5/100
## 4456/4456 - 10s - loss: 0.1620 - accuracy: 0.9463 - val_loss: 0.1655 - val_accuracy: 0.9467 - 10s/epoch - 2ms/step
## Epoch 6/100
## 4456/4456 - 11s - loss: 0.1612 - accuracy: 0.9469 - val_loss: 0.1673 - val_accuracy: 0.9467 - 11s/epoch - 2ms/step
## Epoch 7/100
## 4456/4456 - 10s - loss: 0.1603 - accuracy: 0.9468 - val_loss: 0.1718 - val_accuracy: 0.9457 - 10s/epoch - 2ms/step
## Epoch 8/100
## 4456/4456 - 10s - loss: 0.1597 - accuracy: 0.9471 - val_loss: 0.1644 - val_accuracy: 0.9473 - 10s/epoch - 2ms/step
## Epoch 9/100
## 4456/4456 - 10s - loss: 0.1591 - accuracy: 0.9473 - val_loss: 0.1637 - val_accuracy: 0.9479 - 10s/epoch - 2ms/step
## Epoch 10/100
## 4456/4456 - 10s - loss: 0.1584 - accuracy: 0.9474 - val_loss: 0.1642 - val_accuracy: 0.9480 - 10s/epoch - 2ms/step
## Epoch 11/100
## 4456/4456 - 10s - loss: 0.1576 - accuracy: 0.9480 - val_loss: 0.1660 - val_accuracy: 0.9474 - 10s/epoch - 2ms/step
## Epoch 12/100
## 4456/4456 - 11s - loss: 0.1575 - accuracy: 0.9479 - val_loss: 0.1644 - val_accuracy: 0.9472 - 11s/epoch - 2ms/step
## Epoch 13/100
## 4456/4456 - 10s - loss: 0.1568 - accuracy: 0.9487 - val_loss: 0.1671 - val_accuracy: 0.9462 - 10s/epoch - 2ms/step
## Epoch 14/100
## 4456/4456 - 10s - loss: 0.1564 - accuracy: 0.9487 - val_loss: 0.1653 - val_accuracy: 0.9486 - 10s/epoch - 2ms/step
## Epoch 15/100
## 4456/4456 - 10s - loss: 0.1557 - accuracy: 0.9487 - val_loss: 0.1674 - val_accuracy: 0.9473 - 10s/epoch - 2ms/step
## Epoch 16/100
## 4456/4456 - 10s - loss: 0.1552 - accuracy: 0.9492 - val_loss: 0.1641 - val_accuracy: 0.9485 - 10s/epoch - 2ms/step
## Epoch 17/100
## 4456/4456 - 10s - loss: 0.1547 - accuracy: 0.9490 - val_loss: 0.1650 - val_accuracy: 0.9475 - 10s/epoch - 2ms/step
## Epoch 18/100
## 4456/4456 - 10s - loss: 0.1546 - accuracy: 0.9491 - val_loss: 0.1712 - val_accuracy: 0.9453 - 10s/epoch - 2ms/step
## Epoch 19/100
## 4456/4456 - 10s - loss: 0.1541 - accuracy: 0.9495 - val_loss: 0.1666 - val_accuracy: 0.9482 - 10s/epoch - 2ms/step
## Epoch 20/100
## 4456/4456 - 10s - loss: 0.1538 - accuracy: 0.9495 - val_loss: 0.1694 - val_accuracy: 0.9474 - 10s/epoch - 2ms/step
## Epoch 21/100
## 4456/4456 - 10s - loss: 0.1532 - accuracy: 0.9499 - val_loss: 0.1676 - val_accuracy: 0.9476 - 10s/epoch - 2ms/step
## Epoch 22/100
## 4456/4456 - 10s - loss: 0.1525 - accuracy: 0.9499 - val_loss: 0.1682 - val_accuracy: 0.9474 - 10s/epoch - 2ms/step
## Epoch 23/100
## 4456/4456 - 10s - loss: 0.1522 - accuracy: 0.9503 - val_loss: 0.1679 - val_accuracy: 0.9478 - 10s/epoch - 2ms/step
## Epoch 24/100
## 4456/4456 - 10s - loss: 0.1522 - accuracy: 0.9500 - val_loss: 0.1692 - val_accuracy: 0.9480 - 10s/epoch - 2ms/step
## Epoch 25/100
## 4456/4456 - 11s - loss: 0.1515 - accuracy: 0.9503 - val_loss: 0.1769 - val_accuracy: 0.9462 - 11s/epoch - 2ms/step
## Epoch 26/100
## 4456/4456 - 11s - loss: 0.1514 - accuracy: 0.9502 - val_loss: 0.1690 - val_accuracy: 0.9484 - 11s/epoch - 2ms/step
## Epoch 27/100
## 4456/4456 - 10s - loss: 0.1509 - accuracy: 0.9506 - val_loss: 0.1811 - val_accuracy: 0.9470 - 10s/epoch - 2ms/step
## Epoch 28/100
## 4456/4456 - 9s - loss: 0.1504 - accuracy: 0.9508 - val_loss: 0.1738 - val_accuracy: 0.9472 - 9s/epoch - 2ms/step
## Epoch 29/100
## 4456/4456 - 10s - loss: 0.1500 - accuracy: 0.9509 - val_loss: 0.1746 - val_accuracy: 0.9462 - 10s/epoch - 2ms/step
## Epoch 30/100
## 4456/4456 - 10s - loss: 0.1497 - accuracy: 0.9509 - val_loss: 0.1774 - val_accuracy: 0.9470 - 10s/epoch - 2ms/step
## Epoch 31/100
## 4456/4456 - 10s - loss: 0.1494 - accuracy: 0.9512 - val_loss: 0.1767 - val_accuracy: 0.9463 - 10s/epoch - 2ms/step
## Epoch 32/100
## 4456/4456 - 9s - loss: 0.1491 - accuracy: 0.9514 - val_loss: 0.1818 - val_accuracy: 0.9456 - 9s/epoch - 2ms/step
## Epoch 33/100
## 4456/4456 - 9s - loss: 0.1487 - accuracy: 0.9514 - val_loss: 0.1866 - val_accuracy: 0.9439 - 9s/epoch - 2ms/step
## Epoch 34/100
## 4456/4456 - 9s - loss: 0.1485 - accuracy: 0.9512 - val_loss: 0.1888 - val_accuracy: 0.9456 - 9s/epoch - 2ms/step
## Epoch 35/100
## 4456/4456 - 9s - loss: 0.1483 - accuracy: 0.9516 - val_loss: 0.1833 - val_accuracy: 0.9468 - 9s/epoch - 2ms/step
## Epoch 36/100
## 4456/4456 - 9s - loss: 0.1479 - accuracy: 0.9519 - val_loss: 0.1796 - val_accuracy: 0.9461 - 9s/epoch - 2ms/step
## Epoch 37/100
## 4456/4456 - 9s - loss: 0.1475 - accuracy: 0.9518 - val_loss: 0.1849 - val_accuracy: 0.9459 - 9s/epoch - 2ms/step
## Epoch 38/100
## 4456/4456 - 10s - loss: 0.1473 - accuracy: 0.9518 - val_loss: 0.1890 - val_accuracy: 0.9445 - 10s/epoch - 2ms/step
## Epoch 39/100
## 4456/4456 - 9s - loss: 0.1470 - accuracy: 0.9524 - val_loss: 0.1873 - val_accuracy: 0.9476 - 9s/epoch - 2ms/step
## Epoch 40/100
## 4456/4456 - 9s - loss: 0.1469 - accuracy: 0.9522 - val_loss: 0.1861 - val_accuracy: 0.9452 - 9s/epoch - 2ms/step
## Epoch 41/100
## 4456/4456 - 11s - loss: 0.1465 - accuracy: 0.9521 - val_loss: 0.1875 - val_accuracy: 0.9458 - 11s/epoch - 2ms/step
## Epoch 42/100
## 4456/4456 - 9s - loss: 0.1463 - accuracy: 0.9524 - val_loss: 0.1880 - val_accuracy: 0.9456 - 9s/epoch - 2ms/step
## Epoch 43/100
## 4456/4456 - 10s - loss: 0.1460 - accuracy: 0.9523 - val_loss: 0.1922 - val_accuracy: 0.9466 - 10s/epoch - 2ms/step
## Epoch 44/100
## 4456/4456 - 10s - loss: 0.1457 - accuracy: 0.9523 - val_loss: 0.1941 - val_accuracy: 0.9462 - 10s/epoch - 2ms/step
## Epoch 45/100
## 4456/4456 - 10s - loss: 0.1458 - accuracy: 0.9524 - val_loss: 0.1984 - val_accuracy: 0.9461 - 10s/epoch - 2ms/step
## Epoch 46/100
## 4456/4456 - 10s - loss: 0.1452 - accuracy: 0.9530 - val_loss: 0.1909 - val_accuracy: 0.9449 - 10s/epoch - 2ms/step
## Epoch 47/100
## 4456/4456 - 10s - loss: 0.1451 - accuracy: 0.9528 - val_loss: 0.1986 - val_accuracy: 0.9453 - 10s/epoch - 2ms/step
## Epoch 48/100
## 4456/4456 - 10s - loss: 0.1450 - accuracy: 0.9527 - val_loss: 0.1939 - val_accuracy: 0.9455 - 10s/epoch - 2ms/step
## Epoch 49/100
## 4456/4456 - 10s - loss: 0.1447 - accuracy: 0.9531 - val_loss: 0.1980 - val_accuracy: 0.9449 - 10s/epoch - 2ms/step
## Epoch 50/100
## 4456/4456 - 9s - loss: 0.1440 - accuracy: 0.9533 - val_loss: 0.1966 - val_accuracy: 0.9452 - 9s/epoch - 2ms/step
## Epoch 51/100
## 4456/4456 - 9s - loss: 0.1441 - accuracy: 0.9529 - val_loss: 0.2018 - val_accuracy: 0.9452 - 9s/epoch - 2ms/step
## Epoch 52/100
## 4456/4456 - 9s - loss: 0.1442 - accuracy: 0.9531 - val_loss: 0.2009 - val_accuracy: 0.9457 - 9s/epoch - 2ms/step
## Epoch 53/100
## 4456/4456 - 9s - loss: 0.1439 - accuracy: 0.9532 - val_loss: 0.1999 - val_accuracy: 0.9472 - 9s/epoch - 2ms/step
## Epoch 54/100
## 4456/4456 - 9s - loss: 0.1437 - accuracy: 0.9534 - val_loss: 0.2137 - val_accuracy: 0.9440 - 9s/epoch - 2ms/step
## Epoch 55/100
## 4456/4456 - 9s - loss: 0.1435 - accuracy: 0.9533 - val_loss: 0.2028 - val_accuracy: 0.9460 - 9s/epoch - 2ms/step
## Epoch 56/100
## 4456/4456 - 9s - loss: 0.1432 - accuracy: 0.9535 - val_loss: 0.2091 - val_accuracy: 0.9452 - 9s/epoch - 2ms/step
## Epoch 57/100
## 4456/4456 - 9s - loss: 0.1431 - accuracy: 0.9535 - val_loss: 0.2102 - val_accuracy: 0.9453 - 9s/epoch - 2ms/step
## Epoch 58/100
## 4456/4456 - 9s - loss: 0.1429 - accuracy: 0.9534 - val_loss: 0.2069 - val_accuracy: 0.9447 - 9s/epoch - 2ms/step
## Epoch 59/100
## 4456/4456 - 9s - loss: 0.1427 - accuracy: 0.9536 - val_loss: 0.2110 - val_accuracy: 0.9436 - 9s/epoch - 2ms/step
## Epoch 60/100
## 4456/4456 - 9s - loss: 0.1426 - accuracy: 0.9537 - val_loss: 0.2148 - val_accuracy: 0.9461 - 9s/epoch - 2ms/step
## Epoch 61/100
## 4456/4456 - 9s - loss: 0.1424 - accuracy: 0.9539 - val_loss: 0.2207 - val_accuracy: 0.9441 - 9s/epoch - 2ms/step
## Epoch 62/100
## 4456/4456 - 9s - loss: 0.1420 - accuracy: 0.9538 - val_loss: 0.2233 - val_accuracy: 0.9435 - 9s/epoch - 2ms/step
## Epoch 63/100
## 4456/4456 - 9s - loss: 0.1422 - accuracy: 0.9539 - val_loss: 0.2147 - val_accuracy: 0.9448 - 9s/epoch - 2ms/step
## Epoch 64/100
## 4456/4456 - 10s - loss: 0.1421 - accuracy: 0.9540 - val_loss: 0.2154 - val_accuracy: 0.9436 - 10s/epoch - 2ms/step
## Epoch 65/100
## 4456/4456 - 9s - loss: 0.1417 - accuracy: 0.9542 - val_loss: 0.2284 - val_accuracy: 0.9432 - 9s/epoch - 2ms/step
## Epoch 66/100
## 4456/4456 - 9s - loss: 0.1417 - accuracy: 0.9542 - val_loss: 0.2242 - val_accuracy: 0.9441 - 9s/epoch - 2ms/step
## Epoch 67/100
## 4456/4456 - 10s - loss: 0.1415 - accuracy: 0.9538 - val_loss: 0.2321 - val_accuracy: 0.9459 - 10s/epoch - 2ms/step
## Epoch 68/100
## 4456/4456 - 10s - loss: 0.1412 - accuracy: 0.9542 - val_loss: 0.2217 - val_accuracy: 0.9448 - 10s/epoch - 2ms/step
## Epoch 69/100
## 4456/4456 - 10s - loss: 0.1412 - accuracy: 0.9541 - val_loss: 0.2239 - val_accuracy: 0.9434 - 10s/epoch - 2ms/step
## Epoch 70/100
## 4456/4456 - 9s - loss: 0.1412 - accuracy: 0.9543 - val_loss: 0.2233 - val_accuracy: 0.9452 - 9s/epoch - 2ms/step
## Epoch 71/100
## 4456/4456 - 9s - loss: 0.1411 - accuracy: 0.9545 - val_loss: 0.2309 - val_accuracy: 0.9436 - 9s/epoch - 2ms/step
## Epoch 72/100
## 4456/4456 - 9s - loss: 0.1405 - accuracy: 0.9545 - val_loss: 0.2264 - val_accuracy: 0.9460 - 9s/epoch - 2ms/step
## Epoch 73/100
## 4456/4456 - 9s - loss: 0.1404 - accuracy: 0.9546 - val_loss: 0.2319 - val_accuracy: 0.9445 - 9s/epoch - 2ms/step
## Epoch 74/100
## 4456/4456 - 9s - loss: 0.1403 - accuracy: 0.9549 - val_loss: 0.2337 - val_accuracy: 0.9437 - 9s/epoch - 2ms/step
## Epoch 75/100
## 4456/4456 - 9s - loss: 0.1405 - accuracy: 0.9548 - val_loss: 0.2356 - val_accuracy: 0.9457 - 9s/epoch - 2ms/step
## Epoch 76/100
## 4456/4456 - 9s - loss: 0.1401 - accuracy: 0.9547 - val_loss: 0.2387 - val_accuracy: 0.9427 - 9s/epoch - 2ms/step
## Epoch 77/100
## 4456/4456 - 9s - loss: 0.1404 - accuracy: 0.9548 - val_loss: 0.2388 - val_accuracy: 0.9421 - 9s/epoch - 2ms/step
## Epoch 78/100
## 4456/4456 - 9s - loss: 0.1398 - accuracy: 0.9550 - val_loss: 0.2401 - val_accuracy: 0.9448 - 9s/epoch - 2ms/step
## Epoch 79/100
## 4456/4456 - 10s - loss: 0.1398 - accuracy: 0.9549 - val_loss: 0.2425 - val_accuracy: 0.9434 - 10s/epoch - 2ms/step
## Epoch 80/100
## 4456/4456 - 10s - loss: 0.1398 - accuracy: 0.9549 - val_loss: 0.2396 - val_accuracy: 0.9440 - 10s/epoch - 2ms/step
## Epoch 81/100
## 4456/4456 - 9s - loss: 0.1395 - accuracy: 0.9550 - val_loss: 0.2386 - val_accuracy: 0.9445 - 9s/epoch - 2ms/step
## Epoch 82/100
## 4456/4456 - 9s - loss: 0.1392 - accuracy: 0.9554 - val_loss: 0.2533 - val_accuracy: 0.9413 - 9s/epoch - 2ms/step
## Epoch 83/100
## 4456/4456 - 10s - loss: 0.1392 - accuracy: 0.9554 - val_loss: 0.2612 - val_accuracy: 0.9436 - 10s/epoch - 2ms/step
## Epoch 84/100
## 4456/4456 - 9s - loss: 0.1392 - accuracy: 0.9552 - val_loss: 0.2531 - val_accuracy: 0.9418 - 9s/epoch - 2ms/step
## Epoch 85/100
## 4456/4456 - 9s - loss: 0.1391 - accuracy: 0.9550 - val_loss: 0.2554 - val_accuracy: 0.9436 - 9s/epoch - 2ms/step
## Epoch 86/100
## 4456/4456 - 9s - loss: 0.1392 - accuracy: 0.9553 - val_loss: 0.2459 - val_accuracy: 0.9446 - 9s/epoch - 2ms/step
## Epoch 87/100
## 4456/4456 - 9s - loss: 0.1392 - accuracy: 0.9550 - val_loss: 0.2494 - val_accuracy: 0.9437 - 9s/epoch - 2ms/step
## Epoch 88/100
## 4456/4456 - 9s - loss: 0.1390 - accuracy: 0.9552 - val_loss: 0.2509 - val_accuracy: 0.9430 - 9s/epoch - 2ms/step
## Epoch 89/100
## 4456/4456 - 9s - loss: 0.1388 - accuracy: 0.9556 - val_loss: 0.2612 - val_accuracy: 0.9435 - 9s/epoch - 2ms/step
## Epoch 90/100
## 4456/4456 - 9s - loss: 0.1385 - accuracy: 0.9552 - val_loss: 0.2587 - val_accuracy: 0.9437 - 9s/epoch - 2ms/step
## Epoch 91/100
## 4456/4456 - 9s - loss: 0.1385 - accuracy: 0.9556 - val_loss: 0.2599 - val_accuracy: 0.9441 - 9s/epoch - 2ms/step
## Epoch 92/100
## 4456/4456 - 9s - loss: 0.1386 - accuracy: 0.9555 - val_loss: 0.2558 - val_accuracy: 0.9441 - 9s/epoch - 2ms/step
## Epoch 93/100
## 4456/4456 - 10s - loss: 0.1381 - accuracy: 0.9553 - val_loss: 0.2613 - val_accuracy: 0.9425 - 10s/epoch - 2ms/step
## Epoch 94/100
## 4456/4456 - 10s - loss: 0.1382 - accuracy: 0.9555 - val_loss: 0.2516 - val_accuracy: 0.9441 - 10s/epoch - 2ms/step
## Epoch 95/100
## 4456/4456 - 10s - loss: 0.1382 - accuracy: 0.9554 - val_loss: 0.2607 - val_accuracy: 0.9433 - 10s/epoch - 2ms/step
## Epoch 96/100
## 4456/4456 - 10s - loss: 0.1382 - accuracy: 0.9555 - val_loss: 0.2555 - val_accuracy: 0.9423 - 10s/epoch - 2ms/step
## Epoch 97/100
## 4456/4456 - 10s - loss: 0.1378 - accuracy: 0.9555 - val_loss: 0.2652 - val_accuracy: 0.9438 - 10s/epoch - 2ms/step
## Epoch 98/100
## 4456/4456 - 9s - loss: 0.1377 - accuracy: 0.9559 - val_loss: 0.2777 - val_accuracy: 0.9423 - 9s/epoch - 2ms/step
## Epoch 99/100
## 4456/4456 - 9s - loss: 0.1376 - accuracy: 0.9559 - val_loss: 0.2686 - val_accuracy: 0.9429 - 9s/epoch - 2ms/step
## Epoch 100/100
## 4456/4456 - 9s - loss: 0.1378 - accuracy: 0.9559 - val_loss: 0.2660 - val_accuracy: 0.9434 - 9s/epoch - 2ms/step

# Evaluate the model
metrics <- model %>% evaluate(
  x = as.matrix(test_data[, -ncol(test_data)]),  # Features
  y = as.numeric(test_data$COD) - 1,  # Target variable (convert to 0-based index)
  verbose = 0
)

# Print evaluation metrics
cat("Test Loss:", metrics["loss"], "\n")

## Test Loss: 0.2614014

cat("Test Accuracy:", metrics["accuracy"], "\n")

## Test Accuracy: 0.9416193

# Predictions on test data
predictions <- model %>% predict(as.matrix(test_data[, -ncol(test_data)]))

## 2387/2387 - 2s - 2s/epoch - 1ms/step

predictions <- ifelse(predictions > 0.5, 1, 0)

# Confusion matrix
conf_matrix <- table(Actual = as.numeric(test_data$COD) - 1, Predicted = predictions)
print("Confusion Matrix:")

## [1] "Confusion Matrix:"

print(conf_matrix)

##       Predicted
## Actual     0     1
##      0  6565  2996
##      1  1463 65354

# Accuracy, Sensitivity, and Specificity
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
sensitivity <- conf_matrix[2, 2] / sum(conf_matrix[2, ])
specificity <- conf_matrix[1, 1] / sum(conf_matrix[1, ])
paste("Accuracy:",accuracy)

## [1] "Accuracy: 0.94161931446228"

paste("Sensitivity:", sensitivity)

## [1] "Sensitivity: 0.978104374635198"

paste("Specificity:", specificity)

## [1] "Specificity: 0.686643656521284"

# Calculate overall accuracy
overall_accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)

# Heatmap
heatmap_data <- as.data.frame(conf_matrix)
heatmap <- ggplot(heatmap_data, aes(x = Predicted, y = Actual, fill = Freq)) +
  geom_tile(color = "white") +
  scale_fill_gradient(low = "lightgreen", high = "darkgreen") +
  labs(x = "Predicted", y = "Actual", fill = "Frequency") +
  theme_minimal() +
  geom_text(aes(label = Freq), color = "black", size = 3) +  # Add text labels
  ggtitle("Deep NN Predictive Model") +  # Add title
  labs(subtitle = paste("Accuracy:", scales::percent(overall_accuracy))) +  # Add accuracy as subtitle
  theme(plot.subtitle = element_text(hjust = 0.5))  # Center subtitle

print(heatmap)

# Plot ROC curve
roc_data <- roc(test_data$COD, predictions)

## Setting levels: control = 0, case = 1

## Warning in roc.default(test_data$COD, predictions): Deprecated use a matrix as
## predictor. Unexpected results may be produced, please pass a numeric vector.

## Setting direction: controls < cases

#plot(roc_data, main = "ROC Curve", col = "blue")
plot(roc_data, print.auc = TRUE, auc.polygon = TRUE, max.auc.polygon = TRUE, grid = TRUE, grid.col = "lightgray", main = "ROC Curve")

Conclusion:

In this project, I aimed for prediction of the survival rate of patients with breast cancer with more than 96% accuracy knowing the survival rate is 75%. The goal was to use machine learning and available resources and the techniques learned in DATA606 and DTA607 to deal with this complex problem. I utilized the SEER database spanning from 2011 to 2015, comprising over 300,000 cases, to predict the survival rate of cancer patients based on 16 critical indicators, including race, household income, cancer type, treatment, time to treatment, number of tumors, and more. Preliminary exploratory data analysis was conducted to identify these key indicators from a pool of 36, followed by data cleaning and organization for machine learning tasks. Various R packages were employed for data cleaning, type conversion, handling missing values, and database organization. Additionally, correlation analyses using tools like ggplot, chi-square, Fisher test, and other complex R packages were performed to explore correlations between numeric and categorical variables and the target parameter of interest, Alive/Death.

Initially, the intention was to include all three categories of Alive/Death/Other, but it was later recognized that the inclusion of the “Other” category rendered the analysis irrelevant. Therefore, the analysis was focused solely on Alive/Death, as breast cancer was the primary cause of death even if patients had other conditions.

A range of machine learning algorithms were applied, starting from Logistic Regression and Random Forest to more sophisticated methods like DNN. Overall, the project demonstrated that even individuals with limited domain knowledge can utilize available resources to predict cancer patient outcomes with approximately 94% accuracy. However, further endeavors, such as stratification, parameter importance implication, and additional data gathering, could enhance accuracy, offering significant contributions to the healthcare industry, patient care, and family circumstances.

Despite the complexities associated with managing different packages and large databases, I enjoyed exploring new concepts and learning how different methods can be employed. Particularly, I gained insights into the significance of encoding and its impact on survival model performance. While this analysis lacks the rigor of academic research, it underscores the potential of machine learning in addressing complex problems, paving the way for future exploration and study.

In summary, among the developed models, Logistic Regression emerged as the simplest and fastest, achieving 93% accuracy, followed by RandomForest. Additionally, neural networks exhibited success but were time-consuming and presented black-box risks. For future iterations, I would opt to focus on Logistic Regression and RandomForest, dedicating more time to encoding, data preparation, and exploring stratification and parameter stress testing to potentially enhance accuracy.

This project highlights the potential of machine learning for patient survival prediction, even for individuals with limited domain knowledge. However, further research is needed to:

Enhance Accuracy: Techniques like stratification, parameter importance analysis, and additional data acquisition can be explored.
Improve Generalizability: Future studies could benefit from more diverse datasets and address the limitations of retrospective analysis.
Mitigate Black-Box Risks: While DNNs offered promise, further exploration is required to understand their inner workings and enhance interpretability.

By addressing these limitations, future studies can contribute significantly to personalized medicine, patient care planning, and supporting families facing this challenging diagnosis.

Acknowledgement:

I would like to thank the professors in both DATA606 and DATA607, as well as the students in the classes, who made the courses interesting and challenging. I have learned a lot and dealt with many challenges throughout these courses, despite having little specific background in data science beforehand. The course content was carefully chosen to help students like me develop an understanding of the topic and find enjoyment in the learning process.

References:

[1] SEER (https://seer.cancer.gov/data/access.html)

[2] zgalochkina/SEER_solid_tumor: R code for SEER data analysis of solid tumor in different populations (github.com)

[3] XAI_Healthcare_eXplainable_AI_in_Healthcare.pdf (upc.edu)

[4] Pargen, F., Pfisterer, F., Thomas, J., Bischl, B.: Regularized target encoding out performs traditional methods in supervised machine learning with high cardinality features. Computational Statistics 37(5), 2671–2692 (Nov 2022)

[5] American Cancer Society - Breast Cancer Survival Rates

Surveillance, Epidemiology, and End Results Program. 2023. “SEER*stat Database: Incidence - SEER Research Data, 8 Registries, Nov 2021 Sub (1975-2020) - Linked to County Attributes - Time Dependent (1990-2020) Income/Rurality, 1969-2020 Counties.” National Cancer Institute, DCCPS, Surveillance Research Program, released April 2023, based on the November 2022 submission. https://seer.cancer.gov/data/citation.html.

Data Science Project - Breast Cancer Survival with SEER data

KoohPy <- Koohyar Pooladvand

2024-05-12

Data Preparation

R initialization

Research question

Note on 5 years threshold

Cases

Data collection

Type of study

Data Source

Dependent Variable

Data tiding

Relevant summary statistics

Results of the exploratory data analysis

Correlation investigation

Fisher_test and chi-Square

Correlation Analyses

Machine learning, Random Forest classification model

What is target encoding:

What is One-Hot encoding:

Different models investigated in this Project

Data Preparation for Resemble models

Machine Learning: Random Forest

Machine Learning: Logistic Regression

Data Preparation for Survival model

Deep Neural Network (DNN)

Conclusion:

Acknowledgement:

References: