Bussiness understanding

Problem statement

Breast cancer is one of the most common cancers, often referred to as the “pink killer”, and ranks first in the incidence of malignant tumors in women. The World Health Organization(WHO) reports that 685000 people worldwide died from breast cancer in 2020. Therefore, it is important to study breast cancer mortality and the suivival time of patients, as it provides insight into the causes of death from breast cancer and informs efforts to reduce the number of deaths from the disease.

This study may provide important information about differences in breast cancer outcomes in different population groups. For example, studies may show that certain racial or ethnic groups have higher breast cancer mortality rates, which may indicate that targeted interventions are needed to address these disparities.

Overall, studying breast cancer mortality and how long patients can survive is important for understanding the disease and the best ways to reduce the number of deaths caused by it. It also helps to identify the needs and priorities for research to further improve the survival rate and reduce the burden of this disease.

Objective:

The research aims to provide:

To determine inconsistencies, missing values and outliers in the dataset
Identify the relationship between features through data Exploration
Predict the survivability of patient diagnosed with breast cancer through machine learning modules.

Samples:

Age: the age of female patients ranged from 30 to 69 years. The mean age of the patients was 53.97 years.
Races: White, Black and Other. The majority of patients belonged to the white race, while a few patients were identified as black or other races.
Marital status: Married, Single, Divorced, Widowed, and Separated. Most patients diagnosed with cancer were married.
T stage: the size of tumor. T1, T2, T3, and T4.
N stage: the extent of cancer spread to the lymph nodes. N1, N2, and N3.
Sixth stage: the stage grouping of breast cancer. Based on a combination of T stage, N stage, and M stage(not available in the dataset). The five categories of stage VI and their proportions in the dataset are IIA, IIB, IIIA, IIIB, and IIIC.
Differentiate or grade: how well the tumor resenbles the normal tissue. > * Grade I - Well differentiated > * Grade II - Moderately differentiated > * Grade III - Poorly differentiated > * Grade IV - Undifferentiated
A stage: Regional and Distant. Regional means the spread of cancer is localized. Distant corresponds to the spread of cancer to distant parts of the body.
Tumor size(mm): the size of a tumor in mm.
Survival months: the average value is 71.3 months. The range of this is between 1 and 107.
Dataset: https://ieee-dataport.org/open-access/seer-breast-cancer-data JING TENG, January 18, 2019

Imports

install.packages("dplyr")
install.packages("gmodels")
install.packages("psych")

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘gtools’, ‘gdata’


Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependency ‘mnormt’

library(tidyverse)
library(dplyr)
library(psych)
library(gmodels)
library(ggplot2)
library(tidyr)

Warning message in system("timedatectl", intern = TRUE):
“running command 'timedatectl' had status 1”
── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.4.0     [32m✔[39m [34mpurrr  [39m 1.0.1
[32m✔[39m [34mtibble [39m 3.1.8     [32m✔[39m [34mstringr[39m 1.4.1
[32m✔[39m [34mreadr  [39m 2.1.3     [32m✔[39m [34mforcats[39m 0.5.2

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()


Attaching package: ‘psych’


The following objects are masked from ‘package:ggplot2’:

    %+%, alpha

Data Cleaning

Data Inconsistencies issue in the ‘Race’ Attribute

df <- read.csv("/content/sample_data/Uncleaned_Datav1.csv")

As can be seen in the bar chart below, we can see that each race is repeated twice due to the inconsistent of category name.

race <- table(df$Race)
barplot(race, main="Race Distribution",
        xlab="Race", ylab ="Frequencies")
table(df$Race)

 Black Blacks  Other otherS  wHiTe  White 
   180    118    241     81   1039   2418

png

To tackle inconsistencies in the naming convention which might be due to the combination of two data sources, data smoothing is required by using replacement method

#Replace "Blacks" = Black, "otherS" = "Other" and "wHiTe" <- "White" 
df$Race[df$Race == "Blacks"] <- "Black"
df$Race[df$Race == "otherS"] <- "Other"
df$Race[df$Race == "wHiTe"] <- "White"

After data smoothing

race <- table(df$Race)
barplot(race, main="Race Distribution",
        xlab="Race", ylab ="Frequencies")
table(df$Race)

Black Other White 
  298   322  3457

png

Data Intentionality issue in the “Marital Status” Attribute

As can be seen in the bar chart below there are Missing Value called Unknown which has 25 occurences in the data

MaritalStat <- table(df$Marital.Status)
barplot(MaritalStat, main="Marital Status Distribution",
        xlab="Marital", ylab="Frequencies")
table(df$Marital.Status)

 Divorced   Married Separated   Single    Unknown   Widowed 
      488      2662        45       620        25       237

png

Since the unknown value is 25 occurences, the percentage of the unknown value is 25/4077x100: 0.6132 % of the whole dataset. The optimal solution is by performing Data Reduction which the unknown value will be dropped and it is not resulted in loss of significant information.

df <- subset(df, Marital.Status != "Unknown")

After Data Reduction

MaritalStat <- table(df$Marital.Status)
barplot(MaritalStat, main="Marital Status Distribution",
        xlab="Marital", ylab="Frequencies")
table(df$Marital.Status)

 Divorced   Married Separated   Single    Widowed 
      488      2662        45       620       237

png

Data Noisiness(Unbelievability) issue in the “Age” Attributes

From the Age Histogram Below, there are existent of outliers where some observation have their age equal to “0”. The range of this unbelievabiltiy age is from -1 to 4. The percentage noisiness data in the Age Attribute is 28/4052 x 100 : 0.69 %, therefore it can be eliminated by using Data Reduction as it not effect significant loss of information

hist(df$Age, breaks = 30, main = "Age Histogram", xlab = "Age", ylab = "Frequency")
count <- sum(df$Age >= -1 & df$Age <=5)
cat("The count of noisiness data in Age Attribtutes is", count)

The count of noisiness data in Age Attribtutes is 28

png

After Data Reduction Method

df_filtered <- subset(df, !(Age >= -1 & Age <= 6))
hist(df_filtered$Age, breaks = 30, main = "Age Histogram", xlab = "Age", ylab = "Frequency")

png

Data Incomplete issue in ‘Differentiate’ Attribute

From the bar chart below, there are presence of missing value which accumulated 733 correspond to 18.23%. Dropping these missing values can be impratical because the missing values are significantly high. This missing values will be replaced with the mode which is “Moderately differentiated” by using Data Imputation Method.

differentiate <- table(df$differentiate)
barplot(differentiate, main="Differentiate Distribution",
        xlab="Differentiate", ylab="Frequencies")
table(df$differentiate)

Moderately differentiated     Poorly differentiated          Undifferentiated 
                     1632                      1119                        19 
                  Unknown       Well differentiated 
                      733                       549

png

After Data Imputation

df$differentiate[df$differentiate == "Unknown"] <- "Moderately differentiated"
differentiate <- table(df$differentiate)
barplot(differentiate, main="Differentiate Distribution",
        xlab="Differentiate", ylab="Frequencies")
table(df$differentiate)

Moderately differentiated     Poorly differentiated          Undifferentiated 
                     2365                      1119                        19 
      Well differentiated 
                      549

png

Finding total outliers in Numeric Variables

library(dplyr)
library(tidyr)

# Create a function to identify outliers
find_outliers <- function(x) {
  qnt <- quantile(x, probs = c(0.25, 0.75), na.rm = TRUE)
  H <- 1.5 * IQR(x, na.rm = TRUE)
  y <- x[x < qnt[1] - H | x > qnt[2] + H]
  return(length(y))
}

# Use the function to find outliers in all columns of the dataframe
total_outliers <- df %>% 
  select_if(is.numeric) %>% 
  summarise_all(find_outliers) %>% 
  gather() %>%
  group_by(key) %>%
  summarise(total_outliers = sum(value)) 

colnames(total_outliers) <- c("Variable", "Outliers")
total_outliers
cat("the total outliers are :",sum(total_outliers$Outliers))

A tibble: 5 × 2
Variable	Outliers
<chr>	<int>
Age	28
Reginol.Node.Positive	346
Regional.Node.Examined	73
Survival.Months	19
Tumor.Size	222

the total outliers are : 688

Since the total outliers is quite high, which more than 15% of the total observations. Removing 15% or more of the row values may significantly affect the final predictions. Hence the outliers are not removed for machine learning modeling.

Univariate Exploration

BreastCancer <- read.csv("/content/sample_data/Breast_Cancer.csv")

The mean, max, median of the dataset’s attributes

describe(BreastCancer)

A psych: 16 × 13
	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
	<int>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
Age	1	4024	53.972167	8.9631344	54	54.245031	10.3782	30	69	39	-0.2202085	-0.75675566	0.141296343
Race*	2	4024	2.775845	0.5644652	3	2.935093	0.0000	1	3	2	-2.4052242	4.38936802	0.008898323
Marital.Status*	3	4024	2.371272	1.0636375	2	2.266149	0.0000	1	5	4	1.1459202	0.36387786	0.016767358
T.Stage*	4	4024	1.784791	0.7655309	2	1.699379	1.4826	1	4	3	0.7246804	0.06106612	0.012067957
N.Stage*	5	4024	1.438370	0.6934794	1	1.298137	0.0000	1	3	2	1.2761763	0.21228692	0.010932125
X6th.Stage*	6	4024	2.321819	1.2666239	2	2.152484	1.4826	1	5	4	0.8144155	-0.19692684	0.019967270
differentiate*	7	4024	1.690358	1.0164183	1	1.488199	0.0000	1	4	3	1.4181264	0.73747790	0.016022987
Grade*	8	4024	3.131710	0.6423983	3	3.170497	0.0000	1	4	3	-0.2335884	-0.20517763	0.010126874
A.Stage*	9	4024	1.977137	0.1494847	2	2.000000	0.0000	1	2	1	-6.3821741	38.74177405	0.002356501
Tumor.Size	10	4024	30.473658	21.1196961	25	27.110248	14.8260	1	140	139	1.7384530	3.62490859	0.332934406
Estrogen.Status*	11	4024	1.933151	0.2497912	2	2.000000	0.0000	1	2	1	-3.4672437	10.02426991	0.003937750
Progesterone.Status*	12	4024	1.826541	0.3786909	2	1.908075	0.0000	1	2	1	-1.7241483	0.97292932	0.005969746
Regional.Node.Examined	13	4024	14.357107	8.0996748	14	13.854348	7.4130	1	61	60	0.8286556	1.64347439	0.127684622
Reginol.Node.Positive	14	4024	4.158052	5.1093311	2	2.961180	1.4826	1	46	45	2.7005214	8.96267158	0.080544347
Survival.Months	15	4024	71.297962	22.9214295	73	72.970497	25.2042	1	107	106	-0.5895585	0.01696196	0.361337232
Status*	16	4024	1.153082	0.3601108	1	1.066460	0.0000	1	2	1	1.9262531	1.71087644	0.005676847

Numeric features graph distribution

x1 <- BreastCancer$Tumor.Size
TumorSize1 <- hist(x1, col = "Orange", xlab = "Tumor Size", ylab = "Frequencies", 
                   main = "Tumor Size Histogram")
x2 <- BreastCancer$Age
Age1 <- hist(x2, col = "Orange", xlab = "Age ", ylab = "Frequencies", 
                   main = " Age Distribution Histogram ")
x3 <- BreastCancer$Regional.Node.Examined
Regional.N.E1 <- hist(x3, col = "Orange", xlab = "Regional Node Examined ", ylab = "Frequencies", 
                   main = " Regional Node Examined Distribution Histogram ")
x4 <- BreastCancer$Reginol.Node.Positive
Regional.N.E1 <- hist(x4, col = "Orange", xlab = "Regional Node Positive Examined ", ylab = "Frequencies", 
                   main = " Regional Node Positive Distribution Histogram ")
x5 <- BreastCancer$Survival.Months
SurvivalMonth1 <- hist(x5, col = "Orange", xlab = "Survival Month ", ylab = "Frequencies", 
                   main = " Survival Month Distribution Histogram ")

png

Categorical variable bar chart distribution

race <- table(BreastCancer$Race)
barplot(race, main="Race Distribution",
        xlab="Race", ylab ="Frequencies")

MaritalStat <- table(BreastCancer$Marital.Status)
barplot(MaritalStat, main="Marital Status Distribution",
        xlab="Marital", ylab="Frequencies")

T.stage <- table(BreastCancer$T.Stage)
barplot(T.stage, main="T.Stage Distribution",
        xlab="T.Stage", ylab="Frequencies")

N.stage <- table(BreastCancer$N.Stage)
barplot(N.stage, main="N.Stage Distribution",
        xlab="N.Stage", ylab="Frequencies")

X6th.stage <- table(BreastCancer$X6th.Stage)
barplot(X6th.stage, main="X6th.Stage Distribution",
        xlab="N.Stage", ylab="Frequencies")

differentiate <- table(BreastCancer$differentiate)
barplot(differentiate, main="Differentiate Distribution",
        xlab="Differentiate", ylab="Frequencies")

Grade <- table(BreastCancer$Grade)
barplot(Grade, main="Grade Distribution",
        xlab="Grade", ylab="Frequencies")


A.Stage <- table(BreastCancer$A.Stage)
barplot(A.Stage, main="A.Stage Distribution",
        xlab="A.Stage", ylab="Frequencies")

Estrogen.Stat <- table(BreastCancer$Estrogen.Status)
barplot(Estrogen.Stat, main="Estrogen Status Distribution",
        xlab="Estrogen Status", ylab="Frequencies")

Progestrone.Stat <- table(BreastCancer$Progesterone.Status)
barplot(Progestrone.Stat, main="Progestrone Status Distribution",
        xlab="Progestrone Status", ylab="Frequencies")

png

Numerical Variable Box Plot in the graph


BreastCancer %>% pivot_longer(cols=c('Survival.Months', 'Reginol.Node.Positive', 'Regional.Node.Examined', 'Tumor.Size', 'Age'), names_to = 'Column', values_to = 'Value') %>% 
  ggplot(aes(x=Column, y=Value, fill=Column)) +
  geom_boxplot() +
  facet_wrap(~Column, scales = 'free')

png

Bivariate Exploration

Chi Square Test on the categorical variables with target variable

Arrange the categorical features according to the levels

BreastCancerv1 <- BreastCancer

Race1 = factor(BreastCancerv1$Race, levels = c('White', 'Black', 'Other'))
T.Stage1 = factor(BreastCancerv1$T.Stage, levels =c('T1', 'T2', 'T3', 'T4'))
N.Stage1 = factor(BreastCancerv1$N.Stage, levels = c('N1', 'N2', 'N3'))
X6th.Stage1 = factor(BreastCancerv1$X6th.Stage, levels = c('IIA', 'IIB', 'IIIA', 'IIIB', 'IIIC'))
differentiate1 = factor(BreastCancerv1$differentiate, levels = c('Well differentiated', 'Moderately differentiated', 'Poorly differentiated', 'Undifferentiated'))
Grade1 = factor(BreastCancerv1$Grade, levels = c('1', '2', '3', 'anaplastic; Grade IV'))
A.Stage1 = factor(BreastCancerv1$A.Stage, levels = c('Regional', 'Distant'))
Estrogen.Stat1 = factor(BreastCancerv1$Estrogen.Status, levels = c('Positive', 'Negative'))
Status1 = factor(BreastCancerv1$Status, levels = c('High Survival', 'Non - Survival'))

Cross Table

cat_vars <- sapply(X = BreastCancerv1, function(x) is.character(x))
char_vars <- BreastCancerv1[cat_vars]
tables_list<- list()

# Loop through all columns of the data frame
for (col in names(char_vars)) {
    if(col != "Status"){
        # Create a cross tabulation table with respect to the target variable
        tab <- table(char_vars[, col], char_vars$Status)
        # Add the cross tabulation table to the list
        tables_list[[col]] <- tab
    }
}
 
View(tables_list)

$Race
       
        Alive Dead
  Black   218   73
  Other   287   33
  White  2903  510

$Marital.Status
           
            Alive Dead
  Divorced    396   90
  Married    2285  358
  Separated    30   15
  Single      511  104
  Widowed     186   49

$T.Stage
    
     Alive Dead
  T1  1446  157
  T2  1483  303
  T3   417  116
  T4    62   40

$N.Stage
    
     Alive Dead
  N1  2462  270
  N2   655  165
  N3   291  181

$X6th.Stage
      
       Alive Dead
  IIA   1209   96
  IIB    995  135
  IIIA   866  184
  IIIB    47   20
  IIIC   291  181

$differentiate
                           
                            Alive Dead
  Moderately differentiated  2046  305
  Poorly differentiated       848  263
  Undifferentiated             10    9
  Well differentiated         504   39

$Grade
                       
                        Alive Dead
   anaplastic; Grade IV    10    9
  1                       504   39
  2                      2046  305
  3                       848  263

$A.Stage
          
           Alive Dead
  Distant     57   35
  Regional  3351  581

$Estrogen.Status
          
           Alive Dead
  Negative   161  108
  Positive  3247  508

$Progesterone.Status
          
           Alive Dead
  Negative   494  204
  Positive  2914  412

Chi-Test Result

The chi-square (χ2) statistics is a way to check the relationship between two categorical nominal variables. In this data exploration, we explored the relationship between input categorical features in dataset with the target variable which is “Status”. Our null hypothesis stated that the input categorical features is indepent from the target variable (Status) and our hypothesis that the input categorical variable is dependent on the target variable (Status). From the result below the p value for all relation between input categorical variable with target variable is less than <0.005. Therefore we reject the null hypothesis

chi_results_list <- lapply(tables_list, chisq.test)
View(chi_results_list)

Warning message in FUN(X[[i]], ...):
“Chi-squared approximation may be incorrect”
Warning message in FUN(X[[i]], ...):
“Chi-squared approximation may be incorrect”



$Race

    Pearson's Chi-squared test

data:  X[[i]]
X-squared = 27.97, df = 2, p-value = 8.441e-07


$Marital.Status

    Pearson's Chi-squared test

data:  X[[i]]
X-squared = 28.264, df = 4, p-value = 1.103e-05


$T.Stage

    Pearson's Chi-squared test

data:  X[[i]]
X-squared = 103.48, df = 3, p-value < 2.2e-16


$N.Stage

    Pearson's Chi-squared test

data:  X[[i]]
X-squared = 269.93, df = 2, p-value < 2.2e-16


$X6th.Stage

    Pearson's Chi-squared test

data:  X[[i]]
X-squared = 281.65, df = 4, p-value < 2.2e-16


$differentiate

    Pearson's Chi-squared test

data:  X[[i]]
X-squared = 112.56, df = 3, p-value < 2.2e-16


$Grade

    Pearson's Chi-squared test

data:  X[[i]]
X-squared = 112.56, df = 3, p-value < 2.2e-16


$A.Stage

    Pearson's Chi-squared test with Yates' continuity correction

data:  X[[i]]
X-squared = 35.765, df = 1, p-value = 2.226e-09


$Estrogen.Status

    Pearson's Chi-squared test with Yates' continuity correction

data:  X[[i]]
X-squared = 135.16, df = 1, p-value < 2.2e-16


$Progesterone.Status

    Pearson's Chi-squared test with Yates' continuity correction

data:  X[[i]]
X-squared = 124.89, df = 1, p-value < 2.2e-16

T-Test

T test is used to draw a comparison or find the difference between one categorical (with two categories) and another continuous variable. The continous variable must be normally distributed.Therefore to normalize each continous variable we employed Z-score method. A z-score, also known as a standard score, is a measure of how many standard deviations a data point is from the mean of the data set.

Filtering some of the columns

# Create logical vectors indicating which columns are integers or character type
is_int <- sapply(BreastCancerv1, is.integer)
is_char <- sapply(BreastCancerv1, is.character)

# Remove all columns with character values, except the Status one
BreastCancerv2 <- BreastCancerv1[!(is_char & seq_along(BreastCancerv1) != 16) | is_int]

BreastCancerv2

A data.frame: 4024 × 6
Age	Tumor.Size	Regional.Node.Examined	Reginol.Node.Positive	Survival.Months	Status
<int>	<int>	<int>	<int>	<int>	<chr>
68	4	24	1	60	Alive
50	35	14	5	62	Alive
58	63	14	7	75	Alive
58	18	2	1	84	Alive
47	41	3	1	50	Alive
51	20	18	2	89	Alive
51	8	11	1	54	Alive
40	30	9	1	14	Dead
40	103	20	18	70	Alive
69	32	21	12	92	Alive
68	13	9	1	64	Dead
46	59	11	3	92	Alive
65	35	13	3	56	Alive
48	15	23	7	38	Alive
62	35	16	14	64	Alive
61	19	20	1	49	Alive
56	46	1	1	105	Alive
43	24	22	1	62	Alive
48	25	16	1	107	Alive
60	29	20	1	77	Alive
48	30	15	2	81	Alive
57	40	15	5	50	Alive
55	29	4	1	78	Alive
48	70	18	1	102	Alive
62	20	26	22	98	Alive
63	22	31	17	70	Alive
48	50	25	23	102	Alive
46	17	14	1	82	Alive
57	25	14	4	64	Alive
66	21	10	1	86	Alive
⋮	⋮	⋮	⋮	⋮	⋮
51	2	19	2	50	Alive
62	25	17	14	69	Alive
69	19	21	9	88	Dead
35	19	15	2	56	Alive
68	18	11	1	100	Alive
53	30	12	5	85	Alive
31	45	13	2	63	Alive
55	45	16	1	24	Alive
57	15	5	1	61	Alive
66	19	21	7	64	Dead
58	24	19	9	65	Dead
68	40	14	2	76	Alive
41	35	18	2	66	Alive
44	21	10	1	75	Alive
60	18	19	2	52	Alive
53	100	20	1	107	Alive
59	23	6	1	27	Dead
62	25	24	4	100	Alive
51	80	20	12	79	Alive
68	15	10	1	102	Alive
40	68	22	2	86	Alive
65	40	4	4	68	Alive
54	50	29	2	52	Alive
46	28	2	1	19	Dead
64	10	11	1	70	Alive
62	9	1	1	49	Alive
56	46	14	8	69	Alive
68	22	11	3	69	Alive
58	44	11	1	72	Alive
46	30	7	2	100	Alive

Install the required packages

install.packages("magrittr") # package installations are only needed the first time you use it
install.packages("dplyr")    # alternative installation of the %>%
library(magrittr) # needs to be run every time you start R and want to use %>%
library(dplyr)    # alternatively, this also loads %>%

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)


Attaching package: ‘magrittr’


The following object is masked from ‘package:purrr’:

    set_names


The following object is masked from ‘package:tidyr’:

    extract

Normalize the numeric data using Z-Score

# Select all columns that are of type 'numeric'
numerical_columns <- BreastCancerv2 %>% select_if(is.numeric)

# Select all columns that are of type 'character'
character_columns <- BreastCancerv2 %>% select_if(is.character)

# Normalize the numerical columns using Z-score
numerical_columns <- scale(numerical_columns,center = TRUE , scale = TRUE)

# Bind the normalized numerical columns with the character columns
BreastCancerv2 <- bind_cols(character_columns, numerical_columns)
View(BreastCancerv2)

A data.frame: 4024 × 6
Status	Age	Tumor.Size	Regional.Node.Examined	Reginol.Node.Positive	Survival.Months
<chr>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
Alive	1.5650589	-1.25350563	1.19052837	-0.61809494	-0.49289955
Alive	-0.4431672	0.21431852	-0.04408910	0.16478641	-0.40564495
Alive	0.4493777	1.54009517	-0.04408910	0.55622708	0.16150990
Alive	0.4493777	-0.59061731	-1.52563006	-0.61809494	0.55415557
Alive	-0.7778715	0.49841351	-1.40216831	-0.61809494	-0.92917251
Alive	-0.3315991	-0.49591898	0.44975789	-0.42237460	0.77229205
Alive	-0.3315991	-1.06410897	-0.41447434	-0.61809494	-0.75466332
Dead	-1.5588483	-0.02242731	-0.66139783	-0.61809494	-2.49975518
Alive	-1.5588483	3.43406182	0.69668138	2.70915076	-0.05662658
Alive	1.6766270	0.07227102	0.82014313	1.53482875	0.90317394
Dead	1.5650589	-0.82736314	-0.66139783	-0.61809494	-0.31839036
Alive	-0.8894396	1.35069851	-0.41447434	-0.22665426	0.90317394
Alive	1.2303545	0.21431852	-0.16755084	-0.22665426	-0.66740873
Alive	-0.6663034	-0.73266481	1.06706662	0.55622708	-1.45270007
Alive	0.8956502	0.21431852	0.20283440	1.92626942	-0.31839036
Alive	0.7840821	-0.54326814	0.69668138	-0.61809494	-0.97279981
Alive	0.2262415	0.73515935	-1.64909181	-0.61809494	1.47032879
Alive	-1.2241440	-0.30652231	0.94360488	-0.61809494	-0.40564495
Alive	-0.6663034	-0.25917314	0.20283440	-0.61809494	1.55758338
Alive	0.6725140	-0.06977648	0.69668138	-0.61809494	0.24876449
Alive	-0.6663034	-0.02242731	0.07937265	-0.42237460	0.42327368
Alive	0.3378096	0.45106435	0.07937265	0.16478641	-0.92917251
Alive	0.1146734	-0.06977648	-1.27870657	-0.61809494	0.29239179
Alive	-0.6663034	1.87153933	0.44975789	-0.61809494	1.33944690
Alive	0.8956502	-0.49591898	1.43745186	3.49203210	1.16493772
Alive	1.0072183	-0.40122064	2.05476060	2.51343043	-0.05662658
Alive	-0.6663034	0.92455601	1.31399012	3.68775244	1.33944690
Alive	-0.8894396	-0.63796647	-0.04408910	-0.61809494	0.46690097
Alive	0.3378096	-0.25917314	-0.04408910	-0.03093393	-0.31839036
Alive	1.3419226	-0.44856981	-0.53793609	-0.61809494	0.64141016
⋮	⋮	⋮	⋮	⋮	⋮
Alive	-0.331599066	-1.34820397	0.57321964	-0.42237460	-0.92917251
Alive	0.895650186	-0.25917314	0.32629614	1.92626942	-0.10025388
Dead	1.676626983	-0.54326814	0.82014313	0.94766775	0.72866475
Alive	-2.116688887	-0.54326814	0.07937265	-0.42237460	-0.66740873
Alive	1.565058869	-0.59061731	-0.41447434	-0.61809494	1.25219231
Alive	-0.108462838	-0.02242731	-0.29101259	0.16478641	0.59778286
Alive	-2.562961342	0.68781018	-0.16755084	-0.42237460	-0.36201766
Alive	0.114673389	0.68781018	0.20283440	-0.61809494	-2.06348222
Alive	0.337809617	-0.73266481	-1.15524482	-0.61809494	-0.44927225
Dead	1.341922641	-0.54326814	0.82014313	0.55622708	-0.31839036
Dead	0.449377731	-0.30652231	0.57321964	0.94766775	-0.27476306
Alive	1.565058869	0.45106435	-0.04408910	-0.42237460	0.20513720
Alive	-1.447280204	0.21431852	0.44975789	-0.42237460	-0.23113577
Alive	-1.112575863	-0.44856981	-0.53793609	-0.61809494	0.16150990
Alive	0.672513958	-0.59061731	0.57321964	-0.42237460	-0.84191792
Alive	-0.108462838	3.29201432	0.69668138	-0.61809494	1.55758338
Dead	0.560945845	-0.35387148	-1.03178307	-0.61809494	-1.93260033
Alive	0.895650186	-0.25917314	1.19052837	-0.03093393	1.25219231
Alive	-0.331599066	2.34503100	0.69668138	1.53482875	0.33601908
Alive	1.565058869	-0.73266481	-0.53793609	-0.61809494	1.33944690
Alive	-1.558848318	1.77684100	0.94360488	-0.42237460	0.64141016
Alive	1.230354528	0.45106435	-1.27870657	-0.03093393	-0.14388118
Alive	0.003105276	0.92455601	1.80783710	-0.42237460	-0.84191792
Dead	-0.889439635	-0.11712565	-1.52563006	-0.61809494	-2.28161870
Alive	1.118786414	-0.96941064	-0.41447434	-0.61809494	-0.05662658
Alive	0.895650186	-1.01675980	-1.64909181	-0.61809494	-0.97279981
Alive	0.226241503	0.73515935	-0.04408910	0.75194741	-0.10025388
Alive	1.565058869	-0.40122064	-0.41447434	-0.22665426	-0.10025388
Alive	0.449377731	0.64046101	-0.41447434	-0.61809494	0.03062801
Alive	-0.889439635	-0.02242731	-0.90832133	-0.42237460	1.25219231

T-Test Result

From the T-test result we have seen that the p-value for each continous variable in respect with the target variable is less than < 0.005. Therefore we accept the hypothesis where the input continous variable is dependent with the target variable (status) and reject the null hypothesis


# Create an empty list to store the test results
Tresults_list <- list()
# Loop through all columns of the data frame
for (col in names(BreastCancerv2)) {
    if(col != "Status"){
        # Conduct T-test on current column with respect to target variable
        Ttest_result <- t.test(BreastCancerv2[, col] ~ BreastCancerv2$Status)
        # Extract the p-value and test statistic from the test results
        p_value <- Ttest_result$p.value
        t_statistic <- Ttest_result$statistic
        # Add the p-value and test statistic to the results list
        Tresults_list[[col]] <- list(p_value = p_value,t_statistic = t_statistic)
    }
}
View(Tresults_list)

\(Age</dt> <dd><dl> <dt>\)p_value

<dt>$Tumor.Size</dt>
    <dd><dl>
<dt>$p_value</dt>
    <dd>7.01956816737513e-14</dd>
<dt>$t_statistic</dt>
    <dd><strong>t:</strong> -7.62653452543241</dd>

<dt>$Regional.Node.Examined</dt>
    <dd><dl>
<dt>$p_value</dt>
    <dd>0.0341129729107396</dd>
<dt>$t_statistic</dt>
    <dd><strong>t:</strong> -2.12225993465628</dd>

<dt>$Reginol.Node.Positive</dt>
    <dd><dl>
<dt>$p_value</dt>
    <dd>3.44036275174707e-30</dd>
<dt>$t_statistic</dt>
    <dd><strong>t:</strong> -11.9770766977036</dd>

<dt>$Survival.Months</dt>
    <dd><dl>
<dt>$p_value</dt>
    <dd>1.19257066205458e-129</dd>
<dt>$t_statistic</dt>
    <dd><strong>t:</strong> 29.702852507452</dd>

Anova

ANOVA is helpful for testing three or more levels in the category variable. It is similar to multiple two-sample t-tests. However, it results in fewer type I errors and is appropriate for a range of issues. ANOVA groups differences by comparing the means of each group and includes spreading out the variance into diverse sources. For this data exploration we assumed the category variable is Marital Status, Race,T.Stage,N.stage,X6th Stage, differentiate, Grade because it has 3 or more levels.

Assume Target Variable: Age


BreastCancerv3 <- BreastCancerv1[!(is_int & seq_along(BreastCancerv1) != 1) | is_char]
BreastCancerv3 <- BreastCancerv3[-c(9:12)]
aov.models = BreastCancerv3[ , -grep("Age", names(BreastCancerv3))] %>%
  map(~ aov(BreastCancerv3$Age ~ .x))
aov.models

$Race
Call:
   aov(formula = BreastCancerv3$Age ~ .x)

Terms:
                      .x Residuals
Sum of Squares    3064.1  320134.8
Deg. of Freedom        2      4021

Residual standard error: 8.922764
Estimated effects may be unbalanced

$Marital.Status
Call:
   aov(formula = BreastCancerv3$Age ~ .x)

Terms:
                       .x Residuals
Sum of Squares   17556.43 305642.45
Deg. of Freedom         4      4019

Residual standard error: 8.72063
Estimated effects may be unbalanced

$T.Stage
Call:
   aov(formula = BreastCancerv3$Age ~ .x)

Terms:
                      .x Residuals
Sum of Squares    1925.9  321273.0
Deg. of Freedom        3      4020

Residual standard error: 8.939724
Estimated effects may be unbalanced

$N.Stage
Call:
   aov(formula = BreastCancerv3$Age ~ .x)

Terms:
                      .x Residuals
Sum of Squares      55.5  323143.4
Deg. of Freedom        2      4021

Residual standard error: 8.964594
Estimated effects may be unbalanced

$X6th.Stage
Call:
   aov(formula = BreastCancerv3$Age ~ .x)

Terms:
                      .x Residuals
Sum of Squares    1013.7  322185.2
Deg. of Freedom        4      4019

Residual standard error: 8.953519
Estimated effects may be unbalanced

$differentiate
Call:
   aov(formula = BreastCancerv3$Age ~ .x)

Terms:
                      .x Residuals
Sum of Squares    3326.7  319872.2
Deg. of Freedom        3      4020

Residual standard error: 8.920213
Estimated effects may be unbalanced

$Grade
Call:
   aov(formula = BreastCancerv3$Age ~ .x)

Terms:
                      .x Residuals
Sum of Squares    3326.7  319872.2
Deg. of Freedom        3      4020

Residual standard error: 8.920213
Estimated effects may be unbalanced

Assume Target Variable: Tumor Size

BreastCancerv4 <- BreastCancerv1[!(is_int & seq_along(BreastCancerv1) != 10) | is_char]
BreastCancerv4 <- BreastCancerv4[-c(8,10:12)]
aov.models1 = BreastCancerv4[ , -grep("Tumor.Size", names(BreastCancerv4))] %>%
  map(~ aov(BreastCancerv4$Tumor.Size ~ .x))
aov.models1

$Race
Call:
   aov(formula = BreastCancerv4$Tumor.Size ~ .x)

Terms:
                     .x Residuals
Sum of Squares       98   1794327
Deg. of Freedom       2      4021

Residual standard error: 21.12437
Estimated effects may be unbalanced

$Marital.Status
Call:
   aov(formula = BreastCancerv4$Tumor.Size ~ .x)

Terms:
                       .x Residuals
Sum of Squares     1561.3 1792863.9
Deg. of Freedom         4      4019

Residual standard error: 21.12101
Estimated effects may be unbalanced

$T.Stage
Call:
   aov(formula = BreastCancerv4$Tumor.Size ~ .x)

Terms:
                       .x Residuals
Sum of Squares  1354557.5  439867.7
Deg. of Freedom         3      4020

Residual standard error: 10.46039
Estimated effects may be unbalanced

$N.Stage
Call:
   aov(formula = BreastCancerv4$Tumor.Size ~ .x)

Terms:
                     .x Residuals
Sum of Squares   140869   1653556
Deg. of Freedom       2      4021

Residual standard error: 20.27881
Estimated effects may be unbalanced

$X6th.Stage
Call:
   aov(formula = BreastCancerv4$Tumor.Size ~ .x)

Terms:
                     .x Residuals
Sum of Squares   634624   1159801
Deg. of Freedom       4      4019

Residual standard error: 16.98763
Estimated effects may be unbalanced

$differentiate
Call:
   aov(formula = BreastCancerv4$Tumor.Size ~ .x)

Terms:
                       .x Residuals
Sum of Squares    26497.1 1767928.2
Deg. of Freedom         3      4020

Residual standard error: 20.97101
Estimated effects may be unbalanced

$Grade
Call:
   aov(formula = BreastCancerv4$Tumor.Size ~ .x)

Terms:
                       .x Residuals
Sum of Squares    26497.1 1767928.2
Deg. of Freedom         3      4020

Residual standard error: 20.97101
Estimated effects may be unbalanced

Assume Target Variable: Regional Node Examine

BreastCancerv5 <- BreastCancerv1[!(is_int & seq_along(BreastCancerv1) != 13) | is_char]
BreastCancerv5 <- BreastCancerv5[-c(8:10,12)]
aov.models2 = BreastCancerv5[ , -grep("Regional.Node.Examined", names(BreastCancerv5))] %>%
  map(~ aov(BreastCancerv5$Regional.Node.Examined ~ .x))
aov.models2

$Race
Call:
   aov(formula = BreastCancerv5$Regional.Node.Examined ~ .x)

Terms:
                       .x Residuals
Sum of Squares      43.97 263883.87
Deg. of Freedom         2      4021

Residual standard error: 8.101014
Estimated effects may be unbalanced

$Marital.Status
Call:
   aov(formula = BreastCancerv5$Regional.Node.Examined ~ .x)

Terms:
                       .x Residuals
Sum of Squares     108.67 263819.16
Deg. of Freedom         4      4019

Residual standard error: 8.102036
Estimated effects may be unbalanced

$T.Stage
Call:
   aov(formula = BreastCancerv5$Regional.Node.Examined ~ .x)

Terms:
                       .x Residuals
Sum of Squares    3742.71 260185.13
Deg. of Freedom         3      4020

Residual standard error: 8.04504
Estimated effects may be unbalanced

$N.Stage
Call:
   aov(formula = BreastCancerv5$Regional.Node.Examined ~ .x)

Terms:
                       .x Residuals
Sum of Squares   29058.49 234869.35
Deg. of Freedom         2      4021

Residual standard error: 7.642688
Estimated effects may be unbalanced

$X6th.Stage
Call:
   aov(formula = BreastCancerv5$Regional.Node.Examined ~ .x)

Terms:
                       .x Residuals
Sum of Squares   28918.44 235009.39
Deg. of Freedom         4      4019

Residual standard error: 7.646868
Estimated effects may be unbalanced

$differentiate
Call:
   aov(formula = BreastCancerv5$Regional.Node.Examined ~ .x)

Terms:
                       .x Residuals
Sum of Squares    2169.77 261758.07
Deg. of Freedom         3      4020

Residual standard error: 8.069321
Estimated effects may be unbalanced

$Grade
Call:
   aov(formula = BreastCancerv5$Regional.Node.Examined ~ .x)

Terms:
                       .x Residuals
Sum of Squares    2169.77 261758.07
Deg. of Freedom         3      4020

Residual standard error: 8.069321
Estimated effects may be unbalanced

Assume Target Variable: Survival Months

BreastCancerv6 <- BreastCancerv1[!(is_int & seq_along(BreastCancerv1) != 15) | is_char]
BreastCancerv6 <- BreastCancerv6[-c(8:10,12)]
aov.models3 = BreastCancerv6[ , -grep("Survival.Months", names(BreastCancerv6))] %>%
  map(~ aov(BreastCancerv6$Survival.Months ~ .x))
aov.models3

$Race
Call:
   aov(formula = BreastCancerv6$Survival.Months ~ .x)

Terms:
                       .x Residuals
Sum of Squares     7739.1 2105912.6
Deg. of Freedom         2      4021

Residual standard error: 22.88512
Estimated effects may be unbalanced

$Marital.Status
Call:
   aov(formula = BreastCancerv6$Survival.Months ~ .x)

Terms:
                       .x Residuals
Sum of Squares     6870.1 2106781.7
Deg. of Freedom         4      4019

Residual standard error: 22.89553
Estimated effects may be unbalanced

$T.Stage
Call:
   aov(formula = BreastCancerv6$Survival.Months ~ .x)

Terms:
                       .x Residuals
Sum of Squares    16252.2 2097399.5
Deg. of Freedom         3      4020

Residual standard error: 22.84165
Estimated effects may be unbalanced

$N.Stage
Call:
   aov(formula = BreastCancerv6$Survival.Months ~ .x)

Terms:
                       .x Residuals
Sum of Squares    42434.1 2071217.6
Deg. of Freedom         2      4021

Residual standard error: 22.69582
Estimated effects may be unbalanced

$X6th.Stage
Call:
   aov(formula = BreastCancerv6$Survival.Months ~ .x)

Terms:
                       .x Residuals
Sum of Squares    45935.1 2067716.6
Deg. of Freedom         4      4019

Residual standard error: 22.68227
Estimated effects may be unbalanced

$differentiate
Call:
   aov(formula = BreastCancerv6$Survival.Months ~ .x)

Terms:
                       .x Residuals
Sum of Squares    11397.2 2102254.6
Deg. of Freedom         3      4020

Residual standard error: 22.86808
Estimated effects may be unbalanced

$Grade
Call:
   aov(formula = BreastCancerv6$Survival.Months ~ .x)

Terms:
                       .x Residuals
Sum of Squares    11397.2 2102254.6
Deg. of Freedom         3      4020

Residual standard error: 22.86808
Estimated effects may be unbalanced

Assume Target Variable: Reginol Node Positive

BreastCancerv7 <- BreastCancerv1[!(is_int & seq_along(BreastCancerv1) != 14) | is_char]
BreastCancerv7 <- BreastCancerv7[-c(8:10,12)]
aov.models4 = BreastCancerv7[ , -grep("Reginol.Node.Positive", names(BreastCancerv7))] %>%
  map(~ aov(BreastCancerv7$Reginol.Node.Positive ~ .x))
aov.models4

$Race
Call:
   aov(formula = BreastCancerv7$Reginol.Node.Positive ~ .x)

Terms:
                       .x Residuals
Sum of Squares      23.62 104997.85
Deg. of Freedom         2      4021

Residual standard error: 5.110027
Estimated effects may be unbalanced

$Marital.Status
Call:
   aov(formula = BreastCancerv7$Reginol.Node.Positive ~ .x)

Terms:
                       .x Residuals
Sum of Squares     491.13 104530.35
Deg. of Freedom         4      4019

Residual standard error: 5.099906
Estimated effects may be unbalanced

$T.Stage
Call:
   aov(formula = BreastCancerv7$Reginol.Node.Positive ~ .x)

Terms:
                      .x Residuals
Sum of Squares   6213.48  98808.00
Deg. of Freedom        3      4020

Residual standard error: 4.957732
Estimated effects may be unbalanced

$N.Stage
Call:
   aov(formula = BreastCancerv7$Reginol.Node.Positive ~ .x)

Terms:
                      .x Residuals
Sum of Squares  77973.10  27048.37
Deg. of Freedom        2      4021

Residual standard error: 2.593603
Estimated effects may be unbalanced

$X6th.Stage
Call:
   aov(formula = BreastCancerv7$Reginol.Node.Positive ~ .x)

Terms:
                      .x Residuals
Sum of Squares  75327.78  29693.70
Deg. of Freedom        4      4019

Residual standard error: 2.718148
Estimated effects may be unbalanced

$differentiate
Call:
   aov(formula = BreastCancerv7$Reginol.Node.Positive ~ .x)

Terms:
                       .x Residuals
Sum of Squares    1955.19 103066.29
Deg. of Freedom         3      4020

Residual standard error: 5.063436
Estimated effects may be unbalanced

$Grade
Call:
   aov(formula = BreastCancerv7$Reginol.Node.Positive ~ .x)

Terms:
                       .x Residuals
Sum of Squares    1955.19 103066.29
Deg. of Freedom         3      4020

Residual standard error: 5.063436
Estimated effects may be unbalanced

Bar Chart Comparing Categorical Variables with Target Variable

library(dplyr)
library(tidyverse)
library(ggplot2)
install.packages("purrr")

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

library(purrr)

input_vars <- char_vars[(-c(11))]
target_var <- char_vars[(-c(1:10))]

The below bar chart shows the distribution for each of the category variable with each of its level with the target variable (status).

char_vars %>% pivot_longer(!Status, names_to = "Columns", values_to = "Values") %>%  
  ggplot(aes(x=Values, fill=Status)) +
  geom_bar() +
  facet_wrap(~Columns, ncol = 2, scale="free")+
  coord_flip()

png

To get more valuable insight we porpotion the graph according to its ratio since the graph is highly disporportion in terms of its levels distribution.

In A.Stage features, we can see that if the cancer is spreading Regionally has the highest chance of survival as compared to if the cancer is spreading distanly.
Furthermore we can see that when the estrogen level is negative there is a higher chance of the patients will not survive from breastcancer.
Moreover when the tumor is undifferentiated( Grade IV) the chance of survival becomes lessen and higher chance of mortality.
Although marital status does not play a significant impact on the survivability, it can be seen that Married and Single patients are most likely to have high survivability.
It can be seen also in N stage which the cancer spreading to the lymph nodes progresses as it move from N1 to N3 and making it has higher mortality rate.
Blacks has higher mortality which follows by White and Other.
As the tumor size, T stage increases from T1 to T4 the survivability chance decreases.

char_vars %>% pivot_longer(!Status, names_to = "Columns", values_to = "Values") %>%  
  ggplot(aes(x=Values, fill=Status)) +
  geom_bar(position = "fill", 
           alpha = 1) +
  facet_wrap(~Columns, ncol = 2, scale="free")+
  coord_flip()

png

Box Plot Comparing Numeric Variables with Target Variable

BreastCancerv2 %>% pivot_longer(!Status, names_to = "Columns", values_to = "Values") %>%  
  ggplot(aes(x=Values, fill=Status)) +
  geom_boxplot() +
  facet_wrap(~Columns, ncol = 2, scale="free")+
  coord_flip()

png

Multivariate Exploration & Dimensionality Reduction

install.packages(c('MASS', 'pscl', "corrgram", "psych", "gmodels", "dplyr"))

Installing packages into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

library(tidyverse) 
library(psych)
library(gmodels)  
library(corrgram)    
library(pscl) 
library(MASS)   
library(dplyr)  
library(scales)

Classes and Methods for R developed in the
Political Science Computational Laboratory
Department of Political Science
Stanford University
Simon Jackman
hurdle and zeroinfl functions by Achim Zeileis


Attaching package: ‘MASS’


The following object is masked from ‘package:dplyr’:

    select



Attaching package: ‘scales’


The following objects are masked from ‘package:psych’:

    alpha, rescale


The following object is masked from ‘package:purrr’:

    discard


The following object is masked from ‘package:readr’:

    col_factor

data=read_csv("/content/sample_data/Breast_Cancer.csv") %>%
   mutate(Race=as.factor(Race),
 `Marital Status`=as.factor(`Marital Status`),
 `T Stage`=as.factor(`T Stage`),
 `N Stage`=as.factor(`N Stage`),
 `6th Stage`=as.factor(`6th Stage`),
 differentiate=as.factor(differentiate),
 Grade=as.factor(Grade),
 `A Stage`=as.factor(`A Stage`),
 `Estrogen Status`=as.factor(`Estrogen Status`),
 `Progesterone Status`=as.factor(`Progesterone Status`),
 Status=as.factor(Status))
cat_cols = c('Race', 'Marital Status', 'T Stage', 'N Stage', '6th Stage', 'differentiate','Grade', 'A Stage', 'Estrogen Status', 'Progesterone Status', 'Status')
num_cols = c('Age', 'Tumor Size', 'Regional Node Examined', 'Reginol Node Positive', 'Survival Months')
cat_data = data[,cat_cols]
num_data = data[,num_cols]
head(data)
head(cat_data)
head(num_data)

[1mRows: [22m[34m4024[39m [1mColumns: [22m[34m16[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (11): Race, Marital Status, T Stage, N Stage, 6th Stage, differentiate, ...
[32mdbl[39m  (5): Age, Tumor Size, Regional Node Examined, Reginol Node Positive, Su...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.

A tibble: 6 × 16
Age	Race	Marital Status	T Stage	N Stage	6th Stage	differentiate	Grade	A Stage	Tumor Size	Estrogen Status	Progesterone Status	Regional Node Examined	Reginol Node Positive	Survival Months	Status
<dbl>	<fct>	<fct>	<fct>	<fct>	<fct>	<fct>	<fct>	<fct>	<dbl>	<fct>	<fct>	<dbl>	<dbl>	<dbl>	<fct>
68	White	Married	T1	N1	IIA	Poorly differentiated	3	Regional	4	Positive	Positive	24	1	60	Alive
50	White	Married	T2	N2	IIIA	Moderately differentiated	2	Regional	35	Positive	Positive	14	5	62	Alive
58	White	Divorced	T3	N3	IIIC	Moderately differentiated	2	Regional	63	Positive	Positive	14	7	75	Alive
58	White	Married	T1	N1	IIA	Poorly differentiated	3	Regional	18	Positive	Positive	2	1	84	Alive
47	White	Married	T2	N1	IIB	Poorly differentiated	3	Regional	41	Positive	Positive	3	1	50	Alive
51	White	Single	T1	N1	IIA	Moderately differentiated	2	Regional	20	Positive	Positive	18	2	89	Alive

A tibble: 6 × 11
Race	Marital Status	T Stage	N Stage	6th Stage	differentiate	Grade	A Stage	Estrogen Status	Progesterone Status	Status
<fct>	<fct>	<fct>	<fct>	<fct>	<fct>	<fct>	<fct>	<fct>	<fct>	<fct>
White	Married	T1	N1	IIA	Poorly differentiated	3	Regional	Positive	Positive	Alive
White	Married	T2	N2	IIIA	Moderately differentiated	2	Regional	Positive	Positive	Alive
White	Divorced	T3	N3	IIIC	Moderately differentiated	2	Regional	Positive	Positive	Alive
White	Married	T1	N1	IIA	Poorly differentiated	3	Regional	Positive	Positive	Alive
White	Married	T2	N1	IIB	Poorly differentiated	3	Regional	Positive	Positive	Alive
White	Single	T1	N1	IIA	Moderately differentiated	2	Regional	Positive	Positive	Alive

A tibble: 6 × 5
Age	Tumor Size	Regional Node Examined	Reginol Node Positive	Survival Months
<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
68	4	24	1	60
50	35	14	5	62
58	63	14	7	75
58	18	2	1	84
47	41	3	1	50
51	20	18	2	89

Using reduced dataset, we can note that Age, Survival Months, mca2_1, mca2_3, mca3_1, pca1, and pca3 are significant features for the model with 99.9% interval.

Still, only 35% of variances are explained by this model.

As our objective is in binomial classification, we analyse our dataset using logistic regression to see how much the data can be explained and which feature are significant to logistic regression model.

model <- glm(Status ~., data = data, family = "binomial")
summary(model)
pR2(model)['McFadden']

Call:
glm(formula = Status ~ ., family = "binomial", data = data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.2543  -0.4670  -0.2640  -0.1355   3.3031  

Coefficients: (4 not defined because of singularities)
                                    Estimate Std. Error z value Pr(>|z|)    
(Intercept)                         1.192035   0.592075   2.013 0.044081 *  
Age                                 0.028301   0.006516   4.343 1.40e-05 ***
RaceOther                          -0.876361   0.290957  -3.012 0.002595 ** 
RaceWhite                          -0.449858   0.190396  -2.363 0.018140 *  
`Marital Status`Married            -0.162967   0.166478  -0.979 0.327624    
`Marital Status`Separated           0.503724   0.485998   1.036 0.299981    
`Marital Status`Single             -0.094715   0.206431  -0.459 0.646361    
`Marital Status`Widowed             0.047243   0.261059   0.181 0.856393    
`T Stage`T2                         0.239508   0.229609   1.043 0.296897    
`T Stage`T3                         0.742368   0.372473   1.993 0.046253 *  
`T Stage`T4                         1.359086   0.570897   2.381 0.017284 *  
`N Stage`N2                         0.702434   0.278582   2.521 0.011687 *  
`N Stage`N3                         0.630783   0.351167   1.796 0.072455 .  
`6th Stage`IIB                      0.266879   0.267898   0.996 0.319156    
`6th Stage`IIIA                    -0.199731   0.341157  -0.585 0.558244    
`6th Stage`IIIB                     0.015832   0.662766   0.024 0.980942    
`6th Stage`IIIC                           NA         NA      NA       NA    
differentiatePoorly differentiated  0.433001   0.122914   3.523 0.000427 ***
differentiateUndifferentiated       1.705851   0.793648   2.149 0.031604 *  
differentiateWell differentiated   -0.597017   0.207476  -2.878 0.004008 ** 
Grade2                                    NA         NA      NA       NA    
Grade3                                    NA         NA      NA       NA    
Gradeanaplastic; Grade IV                 NA         NA      NA       NA    
`A Stage`Regional                   0.177552   0.325627   0.545 0.585573    
`Tumor Size`                       -0.003163   0.004729  -0.669 0.503671    
`Estrogen Status`Positive          -0.374343   0.227843  -1.643 0.100385    
`Progesterone Status`Positive      -0.521779   0.152356  -3.425 0.000615 ***
`Regional Node Examined`           -0.031172   0.008045  -3.875 0.000107 ***
`Reginol Node Positive`             0.077710   0.017946   4.330 1.49e-05 ***
`Survival Months`                  -0.061531   0.002763 -22.271  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 3444.7  on 4023  degrees of freedom
Residual deviance: 2232.4  on 3998  degrees of freedom
AIC: 2284.4

Number of Fisher Scoring iterations: 6



fitting null model for pseudo-r2

McFadden: 0.351925027032208

Using original dataset, we can note that Grade 2, Grade 3, Grade IV, and IIIC 6th Stage have singularities to the class, thus removed from the regression.

Then, Age, differentiate, progesterone Status, Regional Node Examined, Reginol Node Positive, and Survival Months are significant features for the model with 99.9% interval.

However, only 35% of variances are explained by this model.

Logistic Regression Analysis

Multiple Correspondence Analysis (MCA)

data %>% pairs.panels(scale=T,
             stars=T,
             gap = 0,
             bg = c("red", "blue")[data$Status],
             pch=21)

png

A few factor columns, like T Stage, N Stage, and 6th Stage, have at least moderate correlation that is statistically significant. We use MCA to reduce these columns to numerical vectors.

res_mca <- mca(cat_data, nf=30)
res_mca 
summary(res_mca)
plot(res_mca)

Call:
mca(df = cat_data, nf = 30)

Multiple correspondence analysis of 4024 cases of 11 factors

Correlations 0.517 0.441 0.430 0.426 0.412 0.400 0.398 0.344 0.327 0.305 0.301 0.301 0.297 0.282 0.274 0.269 0.248 0.209 0.128 0.111 0.088 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000  cumulative % explained  5.17  9.58 13.87 18.14 22.26 26.26 30.25 33.69 36.96 40.01 43.02 46.03 49.00 51.81 54.55 57.24 59.72 61.81 63.09 64.20 65.08 65.08 65.08 65.08 65.08 65.08 65.08 65.08 65.08 65.08 



     Length Class  Mode   
rs   120720 -none- numeric
cs     1080 -none- numeric
fs   120720 -none- numeric
d        30 -none- numeric
p         1 -none- numeric
call      3 -none- call

png

MCA on all factor columns only covers at most 65% of variances. This low value means we cannot use MCA to reduce all factor columns at once.

So, we start with columns those are colinear. These columns are T Stage, N Stage, 6th Stage, Race, and Marital Status.

mca_cols <- c('T Stage', 'N Stage', '6th Stage','Race','Marital Status')
res_mca <- mca(cat_data[,names(cat_data) %in% mca_cols], nf=7)
res_mca  
plot(res_mca)  
pred_mca <- predict(res_mca, cat_data[,names(cat_data) %in% mca_cols])
data_mca <- data %>% mutate(mca1_1 = pred_mca[,1], mca1_2 = pred_mca[,2], mca1_3 = pred_mca[,3], 
  mca1_4 = pred_mca[,4], mca1_5 = pred_mca[,5], mca1_6 = pred_mca[,6], mca1_7 = pred_mca[,7])
data_mca <- data_mca[,!names(data_mca) %in% mca_cols]
head(data_mca)

Call:
mca(df = cat_data[, names(cat_data) %in% mca_cols], nf = 7)

Multiple correspondence analysis of 4024 cases of 5 factors

Correlations 0.667 0.625 0.603 0.590 0.486 0.453 0.447  cumulative % explained 16.67 32.31 47.38 62.14 74.29 85.63 96.80

A tibble: 6 × 18
Age	differentiate	Grade	A Stage	Tumor Size	Estrogen Status	Progesterone Status	Regional Node Examined	Reginol Node Positive	Survival Months	Status	mca1_1	mca1_2	mca1_3	mca1_4	mca1_5	mca1_6	mca1_7
<dbl>	<fct>	<fct>	<fct>	<dbl>	<fct>	<fct>	<dbl>	<dbl>	<dbl>	<fct>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
68	Poorly differentiated	3	Regional	4	Positive	Positive	24	1	60	Alive	-0.0023461091	-0.0010077059	-0.0010327568	0.0009605870	-0.0006947017	-4.192764e-05	4.440576e-05
50	Moderately differentiated	2	Regional	35	Positive	Positive	14	5	62	Alive	0.0014321129	0.0030905826	-0.0001677426	0.0002330912	-0.0006702577	6.418918e-05	2.464879e-04
58	Moderately differentiated	2	Regional	63	Positive	Positive	14	7	75	Alive	0.0047429421	-0.0030614242	0.0006657707	0.0025064765	-0.0004448306	-2.617071e-03	1.369498e-03
58	Poorly differentiated	3	Regional	18	Positive	Positive	2	1	84	Alive	-0.0023461091	-0.0010077059	-0.0010327568	0.0009605870	-0.0006947017	-4.192764e-05	4.440576e-05
47	Poorly differentiated	3	Regional	41	Positive	Positive	3	1	50	Alive	-0.0005593943	0.0001839509	0.0019673530	-0.0019829816	-0.0009202945	-1.987249e-04	-1.041004e-04
51	Moderately differentiated	2	Regional	20	Positive	Positive	18	2	89	Alive	-0.0022315778	-0.0009130826	-0.0010431065	0.0007508688	0.0019226744	-4.756507e-04	-2.239220e-04

png

7 outputs from MCA covers 97% of cumulative variances. Thus, we take all 7 output to replace N Stage, T Stage, 6th Stage, Race, and Marital Status.

Then, we proceed to reduce differentiate, Grade, and A Stage columns as they are slightly colinear.

mca_cols <- c('differentiate', 'Grade', 'A Stage')
res_mca <- mca(data_mca[,names(data_mca) %in% mca_cols], nf=3)
res_mca  
plot(res_mca)  
pred_mca <- predict(res_mca, data_mca[,names(data_mca) %in% mca_cols])
data_mca1 <- data_mca %>% mutate(mca2_1 = pred_mca[,1],mca2_2 = pred_mca[,2],mca2_3 = pred_mca[,3])
data_mca1 <- data_mca1[,!names(data_mca1) %in% mca_cols]
head(data_mca1)

Call:
mca(df = data_mca[, names(data_mca) %in% mca_cols], nf = 3)

Multiple correspondence analysis of 4024 cases of 3 factors

Correlations 0.818 0.816 0.816  cumulative % explained  40.88  81.70 122.53

A tibble: 6 × 18
Age	Tumor Size	Estrogen Status	Progesterone Status	Regional Node Examined	Reginol Node Positive	Survival Months	Status	mca1_1	mca1_2	mca1_3	mca1_4	mca1_5	mca1_6	mca1_7	mca2_1	mca2_2	mca2_3
<dbl>	<dbl>	<fct>	<fct>	<dbl>	<dbl>	<dbl>	<fct>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
68	4	Positive	Positive	24	1	60	Alive	-0.0023461091	-0.0010077059	-0.0010327568	0.0009605870	-0.0006947017	-4.192764e-05	4.440576e-05	-0.006802372	0.0003366833	-1.086844e-03
50	35	Positive	Positive	14	5	62	Alive	0.0014321129	0.0030905826	-0.0001677426	0.0002330912	-0.0006702577	6.418918e-05	2.464879e-04	0.002587008	-0.0025582201	-1.565432e-18
58	63	Positive	Positive	14	7	75	Alive	0.0047429421	-0.0030614242	0.0006657707	0.0025064765	-0.0004448306	-2.617071e-03	1.369498e-03	0.002587008	-0.0025582201	-1.565432e-18
58	18	Positive	Positive	2	1	84	Alive	-0.0023461091	-0.0010077059	-0.0010327568	0.0009605870	-0.0006947017	-4.192764e-05	4.440576e-05	-0.006802372	0.0003366833	-1.086844e-03
47	41	Positive	Positive	3	1	50	Alive	-0.0005593943	0.0001839509	0.0019673530	-0.0019829816	-0.0009202945	-1.987249e-04	-1.041004e-04	-0.006802372	0.0003366833	-1.086844e-03
51	20	Positive	Positive	18	2	89	Alive	-0.0022315778	-0.0009130826	-0.0010431065	0.0007508688	0.0019226744	-4.756507e-04	-2.239220e-04	0.002587008	-0.0025582201	-1.565432e-18

png

First 2 output of MCA covered 82% of cumulative variances while third output covers 123% of cumulative variances. Since the first 2 output does not cover at least 95% of variances, we take the all 3 output to replace differentiate, Grade, and A Stage columns.

After that, we proceed with reducing Estrogen Status and Progesterone Status columns.

mca_cols <- c('Estrogen Status','Progesterone Status')
res_mca <- mca(data_mca1[,names(data_mca1) %in% mca_cols], nf=length(mca_cols))
res_mca  
plot(res_mca)  
pred_mca <- predict(res_mca, data_mca1[,names(data_mca1) %in% mca_cols])
data_mca2 <- data_mca1 %>% mutate(mca3_1 = pred_mca[,1], mca3_2 = pred_mca[,2])
data_mca2 <- data_mca2[,!names(data_mca2) %in% mca_cols]
head(data_mca2)

Call:
mca(df = data_mca1[, names(data_mca1) %in% mca_cols], nf = length(mca_cols))

Multiple correspondence analysis of 4024 cases of 2 factors

Correlations 0.870 0.493  cumulative % explained  86.99 136.32

A tibble: 6 × 18
Age	Tumor Size	Regional Node Examined	Reginol Node Positive	Survival Months	Status	mca1_1	mca1_2	mca1_3	mca1_4	mca1_5	mca1_6	mca1_7	mca2_1	mca2_2	mca2_3	mca3_1	mca3_2
<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<fct>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
68	4	24	1	60	Alive	-0.0023461091	-0.0010077059	-0.0010327568	0.0009605870	-0.0006947017	-4.192764e-05	4.440576e-05	-0.006802372	0.0003366833	-1.086844e-03	-0.002860247	-0.000750588
50	35	14	5	62	Alive	0.0014321129	0.0030905826	-0.0001677426	0.0002330912	-0.0006702577	6.418918e-05	2.464879e-04	0.002587008	-0.0025582201	-1.565432e-18	-0.002860247	-0.000750588
58	63	14	7	75	Alive	0.0047429421	-0.0030614242	0.0006657707	0.0025064765	-0.0004448306	-2.617071e-03	1.369498e-03	0.002587008	-0.0025582201	-1.565432e-18	-0.002860247	-0.000750588
58	18	2	1	84	Alive	-0.0023461091	-0.0010077059	-0.0010327568	0.0009605870	-0.0006947017	-4.192764e-05	4.440576e-05	-0.006802372	0.0003366833	-1.086844e-03	-0.002860247	-0.000750588
47	41	3	1	50	Alive	-0.0005593943	0.0001839509	0.0019673530	-0.0019829816	-0.0009202945	-1.987249e-04	-1.041004e-04	-0.006802372	0.0003366833	-1.086844e-03	-0.002860247	-0.000750588
51	20	18	2	89	Alive	-0.0022315778	-0.0009130826	-0.0010431065	0.0007508688	0.0019226744	-4.756507e-04	-2.239220e-04	0.002587008	-0.0025582201	-1.565432e-18	-0.002860247	-0.000750588

png

First output of MCA already covered 87% of cumulative variances while second output covers 136% of cumulative variances. Thus, we take the both output to replace Estrogen Status and Progesterone Status columns because the first output is not enough to cover at least 95% of the variances.

data_mca2 %>% pairs.panels(scale=T,
             stars=T,
             gap = 0,
             bg = c("red", "blue")[data_mca2$Status],
             pch=21)

png

Now, all the factor columns has been reduced to numerical vectors. Still, we can see that some columns are correlated to each other. So, we use PCA to remove colinearity and dimensionality.

Principal Component Analysis (PCA)

pca_cols = c('mca1_1', 'Tumor Size', 'Reginol Node Positive', 'Regional Node Examined')
res_pca <- princomp(data_mca2[,pca_cols], cor = T) 
summary(res_pca)

pred_pca <- predict(res_pca, data_mca2[,pca_cols])  
data_pca <- data_mca2 %>% mutate(pca1=pred_pca[,1], pca2=pred_pca[,2], pca3=pred_pca[,3])
data_pca <- data_pca[, !names(data_pca) %in% pca_cols]
head(data_pca)

Importance of components:
                          Comp.1    Comp.2    Comp.3     Comp.4
Standard deviation     1.5029084 0.9851157 0.7791421 0.40466125
Proportion of Variance 0.5646834 0.2426133 0.1517656 0.04093768
Cumulative Proportion  0.5646834 0.8072967 0.9590623 1.00000000

A tibble: 6 × 17
Age	Survival Months	Status	mca1_2	mca1_3	mca1_4	mca1_5	mca1_6	mca1_7	mca2_1	mca2_2	mca2_3	mca3_1	mca3_2	pca1	pca2	pca3
<dbl>	<dbl>	<fct>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
68	60	Alive	-0.0010077059	-0.0010327568	0.0009605870	-0.0006947017	-4.192764e-05	4.440576e-05	-0.006802372	0.0003366833	-1.086844e-03	-0.002860247	-0.000750588	-1.1333213	-1.6942370	-0.67609196
50	62	Alive	0.0030905826	-0.0001677426	0.0002330912	-0.0006702577	6.418918e-05	2.464879e-04	0.002587008	-0.0025582201	-1.565432e-18	-0.002860247	-0.000750588	0.5848042	0.2571267	0.15663818
58	75	Alive	-0.0030614242	0.0006657707	0.0025064765	-0.0004448306	-2.617071e-03	1.369498e-03	0.002587008	-0.0025582201	-1.565432e-18	-0.002860247	-0.000750588	2.3363389	1.3114219	0.03636938
58	84	Alive	-0.0010077059	-0.0010327568	0.0009605870	-0.0006947017	-4.192764e-05	4.440576e-05	-0.006802372	0.0003366833	-1.086844e-03	-0.002860247	-0.000750588	-1.8342221	0.6278887	0.65335871
47	50	Alive	0.0001839509	0.0019673530	-0.0019829816	-0.0009202945	-1.987249e-04	-1.041004e-04	-0.006802372	0.0003366833	-1.086844e-03	-0.002860247	-0.000750588	-0.8010371	1.4039334	0.20289254
51	89	Alive	-0.0009130826	-0.0010431065	0.0007508688	0.0019226744	-4.756507e-04	-2.239220e-04	0.002587008	-0.0025582201	-1.565432e-18	-0.002860247	-0.000750588	-0.9345595	-0.7290785	-0.48789923

First 3 output of PCA already covered 96% of cumulative variances. Thus, we take the first 3 output to replace mca1_1, Tumor size, Reginol Node Positive, and Regiona Node Examined columns.

data_pca %>% pairs.panels(scale=T,
             stars=T,
             gap = 0,
             bg = c("red", "blue")[data_pca$Status],
             pch=21)

png

Now, we see that none of the feature columns are correlated to each other.

Lastly, we will normalized the rest of the columns.

data_scaled <- data_pca %>% mutate(Age = rescale(Age), `Survival Months` = rescale(`Survival Months`))

#Modeling

install.packages(c('caret', 'kernlab', 'naivebayes', 'rstatix', 'randomForest'))

Installing packages into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘listenv’, ‘parallelly’, ‘future’, ‘globals’, ‘future.apply’, ‘progressr’, ‘SQUAREM’, ‘lava’, ‘prodlim’, ‘numDeriv’, ‘SparseM’, ‘MatrixModels’, ‘minqa’, ‘nloptr’, ‘RcppEigen’, ‘proxy’, ‘iterators’, ‘Rcpp’, ‘clock’, ‘gower’, ‘hardhat’, ‘ipred’, ‘timeDate’, ‘carData’, ‘abind’, ‘pbkrtest’, ‘quantreg’, ‘lme4’, ‘e1071’, ‘foreach’, ‘ModelMetrics’, ‘plyr’, ‘pROC’, ‘recipes’, ‘reshape2’, ‘corrplot’, ‘car’

library(caret)    
library(kernlab) 
library(naivebayes)
library(rstatix) 
library(randomForest)

Loading required package: lattice


Attaching package: ‘lattice’


The following object is masked from ‘package:corrgram’:

    panel.fill



Attaching package: ‘caret’


The following object is masked from ‘package:purrr’:

    lift



Attaching package: ‘kernlab’


The following object is masked from ‘package:scales’:

    alpha


The following object is masked from ‘package:psych’:

    alpha


The following object is masked from ‘package:purrr’:

    cross


The following object is masked from ‘package:ggplot2’:

    alpha


naivebayes 0.9.7 loaded


Attaching package: ‘rstatix’


The following object is masked from ‘package:MASS’:

    select


The following object is masked from ‘package:stats’:

    filter


randomForest 4.7-1.1

Type rfNews() to see new features/changes/bug fixes.


Attaching package: ‘randomForest’


The following object is masked from ‘package:psych’:

    outlier


The following object is masked from ‘package:dplyr’:

    combine


The following object is masked from ‘package:ggplot2’:

    margin

For modelling, we use Naive Bayes, Random Forest, and Support Vector Machine (SVM) algorithms against original data, reduced data, and built-in PCA pre-processed data.

10-fold cross validation is used for model evaluation later.

Upsampling method is used to address the imbalance size between the classes.

ctrl <- trainControl(method = "cv", number = 10)  

set.seed(0)
set1 <- upSample(data %>% select(-Status), data$Status, yname='Status') 
set.seed(0)
set2 <- upSample(data_scaled %>% select(-Status), data_scaled$Status, yname='Status')

set.seed(0)
nb_1 <- train(Status ~., data = set1, method = "naive_bayes", trControl = ctrl)
nb_1

set.seed(0)
nb_1p <- train(Status ~., data = set1, preProc = c("pca"), method = "naive_bayes", trControl = ctrl)
nb_1p

set.seed(0)
nb_2 <- train(Status ~., data = set2, method = "naive_bayes", trControl = ctrl)
nb_2

Naive Bayes 

6816 samples
  15 predictor
   2 classes: 'Alive', 'Dead' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 6135, 6135, 6134, 6134, 6134, 6134, ... 
Resampling results across tuning parameters:

  usekernel  Accuracy   Kappa    
  FALSE      0.6506751  0.3013481
   TRUE      0.5963957  0.1927825

Tuning parameter 'laplace' was held constant at a value of 0
Tuning
 parameter 'adjust' was held constant at a value of 1
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were laplace = 0, usekernel = FALSE
 and adjust = 1.



Naive Bayes 

6816 samples
  15 predictor
   2 classes: 'Alive', 'Dead' 

Pre-processing: principal component signal extraction (29), centered
 (29), scaled (29) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 6135, 6135, 6134, 6134, 6134, 6134, ... 
Resampling results across tuning parameters:

  usekernel  Accuracy   Kappa    
  FALSE      0.6943899  0.3887863
   TRUE      0.7452929  0.4905871

Tuning parameter 'laplace' was held constant at a value of 0
Tuning
 parameter 'adjust' was held constant at a value of 1
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were laplace = 0, usekernel = TRUE
 and adjust = 1.



Naive Bayes 

6816 samples
  16 predictor
   2 classes: 'Alive', 'Dead' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 6135, 6135, 6134, 6134, 6134, 6134, ... 
Resampling results across tuning parameters:

  usekernel  Accuracy   Kappa    
  FALSE      0.7139037  0.4278098
   TRUE      0.7052600  0.4105174

Tuning parameter 'laplace' was held constant at a value of 0
Tuning
 parameter 'adjust' was held constant at a value of 1
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were laplace = 0, usekernel = FALSE
 and adjust = 1.

set.seed(0)
rf_1 <- train(Status ~., data = set1, method = "rf", trControl = ctrl)
rf_1

set.seed(0)
rf_1p <- train(Status ~., data = set1, preProc = c("pca"), method = "rf", trControl = ctrl)
rf_1p

set.seed(0)
rf_2 <- train(Status ~., data = set2, method = "rf", trControl = ctrl)
rf_2

Random Forest 

6816 samples
  15 predictor
   2 classes: 'Alive', 'Dead' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 6135, 6135, 6134, 6134, 6134, 6134, ... 
Resampling results across tuning parameters:

  mtry  Accuracy   Kappa    
   2    0.8148435  0.6296850
  15    0.9743221  0.9486443
  29    0.9694812  0.9389629

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 15.


Warning message in randomForest.default(x, y, mtry = param$mtry, ...):
“invalid mtry: reset to within valid range”
Warning message in randomForest.default(x, y, mtry = param$mtry, ...):
“invalid mtry: reset to within valid range”
Warning message in randomForest.default(x, y, mtry = param$mtry, ...):
“invalid mtry: reset to within valid range”
Warning message in randomForest.default(x, y, mtry = param$mtry, ...):
“invalid mtry: reset to within valid range”
Warning message in randomForest.default(x, y, mtry = param$mtry, ...):
“invalid mtry: reset to within valid range”
Warning message in randomForest.default(x, y, mtry = param$mtry, ...):
“invalid mtry: reset to within valid range”
Warning message in randomForest.default(x, y, mtry = param$mtry, ...):
“invalid mtry: reset to within valid range”
Warning message in randomForest.default(x, y, mtry = param$mtry, ...):
“invalid mtry: reset to within valid range”
Warning message in randomForest.default(x, y, mtry = param$mtry, ...):
“invalid mtry: reset to within valid range”



Random Forest 

6816 samples
  15 predictor
   2 classes: 'Alive', 'Dead' 

Pre-processing: principal component signal extraction (29), centered
 (29), scaled (29) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 6135, 6135, 6134, 6134, 6134, 6134, ... 
Resampling results across tuning parameters:

  mtry  Accuracy   Kappa    
   2    0.9772589  0.9545179
  15    0.9677195  0.9354391
  29    0.9655182  0.9310365

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.



Random Forest 

6816 samples
  16 predictor
   2 classes: 'Alive', 'Dead' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 6135, 6135, 6134, 6134, 6134, 6134, ... 
Resampling results across tuning parameters:

  mtry  Accuracy   Kappa    
   2    0.9630255  0.9260508
   9    0.9746171  0.9492340
  16    0.9718301  0.9436600

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 9.

set.seed(0)
svm_1 <- train(Status ~., data = set1, method = "svmPoly", trControl = ctrl)
svm_1

set.seed(0)
svm_1p <- train(Status ~., data = set1, preProc = c("pca"), method = "svmPoly", trControl = ctrl)
svm_1p

set.seed(0)
svm_2 <- train(Status ~., data = set2, method = "svmPoly", trControl = ctrl)
svm_2

Support Vector Machines with Polynomial Kernel 

6816 samples
  15 predictor
   2 classes: 'Alive', 'Dead' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 6135, 6135, 6134, 6134, 6134, 6134, ... 
Resampling results across tuning parameters:

  degree  scale  C     Accuracy   Kappa    
  1       0.001  0.25  0.7602668  0.5205353
  1       0.001  0.50  0.7759653  0.5519309
  1       0.001  1.00  0.7818327  0.5636653
  1       0.010  0.25  0.7856459  0.5712925
  1       0.010  0.50  0.7891686  0.5783356
  1       0.010  1.00  0.7897541  0.5795070
  1       0.100  0.25  0.7904878  0.5809754
  1       0.100  0.50  0.7899011  0.5798018
  1       0.100  1.00  0.7906347  0.5812686
  2       0.001  0.25  0.7761125  0.5522252
  2       0.001  0.50  0.7819789  0.5639577
  2       0.001  1.00  0.7874074  0.5748151
  2       0.010  0.25  0.7928347  0.5856686
  2       0.010  0.50  0.7965035  0.5930063
  2       0.010  1.00  0.7989985  0.5979962
  2       0.100  0.25  0.8069222  0.6138445
  2       0.100  0.50  0.8082406  0.6164795
  2       0.100  1.00  0.8091188  0.6182365
  3       0.001  0.25  0.7822733  0.5645473
  3       0.001  0.50  0.7827151  0.5654295
  3       0.001  1.00  0.7879939  0.5759878
  3       0.010  0.25  0.7991445  0.5982870
  3       0.010  0.50  0.8029587  0.6059163
  3       0.010  1.00  0.8079499  0.6158996
  3       0.100  0.25  0.8524020  0.7048047
  3       0.100  0.50  0.8600305  0.7200620
  3       0.100  1.00  0.8661908  0.7323835

Accuracy was used to select the optimal model using the largest value.
The final values used for the model were degree = 3, scale = 0.1 and C = 1.



Support Vector Machines with Polynomial Kernel 

6816 samples
  15 predictor
   2 classes: 'Alive', 'Dead' 

Pre-processing: principal component signal extraction (29), centered
 (29), scaled (29) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 6135, 6135, 6134, 6134, 6134, 6134, ... 
Resampling results across tuning parameters:

  degree  scale  C     Accuracy   Kappa    
  1       0.001  0.25  0.7904911  0.5809824
  1       0.001  0.50  0.7906351  0.5812685
  1       0.001  1.00  0.7912214  0.5824411
  1       0.010  0.25  0.7868185  0.5736349
  1       0.010  0.50  0.7901942  0.5803867
  1       0.010  1.00  0.7882876  0.5765735
  1       0.100  0.25  0.7884338  0.5768660
  1       0.100  0.50  0.7888756  0.5777497
  1       0.100  1.00  0.7885821  0.5771630
  2       0.001  0.25  0.7901952  0.5803887
  2       0.001  0.50  0.7909282  0.5818546
  2       0.001  1.00  0.7888743  0.5777471
  2       0.010  0.25  0.7954792  0.5909577
  2       0.010  0.50  0.7963588  0.5927170
  2       0.010  1.00  0.7959193  0.5918371
  2       0.100  0.25  0.7991471  0.5982949
  2       0.100  0.50  0.8023750  0.6047494
  2       0.100  1.00  0.8000264  0.6000515
  3       0.001  0.25  0.7903408  0.5806800
  3       0.001  0.50  0.7903414  0.5806809
  3       0.001  1.00  0.7888734  0.5777448
  3       0.010  0.25  0.7970917  0.5941831
  3       0.010  0.50  0.7967991  0.5935957
  3       0.010  1.00  0.8017892  0.6035760
  3       0.100  0.25  0.8362632  0.6725273
  3       0.100  0.50  0.8399306  0.6798623
  3       0.100  1.00  0.8443321  0.6886654

Accuracy was used to select the optimal model using the largest value.
The final values used for the model were degree = 3, scale = 0.1 and C = 1.



Support Vector Machines with Polynomial Kernel 

6816 samples
  16 predictor
   2 classes: 'Alive', 'Dead' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 6135, 6135, 6134, 6134, 6134, 6134, ... 
Resampling results across tuning parameters:

  degree  scale  C     Accuracy   Kappa    
  1       0.001  0.25  0.7761134  0.5522271
  1       0.001  0.50  0.7844765  0.5689538
  1       0.001  1.00  0.7868196  0.5736393
  1       0.010  0.25  0.7856468  0.5712933
  1       0.010  0.50  0.7868222  0.5736435
  1       0.010  1.00  0.7888756  0.5777502
  1       0.100  0.25  0.7896089  0.5792171
  1       0.100  0.50  0.7894617  0.5789224
  1       0.100  1.00  0.7894625  0.5789238
  2       0.001  0.25  0.7840367  0.5680740
  2       0.001  0.50  0.7868198  0.5736396
  2       0.001  1.00  0.7850618  0.5701239
  2       0.010  0.25  0.7918103  0.5836191
  2       0.010  0.50  0.7960662  0.5921322
  2       0.010  1.00  0.7969455  0.5938915
  2       0.100  0.25  0.8004671  0.6009350
  2       0.100  0.50  0.8016410  0.6032816
  2       0.100  1.00  0.8022271  0.6044530
  3       0.001  0.25  0.7868230  0.5736466
  3       0.001  0.50  0.7844742  0.5689482
  3       0.001  1.00  0.7879963  0.5759904
  3       0.010  0.25  0.7948908  0.5897825
  3       0.010  0.50  0.7965047  0.5930103
  3       0.010  1.00  0.8019343  0.6038683
  3       0.100  0.25  0.8324492  0.6648965
  3       0.100  0.50  0.8368495  0.6736992
  3       0.100  1.00  0.8402260  0.6804537

Accuracy was used to select the optimal model using the largest value.
The final values used for the model were degree = 3, scale = 0.1 and C = 1.

Model Evaluation

We evaluate our models across the algorithms, features, and inividual models by its accuracy.

all_model <- list(nb_1=nb_1, nb_1p=nb_1p, nb_2=nb_2, svm_1=svm_1, svm_1p=svm_1p, svm_2=svm_2, rf_1=rf_1, rf_1p=rf_1p, rf_2=rf_2)

long_results <- resamples(all_model)$values %>% pivot_longer(!Resample, names_sep = '~', names_to = c('.value', 'Metric')) %>% 
  pivot_longer(names(all_model), names_to = 'Model' , values_to = 'Value')

long_results %>% pivot_wider(names_from = Metric, values_from = Value) %>%
  mutate(ModelType = sub('_.*','', Model)) %>% 
  group_by(ModelType) %>%
  summarise_at(vars(-Resample, -Model), ~mean(.)) %>%
  arrange(desc(Accuracy))

long_results %>% mutate(ModelType = sub('_.*','', Model)) %>%
  ggplot(aes(x=ModelType, y=Value, fill=ModelType)) +
  geom_boxplot() +
  facet_wrap(~Metric)

A tibble: 3 × 3
ModelType	Accuracy	Kappa
<chr>	<dbl>	<dbl>
rf	0.9753994	0.9507987
svm	0.8502496	0.7005009
nb	0.7032906	0.4065817

png

We can see that the algorithm with highest accuracy is Random Forest, then SVM, and lastly Naive Bayes

long_results %>% 
  filter(Metric == 'Accuracy') %>% 
  mutate(ModelType = sub('_.*','', Model)) %>% 
  t_test(Value ~ModelType) %>% 
  add_significance()

A rstatix_test: 3 × 10
.y.	group1	group2	n1	n2	statistic	df	p	p.adj	p.adj.signif
<chr>	<chr>	<chr>	<int>	<int>	<dbl>	<dbl>	<dbl>	<dbl>	<chr>
Value	nb	rf	30	30	-34.04656	29.89943	1.79e-25	3.58e-25	****
Value	nb	svm	30	30	-16.97885	39.68954	8.43e-20	8.43e-20	****
Value	rf	svm	30	30	34.72177	33.67863	5.94e-28	1.78e-27	****

From the t-test results above, we can see that our previous observation is statistically significant with 99.99% interval.

long_results %>% pivot_wider(names_from = Metric, values_from = Value) %>%
  mutate(FeatureType = sub('.*_','', Model)) %>% 
  group_by(FeatureType) %>%
  summarise_at(vars(-Resample, -Model), ~mean(.)) %>%
  arrange(desc(Accuracy))

long_results %>% mutate(FeatureType = sub('.*_','', Model)) %>%
  ggplot(aes(x=FeatureType, y=Value, fill=FeatureType)) +
  geom_boxplot() +
  facet_wrap(~Metric)

A tibble: 3 × 3
FeatureType	Accuracy	Kappa
<chr>	<dbl>	<dbl>
1p	0.8556280	0.7112568
2	0.8429156	0.6858325
1	0.8303960	0.6607920

png

Here, we can see that, by average, features from PCA has the highest accuracy, followed by our processed data, then the original data.

long_results %>% 
  filter(Metric == 'Accuracy') %>% 
  mutate(FeatureType = sub('.*_','', Model)) %>% 
  t_test(Value ~FeatureType) %>% 
  add_significance()

A rstatix_test: 3 × 10
.y.	group1	group2	n1	n2	statistic	df	p	p.adj	p.adj.signif
<chr>	<chr>	<chr>	<int>	<int>	<dbl>	<dbl>	<dbl>	<dbl>	<chr>
Value	1	1p	30	30	-0.8197950	52.41153	0.416	1	ns
Value	1	2	30	30	-0.3909963	55.18921	0.697	1	ns
Value	1p	2	30	30	0.4751066	57.32042	0.637	1	ns

However, t-test results shows that our observation is not statistically significant.

long_results %>% pivot_wider(names_from = Metric, values_from = Value) %>%
  group_by(Model) %>%
  summarise_at(vars(-Resample), ~mean(.)) %>%
  arrange(desc(Accuracy))

long_results %>% ggplot(aes(x=Model, y=Value, fill=Model)) +
  geom_boxplot() +
  facet_wrap(~Metric)

A tibble: 9 × 3
Model	Accuracy	Kappa
<chr>	<dbl>	<dbl>
rf_1p	0.9772589	0.9545179
rf_2	0.9746171	0.9492340
rf_1	0.9743221	0.9486443
svm_1	0.8661908	0.7323835
svm_1p	0.8443321	0.6886654
svm_2	0.8402260	0.6804537
nb_1p	0.7452929	0.4905871
nb_2	0.7139037	0.4278098
nb_1	0.6506751	0.3013481

png

Here, we can see that Random Forest models have higher accuracy, followed by SVM and Naive Bayes. Random Forest models have around the same accuracy among each other. In the meanwhile, SVM using original data and naive bayes using PCA data have the highest accuracy within their respective group.

long_results %>% 
  filter(Metric == 'Accuracy') %>% 
  t_test(Value ~Model) %>% 
  add_significance()

A rstatix_test: 36 × 10
.y.	group1	group2	n1	n2	statistic	df	p	p.adj	p.adj.signif
<chr>	<chr>	<chr>	<int>	<int>	<dbl>	<dbl>	<dbl>	<dbl>	<chr>
Value	nb_1	nb_1p	10	10	-13.1198526	15.347841	9.49e-10	9.49e-09	****
Value	nb_1	nb_2	10	10	-8.5190425	14.987547	3.96e-07	3.17e-06	****
Value	nb_1	rf_1	10	10	-72.8832000	13.923057	2.21e-19	7.96e-18	****
Value	nb_1	rf_1p	10	10	-81.2290603	10.144161	1.31e-15	4.19e-14	****
Value	nb_1	rf_2	10	10	-75.2114985	12.790714	2.59e-18	9.06e-17	****
Value	nb_1	svm_1	10	10	-35.7138208	17.520288	8.17e-18	2.78e-16	****
Value	nb_1	svm_1p	10	10	-27.2885189	15.553784	1.47e-14	4.56e-13	****
Value	nb_1	svm_2	10	10	-33.6827955	17.970672	1.08e-17	3.56e-16	****
Value	nb_1p	nb_2	10	10	3.5839676	17.971095	2.00e-03	1.30e-02
Value	nb_1p	rf_1	10	10	-35.6211233	11.179135	7.27e-13	1.96e-11	****
Value	nb_1p	rf_1p	10	10	-37.7367091	9.473827	1.20e-11	2.64e-10	****
Value	nb_1p	rf_2	10	10	-36.1818961	10.627350	1.79e-12	4.30e-11	****
Value	nb_1p	svm_1	10	10	-15.8697585	16.787838	1.54e-11	2.94e-10	****
Value	nb_1p	svm_1p	10	10	-11.6732626	17.990573	7.91e-10	9.49e-09	****
Value	nb_1p	svm_2	10	10	-13.0045904	15.710899	8.02e-10	9.49e-09	****
Value	nb_2	rf_1	10	10	-39.0753012	11.015532	3.62e-13	1.05e-11	****
Value	nb_2	rf_1p	10	10	-41.1988640	9.437332	5.66e-12	1.30e-10	****
Value	nb_2	rf_2	10	10	-39.6443930	10.503691	8.79e-13	2.20e-11	****
Value	nb_2	svm_1	10	10	-19.4804626	16.463260	8.44e-13	2.19e-11	****
Value	nb_2	svm_1p	10	10	-15.0546371	17.928978	1.29e-11	2.71e-10	****
Value	nb_2	svm_2	10	10	-16.8256639	15.350480	2.63e-11	4.73e-10	****
Value	rf_1	rf_1p	10	10	-1.2529569	12.688843	2.33e-01	9.00e-01	ns
Value	rf_1	rf_2	10	10	-0.1050795	17.612935	9.18e-01	1.00e+00	ns
Value	rf_1	svm_1	10	10	21.3111332	12.670798	2.65e-11	4.73e-10	****
Value	rf_1	svm_1p	10	10	20.6328671	11.277988	2.60e-10	3.90e-09	****
Value	rf_1	svm_2	10	10	29.2631862	13.596623	1.13e-13	3.39e-12	****
Value	rf_1p	rf_2	10	10	1.2702148	13.800530	2.25e-01	9.00e-01	ns
Value	rf_1p	svm_1	10	10	23.5780454	9.820894	5.67e-10	7.37e-09	****
Value	rf_1p	svm_1p	10	10	22.1123499	9.495995	1.73e-09	1.56e-08	****
Value	rf_1p	svm_2	10	10	32.8096435	10.055947	1.47e-11	2.94e-10	****
Value	rf_2	svm_1	10	10	21.8712517	11.777272	6.71e-11	1.07e-09	****
Value	rf_2	svm_1p	10	10	20.9908981	10.702255	4.85e-10	6.79e-09	****
Value	rf_2	svm_2	10	10	30.1790475	12.520781	4.51e-13	1.26e-11	****
Value	svm_1	svm_1p	10	10	2.9109250	16.962301	1.00e-02	4.90e-02
Value	svm_1	svm_2	10	10	4.2290527	17.719075	5.20e-04	4.00e-03	**
Value	svm_1p	svm_2	10	10	0.5713722	15.914702	5.76e-01	1.00e+00	ns

From the t-test results above, we can see that the difference of accuracy among Random Forest models is not statistically significant. In the meanwhile, the results shows that SVM with original data has significantly higher accuracy than the other SVM models with at least 95% confidence interval. Also, Naive Bayes with PCA data has significantly higher accuracy than the other Naive Bayes models with at least 95% confidence interval.