Group Member - Dhia Syahmie Bin Muhammad Sukor S2147929 - Liu Zening S2179213 - Muhammad Shakyr Bin Rosman S2152185 - Raimi Bin Ridzuan S2186754 - Zheng Jiangmeng S2174210

Bussiness understanding

Problem statement

Breast cancer is one of the most common cancers, often referred to as the “pink killer”, and ranks first in the incidence of malignant tumors in women. The World Health Organization(WHO) reports that 685000 people worldwide died from breast cancer in 2020. Therefore, it is important to study breast cancer mortality and the suivival time of patients, as it provides insight into the causes of death from breast cancer and informs efforts to reduce the number of deaths from the disease.

This study may provide important information about differences in breast cancer outcomes in different population groups. For example, studies may show that certain racial or ethnic groups have higher breast cancer mortality rates, which may indicate that targeted interventions are needed to address these disparities.

Overall, studying breast cancer mortality and how long patients can survive is important for understanding the disease and the best ways to reduce the number of deaths caused by it. It also helps to identify the needs and priorities for research to further improve the survival rate and reduce the burden of this disease.

Objective:

The research aims to provide:

  1. To determine inconsistencies, missing values and outliers in the dataset
  2. Identify the relationship between features through data Exploration
  3. Predict the survivability of patient diagnosed with breast cancer through machine learning modules.

Samples:

  • Age: the age of female patients ranged from 30 to 69 years. The mean age of the patients was 53.97 years.

  • Races: White, Black and Other. The majority of patients belonged to the white race, while a few patients were identified as black or other races.

  • Marital status: Married, Single, Divorced, Widowed, and Separated. Most patients diagnosed with cancer were married.

  • T stage: the size of tumor. T1, T2, T3, and T4.

  • N stage: the extent of cancer spread to the lymph nodes. N1, N2, and N3.

  • Sixth stage: the stage grouping of breast cancer. Based on a combination of T stage, N stage, and M stage(not available in the dataset). The five categories of stage VI and their proportions in the dataset are IIA, IIB, IIIA, IIIB, and IIIC.

  • Differentiate or grade: how well the tumor resenbles the normal tissue. > * Grade I - Well differentiated > * Grade II - Moderately differentiated > * Grade III - Poorly differentiated > * Grade IV - Undifferentiated

  • A stage: Regional and Distant. Regional means the spread of cancer is localized. Distant corresponds to the spread of cancer to distant parts of the body.

  • Tumor size(mm): the size of a tumor in mm.

  • Survival months: the average value is 71.3 months. The range of this is between 1 and 107.

  • Dataset: https://ieee-dataport.org/open-access/seer-breast-cancer-data JING TENG, January 18, 2019

Imports

install.packages("dplyr")
install.packages("gmodels")
install.packages("psych") 
Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘gtools’, ‘gdata’


Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependency ‘mnormt’
library(tidyverse)
library(dplyr)
library(psych)
library(gmodels)
library(ggplot2)
library(tidyr) 
Warning message in system("timedatectl", intern = TRUE):
“running command 'timedatectl' had status 1”
── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.4.0     [32m✔[39m [34mpurrr  [39m 1.0.1
[32m✔[39m [34mtibble [39m 3.1.8     [32m✔[39m [34mstringr[39m 1.4.1
[32m✔[39m [34mreadr  [39m 2.1.3     [32m✔[39m [34mforcats[39m 0.5.2

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()


Attaching package: ‘psych’


The following objects are masked from ‘package:ggplot2’:

    %+%, alpha

Data Cleaning

Data Inconsistencies issue in the ‘Race’ Attribute

df <- read.csv("/content/sample_data/Uncleaned_Datav1.csv")

As can be seen in the bar chart below, we can see that each race is repeated twice due to the inconsistent of category name.

race <- table(df$Race)
barplot(race, main="Race Distribution",
        xlab="Race", ylab ="Frequencies")
table(df$Race)
 Black Blacks  Other otherS  wHiTe  White 
   180    118    241     81   1039   2418 

png

To tackle inconsistencies in the naming convention which might be due to the combination of two data sources, data smoothing is required by using replacement method

#Replace "Blacks" = Black, "otherS" = "Other" and "wHiTe" <- "White" 
df$Race[df$Race == "Blacks"] <- "Black"
df$Race[df$Race == "otherS"] <- "Other"
df$Race[df$Race == "wHiTe"] <- "White"

After data smoothing

race <- table(df$Race)
barplot(race, main="Race Distribution",
        xlab="Race", ylab ="Frequencies")
table(df$Race)
Black Other White 
  298   322  3457 

png

Data Intentionality issue in the “Marital Status” Attribute

As can be seen in the bar chart below there are Missing Value called Unknown which has 25 occurences in the data

MaritalStat <- table(df$Marital.Status)
barplot(MaritalStat, main="Marital Status Distribution",
        xlab="Marital", ylab="Frequencies")
table(df$Marital.Status)
 Divorced   Married Separated   Single    Unknown   Widowed 
      488      2662        45       620        25       237 

png

Since the unknown value is 25 occurences, the percentage of the unknown value is 25/4077x100: 0.6132 % of the whole dataset. The optimal solution is by performing Data Reduction which the unknown value will be dropped and it is not resulted in loss of significant information.

df <- subset(df, Marital.Status != "Unknown")

After Data Reduction

MaritalStat <- table(df$Marital.Status)
barplot(MaritalStat, main="Marital Status Distribution",
        xlab="Marital", ylab="Frequencies")
table(df$Marital.Status)
 Divorced   Married Separated   Single    Widowed 
      488      2662        45       620       237 

png

Data Noisiness(Unbelievability) issue in the “Age” Attributes

From the Age Histogram Below, there are existent of outliers where some observation have their age equal to “0”. The range of this unbelievabiltiy age is from -1 to 4. The percentage noisiness data in the Age Attribute is 28/4052 x 100 : 0.69 %, therefore it can be eliminated by using Data Reduction as it not effect significant loss of information

hist(df$Age, breaks = 30, main = "Age Histogram", xlab = "Age", ylab = "Frequency")
count <- sum(df$Age >= -1 & df$Age <=5)
cat("The count of noisiness data in Age Attribtutes is", count)
The count of noisiness data in Age Attribtutes is 28

png

After Data Reduction Method

df_filtered <- subset(df, !(Age >= -1 & Age <= 6))
hist(df_filtered$Age, breaks = 30, main = "Age Histogram", xlab = "Age", ylab = "Frequency")

png

Data Incomplete issue in ‘Differentiate’ Attribute

From the bar chart below, there are presence of missing value which accumulated 733 correspond to 18.23%. Dropping these missing values can be impratical because the missing values are significantly high. This missing values will be replaced with the mode which is “Moderately differentiated” by using Data Imputation Method.

differentiate <- table(df$differentiate)
barplot(differentiate, main="Differentiate Distribution",
        xlab="Differentiate", ylab="Frequencies")
table(df$differentiate)
Moderately differentiated     Poorly differentiated          Undifferentiated 
                     1632                      1119                        19 
                  Unknown       Well differentiated 
                      733                       549 

png

After Data Imputation

df$differentiate[df$differentiate == "Unknown"] <- "Moderately differentiated"
differentiate <- table(df$differentiate)
barplot(differentiate, main="Differentiate Distribution",
        xlab="Differentiate", ylab="Frequencies")
table(df$differentiate)
Moderately differentiated     Poorly differentiated          Undifferentiated 
                     2365                      1119                        19 
      Well differentiated 
                      549 

png

Finding total outliers in Numeric Variables

library(dplyr)
library(tidyr)

# Create a function to identify outliers
find_outliers <- function(x) {
  qnt <- quantile(x, probs = c(0.25, 0.75), na.rm = TRUE)
  H <- 1.5 * IQR(x, na.rm = TRUE)
  y <- x[x < qnt[1] - H | x > qnt[2] + H]
  return(length(y))
}
# Use the function to find outliers in all columns of the dataframe
total_outliers <- df %>% 
  select_if(is.numeric) %>% 
  summarise_all(find_outliers) %>% 
  gather() %>%
  group_by(key) %>%
  summarise(total_outliers = sum(value)) 

colnames(total_outliers) <- c("Variable", "Outliers")
total_outliers
cat("the total outliers are :",sum(total_outliers$Outliers))
A tibble: 5 × 2
Variable Outliers
<chr> <int>
Age 28
Reginol.Node.Positive 346
Regional.Node.Examined 73
Survival.Months 19
Tumor.Size 222
the total outliers are : 688

Since the total outliers is quite high, which more than 15% of the total observations. Removing 15% or more of the row values may significantly affect the final predictions. Hence the outliers are not removed for machine learning modeling.

Univariate Exploration

BreastCancer <- read.csv("/content/sample_data/Breast_Cancer.csv")

The mean, max, median of the dataset’s attributes

describe(BreastCancer)
A psych: 16 × 13
vars n mean sd median trimmed mad min max range skew kurtosis se
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
Age 1 4024 53.972167 8.9631344 54 54.245031 10.3782 30 69 39 -0.2202085 -0.75675566 0.141296343
Race* 2 4024 2.775845 0.5644652 3 2.935093 0.0000 1 3 2 -2.4052242 4.38936802 0.008898323
Marital.Status* 3 4024 2.371272 1.0636375 2 2.266149 0.0000 1 5 4 1.1459202 0.36387786 0.016767358
T.Stage* 4 4024 1.784791 0.7655309 2 1.699379 1.4826 1 4 3 0.7246804 0.06106612 0.012067957
N.Stage* 5 4024 1.438370 0.6934794 1 1.298137 0.0000 1 3 2 1.2761763 0.21228692 0.010932125
X6th.Stage* 6 4024 2.321819 1.2666239 2 2.152484 1.4826 1 5 4 0.8144155 -0.19692684 0.019967270
differentiate* 7 4024 1.690358 1.0164183 1 1.488199 0.0000 1 4 3 1.4181264 0.73747790 0.016022987
Grade* 8 4024 3.131710 0.6423983 3 3.170497 0.0000 1 4 3 -0.2335884 -0.20517763 0.010126874
A.Stage* 9 4024 1.977137 0.1494847 2 2.000000 0.0000 1 2 1 -6.3821741 38.74177405 0.002356501
Tumor.Size 10 4024 30.473658 21.1196961 25 27.110248 14.8260 1 140 139 1.7384530 3.62490859 0.332934406
Estrogen.Status* 11 4024 1.933151 0.2497912 2 2.000000 0.0000 1 2 1 -3.4672437 10.02426991 0.003937750
Progesterone.Status* 12 4024 1.826541 0.3786909 2 1.908075 0.0000 1 2 1 -1.7241483 0.97292932 0.005969746
Regional.Node.Examined 13 4024 14.357107 8.0996748 14 13.854348 7.4130 1 61 60 0.8286556 1.64347439 0.127684622
Reginol.Node.Positive 14 4024 4.158052 5.1093311 2 2.961180 1.4826 1 46 45 2.7005214 8.96267158 0.080544347
Survival.Months 15 4024 71.297962 22.9214295 73 72.970497 25.2042 1 107 106 -0.5895585 0.01696196 0.361337232
Status* 16 4024 1.153082 0.3601108 1 1.066460 0.0000 1 2 1 1.9262531 1.71087644 0.005676847

Numeric features graph distribution

x1 <- BreastCancer$Tumor.Size
TumorSize1 <- hist(x1, col = "Orange", xlab = "Tumor Size", ylab = "Frequencies", 
                   main = "Tumor Size Histogram")
x2 <- BreastCancer$Age
Age1 <- hist(x2, col = "Orange", xlab = "Age ", ylab = "Frequencies", 
                   main = " Age Distribution Histogram ")
x3 <- BreastCancer$Regional.Node.Examined
Regional.N.E1 <- hist(x3, col = "Orange", xlab = "Regional Node Examined ", ylab = "Frequencies", 
                   main = " Regional Node Examined Distribution Histogram ")
x4 <- BreastCancer$Reginol.Node.Positive
Regional.N.E1 <- hist(x4, col = "Orange", xlab = "Regional Node Positive Examined ", ylab = "Frequencies", 
                   main = " Regional Node Positive Distribution Histogram ")
x5 <- BreastCancer$Survival.Months
SurvivalMonth1 <- hist(x5, col = "Orange", xlab = "Survival Month ", ylab = "Frequencies", 
                   main = " Survival Month Distribution Histogram ")

png

png

png

png

png

Categorical variable bar chart distribution

race <- table(BreastCancer$Race)
barplot(race, main="Race Distribution",
        xlab="Race", ylab ="Frequencies")

MaritalStat <- table(BreastCancer$Marital.Status)
barplot(MaritalStat, main="Marital Status Distribution",
        xlab="Marital", ylab="Frequencies")

T.stage <- table(BreastCancer$T.Stage)
barplot(T.stage, main="T.Stage Distribution",
        xlab="T.Stage", ylab="Frequencies")

N.stage <- table(BreastCancer$N.Stage)
barplot(N.stage, main="N.Stage Distribution",
        xlab="N.Stage", ylab="Frequencies")

X6th.stage <- table(BreastCancer$X6th.Stage)
barplot(X6th.stage, main="X6th.Stage Distribution",
        xlab="N.Stage", ylab="Frequencies")

differentiate <- table(BreastCancer$differentiate)
barplot(differentiate, main="Differentiate Distribution",
        xlab="Differentiate", ylab="Frequencies")

Grade <- table(BreastCancer$Grade)
barplot(Grade, main="Grade Distribution",
        xlab="Grade", ylab="Frequencies")


A.Stage <- table(BreastCancer$A.Stage)
barplot(A.Stage, main="A.Stage Distribution",
        xlab="A.Stage", ylab="Frequencies")

Estrogen.Stat <- table(BreastCancer$Estrogen.Status)
barplot(Estrogen.Stat, main="Estrogen Status Distribution",
        xlab="Estrogen Status", ylab="Frequencies")

Progestrone.Stat <- table(BreastCancer$Progesterone.Status)
barplot(Progestrone.Stat, main="Progestrone Status Distribution",
        xlab="Progestrone Status", ylab="Frequencies")

png

png

png

png

png

png

png

png

png

png

Numerical Variable Box Plot in the graph


BreastCancer %>% pivot_longer(cols=c('Survival.Months', 'Reginol.Node.Positive', 'Regional.Node.Examined', 'Tumor.Size', 'Age'), names_to = 'Column', values_to = 'Value') %>% 
  ggplot(aes(x=Column, y=Value, fill=Column)) +
  geom_boxplot() +
  facet_wrap(~Column, scales = 'free')

png

Bivariate Exploration

Chi Square Test on the categorical variables with target variable

Arrange the categorical features according to the levels

BreastCancerv1 <- BreastCancer
Race1 = factor(BreastCancerv1$Race, levels = c('White', 'Black', 'Other'))
T.Stage1 = factor(BreastCancerv1$T.Stage, levels =c('T1', 'T2', 'T3', 'T4'))
N.Stage1 = factor(BreastCancerv1$N.Stage, levels = c('N1', 'N2', 'N3'))
X6th.Stage1 = factor(BreastCancerv1$X6th.Stage, levels = c('IIA', 'IIB', 'IIIA', 'IIIB', 'IIIC'))
differentiate1 = factor(BreastCancerv1$differentiate, levels = c('Well differentiated', 'Moderately differentiated', 'Poorly differentiated', 'Undifferentiated'))
Grade1 = factor(BreastCancerv1$Grade, levels = c('1', '2', '3', 'anaplastic; Grade IV'))
A.Stage1 = factor(BreastCancerv1$A.Stage, levels = c('Regional', 'Distant'))
Estrogen.Stat1 = factor(BreastCancerv1$Estrogen.Status, levels = c('Positive', 'Negative'))
Status1 = factor(BreastCancerv1$Status, levels = c('High Survival', 'Non - Survival'))

Cross Table

cat_vars <- sapply(X = BreastCancerv1, function(x) is.character(x))
char_vars <- BreastCancerv1[cat_vars]
tables_list<- list()

# Loop through all columns of the data frame
for (col in names(char_vars)) {
    if(col != "Status"){
        # Create a cross tabulation table with respect to the target variable
        tab <- table(char_vars[, col], char_vars$Status)
        # Add the cross tabulation table to the list
        tables_list[[col]] <- tab
    }
}
 
View(tables_list)
 
$Race
       
        Alive Dead
  Black   218   73
  Other   287   33
  White  2903  510

$Marital.Status
           
            Alive Dead
  Divorced    396   90
  Married    2285  358
  Separated    30   15
  Single      511  104
  Widowed     186   49

$T.Stage
    
     Alive Dead
  T1  1446  157
  T2  1483  303
  T3   417  116
  T4    62   40

$N.Stage
    
     Alive Dead
  N1  2462  270
  N2   655  165
  N3   291  181

$X6th.Stage
      
       Alive Dead
  IIA   1209   96
  IIB    995  135
  IIIA   866  184
  IIIB    47   20
  IIIC   291  181

$differentiate
                           
                            Alive Dead
  Moderately differentiated  2046  305
  Poorly differentiated       848  263
  Undifferentiated             10    9
  Well differentiated         504   39

$Grade
                       
                        Alive Dead
   anaplastic; Grade IV    10    9
  1                       504   39
  2                      2046  305
  3                       848  263

$A.Stage
          
           Alive Dead
  Distant     57   35
  Regional  3351  581

$Estrogen.Status
          
           Alive Dead
  Negative   161  108
  Positive  3247  508

$Progesterone.Status
          
           Alive Dead
  Negative   494  204
  Positive  2914  412

Chi-Test Result

The chi-square (χ2) statistics is a way to check the relationship between two categorical nominal variables. In this data exploration, we explored the relationship between input categorical features in dataset with the target variable which is “Status”. Our null hypothesis stated that the input categorical features is indepent from the target variable (Status) and our hypothesis that the input categorical variable is dependent on the target variable (Status). From the result below the p value for all relation between input categorical variable with target variable is less than <0.005. Therefore we reject the null hypothesis

chi_results_list <- lapply(tables_list, chisq.test)
View(chi_results_list)
Warning message in FUN(X[[i]], ...):
“Chi-squared approximation may be incorrect”
Warning message in FUN(X[[i]], ...):
“Chi-squared approximation may be incorrect”



$Race

    Pearson's Chi-squared test

data:  X[[i]]
X-squared = 27.97, df = 2, p-value = 8.441e-07


$Marital.Status

    Pearson's Chi-squared test

data:  X[[i]]
X-squared = 28.264, df = 4, p-value = 1.103e-05


$T.Stage

    Pearson's Chi-squared test

data:  X[[i]]
X-squared = 103.48, df = 3, p-value < 2.2e-16


$N.Stage

    Pearson's Chi-squared test

data:  X[[i]]
X-squared = 269.93, df = 2, p-value < 2.2e-16


$X6th.Stage

    Pearson's Chi-squared test

data:  X[[i]]
X-squared = 281.65, df = 4, p-value < 2.2e-16


$differentiate

    Pearson's Chi-squared test

data:  X[[i]]
X-squared = 112.56, df = 3, p-value < 2.2e-16


$Grade

    Pearson's Chi-squared test

data:  X[[i]]
X-squared = 112.56, df = 3, p-value < 2.2e-16


$A.Stage

    Pearson's Chi-squared test with Yates' continuity correction

data:  X[[i]]
X-squared = 35.765, df = 1, p-value = 2.226e-09


$Estrogen.Status

    Pearson's Chi-squared test with Yates' continuity correction

data:  X[[i]]
X-squared = 135.16, df = 1, p-value < 2.2e-16


$Progesterone.Status

    Pearson's Chi-squared test with Yates' continuity correction

data:  X[[i]]
X-squared = 124.89, df = 1, p-value < 2.2e-16

T-Test

T test is used to draw a comparison or find the difference between one categorical (with two categories) and another continuous variable. The continous variable must be normally distributed.Therefore to normalize each continous variable we employed Z-score method. A z-score, also known as a standard score, is a measure of how many standard deviations a data point is from the mean of the data set.

Filtering some of the columns

# Create logical vectors indicating which columns are integers or character type
is_int <- sapply(BreastCancerv1, is.integer)
is_char <- sapply(BreastCancerv1, is.character)

# Remove all columns with character values, except the Status one
BreastCancerv2 <- BreastCancerv1[!(is_char & seq_along(BreastCancerv1) != 16) | is_int]
BreastCancerv2
A data.frame: 4024 × 6
Age Tumor.Size Regional.Node.Examined Reginol.Node.Positive Survival.Months Status
<int> <int> <int> <int> <int> <chr>
68 4 24 1 60 Alive
50 35 14 5 62 Alive
58 63 14 7 75 Alive
58 18 2 1 84 Alive
47 41 3 1 50 Alive
51 20 18 2 89 Alive
51 8 11 1 54 Alive
40 30 9 1 14 Dead
40 103 20 18 70 Alive
69 32 21 12 92 Alive
68 13 9 1 64 Dead
46 59 11 3 92 Alive
65 35 13 3 56 Alive
48 15 23 7 38 Alive
62 35 16 14 64 Alive
61 19 20 1 49 Alive
56 46 1 1 105 Alive
43 24 22 1 62 Alive
48 25 16 1 107 Alive
60 29 20 1 77 Alive
48 30 15 2 81 Alive
57 40 15 5 50 Alive
55 29 4 1 78 Alive
48 70 18 1 102 Alive
62 20 26 22 98 Alive
63 22 31 17 70 Alive
48 50 25 23 102 Alive
46 17 14 1 82 Alive
57 25 14 4 64 Alive
66 21 10 1 86 Alive
51 2 19 2 50 Alive
62 25 17 14 69 Alive
69 19 21 9 88 Dead
35 19 15 2 56 Alive
68 18 11 1 100 Alive
53 30 12 5 85 Alive
31 45 13 2 63 Alive
55 45 16 1 24 Alive
57 15 5 1 61 Alive
66 19 21 7 64 Dead
58 24 19 9 65 Dead
68 40 14 2 76 Alive
41 35 18 2 66 Alive
44 21 10 1 75 Alive
60 18 19 2 52 Alive
53 100 20 1 107 Alive
59 23 6 1 27 Dead
62 25 24 4 100 Alive
51 80 20 12 79 Alive
68 15 10 1 102 Alive
40 68 22 2 86 Alive
65 40 4 4 68 Alive
54 50 29 2 52 Alive
46 28 2 1 19 Dead
64 10 11 1 70 Alive
62 9 1 1 49 Alive
56 46 14 8 69 Alive
68 22 11 3 69 Alive
58 44 11 1 72 Alive
46 30 7 2 100 Alive

Install the required packages

install.packages("magrittr") # package installations are only needed the first time you use it
install.packages("dplyr")    # alternative installation of the %>%
library(magrittr) # needs to be run every time you start R and want to use %>%
library(dplyr)    # alternatively, this also loads %>%
Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)


Attaching package: ‘magrittr’


The following object is masked from ‘package:purrr’:

    set_names


The following object is masked from ‘package:tidyr’:

    extract

Normalize the numeric data using Z-Score

# Select all columns that are of type 'numeric'
numerical_columns <- BreastCancerv2 %>% select_if(is.numeric)

# Select all columns that are of type 'character'
character_columns <- BreastCancerv2 %>% select_if(is.character)

# Normalize the numerical columns using Z-score
numerical_columns <- scale(numerical_columns,center = TRUE , scale = TRUE)

# Bind the normalized numerical columns with the character columns
BreastCancerv2 <- bind_cols(character_columns, numerical_columns)
View(BreastCancerv2)
A data.frame: 4024 × 6
Status Age Tumor.Size Regional.Node.Examined Reginol.Node.Positive Survival.Months
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
Alive 1.5650589 -1.25350563 1.19052837 -0.61809494 -0.49289955
Alive -0.4431672 0.21431852 -0.04408910 0.16478641 -0.40564495
Alive 0.4493777 1.54009517 -0.04408910 0.55622708 0.16150990
Alive 0.4493777 -0.59061731 -1.52563006 -0.61809494 0.55415557
Alive -0.7778715 0.49841351 -1.40216831 -0.61809494 -0.92917251
Alive -0.3315991 -0.49591898 0.44975789 -0.42237460 0.77229205
Alive -0.3315991 -1.06410897 -0.41447434 -0.61809494 -0.75466332
Dead -1.5588483 -0.02242731 -0.66139783 -0.61809494 -2.49975518
Alive -1.5588483 3.43406182 0.69668138 2.70915076 -0.05662658
Alive 1.6766270 0.07227102 0.82014313 1.53482875 0.90317394
Dead 1.5650589 -0.82736314 -0.66139783 -0.61809494 -0.31839036
Alive -0.8894396 1.35069851 -0.41447434 -0.22665426 0.90317394
Alive 1.2303545 0.21431852 -0.16755084 -0.22665426 -0.66740873
Alive -0.6663034 -0.73266481 1.06706662 0.55622708 -1.45270007
Alive 0.8956502 0.21431852 0.20283440 1.92626942 -0.31839036
Alive 0.7840821 -0.54326814 0.69668138 -0.61809494 -0.97279981
Alive 0.2262415 0.73515935 -1.64909181 -0.61809494 1.47032879
Alive -1.2241440 -0.30652231 0.94360488 -0.61809494 -0.40564495
Alive -0.6663034 -0.25917314 0.20283440 -0.61809494 1.55758338
Alive 0.6725140 -0.06977648 0.69668138 -0.61809494 0.24876449
Alive -0.6663034 -0.02242731 0.07937265 -0.42237460 0.42327368
Alive 0.3378096 0.45106435 0.07937265 0.16478641 -0.92917251
Alive 0.1146734 -0.06977648 -1.27870657 -0.61809494 0.29239179
Alive -0.6663034 1.87153933 0.44975789 -0.61809494 1.33944690
Alive 0.8956502 -0.49591898 1.43745186 3.49203210 1.16493772
Alive 1.0072183 -0.40122064 2.05476060 2.51343043 -0.05662658
Alive -0.6663034 0.92455601 1.31399012 3.68775244 1.33944690
Alive -0.8894396 -0.63796647 -0.04408910 -0.61809494 0.46690097
Alive 0.3378096 -0.25917314 -0.04408910 -0.03093393 -0.31839036
Alive 1.3419226 -0.44856981 -0.53793609 -0.61809494 0.64141016
Alive -0.331599066 -1.34820397 0.57321964 -0.42237460 -0.92917251
Alive 0.895650186 -0.25917314 0.32629614 1.92626942 -0.10025388
Dead 1.676626983 -0.54326814 0.82014313 0.94766775 0.72866475
Alive -2.116688887 -0.54326814 0.07937265 -0.42237460 -0.66740873
Alive 1.565058869 -0.59061731 -0.41447434 -0.61809494 1.25219231
Alive -0.108462838 -0.02242731 -0.29101259 0.16478641 0.59778286
Alive -2.562961342 0.68781018 -0.16755084 -0.42237460 -0.36201766
Alive 0.114673389 0.68781018 0.20283440 -0.61809494 -2.06348222
Alive 0.337809617 -0.73266481 -1.15524482 -0.61809494 -0.44927225
Dead 1.341922641 -0.54326814 0.82014313 0.55622708 -0.31839036
Dead 0.449377731 -0.30652231 0.57321964 0.94766775 -0.27476306
Alive 1.565058869 0.45106435 -0.04408910 -0.42237460 0.20513720
Alive -1.447280204 0.21431852 0.44975789 -0.42237460 -0.23113577
Alive -1.112575863 -0.44856981 -0.53793609 -0.61809494 0.16150990
Alive 0.672513958 -0.59061731 0.57321964 -0.42237460 -0.84191792
Alive -0.108462838 3.29201432 0.69668138 -0.61809494 1.55758338
Dead 0.560945845 -0.35387148 -1.03178307 -0.61809494 -1.93260033
Alive 0.895650186 -0.25917314 1.19052837 -0.03093393 1.25219231
Alive -0.331599066 2.34503100 0.69668138 1.53482875 0.33601908
Alive 1.565058869 -0.73266481 -0.53793609 -0.61809494 1.33944690
Alive -1.558848318 1.77684100 0.94360488 -0.42237460 0.64141016
Alive 1.230354528 0.45106435 -1.27870657 -0.03093393 -0.14388118
Alive 0.003105276 0.92455601 1.80783710 -0.42237460 -0.84191792
Dead -0.889439635 -0.11712565 -1.52563006 -0.61809494 -2.28161870
Alive 1.118786414 -0.96941064 -0.41447434 -0.61809494 -0.05662658
Alive 0.895650186 -1.01675980 -1.64909181 -0.61809494 -0.97279981
Alive 0.226241503 0.73515935 -0.04408910 0.75194741 -0.10025388
Alive 1.565058869 -0.40122064 -0.41447434 -0.22665426 -0.10025388
Alive 0.449377731 0.64046101 -0.41447434 -0.61809494 0.03062801
Alive -0.889439635 -0.02242731 -0.90832133 -0.42237460 1.25219231

T-Test Result

From the T-test result we have seen that the p-value for each continous variable in respect with the target variable is less than < 0.005. Therefore we accept the hypothesis where the input continous variable is dependent with the target variable (status) and reject the null hypothesis


# Create an empty list to store the test results
Tresults_list <- list()
# Loop through all columns of the data frame
for (col in names(BreastCancerv2)) {
    if(col != "Status"){
        # Conduct T-test on current column with respect to target variable
        Ttest_result <- t.test(BreastCancerv2[, col] ~ BreastCancerv2$Status)
        # Extract the p-value and test statistic from the test results
        p_value <- Ttest_result$p.value
        t_statistic <- Ttest_result$statistic
        # Add the p-value and test statistic to the results list
        Tresults_list[[col]] <- list(p_value = p_value,t_statistic = t_statistic)
    }
}
View(Tresults_list)
\(Age</dt> <dd><dl> <dt>\)p_value
<dd>0.000930907837822076</dd>
<dt>$t_statistic</dt>
    <dd><strong>t:</strong> -3.32289602486267</dd>
<dt>$Tumor.Size</dt>
    <dd><dl>
<dt>$p_value</dt>
    <dd>7.01956816737513e-14</dd>
<dt>$t_statistic</dt>
    <dd><strong>t:</strong> -7.62653452543241</dd>
<dt>$Regional.Node.Examined</dt>
    <dd><dl>
<dt>$p_value</dt>
    <dd>0.0341129729107396</dd>
<dt>$t_statistic</dt>
    <dd><strong>t:</strong> -2.12225993465628</dd>
<dt>$Reginol.Node.Positive</dt>
    <dd><dl>
<dt>$p_value</dt>
    <dd>3.44036275174707e-30</dd>
<dt>$t_statistic</dt>
    <dd><strong>t:</strong> -11.9770766977036</dd>
<dt>$Survival.Months</dt>
    <dd><dl>
<dt>$p_value</dt>
    <dd>1.19257066205458e-129</dd>
<dt>$t_statistic</dt>
    <dd><strong>t:</strong> 29.702852507452</dd>

Anova

ANOVA is helpful for testing three or more levels in the category variable. It is similar to multiple two-sample t-tests. However, it results in fewer type I errors and is appropriate for a range of issues. ANOVA groups differences by comparing the means of each group and includes spreading out the variance into diverse sources. For this data exploration we assumed the category variable is Marital Status, Race,T.Stage,N.stage,X6th Stage, differentiate, Grade because it has 3 or more levels.

Assume Target Variable: Age


BreastCancerv3 <- BreastCancerv1[!(is_int & seq_along(BreastCancerv1) != 1) | is_char]
BreastCancerv3 <- BreastCancerv3[-c(9:12)]
aov.models = BreastCancerv3[ , -grep("Age", names(BreastCancerv3))] %>%
  map(~ aov(BreastCancerv3$Age ~ .x))
aov.models
$Race
Call:
   aov(formula = BreastCancerv3$Age ~ .x)

Terms:
                      .x Residuals
Sum of Squares    3064.1  320134.8
Deg. of Freedom        2      4021

Residual standard error: 8.922764
Estimated effects may be unbalanced

$Marital.Status
Call:
   aov(formula = BreastCancerv3$Age ~ .x)

Terms:
                       .x Residuals
Sum of Squares   17556.43 305642.45
Deg. of Freedom         4      4019

Residual standard error: 8.72063
Estimated effects may be unbalanced

$T.Stage
Call:
   aov(formula = BreastCancerv3$Age ~ .x)

Terms:
                      .x Residuals
Sum of Squares    1925.9  321273.0
Deg. of Freedom        3      4020

Residual standard error: 8.939724
Estimated effects may be unbalanced

$N.Stage
Call:
   aov(formula = BreastCancerv3$Age ~ .x)

Terms:
                      .x Residuals
Sum of Squares      55.5  323143.4
Deg. of Freedom        2      4021

Residual standard error: 8.964594
Estimated effects may be unbalanced

$X6th.Stage
Call:
   aov(formula = BreastCancerv3$Age ~ .x)

Terms:
                      .x Residuals
Sum of Squares    1013.7  322185.2
Deg. of Freedom        4      4019

Residual standard error: 8.953519
Estimated effects may be unbalanced

$differentiate
Call:
   aov(formula = BreastCancerv3$Age ~ .x)

Terms:
                      .x Residuals
Sum of Squares    3326.7  319872.2
Deg. of Freedom        3      4020

Residual standard error: 8.920213
Estimated effects may be unbalanced

$Grade
Call:
   aov(formula = BreastCancerv3$Age ~ .x)

Terms:
                      .x Residuals
Sum of Squares    3326.7  319872.2
Deg. of Freedom        3      4020

Residual standard error: 8.920213
Estimated effects may be unbalanced

Assume Target Variable: Tumor Size

BreastCancerv4 <- BreastCancerv1[!(is_int & seq_along(BreastCancerv1) != 10) | is_char]
BreastCancerv4 <- BreastCancerv4[-c(8,10:12)]
aov.models1 = BreastCancerv4[ , -grep("Tumor.Size", names(BreastCancerv4))] %>%
  map(~ aov(BreastCancerv4$Tumor.Size ~ .x))
aov.models1
$Race
Call:
   aov(formula = BreastCancerv4$Tumor.Size ~ .x)

Terms:
                     .x Residuals
Sum of Squares       98   1794327
Deg. of Freedom       2      4021

Residual standard error: 21.12437
Estimated effects may be unbalanced

$Marital.Status
Call:
   aov(formula = BreastCancerv4$Tumor.Size ~ .x)

Terms:
                       .x Residuals
Sum of Squares     1561.3 1792863.9
Deg. of Freedom         4      4019

Residual standard error: 21.12101
Estimated effects may be unbalanced

$T.Stage
Call:
   aov(formula = BreastCancerv4$Tumor.Size ~ .x)

Terms:
                       .x Residuals
Sum of Squares  1354557.5  439867.7
Deg. of Freedom         3      4020

Residual standard error: 10.46039
Estimated effects may be unbalanced

$N.Stage
Call:
   aov(formula = BreastCancerv4$Tumor.Size ~ .x)

Terms:
                     .x Residuals
Sum of Squares   140869   1653556
Deg. of Freedom       2      4021

Residual standard error: 20.27881
Estimated effects may be unbalanced

$X6th.Stage
Call:
   aov(formula = BreastCancerv4$Tumor.Size ~ .x)

Terms:
                     .x Residuals
Sum of Squares   634624   1159801
Deg. of Freedom       4      4019

Residual standard error: 16.98763
Estimated effects may be unbalanced

$differentiate
Call:
   aov(formula = BreastCancerv4$Tumor.Size ~ .x)

Terms:
                       .x Residuals
Sum of Squares    26497.1 1767928.2
Deg. of Freedom         3      4020

Residual standard error: 20.97101
Estimated effects may be unbalanced

$Grade
Call:
   aov(formula = BreastCancerv4$Tumor.Size ~ .x)

Terms:
                       .x Residuals
Sum of Squares    26497.1 1767928.2
Deg. of Freedom         3      4020

Residual standard error: 20.97101
Estimated effects may be unbalanced

Assume Target Variable: Regional Node Examine

BreastCancerv5 <- BreastCancerv1[!(is_int & seq_along(BreastCancerv1) != 13) | is_char]
BreastCancerv5 <- BreastCancerv5[-c(8:10,12)]
aov.models2 = BreastCancerv5[ , -grep("Regional.Node.Examined", names(BreastCancerv5))] %>%
  map(~ aov(BreastCancerv5$Regional.Node.Examined ~ .x))
aov.models2




$Race
Call:
   aov(formula = BreastCancerv5$Regional.Node.Examined ~ .x)

Terms:
                       .x Residuals
Sum of Squares      43.97 263883.87
Deg. of Freedom         2      4021

Residual standard error: 8.101014
Estimated effects may be unbalanced

$Marital.Status
Call:
   aov(formula = BreastCancerv5$Regional.Node.Examined ~ .x)

Terms:
                       .x Residuals
Sum of Squares     108.67 263819.16
Deg. of Freedom         4      4019

Residual standard error: 8.102036
Estimated effects may be unbalanced

$T.Stage
Call:
   aov(formula = BreastCancerv5$Regional.Node.Examined ~ .x)

Terms:
                       .x Residuals
Sum of Squares    3742.71 260185.13
Deg. of Freedom         3      4020

Residual standard error: 8.04504
Estimated effects may be unbalanced

$N.Stage
Call:
   aov(formula = BreastCancerv5$Regional.Node.Examined ~ .x)

Terms:
                       .x Residuals
Sum of Squares   29058.49 234869.35
Deg. of Freedom         2      4021

Residual standard error: 7.642688
Estimated effects may be unbalanced

$X6th.Stage
Call:
   aov(formula = BreastCancerv5$Regional.Node.Examined ~ .x)

Terms:
                       .x Residuals
Sum of Squares   28918.44 235009.39
Deg. of Freedom         4      4019

Residual standard error: 7.646868
Estimated effects may be unbalanced

$differentiate
Call:
   aov(formula = BreastCancerv5$Regional.Node.Examined ~ .x)

Terms:
                       .x Residuals
Sum of Squares    2169.77 261758.07
Deg. of Freedom         3      4020

Residual standard error: 8.069321
Estimated effects may be unbalanced

$Grade
Call:
   aov(formula = BreastCancerv5$Regional.Node.Examined ~ .x)

Terms:
                       .x Residuals
Sum of Squares    2169.77 261758.07
Deg. of Freedom         3      4020

Residual standard error: 8.069321
Estimated effects may be unbalanced

Assume Target Variable: Survival Months

BreastCancerv6 <- BreastCancerv1[!(is_int & seq_along(BreastCancerv1) != 15) | is_char]
BreastCancerv6 <- BreastCancerv6[-c(8:10,12)]
aov.models3 = BreastCancerv6[ , -grep("Survival.Months", names(BreastCancerv6))] %>%
  map(~ aov(BreastCancerv6$Survival.Months ~ .x))
aov.models3




$Race
Call:
   aov(formula = BreastCancerv6$Survival.Months ~ .x)

Terms:
                       .x Residuals
Sum of Squares     7739.1 2105912.6
Deg. of Freedom         2      4021

Residual standard error: 22.88512
Estimated effects may be unbalanced

$Marital.Status
Call:
   aov(formula = BreastCancerv6$Survival.Months ~ .x)

Terms:
                       .x Residuals
Sum of Squares     6870.1 2106781.7
Deg. of Freedom         4      4019

Residual standard error: 22.89553
Estimated effects may be unbalanced

$T.Stage
Call:
   aov(formula = BreastCancerv6$Survival.Months ~ .x)

Terms:
                       .x Residuals
Sum of Squares    16252.2 2097399.5
Deg. of Freedom         3      4020

Residual standard error: 22.84165
Estimated effects may be unbalanced

$N.Stage
Call:
   aov(formula = BreastCancerv6$Survival.Months ~ .x)

Terms:
                       .x Residuals
Sum of Squares    42434.1 2071217.6
Deg. of Freedom         2      4021

Residual standard error: 22.69582
Estimated effects may be unbalanced

$X6th.Stage
Call:
   aov(formula = BreastCancerv6$Survival.Months ~ .x)

Terms:
                       .x Residuals
Sum of Squares    45935.1 2067716.6
Deg. of Freedom         4      4019

Residual standard error: 22.68227
Estimated effects may be unbalanced

$differentiate
Call:
   aov(formula = BreastCancerv6$Survival.Months ~ .x)

Terms:
                       .x Residuals
Sum of Squares    11397.2 2102254.6
Deg. of Freedom         3      4020

Residual standard error: 22.86808
Estimated effects may be unbalanced

$Grade
Call:
   aov(formula = BreastCancerv6$Survival.Months ~ .x)

Terms:
                       .x Residuals
Sum of Squares    11397.2 2102254.6
Deg. of Freedom         3      4020

Residual standard error: 22.86808
Estimated effects may be unbalanced

Assume Target Variable: Reginol Node Positive

BreastCancerv7 <- BreastCancerv1[!(is_int & seq_along(BreastCancerv1) != 14) | is_char]
BreastCancerv7 <- BreastCancerv7[-c(8:10,12)]
aov.models4 = BreastCancerv7[ , -grep("Reginol.Node.Positive", names(BreastCancerv7))] %>%
  map(~ aov(BreastCancerv7$Reginol.Node.Positive ~ .x))
aov.models4
$Race
Call:
   aov(formula = BreastCancerv7$Reginol.Node.Positive ~ .x)

Terms:
                       .x Residuals
Sum of Squares      23.62 104997.85
Deg. of Freedom         2      4021

Residual standard error: 5.110027
Estimated effects may be unbalanced

$Marital.Status
Call:
   aov(formula = BreastCancerv7$Reginol.Node.Positive ~ .x)

Terms:
                       .x Residuals
Sum of Squares     491.13 104530.35
Deg. of Freedom         4      4019

Residual standard error: 5.099906
Estimated effects may be unbalanced

$T.Stage
Call:
   aov(formula = BreastCancerv7$Reginol.Node.Positive ~ .x)

Terms:
                      .x Residuals
Sum of Squares   6213.48  98808.00
Deg. of Freedom        3      4020

Residual standard error: 4.957732
Estimated effects may be unbalanced

$N.Stage
Call:
   aov(formula = BreastCancerv7$Reginol.Node.Positive ~ .x)

Terms:
                      .x Residuals
Sum of Squares  77973.10  27048.37
Deg. of Freedom        2      4021

Residual standard error: 2.593603
Estimated effects may be unbalanced

$X6th.Stage
Call:
   aov(formula = BreastCancerv7$Reginol.Node.Positive ~ .x)

Terms:
                      .x Residuals
Sum of Squares  75327.78  29693.70
Deg. of Freedom        4      4019

Residual standard error: 2.718148
Estimated effects may be unbalanced

$differentiate
Call:
   aov(formula = BreastCancerv7$Reginol.Node.Positive ~ .x)

Terms:
                       .x Residuals
Sum of Squares    1955.19 103066.29
Deg. of Freedom         3      4020

Residual standard error: 5.063436
Estimated effects may be unbalanced

$Grade
Call:
   aov(formula = BreastCancerv7$Reginol.Node.Positive ~ .x)

Terms:
                       .x Residuals
Sum of Squares    1955.19 103066.29
Deg. of Freedom         3      4020

Residual standard error: 5.063436
Estimated effects may be unbalanced

Bar Chart Comparing Categorical Variables with Target Variable

library(dplyr)
library(tidyverse)
library(ggplot2)
install.packages("purrr")
Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)
library(purrr)
input_vars <- char_vars[(-c(11))]
target_var <- char_vars[(-c(1:10))] 

The below bar chart shows the distribution for each of the category variable with each of its level with the target variable (status).

char_vars %>% pivot_longer(!Status, names_to = "Columns", values_to = "Values") %>%  
  ggplot(aes(x=Values, fill=Status)) +
  geom_bar() +
  facet_wrap(~Columns, ncol = 2, scale="free")+
  coord_flip()

png

To get more valuable insight we porpotion the graph according to its ratio since the graph is highly disporportion in terms of its levels distribution.

  • In A.Stage features, we can see that if the cancer is spreading Regionally has the highest chance of survival as compared to if the cancer is spreading distanly.
  • Furthermore we can see that when the estrogen level is negative there is a higher chance of the patients will not survive from breastcancer.
  • Moreover when the tumor is undifferentiated( Grade IV) the chance of survival becomes lessen and higher chance of mortality.
  • Although marital status does not play a significant impact on the survivability, it can be seen that Married and Single patients are most likely to have high survivability.
  • It can be seen also in N stage which the cancer spreading to the lymph nodes progresses as it move from N1 to N3 and making it has higher mortality rate.
  • Blacks has higher mortality which follows by White and Other.
  • As the tumor size, T stage increases from T1 to T4 the survivability chance decreases.
char_vars %>% pivot_longer(!Status, names_to = "Columns", values_to = "Values") %>%  
  ggplot(aes(x=Values, fill=Status)) +
  geom_bar(position = "fill", 
           alpha = 1) +
  facet_wrap(~Columns, ncol = 2, scale="free")+
  coord_flip()

png

Box Plot Comparing Numeric Variables with Target Variable

BreastCancerv2 %>% pivot_longer(!Status, names_to = "Columns", values_to = "Values") %>%  
  ggplot(aes(x=Values, fill=Status)) +
  geom_boxplot() +
  facet_wrap(~Columns, ncol = 2, scale="free")+
  coord_flip()

png

Multivariate Exploration & Dimensionality Reduction

install.packages(c('MASS', 'pscl', "corrgram", "psych", "gmodels", "dplyr"))
Installing packages into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)
library(tidyverse) 
library(psych)
library(gmodels)  
library(corrgram)    
library(pscl) 
library(MASS)   
library(dplyr)  
library(scales)
Classes and Methods for R developed in the
Political Science Computational Laboratory
Department of Political Science
Stanford University
Simon Jackman
hurdle and zeroinfl functions by Achim Zeileis


Attaching package: ‘MASS’


The following object is masked from ‘package:dplyr’:

    select



Attaching package: ‘scales’


The following objects are masked from ‘package:psych’:

    alpha, rescale


The following object is masked from ‘package:purrr’:

    discard


The following object is masked from ‘package:readr’:

    col_factor
data=read_csv("/content/sample_data/Breast_Cancer.csv") %>%
   mutate(Race=as.factor(Race),
 `Marital Status`=as.factor(`Marital Status`),
 `T Stage`=as.factor(`T Stage`),
 `N Stage`=as.factor(`N Stage`),
 `6th Stage`=as.factor(`6th Stage`),
 differentiate=as.factor(differentiate),
 Grade=as.factor(Grade),
 `A Stage`=as.factor(`A Stage`),
 `Estrogen Status`=as.factor(`Estrogen Status`),
 `Progesterone Status`=as.factor(`Progesterone Status`),
 Status=as.factor(Status))
cat_cols = c('Race', 'Marital Status', 'T Stage', 'N Stage', '6th Stage', 'differentiate','Grade', 'A Stage', 'Estrogen Status', 'Progesterone Status', 'Status')
num_cols = c('Age', 'Tumor Size', 'Regional Node Examined', 'Reginol Node Positive', 'Survival Months')
cat_data = data[,cat_cols]
num_data = data[,num_cols]
head(data)
head(cat_data)
head(num_data)
[1mRows: [22m[34m4024[39m [1mColumns: [22m[34m16[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (11): Race, Marital Status, T Stage, N Stage, 6th Stage, differentiate, ...
[32mdbl[39m  (5): Age, Tumor Size, Regional Node Examined, Reginol Node Positive, Su...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
A tibble: 6 × 16
Age Race Marital Status T Stage N Stage 6th Stage differentiate Grade A Stage Tumor Size Estrogen Status Progesterone Status Regional Node Examined Reginol Node Positive Survival Months Status
<dbl> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <dbl> <fct> <fct> <dbl> <dbl> <dbl> <fct>
68 White Married T1 N1 IIA Poorly differentiated 3 Regional 4 Positive Positive 24 1 60 Alive
50 White Married T2 N2 IIIA Moderately differentiated 2 Regional 35 Positive Positive 14 5 62 Alive
58 White Divorced T3 N3 IIIC Moderately differentiated 2 Regional 63 Positive Positive 14 7 75 Alive
58 White Married T1 N1 IIA Poorly differentiated 3 Regional 18 Positive Positive 2 1 84 Alive
47 White Married T2 N1 IIB Poorly differentiated 3 Regional 41 Positive Positive 3 1 50 Alive
51 White Single T1 N1 IIA Moderately differentiated 2 Regional 20 Positive Positive 18 2 89 Alive
A tibble: 6 × 11
Race Marital Status T Stage N Stage 6th Stage differentiate Grade A Stage Estrogen Status Progesterone Status Status
<fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct>
White Married T1 N1 IIA Poorly differentiated 3 Regional Positive Positive Alive
White Married T2 N2 IIIA Moderately differentiated 2 Regional Positive Positive Alive
White Divorced T3 N3 IIIC Moderately differentiated 2 Regional Positive Positive Alive
White Married T1 N1 IIA Poorly differentiated 3 Regional Positive Positive Alive
White Married T2 N1 IIB Poorly differentiated 3 Regional Positive Positive Alive
White Single T1 N1 IIA Moderately differentiated 2 Regional Positive Positive Alive
A tibble: 6 × 5
Age Tumor Size Regional Node Examined Reginol Node Positive Survival Months
<dbl> <dbl> <dbl> <dbl> <dbl>
68 4 24 1 60
50 35 14 5 62
58 63 14 7 75
58 18 2 1 84
47 41 3 1 50
51 20 18 2 89

Using reduced dataset, we can note that Age, Survival Months, mca2_1, mca2_3, mca3_1, pca1, and pca3 are significant features for the model with 99.9% interval.

Still, only 35% of variances are explained by this model.

As our objective is in binomial classification, we analyse our dataset using logistic regression to see how much the data can be explained and which feature are significant to logistic regression model.

model <- glm(Status ~., data = data, family = "binomial")
summary(model)
pR2(model)['McFadden']
Call:
glm(formula = Status ~ ., family = "binomial", data = data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.2543  -0.4670  -0.2640  -0.1355   3.3031  

Coefficients: (4 not defined because of singularities)
                                    Estimate Std. Error z value Pr(>|z|)    
(Intercept)                         1.192035   0.592075   2.013 0.044081 *  
Age                                 0.028301   0.006516   4.343 1.40e-05 ***
RaceOther                          -0.876361   0.290957  -3.012 0.002595 ** 
RaceWhite                          -0.449858   0.190396  -2.363 0.018140 *  
`Marital Status`Married            -0.162967   0.166478  -0.979 0.327624    
`Marital Status`Separated           0.503724   0.485998   1.036 0.299981    
`Marital Status`Single             -0.094715   0.206431  -0.459 0.646361    
`Marital Status`Widowed             0.047243   0.261059   0.181 0.856393    
`T Stage`T2                         0.239508   0.229609   1.043 0.296897    
`T Stage`T3                         0.742368   0.372473   1.993 0.046253 *  
`T Stage`T4                         1.359086   0.570897   2.381 0.017284 *  
`N Stage`N2                         0.702434   0.278582   2.521 0.011687 *  
`N Stage`N3                         0.630783   0.351167   1.796 0.072455 .  
`6th Stage`IIB                      0.266879   0.267898   0.996 0.319156    
`6th Stage`IIIA                    -0.199731   0.341157  -0.585 0.558244    
`6th Stage`IIIB                     0.015832   0.662766   0.024 0.980942    
`6th Stage`IIIC                           NA         NA      NA       NA    
differentiatePoorly differentiated  0.433001   0.122914   3.523 0.000427 ***
differentiateUndifferentiated       1.705851   0.793648   2.149 0.031604 *  
differentiateWell differentiated   -0.597017   0.207476  -2.878 0.004008 ** 
Grade2                                    NA         NA      NA       NA    
Grade3                                    NA         NA      NA       NA    
Gradeanaplastic; Grade IV                 NA         NA      NA       NA    
`A Stage`Regional                   0.177552   0.325627   0.545 0.585573    
`Tumor Size`                       -0.003163   0.004729  -0.669 0.503671    
`Estrogen Status`Positive          -0.374343   0.227843  -1.643 0.100385    
`Progesterone Status`Positive      -0.521779   0.152356  -3.425 0.000615 ***
`Regional Node Examined`           -0.031172   0.008045  -3.875 0.000107 ***
`Reginol Node Positive`             0.077710   0.017946   4.330 1.49e-05 ***
`Survival Months`                  -0.061531   0.002763 -22.271  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 3444.7  on 4023  degrees of freedom
Residual deviance: 2232.4  on 3998  degrees of freedom
AIC: 2284.4

Number of Fisher Scoring iterations: 6



fitting null model for pseudo-r2

McFadden: 0.351925027032208

Using original dataset, we can note that Grade 2, Grade 3, Grade IV, and IIIC 6th Stage have singularities to the class, thus removed from the regression.

Then, Age, differentiate, progesterone Status, Regional Node Examined, Reginol Node Positive, and Survival Months are significant features for the model with 99.9% interval.

However, only 35% of variances are explained by this model.

Logistic Regression Analysis

Multiple Correspondence Analysis (MCA)

data %>% pairs.panels(scale=T,
             stars=T,
             gap = 0,
             bg = c("red", "blue")[data$Status],
             pch=21)

png

A few factor columns, like T Stage, N Stage, and 6th Stage, have at least moderate correlation that is statistically significant. We use MCA to reduce these columns to numerical vectors.

res_mca <- mca(cat_data, nf=30)
res_mca 
summary(res_mca)
plot(res_mca) 
Call:
mca(df = cat_data, nf = 30)

Multiple correspondence analysis of 4024 cases of 11 factors

Correlations 0.517 0.441 0.430 0.426 0.412 0.400 0.398 0.344 0.327 0.305 0.301 0.301 0.297 0.282 0.274 0.269 0.248 0.209 0.128 0.111 0.088 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000  cumulative % explained  5.17  9.58 13.87 18.14 22.26 26.26 30.25 33.69 36.96 40.01 43.02 46.03 49.00 51.81 54.55 57.24 59.72 61.81 63.09 64.20 65.08 65.08 65.08 65.08 65.08 65.08 65.08 65.08 65.08 65.08 



     Length Class  Mode   
rs   120720 -none- numeric
cs     1080 -none- numeric
fs   120720 -none- numeric
d        30 -none- numeric
p         1 -none- numeric
call      3 -none- call   

png

MCA on all factor columns only covers at most 65% of variances. This low value means we cannot use MCA to reduce all factor columns at once.

So, we start with columns those are colinear. These columns are T Stage, N Stage, 6th Stage, Race, and Marital Status.

mca_cols <- c('T Stage', 'N Stage', '6th Stage','Race','Marital Status')
res_mca <- mca(cat_data[,names(cat_data) %in% mca_cols], nf=7)
res_mca  
plot(res_mca)  
pred_mca <- predict(res_mca, cat_data[,names(cat_data) %in% mca_cols])
data_mca <- data %>% mutate(mca1_1 = pred_mca[,1], mca1_2 = pred_mca[,2], mca1_3 = pred_mca[,3], 
  mca1_4 = pred_mca[,4], mca1_5 = pred_mca[,5], mca1_6 = pred_mca[,6], mca1_7 = pred_mca[,7])
data_mca <- data_mca[,!names(data_mca) %in% mca_cols]
head(data_mca)
Call:
mca(df = cat_data[, names(cat_data) %in% mca_cols], nf = 7)

Multiple correspondence analysis of 4024 cases of 5 factors

Correlations 0.667 0.625 0.603 0.590 0.486 0.453 0.447  cumulative % explained 16.67 32.31 47.38 62.14 74.29 85.63 96.80 
A tibble: 6 × 18
Age differentiate Grade A Stage Tumor Size Estrogen Status Progesterone Status Regional Node Examined Reginol Node Positive Survival Months Status mca1_1 mca1_2 mca1_3 mca1_4 mca1_5 mca1_6 mca1_7
<dbl> <fct> <fct> <fct> <dbl> <fct> <fct> <dbl> <dbl> <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
68 Poorly differentiated 3 Regional 4 Positive Positive 24 1 60 Alive -0.0023461091 -0.0010077059 -0.0010327568 0.0009605870 -0.0006947017 -4.192764e-05 4.440576e-05
50 Moderately differentiated 2 Regional 35 Positive Positive 14 5 62 Alive 0.0014321129 0.0030905826 -0.0001677426 0.0002330912 -0.0006702577 6.418918e-05 2.464879e-04
58 Moderately differentiated 2 Regional 63 Positive Positive 14 7 75 Alive 0.0047429421 -0.0030614242 0.0006657707 0.0025064765 -0.0004448306 -2.617071e-03 1.369498e-03
58 Poorly differentiated 3 Regional 18 Positive Positive 2 1 84 Alive -0.0023461091 -0.0010077059 -0.0010327568 0.0009605870 -0.0006947017 -4.192764e-05 4.440576e-05
47 Poorly differentiated 3 Regional 41 Positive Positive 3 1 50 Alive -0.0005593943 0.0001839509 0.0019673530 -0.0019829816 -0.0009202945 -1.987249e-04 -1.041004e-04
51 Moderately differentiated 2 Regional 20 Positive Positive 18 2 89 Alive -0.0022315778 -0.0009130826 -0.0010431065 0.0007508688 0.0019226744 -4.756507e-04 -2.239220e-04

png

7 outputs from MCA covers 97% of cumulative variances. Thus, we take all 7 output to replace N Stage, T Stage, 6th Stage, Race, and Marital Status.

Then, we proceed to reduce differentiate, Grade, and A Stage columns as they are slightly colinear.

mca_cols <- c('differentiate', 'Grade', 'A Stage')
res_mca <- mca(data_mca[,names(data_mca) %in% mca_cols], nf=3)
res_mca  
plot(res_mca)  
pred_mca <- predict(res_mca, data_mca[,names(data_mca) %in% mca_cols])
data_mca1 <- data_mca %>% mutate(mca2_1 = pred_mca[,1],mca2_2 = pred_mca[,2],mca2_3 = pred_mca[,3])
data_mca1 <- data_mca1[,!names(data_mca1) %in% mca_cols]
head(data_mca1)
Call:
mca(df = data_mca[, names(data_mca) %in% mca_cols], nf = 3)

Multiple correspondence analysis of 4024 cases of 3 factors

Correlations 0.818 0.816 0.816  cumulative % explained  40.88  81.70 122.53 
A tibble: 6 × 18
Age Tumor Size Estrogen Status Progesterone Status Regional Node Examined Reginol Node Positive Survival Months Status mca1_1 mca1_2 mca1_3 mca1_4 mca1_5 mca1_6 mca1_7 mca2_1 mca2_2 mca2_3
<dbl> <dbl> <fct> <fct> <dbl> <dbl> <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
68 4 Positive Positive 24 1 60 Alive -0.0023461091 -0.0010077059 -0.0010327568 0.0009605870 -0.0006947017 -4.192764e-05 4.440576e-05 -0.006802372 0.0003366833 -1.086844e-03
50 35 Positive Positive 14 5 62 Alive 0.0014321129 0.0030905826 -0.0001677426 0.0002330912 -0.0006702577 6.418918e-05 2.464879e-04 0.002587008 -0.0025582201 -1.565432e-18
58 63 Positive Positive 14 7 75 Alive 0.0047429421 -0.0030614242 0.0006657707 0.0025064765 -0.0004448306 -2.617071e-03 1.369498e-03 0.002587008 -0.0025582201 -1.565432e-18
58 18 Positive Positive 2 1 84 Alive -0.0023461091 -0.0010077059 -0.0010327568 0.0009605870 -0.0006947017 -4.192764e-05 4.440576e-05 -0.006802372 0.0003366833 -1.086844e-03
47 41 Positive Positive 3 1 50 Alive -0.0005593943 0.0001839509 0.0019673530 -0.0019829816 -0.0009202945 -1.987249e-04 -1.041004e-04 -0.006802372 0.0003366833 -1.086844e-03
51 20 Positive Positive 18 2 89 Alive -0.0022315778 -0.0009130826 -0.0010431065 0.0007508688 0.0019226744 -4.756507e-04 -2.239220e-04 0.002587008 -0.0025582201 -1.565432e-18

png

First 2 output of MCA covered 82% of cumulative variances while third output covers 123% of cumulative variances. Since the first 2 output does not cover at least 95% of variances, we take the all 3 output to replace differentiate, Grade, and A Stage columns.

After that, we proceed with reducing Estrogen Status and Progesterone Status columns.

mca_cols <- c('Estrogen Status','Progesterone Status')
res_mca <- mca(data_mca1[,names(data_mca1) %in% mca_cols], nf=length(mca_cols))
res_mca  
plot(res_mca)  
pred_mca <- predict(res_mca, data_mca1[,names(data_mca1) %in% mca_cols])
data_mca2 <- data_mca1 %>% mutate(mca3_1 = pred_mca[,1], mca3_2 = pred_mca[,2])
data_mca2 <- data_mca2[,!names(data_mca2) %in% mca_cols]
head(data_mca2)
Call:
mca(df = data_mca1[, names(data_mca1) %in% mca_cols], nf = length(mca_cols))

Multiple correspondence analysis of 4024 cases of 2 factors

Correlations 0.870 0.493  cumulative % explained  86.99 136.32 
A tibble: 6 × 18
Age Tumor Size Regional Node Examined Reginol Node Positive Survival Months Status mca1_1 mca1_2 mca1_3 mca1_4 mca1_5 mca1_6 mca1_7 mca2_1 mca2_2 mca2_3 mca3_1 mca3_2
<dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
68 4 24 1 60 Alive -0.0023461091 -0.0010077059 -0.0010327568 0.0009605870 -0.0006947017 -4.192764e-05 4.440576e-05 -0.006802372 0.0003366833 -1.086844e-03 -0.002860247 -0.000750588
50 35 14 5 62 Alive 0.0014321129 0.0030905826 -0.0001677426 0.0002330912 -0.0006702577 6.418918e-05 2.464879e-04 0.002587008 -0.0025582201 -1.565432e-18 -0.002860247 -0.000750588
58 63 14 7 75 Alive 0.0047429421 -0.0030614242 0.0006657707 0.0025064765 -0.0004448306 -2.617071e-03 1.369498e-03 0.002587008 -0.0025582201 -1.565432e-18 -0.002860247 -0.000750588
58 18 2 1 84 Alive -0.0023461091 -0.0010077059 -0.0010327568 0.0009605870 -0.0006947017 -4.192764e-05 4.440576e-05 -0.006802372 0.0003366833 -1.086844e-03 -0.002860247 -0.000750588
47 41 3 1 50 Alive -0.0005593943 0.0001839509 0.0019673530 -0.0019829816 -0.0009202945 -1.987249e-04 -1.041004e-04 -0.006802372 0.0003366833 -1.086844e-03 -0.002860247 -0.000750588
51 20 18 2 89 Alive -0.0022315778 -0.0009130826 -0.0010431065 0.0007508688 0.0019226744 -4.756507e-04 -2.239220e-04 0.002587008 -0.0025582201 -1.565432e-18 -0.002860247 -0.000750588

png

First output of MCA already covered 87% of cumulative variances while second output covers 136% of cumulative variances. Thus, we take the both output to replace Estrogen Status and Progesterone Status columns because the first output is not enough to cover at least 95% of the variances.

data_mca2 %>% pairs.panels(scale=T,
             stars=T,
             gap = 0,
             bg = c("red", "blue")[data_mca2$Status],
             pch=21)

png

Now, all the factor columns has been reduced to numerical vectors. Still, we can see that some columns are correlated to each other. So, we use PCA to remove colinearity and dimensionality.

Principal Component Analysis (PCA)

pca_cols = c('mca1_1', 'Tumor Size', 'Reginol Node Positive', 'Regional Node Examined')
res_pca <- princomp(data_mca2[,pca_cols], cor = T) 
summary(res_pca)

pred_pca <- predict(res_pca, data_mca2[,pca_cols])  
data_pca <- data_mca2 %>% mutate(pca1=pred_pca[,1], pca2=pred_pca[,2], pca3=pred_pca[,3])
data_pca <- data_pca[, !names(data_pca) %in% pca_cols]
head(data_pca)
Importance of components:
                          Comp.1    Comp.2    Comp.3     Comp.4
Standard deviation     1.5029084 0.9851157 0.7791421 0.40466125
Proportion of Variance 0.5646834 0.2426133 0.1517656 0.04093768
Cumulative Proportion  0.5646834 0.8072967 0.9590623 1.00000000
A tibble: 6 × 17
Age Survival Months Status mca1_2 mca1_3 mca1_4 mca1_5 mca1_6 mca1_7 mca2_1 mca2_2 mca2_3 mca3_1 mca3_2 pca1 pca2 pca3
<dbl> <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
68 60 Alive -0.0010077059 -0.0010327568 0.0009605870 -0.0006947017 -4.192764e-05 4.440576e-05 -0.006802372 0.0003366833 -1.086844e-03 -0.002860247 -0.000750588 -1.1333213 -1.6942370 -0.67609196
50 62 Alive 0.0030905826 -0.0001677426 0.0002330912 -0.0006702577 6.418918e-05 2.464879e-04 0.002587008 -0.0025582201 -1.565432e-18 -0.002860247 -0.000750588 0.5848042 0.2571267 0.15663818
58 75 Alive -0.0030614242 0.0006657707 0.0025064765 -0.0004448306 -2.617071e-03 1.369498e-03 0.002587008 -0.0025582201 -1.565432e-18 -0.002860247 -0.000750588 2.3363389 1.3114219 0.03636938
58 84 Alive -0.0010077059 -0.0010327568 0.0009605870 -0.0006947017 -4.192764e-05 4.440576e-05 -0.006802372 0.0003366833 -1.086844e-03 -0.002860247 -0.000750588 -1.8342221 0.6278887 0.65335871
47 50 Alive 0.0001839509 0.0019673530 -0.0019829816 -0.0009202945 -1.987249e-04 -1.041004e-04 -0.006802372 0.0003366833 -1.086844e-03 -0.002860247 -0.000750588 -0.8010371 1.4039334 0.20289254
51 89 Alive -0.0009130826 -0.0010431065 0.0007508688 0.0019226744 -4.756507e-04 -2.239220e-04 0.002587008 -0.0025582201 -1.565432e-18 -0.002860247 -0.000750588 -0.9345595 -0.7290785 -0.48789923

First 3 output of PCA already covered 96% of cumulative variances. Thus, we take the first 3 output to replace mca1_1, Tumor size, Reginol Node Positive, and Regiona Node Examined columns.

data_pca %>% pairs.panels(scale=T,
             stars=T,
             gap = 0,
             bg = c("red", "blue")[data_pca$Status],
             pch=21)

png

Now, we see that none of the feature columns are correlated to each other.

Lastly, we will normalized the rest of the columns.

data_scaled <- data_pca %>% mutate(Age = rescale(Age), `Survival Months` = rescale(`Survival Months`))

#Modeling

install.packages(c('caret', 'kernlab', 'naivebayes', 'rstatix', 'randomForest'))
Installing packages into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘listenv’, ‘parallelly’, ‘future’, ‘globals’, ‘future.apply’, ‘progressr’, ‘SQUAREM’, ‘lava’, ‘prodlim’, ‘numDeriv’, ‘SparseM’, ‘MatrixModels’, ‘minqa’, ‘nloptr’, ‘RcppEigen’, ‘proxy’, ‘iterators’, ‘Rcpp’, ‘clock’, ‘gower’, ‘hardhat’, ‘ipred’, ‘timeDate’, ‘carData’, ‘abind’, ‘pbkrtest’, ‘quantreg’, ‘lme4’, ‘e1071’, ‘foreach’, ‘ModelMetrics’, ‘plyr’, ‘pROC’, ‘recipes’, ‘reshape2’, ‘corrplot’, ‘car’
library(caret)    
library(kernlab) 
library(naivebayes)
library(rstatix) 
library(randomForest)  
Loading required package: lattice


Attaching package: ‘lattice’


The following object is masked from ‘package:corrgram’:

    panel.fill



Attaching package: ‘caret’


The following object is masked from ‘package:purrr’:

    lift



Attaching package: ‘kernlab’


The following object is masked from ‘package:scales’:

    alpha


The following object is masked from ‘package:psych’:

    alpha


The following object is masked from ‘package:purrr’:

    cross


The following object is masked from ‘package:ggplot2’:

    alpha


naivebayes 0.9.7 loaded


Attaching package: ‘rstatix’


The following object is masked from ‘package:MASS’:

    select


The following object is masked from ‘package:stats’:

    filter


randomForest 4.7-1.1

Type rfNews() to see new features/changes/bug fixes.


Attaching package: ‘randomForest’


The following object is masked from ‘package:psych’:

    outlier


The following object is masked from ‘package:dplyr’:

    combine


The following object is masked from ‘package:ggplot2’:

    margin

For modelling, we use Naive Bayes, Random Forest, and Support Vector Machine (SVM) algorithms against original data, reduced data, and built-in PCA pre-processed data.

10-fold cross validation is used for model evaluation later.

Upsampling method is used to address the imbalance size between the classes.

ctrl <- trainControl(method = "cv", number = 10)  

set.seed(0)
set1 <- upSample(data %>% select(-Status), data$Status, yname='Status') 
set.seed(0)
set2 <- upSample(data_scaled %>% select(-Status), data_scaled$Status, yname='Status') 
set.seed(0)
nb_1 <- train(Status ~., data = set1, method = "naive_bayes", trControl = ctrl)
nb_1

set.seed(0)
nb_1p <- train(Status ~., data = set1, preProc = c("pca"), method = "naive_bayes", trControl = ctrl)
nb_1p

set.seed(0)
nb_2 <- train(Status ~., data = set2, method = "naive_bayes", trControl = ctrl)
nb_2 
Naive Bayes 

6816 samples
  15 predictor
   2 classes: 'Alive', 'Dead' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 6135, 6135, 6134, 6134, 6134, 6134, ... 
Resampling results across tuning parameters:

  usekernel  Accuracy   Kappa    
  FALSE      0.6506751  0.3013481
   TRUE      0.5963957  0.1927825

Tuning parameter 'laplace' was held constant at a value of 0
Tuning
 parameter 'adjust' was held constant at a value of 1
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were laplace = 0, usekernel = FALSE
 and adjust = 1.



Naive Bayes 

6816 samples
  15 predictor
   2 classes: 'Alive', 'Dead' 

Pre-processing: principal component signal extraction (29), centered
 (29), scaled (29) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 6135, 6135, 6134, 6134, 6134, 6134, ... 
Resampling results across tuning parameters:

  usekernel  Accuracy   Kappa    
  FALSE      0.6943899  0.3887863
   TRUE      0.7452929  0.4905871

Tuning parameter 'laplace' was held constant at a value of 0
Tuning
 parameter 'adjust' was held constant at a value of 1
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were laplace = 0, usekernel = TRUE
 and adjust = 1.



Naive Bayes 

6816 samples
  16 predictor
   2 classes: 'Alive', 'Dead' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 6135, 6135, 6134, 6134, 6134, 6134, ... 
Resampling results across tuning parameters:

  usekernel  Accuracy   Kappa    
  FALSE      0.7139037  0.4278098
   TRUE      0.7052600  0.4105174

Tuning parameter 'laplace' was held constant at a value of 0
Tuning
 parameter 'adjust' was held constant at a value of 1
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were laplace = 0, usekernel = FALSE
 and adjust = 1.
set.seed(0)
rf_1 <- train(Status ~., data = set1, method = "rf", trControl = ctrl)
rf_1

set.seed(0)
rf_1p <- train(Status ~., data = set1, preProc = c("pca"), method = "rf", trControl = ctrl)
rf_1p

set.seed(0)
rf_2 <- train(Status ~., data = set2, method = "rf", trControl = ctrl)
rf_2
Random Forest 

6816 samples
  15 predictor
   2 classes: 'Alive', 'Dead' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 6135, 6135, 6134, 6134, 6134, 6134, ... 
Resampling results across tuning parameters:

  mtry  Accuracy   Kappa    
   2    0.8148435  0.6296850
  15    0.9743221  0.9486443
  29    0.9694812  0.9389629

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 15.


Warning message in randomForest.default(x, y, mtry = param$mtry, ...):
“invalid mtry: reset to within valid range”
Warning message in randomForest.default(x, y, mtry = param$mtry, ...):
“invalid mtry: reset to within valid range”
Warning message in randomForest.default(x, y, mtry = param$mtry, ...):
“invalid mtry: reset to within valid range”
Warning message in randomForest.default(x, y, mtry = param$mtry, ...):
“invalid mtry: reset to within valid range”
Warning message in randomForest.default(x, y, mtry = param$mtry, ...):
“invalid mtry: reset to within valid range”
Warning message in randomForest.default(x, y, mtry = param$mtry, ...):
“invalid mtry: reset to within valid range”
Warning message in randomForest.default(x, y, mtry = param$mtry, ...):
“invalid mtry: reset to within valid range”
Warning message in randomForest.default(x, y, mtry = param$mtry, ...):
“invalid mtry: reset to within valid range”
Warning message in randomForest.default(x, y, mtry = param$mtry, ...):
“invalid mtry: reset to within valid range”



Random Forest 

6816 samples
  15 predictor
   2 classes: 'Alive', 'Dead' 

Pre-processing: principal component signal extraction (29), centered
 (29), scaled (29) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 6135, 6135, 6134, 6134, 6134, 6134, ... 
Resampling results across tuning parameters:

  mtry  Accuracy   Kappa    
   2    0.9772589  0.9545179
  15    0.9677195  0.9354391
  29    0.9655182  0.9310365

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.



Random Forest 

6816 samples
  16 predictor
   2 classes: 'Alive', 'Dead' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 6135, 6135, 6134, 6134, 6134, 6134, ... 
Resampling results across tuning parameters:

  mtry  Accuracy   Kappa    
   2    0.9630255  0.9260508
   9    0.9746171  0.9492340
  16    0.9718301  0.9436600

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 9.
set.seed(0)
svm_1 <- train(Status ~., data = set1, method = "svmPoly", trControl = ctrl)
svm_1

set.seed(0)
svm_1p <- train(Status ~., data = set1, preProc = c("pca"), method = "svmPoly", trControl = ctrl)
svm_1p

set.seed(0)
svm_2 <- train(Status ~., data = set2, method = "svmPoly", trControl = ctrl)
svm_2
Support Vector Machines with Polynomial Kernel 

6816 samples
  15 predictor
   2 classes: 'Alive', 'Dead' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 6135, 6135, 6134, 6134, 6134, 6134, ... 
Resampling results across tuning parameters:

  degree  scale  C     Accuracy   Kappa    
  1       0.001  0.25  0.7602668  0.5205353
  1       0.001  0.50  0.7759653  0.5519309
  1       0.001  1.00  0.7818327  0.5636653
  1       0.010  0.25  0.7856459  0.5712925
  1       0.010  0.50  0.7891686  0.5783356
  1       0.010  1.00  0.7897541  0.5795070
  1       0.100  0.25  0.7904878  0.5809754
  1       0.100  0.50  0.7899011  0.5798018
  1       0.100  1.00  0.7906347  0.5812686
  2       0.001  0.25  0.7761125  0.5522252
  2       0.001  0.50  0.7819789  0.5639577
  2       0.001  1.00  0.7874074  0.5748151
  2       0.010  0.25  0.7928347  0.5856686
  2       0.010  0.50  0.7965035  0.5930063
  2       0.010  1.00  0.7989985  0.5979962
  2       0.100  0.25  0.8069222  0.6138445
  2       0.100  0.50  0.8082406  0.6164795
  2       0.100  1.00  0.8091188  0.6182365
  3       0.001  0.25  0.7822733  0.5645473
  3       0.001  0.50  0.7827151  0.5654295
  3       0.001  1.00  0.7879939  0.5759878
  3       0.010  0.25  0.7991445  0.5982870
  3       0.010  0.50  0.8029587  0.6059163
  3       0.010  1.00  0.8079499  0.6158996
  3       0.100  0.25  0.8524020  0.7048047
  3       0.100  0.50  0.8600305  0.7200620
  3       0.100  1.00  0.8661908  0.7323835

Accuracy was used to select the optimal model using the largest value.
The final values used for the model were degree = 3, scale = 0.1 and C = 1.



Support Vector Machines with Polynomial Kernel 

6816 samples
  15 predictor
   2 classes: 'Alive', 'Dead' 

Pre-processing: principal component signal extraction (29), centered
 (29), scaled (29) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 6135, 6135, 6134, 6134, 6134, 6134, ... 
Resampling results across tuning parameters:

  degree  scale  C     Accuracy   Kappa    
  1       0.001  0.25  0.7904911  0.5809824
  1       0.001  0.50  0.7906351  0.5812685
  1       0.001  1.00  0.7912214  0.5824411
  1       0.010  0.25  0.7868185  0.5736349
  1       0.010  0.50  0.7901942  0.5803867
  1       0.010  1.00  0.7882876  0.5765735
  1       0.100  0.25  0.7884338  0.5768660
  1       0.100  0.50  0.7888756  0.5777497
  1       0.100  1.00  0.7885821  0.5771630
  2       0.001  0.25  0.7901952  0.5803887
  2       0.001  0.50  0.7909282  0.5818546
  2       0.001  1.00  0.7888743  0.5777471
  2       0.010  0.25  0.7954792  0.5909577
  2       0.010  0.50  0.7963588  0.5927170
  2       0.010  1.00  0.7959193  0.5918371
  2       0.100  0.25  0.7991471  0.5982949
  2       0.100  0.50  0.8023750  0.6047494
  2       0.100  1.00  0.8000264  0.6000515
  3       0.001  0.25  0.7903408  0.5806800
  3       0.001  0.50  0.7903414  0.5806809
  3       0.001  1.00  0.7888734  0.5777448
  3       0.010  0.25  0.7970917  0.5941831
  3       0.010  0.50  0.7967991  0.5935957
  3       0.010  1.00  0.8017892  0.6035760
  3       0.100  0.25  0.8362632  0.6725273
  3       0.100  0.50  0.8399306  0.6798623
  3       0.100  1.00  0.8443321  0.6886654

Accuracy was used to select the optimal model using the largest value.
The final values used for the model were degree = 3, scale = 0.1 and C = 1.



Support Vector Machines with Polynomial Kernel 

6816 samples
  16 predictor
   2 classes: 'Alive', 'Dead' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 6135, 6135, 6134, 6134, 6134, 6134, ... 
Resampling results across tuning parameters:

  degree  scale  C     Accuracy   Kappa    
  1       0.001  0.25  0.7761134  0.5522271
  1       0.001  0.50  0.7844765  0.5689538
  1       0.001  1.00  0.7868196  0.5736393
  1       0.010  0.25  0.7856468  0.5712933
  1       0.010  0.50  0.7868222  0.5736435
  1       0.010  1.00  0.7888756  0.5777502
  1       0.100  0.25  0.7896089  0.5792171
  1       0.100  0.50  0.7894617  0.5789224
  1       0.100  1.00  0.7894625  0.5789238
  2       0.001  0.25  0.7840367  0.5680740
  2       0.001  0.50  0.7868198  0.5736396
  2       0.001  1.00  0.7850618  0.5701239
  2       0.010  0.25  0.7918103  0.5836191
  2       0.010  0.50  0.7960662  0.5921322
  2       0.010  1.00  0.7969455  0.5938915
  2       0.100  0.25  0.8004671  0.6009350
  2       0.100  0.50  0.8016410  0.6032816
  2       0.100  1.00  0.8022271  0.6044530
  3       0.001  0.25  0.7868230  0.5736466
  3       0.001  0.50  0.7844742  0.5689482
  3       0.001  1.00  0.7879963  0.5759904
  3       0.010  0.25  0.7948908  0.5897825
  3       0.010  0.50  0.7965047  0.5930103
  3       0.010  1.00  0.8019343  0.6038683
  3       0.100  0.25  0.8324492  0.6648965
  3       0.100  0.50  0.8368495  0.6736992
  3       0.100  1.00  0.8402260  0.6804537

Accuracy was used to select the optimal model using the largest value.
The final values used for the model were degree = 3, scale = 0.1 and C = 1.

Model Evaluation

We evaluate our models across the algorithms, features, and inividual models by its accuracy.

all_model <- list(nb_1=nb_1, nb_1p=nb_1p, nb_2=nb_2, svm_1=svm_1, svm_1p=svm_1p, svm_2=svm_2, rf_1=rf_1, rf_1p=rf_1p, rf_2=rf_2)

long_results <- resamples(all_model)$values %>% pivot_longer(!Resample, names_sep = '~', names_to = c('.value', 'Metric')) %>% 
  pivot_longer(names(all_model), names_to = 'Model' , values_to = 'Value')
long_results %>% pivot_wider(names_from = Metric, values_from = Value) %>%
  mutate(ModelType = sub('_.*','', Model)) %>% 
  group_by(ModelType) %>%
  summarise_at(vars(-Resample, -Model), ~mean(.)) %>%
  arrange(desc(Accuracy))

long_results %>% mutate(ModelType = sub('_.*','', Model)) %>%
  ggplot(aes(x=ModelType, y=Value, fill=ModelType)) +
  geom_boxplot() +
  facet_wrap(~Metric)
A tibble: 3 × 3
ModelType Accuracy Kappa
<chr> <dbl> <dbl>
rf 0.9753994 0.9507987
svm 0.8502496 0.7005009
nb 0.7032906 0.4065817

png

We can see that the algorithm with highest accuracy is Random Forest, then SVM, and lastly Naive Bayes

long_results %>% 
  filter(Metric == 'Accuracy') %>% 
  mutate(ModelType = sub('_.*','', Model)) %>% 
  t_test(Value ~ModelType) %>% 
  add_significance()
A rstatix_test: 3 × 10
.y. group1 group2 n1 n2 statistic df p p.adj p.adj.signif
<chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <chr>
Value nb rf 30 30 -34.04656 29.89943 1.79e-25 3.58e-25 ****
Value nb svm 30 30 -16.97885 39.68954 8.43e-20 8.43e-20 ****
Value rf svm 30 30 34.72177 33.67863 5.94e-28 1.78e-27 ****

From the t-test results above, we can see that our previous observation is statistically significant with 99.99% interval.

long_results %>% pivot_wider(names_from = Metric, values_from = Value) %>%
  mutate(FeatureType = sub('.*_','', Model)) %>% 
  group_by(FeatureType) %>%
  summarise_at(vars(-Resample, -Model), ~mean(.)) %>%
  arrange(desc(Accuracy))

long_results %>% mutate(FeatureType = sub('.*_','', Model)) %>%
  ggplot(aes(x=FeatureType, y=Value, fill=FeatureType)) +
  geom_boxplot() +
  facet_wrap(~Metric)
A tibble: 3 × 3
FeatureType Accuracy Kappa
<chr> <dbl> <dbl>
1p 0.8556280 0.7112568
2 0.8429156 0.6858325
1 0.8303960 0.6607920

png

Here, we can see that, by average, features from PCA has the highest accuracy, followed by our processed data, then the original data.

long_results %>% 
  filter(Metric == 'Accuracy') %>% 
  mutate(FeatureType = sub('.*_','', Model)) %>% 
  t_test(Value ~FeatureType) %>% 
  add_significance()
A rstatix_test: 3 × 10
.y. group1 group2 n1 n2 statistic df p p.adj p.adj.signif
<chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <chr>
Value 1 1p 30 30 -0.8197950 52.41153 0.416 1 ns
Value 1 2 30 30 -0.3909963 55.18921 0.697 1 ns
Value 1p 2 30 30 0.4751066 57.32042 0.637 1 ns

However, t-test results shows that our observation is not statistically significant.

long_results %>% pivot_wider(names_from = Metric, values_from = Value) %>%
  group_by(Model) %>%
  summarise_at(vars(-Resample), ~mean(.)) %>%
  arrange(desc(Accuracy))

long_results %>% ggplot(aes(x=Model, y=Value, fill=Model)) +
  geom_boxplot() +
  facet_wrap(~Metric)
A tibble: 9 × 3
Model Accuracy Kappa
<chr> <dbl> <dbl>
rf_1p 0.9772589 0.9545179
rf_2 0.9746171 0.9492340
rf_1 0.9743221 0.9486443
svm_1 0.8661908 0.7323835
svm_1p 0.8443321 0.6886654
svm_2 0.8402260 0.6804537
nb_1p 0.7452929 0.4905871
nb_2 0.7139037 0.4278098
nb_1 0.6506751 0.3013481

png

Here, we can see that Random Forest models have higher accuracy, followed by SVM and Naive Bayes. Random Forest models have around the same accuracy among each other. In the meanwhile, SVM using original data and naive bayes using PCA data have the highest accuracy within their respective group.

long_results %>% 
  filter(Metric == 'Accuracy') %>% 
  t_test(Value ~Model) %>% 
  add_significance()
A rstatix_test: 36 × 10
.y. group1 group2 n1 n2 statistic df p p.adj p.adj.signif
<chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <chr>
Value nb_1 nb_1p 10 10 -13.1198526 15.347841 9.49e-10 9.49e-09 ****
Value nb_1 nb_2 10 10 -8.5190425 14.987547 3.96e-07 3.17e-06 ****
Value nb_1 rf_1 10 10 -72.8832000 13.923057 2.21e-19 7.96e-18 ****
Value nb_1 rf_1p 10 10 -81.2290603 10.144161 1.31e-15 4.19e-14 ****
Value nb_1 rf_2 10 10 -75.2114985 12.790714 2.59e-18 9.06e-17 ****
Value nb_1 svm_1 10 10 -35.7138208 17.520288 8.17e-18 2.78e-16 ****
Value nb_1 svm_1p 10 10 -27.2885189 15.553784 1.47e-14 4.56e-13 ****
Value nb_1 svm_2 10 10 -33.6827955 17.970672 1.08e-17 3.56e-16 ****
Value nb_1p nb_2 10 10 3.5839676 17.971095 2.00e-03 1.30e-02
Value nb_1p rf_1 10 10 -35.6211233 11.179135 7.27e-13 1.96e-11 ****
Value nb_1p rf_1p 10 10 -37.7367091 9.473827 1.20e-11 2.64e-10 ****
Value nb_1p rf_2 10 10 -36.1818961 10.627350 1.79e-12 4.30e-11 ****
Value nb_1p svm_1 10 10 -15.8697585 16.787838 1.54e-11 2.94e-10 ****
Value nb_1p svm_1p 10 10 -11.6732626 17.990573 7.91e-10 9.49e-09 ****
Value nb_1p svm_2 10 10 -13.0045904 15.710899 8.02e-10 9.49e-09 ****
Value nb_2 rf_1 10 10 -39.0753012 11.015532 3.62e-13 1.05e-11 ****
Value nb_2 rf_1p 10 10 -41.1988640 9.437332 5.66e-12 1.30e-10 ****
Value nb_2 rf_2 10 10 -39.6443930 10.503691 8.79e-13 2.20e-11 ****
Value nb_2 svm_1 10 10 -19.4804626 16.463260 8.44e-13 2.19e-11 ****
Value nb_2 svm_1p 10 10 -15.0546371 17.928978 1.29e-11 2.71e-10 ****
Value nb_2 svm_2 10 10 -16.8256639 15.350480 2.63e-11 4.73e-10 ****
Value rf_1 rf_1p 10 10 -1.2529569 12.688843 2.33e-01 9.00e-01 ns
Value rf_1 rf_2 10 10 -0.1050795 17.612935 9.18e-01 1.00e+00 ns
Value rf_1 svm_1 10 10 21.3111332 12.670798 2.65e-11 4.73e-10 ****
Value rf_1 svm_1p 10 10 20.6328671 11.277988 2.60e-10 3.90e-09 ****
Value rf_1 svm_2 10 10 29.2631862 13.596623 1.13e-13 3.39e-12 ****
Value rf_1p rf_2 10 10 1.2702148 13.800530 2.25e-01 9.00e-01 ns
Value rf_1p svm_1 10 10 23.5780454 9.820894 5.67e-10 7.37e-09 ****
Value rf_1p svm_1p 10 10 22.1123499 9.495995 1.73e-09 1.56e-08 ****
Value rf_1p svm_2 10 10 32.8096435 10.055947 1.47e-11 2.94e-10 ****
Value rf_2 svm_1 10 10 21.8712517 11.777272 6.71e-11 1.07e-09 ****
Value rf_2 svm_1p 10 10 20.9908981 10.702255 4.85e-10 6.79e-09 ****
Value rf_2 svm_2 10 10 30.1790475 12.520781 4.51e-13 1.26e-11 ****
Value svm_1 svm_1p 10 10 2.9109250 16.962301 1.00e-02 4.90e-02
Value svm_1 svm_2 10 10 4.2290527 17.719075 5.20e-04 4.00e-03 **
Value svm_1p svm_2 10 10 0.5713722 15.914702 5.76e-01 1.00e+00 ns

From the t-test results above, we can see that the difference of accuracy among Random Forest models is not statistically significant. In the meanwhile, the results shows that SVM with original data has significantly higher accuracy than the other SVM models with at least 95% confidence interval. Also, Naive Bayes with PCA data has significantly higher accuracy than the other Naive Bayes models with at least 95% confidence interval.

In conclusion, Random Forest algorithm works better with this dataset than SVM and Naive Bayes.