Group Member - Dhia Syahmie Bin Muhammad Sukor S2147929 - Liu Zening S2179213 - Muhammad Shakyr Bin Rosman S2152185 - Raimi Bin Ridzuan S2186754 - Zheng Jiangmeng S2174210
Breast cancer is one of the most common cancers, often referred to as the “pink killer”, and ranks first in the incidence of malignant tumors in women. The World Health Organization(WHO) reports that 685000 people worldwide died from breast cancer in 2020. Therefore, it is important to study breast cancer mortality and the suivival time of patients, as it provides insight into the causes of death from breast cancer and informs efforts to reduce the number of deaths from the disease.
This study may provide important information about differences in breast cancer outcomes in different population groups. For example, studies may show that certain racial or ethnic groups have higher breast cancer mortality rates, which may indicate that targeted interventions are needed to address these disparities.
Overall, studying breast cancer mortality and how long patients can survive is important for understanding the disease and the best ways to reduce the number of deaths caused by it. It also helps to identify the needs and priorities for research to further improve the survival rate and reduce the burden of this disease.
The research aims to provide:
Age: the age of female patients ranged from 30 to 69 years. The mean age of the patients was 53.97 years.
Races: White, Black and Other. The majority of patients belonged to the white race, while a few patients were identified as black or other races.
Marital status: Married, Single, Divorced, Widowed, and Separated. Most patients diagnosed with cancer were married.
T stage: the size of tumor. T1, T2, T3, and T4.
N stage: the extent of cancer spread to the lymph nodes. N1, N2, and N3.
Sixth stage: the stage grouping of breast cancer. Based on a combination of T stage, N stage, and M stage(not available in the dataset). The five categories of stage VI and their proportions in the dataset are IIA, IIB, IIIA, IIIB, and IIIC.
Differentiate or grade: how well the tumor resenbles the normal tissue. > * Grade I - Well differentiated > * Grade II - Moderately differentiated > * Grade III - Poorly differentiated > * Grade IV - Undifferentiated
A stage: Regional and Distant. Regional means the spread of cancer is localized. Distant corresponds to the spread of cancer to distant parts of the body.
Tumor size(mm): the size of a tumor in mm.
Survival months: the average value is 71.3 months. The range of this is between 1 and 107.
Dataset: https://ieee-dataport.org/open-access/seer-breast-cancer-data JING TENG, January 18, 2019
install.packages("dplyr")
install.packages("gmodels")
install.packages("psych")
Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)
Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)
also installing the dependencies ‘gtools’, ‘gdata’
Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)
also installing the dependency ‘mnormt’
library(tidyverse)
library(dplyr)
library(psych)
library(gmodels)
library(ggplot2)
library(tidyr)
Warning message in system("timedatectl", intern = TRUE):
“running command 'timedatectl' had status 1”
── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──
[32m✔[39m [34mggplot2[39m 3.4.0 [32m✔[39m [34mpurrr [39m 1.0.1
[32m✔[39m [34mtibble [39m 3.1.8 [32m✔[39m [34mstringr[39m 1.4.1
[32m✔[39m [34mreadr [39m 2.1.3 [32m✔[39m [34mforcats[39m 0.5.2
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m masks [34mstats[39m::lag()
Attaching package: ‘psych’
The following objects are masked from ‘package:ggplot2’:
%+%, alpha
df <- read.csv("/content/sample_data/Uncleaned_Datav1.csv")
As can be seen in the bar chart below, we can see that each race is repeated twice due to the inconsistent of category name.
race <- table(df$Race)
barplot(race, main="Race Distribution",
xlab="Race", ylab ="Frequencies")
table(df$Race)
Black Blacks Other otherS wHiTe White
180 118 241 81 1039 2418
png
To tackle inconsistencies in the naming convention which might be due to the combination of two data sources, data smoothing is required by using replacement method
#Replace "Blacks" = Black, "otherS" = "Other" and "wHiTe" <- "White"
df$Race[df$Race == "Blacks"] <- "Black"
df$Race[df$Race == "otherS"] <- "Other"
df$Race[df$Race == "wHiTe"] <- "White"
After data smoothing
race <- table(df$Race)
barplot(race, main="Race Distribution",
xlab="Race", ylab ="Frequencies")
table(df$Race)
Black Other White
298 322 3457
png
As can be seen in the bar chart below there are Missing Value called Unknown which has 25 occurences in the data
MaritalStat <- table(df$Marital.Status)
barplot(MaritalStat, main="Marital Status Distribution",
xlab="Marital", ylab="Frequencies")
table(df$Marital.Status)
Divorced Married Separated Single Unknown Widowed
488 2662 45 620 25 237
png
Since the unknown value is 25 occurences, the percentage of the unknown value is 25/4077x100: 0.6132 % of the whole dataset. The optimal solution is by performing Data Reduction which the unknown value will be dropped and it is not resulted in loss of significant information.
df <- subset(df, Marital.Status != "Unknown")
After Data Reduction
MaritalStat <- table(df$Marital.Status)
barplot(MaritalStat, main="Marital Status Distribution",
xlab="Marital", ylab="Frequencies")
table(df$Marital.Status)
Divorced Married Separated Single Widowed
488 2662 45 620 237
png
From the Age Histogram Below, there are existent of outliers where some observation have their age equal to “0”. The range of this unbelievabiltiy age is from -1 to 4. The percentage noisiness data in the Age Attribute is 28/4052 x 100 : 0.69 %, therefore it can be eliminated by using Data Reduction as it not effect significant loss of information
hist(df$Age, breaks = 30, main = "Age Histogram", xlab = "Age", ylab = "Frequency")
count <- sum(df$Age >= -1 & df$Age <=5)
cat("The count of noisiness data in Age Attribtutes is", count)
The count of noisiness data in Age Attribtutes is 28
png
After Data Reduction Method
df_filtered <- subset(df, !(Age >= -1 & Age <= 6))
hist(df_filtered$Age, breaks = 30, main = "Age Histogram", xlab = "Age", ylab = "Frequency")
png
From the bar chart below, there are presence of missing value which accumulated 733 correspond to 18.23%. Dropping these missing values can be impratical because the missing values are significantly high. This missing values will be replaced with the mode which is “Moderately differentiated” by using Data Imputation Method.
differentiate <- table(df$differentiate)
barplot(differentiate, main="Differentiate Distribution",
xlab="Differentiate", ylab="Frequencies")
table(df$differentiate)
Moderately differentiated Poorly differentiated Undifferentiated
1632 1119 19
Unknown Well differentiated
733 549
png
After Data Imputation
df$differentiate[df$differentiate == "Unknown"] <- "Moderately differentiated"
differentiate <- table(df$differentiate)
barplot(differentiate, main="Differentiate Distribution",
xlab="Differentiate", ylab="Frequencies")
table(df$differentiate)
Moderately differentiated Poorly differentiated Undifferentiated
2365 1119 19
Well differentiated
549
png
library(dplyr)
library(tidyr)
# Create a function to identify outliers
find_outliers <- function(x) {
qnt <- quantile(x, probs = c(0.25, 0.75), na.rm = TRUE)
H <- 1.5 * IQR(x, na.rm = TRUE)
y <- x[x < qnt[1] - H | x > qnt[2] + H]
return(length(y))
}
# Use the function to find outliers in all columns of the dataframe
total_outliers <- df %>%
select_if(is.numeric) %>%
summarise_all(find_outliers) %>%
gather() %>%
group_by(key) %>%
summarise(total_outliers = sum(value))
colnames(total_outliers) <- c("Variable", "Outliers")
total_outliers
cat("the total outliers are :",sum(total_outliers$Outliers))
| Variable | Outliers |
|---|---|
| <chr> | <int> |
| Age | 28 |
| Reginol.Node.Positive | 346 |
| Regional.Node.Examined | 73 |
| Survival.Months | 19 |
| Tumor.Size | 222 |
the total outliers are : 688
Since the total outliers is quite high, which more than 15% of the total observations. Removing 15% or more of the row values may significantly affect the final predictions. Hence the outliers are not removed for machine learning modeling.
BreastCancer <- read.csv("/content/sample_data/Breast_Cancer.csv")
describe(BreastCancer)
| vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| <int> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |
| Age | 1 | 4024 | 53.972167 | 8.9631344 | 54 | 54.245031 | 10.3782 | 30 | 69 | 39 | -0.2202085 | -0.75675566 | 0.141296343 |
| Race* | 2 | 4024 | 2.775845 | 0.5644652 | 3 | 2.935093 | 0.0000 | 1 | 3 | 2 | -2.4052242 | 4.38936802 | 0.008898323 |
| Marital.Status* | 3 | 4024 | 2.371272 | 1.0636375 | 2 | 2.266149 | 0.0000 | 1 | 5 | 4 | 1.1459202 | 0.36387786 | 0.016767358 |
| T.Stage* | 4 | 4024 | 1.784791 | 0.7655309 | 2 | 1.699379 | 1.4826 | 1 | 4 | 3 | 0.7246804 | 0.06106612 | 0.012067957 |
| N.Stage* | 5 | 4024 | 1.438370 | 0.6934794 | 1 | 1.298137 | 0.0000 | 1 | 3 | 2 | 1.2761763 | 0.21228692 | 0.010932125 |
| X6th.Stage* | 6 | 4024 | 2.321819 | 1.2666239 | 2 | 2.152484 | 1.4826 | 1 | 5 | 4 | 0.8144155 | -0.19692684 | 0.019967270 |
| differentiate* | 7 | 4024 | 1.690358 | 1.0164183 | 1 | 1.488199 | 0.0000 | 1 | 4 | 3 | 1.4181264 | 0.73747790 | 0.016022987 |
| Grade* | 8 | 4024 | 3.131710 | 0.6423983 | 3 | 3.170497 | 0.0000 | 1 | 4 | 3 | -0.2335884 | -0.20517763 | 0.010126874 |
| A.Stage* | 9 | 4024 | 1.977137 | 0.1494847 | 2 | 2.000000 | 0.0000 | 1 | 2 | 1 | -6.3821741 | 38.74177405 | 0.002356501 |
| Tumor.Size | 10 | 4024 | 30.473658 | 21.1196961 | 25 | 27.110248 | 14.8260 | 1 | 140 | 139 | 1.7384530 | 3.62490859 | 0.332934406 |
| Estrogen.Status* | 11 | 4024 | 1.933151 | 0.2497912 | 2 | 2.000000 | 0.0000 | 1 | 2 | 1 | -3.4672437 | 10.02426991 | 0.003937750 |
| Progesterone.Status* | 12 | 4024 | 1.826541 | 0.3786909 | 2 | 1.908075 | 0.0000 | 1 | 2 | 1 | -1.7241483 | 0.97292932 | 0.005969746 |
| Regional.Node.Examined | 13 | 4024 | 14.357107 | 8.0996748 | 14 | 13.854348 | 7.4130 | 1 | 61 | 60 | 0.8286556 | 1.64347439 | 0.127684622 |
| Reginol.Node.Positive | 14 | 4024 | 4.158052 | 5.1093311 | 2 | 2.961180 | 1.4826 | 1 | 46 | 45 | 2.7005214 | 8.96267158 | 0.080544347 |
| Survival.Months | 15 | 4024 | 71.297962 | 22.9214295 | 73 | 72.970497 | 25.2042 | 1 | 107 | 106 | -0.5895585 | 0.01696196 | 0.361337232 |
| Status* | 16 | 4024 | 1.153082 | 0.3601108 | 1 | 1.066460 | 0.0000 | 1 | 2 | 1 | 1.9262531 | 1.71087644 | 0.005676847 |
x1 <- BreastCancer$Tumor.Size
TumorSize1 <- hist(x1, col = "Orange", xlab = "Tumor Size", ylab = "Frequencies",
main = "Tumor Size Histogram")
x2 <- BreastCancer$Age
Age1 <- hist(x2, col = "Orange", xlab = "Age ", ylab = "Frequencies",
main = " Age Distribution Histogram ")
x3 <- BreastCancer$Regional.Node.Examined
Regional.N.E1 <- hist(x3, col = "Orange", xlab = "Regional Node Examined ", ylab = "Frequencies",
main = " Regional Node Examined Distribution Histogram ")
x4 <- BreastCancer$Reginol.Node.Positive
Regional.N.E1 <- hist(x4, col = "Orange", xlab = "Regional Node Positive Examined ", ylab = "Frequencies",
main = " Regional Node Positive Distribution Histogram ")
x5 <- BreastCancer$Survival.Months
SurvivalMonth1 <- hist(x5, col = "Orange", xlab = "Survival Month ", ylab = "Frequencies",
main = " Survival Month Distribution Histogram ")
png
png
png
png
png
race <- table(BreastCancer$Race)
barplot(race, main="Race Distribution",
xlab="Race", ylab ="Frequencies")
MaritalStat <- table(BreastCancer$Marital.Status)
barplot(MaritalStat, main="Marital Status Distribution",
xlab="Marital", ylab="Frequencies")
T.stage <- table(BreastCancer$T.Stage)
barplot(T.stage, main="T.Stage Distribution",
xlab="T.Stage", ylab="Frequencies")
N.stage <- table(BreastCancer$N.Stage)
barplot(N.stage, main="N.Stage Distribution",
xlab="N.Stage", ylab="Frequencies")
X6th.stage <- table(BreastCancer$X6th.Stage)
barplot(X6th.stage, main="X6th.Stage Distribution",
xlab="N.Stage", ylab="Frequencies")
differentiate <- table(BreastCancer$differentiate)
barplot(differentiate, main="Differentiate Distribution",
xlab="Differentiate", ylab="Frequencies")
Grade <- table(BreastCancer$Grade)
barplot(Grade, main="Grade Distribution",
xlab="Grade", ylab="Frequencies")
A.Stage <- table(BreastCancer$A.Stage)
barplot(A.Stage, main="A.Stage Distribution",
xlab="A.Stage", ylab="Frequencies")
Estrogen.Stat <- table(BreastCancer$Estrogen.Status)
barplot(Estrogen.Stat, main="Estrogen Status Distribution",
xlab="Estrogen Status", ylab="Frequencies")
Progestrone.Stat <- table(BreastCancer$Progesterone.Status)
barplot(Progestrone.Stat, main="Progestrone Status Distribution",
xlab="Progestrone Status", ylab="Frequencies")
png
png
png
png
png
png
png
png
png
png
BreastCancer %>% pivot_longer(cols=c('Survival.Months', 'Reginol.Node.Positive', 'Regional.Node.Examined', 'Tumor.Size', 'Age'), names_to = 'Column', values_to = 'Value') %>%
ggplot(aes(x=Column, y=Value, fill=Column)) +
geom_boxplot() +
facet_wrap(~Column, scales = 'free')
png
BreastCancerv1 <- BreastCancer
Race1 = factor(BreastCancerv1$Race, levels = c('White', 'Black', 'Other'))
T.Stage1 = factor(BreastCancerv1$T.Stage, levels =c('T1', 'T2', 'T3', 'T4'))
N.Stage1 = factor(BreastCancerv1$N.Stage, levels = c('N1', 'N2', 'N3'))
X6th.Stage1 = factor(BreastCancerv1$X6th.Stage, levels = c('IIA', 'IIB', 'IIIA', 'IIIB', 'IIIC'))
differentiate1 = factor(BreastCancerv1$differentiate, levels = c('Well differentiated', 'Moderately differentiated', 'Poorly differentiated', 'Undifferentiated'))
Grade1 = factor(BreastCancerv1$Grade, levels = c('1', '2', '3', 'anaplastic; Grade IV'))
A.Stage1 = factor(BreastCancerv1$A.Stage, levels = c('Regional', 'Distant'))
Estrogen.Stat1 = factor(BreastCancerv1$Estrogen.Status, levels = c('Positive', 'Negative'))
Status1 = factor(BreastCancerv1$Status, levels = c('High Survival', 'Non - Survival'))
cat_vars <- sapply(X = BreastCancerv1, function(x) is.character(x))
char_vars <- BreastCancerv1[cat_vars]
tables_list<- list()
# Loop through all columns of the data frame
for (col in names(char_vars)) {
if(col != "Status"){
# Create a cross tabulation table with respect to the target variable
tab <- table(char_vars[, col], char_vars$Status)
# Add the cross tabulation table to the list
tables_list[[col]] <- tab
}
}
View(tables_list)
$Race
Alive Dead
Black 218 73
Other 287 33
White 2903 510
$Marital.Status
Alive Dead
Divorced 396 90
Married 2285 358
Separated 30 15
Single 511 104
Widowed 186 49
$T.Stage
Alive Dead
T1 1446 157
T2 1483 303
T3 417 116
T4 62 40
$N.Stage
Alive Dead
N1 2462 270
N2 655 165
N3 291 181
$X6th.Stage
Alive Dead
IIA 1209 96
IIB 995 135
IIIA 866 184
IIIB 47 20
IIIC 291 181
$differentiate
Alive Dead
Moderately differentiated 2046 305
Poorly differentiated 848 263
Undifferentiated 10 9
Well differentiated 504 39
$Grade
Alive Dead
anaplastic; Grade IV 10 9
1 504 39
2 2046 305
3 848 263
$A.Stage
Alive Dead
Distant 57 35
Regional 3351 581
$Estrogen.Status
Alive Dead
Negative 161 108
Positive 3247 508
$Progesterone.Status
Alive Dead
Negative 494 204
Positive 2914 412
The chi-square (χ2) statistics is a way to check the relationship between two categorical nominal variables. In this data exploration, we explored the relationship between input categorical features in dataset with the target variable which is “Status”. Our null hypothesis stated that the input categorical features is indepent from the target variable (Status) and our hypothesis that the input categorical variable is dependent on the target variable (Status). From the result below the p value for all relation between input categorical variable with target variable is less than <0.005. Therefore we reject the null hypothesis
chi_results_list <- lapply(tables_list, chisq.test)
View(chi_results_list)
Warning message in FUN(X[[i]], ...):
“Chi-squared approximation may be incorrect”
Warning message in FUN(X[[i]], ...):
“Chi-squared approximation may be incorrect”
$Race
Pearson's Chi-squared test
data: X[[i]]
X-squared = 27.97, df = 2, p-value = 8.441e-07
$Marital.Status
Pearson's Chi-squared test
data: X[[i]]
X-squared = 28.264, df = 4, p-value = 1.103e-05
$T.Stage
Pearson's Chi-squared test
data: X[[i]]
X-squared = 103.48, df = 3, p-value < 2.2e-16
$N.Stage
Pearson's Chi-squared test
data: X[[i]]
X-squared = 269.93, df = 2, p-value < 2.2e-16
$X6th.Stage
Pearson's Chi-squared test
data: X[[i]]
X-squared = 281.65, df = 4, p-value < 2.2e-16
$differentiate
Pearson's Chi-squared test
data: X[[i]]
X-squared = 112.56, df = 3, p-value < 2.2e-16
$Grade
Pearson's Chi-squared test
data: X[[i]]
X-squared = 112.56, df = 3, p-value < 2.2e-16
$A.Stage
Pearson's Chi-squared test with Yates' continuity correction
data: X[[i]]
X-squared = 35.765, df = 1, p-value = 2.226e-09
$Estrogen.Status
Pearson's Chi-squared test with Yates' continuity correction
data: X[[i]]
X-squared = 135.16, df = 1, p-value < 2.2e-16
$Progesterone.Status
Pearson's Chi-squared test with Yates' continuity correction
data: X[[i]]
X-squared = 124.89, df = 1, p-value < 2.2e-16
T test is used to draw a comparison or find the difference between one categorical (with two categories) and another continuous variable. The continous variable must be normally distributed.Therefore to normalize each continous variable we employed Z-score method. A z-score, also known as a standard score, is a measure of how many standard deviations a data point is from the mean of the data set.
# Create logical vectors indicating which columns are integers or character type
is_int <- sapply(BreastCancerv1, is.integer)
is_char <- sapply(BreastCancerv1, is.character)
# Remove all columns with character values, except the Status one
BreastCancerv2 <- BreastCancerv1[!(is_char & seq_along(BreastCancerv1) != 16) | is_int]
BreastCancerv2
| Age | Tumor.Size | Regional.Node.Examined | Reginol.Node.Positive | Survival.Months | Status |
|---|---|---|---|---|---|
| <int> | <int> | <int> | <int> | <int> | <chr> |
| 68 | 4 | 24 | 1 | 60 | Alive |
| 50 | 35 | 14 | 5 | 62 | Alive |
| 58 | 63 | 14 | 7 | 75 | Alive |
| 58 | 18 | 2 | 1 | 84 | Alive |
| 47 | 41 | 3 | 1 | 50 | Alive |
| 51 | 20 | 18 | 2 | 89 | Alive |
| 51 | 8 | 11 | 1 | 54 | Alive |
| 40 | 30 | 9 | 1 | 14 | Dead |
| 40 | 103 | 20 | 18 | 70 | Alive |
| 69 | 32 | 21 | 12 | 92 | Alive |
| 68 | 13 | 9 | 1 | 64 | Dead |
| 46 | 59 | 11 | 3 | 92 | Alive |
| 65 | 35 | 13 | 3 | 56 | Alive |
| 48 | 15 | 23 | 7 | 38 | Alive |
| 62 | 35 | 16 | 14 | 64 | Alive |
| 61 | 19 | 20 | 1 | 49 | Alive |
| 56 | 46 | 1 | 1 | 105 | Alive |
| 43 | 24 | 22 | 1 | 62 | Alive |
| 48 | 25 | 16 | 1 | 107 | Alive |
| 60 | 29 | 20 | 1 | 77 | Alive |
| 48 | 30 | 15 | 2 | 81 | Alive |
| 57 | 40 | 15 | 5 | 50 | Alive |
| 55 | 29 | 4 | 1 | 78 | Alive |
| 48 | 70 | 18 | 1 | 102 | Alive |
| 62 | 20 | 26 | 22 | 98 | Alive |
| 63 | 22 | 31 | 17 | 70 | Alive |
| 48 | 50 | 25 | 23 | 102 | Alive |
| 46 | 17 | 14 | 1 | 82 | Alive |
| 57 | 25 | 14 | 4 | 64 | Alive |
| 66 | 21 | 10 | 1 | 86 | Alive |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
| 51 | 2 | 19 | 2 | 50 | Alive |
| 62 | 25 | 17 | 14 | 69 | Alive |
| 69 | 19 | 21 | 9 | 88 | Dead |
| 35 | 19 | 15 | 2 | 56 | Alive |
| 68 | 18 | 11 | 1 | 100 | Alive |
| 53 | 30 | 12 | 5 | 85 | Alive |
| 31 | 45 | 13 | 2 | 63 | Alive |
| 55 | 45 | 16 | 1 | 24 | Alive |
| 57 | 15 | 5 | 1 | 61 | Alive |
| 66 | 19 | 21 | 7 | 64 | Dead |
| 58 | 24 | 19 | 9 | 65 | Dead |
| 68 | 40 | 14 | 2 | 76 | Alive |
| 41 | 35 | 18 | 2 | 66 | Alive |
| 44 | 21 | 10 | 1 | 75 | Alive |
| 60 | 18 | 19 | 2 | 52 | Alive |
| 53 | 100 | 20 | 1 | 107 | Alive |
| 59 | 23 | 6 | 1 | 27 | Dead |
| 62 | 25 | 24 | 4 | 100 | Alive |
| 51 | 80 | 20 | 12 | 79 | Alive |
| 68 | 15 | 10 | 1 | 102 | Alive |
| 40 | 68 | 22 | 2 | 86 | Alive |
| 65 | 40 | 4 | 4 | 68 | Alive |
| 54 | 50 | 29 | 2 | 52 | Alive |
| 46 | 28 | 2 | 1 | 19 | Dead |
| 64 | 10 | 11 | 1 | 70 | Alive |
| 62 | 9 | 1 | 1 | 49 | Alive |
| 56 | 46 | 14 | 8 | 69 | Alive |
| 68 | 22 | 11 | 3 | 69 | Alive |
| 58 | 44 | 11 | 1 | 72 | Alive |
| 46 | 30 | 7 | 2 | 100 | Alive |
install.packages("magrittr") # package installations are only needed the first time you use it
install.packages("dplyr") # alternative installation of the %>%
library(magrittr) # needs to be run every time you start R and want to use %>%
library(dplyr) # alternatively, this also loads %>%
Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)
Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)
Attaching package: ‘magrittr’
The following object is masked from ‘package:purrr’:
set_names
The following object is masked from ‘package:tidyr’:
extract
# Select all columns that are of type 'numeric'
numerical_columns <- BreastCancerv2 %>% select_if(is.numeric)
# Select all columns that are of type 'character'
character_columns <- BreastCancerv2 %>% select_if(is.character)
# Normalize the numerical columns using Z-score
numerical_columns <- scale(numerical_columns,center = TRUE , scale = TRUE)
# Bind the normalized numerical columns with the character columns
BreastCancerv2 <- bind_cols(character_columns, numerical_columns)
View(BreastCancerv2)
| Status | Age | Tumor.Size | Regional.Node.Examined | Reginol.Node.Positive | Survival.Months |
|---|---|---|---|---|---|
| <chr> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
| Alive | 1.5650589 | -1.25350563 | 1.19052837 | -0.61809494 | -0.49289955 |
| Alive | -0.4431672 | 0.21431852 | -0.04408910 | 0.16478641 | -0.40564495 |
| Alive | 0.4493777 | 1.54009517 | -0.04408910 | 0.55622708 | 0.16150990 |
| Alive | 0.4493777 | -0.59061731 | -1.52563006 | -0.61809494 | 0.55415557 |
| Alive | -0.7778715 | 0.49841351 | -1.40216831 | -0.61809494 | -0.92917251 |
| Alive | -0.3315991 | -0.49591898 | 0.44975789 | -0.42237460 | 0.77229205 |
| Alive | -0.3315991 | -1.06410897 | -0.41447434 | -0.61809494 | -0.75466332 |
| Dead | -1.5588483 | -0.02242731 | -0.66139783 | -0.61809494 | -2.49975518 |
| Alive | -1.5588483 | 3.43406182 | 0.69668138 | 2.70915076 | -0.05662658 |
| Alive | 1.6766270 | 0.07227102 | 0.82014313 | 1.53482875 | 0.90317394 |
| Dead | 1.5650589 | -0.82736314 | -0.66139783 | -0.61809494 | -0.31839036 |
| Alive | -0.8894396 | 1.35069851 | -0.41447434 | -0.22665426 | 0.90317394 |
| Alive | 1.2303545 | 0.21431852 | -0.16755084 | -0.22665426 | -0.66740873 |
| Alive | -0.6663034 | -0.73266481 | 1.06706662 | 0.55622708 | -1.45270007 |
| Alive | 0.8956502 | 0.21431852 | 0.20283440 | 1.92626942 | -0.31839036 |
| Alive | 0.7840821 | -0.54326814 | 0.69668138 | -0.61809494 | -0.97279981 |
| Alive | 0.2262415 | 0.73515935 | -1.64909181 | -0.61809494 | 1.47032879 |
| Alive | -1.2241440 | -0.30652231 | 0.94360488 | -0.61809494 | -0.40564495 |
| Alive | -0.6663034 | -0.25917314 | 0.20283440 | -0.61809494 | 1.55758338 |
| Alive | 0.6725140 | -0.06977648 | 0.69668138 | -0.61809494 | 0.24876449 |
| Alive | -0.6663034 | -0.02242731 | 0.07937265 | -0.42237460 | 0.42327368 |
| Alive | 0.3378096 | 0.45106435 | 0.07937265 | 0.16478641 | -0.92917251 |
| Alive | 0.1146734 | -0.06977648 | -1.27870657 | -0.61809494 | 0.29239179 |
| Alive | -0.6663034 | 1.87153933 | 0.44975789 | -0.61809494 | 1.33944690 |
| Alive | 0.8956502 | -0.49591898 | 1.43745186 | 3.49203210 | 1.16493772 |
| Alive | 1.0072183 | -0.40122064 | 2.05476060 | 2.51343043 | -0.05662658 |
| Alive | -0.6663034 | 0.92455601 | 1.31399012 | 3.68775244 | 1.33944690 |
| Alive | -0.8894396 | -0.63796647 | -0.04408910 | -0.61809494 | 0.46690097 |
| Alive | 0.3378096 | -0.25917314 | -0.04408910 | -0.03093393 | -0.31839036 |
| Alive | 1.3419226 | -0.44856981 | -0.53793609 | -0.61809494 | 0.64141016 |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
| Alive | -0.331599066 | -1.34820397 | 0.57321964 | -0.42237460 | -0.92917251 |
| Alive | 0.895650186 | -0.25917314 | 0.32629614 | 1.92626942 | -0.10025388 |
| Dead | 1.676626983 | -0.54326814 | 0.82014313 | 0.94766775 | 0.72866475 |
| Alive | -2.116688887 | -0.54326814 | 0.07937265 | -0.42237460 | -0.66740873 |
| Alive | 1.565058869 | -0.59061731 | -0.41447434 | -0.61809494 | 1.25219231 |
| Alive | -0.108462838 | -0.02242731 | -0.29101259 | 0.16478641 | 0.59778286 |
| Alive | -2.562961342 | 0.68781018 | -0.16755084 | -0.42237460 | -0.36201766 |
| Alive | 0.114673389 | 0.68781018 | 0.20283440 | -0.61809494 | -2.06348222 |
| Alive | 0.337809617 | -0.73266481 | -1.15524482 | -0.61809494 | -0.44927225 |
| Dead | 1.341922641 | -0.54326814 | 0.82014313 | 0.55622708 | -0.31839036 |
| Dead | 0.449377731 | -0.30652231 | 0.57321964 | 0.94766775 | -0.27476306 |
| Alive | 1.565058869 | 0.45106435 | -0.04408910 | -0.42237460 | 0.20513720 |
| Alive | -1.447280204 | 0.21431852 | 0.44975789 | -0.42237460 | -0.23113577 |
| Alive | -1.112575863 | -0.44856981 | -0.53793609 | -0.61809494 | 0.16150990 |
| Alive | 0.672513958 | -0.59061731 | 0.57321964 | -0.42237460 | -0.84191792 |
| Alive | -0.108462838 | 3.29201432 | 0.69668138 | -0.61809494 | 1.55758338 |
| Dead | 0.560945845 | -0.35387148 | -1.03178307 | -0.61809494 | -1.93260033 |
| Alive | 0.895650186 | -0.25917314 | 1.19052837 | -0.03093393 | 1.25219231 |
| Alive | -0.331599066 | 2.34503100 | 0.69668138 | 1.53482875 | 0.33601908 |
| Alive | 1.565058869 | -0.73266481 | -0.53793609 | -0.61809494 | 1.33944690 |
| Alive | -1.558848318 | 1.77684100 | 0.94360488 | -0.42237460 | 0.64141016 |
| Alive | 1.230354528 | 0.45106435 | -1.27870657 | -0.03093393 | -0.14388118 |
| Alive | 0.003105276 | 0.92455601 | 1.80783710 | -0.42237460 | -0.84191792 |
| Dead | -0.889439635 | -0.11712565 | -1.52563006 | -0.61809494 | -2.28161870 |
| Alive | 1.118786414 | -0.96941064 | -0.41447434 | -0.61809494 | -0.05662658 |
| Alive | 0.895650186 | -1.01675980 | -1.64909181 | -0.61809494 | -0.97279981 |
| Alive | 0.226241503 | 0.73515935 | -0.04408910 | 0.75194741 | -0.10025388 |
| Alive | 1.565058869 | -0.40122064 | -0.41447434 | -0.22665426 | -0.10025388 |
| Alive | 0.449377731 | 0.64046101 | -0.41447434 | -0.61809494 | 0.03062801 |
| Alive | -0.889439635 | -0.02242731 | -0.90832133 | -0.42237460 | 1.25219231 |
From the T-test result we have seen that the p-value for each continous variable in respect with the target variable is less than < 0.005. Therefore we accept the hypothesis where the input continous variable is dependent with the target variable (status) and reject the null hypothesis
# Create an empty list to store the test results
Tresults_list <- list()
# Loop through all columns of the data frame
for (col in names(BreastCancerv2)) {
if(col != "Status"){
# Conduct T-test on current column with respect to target variable
Ttest_result <- t.test(BreastCancerv2[, col] ~ BreastCancerv2$Status)
# Extract the p-value and test statistic from the test results
p_value <- Ttest_result$p.value
t_statistic <- Ttest_result$statistic
# Add the p-value and test statistic to the results list
Tresults_list[[col]] <- list(p_value = p_value,t_statistic = t_statistic)
}
}
View(Tresults_list)
<dd>0.000930907837822076</dd>
<dt>$t_statistic</dt>
<dd><strong>t:</strong> -3.32289602486267</dd>
<dt>$Tumor.Size</dt>
<dd><dl>
<dt>$p_value</dt>
<dd>7.01956816737513e-14</dd>
<dt>$t_statistic</dt>
<dd><strong>t:</strong> -7.62653452543241</dd>
<dt>$Regional.Node.Examined</dt>
<dd><dl>
<dt>$p_value</dt>
<dd>0.0341129729107396</dd>
<dt>$t_statistic</dt>
<dd><strong>t:</strong> -2.12225993465628</dd>
<dt>$Reginol.Node.Positive</dt>
<dd><dl>
<dt>$p_value</dt>
<dd>3.44036275174707e-30</dd>
<dt>$t_statistic</dt>
<dd><strong>t:</strong> -11.9770766977036</dd>
<dt>$Survival.Months</dt>
<dd><dl>
<dt>$p_value</dt>
<dd>1.19257066205458e-129</dd>
<dt>$t_statistic</dt>
<dd><strong>t:</strong> 29.702852507452</dd>
ANOVA is helpful for testing three or more levels in the category variable. It is similar to multiple two-sample t-tests. However, it results in fewer type I errors and is appropriate for a range of issues. ANOVA groups differences by comparing the means of each group and includes spreading out the variance into diverse sources. For this data exploration we assumed the category variable is Marital Status, Race,T.Stage,N.stage,X6th Stage, differentiate, Grade because it has 3 or more levels.
BreastCancerv3 <- BreastCancerv1[!(is_int & seq_along(BreastCancerv1) != 1) | is_char]
BreastCancerv3 <- BreastCancerv3[-c(9:12)]
aov.models = BreastCancerv3[ , -grep("Age", names(BreastCancerv3))] %>%
map(~ aov(BreastCancerv3$Age ~ .x))
aov.models
$Race
Call:
aov(formula = BreastCancerv3$Age ~ .x)
Terms:
.x Residuals
Sum of Squares 3064.1 320134.8
Deg. of Freedom 2 4021
Residual standard error: 8.922764
Estimated effects may be unbalanced
$Marital.Status
Call:
aov(formula = BreastCancerv3$Age ~ .x)
Terms:
.x Residuals
Sum of Squares 17556.43 305642.45
Deg. of Freedom 4 4019
Residual standard error: 8.72063
Estimated effects may be unbalanced
$T.Stage
Call:
aov(formula = BreastCancerv3$Age ~ .x)
Terms:
.x Residuals
Sum of Squares 1925.9 321273.0
Deg. of Freedom 3 4020
Residual standard error: 8.939724
Estimated effects may be unbalanced
$N.Stage
Call:
aov(formula = BreastCancerv3$Age ~ .x)
Terms:
.x Residuals
Sum of Squares 55.5 323143.4
Deg. of Freedom 2 4021
Residual standard error: 8.964594
Estimated effects may be unbalanced
$X6th.Stage
Call:
aov(formula = BreastCancerv3$Age ~ .x)
Terms:
.x Residuals
Sum of Squares 1013.7 322185.2
Deg. of Freedom 4 4019
Residual standard error: 8.953519
Estimated effects may be unbalanced
$differentiate
Call:
aov(formula = BreastCancerv3$Age ~ .x)
Terms:
.x Residuals
Sum of Squares 3326.7 319872.2
Deg. of Freedom 3 4020
Residual standard error: 8.920213
Estimated effects may be unbalanced
$Grade
Call:
aov(formula = BreastCancerv3$Age ~ .x)
Terms:
.x Residuals
Sum of Squares 3326.7 319872.2
Deg. of Freedom 3 4020
Residual standard error: 8.920213
Estimated effects may be unbalanced
BreastCancerv4 <- BreastCancerv1[!(is_int & seq_along(BreastCancerv1) != 10) | is_char]
BreastCancerv4 <- BreastCancerv4[-c(8,10:12)]
aov.models1 = BreastCancerv4[ , -grep("Tumor.Size", names(BreastCancerv4))] %>%
map(~ aov(BreastCancerv4$Tumor.Size ~ .x))
aov.models1
$Race
Call:
aov(formula = BreastCancerv4$Tumor.Size ~ .x)
Terms:
.x Residuals
Sum of Squares 98 1794327
Deg. of Freedom 2 4021
Residual standard error: 21.12437
Estimated effects may be unbalanced
$Marital.Status
Call:
aov(formula = BreastCancerv4$Tumor.Size ~ .x)
Terms:
.x Residuals
Sum of Squares 1561.3 1792863.9
Deg. of Freedom 4 4019
Residual standard error: 21.12101
Estimated effects may be unbalanced
$T.Stage
Call:
aov(formula = BreastCancerv4$Tumor.Size ~ .x)
Terms:
.x Residuals
Sum of Squares 1354557.5 439867.7
Deg. of Freedom 3 4020
Residual standard error: 10.46039
Estimated effects may be unbalanced
$N.Stage
Call:
aov(formula = BreastCancerv4$Tumor.Size ~ .x)
Terms:
.x Residuals
Sum of Squares 140869 1653556
Deg. of Freedom 2 4021
Residual standard error: 20.27881
Estimated effects may be unbalanced
$X6th.Stage
Call:
aov(formula = BreastCancerv4$Tumor.Size ~ .x)
Terms:
.x Residuals
Sum of Squares 634624 1159801
Deg. of Freedom 4 4019
Residual standard error: 16.98763
Estimated effects may be unbalanced
$differentiate
Call:
aov(formula = BreastCancerv4$Tumor.Size ~ .x)
Terms:
.x Residuals
Sum of Squares 26497.1 1767928.2
Deg. of Freedom 3 4020
Residual standard error: 20.97101
Estimated effects may be unbalanced
$Grade
Call:
aov(formula = BreastCancerv4$Tumor.Size ~ .x)
Terms:
.x Residuals
Sum of Squares 26497.1 1767928.2
Deg. of Freedom 3 4020
Residual standard error: 20.97101
Estimated effects may be unbalanced
BreastCancerv5 <- BreastCancerv1[!(is_int & seq_along(BreastCancerv1) != 13) | is_char]
BreastCancerv5 <- BreastCancerv5[-c(8:10,12)]
aov.models2 = BreastCancerv5[ , -grep("Regional.Node.Examined", names(BreastCancerv5))] %>%
map(~ aov(BreastCancerv5$Regional.Node.Examined ~ .x))
aov.models2
$Race
Call:
aov(formula = BreastCancerv5$Regional.Node.Examined ~ .x)
Terms:
.x Residuals
Sum of Squares 43.97 263883.87
Deg. of Freedom 2 4021
Residual standard error: 8.101014
Estimated effects may be unbalanced
$Marital.Status
Call:
aov(formula = BreastCancerv5$Regional.Node.Examined ~ .x)
Terms:
.x Residuals
Sum of Squares 108.67 263819.16
Deg. of Freedom 4 4019
Residual standard error: 8.102036
Estimated effects may be unbalanced
$T.Stage
Call:
aov(formula = BreastCancerv5$Regional.Node.Examined ~ .x)
Terms:
.x Residuals
Sum of Squares 3742.71 260185.13
Deg. of Freedom 3 4020
Residual standard error: 8.04504
Estimated effects may be unbalanced
$N.Stage
Call:
aov(formula = BreastCancerv5$Regional.Node.Examined ~ .x)
Terms:
.x Residuals
Sum of Squares 29058.49 234869.35
Deg. of Freedom 2 4021
Residual standard error: 7.642688
Estimated effects may be unbalanced
$X6th.Stage
Call:
aov(formula = BreastCancerv5$Regional.Node.Examined ~ .x)
Terms:
.x Residuals
Sum of Squares 28918.44 235009.39
Deg. of Freedom 4 4019
Residual standard error: 7.646868
Estimated effects may be unbalanced
$differentiate
Call:
aov(formula = BreastCancerv5$Regional.Node.Examined ~ .x)
Terms:
.x Residuals
Sum of Squares 2169.77 261758.07
Deg. of Freedom 3 4020
Residual standard error: 8.069321
Estimated effects may be unbalanced
$Grade
Call:
aov(formula = BreastCancerv5$Regional.Node.Examined ~ .x)
Terms:
.x Residuals
Sum of Squares 2169.77 261758.07
Deg. of Freedom 3 4020
Residual standard error: 8.069321
Estimated effects may be unbalanced
BreastCancerv6 <- BreastCancerv1[!(is_int & seq_along(BreastCancerv1) != 15) | is_char]
BreastCancerv6 <- BreastCancerv6[-c(8:10,12)]
aov.models3 = BreastCancerv6[ , -grep("Survival.Months", names(BreastCancerv6))] %>%
map(~ aov(BreastCancerv6$Survival.Months ~ .x))
aov.models3
$Race
Call:
aov(formula = BreastCancerv6$Survival.Months ~ .x)
Terms:
.x Residuals
Sum of Squares 7739.1 2105912.6
Deg. of Freedom 2 4021
Residual standard error: 22.88512
Estimated effects may be unbalanced
$Marital.Status
Call:
aov(formula = BreastCancerv6$Survival.Months ~ .x)
Terms:
.x Residuals
Sum of Squares 6870.1 2106781.7
Deg. of Freedom 4 4019
Residual standard error: 22.89553
Estimated effects may be unbalanced
$T.Stage
Call:
aov(formula = BreastCancerv6$Survival.Months ~ .x)
Terms:
.x Residuals
Sum of Squares 16252.2 2097399.5
Deg. of Freedom 3 4020
Residual standard error: 22.84165
Estimated effects may be unbalanced
$N.Stage
Call:
aov(formula = BreastCancerv6$Survival.Months ~ .x)
Terms:
.x Residuals
Sum of Squares 42434.1 2071217.6
Deg. of Freedom 2 4021
Residual standard error: 22.69582
Estimated effects may be unbalanced
$X6th.Stage
Call:
aov(formula = BreastCancerv6$Survival.Months ~ .x)
Terms:
.x Residuals
Sum of Squares 45935.1 2067716.6
Deg. of Freedom 4 4019
Residual standard error: 22.68227
Estimated effects may be unbalanced
$differentiate
Call:
aov(formula = BreastCancerv6$Survival.Months ~ .x)
Terms:
.x Residuals
Sum of Squares 11397.2 2102254.6
Deg. of Freedom 3 4020
Residual standard error: 22.86808
Estimated effects may be unbalanced
$Grade
Call:
aov(formula = BreastCancerv6$Survival.Months ~ .x)
Terms:
.x Residuals
Sum of Squares 11397.2 2102254.6
Deg. of Freedom 3 4020
Residual standard error: 22.86808
Estimated effects may be unbalanced
BreastCancerv7 <- BreastCancerv1[!(is_int & seq_along(BreastCancerv1) != 14) | is_char]
BreastCancerv7 <- BreastCancerv7[-c(8:10,12)]
aov.models4 = BreastCancerv7[ , -grep("Reginol.Node.Positive", names(BreastCancerv7))] %>%
map(~ aov(BreastCancerv7$Reginol.Node.Positive ~ .x))
aov.models4
$Race
Call:
aov(formula = BreastCancerv7$Reginol.Node.Positive ~ .x)
Terms:
.x Residuals
Sum of Squares 23.62 104997.85
Deg. of Freedom 2 4021
Residual standard error: 5.110027
Estimated effects may be unbalanced
$Marital.Status
Call:
aov(formula = BreastCancerv7$Reginol.Node.Positive ~ .x)
Terms:
.x Residuals
Sum of Squares 491.13 104530.35
Deg. of Freedom 4 4019
Residual standard error: 5.099906
Estimated effects may be unbalanced
$T.Stage
Call:
aov(formula = BreastCancerv7$Reginol.Node.Positive ~ .x)
Terms:
.x Residuals
Sum of Squares 6213.48 98808.00
Deg. of Freedom 3 4020
Residual standard error: 4.957732
Estimated effects may be unbalanced
$N.Stage
Call:
aov(formula = BreastCancerv7$Reginol.Node.Positive ~ .x)
Terms:
.x Residuals
Sum of Squares 77973.10 27048.37
Deg. of Freedom 2 4021
Residual standard error: 2.593603
Estimated effects may be unbalanced
$X6th.Stage
Call:
aov(formula = BreastCancerv7$Reginol.Node.Positive ~ .x)
Terms:
.x Residuals
Sum of Squares 75327.78 29693.70
Deg. of Freedom 4 4019
Residual standard error: 2.718148
Estimated effects may be unbalanced
$differentiate
Call:
aov(formula = BreastCancerv7$Reginol.Node.Positive ~ .x)
Terms:
.x Residuals
Sum of Squares 1955.19 103066.29
Deg. of Freedom 3 4020
Residual standard error: 5.063436
Estimated effects may be unbalanced
$Grade
Call:
aov(formula = BreastCancerv7$Reginol.Node.Positive ~ .x)
Terms:
.x Residuals
Sum of Squares 1955.19 103066.29
Deg. of Freedom 3 4020
Residual standard error: 5.063436
Estimated effects may be unbalanced
library(dplyr)
library(tidyverse)
library(ggplot2)
install.packages("purrr")
Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)
library(purrr)
input_vars <- char_vars[(-c(11))]
target_var <- char_vars[(-c(1:10))]
The below bar chart shows the distribution for each of the category variable with each of its level with the target variable (status).
char_vars %>% pivot_longer(!Status, names_to = "Columns", values_to = "Values") %>%
ggplot(aes(x=Values, fill=Status)) +
geom_bar() +
facet_wrap(~Columns, ncol = 2, scale="free")+
coord_flip()
png
To get more valuable insight we porpotion the graph according to its ratio since the graph is highly disporportion in terms of its levels distribution.
char_vars %>% pivot_longer(!Status, names_to = "Columns", values_to = "Values") %>%
ggplot(aes(x=Values, fill=Status)) +
geom_bar(position = "fill",
alpha = 1) +
facet_wrap(~Columns, ncol = 2, scale="free")+
coord_flip()
png
BreastCancerv2 %>% pivot_longer(!Status, names_to = "Columns", values_to = "Values") %>%
ggplot(aes(x=Values, fill=Status)) +
geom_boxplot() +
facet_wrap(~Columns, ncol = 2, scale="free")+
coord_flip()
png
install.packages(c('MASS', 'pscl', "corrgram", "psych", "gmodels", "dplyr"))
Installing packages into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)
library(tidyverse)
library(psych)
library(gmodels)
library(corrgram)
library(pscl)
library(MASS)
library(dplyr)
library(scales)
Classes and Methods for R developed in the
Political Science Computational Laboratory
Department of Political Science
Stanford University
Simon Jackman
hurdle and zeroinfl functions by Achim Zeileis
Attaching package: ‘MASS’
The following object is masked from ‘package:dplyr’:
select
Attaching package: ‘scales’
The following objects are masked from ‘package:psych’:
alpha, rescale
The following object is masked from ‘package:purrr’:
discard
The following object is masked from ‘package:readr’:
col_factor
data=read_csv("/content/sample_data/Breast_Cancer.csv") %>%
mutate(Race=as.factor(Race),
`Marital Status`=as.factor(`Marital Status`),
`T Stage`=as.factor(`T Stage`),
`N Stage`=as.factor(`N Stage`),
`6th Stage`=as.factor(`6th Stage`),
differentiate=as.factor(differentiate),
Grade=as.factor(Grade),
`A Stage`=as.factor(`A Stage`),
`Estrogen Status`=as.factor(`Estrogen Status`),
`Progesterone Status`=as.factor(`Progesterone Status`),
Status=as.factor(Status))
cat_cols = c('Race', 'Marital Status', 'T Stage', 'N Stage', '6th Stage', 'differentiate','Grade', 'A Stage', 'Estrogen Status', 'Progesterone Status', 'Status')
num_cols = c('Age', 'Tumor Size', 'Regional Node Examined', 'Reginol Node Positive', 'Survival Months')
cat_data = data[,cat_cols]
num_data = data[,num_cols]
head(data)
head(cat_data)
head(num_data)
[1mRows: [22m[34m4024[39m [1mColumns: [22m[34m16[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (11): Race, Marital Status, T Stage, N Stage, 6th Stage, differentiate, ...
[32mdbl[39m (5): Age, Tumor Size, Regional Node Examined, Reginol Node Positive, Su...
[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
| Age | Race | Marital Status | T Stage | N Stage | 6th Stage | differentiate | Grade | A Stage | Tumor Size | Estrogen Status | Progesterone Status | Regional Node Examined | Reginol Node Positive | Survival Months | Status |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| <dbl> | <fct> | <fct> | <fct> | <fct> | <fct> | <fct> | <fct> | <fct> | <dbl> | <fct> | <fct> | <dbl> | <dbl> | <dbl> | <fct> |
| 68 | White | Married | T1 | N1 | IIA | Poorly differentiated | 3 | Regional | 4 | Positive | Positive | 24 | 1 | 60 | Alive |
| 50 | White | Married | T2 | N2 | IIIA | Moderately differentiated | 2 | Regional | 35 | Positive | Positive | 14 | 5 | 62 | Alive |
| 58 | White | Divorced | T3 | N3 | IIIC | Moderately differentiated | 2 | Regional | 63 | Positive | Positive | 14 | 7 | 75 | Alive |
| 58 | White | Married | T1 | N1 | IIA | Poorly differentiated | 3 | Regional | 18 | Positive | Positive | 2 | 1 | 84 | Alive |
| 47 | White | Married | T2 | N1 | IIB | Poorly differentiated | 3 | Regional | 41 | Positive | Positive | 3 | 1 | 50 | Alive |
| 51 | White | Single | T1 | N1 | IIA | Moderately differentiated | 2 | Regional | 20 | Positive | Positive | 18 | 2 | 89 | Alive |
| Race | Marital Status | T Stage | N Stage | 6th Stage | differentiate | Grade | A Stage | Estrogen Status | Progesterone Status | Status |
|---|---|---|---|---|---|---|---|---|---|---|
| <fct> | <fct> | <fct> | <fct> | <fct> | <fct> | <fct> | <fct> | <fct> | <fct> | <fct> |
| White | Married | T1 | N1 | IIA | Poorly differentiated | 3 | Regional | Positive | Positive | Alive |
| White | Married | T2 | N2 | IIIA | Moderately differentiated | 2 | Regional | Positive | Positive | Alive |
| White | Divorced | T3 | N3 | IIIC | Moderately differentiated | 2 | Regional | Positive | Positive | Alive |
| White | Married | T1 | N1 | IIA | Poorly differentiated | 3 | Regional | Positive | Positive | Alive |
| White | Married | T2 | N1 | IIB | Poorly differentiated | 3 | Regional | Positive | Positive | Alive |
| White | Single | T1 | N1 | IIA | Moderately differentiated | 2 | Regional | Positive | Positive | Alive |
| Age | Tumor Size | Regional Node Examined | Reginol Node Positive | Survival Months |
|---|---|---|---|---|
| <dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
| 68 | 4 | 24 | 1 | 60 |
| 50 | 35 | 14 | 5 | 62 |
| 58 | 63 | 14 | 7 | 75 |
| 58 | 18 | 2 | 1 | 84 |
| 47 | 41 | 3 | 1 | 50 |
| 51 | 20 | 18 | 2 | 89 |
Using reduced dataset, we can note that Age, Survival Months, mca2_1, mca2_3, mca3_1, pca1, and pca3 are significant features for the model with 99.9% interval.
Still, only 35% of variances are explained by this model.
As our objective is in binomial classification, we analyse our dataset using logistic regression to see how much the data can be explained and which feature are significant to logistic regression model.
model <- glm(Status ~., data = data, family = "binomial")
summary(model)
pR2(model)['McFadden']
Call:
glm(formula = Status ~ ., family = "binomial", data = data)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.2543 -0.4670 -0.2640 -0.1355 3.3031
Coefficients: (4 not defined because of singularities)
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.192035 0.592075 2.013 0.044081 *
Age 0.028301 0.006516 4.343 1.40e-05 ***
RaceOther -0.876361 0.290957 -3.012 0.002595 **
RaceWhite -0.449858 0.190396 -2.363 0.018140 *
`Marital Status`Married -0.162967 0.166478 -0.979 0.327624
`Marital Status`Separated 0.503724 0.485998 1.036 0.299981
`Marital Status`Single -0.094715 0.206431 -0.459 0.646361
`Marital Status`Widowed 0.047243 0.261059 0.181 0.856393
`T Stage`T2 0.239508 0.229609 1.043 0.296897
`T Stage`T3 0.742368 0.372473 1.993 0.046253 *
`T Stage`T4 1.359086 0.570897 2.381 0.017284 *
`N Stage`N2 0.702434 0.278582 2.521 0.011687 *
`N Stage`N3 0.630783 0.351167 1.796 0.072455 .
`6th Stage`IIB 0.266879 0.267898 0.996 0.319156
`6th Stage`IIIA -0.199731 0.341157 -0.585 0.558244
`6th Stage`IIIB 0.015832 0.662766 0.024 0.980942
`6th Stage`IIIC NA NA NA NA
differentiatePoorly differentiated 0.433001 0.122914 3.523 0.000427 ***
differentiateUndifferentiated 1.705851 0.793648 2.149 0.031604 *
differentiateWell differentiated -0.597017 0.207476 -2.878 0.004008 **
Grade2 NA NA NA NA
Grade3 NA NA NA NA
Gradeanaplastic; Grade IV NA NA NA NA
`A Stage`Regional 0.177552 0.325627 0.545 0.585573
`Tumor Size` -0.003163 0.004729 -0.669 0.503671
`Estrogen Status`Positive -0.374343 0.227843 -1.643 0.100385
`Progesterone Status`Positive -0.521779 0.152356 -3.425 0.000615 ***
`Regional Node Examined` -0.031172 0.008045 -3.875 0.000107 ***
`Reginol Node Positive` 0.077710 0.017946 4.330 1.49e-05 ***
`Survival Months` -0.061531 0.002763 -22.271 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 3444.7 on 4023 degrees of freedom
Residual deviance: 2232.4 on 3998 degrees of freedom
AIC: 2284.4
Number of Fisher Scoring iterations: 6
fitting null model for pseudo-r2
McFadden: 0.351925027032208
Using original dataset, we can note that Grade 2, Grade 3, Grade IV, and IIIC 6th Stage have singularities to the class, thus removed from the regression.
Then, Age, differentiate, progesterone Status, Regional Node Examined, Reginol Node Positive, and Survival Months are significant features for the model with 99.9% interval.
However, only 35% of variances are explained by this model.
data %>% pairs.panels(scale=T,
stars=T,
gap = 0,
bg = c("red", "blue")[data$Status],
pch=21)
png
A few factor columns, like T Stage, N Stage, and 6th Stage, have at least moderate correlation that is statistically significant. We use MCA to reduce these columns to numerical vectors.
res_mca <- mca(cat_data, nf=30)
res_mca
summary(res_mca)
plot(res_mca)
Call:
mca(df = cat_data, nf = 30)
Multiple correspondence analysis of 4024 cases of 11 factors
Correlations 0.517 0.441 0.430 0.426 0.412 0.400 0.398 0.344 0.327 0.305 0.301 0.301 0.297 0.282 0.274 0.269 0.248 0.209 0.128 0.111 0.088 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 cumulative % explained 5.17 9.58 13.87 18.14 22.26 26.26 30.25 33.69 36.96 40.01 43.02 46.03 49.00 51.81 54.55 57.24 59.72 61.81 63.09 64.20 65.08 65.08 65.08 65.08 65.08 65.08 65.08 65.08 65.08 65.08
Length Class Mode
rs 120720 -none- numeric
cs 1080 -none- numeric
fs 120720 -none- numeric
d 30 -none- numeric
p 1 -none- numeric
call 3 -none- call
png
MCA on all factor columns only covers at most 65% of variances. This low value means we cannot use MCA to reduce all factor columns at once.
So, we start with columns those are colinear. These columns are T Stage, N Stage, 6th Stage, Race, and Marital Status.
mca_cols <- c('T Stage', 'N Stage', '6th Stage','Race','Marital Status')
res_mca <- mca(cat_data[,names(cat_data) %in% mca_cols], nf=7)
res_mca
plot(res_mca)
pred_mca <- predict(res_mca, cat_data[,names(cat_data) %in% mca_cols])
data_mca <- data %>% mutate(mca1_1 = pred_mca[,1], mca1_2 = pred_mca[,2], mca1_3 = pred_mca[,3],
mca1_4 = pred_mca[,4], mca1_5 = pred_mca[,5], mca1_6 = pred_mca[,6], mca1_7 = pred_mca[,7])
data_mca <- data_mca[,!names(data_mca) %in% mca_cols]
head(data_mca)
Call:
mca(df = cat_data[, names(cat_data) %in% mca_cols], nf = 7)
Multiple correspondence analysis of 4024 cases of 5 factors
Correlations 0.667 0.625 0.603 0.590 0.486 0.453 0.447 cumulative % explained 16.67 32.31 47.38 62.14 74.29 85.63 96.80
| Age | differentiate | Grade | A Stage | Tumor Size | Estrogen Status | Progesterone Status | Regional Node Examined | Reginol Node Positive | Survival Months | Status | mca1_1 | mca1_2 | mca1_3 | mca1_4 | mca1_5 | mca1_6 | mca1_7 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| <dbl> | <fct> | <fct> | <fct> | <dbl> | <fct> | <fct> | <dbl> | <dbl> | <dbl> | <fct> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
| 68 | Poorly differentiated | 3 | Regional | 4 | Positive | Positive | 24 | 1 | 60 | Alive | -0.0023461091 | -0.0010077059 | -0.0010327568 | 0.0009605870 | -0.0006947017 | -4.192764e-05 | 4.440576e-05 |
| 50 | Moderately differentiated | 2 | Regional | 35 | Positive | Positive | 14 | 5 | 62 | Alive | 0.0014321129 | 0.0030905826 | -0.0001677426 | 0.0002330912 | -0.0006702577 | 6.418918e-05 | 2.464879e-04 |
| 58 | Moderately differentiated | 2 | Regional | 63 | Positive | Positive | 14 | 7 | 75 | Alive | 0.0047429421 | -0.0030614242 | 0.0006657707 | 0.0025064765 | -0.0004448306 | -2.617071e-03 | 1.369498e-03 |
| 58 | Poorly differentiated | 3 | Regional | 18 | Positive | Positive | 2 | 1 | 84 | Alive | -0.0023461091 | -0.0010077059 | -0.0010327568 | 0.0009605870 | -0.0006947017 | -4.192764e-05 | 4.440576e-05 |
| 47 | Poorly differentiated | 3 | Regional | 41 | Positive | Positive | 3 | 1 | 50 | Alive | -0.0005593943 | 0.0001839509 | 0.0019673530 | -0.0019829816 | -0.0009202945 | -1.987249e-04 | -1.041004e-04 |
| 51 | Moderately differentiated | 2 | Regional | 20 | Positive | Positive | 18 | 2 | 89 | Alive | -0.0022315778 | -0.0009130826 | -0.0010431065 | 0.0007508688 | 0.0019226744 | -4.756507e-04 | -2.239220e-04 |
png
7 outputs from MCA covers 97% of cumulative variances. Thus, we take all 7 output to replace N Stage, T Stage, 6th Stage, Race, and Marital Status.
Then, we proceed to reduce differentiate, Grade, and A Stage columns as they are slightly colinear.
mca_cols <- c('differentiate', 'Grade', 'A Stage')
res_mca <- mca(data_mca[,names(data_mca) %in% mca_cols], nf=3)
res_mca
plot(res_mca)
pred_mca <- predict(res_mca, data_mca[,names(data_mca) %in% mca_cols])
data_mca1 <- data_mca %>% mutate(mca2_1 = pred_mca[,1],mca2_2 = pred_mca[,2],mca2_3 = pred_mca[,3])
data_mca1 <- data_mca1[,!names(data_mca1) %in% mca_cols]
head(data_mca1)
Call:
mca(df = data_mca[, names(data_mca) %in% mca_cols], nf = 3)
Multiple correspondence analysis of 4024 cases of 3 factors
Correlations 0.818 0.816 0.816 cumulative % explained 40.88 81.70 122.53
| Age | Tumor Size | Estrogen Status | Progesterone Status | Regional Node Examined | Reginol Node Positive | Survival Months | Status | mca1_1 | mca1_2 | mca1_3 | mca1_4 | mca1_5 | mca1_6 | mca1_7 | mca2_1 | mca2_2 | mca2_3 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| <dbl> | <dbl> | <fct> | <fct> | <dbl> | <dbl> | <dbl> | <fct> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
| 68 | 4 | Positive | Positive | 24 | 1 | 60 | Alive | -0.0023461091 | -0.0010077059 | -0.0010327568 | 0.0009605870 | -0.0006947017 | -4.192764e-05 | 4.440576e-05 | -0.006802372 | 0.0003366833 | -1.086844e-03 |
| 50 | 35 | Positive | Positive | 14 | 5 | 62 | Alive | 0.0014321129 | 0.0030905826 | -0.0001677426 | 0.0002330912 | -0.0006702577 | 6.418918e-05 | 2.464879e-04 | 0.002587008 | -0.0025582201 | -1.565432e-18 |
| 58 | 63 | Positive | Positive | 14 | 7 | 75 | Alive | 0.0047429421 | -0.0030614242 | 0.0006657707 | 0.0025064765 | -0.0004448306 | -2.617071e-03 | 1.369498e-03 | 0.002587008 | -0.0025582201 | -1.565432e-18 |
| 58 | 18 | Positive | Positive | 2 | 1 | 84 | Alive | -0.0023461091 | -0.0010077059 | -0.0010327568 | 0.0009605870 | -0.0006947017 | -4.192764e-05 | 4.440576e-05 | -0.006802372 | 0.0003366833 | -1.086844e-03 |
| 47 | 41 | Positive | Positive | 3 | 1 | 50 | Alive | -0.0005593943 | 0.0001839509 | 0.0019673530 | -0.0019829816 | -0.0009202945 | -1.987249e-04 | -1.041004e-04 | -0.006802372 | 0.0003366833 | -1.086844e-03 |
| 51 | 20 | Positive | Positive | 18 | 2 | 89 | Alive | -0.0022315778 | -0.0009130826 | -0.0010431065 | 0.0007508688 | 0.0019226744 | -4.756507e-04 | -2.239220e-04 | 0.002587008 | -0.0025582201 | -1.565432e-18 |
png
First 2 output of MCA covered 82% of cumulative variances while third output covers 123% of cumulative variances. Since the first 2 output does not cover at least 95% of variances, we take the all 3 output to replace differentiate, Grade, and A Stage columns.
After that, we proceed with reducing Estrogen Status and Progesterone Status columns.
mca_cols <- c('Estrogen Status','Progesterone Status')
res_mca <- mca(data_mca1[,names(data_mca1) %in% mca_cols], nf=length(mca_cols))
res_mca
plot(res_mca)
pred_mca <- predict(res_mca, data_mca1[,names(data_mca1) %in% mca_cols])
data_mca2 <- data_mca1 %>% mutate(mca3_1 = pred_mca[,1], mca3_2 = pred_mca[,2])
data_mca2 <- data_mca2[,!names(data_mca2) %in% mca_cols]
head(data_mca2)
Call:
mca(df = data_mca1[, names(data_mca1) %in% mca_cols], nf = length(mca_cols))
Multiple correspondence analysis of 4024 cases of 2 factors
Correlations 0.870 0.493 cumulative % explained 86.99 136.32
| Age | Tumor Size | Regional Node Examined | Reginol Node Positive | Survival Months | Status | mca1_1 | mca1_2 | mca1_3 | mca1_4 | mca1_5 | mca1_6 | mca1_7 | mca2_1 | mca2_2 | mca2_3 | mca3_1 | mca3_2 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <fct> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
| 68 | 4 | 24 | 1 | 60 | Alive | -0.0023461091 | -0.0010077059 | -0.0010327568 | 0.0009605870 | -0.0006947017 | -4.192764e-05 | 4.440576e-05 | -0.006802372 | 0.0003366833 | -1.086844e-03 | -0.002860247 | -0.000750588 |
| 50 | 35 | 14 | 5 | 62 | Alive | 0.0014321129 | 0.0030905826 | -0.0001677426 | 0.0002330912 | -0.0006702577 | 6.418918e-05 | 2.464879e-04 | 0.002587008 | -0.0025582201 | -1.565432e-18 | -0.002860247 | -0.000750588 |
| 58 | 63 | 14 | 7 | 75 | Alive | 0.0047429421 | -0.0030614242 | 0.0006657707 | 0.0025064765 | -0.0004448306 | -2.617071e-03 | 1.369498e-03 | 0.002587008 | -0.0025582201 | -1.565432e-18 | -0.002860247 | -0.000750588 |
| 58 | 18 | 2 | 1 | 84 | Alive | -0.0023461091 | -0.0010077059 | -0.0010327568 | 0.0009605870 | -0.0006947017 | -4.192764e-05 | 4.440576e-05 | -0.006802372 | 0.0003366833 | -1.086844e-03 | -0.002860247 | -0.000750588 |
| 47 | 41 | 3 | 1 | 50 | Alive | -0.0005593943 | 0.0001839509 | 0.0019673530 | -0.0019829816 | -0.0009202945 | -1.987249e-04 | -1.041004e-04 | -0.006802372 | 0.0003366833 | -1.086844e-03 | -0.002860247 | -0.000750588 |
| 51 | 20 | 18 | 2 | 89 | Alive | -0.0022315778 | -0.0009130826 | -0.0010431065 | 0.0007508688 | 0.0019226744 | -4.756507e-04 | -2.239220e-04 | 0.002587008 | -0.0025582201 | -1.565432e-18 | -0.002860247 | -0.000750588 |
png
First output of MCA already covered 87% of cumulative variances while second output covers 136% of cumulative variances. Thus, we take the both output to replace Estrogen Status and Progesterone Status columns because the first output is not enough to cover at least 95% of the variances.
data_mca2 %>% pairs.panels(scale=T,
stars=T,
gap = 0,
bg = c("red", "blue")[data_mca2$Status],
pch=21)
png
Now, all the factor columns has been reduced to numerical vectors. Still, we can see that some columns are correlated to each other. So, we use PCA to remove colinearity and dimensionality.
pca_cols = c('mca1_1', 'Tumor Size', 'Reginol Node Positive', 'Regional Node Examined')
res_pca <- princomp(data_mca2[,pca_cols], cor = T)
summary(res_pca)
pred_pca <- predict(res_pca, data_mca2[,pca_cols])
data_pca <- data_mca2 %>% mutate(pca1=pred_pca[,1], pca2=pred_pca[,2], pca3=pred_pca[,3])
data_pca <- data_pca[, !names(data_pca) %in% pca_cols]
head(data_pca)
Importance of components:
Comp.1 Comp.2 Comp.3 Comp.4
Standard deviation 1.5029084 0.9851157 0.7791421 0.40466125
Proportion of Variance 0.5646834 0.2426133 0.1517656 0.04093768
Cumulative Proportion 0.5646834 0.8072967 0.9590623 1.00000000
| Age | Survival Months | Status | mca1_2 | mca1_3 | mca1_4 | mca1_5 | mca1_6 | mca1_7 | mca2_1 | mca2_2 | mca2_3 | mca3_1 | mca3_2 | pca1 | pca2 | pca3 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| <dbl> | <dbl> | <fct> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
| 68 | 60 | Alive | -0.0010077059 | -0.0010327568 | 0.0009605870 | -0.0006947017 | -4.192764e-05 | 4.440576e-05 | -0.006802372 | 0.0003366833 | -1.086844e-03 | -0.002860247 | -0.000750588 | -1.1333213 | -1.6942370 | -0.67609196 |
| 50 | 62 | Alive | 0.0030905826 | -0.0001677426 | 0.0002330912 | -0.0006702577 | 6.418918e-05 | 2.464879e-04 | 0.002587008 | -0.0025582201 | -1.565432e-18 | -0.002860247 | -0.000750588 | 0.5848042 | 0.2571267 | 0.15663818 |
| 58 | 75 | Alive | -0.0030614242 | 0.0006657707 | 0.0025064765 | -0.0004448306 | -2.617071e-03 | 1.369498e-03 | 0.002587008 | -0.0025582201 | -1.565432e-18 | -0.002860247 | -0.000750588 | 2.3363389 | 1.3114219 | 0.03636938 |
| 58 | 84 | Alive | -0.0010077059 | -0.0010327568 | 0.0009605870 | -0.0006947017 | -4.192764e-05 | 4.440576e-05 | -0.006802372 | 0.0003366833 | -1.086844e-03 | -0.002860247 | -0.000750588 | -1.8342221 | 0.6278887 | 0.65335871 |
| 47 | 50 | Alive | 0.0001839509 | 0.0019673530 | -0.0019829816 | -0.0009202945 | -1.987249e-04 | -1.041004e-04 | -0.006802372 | 0.0003366833 | -1.086844e-03 | -0.002860247 | -0.000750588 | -0.8010371 | 1.4039334 | 0.20289254 |
| 51 | 89 | Alive | -0.0009130826 | -0.0010431065 | 0.0007508688 | 0.0019226744 | -4.756507e-04 | -2.239220e-04 | 0.002587008 | -0.0025582201 | -1.565432e-18 | -0.002860247 | -0.000750588 | -0.9345595 | -0.7290785 | -0.48789923 |
First 3 output of PCA already covered 96% of cumulative variances. Thus, we take the first 3 output to replace mca1_1, Tumor size, Reginol Node Positive, and Regiona Node Examined columns.
data_pca %>% pairs.panels(scale=T,
stars=T,
gap = 0,
bg = c("red", "blue")[data_pca$Status],
pch=21)
png
Now, we see that none of the feature columns are correlated to each other.
Lastly, we will normalized the rest of the columns.
data_scaled <- data_pca %>% mutate(Age = rescale(Age), `Survival Months` = rescale(`Survival Months`))
#Modeling
install.packages(c('caret', 'kernlab', 'naivebayes', 'rstatix', 'randomForest'))
Installing packages into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)
also installing the dependencies ‘listenv’, ‘parallelly’, ‘future’, ‘globals’, ‘future.apply’, ‘progressr’, ‘SQUAREM’, ‘lava’, ‘prodlim’, ‘numDeriv’, ‘SparseM’, ‘MatrixModels’, ‘minqa’, ‘nloptr’, ‘RcppEigen’, ‘proxy’, ‘iterators’, ‘Rcpp’, ‘clock’, ‘gower’, ‘hardhat’, ‘ipred’, ‘timeDate’, ‘carData’, ‘abind’, ‘pbkrtest’, ‘quantreg’, ‘lme4’, ‘e1071’, ‘foreach’, ‘ModelMetrics’, ‘plyr’, ‘pROC’, ‘recipes’, ‘reshape2’, ‘corrplot’, ‘car’
library(caret)
library(kernlab)
library(naivebayes)
library(rstatix)
library(randomForest)
Loading required package: lattice
Attaching package: ‘lattice’
The following object is masked from ‘package:corrgram’:
panel.fill
Attaching package: ‘caret’
The following object is masked from ‘package:purrr’:
lift
Attaching package: ‘kernlab’
The following object is masked from ‘package:scales’:
alpha
The following object is masked from ‘package:psych’:
alpha
The following object is masked from ‘package:purrr’:
cross
The following object is masked from ‘package:ggplot2’:
alpha
naivebayes 0.9.7 loaded
Attaching package: ‘rstatix’
The following object is masked from ‘package:MASS’:
select
The following object is masked from ‘package:stats’:
filter
randomForest 4.7-1.1
Type rfNews() to see new features/changes/bug fixes.
Attaching package: ‘randomForest’
The following object is masked from ‘package:psych’:
outlier
The following object is masked from ‘package:dplyr’:
combine
The following object is masked from ‘package:ggplot2’:
margin
For modelling, we use Naive Bayes, Random Forest, and Support Vector Machine (SVM) algorithms against original data, reduced data, and built-in PCA pre-processed data.
10-fold cross validation is used for model evaluation later.
Upsampling method is used to address the imbalance size between the classes.
ctrl <- trainControl(method = "cv", number = 10)
set.seed(0)
set1 <- upSample(data %>% select(-Status), data$Status, yname='Status')
set.seed(0)
set2 <- upSample(data_scaled %>% select(-Status), data_scaled$Status, yname='Status')
set.seed(0)
nb_1 <- train(Status ~., data = set1, method = "naive_bayes", trControl = ctrl)
nb_1
set.seed(0)
nb_1p <- train(Status ~., data = set1, preProc = c("pca"), method = "naive_bayes", trControl = ctrl)
nb_1p
set.seed(0)
nb_2 <- train(Status ~., data = set2, method = "naive_bayes", trControl = ctrl)
nb_2
Naive Bayes
6816 samples
15 predictor
2 classes: 'Alive', 'Dead'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 6135, 6135, 6134, 6134, 6134, 6134, ...
Resampling results across tuning parameters:
usekernel Accuracy Kappa
FALSE 0.6506751 0.3013481
TRUE 0.5963957 0.1927825
Tuning parameter 'laplace' was held constant at a value of 0
Tuning
parameter 'adjust' was held constant at a value of 1
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were laplace = 0, usekernel = FALSE
and adjust = 1.
Naive Bayes
6816 samples
15 predictor
2 classes: 'Alive', 'Dead'
Pre-processing: principal component signal extraction (29), centered
(29), scaled (29)
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 6135, 6135, 6134, 6134, 6134, 6134, ...
Resampling results across tuning parameters:
usekernel Accuracy Kappa
FALSE 0.6943899 0.3887863
TRUE 0.7452929 0.4905871
Tuning parameter 'laplace' was held constant at a value of 0
Tuning
parameter 'adjust' was held constant at a value of 1
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were laplace = 0, usekernel = TRUE
and adjust = 1.
Naive Bayes
6816 samples
16 predictor
2 classes: 'Alive', 'Dead'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 6135, 6135, 6134, 6134, 6134, 6134, ...
Resampling results across tuning parameters:
usekernel Accuracy Kappa
FALSE 0.7139037 0.4278098
TRUE 0.7052600 0.4105174
Tuning parameter 'laplace' was held constant at a value of 0
Tuning
parameter 'adjust' was held constant at a value of 1
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were laplace = 0, usekernel = FALSE
and adjust = 1.
set.seed(0)
rf_1 <- train(Status ~., data = set1, method = "rf", trControl = ctrl)
rf_1
set.seed(0)
rf_1p <- train(Status ~., data = set1, preProc = c("pca"), method = "rf", trControl = ctrl)
rf_1p
set.seed(0)
rf_2 <- train(Status ~., data = set2, method = "rf", trControl = ctrl)
rf_2
Random Forest
6816 samples
15 predictor
2 classes: 'Alive', 'Dead'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 6135, 6135, 6134, 6134, 6134, 6134, ...
Resampling results across tuning parameters:
mtry Accuracy Kappa
2 0.8148435 0.6296850
15 0.9743221 0.9486443
29 0.9694812 0.9389629
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 15.
Warning message in randomForest.default(x, y, mtry = param$mtry, ...):
“invalid mtry: reset to within valid range”
Warning message in randomForest.default(x, y, mtry = param$mtry, ...):
“invalid mtry: reset to within valid range”
Warning message in randomForest.default(x, y, mtry = param$mtry, ...):
“invalid mtry: reset to within valid range”
Warning message in randomForest.default(x, y, mtry = param$mtry, ...):
“invalid mtry: reset to within valid range”
Warning message in randomForest.default(x, y, mtry = param$mtry, ...):
“invalid mtry: reset to within valid range”
Warning message in randomForest.default(x, y, mtry = param$mtry, ...):
“invalid mtry: reset to within valid range”
Warning message in randomForest.default(x, y, mtry = param$mtry, ...):
“invalid mtry: reset to within valid range”
Warning message in randomForest.default(x, y, mtry = param$mtry, ...):
“invalid mtry: reset to within valid range”
Warning message in randomForest.default(x, y, mtry = param$mtry, ...):
“invalid mtry: reset to within valid range”
Random Forest
6816 samples
15 predictor
2 classes: 'Alive', 'Dead'
Pre-processing: principal component signal extraction (29), centered
(29), scaled (29)
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 6135, 6135, 6134, 6134, 6134, 6134, ...
Resampling results across tuning parameters:
mtry Accuracy Kappa
2 0.9772589 0.9545179
15 0.9677195 0.9354391
29 0.9655182 0.9310365
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.
Random Forest
6816 samples
16 predictor
2 classes: 'Alive', 'Dead'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 6135, 6135, 6134, 6134, 6134, 6134, ...
Resampling results across tuning parameters:
mtry Accuracy Kappa
2 0.9630255 0.9260508
9 0.9746171 0.9492340
16 0.9718301 0.9436600
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 9.
set.seed(0)
svm_1 <- train(Status ~., data = set1, method = "svmPoly", trControl = ctrl)
svm_1
set.seed(0)
svm_1p <- train(Status ~., data = set1, preProc = c("pca"), method = "svmPoly", trControl = ctrl)
svm_1p
set.seed(0)
svm_2 <- train(Status ~., data = set2, method = "svmPoly", trControl = ctrl)
svm_2
Support Vector Machines with Polynomial Kernel
6816 samples
15 predictor
2 classes: 'Alive', 'Dead'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 6135, 6135, 6134, 6134, 6134, 6134, ...
Resampling results across tuning parameters:
degree scale C Accuracy Kappa
1 0.001 0.25 0.7602668 0.5205353
1 0.001 0.50 0.7759653 0.5519309
1 0.001 1.00 0.7818327 0.5636653
1 0.010 0.25 0.7856459 0.5712925
1 0.010 0.50 0.7891686 0.5783356
1 0.010 1.00 0.7897541 0.5795070
1 0.100 0.25 0.7904878 0.5809754
1 0.100 0.50 0.7899011 0.5798018
1 0.100 1.00 0.7906347 0.5812686
2 0.001 0.25 0.7761125 0.5522252
2 0.001 0.50 0.7819789 0.5639577
2 0.001 1.00 0.7874074 0.5748151
2 0.010 0.25 0.7928347 0.5856686
2 0.010 0.50 0.7965035 0.5930063
2 0.010 1.00 0.7989985 0.5979962
2 0.100 0.25 0.8069222 0.6138445
2 0.100 0.50 0.8082406 0.6164795
2 0.100 1.00 0.8091188 0.6182365
3 0.001 0.25 0.7822733 0.5645473
3 0.001 0.50 0.7827151 0.5654295
3 0.001 1.00 0.7879939 0.5759878
3 0.010 0.25 0.7991445 0.5982870
3 0.010 0.50 0.8029587 0.6059163
3 0.010 1.00 0.8079499 0.6158996
3 0.100 0.25 0.8524020 0.7048047
3 0.100 0.50 0.8600305 0.7200620
3 0.100 1.00 0.8661908 0.7323835
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were degree = 3, scale = 0.1 and C = 1.
Support Vector Machines with Polynomial Kernel
6816 samples
15 predictor
2 classes: 'Alive', 'Dead'
Pre-processing: principal component signal extraction (29), centered
(29), scaled (29)
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 6135, 6135, 6134, 6134, 6134, 6134, ...
Resampling results across tuning parameters:
degree scale C Accuracy Kappa
1 0.001 0.25 0.7904911 0.5809824
1 0.001 0.50 0.7906351 0.5812685
1 0.001 1.00 0.7912214 0.5824411
1 0.010 0.25 0.7868185 0.5736349
1 0.010 0.50 0.7901942 0.5803867
1 0.010 1.00 0.7882876 0.5765735
1 0.100 0.25 0.7884338 0.5768660
1 0.100 0.50 0.7888756 0.5777497
1 0.100 1.00 0.7885821 0.5771630
2 0.001 0.25 0.7901952 0.5803887
2 0.001 0.50 0.7909282 0.5818546
2 0.001 1.00 0.7888743 0.5777471
2 0.010 0.25 0.7954792 0.5909577
2 0.010 0.50 0.7963588 0.5927170
2 0.010 1.00 0.7959193 0.5918371
2 0.100 0.25 0.7991471 0.5982949
2 0.100 0.50 0.8023750 0.6047494
2 0.100 1.00 0.8000264 0.6000515
3 0.001 0.25 0.7903408 0.5806800
3 0.001 0.50 0.7903414 0.5806809
3 0.001 1.00 0.7888734 0.5777448
3 0.010 0.25 0.7970917 0.5941831
3 0.010 0.50 0.7967991 0.5935957
3 0.010 1.00 0.8017892 0.6035760
3 0.100 0.25 0.8362632 0.6725273
3 0.100 0.50 0.8399306 0.6798623
3 0.100 1.00 0.8443321 0.6886654
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were degree = 3, scale = 0.1 and C = 1.
Support Vector Machines with Polynomial Kernel
6816 samples
16 predictor
2 classes: 'Alive', 'Dead'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 6135, 6135, 6134, 6134, 6134, 6134, ...
Resampling results across tuning parameters:
degree scale C Accuracy Kappa
1 0.001 0.25 0.7761134 0.5522271
1 0.001 0.50 0.7844765 0.5689538
1 0.001 1.00 0.7868196 0.5736393
1 0.010 0.25 0.7856468 0.5712933
1 0.010 0.50 0.7868222 0.5736435
1 0.010 1.00 0.7888756 0.5777502
1 0.100 0.25 0.7896089 0.5792171
1 0.100 0.50 0.7894617 0.5789224
1 0.100 1.00 0.7894625 0.5789238
2 0.001 0.25 0.7840367 0.5680740
2 0.001 0.50 0.7868198 0.5736396
2 0.001 1.00 0.7850618 0.5701239
2 0.010 0.25 0.7918103 0.5836191
2 0.010 0.50 0.7960662 0.5921322
2 0.010 1.00 0.7969455 0.5938915
2 0.100 0.25 0.8004671 0.6009350
2 0.100 0.50 0.8016410 0.6032816
2 0.100 1.00 0.8022271 0.6044530
3 0.001 0.25 0.7868230 0.5736466
3 0.001 0.50 0.7844742 0.5689482
3 0.001 1.00 0.7879963 0.5759904
3 0.010 0.25 0.7948908 0.5897825
3 0.010 0.50 0.7965047 0.5930103
3 0.010 1.00 0.8019343 0.6038683
3 0.100 0.25 0.8324492 0.6648965
3 0.100 0.50 0.8368495 0.6736992
3 0.100 1.00 0.8402260 0.6804537
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were degree = 3, scale = 0.1 and C = 1.
We evaluate our models across the algorithms, features, and inividual models by its accuracy.
all_model <- list(nb_1=nb_1, nb_1p=nb_1p, nb_2=nb_2, svm_1=svm_1, svm_1p=svm_1p, svm_2=svm_2, rf_1=rf_1, rf_1p=rf_1p, rf_2=rf_2)
long_results <- resamples(all_model)$values %>% pivot_longer(!Resample, names_sep = '~', names_to = c('.value', 'Metric')) %>%
pivot_longer(names(all_model), names_to = 'Model' , values_to = 'Value')
long_results %>% pivot_wider(names_from = Metric, values_from = Value) %>%
mutate(ModelType = sub('_.*','', Model)) %>%
group_by(ModelType) %>%
summarise_at(vars(-Resample, -Model), ~mean(.)) %>%
arrange(desc(Accuracy))
long_results %>% mutate(ModelType = sub('_.*','', Model)) %>%
ggplot(aes(x=ModelType, y=Value, fill=ModelType)) +
geom_boxplot() +
facet_wrap(~Metric)
| ModelType | Accuracy | Kappa |
|---|---|---|
| <chr> | <dbl> | <dbl> |
| rf | 0.9753994 | 0.9507987 |
| svm | 0.8502496 | 0.7005009 |
| nb | 0.7032906 | 0.4065817 |
png
We can see that the algorithm with highest accuracy is Random Forest, then SVM, and lastly Naive Bayes
long_results %>%
filter(Metric == 'Accuracy') %>%
mutate(ModelType = sub('_.*','', Model)) %>%
t_test(Value ~ModelType) %>%
add_significance()
| .y. | group1 | group2 | n1 | n2 | statistic | df | p | p.adj | p.adj.signif |
|---|---|---|---|---|---|---|---|---|---|
| <chr> | <chr> | <chr> | <int> | <int> | <dbl> | <dbl> | <dbl> | <dbl> | <chr> |
| Value | nb | rf | 30 | 30 | -34.04656 | 29.89943 | 1.79e-25 | 3.58e-25 | **** |
| Value | nb | svm | 30 | 30 | -16.97885 | 39.68954 | 8.43e-20 | 8.43e-20 | **** |
| Value | rf | svm | 30 | 30 | 34.72177 | 33.67863 | 5.94e-28 | 1.78e-27 | **** |
From the t-test results above, we can see that our previous observation is statistically significant with 99.99% interval.
long_results %>% pivot_wider(names_from = Metric, values_from = Value) %>%
mutate(FeatureType = sub('.*_','', Model)) %>%
group_by(FeatureType) %>%
summarise_at(vars(-Resample, -Model), ~mean(.)) %>%
arrange(desc(Accuracy))
long_results %>% mutate(FeatureType = sub('.*_','', Model)) %>%
ggplot(aes(x=FeatureType, y=Value, fill=FeatureType)) +
geom_boxplot() +
facet_wrap(~Metric)
| FeatureType | Accuracy | Kappa |
|---|---|---|
| <chr> | <dbl> | <dbl> |
| 1p | 0.8556280 | 0.7112568 |
| 2 | 0.8429156 | 0.6858325 |
| 1 | 0.8303960 | 0.6607920 |
png
Here, we can see that, by average, features from PCA has the highest accuracy, followed by our processed data, then the original data.
long_results %>%
filter(Metric == 'Accuracy') %>%
mutate(FeatureType = sub('.*_','', Model)) %>%
t_test(Value ~FeatureType) %>%
add_significance()
| .y. | group1 | group2 | n1 | n2 | statistic | df | p | p.adj | p.adj.signif |
|---|---|---|---|---|---|---|---|---|---|
| <chr> | <chr> | <chr> | <int> | <int> | <dbl> | <dbl> | <dbl> | <dbl> | <chr> |
| Value | 1 | 1p | 30 | 30 | -0.8197950 | 52.41153 | 0.416 | 1 | ns |
| Value | 1 | 2 | 30 | 30 | -0.3909963 | 55.18921 | 0.697 | 1 | ns |
| Value | 1p | 2 | 30 | 30 | 0.4751066 | 57.32042 | 0.637 | 1 | ns |
However, t-test results shows that our observation is not statistically significant.
long_results %>% pivot_wider(names_from = Metric, values_from = Value) %>%
group_by(Model) %>%
summarise_at(vars(-Resample), ~mean(.)) %>%
arrange(desc(Accuracy))
long_results %>% ggplot(aes(x=Model, y=Value, fill=Model)) +
geom_boxplot() +
facet_wrap(~Metric)
| Model | Accuracy | Kappa |
|---|---|---|
| <chr> | <dbl> | <dbl> |
| rf_1p | 0.9772589 | 0.9545179 |
| rf_2 | 0.9746171 | 0.9492340 |
| rf_1 | 0.9743221 | 0.9486443 |
| svm_1 | 0.8661908 | 0.7323835 |
| svm_1p | 0.8443321 | 0.6886654 |
| svm_2 | 0.8402260 | 0.6804537 |
| nb_1p | 0.7452929 | 0.4905871 |
| nb_2 | 0.7139037 | 0.4278098 |
| nb_1 | 0.6506751 | 0.3013481 |
png
Here, we can see that Random Forest models have higher accuracy, followed by SVM and Naive Bayes. Random Forest models have around the same accuracy among each other. In the meanwhile, SVM using original data and naive bayes using PCA data have the highest accuracy within their respective group.
long_results %>%
filter(Metric == 'Accuracy') %>%
t_test(Value ~Model) %>%
add_significance()
| .y. | group1 | group2 | n1 | n2 | statistic | df | p | p.adj | p.adj.signif |
|---|---|---|---|---|---|---|---|---|---|
| <chr> | <chr> | <chr> | <int> | <int> | <dbl> | <dbl> | <dbl> | <dbl> | <chr> |
| Value | nb_1 | nb_1p | 10 | 10 | -13.1198526 | 15.347841 | 9.49e-10 | 9.49e-09 | **** |
| Value | nb_1 | nb_2 | 10 | 10 | -8.5190425 | 14.987547 | 3.96e-07 | 3.17e-06 | **** |
| Value | nb_1 | rf_1 | 10 | 10 | -72.8832000 | 13.923057 | 2.21e-19 | 7.96e-18 | **** |
| Value | nb_1 | rf_1p | 10 | 10 | -81.2290603 | 10.144161 | 1.31e-15 | 4.19e-14 | **** |
| Value | nb_1 | rf_2 | 10 | 10 | -75.2114985 | 12.790714 | 2.59e-18 | 9.06e-17 | **** |
| Value | nb_1 | svm_1 | 10 | 10 | -35.7138208 | 17.520288 | 8.17e-18 | 2.78e-16 | **** |
| Value | nb_1 | svm_1p | 10 | 10 | -27.2885189 | 15.553784 | 1.47e-14 | 4.56e-13 | **** |
| Value | nb_1 | svm_2 | 10 | 10 | -33.6827955 | 17.970672 | 1.08e-17 | 3.56e-16 | **** |
| Value | nb_1p | nb_2 | 10 | 10 | 3.5839676 | 17.971095 | 2.00e-03 | 1.30e-02 |
|
| Value | nb_1p | rf_1 | 10 | 10 | -35.6211233 | 11.179135 | 7.27e-13 | 1.96e-11 | **** |
| Value | nb_1p | rf_1p | 10 | 10 | -37.7367091 | 9.473827 | 1.20e-11 | 2.64e-10 | **** |
| Value | nb_1p | rf_2 | 10 | 10 | -36.1818961 | 10.627350 | 1.79e-12 | 4.30e-11 | **** |
| Value | nb_1p | svm_1 | 10 | 10 | -15.8697585 | 16.787838 | 1.54e-11 | 2.94e-10 | **** |
| Value | nb_1p | svm_1p | 10 | 10 | -11.6732626 | 17.990573 | 7.91e-10 | 9.49e-09 | **** |
| Value | nb_1p | svm_2 | 10 | 10 | -13.0045904 | 15.710899 | 8.02e-10 | 9.49e-09 | **** |
| Value | nb_2 | rf_1 | 10 | 10 | -39.0753012 | 11.015532 | 3.62e-13 | 1.05e-11 | **** |
| Value | nb_2 | rf_1p | 10 | 10 | -41.1988640 | 9.437332 | 5.66e-12 | 1.30e-10 | **** |
| Value | nb_2 | rf_2 | 10 | 10 | -39.6443930 | 10.503691 | 8.79e-13 | 2.20e-11 | **** |
| Value | nb_2 | svm_1 | 10 | 10 | -19.4804626 | 16.463260 | 8.44e-13 | 2.19e-11 | **** |
| Value | nb_2 | svm_1p | 10 | 10 | -15.0546371 | 17.928978 | 1.29e-11 | 2.71e-10 | **** |
| Value | nb_2 | svm_2 | 10 | 10 | -16.8256639 | 15.350480 | 2.63e-11 | 4.73e-10 | **** |
| Value | rf_1 | rf_1p | 10 | 10 | -1.2529569 | 12.688843 | 2.33e-01 | 9.00e-01 | ns |
| Value | rf_1 | rf_2 | 10 | 10 | -0.1050795 | 17.612935 | 9.18e-01 | 1.00e+00 | ns |
| Value | rf_1 | svm_1 | 10 | 10 | 21.3111332 | 12.670798 | 2.65e-11 | 4.73e-10 | **** |
| Value | rf_1 | svm_1p | 10 | 10 | 20.6328671 | 11.277988 | 2.60e-10 | 3.90e-09 | **** |
| Value | rf_1 | svm_2 | 10 | 10 | 29.2631862 | 13.596623 | 1.13e-13 | 3.39e-12 | **** |
| Value | rf_1p | rf_2 | 10 | 10 | 1.2702148 | 13.800530 | 2.25e-01 | 9.00e-01 | ns |
| Value | rf_1p | svm_1 | 10 | 10 | 23.5780454 | 9.820894 | 5.67e-10 | 7.37e-09 | **** |
| Value | rf_1p | svm_1p | 10 | 10 | 22.1123499 | 9.495995 | 1.73e-09 | 1.56e-08 | **** |
| Value | rf_1p | svm_2 | 10 | 10 | 32.8096435 | 10.055947 | 1.47e-11 | 2.94e-10 | **** |
| Value | rf_2 | svm_1 | 10 | 10 | 21.8712517 | 11.777272 | 6.71e-11 | 1.07e-09 | **** |
| Value | rf_2 | svm_1p | 10 | 10 | 20.9908981 | 10.702255 | 4.85e-10 | 6.79e-09 | **** |
| Value | rf_2 | svm_2 | 10 | 10 | 30.1790475 | 12.520781 | 4.51e-13 | 1.26e-11 | **** |
| Value | svm_1 | svm_1p | 10 | 10 | 2.9109250 | 16.962301 | 1.00e-02 | 4.90e-02 |
|
| Value | svm_1 | svm_2 | 10 | 10 | 4.2290527 | 17.719075 | 5.20e-04 | 4.00e-03 | ** |
| Value | svm_1p | svm_2 | 10 | 10 | 0.5713722 | 15.914702 | 5.76e-01 | 1.00e+00 | ns |