There are approximately 500,000 new cases of invasive cervical cancer diagnosed globally and kills about 300,000 women each year. Although this nnumber is steadily decreasing with recent advances in screening technologies.
Cervical cancer diagnoses occur predominately in women between the ages of 35 and 54 years (50%) with 20% for those aged over 65 and 15% for younger women (20-30 years). Women youger than 20 rarely develop cervical cancer however they can become infected with the human papilloma virus (HPV) which increases their chances of developing this disease later in life.
The human papilloma virus (HPV), which is sexually transmitted, is a major risk factor in developing cervical cancer. Studeies have shown that women wo have taken oral contraceptives for longer than 5-10 years appear to have a higher risk of HPV infection. Other STDs have also been associated with a high risk of developing cervical cancer.
Previous studies have suggested a strong correlation between the long term use of oral contraceptives and cervical cance3r.
Women who have had a number of children have also been shown to have an increased risk of developing cervical cancer. Particularly for those women infected with HPV.
Smoking has been shown to be associated with a higher risk for precancerous changes (dysplasia) in the cervix and subsequent prgression to cancer, especially for those infected with HPV.
Sys.setenv(JAVA_HOME='C:\\Program Files\\Java\\jre-9.0.4')
library(dplyr)
library(readr)
library(tidyverse)
library(mlr)
library(tidyverse) # for ggplot and data wrangly
library(rJava)
library(FSelector)
library(ggplot2)
library(gridExtra)
library(gmodels)
library(GGally)
library(cowplot)
library(tidyr)
library(magrittr)
library(moments)
library(purrr)
library(data.table)
library(latex2exp)
library(caret)
library(robustHD)
library(spFSR)
library(rjson)
library(party)
library(knitr)
library(kableExtra)
library(stringr)
library(mlbench)
library(e1071)
library(MASS)
set.seed(999)
Read in dataset.
dataCancer <- read_csv("kag_risk_factors_cervical_cancer.csv", col_names = TRUE, na = "?", trim_ws = TRUE)
kable(dataCancer %>% head(), caption = "Table 1. Cervical Cancer Dataset") %>% kable_styling(bootstrap_options = c("condensed"), full_width = TRUE, font_size = 10)
| Age | Num_sexual_partners | First_sexual_intercourse | Num_pregnancies | Smokes | Smokes_years | Smokes_packs_year | Hormonal_Contraceptives | Hormonal_Contraceptives_years | IUD | IUD_years | STDs | STDs_number | STDs_condylomatosis | STDs_cervical_condylomatosis | STDs_vaginal_condylomatosis | STDs_vulvo_perineal_condylomatosis | STDs_syphilis | STDs_pelvic_inflam_disease | STDs_genital_herpes | STDs_molluscum_contagiosum | STDs_AIDS | STDs_HIV | STDs_Hepatitis_B | STDs_HPV | STDs_Num_diagnosis | STDs_Time_since_first_diagnosis | STDs_Time_since_last_diagnosis | Dx_Cancer | Dx_CIN | Dx_HPV | Dx | Hinselmann | Schiller | Citology | Biopsy |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 18 | 4 | 15 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NA | NA | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 15 | 1 | 14 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NA | NA | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 34 | 1 | NA | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NA | NA | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 52 | 5 | 16 | 4 | 1 | 37 | 37 | 1 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NA | NA | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 46 | 3 | 21 | 4 | 0 | 0 | 0 | 1 | 15 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NA | NA | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 42 | 3 | 23 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NA | NA | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
dim(dataCancer)
## [1] 858 36
kable(dataCancer %>% summarizeColumns(), caption = "Table 2. Statistical Summary") %>% kable_styling(bootstrap_options = c("condensed"), full_width = TRUE, font_size = 10)
| name | type | na | mean | disp | median | mad | min | max | nlevs |
|---|---|---|---|---|---|---|---|---|---|
| Age | integer | 0 | 26.8205128 | 8.4979481 | 25.0 | 8.1543 | 13 | 84 | 0 |
| Num_sexual_partners | integer | 26 | 2.5276442 | 1.6677605 | 2.0 | 1.4826 | 1 | 28 | 0 |
| First_sexual_intercourse | integer | 7 | 16.9952996 | 2.8033554 | 17.0 | 2.9652 | 10 | 32 | 0 |
| Num_pregnancies | integer | 56 | 2.2755611 | 1.4474141 | 2.0 | 1.4826 | 0 | 11 | 0 |
| Smokes | integer | 13 | 0.1455621 | 0.3528756 | 0.0 | 0.0000 | 0 | 1 | 0 |
| Smokes_years | numeric | 13 | 1.2197214 | 4.0890169 | 0.0 | 0.0000 | 0 | 37 | 0 |
| Smokes_packs_year | numeric | 13 | 0.4531440 | 2.2266098 | 0.0 | 0.0000 | 0 | 37 | 0 |
| Hormonal_Contraceptives | integer | 108 | 0.6413333 | 0.4799292 | 1.0 | 0.0000 | 0 | 1 | 0 |
| Hormonal_Contraceptives_years | numeric | 108 | 2.2564192 | 3.7642535 | 0.5 | 0.7413 | 0 | 30 | 0 |
| IUD | integer | 117 | 0.1120108 | 0.3155928 | 0.0 | 0.0000 | 0 | 1 | 0 |
| IUD_years | numeric | 117 | 0.5148043 | 1.9430885 | 0.0 | 0.0000 | 0 | 19 | 0 |
| STDs | integer | 105 | 0.1049137 | 0.3066458 | 0.0 | 0.0000 | 0 | 1 | 0 |
| STDs_number | integer | 105 | 0.1766268 | 0.5619928 | 0.0 | 0.0000 | 0 | 4 | 0 |
| STDs_condylomatosis | integer | 105 | 0.0584329 | 0.2347162 | 0.0 | 0.0000 | 0 | 1 | 0 |
| STDs_cervical_condylomatosis | integer | 105 | 0.0000000 | 0.0000000 | 0.0 | 0.0000 | 0 | 0 | 0 |
| STDs_vaginal_condylomatosis | integer | 105 | 0.0053121 | 0.0727385 | 0.0 | 0.0000 | 0 | 1 | 0 |
| STDs_vulvo_perineal_condylomatosis | integer | 105 | 0.0571049 | 0.2321972 | 0.0 | 0.0000 | 0 | 1 | 0 |
| STDs_syphilis | integer | 105 | 0.0239044 | 0.1528528 | 0.0 | 0.0000 | 0 | 1 | 0 |
| STDs_pelvic_inflam_disease | integer | 105 | 0.0013280 | 0.0364420 | 0.0 | 0.0000 | 0 | 1 | 0 |
| STDs_genital_herpes | integer | 105 | 0.0013280 | 0.0364420 | 0.0 | 0.0000 | 0 | 1 | 0 |
| STDs_molluscum_contagiosum | integer | 105 | 0.0013280 | 0.0364420 | 0.0 | 0.0000 | 0 | 1 | 0 |
| STDs_AIDS | integer | 105 | 0.0000000 | 0.0000000 | 0.0 | 0.0000 | 0 | 0 | 0 |
| STDs_HIV | integer | 105 | 0.0239044 | 0.1528528 | 0.0 | 0.0000 | 0 | 1 | 0 |
| STDs_Hepatitis_B | integer | 105 | 0.0013280 | 0.0364420 | 0.0 | 0.0000 | 0 | 1 | 0 |
| STDs_HPV | integer | 105 | 0.0026560 | 0.0515025 | 0.0 | 0.0000 | 0 | 1 | 0 |
| STDs_Num_diagnosis | integer | 0 | 0.0874126 | 0.3025447 | 0.0 | 0.0000 | 0 | 3 | 0 |
| STDs_Time_since_first_diagnosis | integer | 787 | 6.1408451 | 5.8950240 | 4.0 | 4.4478 | 1 | 22 | 0 |
| STDs_Time_since_last_diagnosis | integer | 787 | 5.8169014 | 5.7552705 | 3.0 | 2.9652 | 1 | 22 | 0 |
| Dx_Cancer | integer | 0 | 0.0209790 | 0.1433976 | 0.0 | 0.0000 | 0 | 1 | 0 |
| Dx_CIN | integer | 0 | 0.0104895 | 0.1019392 | 0.0 | 0.0000 | 0 | 1 | 0 |
| Dx_HPV | integer | 0 | 0.0209790 | 0.1433976 | 0.0 | 0.0000 | 0 | 1 | 0 |
| Dx | integer | 0 | 0.0279720 | 0.1649888 | 0.0 | 0.0000 | 0 | 1 | 0 |
| Hinselmann | integer | 0 | 0.0407925 | 0.1979246 | 0.0 | 0.0000 | 0 | 1 | 0 |
| Schiller | integer | 0 | 0.0862471 | 0.2808923 | 0.0 | 0.0000 | 0 | 1 | 0 |
| Citology | integer | 0 | 0.0512821 | 0.2207011 | 0.0 | 0.0000 | 0 | 1 | 0 |
| Biopsy | integer | 0 | 0.0641026 | 0.2450784 | 0.0 | 0.0000 | 0 | 1 | 0 |
Removed columns for STDs_cervical_condylomatosis, STDs_vaginal_condylomatosis, STDs_pelvic_inflam_disease, STDs_genital_herpes, STDs_molluscum_contagiosum, STDs_Hepatitis_B, STDs_HPV and STD_AIDS since there 4 or less patient results for these features. Also removed features STDs_Time_since_first_diagnosis and STDs_Time_since_last_diagnosis which contained greater than 60% missing values (787 of 858).
dataCancer <- subset(dataCancer, select = -STDs_cervical_condylomatosis)
dataCancer <- subset(dataCancer, select = -STDs_vaginal_condylomatosis)
dataCancer <- subset(dataCancer, select = -STDs_pelvic_inflam_disease)
dataCancer <- subset(dataCancer, select = -STDs_genital_herpes)
dataCancer <- subset(dataCancer, select = -STDs_molluscum_contagiosum)
dataCancer <- subset(dataCancer, select = -STDs_Hepatitis_B)
dataCancer <- subset(dataCancer, select = -STDs_HPV)
dataCancer <- subset(dataCancer, select = -STDs_AIDS)
dataCancer <- subset(dataCancer, select = -STDs_Time_since_first_diagnosis)
dataCancer <- subset(dataCancer, select = -STDs_Time_since_last_diagnosis)
# How many NAs are in the data
length(which(is.na(dataCancer)))
## [1] 1208
kable(colSums(is.na(dataCancer)), caption = "Table 3. Number of Missing Values in Each Column") %>% kable_styling(bootstrap_options = c("condensed"), full_width = TRUE, font_size = 10)
| x | |
|---|---|
| Age | 0 |
| Num_sexual_partners | 26 |
| First_sexual_intercourse | 7 |
| Num_pregnancies | 56 |
| Smokes | 13 |
| Smokes_years | 13 |
| Smokes_packs_year | 13 |
| Hormonal_Contraceptives | 108 |
| Hormonal_Contraceptives_years | 108 |
| IUD | 117 |
| IUD_years | 117 |
| STDs | 105 |
| STDs_number | 105 |
| STDs_condylomatosis | 105 |
| STDs_vulvo_perineal_condylomatosis | 105 |
| STDs_syphilis | 105 |
| STDs_HIV | 105 |
| STDs_Num_diagnosis | 0 |
| Dx_Cancer | 0 |
| Dx_CIN | 0 |
| Dx_HPV | 0 |
| Dx | 0 |
| Hinselmann | 0 |
| Schiller | 0 |
| Citology | 0 |
| Biopsy | 0 |
Data rows containing missing values for smokes were removed since these only comprised 13 of the 858 patients. Also removed rows of First_Sexual_Intercourse with missing values (7 of 858 patients). Rows missing values for STDs were also removed since they were missing in all features.
dataCancer <- dataCancer[!is.na(dataCancer$Smokes),]
dataCancer <- dataCancer[!is.na(dataCancer$First_sexual_intercourse),]
dataCancer <- dataCancer[!is.na(dataCancer$STDs),]
Proportion of missing values
prop_NA <- function(x) { mean(is.na(x))}
missingData <- sapply(dataCancer, prop_NA)
missingData <- data.frame(Variables = names(missingData), Proportion = missingData, Completude = 1 - missingData)
ggplot(missingData, aes(x=missingData$Variables, y=missingData$Completude)) + geom_col(color = "black", fill = "skyblue", width = 1) + xlab("") + ylab("Completude") + theme_classic() + theme(axis.text.x=element_text(angle=90, size=8, hjust = 1), axis.title.y=element_text(face='bold', size=14))
Figure 1. Proportion of missing values for each feature in the data set.
Assume missing values for number of pregnancies is 0
dataCancer$Num_pregnancies[is.na(dataCancer$Num_pregnancies)] <- 0
The values for Biopsy, Hinselmann, Schiller, Citology represent the results for cervical cancer exams. Therefore, the data for these columns were combined to produce a target feature called Cancer. Higher values for this feature represent an increased likelihood of cervical cancer. Finally, the Biopsy, Hinselmann, Schiller, Citology features were removed to avoid intoducing bias during the training of the various ML models.
dataCancer$Cancer <- dataCancer$Biopsy + dataCancer$Citology + dataCancer$Schiller + dataCancer$Hinselmann
dataCancer$Cancer[dataCancer$Cancer == 2] <- 1
dataCancer$Cancer[dataCancer$Cancer == 3] <- 1
dataCancer$Cancer[dataCancer$Cancer == 4] <- 1
dataCancer$Cancer <- factor(dataCancer$Cancer, levels = c("0", "1"))
dataCancer <- subset(dataCancer, select = -Biopsy)
dataCancer <- subset(dataCancer, select = -Citology)
dataCancer <- subset(dataCancer, select = -Schiller)
dataCancer <- subset(dataCancer, select = -Hinselmann)
ggplot(dataCancer, aes(x = dataCancer$Cancer)) + geom_bar(color = "black", fill = "skyblue") + xlab("Likelihood of Cervical Cancer") + ylab("Frequency") + theme_bw()
Figure 2. Frequency of the likelihood of cervical cancer based on the results of several cinical exams.
The plot in Fig. 2 shows that the data set is highly unbalanced toward those diagnosed as having no clinical evidence for cervical cancer.
p1 <- ggplot(dataCancer, aes(x=Age, fill = Cancer)) + geom_histogram(binwidth = 2, color = "black") + xlab("Patient Age (years)") + ylab("Frequency") + geom_vline(aes(xintercept=median(dataCancer$Age, na.rm = TRUE)), color="red", linetype="dashed", size=1) + theme_bw()
p2 <- ggplot(dataCancer, aes(Cancer, Age)) + geom_boxplot(fill="skyblue") + xlab("Likelihood of Cervical Cancer") + ylab("Patient Age (years)") + geom_point(aes(fill="black"), size = 1, alpha = 0.2, position = position_jitterdodge(), show.legend = FALSE) + theme_bw()
grid.arrange(p1, p2, nrow = 1)
Figure 3. Distribution of Patient ages. Left - histogram. Right - boxplot.
We capped the data by replacing those observations above the upper limit, with the value of 95th %ile.
qnt <- quantile(dataCancer$Age, probs=c(.25, .75), na.rm = TRUE)
caps <- quantile(dataCancer$Age, probs=c(.05, .95), na.rm = TRUE)
H <- 1.5 * IQR(dataCancer$Age, na.rm = TRUE)
dataCancer$Age[dataCancer$Age > (qnt[2] + H)] <- caps[2]
p1 <- ggplot(dataCancer, aes(x=Age, fill = Cancer)) + geom_histogram(binwidth = 2, color = "black") + xlab("Patient Age (years)") + ylab("Frequency") + geom_vline(aes(xintercept=median(dataCancer$Age, na.rm = TRUE)), color="red", linetype="dashed", size=1) + theme_bw()
p2 <- ggplot(dataCancer, aes(Cancer, Age)) + geom_boxplot(fill="skyblue") + xlab("Likelihood of Cervical Cancer") + ylab("Patient Age (years)") + geom_point(aes(fill="black"), size = 1, alpha = 0.2, position = position_jitterdodge(), show.legend = FALSE) + theme_bw()
grid.arrange(p1, p2, nrow = 1)
Figure 4. Distribution of likelihood for cervical cancer with patient age. Left - histogram. Right - boxplot.
As illustrated in figs. 3 and 4 the likelihood to have cervical cancer does not appear to vary greatly with age (Fig. 3).
p1 <- ggplot(dataCancer, aes(x=Num_sexual_partners, fill = Cancer)) + geom_histogram(binwidth = 1, color = "black") + xlab("Number of Sexual Partners") + ylab("Frequency") + geom_vline(aes(xintercept=median(dataCancer$Num_sexual_partners, na.rm = TRUE)), color="red", linetype="dashed", size=1) + theme_bw()
p2 <- ggplot(dataCancer, aes(Cancer, Num_sexual_partners)) + geom_boxplot(fill="skyblue") + xlab("Likelihood of Cervical Cancer") + ylab("Number of Sexual Partners") + geom_point(aes(fill="black"), size = 1, alpha = 0.2, position = position_jitterdodge(), show.legend = FALSE) + theme_bw()
grid.arrange(p1, p2, nrow = 1)
Figure 5. Distribution of likelihood for cervical cancer with number of sexual partners. Left - histogram. Right - boxplot.
The majority of patients had less than 5 sexual partners. The data is highly positively skewed with a maximum value of 28 and a median of 2. Removed outlier Num_sexual_partners = 28 since this patient exhibited no clinical sign of cervical cancer.
dataCancer$Num_sexual_partners[(dataCancer$Num_sexual_partners == 28)] <- 0
p1 <- ggplot(dataCancer, aes(x=Num_sexual_partners, fill = Cancer)) + geom_histogram(binwidth = 1, color = "black") + xlab("Number of Sexual Partners") + ylab("Frequency") + geom_vline(aes(xintercept=median(dataCancer$Num_sexual_partners, na.rm = TRUE)), color="red", linetype="dashed", size=1) + theme_bw()
p2 <- ggplot(dataCancer, aes(Cancer, Num_sexual_partners)) + geom_boxplot(fill="skyblue") + xlab("Likelihood of Cervical Cancer") + ylab("Number of Sexual Partners") + geom_point(aes(fill="black"), size = 1, alpha = 0.2, position = position_jitterdodge(), show.legend = FALSE) + theme_bw()
grid.arrange(p1, p2, nrow = 1)
Figure 6. Distribution of the Number of Sexual Partners for each partner. Left - histogram. Right - Boxplot.
The number of sexual partners exibited no apparent correlation with the likelihood of having cervical cancer.
p3 <- ggplot(dataCancer, aes(x=First_sexual_intercourse, fill = Cancer)) + geom_histogram(binwidth = 1, color = "black") + xlab("Age of First Sexual Intercourse") + ylab("Frequency") + geom_vline(aes(xintercept=median(dataCancer$First_sexual_intercourse, na.rm = TRUE)), color="red", linetype="dashed", size=1) + theme_bw()
p4 <- ggplot(dataCancer, aes(Cancer, First_sexual_intercourse)) + geom_boxplot(fill="skyblue") + xlab("Likelihood of Cervical Cancer") + ylab("Age of First Sexual Intercourse") + geom_point(aes(fill="black"), size = 1, alpha = 0.2, position = position_jitterdodge(), show.legend = FALSE) + theme_bw()
grid.arrange(p3, p4, nrow = 1)
Figure 7. Distribution of time of first sexual intercourse for patients. Left - histogram. Right - boxplot.
Patients varied in the age of first sexual encounter ranging from 10 to 32 years with a largely normal distribution and a median value of 17 years old (Fig. 7).
qqnorm(dataCancer$First_sexual_intercourse, ylab="Sample Quantiles \nfor Number of Sexual Partners")
qqline(dataCancer$First_sexual_intercourse, col="red")
skewness(dataCancer$First_sexual_intercourse)
## [1] 1.556652
Figure 8. The qq-plot for Number of Sexual Partners feature.
The qqplot for the Number of Sexual Partners indicates that the data is slightly positively skewed (skewness = 1.56).
p5 <- ggplot(dataCancer, aes(x=Num_pregnancies, fill = Cancer)) + geom_histogram(binwidth = 1, color = "black") + xlab("Number of Pregnancies") + ylab("Frequency") + geom_vline(aes(xintercept=median(dataCancer$Num_pregnancies, na.rm = TRUE)), color="red", linetype="dashed", size=1) + theme_bw()
p6 <- ggplot(dataCancer, aes(Cancer, Num_pregnancies)) + geom_boxplot(fill="skyblue") + xlab("Likelihood of Cervical Cancer") + ylab("Number of Pregnancies") + geom_point(aes(fill="black"), size = 1, alpha = 0.2, position = position_jitterdodge(), show.legend = FALSE) + theme_bw()
grid.arrange(p5, p6, nrow = 1)
Figure 9. Distribution of the number of pregnancies for each patient. Left - histogram. Right - boxplot.
The number of pregnancies for each patient varied from 0 to 11 with a median value of 2 (Fig. 9). The number of pregnancies did not appear to account for the likelihood to have cervical cancer.
dataCancer$Smokes <- factor(dataCancer$Smokes, levels = c("0", "1"))
labels <- c("0" = "Non-smokers", "1" = "Smokers")
ggplot(data = subset(dataCancer, !is.na(Smokes)), aes(x = Cancer)) + geom_bar(color = "black", fill = "skyblue") + xlab("Likelihood of Cervical Cancer") + ylab("Frequency") + theme_bw() + facet_wrap(~ Smokes, scales = "free", labeller=labeller(Smokes = labels))
CrossTable(dataCancer$Cancer, dataCancer$Smokes)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 737
##
##
## | dataCancer$Smokes
## dataCancer$Cancer | 0 | 1 | Row Total |
## ------------------|-----------|-----------|-----------|
## 0 | 556 | 86 | 642 |
## | 0.073 | 0.435 | |
## | 0.866 | 0.134 | 0.871 |
## | 0.881 | 0.811 | |
## | 0.754 | 0.117 | |
## ------------------|-----------|-----------|-----------|
## 1 | 75 | 20 | 95 |
## | 0.494 | 2.939 | |
## | 0.789 | 0.211 | 0.129 |
## | 0.119 | 0.189 | |
## | 0.102 | 0.027 | |
## ------------------|-----------|-----------|-----------|
## Column Total | 631 | 106 | 737 |
## | 0.856 | 0.144 | |
## ------------------|-----------|-----------|-----------|
##
##
Figure 10. Smoking status for patients.
The patients are predominantly non-smokers (85%, Fig. 10).The table shows that there is a slightly higher chance of patients who are smokers to also have cervical cancer.
Smokes_years_nz <- dataCancer %>% filter(Smokes_years > 0)
p7 <- ggplot(Smokes_years_nz, aes(x=Smokes_years, fill = Cancer)) + geom_histogram(binwidth = 2, color = "black") + xlab("Time Smoking (years)") + ylab("Frequency") + geom_vline(aes(xintercept=median(Smokes_years_nz$Smokes_years, na.rm = TRUE)), color="red", linetype="dashed", size=1) + theme_bw()
p8 <- ggplot(Smokes_years_nz, aes(Cancer, Smokes_years)) + geom_boxplot(fill="skyblue") + xlab("Likelihood of Cervical Cancer") + ylab("Time Smoking (years)") + geom_point(aes(fill="black"), size = 1, alpha = 0.2, position = position_jitterdodge(), show.legend = FALSE) + theme_bw()
grid.arrange(p7, p8, nrow = 1)
Figure 11.. Distribution of number of years spent smoking for each patient. Left - histogram. Right - boxplot.
The period spent smoking by smokers within the patient cohort varied between 0.16 and 37 years with a median value of 7 years (Fig. 11). The data was positively skewed with several patients who had been smoking for longer than 30 years. Several patients with no clinical signs of cervical cancer had smoked for long periods of their lives (i.e. > 10 years) however these tended to be outliers. Interestingly, smokers with clinical evidence for cervical cancer had smoked for longer periods.
Smokes_packs_years_nz <- dataCancer %>% filter(Smokes_packs_year > 0)
p9 <- ggplot(Smokes_packs_years_nz, aes(x=Smokes_packs_year, fill = Cancer)) + geom_histogram(binwidth = 2, color = "black") + xlab("Cigarettes (packs)") + ylab("Frequency") + geom_vline(aes(xintercept=median(Smokes_packs_years_nz$Smokes_packs_year, na.rm = TRUE)), color="red", linetype="dashed", size=1) + theme_bw()
p10 <- ggplot(Smokes_packs_years_nz, aes(Cancer, Smokes_packs_year)) + geom_boxplot(fill="skyblue") + xlab("Likelihood of Cervical Cancer") + ylab("Cigarettes (packs)") + geom_point(aes(fill="black"), size = 1, alpha = 0.2, position = position_jitterdodge(), show.legend = FALSE) + theme_bw()
grid.arrange(p9, p10, nrow = 1)
Figure 12. Quantity of cigarettes smoked each year. Left - histogram. Right - boxplot.
An outlier (Smokes_packs_year = 37) was removed since it did not correlate with an increased likelihood for cervical cancer (Fig. 12).
dataCancer$Smokes_packs_year[(dataCancer$Smokes_packs_year == 37)] <- 0
Smokes_packs_years_nz <- dataCancer %>% filter(Smokes_packs_year > 0)
p9 <- ggplot(Smokes_packs_years_nz, aes(x=Smokes_packs_year, fill = Cancer)) + geom_histogram(binwidth = 1, color = "black") + xlab("Cigarettes (packs)") + ylab("Frequency") + geom_vline(aes(xintercept=median(Smokes_packs_years_nz$Smokes_packs_year, na.rm = TRUE)), color="red", linetype="dashed", size=1) + theme_bw()
p10 <- ggplot(Smokes_packs_years_nz, aes(Cancer, Smokes_packs_year)) + geom_boxplot(fill="skyblue") + xlab("Likelihood of Cervical Cancer") + ylab("Cigarettes (packs)") + geom_point(aes(fill="black"), size = 1, alpha = 0.2, position = position_jitterdodge(), show.legend = FALSE) + theme_bw()
grid.arrange(p9, p10, nrow = 1)
skewness(dataCancer$Smokes_packs_year)
## [1] 7.452461
Figure 13. Distribution of packs of cigarettes smoked by cervical cancer patients. Left - histogram. Right - boxplot.
qqnorm(Smokes_packs_years_nz$Smokes_packs_year, ylab="Sample Quantiles for Number of Cigarettes")
qqline(Smokes_packs_years_nz$Smokes_packs_year, col="red")
skewness(Smokes_packs_years_nz$Smokes_packs_year)
## [1] 2.69972
Figure 14. The qq-plot for the Smokes_packs_year feature.
The qqplot for the Smokes_packs_year feature are highly skewed (skewness = 2.69972). The data for the Smokes_packs_year feature were cube-root transformed in order to obtain a more gaussian distribution.
#Cubed root transformation of Smokes_packs_year feature
dataCancer$Smokes_packs_year <- sign(dataCancer$Smokes_packs_year) * abs(dataCancer$Smokes_packs_year)^(1/3)
Smokes_packs_years_nz$Smokes_packs_year <- sign(Smokes_packs_years_nz$Smokes_packs_year) * abs(Smokes_packs_years_nz$Smokes_packs_year)^(1/3)
p9 <- ggplot(Smokes_packs_years_nz, aes(x=Smokes_packs_year, fill = Cancer)) + geom_histogram(binwidth = 1, color = "black") + xlab("Cigarettes (packs)") + ylab("Frequency") + geom_vline(aes(xintercept=median(Smokes_packs_years_nz$Smokes_packs_year, na.rm = TRUE)), color="red", linetype="dashed", size=1) + theme_bw()
p10 <- ggplot(Smokes_packs_years_nz, aes(Cancer, Smokes_packs_year)) + geom_boxplot(fill="skyblue") + xlab("Likelihood of Cervical Cancer") + ylab("Cigarettes (packs)") + geom_point(aes(fill="black"), size = 1, alpha = 0.2, position = position_jitterdodge(), show.legend = FALSE) + theme_bw()
grid.arrange(p9, p10, nrow = 1)
Figure 15. Data for the Smokes_packs_year feature following cubed root transformation. Left - histogram. Right - boxplot.
The boxplots in fig. 15 indicate that smokers who consumed more cigarettes each year had a higher likelihood of exhibiting the clinical signs of cervical cancer.
qqnorm(Smokes_packs_years_nz$Smokes_packs_year, ylab="Sample Quantiles for Number of Cigarettes")
qqline(Smokes_packs_years_nz$Smokes_packs_year, col="red")
skewness(Smokes_packs_years_nz$Smokes_packs_year)
## [1] 0.6762561
Figure 16. The qq-plot following cubed root transformation of the Smokes_packs_year feature.
dataCancer$Hormonal_Contraceptives <- factor(dataCancer$Hormonal_Contraceptives, levels = c("0", "1"))
labels <- c("0" = "Not on Contraceptives", "1" = "On Contraceptives")
ggplot(data = subset(dataCancer, !is.na(Hormonal_Contraceptives)), aes(x = Cancer)) + geom_bar(color = "black", fill = "skyblue") + xlab("Likelihood of Cervical Cancer") + ylab("Frequency") + theme_bw() + facet_wrap(~ Hormonal_Contraceptives, scales = "free", labeller=labeller(Hormonal_Contraceptives = labels))
CrossTable(dataCancer$Cancer, dataCancer$Hormonal_Contraceptives)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 726
##
##
## | dataCancer$Hormonal_Contraceptives
## dataCancer$Cancer | 0 | 1 | Row Total |
## ------------------|-----------|-----------|-----------|
## 0 | 226 | 405 | 631 |
## | 0.003 | 0.002 | |
## | 0.358 | 0.642 | 0.869 |
## | 0.866 | 0.871 | |
## | 0.311 | 0.558 | |
## ------------------|-----------|-----------|-----------|
## 1 | 35 | 60 | 95 |
## | 0.021 | 0.012 | |
## | 0.368 | 0.632 | 0.131 |
## | 0.134 | 0.129 | |
## | 0.048 | 0.083 | |
## ------------------|-----------|-----------|-----------|
## Column Total | 261 | 465 | 726 |
## | 0.360 | 0.640 | |
## ------------------|-----------|-----------|-----------|
##
##
Figure 17. Patients on hormonal contraceptives by likelihood of cervical cancer.
Approximately two thirds (417 out 650) of the patient cohort were taking hormonal contraceptives (Fig. 17).
Hormonal_Contraceptives_years_nz <- dataCancer %>% filter(Hormonal_Contraceptives_years > 0)
p11 <- ggplot(Hormonal_Contraceptives_years_nz, aes(x=Hormonal_Contraceptives_years, fill = Cancer)) + geom_histogram(binwidth = 2, color = "black") + xlab("Time on Hormonal Contraceptives (years)") + ylab("Frequency") + geom_vline(aes(xintercept=median(Hormonal_Contraceptives_years_nz$Hormonal_Contraceptives_years, na.rm = TRUE)), color="red", linetype="dashed", size=1) + theme_bw()
p12 <- ggplot(Hormonal_Contraceptives_years_nz, aes(Cancer, Hormonal_Contraceptives_years)) + geom_boxplot(fill="skyblue") + xlab("Likelihood of Cervical Cancer") + ylab("Time on Hormonal Contraceptives (years)") + geom_point(aes(fill="black"), size = 1, alpha = 0.2, position = position_jitterdodge(), show.legend = FALSE) + theme_bw()
grid.arrange(p11, p12, nrow = 1)
Figure 18. Distribution of length of period on hormonal contraceptives. Left - histogram. Right - boxplot.
Patients on hormonal contraceptives had been taking them for varying lengths of time from 0.08 to 30 years (Fig. 18). The majority of the patient cohort were on contraceptives for less than 10 years.
qqnorm(Hormonal_Contraceptives_years_nz$Hormonal_Contraceptives_years, ylab="Sample Quantiles for \nTime on Hormonal Contraceptives")
qqline(Hormonal_Contraceptives_years_nz$Hormonal_Contraceptives_years, col="red")
skewness(Hormonal_Contraceptives_years_nz$Hormonal_Contraceptives_years)
## [1] 1.818768
Figure 19. The qq-plot for the Hormonal_Contraceptives_years feature.
The qq-plot for the data for the Hormonal_Contraceptives_years is highly skewed (Fig. 19). The data for the Hormonal_Contraceptives_years feature were cube-root transformed in order to obtain a more gaussian distribution.
#Cubed root transformation of Smokes_packs_year feature
dataCancer$Hormonal_Contraceptives_years <- sign(dataCancer$Hormonal_Contraceptives_years) * abs(dataCancer$Hormonal_Contraceptives_years)^(1/3)
Hormonal_Contraceptives_years_nz$Hormonal_Contraceptives_years <- sign(Hormonal_Contraceptives_years_nz$Hormonal_Contraceptives_years) * abs(Hormonal_Contraceptives_years_nz$Hormonal_Contraceptives_years)^(1/3)
p9 <- ggplot(Hormonal_Contraceptives_years_nz, aes(x=Hormonal_Contraceptives_years, fill = Cancer)) + geom_histogram(binwidth = 1, color = "black") + xlab("Time on Hormonal Contraceptives (years)") + ylab("Frequency") + geom_vline(aes(xintercept=median(Hormonal_Contraceptives_years_nz$Hormonal_Contraceptives_years, na.rm = TRUE)), color="red", linetype="dashed", size=1) + theme_bw()
p10 <- ggplot(Hormonal_Contraceptives_years_nz, aes(Cancer, Hormonal_Contraceptives_years)) + geom_boxplot(fill="skyblue") + xlab("Likelihood of Cervical Cancer") + ylab("Time on Hormonal Contraceptives (years)") + geom_point(aes(fill="black"), size = 1, alpha = 0.2, position = position_jitterdodge(), show.legend = FALSE) + theme_bw()
grid.arrange(p9, p10, nrow = 1)
Figure 20. Distribution of the data for the Hormonal_Contraceptives_years feature. Left - histogram. Right - boxplot.
qqnorm(Hormonal_Contraceptives_years_nz$Hormonal_Contraceptives_years, ylab="Sample Quantiles for \nTime on Hormonal Contraceptives")
qqline(Hormonal_Contraceptives_years_nz$Hormonal_Contraceptives_years, col="red")
skewness(Hormonal_Contraceptives_years_nz$Hormonal_Contraceptives_years)
## [1] 0.384565
Figure 21. The qq-plot of the Hormonal_Contraceptives_years feature following cubed root transformation of the data.
dataCancer$IUD <- factor(dataCancer$IUD, levels = c("0", "1"))
labels <- c("0" = "Does no use IUD", "1" = "Uses IUD")
ggplot(data = subset(dataCancer, !is.na(IUD)), aes(x = Cancer)) + geom_bar(color = "black", fill = "skyblue") + xlab("Likelihood of Cervical Cancer") + ylab("Frequency") + theme_bw() + facet_wrap(~ IUD, scales = "free", labeller=labeller(IUD = labels))
CrossTable(dataCancer$Cancer, dataCancer$IUD)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 722
##
##
## | dataCancer$IUD
## dataCancer$Cancer | 0 | 1 | Row Total |
## ------------------|-----------|-----------|-----------|
## 0 | 564 | 65 | 629 |
## | 0.055 | 0.439 | |
## | 0.897 | 0.103 | 0.871 |
## | 0.880 | 0.802 | |
## | 0.781 | 0.090 | |
## ------------------|-----------|-----------|-----------|
## 1 | 77 | 16 | 93 |
## | 0.375 | 2.970 | |
## | 0.828 | 0.172 | 0.129 |
## | 0.120 | 0.198 | |
## | 0.107 | 0.022 | |
## ------------------|-----------|-----------|-----------|
## Column Total | 641 | 81 | 722 |
## | 0.888 | 0.112 | |
## ------------------|-----------|-----------|-----------|
##
##
Figure 22. Barchart showing the proportion of patients using an Intra Uterine Device (IUD) by likelihood of cervical cancer.
Approximately 10 percent of the patient cohort were using an Intra Uterine Device (IUD) (Fig. 22).
IUD_years_nz <- dataCancer %>% filter(IUD_years > 0)
p13 <- ggplot(IUD_years_nz, aes(x=IUD_years, fill = Cancer)) + geom_histogram(binwidth = 2, color = "black") + xlab("IUD use (years)") + ylab("Frequency") + geom_vline(aes(xintercept=median(IUD_years_nz$IUD_years, na.rm = TRUE)), color="red", linetype="dashed", size=1) + theme_bw()
p14 <- ggplot(IUD_years_nz, aes(Cancer, IUD_years)) + geom_boxplot(fill="skyblue") + xlab("Likelihood of Cervical Cancer") + ylab("IUD use (years)") + geom_point(aes(fill="black"), size = 1, alpha = 0.2, position = position_jitterdodge(), show.legend = FALSE) + theme_bw()
grid.arrange(p13, p14, nrow = 1)
Figure 23. Distribution of the number of years using an IUD. Left - histogram. Right - boxplot.
The period of time that patients had been using an IUD varied between 0.08 and 19 years with a median value of 4 years (Fig. 23).
qqnorm(IUD_years_nz$IUD_years, ylab="Sample Quantiles for \nTime using IUD")
qqline(IUD_years_nz$IUD_years, col="red")
skewness(IUD_years_nz$IUD_years)
## [1] 1.301422
Figure 24. The qq-plot of the IUD_years feature.
The qq-plot (Fig. 24) indicates that data for IUD_years is highly skewed. Therefore the data were transformed with a cubed root function to obtain a more gaussian distribution.
#Cubed root transformation of IUD_years feature
dataCancer$IUD_years <- sign(dataCancer$IUD_years) * abs(dataCancer$IUD_years)^(1/3)
IUD_years_nz$IUD_years <- sign(IUD_years_nz$IUD_years) * abs(IUD_years_nz$IUD_years)^(1/3)
p9 <- ggplot(IUD_years_nz, aes(x=IUD_years, fill = Cancer)) + geom_histogram(binwidth = 1, color = "black") + xlab("Time using IUD") + ylab("Frequency") + geom_vline(aes(xintercept=median(IUD_years_nz$IUD_years, na.rm = TRUE)), color="red", linetype="dashed", size=1) + theme_bw()
p10 <- ggplot(IUD_years_nz, aes(Cancer, IUD_years)) + geom_boxplot(fill="skyblue") + xlab("Likelihood of Cervical Cancer") + ylab("Time using IUD") + geom_point(aes(fill="black"), size = 1, alpha = 0.2, position = position_jitterdodge(), show.legend = FALSE) + theme_bw()
grid.arrange(p9, p10, nrow = 1)
Figure 25. Distribution of cube root transformed data for the IUD_years feature. Left - histogram. Right - boxplot.
qqnorm(IUD_years_nz$IUD_years, ylab="Sample Quantiles for \nTime using IUD")
qqline(IUD_years_nz$IUD_years, col="red")
skewness(IUD_years_nz$IUD_years)
## [1] -0.1482602
Figure 26. The qq-plot for the cube root transformed data for the IUD_years feature.
dataCancer$STDs <- factor(dataCancer$STDs, levels = c("0", "1"))
labels <- c("0" = "No STD", "1" = "Has an STD")
ggplot(data = subset(dataCancer, !is.na(STDs)), aes(x = Cancer)) + geom_bar(color = "black", fill = "skyblue") + xlab("Likelihood of Cervical Cancer") + ylab("Frequency") + theme_bw() + facet_wrap(~ STDs, scales = "free", labeller=labeller(STDs = labels))
CrossTable(dataCancer$Cancer, dataCancer$STDs)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 737
##
##
## | dataCancer$STDs
## dataCancer$Cancer | 0 | 1 | Row Total |
## ------------------|-----------|-----------|-----------|
## 0 | 585 | 57 | 642 |
## | 0.147 | 1.279 | |
## | 0.911 | 0.089 | 0.871 |
## | 0.885 | 0.750 | |
## | 0.794 | 0.077 | |
## ------------------|-----------|-----------|-----------|
## 1 | 76 | 19 | 95 |
## | 0.994 | 8.646 | |
## | 0.800 | 0.200 | 0.129 |
## | 0.115 | 0.250 | |
## | 0.103 | 0.026 | |
## ------------------|-----------|-----------|-----------|
## Column Total | 661 | 76 | 737 |
## | 0.897 | 0.103 | |
## ------------------|-----------|-----------|-----------|
##
##
Figure 27. Proportion of patients having an STD by likelihood of cervical cancer.
Less than 10 percent (60 out of 656) of the patient cohort suffered from a sexually transmitted disease (STD) (Fig. 27). The data indicate that a higher proportion of those having a STD also exhibited clinical signs of cervical cancer.
STDs_number_nz <- dataCancer %>% filter(STDs_number > 0)
p15 <- ggplot(STDs_number_nz, aes(x=STDs_number, fill = Cancer)) + geom_histogram(binwidth = 1, color = "black") + xlab("Number of STDs") + ylab("Frequency") + geom_vline(aes(xintercept=median(STDs_number_nz$STDs_number, na.rm = TRUE)), color="red", linetype="dashed", size=1) + theme_bw()
p16 <- ggplot(STDs_number_nz, aes(Cancer, STDs_number)) + geom_boxplot(fill="skyblue") + xlab("Likelihood of Cervical Cancer") + ylab("Number of STDs") + geom_point(aes(fill="black"), size = 1, alpha = 0.2, position = position_jitterdodge(), show.legend = FALSE) + theme_bw()
grid.arrange(p15, p16, nrow = 1)
Figure 28. Relationship between number of STDs and the likelihood of exhibiting clinical signs of cervical cancer. Left - histogram. Right - boxplot.
Patients in the cohort with STDs had between 1 and 4 different STDs with the median being 2 (Fig. 28).
dataCancer$STDs_condylomatosis <- factor(dataCancer$STDs_condylomatosis, levels = c("0", "1"))
dataCancer$STDs_vulvo_perineal_condylomatosis <- factor(dataCancer$STDs_vulvo_perineal_condylomatosis, levels = c("0", "1"))
dataCancer$STDs_syphilis <- factor(dataCancer$STDs_syphilis, levels = c("0", "1"))
dataCancer$STDs_HIV <- factor(dataCancer$STDs_HIV, levels = c("0", "1"))
dataCancer$Dx_Cancer <- factor(dataCancer$Dx_Cancer, levels = c("0", "1"))
dataCancer$Dx_CIN <- factor(dataCancer$Dx_CIN, levels = c("0", "1"))
dataCancer$IUD <- factor(dataCancer$IUD, levels = c("0", "1"))
dataCancer$Dx_HPV <- factor(dataCancer$Dx_HPV, levels = c("0", "1"))
dataCancer$Dx <- factor(dataCancer$Dx, levels = c("0", "1"))
Missing data in features referring to contraceptives, IUDs and Num_sexual_partners were replaced by 0.
dataCancer$Num_sexual_partners[is.na(dataCancer$Num_sexual_partners)] <- 0
dataCancer$Hormonal_Contraceptives[is.na(dataCancer$Hormonal_Contraceptives)] <- 0
dataCancer$Hormonal_Contraceptives_years[is.na(dataCancer$Hormonal_Contraceptives_years)] <- 0
dataCancer$IUD[is.na(dataCancer$IUD)] <- 0
dataCancer$IUD_years[is.na(dataCancer$IUD_years)] <- 0
kable(summarizeColumns(dataCancer), caption = "Table 4. Feature Summary before Data Preprocessing") %>% kable_styling(bootstrap_options = c("condensed"), full_width = TRUE, font_size = 10)
| name | type | na | mean | disp | median | mad | min | max | nlevs |
|---|---|---|---|---|---|---|---|---|---|
| Age | numeric | 0 | 26.9905020 | 7.8647650 | 26.0000000 | 8.8956 | 13 | 51.000000 | 0 |
| Num_sexual_partners | numeric | 0 | 2.4301221 | 1.3375083 | 2.0000000 | 1.4826 | 0 | 8.000000 | 0 |
| First_sexual_intercourse | integer | 0 | 17.0895522 | 2.8499832 | 17.0000000 | 2.9652 | 10 | 32.000000 | 0 |
| Num_pregnancies | numeric | 0 | 2.1818182 | 1.5365282 | 2.0000000 | 1.4826 | 0 | 11.000000 | 0 |
| Smokes | factor | 0 | NA | 0.1438263 | NA | NA | 106 | 631.000000 | 2 |
| Smokes_years | numeric | 0 | 1.2223388 | 4.1403498 | 0.0000000 | 0.0000 | 0 | 37.000000 | 0 |
| Smokes_packs_year | numeric | 0 | 0.1678524 | 0.4638158 | 0.0000000 | 0.0000 | 0 | 2.802039 | 0 |
| Hormonal_Contraceptives | factor | 0 | NA | 0.3690638 | NA | NA | 272 | 465.000000 | 2 |
| Hormonal_Contraceptives_years | numeric | 0 | 0.8151810 | 0.7711324 | 0.7488872 | 1.1103 | 0 | 2.802039 | 0 |
| IUD | factor | 0 | NA | 0.1099050 | NA | NA | 81 | 656.000000 | 2 |
| IUD_years | numeric | 0 | 0.1664169 | 0.5026574 | 0.0000000 | 0.0000 | 0 | 2.668402 | 0 |
| STDs | factor | 0 | NA | 0.1031208 | NA | NA | 76 | 661.000000 | 2 |
| STDs_number | integer | 0 | 0.1723202 | 0.5530490 | 0.0000000 | 0.0000 | 0 | 4.000000 | 0 |
| STDs_condylomatosis | factor | 0 | NA | 0.0569878 | NA | NA | 42 | 695.000000 | 2 |
| STDs_vulvo_perineal_condylomatosis | factor | 0 | NA | 0.0556309 | NA | NA | 41 | 696.000000 | 2 |
| STDs_syphilis | factor | 0 | NA | 0.0230665 | NA | NA | 17 | 720.000000 | 2 |
| STDs_HIV | factor | 0 | NA | 0.0230665 | NA | NA | 17 | 720.000000 | 2 |
| STDs_Num_diagnosis | integer | 0 | 0.0963365 | 0.3131192 | 0.0000000 | 0.0000 | 0 | 3.000000 | 0 |
| Dx_Cancer | factor | 0 | NA | 0.0230665 | NA | NA | 17 | 720.000000 | 2 |
| Dx_CIN | factor | 0 | NA | 0.0108548 | NA | NA | 8 | 729.000000 | 2 |
| Dx_HPV | factor | 0 | NA | 0.0230665 | NA | NA | 17 | 720.000000 | 2 |
| Dx | factor | 0 | NA | 0.0298507 | NA | NA | 22 | 715.000000 | 2 |
| Cancer | factor | 0 | NA | 0.1289009 | NA | NA | 95 | 642.000000 | 2 |
library(GGally)
ggscatmat(dataCancer, alpha = 0.5, columns = c('Age', 'Num_sexual_partners', 'Num_pregnancies', 'Smokes_years', 'Smokes_packs_year', 'Hormonal_Contraceptives_years', 'IUD_years'), color = "Cancer") + ylab("") + theme_classic(base_size = 8)
Figure 29. Scatter matrix for various variables in the cervical cancer dataset.
The scatter matrix shown in Fig. 29 indicates correlation between (1) Number of Years Smoking (Smokes_years) and Number of Packs of Cigarettes Smoked per Year (Smokes_packs_year), (2) Patient Age (Age) and Number of Pregnancies (Num_pregnancies), (3) Patient Age (Age) and Number of Years Smoking (Smokes_years), (4) Patient Age (Age) and Number of Years on Contraceptives (Hormonal_Contraceptives_years), (5) Patient Age (Age) and Number of Years on IUDs (IUD_years) and (6) Number of Pregnancies (Num_pregnancies) and Number of Years Smoking (Smokes_years).
The data were standardized to compensate for differences in the scales used for various features.
numerical_col_index <- unlist(lapply(dataCancer, is.numeric)) #For all numeric columns
data_standardized <- lapply(dataCancer[, numerical_col_index], standardize) #Standardize
mean(data_standardized$Age)
## [1] -1.175171e-17
sd(data_standardized$Age)
## [1] 1
categorical_col_index <- !numerical_col_index
data_standardized <- cbind(data_standardized, dataCancer[, categorical_col_index])
The rows in the dataset were randomized prior to splitting into training and test sets.
#Shuffle dataset rows
set.seed(1234)
n <- nrow(data_standardized)
shuffled_data <- data_standardized[sample(n), ]
# Split into 70% training and 30% test data.
training_index <- sample(nrow(shuffled_data)*0.70)
test_index <- setdiff(seq(1:nrow(shuffled_data)), training_index)
# Get the training and test data.
training_data <- shuffled_data[training_index, ]
test_data <- shuffled_data[test_index, ]
prop.table(table(training_data$Cancer))
##
## 0 1
## 0.8796117 0.1203883
prop.table(table(test_data$Cancer))
##
## 0 1
## 0.8513514 0.1486486
The datasets were reasonably balanced and representative of the full dataset. We shall use training data for modeling and test data for model evaluation.
The data were initially modeled using all 33 features.
## Loading required package: kknn
##
## Attaching package: 'kknn'
## The following object is masked from 'package:caret':
##
## contr.dummy
## Type len Def Constr Req Tunable Trafo
## laplace numeric - 0 0 to Inf - TRUE -
## Type len Def Constr Req Tunable Trafo
## ntree integer - 500 1 to Inf - TRUE -
## mtry integer - - 1 to Inf - TRUE -
## replace logical - TRUE - - TRUE -
## classwt numericvector <NA> - 0 to Inf - TRUE -
## cutoff numericvector <NA> - 0 to 1 - TRUE -
## strata untyped - - - - FALSE -
## sampsize integervector <NA> - 1 to Inf - TRUE -
## nodesize integer - 1 1 to Inf - TRUE -
## maxnodes integer - - 1 to Inf - TRUE -
## importance logical - FALSE - - TRUE -
## localImp logical - FALSE - - TRUE -
## proximity logical - FALSE - - FALSE -
## oob.prox logical - - - Y FALSE -
## norm.votes logical - TRUE - - FALSE -
## do.trace logical - FALSE - - FALSE -
## keep.forest logical - TRUE - - FALSE -
## keep.inbag logical - FALSE - - FALSE -
## Type len Def Constr Req
## k integer - 7 1 to Inf -
## distance numeric - 2 0 to Inf -
## kernel discrete - optimal rectangular,triangular,epanechnikov,b... -
## scale logical - TRUE - -
## Tunable Trafo
## k TRUE -
## distance TRUE -
## kernel TRUE -
## scale TRUE -
For naive bayes we fine-tuned the Laplace parameter testing values between 0 and 30.
For random forest we fine-tuned mtry testing values of 2, 3, 4 (the number of variables randomly sampled as candidates at each split). Breiman, L. (2001), Random Forests, Machine Learning 45(1), 5-32.
## [Tune] Started tuning learner classif.naiveBayes for parameter set:
## Type len Def Constr Req Tunable Trafo
## laplace numeric - - 0 to 30 - TRUE -
## With control class: TuneControlGrid
## Imputation value: 1
## [Tune-x] 1: laplace=0
## [Tune-y] 1: mmce.test.mean=0.1923355; time: 0.0 min
## [Tune-x] 2: laplace=3.33
## [Tune-y] 2: mmce.test.mean=0.2077758; time: 0.0 min
## [Tune-x] 3: laplace=6.67
## [Tune-y] 3: mmce.test.mean=0.2039113; time: 0.0 min
## [Tune-x] 4: laplace=10
## [Tune-y] 4: mmce.test.mean=0.2039113; time: 0.0 min
## [Tune-x] 5: laplace=13.3
## [Tune-y] 5: mmce.test.mean=0.2019505; time: 0.0 min
## [Tune-x] 6: laplace=16.7
## [Tune-y] 6: mmce.test.mean=0.2038923; time: 0.0 min
## [Tune-x] 7: laplace=20
## [Tune-y] 7: mmce.test.mean=0.2019879; time: 0.0 min
## [Tune-x] 8: laplace=23.3
## [Tune-y] 8: mmce.test.mean=0.2019879; time: 0.0 min
## [Tune-x] 9: laplace=26.7
## [Tune-y] 9: mmce.test.mean=0.2000461; time: 0.0 min
## [Tune-x] 10: laplace=30
## [Tune-y] 10: mmce.test.mean=0.2000461; time: 0.0 min
## [Tune] Result: laplace=0 : mmce.test.mean=0.1923355
## [Tune] Started tuning learner classif.randomForest for parameter set:
## Type len Def Constr Req Tunable Trafo
## mtry discrete - - 2,3,4 - TRUE -
## With control class: TuneControlGrid
## Imputation value: 1
## [Tune-x] 1: mtry=2
## [Tune-y] 1: mmce.test.mean=0.1242576; time: 0.0 min
## [Tune-x] 2: mtry=3
## [Tune-y] 2: mmce.test.mean=0.1262184; time: 0.0 min
## [Tune-x] 3: mtry=4
## [Tune-y] 3: mmce.test.mean=0.1281605; time: 0.0 min
## [Tune] Result: mtry=2 : mmce.test.mean=0.1242576
## [Tune] Started tuning learner classif.kknn for parameter set:
## Type len Def Constr Req Tunable
## k discrete - - 2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,... - TRUE
## Trafo
## k -
## With control class: TuneControlGrid
## Imputation value: 1
## [Tune-x] 1: k=2
## [Tune-y] 1: mmce.test.mean=0.1902605; time: 0.0 min
## [Tune-x] 2: k=3
## [Tune-y] 2: mmce.test.mean=0.1844162; time: 0.0 min
## [Tune-x] 3: k=4
## [Tune-y] 3: mmce.test.mean=0.1863580; time: 0.0 min
## [Tune-x] 4: k=5
## [Tune-y] 4: mmce.test.mean=0.1457120; time: 0.0 min
## [Tune-x] 5: k=6
## [Tune-y] 5: mmce.test.mean=0.1418281; time: 0.0 min
## [Tune-x] 6: k=7
## [Tune-y] 6: mmce.test.mean=0.1398864; time: 0.0 min
## [Tune-x] 7: k=8
## [Tune-y] 7: mmce.test.mean=0.1321384; time: 0.0 min
## [Tune-x] 8: k=9
## [Tune-y] 8: mmce.test.mean=0.1282549; time: 0.0 min
## [Tune-x] 9: k=10
## [Tune-y] 9: mmce.test.mean=0.1223916; time: 0.0 min
## [Tune-x] 10: k=11
## [Tune-y] 10: mmce.test.mean=0.1223726; time: 0.0 min
## [Tune-x] 11: k=12
## [Tune-y] 11: mmce.test.mean=0.1243143; time: 0.0 min
## [Tune-x] 12: k=13
## [Tune-y] 12: mmce.test.mean=0.1204118; time: 0.0 min
## [Tune-x] 13: k=14
## [Tune-y] 13: mmce.test.mean=0.1223535; time: 0.0 min
## [Tune-x] 14: k=15
## [Tune-y] 14: mmce.test.mean=0.1223158; time: 0.0 min
## [Tune-x] 15: k=16
## [Tune-y] 15: mmce.test.mean=0.1203927; time: 0.0 min
## [Tune-x] 16: k=17
## [Tune-y] 16: mmce.test.mean=0.1203927; time: 0.0 min
## [Tune-x] 17: k=18
## [Tune-y] 17: mmce.test.mean=0.1203927; time: 0.0 min
## [Tune-x] 18: k=19
## [Tune-y] 18: mmce.test.mean=0.1203927; time: 0.0 min
## [Tune-x] 19: k=20
## [Tune-y] 19: mmce.test.mean=0.1223345; time: 0.0 min
## [Tune] Result: k=17 : mmce.test.mean=0.1203927
Figure 30. Plot for the optimization of the threshold for the kNn, Naive Bayes and Random Forest classifiers trained on all 23 features.
Figure 31. AUC plot for the kNN, Naive Bayes and Random Forest classifers trained on all 23 features.
The AUC plots were similar for the kNN and Random Forest classifiers trained on all 23 features and were significantly better than the Naive Bayes classifier (Fig. 31).
| Naive Bayes | Random Forest | kNN | |
|---|---|---|---|
| mmce | 0.2117117 | 0.1486486 | 0.1711712 |
| auc | 0.5661376 | 0.6110309 | 0.5957993 |
The Raandom Forest classifier performed the best, when the models were trained on all 23 features, with a mean misclassification error of 0.149 and auv value of 0.611 (Table 5). While the KNN classifier performed slightly better than the Naive Bayes model.
## predicted
## true 0 1
## 0 167 22 tpr: 0.88 fnr: 0.12
## 1 25 8 fpr: 0.76 tnr: 0.24
## ppv: 0.87 for: 0.73 lrp: 1.17 acc: 0.79
## fdr: 0.13 npv: 0.27 lrm: 0.48 dor: 2.43
##
##
## Abbreviations:
## tpr - True positive rate (Sensitivity, Recall)
## fpr - False positive rate (Fall-out)
## fnr - False negative rate (Miss rate)
## tnr - True negative rate (Specificity)
## ppv - Positive predictive value (Precision)
## for - False omission rate
## lrp - Positive likelihood ratio (LR+)
## fdr - False discovery rate
## npv - Negative predictive value
## acc - Accuracy
## lrm - Negative likelihood ratio (LR-)
## dor - Diagnostic odds ratio
## predicted
## true 0 1
## 0 188 1 tpr: 0.99 fnr: 0.01
## 1 32 1 fpr: 0.97 tnr: 0.03
## ppv: 0.85 for: 0.5 lrp: 1.03 acc: 0.85
## fdr: 0.15 npv: 0.5 lrm: 0.17 dor: 5.88
##
##
## Abbreviations:
## tpr - True positive rate (Sensitivity, Recall)
## fpr - False positive rate (Fall-out)
## fnr - False negative rate (Miss rate)
## tnr - True negative rate (Specificity)
## ppv - Positive predictive value (Precision)
## for - False omission rate
## lrp - Positive likelihood ratio (LR+)
## fdr - False discovery rate
## npv - Negative predictive value
## acc - Accuracy
## lrm - Negative likelihood ratio (LR-)
## dor - Diagnostic odds ratio
## predicted
## true 0 1
## 0 183 6 tpr: 0.97 fnr: 0.03
## 1 32 1 fpr: 0.97 tnr: 0.03
## ppv: 0.85 for: 0.86 lrp: 1 acc: 0.83
## fdr: 0.15 npv: 0.14 lrm: 1.05 dor: 0.95
##
##
## Abbreviations:
## tpr - True positive rate (Sensitivity, Recall)
## fpr - False positive rate (Fall-out)
## fnr - False negative rate (Miss rate)
## tnr - True negative rate (Specificity)
## ppv - Positive predictive value (Precision)
## for - False omission rate
## lrp - Positive likelihood ratio (LR+)
## fdr - False discovery rate
## npv - Negative predictive value
## acc - Accuracy
## lrm - Negative likelihood ratio (LR-)
## dor - Diagnostic odds ratio
Various feature selection algorithms were then used to determine whether classifier performance could be improved by using a subset of relevant features. Features were initially selected using the spFRS feature selection algorithm (https://cran.r-project.org/web/packages/spFSR/vignettes/spFSR.html). Aksakalli and Malekipirbazai (2016)
## [1] 23
Figure 32. Measurement of kNN classifier performance with each iteration.
The plot shows mmce is dcreasing suggesting that the parameters can be optimized further.
Figure 33. kNN classifier performance for various values of k (nearest neighbours value).
The plot shows the global optimum and indicates that the best performance was obtained in about 15 iterations.
## $k
## [1] 15
## mmce.test.mean
## 0.1183613
The optimal value of k for k-Nearest Neighbours was 15 with a mean misclassification error of 0.118.
## Learner classif.kknn from package kknn
## Type: classif
## Name: k-Nearest Neighbor; Short name: kknn
## Class: classif.kknn
## Properties: twoclass,multiclass,numerics,factors,prob
## Predict-Type: prob
## Hyperparameters: k=15
## SPSA-FSR begins:
## Wrapper = kknn
## Measure = auc
## Number of selected features = 0
##
## iter value st.dev num.ft best.value
## 1 0.66652 0.08724 13 0.66652 *
## 2 0.67873 0.07876 15 0.67873 *
## 3 0.65068 0.0959 11 0.67873
## 4 0.69214 0.0815 12 0.69214 *
## 5 0.69351 0.05994 12 0.69351 *
## 6 0.70517 0.06063 14 0.70517 *
## 7 0.69378 0.09697 14 0.70517
## 8 0.70983 0.08844 14 0.70983 *
## 9 0.72798 0.07736 16 0.72798 *
## 10 0.69858 0.09898 17 0.72798
## 11 0.70915 0.09041 17 0.72798
## 12 0.6678 0.06979 16 0.72798
## 13 0.71376 0.07237 14 0.72798
## 14 0.69546 0.09024 16 0.72798
## 15 0.71057 0.11709 16 0.72798
## 16 0.72438 0.07684 15 0.72798
## 17 0.71411 0.08948 15 0.72798
## 18 0.71586 0.07034 15 0.72798
## 19 0.71477 0.07681 13 0.72798
## 20 0.67661 0.07696 14 0.72798
## 21 0.69783 0.0881 13 0.72798
## 22 0.70833 0.05328 13 0.72798
## 23 0.70509 0.0787 13 0.72798
## 24 0.70275 0.07714 13 0.72798
## 25 0.69688 0.0619 13 0.72798
## 26 0.69833 0.05764 13 0.72798
## 27 0.70386 0.04867 14 0.72798
## 28 0.70727 0.08117 14 0.72798
## 29 0.71183 0.05808 16 0.72798
##
## Best iteration = 9
## Number of selected features = 16
## Best measure value = 0.72798
## Std. dev. of best measure = 0.07736
## Run time = 0.83 minutes.
## SPSA-FSR begins:
## Wrapper = kknn
## Measure = auc
## Number of selected features = 0
##
## iter value st.dev num.ft best.value
## 1 0.68515 0.04869 14 0.68515 *
## 2 0.69978 0.07572 16 0.69978 *
## 3 0.67652 0.08389 15 0.69978
## 4 0.64775 0.0809 16 0.69978
## 5 0.67645 0.08247 15 0.69978
## 6 0.6575 0.05673 14 0.69978
## 7 0.62584 0.0885 16 0.69978
## 8 0.61483 0.08257 15 0.69978
## 9 0.70581 0.07028 14 0.70581 *
## 10 0.7031 0.06375 16 0.70581
## 11 0.69397 0.05313 15 0.70581
## 12 0.68069 0.0735 17 0.70581
## 13 0.69909 0.06205 12 0.70581
## 14 0.6537 0.102 17 0.70581
## 15 0.61696 0.09682 17 0.70581
## 16 0.66788 0.06852 17 0.70581
## 17 0.7006 0.06328 17 0.70581
## 18 0.68899 0.05968 17 0.70581
## 19 0.68768 0.06092 18 0.70581
## 20 0.69026 0.06691 16 0.70581
## 21 0.70981 0.06813 17 0.70981 *
## 22 0.71647 0.06136 17 0.71647 *
## 23 0.68643 0.07839 15 0.71647
## 24 0.71347 0.06054 15 0.71647
## 25 0.64659 0.07911 17 0.71647
## 26 0.6628 0.07949 17 0.71647
## 27 0.70956 0.08768 15 0.71647
## 28 0.7131 0.08943 14 0.71647
## 29 0.73374 0.05887 12 0.73374 *
## 30 0.71075 0.05638 12 0.73374
## 31 0.71202 0.06512 13 0.73374
## 32 0.71627 0.05382 12 0.73374
## 33 0.69943 0.09905 12 0.73374
## 34 0.71001 0.07922 11 0.73374
## 35 0.73724 0.04251 12 0.73724 *
## 36 0.73142 0.05276 12 0.73724
## 37 0.70352 0.06059 12 0.73724
## 38 0.71149 0.04781 12 0.73724
## 39 0.7304 0.06965 12 0.73724
## 40 0.72522 0.06263 12 0.73724
## 41 0.67496 0.08656 13 0.73724
## 42 0.69286 0.07266 14 0.73724
## 43 0.70409 0.07992 14 0.73724
## 44 0.67695 0.06096 15 0.73724
## 45 0.69201 0.07691 14 0.73724
## 46 0.71119 0.09074 14 0.73724
## 47 0.7261 0.05982 14 0.73724
## 48 0.71517 0.07737 14 0.73724
## 49 0.69535 0.08876 14 0.73724
## 50 0.70618 0.07714 14 0.73724
## 51 0.72119 0.08025 14 0.73724
## 52 0.70999 0.07289 15 0.73724
## 53 0.70839 0.06258 15 0.73724
## 54 0.71485 0.03701 16 0.73724
## 55 0.70488 0.08203 15 0.73724
##
## Best iteration = 35
## Number of selected features = 12
## Best measure value = 0.73724
## Std. dev. of best measure = 0.04251
## Run time = 1.3 minutes.
Figure 34. Scatter plot of mean accuracy rate for each iteration for the kNN classifier (left panel) with and (right panel) without hyperparameter tuning.
| Features | Importance | Features | Importance |
|---|---|---|---|
| Hormonal_Contraceptives | 0.65736 | STDs_HIV | 1.00000 |
| First_sexual_intercourse | 0.58572 | Dx_HPV | 1.00000 |
| Smokes_years | 0.57904 | Num_sexual_partners | 0.97502 |
| STDs_syphilis | 0.57561 | STDs_syphilis | 0.95608 |
| Dx_HPV | 0.57364 | Dx | 0.87179 |
| Num_pregnancies | 0.56038 | Dx_CIN | 0.86142 |
| STDs_Num_diagnosis | 0.55085 | Dx_Cancer | 0.81635 |
| IUD | 0.54519 | Age | 0.81357 |
| Num_sexual_partners | 0.54189 | STDs | 0.78772 |
| Smokes_packs_year | 0.54100 | STDs_Num_diagnosis | 0.76885 |
The top 10 features selected by the spFSR algorithm for the kNN classified with and without tuned parameters differed significantly indicating that tuning the hyperparameters for the kNN classifier made a significant difference.
Figures 35. Features selected by the spFRS algorithm fused with the kNN classifier and ranked according to importance.
spsa_kNN_1 <- spFSR::plotImportance(spsaMod_kNN_tuned)
Figures 36. Features selected by the spFRS algorithm fused with the kNN classifier with tuned hyperparameters and ranked according to importance.
Figure 37. AUC plot for the kNN classifer. kNN - kNN trained on all 23 features; kNN_spsa - kNN classifier trained on features selected using the spFRS algorithm; kNN_spsa_tuned - kNN classifier with tuned hyperparameters and trained on features selected using the spFRS algorithm.
The AUC plots were similar for the kNN classifier without optimization of the hyperparameters and with or without spFRS feature selection while these were both better than the kNN classifier without optimization of the hyperparameters or feature selection (Fig. 37).
| kNN | kNN_spsa | kNN_spsa_tuned | |
|---|---|---|---|
| mmce | 0.1711712 | 0.1756757 | 0.1666667 |
| auc | 0.5910694 | 0.5715889 | 0.5603656 |
The kNN classifier appeared to performed better with hyperparameter optimization and spFRS feature selection with a mean misclassification error of 0.167 and AUC value of 0.560 (Table 6).
## predicted
## true 0 1
## 0 183 6 tpr: 0.97 fnr: 0.03
## 1 32 1 fpr: 0.97 tnr: 0.03
## ppv: 0.85 for: 0.86 lrp: 1 acc: 0.83
## fdr: 0.15 npv: 0.14 lrm: 1.05 dor: 0.95
##
##
## Abbreviations:
## tpr - True positive rate (Sensitivity, Recall)
## fpr - False positive rate (Fall-out)
## fnr - False negative rate (Miss rate)
## tnr - True negative rate (Specificity)
## ppv - Positive predictive value (Precision)
## for - False omission rate
## lrp - Positive likelihood ratio (LR+)
## fdr - False discovery rate
## npv - Negative predictive value
## acc - Accuracy
## lrm - Negative likelihood ratio (LR-)
## dor - Diagnostic odds ratio
## predicted
## true 0 1
## 0 182 7 tpr: 0.96 fnr: 0.04
## 1 32 1 fpr: 0.97 tnr: 0.03
## ppv: 0.85 for: 0.88 lrp: 0.99 acc: 0.82
## fdr: 0.15 npv: 0.12 lrm: 1.22 dor: 0.81
##
##
## Abbreviations:
## tpr - True positive rate (Sensitivity, Recall)
## fpr - False positive rate (Fall-out)
## fnr - False negative rate (Miss rate)
## tnr - True negative rate (Specificity)
## ppv - Positive predictive value (Precision)
## for - False omission rate
## lrp - Positive likelihood ratio (LR+)
## fdr - False discovery rate
## npv - Negative predictive value
## acc - Accuracy
## lrm - Negative likelihood ratio (LR-)
## dor - Diagnostic odds ratio
## predicted
## true 0 1
## 0 182 7 tpr: 0.96 fnr: 0.04
## 1 30 3 fpr: 0.91 tnr: 0.09
## ppv: 0.86 for: 0.7 lrp: 1.06 acc: 0.83
## fdr: 0.14 npv: 0.3 lrm: 0.41 dor: 2.6
##
##
## Abbreviations:
## tpr - True positive rate (Sensitivity, Recall)
## fpr - False positive rate (Fall-out)
## fnr - False negative rate (Miss rate)
## tnr - True negative rate (Specificity)
## ppv - Positive predictive value (Precision)
## for - False omission rate
## lrp - Positive likelihood ratio (LR+)
## fdr - False discovery rate
## npv - Negative predictive value
## acc - Accuracy
## lrm - Negative likelihood ratio (LR-)
## dor - Diagnostic odds ratio
## [1] 23
Figure 38. The plot shows that the hyperparameters can be optimized further.
Figure 39. Optimisation of mtry hyperparameter for the random forest classifier.
## $mtry
## [1] 3
## mmce.test.mean fpr.test.mean tpr.test.mean
## 0.1203221 0.9371795 0.9912088
The optimal of mtry was 3.0 with mean misclassification error of 0.120.
## SPSA-FSR begins:
## Wrapper = rf
## Measure = auc
## Number of selected features = 0
##
## iter value st.dev num.ft best.value
## 1 0.64314 0.08416 15 0.64314 *
## 2 0.64139 0.10711 15 0.64314
## 3 0.67602 0.06895 17 0.67602 *
## 4 0.64849 0.05507 13 0.67602
## 5 0.63961 0.08148 14 0.67602
## 6 0.63942 0.09523 17 0.67602
## 7 0.65477 0.08259 19 0.67602
## 8 0.62347 0.07968 17 0.67602
## 9 0.6476 0.08387 17 0.67602
## 10 0.62405 0.07156 16 0.67602
## 11 0.62688 0.07349 17 0.67602
## 12 0.62629 0.08162 17 0.67602
## 13 0.62632 0.09866 13 0.67602
## 14 0.65441 0.08219 14 0.67602
## 15 0.65822 0.06125 14 0.67602
## 16 0.65953 0.0636 14 0.67602
## 17 0.65521 0.0948 14 0.67602
## 18 0.65495 0.08406 14 0.67602
## 19 0.66435 0.07867 13 0.67602
## 20 0.6391 0.07816 15 0.67602
## 21 0.65495 0.06118 14 0.67602
## 22 0.65651 0.10895 14 0.67602
## 23 0.68126 0.07005 14 0.68126 *
## 24 0.66469 0.08364 14 0.68126
## 25 0.66549 0.08543 10 0.68126
## 26 0.63901 0.09187 10 0.68126
## 27 0.6705 0.07947 11 0.68126
## 28 0.6377 0.05653 13 0.68126
## 29 0.6766 0.08201 16 0.68126
## 30 0.66546 0.0693 16 0.68126
## 31 0.66464 0.09615 15 0.68126
## 32 0.68795 0.0841 16 0.68795 *
## 33 0.66952 0.09 15 0.68795
## 34 0.67959 0.06398 16 0.68795
## 35 0.66812 0.06954 16 0.68795
## 36 0.6759 0.05619 16 0.68795
## 37 0.68707 0.06011 16 0.68795
## 38 0.66229 0.03893 16 0.68795
## 39 0.67342 0.05064 14 0.68795
## 40 0.67957 0.08111 15 0.68795
## 41 0.67762 0.04853 15 0.68795
## 42 0.68319 0.041 15 0.68795
## 43 0.67822 0.08528 15 0.68795
## 44 0.67881 0.06689 15 0.68795
## 45 0.69086 0.08792 15 0.69086 *
## 46 0.67832 0.0948 15 0.69086
## 47 0.68419 0.07576 15 0.69086
## 48 0.68725 0.05082 15 0.69086
## 49 0.68282 0.08796 15 0.69086
## 50 0.67882 0.06557 15 0.69086
## 51 0.67906 0.07365 14 0.69086
## 52 0.67158 0.06444 14 0.69086
## 53 0.67754 0.09124 15 0.69086
## 54 0.67708 0.06279 15 0.69086
## 55 0.67912 0.08474 15 0.69086
## 56 0.68566 0.08206 15 0.69086
## 57 0.68026 0.08441 15 0.69086
## 58 0.67235 0.0693 15 0.69086
## 59 0.67183 0.08446 15 0.69086
## 60 0.66647 0.09128 15 0.69086
## 61 0.69201 0.04497 15 0.69201 *
## 62 0.67567 0.0677 15 0.69201
## 63 0.67719 0.09386 15 0.69201
## 64 0.68454 0.05488 14 0.69201
## 65 0.654 0.06834 15 0.69201
## 66 0.68736 0.05111 15 0.69201
## 67 0.69187 0.06954 16 0.69201
## 68 0.67455 0.07959 16 0.69201
## 69 0.68666 0.06106 16 0.69201
## 70 0.67714 0.12461 15 0.69201
## 71 0.681 0.07434 15 0.69201
## 72 0.6665 0.0747 15 0.69201
## 73 0.68652 0.09277 15 0.69201
## 74 0.67308 0.06382 14 0.69201
## 75 0.69196 0.08934 14 0.69201
## 76 0.67998 0.07061 13 0.69201
## 77 0.66781 0.08619 14 0.69201
## 78 0.67279 0.08624 14 0.69201
## 79 0.67488 0.08931 13 0.69201
## 80 0.67878 0.07789 14 0.69201
## 81 0.69187 0.07456 13 0.69201
##
## Best iteration = 61
## Number of selected features = 15
## Best measure value = 0.69201
## Std. dev. of best measure = 0.04497
## Run time = 10.87 minutes.
## SPSA-FSR begins:
## Wrapper = rf
## Measure = auc
## Number of selected features = 0
##
## iter value st.dev num.ft best.value
## 1 0.67558 0.08358 10 0.67558 *
## 2 0.70587 0.08111 10 0.70587 *
## 3 0.68345 0.09892 13 0.70587
## 4 0.67496 0.08472 14 0.70587
## 5 0.66672 0.07142 14 0.70587
## 6 0.64457 0.08661 14 0.70587
## 7 0.66417 0.06918 14 0.70587
## 8 0.66072 0.07851 14 0.70587
## 9 0.64308 0.07793 15 0.70587
## 10 0.69176 0.06428 13 0.70587
## 11 0.68294 0.08954 14 0.70587
## 12 0.65077 0.06557 16 0.70587
## 13 0.66067 0.0576 14 0.70587
## 14 0.66732 0.07525 14 0.70587
## 15 0.66406 0.06911 13 0.70587
## 16 0.67817 0.05951 14 0.70587
## 17 0.67513 0.08613 14 0.70587
## 18 0.6698 0.10313 15 0.70587
## 19 0.66953 0.05755 15 0.70587
## 20 0.69188 0.05346 14 0.70587
## 21 0.68372 0.08046 14 0.70587
## 22 0.69421 0.06528 16 0.70587
##
## Best iteration = 2
## Number of selected features = 10
## Best measure value = 0.70587
## Std. dev. of best measure = 0.08111
## Run time = 3 minutes.
Figure 39. Scatter plots of mean accuracy rate for each iteration for the random forest classifier (left panel) with and (right panel) without hyperparameter tuning.
| Features | Importance | Features | Importance |
|---|---|---|---|
| STDs_number | 0.50537 | Smokes_years | 0.87184 |
| First_sexual_intercourse | 0.50531 | IUD_years | 0.76944 |
| STDs_syphilis | 0.50341 | Dx_HPV | 0.76921 |
| Num_sexual_partners | 0.50288 | STDs | 0.76637 |
| Num_pregnancies | 0.50288 | Smokes | 0.74595 |
| STDs_condylomatosis | 0.50282 | Dx_CIN | 0.73105 |
| Smokes_years | 0.50276 | Age | 0.70937 |
| Hormonal_Contraceptives | 0.50276 | IUD | 0.67608 |
| Dx_HPV | 0.50257 | Num_pregnancies | 0.65964 |
| Smokes | 0.50058 | Dx | 0.65137 |
The top 10 features selected using the spFRS algorithm differed significantly for the random forest classifier with and without hyperparameter optimization and features selection (Table 7).
Figures 40. Features selected by the spFRS algorithm fused with the random forest classifier and ranked according to importance.
spFSR::plotImportance(spsaMod_rf_tuned)
Figures 41. Features selected by the spFRS algorithm fused with the random forest classifier with optimized hyperparameters and ranked according to importance.
Figure 42. AUC plot for the random forest classifer. RF - Random forest classifier trained on all 23 features; RF_spsa - Random forest classifier trained on features selected using the spFRS algorithm; RF_spsa_tuned - Random forest classifier with tuned hyperparameters and trained on features selected using the spFRS algorithm.
RF <- performance(pred_on_test_rf, measures = list(mmce, auc))
RF_spsa <- performance(pred_on_test_spsa_rf, measures = list(mmce, auc))
RF_spsa_tuned <- performance(pred_on_test_spsa_rf_tuned, measures = list(mmce, auc))
data_RF <- data.frame(RF, RF_spsa, RF_spsa_tuned)
kable(data_RF, caption = "Table 8. Performance for random forest classifier with and without tuned hyperparameters and spFRS feature selection") %>% kable_styling(full_width = F, font_size = 12)
| RF | RF_spsa | RF_spsa_tuned | |
|---|---|---|---|
| mmce | 0.1531532 | 0.1666667 | 0.1531532 |
| auc | 0.6507937 | 0.6556037 | 0.5962803 |
The random forest classifier with spFRS feature selection (but nmo hyperparameter optimization) performed the best with a mean misclassification error of 0.153 and an AUC value of 0.601 (Table 8).
## predicted
## true 0 1
## 0 187 2 tpr: 0.99 fnr: 0.01
## 1 32 1 fpr: 0.97 tnr: 0.03
## ppv: 0.85 for: 0.67 lrp: 1.02 acc: 0.85
## fdr: 0.15 npv: 0.33 lrm: 0.35 dor: 2.92
##
##
## Abbreviations:
## tpr - True positive rate (Sensitivity, Recall)
## fpr - False positive rate (Fall-out)
## fnr - False negative rate (Miss rate)
## tnr - True negative rate (Specificity)
## ppv - Positive predictive value (Precision)
## for - False omission rate
## lrp - Positive likelihood ratio (LR+)
## fdr - False discovery rate
## npv - Negative predictive value
## acc - Accuracy
## lrm - Negative likelihood ratio (LR-)
## dor - Diagnostic odds ratio
## predicted
## true 0 1
## 0 184 5 tpr: 0.97 fnr: 0.03
## 1 32 1 fpr: 0.97 tnr: 0.03
## ppv: 0.85 for: 0.83 lrp: 1 acc: 0.83
## fdr: 0.15 npv: 0.17 lrm: 0.87 dor: 1.15
##
##
## Abbreviations:
## tpr - True positive rate (Sensitivity, Recall)
## fpr - False positive rate (Fall-out)
## fnr - False negative rate (Miss rate)
## tnr - True negative rate (Specificity)
## ppv - Positive predictive value (Precision)
## for - False omission rate
## lrp - Positive likelihood ratio (LR+)
## fdr - False discovery rate
## npv - Negative predictive value
## acc - Accuracy
## lrm - Negative likelihood ratio (LR-)
## dor - Diagnostic odds ratio
## predicted
## true 0 1
## 0 186 3 tpr: 0.98 fnr: 0.02
## 1 31 2 fpr: 0.94 tnr: 0.06
## ppv: 0.86 for: 0.6 lrp: 1.05 acc: 0.85
## fdr: 0.14 npv: 0.4 lrm: 0.26 dor: 4
##
##
## Abbreviations:
## tpr - True positive rate (Sensitivity, Recall)
## fpr - False positive rate (Fall-out)
## fnr - False negative rate (Miss rate)
## tnr - True negative rate (Specificity)
## ppv - Positive predictive value (Precision)
## for - False omission rate
## lrp - Positive likelihood ratio (LR+)
## fdr - False discovery rate
## npv - Negative predictive value
## acc - Accuracy
## lrm - Negative likelihood ratio (LR-)
## dor - Diagnostic odds ratio
Filter methods assign an importance to each feature. The feature is ranked according to importance value resulting in a feature subset. Create an object named mfv by calling generateFilterValuesData from mlr on classif.task and using the filter method randomForest.importance.
## Supervised task: dataCervicalCancer
## Type: regr
## Target: Cancer
## Observations: 737
## Features:
## numerics factors ordered functionals
## 10 12 0 0
## Missings: FALSE
## Has weights: FALSE
## Has blocking: FALSE
## Has coordinates: FALSE
Figure 43. Features selectered sing a filter selection algorithm in mlr (https://mlr-org.github.io/mlr/articles/tutorial/devel/feature_selection.html).(Bischl et al. 2016)
We can also compare this with other filter methods to identify consistencies in the selected features (i.e. information.gain and chi.squared) using the interactive mode plotFilterValuesGGVIS(FV) (Fig. 43).
Figure 44. Features selectered using a filter selection algorithm in mlr based on (Left panel) information gain and (right panel) chi squared (https://mlr-org.github.io/mlr/articles/tutorial/devel/feature_selection.html).(Bischl et al. 2016)
Using these 2 filters DX_Cancer was consistently the most important feature while IUD_years was the least important.
We now ‘fused’ the random forest classification learner with the information.gain filter to train the model.
The optimal percentage of features to keep was determined by 5-fold cross-validation. We use ‘information gain’ as an importance measure and select the 10 features with highest importance. In each resampling iteration feature selection is carried out on the corresponding training data set before fitting the learner.
## [Tune] Started tuning learner classif.randomForest.filtered for parameter set:
## Type len Def Constr Req Tunable Trafo
## fw.abs discrete - - 10 - TRUE -
## With control class: TuneControlGrid
## Imputation value: 1
## [Tune-x] 1: fw.abs=10
## [Tune-y] 1: mmce.test.mean=0.1342894; time: 0.0 min
## [Tune] Result: fw.abs=10 : mmce.test.mean=0.1342894
The optimal percentage and corresponding misclassification error are:
## $fw.abs
## [1] 10
## mmce.test.mean
## 0.1342894
We can now fuse it with fw percentage by “wrapper” the random forest learner with the chi-squared method before training the model:
Now applied getFilteredFeatures on the trained model to view the selected features.
Used a random search with ten iterations on the random forest classifier and classif.task.
## mmce.test.mean
## 0.120359
| Filter | Wrapper |
|---|---|
| IUD | First_sexual_intercourse |
| STDs | Smokes_years |
| STDs_condylomatosis | IUD_years |
| STDs_vulvo_perineal_condylomatosis | IUD |
| STDs_syphilis | STDs_condylomatosis |
| STDs_HIV | STDs_vulvo_perineal_condylomatosis |
| Dx_Cancer | STDs_syphilis |
| Dx_CIN | Dx_CIN |
| Dx_HPV | NA |
| Dx | NA |
The wrapper method selected fewer features and 2 of these were also selected by the filter method (Table 9).
By comparing the misclassification error rates, a random search wrapper method out performed the chi squared (filtered) method. We then fused the wrapper method in a learnerusing makeFeatSelWrapper together with makeFeatSelControlRandom and makeResampleDesc objects.
Obtain the confusion matrix by running calculateConfusionMatrix(pred_on_test) and get the ROC.
## predicted
## true 0 1 -err.-
## 0 184 5 5
## 1 31 2 31
## -err.- 31 5 36
Figure 22. AUC plot for the random forest classifer. RF - Random forest classifier trained on all 23 features; RF_spsa_tuned - Random forest classifier with tuned hyperparameters and trained on features selected using the spFRS algorithm; RF_wrapper - Random forest classifier with tned parameters and trained on features selected using the selectFeatures mlr algorithm.
The AUC plot show that the random forest classifier with spFRS feature selection performed the best.
| x | |
|---|---|
| mmce | 0.1621622 |
| auc | 0.6056598 |
The random forest classifier using the wrapper method for feature selection had a mean misclassification error of 0.158 and AUC value of 0.644 (Table 10).
## predicted
## true 0 1
## 0 184 5 tpr: 0.97 fnr: 0.03
## 1 31 2 fpr: 0.94 tnr: 0.06
## ppv: 0.86 for: 0.71 lrp: 1.04 acc: 0.84
## fdr: 0.14 npv: 0.29 lrm: 0.44 dor: 2.37
##
##
## Abbreviations:
## tpr - True positive rate (Sensitivity, Recall)
## fpr - False positive rate (Fall-out)
## fnr - False negative rate (Miss rate)
## tnr - True negative rate (Specificity)
## ppv - Positive predictive value (Precision)
## for - False omission rate
## lrp - Positive likelihood ratio (LR+)
## fdr - False discovery rate
## npv - Negative predictive value
## acc - Accuracy
## lrm - Negative likelihood ratio (LR-)
## dor - Diagnostic odds ratio
The best results were obtained for the Random Forest classifier with spFRS feature selection. This classifier had a precision of 0.85 and recall of 0.99 with a mmce of 0.153 and AUC value of 0.601. The next best performer was also Random Forest classifier with the feature selection algorithm provided by the CARAT R package. Features assocated with IUDs and STDs were consistently choses as beingof importance by the various feature selector algorithms. Overall the Random Forest classifier has performed reasonably well for the identifying women exhibiting clinical sign of cervical cancer. Although the AUC values suggest that the model can be improved.
In future studies other supervised machine learning algorithms such as Support Vector Machines (SVM) or ensemble methods could be tested to determine whether they may perform better than those trialed in these studies.
The random forest classifiers performed reasonably well at identifying cervical cancer patients. However, some improvement is required before these predictors can be used in a clinical setting, as several diseased patients were not identified in these studies.
Aksakalli, Vural, and Milad Malekipirbazai. 2016. “Feature Selection via Binary Simultaneous Perturbation Stochastic Approximation.” Pattern Rceognition Letters.
Bischl, Bernd, Michel Lang, Lars Kotthoff, Julia Schiffner, Jakob Richter, Erich Studerus, Giuseppe Casalicchio, and Zachary M. Jones. 2016. “mlr: Machine Learning in R.” Journal of Machine Learning Research.