Consider the Cervical Cancer (Risk Factors) data set (available from UCI repository) and exploit the data using association rules (for binary variables) and clustering (for continuous variables).
Below is the code used to load and prepare the dataset (click the ‘Code’ button).
knitr::opts_chunk$set(
echo = TRUE,
message = FALSE,
warning = FALSE,
tidy.opts = list(width.cutoff = 60),
tidy = TRUE
)
library(readr)
library(arules)
library(dplyr)
library(plotly)
if (require("conflicted")) {
conflict_prefer("filter", "dplyr")
}
dataset <- read_csv("risk_factors_cervical_cancer.csv", na = "?")
# coerce it to data.frame, for simplicity
dataset <- data.frame(dataset)
columns <- c(
"age",
"partners",
"sexarca",
"pregnancies",
"smokes",
"smokes_years",
"smokers_pack_year",
"ohc",
"ohc_years",
"iud",
"iud_years",
"std",
"std_num",
"std_condyloma",
"std_cervical",
"std_vaginal",
"std_vulvo_perit",
"std_syphilis",
"std_pid",
"std_herpes",
"std_molluscum",
"std_aids",
"std_hiv",
"std_hepb",
"std_hpv",
"std_num_of_diag",
"std_time_since_first",
"std_time_since_last",
"dx_cancer",
"dx_cin",
"dx_hpv",
"dx",
"hinselmann",
"schiller",
"citology",
"biopsy"
)
names(dataset) <- columns
# ugly way to move the dependant variable to the end:
dataset <- cbind(dataset[, -which(columns == "dx_cancer")],
dx_cancer = dataset$dx_cancer
)
dataset$smokes <- factor(dataset$smokes,
levels = c(1, 0),
labels = c("yes", "no")
)
dataset$ohc <- factor(dataset$ohc,
levels = c(1, 0),
labels = c("yes", "no")
)
dataset$iud <- factor(dataset$iud,
levels = c(1, 0),
labels = c("yes", "no")
)
dataset$std <- factor(dataset$std,
levels = c(1, 0),
labels = c("yes", "no")
)
dataset$std_condyloma <- factor(dataset$std_condyloma,
levels = c(1, 0),
labels = c("yes", "no")
)
dataset$std_cervical <- factor(dataset$std_cervical,
levels = c(1, 0),
labels = c("yes", "no")
)
dataset$std_vaginal <- factor(dataset$std_vaginal,
levels = c(1, 0),
labels = c("yes", "no")
)
dataset$std_vulvo_perit <- factor(dataset$std_vulvo_perit,
levels = c(1, 0),
labels = c("yes", "no")
)
dataset$std_syphilis <- factor(dataset$std_syphilis,
levels = c(1, 0),
labels = c("yes", "no")
)
dataset$std_pid <- factor(dataset$std_pid,
levels = c(1, 0),
labels = c("yes", "no")
)
dataset$std_herpes <- factor(dataset$std_herpes,
levels = c(1, 0),
labels = c("yes", "no")
)
dataset$std_molluscum <- factor(dataset$std_molluscum,
levels = c(1, 0),
labels = c("yes", "no")
)
dataset$std_aids <- factor(dataset$std_aids,
levels = c(1, 0),
labels = c("yes", "no")
)
dataset$std_hiv <- factor(dataset$std_hiv,
levels = c(1, 0),
labels = c("yes", "no")
)
dataset$std_hepb <- factor(dataset$std_hepb,
levels = c(1, 0),
labels = c("yes", "no")
)
dataset$std_hpv <- factor(dataset$std_hpv,
levels = c(1, 0),
labels = c("yes", "no")
)
dataset$dx_cancer <- factor(dataset$dx_cancer,
levels = c(1, 0),
labels = c("yes", "no")
)
dataset$dx_cin <- factor(dataset$dx_cin,
levels = c(1, 0),
labels = c("yes", "no")
)
dataset$dx_hpv <- factor(dataset$dx_hpv,
levels = c(1, 0),
labels = c("yes", "no")
)
dataset$dx <- factor(dataset$dx,
levels = c(1, 0),
labels = c("yes", "no")
)
dataset$hinselmann <- factor(dataset$hinselmann,
levels = c(1, 0),
labels = c("yes", "no")
)
dataset$schiller <- factor(dataset$schiller,
levels = c(1, 0),
labels = c("yes", "no")
)
dataset$citology <- factor(dataset$citology,
levels = c(1, 0),
labels = c("yes", "no")
)
dataset$biopsy <- factor(dataset$biopsy,
levels = c(1, 0),
labels = c("yes", "no")
)The association rules algorithm used is the apriori() from arules package.
This dataset contains several binary variables and some numeric variables like age that were discretized using the discretize function.
Some variables were not selected due to the following reasons:
std: it is the superset of the std_* variables, so it is redundant if we want to use primarily the subsets.dx: it is also a superset, and what we will try to find are some associations with dx_cancer.dx_cin: this has been removed since it is the negation of dx_cancer, so it is an obvious (dis)association.dx_hpv: has been removed due to its known association with dx_cancer.The dx_cancer variable has only about 2% of cases with the value yes. So in this case, the apriori algorithm has a hard time to find rules for these cases. There is investigation for new methods of finding rules in unbalanced datasets(Gu et al. 2003), but here we will just use a very low support, and look at the rules with higher confidence (this may be unfeasible in large datasets). Just to have it in mind, we could think in upsample or downsample our dataset. The rules would be found, but biased, and the metrics of support, confidence and lift would be meaningless.
For the discretization process, there is a known algorithm to find the best cut points, but to keep it simple, in this case, the cut points were set based in some reasoning, like ‘decades’ for age, whole numbers to partners and pregnancies, and ‘frequency’ for sexarca.
Below is the code used for the association rules algorithm (click the ‘Code’ button).
rules_dataset <- dataset %>%
select(
age, partners, sexarca,
pregnancies, smokes, ohc, iud,
std_condyloma:std_hpv, hinselmann:dx_cancer
) %>%
mutate(
age = discretize(age, method = "fixed", breaks = c(seq(0, 80, by = 10), Inf)),
partners = discretize(partners, method = "fixed", breaks = c(0, 1, 2, 3, Inf)),
sexarca = discretize(sexarca, method = "frequency", breaks = 5),
pregnancies = discretize(pregnancies, method = "fixed", breaks = c(0, 1, 2, 3, Inf))
)
transactions <- as(rules_dataset, "transactions")
rules <- apriori(transactions, parameter = list(
supp = 0.003, conf = 0.6, target = "rules",
minlen = 1, maxlen = 7
), control = list(verbose = FALSE))
rules.sub <- subset(rules, subset = rhs %in% "dx_cancer=yes")The algorithm found a set of 47568273 rules, and those with the RHS (right-hand side) equals to dx_cancer=yes were filtered and sorted by ‘confidence’ and ‘support’. The count ranged from 3 to 5 (from a dataset with 858 rows).
insp <- as_tibble(DATAFRAME(head(rules.sub, n = 20, by = c("confidence", "support")), setStart = "", setEnd = ""))
rownames(insp) <- NULL
insp <- insp %>% select(LHS, confidence, count, support, lift)
inspWe can see above some interesting rules that may be counter intuitive: almost all rules either have biopsy or citology equals to no, and in some way is associated with dx_cancer=yes. Being aware that the variable dx_hpv was removed, we may think: how the diagnose was made? It seems also that the number of partners and either using ohc or iud may also be related with dx_cancer=yes.
The clustering algorithm used is the kmeans() from stats package.
Below is the code used for the clustering algorithm (click the ‘Code’ button).
The following variables were not tested due to high amount of NA values: std_time_since_first, std_time_since_last.
For simplicity, rows containing NA values were removed.
As a result, we see that the plot shown below is not very informative. Several combinations of variables were tried, within a range of 3 to 7 variables simultaneously, and evaluated from 2D and 3D perspective, and no clear ‘cluster’ was found to be visually appealing.
Several variables are ‘time-dependent’, as we may say: ‘pregnancies’ may only increase with time; ‘partners’ may only increase with time; ‘std_num’ may increase with time… as logically, ‘age’ increases with time. So these variables frequently show a positive correlation with ‘age’ and maybe a reason for being difficult to find clusters. One can argue that for example ‘death rate’ may have 2-3 clusters (i.e.: <5 yo; >50 yo), but here it seems not to be the case. A simple transformation with the scale() function was also tried, but the results were not better.
Further work can be done with this dataset, perhaps using PCA, SCA, or other data transformation before searching for clusters.
dataset_cluster <- dataset %>%
select(
std_num, age, smokers_pack_year
# age, partners, pregnancies
# smokes_years, smokers_pack_year,
# ohc_years, iud_years, std_num, std_num_of_diag
# too much missings, std_time_since_first, std_time_since_last
) %>%
tidyr::drop_na()
k <- 2
# dataset_cluster <- scale(dataset_cluster)
# set.seed(lubridate::now())
# idxs <- sample(seq_len(ncol(dataset_cluster)), 3)
set.seed(2020)
clust <- kmeans(dataset_cluster[, idxs], centers = k)
fig <- plot_ly(
x = ~ dataset_cluster[, idxs[1]],
y = ~ dataset_cluster[, idxs[2]],
z = ~ dataset_cluster[, idxs[3]],
color = as.factor(clust$cluster)
)
fig <- fig %>% add_markers()
fig <- fig %>% plotly::layout(
title = "Clustering",
scene = list(
camera = list(eye = list(x = -1.55, y = -1.55, z = 0.5)),
xaxis = list(title = names(dataset_cluster)[idxs[1]]),
yaxis = list(title = names(dataset_cluster)[idxs[2]]),
zaxis = list(title = names(dataset_cluster)[idxs[3]])
))
figGu, Lifang, Jiuyong Li, Hongxing He, Graham Williams, Simon Hawkins, and Chris Kelman. 2003. “Association rule discovery with unbalanced class distributions.” Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 2903: 221–32. https://doi.org/10.1007/978-3-540-24581-0_19.