library(arules)
library(dplyr)
library(ggplot2)
library(plotly)
library(rpart)
library(rpart.plot)
library(caret)
Loading required package: lattice
From Pima Indians Diabetes Database.
Meaning of each rows are
raw <- read.csv("diabetes.csv", header = T)
raw
What are predictive and protective factors for the outcomes (of whether having diabtes or not)?
We should see first, which columns and pair of columns might have clustering.
Histogram graph is also available on Kaggle.
ggplotly(ggplot(raw) + geom_histogram(aes(Glucose)))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Single count/continuous variables here usually have Bell curve distribution, with zero values, probably in place of NA’s. Such columns include
I also plotted using scatter plots, hoping more associations will be shown. (Yes, I tried to perform all 9C2.)
Therefore, as for now, I see no real clustering, but there need to be cleaning for zero values.
First, I check how many rows have complete data.
raw_fixed <- raw %>%
filter(Insulin !=0 & Glucose != 0 & BloodPressure != 0 & SkinThickness != 0)
raw_fixed
And take a peak at the data.
ggplotly(ggplot(raw_fixed) + geom_histogram(aes(Outcome)))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
data <- discretizeDF(raw_fixed)
The calculated breaks are: 0, 0, 0, 1
Only unique breaks are used reducing the number of intervals. Look at ? discretize for details.
data
We can see that outcome is broken; therefore, we should fix it.
data$Outcome <- as.factor(raw_fixed$Outcome)
data
ts <- as(data, "transactions")
Test out Apriori rules. We aim for ones with high confidence. High or low support has different meanings.
as(
apriori(ts,
parameter = list(sup = 0.1, conf = 0.9, target="rules"),
appearance = list(rhs = c("Outcome=0", "Outcome=1"))
),
"data.frame"
) %>% arrange(desc(confidence))
Apriori
Parameter specification:
Algorithmic control:
Absolute minimum support count: 39
set item appearances ...[2 item(s)] done [0.00s].
set transactions ...[26 item(s), 393 transaction(s)] done [0.00s].
sorting and recoding items ... [26 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 4 done [0.00s].
writing ... [41 rule(s)] done [0.00s].
creating S4 object ... done [0.00s].
Lower glucose, pregnancies, BMI, and insulin are important protective factors. However, no pregnancies doesn’t predict better outcome.
Therefore, I plot the data just to be sure.
ggplotly(ggplot(raw_fixed %>% filter(Outcome == 0)) + geom_histogram(aes(Pregnancies)))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplotly(ggplot(raw_fixed %>% filter(Outcome == 1)) + geom_histogram(aes(Pregnancies)))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
What about outcome = 1?
as(
apriori(ts,
parameter = list(sup = 0.1, conf = 0.6, target="rules"),
appearance = list(rhs = c("Outcome=1"))
),
"data.frame"
) %>% arrange(desc(support))
Apriori
Parameter specification:
Algorithmic control:
Absolute minimum support count: 39
set item appearances ...[1 item(s)] done [0.00s].
set transactions ...[26 item(s), 393 transaction(s)] done [0.00s].
sorting and recoding items ... [26 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 4 done [0.00s].
writing ... [6 rule(s)] done [0.00s].
creating S4 object ... done [0.00s].
It is harder to find predictive factors for diabetes with confidence. I have to decrease confidence to at least 0.7 to get some results.
Important predictive factors are higher blood glucose level, insulin, more age, and multiparity.
We want to see lower supports as well.
as(
apriori(ts,
parameter = list(sup = 0.03, conf = 0.7, maxlen = 4, target="rules"),
appearance = list(rhs = c("Outcome=0", "Outcome=1"))
),
"data.frame"
) %>% filter(support < 0.1) %>% arrange(desc(lift))
Apriori
Parameter specification:
Algorithmic control:
Absolute minimum support count: 11
set item appearances ...[2 item(s)] done [0.00s].
set transactions ...[26 item(s), 393 transaction(s)] done [0.00s].
sorting and recoding items ... [26 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 4
Mining stopped (maxlen reached). Only patterns up to a length of 4 returned!
done [0.00s].
writing ... [517 rule(s)] done [0.00s].
creating S4 object ... done [0.00s].
Rarer ones with decent counts still include higher blood glucose level, insulin, more age, and multiparity, which are predictive of poorer outcomes. An additional factor here is diabetes pedigree function.
What’s the distribution of diabetes pedigree function, BTW?
ggplotly(ggplot(raw_fixed) + geom_histogram(aes(DiabetesPedigreeFunction)))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Therefore, more diabetes pedigree function values are predictive of diabetes, but the data supporting this is rarer.
As Outcome=0 is undermined, I have to run this again for Outcome=0.
as(
apriori(ts,
parameter = list(sup = 0.03, conf = 0.9, maxlen = 4, target="rules"),
appearance = list(rhs = c("Outcome=0"))
),
"data.frame"
) %>% filter(support < 0.1) %>% arrange(desc(lift))
Apriori
Parameter specification:
Algorithmic control:
Absolute minimum support count: 11
set item appearances ...[1 item(s)] done [0.00s].
set transactions ...[26 item(s), 393 transaction(s)] done [0.00s].
sorting and recoding items ... [26 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 4
Mining stopped (maxlen reached). Only patterns up to a length of 4 returned!
done [0.00s].
writing ... [193 rule(s)] done [0.00s].
creating S4 object ... done [0.00s].
Younger ages less than 25, lower insulin, lower skin thickness, and lower glucose levels less than 105, are protective factors. But not much so for DiabetesPedigreeFunction < 0.325.
ggplotly(ggplot(raw_fixed) + geom_histogram(aes(SkinThickness)))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We train the decision tree with 80% of the data, and 20% for checking.
n80 <- nrow(raw_fixed) * 0.8
training <- head(raw_fixed, n80)
training
And use the remaining 20% for proofing
Now, we create a decision tree with training data.
model <- rpart(Outcome~., data = training, method = "class",
control=rpart.control(minsplit=20, minbucket=10, cp=0.01))
prp(model)
And test the training result.
testdata$Outcome <- as.factor(testdata$Outcome)
predicted <- predict(model, testdata[, -9], type = "class")
actual <- testdata[, 9]
comp <- data.frame(predicted, actual)
comp
And, confusion matrix.
confusionMatrix(comp$predicted, comp$actual)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 46 7
1 8 18
Accuracy : 0.8101
95% CI : (0.7062, 0.8897)
No Information Rate : 0.6835
P-Value [Acc > NIR] : 0.008608
Kappa : 0.5658
Mcnemar's Test P-Value : 1.000000
Sensitivity : 0.8519
Specificity : 0.7200
Pos Pred Value : 0.8679
Neg Pred Value : 0.6923
Prevalence : 0.6835
Detection Rate : 0.5823
Detection Prevalence : 0.6709
Balanced Accuracy : 0.7859
'Positive' Class : 0
Both PPV and NPV are quite good, of 0.87 and 0.70.
I think, the generated model is valid for both positive and negative prediction, as well as not too complex.