Goal of this project is to explore the various, health-related variable combinations which contribute to both the absence and presence of diabetes in patients. Data comes from the UC Irvine Machine Learning Repository (https://archive.ics.uci.edu/dataset/529/early+stage+diabetes+risk+prediction+dataset). Data was produced in 2020 by the Sylhet Diabetes Hospital in Bangladesh as 520 questionnaires filled by patients and approved by the supervising physician. Dataset contains information on the presence of various health conditions and symptoms, potentially associated with diabetes.
Age (integer)
Gender (Categorical)
Rest of the variables are of binary type:
Polyuria (excessive or abnormally large production or passage of urine)
Polydipsia (Excessive thirst leading to high fluid intake)
Sudden weight loss
Weakness
Polyphagia (Insatiable hunger often leading to overeating)
Genital thrush (irritation, rash and redness in the genitalia)
Visual blurring
Itching
Irritabillity
Delayed healing
Partial paresis (weakened voluntary muscle movement, not complete loss of motion)
Muscle stiffness
Alopecia (balding)
Obesity
Class (Target variable): “Positive” for the presence of diabetes, “Negative” in case of absence.
In order to perform association rule mining, an unsupervised learning technique used to discover frequently co-occurring patterns, apriori algorithm will be used. A rule is expressed as 𝑋 ⇒ 𝑌, where 𝑋 and 𝑌 are disjoint itemsets. The quality of a rule is evaluated using three standard measures: support, which quantifies the frequency of occurrence of the itemset; confidence, which represents the conditional probability of observing 𝑌 given 𝑋; and lift, which measures the strength of association relative to statistical independence. A lift value greater than one indicates that the antecedent and consequent co-occur more frequently than would be expected if they were to co-occur independently.
#Loading necessary libraries and data
library(arules)
library(dplyr)
library(arulesViz)
data <- read.csv("data/diabetes_data_upload.csv", sep=",")
#Data preview
head(data, 5)
## Age Gender Polyuria Polydipsia sudden.weight.loss weakness Polyphagia
## 1 40 Male No Yes No Yes No
## 2 58 Male No No No Yes No
## 3 41 Male Yes No No Yes Yes
## 4 45 Male No No Yes Yes Yes
## 5 60 Male Yes Yes Yes Yes Yes
## Genital.thrush visual.blurring Itching Irritability delayed.healing
## 1 No No Yes No Yes
## 2 No Yes No No No
## 3 No No Yes No Yes
## 4 Yes No Yes No Yes
## 5 No Yes Yes Yes Yes
## partial.paresis muscle.stiffness Alopecia Obesity class
## 1 No Yes Yes Yes Positive
## 2 Yes No Yes No Positive
## 3 No Yes Yes No Positive
## 4 No No No No Positive
## 5 Yes Yes Yes Yes Positive
colSums(is.na(data))
## Age Gender Polyuria Polydipsia
## 0 0 0 0
## sudden.weight.loss weakness Polyphagia Genital.thrush
## 0 0 0 0
## visual.blurring Itching Irritability delayed.healing
## 0 0 0 0
## partial.paresis muscle.stiffness Alopecia Obesity
## 0 0 0 0
## class
## 0
# None of the columns has missing values.
str(data)
## 'data.frame': 520 obs. of 17 variables:
## $ Age : int 40 58 41 45 60 55 57 66 67 70 ...
## $ Gender : chr "Male" "Male" "Male" "Male" ...
## $ Polyuria : chr "No" "No" "Yes" "No" ...
## $ Polydipsia : chr "Yes" "No" "No" "No" ...
## $ sudden.weight.loss: chr "No" "No" "No" "Yes" ...
## $ weakness : chr "Yes" "Yes" "Yes" "Yes" ...
## $ Polyphagia : chr "No" "No" "Yes" "Yes" ...
## $ Genital.thrush : chr "No" "No" "No" "Yes" ...
## $ visual.blurring : chr "No" "Yes" "No" "No" ...
## $ Itching : chr "Yes" "No" "Yes" "Yes" ...
## $ Irritability : chr "No" "No" "No" "No" ...
## $ delayed.healing : chr "Yes" "No" "Yes" "Yes" ...
## $ partial.paresis : chr "No" "Yes" "No" "No" ...
## $ muscle.stiffness : chr "Yes" "No" "Yes" "No" ...
## $ Alopecia : chr "Yes" "Yes" "Yes" "No" ...
## $ Obesity : chr "Yes" "No" "No" "No" ...
## $ class : chr "Positive" "Positive" "Positive" "Positive" ...
min(data$Age)
## [1] 16
max(data$Age)
## [1] 90
Except for the ‘Age’ variable, all variables are ready for type transformation into ‘factor’ - as gender variable takes values ‘Male’ and ‘Female’, with symptoms being defiend by values of ‘Yes’ or ‘No’ and class being either ‘Positive’ or ‘Negative’.
In order to make the data suitable for association rules mining, almost all of the variables need to be changed. Gender, symptoms and class variables will be transformed into factor variables as they are. Age has to be changed into interpretable age brackets to allow its inclusion in the transactional representation. Doing so will allow for the dataset to become simillar in structure to a set of ‘transactions’, for which association rules may be explored.
#Creating the age_group variable, after which Age is removed and all variables are transformed to factor type.
data <- data %>%
mutate(
Age_group = cut(
Age,
breaks = c(15, 20, 30, 40, 50, 60, 70, Inf),
labels = c("16-20","21-30","31-40","41-50","51-60","61-70","71+"),
right = TRUE
)
) %>%
select(-Age) %>%
mutate(across(everything(), as.factor))
str(data)
## 'data.frame': 520 obs. of 17 variables:
## $ Gender : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
## $ Polyuria : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 2 2 2 1 ...
## $ Polydipsia : Factor w/ 2 levels "No","Yes": 2 1 1 1 2 2 2 2 2 2 ...
## $ sudden.weight.loss: Factor w/ 2 levels "No","Yes": 1 1 1 2 2 1 1 2 1 2 ...
## $ weakness : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ Polyphagia : Factor w/ 2 levels "No","Yes": 1 1 2 2 2 2 2 1 2 2 ...
## $ Genital.thrush : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 1 2 1 2 1 ...
## $ visual.blurring : Factor w/ 2 levels "No","Yes": 1 2 1 1 2 2 1 2 1 2 ...
## $ Itching : Factor w/ 2 levels "No","Yes": 2 1 2 2 2 2 1 2 2 2 ...
## $ Irritability : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 1 1 2 2 2 ...
## $ delayed.healing : Factor w/ 2 levels "No","Yes": 2 1 2 2 2 2 2 1 1 1 ...
## $ partial.paresis : Factor w/ 2 levels "No","Yes": 1 2 1 1 2 1 2 2 2 1 ...
## $ muscle.stiffness : Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 1 2 2 1 ...
## $ Alopecia : Factor w/ 2 levels "No","Yes": 2 2 2 1 2 2 1 1 1 2 ...
## $ Obesity : Factor w/ 2 levels "No","Yes": 2 1 1 1 2 2 1 1 2 1 ...
## $ class : Factor w/ 2 levels "Negative","Positive": 2 2 2 2 2 2 2 2 2 2 ...
## $ Age_group : Factor w/ 7 levels "16-20","21-30",..: 3 5 4 4 5 5 5 6 6 6 ...
Upon applying the changes, whole dataset will now be transformed, to make each of the observations an itemset, suitable for pattern mining.
transactional <- as(data, "transactions")
inspect(transactional[1:3])
## items transactionID
## [1] {Gender=Male,
## Polyuria=No,
## Polydipsia=Yes,
## sudden.weight.loss=No,
## weakness=Yes,
## Polyphagia=No,
## Genital.thrush=No,
## visual.blurring=No,
## Itching=Yes,
## Irritability=No,
## delayed.healing=Yes,
## partial.paresis=No,
## muscle.stiffness=Yes,
## Alopecia=Yes,
## Obesity=Yes,
## class=Positive,
## Age_group=31-40} 1
## [2] {Gender=Male,
## Polyuria=No,
## Polydipsia=No,
## sudden.weight.loss=No,
## weakness=Yes,
## Polyphagia=No,
## Genital.thrush=No,
## visual.blurring=Yes,
## Itching=No,
## Irritability=No,
## delayed.healing=No,
## partial.paresis=Yes,
## muscle.stiffness=No,
## Alopecia=Yes,
## Obesity=No,
## class=Positive,
## Age_group=51-60} 2
## [3] {Gender=Male,
## Polyuria=Yes,
## Polydipsia=No,
## sudden.weight.loss=No,
## weakness=Yes,
## Polyphagia=Yes,
## Genital.thrush=No,
## visual.blurring=No,
## Itching=Yes,
## Irritability=No,
## delayed.healing=Yes,
## partial.paresis=No,
## muscle.stiffness=Yes,
## Alopecia=Yes,
## Obesity=No,
## class=Positive,
## Age_group=41-50} 3
itemFrequency(transactional, type="relative")
## Gender=Female Gender=Male Polyuria=No
## 0.369230769 0.630769231 0.503846154
## Polyuria=Yes Polydipsia=No Polydipsia=Yes
## 0.496153846 0.551923077 0.448076923
## sudden.weight.loss=No sudden.weight.loss=Yes weakness=No
## 0.582692308 0.417307692 0.413461538
## weakness=Yes Polyphagia=No Polyphagia=Yes
## 0.586538462 0.544230769 0.455769231
## Genital.thrush=No Genital.thrush=Yes visual.blurring=No
## 0.776923077 0.223076923 0.551923077
## visual.blurring=Yes Itching=No Itching=Yes
## 0.448076923 0.513461538 0.486538462
## Irritability=No Irritability=Yes delayed.healing=No
## 0.757692308 0.242307692 0.540384615
## delayed.healing=Yes partial.paresis=No partial.paresis=Yes
## 0.459615385 0.569230769 0.430769231
## muscle.stiffness=No muscle.stiffness=Yes Alopecia=No
## 0.625000000 0.375000000 0.655769231
## Alopecia=Yes Obesity=No Obesity=Yes
## 0.344230769 0.830769231 0.169230769
## class=Negative class=Positive Age_group=16-20
## 0.384615385 0.615384615 0.001923077
## Age_group=21-30 Age_group=31-40 Age_group=41-50
## 0.084615385 0.236538462 0.278846154
## Age_group=51-60 Age_group=61-70 Age_group=71+
## 0.244230769 0.126923077 0.026923077
As for the patient data, the majority is male. Frequency of symptoms seems promising for the analysis, as for most of them around half of the patients experience them, while the other half does not - except for genital thrush, irritability, muscle stiffness, alopecia and obesity. 61% of recorded patients have diabetes, and 75% of patients are in the 30-60 years age range.
itemFrequencyPlot(transactional, topN = 20)
In this chapter, association rules will be extracted. ‘Class’ is the target variable, with ‘Positive’ and ‘Negative’ values indicating presence and absence of diabetes, respectively. Therefore, association rules for both them will be established, in order to define potential patient profiles in both cases. As for the algorithm parameters, minimal support of 0.05, confidence of 0.6 (to focus on highly reliable rules) and minimum length of 3 will be applied.
positive_diabetes_rules <- apriori(transactional, parameter = list(support = 0.05, confidence = 0.6, minlen = 3), appearance = list(rhs = "class=Positive", default = "lhs"))
Best performing rules in terms of the Apriori algorithm parameters will be inspected.
inspect(sort(positive_diabetes_rules, by = "lift")[1:5])
## lhs rhs support confidence coverage lift count
## [1] {Polydipsia=Yes,
## Age_group=61-70} => {class=Positive} 0.06346154 1 0.06346154 1.625 33
## [2] {Genital.thrush=Yes,
## Irritability=Yes} => {class=Positive} 0.08269231 1 0.08269231 1.625 43
## [3] {Genital.thrush=Yes,
## muscle.stiffness=Yes} => {class=Positive} 0.06346154 1 0.06346154 1.625 33
## [4] {Polydipsia=Yes,
## Genital.thrush=Yes} => {class=Positive} 0.10576923 1 0.10576923 1.625 55
## [5] {Genital.thrush=Yes,
## visual.blurring=Yes} => {class=Positive} 0.06923077 1 0.06923077 1.625 36
inspect(sort(positive_diabetes_rules, by = "confidence")[1:5])
## lhs rhs support confidence coverage lift count
## [1] {Polydipsia=Yes,
## Age_group=61-70} => {class=Positive} 0.06346154 1 0.06346154 1.625 33
## [2] {Genital.thrush=Yes,
## Irritability=Yes} => {class=Positive} 0.08269231 1 0.08269231 1.625 43
## [3] {Genital.thrush=Yes,
## muscle.stiffness=Yes} => {class=Positive} 0.06346154 1 0.06346154 1.625 33
## [4] {Polydipsia=Yes,
## Genital.thrush=Yes} => {class=Positive} 0.10576923 1 0.10576923 1.625 55
## [5] {Genital.thrush=Yes,
## visual.blurring=Yes} => {class=Positive} 0.06923077 1 0.06923077 1.625 36
inspect(sort(positive_diabetes_rules, by = "support")[1:5])
## lhs rhs support confidence
## [1] {Alopecia=No, Obesity=No} => {class=Positive} 0.3846154 0.6993007
## [2] {Polyuria=Yes, Obesity=No} => {class=Positive} 0.3750000 0.9653465
## [3] {Genital.thrush=No, Alopecia=No} => {class=Positive} 0.3750000 0.6818182
## [4] {Polyuria=Yes, Polydipsia=Yes} => {class=Positive} 0.3711538 1.0000000
## [5] {Polyuria=Yes, Alopecia=No} => {class=Positive} 0.3596154 1.0000000
## coverage lift count
## [1] 0.5500000 1.136364 200
## [2] 0.3884615 1.568688 195
## [3] 0.5500000 1.107955 195
## [4] 0.3711538 1.625000 193
## [5] 0.3596154 1.625000 187
As for the top lift values, they are not extreme, with all of them being equal to 1.625 at most. However, top rules by confidence show surprising results. For instance, all of the patients in the 61-70 years age group having Polydipsia symptoms also are diabetic. When inspecting top rules in terms of support, it may be noticed that diabetic patients really often have either Polyuria or Polydipsia, conditions causing excessive passage of urine and thirst.
It may be also helpful to visualize the association rules results on graphs.
As presented by the plot, there is an abundance of high support (>0.2), confidence (>0.9) and lift (>1.4) association rules. Exploration of the best performing rules will be further conducted.
When taking all rules into account, the plots are not exactly readable, as the initial rule mining procedure produced a rather large number of rules. In order to focus the interpretation on clinically meaningful and stable associations, the final analysis was restricted to high-strength rules satisfying support > 0.30, confidence > 0.90, and lift > 1.50.
positive_rules_strong <- subset(
positive_diabetes_rules,
subset = support > 0.3 & confidence > 0.9 & lift > 1.5
)
summary(positive_rules_strong)
## set of 15 rules
##
## rule length distribution (lhs + rhs):sizes
## 3 4
## 14 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 3.000 3.000 3.067 3.000 4.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.3038 Min. :0.9253 Min. :0.3173 Min. :1.504
## 1st Qu.:0.3096 1st Qu.:0.9567 1st Qu.:0.3240 1st Qu.:1.555
## Median :0.3365 Median :0.9634 Median :0.3442 Median :1.565
## Mean :0.3340 Mean :0.9657 Mean :0.3458 Mean :1.569
## 3rd Qu.:0.3500 3rd Qu.:0.9767 3rd Qu.:0.3635 3rd Qu.:1.587
## Max. :0.3750 Max. :1.0000 Max. :0.3885 Max. :1.625
## count
## Min. :158.0
## 1st Qu.:161.0
## Median :175.0
## Mean :173.7
## 3rd Qu.:182.0
## Max. :195.0
##
## mining info:
## data ntransactions support confidence
## transactional 520 0.05 0.6
## call
## apriori(data = transactional, parameter = list(support = 0.05, confidence = 0.6, minlen = 3), appearance = list(rhs = "class=Positive", default = "lhs"))
After filtering out the strongest rules, plots should reveal more meaningful information - as only 15 rules are left.
inspect(sort(positive_rules_strong, by = "support"))
## lhs rhs support confidence coverage lift count
## [1] {Polyuria=Yes,
## Obesity=No} => {class=Positive} 0.3750000 0.9653465 0.3884615 1.568688 195
## [2] {Polyuria=Yes,
## Polydipsia=Yes} => {class=Positive} 0.3711538 1.0000000 0.3711538 1.625000 193
## [3] {Polyuria=Yes,
## Alopecia=No} => {class=Positive} 0.3596154 1.0000000 0.3596154 1.625000 187
## [4] {Polydipsia=Yes,
## Alopecia=No} => {class=Positive} 0.3538462 0.9633508 0.3673077 1.565445 184
## [5] {Polyuria=Yes,
## Genital.thrush=No} => {class=Positive} 0.3461538 0.9424084 0.3673077 1.531414 180
## [6] {Polyuria=Yes,
## weakness=Yes} => {class=Positive} 0.3423077 0.9621622 0.3557692 1.563514 178
## [7] {Polydipsia=Yes,
## Obesity=No} => {class=Positive} 0.3403846 0.9619565 0.3538462 1.563179 177
## [8] {Polydipsia=Yes,
## weakness=Yes} => {class=Positive} 0.3365385 0.9776536 0.3442308 1.588687 175
## [9] {Polydipsia=Yes,
## Genital.thrush=No} => {class=Positive} 0.3269231 0.9550562 0.3423077 1.551966 170
## [10] {Polyuria=Yes,
## Polydipsia=Yes,
## Alopecia=No} => {class=Positive} 0.3192308 1.0000000 0.3192308 1.625000 166
## [11] {Polyuria=Yes,
## sudden.weight.loss=Yes} => {class=Positive} 0.3096154 0.9757576 0.3173077 1.585606 161
## [12] {Polyuria=Yes,
## partial.paresis=Yes} => {class=Positive} 0.3096154 0.9583333 0.3230769 1.557292 161
## [13] {partial.paresis=Yes,
## Alopecia=No} => {class=Positive} 0.3096154 0.9252874 0.3346154 1.503592 161
## [14] {Gender=Female,
## Alopecia=No} => {class=Positive} 0.3057692 0.9636364 0.3173077 1.565909 159
## [15] {Polyuria=Yes,
## Irritability=No} => {class=Positive} 0.3038462 0.9349112 0.3250000 1.519231 158
The strongest rules consistently include polyuria and polydipsia, suggesting that the co-occurrence of excessive urination or thirst and almost any other symptom (or even its absence) forms the central pattern for diabetes-positive patients in the dataset.
In combination with the polyuria or polydipsia, symptoms such as weakness, sudden weight loss and partial paresis also are common characteristics of confirmed diabetics.
negative_diabetes_rules <- apriori(transactional, parameter = list(support = 0.05, confidence = 0.6, minlen = 3), appearance = list(rhs = "class=Negative", default = "lhs"))
Once again, only the best performing rules will be examined.
inspect(sort(negative_diabetes_rules, by = "lift")[1:5])
## lhs rhs support confidence coverage lift count
## [1] {Polyuria=No,
## visual.blurring=No,
## Age_group=21-30} => {class=Negative} 0.05769231 1 0.05769231 2.6 30
## [2] {Gender=Male,
## Polyuria=No,
## Age_group=21-30} => {class=Negative} 0.05576923 1 0.05576923 2.6 29
## [3] {Gender=Male,
## delayed.healing=No,
## Age_group=21-30} => {class=Negative} 0.05576923 1 0.05576923 2.6 29
## [4] {Gender=Male,
## Polydipsia=No,
## Age_group=21-30} => {class=Negative} 0.05576923 1 0.05576923 2.6 29
## [5] {Gender=Male,
## Genital.thrush=No,
## Age_group=21-30} => {class=Negative} 0.05576923 1 0.05576923 2.6 29
inspect(sort(negative_diabetes_rules, by = "confidence")[1:5])
## lhs rhs support confidence coverage lift count
## [1] {Polyuria=No,
## visual.blurring=No,
## Age_group=21-30} => {class=Negative} 0.05769231 1 0.05769231 2.6 30
## [2] {Gender=Male,
## Polyuria=No,
## Age_group=21-30} => {class=Negative} 0.05576923 1 0.05576923 2.6 29
## [3] {Gender=Male,
## delayed.healing=No,
## Age_group=21-30} => {class=Negative} 0.05576923 1 0.05576923 2.6 29
## [4] {Gender=Male,
## Polydipsia=No,
## Age_group=21-30} => {class=Negative} 0.05576923 1 0.05576923 2.6 29
## [5] {Gender=Male,
## Genital.thrush=No,
## Age_group=21-30} => {class=Negative} 0.05576923 1 0.05576923 2.6 29
inspect(sort(negative_diabetes_rules, by = "support")[1:5])
## lhs rhs support confidence
## [1] {Polyuria=No, Polydipsia=No} => {class=Negative} 0.3403846 0.7972973
## [2] {Polydipsia=No, Irritability=No} => {class=Negative} 0.3384615 0.7333333
## [3] {Polyuria=No, Irritability=No} => {class=Negative} 0.3326923 0.7688889
## [4] {Gender=Male, Polydipsia=No} => {class=Negative} 0.3326923 0.7863636
## [5] {Gender=Male, Polyuria=No} => {class=Negative} 0.3192308 0.8341709
## coverage lift count
## [1] 0.4269231 2.072973 177
## [2] 0.4615385 1.906667 176
## [3] 0.4326923 1.999111 173
## [4] 0.4230769 2.044545 173
## [5] 0.3826923 2.168844 166
Lift for top association rules of non-diabetics is substantially higher than for top association rules of diabetics discussed earlier. As for confidence, all of the top displayed rules include young people, from the 21-30 age bracket. As for support, something to be expected is seen - with confidence at 0.79 and support of 0.34, patients without both polyuria and polydipsia are diabetes-negative. This further highlights the importance of those two variables in this matter.
Due to lower amount of non-diabetics, the top rules for this subgroup will be filtered out using smaller support restriction (0.25), however higher for lift (2.2) - with confidence limit unchanged.
negative_rules_strong <- subset(
negative_diabetes_rules,
subset = support > 0.25 & confidence > 0.9 & lift > 2.2
)
summary(negative_rules_strong)
## set of 14 rules
##
## rule length distribution (lhs + rhs):sizes
## 4 5 6
## 2 10 2
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4 5 5 5 5 6
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.2519 Min. :0.9048 Min. :0.2635 Min. :2.352
## 1st Qu.:0.2572 1st Qu.:0.9244 1st Qu.:0.2769 1st Qu.:2.403
## Median :0.2635 Median :0.9293 Median :0.2846 Median :2.416
## Mean :0.2695 Mean :0.9339 Mean :0.2887 Mean :2.428
## 3rd Qu.:0.2750 3rd Qu.:0.9462 3rd Qu.:0.2942 3rd Qu.:2.460
## Max. :0.3038 Max. :0.9708 Max. :0.3288 Max. :2.524
## count
## Min. :131.0
## 1st Qu.:133.8
## Median :137.0
## Mean :140.1
## 3rd Qu.:143.0
## Max. :158.0
##
## mining info:
## data ntransactions support confidence
## transactional 520 0.05 0.6
## call
## apriori(data = transactional, parameter = list(support = 0.05, confidence = 0.6, minlen = 3), appearance = list(rhs = "class=Negative", default = "lhs"))
Analysis for best association rules of non-diabetics will be performed.
inspect(sort(negative_rules_strong, by = "support"))
## lhs rhs support confidence coverage lift count
## [1] {Gender=Male,
## Polyuria=No,
## Polydipsia=No} => {class=Negative} 0.3038462 0.9239766 0.3288462 2.402339 158
## [2] {Gender=Male,
## Polyuria=No,
## Irritability=No} => {class=Negative} 0.2980769 0.9117647 0.3269231 2.370588 155
## [3] {Gender=Male,
## Polyuria=No,
## Polydipsia=No,
## Irritability=No} => {class=Negative} 0.2826923 0.9607843 0.2942308 2.498039 147
## [4] {Gender=Male,
## Polyuria=No,
## Polydipsia=No,
## Obesity=No} => {class=Negative} 0.2750000 0.9346405 0.2942308 2.430065 143
## [5] {Gender=Male,
## Polyuria=No,
## Irritability=No,
## Obesity=No} => {class=Negative} 0.2750000 0.9346405 0.2942308 2.430065 143
## [6] {Gender=Male,
## Polyuria=No,
## Polydipsia=No,
## partial.paresis=No} => {class=Negative} 0.2673077 0.9328859 0.2865385 2.425503 139
## [7] {Gender=Male,
## Polyuria=No,
## Polydipsia=No,
## sudden.weight.loss=No} => {class=Negative} 0.2634615 0.9256757 0.2846154 2.406757 137
## [8] {Gender=Male,
## Polyuria=No,
## Irritability=No,
## partial.paresis=No} => {class=Negative} 0.2634615 0.9256757 0.2846154 2.406757 137
## [9] {Gender=Male,
## Polydipsia=No,
## Irritability=No,
## partial.paresis=No} => {class=Negative} 0.2634615 0.9256757 0.2846154 2.406757 137
## [10] {Gender=Male,
## Polyuria=No,
## Polydipsia=No,
## Irritability=No,
## Obesity=No} => {class=Negative} 0.2615385 0.9577465 0.2730769 2.490141 136
## [11] {Gender=Male,
## Polyuria=No,
## Polydipsia=No,
## Genital.thrush=No} => {class=Negative} 0.2557692 0.9500000 0.2692308 2.470000 133
## [12] {Gender=Male,
## Polydipsia=No,
## sudden.weight.loss=No,
## Irritability=No} => {class=Negative} 0.2557692 0.9047619 0.2826923 2.352381 133
## [13] {Gender=Male,
## Polyuria=No,
## Polydipsia=No,
## Irritability=No,
## partial.paresis=No} => {class=Negative} 0.2557692 0.9708029 0.2634615 2.524088 133
## [14] {Gender=Male,
## Polyuria=No,
## sudden.weight.loss=No,
## Irritability=No} => {class=Negative} 0.2519231 0.9160839 0.2750000 2.381818 131
The strongest rules for the diabetes-negative class consistently include male gender combined with the absence of polyuria, and very often also the absence of polydipsia. This suggests that the lack of these two classic diabetes-related symptoms forms the central pattern for diabetes-negative patients in the dataset.
In combination with the absence of polyuria and/or polydipsia, additional “low-risk” indicators such as the absence of irritability, absence of sudden weight loss, absence of partial paresis, and absence of obesity frequently appear in high-confidence rules, further reinforcing the typical profile of analyzed non-diabetic individuals.
Such results are almost symmetric in regards to the profile of diabetic patients.
Main goal of this project was to explore patient questionnaires, to indicate which health conditions and symptoms (or their lack of) appear in the patient profiles - both with the presence and absence of diabetes. In order to achieve the research goal, unsupervised learning technique of association rules, the Apriori algorithm was implemented - before doing so, however, data had to be appropriately transformed for such method.
First, association rules for which the consequent, was the presence of diabetes. After that, strongest rules were explored in order to specify the typical patient profile for such case. For the association rules mining of non-diabetics, the process was exactly the same. As for the research results, polyuria and polydipsia (excessive thirst and urine passage) seem to be the most important symptoms - almost all patients for which they were positive were diabetics, and for those for which they were absent - were nondiabetics. Same could be said about the combination of polyuria or polydipsia with either of the symptoms: sudden weight loss and partial paresis. Younger people were also rather non-diabetic.
As above conclusions could be considered medically plausible, the following are most likely dataset specific: the strongest association rules predicting the negative class consistently involved males, while among the strongest association rules predicting the positive class, the one that included gender variable involved females. This is rather quite interesting due to the nature of the dataset, as males are the majority.