Introduction

Doctors may sometimes be too subjective in their judgements, as they are only human and make assumptions based only on their own experience. Clinical decision-making traditionally relies heavily on individual physician experience and pattern recognition. Human judgment can and should be enhanced by data-driven insights, especially when dealing with novel diseases or high-volume triage (quick assessment of patients identifying their need for further care) situations where time and accuracy are critical.

A simple contribution to solving this issue would be this project, in which I try to analyze association rules between patient’s symptoms and testing COVID-19 positive. During the recent pandemic, where COVID-19 cases were rising rapidly and uncontrollably, rules for choosing highest risk patients could have been useful.

Association rules is an unsupervised machine learning method used to discover interesting relationships and patterns between variables in large databases. It identifies rules in the form “If X, then Y”, where X and Y are sets of items that frequently appear together. Association rules are most commonly used for market basket analysis for finding items frequently bought together, but can have other useful applications, as presented in this project.

Key measures of association rules include:

  • Support - how frequently does this set (X and Y) appear together in the data? % of rows where the rule appears.
  • Confidence - how reliable is the rule? The probability that Y appears when X does.
  • Lift - how much more likely Y is to appear when X appears compared to Y appearing independently.

Lift of 5 means Y is 5 times more likely to appear when X is in the row than if we don’t have information about X, so lift = 5 means: P(Y|X) = 5 × P(Y) Confidence is just conditional probability P(Y|X), but association rules also check if the pattern is frequent (support) and meaningful (lift).

Symptom sets with high lift, support and confidence would make prioritizing testing/isolating patients easier and the whole process more effective.

Installation and loading necessary packages

library(readr)
library(ggplot2)
library(dplyr)
library(RColorBrewer)
library(arules)
library(arulesViz)
library(arulesCBA)
library(knitr)
library(caret)

Exploratory data analysis

The dataset I used in this project with comes form kaggle: https://www.kaggle.com/datasets/miadul/covid-19-patient-symptoms-and-diagnosis-dataset. It contains 5 000 patient cases with:

  • Demographic data: age and gender.
  • Symptom data: fever, dry cough, sore throat, fatigue, headache, shortness of breath, loss of smell, loss of taste, chest pain.
  • Simple clinical data: oxygen level, body temperature.
  • Medical history and exposure: comorbidity (Asthma, Diabetes, Heart Disease or None), contact with patient (other COVID-19 case) and travel history (whether the patient traveled before testing).
data <- read_csv("covid19_patient_symptoms_diagnosis.csv")
data[!complete.cases(data), ] # There is no missing data.
# Number of positive and negative diagnoses:
nrow(data[data$covid_result == 1, ])
## [1] 2600
nrow(data[data$covid_result == 0, ])
## [1] 2400
# Frequency of symptoms in the database divided by result, gender:
symptoms <- c("fever", "dry_cough", "sore_throat", "fatigue", 
                  "headache", "shortness_of_breath", "loss_of_smell", "loss_of_taste", "chest_pain")
data %>%
  group_by(covid_result) %>%
  summarise(across(all_of(symptoms), mean))
## # A tibble: 2 × 10
##   covid_result fever dry_cough sore_throat fatigue headache shortness_of_breath
##          <dbl> <dbl>     <dbl>       <dbl>   <dbl>    <dbl>               <dbl>
## 1            0 0.398     0.313       0.419   0.586    0.439               0.190
## 2            1 0.723     0.660       0.413   0.593    0.45                0.494
## # ℹ 3 more variables: loss_of_smell <dbl>, loss_of_taste <dbl>,
## #   chest_pain <dbl>
data %>%
  group_by(gender) %>%
  summarise(across(all_of(symptoms), mean))
## # A tibble: 2 × 10
##   gender fever dry_cough sore_throat fatigue headache shortness_of_breath
##   <chr>  <dbl>     <dbl>       <dbl>   <dbl>    <dbl>               <dbl>
## 1 Female 0.564     0.482       0.408   0.580    0.440               0.353
## 2 Male   0.571     0.504       0.424   0.600    0.450               0.344
## # ℹ 3 more variables: loss_of_smell <dbl>, loss_of_taste <dbl>,
## #   chest_pain <dbl>
data %>%
  group_by(covid_result, gender) %>%
  summarise(n = n(), .groups = "drop")
## # A tibble: 4 × 3
##   covid_result gender     n
##          <dbl> <chr>  <int>
## 1            0 Female  1220
## 2            0 Male    1180
## 3            1 Female  1294
## 4            1 Male    1306
# Age of patients by diagnosis:

hist(data$age[data$covid_result == 1],
     main = "Age Distribution of COVID-Positive Patients",
     xlab = "Age",
     breaks = 10)

hist(data$age[data$covid_result == 0],
     main = "Age Distribution of COVID-Negative Patients",
     xlab = "Age",
     breaks = 10)

# Oxygen level by diagnosis:
boxplot(data$oxygen_level ~ data$covid_result, names = c("Negative", "Positive"))

# Data issues: 
# Body temperature by diagnosis:
boxplot(data$body_temperature ~ data$covid_result, names = c("Negative", "Positive"))

# Fever symptom vs body temperature:
nrow(data[data$fever == 1, ])
## [1] 2837
nrow(data[data$fever == 0, ])
## [1] 2163
boxplot(data$body_temperature ~ data$fever)

data %>%
  group_by(covid_result, fever) %>%
  summarise(n = n(), .groups = "drop")
## # A tibble: 4 × 3
##   covid_result fever     n
##          <dbl> <dbl> <int>
## 1            0     0  1444
## 2            0     1   956
## 3            1     0   719
## 4            1     1  1881

Unfortunately, this is a synthetic dataset, and as we can see, body temperature probably wasn’t generated correctly. It stays the same across the test results and what is more important, the fever symptom, which is illogical and unacceptable. For the purpose of this project body temperature will be excluded. Synthetic datasets may be flawed in different ways normal datasets do, so careful examination of data is advised in every case.

data <- subset(data, select = -c(body_temperature))
quantile(data$age)
##   0%  25%  50%  75% 100% 
##    1   22   44   66   89
quantile(data$oxygen_level)
##   0%  25%  50%  75% 100% 
##   85   88   92   96   99
# Comorbidity distribution:
data_positive = data[data$covid_result == 1, ]
comorbidity_positive <- table(data_positive$comorbidity)
percentage_positive <- round(100 * comorbidity_positive / sum(comorbidity_positive), 1)
labels_positive <- paste0(names(comorbidity_positive), "\n", percentage_positive, "%")

data_negative = data[data$covid_result == 0, ]
comorbidity_negative <- table(data_negative$comorbidity)
percentage_negative <- round(100 * comorbidity_negative / sum(comorbidity_negative), 1)
labels_negative <- paste0(names(comorbidity_negative), "\n", percentage_negative, "%")

pie(comorbidity_positive, labels = labels_positive)

pie(comorbidity_negative, labels = labels_negative)

data %>%
  group_by(covid_result, travel_history) %>%
  summarise(n = n(), .groups = "drop")
## # A tibble: 4 × 3
##   covid_result travel_history     n
##          <dbl>          <dbl> <int>
## 1            0              0  1814
## 2            0              1   586
## 3            1              0  1933
## 4            1              1   667
data %>%
  group_by(covid_result, contact_with_patient) %>%
  summarise(n = n(), .groups = "drop")
## # A tibble: 4 × 3
##   covid_result contact_with_patient     n
##          <dbl>                <dbl> <int>
## 1            0                    0  1863
## 2            0                    1   537
## 3            1                    0  1137
## 4            1                    1  1463

The demographic characteristics of patients are included in the association rules search, as it is known that age and gender of a person impact the way COVID-19 affects them. For example, there is a greater prevalence of certain symptoms (such as fever and chills) among men testing positive for COVID-19, compared to women during the time of testing (Patel et al., 2023)

data <- subset(data, select = -c(patient_id))

This dataset is complete and clean, but in order to conduct association rules search the data needs to be transformed.

Data transfromation

Recoding data can be done manually or automatically. First, let’s try the manual way:

  • age and oxygen level columns need bins with different ranges
  • all data needa to be assigned a label
for (symptom in symptoms) {
  data[[symptom]] <- factor(data[[symptom]], 
                            levels = c(0, 1),
                            labels = c(paste0("no ", symptom), symptom))
}

data$travel_history <- factor(data$travel_history, levels = c(0, 1), 
                            labels = c("no travel history", "travel history"))

data$contact_with_patient <- factor(data$contact_with_patient, levels = c(0, 1), 
                            labels = c("no contact with COVID-19 patient", "contact with COVID-19 patient"))

data$covid_result <- factor(data$covid_result, levels = c(0, 1), 
                     labels = c("negative COVID-19", "positive COVID-19"))

data$comorbidity[data$comorbidity == "None"] <- "no comorbidity"

Because of the synthetic data generation the age range is quite wide. In the manual recoding I decided on 7 intuitive bins: child, teenager, young adult, adult, middle aged, senior, elderly

data$age <- cut(
  data$age,
  breaks = c(0, 12, 18, 30, 45, 60, 75, 89),
  labels = c(
    "child",
    "teenager",
    "young adult",
    "adult",
    "middle aged",
    "senior",
    "elderly"
  ),
  include.lowest = TRUE,
  right = TRUE
)

For oxygen level I decided on aproximation of medical classification (O’Driscoll et al., 2017):

data$oxygen_level <- cut(
  data$oxygen_level,
  breaks = c(85, 89, 94, 96, 99),
  labels = c(
    "very low oxygen level",
    "low oxygen level",
    "normal oxygen level",
    "high oxygen level"
  ),
  include.lowest = TRUE,
  right = TRUE
)

Now manually prepared dataset is ready for conducting association rules search.

write.csv(data, file="covid_data_manual.csv", row.names = FALSE)

Preparing data for automatic recoding (leaving conitnous variables for automatic conversion):

auto_data <- read_csv("covid19_patient_symptoms_diagnosis.csv")
auto_data <- subset(auto_data, select = -c(patient_id))
auto_data <- subset(auto_data, select = -c(body_temperature))

for (symptom in symptoms) {
  auto_data[[symptom]] <- factor(auto_data[[symptom]], 
                            levels = c(0, 1),
                            labels = c(paste0("no ", symptom), symptom))
}

auto_data$travel_history <- factor(auto_data$travel_history, levels = c(0, 1), 
                              labels = c("no travel history", "travel history"))

auto_data$contact_with_patient <- factor(auto_data$contact_with_patient, levels = c(0, 1), 
                                    labels = c("no contact with COVID-19 patient", "contact with COVID-19 patient"))

auto_data$covid_result <- factor(auto_data$covid_result, levels = c(0, 1), 
                            labels = c("negative COVID-19", "positive COVID-19"))

auto_data$comorbidity[auto_data$comorbidity == "None"] <- "no comorbidity"

write.csv(auto_data, file="covid_data_auto.csv", row.names = FALSE)

Association rules on the manually transformed dataset

covid1 <- read.transactions("covid_data_manual.csv", format = "basket", header = T, sep = ",")
covid1

# Checks:
size(covid1)
length(covid1)
round(itemFrequency(covid1), 3)
inspect(covid1[1:5])
# Everything seems to be fine.

# Cleaning too rare observations (if a symptom appears very rarely, any rule generated from it is likely a coincidence, not a pattern and it is not very helpful in this project case)
covid1 <- covid1[, itemFrequency(covid1)>0.05]
# None removed
sort(itemFrequency(covid1, type="relative"))
##                         teenager                           Asthma 
##                           0.0692                           0.0964 
##              normal oxygen level                            child 
##                           0.1346                           0.1348 
##                      young adult                          elderly 
##                           0.1428                           0.1480 
##                    Heart Disease                           senior 
##                           0.1584                           0.1644 
##                            adult                      middle aged 
##                           0.1658                           0.1750 
##                high oxygen level                         Diabetes 
##                           0.1926                           0.2002 
##                   travel history                    loss_of_taste 
##                           0.2506                           0.2928 
##                    loss_of_smell                       chest_pain 
##                           0.2994                           0.3060 
##                 low oxygen level            very low oxygen level 
##                           0.3344                           0.3384 
##              shortness_of_breath    contact with COVID-19 patient 
##                           0.3484                           0.4000 
##                       no fatigue                      sore_throat 
##                           0.4102                           0.4160 
##                         no fever                         headache 
##                           0.4326                           0.4448 
##                negative COVID-19                        dry_cough 
##                           0.4800                           0.4932 
##                             Male                           Female 
##                           0.4972                           0.5028 
##                     no dry_cough                positive COVID-19 
##                           0.5068                           0.5200 
##                   no comorbidity                      no headache 
##                           0.5450                           0.5552 
##                            fever                   no sore_throat 
##                           0.5674                           0.5840 
##                          fatigue no contact with COVID-19 patient 
##                           0.5898                           0.6000 
##           no shortness_of_breath                    no chest_pain 
##                           0.6516                           0.6940 
##                 no loss_of_smell                 no loss_of_taste 
##                           0.7006                           0.7072 
##                no travel history 
##                           0.7494

From this data and method we can extract numerous valuable insights. Who tests COVID-19 positive/negative? What are their symptoms, clinical characteristics, medical history?

1. What set of characteristics predisposes a person to test COVID-19 positive/negative?

Creating rules with apriori method and standard settings (default minimum values):

# positive
rules.p <- apriori(covid1, parameter = list(supp=0.1, conf=0.5), appearance = list(default="lhs", rhs = "positive COVID-19"))

# Cleaning redundant rules, so rules where exists more general rule with the same or higher confidence value.
rules.clean.p<-rules.p[!is.redundant(rules.p)]
rules.clean.p # 183 rules out of 546 left

# Extracting significant rules using Fisher's exact test:
rules.clean.p<-rules.clean.p[is.significant(rules.clean.p, covid1)] 
rules.clean.p # 141 rules out of 183 left

# Extracting only maximal sets (sets without supersets):
rules.clean.p<-rules.clean.p[is.maximal(rules.clean.p)]
rules.clean.p # 82 rules out of 141 left

inspectDT(rules.clean.p)

# negative
rules.n <- apriori(covid1, parameter = list(supp=0.1, conf=0.5), appearance = list(default="lhs", rhs = "negative COVID-19"))

# Cleaning redundant rules, so rules where exists more general rule with the same or higher confidence value.
rules.clean.n<-rules.n[!is.redundant(rules.n)]
rules.clean.n # 278 rules out of 685 left

# Extracting statistically significant rules:
rules.clean.n<-rules.clean.n[is.significant(rules.clean.n, covid1)] 
rules.clean.n # 277 rules out of 278 left

# Extracting only maximal sets (sets without supersets):
rules.clean.n<-rules.clean.n[is.maximal(rules.clean.n)]
rules.clean.n # 141 rules out of 277 left, a lot more than at positive 

inspectDT(rules.clean.n)

With binary target variable intuition might suggest that the rules for opposite result will be just the exact opposite rules, but this is incorrect. The result depends only on the disitributions of certain variables respective to the target result (plus frequency of results, but this data is 52/48, so this is not as important in this case). Bigger number of association rules for negative testing patients means that the profile of such patient varies more than the profile of a positive testing patient This is actually quite helpful and logical: positive patients have more similar symptoms and characteristics than negative ones. Some specific combinations suggest this certain illness and many combinations suggest its lack. This is observable in reality: one illness typically has a set of typical symptoms and healthy or ill in other way (like just allergic or with a common cold) people can display these symptoms/their lack in many different ways - they are more diverse or common.

The scatter plot shows support vs lift for all positive COVID-19 rules, shaded by confidence.

plot(rules.clean.p, measure=c("support","lift"), shading="confidence")

All rules have low support (0.10-0.17), meaning every identified pattern applies to a relatively small subgroup of patients.The cluster of pale/low-confidence rules in the bottom left (low support, low lift, low confidence) represents weaker rules that passed the minimum thresholds but carry limited predictive value.

The graph visualizes which items appear together most frequently in the positive COVID-19 rules.

plot(rules.clean.p, method = "graph", max = 100)

Nodes represent items (symptoms, characteristics) and edges connect items that appear in the same rules. More central nodes with many connections, like fever, are the most influential predictors, appearing across many different rules. Isolated or peripheral nodes represent items that only matter in very specific combinations.

There are 10 rules with confidence over 0.9, one even being equal to 1. {contact with COVID-19 patient,dry_cough,fever} -> {positive COVID-19} This combination in this dataset guarantees a positive test result. With lift 1.923, a person with this combination is this times more likely to be positive than the average person in the dataset. Patients who test COVID-19 positive usually had contact with other COVID-19 patient, have lost their sense of smell, have a dry cough and shortness of breath. What is surprising is that they also don’t have travel history, maybe indicating local transmission,

The scatter plot shows support vs lift for all negative COVID-19 rules, shaded by confidence.

plot(rules.clean.n, measure=c("support","lift"), shading="confidence")

The graph visualizes which items appear together most frequently in the negative COVID-19 rules.

plot(rules.clean.p, method = "graph", max = 100)

Negative COVID-19 test rules: Healthy or suffering from other illness patient profile

In this dataset there was not a single patient who had no contact, no fever, no loss of smell nor shortness of breath and was diagnosed. Same goes for no contact, no dry cough, no loss of smell and no shortness of breath. Patients without these symptoms were always negative. The negative test result best rules seem to be stronger than positive test result rules - the maximum lift is 2.083 but this is because confidence is equal 1 and the share of negative results is smaller (48%) (1/0.48 is higher than 1/0.52). No history of contact, no dry cough, no fever, no shortness of breath, no loss of smell/taste are prevalent in the rules with highest confidence and they seem to almost guarantee testing negative. It is important to note that this might not be always true, as there are many known cases of asymptomatic COVID-19 patients.

2. How contagious is COVID-19? Does contact with other COVID-19 patient increase the probability of testing positive?

Within the top 10 rules by confidence we see many examples of contact with other COVID-19 patient.

inspect(subset(rules.clean.p, lhs %pin% "contact"))
##      lhs                                 rhs                 support confidence coverage     lift count
## [1]  {contact with COVID-19 patient,                                                                   
##       loss_of_smell}                  => {positive COVID-19}  0.1126  0.9184339   0.1226 1.766219   563
## [2]  {contact with COVID-19 patient,                                                                   
##       very low oxygen level}          => {positive COVID-19}  0.1220  0.8879185   0.1374 1.707536   610
## [3]  {contact with COVID-19 patient,                                                                   
##       shortness_of_breath}            => {positive COVID-19}  0.1252  0.9152047   0.1368 1.760009   626
## [4]  {contact with COVID-19 patient,                                                                   
##       no sore_throat}                 => {positive COVID-19}  0.1698  0.7331606   0.2316 1.409924   849
## [5]  {contact with COVID-19 patient,                                                                   
##       dry_cough,                                                                                       
##       fever}                          => {positive COVID-19}  0.1108  1.0000000   0.1108 1.923077   554
## [6]  {contact with COVID-19 patient,                                                                   
##       dry_cough,                                                                                       
##       no loss_of_taste}               => {positive COVID-19}  0.1186  0.8917293   0.1330 1.714864   593
## [7]  {contact with COVID-19 patient,                                                                   
##       fever,                                                                                           
##       Male}                           => {positive COVID-19}  0.1008  0.8857645   0.1138 1.703393   504
## [8]  {contact with COVID-19 patient,                                                                   
##       Male,                                                                                            
##       no chest_pain}                  => {positive COVID-19}  0.1034  0.7525473   0.1374 1.447206   517
## [9]  {contact with COVID-19 patient,                                                                   
##       fever,                                                                                           
##       no comorbidity}                 => {positive COVID-19}  0.1128  0.8895899   0.1268 1.710750   564
## [10] {contact with COVID-19 patient,                                                                   
##       fever,                                                                                           
##       no headache}                    => {positive COVID-19}  0.1100  0.8842444   0.1244 1.700470   550
## [11] {contact with COVID-19 patient,                                                                   
##       no chest_pain,                                                                                   
##       no headache}                    => {positive COVID-19}  0.1052  0.7429379   0.1416 1.428727   526
## [12] {contact with COVID-19 patient,                                                                   
##       fatigue,                                                                                         
##       fever}                          => {positive COVID-19}  0.1198  0.8926975   0.1342 1.716726   599
## [13] {contact with COVID-19 patient,                                                                   
##       fever,                                                                                           
##       no loss_of_taste}               => {positive COVID-19}  0.1396  0.8869123   0.1574 1.705601   698
## [14] {contact with COVID-19 patient,                                                                   
##       fatigue,                                                                                         
##       no loss_of_taste}               => {positive COVID-19}  0.1228  0.7533742   0.1630 1.448797   614
## [15] {contact with COVID-19 patient,                                                                   
##       fever,                                                                                           
##       no chest_pain,                                                                                   
##       no travel history}              => {positive COVID-19}  0.1012  0.8815331   0.1148 1.695256   506

There are 15 rules (out of 82) containing this condition and they have very high support and lift. In every case where contact is present, the probability of being positive is over 70%, usually much higher. These results confirm that contact significantly increases the probability of a positive result beyond the baseline.

3. Does age matter - are seniors/elderly at bigger risk of getting sick?

inspect(subset(rules.clean.p, lhs %pin% "senior"))
inspect(subset(rules.clean.p, lhs %pin% "elderly"))

There are no rules including age brackets, so it appears not to affect the possibility of testing positive.

4. Does gender matter in differentiating the symptoms leading to positive test result?

inspect(subset(rules.clean.p, lhs %pin% "Female"))
##     lhs                        rhs                 support confidence coverage     lift count
## [1] {Female,                                                                                 
##      loss_of_smell}         => {positive COVID-19}  0.1116  0.7760779   0.1438 1.492457   558
## [2] {Female,                                                                                 
##      very low oxygen level} => {positive COVID-19}  0.1214  0.7132785   0.1702 1.371689   607
## [3] {dry_cough,                                                                              
##      Female,                                                                                 
##      fever}                 => {positive COVID-19}  0.1180  0.8489209   0.1390 1.632540   590
## [4] {dry_cough,                                                                              
##      Female,                                                                                 
##      no travel history}     => {positive COVID-19}  0.1256  0.6993318   0.1796 1.344869   628
## [5] {Female,                                                                                 
##      fever,                                                                                  
##      no sore_throat}        => {positive COVID-19}  0.1124  0.6643026   0.1692 1.277505   562
inspect(subset(rules.clean.p, lhs %pin% "Male"))
##     lhs                                 rhs                 support confidence coverage     lift count
## [1] {Male,                                                                                            
##      shortness_of_breath}            => {positive COVID-19}  0.1284  0.7517564   0.1708 1.445685   642
## [2] {contact with COVID-19 patient,                                                                   
##      fever,                                                                                           
##      Male}                           => {positive COVID-19}  0.1008  0.8857645   0.1138 1.703393   504
## [3] {contact with COVID-19 patient,                                                                   
##      Male,                                                                                            
##      no chest_pain}                  => {positive COVID-19}  0.1034  0.7525473   0.1374 1.447206   517
## [4] {dry_cough,                                                                                       
##      Male,                                                                                            
##      no chest_pain}                  => {positive COVID-19}  0.1236  0.7144509   0.1730 1.373944   618
## [5] {fever,                                                                                           
##      Male,                                                                                            
##      no comorbidity}                 => {positive COVID-19}  0.1058  0.6834625   0.1548 1.314351   529
## [6] {fever,                                                                                           
##      Male,                                                                                            
##      no headache}                    => {positive COVID-19}  0.1068  0.6750948   0.1582 1.298259   534
## [7] {fatigue,                                                                                         
##      fever,                                                                                           
##      Male}                           => {positive COVID-19}  0.1158  0.6803760   0.1702 1.308415   579
## [8] {fever,                                                                                           
##      Male,                                                                                            
##      no loss_of_taste}               => {positive COVID-19}  0.1334  0.6723790   0.1984 1.293037   667
## [9] {fever,                                                                                           
##      Male,                                                                                            
##      no chest_pain,                                                                                   
##      no travel history}              => {positive COVID-19}  0.1040  0.6887417   0.1510 1.324503   520

For male patients positive result is tied to fever combined with contact history or shortness of breath. Males have the strongest overall rule: Fever and contact history makes the probability of positive result 88.6%. Many male rules include “negative” symptoms (no chest_pain, no headache, no loss_of_taste), with fever being the main indicator. For female patients very low oxygen level and loss of smell seem more distinctive. If a female patient loses her sense of smell, there is a 77.6% confidence she will test positive.

5. Does medical history of certain illnesses puts you at risk of COVID-19?

inspect(subset(rules.clean.p, lhs %pin% "Asthma"))
inspect(subset(rules.clean.p, lhs %pin% "Diabetes"))
inspect(subset(rules.clean.p, lhs %pin% "Heart Disease"))

There are no rules including comorbidities, so it doesn’t appear to affect the possibility of testing positive. Perhaps comorbidities and age affect the severity off illness rather than the possibility of contagion.

Based on these results we can compose a data-driven guideline for triage, for example prioritizing testing patients exhibiting cough and fever symptoms or female patients with smell loss.

Association rules on the automatically transformed dataset

covid2 <- read.csv("covid_data_auto.csv", header=TRUE, sep=",")
summary(covid2)

# Converting characters to factors, leaving numeric variables 
covid2[] <- lapply(covid2, function(x) {
  if(is.character(x)) as.factor(x) else x
})
summary(covid2)

MDLP (Minimal Description Length Principle) is a discretization method that uses the class labels (covid result) to find the most informative split. Based on entropy rule and computational effectiveness, it tends to decline the number of intervals to keep the model as simple as possible.

data.disc <- discretizeDF.supervised(covid_result ~ ., data=covid2, method="mdlp")
data.disc$age # age has 1 range
data.disc$oxygen_level # 2 ranges (up to 91.5 and above)

trans.covid2 <- transactions(data.disc)
trans.covid2 <- trans.covid2[, itemFrequency(trans.covid2)>0.05]

rules.covid2 <-mineCARs(covid_result ~ ., transactions=trans.covid2, support = 0.1, confidence = 0.7)
summary(rules.covid2)

rules.covid2.clean <- rules.covid2[!is.redundant(rules.covid2)]
rules.covid2.clean

inspectDT(rules.covid2.clean)

plot(rules.covid2.clean, measure=c("support","lift"), shading="confidence")

inspect(subset(rules.covid2.clean, items %pin% "oxygen_level")[15, ])
##     lhs                                   rhs                              support confidence coverage     lift count
## [1] {oxygen_level=[-Inf,91.5),                                                                                       
##      travel_history=no travel history} => {covid_result=positive COVID-19}  0.2504   0.713797   0.3508 1.372687  1252

Automatic discretization of variables leads to involving more information about oxygen level in the rules. Now, with other characteristics generally oxygen level below 91.5 means testing positive and over 91.5 - negative. In this case, automatic discretization brought some valuable information, as now we have 10 rules with 100% confidence for this dataset. These rules have high lift and quite high support, all of them occurring in more than 10% of patients. For this dataset this data based split seems to be a better solution than a split based on medical information. What is more, the MDLP method ignores age (makes it one bracket, uniform for all) as it is not helpful in differentiating between positive and negative test results.

Classification Based on Association

This is quite a simple concept, a natural extension of searching for association rules. Association rules can be used for building predictions of target value class. The algorithm chooses Class Association Rules based on minimum support and confidence value and then uses it for labeling the transaction, or in this case, a patient. This method is perfect for what we are trying to achieve in this project - faster and more accurate triage. The automatic MDLP method of discretization is most commonly used in CAR, as it optimizes for accurate labeling of analyzed data. Additionally, pruning is necessary, as CBA models can end up with too many rules that only apply to few specific cases.

Classifier is built using automatic default discretization and parameters + pruning M1 (case by case) because of a small dataset.

covid2.classification <- CBA(covid_result ~ ., data = covid2, pruning = "M1")
covid2.classification
## CBA Classifier Object
## Formula: covid_result ~ .
## Number of rules: 23
## Default Class: positive COVID-19
## Classification method: first  
## Description: CBA algorithm (Liu et al., 1998)
inspect(covid2.classification$rules)
##      lhs                                                        rhs                              support confidence coverage     lift count size coveredTransactions totalErrors
## [1]  {shortness_of_breath=no shortness_of_breath,                                                                                                                               
##       loss_of_smell=no loss_of_smell,                                                                                                                                           
##       oxygen_level=[91.5, Inf],                                                                                                                                                 
##       contact_with_patient=no contact with COVID-19 patient} => {covid_result=negative COVID-19}  0.1504  1.0000000   0.1504 2.083333   752    5                 752        1648
## [2]  {dry_cough=no dry_cough,                                                                                                                                                   
##       shortness_of_breath=no shortness_of_breath,                                                                                                                               
##       loss_of_smell=no loss_of_smell,                                                                                                                                           
##       contact_with_patient=no contact with COVID-19 patient} => {covid_result=negative COVID-19}  0.1394  1.0000000   0.1394 2.083333   697    5                 333        1315
## [3]  {fever=fever,                                                                                                                                                              
##       dry_cough=dry_cough,                                                                                                                                                      
##       oxygen_level=[-Inf,91.5)}                              => {covid_result=positive COVID-19}  0.1310  1.0000000   0.1310 1.923077   655    4                 655        1315
## [4]  {fever=no fever,                                                                                                                                                           
##       shortness_of_breath=no shortness_of_breath,                                                                                                                               
##       loss_of_smell=no loss_of_smell,                                                                                                                                           
##       contact_with_patient=no contact with COVID-19 patient} => {covid_result=negative COVID-19}  0.1186  1.0000000   0.1186 2.083333   593    5                 119        1196
## [5]  {dry_cough=no dry_cough,                                                                                                                                                   
##       shortness_of_breath=no shortness_of_breath,                                                                                                                               
##       loss_of_smell=no loss_of_smell,                                                                                                                                           
##       oxygen_level=[91.5, Inf]}                              => {covid_result=negative COVID-19}  0.1174  1.0000000   0.1174 2.083333   587    5                 223         973
## [6]  {fever=fever,                                                                                                                                                              
##       oxygen_level=[-Inf,91.5),                                                                                                                                                 
##       contact_with_patient=contact with COVID-19 patient}    => {covid_result=positive COVID-19}  0.1144  1.0000000   0.1144 1.923077   572    4                 298         973
## [7]  {dry_cough=no dry_cough,                                                                                                                                                   
##       loss_of_smell=no loss_of_smell,                                                                                                                                           
##       oxygen_level=[91.5, Inf],                                                                                                                                                 
##       contact_with_patient=no contact with COVID-19 patient} => {covid_result=negative COVID-19}  0.1134  1.0000000   0.1134 2.083333   567    5                 203         770
## [8]  {fever=fever,                                                                                                                                                              
##       dry_cough=dry_cough,                                                                                                                                                      
##       contact_with_patient=contact with COVID-19 patient}    => {covid_result=positive COVID-19}  0.1108  1.0000000   0.1108 1.923077   554    4                 280         770
## [9]  {fever=no fever,                                                                                                                                                           
##       shortness_of_breath=no shortness_of_breath,                                                                                                                               
##       loss_of_smell=no loss_of_smell,                                                                                                                                           
##       oxygen_level=[91.5, Inf]}                              => {covid_result=negative COVID-19}  0.1066  1.0000000   0.1066 2.083333   533    5                 106         664
## [10] {dry_cough=no dry_cough,                                                                                                                                                   
##       shortness_of_breath=no shortness_of_breath,                                                                                                                               
##       oxygen_level=[91.5, Inf],                                                                                                                                                 
##       contact_with_patient=no contact with COVID-19 patient} => {covid_result=negative COVID-19}  0.1026  1.0000000   0.1026 2.083333   513    5                 149         515
## [11] {fever=no fever,                                                                                                                                                           
##       dry_cough=no dry_cough,                                                                                                                                                   
##       oxygen_level=[91.5, Inf]}                              => {covid_result=negative COVID-19}  0.1078  0.9522968   0.1132 1.983952   539    4                 167         402
## [12] {fever=no fever,                                                                                                                                                           
##       shortness_of_breath=no shortness_of_breath,                                                                                                                               
##       oxygen_level=[91.5, Inf],                                                                                                                                                 
##       travel_history=no travel history}                      => {covid_result=negative COVID-19}  0.1082  0.9491228   0.1140 1.977339   541    5                  72         388
## [13] {dry_cough=no dry_cough,                                                                                                                                                   
##       oxygen_level=[91.5, Inf],                                                                                                                                                 
##       travel_history=no travel history,                                                                                                                                         
##       contact_with_patient=no contact with COVID-19 patient} => {covid_result=negative COVID-19}  0.1154  0.9490132   0.1216 1.977111   577    5                  31         419
## [14] {fever=no fever,                                                                                                                                                           
##       shortness_of_breath=no shortness_of_breath,                                                                                                                               
##       loss_of_taste=no loss_of_taste,                                                                                                                                           
##       oxygen_level=[91.5, Inf]}                              => {covid_result=negative COVID-19}  0.1022  0.9480519   0.1078 1.975108   511    5                  15         418
## [15] {dry_cough=no dry_cough,                                                                                                                                                   
##       oxygen_level=[91.5, Inf],                                                                                                                                                 
##       contact_with_patient=no contact with COVID-19 patient} => {covid_result=negative COVID-19}  0.1496  0.9468354   0.1580 1.972574   748    4                  11         429
## [16] {fever=no fever,                                                                                                                                                           
##       shortness_of_breath=no shortness_of_breath,                                                                                                                               
##       oxygen_level=[91.5, Inf]}                              => {covid_result=negative COVID-19}  0.1408  0.9424364   0.1494 1.963409   704    4                  14         429
## [17] {shortness_of_breath=shortness_of_breath,                                                                                                                                  
##       loss_of_smell=loss_of_smell}                           => {covid_result=positive COVID-19}  0.1022  0.9410681   0.1086 1.809746   511    3                 298         429
## [18] {fever=no fever,                                                                                                                                                           
##       oxygen_level=[91.5, Inf],                                                                                                                                                 
##       contact_with_patient=no contact with COVID-19 patient} => {covid_result=negative COVID-19}  0.1302  0.9407514   0.1384 1.959899   651    4                  80         349
## [19] {fever=no fever,                                                                                                                                                           
##       dry_cough=no dry_cough,                                                                                                                                                   
##       shortness_of_breath=no shortness_of_breath,                                                                                                                               
##       travel_history=no travel history}                      => {covid_result=negative COVID-19}  0.1014  0.9406308   0.1078 1.959647   507    5                 151         262
## [20] {dry_cough=no dry_cough,                                                                                                                                                   
##       shortness_of_breath=no shortness_of_breath,                                                                                                                               
##       oxygen_level=[91.5, Inf],                                                                                                                                                 
##       travel_history=no travel history}                      => {covid_result=negative COVID-19}  0.1220  0.9399076   0.1298 1.958141   610    5                  39         301
## [21] {fever=no fever,                                                                                                                                                           
##       dry_cough=no dry_cough,                                                                                                                                                   
##       contact_with_patient=no contact with COVID-19 patient} => {covid_result=negative COVID-19}  0.1230  0.9389313   0.1310 1.956107   615    4                  97         204
## [22] {fever=no fever,                                                                                                                                                           
##       dry_cough=no dry_cough,                                                                                                                                                   
##       shortness_of_breath=no shortness_of_breath}            => {covid_result=negative COVID-19}  0.1316  0.9359886   0.1406 1.949976   658    4                  34         196
## [23] {}                                                      => {covid_result=positive COVID-19}  0.5200  0.5200000   1.0000 1.000000  5000    1                 873         196

Reality check for rules on continous variables

covid2.sorted<-arrange(covid2, oxygen_level)
covid2.sorted$ID<-1:dim(covid2.sorted)[1]

ggplot(covid2.sorted, aes(x=ID, y=oxygen_level, color=covid_result)) +    
  geom_point(size=2) + scale_color_brewer(palette="Spectral") + ggtitle("oxygen_level")

Clearly, most tests over a threshold of 91.5 are negative and under the 91.5 - positive. This rule seems to be correct.

covid2.sorted<-arrange(covid2, age)
covid2.sorted$ID<-1:dim(covid2.sorted)[1]

ggplot(covid2.sorted, aes(x=ID, y=age, color=covid_result)) +    
  geom_point(size=2) + scale_color_brewer(palette="Spectral") + ggtitle("age")

Age is visibly irrelevant to the test result. Normally, we would probably expect some age variation (older people and children having weaker immune systems, but this is a synthetic dataset)

Predictions based on Class Association Rules

predictions <- predict(covid2.classification, covid2)
predictions # a list of labels

Comparison of real data and predicted labels:

comparison_table <- table(pred = predictions, true=covid2$covid_result)
kable(comparison_table, caption = "COVID-19 Prediction Results")
COVID-19 Prediction Results
negative COVID-19 positive COVID-19
negative COVID-19 2400 196
positive COVID-19 0 2404
confusionMatrix(reference=covid2$covid_result, data=predictions, positive = "positive COVID-19")
## Confusion Matrix and Statistics
## 
##                    Reference
## Prediction          negative COVID-19 positive COVID-19
##   negative COVID-19              2400               196
##   positive COVID-19                 0              2404
##                                            
##                Accuracy : 0.9608           
##                  95% CI : (0.955, 0.966)   
##     No Information Rate : 0.52             
##     P-Value [Acc > NIR] : < 2.2e-16        
##                                            
##                   Kappa : 0.9217           
##                                            
##  Mcnemar's Test P-Value : < 2.2e-16        
##                                            
##             Sensitivity : 0.9246           
##             Specificity : 1.0000           
##          Pos Pred Value : 1.0000           
##          Neg Pred Value : 0.9245           
##              Prevalence : 0.5200           
##          Detection Rate : 0.4808           
##    Detection Prevalence : 0.4808           
##       Balanced Accuracy : 0.9623           
##                                            
##        'Positive' Class : positive COVID-19
## 

For predictions made and tested on the whole datatset the accuracy is impressive (96.08%). True Negatives are equal 2400, so all negative cases were identified correctly. 2404 positive cases were correctly identified, only 196 were missed.

Specificity equal 1 means this model never misdiagnoses a healthy person as having COVID-19. Sensitivity equal 0.9246 means that out of 100 people who actually have COVID, this model diagnoses correctly about 92 of them. P-Value [Acc > NIR] is extremely small, model is statistically better than random guessing, but it labels someone as positive if they perfectly fit one of the rules.

In a medical triage scenario, a labeling a sick person as healthy is more dangerous than the other way around, so it would be better to have more false alarms. While the model is highly reliable when it identifies a case, it doesn’t diagnose more atypical positive cases, resulting in a 7.5% miss rate. In a triage situation, this model would be a better primary filter, but shouldn’t replace laboratory testing.

These results are also overly optimistic, as typically predictions would be made on a separate test dataset that wasn’t used for rule mining.

Let’s mine CARs on a training dataset part and check the confusion matrix for new predictions.

set.seed(2026) # for getting the same sample
n <- nrow(covid2)
train <- sample(1:n, size=as.integer(n*0.9)) # training set: 80% of observations

covid2.classification.train <- CBA(covid_result ~ ., data=covid2[train, ], pruning="M1")
covid2.classification.train # a smaller number of rules than before
## CBA Classifier Object
## Formula: covid_result ~ .
## Number of rules: 19
## Default Class: positive COVID-19
## Classification method: first  
## Description: CBA algorithm (Liu et al., 1998)
inspect(covid2.classification.train$rules)
##      lhs                                                        rhs                                support confidence  coverage     lift count size coveredTransactions totalErrors
## [1]  {shortness_of_breath=no shortness_of_breath,                                                                                                                                  
##       loss_of_smell=no loss_of_smell,                                                                                                                                              
##       oxygen_level=[91.5, Inf],                                                                                                                                                    
##       contact_with_patient=no contact with COVID-19 patient} => {covid_result=negative COVID-19} 0.1491111  1.0000000 0.1491111 2.066116   671    5                 671        1507
## [2]  {dry_cough=no dry_cough,                                                                                                                                                      
##       shortness_of_breath=no shortness_of_breath,                                                                                                                                  
##       loss_of_smell=no loss_of_smell,                                                                                                                                              
##       contact_with_patient=no contact with COVID-19 patient} => {covid_result=negative COVID-19} 0.1424444  1.0000000 0.1424444 2.066116   641    5                 306        1201
## [3]  {fever=fever,                                                                                                                                                                 
##       dry_cough=dry_cough,                                                                                                                                                         
##       oxygen_level=[-Inf,91.5)}                              => {covid_result=positive COVID-19} 0.1282222  1.0000000 0.1282222 1.937984   577    4                 577        1201
## [4]  {dry_cough=no dry_cough,                                                                                                                                                      
##       shortness_of_breath=no shortness_of_breath,                                                                                                                                  
##       loss_of_smell=no loss_of_smell,                                                                                                                                              
##       oxygen_level=[91.5, Inf]}                              => {covid_result=negative COVID-19} 0.1191111  1.0000000 0.1191111 2.066116   536    5                 201        1000
## [5]  {fever=no fever,                                                                                                                                                              
##       shortness_of_breath=no shortness_of_breath,                                                                                                                                  
##       loss_of_smell=no loss_of_smell,                                                                                                                                              
##       contact_with_patient=no contact with COVID-19 patient} => {covid_result=negative COVID-19} 0.1186667  1.0000000 0.1186667 2.066116   534    5                 110         890
## [6]  {dry_cough=no dry_cough,                                                                                                                                                      
##       loss_of_smell=no loss_of_smell,                                                                                                                                              
##       oxygen_level=[91.5, Inf],                                                                                                                                                    
##       contact_with_patient=no contact with COVID-19 patient} => {covid_result=negative COVID-19} 0.1151111  1.0000000 0.1151111 2.066116   518    5                 183         707
## [7]  {fever=fever,                                                                                                                                                                 
##       oxygen_level=[-Inf,91.5),                                                                                                                                                    
##       contact_with_patient=contact with COVID-19 patient}    => {covid_result=positive COVID-19} 0.1144444  1.0000000 0.1144444 1.937984   515    4                 272         707
## [8]  {fever=fever,                                                                                                                                                                 
##       dry_cough=dry_cough,                                                                                                                                                         
##       contact_with_patient=contact with COVID-19 patient}    => {covid_result=positive COVID-19} 0.1117778  1.0000000 0.1117778 1.937984   503    4                 260         707
## [9]  {fever=no fever,                                                                                                                                                              
##       shortness_of_breath=no shortness_of_breath,                                                                                                                                  
##       loss_of_smell=no loss_of_smell,                                                                                                                                              
##       oxygen_level=[91.5, Inf]}                              => {covid_result=negative COVID-19} 0.1062222  1.0000000 0.1062222 2.066116   478    5                  99         608
## [10] {dry_cough=no dry_cough,                                                                                                                                                      
##       shortness_of_breath=no shortness_of_breath,                                                                                                                                  
##       oxygen_level=[91.5, Inf],                                                                                                                                                    
##       contact_with_patient=no contact with COVID-19 patient} => {covid_result=negative COVID-19} 0.1042222  1.0000000 0.1042222 2.066116   469    5                 134         474
## [11] {fever=no fever,                                                                                                                                                              
##       shortness_of_breath=no shortness_of_breath,                                                                                                                                  
##       oxygen_level=[91.5, Inf],                                                                                                                                                    
##       travel_history=no travel history}                      => {covid_result=negative COVID-19} 0.1080000  0.9510763 0.1135556 1.965034   486    5                 102         422
## [12] {fever=no fever,                                                                                                                                                              
##       shortness_of_breath=no shortness_of_breath,                                                                                                                                  
##       loss_of_taste=no loss_of_taste,                                                                                                                                              
##       oxygen_level=[91.5, Inf]}                              => {covid_result=negative COVID-19} 0.1031111  0.9508197 0.1084444 1.964503   464    5                  21         413
## [13] {dry_cough=no dry_cough,                                                                                                                                                      
##       oxygen_level=[91.5, Inf],                                                                                                                                                    
##       travel_history=no travel history,                                                                                                                                            
##       contact_with_patient=no contact with COVID-19 patient} => {covid_result=negative COVID-19} 0.1153333  0.9505495 0.1213333 1.963945   519    5                  52         415
## [14] {fever=no fever,                                                                                                                                                              
##       dry_cough=no dry_cough,                                                                                                                                                      
##       oxygen_level=[91.5, Inf]}                              => {covid_result=negative COVID-19} 0.1095556  0.9499037 0.1153333 1.962611   493    4                  91         376
## [15] {dry_cough=no dry_cough,                                                                                                                                                      
##       oxygen_level=[91.5, Inf],                                                                                                                                                    
##       contact_with_patient=no contact with COVID-19 patient} => {covid_result=negative COVID-19} 0.1515556  0.9498607 0.1595556 1.962522   682    4                   9         385
## [16] {fever=no fever,                                                                                                                                                              
##       shortness_of_breath=no shortness_of_breath,                                                                                                                                  
##       oxygen_level=[91.5, Inf]}                              => {covid_result=negative COVID-19} 0.1417778  0.9437870 0.1502222 1.949973   638    4                  13         386
## [17] {fever=no fever,                                                                                                                                                              
##       loss_of_smell=no loss_of_smell,                                                                                                                                              
##       loss_of_taste=no loss_of_taste,                                                                                                                                              
##       oxygen_level=[91.5, Inf]}                              => {covid_result=negative COVID-19} 0.1080000  0.9436893 0.1144444 1.949771   486    5                  82         362
## [18] {fever=no fever,                                                                                                                                                              
##       dry_cough=no dry_cough,                                                                                                                                                      
##       contact_with_patient=no contact with COVID-19 patient} => {covid_result=negative COVID-19} 0.1235556  0.9407783 0.1313333 1.943757   556    4                 156         276
## [19] {}                                                      => {covid_result=positive COVID-19} 0.5160000  0.5160000 1.0000000 1.000000  4500    1                1161         276
# Predictions on test data:

test_predictions <- predict(covid2.classification.train, covid2[-train, ])
test_predictions # test data predicted labels (20% of the dataset)
# Comparison of real data and predicted labels:
comparison_table_test <- table(pred = test_predictions, true=covid2$covid_result[-train])
kable(comparison_table_test, caption = "COVID-19 Prediction Results for Test Data")
COVID-19 Prediction Results for Test Data
negative COVID-19 positive COVID-19
negative COVID-19 211 25
positive COVID-19 11 253
confusionMatrix(reference=covid2$covid_result[-train], data=test_predictions, positive = "positive COVID-19")
## Confusion Matrix and Statistics
## 
##                    Reference
## Prediction          negative COVID-19 positive COVID-19
##   negative COVID-19               211                25
##   positive COVID-19                11               253
##                                            
##                Accuracy : 0.928            
##                  95% CI : (0.9017, 0.9491) 
##     No Information Rate : 0.556            
##     P-Value [Acc > NIR] : < 2e-16          
##                                            
##                   Kappa : 0.8551           
##                                            
##  Mcnemar's Test P-Value : 0.03026          
##                                            
##             Sensitivity : 0.9101           
##             Specificity : 0.9505           
##          Pos Pred Value : 0.9583           
##          Neg Pred Value : 0.8941           
##              Prevalence : 0.5560           
##          Detection Rate : 0.5060           
##    Detection Prevalence : 0.5280           
##       Balanced Accuracy : 0.9303           
##                                            
##        'Positive' Class : positive COVID-19
## 

This is more realistic than previous predictions. Accuracy dropped slightly, we now have misdiagnosed cases in both ways too. Specificity dropped from 1 to 0.9505 The model now misdiagnosed 10 healthy people as positive. This is expected, as rules perfectly working for training dataset encounter excpetions in the test datatset. Sensitivity remained similar, this suggests that the core symptom clusters are reliable indicators across the entire population, not just a specific subset.

Summary

This project applies association rules mining and classification based on association rules (CBA) to a synthetic COVID-19 dataset with the goal of supporting patient triage. After cleaning the data (notably dropping body temperature due to synthetic generation flaws), variables are discretized both manually and automatically (MDLP), and apriori rules are mined separately for positive and negative diagnoses. Key findings include that contact with a COVID-19 patient, dry cough, and fever together nearly guarantee a positive result, that age and comorbidities don’t predict infection in this dataset, and that gender influences which symptoms are most predictive.

Bibliography

  • O’Driscoll BR, Howard LS, Earis J, Mak V. British Thoracic Society Guideline for oxygen use in adults in healthcare and emergency settings. BMJ Open Respir Res. 2017 May 15;4(1):e000170. doi: 10.1136/bmjresp-2016-000170. PMID: 28883921; PMCID: PMC5531304.

  • Patel JR, Amick BC, Vyas KS, Bircan E, Boothe D, Nembhard WN. Gender disparities in symptomology of COVID-19 among adults in Arkansas. Prev Med Rep. 2023 Jun 23;35:102290. doi: 10.1016/j.pmedr.2023.102290. PMID: 37441188; PMCID: PMC10289819.