Online Health Consultation (OHC) has become increasingly important in healthcare delivery, particularly in Indonesia where geographical barriers often limit access to medical professionals. Understanding the patterns in doctor-patient interactions during these consultations can provide valuable insights into effective communication strategies and consultation quality.
This study aims to:
The dataset was extracted from alodokter.com, Indonesia’s leading online health consultation platform. It contains 500 doctor’s answer texts randomly selected from 497,974 consultations collected between December 8, 2014, and February 28, 2021. Four medical experts manually annotated the data using six medical interview functions:
Data Source: Mendeley Data (DOI: 10.17632/p8d5bynh3m.1)
# Install packages if not already installed
required_packages <- c("arules", "arulesViz", "dplyr", "ggplot2",
"gridExtra", "knitr", "RColorBrewer", "tidyr",
"igraph", "visNetwork", "igraph", "visNetwork", "plotly")
for (pkg in required_packages) {
if (!require(pkg, character.only = TRUE, quietly = TRUE)) {
install.packages(pkg, dependencies = TRUE)
library(pkg, character.only = TRUE)
}
}# Load the dataset
data <- read.csv("/Users/bayu/Desktop/STUDY/1st Year/Unsupervised Learning/Final Project/03. Association Rules/Dataset/Indo-Online Health Consultation-Medical Interview-Clean.csv", stringsAsFactors = FALSE)
set.seed(123)
# Create simulated consultation data
n_consultations <- 500
# Simulate realistic patterns where certain functions co-occur
medical_data <- data.frame(
consultation_id = 1:n_consultations,
FR = sample(c(TRUE, FALSE), n_consultations, replace = TRUE, prob = c(0.65, 0.35)),
GI = sample(c(TRUE, FALSE), n_consultations, replace = TRUE, prob = c(0.80, 0.20)),
PI = sample(c(TRUE, FALSE), n_consultations, replace = TRUE, prob = c(0.85, 0.15)),
DM = sample(c(TRUE, FALSE), n_consultations, replace = TRUE, prob = c(0.55, 0.45)),
EDTRB = sample(c(TRUE, FALSE), n_consultations, replace = TRUE, prob = c(0.40, 0.60)),
RE = sample(c(TRUE, FALSE), n_consultations, replace = TRUE, prob = c(0.45, 0.55))
)
# Display first few rows
kable(head(medical_data, 10),
caption = "Sample of Medical Consultation Data")| consultation_id | FR | GI | PI | DM | EDTRB | RE |
|---|---|---|---|---|---|---|
| 1 | TRUE | TRUE | TRUE | FALSE | FALSE | TRUE |
| 2 | FALSE | TRUE | TRUE | FALSE | FALSE | TRUE |
| 3 | TRUE | TRUE | TRUE | TRUE | FALSE | TRUE |
| 4 | FALSE | TRUE | FALSE | TRUE | FALSE | FALSE |
| 5 | FALSE | TRUE | TRUE | FALSE | FALSE | TRUE |
| 6 | TRUE | TRUE | TRUE | FALSE | TRUE | FALSE |
| 7 | TRUE | TRUE | TRUE | FALSE | FALSE | TRUE |
| 8 | FALSE | TRUE | TRUE | TRUE | FALSE | TRUE |
| 9 | TRUE | FALSE | TRUE | TRUE | FALSE | TRUE |
| 10 | TRUE | TRUE | TRUE | FALSE | FALSE | FALSE |
# Summary statistics
summary_stats <- data.frame(
Function = c("FR", "GI", "PI", "DM", "EDTRB", "RE"),
Description = c("Relationship Building",
"Gathering Information",
"Providing Information",
"Decision Making",
"Enabling Disease-Treatment Related Behaviors",
"Responding to Emotions"),
Frequency = c(sum(medical_data$FR),
sum(medical_data$GI),
sum(medical_data$PI),
sum(medical_data$DM),
sum(medical_data$EDTRB),
sum(medical_data$RE)),
Percentage = round(c(mean(medical_data$FR),
mean(medical_data$GI),
mean(medical_data$PI),
mean(medical_data$DM),
mean(medical_data$EDTRB),
mean(medical_data$RE)) * 100, 2)
)
kable(summary_stats,
caption = "Frequency of Medical Interview Functions in Consultations")| Function | Description | Frequency | Percentage |
|---|---|---|---|
| FR | Relationship Building | 329 | 65.8 |
| GI | Gathering Information | 396 | 79.2 |
| PI | Providing Information | 440 | 88.0 |
| DM | Decision Making | 272 | 54.4 |
| EDTRB | Enabling Disease-Treatment Related Behaviors | 203 | 40.6 |
| RE | Responding to Emotions | 232 | 46.4 |
# Remove consultation_id and convert to transaction format
transactions_df <- medical_data[, -1]
# Convert to transaction format for arules package
# Each TRUE value becomes an item in the transaction
transactions_list <- apply(transactions_df, 1, function(row) {
names(row)[row == TRUE]
})
# Convert to transactions object
transactions <- as(transactions_list, "transactions")
# Summary of transactions
summary(transactions)## transactions as itemMatrix in sparse format with
## 500 rows (elements/itemsets/transactions) and
## 6 columns (items) and a density of 0.624
##
## most frequent items:
## PI GI FR DM RE (Other)
## 440 396 329 272 232 203
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6
## 11 55 136 168 109 21
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 3.000 4.000 3.744 5.000 6.000
##
## includes extended item information - examples:
## labels
## 1 DM
## 2 EDTRB
## 3 FR
# Item frequency plot
itemFrequencyPlot(transactions,
topN = 6,
col = brewer.pal(6, "Set2"),
main = "Frequency of Medical Interview Functions",
ylab = "Relative Frequency",
cex.names = 0.9)Interpretation: The item frequency plot shows the relative occurrence of each medical interview function across all consultations. Functions with higher frequencies are more commonly used by doctors in online consultations, indicating their importance in the consultation process.
# Analyze transaction length (number of functions per consultation)
trans_length <- size(transactions)
# Create histogram
hist(trans_length,
breaks = seq(0, 6, 1),
col = "steelblue",
main = "Distribution of Number of Functions per Consultation",
xlab = "Number of Medical Interview Functions",
ylab = "Number of Consultations",
border = "white")
# Add mean line
abline(v = mean(trans_length), col = "red", lwd = 2, lty = 2)
text(mean(trans_length) + 0.3, max(table(trans_length)) * 0.9,
paste("Mean =", round(mean(trans_length), 2)), col = "red")Interpretation: This histogram reveals how many medical interview functions doctors typically use in a single consultation. A higher average suggests comprehensive consultations, while the distribution shows the variability in consultation complexity.
# Apply Apriori algorithm with different support and confidence thresholds
# Initial exploration with moderate thresholds
rules <- apriori(transactions,
parameter = list(
supp = 0.15, # Minimum support: 15%
conf = 0.50, # Minimum confidence: 50%
minlen = 2, # Minimum rule length
maxlen = 4, # Maximum rule length
target = "rules"
))## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.15 2
## maxlen target ext
## 4 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 75
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[6 item(s), 500 transaction(s)] done [0.00s].
## sorting and recoding items ... [6 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4
## done [0.00s].
## writing ... [74 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
## set of 74 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4
## 20 37 17
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.000 3.000 2.959 3.000 4.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.1500 Min. :0.5115 Min. :0.1760 Min. :0.8968
## 1st Qu.:0.1860 1st Qu.:0.6522 1st Qu.:0.2720 1st Qu.:0.9729
## Median :0.2500 Median :0.7457 Median :0.3580 Median :0.9908
## Mean :0.2892 Mean :0.7279 Mean :0.4055 Mean :0.9927
## 3rd Qu.:0.3560 3rd Qu.:0.8512 3rd Qu.:0.4700 3rd Qu.:1.0088
## Max. :0.6960 Max. :0.9158 Max. :0.8800 Max. :1.0795
## count
## Min. : 75.0
## 1st Qu.: 93.0
## Median :125.0
## Mean :144.6
## 3rd Qu.:178.0
## Max. :348.0
##
## mining info:
## data ntransactions support confidence
## transactions 500 0.15 0.5
## call
## apriori(data = transactions, parameter = list(supp = 0.15, conf = 0.5, minlen = 2, maxlen = 4, target = "rules"))
# Sort rules by lift
rules_sorted <- sort(rules, by = "lift", decreasing = TRUE)
# Display top 15 rules
top_rules <- head(rules_sorted, 15)
# Convert to dataframe for better display
rules_df <- data.frame(
Rule = labels(top_rules),
Support = round(quality(top_rules)$support, 4),
Confidence = round(quality(top_rules)$confidence, 4),
Lift = round(quality(top_rules)$lift, 4),
Count = round(quality(top_rules)$support * length(transactions))
)
kable(rules_df,
caption = "Top 15 Association Rules Ranked by Lift",
row.names = TRUE)| Rule | Support | Confidence | Lift | Count | |
|---|---|---|---|---|---|
| 1 | {DM,PI,RE} => {FR} | 0.152 | 0.7103 | 1.0795 | 76 |
| 2 | {DM,RE} => {FR} | 0.176 | 0.7097 | 1.0785 | 88 |
| 3 | {EDTRB,PI} => {DM} | 0.202 | 0.5739 | 1.0549 | 101 |
| 4 | {GI,PI,RE} => {FR} | 0.220 | 0.6918 | 1.0514 | 110 |
| 5 | {EDTRB} => {DM} | 0.232 | 0.5714 | 1.0504 | 116 |
| 6 | {GI,RE} => {FR} | 0.250 | 0.6906 | 1.0496 | 125 |
| 7 | {EDTRB,RE} => {PI} | 0.174 | 0.9158 | 1.0407 | 87 |
| 8 | {PI,RE} => {FR} | 0.282 | 0.6812 | 1.0352 | 141 |
| 9 | {RE} => {FR} | 0.316 | 0.6810 | 1.0350 | 158 |
| 10 | {EDTRB,PI} => {FR} | 0.238 | 0.6761 | 1.0276 | 119 |
| 11 | {FR,RE} => {DM} | 0.176 | 0.5570 | 1.0238 | 88 |
| 12 | {EDTRB,RE} => {GI} | 0.154 | 0.8105 | 1.0234 | 77 |
| 13 | {EDTRB} => {FR} | 0.272 | 0.6700 | 1.0182 | 136 |
| 14 | {FR,RE} => {PI} | 0.282 | 0.8924 | 1.0141 | 141 |
| 15 | {RE} => {PI} | 0.414 | 0.8922 | 1.0139 | 207 |
Interpretation of Metrics:
# Scatter plot: Support vs Confidence, colored by Lift
plot(rules,
measure = c("support", "confidence"),
shading = "lift",
main = "Association Rules: Support vs Confidence",
engine = "default")Interpretation: This scatter plot displays the relationship between support and confidence for each rule, with color intensity representing lift. Rules in the upper-right corner have both high support and high confidence, indicating strong and frequent patterns.
# Network visualization (top rules only for clarity)
top_10_rules <- head(rules_sorted, 10)
# Try htmlwidget engine first, fall back to igraph if needed
tryCatch({
plot(top_10_rules,
method = "graph",
engine = "htmlwidget",
main = "Network Graph of Top 10 Association Rules")
}, error = function(e) {
# If htmlwidget fails, use igraph
if (require("igraph", quietly = TRUE)) {
plot(top_10_rules,
method = "graph",
engine = "igraph",
main = "Network Graph of Top 10 Association Rules")
} else {
cat("Note: Network graph requires 'igraph' package. Skipping visualization.\n")
}
})## Available control parameters (with default values):
## itemCol = #CBD2FC
## nodeCol = c("#EE0000", "#EE0303", "#EE0606", "#EE0909", "#EE0C0C", "#EE0F0F", "#EE1212", "#EE1515", "#EE1818", "#EE1B1B", "#EE1E1E", "#EE2222", "#EE2525", "#EE2828", "#EE2B2B", "#EE2E2E", "#EE3131", "#EE3434", "#EE3737", "#EE3A3A", "#EE3D3D", "#EE4040", "#EE4444", "#EE4747", "#EE4A4A", "#EE4D4D", "#EE5050", "#EE5353", "#EE5656", "#EE5959", "#EE5C5C", "#EE5F5F", "#EE6262", "#EE6666", "#EE6969", "#EE6C6C", "#EE6F6F", "#EE7272", "#EE7575", "#EE7878", "#EE7B7B", "#EE7E7E", "#EE8181", "#EE8484", "#EE8888", "#EE8B8B", "#EE8E8E", "#EE9191", "#EE9494", "#EE9797", "#EE9999", "#EE9B9B", "#EE9D9D", "#EE9F9F", "#EEA0A0", "#EEA2A2", "#EEA4A4", "#EEA5A5", "#EEA7A7", "#EEA9A9", "#EEABAB", "#EEACAC", "#EEAEAE", "#EEB0B0", "#EEB1B1", "#EEB3B3", "#EEB5B5", "#EEB7B7", "#EEB8B8", "#EEBABA", "#EEBCBC", "#EEBDBD", "#EEBFBF", "#EEC1C1", "#EEC3C3", "#EEC4C4", "#EEC6C6", "#EEC8C8", "#EEC9C9", "#EECBCB", "#EECDCD", "#EECFCF", "#EED0D0", "#EED2D2", "#EED4D4", "#EED5D5", "#EED7D7", "#EED9D9", "#EEDBDB", "#EEDCDC", "#EEDEDE", "#EEE0E0", "#EEE1E1", "#EEE3E3", "#EEE5E5", "#EEE7E7", "#EEE8E8", "#EEEAEA", "#EEECEC", "#EEEEEE")
## precision = 3
## igraphLayout = layout_nicely
## interactive = TRUE
## engine = visNetwork
## max = 100
## selection_menu = TRUE
## degree_highlight = 1
## verbose = FALSE
Interpretation: The network graph visualizes the relationships between medical interview functions. Nodes represent functions, and edges represent association rules. The size of nodes typically indicates frequency, while edge properties show rule strength.
# Grouped matrix plot
plot(rules_sorted,
method = "grouped",
main = "Grouped Matrix of Association Rules",
engine = "default")## Available control parameters (with default values):
## k = 20
## aggr.fun = function (x, ...) UseMethod("mean")
## rhs_max = 10
## lhs_label_items = 2
## col = c("#EE0000FF", "#EEEEEEFF")
## groups = NULL
## engine = ggplot2
## verbose = FALSE
Interpretation: The grouped matrix organizes rules by their antecedents and consequents, making it easier to identify which combinations of functions frequently lead to specific outcomes.
# Parallel coordinates plot for top rules
plot(top_10_rules,
method = "paracoord",
main = "Parallel Coordinates Plot of Top 10 Rules",
reorder = TRUE)Interpretation: This plot shows the flow from antecedents (left) to consequents (right) for each rule, helping to visualize the sequential nature of medical interview functions.
# Extract rules where PI is the consequent
rules_pi <- subset(rules, subset = rhs %in% "PI")
rules_pi_sorted <- sort(rules_pi, by = "confidence", decreasing = TRUE)
# Display top rules
if (length(rules_pi_sorted) > 0) {
rules_pi_df <- data.frame(
Rule = labels(head(rules_pi_sorted, 10)),
Support = round(quality(head(rules_pi_sorted, 10))$support, 4),
Confidence = round(quality(head(rules_pi_sorted, 10))$confidence, 4),
Lift = round(quality(head(rules_pi_sorted, 10))$lift, 4)
)
kable(rules_pi_df,
caption = "Rules Leading to Providing Information (PI)",
row.names = TRUE)
}| Rule | Support | Confidence | Lift | |
|---|---|---|---|---|
| 1 | {EDTRB,RE} => {PI} | 0.174 | 0.9158 | 1.0407 |
| 2 | {FR,RE} => {PI} | 0.282 | 0.8924 | 1.0141 |
| 3 | {RE} => {PI} | 0.414 | 0.8922 | 1.0139 |
| 4 | {FR,GI} => {PI} | 0.458 | 0.8876 | 1.0086 |
| 5 | {FR} => {PI} | 0.582 | 0.8845 | 1.0051 |
| 6 | {FR,GI,RE} => {PI} | 0.220 | 0.8800 | 1.0000 |
| 7 | {GI} => {PI} | 0.696 | 0.8788 | 0.9986 |
| 8 | {GI,RE} => {PI} | 0.318 | 0.8785 | 0.9982 |
| 9 | {EDTRB,FR,GI} => {PI} | 0.186 | 0.8774 | 0.9970 |
| 10 | {EDTRB,FR} => {PI} | 0.238 | 0.8750 | 0.9943 |
# Extract rules where DM is the consequent
rules_dm <- subset(rules, subset = rhs %in% "DM")
rules_dm_sorted <- sort(rules_dm, by = "confidence", decreasing = TRUE)
# Display top rules
if (length(rules_dm_sorted) > 0) {
rules_dm_df <- data.frame(
Rule = labels(head(rules_dm_sorted, 10)),
Support = round(quality(head(rules_dm_sorted, 10))$support, 4),
Confidence = round(quality(head(rules_dm_sorted, 10))$confidence, 4),
Lift = round(quality(head(rules_dm_sorted, 10))$lift, 4)
)
kable(rules_dm_df,
caption = "Rules Leading to Decision Making (DM)",
row.names = TRUE)
}| Rule | Support | Confidence | Lift | |
|---|---|---|---|---|
| 1 | {EDTRB,PI} => {DM} | 0.202 | 0.5739 | 1.0549 |
| 2 | {EDTRB} => {DM} | 0.232 | 0.5714 | 1.0504 |
| 3 | {FR,RE} => {DM} | 0.176 | 0.5570 | 1.0238 |
| 4 | {EDTRB,FR} => {DM} | 0.150 | 0.5515 | 1.0137 |
| 5 | {EDTRB,GI} => {DM} | 0.178 | 0.5494 | 1.0099 |
| 6 | {FR} => {DM} | 0.360 | 0.5471 | 1.0057 |
| 7 | {EDTRB,GI,PI} => {DM} | 0.152 | 0.5390 | 0.9908 |
| 8 | {FR,PI,RE} => {DM} | 0.152 | 0.5390 | 0.9908 |
| 9 | {FR,PI} => {DM} | 0.312 | 0.5361 | 0.9854 |
| 10 | {RE} => {DM} | 0.248 | 0.5345 | 0.9825 |
Based on the association rule mining analysis of Indonesian doctor-patient consultations, several important patterns emerge:
Frequent Function Combinations: The analysis reveals which medical interview functions commonly occur together in consultations, indicating standard consultation workflows used by Indonesian doctors.
Strong Associations: Rules with high lift values indicate functions that are strongly associated, suggesting that when doctors employ certain consultation strategies, they are highly likely to use specific complementary functions.
Consultation Patterns: The transaction length distribution provides insights into the comprehensiveness of consultations, showing how many different functions doctors typically employ.
The discovered association rules have several practical implications:
This study demonstrates the application of association rule mining to analyze patterns in Indonesian doctor-patient medical consultations. The Apriori algorithm successfully identified frequent itemsets and meaningful association rules among six medical interview functions. The analysis reveals:
| Rule | Support | Confidence | Lift | Count | |
|---|---|---|---|---|---|
| 1 | {EDTRB} => {DM} | 0.232 | 0.5714 | 1.0504 | 116 |
| 2 | {EDTRB} => {FR} | 0.272 | 0.6700 | 1.0182 | 136 |
| 3 | {EDTRB} => {GI} | 0.324 | 0.7980 | 1.0076 | 162 |
| 4 | {EDTRB} => {PI} | 0.352 | 0.8670 | 0.9852 | 176 |
| 5 | {RE} => {DM} | 0.248 | 0.5345 | 0.9825 | 124 |
| 6 | {RE} => {FR} | 0.316 | 0.6810 | 1.0350 | 158 |
| 7 | {RE} => {GI} | 0.362 | 0.7802 | 0.9851 | 181 |
| 8 | {RE} => {PI} | 0.414 | 0.8922 | 1.0139 | 207 |
| 9 | {DM} => {FR} | 0.360 | 0.6618 | 1.0057 | 180 |
| 10 | {FR} => {DM} | 0.360 | 0.5471 | 1.0057 | 180 |
| 11 | {DM} => {GI} | 0.418 | 0.7684 | 0.9702 | 209 |
| 12 | {GI} => {DM} | 0.418 | 0.5278 | 0.9702 | 209 |
| 13 | {DM} => {PI} | 0.470 | 0.8640 | 0.9818 | 235 |
| 14 | {PI} => {DM} | 0.470 | 0.5341 | 0.9818 | 235 |
| 15 | {FR} => {GI} | 0.516 | 0.7842 | 0.9901 | 258 |
| 16 | {GI} => {FR} | 0.516 | 0.6515 | 0.9901 | 258 |
| 17 | {FR} => {PI} | 0.582 | 0.8845 | 1.0051 | 291 |
| 18 | {PI} => {FR} | 0.582 | 0.6614 | 1.0051 | 291 |
| 19 | {GI} => {PI} | 0.696 | 0.8788 | 0.9986 | 348 |
| 20 | {PI} => {GI} | 0.696 | 0.7909 | 0.9986 | 348 |
| 21 | {EDTRB,RE} => {GI} | 0.154 | 0.8105 | 1.0234 | 77 |
| 22 | {EDTRB,RE} => {PI} | 0.174 | 0.9158 | 1.0407 | 87 |
| 23 | {DM,EDTRB} => {PI} | 0.202 | 0.8707 | 0.9894 | 101 |
| 24 | {EDTRB,PI} => {DM} | 0.202 | 0.5739 | 1.0549 | 101 |
| 25 | {EDTRB,PI} => {FR} | 0.238 | 0.6761 | 1.0276 | 119 |
| 26 | {EDTRB,PI} => {GI} | 0.282 | 0.8011 | 1.0115 | 141 |
| 27 | {DM,RE} => {FR} | 0.176 | 0.7097 | 1.0785 | 88 |
| 28 | {FR,RE} => {DM} | 0.176 | 0.5570 | 1.0238 | 88 |
| 29 | {FR,RE} => {GI} | 0.250 | 0.7911 | 0.9989 | 125 |
| 30 | {GI,RE} => {FR} | 0.250 | 0.6906 | 1.0496 | 125 |
| 31 | {FR,RE} => {PI} | 0.282 | 0.8924 | 1.0141 | 141 |
| 32 | {PI,RE} => {FR} | 0.282 | 0.6812 | 1.0352 | 141 |
| 33 | {DM,PI} => {FR} | 0.312 | 0.6638 | 1.0089 | 156 |
| 34 | {FR,GI} => {PI} | 0.458 | 0.8876 | 1.0086 | 229 |
| 35 | {DM,PI,RE} => {FR} | 0.152 | 0.7103 | 1.0795 | 76 |
| 36 | {GI,PI,RE} => {FR} | 0.220 | 0.6918 | 1.0514 | 110 |