This dataset contains information on patients with genetic disorder issues, along with details on gene inheritance, blood test results, etc.
Data Source: https://www.kaggle.com/datasets/eftekheraliefte/clean-genetic-dataset
Dataset was uploaded on Kaggle on January 18, 2025.
This report is focusing on applying association rule on these variables to find out if there is any potential connection between the Genetic Disorder issues with other factors
Given the fact that this dataset is a medical report, we could only provide insights on the potential associations as this is not a biological/medical research. The result of this report will not be able to prove any correlation of the Genes inheritance, Blood test result and more with the Genetic Disorder issues
The initial hypothesis is to find out if there are any categories listed in the dataset would have any association with the disorder issue. Hence, we will not choose eclat only to see the frequent items, the suitable method should be Apriori rule
However, we will apply FP-Growth as well as reference to compare the results
options(repos = c(CRAN = "https://cloud.r-project.org"))
install.packages("dplyr")
install.packages("arules")
install.packages("arulesSequences")
install.packages("arulesViz")
install.packages("ggplot2")
install.packages("moments")
library(arules)
library(arulesViz)
library(dplyr)
library(ggplot2)
library(stringr)
library(moments)
gene_data <- read.csv('clean_train_data.csv')
summary(gene_data)
## Patient.Age Genes.in.mother.s.side Inherited.from.father
## Min. : 0.000 Length:20745 Length:20745
## 1st Qu.: 3.000 Class :character Class :character
## Median : 7.000 Mode :character Mode :character
## Mean : 6.974
## 3rd Qu.:10.000
## Max. :14.000
## Maternal.gene Paternal.gene Status
## Length:20745 Length:20745 Length:20745
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## Respiratory.Rate..breaths.min. Heart.Rate..rates.min Follow.up
## Length:20745 Length:20745 Length:20745
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## Gender Birth.asphyxia
## Length:20745 Length:20745
## Class :character Class :character
## Mode :character Mode :character
##
##
##
## Autopsy.shows.birth.defect..if.applicable. Place.of.birth
## Length:20745 Length:20745
## Class :character Class :character
## Mode :character Mode :character
##
##
##
## Folic.acid.details..peri.conceptional. H.O.serious.maternal.illness
## Length:20745 Length:20745
## Class :character Class :character
## Mode :character Mode :character
##
##
##
## H.O.radiation.exposure..x.ray. H.O.substance.abuse Assisted.conception.IVF.ART
## Length:20745 Length:20745 Length:20745
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## History.of.anomalies.in.previous.pregnancies No..of.previous.abortion
## Length:20745 Min. :0
## Class :character 1st Qu.:1
## Mode :character Median :2
## Mean :2
## 3rd Qu.:3
## Max. :4
## Birth.defects Blood.test.result Genetic.Disorder Disorder.Subclass
## Length:20745 Length:20745 Length:20745 Length:20745
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Symptom.Count Total.Blood.Cell.Count Combined_disorder
## Min. :0.000 Min. : 7.302 Length:20745
## 1st Qu.:2.000 1st Qu.:10.560 Class :character
## Median :3.000 Median :12.384 Mode :character
## Mean :2.648 Mean :12.385
## 3rd Qu.:4.000 3rd Qu.:14.183
## Max. :5.000 Max. :17.536
# Check for missing values in the entire dataset
sum(is.na(gene_data))
## [1] 0
# Check values stored as "Missing"
missing_text_total <- sum(gene_data == "Missing", na.rm = TRUE)
print(paste("Total cells with the literal 'Missing':", missing_text_total))
## [1] "Total cells with the literal 'Missing': 53262"
missing_by_col <- sapply(gene_data, function(col) sum(as.character(col) == "Missing", na.rm = TRUE))
# Count total number of entries in each column (including "Missing" strings)
total_entries_by_col <- sapply(gene_data, length)
# For clarity, combine results into a data frame
results <- data.frame(Column = names(gene_data), MissingCount = missing_by_col, TotalCount = total_entries_by_col)
# Columns Missing Count Total Count
results
## Column
## Patient.Age Patient.Age
## Genes.in.mother.s.side Genes.in.mother.s.side
## Inherited.from.father Inherited.from.father
## Maternal.gene Maternal.gene
## Paternal.gene Paternal.gene
## Status Status
## Respiratory.Rate..breaths.min. Respiratory.Rate..breaths.min.
## Heart.Rate..rates.min Heart.Rate..rates.min
## Follow.up Follow.up
## Gender Gender
## Birth.asphyxia Birth.asphyxia
## Autopsy.shows.birth.defect..if.applicable. Autopsy.shows.birth.defect..if.applicable.
## Place.of.birth Place.of.birth
## Folic.acid.details..peri.conceptional. Folic.acid.details..peri.conceptional.
## H.O.serious.maternal.illness H.O.serious.maternal.illness
## H.O.radiation.exposure..x.ray. H.O.radiation.exposure..x.ray.
## H.O.substance.abuse H.O.substance.abuse
## Assisted.conception.IVF.ART Assisted.conception.IVF.ART
## History.of.anomalies.in.previous.pregnancies History.of.anomalies.in.previous.pregnancies
## No..of.previous.abortion No..of.previous.abortion
## Birth.defects Birth.defects
## Blood.test.result Blood.test.result
## Genetic.Disorder Genetic.Disorder
## Disorder.Subclass Disorder.Subclass
## Symptom.Count Symptom.Count
## Total.Blood.Cell.Count Total.Blood.Cell.Count
## Combined_disorder Combined_disorder
## MissingCount TotalCount
## Patient.Age 0 20745
## Genes.in.mother.s.side 0 20745
## Inherited.from.father 0 20745
## Maternal.gene 0 20745
## Paternal.gene 0 20745
## Status 0 20745
## Respiratory.Rate..breaths.min. 0 20745
## Heart.Rate..rates.min 0 20745
## Follow.up 0 20745
## Gender 7400 20745
## Birth.asphyxia 10629 20745
## Autopsy.shows.birth.defect..if.applicable. 14558 20745
## Place.of.birth 0 20745
## Folic.acid.details..peri.conceptional. 0 20745
## H.O.serious.maternal.illness 0 20745
## H.O.radiation.exposure..x.ray. 10568 20745
## H.O.substance.abuse 10107 20745
## Assisted.conception.IVF.ART 0 20745
## History.of.anomalies.in.previous.pregnancies 0 20745
## No..of.previous.abortion 0 20745
## Birth.defects 0 20745
## Blood.test.result 0 20745
## Genetic.Disorder 0 20745
## Disorder.Subclass 0 20745
## Symptom.Count 0 20745
## Total.Blood.Cell.Count 0 20745
## Combined_disorder 0 20745
# Autopsy shows birth defect (if applicable) is related to Status, when the status is alive, the Autopsy result is showing Missing which wouldn't affect the result since they are still alive
# However, with the column remaining, this will not give us much insight since we are unable to know if there is any birth defect when they are alive, there is no need to keep this column
# Remove Autopsy.shows.birth.defect..if.applicable.
gene_data$Autopsy.shows.birth.defect..if.applicable. <- NULL
# Our main focus is to analyse association for Genetic Disorder, Disorder Subclass and Combined_disorder, genetic disorder could potentially be related to gender, hence we will drop the rows without any gender data
# Remove rows where Gender equals the literal string "Missing"
gene_data <- gene_data[ as.character(gene_data$Gender) != "Missing", ]
# There is still around 50% of the values missing for Birth.asphyxia, H.O.radiation.exposure..x.ray. and H.O.substance.abuse, we will have to delete these columns since the missing value is too large
gene_data[c("Birth.asphyxia","H.O.substance.abuse","H.O.radiation.exposure..x.ray.")] <- NULL
# Recalculate missing counts for each column after removal
missing_by_col <- sapply(gene_data, function(col) sum(as.character(col) == "Missing", na.rm = TRUE))
total_entries_by_col <- sapply(gene_data, length)
# Recreate the results data frame with updated counts
results <- data.frame(Column = names(gene_data), MissingCount = missing_by_col, TotalCount = total_entries_by_col)
results
## Column
## Patient.Age Patient.Age
## Genes.in.mother.s.side Genes.in.mother.s.side
## Inherited.from.father Inherited.from.father
## Maternal.gene Maternal.gene
## Paternal.gene Paternal.gene
## Status Status
## Respiratory.Rate..breaths.min. Respiratory.Rate..breaths.min.
## Heart.Rate..rates.min Heart.Rate..rates.min
## Follow.up Follow.up
## Gender Gender
## Place.of.birth Place.of.birth
## Folic.acid.details..peri.conceptional. Folic.acid.details..peri.conceptional.
## H.O.serious.maternal.illness H.O.serious.maternal.illness
## Assisted.conception.IVF.ART Assisted.conception.IVF.ART
## History.of.anomalies.in.previous.pregnancies History.of.anomalies.in.previous.pregnancies
## No..of.previous.abortion No..of.previous.abortion
## Birth.defects Birth.defects
## Blood.test.result Blood.test.result
## Genetic.Disorder Genetic.Disorder
## Disorder.Subclass Disorder.Subclass
## Symptom.Count Symptom.Count
## Total.Blood.Cell.Count Total.Blood.Cell.Count
## Combined_disorder Combined_disorder
## MissingCount TotalCount
## Patient.Age 0 13345
## Genes.in.mother.s.side 0 13345
## Inherited.from.father 0 13345
## Maternal.gene 0 13345
## Paternal.gene 0 13345
## Status 0 13345
## Respiratory.Rate..breaths.min. 0 13345
## Heart.Rate..rates.min 0 13345
## Follow.up 0 13345
## Gender 0 13345
## Place.of.birth 0 13345
## Folic.acid.details..peri.conceptional. 0 13345
## H.O.serious.maternal.illness 0 13345
## Assisted.conception.IVF.ART 0 13345
## History.of.anomalies.in.previous.pregnancies 0 13345
## No..of.previous.abortion 0 13345
## Birth.defects 0 13345
## Blood.test.result 0 13345
## Genetic.Disorder 0 13345
## Disorder.Subclass 0 13345
## Symptom.Count 0 13345
## Total.Blood.Cell.Count 0 13345
## Combined_disorder 0 13345
# Get unique values for each column
unique_values <- lapply(gene_data, unique)
length(unique_values)
## [1] 23
# We are defining the bins as below:
gene_data$Patient.Age <- cut(gene_data$Patient.Age,
breaks = c(-Inf, 2, 6, 10, 14, Inf),
labels = c("Infant", "Early Childhood", "Middle Childhood", "Early Teens", "Other"),
right = TRUE)
# $No..of.previous.abortion: 2 4 0 3 1. Since there are only 5 unique values, we will not convert these numerical data
# $Symptom.Count: 5 4 3 1 2 0. Since there are only 5 unique values, we will not convert these numerical data
# There are Genetic Disorder, Disorder Subclass and Combined_disorder. Combined_disorder is the combination of Genetic Disorder and Disorder Subclass. Hence, we only need to keep Combined_disorder since we can see the result of these two by using Combined_disorder
gene_data[c("Genetic.Disorder","Disorder.Subclass")] <- NULL
# Combined_disorder values shorten as below:
gene_data$Combined_disorder <- str_replace_all(gene_data$Combined_disorder, c(
"Mitochondrial_genetic_inheritance_disorders_Leber's_hereditary_optic_neuropathy" = "Mito_LHON",
"Mitochondrial_genetic_inheritance_disorders_Leigh_syndrome" = "Mito_Leigh",
"Mitochondrial_genetic_inheritance_disorders_Mitochondrial_myopathy" = "Mito_Myopathy",
"Multifactorial_genetic_inheritance_disorders_Alzheimer's" = "Multi_Alzheimer",
"Multifactorial_genetic_inheritance_disorders_Cancer" = "Multi_Cancer",
"Multifactorial_genetic_inheritance_disorders_Diabetes" = "Multi_Diabetes",
"Single-gene_inheritance_diseases_Cystic_fibrosis" = "Single_CysticFibrosis",
"Single-gene_inheritance_diseases_Hemochromatosis" = "Single_Hemochromatosis",
"Single-gene_inheritance_diseases_Tay-Sachs" = "Single_TaySachs"
))
# summary(unique_values$Total.Blood.Cell.Count)
# Total number of rows for blood cell count
length(unique_values$Total.Blood.Cell.Count)
## [1] 13345
# Check the normality of the Total.Blood.Cell.Count
skewness(gene_data$Total.Blood.Cell.Count)
## [1] 0.009567959
# Skewness value is 0.009566883 after removing missing values, Total.Blood.Cell.Count is not skewed, it is symmetric instead. Max value is close to the 3rd Qu. we can consider there is no outlier
# Total.Blood.Cell.Count: 13345, need to be converted to categorical bins since the number of unique values is significant
# We are defining the bins as below:
gene_data$Total.Blood.Cell.Count <- cut(
gene_data$Total.Blood.Cell.Count,
breaks = c(-Inf, 10.544, 12.379,14.162, Inf),
labels = c("Blood Low", "Blood Moderate", "Blood High", "Blood Very High"),
right = FALSE
)
# Iterate over columns to process them
for (col in colnames(gene_data)) {
# Check for columns with "Yes" and "No"
if (all(gene_data[[col]] %in% c("Yes", "No"), na.rm = TRUE)) {
prefix <- strsplit(col, split = "[ .]")[[1]][1]
gene_data[[col]] <- paste0(prefix, "_", gene_data[[col]])
}
# Check for columns with integer values
if (is.numeric(gene_data[[col]]) && all(gene_data[[col]] %% 1 == 0, na.rm = TRUE)) {
prefix <- strsplit(col, split = "[ .]")[[1]][1]
gene_data[[col]] <- paste0(prefix, "_", gene_data[[col]])
}
}
# Preview the modified dataset
head(gene_data)
## Patient.Age Genes.in.mother.s.side Inherited.from.father Maternal.gene
## 2 Early Teens Genes_Yes Inherited_No Maternal_Yes
## 3 Middle Childhood Genes_Yes Inherited_Yes Maternal_Yes
## 4 Middle Childhood Genes_Yes Inherited_No Maternal_Yes
## 5 Early Teens Genes_Yes Inherited_Yes Maternal_Yes
## 6 Infant Genes_No Inherited_Yes Maternal_Yes
## 7 Infant Genes_No Inherited_Yes Maternal_Yes
## Paternal.gene Status Respiratory.Rate..breaths.min. Heart.Rate..rates.min
## 2 Paternal_No Alive Normal (30-60) Tachycardia
## 3 Paternal_Yes Deceased Normal (30-60) Tachycardia
## 4 Paternal_Yes Deceased Tachypnea Normal
## 5 Paternal_Yes Alive Normal (30-60) Tachycardia
## 6 Paternal_No Alive Tachypnea Tachycardia
## 7 Paternal_No Alive Normal (30-60) Tachycardia
## Follow.up Gender Place.of.birth Folic.acid.details..peri.conceptional.
## 2 Low Male Institute Folic_Yes
## 3 Low Female Home Folic_Yes
## 4 Low Male Institute Folic_Yes
## 5 High Female Home Folic_Yes
## 6 High Female Home Folic_Yes
## 7 Low Male Home Folic_No
## H.O.serious.maternal.illness Assisted.conception.IVF.ART
## 2 H_Yes Assisted_Yes
## 3 H_No Assisted_Yes
## 4 H_Yes Assisted_No
## 5 H_No Assisted_No
## 6 H_Yes Assisted_Yes
## 7 H_No Assisted_Yes
## History.of.anomalies.in.previous.pregnancies No..of.previous.abortion
## 2 History_No No_4
## 3 History_No No_0
## 4 History_Yes No_3
## 5 History_No No_4
## 6 History_No No_4
## 7 History_No No_4
## Birth.defects Blood.test.result Symptom.Count Total.Blood.Cell.Count
## 2 Multiple slightly abnormal Symptom_5 Blood High
## 3 Singular slightly abnormal Symptom_4 Blood Low
## 4 Multiple normal Symptom_4 Blood High
## 5 Singular slightly abnormal Symptom_4 Blood Low
## 6 Singular abnormal Symptom_5 Blood Low
## 7 Multiple abnormal Symptom_5 Blood Very High
## Combined_disorder
## 2 Mito_LHON
## 3 Mito_LHON
## 4 Mito_LHON
## 5 Mito_LHON
## 6 Mito_LHON
## 7 Mito_LHON
# Dimension of the dataset
dim(gene_data)
## [1] 13345 21
#write.csv(gene_data, file="modified_gene_data.csv", quote = TRUE, row.names = FALSE)
# There are categories as below:
gene_data[1,]
## Patient.Age Genes.in.mother.s.side Inherited.from.father Maternal.gene
## 2 Early Teens Genes_Yes Inherited_No Maternal_Yes
## Paternal.gene Status Respiratory.Rate..breaths.min. Heart.Rate..rates.min
## 2 Paternal_No Alive Normal (30-60) Tachycardia
## Follow.up Gender Place.of.birth Folic.acid.details..peri.conceptional.
## 2 Low Male Institute Folic_Yes
## H.O.serious.maternal.illness Assisted.conception.IVF.ART
## 2 H_Yes Assisted_Yes
## History.of.anomalies.in.previous.pregnancies No..of.previous.abortion
## 2 History_No No_4
## Birth.defects Blood.test.result Symptom.Count Total.Blood.Cell.Count
## 2 Multiple slightly abnormal Symptom_5 Blood High
## Combined_disorder
## 2 Mito_LHON
# There are factors such as patient.age, genes.in.mother, Inherited.from.father etc. These are most likely the antecendents.
# Combined_disorder is the potential consequent as the expression of certain genes and some other factors
dim(gene_data)
## [1] 13345 21
# Read the pre-processed data
trans1 <- read.transactions("modified_gene_data.csv", format="basket", sep=",", skip=0)
summary(trans1)
## transactions as itemMatrix in sparse format with
## 13346 rows (elements/itemsets/transactions) and
## 83 columns (items) and a density of 0.253012
##
## most frequent items:
## Inherited_No Genes_Yes Paternal_No Maternal_Yes Folic_Yes (Other)
## 8086 7926 7555 7390 7335 241974
##
## element (itemset/transaction) length distribution:
## sizes
## 21
## 13346
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 21 21 21 21 21 21
##
## includes extended item information - examples:
## labels
## 1 abnormal
## 2 Alive
## 3 Assisted_No
inspect(head(trans1))
## items
## [1] {Assisted.conception.IVF.ART,
## Birth.defects,
## Blood.test.result,
## Combined_disorder,
## Folic.acid.details..peri.conceptional.,
## Follow.up,
## Gender,
## Genes.in.mother.s.side,
## H.O.serious.maternal.illness,
## Heart.Rate..rates.min,
## History.of.anomalies.in.previous.pregnancies,
## Inherited.from.father,
## Maternal.gene,
## No..of.previous.abortion,
## Paternal.gene,
## Patient.Age,
## Place.of.birth,
## Respiratory.Rate..breaths.min.,
## Status,
## Symptom.Count,
## Total.Blood.Cell.Count}
## [2] {Alive,
## Assisted_Yes,
## Blood High,
## Early Teens,
## Folic_Yes,
## Genes_Yes,
## H_Yes,
## History_No,
## Inherited_No,
## Institute,
## Low,
## Male,
## Maternal_Yes,
## Mito_LHON,
## Multiple,
## No_4,
## Normal (30-60),
## Paternal_No,
## slightly abnormal,
## Symptom_5,
## Tachycardia}
## [3] {Assisted_Yes,
## Blood Low,
## Deceased,
## Female,
## Folic_Yes,
## Genes_Yes,
## H_No,
## History_No,
## Home,
## Inherited_Yes,
## Low,
## Maternal_Yes,
## Middle Childhood,
## Mito_LHON,
## No_0,
## Normal (30-60),
## Paternal_Yes,
## Singular,
## slightly abnormal,
## Symptom_4,
## Tachycardia}
## [4] {Assisted_No,
## Blood High,
## Deceased,
## Folic_Yes,
## Genes_Yes,
## H_Yes,
## History_Yes,
## Inherited_No,
## Institute,
## Low,
## Male,
## Maternal_Yes,
## Middle Childhood,
## Mito_LHON,
## Multiple,
## No_3,
## normal,
## Normal,
## Paternal_Yes,
## Symptom_4,
## Tachypnea}
## [5] {Alive,
## Assisted_No,
## Blood Low,
## Early Teens,
## Female,
## Folic_Yes,
## Genes_Yes,
## H_No,
## High,
## History_No,
## Home,
## Inherited_Yes,
## Maternal_Yes,
## Mito_LHON,
## No_4,
## Normal (30-60),
## Paternal_Yes,
## Singular,
## slightly abnormal,
## Symptom_4,
## Tachycardia}
## [6] {abnormal,
## Alive,
## Assisted_Yes,
## Blood Low,
## Female,
## Folic_Yes,
## Genes_No,
## H_Yes,
## High,
## History_No,
## Home,
## Infant,
## Inherited_Yes,
## Maternal_Yes,
## Mito_LHON,
## No_4,
## Paternal_No,
## Singular,
## Symptom_5,
## Tachycardia,
## Tachypnea}
length(trans1)
## [1] 13346
# Simple statistics
head(itemFrequency(trans1, type="relative"))
## abnormal Alive
## 2.310805e-01 5.071182e-01
## Assisted_No Assisted_Yes
## 4.796194e-01 5.203057e-01
## Assisted.conception.IVF.ART Birth.defects
## 7.492882e-05 7.492882e-05
head(itemFrequency(trans1, type="absolute"))
## abnormal Alive
## 3084 6768
## Assisted_No Assisted_Yes
## 6401 6944
## Assisted.conception.IVF.ART Birth.defects
## 1 1
itemFrequencyPlot(trans1, topN = 15)
# visualize the sparse matrix for the first 5 items
image(trans1[1:5])
image(sample(trans1, 100))
# Support at 0.1 and confidence at 0.5, minimum length of a rule is 2 elements
gene_rules <- apriori(trans1, parameter = list(support = 0.1, confidence = 0.5, minlen = 2))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.1 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 1334
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[83 item(s), 13346 transaction(s)] done [0.01s].
## sorting and recoding items ... [55 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.04s].
## writing ... [6665 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
gene_rules
## set of 6665 rules
summary(gene_rules)
## set of 6665 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4
## 707 5331 627
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 3.000 3.000 2.988 3.000 4.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.1000 Min. :0.5000 Min. :0.1511 Min. :0.8670
## 1st Qu.:0.1173 1st Qu.:0.5174 1st Qu.:0.2203 1st Qu.:0.9922
## Median :0.1339 Median :0.5393 Median :0.2476 Median :1.0043
## Mean :0.1418 Mean :0.5458 Mean :0.2605 Mean :1.0073
## 3rd Qu.:0.1514 3rd Qu.:0.5614 3rd Qu.:0.2771 3rd Qu.:1.0176
## Max. :0.3666 Max. :0.6755 Max. :0.6059 Max. :1.7471
## count
## Min. :1335
## 1st Qu.:1565
## Median :1787
## Mean :1893
## 3rd Qu.:2020
## Max. :4893
##
## mining info:
## data ntransactions support confidence
## trans1 13346 0.1 0.5
## call
## apriori(data = trans1, parameter = list(support = 0.1, confidence = 0.5, minlen = 2))
inspect(gene_rules[1:10])
## lhs rhs support confidence coverage lift count
## [1] {No_3} => {Genes_Yes} 0.1017533 0.5848407 0.1739847 0.9847695 1358
## [2] {No_3} => {Inherited_No} 0.1063989 0.6115418 0.1739847 1.0093540 1420
## [3] {No_1} => {Paternal_No} 0.1013787 0.5708861 0.1775813 1.0084772 1353
## [4] {No_1} => {Genes_Yes} 0.1062491 0.5983122 0.1775813 1.0074533 1418
## [5] {No_1} => {Inherited_No} 0.1100704 0.6198312 0.1775813 1.0230358 1469
## [6] {No_4} => {Paternal_No} 0.1024277 0.5669847 0.1806534 1.0015853 1367
## [7] {No_4} => {Genes_Yes} 0.1059493 0.5864786 0.1806534 0.9875276 1414
## [8] {No_4} => {Inherited_No} 0.1054998 0.5839900 0.1806534 0.9638797 1408
## [9] {No_0} => {Folic_Yes} 0.1016784 0.5591265 0.1818522 1.0173282 1357
## [10] {No_0} => {Maternal_Yes} 0.1029522 0.5661310 0.1818522 1.0224066 1374
inspect(sort(gene_rules, by = "lift")[1:5])
## lhs rhs support confidence coverage
## [1] {Mito_Myopathy} => {Symptom_2} 0.1123183 0.5088255 0.2207403
## [2] {Mito_Myopathy} => {Maternal_No} 0.1256556 0.5692464 0.2207403
## [3] {Single_CysticFibrosis} => {Maternal_Yes} 0.1411659 0.6755109 0.2089765
## [4] {Genes_No, H_No} => {Maternal_No} 0.1061741 0.5311094 0.1999101
## [5] {Single_CysticFibrosis} => {Tachycardia} 0.1154653 0.5525278 0.2089765
## lift count
## [1] 1.747051 1499
## [2] 1.275762 1677
## [3] 1.219942 1884
## [4] 1.190292 1417
## [5] 1.188594 1541
inspect(sort(gene_rules, by = "confidence")[1:5])
## lhs rhs support
## [1] {Single_CysticFibrosis} => {Maternal_Yes} 0.1411659
## [2] {Normal, Paternal_No, Singular} => {Inherited_No} 0.1040012
## [3] {Paternal_No, Symptom_3} => {Inherited_No} 0.1020530
## [4] {History_Yes, Institute, Paternal_No} => {Inherited_No} 0.1084220
## [5] {Institute, Normal, Paternal_No} => {Inherited_No} 0.1081223
## confidence coverage lift count
## [1] 0.6755109 0.2089765 1.219942 1884
## [2] 0.6754258 0.1539787 1.114795 1388
## [3] 0.6712666 0.1520306 1.107930 1362
## [4] 0.6702177 0.1617713 1.106199 1447
## [5] 0.6689847 0.1616215 1.104164 1443
inspect(sort(gene_rules, by = "support")[1:5])
## lhs rhs support confidence coverage lift
## [1] {Paternal_No} => {Inherited_No} 0.3666267 0.6476506 0.5660872 1.0689518
## [2] {Inherited_No} => {Paternal_No} 0.3666267 0.6051200 0.6058744 1.0689518
## [3] {Genes_Yes} => {Inherited_No} 0.3589840 0.6044663 0.5938858 0.9976759
## [4] {Inherited_No} => {Genes_Yes} 0.3589840 0.5925056 0.6058744 0.9976759
## [5] {Maternal_Yes} => {Genes_Yes} 0.3556871 0.6423545 0.5537240 1.0816129
## count
## [1] 4893
## [2] 4893
## [3] 4791
## [4] 4791
## [5] 4747
plot(gene_rules)
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
# Rules for Combined_disorder
# Mito_LHON
Mito_LHON_rule <- apriori(data=trans1, parameter=list(supp=0.001,conf = 0.6),
appearance=list(default="lhs", rhs="Mito_LHON"), control=list(verbose=F))
Mito_LHON_rule_byconf <- sort(Mito_LHON_rule, by="confidence", decreasing=TRUE)
inspect(head(Mito_LHON_rule_byconf))
## lhs rhs support confidence coverage lift count
## [1] {Alive,
## Infant,
## Male,
## Multiple,
## Paternal_Yes,
## Symptom_5} => {Mito_LHON} 0.001123932 0.8823529 0.001273790 31.74092 15
## [2] {Folic_Yes,
## H_Yes,
## Home,
## Infant,
## Paternal_Yes,
## Symptom_5} => {Mito_LHON} 0.001123932 0.7500000 0.001498576 26.97978 15
## [3] {Assisted_No,
## H_Yes,
## Maternal_Yes,
## No_2,
## Symptom_5,
## Tachycardia} => {Mito_LHON} 0.001123932 0.7500000 0.001498576 26.97978 15
## [4] {Folic_Yes,
## Home,
## Male,
## Paternal_Yes,
## slightly abnormal,
## Symptom_5} => {Mito_LHON} 0.001123932 0.6818182 0.001648434 24.52708 15
## [5] {Home,
## Male,
## Normal,
## Paternal_Yes,
## slightly abnormal,
## Symptom_5} => {Mito_LHON} 0.001049003 0.6666667 0.001573505 23.98203 14
## [6] {Alive,
## Genes_Yes,
## Infant,
## Multiple,
## Paternal_Yes,
## Symptom_5} => {Mito_LHON} 0.001123932 0.6521739 0.001723363 23.46068 15
plot(Mito_LHON_rule)
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
plot(Mito_LHON_rule, method = "paracoord", control = list(reorder = TRUE))
# There are 271 rules for Mito_LHON, support and confdient levels at 0.01 were tested, however, there were limited rules. Hence a lower support level 0.001 is used in this rule
# Even though the support level is quite low in this rule, however, we can find some interesting insights, the maximum confidence level 0.8823529 is observed. Which indicates that even the combination of Alive, Infant, Male, Multiple, Paternal_Yes, Symptom_5 occur in only 0.1123932%, the consequent Mito_LHON's confidence level reached 0.8823529
# The antecedent is rare, but it is still a reliable predictor of the consequent. This applies to some other antecedents as well such as the ones listed in the top 5
# FP-Growth
frequent_itemsets_1 <- eclat(trans1, parameter = list(supp = 0.001, maxlen = 5))
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.001 1 5 frequent itemsets TRUE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 13
##
## create itemset ...
## set transactions ...[83 item(s), 13346 transaction(s)] done [0.01s].
## sorting and recoding items ... [62 item(s)] done [0.00s].
## creating bit matrix ... [62 row(s), 13346 column(s)] done [0.00s].
## writing ... [3042465 set(s)] done [0.92s].
## Creating S4 object ... done [0.31s].
Mito_LHON_rhs <- "Mito_LHON"
Mito_LHON_growth <- ruleInduction(frequent_itemsets_1, trans1, confidence = 0.6)
filtered_rules_1 <- subset(Mito_LHON_growth, rhs %in% Mito_LHON_rhs)
filtered_rules_1_byconf <- sort(filtered_rules_1, by="confidence", decreasing=TRUE)
inspect(filtered_rules_1_byconf)
# There is no output from FP-Growth
Mito_Leigh_rule <- apriori(data=trans1, parameter=list(supp=0.02,conf = 0.5),
appearance=list(default="lhs", rhs="Mito_Leigh"), control=list(verbose=F))
Mito_Leigh_rule_byconf <- sort(Mito_Leigh_rule, by="confidence", decreasing=TRUE)
inspect(head(Mito_Leigh_rule_byconf))
## lhs rhs support confidence coverage lift count
## [1] {inconclusive,
## Singular,
## Symptom_3} => {Mito_Leigh} 0.02075528 0.5918803 0.03506669 2.298294 277
## [2] {inconclusive,
## Paternal_No,
## Symptom_3} => {Mito_Leigh} 0.02412708 0.5908257 0.04083621 2.294198 322
## [3] {inconclusive,
## Inherited_No,
## Symptom_3} => {Mito_Leigh} 0.02450172 0.5902527 0.04151056 2.291973 327
## [4] {inconclusive,
## Normal,
## Symptom_3} => {Mito_Leigh} 0.02187921 0.5770751 0.03791398 2.240804 292
## [5] {Assisted_Yes,
## inconclusive,
## Symptom_3} => {Mito_Leigh} 0.02210400 0.5728155 0.03858834 2.224264 295
## [6] {Assisted_Yes,
## Normal,
## Singular,
## Symptom_3} => {Mito_Leigh} 0.02142964 0.5708583 0.03753934 2.216664 286
plot(Mito_Leigh_rule)
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
plot(Mito_Leigh_rule, method = "paracoord", control = list(reorder = TRUE))
# There are 160 rules for Mito_Leigh, support level was set to 0.02 which is 20 times Mito_LHON
# With such a higher support level, we still observe that there are many associations(160) having confidence level more than 0.5
# This indicates that Mito_Leigh is happening more frequent among children comparing to other disorders
# With antecedent inconclusive, Singular, Symptom_3, consequent Mito_Leigh confidence level is 0.5918803 which is a moderate indication that antecedent is a quite reliable predictor of consequent
# FP-Growth
frequent_itemsets_6 <- eclat(trans1, parameter = list(supp = 0.02, maxlen = 5))
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.02 1 5 frequent itemsets TRUE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 266
##
## create itemset ...
## set transactions ...[83 item(s), 13346 transaction(s)] done [0.01s].
## sorting and recoding items ... [60 item(s)] done [0.00s].
## creating bit matrix ... [60 row(s), 13346 column(s)] done [0.00s].
## writing ... [283863 set(s)] done [0.21s].
## Creating S4 object ... done [0.01s].
Mito_Leigh_rhs <- "Mito_Leigh"
Mito_Leigh_growth <- ruleInduction(frequent_itemsets_6, trans1, confidence = 0.5)
filtered_rules_6 <- subset(Mito_Leigh_growth, rhs %in% Mito_Leigh_rhs)
filtered_rules_6_byconf <- sort(filtered_rules_6, by="confidence", decreasing=TRUE)
inspect(head(filtered_rules_6_byconf))
## lhs rhs support confidence lift itemset
## [1] {inconclusive,
## Singular,
## Symptom_3} => {Mito_Leigh} 0.02075528 0.5918803 2.298294 58332
## [2] {inconclusive,
## Paternal_No,
## Symptom_3} => {Mito_Leigh} 0.02412708 0.5908257 2.294198 58322
## [3] {inconclusive,
## Inherited_No,
## Symptom_3} => {Mito_Leigh} 0.02450172 0.5902527 2.291973 58320
## [4] {inconclusive,
## Normal,
## Symptom_3} => {Mito_Leigh} 0.02187921 0.5770751 2.240804 58327
## [5] {Assisted_Yes,
## inconclusive,
## Symptom_3} => {Mito_Leigh} 0.02210400 0.5728155 2.224264 58330
## [6] {Assisted_Yes,
## Normal,
## Singular,
## Symptom_3} => {Mito_Leigh} 0.02142964 0.5708583 2.216664 102235
# FP-Growth is giving the same result as Apriori
Mito_Myopathy_rule <- apriori(data=trans1, parameter=list(supp=0.02,conf = 0.5),
appearance=list(default="lhs", rhs="Mito_Myopathy"), control=list(verbose=F))
Mito_Myopathy_rule_byconf <- sort(Mito_Myopathy_rule, by="confidence", decreasing=TRUE)
inspect(head(Mito_Myopathy_rule_byconf))
## lhs rhs support confidence coverage lift count
## [1] {Multiple,
## normal,
## Symptom_2} => {Mito_Myopathy} 0.02217893 0.5481481 0.04046156 2.483226 296
## [2] {Assisted_No,
## Maternal_No,
## Multiple,
## Symptom_2} => {Mito_Myopathy} 0.02030571 0.5452716 0.03723962 2.470195 271
## [3] {Low,
## Maternal_No,
## Multiple,
## Symptom_2} => {Mito_Myopathy} 0.02083021 0.5419103 0.03843848 2.454968 278
## [4] {H_No,
## normal,
## Symptom_2} => {Mito_Myopathy} 0.02083021 0.5387597 0.03866327 2.440695 278
## [5] {Female,
## H_No,
## Multiple,
## Symptom_2} => {Mito_Myopathy} 0.02023078 0.5357143 0.03776412 2.426898 270
## [6] {Assisted_No,
## Maternal_No,
## Normal (30-60),
## Symptom_2} => {Mito_Myopathy} 0.02202907 0.5335753 0.04128578 2.417208 294
plot(Mito_Myopathy_rule)
plot(Mito_Myopathy_rule, method="paracoord", control=list(reorder=TRUE))
# There are 34 rules for Mito_Myopathy, support level was set to 0.02 which is 20 times Mito_LHON
# This indicates that Mito_Myopathy is happening more frequent among children comparing to other disorders
# With antecedent Multiple, normal, Symptom_2, consequent Mito_Leigh confidence level is 0.5481481 which is a moderate indication that antecedent is a quite reliable predictor of consequent
# FP-Growth
frequent_itemsets_9 <- eclat(trans1, parameter = list(supp = 0.02, maxlen = 5))
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.02 1 5 frequent itemsets TRUE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 266
##
## create itemset ...
## set transactions ...[83 item(s), 13346 transaction(s)] done [0.01s].
## sorting and recoding items ... [60 item(s)] done [0.00s].
## creating bit matrix ... [60 row(s), 13346 column(s)] done [0.00s].
## writing ... [283863 set(s)] done [0.21s].
## Creating S4 object ... done [0.01s].
Mito_Myopathy_rhs <- "Mito_Myopathy"
Mito_Myopathy_growth <- ruleInduction(frequent_itemsets_9, trans1, confidence = 0.5)
filtered_rules_9 <- subset(Mito_Myopathy_growth, rhs %in% Mito_Myopathy_rhs)
filtered_rules_9_byconf <- sort(filtered_rules_9, by="confidence", decreasing=TRUE)
inspect(head(filtered_rules_9_byconf))
## lhs rhs support confidence lift itemset
## [1] {Multiple,
## normal,
## Symptom_2} => {Mito_Myopathy} 0.02217893 0.5481481 2.483226 31194
## [2] {Assisted_No,
## Maternal_No,
## Multiple,
## Symptom_2} => {Mito_Myopathy} 0.02030571 0.5452716 2.470195 31867
## [3] {Low,
## Maternal_No,
## Multiple,
## Symptom_2} => {Mito_Myopathy} 0.02083021 0.5419103 2.454968 31880
## [4] {H_No,
## normal,
## Symptom_2} => {Mito_Myopathy} 0.02083021 0.5387597 2.440695 31196
## [5] {Female,
## H_No,
## Multiple,
## Symptom_2} => {Mito_Myopathy} 0.02023078 0.5357143 2.426898 32131
## [6] {Assisted_No,
## Maternal_No,
## Normal (30-60),
## Symptom_2} => {Mito_Myopathy} 0.02202907 0.5335753 2.417208 31864
# FP-Growth is giving the same result as Apriori
Multi_Alzheimer_rule <- apriori(data=trans1, parameter=list(supp=0.001,conf = 0.5),
appearance=list(default="lhs", rhs="Multi_Alzheimer"), control=list(verbose=F))
Multi_Alzheimer_rule_byconf <- sort(Multi_Alzheimer_rule, by="confidence", decreasing=TRUE)
inspect(head(Multi_Alzheimer_rule_byconf))
## lhs rhs support confidence coverage lift count
## [1] {Early Childhood,
## Genes_Yes,
## H_No,
## Inherited_Yes,
## Singular,
## Symptom_5} => {Multi_Alzheimer} 0.001049003 0.8235294 0.001273790 106.70702 14
## [2] {Early Childhood,
## H_No,
## Inherited_Yes,
## Singular,
## Symptom_5} => {Multi_Alzheimer} 0.001049003 0.7000000 0.001498576 90.70097 14
## [3] {Early Childhood,
## Genes_Yes,
## H_No,
## Inherited_Yes,
## Symptom_5} => {Multi_Alzheimer} 0.001198861 0.5925926 0.002023078 76.78389 16
## [4] {Early Childhood,
## Genes_Yes,
## History_No,
## Inherited_Yes,
## Symptom_5} => {Multi_Alzheimer} 0.001049003 0.5833333 0.001798292 75.58414 14
## [5] {Early Childhood,
## Genes_Yes,
## Inherited_Yes,
## Singular,
## Symptom_5} => {Multi_Alzheimer} 0.001273790 0.5483871 0.002322793 71.05606 17
## [6] {Genes_Yes,
## History_No,
## Inherited_Yes,
## Paternal_Yes,
## Singular,
## Symptom_5} => {Multi_Alzheimer} 0.001123932 0.5357143 0.002098007 69.41401 15
plot(Multi_Alzheimer_rule)
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
plot(Multi_Alzheimer_rule_byconf, method="paracoord", control=list(reorder=TRUE))
# Support level 0.01 and confidence level 0.01 were tested, however, there was 0 rules
# With the current support level and confidence level, there are 7 rules. Similar to Mito_LHON, the support level is low, however, when Early Childhood, Genes_Yes, H_No, Inherited_Yes, Singular,Symptom_5 occur together, Multi_Alzheimer has a significant confidence level which is 0.8235294
# This indicates that The antecedent items are rare but reliable predictors of the consequent. The top 5 are listed in the output
# FP-Growth
frequent_itemsets_7 <- eclat(trans1, parameter = list(supp = 0.001, maxlen = 5))
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.001 1 5 frequent itemsets TRUE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 13
##
## create itemset ...
## set transactions ...[83 item(s), 13346 transaction(s)] done [0.01s].
## sorting and recoding items ... [62 item(s)] done [0.00s].
## creating bit matrix ... [62 row(s), 13346 column(s)] done [0.00s].
## writing ... [3042465 set(s)] done [0.92s].
## Creating S4 object ... done [0.17s].
Multi_Alzheimer_rhs <- "Multi_Alzheimer"
Multi_Alzheimer_growth <- ruleInduction(frequent_itemsets_7, trans1, confidence = 0.5)
filtered_rules_7 <- subset(Multi_Alzheimer_growth, rhs %in% Multi_Alzheimer_rhs)
filtered_rules_7_byconf <- sort(filtered_rules_7, by="confidence", decreasing=TRUE)
inspect(head(filtered_rules_7_byconf))
# There is not rules from FP-Growth
Multi_Cancer_rule <- apriori(data=trans1, parameter=list(supp=0.001,conf = 0.2),
appearance=list(default="lhs", rhs="Multi_Cancer"), control=list(verbose=F))
Multi_Cancer_rule_byconf <- sort(Multi_Cancer_rule, by="confidence", decreasing=TRUE)
inspect(head(Multi_Cancer_rule_byconf))
## lhs rhs support confidence coverage lift count
## [1] {Assisted_No,
## Genes_No,
## Maternal_No,
## Paternal_No,
## Symptom_0} => {Multi_Cancer} 0.001123932 0.4687500 0.002397722 115.85069 15
## [2] {Genes_No,
## Inherited_No,
## Maternal_No,
## Paternal_No,
## Symptom_0} => {Multi_Cancer} 0.001198861 0.4102564 0.002922224 101.39411 16
## [3] {Inherited_No,
## Maternal_No,
## Paternal_No,
## Singular,
## Symptom_0} => {Multi_Cancer} 0.001273790 0.3695652 0.003446726 91.33736 17
## [4] {Assisted_No,
## Deceased,
## Maternal_No,
## Paternal_No,
## Symptom_0} => {Multi_Cancer} 0.001123932 0.3571429 0.003147010 88.26720 15
## [5] {Deceased,
## Inherited_No,
## Maternal_No,
## Paternal_No,
## Symptom_0} => {Multi_Cancer} 0.001123932 0.3571429 0.003147010 88.26720 15
## [6] {Genes_No,
## Maternal_No,
## Paternal_No,
## Symptom_0} => {Multi_Cancer} 0.001423648 0.3518519 0.004046156 86.95953 19
plot(Multi_Cancer_rule)
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
# Support level 0.01 and confidence level 0.01 were applied, however, there was 0 rules
# With the current support and confidence levels, there are 92 rules. However, the maximum confidence level is only 0.4687500 which is less than 0.5
# With such low support level and confidence level, we may not be able to be convinced that there are strong association rules for Multi_Cancer
# FP-Growth
frequent_itemsets_8 <- eclat(trans1, parameter = list(supp = 0.001, maxlen = 5))
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.001 1 5 frequent itemsets TRUE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 13
##
## create itemset ...
## set transactions ...[83 item(s), 13346 transaction(s)] done [0.01s].
## sorting and recoding items ... [62 item(s)] done [0.00s].
## creating bit matrix ... [62 row(s), 13346 column(s)] done [0.00s].
## writing ... [3042465 set(s)] done [0.92s].
## Creating S4 object ... done [0.18s].
Multi_Cancer_rhs <- "Multi_Cancer"
Multi_Cancer_growth <- ruleInduction(frequent_itemsets_8, trans1, confidence = 0.2)
filtered_rules_8 <- subset(Multi_Cancer_growth, rhs %in% Multi_Cancer_rhs)
filtered_rules_8_byconf <- sort(filtered_rules_8, by="confidence", decreasing=TRUE)
inspect(head(filtered_rules_8_byconf))
## lhs rhs support confidence lift itemset
## [1] {Genes_No,
## Maternal_No,
## Paternal_No,
## Symptom_0} => {Multi_Cancer} 0.001423648 0.3518519 86.95953 2
## [2] {Assisted_No,
## Genes_No,
## Maternal_No,
## Symptom_0} => {Multi_Cancer} 0.001198861 0.3478261 85.96457 7
## [3] {Genes_No,
## Inherited_No,
## Singular,
## Symptom_0} => {Multi_Cancer} 0.001123932 0.3191489 78.87707 11
## [4] {Genes_No,
## Inherited_No,
## Maternal_No,
## Symptom_0} => {Multi_Cancer} 0.001348719 0.3157895 78.04678 1
## [5] {Genes_No,
## Low,
## Maternal_No,
## Symptom_0} => {Multi_Cancer} 0.001049003 0.3111111 76.89053 4
## [6] {Genes_No,
## Inherited_No,
## Paternal_No,
## Symptom_0} => {Multi_Cancer} 0.001348719 0.2950820 72.92896 16
# FP-Growth generated different rules, with even lower confidence level
Multi_Diabetes_rule <- apriori(data=trans1, parameter=list(supp=0.001,conf = 0.8),
appearance=list(default="lhs", rhs="Multi_Diabetes"), control=list(verbose=F))
Multi_Diabetes_rule_byconf <- sort(Multi_Diabetes_rule, by="confidence", decreasing=TRUE)
inspect(head(Multi_Diabetes_rule_byconf))
## lhs rhs support confidence coverage lift count
## [1] {Blood Very High,
## Folic_Yes,
## Genes_No,
## H_No,
## Singular,
## Symptom_5} => {Multi_Diabetes} 0.001049003 0.9333333 0.001123932 11.00377 14
## [2] {Blood Very High,
## Genes_No,
## H_No,
## Singular,
## Symptom_5} => {Multi_Diabetes} 0.001273790 0.8947368 0.001423648 10.54873 17
## [3] {abnormal,
## Early Teens,
## Female,
## High,
## Inherited_Yes,
## Symptom_4} => {Multi_Diabetes} 0.001198861 0.8888889 0.001348719 10.47978 16
## [4] {abnormal,
## Folic_Yes,
## H_No,
## History_Yes,
## Singular,
## Symptom_5} => {Multi_Diabetes} 0.001049003 0.8750000 0.001198861 10.31603 14
## [5] {abnormal,
## Blood High,
## Genes_Yes,
## No_2,
## Singular,
## Symptom_4} => {Multi_Diabetes} 0.001049003 0.8750000 0.001198861 10.31603 14
## [6] {abnormal,
## High,
## Institute,
## Maternal_Yes,
## Normal (30-60),
## Symptom_5} => {Multi_Diabetes} 0.001273790 0.8500000 0.001498576 10.02129 17
plot(Multi_Diabetes_rule)
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
plot(Multi_Diabetes_rule_byconf, method="paracoord", control=list(reorder=TRUE))
# Support level 0.01 and confidence level 0.01 were applied, however, there was 0 rules
# With the current support level and confidence level, there are 21 rules. Similar to Mito_LHON, the support level is low, however, when Blood Very High, Folic_Yes, Genes_No, H_No, Singular, Symptom_5 occur together, Multi_Diabetes has a significant confidence level which is 0.9333333
# This indicates that The antecedent items are rare but reliable predictors of the consequent. The top 5 are listed in the output
# FP-Growth
frequent_itemsets_2 <- eclat(trans1, parameter = list(supp = 0.001, maxlen = 5))
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.001 1 5 frequent itemsets TRUE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 13
##
## create itemset ...
## set transactions ...[83 item(s), 13346 transaction(s)] done [0.01s].
## sorting and recoding items ... [62 item(s)] done [0.00s].
## creating bit matrix ... [62 row(s), 13346 column(s)] done [0.00s].
## writing ... [3042465 set(s)] done [0.92s].
## Creating S4 object ... done [0.17s].
Multi_Diabetes_rhs <- "Multi_Diabetes"
Multi_Diabetes_growth <- ruleInduction(frequent_itemsets_2, trans1, confidence = 0.8)
filtered_rules_2 <- subset(Multi_Diabetes_growth, rhs %in% Multi_Diabetes_rhs)
filtered_rules_2_byconf <- sort(filtered_rules_2, by="confidence", decreasing=TRUE)
inspect(head(filtered_rules_2_byconf))
# FP-Growth generated 0 rules
Single_CysticFibrosis_rule <- apriori(data=trans1, parameter=list(supp=0.01,conf = 0.65),
appearance=list(default="lhs", rhs="Single_CysticFibrosis"), control=list(verbose=F))
Single_CysticFibrosis_rule_byconf <- sort(Single_CysticFibrosis_rule, by="confidence", decreasing=TRUE)
inspect(head(Single_CysticFibrosis_rule_byconf))
## lhs rhs support confidence coverage lift count
## [1] {Male,
## Multiple,
## slightly abnormal,
## Symptom_4} => {Single_CysticFibrosis} 0.01123932 0.7317073 0.01536041 3.501386 150
## [2] {Male,
## slightly abnormal,
## Symptom_4,
## Tachycardia} => {Single_CysticFibrosis} 0.01161397 0.7311321 0.01588491 3.498633 155
## [3] {Alive,
## Low,
## Male,
## Multiple,
## Symptom_4} => {Single_CysticFibrosis} 0.01004046 0.6943005 0.01446126 3.322386 134
## [4] {Low,
## Male,
## slightly abnormal,
## Symptom_4} => {Single_CysticFibrosis} 0.01213847 0.6923077 0.01753334 3.312850 162
## [5] {Normal (30-60),
## slightly abnormal,
## Symptom_4,
## Tachycardia} => {Single_CysticFibrosis} 0.01011539 0.6887755 0.01468605 3.295948 135
## [6] {Assisted_Yes,
## Male,
## slightly abnormal,
## Symptom_4} => {Single_CysticFibrosis} 0.01123932 0.6880734 0.01633448 3.292588 150
plot(Single_CysticFibrosis_rule)
plot(Single_CysticFibrosis_rule_byconf, method="paracoord", control=list(reorder=TRUE))
# With the current support and confidence level, there are 40 rules
# When Male, Multiple, slightly abnormal, Symptom_4 occur together, the consequent Single_CysticFibrosis has a confidence level of 0.7317073
# With the current support level, it is much more than many other rules. Which indicates this would occur more often comparing to other disorders
# The top 5 rules are listed in the output
# FP-Growth
frequent_itemsets_3 <- eclat(trans1, parameter = list(supp = 0.01, maxlen = 5))
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.01 1 5 frequent itemsets TRUE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 133
##
## create itemset ...
## set transactions ...[83 item(s), 13346 transaction(s)] done [0.01s].
## sorting and recoding items ... [60 item(s)] done [0.00s].
## creating bit matrix ... [60 row(s), 13346 column(s)] done [0.00s].
## writing ... [873336 set(s)] done [0.43s].
## Creating S4 object ... done [0.05s].
Single_CysticFibrosis_rhs <- "Single_CysticFibrosis"
Single_CysticFibrosis_growth <- ruleInduction(frequent_itemsets_3, trans1, confidence = 0.65)
filtered_rules_3 <- subset(Single_CysticFibrosis_growth, rhs %in% Single_CysticFibrosis_rhs)
filtered_rules_3_byconf <- sort(filtered_rules_3, by="confidence", decreasing=TRUE)
inspect(head(filtered_rules_3_byconf))
## lhs rhs support confidence lift itemset
## [1] {Male,
## Multiple,
## slightly abnormal,
## Symptom_4} => {Single_CysticFibrosis} 0.01123932 0.7317073 3.501386 154043
## [2] {Male,
## slightly abnormal,
## Symptom_4,
## Tachycardia} => {Single_CysticFibrosis} 0.01161397 0.7311321 3.498633 154022
## [3] {Low,
## Male,
## slightly abnormal,
## Symptom_4} => {Single_CysticFibrosis} 0.01213847 0.6923077 3.312850 154054
## [4] {Normal (30-60),
## slightly abnormal,
## Symptom_4,
## Tachycardia} => {Single_CysticFibrosis} 0.01011539 0.6887755 3.295948 154018
## [5] {Assisted_Yes,
## Male,
## slightly abnormal,
## Symptom_4} => {Single_CysticFibrosis} 0.01123932 0.6880734 3.292588 154055
## [6] {H_Yes,
## Male,
## slightly abnormal,
## Symptom_4} => {Single_CysticFibrosis} 0.01161397 0.6858407 3.281904 154056
# FP-Growth generated slightly different rules, the first 2 rules are the same
Single_Hemochromatosis_rule <- apriori(data=trans1, parameter=list(supp=0.001,conf = 0.8),
appearance=list(default="lhs", rhs="Single_Hemochromatosis"), control=list(verbose=F))
Single_Hemochromatosis_rule_byconf <- sort(Single_Hemochromatosis_rule, by="confidence", decreasing=TRUE)
inspect(head(Single_Hemochromatosis_rule_byconf))
## lhs rhs support confidence coverage lift count
## [1] {H_No,
## Home,
## inconclusive,
## Inherited_No,
## Symptom_0} => {Single_Hemochromatosis} 0.001049003 0.9333333 0.001123932 13.61341 14
## [2] {Blood Very High,
## Home,
## Paternal_No,
## Symptom_0,
## Tachypnea} => {Single_Hemochromatosis} 0.001049003 0.8750000 0.001198861 12.76257 14
## [3] {Assisted_No,
## H_Yes,
## History_No,
## Low,
## Normal,
## Symptom_0} => {Single_Hemochromatosis} 0.001049003 0.8750000 0.001198861 12.76257 14
## [4] {Folic_Yes,
## H_No,
## High,
## Inherited_No,
## Male,
## Symptom_0} => {Single_Hemochromatosis} 0.001049003 0.8750000 0.001198861 12.76257 14
## [5] {Assisted_No,
## Blood Very High,
## Inherited_No,
## Multiple,
## Symptom_0} => {Single_Hemochromatosis} 0.001198861 0.8421053 0.001423648 12.28277 16
## [6] {Deceased,
## Folic_Yes,
## Male,
## Paternal_Yes,
## Symptom_0} => {Single_Hemochromatosis} 0.001198861 0.8421053 0.001423648 12.28277 16
plot(Single_Hemochromatosis_rule)
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
plot(Single_Hemochromatosis_rule_byconf, method="paracoord", control=list(reorder=TRUE))
# Support level 0.01 was tested, however, the maximum confidence level was even below 0.5 which wouldn't give us any valueable rules
# With the current support level and confidence level, there are 20 rules. Similar to Mito_LHON, the support level is low, however, when H_No, Home, inconclusive, Inherited_No, Symptom_0 occur together, Single_Hemochromatosis has a significant confidence level which is 0.9333333
# Top 5 rules listed in the output
# FP-Growth
frequent_itemsets_4 <- eclat(trans1, parameter = list(supp = 0.001, maxlen = 5))
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.001 1 5 frequent itemsets TRUE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 13
##
## create itemset ...
## set transactions ...[83 item(s), 13346 transaction(s)] done [0.01s].
## sorting and recoding items ... [62 item(s)] done [0.00s].
## creating bit matrix ... [62 row(s), 13346 column(s)] done [0.00s].
## writing ... [3042465 set(s)] done [0.92s].
## Creating S4 object ... done [0.17s].
Single_Hemochromatosis_rhs <- "Single_Hemochromatosis"
Single_Hemochromatosis_growth <- ruleInduction(frequent_itemsets_4, trans1, confidence = 0.8)
filtered_rules_4 <- subset(Single_Hemochromatosis_growth, rhs %in% Single_Hemochromatosis_rhs)
filtered_rules_4_byconf <- sort(filtered_rules_4, by="confidence", decreasing=TRUE)
inspect(head(filtered_rules_4_byconf))
## lhs rhs support confidence lift itemset
## [1] {Maternal_Yes,
## Multiple,
## No_4,
## Symptom_0} => {Single_Hemochromatosis} 0.001198861 0.8 11.66863 64182
# Only 1 rule generated by FP-Growth
Single_TaySachs_rule <- apriori(data=trans1, parameter=list(supp=0.001,conf = 0.8),
appearance=list(default="lhs", rhs="Single_TaySachs"), control=list(verbose=F))
Single_TaySachs_rule_byconf <- sort(Single_TaySachs_rule, by="confidence", decreasing=TRUE)
inspect(head(Single_TaySachs_rule_byconf))
## lhs rhs support confidence coverage lift count
## [1] {Deceased,
## Infant,
## Maternal_No,
## Paternal_No,
## slightly abnormal,
## Symptom_1} => {Single_TaySachs} 0.001123932 0.8823529 0.00127379 7.369138 15
## [2] {Assisted_Yes,
## Blood High,
## Normal,
## Paternal_No,
## slightly abnormal,
## Symptom_1} => {Single_TaySachs} 0.001573505 0.8400000 0.00187322 7.015419 21
## [3] {Assisted_Yes,
## Blood Very High,
## Infant,
## Male,
## Maternal_Yes,
## Symptom_1} => {Single_TaySachs} 0.001049003 0.8235294 0.00127379 6.877862 14
## [4] {Deceased,
## H_Yes,
## High,
## Infant,
## slightly abnormal,
## Symptom_1} => {Single_TaySachs} 0.001049003 0.8235294 0.00127379 6.877862 14
## [5] {Folic_Yes,
## High,
## Infant,
## Paternal_No,
## slightly abnormal,
## Symptom_1} => {Single_TaySachs} 0.001049003 0.8235294 0.00127379 6.877862 14
## [6] {Deceased,
## Infant,
## Institute,
## Paternal_No,
## slightly abnormal,
## Symptom_1} => {Single_TaySachs} 0.001049003 0.8235294 0.00127379 6.877862 14
plot(Single_TaySachs_rule)
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
plot(Single_TaySachs_rule_byconf, method="paracoord", control=list(reorder=TRUE))
# Support level 0.01 was tested, however, the maximum confidence level was even below 0.5 which wouldn't give us any valueable rules
# With the current support level and confidence level, there are 10 rules. Similar to Mito_LHON, the support level is low, however, when Deceased, Infant, Maternal_No, Paternal_No, slightly abnormal, Symptom_1 occur together, Single_Hemochromatosis has a significant confidence level which is 0.8823529
# Top 5 rules listed in the output
# FP-Growth
frequent_itemsets_5 <- eclat(trans1, parameter = list(supp = 0.001, maxlen = 5))
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.001 1 5 frequent itemsets TRUE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 13
##
## create itemset ...
## set transactions ...[83 item(s), 13346 transaction(s)] done [0.01s].
## sorting and recoding items ... [62 item(s)] done [0.00s].
## creating bit matrix ... [62 row(s), 13346 column(s)] done [0.00s].
## writing ... [3042465 set(s)] done [0.92s].
## Creating S4 object ... done [0.17s].
Single_TaySachs_rhs <- "Single_TaySachs"
Single_TaySachs_growth <- ruleInduction(frequent_itemsets_5, trans1, confidence = 0.8)
filtered_rules_5 <- subset(Single_TaySachs_growth, rhs %in% Single_TaySachs_rhs)
filtered_rules_5_byconf <- sort(filtered_rules_5, by="confidence", decreasing=TRUE)
inspect(head(filtered_rules_5_byconf))
# FP-Growth generated 0 rule
# Whole dataset
gene_rules_viz <- apriori(trans1, parameter = list(support = 0.1, confidence = 0.6, minlen = 2))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.6 0.1 1 none FALSE TRUE 5 0.1 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 1334
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[83 item(s), 13346 transaction(s)] done [0.01s].
## sorting and recoding items ... [55 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.05s].
## writing ... [712 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
gene_rules_viz
## set of 712 rules
gene_rules_viz_byconf <- sort(gene_rules_viz, by="confidence", decreasing=TRUE)
inspect(head(gene_rules))
## lhs rhs support confidence coverage lift count
## [1] {No_3} => {Genes_Yes} 0.1017533 0.5848407 0.1739847 0.9847695 1358
## [2] {No_3} => {Inherited_No} 0.1063989 0.6115418 0.1739847 1.0093540 1420
## [3] {No_1} => {Paternal_No} 0.1013787 0.5708861 0.1775813 1.0084772 1353
## [4] {No_1} => {Genes_Yes} 0.1062491 0.5983122 0.1775813 1.0074533 1418
## [5] {No_1} => {Inherited_No} 0.1100704 0.6198312 0.1775813 1.0230358 1469
## [6] {No_4} => {Paternal_No} 0.1024277 0.5669847 0.1806534 1.0015853 1367
# Genetic Disorder
gene_rules_combined_disorder <- apriori(trans1,
parameter = list(support = 0.01, confidence = 0.65, minlen = 2),
appearance = list(default = "lhs", rhs = c("Mito_LHON", "Mito_Leigh", "Mito_Myopathy",
"Multi_Alzheimer", "Multi_Cancer", "Multi_Diabetes",
"Single_CysticFibrosis", "Single_Hemochromatosis", "Single_TaySachs")))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.65 0.1 1 none FALSE TRUE 5 0.01 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 133
##
## set item appearances ...[9 item(s)] done [0.00s].
## set transactions ...[83 item(s), 13346 transaction(s)] done [0.01s].
## sorting and recoding items ... [60 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 done [3.32s].
## writing ... [42 rule(s)] done [0.05s].
## creating S4 object ... done [0.05s].
gene_rules_combined_disorder_byconf <- sort(gene_rules_combined_disorder, by="confidence", decreasing=TRUE)
inspect(head(gene_rules_combined_disorder_byconf))
## lhs rhs support confidence coverage lift count
## [1] {Male,
## Multiple,
## slightly abnormal,
## Symptom_4} => {Single_CysticFibrosis} 0.01123932 0.7317073 0.01536041 3.501386 150
## [2] {Male,
## slightly abnormal,
## Symptom_4,
## Tachycardia} => {Single_CysticFibrosis} 0.01161397 0.7311321 0.01588491 3.498633 155
## [3] {Alive,
## Low,
## Male,
## Multiple,
## Symptom_4} => {Single_CysticFibrosis} 0.01004046 0.6943005 0.01446126 3.322386 134
## [4] {Low,
## Male,
## slightly abnormal,
## Symptom_4} => {Single_CysticFibrosis} 0.01213847 0.6923077 0.01753334 3.312850 162
## [5] {Normal (30-60),
## slightly abnormal,
## Symptom_4,
## Tachycardia} => {Single_CysticFibrosis} 0.01011539 0.6887755 0.01468605 3.295948 135
## [6] {Assisted_Yes,
## Male,
## slightly abnormal,
## Symptom_4} => {Single_CysticFibrosis} 0.01123932 0.6880734 0.01633448 3.292588 150
plot(gene_rules_combined_disorder_byconf, method="paracoord", control=list(reorder=TRUE))
# FP-Growth
# Generate frequent itemsets using FP-Growth (eclat function)
frequent_itemsets <- eclat(trans1, parameter = list(supp = 0.01, maxlen = 5))
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.01 1 5 frequent itemsets TRUE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 133
##
## create itemset ...
## set transactions ...[83 item(s), 13346 transaction(s)] done [0.01s].
## sorting and recoding items ... [60 item(s)] done [0.00s].
## creating bit matrix ... [60 row(s), 13346 column(s)] done [0.00s].
## writing ... [873336 set(s)] done [0.43s].
## Creating S4 object ... done [0.05s].
# View frequent itemsets
inspect(head(frequent_itemsets, n = 10))
## items support count
## [1] {Genes_Yes, Mito_LHON, Symptom_5} 0.01063989 142
## [2] {Maternal_Yes, Mito_LHON, Symptom_5} 0.01041511 139
## [3] {Genes_Yes, Mito_LHON, Symptom_4} 0.01041511 139
## [4] {Genes_Yes, Inherited_Yes, Mito_LHON} 0.01176382 157
## [5] {Inherited_Yes, Maternal_Yes, Mito_LHON} 0.01176382 157
## [6] {Genes_Yes, Maternal_Yes, Mito_LHON, Paternal_Yes} 0.01071482 143
## [7] {Genes_Yes, Mito_LHON, Paternal_Yes} 0.01348719 180
## [8] {Maternal_Yes, Mito_LHON, Paternal_Yes} 0.01311254 175
## [9] {Genes_Yes, Mito_LHON, Tachypnea} 0.01034018 138
## [10] {Genes_Yes, History_No, Mito_LHON} 0.01019032 136
# Define desired RHS items
desired_rhs <- c("Mito_LHON", "Mito_Leigh", "Mito_Myopathy", "Multi_Alzheimer", "Multi_Cancer", "Multi_Diabetes",
"Single_CysticFibrosis", "Single_Hemochromatosis", "Single_TaySachs")
# Generate rules with specific RHS
rules_fp_growth <- ruleInduction(frequent_itemsets, trans1, confidence = 0.65)
# Filter rules to include only the desired RHS
filtered_rules <- subset(rules_fp_growth, rhs %in% desired_rhs)
# View the filtered rules
filtered_rules_byconf <- sort(filtered_rules, by="confidence", decreasing=TRUE)
inspect(head(filtered_rules_byconf))
## lhs rhs support confidence lift itemset
## [1] {Male,
## Multiple,
## slightly abnormal,
## Symptom_4} => {Single_CysticFibrosis} 0.01123932 0.7317073 3.501386 154043
## [2] {Male,
## slightly abnormal,
## Symptom_4,
## Tachycardia} => {Single_CysticFibrosis} 0.01161397 0.7311321 3.498633 154022
## [3] {Low,
## Male,
## slightly abnormal,
## Symptom_4} => {Single_CysticFibrosis} 0.01213847 0.6923077 3.312850 154054
## [4] {Normal (30-60),
## slightly abnormal,
## Symptom_4,
## Tachycardia} => {Single_CysticFibrosis} 0.01011539 0.6887755 3.295948 154018
## [5] {Assisted_Yes,
## Male,
## slightly abnormal,
## Symptom_4} => {Single_CysticFibrosis} 0.01123932 0.6880734 3.292588 154055
## [6] {H_Yes,
## Male,
## slightly abnormal,
## Symptom_4} => {Single_CysticFibrosis} 0.01161397 0.6858407 3.281904 154056
# FP-Growth generated similar rules, only some of the rules from Apriori are not included
# Plot the filtered rules
plot(filtered_rules, method = "scatterplot", measure = c("support", "confidence"), shading = "lift")
# Grouped matrix plot
plot(filtered_rules, method = "grouped")
# Parallel coordinate plot
plot(filtered_rules, method = "paracoord", control = list(reorder = TRUE))
# Define file paths
performance_apriori_path <- 'performance_apriori.rds'
performance_fp_growth_path <- 'performance_fp_growth.rds'
# Apriori Performance
if (file.exists(performance_apriori_path)) {
gene_rules_apriori <- readRDS(performance_apriori_path)
message("Loaded Apriori rules from 'performance_apriori.rds'.")
} else {
time_apriori <- system.time({
gene_rules_apriori <- apriori(
data = trans1,
parameter = list(support = 0.01, confidence = 0.5, minlen = 2),
appearance = list(default = "lhs", rhs = c(
"Mito_LHON", "Mito_Leigh", "Mito_Myopathy",
"Multi_Alzheimer", "Multi_Cancer", "Multi_Diabetes",
"Single_CysticFibrosis", "Single_Hemochromatosis", "Single_TaySachs"
)),
control = list(verbose = FALSE)
)
saveRDS(gene_rules_apriori, file = performance_apriori_path)
})
message(paste("Time taken by Apriori:", round(time_apriori["elapsed"], 2), "seconds"))
}
## Loaded Apriori rules from 'performance_apriori.rds'.
# FP-Growth Performance
if (file.exists(performance_fp_growth_path)) {
filtered_rules_fp_growth <- readRDS(performance_fp_growth_path)
message("Loaded FP-Growth rules from 'performance_fp_growth.rds'.")
} else {
time_fp_growth <- system.time({
frequent_itemsets_fp_growth <- eclat(trans1, parameter = list(supp = 0.01, maxlen = 5))
rules_fp_growth <- ruleInduction(frequent_itemsets_fp_growth, trans1, confidence = 0.5)
filtered_rules_fp_growth <- subset(rules_fp_growth, rhs %in% c(
"Mito_LHON", "Mito_Leigh", "Mito_Myopathy",
"Multi_Alzheimer", "Multi_Cancer", "Multi_Diabetes",
"Single_CysticFibrosis", "Single_Hemochromatosis", "Single_TaySachs"
))
saveRDS(filtered_rules_fp_growth, file = performance_fp_growth_path)
})
message(paste("Time taken by FP-Growth:", round(time_fp_growth["elapsed"], 2), "seconds"))
}
## Loaded FP-Growth rules from 'performance_fp_growth.rds'.
# Inspect top Apriori rules
inspect(head(gene_rules_apriori, n = 10))
## lhs rhs support confidence coverage lift count
## [1] {No_1,
## Symptom_4} => {Single_CysticFibrosis} 0.01768320 0.5010616 0.03529147 2.397694 236
## [2] {slightly abnormal,
## Symptom_4} => {Single_CysticFibrosis} 0.03176982 0.5880721 0.05402368 2.814059 424
## [3] {Middle Childhood,
## Symptom_4} => {Single_CysticFibrosis} 0.03004646 0.5031368 0.05971827 2.407624 401
## [4] {Symptom_4,
## Tachycardia} => {Single_CysticFibrosis} 0.05132624 0.5297757 0.09688296 2.535097 685
## [5] {Home,
## Symptom_4} => {Single_CysticFibrosis} 0.05140117 0.5138577 0.10002997 2.458926 686
## [6] {Multiple,
## Symptom_4} => {Single_CysticFibrosis} 0.05162596 0.5416667 0.09530946 2.591998 689
## [7] {Male,
## Symptom_4} => {Single_CysticFibrosis} 0.05627154 0.5356633 0.10505020 2.563271 751
## [8] {H_Yes,
## Symptom_4} => {Single_CysticFibrosis} 0.05155103 0.5119048 0.10070433 2.449581 688
## [9] {Assisted_Yes,
## Symptom_4} => {Single_CysticFibrosis} 0.05177581 0.5010877 0.10332684 2.397819 691
## [10] {Low,
## Symptom_4} => {Single_CysticFibrosis} 0.05312453 0.5213235 0.10190319 2.494652 709
# Inspect top FP-Growth rules
inspect(head(filtered_rules_fp_growth, n = 10))
## lhs rhs support confidence lift itemset
## [1] {Inherited_No,
## No_1,
## Symptom_4} => {Single_CysticFibrosis} 0.01049003 0.5166052 2.472073 54796
## [2] {Genes_Yes,
## No_1,
## Symptom_4} => {Single_CysticFibrosis} 0.01176382 0.5097403 2.439223 54797
## [3] {Maternal_Yes,
## No_1,
## Symptom_4} => {Single_CysticFibrosis} 0.01123932 0.5050505 2.416782 54798
## [4] {History_Yes,
## No_1,
## Symptom_4} => {Single_CysticFibrosis} 0.01034018 0.5587045 2.673528 54799
## [5] {Male,
## No_1,
## Symptom_4} => {Single_CysticFibrosis} 0.01176382 0.5528169 2.645355 54800
## [6] {Multiple,
## No_1,
## Symptom_4} => {Single_CysticFibrosis} 0.01049003 0.5645161 2.701338 54801
## [7] {No_1,
## Symptom_4,
## Tachycardia} => {Single_CysticFibrosis} 0.01026525 0.5546559 2.654155 54802
## [8] {No_1,
## Symptom_4} => {Single_CysticFibrosis} 0.01768320 0.5010616 2.397694 54926
## [9] {Low,
## No_1,
## Symptom_3} => {Mito_Leigh} 0.01288776 0.5043988 1.958600 58214
## [10] {Institute,
## No_1,
## Symptom_3} => {Mito_Leigh} 0.01296269 0.5029070 1.952807 58215
# Compare times
# These messages are already printed above
Based on the analysis result from this report, we can identify some association rules are providing interpretable insights to us
There are some genetic disorder rules with lower support level, but significant confidence level(mostly above 60%)
There are some genetic disorder rules with higher support level(above 0.01), with noticeable confidence level(all above 50%)
There is one disorder: Multi_Cancer which doesn’t have any significant rules from both Apriori and FP-Growth approach
From the results, we can see that these rules with lower support level, but significant confidence level indicates The antecedent items are rare but reliable predictors of the consequent
These with higher support are more likely to occur among children(research objects age up to 14), with these rules, we have very noticeable confidence level above 50%. Which indicates which combinations would more likely to happen
As mentioned at the beginning, this report can only provide insights of the potential association rule that researchers can refer to (we can use the top 5 rules for each disorder). This report will not be able to provide biological/medical proof that these factors are correlated
Based on the comparison, Apriori often can provide more analysis results and it is easier to control the wanted outcome. Which is to find out the association rules for the genetic disorder as the consequent items