1 Dataset description

1.1 Gene Expression of Disorder

This dataset contains information on patients with genetic disorder issues, along with details on gene inheritance, blood test results, etc.

Data Source: https://www.kaggle.com/datasets/eftekheraliefte/clean-genetic-dataset

Dataset was uploaded on Kaggle on January 18, 2025.

1.2 Aim and methods

This report is focusing on applying association rule on these variables to find out if there is any potential connection between the Genetic Disorder issues with other factors

Given the fact that this dataset is a medical report, we could only provide insights on the potential associations as this is not a biological/medical research. The result of this report will not be able to prove any correlation of the Genes inheritance, Blood test result and more with the Genetic Disorder issues

The initial hypothesis is to find out if there are any categories listed in the dataset would have any association with the disorder issue. Hence, we will not choose eclat only to see the frequent items, the suitable method should be Apriori rule

However, we will apply FP-Growth as well as reference to compare the results

2 Load packages and libraries

options(repos = c(CRAN = "https://cloud.r-project.org"))
install.packages("dplyr")
install.packages("arules")
install.packages("arulesSequences")
install.packages("arulesViz")
install.packages("ggplot2")
install.packages("moments")
library(arules)
library(arulesViz)
library(dplyr)
library(ggplot2)
library(stringr)
library(moments)

3 Pre-Processing/Pre-Checking the dataset

gene_data <- read.csv('clean_train_data.csv')
summary(gene_data)
##   Patient.Age     Genes.in.mother.s.side Inherited.from.father
##  Min.   : 0.000   Length:20745           Length:20745         
##  1st Qu.: 3.000   Class :character       Class :character     
##  Median : 7.000   Mode  :character       Mode  :character     
##  Mean   : 6.974                                               
##  3rd Qu.:10.000                                               
##  Max.   :14.000                                               
##  Maternal.gene      Paternal.gene         Status         
##  Length:20745       Length:20745       Length:20745      
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##  Respiratory.Rate..breaths.min. Heart.Rate..rates.min  Follow.up        
##  Length:20745                   Length:20745          Length:20745      
##  Class :character               Class :character      Class :character  
##  Mode  :character               Mode  :character      Mode  :character  
##                                                                         
##                                                                         
##                                                                         
##     Gender          Birth.asphyxia    
##  Length:20745       Length:20745      
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
##                                       
##  Autopsy.shows.birth.defect..if.applicable. Place.of.birth    
##  Length:20745                               Length:20745      
##  Class :character                           Class :character  
##  Mode  :character                           Mode  :character  
##                                                               
##                                                               
##                                                               
##  Folic.acid.details..peri.conceptional. H.O.serious.maternal.illness
##  Length:20745                           Length:20745                
##  Class :character                       Class :character            
##  Mode  :character                       Mode  :character            
##                                                                     
##                                                                     
##                                                                     
##  H.O.radiation.exposure..x.ray. H.O.substance.abuse Assisted.conception.IVF.ART
##  Length:20745                   Length:20745        Length:20745               
##  Class :character               Class :character    Class :character           
##  Mode  :character               Mode  :character    Mode  :character           
##                                                                                
##                                                                                
##                                                                                
##  History.of.anomalies.in.previous.pregnancies No..of.previous.abortion
##  Length:20745                                 Min.   :0               
##  Class :character                             1st Qu.:1               
##  Mode  :character                             Median :2               
##                                               Mean   :2               
##                                               3rd Qu.:3               
##                                               Max.   :4               
##  Birth.defects      Blood.test.result  Genetic.Disorder   Disorder.Subclass 
##  Length:20745       Length:20745       Length:20745       Length:20745      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  Symptom.Count   Total.Blood.Cell.Count Combined_disorder 
##  Min.   :0.000   Min.   : 7.302         Length:20745      
##  1st Qu.:2.000   1st Qu.:10.560         Class :character  
##  Median :3.000   Median :12.384         Mode  :character  
##  Mean   :2.648   Mean   :12.385                           
##  3rd Qu.:4.000   3rd Qu.:14.183                           
##  Max.   :5.000   Max.   :17.536
# Check for missing values in the entire dataset
sum(is.na(gene_data))
## [1] 0
# Check values stored as "Missing"
missing_text_total <- sum(gene_data == "Missing", na.rm = TRUE)
print(paste("Total cells with the literal 'Missing':", missing_text_total))
## [1] "Total cells with the literal 'Missing': 53262"
missing_by_col <- sapply(gene_data, function(col) sum(as.character(col) == "Missing", na.rm = TRUE))
# Count total number of entries in each column (including "Missing" strings)
total_entries_by_col <- sapply(gene_data, length)
# For clarity, combine results into a data frame
results <- data.frame(Column = names(gene_data), MissingCount = missing_by_col, TotalCount = total_entries_by_col)
# Columns Missing Count Total Count
results
##                                                                                    Column
## Patient.Age                                                                   Patient.Age
## Genes.in.mother.s.side                                             Genes.in.mother.s.side
## Inherited.from.father                                               Inherited.from.father
## Maternal.gene                                                               Maternal.gene
## Paternal.gene                                                               Paternal.gene
## Status                                                                             Status
## Respiratory.Rate..breaths.min.                             Respiratory.Rate..breaths.min.
## Heart.Rate..rates.min                                               Heart.Rate..rates.min
## Follow.up                                                                       Follow.up
## Gender                                                                             Gender
## Birth.asphyxia                                                             Birth.asphyxia
## Autopsy.shows.birth.defect..if.applicable.     Autopsy.shows.birth.defect..if.applicable.
## Place.of.birth                                                             Place.of.birth
## Folic.acid.details..peri.conceptional.             Folic.acid.details..peri.conceptional.
## H.O.serious.maternal.illness                                 H.O.serious.maternal.illness
## H.O.radiation.exposure..x.ray.                             H.O.radiation.exposure..x.ray.
## H.O.substance.abuse                                                   H.O.substance.abuse
## Assisted.conception.IVF.ART                                   Assisted.conception.IVF.ART
## History.of.anomalies.in.previous.pregnancies History.of.anomalies.in.previous.pregnancies
## No..of.previous.abortion                                         No..of.previous.abortion
## Birth.defects                                                               Birth.defects
## Blood.test.result                                                       Blood.test.result
## Genetic.Disorder                                                         Genetic.Disorder
## Disorder.Subclass                                                       Disorder.Subclass
## Symptom.Count                                                               Symptom.Count
## Total.Blood.Cell.Count                                             Total.Blood.Cell.Count
## Combined_disorder                                                       Combined_disorder
##                                              MissingCount TotalCount
## Patient.Age                                             0      20745
## Genes.in.mother.s.side                                  0      20745
## Inherited.from.father                                   0      20745
## Maternal.gene                                           0      20745
## Paternal.gene                                           0      20745
## Status                                                  0      20745
## Respiratory.Rate..breaths.min.                          0      20745
## Heart.Rate..rates.min                                   0      20745
## Follow.up                                               0      20745
## Gender                                               7400      20745
## Birth.asphyxia                                      10629      20745
## Autopsy.shows.birth.defect..if.applicable.          14558      20745
## Place.of.birth                                          0      20745
## Folic.acid.details..peri.conceptional.                  0      20745
## H.O.serious.maternal.illness                            0      20745
## H.O.radiation.exposure..x.ray.                      10568      20745
## H.O.substance.abuse                                 10107      20745
## Assisted.conception.IVF.ART                             0      20745
## History.of.anomalies.in.previous.pregnancies            0      20745
## No..of.previous.abortion                                0      20745
## Birth.defects                                           0      20745
## Blood.test.result                                       0      20745
## Genetic.Disorder                                        0      20745
## Disorder.Subclass                                       0      20745
## Symptom.Count                                           0      20745
## Total.Blood.Cell.Count                                  0      20745
## Combined_disorder                                       0      20745
# Autopsy shows birth defect (if applicable) is related to Status, when the status is alive, the Autopsy result is showing Missing which wouldn't affect the result since they are still alive
# However, with the column remaining, this will not give us much insight since we are unable to know if there is any birth defect when they are alive, there is no need to keep this column
# Remove Autopsy.shows.birth.defect..if.applicable.
gene_data$Autopsy.shows.birth.defect..if.applicable. <- NULL
# Our main focus is to analyse association for Genetic Disorder, Disorder Subclass and Combined_disorder, genetic disorder could potentially be related to gender, hence we will drop the rows without any gender data
# Remove rows where Gender equals the literal string "Missing"
gene_data <- gene_data[ as.character(gene_data$Gender) != "Missing", ]
# There is still around 50% of the values missing for Birth.asphyxia, H.O.radiation.exposure..x.ray. and H.O.substance.abuse, we will have to delete these columns since the missing value is too large
gene_data[c("Birth.asphyxia","H.O.substance.abuse","H.O.radiation.exposure..x.ray.")] <- NULL
# Recalculate missing counts for each column after removal
missing_by_col <- sapply(gene_data, function(col) sum(as.character(col) == "Missing", na.rm = TRUE))
total_entries_by_col <- sapply(gene_data, length)
# Recreate the results data frame with updated counts
results <- data.frame(Column = names(gene_data), MissingCount = missing_by_col, TotalCount = total_entries_by_col)
results
##                                                                                    Column
## Patient.Age                                                                   Patient.Age
## Genes.in.mother.s.side                                             Genes.in.mother.s.side
## Inherited.from.father                                               Inherited.from.father
## Maternal.gene                                                               Maternal.gene
## Paternal.gene                                                               Paternal.gene
## Status                                                                             Status
## Respiratory.Rate..breaths.min.                             Respiratory.Rate..breaths.min.
## Heart.Rate..rates.min                                               Heart.Rate..rates.min
## Follow.up                                                                       Follow.up
## Gender                                                                             Gender
## Place.of.birth                                                             Place.of.birth
## Folic.acid.details..peri.conceptional.             Folic.acid.details..peri.conceptional.
## H.O.serious.maternal.illness                                 H.O.serious.maternal.illness
## Assisted.conception.IVF.ART                                   Assisted.conception.IVF.ART
## History.of.anomalies.in.previous.pregnancies History.of.anomalies.in.previous.pregnancies
## No..of.previous.abortion                                         No..of.previous.abortion
## Birth.defects                                                               Birth.defects
## Blood.test.result                                                       Blood.test.result
## Genetic.Disorder                                                         Genetic.Disorder
## Disorder.Subclass                                                       Disorder.Subclass
## Symptom.Count                                                               Symptom.Count
## Total.Blood.Cell.Count                                             Total.Blood.Cell.Count
## Combined_disorder                                                       Combined_disorder
##                                              MissingCount TotalCount
## Patient.Age                                             0      13345
## Genes.in.mother.s.side                                  0      13345
## Inherited.from.father                                   0      13345
## Maternal.gene                                           0      13345
## Paternal.gene                                           0      13345
## Status                                                  0      13345
## Respiratory.Rate..breaths.min.                          0      13345
## Heart.Rate..rates.min                                   0      13345
## Follow.up                                               0      13345
## Gender                                                  0      13345
## Place.of.birth                                          0      13345
## Folic.acid.details..peri.conceptional.                  0      13345
## H.O.serious.maternal.illness                            0      13345
## Assisted.conception.IVF.ART                             0      13345
## History.of.anomalies.in.previous.pregnancies            0      13345
## No..of.previous.abortion                                0      13345
## Birth.defects                                           0      13345
## Blood.test.result                                       0      13345
## Genetic.Disorder                                        0      13345
## Disorder.Subclass                                       0      13345
## Symptom.Count                                           0      13345
## Total.Blood.Cell.Count                                  0      13345
## Combined_disorder                                       0      13345
# Get unique values for each column
unique_values <- lapply(gene_data, unique)
length(unique_values)
## [1] 23
# We are defining the bins as below:
gene_data$Patient.Age <- cut(gene_data$Patient.Age,
                             breaks = c(-Inf, 2, 6, 10, 14, Inf),
                             labels = c("Infant", "Early Childhood", "Middle Childhood", "Early Teens", "Other"),
                             right = TRUE)
# $No..of.previous.abortion: 2 4 0 3 1. Since there are only 5 unique values, we will not convert these numerical data
# $Symptom.Count: 5 4 3 1 2 0. Since there are only 5 unique values, we will not convert these numerical data
# There are Genetic Disorder, Disorder Subclass and Combined_disorder. Combined_disorder is the combination of Genetic Disorder and Disorder Subclass. Hence, we only need to keep Combined_disorder since we can see the result of these two by using Combined_disorder
gene_data[c("Genetic.Disorder","Disorder.Subclass")] <- NULL
# Combined_disorder values shorten as below:
gene_data$Combined_disorder <- str_replace_all(gene_data$Combined_disorder, c(
  "Mitochondrial_genetic_inheritance_disorders_Leber's_hereditary_optic_neuropathy" = "Mito_LHON",
  "Mitochondrial_genetic_inheritance_disorders_Leigh_syndrome" = "Mito_Leigh",
  "Mitochondrial_genetic_inheritance_disorders_Mitochondrial_myopathy" = "Mito_Myopathy",
  "Multifactorial_genetic_inheritance_disorders_Alzheimer's" = "Multi_Alzheimer",
  "Multifactorial_genetic_inheritance_disorders_Cancer" = "Multi_Cancer",
  "Multifactorial_genetic_inheritance_disorders_Diabetes" = "Multi_Diabetes",
  "Single-gene_inheritance_diseases_Cystic_fibrosis" = "Single_CysticFibrosis",
  "Single-gene_inheritance_diseases_Hemochromatosis" = "Single_Hemochromatosis",
  "Single-gene_inheritance_diseases_Tay-Sachs" = "Single_TaySachs"
))
# summary(unique_values$Total.Blood.Cell.Count)
# Total number of rows for blood cell count
length(unique_values$Total.Blood.Cell.Count)
## [1] 13345
# Check the normality of the Total.Blood.Cell.Count
skewness(gene_data$Total.Blood.Cell.Count)
## [1] 0.009567959
# Skewness value is 0.009566883 after removing missing values, Total.Blood.Cell.Count is not skewed, it is symmetric instead. Max value is close to the 3rd Qu. we can consider there is no outlier
# Total.Blood.Cell.Count: 13345, need to be converted to categorical bins since the number of unique values is significant
# We are defining the bins as below:
gene_data$Total.Blood.Cell.Count <- cut(
  gene_data$Total.Blood.Cell.Count,
  breaks = c(-Inf, 10.544, 12.379,14.162, Inf),
  labels = c("Blood Low", "Blood Moderate", "Blood High", "Blood Very High"),
  right = FALSE
)
# Iterate over columns to process them
for (col in colnames(gene_data)) {
  # Check for columns with "Yes" and "No"
  if (all(gene_data[[col]] %in% c("Yes", "No"), na.rm = TRUE)) {
    prefix <- strsplit(col, split = "[ .]")[[1]][1]
    gene_data[[col]] <- paste0(prefix, "_", gene_data[[col]])
  }
  # Check for columns with integer values
  if (is.numeric(gene_data[[col]]) && all(gene_data[[col]] %% 1 == 0, na.rm = TRUE)) {
    prefix <- strsplit(col, split = "[ .]")[[1]][1]
    gene_data[[col]] <- paste0(prefix, "_", gene_data[[col]])
  }
}
# Preview the modified dataset
head(gene_data)
##        Patient.Age Genes.in.mother.s.side Inherited.from.father Maternal.gene
## 2      Early Teens              Genes_Yes          Inherited_No  Maternal_Yes
## 3 Middle Childhood              Genes_Yes         Inherited_Yes  Maternal_Yes
## 4 Middle Childhood              Genes_Yes          Inherited_No  Maternal_Yes
## 5      Early Teens              Genes_Yes         Inherited_Yes  Maternal_Yes
## 6           Infant               Genes_No         Inherited_Yes  Maternal_Yes
## 7           Infant               Genes_No         Inherited_Yes  Maternal_Yes
##   Paternal.gene   Status Respiratory.Rate..breaths.min. Heart.Rate..rates.min
## 2   Paternal_No    Alive                 Normal (30-60)           Tachycardia
## 3  Paternal_Yes Deceased                 Normal (30-60)           Tachycardia
## 4  Paternal_Yes Deceased                      Tachypnea                Normal
## 5  Paternal_Yes    Alive                 Normal (30-60)           Tachycardia
## 6   Paternal_No    Alive                      Tachypnea           Tachycardia
## 7   Paternal_No    Alive                 Normal (30-60)           Tachycardia
##   Follow.up Gender Place.of.birth Folic.acid.details..peri.conceptional.
## 2       Low   Male      Institute                              Folic_Yes
## 3       Low Female           Home                              Folic_Yes
## 4       Low   Male      Institute                              Folic_Yes
## 5      High Female           Home                              Folic_Yes
## 6      High Female           Home                              Folic_Yes
## 7       Low   Male           Home                               Folic_No
##   H.O.serious.maternal.illness Assisted.conception.IVF.ART
## 2                        H_Yes                Assisted_Yes
## 3                         H_No                Assisted_Yes
## 4                        H_Yes                 Assisted_No
## 5                         H_No                 Assisted_No
## 6                        H_Yes                Assisted_Yes
## 7                         H_No                Assisted_Yes
##   History.of.anomalies.in.previous.pregnancies No..of.previous.abortion
## 2                                   History_No                     No_4
## 3                                   History_No                     No_0
## 4                                  History_Yes                     No_3
## 5                                   History_No                     No_4
## 6                                   History_No                     No_4
## 7                                   History_No                     No_4
##   Birth.defects Blood.test.result Symptom.Count Total.Blood.Cell.Count
## 2      Multiple slightly abnormal     Symptom_5             Blood High
## 3      Singular slightly abnormal     Symptom_4              Blood Low
## 4      Multiple            normal     Symptom_4             Blood High
## 5      Singular slightly abnormal     Symptom_4              Blood Low
## 6      Singular          abnormal     Symptom_5              Blood Low
## 7      Multiple          abnormal     Symptom_5        Blood Very High
##   Combined_disorder
## 2         Mito_LHON
## 3         Mito_LHON
## 4         Mito_LHON
## 5         Mito_LHON
## 6         Mito_LHON
## 7         Mito_LHON
# Dimension of the dataset
dim(gene_data)
## [1] 13345    21
#write.csv(gene_data, file="modified_gene_data.csv", quote = TRUE, row.names = FALSE)

4 Dataset Pre-analysis and Processing

# There are categories as below:
gene_data[1,]
##   Patient.Age Genes.in.mother.s.side Inherited.from.father Maternal.gene
## 2 Early Teens              Genes_Yes          Inherited_No  Maternal_Yes
##   Paternal.gene Status Respiratory.Rate..breaths.min. Heart.Rate..rates.min
## 2   Paternal_No  Alive                 Normal (30-60)           Tachycardia
##   Follow.up Gender Place.of.birth Folic.acid.details..peri.conceptional.
## 2       Low   Male      Institute                              Folic_Yes
##   H.O.serious.maternal.illness Assisted.conception.IVF.ART
## 2                        H_Yes                Assisted_Yes
##   History.of.anomalies.in.previous.pregnancies No..of.previous.abortion
## 2                                   History_No                     No_4
##   Birth.defects Blood.test.result Symptom.Count Total.Blood.Cell.Count
## 2      Multiple slightly abnormal     Symptom_5             Blood High
##   Combined_disorder
## 2         Mito_LHON
# There are factors such as patient.age, genes.in.mother, Inherited.from.father etc. These are most likely the antecendents.
# Combined_disorder is the potential consequent as the expression of certain genes and some other factors
dim(gene_data)
## [1] 13345    21
# Read the pre-processed data
trans1 <- read.transactions("modified_gene_data.csv", format="basket", sep=",", skip=0)
summary(trans1)
## transactions as itemMatrix in sparse format with
##  13346 rows (elements/itemsets/transactions) and
##  83 columns (items) and a density of 0.253012 
## 
## most frequent items:
## Inherited_No    Genes_Yes  Paternal_No Maternal_Yes    Folic_Yes      (Other) 
##         8086         7926         7555         7390         7335       241974 
## 
## element (itemset/transaction) length distribution:
## sizes
##    21 
## 13346 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      21      21      21      21      21      21 
## 
## includes extended item information - examples:
##        labels
## 1    abnormal
## 2       Alive
## 3 Assisted_No
inspect(head(trans1))
##     items                                          
## [1] {Assisted.conception.IVF.ART,                  
##      Birth.defects,                                
##      Blood.test.result,                            
##      Combined_disorder,                            
##      Folic.acid.details..peri.conceptional.,       
##      Follow.up,                                    
##      Gender,                                       
##      Genes.in.mother.s.side,                       
##      H.O.serious.maternal.illness,                 
##      Heart.Rate..rates.min,                        
##      History.of.anomalies.in.previous.pregnancies, 
##      Inherited.from.father,                        
##      Maternal.gene,                                
##      No..of.previous.abortion,                     
##      Paternal.gene,                                
##      Patient.Age,                                  
##      Place.of.birth,                               
##      Respiratory.Rate..breaths.min.,               
##      Status,                                       
##      Symptom.Count,                                
##      Total.Blood.Cell.Count}                       
## [2] {Alive,                                        
##      Assisted_Yes,                                 
##      Blood High,                                   
##      Early Teens,                                  
##      Folic_Yes,                                    
##      Genes_Yes,                                    
##      H_Yes,                                        
##      History_No,                                   
##      Inherited_No,                                 
##      Institute,                                    
##      Low,                                          
##      Male,                                         
##      Maternal_Yes,                                 
##      Mito_LHON,                                    
##      Multiple,                                     
##      No_4,                                         
##      Normal (30-60),                               
##      Paternal_No,                                  
##      slightly abnormal,                            
##      Symptom_5,                                    
##      Tachycardia}                                  
## [3] {Assisted_Yes,                                 
##      Blood Low,                                    
##      Deceased,                                     
##      Female,                                       
##      Folic_Yes,                                    
##      Genes_Yes,                                    
##      H_No,                                         
##      History_No,                                   
##      Home,                                         
##      Inherited_Yes,                                
##      Low,                                          
##      Maternal_Yes,                                 
##      Middle Childhood,                             
##      Mito_LHON,                                    
##      No_0,                                         
##      Normal (30-60),                               
##      Paternal_Yes,                                 
##      Singular,                                     
##      slightly abnormal,                            
##      Symptom_4,                                    
##      Tachycardia}                                  
## [4] {Assisted_No,                                  
##      Blood High,                                   
##      Deceased,                                     
##      Folic_Yes,                                    
##      Genes_Yes,                                    
##      H_Yes,                                        
##      History_Yes,                                  
##      Inherited_No,                                 
##      Institute,                                    
##      Low,                                          
##      Male,                                         
##      Maternal_Yes,                                 
##      Middle Childhood,                             
##      Mito_LHON,                                    
##      Multiple,                                     
##      No_3,                                         
##      normal,                                       
##      Normal,                                       
##      Paternal_Yes,                                 
##      Symptom_4,                                    
##      Tachypnea}                                    
## [5] {Alive,                                        
##      Assisted_No,                                  
##      Blood Low,                                    
##      Early Teens,                                  
##      Female,                                       
##      Folic_Yes,                                    
##      Genes_Yes,                                    
##      H_No,                                         
##      High,                                         
##      History_No,                                   
##      Home,                                         
##      Inherited_Yes,                                
##      Maternal_Yes,                                 
##      Mito_LHON,                                    
##      No_4,                                         
##      Normal (30-60),                               
##      Paternal_Yes,                                 
##      Singular,                                     
##      slightly abnormal,                            
##      Symptom_4,                                    
##      Tachycardia}                                  
## [6] {abnormal,                                     
##      Alive,                                        
##      Assisted_Yes,                                 
##      Blood Low,                                    
##      Female,                                       
##      Folic_Yes,                                    
##      Genes_No,                                     
##      H_Yes,                                        
##      High,                                         
##      History_No,                                   
##      Home,                                         
##      Infant,                                       
##      Inherited_Yes,                                
##      Maternal_Yes,                                 
##      Mito_LHON,                                    
##      No_4,                                         
##      Paternal_No,                                  
##      Singular,                                     
##      Symptom_5,                                    
##      Tachycardia,                                  
##      Tachypnea}
length(trans1)
## [1] 13346
# Simple statistics 
head(itemFrequency(trans1, type="relative"))
##                    abnormal                       Alive 
##                2.310805e-01                5.071182e-01 
##                 Assisted_No                Assisted_Yes 
##                4.796194e-01                5.203057e-01 
## Assisted.conception.IVF.ART               Birth.defects 
##                7.492882e-05                7.492882e-05
head(itemFrequency(trans1, type="absolute"))
##                    abnormal                       Alive 
##                        3084                        6768 
##                 Assisted_No                Assisted_Yes 
##                        6401                        6944 
## Assisted.conception.IVF.ART               Birth.defects 
##                           1                           1
itemFrequencyPlot(trans1, topN = 15)

# visualize the sparse matrix for the first 5 items
image(trans1[1:5])

image(sample(trans1, 100))

5 Apriori algorithm

# Support at 0.1 and confidence at 0.5, minimum length of a rule is 2 elements
gene_rules <- apriori(trans1, parameter = list(support = 0.1, confidence = 0.5, minlen = 2))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5     0.1      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 1334 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[83 item(s), 13346 transaction(s)] done [0.01s].
## sorting and recoding items ... [55 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.04s].
## writing ... [6665 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
gene_rules
## set of 6665 rules
summary(gene_rules)
## set of 6665 rules
## 
## rule length distribution (lhs + rhs):sizes
##    2    3    4 
##  707 5331  627 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   3.000   3.000   2.988   3.000   4.000 
## 
## summary of quality measures:
##     support         confidence        coverage           lift       
##  Min.   :0.1000   Min.   :0.5000   Min.   :0.1511   Min.   :0.8670  
##  1st Qu.:0.1173   1st Qu.:0.5174   1st Qu.:0.2203   1st Qu.:0.9922  
##  Median :0.1339   Median :0.5393   Median :0.2476   Median :1.0043  
##  Mean   :0.1418   Mean   :0.5458   Mean   :0.2605   Mean   :1.0073  
##  3rd Qu.:0.1514   3rd Qu.:0.5614   3rd Qu.:0.2771   3rd Qu.:1.0176  
##  Max.   :0.3666   Max.   :0.6755   Max.   :0.6059   Max.   :1.7471  
##      count     
##  Min.   :1335  
##  1st Qu.:1565  
##  Median :1787  
##  Mean   :1893  
##  3rd Qu.:2020  
##  Max.   :4893  
## 
## mining info:
##    data ntransactions support confidence
##  trans1         13346     0.1        0.5
##                                                                                   call
##  apriori(data = trans1, parameter = list(support = 0.1, confidence = 0.5, minlen = 2))
inspect(gene_rules[1:10])
##      lhs       rhs            support   confidence coverage  lift      count
## [1]  {No_3} => {Genes_Yes}    0.1017533 0.5848407  0.1739847 0.9847695 1358 
## [2]  {No_3} => {Inherited_No} 0.1063989 0.6115418  0.1739847 1.0093540 1420 
## [3]  {No_1} => {Paternal_No}  0.1013787 0.5708861  0.1775813 1.0084772 1353 
## [4]  {No_1} => {Genes_Yes}    0.1062491 0.5983122  0.1775813 1.0074533 1418 
## [5]  {No_1} => {Inherited_No} 0.1100704 0.6198312  0.1775813 1.0230358 1469 
## [6]  {No_4} => {Paternal_No}  0.1024277 0.5669847  0.1806534 1.0015853 1367 
## [7]  {No_4} => {Genes_Yes}    0.1059493 0.5864786  0.1806534 0.9875276 1414 
## [8]  {No_4} => {Inherited_No} 0.1054998 0.5839900  0.1806534 0.9638797 1408 
## [9]  {No_0} => {Folic_Yes}    0.1016784 0.5591265  0.1818522 1.0173282 1357 
## [10] {No_0} => {Maternal_Yes} 0.1029522 0.5661310  0.1818522 1.0224066 1374
inspect(sort(gene_rules, by = "lift")[1:5])
##     lhs                        rhs            support   confidence coverage 
## [1] {Mito_Myopathy}         => {Symptom_2}    0.1123183 0.5088255  0.2207403
## [2] {Mito_Myopathy}         => {Maternal_No}  0.1256556 0.5692464  0.2207403
## [3] {Single_CysticFibrosis} => {Maternal_Yes} 0.1411659 0.6755109  0.2089765
## [4] {Genes_No, H_No}        => {Maternal_No}  0.1061741 0.5311094  0.1999101
## [5] {Single_CysticFibrosis} => {Tachycardia}  0.1154653 0.5525278  0.2089765
##     lift     count
## [1] 1.747051 1499 
## [2] 1.275762 1677 
## [3] 1.219942 1884 
## [4] 1.190292 1417 
## [5] 1.188594 1541
inspect(sort(gene_rules, by = "confidence")[1:5])
##     lhs                                      rhs            support  
## [1] {Single_CysticFibrosis}               => {Maternal_Yes} 0.1411659
## [2] {Normal, Paternal_No, Singular}       => {Inherited_No} 0.1040012
## [3] {Paternal_No, Symptom_3}              => {Inherited_No} 0.1020530
## [4] {History_Yes, Institute, Paternal_No} => {Inherited_No} 0.1084220
## [5] {Institute, Normal, Paternal_No}      => {Inherited_No} 0.1081223
##     confidence coverage  lift     count
## [1] 0.6755109  0.2089765 1.219942 1884 
## [2] 0.6754258  0.1539787 1.114795 1388 
## [3] 0.6712666  0.1520306 1.107930 1362 
## [4] 0.6702177  0.1617713 1.106199 1447 
## [5] 0.6689847  0.1616215 1.104164 1443
inspect(sort(gene_rules, by = "support")[1:5])
##     lhs               rhs            support   confidence coverage  lift     
## [1] {Paternal_No}  => {Inherited_No} 0.3666267 0.6476506  0.5660872 1.0689518
## [2] {Inherited_No} => {Paternal_No}  0.3666267 0.6051200  0.6058744 1.0689518
## [3] {Genes_Yes}    => {Inherited_No} 0.3589840 0.6044663  0.5938858 0.9976759
## [4] {Inherited_No} => {Genes_Yes}    0.3589840 0.5925056  0.6058744 0.9976759
## [5] {Maternal_Yes} => {Genes_Yes}    0.3556871 0.6423545  0.5537240 1.0816129
##     count
## [1] 4893 
## [2] 4893 
## [3] 4791 
## [4] 4791 
## [5] 4747
plot(gene_rules)
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

6 Rules for Combined_disorder

6.1 Mito_LHON

# Rules for Combined_disorder
# Mito_LHON
Mito_LHON_rule <- apriori(data=trans1, parameter=list(supp=0.001,conf = 0.6), 
                          appearance=list(default="lhs", rhs="Mito_LHON"), control=list(verbose=F)) 
Mito_LHON_rule_byconf <- sort(Mito_LHON_rule, by="confidence", decreasing=TRUE)
inspect(head(Mito_LHON_rule_byconf))
##     lhs                     rhs             support confidence    coverage     lift count
## [1] {Alive,                                                                              
##      Infant,                                                                             
##      Male,                                                                               
##      Multiple,                                                                           
##      Paternal_Yes,                                                                       
##      Symptom_5}          => {Mito_LHON} 0.001123932  0.8823529 0.001273790 31.74092    15
## [2] {Folic_Yes,                                                                          
##      H_Yes,                                                                              
##      Home,                                                                               
##      Infant,                                                                             
##      Paternal_Yes,                                                                       
##      Symptom_5}          => {Mito_LHON} 0.001123932  0.7500000 0.001498576 26.97978    15
## [3] {Assisted_No,                                                                        
##      H_Yes,                                                                              
##      Maternal_Yes,                                                                       
##      No_2,                                                                               
##      Symptom_5,                                                                          
##      Tachycardia}        => {Mito_LHON} 0.001123932  0.7500000 0.001498576 26.97978    15
## [4] {Folic_Yes,                                                                          
##      Home,                                                                               
##      Male,                                                                               
##      Paternal_Yes,                                                                       
##      slightly abnormal,                                                                  
##      Symptom_5}          => {Mito_LHON} 0.001123932  0.6818182 0.001648434 24.52708    15
## [5] {Home,                                                                               
##      Male,                                                                               
##      Normal,                                                                             
##      Paternal_Yes,                                                                       
##      slightly abnormal,                                                                  
##      Symptom_5}          => {Mito_LHON} 0.001049003  0.6666667 0.001573505 23.98203    14
## [6] {Alive,                                                                              
##      Genes_Yes,                                                                          
##      Infant,                                                                             
##      Multiple,                                                                           
##      Paternal_Yes,                                                                       
##      Symptom_5}          => {Mito_LHON} 0.001123932  0.6521739 0.001723363 23.46068    15
plot(Mito_LHON_rule)
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

plot(Mito_LHON_rule, method = "paracoord", control = list(reorder = TRUE))

# There are 271 rules for Mito_LHON, support and confdient levels at 0.01 were tested, however, there were limited rules. Hence a lower support level 0.001 is used in this rule
# Even though the support level is quite low in this rule, however, we can find some interesting insights, the maximum confidence level 0.8823529 is observed. Which indicates that even the combination of Alive, Infant, Male, Multiple, Paternal_Yes, Symptom_5 occur in only 0.1123932%, the consequent Mito_LHON's confidence level reached 0.8823529
# The antecedent is rare, but it is still a reliable predictor of the consequent. This applies to some other antecedents as well such as the ones listed in the top 5
# FP-Growth
frequent_itemsets_1 <- eclat(trans1, parameter = list(supp = 0.001, maxlen = 5))
## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target  ext
##     FALSE   0.001      1      5 frequent itemsets TRUE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 13 
## 
## create itemset ... 
## set transactions ...[83 item(s), 13346 transaction(s)] done [0.01s].
## sorting and recoding items ... [62 item(s)] done [0.00s].
## creating bit matrix ... [62 row(s), 13346 column(s)] done [0.00s].
## writing  ... [3042465 set(s)] done [0.92s].
## Creating S4 object  ... done [0.31s].
Mito_LHON_rhs <- "Mito_LHON"
Mito_LHON_growth <- ruleInduction(frequent_itemsets_1, trans1, confidence = 0.6)
filtered_rules_1 <- subset(Mito_LHON_growth, rhs %in% Mito_LHON_rhs)
filtered_rules_1_byconf <- sort(filtered_rules_1, by="confidence", decreasing=TRUE)
inspect(filtered_rules_1_byconf)
# There is no output from FP-Growth

6.2 Mito_Leigh

Mito_Leigh_rule <- apriori(data=trans1, parameter=list(supp=0.02,conf = 0.5), 
                           appearance=list(default="lhs", rhs="Mito_Leigh"), control=list(verbose=F)) 
Mito_Leigh_rule_byconf <- sort(Mito_Leigh_rule, by="confidence", decreasing=TRUE)
inspect(head(Mito_Leigh_rule_byconf))
##     lhs                rhs             support confidence   coverage     lift count
## [1] {inconclusive,                                                                 
##      Singular,                                                                     
##      Symptom_3}     => {Mito_Leigh} 0.02075528  0.5918803 0.03506669 2.298294   277
## [2] {inconclusive,                                                                 
##      Paternal_No,                                                                  
##      Symptom_3}     => {Mito_Leigh} 0.02412708  0.5908257 0.04083621 2.294198   322
## [3] {inconclusive,                                                                 
##      Inherited_No,                                                                 
##      Symptom_3}     => {Mito_Leigh} 0.02450172  0.5902527 0.04151056 2.291973   327
## [4] {inconclusive,                                                                 
##      Normal,                                                                       
##      Symptom_3}     => {Mito_Leigh} 0.02187921  0.5770751 0.03791398 2.240804   292
## [5] {Assisted_Yes,                                                                 
##      inconclusive,                                                                 
##      Symptom_3}     => {Mito_Leigh} 0.02210400  0.5728155 0.03858834 2.224264   295
## [6] {Assisted_Yes,                                                                 
##      Normal,                                                                       
##      Singular,                                                                     
##      Symptom_3}     => {Mito_Leigh} 0.02142964  0.5708583 0.03753934 2.216664   286
plot(Mito_Leigh_rule)
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

plot(Mito_Leigh_rule, method = "paracoord", control = list(reorder = TRUE))

# There are 160 rules for Mito_Leigh, support level was set to 0.02 which is 20 times Mito_LHON
# With such a higher support level, we still observe that there are many associations(160) having confidence level more than 0.5
# This indicates that Mito_Leigh is happening more frequent among children comparing to other disorders
# With antecedent inconclusive, Singular, Symptom_3, consequent Mito_Leigh confidence level is 0.5918803 which is a moderate indication that antecedent is a quite reliable predictor of consequent
# FP-Growth
frequent_itemsets_6 <- eclat(trans1, parameter = list(supp = 0.02, maxlen = 5))
## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target  ext
##     FALSE    0.02      1      5 frequent itemsets TRUE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 266 
## 
## create itemset ... 
## set transactions ...[83 item(s), 13346 transaction(s)] done [0.01s].
## sorting and recoding items ... [60 item(s)] done [0.00s].
## creating bit matrix ... [60 row(s), 13346 column(s)] done [0.00s].
## writing  ... [283863 set(s)] done [0.21s].
## Creating S4 object  ... done [0.01s].
Mito_Leigh_rhs <- "Mito_Leigh"
Mito_Leigh_growth <- ruleInduction(frequent_itemsets_6, trans1, confidence = 0.5)
filtered_rules_6 <- subset(Mito_Leigh_growth, rhs %in% Mito_Leigh_rhs)
filtered_rules_6_byconf <- sort(filtered_rules_6, by="confidence", decreasing=TRUE)
inspect(head(filtered_rules_6_byconf))
##     lhs                rhs             support confidence     lift itemset
## [1] {inconclusive,                                                        
##      Singular,                                                            
##      Symptom_3}     => {Mito_Leigh} 0.02075528  0.5918803 2.298294   58332
## [2] {inconclusive,                                                        
##      Paternal_No,                                                         
##      Symptom_3}     => {Mito_Leigh} 0.02412708  0.5908257 2.294198   58322
## [3] {inconclusive,                                                        
##      Inherited_No,                                                        
##      Symptom_3}     => {Mito_Leigh} 0.02450172  0.5902527 2.291973   58320
## [4] {inconclusive,                                                        
##      Normal,                                                              
##      Symptom_3}     => {Mito_Leigh} 0.02187921  0.5770751 2.240804   58327
## [5] {Assisted_Yes,                                                        
##      inconclusive,                                                        
##      Symptom_3}     => {Mito_Leigh} 0.02210400  0.5728155 2.224264   58330
## [6] {Assisted_Yes,                                                        
##      Normal,                                                              
##      Singular,                                                            
##      Symptom_3}     => {Mito_Leigh} 0.02142964  0.5708583 2.216664  102235
# FP-Growth is giving the same result as Apriori

6.3 Mito_Myopathy

Mito_Myopathy_rule <- apriori(data=trans1, parameter=list(supp=0.02,conf = 0.5), 
                              appearance=list(default="lhs", rhs="Mito_Myopathy"), control=list(verbose=F)) 
Mito_Myopathy_rule_byconf <- sort(Mito_Myopathy_rule, by="confidence", decreasing=TRUE)
inspect(head(Mito_Myopathy_rule_byconf))
##     lhs                  rhs                support confidence   coverage     lift count
## [1] {Multiple,                                                                          
##      normal,                                                                            
##      Symptom_2}       => {Mito_Myopathy} 0.02217893  0.5481481 0.04046156 2.483226   296
## [2] {Assisted_No,                                                                       
##      Maternal_No,                                                                       
##      Multiple,                                                                          
##      Symptom_2}       => {Mito_Myopathy} 0.02030571  0.5452716 0.03723962 2.470195   271
## [3] {Low,                                                                               
##      Maternal_No,                                                                       
##      Multiple,                                                                          
##      Symptom_2}       => {Mito_Myopathy} 0.02083021  0.5419103 0.03843848 2.454968   278
## [4] {H_No,                                                                              
##      normal,                                                                            
##      Symptom_2}       => {Mito_Myopathy} 0.02083021  0.5387597 0.03866327 2.440695   278
## [5] {Female,                                                                            
##      H_No,                                                                              
##      Multiple,                                                                          
##      Symptom_2}       => {Mito_Myopathy} 0.02023078  0.5357143 0.03776412 2.426898   270
## [6] {Assisted_No,                                                                       
##      Maternal_No,                                                                       
##      Normal (30-60),                                                                    
##      Symptom_2}       => {Mito_Myopathy} 0.02202907  0.5335753 0.04128578 2.417208   294
plot(Mito_Myopathy_rule)

plot(Mito_Myopathy_rule, method="paracoord", control=list(reorder=TRUE))

# There are 34 rules for Mito_Myopathy, support level was set to 0.02 which is 20 times Mito_LHON
# This indicates that Mito_Myopathy is happening more frequent among children comparing to other disorders
# With antecedent Multiple, normal, Symptom_2, consequent Mito_Leigh confidence level is 0.5481481 which is a moderate indication that antecedent is a quite reliable predictor of consequent
# FP-Growth
frequent_itemsets_9 <- eclat(trans1, parameter = list(supp = 0.02, maxlen = 5))
## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target  ext
##     FALSE    0.02      1      5 frequent itemsets TRUE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 266 
## 
## create itemset ... 
## set transactions ...[83 item(s), 13346 transaction(s)] done [0.01s].
## sorting and recoding items ... [60 item(s)] done [0.00s].
## creating bit matrix ... [60 row(s), 13346 column(s)] done [0.00s].
## writing  ... [283863 set(s)] done [0.21s].
## Creating S4 object  ... done [0.01s].
Mito_Myopathy_rhs <- "Mito_Myopathy"
Mito_Myopathy_growth <- ruleInduction(frequent_itemsets_9, trans1, confidence = 0.5)
filtered_rules_9 <- subset(Mito_Myopathy_growth, rhs %in% Mito_Myopathy_rhs)
filtered_rules_9_byconf <- sort(filtered_rules_9, by="confidence", decreasing=TRUE)
inspect(head(filtered_rules_9_byconf))
##     lhs                  rhs                support confidence     lift itemset
## [1] {Multiple,                                                                 
##      normal,                                                                   
##      Symptom_2}       => {Mito_Myopathy} 0.02217893  0.5481481 2.483226   31194
## [2] {Assisted_No,                                                              
##      Maternal_No,                                                              
##      Multiple,                                                                 
##      Symptom_2}       => {Mito_Myopathy} 0.02030571  0.5452716 2.470195   31867
## [3] {Low,                                                                      
##      Maternal_No,                                                              
##      Multiple,                                                                 
##      Symptom_2}       => {Mito_Myopathy} 0.02083021  0.5419103 2.454968   31880
## [4] {H_No,                                                                     
##      normal,                                                                   
##      Symptom_2}       => {Mito_Myopathy} 0.02083021  0.5387597 2.440695   31196
## [5] {Female,                                                                   
##      H_No,                                                                     
##      Multiple,                                                                 
##      Symptom_2}       => {Mito_Myopathy} 0.02023078  0.5357143 2.426898   32131
## [6] {Assisted_No,                                                              
##      Maternal_No,                                                              
##      Normal (30-60),                                                           
##      Symptom_2}       => {Mito_Myopathy} 0.02202907  0.5335753 2.417208   31864
# FP-Growth is giving the same result as Apriori

6.4 Multi_Alzheimer

Multi_Alzheimer_rule <- apriori(data=trans1, parameter=list(supp=0.001,conf = 0.5), 
                                appearance=list(default="lhs", rhs="Multi_Alzheimer"), control=list(verbose=F)) 
Multi_Alzheimer_rule_byconf <- sort(Multi_Alzheimer_rule, by="confidence", decreasing=TRUE)
inspect(head(Multi_Alzheimer_rule_byconf))
##     lhs                   rhs                   support confidence    coverage      lift count
## [1] {Early Childhood,                                                                         
##      Genes_Yes,                                                                               
##      H_No,                                                                                    
##      Inherited_Yes,                                                                           
##      Singular,                                                                                
##      Symptom_5}        => {Multi_Alzheimer} 0.001049003  0.8235294 0.001273790 106.70702    14
## [2] {Early Childhood,                                                                         
##      H_No,                                                                                    
##      Inherited_Yes,                                                                           
##      Singular,                                                                                
##      Symptom_5}        => {Multi_Alzheimer} 0.001049003  0.7000000 0.001498576  90.70097    14
## [3] {Early Childhood,                                                                         
##      Genes_Yes,                                                                               
##      H_No,                                                                                    
##      Inherited_Yes,                                                                           
##      Symptom_5}        => {Multi_Alzheimer} 0.001198861  0.5925926 0.002023078  76.78389    16
## [4] {Early Childhood,                                                                         
##      Genes_Yes,                                                                               
##      History_No,                                                                              
##      Inherited_Yes,                                                                           
##      Symptom_5}        => {Multi_Alzheimer} 0.001049003  0.5833333 0.001798292  75.58414    14
## [5] {Early Childhood,                                                                         
##      Genes_Yes,                                                                               
##      Inherited_Yes,                                                                           
##      Singular,                                                                                
##      Symptom_5}        => {Multi_Alzheimer} 0.001273790  0.5483871 0.002322793  71.05606    17
## [6] {Genes_Yes,                                                                               
##      History_No,                                                                              
##      Inherited_Yes,                                                                           
##      Paternal_Yes,                                                                            
##      Singular,                                                                                
##      Symptom_5}        => {Multi_Alzheimer} 0.001123932  0.5357143 0.002098007  69.41401    15
plot(Multi_Alzheimer_rule)
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

plot(Multi_Alzheimer_rule_byconf, method="paracoord", control=list(reorder=TRUE))

# Support level 0.01 and confidence level 0.01 were tested, however, there was 0 rules
# With the current support level and confidence level, there are 7 rules. Similar to Mito_LHON, the support level is low, however, when Early Childhood, Genes_Yes, H_No, Inherited_Yes, Singular,Symptom_5 occur together, Multi_Alzheimer has a significant confidence level which is 0.8235294
# This indicates that The antecedent items are rare but reliable predictors of the consequent. The top 5 are listed in the output
# FP-Growth
frequent_itemsets_7 <- eclat(trans1, parameter = list(supp = 0.001, maxlen = 5))
## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target  ext
##     FALSE   0.001      1      5 frequent itemsets TRUE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 13 
## 
## create itemset ... 
## set transactions ...[83 item(s), 13346 transaction(s)] done [0.01s].
## sorting and recoding items ... [62 item(s)] done [0.00s].
## creating bit matrix ... [62 row(s), 13346 column(s)] done [0.00s].
## writing  ... [3042465 set(s)] done [0.92s].
## Creating S4 object  ... done [0.17s].
Multi_Alzheimer_rhs <- "Multi_Alzheimer"
Multi_Alzheimer_growth <- ruleInduction(frequent_itemsets_7, trans1, confidence = 0.5)
filtered_rules_7 <- subset(Multi_Alzheimer_growth, rhs %in% Multi_Alzheimer_rhs)
filtered_rules_7_byconf <- sort(filtered_rules_7, by="confidence", decreasing=TRUE)
inspect(head(filtered_rules_7_byconf))
# There is not rules from FP-Growth

6.5 Multi_Cancer

Multi_Cancer_rule <- apriori(data=trans1, parameter=list(supp=0.001,conf = 0.2), 
                             appearance=list(default="lhs", rhs="Multi_Cancer"), control=list(verbose=F)) 
Multi_Cancer_rule_byconf <- sort(Multi_Cancer_rule, by="confidence", decreasing=TRUE)
inspect(head(Multi_Cancer_rule_byconf))
##     lhs                rhs                support confidence    coverage      lift count
## [1] {Assisted_No,                                                                       
##      Genes_No,                                                                          
##      Maternal_No,                                                                       
##      Paternal_No,                                                                       
##      Symptom_0}     => {Multi_Cancer} 0.001123932  0.4687500 0.002397722 115.85069    15
## [2] {Genes_No,                                                                          
##      Inherited_No,                                                                      
##      Maternal_No,                                                                       
##      Paternal_No,                                                                       
##      Symptom_0}     => {Multi_Cancer} 0.001198861  0.4102564 0.002922224 101.39411    16
## [3] {Inherited_No,                                                                      
##      Maternal_No,                                                                       
##      Paternal_No,                                                                       
##      Singular,                                                                          
##      Symptom_0}     => {Multi_Cancer} 0.001273790  0.3695652 0.003446726  91.33736    17
## [4] {Assisted_No,                                                                       
##      Deceased,                                                                          
##      Maternal_No,                                                                       
##      Paternal_No,                                                                       
##      Symptom_0}     => {Multi_Cancer} 0.001123932  0.3571429 0.003147010  88.26720    15
## [5] {Deceased,                                                                          
##      Inherited_No,                                                                      
##      Maternal_No,                                                                       
##      Paternal_No,                                                                       
##      Symptom_0}     => {Multi_Cancer} 0.001123932  0.3571429 0.003147010  88.26720    15
## [6] {Genes_No,                                                                          
##      Maternal_No,                                                                       
##      Paternal_No,                                                                       
##      Symptom_0}     => {Multi_Cancer} 0.001423648  0.3518519 0.004046156  86.95953    19
plot(Multi_Cancer_rule)
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

# Support level 0.01 and confidence level 0.01 were applied, however, there was 0 rules
# With the current support and confidence levels, there are 92 rules. However, the maximum confidence level is only 0.4687500 which is less than 0.5
# With such low support level and confidence level, we may not be able to be convinced that there are strong association rules for Multi_Cancer
# FP-Growth
frequent_itemsets_8 <- eclat(trans1, parameter = list(supp = 0.001, maxlen = 5))
## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target  ext
##     FALSE   0.001      1      5 frequent itemsets TRUE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 13 
## 
## create itemset ... 
## set transactions ...[83 item(s), 13346 transaction(s)] done [0.01s].
## sorting and recoding items ... [62 item(s)] done [0.00s].
## creating bit matrix ... [62 row(s), 13346 column(s)] done [0.00s].
## writing  ... [3042465 set(s)] done [0.92s].
## Creating S4 object  ... done [0.18s].
Multi_Cancer_rhs <- "Multi_Cancer"
Multi_Cancer_growth <- ruleInduction(frequent_itemsets_8, trans1, confidence = 0.2)
filtered_rules_8 <- subset(Multi_Cancer_growth, rhs %in% Multi_Cancer_rhs)
filtered_rules_8_byconf <- sort(filtered_rules_8, by="confidence", decreasing=TRUE)
inspect(head(filtered_rules_8_byconf))
##     lhs                rhs                support confidence     lift itemset
## [1] {Genes_No,                                                               
##      Maternal_No,                                                            
##      Paternal_No,                                                            
##      Symptom_0}     => {Multi_Cancer} 0.001423648  0.3518519 86.95953       2
## [2] {Assisted_No,                                                            
##      Genes_No,                                                               
##      Maternal_No,                                                            
##      Symptom_0}     => {Multi_Cancer} 0.001198861  0.3478261 85.96457       7
## [3] {Genes_No,                                                               
##      Inherited_No,                                                           
##      Singular,                                                               
##      Symptom_0}     => {Multi_Cancer} 0.001123932  0.3191489 78.87707      11
## [4] {Genes_No,                                                               
##      Inherited_No,                                                           
##      Maternal_No,                                                            
##      Symptom_0}     => {Multi_Cancer} 0.001348719  0.3157895 78.04678       1
## [5] {Genes_No,                                                               
##      Low,                                                                    
##      Maternal_No,                                                            
##      Symptom_0}     => {Multi_Cancer} 0.001049003  0.3111111 76.89053       4
## [6] {Genes_No,                                                               
##      Inherited_No,                                                           
##      Paternal_No,                                                            
##      Symptom_0}     => {Multi_Cancer} 0.001348719  0.2950820 72.92896      16
# FP-Growth generated different rules, with even lower confidence level

6.6 Multi_Diabetes

Multi_Diabetes_rule <- apriori(data=trans1, parameter=list(supp=0.001,conf = 0.8), 
                               appearance=list(default="lhs", rhs="Multi_Diabetes"), control=list(verbose=F)) 
Multi_Diabetes_rule_byconf <- sort(Multi_Diabetes_rule, by="confidence", decreasing=TRUE)
inspect(head(Multi_Diabetes_rule_byconf))
##     lhs                   rhs                  support confidence    coverage     lift count
## [1] {Blood Very High,                                                                       
##      Folic_Yes,                                                                             
##      Genes_No,                                                                              
##      H_No,                                                                                  
##      Singular,                                                                              
##      Symptom_5}        => {Multi_Diabetes} 0.001049003  0.9333333 0.001123932 11.00377    14
## [2] {Blood Very High,                                                                       
##      Genes_No,                                                                              
##      H_No,                                                                                  
##      Singular,                                                                              
##      Symptom_5}        => {Multi_Diabetes} 0.001273790  0.8947368 0.001423648 10.54873    17
## [3] {abnormal,                                                                              
##      Early Teens,                                                                           
##      Female,                                                                                
##      High,                                                                                  
##      Inherited_Yes,                                                                         
##      Symptom_4}        => {Multi_Diabetes} 0.001198861  0.8888889 0.001348719 10.47978    16
## [4] {abnormal,                                                                              
##      Folic_Yes,                                                                             
##      H_No,                                                                                  
##      History_Yes,                                                                           
##      Singular,                                                                              
##      Symptom_5}        => {Multi_Diabetes} 0.001049003  0.8750000 0.001198861 10.31603    14
## [5] {abnormal,                                                                              
##      Blood High,                                                                            
##      Genes_Yes,                                                                             
##      No_2,                                                                                  
##      Singular,                                                                              
##      Symptom_4}        => {Multi_Diabetes} 0.001049003  0.8750000 0.001198861 10.31603    14
## [6] {abnormal,                                                                              
##      High,                                                                                  
##      Institute,                                                                             
##      Maternal_Yes,                                                                          
##      Normal (30-60),                                                                        
##      Symptom_5}        => {Multi_Diabetes} 0.001273790  0.8500000 0.001498576 10.02129    17
plot(Multi_Diabetes_rule)
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

plot(Multi_Diabetes_rule_byconf, method="paracoord", control=list(reorder=TRUE))

# Support level 0.01 and confidence level 0.01 were applied, however, there was 0 rules
# With the current support level and confidence level, there are 21 rules. Similar to Mito_LHON, the support level is low, however, when Blood Very High, Folic_Yes, Genes_No, H_No, Singular, Symptom_5 occur together, Multi_Diabetes has a significant confidence level which is 0.9333333
# This indicates that The antecedent items are rare but reliable predictors of the consequent. The top 5 are listed in the output
# FP-Growth
frequent_itemsets_2 <- eclat(trans1, parameter = list(supp = 0.001, maxlen = 5))
## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target  ext
##     FALSE   0.001      1      5 frequent itemsets TRUE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 13 
## 
## create itemset ... 
## set transactions ...[83 item(s), 13346 transaction(s)] done [0.01s].
## sorting and recoding items ... [62 item(s)] done [0.00s].
## creating bit matrix ... [62 row(s), 13346 column(s)] done [0.00s].
## writing  ... [3042465 set(s)] done [0.92s].
## Creating S4 object  ... done [0.17s].
Multi_Diabetes_rhs <- "Multi_Diabetes"
Multi_Diabetes_growth <- ruleInduction(frequent_itemsets_2, trans1, confidence = 0.8)
filtered_rules_2 <- subset(Multi_Diabetes_growth, rhs %in% Multi_Diabetes_rhs)
filtered_rules_2_byconf <- sort(filtered_rules_2, by="confidence", decreasing=TRUE)
inspect(head(filtered_rules_2_byconf))
# FP-Growth generated 0 rules

6.7 Single_CysticFibrosis

Single_CysticFibrosis_rule <- apriori(data=trans1, parameter=list(supp=0.01,conf = 0.65), 
                                      appearance=list(default="lhs", rhs="Single_CysticFibrosis"), control=list(verbose=F)) 
Single_CysticFibrosis_rule_byconf <- sort(Single_CysticFibrosis_rule, by="confidence", decreasing=TRUE)
inspect(head(Single_CysticFibrosis_rule_byconf))
##     lhs                     rhs                        support confidence   coverage     lift count
## [1] {Male,                                                                                         
##      Multiple,                                                                                     
##      slightly abnormal,                                                                            
##      Symptom_4}          => {Single_CysticFibrosis} 0.01123932  0.7317073 0.01536041 3.501386   150
## [2] {Male,                                                                                         
##      slightly abnormal,                                                                            
##      Symptom_4,                                                                                    
##      Tachycardia}        => {Single_CysticFibrosis} 0.01161397  0.7311321 0.01588491 3.498633   155
## [3] {Alive,                                                                                        
##      Low,                                                                                          
##      Male,                                                                                         
##      Multiple,                                                                                     
##      Symptom_4}          => {Single_CysticFibrosis} 0.01004046  0.6943005 0.01446126 3.322386   134
## [4] {Low,                                                                                          
##      Male,                                                                                         
##      slightly abnormal,                                                                            
##      Symptom_4}          => {Single_CysticFibrosis} 0.01213847  0.6923077 0.01753334 3.312850   162
## [5] {Normal (30-60),                                                                               
##      slightly abnormal,                                                                            
##      Symptom_4,                                                                                    
##      Tachycardia}        => {Single_CysticFibrosis} 0.01011539  0.6887755 0.01468605 3.295948   135
## [6] {Assisted_Yes,                                                                                 
##      Male,                                                                                         
##      slightly abnormal,                                                                            
##      Symptom_4}          => {Single_CysticFibrosis} 0.01123932  0.6880734 0.01633448 3.292588   150
plot(Single_CysticFibrosis_rule)

plot(Single_CysticFibrosis_rule_byconf, method="paracoord", control=list(reorder=TRUE))

# With the current support and confidence level, there are 40 rules
# When Male, Multiple, slightly abnormal, Symptom_4 occur together, the consequent Single_CysticFibrosis has a confidence level of 0.7317073
# With the current support level, it is much more than many other rules. Which indicates this would occur more often comparing to other disorders
# The top 5 rules are listed in the output
# FP-Growth
frequent_itemsets_3 <- eclat(trans1, parameter = list(supp = 0.01, maxlen = 5))
## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target  ext
##     FALSE    0.01      1      5 frequent itemsets TRUE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 133 
## 
## create itemset ... 
## set transactions ...[83 item(s), 13346 transaction(s)] done [0.01s].
## sorting and recoding items ... [60 item(s)] done [0.00s].
## creating bit matrix ... [60 row(s), 13346 column(s)] done [0.00s].
## writing  ... [873336 set(s)] done [0.43s].
## Creating S4 object  ... done [0.05s].
Single_CysticFibrosis_rhs <- "Single_CysticFibrosis"
Single_CysticFibrosis_growth <- ruleInduction(frequent_itemsets_3, trans1, confidence = 0.65)
filtered_rules_3 <- subset(Single_CysticFibrosis_growth, rhs %in% Single_CysticFibrosis_rhs)
filtered_rules_3_byconf <- sort(filtered_rules_3, by="confidence", decreasing=TRUE)
inspect(head(filtered_rules_3_byconf))
##     lhs                     rhs                        support confidence     lift itemset
## [1] {Male,                                                                                
##      Multiple,                                                                            
##      slightly abnormal,                                                                   
##      Symptom_4}          => {Single_CysticFibrosis} 0.01123932  0.7317073 3.501386  154043
## [2] {Male,                                                                                
##      slightly abnormal,                                                                   
##      Symptom_4,                                                                           
##      Tachycardia}        => {Single_CysticFibrosis} 0.01161397  0.7311321 3.498633  154022
## [3] {Low,                                                                                 
##      Male,                                                                                
##      slightly abnormal,                                                                   
##      Symptom_4}          => {Single_CysticFibrosis} 0.01213847  0.6923077 3.312850  154054
## [4] {Normal (30-60),                                                                      
##      slightly abnormal,                                                                   
##      Symptom_4,                                                                           
##      Tachycardia}        => {Single_CysticFibrosis} 0.01011539  0.6887755 3.295948  154018
## [5] {Assisted_Yes,                                                                        
##      Male,                                                                                
##      slightly abnormal,                                                                   
##      Symptom_4}          => {Single_CysticFibrosis} 0.01123932  0.6880734 3.292588  154055
## [6] {H_Yes,                                                                               
##      Male,                                                                                
##      slightly abnormal,                                                                   
##      Symptom_4}          => {Single_CysticFibrosis} 0.01161397  0.6858407 3.281904  154056
# FP-Growth generated slightly different rules, the first 2 rules are the same

6.8 Single_Hemochromatosis

Single_Hemochromatosis_rule <- apriori(data=trans1, parameter=list(supp=0.001,conf = 0.8), 
                                       appearance=list(default="lhs", rhs="Single_Hemochromatosis"), control=list(verbose=F)) 
Single_Hemochromatosis_rule_byconf <- sort(Single_Hemochromatosis_rule, by="confidence", decreasing=TRUE)
inspect(head(Single_Hemochromatosis_rule_byconf))
##     lhs                   rhs                          support confidence    coverage     lift count
## [1] {H_No,                                                                                          
##      Home,                                                                                          
##      inconclusive,                                                                                  
##      Inherited_No,                                                                                  
##      Symptom_0}        => {Single_Hemochromatosis} 0.001049003  0.9333333 0.001123932 13.61341    14
## [2] {Blood Very High,                                                                               
##      Home,                                                                                          
##      Paternal_No,                                                                                   
##      Symptom_0,                                                                                     
##      Tachypnea}        => {Single_Hemochromatosis} 0.001049003  0.8750000 0.001198861 12.76257    14
## [3] {Assisted_No,                                                                                   
##      H_Yes,                                                                                         
##      History_No,                                                                                    
##      Low,                                                                                           
##      Normal,                                                                                        
##      Symptom_0}        => {Single_Hemochromatosis} 0.001049003  0.8750000 0.001198861 12.76257    14
## [4] {Folic_Yes,                                                                                     
##      H_No,                                                                                          
##      High,                                                                                          
##      Inherited_No,                                                                                  
##      Male,                                                                                          
##      Symptom_0}        => {Single_Hemochromatosis} 0.001049003  0.8750000 0.001198861 12.76257    14
## [5] {Assisted_No,                                                                                   
##      Blood Very High,                                                                               
##      Inherited_No,                                                                                  
##      Multiple,                                                                                      
##      Symptom_0}        => {Single_Hemochromatosis} 0.001198861  0.8421053 0.001423648 12.28277    16
## [6] {Deceased,                                                                                      
##      Folic_Yes,                                                                                     
##      Male,                                                                                          
##      Paternal_Yes,                                                                                  
##      Symptom_0}        => {Single_Hemochromatosis} 0.001198861  0.8421053 0.001423648 12.28277    16
plot(Single_Hemochromatosis_rule)
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

plot(Single_Hemochromatosis_rule_byconf, method="paracoord", control=list(reorder=TRUE))

# Support level 0.01 was tested, however, the maximum confidence level was even below 0.5 which wouldn't give us any valueable rules
# With the current support level and confidence level, there are 20 rules. Similar to Mito_LHON, the support level is low, however, when H_No, Home, inconclusive, Inherited_No, Symptom_0 occur together, Single_Hemochromatosis has a significant confidence level which is 0.9333333
# Top 5 rules listed in the output
# FP-Growth
frequent_itemsets_4 <- eclat(trans1, parameter = list(supp = 0.001, maxlen = 5))
## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target  ext
##     FALSE   0.001      1      5 frequent itemsets TRUE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 13 
## 
## create itemset ... 
## set transactions ...[83 item(s), 13346 transaction(s)] done [0.01s].
## sorting and recoding items ... [62 item(s)] done [0.00s].
## creating bit matrix ... [62 row(s), 13346 column(s)] done [0.00s].
## writing  ... [3042465 set(s)] done [0.92s].
## Creating S4 object  ... done [0.17s].
Single_Hemochromatosis_rhs <- "Single_Hemochromatosis"
Single_Hemochromatosis_growth <- ruleInduction(frequent_itemsets_4, trans1, confidence = 0.8)
filtered_rules_4 <- subset(Single_Hemochromatosis_growth, rhs %in% Single_Hemochromatosis_rhs)
filtered_rules_4_byconf <- sort(filtered_rules_4, by="confidence", decreasing=TRUE)
inspect(head(filtered_rules_4_byconf))
##     lhs                rhs                          support confidence     lift itemset
## [1] {Maternal_Yes,                                                                     
##      Multiple,                                                                         
##      No_4,                                                                             
##      Symptom_0}     => {Single_Hemochromatosis} 0.001198861        0.8 11.66863   64182
# Only 1 rule generated by FP-Growth

6.9 Single_TaySachs

Single_TaySachs_rule <- apriori(data=trans1, parameter=list(supp=0.001,conf = 0.8), 
                                appearance=list(default="lhs", rhs="Single_TaySachs"), control=list(verbose=F)) 
Single_TaySachs_rule_byconf <- sort(Single_TaySachs_rule, by="confidence", decreasing=TRUE)
inspect(head(Single_TaySachs_rule_byconf))
##     lhs                     rhs                   support confidence   coverage     lift count
## [1] {Deceased,                                                                                
##      Infant,                                                                                  
##      Maternal_No,                                                                             
##      Paternal_No,                                                                             
##      slightly abnormal,                                                                       
##      Symptom_1}          => {Single_TaySachs} 0.001123932  0.8823529 0.00127379 7.369138    15
## [2] {Assisted_Yes,                                                                            
##      Blood High,                                                                              
##      Normal,                                                                                  
##      Paternal_No,                                                                             
##      slightly abnormal,                                                                       
##      Symptom_1}          => {Single_TaySachs} 0.001573505  0.8400000 0.00187322 7.015419    21
## [3] {Assisted_Yes,                                                                            
##      Blood Very High,                                                                         
##      Infant,                                                                                  
##      Male,                                                                                    
##      Maternal_Yes,                                                                            
##      Symptom_1}          => {Single_TaySachs} 0.001049003  0.8235294 0.00127379 6.877862    14
## [4] {Deceased,                                                                                
##      H_Yes,                                                                                   
##      High,                                                                                    
##      Infant,                                                                                  
##      slightly abnormal,                                                                       
##      Symptom_1}          => {Single_TaySachs} 0.001049003  0.8235294 0.00127379 6.877862    14
## [5] {Folic_Yes,                                                                               
##      High,                                                                                    
##      Infant,                                                                                  
##      Paternal_No,                                                                             
##      slightly abnormal,                                                                       
##      Symptom_1}          => {Single_TaySachs} 0.001049003  0.8235294 0.00127379 6.877862    14
## [6] {Deceased,                                                                                
##      Infant,                                                                                  
##      Institute,                                                                               
##      Paternal_No,                                                                             
##      slightly abnormal,                                                                       
##      Symptom_1}          => {Single_TaySachs} 0.001049003  0.8235294 0.00127379 6.877862    14
plot(Single_TaySachs_rule)
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

plot(Single_TaySachs_rule_byconf, method="paracoord", control=list(reorder=TRUE))

# Support level 0.01 was tested, however, the maximum confidence level was even below 0.5 which wouldn't give us any valueable rules
# With the current support level and confidence level, there are 10 rules. Similar to Mito_LHON, the support level is low, however, when Deceased, Infant, Maternal_No, Paternal_No, slightly abnormal, Symptom_1 occur together, Single_Hemochromatosis has a significant confidence level which is 0.8823529
# Top 5 rules listed in the output
# FP-Growth
frequent_itemsets_5 <- eclat(trans1, parameter = list(supp = 0.001, maxlen = 5))
## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target  ext
##     FALSE   0.001      1      5 frequent itemsets TRUE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 13 
## 
## create itemset ... 
## set transactions ...[83 item(s), 13346 transaction(s)] done [0.01s].
## sorting and recoding items ... [62 item(s)] done [0.00s].
## creating bit matrix ... [62 row(s), 13346 column(s)] done [0.00s].
## writing  ... [3042465 set(s)] done [0.92s].
## Creating S4 object  ... done [0.17s].
Single_TaySachs_rhs <- "Single_TaySachs"
Single_TaySachs_growth <- ruleInduction(frequent_itemsets_5, trans1, confidence = 0.8)
filtered_rules_5 <- subset(Single_TaySachs_growth, rhs %in% Single_TaySachs_rhs)
filtered_rules_5_byconf <- sort(filtered_rules_5, by="confidence", decreasing=TRUE)
inspect(head(filtered_rules_5_byconf))
# FP-Growth generated 0 rule

7 Parallel visualisation

# Whole dataset
gene_rules_viz <- apriori(trans1, parameter = list(support = 0.1, confidence = 0.6, minlen = 2))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.6    0.1    1 none FALSE            TRUE       5     0.1      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 1334 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[83 item(s), 13346 transaction(s)] done [0.01s].
## sorting and recoding items ... [55 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.05s].
## writing ... [712 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
gene_rules_viz
## set of 712 rules
gene_rules_viz_byconf <- sort(gene_rules_viz, by="confidence", decreasing=TRUE)
inspect(head(gene_rules))
##     lhs       rhs            support   confidence coverage  lift      count
## [1] {No_3} => {Genes_Yes}    0.1017533 0.5848407  0.1739847 0.9847695 1358 
## [2] {No_3} => {Inherited_No} 0.1063989 0.6115418  0.1739847 1.0093540 1420 
## [3] {No_1} => {Paternal_No}  0.1013787 0.5708861  0.1775813 1.0084772 1353 
## [4] {No_1} => {Genes_Yes}    0.1062491 0.5983122  0.1775813 1.0074533 1418 
## [5] {No_1} => {Inherited_No} 0.1100704 0.6198312  0.1775813 1.0230358 1469 
## [6] {No_4} => {Paternal_No}  0.1024277 0.5669847  0.1806534 1.0015853 1367
# Genetic Disorder
gene_rules_combined_disorder <- apriori(trans1, 
                                        parameter = list(support = 0.01, confidence = 0.65, minlen = 2), 
                                        appearance = list(default = "lhs", rhs = c("Mito_LHON", "Mito_Leigh", "Mito_Myopathy", 
                                                                                   "Multi_Alzheimer", "Multi_Cancer", "Multi_Diabetes", 
                                                                                   "Single_CysticFibrosis", "Single_Hemochromatosis", "Single_TaySachs")))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.65    0.1    1 none FALSE            TRUE       5    0.01      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 133 
## 
## set item appearances ...[9 item(s)] done [0.00s].
## set transactions ...[83 item(s), 13346 transaction(s)] done [0.01s].
## sorting and recoding items ... [60 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 done [3.32s].
## writing ... [42 rule(s)] done [0.05s].
## creating S4 object  ... done [0.05s].
gene_rules_combined_disorder_byconf <- sort(gene_rules_combined_disorder, by="confidence", decreasing=TRUE)
inspect(head(gene_rules_combined_disorder_byconf))
##     lhs                     rhs                        support confidence   coverage     lift count
## [1] {Male,                                                                                         
##      Multiple,                                                                                     
##      slightly abnormal,                                                                            
##      Symptom_4}          => {Single_CysticFibrosis} 0.01123932  0.7317073 0.01536041 3.501386   150
## [2] {Male,                                                                                         
##      slightly abnormal,                                                                            
##      Symptom_4,                                                                                    
##      Tachycardia}        => {Single_CysticFibrosis} 0.01161397  0.7311321 0.01588491 3.498633   155
## [3] {Alive,                                                                                        
##      Low,                                                                                          
##      Male,                                                                                         
##      Multiple,                                                                                     
##      Symptom_4}          => {Single_CysticFibrosis} 0.01004046  0.6943005 0.01446126 3.322386   134
## [4] {Low,                                                                                          
##      Male,                                                                                         
##      slightly abnormal,                                                                            
##      Symptom_4}          => {Single_CysticFibrosis} 0.01213847  0.6923077 0.01753334 3.312850   162
## [5] {Normal (30-60),                                                                               
##      slightly abnormal,                                                                            
##      Symptom_4,                                                                                    
##      Tachycardia}        => {Single_CysticFibrosis} 0.01011539  0.6887755 0.01468605 3.295948   135
## [6] {Assisted_Yes,                                                                                 
##      Male,                                                                                         
##      slightly abnormal,                                                                            
##      Symptom_4}          => {Single_CysticFibrosis} 0.01123932  0.6880734 0.01633448 3.292588   150
plot(gene_rules_combined_disorder_byconf, method="paracoord", control=list(reorder=TRUE))

# FP-Growth
# Generate frequent itemsets using FP-Growth (eclat function)
frequent_itemsets <- eclat(trans1, parameter = list(supp = 0.01, maxlen = 5))
## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target  ext
##     FALSE    0.01      1      5 frequent itemsets TRUE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 133 
## 
## create itemset ... 
## set transactions ...[83 item(s), 13346 transaction(s)] done [0.01s].
## sorting and recoding items ... [60 item(s)] done [0.00s].
## creating bit matrix ... [60 row(s), 13346 column(s)] done [0.00s].
## writing  ... [873336 set(s)] done [0.43s].
## Creating S4 object  ... done [0.05s].
# View frequent itemsets
inspect(head(frequent_itemsets, n = 10))
##      items                                              support    count
## [1]  {Genes_Yes, Mito_LHON, Symptom_5}                  0.01063989 142  
## [2]  {Maternal_Yes, Mito_LHON, Symptom_5}               0.01041511 139  
## [3]  {Genes_Yes, Mito_LHON, Symptom_4}                  0.01041511 139  
## [4]  {Genes_Yes, Inherited_Yes, Mito_LHON}              0.01176382 157  
## [5]  {Inherited_Yes, Maternal_Yes, Mito_LHON}           0.01176382 157  
## [6]  {Genes_Yes, Maternal_Yes, Mito_LHON, Paternal_Yes} 0.01071482 143  
## [7]  {Genes_Yes, Mito_LHON, Paternal_Yes}               0.01348719 180  
## [8]  {Maternal_Yes, Mito_LHON, Paternal_Yes}            0.01311254 175  
## [9]  {Genes_Yes, Mito_LHON, Tachypnea}                  0.01034018 138  
## [10] {Genes_Yes, History_No, Mito_LHON}                 0.01019032 136
# Define desired RHS items
desired_rhs <- c("Mito_LHON", "Mito_Leigh", "Mito_Myopathy", "Multi_Alzheimer", "Multi_Cancer", "Multi_Diabetes", 
                 "Single_CysticFibrosis", "Single_Hemochromatosis", "Single_TaySachs")
# Generate rules with specific RHS
rules_fp_growth <- ruleInduction(frequent_itemsets, trans1, confidence = 0.65)
# Filter rules to include only the desired RHS
filtered_rules <- subset(rules_fp_growth, rhs %in% desired_rhs)
# View the filtered rules
filtered_rules_byconf <- sort(filtered_rules, by="confidence", decreasing=TRUE)
inspect(head(filtered_rules_byconf))
##     lhs                     rhs                        support confidence     lift itemset
## [1] {Male,                                                                                
##      Multiple,                                                                            
##      slightly abnormal,                                                                   
##      Symptom_4}          => {Single_CysticFibrosis} 0.01123932  0.7317073 3.501386  154043
## [2] {Male,                                                                                
##      slightly abnormal,                                                                   
##      Symptom_4,                                                                           
##      Tachycardia}        => {Single_CysticFibrosis} 0.01161397  0.7311321 3.498633  154022
## [3] {Low,                                                                                 
##      Male,                                                                                
##      slightly abnormal,                                                                   
##      Symptom_4}          => {Single_CysticFibrosis} 0.01213847  0.6923077 3.312850  154054
## [4] {Normal (30-60),                                                                      
##      slightly abnormal,                                                                   
##      Symptom_4,                                                                           
##      Tachycardia}        => {Single_CysticFibrosis} 0.01011539  0.6887755 3.295948  154018
## [5] {Assisted_Yes,                                                                        
##      Male,                                                                                
##      slightly abnormal,                                                                   
##      Symptom_4}          => {Single_CysticFibrosis} 0.01123932  0.6880734 3.292588  154055
## [6] {H_Yes,                                                                               
##      Male,                                                                                
##      slightly abnormal,                                                                   
##      Symptom_4}          => {Single_CysticFibrosis} 0.01161397  0.6858407 3.281904  154056
# FP-Growth generated similar rules, only some of the rules from Apriori are not included

# Plot the filtered rules
plot(filtered_rules, method = "scatterplot", measure = c("support", "confidence"), shading = "lift")

# Grouped matrix plot
plot(filtered_rules, method = "grouped")

# Parallel coordinate plot
plot(filtered_rules, method = "paracoord", control = list(reorder = TRUE))

8 Apriori and FP-Growth performance comparison

# Define file paths
performance_apriori_path <- 'performance_apriori.rds'
performance_fp_growth_path <- 'performance_fp_growth.rds'

# Apriori Performance
if (file.exists(performance_apriori_path)) {
  gene_rules_apriori <- readRDS(performance_apriori_path)
  message("Loaded Apriori rules from 'performance_apriori.rds'.")
} else {
  time_apriori <- system.time({
    gene_rules_apriori <- apriori(
      data = trans1, 
      parameter = list(support = 0.01, confidence = 0.5, minlen = 2), 
      appearance = list(default = "lhs", rhs = c(
        "Mito_LHON", "Mito_Leigh", "Mito_Myopathy", 
        "Multi_Alzheimer", "Multi_Cancer", "Multi_Diabetes", 
        "Single_CysticFibrosis", "Single_Hemochromatosis", "Single_TaySachs"
      )),
      control = list(verbose = FALSE)
    )
    saveRDS(gene_rules_apriori, file = performance_apriori_path)
  })
  message(paste("Time taken by Apriori:", round(time_apriori["elapsed"], 2), "seconds"))
}
## Loaded Apriori rules from 'performance_apriori.rds'.
# FP-Growth Performance
if (file.exists(performance_fp_growth_path)) {
  filtered_rules_fp_growth <- readRDS(performance_fp_growth_path)
  message("Loaded FP-Growth rules from 'performance_fp_growth.rds'.")
} else {
  time_fp_growth <- system.time({
    frequent_itemsets_fp_growth <- eclat(trans1, parameter = list(supp = 0.01, maxlen = 5))
    rules_fp_growth <- ruleInduction(frequent_itemsets_fp_growth, trans1, confidence = 0.5)
    filtered_rules_fp_growth <- subset(rules_fp_growth, rhs %in% c(
      "Mito_LHON", "Mito_Leigh", "Mito_Myopathy", 
      "Multi_Alzheimer", "Multi_Cancer", "Multi_Diabetes", 
      "Single_CysticFibrosis", "Single_Hemochromatosis", "Single_TaySachs"
    ))
    saveRDS(filtered_rules_fp_growth, file = performance_fp_growth_path)
  })
  message(paste("Time taken by FP-Growth:", round(time_fp_growth["elapsed"], 2), "seconds"))
}
## Loaded FP-Growth rules from 'performance_fp_growth.rds'.
# Inspect top Apriori rules
inspect(head(gene_rules_apriori, n = 10))
##      lhs                     rhs                        support confidence   coverage     lift count
## [1]  {No_1,                                                                                         
##       Symptom_4}          => {Single_CysticFibrosis} 0.01768320  0.5010616 0.03529147 2.397694   236
## [2]  {slightly abnormal,                                                                            
##       Symptom_4}          => {Single_CysticFibrosis} 0.03176982  0.5880721 0.05402368 2.814059   424
## [3]  {Middle Childhood,                                                                             
##       Symptom_4}          => {Single_CysticFibrosis} 0.03004646  0.5031368 0.05971827 2.407624   401
## [4]  {Symptom_4,                                                                                    
##       Tachycardia}        => {Single_CysticFibrosis} 0.05132624  0.5297757 0.09688296 2.535097   685
## [5]  {Home,                                                                                         
##       Symptom_4}          => {Single_CysticFibrosis} 0.05140117  0.5138577 0.10002997 2.458926   686
## [6]  {Multiple,                                                                                     
##       Symptom_4}          => {Single_CysticFibrosis} 0.05162596  0.5416667 0.09530946 2.591998   689
## [7]  {Male,                                                                                         
##       Symptom_4}          => {Single_CysticFibrosis} 0.05627154  0.5356633 0.10505020 2.563271   751
## [8]  {H_Yes,                                                                                        
##       Symptom_4}          => {Single_CysticFibrosis} 0.05155103  0.5119048 0.10070433 2.449581   688
## [9]  {Assisted_Yes,                                                                                 
##       Symptom_4}          => {Single_CysticFibrosis} 0.05177581  0.5010877 0.10332684 2.397819   691
## [10] {Low,                                                                                          
##       Symptom_4}          => {Single_CysticFibrosis} 0.05312453  0.5213235 0.10190319 2.494652   709
# Inspect top FP-Growth rules
inspect(head(filtered_rules_fp_growth, n = 10))
##      lhs                rhs                        support confidence     lift itemset
## [1]  {Inherited_No,                                                                   
##       No_1,                                                                           
##       Symptom_4}     => {Single_CysticFibrosis} 0.01049003  0.5166052 2.472073   54796
## [2]  {Genes_Yes,                                                                      
##       No_1,                                                                           
##       Symptom_4}     => {Single_CysticFibrosis} 0.01176382  0.5097403 2.439223   54797
## [3]  {Maternal_Yes,                                                                   
##       No_1,                                                                           
##       Symptom_4}     => {Single_CysticFibrosis} 0.01123932  0.5050505 2.416782   54798
## [4]  {History_Yes,                                                                    
##       No_1,                                                                           
##       Symptom_4}     => {Single_CysticFibrosis} 0.01034018  0.5587045 2.673528   54799
## [5]  {Male,                                                                           
##       No_1,                                                                           
##       Symptom_4}     => {Single_CysticFibrosis} 0.01176382  0.5528169 2.645355   54800
## [6]  {Multiple,                                                                       
##       No_1,                                                                           
##       Symptom_4}     => {Single_CysticFibrosis} 0.01049003  0.5645161 2.701338   54801
## [7]  {No_1,                                                                           
##       Symptom_4,                                                                      
##       Tachycardia}   => {Single_CysticFibrosis} 0.01026525  0.5546559 2.654155   54802
## [8]  {No_1,                                                                           
##       Symptom_4}     => {Single_CysticFibrosis} 0.01768320  0.5010616 2.397694   54926
## [9]  {Low,                                                                            
##       No_1,                                                                           
##       Symptom_3}     => {Mito_Leigh}            0.01288776  0.5043988 1.958600   58214
## [10] {Institute,                                                                      
##       No_1,                                                                           
##       Symptom_3}     => {Mito_Leigh}            0.01296269  0.5029070 1.952807   58215
# Compare times
# These messages are already printed above

9 Conclusion

Based on the analysis result from this report, we can identify some association rules are providing interpretable insights to us

  1. There are some genetic disorder rules with lower support level, but significant confidence level(mostly above 60%)

    • Mito_LHON
    • Multi_Alzheimer
    • Multi_Diabetes
    • Single_Hemochromatosis
    • Single_TaySachs
  2. There are some genetic disorder rules with higher support level(above 0.01), with noticeable confidence level(all above 50%)

    • Mito_Leigh
    • Mito_Myopathy
    • Single_CysticFibrosis
  3. There is one disorder: Multi_Cancer which doesn’t have any significant rules from both Apriori and FP-Growth approach

  4. From the results, we can see that these rules with lower support level, but significant confidence level indicates The antecedent items are rare but reliable predictors of the consequent

  5. These with higher support are more likely to occur among children(research objects age up to 14), with these rules, we have very noticeable confidence level above 50%. Which indicates which combinations would more likely to happen

  6. As mentioned at the beginning, this report can only provide insights of the potential association rule that researchers can refer to (we can use the top 5 rules for each disorder). This report will not be able to provide biological/medical proof that these factors are correlated

Based on the comparison, Apriori often can provide more analysis results and it is easier to control the wanted outcome. Which is to find out the association rules for the genetic disorder as the consequent items