Project3_association

library(arules)

## Loading required package: Matrix

## 
## Attaching package: 'arules'

## The following objects are masked from 'package:base':
## 
##     abbreviate, write

library(arulesViz)

1.Introduction

This project applies association rule mining to analyze fathers’ knowledge about child nutrition and animal-source foods. The data come from a large survey conducted in Rwanda and focus on foods that are important for child growth and development.

The aim of the analysis is to identify common combinations of food items mentioned by fathers and to explore meaningful relationships between these items using association rules.

2.Data Description and Variable Selection

This project uses survey data from the Engaging Men study, obtained from Harvard Dataverse. The dataset includes fathers’ responses to questions about child feeding and animal-source foods. For the purpose of association rule mining, six food-related variables are selected. These variables represent key food groups that are important for child nutrition and are suitable for analyzing co-occurrence patterns.

Load the raw survey data

data_raw <- read.delim( “Engaging_men_baseend_labeled.tab”, sep = “, header = TRUE, stringsAsFactors = FALSE )

Inspect the structure of the dataset

head(data_raw)

Selected Food Variables

The following six variables are used in the analysis: Milk Meat Fish Eggs Porridge with milk Fruits These variables capture both animal-source foods and complementary foods and are therefore relevant for identifying meaningful association rules.

data_raw <- read.delim(
  "Engaging_men_baseend_labeled.tab",
  sep = "\t",
  header = TRUE,
  stringsAsFactors = FALSE
)

library(dplyr)

# Select and rename variables to match the analysis description
data_food <- data_raw %>%
  select(
    Milk      = q_205_1,
    Meat      = q_205_2,
    Fish      = q_205_3,
    Eggs      = q_205_4,
    Porridge  = q_204_2,  # "Porridge with milk"
    Fruits    = q_204_6
  )

# Preview the first few rows
head(data_food)

##   Milk Meat Fish Eggs Porridge Fruits
## 1    1    1    1    1        0      1
## 2    1    1    0    1        0      1
## 3    1    0    0    0        0      0
## 4    1    1    1    0        1      1
## 5    1    1    0    0        0      0
## 6    0    1    1    1        1      1

3.Data Preparation for Association Rules

Convert the selected food variables into a transaction format. Each row represents one father, and each item represents a food mentioned.

library(dplyr)

data_food <- data_raw %>%
  select(
    Milk     = q_205_1,
    Meat     = q_205_2,
    Fish     = q_205_3,
    Eggs     = q_205_4,
    Porridge = q_204_2,
    Fruits   = q_204_6
  )

head(data_food)

##   Milk Meat Fish Eggs Porridge Fruits
## 1    1    1    1    1        0      1
## 2    1    1    0    1        0      1
## 3    1    0    0    0        0      0
## 4    1    1    1    0        1      1
## 5    1    1    0    0        0      0
## 6    0    1    1    1        1      1

4.Transaction Data Creation

The binary food variables are converted into transaction data.

library(arules)

data_food_logical <- data_food == 1
food_transactions <- as(data_food_logical, "transactions")

summary(food_transactions)

## transactions as itemMatrix in sparse format with
##  298 rows (elements/itemsets/transactions) and
##  6 columns (items) and a density of 0.606264 
## 
## most frequent items:
##    Meat    Eggs  Fruits    Milk    Fish (Other) 
##     271     215     202     192     136      68 
## 
## element (itemset/transaction) length distribution:
## sizes
##   1   2   3   4   5   6 
##  15  23  83 118  52   7 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   3.000   4.000   3.638   4.000   6.000 
## 
## includes extended item information - examples:
##   labels
## 1   Milk
## 2   Meat
## 3   Fish

5.Association Rule Mining

The Apriori algorithm is applied to identify frequent food combinations.

rules <- apriori(
  food_transactions,
  parameter = list(
    support = 0.1,
    confidence = 0.6,
    minlen = 2
  )
)

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.6    0.1    1 none FALSE            TRUE       5     0.1      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 29 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[6 item(s), 298 transaction(s)] done [0.00s].
## sorting and recoding items ... [6 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [64 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

length(rules)

## [1] 64

6.Rule Inspection

The rules are sorted by confidence to highlight the strongest associations.

rules_conf <- sort(rules, by = "confidence", decreasing = TRUE)
inspect(rules_conf[1:10])

##      lhs                         rhs    support   confidence coverage  lift    
## [1]  {Fish, Porridge}         => {Meat} 0.1073826 1.0000000  0.1073826 1.099631
## [2]  {Milk, Porridge}         => {Meat} 0.1308725 0.9750000  0.1342282 1.072140
## [3]  {Eggs, Porridge}         => {Meat} 0.1778523 0.9636364  0.1845638 1.059644
## [4]  {Porridge}               => {Meat} 0.2181208 0.9558824  0.2281879 1.051118
## [5]  {Porridge, Fruits}       => {Meat} 0.1442953 0.9555556  0.1510067 1.050759
## [6]  {Eggs, Porridge, Fruits} => {Meat} 0.1174497 0.9459459  0.1241611 1.040191
## [7]  {Eggs}                   => {Meat} 0.6812081 0.9441860  0.7214765 1.038256
## [8]  {Eggs, Fruits}           => {Meat} 0.4798658 0.9407895  0.5100671 1.034521
## [9]  {Fish, Eggs}             => {Meat} 0.3456376 0.9363636  0.3691275 1.029654
## [10] {Fish, Eggs, Fruits}     => {Meat} 0.2416107 0.9350649  0.2583893 1.028226
##      count
## [1]   32  
## [2]   39  
## [3]   53  
## [4]   65  
## [5]   43  
## [6]   35  
## [7]  203  
## [8]  143  
## [9]  103  
## [10]  72

7.Rule Visualization and Analysis

This section presents different visualizations to better understand the association rules from several perspectives.

7.1 Support–Confidence Scatter Plot

Before looking at individual rules, it is useful to examine the overall distribution of all generated rules. Support and confidence are the two main quality measures, and their relationship helps to evaluate whether the chosen parameters are reasonable.

plot(
  rules,
  measure = c("support", "confidence"),
  shading = "lift"
)

The plot shows that most rules have relatively low support but moderate to high confidence. This indicates that many food combinations are not very frequent, but when they occur, they are relatively reliable. The use of lift as shading helps to identify stronger rules among them.

7.2 Top Rules by Lift

While support and confidence describe frequency and reliability, lift measures how strong an association is compared to random co-occurrence. Therefore, examining rules with the highest lift allows us to focus on the most informative patterns.

plot(
  sort(rules, by = "lift", decreasing = TRUE)[1:10],
  measure = "lift"
)

## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

The top rules show lift values clearly greater than 1, which suggests meaningful associations between food items. These rules are stronger than what would be expected by chance and are therefore more interesting for interpretation.

7.3 Rule Network Graph

To better understand how different food items are connected, a network graph is used. This visualization focuses on the structure of the strongest rules rather than their numerical values.

library(arulesViz)

plot(
  rules_conf[1:10],
  method = "graph",
  engine = "htmlwidget"
)

The network graph shows that some food items, such as milk and eggs, appear more frequently as central nodes. This suggests that these foods often co-occur with other items and may play an important role in child nutrition patterns.

7.4 Rule Length Distribution

Rule length shows how many items are included in a rule. Analyzing rule length helps to understand how complex the rules are and whether they are easy to interpret.

# Calculate rule length manually
rule_length <- size(rules)

# Plot the distribution of rule length
hist(
  rule_length,
  breaks = seq(min(rule_length) - 0.5, max(rule_length) + 0.5, by = 1),
  xlab = "Rule length (number of items)",
  ylab = "Frequency",
  main = "Distribution of Rule Length"
)

### 7.5 Support Threshold Sensitivity Analysis The choice of the support threshold strongly affects the number of generated rules. To understand this effect, a simple sensitivity analysis is conducted by varying the support value.

support_values <- c(0.05, 0.1, 0.15, 0.2)

rule_counts <- sapply(
  support_values,
  function(s) {
    length(
      apriori(
        food_transactions,
        parameter = list(
          support = s,
          confidence = 0.6
        )
      )
    )
  }
)

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.6    0.1    1 none FALSE            TRUE       5    0.05      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 14 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[6 item(s), 298 transaction(s)] done [0.00s].
## sorting and recoding items ... [6 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [92 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.6    0.1    1 none FALSE            TRUE       5     0.1      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 29 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[6 item(s), 298 transaction(s)] done [0.00s].
## sorting and recoding items ... [6 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [68 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.6    0.1    1 none FALSE            TRUE       5    0.15      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 44 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[6 item(s), 298 transaction(s)] done [0.00s].
## sorting and recoding items ... [6 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [51 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.6    0.1    1 none FALSE            TRUE       5     0.2      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 59 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[6 item(s), 298 transaction(s)] done [0.00s].
## sorting and recoding items ... [6 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [43 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

plot(
  support_values,
  rule_counts,
  type = "b",
  xlab = "Support threshold",
  ylab = "Number of rules"
)

As the support threshold increases, the number of rules decreases significantly. This shows a clear trade-off between capturing more patterns and keeping only the most frequent ones. The selected support value represents a balance between these two goals. Based on this sensitivity analysis, a support threshold of 0.1 was selected as it provides a reasonable balance between capturing a sufficient number of rules and maintaining interpretability.

7.6 Parallel Coordinates Plot of Rules

Parallel coordinates allow us to compare multiple rule quality measures at the same time. This visualization helps to see how support, confidence, and lift vary across different rules.

library(arulesViz)

plot(
  rules,
  method = "paracoord",
  control = list(
    reorder = TRUE
  )
)

The parallel plot shows that rules with higher confidence often have lower support. At the same time, rules with higher lift tend to stand out from the rest. This confirms the trade-off between frequency and strength in association rule mining.

7.7 Rule-based Interpretation Example

To further interpret the extracted association rules, we focus on a specific food item and examine which other foods are most likely to co-occur with it. In particular, rules involving meat show that it frequently appears together with eggs and milk-based foods. This suggests that fathers tend to report animal-source foods as part of broader food combinations rather than isolated items. Such patterns indicate that food choices are often structured around common dietary bundles, which may be relevant when interpreting reported child nutrition practices.

7.8 Overall Summary of Rule Analysis

Overall, the different visualizations provide a comprehensive view of the generated association rules and highlight clear trade-offs between support, confidence, and lift. The results indicate that stronger rules tend to be less frequent but offer more informative insights into food co-occurrence patterns. The network and parallel coordinate plots further help to reveal the structure and quality of the rules from multiple perspectives, making it easier to identify central food items and dominant associations. In addition, the sensitivity analysis confirms that the chosen parameter values achieve a reasonable balance between the number of extracted rules and their interpretability. Taken together, these findings suggest that fathers tend to report child nutrition not as isolated food items, but as combinations of foods that commonly appear together. Such patterns provide meaningful insights into reported dietary practices and support further interpretation in the context of child nutrition.

8.Summary

This project applies association rule mining to explore patterns in fathers’ reported child nutrition practices. Using survey data from the Engaging Men study, a set of food-related variables was selected and transformed into transaction data suitable for association rule analysis. The Apriori algorithm was used to identify frequent food combinations, and the resulting rules were evaluated using support, confidence, and lift. Multiple visualizations, including scatter plots, network graphs, parallel coordinate plots, and sensitivity analysis, were employed to examine the quality, structure, and robustness of the extracted rules. The analysis highlights clear trade-offs between rule frequency and strength and shows that meaningful rules often involve combinations of animal-source and complementary foods. Overall, the results suggest that reported food choices are structured around common food bundles rather than isolated items, providing useful insights into reported child nutrition patterns.

Project3_association_rules

LINLIN MAO

2026-01-21