Association Rules on European Social Survey (homework)

Introduction

This homework focuses on applying association rule mining to the European Social Survey (ESS) Round 11 dataset. The goal is to uncover meaningful patterns and relationships between variables, particularly focusing on the variable “Feeling about household’s income nowadays” (hincfel) as the rule consequent. By using data from a wide geographical range, the analysis aims to identify key factors influencing household income perceptions, such as economic satisfaction, employment stability, and personal happiness.

Load and Inspect the Dataset

## Loading required package: Matrix

## 
## Attaching package: 'arules'

## The following objects are masked from 'package:base':
## 
##     abbreviate, write

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:arules':
## 
##     intersect, recode, setdiff, setequal, union

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Checking Broadest Geographical Coverage

table(df$cntry)

## 
##   AT   BE   CH   CY   DE   ES   FI   FR   GB   GR   HR   HU   IE   IS   IT   LT 
## 2354 1594 1384  685 2420 1844 1563 1771 1684 2757 1563 2118 2017  842 2865 1365 
##   NL   NO   PL   PT   RS   SE   SI   SK 
## 1695 1337 1442 1373 1563 1230 1248 1442

Important Observasion and Chalanges:

After going through the dataset, I noticed there were a lot of missing values (NA’s). After spending some time inspecting the data, I decided to fill in some of the missing values with the most common value for each variable. This helped make the dataset more complete and usable for analysis.

For the variable hincfel (feeling about household income), I realized it wasn’t practical to create rules with all nine levels because it wasn’t producing strong patterns. So, I simplified it by grouping the levels into three categories: Low, Medium, and High. This made it easier to work with and allowed me to generate more meaningful rules.

I also noticed that employment status had too many categories, which made the patterns less clear. To fix this, I grouped them into just two categories: Stable and Unstable. By doing this, I was able to uncover stronger and clearer relationships between employment status and household income.

These changes really helped make the dataset easier to work with and allowed me to focus on finding better patterns and insights in the data.

Cleaning the dataset

df_clean <- df %>%
  mutate(across(everything(), ~ replace(., . %in% c(66, 6666, 8888, 9999, 999), NA))) %>%
  filter(!is.na(hincfel))

#  Reducing the number of unique values (Grouping Levels)
# Grouping age into broader categories
df_clean$agea <- cut(df_clean$agea, breaks = c(15, 30, 45, 60, 75, 100), 
                     labels = c("15-30", "31-45", "46-60", "61-75", "76+"), include.lowest = TRUE)

# Grouping education levels into general categories
df_clean$edulvlb <- cut(df_clean$edulvlb, breaks = c(0, 300, 600, 9000), 
                        labels = c("Low", "Medium", "High"), include.lowest = TRUE)

# Reducing categories for `hincfel` (Feeling about household income)
df_clean$hincfel <- case_when(
  df_clean$hincfel %in% c(1, 2) ~ "Low",
  df_clean$hincfel %in% c(3, 4) ~ "Medium",
  df_clean$hincfel %in% c(7, 8, 9) ~ "High",
  TRUE ~ NA_character_
)

# Reducing categories for employment status
df_clean$wrkctra <- ifelse(df_clean$wrkctra %in% c(1, 2, 3), "Stable", "Unstable")

sum(is.na(df_clean$hincfel))

## [1] 0

Convert Data for Association Rule Mining

Since there was too many varibale and my goals was to generate specific rules about the “hincfelm”, I selected just relevant variables. then, prepared it for apriori rules.

# Selecting relevant categorical variables
df_selected <- df_clean %>%
  select(hincfel, cntry, gndr, agea, marsts, edulvlb, vote, happy, 
         stfeco, gincdif, freehms, hmsacld, pdwrk, uempla, uemp12m, wrkctra) %>%
  mutate(across(everything(), as.factor))  # Convert to factors


# Convert dataset into transaction format
df_transactions <- as(df_selected, "transactions")

inspect(df_transactions[1:5])  # View first few transactions

##     items              transactionID
## [1] {hincfel=Low,                   
##      cntry=AT,                      
##      gndr=1,                        
##      agea=61-75,                    
##      edulvlb=Medium,                
##      vote=1,                        
##      happy=8,                       
##      stfeco=6,                      
##      gincdif=2,                     
##      freehms=2,                     
##      hmsacld=3,                     
##      pdwrk=0,                       
##      uempla=0,                      
##      uemp12m=6,                     
##      wrkctra=Unstable}             1
## [2] {hincfel=Low,                   
##      cntry=AT,                      
##      gndr=2,                        
##      agea=15-30,                    
##      marsts=6,                      
##      edulvlb=Medium,                
##      vote=1,                        
##      happy=9,                       
##      stfeco=2,                      
##      gincdif=1,                     
##      freehms=1,                     
##      hmsacld=1,                     
##      pdwrk=0,                       
##      uempla=0,                      
##      uemp12m=6,                     
##      wrkctra=Stable}               2
## [3] {hincfel=Low,                   
##      cntry=AT,                      
##      gndr=2,                        
##      agea=46-60,                    
##      edulvlb=High,                  
##      vote=1,                        
##      happy=9,                       
##      stfeco=6,                      
##      gincdif=1,                     
##      freehms=1,                     
##      hmsacld=1,                     
##      pdwrk=1,                       
##      uempla=0,                      
##      uemp12m=6,                     
##      wrkctra=Stable}               3
## [4] {hincfel=Low,                   
##      cntry=AT,                      
##      gndr=2,                        
##      agea=76+,                      
##      marsts=4,                      
##      edulvlb=Medium,                
##      vote=2,                        
##      happy=7,                       
##      stfeco=4,                      
##      gincdif=1,                     
##      freehms=2,                     
##      hmsacld=3,                     
##      pdwrk=0,                       
##      uempla=0,                      
##      uemp12m=6,                     
##      wrkctra=Stable}               4
## [5] {hincfel=Low,                   
##      cntry=AT,                      
##      gndr=1,                        
##      agea=61-75,                    
##      edulvlb=Medium,                
##      vote=1,                        
##      happy=9,                       
##      stfeco=6,                      
##      gincdif=2,                     
##      freehms=2,                     
##      hmsacld=2,                     
##      pdwrk=0,                       
##      uempla=0,                      
##      uemp12m=6,                     
##      wrkctra=Stable}               5

generating association rules

with Restriction of the consequent (right-hand side) of the rules to specific values of the hincfel variable.

rules <- apriori(df_transactions, 
                 parameter = list(supp = 0.02, conf = 0.6, maxlen = 4),
                 appearance = list(rhs = c("hincfel=Low", "hincfel=Medium", "hincfel=High")),
                 control = list(verbose = FALSE))

Inspect & Filter the Best Rules

rules_sorted <- sort(rules, by = "confidence", decreasing = TRUE)
inspect(head(rules_sorted, 10))

##      lhs                                    rhs           support    confidence
## [1]  {vote=1, happy=9, stfeco=7}         => {hincfel=Low} 0.02363283 0.9703476 
## [2]  {happy=9, stfeco=7, wrkctra=Stable} => {hincfel=Low} 0.02383205 0.9647177 
## [3]  {happy=9, stfeco=7, uemp12m=6}      => {hincfel=Low} 0.02291065 0.9643606 
## [4]  {cntry=SE, vote=1, uemp12m=6}       => {hincfel=Low} 0.02009662 0.9641577 
## [5]  {stfeco=7, freehms=1, pdwrk=1}      => {hincfel=Low} 0.03297141 0.9636099 
## [6]  {happy=9, stfeco=7, uempla=0}       => {hincfel=Low} 0.02858850 0.9630872 
## [7]  {edulvlb=High, vote=1, stfeco=6}    => {hincfel=Low} 0.03673175 0.9627937 
## [8]  {cntry=DE, gndr=1, vote=1}          => {hincfel=Low} 0.02303516 0.9625390 
## [9]  {edulvlb=High, happy=9, hmsacld=1}  => {hincfel=Low} 0.02532623 0.9621570 
## [10] {edulvlb=High, stfeco=7, pdwrk=1}   => {hincfel=Low} 0.02848889 0.9621531 
##      coverage   lift     count
## [1]  0.02435502 1.202001  949 
## [2]  0.02470366 1.195027  957 
## [3]  0.02375735 1.194585  920 
## [4]  0.02084371 1.194334  807 
## [5]  0.03421656 1.193655 1324 
## [6]  0.02968423 1.193008 1148 
## [7]  0.03815121 1.192644 1475 
## [8]  0.02393167 1.192329  925 
## [9]  0.02632234 1.191855 1017 
## [10] 0.02960952 1.191851 1144

visualize the association rules generated by the Apriori

set.seed(240)

# Scatterplot 
plot(rules_sorted, method = "scatterplot", measure = c("support", "confidence"), shading = "lift")

## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

plot(rules_sorted, method = "graph", control = list(type = "items"))

## Available control parameters (with default values):
## layout    =  stress
## circular  =  FALSE
## ggraphdots    =  NULL
## edges     =  <environment>
## nodes     =  <environment>
## nodetext  =  <environment>
## colors    =  c("#EE0000FF", "#EEEEEEFF")
## engine    =  ggplot2
## max   =  100
## verbose   =  FALSE

# Grouped matrix 
plot(rules_sorted, method = "grouped")

Interpretations:

Top Rules with High Confidence: Rules with hincfel=Low as the consequent (right-hand side) indicate a strong association with combinations of attributes like high happiness (happy=9), stable working conditions (wrkctra=Stable), and low economic satisfaction (stfeco=7). These rules have confidence values above 0.96, which suggests a high probability of low income when these conditions are met.

Lift Analysis: The lift values close to 1.2 in these rules indicate a moderate improvement over random association. It suggests the antecedent variables (for example: vote=1, uemp12m=6) are reasonably predictive of low income.

Scatterplot Insights: The scatterplot of support vs. confidence shows that most rules are concentrated with confidence above 0.7, indicating robustness in predictive capability. However, the support values are relatively low, reflecting that the combinations occur in a smaller portion of the dataset.

Grouped Matrix Visualization: The grouped matrix emphasizes clusters of variables contributing to hincfel=Low. Attributes like cntry=DE, happy=9, and wrkctra=Stable frequently combine to predict low income. Larger nodes (higher support) are often associated with hincfel=Low, while shading indicates the strength of the association (lift).

strong rules by support

rules_sorted_sup <- sort(rules, by = "support", decreasing = TRUE)


meaningful_rules_sup <- head(rules_sorted_sup, 10)


plot(meaningful_rules_sup, method = "paracoord", control = list(reorder = TRUE))

Parallel Coordinates Plot: This visualization demonstrates how certain variables interact to predict the consequent, hincfel=Low. Rules involving combinations like wrkctra=Stable, uemp12m=6, and stfeco=7 are highly predictive of hincfel=Low.

It highlights that employment-related and financial satisfaction variables significantly influence perceived income sufficiency.

analyzing rules by consequent with focus on levels of “hincfel”

Rules for hincfel=Low

rules_low <- subset(rules, rhs %pin% "hincfel=Low")
inspect(head(sort(rules_low, by = "confidence", decreasing = TRUE), 10))

##      lhs                                    rhs           support    confidence
## [1]  {vote=1, happy=9, stfeco=7}         => {hincfel=Low} 0.02363283 0.9703476 
## [2]  {happy=9, stfeco=7, wrkctra=Stable} => {hincfel=Low} 0.02383205 0.9647177 
## [3]  {happy=9, stfeco=7, uemp12m=6}      => {hincfel=Low} 0.02291065 0.9643606 
## [4]  {cntry=SE, vote=1, uemp12m=6}       => {hincfel=Low} 0.02009662 0.9641577 
## [5]  {stfeco=7, freehms=1, pdwrk=1}      => {hincfel=Low} 0.03297141 0.9636099 
## [6]  {happy=9, stfeco=7, uempla=0}       => {hincfel=Low} 0.02858850 0.9630872 
## [7]  {edulvlb=High, vote=1, stfeco=6}    => {hincfel=Low} 0.03673175 0.9627937 
## [8]  {cntry=DE, gndr=1, vote=1}          => {hincfel=Low} 0.02303516 0.9625390 
## [9]  {edulvlb=High, happy=9, hmsacld=1}  => {hincfel=Low} 0.02532623 0.9621570 
## [10] {edulvlb=High, stfeco=7, pdwrk=1}   => {hincfel=Low} 0.02848889 0.9621531 
##      coverage   lift     count
## [1]  0.02435502 1.202001  949 
## [2]  0.02470366 1.195027  957 
## [3]  0.02375735 1.194585  920 
## [4]  0.02084371 1.194334  807 
## [5]  0.03421656 1.193655 1324 
## [6]  0.02968423 1.193008 1148 
## [7]  0.03815121 1.192644 1475 
## [8]  0.02393167 1.192329  925 
## [9]  0.02632234 1.191855 1017 
## [10] 0.02960952 1.191851 1144

plots for hincfel=Low

meaningful_rules_low <- head(sort(rules_low, by = "confidence", decreasing = TRUE),10)
plot(meaningful_rules_low, method="graph")

Top Rules for hincfel=Low:

The top rules based on confidence suggest that: Individuals who are employed stably (wrkctra=Stable), perceive the economy positively (stfeco=7), and are very happy (happy=9) are more likely to report their household income as Low.

Similarly, demographic and country-specific variables like cntry=SE and cntry=DE interact with employment and satisfaction variables to predict hincfel=Low.

The high confidence (0.96–0.97) across these rules implies strong predictive reliability.

Rules for hincfel=Medium

rules_medium <- subset(rules, rhs %pin% "hincfel=Medium")
inspect(head(sort(rules_medium, by = "confidence", decreasing = TRUE), 10))

Rules for hincfel=High

rules_high <- subset(rules, rhs %pin% "hincfel=High")
inspect(head(sort(rules_high, by = "confidence", decreasing = TRUE), 10))

Imporatnr insight form the results hincfel

Insights: Rules specific to hincfel=Medium or hincfel=High were limited, indicating weaker associations with these outcomes compared to hincfel=Low.

The findings suggest a strong connection between economic perceptions, employment stability, and subjective financial sufficiency (hincfel=Low).

Analyzing Specific Attributes

# Filter rules where 'stfeco' or 'wrkctra' are in LHS
rules_stfeco <- subset(rules, lhs %pin% "stfeco")
rules_wrkctra <- subset(rules, lhs %pin% "wrkctra")
inspect(head(sort(rules_stfeco, by = "confidence", decreasing = TRUE), 5))

##     lhs                                    rhs           support    confidence
## [1] {vote=1, happy=9, stfeco=7}         => {hincfel=Low} 0.02363283 0.9703476 
## [2] {happy=9, stfeco=7, wrkctra=Stable} => {hincfel=Low} 0.02383205 0.9647177 
## [3] {happy=9, stfeco=7, uemp12m=6}      => {hincfel=Low} 0.02291065 0.9643606 
## [4] {stfeco=7, freehms=1, pdwrk=1}      => {hincfel=Low} 0.03297141 0.9636099 
## [5] {happy=9, stfeco=7, uempla=0}       => {hincfel=Low} 0.02858850 0.9630872 
##     coverage   lift     count
## [1] 0.02435502 1.202001  949 
## [2] 0.02470366 1.195027  957 
## [3] 0.02375735 1.194585  920 
## [4] 0.03421656 1.193655 1324 
## [5] 0.02968423 1.193008 1148

inspect(head(sort(rules_wrkctra, by = "confidence", decreasing = TRUE), 5))

##     lhs                 rhs              support confidence   coverage     lift count
## [1] {happy=9,                                                                        
##      stfeco=7,                                                                       
##      wrkctra=Stable} => {hincfel=Low} 0.02383205  0.9647177 0.02470366 1.195027   957
## [2] {edulvlb=High,                                                                   
##      happy=9,                                                                        
##      wrkctra=Stable} => {hincfel=Low} 0.04995517  0.9575179 0.05217153 1.186109  2006
## [3] {edulvlb=High,                                                                   
##      stfeco=6,                                                                       
##      wrkctra=Stable} => {hincfel=Low} 0.03683136  0.9560440 0.03852475 1.184283  1479
## [4] {cntry=SE,                                                                       
##      vote=1,                                                                         
##      wrkctra=Stable} => {hincfel=Low} 0.02360793  0.9537223 0.02475346 1.181407   948
## [5] {cntry=NL,                                                                       
##      pdwrk=1,                                                                        
##      wrkctra=Stable} => {hincfel=Low} 0.02206395  0.9526882 0.02315968 1.180126   886

General Insights from the Rules

Primary Consequent (hincfel=Low): The rules indicate strong associations for individuals who feel their household income is low. These associations are supported by high confidence values, which suggest a strong likelihood of the antecedent attributes leading to the consequent. The support values for these rules are moderate, highlighting subsets of individuals in the dataset that share these characteristics.

Key Antecedent Attributes: Variables such as stfeco (satisfaction with the economy) and wrkctra (employment stability) play significant roles: A high level of dissatisfaction with the economy (stfeco=7) is strongly associated with low income perception. Employment stability (wrkctra=Stable) is another prominent antecedent. Other key attributes include vote=1 (participation in voting), happy=9 (high happiness levels), and edulvlb=High (high education level).

Lift Values: Lift values exceeding 1.18 across the rules demonstrate a significant association beyond random chance. The lift values emphasize that the combinations of antecedent attributes are meaningful and relevant predictors for low income perception.

plot for filtered rules by ‘stfeco’ and ‘wrkctra’

meaningful_rules_fitered1 <- head(sort(rules_stfeco, by = "confidence", decreasing = TRUE), 5)
meaningful_rules_fitered2 <- head(sort(rules_wrkctra, by = "confidence", decreasing = TRUE), 5)
plot(meaningful_rules_fitered1  , method="graph")

plot(meaningful_rules_fitered2  , method="graph")

Plot Interpretations

Parallel Coordinates Plot: This plot illustrates the alignment of rules with specific antecedents (e.g., wrkctra=Stable and stfeco=7) leading to the consequent hincfel=Low. The sharp lines connecting wrkctra=Stable and stfeco=7 to hincfel=Low confirm their centrality in defining low-income perception.

Graph Visualization of Filtered Rules: The filtered graph focuses on significant antecedent combinations: For stfeco=7, nodes such as freehms=1 (freedom to make life choices) and pdwrk=1 (having paid work) highlight distinct contributions to low income perception. For wrkctra=Stable, connections to vote=1 and edulvlb=High point to nuanced patterns where even high education and civic engagement do not mitigate the perception of low income.

Combined Graph View: The combined visualization captures overlapping influences of the key antecedents. Strong associations between wrkctra=Stable, stfeco=7, and hincfel=Low dominate the network, reinforcing the importance of these attributes.

Aggregate Insights by Country

Sweden

# Rules for a specific country ( Sweden)
rules_country <- subset(rules, lhs %pin% "cntry=SE")
inspect(head(sort(rules_country, by = "confidence", decreasing = TRUE), 5))

##     lhs                                   rhs           support    confidence
## [1] {cntry=SE, vote=1, uemp12m=6}      => {hincfel=Low} 0.02009662 0.9641577 
## [2] {cntry=SE, uempla=0, uemp12m=6}    => {hincfel=Low} 0.02141648 0.9598214 
## [3] {cntry=SE, uemp12m=6}              => {hincfel=Low} 0.02149118 0.9578246 
## [4] {cntry=SE, vote=1, uempla=0}       => {hincfel=Low} 0.02672079 0.9546263 
## [5] {cntry=SE, vote=1, wrkctra=Stable} => {hincfel=Low} 0.02360793 0.9537223 
##     coverage   lift     count
## [1] 0.02084371 1.194334  807 
## [2] 0.02231298 1.188962  860 
## [3] 0.02243749 1.186489  863 
## [4] 0.02799084 1.182527 1073 
## [5] 0.02475346 1.181407  948

plot for SE

meaningful_rules_SE <- head(sort(rules_country, by = "confidence", decreasing = TRUE), 5)
plot(meaningful_rules_SE  , method="graph")

Overall Interpretation SE:

In Sweden, economic insecurity (feeling a low household income) is significantly associated with prior unemployment (uemp12m=6), stable work (wrkctra=Stable), and voting behavior. These associations suggest that even employed individuals may perceive their household income as insufficient, potentially reflecting broader economic or social concerns.

Poland

# Rules for a specific country ( Poland)
rules_country_pl <- subset(rules, lhs %pin% "cntry=PL")
inspect(head(sort(rules_country_pl, by = "confidence", decreasing = TRUE), 5))

##     lhs                                     rhs           support    confidence
## [1] {cntry=PL, vote=1, wrkctra=Stable}   => {hincfel=Low} 0.02089352 0.8992497 
## [2] {cntry=PL, uemp12m=6}                => {hincfel=Low} 0.02380715 0.8934579 
## [3] {cntry=PL, uempla=0, uemp12m=6}      => {hincfel=Low} 0.02373244 0.8931584 
## [4] {cntry=PL, vote=1, uempla=0}         => {hincfel=Low} 0.02579938 0.8915663 
## [5] {cntry=PL, uempla=0, wrkctra=Stable} => {hincfel=Low} 0.02408108 0.8896044 
##     coverage   lift     count
## [1] 0.02323439 1.113930  839 
## [2] 0.02664608 1.106756  956 
## [3] 0.02657137 1.106385  953 
## [4] 0.02893715 1.104412 1036 
## [5] 0.02706943 1.101982  967

meaningful_rules_PL <- head(sort(rules_country_pl, by = "confidence", decreasing = TRUE), 5)
plot(meaningful_rules_PL  , method="graph")

Interpretation PL:

In Poland, individuals’ perceptions of low household income are closely tied to employment-related factors. Past unemployment experiences (uemp12m=6) are a significant driver, even for those currently employed (uempla=0) or in stable jobs (wrkctra=Stable). This suggests that past financial disruptions have a lasting impact on economic perceptions. Voting behavior (vote=1) also correlates with low-income feelings, indicating a possible connection between political engagement and economic dissatisfaction.

Summary

In this project, I analyzed data from the European Social Survey (ESS) Round 11, focusing on the variable “Feeling about household’s income nowadays (hincfel).” Due to the large number of missing values, I replaced some NAs with the most frequent values and reduced the levels of hincfel into three categories: “Low,” “Medium,” and “High.” Similarly, employment status was categorized into “Stable” and “Unstable.” Association rules were developed with hincfel as the rule consequent, and meaningful rules were extracted and visualized using various methods, such as scatterplots and network graphs. Rules were analyzed by country and key attributes to derive insights.

Key Insights Strong rules indicate that stable employment, high happiness, and certain country-specific factors are associated with hincfel=Low. Visualizations effectively highlight these relationships and their lift/support.