This homework focuses on applying association rule mining to the European Social Survey (ESS) Round 11 dataset. The goal is to uncover meaningful patterns and relationships between variables, particularly focusing on the variable “Feeling about household’s income nowadays” (hincfel) as the rule consequent. By using data from a wide geographical range, the analysis aims to identify key factors influencing household income perceptions, such as economic satisfaction, employment stability, and personal happiness.
## Loading required package: Matrix
##
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
##
## abbreviate, write
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:arules':
##
## intersect, recode, setdiff, setequal, union
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
table(df$cntry)
##
## AT BE CH CY DE ES FI FR GB GR HR HU IE IS IT LT
## 2354 1594 1384 685 2420 1844 1563 1771 1684 2757 1563 2118 2017 842 2865 1365
## NL NO PL PT RS SE SI SK
## 1695 1337 1442 1373 1563 1230 1248 1442
After going through the dataset, I noticed there were a lot of missing values (NA’s). After spending some time inspecting the data, I decided to fill in some of the missing values with the most common value for each variable. This helped make the dataset more complete and usable for analysis.
For the variable hincfel (feeling about household income), I realized it wasn’t practical to create rules with all nine levels because it wasn’t producing strong patterns. So, I simplified it by grouping the levels into three categories: Low, Medium, and High. This made it easier to work with and allowed me to generate more meaningful rules.
I also noticed that employment status had too many categories, which made the patterns less clear. To fix this, I grouped them into just two categories: Stable and Unstable. By doing this, I was able to uncover stronger and clearer relationships between employment status and household income.
These changes really helped make the dataset easier to work with and allowed me to focus on finding better patterns and insights in the data.
df_clean <- df %>%
mutate(across(everything(), ~ replace(., . %in% c(66, 6666, 8888, 9999, 999), NA))) %>%
filter(!is.na(hincfel))
# Reducing the number of unique values (Grouping Levels)
# Grouping age into broader categories
df_clean$agea <- cut(df_clean$agea, breaks = c(15, 30, 45, 60, 75, 100),
labels = c("15-30", "31-45", "46-60", "61-75", "76+"), include.lowest = TRUE)
# Grouping education levels into general categories
df_clean$edulvlb <- cut(df_clean$edulvlb, breaks = c(0, 300, 600, 9000),
labels = c("Low", "Medium", "High"), include.lowest = TRUE)
# Reducing categories for `hincfel` (Feeling about household income)
df_clean$hincfel <- case_when(
df_clean$hincfel %in% c(1, 2) ~ "Low",
df_clean$hincfel %in% c(3, 4) ~ "Medium",
df_clean$hincfel %in% c(7, 8, 9) ~ "High",
TRUE ~ NA_character_
)
# Reducing categories for employment status
df_clean$wrkctra <- ifelse(df_clean$wrkctra %in% c(1, 2, 3), "Stable", "Unstable")
sum(is.na(df_clean$hincfel))
## [1] 0
Since there was too many varibale and my goals was to generate specific rules about the “hincfelm”, I selected just relevant variables. then, prepared it for apriori rules.
# Selecting relevant categorical variables
df_selected <- df_clean %>%
select(hincfel, cntry, gndr, agea, marsts, edulvlb, vote, happy,
stfeco, gincdif, freehms, hmsacld, pdwrk, uempla, uemp12m, wrkctra) %>%
mutate(across(everything(), as.factor)) # Convert to factors
# Convert dataset into transaction format
df_transactions <- as(df_selected, "transactions")
inspect(df_transactions[1:5]) # View first few transactions
## items transactionID
## [1] {hincfel=Low,
## cntry=AT,
## gndr=1,
## agea=61-75,
## edulvlb=Medium,
## vote=1,
## happy=8,
## stfeco=6,
## gincdif=2,
## freehms=2,
## hmsacld=3,
## pdwrk=0,
## uempla=0,
## uemp12m=6,
## wrkctra=Unstable} 1
## [2] {hincfel=Low,
## cntry=AT,
## gndr=2,
## agea=15-30,
## marsts=6,
## edulvlb=Medium,
## vote=1,
## happy=9,
## stfeco=2,
## gincdif=1,
## freehms=1,
## hmsacld=1,
## pdwrk=0,
## uempla=0,
## uemp12m=6,
## wrkctra=Stable} 2
## [3] {hincfel=Low,
## cntry=AT,
## gndr=2,
## agea=46-60,
## edulvlb=High,
## vote=1,
## happy=9,
## stfeco=6,
## gincdif=1,
## freehms=1,
## hmsacld=1,
## pdwrk=1,
## uempla=0,
## uemp12m=6,
## wrkctra=Stable} 3
## [4] {hincfel=Low,
## cntry=AT,
## gndr=2,
## agea=76+,
## marsts=4,
## edulvlb=Medium,
## vote=2,
## happy=7,
## stfeco=4,
## gincdif=1,
## freehms=2,
## hmsacld=3,
## pdwrk=0,
## uempla=0,
## uemp12m=6,
## wrkctra=Stable} 4
## [5] {hincfel=Low,
## cntry=AT,
## gndr=1,
## agea=61-75,
## edulvlb=Medium,
## vote=1,
## happy=9,
## stfeco=6,
## gincdif=2,
## freehms=2,
## hmsacld=2,
## pdwrk=0,
## uempla=0,
## uemp12m=6,
## wrkctra=Stable} 5
rules <- apriori(df_transactions,
parameter = list(supp = 0.02, conf = 0.6, maxlen = 4),
appearance = list(rhs = c("hincfel=Low", "hincfel=Medium", "hincfel=High")),
control = list(verbose = FALSE))
rules_sorted <- sort(rules, by = "confidence", decreasing = TRUE)
inspect(head(rules_sorted, 10))
## lhs rhs support confidence
## [1] {vote=1, happy=9, stfeco=7} => {hincfel=Low} 0.02363283 0.9703476
## [2] {happy=9, stfeco=7, wrkctra=Stable} => {hincfel=Low} 0.02383205 0.9647177
## [3] {happy=9, stfeco=7, uemp12m=6} => {hincfel=Low} 0.02291065 0.9643606
## [4] {cntry=SE, vote=1, uemp12m=6} => {hincfel=Low} 0.02009662 0.9641577
## [5] {stfeco=7, freehms=1, pdwrk=1} => {hincfel=Low} 0.03297141 0.9636099
## [6] {happy=9, stfeco=7, uempla=0} => {hincfel=Low} 0.02858850 0.9630872
## [7] {edulvlb=High, vote=1, stfeco=6} => {hincfel=Low} 0.03673175 0.9627937
## [8] {cntry=DE, gndr=1, vote=1} => {hincfel=Low} 0.02303516 0.9625390
## [9] {edulvlb=High, happy=9, hmsacld=1} => {hincfel=Low} 0.02532623 0.9621570
## [10] {edulvlb=High, stfeco=7, pdwrk=1} => {hincfel=Low} 0.02848889 0.9621531
## coverage lift count
## [1] 0.02435502 1.202001 949
## [2] 0.02470366 1.195027 957
## [3] 0.02375735 1.194585 920
## [4] 0.02084371 1.194334 807
## [5] 0.03421656 1.193655 1324
## [6] 0.02968423 1.193008 1148
## [7] 0.03815121 1.192644 1475
## [8] 0.02393167 1.192329 925
## [9] 0.02632234 1.191855 1017
## [10] 0.02960952 1.191851 1144
set.seed(240)
# Scatterplot
plot(rules_sorted, method = "scatterplot", measure = c("support", "confidence"), shading = "lift")
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
plot(rules_sorted, method = "graph", control = list(type = "items"))
## Available control parameters (with default values):
## layout = stress
## circular = FALSE
## ggraphdots = NULL
## edges = <environment>
## nodes = <environment>
## nodetext = <environment>
## colors = c("#EE0000FF", "#EEEEEEFF")
## engine = ggplot2
## max = 100
## verbose = FALSE
# Grouped matrix
plot(rules_sorted, method = "grouped")
Top Rules with High Confidence: Rules with hincfel=Low as the consequent (right-hand side) indicate a strong association with combinations of attributes like high happiness (happy=9), stable working conditions (wrkctra=Stable), and low economic satisfaction (stfeco=7). These rules have confidence values above 0.96, which suggests a high probability of low income when these conditions are met.
Lift Analysis: The lift values close to 1.2 in these rules indicate a moderate improvement over random association. It suggests the antecedent variables (for example: vote=1, uemp12m=6) are reasonably predictive of low income.
Scatterplot Insights: The scatterplot of support vs. confidence shows that most rules are concentrated with confidence above 0.7, indicating robustness in predictive capability. However, the support values are relatively low, reflecting that the combinations occur in a smaller portion of the dataset.
Grouped Matrix Visualization: The grouped matrix emphasizes clusters of variables contributing to hincfel=Low. Attributes like cntry=DE, happy=9, and wrkctra=Stable frequently combine to predict low income. Larger nodes (higher support) are often associated with hincfel=Low, while shading indicates the strength of the association (lift).
rules_sorted_sup <- sort(rules, by = "support", decreasing = TRUE)
meaningful_rules_sup <- head(rules_sorted_sup, 10)
plot(meaningful_rules_sup, method = "paracoord", control = list(reorder = TRUE))
Parallel Coordinates Plot: This visualization demonstrates how certain variables interact to predict the consequent, hincfel=Low. Rules involving combinations like wrkctra=Stable, uemp12m=6, and stfeco=7 are highly predictive of hincfel=Low.
It highlights that employment-related and financial satisfaction variables significantly influence perceived income sufficiency.
rules_low <- subset(rules, rhs %pin% "hincfel=Low")
inspect(head(sort(rules_low, by = "confidence", decreasing = TRUE), 10))
## lhs rhs support confidence
## [1] {vote=1, happy=9, stfeco=7} => {hincfel=Low} 0.02363283 0.9703476
## [2] {happy=9, stfeco=7, wrkctra=Stable} => {hincfel=Low} 0.02383205 0.9647177
## [3] {happy=9, stfeco=7, uemp12m=6} => {hincfel=Low} 0.02291065 0.9643606
## [4] {cntry=SE, vote=1, uemp12m=6} => {hincfel=Low} 0.02009662 0.9641577
## [5] {stfeco=7, freehms=1, pdwrk=1} => {hincfel=Low} 0.03297141 0.9636099
## [6] {happy=9, stfeco=7, uempla=0} => {hincfel=Low} 0.02858850 0.9630872
## [7] {edulvlb=High, vote=1, stfeco=6} => {hincfel=Low} 0.03673175 0.9627937
## [8] {cntry=DE, gndr=1, vote=1} => {hincfel=Low} 0.02303516 0.9625390
## [9] {edulvlb=High, happy=9, hmsacld=1} => {hincfel=Low} 0.02532623 0.9621570
## [10] {edulvlb=High, stfeco=7, pdwrk=1} => {hincfel=Low} 0.02848889 0.9621531
## coverage lift count
## [1] 0.02435502 1.202001 949
## [2] 0.02470366 1.195027 957
## [3] 0.02375735 1.194585 920
## [4] 0.02084371 1.194334 807
## [5] 0.03421656 1.193655 1324
## [6] 0.02968423 1.193008 1148
## [7] 0.03815121 1.192644 1475
## [8] 0.02393167 1.192329 925
## [9] 0.02632234 1.191855 1017
## [10] 0.02960952 1.191851 1144
meaningful_rules_low <- head(sort(rules_low, by = "confidence", decreasing = TRUE),10)
plot(meaningful_rules_low, method="graph")
Top Rules for hincfel=Low:
The top rules based on confidence suggest that: Individuals who are employed stably (wrkctra=Stable), perceive the economy positively (stfeco=7), and are very happy (happy=9) are more likely to report their household income as Low.
Similarly, demographic and country-specific variables like cntry=SE and cntry=DE interact with employment and satisfaction variables to predict hincfel=Low.
The high confidence (0.96–0.97) across these rules implies strong predictive reliability.
rules_medium <- subset(rules, rhs %pin% "hincfel=Medium")
inspect(head(sort(rules_medium, by = "confidence", decreasing = TRUE), 10))
rules_high <- subset(rules, rhs %pin% "hincfel=High")
inspect(head(sort(rules_high, by = "confidence", decreasing = TRUE), 10))
Insights: Rules specific to hincfel=Medium or hincfel=High were limited, indicating weaker associations with these outcomes compared to hincfel=Low.
The findings suggest a strong connection between economic perceptions, employment stability, and subjective financial sufficiency (hincfel=Low).
# Filter rules where 'stfeco' or 'wrkctra' are in LHS
rules_stfeco <- subset(rules, lhs %pin% "stfeco")
rules_wrkctra <- subset(rules, lhs %pin% "wrkctra")
inspect(head(sort(rules_stfeco, by = "confidence", decreasing = TRUE), 5))
## lhs rhs support confidence
## [1] {vote=1, happy=9, stfeco=7} => {hincfel=Low} 0.02363283 0.9703476
## [2] {happy=9, stfeco=7, wrkctra=Stable} => {hincfel=Low} 0.02383205 0.9647177
## [3] {happy=9, stfeco=7, uemp12m=6} => {hincfel=Low} 0.02291065 0.9643606
## [4] {stfeco=7, freehms=1, pdwrk=1} => {hincfel=Low} 0.03297141 0.9636099
## [5] {happy=9, stfeco=7, uempla=0} => {hincfel=Low} 0.02858850 0.9630872
## coverage lift count
## [1] 0.02435502 1.202001 949
## [2] 0.02470366 1.195027 957
## [3] 0.02375735 1.194585 920
## [4] 0.03421656 1.193655 1324
## [5] 0.02968423 1.193008 1148
inspect(head(sort(rules_wrkctra, by = "confidence", decreasing = TRUE), 5))
## lhs rhs support confidence coverage lift count
## [1] {happy=9,
## stfeco=7,
## wrkctra=Stable} => {hincfel=Low} 0.02383205 0.9647177 0.02470366 1.195027 957
## [2] {edulvlb=High,
## happy=9,
## wrkctra=Stable} => {hincfel=Low} 0.04995517 0.9575179 0.05217153 1.186109 2006
## [3] {edulvlb=High,
## stfeco=6,
## wrkctra=Stable} => {hincfel=Low} 0.03683136 0.9560440 0.03852475 1.184283 1479
## [4] {cntry=SE,
## vote=1,
## wrkctra=Stable} => {hincfel=Low} 0.02360793 0.9537223 0.02475346 1.181407 948
## [5] {cntry=NL,
## pdwrk=1,
## wrkctra=Stable} => {hincfel=Low} 0.02206395 0.9526882 0.02315968 1.180126 886
Primary Consequent (hincfel=Low): The rules indicate strong associations for individuals who feel their household income is low. These associations are supported by high confidence values, which suggest a strong likelihood of the antecedent attributes leading to the consequent. The support values for these rules are moderate, highlighting subsets of individuals in the dataset that share these characteristics.
Key Antecedent Attributes: Variables such as stfeco (satisfaction with the economy) and wrkctra (employment stability) play significant roles: A high level of dissatisfaction with the economy (stfeco=7) is strongly associated with low income perception. Employment stability (wrkctra=Stable) is another prominent antecedent. Other key attributes include vote=1 (participation in voting), happy=9 (high happiness levels), and edulvlb=High (high education level).
Lift Values: Lift values exceeding 1.18 across the rules demonstrate a significant association beyond random chance. The lift values emphasize that the combinations of antecedent attributes are meaningful and relevant predictors for low income perception.
meaningful_rules_fitered1 <- head(sort(rules_stfeco, by = "confidence", decreasing = TRUE), 5)
meaningful_rules_fitered2 <- head(sort(rules_wrkctra, by = "confidence", decreasing = TRUE), 5)
plot(meaningful_rules_fitered1 , method="graph")
plot(meaningful_rules_fitered2 , method="graph")
Parallel Coordinates Plot: This plot illustrates the alignment of rules with specific antecedents (e.g., wrkctra=Stable and stfeco=7) leading to the consequent hincfel=Low. The sharp lines connecting wrkctra=Stable and stfeco=7 to hincfel=Low confirm their centrality in defining low-income perception.
Graph Visualization of Filtered Rules: The filtered graph focuses on significant antecedent combinations: For stfeco=7, nodes such as freehms=1 (freedom to make life choices) and pdwrk=1 (having paid work) highlight distinct contributions to low income perception. For wrkctra=Stable, connections to vote=1 and edulvlb=High point to nuanced patterns where even high education and civic engagement do not mitigate the perception of low income.
Combined Graph View: The combined visualization captures overlapping influences of the key antecedents. Strong associations between wrkctra=Stable, stfeco=7, and hincfel=Low dominate the network, reinforcing the importance of these attributes.
# Rules for a specific country ( Sweden)
rules_country <- subset(rules, lhs %pin% "cntry=SE")
inspect(head(sort(rules_country, by = "confidence", decreasing = TRUE), 5))
## lhs rhs support confidence
## [1] {cntry=SE, vote=1, uemp12m=6} => {hincfel=Low} 0.02009662 0.9641577
## [2] {cntry=SE, uempla=0, uemp12m=6} => {hincfel=Low} 0.02141648 0.9598214
## [3] {cntry=SE, uemp12m=6} => {hincfel=Low} 0.02149118 0.9578246
## [4] {cntry=SE, vote=1, uempla=0} => {hincfel=Low} 0.02672079 0.9546263
## [5] {cntry=SE, vote=1, wrkctra=Stable} => {hincfel=Low} 0.02360793 0.9537223
## coverage lift count
## [1] 0.02084371 1.194334 807
## [2] 0.02231298 1.188962 860
## [3] 0.02243749 1.186489 863
## [4] 0.02799084 1.182527 1073
## [5] 0.02475346 1.181407 948
meaningful_rules_SE <- head(sort(rules_country, by = "confidence", decreasing = TRUE), 5)
plot(meaningful_rules_SE , method="graph")
In Sweden, economic insecurity (feeling a low household income) is significantly associated with prior unemployment (uemp12m=6), stable work (wrkctra=Stable), and voting behavior. These associations suggest that even employed individuals may perceive their household income as insufficient, potentially reflecting broader economic or social concerns.
# Rules for a specific country ( Poland)
rules_country_pl <- subset(rules, lhs %pin% "cntry=PL")
inspect(head(sort(rules_country_pl, by = "confidence", decreasing = TRUE), 5))
## lhs rhs support confidence
## [1] {cntry=PL, vote=1, wrkctra=Stable} => {hincfel=Low} 0.02089352 0.8992497
## [2] {cntry=PL, uemp12m=6} => {hincfel=Low} 0.02380715 0.8934579
## [3] {cntry=PL, uempla=0, uemp12m=6} => {hincfel=Low} 0.02373244 0.8931584
## [4] {cntry=PL, vote=1, uempla=0} => {hincfel=Low} 0.02579938 0.8915663
## [5] {cntry=PL, uempla=0, wrkctra=Stable} => {hincfel=Low} 0.02408108 0.8896044
## coverage lift count
## [1] 0.02323439 1.113930 839
## [2] 0.02664608 1.106756 956
## [3] 0.02657137 1.106385 953
## [4] 0.02893715 1.104412 1036
## [5] 0.02706943 1.101982 967
meaningful_rules_PL <- head(sort(rules_country_pl, by = "confidence", decreasing = TRUE), 5)
plot(meaningful_rules_PL , method="graph")
Interpretation PL:
In Poland, individuals’ perceptions of low household income are closely tied to employment-related factors. Past unemployment experiences (uemp12m=6) are a significant driver, even for those currently employed (uempla=0) or in stable jobs (wrkctra=Stable). This suggests that past financial disruptions have a lasting impact on economic perceptions. Voting behavior (vote=1) also correlates with low-income feelings, indicating a possible connection between political engagement and economic dissatisfaction.
In this project, I analyzed data from the European Social Survey (ESS) Round 11, focusing on the variable “Feeling about household’s income nowadays (hincfel).” Due to the large number of missing values, I replaced some NAs with the most frequent values and reduced the levels of hincfel into three categories: “Low,” “Medium,” and “High.” Similarly, employment status was categorized into “Stable” and “Unstable.” Association rules were developed with hincfel as the rule consequent, and meaningful rules were extracted and visualized using various methods, such as scatterplots and network graphs. Rules were analyzed by country and key attributes to derive insights.
Key Insights Strong rules indicate that stable employment, high happiness, and certain country-specific factors are associated with hincfel=Low. Visualizations effectively highlight these relationships and their lift/support.