This project is focused on using association rules to analyze voting patterns of U.S. House of Representatives members from 1984, classified as Republican or Democrat. The dataset contains voting records for 435 Congressmen on 16 key votes. Each vote can be “yes” (y), “no” (n), or “unknown” (?).
The goal of this project is to find relationships and patterns in the data using the Apriori algorithm, create association rules, and understand the structure of voting patterns.
Hypothesis: Party affiliation predicts voting behavior in the 1984 House of Representatives.
Method: Apriori algorithm
Dataset: Congressional Voting Records (1984)
STEP 1: Load libraries
We Load all necessary R libraries required for the association rules analysis.
library(arules)
## Warning: package 'arules' was built under R version 4.5.2
## Loading required package: Matrix
##
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
##
## abbreviate, write
library(arulesViz)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.5.2
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:arules':
##
## intersect, recode, setdiff, setequal, union
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
STEP 2: Process the data
We prepare the data for association rules by transforming it into a transactional format compatible with the Apriori algorithm. But before that, we convert the data into binary values, where 1 means present and 0 means absent. Then the data is converted to a transactions object, and each row becomes a transaction, and each binary feature becomes an item, getting a total of 435 transactions and 34 items.
# Read CSV
votes <- read.csv("house-votes-84.csv", sep = ",", header = FALSE)
# Add column names
colnames(votes) <- c('party', paste0('V', 1:16))
# Convert all columns to factors to ensure proper one-hot encoding
votes[] <- lapply(votes, as.factor)
# Convert to binary format
votes_binary <- model.matrix(~ . - 1, data = votes)
votes_binary <- as.data.frame(votes_binary) * 1
# Convert to transactions format
# Convert binary data to list format first
votes_list <- apply(votes_binary == 1, 1, function(x) names(votes_binary)[x])
votes_transactions <- as(votes_list, "transactions")
# Check results
print(paste("Shape (dim equivalent):", paste(dim(votes_binary), collapse = " ")))
## [1] "Shape (dim equivalent): 435 34"
print(paste("Number of transactions:", length(votes_transactions)))
## [1] "Number of transactions: 435"
print(paste("Number of items:", dim(votes_transactions)[2]))
## [1] "Number of items: 34"
STEP 3: Item frequency
We calculate the item frequency (support). Support is the proportion of transactions in a dataset in which a particular itemset appears, showing how often that itemset occurs in the data. Itemsets with very low support usually do not lead to useful association rules. We observe that V6y, V16y, and partydemocrat are the first three most important items, appearing in most transactions and key to finding patterns in the data. To make this more visual, we compute a bar plot showing the 10 most important items.
# Item frequency (support)
item_freq <- itemFrequency(votes_transactions, type = "relative")
print("\nTop 10 most frequent items:")
## [1] "\nTop 10 most frequent items:"
print(sort(item_freq, decreasing = TRUE)[1:10])
## V6y V16y partydemocrat V11n V3y
## 0.6252874 0.6183908 0.6137931 0.6068966 0.5816092
## V14y V4n V8y V7y V1n
## 0.5701149 0.5678161 0.5563218 0.5494253 0.5425287
# Item Frequency Plot (Top 10 items)
num_items <- dim(votes_transactions)[2]
top_n_items <- min(10, num_items)
item_freq_sorted <- sort(item_freq, decreasing = TRUE)[1:top_n_items]
item_freq_df <- data.frame(
Item = names(item_freq_sorted),
Support = as.numeric(item_freq_sorted)
)
ggplot(item_freq_df, aes(x = reorder(Item, -Support), y = Support)) +
geom_bar(stat = "identity", fill = "blue") +
labs(x = "Items", y = "Support (Frequency)",
title = paste("Item Frequency Plot (Top", top_n_items, "Items)")) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(hjust = 0.5, face = "bold"))
STEP 3.1: Sparse matrix visualization
We create a scatter plot showing the first 10 transactions and 10 items, showing where items appear across transactions.
# Visualize sparse matrix for first 10 transactions and first 10 items
num_transactions <- length(votes_transactions)
n_trans <- min(10, num_transactions)
n_items <- min(10, num_items)
# Extract sparse matrix data
sparse_data <- as(votes_transactions[1:n_trans, 1:n_items], "matrix")
# Find positions
which_items <- which(sparse_data == 1, arr.ind = TRUE)
# Plot points
plot(which_items[, 2], which_items[, 1],
pch = 16, col = "blue",
xlab = "Items", ylab = "Transactions",
main = paste("Sparse Matrix Visualization (First", n_trans, "Transactions × First", n_items, "Items)"),
xlim = c(1, n_items), ylim = c(1, n_trans),
bg = "white")
STEP 4: Run Apriori algorithm
No we use the Apriori algorithm to find all frequent itemsets, which we need to create association rules. We found 481,443 itemsets. Both single items and combinations appear, which means there are voting patterns that can be analyzed.
For example:
{V4n} => {partydemocrat} = 0.5632184 (support), 0.9919028 (confidence), 0.5678161 (coverage), 1.616021 (lift), 245 (count)
If someone votes “no” on issue V4n, they are 99.2% likely to be a Democrat, and this pattern appears in 56.3% of all voting records (245 out of 435 transactions), with a lift of 1.62, indicating a positive association.
# Find frequent itemsets with min_support = 0.1
frequent_itemsets <- apriori(votes_transactions,
parameter = list(support = 0.1,
minlen = 1,
maxlen = 10))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.1 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 43
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[34 item(s), 435 transaction(s)] done [0.00s].
## sorting and recoding items ... [34 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10
## Warning in apriori(votes_transactions, parameter = list(support = 0.1, minlen =
## 1, : Mining stopped (maxlen reached). Only patterns up to a length of 10
## returned!
## done [0.03s].
## writing ... [481443 rule(s)] done [0.02s].
## creating S4 object ... done [0.13s].
print(paste("Number of frequent itemsets:", length(frequent_itemsets)))
## [1] "Number of frequent itemsets: 481443"
print("\nTop 10 frequent itemsets:")
## [1] "\nTop 10 frequent itemsets:"
inspect(sort(frequent_itemsets, by = "support", decreasing = TRUE)[1:10])
## lhs rhs support confidence coverage
## [1] {V4n} => {partydemocrat} 0.5632184 0.9919028 0.5678161
## [2] {partydemocrat} => {V4n} 0.5632184 0.9176030 0.6137931
## [3] {V3y} => {partydemocrat} 0.5310345 0.9130435 0.5816092
## [4] {partydemocrat} => {V3y} 0.5310345 0.8651685 0.6137931
## [5] {V4n} => {V3y} 0.5034483 0.8866397 0.5678161
## [6] {V3y} => {V4n} 0.5034483 0.8656126 0.5816092
## [7] {V3y, V4n} => {partydemocrat} 0.5034483 1.0000000 0.5034483
## [8] {partydemocrat, V4n} => {V3y} 0.5034483 0.8938776 0.5632184
## [9] {partydemocrat, V3y} => {V4n} 0.5034483 0.9480519 0.5310345
## [10] {V8y} => {partydemocrat} 0.5011494 0.9008264 0.5563218
## lift count
## [1] 1.616021 245
## [2] 1.616021 245
## [3] 1.487543 231
## [4] 1.487543 231
## [5] 1.524460 219
## [6] 1.524460 219
## [7] 1.629213 219
## [8] 1.536904 219
## [9] 1.669646 219
## [10] 1.467639 218
STEP 5: Generate association rules
Now we use Apriori algorithm to generate association rules from the frequent itemsets. We generate 576,122 association rules. in total.
For example:
{V16n} => {partyrepublican} = 0.1149425 (support), 0.8064516 (confidence), 0.1425287 (coverage), 2.088134 (lift), 50 (count)
If someone votes “no” on issue V16, they are 81% likely to be a Republican, and this pattern appears in 11.5% of all voting records (50 out of 435 transactions), with a lift of 2.09, indicating a positive association.
# Generate rules with confidence = 0.5
rules <- apriori(votes_transactions,
parameter = list(support = 0.1,
confidence = 0.5,
minlen = 2))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.1 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 43
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[34 item(s), 435 transaction(s)] done [0.00s].
## sorting and recoding items ... [34 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10
## Warning in apriori(votes_transactions, parameter = list(support = 0.1,
## confidence = 0.5, : Mining stopped (maxlen reached). Only patterns up to a
## length of 10 returned!
## done [0.03s].
## writing ... [576122 rule(s)] done [0.02s].
## creating S4 object ... done [0.12s].
print(paste("Number of rules:", length(rules)))
## [1] "Number of rules: 576122"
print("\nRules summary:")
## [1] "\nRules summary:"
inspect(rules[1:10])
## lhs rhs support confidence coverage lift count
## [1] {V16n} => {partyrepublican} 0.1149425 0.8064516 0.1425287 2.088134 50
## [2] {V16n} => {V3n} 0.1310345 0.9193548 0.1425287 2.338710 57
## [3] {V16n} => {V12y} 0.1126437 0.7903226 0.1425287 2.010470 49
## [4] {V16n} => {V4y} 0.1264368 0.8870968 0.1425287 2.180153 55
## [5] {V16n} => {V8n} 0.1356322 0.9516129 0.1425287 2.325571 59
## [6] {V16n} => {V7n} 0.1356322 0.9516129 0.1425287 2.274459 59
## [7] {V16n} => {V9n} 0.1333333 0.9354839 0.1425287 1.975415 58
## [8] {V16n} => {V13y} 0.1287356 0.9032258 0.1425287 1.879920 56
## [9] {V16n} => {V5y} 0.1356322 0.9516129 0.1425287 1.952602 59
## [10] {V16n} => {V15n} 0.1264368 0.8870968 0.1425287 1.656168 55
STEP 6: Sort and inspect rules
We sort rules by three metrics to find the most important patterns:
Lift: Finds rules with the strongest associations. Lift > 1.0 means positive association, lift < 1.0 means weak or negative association.
Confidence: Finds the most reliable rules.
Support: Finds the most frequent rules that appear in many transactions.
# Define num_rules
num_rules <- length(rules)
# Sort by lift
rules_by_lift <- sort(rules, by = "lift", decreasing = TRUE)
print("Top 5 rules by LIFT:")
## [1] "Top 5 rules by LIFT:"
inspect(rules_by_lift[1:5])
## lhs rhs support confidence coverage lift count
## [1] {V10n,
## V14n,
## V1y,
## V4n,
## V5n,
## V7y,
## V9y} => {V6n} 0.1195402 0.962963 0.1241379 2.755848 52
## [2] {V10n,
## V14n,
## V1y,
## V4n,
## V5n,
## V7y,
## V8y,
## V9y} => {V6n} 0.1195402 0.962963 0.1241379 2.755848 52
## [3] {V10n,
## V14n,
## V1y,
## V3y,
## V4n,
## V5n,
## V7y,
## V9y} => {V6n} 0.1195402 0.962963 0.1241379 2.755848 52
## [4] {partydemocrat,
## V10n,
## V14n,
## V1y,
## V4n,
## V5n,
## V7y,
## V9y} => {V6n} 0.1195402 0.962963 0.1241379 2.755848 52
## [5] {V10n,
## V14n,
## V1y,
## V3y,
## V4n,
## V5n,
## V7y,
## V8y,
## V9y} => {V6n} 0.1195402 0.962963 0.1241379 2.755848 52
# Sort by confidence
rules_by_conf <- sort(rules, by = "confidence", decreasing = TRUE)
print("\nTop 5 rules by CONFIDENCE:")
## [1] "\nTop 5 rules by CONFIDENCE:"
inspect(rules_by_conf[1:5])
## lhs rhs support confidence coverage lift
## [1] {partyrepublican, V16n} => {V3n} 0.1149425 1 0.1149425 2.543860
## [2] {partyrepublican, V16n} => {V4y} 0.1149425 1 0.1149425 2.457627
## [3] {partyrepublican, V16n} => {V9n} 0.1149425 1 0.1149425 2.111650
## [4] {V12y, V16n} => {V14y} 0.1126437 1 0.1126437 1.754032
## [5] {V16n, V1n} => {V14y} 0.1149425 1 0.1149425 1.754032
## count
## [1] 50
## [2] 50
## [3] 50
## [4] 49
## [5] 50
# Sort by support
rules_by_supp <- sort(rules, by = "support", decreasing = TRUE)
print("\nTop 5 rules by SUPPORT:")
## [1] "\nTop 5 rules by SUPPORT:"
inspect(rules_by_supp[1:5])
## lhs rhs support confidence coverage lift
## [1] {V4n} => {partydemocrat} 0.5632184 0.9919028 0.5678161 1.616021
## [2] {partydemocrat} => {V4n} 0.5632184 0.9176030 0.6137931 1.616021
## [3] {V3y} => {partydemocrat} 0.5310345 0.9130435 0.5816092 1.487543
## [4] {partydemocrat} => {V3y} 0.5310345 0.8651685 0.6137931 1.487543
## [5] {V4n} => {V3y} 0.5034483 0.8866397 0.5678161 1.524460
## count
## [1] 245
## [2] 245
## [3] 231
## [4] 231
## [5] 219
The histograms allow us to see how support, confidence, and lift are distributed across all rules. We observe that most rules have low support, with fewer having high support. A similar case occurs for confidence. In the case of lift, the distributions show how strong the rules are overall.
# Distribution of Rule Metrics
all_rules_df <- as(rules, "data.frame")
# Distribution plots
par(mfrow = c(1, 3))
hist(all_rules_df$support, breaks = 30, main = "Distribution of Support",
xlab = "Support", col = "blue", border = "black")
hist(all_rules_df$confidence, breaks = 30, main = "Distribution of Confidence",
xlab = "Confidence", col = "orange", border = "black")
hist(all_rules_df$lift, breaks = 30, main = "Distribution of Lift",
xlab = "Lift", col = "green", border = "black")
par(mfrow = c(1, 1))
STEP 6.1: Rule Visualizations
We visualise all the rules in a scatter plot, where the points located in the upper-right, having both high support and high confidence, are the best rules. The same interpretation applies to the second scatter plot, but with 10 rules.
# Filter rules for better visualization
top_n_rules <- min(50, num_rules)
top_rules <- sort(rules, by = "lift", decreasing = TRUE)[1:top_n_rules]
# Convert rules to data frame for plotting
rules_df <- as(top_rules, "data.frame")
# Scatter Plot 1: Support vs Confidence
ggplot(all_rules_df, aes(x = support, y = confidence, color = lift)) +
geom_point(size = 1, alpha = 0.5) +
scale_color_gradient(low = "lightcoral", high = "darkred", name = "Lift") +
labs(x = "Support", y = "Confidence",
title = paste("Scatter plot for", nrow(all_rules_df), "rules")) +
theme_minimal() +
theme(legend.position = "right",
panel.background = element_rect(fill = "white", color = NA),
plot.background = element_rect(fill = "white", color = NA),
plot.title = element_text(hjust = 0.5, face = "bold"))
# Scatter Plot 2: Support vs Confidence (using top 10 rules)
top_10_supp <- sort(rules, by = c("support", "lift"), decreasing = TRUE)[1:min(10, num_rules)]
top_10_df <- as(top_10_supp, "data.frame")
ggplot(top_10_df, aes(x = support, y = confidence, color = lift)) +
geom_point(size = 3, alpha = 0.7) +
scale_color_gradient(low = "lightcoral", high = "darkred", name = "Lift") +
labs(x = "Support", y = "Confidence",
title = paste("Scatter plot for", nrow(top_10_df), "rules")) +
theme_minimal() +
theme(legend.position = "right",
panel.background = element_rect(fill = "white", color = NA),
plot.background = element_rect(fill = "white", color = NA),
plot.title = element_text(hjust = 0.5, face = "bold"))
STEP 6.2: Additional rule Visualizations
The plot shows that partydemocrat and V4n form the strongest associations, indicating that being a Democrat strongly predicts voting “no” on issue V4. This visualization helps to identify the most important relationships in the voting data.
# Prepare top 10 rules for visualization
top_10_rules <- sort(rules, by = c("support", "lift"), decreasing = TRUE)[1:min(10, num_rules)]
# Grouped plot (top 10 rules)
plot(top_10_rules, method = "grouped")
This plot shows similar results to the previous one, by establishing relations with bidirectional arrows between partydemocrat and V4n, which indicates predictability between those two features, being the most important pattern in the data.
# Graph plot (top 10 rules)
plot(top_10_rules, method = "graph")
STEP 7: Filter rules (rules containing “partydemocrat”)
Having a total of 136,091 rules involving partydemocrat shows strong party-based voting in 1984, with many predictable relationships between party membership and how people vote. This means that party affiliation is both a strong predictor of voting behavior and can be predicted by voting patterns.
all_items <- itemLabels(votes_transactions)
party_items <- all_items[grep("party", all_items, ignore.case = TRUE)]
print("Party-related items found:")
## [1] "Party-related items found:"
print(party_items)
## [1] "partydemocrat" "partyrepublican"
# Filter rules involving 'party_democrat'
democrat_rules <- subset(rules, items %in% party_items[grep("democrat", party_items, ignore.case = TRUE)])
print(paste("Rules involving 'party_democrat':", length(democrat_rules)))
## [1] "Rules involving 'party_democrat': 136091"
inspect(democrat_rules[1:10])
## lhs rhs support confidence coverage lift
## [1] {V11y} => {partydemocrat} 0.2965517 0.8600000 0.3448276 1.401124
## [2] {V6n} => {partydemocrat} 0.3103448 0.8881579 0.3494253 1.446999
## [3] {partydemocrat} => {V6n} 0.3103448 0.5056180 0.6137931 1.446999
## [4] {V14n} => {partydemocrat} 0.3839080 0.9823529 0.3908046 1.600463
## [5] {partydemocrat} => {V14n} 0.3839080 0.6254682 0.6137931 1.600463
## [6] {V15y} => {partydemocrat} 0.3678161 0.9195402 0.4000000 1.498127
## [7] {partydemocrat} => {V15y} 0.3678161 0.5992509 0.6137931 1.498127
## [8] {V1y} => {partydemocrat} 0.3586207 0.8342246 0.4298851 1.359130
## [9] {partydemocrat} => {V1y} 0.3586207 0.5842697 0.6137931 1.359130
## [10] {V2y} => {partydemocrat} 0.2758621 0.6153846 0.4482759 1.002593
## count
## [1] 129
## [2] 135
## [3] 135
## [4] 167
## [5] 167
## [6] 160
## [7] 160
## [8] 156
## [9] 156
## [10] 120
We found patterns in how Congressmen voted in 1984. Using the Apriori algorithm, we discovered relationships between votes and how party affiliation affects voting. We generated 576,122 association rules, with 136,091 rules involving the Democrat party, showing strong party-based voting patterns. By turning raw votes into rules,we revealed both expected patterns and unexpected relationships between votes.