# install.packages("arules")
# install.packages("arulesViz")
library(arules)
library(arulesViz)
This project applies association rule mining to road traffic accident data from Addis Ababa, Ethiopia (Dataset) collected from sub city police departments. Unlike typical grocery basket examples, this dataset offers an opportunity to uncover actionable safety patterns with real-world policy implications. The features span driver demographics (age, sex, education, experience), environmental conditions (weather, lighting, road surface), infrastructure (junction type, lanes), and outcomes (severity, collision type). Each accident can be viewed as a “basket” of co-occurring conditions, making the rich categorical structure ideal for association analysis.
# Read the cleaned traffic accident dataset
acc <- read.csv("C:/Users/David Abraham/Downloads/cleaned.csv/cleaned.csv", stringsAsFactors = TRUE)
# Basic inspection
head(acc)
str(acc)
## 'data.frame': 12316 obs. of 15 variables:
## $ Age_band_of_driver : Factor w/ 5 levels "18-30","31-50",..: 1 2 1 1 1 2 1 1 1 1 ...
## $ Sex_of_driver : Factor w/ 3 levels "Female","Male",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Educational_level : Factor w/ 7 levels "Above high school",..: 1 5 5 5 5 6 5 5 5 5 ...
## $ Vehicle_driver_relation: Factor w/ 4 levels "Employee","Other",..: 1 1 1 1 1 4 1 1 1 1 ...
## $ Driving_experience : Factor w/ 8 levels "1-2yr","2-5yr",..: 1 4 1 3 2 8 2 2 4 1 ...
## $ Lanes_or_Medians : Factor w/ 7 levels "Double carriageway (median)",..: 7 6 3 3 3 7 6 3 3 6 ...
## $ Types_of_Junction : Factor w/ 8 levels "Crossing","No junction",..: 2 2 2 8 8 8 1 8 8 8 ...
## $ Road_surface_type : Factor w/ 6 levels "Asphalt roads",..: 1 1 1 3 1 6 6 1 3 1 ...
## $ Light_conditions : Factor w/ 4 levels "Darkness - lights lit",..: 4 4 4 1 1 4 4 4 4 4 ...
## $ Weather_conditions : Factor w/ 9 levels "Cloudy","Fog or mist",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ Type_of_collision : Factor w/ 10 levels "Collision with animals",..: 3 9 4 9 9 9 9 9 3 3 ...
## $ Vehicle_movement : Factor w/ 13 levels "Entering a junction",..: 3 3 3 3 3 11 4 11 3 11 ...
## $ Pedestrian_movement : Factor w/ 9 levels "Crossing from driver's nearside",..: 6 6 6 6 6 6 6 6 1 6 ...
## $ Cause_of_accident : Factor w/ 20 levels "Changing lane to the left",..: 10 17 1 2 17 15 14 13 2 10 ...
## $ Accident_severity : int 2 2 1 2 2 2 2 2 2 1 ...
summary(acc)
## Age_band_of_driver Sex_of_driver Educational_level
## 18-30 :4271 Female : 701 Above high school : 362
## 31-50 :4087 Male :11437 Elementary school :2163
## Over 51 :1585 Unknown: 178 High school :1110
## Under 18: 825 Illiterate : 45
## Unknown :1548 Junior high school:7619
## Unknown : 841
## Writing & reading : 176
## Vehicle_driver_relation Driving_experience
## Employee:9627 5-10yr :3363
## Other : 123 2-5yr :2613
## Owner :1973 Above 10yr:2262
## Unknown : 593 1-2yr :1756
## Below 1yr :1342
## Unknown : 829
## (Other) : 151
## Lanes_or_Medians Types_of_Junction
## Double carriageway (median) :1020 Y Shape :4543
## One way : 845 No junction:3837
## other :1660 Crossing :2177
## Two-way (divided with broken lines road marking):4411 Unknown :1078
## Two-way (divided with solid lines road marking) : 142 Other : 445
## Undivided Two way :3796 O Shape : 164
## Unknown : 442 (Other) : 72
## Road_surface_type Light_conditions
## Asphalt roads :11296 Darkness - lights lit :3286
## Asphalt roads with some distress: 81 Darkness - lights unlit: 40
## Earth roads : 358 Darkness - no lighting : 192
## Gravel roads : 242 Daylight :8798
## Other : 167
## Unknown : 172
##
## Weather_conditions Type_of_collision
## Normal :10063 Vehicle with vehicle collision :8774
## Raining: 1331 Collision with roadside objects:1786
## Other : 296 Collision with pedestrians : 896
## Unknown: 292 Rollover : 397
## Cloudy : 125 Collision with animals : 171
## Windy : 98 Unknown : 169
## (Other): 111 (Other) : 123
## Vehicle_movement
## Going straight :8158
## Moving Backward: 985
## Other : 937
## Reversing : 563
## Turnover : 489
## Unknown : 396
## (Other) : 788
## Pedestrian_movement
## Not a Pedestrian :11390
## Crossing from nearside - masked by parked or statioNot a Pedestrianry vehicle : 337
## Unknown or other : 293
## Crossing from driver's nearside : 140
## Crossing from offside - masked by parked or statioNot a Pedestrianry vehicle : 72
## In carriageway, statioNot a Pedestrianry - not crossing (standing or playing): 46
## (Other) : 38
## Cause_of_accident Accident_severity
## No distancing :2263 Min. :0.000
## Changing lane to the right:1808 1st Qu.:2.000
## Changing lane to the left :1473 Median :2.000
## Driving carelessly :1402 Mean :1.833
## No priority to vehicle :1207 3rd Qu.:2.000
## Moving Backward :1137 Max. :2.000
## (Other) :3026
The dataset contains dimensions whose values are already categorical by nature making any transformations from continuous to discrete data no longer needed.
# Convert Accident_severity to factor
acc$Accident_severity <- as.factor(acc$Accident_severity)
# Check structure again
str(acc)
## 'data.frame': 12316 obs. of 15 variables:
## $ Age_band_of_driver : Factor w/ 5 levels "18-30","31-50",..: 1 2 1 1 1 2 1 1 1 1 ...
## $ Sex_of_driver : Factor w/ 3 levels "Female","Male",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Educational_level : Factor w/ 7 levels "Above high school",..: 1 5 5 5 5 6 5 5 5 5 ...
## $ Vehicle_driver_relation: Factor w/ 4 levels "Employee","Other",..: 1 1 1 1 1 4 1 1 1 1 ...
## $ Driving_experience : Factor w/ 8 levels "1-2yr","2-5yr",..: 1 4 1 3 2 8 2 2 4 1 ...
## $ Lanes_or_Medians : Factor w/ 7 levels "Double carriageway (median)",..: 7 6 3 3 3 7 6 3 3 6 ...
## $ Types_of_Junction : Factor w/ 8 levels "Crossing","No junction",..: 2 2 2 8 8 8 1 8 8 8 ...
## $ Road_surface_type : Factor w/ 6 levels "Asphalt roads",..: 1 1 1 3 1 6 6 1 3 1 ...
## $ Light_conditions : Factor w/ 4 levels "Darkness - lights lit",..: 4 4 4 1 1 4 4 4 4 4 ...
## $ Weather_conditions : Factor w/ 9 levels "Cloudy","Fog or mist",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ Type_of_collision : Factor w/ 10 levels "Collision with animals",..: 3 9 4 9 9 9 9 9 3 3 ...
## $ Vehicle_movement : Factor w/ 13 levels "Entering a junction",..: 3 3 3 3 3 11 4 11 3 11 ...
## $ Pedestrian_movement : Factor w/ 9 levels "Crossing from driver's nearside",..: 6 6 6 6 6 6 6 6 1 6 ...
## $ Cause_of_accident : Factor w/ 20 levels "Changing lane to the left",..: 10 17 1 2 17 15 14 13 2 10 ...
## $ Accident_severity : Factor w/ 3 levels "0","1","2": 3 3 2 3 3 3 3 3 3 2 ...
# Convert data frame to "transactions"
trans <- as(acc, "transactions")
# Inspect transactions
summary(trans)
## transactions as itemMatrix in sparse format with
## 12316 rows (elements/itemsets/transactions) and
## 116 columns (items) and a density of 0.1293103
##
## most frequent items:
## Sex_of_driver=Male Pedestrian_movement=Not a Pedestrian
## 11437 11390
## Road_surface_type=Asphalt roads Accident_severity=2
## 11296 10415
## Weather_conditions=Normal (Other)
## 10063 130139
##
## element (itemset/transaction) length distribution:
## sizes
## 15
## 12316
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 15 15 15 15 15 15
##
## includes extended item information - examples:
## labels variables levels
## 1 Age_band_of_driver=18-30 Age_band_of_driver 18-30
## 2 Age_band_of_driver=31-50 Age_band_of_driver 31-50
## 3 Age_band_of_driver=Over 51 Age_band_of_driver Over 51
##
## includes extended transaction information - examples:
## transactionID
## 1 1
## 2 2
## 3 3
inspect(trans[1:5])
## items transactionID
## [1] {Age_band_of_driver=18-30,
## Sex_of_driver=Male,
## Educational_level=Above high school,
## Vehicle_driver_relation=Employee,
## Driving_experience=1-2yr,
## Lanes_or_Medians=Unknown,
## Types_of_Junction=No junction,
## Road_surface_type=Asphalt roads,
## Light_conditions=Daylight,
## Weather_conditions=Normal,
## Type_of_collision=Collision with roadside-parked vehicles,
## Vehicle_movement=Going straight,
## Pedestrian_movement=Not a Pedestrian,
## Cause_of_accident=Moving Backward,
## Accident_severity=2} 1
## [2] {Age_band_of_driver=31-50,
## Sex_of_driver=Male,
## Educational_level=Junior high school,
## Vehicle_driver_relation=Employee,
## Driving_experience=Above 10yr,
## Lanes_or_Medians=Undivided Two way,
## Types_of_Junction=No junction,
## Road_surface_type=Asphalt roads,
## Light_conditions=Daylight,
## Weather_conditions=Normal,
## Type_of_collision=Vehicle with vehicle collision,
## Vehicle_movement=Going straight,
## Pedestrian_movement=Not a Pedestrian,
## Cause_of_accident=Overtaking,
## Accident_severity=2} 2
## [3] {Age_band_of_driver=18-30,
## Sex_of_driver=Male,
## Educational_level=Junior high school,
## Vehicle_driver_relation=Employee,
## Driving_experience=1-2yr,
## Lanes_or_Medians=other,
## Types_of_Junction=No junction,
## Road_surface_type=Asphalt roads,
## Light_conditions=Daylight,
## Weather_conditions=Normal,
## Type_of_collision=Collision with roadside objects,
## Vehicle_movement=Going straight,
## Pedestrian_movement=Not a Pedestrian,
## Cause_of_accident=Changing lane to the left,
## Accident_severity=1} 3
## [4] {Age_band_of_driver=18-30,
## Sex_of_driver=Male,
## Educational_level=Junior high school,
## Vehicle_driver_relation=Employee,
## Driving_experience=5-10yr,
## Lanes_or_Medians=other,
## Types_of_Junction=Y Shape,
## Road_surface_type=Earth roads,
## Light_conditions=Darkness - lights lit,
## Weather_conditions=Normal,
## Type_of_collision=Vehicle with vehicle collision,
## Vehicle_movement=Going straight,
## Pedestrian_movement=Not a Pedestrian,
## Cause_of_accident=Changing lane to the right,
## Accident_severity=2} 4
## [5] {Age_band_of_driver=18-30,
## Sex_of_driver=Male,
## Educational_level=Junior high school,
## Vehicle_driver_relation=Employee,
## Driving_experience=2-5yr,
## Lanes_or_Medians=other,
## Types_of_Junction=Y Shape,
## Road_surface_type=Asphalt roads,
## Light_conditions=Darkness - lights lit,
## Weather_conditions=Normal,
## Type_of_collision=Vehicle with vehicle collision,
## Vehicle_movement=Going straight,
## Pedestrian_movement=Not a Pedestrian,
## Cause_of_accident=Overtaking,
## Accident_severity=2} 5
The above logs verify that one transaction contains all the information of one accident from the dataset. Keeping the concept of encapsulating each accident as combination of different “items” made tackling this analysis a little easier.
Since most of the columns in this dataset are long , shortening them will help in future visualizations.
# Relative item frequencies (top 20 for readability)
head(sort(itemFrequency(trans, type = "relative"), decreasing = TRUE), 20)
## Sex_of_driver=Male
## 0.9286294
## Pedestrian_movement=Not a Pedestrian
## 0.9248133
## Road_surface_type=Asphalt roads
## 0.9171809
## Accident_severity=2
## 0.8456479
## Weather_conditions=Normal
## 0.8170672
## Vehicle_driver_relation=Employee
## 0.7816661
## Light_conditions=Daylight
## 0.7143553
## Type_of_collision=Vehicle with vehicle collision
## 0.7124066
## Vehicle_movement=Going straight
## 0.6623904
## Educational_level=Junior high school
## 0.6186262
## Types_of_Junction=Y Shape
## 0.3688698
## Lanes_or_Medians=Two-way (divided with broken lines road marking)
## 0.3581520
## Age_band_of_driver=18-30
## 0.3467847
## Age_band_of_driver=31-50
## 0.3318448
## Types_of_Junction=No junction
## 0.3115460
## Lanes_or_Medians=Undivided Two way
## 0.3082170
## Driving_experience=5-10yr
## 0.2730594
## Light_conditions=Darkness - lights lit
## 0.2668074
## Driving_experience=2-5yr
## 0.2121630
## Cause_of_accident=No distancing
## 0.1837447
# Absolute frequencies
head(sort(itemFrequency(trans, type = "absolute"), decreasing = TRUE), 20)
## Sex_of_driver=Male
## 11437
## Pedestrian_movement=Not a Pedestrian
## 11390
## Road_surface_type=Asphalt roads
## 11296
## Accident_severity=2
## 10415
## Weather_conditions=Normal
## 10063
## Vehicle_driver_relation=Employee
## 9627
## Light_conditions=Daylight
## 8798
## Type_of_collision=Vehicle with vehicle collision
## 8774
## Vehicle_movement=Going straight
## 8158
## Educational_level=Junior high school
## 7619
## Types_of_Junction=Y Shape
## 4543
## Lanes_or_Medians=Two-way (divided with broken lines road marking)
## 4411
## Age_band_of_driver=18-30
## 4271
## Age_band_of_driver=31-50
## 4087
## Types_of_Junction=No junction
## 3837
## Lanes_or_Medians=Undivided Two way
## 3796
## Driving_experience=5-10yr
## 3363
## Light_conditions=Darkness - lights lit
## 3286
## Driving_experience=2-5yr
## 2613
## Cause_of_accident=No distancing
## 2263
# Shorten item labels for better visualization
itemLabels(trans) <- gsub("Age_band_of_driver=", "Age:", itemLabels(trans))
itemLabels(trans) <- gsub("Sex_of_driver=", "Sex:", itemLabels(trans))
itemLabels(trans) <- gsub("Educational_level=", "Edu:", itemLabels(trans))
itemLabels(trans) <- gsub("Vehicle_driver_relation=", "Relation:", itemLabels(trans))
itemLabels(trans) <- gsub("Driving_experience=", "Exp:", itemLabels(trans))
itemLabels(trans) <- gsub("Lanes_or_Medians=", "Lanes:", itemLabels(trans))
itemLabels(trans) <- gsub("Types_of_Junction=", "Junction:", itemLabels(trans))
itemLabels(trans) <- gsub("Road_surface_type=", "Road:", itemLabels(trans))
itemLabels(trans) <- gsub("Light_conditions=", "Light:", itemLabels(trans))
itemLabels(trans) <- gsub("Weather_conditions=", "Weather:", itemLabels(trans))
itemLabels(trans) <- gsub("Type_of_collision=", "Collision:", itemLabels(trans))
itemLabels(trans) <- gsub("Vehicle_movement=", "Movement:", itemLabels(trans))
itemLabels(trans) <- gsub("Pedestrian_movement=", "Pedestrian:", itemLabels(trans))
itemLabels(trans) <- gsub("Cause_of_accident=", "Cause:", itemLabels(trans))
itemLabels(trans) <- gsub("Accident_severity=", "Severity:", itemLabels(trans))
# Visualize top 15 most frequent items
itemFrequencyPlot(trans, topN = 15, type = "relative",
main = "Top 15 Most Frequent Accident Characteristics")
The above item frequency graph highlights that there exists values for different features that are highly recurring throughout the dataset like Sex_of_driver being Male and Pedestrian_movement being Not a Pedestrian having over 90% in occurrence. This means that we need to be careful in interpreting these frequent items as they might dominate the association rules due to their high prevalence.
# Apriori rules
rules_apriori <- apriori(
trans,
parameter = list(
supp = 0.05,
conf = 0.6,
minlen = 2
)
)
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.6 0.1 1 none FALSE TRUE 5 0.05 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 615
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[116 item(s), 12316 transaction(s)] done [0.01s].
## sorting and recoding items ... [49 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10
## done [0.20s].
## writing ... [77000 rule(s)] done [0.01s].
## creating S4 object ... done [0.02s].
summary(rules_apriori)
## set of 77000 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4 5 6 7 8 9 10
## 431 3362 11012 19851 20833 13829 6079 1503 100
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 5.000 6.000 5.704 7.000 10.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.05002 Min. :0.6000 Min. :0.05099 Min. :0.6627
## 1st Qu.:0.05927 1st Qu.:0.7233 1st Qu.:0.07356 1st Qu.:0.9943
## Median :0.07478 Median :0.8410 Median :0.09354 Median :1.0033
## Mean :0.09810 Mean :0.8173 Mean :0.12200 Mean :1.0121
## 3rd Qu.:0.10677 3rd Qu.:0.9188 3rd Qu.:0.13397 3rd Qu.:1.0195
## Max. :0.85888 Max. :0.9851 Max. :0.92863 Max. :1.2112
## count
## Min. : 616
## 1st Qu.: 730
## Median : 921
## Mean : 1208
## 3rd Qu.: 1315
## Max. :10578
##
## mining info:
## data ntransactions support confidence
## trans 12316 0.05 0.6
## call
## apriori(data = trans, parameter = list(supp = 0.05, conf = 0.6, minlen = 2))
# By confidence
inspect(head(sort(rules_apriori, by = "confidence"), 1))
## lhs rhs support confidence coverage lift count
## [1] {Age:31-50,
## Junction:No junction,
## Light:Daylight,
## Collision:Vehicle with vehicle collision} => {Sex:Male} 0.05383241 0.9851412 0.05464437 1.060855 663
# By lift
inspect(head(sort(rules_apriori, by = "lift"), 1))
## lhs rhs support confidence coverage lift count
## [1] {Relation:Employee,
## Exp:Above 10yr,
## Light:Daylight,
## Weather:Normal,
## Pedestrian:Not a Pedestrian} => {Edu:Junior high school} 0.06211432 0.7492654 0.08290029 1.211176 765
# By support
inspect(head(sort(rules_apriori, by = "support"), 1))
## lhs rhs support confidence coverage
## [1] {Pedestrian:Not a Pedestrian} => {Sex:Male} 0.8588828 0.9287094 0.9248133
## lift count
## [1] 1.000086 10578
The highest confidence rule (98.5%) shows that middle-aged drivers in
daytime vehicle collisions are almost always male, while the highest
support rule (85.9%) links non-pedestrian accidents to male drivers —
however, both have lift ≈ 1.0, indicating these patterns largely reflect
dataset characteristics (93% male drivers). The highest lift rule (1.21)
reveals that experienced employee drivers (10+ years) are 21% more
likely to have junior high school education, suggesting a professional
driver demographic profile.
Overall these associations don’t give us the best metrics but it was
just an exercise to see what rules excel in each dimension. Digging
through the dataset for rules with specific outcomes might give us more
meaningful results if we sort them by lift while keeping a reasonable
support threshold.
# Remove redundant rules for cleaner visualization
rules_apriori_nonred <- rules_apriori[!is.redundant(rules_apriori)]
cat("Original rules:", length(rules_apriori),
"\nNon-redundant rules:", length(rules_apriori_nonred), "\n")
## Original rules: 77000
## Non-redundant rules: 7956
# Scatter plot: Support vs Confidence, shaded by Lift
plot(rules_apriori, measure = c("support", "confidence"),
shading = "lift", main = "Association Rules: Support vs Confidence")
The scatter plot reveals a trade-off between frequency and being interesting: rules with high support (>0.5) cluster around lift ≈ 1.0, indicating no real association beyond chance. The most interesting rules (lift ~1.2, darker red) appear only at low support (<0.25), meaning genuine patterns exist but are relatively rare. The horizontal bands at ~93% confidence reflect the male-dominated dataset rather than unique discoveries.
rules_female_inverse <- apriori(trans,
parameter = list(supp = 0.005, conf = 0.15, minlen = 2),
appearance = list(default = "lhs", rhs = "Sex:Female"),
control = list(verbose = FALSE))
# Apply both filters: non-redundant AND maximal
rules_female_filtered <- rules_female_inverse[!is.redundant(rules_female_inverse) & is.maximal(rules_female_inverse)]
cat("Female driver rules - Original:", length(rules_female_inverse),
"Non-redundant + Maximal:", length(rules_female_filtered), "\n\n")
## Female driver rules - Original: 2268 Non-redundant + Maximal: 51
# Use filtered rules for visualization and inspection
top_female <- head(sort(rules_female_filtered, by = "lift"), 10)
plot(top_female, method = "graph",
main = "Network: Factors Leading to Female Driver Involvement")
## Available control parameters (with default values):
## layout = stress
## circular = FALSE
## ggraphdots = NULL
## edges = <environment>
## nodes = <environment>
## nodetext = <environment>
## colors = c("#EE0000FF", "#EEEEEEFF")
## engine = ggplot2
## max = 100
## verbose = FALSE
inspect(head(sort(rules_female_filtered, by = "lift"), 5))
## lhs rhs support confidence coverage lift count
## [1] {Age:Unknown,
## Junction:Y Shape,
## Light:Darkness - lights lit,
## Weather:Normal} => {Sex:Female} 0.005034102 0.5123967 0.009824618 9.002393 62
## [2] {Age:Unknown,
## Lanes:Two-way (divided with broken lines road marking),
## Road:Asphalt roads,
## Light:Darkness - lights lit,
## Pedestrian:Not a Pedestrian} => {Sex:Female} 0.005683664 0.4895105 0.011610913 8.600301 70
## [3] {Age:Unknown,
## Relation:Employee,
## Exp:1-2yr,
## Road:Asphalt roads,
## Weather:Normal,
## Pedestrian:Not a Pedestrian} => {Sex:Female} 0.005115297 0.4701493 0.010880156 8.260140 63
## [4] {Age:Unknown,
## Lanes:Two-way (divided with broken lines road marking),
## Light:Darkness - lights lit,
## Weather:Normal} => {Sex:Female} 0.005115297 0.4666667 0.010961351 8.198954 63
## [5] {Age:Unknown,
## Relation:Employee,
## Junction:No junction,
## Light:Darkness - lights lit,
## Pedestrian:Not a Pedestrian} => {Sex:Female} 0.005115297 0.4666667 0.010961351 8.198954 63
Female driver accidents in this dataset show a highly specific profile: they predominantly occur in urban nighttime conditions (darkness with street lighting), at Y-shaped junctions, on asphalt roads. These accidents are 9-10x more likely to involve female drivers than the 7% baseline, suggesting gender-specific exposure patterns.
# What factors lead to Severity=2 accidents?
rules_serious <- apriori(trans,
parameter = list(supp = 0.005, conf = 0.12, minlen = 2),
appearance = list(default = "lhs", rhs = "Severity:2"),
control = list(verbose = FALSE))
cat("Total rules found:", length(rules_serious), "\n\n")
## Total rules found: 244016
inspect(head(sort(rules_serious, by = "lift"), 3))
## lhs rhs support confidence coverage lift count
## [1] {Age:31-50,
## Lanes:One way,
## Junction:No junction,
## Light:Daylight,
## Pedestrian:Not a Pedestrian} => {Severity:2} 0.005034102 1 0.005034102 1.182525 62
## [2] {Age:18-30,
## Junction:Crossing,
## Road:Asphalt roads,
## Weather:Raining,
## Pedestrian:Not a Pedestrian} => {Severity:2} 0.005115297 1 0.005115297 1.182525 63
## [3] {Age:31-50,
## Sex:Male,
## Lanes:One way,
## Junction:No junction,
## Light:Daylight,
## Pedestrian:Not a Pedestrian} => {Severity:2} 0.005034102 1 0.005034102 1.182525 62
Certain specific conditions (e.g., middle-aged drivers on one-way streets during daylight) predict severity (Severity=2) with 100% confidence.
However, with modest lift values (1.18) reflecting only an 18% increase over the 85% baseline, these rules identify minor variations in already-dominant outcomes rather than critical safety interventions. So let’s try looking at Severity level 1 accidents instead.
# What factors lead to Severity=1 accidents?
rules_serious <- apriori(trans,
parameter = list(supp = 0.005, conf = 0.12, minlen = 2),
appearance = list(default = "lhs", rhs = "Severity:1"),
control = list(verbose = FALSE))
# Apply quality filters
rules_serious_nonred <- rules_serious[!is.redundant(rules_serious)]
rules_serious_filtered <- rules_serious_nonred[is.maximal(rules_serious_nonred)]
cat("Serious injury rules - Original:", length(rules_serious),
"| Non-redundant:", length(rules_serious_nonred),
"| Maximal:", length(rules_serious_filtered), "\n\n")
## Serious injury rules - Original: 18162 | Non-redundant: 1830 | Maximal: 382
inspect(head(sort(rules_serious_filtered, by = "lift"), 3))
## lhs rhs support confidence coverage lift count
## [1] {Age:Under 18,
## Sex:Male,
## Junction:No junction,
## Road:Asphalt roads} => {Severity:1} 0.005358883 0.3173077 0.01688860 2.242089 66
## [2] {Age:Under 18,
## Sex:Male,
## Relation:Employee,
## Road:Asphalt roads,
## Weather:Normal,
## Movement:Going straight} => {Severity:1} 0.006252030 0.2421384 0.02582007 1.710944 77
## [3] {Age:Under 18,
## Relation:Employee,
## Road:Asphalt roads,
## Light:Daylight,
## Movement:Going straight} => {Severity:1} 0.005034102 0.2393822 0.02102956 1.691470 62
# Plots for top serious injury rules
top_serious <- head(sort(rules_serious_filtered, by = "lift"), 10)
plot(top_serious, method = "graph",
main = "Network: Factors Leading to Serious Injuries (Severity=1)")
## Available control parameters (with default values):
## layout = stress
## circular = FALSE
## ggraphdots = NULL
## edges = <environment>
## nodes = <environment>
## nodetext = <environment>
## colors = c("#EE0000FF", "#EEEEEEFF")
## engine = ggplot2
## max = 100
## verbose = FALSE
Underage drivers (<18 years) face significantly elevated serious injury risk, with the strongest association (lift=2.24) occurring when young male drivers are involved in accidents on asphalt roads at non-junction locations. These conditions result in serious injuries at 2.24 times the baseline rate, with 32% severity compared to the 14.5% overall average.
The concentration of serious underage accidents on straight road segments (non-junctions) rather than intersections suggests speed-related factors, indicating that enhanced enforcement and speed controls on open road segments may reduce serious injury risk for this vulnerable demographic.
# Find frequent itemsets (support > 5%)
freq_items_eclat <- eclat(trans,
parameter = list(supp = 0.05, maxlen = 5))
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.05 1 5 frequent itemsets TRUE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 615
##
## create itemset ...
## set transactions ...[116 item(s), 12316 transaction(s)] done [0.01s].
## sorting and recoding items ... [49 item(s)] done [0.00s].
## creating bit matrix ... [49 row(s), 12316 column(s)] done [0.00s].
## writing ... [11611 set(s)] done [0.01s].
## Creating S4 object ... done [0.00s].
cat("Total frequent itemsets found:", length(freq_items_eclat), "\n\n")
## Total frequent itemsets found: 11611
# Top 10 most frequent itemsets
cat("=== TOP 10 MOST FREQUENT ITEMSETS ===\n")
## === TOP 10 MOST FREQUENT ITEMSETS ===
top_itemsets <- sort(freq_items_eclat, by = "support", decreasing = TRUE)[1:10]
inspect(top_itemsets)
## items support count
## [1] {Sex:Male} 0.9286294 11437
## [2] {Pedestrian:Not a Pedestrian} 0.9248133 11390
## [3] {Road:Asphalt roads} 0.9171809 11296
## [4] {Sex:Male,
## Pedestrian:Not a Pedestrian} 0.8588828 10578
## [5] {Sex:Male,
## Road:Asphalt roads} 0.8512504 10484
## [6] {Road:Asphalt roads,
## Pedestrian:Not a Pedestrian} 0.8492205 10459
## [7] {Severity:2} 0.8456479 10415
## [8] {Weather:Normal} 0.8170672 10063
## [9] {Sex:Male,
## Road:Asphalt roads,
## Pedestrian:Not a Pedestrian} 0.7879994 9705
## [10] {Sex:Male,
## Severity:2} 0.7846703 9664
cat("\n=== BOTTOM 10 LEAST FREQUENT ITEMSETS (Support ≥5%) ===\n")
##
## === BOTTOM 10 LEAST FREQUENT ITEMSETS (Support ≥5%) ===
bottom_itemsets <- sort(freq_items_eclat, by = "support", decreasing = FALSE)[1:10]
inspect(bottom_itemsets)
## items support count
## [1] {Sex:Male,
## Weather:Normal,
## Movement:Moving Backward,
## Severity:2} 0.05001624 616
## [2] {Lanes:Double carriageway (median),
## Collision:Vehicle with vehicle collision,
## Severity:2} 0.05001624 616
## [3] {Sex:Male,
## Edu:High school,
## Collision:Vehicle with vehicle collision,
## Severity:2} 0.05001624 616
## [4] {Edu:High school,
## Collision:Vehicle with vehicle collision,
## Pedestrian:Not a Pedestrian,
## Severity:2} 0.05001624 616
## [5] {Relation:Employee,
## Collision:Vehicle with vehicle collision,
## Cause:Moving Backward} 0.05001624 616
## [6] {Road:Asphalt roads,
## Movement:Going straight,
## Cause:No priority to vehicle,
## Severity:2} 0.05001624 616
## [7] {Relation:Employee,
## Road:Asphalt roads,
## Collision:Vehicle with vehicle collision,
## Cause:No priority to vehicle} 0.05001624 616
## [8] {Relation:Employee,
## Weather:Normal,
## Pedestrian:Not a Pedestrian,
## Cause:No priority to vehicle,
## Severity:2} 0.05001624 616
## [9] {Relation:Employee,
## Exp:Below 1yr,
## Road:Asphalt roads,
## Movement:Going straight,
## Pedestrian:Not a Pedestrian} 0.05001624 616
## [10] {Relation:Employee,
## Exp:Below 1yr,
## Movement:Going straight,
## Severity:2} 0.05001624 616
cat("\n=== RULES INDUCED FROM ECLAT ===\n")
##
## === RULES INDUCED FROM ECLAT ===
freq_rules_eclat <- ruleInduction(freq_items_eclat, trans, confidence = 0.6)
cat("Total rules generated:", length(freq_rules_eclat), "\n\n")
## Total rules generated: 34656
inspect(head(sort(freq_rules_eclat, by = "confidence"), 1))
## lhs rhs support confidence lift itemset
## [1] {Age:31-50,
## Junction:No junction,
## Light:Daylight,
## Collision:Vehicle with vehicle collision} => {Sex:Male} 0.05383241 0.9851412 1.060855 7893
inspect(head(sort(freq_rules_eclat, by = "lift"), 1))
## lhs rhs support confidence lift itemset
## [1] {Relation:Employee,
## Exp:Above 10yr,
## Light:Daylight,
## Weather:Normal} => {Edu:Junior high school} 0.06771679 0.7479821 1.209102 4213
inspect(head(sort(freq_rules_eclat, by = "support"), 1))
## lhs rhs support confidence lift
## [1] {Pedestrian:Not a Pedestrian} => {Sex:Male} 0.8588828 0.9287094 1.000086
## itemset
## [1] 11562
From the above results on running E-Clat, The top frequent itemsets reveal that most accidents share common characteristics: male drivers (92.9%), asphalt road surfaces (91.7%), non-pedestrian incidents (92.5%), and minor injury outcomes (84.6% Severity=2). The combination of {Male + Asphalt + Not a Pedestrian} co-occurs in 78.8% of all accidents, representing the typical accident profile in this dataset.
Notably, all bottom-ranked itemsets cluster exactly at the 5.0% threshold (616 accidents), indicating the algorithm’s detection boundary. The rules created by ECLAT align perfectly with the prior Apriori results when looking for the rules that has highest confidence, support and lift which is great sign as both algorithms despite having different approaches should yield the same results which is consistent with our test.
This analysis demonstrates that association rule mining extends effectively beyond retail applications to extract meaningful safety insights from traffic accident data. The dataset’s frequency imbalance uncovered in the early stages of the report highlighted a challenge to keep in mind while assessing created rules. However, targeted mining with adaptive thresholds revealed critical risk patterns like underage drivers face 2.24× elevated serious injury risk on non-junction segments, elderly drivers experience 3.6× higher rollover likelihood during nighttime urban driving, and female driver accidents concentrate 9-10× more at lit Y-junctions.
Interms of creating rules, both Apriori and ECLAT methods were useful with one helping us find specific answers to complex questions and the other giving us information about frequently occurring factors across accidents.
Ultimately, this work proves that imbalanced categorical data demands context-appropriate techniques. When properly adapted through targeted mining, lift-based filtering, and domain-informed parameters, association rules transform complex accident records into policy-relevant patterns for vulnerable demographics and high-risk conditions. Lastly, It was fascinating to explore how association rules can be a powerful technique to uncover patterns in data.
State of Road Traffic accidents in Ethopia: https://www.nature.com/articles/s41598-024-74483-5
Apriori Algorithm:
https://www.ibm.com/think/topics/apriori-algorithm
Eclat Algorithm:
https://quality-life.medium.com/eclat-algorithm-in-machine-learning-fe07d33fcc5b