# install.packages("arules")
# install.packages("arulesViz")
library(arules)
library(arulesViz)

Introduction

This project applies association rule mining to road traffic accident data from Addis Ababa, Ethiopia (Dataset) collected from sub city police departments. Unlike typical grocery basket examples, this dataset offers an opportunity to uncover actionable safety patterns with real-world policy implications. The features span driver demographics (age, sex, education, experience), environmental conditions (weather, lighting, road surface), infrastructure (junction type, lanes), and outcomes (severity, collision type). Each accident can be viewed as a “basket” of co-occurring conditions, making the rich categorical structure ideal for association analysis.

1. Data Loading and Exploration

# Read the cleaned traffic accident dataset
acc <- read.csv("C:/Users/David Abraham/Downloads/cleaned.csv/cleaned.csv", stringsAsFactors = TRUE)

# Basic inspection
head(acc)
str(acc)
## 'data.frame':    12316 obs. of  15 variables:
##  $ Age_band_of_driver     : Factor w/ 5 levels "18-30","31-50",..: 1 2 1 1 1 2 1 1 1 1 ...
##  $ Sex_of_driver          : Factor w/ 3 levels "Female","Male",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Educational_level      : Factor w/ 7 levels "Above high school",..: 1 5 5 5 5 6 5 5 5 5 ...
##  $ Vehicle_driver_relation: Factor w/ 4 levels "Employee","Other",..: 1 1 1 1 1 4 1 1 1 1 ...
##  $ Driving_experience     : Factor w/ 8 levels "1-2yr","2-5yr",..: 1 4 1 3 2 8 2 2 4 1 ...
##  $ Lanes_or_Medians       : Factor w/ 7 levels "Double carriageway (median)",..: 7 6 3 3 3 7 6 3 3 6 ...
##  $ Types_of_Junction      : Factor w/ 8 levels "Crossing","No junction",..: 2 2 2 8 8 8 1 8 8 8 ...
##  $ Road_surface_type      : Factor w/ 6 levels "Asphalt roads",..: 1 1 1 3 1 6 6 1 3 1 ...
##  $ Light_conditions       : Factor w/ 4 levels "Darkness - lights lit",..: 4 4 4 1 1 4 4 4 4 4 ...
##  $ Weather_conditions     : Factor w/ 9 levels "Cloudy","Fog or mist",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Type_of_collision      : Factor w/ 10 levels "Collision with animals",..: 3 9 4 9 9 9 9 9 3 3 ...
##  $ Vehicle_movement       : Factor w/ 13 levels "Entering a junction",..: 3 3 3 3 3 11 4 11 3 11 ...
##  $ Pedestrian_movement    : Factor w/ 9 levels "Crossing from driver's nearside",..: 6 6 6 6 6 6 6 6 1 6 ...
##  $ Cause_of_accident      : Factor w/ 20 levels "Changing lane to the left",..: 10 17 1 2 17 15 14 13 2 10 ...
##  $ Accident_severity      : int  2 2 1 2 2 2 2 2 2 1 ...
summary(acc)
##  Age_band_of_driver Sex_of_driver            Educational_level
##  18-30   :4271      Female :  701   Above high school : 362   
##  31-50   :4087      Male   :11437   Elementary school :2163   
##  Over 51 :1585      Unknown:  178   High school       :1110   
##  Under 18: 825                      Illiterate        :  45   
##  Unknown :1548                      Junior high school:7619   
##                                     Unknown           : 841   
##                                     Writing & reading : 176   
##  Vehicle_driver_relation  Driving_experience
##  Employee:9627           5-10yr    :3363    
##  Other   : 123           2-5yr     :2613    
##  Owner   :1973           Above 10yr:2262    
##  Unknown : 593           1-2yr     :1756    
##                          Below 1yr :1342    
##                          Unknown   : 829    
##                          (Other)   : 151    
##                                          Lanes_or_Medians   Types_of_Junction
##  Double carriageway (median)                     :1020    Y Shape    :4543   
##  One way                                         : 845    No junction:3837   
##  other                                           :1660    Crossing   :2177   
##  Two-way (divided with broken lines road marking):4411    Unknown    :1078   
##  Two-way (divided with solid lines road marking) : 142    Other      : 445   
##  Undivided Two way                               :3796    O Shape    : 164   
##  Unknown                                         : 442    (Other)    :  72   
##                         Road_surface_type                Light_conditions
##  Asphalt roads                   :11296   Darkness - lights lit  :3286   
##  Asphalt roads with some distress:   81   Darkness - lights unlit:  40   
##  Earth roads                     :  358   Darkness - no lighting : 192   
##  Gravel roads                    :  242   Daylight               :8798   
##  Other                           :  167                                  
##  Unknown                         :  172                                  
##                                                                          
##  Weather_conditions                       Type_of_collision
##  Normal :10063      Vehicle with vehicle collision :8774   
##  Raining: 1331      Collision with roadside objects:1786   
##  Other  :  296      Collision with pedestrians     : 896   
##  Unknown:  292      Rollover                       : 397   
##  Cloudy :  125      Collision with animals         : 171   
##  Windy  :   98      Unknown                        : 169   
##  (Other):  111      (Other)                        : 123   
##         Vehicle_movement
##  Going straight :8158   
##  Moving Backward: 985   
##  Other          : 937   
##  Reversing      : 563   
##  Turnover       : 489   
##  Unknown        : 396   
##  (Other)        : 788   
##                                                                      Pedestrian_movement
##  Not a Pedestrian                                                              :11390   
##  Crossing from nearside - masked by parked or statioNot a Pedestrianry vehicle :  337   
##  Unknown or other                                                              :  293   
##  Crossing from driver's nearside                                               :  140   
##  Crossing from offside - masked by  parked or statioNot a Pedestrianry vehicle :   72   
##  In carriageway, statioNot a Pedestrianry - not crossing  (standing or playing):   46   
##  (Other)                                                                       :   38   
##                   Cause_of_accident Accident_severity
##  No distancing             :2263    Min.   :0.000    
##  Changing lane to the right:1808    1st Qu.:2.000    
##  Changing lane to the left :1473    Median :2.000    
##  Driving carelessly        :1402    Mean   :1.833    
##  No priority to vehicle    :1207    3rd Qu.:2.000    
##  Moving Backward           :1137    Max.   :2.000    
##  (Other)                   :3026

The dataset contains dimensions whose values are already categorical by nature making any transformations from continuous to discrete data no longer needed.

# Convert Accident_severity to factor
acc$Accident_severity <- as.factor(acc$Accident_severity)

# Check structure again
str(acc)
## 'data.frame':    12316 obs. of  15 variables:
##  $ Age_band_of_driver     : Factor w/ 5 levels "18-30","31-50",..: 1 2 1 1 1 2 1 1 1 1 ...
##  $ Sex_of_driver          : Factor w/ 3 levels "Female","Male",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Educational_level      : Factor w/ 7 levels "Above high school",..: 1 5 5 5 5 6 5 5 5 5 ...
##  $ Vehicle_driver_relation: Factor w/ 4 levels "Employee","Other",..: 1 1 1 1 1 4 1 1 1 1 ...
##  $ Driving_experience     : Factor w/ 8 levels "1-2yr","2-5yr",..: 1 4 1 3 2 8 2 2 4 1 ...
##  $ Lanes_or_Medians       : Factor w/ 7 levels "Double carriageway (median)",..: 7 6 3 3 3 7 6 3 3 6 ...
##  $ Types_of_Junction      : Factor w/ 8 levels "Crossing","No junction",..: 2 2 2 8 8 8 1 8 8 8 ...
##  $ Road_surface_type      : Factor w/ 6 levels "Asphalt roads",..: 1 1 1 3 1 6 6 1 3 1 ...
##  $ Light_conditions       : Factor w/ 4 levels "Darkness - lights lit",..: 4 4 4 1 1 4 4 4 4 4 ...
##  $ Weather_conditions     : Factor w/ 9 levels "Cloudy","Fog or mist",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Type_of_collision      : Factor w/ 10 levels "Collision with animals",..: 3 9 4 9 9 9 9 9 3 3 ...
##  $ Vehicle_movement       : Factor w/ 13 levels "Entering a junction",..: 3 3 3 3 3 11 4 11 3 11 ...
##  $ Pedestrian_movement    : Factor w/ 9 levels "Crossing from driver's nearside",..: 6 6 6 6 6 6 6 6 1 6 ...
##  $ Cause_of_accident      : Factor w/ 20 levels "Changing lane to the left",..: 10 17 1 2 17 15 14 13 2 10 ...
##  $ Accident_severity      : Factor w/ 3 levels "0","1","2": 3 3 2 3 3 3 3 3 3 2 ...
# Convert data frame to "transactions"
trans <- as(acc, "transactions")

# Inspect transactions
summary(trans)
## transactions as itemMatrix in sparse format with
##  12316 rows (elements/itemsets/transactions) and
##  116 columns (items) and a density of 0.1293103 
## 
## most frequent items:
##                   Sex_of_driver=Male Pedestrian_movement=Not a Pedestrian 
##                                11437                                11390 
##      Road_surface_type=Asphalt roads                  Accident_severity=2 
##                                11296                                10415 
##            Weather_conditions=Normal                              (Other) 
##                                10063                               130139 
## 
## element (itemset/transaction) length distribution:
## sizes
##    15 
## 12316 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      15      15      15      15      15      15 
## 
## includes extended item information - examples:
##                       labels          variables  levels
## 1   Age_band_of_driver=18-30 Age_band_of_driver   18-30
## 2   Age_band_of_driver=31-50 Age_band_of_driver   31-50
## 3 Age_band_of_driver=Over 51 Age_band_of_driver Over 51
## 
## includes extended transaction information - examples:
##   transactionID
## 1             1
## 2             2
## 3             3
inspect(trans[1:5])
##     items                                                        transactionID
## [1] {Age_band_of_driver=18-30,                                                
##      Sex_of_driver=Male,                                                      
##      Educational_level=Above high school,                                     
##      Vehicle_driver_relation=Employee,                                        
##      Driving_experience=1-2yr,                                                
##      Lanes_or_Medians=Unknown,                                                
##      Types_of_Junction=No junction,                                           
##      Road_surface_type=Asphalt roads,                                         
##      Light_conditions=Daylight,                                               
##      Weather_conditions=Normal,                                               
##      Type_of_collision=Collision with roadside-parked vehicles,               
##      Vehicle_movement=Going straight,                                         
##      Pedestrian_movement=Not a Pedestrian,                                    
##      Cause_of_accident=Moving Backward,                                       
##      Accident_severity=2}                                                    1
## [2] {Age_band_of_driver=31-50,                                                
##      Sex_of_driver=Male,                                                      
##      Educational_level=Junior high school,                                    
##      Vehicle_driver_relation=Employee,                                        
##      Driving_experience=Above 10yr,                                           
##      Lanes_or_Medians=Undivided Two way,                                      
##      Types_of_Junction=No junction,                                           
##      Road_surface_type=Asphalt roads,                                         
##      Light_conditions=Daylight,                                               
##      Weather_conditions=Normal,                                               
##      Type_of_collision=Vehicle with vehicle collision,                        
##      Vehicle_movement=Going straight,                                         
##      Pedestrian_movement=Not a Pedestrian,                                    
##      Cause_of_accident=Overtaking,                                            
##      Accident_severity=2}                                                    2
## [3] {Age_band_of_driver=18-30,                                                
##      Sex_of_driver=Male,                                                      
##      Educational_level=Junior high school,                                    
##      Vehicle_driver_relation=Employee,                                        
##      Driving_experience=1-2yr,                                                
##      Lanes_or_Medians=other,                                                  
##      Types_of_Junction=No junction,                                           
##      Road_surface_type=Asphalt roads,                                         
##      Light_conditions=Daylight,                                               
##      Weather_conditions=Normal,                                               
##      Type_of_collision=Collision with roadside objects,                       
##      Vehicle_movement=Going straight,                                         
##      Pedestrian_movement=Not a Pedestrian,                                    
##      Cause_of_accident=Changing lane to the left,                             
##      Accident_severity=1}                                                    3
## [4] {Age_band_of_driver=18-30,                                                
##      Sex_of_driver=Male,                                                      
##      Educational_level=Junior high school,                                    
##      Vehicle_driver_relation=Employee,                                        
##      Driving_experience=5-10yr,                                               
##      Lanes_or_Medians=other,                                                  
##      Types_of_Junction=Y Shape,                                               
##      Road_surface_type=Earth roads,                                           
##      Light_conditions=Darkness - lights lit,                                  
##      Weather_conditions=Normal,                                               
##      Type_of_collision=Vehicle with vehicle collision,                        
##      Vehicle_movement=Going straight,                                         
##      Pedestrian_movement=Not a Pedestrian,                                    
##      Cause_of_accident=Changing lane to the right,                            
##      Accident_severity=2}                                                    4
## [5] {Age_band_of_driver=18-30,                                                
##      Sex_of_driver=Male,                                                      
##      Educational_level=Junior high school,                                    
##      Vehicle_driver_relation=Employee,                                        
##      Driving_experience=2-5yr,                                                
##      Lanes_or_Medians=other,                                                  
##      Types_of_Junction=Y Shape,                                               
##      Road_surface_type=Asphalt roads,                                         
##      Light_conditions=Darkness - lights lit,                                  
##      Weather_conditions=Normal,                                               
##      Type_of_collision=Vehicle with vehicle collision,                        
##      Vehicle_movement=Going straight,                                         
##      Pedestrian_movement=Not a Pedestrian,                                    
##      Cause_of_accident=Overtaking,                                            
##      Accident_severity=2}                                                    5

The above logs verify that one transaction contains all the information of one accident from the dataset. Keeping the concept of encapsulating each accident as combination of different “items” made tackling this analysis a little easier.

Since most of the columns in this dataset are long , shortening them will help in future visualizations.

# Relative item frequencies (top 20 for readability)
head(sort(itemFrequency(trans, type = "relative"), decreasing = TRUE), 20)
##                                                Sex_of_driver=Male 
##                                                         0.9286294 
##                              Pedestrian_movement=Not a Pedestrian 
##                                                         0.9248133 
##                                   Road_surface_type=Asphalt roads 
##                                                         0.9171809 
##                                               Accident_severity=2 
##                                                         0.8456479 
##                                         Weather_conditions=Normal 
##                                                         0.8170672 
##                                  Vehicle_driver_relation=Employee 
##                                                         0.7816661 
##                                         Light_conditions=Daylight 
##                                                         0.7143553 
##                  Type_of_collision=Vehicle with vehicle collision 
##                                                         0.7124066 
##                                   Vehicle_movement=Going straight 
##                                                         0.6623904 
##                              Educational_level=Junior high school 
##                                                         0.6186262 
##                                         Types_of_Junction=Y Shape 
##                                                         0.3688698 
## Lanes_or_Medians=Two-way (divided with broken lines road marking) 
##                                                         0.3581520 
##                                          Age_band_of_driver=18-30 
##                                                         0.3467847 
##                                          Age_band_of_driver=31-50 
##                                                         0.3318448 
##                                     Types_of_Junction=No junction 
##                                                         0.3115460 
##                                Lanes_or_Medians=Undivided Two way 
##                                                         0.3082170 
##                                         Driving_experience=5-10yr 
##                                                         0.2730594 
##                            Light_conditions=Darkness - lights lit 
##                                                         0.2668074 
##                                          Driving_experience=2-5yr 
##                                                         0.2121630 
##                                   Cause_of_accident=No distancing 
##                                                         0.1837447
# Absolute frequencies
head(sort(itemFrequency(trans, type = "absolute"), decreasing = TRUE), 20)
##                                                Sex_of_driver=Male 
##                                                             11437 
##                              Pedestrian_movement=Not a Pedestrian 
##                                                             11390 
##                                   Road_surface_type=Asphalt roads 
##                                                             11296 
##                                               Accident_severity=2 
##                                                             10415 
##                                         Weather_conditions=Normal 
##                                                             10063 
##                                  Vehicle_driver_relation=Employee 
##                                                              9627 
##                                         Light_conditions=Daylight 
##                                                              8798 
##                  Type_of_collision=Vehicle with vehicle collision 
##                                                              8774 
##                                   Vehicle_movement=Going straight 
##                                                              8158 
##                              Educational_level=Junior high school 
##                                                              7619 
##                                         Types_of_Junction=Y Shape 
##                                                              4543 
## Lanes_or_Medians=Two-way (divided with broken lines road marking) 
##                                                              4411 
##                                          Age_band_of_driver=18-30 
##                                                              4271 
##                                          Age_band_of_driver=31-50 
##                                                              4087 
##                                     Types_of_Junction=No junction 
##                                                              3837 
##                                Lanes_or_Medians=Undivided Two way 
##                                                              3796 
##                                         Driving_experience=5-10yr 
##                                                              3363 
##                            Light_conditions=Darkness - lights lit 
##                                                              3286 
##                                          Driving_experience=2-5yr 
##                                                              2613 
##                                   Cause_of_accident=No distancing 
##                                                              2263
# Shorten item labels for better visualization
itemLabels(trans) <- gsub("Age_band_of_driver=", "Age:", itemLabels(trans))
itemLabels(trans) <- gsub("Sex_of_driver=", "Sex:", itemLabels(trans))
itemLabels(trans) <- gsub("Educational_level=", "Edu:", itemLabels(trans))
itemLabels(trans) <- gsub("Vehicle_driver_relation=", "Relation:", itemLabels(trans))
itemLabels(trans) <- gsub("Driving_experience=", "Exp:", itemLabels(trans))
itemLabels(trans) <- gsub("Lanes_or_Medians=", "Lanes:", itemLabels(trans))
itemLabels(trans) <- gsub("Types_of_Junction=", "Junction:", itemLabels(trans))
itemLabels(trans) <- gsub("Road_surface_type=", "Road:", itemLabels(trans))
itemLabels(trans) <- gsub("Light_conditions=", "Light:", itemLabels(trans))
itemLabels(trans) <- gsub("Weather_conditions=", "Weather:", itemLabels(trans))
itemLabels(trans) <- gsub("Type_of_collision=", "Collision:", itemLabels(trans))
itemLabels(trans) <- gsub("Vehicle_movement=", "Movement:", itemLabels(trans))
itemLabels(trans) <- gsub("Pedestrian_movement=", "Pedestrian:", itemLabels(trans))
itemLabels(trans) <- gsub("Cause_of_accident=", "Cause:", itemLabels(trans))
itemLabels(trans) <- gsub("Accident_severity=", "Severity:", itemLabels(trans))
# Visualize top 15 most frequent items
itemFrequencyPlot(trans, topN = 15, type = "relative", 
                  main = "Top 15 Most Frequent Accident Characteristics")

The above item frequency graph highlights that there exists values for different features that are highly recurring throughout the dataset like Sex_of_driver being Male and Pedestrian_movement being Not a Pedestrian having over 90% in occurrence. This means that we need to be careful in interpreting these frequent items as they might dominate the association rules due to their high prevalence.

2. Using Apriori to build association rules

# Apriori rules
rules_apriori <- apriori(
  trans,
  parameter = list(
    supp = 0.05,
    conf = 0.6,
    minlen = 2
  )
)
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.6    0.1    1 none FALSE            TRUE       5    0.05      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 615 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[116 item(s), 12316 transaction(s)] done [0.01s].
## sorting and recoding items ... [49 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10
##  done [0.20s].
## writing ... [77000 rule(s)] done [0.01s].
## creating S4 object  ... done [0.02s].
summary(rules_apriori)
## set of 77000 rules
## 
## rule length distribution (lhs + rhs):sizes
##     2     3     4     5     6     7     8     9    10 
##   431  3362 11012 19851 20833 13829  6079  1503   100 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   5.000   6.000   5.704   7.000  10.000 
## 
## summary of quality measures:
##     support          confidence        coverage            lift       
##  Min.   :0.05002   Min.   :0.6000   Min.   :0.05099   Min.   :0.6627  
##  1st Qu.:0.05927   1st Qu.:0.7233   1st Qu.:0.07356   1st Qu.:0.9943  
##  Median :0.07478   Median :0.8410   Median :0.09354   Median :1.0033  
##  Mean   :0.09810   Mean   :0.8173   Mean   :0.12200   Mean   :1.0121  
##  3rd Qu.:0.10677   3rd Qu.:0.9188   3rd Qu.:0.13397   3rd Qu.:1.0195  
##  Max.   :0.85888   Max.   :0.9851   Max.   :0.92863   Max.   :1.2112  
##      count      
##  Min.   :  616  
##  1st Qu.:  730  
##  Median :  921  
##  Mean   : 1208  
##  3rd Qu.: 1315  
##  Max.   :10578  
## 
## mining info:
##   data ntransactions support confidence
##  trans         12316    0.05        0.6
##                                                                          call
##  apriori(data = trans, parameter = list(supp = 0.05, conf = 0.6, minlen = 2))
  # By confidence
  inspect(head(sort(rules_apriori, by = "confidence"), 1))
##     lhs                                           rhs           support confidence   coverage     lift count
## [1] {Age:31-50,                                                                                             
##      Junction:No junction,                                                                                  
##      Light:Daylight,                                                                                        
##      Collision:Vehicle with vehicle collision} => {Sex:Male} 0.05383241  0.9851412 0.05464437 1.060855   663
  # By lift
  inspect(head(sort(rules_apriori, by = "lift"), 1))
##     lhs                              rhs                         support confidence   coverage     lift count
## [1] {Relation:Employee,                                                                                      
##      Exp:Above 10yr,                                                                                         
##      Light:Daylight,                                                                                         
##      Weather:Normal,                                                                                         
##      Pedestrian:Not a Pedestrian} => {Edu:Junior high school} 0.06211432  0.7492654 0.08290029 1.211176   765
  # By support
  inspect(head(sort(rules_apriori, by = "support"), 1))
##     lhs                              rhs        support   confidence coverage 
## [1] {Pedestrian:Not a Pedestrian} => {Sex:Male} 0.8588828 0.9287094  0.9248133
##     lift     count
## [1] 1.000086 10578

The highest confidence rule (98.5%) shows that middle-aged drivers in daytime vehicle collisions are almost always male, while the highest support rule (85.9%) links non-pedestrian accidents to male drivers — however, both have lift ≈ 1.0, indicating these patterns largely reflect dataset characteristics (93% male drivers). The highest lift rule (1.21) reveals that experienced employee drivers (10+ years) are 21% more likely to have junior high school education, suggesting a professional driver demographic profile.

Overall these associations don’t give us the best metrics but it was just an exercise to see what rules excel in each dimension. Digging through the dataset for rules with specific outcomes might give us more meaningful results if we sort them by lift while keeping a reasonable support threshold.

# Remove redundant rules for cleaner visualization
rules_apriori_nonred <- rules_apriori[!is.redundant(rules_apriori)]
cat("Original rules:", length(rules_apriori), 
    "\nNon-redundant rules:", length(rules_apriori_nonred), "\n")
## Original rules: 77000 
## Non-redundant rules: 7956
# Scatter plot: Support vs Confidence, shaded by Lift
plot(rules_apriori, measure = c("support", "confidence"), 
     shading = "lift", main = "Association Rules: Support vs Confidence")

The scatter plot reveals a trade-off between frequency and being interesting: rules with high support (>0.5) cluster around lift ≈ 1.0, indicating no real association beyond chance. The most interesting rules (lift ~1.2, darker red) appear only at low support (<0.25), meaning genuine patterns exist but are relatively rare. The horizontal bands at ~93% confidence reflect the male-dominated dataset rather than unique discoveries.

Some Interesting Findings using Apriori

What kind of accidents are female drivers involved in?

rules_female_inverse <- apriori(trans,
  parameter = list(supp = 0.005, conf = 0.15, minlen = 2),
  appearance = list(default = "lhs", rhs = "Sex:Female"),
  control = list(verbose = FALSE))

# Apply both filters: non-redundant AND maximal
rules_female_filtered <- rules_female_inverse[!is.redundant(rules_female_inverse) & is.maximal(rules_female_inverse)]

cat("Female driver rules - Original:", length(rules_female_inverse),
    "Non-redundant + Maximal:", length(rules_female_filtered), "\n\n")
## Female driver rules - Original: 2268 Non-redundant + Maximal: 51
# Use filtered rules for visualization and inspection
top_female <- head(sort(rules_female_filtered, by = "lift"), 10)
plot(top_female, method = "graph", 
     main = "Network: Factors Leading to Female Driver Involvement")
## Available control parameters (with default values):
## layout    =  stress
## circular  =  FALSE
## ggraphdots    =  NULL
## edges     =  <environment>
## nodes     =  <environment>
## nodetext  =  <environment>
## colors    =  c("#EE0000FF", "#EEEEEEFF")
## engine    =  ggplot2
## max   =  100
## verbose   =  FALSE

inspect(head(sort(rules_female_filtered, by = "lift"), 5))
##     lhs                                                          rhs              support confidence    coverage     lift count
## [1] {Age:Unknown,                                                                                                              
##      Junction:Y Shape,                                                                                                         
##      Light:Darkness - lights lit,                                                                                              
##      Weather:Normal}                                          => {Sex:Female} 0.005034102  0.5123967 0.009824618 9.002393    62
## [2] {Age:Unknown,                                                                                                              
##      Lanes:Two-way (divided with broken lines road marking),                                                                   
##      Road:Asphalt roads,                                                                                                       
##      Light:Darkness - lights lit,                                                                                              
##      Pedestrian:Not a Pedestrian}                             => {Sex:Female} 0.005683664  0.4895105 0.011610913 8.600301    70
## [3] {Age:Unknown,                                                                                                              
##      Relation:Employee,                                                                                                        
##      Exp:1-2yr,                                                                                                                
##      Road:Asphalt roads,                                                                                                       
##      Weather:Normal,                                                                                                           
##      Pedestrian:Not a Pedestrian}                             => {Sex:Female} 0.005115297  0.4701493 0.010880156 8.260140    63
## [4] {Age:Unknown,                                                                                                              
##      Lanes:Two-way (divided with broken lines road marking),                                                                   
##      Light:Darkness - lights lit,                                                                                              
##      Weather:Normal}                                          => {Sex:Female} 0.005115297  0.4666667 0.010961351 8.198954    63
## [5] {Age:Unknown,                                                                                                              
##      Relation:Employee,                                                                                                        
##      Junction:No junction,                                                                                                     
##      Light:Darkness - lights lit,                                                                                              
##      Pedestrian:Not a Pedestrian}                             => {Sex:Female} 0.005115297  0.4666667 0.010961351 8.198954    63

Female driver accidents in this dataset show a highly specific profile: they predominantly occur in urban nighttime conditions (darkness with street lighting), at Y-shaped junctions, on asphalt roads. These accidents are 9-10x more likely to involve female drivers than the 7% baseline, suggesting gender-specific exposure patterns.

What factors lead to Severity=2 accidents?

# What factors lead to Severity=2 accidents?
rules_serious <- apriori(trans,
  parameter = list(supp = 0.005, conf = 0.12, minlen = 2),
  appearance = list(default = "lhs", rhs = "Severity:2"),
  control = list(verbose = FALSE))

cat("Total rules found:", length(rules_serious), "\n\n")
## Total rules found: 244016
inspect(head(sort(rules_serious, by = "lift"), 3))
##     lhs                              rhs              support confidence    coverage     lift count
## [1] {Age:31-50,                                                                                    
##      Lanes:One way,                                                                                
##      Junction:No junction,                                                                         
##      Light:Daylight,                                                                               
##      Pedestrian:Not a Pedestrian} => {Severity:2} 0.005034102          1 0.005034102 1.182525    62
## [2] {Age:18-30,                                                                                    
##      Junction:Crossing,                                                                            
##      Road:Asphalt roads,                                                                           
##      Weather:Raining,                                                                              
##      Pedestrian:Not a Pedestrian} => {Severity:2} 0.005115297          1 0.005115297 1.182525    63
## [3] {Age:31-50,                                                                                    
##      Sex:Male,                                                                                     
##      Lanes:One way,                                                                                
##      Junction:No junction,                                                                         
##      Light:Daylight,                                                                               
##      Pedestrian:Not a Pedestrian} => {Severity:2} 0.005034102          1 0.005034102 1.182525    62

Certain specific conditions (e.g., middle-aged drivers on one-way streets during daylight) predict severity (Severity=2) with 100% confidence.

However, with modest lift values (1.18) reflecting only an 18% increase over the 85% baseline, these rules identify minor variations in already-dominant outcomes rather than critical safety interventions. So let’s try looking at Severity level 1 accidents instead.

What factors lead to Severity=1 accidents?

# What factors lead to Severity=1 accidents?
rules_serious <- apriori(trans,
  parameter = list(supp = 0.005, conf = 0.12, minlen = 2),
  appearance = list(default = "lhs", rhs = "Severity:1"),
  control = list(verbose = FALSE))

# Apply quality filters
rules_serious_nonred <- rules_serious[!is.redundant(rules_serious)]
rules_serious_filtered <- rules_serious_nonred[is.maximal(rules_serious_nonred)]

cat("Serious injury rules - Original:", length(rules_serious),
    "| Non-redundant:", length(rules_serious_nonred),
    "| Maximal:", length(rules_serious_filtered), "\n\n")
## Serious injury rules - Original: 18162 | Non-redundant: 1830 | Maximal: 382
inspect(head(sort(rules_serious_filtered, by = "lift"), 3))
##     lhs                          rhs              support confidence   coverage     lift count
## [1] {Age:Under 18,                                                                            
##      Sex:Male,                                                                                
##      Junction:No junction,                                                                    
##      Road:Asphalt roads}      => {Severity:1} 0.005358883  0.3173077 0.01688860 2.242089    66
## [2] {Age:Under 18,                                                                            
##      Sex:Male,                                                                                
##      Relation:Employee,                                                                       
##      Road:Asphalt roads,                                                                      
##      Weather:Normal,                                                                          
##      Movement:Going straight} => {Severity:1} 0.006252030  0.2421384 0.02582007 1.710944    77
## [3] {Age:Under 18,                                                                            
##      Relation:Employee,                                                                       
##      Road:Asphalt roads,                                                                      
##      Light:Daylight,                                                                          
##      Movement:Going straight} => {Severity:1} 0.005034102  0.2393822 0.02102956 1.691470    62
# Plots for top serious injury rules
top_serious <- head(sort(rules_serious_filtered, by = "lift"), 10)
plot(top_serious, method = "graph", 
     main = "Network: Factors Leading to Serious Injuries (Severity=1)")
## Available control parameters (with default values):
## layout    =  stress
## circular  =  FALSE
## ggraphdots    =  NULL
## edges     =  <environment>
## nodes     =  <environment>
## nodetext  =  <environment>
## colors    =  c("#EE0000FF", "#EEEEEEFF")
## engine    =  ggplot2
## max   =  100
## verbose   =  FALSE

Underage drivers (<18 years) face significantly elevated serious injury risk, with the strongest association (lift=2.24) occurring when young male drivers are involved in accidents on asphalt roads at non-junction locations. These conditions result in serious injuries at 2.24 times the baseline rate, with 32% severity compared to the 14.5% overall average.

The concentration of serious underage accidents on straight road segments (non-junctions) rather than intersections suggests speed-related factors, indicating that enhanced enforcement and speed controls on open road segments may reduce serious injury risk for this vulnerable demographic.

3. Using Eclat to check associations

# Find frequent itemsets (support > 5%)
freq_items_eclat <- eclat(trans, 
  parameter = list(supp = 0.05, maxlen = 5))
## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target  ext
##     FALSE    0.05      1      5 frequent itemsets TRUE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 615 
## 
## create itemset ... 
## set transactions ...[116 item(s), 12316 transaction(s)] done [0.01s].
## sorting and recoding items ... [49 item(s)] done [0.00s].
## creating bit matrix ... [49 row(s), 12316 column(s)] done [0.00s].
## writing  ... [11611 set(s)] done [0.01s].
## Creating S4 object  ... done [0.00s].
cat("Total frequent itemsets found:", length(freq_items_eclat), "\n\n")
## Total frequent itemsets found: 11611
# Top 10 most frequent itemsets
cat("=== TOP 10 MOST FREQUENT ITEMSETS ===\n")
## === TOP 10 MOST FREQUENT ITEMSETS ===
top_itemsets <- sort(freq_items_eclat, by = "support", decreasing = TRUE)[1:10]
inspect(top_itemsets)
##      items                           support count
## [1]  {Sex:Male}                    0.9286294 11437
## [2]  {Pedestrian:Not a Pedestrian} 0.9248133 11390
## [3]  {Road:Asphalt roads}          0.9171809 11296
## [4]  {Sex:Male,                                   
##       Pedestrian:Not a Pedestrian} 0.8588828 10578
## [5]  {Sex:Male,                                   
##       Road:Asphalt roads}          0.8512504 10484
## [6]  {Road:Asphalt roads,                         
##       Pedestrian:Not a Pedestrian} 0.8492205 10459
## [7]  {Severity:2}                  0.8456479 10415
## [8]  {Weather:Normal}              0.8170672 10063
## [9]  {Sex:Male,                                   
##       Road:Asphalt roads,                         
##       Pedestrian:Not a Pedestrian} 0.7879994  9705
## [10] {Sex:Male,                                   
##       Severity:2}                  0.7846703  9664
cat("\n=== BOTTOM 10 LEAST FREQUENT ITEMSETS (Support ≥5%) ===\n")
## 
## === BOTTOM 10 LEAST FREQUENT ITEMSETS (Support ≥5%) ===
bottom_itemsets <- sort(freq_items_eclat, by = "support", decreasing = FALSE)[1:10]
inspect(bottom_itemsets)
##      items                                          support count
## [1]  {Sex:Male,                                                  
##       Weather:Normal,                                            
##       Movement:Moving Backward,                                  
##       Severity:2}                                0.05001624   616
## [2]  {Lanes:Double carriageway (median),                         
##       Collision:Vehicle with vehicle collision,                  
##       Severity:2}                                0.05001624   616
## [3]  {Sex:Male,                                                  
##       Edu:High school,                                           
##       Collision:Vehicle with vehicle collision,                  
##       Severity:2}                                0.05001624   616
## [4]  {Edu:High school,                                           
##       Collision:Vehicle with vehicle collision,                  
##       Pedestrian:Not a Pedestrian,                               
##       Severity:2}                                0.05001624   616
## [5]  {Relation:Employee,                                         
##       Collision:Vehicle with vehicle collision,                  
##       Cause:Moving Backward}                     0.05001624   616
## [6]  {Road:Asphalt roads,                                        
##       Movement:Going straight,                                   
##       Cause:No priority to vehicle,                              
##       Severity:2}                                0.05001624   616
## [7]  {Relation:Employee,                                         
##       Road:Asphalt roads,                                        
##       Collision:Vehicle with vehicle collision,                  
##       Cause:No priority to vehicle}              0.05001624   616
## [8]  {Relation:Employee,                                         
##       Weather:Normal,                                            
##       Pedestrian:Not a Pedestrian,                               
##       Cause:No priority to vehicle,                              
##       Severity:2}                                0.05001624   616
## [9]  {Relation:Employee,                                         
##       Exp:Below 1yr,                                             
##       Road:Asphalt roads,                                        
##       Movement:Going straight,                                   
##       Pedestrian:Not a Pedestrian}               0.05001624   616
## [10] {Relation:Employee,                                         
##       Exp:Below 1yr,                                             
##       Movement:Going straight,                                   
##       Severity:2}                                0.05001624   616
cat("\n=== RULES INDUCED FROM ECLAT ===\n")
## 
## === RULES INDUCED FROM ECLAT ===
freq_rules_eclat <- ruleInduction(freq_items_eclat, trans, confidence = 0.6)

cat("Total rules generated:", length(freq_rules_eclat), "\n\n")
## Total rules generated: 34656
inspect(head(sort(freq_rules_eclat, by = "confidence"), 1))
##     lhs                                           rhs           support confidence     lift itemset
## [1] {Age:31-50,                                                                                    
##      Junction:No junction,                                                                         
##      Light:Daylight,                                                                               
##      Collision:Vehicle with vehicle collision} => {Sex:Male} 0.05383241  0.9851412 1.060855    7893
inspect(head(sort(freq_rules_eclat, by = "lift"), 1))
##     lhs                     rhs                         support confidence     lift itemset
## [1] {Relation:Employee,                                                                    
##      Exp:Above 10yr,                                                                       
##      Light:Daylight,                                                                       
##      Weather:Normal}     => {Edu:Junior high school} 0.06771679  0.7479821 1.209102    4213
inspect(head(sort(freq_rules_eclat, by = "support"), 1))
##     lhs                              rhs        support   confidence lift    
## [1] {Pedestrian:Not a Pedestrian} => {Sex:Male} 0.8588828 0.9287094  1.000086
##     itemset
## [1] 11562

From the above results on running E-Clat, The top frequent itemsets reveal that most accidents share common characteristics: male drivers (92.9%), asphalt road surfaces (91.7%), non-pedestrian incidents (92.5%), and minor injury outcomes (84.6% Severity=2). The combination of {Male + Asphalt + Not a Pedestrian} co-occurs in 78.8% of all accidents, representing the typical accident profile in this dataset.

Notably, all bottom-ranked itemsets cluster exactly at the 5.0% threshold (616 accidents), indicating the algorithm’s detection boundary. The rules created by ECLAT align perfectly with the prior Apriori results when looking for the rules that has highest confidence, support and lift which is great sign as both algorithms despite having different approaches should yield the same results which is consistent with our test.

4. Conclusions

This analysis demonstrates that association rule mining extends effectively beyond retail applications to extract meaningful safety insights from traffic accident data. The dataset’s frequency imbalance uncovered in the early stages of the report highlighted a challenge to keep in mind while assessing created rules. However, targeted mining with adaptive thresholds revealed critical risk patterns like underage drivers face 2.24× elevated serious injury risk on non-junction segments, elderly drivers experience 3.6× higher rollover likelihood during nighttime urban driving, and female driver accidents concentrate 9-10× more at lit Y-junctions.

Interms of creating rules, both Apriori and ECLAT methods were useful with one helping us find specific answers to complex questions and the other giving us information about frequently occurring factors across accidents.

Ultimately, this work proves that imbalanced categorical data demands context-appropriate techniques. When properly adapted through targeted mining, lift-based filtering, and domain-informed parameters, association rules transform complex accident records into policy-relevant patterns for vulnerable demographics and high-risk conditions. Lastly, It was fascinating to explore how association rules can be a powerful technique to uncover patterns in data.