Exploratory Data Analysis (EDA) by Swastika Barua

by Swastika Barua (qwx11)

Introduction

The dataset represents a detailed record of traffic incidents, capturing various environmental, road, vehicle, and individual-related factors. Each row corresponds to a unique traffic event, providing information about the conditions under which the incident occurred. It includes weather and lighting conditions, such as weather (Wthr_Cond_ID) and light conditions (Light_Cond_ID), road characteristics like road type (Road_Type_ID), road alignment (Road_Algn_ID), and surface conditions (SurfDry). Traffic control and event-related data include traffic control type (Traffic_Cntl_ID) and the harmful event (Harm_Evnt_ID), while intersection and collision information identifies whether the event occurred at an intersection (Intrsct_Relat_ID) and the type of first harmful event collision (FHE_Collsn_ID). Demographic and vehicle details encompass vehicle body style (Veh_Body_Styl_ID), ethnicity (Prsn_Ethnicity_ID), age group (Prsn_Age), and injury severity (Prsn_Injry_Sev_ID). Additionally, traffic volume (TrafVol) and speed limit categories (Crash_Speed_LimitCat) provide further context to the incidents.

Loading libraries

setwd("C:/Users/swast/OneDrive - Texas State University/02_TXST/Fall 2024")
data <- read_excel("./AI/HW1_Data.xlsx")

Data Information

The dataset, containing 1,295 rows and 19 variables, was examined for structure and content. The first few entries were viewed using head(), while summary() provided basic statistical details. glimpse() revealed that the dataset consists of 16 categorical variables (e.g., Wthr_Cond_ID, Road_Type_ID), 1 numeric variables (TrafVol), and 2 binary variables (SurfDry, GenMale). Missing data checks showed no missing values across any columns. The dataset includes variables relevant to traffic incidents, such as environmental conditions, road characteristics, vehicle types, and person-related factors.

head(data)
## # A tibble: 6 × 19
##   Wthr_Cond_ID Light_Cond_ID   Road_Type_ID Road_Algn_ID SurfDry Traffic_Cntl_ID
##   <chr>        <chr>           <chr>        <chr>          <dbl> <chr>          
## 1 Clear        Dark, not ligh… 2 lane, 2 w… Straight, l…       1 Marked lanes   
## 2 Clear        Dark, not ligh… 2 lane, 2 w… Straight, l…       1 Center stripe/…
## 3 Clear        Daylight        2 lane, 2 w… Straight, l…       1 Marked lanes   
## 4 Clear        Daylight        2 lane, 2 w… Straight, l…       1 Center stripe/…
## 5 Clear        Dark, not ligh… 2 lane, 2 w… Straight, g…       1 None           
## 6 Clear        Daylight        Unknown      Straight, l…       1 None           
## # ℹ 13 more variables: Harm_Evnt_ID <chr>, Intrsct_Relat_ID <chr>,
## #   FHE_Collsn_ID <chr>, Road_Part_Adj_ID <chr>, Road_Cls_ID <chr>,
## #   Pop_Group_ID <chr>, Crash_Speed_LimitCat <chr>, Veh_Body_Styl_ID <chr>,
## #   Prsn_Ethnicity_ID <chr>, GenMale <dbl>, TrafVol <dbl>, Prsn_Age <chr>,
## #   Prsn_Injry_Sev_ID <chr>
summary(data)
##  Wthr_Cond_ID       Light_Cond_ID      Road_Type_ID       Road_Algn_ID      
##  Length:1295        Length:1295        Length:1295        Length:1295       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     SurfDry       Traffic_Cntl_ID    Harm_Evnt_ID       Intrsct_Relat_ID  
##  Min.   :0.0000   Length:1295        Length:1295        Length:1295       
##  1st Qu.:1.0000   Class :character   Class :character   Class :character  
##  Median :1.0000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :0.9143                                                           
##  3rd Qu.:1.0000                                                           
##  Max.   :1.0000                                                           
##  FHE_Collsn_ID      Road_Part_Adj_ID   Road_Cls_ID        Pop_Group_ID      
##  Length:1295        Length:1295        Length:1295        Length:1295       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  Crash_Speed_LimitCat Veh_Body_Styl_ID   Prsn_Ethnicity_ID     GenMale      
##  Length:1295          Length:1295        Length:1295        Min.   :0.0000  
##  Class :character     Class :character   Class :character   1st Qu.:1.0000  
##  Mode  :character     Mode  :character   Mode  :character   Median :1.0000  
##                                                             Mean   :0.8842  
##                                                             3rd Qu.:1.0000  
##                                                             Max.   :1.0000  
##     TrafVol        Prsn_Age         Prsn_Injry_Sev_ID 
##  Min.   :  204   Length:1295        Length:1295       
##  1st Qu.: 7694   Class :character   Class :character  
##  Median :14950   Mode  :character   Mode  :character  
##  Mean   :14685                                        
##  3rd Qu.:21623                                        
##  Max.   :28978
glimpse(data)
## Rows: 1,295
## Columns: 19
## $ Wthr_Cond_ID         <chr> "Clear", "Clear", "Clear", "Clear", "Clear", "Cle…
## $ Light_Cond_ID        <chr> "Dark, not lighted", "Dark, not lighted", "Daylig…
## $ Road_Type_ID         <chr> "2 lane, 2 way", "2 lane, 2 way", "2 lane, 2 way"…
## $ Road_Algn_ID         <chr> "Straight, level", "Straight, level", "Straight, …
## $ SurfDry              <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Traffic_Cntl_ID      <chr> "Marked lanes", "Center stripe/divider", "Marked …
## $ Harm_Evnt_ID         <chr> "Motor vehicle in transport", "Motor vehicle in t…
## $ Intrsct_Relat_ID     <chr> "Non intersection", "Non intersection", "Intersec…
## $ FHE_Collsn_ID        <chr> "Sd both going straight-rear end", "Sd both going…
## $ Road_Part_Adj_ID     <chr> "Main/proper lane", "Main/proper lane", "Main/pro…
## $ Road_Cls_ID          <chr> "Farm to market", "Us & state highways", "Farm to…
## $ Pop_Group_ID         <chr> "10,000 - 24,999 pop", "Rural", "Other", "Rural",…
## $ Crash_Speed_LimitCat <chr> "30-40 mph", "65-70 mph", "45-60 mph", "65-70 mph…
## $ Veh_Body_Styl_ID     <chr> "Farm equipment", "Farm equipment", "Farm equipme…
## $ Prsn_Ethnicity_ID    <chr> "White", "White", "White", "White", "Other", "Whi…
## $ GenMale              <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1…
## $ TrafVol              <dbl> 13469, 4050, 8280, 19658, 25249, 7490, 18919, 844…
## $ Prsn_Age             <chr> "25-54 years", "25-54 years", "Other", "25-54 yea…
## $ Prsn_Injry_Sev_ID    <chr> "O", "O", "O", "O", "O", "O", "O", "O", "O", "O",…
# Check for missing data
missing_data <- sapply(data, function(x) sum(is.na(x)))
missing_data
##         Wthr_Cond_ID        Light_Cond_ID         Road_Type_ID 
##                    0                    0                    0 
##         Road_Algn_ID              SurfDry      Traffic_Cntl_ID 
##                    0                    0                    0 
##         Harm_Evnt_ID     Intrsct_Relat_ID        FHE_Collsn_ID 
##                    0                    0                    0 
##     Road_Part_Adj_ID          Road_Cls_ID         Pop_Group_ID 
##                    0                    0                    0 
## Crash_Speed_LimitCat     Veh_Body_Styl_ID    Prsn_Ethnicity_ID 
##                    0                    0                    0 
##              GenMale              TrafVol             Prsn_Age 
##                    0                    0                    0 
##    Prsn_Injry_Sev_ID 
##                    0

Descriptive Statistics

# Load necessary packages
library(compareGroups)

# Generate descriptive statistics for all variables across Prsn_Injry_Sev_ID (or another grouping variable)
res <- compareGroups(Prsn_Injry_Sev_ID ~ ., data = data)

# Display the descriptive statistics summary
summary_res <- createTable(res)

# Print the table to view the descriptive statistics
print(summary_res)
## 
## --------Summary descriptives table by 'Prsn_Injry_Sev_ID'---------
## 
## _____________________________________________________________________________________ 
##                                           BC           KA           O       p.overall 
##                                         N=120         N=50        N=1125              
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ 
## Wthr_Cond_ID:                                                                   .     
##     Clear                             98 (81.7%)   42 (84.0%)  944 (83.9%)            
##     Cloudy                            19 (15.8%)   6 (12.0%)   121 (10.8%)            
##     Fog                               1 (0.83%)    1 (2.00%)    8 (0.71%)             
##     Other                             0 (0.00%)    0 (0.00%)    10 (0.89%)            
##     Rain                              2 (1.67%)    1 (2.00%)    42 (3.73%)            
## Light_Cond_ID:                                                                  .     
##     Dark, lighted                     5 (4.17%)    0 (0.00%)    37 (3.29%)            
##     Dark, not lighted                 16 (13.3%)   12 (24.0%)  126 (11.2%)            
##     Daylight                          90 (75.0%)   36 (72.0%)  914 (81.2%)            
##     Dusk                              6 (5.00%)    2 (4.00%)    31 (2.76%)            
##     Other                             3 (2.50%)    0 (0.00%)    17 (1.51%)            
## Road_Type_ID:                                                                   .     
##     2 lane, 2 way                     50 (41.7%)   26 (52.0%)  475 (42.2%)            
##     4 or more lanes, divided          27 (22.5%)   9 (18.0%)   149 (13.2%)            
##     4 or more lanes, undivided        11 (9.17%)   6 (12.0%)    79 (7.02%)            
##     Other                             0 (0.00%)    0 (0.00%)    2 (0.18%)             
##     Unknown                           32 (26.7%)   9 (18.0%)   420 (37.3%)            
## Road_Algn_ID:                                                                   .     
##     Curve, level                      6 (5.00%)    2 (4.00%)    54 (4.80%)            
##     Other                             2 (1.67%)    2 (4.00%)    40 (3.56%)            
##     Straight, grade                   24 (20.0%)   5 (10.0%)   106 (9.42%)            
##     Straight, hillcrest               4 (3.33%)    4 (8.00%)    39 (3.47%)            
##     Straight, level                   84 (70.0%)   37 (74.0%)  886 (78.8%)            
## SurfDry                              0.93 (0.25)  0.94 (0.24)  0.91 (0.28)    0.571   
## Traffic_Cntl_ID:                                                                .     
##     Center stripe/divider             32 (26.7%)   15 (30.0%)  259 (23.0%)            
##     Marked lanes                      39 (32.5%)   19 (38.0%)  331 (29.4%)            
##     No passing zone                   12 (10.0%)   6 (12.0%)    76 (6.76%)            
##     None                              21 (17.5%)   5 (10.0%)   270 (24.0%)            
##     Other                             16 (13.3%)   5 (10.0%)   189 (16.8%)            
## Harm_Evnt_ID:                                                                   .     
##     Fixed object                      10 (8.33%)   4 (8.00%)   130 (11.6%)            
##     Motor vehicle in transport        99 (82.5%)   40 (80.0%)  867 (77.1%)            
##     Other                             4 (3.33%)    4 (8.00%)    24 (2.13%)            
##     Overturned                        6 (5.00%)    1 (2.00%)    19 (1.69%)            
##     Parked car                        1 (0.83%)    1 (2.00%)    85 (7.56%)            
## Intrsct_Relat_ID:                                                               .     
##     Driveway access                   8 (6.67%)    3 (6.00%)   134 (11.9%)            
##     Intersection                      15 (12.5%)   5 (10.0%)   150 (13.3%)            
##     Intersection related              7 (5.83%)    1 (2.00%)   104 (9.24%)            
##     Non intersection                  90 (75.0%)   41 (82.0%)  737 (65.5%)            
## FHE_Collsn_ID:                                                               <0.001   
##     Omv vehicle going straight        18 (15.0%)   9 (18.0%)   232 (20.6%)            
##     Other                             24 (20.0%)   6 (12.0%)   367 (32.6%)            
##     Sd both going straight-rear end   60 (50.0%)   27 (54.0%)  225 (20.0%)            
##     Sd both going straight-sideswipe  9 (7.50%)    3 (6.00%)   147 (13.1%)            
##     Sd one straight-one left turn     9 (7.50%)    5 (10.0%)   154 (13.7%)            
## Road_Part_Adj_ID:                                                             0.151   
##     Exit/off ramp                     1 (0.83%)    1 (2.00%)    2 (0.18%)             
##     Main/proper lane                 110 (91.7%)   49 (98.0%)  1066 (94.8%)           
##     Other                             1 (0.83%)    0 (0.00%)    3 (0.27%)             
##     Other (explain in narrative)      4 (3.33%)    0 (0.00%)    26 (2.31%)            
##     Unknown                           4 (3.33%)    0 (0.00%)    28 (2.49%)            
## Road_Cls_ID:                                                                    .     
##     City street                       10 (8.33%)   4 (8.00%)   207 (18.4%)            
##     County road                       15 (12.5%)   4 (8.00%)   172 (15.3%)            
##     Farm to market                    43 (35.8%)   14 (28.0%)  317 (28.2%)            
##     Other                             13 (10.8%)   1 (2.00%)    75 (6.67%)            
##     Us & state highways               39 (32.5%)   27 (54.0%)  354 (31.5%)            
## Pop_Group_ID:                                                                   .     
##     10,000 - 24,999 pop               4 (3.33%)    3 (6.00%)    56 (4.98%)            
##     250,000 pop. And over             9 (7.50%)    2 (4.00%)   106 (9.42%)            
##     Other                             16 (13.3%)   4 (8.00%)   173 (15.4%)            
##     Rural                             87 (72.5%)   40 (80.0%)  739 (65.7%)            
##     Town under 2,499 pop.             4 (3.33%)    1 (2.00%)    51 (4.53%)            
## Crash_Speed_LimitCat:                                                           .     
##     > 70 mph                          19 (15.8%)   10 (20.0%)  174 (15.5%)            
##     30-40 mph                         20 (16.7%)   6 (12.0%)   310 (27.6%)            
##     45-60 mph                         45 (37.5%)   14 (28.0%)  412 (36.6%)            
##     65-70 mph                         28 (23.3%)   20 (40.0%)  175 (15.6%)            
##     Other                             8 (6.67%)    0 (0.00%)    54 (4.80%)            
## Veh_Body_Styl_ID: Farm equipment      120 (100%)   50 (100%)   1125 (100%)      .     
## Prsn_Ethnicity_ID:                                                              .     
##     Black                             9 (7.50%)    2 (4.00%)    52 (4.62%)            
##     Hispanic                          36 (30.0%)   12 (24.0%)  374 (33.2%)            
##     Other                             1 (0.83%)    0 (0.00%)    38 (3.38%)            
##     Unknown                           0 (0.00%)    0 (0.00%)    83 (7.38%)            
##     White                             74 (61.7%)   36 (72.0%)  578 (51.4%)            
## GenMale                              0.94 (0.24)  0.98 (0.14)  0.87 (0.33)    0.008   
## TrafVol                              15722 (8359) 13821 (8575) 14613 (8190)   0.280   
## Prsn_Age:                                                                       .     
##     15-24 years                       18 (15.0%)   5 (10.0%)   165 (14.7%)            
##     25-54 years                       56 (46.7%)   19 (38.0%)  497 (44.2%)            
##     55-64 years                       23 (19.2%)   8 (16.0%)   201 (17.9%)            
##     65-74 years                       14 (11.7%)   10 (20.0%)   94 (8.36%)            
##     Other                             9 (7.50%)    8 (16.0%)   168 (14.9%)            
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯

Visualization: Plots

Barplots

The data shows that most traffic incidents occurred in clear weather and daylight conditions. Two lane, two way roadways had the highest incident count, with marked lanes and center stripes being common traffic controls. Most harmful events involved motor vehicles in transport, mainly at non-intersections. Rear-end collisions and incidents on “US & State highways” and “Farm to market” roads were frequent. Rural areas accounted for the majority of incidents, with speed limits often in the 45-60 mph range.

# List of categorical variables and corresponding colors
plots <- list(
  list(var = "Wthr_Cond_ID", color = "#707D97", title = "Weather Conditions Distribution", xlab = "Weather Condition"),
  list(var = "Light_Cond_ID", color = "#D6DEE9", title = "Light Conditions Distribution", xlab = "Light Condition"),
  list(var = "Road_Type_ID", color = "#F7D3B9", title = "Road Type Distribution", xlab = "Road Type"),
  list(var = "Traffic_Cntl_ID", color = "#D8E3D9", title = "Traffic Control Distribution", xlab = "Traffic Control"),
  list(var = "Harm_Evnt_ID", color = "#FAD4D9", title = "Harm Event Distribution", xlab = "Harm Event"),
  list(var = "Intrsct_Relat_ID", color = "#E7D7E9", title = "Intersection Relation Distribution", xlab = "Intersection Relation"),
  list(var = "FHE_Collsn_ID", color = "#E6D5C9", title = "FHE Collision Distribution", xlab = "FHE Collision"),
  list(var = "Road_Cls_ID", color = "#E4E4E9", title = "Road Class Distribution", xlab = "Road Class"),
  list(var = "Pop_Group_ID", color = "#FAD4D9", title = "Population Group Distribution", xlab = "Population Group"),
  list(var = "Crash_Speed_LimitCat", color = "#D6DEE9", title = "Speed Limit Category Distribution", xlab = "Speed Limit Category"),
  list(var = "Prsn_Ethnicity_ID", color = "#F7E7C9", title = "Person Ethnicity Distribution", xlab = "Ethnicity"),
  list(var = "Prsn_Injry_Sev_ID", color = "#D8E3D9", title = "Person Injury Severity Distribution", xlab = "Injury Severity")
)

# Loop through each plot and generate it
for (plot_info in plots) {
  ggplot(data, aes_string(x = plot_info$var)) +
    geom_bar(fill = plot_info$color) +
    labs(title = plot_info$title, x = plot_info$xlab, y = "Count") +
    theme_minimal() -> p
  
  print(p)  # Print the plot
}

numeric_vars <- data %>% select_if(is.numeric)
numeric_vars %>% 
  gather(key = "variable", value = "value") %>% 
  ggplot(aes(x = value)) + 
  facet_wrap(~ variable, scales = "free") + 
  geom_histogram(bins = 30, fill = "#A9C0C2", color = "black") + 
  theme_minimal() + 
  labs(title = "Histograms of Numeric Variables")

Heatmaps and Boxplot

The heatmaps and boxplot highlight key relationships between crash severity and various factors. From the heatmaps comparing person injury severity with weather condition, clear weather shows the highest crash frequency, especially for non-injury (“O”) crashes, while fog, rain, and other conditions are less common. Lighting conditions reveal that most crashes occur during daylight, with fewer in dark, lighted or unlighted conditions. Road type and road alignment indicate that two-lane, two-way roads and straight-level alignments are associated with higher crash counts. Traffic control devices like marked lanes and center stripes have higher crash frequencies. In terms of the manner of crash, motor vehicle-in-transport leads to the most crashes. Additionally, the population group, speed limits, and person age factors show distinct trends: rural areas and higher speed limits correlate with more crashes, and individuals aged 25-54 experience the highest crash counts. The boxplot shows that traffic volume is relatively similar across injury severities, with only moderate variability.

# Load necessary libraries
library(ggplot2)

# List of variables to compare with Prsn_Injry_Sev_ID
# Assuming the data has both categorical and numerical variables
variables_to_compare <- c("Wthr_Cond_ID", "Light_Cond_ID", "Road_Type_ID", "Road_Algn_ID",
                          "Traffic_Cntl_ID", "Harm_Evnt_ID", "FHE_Collsn_ID", 
                           "Pop_Group_ID", "Crash_Speed_LimitCat", 
                           "TrafVol", "Prsn_Age")

# Loop through each variable
for (var in variables_to_compare) {
  
  # Check if the variable is categorical or numeric
  if (is.factor(data[[var]]) || is.character(data[[var]])) {
    
    # For categorical variables: Heatmap with cross-tabulation
    cross_tab <- table(data$Prsn_Injry_Sev_ID, data[[var]])
    cross_tab_df <- as.data.frame(cross_tab)
    
    # Plot the heatmap
    p <- ggplot(cross_tab_df, aes(Var2, Var1)) +
      geom_tile(aes(fill = Freq), color = "white") +
      scale_fill_gradient(low = "#D8E3D9", high = "#707D97") +
      labs(title = paste("Heatmap of Person Injury Severity vs", var), 
           x = var, y = "Person Injury Severity", fill = "Count") +
      theme_minimal()
    
    # Print the plot
    print(p)
    
  } else if (is.numeric(data[[var]])) {
    
    # For numeric variables: Boxplot comparing distribution across injury severity levels
    p <- ggplot(data, aes(x = Prsn_Injry_Sev_ID, y = data[[var]])) +
      geom_boxplot(fill = "#D8E3D9", color = "#707D97") +
      labs(title = paste("Boxplot of", var, "by Person Injury Severity"), 
           x = "Person Injury Severity", y = var) +
      theme_minimal()
    
    # Print the plot
    print(p)
  }
}

Density plots

The density plots show that crashes with no or minor injuries are more frequent on dry surfaces, with a higher density near dry conditions. Crashes with minor or no injuries tend to occur at lower traffic volumes, while severe crashes are concentrated at moderate traffic levels. Male involvement is consistently high across all injury severities, particularly in cases with no injury.

# Load necessary libraries
library(ggplot2)

# List of numeric variables to create density plots
numeric_variables <- c("SurfDry", "TrafVol", "GenMale")  # Assuming these are numeric

# Custom palette with lighter shades
lighter_palette <- c("#FFDAB9", "#E6E6FA", "#B0E0E6", "#FFB6C1", "#FFFACD", "#E0FFFF")

# Loop through each numeric variable and generate a density plot
for (var in numeric_variables) {
  
  # Create a density plot using ggplot2 with lighter shades
  p <- ggplot(data, aes(x = data[[var]], fill = Prsn_Injry_Sev_ID)) +
    geom_density(alpha = 0.6) +  # Use alpha to control transparency
    scale_fill_manual(values = lighter_palette) +  # Apply lighter color palette
    labs(title = paste("Density Plot of", var, "by Person Injury Severity"), 
         x = var, 
         fill = "Person Injury Severity") +
    theme_minimal()
  
  # Print the plot
  print(p)
}

## Stacked Bar Plots

The stacked bar charts provide insights into relationships between different traffic variables. The first chart indicates that most crashes occurred in clear weather and daylight conditions, while the second chart shows that two-lane, two-way roads had the highest incident count, especially on farm-to-market roads. The third chart highlights a predominant occurrence of crashes in rural areas, particularly among White individuals. Overall, non-intersection crashes involving motor vehicles in transport were the most frequent harmful events across different road classes and traffic control types.

# Load necessary libraries
library(ggplot2)

# List of pairs of character (categorical) variables for stacked bar charts
categorical_pairs <- list(
  c("Wthr_Cond_ID", "Light_Cond_ID"),        # Weather Condition, Light Condition
  c("Road_Type_ID", "Road_Cls_ID"),          # Road Type, Road Class
  c("Pop_Group_ID", "Prsn_Ethnicity_ID"),    # Population Group, Person Ethnicity
  c("Harm_Evnt_ID", "Intrsct_Relat_ID"),     # Harmful Event, Intersection Relation
  c("Traffic_Cntl_ID", "FHE_Collsn_ID")      # Traffic Control, First Harmful Event Collision
)

# Define a custom sober color palette (muted and neutral tones)
sober_palette <- c("#696969", "#EDE4C4", "#9EB0A6", "#E3C9A8", "#A9C0C2", "#BC8F8F")

# Loop through each pair of character variables and generate a stacked bar chart
for (pair in categorical_pairs) {
  var1 <- pair[1]
  var2 <- pair[2]
  
  # Create the stacked bar chart with sober colors
  p <- ggplot(data, aes_string(x = var1, fill = var2)) +
    geom_bar(position = "stack") +  # Stacked bars
    scale_fill_manual(values = sober_palette) +  # Apply sober color palette
    labs(title = paste("Stacked Bar Chart of", var1, "by", var2), 
         x = var1, 
         y = "Count", 
         fill = var2) +
    theme_minimal()
  
  # Print the plot
  print(p)
}

Contingency Matrix

The contingency coefficient matrix reveals several notable associations between categorical variables in the dataset. Vehicle body style shows a strong association with the road part adjacent (0.88), suggesting that specific vehicle types are more likely to be involved in crashes in certain areas of the road, such as the main lane or shoulder. Additionally, road class and road type (0.72) are closely linked, indicating that certain road types are common to specific road classes, such as highways or rural roads. Traffic control mechanisms have a moderate association with the type of harmful event (0.6), implying that the presence or absence of controls like signals influences the nature of crashes, such as rear-end collisions. Finally, the crash speed limit is moderately associated with the harmful event (0.51)

# Load necessary packages
library(vcd)         # For computing chi-square statistic
library(reshape2)    # For reshaping data for visualization
library(ggplot2)     # For plotting

# Sample data: Replace this with your actual dataset
# The character variables from your dataset
categorical_vars <- c("Wthr_Cond_ID", "Light_Cond_ID", "Road_Type_ID", "Road_Algn_ID", 
                      "Traffic_Cntl_ID", "Harm_Evnt_ID", "Intrsct_Relat_ID", "FHE_Collsn_ID", 
                      "Road_Part_Adj_ID", "Road_Cls_ID", "Pop_Group_ID", "Crash_Speed_LimitCat", 
                      "Veh_Body_Styl_ID", "Prsn_Ethnicity_ID", "Prsn_Age")

# Filter only the character columns from your actual dataset
data_categorical <- data[categorical_vars]

# Function to calculate Contingency Coefficient for a pair of categorical variables
contingency_coefficient <- function(x, y) {
  # Create a contingency table
  tbl <- table(x, y)
  
  # Perform chi-square test
  chi2_test <- chisq.test(tbl)
  
  # Calculate Contingency Coefficient
  N <- sum(tbl)  # Total number of observations
  chi2 <- chi2_test$statistic  # Chi-square statistic
  C <- sqrt(chi2 / (chi2 + N))  # Contingency Coefficient formula
  
  return(C)
}

# Create an empty matrix to store Contingency Coefficients
contingency_matrix <- matrix(NA, ncol = length(categorical_vars), nrow = length(categorical_vars))
colnames(contingency_matrix) <- categorical_vars
rownames(contingency_matrix) <- categorical_vars

# Loop through each pair of categorical variables and calculate Contingency Coefficients
for (i in 1:length(categorical_vars)) {
  for (j in 1:length(categorical_vars)) {
    if (i != j) {
      contingency_matrix[i, j] <- contingency_coefficient(data_categorical[[categorical_vars[i]]], 
                                                          data_categorical[[categorical_vars[j]]])
    }
  }
}

# Convert the matrix to a data frame for easier viewing and plotting
contingency_matrix_df <- as.data.frame(contingency_matrix)

# Melt the matrix for use in ggplot2, ensure row and column names are set correctly
contingency_matrix_melt <- melt(as.matrix(contingency_matrix_df), varnames = c("Var1", "Var2"))

# Plot the Contingency Coefficient matrix as a heatmap with rotated text
ggplot(data = contingency_matrix_melt, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile(color = "white") +
  scale_fill_gradient(low = "#EDE4C4", high = "#696969", na.value = "white") +
  geom_text(aes(label = round(value, 2)), color = "Navy blue", size = 3) +
  labs(title = "Contingency Coefficient Matrix for Categorical Variables",
       x = "Categorical Variable 1",
       y = "Categorical Variable 2") +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5, size = 10),  # Rotate x-axis text to 90 degrees
    axis.text.y = element_text(size = 10),  # Set size for y-axis text
    plot.title = element_text(hjust = 0.5, size = 14)  # Center the title and increase its size
  )

Violin Plot

The violin plot illustrates traffic volume across different injury severities: BC , KA , and O. The mean traffic volume for BC is slightly higher (15,586) compared to KA (14,933) and O (14,451), but the differences are not statistically significant as indicated by the p-value of 0.38.

# Continuous variables you want to plot
continuous_vars <- c("TrafVol")

# Loop through each continuous variable and create violin plots
for (var in continuous_vars) {
  print(
    ggbetweenstats(
      data = data,
      x = Prsn_Injry_Sev_ID,
      y = !!sym(var),  # Use !!sym() to handle variable names dynamically
      type = "parametric",
      plot.type = "violin",
      title = paste("Violin plot of", var, "by Injury Severity")
    )
  )
}

## Point Plot

The visualizations illustrate the influence of various factors on person injury severity in traffic incidents. Overturned vehicles in dark, unlighted conditions are linked to more severe injuries (KA) as seen in the first plot. Certain FHE collisions, particularly during dark, not lighted conditions, also show higher severity (KA) in the second plot. Highways and straight road alignments are associated with greater injury severity, as indicated in the third plot. Finally, intersections, especially in divided lanes, tend to have more severe injuries, as shown in the fourth plot.

# Plot 1: Person Injury Severity vs. Light Conditions and Harmful Event
p1 <- ggplot(data, aes(x = Light_Cond_ID, y = Harm_Evnt_ID)) +
  geom_point(aes(size = Prsn_Injry_Sev_ID, color = Prsn_Injry_Sev_ID), alpha = 0.6) +
  scale_size_discrete(range = c(2, 10)) +  # Adjust point size range for categorical variable
  scale_color_manual(values = c("#000080", "#1E90FF", "#ADD8E6")) +  # Set a manual gradient for categorical values
  labs(
    title = "Point Plot of Person Injury Severity by Light Conditions and Harmful Event",
    x = "Light Condition",
    y = "Harmful Event",
    size = "Person Injury Severity",
    color = "Person Injury Severity"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

# Plot 2: Person Injury Severity vs. FHE Collision and Light Conditions
p2 <- ggplot(data, aes(x = FHE_Collsn_ID, y = Light_Cond_ID)) +
  geom_point(aes(size = Prsn_Injry_Sev_ID, color = Prsn_Injry_Sev_ID), alpha = 0.6) +
  scale_size_discrete(range = c(2, 10)) +  # Adjust point size range for categorical variable
  scale_color_manual(values = c("#000080", "#1E90FF", "#ADD8E6")) +  # Set a manual gradient for categorical values
  labs(
    title = "Point Plot of Person Injury Severity by FHE Collision and Light Conditions",
    x = "FHE Collision",
    y = "Light Condition",
    size = "Person Injury Severity",
    color = "Person Injury Severity"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

# Plot 3: Person Injury Severity vs. Road Class and Road Alignment
p3 <- ggplot(data, aes(x = Road_Cls_ID, y = Road_Algn_ID)) +
  geom_point(aes(size = Prsn_Injry_Sev_ID, color = Prsn_Injry_Sev_ID), alpha = 0.6) +
  scale_size_discrete(range = c(2, 10)) +  # Adjust point size range for categorical variable
  scale_color_manual(values = c("#000080", "#1E90FF", "#ADD8E6")) +  # Set a manual gradient for categorical values
  labs(
    title = "Point Plot of Person Injury Severity by Road Class and Road Alignment",
    x = "Road Class",
    y = "Road Alignment",
    size = "Person Injury Severity",
    color = "Person Injury Severity"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

# Plot 4: Person Injury Severity vs. Intersection Relation and Road Type
p4 <- ggplot(data, aes(x = Intrsct_Relat_ID, y = Road_Type_ID)) +
  geom_point(aes(size = Prsn_Injry_Sev_ID, color = Prsn_Injry_Sev_ID), alpha = 0.6) +
  scale_size_discrete(range = c(2, 10)) +  # Adjust point size range for categorical variable
  scale_color_manual(values = c("#000080", "#1E90FF", "#ADD8E6")) +  # Set a manual gradient for categorical values
  labs(
    title = "Point Plot of Person Injury Severity by Intersection Relation and Road Type",
    x = "Intersection Relation",
    y = "Road Type",
    size = "Person Injury Severity",
    color = "Person Injury Severity"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

# Display all four plots
print(p1)

print(p2)

print(p3)

print(p4)

Conclusion

The analysis of this traffic incident dataset reveals a significant influence of environmental, road, and individual factors on crash severity. It was observed that crashes occurred most frequently in clear weather and rural areas, with motor vehicles in transport identified as the most common harmful event. Traffic volume was found to vary across injury severities, and associations between categorical variables, such as vehicle type and road characteristics, were highlighted.