Exploratory Data Analysis (EDA) by Swastika Barua
by Swastika Barua (qwx11)
Introduction
The dataset represents a detailed record of traffic incidents, capturing various environmental, road, vehicle, and individual-related factors. Each row corresponds to a unique traffic event, providing information about the conditions under which the incident occurred. It includes weather and lighting conditions, such as weather (Wthr_Cond_ID) and light conditions (Light_Cond_ID), road characteristics like road type (Road_Type_ID), road alignment (Road_Algn_ID), and surface conditions (SurfDry). Traffic control and event-related data include traffic control type (Traffic_Cntl_ID) and the harmful event (Harm_Evnt_ID), while intersection and collision information identifies whether the event occurred at an intersection (Intrsct_Relat_ID) and the type of first harmful event collision (FHE_Collsn_ID). Demographic and vehicle details encompass vehicle body style (Veh_Body_Styl_ID), ethnicity (Prsn_Ethnicity_ID), age group (Prsn_Age), and injury severity (Prsn_Injry_Sev_ID). Additionally, traffic volume (TrafVol) and speed limit categories (Crash_Speed_LimitCat) provide further context to the incidents.
Loading libraries
Data Information
The dataset, containing 1,295 rows and 19 variables, was examined for structure and content. The first few entries were viewed using head(), while summary() provided basic statistical details. glimpse() revealed that the dataset consists of 16 categorical variables (e.g., Wthr_Cond_ID, Road_Type_ID), 1 numeric variables (TrafVol), and 2 binary variables (SurfDry, GenMale). Missing data checks showed no missing values across any columns. The dataset includes variables relevant to traffic incidents, such as environmental conditions, road characteristics, vehicle types, and person-related factors.
## # A tibble: 6 × 19
## Wthr_Cond_ID Light_Cond_ID Road_Type_ID Road_Algn_ID SurfDry Traffic_Cntl_ID
## <chr> <chr> <chr> <chr> <dbl> <chr>
## 1 Clear Dark, not ligh… 2 lane, 2 w… Straight, l… 1 Marked lanes
## 2 Clear Dark, not ligh… 2 lane, 2 w… Straight, l… 1 Center stripe/…
## 3 Clear Daylight 2 lane, 2 w… Straight, l… 1 Marked lanes
## 4 Clear Daylight 2 lane, 2 w… Straight, l… 1 Center stripe/…
## 5 Clear Dark, not ligh… 2 lane, 2 w… Straight, g… 1 None
## 6 Clear Daylight Unknown Straight, l… 1 None
## # ℹ 13 more variables: Harm_Evnt_ID <chr>, Intrsct_Relat_ID <chr>,
## # FHE_Collsn_ID <chr>, Road_Part_Adj_ID <chr>, Road_Cls_ID <chr>,
## # Pop_Group_ID <chr>, Crash_Speed_LimitCat <chr>, Veh_Body_Styl_ID <chr>,
## # Prsn_Ethnicity_ID <chr>, GenMale <dbl>, TrafVol <dbl>, Prsn_Age <chr>,
## # Prsn_Injry_Sev_ID <chr>
## Wthr_Cond_ID Light_Cond_ID Road_Type_ID Road_Algn_ID
## Length:1295 Length:1295 Length:1295 Length:1295
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## SurfDry Traffic_Cntl_ID Harm_Evnt_ID Intrsct_Relat_ID
## Min. :0.0000 Length:1295 Length:1295 Length:1295
## 1st Qu.:1.0000 Class :character Class :character Class :character
## Median :1.0000 Mode :character Mode :character Mode :character
## Mean :0.9143
## 3rd Qu.:1.0000
## Max. :1.0000
## FHE_Collsn_ID Road_Part_Adj_ID Road_Cls_ID Pop_Group_ID
## Length:1295 Length:1295 Length:1295 Length:1295
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Crash_Speed_LimitCat Veh_Body_Styl_ID Prsn_Ethnicity_ID GenMale
## Length:1295 Length:1295 Length:1295 Min. :0.0000
## Class :character Class :character Class :character 1st Qu.:1.0000
## Mode :character Mode :character Mode :character Median :1.0000
## Mean :0.8842
## 3rd Qu.:1.0000
## Max. :1.0000
## TrafVol Prsn_Age Prsn_Injry_Sev_ID
## Min. : 204 Length:1295 Length:1295
## 1st Qu.: 7694 Class :character Class :character
## Median :14950 Mode :character Mode :character
## Mean :14685
## 3rd Qu.:21623
## Max. :28978
## Rows: 1,295
## Columns: 19
## $ Wthr_Cond_ID <chr> "Clear", "Clear", "Clear", "Clear", "Clear", "Cle…
## $ Light_Cond_ID <chr> "Dark, not lighted", "Dark, not lighted", "Daylig…
## $ Road_Type_ID <chr> "2 lane, 2 way", "2 lane, 2 way", "2 lane, 2 way"…
## $ Road_Algn_ID <chr> "Straight, level", "Straight, level", "Straight, …
## $ SurfDry <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Traffic_Cntl_ID <chr> "Marked lanes", "Center stripe/divider", "Marked …
## $ Harm_Evnt_ID <chr> "Motor vehicle in transport", "Motor vehicle in t…
## $ Intrsct_Relat_ID <chr> "Non intersection", "Non intersection", "Intersec…
## $ FHE_Collsn_ID <chr> "Sd both going straight-rear end", "Sd both going…
## $ Road_Part_Adj_ID <chr> "Main/proper lane", "Main/proper lane", "Main/pro…
## $ Road_Cls_ID <chr> "Farm to market", "Us & state highways", "Farm to…
## $ Pop_Group_ID <chr> "10,000 - 24,999 pop", "Rural", "Other", "Rural",…
## $ Crash_Speed_LimitCat <chr> "30-40 mph", "65-70 mph", "45-60 mph", "65-70 mph…
## $ Veh_Body_Styl_ID <chr> "Farm equipment", "Farm equipment", "Farm equipme…
## $ Prsn_Ethnicity_ID <chr> "White", "White", "White", "White", "Other", "Whi…
## $ GenMale <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1…
## $ TrafVol <dbl> 13469, 4050, 8280, 19658, 25249, 7490, 18919, 844…
## $ Prsn_Age <chr> "25-54 years", "25-54 years", "Other", "25-54 yea…
## $ Prsn_Injry_Sev_ID <chr> "O", "O", "O", "O", "O", "O", "O", "O", "O", "O",…
## Wthr_Cond_ID Light_Cond_ID Road_Type_ID
## 0 0 0
## Road_Algn_ID SurfDry Traffic_Cntl_ID
## 0 0 0
## Harm_Evnt_ID Intrsct_Relat_ID FHE_Collsn_ID
## 0 0 0
## Road_Part_Adj_ID Road_Cls_ID Pop_Group_ID
## 0 0 0
## Crash_Speed_LimitCat Veh_Body_Styl_ID Prsn_Ethnicity_ID
## 0 0 0
## GenMale TrafVol Prsn_Age
## 0 0 0
## Prsn_Injry_Sev_ID
## 0
Descriptive Statistics
# Load necessary packages
library(compareGroups)
# Generate descriptive statistics for all variables across Prsn_Injry_Sev_ID (or another grouping variable)
res <- compareGroups(Prsn_Injry_Sev_ID ~ ., data = data)
# Display the descriptive statistics summary
summary_res <- createTable(res)
# Print the table to view the descriptive statistics
print(summary_res)##
## --------Summary descriptives table by 'Prsn_Injry_Sev_ID'---------
##
## _____________________________________________________________________________________
## BC KA O p.overall
## N=120 N=50 N=1125
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
## Wthr_Cond_ID: .
## Clear 98 (81.7%) 42 (84.0%) 944 (83.9%)
## Cloudy 19 (15.8%) 6 (12.0%) 121 (10.8%)
## Fog 1 (0.83%) 1 (2.00%) 8 (0.71%)
## Other 0 (0.00%) 0 (0.00%) 10 (0.89%)
## Rain 2 (1.67%) 1 (2.00%) 42 (3.73%)
## Light_Cond_ID: .
## Dark, lighted 5 (4.17%) 0 (0.00%) 37 (3.29%)
## Dark, not lighted 16 (13.3%) 12 (24.0%) 126 (11.2%)
## Daylight 90 (75.0%) 36 (72.0%) 914 (81.2%)
## Dusk 6 (5.00%) 2 (4.00%) 31 (2.76%)
## Other 3 (2.50%) 0 (0.00%) 17 (1.51%)
## Road_Type_ID: .
## 2 lane, 2 way 50 (41.7%) 26 (52.0%) 475 (42.2%)
## 4 or more lanes, divided 27 (22.5%) 9 (18.0%) 149 (13.2%)
## 4 or more lanes, undivided 11 (9.17%) 6 (12.0%) 79 (7.02%)
## Other 0 (0.00%) 0 (0.00%) 2 (0.18%)
## Unknown 32 (26.7%) 9 (18.0%) 420 (37.3%)
## Road_Algn_ID: .
## Curve, level 6 (5.00%) 2 (4.00%) 54 (4.80%)
## Other 2 (1.67%) 2 (4.00%) 40 (3.56%)
## Straight, grade 24 (20.0%) 5 (10.0%) 106 (9.42%)
## Straight, hillcrest 4 (3.33%) 4 (8.00%) 39 (3.47%)
## Straight, level 84 (70.0%) 37 (74.0%) 886 (78.8%)
## SurfDry 0.93 (0.25) 0.94 (0.24) 0.91 (0.28) 0.571
## Traffic_Cntl_ID: .
## Center stripe/divider 32 (26.7%) 15 (30.0%) 259 (23.0%)
## Marked lanes 39 (32.5%) 19 (38.0%) 331 (29.4%)
## No passing zone 12 (10.0%) 6 (12.0%) 76 (6.76%)
## None 21 (17.5%) 5 (10.0%) 270 (24.0%)
## Other 16 (13.3%) 5 (10.0%) 189 (16.8%)
## Harm_Evnt_ID: .
## Fixed object 10 (8.33%) 4 (8.00%) 130 (11.6%)
## Motor vehicle in transport 99 (82.5%) 40 (80.0%) 867 (77.1%)
## Other 4 (3.33%) 4 (8.00%) 24 (2.13%)
## Overturned 6 (5.00%) 1 (2.00%) 19 (1.69%)
## Parked car 1 (0.83%) 1 (2.00%) 85 (7.56%)
## Intrsct_Relat_ID: .
## Driveway access 8 (6.67%) 3 (6.00%) 134 (11.9%)
## Intersection 15 (12.5%) 5 (10.0%) 150 (13.3%)
## Intersection related 7 (5.83%) 1 (2.00%) 104 (9.24%)
## Non intersection 90 (75.0%) 41 (82.0%) 737 (65.5%)
## FHE_Collsn_ID: <0.001
## Omv vehicle going straight 18 (15.0%) 9 (18.0%) 232 (20.6%)
## Other 24 (20.0%) 6 (12.0%) 367 (32.6%)
## Sd both going straight-rear end 60 (50.0%) 27 (54.0%) 225 (20.0%)
## Sd both going straight-sideswipe 9 (7.50%) 3 (6.00%) 147 (13.1%)
## Sd one straight-one left turn 9 (7.50%) 5 (10.0%) 154 (13.7%)
## Road_Part_Adj_ID: 0.151
## Exit/off ramp 1 (0.83%) 1 (2.00%) 2 (0.18%)
## Main/proper lane 110 (91.7%) 49 (98.0%) 1066 (94.8%)
## Other 1 (0.83%) 0 (0.00%) 3 (0.27%)
## Other (explain in narrative) 4 (3.33%) 0 (0.00%) 26 (2.31%)
## Unknown 4 (3.33%) 0 (0.00%) 28 (2.49%)
## Road_Cls_ID: .
## City street 10 (8.33%) 4 (8.00%) 207 (18.4%)
## County road 15 (12.5%) 4 (8.00%) 172 (15.3%)
## Farm to market 43 (35.8%) 14 (28.0%) 317 (28.2%)
## Other 13 (10.8%) 1 (2.00%) 75 (6.67%)
## Us & state highways 39 (32.5%) 27 (54.0%) 354 (31.5%)
## Pop_Group_ID: .
## 10,000 - 24,999 pop 4 (3.33%) 3 (6.00%) 56 (4.98%)
## 250,000 pop. And over 9 (7.50%) 2 (4.00%) 106 (9.42%)
## Other 16 (13.3%) 4 (8.00%) 173 (15.4%)
## Rural 87 (72.5%) 40 (80.0%) 739 (65.7%)
## Town under 2,499 pop. 4 (3.33%) 1 (2.00%) 51 (4.53%)
## Crash_Speed_LimitCat: .
## > 70 mph 19 (15.8%) 10 (20.0%) 174 (15.5%)
## 30-40 mph 20 (16.7%) 6 (12.0%) 310 (27.6%)
## 45-60 mph 45 (37.5%) 14 (28.0%) 412 (36.6%)
## 65-70 mph 28 (23.3%) 20 (40.0%) 175 (15.6%)
## Other 8 (6.67%) 0 (0.00%) 54 (4.80%)
## Veh_Body_Styl_ID: Farm equipment 120 (100%) 50 (100%) 1125 (100%) .
## Prsn_Ethnicity_ID: .
## Black 9 (7.50%) 2 (4.00%) 52 (4.62%)
## Hispanic 36 (30.0%) 12 (24.0%) 374 (33.2%)
## Other 1 (0.83%) 0 (0.00%) 38 (3.38%)
## Unknown 0 (0.00%) 0 (0.00%) 83 (7.38%)
## White 74 (61.7%) 36 (72.0%) 578 (51.4%)
## GenMale 0.94 (0.24) 0.98 (0.14) 0.87 (0.33) 0.008
## TrafVol 15722 (8359) 13821 (8575) 14613 (8190) 0.280
## Prsn_Age: .
## 15-24 years 18 (15.0%) 5 (10.0%) 165 (14.7%)
## 25-54 years 56 (46.7%) 19 (38.0%) 497 (44.2%)
## 55-64 years 23 (19.2%) 8 (16.0%) 201 (17.9%)
## 65-74 years 14 (11.7%) 10 (20.0%) 94 (8.36%)
## Other 9 (7.50%) 8 (16.0%) 168 (14.9%)
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
Visualization: Plots
Barplots
The data shows that most traffic incidents occurred in clear weather and daylight conditions. Two lane, two way roadways had the highest incident count, with marked lanes and center stripes being common traffic controls. Most harmful events involved motor vehicles in transport, mainly at non-intersections. Rear-end collisions and incidents on “US & State highways” and “Farm to market” roads were frequent. Rural areas accounted for the majority of incidents, with speed limits often in the 45-60 mph range.
# List of categorical variables and corresponding colors
plots <- list(
list(var = "Wthr_Cond_ID", color = "#707D97", title = "Weather Conditions Distribution", xlab = "Weather Condition"),
list(var = "Light_Cond_ID", color = "#D6DEE9", title = "Light Conditions Distribution", xlab = "Light Condition"),
list(var = "Road_Type_ID", color = "#F7D3B9", title = "Road Type Distribution", xlab = "Road Type"),
list(var = "Traffic_Cntl_ID", color = "#D8E3D9", title = "Traffic Control Distribution", xlab = "Traffic Control"),
list(var = "Harm_Evnt_ID", color = "#FAD4D9", title = "Harm Event Distribution", xlab = "Harm Event"),
list(var = "Intrsct_Relat_ID", color = "#E7D7E9", title = "Intersection Relation Distribution", xlab = "Intersection Relation"),
list(var = "FHE_Collsn_ID", color = "#E6D5C9", title = "FHE Collision Distribution", xlab = "FHE Collision"),
list(var = "Road_Cls_ID", color = "#E4E4E9", title = "Road Class Distribution", xlab = "Road Class"),
list(var = "Pop_Group_ID", color = "#FAD4D9", title = "Population Group Distribution", xlab = "Population Group"),
list(var = "Crash_Speed_LimitCat", color = "#D6DEE9", title = "Speed Limit Category Distribution", xlab = "Speed Limit Category"),
list(var = "Prsn_Ethnicity_ID", color = "#F7E7C9", title = "Person Ethnicity Distribution", xlab = "Ethnicity"),
list(var = "Prsn_Injry_Sev_ID", color = "#D8E3D9", title = "Person Injury Severity Distribution", xlab = "Injury Severity")
)
# Loop through each plot and generate it
for (plot_info in plots) {
ggplot(data, aes_string(x = plot_info$var)) +
geom_bar(fill = plot_info$color) +
labs(title = plot_info$title, x = plot_info$xlab, y = "Count") +
theme_minimal() -> p
print(p) # Print the plot
}numeric_vars <- data %>% select_if(is.numeric)
numeric_vars %>%
gather(key = "variable", value = "value") %>%
ggplot(aes(x = value)) +
facet_wrap(~ variable, scales = "free") +
geom_histogram(bins = 30, fill = "#A9C0C2", color = "black") +
theme_minimal() +
labs(title = "Histograms of Numeric Variables")Heatmaps and Boxplot
The heatmaps and boxplot highlight key relationships between crash severity and various factors. From the heatmaps comparing person injury severity with weather condition, clear weather shows the highest crash frequency, especially for non-injury (“O”) crashes, while fog, rain, and other conditions are less common. Lighting conditions reveal that most crashes occur during daylight, with fewer in dark, lighted or unlighted conditions. Road type and road alignment indicate that two-lane, two-way roads and straight-level alignments are associated with higher crash counts. Traffic control devices like marked lanes and center stripes have higher crash frequencies. In terms of the manner of crash, motor vehicle-in-transport leads to the most crashes. Additionally, the population group, speed limits, and person age factors show distinct trends: rural areas and higher speed limits correlate with more crashes, and individuals aged 25-54 experience the highest crash counts. The boxplot shows that traffic volume is relatively similar across injury severities, with only moderate variability.
# Load necessary libraries
library(ggplot2)
# List of variables to compare with Prsn_Injry_Sev_ID
# Assuming the data has both categorical and numerical variables
variables_to_compare <- c("Wthr_Cond_ID", "Light_Cond_ID", "Road_Type_ID", "Road_Algn_ID",
"Traffic_Cntl_ID", "Harm_Evnt_ID", "FHE_Collsn_ID",
"Pop_Group_ID", "Crash_Speed_LimitCat",
"TrafVol", "Prsn_Age")
# Loop through each variable
for (var in variables_to_compare) {
# Check if the variable is categorical or numeric
if (is.factor(data[[var]]) || is.character(data[[var]])) {
# For categorical variables: Heatmap with cross-tabulation
cross_tab <- table(data$Prsn_Injry_Sev_ID, data[[var]])
cross_tab_df <- as.data.frame(cross_tab)
# Plot the heatmap
p <- ggplot(cross_tab_df, aes(Var2, Var1)) +
geom_tile(aes(fill = Freq), color = "white") +
scale_fill_gradient(low = "#D8E3D9", high = "#707D97") +
labs(title = paste("Heatmap of Person Injury Severity vs", var),
x = var, y = "Person Injury Severity", fill = "Count") +
theme_minimal()
# Print the plot
print(p)
} else if (is.numeric(data[[var]])) {
# For numeric variables: Boxplot comparing distribution across injury severity levels
p <- ggplot(data, aes(x = Prsn_Injry_Sev_ID, y = data[[var]])) +
geom_boxplot(fill = "#D8E3D9", color = "#707D97") +
labs(title = paste("Boxplot of", var, "by Person Injury Severity"),
x = "Person Injury Severity", y = var) +
theme_minimal()
# Print the plot
print(p)
}
}Density plots
The density plots show that crashes with no or minor injuries are more frequent on dry surfaces, with a higher density near dry conditions. Crashes with minor or no injuries tend to occur at lower traffic volumes, while severe crashes are concentrated at moderate traffic levels. Male involvement is consistently high across all injury severities, particularly in cases with no injury.
# Load necessary libraries
library(ggplot2)
# List of numeric variables to create density plots
numeric_variables <- c("SurfDry", "TrafVol", "GenMale") # Assuming these are numeric
# Custom palette with lighter shades
lighter_palette <- c("#FFDAB9", "#E6E6FA", "#B0E0E6", "#FFB6C1", "#FFFACD", "#E0FFFF")
# Loop through each numeric variable and generate a density plot
for (var in numeric_variables) {
# Create a density plot using ggplot2 with lighter shades
p <- ggplot(data, aes(x = data[[var]], fill = Prsn_Injry_Sev_ID)) +
geom_density(alpha = 0.6) + # Use alpha to control transparency
scale_fill_manual(values = lighter_palette) + # Apply lighter color palette
labs(title = paste("Density Plot of", var, "by Person Injury Severity"),
x = var,
fill = "Person Injury Severity") +
theme_minimal()
# Print the plot
print(p)
}
## Stacked Bar Plots
The stacked bar charts provide insights into relationships between different traffic variables. The first chart indicates that most crashes occurred in clear weather and daylight conditions, while the second chart shows that two-lane, two-way roads had the highest incident count, especially on farm-to-market roads. The third chart highlights a predominant occurrence of crashes in rural areas, particularly among White individuals. Overall, non-intersection crashes involving motor vehicles in transport were the most frequent harmful events across different road classes and traffic control types.
# Load necessary libraries
library(ggplot2)
# List of pairs of character (categorical) variables for stacked bar charts
categorical_pairs <- list(
c("Wthr_Cond_ID", "Light_Cond_ID"), # Weather Condition, Light Condition
c("Road_Type_ID", "Road_Cls_ID"), # Road Type, Road Class
c("Pop_Group_ID", "Prsn_Ethnicity_ID"), # Population Group, Person Ethnicity
c("Harm_Evnt_ID", "Intrsct_Relat_ID"), # Harmful Event, Intersection Relation
c("Traffic_Cntl_ID", "FHE_Collsn_ID") # Traffic Control, First Harmful Event Collision
)
# Define a custom sober color palette (muted and neutral tones)
sober_palette <- c("#696969", "#EDE4C4", "#9EB0A6", "#E3C9A8", "#A9C0C2", "#BC8F8F")
# Loop through each pair of character variables and generate a stacked bar chart
for (pair in categorical_pairs) {
var1 <- pair[1]
var2 <- pair[2]
# Create the stacked bar chart with sober colors
p <- ggplot(data, aes_string(x = var1, fill = var2)) +
geom_bar(position = "stack") + # Stacked bars
scale_fill_manual(values = sober_palette) + # Apply sober color palette
labs(title = paste("Stacked Bar Chart of", var1, "by", var2),
x = var1,
y = "Count",
fill = var2) +
theme_minimal()
# Print the plot
print(p)
}Contingency Matrix
The contingency coefficient matrix reveals several notable associations between categorical variables in the dataset. Vehicle body style shows a strong association with the road part adjacent (0.88), suggesting that specific vehicle types are more likely to be involved in crashes in certain areas of the road, such as the main lane or shoulder. Additionally, road class and road type (0.72) are closely linked, indicating that certain road types are common to specific road classes, such as highways or rural roads. Traffic control mechanisms have a moderate association with the type of harmful event (0.6), implying that the presence or absence of controls like signals influences the nature of crashes, such as rear-end collisions. Finally, the crash speed limit is moderately associated with the harmful event (0.51)
# Load necessary packages
library(vcd) # For computing chi-square statistic
library(reshape2) # For reshaping data for visualization
library(ggplot2) # For plotting
# Sample data: Replace this with your actual dataset
# The character variables from your dataset
categorical_vars <- c("Wthr_Cond_ID", "Light_Cond_ID", "Road_Type_ID", "Road_Algn_ID",
"Traffic_Cntl_ID", "Harm_Evnt_ID", "Intrsct_Relat_ID", "FHE_Collsn_ID",
"Road_Part_Adj_ID", "Road_Cls_ID", "Pop_Group_ID", "Crash_Speed_LimitCat",
"Veh_Body_Styl_ID", "Prsn_Ethnicity_ID", "Prsn_Age")
# Filter only the character columns from your actual dataset
data_categorical <- data[categorical_vars]
# Function to calculate Contingency Coefficient for a pair of categorical variables
contingency_coefficient <- function(x, y) {
# Create a contingency table
tbl <- table(x, y)
# Perform chi-square test
chi2_test <- chisq.test(tbl)
# Calculate Contingency Coefficient
N <- sum(tbl) # Total number of observations
chi2 <- chi2_test$statistic # Chi-square statistic
C <- sqrt(chi2 / (chi2 + N)) # Contingency Coefficient formula
return(C)
}
# Create an empty matrix to store Contingency Coefficients
contingency_matrix <- matrix(NA, ncol = length(categorical_vars), nrow = length(categorical_vars))
colnames(contingency_matrix) <- categorical_vars
rownames(contingency_matrix) <- categorical_vars
# Loop through each pair of categorical variables and calculate Contingency Coefficients
for (i in 1:length(categorical_vars)) {
for (j in 1:length(categorical_vars)) {
if (i != j) {
contingency_matrix[i, j] <- contingency_coefficient(data_categorical[[categorical_vars[i]]],
data_categorical[[categorical_vars[j]]])
}
}
}
# Convert the matrix to a data frame for easier viewing and plotting
contingency_matrix_df <- as.data.frame(contingency_matrix)
# Melt the matrix for use in ggplot2, ensure row and column names are set correctly
contingency_matrix_melt <- melt(as.matrix(contingency_matrix_df), varnames = c("Var1", "Var2"))
# Plot the Contingency Coefficient matrix as a heatmap with rotated text
ggplot(data = contingency_matrix_melt, aes(x = Var1, y = Var2, fill = value)) +
geom_tile(color = "white") +
scale_fill_gradient(low = "#EDE4C4", high = "#696969", na.value = "white") +
geom_text(aes(label = round(value, 2)), color = "Navy blue", size = 3) +
labs(title = "Contingency Coefficient Matrix for Categorical Variables",
x = "Categorical Variable 1",
y = "Categorical Variable 2") +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5, size = 10), # Rotate x-axis text to 90 degrees
axis.text.y = element_text(size = 10), # Set size for y-axis text
plot.title = element_text(hjust = 0.5, size = 14) # Center the title and increase its size
)Violin Plot
The violin plot illustrates traffic volume across different injury severities: BC , KA , and O. The mean traffic volume for BC is slightly higher (15,586) compared to KA (14,933) and O (14,451), but the differences are not statistically significant as indicated by the p-value of 0.38.
# Continuous variables you want to plot
continuous_vars <- c("TrafVol")
# Loop through each continuous variable and create violin plots
for (var in continuous_vars) {
print(
ggbetweenstats(
data = data,
x = Prsn_Injry_Sev_ID,
y = !!sym(var), # Use !!sym() to handle variable names dynamically
type = "parametric",
plot.type = "violin",
title = paste("Violin plot of", var, "by Injury Severity")
)
)
}
## Point Plot
The visualizations illustrate the influence of various factors on person injury severity in traffic incidents. Overturned vehicles in dark, unlighted conditions are linked to more severe injuries (KA) as seen in the first plot. Certain FHE collisions, particularly during dark, not lighted conditions, also show higher severity (KA) in the second plot. Highways and straight road alignments are associated with greater injury severity, as indicated in the third plot. Finally, intersections, especially in divided lanes, tend to have more severe injuries, as shown in the fourth plot.
# Plot 1: Person Injury Severity vs. Light Conditions and Harmful Event
p1 <- ggplot(data, aes(x = Light_Cond_ID, y = Harm_Evnt_ID)) +
geom_point(aes(size = Prsn_Injry_Sev_ID, color = Prsn_Injry_Sev_ID), alpha = 0.6) +
scale_size_discrete(range = c(2, 10)) + # Adjust point size range for categorical variable
scale_color_manual(values = c("#000080", "#1E90FF", "#ADD8E6")) + # Set a manual gradient for categorical values
labs(
title = "Point Plot of Person Injury Severity by Light Conditions and Harmful Event",
x = "Light Condition",
y = "Harmful Event",
size = "Person Injury Severity",
color = "Person Injury Severity"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1)
)
# Plot 2: Person Injury Severity vs. FHE Collision and Light Conditions
p2 <- ggplot(data, aes(x = FHE_Collsn_ID, y = Light_Cond_ID)) +
geom_point(aes(size = Prsn_Injry_Sev_ID, color = Prsn_Injry_Sev_ID), alpha = 0.6) +
scale_size_discrete(range = c(2, 10)) + # Adjust point size range for categorical variable
scale_color_manual(values = c("#000080", "#1E90FF", "#ADD8E6")) + # Set a manual gradient for categorical values
labs(
title = "Point Plot of Person Injury Severity by FHE Collision and Light Conditions",
x = "FHE Collision",
y = "Light Condition",
size = "Person Injury Severity",
color = "Person Injury Severity"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1)
)
# Plot 3: Person Injury Severity vs. Road Class and Road Alignment
p3 <- ggplot(data, aes(x = Road_Cls_ID, y = Road_Algn_ID)) +
geom_point(aes(size = Prsn_Injry_Sev_ID, color = Prsn_Injry_Sev_ID), alpha = 0.6) +
scale_size_discrete(range = c(2, 10)) + # Adjust point size range for categorical variable
scale_color_manual(values = c("#000080", "#1E90FF", "#ADD8E6")) + # Set a manual gradient for categorical values
labs(
title = "Point Plot of Person Injury Severity by Road Class and Road Alignment",
x = "Road Class",
y = "Road Alignment",
size = "Person Injury Severity",
color = "Person Injury Severity"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1)
)
# Plot 4: Person Injury Severity vs. Intersection Relation and Road Type
p4 <- ggplot(data, aes(x = Intrsct_Relat_ID, y = Road_Type_ID)) +
geom_point(aes(size = Prsn_Injry_Sev_ID, color = Prsn_Injry_Sev_ID), alpha = 0.6) +
scale_size_discrete(range = c(2, 10)) + # Adjust point size range for categorical variable
scale_color_manual(values = c("#000080", "#1E90FF", "#ADD8E6")) + # Set a manual gradient for categorical values
labs(
title = "Point Plot of Person Injury Severity by Intersection Relation and Road Type",
x = "Intersection Relation",
y = "Road Type",
size = "Person Injury Severity",
color = "Person Injury Severity"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1)
)
# Display all four plots
print(p1)Conclusion
The analysis of this traffic incident dataset reveals a significant influence of environmental, road, and individual factors on crash severity. It was observed that crashes occurred most frequently in clear weather and rural areas, with motor vehicles in transport identified as the most common harmful event. Traffic volume was found to vary across injury severities, and associations between categorical variables, such as vehicle type and road characteristics, were highlighted.