DSA406 Final Project

DSA406_001_SP25_FP_cisrael

Author

Chloe Israel

Published

April 16, 2025

Dataset Description and Motivation

Motivation

For my final project, I chose to explore a dataset that collected information about traffic accidents. I chose to explore this to gain a better understanding of the causes of traffic accidents. Last year, my sister was in a car accident that left her with a broken leg. She wasn’t able to walk for a while and still has a limp. I have been in a small car accident myself and have friends who have as well. I’ve always been a careful driver, but after my sister’s accident, I’ve been extra vigilant. Sometimes, traffic accidents are truly just accidents, but I think many have causes that could’ve been avoided. I wanted to look at this data to help gain some awareness of theconditions that cause traffic accidents and learn how to avoid them as best as possible.

Description

The dataset has over 200,000 observations of 24 variables. Its variable types include Character, Integer, and Numerical values. The dataset describes information about traffic accidents from 2016 to 2023. The variables describe aspects regarding the accident such as the conditions, type, and outcomes. Based on the initial inspection, “crash_date” stands out as a column of interest. It is stored as a character, but could be more useful as a datetime, so that may be a value to change in the future. The author states that the data for this dataset was “obtained from the internet”. Based on the information recorded and how it was recorded, I believe the data was likely obtained from police traffic incident reports or insurance reports.

Source: Kaggle.com, https://www.kaggle.com/datasets/oktayrdeki/traffic-accidents/data

Author/Owner: Oktay Ördekçi

CSV File: traffic_accidents.csv

Data Dictionary

Variable Name	Class/Data Type	Continuity	Description	Potential R Function(s)
crash_date	Character	Continuous	The date the accident occurred.	separate(), mutate()
traffic_control_device	Character	Continuous	The type of traffic control device involved (e.g., traffic light, sign)	group_by()
weather_condition	Character	Continuous	The weather conditions at the time of the accident	group_by()
lighting_condition	Character	Continuous	The lighting conditions at the time of the accident	group_by()
first_crash_type	Character	Continuous	The initial type of the crash (e.g., head-on, rear-end)	group_by()
trafficway_type	Character	Continuous	The type of roadway involved in the accident (e.g., highway, local road)	group_by()
alignment	Character	Continuous	The alignment of the road where the accident occurred (e.g., straight, curved)	group_by()
roadway_surface_cond	Character	Continuous	The condition of the roadway surface (e.g., dry, wet, icy)	group_by()
road_defect	Character	Continuous	Any defects present on the road surface	group_by()
crash_type	Character	Continuous	The overall type of the crash	group_by()
intersection_related_i	Character	Discrete	Whether the accident was related to an intersection	filter(), group_by()
damage	Character	Continuous	The extent of the damage caused by the accident	group_by()
prim_contributory_cause	Character	Continuous	The primary cause contributing to the crash	group_by()
num_units	Numerical: Integer	Discrete	The number of vehicles involved in the accident	filter()
most_severe_injury	Character	Continuous	The most severe injury sustained in the crash	group_by()
injuries_total	Numerical	Discrete	The total number of injuries reported	filter(), summary()
injuries_fatal	Numerical	Discrete	The number of fatal injuries resulting from the accident	filter(), summary()
injuries_incapacitating	Numerical	Discrete	The number of incapacitating injuries	filter(), summary()
injuries_non_incapacitating	Numerical	Discrete	The number of non-incapacitating injuries	filter(), summary()
injuries_reported_not_evident	Numerical	Discrete	The number of injuries reported but not visibly evident	filter(), summary()
injuries_no_indication	Numerical	Discrete	The number of cases with no indication of injury	filter(), summary()
crash_hour	Numerical: Integer	Discrete	The hour the accident occurred (0-23)	filter()
crash_day_of_week	Numerical: Integer	Discrete	The day of the week the accident occurred (1-7)	filter()
crash_month	Numerical: Integer	Discrete	The month the accident occurred	filter()

Ethical Considerations

Given that the data was collected from the internet, there could be some concerns regarding how the data was collected. Its source could be unreliable or inaccurate. There is also no indication of what areas or regions this data was collected from. Because of this, information that could impact the number of traffic accidents that occur, like population or the number of drivers in the area, cannot be considered in this analysis.

Initial Hypothesis

Traffic accidents resulting in severe injuries (fatal or incapacitating) occur more often during adverse weather conditions and poor lighting, suggesting a significant positive correlation between poor driving conditions and accident-related injuries.

Data Processing

# Load the needed libraries
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(ggplot2)
library(lubridate)

Warning: package 'lubridate' was built under R version 4.4.2


Attaching package: 'lubridate'

The following objects are masked from 'package:base':

    date, intersect, setdiff, union

library(scales)

# Read in the dataset
accidents_data_raw <- read.csv("data/traffic_accidents.csv")

Dataset Inspection

# Inspect the first few rows
head(accidents_data_raw)

              crash_date traffic_control_device weather_condition
1 07/29/2023 01:00:00 PM         TRAFFIC SIGNAL             CLEAR
2 08/13/2023 12:11:00 AM         TRAFFIC SIGNAL             CLEAR
3 12/09/2021 10:30:00 AM         TRAFFIC SIGNAL             CLEAR
4 08/09/2023 07:55:00 PM         TRAFFIC SIGNAL             CLEAR
5 08/19/2023 02:55:00 PM         TRAFFIC SIGNAL             CLEAR
6 09/06/2023 12:59:00 AM            NO CONTROLS              RAIN
      lighting_condition first_crash_type trafficway_type          alignment
1               DAYLIGHT          TURNING     NOT DIVIDED STRAIGHT AND LEVEL
2 DARKNESS, LIGHTED ROAD          TURNING        FOUR WAY STRAIGHT AND LEVEL
3               DAYLIGHT         REAR END  T-INTERSECTION STRAIGHT AND LEVEL
4               DAYLIGHT            ANGLE        FOUR WAY STRAIGHT AND LEVEL
5               DAYLIGHT         REAR END  T-INTERSECTION STRAIGHT AND LEVEL
6 DARKNESS, LIGHTED ROAD     FIXED OBJECT     NOT DIVIDED STRAIGHT AND LEVEL
  roadway_surface_cond road_defect                       crash_type
1              UNKNOWN     UNKNOWN           NO INJURY / DRIVE AWAY
2                  DRY  NO DEFECTS           NO INJURY / DRIVE AWAY
3                  DRY  NO DEFECTS           NO INJURY / DRIVE AWAY
4                  DRY  NO DEFECTS INJURY AND / OR TOW DUE TO CRASH
5              UNKNOWN     UNKNOWN           NO INJURY / DRIVE AWAY
6                  WET     UNKNOWN INJURY AND / OR TOW DUE TO CRASH
  intersection_related_i        damage             prim_contributory_cause
1                      Y $501 - $1,500                 UNABLE TO DETERMINE
2                      Y   OVER $1,500          IMPROPER TURNING/NO SIGNAL
3                      Y $501 - $1,500               FOLLOWING TOO CLOSELY
4                      Y   OVER $1,500                 UNABLE TO DETERMINE
5                      Y $501 - $1,500 DRIVING SKILLS/KNOWLEDGE/EXPERIENCE
6                      N $501 - $1,500                 UNABLE TO DETERMINE
  num_units       most_severe_injury injuries_total injuries_fatal
1         2  NO INDICATION OF INJURY              0              0
2         2  NO INDICATION OF INJURY              0              0
3         3  NO INDICATION OF INJURY              0              0
4         2 NONINCAPACITATING INJURY              5              0
5         2  NO INDICATION OF INJURY              0              0
6         1 NONINCAPACITATING INJURY              2              0
  injuries_incapacitating injuries_non_incapacitating
1                       0                           0
2                       0                           0
3                       0                           0
4                       0                           5
5                       0                           0
6                       0                           2
  injuries_reported_not_evident injuries_no_indication crash_hour
1                             0                      3         13
2                             0                      2          0
3                             0                      3         10
4                             0                      0         19
5                             0                      3         14
6                             0                      0          0
  crash_day_of_week crash_month
1                 7           7
2                 1           8
3                 5          12
4                 4           8
5                 7           8
6                 4           9

# Display the column names
names(accidents_data_raw)

 [1] "crash_date"                    "traffic_control_device"       
 [3] "weather_condition"             "lighting_condition"           
 [5] "first_crash_type"              "trafficway_type"              
 [7] "alignment"                     "roadway_surface_cond"         
 [9] "road_defect"                   "crash_type"                   
[11] "intersection_related_i"        "damage"                       
[13] "prim_contributory_cause"       "num_units"                    
[15] "most_severe_injury"            "injuries_total"               
[17] "injuries_fatal"                "injuries_incapacitating"      
[19] "injuries_non_incapacitating"   "injuries_reported_not_evident"
[21] "injuries_no_indication"        "crash_hour"                   
[23] "crash_day_of_week"             "crash_month"

# Inspect dataset's structure
str(accidents_data_raw)

'data.frame':   209306 obs. of  24 variables:
 $ crash_date                   : chr  "07/29/2023 01:00:00 PM" "08/13/2023 12:11:00 AM" "12/09/2021 10:30:00 AM" "08/09/2023 07:55:00 PM" ...
 $ traffic_control_device       : chr  "TRAFFIC SIGNAL" "TRAFFIC SIGNAL" "TRAFFIC SIGNAL" "TRAFFIC SIGNAL" ...
 $ weather_condition            : chr  "CLEAR" "CLEAR" "CLEAR" "CLEAR" ...
 $ lighting_condition           : chr  "DAYLIGHT" "DARKNESS, LIGHTED ROAD" "DAYLIGHT" "DAYLIGHT" ...
 $ first_crash_type             : chr  "TURNING" "TURNING" "REAR END" "ANGLE" ...
 $ trafficway_type              : chr  "NOT DIVIDED" "FOUR WAY" "T-INTERSECTION" "FOUR WAY" ...
 $ alignment                    : chr  "STRAIGHT AND LEVEL" "STRAIGHT AND LEVEL" "STRAIGHT AND LEVEL" "STRAIGHT AND LEVEL" ...
 $ roadway_surface_cond         : chr  "UNKNOWN" "DRY" "DRY" "DRY" ...
 $ road_defect                  : chr  "UNKNOWN" "NO DEFECTS" "NO DEFECTS" "NO DEFECTS" ...
 $ crash_type                   : chr  "NO INJURY / DRIVE AWAY" "NO INJURY / DRIVE AWAY" "NO INJURY / DRIVE AWAY" "INJURY AND / OR TOW DUE TO CRASH" ...
 $ intersection_related_i       : chr  "Y" "Y" "Y" "Y" ...
 $ damage                       : chr  "$501 - $1,500" "OVER $1,500" "$501 - $1,500" "OVER $1,500" ...
 $ prim_contributory_cause      : chr  "UNABLE TO DETERMINE" "IMPROPER TURNING/NO SIGNAL" "FOLLOWING TOO CLOSELY" "UNABLE TO DETERMINE" ...
 $ num_units                    : int  2 2 3 2 2 1 2 2 2 2 ...
 $ most_severe_injury           : chr  "NO INDICATION OF INJURY" "NO INDICATION OF INJURY" "NO INDICATION OF INJURY" "NONINCAPACITATING INJURY" ...
 $ injuries_total               : num  0 0 0 5 0 2 0 1 0 0 ...
 $ injuries_fatal               : num  0 0 0 0 0 0 0 0 0 0 ...
 $ injuries_incapacitating      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ injuries_non_incapacitating  : num  0 0 0 5 0 2 0 1 0 0 ...
 $ injuries_reported_not_evident: num  0 0 0 0 0 0 0 0 0 0 ...
 $ injuries_no_indication       : num  3 2 3 0 3 0 2 1 3 4 ...
 $ crash_hour                   : int  13 0 10 19 14 0 11 14 18 17 ...
 $ crash_day_of_week            : int  7 1 5 4 7 4 3 4 2 5 ...
 $ crash_month                  : int  7 8 12 8 8 9 12 9 6 9 ...

Handling Missing Values

The initial inspection leads me to believe this dataset has little to no missing values. To confirm, I will be using the base R function is.na() to identify potential missing values.

# Look for missing values in each column
print(colSums(is.na(accidents_data_raw)))

                   crash_date        traffic_control_device 
                            0                             0 
            weather_condition            lighting_condition 
                            0                             0 
             first_crash_type               trafficway_type 
                            0                             0 
                    alignment          roadway_surface_cond 
                            0                             0 
                  road_defect                    crash_type 
                            0                             0 
       intersection_related_i                        damage 
                            0                             0 
      prim_contributory_cause                     num_units 
                            0                             0 
           most_severe_injury                injuries_total 
                            0                             0 
               injuries_fatal       injuries_incapacitating 
                            0                             0 
  injuries_non_incapacitating injuries_reported_not_evident 
                            0                             0 
       injuries_no_indication                    crash_hour 
                            0                             0 
            crash_day_of_week                   crash_month 
                            0                             0

Since this dataset is already fairly clean, there is no need to handle any missing values.

Handling Outliers

Most of the data in this dataset is categorical, and the numerical variables describe the number of cars involved the the accident, the injuries associated with the accident, and when it occurred. Given the nature of the numerical characteristics of the dataset, I’ll check the following variables for any outliers: num_units, injuries_total, injuries_fatal, injuries_incapacitating, injuries_non_incapacitating, injuries_reported_not_evident, and injuries_no_indication.

# Print the summary statistics to the described variables
accidents_data_raw %>%
  select(num_units, injuries_total, injuries_fatal, injuries_incapacitating, injuries_non_incapacitating, injuries_reported_not_evident, injuries_no_indication) %>%
  summary()

   num_units      injuries_total    injuries_fatal     injuries_incapacitating
 Min.   : 1.000   Min.   : 0.0000   Min.   :0.000000   Min.   :0.0000         
 1st Qu.: 2.000   1st Qu.: 0.0000   1st Qu.:0.000000   1st Qu.:0.0000         
 Median : 2.000   Median : 0.0000   Median :0.000000   Median :0.0000         
 Mean   : 2.063   Mean   : 0.3827   Mean   :0.001858   Mean   :0.0381         
 3rd Qu.: 2.000   3rd Qu.: 1.0000   3rd Qu.:0.000000   3rd Qu.:0.0000         
 Max.   :11.000   Max.   :21.0000   Max.   :3.000000   Max.   :7.0000         
 injuries_non_incapacitating injuries_reported_not_evident
 Min.   : 0.0000             Min.   : 0.0000              
 1st Qu.: 0.0000             1st Qu.: 0.0000              
 Median : 0.0000             Median : 0.0000              
 Mean   : 0.2212             Mean   : 0.1215              
 3rd Qu.: 0.0000             3rd Qu.: 0.0000              
 Max.   :21.0000             Max.   :15.0000              
 injuries_no_indication
 Min.   : 0.000        
 1st Qu.: 2.000        
 Median : 2.000        
 Mean   : 2.244        
 3rd Qu.: 3.000        
 Max.   :49.000

# Construct boxplots for the variables
accidents_data_raw %>%
  select(num_units, injuries_total, injuries_fatal, injuries_incapacitating, injuries_non_incapacitating, injuries_reported_not_evident, injuries_no_indication) %>%
  tidyr::gather(metric, value) %>%
  ggplot(aes(x = metric, y = value)) +
  geom_boxplot() +
  theme_minimal() +
  labs(title = "Distribution of Numerical Traffic Acident Variables",
       x = "Metric",
       y = "Value") +
  theme(axis.text.x = element_text(angle = 45))

Looking at the box plot, it seems that all these variables have many outliers. However, for a dataset like this, that is not unusual. There is no “normal” result of a car accident. The number of cars involved or injuries suffered depends on many outside factors, some of which can’t be predicted. For this reason, I will be keeping the outliers in the dataset, as they could be important points of information. It’s possible that some of these variables could have a relationship to be explored. For example, the scatterplot below shows a weak positive correlation between the number of units involved in the accident and the total number of injuries.

# Create a scatter plot
ggplot(accidents_data_raw, aes(x = injuries_total, y = num_units)) +
  geom_point(alpha = 0.5, color = "red") +
  theme_minimal() +
  labs(title = "Number of Units vs Total Injuries",
       subtitle = "Possible Positive Correlation Among Outliers") +
  geom_smooth(method =lm, color = "blue")

`geom_smooth()` using formula = 'y ~ x'

Transformations

All the numerical variables are normal and don’t require any transformations.

Exploratory Analysis and Visualization

Question 1: Of the total number of traffic accidents recorded, how many result in severe injury?

First, it is important to establish how often a traffic accident may result in a severe injury. It’s likely that many will not, which is important to put into perspective to keep from spreading unnecessary fears. For this project, an injury is “severe’ if it is incapacitating or fatal.

# Create a new data frame with a column that totals all the reports of injuries and no injuries (different from "injuries_total")
accidents_reported <- accidents_data_raw %>%
  mutate(
    total_injuries_reported = injuries_fatal + injuries_incapacitating + injuries_non_incapacitating +
      injuries_reported_not_evident + injuries_no_indication)

accidents_reported %>%
  select(total_injuries_reported, injuries_total) %>%
  slice(1:10)

   total_injuries_reported injuries_total
1                        3              0
2                        2              0
3                        3              0
4                        5              5
5                        3              0
6                        2              2
7                        2              0
8                        2              1
9                        3              0
10                       4              0

# Create a dataframe with the injury to report ratio for visualization
accident_injury_ratio <- data.frame(
  injury_types=c("Fatal Injuries", "Incapacitating Injuries", "Non-Incapacitating Injuries", "Reported, Not Evident", "None"),
  ratio=c(
    sum(accidents_reported$injuries_fatal) / sum(accidents_reported$total_injuries_reported),
    sum(accidents_reported$injuries_incapacitating) / sum(accidents_reported$total_injuries_reported),
    sum(accidents_reported$injuries_non_incapacitating) / sum(accidents_reported$total_injuries_reported),
    sum(accidents_reported$injuries_reported_not_evident) / sum(accidents_reported$total_injuries_reported),
    sum(accidents_reported$injuries_no_indication) / sum(accidents_reported$total_injuries_reported)
    )
)

accident_injury_ratio

                 injury_types        ratio
1              Fatal Injuries 0.0007075455
2     Incapacitating Injuries 0.0145055912
3 Non-Incapacitating Injuries 0.0842270111
4       Reported, Not Evident 0.0462614681
5                        None 0.8542983841

# Visualize the ratios
ggplot(accident_injury_ratio, aes(x = reorder(injury_types, -ratio), y = ratio, fill = injury_types)) +
  geom_bar(stat = "identity", alpha = 0.8) +
  labs(
    title = "Majority of Traffic Accidents Report No Injuries",
    x = "Injury Reported",
    y = "Ratio of Report to Total Reports"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 18, face = "bold"),
    axis.title = element_text(size = 12),
    axis.text.y = element_text(size = 10),
    axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
    panel.grid.minor = element_blank()
  ) +
  scale_y_continuous(labels = scales::comma) + scale_fill_brewer(palette="Spectral")

It’s clear that a vast majority of accidents result in no injury at all. That’s a reassuring thing to keep in mind while exploring this dataset. However, given that this data set is so large, even just a small percentage of accidents resulting in fatal injuries is still hundreds of fatalities. And non-incapacitating injuries can still have horrible consequences. It’s also important to note that many traffic accidents are minor, like sideswipes or dings. It’s likely there are certain accidents that result in more serious consequences.

Question 2: How does the type of injury relate to the accident/crash type?

As stated above, the type of accident is an important factor in the possibility and seriousness of a resulting injury. Involvement of a pedestrian, a non-vehicle object, or the crash angle are all variables to consider.

# Find the top ten most common crash types
accidents_reported %>%
  group_by(first_crash_type) %>%
  summarise(reports_made = sum(total_injuries_reported)) %>%
  arrange(-reports_made)

# A tibble: 18 × 2
   first_crash_type             reports_made
   <chr>                               <dbl>
 1 TURNING                            173763
 2 ANGLE                              148323
 3 REAR END                           112464
 4 SIDESWIPE SAME DIRECTION            52284
 5 PEDESTRIAN                          19974
 6 PEDALCYCLIST                        11829
 7 PARKED MOTOR VEHICLE                 7265
 8 FIXED OBJECT                         6790
 9 HEAD ON                              5216
10 SIDESWIPE OPPOSITE DIRECTION         4908
11 REAR TO FRONT                        2913
12 REAR TO SIDE                         2059
13 OTHER OBJECT                         1242
14 OTHER NONCOLLISION                    366
15 OVERTURNED                            160
16 REAR TO REAR                          110
17 ANIMAL                                107
18 TRAIN                                  15

# List of the top 10 most common crash types reported
top_10_crash_types <- c("TURNING", "ANGLE", "REAR END", "SIDESWIPE SAME DIRECTION", "PEDESTRIAN", "PEDALCYCLIST", "PARKED MOTOR VEHICLE", "FIXED OBJECT", "HEAD ON", "SIDESWIPE OPPOSITE DIRECTION")

# Create a dataframe with the injury to report ratio for visualization
accident_injury_crash_type <- accidents_reported %>%
  group_by(first_crash_type) %>%
  summarise(injuries_fatal_ratio = sum(injuries_fatal) / sum(total_injuries_reported),
            injuries_no_indication_ratio = sum(injuries_no_indication) / sum(total_injuries_reported)) %>%
  filter(first_crash_type %in% top_10_crash_types)

accident_injury_crash_type

# A tibble: 10 × 3
   first_crash_type             injuries_fatal_ratio injuries_no_indication_ra…¹
   <chr>                                       <dbl>                       <dbl>
 1 ANGLE                                    0.000829                       0.804
 2 FIXED OBJECT                             0.00604                        0.758
 3 HEAD ON                                  0.000575                       0.756
 4 PARKED MOTOR VEHICLE                     0.000688                       0.926
 5 PEDALCYCLIST                             0.00127                        0.651
 6 PEDESTRIAN                               0.00476                        0.568
 7 REAR END                                 0.000116                       0.911
 8 SIDESWIPE OPPOSITE DIRECTION             0.000611                       0.892
 9 SIDESWIPE SAME DIRECTION                 0.000172                       0.962
10 TURNING                                  0.000432                       0.877
# ℹ abbreviated name: ¹injuries_no_indication_ratio

# Visualize the ratios for fatal crashes
ggplot(accident_injury_crash_type, aes(x = reorder(first_crash_type, -injuries_fatal_ratio), y = injuries_fatal_ratio, fill = first_crash_type)) +
  geom_bar(stat = "identity", alpha = 0.8) +
  labs(
    title = "Top 10 Crashes that Result in Fatal Injury",
    x = "Crash Type",
    y = "Number of Fatal Injuries Reported"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 13, face = "bold"),
    axis.title = element_text(size = 11),
    axis.text.y = element_text(size = 10),
    axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
    panel.grid.minor = element_blank()
  ) +
  scale_y_continuous(labels = scales::comma) + scale_fill_brewer(palette="Spectral")

# Visualize the ratios for no injury crashes
ggplot(accident_injury_crash_type, aes(x = reorder(first_crash_type, -injuries_no_indication_ratio), y = injuries_no_indication_ratio, fill = first_crash_type)) +
  geom_bar(stat = "identity", alpha = 0.8) +
  labs(
    title = "Top 10 Crashes that Result in No Injury",
    x = "Crash Type",
    y = "Number of No Injuries Reported"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold"),
    axis.title = element_text(size = 11),
    axis.text.y = element_text(size = 10),
    axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
    panel.grid.minor = element_blank()
  ) +
  scale_y_continuous(labels = scales::comma) + scale_fill_brewer(palette="Spectral")

By looking at the ratios of the types of injuries reported to the total reports, we can get a better idea of which crash types result in which injury. Crashes that result in no injury are more minor ones, like a rear-end, sideswipes, and dinging a parked car. This makes sense as to why these don’t result in any injury. By contrast, crashes that result in fatal injury usually include parties that are not in cars and fixed objects, which also makes sense.

Question 3: How does the likelihood of a severe injury differ across different accident conditions?

There are outside factors that can cause an accident, increase the likelihood of one, and impact any resulting injuries. While they can’t be completely avoided, it’s important to note more dangerous conditions to be more aware as a driver. The main conditions I want to examine are weather and lighting, as I think they both have the most impact on the how a driver drives.

# Create a dataframe with the severe injury to report ratio, grouped by weather conditions
# Using the ratio to get more accurate results across the weather conditions
severe_injuries_weather <- accidents_reported %>%
  group_by(weather_condition) %>%
  summarise(
    injuries_severe = (sum(injuries_fatal) + sum(injuries_incapacitating)) / sum(total_injuries_reported),
    injuries_severe_sum = sum(injuries_fatal) + sum(injuries_incapacitating)
  ) %>%
  filter(injuries_severe  > 0.01) # Filter to get the more common causes

severe_injuries_weather

# A tibble: 8 × 3
  weather_condition      injuries_severe injuries_severe_sum
  <chr>                            <dbl>               <dbl>
1 BLOWING SNOW                    0.0151                   5
2 CLEAR                           0.0159                6903
3 CLOUDY/OVERCAST                 0.0132                 262
4 FREEZING RAIN/DRIZZLE           0.0168                  22
5 OTHER                           0.0211                  36
6 RAIN                            0.0147                 844
7 SEVERE CROSS WIND GATE          0.0278                   2
8 SLEET/HAIL                      0.0163                  13

# Repeat the above code for lighting conditions
severe_injuries_lighting <- accidents_reported %>%
  group_by(lighting_condition) %>%
  summarise(
    injuries_severe = (sum(injuries_fatal) + sum(injuries_incapacitating)) / sum(total_injuries_reported),
    injuries_severe_sum = sum(injuries_fatal) + sum(injuries_incapacitating)
  ) %>%
  filter(injuries_severe  > 0.01)

severe_injuries_lighting

# A tibble: 5 × 3
  lighting_condition     injuries_severe injuries_severe_sum
  <chr>                            <dbl>               <dbl>
1 DARKNESS                        0.0189                 366
2 DARKNESS, LIGHTED ROAD          0.0195                2819
3 DAWN                            0.0180                 168
4 DAYLIGHT                        0.0135                4712
5 DUSK                            0.0141                 238

# Visualize both conditions to compare
ggplot(severe_injuries_weather, aes(x = reorder(weather_condition, -injuries_severe), y = injuries_severe, fill = weather_condition)) +
  geom_bar(stat = "identity", alpha = 0.8) +
  labs(
    title = "Majority of Severe Accident-Related Injuries Occur During Inclement Weather",
    x = "Weather Conditions",
    y = "Injuries Recorded"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 13, face = "bold"),
    axis.title = element_text(size = 12),
    axis.text.y = element_text(size = 10),
    axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
    panel.grid.minor = element_blank()
  ) +
  scale_y_continuous(labels = scales::comma) + scale_fill_brewer(palette="Spectral")

ggplot(severe_injuries_lighting, aes(x = reorder(lighting_condition, -injuries_severe), y = injuries_severe, fill = lighting_condition)) +
  geom_bar(stat = "identity", alpha = 0.8) +
  labs(
    title = "Majority of Severe Accident-Related Injuries Occur In Poor Lighting",
    x = "Weather Conditions",
    y = "Injuries Recorded"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 13, face = "bold"),
    axis.title = element_text(size = 12),
    axis.text.y = element_text(size = 10),
    axis.text.x = element_text(size = 10, angle = 55, hjust = 1),
    panel.grid.minor = element_blank()
  ) +
  scale_y_continuous(labels = scales::comma) + scale_fill_brewer(palette="Spectral")

As expected, the more dangerous driving conditions are wind, sleet, hail, and relative darkness or poor lighting. What’s interesting is that more severe accidents occur on lighted roads than in complete darkness. This could indicate some carelessness when driving on lit roads or that some lighted roads are poorly lit.

Question 4: How does the likelihood of a traffic-related injury change as the time of day changes?

Certain times of the day can be busier, resulting in more traffic and likely, more accidents. It’s good to examine to help determine how traffic accidents relate to time.

# Get the total number of injuries by hour of the day
accidents_hours <- accidents_reported %>%
  group_by(crash_hour) %>%
  summarise(
    injuries_total = sum(injuries_total)
  ) 

accidents_hours

# A tibble: 24 × 2
   crash_hour injuries_total
        <int>          <dbl>
 1          0           2411
 2          1           1992
 3          2           1784
 4          3           1270
 5          4           1012
 6          5           1215
 7          6           1898
 8          7           3395
 9          8           3976
10          9           3350
# ℹ 14 more rows

ggplot(accidents_hours, aes(x = crash_hour, y = injuries_total)) +
  geom_line(color = "#800", linewidth = 1.25) +
  labs(
    title = "More Accident-Related Injuries Occur During Rush Hour",
    x = "Time of Day (24 hr)",
    y = "Injuries Recorded"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 13, face = "bold"),
    axis.title = element_text(size = 12),
    axis.text.y = element_text(size = 10),
    axis.text.x = element_text(size = 10, angle = 55, hjust = 1),
    panel.grid.minor = element_blank()
  ) +
  scale_y_continuous(labels = scales::comma) + scale_fill_brewer(palette="Spectral")

# Get the total number of injuries by month
accidents_months <- accidents_reported %>%
  group_by(crash_month) %>%
  summarise(
    injuries_total = sum(injuries_total)
  ) 

accidents_months

# A tibble: 12 × 2
   crash_month injuries_total
         <int>          <dbl>
 1           1           5719
 2           2           4639
 3           3           5525
 4           4           5777
 5           5           6954
 6           6           7230
 7           7           7585
 8           8           7491
 9           9           7729
10          10           7918
11          11           6843
12          12           6695

ggplot(accidents_months, aes(x = crash_month, y = injuries_total)) +
  geom_line(color = "#800", linewidth = 1.25) +
  labs(
    title = "More Accident-Related Injuries Occur During The Fall",
    x = "Month",
    y = "Injuries Recorded"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 13, face = "bold"),
    axis.title = element_text(size = 12),
    axis.text.y = element_text(size = 10),
    axis.text.x = element_text(size = 10, angle = 55, hjust = 1),
    panel.grid.minor = element_blank()
  ) +
  scale_y_continuous(labels = scales::comma) +
  scale_x_continuous(breaks = pretty_breaks()) +
  scale_fill_brewer(palette="Spectral")

As expected, the number of accident-related injuries peak around 4-6 pm, which is rush hour. This makes sense, as it’s the busiest part of the day. There is a clear peak around 8 am, which is also a busy time when a majority of people go to work. What is surprising is that injuries are higher during the summer and fall, with a sharp increase from April to May and a peak during October. There could be many reasons for this, like weather, holidays, or whether people are in school or not.

Hypothesis Generation

Hypothesis: More traffic-related injuries occur during when more people are present on or near the road, the most severe occurring during inclement weather, poor lighting, and with non-vehicle entities.

Based on the results of this exploration, while a majority of reported traffic accidents result in no injuries, those that do occur more often during the summer and fall, peaking during the time the roads are the busiest. Of the traffic accidents that result in injuries, the most fatal ones occur during inclement weather, in poor lighting, and involve pedestrians or fixed objects more than other vehicles. This is meaningful for stakeholders because being able to accurately predict when an accident may happen can help people avoid them. This will help reduce the physical, emotional, and monetary tolls that traffic accidents can cause. Data such as this can also help insurance companies adjust their rates and quotes accordingly.

Some useful additional data to help test this hypothesis would be location-specific information and car type information (car, truck, sedan, etc..). Car type information could help further specify the analysis to determine the more dangerous vehicles. Location-specific information will help with finding clear patterns by region to help draw better conclusions as to how to address this issue. If this hypothesis is true, we could take clear and meaningful steps to help prevent injuries and deaths caused by traffic accidents.

Stakeholder Communication

The purpose of this analysis was to investigate the factors that may correlate with traffic accidents that result in injuries. Some key findings are that of the traffic accidents reported, a majority result in no injury at all. However, about 10% result in some injury, which translates to thousands of potentially life-altering injuries and hundreds of deaths. It was found that drivers should be extra cautious around pedestrians, cyclists, and fixed objects, as these types of crashes result in the most fatalities. Drivers should also exercise caution during inclement weather and poor lighting conditions. The analysis also indicates that the more dangerous times to be on the road are also the busiest times, when people are rushing to and from work, and later in the year when people travel much more for the holidays. These results show that driver stakeholders should take extra care when driving in any of these conditions. Insurance stakeholders should take these conditions into account when setting rates and paying out any claims.

This analysis resulted in the following hypothesis: More traffic-related injuries occur during when more people are present on or near the road, the most severe occurring during inclement weather, poor lighting, and with non-vehicle entities. Next steps would be to investigate further by region to examine how these initial trends hold in different areas. There should also be steps taken to spread more awareness about driver safety, with a focus on putting the message out to the public during inclement weather and rush hour. Education about pedestrian safety should also be prioritized. Not only will promoting safety reduce injuries and deaths, it will save drivers hundreds of thousands of dollars each year.

# Preparation code for stakeholder visualization
# Get the total injuries for each month, grouped by month and the damage in $
stakeholder_viz <- accidents_reported %>%
  group_by(crash_month, damage) %>%
  summarise(
    injuries_total = sum(injuries_total)
  )

`summarise()` has grouped output by 'crash_month'. You can override using the
`.groups` argument.

stakeholder_viz

# A tibble: 36 × 3
# Groups:   crash_month [12]
   crash_month damage        injuries_total
         <int> <chr>                  <dbl>
 1           1 $500 OR LESS             630
 2           1 $501 - $1,500            498
 3           1 OVER $1,500             4591
 4           2 $500 OR LESS             529
 5           2 $501 - $1,500            369
 6           2 OVER $1,500             3741
 7           3 $500 OR LESS             505
 8           3 $501 - $1,500            416
 9           3 OVER $1,500             4604
10           4 $500 OR LESS             453
# ℹ 26 more rows

ggplot(stakeholder_viz, aes(x=crash_month, y=injuries_total, fill=damage)) + 
    geom_area(stat = "identity") +
  labs(
    title = "Thousands of Traffic Injuries Is Thousands of Dollars Gone Each Month",
    x = "Month",
    y = "Injuries Recorded"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 13, face = "bold"),
    axis.title = element_text(size = 12),
    axis.text.y = element_text(size = 10),
    axis.text.x = element_text(size = 10, angle = 55, hjust = 1),
    panel.grid.minor = element_blank()
  ) +
  scale_y_continuous(labels = scales::comma) +
  scale_x_continuous(breaks = pretty_breaks()) +
  scale_fill_brewer(palette="Spectral")