Our Dataset has 500000 rows and 46 columns. Most of the data is either in char or num, although some of the char could be Boolean. We have a unique identifier that called accident_Index, it is quite long however with 300,000 rows it may be optimal not to make our own key.
This is a countrywide car accident dataset that covers 49 states of the USA. The accident data were collected from February 2016 to March 2023, using multiple APIs that provide streaming traffic incident (or event) data. These APIs broadcast traffic data captured by various entities, including the US and state departments of transportation, law enforcement agencies, traffic cameras, and traffic sensors within the road networks.
What are your motivations for exploring this dataset?
Although this subject was not my first choice, I haven’t been able to find something that could work towards what I want to do. As to this subject, I have been in a car accident before and it shook me up for a while. I wont say that it was hard to drive after but it definitely sits in the back of my mind when I do drive. So being able to understand the common causes of a car accident could help ease my mind.
What questions do you want to answer? (broad)
what attributes tend to be most associated with severe car accidents?
Hypothesis
Severe car accidents are more common in rainy junctions than any other situation.
Biases
A bias I may have is of course the fact that I have been in a car accident so I may hold a bias towards my own situation. I would also have a bias on prior understanding of driving where I know that rain and intersections cause volatile driving situations.
Data Dictionary
A data dictionary serves as a comprehensive guide to understanding the structure and attributes of a dataset. Based on the information you’ve provided, here’s a structured data dictionary for your dataset:
Variable Name
Data Type
Description
ID
String
Unique identifier for the accident record.
Source
String
Origin of the raw accident data.
Severity
Integer
Severity level of the accident, indicating its impact on traffic.
Start_Time
DateTime
Start time of the accident in local time zone.
End_Time
DateTime
End time when the accident’s impact on traffic was dismissed.
Start_Lat
Float
Latitude of the accident’s start point.
Start_Lng
Float
Longitude of the accident’s start point.
End_Lat
Float
Latitude of the accident’s end point.
End_Lng
Float
Longitude of the accident’s end point.
Distance(mi)
Float
Length of the road extent affected by the accident.
Description
String
Human-provided description of the accident.
Street
String
Street name where the accident occurred.
City
String
City where the accident occurred.
County
String
County where the accident occurred.
State
String
State where the accident occurred.
Zipcode
String
Zip code of the accident location.
Country
String
Country where the accident occurred.
Timezone
String
Timezone based on the accident’s location.
Airport_Code
String
Closest airport-based weather station to the accident location.
Weather_Timestamp
DateTime
Timestamp of the weather observation record in local time.
Temperature(F)
Float
Temperature at the time of the accident.
Wind_Chill(F)
Float
Wind chill at the time of the accident.
Humidity(%)
Float
Humidity percentage at the time of the accident.
Pressure(in)
Float
Atmospheric pressure at the time of the accident.
Visibility(mi)
Float
Visibility distance at the time of the accident.
Wind_Direction
String
Direction from which the wind was blowing.
Wind_Speed(mph)
Float
Wind speed at the time of the accident.
Precipitation(in)
Float
Precipitation amount at the time of the accident.
Weather_Condition
String
Weather condition during the accident.
Amenity
Boolean
Presence of an amenity near the accident location.
Bump
Boolean
Presence of a speed bump or hump near the accident location.
Crossing
Boolean
Presence of a crossing near the accident location.
Give_Way
Boolean
Presence of a give way sign near the accident location.
Junction
Boolean
Presence of a junction near the accident location.
No_Exit
Boolean
Presence of a no exit sign near the accident location.
Railway
Boolean
Presence of a railway near the accident location.
Roundabout
Boolean
Presence of a roundabout near the accident location.
Station
Boolean
Presence of a station near the accident location.
Stop
Boolean
Presence of a stop sign near the accident location.
Traffic_Calming
Boolean
Presence of traffic calming measures near the accident location.
Traffic_Signal
Boolean
Presence of a traffic signal near the accident location.
Turning_Loop
Boolean
Presence of a turning loop near the accident location.
Sunrise_Sunset
String
Period of the day based on sunrise/sunset.
Civil_Twilight
String
Period of the day based on civil twilight.
Nautical_Twilight
String
Period of the day based on nautical twilight.
Astronomical_Twilight
String
Period of the day based on astronomical
Data Cleaning
checking for null values
na_count_per_column <-colSums(is.na(data)) #count the total na values in each columnsprint(na_count_per_column[na_count_per_column >0]) #print total na
Given these results, I am going to delete the End_Lat, End_LNG, Wind_chill.F. columns. and delete all rows in which the rest of the variables are null. I will still have over 300000 rows of data.
data <-subset(data, select =-c(End_Lat, End_Lng, Wind_Chill.F.)) #delete unnecesarry columnsdata <-na.omit(data) #omit all rows with NA values
Understanding the frequency of each accident severity level provides a foundational view of the dataset and helps determine where prevention efforts may be most effective. To explore this, I created a bar chart using ggplot2 to visualize the count of incidents across severity levels 1 through 4. The results show that the majority of accidents fall under Severity Level 2, meaning they tend to cause moderate disruption to traffic and are not life-threatening. These findings suggest that targeting the causes of Level 2 accidents could lead to the most widespread improvements in road safety.
library(ggplot2)#plot the distribution of the severity of the accidentsggplot(data, aes(x =factor(Severity), fill =factor(Severity))) +geom_bar() +scale_fill_brewer(palette ="Set3") +labs(title ="Moderate accidents are most common",x ="Severity Level",y ="Count",fill ="Severity Level") +theme_minimal()
How does accident frequency vary by hour and weekday?
Identifying when accidents are most likely to occur is key for scheduling interventions such as traffic patrols, public safety announcements, or infrastructure changes. I extracted the hour and weekday from the Start_Time field using lubridate and plotted a heat map to examine accident frequency over time. The visualization revealed clear spikes during weekday rush hours—especially between 7–9 AM and 3–6 PM—implying that commuter traffic is a major factor in accident occurrence. These time-based trends can inform better planning of city resources and suggest that interventions should be concentrated during these high-risk windows.
# Extract hour and weekday from Start_Timedata <- data %>%mutate(Hour =hour(Start_Time),Day =wday(Start_Time, label =TRUE) # Sunday = 1, Saturday = 7 )# Count number of accidents for each day-hour pairheat_data <- data %>%count(Day, Hour)# Plot the heat mapggplot(heat_data, aes(x = Hour, y = Day, fill = n)) +geom_tile(color ="white") +scale_fill_viridis_c(name ="Accidents", option ="C") +labs(title ="Clear Spike in Accidents During Commmute Hours",x ="Hour of Day",y ="Day of Week" ) +theme_minimal()
What happens to accident frequency when weather conditions change?
Weather is often assumed to be a major cause of traffic accidents, but it’s important to validate whether that assumption holds true in the data. To explore this, I counted the number of accidents associated with each unique weather condition and visualized the top 10 using a horizontal bar chart. Surprisingly, the vast majority of accidents happened under clear or mildly cloudy conditions like “Fair” and “Mostly Cloudy,” rather than during storms or snow. This finding challenges conventional wisdom and suggests that driver behavior and traffic density during normal weather may be more influential than the weather itself in causing accidents.
weather_counts <- data %>%group_by(Weather_Condition) %>%summarise(Count =n()) %>%arrange(desc(Count)) %>%top_n(10, Count) # Select top 10 weather conditionsggplot(weather_counts, aes(x =reorder(Weather_Condition, Count), y = Count, fill = Weather_Condition)) +geom_bar(stat ="identity") +coord_flip() +scale_fill_brewer(palette ="Paired") +labs(title ="Top 10 Weather Conditions During Accidents",x ="Weather Condition",y ="Number of Accidents",fill ="Weather Condition") +theme_minimal()
3. How does average accident severity differ across cities?
While some cities may experience a high number of accidents, others may be more prone to severe incidents. This distinction is important for making localized improvements in road safety. I used aggregate() to calculate the average severity for each city, filtered to include only those with over 100 accidents, and plotted the top 10 cities by severity using a bar chart. Cities like Saint Louis, Lansing, and Chicago ranked highest in severity, even though they don’t lead in total accident count. This indicates that certain urban environments may have underlying risk factors that lead to more dangerous outcomes, warranting further investigation.
# Calculate average severity per cityavg_severity <-aggregate(Severity ~ City, data = data, mean)# Calculate count per city to filter out cities with small sample sizescity_counts <-table(data$City)# Merge counts into the avg_severity dataframeavg_severity$Count <- city_counts[avg_severity$City]# Keep only cities with at least 100 accidentsavg_severity_filtered <- avg_severity[avg_severity$Count >=100, ]# Get top 10 cities by average severitytop10 <-head(avg_severity_filtered[order(-avg_severity_filtered$Severity), ], 10)# Plotlibrary(ggplot2)ggplot(top10, aes(x =reorder(City, Severity), y = Severity, fill = Severity)) +geom_col() +coord_flip() +scale_fill_viridis_c() +labs(title ="Top 10 Cities by Average Accident Severity",x ="City",y ="Average Severity" ) +theme_minimal()
city_counts <-table(data$City)# Convert to data framecity_counts_df <-as.data.frame(city_counts)colnames(city_counts_df) <-c("City", "Count")# Sort by count (descending) and take top 10top10_cities <-head(city_counts_df[order(-city_counts_df$Count), ], 10)# Plotlibrary(ggplot2)ggplot(top10_cities, aes(x =reorder(City, Count), y = Count, fill = Count)) +geom_col() +coord_flip() +scale_fill_viridis_c() +labs(title ="Top 10 Cities by Number of Accidents",x ="City",y ="Accident Count" ) +theme_minimal()
Is there a time in the year in which we see a spike in accidents?
This seasonal analysis shows that accidents peak during the winter months, followed closely by fall, while spring and summer see noticeably fewer incidents. The elevated accident count in winter may be driven by a combination of hazardous road conditions like ice and snow, reduced daylight, and increased travel around the holidays. Fall’s higher numbers could be influenced by back-to-school traffic and early seasonal weather changes. These findings suggest that colder seasons pose greater risks for drivers, and it’s in our best interest to focus road safety campaigns, resource planning, and traffic management efforts during these times of year.
4o
# Add a Month columndata$Month <-month(data$Start_Time, label =TRUE)ggplot(data, aes(x = Month)) +geom_bar(fill ="steelblue") +labs(title ="Monthly Distribution of Accidents",x ="Month",y ="Number of Accidents") +theme_minimal()
# create seasonsdata$Season <-factor(ifelse(month(data$Start_Time) %in%c(12, 1, 2), "Winter",ifelse(month(data$Start_Time) %in%c(3, 4, 5), "Spring",ifelse(month(data$Start_Time) %in%c(6, 7, 8), "Summer", "Fall"))),levels =c("Winter", "Spring", "Summer", "Fall"))# Plot accident count by seasonlibrary(ggplot2)ggplot(data, aes(x = Season)) +geom_bar(fill ="steelblue") +labs(title ="Early Sunsets Could be a Factor in Accident Spikes",x ="Season",y ="Number of Accidents") +theme_minimal()
Hypothesis
Based on the exploratory analysis, I observed that most accidents occurred during weekday rush hours, particularly between 7–9 AM and 3–6 PM, and under fair or mildly cloudy weather conditions. Additionally, accident frequency peaked during the winter months, followed by fall, while spring and summer saw noticeably fewer incidents. This seasonal trend suggests that factors like holiday travel, shorter daylight hours, and winter road conditions may contribute to increased accident risk—but even then, most accidents still occurred during clear weather. These patterns challenge the common assumption that adverse weather is the primary cause of accidents and instead point to traffic volume and time of day as stronger contributors. From this, I hypothesize that accident frequency is more strongly influenced by traffic patterns than by weather conditions. This hypothesis is meaningful because it can help city planners, public safety officials, and traffic engineers prioritize interventions where they will have the greatest impact—targeting congestion and peak traffic hours rather than focusing solely on weather-related responses. To fully test this hypothesis, additional data such as hourly traffic volume, congestion levels, and road type classifications would be needed. Analytical methods like multivariate regression and time series modeling would help isolate the effects of traffic versus weather. If the hypothesis is true, efforts should focus on managing traffic flow during high-volume periods; if false, more attention should be placed on preparing for and mitigating weather-related hazards.
Executive Summary
This analysis explores patterns and risk factors associated with vehicle accidents using a national dataset. Key exploratory findings reveal that most accidents occur during weekday rush hours—specifically between 7–9 AM and 3–6 PM—suggesting a strong relationship between traffic congestion and accident frequency. Contrary to popular assumptions, the majority of accidents take place in fair or mildly cloudy weather, not during rain, fog, or snow. Seasonal analysis further supports this insight: Winter has the highest accident count, followed by Fall, while Spring and Summer experience fewer incidents overall. This suggests that factors such as holiday travel, shorter daylight hours, and increased congestion in colder months may play a role.
Based on these patterns, we hypothesize that traffic volume and time of day are more influential in accident frequency than adverse weather conditions. This hypothesis has practical implications for traffic engineers, public safety officials, and urban planners. If accurate, it would shift the focus of safety efforts away from weather-specific interventions toward congestion management strategies such as optimized traffic signal timing, enforcement during peak hours, or improved public transportation access.
To rigorously test this hypothesis, additional data is needed—specifically, real-time or historical traffic volume, congestion levels, and road type classifications. Statistical methods such as multivariate regression and time series modeling would help isolate the effects of traffic versus environmental factors on accident frequency.
For stakeholders, the main takeaway is that predictable human patterns—such as commuting times and seasonal travel—may drive accidents more than unpredictable weather events. Focusing resources on these high-risk time windows and travel periods could lead to meaningful reductions in accident rates.
library(ggplot2)library(lubridate)# Create necessary time featuresdata$Hour <-hour(data$Start_Time)data$Weekday <-wday(data$Start_Time, label =TRUE)data$Is_Weekday <-!data$Weekday %in%c("Sat", "Sun")data$Month <-month(data$Start_Time)# Assign seasons based on monthdata$Season <-factor(ifelse(data$Month %in%c(12, 1, 2), "Winter",ifelse(data$Month %in%c(3, 4, 5), "Spring",ifelse(data$Month %in%c(6, 7, 8), "Summer", "Fall"))),levels =c("Winter", "Spring", "Summer", "Fall"))# Filter for weekdays onlyweekday_data <- data[data$Is_Weekday ==TRUE, ]# Count accidents by hour and seasonaccidents_by_hour_season <-as.data.frame(table(weekday_data$Hour, weekday_data$Season))colnames(accidents_by_hour_season) <-c("Hour", "Season", "Accidents")accidents_by_hour_season$Hour <-as.numeric(as.character(accidents_by_hour_season$Hour))# Plotggplot(accidents_by_hour_season, aes(x = Hour, y = Accidents, color = Season, group = Season)) +geom_line(size =1.2) +geom_point(size =2) +scale_color_brewer(palette ="Set1") +labs(title ="Cold Dark Commutes",x ="Hour of Day",y ="Number of Accidents",color ="Season" ) +theme_minimal()
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.