For my final project in R, I decided to use the United States Traffic Accidents dataset. This set carried a lot of useful information about recorded accidents across the country allowing for versatility in my visualizations. The dataset is countrywide, including 49 of the 50 states, and records were collected from February 2016 through December 2020. The data was gathered using multiple traffic APIs, records were captured through state departments of transportation, law enforcement agencies, traffic cameras, and traffic sensors. There were 1516064 unique values that I was able to work with.
Here are some preliminary findings from analyzing my dataset.
Here is the structure and summary for my dataframe. The traffic records included a lot of quantitative data that can be used for meaningful grouping such as the start and end time of each accident, the severity, the latitude and longitude, as well as the weather and temperature during the accident.
str(trafficdf)
## 'data.frame': 1516064 obs. of 51 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ ID : chr "A-2716600" "A-2716601" "A-2716602" "A-2716603" ...
## $ Severity : Factor w/ 4 levels "1","2","3","4": 3 2 2 2 2 3 2 2 2 2 ...
## $ Start_Time : chr "2016-02-08 00:37:08" "2016-02-08 05:56:20" "2016-02-08 06:15:39" "2016-02-08 06:15:39" ...
## $ End_Time : chr "2016-02-08 06:37:08" "2016-02-08 11:56:20" "2016-02-08 12:15:39" "2016-02-08 12:15:39" ...
## $ Start_Lat : num 40.1 39.9 39.1 39.1 41.1 ...
## $ Start_Lng : num -83.1 -84.1 -84.5 -84.5 -81.5 ...
## $ End_Lat : num 40.1 39.9 39.1 39.1 41.1 ...
## $ End_Lng : num -83 -84 -84.5 -84.5 -81.5 ...
## $ Distance.mi. : num 3.23 0.747 0.055 0.219 0.123 ...
## $ Description : chr "Between Sawmill Rd/Exit 20 and OH-315/Olentangy Riv Rd/Exit 22 - Accident." "At OH-4/OH-235/Exit 41 - Accident." "At I-71/US-50/Exit 1 - Accident." "At I-71/US-50/Exit 1 - Accident." ...
## $ Number : int NA NA NA NA NA NA NA 1887 NA NA ...
## $ Street : chr "Outerbelt E" "I-70 E" "I-75 S" "US-50 E" ...
## $ Side : chr "R" "R" "R" "R" ...
## $ City : chr "Dublin" "Dayton" "Cincinnati" "Cincinnati" ...
## $ County : chr "Franklin" "Montgomery" "Hamilton" "Hamilton" ...
## $ State : Factor w/ 49 levels "AL","AR","AZ",..: 34 34 34 34 34 34 34 34 34 34 ...
## $ Zipcode : chr "43017" "45424" "45203" "45202" ...
## $ Country : chr "US" "US" "US" "US" ...
## $ Timezone : chr "US/Eastern" "US/Eastern" "US/Eastern" "US/Eastern" ...
## $ Airport_Code : chr "KOSU" "KFFO" "KLUK" "KLUK" ...
## $ Weather_Timestamp : chr "2016-02-08 00:53:00" "2016-02-08 05:58:00" "2016-02-08 05:53:00" "2016-02-08 05:53:00" ...
## $ Temperature.F. : num 42.1 36.9 36 36 39 37 35.6 35.6 33.8 33.1 ...
## $ Wind_Chill.F. : num 36.1 NA NA NA NA 29.8 29.2 29.2 NA 30 ...
## $ Humidity... : int 58 91 97 97 55 93 100 100 100 92 ...
## $ Pressure.in. : num 29.8 29.7 29.7 29.7 29.6 ...
## $ Visibility.mi. : num 10 10 10 10 10 10 10 10 3 0.5 ...
## $ Wind_Direction : chr "SW" "Calm" "Calm" "Calm" ...
## $ Wind_Speed.mph. : num 10.4 NA NA NA NA 10.4 8.1 8.1 2.3 3.5 ...
## $ Precipitation.in. : num 0 0.02 0.02 0.02 NA 0.01 NA NA NA 0.08 ...
## $ Weather_Condition : chr "Light Rain" "Light Rain" "Overcast" "Overcast" ...
## $ Amenity : chr "False" "False" "False" "False" ...
## $ Bump : chr "False" "False" "False" "False" ...
## $ Crossing : chr "False" "False" "False" "False" ...
## $ Give_Way : chr "False" "False" "False" "False" ...
## $ Junction : chr "False" "False" "True" "True" ...
## $ No_Exit : chr "False" "False" "False" "False" ...
## $ Railway : chr "False" "False" "False" "False" ...
## $ Roundabout : chr "False" "False" "False" "False" ...
## $ Station : chr "False" "False" "False" "False" ...
## $ Stop : chr "False" "False" "False" "False" ...
## $ Traffic_Calming : chr "False" "False" "False" "False" ...
## $ Traffic_Signal : chr "False" "False" "False" "False" ...
## $ Turning_Loop : chr "False" "False" "False" "False" ...
## $ Sunrise_Sunset : chr "Night" "Night" "Night" "Night" ...
## $ Civil_Twilight : chr "Night" "Night" "Night" "Night" ...
## $ Nautical_Twilight : chr "Night" "Night" "Night" "Night" ...
## $ Astronomical_Twilight: chr "Night" "Night" "Day" "Day" ...
## $ type : chr "None" "None" "Junction" "Junction" ...
## $ year : Factor w/ 5 levels "2016","2017",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ region : chr "Mid_West" "Mid_West" "Mid_West" "Mid_West" ...
summary(trafficdf)
## X ID Severity Start_Time
## Min. : 1 Length:1516064 1: 28178 Length:1516064
## 1st Qu.: 379017 Class :character 2:1212382 Class :character
## Median : 758032 Mode :character 3: 161052 Mode :character
## Mean : 758032 4: 114452
## 3rd Qu.:1137048
## Max. :1516064
##
## End_Time Start_Lat Start_Lng End_Lat
## Length:1516064 Min. :24.57 Min. :-124.50 Min. :24.57
## Class :character 1st Qu.:33.85 1st Qu.:-118.21 1st Qu.:33.85
## Mode :character Median :37.35 Median : -94.38 Median :37.35
## Mean :36.90 Mean : -98.60 Mean :36.90
## 3rd Qu.:40.73 3rd Qu.: -80.87 3rd Qu.:40.73
## Max. :49.00 Max. : -67.11 Max. :49.08
##
## End_Lng Distance.mi. Description Number
## Min. :-124.50 Min. : 0.0000 Length:1516064 Min. : 0
## 1st Qu.:-118.21 1st Qu.: 0.0000 Class :character 1st Qu.: 1212
## Median : -94.38 Median : 0.1780 Mode :character Median : 4000
## Mean : -98.60 Mean : 0.5873 Mean : 8908
## 3rd Qu.: -80.87 3rd Qu.: 0.5940 3rd Qu.: 10100
## Max. : -67.11 Max. :155.1860 Max. :9999997
## NA's :1046095
## Street Side City County
## Length:1516064 Length:1516064 Length:1516064 Length:1516064
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## State Zipcode Country Timezone
## CA :448833 Length:1516064 Length:1516064 Length:1516064
## FL :153007 Class :character Class :character Class :character
## OR : 87484 Mode :character Mode :character Mode :character
## TX : 75142
## NY : 60974
## MN : 52345
## (Other):638279
## Airport_Code Weather_Timestamp Temperature.F. Wind_Chill.F.
## Length:1516064 Length:1516064 Min. :-89.00 Min. :-89.0
## Class :character Class :character 1st Qu.: 47.00 1st Qu.: 40.8
## Mode :character Mode :character Median : 61.00 Median : 57.0
## Mean : 59.58 Mean : 55.1
## 3rd Qu.: 73.00 3rd Qu.: 71.0
## Max. :170.60 Max. :113.0
## NA's :43033 NA's :449316
## Humidity... Pressure.in. Visibility.mi. Wind_Direction
## Min. : 1.00 Min. : 0.00 Min. : 0.00 Length:1516064
## 1st Qu.: 48.00 1st Qu.:29.44 1st Qu.: 10.00 Class :character
## Median : 68.00 Median :29.88 Median : 10.00 Mode :character
## Mean : 64.66 Mean :29.55 Mean : 9.13
## 3rd Qu.: 84.00 3rd Qu.:30.04 3rd Qu.: 10.00
## Max. :100.00 Max. :58.04 Max. :140.00
## NA's :45509 NA's :36274 NA's :44211
## Wind_Speed.mph. Precipitation.in. Weather_Condition Amenity
## Min. : 0.00 Min. : 0 Length:1516064 Length:1516064
## 1st Qu.: 4.60 1st Qu.: 0 Class :character Class :character
## Median : 7.00 Median : 0 Mode :character Mode :character
## Mean : 7.63 Mean : 0
## 3rd Qu.: 10.40 3rd Qu.: 0
## Max. :984.00 Max. :24
## NA's :128862 NA's :510549
## Bump Crossing Give_Way Junction
## Length:1516064 Length:1516064 Length:1516064 Length:1516064
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## No_Exit Railway Roundabout Station
## Length:1516064 Length:1516064 Length:1516064 Length:1516064
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Stop Traffic_Calming Traffic_Signal Turning_Loop
## Length:1516064 Length:1516064 Length:1516064 Length:1516064
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Sunrise_Sunset Civil_Twilight Nautical_Twilight Astronomical_Twilight
## Length:1516064 Length:1516064 Length:1516064 Length:1516064
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## type year region
## Length:1516064 2016:129325 Length:1516064
## Class :character 2017:170099 Class :character
## Mode :character 2018:166936 Mode :character
## 2019:261772
## 2020:787932
##
##
Here are all the 49 states represented in this database, as well as the total accident count for each state.
unique(trafficdf$State)
## [1] OH IN KY WV MI PA CA NV MN TX MO CO OK LA KS WI IA MS NE ND WY SD MT NM AR
## [26] IL NJ GA FL NY CT RI SC NC MD MA TN VA DE DC ME AL NH VT AZ UT ID OR WA
## 49 Levels: AL AR AZ CA CO CT DC DE FL GA IA ID IL IN KS KY LA MA MD ME ... WY
counts_state <- dplyr::count(trafficdf, trafficdf$State)
counts_state
## trafficdf$State n
## 1 AL 9375
## 2 AR 4373
## 3 AZ 30185
## 4 CA 448833
## 5 CO 19809
## 6 CT 15194
## 7 DC 3788
## 8 DE 2331
## 9 FL 153007
## 10 GA 31111
## 11 IA 4780
## 12 ID 4061
## 13 IL 41709
## 14 IN 11736
## 15 KS 5146
## 16 KY 5671
## 17 LA 19250
## 18 MA 6121
## 19 MD 35320
## 20 ME 2202
## 21 MI 27775
## 22 MN 52345
## 23 MO 16949
## 24 MS 2790
## 25 MT 3306
## 26 NC 50159
## 27 ND 455
## 28 NE 2178
## 29 NH 3826
## 30 NJ 29850
## 31 NM 1467
## 32 NV 3681
## 33 NY 60974
## 34 OH 22044
## 35 OK 5592
## 36 OR 87484
## 37 PA 42844
## 38 RI 3766
## 39 SC 28090
## 40 SD 213
## 41 TN 21874
## 42 TX 75142
## 43 UT 33218
## 44 VA 51198
## 45 VT 352
## 46 WA 22999
## 47 WI 7919
## 48 WV 3242
## 49 WY 330
These are a couple of the columns I had initially looked into when searching through the dataset. I first looked into precipitation to see if it was a determining factor that could affect traffic accidents. This is the average precipitation (in inches) for the entirety of the dataset.
mean(trafficdf$Precipitation.in., na.rm = TRUE)
## [1] 0.008477855
As it turns out the average precipitation was incredibly low and therefore must have a low correlation with the number of traffic accidents that occur.
I also looked at wind speed, as this could possibly influence the occurrence of traffic accidents in the US if the mean turns out to be high enough. Here is the mean for this column, as well as the counts for different wind speeds throughout the dataset (in mph).
head(dplyr::count(trafficdf, trafficdf$Wind_Speed.mph.), 20)
## trafficdf$Wind_Speed.mph. n
## 1 0.0 202877
## 2 1.0 27
## 3 1.2 125
## 4 2.0 88
## 5 2.3 259
## 6 3.0 98509
## 7 3.5 43944
## 8 4.6 46674
## 9 5.0 96985
## 10 5.8 47224
## 11 6.0 90336
## 12 6.9 44742
## 13 7.0 83055
## 14 8.0 73414
## 15 8.1 41453
## 16 9.0 64831
## 17 9.2 37715
## 18 10.0 53178
## 19 10.4 31631
## 20 11.5 29053
mean(trafficdf$Wind_Speed.mph., na.rm = TRUE)
## [1] 7.630812
Ultimately, this information did not look as promising as I had hoped as well, which led me to my visualizations.
For my first visualization, I wanted to have a better sense of a possible variable that could give reason to occurences of traffic accidents in general. After delving deeper into the dataset, I discovered 12 columns of data regarding the presence of a different traffic landmark, each being a boolean that prints either “True” or “False” if the variable was involved in the accident. To consolidate this information, I created a new column within the traffic dataframe called “type” which contains the name of the landmark involved for each record, then created a new dataframe to hold the number of occurences of each landmark type.
## trafficdf$type n
## 1 Junction 202345
## 2 Traffic_Signal 97818
## 3 Crossing 82228
## 4 Station 16555
## 5 Stop 12987
## 6 Amenity 12403
## 7 Railway 3463
## 8 Give_Way 2082
## 9 No_Exit 928
## 10 Bump 250
## 11 Traffic_Calming 56
## 12 Roundabout 16
I used my new “counts” dataframe to create a horizontal barplot of this information, grouping each bar by landmark type, and by year. Through this visualization I am able to have a better idea of what kind of traffic landmarks are most involved in traffic accidents throughout the US, an incite that could be helpful in determining the causes of crashes.
For my next visualization, I also wanted to see during what time of the year traffic accidents were more common through the years in the dataset. To do this I created a dataframe containing accident counts grouped my year, and also by each month of the year. My original prediction for this visualization was that there would be an increase in traffic accidents during the winter months, as roads are more prone to freezing over during this time, as well as there being more hazardous driving weather conditions over the course of these months. While my findings were not entirely significant, there is the presence of a slight curve in the number of traffic accidents over these years, with the dip being during the summer months of 2017, 2018, 2019, and a drastic dip in 2020.
After my first couple visualizations, I wanted to experiment with location and visualize it against the “Severity” column of my dataset to see which areas have the most severe traffic accidents. While experimenting, I realized that grouping by the 49 individual states within the dataset would create a large amount of groups, making it difficult for visualization purposes. To combat this, I created a new column within the dataframe narrowing each state into a specific region.
## trafficdf$region n
## 1 West 452514
## 2 South_East 320029
## 3 Mid_Atlantic 195909
## 4 Mid_West 171145
## 5 South_West 165413
## 6 North_West 118180
## 7 None 61413
## 8 North_East 31461
As there seemed to be a relatively equally distributed spread of traffic accident records between all the regions, this made for a good grouping for visualization. Below is the Severity of Traffic Accidents grouped by each Region in the US. The important thing to look at from these pie charts is the pink slice in each of the pies. This slice indicates an accident of the most significant severity, and can show the trend of which regions tend to have the worst accidents. Which in this case was found to be: Other, the Mid-Atlantic, and the North-East for our top three.
plot3 <- ggplot(data = region_severity_df, aes(x="", y=n, fill=severity)) +
geom_bar(stat="identity", position = "fill") +
coord_polar(theta="y", start=0) +
labs(fill = "Severity Level", x = NULL, y = NULL, title = "Accident Severity by Region") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5),
axis.text = element_blank(),
axis.ticks = element_blank(),
panel.grid = element_blank()) +
facet_wrap(~myRegion, ncol = 3, nrow = 3) +
scale_fill_brewer(palette = "Set2") +
geom_text(aes(x=1.8, label=ifelse(percent_of_total>5.2, paste0(severity, ", ", percent_of_total,"%"), "")),
size=2.25,
position = position_fill(vjust = 0.4))
plot3
To go along with the previous visualization grouped by region, I created a donut plot using plotly showing five different donuts of the five years represented in my dataset (2016-2020). This plot is meant to be able to show a lot of general information about the data, splitting the total traffic accidents for each year into the percentages of the regions in which they occurred. By hovering over a section of each donut, you are able to see the corresponding year, region, accident count, and percentage of that year’s accidents.
plot4 <- plot_ly(hole=0.7) %>%
layout(title="Traffic Accidents (2016-2020)") %>%
add_trace(data = region_df[region_df$year==2020,],
labels=~myRegion,
values=~region_df[region_df$year==2020,"n"],
type="pie",
textposition="inside",
hovertemplate="Year: 2020<br>Region: %{label}<br>Percent: %{percent}<br>Accident Count: %{value}<extra></extra>") %>%
add_trace(data = region_df[region_df$year==2019,],
labels=~myRegion,
values=~region_df[region_df$year==2019,"n"],
type="pie",
textposition="inside",
hovertemplate="Year: 2019<br>Region: %{label}<br>Percent: %{percent}<br>Accident Count: %{value}<extra></extra>",
domain = list(
x = c(0.16,0.84),
y = c(0.16,0.84))) %>%
add_trace(data = region_df[region_df$year==2018,],
labels=~myRegion,
values=~region_df[region_df$year==2018,"n"],
type="pie",
textposition="inside",
hovertemplate="Year: 2018<br>Region: %{label}<br>Percent: %{percent}<br>Accident Count: %{value}<extra></extra>",
domain = list(
x = c(0.27,0.73),
y = c(0.27,0.73))) %>%
add_trace(data = region_df[region_df$year==2017,],
labels=~myRegion,
values=~region_df[region_df$year==2017,"n"],
type="pie",
textposition="inside",
hovertemplate="Year: 2017<br>Region:%{label}<br>Percent: %{percent}<br>Accident Count: %{value}<extra></extra>",
domain = list(
x = c(0.35,0.65),
y = c(0.35,0.65))) %>%
add_trace(data = region_df[region_df$year==2016,],
labels=~myRegion,
values=~region_df[region_df$year==2016,"n"],
type="pie",
textposition="inside",
hovertemplate="Year: 2016<br>Region: %{label}<br>Percent: %{percent}<br>Accident Count: %{value}<extra></extra>",
domain = list(
x = c(0.40,0.60),
y = c(0.40,0.60)))
plot4
My last two visualizations both display a Heat Map depicting occurrences of traffic accidents grouped by day of the week and by year.
Tab 1 contains the Heat Map for the number of traffic accidents by day and year with the darkest colors being in 2020 and are mainly on weekdays.
Tab 2 contains a similar Heat Map, but rather than displaying the count of traffic accidents for each box, it displays the percentages of the total occurrences on each day of the week, by each year. Taking into account each year, Tuesday had the highest average percentage of the total traffic accidents over all other days of the week, over the five year period at 17.8%. Wednesday was a close second at 17.68% averaged over the five years.
plot5 <- ggplot(days_df, aes(x = year, y = daysoftheweek, fill=n)) +
geom_tile(color="black") +
geom_text(aes(label=comma(n))) +
labs(title = "HeatMap: Accidents by Day of the Week and Year (#)",
x = "Year",
y = "Days",
fill = "Accident Count") +
theme_igray() +
theme(plot.title = element_text(hjust = 0.5)) +
scale_y_discrete(limits = rev(levels(days_df$daysoftheweek))) +
scale_fill_continuous(low="white", high = "lightblue", breaks = c(seq(0, max(days_df$n), by = 20000))) +
guides(fill = guide_legend(reverse = TRUE, override.aes = list(colour = "black")))
plot5
plot6 <- ggplot(days_df, aes(x = year, y = daysoftheweek, fill=percent_of_total)) +
geom_tile(color="black") +
geom_text(aes(label=paste0(percent_of_total, "%"))) +
labs(title = "HeatMap: Accidents by Day of the Week and Year (%)",
x = "Year",
y = "Days",
fill = "Accident Percentage") +
theme_igray() +
theme(plot.title = element_text(hjust = 0.5)) +
scale_y_discrete(limits = rev(levels(days_df$daysoftheweek))) +
scale_fill_continuous(low="white", high = "lightgreen", breaks = c(seq(0, max(days_df$percent_of_total), by = 2.5))) +
guides(fill = guide_legend(reverse = TRUE, override.aes = list(colour = "black")))
plot6