US Traffic Accidents

For my final project in R, I decided to use the United States Traffic Accidents dataset. This set carried a lot of useful information about recorded accidents across the country allowing for versatility in my visualizations. The dataset is countrywide, including 49 of the 50 states, and records were collected from February 2016 through December 2020. The data was gathered using multiple traffic APIs, records were captured through state departments of transportation, law enforcement agencies, traffic cameras, and traffic sensors. There were 1516064 unique values that I was able to work with.

Preliminary Findings

Here are some preliminary findings from analyzing my dataset.

Tab 1

Here is the structure and summary for my dataframe. The traffic records included a lot of quantitative data that can be used for meaningful grouping such as the start and end time of each accident, the severity, the latitude and longitude, as well as the weather and temperature during the accident.

str(trafficdf)
## 'data.frame':    1516064 obs. of  51 variables:
##  $ X                    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ ID                   : chr  "A-2716600" "A-2716601" "A-2716602" "A-2716603" ...
##  $ Severity             : Factor w/ 4 levels "1","2","3","4": 3 2 2 2 2 3 2 2 2 2 ...
##  $ Start_Time           : chr  "2016-02-08 00:37:08" "2016-02-08 05:56:20" "2016-02-08 06:15:39" "2016-02-08 06:15:39" ...
##  $ End_Time             : chr  "2016-02-08 06:37:08" "2016-02-08 11:56:20" "2016-02-08 12:15:39" "2016-02-08 12:15:39" ...
##  $ Start_Lat            : num  40.1 39.9 39.1 39.1 41.1 ...
##  $ Start_Lng            : num  -83.1 -84.1 -84.5 -84.5 -81.5 ...
##  $ End_Lat              : num  40.1 39.9 39.1 39.1 41.1 ...
##  $ End_Lng              : num  -83 -84 -84.5 -84.5 -81.5 ...
##  $ Distance.mi.         : num  3.23 0.747 0.055 0.219 0.123 ...
##  $ Description          : chr  "Between Sawmill Rd/Exit 20 and OH-315/Olentangy Riv Rd/Exit 22 - Accident." "At OH-4/OH-235/Exit 41 - Accident." "At I-71/US-50/Exit 1 - Accident." "At I-71/US-50/Exit 1 - Accident." ...
##  $ Number               : int  NA NA NA NA NA NA NA 1887 NA NA ...
##  $ Street               : chr  "Outerbelt E" "I-70 E" "I-75 S" "US-50 E" ...
##  $ Side                 : chr  "R" "R" "R" "R" ...
##  $ City                 : chr  "Dublin" "Dayton" "Cincinnati" "Cincinnati" ...
##  $ County               : chr  "Franklin" "Montgomery" "Hamilton" "Hamilton" ...
##  $ State                : Factor w/ 49 levels "AL","AR","AZ",..: 34 34 34 34 34 34 34 34 34 34 ...
##  $ Zipcode              : chr  "43017" "45424" "45203" "45202" ...
##  $ Country              : chr  "US" "US" "US" "US" ...
##  $ Timezone             : chr  "US/Eastern" "US/Eastern" "US/Eastern" "US/Eastern" ...
##  $ Airport_Code         : chr  "KOSU" "KFFO" "KLUK" "KLUK" ...
##  $ Weather_Timestamp    : chr  "2016-02-08 00:53:00" "2016-02-08 05:58:00" "2016-02-08 05:53:00" "2016-02-08 05:53:00" ...
##  $ Temperature.F.       : num  42.1 36.9 36 36 39 37 35.6 35.6 33.8 33.1 ...
##  $ Wind_Chill.F.        : num  36.1 NA NA NA NA 29.8 29.2 29.2 NA 30 ...
##  $ Humidity...          : int  58 91 97 97 55 93 100 100 100 92 ...
##  $ Pressure.in.         : num  29.8 29.7 29.7 29.7 29.6 ...
##  $ Visibility.mi.       : num  10 10 10 10 10 10 10 10 3 0.5 ...
##  $ Wind_Direction       : chr  "SW" "Calm" "Calm" "Calm" ...
##  $ Wind_Speed.mph.      : num  10.4 NA NA NA NA 10.4 8.1 8.1 2.3 3.5 ...
##  $ Precipitation.in.    : num  0 0.02 0.02 0.02 NA 0.01 NA NA NA 0.08 ...
##  $ Weather_Condition    : chr  "Light Rain" "Light Rain" "Overcast" "Overcast" ...
##  $ Amenity              : chr  "False" "False" "False" "False" ...
##  $ Bump                 : chr  "False" "False" "False" "False" ...
##  $ Crossing             : chr  "False" "False" "False" "False" ...
##  $ Give_Way             : chr  "False" "False" "False" "False" ...
##  $ Junction             : chr  "False" "False" "True" "True" ...
##  $ No_Exit              : chr  "False" "False" "False" "False" ...
##  $ Railway              : chr  "False" "False" "False" "False" ...
##  $ Roundabout           : chr  "False" "False" "False" "False" ...
##  $ Station              : chr  "False" "False" "False" "False" ...
##  $ Stop                 : chr  "False" "False" "False" "False" ...
##  $ Traffic_Calming      : chr  "False" "False" "False" "False" ...
##  $ Traffic_Signal       : chr  "False" "False" "False" "False" ...
##  $ Turning_Loop         : chr  "False" "False" "False" "False" ...
##  $ Sunrise_Sunset       : chr  "Night" "Night" "Night" "Night" ...
##  $ Civil_Twilight       : chr  "Night" "Night" "Night" "Night" ...
##  $ Nautical_Twilight    : chr  "Night" "Night" "Night" "Night" ...
##  $ Astronomical_Twilight: chr  "Night" "Night" "Day" "Day" ...
##  $ type                 : chr  "None" "None" "Junction" "Junction" ...
##  $ year                 : Factor w/ 5 levels "2016","2017",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ region               : chr  "Mid_West" "Mid_West" "Mid_West" "Mid_West" ...
summary(trafficdf)
##        X                ID            Severity     Start_Time       
##  Min.   :      1   Length:1516064     1:  28178   Length:1516064    
##  1st Qu.: 379017   Class :character   2:1212382   Class :character  
##  Median : 758032   Mode  :character   3: 161052   Mode  :character  
##  Mean   : 758032                      4: 114452                     
##  3rd Qu.:1137048                                                    
##  Max.   :1516064                                                    
##                                                                     
##    End_Time           Start_Lat       Start_Lng          End_Lat     
##  Length:1516064     Min.   :24.57   Min.   :-124.50   Min.   :24.57  
##  Class :character   1st Qu.:33.85   1st Qu.:-118.21   1st Qu.:33.85  
##  Mode  :character   Median :37.35   Median : -94.38   Median :37.35  
##                     Mean   :36.90   Mean   : -98.60   Mean   :36.90  
##                     3rd Qu.:40.73   3rd Qu.: -80.87   3rd Qu.:40.73  
##                     Max.   :49.00   Max.   : -67.11   Max.   :49.08  
##                                                                      
##     End_Lng         Distance.mi.      Description            Number       
##  Min.   :-124.50   Min.   :  0.0000   Length:1516064     Min.   :      0  
##  1st Qu.:-118.21   1st Qu.:  0.0000   Class :character   1st Qu.:   1212  
##  Median : -94.38   Median :  0.1780   Mode  :character   Median :   4000  
##  Mean   : -98.60   Mean   :  0.5873                      Mean   :   8908  
##  3rd Qu.: -80.87   3rd Qu.:  0.5940                      3rd Qu.:  10100  
##  Max.   : -67.11   Max.   :155.1860                      Max.   :9999997  
##                                                          NA's   :1046095  
##     Street              Side               City              County         
##  Length:1516064     Length:1516064     Length:1516064     Length:1516064    
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##      State          Zipcode            Country            Timezone        
##  CA     :448833   Length:1516064     Length:1516064     Length:1516064    
##  FL     :153007   Class :character   Class :character   Class :character  
##  OR     : 87484   Mode  :character   Mode  :character   Mode  :character  
##  TX     : 75142                                                           
##  NY     : 60974                                                           
##  MN     : 52345                                                           
##  (Other):638279                                                           
##  Airport_Code       Weather_Timestamp  Temperature.F.   Wind_Chill.F.   
##  Length:1516064     Length:1516064     Min.   :-89.00   Min.   :-89.0   
##  Class :character   Class :character   1st Qu.: 47.00   1st Qu.: 40.8   
##  Mode  :character   Mode  :character   Median : 61.00   Median : 57.0   
##                                        Mean   : 59.58   Mean   : 55.1   
##                                        3rd Qu.: 73.00   3rd Qu.: 71.0   
##                                        Max.   :170.60   Max.   :113.0   
##                                        NA's   :43033    NA's   :449316  
##   Humidity...      Pressure.in.   Visibility.mi.   Wind_Direction    
##  Min.   :  1.00   Min.   : 0.00   Min.   :  0.00   Length:1516064    
##  1st Qu.: 48.00   1st Qu.:29.44   1st Qu.: 10.00   Class :character  
##  Median : 68.00   Median :29.88   Median : 10.00   Mode  :character  
##  Mean   : 64.66   Mean   :29.55   Mean   :  9.13                     
##  3rd Qu.: 84.00   3rd Qu.:30.04   3rd Qu.: 10.00                     
##  Max.   :100.00   Max.   :58.04   Max.   :140.00                     
##  NA's   :45509    NA's   :36274   NA's   :44211                      
##  Wind_Speed.mph.  Precipitation.in. Weather_Condition    Amenity         
##  Min.   :  0.00   Min.   : 0        Length:1516064     Length:1516064    
##  1st Qu.:  4.60   1st Qu.: 0        Class :character   Class :character  
##  Median :  7.00   Median : 0        Mode  :character   Mode  :character  
##  Mean   :  7.63   Mean   : 0                                             
##  3rd Qu.: 10.40   3rd Qu.: 0                                             
##  Max.   :984.00   Max.   :24                                             
##  NA's   :128862   NA's   :510549                                         
##      Bump             Crossing           Give_Way           Junction        
##  Length:1516064     Length:1516064     Length:1516064     Length:1516064    
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##    No_Exit            Railway           Roundabout          Station         
##  Length:1516064     Length:1516064     Length:1516064     Length:1516064    
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##      Stop           Traffic_Calming    Traffic_Signal     Turning_Loop      
##  Length:1516064     Length:1516064     Length:1516064     Length:1516064    
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  Sunrise_Sunset     Civil_Twilight     Nautical_Twilight  Astronomical_Twilight
##  Length:1516064     Length:1516064     Length:1516064     Length:1516064       
##  Class :character   Class :character   Class :character   Class :character     
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character     
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##      type             year           region         
##  Length:1516064     2016:129325   Length:1516064    
##  Class :character   2017:170099   Class :character  
##  Mode  :character   2018:166936   Mode  :character  
##                     2019:261772                     
##                     2020:787932                     
##                                                     
## 

Tab 2

Here are all the 49 states represented in this database, as well as the total accident count for each state.

unique(trafficdf$State)
##  [1] OH IN KY WV MI PA CA NV MN TX MO CO OK LA KS WI IA MS NE ND WY SD MT NM AR
## [26] IL NJ GA FL NY CT RI SC NC MD MA TN VA DE DC ME AL NH VT AZ UT ID OR WA
## 49 Levels: AL AR AZ CA CO CT DC DE FL GA IA ID IL IN KS KY LA MA MD ME ... WY
counts_state <- dplyr::count(trafficdf, trafficdf$State)
counts_state
##    trafficdf$State      n
## 1               AL   9375
## 2               AR   4373
## 3               AZ  30185
## 4               CA 448833
## 5               CO  19809
## 6               CT  15194
## 7               DC   3788
## 8               DE   2331
## 9               FL 153007
## 10              GA  31111
## 11              IA   4780
## 12              ID   4061
## 13              IL  41709
## 14              IN  11736
## 15              KS   5146
## 16              KY   5671
## 17              LA  19250
## 18              MA   6121
## 19              MD  35320
## 20              ME   2202
## 21              MI  27775
## 22              MN  52345
## 23              MO  16949
## 24              MS   2790
## 25              MT   3306
## 26              NC  50159
## 27              ND    455
## 28              NE   2178
## 29              NH   3826
## 30              NJ  29850
## 31              NM   1467
## 32              NV   3681
## 33              NY  60974
## 34              OH  22044
## 35              OK   5592
## 36              OR  87484
## 37              PA  42844
## 38              RI   3766
## 39              SC  28090
## 40              SD    213
## 41              TN  21874
## 42              TX  75142
## 43              UT  33218
## 44              VA  51198
## 45              VT    352
## 46              WA  22999
## 47              WI   7919
## 48              WV   3242
## 49              WY    330

Tab 3

These are a couple of the columns I had initially looked into when searching through the dataset. I first looked into precipitation to see if it was a determining factor that could affect traffic accidents. This is the average precipitation (in inches) for the entirety of the dataset.

mean(trafficdf$Precipitation.in., na.rm = TRUE)
## [1] 0.008477855

As it turns out the average precipitation was incredibly low and therefore must have a low correlation with the number of traffic accidents that occur.

I also looked at wind speed, as this could possibly influence the occurrence of traffic accidents in the US if the mean turns out to be high enough. Here is the mean for this column, as well as the counts for different wind speeds throughout the dataset (in mph).

head(dplyr::count(trafficdf, trafficdf$Wind_Speed.mph.), 20)
##    trafficdf$Wind_Speed.mph.      n
## 1                        0.0 202877
## 2                        1.0     27
## 3                        1.2    125
## 4                        2.0     88
## 5                        2.3    259
## 6                        3.0  98509
## 7                        3.5  43944
## 8                        4.6  46674
## 9                        5.0  96985
## 10                       5.8  47224
## 11                       6.0  90336
## 12                       6.9  44742
## 13                       7.0  83055
## 14                       8.0  73414
## 15                       8.1  41453
## 16                       9.0  64831
## 17                       9.2  37715
## 18                      10.0  53178
## 19                      10.4  31631
## 20                      11.5  29053
mean(trafficdf$Wind_Speed.mph., na.rm = TRUE)
## [1] 7.630812

Ultimately, this information did not look as promising as I had hoped as well, which led me to my visualizations.

Beginning Visualizations

Tab 1

For my first visualization, I wanted to have a better sense of a possible variable that could give reason to occurences of traffic accidents in general. After delving deeper into the dataset, I discovered 12 columns of data regarding the presence of a different traffic landmark, each being a boolean that prints either “True” or “False” if the variable was involved in the accident. To consolidate this information, I created a new column within the traffic dataframe called “type” which contains the name of the landmark involved for each record, then created a new dataframe to hold the number of occurences of each landmark type.

##     trafficdf$type      n
## 1         Junction 202345
## 2   Traffic_Signal  97818
## 3         Crossing  82228
## 4          Station  16555
## 5             Stop  12987
## 6          Amenity  12403
## 7          Railway   3463
## 8         Give_Way   2082
## 9          No_Exit    928
## 10            Bump    250
## 11 Traffic_Calming     56
## 12      Roundabout     16

I used my new “counts” dataframe to create a horizontal barplot of this information, grouping each bar by landmark type, and by year. Through this visualization I am able to have a better idea of what kind of traffic landmarks are most involved in traffic accidents throughout the US, an incite that could be helpful in determining the causes of crashes.

Tab 2

For my next visualization, I also wanted to see during what time of the year traffic accidents were more common through the years in the dataset. To do this I created a dataframe containing accident counts grouped my year, and also by each month of the year. My original prediction for this visualization was that there would be an increase in traffic accidents during the winter months, as roads are more prone to freezing over during this time, as well as there being more hazardous driving weather conditions over the course of these months. While my findings were not entirely significant, there is the presence of a slight curve in the number of traffic accidents over these years, with the dip being during the summer months of 2017, 2018, 2019, and a drastic dip in 2020.

More Visualizations

Tab 1

After my first couple visualizations, I wanted to experiment with location and visualize it against the “Severity” column of my dataset to see which areas have the most severe traffic accidents. While experimenting, I realized that grouping by the 49 individual states within the dataset would create a large amount of groups, making it difficult for visualization purposes. To combat this, I created a new column within the dataframe narrowing each state into a specific region.

##   trafficdf$region      n
## 1             West 452514
## 2       South_East 320029
## 3     Mid_Atlantic 195909
## 4         Mid_West 171145
## 5       South_West 165413
## 6       North_West 118180
## 7             None  61413
## 8       North_East  31461

As there seemed to be a relatively equally distributed spread of traffic accident records between all the regions, this made for a good grouping for visualization. Below is the Severity of Traffic Accidents grouped by each Region in the US. The important thing to look at from these pie charts is the pink slice in each of the pies. This slice indicates an accident of the most significant severity, and can show the trend of which regions tend to have the worst accidents. Which in this case was found to be: Other, the Mid-Atlantic, and the North-East for our top three.

plot3 <- ggplot(data = region_severity_df, aes(x="", y=n, fill=severity)) +
  geom_bar(stat="identity", position = "fill") +
  coord_polar(theta="y", start=0) +
  labs(fill = "Severity Level", x = NULL, y = NULL, title = "Accident Severity by Region") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5),
        axis.text = element_blank(),
        axis.ticks = element_blank(),
        panel.grid = element_blank()) +
  facet_wrap(~myRegion, ncol = 3, nrow = 3) +
  scale_fill_brewer(palette = "Set2") +
  geom_text(aes(x=1.8, label=ifelse(percent_of_total>5.2, paste0(severity, ", ", percent_of_total,"%"), "")),
            size=2.25,
            position = position_fill(vjust = 0.4))
plot3

Tab 2

To go along with the previous visualization grouped by region, I created a donut plot using plotly showing five different donuts of the five years represented in my dataset (2016-2020). This plot is meant to be able to show a lot of general information about the data, splitting the total traffic accidents for each year into the percentages of the regions in which they occurred. By hovering over a section of each donut, you are able to see the corresponding year, region, accident count, and percentage of that year’s accidents.

plot4 <- plot_ly(hole=0.7) %>%
  layout(title="Traffic Accidents (2016-2020)") %>%
  add_trace(data = region_df[region_df$year==2020,],
            labels=~myRegion,
            values=~region_df[region_df$year==2020,"n"],
            type="pie",
            textposition="inside",
            hovertemplate="Year: 2020<br>Region: %{label}<br>Percent: %{percent}<br>Accident Count: %{value}<extra></extra>") %>%
  add_trace(data = region_df[region_df$year==2019,],
            labels=~myRegion,
            values=~region_df[region_df$year==2019,"n"],
            type="pie",
            textposition="inside",
            hovertemplate="Year: 2019<br>Region: %{label}<br>Percent: %{percent}<br>Accident Count: %{value}<extra></extra>",
            domain = list(
              x = c(0.16,0.84),
              y = c(0.16,0.84))) %>%
  add_trace(data = region_df[region_df$year==2018,],
            labels=~myRegion,
            values=~region_df[region_df$year==2018,"n"],
            type="pie",
            textposition="inside",
            hovertemplate="Year: 2018<br>Region: %{label}<br>Percent: %{percent}<br>Accident Count: %{value}<extra></extra>",
            domain = list(
              x = c(0.27,0.73),
              y = c(0.27,0.73))) %>%
  add_trace(data = region_df[region_df$year==2017,],
            labels=~myRegion,
            values=~region_df[region_df$year==2017,"n"],
            type="pie",
            textposition="inside",
            hovertemplate="Year: 2017<br>Region:%{label}<br>Percent: %{percent}<br>Accident Count: %{value}<extra></extra>",
            domain = list(
              x = c(0.35,0.65),
              y = c(0.35,0.65))) %>%
  add_trace(data = region_df[region_df$year==2016,],
            labels=~myRegion,
            values=~region_df[region_df$year==2016,"n"],
            type="pie",
            textposition="inside",
            hovertemplate="Year: 2016<br>Region: %{label}<br>Percent: %{percent}<br>Accident Count: %{value}<extra></extra>",
            domain = list(
              x = c(0.40,0.60),
              y = c(0.40,0.60)))
plot4

Heat Map

My last two visualizations both display a Heat Map depicting occurrences of traffic accidents grouped by day of the week and by year.

Tab 1 contains the Heat Map for the number of traffic accidents by day and year with the darkest colors being in 2020 and are mainly on weekdays.

Tab 2 contains a similar Heat Map, but rather than displaying the count of traffic accidents for each box, it displays the percentages of the total occurrences on each day of the week, by each year. Taking into account each year, Tuesday had the highest average percentage of the total traffic accidents over all other days of the week, over the five year period at 17.8%. Wednesday was a close second at 17.68% averaged over the five years.

Tab 1

plot5 <- ggplot(days_df, aes(x = year, y = daysoftheweek, fill=n)) +
  geom_tile(color="black") +
  geom_text(aes(label=comma(n))) + 
  labs(title = "HeatMap: Accidents by Day of the Week and Year (#)",
       x = "Year",
       y = "Days",
       fill = "Accident Count") +
  theme_igray() + 
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_y_discrete(limits = rev(levels(days_df$daysoftheweek))) +
  scale_fill_continuous(low="white", high = "lightblue", breaks = c(seq(0, max(days_df$n), by = 20000))) +
  guides(fill = guide_legend(reverse = TRUE, override.aes = list(colour = "black")))
plot5

Tab 2

plot6 <- ggplot(days_df, aes(x = year, y = daysoftheweek, fill=percent_of_total)) +
  geom_tile(color="black") +
  geom_text(aes(label=paste0(percent_of_total, "%"))) + 
  labs(title = "HeatMap: Accidents by Day of the Week and Year (%)",
       x = "Year",
       y = "Days",
       fill = "Accident Percentage") +
  theme_igray() + 
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_y_discrete(limits = rev(levels(days_df$daysoftheweek))) +
  scale_fill_continuous(low="white", high = "lightgreen", breaks = c(seq(0, max(days_df$percent_of_total), by = 2.5))) +
  guides(fill = guide_legend(reverse = TRUE, override.aes = list(colour = "black")))
plot6