About Traffic Crash Data

For the forecasting study, I have selected Traffic Crash Reports (Cincinnati Police Department-CPD) from Cincinnati Open Data Portal containing daily traffic incidents which includes the date when the accident occurred, place where the accident occurred, the type of crash, weather on that day, and so on.

Source : https://data.cincinnati-oh.gov/Safety/Traffic-Crash-Reports-CPD-/rvmt-pkmq

We have filtered the crash data for 2020-Present and can predict future crash data for Cincinnati for the next 3 months / 6months / yearly and so on.

Why this data set?

Traffic crash data is useful to support the development and assessment of road safety plans by identifying possible risk factors and locating hazardous regions and reduce the number of accidents occurrences. I am new to forecasting and modeling, and I found it really interesting while working on it in the class and wanted to continue working on it. Hence, I selected this data set.

I am excited to work on it!

Variation in our variables

Our response variable would be number of crashes (can be per day/month/year).I believe that the variation in the response variable can be explained by age, day, weather, road conditions, manner of crash, light conditions, road surface.

This might be a tricky data set to forecast since a lot of columns contains categorical data and not numerical data.

Installing necessary packages

#install.packages("dplyr")
#install.packages("tidyverse")
#install.packages("janitor")
#packageVersion("dplyr")
#install.packages("dplyr")
#install.packages("readr")
#install.packages("magrittr") # package installations are only needed the first time you use it
#install.packages("dplyr")    # alternative installation of the %>%

Importing the libraries that we will use throughout this project

library(readr) # Reading csvs
library(ggplot2) # Plotting
library(knitr) # Kable
library(broom) # For tidying model results
library(magrittr) # needs to be run every time you start R and want to use %>%
library(dplyr)    # alternatively, this also loads %>%

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(gtsummary) # for creating summary tables

## #Uighur

library(dplyr) # Data wrangling
library(janitor) # Cleaning variable names

## 
## Attaching package: 'janitor'

## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(lubridate) # Working with dates

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

Reading the data

cpd_file_path = "C:/Users/Goels/Desktop/UC/A - SEMESTER 2/Forecasting Methods - Antony - Thursday/traffic crash/Traffic_Crash_Reports__CPD_.csv"

cpd_data <- read.csv(cpd_file_path, stringsAsFactors = FALSE)

Exploring the data set

Overview of the data :

head(cpd_data)

##         ï..ADDRESS_X LATITUDE_X LONGITUDE_X AGE COMMUNITY_COUNCIL_NEIGHBORHOOD
## 1   XX W MITCHELL AV   39.16111  -84.506982  37                        CLIFTON
## 2      5XX W KING DR  39.139361  -84.525669  18                        CLIFTON
## 3      5XX W KING DR  39.139401  -84.526069  NA                        CLIFTON
## 4      1XX E KING DR  39.135456  -84.505356  28                     CORRYVILLE
## 5      1XX E KING DR  39.135816  -84.506256  22                     CORRYVILLE
## 6 34XX MONTGOMERY RD  39.142066  -84.472325  34                       EVANSTON
##   CPD_NEIGHBORHOOD              CRASHDATE CRASHLOCATION
## 1          CLIFTON 01/16/2022 03:45:00 AM              
## 2          CLIFTON 01/16/2022 05:00:00 AM              
## 3          CLIFTON 01/16/2022 05:00:00 AM              
## 4       CORRYVILLE 01/16/2022 03:36:00 AM              
## 5       CORRYVILLE 01/16/2022 03:36:00 AM              
## 6         EVANSTON 01/16/2022 03:20:02 AM              
##              CRASHSEVERITY CRASHSEVERITYID      DATECRASHREPORTED DAYOFWEEK
## 1 5 - PROPERTY DAMAGE ONLY          201905 01/16/2022 04:30:00 AM       SUN
## 2 5 - PROPERTY DAMAGE ONLY          201905 01/16/2022 04:13:45 AM       SUN
## 3 5 - PROPERTY DAMAGE ONLY          201905 01/16/2022 04:13:45 AM       SUN
## 4 5 - PROPERTY DAMAGE ONLY          201905 01/16/2022 03:49:00 AM       SUN
## 5 5 - PROPERTY DAMAGE ONLY          201905 01/16/2022 03:49:00 AM       SUN
## 6      4 - INJURY POSSIBLE          201904 01/16/2022 03:20:02 AM       SUN
##       GENDER                INJURIES                           INSTANCEID
## 1   M - MALE 5 - NO APPARENTY INJURY BEFCF82D-D869-45F7-B5F1-D0314ADA0A6C
## 2 F - FEMALE 5 - NO APPARENTY INJURY 712FD408-A8B9-411C-A4B9-86B63DBA1F72
## 3            5 - NO APPARENTY INJURY 712FD408-A8B9-411C-A4B9-86B63DBA1F72
## 4 F - FEMALE 5 - NO APPARENTY INJURY 816379F5-112B-4A4E-85D8-3C7B805BFD1A
## 5 F - FEMALE 5 - NO APPARENTY INJURY 816379F5-112B-4A4E-85D8-3C7B805BFD1A
## 6 F - FEMALE     4 - POSSIBLE INJURY 43C13A4F-9180-4E04-99D1-1F3966DA2FFB
##       LIGHTCONDITIONSPRIMARY LOCALREPORTNO
## 1               1 - DAYLIGHT     225000640
## 2 3 - DARK - LIGHTED ROADWAY     225000639
## 3 3 - DARK - LIGHTED ROADWAY     225000639
## 4               1 - DAYLIGHT     225000638
## 5               1 - DAYLIGHT     225000638
## 6               1 - DAYLIGHT     225000650
##                                               MANNEROFCRASH
## 1                             7 - SIDESWIPE, SAME DIRECTION
## 2                             7 - SIDESWIPE, SAME DIRECTION
## 3                             7 - SIDESWIPE, SAME DIRECTION
## 4                                              2 - REAR-END
## 5                                              2 - REAR-END
## 6 1 - NOT COLLISION BETWEEN TWO MOTOR VEHICLES IN TRANSPORT
##   ROADCONDITIONSPRIMARY        ROADCONTOUR                       ROADSURFACE
## 1              01 - DRY 1 - STRAIGHT LEVEL 2 - BLACKTOP, BITUMINOUS, ASPHALT
## 2              01 - DRY 2 - STRAIGHT GRADE 2 - BLACKTOP, BITUMINOUS, ASPHALT
## 3              01 - DRY 2 - STRAIGHT GRADE 2 - BLACKTOP, BITUMINOUS, ASPHALT
## 4              01 - DRY 2 - STRAIGHT GRADE 2 - BLACKTOP, BITUMINOUS, ASPHALT
## 5              01 - DRY 2 - STRAIGHT GRADE 2 - BLACKTOP, BITUMINOUS, ASPHALT
## 6              01 - DRY 1 - STRAIGHT LEVEL                      1 - CONCRETE
##   SNA_NEIGHBORHOOD TYPEOFPERSON    WEATHER   ZIP UNITTYPE
## 1               NA   D - DRIVER  1 - CLEAR 45217       NA
## 2               NA   D - DRIVER  1 - CLEAR 45220       NA
## 3               NA   D - DRIVER  1 - CLEAR 45220       NA
## 4               NA   D - DRIVER 2 - CLOUDY 45219       NA
## 5               NA   D - DRIVER 2 - CLOUDY 45219       NA
## 6               NA   D - DRIVER  1 - CLEAR 45207       NA

Dimensions of our data set :

dim(cpd_data)

## [1] 291852     26

Features :

colnames(cpd_data)

##  [1] "ï..ADDRESS_X"                   "LATITUDE_X"                    
##  [3] "LONGITUDE_X"                    "AGE"                           
##  [5] "COMMUNITY_COUNCIL_NEIGHBORHOOD" "CPD_NEIGHBORHOOD"              
##  [7] "CRASHDATE"                      "CRASHLOCATION"                 
##  [9] "CRASHSEVERITY"                  "CRASHSEVERITYID"               
## [11] "DATECRASHREPORTED"              "DAYOFWEEK"                     
## [13] "GENDER"                         "INJURIES"                      
## [15] "INSTANCEID"                     "LIGHTCONDITIONSPRIMARY"        
## [17] "LOCALREPORTNO"                  "MANNEROFCRASH"                 
## [19] "ROADCONDITIONSPRIMARY"          "ROADCONTOUR"                   
## [21] "ROADSURFACE"                    "SNA_NEIGHBORHOOD"              
## [23] "TYPEOFPERSON"                   "WEATHER"                       
## [25] "ZIP"                            "UNITTYPE"

Number of NA’s in our data set:

colSums(is.na(cpd_data))

##                   ï..ADDRESS_X                     LATITUDE_X 
##                              0                              0 
##                    LONGITUDE_X                            AGE 
##                              0                          35475 
## COMMUNITY_COUNCIL_NEIGHBORHOOD               CPD_NEIGHBORHOOD 
##                              0                              0 
##                      CRASHDATE                  CRASHLOCATION 
##                              0                              0 
##                  CRASHSEVERITY                CRASHSEVERITYID 
##                              0                             10 
##              DATECRASHREPORTED                      DAYOFWEEK 
##                              0                              0 
##                         GENDER                       INJURIES 
##                              0                              0 
##                     INSTANCEID         LIGHTCONDITIONSPRIMARY 
##                              0                              0 
##                  LOCALREPORTNO                  MANNEROFCRASH 
##                              0                              0 
##          ROADCONDITIONSPRIMARY                    ROADCONTOUR 
##                              0                              0 
##                    ROADSURFACE               SNA_NEIGHBORHOOD 
##                              0                         291852 
##                   TYPEOFPERSON                        WEATHER 
##                              0                              0 
##                            ZIP                       UNITTYPE 
##                              0                         291852

Since ‘UNITTYPE’ and ‘SNA_NEIGHBOURHOOD’ is empty (as number of NA’s = total number of rows in our data set), it is better that we drop the 2 columns since they are not at all useful.

drop <- c("UNITTYPE","SNA_NEIGHBORHOOD")
cpd_data = cpd_data[,!(names(cpd_data) %in% drop)]

Structure of the data set :

str(cpd_data)

## 'data.frame':    291852 obs. of  24 variables:
##  $ ï..ADDRESS_X                  : chr  "XX W MITCHELL AV" "5XX W KING DR" "5XX W KING DR" "1XX E KING DR" ...
##  $ LATITUDE_X                    : chr  "39.16111" "39.139361" "39.139401" "39.135456" ...
##  $ LONGITUDE_X                   : chr  "-84.506982" "-84.525669" "-84.526069" "-84.505356" ...
##  $ AGE                           : int  37 18 NA 28 22 34 NA 49 27 22 ...
##  $ COMMUNITY_COUNCIL_NEIGHBORHOOD: chr  "CLIFTON" "CLIFTON" "CLIFTON" "CORRYVILLE" ...
##  $ CPD_NEIGHBORHOOD              : chr  "CLIFTON" "CLIFTON" "CLIFTON" "CORRYVILLE" ...
##  $ CRASHDATE                     : chr  "01/16/2022 03:45:00 AM" "01/16/2022 05:00:00 AM" "01/16/2022 05:00:00 AM" "01/16/2022 03:36:00 AM" ...
##  $ CRASHLOCATION                 : chr  "" "" "" "" ...
##  $ CRASHSEVERITY                 : chr  "5 - PROPERTY DAMAGE ONLY" "5 - PROPERTY DAMAGE ONLY" "5 - PROPERTY DAMAGE ONLY" "5 - PROPERTY DAMAGE ONLY" ...
##  $ CRASHSEVERITYID               : int  201905 201905 201905 201905 201905 201904 201905 201905 201905 201905 ...
##  $ DATECRASHREPORTED             : chr  "01/16/2022 04:30:00 AM" "01/16/2022 04:13:45 AM" "01/16/2022 04:13:45 AM" "01/16/2022 03:49:00 AM" ...
##  $ DAYOFWEEK                     : chr  "SUN" "SUN" "SUN" "SUN" ...
##  $ GENDER                        : chr  "M - MALE" "F - FEMALE" "" "F - FEMALE" ...
##  $ INJURIES                      : chr  "5 - NO APPARENTY INJURY" "5 - NO APPARENTY INJURY" "5 - NO APPARENTY INJURY" "5 - NO APPARENTY INJURY" ...
##  $ INSTANCEID                    : chr  "BEFCF82D-D869-45F7-B5F1-D0314ADA0A6C" "712FD408-A8B9-411C-A4B9-86B63DBA1F72" "712FD408-A8B9-411C-A4B9-86B63DBA1F72" "816379F5-112B-4A4E-85D8-3C7B805BFD1A" ...
##  $ LIGHTCONDITIONSPRIMARY        : chr  "1 - DAYLIGHT" "3 - DARK - LIGHTED ROADWAY" "3 - DARK - LIGHTED ROADWAY" "1 - DAYLIGHT" ...
##  $ LOCALREPORTNO                 : num  2.25e+08 2.25e+08 2.25e+08 2.25e+08 2.25e+08 ...
##  $ MANNEROFCRASH                 : chr  "7 - SIDESWIPE, SAME DIRECTION" "7 - SIDESWIPE, SAME DIRECTION" "7 - SIDESWIPE, SAME DIRECTION" "2 - REAR-END" ...
##  $ ROADCONDITIONSPRIMARY         : chr  "01 - DRY" "01 - DRY" "01 - DRY" "01 - DRY" ...
##  $ ROADCONTOUR                   : chr  "1 - STRAIGHT LEVEL" "2 - STRAIGHT GRADE" "2 - STRAIGHT GRADE" "2 - STRAIGHT GRADE" ...
##  $ ROADSURFACE                   : chr  "2 - BLACKTOP, BITUMINOUS, ASPHALT" "2 - BLACKTOP, BITUMINOUS, ASPHALT" "2 - BLACKTOP, BITUMINOUS, ASPHALT" "2 - BLACKTOP, BITUMINOUS, ASPHALT" ...
##  $ TYPEOFPERSON                  : chr  "D - DRIVER" "D - DRIVER" "D - DRIVER" "D - DRIVER" ...
##  $ WEATHER                       : chr  "1 - CLEAR" "1 - CLEAR" "1 - CLEAR" "2 - CLOUDY" ...
##  $ ZIP                           : chr  "45217" "45220" "45220" "45219" ...

Summary statistics

summary(cpd_data)

##  ï..ADDRESS_X        LATITUDE_X        LONGITUDE_X             AGE        
##  Length:291852      Length:291852      Length:291852      Min.   :  0.00  
##  Class :character   Class :character   Class :character   1st Qu.: 24.00  
##  Mode  :character   Mode  :character   Mode  :character   Median : 34.00  
##                                                           Mean   : 37.51  
##                                                           3rd Qu.: 49.00  
##                                                           Max.   :123.00  
##                                                           NA's   :35475   
##  COMMUNITY_COUNCIL_NEIGHBORHOOD CPD_NEIGHBORHOOD    CRASHDATE        
##  Length:291852                  Length:291852      Length:291852     
##  Class :character               Class :character   Class :character  
##  Mode  :character               Mode  :character   Mode  :character  
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##  CRASHLOCATION      CRASHSEVERITY      CRASHSEVERITYID  DATECRASHREPORTED 
##  Length:291852      Length:291852      Min.   :     1   Length:291852     
##  Class :character   Class :character   1st Qu.:     3   Class :character  
##  Mode  :character   Mode  :character   Median :     3   Mode  :character  
##                                        Mean   : 67742                     
##                                        3rd Qu.:201904                     
##                                        Max.   :201905                     
##                                        NA's   :10                         
##   DAYOFWEEK            GENDER            INJURIES          INSTANCEID       
##  Length:291852      Length:291852      Length:291852      Length:291852     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  LIGHTCONDITIONSPRIMARY LOCALREPORTNO       MANNEROFCRASH     
##  Length:291852          Min.   :1.400e+01   Length:291852     
##  Class :character       1st Qu.:1.550e+08   Class :character  
##  Mode  :character       Median :1.750e+08   Mode  :character  
##                         Mean   :1.794e+08                     
##                         3rd Qu.:1.950e+08                     
##                         Max.   :2.201e+11                     
##                                                               
##  ROADCONDITIONSPRIMARY ROADCONTOUR        ROADSURFACE        TYPEOFPERSON      
##  Length:291852         Length:291852      Length:291852      Length:291852     
##  Class :character      Class :character   Class :character   Class :character  
##  Mode  :character      Mode  :character   Mode  :character   Mode  :character  
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##    WEATHER              ZIP           
##  Length:291852      Length:291852     
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
##                                       
##

Key inferences : 1) Age : Age cannot be 0 (min) or 123(max), which means that data has discrepancy and we will have to see if we want to keep the data or remove the outliers. Also, we observe that on an average the age of the person driving was 37 years. 2) crashdate : This field is crucial to our modeling and in order to analyse our data, we need to extract the date from the crashdate since the format of the feature is “yy-mm-dd 00:00:00”

cpd_data = read_csv(cpd_file_path) %>%
  clean_names() %>%
  mutate(crash_datetime = mdy_hms(crashdate),
         crash_date = as.Date(crash_datetime)) %>%
  dplyr::select(-crashdate)

## Rows: 291852 Columns: 26

## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (18): ADDRESS_X, COMMUNITY_COUNCIL_NEIGHBORHOOD, CPD_NEIGHBORHOOD, CRASH...
## dbl  (6): LATITUDE_X, LONGITUDE_X, AGE, CRASHSEVERITYID, LOCALREPORTNO, ZIP
## lgl  (2): SNA_NEIGHBORHOOD, UNITTYPE

## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(cpd_data)

## # A tibble: 6 x 27
##   address_x    latitude_x longitude_x   age community_council_~ cpd_neighborhood
##   <chr>             <dbl>       <dbl> <dbl> <chr>               <chr>           
## 1 XX W MITCHE~       39.2       -84.5    37 CLIFTON             CLIFTON         
## 2 5XX W KING ~       39.1       -84.5    18 CLIFTON             CLIFTON         
## 3 5XX W KING ~       39.1       -84.5    NA CLIFTON             CLIFTON         
## 4 1XX E KING ~       39.1       -84.5    28 CORRYVILLE          CORRYVILLE      
## 5 1XX E KING ~       39.1       -84.5    22 CORRYVILLE          CORRYVILLE      
## 6 34XX MONTGO~       39.1       -84.5    34 EVANSTON            EVANSTON        
## # ... with 21 more variables: crashlocation <chr>, crashseverity <chr>,
## #   crashseverityid <dbl>, datecrashreported <chr>, dayofweek <chr>,
## #   gender <chr>, injuries <chr>, instanceid <chr>,
## #   lightconditionsprimary <chr>, localreportno <dbl>, mannerofcrash <chr>,
## #   roadconditionsprimary <chr>, roadcontour <chr>, roadsurface <chr>,
## #   sna_neighborhood <lgl>, typeofperson <chr>, weather <chr>, zip <dbl>,
## #   unittype <lgl>, crash_datetime <dttm>, crash_date <date>

Studying the features in detail

Printing the unique values contained in Injuries column

unique(cpd_data$injuries)

##  [1] "5 - NO APPARENTY INJURY"       "4 - POSSIBLE INJURY"          
##  [3] NA                              "3 - SUSPECTED MINOR INJURY"   
##  [5] "2 - SUSPECTED SERIOUS INJURY"  "1 - FATAL"                    
##  [7] "1 - NO INJURY / NONE REPORTED" "3 - NON-INCAPACITATING"       
##  [9] "2 - POSSIBLE"                  "5 - FATAL"                    
## [11] "4 - INCAPACITATING"

Since there is overlap in numbers given to the feature injuries and can be clubbed, we clean the data set by renaming the columns

cpd_data$injuries[cpd_data$injuries=="5 - NO APPARENTY INJURY"]<- "FATAL"
cpd_data$injuries[cpd_data$injuries=="4 - POSSIBLE INJURY"]<- "POSSIBLE INJURY"
cpd_data$injuries[cpd_data$injuries=="3 - SUSPECTED MINOR INJURY"]<- "SUSPECTED MINOR INJURY"
cpd_data$injuries[cpd_data$injuries=="2 - SUSPECTED SERIOUS INJURY"]<- "SUSPECTED SERIOUS INJURY"
cpd_data$injuries[cpd_data$injuries=="1 - FATAL"] <- "FATAL"
cpd_data$injuries[cpd_data$injuries== "1 - NO INJURY / NONE REPORTED"] <- "NO INJURY / NONE REPORTED"
cpd_data$injuries[cpd_data$injuries== "3 - NON-INCAPACITATING"] <- "SUSPECTED MINOR INJURY"
cpd_data$injuries[cpd_data$injuries== "2 - POSSIBLE"] <- "SUSPECTED SERIOUS INJURY"
cpd_data$injuries[cpd_data$injuries== "5 - FATAL"] <- "FATAL"
cpd_data$injuries[cpd_data$injuries== "4 - INCAPACITATING"] <- "POSSIBLE INJURY"
cpd_data$injuries[cpd_data$injuries== " "] <- "Data Unavailable"

Similary, we do data cleaning for other features :

cpd_data$gender[cpd_data$gender=="F - FEMALE"]<- "Female"
cpd_data$gender[cpd_data$gender=="M - MALE"]<- "Male"
cpd_data$gender[cpd_data$gender=="FEMALE"]<- "Female"
cpd_data$gender[cpd_data$gender=="MALE"]<- "Male"
cpd_data$gender[cpd_data$gender=="U - UNKNOWN"] <- ""
cpd_data$gender[cpd_data$gender== 'NA'] <- ""

unique(cpd_data$gender)

## [1] "Male"   "Female" NA       ""

unique(cpd_data$weather)

##  [1] "1 - CLEAR"                            
##  [2] "2 - CLOUDY"                           
##  [3] "6 - SNOW"                             
##  [4] "99 - OTHER/UNKNOWN"                   
##  [5] "5 - SLEET, HAIL"                      
##  [6] "9 - FREEZING RAIN OR FREEZING DRIZZLE"
##  [7] "4 - RAIN"                             
##  [8] NA                                     
##  [9] "8 - BLOWING SAND, SOIL, DIRT, SNOW"   
## [10] "3 - FOG, SMOG, SMOKE"                 
## [11] "7 - SEVERE CROSSWINDS"                
## [12] "9 - OTHER/UNKNOWN"                    
## [13] "5 - SLEET,HAIL"

cpd_data$weather[cpd_data$weather=="1 - CLEAR"]<- "CLEAR"
cpd_data$weather[cpd_data$weather=="2 - CLOUDY"]<- "CLOUDY"
cpd_data$weather[cpd_data$weather=="6 - SNOW"]<- "SNOW"
cpd_data$weather[cpd_data$weather=="99 - OTHER/UNKNOWN"]<- "UNKNOWN"
cpd_data$weather[cpd_data$weather=="5 - SLEET, HAIL"] <- "SLEET, HAIL"
cpd_data$weather[cpd_data$weather== "9 - FREEZING RAIN OR FREEZING DRIZZLE"] <- "FREEZING RAIN OR FREEZING DRIZZLE"
cpd_data$weather[cpd_data$weather== "4 - RAIN"] <- "RAIN"
cpd_data$weather[cpd_data$weather== "8 - BLOWING SAND, SOIL, DIRT, SNOW"] <- "BLOWING SAND, SOIL, DIRT, SNOW"
cpd_data$weather[cpd_data$weather== "3 - FOG, SMOG, SMOKE"] <- "FOG, SMOG, SMOKE"
cpd_data$weather[cpd_data$weather== "7 - SEVERE CROSSWINDS"] <- "SEVERE CROSSWINDS"
cpd_data$weather[cpd_data$weather== "9 - OTHER/UNKNOWN"] <- "UNKNOWN"
cpd_data$weather[cpd_data$weather== "5 - SLEET,HAIL"] <- "SLEET, HAIL"

Daily Crash Data (from 1st January 2020 - Present)

daily_crashes = cpd_data %>%
  distinct(instanceid,.keep_all=TRUE) %>% # Want only one row per traffic instance
  count(crash_date,name='num_incidents') %>% # Count incidents by day
  filter(crash_date>=ymd('2020-01-01')) # Filter to post-2020

daily_crashes_plot = ggplot(daily_crashes)+
  geom_line(aes(crash_date,num_incidents))+
  theme_bw()+
  xlab("Date")+
  ylab("Number of Incidents")+
  labs(
    title = 'Number of Daily Traffic Incidents, Cincinnati Police Department',
    subtitle = 'January 2020 - Present'
  )

daily_crashes_plot

Plotting a linear trend on top of the daily crash data

daily_crashes_plot + 
  geom_smooth(aes(crash_date,num_incidents),method='lm',color='red')

## `geom_smooth()` using formula 'y ~ x'

We can observe that there is a slightly positive trend over time.

Visualizing Daily Crash Data - DAY WISE (from 1st January 2020 - Present)

cpd_data %>%
  distinct(instanceid,.keep_all=TRUE) %>% # One row per traffic incident
  count(crash_date,dayofweek,name='daily_incidents') %>% # Count Number of Incidents Per Day
  filter(crash_date>=ymd('2020-01-01'))%>%
  group_by(dayofweek) %>% # Group data by day of week
  summarize(avg_daily_incidents = mean(daily_incidents)) %>% # Take the Mean Incidents by day of week
  ungroup() %>%
  mutate(dayofweek=factor(
    dayofweek,
    labels = c("MON",'TUE','WED','THU','FRI','SAT','SUN'),
    levels = c("MON",'TUE','WED','THU','FRI','SAT','SUN')
  )) %>% # Set day of week as factor so the plot orders the bars properly
  filter(!is.na(dayofweek)) %>% # Filter missing day of week values
  ggplot()+
  geom_col(aes(dayofweek,avg_daily_incidents))+
  theme_bw()+
  xlab("Day of Week")+
  ylab("Average Daily Traffic Incidents")+
  scale_fill_discrete(name = 'Day of Week')+
  labs(
    title = 'Average Daily Traffic Incidents in Cincinnati by Day of week',
    subtitle = 'January 2020 - Present')

Visualizing Daily Crash Data - GENDER WISE (from 1st January 2020 - Present)

cpd_gender <- cpd_data%>%
  distinct(instanceid,.keep_all = TRUE)%>%
  count(crash_date, gender, name="incidents_by_gender")%>%
  filter(crash_date>=ymd('2020-01-01'))%>%
  group_by(gender)%>%
  summarize(avg_daily_incidents = mean(incidents_by_gender))%>%
  ungroup()%>%
  filter(!is.na(gender))%>%
  filter(gender != "U")


ggplot(data = cpd_gender) +
  geom_col(mapping = aes(x=gender, y = avg_daily_incidents))+
  theme_bw()+
  xlab("Gender")+
  ylab("Average Daily Incidents")+
  scale_fill_discrete(name = 'gender')+
  labs(
    title = 'Average Daily Traffic Incidents in Cincinnati by Gender',
    subtitle = 'January 2020 - Present'
  )

Visualizing Daily Crash Data - AGE WISE (from 1st January 2020 - Present)

cpd_age <- cpd_data%>%
  distinct(instanceid,.keep_all = TRUE)%>%
  count(crash_date, age, name="incidents_by_age")%>%
  filter(crash_date>=ymd('2020-01-01'))%>%
  group_by(age)%>%
  summarize(avg_daily_incidents = mean(incidents_by_age))%>%
  ungroup()%>%
  filter(!is.na(age))%>%
  filter(age != "U")


ggplot(data = cpd_age) +
  geom_col(mapping = aes(x=age, y = avg_daily_incidents))+
  theme_bw()+
  xlab("Age")+
  ylab("Average Daily Incidents")+
  scale_fill_discrete(name = 'age')+
  labs(
    title = 'Average Daily Traffic Incidents in Cincinnati by Age',
    subtitle = 'January 2020 - Present'
  )

Visualizing Daily Crash Data - WEATHER WISE (from 1st January 2020 - Present)

cpd_weather <- cpd_data%>%
  distinct(instanceid,.keep_all = TRUE)%>%
  count(crash_date, weather, name="incidents_by_weather")%>%
  filter(crash_date>=ymd('2020-01-01'))%>%
  group_by(weather)%>%
  summarize(avg_daily_incidents = mean(incidents_by_weather))%>%
  ungroup()%>%
  filter(!is.na(weather))


ggplot(data = cpd_weather) +
  geom_col(mapping = aes(x=weather, y = avg_daily_incidents))+
  theme_bw()+
  xlab("Weather")+
  ylab("Average Daily Incidents")+
  scale_fill_discrete(name = 'weather')+
  labs(
    title = 'Average Daily Traffic Incidents in Cincinnati by Weather',
    subtitle = 'January 2020 - Present'
  )

Visualizing Daily Crash Data on the basis of TYPE OF PERSON (from 1st January 2020 - Present)

cpd_typeofperson <- cpd_data%>%
  distinct(instanceid,.keep_all = TRUE)%>%
  count(crash_date, typeofperson, name="incidents_by_typeofperson")%>%
  filter(crash_date>=ymd('2020-01-01'))%>%
  group_by(typeofperson)%>%
  summarize(avg_daily_incidents = mean(incidents_by_typeofperson))%>%
  ungroup()%>%
  filter(!is.na(typeofperson))


ggplot(data = cpd_typeofperson) +
  geom_col(mapping = aes(x=typeofperson, y = avg_daily_incidents))+
  theme_bw()+
  xlab("typeofperson")+
  ylab("Average Daily Incidents")+
  scale_fill_discrete(name = 'typeofperson')+
  labs(
    title = 'Average Daily Traffic Incidents in Cincinnati by Type of Person',
    subtitle = 'January 2020 - Present'
  )

Visualizing Daily Crash Data - CRASH SEVERITY WISE (from 1st January 2020 - Present)

cpd_crashseverity <- cpd_data%>%
  distinct(instanceid,.keep_all = TRUE)%>%
  count(crash_date, crashseverity, name="incidents_by_crashseverity")%>%
  filter(crash_date>=ymd('2020-01-01'))%>%
  group_by(crashseverity)%>%
  summarize(avg_daily_incidents = mean(incidents_by_crashseverity))%>%
  ungroup()%>%
  filter(!is.na(crashseverity))


ggplot(data = cpd_crashseverity) +
  geom_col(mapping = aes(x=crashseverity, y = avg_daily_incidents))+
  theme_bw()+
  xlab("crashseverity")+
  ylab("Average Daily Incidents")+
  scale_fill_discrete(name = 'crashseverity')+
  labs(
    title = 'Average Daily Traffic Incidents in Cincinnati by Crash Severity',
    subtitle = 'January 2020 - Present'
  )

creating summary statistics table for features (Jan 2020-Present)

stats1 <- cpd_data %>% select(weather, age, injuries, typeofperson, mannerofcrash, roadcontour, dayofweek, lightconditionsprimary, roadconditionsprimary,crashlocation, crash_date)%>%
  filter(crash_date>=ymd('2020-01-01'))

table1 <- tbl_summary(stats1,
                      statistic = list(all_continuous()~"{mean} {sd} "))
table1

Characteristic	N = 62,966¹
weather
BLOWING SAND, SOIL, DIRT, SNOW	3 (<0.1%)
CLEAR	42,396 (67%)
CLOUDY	9,669 (15%)
FOG, SMOG, SMOKE	50 (<0.1%)
FREEZING RAIN OR FREEZING DRIZZLE	43 (<0.1%)
RAIN	8,917 (14%)
SEVERE CROSSWINDS	6 (<0.1%)
SLEET, HAIL	85 (0.1%)
SNOW	1,258 (2.0%)
UNKNOWN	534 (0.8%)
Unknown	5
age	37 16
Unknown	9,331
injuries
FATAL	52,431 (83%)
NO INJURY / NONE REPORTED	2 (<0.1%)
POSSIBLE INJURY	4,956 (7.9%)
SUSPECTED MINOR INJURY	4,995 (7.9%)
SUSPECTED SERIOUS INJURY	519 (0.8%)
Unknown	63
typeofperson
D - DRIVER	56,157 (89%)
O - OCCUPANT	6,107 (9.7%)
P - PEDESTRIAN	641 (1.0%)
Unknown	61
mannerofcrash
1 - NOT COLLISION BETWEEN TWO MOTOR VEHICLES IN TRANSPORT	9,926 (16%)
2 - REAR-END	16,051 (25%)
3 - HEAD-ON	1,876 (3.0%)
4 - REAR-TO-REAR	287 (0.5%)
5 - BACKING	1,913 (3.0%)
6 - ANGLE	19,359 (31%)
7 - SIDESWIPE, SAME DIRECTION	10,929 (17%)
8 - SIDESWIPE, OPPOSITE DIRECTION	1,569 (2.5%)
9 - UNKNOWN	1,051 (1.7%)
Unknown	5
roadcontour
1 - STRAIGHT LEVEL	44,762 (71%)
2 - STRAIGHT GRADE	11,755 (19%)
3 - CURVE LEVEL	2,832 (4.5%)
4 - CURVE GRADE	3,570 (5.7%)
9 - UNKNOWN	42 (<0.1%)
Unknown	5
dayofweek
FRI	10,984 (17%)
MON	8,541 (14%)
SAT	8,578 (14%)
SUN	7,198 (11%)
THU	9,516 (15%)
TUE	9,012 (14%)
WED	9,137 (15%)
lightconditionsprimary
1 - DAYLIGHT	42,077 (67%)
2 - DAWN	1,191 (1.9%)
2 - DUSK	1,488 (2.4%)
3 - DARK - LIGHTED ROADWAY	16,362 (26%)
4 - DARK – ROADWAY NOT LIGHTED	971 (1.5%)
5 - DARK – UNKNOWN ROADWAY LIGHTING	402 (0.6%)
9 - OTHER	57 (<0.1%)
9 - UNKNOWN	413 (0.7%)
Unknown	5
roadconditionsprimary
01 - DRY	47,881 (76%)
02 - WET	13,134 (21%)
03 - SNOW	887 (1.4%)
04 - ICE	502 (0.8%)
05 - SAND, MUD, DIRT, OIL, GRAVEL	21 (<0.1%)
06 - WATER (STANDING, MOVING)	24 (<0.1%)
07 - SLUSH	71 (0.1%)
09 - OTHER	16 (<0.1%)
09 - UNKNOWN	425 (0.7%)
Unknown	5
crashlocation	0 (NA%)
Unknown	62,966
crash_date	2021-01-23 213.165845961698
¹ n (%); Mean SD

# Test whether there is a true linear trend over time
mod1 = lm(num_incidents~crash_date,data=daily_crashes)
summary(mod1)

## 
## Call:
## lm(formula = num_incidents ~ crash_date, data = daily_crashes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.756  -8.192  -0.959   7.277  86.798 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.352e+02  4.077e+01  -5.770 1.16e-08 ***
## crash_date   1.498e-02  2.187e-03   6.848 1.56e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.92 on 746 degrees of freedom
## Multiple R-squared:  0.05915,    Adjusted R-squared:  0.05788 
## F-statistic:  46.9 on 1 and 746 DF,  p-value: 1.565e-11

When we model to test whether there has been any significant impact of a specific year on number of incidents, we observe that p vlaue is <0.05. It means there has been impact of time series on the number of incidents.

plot(mod1)

Cincinnati Police Department

Saloni Goel