About Traffic Crash Data

For the forecasting study, I have selected Traffic Crash Reports (Cincinnati Police Department-CPD) from Cincinnati Open Data Portal containing daily traffic incidents which includes the date when the accident occurred, place where the accident occurred, the type of crash, weather on that day, and so on.

Source : https://data.cincinnati-oh.gov/Safety/Traffic-Crash-Reports-CPD-/rvmt-pkmq

We have filtered the crash data for 2020-Present and can predict future crash data for Cincinnati for the next 3 months / 6months / yearly and so on.

Why this data set?

Traffic crash data is useful to support the development and assessment of road safety plans by identifying possible risk factors and locating hazardous regions and reduce the number of accidents occurrences. I am new to forecasting and modeling, and I found it really interesting while working on it in the class and wanted to continue working on it. Hence, I selected this data set.

I am excited to work on it!

Variation in our variables

Our response variable would be number of crashes (can be per day/month/year).I believe that the variation in the response variable can be explained by age, day, weather, road conditions, manner of crash, light conditions, road surface.

This might be a tricky data set to forecast since a lot of columns contains categorical data and not numerical data.

Installing necessary packages

#install.packages("dplyr")
#install.packages("tidyverse")
#install.packages("janitor")
#packageVersion("dplyr")
#install.packages("dplyr")
#install.packages("readr")
#install.packages("magrittr") # package installations are only needed the first time you use it
#install.packages("dplyr")    # alternative installation of the %>%

Importing the libraries that we will use throughout this project

library(readr) # Reading csvs
library(ggplot2) # Plotting
library(knitr) # Kable
library(broom) # For tidying model results
library(magrittr) # needs to be run every time you start R and want to use %>%
library(dplyr)    # alternatively, this also loads %>%
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(gtsummary) # for creating summary tables
## #Uighur
library(dplyr) # Data wrangling
library(janitor) # Cleaning variable names
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(lubridate) # Working with dates
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

Reading the data

cpd_file_path = "C:/Users/Goels/Desktop/UC/A - SEMESTER 2/Forecasting Methods - Antony - Thursday/traffic crash/Traffic_Crash_Reports__CPD_.csv"
cpd_data <- read.csv(cpd_file_path, stringsAsFactors = FALSE)

Exploring the data set

Overview of the data :

head(cpd_data)
##         ï..ADDRESS_X LATITUDE_X LONGITUDE_X AGE COMMUNITY_COUNCIL_NEIGHBORHOOD
## 1   XX W MITCHELL AV   39.16111  -84.506982  37                        CLIFTON
## 2      5XX W KING DR  39.139361  -84.525669  18                        CLIFTON
## 3      5XX W KING DR  39.139401  -84.526069  NA                        CLIFTON
## 4      1XX E KING DR  39.135456  -84.505356  28                     CORRYVILLE
## 5      1XX E KING DR  39.135816  -84.506256  22                     CORRYVILLE
## 6 34XX MONTGOMERY RD  39.142066  -84.472325  34                       EVANSTON
##   CPD_NEIGHBORHOOD              CRASHDATE CRASHLOCATION
## 1          CLIFTON 01/16/2022 03:45:00 AM              
## 2          CLIFTON 01/16/2022 05:00:00 AM              
## 3          CLIFTON 01/16/2022 05:00:00 AM              
## 4       CORRYVILLE 01/16/2022 03:36:00 AM              
## 5       CORRYVILLE 01/16/2022 03:36:00 AM              
## 6         EVANSTON 01/16/2022 03:20:02 AM              
##              CRASHSEVERITY CRASHSEVERITYID      DATECRASHREPORTED DAYOFWEEK
## 1 5 - PROPERTY DAMAGE ONLY          201905 01/16/2022 04:30:00 AM       SUN
## 2 5 - PROPERTY DAMAGE ONLY          201905 01/16/2022 04:13:45 AM       SUN
## 3 5 - PROPERTY DAMAGE ONLY          201905 01/16/2022 04:13:45 AM       SUN
## 4 5 - PROPERTY DAMAGE ONLY          201905 01/16/2022 03:49:00 AM       SUN
## 5 5 - PROPERTY DAMAGE ONLY          201905 01/16/2022 03:49:00 AM       SUN
## 6      4 - INJURY POSSIBLE          201904 01/16/2022 03:20:02 AM       SUN
##       GENDER                INJURIES                           INSTANCEID
## 1   M - MALE 5 - NO APPARENTY INJURY BEFCF82D-D869-45F7-B5F1-D0314ADA0A6C
## 2 F - FEMALE 5 - NO APPARENTY INJURY 712FD408-A8B9-411C-A4B9-86B63DBA1F72
## 3            5 - NO APPARENTY INJURY 712FD408-A8B9-411C-A4B9-86B63DBA1F72
## 4 F - FEMALE 5 - NO APPARENTY INJURY 816379F5-112B-4A4E-85D8-3C7B805BFD1A
## 5 F - FEMALE 5 - NO APPARENTY INJURY 816379F5-112B-4A4E-85D8-3C7B805BFD1A
## 6 F - FEMALE     4 - POSSIBLE INJURY 43C13A4F-9180-4E04-99D1-1F3966DA2FFB
##       LIGHTCONDITIONSPRIMARY LOCALREPORTNO
## 1               1 - DAYLIGHT     225000640
## 2 3 - DARK - LIGHTED ROADWAY     225000639
## 3 3 - DARK - LIGHTED ROADWAY     225000639
## 4               1 - DAYLIGHT     225000638
## 5               1 - DAYLIGHT     225000638
## 6               1 - DAYLIGHT     225000650
##                                               MANNEROFCRASH
## 1                             7 - SIDESWIPE, SAME DIRECTION
## 2                             7 - SIDESWIPE, SAME DIRECTION
## 3                             7 - SIDESWIPE, SAME DIRECTION
## 4                                              2 - REAR-END
## 5                                              2 - REAR-END
## 6 1 - NOT COLLISION BETWEEN TWO MOTOR VEHICLES IN TRANSPORT
##   ROADCONDITIONSPRIMARY        ROADCONTOUR                       ROADSURFACE
## 1              01 - DRY 1 - STRAIGHT LEVEL 2 - BLACKTOP, BITUMINOUS, ASPHALT
## 2              01 - DRY 2 - STRAIGHT GRADE 2 - BLACKTOP, BITUMINOUS, ASPHALT
## 3              01 - DRY 2 - STRAIGHT GRADE 2 - BLACKTOP, BITUMINOUS, ASPHALT
## 4              01 - DRY 2 - STRAIGHT GRADE 2 - BLACKTOP, BITUMINOUS, ASPHALT
## 5              01 - DRY 2 - STRAIGHT GRADE 2 - BLACKTOP, BITUMINOUS, ASPHALT
## 6              01 - DRY 1 - STRAIGHT LEVEL                      1 - CONCRETE
##   SNA_NEIGHBORHOOD TYPEOFPERSON    WEATHER   ZIP UNITTYPE
## 1               NA   D - DRIVER  1 - CLEAR 45217       NA
## 2               NA   D - DRIVER  1 - CLEAR 45220       NA
## 3               NA   D - DRIVER  1 - CLEAR 45220       NA
## 4               NA   D - DRIVER 2 - CLOUDY 45219       NA
## 5               NA   D - DRIVER 2 - CLOUDY 45219       NA
## 6               NA   D - DRIVER  1 - CLEAR 45207       NA

Dimensions of our data set :

dim(cpd_data)
## [1] 291852     26

Features :

colnames(cpd_data)
##  [1] "ï..ADDRESS_X"                   "LATITUDE_X"                    
##  [3] "LONGITUDE_X"                    "AGE"                           
##  [5] "COMMUNITY_COUNCIL_NEIGHBORHOOD" "CPD_NEIGHBORHOOD"              
##  [7] "CRASHDATE"                      "CRASHLOCATION"                 
##  [9] "CRASHSEVERITY"                  "CRASHSEVERITYID"               
## [11] "DATECRASHREPORTED"              "DAYOFWEEK"                     
## [13] "GENDER"                         "INJURIES"                      
## [15] "INSTANCEID"                     "LIGHTCONDITIONSPRIMARY"        
## [17] "LOCALREPORTNO"                  "MANNEROFCRASH"                 
## [19] "ROADCONDITIONSPRIMARY"          "ROADCONTOUR"                   
## [21] "ROADSURFACE"                    "SNA_NEIGHBORHOOD"              
## [23] "TYPEOFPERSON"                   "WEATHER"                       
## [25] "ZIP"                            "UNITTYPE"

Number of NA’s in our data set:

colSums(is.na(cpd_data))
##                   ï..ADDRESS_X                     LATITUDE_X 
##                              0                              0 
##                    LONGITUDE_X                            AGE 
##                              0                          35475 
## COMMUNITY_COUNCIL_NEIGHBORHOOD               CPD_NEIGHBORHOOD 
##                              0                              0 
##                      CRASHDATE                  CRASHLOCATION 
##                              0                              0 
##                  CRASHSEVERITY                CRASHSEVERITYID 
##                              0                             10 
##              DATECRASHREPORTED                      DAYOFWEEK 
##                              0                              0 
##                         GENDER                       INJURIES 
##                              0                              0 
##                     INSTANCEID         LIGHTCONDITIONSPRIMARY 
##                              0                              0 
##                  LOCALREPORTNO                  MANNEROFCRASH 
##                              0                              0 
##          ROADCONDITIONSPRIMARY                    ROADCONTOUR 
##                              0                              0 
##                    ROADSURFACE               SNA_NEIGHBORHOOD 
##                              0                         291852 
##                   TYPEOFPERSON                        WEATHER 
##                              0                              0 
##                            ZIP                       UNITTYPE 
##                              0                         291852

Since ‘UNITTYPE’ and ‘SNA_NEIGHBOURHOOD’ is empty (as number of NA’s = total number of rows in our data set), it is better that we drop the 2 columns since they are not at all useful.

drop <- c("UNITTYPE","SNA_NEIGHBORHOOD")
cpd_data = cpd_data[,!(names(cpd_data) %in% drop)]

Structure of the data set :

str(cpd_data)
## 'data.frame':    291852 obs. of  24 variables:
##  $ ï..ADDRESS_X                  : chr  "XX W MITCHELL AV" "5XX W KING DR" "5XX W KING DR" "1XX E KING DR" ...
##  $ LATITUDE_X                    : chr  "39.16111" "39.139361" "39.139401" "39.135456" ...
##  $ LONGITUDE_X                   : chr  "-84.506982" "-84.525669" "-84.526069" "-84.505356" ...
##  $ AGE                           : int  37 18 NA 28 22 34 NA 49 27 22 ...
##  $ COMMUNITY_COUNCIL_NEIGHBORHOOD: chr  "CLIFTON" "CLIFTON" "CLIFTON" "CORRYVILLE" ...
##  $ CPD_NEIGHBORHOOD              : chr  "CLIFTON" "CLIFTON" "CLIFTON" "CORRYVILLE" ...
##  $ CRASHDATE                     : chr  "01/16/2022 03:45:00 AM" "01/16/2022 05:00:00 AM" "01/16/2022 05:00:00 AM" "01/16/2022 03:36:00 AM" ...
##  $ CRASHLOCATION                 : chr  "" "" "" "" ...
##  $ CRASHSEVERITY                 : chr  "5 - PROPERTY DAMAGE ONLY" "5 - PROPERTY DAMAGE ONLY" "5 - PROPERTY DAMAGE ONLY" "5 - PROPERTY DAMAGE ONLY" ...
##  $ CRASHSEVERITYID               : int  201905 201905 201905 201905 201905 201904 201905 201905 201905 201905 ...
##  $ DATECRASHREPORTED             : chr  "01/16/2022 04:30:00 AM" "01/16/2022 04:13:45 AM" "01/16/2022 04:13:45 AM" "01/16/2022 03:49:00 AM" ...
##  $ DAYOFWEEK                     : chr  "SUN" "SUN" "SUN" "SUN" ...
##  $ GENDER                        : chr  "M - MALE" "F - FEMALE" "" "F - FEMALE" ...
##  $ INJURIES                      : chr  "5 - NO APPARENTY INJURY" "5 - NO APPARENTY INJURY" "5 - NO APPARENTY INJURY" "5 - NO APPARENTY INJURY" ...
##  $ INSTANCEID                    : chr  "BEFCF82D-D869-45F7-B5F1-D0314ADA0A6C" "712FD408-A8B9-411C-A4B9-86B63DBA1F72" "712FD408-A8B9-411C-A4B9-86B63DBA1F72" "816379F5-112B-4A4E-85D8-3C7B805BFD1A" ...
##  $ LIGHTCONDITIONSPRIMARY        : chr  "1 - DAYLIGHT" "3 - DARK - LIGHTED ROADWAY" "3 - DARK - LIGHTED ROADWAY" "1 - DAYLIGHT" ...
##  $ LOCALREPORTNO                 : num  2.25e+08 2.25e+08 2.25e+08 2.25e+08 2.25e+08 ...
##  $ MANNEROFCRASH                 : chr  "7 - SIDESWIPE, SAME DIRECTION" "7 - SIDESWIPE, SAME DIRECTION" "7 - SIDESWIPE, SAME DIRECTION" "2 - REAR-END" ...
##  $ ROADCONDITIONSPRIMARY         : chr  "01 - DRY" "01 - DRY" "01 - DRY" "01 - DRY" ...
##  $ ROADCONTOUR                   : chr  "1 - STRAIGHT LEVEL" "2 - STRAIGHT GRADE" "2 - STRAIGHT GRADE" "2 - STRAIGHT GRADE" ...
##  $ ROADSURFACE                   : chr  "2 - BLACKTOP, BITUMINOUS, ASPHALT" "2 - BLACKTOP, BITUMINOUS, ASPHALT" "2 - BLACKTOP, BITUMINOUS, ASPHALT" "2 - BLACKTOP, BITUMINOUS, ASPHALT" ...
##  $ TYPEOFPERSON                  : chr  "D - DRIVER" "D - DRIVER" "D - DRIVER" "D - DRIVER" ...
##  $ WEATHER                       : chr  "1 - CLEAR" "1 - CLEAR" "1 - CLEAR" "2 - CLOUDY" ...
##  $ ZIP                           : chr  "45217" "45220" "45220" "45219" ...

Summary statistics

summary(cpd_data)
##  ï..ADDRESS_X        LATITUDE_X        LONGITUDE_X             AGE        
##  Length:291852      Length:291852      Length:291852      Min.   :  0.00  
##  Class :character   Class :character   Class :character   1st Qu.: 24.00  
##  Mode  :character   Mode  :character   Mode  :character   Median : 34.00  
##                                                           Mean   : 37.51  
##                                                           3rd Qu.: 49.00  
##                                                           Max.   :123.00  
##                                                           NA's   :35475   
##  COMMUNITY_COUNCIL_NEIGHBORHOOD CPD_NEIGHBORHOOD    CRASHDATE        
##  Length:291852                  Length:291852      Length:291852     
##  Class :character               Class :character   Class :character  
##  Mode  :character               Mode  :character   Mode  :character  
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##  CRASHLOCATION      CRASHSEVERITY      CRASHSEVERITYID  DATECRASHREPORTED 
##  Length:291852      Length:291852      Min.   :     1   Length:291852     
##  Class :character   Class :character   1st Qu.:     3   Class :character  
##  Mode  :character   Mode  :character   Median :     3   Mode  :character  
##                                        Mean   : 67742                     
##                                        3rd Qu.:201904                     
##                                        Max.   :201905                     
##                                        NA's   :10                         
##   DAYOFWEEK            GENDER            INJURIES          INSTANCEID       
##  Length:291852      Length:291852      Length:291852      Length:291852     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  LIGHTCONDITIONSPRIMARY LOCALREPORTNO       MANNEROFCRASH     
##  Length:291852          Min.   :1.400e+01   Length:291852     
##  Class :character       1st Qu.:1.550e+08   Class :character  
##  Mode  :character       Median :1.750e+08   Mode  :character  
##                         Mean   :1.794e+08                     
##                         3rd Qu.:1.950e+08                     
##                         Max.   :2.201e+11                     
##                                                               
##  ROADCONDITIONSPRIMARY ROADCONTOUR        ROADSURFACE        TYPEOFPERSON      
##  Length:291852         Length:291852      Length:291852      Length:291852     
##  Class :character      Class :character   Class :character   Class :character  
##  Mode  :character      Mode  :character   Mode  :character   Mode  :character  
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##    WEATHER              ZIP           
##  Length:291852      Length:291852     
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
##                                       
## 

Key inferences : 1) Age : Age cannot be 0 (min) or 123(max), which means that data has discrepancy and we will have to see if we want to keep the data or remove the outliers. Also, we observe that on an average the age of the person driving was 37 years. 2) crashdate : This field is crucial to our modeling and in order to analyse our data, we need to extract the date from the crashdate since the format of the feature is “yy-mm-dd 00:00:00”

cpd_data = read_csv(cpd_file_path) %>%
  clean_names() %>%
  mutate(crash_datetime = mdy_hms(crashdate),
         crash_date = as.Date(crash_datetime)) %>%
  dplyr::select(-crashdate)
## Rows: 291852 Columns: 26
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (18): ADDRESS_X, COMMUNITY_COUNCIL_NEIGHBORHOOD, CPD_NEIGHBORHOOD, CRASH...
## dbl  (6): LATITUDE_X, LONGITUDE_X, AGE, CRASHSEVERITYID, LOCALREPORTNO, ZIP
## lgl  (2): SNA_NEIGHBORHOOD, UNITTYPE
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(cpd_data)
## # A tibble: 6 x 27
##   address_x    latitude_x longitude_x   age community_council_~ cpd_neighborhood
##   <chr>             <dbl>       <dbl> <dbl> <chr>               <chr>           
## 1 XX W MITCHE~       39.2       -84.5    37 CLIFTON             CLIFTON         
## 2 5XX W KING ~       39.1       -84.5    18 CLIFTON             CLIFTON         
## 3 5XX W KING ~       39.1       -84.5    NA CLIFTON             CLIFTON         
## 4 1XX E KING ~       39.1       -84.5    28 CORRYVILLE          CORRYVILLE      
## 5 1XX E KING ~       39.1       -84.5    22 CORRYVILLE          CORRYVILLE      
## 6 34XX MONTGO~       39.1       -84.5    34 EVANSTON            EVANSTON        
## # ... with 21 more variables: crashlocation <chr>, crashseverity <chr>,
## #   crashseverityid <dbl>, datecrashreported <chr>, dayofweek <chr>,
## #   gender <chr>, injuries <chr>, instanceid <chr>,
## #   lightconditionsprimary <chr>, localreportno <dbl>, mannerofcrash <chr>,
## #   roadconditionsprimary <chr>, roadcontour <chr>, roadsurface <chr>,
## #   sna_neighborhood <lgl>, typeofperson <chr>, weather <chr>, zip <dbl>,
## #   unittype <lgl>, crash_datetime <dttm>, crash_date <date>

Studying the features in detail

Printing the unique values contained in Injuries column

unique(cpd_data$injuries)
##  [1] "5 - NO APPARENTY INJURY"       "4 - POSSIBLE INJURY"          
##  [3] NA                              "3 - SUSPECTED MINOR INJURY"   
##  [5] "2 - SUSPECTED SERIOUS INJURY"  "1 - FATAL"                    
##  [7] "1 - NO INJURY / NONE REPORTED" "3 - NON-INCAPACITATING"       
##  [9] "2 - POSSIBLE"                  "5 - FATAL"                    
## [11] "4 - INCAPACITATING"

Since there is overlap in numbers given to the feature injuries and can be clubbed, we clean the data set by renaming the columns

cpd_data$injuries[cpd_data$injuries=="5 - NO APPARENTY INJURY"]<- "FATAL"
cpd_data$injuries[cpd_data$injuries=="4 - POSSIBLE INJURY"]<- "POSSIBLE INJURY"
cpd_data$injuries[cpd_data$injuries=="3 - SUSPECTED MINOR INJURY"]<- "SUSPECTED MINOR INJURY"
cpd_data$injuries[cpd_data$injuries=="2 - SUSPECTED SERIOUS INJURY"]<- "SUSPECTED SERIOUS INJURY"
cpd_data$injuries[cpd_data$injuries=="1 - FATAL"] <- "FATAL"
cpd_data$injuries[cpd_data$injuries== "1 - NO INJURY / NONE REPORTED"] <- "NO INJURY / NONE REPORTED"
cpd_data$injuries[cpd_data$injuries== "3 - NON-INCAPACITATING"] <- "SUSPECTED MINOR INJURY"
cpd_data$injuries[cpd_data$injuries== "2 - POSSIBLE"] <- "SUSPECTED SERIOUS INJURY"
cpd_data$injuries[cpd_data$injuries== "5 - FATAL"] <- "FATAL"
cpd_data$injuries[cpd_data$injuries== "4 - INCAPACITATING"] <- "POSSIBLE INJURY"
cpd_data$injuries[cpd_data$injuries== " "] <- "Data Unavailable"

Similary, we do data cleaning for other features :

cpd_data$gender[cpd_data$gender=="F - FEMALE"]<- "Female"
cpd_data$gender[cpd_data$gender=="M - MALE"]<- "Male"
cpd_data$gender[cpd_data$gender=="FEMALE"]<- "Female"
cpd_data$gender[cpd_data$gender=="MALE"]<- "Male"
cpd_data$gender[cpd_data$gender=="U - UNKNOWN"] <- ""
cpd_data$gender[cpd_data$gender== 'NA'] <- ""
unique(cpd_data$gender)
## [1] "Male"   "Female" NA       ""
unique(cpd_data$weather)
##  [1] "1 - CLEAR"                            
##  [2] "2 - CLOUDY"                           
##  [3] "6 - SNOW"                             
##  [4] "99 - OTHER/UNKNOWN"                   
##  [5] "5 - SLEET, HAIL"                      
##  [6] "9 - FREEZING RAIN OR FREEZING DRIZZLE"
##  [7] "4 - RAIN"                             
##  [8] NA                                     
##  [9] "8 - BLOWING SAND, SOIL, DIRT, SNOW"   
## [10] "3 - FOG, SMOG, SMOKE"                 
## [11] "7 - SEVERE CROSSWINDS"                
## [12] "9 - OTHER/UNKNOWN"                    
## [13] "5 - SLEET,HAIL"
cpd_data$weather[cpd_data$weather=="1 - CLEAR"]<- "CLEAR"
cpd_data$weather[cpd_data$weather=="2 - CLOUDY"]<- "CLOUDY"
cpd_data$weather[cpd_data$weather=="6 - SNOW"]<- "SNOW"
cpd_data$weather[cpd_data$weather=="99 - OTHER/UNKNOWN"]<- "UNKNOWN"
cpd_data$weather[cpd_data$weather=="5 - SLEET, HAIL"] <- "SLEET, HAIL"
cpd_data$weather[cpd_data$weather== "9 - FREEZING RAIN OR FREEZING DRIZZLE"] <- "FREEZING RAIN OR FREEZING DRIZZLE"
cpd_data$weather[cpd_data$weather== "4 - RAIN"] <- "RAIN"
cpd_data$weather[cpd_data$weather== "8 - BLOWING SAND, SOIL, DIRT, SNOW"] <- "BLOWING SAND, SOIL, DIRT, SNOW"
cpd_data$weather[cpd_data$weather== "3 - FOG, SMOG, SMOKE"] <- "FOG, SMOG, SMOKE"
cpd_data$weather[cpd_data$weather== "7 - SEVERE CROSSWINDS"] <- "SEVERE CROSSWINDS"
cpd_data$weather[cpd_data$weather== "9 - OTHER/UNKNOWN"] <- "UNKNOWN"
cpd_data$weather[cpd_data$weather== "5 - SLEET,HAIL"] <- "SLEET, HAIL"

Daily Crash Data (from 1st January 2020 - Present)

daily_crashes = cpd_data %>%
  distinct(instanceid,.keep_all=TRUE) %>% # Want only one row per traffic instance
  count(crash_date,name='num_incidents') %>% # Count incidents by day
  filter(crash_date>=ymd('2020-01-01')) # Filter to post-2020

daily_crashes_plot = ggplot(daily_crashes)+
  geom_line(aes(crash_date,num_incidents))+
  theme_bw()+
  xlab("Date")+
  ylab("Number of Incidents")+
  labs(
    title = 'Number of Daily Traffic Incidents, Cincinnati Police Department',
    subtitle = 'January 2020 - Present'
  )

daily_crashes_plot

Plotting a linear trend on top of the daily crash data

daily_crashes_plot + 
  geom_smooth(aes(crash_date,num_incidents),method='lm',color='red')
## `geom_smooth()` using formula 'y ~ x'

We can observe that there is a slightly positive trend over time.

Visualizing Daily Crash Data - DAY WISE (from 1st January 2020 - Present)

cpd_data %>%
  distinct(instanceid,.keep_all=TRUE) %>% # One row per traffic incident
  count(crash_date,dayofweek,name='daily_incidents') %>% # Count Number of Incidents Per Day
  filter(crash_date>=ymd('2020-01-01'))%>%
  group_by(dayofweek) %>% # Group data by day of week
  summarize(avg_daily_incidents = mean(daily_incidents)) %>% # Take the Mean Incidents by day of week
  ungroup() %>%
  mutate(dayofweek=factor(
    dayofweek,
    labels = c("MON",'TUE','WED','THU','FRI','SAT','SUN'),
    levels = c("MON",'TUE','WED','THU','FRI','SAT','SUN')
  )) %>% # Set day of week as factor so the plot orders the bars properly
  filter(!is.na(dayofweek)) %>% # Filter missing day of week values
  ggplot()+
  geom_col(aes(dayofweek,avg_daily_incidents))+
  theme_bw()+
  xlab("Day of Week")+
  ylab("Average Daily Traffic Incidents")+
  scale_fill_discrete(name = 'Day of Week')+
  labs(
    title = 'Average Daily Traffic Incidents in Cincinnati by Day of week',
    subtitle = 'January 2020 - Present')

Visualizing Daily Crash Data - GENDER WISE (from 1st January 2020 - Present)

cpd_gender <- cpd_data%>%
  distinct(instanceid,.keep_all = TRUE)%>%
  count(crash_date, gender, name="incidents_by_gender")%>%
  filter(crash_date>=ymd('2020-01-01'))%>%
  group_by(gender)%>%
  summarize(avg_daily_incidents = mean(incidents_by_gender))%>%
  ungroup()%>%
  filter(!is.na(gender))%>%
  filter(gender != "U")


ggplot(data = cpd_gender) +
  geom_col(mapping = aes(x=gender, y = avg_daily_incidents))+
  theme_bw()+
  xlab("Gender")+
  ylab("Average Daily Incidents")+
  scale_fill_discrete(name = 'gender')+
  labs(
    title = 'Average Daily Traffic Incidents in Cincinnati by Gender',
    subtitle = 'January 2020 - Present'
  )

Visualizing Daily Crash Data - AGE WISE (from 1st January 2020 - Present)

cpd_age <- cpd_data%>%
  distinct(instanceid,.keep_all = TRUE)%>%
  count(crash_date, age, name="incidents_by_age")%>%
  filter(crash_date>=ymd('2020-01-01'))%>%
  group_by(age)%>%
  summarize(avg_daily_incidents = mean(incidents_by_age))%>%
  ungroup()%>%
  filter(!is.na(age))%>%
  filter(age != "U")


ggplot(data = cpd_age) +
  geom_col(mapping = aes(x=age, y = avg_daily_incidents))+
  theme_bw()+
  xlab("Age")+
  ylab("Average Daily Incidents")+
  scale_fill_discrete(name = 'age')+
  labs(
    title = 'Average Daily Traffic Incidents in Cincinnati by Age',
    subtitle = 'January 2020 - Present'
  )

Visualizing Daily Crash Data - WEATHER WISE (from 1st January 2020 - Present)

cpd_weather <- cpd_data%>%
  distinct(instanceid,.keep_all = TRUE)%>%
  count(crash_date, weather, name="incidents_by_weather")%>%
  filter(crash_date>=ymd('2020-01-01'))%>%
  group_by(weather)%>%
  summarize(avg_daily_incidents = mean(incidents_by_weather))%>%
  ungroup()%>%
  filter(!is.na(weather))


ggplot(data = cpd_weather) +
  geom_col(mapping = aes(x=weather, y = avg_daily_incidents))+
  theme_bw()+
  xlab("Weather")+
  ylab("Average Daily Incidents")+
  scale_fill_discrete(name = 'weather')+
  labs(
    title = 'Average Daily Traffic Incidents in Cincinnati by Weather',
    subtitle = 'January 2020 - Present'
  )

Visualizing Daily Crash Data on the basis of TYPE OF PERSON (from 1st January 2020 - Present)

cpd_typeofperson <- cpd_data%>%
  distinct(instanceid,.keep_all = TRUE)%>%
  count(crash_date, typeofperson, name="incidents_by_typeofperson")%>%
  filter(crash_date>=ymd('2020-01-01'))%>%
  group_by(typeofperson)%>%
  summarize(avg_daily_incidents = mean(incidents_by_typeofperson))%>%
  ungroup()%>%
  filter(!is.na(typeofperson))


ggplot(data = cpd_typeofperson) +
  geom_col(mapping = aes(x=typeofperson, y = avg_daily_incidents))+
  theme_bw()+
  xlab("typeofperson")+
  ylab("Average Daily Incidents")+
  scale_fill_discrete(name = 'typeofperson')+
  labs(
    title = 'Average Daily Traffic Incidents in Cincinnati by Type of Person',
    subtitle = 'January 2020 - Present'
  )

Visualizing Daily Crash Data - CRASH SEVERITY WISE (from 1st January 2020 - Present)

cpd_crashseverity <- cpd_data%>%
  distinct(instanceid,.keep_all = TRUE)%>%
  count(crash_date, crashseverity, name="incidents_by_crashseverity")%>%
  filter(crash_date>=ymd('2020-01-01'))%>%
  group_by(crashseverity)%>%
  summarize(avg_daily_incidents = mean(incidents_by_crashseverity))%>%
  ungroup()%>%
  filter(!is.na(crashseverity))


ggplot(data = cpd_crashseverity) +
  geom_col(mapping = aes(x=crashseverity, y = avg_daily_incidents))+
  theme_bw()+
  xlab("crashseverity")+
  ylab("Average Daily Incidents")+
  scale_fill_discrete(name = 'crashseverity')+
  labs(
    title = 'Average Daily Traffic Incidents in Cincinnati by Crash Severity',
    subtitle = 'January 2020 - Present'
  )

creating summary statistics table for features (Jan 2020-Present)

stats1 <- cpd_data %>% select(weather, age, injuries, typeofperson, mannerofcrash, roadcontour, dayofweek, lightconditionsprimary, roadconditionsprimary,crashlocation, crash_date)%>%
  filter(crash_date>=ymd('2020-01-01'))

table1 <- tbl_summary(stats1,
                      statistic = list(all_continuous()~"{mean} {sd} "))
table1
Characteristic N = 62,9661
weather
BLOWING SAND, SOIL, DIRT, SNOW 3 (<0.1%)
CLEAR 42,396 (67%)
CLOUDY 9,669 (15%)
FOG, SMOG, SMOKE 50 (<0.1%)
FREEZING RAIN OR FREEZING DRIZZLE 43 (<0.1%)
RAIN 8,917 (14%)
SEVERE CROSSWINDS 6 (<0.1%)
SLEET, HAIL 85 (0.1%)
SNOW 1,258 (2.0%)
UNKNOWN 534 (0.8%)
Unknown 5
age 37 16
Unknown 9,331
injuries
FATAL 52,431 (83%)
NO INJURY / NONE REPORTED 2 (<0.1%)
POSSIBLE INJURY 4,956 (7.9%)
SUSPECTED MINOR INJURY 4,995 (7.9%)
SUSPECTED SERIOUS INJURY 519 (0.8%)
Unknown 63
typeofperson
D - DRIVER 56,157 (89%)
O - OCCUPANT 6,107 (9.7%)
P - PEDESTRIAN 641 (1.0%)
Unknown 61
mannerofcrash
1 - NOT COLLISION BETWEEN TWO MOTOR VEHICLES IN TRANSPORT 9,926 (16%)
2 - REAR-END 16,051 (25%)
3 - HEAD-ON 1,876 (3.0%)
4 - REAR-TO-REAR 287 (0.5%)
5 - BACKING 1,913 (3.0%)
6 - ANGLE 19,359 (31%)
7 - SIDESWIPE, SAME DIRECTION 10,929 (17%)
8 - SIDESWIPE, OPPOSITE DIRECTION 1,569 (2.5%)
9 - UNKNOWN 1,051 (1.7%)
Unknown 5
roadcontour
1 - STRAIGHT LEVEL 44,762 (71%)
2 - STRAIGHT GRADE 11,755 (19%)
3 - CURVE LEVEL 2,832 (4.5%)
4 - CURVE GRADE 3,570 (5.7%)
9 - UNKNOWN 42 (<0.1%)
Unknown 5
dayofweek
FRI 10,984 (17%)
MON 8,541 (14%)
SAT 8,578 (14%)
SUN 7,198 (11%)
THU 9,516 (15%)
TUE 9,012 (14%)
WED 9,137 (15%)
lightconditionsprimary
1 - DAYLIGHT 42,077 (67%)
2 - DAWN 1,191 (1.9%)
2 - DUSK 1,488 (2.4%)
3 - DARK - LIGHTED ROADWAY 16,362 (26%)
4 - DARK – ROADWAY NOT LIGHTED 971 (1.5%)
5 - DARK – UNKNOWN ROADWAY LIGHTING 402 (0.6%)
9 - OTHER 57 (<0.1%)
9 - UNKNOWN 413 (0.7%)
Unknown 5
roadconditionsprimary
01 - DRY 47,881 (76%)
02 - WET 13,134 (21%)
03 - SNOW 887 (1.4%)
04 - ICE 502 (0.8%)
05 - SAND, MUD, DIRT, OIL, GRAVEL 21 (<0.1%)
06 - WATER (STANDING, MOVING) 24 (<0.1%)
07 - SLUSH 71 (0.1%)
09 - OTHER 16 (<0.1%)
09 - UNKNOWN 425 (0.7%)
Unknown 5
crashlocation 0 (NA%)
Unknown 62,966
crash_date 2021-01-23 213.165845961698

1 n (%); Mean SD

# Test whether there is a true linear trend over time
mod1 = lm(num_incidents~crash_date,data=daily_crashes)
summary(mod1)
## 
## Call:
## lm(formula = num_incidents ~ crash_date, data = daily_crashes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.756  -8.192  -0.959   7.277  86.798 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.352e+02  4.077e+01  -5.770 1.16e-08 ***
## crash_date   1.498e-02  2.187e-03   6.848 1.56e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.92 on 746 degrees of freedom
## Multiple R-squared:  0.05915,    Adjusted R-squared:  0.05788 
## F-statistic:  46.9 on 1 and 746 DF,  p-value: 1.565e-11

When we model to test whether there has been any significant impact of a specific year on number of incidents, we observe that p vlaue is <0.05. It means there has been impact of time series on the number of incidents.

plot(mod1)