For the forecasting study, I have selected Traffic Crash Reports (Cincinnati Police Department-CPD) from Cincinnati Open Data Portal containing daily traffic incidents which includes the date when the accident occurred, place where the accident occurred, the type of crash, weather on that day, and so on.
Source : https://data.cincinnati-oh.gov/Safety/Traffic-Crash-Reports-CPD-/rvmt-pkmq
We have filtered the crash data for 2020-Present and can predict future crash data for Cincinnati for the next 3 months / 6months / yearly and so on.
Traffic crash data is useful to support the development and assessment of road safety plans by identifying possible risk factors and locating hazardous regions and reduce the number of accidents occurrences. I am new to forecasting and modeling, and I found it really interesting while working on it in the class and wanted to continue working on it. Hence, I selected this data set.
I am excited to work on it!
Our response variable would be number of crashes (can be per day/month/year).I believe that the variation in the response variable can be explained by age, day, weather, road conditions, manner of crash, light conditions, road surface.
This might be a tricky data set to forecast since a lot of columns contains categorical data and not numerical data.
#install.packages("dplyr")
#install.packages("tidyverse")
#install.packages("janitor")
#packageVersion("dplyr")
#install.packages("dplyr")
#install.packages("readr")
#install.packages("magrittr") # package installations are only needed the first time you use it
#install.packages("dplyr") # alternative installation of the %>%
library(readr) # Reading csvs
library(ggplot2) # Plotting
library(knitr) # Kable
library(broom) # For tidying model results
library(magrittr) # needs to be run every time you start R and want to use %>%
library(dplyr) # alternatively, this also loads %>%
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(gtsummary) # for creating summary tables
## #Uighur
library(dplyr) # Data wrangling
library(janitor) # Cleaning variable names
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(lubridate) # Working with dates
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
cpd_file_path = "C:/Users/Goels/Desktop/UC/A - SEMESTER 2/Forecasting Methods - Antony - Thursday/traffic crash/Traffic_Crash_Reports__CPD_.csv"
cpd_data <- read.csv(cpd_file_path, stringsAsFactors = FALSE)
Overview of the data :
head(cpd_data)
## ï..ADDRESS_X LATITUDE_X LONGITUDE_X AGE COMMUNITY_COUNCIL_NEIGHBORHOOD
## 1 XX W MITCHELL AV 39.16111 -84.506982 37 CLIFTON
## 2 5XX W KING DR 39.139361 -84.525669 18 CLIFTON
## 3 5XX W KING DR 39.139401 -84.526069 NA CLIFTON
## 4 1XX E KING DR 39.135456 -84.505356 28 CORRYVILLE
## 5 1XX E KING DR 39.135816 -84.506256 22 CORRYVILLE
## 6 34XX MONTGOMERY RD 39.142066 -84.472325 34 EVANSTON
## CPD_NEIGHBORHOOD CRASHDATE CRASHLOCATION
## 1 CLIFTON 01/16/2022 03:45:00 AM
## 2 CLIFTON 01/16/2022 05:00:00 AM
## 3 CLIFTON 01/16/2022 05:00:00 AM
## 4 CORRYVILLE 01/16/2022 03:36:00 AM
## 5 CORRYVILLE 01/16/2022 03:36:00 AM
## 6 EVANSTON 01/16/2022 03:20:02 AM
## CRASHSEVERITY CRASHSEVERITYID DATECRASHREPORTED DAYOFWEEK
## 1 5 - PROPERTY DAMAGE ONLY 201905 01/16/2022 04:30:00 AM SUN
## 2 5 - PROPERTY DAMAGE ONLY 201905 01/16/2022 04:13:45 AM SUN
## 3 5 - PROPERTY DAMAGE ONLY 201905 01/16/2022 04:13:45 AM SUN
## 4 5 - PROPERTY DAMAGE ONLY 201905 01/16/2022 03:49:00 AM SUN
## 5 5 - PROPERTY DAMAGE ONLY 201905 01/16/2022 03:49:00 AM SUN
## 6 4 - INJURY POSSIBLE 201904 01/16/2022 03:20:02 AM SUN
## GENDER INJURIES INSTANCEID
## 1 M - MALE 5 - NO APPARENTY INJURY BEFCF82D-D869-45F7-B5F1-D0314ADA0A6C
## 2 F - FEMALE 5 - NO APPARENTY INJURY 712FD408-A8B9-411C-A4B9-86B63DBA1F72
## 3 5 - NO APPARENTY INJURY 712FD408-A8B9-411C-A4B9-86B63DBA1F72
## 4 F - FEMALE 5 - NO APPARENTY INJURY 816379F5-112B-4A4E-85D8-3C7B805BFD1A
## 5 F - FEMALE 5 - NO APPARENTY INJURY 816379F5-112B-4A4E-85D8-3C7B805BFD1A
## 6 F - FEMALE 4 - POSSIBLE INJURY 43C13A4F-9180-4E04-99D1-1F3966DA2FFB
## LIGHTCONDITIONSPRIMARY LOCALREPORTNO
## 1 1 - DAYLIGHT 225000640
## 2 3 - DARK - LIGHTED ROADWAY 225000639
## 3 3 - DARK - LIGHTED ROADWAY 225000639
## 4 1 - DAYLIGHT 225000638
## 5 1 - DAYLIGHT 225000638
## 6 1 - DAYLIGHT 225000650
## MANNEROFCRASH
## 1 7 - SIDESWIPE, SAME DIRECTION
## 2 7 - SIDESWIPE, SAME DIRECTION
## 3 7 - SIDESWIPE, SAME DIRECTION
## 4 2 - REAR-END
## 5 2 - REAR-END
## 6 1 - NOT COLLISION BETWEEN TWO MOTOR VEHICLES IN TRANSPORT
## ROADCONDITIONSPRIMARY ROADCONTOUR ROADSURFACE
## 1 01 - DRY 1 - STRAIGHT LEVEL 2 - BLACKTOP, BITUMINOUS, ASPHALT
## 2 01 - DRY 2 - STRAIGHT GRADE 2 - BLACKTOP, BITUMINOUS, ASPHALT
## 3 01 - DRY 2 - STRAIGHT GRADE 2 - BLACKTOP, BITUMINOUS, ASPHALT
## 4 01 - DRY 2 - STRAIGHT GRADE 2 - BLACKTOP, BITUMINOUS, ASPHALT
## 5 01 - DRY 2 - STRAIGHT GRADE 2 - BLACKTOP, BITUMINOUS, ASPHALT
## 6 01 - DRY 1 - STRAIGHT LEVEL 1 - CONCRETE
## SNA_NEIGHBORHOOD TYPEOFPERSON WEATHER ZIP UNITTYPE
## 1 NA D - DRIVER 1 - CLEAR 45217 NA
## 2 NA D - DRIVER 1 - CLEAR 45220 NA
## 3 NA D - DRIVER 1 - CLEAR 45220 NA
## 4 NA D - DRIVER 2 - CLOUDY 45219 NA
## 5 NA D - DRIVER 2 - CLOUDY 45219 NA
## 6 NA D - DRIVER 1 - CLEAR 45207 NA
dim(cpd_data)
## [1] 291852 26
colnames(cpd_data)
## [1] "ï..ADDRESS_X" "LATITUDE_X"
## [3] "LONGITUDE_X" "AGE"
## [5] "COMMUNITY_COUNCIL_NEIGHBORHOOD" "CPD_NEIGHBORHOOD"
## [7] "CRASHDATE" "CRASHLOCATION"
## [9] "CRASHSEVERITY" "CRASHSEVERITYID"
## [11] "DATECRASHREPORTED" "DAYOFWEEK"
## [13] "GENDER" "INJURIES"
## [15] "INSTANCEID" "LIGHTCONDITIONSPRIMARY"
## [17] "LOCALREPORTNO" "MANNEROFCRASH"
## [19] "ROADCONDITIONSPRIMARY" "ROADCONTOUR"
## [21] "ROADSURFACE" "SNA_NEIGHBORHOOD"
## [23] "TYPEOFPERSON" "WEATHER"
## [25] "ZIP" "UNITTYPE"
colSums(is.na(cpd_data))
## ï..ADDRESS_X LATITUDE_X
## 0 0
## LONGITUDE_X AGE
## 0 35475
## COMMUNITY_COUNCIL_NEIGHBORHOOD CPD_NEIGHBORHOOD
## 0 0
## CRASHDATE CRASHLOCATION
## 0 0
## CRASHSEVERITY CRASHSEVERITYID
## 0 10
## DATECRASHREPORTED DAYOFWEEK
## 0 0
## GENDER INJURIES
## 0 0
## INSTANCEID LIGHTCONDITIONSPRIMARY
## 0 0
## LOCALREPORTNO MANNEROFCRASH
## 0 0
## ROADCONDITIONSPRIMARY ROADCONTOUR
## 0 0
## ROADSURFACE SNA_NEIGHBORHOOD
## 0 291852
## TYPEOFPERSON WEATHER
## 0 0
## ZIP UNITTYPE
## 0 291852
Since ‘UNITTYPE’ and ‘SNA_NEIGHBOURHOOD’ is empty (as number of NA’s = total number of rows in our data set), it is better that we drop the 2 columns since they are not at all useful.
drop <- c("UNITTYPE","SNA_NEIGHBORHOOD")
cpd_data = cpd_data[,!(names(cpd_data) %in% drop)]
str(cpd_data)
## 'data.frame': 291852 obs. of 24 variables:
## $ ï..ADDRESS_X : chr "XX W MITCHELL AV" "5XX W KING DR" "5XX W KING DR" "1XX E KING DR" ...
## $ LATITUDE_X : chr "39.16111" "39.139361" "39.139401" "39.135456" ...
## $ LONGITUDE_X : chr "-84.506982" "-84.525669" "-84.526069" "-84.505356" ...
## $ AGE : int 37 18 NA 28 22 34 NA 49 27 22 ...
## $ COMMUNITY_COUNCIL_NEIGHBORHOOD: chr "CLIFTON" "CLIFTON" "CLIFTON" "CORRYVILLE" ...
## $ CPD_NEIGHBORHOOD : chr "CLIFTON" "CLIFTON" "CLIFTON" "CORRYVILLE" ...
## $ CRASHDATE : chr "01/16/2022 03:45:00 AM" "01/16/2022 05:00:00 AM" "01/16/2022 05:00:00 AM" "01/16/2022 03:36:00 AM" ...
## $ CRASHLOCATION : chr "" "" "" "" ...
## $ CRASHSEVERITY : chr "5 - PROPERTY DAMAGE ONLY" "5 - PROPERTY DAMAGE ONLY" "5 - PROPERTY DAMAGE ONLY" "5 - PROPERTY DAMAGE ONLY" ...
## $ CRASHSEVERITYID : int 201905 201905 201905 201905 201905 201904 201905 201905 201905 201905 ...
## $ DATECRASHREPORTED : chr "01/16/2022 04:30:00 AM" "01/16/2022 04:13:45 AM" "01/16/2022 04:13:45 AM" "01/16/2022 03:49:00 AM" ...
## $ DAYOFWEEK : chr "SUN" "SUN" "SUN" "SUN" ...
## $ GENDER : chr "M - MALE" "F - FEMALE" "" "F - FEMALE" ...
## $ INJURIES : chr "5 - NO APPARENTY INJURY" "5 - NO APPARENTY INJURY" "5 - NO APPARENTY INJURY" "5 - NO APPARENTY INJURY" ...
## $ INSTANCEID : chr "BEFCF82D-D869-45F7-B5F1-D0314ADA0A6C" "712FD408-A8B9-411C-A4B9-86B63DBA1F72" "712FD408-A8B9-411C-A4B9-86B63DBA1F72" "816379F5-112B-4A4E-85D8-3C7B805BFD1A" ...
## $ LIGHTCONDITIONSPRIMARY : chr "1 - DAYLIGHT" "3 - DARK - LIGHTED ROADWAY" "3 - DARK - LIGHTED ROADWAY" "1 - DAYLIGHT" ...
## $ LOCALREPORTNO : num 2.25e+08 2.25e+08 2.25e+08 2.25e+08 2.25e+08 ...
## $ MANNEROFCRASH : chr "7 - SIDESWIPE, SAME DIRECTION" "7 - SIDESWIPE, SAME DIRECTION" "7 - SIDESWIPE, SAME DIRECTION" "2 - REAR-END" ...
## $ ROADCONDITIONSPRIMARY : chr "01 - DRY" "01 - DRY" "01 - DRY" "01 - DRY" ...
## $ ROADCONTOUR : chr "1 - STRAIGHT LEVEL" "2 - STRAIGHT GRADE" "2 - STRAIGHT GRADE" "2 - STRAIGHT GRADE" ...
## $ ROADSURFACE : chr "2 - BLACKTOP, BITUMINOUS, ASPHALT" "2 - BLACKTOP, BITUMINOUS, ASPHALT" "2 - BLACKTOP, BITUMINOUS, ASPHALT" "2 - BLACKTOP, BITUMINOUS, ASPHALT" ...
## $ TYPEOFPERSON : chr "D - DRIVER" "D - DRIVER" "D - DRIVER" "D - DRIVER" ...
## $ WEATHER : chr "1 - CLEAR" "1 - CLEAR" "1 - CLEAR" "2 - CLOUDY" ...
## $ ZIP : chr "45217" "45220" "45220" "45219" ...
summary(cpd_data)
## ï..ADDRESS_X LATITUDE_X LONGITUDE_X AGE
## Length:291852 Length:291852 Length:291852 Min. : 0.00
## Class :character Class :character Class :character 1st Qu.: 24.00
## Mode :character Mode :character Mode :character Median : 34.00
## Mean : 37.51
## 3rd Qu.: 49.00
## Max. :123.00
## NA's :35475
## COMMUNITY_COUNCIL_NEIGHBORHOOD CPD_NEIGHBORHOOD CRASHDATE
## Length:291852 Length:291852 Length:291852
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## CRASHLOCATION CRASHSEVERITY CRASHSEVERITYID DATECRASHREPORTED
## Length:291852 Length:291852 Min. : 1 Length:291852
## Class :character Class :character 1st Qu.: 3 Class :character
## Mode :character Mode :character Median : 3 Mode :character
## Mean : 67742
## 3rd Qu.:201904
## Max. :201905
## NA's :10
## DAYOFWEEK GENDER INJURIES INSTANCEID
## Length:291852 Length:291852 Length:291852 Length:291852
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## LIGHTCONDITIONSPRIMARY LOCALREPORTNO MANNEROFCRASH
## Length:291852 Min. :1.400e+01 Length:291852
## Class :character 1st Qu.:1.550e+08 Class :character
## Mode :character Median :1.750e+08 Mode :character
## Mean :1.794e+08
## 3rd Qu.:1.950e+08
## Max. :2.201e+11
##
## ROADCONDITIONSPRIMARY ROADCONTOUR ROADSURFACE TYPEOFPERSON
## Length:291852 Length:291852 Length:291852 Length:291852
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## WEATHER ZIP
## Length:291852 Length:291852
## Class :character Class :character
## Mode :character Mode :character
##
##
##
##
Key inferences : 1) Age : Age cannot be 0 (min) or 123(max), which means that data has discrepancy and we will have to see if we want to keep the data or remove the outliers. Also, we observe that on an average the age of the person driving was 37 years. 2) crashdate : This field is crucial to our modeling and in order to analyse our data, we need to extract the date from the crashdate since the format of the feature is “yy-mm-dd 00:00:00”
cpd_data = read_csv(cpd_file_path) %>%
clean_names() %>%
mutate(crash_datetime = mdy_hms(crashdate),
crash_date = as.Date(crash_datetime)) %>%
dplyr::select(-crashdate)
## Rows: 291852 Columns: 26
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (18): ADDRESS_X, COMMUNITY_COUNCIL_NEIGHBORHOOD, CPD_NEIGHBORHOOD, CRASH...
## dbl (6): LATITUDE_X, LONGITUDE_X, AGE, CRASHSEVERITYID, LOCALREPORTNO, ZIP
## lgl (2): SNA_NEIGHBORHOOD, UNITTYPE
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(cpd_data)
## # A tibble: 6 x 27
## address_x latitude_x longitude_x age community_council_~ cpd_neighborhood
## <chr> <dbl> <dbl> <dbl> <chr> <chr>
## 1 XX W MITCHE~ 39.2 -84.5 37 CLIFTON CLIFTON
## 2 5XX W KING ~ 39.1 -84.5 18 CLIFTON CLIFTON
## 3 5XX W KING ~ 39.1 -84.5 NA CLIFTON CLIFTON
## 4 1XX E KING ~ 39.1 -84.5 28 CORRYVILLE CORRYVILLE
## 5 1XX E KING ~ 39.1 -84.5 22 CORRYVILLE CORRYVILLE
## 6 34XX MONTGO~ 39.1 -84.5 34 EVANSTON EVANSTON
## # ... with 21 more variables: crashlocation <chr>, crashseverity <chr>,
## # crashseverityid <dbl>, datecrashreported <chr>, dayofweek <chr>,
## # gender <chr>, injuries <chr>, instanceid <chr>,
## # lightconditionsprimary <chr>, localreportno <dbl>, mannerofcrash <chr>,
## # roadconditionsprimary <chr>, roadcontour <chr>, roadsurface <chr>,
## # sna_neighborhood <lgl>, typeofperson <chr>, weather <chr>, zip <dbl>,
## # unittype <lgl>, crash_datetime <dttm>, crash_date <date>
Printing the unique values contained in Injuries column
unique(cpd_data$injuries)
## [1] "5 - NO APPARENTY INJURY" "4 - POSSIBLE INJURY"
## [3] NA "3 - SUSPECTED MINOR INJURY"
## [5] "2 - SUSPECTED SERIOUS INJURY" "1 - FATAL"
## [7] "1 - NO INJURY / NONE REPORTED" "3 - NON-INCAPACITATING"
## [9] "2 - POSSIBLE" "5 - FATAL"
## [11] "4 - INCAPACITATING"
Since there is overlap in numbers given to the feature injuries and can be clubbed, we clean the data set by renaming the columns
cpd_data$injuries[cpd_data$injuries=="5 - NO APPARENTY INJURY"]<- "FATAL"
cpd_data$injuries[cpd_data$injuries=="4 - POSSIBLE INJURY"]<- "POSSIBLE INJURY"
cpd_data$injuries[cpd_data$injuries=="3 - SUSPECTED MINOR INJURY"]<- "SUSPECTED MINOR INJURY"
cpd_data$injuries[cpd_data$injuries=="2 - SUSPECTED SERIOUS INJURY"]<- "SUSPECTED SERIOUS INJURY"
cpd_data$injuries[cpd_data$injuries=="1 - FATAL"] <- "FATAL"
cpd_data$injuries[cpd_data$injuries== "1 - NO INJURY / NONE REPORTED"] <- "NO INJURY / NONE REPORTED"
cpd_data$injuries[cpd_data$injuries== "3 - NON-INCAPACITATING"] <- "SUSPECTED MINOR INJURY"
cpd_data$injuries[cpd_data$injuries== "2 - POSSIBLE"] <- "SUSPECTED SERIOUS INJURY"
cpd_data$injuries[cpd_data$injuries== "5 - FATAL"] <- "FATAL"
cpd_data$injuries[cpd_data$injuries== "4 - INCAPACITATING"] <- "POSSIBLE INJURY"
cpd_data$injuries[cpd_data$injuries== " "] <- "Data Unavailable"
Similary, we do data cleaning for other features :
cpd_data$gender[cpd_data$gender=="F - FEMALE"]<- "Female"
cpd_data$gender[cpd_data$gender=="M - MALE"]<- "Male"
cpd_data$gender[cpd_data$gender=="FEMALE"]<- "Female"
cpd_data$gender[cpd_data$gender=="MALE"]<- "Male"
cpd_data$gender[cpd_data$gender=="U - UNKNOWN"] <- ""
cpd_data$gender[cpd_data$gender== 'NA'] <- ""
unique(cpd_data$gender)
## [1] "Male" "Female" NA ""
unique(cpd_data$weather)
## [1] "1 - CLEAR"
## [2] "2 - CLOUDY"
## [3] "6 - SNOW"
## [4] "99 - OTHER/UNKNOWN"
## [5] "5 - SLEET, HAIL"
## [6] "9 - FREEZING RAIN OR FREEZING DRIZZLE"
## [7] "4 - RAIN"
## [8] NA
## [9] "8 - BLOWING SAND, SOIL, DIRT, SNOW"
## [10] "3 - FOG, SMOG, SMOKE"
## [11] "7 - SEVERE CROSSWINDS"
## [12] "9 - OTHER/UNKNOWN"
## [13] "5 - SLEET,HAIL"
cpd_data$weather[cpd_data$weather=="1 - CLEAR"]<- "CLEAR"
cpd_data$weather[cpd_data$weather=="2 - CLOUDY"]<- "CLOUDY"
cpd_data$weather[cpd_data$weather=="6 - SNOW"]<- "SNOW"
cpd_data$weather[cpd_data$weather=="99 - OTHER/UNKNOWN"]<- "UNKNOWN"
cpd_data$weather[cpd_data$weather=="5 - SLEET, HAIL"] <- "SLEET, HAIL"
cpd_data$weather[cpd_data$weather== "9 - FREEZING RAIN OR FREEZING DRIZZLE"] <- "FREEZING RAIN OR FREEZING DRIZZLE"
cpd_data$weather[cpd_data$weather== "4 - RAIN"] <- "RAIN"
cpd_data$weather[cpd_data$weather== "8 - BLOWING SAND, SOIL, DIRT, SNOW"] <- "BLOWING SAND, SOIL, DIRT, SNOW"
cpd_data$weather[cpd_data$weather== "3 - FOG, SMOG, SMOKE"] <- "FOG, SMOG, SMOKE"
cpd_data$weather[cpd_data$weather== "7 - SEVERE CROSSWINDS"] <- "SEVERE CROSSWINDS"
cpd_data$weather[cpd_data$weather== "9 - OTHER/UNKNOWN"] <- "UNKNOWN"
cpd_data$weather[cpd_data$weather== "5 - SLEET,HAIL"] <- "SLEET, HAIL"
daily_crashes = cpd_data %>%
distinct(instanceid,.keep_all=TRUE) %>% # Want only one row per traffic instance
count(crash_date,name='num_incidents') %>% # Count incidents by day
filter(crash_date>=ymd('2020-01-01')) # Filter to post-2020
daily_crashes_plot = ggplot(daily_crashes)+
geom_line(aes(crash_date,num_incidents))+
theme_bw()+
xlab("Date")+
ylab("Number of Incidents")+
labs(
title = 'Number of Daily Traffic Incidents, Cincinnati Police Department',
subtitle = 'January 2020 - Present'
)
daily_crashes_plot
Plotting a linear trend on top of the daily crash data
daily_crashes_plot +
geom_smooth(aes(crash_date,num_incidents),method='lm',color='red')
## `geom_smooth()` using formula 'y ~ x'
We can observe that there is a slightly positive trend over time.
Visualizing Daily Crash Data - DAY WISE (from 1st January 2020 - Present)
cpd_data %>%
distinct(instanceid,.keep_all=TRUE) %>% # One row per traffic incident
count(crash_date,dayofweek,name='daily_incidents') %>% # Count Number of Incidents Per Day
filter(crash_date>=ymd('2020-01-01'))%>%
group_by(dayofweek) %>% # Group data by day of week
summarize(avg_daily_incidents = mean(daily_incidents)) %>% # Take the Mean Incidents by day of week
ungroup() %>%
mutate(dayofweek=factor(
dayofweek,
labels = c("MON",'TUE','WED','THU','FRI','SAT','SUN'),
levels = c("MON",'TUE','WED','THU','FRI','SAT','SUN')
)) %>% # Set day of week as factor so the plot orders the bars properly
filter(!is.na(dayofweek)) %>% # Filter missing day of week values
ggplot()+
geom_col(aes(dayofweek,avg_daily_incidents))+
theme_bw()+
xlab("Day of Week")+
ylab("Average Daily Traffic Incidents")+
scale_fill_discrete(name = 'Day of Week')+
labs(
title = 'Average Daily Traffic Incidents in Cincinnati by Day of week',
subtitle = 'January 2020 - Present')
Visualizing Daily Crash Data - GENDER WISE (from 1st January 2020 - Present)
cpd_gender <- cpd_data%>%
distinct(instanceid,.keep_all = TRUE)%>%
count(crash_date, gender, name="incidents_by_gender")%>%
filter(crash_date>=ymd('2020-01-01'))%>%
group_by(gender)%>%
summarize(avg_daily_incidents = mean(incidents_by_gender))%>%
ungroup()%>%
filter(!is.na(gender))%>%
filter(gender != "U")
ggplot(data = cpd_gender) +
geom_col(mapping = aes(x=gender, y = avg_daily_incidents))+
theme_bw()+
xlab("Gender")+
ylab("Average Daily Incidents")+
scale_fill_discrete(name = 'gender')+
labs(
title = 'Average Daily Traffic Incidents in Cincinnati by Gender',
subtitle = 'January 2020 - Present'
)
Visualizing Daily Crash Data - AGE WISE (from 1st January 2020 - Present)
cpd_age <- cpd_data%>%
distinct(instanceid,.keep_all = TRUE)%>%
count(crash_date, age, name="incidents_by_age")%>%
filter(crash_date>=ymd('2020-01-01'))%>%
group_by(age)%>%
summarize(avg_daily_incidents = mean(incidents_by_age))%>%
ungroup()%>%
filter(!is.na(age))%>%
filter(age != "U")
ggplot(data = cpd_age) +
geom_col(mapping = aes(x=age, y = avg_daily_incidents))+
theme_bw()+
xlab("Age")+
ylab("Average Daily Incidents")+
scale_fill_discrete(name = 'age')+
labs(
title = 'Average Daily Traffic Incidents in Cincinnati by Age',
subtitle = 'January 2020 - Present'
)
Visualizing Daily Crash Data - WEATHER WISE (from 1st January 2020 - Present)
cpd_weather <- cpd_data%>%
distinct(instanceid,.keep_all = TRUE)%>%
count(crash_date, weather, name="incidents_by_weather")%>%
filter(crash_date>=ymd('2020-01-01'))%>%
group_by(weather)%>%
summarize(avg_daily_incidents = mean(incidents_by_weather))%>%
ungroup()%>%
filter(!is.na(weather))
ggplot(data = cpd_weather) +
geom_col(mapping = aes(x=weather, y = avg_daily_incidents))+
theme_bw()+
xlab("Weather")+
ylab("Average Daily Incidents")+
scale_fill_discrete(name = 'weather')+
labs(
title = 'Average Daily Traffic Incidents in Cincinnati by Weather',
subtitle = 'January 2020 - Present'
)
Visualizing Daily Crash Data on the basis of TYPE OF PERSON (from 1st January 2020 - Present)
cpd_typeofperson <- cpd_data%>%
distinct(instanceid,.keep_all = TRUE)%>%
count(crash_date, typeofperson, name="incidents_by_typeofperson")%>%
filter(crash_date>=ymd('2020-01-01'))%>%
group_by(typeofperson)%>%
summarize(avg_daily_incidents = mean(incidents_by_typeofperson))%>%
ungroup()%>%
filter(!is.na(typeofperson))
ggplot(data = cpd_typeofperson) +
geom_col(mapping = aes(x=typeofperson, y = avg_daily_incidents))+
theme_bw()+
xlab("typeofperson")+
ylab("Average Daily Incidents")+
scale_fill_discrete(name = 'typeofperson')+
labs(
title = 'Average Daily Traffic Incidents in Cincinnati by Type of Person',
subtitle = 'January 2020 - Present'
)
Visualizing Daily Crash Data - CRASH SEVERITY WISE (from 1st January 2020 - Present)
cpd_crashseverity <- cpd_data%>%
distinct(instanceid,.keep_all = TRUE)%>%
count(crash_date, crashseverity, name="incidents_by_crashseverity")%>%
filter(crash_date>=ymd('2020-01-01'))%>%
group_by(crashseverity)%>%
summarize(avg_daily_incidents = mean(incidents_by_crashseverity))%>%
ungroup()%>%
filter(!is.na(crashseverity))
ggplot(data = cpd_crashseverity) +
geom_col(mapping = aes(x=crashseverity, y = avg_daily_incidents))+
theme_bw()+
xlab("crashseverity")+
ylab("Average Daily Incidents")+
scale_fill_discrete(name = 'crashseverity')+
labs(
title = 'Average Daily Traffic Incidents in Cincinnati by Crash Severity',
subtitle = 'January 2020 - Present'
)
stats1 <- cpd_data %>% select(weather, age, injuries, typeofperson, mannerofcrash, roadcontour, dayofweek, lightconditionsprimary, roadconditionsprimary,crashlocation, crash_date)%>%
filter(crash_date>=ymd('2020-01-01'))
table1 <- tbl_summary(stats1,
statistic = list(all_continuous()~"{mean} {sd} "))
table1
| Characteristic | N = 62,9661 |
|---|---|
| weather | |
| BLOWING SAND, SOIL, DIRT, SNOW | 3 (<0.1%) |
| CLEAR | 42,396 (67%) |
| CLOUDY | 9,669 (15%) |
| FOG, SMOG, SMOKE | 50 (<0.1%) |
| FREEZING RAIN OR FREEZING DRIZZLE | 43 (<0.1%) |
| RAIN | 8,917 (14%) |
| SEVERE CROSSWINDS | 6 (<0.1%) |
| SLEET, HAIL | 85 (0.1%) |
| SNOW | 1,258 (2.0%) |
| UNKNOWN | 534 (0.8%) |
| Unknown | 5 |
| age | 37 16 |
| Unknown | 9,331 |
| injuries | |
| FATAL | 52,431 (83%) |
| NO INJURY / NONE REPORTED | 2 (<0.1%) |
| POSSIBLE INJURY | 4,956 (7.9%) |
| SUSPECTED MINOR INJURY | 4,995 (7.9%) |
| SUSPECTED SERIOUS INJURY | 519 (0.8%) |
| Unknown | 63 |
| typeofperson | |
| D - DRIVER | 56,157 (89%) |
| O - OCCUPANT | 6,107 (9.7%) |
| P - PEDESTRIAN | 641 (1.0%) |
| Unknown | 61 |
| mannerofcrash | |
| 1 - NOT COLLISION BETWEEN TWO MOTOR VEHICLES IN TRANSPORT | 9,926 (16%) |
| 2 - REAR-END | 16,051 (25%) |
| 3 - HEAD-ON | 1,876 (3.0%) |
| 4 - REAR-TO-REAR | 287 (0.5%) |
| 5 - BACKING | 1,913 (3.0%) |
| 6 - ANGLE | 19,359 (31%) |
| 7 - SIDESWIPE, SAME DIRECTION | 10,929 (17%) |
| 8 - SIDESWIPE, OPPOSITE DIRECTION | 1,569 (2.5%) |
| 9 - UNKNOWN | 1,051 (1.7%) |
| Unknown | 5 |
| roadcontour | |
| 1 - STRAIGHT LEVEL | 44,762 (71%) |
| 2 - STRAIGHT GRADE | 11,755 (19%) |
| 3 - CURVE LEVEL | 2,832 (4.5%) |
| 4 - CURVE GRADE | 3,570 (5.7%) |
| 9 - UNKNOWN | 42 (<0.1%) |
| Unknown | 5 |
| dayofweek | |
| FRI | 10,984 (17%) |
| MON | 8,541 (14%) |
| SAT | 8,578 (14%) |
| SUN | 7,198 (11%) |
| THU | 9,516 (15%) |
| TUE | 9,012 (14%) |
| WED | 9,137 (15%) |
| lightconditionsprimary | |
| 1 - DAYLIGHT | 42,077 (67%) |
| 2 - DAWN | 1,191 (1.9%) |
| 2 - DUSK | 1,488 (2.4%) |
| 3 - DARK - LIGHTED ROADWAY | 16,362 (26%) |
| 4 - DARK – ROADWAY NOT LIGHTED | 971 (1.5%) |
| 5 - DARK – UNKNOWN ROADWAY LIGHTING | 402 (0.6%) |
| 9 - OTHER | 57 (<0.1%) |
| 9 - UNKNOWN | 413 (0.7%) |
| Unknown | 5 |
| roadconditionsprimary | |
| 01 - DRY | 47,881 (76%) |
| 02 - WET | 13,134 (21%) |
| 03 - SNOW | 887 (1.4%) |
| 04 - ICE | 502 (0.8%) |
| 05 - SAND, MUD, DIRT, OIL, GRAVEL | 21 (<0.1%) |
| 06 - WATER (STANDING, MOVING) | 24 (<0.1%) |
| 07 - SLUSH | 71 (0.1%) |
| 09 - OTHER | 16 (<0.1%) |
| 09 - UNKNOWN | 425 (0.7%) |
| Unknown | 5 |
| crashlocation | 0 (NA%) |
| Unknown | 62,966 |
| crash_date | 2021-01-23 213.165845961698 |
|
1
n (%); Mean SD
|
|
# Test whether there is a true linear trend over time
mod1 = lm(num_incidents~crash_date,data=daily_crashes)
summary(mod1)
##
## Call:
## lm(formula = num_incidents ~ crash_date, data = daily_crashes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.756 -8.192 -0.959 7.277 86.798
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.352e+02 4.077e+01 -5.770 1.16e-08 ***
## crash_date 1.498e-02 2.187e-03 6.848 1.56e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.92 on 746 degrees of freedom
## Multiple R-squared: 0.05915, Adjusted R-squared: 0.05788
## F-statistic: 46.9 on 1 and 746 DF, p-value: 1.565e-11
When we model to test whether there has been any significant impact of a specific year on number of incidents, we observe that p vlaue is <0.05. It means there has been impact of time series on the number of incidents.
plot(mod1)