Between the years 2015 - 2020, was there a upward or downward trend in accidents involving alcohol?

An analysis using data provided by the Victorian State Government

Wong Yi Wei s3966890

Last updated: 16 October, 2022

Introduction

There are many accidents happening everyday globally, and factors such as driving under the influence such as alcohol could increase the amount of accidents happening.

With an increased governance and more roadblocks happening from the years since 2015, we will be exploring and analyzing on whether or not there is a upward or downward trend in fatal accidents involving alcohol consumptionIntroduction Cont.

This analysis will be using the data set: “Road Crashes for five Years - Victoria”

Problem Statement

Data

Data Cont.

Data Cont.

This data set was imported into R Studio using the read.csv() function, and the following code head() shows the first 6 of the data.

crashes<-read.csv(file="C:/Users/ethan/OneDrive - RMIT University/applied analytics assignment 2/Road_Crashes_for_five_Years_Victoria.csv")

head(crashes)

Data Cont.

To preprocess the data and get it ready for analysis and visualisation, we will first check the data structure using the str() function. Then, we will make any changes necessary for the data to solve our problem statement. From the output below, we can see that there are 75320 observations with 65 variables and we need to filter and subset the variables ACCIDENT_DATE, SEVERITY, FATALITY, and ALCOHOL_RELATED.

str(crashes)
## 'data.frame':    75320 obs. of  65 variables:
##  $ X                : num  2520841 2526250 2538721 2516548 2485113 ...
##  $ Y                : num  2392194 2381737 2400595 2389749 2413935 ...
##  $ OBJECTID         : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ ACCIDENT_NO      : chr  "T20150013821" "T20150013832" "T20150013882" "T20150013886" ...
##  $ ABS_CODE         : chr  "ABS to receive accident" "ABS to receive accident" "ABS to receive accident" "ABS to receive accident" ...
##  $ ACCIDENT_STATUS  : chr  "Finished" "Finished" "Finished" "Finished" ...
##  $ ACCIDENT_DATE    : chr  "2015/07/01 00:00:00+00" "2015/07/01 00:00:00+00" "2015/07/01 00:00:00+00" "2015/07/01 00:00:00+00" ...
##  $ ACCIDENT_TIME    : chr  "1899/12/30 07:00:00+00" "1899/12/30 09:55:00+00" "1899/12/30 21:50:00+00" "1899/12/30 06:00:00+00" ...
##  $ ALCOHOLTIME      : chr  "No" "No" "Yes" "Yes" ...
##  $ ACCIDENT_TYPE    : chr  "Collision with vehicle" "Collision with vehicle" "Collision with a fixed object" "Collision with a fixed object" ...
##  $ DAY_OF_WEEK      : chr  "Wednesday" "Wednesday" "Wednesday" "Wednesday" ...
##  $ DCA_CODE         : chr  "REAR END(VEHICLES IN SAME LANE)" "CROSS TRAFFIC(INTERSECTIONS ONLY)" "OFF RIGHT BEND INTO OBJECT/PARKED VEHICLE" "LEFT OFF CARRIAGEWAY INTO OBJECT/PARKED VEHICLE" ...
##  $ HIT_RUN_FLAG     : chr  "No" "No" "No" "No" ...
##  $ LIGHT_CONDITION  : chr  "Dusk/Dawn" "Day" "Dark No street lights" "Dark Street lights on" ...
##  $ POLICE_ATTEND    : chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ ROAD_GEOMETRY    : chr  "Not at intersection" "Cross intersection" "Not at intersection" "T intersection" ...
##  $ SEVERITY         : chr  "Serious injury accident" "Serious injury accident" "Serious injury accident" "Serious injury accident" ...
##  $ SPEED_ZONE       : chr  "100 km/hr" "60 km/hr" "60 km/hr" "80 km/hr" ...
##  $ RUN_OFFROAD      : chr  "No" "No" "Yes" "Yes" ...
##  $ NODE_ID          : int  291566 35737 292361 282422 64208 291136 41584 291879 43662 291285 ...
##  $ LONGITUDE        : num  145 145 145 145 145 ...
##  $ LATITUDE         : num  -38 -38.1 -37.9 -38 -37.8 ...
##  $ NODE_TYPE        : chr  "Non-Intersection" "Intersection" "Non-Intersection" "Intersection" ...
##  $ LGA_NAME         : chr  "CASEY" "CASEY" "YARRA RANGES" "DANDENONG" ...
##  $ REGION_NAME      : chr  "METROPOLITAN SOUTH EAST REGION" "METROPOLITAN SOUTH EAST REGION" "METROPOLITAN SOUTH EAST REGION" "METROPOLITAN SOUTH EAST REGION" ...
##  $ VICGRID_X        : num  2520841 2526250 2538721 2516548 2485113 ...
##  $ VICGRID_Y        : num  2392194 2381737 2400595 2389749 2413935 ...
##  $ TOTAL_PERSONS    : int  4 2 1 1 2 3 4 2 2 2 ...
##  $ INJ_OR_FATAL     : int  1 1 1 1 1 3 2 1 1 1 ...
##  $ FATALITY         : int  0 0 0 0 0 1 1 0 0 0 ...
##  $ SERIOUSINJURY    : int  1 1 1 1 0 2 1 0 1 0 ...
##  $ OTHERINJURY      : int  0 0 0 0 1 0 0 1 0 1 ...
##  $ NONINJURED       : int  3 1 0 0 1 0 2 1 1 1 ...
##  $ MALES            : int  2 2 1 1 1 3 2 2 2 1 ...
##  $ FEMALES          : int  2 0 0 0 1 0 0 0 0 1 ...
##  $ BICYCLIST        : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ PASSENGER        : int  0 0 0 0 0 1 1 0 0 0 ...
##  $ DRIVER           : int  4 2 1 0 2 2 2 1 1 2 ...
##  $ PEDESTRIAN       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PILLION          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ MOTORIST         : int  0 0 0 1 0 0 1 0 1 0 ...
##  $ UNKNOWN          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PED_CYCLIST_5_12 : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PED_CYCLIST_13_18: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ OLD_PEDESTRIAN   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ OLD_DRIVER       : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ YOUNG_DRIVER     : int  2 0 0 0 0 1 0 0 0 0 ...
##  $ ALCOHOL_RELATED  : chr  "No" "No" "No" "No" ...
##  $ UNLICENCSED      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ NO_OF_VEHICLES   : int  4 2 1 1 2 2 3 2 2 2 ...
##  $ HEAVYVEHICLE     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PASSENGERVEHICLE : int  4 1 1 0 2 2 2 1 1 2 ...
##  $ MOTORCYCLE       : int  0 0 0 1 0 0 1 0 1 0 ...
##  $ PUBLICVEHICLE    : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ DEG_URBAN_NAME   : chr  "MELB_URBAN" "MELB_URBAN" "MELB_URBAN" "MELB_URBAN" ...
##  $ DEG_URBAN_ALL    : chr  "MELB_URBAN" "MELB_URBAN" "MELB_URBAN" "MELB_URBAN" ...
##  $ LGA_NAME_ALL     : chr  "CASEY" "CASEY" "YARRA RANGES" "DANDENONG" ...
##  $ REGION_NAME_ALL  : chr  "METROPOLITAN SOUTH EAST REGION" "METROPOLITAN SOUTH EAST REGION" "METROPOLITAN SOUTH EAST REGION" "METROPOLITAN SOUTH EAST REGION" ...
##  $ SRNS             : chr  "M" "C" "C" "" ...
##  $ SRNS_ALL         : chr  "M" "C" "C" "" ...
##  $ RMA              : chr  "Freeway" "Arterial Other" "Arterial Other" "Arterial Highway" ...
##  $ RMA_ALL          : chr  "Freeway" "Arterial Other,Local Road" "Arterial Other" "Local Road,Arterial Highway" ...
##  $ DIVIDED          : chr  "Divided" "Divided" "Undivided" "Divided" ...
##  $ DIVIDED_ALL      : chr  "Divided" "Divided,Undivided" "Undivided" "Undivided,Divided" ...
##  $ STAT_DIV_NAME    : chr  "Metro" "Metro" "Metro" "Metro" ...

Data Cont.

As we have explained from the previous slide, we will then subset and filter the data into a new data called ‘crashes2’. From here, we can see that the column names have to be changed to make it more easier to read, and the date has to be changed to only show the year and month. Besides that, we will have to make the severity into a factor.

crashes2<-crashes[ ,c('ACCIDENT_DATE','SEVERITY','FATALITY','ALCOHOL_RELATED')]

str(crashes2)
## 'data.frame':    75320 obs. of  4 variables:
##  $ ACCIDENT_DATE  : chr  "2015/07/01 00:00:00+00" "2015/07/01 00:00:00+00" "2015/07/01 00:00:00+00" "2015/07/01 00:00:00+00" ...
##  $ SEVERITY       : chr  "Serious injury accident" "Serious injury accident" "Serious injury accident" "Serious injury accident" ...
##  $ FATALITY       : int  0 0 0 0 0 1 1 0 0 0 ...
##  $ ALCOHOL_RELATED: chr  "No" "No" "No" "No" ...
colnames(crashes2)<-c("Year","Severity","Fatality","Alcohol-Related")

crashes2$Year<-as.numeric(format(as.Date(crashes2$Year, format="%Y/%m/%d"),"%Y"))

crashes2$Severity<-factor(crashes2$Severity,levels = c("Serious injury accident","Other injury accident","Fatal accident","Non injury accident"),labels=c("Serious","Other","Fatal","No"))

Data Cont.

After formatting the correct data type for each of the variables, we now have to filter out the data and make it so that it only shows alcohol related accidents. Here, we have also sorted the year column into an ascending order using the arrange() function.

alcohol.crashes<-crashes2 %>% filter(`Alcohol-Related`=="Yes")

Data Cont.

Besides that. we have also grouped the accidents by year into a new data called ‘alcohol.crashes2’, and used the summary() function to provide the data summary for the accidents.

alcohol.crashes2<-alcohol.crashes %>% mutate(Accident=case_when(!is.na(alcohol.crashes$Year)~"1",is.na(alcohol.crashes$Year)~"0"))

alcohol.crashes2 %>% group_by(alcohol.crashes2$Year) %>% summarise_at(vars(-Year),funs(mean(alcohol.crashes2$Accident,na.rm=TRUE)))
crashes2015<-filter(alcohol.crashes2,Year=="2015") %>% group_by(Year) %>% summarise("No of Accidents" = n())


alcohol.crashes2$Accident<-as.numeric(alcohol.crashes2$Accident)

alcohol.crashes3<-alcohol.crashes %>% group_by(Year) %>% summarise("No of Accidents" = n())

summary(alcohol.crashes2)
##       Year         Severity       Fatality       Alcohol-Related       Accident
##  Min.   :2015   Serious:1317   Min.   :0.00000   Length:2256        Min.   :1  
##  1st Qu.:2016   Other  : 889   1st Qu.:0.00000   Class :character   1st Qu.:1  
##  Median :2017   Fatal  :  50   Median :0.00000   Mode  :character   Median :1  
##  Mean   :2017   No     :   0   Mean   :0.02482                      Mean   :1  
##  3rd Qu.:2019                  3rd Qu.:0.00000                      3rd Qu.:1  
##  Max.   :2020                  Max.   :2.00000                      Max.   :1

Descriptive Statistics and Visualisation

Using the ggplot() function, we have made a bar chart showing the alcohol-related accidents from the years 2015 to 2020. Surprisingly, the graph has shown a downward trend in accidents with the year 2017 having the most alcohol related accidents.

However, there was a sharp drop in alcohol-related accidents in 2020, presumably that factors such as Covid-19 lock downs were implemented at that time.

alcohol.crashes4<-data.frame(alcohol.crashes3)


ggplot(alcohol.crashes4,aes(x=Year,y=No.of.Accidents))+geom_bar(width=0.3,stat="identity",color="white") + ggtitle("Alcohol-Related Accidents (2015-2020)")

Decsriptive Statistics Cont.

A table containing the summary of the data is as shown below. Here we can see that the population mean for the number of accidents throughout the years are 376, with the highest accident count of 500.

alcohol.crashes3 %>% group_by(Year) %>%summarise(Min = min(`No of Accidents`,na.rm = TRUE),
                               Q1 = quantile(`No of Accidents`,probs = .25,na.rm = TRUE),
                               Median = median(`No of Accidents`, na.rm = TRUE),
                                           Q3 = quantile(`No of Accidents`,probs = .75,na.rm = TRUE),
                               Max = max(`No of Accidents`,na.rm = TRUE),
                               Mean = mean(`No of Accidents`, na.rm = TRUE),
                               SD = sd(`No of Accidents`, na.rm = TRUE),
                                n = n(),
                               Missing = sum(is.na(`No of Accidents`))) -> table1
knitr::kable(table1)
Year Min Q1 Median Q3 Max Mean SD n Missing
2015 247 247 247 247 247 247 NA 1 0
2016 481 481 481 481 481 481 NA 1 0
2017 500 500 500 500 500 500 NA 1 0
2018 457 457 457 457 457 457 NA 1 0
2019 430 430 430 430 430 430 NA 1 0
2020 141 141 141 141 141 141 NA 1 0

Decsriptive Statistics Cont.

print(summary(alcohol.crashes3))
##       Year      No of Accidents
##  Min.   :2015   Min.   :141.0  
##  1st Qu.:2016   1st Qu.:292.8  
##  Median :2018   Median :443.5  
##  Mean   :2018   Mean   :376.0  
##  3rd Qu.:2019   3rd Qu.:475.0  
##  Max.   :2020   Max.   :500.0

Hypothesis Testing

In this presentation, we will be using the two tailed, one-sample t-test as to test on whether or not there is a difference from the population mean. As we have seen in the descriptive statistics, we will confirm on whether or not there is a decrease in alcohol-related accidents. We would assume that the result will show that there is an decrease in alcohol-related accidents throughout the years. Since the data is <30, we will have to adjust and use a t-value test.

The null hypothesis is stated as follows:

\[H_0: \mu_1 = 376 \]The alternate hypothesis is stated as follows:

\[H_A: \mu_1 \lt 376\]

Hypthesis Testing Cont.

In this step, we have to test our hypothesis. We will use the two tailed, one-sample t-test to look for the mean difference where mu = 376.

t.test(alcohol.crashes4$No.of.Accidents,mu=376,conf.level=.95)
## 
##  One Sample t-test
## 
## data:  alcohol.crashes4$No.of.Accidents
## t = 0, df = 5, p-value = 1
## alternative hypothesis: true mean is not equal to 376
## 95 percent confidence interval:
##  221.948 530.052
## sample estimates:
## mean of x 
##       376

Hypthesis Testing Cont.

Here we will check whether or not the data follows a normal distribution. From here we can see that since there is a small sample size, we cannot determine whether if it follows a normal distribution.

alcohol.crashes4$No.of.Accidents %>% qqPlot(dist="norm")

Hypthesis Testing Cont.

A two-tailed, one sample t-test was used to determine if the mean alcohol-related accidents throughout the years were significantly different from the population mean of 376 crashes. The significance level of 0.05 was used. The sample mean of alcohol-related accidents was 376 accidents. Since p,1 >0.01, we can reject the null hypothesis.

Discussion

References