Between the years 2015 - 2020, was there a upward or downward trend in accidents involving alcohol?

An analysis using data provided by the Victorian State Government

Wong Yi Wei s3966890

Last updated: 16 October, 2022

Introduction

There are many accidents happening everyday globally, and factors such as driving under the influence such as alcohol could increase the amount of accidents happening.

With an increased governance and more roadblocks happening from the years since 2015, we will be exploring and analyzing on whether or not there is a upward or downward trend in fatal accidents involving alcohol consumptionIntroduction Cont.

This analysis will be using the data set: “Road Crashes for five Years - Victoria”

Problem Statement

This presentation aims to explore and analyze on whether there is a upward or downward trend in fatal accidents involving alcohol consumption for the years 2010 - 2015.
We will be using hypothesis testing, as well as statistical graphs such to make an decision on our conclusion.

Data

The data we will be using is “Road Crashes for five Years - Victoria” as obtained from the Data VIC website.
https://discover.data.vic.gov.au/dataset/road-crashes-for-five-years-victoria
The data is collected using reported fatal and injury crashes on Victorian roads during the years 2015-2020.

Data Cont.

In this data, there are 75320 observations and 65 variables.
The important variables that we will be using are listed as below:
- ACCIDENT_DATE
  - This would show us the year on when the accident had happened.
- SEVERITY
  - This variable shows whether or not the accident was fatal.
- FATALITY
  - This variable accounts for the fatality and lives lost in the accident.
- ALCOHOL_RELATED
  - This variable tells us whether if the accident was caused by alcohol consumption.

Data Cont.

This data set was imported into R Studio using the read.csv() function, and the following code head() shows the first 6 of the data.

crashes<-read.csv(file="C:/Users/ethan/OneDrive - RMIT University/applied analytics assignment 2/Road_Crashes_for_five_Years_Victoria.csv")

head(crashes)

Data Cont.

To preprocess the data and get it ready for analysis and visualisation, we will first check the data structure using the str() function. Then, we will make any changes necessary for the data to solve our problem statement. From the output below, we can see that there are 75320 observations with 65 variables and we need to filter and subset the variables ACCIDENT_DATE, SEVERITY, FATALITY, and ALCOHOL_RELATED.

str(crashes)

## 'data.frame':    75320 obs. of  65 variables:
##  $ X                : num  2520841 2526250 2538721 2516548 2485113 ...
##  $ Y                : num  2392194 2381737 2400595 2389749 2413935 ...
##  $ OBJECTID         : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ ACCIDENT_NO      : chr  "T20150013821" "T20150013832" "T20150013882" "T20150013886" ...
##  $ ABS_CODE         : chr  "ABS to receive accident" "ABS to receive accident" "ABS to receive accident" "ABS to receive accident" ...
##  $ ACCIDENT_STATUS  : chr  "Finished" "Finished" "Finished" "Finished" ...
##  $ ACCIDENT_DATE    : chr  "2015/07/01 00:00:00+00" "2015/07/01 00:00:00+00" "2015/07/01 00:00:00+00" "2015/07/01 00:00:00+00" ...
##  $ ACCIDENT_TIME    : chr  "1899/12/30 07:00:00+00" "1899/12/30 09:55:00+00" "1899/12/30 21:50:00+00" "1899/12/30 06:00:00+00" ...
##  $ ALCOHOLTIME      : chr  "No" "No" "Yes" "Yes" ...
##  $ ACCIDENT_TYPE    : chr  "Collision with vehicle" "Collision with vehicle" "Collision with a fixed object" "Collision with a fixed object" ...
##  $ DAY_OF_WEEK      : chr  "Wednesday" "Wednesday" "Wednesday" "Wednesday" ...
##  $ DCA_CODE         : chr  "REAR END(VEHICLES IN SAME LANE)" "CROSS TRAFFIC(INTERSECTIONS ONLY)" "OFF RIGHT BEND INTO OBJECT/PARKED VEHICLE" "LEFT OFF CARRIAGEWAY INTO OBJECT/PARKED VEHICLE" ...
##  $ HIT_RUN_FLAG     : chr  "No" "No" "No" "No" ...
##  $ LIGHT_CONDITION  : chr  "Dusk/Dawn" "Day" "Dark No street lights" "Dark Street lights on" ...
##  $ POLICE_ATTEND    : chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ ROAD_GEOMETRY    : chr  "Not at intersection" "Cross intersection" "Not at intersection" "T intersection" ...
##  $ SEVERITY         : chr  "Serious injury accident" "Serious injury accident" "Serious injury accident" "Serious injury accident" ...
##  $ SPEED_ZONE       : chr  "100 km/hr" "60 km/hr" "60 km/hr" "80 km/hr" ...
##  $ RUN_OFFROAD      : chr  "No" "No" "Yes" "Yes" ...
##  $ NODE_ID          : int  291566 35737 292361 282422 64208 291136 41584 291879 43662 291285 ...
##  $ LONGITUDE        : num  145 145 145 145 145 ...
##  $ LATITUDE         : num  -38 -38.1 -37.9 -38 -37.8 ...
##  $ NODE_TYPE        : chr  "Non-Intersection" "Intersection" "Non-Intersection" "Intersection" ...
##  $ LGA_NAME         : chr  "CASEY" "CASEY" "YARRA RANGES" "DANDENONG" ...
##  $ REGION_NAME      : chr  "METROPOLITAN SOUTH EAST REGION" "METROPOLITAN SOUTH EAST REGION" "METROPOLITAN SOUTH EAST REGION" "METROPOLITAN SOUTH EAST REGION" ...
##  $ VICGRID_X        : num  2520841 2526250 2538721 2516548 2485113 ...
##  $ VICGRID_Y        : num  2392194 2381737 2400595 2389749 2413935 ...
##  $ TOTAL_PERSONS    : int  4 2 1 1 2 3 4 2 2 2 ...
##  $ INJ_OR_FATAL     : int  1 1 1 1 1 3 2 1 1 1 ...
##  $ FATALITY         : int  0 0 0 0 0 1 1 0 0 0 ...
##  $ SERIOUSINJURY    : int  1 1 1 1 0 2 1 0 1 0 ...
##  $ OTHERINJURY      : int  0 0 0 0 1 0 0 1 0 1 ...
##  $ NONINJURED       : int  3 1 0 0 1 0 2 1 1 1 ...
##  $ MALES            : int  2 2 1 1 1 3 2 2 2 1 ...
##  $ FEMALES          : int  2 0 0 0 1 0 0 0 0 1 ...
##  $ BICYCLIST        : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ PASSENGER        : int  0 0 0 0 0 1 1 0 0 0 ...
##  $ DRIVER           : int  4 2 1 0 2 2 2 1 1 2 ...
##  $ PEDESTRIAN       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PILLION          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ MOTORIST         : int  0 0 0 1 0 0 1 0 1 0 ...
##  $ UNKNOWN          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PED_CYCLIST_5_12 : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PED_CYCLIST_13_18: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ OLD_PEDESTRIAN   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ OLD_DRIVER       : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ YOUNG_DRIVER     : int  2 0 0 0 0 1 0 0 0 0 ...
##  $ ALCOHOL_RELATED  : chr  "No" "No" "No" "No" ...
##  $ UNLICENCSED      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ NO_OF_VEHICLES   : int  4 2 1 1 2 2 3 2 2 2 ...
##  $ HEAVYVEHICLE     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PASSENGERVEHICLE : int  4 1 1 0 2 2 2 1 1 2 ...
##  $ MOTORCYCLE       : int  0 0 0 1 0 0 1 0 1 0 ...
##  $ PUBLICVEHICLE    : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ DEG_URBAN_NAME   : chr  "MELB_URBAN" "MELB_URBAN" "MELB_URBAN" "MELB_URBAN" ...
##  $ DEG_URBAN_ALL    : chr  "MELB_URBAN" "MELB_URBAN" "MELB_URBAN" "MELB_URBAN" ...
##  $ LGA_NAME_ALL     : chr  "CASEY" "CASEY" "YARRA RANGES" "DANDENONG" ...
##  $ REGION_NAME_ALL  : chr  "METROPOLITAN SOUTH EAST REGION" "METROPOLITAN SOUTH EAST REGION" "METROPOLITAN SOUTH EAST REGION" "METROPOLITAN SOUTH EAST REGION" ...
##  $ SRNS             : chr  "M" "C" "C" "" ...
##  $ SRNS_ALL         : chr  "M" "C" "C" "" ...
##  $ RMA              : chr  "Freeway" "Arterial Other" "Arterial Other" "Arterial Highway" ...
##  $ RMA_ALL          : chr  "Freeway" "Arterial Other,Local Road" "Arterial Other" "Local Road,Arterial Highway" ...
##  $ DIVIDED          : chr  "Divided" "Divided" "Undivided" "Divided" ...
##  $ DIVIDED_ALL      : chr  "Divided" "Divided,Undivided" "Undivided" "Undivided,Divided" ...
##  $ STAT_DIV_NAME    : chr  "Metro" "Metro" "Metro" "Metro" ...

Data Cont.

As we have explained from the previous slide, we will then subset and filter the data into a new data called ‘crashes2’. From here, we can see that the column names have to be changed to make it more easier to read, and the date has to be changed to only show the year and month. Besides that, we will have to make the severity into a factor.

crashes2<-crashes[ ,c('ACCIDENT_DATE','SEVERITY','FATALITY','ALCOHOL_RELATED')]

str(crashes2)

## 'data.frame':    75320 obs. of  4 variables:
##  $ ACCIDENT_DATE  : chr  "2015/07/01 00:00:00+00" "2015/07/01 00:00:00+00" "2015/07/01 00:00:00+00" "2015/07/01 00:00:00+00" ...
##  $ SEVERITY       : chr  "Serious injury accident" "Serious injury accident" "Serious injury accident" "Serious injury accident" ...
##  $ FATALITY       : int  0 0 0 0 0 1 1 0 0 0 ...
##  $ ALCOHOL_RELATED: chr  "No" "No" "No" "No" ...

colnames(crashes2)<-c("Year","Severity","Fatality","Alcohol-Related")

crashes2$Year<-as.numeric(format(as.Date(crashes2$Year, format="%Y/%m/%d"),"%Y"))

crashes2$Severity<-factor(crashes2$Severity,levels = c("Serious injury accident","Other injury accident","Fatal accident","Non injury accident"),labels=c("Serious","Other","Fatal","No"))

Data Cont.

After formatting the correct data type for each of the variables, we now have to filter out the data and make it so that it only shows alcohol related accidents. Here, we have also sorted the year column into an ascending order using the arrange() function.

alcohol.crashes<-crashes2 %>% filter(`Alcohol-Related`=="Yes")

Data Cont.

Besides that. we have also grouped the accidents by year into a new data called ‘alcohol.crashes2’, and used the summary() function to provide the data summary for the accidents.

alcohol.crashes2<-alcohol.crashes %>% mutate(Accident=case_when(!is.na(alcohol.crashes$Year)~"1",is.na(alcohol.crashes$Year)~"0"))

alcohol.crashes2 %>% group_by(alcohol.crashes2$Year) %>% summarise_at(vars(-Year),funs(mean(alcohol.crashes2$Accident,na.rm=TRUE)))

crashes2015<-filter(alcohol.crashes2,Year=="2015") %>% group_by(Year) %>% summarise("No of Accidents" = n())


alcohol.crashes2$Accident<-as.numeric(alcohol.crashes2$Accident)

alcohol.crashes3<-alcohol.crashes %>% group_by(Year) %>% summarise("No of Accidents" = n())

summary(alcohol.crashes2)

##       Year         Severity       Fatality       Alcohol-Related       Accident
##  Min.   :2015   Serious:1317   Min.   :0.00000   Length:2256        Min.   :1  
##  1st Qu.:2016   Other  : 889   1st Qu.:0.00000   Class :character   1st Qu.:1  
##  Median :2017   Fatal  :  50   Median :0.00000   Mode  :character   Median :1  
##  Mean   :2017   No     :   0   Mean   :0.02482                      Mean   :1  
##  3rd Qu.:2019                  3rd Qu.:0.00000                      3rd Qu.:1  
##  Max.   :2020                  Max.   :2.00000                      Max.   :1

Descriptive Statistics and Visualisation

Using the ggplot() function, we have made a bar chart showing the alcohol-related accidents from the years 2015 to 2020. Surprisingly, the graph has shown a downward trend in accidents with the year 2017 having the most alcohol related accidents.

However, there was a sharp drop in alcohol-related accidents in 2020, presumably that factors such as Covid-19 lock downs were implemented at that time.

alcohol.crashes4<-data.frame(alcohol.crashes3)


ggplot(alcohol.crashes4,aes(x=Year,y=No.of.Accidents))+geom_bar(width=0.3,stat="identity",color="white") + ggtitle("Alcohol-Related Accidents (2015-2020)")

Decsriptive Statistics Cont.

A table containing the summary of the data is as shown below. Here we can see that the population mean for the number of accidents throughout the years are 376, with the highest accident count of 500.

alcohol.crashes3 %>% group_by(Year) %>%summarise(Min = min(`No of Accidents`,na.rm = TRUE),
                               Q1 = quantile(`No of Accidents`,probs = .25,na.rm = TRUE),
                               Median = median(`No of Accidents`, na.rm = TRUE),
                                           Q3 = quantile(`No of Accidents`,probs = .75,na.rm = TRUE),
                               Max = max(`No of Accidents`,na.rm = TRUE),
                               Mean = mean(`No of Accidents`, na.rm = TRUE),
                               SD = sd(`No of Accidents`, na.rm = TRUE),
                                n = n(),
                               Missing = sum(is.na(`No of Accidents`))) -> table1
knitr::kable(table1)

Year	Min	Q1	Median	Q3	Max	Mean	SD	n
2015	247	247	247	247	247	247	NA	1
2016	481	481	481	481	481	481	NA	1
2017	500	500	500	500	500	500	NA	1
2018	457	457	457	457	457	457	NA	1
2019	430	430	430	430	430	430	NA	1
2020	141	141	141	141	141	141	NA	1

Decsriptive Statistics Cont.

print(summary(alcohol.crashes3))

##       Year      No of Accidents
##  Min.   :2015   Min.   :141.0  
##  1st Qu.:2016   1st Qu.:292.8  
##  Median :2018   Median :443.5  
##  Mean   :2018   Mean   :376.0  
##  3rd Qu.:2019   3rd Qu.:475.0  
##  Max.   :2020   Max.   :500.0

Hypothesis Testing

In this presentation, we will be using the two tailed, one-sample t-test as to test on whether or not there is a difference from the population mean. As we have seen in the descriptive statistics, we will confirm on whether or not there is a decrease in alcohol-related accidents. We would assume that the result will show that there is an decrease in alcohol-related accidents throughout the years. Since the data is <30, we will have to adjust and use a t-value test.

The null hypothesis is stated as follows:

\[H_0: \mu_1 = 376 \]The alternate hypothesis is stated as follows:

\[H_A: \mu_1 \lt 376\]

Hypthesis Testing Cont.

In this step, we have to test our hypothesis. We will use the two tailed, one-sample t-test to look for the mean difference where mu = 376.

t.test(alcohol.crashes4$No.of.Accidents,mu=376,conf.level=.95)

## 
##  One Sample t-test
## 
## data:  alcohol.crashes4$No.of.Accidents
## t = 0, df = 5, p-value = 1
## alternative hypothesis: true mean is not equal to 376
## 95 percent confidence interval:
##  221.948 530.052
## sample estimates:
## mean of x 
##       376

Hypthesis Testing Cont.

Here we will check whether or not the data follows a normal distribution. From here we can see that since there is a small sample size, we cannot determine whether if it follows a normal distribution.

alcohol.crashes4$No.of.Accidents %>% qqPlot(dist="norm")

Hypthesis Testing Cont.

A two-tailed, one sample t-test was used to determine if the mean alcohol-related accidents throughout the years were significantly different from the population mean of 376 crashes. The significance level of 0.05 was used. The sample mean of alcohol-related accidents was 376 accidents. Since p,1 >0.01, we can reject the null hypothesis.

Discussion

The test result showed that there were a difference in alcohol-related accidents between 2015-2020.
We can safely say that there was an decrease in alcohol-related accidents.
However, there is a chance that the factor when the pandemic hit where lock downs were imposed, fewer people got into accidents as there were less vehicles on the road.
The limitation of this study was that due to the small sample size when showed in years, there were difficulties in the findings.
For the next study, I would propose that there would be more sample sizes based on the year for a more accurate result.

Between the years 2015 - 2020, was there a upward or downward trend in accidents involving alcohol?

An analysis using data provided by the Victorian State Government

RPubs link information

Introduction

Problem Statement

Data

Data Cont.

Data Cont.

Data Cont.

Data Cont.

Data Cont.

Data Cont.

Descriptive Statistics and Visualisation

Decsriptive Statistics Cont.

Decsriptive Statistics Cont.

Hypothesis Testing

Hypthesis Testing Cont.

Hypthesis Testing Cont.

Hypthesis Testing Cont.

Discussion

References