Wong Yi Wei s3966890
Last updated: 16 October, 2022
There are many accidents happening everyday globally, and factors such as driving under the influence such as alcohol could increase the amount of accidents happening.
With an increased governance and more roadblocks happening from the years since 2015, we will be exploring and analyzing on whether or not there is a upward or downward trend in fatal accidents involving alcohol consumptionIntroduction Cont.
This analysis will be using the data set: “Road Crashes for five Years - Victoria”
In this data, there are 75320 observations and 65 variables.
The important variables that we will be using are listed as below:
ACCIDENT_DATE
SEVERITY
FATALITY
ALCOHOL_RELATED
This data set was imported into R Studio using the read.csv() function, and the following code head() shows the first 6 of the data.
crashes<-read.csv(file="C:/Users/ethan/OneDrive - RMIT University/applied analytics assignment 2/Road_Crashes_for_five_Years_Victoria.csv")
head(crashes)To preprocess the data and get it ready for analysis and visualisation, we will first check the data structure using the str() function. Then, we will make any changes necessary for the data to solve our problem statement. From the output below, we can see that there are 75320 observations with 65 variables and we need to filter and subset the variables ACCIDENT_DATE, SEVERITY, FATALITY, and ALCOHOL_RELATED.
## 'data.frame': 75320 obs. of 65 variables:
## $ X : num 2520841 2526250 2538721 2516548 2485113 ...
## $ Y : num 2392194 2381737 2400595 2389749 2413935 ...
## $ OBJECTID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ ACCIDENT_NO : chr "T20150013821" "T20150013832" "T20150013882" "T20150013886" ...
## $ ABS_CODE : chr "ABS to receive accident" "ABS to receive accident" "ABS to receive accident" "ABS to receive accident" ...
## $ ACCIDENT_STATUS : chr "Finished" "Finished" "Finished" "Finished" ...
## $ ACCIDENT_DATE : chr "2015/07/01 00:00:00+00" "2015/07/01 00:00:00+00" "2015/07/01 00:00:00+00" "2015/07/01 00:00:00+00" ...
## $ ACCIDENT_TIME : chr "1899/12/30 07:00:00+00" "1899/12/30 09:55:00+00" "1899/12/30 21:50:00+00" "1899/12/30 06:00:00+00" ...
## $ ALCOHOLTIME : chr "No" "No" "Yes" "Yes" ...
## $ ACCIDENT_TYPE : chr "Collision with vehicle" "Collision with vehicle" "Collision with a fixed object" "Collision with a fixed object" ...
## $ DAY_OF_WEEK : chr "Wednesday" "Wednesday" "Wednesday" "Wednesday" ...
## $ DCA_CODE : chr "REAR END(VEHICLES IN SAME LANE)" "CROSS TRAFFIC(INTERSECTIONS ONLY)" "OFF RIGHT BEND INTO OBJECT/PARKED VEHICLE" "LEFT OFF CARRIAGEWAY INTO OBJECT/PARKED VEHICLE" ...
## $ HIT_RUN_FLAG : chr "No" "No" "No" "No" ...
## $ LIGHT_CONDITION : chr "Dusk/Dawn" "Day" "Dark No street lights" "Dark Street lights on" ...
## $ POLICE_ATTEND : chr "Yes" "Yes" "Yes" "Yes" ...
## $ ROAD_GEOMETRY : chr "Not at intersection" "Cross intersection" "Not at intersection" "T intersection" ...
## $ SEVERITY : chr "Serious injury accident" "Serious injury accident" "Serious injury accident" "Serious injury accident" ...
## $ SPEED_ZONE : chr "100 km/hr" "60 km/hr" "60 km/hr" "80 km/hr" ...
## $ RUN_OFFROAD : chr "No" "No" "Yes" "Yes" ...
## $ NODE_ID : int 291566 35737 292361 282422 64208 291136 41584 291879 43662 291285 ...
## $ LONGITUDE : num 145 145 145 145 145 ...
## $ LATITUDE : num -38 -38.1 -37.9 -38 -37.8 ...
## $ NODE_TYPE : chr "Non-Intersection" "Intersection" "Non-Intersection" "Intersection" ...
## $ LGA_NAME : chr "CASEY" "CASEY" "YARRA RANGES" "DANDENONG" ...
## $ REGION_NAME : chr "METROPOLITAN SOUTH EAST REGION" "METROPOLITAN SOUTH EAST REGION" "METROPOLITAN SOUTH EAST REGION" "METROPOLITAN SOUTH EAST REGION" ...
## $ VICGRID_X : num 2520841 2526250 2538721 2516548 2485113 ...
## $ VICGRID_Y : num 2392194 2381737 2400595 2389749 2413935 ...
## $ TOTAL_PERSONS : int 4 2 1 1 2 3 4 2 2 2 ...
## $ INJ_OR_FATAL : int 1 1 1 1 1 3 2 1 1 1 ...
## $ FATALITY : int 0 0 0 0 0 1 1 0 0 0 ...
## $ SERIOUSINJURY : int 1 1 1 1 0 2 1 0 1 0 ...
## $ OTHERINJURY : int 0 0 0 0 1 0 0 1 0 1 ...
## $ NONINJURED : int 3 1 0 0 1 0 2 1 1 1 ...
## $ MALES : int 2 2 1 1 1 3 2 2 2 1 ...
## $ FEMALES : int 2 0 0 0 1 0 0 0 0 1 ...
## $ BICYCLIST : int 0 0 0 0 0 0 0 1 0 0 ...
## $ PASSENGER : int 0 0 0 0 0 1 1 0 0 0 ...
## $ DRIVER : int 4 2 1 0 2 2 2 1 1 2 ...
## $ PEDESTRIAN : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PILLION : int 0 0 0 0 0 0 0 0 0 0 ...
## $ MOTORIST : int 0 0 0 1 0 0 1 0 1 0 ...
## $ UNKNOWN : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PED_CYCLIST_5_12 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PED_CYCLIST_13_18: int 0 0 0 0 0 0 0 0 0 0 ...
## $ OLD_PEDESTRIAN : int 0 0 0 0 0 0 0 0 0 0 ...
## $ OLD_DRIVER : int 0 1 0 0 0 0 0 0 0 0 ...
## $ YOUNG_DRIVER : int 2 0 0 0 0 1 0 0 0 0 ...
## $ ALCOHOL_RELATED : chr "No" "No" "No" "No" ...
## $ UNLICENCSED : int 0 0 0 0 0 0 0 0 0 0 ...
## $ NO_OF_VEHICLES : int 4 2 1 1 2 2 3 2 2 2 ...
## $ HEAVYVEHICLE : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PASSENGERVEHICLE : int 4 1 1 0 2 2 2 1 1 2 ...
## $ MOTORCYCLE : int 0 0 0 1 0 0 1 0 1 0 ...
## $ PUBLICVEHICLE : int 0 1 0 0 0 0 0 0 0 0 ...
## $ DEG_URBAN_NAME : chr "MELB_URBAN" "MELB_URBAN" "MELB_URBAN" "MELB_URBAN" ...
## $ DEG_URBAN_ALL : chr "MELB_URBAN" "MELB_URBAN" "MELB_URBAN" "MELB_URBAN" ...
## $ LGA_NAME_ALL : chr "CASEY" "CASEY" "YARRA RANGES" "DANDENONG" ...
## $ REGION_NAME_ALL : chr "METROPOLITAN SOUTH EAST REGION" "METROPOLITAN SOUTH EAST REGION" "METROPOLITAN SOUTH EAST REGION" "METROPOLITAN SOUTH EAST REGION" ...
## $ SRNS : chr "M" "C" "C" "" ...
## $ SRNS_ALL : chr "M" "C" "C" "" ...
## $ RMA : chr "Freeway" "Arterial Other" "Arterial Other" "Arterial Highway" ...
## $ RMA_ALL : chr "Freeway" "Arterial Other,Local Road" "Arterial Other" "Local Road,Arterial Highway" ...
## $ DIVIDED : chr "Divided" "Divided" "Undivided" "Divided" ...
## $ DIVIDED_ALL : chr "Divided" "Divided,Undivided" "Undivided" "Undivided,Divided" ...
## $ STAT_DIV_NAME : chr "Metro" "Metro" "Metro" "Metro" ...
As we have explained from the previous slide, we will then subset and filter the data into a new data called ‘crashes2’. From here, we can see that the column names have to be changed to make it more easier to read, and the date has to be changed to only show the year and month. Besides that, we will have to make the severity into a factor.
## 'data.frame': 75320 obs. of 4 variables:
## $ ACCIDENT_DATE : chr "2015/07/01 00:00:00+00" "2015/07/01 00:00:00+00" "2015/07/01 00:00:00+00" "2015/07/01 00:00:00+00" ...
## $ SEVERITY : chr "Serious injury accident" "Serious injury accident" "Serious injury accident" "Serious injury accident" ...
## $ FATALITY : int 0 0 0 0 0 1 1 0 0 0 ...
## $ ALCOHOL_RELATED: chr "No" "No" "No" "No" ...
colnames(crashes2)<-c("Year","Severity","Fatality","Alcohol-Related")
crashes2$Year<-as.numeric(format(as.Date(crashes2$Year, format="%Y/%m/%d"),"%Y"))
crashes2$Severity<-factor(crashes2$Severity,levels = c("Serious injury accident","Other injury accident","Fatal accident","Non injury accident"),labels=c("Serious","Other","Fatal","No"))After formatting the correct data type for each of the variables, we now have to filter out the data and make it so that it only shows alcohol related accidents. Here, we have also sorted the year column into an ascending order using the arrange() function.
Besides that. we have also grouped the accidents by year into a new data called ‘alcohol.crashes2’, and used the summary() function to provide the data summary for the accidents.
alcohol.crashes2<-alcohol.crashes %>% mutate(Accident=case_when(!is.na(alcohol.crashes$Year)~"1",is.na(alcohol.crashes$Year)~"0"))
alcohol.crashes2 %>% group_by(alcohol.crashes2$Year) %>% summarise_at(vars(-Year),funs(mean(alcohol.crashes2$Accident,na.rm=TRUE)))crashes2015<-filter(alcohol.crashes2,Year=="2015") %>% group_by(Year) %>% summarise("No of Accidents" = n())
alcohol.crashes2$Accident<-as.numeric(alcohol.crashes2$Accident)
alcohol.crashes3<-alcohol.crashes %>% group_by(Year) %>% summarise("No of Accidents" = n())
summary(alcohol.crashes2)## Year Severity Fatality Alcohol-Related Accident
## Min. :2015 Serious:1317 Min. :0.00000 Length:2256 Min. :1
## 1st Qu.:2016 Other : 889 1st Qu.:0.00000 Class :character 1st Qu.:1
## Median :2017 Fatal : 50 Median :0.00000 Mode :character Median :1
## Mean :2017 No : 0 Mean :0.02482 Mean :1
## 3rd Qu.:2019 3rd Qu.:0.00000 3rd Qu.:1
## Max. :2020 Max. :2.00000 Max. :1
Using the ggplot() function, we have made a bar chart showing the alcohol-related accidents from the years 2015 to 2020. Surprisingly, the graph has shown a downward trend in accidents with the year 2017 having the most alcohol related accidents.
However, there was a sharp drop in alcohol-related accidents in 2020, presumably that factors such as Covid-19 lock downs were implemented at that time.
alcohol.crashes4<-data.frame(alcohol.crashes3)
ggplot(alcohol.crashes4,aes(x=Year,y=No.of.Accidents))+geom_bar(width=0.3,stat="identity",color="white") + ggtitle("Alcohol-Related Accidents (2015-2020)")A table containing the summary of the data is as shown below. Here we can see that the population mean for the number of accidents throughout the years are 376, with the highest accident count of 500.
alcohol.crashes3 %>% group_by(Year) %>%summarise(Min = min(`No of Accidents`,na.rm = TRUE),
Q1 = quantile(`No of Accidents`,probs = .25,na.rm = TRUE),
Median = median(`No of Accidents`, na.rm = TRUE),
Q3 = quantile(`No of Accidents`,probs = .75,na.rm = TRUE),
Max = max(`No of Accidents`,na.rm = TRUE),
Mean = mean(`No of Accidents`, na.rm = TRUE),
SD = sd(`No of Accidents`, na.rm = TRUE),
n = n(),
Missing = sum(is.na(`No of Accidents`))) -> table1
knitr::kable(table1)| Year | Min | Q1 | Median | Q3 | Max | Mean | SD | n | Missing |
|---|---|---|---|---|---|---|---|---|---|
| 2015 | 247 | 247 | 247 | 247 | 247 | 247 | NA | 1 | 0 |
| 2016 | 481 | 481 | 481 | 481 | 481 | 481 | NA | 1 | 0 |
| 2017 | 500 | 500 | 500 | 500 | 500 | 500 | NA | 1 | 0 |
| 2018 | 457 | 457 | 457 | 457 | 457 | 457 | NA | 1 | 0 |
| 2019 | 430 | 430 | 430 | 430 | 430 | 430 | NA | 1 | 0 |
| 2020 | 141 | 141 | 141 | 141 | 141 | 141 | NA | 1 | 0 |
## Year No of Accidents
## Min. :2015 Min. :141.0
## 1st Qu.:2016 1st Qu.:292.8
## Median :2018 Median :443.5
## Mean :2018 Mean :376.0
## 3rd Qu.:2019 3rd Qu.:475.0
## Max. :2020 Max. :500.0
In this presentation, we will be using the two tailed, one-sample t-test as to test on whether or not there is a difference from the population mean. As we have seen in the descriptive statistics, we will confirm on whether or not there is a decrease in alcohol-related accidents. We would assume that the result will show that there is an decrease in alcohol-related accidents throughout the years. Since the data is <30, we will have to adjust and use a t-value test.
The null hypothesis is stated as follows:
\[H_0: \mu_1 = 376 \]The alternate hypothesis is stated as follows:
\[H_A: \mu_1 \lt 376\]
In this step, we have to test our hypothesis. We will use the two tailed, one-sample t-test to look for the mean difference where mu = 376.
##
## One Sample t-test
##
## data: alcohol.crashes4$No.of.Accidents
## t = 0, df = 5, p-value = 1
## alternative hypothesis: true mean is not equal to 376
## 95 percent confidence interval:
## 221.948 530.052
## sample estimates:
## mean of x
## 376
Here we will check whether or not the data follows a normal distribution. From here we can see that since there is a small sample size, we cannot determine whether if it follows a normal distribution.
A two-tailed, one sample t-test was used to determine if the mean alcohol-related accidents throughout the years were significantly different from the population mean of 376 crashes. The significance level of 0.05 was used. The sample mean of alcohol-related accidents was 376 accidents. Since p,1 >0.01, we can reject the null hypothesis.