library(tidyr)
library(dplyr)
library(readr)
library(forecast)
library(stringr)
library(MVN)
library(outliers)
The data was collected for the possible analysis of fatal crashes on Victorian roads. The initial dataset chosen was the Fatal Crashes - Lives Lost, Last 5 Years to Date dataset from VicRoads. However, important information such the speed zone of the accident, whether or not alcohol was involved and the number of fatalities and serious injuries from the accident was missing. Therefore, the aforementioned dataset was merged with te Crashes Last Five Years dataset in order to include the relevent information.
Initially, the labels of the character variables were scanned for any values that could be renamed/binned, or changed to NA. The data types were then converted into their proper form. The conversions included integer to factor variables, character to factor variables and character to date variables. The variable explaining alcohol involvement in the crash was labelled and the variable explaining the speed zone of the area in which the crash took place was relabelled and ordered.
A new variable was created in order to show the tally of serious injuries and fatalities that occurred in each accident. A full scan of missing and inconsistent values, and numeric outliers took place. It was decided to remove any missing value or outlier from the dataset. The age of the victims in each observation was transformed using a Box-Cox transformation, with the optimal lambda value being 0.34, in order to convert the distribution into a normal distribution. However, through the use of the Shapiro-Wilk test for normality, the transformation failed to normalise the data. Finally, the time variable was cleaned into the format HH:MM:SS.
Overall, the cleaned dataset is tidy, with variables in their proper type, and without any outliers and inconsistent or missing values.
The first dataset used is called Fatal Crashes - Lives Lost, Last 5 Years to Date. This dataset contains information on fatal crashes on Victorian roads from the 1st of Janurary 2013 to the 2nd of May 2018. The data was provided by VicRoads and is available by following the link:
The dataset contains 1,361 observations of 25 variables.
The second dataset used is called Crashes Last Five Years. This dataset contains information on all crashes on Victorian roads resulting in a fatality and/or injury, from from the 1st of Janurary 2012 to the 1st of Sepember 2017. The data was provided by VicRoads and is available by following the link:
The dataset contains 76,451 observations of 65 variables. This data contains some useful information that is not present in the other dataset, such as the speed zone in which the accident occurred, whether or not the accident was alcohol related, and the total number of fatalities and serious injuries resulting from the accident.
The variables selected to be merged from each dataset are described below (definitions taken from the data sources, provided by the links above):
ACCIDENT_NO: Unique identifier for each accident in Victoria. Used as the variable in which the two datasets were joined.
SERIOUSINJURY: The number of persons seriously injured in the crash.
FATALITY: The number of persons killed in the crash.
NO_OF_VEHICLES: Number of vehicles involved in the crash.
ACCIDENT_TIME: The time of the accident in 24 hour format (HHMMSS). Commonly rounded to the nearest 5 minutes or hour.
AGE: The age of the person at the time of the accident.
ACCIDENT_DATE: Accident date (YYYY-MM-DD).
SEX: The gender of the person involved in the accident.
ROAD_USER_TYPE_DESC: The role of the person was at the time of the accident (eg. Driver, Passenger, Bicyclist etc…)
ACCIDENT_TYPE_DESC: The basic description of the type of accident (eg. Collision with vehicle, Struck pedestrian etc…).
LGA_NAME: The name of the Local Government Area in which the accident took place (locations in brackets are unincorporated areas).
STAT_DIV_NAME: The Metro or Country region where the crash occurred.
NODE_TYPE_DESC: Whether the accident occured at an intersection, or non-intersection.
ROAD_NUM: Road classification number, as determined by VicRoads. 2000-2999 = Freeway & State Highways, 3000- 3999 = Forest roads, 4000-4999 = Tourist roads, 5000-5999 = Main Roads, 9999 = Local Roads.
SPEED_ZONE: The speed zone of the location where the accident occurred.
ALCOHOL_RELATED: Whether or not the accident was alcohol related (BAC>0.001).
PERSONID: Identifier for the people in an accident. Used to differentiate the people involved in the crash when there are multiple victims.
NODE_ID: A unique identifier for each accident location.
## Data Import
# Setting the wokring directory
setwd("/Users/MyMac/OneDrive/Documents/Analytics/Year 2/Data Preprocessing/Assignment 3")
# Importing datasets using the read.csv function
allcrash = read.csv("All_Crashes.csv",header = T, stringsAsFactors = F)
fatalcrash = read.csv("Fatal_Crashes.csv",header=T, stringsAsFactors = F)
# Selecting relevent variables
allcrash = allcrash[,c("ACCIDENT_NO","SPEED_ZONE","ALCOHOL_RELATED","FATALITY","SERIOUSINJURY")]
head(allcrash,4)
## ACCIDENT_NO SPEED_ZONE ALCOHOL_RELATED FATALITY SERIOUSINJURY
## 1 T20120013207 60 km/hr No 0 0
## 2 T20120013209 60 km/hr No 0 0
## 3 T20120013212 40 km/hr No 0 0
## 4 T20120013213 50 km/hr No 0 1
fatalcrash = fatalcrash[,c("ACCIDENT_NO","ACCIDENT_DATE","ACCIDENT_TIME",
"NODE_ID","SEX","AGE","ROAD_USER_TYPE_DESC",
"ACCIDENT_TYPE_DESC","NO_OF_VEHICLES","LGA_NAME",
"STAT_DIV_NAME","NODE_TYPE_DESC","ROAD_NUM","PERSON_ID")]
head(fatalcrash,4)
## ACCIDENT_NO ACCIDENT_DATE ACCIDENT_TIME NODE_ID SEX AGE
## 1 T20170006007 2017-03-25T00:00:00.000Z 41400 315856 M 24
## 2 T20160003982 2016-02-22T00:00:00.000Z 171700 299896 F 28
## 3 T20150022025 2015-10-24T00:00:00.000Z 203000 69771 F 80
## 4 T20160018085 2016-08-18T00:00:00.000Z 82600 50579 M 50
## ROAD_USER_TYPE_DESC ACCIDENT_TYPE_DESC NO_OF_VEHICLES
## 1 Drivers Collision with a fixed object 1
## 2 Drivers Collision with a fixed object 1
## 3 Pedestrians Struck Pedestrian 2
## 4 Drivers Collision with vehicle 2
## LGA_NAME STAT_DIV_NAME NODE_TYPE_DESC ROAD_NUM
## 1 BASS COAST Country Non Intersection 0
## 2 EAST GIPPSLAND Country Non Intersection 5560
## 3 KINGSTON Melbourne - South East Intersection 5803
## 4 COLAC OTWAY Country Intersection 9999
## PERSON_ID
## 1 A
## 2 A
## 3 01
## 4 B
# Merging data - Requirement 1
# Using inner_join function from dplyr package
crashcombined = inner_join(fatalcrash,allcrash,by = "ACCIDENT_NO")
head(crashcombined,4)
## ACCIDENT_NO ACCIDENT_DATE ACCIDENT_TIME NODE_ID SEX AGE
## 1 T20170006007 2017-03-25T00:00:00.000Z 41400 315856 M 24
## 2 T20160003982 2016-02-22T00:00:00.000Z 171700 299896 F 28
## 3 T20150022025 2015-10-24T00:00:00.000Z 203000 69771 F 80
## 4 T20160018085 2016-08-18T00:00:00.000Z 82600 50579 M 50
## ROAD_USER_TYPE_DESC ACCIDENT_TYPE_DESC NO_OF_VEHICLES
## 1 Drivers Collision with a fixed object 1
## 2 Drivers Collision with a fixed object 1
## 3 Pedestrians Struck Pedestrian 2
## 4 Drivers Collision with vehicle 2
## LGA_NAME STAT_DIV_NAME NODE_TYPE_DESC ROAD_NUM
## 1 BASS COAST Country Non Intersection 0
## 2 EAST GIPPSLAND Country Non Intersection 5560
## 3 KINGSTON Melbourne - South East Intersection 5803
## 4 COLAC OTWAY Country Intersection 9999
## PERSON_ID SPEED_ZONE ALCOHOL_RELATED FATALITY SERIOUSINJURY
## 1 A 80 km/hr No 1 0
## 2 A 100 km/hr No 1 0
## 3 01 60 km/hr No 1 0
## 4 B 100 km/hr No 1 0
## Checking the dimensions of each dataset:
dim(allcrash)
## [1] 76451 5
dim(fatalcrash)
## [1] 1361 14
dim(crashcombined)
## [1] 1152 18
Since the date ranges from the two data sets are slightly different, the inner_join function was used in order to ensure that only data from crashes that exist in both datasets were merged. The merged dataset (crashcombined) containts 1152 observations. Hence, 209 observations from the fatalcrash dataset were ommitted as there was no corresponding information for these observations in the allcrash dataset. The dimension check confirms that all variables were retained in the combined dataset. The merged dataset contains 1152 observations of 18 variables.
The dataset was inspected by analysing its dimensions, structure, and character variable labels.
## Requirement 2
# Viewing the dimensions and structure of the combined data:
dim(crashcombined)
## [1] 1152 18
str(crashcombined)
## 'data.frame': 1152 obs. of 18 variables:
## $ ACCIDENT_NO : chr "T20170006007" "T20160003982" "T20150022025" "T20160018085" ...
## $ ACCIDENT_DATE : chr "2017-03-25T00:00:00.000Z" "2016-02-22T00:00:00.000Z" "2015-10-24T00:00:00.000Z" "2016-08-18T00:00:00.000Z" ...
## $ ACCIDENT_TIME : int 41400 171700 203000 82600 85500 85500 85500 85500 230000 170000 ...
## $ NODE_ID : int 315856 299896 69771 50579 48626 48626 48626 48626 264090 311008 ...
## $ SEX : chr "M" "F" "F" "M" ...
## $ AGE : int 24 28 80 50 6 2 33 37 27 58 ...
## $ ROAD_USER_TYPE_DESC: chr "Drivers" "Drivers" "Pedestrians" "Drivers" ...
## $ ACCIDENT_TYPE_DESC : chr "Collision with a fixed object" "Collision with a fixed object" "Struck Pedestrian" "Collision with vehicle" ...
## $ NO_OF_VEHICLES : int 1 1 2 2 2 2 2 2 1 1 ...
## $ LGA_NAME : chr "BASS COAST" "EAST GIPPSLAND" "KINGSTON" "COLAC OTWAY" ...
## $ STAT_DIV_NAME : chr "Country" "Country" "Melbourne - South East" "Country" ...
## $ NODE_TYPE_DESC : chr "Non Intersection" "Non Intersection" "Intersection" "Intersection" ...
## $ ROAD_NUM : int 0 5560 5803 9999 9999 9999 9999 9999 2570 9999 ...
## $ PERSON_ID : chr "A" "A" "01" "B" ...
## $ SPEED_ZONE : chr "80 km/hr" "100 km/hr" "60 km/hr" "100 km/hr" ...
## $ ALCOHOL_RELATED : chr "No" "No" "No" "No" ...
## $ FATALITY : int 1 1 1 1 4 4 4 4 1 1 ...
## $ SERIOUSINJURY : int 0 0 0 0 0 0 0 0 0 0 ...
The initial dataset contains character and integer variables. The ACCIDENT_NO, ACCIDENT_TIME, AGE, NO_OF_VEHICLES, FATALITY and SERIOUSINJURY varialbles will initially remain unchanged, whereas the other variables will undergo data type conversions, ordering and labelling. This process is described below.
## Unique value for each character variable
# Finding possible labels to bin
# tail used to see if NA's exist as they'll appear at the end of the list
tail(table(crashcombined$ACCIDENT_NO, useNA = "ifany"),5)
##
## T20170013947 T20170013970 T20170014068 T20170014249 T20170014368
## 1 1 1 1 1
tail(table(crashcombined$ACCIDENT_DATE, useNA = "ifany"),5)
##
## 2017-07-17T00:00:00.000Z 2017-07-18T00:00:00.000Z 2017-07-20T00:00:00.000Z
## 1 2 1
## 2017-07-22T00:00:00.000Z 2017-07-24T00:00:00.000Z
## 1 1
table(crashcombined$SEX, useNA = "ifany")
##
## F M
## 338 814
table(crashcombined$ROAD_USER_TYPE_DESC, useNA = "ifany")
##
## Bicyclists Drivers Motorcyclists
## 40 568 173
## Passengers Pedestrians Pillion Passengers
## 202 166 3
table(crashcombined$ACCIDENT_TYPE_DESC, useNA = "ifany")
##
## Collision with a fixed object Collision with some other object
## 397 15
## Collision with vehicle Fall from or in moving vehicle
## 519 7
## No collision and no object struck Other accident
## 7 2
## Struck animal Struck Pedestrian
## 7 157
## Vehicle overturned (no collision)
## 41
table(crashcombined$SPEED_ZONE, useNA = "ifany")
##
## 100 km/hr 110 km/hr
## 528 50
## 40 km/hr 50 km/hr
## 15 84
## 60 km/hr 70 km/hr
## 206 70
## 75 km/hr 80 km/hr
## 1 175
## 90 km/hr Camping grounds or off road
## 11 1
## Not known Other speed limit
## 10 1
tail(table(crashcombined$ROAD_NUM, useNA = "ifany"),5)
##
## 7430 7576 7598 7921 9999
## 1 1 1 2 304
tail(table(crashcombined$LGA_NAME, useNA = "ifany"),5)
##
## WODONGA WYNDHAM YARRA YARRA RANGES YARRIAMBIACK
## 10 27 6 36 3
table(crashcombined$STAT_DIV_NAME, useNA = "ifany")
##
## Ballarat Bendigo Country
## 62 41 528
## Hume Melbourne - Inner Melbourne - Inner East
## 1 55 27
## Melbourne - Inner South Melbourne - North East Melbourne - North West
## 45 39 47
## Melbourne - Outer East Melbourne - South East Melbourne - West
## 70 100 93
## Metro
## 44
table(crashcombined$NODE_TYPE_DESC, useNA = "ifany")
##
## Intersection Non Intersection Unknown
## 322 828 2
table(crashcombined$PERSON_ID, useNA = "ifany")
##
## 01 02 03 04 05 06 07 08 09 10 11 A B C D E
## 283 36 27 10 4 3 3 1 1 1 2 608 152 18 2 1
table(crashcombined$ALCOHOL_RELATED, useNA = "ifany")
##
## No Yes
## 1002 150
The unique values contained in each character variable was listed using contingency tables. This not only displays any unknown/other/NA values, but it also allows me to view similar labels in each variable, which could then be combined into one category. For example, the No collision and no object struck and Vehicle overturned (no collision) levels from the ACCIDENT_TYPE_DESC variable can be combined into one No Collision level.
#_____________________________________________________________________________
## Converting unknown/other values into "NA" values.
crashcombined$SPEED_ZONE[
crashcombined$SPEED_ZONE %in% c("Camping grounds or off road",
"Not known","Other speed limit"
)] = NA
# Removing "km/hr" from values
crashcombined$SPEED_ZONE = chartr(old = "km/hr", new = " ", crashcombined$SPEED_ZONE)
# Removing blank space
crashcombined$SPEED_ZONE = str_trim(crashcombined$SPEED_ZONE, side = "both")
crashcombined$NODE_TYPE_DESC[
crashcombined$NODE_TYPE_DESC %in% c("Unknown")] = NA
crashcombined$ACCIDENT_TYPE_DESC[
crashcombined$ACCIDENT_TYPE_DESC %in% c("Other accident")] = NA
#____________________________________________________________________________
# Binning values into road categories
crashcombined$ROAD_NUM[
crashcombined$ROAD_NUM >= 2000 & crashcombined$ROAD_NUM < 3000] = "Freeway and State Highways"
crashcombined$ROAD_NUM[
crashcombined$ROAD_NUM >= 3000 & crashcombined$ROAD_NUM < 4000] = "Forest Roads"
crashcombined$ROAD_NUM[
crashcombined$ROAD_NUM >= 4000 & crashcombined$ROAD_NUM < 5000] = "Tourist Roads"
crashcombined$ROAD_NUM[
crashcombined$ROAD_NUM >= 5000 & crashcombined$ROAD_NUM < 6000] = "Main Roads"
crashcombined$ROAD_NUM[
crashcombined$ROAD_NUM >= 7000 & crashcombined$ROAD_NUM < 8000] = "Ramps"
crashcombined$ROAD_NUM[
crashcombined$ROAD_NUM == 9999] <- "Local Roads"
# There is no explanation for 0 and -1 values in the dataset documentation
crashcombined$ROAD_NUM[
crashcombined$ROAD_NUM == 0 | crashcombined$ROAD_NUM == -1] = NA
## Binning observations into less categories
crashcombined$ACCIDENT_TYPE_DESC[
crashcombined$ACCIDENT_TYPE_DESC %in% c("No collision and no object struck",
"Vehicle overturned (no collision)"
)] <- "No Collision"
#____________________________________________________________________________
## Converting variables - Requirement 3
# Converting to factor
crashcombined$ALCOHOL_RELATED = as.factor(crashcombined$ALCOHOL_RELATED)
crashcombined$ROAD_NUM = as.factor(crashcombined$ROAD_NUM)
crashcombined$SEX = as.factor(crashcombined$SEX)
crashcombined$ACCIDENT_TYPE_DESC = as.factor(crashcombined$ACCIDENT_TYPE_DESC)
crashcombined$ROAD_USER_TYPE_DESC = as.factor(crashcombined$ROAD_USER_TYPE_DESC)
crashcombined$LGA_NAME = as.factor(crashcombined$LGA_NAME)
crashcombined$NODE_TYPE_DESC = as.factor(crashcombined$NODE_TYPE_DESC)
crashcombined$PERSON_ID = as.factor(crashcombined$PERSON_ID)
crashcombined$STAT_DIV_NAME = as.factor(crashcombined$STAT_DIV_NAME)
crashcombined$NODE_ID = as.factor(crashcombined$NODE_ID)
# Chanverting to date
# Converts to YYYY-MM-DD
crashcombined$ACCIDENT_DATE=as.Date(crashcombined$ACCIDENT_DATE)
# Check
# Dates should fall in-between 1st Jan 2013 and 1st Sep 2017
min(crashcombined$ACCIDENT_DATE)
## [1] "2013-01-02"
max(crashcombined$ACCIDENT_DATE)
## [1] "2017-07-24"
##___________________________________________________
## Changing names of factor levels - Requirement 4
# Labelling levels
crashcombined$ALCOHOL_RELATED =
factor(crashcombined$ALCOHOL_RELATED,
levels = c("Yes","No"),labels = c(1,0))
crashcombined$SPEED_ZONE[
crashcombined$SPEED_ZONE %in% c('30', '40','50')] <- "Low"
crashcombined$SPEED_ZONE[
crashcombined$SPEED_ZONE %in% c('60', '70','75','80')] <- "Med"
crashcombined$SPEED_ZONE[
crashcombined$SPEED_ZONE %in% c('90', '100','110')] <- "High"
# Ordering levels
crashcombined$SPEED_ZONE = ordered(crashcombined$SPEED_ZONE,
levels = c("Low","Med", "High"))
Various variables were converted into dates, factors and ordered factors. The conversions applied to the variables are described below:
ACCIDENT_DATE: Initially a character Variable - Converted to a date variable using the as.date function.
SEX: Initially a character variable - Converted to a factor variable using the as.factor function. This a categorical variable since there are not measureable differences between the values.
ROAD_USER_TYPE_DESC: Initially a character Variable - Converted to a factor variable using the as.factor function. This a categorical variable.
ACCIDENT_TYPE_DESC: Initially a character Variable - Converted to a factor variable using the as.factor function. The No collision and no object struck and Vehicle overturned (no collision) levels were combined into one No Collision level. This a categorical variable.
LGA_NAME: Initially a character Variable - Converted to a factor variable using the as.factor function. This a categorical variable.
STAT_DIV_NAME: Initially a character Variable - Converted to a factor variable using the as.factor function. This a categorical variable.
NODE_TYPE_DESC: Initially a Character Variable - Converted to a factor variable using the as.factor function. This a categorical variable.
ROAD_NUM: Initially an integer Variable - Converted to a factor variable with values being binned into the following categories; 2000-2999 = Freeway and State Highways, 3000-399 = Forest Roads, 4000-4999 = Tourist Roads, 5000-5999 = Main Roads, 7000-7999 = Ramps, 9999 = Local Roads, 0 or -1 = Other/Unknown. These categories are in accordance with VicRoads classifications. This a categorical variable.
SPEED_ZONE: Initially a character Variable - Relabelled and converted to an ordinal variable. The speed limits were binned into the following categories: 40-50 km/hr = Low, 60-80 km/hr = Med, 90-110 km/hr = High. These levels were binned into smaller categories in order to reduce the complexity of future modelling.
ALCOHOL_RELATED: Initially a character Variable - Converted to a binary variable. 1 = Crash was alcohol related, 0 = Crash was not alcohol related.
PERSONID: Initially a character Variable - Converted to a factor variable using the as.factor function. This is a categorical variable.
NODE_ID: Initially an integer Variable - Converted to a factor variable. This variable is not an integer because each value is representative of a unique location. Therefore, the order of these values is irralevent, as well as the differences between the values, making it a categorical variable.
There were also four variables that were found to have unknown/other labels (ACCIDENT_TYPE_DESC, NODE_TYPE_DESC, ROAD_NUM and SPEED_ZONE), these labels were changed to NA values.
In a tidy dataset, each variable measures should be in one column, and each different observation should be in a different row. The duplicated function was used to check if the observations were unique.
# Requirement 5
# Counting the number of observations that are, or are not duplicates.
table(duplicated(crashcombined))
##
## FALSE
## 1152
# Viewing the variables contained in the dataset
head(crashcombined,3)
## ACCIDENT_NO ACCIDENT_DATE ACCIDENT_TIME NODE_ID SEX AGE
## 1 T20170006007 2017-03-25 41400 315856 M 24
## 2 T20160003982 2016-02-22 171700 299896 F 28
## 3 T20150022025 2015-10-24 203000 69771 F 80
## ROAD_USER_TYPE_DESC ACCIDENT_TYPE_DESC NO_OF_VEHICLES
## 1 Drivers Collision with a fixed object 1
## 2 Drivers Collision with a fixed object 1
## 3 Pedestrians Struck Pedestrian 2
## LGA_NAME STAT_DIV_NAME NODE_TYPE_DESC ROAD_NUM
## 1 BASS COAST Country Non Intersection <NA>
## 2 EAST GIPPSLAND Country Non Intersection Main Roads
## 3 KINGSTON Melbourne - South East Intersection Main Roads
## PERSON_ID SPEED_ZONE ALCOHOL_RELATED FATALITY SERIOUSINJURY
## 1 A Med 0 1 0
## 2 A High 0 1 0
## 3 01 Med 0 1 0
names(crashcombined)
## [1] "ACCIDENT_NO" "ACCIDENT_DATE" "ACCIDENT_TIME"
## [4] "NODE_ID" "SEX" "AGE"
## [7] "ROAD_USER_TYPE_DESC" "ACCIDENT_TYPE_DESC" "NO_OF_VEHICLES"
## [10] "LGA_NAME" "STAT_DIV_NAME" "NODE_TYPE_DESC"
## [13] "ROAD_NUM" "PERSON_ID" "SPEED_ZONE"
## [16] "ALCOHOL_RELATED" "FATALITY" "SERIOUSINJURY"
By using the dublicated function, it was confirmed that all observations were unique, as none of them were duplicates. By analysing the variables and column names, it is clear that each variable is measured in only one column. Furthermore, each cell only contains one value. Hence, it can be concluded that the dataset is tidy.
The SER_FAT variable was added to the dataset to measure the total number of serious injuries and fatalities from the accident in each observation. This provides a better picture as to the severity of an accident. For example, an accident with the same number of fatalities may have a different amount of serious injuries, indicating that one accident was more severe than the other.
# mutate function used to create new variable - Requirement 6
crashcombined = crashcombined %>% mutate(SER_FAT = FATALITY+SERIOUSINJURY)
## Warning: package 'bindrcpp' was built under R version 3.4.4
# Check if procedure worked
dim(crashcombined)
## [1] 1152 19
head(crashcombined,2)
## ACCIDENT_NO ACCIDENT_DATE ACCIDENT_TIME NODE_ID SEX AGE
## 1 T20170006007 2017-03-25 41400 315856 M 24
## 2 T20160003982 2016-02-22 171700 299896 F 28
## ROAD_USER_TYPE_DESC ACCIDENT_TYPE_DESC NO_OF_VEHICLES
## 1 Drivers Collision with a fixed object 1
## 2 Drivers Collision with a fixed object 1
## LGA_NAME STAT_DIV_NAME NODE_TYPE_DESC ROAD_NUM PERSON_ID
## 1 BASS COAST Country Non Intersection <NA> A
## 2 EAST GIPPSLAND Country Non Intersection Main Roads A
## SPEED_ZONE ALCOHOL_RELATED FATALITY SERIOUSINJURY SER_FAT
## 1 Med 0 1 0 1
## 2 High 0 1 0 1
The mutate function was able to successfully created the new SER_FAT variable.
There shouldn’t be any observations with 0 fatalities. Hence, all the different values in this variable are checked.
# Scanning for 0 or NA fatalities
table(crashcombined$FATALITY,useNA="ifany") # Frequency table of values in FATALITY
##
## 0 1 2 3 4
## 2 995 122 21 12
It was found that the FATALITY variables contained two 0 values. This a potential inconsistency because all observations should be from fatal crashes (due to the inner join of the fatalcrashes dataset), hence all values in this column should be greater or equal to 1. A possible explanation for this occurance is that two victims were injured in the accidents and subsequently died at a later date in hospital (after the allcrashes dataset was last updated and before the fatalcrashes dataset was last updated).
## Investigating inconsistant values
# Check if observations with 0 fatalities had serious injuries
crashcombined$SERIOUSINJURY[crashcombined$FATALITY==0]
## [1] 1 1
# Checking the dates of accidents with 0 recorded fatalities
crashcombined$ACCIDENT_DATE[crashcombined$FATALITY==0]
## [1] "2017-05-08" "2017-07-17"
The observations with 0 fatalities each recorded 1 serious injury. Furthermore, the dates of these accidents occured on the 8th of May and the 17th of August in 2017, with the allcrashes dateset last updated on the 1st of September 2017 and the fatalcrashes dataset last being updated on the 2nd of March 2018. Since there is a distinct possibility that the victims were seriously injured in their accidents and later died within the eight month gap in which the datasets were updatated, these observations are not seen as inconsistencies.
In order to find any other missing values or inconsistencies, functions were created to find finite, infinite and NaN values accross the whole crashcombined dataset. These functions work by creating contingency tables for the number of TRUE and FALSE values from each variable in the dataset after the is.finite, is.infinite and is.nan functions are run. If the contingency tables for each variable in the dataset only contained FALSE values (ie 0 TRUE’s) or TRUE values (ie 0 FALSE’s), then there would be no special values in the data. The is.na function was also used to find any missing values and complete.cases was used to determine the number of observations with missing values.
# Creation of functions to find special vals over whole dataset - Requirement 7
is.finitefunc <- function(x){
temp = seq(from = 1, to=length(x[1,]))
for (i in 1:length(x[1,])) {
temp[i] = length(table(is.finite(x[,i])))
# If there are special values, the length of the table will be >1
}
# Mean of 1 means that all tables either have 0 TRUE's or 0 FALSE's
if (mean(temp)==1){
return("All values in dataset are finite")
}else {return("There are non-finite values in the dataset")}
}
is.infinitefunc <- function(x){
temp = seq(from = 1, to=length(x[1,]))
for (i in 1:length(x[1,])) {
temp[i] = length(table(is.infinite(x[,i])))
# If there are special values, the length of the table will be >1
}
if (mean(temp)==1){
return("All values in dataset are not infinite")
}else {return("There are infinite values in the dataset")}
}
is.nanfunc <- function(x){
temp = seq(from = 1, to=length(x[1,]))
for (i in 1:length(x[1,])) {
temp[i] = length(table(is.nan(x[,i])))
# If there are special values, the length of the table will be >1
}
if (mean(temp)==1){
return("There are no NaN values")
}else {return("There are one or more NaN values ")}
}
is.finitefunc(crashcombined)
## [1] "There are non-finite values in the dataset"
is.infinitefunc(crashcombined)
## [1] "All values in dataset are not infinite"
is.nanfunc(crashcombined)
## [1] "There are no NaN values"
sum(is.na(crashcombined)) # Number of NA values
## [1] 101
# Finding the number of complete cases.
numcomplete =
length(crashcombined[,1]) - length((crashcombined[complete.cases(crashcombined),])[,1])
numcomplete
## [1] 95
# Proportion of observations that are missing (%)
numcomplete*100/length(crashcombined[,1])
## [1] 8.246528
There were no infinite or NaN values values found in the dataset, but there were non-finite values. This was expected since various labels were converted to NA values in the earlier processing stage. The is.na function found 101 NA values in the dataset and the complete.cases function showed found 95 complete cases, indicating that there were some observations with multiple NA values.
It was found that around 8% of the observations contain missing values. Since many (if not all) of these missing values came from the categorical variables that were relabelled in the ealier stages, it was decided to simply remove these observations in order to avoid the bias caused by imputing their mode.
## Removing missing values
crashcombined = crashcombined[complete.cases(crashcombined),]
## Checking that all missing values are removed
length(crashcombined[,1]) - length((crashcombined[complete.cases(crashcombined),])[,1])
## [1] 0
is.finitefunc(crashcombined)
## [1] "All values in dataset are finite"
is.infinitefunc(crashcombined)
## [1] "All values in dataset are not infinite"
is.nanfunc(crashcombined)
## [1] "There are no NaN values"
sum(is.na(crashcombined))
## [1] 0
There were no inconsistent or missing values after the incomplete cases were removed.
Univariate outliers were found by finding the observations with z-scores of absolute value above 3, for each value of the numeric variables.
# Checking all numeric variables for outliers - Requirement 8
# Using outliers package
z.scores2 <- crashcombined$AGE %>% scores(type = "z")
which( abs(z.scores2) >3 )
## integer(0)
z.scores3 <- crashcombined$NO_OF_VEHICLES %>% scores(type = "z")
length(which( abs(z.scores3) >3 ))
## [1] 13
table(crashcombined$NO_OF_VEHICLES, useNA = "ifany")
##
## 0 1 2 3 4 5 6 10 11
## 1 521 438 70 14 10 1 1 1
z.scores4 <- crashcombined$SERIOUSINJURY %>% scores(type = "z")
length(which( abs(z.scores4) >3 ))
## [1] 28
table(crashcombined$SERIOUSINJURY, useNA = "ifany")
##
## 0 1 2 3 4 5 6 7 10
## 716 221 62 30 12 8 3 2 3
z.scores1 <- crashcombined$FATALITY %>% scores(type = "z")
length(which( abs(z.scores1) >3 ))
## [1] 33
table(crashcombined$FATALITY, useNA = "ifany")
##
## 0 1 2 3 4
## 2 902 120 21 12
z.scores5 <- crashcombined$SER_FAT %>% scores(type = "z")
length(which( abs(z.scores5) >3 ))
## [1] 27
table(crashcombined$SER_FAT, useNA = "ifany")
##
## 1 2 3 4 5 6 7 9 13
## 650 223 99 48 10 8 14 2 3
z.scores6 <- crashcombined$ACCIDENT_TIME %>% scores(type = "z")
which( abs(z.scores6) >3 )
## integer(0)
The contingency tables from the numeric variables with outliers provide an idea as to the value of the outliers. For example, it is clear that the 27 outliers from the SER_FAT variable came from accidents with 6+ serious injuries or fatalities. These univariate outliers may be considered as freak occurances and will therefore be removed from the dataset.
# Removal of univariate outliers
outliers = c(which(abs(z.scores1) >3),which(abs(z.scores3) >3),
which( abs(z.scores4) >3 ,which( abs(z.scores5) >3 )))
crashcombined = crashcombined[-outliers,]
Some algorithms work best without multivariate outliers, and these were detected by finding their Mahalanobis distance.
# Subset of non-discrete variables
numeric.vars = crashcombined %>%
dplyr::select(AGE,ACCIDENT_TIME)
outlier.det = MVN::mvn(numeric.vars, multivariateOutlierMethod = "adj",
showOutliers = TRUE)
The mvn function did not work when the NO_OF_VEHICLES, SERIOUSINJURY and FATALITY variables were analysed. This is becasue these variables are discrete and can only take whole numbers. The adjusted quantile method was used in finding the multivariate outliers between AGE and ACCIDENT_TIME, as this method has been found to be more robust (Filzmoser, 2004). There were no multivariate outliers found using this method.
The AGE variable was tested for normality using the Shapiro-Wilk test. After rejecting the null hypothesis that the variable was normally distributed, a Box-Cox transformation was performed and a lambda value of 0.336 was applied to the data in attempt to normalize the data.
# Requirement 9
# Hitstogram of Age variable
hist(crashcombined$AGE, main="Age distribution before transformation"
,width = "50%")
# Testing for normality
shapiro.test(crashcombined$AGE)
##
## Shapiro-Wilk normality test
##
## data: crashcombined$AGE
## W = 0.95957, p-value = 6.644e-16
# Finding lambda value
lambda<- forecast::BoxCox.lambda(crashcombined$AGE)
lambda
## [1] 0.3360749
# Applying transformation
crashcombined$AGE=(crashcombined$AGE^lambda-1)/lambda
# Normality test after transformation
shapiro.test(crashcombined$AGE)
##
## Shapiro-Wilk normality test
##
## data: crashcombined$AGE
## W = 0.96331, p-value = 4.546e-15
# Histogram after transformation
hist(crashcombined$AGE, main= "Age distribution after transformation")
After applying a data transformation of the AGE variable, the Shapiro-Wilk test of the transformed data indicates that normalisation failed.
The time variable is a little bit difficult to interpret. Therefore, character manipulation is performed in order to have the time variables in the HH:MM:SS format, making them easier to interpret.
# Variable before cleaning
head(crashcombined$ACCIDENT_TIME,10)
## [1] 171700 203000 82600 230000 170000 105500 131500 131500 25400 181500
# Converting to character variable
crashcombined$ACCIDENT_TIME=as.character(crashcombined$ACCIDENT_TIME)
# Adding 0's to the start of each value less than 6 characters long
for (i in 1:length(crashcombined[,1])){
crashcombined$ACCIDENT_TIME[i] = str_pad(crashcombined$ACCIDENT_TIME[i],
width = 6, side = "left", pad="0")
}
# Splitting values into three
a = substr(as.character(crashcombined$ACCIDENT_TIME),1,2)
b = substr(as.character(crashcombined$ACCIDENT_TIME),3,4)
c = substr(as.character(crashcombined$ACCIDENT_TIME),5,6)
# Recombining values with : separator
crashcombined$ACCIDENT_TIME = paste(a,b,c, sep = ":")
# Variable after cleaning
head(crashcombined$ACCIDENT_TIME,10)
## [1] "17:17:00" "20:30:00" "08:26:00" "23:00:00" "17:00:00" "10:55:00"
## [7] "13:15:00" "13:15:00" "02:54:00" "18:15:00"
# Inspecting final dataset
# Checking column names
names(crashcombined)
## [1] "ACCIDENT_NO" "ACCIDENT_DATE" "ACCIDENT_TIME"
## [4] "NODE_ID" "SEX" "AGE"
## [7] "ROAD_USER_TYPE_DESC" "ACCIDENT_TYPE_DESC" "NO_OF_VEHICLES"
## [10] "LGA_NAME" "STAT_DIV_NAME" "NODE_TYPE_DESC"
## [13] "ROAD_NUM" "PERSON_ID" "SPEED_ZONE"
## [16] "ALCOHOL_RELATED" "FATALITY" "SERIOUSINJURY"
## [19] "SER_FAT"
## Structure of Data after processing
str(crashcombined)
## 'data.frame': 989 obs. of 19 variables:
## $ ACCIDENT_NO : chr "T20160003982" "T20150022025" "T20160018085" "T20130019712" ...
## $ ACCIDENT_DATE : Date, format: "2016-02-22" "2015-10-24" ...
## $ ACCIDENT_TIME : chr "17:17:00" "20:30:00" "08:26:00" "23:00:00" ...
## $ NODE_ID : Factor w/ 1063 levels "120","298","393",..: 795 218 170 331 956 891 753 753 268 257 ...
## $ SEX : Factor w/ 2 levels "F","M": 1 1 2 2 2 2 1 1 2 2 ...
## $ AGE : num 6.14 10 8.1 6.03 8.67 ...
## $ ROAD_USER_TYPE_DESC: Factor w/ 6 levels "Bicyclists","Drivers",..: 2 5 2 2 1 4 4 2 2 2 ...
## $ ACCIDENT_TYPE_DESC : Factor w/ 7 levels "Collision with a fixed object",..: 1 7 3 1 4 3 3 3 1 1 ...
## $ NO_OF_VEHICLES : int 1 2 2 1 1 2 2 2 1 1 ...
## $ LGA_NAME : Factor w/ 81 levels "(FALLS CREEK)",..: 24 37 20 69 38 14 12 12 18 38 ...
## $ STAT_DIV_NAME : Factor w/ 13 levels "Ballarat","Bendigo",..: 3 11 3 3 10 12 2 2 11 10 ...
## $ NODE_TYPE_DESC : Factor w/ 2 levels "Intersection",..: 2 1 1 2 2 2 2 2 2 2 ...
## $ ROAD_NUM : Factor w/ 6 levels "Forest Roads",..: 4 4 3 2 3 2 2 2 2 3 ...
## $ PERSON_ID : Factor w/ 16 levels "01","02","03",..: 12 1 13 12 13 1 2 13 12 12 ...
## $ SPEED_ZONE : Ord.factor w/ 3 levels "Low"<"Med"<"High": 3 2 3 3 1 3 3 3 2 2 ...
## $ ALCOHOL_RELATED : Factor w/ 2 levels "1","0": 2 2 2 1 2 2 2 2 1 1 ...
## $ FATALITY : int 1 1 1 1 1 1 2 2 1 1 ...
## $ SERIOUSINJURY : int 0 0 0 0 0 1 2 2 0 0 ...
## $ SER_FAT : int 1 1 1 1 1 2 4 4 1 1 ...
## Dimensions of Data after processing
dim(crashcombined)
## [1] 989 19
#Data
head(crashcombined,10)
## ACCIDENT_NO ACCIDENT_DATE ACCIDENT_TIME NODE_ID SEX AGE
## 2 T20160003982 2016-02-22 17:17:00 299896 F 6.142848
## 3 T20150022025 2015-10-24 20:30:00 69771 F 10.000590
## 4 T20160018085 2016-08-18 08:26:00 50579 M 8.104610
## 9 T20130019712 2013-09-24 23:00:00 264090 M 6.032079
## 10 T20160025394 2016-10-13 17:00:00 311008 M 8.671306
## 11 T20160009999 2016-05-07 10:55:00 306973 M 6.460152
## 13 T20150025270 2015-12-06 13:15:00 296945 F 7.953637
## 14 T20150025270 2015-12-06 13:15:00 296945 F 8.870385
## 15 T20130000066 2013-01-02 02:54:00 258321 M 5.028728
## 16 T20130000366 2013-01-06 18:15:00 256415 M 5.802087
## ROAD_USER_TYPE_DESC ACCIDENT_TYPE_DESC NO_OF_VEHICLES
## 2 Drivers Collision with a fixed object 1
## 3 Pedestrians Struck Pedestrian 2
## 4 Drivers Collision with vehicle 2
## 9 Drivers Collision with a fixed object 1
## 10 Bicyclists Fall from or in moving vehicle 1
## 11 Passengers Collision with vehicle 2
## 13 Passengers Collision with vehicle 2
## 14 Drivers Collision with vehicle 2
## 15 Drivers Collision with a fixed object 1
## 16 Drivers Collision with a fixed object 1
## LGA_NAME STAT_DIV_NAME NODE_TYPE_DESC
## 2 EAST GIPPSLAND Country Non Intersection
## 3 KINGSTON Melbourne - South East Intersection
## 4 COLAC OTWAY Country Intersection
## 9 SWAN HILL Country Non Intersection
## 10 KNOX Melbourne - Outer East Non Intersection
## 11 BRIMBANK Melbourne - West Non Intersection
## 13 BENDIGO Bendigo Non Intersection
## 14 BENDIGO Bendigo Non Intersection
## 15 CASEY Melbourne - South East Non Intersection
## 16 KNOX Melbourne - Outer East Non Intersection
## ROAD_NUM PERSON_ID SPEED_ZONE ALCOHOL_RELATED
## 2 Main Roads A High 0
## 3 Main Roads 01 Med 0
## 4 Local Roads B High 0
## 9 Freeway and State Highways A High 1
## 10 Local Roads B Low 0
## 11 Freeway and State Highways 01 High 0
## 13 Freeway and State Highways 02 High 0
## 14 Freeway and State Highways B High 0
## 15 Freeway and State Highways A Med 1
## 16 Local Roads A Med 1
## FATALITY SERIOUSINJURY SER_FAT
## 2 1 0 1
## 3 1 0 1
## 4 1 0 1
## 9 1 0 1
## 10 1 0 1
## 11 1 1 2
## 13 2 2 4
## 14 2 2 4
## 15 1 0 1
## 16 1 0 1
# Categorical data labels
head(crashcombined$NODE_ID,5)
## [1] 299896 69771 50579 264090 311008
## 1063 Levels: 120 298 393 827 851 873 1317 10429 11397 12239 13567 ... 320341
levels(crashcombined$SEX)
## [1] "F" "M"
levels(crashcombined$ROAD_USER_TYPE_DESC)
## [1] "Bicyclists" "Drivers" "Motorcyclists"
## [4] "Passengers" "Pedestrians" "Pillion Passengers"
levels(crashcombined$ACCIDENT_TYPE_DESC)
## [1] "Collision with a fixed object" "Collision with some other object"
## [3] "Collision with vehicle" "Fall from or in moving vehicle"
## [5] "No Collision" "Struck animal"
## [7] "Struck Pedestrian"
levels(crashcombined$LGA_NAME)
## [1] "(FALLS CREEK)" "(FRENCH ISLAND)" "(MOUNT HOTHAM)"
## [4] "ALPINE" "ARARAT" "BALLARAT"
## [7] "BANYULE" "BASS COAST" "BAW BAW"
## [10] "BAYSIDE" "BENALLA" "BENDIGO"
## [13] "BOROONDARA" "BRIMBANK" "BULOKE"
## [16] "CAMPASPE" "CARDINIA" "CASEY"
## [19] "CENTRAL GOLDFIELDS" "COLAC OTWAY" "CORANGAMITE"
## [22] "DANDENONG" "DAREBIN" "EAST GIPPSLAND"
## [25] "FRANKSTON" "GANNAWARRA" "GEELONG"
## [28] "GLEN EIRA" "GLENELG" "GOLDEN PLAINS"
## [31] "HEPBURN" "HINDMARSH" "HOBSONS BAY"
## [34] "HORSHAM" "HUME" "INDIGO"
## [37] "KINGSTON" "KNOX" "LATROBE"
## [40] "LODDON" "MACEDON RANGES" "MANNINGHAM"
## [43] "MANSFIELD" "MARIBYRNONG" "MAROONDAH"
## [46] "MELBOURNE" "MELTON" "MILDURA"
## [49] "MITCHELL" "MOIRA" "MONASH"
## [52] "MOONEE VALLEY" "MOORABOOL" "MORELAND"
## [55] "MORNINGTON PENINSULA" "MOUNT ALEXANDER" "MOYNE"
## [58] "MURRINDINDI" "NILLUMBIK" "NORTHERN GRAMPIANS"
## [61] "PORT PHILLIP" "PYRENEES" "SHEPPARTON"
## [64] "SOUTH GIPPSLAND" "SOUTHERN GRAMPIANS" "STONNINGTON"
## [67] "STRATHBOGIE" "SURF COAST" "SWAN HILL"
## [70] "TOWONG" "WANGARATTA" "WARRNAMBOOL"
## [73] "WELLINGTON" "WEST WIMMERA" "WHITEHORSE"
## [76] "WHITTLESEA" "WODONGA" "WYNDHAM"
## [79] "YARRA" "YARRA RANGES" "YARRIAMBIACK"
levels(crashcombined$STAT_DIV_NAME)
## [1] "Ballarat" "Bendigo"
## [3] "Country" "Hume"
## [5] "Melbourne - Inner" "Melbourne - Inner East"
## [7] "Melbourne - Inner South" "Melbourne - North East"
## [9] "Melbourne - North West" "Melbourne - Outer East"
## [11] "Melbourne - South East" "Melbourne - West"
## [13] "Metro"
levels(crashcombined$NODE_TYPE_DESC)
## [1] "Intersection" "Non Intersection"
levels(crashcombined$PERSON_ID)
## [1] "01" "02" "03" "04" "05" "06" "07" "08" "09" "10" "11" "A" "B" "C"
## [15] "D" "E"
levels(crashcombined$ALCOHOL_RELATED)
## [1] "1" "0"
levels(crashcombined$ROAD_NUM)
## [1] "Forest Roads" "Freeway and State Highways"
## [3] "Local Roads" "Main Roads"
## [5] "Ramps" "Tourist Roads"
levels(crashcombined$SPEED_ZONE)
## [1] "Low" "Med" "High"
By inspecting the final dataset, the cleaned dataset is tidy, with variables in their proper type, and without any outliers and inconsistent or missing values.
Filzmoser, P. (2004). A multivariate outlier detection method. State University, 18-22.