Required packages

library(tidyr)
library(dplyr)
library(readr) 
library(forecast)
library(stringr)
library(MVN)
library(outliers)

Executive Summary

The data was collected for the possible analysis of fatal crashes on Victorian roads. The initial dataset chosen was the Fatal Crashes - Lives Lost, Last 5 Years to Date dataset from VicRoads. However, important information such the speed zone of the accident, whether or not alcohol was involved and the number of fatalities and serious injuries from the accident was missing. Therefore, the aforementioned dataset was merged with te Crashes Last Five Years dataset in order to include the relevent information.

Initially, the labels of the character variables were scanned for any values that could be renamed/binned, or changed to NA. The data types were then converted into their proper form. The conversions included integer to factor variables, character to factor variables and character to date variables. The variable explaining alcohol involvement in the crash was labelled and the variable explaining the speed zone of the area in which the crash took place was relabelled and ordered.

A new variable was created in order to show the tally of serious injuries and fatalities that occurred in each accident. A full scan of missing and inconsistent values, and numeric outliers took place. It was decided to remove any missing value or outlier from the dataset. The age of the victims in each observation was transformed using a Box-Cox transformation, with the optimal lambda value being 0.34, in order to convert the distribution into a normal distribution. However, through the use of the Shapiro-Wilk test for normality, the transformation failed to normalise the data. Finally, the time variable was cleaned into the format HH:MM:SS.

Overall, the cleaned dataset is tidy, with variables in their proper type, and without any outliers and inconsistent or missing values.

Data

The first dataset used is called Fatal Crashes - Lives Lost, Last 5 Years to Date. This dataset contains information on fatal crashes on Victorian roads from the 1st of Janurary 2013 to the 2nd of May 2018. The data was provided by VicRoads and is available by following the link:

https://vicroadsopendata-vicroadsmaps.opendata.arcgis.com/datasets/fatal-crashes-lives-lost-last-5-years-to-date?geometry=113.539%2C-42.506%2C176.689%2C-30.166&orderBy=FD_YESTERDAY_DATE_OF_YEAR.

The dataset contains 1,361 observations of 25 variables.

The second dataset used is called Crashes Last Five Years. This dataset contains information on all crashes on Victorian roads resulting in a fatality and/or injury, from from the 1st of Janurary 2012 to the 1st of Sepember 2017. The data was provided by VicRoads and is available by following the link:

https://vicroadsopendata-vicroadsmaps.opendata.arcgis.com/datasets/crashes-last-five-years?geometry=129.462%2C-39.619%2C161.037%2C-33.446.

The dataset contains 76,451 observations of 65 variables. This data contains some useful information that is not present in the other dataset, such as the speed zone in which the accident occurred, whether or not the accident was alcohol related, and the total number of fatalities and serious injuries resulting from the accident.

The variables selected to be merged from each dataset are described below (definitions taken from the data sources, provided by the links above):

## Data Import

# Setting the wokring directory
setwd("/Users/MyMac/OneDrive/Documents/Analytics/Year 2/Data Preprocessing/Assignment 3")

# Importing datasets using the read.csv function
allcrash = read.csv("All_Crashes.csv",header = T, stringsAsFactors = F)
fatalcrash = read.csv("Fatal_Crashes.csv",header=T, stringsAsFactors = F)

# Selecting relevent variables
allcrash = allcrash[,c("ACCIDENT_NO","SPEED_ZONE","ALCOHOL_RELATED","FATALITY","SERIOUSINJURY")]
head(allcrash,4)
##    ACCIDENT_NO SPEED_ZONE ALCOHOL_RELATED FATALITY SERIOUSINJURY
## 1 T20120013207   60 km/hr              No        0             0
## 2 T20120013209   60 km/hr              No        0             0
## 3 T20120013212   40 km/hr              No        0             0
## 4 T20120013213   50 km/hr              No        0             1
fatalcrash = fatalcrash[,c("ACCIDENT_NO","ACCIDENT_DATE","ACCIDENT_TIME",
                        "NODE_ID","SEX","AGE","ROAD_USER_TYPE_DESC",
                        "ACCIDENT_TYPE_DESC","NO_OF_VEHICLES","LGA_NAME",
                        "STAT_DIV_NAME","NODE_TYPE_DESC","ROAD_NUM","PERSON_ID")]
head(fatalcrash,4)
##    ACCIDENT_NO            ACCIDENT_DATE ACCIDENT_TIME NODE_ID SEX AGE
## 1 T20170006007 2017-03-25T00:00:00.000Z         41400  315856   M  24
## 2 T20160003982 2016-02-22T00:00:00.000Z        171700  299896   F  28
## 3 T20150022025 2015-10-24T00:00:00.000Z        203000   69771   F  80
## 4 T20160018085 2016-08-18T00:00:00.000Z         82600   50579   M  50
##   ROAD_USER_TYPE_DESC            ACCIDENT_TYPE_DESC NO_OF_VEHICLES
## 1             Drivers Collision with a fixed object              1
## 2             Drivers Collision with a fixed object              1
## 3         Pedestrians             Struck Pedestrian              2
## 4             Drivers        Collision with vehicle              2
##         LGA_NAME          STAT_DIV_NAME   NODE_TYPE_DESC ROAD_NUM
## 1     BASS COAST                Country Non Intersection        0
## 2 EAST GIPPSLAND                Country Non Intersection     5560
## 3       KINGSTON Melbourne - South East     Intersection     5803
## 4    COLAC OTWAY                Country     Intersection     9999
##   PERSON_ID
## 1         A
## 2         A
## 3        01
## 4         B
# Merging data - Requirement 1 
# Using inner_join function from dplyr package
crashcombined = inner_join(fatalcrash,allcrash,by = "ACCIDENT_NO") 
head(crashcombined,4)
##    ACCIDENT_NO            ACCIDENT_DATE ACCIDENT_TIME NODE_ID SEX AGE
## 1 T20170006007 2017-03-25T00:00:00.000Z         41400  315856   M  24
## 2 T20160003982 2016-02-22T00:00:00.000Z        171700  299896   F  28
## 3 T20150022025 2015-10-24T00:00:00.000Z        203000   69771   F  80
## 4 T20160018085 2016-08-18T00:00:00.000Z         82600   50579   M  50
##   ROAD_USER_TYPE_DESC            ACCIDENT_TYPE_DESC NO_OF_VEHICLES
## 1             Drivers Collision with a fixed object              1
## 2             Drivers Collision with a fixed object              1
## 3         Pedestrians             Struck Pedestrian              2
## 4             Drivers        Collision with vehicle              2
##         LGA_NAME          STAT_DIV_NAME   NODE_TYPE_DESC ROAD_NUM
## 1     BASS COAST                Country Non Intersection        0
## 2 EAST GIPPSLAND                Country Non Intersection     5560
## 3       KINGSTON Melbourne - South East     Intersection     5803
## 4    COLAC OTWAY                Country     Intersection     9999
##   PERSON_ID SPEED_ZONE ALCOHOL_RELATED FATALITY SERIOUSINJURY
## 1         A   80 km/hr              No        1             0
## 2         A  100 km/hr              No        1             0
## 3        01   60 km/hr              No        1             0
## 4         B  100 km/hr              No        1             0
## Checking the dimensions of each dataset:
dim(allcrash)
## [1] 76451     5
dim(fatalcrash)
## [1] 1361   14
dim(crashcombined)
## [1] 1152   18

Since the date ranges from the two data sets are slightly different, the inner_join function was used in order to ensure that only data from crashes that exist in both datasets were merged. The merged dataset (crashcombined) containts 1152 observations. Hence, 209 observations from the fatalcrash dataset were ommitted as there was no corresponding information for these observations in the allcrash dataset. The dimension check confirms that all variables were retained in the combined dataset. The merged dataset contains 1152 observations of 18 variables.

Understand

The dataset was inspected by analysing its dimensions, structure, and character variable labels.

## Requirement 2 
# Viewing the dimensions and structure of the combined data:
dim(crashcombined)
## [1] 1152   18
str(crashcombined)
## 'data.frame':    1152 obs. of  18 variables:
##  $ ACCIDENT_NO        : chr  "T20170006007" "T20160003982" "T20150022025" "T20160018085" ...
##  $ ACCIDENT_DATE      : chr  "2017-03-25T00:00:00.000Z" "2016-02-22T00:00:00.000Z" "2015-10-24T00:00:00.000Z" "2016-08-18T00:00:00.000Z" ...
##  $ ACCIDENT_TIME      : int  41400 171700 203000 82600 85500 85500 85500 85500 230000 170000 ...
##  $ NODE_ID            : int  315856 299896 69771 50579 48626 48626 48626 48626 264090 311008 ...
##  $ SEX                : chr  "M" "F" "F" "M" ...
##  $ AGE                : int  24 28 80 50 6 2 33 37 27 58 ...
##  $ ROAD_USER_TYPE_DESC: chr  "Drivers" "Drivers" "Pedestrians" "Drivers" ...
##  $ ACCIDENT_TYPE_DESC : chr  "Collision with a fixed object" "Collision with a fixed object" "Struck Pedestrian" "Collision with vehicle" ...
##  $ NO_OF_VEHICLES     : int  1 1 2 2 2 2 2 2 1 1 ...
##  $ LGA_NAME           : chr  "BASS COAST" "EAST GIPPSLAND" "KINGSTON" "COLAC OTWAY" ...
##  $ STAT_DIV_NAME      : chr  "Country" "Country" "Melbourne - South East" "Country" ...
##  $ NODE_TYPE_DESC     : chr  "Non Intersection" "Non Intersection" "Intersection" "Intersection" ...
##  $ ROAD_NUM           : int  0 5560 5803 9999 9999 9999 9999 9999 2570 9999 ...
##  $ PERSON_ID          : chr  "A" "A" "01" "B" ...
##  $ SPEED_ZONE         : chr  "80 km/hr" "100 km/hr" "60 km/hr" "100 km/hr" ...
##  $ ALCOHOL_RELATED    : chr  "No" "No" "No" "No" ...
##  $ FATALITY           : int  1 1 1 1 4 4 4 4 1 1 ...
##  $ SERIOUSINJURY      : int  0 0 0 0 0 0 0 0 0 0 ...

The initial dataset contains character and integer variables. The ACCIDENT_NO, ACCIDENT_TIME, AGE, NO_OF_VEHICLES, FATALITY and SERIOUSINJURY varialbles will initially remain unchanged, whereas the other variables will undergo data type conversions, ordering and labelling. This process is described below.

## Unique value for each character variable
# Finding possible labels to bin
# tail used to see if NA's exist as they'll appear at the end of the list
tail(table(crashcombined$ACCIDENT_NO, useNA = "ifany"),5)
## 
## T20170013947 T20170013970 T20170014068 T20170014249 T20170014368 
##            1            1            1            1            1
tail(table(crashcombined$ACCIDENT_DATE, useNA = "ifany"),5)
## 
## 2017-07-17T00:00:00.000Z 2017-07-18T00:00:00.000Z 2017-07-20T00:00:00.000Z 
##                        1                        2                        1 
## 2017-07-22T00:00:00.000Z 2017-07-24T00:00:00.000Z 
##                        1                        1
table(crashcombined$SEX, useNA = "ifany")
## 
##   F   M 
## 338 814
table(crashcombined$ROAD_USER_TYPE_DESC, useNA = "ifany")
## 
##         Bicyclists            Drivers      Motorcyclists 
##                 40                568                173 
##         Passengers        Pedestrians Pillion Passengers 
##                202                166                  3
table(crashcombined$ACCIDENT_TYPE_DESC, useNA = "ifany")
## 
##     Collision with a fixed object  Collision with some other object 
##                               397                                15 
##            Collision with vehicle    Fall from or in moving vehicle 
##                               519                                 7 
## No collision and no object struck                    Other accident 
##                                 7                                 2 
##                     Struck animal                 Struck Pedestrian 
##                                 7                               157 
## Vehicle overturned (no collision) 
##                                41
table(crashcombined$SPEED_ZONE, useNA = "ifany")
## 
##                   100 km/hr                   110 km/hr 
##                         528                          50 
##                    40 km/hr                    50 km/hr 
##                          15                          84 
##                    60 km/hr                    70 km/hr 
##                         206                          70 
##                    75 km/hr                    80 km/hr 
##                           1                         175 
##                    90 km/hr Camping grounds or off road 
##                          11                           1 
##                   Not known           Other speed limit 
##                          10                           1
tail(table(crashcombined$ROAD_NUM, useNA = "ifany"),5)
## 
## 7430 7576 7598 7921 9999 
##    1    1    1    2  304
tail(table(crashcombined$LGA_NAME, useNA = "ifany"),5)
## 
##      WODONGA      WYNDHAM        YARRA YARRA RANGES YARRIAMBIACK 
##           10           27            6           36            3
table(crashcombined$STAT_DIV_NAME, useNA = "ifany")
## 
##                Ballarat                 Bendigo                 Country 
##                      62                      41                     528 
##                    Hume       Melbourne - Inner  Melbourne - Inner East 
##                       1                      55                      27 
## Melbourne - Inner South  Melbourne - North East  Melbourne - North West 
##                      45                      39                      47 
##  Melbourne - Outer East  Melbourne - South East        Melbourne - West 
##                      70                     100                      93 
##                   Metro 
##                      44
table(crashcombined$NODE_TYPE_DESC, useNA = "ifany")
## 
##     Intersection Non Intersection          Unknown 
##              322              828                2
table(crashcombined$PERSON_ID, useNA = "ifany")
## 
##  01  02  03  04  05  06  07  08  09  10  11   A   B   C   D   E 
## 283  36  27  10   4   3   3   1   1   1   2 608 152  18   2   1
table(crashcombined$ALCOHOL_RELATED, useNA = "ifany")
## 
##   No  Yes 
## 1002  150

The unique values contained in each character variable was listed using contingency tables. This not only displays any unknown/other/NA values, but it also allows me to view similar labels in each variable, which could then be combined into one category. For example, the No collision and no object struck and Vehicle overturned (no collision) levels from the ACCIDENT_TYPE_DESC variable can be combined into one No Collision level.

#_____________________________________________________________________________
## Converting unknown/other values into "NA" values.
crashcombined$SPEED_ZONE[
  crashcombined$SPEED_ZONE %in% c("Camping grounds or off road",
                               "Not known","Other speed limit"
                               )] = NA

# Removing "km/hr" from values
crashcombined$SPEED_ZONE = chartr(old = "km/hr", new = "     ", crashcombined$SPEED_ZONE)
# Removing blank space
crashcombined$SPEED_ZONE = str_trim(crashcombined$SPEED_ZONE, side = "both")

crashcombined$NODE_TYPE_DESC[
  crashcombined$NODE_TYPE_DESC %in% c("Unknown")] = NA

crashcombined$ACCIDENT_TYPE_DESC[
  crashcombined$ACCIDENT_TYPE_DESC %in% c("Other accident")] = NA 
#____________________________________________________________________________
 # Binning values into road categories
crashcombined$ROAD_NUM[
  crashcombined$ROAD_NUM >= 2000 & crashcombined$ROAD_NUM < 3000] = "Freeway and State Highways"
crashcombined$ROAD_NUM[
  crashcombined$ROAD_NUM >= 3000 & crashcombined$ROAD_NUM < 4000] = "Forest Roads"
crashcombined$ROAD_NUM[
  crashcombined$ROAD_NUM >= 4000 & crashcombined$ROAD_NUM < 5000] = "Tourist Roads"
crashcombined$ROAD_NUM[
  crashcombined$ROAD_NUM >= 5000 & crashcombined$ROAD_NUM < 6000] = "Main Roads"
crashcombined$ROAD_NUM[
  crashcombined$ROAD_NUM >= 7000 & crashcombined$ROAD_NUM < 8000] = "Ramps"
crashcombined$ROAD_NUM[
  crashcombined$ROAD_NUM == 9999] <- "Local Roads"

# There is no explanation for 0 and -1 values in the dataset documentation
crashcombined$ROAD_NUM[
  crashcombined$ROAD_NUM == 0 | crashcombined$ROAD_NUM == -1] = NA

  ## Binning observations into less categories
crashcombined$ACCIDENT_TYPE_DESC[
  crashcombined$ACCIDENT_TYPE_DESC %in% c("No collision and no object struck",
                               "Vehicle overturned (no collision)"
                               )] <- "No Collision"
  
#____________________________________________________________________________
## Converting variables - Requirement 3 
# Converting to factor
crashcombined$ALCOHOL_RELATED = as.factor(crashcombined$ALCOHOL_RELATED)
crashcombined$ROAD_NUM = as.factor(crashcombined$ROAD_NUM)
crashcombined$SEX = as.factor(crashcombined$SEX)
crashcombined$ACCIDENT_TYPE_DESC = as.factor(crashcombined$ACCIDENT_TYPE_DESC)
crashcombined$ROAD_USER_TYPE_DESC = as.factor(crashcombined$ROAD_USER_TYPE_DESC)
crashcombined$LGA_NAME = as.factor(crashcombined$LGA_NAME)
crashcombined$NODE_TYPE_DESC = as.factor(crashcombined$NODE_TYPE_DESC)
crashcombined$PERSON_ID = as.factor(crashcombined$PERSON_ID)
crashcombined$STAT_DIV_NAME = as.factor(crashcombined$STAT_DIV_NAME)
crashcombined$NODE_ID = as.factor(crashcombined$NODE_ID)

# Chanverting to date 
# Converts to YYYY-MM-DD
crashcombined$ACCIDENT_DATE=as.Date(crashcombined$ACCIDENT_DATE)

# Check 
# Dates should fall in-between 1st Jan 2013 and 1st Sep 2017
min(crashcombined$ACCIDENT_DATE)
## [1] "2013-01-02"
max(crashcombined$ACCIDENT_DATE)
## [1] "2017-07-24"
##___________________________________________________

## Changing names of factor levels  - Requirement 4
# Labelling levels
crashcombined$ALCOHOL_RELATED =  
  factor(crashcombined$ALCOHOL_RELATED, 
             levels = c("Yes","No"),labels = c(1,0))

crashcombined$SPEED_ZONE[
crashcombined$SPEED_ZONE %in% c('30', '40','50')] <- "Low"

crashcombined$SPEED_ZONE[
crashcombined$SPEED_ZONE %in% c('60', '70','75','80')] <- "Med"

crashcombined$SPEED_ZONE[
crashcombined$SPEED_ZONE %in% c('90', '100','110')] <- "High"

# Ordering levels
crashcombined$SPEED_ZONE = ordered(crashcombined$SPEED_ZONE,
levels = c("Low","Med", "High"))

Various variables were converted into dates, factors and ordered factors. The conversions applied to the variables are described below:

There were also four variables that were found to have unknown/other labels (ACCIDENT_TYPE_DESC, NODE_TYPE_DESC, ROAD_NUM and SPEED_ZONE), these labels were changed to NA values.

Tidy & Manipulate Data I

In a tidy dataset, each variable measures should be in one column, and each different observation should be in a different row. The duplicated function was used to check if the observations were unique.

# Requirement 5
# Counting the number of observations that are, or are not duplicates.
table(duplicated(crashcombined))
## 
## FALSE 
##  1152
# Viewing the variables contained in the dataset
head(crashcombined,3)
##    ACCIDENT_NO ACCIDENT_DATE ACCIDENT_TIME NODE_ID SEX AGE
## 1 T20170006007    2017-03-25         41400  315856   M  24
## 2 T20160003982    2016-02-22        171700  299896   F  28
## 3 T20150022025    2015-10-24        203000   69771   F  80
##   ROAD_USER_TYPE_DESC            ACCIDENT_TYPE_DESC NO_OF_VEHICLES
## 1             Drivers Collision with a fixed object              1
## 2             Drivers Collision with a fixed object              1
## 3         Pedestrians             Struck Pedestrian              2
##         LGA_NAME          STAT_DIV_NAME   NODE_TYPE_DESC   ROAD_NUM
## 1     BASS COAST                Country Non Intersection       <NA>
## 2 EAST GIPPSLAND                Country Non Intersection Main Roads
## 3       KINGSTON Melbourne - South East     Intersection Main Roads
##   PERSON_ID SPEED_ZONE ALCOHOL_RELATED FATALITY SERIOUSINJURY
## 1         A        Med               0        1             0
## 2         A       High               0        1             0
## 3        01        Med               0        1             0
names(crashcombined)
##  [1] "ACCIDENT_NO"         "ACCIDENT_DATE"       "ACCIDENT_TIME"      
##  [4] "NODE_ID"             "SEX"                 "AGE"                
##  [7] "ROAD_USER_TYPE_DESC" "ACCIDENT_TYPE_DESC"  "NO_OF_VEHICLES"     
## [10] "LGA_NAME"            "STAT_DIV_NAME"       "NODE_TYPE_DESC"     
## [13] "ROAD_NUM"            "PERSON_ID"           "SPEED_ZONE"         
## [16] "ALCOHOL_RELATED"     "FATALITY"            "SERIOUSINJURY"

By using the dublicated function, it was confirmed that all observations were unique, as none of them were duplicates. By analysing the variables and column names, it is clear that each variable is measured in only one column. Furthermore, each cell only contains one value. Hence, it can be concluded that the dataset is tidy.

Tidy & Manipulate Data II

The SER_FAT variable was added to the dataset to measure the total number of serious injuries and fatalities from the accident in each observation. This provides a better picture as to the severity of an accident. For example, an accident with the same number of fatalities may have a different amount of serious injuries, indicating that one accident was more severe than the other.

# mutate function used to create new variable - Requirement 6
crashcombined = crashcombined %>% mutate(SER_FAT = FATALITY+SERIOUSINJURY)
## Warning: package 'bindrcpp' was built under R version 3.4.4
# Check if procedure worked
dim(crashcombined)
## [1] 1152   19
head(crashcombined,2)
##    ACCIDENT_NO ACCIDENT_DATE ACCIDENT_TIME NODE_ID SEX AGE
## 1 T20170006007    2017-03-25         41400  315856   M  24
## 2 T20160003982    2016-02-22        171700  299896   F  28
##   ROAD_USER_TYPE_DESC            ACCIDENT_TYPE_DESC NO_OF_VEHICLES
## 1             Drivers Collision with a fixed object              1
## 2             Drivers Collision with a fixed object              1
##         LGA_NAME STAT_DIV_NAME   NODE_TYPE_DESC   ROAD_NUM PERSON_ID
## 1     BASS COAST       Country Non Intersection       <NA>         A
## 2 EAST GIPPSLAND       Country Non Intersection Main Roads         A
##   SPEED_ZONE ALCOHOL_RELATED FATALITY SERIOUSINJURY SER_FAT
## 1        Med               0        1             0       1
## 2       High               0        1             0       1

The mutate function was able to successfully created the new SER_FAT variable.

Scan I

There shouldn’t be any observations with 0 fatalities. Hence, all the different values in this variable are checked.

# Scanning for 0 or NA fatalities 
table(crashcombined$FATALITY,useNA="ifany") # Frequency table of values in FATALITY
## 
##   0   1   2   3   4 
##   2 995 122  21  12

It was found that the FATALITY variables contained two 0 values. This a potential inconsistency because all observations should be from fatal crashes (due to the inner join of the fatalcrashes dataset), hence all values in this column should be greater or equal to 1. A possible explanation for this occurance is that two victims were injured in the accidents and subsequently died at a later date in hospital (after the allcrashes dataset was last updated and before the fatalcrashes dataset was last updated).

## Investigating inconsistant values 
# Check if observations with 0 fatalities had serious injuries
crashcombined$SERIOUSINJURY[crashcombined$FATALITY==0] 
## [1] 1 1
# Checking the dates of accidents with 0 recorded fatalities
crashcombined$ACCIDENT_DATE[crashcombined$FATALITY==0]
## [1] "2017-05-08" "2017-07-17"

The observations with 0 fatalities each recorded 1 serious injury. Furthermore, the dates of these accidents occured on the 8th of May and the 17th of August in 2017, with the allcrashes dateset last updated on the 1st of September 2017 and the fatalcrashes dataset last being updated on the 2nd of March 2018. Since there is a distinct possibility that the victims were seriously injured in their accidents and later died within the eight month gap in which the datasets were updatated, these observations are not seen as inconsistencies.

In order to find any other missing values or inconsistencies, functions were created to find finite, infinite and NaN values accross the whole crashcombined dataset. These functions work by creating contingency tables for the number of TRUE and FALSE values from each variable in the dataset after the is.finite, is.infinite and is.nan functions are run. If the contingency tables for each variable in the dataset only contained FALSE values (ie 0 TRUE’s) or TRUE values (ie 0 FALSE’s), then there would be no special values in the data. The is.na function was also used to find any missing values and complete.cases was used to determine the number of observations with missing values.

# Creation of functions to find special vals over whole dataset - Requirement 7

is.finitefunc <- function(x){
  temp = seq(from = 1, to=length(x[1,]))
 for (i in 1:length(x[1,])) {
      temp[i] = length(table(is.finite(x[,i])))
      # If there are special values, the length of the table will be >1
 }
  # Mean of 1 means that all tables either have 0 TRUE's or 0 FALSE's
   if (mean(temp)==1){
     return("All values in dataset are finite")
   }else {return("There are non-finite values in the dataset")}
}

is.infinitefunc <- function(x){
  temp = seq(from = 1, to=length(x[1,]))
 for (i in 1:length(x[1,])) {
      temp[i] = length(table(is.infinite(x[,i])))
      # If there are special values, the length of the table will be >1
 }
   if (mean(temp)==1){
     return("All values in dataset are not infinite")
   }else {return("There are infinite values in the dataset")}
}


is.nanfunc <- function(x){
  temp = seq(from = 1, to=length(x[1,]))
 for (i in 1:length(x[1,])) {
      temp[i] = length(table(is.nan(x[,i])))
      # If there are special values, the length of the table will be >1
 }
   if (mean(temp)==1){
     return("There are no NaN values")
   }else {return("There are one or more NaN values ")}
}

is.finitefunc(crashcombined)
## [1] "There are non-finite values in the dataset"
is.infinitefunc(crashcombined)
## [1] "All values in dataset are not infinite"
is.nanfunc(crashcombined)
## [1] "There are no NaN values"
sum(is.na(crashcombined)) # Number of NA values
## [1] 101
# Finding the number of complete cases.
numcomplete = 
  length(crashcombined[,1]) - length((crashcombined[complete.cases(crashcombined),])[,1])
numcomplete
## [1] 95
# Proportion of observations that are missing (%)
numcomplete*100/length(crashcombined[,1])
## [1] 8.246528

There were no infinite or NaN values values found in the dataset, but there were non-finite values. This was expected since various labels were converted to NA values in the earlier processing stage. The is.na function found 101 NA values in the dataset and the complete.cases function showed found 95 complete cases, indicating that there were some observations with multiple NA values.

It was found that around 8% of the observations contain missing values. Since many (if not all) of these missing values came from the categorical variables that were relabelled in the ealier stages, it was decided to simply remove these observations in order to avoid the bias caused by imputing their mode.

## Removing missing values
crashcombined = crashcombined[complete.cases(crashcombined),]

## Checking that all missing values are removed
length(crashcombined[,1]) - length((crashcombined[complete.cases(crashcombined),])[,1])
## [1] 0
is.finitefunc(crashcombined)
## [1] "All values in dataset are finite"
is.infinitefunc(crashcombined)
## [1] "All values in dataset are not infinite"
is.nanfunc(crashcombined)
## [1] "There are no NaN values"
sum(is.na(crashcombined))
## [1] 0

There were no inconsistent or missing values after the incomplete cases were removed.

Scan II

Univariate outliers were found by finding the observations with z-scores of absolute value above 3, for each value of the numeric variables.

# Checking all numeric variables for outliers - Requirement 8
# Using outliers package
z.scores2 <- crashcombined$AGE %>%  scores(type = "z")
which( abs(z.scores2) >3 )
## integer(0)
z.scores3 <- crashcombined$NO_OF_VEHICLES %>%  scores(type = "z")
length(which( abs(z.scores3) >3 ))
## [1] 13
table(crashcombined$NO_OF_VEHICLES, useNA = "ifany")
## 
##   0   1   2   3   4   5   6  10  11 
##   1 521 438  70  14  10   1   1   1
z.scores4 <- crashcombined$SERIOUSINJURY %>%  scores(type = "z")
length(which( abs(z.scores4) >3 ))
## [1] 28
table(crashcombined$SERIOUSINJURY, useNA = "ifany")
## 
##   0   1   2   3   4   5   6   7  10 
## 716 221  62  30  12   8   3   2   3
z.scores1 <- crashcombined$FATALITY %>%  scores(type = "z")
length(which( abs(z.scores1) >3 ))
## [1] 33
table(crashcombined$FATALITY, useNA = "ifany")
## 
##   0   1   2   3   4 
##   2 902 120  21  12
z.scores5 <- crashcombined$SER_FAT %>%  scores(type = "z")
length(which( abs(z.scores5) >3 ))
## [1] 27
table(crashcombined$SER_FAT, useNA = "ifany")
## 
##   1   2   3   4   5   6   7   9  13 
## 650 223  99  48  10   8  14   2   3
z.scores6 <- crashcombined$ACCIDENT_TIME %>%  scores(type = "z")
which( abs(z.scores6) >3 )
## integer(0)

The contingency tables from the numeric variables with outliers provide an idea as to the value of the outliers. For example, it is clear that the 27 outliers from the SER_FAT variable came from accidents with 6+ serious injuries or fatalities. These univariate outliers may be considered as freak occurances and will therefore be removed from the dataset.

# Removal of univariate outliers
outliers = c(which(abs(z.scores1) >3),which(abs(z.scores3) >3),
             which( abs(z.scores4) >3 ,which( abs(z.scores5) >3 )))

 crashcombined = crashcombined[-outliers,]

Some algorithms work best without multivariate outliers, and these were detected by finding their Mahalanobis distance.

# Subset of non-discrete variables
numeric.vars = crashcombined %>% 
  dplyr::select(AGE,ACCIDENT_TIME)

outlier.det = MVN::mvn(numeric.vars, multivariateOutlierMethod = "adj",
                  showOutliers = TRUE)

The mvn function did not work when the NO_OF_VEHICLES, SERIOUSINJURY and FATALITY variables were analysed. This is becasue these variables are discrete and can only take whole numbers. The adjusted quantile method was used in finding the multivariate outliers between AGE and ACCIDENT_TIME, as this method has been found to be more robust (Filzmoser, 2004). There were no multivariate outliers found using this method.

Transform

The AGE variable was tested for normality using the Shapiro-Wilk test. After rejecting the null hypothesis that the variable was normally distributed, a Box-Cox transformation was performed and a lambda value of 0.336 was applied to the data in attempt to normalize the data.

# Requirement 9
# Hitstogram of Age variable
hist(crashcombined$AGE, main="Age distribution before transformation"
     ,width = "50%")

# Testing for normality
shapiro.test(crashcombined$AGE) 
## 
##  Shapiro-Wilk normality test
## 
## data:  crashcombined$AGE
## W = 0.95957, p-value = 6.644e-16
# Finding lambda value
lambda<- forecast::BoxCox.lambda(crashcombined$AGE)
lambda
## [1] 0.3360749
# Applying transformation
crashcombined$AGE=(crashcombined$AGE^lambda-1)/lambda

# Normality test after transformation
shapiro.test(crashcombined$AGE)
## 
##  Shapiro-Wilk normality test
## 
## data:  crashcombined$AGE
## W = 0.96331, p-value = 4.546e-15
# Histogram after transformation
hist(crashcombined$AGE, main= "Age distribution after transformation")

After applying a data transformation of the AGE variable, the Shapiro-Wilk test of the transformed data indicates that normalisation failed.

Optional - Cleaning Time variable

The time variable is a little bit difficult to interpret. Therefore, character manipulation is performed in order to have the time variables in the HH:MM:SS format, making them easier to interpret.

# Variable before cleaning
head(crashcombined$ACCIDENT_TIME,10)
##  [1] 171700 203000  82600 230000 170000 105500 131500 131500  25400 181500
# Converting to character variable
crashcombined$ACCIDENT_TIME=as.character(crashcombined$ACCIDENT_TIME)

# Adding 0's to the start of each value less than 6 characters long
for (i in 1:length(crashcombined[,1])){
   crashcombined$ACCIDENT_TIME[i] = str_pad(crashcombined$ACCIDENT_TIME[i], 
                                            width = 6, side = "left", pad="0")
}

# Splitting values into three
a = substr(as.character(crashcombined$ACCIDENT_TIME),1,2)
b = substr(as.character(crashcombined$ACCIDENT_TIME),3,4)
c = substr(as.character(crashcombined$ACCIDENT_TIME),5,6)

# Recombining values with : separator
crashcombined$ACCIDENT_TIME = paste(a,b,c, sep = ":")

# Variable after cleaning
head(crashcombined$ACCIDENT_TIME,10)
##  [1] "17:17:00" "20:30:00" "08:26:00" "23:00:00" "17:00:00" "10:55:00"
##  [7] "13:15:00" "13:15:00" "02:54:00" "18:15:00"

Final Inspection

# Inspecting final dataset
# Checking column names
names(crashcombined)
##  [1] "ACCIDENT_NO"         "ACCIDENT_DATE"       "ACCIDENT_TIME"      
##  [4] "NODE_ID"             "SEX"                 "AGE"                
##  [7] "ROAD_USER_TYPE_DESC" "ACCIDENT_TYPE_DESC"  "NO_OF_VEHICLES"     
## [10] "LGA_NAME"            "STAT_DIV_NAME"       "NODE_TYPE_DESC"     
## [13] "ROAD_NUM"            "PERSON_ID"           "SPEED_ZONE"         
## [16] "ALCOHOL_RELATED"     "FATALITY"            "SERIOUSINJURY"      
## [19] "SER_FAT"
## Structure of Data after processing
str(crashcombined)
## 'data.frame':    989 obs. of  19 variables:
##  $ ACCIDENT_NO        : chr  "T20160003982" "T20150022025" "T20160018085" "T20130019712" ...
##  $ ACCIDENT_DATE      : Date, format: "2016-02-22" "2015-10-24" ...
##  $ ACCIDENT_TIME      : chr  "17:17:00" "20:30:00" "08:26:00" "23:00:00" ...
##  $ NODE_ID            : Factor w/ 1063 levels "120","298","393",..: 795 218 170 331 956 891 753 753 268 257 ...
##  $ SEX                : Factor w/ 2 levels "F","M": 1 1 2 2 2 2 1 1 2 2 ...
##  $ AGE                : num  6.14 10 8.1 6.03 8.67 ...
##  $ ROAD_USER_TYPE_DESC: Factor w/ 6 levels "Bicyclists","Drivers",..: 2 5 2 2 1 4 4 2 2 2 ...
##  $ ACCIDENT_TYPE_DESC : Factor w/ 7 levels "Collision with a fixed object",..: 1 7 3 1 4 3 3 3 1 1 ...
##  $ NO_OF_VEHICLES     : int  1 2 2 1 1 2 2 2 1 1 ...
##  $ LGA_NAME           : Factor w/ 81 levels "(FALLS CREEK)",..: 24 37 20 69 38 14 12 12 18 38 ...
##  $ STAT_DIV_NAME      : Factor w/ 13 levels "Ballarat","Bendigo",..: 3 11 3 3 10 12 2 2 11 10 ...
##  $ NODE_TYPE_DESC     : Factor w/ 2 levels "Intersection",..: 2 1 1 2 2 2 2 2 2 2 ...
##  $ ROAD_NUM           : Factor w/ 6 levels "Forest Roads",..: 4 4 3 2 3 2 2 2 2 3 ...
##  $ PERSON_ID          : Factor w/ 16 levels "01","02","03",..: 12 1 13 12 13 1 2 13 12 12 ...
##  $ SPEED_ZONE         : Ord.factor w/ 3 levels "Low"<"Med"<"High": 3 2 3 3 1 3 3 3 2 2 ...
##  $ ALCOHOL_RELATED    : Factor w/ 2 levels "1","0": 2 2 2 1 2 2 2 2 1 1 ...
##  $ FATALITY           : int  1 1 1 1 1 1 2 2 1 1 ...
##  $ SERIOUSINJURY      : int  0 0 0 0 0 1 2 2 0 0 ...
##  $ SER_FAT            : int  1 1 1 1 1 2 4 4 1 1 ...
## Dimensions of Data after processing
dim(crashcombined)
## [1] 989  19
#Data
head(crashcombined,10)
##     ACCIDENT_NO ACCIDENT_DATE ACCIDENT_TIME NODE_ID SEX       AGE
## 2  T20160003982    2016-02-22      17:17:00  299896   F  6.142848
## 3  T20150022025    2015-10-24      20:30:00   69771   F 10.000590
## 4  T20160018085    2016-08-18      08:26:00   50579   M  8.104610
## 9  T20130019712    2013-09-24      23:00:00  264090   M  6.032079
## 10 T20160025394    2016-10-13      17:00:00  311008   M  8.671306
## 11 T20160009999    2016-05-07      10:55:00  306973   M  6.460152
## 13 T20150025270    2015-12-06      13:15:00  296945   F  7.953637
## 14 T20150025270    2015-12-06      13:15:00  296945   F  8.870385
## 15 T20130000066    2013-01-02      02:54:00  258321   M  5.028728
## 16 T20130000366    2013-01-06      18:15:00  256415   M  5.802087
##    ROAD_USER_TYPE_DESC             ACCIDENT_TYPE_DESC NO_OF_VEHICLES
## 2              Drivers  Collision with a fixed object              1
## 3          Pedestrians              Struck Pedestrian              2
## 4              Drivers         Collision with vehicle              2
## 9              Drivers  Collision with a fixed object              1
## 10          Bicyclists Fall from or in moving vehicle              1
## 11          Passengers         Collision with vehicle              2
## 13          Passengers         Collision with vehicle              2
## 14             Drivers         Collision with vehicle              2
## 15             Drivers  Collision with a fixed object              1
## 16             Drivers  Collision with a fixed object              1
##          LGA_NAME          STAT_DIV_NAME   NODE_TYPE_DESC
## 2  EAST GIPPSLAND                Country Non Intersection
## 3        KINGSTON Melbourne - South East     Intersection
## 4     COLAC OTWAY                Country     Intersection
## 9       SWAN HILL                Country Non Intersection
## 10           KNOX Melbourne - Outer East Non Intersection
## 11       BRIMBANK       Melbourne - West Non Intersection
## 13        BENDIGO                Bendigo Non Intersection
## 14        BENDIGO                Bendigo Non Intersection
## 15          CASEY Melbourne - South East Non Intersection
## 16           KNOX Melbourne - Outer East Non Intersection
##                      ROAD_NUM PERSON_ID SPEED_ZONE ALCOHOL_RELATED
## 2                  Main Roads         A       High               0
## 3                  Main Roads        01        Med               0
## 4                 Local Roads         B       High               0
## 9  Freeway and State Highways         A       High               1
## 10                Local Roads         B        Low               0
## 11 Freeway and State Highways        01       High               0
## 13 Freeway and State Highways        02       High               0
## 14 Freeway and State Highways         B       High               0
## 15 Freeway and State Highways         A        Med               1
## 16                Local Roads         A        Med               1
##    FATALITY SERIOUSINJURY SER_FAT
## 2         1             0       1
## 3         1             0       1
## 4         1             0       1
## 9         1             0       1
## 10        1             0       1
## 11        1             1       2
## 13        2             2       4
## 14        2             2       4
## 15        1             0       1
## 16        1             0       1
# Categorical data labels

head(crashcombined$NODE_ID,5)
## [1] 299896 69771  50579  264090 311008
## 1063 Levels: 120 298 393 827 851 873 1317 10429 11397 12239 13567 ... 320341
levels(crashcombined$SEX)
## [1] "F" "M"
levels(crashcombined$ROAD_USER_TYPE_DESC)
## [1] "Bicyclists"         "Drivers"            "Motorcyclists"     
## [4] "Passengers"         "Pedestrians"        "Pillion Passengers"
levels(crashcombined$ACCIDENT_TYPE_DESC)
## [1] "Collision with a fixed object"    "Collision with some other object"
## [3] "Collision with vehicle"           "Fall from or in moving vehicle"  
## [5] "No Collision"                     "Struck animal"                   
## [7] "Struck Pedestrian"
levels(crashcombined$LGA_NAME)
##  [1] "(FALLS CREEK)"        "(FRENCH ISLAND)"      "(MOUNT HOTHAM)"      
##  [4] "ALPINE"               "ARARAT"               "BALLARAT"            
##  [7] "BANYULE"              "BASS COAST"           "BAW BAW"             
## [10] "BAYSIDE"              "BENALLA"              "BENDIGO"             
## [13] "BOROONDARA"           "BRIMBANK"             "BULOKE"              
## [16] "CAMPASPE"             "CARDINIA"             "CASEY"               
## [19] "CENTRAL GOLDFIELDS"   "COLAC OTWAY"          "CORANGAMITE"         
## [22] "DANDENONG"            "DAREBIN"              "EAST GIPPSLAND"      
## [25] "FRANKSTON"            "GANNAWARRA"           "GEELONG"             
## [28] "GLEN EIRA"            "GLENELG"              "GOLDEN PLAINS"       
## [31] "HEPBURN"              "HINDMARSH"            "HOBSONS BAY"         
## [34] "HORSHAM"              "HUME"                 "INDIGO"              
## [37] "KINGSTON"             "KNOX"                 "LATROBE"             
## [40] "LODDON"               "MACEDON RANGES"       "MANNINGHAM"          
## [43] "MANSFIELD"            "MARIBYRNONG"          "MAROONDAH"           
## [46] "MELBOURNE"            "MELTON"               "MILDURA"             
## [49] "MITCHELL"             "MOIRA"                "MONASH"              
## [52] "MOONEE VALLEY"        "MOORABOOL"            "MORELAND"            
## [55] "MORNINGTON PENINSULA" "MOUNT ALEXANDER"      "MOYNE"               
## [58] "MURRINDINDI"          "NILLUMBIK"            "NORTHERN GRAMPIANS"  
## [61] "PORT PHILLIP"         "PYRENEES"             "SHEPPARTON"          
## [64] "SOUTH GIPPSLAND"      "SOUTHERN GRAMPIANS"   "STONNINGTON"         
## [67] "STRATHBOGIE"          "SURF COAST"           "SWAN HILL"           
## [70] "TOWONG"               "WANGARATTA"           "WARRNAMBOOL"         
## [73] "WELLINGTON"           "WEST WIMMERA"         "WHITEHORSE"          
## [76] "WHITTLESEA"           "WODONGA"              "WYNDHAM"             
## [79] "YARRA"                "YARRA RANGES"         "YARRIAMBIACK"
levels(crashcombined$STAT_DIV_NAME)
##  [1] "Ballarat"                "Bendigo"                
##  [3] "Country"                 "Hume"                   
##  [5] "Melbourne - Inner"       "Melbourne - Inner East" 
##  [7] "Melbourne - Inner South" "Melbourne - North East" 
##  [9] "Melbourne - North West"  "Melbourne - Outer East" 
## [11] "Melbourne - South East"  "Melbourne - West"       
## [13] "Metro"
levels(crashcombined$NODE_TYPE_DESC)
## [1] "Intersection"     "Non Intersection"
levels(crashcombined$PERSON_ID)
##  [1] "01" "02" "03" "04" "05" "06" "07" "08" "09" "10" "11" "A"  "B"  "C" 
## [15] "D"  "E"
levels(crashcombined$ALCOHOL_RELATED)
## [1] "1" "0"
levels(crashcombined$ROAD_NUM)
## [1] "Forest Roads"               "Freeway and State Highways"
## [3] "Local Roads"                "Main Roads"                
## [5] "Ramps"                      "Tourist Roads"
levels(crashcombined$SPEED_ZONE)
## [1] "Low"  "Med"  "High"

By inspecting the final dataset, the cleaned dataset is tidy, with variables in their proper type, and without any outliers and inconsistent or missing values.

References:

Filzmoser, P. (2004). A multivariate outlier detection method. State University, 18-22.