Required packages

library(readr)
library(tidyr)
library(dplyr)
library(Hmisc)
library(deductive)
library(validate)
library(outliers)
library(ggplot2)
library(lubridate)
library(Rmisc)

Executive Summary

This project will demonstrate the five major tasks in data processing using the ACCIDENTS data base which was extracted from Vicroads website.

Task1 : Get

Three data tables ACCIDENT.csv and ACCIDENT_LOCATION.csv were imported and the necessary data type conversions were done using read_csv function. Both data frames were subsetted with selected variables and then they were joined using left_join() function.

Task2 : Understand

To check and understand the structure of the data head(),dim() and str() functions were used. In addition the pdf “Crash Stats User Guide”[2] provided along with the ACCIDENT database was referred to understand the meaning of the attributes. The data frames were subsetted with selected variables before merging them together.

Task3 : Tidy and Manipulate

The three tidy data principles were checked. Three new variables were introduced which describes more meaningful information using the mutate() function.

Task4 : Scan

All the variables were scanned for NA’s and outliers. Large number of NA’s were detected for 3 variables about the accident location, and hence they were unchanged. Outliers were checked for individual numerical variables as well as grouped by relevant categorical variables.

Task5 : Transform

DISTANCE_LOCATION variable was transformed to achieve a more symmetric distribution.

Data

The ACCIDENT database was downloaded from Department of Transport Open data, Victoria State Government Website[1]. The data is provided by VicRoads for educational and research purposes. Out of 12 data tables provided in the ACCIDENT database, 2 tables were selected to conduct the preprocessing steps. The selected files are ACCIDENT.csv and ACCIDENT_LOCATION.csv. The two tables will be joined by using their common attribute and the primary key ACCIDENT_NO.

When importing the ACCIDENT data set, following initial data type conversions were made;

The two data frames were subsetted before joining. ACCIDENT_sub was created selecting only the variables ACCIDENT_NO, ACCIDENTDATE, ACCIDENTTIME, Accident Type Desc,Light Condition Desc ,NO_OF_VEHICLES,NO_PERSONS,NO_PERSONS_KILLED,SEVERITY in ACCIDENT data frame. And ACCIDENT_LOCATION_sub was created with the variables ACCIDENT_NO, ROAD_ROUTE_1, ROAD_TYPE in ACCIDENT_LOCATION data frame.

ACCIDENT_LOCATION_sub data frame was left joined to ACCIDENT_sub data frame by ACCIDENT_NO and the ACCIDENT_New data frame was labelled as ACCIDENT_New.

References

[1]Vicroadsopendatastorehouse.vicroads.vic.gov.au, 2013. [Online]. Available: https://vicroadsopendatastorehouse.vicroads.vic.gov.au/opendata/Road_Safety/ACCIDENT.zip.

[2]“Crash Stats User Guide”, Data.vicroads.vic.gov.au, 2013. [Online]. Available: http://data.vicroads.vic.gov.au/metadata/crashstats_user_guide_and_appendices.pdf.

#Importing the accident data set

ACCIDENT <- read_csv("ACCIDENT/ACCIDENT.csv", 
            col_types = cols(`Accident Type Desc` = col_factor(levels = c("Collision with vehicle", "Struck Pedestrian", "Struck animal", "Collision with a fixed object", "collision with some other object", "Vehicle overturned (no collision)", "Fall from or in moving vehicle", "No collision and no object struck", "Other accident")),
                            `Light Condition Desc` = col_factor(levels = c("Day", "Dusk/Dawn", "Dark Street lights on", "Dark Street lights off", "Dark Street lights unknown", "Dark No street lights", "Unknown"), ordered = TRUE)))
#Converting SEVERITY variable as a labelled ordered factor.
ACCIDENT$SEVERITY <- ACCIDENT$SEVERITY %>% factor(levels = c("1", "2", "3", "4"), labels=c("Low", "Mediam", "High", "Very High"), ordered = TRUE)

#Converting ACCIDENTDATE and ACCIDENTTIME from character to Date and time
ACCIDENT$ACCIDENTDATE %>% dmy() 
ACCIDENT$ACCIDENTTIME %>% hms()
head(ACCIDENT)
#Importing ACCIDENT_LOCATION data set
ACCIDENT_LOCATION <- read_csv("ACCIDENT/ACCIDENT_LOCATION.csv")
## Parsed with column specification:
## cols(
##   ACCIDENT_NO = col_character(),
##   NODE_ID = col_double(),
##   ROAD_ROUTE_1 = col_double(),
##   ROAD_NAME = col_character(),
##   ROAD_TYPE = col_character(),
##   ROAD_NAME_INT = col_character(),
##   ROAD_TYPE_INT = col_character(),
##   DISTANCE_LOCATION = col_double(),
##   DIRECTION_LOCATION = col_character(),
##   NEAREST_KM_POST = col_logical(),
##   OFF_ROAD_LOCATION = col_character()
## )
head(ACCIDENT_LOCATION)
#Subsetting the data sets
ACCIDENT_sub <- ACCIDENT %>% select("ACCIDENT_NO","ACCIDENTDATE","ACCIDENTTIME","Accident Type Desc", "LIGHT_CONDITION", "Light Condition Desc","NO_OF_VEHICLES", "NO_PERSONS_INJ_2", "NO_PERSONS_INJ_3", "NO_PERSONS_KILLED", "NO_PERSONS_NOT_INJ","SEVERITY")
head(ACCIDENT_sub)
ACCIDENT_LOCATION_sub <- ACCIDENT_LOCATION %>% select("ACCIDENT_NO", "ROAD_ROUTE_1", "ROAD_TYPE", "DISTANCE_LOCATION")
head(ACCIDENT_LOCATION_sub)
#Joining the two datasets
ACCIDENT_New <- ACCIDENT_sub %>% left_join(ACCIDENT_LOCATION_sub, by= "ACCIDENT_NO")
head(ACCIDENT_New)

Understand

Type conversation was done during data extraction phase.

#Checking and understanding the data
#ACCIDENT Data
dim(ACCIDENT_sub)
## [1] 181606     12
colnames(ACCIDENT_sub)
##  [1] "ACCIDENT_NO"          "ACCIDENTDATE"         "ACCIDENTTIME"        
##  [4] "Accident Type Desc"   "LIGHT_CONDITION"      "Light Condition Desc"
##  [7] "NO_OF_VEHICLES"       "NO_PERSONS_INJ_2"     "NO_PERSONS_INJ_3"    
## [10] "NO_PERSONS_KILLED"    "NO_PERSONS_NOT_INJ"   "SEVERITY"
#ACCIDENT_EVENT data
dim(ACCIDENT_LOCATION_sub)
## [1] 181606      4
colnames(ACCIDENT_LOCATION_sub)
## [1] "ACCIDENT_NO"       "ROAD_ROUTE_1"      "ROAD_TYPE"        
## [4] "DISTANCE_LOCATION"
#The joined table
dim(ACCIDENT_New)
## [1] 181606     15
str(ACCIDENT_New)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 181606 obs. of  15 variables:
##  $ ACCIDENT_NO         : chr  "T20060000010" "T20060000018" "T20060000022" "T20060000023" ...
##  $ ACCIDENTDATE        : chr  "13/01/2006" "13/01/2006" "14/01/2006" "14/01/2006" ...
##  $ ACCIDENTTIME        : chr  "12.42.00" "19.10.00" "12.10.00" "11.49.00" ...
##  $ Accident Type Desc  : Factor w/ 9 levels "Collision with vehicle",..: 1 1 7 1 1 1 4 4 1 2 ...
##  $ LIGHT_CONDITION     : num  1 1 1 1 1 1 1 3 1 1 ...
##  $ Light Condition Desc: Ord.factor w/ 7 levels "Day"<"Dusk/Dawn"<..: 1 1 1 1 1 1 1 3 1 1 ...
##  $ NO_OF_VEHICLES      : num  3 2 1 2 2 2 1 1 3 1 ...
##  $ NO_PERSONS_INJ_2    : num  0 0 1 1 0 1 1 1 2 0 ...
##  $ NO_PERSONS_INJ_3    : num  1 1 0 0 3 0 0 0 2 1 ...
##  $ NO_PERSONS_KILLED   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ NO_PERSONS_NOT_INJ  : num  5 3 1 1 0 1 0 0 1 1 ...
##  $ SEVERITY            : Ord.factor w/ 4 levels "Low"<"Mediam"<..: 3 3 2 2 3 2 2 2 2 3 ...
##  $ ROAD_ROUTE_1        : num  2090 5057 9999 2400 9999 ...
##  $ ROAD_TYPE           : chr  "STREET" "ROAD" "ROAD" "ROAD" ...
##  $ DISTANCE_LOCATION   : num  0 70 210 0 20 49 99 82 0 26 ...

Most of the variables are self-explanatory. How ever, variables which are mainly used and discussed in this report are defined as follows.

Dataframe: ACCIDENT

No.of observations = 181,606
No.of columns = 12

Main variable description

  • ACCIDENT_NO : (character) A 12 character field, starting with T (for example, T20060123456) where characters 2 to 5 are the year in which accident was registered, and characters 6 to 12 are a numeric sequencing numbers.

  • Accident Type Desc : (Factor) Accident category (1-Collision with vehicle, 2-Struck pedestrian, 3-Struck animal, 4-Collision with a fixed object, 5-Collision with some other object, 6-Vehicle overturned (no collision), 7-Fall from or in moving vehicle, 8-No collision and no object struck, 9-Other accident )

  • Light Condition Desc : (ordered factor) Seven light conditions as recorded by the reporting police officer. ((“Day”, “Dusk/Dawn”, “Dark Street lights on”, “Dark Street lights off”, “Dark Street lights unknown”, “Dark No street lights”, “Unknown”)

  • SEVERITY : (labelled ordered factor) Severity of individual injury (1- Fatal accident, 2- Serious injury accident, 3- Other injury accident, 4- Non injury accident

  • NO_VEHICLES, NO_PERSONS, NO_PERSONS_INJ_2, NO_PERSONS_INJ_3, NO_PERSONS_KILLED, NO_PERSONS_NOT_INJ : (numeric)

Dataframe: ACCIDENT_LOCATION

No.of observations = 181,606
No.of columns = 4

Main variable description

  • ROAD_ROUTE_1 : Group Classifications are:(2000-2999 Freeways or Highways, 3000-3999 Forest Rds, 4000-4999 Tourist Rds, 5000-5999 Main Rds, 7000-7999 Freeway ramps, >=100,000 Unclassified Roads)
Dataframe: ACCIDENT_New

No.of observations = 181,606
No.of columns = 15

Tidy & Manipulate Data I

It was checked whether the data frame ACCIDENT_New follows the tidy data principles.

# Comparing number of unique accident records and number of rows in the data frame ACCIDENT_New
length(unique(ACCIDENT_New$ACCIDENT_NO))
## [1] 181606
dim(ACCIDENT_New)
## [1] 181606     15

Tidy & Manipulate Data II

Two new variables were introduced to store the percentages of injured and killed persons count with respect to the total number of persons involved in the accident. To create these new attributes while preserving the existing variables, mutate() function was used. round() function was used to round the percentages up to two decimal places.

# Computing injured and death percentages
ACCIDENT_New <- mutate(ACCIDENT_New, 
                       TOTAL_PERSONS = NO_PERSONS_KILLED+NO_PERSONS_INJ_2 + NO_PERSONS_INJ_3 + NO_PERSONS_NOT_INJ ,
                       DEATH_PERCENT = round(NO_PERSONS_KILLED / (NO_PERSONS_KILLED+NO_PERSONS_INJ_2 + NO_PERSONS_INJ_3 + NO_PERSONS_NOT_INJ ) * 100, digits = 2),
                       INJURED_ONLY_PERCENT = round((NO_PERSONS_INJ_2 + NO_PERSONS_INJ_3) / (NO_PERSONS_KILLED+NO_PERSONS_INJ_2 + NO_PERSONS_INJ_3 + NO_PERSONS_NOT_INJ ) * 100 , digits = 2))

In the metadata file it was mentioned that the ROAD_ROUTE_1 variable was classified as : 2000-2999 Freeways or Highways, 3000-3999 Forest Rds, 4000-4999 Tourist Rds, 5000-5999 Main Rds, 7000-7999 Freeway ramps, >=100,000 Unclassified Roads . Hence for better understanding and for data manipulation convenience a new variable ROAD_ROUTE_1_Desc was introduced to store the fore mentioned classification description.

ACCIDENT_New <- mutate(ACCIDENT_New,
                       ROAD_ROUTE_1_Desc =cut(ROAD_ROUTE_1, 
                                              breaks=c(1999, 3000, 4000, 5000, 6000, 8000, Inf), 
                                              labels=c("Freeways or Highways","Forest Rds","Tourist Rds", "Main Rds", "Freeway ramps", "Unclassified Roads")))

Year, month and date separation from the ACCIDENTDATE variable

ACCIDENT_New <- ACCIDENT_New %>% separate(ACCIDENTDATE, c("DAY","MONTH","YEAR"), sep = "/")
ACCIDENT_New <- ACCIDENT_New %>% mutate(HOUR = hour(hms(ACCIDENTTIME)))

head(ACCIDENT_New)

Scan I

colSums(is.na(ACCIDENT_New))
##          ACCIDENT_NO                  DAY                MONTH 
##                    0                    0                    0 
##                 YEAR         ACCIDENTTIME   Accident Type Desc 
##                    0                    0                    0 
##      LIGHT_CONDITION Light Condition Desc       NO_OF_VEHICLES 
##                    0                    0                    0 
##     NO_PERSONS_INJ_2     NO_PERSONS_INJ_3    NO_PERSONS_KILLED 
##                    0                    0                    0 
##   NO_PERSONS_NOT_INJ             SEVERITY         ROAD_ROUTE_1 
##                    0                    0                  828 
##            ROAD_TYPE    DISTANCE_LOCATION        TOTAL_PERSONS 
##                 3079                  828                    0 
##        DEATH_PERCENT INJURED_ONLY_PERCENT    ROAD_ROUTE_1_Desc 
##                    0                    0                 1336 
##                 HOUR 
##                    0

Out of the 15 variables, 4 variables, ROAD_ROUTE_1, ROAD_ROUTE_1_Desc, ROAD_TYPE and DISTANCE_LOCATION have missing values. If the columns ROAD_ROUTE_1 and ROAD_ROUTE_1_Desc were considered, those variables should have equal number of missing values. Thus, the following test was done to investigate the causes of this mismatch.

Scan II

The numeric variables HOUR, NO_OF_VEHICLES, TOTAL_PERSONS, DEATH_PERCENT, INJURED_ONLY_PERCENT, NO_PERSONS_NOT_INJ, DEATH_PERCENT and INJURED_ONLY_PERCENT were tested for possible outliers using boxplot.

p1 <- ggplot(data = ACCIDENT_New, aes(x=`Light Condition Desc`, y=HOUR))+
            ggtitle("Boxplot of hour of the day by Light Condition Desc") +
            coord_flip() + 
            geom_boxplot()
p1

ACCIDENT_New$NO_OF_VEHICLES %>% boxplot(main = "Boxplot of no.of vehecles involved in the accident")

ACCIDENT_New$TOTAL_PERSONS %>% boxplot(main = "Boxplot of total no.of persons involved in the accident")

ACCIDENT_New$DEATH_PERCENT %>% boxplot(main = "Boxplot of death percentage of the accident")

ACCIDENT_New$INJURED_ONLY_PERCENT %>% boxplot(main = "Boxplot of injured persons percentage of the accident")

ACCIDENT_New$NO_PERSONS_NOT_INJ %>% boxplot(main = "Boxplot of unharmed persons of the accident")

p2 <- ggplot(data = ACCIDENT_New, aes(x=SEVERITY, y=DEATH_PERCENT))+
            ggtitle("Boxplot of the death percentage by severity") +
            geom_boxplot()
p2

p3 <- ggplot(data = ACCIDENT_New, aes(x=SEVERITY, y=INJURED_ONLY_PERCENT))+
            ggtitle("Boxplot of the injured percentage by severity") +
            geom_boxplot()
p3

ACCIDENT_New$DISTANCE_LOCATION %>% boxplot(main = "Boxplot of DISTANCE_LOCATION")

  1. Hour of the day by Light Condition Desc - Instead of validating the two variables separately, a grouped box plot was generated. Three outliers were detected for light condition= day, that is very early hours of the day was identified as having day light. These values were unchanged.

  2. The three boxplots for NO_OF_VEHICLES, TOTAL_PERSONS, DEATH_PERCENT, NO_PERSONS_NOT_INJ, and DISTANCE_LOCATION indicates high number of outliers, which represents a systematic pattern. Hence these outliers were kept to maintain the unbiasness of the data.

  3. No outliers were detected in INJURED_ONLY_PERCENT variable.

  4. For further validation, DEATH_PERCENT and INJURED_ONLY_PERCENT variables were investigated with respect to the severity of the accident, and no outliers were detected.

Transform

It can be observed from the histogram that the DISTANCE_LOCATION variable has a right-skewed distribution, hence it was transformed using the Log transformation, so the variable would be more symmetric distribution.

hist(ACCIDENT_New$DISTANCE_LOCATION)

ln_DISTANCE_LOCATION <- log10(ACCIDENT_New$DISTANCE_LOCATION)
## Warning: NaNs produced
hist(ln_DISTANCE_LOCATION)