Required packages

This section import the required packages

# Loading all required packages

library(readr)
library(dplyr)
library(ggplot2)
library(forecast)
library(tidyr)
library(lubridate)

Executive Summary

This report uses different preprocessing techniques to process an open-source dataset. the dataset was chosen in sourced from CrashStats data provided by VicRoads.The source contained eleven datasets, all of which could be joined to produce a single data set but processing such a huge dataset would be beyond our computational power so two suitable data sets were chosen and joined using one primary key column. The data sets were joined using Inner Join.

To examine the information all the more effectively, the structure of the dataset was analyzed. It distinguished numerical and character class types. Following this, a few factors were changed over into factors and levels were applied to give fitting names. The merged data set was checked to see if it satisfies the tidy data. Using the existing variable in the dataset, new variables were created. Apart from this, missing values were removed and outliers were handled using the boxplot method and capping function. During the process of removing the outliers, it was found that the AGE variable is skewed to the right and therefore, suitable techniques were applied to transform the variable to get rid of the skewness.

Data Description

This assignment will focus on the CrashStats data provided by VicRoads.It contains statistics on road traffic accidents that were reported to the police. To briefly introduce the data, it contains records of accidents that satisfy the following conditions :

Preliminary analysis showed eleven separate datasets, all of which could be joined together using a primary key column but that would require extensive processing power since data will be huge thus we handpicked two sets, namely Accident.csv(basic accident details, time, severity, location) and Person.csv(person based details, age, sex, etc).

Statistical comparisons of selected attributes between two different datasets will provide information on each person involved in the accident from the Person dataset with corresponding accident information from the Accident dataset. This means that each row contains data of a person involved in an accident with details joined via accident number which acts as the primary key.

Link to Dataset : https://discover.data.vic.gov.au/dataset/crash-stats-data-extract/resource/f9c7c05a-19e9-4593-aa2c-960a9c97b858

Further information on practical application of dataset can be found on: http://data.vicroads.vic.gov.au/metadata/crashstats_user_guide_and_appendices.pdf

DataSet 1 : Accident

Import Summary

#Reading the first dataset  'ACCIDENT.csv'
ACCIDENT <- read_csv("~/Downloads/2000 to 2005 ACCIDENT/ACCIDENT.csv", 
    col_types = cols(ACCIDENT_NO = col_character(), 
        `DCA Description` = col_skip(), DCA_CODE = col_skip(), 
        EDITION = col_skip(), GRID_REFERENCE_X = col_skip(), 
        GRID_REFERENCE_Y = col_skip(), NODE_ID = col_skip(), 
        PAGE = col_skip(), ROAD_GEOMETRY = col_skip(), 
        `Road Geometry Desc` = col_skip()))

#Previewing first few rows to check the sample from the complete data 
head(ACCIDENT)
#Checking dimensions of dataset for validation.
dim(ACCIDENT)
## [1] 103061     19

DataSet 2: Person

Import Summary:

#Reading the second dataset  'PERSON.csv'
PERSON <- read_csv("~/Downloads/2000 to 2005 ACCIDENT/PERSON.csv", 
    col_types = cols(ACCIDENT_NO = col_character(), 
        HELMET_BELT_WORN = col_skip(), LICENCE_STATE = col_skip(), 
        POSTCODE = col_skip()))

#Previewing first few rows to check the sample from the complete data
head(PERSON)
#Checking dimensions of dataset for validation.
dim(PERSON)
## [1] 269045     14

Merge Dataset

#Using inner join to combine two data set using common records on ACCIDENT_NO column
mergedData <-ACCIDENT %>% inner_join(PERSON, by='ACCIDENT_NO')

#Checking dimensions of merged dataset.
dim(mergedData)
## [1] 269045     32
#Previewing joined dataset for validation.
head(mergedData)
colnames(mergedData)
##  [1] "ACCIDENT_NO"          "ACCIDENTDATE"         "ACCIDENTTIME"        
##  [4] "ACCIDENT_TYPE"        "Accident Type Desc"   "DAY_OF_WEEK"         
##  [7] "Day Week Description" "DIRECTORY"            "LIGHT_CONDITION"     
## [10] "Light Condition Desc" "NO_OF_VEHICLES"       "NO_PERSONS"          
## [13] "NO_PERSONS_INJ_2"     "NO_PERSONS_INJ_3"     "NO_PERSONS_KILLED"   
## [16] "NO_PERSONS_NOT_INJ"   "POLICE_ATTEND"        "SEVERITY"            
## [19] "SPEED_ZONE"           "PERSON_ID"            "VEHICLE_ID"          
## [22] "SEX"                  "AGE"                  "Age Group"           
## [25] "INJ_LEVEL"            "Inj Level Desc"       "SEATING_POSITION"    
## [28] "ROAD_USER_TYPE"       "Road User Type Desc"  "PEDEST_MOVEMENT"     
## [31] "TAKEN_HOSPITAL"       "EJECTED_CODE"
#The columns names in the dataset was found with spaces in between which was fixed using underscore to avoid quotation marks when calling these columns.
names(mergedData) <- gsub(" ", "_", names(mergedData))

#Renaming few column names to make it easier to understand them.
colnames(mergedData)[colnames(mergedData) == "NO_PERSONS_INJ_2"] <- "Serious_Injury"
colnames(mergedData)[colnames(mergedData) == "NO_PERSONS_INJ_3"] <- "Other_Injury"
colnames(mergedData)[colnames(mergedData) == "NO_PERSONS_NOT_INJ"] <- "No_Injury"

Understand

After joining the merged dataset, we used the str() function to check the data types of different variables. Based on the output, we realised that most of the varibales were either in character or integer variables. However, almost all the variables required data conversion either to factors or date or integer.Factors were further examined for labelling and ordering.

Merged Dataset Structure

#checking structure of joined dataset
str(mergedData)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 269045 obs. of  32 variables:
##  $ ACCIDENT_NO         : chr  "12000000389" "12000000389" "12000000554" "12000000555" ...
##  $ ACCIDENTDATE        : chr  "4/01/2000" "4/01/2000" "6/01/2000" "1/01/2000" ...
##  $ ACCIDENTTIME        : chr  "22.30.00" "22.30.00" "22.15.00" "18.00.00" ...
##  $ ACCIDENT_TYPE       : num  4 4 4 1 1 1 1 4 4 1 ...
##  $ Accident_Type_Desc  : chr  "Collision with a fixed object" "Collision with a fixed object" "Collision with a fixed object" "Collision with vehicle" ...
##  $ DAY_OF_WEEK         : num  3 3 5 7 7 7 7 1 1 2 ...
##  $ Day_Week_Description: chr  "Tuesday" "Tuesday" "Thursday" "Saturday" ...
##  $ DIRECTORY           : chr  "MEL" "MEL" "MEL" "MEL" ...
##  $ LIGHT_CONDITION     : num  5 5 3 1 1 1 1 3 3 3 ...
##  $ Light_Condition_Desc: chr  "Dark No street lights" "Dark No street lights" "Dark Street lights on" "Day" ...
##  $ NO_OF_VEHICLES      : num  1 1 1 4 4 4 4 1 1 3 ...
##  $ NO_PERSONS          : num  2 2 1 4 4 4 4 2 2 3 ...
##  $ Serious_Injury      : num  1 1 1 1 1 1 1 2 2 0 ...
##  $ Other_Injury        : num  0 0 0 1 1 1 1 0 0 2 ...
##  $ NO_PERSONS_KILLED   : num  1 1 0 0 0 0 0 0 0 0 ...
##  $ No_Injury           : num  0 0 0 2 2 2 2 0 0 1 ...
##  $ POLICE_ATTEND       : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ SEVERITY            : num  1 1 2 2 2 2 2 2 2 3 ...
##  $ SPEED_ZONE          : chr  "090" "090" "060" "060" ...
##  $ PERSON_ID           : chr  "01" "A" "A" "A" ...
##  $ VEHICLE_ID          : chr  "A" "A" "A" "A" ...
##  $ SEX                 : chr  "M" "M" "M" "M" ...
##  $ AGE                 : num  16 19 31 27 32 42 25 18 18 31 ...
##  $ Age_Group           : chr  "16-17" "17-21" "30-39" "26-29" ...
##  $ INJ_LEVEL           : num  2 1 2 2 3 4 4 2 2 3 ...
##  $ Inj_Level_Desc      : chr  "Serious injury" "Fatality" "Serious injury" "Serious injury" ...
##  $ SEATING_POSITION    : chr  "LF" "D" "D" "D" ...
##  $ ROAD_USER_TYPE      : num  3 2 2 2 2 2 2 3 2 2 ...
##  $ Road_User_Type_Desc : chr  "Passengers" "Drivers" "Drivers" "Drivers" ...
##  $ PEDEST_MOVEMENT     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ TAKEN_HOSPITAL      : chr  "Y" NA "Y" "Y" ...
##  $ EJECTED_CODE        : num  0 0 0 0 0 0 0 0 0 0 ...

Data Type Conversion

#Using mutate function to convert the variables like 'Day Week Description','Light Condition Desc''sex' etc into factors to facilitate 
mergedData <- mutate_at(mergedData, vars("Day_Week_Description", "DIRECTORY","Light_Condition_Desc","SEX","Age_Group","Inj_Level_Desc","SEATING_POSITION","TAKEN_HOSPITAL","SPEED_ZONE"), as.factor)

#Using mutate function to convert the variables like 'NO_PERSONS','NO_PERSONS_NOT_INJ''EJECTED_CODE' etc into integers to facilitate
mergedData <- mutate_at(mergedData, vars("NO_OF_VEHICLES","NO_PERSONS","Serious_Injury","Other_Injury","NO_PERSONS_KILLED","No_Injury","POLICE_ATTEND","SEVERITY","AGE","INJ_LEVEL","ROAD_USER_TYPE","PEDEST_MOVEMENT","EJECTED_CODE"), as.integer)

#Using dmy function to convert ACCIDENTDATE into year-month-day format
mergedData$ACCIDENTDATE <- ymd(mergedData$ACCIDENTDATE)
## Warning: All formats failed to parse. No formats found.
#Using hms function to convert ACCIDENTTIME into time format for storing time-of-day values
mergedData$ACCIDENTTIME <- hms(mergedData$ACCIDENTTIME)
## Warning in .parse_hms(..., order = "HMS", quiet = quiet): Some strings
## failed to parse, or all strings are NAs
#Checking levels for factor type variables  
levels(mergedData$Day_Week_Description)
## [1] "Friday"    "Monday"    "Saturday"  "Sunday"    "Thursday"  "Tuesday"  
## [7] "Wednesday"
levels(mergedData$SPEED_ZONE)
##  [1] "040" "050" "060" "070" "075" "080" "090" "100" "110" "777" "888"
## [12] "999"
levels(mergedData$SEX)
## [1] "F" "M" "U"
levels(mergedData$Age_Group)
##  [1] "0-4"     "13-15"   "16-17"   "17-21"   "22-25"   "26-29"   "30-39"  
##  [8] "40-49"   "5-12"    "50-59"   "60-64"   "64-69"   "70+"     "unknown"
#Ordering the factor values inside Day_Week_description
mergedData$Day_Week_Description <- factor(mergedData$Day_Week_Description, levels = c("Monday","Tuesday","Wednesday","Thrusday","Friday","Saturday","Sunday"), ordered = TRUE)

#Ordering and labeling data with more informative label

#note: labeling outliers(777,888,999) as unknown 
mergedData$SPEED_ZONE<- factor(mergedData$SPEED_ZONE, levels= c('030', '040', '050', '060', '070', '075', '080', '090', '100', '110', '777', '888', '999'), labels=c('30KM/HR', '40KM/HR', '50KM/HR', '60KM/HR', '70KM/HR', '75KM/HR', '80KM/HR', '90KM/HR', '100KM/HR', '110KM/HR','Unknown','Unknown', 'Unknown'),ordered = TRUE)

mergedData$SEX <- factor(mergedData$SEX, levels = c('F','M','U'), labels = c("Female",'Male','Unknown'))


mergedData$Age_Group<-factor(mergedData$Age_Group, levels = c('0-4', '5-12', '13-15', '16-17', '17-21', '22-25',  '26-29', '30-39', '40-49', '50-59', '60-64', '64-69', '70+', 'unknown'), labels=c('0-4 Years', '5-12 Years', '13-15 Years', '16-17 Years', '17-21 Years', '22-25 Years',  '26-29 Years', '30-39 Years', '40-49 Years', '50-59 Years', '60-64 Years', '64-69 Years', '70 Years+', 'Unknown'))


#Validating the changes applied to the attributes of the merged dataset
str(mergedData)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 269045 obs. of  32 variables:
##  $ ACCIDENT_NO         : chr  "12000000389" "12000000389" "12000000554" "12000000555" ...
##  $ ACCIDENTDATE        : Date, format: NA NA ...
##  $ ACCIDENTTIME        :Formal class 'Period' [package "lubridate"] with 6 slots
##   .. ..@ .Data : num  0 0 0 0 0 0 0 0 0 0 ...
##   .. ..@ year  : num  0 0 0 0 0 0 0 0 0 0 ...
##   .. ..@ month : num  0 0 0 0 0 0 0 0 0 0 ...
##   .. ..@ day   : num  0 0 0 0 0 0 0 0 0 0 ...
##   .. ..@ hour  : num  22 22 22 18 18 18 18 1 1 23 ...
##   .. ..@ minute: num  30 30 15 0 0 0 0 0 0 50 ...
##  $ ACCIDENT_TYPE       : num  4 4 4 1 1 1 1 4 4 1 ...
##  $ Accident_Type_Desc  : chr  "Collision with a fixed object" "Collision with a fixed object" "Collision with a fixed object" "Collision with vehicle" ...
##  $ DAY_OF_WEEK         : num  3 3 5 7 7 7 7 1 1 2 ...
##  $ Day_Week_Description: Ord.factor w/ 7 levels "Monday"<"Tuesday"<..: 2 2 NA 6 6 6 6 7 7 1 ...
##  $ DIRECTORY           : Factor w/ 2 levels "MEL","VCS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ LIGHT_CONDITION     : num  5 5 3 1 1 1 1 3 3 3 ...
##  $ Light_Condition_Desc: Factor w/ 7 levels "Dark No street lights",..: 1 1 3 5 5 5 5 3 3 3 ...
##  $ NO_OF_VEHICLES      : int  1 1 1 4 4 4 4 1 1 3 ...
##  $ NO_PERSONS          : int  2 2 1 4 4 4 4 2 2 3 ...
##  $ Serious_Injury      : int  1 1 1 1 1 1 1 2 2 0 ...
##  $ Other_Injury        : int  0 0 0 1 1 1 1 0 0 2 ...
##  $ NO_PERSONS_KILLED   : int  1 1 0 0 0 0 0 0 0 0 ...
##  $ No_Injury           : int  0 0 0 2 2 2 2 0 0 1 ...
##  $ POLICE_ATTEND       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ SEVERITY            : int  1 1 2 2 2 2 2 2 2 3 ...
##  $ SPEED_ZONE          : Ord.factor w/ 11 levels "30KM/HR"<"40KM/HR"<..: 8 8 4 4 4 4 4 4 4 5 ...
##  $ PERSON_ID           : chr  "01" "A" "A" "A" ...
##  $ VEHICLE_ID          : chr  "A" "A" "A" "A" ...
##  $ SEX                 : Factor w/ 3 levels "Female","Male",..: 2 2 2 2 1 2 2 1 2 1 ...
##  $ AGE                 : int  16 19 31 27 32 42 25 18 18 31 ...
##  $ Age_Group           : Factor w/ 14 levels "0-4 Years","5-12 Years",..: 4 5 8 7 8 9 6 5 5 8 ...
##  $ INJ_LEVEL           : int  2 1 2 2 3 4 4 2 2 3 ...
##  $ Inj_Level_Desc      : Factor w/ 4 levels "Fatality","Not injured",..: 4 1 4 4 3 2 2 4 4 3 ...
##  $ SEATING_POSITION    : Factor w/ 10 levels "CF","CR","D",..: 4 3 3 3 3 3 3 4 3 3 ...
##  $ ROAD_USER_TYPE      : int  3 2 2 2 2 2 2 3 2 2 ...
##  $ Road_User_Type_Desc : chr  "Passengers" "Drivers" "Drivers" "Drivers" ...
##  $ PEDEST_MOVEMENT     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ TAKEN_HOSPITAL      : Factor w/ 2 levels "N","Y": 2 NA 2 2 1 NA NA 2 2 NA ...
##  $ EJECTED_CODE        : int  0 0 0 0 0 0 0 0 0 0 ...

Tidy & Manipulate Data I

Tidy Data Principles:

let’s look into this more closely by using an example

mergedData %>% select(ACCIDENT_NO,PERSON_ID,Serious_Injury,Other_Injury,No_Injury,INJ_LEVEL,Inj_Level_Desc) %>% filter(ACCIDENT_NO == '12000000389')

Now, let’s look into the second and third tidy data principle :

Tidy & Manipulate Data II

In this section, we have created new variables using the existing data using function mutate().

#Creating two new variables "Person_Affected_Percent" and "Average_Age"
mergedData <- mergedData %>% group_by(ACCIDENT_NO) %>% mutate(PERSON_AFFECTED_PERCENT = (Serious_Injury +Other_Injury+NO_PERSONS_KILLED)/NO_PERSONS*100, AVERAGE_AGE = mean(AGE,na.rm = TRUE))

#Created new varibale to present the output by selecting variables of interest grouped by accident No
resultedData <- mergedData %>% select(ACCIDENT_NO,PERSON_ID,PERSON_AFFECTED_PERCENT,AVERAGE_AGE) %>% group_by(ACCIDENT_NO)

#Checking the result
head(resultedData)

Scan I

In this section, we scanned for the missing values present in the merged dataset. Before any processing, we made few rules to deal with the missing values

#Checking the number of missing values column wise to have a overall look
colSums(is.na(mergedData))
##             ACCIDENT_NO            ACCIDENTDATE            ACCIDENTTIME 
##                       0                  269045                     380 
##           ACCIDENT_TYPE      Accident_Type_Desc             DAY_OF_WEEK 
##                       0                       0                       0 
##    Day_Week_Description               DIRECTORY         LIGHT_CONDITION 
##                   39444                     333                       0 
##    Light_Condition_Desc          NO_OF_VEHICLES              NO_PERSONS 
##                       0                       0                       0 
##          Serious_Injury            Other_Injury       NO_PERSONS_KILLED 
##                       0                       0                       0 
##               No_Injury           POLICE_ATTEND                SEVERITY 
##                       0                       0                       0 
##              SPEED_ZONE               PERSON_ID              VEHICLE_ID 
##                       0                       0                   10405 
##                     SEX                     AGE               Age_Group 
##                       0                   16514                       0 
##               INJ_LEVEL          Inj_Level_Desc        SEATING_POSITION 
##                       0                       0                   10425 
##          ROAD_USER_TYPE     Road_User_Type_Desc         PEDEST_MOVEMENT 
##                       0                       0                       2 
##          TAKEN_HOSPITAL            EJECTED_CODE PERSON_AFFECTED_PERCENT 
##                  182656                     921                       0 
##             AVERAGE_AGE 
##                     243
#Based on results from colSums() function, it was found that variable like ACCIDENTTIME, DIRECTORY,VEHICLE_ID,EJECTED_CODE,AVERAGE_AGE,SEATING_POSITION, PEDEST_MOVEMENT, AGE have missing values less than 10%.

#Therefore, based on the Rule 1, missing values were removed from the dataset
mergedData <- mergedData[!(is.na(mergedData$ACCIDENTTIME)),]
mergedData <- mergedData[!(is.na(mergedData$DIRECTORY)),]
mergedData <- mergedData[!(is.na(mergedData$VEHICLE_ID)),]
mergedData <- mergedData[!(is.na(mergedData$EJECTED_CODE)),]
mergedData <- mergedData[!(is.na(mergedData$AVERAGE_AGE)),]
mergedData <- mergedData[!(is.na(mergedData$SEATING_POSITION)),]
mergedData <- mergedData[!(is.na(mergedData$PEDEST_MOVEMENT)),]
mergedData <- mergedData[!(is.na(mergedData$AGE)),]


#Checking the dataset after missing values removal
colSums(is.na(mergedData))
##             ACCIDENT_NO            ACCIDENTDATE            ACCIDENTTIME 
##                       0                  241241                       0 
##           ACCIDENT_TYPE      Accident_Type_Desc             DAY_OF_WEEK 
##                       0                       0                       0 
##    Day_Week_Description               DIRECTORY         LIGHT_CONDITION 
##                   35251                       0                       0 
##    Light_Condition_Desc          NO_OF_VEHICLES              NO_PERSONS 
##                       0                       0                       0 
##          Serious_Injury            Other_Injury       NO_PERSONS_KILLED 
##                       0                       0                       0 
##               No_Injury           POLICE_ATTEND                SEVERITY 
##                       0                       0                       0 
##              SPEED_ZONE               PERSON_ID              VEHICLE_ID 
##                       0                       0                       0 
##                     SEX                     AGE               Age_Group 
##                       0                       0                       0 
##               INJ_LEVEL          Inj_Level_Desc        SEATING_POSITION 
##                       0                       0                       0 
##          ROAD_USER_TYPE     Road_User_Type_Desc         PEDEST_MOVEMENT 
##                       0                       0                       0 
##          TAKEN_HOSPITAL            EJECTED_CODE PERSON_AFFECTED_PERCENT 
##                  164162                       0                       0 
##             AVERAGE_AGE 
##                       0

Scan II

In this section, numeric variables were checked in our merged data to scan for the outliers and fixed them with suitable techniques.

To start with the process,

#Histogram to check the distribution of AGE variable
hist(mergedData$AGE ,col = "gold", xlab = "Age", ylab = "Number of Individuals", main = "Histogram of Age of the People involved in Accidents")
abline(v= mean(mergedData$AGE), col="red", lwd=2, lty=2)
abline(v=median(mergedData$AGE), col='blue', lwd=2, lty=2)

#Boxplot to detect the outliers
mergedData$AGE %>% boxplot(main = "Detecting Outliers in Age variable", ylab = "Age", col = "gold")

#Checking the summary before outlier removal
summary(mergedData$AGE)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   21.00   31.00   34.23   45.00   99.00
#Defining Capping function
cap <- function(x){
    quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
    x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
    x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
    x
}

#Applying the defined capping function on the data to fix the outliers
mergedData$AGE <- mergedData$AGE %>% cap()
#Getting the updated summary of the data
summary(mergedData$AGE)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   21.00   31.00   34.08   45.00   81.00
#Boxplot to check the data after outlier removal
mergedData$AGE %>% boxplot(main = "Age Variable After Outliers Removal", ylab = "Age", col = "lightBlue")

Transform

As from the last section, we discovered that the AGE variable in our dataset is skewed to the right and therefore, in this section different transformation techniques were applied to get rid of the skewness.

#Firstly, the log 10 transformation was applied as it's commonly used for reducing right skewness
log_age <- log10(mergedData$AGE)
hist(log_age)

#Secondly, natural logarithm was applied to check whether it's generating better result than log 10
ln_age <- log(mergedData$AGE)
hist(ln_age)

#Lastly, square root transformation was used for reducing right skewness
sqrt_age <- sqrt(mergedData$AGE)
hist(sqrt_age)

Results

Based on the generated result, square root transformation performed the best. Though, the trasnformation didn’t result in complete ridness to the skewness.The square root transformation is still slightly skewed to the left.