Data Preprocessing

Required packages

#Loading the packages 
library(tidyr)
library(lubridate)
library(ggplot2)
library(readr)
library(forecast)
library(dplyr)
library(Hmisc)
library(editrules)
library(outliers)

Executive Summary

The ‘Comma Seperated Values (csv)’ files were downloaded from open source data and imported into R to perform required data preprocessing analysis.

In this assignment, 3 dataset has been joined out of 11 datasets available in VicRoads CrashStats. Unnecessary column has been removed from all dataset and also dataset observation has been filtered out to make dataset more manageable. Each attributes are examined carefully and data type conversion is done to approximate classes like character to factor and applied reordering of factor variables. We checked for the tidiness of the dataset and found the dataset is in tidy format. New attributes has been created using mutate() for calculation of variables.

The dataset was then scanned for nan, inf, outliers and missing values. Moreover inconsistencies and obvious error were also checked for variables. An approximated transformation was done which reduces the skewedness of histogram and transformed into a normal distribution, making the analysis simpler.

Data

The dataset were downloaded as ‘Comma Separated Values (csv)’ files from the VicRoads Website. https://discover.data.vic.gov.au/dataset/crash-stats-data-extract

The VicRoads crashstats website provides with all the necessary information to analyze Victorian fatal and injury crash data based on person, vehicle, time, conditions, crash type, road user type, object hit etc. The dataset comprises of total of 11 csv files out of which accident, person and vehicle dataset files are chosen for the assignment analysis and joined based on common variables.

Dataset 1: Accident- provides information about Peron accident, no of people involved and speed zone of the vehicle.(181601 observations,28 variables)

Dataset 2: Person- provides information about person id, person age, sex, Road user type(passenger, driver,motorcyclists) and level of injury to the person.(438142 observations,17 variables)

Dataset 3: Vehicle- provides information about vehicle registration details vehicle type, owner postcode of the vehicle.(323741 observations,38 variables)

Importing

All the 3 datasets has been read into R using read_csv() function and saved as ACCIENDT,PEROSN, VEHICLE respectively.

ACCIDENT <- read_csv("C:/Users/soumya/Desktop/csv/ACCIDENT.csv")

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   ACCIDENT_NO = col_character(),
##   ACCIDENTDATE = col_character(),
##   ACCIDENTTIME = col_character(),
##   `Accident Type Desc` = col_character(),
##   `Day Week Description` = col_character(),
##   `DCA Description` = col_character(),
##   DIRECTORY = col_character(),
##   PAGE = col_character(),
##   GRID_REFERENCE_X = col_character(),
##   `Light Condition Desc` = col_character(),
##   `Road Geometry Desc` = col_character(),
##   SPEED_ZONE = col_character()
## )

## See spec(...) for full column specifications.

## Warning: 10 parsing failures.
##   row     col expected actual                                       file
## 65513 EDITION a double   ED35 'C:/Users/soumya/Desktop/csv/ACCIDENT.csv'
## 66295 EDITION a double   ED35 'C:/Users/soumya/Desktop/csv/ACCIDENT.csv'
## 67973 EDITION a double   ED35 'C:/Users/soumya/Desktop/csv/ACCIDENT.csv'
## 75085 EDITION a double   ED35 'C:/Users/soumya/Desktop/csv/ACCIDENT.csv'
## 97633 EDITION a double   ED35 'C:/Users/soumya/Desktop/csv/ACCIDENT.csv'
## ..... ....... ........ ...... ..........................................
## See problems(...) for more details.

PERSON <- read_csv("C:/Users/soumya/Desktop/csv/PERSON.csv")

## Parsed with column specification:
## cols(
##   ACCIDENT_NO = col_character(),
##   PERSON_ID = col_character(),
##   VEHICLE_ID = col_character(),
##   SEX = col_character(),
##   AGE = col_double(),
##   `Age Group` = col_character(),
##   INJ_LEVEL = col_double(),
##   `Inj Level Desc` = col_character(),
##   SEATING_POSITION = col_character(),
##   HELMET_BELT_WORN = col_double(),
##   ROAD_USER_TYPE = col_double(),
##   `Road User Type Desc` = col_character(),
##   LICENCE_STATE = col_character(),
##   PEDEST_MOVEMENT = col_double(),
##   POSTCODE = col_double(),
##   TAKEN_HOSPITAL = col_character(),
##   EJECTED_CODE = col_double()
## )

VEHICLE <- read_csv("C:/Users/soumya/Desktop/csv/VEHICLE.csv")

## Parsed with column specification:
## cols(
##   .default = col_character(),
##   VEHICLE_YEAR_MANUF = col_double(),
##   VEHICLE_DCA_CODE = col_double(),
##   ROAD_SURFACE_TYPE = col_double(),
##   VEHICLE_POWER = col_logical(),
##   VEHICLE_WEIGHT = col_double(),
##   NO_OF_WHEELS = col_double(),
##   NO_OF_CYLINDERS = col_double(),
##   SEATING_CAPACITY = col_double(),
##   TARE_WEIGHT = col_double(),
##   TOTAL_NO_OCCUPANTS = col_double(),
##   CARRY_CAPACITY = col_double(),
##   CUBIC_CAPACITY = col_number(),
##   CAUGHT_FIRE = col_double(),
##   LAMPS = col_double(),
##   LEVEL_OF_DAMAGE = col_double(),
##   OWNER_POSTCODE = col_double(),
##   TOWED_AWAY_FLAG = col_double()
## )
## See spec(...) for full column specifications.

head(ACCIDENT)

head(PERSON)

head(VEHICLE)

We merged 3 datasets together using merge() function and saved it as VICROAD.The Accident and Person datasets are merged using inner_join based on AcCIDENT_No which is saved as MERGE1. The MERGE1 is then joined with Vehicle dataset based on ACCIDENT_NO and VEHICLE_ID and saved as VICROAD.The final dataset contains 80 variables with 418448 observations totally.

#Merging the three datasets together 
MERGE1 <-ACCIDENT%>%inner_join(PERSON, by='ACCIDENT_NO')
VICROAD <-MERGE1%>%inner_join(VEHICLE, by=c("ACCIDENT_NO","VEHICLE_ID"))

head(VICROAD)

#Replcaing the blank spaces in the variables with "_". It will be easy to call up column names.
colnames(VICROAD)

##  [1] "ACCIDENT_NO"            "ACCIDENTDATE"          
##  [3] "ACCIDENTTIME"           "ACCIDENT_TYPE"         
##  [5] "Accident Type Desc"     "DAY_OF_WEEK"           
##  [7] "Day Week Description"   "DCA_CODE"              
##  [9] "DCA Description"        "DIRECTORY"             
## [11] "EDITION"                "PAGE"                  
## [13] "GRID_REFERENCE_X"       "GRID_REFERENCE_Y"      
## [15] "LIGHT_CONDITION"        "Light Condition Desc"  
## [17] "NODE_ID"                "NO_OF_VEHICLES"        
## [19] "NO_PERSONS"             "NO_PERSONS_INJ_2"      
## [21] "NO_PERSONS_INJ_3"       "NO_PERSONS_KILLED"     
## [23] "NO_PERSONS_NOT_INJ"     "POLICE_ATTEND"         
## [25] "ROAD_GEOMETRY"          "Road Geometry Desc"    
## [27] "SEVERITY"               "SPEED_ZONE"            
## [29] "PERSON_ID"              "VEHICLE_ID"            
## [31] "SEX"                    "AGE"                   
## [33] "Age Group"              "INJ_LEVEL"             
## [35] "Inj Level Desc"         "SEATING_POSITION"      
## [37] "HELMET_BELT_WORN"       "ROAD_USER_TYPE"        
## [39] "Road User Type Desc"    "LICENCE_STATE"         
## [41] "PEDEST_MOVEMENT"        "POSTCODE"              
## [43] "TAKEN_HOSPITAL"         "EJECTED_CODE"          
## [45] "VEHICLE_YEAR_MANUF"     "VEHICLE_DCA_CODE"      
## [47] "INITIAL_DIRECTION"      "ROAD_SURFACE_TYPE"     
## [49] "Road Surface Type Desc" "REG_STATE"             
## [51] "VEHICLE_BODY_STYLE"     "VEHICLE_MAKE"          
## [53] "VEHICLE_MODEL"          "VEHICLE_POWER"         
## [55] "VEHICLE_TYPE"           "Vehicle Type Desc"     
## [57] "VEHICLE_WEIGHT"         "CONSTRUCTION_TYPE"     
## [59] "FUEL_TYPE"              "NO_OF_WHEELS"          
## [61] "NO_OF_CYLINDERS"        "SEATING_CAPACITY"      
## [63] "TARE_WEIGHT"            "TOTAL_NO_OCCUPANTS"    
## [65] "CARRY_CAPACITY"         "CUBIC_CAPACITY"        
## [67] "FINAL_DIRECTION"        "DRIVER_INTENT"         
## [69] "VEHICLE_MOVEMENT"       "TRAILER_TYPE"          
## [71] "VEHICLE_COLOUR_1"       "VEHICLE_COLOUR_2"      
## [73] "CAUGHT_FIRE"            "INITIAL_IMPACT"        
## [75] "LAMPS"                  "LEVEL_OF_DAMAGE"       
## [77] "OWNER_POSTCODE"         "TOWED_AWAY_FLAG"       
## [79] "TRAFFIC_CONTROL"        "Traffic Control Desc"

names(VICROAD) <- gsub(" ", "_", names(VICROAD))
colnames(VICROAD)

##  [1] "ACCIDENT_NO"            "ACCIDENTDATE"          
##  [3] "ACCIDENTTIME"           "ACCIDENT_TYPE"         
##  [5] "Accident_Type_Desc"     "DAY_OF_WEEK"           
##  [7] "Day_Week_Description"   "DCA_CODE"              
##  [9] "DCA_Description"        "DIRECTORY"             
## [11] "EDITION"                "PAGE"                  
## [13] "GRID_REFERENCE_X"       "GRID_REFERENCE_Y"      
## [15] "LIGHT_CONDITION"        "Light_Condition_Desc"  
## [17] "NODE_ID"                "NO_OF_VEHICLES"        
## [19] "NO_PERSONS"             "NO_PERSONS_INJ_2"      
## [21] "NO_PERSONS_INJ_3"       "NO_PERSONS_KILLED"     
## [23] "NO_PERSONS_NOT_INJ"     "POLICE_ATTEND"         
## [25] "ROAD_GEOMETRY"          "Road_Geometry_Desc"    
## [27] "SEVERITY"               "SPEED_ZONE"            
## [29] "PERSON_ID"              "VEHICLE_ID"            
## [31] "SEX"                    "AGE"                   
## [33] "Age_Group"              "INJ_LEVEL"             
## [35] "Inj_Level_Desc"         "SEATING_POSITION"      
## [37] "HELMET_BELT_WORN"       "ROAD_USER_TYPE"        
## [39] "Road_User_Type_Desc"    "LICENCE_STATE"         
## [41] "PEDEST_MOVEMENT"        "POSTCODE"              
## [43] "TAKEN_HOSPITAL"         "EJECTED_CODE"          
## [45] "VEHICLE_YEAR_MANUF"     "VEHICLE_DCA_CODE"      
## [47] "INITIAL_DIRECTION"      "ROAD_SURFACE_TYPE"     
## [49] "Road_Surface_Type_Desc" "REG_STATE"             
## [51] "VEHICLE_BODY_STYLE"     "VEHICLE_MAKE"          
## [53] "VEHICLE_MODEL"          "VEHICLE_POWER"         
## [55] "VEHICLE_TYPE"           "Vehicle_Type_Desc"     
## [57] "VEHICLE_WEIGHT"         "CONSTRUCTION_TYPE"     
## [59] "FUEL_TYPE"              "NO_OF_WHEELS"          
## [61] "NO_OF_CYLINDERS"        "SEATING_CAPACITY"      
## [63] "TARE_WEIGHT"            "TOTAL_NO_OCCUPANTS"    
## [65] "CARRY_CAPACITY"         "CUBIC_CAPACITY"        
## [67] "FINAL_DIRECTION"        "DRIVER_INTENT"         
## [69] "VEHICLE_MOVEMENT"       "TRAILER_TYPE"          
## [71] "VEHICLE_COLOUR_1"       "VEHICLE_COLOUR_2"      
## [73] "CAUGHT_FIRE"            "INITIAL_IMPACT"        
## [75] "LAMPS"                  "LEVEL_OF_DAMAGE"       
## [77] "OWNER_POSTCODE"         "TOWED_AWAY_FLAG"       
## [79] "TRAFFIC_CONTROL"        "Traffic_Control_Desc"

The dataset contains 80 variables which is huge after merging the 3 datasets. Selecting few and relevant varibles required from the merged dataset for the analysis. We use select() function which allow us to select reqired variables(column names).

#Selecting few variables for the analysis
FINAL_VIC <- select(VICROAD ,ACCIDENT_NO,ACCIDENTDATE,ACCIDENTTIME,ACCIDENT_TYPE, Accident_Type_Desc,NO_OF_VEHICLES,NO_PERSONS,NO_PERSONS_INJ_2,NO_PERSONS_INJ_3,NO_PERSONS_KILLED,NO_PERSONS_NOT_INJ,Light_Condition_Desc,SPEED_ZONE,PERSON_ID, VEHICLE_ID,SEX,AGE,Age_Group,INJ_LEVEL, Inj_Level_Desc, SEATING_POSITION,HELMET_BELT_WORN,Road_User_Type_Desc,Vehicle_Type_Desc,CARRY_CAPACITY, OWNER_POSTCODE,Traffic_Control_Desc)

dim(FINAL_VIC)

## [1] 418448     27

The FINAL_VIC conatins a numerous observation totall of 418442. We use filter() function to filter out observations based on conditions,where we filtered out based on day for Light_Condition_Desc, drives for Road_User_Type_Desc and finally car for Vehicle_Type_Desc. This is just to make the interpretations work simpler and easy to understand.

The DETAILS contains about 110674 observations.

#To reduce the number of observations
DETAILS <- filter(FINAL_VIC ,Light_Condition_Desc == 'Day',Road_User_Type_Desc == 'Drivers', Vehicle_Type_Desc == 'Car' )

dim(DETAILS)

## [1] 110674     27

Understand

The dataset contains about 110674 observations with 27 variables. It contains character and numericals values.The structure of the dataset was examined and the required data type conversions were then carried out to few classes and the final dataset consists of character, numeric, factor and date variables.

str(DETAILS)

## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 110674 obs. of  27 variables:
##  $ ACCIDENT_NO         : chr  "T20060000010" "T20060000010" "T20060000018" "T20060000023" ...
##  $ ACCIDENTDATE        : chr  "13/01/2006" "13/01/2006" "13/01/2006" "14/01/2006" ...
##  $ ACCIDENTTIME        : chr  "12.42.00" "12.42.00" "19.10.00" "11.49.00" ...
##  $ ACCIDENT_TYPE       : num  1 1 1 1 4 2 4 1 1 1 ...
##  $ Accident_Type_Desc  : chr  "Collision with vehicle" "Collision with vehicle" "Collision with vehicle" "Collision with vehicle" ...
##  $ NO_OF_VEHICLES      : num  3 3 2 2 1 1 1 3 3 2 ...
##  $ NO_PERSONS          : num  6 6 4 2 1 2 2 4 4 4 ...
##  $ NO_PERSONS_INJ_2    : num  0 0 0 1 1 0 1 1 1 0 ...
##  $ NO_PERSONS_INJ_3    : num  1 1 1 0 0 1 0 2 2 1 ...
##  $ NO_PERSONS_KILLED   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ NO_PERSONS_NOT_INJ  : num  5 5 3 1 0 1 1 1 1 3 ...
##  $ Light_Condition_Desc: chr  "Day" "Day" "Day" "Day" ...
##  $ SPEED_ZONE          : chr  "060" "060" "070" "080" ...
##  $ PERSON_ID           : chr  "A" "B" "B" "A" ...
##  $ VEHICLE_ID          : chr  "A" "B" "B" "A" ...
##  $ SEX                 : chr  "M" "F" "F" "M" ...
##  $ AGE                 : num  72 62 20 43 50 32 31 53 41 51 ...
##  $ Age_Group           : chr  "70+" "60-64" "17-21" "40-49" ...
##  $ INJ_LEVEL           : num  4 3 3 2 2 4 4 3 4 4 ...
##  $ Inj_Level_Desc      : chr  "Not injured" "Other injury" "Other injury" "Serious injury" ...
##  $ SEATING_POSITION    : chr  "D" "D" "D" "D" ...
##  $ HELMET_BELT_WORN    : num  1 1 1 1 1 1 9 1 1 1 ...
##  $ Road_User_Type_Desc : chr  "Drivers" "Drivers" "Drivers" "Drivers" ...
##  $ Vehicle_Type_Desc   : chr  "Car" "Car" "Car" "Car" ...
##  $ CARRY_CAPACITY      : num  1100 974 1200 800 1450 ...
##  $ OWNER_POSTCODE      : num  3130 3977 3805 3016 3850 ...
##  $ Traffic_Control_Desc: chr  "Stop-go lights" "Stop-go lights" "Giveway sign" "Stop-go lights" ...

The ACCIDENTDATE and ACCIDENTTIME are in character format which is been converted into date format uisng dmy() and hms() function respectively. Accident_Type_Desc, Light_Condition_Desc, SEX,Age_Group, Inj_Level_Desc, Road_User_Type_Desc, Vehicle_Type_Desc and Traffic_Control_Desc are in character format which are converted into factors. SPEED_ZONE is converted to numeric from character format.

#Converting to date and time format
DETAILS$ACCIDENTDATE <- dmy(DETAILS$ACCIDENTDATE)
DETAILS$ACCIDENTTIME <- hms(DETAILS$ACCIDENTTIME)

#Converting into factors
DETAILS[ , c(5, 12, 16, 18, 20, 23, 24, 27)]<-lapply(DETAILS[ ,c (5, 12, 16, 18, 20, 23, 24, 27)], factor)

#Converting into numeric
DETAILS$SPEED_ZONE <- as.numeric(DETAILS$SPEED_ZONE)

str(DETAILS)

## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 110674 obs. of  27 variables:
##  $ ACCIDENT_NO         : chr  "T20060000010" "T20060000010" "T20060000018" "T20060000023" ...
##  $ ACCIDENTDATE        : Date, format: "2006-01-13" "2006-01-13" ...
##  $ ACCIDENTTIME        :Formal class 'Period' [package "lubridate"] with 6 slots
##   .. ..@ .Data : num  0 0 0 0 0 0 0 0 0 0 ...
##   .. ..@ year  : num  0 0 0 0 0 0 0 0 0 0 ...
##   .. ..@ month : num  0 0 0 0 0 0 0 0 0 0 ...
##   .. ..@ day   : num  0 0 0 0 0 0 0 0 0 0 ...
##   .. ..@ hour  : num  12 12 19 11 10 11 11 11 11 16 ...
##   .. ..@ minute: num  42 42 10 49 55 0 53 50 50 15 ...
##  $ ACCIDENT_TYPE       : num  1 1 1 1 4 2 4 1 1 1 ...
##  $ Accident_Type_Desc  : Factor w/ 9 levels "Collision with a fixed object",..: 3 3 3 3 1 8 1 3 3 3 ...
##  $ NO_OF_VEHICLES      : num  3 3 2 2 1 1 1 3 3 2 ...
##  $ NO_PERSONS          : num  6 6 4 2 1 2 2 4 4 4 ...
##  $ NO_PERSONS_INJ_2    : num  0 0 0 1 1 0 1 1 1 0 ...
##  $ NO_PERSONS_INJ_3    : num  1 1 1 0 0 1 0 2 2 1 ...
##  $ NO_PERSONS_KILLED   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ NO_PERSONS_NOT_INJ  : num  5 5 3 1 0 1 1 1 1 3 ...
##  $ Light_Condition_Desc: Factor w/ 1 level "Day": 1 1 1 1 1 1 1 1 1 1 ...
##  $ SPEED_ZONE          : num  60 60 70 80 100 60 80 80 80 60 ...
##  $ PERSON_ID           : chr  "A" "B" "B" "A" ...
##  $ VEHICLE_ID          : chr  "A" "B" "B" "A" ...
##  $ SEX                 : Factor w/ 3 levels "F","M","U": 2 1 1 2 1 1 2 1 2 1 ...
##  $ AGE                 : num  72 62 20 43 50 32 31 53 41 51 ...
##  $ Age_Group           : Factor w/ 14 levels "0-4","13-15",..: 13 11 4 8 10 7 7 10 8 10 ...
##  $ INJ_LEVEL           : num  4 3 3 2 2 4 4 3 4 4 ...
##  $ Inj_Level_Desc      : Factor w/ 4 levels "Fatality","Not injured",..: 2 3 3 4 4 2 2 3 2 2 ...
##  $ SEATING_POSITION    : chr  "D" "D" "D" "D" ...
##  $ HELMET_BELT_WORN    : num  1 1 1 1 1 1 9 1 1 1 ...
##  $ Road_User_Type_Desc : Factor w/ 1 level "Drivers": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Vehicle_Type_Desc   : Factor w/ 1 level "Car": 1 1 1 1 1 1 1 1 1 1 ...
##  $ CARRY_CAPACITY      : num  1100 974 1200 800 1450 ...
##  $ OWNER_POSTCODE      : num  3130 3977 3805 3016 3850 ...
##  $ Traffic_Control_Desc: Factor w/ 17 levels "Flashing lights",..: 15 15 2 15 3 3 3 15 15 17 ...

checking the level of variables like SEX, Age_Group, Inj_Level_Desc using levels() function to get the order of arrangement. We have used factor() function to re-arrange in to ordered format. Also, renamed the SEX column names for better understanding.

#Levels for the factors variable
levels(DETAILS$SEX)

## [1] "F" "M" "U"

levels(DETAILS$Age_Group)

##  [1] "0-4"     "13-15"   "16-17"   "17-21"   "22-25"   "26-29"   "30-39"  
##  [8] "40-49"   "5-12"    "50-59"   "60-64"   "64-69"   "70+"     "unknown"

levels(DETAILS$Inj_Level_Desc)

## [1] "Fatality"       "Not injured"    "Other injury"   "Serious injury"

#Re-naming label and ordering the levels for factor variables

DETAILS$SEX <- factor(DETAILS$SEX, levels = c('F','M','U'), labels= c('Female','Male', 'Unknown'))

DETAILS$Age_Group<-factor(DETAILS$Age_Group, levels = c('0-4', '5-12', '13-15', '16-17', '17-21', '22-25',  '26-29', '30-39', '40-49', '50-59', '60-64', '64-69', '70+', 'unknown'), labels=c('0-4 yrs', '5-12 yrs', '13-15 yrs', '16-17 yrs', '17-21 yrs', '22-25 yrs',  '26-29 yrs', '30-39 yrs', '40-49 yrs', '50-59 yrs', '60-64 yrs', '64-69 yrs', '70 yrs or older', 'Unknown'))

DETAILS$Inj_Level_Desc <- factor(DETAILS$Inj_Level_Desc, levels = c("Fatality","Not injured","Other injury","Serious injury"), labels= c("Not injured","Other injury","Serious injury","Fatality"))

str(DETAILS)

## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 110674 obs. of  27 variables:
##  $ ACCIDENT_NO         : chr  "T20060000010" "T20060000010" "T20060000018" "T20060000023" ...
##  $ ACCIDENTDATE        : Date, format: "2006-01-13" "2006-01-13" ...
##  $ ACCIDENTTIME        :Formal class 'Period' [package "lubridate"] with 6 slots
##   .. ..@ .Data : num  0 0 0 0 0 0 0 0 0 0 ...
##   .. ..@ year  : num  0 0 0 0 0 0 0 0 0 0 ...
##   .. ..@ month : num  0 0 0 0 0 0 0 0 0 0 ...
##   .. ..@ day   : num  0 0 0 0 0 0 0 0 0 0 ...
##   .. ..@ hour  : num  12 12 19 11 10 11 11 11 11 16 ...
##   .. ..@ minute: num  42 42 10 49 55 0 53 50 50 15 ...
##  $ ACCIDENT_TYPE       : num  1 1 1 1 4 2 4 1 1 1 ...
##  $ Accident_Type_Desc  : Factor w/ 9 levels "Collision with a fixed object",..: 3 3 3 3 1 8 1 3 3 3 ...
##  $ NO_OF_VEHICLES      : num  3 3 2 2 1 1 1 3 3 2 ...
##  $ NO_PERSONS          : num  6 6 4 2 1 2 2 4 4 4 ...
##  $ NO_PERSONS_INJ_2    : num  0 0 0 1 1 0 1 1 1 0 ...
##  $ NO_PERSONS_INJ_3    : num  1 1 1 0 0 1 0 2 2 1 ...
##  $ NO_PERSONS_KILLED   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ NO_PERSONS_NOT_INJ  : num  5 5 3 1 0 1 1 1 1 3 ...
##  $ Light_Condition_Desc: Factor w/ 1 level "Day": 1 1 1 1 1 1 1 1 1 1 ...
##  $ SPEED_ZONE          : num  60 60 70 80 100 60 80 80 80 60 ...
##  $ PERSON_ID           : chr  "A" "B" "B" "A" ...
##  $ VEHICLE_ID          : chr  "A" "B" "B" "A" ...
##  $ SEX                 : Factor w/ 3 levels "Female","Male",..: 2 1 1 2 1 1 2 1 2 1 ...
##  $ AGE                 : num  72 62 20 43 50 32 31 53 41 51 ...
##  $ Age_Group           : Factor w/ 14 levels "0-4 yrs","5-12 yrs",..: 13 11 5 9 10 8 8 10 9 10 ...
##  $ INJ_LEVEL           : num  4 3 3 2 2 4 4 3 4 4 ...
##  $ Inj_Level_Desc      : Factor w/ 4 levels "Not injured",..: 2 3 3 4 4 2 2 3 2 2 ...
##  $ SEATING_POSITION    : chr  "D" "D" "D" "D" ...
##  $ HELMET_BELT_WORN    : num  1 1 1 1 1 1 9 1 1 1 ...
##  $ Road_User_Type_Desc : Factor w/ 1 level "Drivers": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Vehicle_Type_Desc   : Factor w/ 1 level "Car": 1 1 1 1 1 1 1 1 1 1 ...
##  $ CARRY_CAPACITY      : num  1100 974 1200 800 1450 ...
##  $ OWNER_POSTCODE      : num  3130 3977 3805 3016 3850 ...
##  $ Traffic_Control_Desc: Factor w/ 17 levels "Flashing lights",..: 15 15 2 15 3 3 3 15 15 17 ...

Tidy & Manipulate Data I

To check whether the data is in tidy format, dataset must conform certain principles which are: 1. Each variable must have its own column. 2. Each observation must have its own row. 3. Each value must have its own cell.

The dataset is in tidy format since it has correct datatypes for all the columns and conform to tidy principle.

head(DETAILS,15)

Tidy & Manipulate Data II

Creating new varliables from the existing variables. The function mutate() is used to create a new variable by grouping based on ACCIDENT_NO using group_by() function. The new variables are the percentage of people killed, not injured which is calculated by NO_PERSONS_KILLED divided by NO_PERSONS and multiplied by 100.

#Calculating percentage of people injured
DETAILS<-DETAILS %>% group_by(ACCIDENT_NO) %>%
  mutate(PERC_FATAL=NO_PERSONS_KILLED/NO_PERSONS*100,
         PERC_INJ_2=NO_PERSONS_INJ_2/NO_PERSONS*100,
         PERC_INJ_3=NO_PERSONS_INJ_3/NO_PERSONS*100,
         PERC_NOT_INJ=NO_PERSONS_NOT_INJ/NO_PERSONS*100)

head(DETAILS)

Scan I

Firstly, the total number of missing values was calculated. Secondly, the sum of missing values for each column was calculated. We found that AGE, CARRY_CAPACITY and OWNER_POSTCODE have missing values of 3367, 94096, 2851 respectively.

#Checking for na values
sum(is.na(DETAILS))

## [1] 100314

colSums(is.na(DETAILS))

##          ACCIDENT_NO         ACCIDENTDATE         ACCIDENTTIME 
##                    0                    0                    0 
##        ACCIDENT_TYPE   Accident_Type_Desc       NO_OF_VEHICLES 
##                    0                    0                    0 
##           NO_PERSONS     NO_PERSONS_INJ_2     NO_PERSONS_INJ_3 
##                    0                    0                    0 
##    NO_PERSONS_KILLED   NO_PERSONS_NOT_INJ Light_Condition_Desc 
##                    0                    0                    0 
##           SPEED_ZONE            PERSON_ID           VEHICLE_ID 
##                    0                    0                    0 
##                  SEX                  AGE            Age_Group 
##                    0                 3367                    0 
##            INJ_LEVEL       Inj_Level_Desc     SEATING_POSITION 
##                    0                    0                    0 
##     HELMET_BELT_WORN  Road_User_Type_Desc    Vehicle_Type_Desc 
##                    0                    0                    0 
##       CARRY_CAPACITY       OWNER_POSTCODE Traffic_Control_Desc 
##                94096                 2851                    0 
##           PERC_FATAL           PERC_INJ_2           PERC_INJ_3 
##                    0                    0                    0 
##         PERC_NOT_INJ 
##                    0

AGE is numeric variable which has missing values, it can be replaced by mean or medain of the age column. Here we replaced with mean value(42.07253).

CARRY_CAPACITY is also numerical, which can be replaced wih mean or median values. Replacing it with mean of 977.2936.

OWNER_POSTCODE cannot be replaced by mean or median, since each owner has unique postcode and also the missing values are in small percentage, it seemed appropriate to remove these observations from the dataset.

#Replacing the missing values with the mean 
mean(DETAILS$AGE,na.rm = TRUE)

## [1] 42.07253

DETAILS$AGE[is.na(DETAILS$AGE)] <- mean(DETAILS$AGE, na.rm = TRUE)

mean(DETAILS$CARRY_CAPACITY,na.rm = TRUE)

## [1] 977.2936

DETAILS$CARRY_CAPACITY[is.na(DETAILS$CARRY_CAPACITY)] <- mean(DETAILS$CARRY_CAPACITY, na.rm = TRUE)

FINAL_DETAILS<- DETAILS[!is.na(DETAILS$OWNER_POSTCODE),]

colSums(is.na(FINAL_DETAILS))

##          ACCIDENT_NO         ACCIDENTDATE         ACCIDENTTIME 
##                    0                    0                    0 
##        ACCIDENT_TYPE   Accident_Type_Desc       NO_OF_VEHICLES 
##                    0                    0                    0 
##           NO_PERSONS     NO_PERSONS_INJ_2     NO_PERSONS_INJ_3 
##                    0                    0                    0 
##    NO_PERSONS_KILLED   NO_PERSONS_NOT_INJ Light_Condition_Desc 
##                    0                    0                    0 
##           SPEED_ZONE            PERSON_ID           VEHICLE_ID 
##                    0                    0                    0 
##                  SEX                  AGE            Age_Group 
##                    0                    0                    0 
##            INJ_LEVEL       Inj_Level_Desc     SEATING_POSITION 
##                    0                    0                    0 
##     HELMET_BELT_WORN  Road_User_Type_Desc    Vehicle_Type_Desc 
##                    0                    0                    0 
##       CARRY_CAPACITY       OWNER_POSTCODE Traffic_Control_Desc 
##                    0                    0                    0 
##           PERC_FATAL           PERC_INJ_2           PERC_INJ_3 
##                    0                    0                    0 
##         PERC_NOT_INJ 
##                    0

#No missing values found
sum(is.na(FINAL_DETAILS))

## [1] 0

Thirdly, checking for special values such as NaN and Inf values in the dataset. The dataset didn’t contain any such values.

#Finding Inf and NaN values
is.special <- function(x){
if (is.numeric(x)) (is.infinite(x) | is.nan(x))
}
sapply(FINAL_DETAILS, function(x) sum( is.special(x) ))

##          ACCIDENT_NO         ACCIDENTDATE         ACCIDENTTIME 
##                    0                    0                    0 
##        ACCIDENT_TYPE   Accident_Type_Desc       NO_OF_VEHICLES 
##                    0                    0                    0 
##           NO_PERSONS     NO_PERSONS_INJ_2     NO_PERSONS_INJ_3 
##                    0                    0                    0 
##    NO_PERSONS_KILLED   NO_PERSONS_NOT_INJ Light_Condition_Desc 
##                    0                    0                    0 
##           SPEED_ZONE            PERSON_ID           VEHICLE_ID 
##                    0                    0                    0 
##                  SEX                  AGE            Age_Group 
##                    0                    0                    0 
##            INJ_LEVEL       Inj_Level_Desc     SEATING_POSITION 
##                    0                    0                    0 
##     HELMET_BELT_WORN  Road_User_Type_Desc    Vehicle_Type_Desc 
##                    0                    0                    0 
##       CARRY_CAPACITY       OWNER_POSTCODE Traffic_Control_Desc 
##                    0                    0                    0 
##           PERC_FATAL           PERC_INJ_2           PERC_INJ_3 
##                    0                    0                    0 
##         PERC_NOT_INJ 
##                    0

Finally, checking for inconsistencies and obvious error in the dataset, age cannot be negative or greater than 120 which is specified as condition and found that there exists no such observations in the dataset.

#Setting rule for age
(error <- editset(c("AGE >= 0", "AGE <= 120")))

## 
## Edit set:
## num1 : 0 <= AGE
## num2 : AGE <= 120

age_error <- violatedEdits(error, FINAL_DETAILS)
summary(age_error)

## No violations detected, 0 checks evaluated to NA

## NULL

Scan II

In this step, all the numerical variables are scanned to detect outliers. The box plot for SPEED_ZONE shows few outliers which are exceeding the SPEED_ZONE values from normal.Therefore, the outliers are removed.

#checking outliers

boxplot(FINAL_DETAILS$SPEED_ZONE,
main = "SPEED_ZONE PLOT",ylab = "SPEED_ZONE",col = "orange")

#Removing Outliers
speed_outliers <- boxplot(FINAL_DETAILS$SPEED_ZONE, plot=FALSE)$out
FINAL_DETAILS <- FINAL_DETAILS[-which(FINAL_DETAILS$SPEED_ZONE %in% speed_outliers),]
boxplot(FINAL_DETAILS$SPEED_ZONE,
main = "SPEED_ZONE PLOT",ylab = "SPEED_ZONE",col = "orange")

In the above box plot, the outliers are removed that exceeded the actual SPEED_ZONE. The box-plot for CARRY_CAPACITY shows number of outliers.

boxplot(FINAL_DETAILS$CARRY_CAPACITY,
main = "CARRY_CAPACITY-BOX PLOT",ylab = "CARRY_CAPACITY",col = "grey")

The AGE variable also contains outliers as shown in the box plot. All the outliers present exists in the age group of 70+. The approach used to remove the outliers is using z-score.The same method as AGE is been implemented for CARRY_CAPACITY to remove the outliers.

boxplot(FINAL_DETAILS$AGE~FINAL_DETAILS$Age_Group)

Tukey’s method of outlier detection - is used to detect the outliers, which are defiened as the values which are behond the range of -1.5IQR to 1.5IQR limit.

From the summary() function, the minimum z score is -1.8444, maximum z score is 3.3308. In this step, we are using which() function to get the location of z score whoes values are above 3.

z.scores <- FINAL_DETAILS$AGE %>%  scores(type = "z")
z.scores %>% summary()

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.8379 -0.8679 -0.1672  0.0000  0.6951  3.3358

dim(FINAL_DETAILS)

## [1] 103693     31

length(which( abs(z.scores) >3 ))

## [1] 3

There are only 3 out of 103693, which is less than .1% of the observation. Therefore, the outliers are removed.

age_outliers<-which(abs(z.scores)>3)
FINAL_DETAILS<-FINAL_DETAILS[-age_outliers,]

Transform

Data transformation is applied on AGE variable, which depicts right skewness as seen in the histogram. We apply the Log transformation (base 10 - log10 or base e - loge), natural logarithm transformation called log base e (ln) and Square root transformation for reducing the right skweness in the data.

#AGE showing RIGHT SKEWED DISTRIBUTION
hist(FINAL_DETAILS$AGE, xlab = "Age", main = "Histogram for Age")

The Log transformation (base 10 - log10 or base e - loge) is used in order to achieve a normal distribution which may be required for further analysis.

log10_age <- log10(FINAL_DETAILS$AGE)
hist(log10_age)

log_age <- log(FINAL_DETAILS$AGE)
hist(log_age)

From this figure, we can say that the log10 transformation appears to be slightly better than the natural log transformation to reduce right skewness. We also check with square root transformation in oder to see the impact on skewness.

sqrt_age <- sqrt(FINAL_DETAILS$AGE)
hist(sqrt_age)

Square root transformation shifted the frequency and did not help in spreading out the values.

From the above histograms, we draw conclusion that log10 transformation worked better in reducing the right skweness of the variable.