#Loading the packages
library(tidyr)
library(lubridate)
library(ggplot2)
library(readr)
library(forecast)
library(dplyr)
library(Hmisc)
library(editrules)
library(outliers)
The ‘Comma Seperated Values (csv)’ files were downloaded from open source data and imported into R to perform required data preprocessing analysis.
In this assignment, 3 dataset has been joined out of 11 datasets available in VicRoads CrashStats. Unnecessary column has been removed from all dataset and also dataset observation has been filtered out to make dataset more manageable. Each attributes are examined carefully and data type conversion is done to approximate classes like character to factor and applied reordering of factor variables. We checked for the tidiness of the dataset and found the dataset is in tidy format. New attributes has been created using mutate() for calculation of variables.
The dataset was then scanned for nan, inf, outliers and missing values. Moreover inconsistencies and obvious error were also checked for variables. An approximated transformation was done which reduces the skewedness of histogram and transformed into a normal distribution, making the analysis simpler.
The dataset were downloaded as ‘Comma Separated Values (csv)’ files from the VicRoads Website. https://discover.data.vic.gov.au/dataset/crash-stats-data-extract
The VicRoads crashstats website provides with all the necessary information to analyze Victorian fatal and injury crash data based on person, vehicle, time, conditions, crash type, road user type, object hit etc. The dataset comprises of total of 11 csv files out of which accident, person and vehicle dataset files are chosen for the assignment analysis and joined based on common variables.
Dataset 1: Accident- provides information about Peron accident, no of people involved and speed zone of the vehicle.(181601 observations,28 variables)
Dataset 2: Person- provides information about person id, person age, sex, Road user type(passenger, driver,motorcyclists) and level of injury to the person.(438142 observations,17 variables)
Dataset 3: Vehicle- provides information about vehicle registration details vehicle type, owner postcode of the vehicle.(323741 observations,38 variables)
All the 3 datasets has been read into R using read_csv() function and saved as ACCIENDT,PEROSN, VEHICLE respectively.
ACCIDENT <- read_csv("C:/Users/soumya/Desktop/csv/ACCIDENT.csv")
## Parsed with column specification:
## cols(
## .default = col_double(),
## ACCIDENT_NO = col_character(),
## ACCIDENTDATE = col_character(),
## ACCIDENTTIME = col_character(),
## `Accident Type Desc` = col_character(),
## `Day Week Description` = col_character(),
## `DCA Description` = col_character(),
## DIRECTORY = col_character(),
## PAGE = col_character(),
## GRID_REFERENCE_X = col_character(),
## `Light Condition Desc` = col_character(),
## `Road Geometry Desc` = col_character(),
## SPEED_ZONE = col_character()
## )
## See spec(...) for full column specifications.
## Warning: 10 parsing failures.
## row col expected actual file
## 65513 EDITION a double ED35 'C:/Users/soumya/Desktop/csv/ACCIDENT.csv'
## 66295 EDITION a double ED35 'C:/Users/soumya/Desktop/csv/ACCIDENT.csv'
## 67973 EDITION a double ED35 'C:/Users/soumya/Desktop/csv/ACCIDENT.csv'
## 75085 EDITION a double ED35 'C:/Users/soumya/Desktop/csv/ACCIDENT.csv'
## 97633 EDITION a double ED35 'C:/Users/soumya/Desktop/csv/ACCIDENT.csv'
## ..... ....... ........ ...... ..........................................
## See problems(...) for more details.
PERSON <- read_csv("C:/Users/soumya/Desktop/csv/PERSON.csv")
## Parsed with column specification:
## cols(
## ACCIDENT_NO = col_character(),
## PERSON_ID = col_character(),
## VEHICLE_ID = col_character(),
## SEX = col_character(),
## AGE = col_double(),
## `Age Group` = col_character(),
## INJ_LEVEL = col_double(),
## `Inj Level Desc` = col_character(),
## SEATING_POSITION = col_character(),
## HELMET_BELT_WORN = col_double(),
## ROAD_USER_TYPE = col_double(),
## `Road User Type Desc` = col_character(),
## LICENCE_STATE = col_character(),
## PEDEST_MOVEMENT = col_double(),
## POSTCODE = col_double(),
## TAKEN_HOSPITAL = col_character(),
## EJECTED_CODE = col_double()
## )
VEHICLE <- read_csv("C:/Users/soumya/Desktop/csv/VEHICLE.csv")
## Parsed with column specification:
## cols(
## .default = col_character(),
## VEHICLE_YEAR_MANUF = col_double(),
## VEHICLE_DCA_CODE = col_double(),
## ROAD_SURFACE_TYPE = col_double(),
## VEHICLE_POWER = col_logical(),
## VEHICLE_WEIGHT = col_double(),
## NO_OF_WHEELS = col_double(),
## NO_OF_CYLINDERS = col_double(),
## SEATING_CAPACITY = col_double(),
## TARE_WEIGHT = col_double(),
## TOTAL_NO_OCCUPANTS = col_double(),
## CARRY_CAPACITY = col_double(),
## CUBIC_CAPACITY = col_number(),
## CAUGHT_FIRE = col_double(),
## LAMPS = col_double(),
## LEVEL_OF_DAMAGE = col_double(),
## OWNER_POSTCODE = col_double(),
## TOWED_AWAY_FLAG = col_double()
## )
## See spec(...) for full column specifications.
head(ACCIDENT)
head(PERSON)
head(VEHICLE)
We merged 3 datasets together using merge() function and saved it as VICROAD.The Accident and Person datasets are merged using inner_join based on AcCIDENT_No which is saved as MERGE1. The MERGE1 is then joined with Vehicle dataset based on ACCIDENT_NO and VEHICLE_ID and saved as VICROAD.The final dataset contains 80 variables with 418448 observations totally.
#Merging the three datasets together
MERGE1 <-ACCIDENT%>%inner_join(PERSON, by='ACCIDENT_NO')
VICROAD <-MERGE1%>%inner_join(VEHICLE, by=c("ACCIDENT_NO","VEHICLE_ID"))
head(VICROAD)
#Replcaing the blank spaces in the variables with "_". It will be easy to call up column names.
colnames(VICROAD)
## [1] "ACCIDENT_NO" "ACCIDENTDATE"
## [3] "ACCIDENTTIME" "ACCIDENT_TYPE"
## [5] "Accident Type Desc" "DAY_OF_WEEK"
## [7] "Day Week Description" "DCA_CODE"
## [9] "DCA Description" "DIRECTORY"
## [11] "EDITION" "PAGE"
## [13] "GRID_REFERENCE_X" "GRID_REFERENCE_Y"
## [15] "LIGHT_CONDITION" "Light Condition Desc"
## [17] "NODE_ID" "NO_OF_VEHICLES"
## [19] "NO_PERSONS" "NO_PERSONS_INJ_2"
## [21] "NO_PERSONS_INJ_3" "NO_PERSONS_KILLED"
## [23] "NO_PERSONS_NOT_INJ" "POLICE_ATTEND"
## [25] "ROAD_GEOMETRY" "Road Geometry Desc"
## [27] "SEVERITY" "SPEED_ZONE"
## [29] "PERSON_ID" "VEHICLE_ID"
## [31] "SEX" "AGE"
## [33] "Age Group" "INJ_LEVEL"
## [35] "Inj Level Desc" "SEATING_POSITION"
## [37] "HELMET_BELT_WORN" "ROAD_USER_TYPE"
## [39] "Road User Type Desc" "LICENCE_STATE"
## [41] "PEDEST_MOVEMENT" "POSTCODE"
## [43] "TAKEN_HOSPITAL" "EJECTED_CODE"
## [45] "VEHICLE_YEAR_MANUF" "VEHICLE_DCA_CODE"
## [47] "INITIAL_DIRECTION" "ROAD_SURFACE_TYPE"
## [49] "Road Surface Type Desc" "REG_STATE"
## [51] "VEHICLE_BODY_STYLE" "VEHICLE_MAKE"
## [53] "VEHICLE_MODEL" "VEHICLE_POWER"
## [55] "VEHICLE_TYPE" "Vehicle Type Desc"
## [57] "VEHICLE_WEIGHT" "CONSTRUCTION_TYPE"
## [59] "FUEL_TYPE" "NO_OF_WHEELS"
## [61] "NO_OF_CYLINDERS" "SEATING_CAPACITY"
## [63] "TARE_WEIGHT" "TOTAL_NO_OCCUPANTS"
## [65] "CARRY_CAPACITY" "CUBIC_CAPACITY"
## [67] "FINAL_DIRECTION" "DRIVER_INTENT"
## [69] "VEHICLE_MOVEMENT" "TRAILER_TYPE"
## [71] "VEHICLE_COLOUR_1" "VEHICLE_COLOUR_2"
## [73] "CAUGHT_FIRE" "INITIAL_IMPACT"
## [75] "LAMPS" "LEVEL_OF_DAMAGE"
## [77] "OWNER_POSTCODE" "TOWED_AWAY_FLAG"
## [79] "TRAFFIC_CONTROL" "Traffic Control Desc"
names(VICROAD) <- gsub(" ", "_", names(VICROAD))
colnames(VICROAD)
## [1] "ACCIDENT_NO" "ACCIDENTDATE"
## [3] "ACCIDENTTIME" "ACCIDENT_TYPE"
## [5] "Accident_Type_Desc" "DAY_OF_WEEK"
## [7] "Day_Week_Description" "DCA_CODE"
## [9] "DCA_Description" "DIRECTORY"
## [11] "EDITION" "PAGE"
## [13] "GRID_REFERENCE_X" "GRID_REFERENCE_Y"
## [15] "LIGHT_CONDITION" "Light_Condition_Desc"
## [17] "NODE_ID" "NO_OF_VEHICLES"
## [19] "NO_PERSONS" "NO_PERSONS_INJ_2"
## [21] "NO_PERSONS_INJ_3" "NO_PERSONS_KILLED"
## [23] "NO_PERSONS_NOT_INJ" "POLICE_ATTEND"
## [25] "ROAD_GEOMETRY" "Road_Geometry_Desc"
## [27] "SEVERITY" "SPEED_ZONE"
## [29] "PERSON_ID" "VEHICLE_ID"
## [31] "SEX" "AGE"
## [33] "Age_Group" "INJ_LEVEL"
## [35] "Inj_Level_Desc" "SEATING_POSITION"
## [37] "HELMET_BELT_WORN" "ROAD_USER_TYPE"
## [39] "Road_User_Type_Desc" "LICENCE_STATE"
## [41] "PEDEST_MOVEMENT" "POSTCODE"
## [43] "TAKEN_HOSPITAL" "EJECTED_CODE"
## [45] "VEHICLE_YEAR_MANUF" "VEHICLE_DCA_CODE"
## [47] "INITIAL_DIRECTION" "ROAD_SURFACE_TYPE"
## [49] "Road_Surface_Type_Desc" "REG_STATE"
## [51] "VEHICLE_BODY_STYLE" "VEHICLE_MAKE"
## [53] "VEHICLE_MODEL" "VEHICLE_POWER"
## [55] "VEHICLE_TYPE" "Vehicle_Type_Desc"
## [57] "VEHICLE_WEIGHT" "CONSTRUCTION_TYPE"
## [59] "FUEL_TYPE" "NO_OF_WHEELS"
## [61] "NO_OF_CYLINDERS" "SEATING_CAPACITY"
## [63] "TARE_WEIGHT" "TOTAL_NO_OCCUPANTS"
## [65] "CARRY_CAPACITY" "CUBIC_CAPACITY"
## [67] "FINAL_DIRECTION" "DRIVER_INTENT"
## [69] "VEHICLE_MOVEMENT" "TRAILER_TYPE"
## [71] "VEHICLE_COLOUR_1" "VEHICLE_COLOUR_2"
## [73] "CAUGHT_FIRE" "INITIAL_IMPACT"
## [75] "LAMPS" "LEVEL_OF_DAMAGE"
## [77] "OWNER_POSTCODE" "TOWED_AWAY_FLAG"
## [79] "TRAFFIC_CONTROL" "Traffic_Control_Desc"
The dataset contains 80 variables which is huge after merging the 3 datasets. Selecting few and relevant varibles required from the merged dataset for the analysis. We use select() function which allow us to select reqired variables(column names).
#Selecting few variables for the analysis
FINAL_VIC <- select(VICROAD ,ACCIDENT_NO,ACCIDENTDATE,ACCIDENTTIME,ACCIDENT_TYPE, Accident_Type_Desc,NO_OF_VEHICLES,NO_PERSONS,NO_PERSONS_INJ_2,NO_PERSONS_INJ_3,NO_PERSONS_KILLED,NO_PERSONS_NOT_INJ,Light_Condition_Desc,SPEED_ZONE,PERSON_ID, VEHICLE_ID,SEX,AGE,Age_Group,INJ_LEVEL, Inj_Level_Desc, SEATING_POSITION,HELMET_BELT_WORN,Road_User_Type_Desc,Vehicle_Type_Desc,CARRY_CAPACITY, OWNER_POSTCODE,Traffic_Control_Desc)
dim(FINAL_VIC)
## [1] 418448 27
The FINAL_VIC conatins a numerous observation totall of 418442. We use filter() function to filter out observations based on conditions,where we filtered out based on day for Light_Condition_Desc, drives for Road_User_Type_Desc and finally car for Vehicle_Type_Desc. This is just to make the interpretations work simpler and easy to understand.
The DETAILS contains about 110674 observations.
#To reduce the number of observations
DETAILS <- filter(FINAL_VIC ,Light_Condition_Desc == 'Day',Road_User_Type_Desc == 'Drivers', Vehicle_Type_Desc == 'Car' )
dim(DETAILS)
## [1] 110674 27
The dataset contains about 110674 observations with 27 variables. It contains character and numericals values.The structure of the dataset was examined and the required data type conversions were then carried out to few classes and the final dataset consists of character, numeric, factor and date variables.
str(DETAILS)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 110674 obs. of 27 variables:
## $ ACCIDENT_NO : chr "T20060000010" "T20060000010" "T20060000018" "T20060000023" ...
## $ ACCIDENTDATE : chr "13/01/2006" "13/01/2006" "13/01/2006" "14/01/2006" ...
## $ ACCIDENTTIME : chr "12.42.00" "12.42.00" "19.10.00" "11.49.00" ...
## $ ACCIDENT_TYPE : num 1 1 1 1 4 2 4 1 1 1 ...
## $ Accident_Type_Desc : chr "Collision with vehicle" "Collision with vehicle" "Collision with vehicle" "Collision with vehicle" ...
## $ NO_OF_VEHICLES : num 3 3 2 2 1 1 1 3 3 2 ...
## $ NO_PERSONS : num 6 6 4 2 1 2 2 4 4 4 ...
## $ NO_PERSONS_INJ_2 : num 0 0 0 1 1 0 1 1 1 0 ...
## $ NO_PERSONS_INJ_3 : num 1 1 1 0 0 1 0 2 2 1 ...
## $ NO_PERSONS_KILLED : num 0 0 0 0 0 0 0 0 0 0 ...
## $ NO_PERSONS_NOT_INJ : num 5 5 3 1 0 1 1 1 1 3 ...
## $ Light_Condition_Desc: chr "Day" "Day" "Day" "Day" ...
## $ SPEED_ZONE : chr "060" "060" "070" "080" ...
## $ PERSON_ID : chr "A" "B" "B" "A" ...
## $ VEHICLE_ID : chr "A" "B" "B" "A" ...
## $ SEX : chr "M" "F" "F" "M" ...
## $ AGE : num 72 62 20 43 50 32 31 53 41 51 ...
## $ Age_Group : chr "70+" "60-64" "17-21" "40-49" ...
## $ INJ_LEVEL : num 4 3 3 2 2 4 4 3 4 4 ...
## $ Inj_Level_Desc : chr "Not injured" "Other injury" "Other injury" "Serious injury" ...
## $ SEATING_POSITION : chr "D" "D" "D" "D" ...
## $ HELMET_BELT_WORN : num 1 1 1 1 1 1 9 1 1 1 ...
## $ Road_User_Type_Desc : chr "Drivers" "Drivers" "Drivers" "Drivers" ...
## $ Vehicle_Type_Desc : chr "Car" "Car" "Car" "Car" ...
## $ CARRY_CAPACITY : num 1100 974 1200 800 1450 ...
## $ OWNER_POSTCODE : num 3130 3977 3805 3016 3850 ...
## $ Traffic_Control_Desc: chr "Stop-go lights" "Stop-go lights" "Giveway sign" "Stop-go lights" ...
The ACCIDENTDATE and ACCIDENTTIME are in character format which is been converted into date format uisng dmy() and hms() function respectively. Accident_Type_Desc, Light_Condition_Desc, SEX,Age_Group, Inj_Level_Desc, Road_User_Type_Desc, Vehicle_Type_Desc and Traffic_Control_Desc are in character format which are converted into factors. SPEED_ZONE is converted to numeric from character format.
#Converting to date and time format
DETAILS$ACCIDENTDATE <- dmy(DETAILS$ACCIDENTDATE)
DETAILS$ACCIDENTTIME <- hms(DETAILS$ACCIDENTTIME)
#Converting into factors
DETAILS[ , c(5, 12, 16, 18, 20, 23, 24, 27)]<-lapply(DETAILS[ ,c (5, 12, 16, 18, 20, 23, 24, 27)], factor)
#Converting into numeric
DETAILS$SPEED_ZONE <- as.numeric(DETAILS$SPEED_ZONE)
str(DETAILS)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 110674 obs. of 27 variables:
## $ ACCIDENT_NO : chr "T20060000010" "T20060000010" "T20060000018" "T20060000023" ...
## $ ACCIDENTDATE : Date, format: "2006-01-13" "2006-01-13" ...
## $ ACCIDENTTIME :Formal class 'Period' [package "lubridate"] with 6 slots
## .. ..@ .Data : num 0 0 0 0 0 0 0 0 0 0 ...
## .. ..@ year : num 0 0 0 0 0 0 0 0 0 0 ...
## .. ..@ month : num 0 0 0 0 0 0 0 0 0 0 ...
## .. ..@ day : num 0 0 0 0 0 0 0 0 0 0 ...
## .. ..@ hour : num 12 12 19 11 10 11 11 11 11 16 ...
## .. ..@ minute: num 42 42 10 49 55 0 53 50 50 15 ...
## $ ACCIDENT_TYPE : num 1 1 1 1 4 2 4 1 1 1 ...
## $ Accident_Type_Desc : Factor w/ 9 levels "Collision with a fixed object",..: 3 3 3 3 1 8 1 3 3 3 ...
## $ NO_OF_VEHICLES : num 3 3 2 2 1 1 1 3 3 2 ...
## $ NO_PERSONS : num 6 6 4 2 1 2 2 4 4 4 ...
## $ NO_PERSONS_INJ_2 : num 0 0 0 1 1 0 1 1 1 0 ...
## $ NO_PERSONS_INJ_3 : num 1 1 1 0 0 1 0 2 2 1 ...
## $ NO_PERSONS_KILLED : num 0 0 0 0 0 0 0 0 0 0 ...
## $ NO_PERSONS_NOT_INJ : num 5 5 3 1 0 1 1 1 1 3 ...
## $ Light_Condition_Desc: Factor w/ 1 level "Day": 1 1 1 1 1 1 1 1 1 1 ...
## $ SPEED_ZONE : num 60 60 70 80 100 60 80 80 80 60 ...
## $ PERSON_ID : chr "A" "B" "B" "A" ...
## $ VEHICLE_ID : chr "A" "B" "B" "A" ...
## $ SEX : Factor w/ 3 levels "F","M","U": 2 1 1 2 1 1 2 1 2 1 ...
## $ AGE : num 72 62 20 43 50 32 31 53 41 51 ...
## $ Age_Group : Factor w/ 14 levels "0-4","13-15",..: 13 11 4 8 10 7 7 10 8 10 ...
## $ INJ_LEVEL : num 4 3 3 2 2 4 4 3 4 4 ...
## $ Inj_Level_Desc : Factor w/ 4 levels "Fatality","Not injured",..: 2 3 3 4 4 2 2 3 2 2 ...
## $ SEATING_POSITION : chr "D" "D" "D" "D" ...
## $ HELMET_BELT_WORN : num 1 1 1 1 1 1 9 1 1 1 ...
## $ Road_User_Type_Desc : Factor w/ 1 level "Drivers": 1 1 1 1 1 1 1 1 1 1 ...
## $ Vehicle_Type_Desc : Factor w/ 1 level "Car": 1 1 1 1 1 1 1 1 1 1 ...
## $ CARRY_CAPACITY : num 1100 974 1200 800 1450 ...
## $ OWNER_POSTCODE : num 3130 3977 3805 3016 3850 ...
## $ Traffic_Control_Desc: Factor w/ 17 levels "Flashing lights",..: 15 15 2 15 3 3 3 15 15 17 ...
checking the level of variables like SEX, Age_Group, Inj_Level_Desc using levels() function to get the order of arrangement. We have used factor() function to re-arrange in to ordered format. Also, renamed the SEX column names for better understanding.
#Levels for the factors variable
levels(DETAILS$SEX)
## [1] "F" "M" "U"
levels(DETAILS$Age_Group)
## [1] "0-4" "13-15" "16-17" "17-21" "22-25" "26-29" "30-39"
## [8] "40-49" "5-12" "50-59" "60-64" "64-69" "70+" "unknown"
levels(DETAILS$Inj_Level_Desc)
## [1] "Fatality" "Not injured" "Other injury" "Serious injury"
#Re-naming label and ordering the levels for factor variables
DETAILS$SEX <- factor(DETAILS$SEX, levels = c('F','M','U'), labels= c('Female','Male', 'Unknown'))
DETAILS$Age_Group<-factor(DETAILS$Age_Group, levels = c('0-4', '5-12', '13-15', '16-17', '17-21', '22-25', '26-29', '30-39', '40-49', '50-59', '60-64', '64-69', '70+', 'unknown'), labels=c('0-4 yrs', '5-12 yrs', '13-15 yrs', '16-17 yrs', '17-21 yrs', '22-25 yrs', '26-29 yrs', '30-39 yrs', '40-49 yrs', '50-59 yrs', '60-64 yrs', '64-69 yrs', '70 yrs or older', 'Unknown'))
DETAILS$Inj_Level_Desc <- factor(DETAILS$Inj_Level_Desc, levels = c("Fatality","Not injured","Other injury","Serious injury"), labels= c("Not injured","Other injury","Serious injury","Fatality"))
str(DETAILS)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 110674 obs. of 27 variables:
## $ ACCIDENT_NO : chr "T20060000010" "T20060000010" "T20060000018" "T20060000023" ...
## $ ACCIDENTDATE : Date, format: "2006-01-13" "2006-01-13" ...
## $ ACCIDENTTIME :Formal class 'Period' [package "lubridate"] with 6 slots
## .. ..@ .Data : num 0 0 0 0 0 0 0 0 0 0 ...
## .. ..@ year : num 0 0 0 0 0 0 0 0 0 0 ...
## .. ..@ month : num 0 0 0 0 0 0 0 0 0 0 ...
## .. ..@ day : num 0 0 0 0 0 0 0 0 0 0 ...
## .. ..@ hour : num 12 12 19 11 10 11 11 11 11 16 ...
## .. ..@ minute: num 42 42 10 49 55 0 53 50 50 15 ...
## $ ACCIDENT_TYPE : num 1 1 1 1 4 2 4 1 1 1 ...
## $ Accident_Type_Desc : Factor w/ 9 levels "Collision with a fixed object",..: 3 3 3 3 1 8 1 3 3 3 ...
## $ NO_OF_VEHICLES : num 3 3 2 2 1 1 1 3 3 2 ...
## $ NO_PERSONS : num 6 6 4 2 1 2 2 4 4 4 ...
## $ NO_PERSONS_INJ_2 : num 0 0 0 1 1 0 1 1 1 0 ...
## $ NO_PERSONS_INJ_3 : num 1 1 1 0 0 1 0 2 2 1 ...
## $ NO_PERSONS_KILLED : num 0 0 0 0 0 0 0 0 0 0 ...
## $ NO_PERSONS_NOT_INJ : num 5 5 3 1 0 1 1 1 1 3 ...
## $ Light_Condition_Desc: Factor w/ 1 level "Day": 1 1 1 1 1 1 1 1 1 1 ...
## $ SPEED_ZONE : num 60 60 70 80 100 60 80 80 80 60 ...
## $ PERSON_ID : chr "A" "B" "B" "A" ...
## $ VEHICLE_ID : chr "A" "B" "B" "A" ...
## $ SEX : Factor w/ 3 levels "Female","Male",..: 2 1 1 2 1 1 2 1 2 1 ...
## $ AGE : num 72 62 20 43 50 32 31 53 41 51 ...
## $ Age_Group : Factor w/ 14 levels "0-4 yrs","5-12 yrs",..: 13 11 5 9 10 8 8 10 9 10 ...
## $ INJ_LEVEL : num 4 3 3 2 2 4 4 3 4 4 ...
## $ Inj_Level_Desc : Factor w/ 4 levels "Not injured",..: 2 3 3 4 4 2 2 3 2 2 ...
## $ SEATING_POSITION : chr "D" "D" "D" "D" ...
## $ HELMET_BELT_WORN : num 1 1 1 1 1 1 9 1 1 1 ...
## $ Road_User_Type_Desc : Factor w/ 1 level "Drivers": 1 1 1 1 1 1 1 1 1 1 ...
## $ Vehicle_Type_Desc : Factor w/ 1 level "Car": 1 1 1 1 1 1 1 1 1 1 ...
## $ CARRY_CAPACITY : num 1100 974 1200 800 1450 ...
## $ OWNER_POSTCODE : num 3130 3977 3805 3016 3850 ...
## $ Traffic_Control_Desc: Factor w/ 17 levels "Flashing lights",..: 15 15 2 15 3 3 3 15 15 17 ...
To check whether the data is in tidy format, dataset must conform certain principles which are: 1. Each variable must have its own column. 2. Each observation must have its own row. 3. Each value must have its own cell.
The dataset is in tidy format since it has correct datatypes for all the columns and conform to tidy principle.
head(DETAILS,15)
Creating new varliables from the existing variables. The function mutate() is used to create a new variable by grouping based on ACCIDENT_NO using group_by() function. The new variables are the percentage of people killed, not injured which is calculated by NO_PERSONS_KILLED divided by NO_PERSONS and multiplied by 100.
#Calculating percentage of people injured
DETAILS<-DETAILS %>% group_by(ACCIDENT_NO) %>%
mutate(PERC_FATAL=NO_PERSONS_KILLED/NO_PERSONS*100,
PERC_INJ_2=NO_PERSONS_INJ_2/NO_PERSONS*100,
PERC_INJ_3=NO_PERSONS_INJ_3/NO_PERSONS*100,
PERC_NOT_INJ=NO_PERSONS_NOT_INJ/NO_PERSONS*100)
head(DETAILS)
Firstly, the total number of missing values was calculated. Secondly, the sum of missing values for each column was calculated. We found that AGE, CARRY_CAPACITY and OWNER_POSTCODE have missing values of 3367, 94096, 2851 respectively.
#Checking for na values
sum(is.na(DETAILS))
## [1] 100314
colSums(is.na(DETAILS))
## ACCIDENT_NO ACCIDENTDATE ACCIDENTTIME
## 0 0 0
## ACCIDENT_TYPE Accident_Type_Desc NO_OF_VEHICLES
## 0 0 0
## NO_PERSONS NO_PERSONS_INJ_2 NO_PERSONS_INJ_3
## 0 0 0
## NO_PERSONS_KILLED NO_PERSONS_NOT_INJ Light_Condition_Desc
## 0 0 0
## SPEED_ZONE PERSON_ID VEHICLE_ID
## 0 0 0
## SEX AGE Age_Group
## 0 3367 0
## INJ_LEVEL Inj_Level_Desc SEATING_POSITION
## 0 0 0
## HELMET_BELT_WORN Road_User_Type_Desc Vehicle_Type_Desc
## 0 0 0
## CARRY_CAPACITY OWNER_POSTCODE Traffic_Control_Desc
## 94096 2851 0
## PERC_FATAL PERC_INJ_2 PERC_INJ_3
## 0 0 0
## PERC_NOT_INJ
## 0
AGE is numeric variable which has missing values, it can be replaced by mean or medain of the age column. Here we replaced with mean value(42.07253).
CARRY_CAPACITY is also numerical, which can be replaced wih mean or median values. Replacing it with mean of 977.2936.
OWNER_POSTCODE cannot be replaced by mean or median, since each owner has unique postcode and also the missing values are in small percentage, it seemed appropriate to remove these observations from the dataset.
#Replacing the missing values with the mean
mean(DETAILS$AGE,na.rm = TRUE)
## [1] 42.07253
DETAILS$AGE[is.na(DETAILS$AGE)] <- mean(DETAILS$AGE, na.rm = TRUE)
mean(DETAILS$CARRY_CAPACITY,na.rm = TRUE)
## [1] 977.2936
DETAILS$CARRY_CAPACITY[is.na(DETAILS$CARRY_CAPACITY)] <- mean(DETAILS$CARRY_CAPACITY, na.rm = TRUE)
FINAL_DETAILS<- DETAILS[!is.na(DETAILS$OWNER_POSTCODE),]
colSums(is.na(FINAL_DETAILS))
## ACCIDENT_NO ACCIDENTDATE ACCIDENTTIME
## 0 0 0
## ACCIDENT_TYPE Accident_Type_Desc NO_OF_VEHICLES
## 0 0 0
## NO_PERSONS NO_PERSONS_INJ_2 NO_PERSONS_INJ_3
## 0 0 0
## NO_PERSONS_KILLED NO_PERSONS_NOT_INJ Light_Condition_Desc
## 0 0 0
## SPEED_ZONE PERSON_ID VEHICLE_ID
## 0 0 0
## SEX AGE Age_Group
## 0 0 0
## INJ_LEVEL Inj_Level_Desc SEATING_POSITION
## 0 0 0
## HELMET_BELT_WORN Road_User_Type_Desc Vehicle_Type_Desc
## 0 0 0
## CARRY_CAPACITY OWNER_POSTCODE Traffic_Control_Desc
## 0 0 0
## PERC_FATAL PERC_INJ_2 PERC_INJ_3
## 0 0 0
## PERC_NOT_INJ
## 0
#No missing values found
sum(is.na(FINAL_DETAILS))
## [1] 0
Thirdly, checking for special values such as NaN and Inf values in the dataset. The dataset didn’t contain any such values.
#Finding Inf and NaN values
is.special <- function(x){
if (is.numeric(x)) (is.infinite(x) | is.nan(x))
}
sapply(FINAL_DETAILS, function(x) sum( is.special(x) ))
## ACCIDENT_NO ACCIDENTDATE ACCIDENTTIME
## 0 0 0
## ACCIDENT_TYPE Accident_Type_Desc NO_OF_VEHICLES
## 0 0 0
## NO_PERSONS NO_PERSONS_INJ_2 NO_PERSONS_INJ_3
## 0 0 0
## NO_PERSONS_KILLED NO_PERSONS_NOT_INJ Light_Condition_Desc
## 0 0 0
## SPEED_ZONE PERSON_ID VEHICLE_ID
## 0 0 0
## SEX AGE Age_Group
## 0 0 0
## INJ_LEVEL Inj_Level_Desc SEATING_POSITION
## 0 0 0
## HELMET_BELT_WORN Road_User_Type_Desc Vehicle_Type_Desc
## 0 0 0
## CARRY_CAPACITY OWNER_POSTCODE Traffic_Control_Desc
## 0 0 0
## PERC_FATAL PERC_INJ_2 PERC_INJ_3
## 0 0 0
## PERC_NOT_INJ
## 0
Finally, checking for inconsistencies and obvious error in the dataset, age cannot be negative or greater than 120 which is specified as condition and found that there exists no such observations in the dataset.
#Setting rule for age
(error <- editset(c("AGE >= 0", "AGE <= 120")))
##
## Edit set:
## num1 : 0 <= AGE
## num2 : AGE <= 120
age_error <- violatedEdits(error, FINAL_DETAILS)
summary(age_error)
## No violations detected, 0 checks evaluated to NA
## NULL
In this step, all the numerical variables are scanned to detect outliers. The box plot for SPEED_ZONE shows few outliers which are exceeding the SPEED_ZONE values from normal.Therefore, the outliers are removed.
#checking outliers
boxplot(FINAL_DETAILS$SPEED_ZONE,
main = "SPEED_ZONE PLOT",ylab = "SPEED_ZONE",col = "orange")
#Removing Outliers
speed_outliers <- boxplot(FINAL_DETAILS$SPEED_ZONE, plot=FALSE)$out
FINAL_DETAILS <- FINAL_DETAILS[-which(FINAL_DETAILS$SPEED_ZONE %in% speed_outliers),]
boxplot(FINAL_DETAILS$SPEED_ZONE,
main = "SPEED_ZONE PLOT",ylab = "SPEED_ZONE",col = "orange")
In the above box plot, the outliers are removed that exceeded the actual SPEED_ZONE. The box-plot for CARRY_CAPACITY shows number of outliers.
boxplot(FINAL_DETAILS$CARRY_CAPACITY,
main = "CARRY_CAPACITY-BOX PLOT",ylab = "CARRY_CAPACITY",col = "grey")
The AGE variable also contains outliers as shown in the box plot. All the outliers present exists in the age group of 70+. The approach used to remove the outliers is using z-score.The same method as AGE is been implemented for CARRY_CAPACITY to remove the outliers.
boxplot(FINAL_DETAILS$AGE~FINAL_DETAILS$Age_Group)
Tukey’s method of outlier detection - is used to detect the outliers, which are defiened as the values which are behond the range of -1.5IQR to 1.5IQR limit.
From the summary() function, the minimum z score is -1.8444, maximum z score is 3.3308. In this step, we are using which() function to get the location of z score whoes values are above 3.
z.scores <- FINAL_DETAILS$AGE %>% scores(type = "z")
z.scores %>% summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.8379 -0.8679 -0.1672 0.0000 0.6951 3.3358
dim(FINAL_DETAILS)
## [1] 103693 31
length(which( abs(z.scores) >3 ))
## [1] 3
There are only 3 out of 103693, which is less than .1% of the observation. Therefore, the outliers are removed.
age_outliers<-which(abs(z.scores)>3)
FINAL_DETAILS<-FINAL_DETAILS[-age_outliers,]
Data transformation is applied on AGE variable, which depicts right skewness as seen in the histogram. We apply the Log transformation (base 10 - log10 or base e - loge), natural logarithm transformation called log base e (ln) and Square root transformation for reducing the right skweness in the data.
#AGE showing RIGHT SKEWED DISTRIBUTION
hist(FINAL_DETAILS$AGE, xlab = "Age", main = "Histogram for Age")
The Log transformation (base 10 - log10 or base e - loge) is used in order to achieve a normal distribution which may be required for further analysis.
log10_age <- log10(FINAL_DETAILS$AGE)
hist(log10_age)
log_age <- log(FINAL_DETAILS$AGE)
hist(log_age)
From this figure, we can say that the log10 transformation appears to be slightly better than the natural log transformation to reduce right skewness. We also check with square root transformation in oder to see the impact on skewness.
sqrt_age <- sqrt(FINAL_DETAILS$AGE)
hist(sqrt_age)
Square root transformation shifted the frequency and did not help in spreading out the values.
From the above histograms, we draw conclusion that log10 transformation worked better in reducing the right skweness of the variable.