This section import the required packages
# Loading all required packages
library(readr)
library(dplyr)
library(ggplot2)
library(forecast)
library(tidyr)
library(lubridate)
This report uses different preprocessing techniques to process an open-source dataset. the dataset was chosen in sourced from CrashStats data provided by VicRoads.The source contained eleven datasets, all of which could be joined to produce a single data set but processing such a huge dataset would be beyond our computational power so two suitable data sets were chosen and joined using one primary key column. The data sets were joined using Inner Join.
To examine the information all the more effectively, the structure of the dataset was analyzed. It distinguished numerical and character class types. Following this, a few factors were changed over into factors and levels were applied to give fitting names. The merged data set was checked to see if it satisfies the tidy data. Using the existing variable in the dataset, new variables were created. Apart from this, missing values were removed and outliers were handled using the boxplot method and capping function. During the process of removing the outliers, it was found that the AGE variable is skewed to the right and therefore, suitable techniques were applied to transform the variable to get rid of the skewness.
This assignment will focus on the CrashStats data provided by VicRoads.It contains statistics on road traffic accidents that were reported to the police. To briefly introduce the data, it contains records of accidents that satisfy the following conditions :
Preliminary analysis showed eleven separate datasets, all of which could be joined together using a primary key column but that would require extensive processing power since data will be huge thus we handpicked two sets, namely Accident.csv(basic accident details, time, severity, location) and Person.csv(person based details, age, sex, etc).
Statistical comparisons of selected attributes between two different datasets will provide information on each person involved in the accident from the Person dataset with corresponding accident information from the Accident dataset. This means that each row contains data of a person involved in an accident with details joined via accident number which acts as the primary key.
Link to Dataset : https://discover.data.vic.gov.au/dataset/crash-stats-data-extract/resource/f9c7c05a-19e9-4593-aa2c-960a9c97b858
Further information on practical application of dataset can be found on: http://data.vicroads.vic.gov.au/metadata/crashstats_user_guide_and_appendices.pdf
Import Summary
#Reading the first dataset 'ACCIDENT.csv'
ACCIDENT <- read_csv("~/Downloads/2000 to 2005 ACCIDENT/ACCIDENT.csv",
col_types = cols(ACCIDENT_NO = col_character(),
`DCA Description` = col_skip(), DCA_CODE = col_skip(),
EDITION = col_skip(), GRID_REFERENCE_X = col_skip(),
GRID_REFERENCE_Y = col_skip(), NODE_ID = col_skip(),
PAGE = col_skip(), ROAD_GEOMETRY = col_skip(),
`Road Geometry Desc` = col_skip()))
#Previewing first few rows to check the sample from the complete data
head(ACCIDENT)
#Checking dimensions of dataset for validation.
dim(ACCIDENT)
## [1] 103061 19
Import Summary:
#Reading the second dataset 'PERSON.csv'
PERSON <- read_csv("~/Downloads/2000 to 2005 ACCIDENT/PERSON.csv",
col_types = cols(ACCIDENT_NO = col_character(),
HELMET_BELT_WORN = col_skip(), LICENCE_STATE = col_skip(),
POSTCODE = col_skip()))
#Previewing first few rows to check the sample from the complete data
head(PERSON)
#Checking dimensions of dataset for validation.
dim(PERSON)
## [1] 269045 14
#Using inner join to combine two data set using common records on ACCIDENT_NO column
mergedData <-ACCIDENT %>% inner_join(PERSON, by='ACCIDENT_NO')
#Checking dimensions of merged dataset.
dim(mergedData)
## [1] 269045 32
#Previewing joined dataset for validation.
head(mergedData)
colnames(mergedData)
## [1] "ACCIDENT_NO" "ACCIDENTDATE" "ACCIDENTTIME"
## [4] "ACCIDENT_TYPE" "Accident Type Desc" "DAY_OF_WEEK"
## [7] "Day Week Description" "DIRECTORY" "LIGHT_CONDITION"
## [10] "Light Condition Desc" "NO_OF_VEHICLES" "NO_PERSONS"
## [13] "NO_PERSONS_INJ_2" "NO_PERSONS_INJ_3" "NO_PERSONS_KILLED"
## [16] "NO_PERSONS_NOT_INJ" "POLICE_ATTEND" "SEVERITY"
## [19] "SPEED_ZONE" "PERSON_ID" "VEHICLE_ID"
## [22] "SEX" "AGE" "Age Group"
## [25] "INJ_LEVEL" "Inj Level Desc" "SEATING_POSITION"
## [28] "ROAD_USER_TYPE" "Road User Type Desc" "PEDEST_MOVEMENT"
## [31] "TAKEN_HOSPITAL" "EJECTED_CODE"
#The columns names in the dataset was found with spaces in between which was fixed using underscore to avoid quotation marks when calling these columns.
names(mergedData) <- gsub(" ", "_", names(mergedData))
#Renaming few column names to make it easier to understand them.
colnames(mergedData)[colnames(mergedData) == "NO_PERSONS_INJ_2"] <- "Serious_Injury"
colnames(mergedData)[colnames(mergedData) == "NO_PERSONS_INJ_3"] <- "Other_Injury"
colnames(mergedData)[colnames(mergedData) == "NO_PERSONS_NOT_INJ"] <- "No_Injury"
After joining the merged dataset, we used the str() function to check the data types of different variables. Based on the output, we realised that most of the varibales were either in character or integer variables. However, almost all the variables required data conversion either to factors or date or integer.Factors were further examined for labelling and ordering.
Merged Dataset Structure
#checking structure of joined dataset
str(mergedData)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 269045 obs. of 32 variables:
## $ ACCIDENT_NO : chr "12000000389" "12000000389" "12000000554" "12000000555" ...
## $ ACCIDENTDATE : chr "4/01/2000" "4/01/2000" "6/01/2000" "1/01/2000" ...
## $ ACCIDENTTIME : chr "22.30.00" "22.30.00" "22.15.00" "18.00.00" ...
## $ ACCIDENT_TYPE : num 4 4 4 1 1 1 1 4 4 1 ...
## $ Accident_Type_Desc : chr "Collision with a fixed object" "Collision with a fixed object" "Collision with a fixed object" "Collision with vehicle" ...
## $ DAY_OF_WEEK : num 3 3 5 7 7 7 7 1 1 2 ...
## $ Day_Week_Description: chr "Tuesday" "Tuesday" "Thursday" "Saturday" ...
## $ DIRECTORY : chr "MEL" "MEL" "MEL" "MEL" ...
## $ LIGHT_CONDITION : num 5 5 3 1 1 1 1 3 3 3 ...
## $ Light_Condition_Desc: chr "Dark No street lights" "Dark No street lights" "Dark Street lights on" "Day" ...
## $ NO_OF_VEHICLES : num 1 1 1 4 4 4 4 1 1 3 ...
## $ NO_PERSONS : num 2 2 1 4 4 4 4 2 2 3 ...
## $ Serious_Injury : num 1 1 1 1 1 1 1 2 2 0 ...
## $ Other_Injury : num 0 0 0 1 1 1 1 0 0 2 ...
## $ NO_PERSONS_KILLED : num 1 1 0 0 0 0 0 0 0 0 ...
## $ No_Injury : num 0 0 0 2 2 2 2 0 0 1 ...
## $ POLICE_ATTEND : num 1 1 1 1 1 1 1 1 1 1 ...
## $ SEVERITY : num 1 1 2 2 2 2 2 2 2 3 ...
## $ SPEED_ZONE : chr "090" "090" "060" "060" ...
## $ PERSON_ID : chr "01" "A" "A" "A" ...
## $ VEHICLE_ID : chr "A" "A" "A" "A" ...
## $ SEX : chr "M" "M" "M" "M" ...
## $ AGE : num 16 19 31 27 32 42 25 18 18 31 ...
## $ Age_Group : chr "16-17" "17-21" "30-39" "26-29" ...
## $ INJ_LEVEL : num 2 1 2 2 3 4 4 2 2 3 ...
## $ Inj_Level_Desc : chr "Serious injury" "Fatality" "Serious injury" "Serious injury" ...
## $ SEATING_POSITION : chr "LF" "D" "D" "D" ...
## $ ROAD_USER_TYPE : num 3 2 2 2 2 2 2 3 2 2 ...
## $ Road_User_Type_Desc : chr "Passengers" "Drivers" "Drivers" "Drivers" ...
## $ PEDEST_MOVEMENT : num 0 0 0 0 0 0 0 0 0 0 ...
## $ TAKEN_HOSPITAL : chr "Y" NA "Y" "Y" ...
## $ EJECTED_CODE : num 0 0 0 0 0 0 0 0 0 0 ...
Data Type Conversion
#Using mutate function to convert the variables like 'Day Week Description','Light Condition Desc''sex' etc into factors to facilitate
mergedData <- mutate_at(mergedData, vars("Day_Week_Description", "DIRECTORY","Light_Condition_Desc","SEX","Age_Group","Inj_Level_Desc","SEATING_POSITION","TAKEN_HOSPITAL","SPEED_ZONE"), as.factor)
#Using mutate function to convert the variables like 'NO_PERSONS','NO_PERSONS_NOT_INJ''EJECTED_CODE' etc into integers to facilitate
mergedData <- mutate_at(mergedData, vars("NO_OF_VEHICLES","NO_PERSONS","Serious_Injury","Other_Injury","NO_PERSONS_KILLED","No_Injury","POLICE_ATTEND","SEVERITY","AGE","INJ_LEVEL","ROAD_USER_TYPE","PEDEST_MOVEMENT","EJECTED_CODE"), as.integer)
#Using dmy function to convert ACCIDENTDATE into year-month-day format
mergedData$ACCIDENTDATE <- ymd(mergedData$ACCIDENTDATE)
## Warning: All formats failed to parse. No formats found.
#Using hms function to convert ACCIDENTTIME into time format for storing time-of-day values
mergedData$ACCIDENTTIME <- hms(mergedData$ACCIDENTTIME)
## Warning in .parse_hms(..., order = "HMS", quiet = quiet): Some strings
## failed to parse, or all strings are NAs
#Checking levels for factor type variables
levels(mergedData$Day_Week_Description)
## [1] "Friday" "Monday" "Saturday" "Sunday" "Thursday" "Tuesday"
## [7] "Wednesday"
levels(mergedData$SPEED_ZONE)
## [1] "040" "050" "060" "070" "075" "080" "090" "100" "110" "777" "888"
## [12] "999"
levels(mergedData$SEX)
## [1] "F" "M" "U"
levels(mergedData$Age_Group)
## [1] "0-4" "13-15" "16-17" "17-21" "22-25" "26-29" "30-39"
## [8] "40-49" "5-12" "50-59" "60-64" "64-69" "70+" "unknown"
#Ordering the factor values inside Day_Week_description
mergedData$Day_Week_Description <- factor(mergedData$Day_Week_Description, levels = c("Monday","Tuesday","Wednesday","Thrusday","Friday","Saturday","Sunday"), ordered = TRUE)
#Ordering and labeling data with more informative label
#note: labeling outliers(777,888,999) as unknown
mergedData$SPEED_ZONE<- factor(mergedData$SPEED_ZONE, levels= c('030', '040', '050', '060', '070', '075', '080', '090', '100', '110', '777', '888', '999'), labels=c('30KM/HR', '40KM/HR', '50KM/HR', '60KM/HR', '70KM/HR', '75KM/HR', '80KM/HR', '90KM/HR', '100KM/HR', '110KM/HR','Unknown','Unknown', 'Unknown'),ordered = TRUE)
mergedData$SEX <- factor(mergedData$SEX, levels = c('F','M','U'), labels = c("Female",'Male','Unknown'))
mergedData$Age_Group<-factor(mergedData$Age_Group, levels = c('0-4', '5-12', '13-15', '16-17', '17-21', '22-25', '26-29', '30-39', '40-49', '50-59', '60-64', '64-69', '70+', 'unknown'), labels=c('0-4 Years', '5-12 Years', '13-15 Years', '16-17 Years', '17-21 Years', '22-25 Years', '26-29 Years', '30-39 Years', '40-49 Years', '50-59 Years', '60-64 Years', '64-69 Years', '70 Years+', 'Unknown'))
#Validating the changes applied to the attributes of the merged dataset
str(mergedData)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 269045 obs. of 32 variables:
## $ ACCIDENT_NO : chr "12000000389" "12000000389" "12000000554" "12000000555" ...
## $ ACCIDENTDATE : Date, format: NA NA ...
## $ ACCIDENTTIME :Formal class 'Period' [package "lubridate"] with 6 slots
## .. ..@ .Data : num 0 0 0 0 0 0 0 0 0 0 ...
## .. ..@ year : num 0 0 0 0 0 0 0 0 0 0 ...
## .. ..@ month : num 0 0 0 0 0 0 0 0 0 0 ...
## .. ..@ day : num 0 0 0 0 0 0 0 0 0 0 ...
## .. ..@ hour : num 22 22 22 18 18 18 18 1 1 23 ...
## .. ..@ minute: num 30 30 15 0 0 0 0 0 0 50 ...
## $ ACCIDENT_TYPE : num 4 4 4 1 1 1 1 4 4 1 ...
## $ Accident_Type_Desc : chr "Collision with a fixed object" "Collision with a fixed object" "Collision with a fixed object" "Collision with vehicle" ...
## $ DAY_OF_WEEK : num 3 3 5 7 7 7 7 1 1 2 ...
## $ Day_Week_Description: Ord.factor w/ 7 levels "Monday"<"Tuesday"<..: 2 2 NA 6 6 6 6 7 7 1 ...
## $ DIRECTORY : Factor w/ 2 levels "MEL","VCS": 1 1 1 1 1 1 1 1 1 1 ...
## $ LIGHT_CONDITION : num 5 5 3 1 1 1 1 3 3 3 ...
## $ Light_Condition_Desc: Factor w/ 7 levels "Dark No street lights",..: 1 1 3 5 5 5 5 3 3 3 ...
## $ NO_OF_VEHICLES : int 1 1 1 4 4 4 4 1 1 3 ...
## $ NO_PERSONS : int 2 2 1 4 4 4 4 2 2 3 ...
## $ Serious_Injury : int 1 1 1 1 1 1 1 2 2 0 ...
## $ Other_Injury : int 0 0 0 1 1 1 1 0 0 2 ...
## $ NO_PERSONS_KILLED : int 1 1 0 0 0 0 0 0 0 0 ...
## $ No_Injury : int 0 0 0 2 2 2 2 0 0 1 ...
## $ POLICE_ATTEND : int 1 1 1 1 1 1 1 1 1 1 ...
## $ SEVERITY : int 1 1 2 2 2 2 2 2 2 3 ...
## $ SPEED_ZONE : Ord.factor w/ 11 levels "30KM/HR"<"40KM/HR"<..: 8 8 4 4 4 4 4 4 4 5 ...
## $ PERSON_ID : chr "01" "A" "A" "A" ...
## $ VEHICLE_ID : chr "A" "A" "A" "A" ...
## $ SEX : Factor w/ 3 levels "Female","Male",..: 2 2 2 2 1 2 2 1 2 1 ...
## $ AGE : int 16 19 31 27 32 42 25 18 18 31 ...
## $ Age_Group : Factor w/ 14 levels "0-4 Years","5-12 Years",..: 4 5 8 7 8 9 6 5 5 8 ...
## $ INJ_LEVEL : int 2 1 2 2 3 4 4 2 2 3 ...
## $ Inj_Level_Desc : Factor w/ 4 levels "Fatality","Not injured",..: 4 1 4 4 3 2 2 4 4 3 ...
## $ SEATING_POSITION : Factor w/ 10 levels "CF","CR","D",..: 4 3 3 3 3 3 3 4 3 3 ...
## $ ROAD_USER_TYPE : int 3 2 2 2 2 2 2 3 2 2 ...
## $ Road_User_Type_Desc : chr "Passengers" "Drivers" "Drivers" "Drivers" ...
## $ PEDEST_MOVEMENT : int 0 0 0 0 0 0 0 0 0 0 ...
## $ TAKEN_HOSPITAL : Factor w/ 2 levels "N","Y": 2 NA 2 2 1 NA NA 2 2 NA ...
## $ EJECTED_CODE : int 0 0 0 0 0 0 0 0 0 0 ...
Tidy Data Principles:
let’s look into this more closely by using an example
mergedData %>% select(ACCIDENT_NO,PERSON_ID,Serious_Injury,Other_Injury,No_Injury,INJ_LEVEL,Inj_Level_Desc) %>% filter(ACCIDENT_NO == '12000000389')
Now, let’s look into the second and third tidy data principle :
Each observation must have its own row : Yes, each observation row is unique for a person(Person_ID) involved in a particular accident(Accident_No)
Each value must have its own cell : Yes, each value has its own cell. However, if required, values in variables like data & time column could be separated into different cells. Though, We choose to keep them in a single cell for now.
In this section, we have created new variables using the existing data using function mutate().
#Creating two new variables "Person_Affected_Percent" and "Average_Age"
mergedData <- mergedData %>% group_by(ACCIDENT_NO) %>% mutate(PERSON_AFFECTED_PERCENT = (Serious_Injury +Other_Injury+NO_PERSONS_KILLED)/NO_PERSONS*100, AVERAGE_AGE = mean(AGE,na.rm = TRUE))
#Created new varibale to present the output by selecting variables of interest grouped by accident No
resultedData <- mergedData %>% select(ACCIDENT_NO,PERSON_ID,PERSON_AFFECTED_PERCENT,AVERAGE_AGE) %>% group_by(ACCIDENT_NO)
#Checking the result
head(resultedData)
In this section, we scanned for the missing values present in the merged dataset. Before any processing, we made few rules to deal with the missing values
#Checking the number of missing values column wise to have a overall look
colSums(is.na(mergedData))
## ACCIDENT_NO ACCIDENTDATE ACCIDENTTIME
## 0 269045 380
## ACCIDENT_TYPE Accident_Type_Desc DAY_OF_WEEK
## 0 0 0
## Day_Week_Description DIRECTORY LIGHT_CONDITION
## 39444 333 0
## Light_Condition_Desc NO_OF_VEHICLES NO_PERSONS
## 0 0 0
## Serious_Injury Other_Injury NO_PERSONS_KILLED
## 0 0 0
## No_Injury POLICE_ATTEND SEVERITY
## 0 0 0
## SPEED_ZONE PERSON_ID VEHICLE_ID
## 0 0 10405
## SEX AGE Age_Group
## 0 16514 0
## INJ_LEVEL Inj_Level_Desc SEATING_POSITION
## 0 0 10425
## ROAD_USER_TYPE Road_User_Type_Desc PEDEST_MOVEMENT
## 0 0 2
## TAKEN_HOSPITAL EJECTED_CODE PERSON_AFFECTED_PERCENT
## 182656 921 0
## AVERAGE_AGE
## 243
#Based on results from colSums() function, it was found that variable like ACCIDENTTIME, DIRECTORY,VEHICLE_ID,EJECTED_CODE,AVERAGE_AGE,SEATING_POSITION, PEDEST_MOVEMENT, AGE have missing values less than 10%.
#Therefore, based on the Rule 1, missing values were removed from the dataset
mergedData <- mergedData[!(is.na(mergedData$ACCIDENTTIME)),]
mergedData <- mergedData[!(is.na(mergedData$DIRECTORY)),]
mergedData <- mergedData[!(is.na(mergedData$VEHICLE_ID)),]
mergedData <- mergedData[!(is.na(mergedData$EJECTED_CODE)),]
mergedData <- mergedData[!(is.na(mergedData$AVERAGE_AGE)),]
mergedData <- mergedData[!(is.na(mergedData$SEATING_POSITION)),]
mergedData <- mergedData[!(is.na(mergedData$PEDEST_MOVEMENT)),]
mergedData <- mergedData[!(is.na(mergedData$AGE)),]
#Checking the dataset after missing values removal
colSums(is.na(mergedData))
## ACCIDENT_NO ACCIDENTDATE ACCIDENTTIME
## 0 241241 0
## ACCIDENT_TYPE Accident_Type_Desc DAY_OF_WEEK
## 0 0 0
## Day_Week_Description DIRECTORY LIGHT_CONDITION
## 35251 0 0
## Light_Condition_Desc NO_OF_VEHICLES NO_PERSONS
## 0 0 0
## Serious_Injury Other_Injury NO_PERSONS_KILLED
## 0 0 0
## No_Injury POLICE_ATTEND SEVERITY
## 0 0 0
## SPEED_ZONE PERSON_ID VEHICLE_ID
## 0 0 0
## SEX AGE Age_Group
## 0 0 0
## INJ_LEVEL Inj_Level_Desc SEATING_POSITION
## 0 0 0
## ROAD_USER_TYPE Road_User_Type_Desc PEDEST_MOVEMENT
## 0 0 0
## TAKEN_HOSPITAL EJECTED_CODE PERSON_AFFECTED_PERCENT
## 164162 0 0
## AVERAGE_AGE
## 0
In this section, numeric variables were checked in our merged data to scan for the outliers and fixed them with suitable techniques.
To start with the process,
#Histogram to check the distribution of AGE variable
hist(mergedData$AGE ,col = "gold", xlab = "Age", ylab = "Number of Individuals", main = "Histogram of Age of the People involved in Accidents")
abline(v= mean(mergedData$AGE), col="red", lwd=2, lty=2)
abline(v=median(mergedData$AGE), col='blue', lwd=2, lty=2)
#Boxplot to detect the outliers
mergedData$AGE %>% boxplot(main = "Detecting Outliers in Age variable", ylab = "Age", col = "gold")
#Checking the summary before outlier removal
summary(mergedData$AGE)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 21.00 31.00 34.23 45.00 99.00
#Defining Capping function
cap <- function(x){
quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
x
}
#Applying the defined capping function on the data to fix the outliers
mergedData$AGE <- mergedData$AGE %>% cap()
#Getting the updated summary of the data
summary(mergedData$AGE)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 21.00 31.00 34.08 45.00 81.00
#Boxplot to check the data after outlier removal
mergedData$AGE %>% boxplot(main = "Age Variable After Outliers Removal", ylab = "Age", col = "lightBlue")
As from the last section, we discovered that the AGE variable in our dataset is skewed to the right and therefore, in this section different transformation techniques were applied to get rid of the skewness.
#Firstly, the log 10 transformation was applied as it's commonly used for reducing right skewness
log_age <- log10(mergedData$AGE)
hist(log_age)
#Secondly, natural logarithm was applied to check whether it's generating better result than log 10
ln_age <- log(mergedData$AGE)
hist(ln_age)
#Lastly, square root transformation was used for reducing right skewness
sqrt_age <- sqrt(mergedData$AGE)
hist(sqrt_age)
Based on the generated result, square root transformation performed the best. Though, the trasnformation didn’t result in complete ridness to the skewness.The square root transformation is still slightly skewed to the left.