MATH2349 Semester 2, 2019

Required packages

This section import the required packages

# Loading all required packages

library(readr)
library(dplyr)
library(ggplot2)
library(forecast)
library(tidyr)
library(lubridate)

Executive Summary

This report uses different preprocessing techniques to process an open-source dataset. the dataset was chosen in sourced from CrashStats data provided by VicRoads.The source contained eleven datasets, all of which could be joined to produce a single data set but processing such a huge dataset would be beyond our computational power so two suitable data sets were chosen and joined using one primary key column. The data sets were joined using Inner Join.

To examine the information all the more effectively, the structure of the dataset was analyzed. It distinguished numerical and character class types. Following this, a few factors were changed over into factors and levels were applied to give fitting names. The merged data set was checked to see if it satisfies the tidy data. Using the existing variable in the dataset, new variables were created. Apart from this, missing values were removed and outliers were handled using the boxplot method and capping function. During the process of removing the outliers, it was found that the AGE variable is skewed to the right and therefore, suitable techniques were applied to transform the variable to get rid of the skewness.

Data Description

This assignment will focus on the CrashStats data provided by VicRoads.It contains statistics on road traffic accidents that were reported to the police. To briefly introduce the data, it contains records of accidents that satisfy the following conditions :

The Accident occurred after 1987.
That the accident resulted in:
Victim death happened within thirty days of the accident.
Personal injury reported to a police officer.
The accident offered on terrain like road, street, thoroughfare, footpath, railway level crossing, or any place open to the public.
The accident involved one or more road vehicles.

Preliminary analysis showed eleven separate datasets, all of which could be joined together using a primary key column but that would require extensive processing power since data will be huge thus we handpicked two sets, namely Accident.csv(basic accident details, time, severity, location) and Person.csv(person based details, age, sex, etc).

Statistical comparisons of selected attributes between two different datasets will provide information on each person involved in the accident from the Person dataset with corresponding accident information from the Accident dataset. This means that each row contains data of a person involved in an accident with details joined via accident number which acts as the primary key.

Link to Dataset : https://discover.data.vic.gov.au/dataset/crash-stats-data-extract/resource/f9c7c05a-19e9-4593-aa2c-960a9c97b858

Further information on practical application of dataset can be found on: http://data.vicroads.vic.gov.au/metadata/crashstats_user_guide_and_appendices.pdf

DataSet 1 : Accident

Import Summary

Read in the data and store it into ACCIDENT data frame and using head() function to validate the output.
The ACCIDENT dataset has 103061 rows and 19 columns.
While importing the data, only the variables we interested in were kept and rest was skipped.
Also, changed the data type of the varibale “Accident_No” from double to character as the values were alpha-numeric.

#Reading the first dataset  'ACCIDENT.csv'
ACCIDENT <- read_csv("~/Downloads/2000 to 2005 ACCIDENT/ACCIDENT.csv", 
    col_types = cols(ACCIDENT_NO = col_character(), 
        `DCA Description` = col_skip(), DCA_CODE = col_skip(), 
        EDITION = col_skip(), GRID_REFERENCE_X = col_skip(), 
        GRID_REFERENCE_Y = col_skip(), NODE_ID = col_skip(), 
        PAGE = col_skip(), ROAD_GEOMETRY = col_skip(), 
        `Road Geometry Desc` = col_skip()))

#Previewing first few rows to check the sample from the complete data 
head(ACCIDENT)

#Checking dimensions of dataset for validation.
dim(ACCIDENT)

## [1] 103061     19

DataSet 2: Person

Import Summary:

Read in the data and store it into PERSON data frame and confirm that it was read in correctly.
The Person dataset contains 269045 rows and 14 columns
While importing the data, only the variables we interested in were kept and rest was skipped.
Also,changed the data type of the varibale “Accident_No” from double to character as the values were alpha-numeric.

#Reading the second dataset  'PERSON.csv'
PERSON <- read_csv("~/Downloads/2000 to 2005 ACCIDENT/PERSON.csv", 
    col_types = cols(ACCIDENT_NO = col_character(), 
        HELMET_BELT_WORN = col_skip(), LICENCE_STATE = col_skip(), 
        POSTCODE = col_skip()))

#Previewing first few rows to check the sample from the complete data
head(PERSON)

#Checking dimensions of dataset for validation.
dim(PERSON)

## [1] 269045     14

Merge Dataset

#Using inner join to combine two data set using common records on ACCIDENT_NO column
mergedData <-ACCIDENT %>% inner_join(PERSON, by='ACCIDENT_NO')

#Checking dimensions of merged dataset.
dim(mergedData)

## [1] 269045     32

#Previewing joined dataset for validation.
head(mergedData)

colnames(mergedData)

##  [1] "ACCIDENT_NO"          "ACCIDENTDATE"         "ACCIDENTTIME"        
##  [4] "ACCIDENT_TYPE"        "Accident Type Desc"   "DAY_OF_WEEK"         
##  [7] "Day Week Description" "DIRECTORY"            "LIGHT_CONDITION"     
## [10] "Light Condition Desc" "NO_OF_VEHICLES"       "NO_PERSONS"          
## [13] "NO_PERSONS_INJ_2"     "NO_PERSONS_INJ_3"     "NO_PERSONS_KILLED"   
## [16] "NO_PERSONS_NOT_INJ"   "POLICE_ATTEND"        "SEVERITY"            
## [19] "SPEED_ZONE"           "PERSON_ID"            "VEHICLE_ID"          
## [22] "SEX"                  "AGE"                  "Age Group"           
## [25] "INJ_LEVEL"            "Inj Level Desc"       "SEATING_POSITION"    
## [28] "ROAD_USER_TYPE"       "Road User Type Desc"  "PEDEST_MOVEMENT"     
## [31] "TAKEN_HOSPITAL"       "EJECTED_CODE"

#The columns names in the dataset was found with spaces in between which was fixed using underscore to avoid quotation marks when calling these columns.
names(mergedData) <- gsub(" ", "_", names(mergedData))

#Renaming few column names to make it easier to understand them.
colnames(mergedData)[colnames(mergedData) == "NO_PERSONS_INJ_2"] <- "Serious_Injury"
colnames(mergedData)[colnames(mergedData) == "NO_PERSONS_INJ_3"] <- "Other_Injury"
colnames(mergedData)[colnames(mergedData) == "NO_PERSONS_NOT_INJ"] <- "No_Injury"

Understand

After joining the merged dataset, we used the str() function to check the data types of different variables. Based on the output, we realised that most of the varibales were either in character or integer variables. However, almost all the variables required data conversion either to factors or date or integer.Factors were further examined for labelling and ordering.

Merged Dataset Structure

#checking structure of joined dataset
str(mergedData)

## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 269045 obs. of  32 variables:
##  $ ACCIDENT_NO         : chr  "12000000389" "12000000389" "12000000554" "12000000555" ...
##  $ ACCIDENTDATE        : chr  "4/01/2000" "4/01/2000" "6/01/2000" "1/01/2000" ...
##  $ ACCIDENTTIME        : chr  "22.30.00" "22.30.00" "22.15.00" "18.00.00" ...
##  $ ACCIDENT_TYPE       : num  4 4 4 1 1 1 1 4 4 1 ...
##  $ Accident_Type_Desc  : chr  "Collision with a fixed object" "Collision with a fixed object" "Collision with a fixed object" "Collision with vehicle" ...
##  $ DAY_OF_WEEK         : num  3 3 5 7 7 7 7 1 1 2 ...
##  $ Day_Week_Description: chr  "Tuesday" "Tuesday" "Thursday" "Saturday" ...
##  $ DIRECTORY           : chr  "MEL" "MEL" "MEL" "MEL" ...
##  $ LIGHT_CONDITION     : num  5 5 3 1 1 1 1 3 3 3 ...
##  $ Light_Condition_Desc: chr  "Dark No street lights" "Dark No street lights" "Dark Street lights on" "Day" ...
##  $ NO_OF_VEHICLES      : num  1 1 1 4 4 4 4 1 1 3 ...
##  $ NO_PERSONS          : num  2 2 1 4 4 4 4 2 2 3 ...
##  $ Serious_Injury      : num  1 1 1 1 1 1 1 2 2 0 ...
##  $ Other_Injury        : num  0 0 0 1 1 1 1 0 0 2 ...
##  $ NO_PERSONS_KILLED   : num  1 1 0 0 0 0 0 0 0 0 ...
##  $ No_Injury           : num  0 0 0 2 2 2 2 0 0 1 ...
##  $ POLICE_ATTEND       : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ SEVERITY            : num  1 1 2 2 2 2 2 2 2 3 ...
##  $ SPEED_ZONE          : chr  "090" "090" "060" "060" ...
##  $ PERSON_ID           : chr  "01" "A" "A" "A" ...
##  $ VEHICLE_ID          : chr  "A" "A" "A" "A" ...
##  $ SEX                 : chr  "M" "M" "M" "M" ...
##  $ AGE                 : num  16 19 31 27 32 42 25 18 18 31 ...
##  $ Age_Group           : chr  "16-17" "17-21" "30-39" "26-29" ...
##  $ INJ_LEVEL           : num  2 1 2 2 3 4 4 2 2 3 ...
##  $ Inj_Level_Desc      : chr  "Serious injury" "Fatality" "Serious injury" "Serious injury" ...
##  $ SEATING_POSITION    : chr  "LF" "D" "D" "D" ...
##  $ ROAD_USER_TYPE      : num  3 2 2 2 2 2 2 3 2 2 ...
##  $ Road_User_Type_Desc : chr  "Passengers" "Drivers" "Drivers" "Drivers" ...
##  $ PEDEST_MOVEMENT     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ TAKEN_HOSPITAL      : chr  "Y" NA "Y" "Y" ...
##  $ EJECTED_CODE        : num  0 0 0 0 0 0 0 0 0 0 ...

Data Type Conversion

#Using mutate function to convert the variables like 'Day Week Description','Light Condition Desc''sex' etc into factors to facilitate 
mergedData <- mutate_at(mergedData, vars("Day_Week_Description", "DIRECTORY","Light_Condition_Desc","SEX","Age_Group","Inj_Level_Desc","SEATING_POSITION","TAKEN_HOSPITAL","SPEED_ZONE"), as.factor)

#Using mutate function to convert the variables like 'NO_PERSONS','NO_PERSONS_NOT_INJ''EJECTED_CODE' etc into integers to facilitate
mergedData <- mutate_at(mergedData, vars("NO_OF_VEHICLES","NO_PERSONS","Serious_Injury","Other_Injury","NO_PERSONS_KILLED","No_Injury","POLICE_ATTEND","SEVERITY","AGE","INJ_LEVEL","ROAD_USER_TYPE","PEDEST_MOVEMENT","EJECTED_CODE"), as.integer)

#Using dmy function to convert ACCIDENTDATE into year-month-day format
mergedData$ACCIDENTDATE <- ymd(mergedData$ACCIDENTDATE)

## Warning: All formats failed to parse. No formats found.

#Using hms function to convert ACCIDENTTIME into time format for storing time-of-day values
mergedData$ACCIDENTTIME <- hms(mergedData$ACCIDENTTIME)

## Warning in .parse_hms(..., order = "HMS", quiet = quiet): Some strings
## failed to parse, or all strings are NAs

#Checking levels for factor type variables  
levels(mergedData$Day_Week_Description)

## [1] "Friday"    "Monday"    "Saturday"  "Sunday"    "Thursday"  "Tuesday"  
## [7] "Wednesday"

levels(mergedData$SPEED_ZONE)

##  [1] "040" "050" "060" "070" "075" "080" "090" "100" "110" "777" "888"
## [12] "999"

levels(mergedData$SEX)

## [1] "F" "M" "U"

levels(mergedData$Age_Group)

##  [1] "0-4"     "13-15"   "16-17"   "17-21"   "22-25"   "26-29"   "30-39"  
##  [8] "40-49"   "5-12"    "50-59"   "60-64"   "64-69"   "70+"     "unknown"

#Ordering the factor values inside Day_Week_description
mergedData$Day_Week_Description <- factor(mergedData$Day_Week_Description, levels = c("Monday","Tuesday","Wednesday","Thrusday","Friday","Saturday","Sunday"), ordered = TRUE)

#Ordering and labeling data with more informative label

#note: labeling outliers(777,888,999) as unknown 
mergedData$SPEED_ZONE<- factor(mergedData$SPEED_ZONE, levels= c('030', '040', '050', '060', '070', '075', '080', '090', '100', '110', '777', '888', '999'), labels=c('30KM/HR', '40KM/HR', '50KM/HR', '60KM/HR', '70KM/HR', '75KM/HR', '80KM/HR', '90KM/HR', '100KM/HR', '110KM/HR','Unknown','Unknown', 'Unknown'),ordered = TRUE)

mergedData$SEX <- factor(mergedData$SEX, levels = c('F','M','U'), labels = c("Female",'Male','Unknown'))


mergedData$Age_Group<-factor(mergedData$Age_Group, levels = c('0-4', '5-12', '13-15', '16-17', '17-21', '22-25',  '26-29', '30-39', '40-49', '50-59', '60-64', '64-69', '70+', 'unknown'), labels=c('0-4 Years', '5-12 Years', '13-15 Years', '16-17 Years', '17-21 Years', '22-25 Years',  '26-29 Years', '30-39 Years', '40-49 Years', '50-59 Years', '60-64 Years', '64-69 Years', '70 Years+', 'Unknown'))


#Validating the changes applied to the attributes of the merged dataset
str(mergedData)

## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 269045 obs. of  32 variables:
##  $ ACCIDENT_NO         : chr  "12000000389" "12000000389" "12000000554" "12000000555" ...
##  $ ACCIDENTDATE        : Date, format: NA NA ...
##  $ ACCIDENTTIME        :Formal class 'Period' [package "lubridate"] with 6 slots
##   .. ..@ .Data : num  0 0 0 0 0 0 0 0 0 0 ...
##   .. ..@ year  : num  0 0 0 0 0 0 0 0 0 0 ...
##   .. ..@ month : num  0 0 0 0 0 0 0 0 0 0 ...
##   .. ..@ day   : num  0 0 0 0 0 0 0 0 0 0 ...
##   .. ..@ hour  : num  22 22 22 18 18 18 18 1 1 23 ...
##   .. ..@ minute: num  30 30 15 0 0 0 0 0 0 50 ...
##  $ ACCIDENT_TYPE       : num  4 4 4 1 1 1 1 4 4 1 ...
##  $ Accident_Type_Desc  : chr  "Collision with a fixed object" "Collision with a fixed object" "Collision with a fixed object" "Collision with vehicle" ...
##  $ DAY_OF_WEEK         : num  3 3 5 7 7 7 7 1 1 2 ...
##  $ Day_Week_Description: Ord.factor w/ 7 levels "Monday"<"Tuesday"<..: 2 2 NA 6 6 6 6 7 7 1 ...
##  $ DIRECTORY           : Factor w/ 2 levels "MEL","VCS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ LIGHT_CONDITION     : num  5 5 3 1 1 1 1 3 3 3 ...
##  $ Light_Condition_Desc: Factor w/ 7 levels "Dark No street lights",..: 1 1 3 5 5 5 5 3 3 3 ...
##  $ NO_OF_VEHICLES      : int  1 1 1 4 4 4 4 1 1 3 ...
##  $ NO_PERSONS          : int  2 2 1 4 4 4 4 2 2 3 ...
##  $ Serious_Injury      : int  1 1 1 1 1 1 1 2 2 0 ...
##  $ Other_Injury        : int  0 0 0 1 1 1 1 0 0 2 ...
##  $ NO_PERSONS_KILLED   : int  1 1 0 0 0 0 0 0 0 0 ...
##  $ No_Injury           : int  0 0 0 2 2 2 2 0 0 1 ...
##  $ POLICE_ATTEND       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ SEVERITY            : int  1 1 2 2 2 2 2 2 2 3 ...
##  $ SPEED_ZONE          : Ord.factor w/ 11 levels "30KM/HR"<"40KM/HR"<..: 8 8 4 4 4 4 4 4 4 5 ...
##  $ PERSON_ID           : chr  "01" "A" "A" "A" ...
##  $ VEHICLE_ID          : chr  "A" "A" "A" "A" ...
##  $ SEX                 : Factor w/ 3 levels "Female","Male",..: 2 2 2 2 1 2 2 1 2 1 ...
##  $ AGE                 : int  16 19 31 27 32 42 25 18 18 31 ...
##  $ Age_Group           : Factor w/ 14 levels "0-4 Years","5-12 Years",..: 4 5 8 7 8 9 6 5 5 8 ...
##  $ INJ_LEVEL           : int  2 1 2 2 3 4 4 2 2 3 ...
##  $ Inj_Level_Desc      : Factor w/ 4 levels "Fatality","Not injured",..: 4 1 4 4 3 2 2 4 4 3 ...
##  $ SEATING_POSITION    : Factor w/ 10 levels "CF","CR","D",..: 4 3 3 3 3 3 3 4 3 3 ...
##  $ ROAD_USER_TYPE      : int  3 2 2 2 2 2 2 3 2 2 ...
##  $ Road_User_Type_Desc : chr  "Passengers" "Drivers" "Drivers" "Drivers" ...
##  $ PEDEST_MOVEMENT     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ TAKEN_HOSPITAL      : Factor w/ 2 levels "N","Y": 2 NA 2 2 1 NA NA 2 2 NA ...
##  $ EJECTED_CODE        : int  0 0 0 0 0 0 0 0 0 0 ...

Tidy & Manipulate Data I

Tidy Data Principles:

Each variable must have its own column : At first, few of the variable in our dataset wasn’t satisfying this principle. Variables like Serious_Injury,Other_Injury,No_Injury which could be converted into values under a common variable giving information on severity of the caused injury using gather() fucntion. However, this problem got solved automatically when we merged the dataset with person dataset, as it’s already have variables which convey the severity of the injury.

let’s look into this more closely by using an example

mergedData %>% select(ACCIDENT_NO,PERSON_ID,Serious_Injury,Other_Injury,No_Injury,INJ_LEVEL,Inj_Level_Desc) %>% filter(ACCIDENT_NO == '12000000389')

Now, let’s look into the second and third tidy data principle :

Each observation must have its own row : Yes, each observation row is unique for a person(Person_ID) involved in a particular accident(Accident_No)
Each value must have its own cell : Yes, each value has its own cell. However, if required, values in variables like data & time column could be separated into different cells. Though, We choose to keep them in a single cell for now.

Tidy & Manipulate Data II

In this section, we have created new variables using the existing data using function mutate().

PERSON_AFFECTED_PERCENT : This newly created variable gives percent of person involved in an accident got affected at any level. For this, we have used values from varibales like Serious_Injury, Other_Injury and No_Persons_killed.
AVERAGE_AGE : This variable gives the mean age of the people involved in an accident.

#Creating two new variables "Person_Affected_Percent" and "Average_Age"
mergedData <- mergedData %>% group_by(ACCIDENT_NO) %>% mutate(PERSON_AFFECTED_PERCENT = (Serious_Injury +Other_Injury+NO_PERSONS_KILLED)/NO_PERSONS*100, AVERAGE_AGE = mean(AGE,na.rm = TRUE))

#Created new varibale to present the output by selecting variables of interest grouped by accident No
resultedData <- mergedData %>% select(ACCIDENT_NO,PERSON_ID,PERSON_AFFECTED_PERCENT,AVERAGE_AGE) %>% group_by(ACCIDENT_NO)

#Checking the result
head(resultedData)

Scan I

In this section, we scanned for the missing values present in the merged dataset. Before any processing, we made few rules to deal with the missing values

Rule 1 : If the missing values are under 7% (i.e.18,800 Observations) of the total available data values for a particular variable, then the observations with the missing values will be removed from the dataset. Assuming, other 90% data is sufficient to make the analysis.
Rule 2 : If the removal of missing values from a variable is affecting the any other variable more than 15%, then the actions from rules 1 will be undone and missing values will be kept untouched.
Rule 3 : If the missing values are above the 7% of the total data values, then no action will be taken on those missing values.

#Checking the number of missing values column wise to have a overall look
colSums(is.na(mergedData))

##             ACCIDENT_NO            ACCIDENTDATE            ACCIDENTTIME 
##                       0                  269045                     380 
##           ACCIDENT_TYPE      Accident_Type_Desc             DAY_OF_WEEK 
##                       0                       0                       0 
##    Day_Week_Description               DIRECTORY         LIGHT_CONDITION 
##                   39444                     333                       0 
##    Light_Condition_Desc          NO_OF_VEHICLES              NO_PERSONS 
##                       0                       0                       0 
##          Serious_Injury            Other_Injury       NO_PERSONS_KILLED 
##                       0                       0                       0 
##               No_Injury           POLICE_ATTEND                SEVERITY 
##                       0                       0                       0 
##              SPEED_ZONE               PERSON_ID              VEHICLE_ID 
##                       0                       0                   10405 
##                     SEX                     AGE               Age_Group 
##                       0                   16514                       0 
##               INJ_LEVEL          Inj_Level_Desc        SEATING_POSITION 
##                       0                       0                   10425 
##          ROAD_USER_TYPE     Road_User_Type_Desc         PEDEST_MOVEMENT 
##                       0                       0                       2 
##          TAKEN_HOSPITAL            EJECTED_CODE PERSON_AFFECTED_PERCENT 
##                  182656                     921                       0 
##             AVERAGE_AGE 
##                     243

#Based on results from colSums() function, it was found that variable like ACCIDENTTIME, DIRECTORY,VEHICLE_ID,EJECTED_CODE,AVERAGE_AGE,SEATING_POSITION, PEDEST_MOVEMENT, AGE have missing values less than 10%.

#Therefore, based on the Rule 1, missing values were removed from the dataset
mergedData <- mergedData[!(is.na(mergedData$ACCIDENTTIME)),]
mergedData <- mergedData[!(is.na(mergedData$DIRECTORY)),]
mergedData <- mergedData[!(is.na(mergedData$VEHICLE_ID)),]
mergedData <- mergedData[!(is.na(mergedData$EJECTED_CODE)),]
mergedData <- mergedData[!(is.na(mergedData$AVERAGE_AGE)),]
mergedData <- mergedData[!(is.na(mergedData$SEATING_POSITION)),]
mergedData <- mergedData[!(is.na(mergedData$PEDEST_MOVEMENT)),]
mergedData <- mergedData[!(is.na(mergedData$AGE)),]


#Checking the dataset after missing values removal
colSums(is.na(mergedData))

##             ACCIDENT_NO            ACCIDENTDATE            ACCIDENTTIME 
##                       0                  241241                       0 
##           ACCIDENT_TYPE      Accident_Type_Desc             DAY_OF_WEEK 
##                       0                       0                       0 
##    Day_Week_Description               DIRECTORY         LIGHT_CONDITION 
##                   35251                       0                       0 
##    Light_Condition_Desc          NO_OF_VEHICLES              NO_PERSONS 
##                       0                       0                       0 
##          Serious_Injury            Other_Injury       NO_PERSONS_KILLED 
##                       0                       0                       0 
##               No_Injury           POLICE_ATTEND                SEVERITY 
##                       0                       0                       0 
##              SPEED_ZONE               PERSON_ID              VEHICLE_ID 
##                       0                       0                       0 
##                     SEX                     AGE               Age_Group 
##                       0                       0                       0 
##               INJ_LEVEL          Inj_Level_Desc        SEATING_POSITION 
##                       0                       0                       0 
##          ROAD_USER_TYPE     Road_User_Type_Desc         PEDEST_MOVEMENT 
##                       0                       0                       0 
##          TAKEN_HOSPITAL            EJECTED_CODE PERSON_AFFECTED_PERCENT 
##                  164162                       0                       0 
##             AVERAGE_AGE 
##                       0

Scan II

In this section, numeric variables were checked in our merged data to scan for the outliers and fixed them with suitable techniques.

To start with the process,

AGE variable was picked from the dataset as that’s the only variable which had proper numeric values.
Then, the distribution of the AGE variables was checked using a histogram. The distribution was found skewed to the right and therefore, techniques like z-score approach can’t be applied to detect and remove the outliers as that only works with symmetric distribution.
On the other hand, Tukey’s method or Boxplot was suitable to find out outliers with such distribtuion.Therefore, boxplot approach was choosen to move ahead.
After the detection, we applied Capping or winsorising approach to deal with the outliers as it replaces the outliers with the nearest neighbours that are not outliers.

#Histogram to check the distribution of AGE variable
hist(mergedData$AGE ,col = "gold", xlab = "Age", ylab = "Number of Individuals", main = "Histogram of Age of the People involved in Accidents")
abline(v= mean(mergedData$AGE), col="red", lwd=2, lty=2)
abline(v=median(mergedData$AGE), col='blue', lwd=2, lty=2)

#Boxplot to detect the outliers
mergedData$AGE %>% boxplot(main = "Detecting Outliers in Age variable", ylab = "Age", col = "gold")

#Checking the summary before outlier removal
summary(mergedData$AGE)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   21.00   31.00   34.23   45.00   99.00

#Defining Capping function
cap <- function(x){
    quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
    x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
    x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
    x
}

#Applying the defined capping function on the data to fix the outliers
mergedData$AGE <- mergedData$AGE %>% cap()
#Getting the updated summary of the data
summary(mergedData$AGE)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   21.00   31.00   34.08   45.00   81.00

#Boxplot to check the data after outlier removal
mergedData$AGE %>% boxplot(main = "Age Variable After Outliers Removal", ylab = "Age", col = "lightBlue")

Transform

As from the last section, we discovered that the AGE variable in our dataset is skewed to the right and therefore, in this section different transformation techniques were applied to get rid of the skewness.

#Firstly, the log 10 transformation was applied as it's commonly used for reducing right skewness
log_age <- log10(mergedData$AGE)
hist(log_age)

#Secondly, natural logarithm was applied to check whether it's generating better result than log 10
ln_age <- log(mergedData$AGE)
hist(ln_age)

#Lastly, square root transformation was used for reducing right skewness
sqrt_age <- sqrt(mergedData$AGE)
hist(sqrt_age)

Results

Based on the generated result, square root transformation performed the best. Though, the trasnformation didn’t result in complete ridness to the skewness.The square root transformation is still slightly skewed to the left.