Provide the packages required to reproduce the report.
library(readr) #Useful for reading files with csv extension
library(magrittr) #useful for pipes operations
library(stringr) #Useful for common string operations
library(tidyr) #Useful for data tidying
library(lubridate) #USeful for date-format data manipulations
library(outliers) #Useful for outlier detection
library(dplyr) #Useful for data manipulations
library(Hmisc) #Useful for NA values imputation
As stated by US Department of Transportation (2015), traffic crashes in 2015 took the lives of 35,092 people and increased by 7.2 % from 2014, which also officially put an end to the 5-decade trend of declining traffic fatalities. Because of this, the US government publicly released the dataset of traffic fatalities to gain insights of which factors causing the crashes and raise people’s awareness about public safety.
The dataset of “US 2015 Traffic Fatalities” is chosen as the main interest for this report to perform extensive preprocessing task in R. The report will only “Accidents” as the main dataset and “Vehicle” and “Person” as the supporting ones to describe the entire picture of US traffic accidents in 2015. As there are some variables that are not necessary, to make the report more cohesive, we subset the dataset by removing them but make sure that the required data types are fulfilled.
The report follows these steps as below:
Importing the datasets to R using read_csv() function and joining them together using common key values
Examining the types and data structures in the dataset
Making the final dataset more comprehensive to readers by converting/factorising variables into their suitable classes (with the help of Fatality Analysis Reporting System (FARS) Analytical User’s Manual issued by US Department of transportation)
Checking if the dataset follows the data tidy technique developed by Hadley Wickham and Grolemund (2016)
A new variable CHECK_IF_DEAD is mutated based on casualties’ death year to mark fatalities in accident cases
Checking missing values/ inconsistencies in the dataset and imputing them with means and modes
Checking outliers in numeric- type variables and performing Capping method to get rid of them.
Performing log function on the right skewed distributed variable to transform it into normal distribution
The complete dataset of “2015 US Traffic Fatalities” issued by National Highway Traffic Safety Administration can be downloaded from Kaggle. To produce the report, only three data files “Accidents”, “Vehicle” and “Person” are used respectively as the main and two supporting data.
The main dataset is “Accidents” which includes the total statistics of all US crashing cases in 2015.
The original Accident dataset has 32166 observations of 52 variables.
setwd("~/Data preprocessing/Assignment 3")
accidents <- read_csv("accident.csv")
Variable explanation:
STATE: State Number
ST_CASE: Case Unique Number associated with States
VE_TOTAl: Number of motor vehicles (Including Not-in-transport vehicles) involved in the crashing
VE_FOMRS: Number of vehicle forms submitted
PVH_INVL: Number of parked/working vehicle involved
PEDS: Number of pedestrian/bicyclist forms submitted
PERNOTMVIT: Number of persons not in motor vehicles in Transport
PERMVIT: Number of persons in motor vehicles in Transport
PERSONS: All persons involved in the crash including motorists
COUNTY: Codes associated with counties’ names
CITY: Codes associated with cities’ names
DAY/MONTH/YEAR: Day/Month/Year of crash
DAY_WEEK: Day of week
HOUR/MINUTE: Hour/Minute of crash
NHS: Whether it is on National Highway System
RUR_URB: Whether the land use is rural or urban
FUNC_SYS: Functional system
RD_OWNER: Road ownership
ROUTE: The route where the crash occurred
TWAY_ID: Traffic way identifier
MILEPT: Mile point
LATITUDE - LONGITUD: Geographical location of the crash
SP_JUR: Special jurisdiction
HARM_EV: First harmful event causing injury or property damage
MAN_COLL: Manner of collision
RELJCT: Relation to junction
TYP_INT: Type of intersection
WRK_ZONE: Whether the crash occurred in a construction or maintenance zone
REL_ROAD: Relation to trafficway
LGT_COND: Light condition
WEATHER: Atmosphere condition
SCH_BUS: Whether a school bus or motor vehicle functioning as a school bus is related to the crash
RAIL: rail grade crossing ID
NOT_HOUR/MIN: EMS notified hour/minute
ARR_HOUR/MIN: EMS arrival hour/minute
HOSP_HOUR/MIN: EMS hospital hour/minute
CF: Crash related factor
DRUNK_DR: whether the driver is drunk
The first supporting dataset is “Vehicle” which contains data of in-transport motor vehicles and their drivers in each crashing case in 2015.
The original Vehicle dataset contains 48923 observations of 102 variables.
vehicles <- read_csv("vehicle.csv")
Filtering the first dataset
vehicles_select <- vehicles[ ,c("ST_CASE", "VEH_NO", "MAKE")]
Variable explanation
ST_CASE: Case Unique Number associated with States (Constructed by combining STATE number and 4-Digital Consecutive Number)
VEH_NO: Vehicle number involved in the crash
MAKE: Vehicle make
The second supporting dataset is “Person” which contains data of motorist and non-motorist involved in each crashing case in 2015.
The original Person dataset contains 80587 observations of 68 variables.
persons <- read_csv("person.csv")
Filtering the second dataset
persons_select <- persons[ ,c("ST_CASE", "VEH_NO", "AGE", "SEX", "DEATH_YR")]
Variable explanation
ST_CASE: Case Unique Number associated with States (Constructed by combining STATE number and 4-Digital Consecutive Number)
VEH_NO: Vehicle number involved in the crash
AGE: Person age
SEX: Person gender
DEATH_YR: Person’s death year
Common variables:
“ST_CASE” and “VEH_NO”: In each case, there are a certain number of vehicles involved and in each specific vehicle, there are motorists. That is why if we want to join these two data, only based on either of the key values, it is not enough and may lead to incorrect data.
Join function:
Full join () functions: In the Person dataset, as observed, vehicle_No = 0 is counted and it represents crashing cases in which non-motorists are involved while it is excluded in Vehicle dataset. That is why when doing full join, we can keep all observations in both Vehicle and Person datasets.
vehicles_person <- vehicles_select %>% full_join(persons_select, by= c("ST_CASE"="ST_CASE", "VEH_NO" = "VEH_NO" ))
Join function:
Accident dataset acts as the main interest one so we put more emphasis on it. By using left join () function, we join the two datasets.
The combined dataset shows us a clearer description of each crashing accidents that happened in the US in 2015. We have the total number of vehicles and persons involved in each case and their details of vehicles and non-motorists/motorists involved in it.
accidents <- accidents %>% left_join(vehicles_person, by ="ST_CASE")
We subset the dataset by eliminating some variables but still make sure that the required data types are met.
accidents <- accidents %>% select (-VE_FORMS, -FATALS, -PEDS, -PVH_INVL,-PERNOTMVIT, -CITY, -PERMVIT, -COUNTY, -TWAY_ID2, -MILEPT, -WEATHER2, -WEATHER, -RAIL, -DRUNK_DR, -CF1, -CF2, -CF3, -RD_OWNER, -NOT_HOUR, -NOT_MIN, -ARR_HOUR, -ARR_MIN, -HOSP_HR, -HOSP_MN, -RELJCT1, -RELJCT2)
print(accidents)
str(accidents)
Classes tbl_df, tbl and 'data.frame': 80804 obs. of 31 variables:
$ STATE : int 1 1 1 1 1 1 1 1 1 1 ...
$ ST_CASE : int 10001 10002 10003 10003 10004 10005 10005 10006 10006 10007 ...
$ VE_TOTAL: int 1 1 1 1 1 2 2 1 1 1 ...
$ PERSONS : int 1 1 2 2 1 2 2 2 2 2 ...
$ DAY : int 1 1 1 1 4 7 7 8 8 8 ...
$ MONTH : int 1 1 1 1 1 1 1 1 1 1 ...
$ YEAR : int 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
$ DAY_WEEK: int 5 5 5 5 1 4 4 5 5 5 ...
$ HOUR : int 2 22 1 1 0 7 7 9 9 18 ...
$ MINUTE : int 40 13 25 25 57 9 9 59 59 33 ...
$ NHS : int 0 1 0 0 0 0 0 0 0 1 ...
$ RUR_URB : int 1 1 1 1 1 2 2 1 1 2 ...
$ FUNC_SYS: int 3 1 3 3 4 3 3 5 5 3 ...
$ ROUTE : int 3 1 2 2 3 2 2 4 4 2 ...
$ TWAY_ID : chr "SR-5" "I-65" "US-SR 6" "US-SR 6" ...
$ LATITUDE: num 33.9 34.9 32.1 32.1 31.4 ...
$ LONGITUD: num -87.3 -86.9 -85.8 -85.8 -85.5 ...
$ SP_JUR : int 0 0 0 0 0 0 0 0 0 0 ...
$ HARM_EV : int 35 34 42 42 53 12 12 1 1 43 ...
$ MAN_COLL: int 0 0 0 0 0 6 6 0 0 0 ...
$ TYP_INT : int 1 1 1 1 1 3 3 1 1 1 ...
$ WRK_ZONE: int 0 0 0 0 0 0 0 0 0 0 ...
$ REL_ROAD: int 4 3 4 4 4 1 1 4 4 4 ...
$ LGT_COND: int 2 2 2 2 2 1 1 1 1 3 ...
$ WEATHER1: int 1 10 1 1 10 1 1 1 1 1 ...
$ SCH_BUS : int 0 0 0 0 0 0 0 0 0 0 ...
$ VEH_NO : int 1 1 1 1 1 1 2 1 1 1 ...
$ MAKE : int 12 49 20 20 12 69 85 12 12 23 ...
$ AGE : int 68 49 31 20 40 24 60 64 17 80 ...
$ SEX : int 1 1 1 2 1 1 1 1 1 1 ...
$ DEATH_YR: int 2015 2015 2015 8888 2015 2015 8888 2015 8888 2015 ...
Some variables are assigned with a specific number, in fact, however, it is a code representing a meaningful text. Referencing the Fatality Analysis Reporting System (FARS) Analytical User’s Manual issued by US Department of transportation, we convert them into character type to make our report more user-friendly.
#STATE
accidents<- accidents%>% mutate(STATE= factor(STATE, levels = c('1', '2', '3', '4', '5', '6', '8', '9', '10', '11', '12', '13', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45','46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56'),
labels = c("Alabama", "Alaska", "American Samoa", "Arizona", "Arkansas", "California", "Colorado", "Connecticut", "Delaware", "District of Columbia", "Florida", "Georgia", "Hawaii", "Idaho", "Illinois", "Indiana", "Iowa", "Kansas", "Kentucky", "Louisiana", "Maine", "Maryland", "Massachusetts", "Michigan", "Minnesota", "Mississippi", "Missouri", "Montana", "Nebraska", "Nevada", "New Hampshire", "New Jersey", "New Mexico", "New York", "North Carolina","North Dakota", "Ohio", "Oklahoma", "Oregon", "Pennsylvania", "Puerto Rico", "Rhode Island", "South Carolina", "South Dakota", "Tennessee","Texas", "Utah", "Vermont", "Virginia", "Virgin Islands", "Washington", "West Virginia", "Wisconsin", "Wyoming")))
#DAYS_OF_WEEK
accidents<- accidents%>% mutate(DAY_WEEK = factor(DAY_WEEK, levels = c('1', '2', '3', '4', '5', '6', '7'), labels = c("Sunday","Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")))
##NHS
accidents<- accidents%>% mutate(NHS = factor(NHS, levels = c('0', '1', '9'), labels = c("Not on the NHS", "On the NHS", "Unknown")))
##RUR_URB
accidents<- accidents%>% mutate(RUR_URB = factor(RUR_URB, levels = c('1', '2', '6','8','9'), labels = c("Rural", "Urban", "Trafficway Not in State Inventory", "Not Reported", "Unknown")))
#FUNC_SYS
accidents<- accidents%>% mutate(FUNC_SYS = factor(FUNC_SYS, levels = c('1', '2', '3', '4', '5', '6', '7', '96', '98', '99'), labels = c("Interstate", "Principal Arterial - Other Freeways and Expressways", "Principal Arterial - Other", "Minor Arterial","Major Collector","Minor Collector", "Local", "Trafficway Not in State Inventory", "Not Reported", "Unknown")))
#SP_JUR
accidents<- accidents%>% mutate(SP_JUR = factor(SP_JUR, levels = c('0', '1', '2', '3', '4', '5', '8', '9'), labels = c("No Special Jurisdiction (Includes National Forests Since 2008)", "National Park Service", "Military", "Indian Reservation","College/University Campus", "Other Federal Properties (Since 1977)", "Other (Since 1976)", "Unknown")))
#HARM_EV
accidents<- accidents%>% mutate(HARM_EV = factor(HARM_EV, levels = c('1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '14', '15', '16', '17', '18', '19', '20', '21', '23', '24', '25', '26', '30', '31', '32', '33', '34', '35', '38', '39', '40', '41', '42', '43', '44', '45', '46', '48', '49', '50', '51', '52', '53', '54', '55', '57', '58', '59', '72', '73', '98', '99'), labels = c("Rollover/overturn", "Fire/explosion", "Immersion", "Gas inhalation", "Fell/jumped from vehicle", "Injured in vehicle (noncollision)", "Other noncollision","Pedestrian", "Pedalcyclist", "Railway vehicle", "Live animal", "Motor vehicle in-transport", "Parked motor vehicle (not in transport)", "Nonmotorist on personal conveyance","Thrown or falling object", "Boulder", "Other object (not fixed)", "Building", "Impact attenuator/crash cushion", "Bridge pier or support", "Bridge rail (includes parapet)","Guardrail face", "Concrete traffic barrier", "Other traffic barrier", "Utility pole / light support", "Other post, other pole or other supports", "Culvert", "Curb","Ditch", "Embankment", "Fence", "Wall", "Fire hydrant", "Shrubbery", "Tree (standing only)", "Other fixed object", "Pavement surface irregularity (pothole, grooved, grates)","Working motor vehicle (construction, maintenance or utility vehicles only)", "Traffic signal support", "Snow bank", "Ridden animal or animal-drawn conveyance", "Bridge overhead structure","Jackknife (harmful to this vehicle)", "Guardrail end", "Mail box", "Motor vehicle in-transport strikes or is struck by cargo, persons or objects set-in-motion from/by another motor vehicle in-transport","Motor vehicle in motion outside the trafficway", "Cable barrier", "Ground", "Traffic sign support", "Cargo/equipment loss or shift (harmful to this vehicle)", "Object Fell From Motor Vehicle In-Transport", "Not reported", "Unknown")))
#MAN_COLL
accidents<- accidents%>% mutate(MAN_COLL = factor(MAN_COLL, levels = c('0', '1', '2', '6', '7', '8', '9', '10', '11', '98','99'),labels = c("Not a collision with a motor vehicle in-transport", "Front-to-rear", "Front-to-front", "Angle","Sideswipe - same direction","Sideswipe - opposite direction", "Rear-to-side", "Rear-to-rear", "Other", "Not reported", "Unknown")))
#TYP_INT
accidents<- accidents%>% mutate(TYP_INT = factor(TYP_INT, levels = c('1', '2', '3', '4', '5', '6', '7', '10', '98', '99'),labels = c("Not an intersection", "Four-way intersection", "T-intersection", "Y-intersection", "Traffic circle", "Roundabout", "Five point, or more", "L Intersection","Not reported", "Unknown" )))
#WRK_ZONE
accidents<- accidents%>% mutate(WRK_ZONE = factor(WRK_ZONE, levels = c('0','1', '2', '3', '4', '8'),labels = c("None", "Construction", "Maintenance", "Utility", "Work zone, type unknown", "Not reported")))
#LGT_COND
accidents<- accidents%>% mutate(LGT_COND = factor(LGT_COND, levels = c('1', '2', '3', '4', '5', '6', '7', '8', '9'),labels = c("Daylight", "Dark, not lighted", "Dark, lighted", "Dawn", "Dusk", "Dark, unknown lighting", "Other", "Not reported", "Unknown")))
#REL_ROAD
accidents<- accidents%>% mutate(REL_ROAD = factor(REL_ROAD, levels = c('1', '2', '3', '4', '5', '6', '7','8', '10', '11', '98', '99'), labels = c("On roadway", "On shoulder", "On median", "On roadside", "Outside trafficway", "Off roadway - location unknown", "In parking lane/zone","Gore", "Separator", "Continuous left-turn lane", "Not reported", "Unknown")))
#WEATHER1
accidents<- accidents%>% mutate(WEATHER1 = factor(WEATHER1, levels = c('0','1', '2', '3', '4', '5', '6', '7', '8', '10', '11', '12', '98', '99'), labels = c("No additional atmospheric condition", "Clear", "Rain", "Sleet, hail (freezing rain or drizzle)", "Snow", "Fog, smog, smoke", "Severe crosswinds", "Blowing sand, soil, dirt", "Other", "Cloudy", "Blowing snow", "Freezing Rain or Drizzle", "Not reported", "Unknown")))
#SCH_BUS
accidents<- accidents%>% mutate(SCH_BUS = factor(SCH_BUS, levels = c('0','1'),
labels = c("No", "Yes")))
#SEX
accidents<- accidents%>% mutate(SEX = factor(SEX, levels = c('1', '2', '8', '9'),
labels = c("Male", "Female", "Not reported", "Unknown")))
#MAKE
accidents<- accidents%>% mutate(MAKE = factor(MAKE, levels = c('1', '2','3','6', '7', '8', '9', '10', '12', '13', '14', '18', '19', '20', '21', '22', '23', '24', '25', '29', '30', '31', '32','33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53','54', '55', '56', '57', '58', '59', '60', '61', '62', '63', '64', '65', '69', '70', '71', '72', '73', '74', '75', '76','77', '78', '79', '80', '81', '82', '83', '84', '85', '86', '87', '88', '89', '90', '91', '92', '93', '94', '97', '98', '99'), labels = c("American Motors", "Jeep", "AM General", "Chrysler", "Dodge", "Imperial", "Plymouth", "Eagle", "Ford", "Lincoln", "Mercury", "Buick/Opel", "Cadillac", "Chevrolet", "Oldsmobile", "Pontiac", "GMC", "Saturn", "Grumman", "Other domestic", "Volkswagen", "Alfa Romeo", "Audi", "Austin/Healey", "BMW", "Datsun/Nissan", "Fiat", "Honda", "Isuzu", "Jaguar", "Lancia","Mazda", "Mercedes-Benz", "MG", "Peugeot", "Porsche", "Renault", "Saab", "Subaru", "Toyota", "Triumph", "Volvo",
"Mitsubishi", "Suzuki", "Acura", "Hyundai", "Merkur", "Yugo", "Infiniti", "Lexus", "Daihatsu", "Sterling", "Land Rover","Kia", "Daewoo", "Smart", "Other Imports", "BSA", "Ducati", "Harley-Davidson", "Kawasaki", "Moto-Guzzi", "Norton", "Yamaha","Victory", "Other Make Moped", "Other Make Motored Cycle", "Brockway", "Diamond-Reo", "Freightliner", "FWD", "International Harvester/Navistar","Kenworth", "Mack", "Peterbilt", "Iveco/Magirus", "White/Autocar - White/GMC", "Bluebird", "Eagle Coach", "Gillag", "MCI", "Thomas Built", "Not reported", "Other make", "Unknown make")))
#Death Year
accidents<- accidents %>% mutate(DEATH_YR = (str_replace_all(DEATH_YR, pattern = "9999", replacement =NA_character_)))
#Crash Hour
accidents <- accidents %>% mutate(HOUR = (str_replace_all(HOUR, pattern = "99", replacement =NA_character_)))
#Crash Minute
accidents <- accidents %>% mutate(MINUTE = (str_replace_all(MINUTE, pattern = "99", replacement =NA_character_)))
#Age
accidents <- accidents %>% mutate(AGE = (str_replace_all(AGE, pattern = "999", replacement =NA_character_)))
accidents <- accidents %>% mutate(AGE = (str_replace_all(AGE, pattern = "998", replacement =NA_character_)))
accidents$AGE <- as.numeric(accidents$AGE)
There are some variables that contain year, month and day information separately. We will join them together to make one variable and convert them to date formate.
accidents5 <- accidents %>% unite("CRASH_DAY","DAY", "MONTH", "YEAR", sep = ".")
accidents5$CRASH_DAY <- dmy(accidents5$CRASH_DAY)
accidents6 <- accidents5 %>% unite("CRASH_TIME", "HOUR", "MINUTE", sep = ":")
In order to make our dataset more comprehensive, we will move some relevant variables near each other.
accidents_final <- accidents6[ ,c(1, 2, 3, 24, 25, 4, 26, 27, 28, 6, 5, 7:23)]
str(accidents_final)
Classes tbl_df, tbl and 'data.frame': 80804 obs. of 28 variables:
$ STATE : Factor w/ 54 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
$ ST_CASE : int 10001 10002 10003 10003 10004 10005 10005 10006 10006 10007 ...
$ VE_TOTAL : int 1 1 1 1 1 2 2 1 1 1 ...
$ VEH_NO : int 1 1 1 1 1 1 2 1 1 1 ...
$ MAKE : Factor w/ 85 levels "American Motors",..: 9 40 14 14 9 57 73 9 9 17 ...
$ PERSONS : int 1 1 2 2 1 2 2 2 2 2 ...
$ AGE : num 68 49 31 20 40 24 60 64 17 80 ...
$ SEX : Factor w/ 4 levels "Male","Female",..: 1 1 1 2 1 1 1 1 1 1 ...
$ DEATH_YR : chr "2015" "2015" "2015" "8888" ...
$ DAY_WEEK : Factor w/ 7 levels "Sunday","Monday",..: 5 5 5 5 1 4 4 5 5 5 ...
$ CRASH_DAY : Date, format: "2015-01-01" "2015-01-01" ...
$ CRASH_TIME: chr "2:40" "22:13" "1:25" "1:25" ...
$ NHS : Factor w/ 3 levels "Not on the NHS",..: 1 2 1 1 1 1 1 1 1 2 ...
$ RUR_URB : Factor w/ 5 levels "Rural","Urban",..: 1 1 1 1 1 2 2 1 1 2 ...
$ FUNC_SYS : Factor w/ 10 levels "Interstate","Principal Arterial - Other Freeways and Expressways",..: 3 1 3 3 4 3 3 5 5 3 ...
$ ROUTE : int 3 1 2 2 3 2 2 4 4 2 ...
$ TWAY_ID : chr "SR-5" "I-65" "US-SR 6" "US-SR 6" ...
$ LATITUDE : num 33.9 34.9 32.1 32.1 31.4 ...
$ LONGITUD : num -87.3 -86.9 -85.8 -85.8 -85.5 ...
$ SP_JUR : Factor w/ 8 levels "No Special Jurisdiction (Includes National Forests Since 2008)",..: 1 1 1 1 1 1 1 1 1 1 ...
$ HARM_EV : Factor w/ 54 levels "Rollover/overturn",..: 30 29 35 35 45 12 12 1 1 36 ...
$ MAN_COLL : Factor w/ 11 levels "Not a collision with a motor vehicle in-transport",..: 1 1 1 1 1 4 4 1 1 1 ...
$ TYP_INT : Factor w/ 10 levels "Not an intersection",..: 1 1 1 1 1 3 3 1 1 1 ...
$ WRK_ZONE : Factor w/ 6 levels "None","Construction",..: 1 1 1 1 1 1 1 1 1 1 ...
$ REL_ROAD : Factor w/ 12 levels "On roadway","On shoulder",..: 4 3 4 4 4 1 1 4 4 4 ...
$ LGT_COND : Factor w/ 9 levels "Daylight","Dark, not lighted",..: 2 2 2 2 2 1 1 1 1 3 ...
$ WEATHER1 : Factor w/ 14 levels "No additional atmospheric condition",..: 2 10 2 2 10 2 2 2 2 2 ...
$ SCH_BUS : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
The dataset is believed to be tidy, following the three principles of dataset tidy:
Each variable forms a column
Each observation forms a row
Each type of observational unit forms a table
Creating a new column CHECK_IF_DEAD to mark any persons who died from the crash.
accidents_final <- accidents_final %>% mutate(CHECK_IF_DEAD = ifelse(DEATH_YR == "8888", "Not Dead", "Dead"))
head(accidents_final$CHECK_IF_DEAD)
[1] "Dead" "Dead" "Dead" "Not Dead" "Dead" "Dead"
First, we will utilise colSums() function to scan the total missing values in each column.
colSums(is.na(accidents_final))
STATE ST_CASE VE_TOTAL VEH_NO MAKE
0 0 0 0 7254
PERSONS AGE SEX DEATH_YR DAY_WEEK
0 1885 217 476 0
CRASH_DAY CRASH_TIME NHS RUR_URB FUNC_SYS
0 0 0 0 0
ROUTE TWAY_ID LATITUDE LONGITUD SP_JUR
0 0 0 0 0
HARM_EV MAN_COLL TYP_INT WRK_ZONE REL_ROAD
0 0 0 0 0
LGT_COND WEATHER1 SCH_BUS CHECK_IF_DEAD
0 0 0 476
Our dataset has NA values in both numeric and categorical variables. Even though the size of our data is huge and we can choose to remove NA, it is not a recommended strategy which may end up with biased dataset. The Hmisc package is applied in our dataset to impute missing values with mean and mode for numeric and categorical variables respectively.
#MAKE:
accidents_final$MAKE <- impute(accidents_final$MAKE, fun= mode)
#SEX:
accidents_final$SEX <- impute(accidents_final$SEX, fun= mode)
#DEATH_YR
accidents_final$DEATH_YR <- impute(accidents_final$DEATH_YR, fun= mode)
#CHECK_IF_DEAD
accidents_final$CHECK_IF_DEAD <- impute(accidents_final$CHECK_IF_DEAD, fun= mode)
#AGE
mean_AGE <- round(mean(accidents_final$AGE, na.rm = TRUE),0)
accidents_final$AGE <- impute(accidents_final$AGE, fun = mean_AGE)
colSums(is.na(accidents_final))
STATE ST_CASE VE_TOTAL VEH_NO MAKE
0 0 0 0 0
PERSONS AGE SEX DEATH_YR DAY_WEEK
0 0 0 0 0
CRASH_DAY CRASH_TIME NHS RUR_URB FUNC_SYS
0 0 0 0 0
ROUTE TWAY_ID LATITUDE LONGITUD SP_JUR
0 0 0 0 0
HARM_EV MAN_COLL TYP_INT WRK_ZONE REL_ROAD
0 0 0 0 0
LGT_COND WEATHER1 SCH_BUS CHECK_IF_DEAD
0 0 0 0
We will leave out STATE, LATITUDE, LONGITUDE and ROUTE. Scanning outliers will be performed on key variables (“Vehicles cases”, “Vehicles numbers” and “Person”) because of the page limitation.
All numeric variables have outliers, Capping method (replacing the outliers with the nearest neighbours that are not outliers) will be applied.
accidents_final$VE_TOTAL %>% boxplot(main="Box Plot of Vehicles Involved in Traffic Accidents", ylab="Number of Vehicles", col = "grey")
accidents_final$VEH_NO %>% boxplot(main="Box Plot of Vehicles Involved in each case", ylab="Number of Vehicles", col = "grey")
accidents_final$PERSONS %>% boxplot(main="Box Plot of persons involved in Traffic Accidents", ylab="Number of persons", col = "grey")
accidents_final$AGE <- as.numeric(accidents_final$AGE)
accidents_final$AGE %>% boxplot(main="Box Plot of person age involved in Traffic Accidents", ylab="Age", col = "grey")
z.scores_VE <- accidents_final$VE_TOTAL %>% scores(type = "z")
z.scores_VE %>% summary()
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.42505 -0.42505 0.01305 0.00000 0.01305 24.54652
z.scores_VE_NO <- accidents_final$VEH_NO %>% scores(type = "z")
z.scores_VE_NO %>% summary()
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.9587 -0.2695 -0.2695 0.0000 0.4198 39.0199
z.scores_PERSONS <- accidents_final$PERSONS %>% scores(type = "z")
z.scores_PERSONS %>% summary()
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.7135 -0.3309 -0.1397 0.0000 0.0516 17.0741
z.scores_AGE <- accidents_final$AGE %>% scores(type = "z")
z.scores_AGE %>% summary()
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.9473 -0.8022 -0.1052 0.0000 0.7412 3.7285
length(which(abs(z.scores_VE) >3 ))
[1] 496
length(which(abs(z.scores_VE_NO) >3 ))
[1] 566
length(which(abs(z.scores_PERSONS) >3 ))
[1] 772
length(which(abs(z.scores_AGE) >3 ))
[1] 5
cap <- function(x){
quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
x
}
accident_capped_ve_total <- accidents_final$VE_TOTAL %>% cap()
accident_capped_ve_no <- accidents_final$VEH_NO %>% cap()
accident_capped_person <- accidents_final$PERSONS %>% cap()
accident_capped_age <- accidents_final$AGE %>% cap()
accident_capped_ve_total %>% boxplot(main="Box Plot of Vehicles Involved in Traffic Accidents", ylab="Number of Vehicles", col = "grey")
accident_capped_ve_no %>% boxplot(main="Box Plot of Vehicles Involved in each case", ylab="Number of Vehicles", col = "grey")
accident_capped_person <- accidents_final$PERSONS %>% cap() %>% boxplot(main="Box Plot of persons involved in Traffic Accidents", ylab="Number of persons", col = "grey")
accident_capped_age %>% boxplot(main="Box Plot of person age involved in Traffic Accidents", ylab="Age", col = "grey")
par(mfrow=c(1,2))
hist(accidents_final$PERSONS,main="Before Transformation", xlab = "Number of persons")
log_person <- log1p(accidents_final$PERSONS)
hist(log_person,main="After Transformation", xlab = "Number of persons")
Using Histogram to check the distribution of PERSONS variable, we see that it is right skewed. To tackle right skewness, the log () transformation is applied which compresses high values and spreads low values by expressing the values as orders of magnitude.
Since log cannot be used with 0 while it frequently shows up in our variable, we use log1p() which adds a constant to successfully handle our right skewed distributed variable.
Applying log1p() to the PERSONS variable helps to sinificantly reduce the skewness, making it likely to be normally distributed.