Following packagins are required for completion of this report.
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
library(stringr)
library(ggplot2)
## Registered S3 methods overwritten by 'ggplot2':
## method from
## [.quosures rlang
## c.quosures rlang
## print.quosures rlang
library(knitr)
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
library(outliers)
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
library(forecast)
## Registered S3 method overwritten by 'xts':
## method from
## as.zoo.xts zoo
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## Registered S3 methods overwritten by 'forecast':
## method from
## fitted.fracdiff fracdiff
## residuals.fracdiff fracdiff
This report aims to demonstrate data-preprocessing steps using R on an open source data: UK Car Accidents 2005 - 2015 from the UK Department of Transport. Our tentative goal of pre-processing data is to obtain a clean, tidy table contains information about whether police officer attended accident scene, the age of driver, the number of vehicle involved and casualties and engine capacity for the year of 2015.
The data we are using for our investigation - car accidents, were stored in two separate csv files: Accident0515 and Vehicles0515. Our work was based on the merged table with selected variables and year. Pre-processing steps are as follows:
Data sourced from https://www.kaggle.com/silicon99/dft-accident-data. Accidents file recorded accident details of car accidents happended from 2005 to 2015 in UK,including Accident_Index (uniquely identify each accident), accident location longitude and latitude, accident date and time, accident severity and junction details. Vehicles file recorded involved vehicle details including Accidente_Index, vehicle type, vehicle manoeuvre, sex and age of the driver, etc. Accidents and Vehicle are joined on Accident_Index, through inspection these two relations have the same number of observations so use inner join to have all common rows retained.
accidents<-read_csv("Accidents0515.csv")
## Parsed with column specification:
## cols(
## .default = col_double(),
## Accident_Index = col_character(),
## Date = col_character(),
## Time = col_time(format = ""),
## `Local_Authority_(Highway)` = col_character(),
## LSOA_of_Accident_Location = col_character()
## )
## See spec(...) for full column specifications.
vehicles<-read_csv("Vehicles0515.csv")
## Parsed with column specification:
## cols(
## .default = col_double(),
## Accident_Index = col_character()
## )
## See spec(...) for full column specifications.
## Warning: 257845 parsing failures.
## row col expected actual file
## 3004426 -- 22 columns 23 columns 'Vehicles0515.csv'
## 3004427 -- 22 columns 23 columns 'Vehicles0515.csv'
## 3004428 -- 22 columns 23 columns 'Vehicles0515.csv'
## 3004429 -- 22 columns 23 columns 'Vehicles0515.csv'
## 3004430 -- 22 columns 23 columns 'Vehicles0515.csv'
## ....... ... .......... .......... ..................
## See problems(...) for more details.
merged_table<-inner_join(accidents,vehicles,by = "Accident_Index")
selected_table<-merged_table %>% select(Accident_Index, Number_of_Vehicles,Number_of_Casualties, Date,Did_Police_Officer_Attend_Scene_of_Accident,Age_of_Driver,'Engine_Capacity_(CC)')
Use str() to examine our table in interest: selected_table.
Through examination notice that Date and Did_Police_Officer_Attend_Scene_of_Accident were read incorrectly, so need to identify Date and specify Did_Police_Officer_Attend_Scene_of_Accident as a factor with understandable labels.
str(selected_table)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 3262270 obs. of 7 variables:
## $ Accident_Index : chr "200501BS00001" "200501BS00002" "200501BS00003" "200501BS00003" ...
## $ Number_of_Vehicles : num 1 1 2 2 1 1 2 2 2 2 ...
## $ Number_of_Casualties : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Date : chr "04/01/2005" "05/01/2005" "06/01/2005" "06/01/2005" ...
## $ Did_Police_Officer_Attend_Scene_of_Accident: num 1 1 1 1 1 1 1 1 1 1 ...
## $ Age_of_Driver : num 74 42 35 62 49 49 51 30 31 41 ...
## $ Engine_Capacity_(CC) : num -1 8268 8300 1762 1769 ...
selected_table$Date<-dmy(selected_table$Date)
selected_table$Did_Police_Officer_Attend_Scene_of_Accident<-selected_table$Did_Police_Officer_Attend_Scene_of_Accident %>%
factor(levels = c(-1,1,2,3),
labels = c("data_missing","Yes","No","Self_rep"))
selected_table$Did_Police_Officer_Attend_Scene_of_Accident %>% class()
## [1] "factor"
selected_table$Did_Police_Officer_Attend_Scene_of_Accident%>% levels()
## [1] "data_missing" "Yes" "No" "Self_rep"
Selected_table is already in a tidy format since the following three conditions are met. 1. Each variable has its own column 2. each observation has its own row and 3.each value has its own cell. So we do not have to tidy it up. However, the data table contains data from year 2005 to 2015 and we are only interested in the year 2015, so we will need to further subset the relation that contains information for the year we need. We will start by extracting the year ‘Date’ variable by using separate () and then filter out data for the year 2015.
selected_table<-selected_table %>% separate('Date', c("year","month","day"),sep = "-")
data_2015<-selected_table %>% filter(year=='2015')
data_2015 %>% head()
We created a new column shows the cumulative distribution of Age of driver. This will give an idea of what is the ranking for the age of the driver in an accident.
data_2015<-mutate(data_2015, "cumulative_age" = cume_dist(data_2015$Age_of_Driver))
data_2015 %>% head()
Scan the data for missing values, inconsistencies and obvious errors.
Through quick inspection, no NA is found in the data. However the data dictionary provides a guidance of coding for missing values: when a value for a column is unknonw it has been recorded as “-1”, so we start by recoding “-1” to NA and use colSums() to check what columns contain missing values.
data_2015$Number_of_Vehicles[which(data_2015$Number_of_Vehicles== "-1")]<-NA
data_2015$Number_of_Casualties[which(data_2015$Number_of_Casualties== "-1")]<-NA
data_2015$Did_Police_Officer_Attend_Scene_of_Accident[which(data_2015$Did_Police_Officer_Attend_Scene_of_Accident== "-1")]<-NA
data_2015$Age_of_Driver[which(data_2015$Age_of_Driver== "-1")]<-NA
data_2015$`Engine_Capacity_(CC)`[which(data_2015$`Engine_Capacity_(CC)`== "-1")]<-NA
data_2015$cumulative_age[which(data_2015$cumulative_age== "-1")]<-NA
colSums(is.na(data_2015))
## Accident_Index
## 0
## Number_of_Vehicles
## 0
## Number_of_Casualties
## 0
## year
## 0
## month
## 0
## day
## 0
## Did_Police_Officer_Attend_Scene_of_Accident
## 0
## Age_of_Driver
## 29828
## Engine_Capacity_(CC)
## 72289
## cumulative_age
## 0
‘Age_of_Driver’ and ‘Engine_Capacity_(CC)’ have sufficiently high number of missing value (more than 5%), this indicates missing values should not be ignored or simply removed. We will replace them by mean. Use colSums() to examine the missing valuea again.
data_2015$Age_of_Driver<-impute(data_2015$Age_of_Driver, fun = mean)
data_2015 %>%
summarise(Min = min(data_2015$`Engine_Capacity_(CC)`,na.rm = TRUE),
Q1 = quantile(data_2015$`Engine_Capacity_(CC)`,probs = .25,na.rm = TRUE),
Median = median(data_2015$`Engine_Capacity_(CC)`, na.rm = TRUE),
Q3 = quantile(data_2015$`Engine_Capacity_(CC)`,probs = .75,na.rm = TRUE),
Max = max(data_2015$`Engine_Capacity_(CC)`,na.rm = TRUE),
Mean = mean(data_2015$`Engine_Capacity_(CC)`, na.rm = TRUE),
SD = sd(data_2015$`Engine_Capacity_(CC)`, na.rm = TRUE),
n = n(),
Missing = sum(is.na(data_2015$`Engine_Capacity_(CC)`)))
data_2015$`Engine_Capacity_(CC)`<-impute(data_2015$`Engine_Capacity_(CC)`, fun = mean)
colSums(is.na(data_2015))
## Accident_Index
## 0
## Number_of_Vehicles
## 0
## Number_of_Casualties
## 0
## year
## 0
## month
## 0
## day
## 0
## Did_Police_Officer_Attend_Scene_of_Accident
## 0
## Age_of_Driver
## 0
## Engine_Capacity_(CC)
## 0
## cumulative_age
## 0
clean_data_2015<-data_2015
Similarly scan the data for special values with function is.nan() and is.infinite(). Result shows no inconsistencies found in the data.
sum(is.nan(clean_data_2015$Number_of_Vehicles))
## [1] 0
sum(is.nan(clean_data_2015$Number_of_Casualties))
## [1] 0
sum(is.nan(clean_data_2015$Age_of_Driver))
## [1] 0
sum(is.nan(clean_data_2015$`Engine_Capacity_(CC)`))
## [1] 0
sum(is.nan(clean_data_2015$cumulative_age))
## [1] 0
sum(is.infinite(clean_data_2015$Number_of_Vehicles))
## [1] 0
sum(is.infinite(clean_data_2015$Number_of_Casualties))
## [1] 0
sum(is.infinite(clean_data_2015$Age_of_Driver))
## [1] 0
sum(is.infinite(clean_data_2015$`Engine_Capacity_(CC)`))
## [1] 0
sum(is.infinite(clean_data_2015$cumulative_age))
## [1] 0
Scan the numeric variables for outliers. First variable to be inspected is ‘Number_of_Vehicle’.
boxplot(clean_data_2015$Number_of_Vehicles)
clean_data_2015 %>% filter(Number_of_Vehicles==3) %>% count()
clean_data_2015 %>% filter(Number_of_Vehicles>4) %>% count()
Based on the boxplot, it looks like other the box (in this case it’s a line) captures 2 which means the number of vehicle involved other than 2 vehicles are outside the outlier fences.However, a single dot for 3 vehicles has 33192 observations which is a significant amount of the variable. Although with 4 vehicles there was only 4563 instances. Considering the high proportion of outliers, capping would be more suitable to deal with outliers in this case.
Now we look at ‘Number_of_Casualties’.
boxplot(clean_data_2015$Number_of_Casualties)
From the outlier detetion function above, we can see that number of casualties larger than 4 can be treated as outliers. Outside of the quantile range could be treated as outliers and similarly capping is suitable.
Next is ‘Age_of_Driver’.
clean_data_2015$Age_of_Driver<-as.numeric(clean_data_2015$Age_of_Driver)
boxplot(clean_data_2015$Age_of_Driver)
clean_data_2015 %>% filter(Age_of_Driver<16) %>% count()
From the boxplot, an age of over 80 is considered an outlier and any age between 0-80 is appropriate for analysis. However this is conterintuitive as our data contains records of driver at age of 1, it is more likely these ages were recorded by entry error. We did some research on the problem: According to the UK driving law, drivers need to be age over 16 and there is no upper age limit(https://en.wikipedia.org/wiki/Driving_licence_in_the_United_Kingdom#Current_categories). When driver reach the age of 70, they can obtain permit from the government and still drive on the road(https://www.informationnow.org.uk/article/driving-as-you-get-older/).
With age below 16 we decided to treat them as outliers even though they were not detected from the boxplot. These records will be removed from further processing.
And use same detection method for ‘Engine_Capacity’.
clean_data_2015$`Engine_Capacity_(CC)`<-as.numeric(clean_data_2015$`Engine_Capacity_(CC)`)
boxplot(clean_data_2015$`Engine_Capacity_(CC)`)
clean_data_2015 %>% filter(`Engine_Capacity_(CC)`>5000|`Engine_Capacity_(CC)`<100) %>% count()
Looking at the graph in combination of the output of the summarise statistic of Engine Capacity. Engine capacity that is outside of the quartile range is treated as outliers.
Last we will also consider the new variable ‘Cumulative_Age’ created via mutate().
boxplot(clean_data_2015$cumulative_age)
This makes sense because cumulative distribution is the probability distribution of the probability of a driver is younger or equally aged at an age and works as a ranking indicator. So there’s no outlier in this variable.
In summary, our decision is to first delete the observations with driver age under 15 and then use capping method to get rid of the detected outliers.
age_over_16<- clean_data_2015 %>% filter(Age_of_Driver>=16)
cap <- function(x){
quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
x
}
##Get a new dataset and use capping method in there
age_over_16->data_capped
data_capped$Number_of_Vehicles<-data_capped$Number_of_Vehicles %>% cap()
data_capped$Number_of_Casualties<-data_capped$Number_of_Casualties %>% cap()
data_capped$Age_of_Driver<-data_capped$Age_of_Driver %>% cap()
data_capped$`Engine_Capacity_(CC)`<-data_capped$`Engine_Capacity_(CC)` %>% cap()
summary(data_capped)
## Accident_Index Number_of_Vehicles Number_of_Casualties
## Length:255815 Min. :1.00 Min. :1.000
## Class :character 1st Qu.:2.00 1st Qu.:1.000
## Mode :character Median :2.00 Median :1.000
## Mean :2.21 Mean :1.347
## 3rd Qu.:2.00 3rd Qu.:2.000
## Max. :4.00 Max. :3.000
## year month day
## Length:255815 Length:255815 Length:255815
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## Did_Police_Officer_Attend_Scene_of_Accident Age_of_Driver
## data_missing: 16 Min. :16.00
## Yes :207613 1st Qu.:28.00
## No : 47525 Median :40.17
## Self_rep : 661 Mean :40.19
## 3rd Qu.:49.00
## Max. :80.00
## Engine_Capacity_(CC) cumulative_age
## Min. : 636 Min. :0.1157
## 1st Qu.:1391 1st Qu.:0.2603
## Median :1896 Median :0.5144
## Mean :1736 Mean :0.5171
## 3rd Qu.:1896 3rd Qu.:0.7546
## Max. :2987 Max. :1.0000
#Inspect numeric variables after capping
data_capped$Age_of_Driver <- as.numeric(data_capped$Age_of_Driver)
boxplot(data_capped$Number_of_Vehicles)
boxplot(data_capped$Number_of_Casualties)
boxplot(data_capped$Age_of_Driver)
boxplot(data_capped$`Engine_Capacity_(CC)`)
First inspect all numeric variables by plotting histogram.
hist(data_capped$Number_of_Vehicles)
hist(data_capped$Number_of_Casualties)
hist(data_capped$Age_of_Driver)
hist(data_capped$`Engine_Capacity_(CC)`)
Notice all variables are not normally distributed even through with such large quantity of observations. Because number of vehicles, number of casualties are both discrete and have a very small range. It does not make sense if we choose to standardise these variables. Instead we will focus on ‘Age_of_Driver’ and ‘Engine_Capacity_(CC)’ which have potential for further processing and analysis.
After endeavours were made to transform the data with several methods, BoxCox was tested to work best with this dataset as skewness can be significantly corrected.
BoxCox(data_capped$Age_of_Driver,lambda = "auto") %>% hist()
BoxCox(data_capped$`Engine_Capacity_(CC)`, lambda = "auto") %>% hist()