MATH2349 Semester 1, 2019

Required packages

Following packagins are required for completion of this report.

library(readr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
library(lubridate)

## 
## Attaching package: 'lubridate'

## The following object is masked from 'package:base':
## 
##     date

library(stringr)
library(ggplot2)

## Registered S3 methods overwritten by 'ggplot2':
##   method         from 
##   [.quosures     rlang
##   c.quosures     rlang
##   print.quosures rlang

library(knitr)
library(Hmisc)

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:dplyr':
## 
##     src, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, units

library(outliers)
library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

library(forecast)

## Registered S3 method overwritten by 'xts':
##   method     from
##   as.zoo.xts zoo

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

## Registered S3 methods overwritten by 'forecast':
##   method             from    
##   fitted.fracdiff    fracdiff
##   residuals.fracdiff fracdiff

Executive Summary

This report aims to demonstrate data-preprocessing steps using R on an open source data: UK Car Accidents 2005 - 2015 from the UK Department of Transport. Our tentative goal of pre-processing data is to obtain a clean, tidy table contains information about whether police officer attended accident scene, the age of driver, the number of vehicle involved and casualties and engine capacity for the year of 2015.

The data we are using for our investigation - car accidents, were stored in two separate csv files: Accident0515 and Vehicles0515. Our work was based on the merged table with selected variables and year. Pre-processing steps are as follows:

Examine table structure and re-specify inccorectly read variable class.
Separate year from ‘Date’ and subset the table with the year in interest.
Create a new column ‘Cumulative Age’ to show the ranking of age of the driver for each accident.
Scan the table for missing values and inconsistencies, recode missing value for R to understand and impute missing values where the proportion is large (greater than 5%).
Scan the table for outliers using boxplot and identify any outliers not showing in the plot, explain the existence of outliers for each variable and treat with appropriate method.
Prepare numeric variables for further analysis with an appropriate transformation, normalise continuous variables.

Data

Data sourced from https://www.kaggle.com/silicon99/dft-accident-data. Accidents file recorded accident details of car accidents happended from 2005 to 2015 in UK，including Accident_Index (uniquely identify each accident), accident location longitude and latitude, accident date and time, accident severity and junction details. Vehicles file recorded involved vehicle details including Accidente_Index, vehicle type, vehicle manoeuvre, sex and age of the driver, etc. Accidents and Vehicle are joined on Accident_Index, through inspection these two relations have the same number of observations so use inner join to have all common rows retained.

accidents<-read_csv("Accidents0515.csv")

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   Accident_Index = col_character(),
##   Date = col_character(),
##   Time = col_time(format = ""),
##   `Local_Authority_(Highway)` = col_character(),
##   LSOA_of_Accident_Location = col_character()
## )

## See spec(...) for full column specifications.

vehicles<-read_csv("Vehicles0515.csv")

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   Accident_Index = col_character()
## )
## See spec(...) for full column specifications.

## Warning: 257845 parsing failures.
##     row col   expected     actual               file
## 3004426  -- 22 columns 23 columns 'Vehicles0515.csv'
## 3004427  -- 22 columns 23 columns 'Vehicles0515.csv'
## 3004428  -- 22 columns 23 columns 'Vehicles0515.csv'
## 3004429  -- 22 columns 23 columns 'Vehicles0515.csv'
## 3004430  -- 22 columns 23 columns 'Vehicles0515.csv'
## ....... ... .......... .......... ..................
## See problems(...) for more details.

merged_table<-inner_join(accidents,vehicles,by = "Accident_Index")
selected_table<-merged_table %>% select(Accident_Index, Number_of_Vehicles,Number_of_Casualties, Date,Did_Police_Officer_Attend_Scene_of_Accident,Age_of_Driver,'Engine_Capacity_(CC)')

Understand

Use str() to examine our table in interest: selected_table.

Through examination notice that Date and Did_Police_Officer_Attend_Scene_of_Accident were read incorrectly, so need to identify Date and specify Did_Police_Officer_Attend_Scene_of_Accident as a factor with understandable labels.

str(selected_table)

## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 3262270 obs. of  7 variables:
##  $ Accident_Index                             : chr  "200501BS00001" "200501BS00002" "200501BS00003" "200501BS00003" ...
##  $ Number_of_Vehicles                         : num  1 1 2 2 1 1 2 2 2 2 ...
##  $ Number_of_Casualties                       : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Date                                       : chr  "04/01/2005" "05/01/2005" "06/01/2005" "06/01/2005" ...
##  $ Did_Police_Officer_Attend_Scene_of_Accident: num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Age_of_Driver                              : num  74 42 35 62 49 49 51 30 31 41 ...
##  $ Engine_Capacity_(CC)                       : num  -1 8268 8300 1762 1769 ...

selected_table$Date<-dmy(selected_table$Date)

selected_table$Did_Police_Officer_Attend_Scene_of_Accident<-selected_table$Did_Police_Officer_Attend_Scene_of_Accident %>% 
  factor(levels = c(-1,1,2,3),
         labels = c("data_missing","Yes","No","Self_rep"))
selected_table$Did_Police_Officer_Attend_Scene_of_Accident %>% class()

## [1] "factor"

selected_table$Did_Police_Officer_Attend_Scene_of_Accident%>% levels()

## [1] "data_missing" "Yes"          "No"           "Self_rep"

Tidy & Manipulate Data I

Selected_table is already in a tidy format since the following three conditions are met. 1. Each variable has its own column 2. each observation has its own row and 3.each value has its own cell. So we do not have to tidy it up. However, the data table contains data from year 2005 to 2015 and we are only interested in the year 2015, so we will need to further subset the relation that contains information for the year we need. We will start by extracting the year ‘Date’ variable by using separate () and then filter out data for the year 2015.

selected_table<-selected_table %>% separate('Date', c("year","month","day"),sep = "-")
data_2015<-selected_table %>% filter(year=='2015')
data_2015 %>% head()

Tidy & Manipulate Data II

We created a new column shows the cumulative distribution of Age of driver. This will give an idea of what is the ranking for the age of the driver in an accident.

data_2015<-mutate(data_2015, "cumulative_age" = cume_dist(data_2015$Age_of_Driver))
data_2015 %>% head()

Scan I

Scan the data for missing values, inconsistencies and obvious errors.

Through quick inspection, no NA is found in the data. However the data dictionary provides a guidance of coding for missing values: when a value for a column is unknonw it has been recorded as “-1”, so we start by recoding “-1” to NA and use colSums() to check what columns contain missing values.

data_2015$Number_of_Vehicles[which(data_2015$Number_of_Vehicles== "-1")]<-NA
data_2015$Number_of_Casualties[which(data_2015$Number_of_Casualties== "-1")]<-NA
data_2015$Did_Police_Officer_Attend_Scene_of_Accident[which(data_2015$Did_Police_Officer_Attend_Scene_of_Accident== "-1")]<-NA
data_2015$Age_of_Driver[which(data_2015$Age_of_Driver== "-1")]<-NA
data_2015$`Engine_Capacity_(CC)`[which(data_2015$`Engine_Capacity_(CC)`== "-1")]<-NA
data_2015$cumulative_age[which(data_2015$cumulative_age== "-1")]<-NA

colSums(is.na(data_2015))

##                              Accident_Index 
##                                           0 
##                          Number_of_Vehicles 
##                                           0 
##                        Number_of_Casualties 
##                                           0 
##                                        year 
##                                           0 
##                                       month 
##                                           0 
##                                         day 
##                                           0 
## Did_Police_Officer_Attend_Scene_of_Accident 
##                                           0 
##                               Age_of_Driver 
##                                       29828 
##                        Engine_Capacity_(CC) 
##                                       72289 
##                              cumulative_age 
##                                           0

‘Age_of_Driver’ and ‘Engine_Capacity_(CC)’ have sufficiently high number of missing value (more than 5%), this indicates missing values should not be ignored or simply removed. We will replace them by mean. Use colSums() to examine the missing valuea again.

data_2015$Age_of_Driver<-impute(data_2015$Age_of_Driver, fun = mean) 

data_2015 %>%
  summarise(Min = min(data_2015$`Engine_Capacity_(CC)`,na.rm = TRUE),
          Q1 = quantile(data_2015$`Engine_Capacity_(CC)`,probs = .25,na.rm = TRUE),
          Median = median(data_2015$`Engine_Capacity_(CC)`, na.rm = TRUE),
          Q3 = quantile(data_2015$`Engine_Capacity_(CC)`,probs = .75,na.rm = TRUE),
          Max = max(data_2015$`Engine_Capacity_(CC)`,na.rm = TRUE),
          Mean = mean(data_2015$`Engine_Capacity_(CC)`, na.rm = TRUE),
          SD = sd(data_2015$`Engine_Capacity_(CC)`, na.rm = TRUE),
          n = n(),
          Missing = sum(is.na(data_2015$`Engine_Capacity_(CC)`)))

data_2015$`Engine_Capacity_(CC)`<-impute(data_2015$`Engine_Capacity_(CC)`, fun = mean) 

colSums(is.na(data_2015))

##                              Accident_Index 
##                                           0 
##                          Number_of_Vehicles 
##                                           0 
##                        Number_of_Casualties 
##                                           0 
##                                        year 
##                                           0 
##                                       month 
##                                           0 
##                                         day 
##                                           0 
## Did_Police_Officer_Attend_Scene_of_Accident 
##                                           0 
##                               Age_of_Driver 
##                                           0 
##                        Engine_Capacity_(CC) 
##                                           0 
##                              cumulative_age 
##                                           0

clean_data_2015<-data_2015

Similarly scan the data for special values with function is.nan() and is.infinite(). Result shows no inconsistencies found in the data.

sum(is.nan(clean_data_2015$Number_of_Vehicles))

## [1] 0

sum(is.nan(clean_data_2015$Number_of_Casualties))

## [1] 0

sum(is.nan(clean_data_2015$Age_of_Driver))

## [1] 0

sum(is.nan(clean_data_2015$`Engine_Capacity_(CC)`))

## [1] 0

sum(is.nan(clean_data_2015$cumulative_age))

## [1] 0

sum(is.infinite(clean_data_2015$Number_of_Vehicles))

## [1] 0

sum(is.infinite(clean_data_2015$Number_of_Casualties))

## [1] 0

sum(is.infinite(clean_data_2015$Age_of_Driver))

## [1] 0

sum(is.infinite(clean_data_2015$`Engine_Capacity_(CC)`))

## [1] 0

sum(is.infinite(clean_data_2015$cumulative_age))

## [1] 0

Scan II

Scan the numeric variables for outliers. First variable to be inspected is ‘Number_of_Vehicle’.

boxplot(clean_data_2015$Number_of_Vehicles)

clean_data_2015 %>% filter(Number_of_Vehicles==3) %>% count()

clean_data_2015 %>% filter(Number_of_Vehicles>4) %>% count()

Based on the boxplot, it looks like other the box (in this case it’s a line) captures 2 which means the number of vehicle involved other than 2 vehicles are outside the outlier fences.However, a single dot for 3 vehicles has 33192 observations which is a significant amount of the variable. Although with 4 vehicles there was only 4563 instances. Considering the high proportion of outliers, capping would be more suitable to deal with outliers in this case.

Now we look at ‘Number_of_Casualties’.

boxplot(clean_data_2015$Number_of_Casualties)

From the outlier detetion function above, we can see that number of casualties larger than 4 can be treated as outliers. Outside of the quantile range could be treated as outliers and similarly capping is suitable.

Next is ‘Age_of_Driver’.

clean_data_2015$Age_of_Driver<-as.numeric(clean_data_2015$Age_of_Driver)
boxplot(clean_data_2015$Age_of_Driver)

clean_data_2015 %>% filter(Age_of_Driver<16) %>% count()

From the boxplot, an age of over 80 is considered an outlier and any age between 0-80 is appropriate for analysis. However this is conterintuitive as our data contains records of driver at age of 1, it is more likely these ages were recorded by entry error. We did some research on the problem: According to the UK driving law, drivers need to be age over 16 and there is no upper age limit(https://en.wikipedia.org/wiki/Driving_licence_in_the_United_Kingdom#Current_categories). When driver reach the age of 70, they can obtain permit from the government and still drive on the road(https://www.informationnow.org.uk/article/driving-as-you-get-older/).

With age below 16 we decided to treat them as outliers even though they were not detected from the boxplot. These records will be removed from further processing.

And use same detection method for ‘Engine_Capacity’.

clean_data_2015$`Engine_Capacity_(CC)`<-as.numeric(clean_data_2015$`Engine_Capacity_(CC)`)
boxplot(clean_data_2015$`Engine_Capacity_(CC)`)

clean_data_2015 %>% filter(`Engine_Capacity_(CC)`>5000|`Engine_Capacity_(CC)`<100) %>% count()

Looking at the graph in combination of the output of the summarise statistic of Engine Capacity. Engine capacity that is outside of the quartile range is treated as outliers.

Last we will also consider the new variable ‘Cumulative_Age’ created via mutate().

boxplot(clean_data_2015$cumulative_age)

This makes sense because cumulative distribution is the probability distribution of the probability of a driver is younger or equally aged at an age and works as a ranking indicator. So there’s no outlier in this variable.

In summary, our decision is to first delete the observations with driver age under 15 and then use capping method to get rid of the detected outliers.

age_over_16<- clean_data_2015 %>% filter(Age_of_Driver>=16)
cap <- function(x){
  quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
  x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
  x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
  x
}

##Get a new dataset and use capping method in there
age_over_16->data_capped
data_capped$Number_of_Vehicles<-data_capped$Number_of_Vehicles %>% cap()
data_capped$Number_of_Casualties<-data_capped$Number_of_Casualties %>% cap()
data_capped$Age_of_Driver<-data_capped$Age_of_Driver %>% cap()
data_capped$`Engine_Capacity_(CC)`<-data_capped$`Engine_Capacity_(CC)` %>% cap()

summary(data_capped)

##  Accident_Index     Number_of_Vehicles Number_of_Casualties
##  Length:255815      Min.   :1.00       Min.   :1.000       
##  Class :character   1st Qu.:2.00       1st Qu.:1.000       
##  Mode  :character   Median :2.00       Median :1.000       
##                     Mean   :2.21       Mean   :1.347       
##                     3rd Qu.:2.00       3rd Qu.:2.000       
##                     Max.   :4.00       Max.   :3.000       
##      year              month               day           
##  Length:255815      Length:255815      Length:255815     
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##  Did_Police_Officer_Attend_Scene_of_Accident Age_of_Driver  
##  data_missing:    16                         Min.   :16.00  
##  Yes         :207613                         1st Qu.:28.00  
##  No          : 47525                         Median :40.17  
##  Self_rep    :   661                         Mean   :40.19  
##                                              3rd Qu.:49.00  
##                                              Max.   :80.00  
##  Engine_Capacity_(CC) cumulative_age  
##  Min.   : 636         Min.   :0.1157  
##  1st Qu.:1391         1st Qu.:0.2603  
##  Median :1896         Median :0.5144  
##  Mean   :1736         Mean   :0.5171  
##  3rd Qu.:1896         3rd Qu.:0.7546  
##  Max.   :2987         Max.   :1.0000

#Inspect numeric variables after capping
data_capped$Age_of_Driver <- as.numeric(data_capped$Age_of_Driver)
boxplot(data_capped$Number_of_Vehicles)

boxplot(data_capped$Number_of_Casualties)

boxplot(data_capped$Age_of_Driver)

boxplot(data_capped$`Engine_Capacity_(CC)`)

Transform

First inspect all numeric variables by plotting histogram.

hist(data_capped$Number_of_Vehicles)

hist(data_capped$Number_of_Casualties)

hist(data_capped$Age_of_Driver)

hist(data_capped$`Engine_Capacity_(CC)`)

Notice all variables are not normally distributed even through with such large quantity of observations. Because number of vehicles, number of casualties are both discrete and have a very small range. It does not make sense if we choose to standardise these variables. Instead we will focus on ‘Age_of_Driver’ and ‘Engine_Capacity_(CC)’ which have potential for further processing and analysis.

After endeavours were made to transform the data with several methods, BoxCox was tested to work best with this dataset as skewness can be significantly corrected.

BoxCox(data_capped$Age_of_Driver,lambda = "auto") %>% hist()

BoxCox(data_capped$`Engine_Capacity_(CC)`, lambda = "auto") %>% hist()