Required packages

knitr::opts_chunk$set(echo = TRUE, warnings=FALSE, message=FALSE)
library(readr)     # Used for reading and importing csv files
library(dplyr)     # For data wrangling including joins, gsub,

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(magrittr)  # Also for pipes
library(tidyr)     # Used for the separate function

## 
## Attaching package: 'tidyr'

## The following object is masked from 'package:magrittr':
## 
##     extract

library(glue)      # For transmuting using the glue function

## 
## Attaching package: 'glue'

## The following object is masked from 'package:dplyr':
## 
##     collapse

library(outliers)  # For using scores function when dealing with outliers
library(here)      # To help import csv files

## here() starts at C:/Users/heinz/Documents/Uni/T3 Data Wrangling/Assessment 3

Data

We are about to explore a 10 year history of UK vehicular accidents and casualties.

References Source: Fedit, B 2018, UK Accidents 10 years history with many variables, viewed 6 October 2021, https://www.kaggle.com/benoit72/uk-accidents-10-years-history-with-many-variables

Loom link: https://www.loom.com/share/96ff2bd38b114d5ea45cd693941236d9

Step 1: Get two datasets

# Read and create accidents df
accidents <- read.csv(here("Accidents0514.csv"))
head(accidents)

# Read and create casualties df 
casualties <- read.csv(here("Casualties0514.csv"))
head(casualties)

Explanation for Imported csv files

Firstly, I imported the accidents data using the read.csv function and saved this to accidents. Then I did the same for the casualties data and saved this to casualties. *I called the head of each new df to show the first few rows.

Variables Explanation: * There are 46 different variables in total, so rather than explain each one, I’ll explain the ones I want to keep and subset for the purpose of this assignment in Step 3.

Step 1 cont’d: Inspect the data

names(accidents)  # Returns the column/variable names of the accidents df

##  [1] "ï..Accident_Index"                          
##  [2] "Location_Easting_OSGR"                      
##  [3] "Location_Northing_OSGR"                     
##  [4] "Longitude"                                  
##  [5] "Latitude"                                   
##  [6] "Police_Force"                               
##  [7] "Accident_Severity"                          
##  [8] "Number_of_Vehicles"                         
##  [9] "Number_of_Casualties"                       
## [10] "Date"                                       
## [11] "Day_of_Week"                                
## [12] "Time"                                       
## [13] "Local_Authority_.District."                 
## [14] "Local_Authority_.Highway."                  
## [15] "X1st_Road_Class"                            
## [16] "X1st_Road_Number"                           
## [17] "Road_Type"                                  
## [18] "Speed_limit"                                
## [19] "Junction_Detail"                            
## [20] "Junction_Control"                           
## [21] "X2nd_Road_Class"                            
## [22] "X2nd_Road_Number"                           
## [23] "Pedestrian_Crossing.Human_Control"          
## [24] "Pedestrian_Crossing.Physical_Facilities"    
## [25] "Light_Conditions"                           
## [26] "Weather_Conditions"                         
## [27] "Road_Surface_Conditions"                    
## [28] "Special_Conditions_at_Site"                 
## [29] "Carriageway_Hazards"                        
## [30] "Urban_or_Rural_Area"                        
## [31] "Did_Police_Officer_Attend_Scene_of_Accident"
## [32] "LSOA_of_Accident_Location"

names(casualties) # Returns the column/variable names of the casualties df

##  [1] "ï..Accident_Index"                  "Vehicle_Reference"                 
##  [3] "Casualty_Reference"                 "Casualty_Class"                    
##  [5] "Sex_of_Casualty"                    "Age_of_Casualty"                   
##  [7] "Age_Band_of_Casualty"               "Casualty_Severity"                 
##  [9] "Pedestrian_Location"                "Pedestrian_Movement"               
## [11] "Car_Passenger"                      "Bus_or_Coach_Passenger"            
## [13] "Pedestrian_Road_Maintenance_Worker" "Casualty_Type"                     
## [15] "Casualty_Home_Area_Type"

dim(accidents)    # Returns the row and column dimensions of the accidents df

## [1] 1640597      32

dim(casualties)   # Returns the row and column dimensions of the accidents df

## [1] 2216720      15

The accidents df has 1640597 observations and 32 variables
The casualties df has 2216720 observations and 15 variables,

Step 2: Merge the data sets

# In order to merge the data sets I'll first I fix the variable name that I want to merge
accidents %<>%
  rename("Accident_Index" = "ï..Accident_Index")

casualties %<>%
  rename("Accident_Index" = "ï..Accident_Index")

# Using inner join to merge data sets
accidents_comb <- inner_join(accidents, casualties, by="Accident_Index")
head(accidents_comb)

Step 2: Explanation * First I updated the first variable/column of each data set to share the name Accident_Index which will make it easier to join. * Used inner join to retain only the observations in both data sets. This will help to reduce the creation of missing values (which may have occurred with a left join for example). * Using this join and merging the datasets allows me to keep the information in Accidents such as the Date and Accident Severity whilst adding further details on the particular accident from Casualties such as number of casualties, class, and gender.

Understand

Step 3: Inspect the new dataset

Since there are 46 variables, I’m going to subset the merged data to only the variables that I’d prefer to go into more detail (given limited space in assignment).

Inspect the dimensions of the merged df

dim(accidents_comb) # The df has 46 variables

# Subset to the variables I want to keep
accidents_comb_sub <- accidents_comb[,c("Accident_Index", "Date", "Accident_Severity", "Number_of_Casualties",  "Casualty_Class", "Age_of_Casualty", "Sex_of_Casualty")]

# Using the attributes function to inspect elements
attributes(accidents_comb_sub) # Shows names (variables), class(data frame) and row names

Step 3 cont’d

# Inspect the structure of the new data frame
str(accidents_comb_sub)

## 'data.frame':    2216720 obs. of  7 variables:
##  $ Accident_Index      : chr  "200501BS00001" "200501BS00002" "200501BS00003" "200501BS00004" ...
##  $ Date                : chr  "04/01/2005" "05/01/2005" "06/01/2005" "07/01/2005" ...
##  $ Accident_Severity   : int  2 3 3 3 3 3 3 3 3 3 ...
##  $ Number_of_Casualties: int  1 1 1 1 1 1 1 2 2 2 ...
##  $ Casualty_Class      : int  3 2 1 3 1 1 1 3 3 1 ...
##  $ Age_of_Casualty     : int  37 37 62 30 49 30 31 13 13 35 ...
##  $ Sex_of_Casualty     : int  1 1 1 1 1 2 1 2 2 1 ...

Step 3 Explanation: * accidents_comb_sub is a data frame with 2216720 observations/rows and 7 variables/columns * Accident_Index is the Unique identifier and is a character variable. * Date is in dd/mm/yyyy format from 2005 to 2014 and is a character variable. * Accident_Severity is an integer variable: 1=Fatal, 2=Serious, 3=Slight. * Number_of_Casualties is self explanatory is an integer variable. * Casualty_Class is an integer variable: 1=Driver/rider, 2=Passenger, 3=Pedestrian. * Age_of_Casualty is self explanatory is an integer variable * Sex_of_Casualty is an integer variable: 1=Male, 2=Female

We will change the classes of some of these variables below.

Step 4: Data Type Conversions

# Convert Date from Character to Date
accidents_comb_sub$Date %<>% 
  as.Date(format = "%y/%m/%d")

# Convert Accident_Severity to factor variable
accidents_comb_sub$Accident_Severity %<>%
  as.factor()

# Convert Casualty_Class to factor variable
accidents_comb_sub$Casualty_Class %<>%
  as.factor()

# Convert Sex_of_Casualty to factor variable
accidents_comb_sub$Sex_of_Casualty %<>%
  as.factor()

#Use the class function to check each variable is converted.
class(accidents_comb_sub$Date)

## [1] "Date"

class(accidents_comb_sub$Accident_Severity)

## [1] "factor"

class(accidents_comb_sub$Casualty_Class)

## [1] "factor"

class(accidents_comb_sub$Sex_of_Casualty)

## [1] "factor"

Step 4: Explanation * Using the as.Date() function, I converted the Date variable from character to Date format * Using the as.factor() function I converted the Accident_Severity, Casualty_Class and Sex_of_Casualty variables to factor variables.

Step 5: Relabelling/reordering factor variables

# Relabel and reorder Accident_Severity 
accidents_comb_sub$Accident_Severity <- factor(accidents_comb_sub$Accident_Severity, levels = c(3,2,1), labels = c("Slight", "Serious", "Fatal"), ordered = TRUE)

# Relabel Casualty_Class (order not important)
accidents_comb_sub$Casualty_Class <- factor(accidents_comb_sub$Casualty_Class,
                                            levels = c(1,2,3),
                                            labels = c("Driver/rider", "Passenger", "Pedestrian"))

# Relabel Sex_of_Casualty  (order not important)
accidents_comb_sub$Sex_of_Casualty <- factor(accidents_comb_sub$Sex_of_Casualty,
                                             levels = c(1,2),
                                             labels = c("Male", "Female"))

# Now we will make sure our updated variables are factor character variables
as.character.factor(accidents_comb_sub$Accident_Severity) # This is ordered
as.character.factor(accidents_comb_sub$Casualty_Class)
as.character.factor(accidents_comb_sub$Sex_of_Casualty)

# Use structure function to check work above
str(accidents_comb_sub)

## 'data.frame':    2216720 obs. of  7 variables:
##  $ Accident_Index      : chr  "200501BS00001" "200501BS00002" "200501BS00003" "200501BS00004" ...
##  $ Date                : Date, format: "2004-01-20" "2005-01-20" ...
##  $ Accident_Severity   : Ord.factor w/ 3 levels "Slight"<"Serious"<..: 2 1 1 1 1 1 1 1 1 1 ...
##  $ Number_of_Casualties: int  1 1 1 1 1 1 1 2 2 2 ...
##  $ Casualty_Class      : Factor w/ 3 levels "Driver/rider",..: 3 2 1 3 1 1 1 3 3 1 ...
##  $ Age_of_Casualty     : int  37 37 62 30 49 30 31 13 13 35 ...
##  $ Sex_of_Casualty     : Factor w/ 2 levels "Male","Female": 1 1 1 1 1 2 1 2 2 1 ...

Step 5: Explanation * I relabeled & reordered Accident_Severity to Slight<Serious<Fatal to provide more information for the user on the levels regarding accident severity. * I relabeled Casualty_Class (“Driver/rider”, “Passenger”, “Pedestrian”) and Sex_of_Casualty (Male, Female) to provide more information the user.

Tidy & Manipulate Data I

Step 6: Tidy and Manipulate data • Each value must have its own cell. Yes each value has its own cell. • Each observation must have its own row. There are duplicates which we’ll remove, after which each observation will have its own row.
• Each variable must have its own column. Whilst each variable has its own column the date can also be separated into day, month,year. See below.

# Show if there are any duplicated entries
any(duplicated(accidents_comb_sub)) # this shows there are duplicate observations

## [1] TRUE

# Replace the df with only the unique observations
accidents_unique <- unique(accidents_comb_sub)

any(duplicated(accidents_unique))

## [1] FALSE

Step 6: Explanation * I checked if there were any duplicates by wrapping the duplicated() function with the any() function and this return TRUE. * I then removed the duplicates by using the unique() function and replaced my df with a cleaner version of accidents_comb_sub without duplicates.

# Separate the Date into 3 columns.
accidents_unique %<>%
  separate(Date, into = c("Year", "Month", "Day"), sep ="-")
head(accidents_unique)

I used the separate function in tidyr to separate the Date variable into separate Year, Month and Day columns.
This will allow the user greater filtering functionality to function by year for example.

Tidy & Manipulate Data II

Step 7: Create/mutate a new variable.

# Extract first three letters of Sex_of_Casualty and Casualty_Class
gender <- substr(accidents_unique$Sex_of_Casualty, 1, 3)
class <- substr(accidents_unique$Casualty_Class, 1, 3)

# Create new variable called Sex_Class from existing Sex_of_Casualty and Casualty_Class variables
accidents_glue <- accidents_unique %>%
  mutate(Sex_Class = glue("{gender}-{class}"))

# Show the new categories of the new class
head(accidents_glue$Sex_Class)

## Mal-Ped
## Mal-Pas
## Mal-Dri
## Mal-Ped
## Mal-Dri
## Fem-Dri

Step 7 cont’d: Explanation * I wanted to create a new variable by combining the sex of the casualty with the type. * By having the new category Mal-Dri, for example, this might allow for the user to easier identify Male Driver casualty types.

Scan I

Step 8: Scan for missing values

# Scan whole data frame for missing values
colSums(is.na(accidents_glue)) # there are 594 NA values in Sex_of_Casualty.

##       Accident_Index                 Year                Month 
##                    0                    0                    0 
##                  Day    Accident_Severity Number_of_Casualties 
##                    0                    0                    0 
##       Casualty_Class      Age_of_Casualty      Sex_of_Casualty 
##                    0                    0                  594 
##            Sex_Class 
##                    0

I used the is.na() function wrapped in the colSums() function to find the total number of missing values or NAs in each column of my df.

Step 8 cont’d: Removing observations with missing values Since there are only 594 NA values in a df of millions, we’ll go ahead and remove these as we’ll still have a large enough sample size.

# Remove missing values
accidents_isna <- accidents_glue[complete.cases(accidents_glue), ]

# Now there are no rows with NA values 
colSums(is.na(accidents_isna))

##       Accident_Index                 Year                Month 
##                    0                    0                    0 
##                  Day    Accident_Severity Number_of_Casualties 
##                    0                    0                    0 
##       Casualty_Class      Age_of_Casualty      Sex_of_Casualty 
##                    0                    0                    0 
##            Sex_Class 
##                    0

I used the complete.cases function to help me replace a clean version of accidents_comb_sub without NA values back into accidents_comb_sub.

Step 8 cont’d: Scan for inconsistencies

table(accidents_isna$Year)

## 
##  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013 
## 73726 71193 71310 72236 72571 72437 72615 73112 73623 73394 74076 73325 72421 
##  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024  2025  2026 
## 72209 72410 72232 73000 73141 73835 71768 72138 71370 71840 71955 68887 69993 
##  2027  2028  2029  2030  2031 
## 70252 71691 65927 64397 39675

table(accidents_isna$Month)

## 
##     01     02     03     04     05     06     07     08     09     10     11 
## 173017 161281 174787 170071 186454 185507 193459 183994 187101 197679 199176 
##     12 
## 180233

table(accidents_isna$Day)

## 
##      20 
## 2192759

table(accidents_isna$Accident_Severity)

## 
##  Slight Serious   Fatal 
## 1834835  318636   39288

table(accidents_isna$Number_of_Casualties)

## 
##       1       2       3       4       5       6       7       8       9      10 
## 1257689  522600  220036  103912   47002   21117    8316    4046    2166    1264 
##      11      12      13      14      15      16      17      18      19      20 
##     884     499     451     300     259     179     189     150     159      53 
##      21      22      23      24      25      26      27      28      29      32 
##      95     143      42      63      58     145      41      36      93      27 
##      33      35      36      38      40      41      42      43      45      46 
##      12      31      49      33      34      16      86      21      64      24 
##      47      48      51      54      62      63      68      70      87      93 
##      18      40      79      39      23      14      46      50      26      40

table(accidents_isna$Casualty_Class)

## 
## Driver/rider    Passenger   Pedestrian 
##      1394665       523522       274572

table(accidents_isna$Sex_of_Casualty)

## 
##    Male  Female 
## 1279648  913111

table(accidents_isna$Sex_Class)

## 
## Fem-Dri Fem-Pas Fem-Ped Mal-Dri Mal-Pas Mal-Ped 
##  483936  312031  117144  910729  211491  157428

table(accidents_isna$Age_of_Casualty) # this has 43029 "-1" values.

## 
##    -1     0     1     2     3     4     5     6     7     8     9    10    11 
## 43029  1749  4667  6416  8108  8878  9669 10568 11289 12332 13364 15362 19079 
##    12    13    14    15    16    17    18    19    20    21    22    23    24 
## 21154 19640 20174 21911 39936 62926 75619 70429 67544 60671 57344 54820 51726 
##    25    26    27    28    29    30    31    32    33    34    35    36    37 
## 53888 48227 44803 43749 40895 47956 37746 37782 36788 36229 40397 35655 34975 
##    38    39    40    41    42    43    44    45    46    47    48    49    50 
## 35964 36038 41727 35166 35788 34428 33434 35158 31281 29733 29325 27536 30222 
##    51    52    53    54    55    56    57    58    59    60    61    62    63 
## 24328 23532 22101 20934 21290 19340 18320 17768 16911 18676 14620 14327 13035 
##    64    65    66    67    68    69    70    71    72    73    74    75    76 
## 12114 12628 10186  9828  9251  8866 10489  7845  7760  7585  7403  7991  7054 
##    77    78    79    80    81    82    83    84    85    86    87    88    89 
##  6607  6490  6366  6867  5570  5398  4759  4353  4055  3206  2726  2131  1825 
##    90    91    92    93    94    95    96    97    98    99   100   101   102 
##  1465  1045   785   534   374   255   158   105   123    77     4     2     2 
##   103 
##     1

This Scatter plot also shows the inconsistencies

# Scatter showing number of casualties by age 
accidents_isna %>%
  plot(Number_of_Casualties ~ Age_of_Casualty,
       data=.,
       ylab="Casualties", xlab="Age", main="Casualties by Age")

Explanation
I scanned for inconsistencies in the values recorded by checked the instances of the values and categories for each variable using the table function.
I noticed occurrences of -1 values in the Age_of_Casualty variable.
The scatter plot was created using the base plot() function and shows the relationship between Number_of_Casualties and Age_of_Casualty

# Replace -1 age values with the mean
accidents_isna$Age_of_Casualty[accidents_isna$Age_of_Casualty == -1] <- mean(accidents_isna$Age_of_Casualty)
table(accidents_isna$Age_of_Casualty) # Now there are no -1 values

## 
##                0                1                2                3 
##             1749             4667             6416             8108 
##                4                5                6                7 
##             8878             9669            10568            11289 
##                8                9               10               11 
##            12332            13364            15362            19079 
##               12               13               14               15 
##            21154            19640            20174            21911 
##               16               17               18               19 
##            39936            62926            75619            70429 
##               20               21               22               23 
##            67544            60671            57344            54820 
##               24               25               26               27 
##            51726            53888            48227            44803 
##               28               29               30               31 
##            43749            40895            47956            37746 
##               32               33               34 34.5126974738218 
##            37782            36788            36229            43029 
##               35               36               37               38 
##            40397            35655            34975            35964 
##               39               40               41               42 
##            36038            41727            35166            35788 
##               43               44               45               46 
##            34428            33434            35158            31281 
##               47               48               49               50 
##            29733            29325            27536            30222 
##               51               52               53               54 
##            24328            23532            22101            20934 
##               55               56               57               58 
##            21290            19340            18320            17768 
##               59               60               61               62 
##            16911            18676            14620            14327 
##               63               64               65               66 
##            13035            12114            12628            10186 
##               67               68               69               70 
##             9828             9251             8866            10489 
##               71               72               73               74 
##             7845             7760             7585             7403 
##               75               76               77               78 
##             7991             7054             6607             6490 
##               79               80               81               82 
##             6366             6867             5570             5398 
##               83               84               85               86 
##             4759             4353             4055             3206 
##               87               88               89               90 
##             2726             2131             1825             1465 
##               91               92               93               94 
##             1045              785              534              374 
##               95               96               97               98 
##              255              158              105              123 
##               99              100              101              102 
##               77                4                2                2 
##              103 
##                1

Step 8: Explanation * Searched the categories under each variable using the table() function. This showed instances of -1 values which is incorrect data as Age cannot be -1. Therefore I replaced observations with -1 with the average(mean) of the Age_of_Casualty variable using the mean() function.

#The scatter plot is now cleaner
accidents_clean <- accidents_isna 
accidents_clean %>%
  plot(Number_of_Casualties ~ Age_of_Casualty,
       data=.,
       ylab="Casualties", xlab="Age", main="Casualties by Age")

Step 8: Explanation * The scatter plot shows a cleaner dataset * I chose to replace the missing instances with the mean rather than remove them as there were over 40,000+ occurrences.

Scan II

Step 9: Scan for outliers * To demonstrate this section I will apply the scores function from the outliers() package. * Our numeric variables here are Number_of_Casualties and Age_of_Casualty.

# Find outliers using z scores for Number_of_Casualties
z.scores <- accidents_clean$Number_of_Casualties %>%
  outliers::scores(type="z")
z.scores %>% summary() # Shows IQR

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.4876 -0.4876 -0.4876  0.0000  0.1164 55.0832

# Find outliers using z scores for Age_of_Casualty
z.scores2 <- accidents_clean$Age_of_Casualty %>%
  outliers::scores(type="z")
z.scores2 %>% summary() #Shows IQR

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.9394 -0.7827 -0.1768  0.0000  0.5944  3.7340

# Then calculate how many outliers there are using the z scores
length(which(abs(z.scores)>3)) # There are 20403 outliers in Number_of_Casualties.

## [1] 20403

length(which(abs(z.scores2)>3)) # There are 4930 outliers in Age_of_Casualty.

## [1] 4930

I used the scores() function in the outliers package to calculate the z scores and outliers.
The outliers for Number_of_Casualties and Age_of_Casualty are returned using the length() function wrapped around the which() and abs() functions.

Removing the outliers

# Boxplot of number of casualties by severity
boxplot(accidents_clean$Number_of_Casualties ~ accidents_isna$Accident_Severity,
            main = "Number of Casualties by Severity", ylab = "Casualties", xlab = "Severity")

# Boxplot of number of Age by severity
boxplot(accidents_clean$Age_of_Casualty ~ accidents_clean$Accident_Severity,
            main = "Number of Age by Severity", ylab = "Age", xlab = "Severity")

# Since there are a few outliers we will remove these
accidents_removed <- accidents_clean$Number_of_Casualties[-which(abs(z.scores)>3)]
accidents_removed2 <- accidents_clean$Age_of_Casualty[-which(abs(z.scores2)>3)]

#Result of outliers
accidents_removed
accidents_removed2

*Since I had relatively few outliers (compared to the millions of values in the df) I removed them by using a negative ‘-’ in the which and abs functions and assigned the outliers to their own variables.

Transform

Apply an appropriate transformation for at least one of the variables. In addition to the R codes and outputs, explain everything that you do in this step. In this step, you should fulfill the step #10.

# Let's have a look at skewness
hist(accidents_clean$Age_of_Casualty, 
     main = "Histogram of Age of Casualties", 
     xlab = "Age of Casualties")

# This histogram has a slight right-skewness

The Age_of_Casualty histogram follows a normal distribution and has some minor right skewness.

# Scaling without mean centering
scaled_age <- scale(accidents_clean$Age_of_Casualty, center = FALSE, scale = TRUE)
hist(scaled_age)

# Perform a z-score transformation
zscaled_age <- scale(accidents_clean$Age_of_Casualty, center = TRUE, scale = TRUE)
hist(zscaled_age)

# Perform mean centering
mcentre_age <- scale(accidents_clean$Age_of_Casualty, center = TRUE, scale = FALSE)
hist(mcentre_age)

Explanation * Noted: A log transformation couldn’t be used as there are 0 values.

Data Wrangling Assessment Task 3: Dataset challenge

UK Accidents 10 years history with many variables

Heinzy Johan Bernt s3923824