knitr::opts_chunk$set(echo = TRUE, warnings=FALSE, message=FALSE)
library(readr) # Used for reading and importing csv files
library(dplyr) # For data wrangling including joins, gsub,
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(magrittr) # Also for pipes
library(tidyr) # Used for the separate function
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:magrittr':
##
## extract
library(glue) # For transmuting using the glue function
##
## Attaching package: 'glue'
## The following object is masked from 'package:dplyr':
##
## collapse
library(outliers) # For using scores function when dealing with outliers
library(here) # To help import csv files
## here() starts at C:/Users/heinz/Documents/Uni/T3 Data Wrangling/Assessment 3
We are about to explore a 10 year history of UK vehicular accidents and casualties.
References Source: Fedit, B 2018, UK Accidents 10 years history with many variables, viewed 6 October 2021, https://www.kaggle.com/benoit72/uk-accidents-10-years-history-with-many-variables
Loom link: https://www.loom.com/share/96ff2bd38b114d5ea45cd693941236d9
Step 1: Get two datasets
# Read and create accidents df
accidents <- read.csv(here("Accidents0514.csv"))
head(accidents)
# Read and create casualties df
casualties <- read.csv(here("Casualties0514.csv"))
head(casualties)
Explanation for Imported csv files
Firstly, I imported the accidents data using the read.csv function and saved this to accidents. Then I did the same for the casualties data and saved this to casualties. *I called the head of each new df to show the first few rows.
Variables Explanation: * There are 46 different variables in total, so rather than explain each one, I’ll explain the ones I want to keep and subset for the purpose of this assignment in Step 3.
Step 1 cont’d: Inspect the data
names(accidents) # Returns the column/variable names of the accidents df
## [1] "ï..Accident_Index"
## [2] "Location_Easting_OSGR"
## [3] "Location_Northing_OSGR"
## [4] "Longitude"
## [5] "Latitude"
## [6] "Police_Force"
## [7] "Accident_Severity"
## [8] "Number_of_Vehicles"
## [9] "Number_of_Casualties"
## [10] "Date"
## [11] "Day_of_Week"
## [12] "Time"
## [13] "Local_Authority_.District."
## [14] "Local_Authority_.Highway."
## [15] "X1st_Road_Class"
## [16] "X1st_Road_Number"
## [17] "Road_Type"
## [18] "Speed_limit"
## [19] "Junction_Detail"
## [20] "Junction_Control"
## [21] "X2nd_Road_Class"
## [22] "X2nd_Road_Number"
## [23] "Pedestrian_Crossing.Human_Control"
## [24] "Pedestrian_Crossing.Physical_Facilities"
## [25] "Light_Conditions"
## [26] "Weather_Conditions"
## [27] "Road_Surface_Conditions"
## [28] "Special_Conditions_at_Site"
## [29] "Carriageway_Hazards"
## [30] "Urban_or_Rural_Area"
## [31] "Did_Police_Officer_Attend_Scene_of_Accident"
## [32] "LSOA_of_Accident_Location"
names(casualties) # Returns the column/variable names of the casualties df
## [1] "ï..Accident_Index" "Vehicle_Reference"
## [3] "Casualty_Reference" "Casualty_Class"
## [5] "Sex_of_Casualty" "Age_of_Casualty"
## [7] "Age_Band_of_Casualty" "Casualty_Severity"
## [9] "Pedestrian_Location" "Pedestrian_Movement"
## [11] "Car_Passenger" "Bus_or_Coach_Passenger"
## [13] "Pedestrian_Road_Maintenance_Worker" "Casualty_Type"
## [15] "Casualty_Home_Area_Type"
dim(accidents) # Returns the row and column dimensions of the accidents df
## [1] 1640597 32
dim(casualties) # Returns the row and column dimensions of the accidents df
## [1] 2216720 15
Step 2: Merge the data sets
# In order to merge the data sets I'll first I fix the variable name that I want to merge
accidents %<>%
rename("Accident_Index" = "ï..Accident_Index")
casualties %<>%
rename("Accident_Index" = "ï..Accident_Index")
# Using inner join to merge data sets
accidents_comb <- inner_join(accidents, casualties, by="Accident_Index")
head(accidents_comb)
Step 2: Explanation * First I updated the first variable/column of each data set to share the name Accident_Index which will make it easier to join. * Used inner join to retain only the observations in both data sets. This will help to reduce the creation of missing values (which may have occurred with a left join for example). * Using this join and merging the datasets allows me to keep the information in Accidents such as the Date and Accident Severity whilst adding further details on the particular accident from Casualties such as number of casualties, class, and gender.
Step 3: Inspect the new dataset
dim(accidents_comb) # The df has 46 variables
# Subset to the variables I want to keep
accidents_comb_sub <- accidents_comb[,c("Accident_Index", "Date", "Accident_Severity", "Number_of_Casualties", "Casualty_Class", "Age_of_Casualty", "Sex_of_Casualty")]
# Using the attributes function to inspect elements
attributes(accidents_comb_sub) # Shows names (variables), class(data frame) and row names
Step 3 cont’d
# Inspect the structure of the new data frame
str(accidents_comb_sub)
## 'data.frame': 2216720 obs. of 7 variables:
## $ Accident_Index : chr "200501BS00001" "200501BS00002" "200501BS00003" "200501BS00004" ...
## $ Date : chr "04/01/2005" "05/01/2005" "06/01/2005" "07/01/2005" ...
## $ Accident_Severity : int 2 3 3 3 3 3 3 3 3 3 ...
## $ Number_of_Casualties: int 1 1 1 1 1 1 1 2 2 2 ...
## $ Casualty_Class : int 3 2 1 3 1 1 1 3 3 1 ...
## $ Age_of_Casualty : int 37 37 62 30 49 30 31 13 13 35 ...
## $ Sex_of_Casualty : int 1 1 1 1 1 2 1 2 2 1 ...
Step 3 Explanation: * accidents_comb_sub is a data frame with 2216720 observations/rows and 7 variables/columns * Accident_Index is the Unique identifier and is a character variable. * Date is in dd/mm/yyyy format from 2005 to 2014 and is a character variable. * Accident_Severity is an integer variable: 1=Fatal, 2=Serious, 3=Slight. * Number_of_Casualties is self explanatory is an integer variable. * Casualty_Class is an integer variable: 1=Driver/rider, 2=Passenger, 3=Pedestrian. * Age_of_Casualty is self explanatory is an integer variable * Sex_of_Casualty is an integer variable: 1=Male, 2=Female
Step 4: Data Type Conversions
# Convert Date from Character to Date
accidents_comb_sub$Date %<>%
as.Date(format = "%y/%m/%d")
# Convert Accident_Severity to factor variable
accidents_comb_sub$Accident_Severity %<>%
as.factor()
# Convert Casualty_Class to factor variable
accidents_comb_sub$Casualty_Class %<>%
as.factor()
# Convert Sex_of_Casualty to factor variable
accidents_comb_sub$Sex_of_Casualty %<>%
as.factor()
#Use the class function to check each variable is converted.
class(accidents_comb_sub$Date)
## [1] "Date"
class(accidents_comb_sub$Accident_Severity)
## [1] "factor"
class(accidents_comb_sub$Casualty_Class)
## [1] "factor"
class(accidents_comb_sub$Sex_of_Casualty)
## [1] "factor"
Step 4: Explanation * Using the as.Date() function, I converted the Date variable from character to Date format * Using the as.factor() function I converted the Accident_Severity, Casualty_Class and Sex_of_Casualty variables to factor variables.
Step 5: Relabelling/reordering factor variables
# Relabel and reorder Accident_Severity
accidents_comb_sub$Accident_Severity <- factor(accidents_comb_sub$Accident_Severity, levels = c(3,2,1), labels = c("Slight", "Serious", "Fatal"), ordered = TRUE)
# Relabel Casualty_Class (order not important)
accidents_comb_sub$Casualty_Class <- factor(accidents_comb_sub$Casualty_Class,
levels = c(1,2,3),
labels = c("Driver/rider", "Passenger", "Pedestrian"))
# Relabel Sex_of_Casualty (order not important)
accidents_comb_sub$Sex_of_Casualty <- factor(accidents_comb_sub$Sex_of_Casualty,
levels = c(1,2),
labels = c("Male", "Female"))
# Now we will make sure our updated variables are factor character variables
as.character.factor(accidents_comb_sub$Accident_Severity) # This is ordered
as.character.factor(accidents_comb_sub$Casualty_Class)
as.character.factor(accidents_comb_sub$Sex_of_Casualty)
# Use structure function to check work above
str(accidents_comb_sub)
## 'data.frame': 2216720 obs. of 7 variables:
## $ Accident_Index : chr "200501BS00001" "200501BS00002" "200501BS00003" "200501BS00004" ...
## $ Date : Date, format: "2004-01-20" "2005-01-20" ...
## $ Accident_Severity : Ord.factor w/ 3 levels "Slight"<"Serious"<..: 2 1 1 1 1 1 1 1 1 1 ...
## $ Number_of_Casualties: int 1 1 1 1 1 1 1 2 2 2 ...
## $ Casualty_Class : Factor w/ 3 levels "Driver/rider",..: 3 2 1 3 1 1 1 3 3 1 ...
## $ Age_of_Casualty : int 37 37 62 30 49 30 31 13 13 35 ...
## $ Sex_of_Casualty : Factor w/ 2 levels "Male","Female": 1 1 1 1 1 2 1 2 2 1 ...
Step 5: Explanation * I relabeled & reordered Accident_Severity to Slight<Serious<Fatal to provide more information for the user on the levels regarding accident severity. * I relabeled Casualty_Class (“Driver/rider”, “Passenger”, “Pedestrian”) and Sex_of_Casualty (Male, Female) to provide more information the user.
Step 6: Tidy and Manipulate data • Each value must have its own cell.
Yes each value has its own cell. • Each observation must have its own
row. There are duplicates which we’ll remove, after which each
observation will have its own row.
• Each variable must have its own column. Whilst each variable has its
own column the date can also be separated into day, month,year. See
below.
# Show if there are any duplicated entries
any(duplicated(accidents_comb_sub)) # this shows there are duplicate observations
## [1] TRUE
# Replace the df with only the unique observations
accidents_unique <- unique(accidents_comb_sub)
any(duplicated(accidents_unique))
## [1] FALSE
Step 6: Explanation * I checked if there were any duplicates by wrapping the duplicated() function with the any() function and this return TRUE. * I then removed the duplicates by using the unique() function and replaced my df with a cleaner version of accidents_comb_sub without duplicates.
# Separate the Date into 3 columns.
accidents_unique %<>%
separate(Date, into = c("Year", "Month", "Day"), sep ="-")
head(accidents_unique)
Step 7: Create/mutate a new variable.
# Extract first three letters of Sex_of_Casualty and Casualty_Class
gender <- substr(accidents_unique$Sex_of_Casualty, 1, 3)
class <- substr(accidents_unique$Casualty_Class, 1, 3)
# Create new variable called Sex_Class from existing Sex_of_Casualty and Casualty_Class variables
accidents_glue <- accidents_unique %>%
mutate(Sex_Class = glue("{gender}-{class}"))
# Show the new categories of the new class
head(accidents_glue$Sex_Class)
## Mal-Ped
## Mal-Pas
## Mal-Dri
## Mal-Ped
## Mal-Dri
## Fem-Dri
Step 7 cont’d: Explanation * I wanted to create a new variable by combining the sex of the casualty with the type. * By having the new category Mal-Dri, for example, this might allow for the user to easier identify Male Driver casualty types.
Step 8: Scan for missing values
# Scan whole data frame for missing values
colSums(is.na(accidents_glue)) # there are 594 NA values in Sex_of_Casualty.
## Accident_Index Year Month
## 0 0 0
## Day Accident_Severity Number_of_Casualties
## 0 0 0
## Casualty_Class Age_of_Casualty Sex_of_Casualty
## 0 0 594
## Sex_Class
## 0
Step 8 cont’d: Removing observations with missing values Since there are only 594 NA values in a df of millions, we’ll go ahead and remove these as we’ll still have a large enough sample size.
# Remove missing values
accidents_isna <- accidents_glue[complete.cases(accidents_glue), ]
# Now there are no rows with NA values
colSums(is.na(accidents_isna))
## Accident_Index Year Month
## 0 0 0
## Day Accident_Severity Number_of_Casualties
## 0 0 0
## Casualty_Class Age_of_Casualty Sex_of_Casualty
## 0 0 0
## Sex_Class
## 0
Step 8 cont’d: Scan for inconsistencies
table(accidents_isna$Year)
##
## 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
## 73726 71193 71310 72236 72571 72437 72615 73112 73623 73394 74076 73325 72421
## 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026
## 72209 72410 72232 73000 73141 73835 71768 72138 71370 71840 71955 68887 69993
## 2027 2028 2029 2030 2031
## 70252 71691 65927 64397 39675
table(accidents_isna$Month)
##
## 01 02 03 04 05 06 07 08 09 10 11
## 173017 161281 174787 170071 186454 185507 193459 183994 187101 197679 199176
## 12
## 180233
table(accidents_isna$Day)
##
## 20
## 2192759
table(accidents_isna$Accident_Severity)
##
## Slight Serious Fatal
## 1834835 318636 39288
table(accidents_isna$Number_of_Casualties)
##
## 1 2 3 4 5 6 7 8 9 10
## 1257689 522600 220036 103912 47002 21117 8316 4046 2166 1264
## 11 12 13 14 15 16 17 18 19 20
## 884 499 451 300 259 179 189 150 159 53
## 21 22 23 24 25 26 27 28 29 32
## 95 143 42 63 58 145 41 36 93 27
## 33 35 36 38 40 41 42 43 45 46
## 12 31 49 33 34 16 86 21 64 24
## 47 48 51 54 62 63 68 70 87 93
## 18 40 79 39 23 14 46 50 26 40
table(accidents_isna$Casualty_Class)
##
## Driver/rider Passenger Pedestrian
## 1394665 523522 274572
table(accidents_isna$Sex_of_Casualty)
##
## Male Female
## 1279648 913111
table(accidents_isna$Sex_Class)
##
## Fem-Dri Fem-Pas Fem-Ped Mal-Dri Mal-Pas Mal-Ped
## 483936 312031 117144 910729 211491 157428
table(accidents_isna$Age_of_Casualty) # this has 43029 "-1" values.
##
## -1 0 1 2 3 4 5 6 7 8 9 10 11
## 43029 1749 4667 6416 8108 8878 9669 10568 11289 12332 13364 15362 19079
## 12 13 14 15 16 17 18 19 20 21 22 23 24
## 21154 19640 20174 21911 39936 62926 75619 70429 67544 60671 57344 54820 51726
## 25 26 27 28 29 30 31 32 33 34 35 36 37
## 53888 48227 44803 43749 40895 47956 37746 37782 36788 36229 40397 35655 34975
## 38 39 40 41 42 43 44 45 46 47 48 49 50
## 35964 36038 41727 35166 35788 34428 33434 35158 31281 29733 29325 27536 30222
## 51 52 53 54 55 56 57 58 59 60 61 62 63
## 24328 23532 22101 20934 21290 19340 18320 17768 16911 18676 14620 14327 13035
## 64 65 66 67 68 69 70 71 72 73 74 75 76
## 12114 12628 10186 9828 9251 8866 10489 7845 7760 7585 7403 7991 7054
## 77 78 79 80 81 82 83 84 85 86 87 88 89
## 6607 6490 6366 6867 5570 5398 4759 4353 4055 3206 2726 2131 1825
## 90 91 92 93 94 95 96 97 98 99 100 101 102
## 1465 1045 785 534 374 255 158 105 123 77 4 2 2
## 103
## 1
This Scatter plot also shows the inconsistencies
# Scatter showing number of casualties by age
accidents_isna %>%
plot(Number_of_Casualties ~ Age_of_Casualty,
data=.,
ylab="Casualties", xlab="Age", main="Casualties by Age")
# Replace -1 age values with the mean
accidents_isna$Age_of_Casualty[accidents_isna$Age_of_Casualty == -1] <- mean(accidents_isna$Age_of_Casualty)
table(accidents_isna$Age_of_Casualty) # Now there are no -1 values
##
## 0 1 2 3
## 1749 4667 6416 8108
## 4 5 6 7
## 8878 9669 10568 11289
## 8 9 10 11
## 12332 13364 15362 19079
## 12 13 14 15
## 21154 19640 20174 21911
## 16 17 18 19
## 39936 62926 75619 70429
## 20 21 22 23
## 67544 60671 57344 54820
## 24 25 26 27
## 51726 53888 48227 44803
## 28 29 30 31
## 43749 40895 47956 37746
## 32 33 34 34.5126974738218
## 37782 36788 36229 43029
## 35 36 37 38
## 40397 35655 34975 35964
## 39 40 41 42
## 36038 41727 35166 35788
## 43 44 45 46
## 34428 33434 35158 31281
## 47 48 49 50
## 29733 29325 27536 30222
## 51 52 53 54
## 24328 23532 22101 20934
## 55 56 57 58
## 21290 19340 18320 17768
## 59 60 61 62
## 16911 18676 14620 14327
## 63 64 65 66
## 13035 12114 12628 10186
## 67 68 69 70
## 9828 9251 8866 10489
## 71 72 73 74
## 7845 7760 7585 7403
## 75 76 77 78
## 7991 7054 6607 6490
## 79 80 81 82
## 6366 6867 5570 5398
## 83 84 85 86
## 4759 4353 4055 3206
## 87 88 89 90
## 2726 2131 1825 1465
## 91 92 93 94
## 1045 785 534 374
## 95 96 97 98
## 255 158 105 123
## 99 100 101 102
## 77 4 2 2
## 103
## 1
Step 8: Explanation * Searched the categories under each variable using the table() function. This showed instances of -1 values which is incorrect data as Age cannot be -1. Therefore I replaced observations with -1 with the average(mean) of the Age_of_Casualty variable using the mean() function.
#The scatter plot is now cleaner
accidents_clean <- accidents_isna
accidents_clean %>%
plot(Number_of_Casualties ~ Age_of_Casualty,
data=.,
ylab="Casualties", xlab="Age", main="Casualties by Age")
Step 8: Explanation * The scatter plot shows a cleaner dataset * I chose to replace the missing instances with the mean rather than remove them as there were over 40,000+ occurrences.
Step 9: Scan for outliers * To demonstrate this section I will apply the scores function from the outliers() package. * Our numeric variables here are Number_of_Casualties and Age_of_Casualty.
# Find outliers using z scores for Number_of_Casualties
z.scores <- accidents_clean$Number_of_Casualties %>%
outliers::scores(type="z")
z.scores %>% summary() # Shows IQR
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.4876 -0.4876 -0.4876 0.0000 0.1164 55.0832
# Find outliers using z scores for Age_of_Casualty
z.scores2 <- accidents_clean$Age_of_Casualty %>%
outliers::scores(type="z")
z.scores2 %>% summary() #Shows IQR
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.9394 -0.7827 -0.1768 0.0000 0.5944 3.7340
# Then calculate how many outliers there are using the z scores
length(which(abs(z.scores)>3)) # There are 20403 outliers in Number_of_Casualties.
## [1] 20403
length(which(abs(z.scores2)>3)) # There are 4930 outliers in Age_of_Casualty.
## [1] 4930
Removing the outliers
# Boxplot of number of casualties by severity
boxplot(accidents_clean$Number_of_Casualties ~ accidents_isna$Accident_Severity,
main = "Number of Casualties by Severity", ylab = "Casualties", xlab = "Severity")
# Boxplot of number of Age by severity
boxplot(accidents_clean$Age_of_Casualty ~ accidents_clean$Accident_Severity,
main = "Number of Age by Severity", ylab = "Age", xlab = "Severity")
# Since there are a few outliers we will remove these
accidents_removed <- accidents_clean$Number_of_Casualties[-which(abs(z.scores)>3)]
accidents_removed2 <- accidents_clean$Age_of_Casualty[-which(abs(z.scores2)>3)]
#Result of outliers
accidents_removed
accidents_removed2
*Since I had relatively few outliers (compared to the millions of values in the df) I removed them by using a negative ‘-’ in the which and abs functions and assigned the outliers to their own variables.
Apply an appropriate transformation for at least one of the variables. In addition to the R codes and outputs, explain everything that you do in this step. In this step, you should fulfill the step #10.
# Let's have a look at skewness
hist(accidents_clean$Age_of_Casualty,
main = "Histogram of Age of Casualties",
xlab = "Age of Casualties")
# This histogram has a slight right-skewness
# Scaling without mean centering
scaled_age <- scale(accidents_clean$Age_of_Casualty, center = FALSE, scale = TRUE)
hist(scaled_age)
# Perform a z-score transformation
zscaled_age <- scale(accidents_clean$Age_of_Casualty, center = TRUE, scale = TRUE)
hist(zscaled_age)
# Perform mean centering
mcentre_age <- scale(accidents_clean$Age_of_Casualty, center = TRUE, scale = FALSE)
hist(mcentre_age)
Explanation * Noted: A log transformation couldn’t be used as there are
0 values.