library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
library(stringr)
## Warning: package 'stringr' was built under R version 3.4.4
library(zoo)
## Warning: package 'zoo' was built under R version 3.4.4
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(Hmisc)
## Warning: package 'Hmisc' was built under R version 3.4.4
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.4.4
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
library(tidyr)
## Warning: package 'tidyr' was built under R version 3.4.4
library(outliers)
library(forecast)
## Warning: package 'forecast' was built under R version 3.4.4
For this data preprocessing assignment, we are getting the data of road crashes in South Australia in 2016 ready for analysis. The followings are summary of the steps taken to achieve this:
First of all, two data sets were merged to join crash-related data such as nature of the crash and crash site contained in one data set to another containing information about corresponding units involved in the crashes and their occupants. After joining, the merged data set was filtered to contain only observations of the unit responsible cases.
Then a complete inspection of variables was done to understand each variables’ specification and related data. After this, appropriate type conversions were applied and missing data entries in various forms such as “Unknown”, “UNKNOWN” or “XXXX” was all converted to NA for consistency. Year and Month variables were combined and converted to a date format. A variable name was changed to make it more explanatory. A variable ‘Vehicle Age’ was mutated from two existing variables, (‘Year’ - ‘Veh Year’). After these tasks were done, upon inspection, this data set is now in tidy data format according to Hadley Wickham’s tidy data principles.
Missing values of all categorical variables and all numeric variables were dealt with by imputing with each variables’ value of mode and median respectively.
Outliers of all numeric variables were then scanned using z-scores and imputed with mean values where necessary. Summary statistics and observation inquiries using subsetting were also used to inspect suspected outliers to see if they could be plausible in the context of the problem.
Numerical variable ‘Age’ was finally transformed to decrease the skewness and convert the distribution to a more normal distribution.
Two data sets were obtained from the following [link] (https://data.sa.gov.au/data/dataset/road-crash-data/resource/446afe5b-4e01-4cdf-a281- edd25aaf3802) These data sets are comprised of the information regarding road crashes in South AUstralia in 2016, which were reported to the police where at least one person being killed or injured, vehicle towed away, or total property damage of $5000 or greater.
The first data set, “Crashtype”, concerns relevant information about the nature of the crash and crash site with the following specification of variables:
The second data set, “Unit”, concerns relevant identity information about all units (vehicles and corresponding occupants) as well as whether rollover or fire occurred. The following are variables’ specifications:
In order to import and read these datasets into R, I did the following steps: * From Environment pane, clicked import Dataset > From text(readr)… * Set name as ‘Crashtype’ and ‘Unit’. * Copied the code from Code Preview box into R script for future loading. * Used class() to check if they are imported as dataframe.
Variable ‘Postcode’ exists in both dataset. One describes postcode of where the crash occurred while the other is where the occupant resides at. General subset was used to remove ‘Postcode’ variable in the latter since we are only interested in crash-related data.
full_join() was used to merge the two dataset by key variable ‘REPORT ID’, new data set named ‘sacrash’.
There are multiple units involved in each crash and this is shown in ’Unit" dataset. However, the “Crashtype” dataset contains only information for the unit responsible in the crash. Since we are interested in only the crash-related data, filter() from dplyr was used to subset the merged dataset to include those units responsible.
Crashtype <- read_csv("C:/Users/mengh/OneDrive/Desktop/Data Preprocessing/Datapreprocessing Assignment 3/2016_DATA_SA_Crash.csv")
## Parsed with column specification:
## cols(
## .default = col_character(),
## Postcode = col_integer(),
## `Total Units` = col_integer(),
## `Total Cas` = col_integer(),
## `Total Fats` = col_integer(),
## `Total SI` = col_integer(),
## `Total MI` = col_integer(),
## Year = col_integer(),
## Time = col_time(format = ""),
## ACCLOC_X = col_double(),
## ACCLOC_Y = col_double()
## )
## See spec(...) for full column specifications.
Unit <- read_csv("C:/Users/mengh/OneDrive/Desktop/Data Preprocessing/Datapreprocessing Assignment 3/2016_DATA_SA_Units.csv")
## Parsed with column specification:
## cols(
## REPORT_ID = col_character(),
## `Unit No` = col_character(),
## `No Of Cas` = col_integer(),
## `Veh Reg State` = col_character(),
## `Unit Type` = col_character(),
## `Veh Year` = col_character(),
## `Direction Of Travel` = col_character(),
## Sex = col_character(),
## Age = col_character(),
## `Lic State` = col_character(),
## `Licence Class` = col_character(),
## `Licence Type` = col_character(),
## Towing = col_character(),
## `Unit Movement` = col_character(),
## `Number Occupants` = col_character(),
## Postcode = col_character(),
## Rollover = col_character(),
## Fire = col_character()
## )
head(Crashtype)
head(Unit)
class(Crashtype)
## [1] "tbl_df" "tbl" "data.frame"
class(Unit)
## [1] "tbl_df" "tbl" "data.frame"
Unit <- Unit %>% select(-Postcode)
sacrash <- full_join(Crashtype, Unit, by = "REPORT_ID")
sacrash <- sacrash %>% filter(`Unit Resp`== `Unit No`)
sacrash <- sacrash %>% select(-`Unit Resp`)
Some of the variables were imported in incorrect types. Therefore, they were converted using:
Since year, month but only day of the week are given in three separate columns, ymd() cannot be used to convert to year-month-day date format. Instead, Year and Month was pasted together with separator “-” as a new variable, then as.yearmon() from zoo package was used to convert to just year-month format and leaving day of the week variable as it is.
Upon inspection using unique(variable_name) for each variables, some missing values in some variables were entered as “Unknown”, “UNKNOWN” and “XXXX”. In this case, vector subset was used to convert these values to NA before type conversions so that these values can be grouped together with missing values to determine the total number of missing values. Unique(variable_name) was used again after this process to check if this is successful.
For variables ‘Rollover’, ‘Fire’, ‘DUI Involved’ and ‘Drugs Involved’. Upon inspection, it is assumed that data entry is only entered as “Y” when the above occurred which is a small percentage compared to the whole datset. However, before proceeding to data analysis, this would need clarification from data collector if this is the actual case. In this instance, it is assumed so. Therefore, where there is no data entry, it is substituted with “N” then convert to factor with labels “Yes” and “No”.
‘LGA’ variable was changed to ‘Local Government Area’ for better explanatory purpose using colnames().
glimpse(sacrash)
## Observations: 16,580
## Variables: 48
## $ REPORT_ID <chr> "2016-1-18/07/2017", "2016-2-18/07/2017"...
## $ `Stats Area` <chr> "1 City", "2 Metropolitan", "2 Metropoli...
## $ Suburb <chr> "ADELAIDE", "POORAKA", "GREEN FIELDS", "...
## $ Postcode <int> 5000, 5095, 5107, 5043, 5116, 5251, 5094...
## $ `LGA Name` <chr> "CITY OF ADELAIDE", "CITY OF SALISBURY",...
## $ `Total Units` <int> 2, 2, 2, 2, 2, 2, 2, 4, 2, 2, 2, 2, 2, 2...
## $ `Total Cas` <int> 1, 0, 0, 0, 3, 0, 0, 1, 1, 0, 0, 0, 0, 0...
## $ `Total Fats` <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ `Total SI` <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ `Total MI` <int> 0, 0, 0, 0, 3, 0, 0, 1, 1, 0, 0, 0, 0, 0...
## $ Year <int> 2016, 2016, 2016, 2016, 2016, 2016, 2016...
## $ Month <chr> "November", "December", "November", "Nov...
## $ Day <chr> "Wednesday", "Thursday", "Tuesday", "Tue...
## $ Time <time> 13:45:00, 13:00:00, 07:10:00, 10:30:00,...
## $ `Area Speed` <chr> "050", "070", "080", "060", "090", "050"...
## $ `Position Type` <chr> "Divided Road", "Divided Road", "T-Junct...
## $ `Horizontal Align` <chr> "Straight road", "Straight road", "Strai...
## $ `Vertical Align` <chr> "Level", "Level", "Level", "Level", "Lev...
## $ `Other Feat` <chr> "Not Applicable", "Not Applicable", "Not...
## $ `Road Surface` <chr> "Sealed", "Sealed", "Sealed", "Sealed", ...
## $ `Moisture Cond` <chr> "Dry", "Dry", "Dry", "Dry", "Dry", "Dry"...
## $ `Weather Cond` <chr> "Not Raining", "Not Raining", "Not Raini...
## $ DayNight <chr> "Daylight", "Daylight", "Daylight", "Day...
## $ `Crash Type` <chr> "Hit Pedestrian", "Rear End", "Rear End"...
## $ `Entity Code` <chr> "Pedestrian", "Driver Rider", "Driver Ri...
## $ `CSEF Severity` <chr> "4: Fatal", "1: PDO", "1: PDO", "1: PDO"...
## $ `Traffic Ctrls` <chr> "No Control", "No Control", "Traffic Sig...
## $ `DUI Involved` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ `Drugs Involved` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ ACCLOC_X <dbl> 1328459, 1330554, 1328689, 1324837, 1339...
## $ ACCLOC_Y <dbl> 1670878, 1681820, 1685256, 1660685, 1700...
## $ UNIQUE_LOC <chr> "13284591670878", "13305541681820", "132...
## $ `Unit No` <chr> "02", "02", "02", "02", "01", "01", "02"...
## $ `No Of Cas` <int> 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0...
## $ `Veh Reg State` <chr> NA, "SA", "SA", "UNKNOWN", "SA", "SA", "...
## $ `Unit Type` <chr> "Pedestrian on Road", "Station Wagon", "...
## $ `Veh Year` <chr> NA, "2008", "2013", "XXXX", "1997", "201...
## $ `Direction Of Travel` <chr> "East", "Unknown", "South", "North", "So...
## $ Sex <chr> "Male", "Male", "Male", "Unknown", "Male...
## $ Age <chr> "072", "040", "023", "XXX", "042", "035"...
## $ `Lic State` <chr> NA, "SA", "SA", "UNKNOWN", "SA", "SA", "...
## $ `Licence Class` <chr> NA, "C", "C", "XX", "C", "C", "C", "XX",...
## $ `Licence Type` <chr> NA, "Full", "Full", "Unknown", "Full", "...
## $ Towing <chr> NA, "Not Towing", "Not Towing", "Unknown...
## $ `Unit Movement` <chr> "Walking on Road", "Straight Ahead", "St...
## $ `Number Occupants` <chr> NA, "001", "001", "001", "001", "001", "...
## $ Rollover <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ Fire <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
dim(sacrash)
## [1] 16580 48
sacrash <- sacrash %>% mutate(Date = paste(Year,Month, sep = "-"))
sacrash$Date <- as.yearmon(sacrash$Date, "%Y-%B")
class(sacrash$Date)
## [1] "yearmon"
sacrash$`Number Occupants` <- as.numeric(sacrash$`Number Occupants`)
unique(sacrash$`Veh Reg State`)
## [1] NA "SA" "UNKNOWN" "NSW" "VIC" "QLD" "WA"
## [8] "NT" "TAS" "FEDERAL" "ACT"
sacrash$`Veh Reg State`[sacrash$`Veh Reg State` == "UNKNOWN"] <- NA
sacrash$`Veh Reg State` <- as.factor(sacrash$`Veh Reg State`)
unique(sacrash$`CSEF Severity`)
## [1] "4: Fatal" "1: PDO" "2: MI" "3: SI"
sacrash$`CSEF Severity`<- factor(sacrash$`CSEF Severity`, levels = c("1: PDO", "2: MI", "3: SI", "4: Fatal"), labels = c("Property Damage Only", "Minor Injury", "Serious Injury", "Fatal"), ordered = TRUE)
levels(sacrash$`CSEF Severity`)
## [1] "Property Damage Only" "Minor Injury" "Serious Injury"
## [4] "Fatal"
sacrash$`No Of Cas` <- as.numeric(sacrash$`No Of Cas`)
sacrash$`Unit Type` <- as.factor(sacrash$`Unit Type`)
levels(sacrash$`Unit Type`)
## [1] "Animal - Domestic - Not Ridden" "Animal - Wild"
## [3] "Animal Drawn Vehicle" "BDOUBLE - ROAD TRAIN"
## [5] "Forward Control Passenger Van" "Light Truck LT 4.5T"
## [7] "Motor Cars - Sedan" "Motor Cars - Tourer"
## [9] "Motor Cycle" "Motor Vehicle - Type Unknown"
## [11] "Motorised Wheelchair/Gopher" "OMNIBUS"
## [13] "Other Defined Special Vehicle" "Panel Van"
## [15] "Pedal Cycle" "Pedestrian on Road"
## [17] "Power Asst. Bicycle" "RIGID TRUCK LGE GE 4.5T"
## [19] "Scooter" "SEMI TRAILER"
## [21] "Small Wheel Vehicle User" "Station Wagon"
## [23] "Taxi Cab" "Utility"
## [25] "Wheelchair / Elec. Wheelchair"
sacrash$`Veh Year`[sacrash$`Veh Year` == "XXXX"] <- NA
sacrash$`Veh Year` <- as.numeric(sacrash$`Veh Year`)
sacrash$`Direction Of Travel`[sacrash$`Direction Of Travel` == "Unknown"] <- NA
sacrash$`Direction Of Travel` <- as.factor(sacrash$`Direction Of Travel`)
sacrash$Sex[sacrash$Sex == "Unknown"] <- NA
sacrash$Sex <- as.factor(sacrash$Sex)
sacrash$Age <- as.numeric(sacrash$Age)
## Warning: NAs introduced by coercion
sacrash$`Lic State` <- factor(sacrash$`Lic State`, levels = c("ACT", "NSw", "NT", "O/S", "QLD","SA","TAS", "VIC", "WA"), labels = c("Australian Capital Territory", "New South Wales", "Northern Territory", "Overseas","Queensland", "South Australia", "Tasmania", "Victoria", "Western Australia"))
sacrash$`Licence Class` <- as.factor(sacrash$`Licence Class`)
sacrash$`Licence Type`[sacrash$`Licence Type`=="Unknown"] <- NA
sacrash$`Licence Type` <- as.factor(sacrash$`Licence Type`)
sacrash$Towing[sacrash$Towing=="Unknown"] <- NA
sacrash$Towing<- as.factor(sacrash$Towing)
sacrash$`Unit Movement` <- as.factor(sacrash$`Unit Movement`)
sacrash$Postcode <- as.character(sacrash$Postcode)
sacrash$Rollover[is.na(sacrash$Rollover)] <- "N"
sacrash$Rollover <- factor(sacrash$Rollover, levels = c("Y", "N"), labels = c("Yes","No"))
unique(sacrash$Rollover)
## [1] No Yes
## Levels: Yes No
sacrash$Fire[is.na(sacrash$Fire)] <- "N"
sacrash$Fire <- factor(sacrash$Fire, levels = c("Y","N"), labels = c("Yes", "NO"))
unique(sacrash$Fire)
## [1] NO Yes
## Levels: Yes NO
sacrash$`Stats Area` <- factor(sacrash$`Stats Area`, levels = c("1 City","2 Metropolitan", "3 Country"), labels = c("City", "Metropolitan", "Country"))
sacrash$Suburb <- as.factor(sacrash$Suburb)
colnames(sacrash)[5] <- "Local Government Area"
sacrash$`Local Government Area` <- as.factor(sacrash$`Local Government Area`)
sacrash$`Total Units` <- as.numeric(sacrash$`Total Units`)
sacrash$`Total Cas` <- as.numeric(sacrash$`Total Cas`)
sacrash$`Total Fats`<- as.numeric(sacrash$`Total Fats`)
sacrash$`Total SI` <- as.numeric(sacrash$`Total SI`)
sacrash$'Total MI' <- as.numeric(sacrash$'Total MI')
sacrash$`Area Speed` <- as.numeric(sacrash$`Area Speed`)
sacrash$`Position Type`<- as.factor(sacrash$`Position Type`)
sacrash$`Horizontal Align`[sacrash$`Horizontal Align`=="Unknown"] <- NA
sacrash$`Horizontal Align`<- as.factor(sacrash$`Horizontal Align`)
sacrash$`Vertical Align`[sacrash$`Vertical Align`=="Unknown"] <- NA
sacrash$`Vertical Align` <- as.factor(sacrash$`Vertical Align`)
sacrash$`Other Feat`[sacrash$`Other Feat`=="Unknown"] <- NA
sacrash$`Other Feat` <- as.factor(sacrash$`Other Feat`)
sacrash$`Road Surface`[sacrash$`Road Surface`=="Unknown"] <- NA
sacrash$`Road Surface` <- as.factor(sacrash$`Road Surface`)
sacrash$`Moisture Cond`[sacrash$`Moisture Cond`=="Unknown"] <- NA
sacrash$`Moisture Cond` <- as.factor(sacrash$`Moisture Cond`)
sacrash$`Weather Cond`[sacrash$`Weather Cond`=="Unknown"] <- NA
sacrash$`Weather Cond` <- as.factor(sacrash$`Weather Cond`)
sacrash$DayNight <- as.factor(sacrash$DayNight)
sacrash$`Crash Type` <- as.factor(sacrash$`Crash Type`)
sacrash$`Entity Code` <- as.factor(sacrash$`Entity Code`)
sacrash$`Traffic Ctrls`<- as.factor(sacrash$`Traffic Ctrls`)
sacrash$`Drugs Involved`[is.na(sacrash$`Drugs Involved`)] <- "N"
sacrash$`Drugs Involved` <- factor(sacrash$`Drugs Involved`, levels = c("Y", "N"), labels = c("Yes","No"))
sacrash$`DUI Involved`[is.na(sacrash$`DUI Involved`)] <- "N"
sacrash$`DUI Involved` <- factor(sacrash$`DUI Involved`, levels = c("Y", "N"), labels = c("Yes","No"))
sacrash$ACCLOC_X <- as.character(sacrash$ACCLOC_X)
sacrash$ACCLOC_Y <- as.character(sacrash$ACCLOC_Y)
sacrash$UNIQUE_LOC <- as.character(sacrash$UNIQUE_LOC)
Upon inspection using glimpse(), head() and tail(). This data is now tidy since it follows Hadley Wickham’s tidy data principles where:
glimpse(sacrash)
## Observations: 16,580
## Variables: 49
## $ REPORT_ID <chr> "2016-1-18/07/2017", "2016-2-18/07/201...
## $ `Stats Area` <fct> City, Metropolitan, Metropolitan, Metr...
## $ Suburb <fct> ADELAIDE, POORAKA, GREEN FIELDS, MITCH...
## $ Postcode <chr> "5000", "5095", "5107", "5043", "5116"...
## $ `Local Government Area` <fct> CITY OF ADELAIDE, CITY OF SALISBURY, C...
## $ `Total Units` <dbl> 2, 2, 2, 2, 2, 2, 2, 4, 2, 2, 2, 2, 2,...
## $ `Total Cas` <dbl> 1, 0, 0, 0, 3, 0, 0, 1, 1, 0, 0, 0, 0,...
## $ `Total Fats` <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ `Total SI` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ `Total MI` <dbl> 0, 0, 0, 0, 3, 0, 0, 1, 1, 0, 0, 0, 0,...
## $ Year <int> 2016, 2016, 2016, 2016, 2016, 2016, 20...
## $ Month <chr> "November", "December", "November", "N...
## $ Day <chr> "Wednesday", "Thursday", "Tuesday", "T...
## $ Time <time> 13:45:00, 13:00:00, 07:10:00, 10:30:0...
## $ `Area Speed` <dbl> 50, 70, 80, 60, 90, 50, 90, 60, 60, 60...
## $ `Position Type` <fct> Divided Road, Divided Road, T-Junction...
## $ `Horizontal Align` <fct> Straight road, Straight road, Straight...
## $ `Vertical Align` <fct> Level, Level, Level, Level, Level, Lev...
## $ `Other Feat` <fct> Not Applicable, Not Applicable, Not Ap...
## $ `Road Surface` <fct> Sealed, Sealed, Sealed, Sealed, Sealed...
## $ `Moisture Cond` <fct> Dry, Dry, Dry, Dry, Dry, Dry, Dry, Dry...
## $ `Weather Cond` <fct> Not Raining, Not Raining, Not Raining,...
## $ DayNight <fct> Daylight, Daylight, Daylight, Daylight...
## $ `Crash Type` <fct> Hit Pedestrian, Rear End, Rear End, Ri...
## $ `Entity Code` <fct> Pedestrian, Driver Rider, Driver Rider...
## $ `CSEF Severity` <ord> Fatal, Property Damage Only, Property ...
## $ `Traffic Ctrls` <fct> No Control, No Control, Traffic Signal...
## $ `DUI Involved` <fct> No, No, No, No, No, No, No, No, No, No...
## $ `Drugs Involved` <fct> No, No, No, No, No, No, No, No, No, No...
## $ ACCLOC_X <chr> "1328459.11", "1330553.58", "1328689.4...
## $ ACCLOC_Y <chr> "1670877.88", "1681820.29", "1685255.5...
## $ UNIQUE_LOC <chr> "13284591670878", "13305541681820", "1...
## $ `Unit No` <chr> "02", "02", "02", "02", "01", "01", "0...
## $ `No Of Cas` <dbl> 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0,...
## $ `Veh Reg State` <fct> NA, SA, SA, NA, SA, SA, SA, SA, NA, NA...
## $ `Unit Type` <fct> Pedestrian on Road, Station Wagon, Pan...
## $ `Veh Year` <dbl> NA, 2008, 2013, NA, 1997, 2013, 1997, ...
## $ `Direction Of Travel` <fct> East, NA, South, North, South East, No...
## $ Sex <fct> Male, Male, Male, NA, Male, Female, Ma...
## $ Age <dbl> 72, 40, 23, NA, 42, 35, 37, 83, 27, NA...
## $ `Lic State` <fct> NA, South Australia, South Australia, ...
## $ `Licence Class` <fct> NA, C, C, XX, C, C, C, XX, NA, XX, XX,...
## $ `Licence Type` <fct> NA, Full, Full, NA, Full, Full, Full, ...
## $ Towing <fct> NA, Not Towing, Not Towing, NA, Not To...
## $ `Unit Movement` <fct> Walking on Road, Straight Ahead, Strai...
## $ `Number Occupants` <dbl> NA, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1...
## $ Rollover <fct> No, No, No, No, No, No, No, No, No, No...
## $ Fire <fct> NO, NO, NO, NO, NO, NO, NO, NO, NO, NO...
## $ Date <S3: yearmon> Nov 2016, Dec 2016, Nov 2016, ...
head(sacrash)
tail(sacrash)
sacrash$Year <- as.numeric(sacrash$Year)
sacrash$`Veh Year`<- as.numeric(sacrash$`Veh Year`)
sacrash <- sacrash %>% mutate(vehicle_age = Year - `Veh Year`)
head(sacrash$vehicle_age)
## [1] NA 8 3 NA 19 3
colSums() was used to get total number of missing values in each variables. For categoriable variables’ missing values, mode was used to replace those values. Then sum(is.na(variable_name)) is used to check if all missing values were replaced successfully.
After all the imputations for missing values, colSums(is.na(sacrash)) was checked again and as expected, now there are no missing values.
colSums(is.na(sacrash))
## REPORT_ID Stats Area Suburb
## 0 0 0
## Postcode Local Government Area Total Units
## 0 144 0
## Total Cas Total Fats Total SI
## 0 0 0
## Total MI Year Month
## 0 0 0
## Day Time Area Speed
## 0 0 0
## Position Type Horizontal Align Vertical Align
## 0 7 24
## Other Feat Road Surface Moisture Cond
## 120 3 71
## Weather Cond DayNight Crash Type
## 86 0 0
## Entity Code CSEF Severity Traffic Ctrls
## 0 0 0
## DUI Involved Drugs Involved ACCLOC_X
## 0 0 25
## ACCLOC_Y UNIQUE_LOC Unit No
## 25 25 0
## No Of Cas Veh Reg State Unit Type
## 0 1714 0
## Veh Year Direction Of Travel Sex
## 2025 303 1351
## Age Lic State Licence Class
## 1973 2828 758
## Licence Type Towing Unit Movement
## 3729 1494 268
## Number Occupants Rollover Fire
## 435 0 0
## Date vehicle_age
## 0 2025
sacrash$`Horizontal Align` <- impute(sacrash$`Horizontal Align`, fun = mode)
sum(is.na(sacrash$`Horizontal Align`))
## [1] 0
sacrash$`Vertical Align` <- impute(sacrash$`Vertical Align`, fun = mode)
sum(is.na(sacrash$`Horizontal Align`))
## [1] 0
sacrash$`Vertical Align` <- impute(sacrash$`Vertical Align`, fun = mode)
sum(is.na(sacrash$`Horizontal Align`))
## [1] 0
sacrash$`Other Feat` <- impute(sacrash$`Other Feat`, fun = mode)
sum(is.na(sacrash$`Other Feat`))
## [1] 0
sacrash$`Road Surface` <- impute(sacrash$`Road Surface`, fun = mode)
sum(is.na(sacrash$`Road Surface`))
## [1] 0
sacrash$`Moisture Cond` <- impute(sacrash$`Moisture Cond`, fun = mode)
sum(is.na(sacrash$`Moisture Cond`))
## [1] 0
sacrash$`Weather Cond` <- impute(sacrash$`Weather Cond`, fun = mode)
sum(is.na(sacrash$`Weather Cond`))
## [1] 0
sacrash$`Veh Reg State` <- impute(sacrash$`Veh Reg State`, fun = mode)
sum(is.na(sacrash$`Veh Reg State`))
## [1] 0
sacrash$`Unit Movement` <- impute(sacrash$`Unit Movement`, fun = mode)
sum(is.na(sacrash$`Unit Movement`))
## [1] 0
sacrash$`Direction Of Travel` <- impute(sacrash$`Direction Of Travel`, fun = mode)
sum(is.na(sacrash$`Direction Of Travel`))
## [1] 0
sacrash$Sex <- impute(sacrash$Sex, fun = mode)
sum(is.na(sacrash$Sex))
## [1] 0
sacrash$`Lic State` <- impute(sacrash$`Lic State`, fun = mode)
sum(is.na(sacrash$`Lic State`))
## [1] 0
sacrash$`Licence Class` <- impute(sacrash$`Licence Class`, fun = mode)
sum(is.na(sacrash$`Licence Class`))
## [1] 0
sacrash$`Licence Type` <- impute(sacrash$`Licence Type`, fun = mode)
sum(is.na(sacrash$`Licence Type`))
## [1] 0
sacrash$Towing <- impute(sacrash$Towing, fun = mode)
sum(is.na(sacrash$Towing))
## [1] 0
sacrash$ACCLOC_X <- impute(sacrash$ACCLOC_X, fun = mode)
sum(is.na(sacrash$ACCLOC_X))
## [1] 0
sacrash$ACCLOC_Y <- impute(sacrash$ACCLOC_Y, fun = mode)
sum(is.na(sacrash$ACCLOC_Y))
## [1] 0
sacrash$UNIQUE_LOC <- impute(sacrash$UNIQUE_LOC, fun = mode)
sum(is.na(sacrash$UNIQUE_LOC))
## [1] 0
sacrash$`Local Government Area`<- impute(sacrash$`Local Government Area`, fun = mode)
sum(is.na(sacrash$`Local Government Area`))
## [1] 0
sacrash$vehicle_age <- impute(sacrash$vehicle_age, fun = median)
sum(is.na(sacrash$vehicle_age))
## [1] 0
sacrash$`Veh Year` <- impute(sacrash$`Veh Year`, fun = median)
sum(is.na(sacrash$`Veh Year`))
## [1] 0
sacrash$`Number Occupants` <- impute(sacrash$`Number Occupants`, fun = median)
sum(is.na(sacrash$`Number Occupants`))
## [1] 0
sacrash$Age <- impute(sacrash$Age, fun = median)
sum(is.na(sacrash$Age))
## [1] 0
colSums(is.na(sacrash))
## REPORT_ID Stats Area Suburb
## 0 0 0
## Postcode Local Government Area Total Units
## 0 0 0
## Total Cas Total Fats Total SI
## 0 0 0
## Total MI Year Month
## 0 0 0
## Day Time Area Speed
## 0 0 0
## Position Type Horizontal Align Vertical Align
## 0 0 0
## Other Feat Road Surface Moisture Cond
## 0 0 0
## Weather Cond DayNight Crash Type
## 0 0 0
## Entity Code CSEF Severity Traffic Ctrls
## 0 0 0
## DUI Involved Drugs Involved ACCLOC_X
## 0 0 0
## ACCLOC_Y UNIQUE_LOC Unit No
## 0 0 0
## No Of Cas Veh Reg State Unit Type
## 0 0 0
## Veh Year Direction Of Travel Sex
## 0 0 0
## Age Lic State Licence Class
## 0 0 0
## Licence Type Towing Unit Movement
## 0 0 0
## Number Occupants Rollover Fire
## 0 0 0
## Date vehicle_age
## 0 0
For Number Occupants variable, summary() was used to inspect general statistics of the variable and its maximum value appears to be an error since 999 number of occupants contained in a vehicle is impossible. After using unique() to investigate further there are four large values and their corresponding type of vehicle is omnibus. It is possible for omnibus to contain 50 seats. So 99 and 999 is way too large and replaced with 50 instead.
For Vehicle Year variable, at first there appears to be 93 outliers after checking for outliers using z-scores. These were then imputed with mean value but there were still 28 outliers reported. However, when summary() is called, the minimum value is 1986 and it is plausible to have car from as old as 1986 to be on the road.
For variable Vehicle_age, mutated from vehicle year and year data collated which was in 2016. Since vehicle year’s missing values and outliers has been imputed with new values. We’re going to mutate this variable again then use similar process, z-scores, to check outliers and as expected now there is none.
summary(sacrash$`Number Occupants`)
##
## 435 values imputed to 1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 1.000 1.409 1.000 999.000
unique(sacrash$`Number Occupants`)
## [1] 1 3 99 2 4 0 20 12 5 6 8 7 10 11 15 13 50
## [18] 999 32
which(sacrash$`Number Occupants` == 99)
## [1] 16
sacrash[16,]$`Unit Type`
## [1] OMNIBUS
## 25 Levels: Animal - Domestic - Not Ridden ... Wheelchair / Elec. Wheelchair
which(sacrash$`Number Occupants` == 999)
## [1] 15073
sacrash[15073,]$`Unit Type`
## [1] OMNIBUS
## 25 Levels: Animal - Domestic - Not Ridden ... Wheelchair / Elec. Wheelchair
which(sacrash$`Number Occupants` == 50)
## [1] 5772
sacrash[5772,]$`Unit Type`
## [1] OMNIBUS
## 25 Levels: Animal - Domestic - Not Ridden ... Wheelchair / Elec. Wheelchair
which(sacrash$`Number Occupants` == 32)
## [1] 16323
sacrash[16323,]$`Unit Type`
## [1] OMNIBUS
## 25 Levels: Animal - Domestic - Not Ridden ... Wheelchair / Elec. Wheelchair
sacrash$`Number Occupants`[sacrash$`Number Occupants` == 999 ] <- 50
sacrash$`Number Occupants`[sacrash$`Number Occupants` == 99 ] <- 50
summary(sacrash$`Number Occupants`)
##
## 435 values imputed to 1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 1.000 1.349 1.000 50.000
summary(sacrash$`Veh Year`)
##
## 2025 values imputed to 2005
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1924 2001 2005 2005 2010 2016
z.scores1 <- sacrash$`Veh Year` %>% scores(type = "z")
length(which(abs(z.scores1)>3))
## [1] 93
sacrash$`Veh Year`[ which(abs(z.scores1)>3) ] <- mean(sacrash$`Veh Year`, na.rm = TRUE)
length(which(abs(z.scores1)>3))
## [1] 93
summary(sacrash$`Veh Year`)
##
## 2025 values imputed to 2005
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1986 2001 2005 2005 2010 2016
sacrash <- sacrash %>% mutate(vehicle_age = Year - `Veh Year`)
summary(sacrash$vehicle_age)
##
## 2025 values imputed to 11
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 6.00 11.00 11.05 15.00 30.00
z.scores2 <- sacrash$`Veh Year` %>% scores(type = "z")
length(which(abs(z.scores2)>3))
## [1] 28
For this step, we are attempting to transform variable Age to decrease the skewness and convert the distribution to a normal distribution. Several mathematical transformations and box-cox transformation were applied and in conclusion, log is the most effective according to histograms.
hist(sacrash$Age)
boxcox_age <- BoxCox(sacrash$Age, lambda = "auto")
hist(boxcox_age)
log_age <- log10(sacrash$Age)
hist(log_age)
ln_age <- log(sacrash$Age)
hist(ln_age)
sqrt_age <- sqrt(sacrash$Age)
hist(sqrt_age)
age_square <- sacrash$Age^2
hist(age_square)
age_reciprocal <- sacrash$Age^(-1)
hist(age_reciprocal)