MATH2349 Semester 1, 2018

Required packages

library(readr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following object is masked from 'package:base':
## 
##     date

library(stringr)

## Warning: package 'stringr' was built under R version 3.4.4

library(zoo)

## Warning: package 'zoo' was built under R version 3.4.4

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

library(Hmisc)

## Warning: package 'Hmisc' was built under R version 3.4.4

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 3.4.4

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:dplyr':
## 
##     src, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, units

library(tidyr)

## Warning: package 'tidyr' was built under R version 3.4.4

library(outliers)
library(forecast)

## Warning: package 'forecast' was built under R version 3.4.4

Executive Summary

For this data preprocessing assignment, we are getting the data of road crashes in South Australia in 2016 ready for analysis. The followings are summary of the steps taken to achieve this:

First of all, two data sets were merged to join crash-related data such as nature of the crash and crash site contained in one data set to another containing information about corresponding units involved in the crashes and their occupants. After joining, the merged data set was filtered to contain only observations of the unit responsible cases.
Then a complete inspection of variables was done to understand each variables’ specification and related data. After this, appropriate type conversions were applied and missing data entries in various forms such as “Unknown”, “UNKNOWN” or “XXXX” was all converted to NA for consistency. Year and Month variables were combined and converted to a date format. A variable name was changed to make it more explanatory. A variable ‘Vehicle Age’ was mutated from two existing variables, (‘Year’ - ‘Veh Year’). After these tasks were done, upon inspection, this data set is now in tidy data format according to Hadley Wickham’s tidy data principles.
Missing values of all categorical variables and all numeric variables were dealt with by imputing with each variables’ value of mode and median respectively.
Outliers of all numeric variables were then scanned using z-scores and imputed with mean values where necessary. Summary statistics and observation inquiries using subsetting were also used to inspect suspected outliers to see if they could be plausible in the context of the problem.
Numerical variable ‘Age’ was finally transformed to decrease the skewness and convert the distribution to a more normal distribution.

Data

Two data sets were obtained from the following [link] (https://data.sa.gov.au/data/dataset/road-crash-data/resource/446afe5b-4e01-4cdf-a281- edd25aaf3802) These data sets are comprised of the information regarding road crashes in South AUstralia in 2016, which were reported to the police where at least one person being killed or injured, vehicle towed away, or total property damage of $5000 or greater.

The first data set, “Crashtype”, concerns relevant information about the nature of the crash and crash site with the following specification of variables:

REPORT_ID: Unique number assigned to records to separate individual crashes
Stats Area: A code defining whether the road crash occurred within City, Metro or Country Area
Suburb: The Suburb that the crash occurred in
Postcode: The Postcode that the crash occurred in
LGA Name: The Local Government Area that the crash occurred in
Total Units: The total number of units involved in a road crash
Total Cas: Total number of casualties as a result of a road crash
Total Fats: Total number of fatalities as a result of a road crash
Total SI: Total number of people admitted to hospital as a result of a road crash
Total MI: Total number of people treated by private doctor or treated at hospital but not admitted
Year: Year of Crash
Month: Month crash occurred
Day: The day of the week the crash occurred
Time: The reported time of the crash
Area Speed: The reported time of the crash
Position Type: Identifying if a crash location was at an intersection or midblock
HOrizontal Align: Defines the horizontal alignment of the road at the sight of the crash
Vertical Align: Defines the vertical alignment of the road at the sight of the crash
Other Feat: Defines other relevant features of the crash site locations
Road SUrface: Defines the road surface type at the crash location
Moisture Cond: Defines the pavement surface moisture condition at the crash location
Weather Cond: Defines the weather condition at the time and location of the crash
DayNight: The lighting condition at the time and location of the the crash
Crash Type: Defines the road crash type
Unit Resp: The number of the unit determined to be responsible for the road crash
Entity Code: A code defining the entity deemed to be responsible for the road crash
CSEF Severity: Defines the road crash severity (classified by the highest injury severity sustained in the crash)
Traffic Ctrls: Defines the traffic control at the time and location of the road crash
DUI Involved: Involved if at least one controller in the crash recorded an illegal Blood Alcohol Concentration level
Drugs Involved: Involved if at least one controller in the crash recorded tested positive for a prescribed drug
ACCLOC_X: The X coordinate of the crash when located
ACCLOC_Y: The Y coordinate of the crash when located
UNIQUE_LOC: A concatenation of the X and Y coordinates for the purpose of grouping crashes

The second data set, “Unit”, concerns relevant identity information about all units (vehicles and corresponding occupants) as well as whether rollover or fire occurred. The following are variables’ specifications:

REPORT_ID: Unique number assigned to records to separate individual crashes
Unit No: A number assigned to a unit involved in a road crash
No of Cas: The total number of casualties for a unit involved in a road crash
Veh Reg State: The state or Territory the unit is registered in
Unit Type: A code defining the type of unit involved in a road crash
Veh Year: The year of manufacture of a vehicle.
Direction of Travel: A code defining the direction of travel of a unit immediately prior to the road crash
Sex: A code defining the sex of the controller of the unit involved in a road crash
Age: The age of the controller of the unit involved in a road crash
Lic State: A code defining the state of issue of the licence of the driver/rider
Licence Class: A code defining the class of the licence held by the driver/rider
LIcence Type: A code defining the level of licence held by the driver/rider
Towing: A code defining if the unit was towing another unit
Unit Movement: A code defining the movement of a unit immediately prior to the road crash
Number Occupants: A code defining the number of occupants in the unit at the time of the crash
Postcode: The residential postcode of the unit in a road crash
Rollover: The residential postcode of the unit in a road crash
Fire: Vehicle caught fire at some stage in the crash

In order to import and read these datasets into R, I did the following steps: * From Environment pane, clicked import Dataset > From text(readr)… * Set name as ‘Crashtype’ and ‘Unit’. * Copied the code from Code Preview box into R script for future loading. * Used class() to check if they are imported as dataframe.

Variable ‘Postcode’ exists in both dataset. One describes postcode of where the crash occurred while the other is where the occupant resides at. General subset was used to remove ‘Postcode’ variable in the latter since we are only interested in crash-related data.

full_join() was used to merge the two dataset by key variable ‘REPORT ID’, new data set named ‘sacrash’.

There are multiple units involved in each crash and this is shown in ’Unit" dataset. However, the “Crashtype” dataset contains only information for the unit responsible in the crash. Since we are interested in only the crash-related data, filter() from dplyr was used to subset the merged dataset to include those units responsible.

Crashtype <- read_csv("C:/Users/mengh/OneDrive/Desktop/Data Preprocessing/Datapreprocessing Assignment 3/2016_DATA_SA_Crash.csv")

## Parsed with column specification:
## cols(
##   .default = col_character(),
##   Postcode = col_integer(),
##   `Total Units` = col_integer(),
##   `Total Cas` = col_integer(),
##   `Total Fats` = col_integer(),
##   `Total SI` = col_integer(),
##   `Total MI` = col_integer(),
##   Year = col_integer(),
##   Time = col_time(format = ""),
##   ACCLOC_X = col_double(),
##   ACCLOC_Y = col_double()
## )

## See spec(...) for full column specifications.

Unit <- read_csv("C:/Users/mengh/OneDrive/Desktop/Data Preprocessing/Datapreprocessing Assignment 3/2016_DATA_SA_Units.csv")

## Parsed with column specification:
## cols(
##   REPORT_ID = col_character(),
##   `Unit No` = col_character(),
##   `No Of Cas` = col_integer(),
##   `Veh Reg State` = col_character(),
##   `Unit Type` = col_character(),
##   `Veh Year` = col_character(),
##   `Direction Of Travel` = col_character(),
##   Sex = col_character(),
##   Age = col_character(),
##   `Lic State` = col_character(),
##   `Licence Class` = col_character(),
##   `Licence Type` = col_character(),
##   Towing = col_character(),
##   `Unit Movement` = col_character(),
##   `Number Occupants` = col_character(),
##   Postcode = col_character(),
##   Rollover = col_character(),
##   Fire = col_character()
## )

head(Crashtype)

head(Unit)

class(Crashtype)

## [1] "tbl_df"     "tbl"        "data.frame"

class(Unit)

## [1] "tbl_df"     "tbl"        "data.frame"

Unit <- Unit %>% select(-Postcode)
sacrash <- full_join(Crashtype, Unit, by = "REPORT_ID")
sacrash <- sacrash %>% filter(`Unit Resp`== `Unit No`)
sacrash <- sacrash %>% select(-`Unit Resp`)

Understand

Some of the variables were imported in incorrect types. Therefore, they were converted using:

as.numeric(), for example, ‘Number Occupants’
as.factor(), for example, ‘Veh Reg State’
factor() for ordered factor variables, for example, ‘CSEF Severity’

Since year, month but only day of the week are given in three separate columns, ymd() cannot be used to convert to year-month-day date format. Instead, Year and Month was pasted together with separator “-” as a new variable, then as.yearmon() from zoo package was used to convert to just year-month format and leaving day of the week variable as it is.

Upon inspection using unique(variable_name) for each variables, some missing values in some variables were entered as “Unknown”, “UNKNOWN” and “XXXX”. In this case, vector subset was used to convert these values to NA before type conversions so that these values can be grouped together with missing values to determine the total number of missing values. Unique(variable_name) was used again after this process to check if this is successful.

For variables ‘Rollover’, ‘Fire’, ‘DUI Involved’ and ‘Drugs Involved’. Upon inspection, it is assumed that data entry is only entered as “Y” when the above occurred which is a small percentage compared to the whole datset. However, before proceeding to data analysis, this would need clarification from data collector if this is the actual case. In this instance, it is assumed so. Therefore, where there is no data entry, it is substituted with “N” then convert to factor with labels “Yes” and “No”.

‘LGA’ variable was changed to ‘Local Government Area’ for better explanatory purpose using colnames().

glimpse(sacrash)

## Observations: 16,580
## Variables: 48
## $ REPORT_ID             <chr> "2016-1-18/07/2017", "2016-2-18/07/2017"...
## $ `Stats Area`          <chr> "1 City", "2 Metropolitan", "2 Metropoli...
## $ Suburb                <chr> "ADELAIDE", "POORAKA", "GREEN FIELDS", "...
## $ Postcode              <int> 5000, 5095, 5107, 5043, 5116, 5251, 5094...
## $ `LGA Name`            <chr> "CITY OF ADELAIDE", "CITY OF SALISBURY",...
## $ `Total Units`         <int> 2, 2, 2, 2, 2, 2, 2, 4, 2, 2, 2, 2, 2, 2...
## $ `Total Cas`           <int> 1, 0, 0, 0, 3, 0, 0, 1, 1, 0, 0, 0, 0, 0...
## $ `Total Fats`          <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ `Total SI`            <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ `Total MI`            <int> 0, 0, 0, 0, 3, 0, 0, 1, 1, 0, 0, 0, 0, 0...
## $ Year                  <int> 2016, 2016, 2016, 2016, 2016, 2016, 2016...
## $ Month                 <chr> "November", "December", "November", "Nov...
## $ Day                   <chr> "Wednesday", "Thursday", "Tuesday", "Tue...
## $ Time                  <time> 13:45:00, 13:00:00, 07:10:00, 10:30:00,...
## $ `Area Speed`          <chr> "050", "070", "080", "060", "090", "050"...
## $ `Position Type`       <chr> "Divided Road", "Divided Road", "T-Junct...
## $ `Horizontal Align`    <chr> "Straight road", "Straight road", "Strai...
## $ `Vertical Align`      <chr> "Level", "Level", "Level", "Level", "Lev...
## $ `Other Feat`          <chr> "Not Applicable", "Not Applicable", "Not...
## $ `Road Surface`        <chr> "Sealed", "Sealed", "Sealed", "Sealed", ...
## $ `Moisture Cond`       <chr> "Dry", "Dry", "Dry", "Dry", "Dry", "Dry"...
## $ `Weather Cond`        <chr> "Not Raining", "Not Raining", "Not Raini...
## $ DayNight              <chr> "Daylight", "Daylight", "Daylight", "Day...
## $ `Crash Type`          <chr> "Hit Pedestrian", "Rear End", "Rear End"...
## $ `Entity Code`         <chr> "Pedestrian", "Driver Rider", "Driver Ri...
## $ `CSEF Severity`       <chr> "4: Fatal", "1: PDO", "1: PDO", "1: PDO"...
## $ `Traffic Ctrls`       <chr> "No Control", "No Control", "Traffic Sig...
## $ `DUI Involved`        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ `Drugs Involved`      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ ACCLOC_X              <dbl> 1328459, 1330554, 1328689, 1324837, 1339...
## $ ACCLOC_Y              <dbl> 1670878, 1681820, 1685256, 1660685, 1700...
## $ UNIQUE_LOC            <chr> "13284591670878", "13305541681820", "132...
## $ `Unit No`             <chr> "02", "02", "02", "02", "01", "01", "02"...
## $ `No Of Cas`           <int> 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0...
## $ `Veh Reg State`       <chr> NA, "SA", "SA", "UNKNOWN", "SA", "SA", "...
## $ `Unit Type`           <chr> "Pedestrian on Road", "Station Wagon", "...
## $ `Veh Year`            <chr> NA, "2008", "2013", "XXXX", "1997", "201...
## $ `Direction Of Travel` <chr> "East", "Unknown", "South", "North", "So...
## $ Sex                   <chr> "Male", "Male", "Male", "Unknown", "Male...
## $ Age                   <chr> "072", "040", "023", "XXX", "042", "035"...
## $ `Lic State`           <chr> NA, "SA", "SA", "UNKNOWN", "SA", "SA", "...
## $ `Licence Class`       <chr> NA, "C", "C", "XX", "C", "C", "C", "XX",...
## $ `Licence Type`        <chr> NA, "Full", "Full", "Unknown", "Full", "...
## $ Towing                <chr> NA, "Not Towing", "Not Towing", "Unknown...
## $ `Unit Movement`       <chr> "Walking on Road", "Straight Ahead", "St...
## $ `Number Occupants`    <chr> NA, "001", "001", "001", "001", "001", "...
## $ Rollover              <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ Fire                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...

dim(sacrash)

## [1] 16580    48

sacrash <- sacrash %>% mutate(Date = paste(Year,Month, sep = "-"))
sacrash$Date <- as.yearmon(sacrash$Date, "%Y-%B")
class(sacrash$Date)

## [1] "yearmon"

sacrash$`Number Occupants` <- as.numeric(sacrash$`Number Occupants`)

unique(sacrash$`Veh Reg State`)

##  [1] NA        "SA"      "UNKNOWN" "NSW"     "VIC"     "QLD"     "WA"     
##  [8] "NT"      "TAS"     "FEDERAL" "ACT"

sacrash$`Veh Reg State`[sacrash$`Veh Reg State` == "UNKNOWN"] <- NA
sacrash$`Veh Reg State` <- as.factor(sacrash$`Veh Reg State`)

unique(sacrash$`CSEF Severity`)

## [1] "4: Fatal" "1: PDO"   "2: MI"    "3: SI"

sacrash$`CSEF Severity`<- factor(sacrash$`CSEF Severity`, levels = c("1: PDO", "2: MI", "3: SI", "4: Fatal"), labels = c("Property Damage Only", "Minor Injury", "Serious Injury", "Fatal"), ordered = TRUE)
levels(sacrash$`CSEF Severity`)

## [1] "Property Damage Only" "Minor Injury"         "Serious Injury"      
## [4] "Fatal"

sacrash$`No Of Cas` <- as.numeric(sacrash$`No Of Cas`)

sacrash$`Unit Type` <- as.factor(sacrash$`Unit Type`)
levels(sacrash$`Unit Type`)

##  [1] "Animal - Domestic - Not Ridden" "Animal - Wild"                 
##  [3] "Animal Drawn Vehicle"           "BDOUBLE - ROAD TRAIN"          
##  [5] "Forward Control Passenger Van"  "Light Truck LT 4.5T"           
##  [7] "Motor Cars - Sedan"             "Motor Cars - Tourer"           
##  [9] "Motor Cycle"                    "Motor Vehicle - Type Unknown"  
## [11] "Motorised Wheelchair/Gopher"    "OMNIBUS"                       
## [13] "Other Defined Special Vehicle"  "Panel Van"                     
## [15] "Pedal Cycle"                    "Pedestrian on Road"            
## [17] "Power Asst. Bicycle"            "RIGID TRUCK LGE GE 4.5T"       
## [19] "Scooter"                        "SEMI TRAILER"                  
## [21] "Small Wheel Vehicle User"       "Station Wagon"                 
## [23] "Taxi Cab"                       "Utility"                       
## [25] "Wheelchair / Elec. Wheelchair"

sacrash$`Veh Year`[sacrash$`Veh Year` == "XXXX"] <- NA
sacrash$`Veh Year` <- as.numeric(sacrash$`Veh Year`)

sacrash$`Direction Of Travel`[sacrash$`Direction Of Travel` == "Unknown"] <- NA
sacrash$`Direction Of Travel` <- as.factor(sacrash$`Direction Of Travel`)

sacrash$Sex[sacrash$Sex == "Unknown"] <- NA
sacrash$Sex <- as.factor(sacrash$Sex)

sacrash$Age <- as.numeric(sacrash$Age)

## Warning: NAs introduced by coercion

sacrash$`Lic State` <- factor(sacrash$`Lic State`, levels = c("ACT", "NSw", "NT", "O/S", "QLD","SA","TAS", "VIC", "WA"), labels = c("Australian Capital Territory", "New South Wales", "Northern Territory", "Overseas","Queensland", "South Australia", "Tasmania", "Victoria", "Western Australia"))

sacrash$`Licence Class` <- as.factor(sacrash$`Licence Class`)

sacrash$`Licence Type`[sacrash$`Licence Type`=="Unknown"] <- NA
sacrash$`Licence Type` <- as.factor(sacrash$`Licence Type`)

sacrash$Towing[sacrash$Towing=="Unknown"] <- NA
sacrash$Towing<- as.factor(sacrash$Towing)

sacrash$`Unit Movement` <- as.factor(sacrash$`Unit Movement`)

sacrash$Postcode <- as.character(sacrash$Postcode)

sacrash$Rollover[is.na(sacrash$Rollover)] <- "N"
sacrash$Rollover <- factor(sacrash$Rollover, levels = c("Y", "N"), labels = c("Yes","No"))
unique(sacrash$Rollover)

## [1] No  Yes
## Levels: Yes No

sacrash$Fire[is.na(sacrash$Fire)] <- "N"
sacrash$Fire <- factor(sacrash$Fire, levels = c("Y","N"), labels = c("Yes", "NO"))
unique(sacrash$Fire)

## [1] NO  Yes
## Levels: Yes NO

sacrash$`Stats Area` <- factor(sacrash$`Stats Area`, levels = c("1 City","2 Metropolitan", "3 Country"), labels = c("City", "Metropolitan", "Country"))

sacrash$Suburb <- as.factor(sacrash$Suburb)

colnames(sacrash)[5] <- "Local Government Area"
sacrash$`Local Government Area` <- as.factor(sacrash$`Local Government Area`)

sacrash$`Total Units` <- as.numeric(sacrash$`Total Units`)

sacrash$`Total Cas` <- as.numeric(sacrash$`Total Cas`)

sacrash$`Total Fats`<- as.numeric(sacrash$`Total Fats`)

sacrash$`Total SI` <- as.numeric(sacrash$`Total SI`)

sacrash$'Total MI' <- as.numeric(sacrash$'Total MI')

sacrash$`Area Speed` <- as.numeric(sacrash$`Area Speed`)

sacrash$`Position Type`<- as.factor(sacrash$`Position Type`)

sacrash$`Horizontal Align`[sacrash$`Horizontal Align`=="Unknown"] <- NA
sacrash$`Horizontal Align`<- as.factor(sacrash$`Horizontal Align`)

sacrash$`Vertical Align`[sacrash$`Vertical Align`=="Unknown"] <- NA
sacrash$`Vertical Align` <- as.factor(sacrash$`Vertical Align`)

sacrash$`Other Feat`[sacrash$`Other Feat`=="Unknown"] <- NA
sacrash$`Other Feat` <- as.factor(sacrash$`Other Feat`)

sacrash$`Road Surface`[sacrash$`Road Surface`=="Unknown"] <- NA
sacrash$`Road Surface` <- as.factor(sacrash$`Road Surface`)

sacrash$`Moisture Cond`[sacrash$`Moisture Cond`=="Unknown"] <- NA
sacrash$`Moisture Cond` <- as.factor(sacrash$`Moisture Cond`)

sacrash$`Weather Cond`[sacrash$`Weather Cond`=="Unknown"] <- NA
sacrash$`Weather Cond` <- as.factor(sacrash$`Weather Cond`)

sacrash$DayNight <- as.factor(sacrash$DayNight)

sacrash$`Crash Type` <- as.factor(sacrash$`Crash Type`)

sacrash$`Entity Code` <- as.factor(sacrash$`Entity Code`)

sacrash$`Traffic Ctrls`<- as.factor(sacrash$`Traffic Ctrls`)

sacrash$`Drugs Involved`[is.na(sacrash$`Drugs Involved`)] <- "N"
sacrash$`Drugs Involved` <- factor(sacrash$`Drugs Involved`, levels = c("Y", "N"), labels = c("Yes","No"))

sacrash$`DUI Involved`[is.na(sacrash$`DUI Involved`)] <- "N"
sacrash$`DUI Involved` <- factor(sacrash$`DUI Involved`, levels = c("Y", "N"), labels = c("Yes","No"))

sacrash$ACCLOC_X <- as.character(sacrash$ACCLOC_X)

sacrash$ACCLOC_Y <- as.character(sacrash$ACCLOC_Y)

sacrash$UNIQUE_LOC <- as.character(sacrash$UNIQUE_LOC)

Tidy & Manipulate Data I

Upon inspection using glimpse(), head() and tail(). This data is now tidy since it follows Hadley Wickham’s tidy data principles where:

Each variable have its own column.
Each observation have its own row.
Each value have its own cell.

glimpse(sacrash)

## Observations: 16,580
## Variables: 49
## $ REPORT_ID               <chr> "2016-1-18/07/2017", "2016-2-18/07/201...
## $ `Stats Area`            <fct> City, Metropolitan, Metropolitan, Metr...
## $ Suburb                  <fct> ADELAIDE, POORAKA, GREEN FIELDS, MITCH...
## $ Postcode                <chr> "5000", "5095", "5107", "5043", "5116"...
## $ `Local Government Area` <fct> CITY OF ADELAIDE, CITY OF SALISBURY, C...
## $ `Total Units`           <dbl> 2, 2, 2, 2, 2, 2, 2, 4, 2, 2, 2, 2, 2,...
## $ `Total Cas`             <dbl> 1, 0, 0, 0, 3, 0, 0, 1, 1, 0, 0, 0, 0,...
## $ `Total Fats`            <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ `Total SI`              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ `Total MI`              <dbl> 0, 0, 0, 0, 3, 0, 0, 1, 1, 0, 0, 0, 0,...
## $ Year                    <int> 2016, 2016, 2016, 2016, 2016, 2016, 20...
## $ Month                   <chr> "November", "December", "November", "N...
## $ Day                     <chr> "Wednesday", "Thursday", "Tuesday", "T...
## $ Time                    <time> 13:45:00, 13:00:00, 07:10:00, 10:30:0...
## $ `Area Speed`            <dbl> 50, 70, 80, 60, 90, 50, 90, 60, 60, 60...
## $ `Position Type`         <fct> Divided Road, Divided Road, T-Junction...
## $ `Horizontal Align`      <fct> Straight road, Straight road, Straight...
## $ `Vertical Align`        <fct> Level, Level, Level, Level, Level, Lev...
## $ `Other Feat`            <fct> Not Applicable, Not Applicable, Not Ap...
## $ `Road Surface`          <fct> Sealed, Sealed, Sealed, Sealed, Sealed...
## $ `Moisture Cond`         <fct> Dry, Dry, Dry, Dry, Dry, Dry, Dry, Dry...
## $ `Weather Cond`          <fct> Not Raining, Not Raining, Not Raining,...
## $ DayNight                <fct> Daylight, Daylight, Daylight, Daylight...
## $ `Crash Type`            <fct> Hit Pedestrian, Rear End, Rear End, Ri...
## $ `Entity Code`           <fct> Pedestrian, Driver Rider, Driver Rider...
## $ `CSEF Severity`         <ord> Fatal, Property Damage Only, Property ...
## $ `Traffic Ctrls`         <fct> No Control, No Control, Traffic Signal...
## $ `DUI Involved`          <fct> No, No, No, No, No, No, No, No, No, No...
## $ `Drugs Involved`        <fct> No, No, No, No, No, No, No, No, No, No...
## $ ACCLOC_X                <chr> "1328459.11", "1330553.58", "1328689.4...
## $ ACCLOC_Y                <chr> "1670877.88", "1681820.29", "1685255.5...
## $ UNIQUE_LOC              <chr> "13284591670878", "13305541681820", "1...
## $ `Unit No`               <chr> "02", "02", "02", "02", "01", "01", "0...
## $ `No Of Cas`             <dbl> 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0,...
## $ `Veh Reg State`         <fct> NA, SA, SA, NA, SA, SA, SA, SA, NA, NA...
## $ `Unit Type`             <fct> Pedestrian on Road, Station Wagon, Pan...
## $ `Veh Year`              <dbl> NA, 2008, 2013, NA, 1997, 2013, 1997, ...
## $ `Direction Of Travel`   <fct> East, NA, South, North, South East, No...
## $ Sex                     <fct> Male, Male, Male, NA, Male, Female, Ma...
## $ Age                     <dbl> 72, 40, 23, NA, 42, 35, 37, 83, 27, NA...
## $ `Lic State`             <fct> NA, South Australia, South Australia, ...
## $ `Licence Class`         <fct> NA, C, C, XX, C, C, C, XX, NA, XX, XX,...
## $ `Licence Type`          <fct> NA, Full, Full, NA, Full, Full, Full, ...
## $ Towing                  <fct> NA, Not Towing, Not Towing, NA, Not To...
## $ `Unit Movement`         <fct> Walking on Road, Straight Ahead, Strai...
## $ `Number Occupants`      <dbl> NA, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1...
## $ Rollover                <fct> No, No, No, No, No, No, No, No, No, No...
## $ Fire                    <fct> NO, NO, NO, NO, NO, NO, NO, NO, NO, NO...
## $ Date                    <S3: yearmon> Nov 2016, Dec 2016, Nov 2016, ...

head(sacrash)

tail(sacrash)

Tidy & Manipulate Data II

sacrash$Year <- as.numeric(sacrash$Year)
sacrash$`Veh Year`<- as.numeric(sacrash$`Veh Year`)
sacrash <- sacrash %>% mutate(vehicle_age = Year - `Veh Year`)
head(sacrash$vehicle_age)

## [1] NA  8  3 NA 19  3

Scan I

colSums() was used to get total number of missing values in each variables. For categoriable variables’ missing values, mode was used to replace those values. Then sum(is.na(variable_name)) is used to check if all missing values were replaced successfully.

After all the imputations for missing values, colSums(is.na(sacrash)) was checked again and as expected, now there are no missing values.

colSums(is.na(sacrash))

##             REPORT_ID            Stats Area                Suburb 
##                     0                     0                     0 
##              Postcode Local Government Area           Total Units 
##                     0                   144                     0 
##             Total Cas            Total Fats              Total SI 
##                     0                     0                     0 
##              Total MI                  Year                 Month 
##                     0                     0                     0 
##                   Day                  Time            Area Speed 
##                     0                     0                     0 
##         Position Type      Horizontal Align        Vertical Align 
##                     0                     7                    24 
##            Other Feat          Road Surface         Moisture Cond 
##                   120                     3                    71 
##          Weather Cond              DayNight            Crash Type 
##                    86                     0                     0 
##           Entity Code         CSEF Severity         Traffic Ctrls 
##                     0                     0                     0 
##          DUI Involved        Drugs Involved              ACCLOC_X 
##                     0                     0                    25 
##              ACCLOC_Y            UNIQUE_LOC               Unit No 
##                    25                    25                     0 
##             No Of Cas         Veh Reg State             Unit Type 
##                     0                  1714                     0 
##              Veh Year   Direction Of Travel                   Sex 
##                  2025                   303                  1351 
##                   Age             Lic State         Licence Class 
##                  1973                  2828                   758 
##          Licence Type                Towing         Unit Movement 
##                  3729                  1494                   268 
##      Number Occupants              Rollover                  Fire 
##                   435                     0                     0 
##                  Date           vehicle_age 
##                     0                  2025

sacrash$`Horizontal Align` <- impute(sacrash$`Horizontal Align`, fun = mode)
sum(is.na(sacrash$`Horizontal Align`))

## [1] 0

sacrash$`Vertical Align` <- impute(sacrash$`Vertical Align`, fun = mode)
sum(is.na(sacrash$`Horizontal Align`))

## [1] 0

sacrash$`Vertical Align` <- impute(sacrash$`Vertical Align`, fun = mode)
sum(is.na(sacrash$`Horizontal Align`))

## [1] 0

sacrash$`Other Feat` <- impute(sacrash$`Other Feat`, fun = mode)
sum(is.na(sacrash$`Other Feat`))

## [1] 0

sacrash$`Road Surface` <- impute(sacrash$`Road Surface`, fun = mode)
sum(is.na(sacrash$`Road Surface`))

## [1] 0

sacrash$`Moisture Cond` <- impute(sacrash$`Moisture Cond`, fun = mode)
sum(is.na(sacrash$`Moisture Cond`))

## [1] 0

sacrash$`Weather Cond` <- impute(sacrash$`Weather Cond`, fun = mode)
sum(is.na(sacrash$`Weather Cond`))

## [1] 0

sacrash$`Veh Reg State` <- impute(sacrash$`Veh Reg State`, fun = mode)
sum(is.na(sacrash$`Veh Reg State`))

## [1] 0

sacrash$`Unit Movement` <- impute(sacrash$`Unit Movement`, fun = mode)
sum(is.na(sacrash$`Unit Movement`))

## [1] 0

sacrash$`Direction Of Travel` <- impute(sacrash$`Direction Of Travel`, fun = mode)
sum(is.na(sacrash$`Direction Of Travel`))

## [1] 0

sacrash$Sex <- impute(sacrash$Sex, fun = mode)
sum(is.na(sacrash$Sex))

## [1] 0

sacrash$`Lic State` <- impute(sacrash$`Lic State`, fun = mode)
sum(is.na(sacrash$`Lic State`))

## [1] 0

sacrash$`Licence Class` <- impute(sacrash$`Licence Class`, fun = mode)
sum(is.na(sacrash$`Licence Class`))

## [1] 0

sacrash$`Licence Type` <- impute(sacrash$`Licence Type`, fun = mode)
sum(is.na(sacrash$`Licence Type`))

## [1] 0

sacrash$Towing <- impute(sacrash$Towing, fun = mode)
sum(is.na(sacrash$Towing))

## [1] 0

sacrash$ACCLOC_X <- impute(sacrash$ACCLOC_X, fun = mode)
sum(is.na(sacrash$ACCLOC_X))

## [1] 0

sacrash$ACCLOC_Y <- impute(sacrash$ACCLOC_Y, fun = mode)
sum(is.na(sacrash$ACCLOC_Y))

## [1] 0

sacrash$UNIQUE_LOC <- impute(sacrash$UNIQUE_LOC, fun = mode)
sum(is.na(sacrash$UNIQUE_LOC))

## [1] 0

sacrash$`Local Government Area`<- impute(sacrash$`Local Government Area`, fun = mode)
sum(is.na(sacrash$`Local Government Area`))

## [1] 0

sacrash$vehicle_age <- impute(sacrash$vehicle_age, fun = median)
sum(is.na(sacrash$vehicle_age))

## [1] 0

sacrash$`Veh Year` <- impute(sacrash$`Veh Year`, fun = median)
sum(is.na(sacrash$`Veh Year`))

## [1] 0

sacrash$`Number Occupants` <- impute(sacrash$`Number Occupants`, fun = median)
sum(is.na(sacrash$`Number Occupants`))

## [1] 0

sacrash$Age <- impute(sacrash$Age, fun = median)
sum(is.na(sacrash$Age))

## [1] 0

colSums(is.na(sacrash))

##             REPORT_ID            Stats Area                Suburb 
##                     0                     0                     0 
##              Postcode Local Government Area           Total Units 
##                     0                     0                     0 
##             Total Cas            Total Fats              Total SI 
##                     0                     0                     0 
##              Total MI                  Year                 Month 
##                     0                     0                     0 
##                   Day                  Time            Area Speed 
##                     0                     0                     0 
##         Position Type      Horizontal Align        Vertical Align 
##                     0                     0                     0 
##            Other Feat          Road Surface         Moisture Cond 
##                     0                     0                     0 
##          Weather Cond              DayNight            Crash Type 
##                     0                     0                     0 
##           Entity Code         CSEF Severity         Traffic Ctrls 
##                     0                     0                     0 
##          DUI Involved        Drugs Involved              ACCLOC_X 
##                     0                     0                     0 
##              ACCLOC_Y            UNIQUE_LOC               Unit No 
##                     0                     0                     0 
##             No Of Cas         Veh Reg State             Unit Type 
##                     0                     0                     0 
##              Veh Year   Direction Of Travel                   Sex 
##                     0                     0                     0 
##                   Age             Lic State         Licence Class 
##                     0                     0                     0 
##          Licence Type                Towing         Unit Movement 
##                     0                     0                     0 
##      Number Occupants              Rollover                  Fire 
##                     0                     0                     0 
##                  Date           vehicle_age 
##                     0                     0

Scan II

For Number Occupants variable, summary() was used to inspect general statistics of the variable and its maximum value appears to be an error since 999 number of occupants contained in a vehicle is impossible. After using unique() to investigate further there are four large values and their corresponding type of vehicle is omnibus. It is possible for omnibus to contain 50 seats. So 99 and 999 is way too large and replaced with 50 instead.

For Vehicle Year variable, at first there appears to be 93 outliers after checking for outliers using z-scores. These were then imputed with mean value but there were still 28 outliers reported. However, when summary() is called, the minimum value is 1986 and it is plausible to have car from as old as 1986 to be on the road.

For variable Vehicle_age, mutated from vehicle year and year data collated which was in 2016. Since vehicle year’s missing values and outliers has been imputed with new values. We’re going to mutate this variable again then use similar process, z-scores, to check outliers and as expected now there is none.

summary(sacrash$`Number Occupants`)

## 
##  435 values imputed to 1

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   1.000   1.409   1.000 999.000

unique(sacrash$`Number Occupants`)

##  [1]   1   3  99   2   4   0  20  12   5   6   8   7  10  11  15  13  50
## [18] 999  32

which(sacrash$`Number Occupants` == 99)

## [1] 16

sacrash[16,]$`Unit Type`

## [1] OMNIBUS
## 25 Levels: Animal - Domestic - Not Ridden ... Wheelchair / Elec. Wheelchair

which(sacrash$`Number Occupants` == 999)

## [1] 15073

sacrash[15073,]$`Unit Type`

## [1] OMNIBUS
## 25 Levels: Animal - Domestic - Not Ridden ... Wheelchair / Elec. Wheelchair

which(sacrash$`Number Occupants` == 50)

## [1] 5772

sacrash[5772,]$`Unit Type`

## [1] OMNIBUS
## 25 Levels: Animal - Domestic - Not Ridden ... Wheelchair / Elec. Wheelchair

which(sacrash$`Number Occupants` == 32)

## [1] 16323

sacrash[16323,]$`Unit Type`

## [1] OMNIBUS
## 25 Levels: Animal - Domestic - Not Ridden ... Wheelchair / Elec. Wheelchair

sacrash$`Number Occupants`[sacrash$`Number Occupants` == 999 ] <- 50
sacrash$`Number Occupants`[sacrash$`Number Occupants` == 99 ] <- 50
summary(sacrash$`Number Occupants`)

## 
##  435 values imputed to 1

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   1.000   1.349   1.000  50.000

summary(sacrash$`Veh Year`)

## 
##  2025 values imputed to 2005

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1924    2001    2005    2005    2010    2016

z.scores1 <- sacrash$`Veh Year` %>% scores(type = "z")
length(which(abs(z.scores1)>3))

## [1] 93

sacrash$`Veh Year`[ which(abs(z.scores1)>3) ] <- mean(sacrash$`Veh Year`, na.rm = TRUE)
length(which(abs(z.scores1)>3))

## [1] 93

summary(sacrash$`Veh Year`)

## 
##  2025 values imputed to 2005

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1986    2001    2005    2005    2010    2016

sacrash <- sacrash %>% mutate(vehicle_age = Year - `Veh Year`)
summary(sacrash$vehicle_age)

## 
##  2025 values imputed to 11

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    6.00   11.00   11.05   15.00   30.00

z.scores2 <- sacrash$`Veh Year` %>% scores(type = "z")
length(which(abs(z.scores2)>3))

## [1] 28

Transform

For this step, we are attempting to transform variable Age to decrease the skewness and convert the distribution to a normal distribution. Several mathematical transformations and box-cox transformation were applied and in conclusion, log is the most effective according to histograms.

hist(sacrash$Age)

boxcox_age <- BoxCox(sacrash$Age, lambda = "auto")
hist(boxcox_age)

log_age <- log10(sacrash$Age)
hist(log_age)

ln_age <- log(sacrash$Age)
hist(ln_age)

sqrt_age <- sqrt(sacrash$Age)
hist(sqrt_age)

age_square <- sacrash$Age^2
hist(age_square)

age_reciprocal <- sacrash$Age^(-1)
hist(age_reciprocal)