Required packages

library(readr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
library(stringr)
## Warning: package 'stringr' was built under R version 3.4.4
library(zoo)
## Warning: package 'zoo' was built under R version 3.4.4
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(Hmisc)
## Warning: package 'Hmisc' was built under R version 3.4.4
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.4.4
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## The following objects are masked from 'package:base':
## 
##     format.pval, units
library(tidyr)
## Warning: package 'tidyr' was built under R version 3.4.4
library(outliers)
library(forecast)
## Warning: package 'forecast' was built under R version 3.4.4

Executive Summary

For this data preprocessing assignment, we are getting the data of road crashes in South Australia in 2016 ready for analysis. The followings are summary of the steps taken to achieve this:

Data

Two data sets were obtained from the following [link] (https://data.sa.gov.au/data/dataset/road-crash-data/resource/446afe5b-4e01-4cdf-a281- edd25aaf3802) These data sets are comprised of the information regarding road crashes in South AUstralia in 2016, which were reported to the police where at least one person being killed or injured, vehicle towed away, or total property damage of $5000 or greater.

The first data set, “Crashtype”, concerns relevant information about the nature of the crash and crash site with the following specification of variables:

The second data set, “Unit”, concerns relevant identity information about all units (vehicles and corresponding occupants) as well as whether rollover or fire occurred. The following are variables’ specifications:

In order to import and read these datasets into R, I did the following steps: * From Environment pane, clicked import Dataset > From text(readr)… * Set name as ‘Crashtype’ and ‘Unit’. * Copied the code from Code Preview box into R script for future loading. * Used class() to check if they are imported as dataframe.

Variable ‘Postcode’ exists in both dataset. One describes postcode of where the crash occurred while the other is where the occupant resides at. General subset was used to remove ‘Postcode’ variable in the latter since we are only interested in crash-related data.

full_join() was used to merge the two dataset by key variable ‘REPORT ID’, new data set named ‘sacrash’.

There are multiple units involved in each crash and this is shown in ’Unit" dataset. However, the “Crashtype” dataset contains only information for the unit responsible in the crash. Since we are interested in only the crash-related data, filter() from dplyr was used to subset the merged dataset to include those units responsible.

Crashtype <- read_csv("C:/Users/mengh/OneDrive/Desktop/Data Preprocessing/Datapreprocessing Assignment 3/2016_DATA_SA_Crash.csv")
## Parsed with column specification:
## cols(
##   .default = col_character(),
##   Postcode = col_integer(),
##   `Total Units` = col_integer(),
##   `Total Cas` = col_integer(),
##   `Total Fats` = col_integer(),
##   `Total SI` = col_integer(),
##   `Total MI` = col_integer(),
##   Year = col_integer(),
##   Time = col_time(format = ""),
##   ACCLOC_X = col_double(),
##   ACCLOC_Y = col_double()
## )
## See spec(...) for full column specifications.
Unit <- read_csv("C:/Users/mengh/OneDrive/Desktop/Data Preprocessing/Datapreprocessing Assignment 3/2016_DATA_SA_Units.csv")
## Parsed with column specification:
## cols(
##   REPORT_ID = col_character(),
##   `Unit No` = col_character(),
##   `No Of Cas` = col_integer(),
##   `Veh Reg State` = col_character(),
##   `Unit Type` = col_character(),
##   `Veh Year` = col_character(),
##   `Direction Of Travel` = col_character(),
##   Sex = col_character(),
##   Age = col_character(),
##   `Lic State` = col_character(),
##   `Licence Class` = col_character(),
##   `Licence Type` = col_character(),
##   Towing = col_character(),
##   `Unit Movement` = col_character(),
##   `Number Occupants` = col_character(),
##   Postcode = col_character(),
##   Rollover = col_character(),
##   Fire = col_character()
## )
head(Crashtype)
head(Unit)
class(Crashtype)
## [1] "tbl_df"     "tbl"        "data.frame"
class(Unit)
## [1] "tbl_df"     "tbl"        "data.frame"
Unit <- Unit %>% select(-Postcode)
sacrash <- full_join(Crashtype, Unit, by = "REPORT_ID")
sacrash <- sacrash %>% filter(`Unit Resp`== `Unit No`)
sacrash <- sacrash %>% select(-`Unit Resp`)

Understand

Some of the variables were imported in incorrect types. Therefore, they were converted using:

Since year, month but only day of the week are given in three separate columns, ymd() cannot be used to convert to year-month-day date format. Instead, Year and Month was pasted together with separator “-” as a new variable, then as.yearmon() from zoo package was used to convert to just year-month format and leaving day of the week variable as it is.

Upon inspection using unique(variable_name) for each variables, some missing values in some variables were entered as “Unknown”, “UNKNOWN” and “XXXX”. In this case, vector subset was used to convert these values to NA before type conversions so that these values can be grouped together with missing values to determine the total number of missing values. Unique(variable_name) was used again after this process to check if this is successful.

For variables ‘Rollover’, ‘Fire’, ‘DUI Involved’ and ‘Drugs Involved’. Upon inspection, it is assumed that data entry is only entered as “Y” when the above occurred which is a small percentage compared to the whole datset. However, before proceeding to data analysis, this would need clarification from data collector if this is the actual case. In this instance, it is assumed so. Therefore, where there is no data entry, it is substituted with “N” then convert to factor with labels “Yes” and “No”.

‘LGA’ variable was changed to ‘Local Government Area’ for better explanatory purpose using colnames().

glimpse(sacrash)
## Observations: 16,580
## Variables: 48
## $ REPORT_ID             <chr> "2016-1-18/07/2017", "2016-2-18/07/2017"...
## $ `Stats Area`          <chr> "1 City", "2 Metropolitan", "2 Metropoli...
## $ Suburb                <chr> "ADELAIDE", "POORAKA", "GREEN FIELDS", "...
## $ Postcode              <int> 5000, 5095, 5107, 5043, 5116, 5251, 5094...
## $ `LGA Name`            <chr> "CITY OF ADELAIDE", "CITY OF SALISBURY",...
## $ `Total Units`         <int> 2, 2, 2, 2, 2, 2, 2, 4, 2, 2, 2, 2, 2, 2...
## $ `Total Cas`           <int> 1, 0, 0, 0, 3, 0, 0, 1, 1, 0, 0, 0, 0, 0...
## $ `Total Fats`          <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ `Total SI`            <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ `Total MI`            <int> 0, 0, 0, 0, 3, 0, 0, 1, 1, 0, 0, 0, 0, 0...
## $ Year                  <int> 2016, 2016, 2016, 2016, 2016, 2016, 2016...
## $ Month                 <chr> "November", "December", "November", "Nov...
## $ Day                   <chr> "Wednesday", "Thursday", "Tuesday", "Tue...
## $ Time                  <time> 13:45:00, 13:00:00, 07:10:00, 10:30:00,...
## $ `Area Speed`          <chr> "050", "070", "080", "060", "090", "050"...
## $ `Position Type`       <chr> "Divided Road", "Divided Road", "T-Junct...
## $ `Horizontal Align`    <chr> "Straight road", "Straight road", "Strai...
## $ `Vertical Align`      <chr> "Level", "Level", "Level", "Level", "Lev...
## $ `Other Feat`          <chr> "Not Applicable", "Not Applicable", "Not...
## $ `Road Surface`        <chr> "Sealed", "Sealed", "Sealed", "Sealed", ...
## $ `Moisture Cond`       <chr> "Dry", "Dry", "Dry", "Dry", "Dry", "Dry"...
## $ `Weather Cond`        <chr> "Not Raining", "Not Raining", "Not Raini...
## $ DayNight              <chr> "Daylight", "Daylight", "Daylight", "Day...
## $ `Crash Type`          <chr> "Hit Pedestrian", "Rear End", "Rear End"...
## $ `Entity Code`         <chr> "Pedestrian", "Driver Rider", "Driver Ri...
## $ `CSEF Severity`       <chr> "4: Fatal", "1: PDO", "1: PDO", "1: PDO"...
## $ `Traffic Ctrls`       <chr> "No Control", "No Control", "Traffic Sig...
## $ `DUI Involved`        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ `Drugs Involved`      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ ACCLOC_X              <dbl> 1328459, 1330554, 1328689, 1324837, 1339...
## $ ACCLOC_Y              <dbl> 1670878, 1681820, 1685256, 1660685, 1700...
## $ UNIQUE_LOC            <chr> "13284591670878", "13305541681820", "132...
## $ `Unit No`             <chr> "02", "02", "02", "02", "01", "01", "02"...
## $ `No Of Cas`           <int> 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0...
## $ `Veh Reg State`       <chr> NA, "SA", "SA", "UNKNOWN", "SA", "SA", "...
## $ `Unit Type`           <chr> "Pedestrian on Road", "Station Wagon", "...
## $ `Veh Year`            <chr> NA, "2008", "2013", "XXXX", "1997", "201...
## $ `Direction Of Travel` <chr> "East", "Unknown", "South", "North", "So...
## $ Sex                   <chr> "Male", "Male", "Male", "Unknown", "Male...
## $ Age                   <chr> "072", "040", "023", "XXX", "042", "035"...
## $ `Lic State`           <chr> NA, "SA", "SA", "UNKNOWN", "SA", "SA", "...
## $ `Licence Class`       <chr> NA, "C", "C", "XX", "C", "C", "C", "XX",...
## $ `Licence Type`        <chr> NA, "Full", "Full", "Unknown", "Full", "...
## $ Towing                <chr> NA, "Not Towing", "Not Towing", "Unknown...
## $ `Unit Movement`       <chr> "Walking on Road", "Straight Ahead", "St...
## $ `Number Occupants`    <chr> NA, "001", "001", "001", "001", "001", "...
## $ Rollover              <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ Fire                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
dim(sacrash)
## [1] 16580    48
sacrash <- sacrash %>% mutate(Date = paste(Year,Month, sep = "-"))
sacrash$Date <- as.yearmon(sacrash$Date, "%Y-%B")
class(sacrash$Date)
## [1] "yearmon"
sacrash$`Number Occupants` <- as.numeric(sacrash$`Number Occupants`)

unique(sacrash$`Veh Reg State`)
##  [1] NA        "SA"      "UNKNOWN" "NSW"     "VIC"     "QLD"     "WA"     
##  [8] "NT"      "TAS"     "FEDERAL" "ACT"
sacrash$`Veh Reg State`[sacrash$`Veh Reg State` == "UNKNOWN"] <- NA
sacrash$`Veh Reg State` <- as.factor(sacrash$`Veh Reg State`)

unique(sacrash$`CSEF Severity`)
## [1] "4: Fatal" "1: PDO"   "2: MI"    "3: SI"
sacrash$`CSEF Severity`<- factor(sacrash$`CSEF Severity`, levels = c("1: PDO", "2: MI", "3: SI", "4: Fatal"), labels = c("Property Damage Only", "Minor Injury", "Serious Injury", "Fatal"), ordered = TRUE)
levels(sacrash$`CSEF Severity`)
## [1] "Property Damage Only" "Minor Injury"         "Serious Injury"      
## [4] "Fatal"
sacrash$`No Of Cas` <- as.numeric(sacrash$`No Of Cas`)

sacrash$`Unit Type` <- as.factor(sacrash$`Unit Type`)
levels(sacrash$`Unit Type`)
##  [1] "Animal - Domestic - Not Ridden" "Animal - Wild"                 
##  [3] "Animal Drawn Vehicle"           "BDOUBLE - ROAD TRAIN"          
##  [5] "Forward Control Passenger Van"  "Light Truck LT 4.5T"           
##  [7] "Motor Cars - Sedan"             "Motor Cars - Tourer"           
##  [9] "Motor Cycle"                    "Motor Vehicle - Type Unknown"  
## [11] "Motorised Wheelchair/Gopher"    "OMNIBUS"                       
## [13] "Other Defined Special Vehicle"  "Panel Van"                     
## [15] "Pedal Cycle"                    "Pedestrian on Road"            
## [17] "Power Asst. Bicycle"            "RIGID TRUCK LGE GE 4.5T"       
## [19] "Scooter"                        "SEMI TRAILER"                  
## [21] "Small Wheel Vehicle User"       "Station Wagon"                 
## [23] "Taxi Cab"                       "Utility"                       
## [25] "Wheelchair / Elec. Wheelchair"
sacrash$`Veh Year`[sacrash$`Veh Year` == "XXXX"] <- NA
sacrash$`Veh Year` <- as.numeric(sacrash$`Veh Year`)

sacrash$`Direction Of Travel`[sacrash$`Direction Of Travel` == "Unknown"] <- NA
sacrash$`Direction Of Travel` <- as.factor(sacrash$`Direction Of Travel`)

sacrash$Sex[sacrash$Sex == "Unknown"] <- NA
sacrash$Sex <- as.factor(sacrash$Sex)

sacrash$Age <- as.numeric(sacrash$Age)
## Warning: NAs introduced by coercion
sacrash$`Lic State` <- factor(sacrash$`Lic State`, levels = c("ACT", "NSw", "NT", "O/S", "QLD","SA","TAS", "VIC", "WA"), labels = c("Australian Capital Territory", "New South Wales", "Northern Territory", "Overseas","Queensland", "South Australia", "Tasmania", "Victoria", "Western Australia"))

sacrash$`Licence Class` <- as.factor(sacrash$`Licence Class`)

sacrash$`Licence Type`[sacrash$`Licence Type`=="Unknown"] <- NA
sacrash$`Licence Type` <- as.factor(sacrash$`Licence Type`)

sacrash$Towing[sacrash$Towing=="Unknown"] <- NA
sacrash$Towing<- as.factor(sacrash$Towing)

sacrash$`Unit Movement` <- as.factor(sacrash$`Unit Movement`)

sacrash$Postcode <- as.character(sacrash$Postcode)

sacrash$Rollover[is.na(sacrash$Rollover)] <- "N"
sacrash$Rollover <- factor(sacrash$Rollover, levels = c("Y", "N"), labels = c("Yes","No"))
unique(sacrash$Rollover)
## [1] No  Yes
## Levels: Yes No
sacrash$Fire[is.na(sacrash$Fire)] <- "N"
sacrash$Fire <- factor(sacrash$Fire, levels = c("Y","N"), labels = c("Yes", "NO"))
unique(sacrash$Fire)
## [1] NO  Yes
## Levels: Yes NO
sacrash$`Stats Area` <- factor(sacrash$`Stats Area`, levels = c("1 City","2 Metropolitan", "3 Country"), labels = c("City", "Metropolitan", "Country"))

sacrash$Suburb <- as.factor(sacrash$Suburb)

colnames(sacrash)[5] <- "Local Government Area"
sacrash$`Local Government Area` <- as.factor(sacrash$`Local Government Area`)

sacrash$`Total Units` <- as.numeric(sacrash$`Total Units`)

sacrash$`Total Cas` <- as.numeric(sacrash$`Total Cas`)

sacrash$`Total Fats`<- as.numeric(sacrash$`Total Fats`)

sacrash$`Total SI` <- as.numeric(sacrash$`Total SI`)

sacrash$'Total MI' <- as.numeric(sacrash$'Total MI')

sacrash$`Area Speed` <- as.numeric(sacrash$`Area Speed`)

sacrash$`Position Type`<- as.factor(sacrash$`Position Type`)

sacrash$`Horizontal Align`[sacrash$`Horizontal Align`=="Unknown"] <- NA
sacrash$`Horizontal Align`<- as.factor(sacrash$`Horizontal Align`)

sacrash$`Vertical Align`[sacrash$`Vertical Align`=="Unknown"] <- NA
sacrash$`Vertical Align` <- as.factor(sacrash$`Vertical Align`)

sacrash$`Other Feat`[sacrash$`Other Feat`=="Unknown"] <- NA
sacrash$`Other Feat` <- as.factor(sacrash$`Other Feat`)

sacrash$`Road Surface`[sacrash$`Road Surface`=="Unknown"] <- NA
sacrash$`Road Surface` <- as.factor(sacrash$`Road Surface`)

sacrash$`Moisture Cond`[sacrash$`Moisture Cond`=="Unknown"] <- NA
sacrash$`Moisture Cond` <- as.factor(sacrash$`Moisture Cond`)

sacrash$`Weather Cond`[sacrash$`Weather Cond`=="Unknown"] <- NA
sacrash$`Weather Cond` <- as.factor(sacrash$`Weather Cond`)

sacrash$DayNight <- as.factor(sacrash$DayNight)

sacrash$`Crash Type` <- as.factor(sacrash$`Crash Type`)

sacrash$`Entity Code` <- as.factor(sacrash$`Entity Code`)

sacrash$`Traffic Ctrls`<- as.factor(sacrash$`Traffic Ctrls`)

sacrash$`Drugs Involved`[is.na(sacrash$`Drugs Involved`)] <- "N"
sacrash$`Drugs Involved` <- factor(sacrash$`Drugs Involved`, levels = c("Y", "N"), labels = c("Yes","No"))

sacrash$`DUI Involved`[is.na(sacrash$`DUI Involved`)] <- "N"
sacrash$`DUI Involved` <- factor(sacrash$`DUI Involved`, levels = c("Y", "N"), labels = c("Yes","No"))

sacrash$ACCLOC_X <- as.character(sacrash$ACCLOC_X)

sacrash$ACCLOC_Y <- as.character(sacrash$ACCLOC_Y)

sacrash$UNIQUE_LOC <- as.character(sacrash$UNIQUE_LOC)

Tidy & Manipulate Data I

Upon inspection using glimpse(), head() and tail(). This data is now tidy since it follows Hadley Wickham’s tidy data principles where:

glimpse(sacrash)
## Observations: 16,580
## Variables: 49
## $ REPORT_ID               <chr> "2016-1-18/07/2017", "2016-2-18/07/201...
## $ `Stats Area`            <fct> City, Metropolitan, Metropolitan, Metr...
## $ Suburb                  <fct> ADELAIDE, POORAKA, GREEN FIELDS, MITCH...
## $ Postcode                <chr> "5000", "5095", "5107", "5043", "5116"...
## $ `Local Government Area` <fct> CITY OF ADELAIDE, CITY OF SALISBURY, C...
## $ `Total Units`           <dbl> 2, 2, 2, 2, 2, 2, 2, 4, 2, 2, 2, 2, 2,...
## $ `Total Cas`             <dbl> 1, 0, 0, 0, 3, 0, 0, 1, 1, 0, 0, 0, 0,...
## $ `Total Fats`            <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ `Total SI`              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ `Total MI`              <dbl> 0, 0, 0, 0, 3, 0, 0, 1, 1, 0, 0, 0, 0,...
## $ Year                    <int> 2016, 2016, 2016, 2016, 2016, 2016, 20...
## $ Month                   <chr> "November", "December", "November", "N...
## $ Day                     <chr> "Wednesday", "Thursday", "Tuesday", "T...
## $ Time                    <time> 13:45:00, 13:00:00, 07:10:00, 10:30:0...
## $ `Area Speed`            <dbl> 50, 70, 80, 60, 90, 50, 90, 60, 60, 60...
## $ `Position Type`         <fct> Divided Road, Divided Road, T-Junction...
## $ `Horizontal Align`      <fct> Straight road, Straight road, Straight...
## $ `Vertical Align`        <fct> Level, Level, Level, Level, Level, Lev...
## $ `Other Feat`            <fct> Not Applicable, Not Applicable, Not Ap...
## $ `Road Surface`          <fct> Sealed, Sealed, Sealed, Sealed, Sealed...
## $ `Moisture Cond`         <fct> Dry, Dry, Dry, Dry, Dry, Dry, Dry, Dry...
## $ `Weather Cond`          <fct> Not Raining, Not Raining, Not Raining,...
## $ DayNight                <fct> Daylight, Daylight, Daylight, Daylight...
## $ `Crash Type`            <fct> Hit Pedestrian, Rear End, Rear End, Ri...
## $ `Entity Code`           <fct> Pedestrian, Driver Rider, Driver Rider...
## $ `CSEF Severity`         <ord> Fatal, Property Damage Only, Property ...
## $ `Traffic Ctrls`         <fct> No Control, No Control, Traffic Signal...
## $ `DUI Involved`          <fct> No, No, No, No, No, No, No, No, No, No...
## $ `Drugs Involved`        <fct> No, No, No, No, No, No, No, No, No, No...
## $ ACCLOC_X                <chr> "1328459.11", "1330553.58", "1328689.4...
## $ ACCLOC_Y                <chr> "1670877.88", "1681820.29", "1685255.5...
## $ UNIQUE_LOC              <chr> "13284591670878", "13305541681820", "1...
## $ `Unit No`               <chr> "02", "02", "02", "02", "01", "01", "0...
## $ `No Of Cas`             <dbl> 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0,...
## $ `Veh Reg State`         <fct> NA, SA, SA, NA, SA, SA, SA, SA, NA, NA...
## $ `Unit Type`             <fct> Pedestrian on Road, Station Wagon, Pan...
## $ `Veh Year`              <dbl> NA, 2008, 2013, NA, 1997, 2013, 1997, ...
## $ `Direction Of Travel`   <fct> East, NA, South, North, South East, No...
## $ Sex                     <fct> Male, Male, Male, NA, Male, Female, Ma...
## $ Age                     <dbl> 72, 40, 23, NA, 42, 35, 37, 83, 27, NA...
## $ `Lic State`             <fct> NA, South Australia, South Australia, ...
## $ `Licence Class`         <fct> NA, C, C, XX, C, C, C, XX, NA, XX, XX,...
## $ `Licence Type`          <fct> NA, Full, Full, NA, Full, Full, Full, ...
## $ Towing                  <fct> NA, Not Towing, Not Towing, NA, Not To...
## $ `Unit Movement`         <fct> Walking on Road, Straight Ahead, Strai...
## $ `Number Occupants`      <dbl> NA, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1...
## $ Rollover                <fct> No, No, No, No, No, No, No, No, No, No...
## $ Fire                    <fct> NO, NO, NO, NO, NO, NO, NO, NO, NO, NO...
## $ Date                    <S3: yearmon> Nov 2016, Dec 2016, Nov 2016, ...
head(sacrash)
tail(sacrash)

Tidy & Manipulate Data II

sacrash$Year <- as.numeric(sacrash$Year)
sacrash$`Veh Year`<- as.numeric(sacrash$`Veh Year`)
sacrash <- sacrash %>% mutate(vehicle_age = Year - `Veh Year`)
head(sacrash$vehicle_age)
## [1] NA  8  3 NA 19  3

Scan I

colSums() was used to get total number of missing values in each variables. For categoriable variables’ missing values, mode was used to replace those values. Then sum(is.na(variable_name)) is used to check if all missing values were replaced successfully.

After all the imputations for missing values, colSums(is.na(sacrash)) was checked again and as expected, now there are no missing values.

colSums(is.na(sacrash))
##             REPORT_ID            Stats Area                Suburb 
##                     0                     0                     0 
##              Postcode Local Government Area           Total Units 
##                     0                   144                     0 
##             Total Cas            Total Fats              Total SI 
##                     0                     0                     0 
##              Total MI                  Year                 Month 
##                     0                     0                     0 
##                   Day                  Time            Area Speed 
##                     0                     0                     0 
##         Position Type      Horizontal Align        Vertical Align 
##                     0                     7                    24 
##            Other Feat          Road Surface         Moisture Cond 
##                   120                     3                    71 
##          Weather Cond              DayNight            Crash Type 
##                    86                     0                     0 
##           Entity Code         CSEF Severity         Traffic Ctrls 
##                     0                     0                     0 
##          DUI Involved        Drugs Involved              ACCLOC_X 
##                     0                     0                    25 
##              ACCLOC_Y            UNIQUE_LOC               Unit No 
##                    25                    25                     0 
##             No Of Cas         Veh Reg State             Unit Type 
##                     0                  1714                     0 
##              Veh Year   Direction Of Travel                   Sex 
##                  2025                   303                  1351 
##                   Age             Lic State         Licence Class 
##                  1973                  2828                   758 
##          Licence Type                Towing         Unit Movement 
##                  3729                  1494                   268 
##      Number Occupants              Rollover                  Fire 
##                   435                     0                     0 
##                  Date           vehicle_age 
##                     0                  2025
sacrash$`Horizontal Align` <- impute(sacrash$`Horizontal Align`, fun = mode)
sum(is.na(sacrash$`Horizontal Align`))
## [1] 0
sacrash$`Vertical Align` <- impute(sacrash$`Vertical Align`, fun = mode)
sum(is.na(sacrash$`Horizontal Align`))
## [1] 0
sacrash$`Vertical Align` <- impute(sacrash$`Vertical Align`, fun = mode)
sum(is.na(sacrash$`Horizontal Align`))
## [1] 0
sacrash$`Other Feat` <- impute(sacrash$`Other Feat`, fun = mode)
sum(is.na(sacrash$`Other Feat`))
## [1] 0
sacrash$`Road Surface` <- impute(sacrash$`Road Surface`, fun = mode)
sum(is.na(sacrash$`Road Surface`))
## [1] 0
sacrash$`Moisture Cond` <- impute(sacrash$`Moisture Cond`, fun = mode)
sum(is.na(sacrash$`Moisture Cond`))
## [1] 0
sacrash$`Weather Cond` <- impute(sacrash$`Weather Cond`, fun = mode)
sum(is.na(sacrash$`Weather Cond`))
## [1] 0
sacrash$`Veh Reg State` <- impute(sacrash$`Veh Reg State`, fun = mode)
sum(is.na(sacrash$`Veh Reg State`))
## [1] 0
sacrash$`Unit Movement` <- impute(sacrash$`Unit Movement`, fun = mode)
sum(is.na(sacrash$`Unit Movement`))
## [1] 0
sacrash$`Direction Of Travel` <- impute(sacrash$`Direction Of Travel`, fun = mode)
sum(is.na(sacrash$`Direction Of Travel`))
## [1] 0
sacrash$Sex <- impute(sacrash$Sex, fun = mode)
sum(is.na(sacrash$Sex))
## [1] 0
sacrash$`Lic State` <- impute(sacrash$`Lic State`, fun = mode)
sum(is.na(sacrash$`Lic State`))
## [1] 0
sacrash$`Licence Class` <- impute(sacrash$`Licence Class`, fun = mode)
sum(is.na(sacrash$`Licence Class`))
## [1] 0
sacrash$`Licence Type` <- impute(sacrash$`Licence Type`, fun = mode)
sum(is.na(sacrash$`Licence Type`))
## [1] 0
sacrash$Towing <- impute(sacrash$Towing, fun = mode)
sum(is.na(sacrash$Towing))
## [1] 0
sacrash$ACCLOC_X <- impute(sacrash$ACCLOC_X, fun = mode)
sum(is.na(sacrash$ACCLOC_X))
## [1] 0
sacrash$ACCLOC_Y <- impute(sacrash$ACCLOC_Y, fun = mode)
sum(is.na(sacrash$ACCLOC_Y))
## [1] 0
sacrash$UNIQUE_LOC <- impute(sacrash$UNIQUE_LOC, fun = mode)
sum(is.na(sacrash$UNIQUE_LOC))
## [1] 0
sacrash$`Local Government Area`<- impute(sacrash$`Local Government Area`, fun = mode)
sum(is.na(sacrash$`Local Government Area`))
## [1] 0
sacrash$vehicle_age <- impute(sacrash$vehicle_age, fun = median)
sum(is.na(sacrash$vehicle_age))
## [1] 0
sacrash$`Veh Year` <- impute(sacrash$`Veh Year`, fun = median)
sum(is.na(sacrash$`Veh Year`))
## [1] 0
sacrash$`Number Occupants` <- impute(sacrash$`Number Occupants`, fun = median)
sum(is.na(sacrash$`Number Occupants`))
## [1] 0
sacrash$Age <- impute(sacrash$Age, fun = median)
sum(is.na(sacrash$Age))
## [1] 0
colSums(is.na(sacrash))
##             REPORT_ID            Stats Area                Suburb 
##                     0                     0                     0 
##              Postcode Local Government Area           Total Units 
##                     0                     0                     0 
##             Total Cas            Total Fats              Total SI 
##                     0                     0                     0 
##              Total MI                  Year                 Month 
##                     0                     0                     0 
##                   Day                  Time            Area Speed 
##                     0                     0                     0 
##         Position Type      Horizontal Align        Vertical Align 
##                     0                     0                     0 
##            Other Feat          Road Surface         Moisture Cond 
##                     0                     0                     0 
##          Weather Cond              DayNight            Crash Type 
##                     0                     0                     0 
##           Entity Code         CSEF Severity         Traffic Ctrls 
##                     0                     0                     0 
##          DUI Involved        Drugs Involved              ACCLOC_X 
##                     0                     0                     0 
##              ACCLOC_Y            UNIQUE_LOC               Unit No 
##                     0                     0                     0 
##             No Of Cas         Veh Reg State             Unit Type 
##                     0                     0                     0 
##              Veh Year   Direction Of Travel                   Sex 
##                     0                     0                     0 
##                   Age             Lic State         Licence Class 
##                     0                     0                     0 
##          Licence Type                Towing         Unit Movement 
##                     0                     0                     0 
##      Number Occupants              Rollover                  Fire 
##                     0                     0                     0 
##                  Date           vehicle_age 
##                     0                     0

Scan II

For Number Occupants variable, summary() was used to inspect general statistics of the variable and its maximum value appears to be an error since 999 number of occupants contained in a vehicle is impossible. After using unique() to investigate further there are four large values and their corresponding type of vehicle is omnibus. It is possible for omnibus to contain 50 seats. So 99 and 999 is way too large and replaced with 50 instead.

For Vehicle Year variable, at first there appears to be 93 outliers after checking for outliers using z-scores. These were then imputed with mean value but there were still 28 outliers reported. However, when summary() is called, the minimum value is 1986 and it is plausible to have car from as old as 1986 to be on the road.

For variable Vehicle_age, mutated from vehicle year and year data collated which was in 2016. Since vehicle year’s missing values and outliers has been imputed with new values. We’re going to mutate this variable again then use similar process, z-scores, to check outliers and as expected now there is none.

summary(sacrash$`Number Occupants`)
## 
##  435 values imputed to 1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   1.000   1.409   1.000 999.000
unique(sacrash$`Number Occupants`)
##  [1]   1   3  99   2   4   0  20  12   5   6   8   7  10  11  15  13  50
## [18] 999  32
which(sacrash$`Number Occupants` == 99)
## [1] 16
sacrash[16,]$`Unit Type`
## [1] OMNIBUS
## 25 Levels: Animal - Domestic - Not Ridden ... Wheelchair / Elec. Wheelchair
which(sacrash$`Number Occupants` == 999)
## [1] 15073
sacrash[15073,]$`Unit Type`
## [1] OMNIBUS
## 25 Levels: Animal - Domestic - Not Ridden ... Wheelchair / Elec. Wheelchair
which(sacrash$`Number Occupants` == 50)
## [1] 5772
sacrash[5772,]$`Unit Type`
## [1] OMNIBUS
## 25 Levels: Animal - Domestic - Not Ridden ... Wheelchair / Elec. Wheelchair
which(sacrash$`Number Occupants` == 32)
## [1] 16323
sacrash[16323,]$`Unit Type`
## [1] OMNIBUS
## 25 Levels: Animal - Domestic - Not Ridden ... Wheelchair / Elec. Wheelchair
sacrash$`Number Occupants`[sacrash$`Number Occupants` == 999 ] <- 50
sacrash$`Number Occupants`[sacrash$`Number Occupants` == 99 ] <- 50
summary(sacrash$`Number Occupants`)
## 
##  435 values imputed to 1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   1.000   1.349   1.000  50.000
summary(sacrash$`Veh Year`)
## 
##  2025 values imputed to 2005
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1924    2001    2005    2005    2010    2016
z.scores1 <- sacrash$`Veh Year` %>% scores(type = "z")
length(which(abs(z.scores1)>3))
## [1] 93
sacrash$`Veh Year`[ which(abs(z.scores1)>3) ] <- mean(sacrash$`Veh Year`, na.rm = TRUE)
length(which(abs(z.scores1)>3))
## [1] 93
summary(sacrash$`Veh Year`)
## 
##  2025 values imputed to 2005
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1986    2001    2005    2005    2010    2016
sacrash <- sacrash %>% mutate(vehicle_age = Year - `Veh Year`)
summary(sacrash$vehicle_age)
## 
##  2025 values imputed to 11
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    6.00   11.00   11.05   15.00   30.00
z.scores2 <- sacrash$`Veh Year` %>% scores(type = "z")
length(which(abs(z.scores2)>3))
## [1] 28

Transform

For this step, we are attempting to transform variable Age to decrease the skewness and convert the distribution to a normal distribution. Several mathematical transformations and box-cox transformation were applied and in conclusion, log is the most effective according to histograms.

hist(sacrash$Age)

boxcox_age <- BoxCox(sacrash$Age, lambda = "auto")
hist(boxcox_age)

log_age <- log10(sacrash$Age)
hist(log_age)

ln_age <- log(sacrash$Age)
hist(ln_age)

sqrt_age <- sqrt(sacrash$Age)
hist(sqrt_age)

age_square <- sacrash$Age^2
hist(age_square)

age_reciprocal <- sacrash$Age^(-1)
hist(age_reciprocal)