Required packages
- dplyr package for
- select()
- inner_join()
- mutate()
- rename()
- lubridate for
- magrittr for
- stringr for
library(dplyr)
library(stringr)
library(knitr)
library(lubridate)
library(magrittr)
Data
Dataset Description
This assessment includes 2 datasets relate to the fatal crashes on Victorian roads during the latest five year reporting period. Road Safety data is provided by VicRoads for educational and research purposes with Creative Commons license 4.0. This data is in Web Mercator (Auxiliary Sphere) projection.(Vicroads,2021) The two datasets include:
- “Road_Crashes_for_five_Years_Victoria.csv” (https://vicroadsopendata-vicroadsmaps.opendata.arcgis.com/datasets/vicroadsmaps::road-crashes-for-five-years-victoria/about) (Full citation in the reference list)
This dataset is about ‘Fatal and injury crashes on Victorian roads during the latest five year reporting period.’ (Victoria Department of Transport Open Data Hub). The time period is between 25/05/2015 and 5/10/2021.
- "Fatal_Crashes_-_Lives_Lost%2C_Last_5_Years_to_Date.csv" (https://vicroadsopendata-vicroadsmaps.opendata.arcgis.com/datasets/vicroadsmaps::fatal-crashes-lives-lost-last-5-years-to-date-160421/about) (Full citation in the reference list)
This dataset is about the fatal crashes on Victorian roads during the latest five complete years and current year to the 5th of August 2021. (Vicroads,2021)
Variables Description
This descriptions just provided only the variables used in the data preprocessing process.
- ACCIDENT_NO is a charector field indicates the unique ID that identify particular accident that occured. Contains 12 characters, starting with T (for example, T20060123456) Where characters 2 to 5 are the year in which accident was registered; Where characters 6 to 12 are a numeric sequencing numbers
- DATE is a field indicates the date that the accident occurred.
- TIME is a field indicating the time that the accident occurred.
- DAY_OF_WEEK is is a charector field indicates the day of the week that the accident occurred.
- ACCIDENT_TYPE is a character field indicates the type of the accident. It is a basic description of what occurred, based on nine categories as following:
- Collision with vehicle
- Struck pedestrian
- Struck animal
- Collision with a fixed object
- Collision with some other object
- Vehicle overturned (no collision)
- Fall from or in moving vehicle
- No collision and no object struck
- Other accident
- NO_OF_VEHICLES is a numeric field indicates the number of vehicles involved in the crash.
- AGE is a field indicates how old the person was at the time of the accident. It is calculated by subtracting the person’s birth date from the accident date to give the person’s age in years.
- ROAD_USER is a character field indicates what the role of the person was at the time of the accident as following:
- Pedestrian
- Driver
- Passenger
- Motorcyclist
- Pillion Passenger
- Bicyclist (incl.passengers)
- Other driver
- Other passenger
- Not known
- SPEED_ZONE is a character field indicates the speed zone (km/hr) at the location of the accident. The speed zone is generally assigned to the main vehicle involved.
- LGA_NAME is a character field contains the local government area name that the accent occured.
- ALCOHOL_RELATED is a field of value “No” and “Yes” which “Yes” means the particular accident is alcohol related which Blood/Breath Alcohol Concentration>0.001 in the road user type: driver,rider,cyclist and pedestrian.
Preparation
step 1.1 Import the datasets by using the read.csv() to read the .csv files and assign “na.string” argument with “U”,“Unknown”, " “,”Not known" to tell R R to treat these values as NA(not available) value. * Assign the “Road_Crashes_for_five_Years_Victoria.csv” to “Crashes” dataframe and "Fatal_Crashes_-_Lives_Lost%2C_Last_5_Years_to_Date.csv" to “Fatal”dataframe.
step 1.2 Check the dataset for pre-processing data.
- Check the structure of both datasets using the str() function. The result shown that:
- They both are dataframe. The results also show the names and types of the variables together with the example of some values in the variable.
- For the “Crashes” dataset, there are 75,320 observations and 65 variables.
- For the “Fatal” dataset, there are 1,264 observations and 31 variables.
step 1.3 Select the following variables that I think they are useful and drop the rest using the select() function from dplyr:
- For “Crashes” dataframe select:
- ACCIDENT_NO
- ACCIDENT_TYPE
- DAY_OF_WEEK
- SPEED_ZONE
- LGA_NAME
- NO_OF_VEHICLES
- ALCOHOL_RELATED
- For “Fatal” dataframe select:
- ACCIDENT_N
- AGE
- RD_U_DESC
- ACCIDENT_D
- ACCIDENT_T
step 1.5 Rename some of variables to make them more sensible and also rename the “ACCIDENT_N” variable to “ACCIDENT_NO” to match with the variable “ACCIDENT_NO” in “Crashes” dataframe that I am going to use it to join the two dataframe in the next step. Use the rename() function from dplyr package to rename the “ACCIDENT_N” variale into “ACCIDENT_NO”, “RD_U_DESC” into “ROAD_USER”, “ACCIDENT_D” into “DATE” and “ACCIDENT_T” into “TIME”.
step 1.6 Merge 2 datasets by keeping only the observation in both dataframe using the inner_join() function from dplyr package. Joining both datasets together by variable “ACCIDENT_NO” using the argument “by” and assign the result into “Lives_lost” dataframe.
step 1.7 Rearrange the variables order by assign the positions of the variable in order that I want to arrange by using square brackets and the c() function. Leaving the rows selection part blank to subset every rows.
Crashes<- read.csv("Road_Crashes_for_five_Years_Victoria.csv", na.strings = c("U","Unknown", " ", "Not known"))
Fatal <- read.csv("Fatal_Crashes_-_Lives_Lost%2C_Last_5_Years_to_Date.csv",na.strings = c("U","Unknown", " ","Not known"))
str(Fatal)
str(Crashes)
Crashes %<>% select(ACCIDENT_NO, ACCIDENT_TYPE, DAY_OF_WEEK, SPEED_ZONE, LGA_NAME ,NO_OF_VEHICLES, ALCOHOL_RELATED)
Fatal %<>% select(ACCIDENT_N, ACCIDENT_D, ACC_TIME, AGE, RD_U_DESC)
Fatal %<>% rename(c("ACCIDENT_NO" = ACCIDENT_N, "ROAD_USER" = RD_U_DESC, "DATE" = ACCIDENT_D, "TIME" = ACC_TIME))
Lives_lost <- inner_join(Crashes, Fatal, by = "ACCIDENT_NO")
Lives_lost <- Lives_lost[,c(1, 8, 9, 3, 2, 6, 10, 11, 4, 5, 7 )]
head(Lives_lost)
Understand
step 2.1 Check the structure of the dataset by using the str() function and the attributes. The result shown:
- The dataset is a dataframe contains 1,264 Observations and 11 variables.
- The variable “ACCIDENT_NO”, “DAY_OF_WEEK”, “ACCIDENT_TYPE”, “ROAD_USER”, “SPEED_ZONE”, “LGA_NAME”, “ALCOHOL_RELATED”, “DATE” and “TIME” are a character type. I will leave the “DATE” and “TIME” variable as they are as I am going to combine them together and convert them into POSIXct class in Step 4.
- The variable “AGE” and “NO_OF_VEHICLES” are integer type. I will leave the “NO_OF_VEHICLES” as they are because integer is the right type for it.
step 2.2 Variables conversion
- Convert the variables “ACCIDENT_NO”, “ACCIDENT_TYPE” and “ROAD_USER” to be factor using the as.factor() function.
- Convert the variables “DAY_OF_WEEK”, “AGE” and “SPEED_ZONE” into an ordered factor varialbles by using the factor() function and ordering them using the “levels” argument with the c() function and assign argument “ordered” to be “TRUE” so the levels are regarded as ordered.
- Convert the variable “ALCOHOL_RELATED” into logical type by:
- Replace values “No” with “FALSE” and values “Yes” with “TRUE”.
- Use the as.logical() function to convert them into logical type.
- Check the result of the conversion by using the str() function. The result shown that:
- The “ACCIDENT_NO”, “ACCIDENT_TYPE” and “ROAD_USER” have converted into factor type.
- The “DAY_OF_WEEK”, “AGE” and “SPEED_ZONE” have converted into ordered factor type
- The “ALCOHOL_RELATED” have converted into logical type.
Lives_lost %>% str()
Lives_lost %>% attributes()
Lives_lost$ACCIDENT_NO %<>% as.factor()
Lives_lost$ACCIDENT_TYPE %<>% as.factor()
Lives_lost$ROAD_USER %<>% as.factor()
Lives_lost$DAY_OF_WEEK %<>% factor(levels=c("Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"), ordered=TRUE)
Lives_lost$AGE %<>% factor(levels = c(1:99), ordered = TRUE)
Lives_lost$SPEED_ZONE %<>% factor(levels = c("Camping grounds or off road", "40 km/hr","50 km/hr", "60 km/hr", "70 km/hr","75 km/hr", "80 km/hr", "90 km/hr", "100 km/hr", "110 km/hr", "Other speed limit"), ordered = TRUE)
Lives_lost$ALCOHOL_RELATED<-gsub("No", "FALSE", Lives_lost$ALCOHOL_RELATED)
Lives_lost$ALCOHOL_RELATED<-gsub("Yes", "TRUE", Lives_lost$ALCOHOL_RELATED)
Lives_lost$ALCOHOL_RELATED %<>% as.logical()
Lives_lost %>% str()
Tidy & Manipulate Data I
Step 3 Determine whether the dataset is tidy or untidy.
To determine if my dataset is tidy or not, I used the three interrelated rules:
- Each variable must have its own column.
- Each observation must have its own row.
- Each value must have its own cell. (Hadley Wickham and Garrett Grolemund 2016)
From my point of view my dataset “Live_lost” is already in a tidy format as each variable have got its own column, each Observation have got its own row and Each value have got its own cell.
Tidy & Manipulate Data II
Step 4.1 Shape the format of the values from “DATE” variable by keeping the value from the first position to 10th and drop the rest by using the substr() function. Assign the “start” argument with “1” and the “stop” argument with “10” to indicate the range of position that I want to keep.
Step 4.2 Shape “TIME” variable into the easy format in preparing to use the ymd_hms() function in step 4.4 by using the str_replace_all() function from stringr package to replace “.:” with “:” and replace “.00” with “0.00”. Use the head() function to review the result.
Step 4.3 Create the new variable “DATE” by combining variable “DATE” and “TIME” together.
- Using the mutate() function from the dplyr package to create a new variable called “DATE”.
- Using the paste() function to:
- Put the value from the old variable “DATE” and follow with the values from the variable “TIME”.
- Assign the argument “sep” with " "(blank) to indicate the separator between the values.
Step 4.4 Convert the variable “DATE” into POSIXct class using ymd_hms() function from “lubridate” package.
Step 4.5 Drop variables “TIME” by using the select() function with the c() function and the “-” operator to assign the column name “TIME” that I want to drop. Use the str() function to check the result. The result shown that:
- The “DATE” variable have convert into class POSIXct which representing calendar dates and times.
- The “DATE” variable format change into “yyyy-mm-dd hh:mm:ss”.
- The “TIME” variable has dropped.
Lives_lost$DATE %<>% substr(start = 1, stop = 10)
Lives_lost$TIME %<>% str_replace_all(".:", ":")
Lives_lost$TIME %<>% str_replace_all(".00", "0:00")
head(Lives_lost)
Lives_lost %<>% mutate (DATE = paste(DATE, TIME, sep = " "))
Lives_lost$DATE %<>% ymd_hms()
Lives_lost %<>% select(-TIME)
str(Lives_lost)
Scan I
Step 5.1 Check the amount of NA in each column by using the is.na() function to check for the NA value and the colSum() function to sum up the total of NA value in each column. The result found:
- 2 of the values in “DAY_OF_WEEK” variable are NA.
- 5 of the values in “AGE” variable are NA.
- 16 of the values in “SPEED_ZONE” are NA.
Step 5.2 Replace the missing values in “DAY_OF_WEEK” variable :
- The 2 of the value in “DAY_OF_WEEK” variable are NA but the “DATE” variable have no NA value means I can replace the missing value by using the weekdays() function to find the day of the week and assign the value back to column “DAY_OF_WEEK”.
- By doing this replacement causing the “DAY_OF_WEEK” variable to turn into a character type so I convert it back in to an ordered factor by using the same method as the step 2.2.
- Check the NA values in the column “DAY_OF_WEEK” to make sure there is none NA value left in “DAY_OF_WEEK” column by using the is.na() function together with the which() function and the length() function. The result shown that there is no NA value in “DAY_OF_WEEK” variable.
Step 5.2 Check the amount of NA in the whole dataset by:
- Use the is.na() function to check for the NA value.
- Use the which() function to identify the position of NA value.
- Use length() function to get the number of the NA value.
- Using pipes operator between the function to pass the result of one function to the other one in sequence.
Step 5.3 Subsetting the observations with NA values into the dataframe called “NA_relation_check” to be able to have a better look for the relation of the missing value with other variable.
- Use the is.na() function to find the NA value in the dataset.
- Use the rowSums() function to sum up the missing value in each observation.
- Subset just the observation with missing values (>0) and put in the dataframe “NA_relation_check”.
Step 5.4 After checking, I could not find any relation to the other variable or between the observation. The missing value look like Missing completely at random (MCAR) for me. “A rule of thumb says that when the data include less than 5% random missingness which does not depend on observed or unobserved values, complete case analysis may be an acceptable approach, too”(Hannah Roos, 2021).
- Calculate the percentage of the observation with missing value by deviding the number of the observations with the missing values by the number of all the observations in the dataset and then multiply by 100. The result shown that the observations with the missing values is just 1.66% of all observations. Because of the information above, I choose to exclude all the observation with the missing value.
Step 5.5 Exclude the observations with missing values:
- Use the complete.cases() function to identify complete row of the dataframe and the square brackets to subset those row with all column. Assign the result to dataframe called “Lives_lost2”.
- Check if there is any NA value left in the dataframe with the same function as step 5.2.
- The result shown that there is no NA value left in the dataframe “Lives_lost2”.
is.na(Lives_lost) %>% colSums()
Lives_lost$DAY_OF_WEEK <- weekdays(Lives_lost$DATE)
Lives_lost$DAY_OF_WEEK %<>% factor(levels=c("Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"), ordered=TRUE)
is.na(Lives_lost$DAY_OF_WEEK) %>% which() %>% length()
is.na(Lives_lost) %>% which() %>% length()
NA_relation_check <- Lives_lost[rowSums(is.na(Lives_lost)) > 0, ]
NA_relation_check %>% head
((NA_relation_check %>% nrow)/(Lives_lost%>% nrow())) * 100
Lives_lost2<-Lives_lost[complete.cases(Lives_lost), ]
is.na(Lives_lost2)%>% which() %>% length()
Scan II
I am using the univariate outlier detection method with Non-parametric method (Box plot), to approach the “NO_OF_VEHICLES” variable because I am just focusing on one particular variable and I assume that the data is not normally distribute. For the handling process, I choose to delete all the outliers because imputing with mean does not makesense as the value indicate number of vehicles which should be integer (mean is 1.651 which is not integer) and imputing with median is not going to suit with the other variables in the observation. The processes is as following:
Step 6.1 Create Boxplot of the variable “NO_OF_VEHICLES” to see the big picture of the data and the outliers by using the boxplot() function.
- Assign the argument “main” with the title of the boxplot, in this case “Box Plot of Number of vehicles”.
- Assign the argument “ylab” with the y-axis annotation, in this case “Number of vehicles”.
- Assign the argument “col” with the color preference for the bodies of the box plot, in this case I choose “orange”.
Step 6.2 Apply the univariate outlier detection by:
- Getting the Q1 and Q3 values by using the summary() function. The result shown:
- The first position: minimum value = 1.00.
- The second position: Q1 = 1.00.
- The third position: median = 1.00
- The 4th position: mean = 1.65
- The 5th position: Q3 = 2.00
- The 6th position: maximum value = 11.00
- Extract the Q1 value by using the double square brackets to subset the value from the 2nd position of the summary() function’s result and assign to “q1”.
- Extract the Q3 value by using the double square brackets to subset the value from the 5th position of the summary() function’s result and assign to “q3”.
- Calculate the interquartile range(IQR) by subtract q1 value from q3 value and assign to “iqr”.
- Calculate the upper fence by add the 1.5 times of iqr to q3 and assign the value to “upper_fence”.
- Calculate the lower fence by subtract the 1.5 times of iqr from q1 and assign the value to “lower_fence”.
- Extract the position of the outliers by using the which() function, picking just the values that is more than the “upper_fence” value and less than the “lower_fence” value. Assign the outliers over the “upper_fence” value into “up_outliers” and assign the outliers below the “lower_fence” value into “low_outliers”.
- Sum up the amount of outliers in each of the object (the “up_outliers” and the “low_outliers”) by using the “length” function. the result shown that the amount of the outlier above the “upper_fence” value is 37 outliers but there is none outlier below the “lower_fence”
Step 6.3 Handle the outliers by exclude them using the squre bracket and “-” operator to drop the “up_outlier” positions from the “Lives_lost2$NO_OF_VEHICLES” and assign the result into the “Vehicle_clean” object.
- Check the result by using the boxplot() function and assign the list into “bp” and then use the length() function to sum up the amount of the outlier in the “out” element in the “bp” list. The result showm that there is non outlier left in the object “Vehical_clean”.
Lives_lost2$NO_OF_VEHICLES %>%
boxplot(main = "Box Plot of Number of vehicle", ylab = "Number of vehicle", col = "orange")
summary(Lives_lost2$NO_OF_VEHICLES)
q1 <- summary(Lives_lost2$NO_OF_VEHICLES)[[2]]
q3 <- summary(Lives_lost2$NO_OF_VEHICLES)[[5]]
iqr <- q3 - q1
lower_fence <- q1 - (1.5 * iqr)
upper_fence <- q3 + (1.5 * iqr)
up_outliers <- which(Lives_lost2$NO_OF_VEHICLES > upper_fence)
low_outliers <- which(Lives_lost2$NO_OF_VEHICLES < lower_fence)
low_outliers %>% length()
up_outliers %>% length()
Vehicle_clean<- Lives_lost2$NO_OF_VEHICLES[-up_outliers]
bp<- boxplot(Vehicle_clean)
bp$out %>% length()
References:
Calculate Skewness in R n.d., ProgrammingR, viewed 14 October 2021, https://www.programmingr.com/statistics/skewness/
Hadley Wickham and Garrett Grolemund, R for Data Science, R4DS, viewed 18 September 2021, https://r4ds.had.co.nz/tidy-data.html
Mangiafico, S.S. 2016, Summary and Analysis of Extension Program Evaluation in R, viewed 14 October 2021, https://rcompanion.org/handbook/I_12.html
Victoria Department of Transport 2021, Crashes_Last_Five_Years - Open Data, data file, Victoria Department of Transport Open Data Hub, viewed 10 October 2021, https://vicroadsopendata-vicroadsmaps.opendata.arcgis.com/datasets/vicroadsmaps::road-crashes-for-five-years-victoria/about.
Victoria Department of Transport, Fatal Crashes - Lives Lost, Last 5 Years to Date, the Fatal Crashes - Lives Lost Year to Date dataset, data file, Department of Transport Open Data Hub, viewed 10 October 2021, https://vicroadsopendata-vicroadsmaps.opendata.arcgis.com/datasets/vicroadsmaps::road-crashes-for-five-years-victoria/about.
Hannah Roos 2021, Smart handling of missing data in R, Towards Data Science, viewed 10 October 2021, https://towardsdatascience.com/smart-handling-of-missing-data-in-r-6425f8a559f2
---
title: "Data Wrangling"
author: Naam Marshall
subtitle: "Victoria Fatal Crashes in the last 5 years"
output:
  html_notebook: default
---



## Required packages 
  
* dplyr package for  
  + select()
  + inner_join()
  + mutate()
  + rename()
  
* lubridate for 
  + ymd_hms()
  
* magrittr for
  + %>% 
  + %<>% 
  
* stringr for
  + str_replace_all()

```{r echo=TRUE, warning=FALSE}
library(dplyr)
library(stringr)
library(knitr)
library(lubridate)
library(magrittr)

```


## Data 

**Dataset Description**

This assessment includes 2 datasets relate to the fatal crashes on Victorian roads during the latest five year reporting period. Road Safety data is provided by VicRoads for educational and research purposes with Creative Commons license 4.0. This data is in Web Mercator (Auxiliary Sphere) projection.(Vicroads,2021) The two datasets include:  

1. "Road_Crashes_for_five_Years_Victoria.csv" (https://vicroadsopendata-vicroadsmaps.opendata.arcgis.com/datasets/vicroadsmaps::road-crashes-for-five-years-victoria/about) (Full citation in the reference list)

This dataset is about 'Fatal and injury crashes on Victorian roads during the latest five year reporting period.' (Victoria Department of Transport Open Data Hub). The time period is between 25/05/2015 and 5/10/2021.

2. "Fatal_Crashes_-_Lives_Lost%2C_Last_5_Years_to_Date.csv"
(https://vicroadsopendata-vicroadsmaps.opendata.arcgis.com/datasets/vicroadsmaps::fatal-crashes-lives-lost-last-5-years-to-date-160421/about) (Full citation in the reference list)

This dataset is about the fatal crashes on Victorian roads during the latest five complete years and current year to the 5th of August 2021. (Vicroads,2021)

**Variables Description**

This descriptions just provided only the variables used in the data preprocessing process.

* **ACCIDENT_NO** is a charector field indicates the unique ID that identify particular accident that occured. Contains 12 characters, starting with T (for example, T20060123456) Where characters 2 to 5 are the year in which accident was registered; Where characters 6 to 12 are a numeric sequencing numbers
* **DATE** is a field indicates the date that the accident occurred. 
* **TIME** is a field indicating the time that the accident occurred.
* **DAY_OF_WEEK** is is a charector field indicates the day of the week that the accident occurred.
* **ACCIDENT_TYPE** is a character field indicates the type of the accident. It is a basic description of what occurred, based on nine categories as following:  
  + Collision with vehicle
  + Struck pedestrian
  + Struck animal
  + Collision with a fixed object
  + Collision with some other object
  + Vehicle overturned (no collision)
  + Fall from or in moving vehicle
  + No collision and no object struck
  + Other accident
* **NO_OF_VEHICLES** is a numeric field indicates the number of vehicles involved in the crash.
* **AGE** is a field indicates how old the person was at the time of the accident. It is calculated by subtracting the person’s birth date from the accident date to give the person’s age in years.
* **ROAD_USER** is a character field indicates what the role of the person was at the time of the accident as following:  
  + Pedestrian
  + Driver
  + Passenger
  + Motorcyclist
  + Pillion Passenger
  + Bicyclist (incl.passengers)
  + Other driver
  + Other passenger
  + Not known
* **SPEED_ZONE** is a character field indicates the speed zone (km/hr) at the location of the accident. The speed zone is generally assigned to the main vehicle involved.
* **LGA_NAME** is a character field contains the local government area name that the accent occured. 
* **ALCOHOL_RELATED** is a field of value "No" and "Yes" which "Yes" means the particular accident is alcohol related which Blood/Breath Alcohol Concentration>0.001 in the road user type: driver,rider,cyclist and pedestrian.

## Preparation
 
 **step 1.1** Import the datasets by using the read.csv() to read the .csv files and assign "na.string" argument with "U","Unknown", " ", "Not known" to tell R R to treat these values as NA(not available) value. 
* Assign the "Road_Crashes_for_five_Years_Victoria.csv" to "Crashes" dataframe and "Fatal_Crashes_-_Lives_Lost%2C_Last_5_Years_to_Date.csv" to "Fatal"dataframe.

**step 1.2** Check the dataset for pre-processing data.  

* Check the structure of both datasets using the str() function. The result shown that:
  + They both are dataframe. The results also show the names and types of the variables together with the example of some values in the variable. 
  + For the "Crashes" dataset, there are 75,320 observations and 65 variables.
  + For the "Fatal" dataset, there are 1,264 observations and 31 variables.

**step 1.3** Select the following variables that I think they are useful and drop the rest using the select() function from dplyr: 

* For "Crashes" dataframe select:
  + ACCIDENT_NO
  + ACCIDENT_TYPE
  + DAY_OF_WEEK
  + SPEED_ZONE
  + LGA_NAME
  + NO_OF_VEHICLES
  + ALCOHOL_RELATED
  
* For "Fatal" dataframe select: 
  + ACCIDENT_N
  + AGE
  + RD_U_DESC
  + ACCIDENT_D
  + ACCIDENT_T

**step 1.5** Rename some of variables to make them more sensible and also rename the "ACCIDENT_N" variable to "ACCIDENT_NO" to match with the variable "ACCIDENT_NO" in "Crashes" dataframe that I am going to use it to join the two dataframe in the next step. Use the rename() function from dplyr package to rename the "ACCIDENT_N" variale into "ACCIDENT_NO", "RD_U_DESC" into "ROAD_USER", "ACCIDENT_D" into "DATE" and "ACCIDENT_T" into "TIME".

**step 1.6** Merge 2 datasets by keeping only the observation in both dataframe using the inner_join() function from dplyr package. Joining both datasets together by variable "ACCIDENT_NO" using the argument "by" and assign the result into "Lives_lost" dataframe. 

**step 1.7** Rearrange the variables order by assign the positions of the variable in order that I want to arrange by using square brackets and the c() function. Leaving the rows selection part blank to subset every rows.

```{r}

Crashes<- read.csv("Road_Crashes_for_five_Years_Victoria.csv", na.strings = c("U","Unknown", " ", "Not known"))
Fatal <- read.csv("Fatal_Crashes_-_Lives_Lost%2C_Last_5_Years_to_Date.csv",na.strings = c("U","Unknown", " ","Not known"))

str(Fatal)
str(Crashes)


Crashes %<>% select(ACCIDENT_NO, ACCIDENT_TYPE, DAY_OF_WEEK, SPEED_ZONE, LGA_NAME ,NO_OF_VEHICLES, ALCOHOL_RELATED)
Fatal %<>% select(ACCIDENT_N, ACCIDENT_D, ACC_TIME, AGE, RD_U_DESC)
Fatal %<>% rename(c("ACCIDENT_NO" = ACCIDENT_N, "ROAD_USER" = RD_U_DESC, "DATE" = ACCIDENT_D, "TIME" = ACC_TIME)) 

Lives_lost <- inner_join(Crashes, Fatal, by = "ACCIDENT_NO")
Lives_lost <- Lives_lost[,c(1, 8, 9, 3, 2, 6, 10, 11, 4, 5, 7 )] 
head(Lives_lost)
```
## Understand 

**step 2.1** Check the structure of the dataset by using the str() function and the attributes. The result shown:  

* The dataset is a dataframe contains 1,264 Observations and 11 variables.
* The variable "ACCIDENT_NO", "DAY_OF_WEEK", "ACCIDENT_TYPE", "ROAD_USER", "SPEED_ZONE", "LGA_NAME", "ALCOHOL_RELATED", "DATE" and "TIME" are a character type. I will leave the "DATE" and "TIME" variable as they are as I am going to combine them together and convert them into POSIXct class in Step 4.
* The variable "AGE" and "NO_OF_VEHICLES" are integer type. I will leave the "NO_OF_VEHICLES" as they are because integer is the right type for it.

**step 2.2** Variables conversion  

* Convert the variables "ACCIDENT_NO", "ACCIDENT_TYPE" and "ROAD_USER" to be factor using the as.factor() function.
* Convert the variables "DAY_OF_WEEK", "AGE" and "SPEED_ZONE" into an ordered factor varialbles by using the factor() function and ordering them using the "levels" argument with the c() function and assign argument "ordered" to be "TRUE" so the levels are regarded as ordered.
* Convert the variable "ALCOHOL_RELATED" into logical type by:
  + Replace values "No" with "FALSE" and values "Yes" with "TRUE".
  + Use the as.logical() function to convert them into logical type.
* Check the result of the conversion by using the str() function. The result shown that:  
  + The "ACCIDENT_NO", "ACCIDENT_TYPE" and "ROAD_USER" have converted into factor type.
  + The "DAY_OF_WEEK", "AGE" and "SPEED_ZONE" have converted into ordered factor type
  + The "ALCOHOL_RELATED" have converted into logical type.


```{r}
Lives_lost %>% str()
Lives_lost %>% attributes()

Lives_lost$ACCIDENT_NO %<>% as.factor()
Lives_lost$ACCIDENT_TYPE %<>% as.factor()
Lives_lost$ROAD_USER %<>% as.factor()

Lives_lost$DAY_OF_WEEK %<>% factor(levels=c("Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"), ordered=TRUE)
Lives_lost$AGE %<>% factor(levels = c(1:99), ordered = TRUE)
Lives_lost$SPEED_ZONE %<>% factor(levels = c("Camping grounds or off road", "40 km/hr","50 km/hr", "60 km/hr", "70 km/hr","75 km/hr", "80 km/hr", "90 km/hr", "100 km/hr", "110 km/hr", "Other speed limit"), ordered = TRUE)

Lives_lost$ALCOHOL_RELATED<-gsub("No", "FALSE", Lives_lost$ALCOHOL_RELATED)
Lives_lost$ALCOHOL_RELATED<-gsub("Yes", "TRUE", Lives_lost$ALCOHOL_RELATED)
Lives_lost$ALCOHOL_RELATED %<>% as.logical() 

Lives_lost %>% str()

```


##	Tidy & Manipulate Data I 

**Step 3** Determine whether the dataset is tidy or untidy. 

To determine if my dataset is tidy or not, I used the three interrelated rules:

1. Each variable must have its own column.
2. Each observation must have its own row.
3. Each value must have its own cell.
(Hadley Wickham and Garrett Grolemund 2016)

From my point of view my dataset "Live_lost" is already in a tidy format as each variable have got its own column, each Observation have got its own row and Each value have got its own cell.

```{r}

```


##	Tidy & Manipulate Data II 

**Step 4.1** Shape the format of the values from "DATE" variable by keeping the value from the first position to 10th and drop the rest by using the substr() function. Assign the "start" argument with "1" and the "stop" argument with "10" to indicate the range of position that I want to keep.  

**Step 4.2** Shape "TIME" variable into the easy format in preparing to use the ymd_hms() function in step 4.4 by using the str_replace_all() function from stringr package to replace ".:" with ":" and replace ".00" with "0.00". Use the head() function to review the result. 

**Step 4.3** Create the new variable "DATE" by combining variable "DATE" and "TIME" together.  

* Using the mutate() function from the dplyr package to create a new variable called "DATE".
* Using the paste() function to:  
  + Put the value from the old variable "DATE" and follow with the values from the variable "TIME".
  + Assign the argument "sep" with " "(blank) to indicate the separator between the values.  
  
**Step 4.4** Convert the variable "DATE" into POSIXct class using ymd_hms() function from "lubridate" package.

**Step 4.5** Drop variables "TIME" by using the select() function with the c() function and the "-" operator to assign the column name "TIME" that I want to drop. Use the str() function to check the result. The result shown that:  


* The "DATE" variable have convert into class POSIXct which representing calendar dates and times.
* The "DATE" variable format change into "yyyy-mm-dd hh:mm:ss".
* The "TIME" variable has dropped.


```{r}

Lives_lost$DATE %<>% substr(start = 1, stop = 10) 

Lives_lost$TIME %<>% str_replace_all(".:", ":")
Lives_lost$TIME %<>% str_replace_all(".00", "0:00")
head(Lives_lost)

Lives_lost %<>% mutate (DATE = paste(DATE, TIME, sep = " "))
Lives_lost$DATE %<>% ymd_hms()
Lives_lost %<>% select(-TIME) 
str(Lives_lost)

```

##	Scan I 

**Step 5.1** Check the amount of NA in each column by using the is.na() function to check for the NA value and the colSum() function to sum up the total of NA value in each column. The result found:  

  * 2 of the values in "DAY_OF_WEEK" variable are NA.  
  * 5 of the values in "AGE" variable are NA.  
  * 16 of the values in "SPEED_ZONE" are NA.

**Step 5.2** Replace the missing values in "DAY_OF_WEEK" variable :  

* The 2 of the value in "DAY_OF_WEEK" variable are NA but the "DATE" variable have no NA value means I can replace the missing value by using the weekdays() function to find the day of the week and assign the value back to column "DAY_OF_WEEK". 
* By doing this replacement causing the "DAY_OF_WEEK" variable to turn into a character type so I convert it back in to an ordered factor by using the same method as the step 2.2.
* Check the NA values in the column "DAY_OF_WEEK" to make sure there is none NA value left in "DAY_OF_WEEK" column by using the is.na() function together with the which() function and the length() function. The result shown that there is no NA value in "DAY_OF_WEEK" variable.

**Step 5.2** Check the amount of NA in the whole dataset by:  

* Use the is.na() function to check for the NA value.
* Use the which() function to identify the position of NA value.
* Use length() function to get the number of the NA value.
* Using pipes operator between the function to pass the result of one function to the other one in sequence.

**Step 5.3** Subsetting the observations with NA values into the dataframe called "NA_relation_check" to be able to have a better look for the relation of the missing value with other variable.  

* Use the is.na() function to find the NA value in the dataset.
* Use the rowSums() function to sum up the missing value in each observation.
* Subset just the observation with missing values (>0) and put in the dataframe "NA_relation_check".

**Step 5.4**
After checking, I could not find any relation to the other variable or between the observation. The missing value look like Missing completely at random (MCAR) for me. "A rule of thumb says that when the data include less than 5% random missingness which does not depend on observed or unobserved values, complete case analysis may be an acceptable approach, too"(Hannah Roos, 2021). 

* Calculate the percentage of the observation with missing value by deviding the number of the observations with the missing values by the number of all the observations in the dataset and then multiply by 100. The result shown that the observations with the missing values is just 1.66% of all observations. Because of the information above, I choose to exclude all the observation with the missing value.

**Step 5.5** Exclude the observations with missing values:  

* Use the complete.cases() function to identify complete row of the dataframe and the square brackets to subset those row with all column. Assign the result to dataframe called "Lives_lost2".
* Check if there is any NA value left in the dataframe with the same function as step 5.2.
* The result shown that there is no NA value left in the dataframe "Lives_lost2".

```{r}

is.na(Lives_lost) %>% colSums()

Lives_lost$DAY_OF_WEEK <- weekdays(Lives_lost$DATE)
Lives_lost$DAY_OF_WEEK %<>% factor(levels=c("Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"), ordered=TRUE)
is.na(Lives_lost$DAY_OF_WEEK) %>% which() %>% length()

is.na(Lives_lost) %>% which() %>% length()
NA_relation_check <- Lives_lost[rowSums(is.na(Lives_lost)) > 0, ]
NA_relation_check %>% head

((NA_relation_check %>% nrow)/(Lives_lost%>% nrow())) * 100
Lives_lost2<-Lives_lost[complete.cases(Lives_lost), ]
is.na(Lives_lost2)%>% which() %>% length()
```

##	Scan II

I am using the univariate outlier detection method with Non-parametric method (Box plot), to approach the "NO_OF_VEHICLES" variable because I am just focusing on one particular variable and I assume that the data is not normally distribute. For the handling process, I choose to delete all the outliers because imputing with mean does not makesense as the value indicate number of vehicles which should be integer (mean is 1.651 which is not integer) and imputing with median is not going to suit with the other variables in the observation. The processes is as following:  

**Step 6.1** Create Boxplot of the variable "NO_OF_VEHICLES" to see the big picture of the data and the outliers by using the boxplot() function.  

* Assign the argument "main" with the title of the boxplot, in this case "Box Plot of Number of vehicles".  
* Assign the argument "ylab" with the y-axis annotation, in this case "Number of vehicles".
* Assign the argument "col" with the color preference for the bodies of the box plot, in this case I choose "orange".

**Step 6.2** Apply the univariate outlier detection by:  

* Getting the Q1 and Q3 values by using the summary() function. The result shown:  
  + The first position: minimum value = 1.00.
  + The second position: Q1 = 1.00.
  + The third position: median = 1.00 
  + The 4th position: mean = 1.65
  + The 5th position: Q3 = 2.00
  + The 6th position: maximum value = 11.00
* Extract the Q1 value by using the double square brackets to subset the value from the 2nd position of the summary() function's result and assign to "q1".
* Extract the Q3 value by using the double square brackets to subset the value from the 5th position of the summary() function's result and assign to "q3".
* Calculate the interquartile range(IQR) by subtract q1 value from q3 value and assign to "iqr".
* Calculate the upper fence by add the 1.5 times of iqr to q3 and assign the value to "upper_fence".
* Calculate the lower fence by subtract the 1.5 times of iqr from q1 and assign the value to "lower_fence".
* Extract the position of the outliers by using the which() function, picking just the values that is more than the "upper_fence" value and less than the "lower_fence" value. Assign the outliers over the "upper_fence" value into "up_outliers" and assign the outliers below the "lower_fence" value into "low_outliers".
* Sum up the amount of outliers in each of the object (the "up_outliers" and the "low_outliers") by using the "length" function. the result shown that the amount of the outlier above the "upper_fence" value is 37 outliers but there is none outlier below the "lower_fence"

**Step 6.3** Handle the outliers by exclude them using the squre bracket and "-" operator to drop the "up_outlier" positions from the "Lives_lost2$NO_OF_VEHICLES" and assign the result into the "Vehicle_clean" object.

* Check the result by using the boxplot() function and assign the list into "bp" and then use the length() function to sum up the amount of the outlier in the "out" element in the "bp" list. The result showm that there is non outlier left in the object "Vehical_clean".

```{r}

Lives_lost2$NO_OF_VEHICLES %>%  
  boxplot(main = "Box Plot of Number of vehicle", ylab = "Number of vehicle", col = "orange") 

summary(Lives_lost2$NO_OF_VEHICLES)

q1 <- summary(Lives_lost2$NO_OF_VEHICLES)[[2]]
q3 <- summary(Lives_lost2$NO_OF_VEHICLES)[[5]]
iqr <- q3 - q1

lower_fence <- q1 - (1.5 * iqr)
upper_fence <- q3 + (1.5 * iqr)

up_outliers <- which(Lives_lost2$NO_OF_VEHICLES > upper_fence)
low_outliers <- which(Lives_lost2$NO_OF_VEHICLES < lower_fence)

low_outliers %>% length()
up_outliers %>% length()

Vehicle_clean<- Lives_lost2$NO_OF_VEHICLES[-up_outliers]

bp<- boxplot(Vehicle_clean)
bp$out %>% length()

```

##	Transform 

**Step 7** Transformation : I choose to transform the "AGE" variable. I convert the variable from ordered factor into numeric by using the as.numeric() function to be able to plot the graph and transform. Then, extract just the "AGE" variable into "Age" object so this process does not change the type of the variable in the big dataset.

**Step 7.1** Create a histogram of the "Age" by using the hist() function.  

* Assign the argument "main" with the title of the histogram "Histogram of Age".
* Assign the argument "xlab" with the x-axis annotation "Age(years old)".
* Assign the argument "col" with the color preference for the bar. 

The result shown that the data appear to be right-skewed. 

**Step 7.2** Transform the Age data to decrease the skewness by using square root transformation as the square root transformation is used for reducing right-skewness.  

* Process the square root transformation by taking the "Age" variable values to the power of 1/2.
* Review the results of the data before and after the transformation side by side by using the par() function to set the graphical parameter of the graph. Use the "mfrow" argument to plot the graph to 2 colunms in 1 row. 
* Using the hist() function to plot both graph.
* The result shown that the square root transformation reduced the right-skewness of the "Age" data. 
* **The moments package is not in the assessment requirment but I just want to explain how I checked the skewness of the graph. I use the skewness() function from the moments package to make sure the data is less skewed. The result shown that the Age skewness was 0.34 to the right but after the transformation the skewness was 0.12 to the left which is less skewed.**
 

```{r}

Age<-as.numeric(Lives_lost2$AGE)

hist(Age,
main="Histogram of Age",
xlab="Age(years old)",
col= "darkslategrey")

par(mfrow=c(1,2))
hist(Age, col = "darkslategrey")
sqrt <- Age^(1/2)
hist(sqrt, col = "darkslategray3")

library(moments)
skewness(Age)
skewness(sqrt)

```


## References:

_Calculate Skewness in R_ n.d., ProgrammingR, viewed 14 October 2021, <https://www.programmingr.com/statistics/skewness/>

Hadley Wickham and Garrett Grolemund, _R for Data Science_, R4DS, viewed 18 September 2021, <https://r4ds.had.co.nz/tidy-data.html>

Mangiafico, S.S. 2016, Summary and Analysis of Extension Program Evaluation in R, viewed 14 October 2021, <https://rcompanion.org/handbook/I_12.html>

Victoria Department of Transport 2021, _Crashes_Last_Five_Years - Open Data_, data file, Victoria Department of Transport Open Data Hub, viewed 10 October 2021, <https://vicroadsopendata-vicroadsmaps.opendata.arcgis.com/datasets/vicroadsmaps::road-crashes-for-five-years-victoria/about>.

Victoria Department of Transport, _Fatal Crashes - Lives Lost, Last 5 Years to Date_, the Fatal Crashes - Lives Lost Year to Date dataset, data file, Department of Transport Open Data Hub, viewed 10 October 2021, <https://vicroadsopendata-vicroadsmaps.opendata.arcgis.com/datasets/vicroadsmaps::road-crashes-for-five-years-victoria/about>.

Hannah Roos 2021, _Smart handling of missing data in R_, Towards Data Science, viewed 10 October 2021,
<https://towardsdatascience.com/smart-handling-of-missing-data-in-r-6425f8a559f2> 

<br>
<br>
<br>
<br>
