Loading all the required packages for this analysis.
library(tidyr)
library(lubridate)
library(ggplot2)
library(readr)
library(forecast)
library(dplyr)
library(Hmisc)
library(editrules)
library(knitr)
library(outliers)
Data pre processing is the most important step in every analysis. If the data is properly cleaned and all the necessary pre processing steps have been applied correctly, then only the further analysis will make sense. For this analysis, open source data provided by the Victorian government about road accidents was used. The focus was looking at the accident location, demographics of persons involved in the crash and making inferences based on that.
Various data pre processing steps were applied to the data. Alongwith this data a supplementary data was used as well which maps postcode to its corresponding latitudes and longitudes. Data type conversions were implemented for the required columns. Various checks were applied to test whether the data was in tidy formmat or not. Then, missing value treatment was applied wherever necessary. Numeric data was treated for outliers. A new column was created which calculates the distance between accident location and residence of the driver. The latitudes and longitudes were used to calculate the radial distance between these locations. It was really interesting to get an idea about how far the accident took place from the driver’s home. Finally, data transformation was applied onto the age column since its distribution was positively skewed.
The data used is crash statistics data provided by VicRoads.It is an open source data available on VicRoads website: https://discover.data.vic.gov.au/dataset/crash-stats-data-extract
This data provides information about road crashes in Victoria. There are different datasets which give information specific to accidents, persons invovled in the accident, vehicles involved in the accident, atmospheric conditions at the time of accident, accident node location specifications and so on. The datasets for accidents, vehicle, person and node were combined together. The accident data has basic details of the accident, time of accident etc. Person data has information like age, gender of people invovled in the crash. Vehicle data has veicle information, type, make etc. Node has exact co-ordinates of the accident location. ‘Accident No’ is the primary key in the accidents data. It will be used to merge all the other datasets to the accidents data.
Alongwith this, postcode mapping data was used which gives latitude and longitudes for postcodes in Australia. This data too is open source available at: https://www.matthewproctor.com/australian_postcodes This is a postcode mapping data which maps latitude and longitude to postcodes in Australia.
The crash stats data has multiple csv files corresponding to different aspects of the accident. For this analysis, the data related to accidents, accident node, vehicles involved in the crash and persons invovled in the accident was used. Also, the mapping file for postcodes was used. All these 5 datasets were imported.
accident <- read.csv("ACCIDENT.csv",stringsAsFactors = FALSE)
node <- read.csv("NODE.csv",stringsAsFactors = FALSE)
vehicle <- read.csv("VEHICLE.csv",stringsAsFactors = FALSE)
person <- read.csv("PERSON.csv",stringsAsFactors = FALSE)
postcode <- read.csv("postcode.csv",stringsAsFactors = FALSE)
The data has vast amounts of data. To make interpretation simpler this analysis was restricted to accidents involving only car drivers. The person and was filtered to just include drivers. While vehicle data was filtered out just for cars.
person1 <- filter(person ,Road.User.Type.Desc == 'Drivers' )
vehicle1 <- filter(vehicle , Vehicle.Type.Desc=='Car' )
Each dataset has around 20 columns. After merging the data the number of colum ns would be extremely high. Therefore, only those columns which were relevant to this analysis were retained. The ’select’function in dplyr allows us to select the required columns by means of their names.
Accident1 data has following variables: 1) ACCIDENT_NO - Identifier for each accident 2) ACCIDENTDATE - Date of accident 3) NODE_ID - ID for node of accident 4) SPEED_ZONE - Spped zone at the lication of crash
accident1 <- select(accident ,ACCIDENT_NO,ACCIDENTDATE,NODE_ID,SPEED_ZONE)
kable(head(accident1,5))
| ACCIDENT_NO | ACCIDENTDATE | NODE_ID | SPEED_ZONE |
|---|---|---|---|
| T20060000010 | 13/01/2006 | 43078 | 60 |
| T20060000018 | 13/01/2006 | 29720 | 70 |
| T20060000022 | 14/01/2006 | 203074 | 100 |
| T20060000023 | 14/01/2006 | 55462 | 80 |
| T20060000026 | 14/01/2006 | 202988 | 50 |
dim(accident1)
## [1] 176937 4
Vehicle2 data has following variables: 1) ACCIDENT_NO - Identifier for each accident 2) VEHICLE_ID - Identifier for vehicles involved in an accident 3) Vehicle.Type.Desc - Type of vehile like car, motorbike. In this case only Cars are being filtered 4) CARRY_CAPACITY - Weight carrying cspacity of vehicle 5) OWNER_POSTCODE - Postcode of vehicle owner’s residence
vehicle2 <- select(vehicle1 ,
ACCIDENT_NO,VEHICLE_ID,Vehicle.Type.Desc,CARRY_CAPACITY,OWNER_POSTCODE)
kable(head(vehicle2,5))
| ACCIDENT_NO | VEHICLE_ID | Vehicle.Type.Desc | CARRY_CAPACITY | OWNER_POSTCODE |
|---|---|---|---|---|
| T20060000010 | A | Car | 1100 | 3130 |
| T20060000010 | B | Car | 974 | 3977 |
| T20060000018 | B | Car | 1200 | 3805 |
| T20060000023 | A | Car | 800 | 3016 |
| T20060000028 | A | Car | 950 | 3788 |
dim(vehicle2)
## [1] 167346 5
Person2 data has following variables: 1) ACCIDENT_NO - Identifier for each accident 2) VEHICLE_ID - Identifier for vehicles involved in an accident 3) SEX - Gender of person invovled in accident 4) AGE 5) Age.Group - Age group of the person
person2 <- select(person1 ,ACCIDENT_NO,PERSON_ID,VEHICLE_ID,SEX,AGE,Age.Group)
kable(head(person2,5))
| ACCIDENT_NO | PERSON_ID | VEHICLE_ID | SEX | AGE | Age.Group |
|---|---|---|---|---|---|
| T20060000010 | A | A | M | 72 | 70+ |
| T20060000010 | B | B | F | 62 | 60-64 |
| T20060000010 | C | C | M | 39 | 30-39 |
| T20060000018 | A | A | M | 30 | 30-39 |
| T20060000018 | B | B | F | 20 | 17-21 |
dim(person2)
## [1] 256437 6
Node1 data has following variables: 1) ACCIDENT_NO - Identifier for each accident 2) NODE_ID - ID for node of accident 3) Lat - Latitude of accident node 4) Long - Longitude of accident node 5) Postcode.No - Postcode of location/node of accident
node1 <- select(node ,ACCIDENT_NO,NODE_ID,Lat, Long,Postcode.No)
kable(head(node1,5))
| ACCIDENT_NO | NODE_ID | Lat | Long | Postcode.No |
|---|---|---|---|---|
| T20060000010 | 43078 | -37.98862 | 145.2181 | 3175 |
| T20060000018 | 29720 | -37.99092 | 145.2763 | 3804 |
| T20060000022 | 203074 | -38.39632 | 144.8549 | 3939 |
| T20060000023 | 55462 | -37.98918 | 145.1450 | 3173 |
| T20060000026 | 202988 | -38.37299 | 144.8716 | 3940 |
dim(node1)
## [1] 178953 5
Postcode data has the following valriables: 1) postcode - mapping postcode 2) lat_map - latitude corresponding to the postcode 3) long_map - longitude corresponding to the postcode
kable(head(postcode,5))
| postcode | lat_map | long_map |
|---|---|---|
| 6532 | -28.440886 | 115.004595 |
| 6434 | -30.820157 | 125.672818 |
| 6435 | -28.219307 | 120.44875 |
| 6436 | -29.602251 | 124.856838 |
| 6437 | -27.68439 | 120.529798 |
dim(postcode)
## [1] 3631 3
All the 5 datasets are combined together. Datasets are merged 2 at a time. First, accidents and persons data are merged by inner join. They are merged on ‘accident no’. Then, this merged data is inner joined with vehicle data on a composite key - combination of accident no and vehicle no. This is done to aviod unecessary duplications in the data. After this, the data is merged with accident node data and this as well is merged on combination of accident no and node id. In the end the resulting data is left joined on the postcode mapping data on postcode.
merge1 <- accident1 %>% inner_join(person2, by = "ACCIDENT_NO")
merge2 <- merge1 %>% inner_join(vehicle2, by = c("ACCIDENT_NO","VEHICLE_ID"))
accidents_merge <- merge2 %>% inner_join(node1, by = c("ACCIDENT_NO","NODE_ID"))
data_final <- accidents_merge %>%
left_join(postcode, by = c("OWNER_POSTCODE" = "postcode"))
The summary of the data shows that the final dataset has 17 columns and has 164467 observations. This data has combination of categorical and numeric variables. The expected type for the variables are 1) Character variables: ACCIDENT_NO, PERSON_ID, VEHICLE_ID, Vehicle.Type.Desc
Numeric variables: NODE_ID, SPEED_ZONE, AGE, CARRY_CAPACITY, OWNER_POSTCODE, Postcode.No, Lat, Long, lat_map, long_map
Factors: Age.Group, SEX
Date variables: ACCIDENTDATE
From summary of dataset it can be seen that data types have been incorrectly specified for lat_map, long_map, ACCIDENTDATE, SEX and Age.Group. Data types need to be altered for all these variables.
str(data_final)
## 'data.frame': 164467 obs. of 17 variables:
## $ ACCIDENT_NO : chr "T20060000010" "T20060000010" "T20060000018" "T20060000023" ...
## $ ACCIDENTDATE : chr "13/01/2006" "13/01/2006" "13/01/2006" "14/01/2006" ...
## $ NODE_ID : int 43078 43078 29720 55462 203045 205206 203077 43877 43877 43877 ...
## $ SPEED_ZONE : int 60 60 70 80 100 60 80 80 80 80 ...
## $ PERSON_ID : chr "A " "B " "B " "A " ...
## $ VEHICLE_ID : chr "A" "B" "B" "A" ...
## $ SEX : chr "M" "F" "F" "M" ...
## $ AGE : int 72 62 20 43 50 32 31 81 53 41 ...
## $ Age.Group : chr "70+" "60-64" "17-21" "40-49" ...
## $ Vehicle.Type.Desc: chr "Car" "Car" "Car" "Car" ...
## $ CARRY_CAPACITY : int 1100 974 1200 800 1450 NA 1141 650 300 1090 ...
## $ OWNER_POSTCODE : int 3130 3977 3805 3016 3850 3910 3805 3980 3758 3189 ...
## $ Lat : num -38 -38 -38 -38 -38.1 ...
## $ Long : num 145 145 145 145 146 ...
## $ Postcode.No : int 3175 3175 3804 3173 3820 3198 3173 3977 3977 3977 ...
## $ lat_map : chr "-37.81805" "-38.135253" "-38.037026" "-37.863743" ...
## $ long_map : chr "145.149653" "145.268888" "145.304198" "144.888461" ...
The latitude and longitudes have been read in as character data. These columns must be converted into integer.
data_final$lat_map <- as.integer(data_final$lat_map)
data_final$long_map <- as.integer(data_final$long_map)
The date of accident is in character form and it needs to be converted into date format. The ‘dmy’ function in lubridate package is used to convert this column into date format.
# convert to date format
data_final$ACCIDENTDATE <- dmy(data_final$ACCIDENTDATE)
head(data_final)
dim(data_final)
## [1] 164467 17
The column for age group needs to be converted into factor while defining the levels as well. The column for Sex also needs to be converted into factor as well.
data_final <- mutate(data_final, Age.Group = factor(data_final$Age.Group,
levels = c("unknown","0-4","5-12","13-15","16-17","17-21","22-25","26-29","30-39",
"40-49","50-59","60-64","64-69","70+"),
ordered=TRUE))
data_final <- mutate(data_final, SEX = factor(data_final$SEX, levels=c("M","F"),
ordered = TRUE))
str(data_final)
## 'data.frame': 164467 obs. of 17 variables:
## $ ACCIDENT_NO : chr "T20060000010" "T20060000010" "T20060000018" "T20060000023" ...
## $ ACCIDENTDATE : Date, format: "2006-01-13" "2006-01-13" ...
## $ NODE_ID : int 43078 43078 29720 55462 203045 205206 203077 43877 43877 43877 ...
## $ SPEED_ZONE : int 60 60 70 80 100 60 80 80 80 80 ...
## $ PERSON_ID : chr "A " "B " "B " "A " ...
## $ VEHICLE_ID : chr "A" "B" "B" "A" ...
## $ SEX : Ord.factor w/ 2 levels "M"<"F": 1 2 2 1 2 2 1 2 2 1 ...
## $ AGE : int 72 62 20 43 50 32 31 81 53 41 ...
## $ Age.Group : Ord.factor w/ 14 levels "unknown"<"0-4"<..: 14 12 6 10 11 9 9 14 11 10 ...
## $ Vehicle.Type.Desc: chr "Car" "Car" "Car" "Car" ...
## $ CARRY_CAPACITY : int 1100 974 1200 800 1450 NA 1141 650 300 1090 ...
## $ OWNER_POSTCODE : int 3130 3977 3805 3016 3850 3910 3805 3980 3758 3189 ...
## $ Lat : num -38 -38 -38 -38 -38.1 ...
## $ Long : num 145 145 145 145 146 ...
## $ Postcode.No : int 3175 3175 3804 3173 3820 3198 3173 3977 3977 3977 ...
## $ lat_map : int -37 -38 -38 -37 -38 -38 -38 -38 -37 -37 ...
## $ long_map : int 145 145 145 144 147 145 145 145 145 145 ...
As stated by Hadley Wickham the rules which make a dataset tidy 1) Each variable must have its own column 2) Each observation must have its own row 3) Each value must have its own cell
head(data_final,10)
Looking at top 10 observations form the dataset it can be seen that
Therefore, all the 3 stated rules are satisfied and this implies that the data is in tidy format.
Scan the data for missing values, inconsistencies and obvious errors. In this step, you should fulfil the minimum requirement #7. In addition to the R codes and outputs, explain how you dealt with these values.
The data is now scanned for missing values, inconsistencies and any other errors. First step is locating the missing values. Then the next step is recoding those missing values. For some variables, even removing the rows with missing values makes sense. But it completely depends on the type of data one is dealing with and this approach changes from problem to problem.
Scanning the data for missing values at a column level. It is seen that the variable carry capacity has the maximum missing values, followed by lat_map and long_map. Depending on the data type different missing value treatments were applied.
kable(colSums(is.na(data_final)), col.names = "Number of Missing Observations")
| Number of Missing Observations | |
|---|---|
| ACCIDENT_NO | 0 |
| ACCIDENTDATE | 0 |
| NODE_ID | 0 |
| SPEED_ZONE | 0 |
| PERSON_ID | 0 |
| VEHICLE_ID | 0 |
| SEX | 5012 |
| AGE | 5397 |
| Age.Group | 0 |
| Vehicle.Type.Desc | 0 |
| CARRY_CAPACITY | 139188 |
| OWNER_POSTCODE | 4324 |
| Lat | 16 |
| Long | 16 |
| Postcode.No | 0 |
| lat_map | 7132 |
| long_map | 7132 |
Missing value treatment is applied for Age, lat_map, long_map, lat, long, owner_postcode and sex columns.
Age, Carry Capacity - Impute by median (numeric variables) Usually, for numeric variables, missing values are replaced by mean or median of the column. While for categorical variables, mode is used for replacement. Here for the variables AGE and CARRY_CAPACITY, the missing values are replaced by the median of the respective column.
#Imputing Missing values in Age + Carry Capacity by replacing them with the median
data_final$AGE[is.na(data_final$AGE)] <- median(data_final$AGE, na.rm = TRUE)
data_final$CARRY_CAPACITY[is.na(data_final$CARRY_CAPACITY)] <-
median(data_final$CARRY_CAPACITY, na.rm = TRUE)
#Check the results of imputation
kable(colSums(is.na(data_final)),col.names = "Number of Missing Observations")
| Number of Missing Observations | |
|---|---|
| ACCIDENT_NO | 0 |
| ACCIDENTDATE | 0 |
| NODE_ID | 0 |
| SPEED_ZONE | 0 |
| PERSON_ID | 0 |
| VEHICLE_ID | 0 |
| SEX | 5012 |
| AGE | 0 |
| Age.Group | 0 |
| Vehicle.Type.Desc | 0 |
| CARRY_CAPACITY | 0 |
| OWNER_POSTCODE | 4324 |
| Lat | 16 |
| Long | 16 |
| Postcode.No | 0 |
| lat_map | 7132 |
| long_map | 7132 |
For Lat, Long, lat_map and long_map, these variables represent the latitude and longitudes and it absolutely doesn’t make sense to use mean or median for replacing the missing values. Therefore, the for this variables, the rows with missing values are removed.
#Subset the data by droping incomplete cases
data_imputed <- data_final[!is.na(data_final$long_map),]
data_imputed <- data_imputed[!is.na(data_imputed$Lat),]
#Check the results of imputation
kable(colSums(is.na(data_imputed)),col.names = "Number of Missing Observations")
| Number of Missing Observations | |
|---|---|
| ACCIDENT_NO | 0 |
| ACCIDENTDATE | 0 |
| NODE_ID | 0 |
| SPEED_ZONE | 0 |
| PERSON_ID | 0 |
| VEHICLE_ID | 0 |
| SEX | 2459 |
| AGE | 0 |
| Age.Group | 0 |
| Vehicle.Type.Desc | 0 |
| CARRY_CAPACITY | 0 |
| OWNER_POSTCODE | 0 |
| Lat | 0 |
| Long | 0 |
| Postcode.No | 0 |
| lat_map | 0 |
| long_map | 0 |
Now only SEX column has missing values. Since it is a categorical column, mode is used for the replacement of missing values. And now all the missing values have been taken care of.
#Impute the missing values in Sex by replacing them witht the mode
data_imputed$SEX <- impute(data_imputed$SEX, fun = mode)
#Check the results of imputation
kable(colSums(is.na(data_imputed)),col.names = "Number of Missing Observations")
| Number of Missing Observations | |
|---|---|
| ACCIDENT_NO | 0 |
| ACCIDENTDATE | 0 |
| NODE_ID | 0 |
| SPEED_ZONE | 0 |
| PERSON_ID | 0 |
| VEHICLE_ID | 0 |
| SEX | 0 |
| AGE | 0 |
| Age.Group | 0 |
| Vehicle.Type.Desc | 0 |
| CARRY_CAPACITY | 0 |
| OWNER_POSTCODE | 0 |
| Lat | 0 |
| Long | 0 |
| Postcode.No | 0 |
| lat_map | 0 |
| long_map | 0 |
In addition to missing values, some special values also need to be dealt with in R. Checking for Inf and NaN values using user-defined function There are no special values present.
#Create a function to check Inf and NaN
is.special <- function(x){
if (is.numeric(x)) (is.infinite(x) | is.nan(x))
}
#Checking the dataset
kable(sapply(data_imputed, function(x) sum( is.special(x) )),
col.names = "Number of Special Values")
| Number of Special Values | |
|---|---|
| ACCIDENT_NO | 0 |
| ACCIDENTDATE | 0 |
| NODE_ID | 0 |
| SPEED_ZONE | 0 |
| PERSON_ID | 0 |
| VEHICLE_ID | 0 |
| SEX | 0 |
| AGE | 0 |
| Age.Group | 0 |
| Vehicle.Type.Desc | 0 |
| CARRY_CAPACITY | 0 |
| OWNER_POSTCODE | 0 |
| Lat | 0 |
| Long | 0 |
| Postcode.No | 0 |
| lat_map | 0 |
| long_map | 0 |
Next step is checking for inconsistencies or errors. These refer to impossible values of a variable which are not possible in real life. We apply the rule on AGE variable. Obviously age cannot be negative and it cannot be greater than 120 years.
#Creating an EditRule around age
(Rule1 <- editset(c("AGE >= 0", "AGE <= 120")))
##
## Edit set:
## num1 : 0 <= AGE
## num2 : AGE <= 120
Therefore, after applying this condition it was found that there are no observations which violate the condition.
#Checking the dataset for rule violations
violated <- violatedEdits(Rule1, data_imputed)
# summary of violated rules
summary(violated)
## No violations detected, 0 checks evaluated to NA
## NULL
In this step, all numeric variables were scanned for outliers. So, Speed Zone, Age, Carry Capacity, Lat, Long, lat_map, long_map were all checked.
Speed Zone - Boxplot A boxplot facilitates univariate outlier detection. The box plot for speed zone shows a few outliers, with their values being too high to be actual speed zone values.
#Checking Outliers
data_imputed$SPEED_ZONE %>% boxplot(main="Box Plot of Speed Zones",
ylab="Speed Zones", col = "grey")
Therefore, the outliers were completely removed. The new box plot shows that these outliers have been successfully removed.
#Excluding Outliers
outliers_speed <- boxplot(data_imputed$SPEED_ZONE, plot=FALSE)$out
data_imputed <- data_imputed[-which(data_imputed$SPEED_ZONE %in% outliers_speed),]
#Checking the data after outlier removal
data_imputed$SPEED_ZONE %>% boxplot(main="Box Plot of Speed Zones - Outlier Correction",
ylab="Speed Zones", col = "grey")
Age - Boxplot For the AGE variable, median was used to replace the missing values. The box plot shows that there are quite a few outliers present. But condition for inconsistency was already checked for this variables. There isn’t always a need to reomve the outliers if their value makes sense. Therefore for AGE, the outliers were kept as it is in the data.
data_imputed$AGE %>% boxplot(main="Box Plot of Age", ylab="Age", col = "grey")
Carry Capacity - Boxplot Similar to AGE, Carry capacity also has outliers.
data_imputed$CARRY_CAPACITY %>% boxplot(main="Box Plot of Carry Capacity",
ylab="Carry Capacity", col = "grey")
carry <- boxplot(data_imputed$CARRY_CAPACITY, plot=FALSE)$out
data_imputed <- data_imputed[-which(data_imputed$CARRY_CAPACITY %in% carry),]
But the outliers would be imputed for carry capacity. The same method as speed_zone is implemented. The outliers are removed based on the box plot.
Z-Score
For lat_map and long_map, distance based univariate outlier detection technique was used. Z-Score was calculated. Z-Score of greater than 3 implies an outlier. According to this condition, the number of outliers in both these variables are as follows:
#Checking z-scores
z.scores.latmap <- data_imputed$lat_map %>% scores(type = "z")
z.scores.longmap <- data_imputed$long_map %>% scores(type = "z")
#No of outliers
o_lat_map <- length(which(abs(z.scores.latmap)>3))
o_long_map <- length(which(abs(z.scores.longmap)>3))
z.scores <- data.frame(o_lat_map,o_long_map)
kable(z.scores, col.names = c("Outliers in lat_map",
"Outliers in long_map"))
| Outliers in lat_map | Outliers in long_map |
|---|---|
| 1182 | 2191 |
These outliers too are excluded from the data.
#Excluding outliers
data_imputed<- data_imputed[ - which( abs(z.scores.latmap) >3 ),]
data_imputed<- data_imputed[ - which( abs(z.scores.longmap) >3 ),]
The same technique was implemented for Lat, Long as well. Outliers were those which had the absolute value of Z-Score greater than 3. Lat, Long - Z-scores
#Checking z-scores
z.scores.lat <- data_imputed$Lat %>% scores(type = "z")
z.scores.long <- data_imputed$Long %>% scores(type = "z")
#No of outliers
o_lat <- length(which(abs(z.scores.lat)>3))
o_long <- length(which(abs(z.scores.long)>3))
z.scores2 <- data.frame(o_lat,o_long)
kable(z.scores2, col.names = c("Outliers in Lat",
"Outliers in Long"))
| Outliers in Lat | Outliers in Long |
|---|---|
| 2887 | 4221 |
These outliers were removed from the data.
#Excluding outliers
data_imputed<- data_imputed[ - which( abs(z.scores.lat) >3 ),]
data_imputed<- data_imputed[ - which( abs(z.scores.long) >3 ),]
In order to identify the distance between crash site and the car owner’s home we shall calculate the latitudinal and longitudinal distance between X,Y coordinates of the crash site and the car owner’s house. Since we did not directly have the coordinates for the owner’s place of residence, we used the postcode information provided in the dataset to extrapolate it and find out latitude and longitude pair (by merging an external source of data).
In order to calculate distances we shall use Haversine equation which is as follows:
Distance (kms) = ACOS (SIN(LAT1) * SIN(LAT2) + COS(LAT1) * COS(2) * COS(LONG2-LONG1)) * 6371
Since this formula is calculated for latitude and longitude described in radians and we have our dataset describing it in decimal degrees, we shall use the following radian conversion:
Radian Coordinates = (Degrees * pi) / 180
where pi = 22/7
#Radian Conversion Function
deg2rad <- function(deg) {(deg * pi) / (180)}
data_imputed$distance <- acos((cos(deg2rad(90-data_imputed$Lat))*
cos(deg2rad(90-data_imputed$lat_map)))+
(sin(deg2rad(90-data_imputed$Lat))*
sin(deg2rad(90-data_imputed$lat_map)))*
cos(deg2rad(data_imputed$Long -
data_imputed$long_map)))* 6371
We shall check the first few distance calculations as follows:
head(data_imputed$distance)
## [1] 15.99291 95.94852 167.30087 152.41457 24.18816 24.65230
Data transformations are applied to get normality or variance homogeneity. We shall check the distribution of Age variable in the dataset:
hist(data_imputed$AGE, xlab = "Age", main = "Histogram of Driver's Age")
This distribution appears to be right skewed in nature. We can apply log transformation in order to achieve a normal distribution which may be required for further analysis. For this purpose, we shall check the results of natural logarithm as well be log with base 10.
log_age <- log(data_imputed$AGE)
hist(log_age)
Since the Age data did not have any negative or zero values, we could easily apply the log transformation without any further processing. From the figure above, we can see that the transformation has somewhat helped in reducing the right skewness of the variable. We shall try log with the base 10 in the next step.
log10_age <- log10(data_imputed$AGE)
hist(log10_age)
From this figure, we can say that the log10 transformation appears to be slightly better than the natural log transformation to reduce right skewness. We shall also check with square root transformation in oder to see the impact on skewness.
log_age_sqrt <- sqrt(data_imputed$AGE)
hist(log_age_sqrt)
The square root transformation shifted the frequency to the right, however it did not help in spreading out the high and low values of the data.
From the result above, we can say that the log transformation with the base 10 worked the best in reducing the right skewness of the variable.