Required packages

Loading all the required packages for this analysis.

library(tidyr)
library(lubridate)
library(ggplot2)
library(readr)
library(forecast)
library(dplyr)
library(Hmisc)
library(editrules)
library(knitr)
library(outliers)

Executive Summary

Data pre processing is the most important step in every analysis. If the data is properly cleaned and all the necessary pre processing steps have been applied correctly, then only the further analysis will make sense. For this analysis, open source data provided by the Victorian government about road accidents was used. The focus was looking at the accident location, demographics of persons involved in the crash and making inferences based on that.

Various data pre processing steps were applied to the data. Alongwith this data a supplementary data was used as well which maps postcode to its corresponding latitudes and longitudes. Data type conversions were implemented for the required columns. Various checks were applied to test whether the data was in tidy formmat or not. Then, missing value treatment was applied wherever necessary. Numeric data was treated for outliers. A new column was created which calculates the distance between accident location and residence of the driver. The latitudes and longitudes were used to calculate the radial distance between these locations. It was really interesting to get an idea about how far the accident took place from the driver’s home. Finally, data transformation was applied onto the age column since its distribution was positively skewed.

Data

The data used is crash statistics data provided by VicRoads.It is an open source data available on VicRoads website: https://discover.data.vic.gov.au/dataset/crash-stats-data-extract

This data provides information about road crashes in Victoria. There are different datasets which give information specific to accidents, persons invovled in the accident, vehicles involved in the accident, atmospheric conditions at the time of accident, accident node location specifications and so on. The datasets for accidents, vehicle, person and node were combined together. The accident data has basic details of the accident, time of accident etc. Person data has information like age, gender of people invovled in the crash. Vehicle data has veicle information, type, make etc. Node has exact co-ordinates of the accident location. ‘Accident No’ is the primary key in the accidents data. It will be used to merge all the other datasets to the accidents data.

Alongwith this, postcode mapping data was used which gives latitude and longitudes for postcodes in Australia. This data too is open source available at: https://www.matthewproctor.com/australian_postcodes This is a postcode mapping data which maps latitude and longitude to postcodes in Australia.

Importing Data

The crash stats data has multiple csv files corresponding to different aspects of the accident. For this analysis, the data related to accidents, accident node, vehicles involved in the crash and persons invovled in the accident was used. Also, the mapping file for postcodes was used. All these 5 datasets were imported.

accident <- read.csv("ACCIDENT.csv",stringsAsFactors = FALSE)

node <- read.csv("NODE.csv",stringsAsFactors = FALSE)

vehicle <- read.csv("VEHICLE.csv",stringsAsFactors = FALSE)

person <- read.csv("PERSON.csv",stringsAsFactors = FALSE)

postcode <- read.csv("postcode.csv",stringsAsFactors = FALSE)

Filtering Data

The data has vast amounts of data. To make interpretation simpler this analysis was restricted to accidents involving only car drivers. The person and was filtered to just include drivers. While vehicle data was filtered out just for cars.

person1 <- filter(person ,Road.User.Type.Desc == 'Drivers' )

vehicle1 <- filter(vehicle , Vehicle.Type.Desc=='Car' )

Selecting variables

Each dataset has around 20 columns. After merging the data the number of colum ns would be extremely high. Therefore, only those columns which were relevant to this analysis were retained. The ’select’function in dplyr allows us to select the required columns by means of their names.

Accident1 data has following variables: 1) ACCIDENT_NO - Identifier for each accident 2) ACCIDENTDATE - Date of accident 3) NODE_ID - ID for node of accident 4) SPEED_ZONE - Spped zone at the lication of crash

accident1 <- select(accident ,ACCIDENT_NO,ACCIDENTDATE,NODE_ID,SPEED_ZONE)

kable(head(accident1,5))
ACCIDENT_NO ACCIDENTDATE NODE_ID SPEED_ZONE
T20060000010 13/01/2006 43078 60
T20060000018 13/01/2006 29720 70
T20060000022 14/01/2006 203074 100
T20060000023 14/01/2006 55462 80
T20060000026 14/01/2006 202988 50
dim(accident1)
## [1] 176937      4

Vehicle2 data has following variables: 1) ACCIDENT_NO - Identifier for each accident 2) VEHICLE_ID - Identifier for vehicles involved in an accident 3) Vehicle.Type.Desc - Type of vehile like car, motorbike. In this case only Cars are being filtered 4) CARRY_CAPACITY - Weight carrying cspacity of vehicle 5) OWNER_POSTCODE - Postcode of vehicle owner’s residence

vehicle2 <- select(vehicle1 ,
                   ACCIDENT_NO,VEHICLE_ID,Vehicle.Type.Desc,CARRY_CAPACITY,OWNER_POSTCODE)

kable(head(vehicle2,5))
ACCIDENT_NO VEHICLE_ID Vehicle.Type.Desc CARRY_CAPACITY OWNER_POSTCODE
T20060000010 A Car 1100 3130
T20060000010 B Car 974 3977
T20060000018 B Car 1200 3805
T20060000023 A Car 800 3016
T20060000028 A Car 950 3788
dim(vehicle2)
## [1] 167346      5

Person2 data has following variables: 1) ACCIDENT_NO - Identifier for each accident 2) VEHICLE_ID - Identifier for vehicles involved in an accident 3) SEX - Gender of person invovled in accident 4) AGE 5) Age.Group - Age group of the person

person2 <- select(person1 ,ACCIDENT_NO,PERSON_ID,VEHICLE_ID,SEX,AGE,Age.Group)
kable(head(person2,5))
ACCIDENT_NO PERSON_ID VEHICLE_ID SEX AGE Age.Group
T20060000010 A A M 72 70+
T20060000010 B B F 62 60-64
T20060000010 C C M 39 30-39
T20060000018 A A M 30 30-39
T20060000018 B B F 20 17-21
dim(person2)
## [1] 256437      6

Node1 data has following variables: 1) ACCIDENT_NO - Identifier for each accident 2) NODE_ID - ID for node of accident 3) Lat - Latitude of accident node 4) Long - Longitude of accident node 5) Postcode.No - Postcode of location/node of accident

node1 <- select(node ,ACCIDENT_NO,NODE_ID,Lat, Long,Postcode.No)
kable(head(node1,5))
ACCIDENT_NO NODE_ID Lat Long Postcode.No
T20060000010 43078 -37.98862 145.2181 3175
T20060000018 29720 -37.99092 145.2763 3804
T20060000022 203074 -38.39632 144.8549 3939
T20060000023 55462 -37.98918 145.1450 3173
T20060000026 202988 -38.37299 144.8716 3940
dim(node1)
## [1] 178953      5

Postcode data has the following valriables: 1) postcode - mapping postcode 2) lat_map - latitude corresponding to the postcode 3) long_map - longitude corresponding to the postcode

kable(head(postcode,5))
postcode lat_map long_map
6532 -28.440886 115.004595
6434 -30.820157 125.672818
6435 -28.219307 120.44875
6436 -29.602251 124.856838
6437 -27.68439 120.529798
dim(postcode)
## [1] 3631    3

All the 5 datasets are combined together. Datasets are merged 2 at a time. First, accidents and persons data are merged by inner join. They are merged on ‘accident no’. Then, this merged data is inner joined with vehicle data on a composite key - combination of accident no and vehicle no. This is done to aviod unecessary duplications in the data. After this, the data is merged with accident node data and this as well is merged on combination of accident no and node id. In the end the resulting data is left joined on the postcode mapping data on postcode.

merge1 <- accident1 %>% inner_join(person2, by = "ACCIDENT_NO")

merge2 <- merge1 %>% inner_join(vehicle2, by = c("ACCIDENT_NO","VEHICLE_ID"))

accidents_merge <- merge2 %>% inner_join(node1, by = c("ACCIDENT_NO","NODE_ID"))

data_final <- accidents_merge %>% 
              left_join(postcode, by = c("OWNER_POSTCODE" = "postcode"))

Understand

The summary of the data shows that the final dataset has 17 columns and has 164467 observations. This data has combination of categorical and numeric variables. The expected type for the variables are 1) Character variables: ACCIDENT_NO, PERSON_ID, VEHICLE_ID, Vehicle.Type.Desc

  1. Numeric variables: NODE_ID, SPEED_ZONE, AGE, CARRY_CAPACITY, OWNER_POSTCODE, Postcode.No, Lat, Long, lat_map, long_map

  2. Factors: Age.Group, SEX

  3. Date variables: ACCIDENTDATE

From summary of dataset it can be seen that data types have been incorrectly specified for lat_map, long_map, ACCIDENTDATE, SEX and Age.Group. Data types need to be altered for all these variables.

str(data_final)
## 'data.frame':    164467 obs. of  17 variables:
##  $ ACCIDENT_NO      : chr  "T20060000010" "T20060000010" "T20060000018" "T20060000023" ...
##  $ ACCIDENTDATE     : chr  "13/01/2006" "13/01/2006" "13/01/2006" "14/01/2006" ...
##  $ NODE_ID          : int  43078 43078 29720 55462 203045 205206 203077 43877 43877 43877 ...
##  $ SPEED_ZONE       : int  60 60 70 80 100 60 80 80 80 80 ...
##  $ PERSON_ID        : chr  "A " "B " "B " "A " ...
##  $ VEHICLE_ID       : chr  "A" "B" "B" "A" ...
##  $ SEX              : chr  "M" "F" "F" "M" ...
##  $ AGE              : int  72 62 20 43 50 32 31 81 53 41 ...
##  $ Age.Group        : chr  "70+" "60-64" "17-21" "40-49" ...
##  $ Vehicle.Type.Desc: chr  "Car" "Car" "Car" "Car" ...
##  $ CARRY_CAPACITY   : int  1100 974 1200 800 1450 NA 1141 650 300 1090 ...
##  $ OWNER_POSTCODE   : int  3130 3977 3805 3016 3850 3910 3805 3980 3758 3189 ...
##  $ Lat              : num  -38 -38 -38 -38 -38.1 ...
##  $ Long             : num  145 145 145 145 146 ...
##  $ Postcode.No      : int  3175 3175 3804 3173 3820 3198 3173 3977 3977 3977 ...
##  $ lat_map          : chr  "-37.81805" "-38.135253" "-38.037026" "-37.863743" ...
##  $ long_map         : chr  "145.149653" "145.268888" "145.304198" "144.888461" ...

The latitude and longitudes have been read in as character data. These columns must be converted into integer.

data_final$lat_map <- as.integer(data_final$lat_map)
data_final$long_map <- as.integer(data_final$long_map)

The date of accident is in character form and it needs to be converted into date format. The ‘dmy’ function in lubridate package is used to convert this column into date format.

# convert to date format
data_final$ACCIDENTDATE <- dmy(data_final$ACCIDENTDATE)

head(data_final)
dim(data_final)
## [1] 164467     17

The column for age group needs to be converted into factor while defining the levels as well. The column for Sex also needs to be converted into factor as well.

data_final <- mutate(data_final, Age.Group = factor(data_final$Age.Group, 
levels = c("unknown","0-4","5-12","13-15","16-17","17-21","22-25","26-29","30-39",
           "40-49","50-59","60-64","64-69","70+"), 
                            ordered=TRUE)) 

data_final <- mutate(data_final, SEX = factor(data_final$SEX, levels=c("M","F"), 
                                              ordered = TRUE))

str(data_final)
## 'data.frame':    164467 obs. of  17 variables:
##  $ ACCIDENT_NO      : chr  "T20060000010" "T20060000010" "T20060000018" "T20060000023" ...
##  $ ACCIDENTDATE     : Date, format: "2006-01-13" "2006-01-13" ...
##  $ NODE_ID          : int  43078 43078 29720 55462 203045 205206 203077 43877 43877 43877 ...
##  $ SPEED_ZONE       : int  60 60 70 80 100 60 80 80 80 80 ...
##  $ PERSON_ID        : chr  "A " "B " "B " "A " ...
##  $ VEHICLE_ID       : chr  "A" "B" "B" "A" ...
##  $ SEX              : Ord.factor w/ 2 levels "M"<"F": 1 2 2 1 2 2 1 2 2 1 ...
##  $ AGE              : int  72 62 20 43 50 32 31 81 53 41 ...
##  $ Age.Group        : Ord.factor w/ 14 levels "unknown"<"0-4"<..: 14 12 6 10 11 9 9 14 11 10 ...
##  $ Vehicle.Type.Desc: chr  "Car" "Car" "Car" "Car" ...
##  $ CARRY_CAPACITY   : int  1100 974 1200 800 1450 NA 1141 650 300 1090 ...
##  $ OWNER_POSTCODE   : int  3130 3977 3805 3016 3850 3910 3805 3980 3758 3189 ...
##  $ Lat              : num  -38 -38 -38 -38 -38.1 ...
##  $ Long             : num  145 145 145 145 146 ...
##  $ Postcode.No      : int  3175 3175 3804 3173 3820 3198 3173 3977 3977 3977 ...
##  $ lat_map          : int  -37 -38 -38 -37 -38 -38 -38 -38 -37 -37 ...
##  $ long_map         : int  145 145 145 144 147 145 145 145 145 145 ...

Tidy & Manipulate Data I

As stated by Hadley Wickham the rules which make a dataset tidy 1) Each variable must have its own column 2) Each observation must have its own row 3) Each value must have its own cell

head(data_final,10)

Looking at top 10 observations form the dataset it can be seen that

  1. Each column represents a separate variable.
  2. Each separate row represents a separate observation
  3. Each cell has only one value. None of the cells have more than 1 value in it.

Therefore, all the 3 stated rules are satisfied and this implies that the data is in tidy format.

Scan I

Scan the data for missing values, inconsistencies and obvious errors. In this step, you should fulfil the minimum requirement #7. In addition to the R codes and outputs, explain how you dealt with these values.

The data is now scanned for missing values, inconsistencies and any other errors. First step is locating the missing values. Then the next step is recoding those missing values. For some variables, even removing the rows with missing values makes sense. But it completely depends on the type of data one is dealing with and this approach changes from problem to problem.

Missing Values

Scanning the data for missing values at a column level. It is seen that the variable carry capacity has the maximum missing values, followed by lat_map and long_map. Depending on the data type different missing value treatments were applied.

kable(colSums(is.na(data_final)), col.names = "Number of Missing Observations")
Number of Missing Observations
ACCIDENT_NO 0
ACCIDENTDATE 0
NODE_ID 0
SPEED_ZONE 0
PERSON_ID 0
VEHICLE_ID 0
SEX 5012
AGE 5397
Age.Group 0
Vehicle.Type.Desc 0
CARRY_CAPACITY 139188
OWNER_POSTCODE 4324
Lat 16
Long 16
Postcode.No 0
lat_map 7132
long_map 7132

Missing value treatment is applied for Age, lat_map, long_map, lat, long, owner_postcode and sex columns.

Age, Carry Capacity - Impute by median (numeric variables) Usually, for numeric variables, missing values are replaced by mean or median of the column. While for categorical variables, mode is used for replacement. Here for the variables AGE and CARRY_CAPACITY, the missing values are replaced by the median of the respective column.

#Imputing Missing values in Age + Carry Capacity by replacing them with the median

data_final$AGE[is.na(data_final$AGE)] <- median(data_final$AGE, na.rm = TRUE)


data_final$CARRY_CAPACITY[is.na(data_final$CARRY_CAPACITY)] <- 
  median(data_final$CARRY_CAPACITY, na.rm = TRUE)

#Check the results of imputation
kable(colSums(is.na(data_final)),col.names = "Number of Missing Observations")
Number of Missing Observations
ACCIDENT_NO 0
ACCIDENTDATE 0
NODE_ID 0
SPEED_ZONE 0
PERSON_ID 0
VEHICLE_ID 0
SEX 5012
AGE 0
Age.Group 0
Vehicle.Type.Desc 0
CARRY_CAPACITY 0
OWNER_POSTCODE 4324
Lat 16
Long 16
Postcode.No 0
lat_map 7132
long_map 7132

For Lat, Long, lat_map and long_map, these variables represent the latitude and longitudes and it absolutely doesn’t make sense to use mean or median for replacing the missing values. Therefore, the for this variables, the rows with missing values are removed.

#Subset the data by droping incomplete cases
data_imputed <- data_final[!is.na(data_final$long_map),]
data_imputed <- data_imputed[!is.na(data_imputed$Lat),]

#Check the results of imputation
kable(colSums(is.na(data_imputed)),col.names = "Number of Missing Observations")
Number of Missing Observations
ACCIDENT_NO 0
ACCIDENTDATE 0
NODE_ID 0
SPEED_ZONE 0
PERSON_ID 0
VEHICLE_ID 0
SEX 2459
AGE 0
Age.Group 0
Vehicle.Type.Desc 0
CARRY_CAPACITY 0
OWNER_POSTCODE 0
Lat 0
Long 0
Postcode.No 0
lat_map 0
long_map 0

Now only SEX column has missing values. Since it is a categorical column, mode is used for the replacement of missing values. And now all the missing values have been taken care of.

#Impute the missing values in Sex by replacing them witht the mode
data_imputed$SEX <- impute(data_imputed$SEX, fun = mode)

#Check the results of imputation
kable(colSums(is.na(data_imputed)),col.names = "Number of Missing Observations")
Number of Missing Observations
ACCIDENT_NO 0
ACCIDENTDATE 0
NODE_ID 0
SPEED_ZONE 0
PERSON_ID 0
VEHICLE_ID 0
SEX 0
AGE 0
Age.Group 0
Vehicle.Type.Desc 0
CARRY_CAPACITY 0
OWNER_POSTCODE 0
Lat 0
Long 0
Postcode.No 0
lat_map 0
long_map 0

Checking for Special Values

In addition to missing values, some special values also need to be dealt with in R. Checking for Inf and NaN values using user-defined function There are no special values present.

#Create a function to check Inf and NaN
is.special <- function(x){
if (is.numeric(x)) (is.infinite(x) | is.nan(x))
}

#Checking the dataset
kable(sapply(data_imputed, function(x) sum( is.special(x) )), 
      col.names = "Number of Special Values")
Number of Special Values
ACCIDENT_NO 0
ACCIDENTDATE 0
NODE_ID 0
SPEED_ZONE 0
PERSON_ID 0
VEHICLE_ID 0
SEX 0
AGE 0
Age.Group 0
Vehicle.Type.Desc 0
CARRY_CAPACITY 0
OWNER_POSTCODE 0
Lat 0
Long 0
Postcode.No 0
lat_map 0
long_map 0

Checking for inconsistencies and errors

Next step is checking for inconsistencies or errors. These refer to impossible values of a variable which are not possible in real life. We apply the rule on AGE variable. Obviously age cannot be negative and it cannot be greater than 120 years.

#Creating an EditRule around age
(Rule1 <- editset(c("AGE >= 0", "AGE <= 120")))
## 
## Edit set:
## num1 : 0 <= AGE
## num2 : AGE <= 120

Therefore, after applying this condition it was found that there are no observations which violate the condition.

#Checking the dataset for rule violations
violated <- violatedEdits(Rule1, data_imputed)

# summary of violated rules
summary(violated)
## No violations detected, 0 checks evaluated to NA
## NULL

Scan II

In this step, all numeric variables were scanned for outliers. So, Speed Zone, Age, Carry Capacity, Lat, Long, lat_map, long_map were all checked.

Speed Zone - Boxplot A boxplot facilitates univariate outlier detection. The box plot for speed zone shows a few outliers, with their values being too high to be actual speed zone values.

#Checking Outliers
data_imputed$SPEED_ZONE %>%  boxplot(main="Box Plot of Speed Zones", 
                                     ylab="Speed Zones", col = "grey")

Therefore, the outliers were completely removed. The new box plot shows that these outliers have been successfully removed.

#Excluding Outliers
outliers_speed <- boxplot(data_imputed$SPEED_ZONE, plot=FALSE)$out
data_imputed <- data_imputed[-which(data_imputed$SPEED_ZONE %in% outliers_speed),]

#Checking the data after outlier removal
data_imputed$SPEED_ZONE %>%  boxplot(main="Box Plot of Speed Zones - Outlier Correction", 
                                     ylab="Speed Zones", col = "grey")

Age - Boxplot For the AGE variable, median was used to replace the missing values. The box plot shows that there are quite a few outliers present. But condition for inconsistency was already checked for this variables. There isn’t always a need to reomve the outliers if their value makes sense. Therefore for AGE, the outliers were kept as it is in the data.

data_imputed$AGE %>%  boxplot(main="Box Plot of Age", ylab="Age", col = "grey")

Carry Capacity - Boxplot Similar to AGE, Carry capacity also has outliers.

data_imputed$CARRY_CAPACITY %>%  boxplot(main="Box Plot of Carry Capacity", 
                                         ylab="Carry Capacity", col = "grey")

carry <- boxplot(data_imputed$CARRY_CAPACITY, plot=FALSE)$out
data_imputed <- data_imputed[-which(data_imputed$CARRY_CAPACITY %in% carry),]

But the outliers would be imputed for carry capacity. The same method as speed_zone is implemented. The outliers are removed based on the box plot.

Z-Score

For lat_map and long_map, distance based univariate outlier detection technique was used. Z-Score was calculated. Z-Score of greater than 3 implies an outlier. According to this condition, the number of outliers in both these variables are as follows:

#Checking z-scores
z.scores.latmap <- data_imputed$lat_map %>%  scores(type = "z")
z.scores.longmap <- data_imputed$long_map %>%  scores(type = "z")

#No of outliers
o_lat_map <- length(which(abs(z.scores.latmap)>3))
o_long_map <- length(which(abs(z.scores.longmap)>3))
z.scores <- data.frame(o_lat_map,o_long_map)
kable(z.scores, col.names = c("Outliers in lat_map",
                              "Outliers in long_map"))
Outliers in lat_map Outliers in long_map
1182 2191

These outliers too are excluded from the data.

#Excluding outliers
data_imputed<- data_imputed[ - which( abs(z.scores.latmap) >3 ),]
data_imputed<- data_imputed[ - which( abs(z.scores.longmap) >3 ),]

The same technique was implemented for Lat, Long as well. Outliers were those which had the absolute value of Z-Score greater than 3. Lat, Long - Z-scores

#Checking z-scores
z.scores.lat <- data_imputed$Lat %>%  scores(type = "z")
z.scores.long <- data_imputed$Long %>%  scores(type = "z")

#No of outliers
o_lat <- length(which(abs(z.scores.lat)>3))
o_long <- length(which(abs(z.scores.long)>3))
z.scores2 <- data.frame(o_lat,o_long)
kable(z.scores2, col.names = c("Outliers in Lat",
                              "Outliers in Long"))
Outliers in Lat Outliers in Long
2887 4221

These outliers were removed from the data.

#Excluding outliers
data_imputed<- data_imputed[ - which( abs(z.scores.lat) >3 ),]
data_imputed<- data_imputed[ - which( abs(z.scores.long) >3 ),]

Tidy & Manipulate Data II

In order to identify the distance between crash site and the car owner’s home we shall calculate the latitudinal and longitudinal distance between X,Y coordinates of the crash site and the car owner’s house. Since we did not directly have the coordinates for the owner’s place of residence, we used the postcode information provided in the dataset to extrapolate it and find out latitude and longitude pair (by merging an external source of data).

In order to calculate distances we shall use Haversine equation which is as follows:

Distance (kms) = ACOS (SIN(LAT1) * SIN(LAT2) + COS(LAT1) * COS(2) * COS(LONG2-LONG1)) * 6371

Since this formula is calculated for latitude and longitude described in radians and we have our dataset describing it in decimal degrees, we shall use the following radian conversion:

Radian Coordinates = (Degrees * pi) / 180

where pi = 22/7

#Radian Conversion Function
deg2rad <- function(deg) {(deg * pi) / (180)}

data_imputed$distance <- acos((cos(deg2rad(90-data_imputed$Lat))*
                                cos(deg2rad(90-data_imputed$lat_map)))+
                                (sin(deg2rad(90-data_imputed$Lat))*
                                sin(deg2rad(90-data_imputed$lat_map)))*
                                cos(deg2rad(data_imputed$Long -
                                            data_imputed$long_map)))* 6371

We shall check the first few distance calculations as follows:

head(data_imputed$distance)
## [1]  15.99291  95.94852 167.30087 152.41457  24.18816  24.65230

Transform

Data transformations are applied to get normality or variance homogeneity. We shall check the distribution of Age variable in the dataset:

hist(data_imputed$AGE, xlab = "Age", main = "Histogram of Driver's Age")

This distribution appears to be right skewed in nature. We can apply log transformation in order to achieve a normal distribution which may be required for further analysis. For this purpose, we shall check the results of natural logarithm as well be log with base 10.

log_age <- log(data_imputed$AGE)
hist(log_age)

Since the Age data did not have any negative or zero values, we could easily apply the log transformation without any further processing. From the figure above, we can see that the transformation has somewhat helped in reducing the right skewness of the variable. We shall try log with the base 10 in the next step.

log10_age <- log10(data_imputed$AGE)
hist(log10_age)

From this figure, we can say that the log10 transformation appears to be slightly better than the natural log transformation to reduce right skewness. We shall also check with square root transformation in oder to see the impact on skewness.

log_age_sqrt <- sqrt(data_imputed$AGE)
hist(log_age_sqrt)

The square root transformation shifted the frequency to the right, however it did not help in spreading out the high and low values of the data.

From the result above, we can say that the log transformation with the base 10 worked the best in reducing the right skewness of the variable.