library(readr)
library(xlsx)
library(dplyr)
library(tidyr)
library(ggplot2)
library(readxl)
library(editrules)
library(forecast)

Executive Summary

In this assignment, we have used two datasets (Personal Income in Australia and Youth Detention centers in Australia) to do our analysis. We have done few of the pre-processing steps for dataset 1 and dataset 2 separately while some of the pre-processing steps have been done on the after merging dataset 1 and 2 together. Following are the steps followed:

Dataset 1 - Personal Income in Australia

Dataset 2 - Youth Detention Centers in Australia

Merged Dataset - Personal Income in Australia and Youth Detention Centers in Australia

Data

1 Dataset 1 - “Personal Income of Australia”

Data Source - https://www.abs.gov.au/statistics/labour/earnings-and-work-hours/personal-income-australia/latest-release

Description of the dataset - This dataset provides us with year over year details of earners in each state between the period 2011-2017. The details of the earners provided are:

  • “Earners” (persons): This variable provides us with the the total count of earners present in a state at any given time period between 2011-2017

  • “Median age of earners”: This variable provides us with an estimatation of what the median age of the earning population is within a state for time period between 2011-2017

  • “Sum($)”: This variable provides us with the total money made by the earners within each state for the time period between 2011-2017

  • “Median($)”: This variable provides us with the median amount of money made by earners within each state for the time period between 2011-2017

  • “Mean($)”: This variable provides us with the average amount of money made by earners within each state for the time period between 2011-2017

Although, our above dataset consists of 5 variables in the wide format, we will only be considering 2 of them for the sake of our analysis. The 2 variables that we will be considering are “Earners” and “Sum”

Since the column headers under both “Earners” and “Sum” are same (years), we will be loading them seprately, tidying them, and then merging them together

1.1 Loading one dataset by splitting into two

1.1.1 Dataset - “Personal Income of Australia - Total Number of Earners”

Personal_Income_in_Australia_Earners <- read_excel("/Applications/Documents/Study/RMIT Learning/Second Semester/Data Wrangling/02 Assignment/03 Data Sources/Personal Income in Australia.xls", 
    sheet = "Table 1.1", col_types = c("text", 
        "skip", "text", "text", "text", "text", 
        "text", "text", "skip", "skip", "skip", 
        "skip", "skip", "skip", "skip", "skip", 
        "skip", "skip", "skip", "skip", "skip", 
        "skip", "skip", "skip", "skip", "skip", 
        "skip", "skip", "skip", "skip", "skip", 
        "skip"), skip = 6)

head(Personal_Income_in_Australia_Earners,3)

We will only be selecting the states from the dataframe and hence we will be subsetting this dataset

Personal_Income_in_Australia_Earners_filtered <- Personal_Income_in_Australia_Earners[c(2,5,8,11,14,17,20,23),]

head(Personal_Income_in_Australia_Earners_filtered,3)

1.1.2 Dataset - “Personal Income of Australia - Total Income”

Personal_Income_in_Australia_Sum <- read_excel("/Applications/Documents/Study/RMIT Learning/Second Semester/Data Wrangling/02 Assignment/03 Data Sources/Personal Income in Australia.xls", 
    sheet = "Table 1.1", col_types = c("text", 
        "skip", "skip", "skip", "skip", "skip", 
        "skip", "skip", "skip", "skip", "skip", 
        "skip", "skip", "skip", "numeric", 
        "numeric", "numeric", "numeric", 
        "numeric", "numeric", "skip", "skip", 
        "skip", "skip", "skip", "skip", "skip", 
        "skip", "skip", "skip", "skip", "skip"), 
    skip = 6)
head(Personal_Income_in_Australia_Sum,3)

Similar to the above step, we will be subsetting the dataset to include only the states

Personal_Income_in_Australia_Sum_filtered <- Personal_Income_in_Australia_Sum[c(2,5,8,11,14,17,20,23),]

head(Personal_Income_in_Australia_Sum_filtered,3)

1.2 Tidy & Manipulate Data

1.2.1 Untidy dataset

As we can see, our data is not in the tidy format we would want it to be.

How is it a untidy data?

  • All the years are spread in the wide format when they could be clubbed under one variable “Years”

  • For our analysis, we should have the following columns for our data to follow the tidy principles: “States”, “Year”, “Earners”, “Sum($)”

    • In this format, all the variables have a column
    • All the observations have their own row
    • Each value has its own cell

Let us convert our dataset to a tidy set!

1.2.2 Tidy dataset

To convert our datasets (Earners and Total Income) into tidy dataset, we will convert their “Years” columns from a wide format to a long format and hence bring them below one column “Years”. This way, each variable (2011-12, 2012-13, etc.) will have its own column.

Personal_Income_in_Australia_Earners_filtered <- Personal_Income_in_Australia_Earners_filtered%>% gather(`2011-12`, `2012-13`, `2013-14`, `2014-15`, `2015-16`, `2016-17`, key = "Years", value = "Number_of_earners")

Personal_Income_in_Australia_Sum_filtered <- Personal_Income_in_Australia_Sum_filtered%>% gather(`2011-12`, `2012-13`, `2013-14`, `2014-15`, `2015-16`, `2016-17`, key = "Years", value = "Total_Money_Earned")

head(Personal_Income_in_Australia_Earners_filtered,3)
head(Personal_Income_in_Australia_Sum_filtered,3)

1.3 Understanding dataset

1.3.1 Merging both the datasets (dataset 1.1 and dataset 1.2) together

We will now join these two datasets to merge into one

Personal_Income_in_Australia_Earners_merged <- inner_join(Personal_Income_in_Australia_Earners_filtered, Personal_Income_in_Australia_Sum_filtered, by = c("GCCSA","Years"))

head(Personal_Income_in_Australia_Earners_merged,3)
str(Personal_Income_in_Australia_Earners_merged)
## tibble [48 × 4] (S3: tbl_df/tbl/data.frame)
##  $ GCCSA             : chr [1:48] "New South Wales" "Victoria" "Queensland" "South Australia" ...
##  $ Years             : chr [1:48] "2011-12" "2011-12" "2011-12" "2011-12" ...
##  $ Number_of_earners : chr [1:48] "3824011" "3035381" "2448420" "880884" ...
##  $ Total_Money_Earned: num [1:48] 2.18e+11 1.64e+11 1.31e+11 4.48e+10 8.55e+10 ...

1.3.2 Transforming data type conversions and renaming columns

From the above table, we can observe that there are few transformations in the type of columns that need to take place:

  • “GCCSA” currently has each state has “character” but this will be converted to “factor” since they represent categorical data. Here will not need to factor it since they do not contain any levels between them

  • “GCCSA” will be renamed to “State” to make it clearer

  • “Number_of_earners” is currently character but since they represent the total number of earners in each state, we will be converting them to integer

  • “Years” will be renamed to “Year” to make ti clearer

## Converting "GCCSA" to factor
Personal_Income_in_Australia_Earners_merged$GCCSA <- as.factor(Personal_Income_in_Australia_Earners_merged$GCCSA)

## Renaming "GCCSA" to "State" 
Personal_Income_in_Australia_Earners_merged <- Personal_Income_in_Australia_Earners_merged %>% rename(State = GCCSA)

## Converting "Number_of_earners" to integer
Personal_Income_in_Australia_Earners_merged$Number_of_earners <- as.numeric(Personal_Income_in_Australia_Earners_merged$Number_of_earners)

## Renaming Years to Year
Personal_Income_in_Australia_Earners_merged <- Personal_Income_in_Australia_Earners_merged %>% rename(Year = Years)

1.3.3 Summarizing the variables type and data structures

We have a dataframe “Personal_Income_in_Australia_Earners_merged” of dimemsions 48x5.

Variables Type
State Factor
Year Character
Number of Earners Number
Total Money Earned Number
Per_Capita_Income Number

State has been converted to factor since it is a categorical variable and is unordered. Year has been kept as a character since factoring it did not provide any additional value so we have kept the data type as it is.

## Checking the output of the conversion
str(Personal_Income_in_Australia_Earners_merged)
## tibble [48 × 4] (S3: tbl_df/tbl/data.frame)
##  $ State             : Factor w/ 8 levels "Australian Capital Territory",..: 2 7 4 5 8 6 3 1 2 7 ...
##  $ Year              : chr [1:48] "2011-12" "2011-12" "2011-12" "2011-12" ...
##  $ Number_of_earners : num [1:48] 3824011 3035381 2448420 880884 1349170 ...
##  $ Total_Money_Earned: num [1:48] 2.18e+11 1.64e+11 1.31e+11 4.48e+10 8.55e+10 ...

1.3.4 Creating a variable - Per Capita Income

We will now be creating a new variable “Per_capita_income”. “Per_capita_income” will be calculated as the total earnings per state divided by the total number of earners in the state. This will give us a better knowledge of how each earner is earning in each state over the years.

Personal_Income_in_Australia_Earners_merged$Per_Capita_Income <- Personal_Income_in_Australia_Earners_merged$Total_Money_Earned/Personal_Income_in_Australia_Earners_merged$Number_of_earners

head(Personal_Income_in_Australia_Earners_merged,3)

2 Dataset 2 - “Youth Justice Detention Data in Australia”

Data Source - https://data.gov.au/data/dataset/youth-justice-detention-data/resource/c7edfa08-7bc9-404d-8f2b-22bcd0425021

Description of the dataset - This dataset provides us with quarterly details of average nightly population in detention centres across Australia between the period 2008-2017. The details of the earners provided are:

  • “agegrp” (persons): This variable provides us with the the age group of the people in detention in a state at a quarterly level for time period between 2008-2017

  • “indig_status”: This variable provides us with indigenous status of the population present in the detention centre within a state at a quarterly level for time period between 2008-2017

  • “legal_status”: This variable provides us with the legal status of the people in the detention centres within each state at a quarterly level for the time period between 2008-2017

  • “sex”: This variable provides us with the sex of the people present in detention centres within each state at a quarterly level for the time period between 2008-2017

  • “State”: This variable provides us with the state details for the time period between 2008-2017

  • “quarter”: This variable provides us with the quarter details in the format of Month/Year for the time period between 2008-2017

  • “quart”: This variable provides us with the quarter details in the format of “month” for the time period between 2008-2017

  • “year”: This variable provides us with the year details in the format of “year” for the time period between 2008-2017

  • “avg_nightly_pop”: This variable provides us with the average night population in the youth detention centres within each state at a quarterly level for the time period between 2008-2017

youth.justice.detention.data.2017 <- read.csv("/Applications/Documents/Study/RMIT Learning/Second Semester/Data Wrangling/02 Assignment/03 Data Sources/youth-justice-detention-data-2017.csv")

head(youth.justice.detention.data.2017,3)

2.1 Understanding the Dataset 2: Youth Detention Centre in Australia

Let us have a look at the summary of the dataframe. From here, we will get a brief idea of what values are present in the dataframe and what kind of pre-processing needs to be done

We have a dataframe “youth.justice.detention.data.2017” of 42768x9 dimensions

Variables Type Levels
agegrp Factor 3
indig_status Factor 4
legal_status Factor 3
sex Factor 4
state Factor 9
quarter Factor 33
quart Factor 4
year integer -
avg_night_population integer -

Since we will be converting our data to our State-Year level, most of the variables will be dropping off. However, State will be kept as a factor since it is a categorical variable and will be consistent with dataset 1.

str(youth.justice.detention.data.2017)
## 'data.frame':    42768 obs. of  9 variables:
##  $ agegrp         : Factor w/ 3 levels "10 to 17","18+",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ indig_status   : Factor w/ 4 levels "Indigenous","Non-Indigenous",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ legal_status   : Factor w/ 3 levels "Sentenced","Total",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ sex            : Factor w/ 4 levels "Female","Male",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ state          : Factor w/ 9 levels "ACT","Aust","NSW",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ quarter        : Factor w/ 33 levels "Dec qtr 2008",..: 9 26 1 18 10 27 2 19 11 28 ...
##  $ quart          : Factor w/ 4 levels "Dec","Jun","Mar",..: 2 4 1 3 2 4 1 3 2 4 ...
##  $ year           : int  2008 2008 2008 2009 2009 2009 2009 2010 2010 2010 ...
##  $ avg_nightly_pop: int  202 197 196 177 166 155 175 165 170 155 ...
dim(youth.justice.detention.data.2017)
## [1] 42768     9

2.1.1 Excluding information that is not pertinent

For variables “agegrp”, “indig_status”, “legal_status”, and “sex”, we can observe that they have a variable called “Total” which provides us with the sum of all the other variables present in the dataframe. Hence we will be subsetting the dataset on “total”

Before we make that decision, let us confirm if “total” is actually the sum of other variables in a column

# This is the R chunk for the Tidy & Manipulate Data I 

newdata <- subset(youth.justice.detention.data.2017, youth.justice.detention.data.2017$state == 'NSW' & youth.justice.detention.data.2017$quart == 'Jun' & youth.justice.detention.data.2017$indig_status == 'Total' & youth.justice.detention.data.2017$legal_status == 'Total' & youth.justice.detention.data.2017$sex=='Total' & youth.justice.detention.data.2017$state=='NSW' & youth.justice.detention.data.2017$year == 2008, select = c("agegrp", "indig_status","legal_status","sex","state","quarter","quart","year","avg_nightly_pop"))

head(newdata)
  • From the above table we can observe that ‘Total’ is the sum of all other cases in a particular column (in the above case, we experimented with ‘agegrp’). Hence we can remove all row values except for ‘Total’.

  • Since ‘agegrp, ’indig_status, ’legal_status’, and ‘sex’ will have only ‘Total’ as their row values, we will be removing these columns since they do not provide us with any additional information

  • Since we finally want information at a year level, we will also be removing “quarter” and “quart” values from our dataframe

youth_justice_detention_data_filtered <- subset(youth.justice.detention.data.2017, youth.justice.detention.data.2017$indig_status == 'Total' & youth.justice.detention.data.2017$legal_status == 'Total' & youth.justice.detention.data.2017$sex=='Total' & youth.justice.detention.data.2017$agegrp == 'Total', select = c("state","year","avg_nightly_pop"))

head(youth_justice_detention_data_filtered,3)

2.2 Tidy & Manipulate Data

2.2.1 Grouping data together at yearly level

Since the data we have above is at a quarter level, we will be grouping them together to get it at a yearly and state level

youth_justice_detention_data_grouped <- aggregate(youth_justice_detention_data_filtered$avg_nightly_pop, by = list(State = youth_justice_detention_data_filtered$state, Year = youth_justice_detention_data_filtered$year), FUN = sum)

youth_justice_detention_data_grouped <- youth_justice_detention_data_grouped %>% rename(avg_nightly_population = x)

head(youth_justice_detention_data_grouped,3)

Converting “State” variable from “factor” to “character” so that changes can be made easily to the values present in the column

youth_justice_detention_data_grouped <- youth_justice_detention_data_grouped%>%mutate_if(is.factor, as.character)

str(youth_justice_detention_data_grouped)
## 'data.frame':    81 obs. of  3 variables:
##  $ State                 : chr  "ACT" "Aust" "NSW" "NT" ...
##  $ Year                  : int  2008 2008 2008 2008 2008 2008 2008 2008 2008 2009 ...
##  $ avg_nightly_population: int  45 2932 1285 78 361 195 87 410 470 49 ...

2.2.2 Creating variables that are consistent across both datasets

From the above table, we can observe the following:

  • “State”: Currently the variables present in this column are short forms. We will be converting these shortforms to the actual names of the states

  • “Year”: We also observe that the values present in this column are a bit different from the format we have in Dataset 1 (Personal Income in Australia). e.g. “Year” in Dataset 1 displays “2008-09” whereas “Year” in Dataset 2 displays “2008”. Hence we will convert data in Dataset 2 to Dataset 1 format to keep it consistent

youth_justice_detention_data_grouped$State[youth_justice_detention_data_grouped$State == "NSW"] <- "New South Wales"

youth_justice_detention_data_grouped$State[youth_justice_detention_data_grouped$State == "Vic"] <- "Victoria"

youth_justice_detention_data_grouped$State[youth_justice_detention_data_grouped$State == "Qld"] <- "Queensland"

youth_justice_detention_data_grouped$State[youth_justice_detention_data_grouped$State == "SA"] <- "South Australia"

youth_justice_detention_data_grouped$State[youth_justice_detention_data_grouped$State == "Tas"] <- "Tasmania"

youth_justice_detention_data_grouped$State[youth_justice_detention_data_grouped$State == "WA"] <- "Western Australia"

youth_justice_detention_data_grouped$State[youth_justice_detention_data_grouped$State == "NT"] <- "Northern Territory"

youth_justice_detention_data_grouped$State[youth_justice_detention_data_grouped$State == "ACT"] <- "Australian Capital Territory"

youth_justice_detention_data_grouped$State[youth_justice_detention_data_grouped$State == "Aust"] <- "Australia"

youth_justice_detention_data_grouped$Year[youth_justice_detention_data_grouped$Year == "2008"] <- "2008-09"

youth_justice_detention_data_grouped$Year[youth_justice_detention_data_grouped$Year == "2009"] <- "2009-10"

youth_justice_detention_data_grouped$Year[youth_justice_detention_data_grouped$Year == "2010"] <- "2010-11"

youth_justice_detention_data_grouped$Year[youth_justice_detention_data_grouped$Year == "2011"] <- "2011-12"

youth_justice_detention_data_grouped$Year[youth_justice_detention_data_grouped$Year == "2012"] <- "2012-13"

youth_justice_detention_data_grouped$Year[youth_justice_detention_data_grouped$Year == "2013"] <- "2013-14"

youth_justice_detention_data_grouped$Year[youth_justice_detention_data_grouped$Year == "2014"] <- "2014-15"

youth_justice_detention_data_grouped$Year[youth_justice_detention_data_grouped$Year == "2015"] <- "2015-16"

youth_justice_detention_data_grouped$Year[youth_justice_detention_data_grouped$Year == "2016"] <- "2016-17"

head(youth_justice_detention_data_grouped,3)

2.2.3 Removing values that are not required

The following values will be removed from our dataframe since we they are not required:

  • “Australia - State”: We will be removing “Australia” from the “State” column since we are dealing with only states

  • “Year”: We will be removing years from “2008-09” to “2010-11”, since we are dealing with years from “2011-12” onwards

youth_justice_detention_data_grouped_filtered <- subset(youth_justice_detention_data_grouped, !(youth_justice_detention_data_grouped$State == 'Australia'), select = c("State","Year","avg_nightly_population"))

youth_justice_detention_data_grouped_filtered <- subset(youth_justice_detention_data_grouped_filtered, !(youth_justice_detention_data_grouped_filtered$Year == '2008-09'), select = c("State","Year","avg_nightly_population"))

youth_justice_detention_data_grouped_filtered <- subset(youth_justice_detention_data_grouped_filtered, !(youth_justice_detention_data_grouped_filtered$Year == '2009-10'), select = c("State","Year","avg_nightly_population"))

youth_justice_detention_data_grouped_filtered <- subset(youth_justice_detention_data_grouped_filtered, !(youth_justice_detention_data_grouped_filtered$Year == '2010-11'), select = c("State","Year","avg_nightly_population"))

head(youth_justice_detention_data_grouped_filtered,3)

2.2.4 Converting “character” to “factor”

We will be converting “States” to “Factor” since they are categorical variables and will be consistent with the first dataset

## Converting States to "Factor"

youth_justice_detention_data_grouped_filtered$State <- as.factor(youth_justice_detention_data_grouped_filtered$State)

str(youth_justice_detention_data_grouped_filtered)
## 'data.frame':    48 obs. of  3 variables:
##  $ State                 : Factor w/ 8 levels "Australian Capital Territory",..: 1 2 3 4 5 6 7 8 1 2 ...
##  $ Year                  : chr  "2011-12" "2011-12" "2011-12" "2011-12" ...
##  $ avg_nightly_population: int  85 1474 153 517 246 92 724 743 81 1333 ...

2.3 Merging Dataset 1 (Personal Income in Australia) and Dataset 2 (Youth Detention Centres in Australia) together

Now that we have got both the datasets in the same format, we can merge them together.

## Merging both the datasets together

Australia_Earners_Justice <- inner_join(x = Personal_Income_in_Australia_Earners_merged, y = youth_justice_detention_data_grouped_filtered, by = c("State","Year"))

head(Australia_Earners_Justice,3)

3 Scan

We will now be scanning our datafraame for missing values, special values, and obvious errors

3.1 Scanning for missing values

Since we have used .csv and .xslx file to import data, the null values are represented as NA for integer reference and for character reference. Therefore in this case, we can use is.na to count the missing values present in our dataframes. However, we need to be careful if we import data from any other softwares. Depending on the software, the missing values can be represented as ‘99’, ‘.’, and ‘..’ as well and we may need different ways to identify them.

Since is.na() works with dataframes, we will be using this function to count the total number of NAs in our dataframe. We will be adding a sum function outside it, so that we can out the total number of missing values present in our dataset.

## Scanning for missing values

sum(is.na(Australia_Earners_Justice))
## [1] 0

3.2 Scanning for special values

Here we will be using the sapply function that will allow us to call on a function and apply it on our dataframe. Here again we have used the sum function outside the “is.specialorNA” function to count the total of special values present in our dataframe

## Scanning for special values

is.specialorNA <- function(x){
if (is.numeric(x)) (is.infinite(x) | is.nan(x) | is.na(x))
}

sapply(Australia_Earners_Justice, function(x) sum(is.specialorNA(x)))
##                  State                   Year      Number_of_earners 
##                      0                      0                      0 
##     Total_Money_Earned      Per_Capita_Income avg_nightly_population 
##                      0                      0                      0

There are no missing values in our dataframe

3.3 Scanning for obvious errors

An obvious error occurs when a data record contains a value that can not be corresponded to in real life. For this analysis, we have added four rules here that can not be violated in real life -

  • Number of Earners within each state has to be greater than 0

  • Total amount earned by each state has to be greater than 0

  • Per capita income within each state has to be greater than 0

  • Average nightly population of a detention centre can not be a negative value

Violating any of these rules, we will lead to the count of violations being greater than 0 and then we would be able to corrective actions on it (if required)

## Scanning for any obvious errors

(Rule1 <- editset(c("Number_of_earners >= 0", "Total_Money_Earned >= 0", "Per_Capita_Income >= 0", "avg_nightly_population > 0")))
## 
## Edit set:
## num1 : 0 <= Number_of_earners
## num2 : 0 <= Total_Money_Earned
## num3 : 0 <= Per_Capita_Income
## num4 : 0 < avg_nightly_population
Violated <- violatedEdits(Rule1, Australia_Earners_Justice)

# summary of violated rules

sum(isTRUE(Violated))
## [1] 0
summary(Violated)
## No violations detected, 0 checks evaluated to NA
## NULL

There are no rules that are violated

4 Scan for Outliers

4.1 Z-Score Mthodology

We will first try to remove outliers with the z-score methodology. However, the key assumption of z-score methodology is that the underlying data is normally distributed. We will first plot the histogram of the data and observe.

## Scanning for outliers using "z-score" 

hist(Australia_Earners_Justice$Number_of_earners)

We can see from the above chart that it does not have a normal distribution which is a key assumption in the “z-score” methodology and hence we will not be using “z-score” methodology for identifying outliers. Instead we will be using box plot to identify outliers.

4.2 Boxplot Methodology

## Scanning for outliers

Australia_Earners_Justice$Number_of_earners %>% boxplot(main="Box Plot of Number of Earners", ylab="Earners", col = "pink")

Australia_Earners_Justice$Total_Money_Earned %>% boxplot(main="Box Plot of Amount of Money Earned", ylab="Money", col = "dark blue")

Australia_Earners_Justice$Per_Capita_Income %>% boxplot(main="Box Plot of Per Capita Income", ylab="Income", col = "dark green")

Australia_Earners_Justice$avg_nightly_population %>% boxplot(main="Box Plot of Average Nightly Population", ylab="Population", col = "dark orange")

From the above Box plot we can see that there are no outliers.

5 Transform

In the histograms above, we observe that the none of our numerical attributes have a normal distribution. We also observe three out four variables have right skewdness. The last variable is showing attributes of a bi-modal distribution. We will try out various transformation technique to decrease the skewness and convert the distribution to a normal distribution for one of our variables (“Number of Earners”)

5.1 Log transformation

## Checking for variable 1: Number of Earners
log_number_of_earners <- log10(Australia_Earners_Justice$Number_of_earners)
hist(log_number_of_earners)

5.2 ln transformation

ln_number_of_earners <- log(Australia_Earners_Justice$Number_of_earners)
hist(ln_number_of_earners)

5.3 Reciprocal transformation

rec_number_of_earners <- 1/Australia_Earners_Justice$Number_of_earners
hist(rec_number_of_earners)

5.4 Square root transformation

sqrt_number_of_earners <- sqrt(Australia_Earners_Justice$Number_of_earners)
hist(sqrt_number_of_earners)

5.5 Cube root transformation

cbrt_number_of_earners <- Australia_Earners_Justice$Number_of_earners^1/3
hist(cbrt_number_of_earners)

5.6 Box Cox transformation

boxcox_number_of_earners <- BoxCox(Australia_Earners_Justice$Number_of_earners, lambda = 'auto')
hist(boxcox_number_of_earners)

Having seen multiple transformations, we can choose square root transformations for further analysis since it is the closest to the normal distribution for “Total number of Earners”