library(readr)
library(xlsx)
library(dplyr)
library(tidyr)
library(ggplot2)
library(readxl)
library(editrules)
library(forecast)
In this assignment, we have used two datasets (Personal Income in Australia and Youth Detention centers in Australia) to do our analysis. We have done few of the pre-processing steps for dataset 1 and dataset 2 separately while some of the pre-processing steps have been done on the after merging dataset 1 and 2 together. Following are the steps followed:
Dataset 1 - Personal Income in Australia
Dataset 2 - Youth Detention Centers in Australia
Merged Dataset - Personal Income in Australia and Youth Detention Centers in Australia
Data Source - https://www.abs.gov.au/statistics/labour/earnings-and-work-hours/personal-income-australia/latest-release
Description of the dataset - This dataset provides us with year over year details of earners in each state between the period 2011-2017. The details of the earners provided are:
“Earners” (persons): This variable provides us with the the total count of earners present in a state at any given time period between 2011-2017
“Median age of earners”: This variable provides us with an estimatation of what the median age of the earning population is within a state for time period between 2011-2017
“Sum($)”: This variable provides us with the total money made by the earners within each state for the time period between 2011-2017
“Median($)”: This variable provides us with the median amount of money made by earners within each state for the time period between 2011-2017
“Mean($)”: This variable provides us with the average amount of money made by earners within each state for the time period between 2011-2017
Although, our above dataset consists of 5 variables in the wide format, we will only be considering 2 of them for the sake of our analysis. The 2 variables that we will be considering are “Earners” and “Sum”
Since the column headers under both “Earners” and “Sum” are same (years), we will be loading them seprately, tidying them, and then merging them together
Personal_Income_in_Australia_Earners <- read_excel("/Applications/Documents/Study/RMIT Learning/Second Semester/Data Wrangling/02 Assignment/03 Data Sources/Personal Income in Australia.xls",
sheet = "Table 1.1", col_types = c("text",
"skip", "text", "text", "text", "text",
"text", "text", "skip", "skip", "skip",
"skip", "skip", "skip", "skip", "skip",
"skip", "skip", "skip", "skip", "skip",
"skip", "skip", "skip", "skip", "skip",
"skip", "skip", "skip", "skip", "skip",
"skip"), skip = 6)
head(Personal_Income_in_Australia_Earners,3)
We will only be selecting the states from the dataframe and hence we will be subsetting this dataset
Personal_Income_in_Australia_Earners_filtered <- Personal_Income_in_Australia_Earners[c(2,5,8,11,14,17,20,23),]
head(Personal_Income_in_Australia_Earners_filtered,3)
Personal_Income_in_Australia_Sum <- read_excel("/Applications/Documents/Study/RMIT Learning/Second Semester/Data Wrangling/02 Assignment/03 Data Sources/Personal Income in Australia.xls",
sheet = "Table 1.1", col_types = c("text",
"skip", "skip", "skip", "skip", "skip",
"skip", "skip", "skip", "skip", "skip",
"skip", "skip", "skip", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "skip", "skip",
"skip", "skip", "skip", "skip", "skip",
"skip", "skip", "skip", "skip", "skip"),
skip = 6)
head(Personal_Income_in_Australia_Sum,3)
Similar to the above step, we will be subsetting the dataset to include only the states
Personal_Income_in_Australia_Sum_filtered <- Personal_Income_in_Australia_Sum[c(2,5,8,11,14,17,20,23),]
head(Personal_Income_in_Australia_Sum_filtered,3)
As we can see, our data is not in the tidy format we would want it to be.
How is it a untidy data?
All the years are spread in the wide format when they could be clubbed under one variable “Years”
For our analysis, we should have the following columns for our data to follow the tidy principles: “States”, “Year”, “Earners”, “Sum($)”
Let us convert our dataset to a tidy set!
To convert our datasets (Earners and Total Income) into tidy dataset, we will convert their “Years” columns from a wide format to a long format and hence bring them below one column “Years”. This way, each variable (2011-12, 2012-13, etc.) will have its own column.
Personal_Income_in_Australia_Earners_filtered <- Personal_Income_in_Australia_Earners_filtered%>% gather(`2011-12`, `2012-13`, `2013-14`, `2014-15`, `2015-16`, `2016-17`, key = "Years", value = "Number_of_earners")
Personal_Income_in_Australia_Sum_filtered <- Personal_Income_in_Australia_Sum_filtered%>% gather(`2011-12`, `2012-13`, `2013-14`, `2014-15`, `2015-16`, `2016-17`, key = "Years", value = "Total_Money_Earned")
head(Personal_Income_in_Australia_Earners_filtered,3)
head(Personal_Income_in_Australia_Sum_filtered,3)
We will now join these two datasets to merge into one
Personal_Income_in_Australia_Earners_merged <- inner_join(Personal_Income_in_Australia_Earners_filtered, Personal_Income_in_Australia_Sum_filtered, by = c("GCCSA","Years"))
head(Personal_Income_in_Australia_Earners_merged,3)
str(Personal_Income_in_Australia_Earners_merged)
## tibble [48 × 4] (S3: tbl_df/tbl/data.frame)
## $ GCCSA : chr [1:48] "New South Wales" "Victoria" "Queensland" "South Australia" ...
## $ Years : chr [1:48] "2011-12" "2011-12" "2011-12" "2011-12" ...
## $ Number_of_earners : chr [1:48] "3824011" "3035381" "2448420" "880884" ...
## $ Total_Money_Earned: num [1:48] 2.18e+11 1.64e+11 1.31e+11 4.48e+10 8.55e+10 ...
From the above table, we can observe that there are few transformations in the type of columns that need to take place:
“GCCSA” currently has each state has “character” but this will be converted to “factor” since they represent categorical data. Here will not need to factor it since they do not contain any levels between them
“GCCSA” will be renamed to “State” to make it clearer
“Number_of_earners” is currently character but since they represent the total number of earners in each state, we will be converting them to integer
“Years” will be renamed to “Year” to make ti clearer
## Converting "GCCSA" to factor
Personal_Income_in_Australia_Earners_merged$GCCSA <- as.factor(Personal_Income_in_Australia_Earners_merged$GCCSA)
## Renaming "GCCSA" to "State"
Personal_Income_in_Australia_Earners_merged <- Personal_Income_in_Australia_Earners_merged %>% rename(State = GCCSA)
## Converting "Number_of_earners" to integer
Personal_Income_in_Australia_Earners_merged$Number_of_earners <- as.numeric(Personal_Income_in_Australia_Earners_merged$Number_of_earners)
## Renaming Years to Year
Personal_Income_in_Australia_Earners_merged <- Personal_Income_in_Australia_Earners_merged %>% rename(Year = Years)
We have a dataframe “Personal_Income_in_Australia_Earners_merged” of dimemsions 48x5.
| Variables | Type |
|---|---|
| State | Factor |
| Year | Character |
| Number of Earners | Number |
| Total Money Earned | Number |
| Per_Capita_Income | Number |
State has been converted to factor since it is a categorical variable and is unordered. Year has been kept as a character since factoring it did not provide any additional value so we have kept the data type as it is.
## Checking the output of the conversion
str(Personal_Income_in_Australia_Earners_merged)
## tibble [48 × 4] (S3: tbl_df/tbl/data.frame)
## $ State : Factor w/ 8 levels "Australian Capital Territory",..: 2 7 4 5 8 6 3 1 2 7 ...
## $ Year : chr [1:48] "2011-12" "2011-12" "2011-12" "2011-12" ...
## $ Number_of_earners : num [1:48] 3824011 3035381 2448420 880884 1349170 ...
## $ Total_Money_Earned: num [1:48] 2.18e+11 1.64e+11 1.31e+11 4.48e+10 8.55e+10 ...
We will now be creating a new variable “Per_capita_income”. “Per_capita_income” will be calculated as the total earnings per state divided by the total number of earners in the state. This will give us a better knowledge of how each earner is earning in each state over the years.
Personal_Income_in_Australia_Earners_merged$Per_Capita_Income <- Personal_Income_in_Australia_Earners_merged$Total_Money_Earned/Personal_Income_in_Australia_Earners_merged$Number_of_earners
head(Personal_Income_in_Australia_Earners_merged,3)
Data Source - https://data.gov.au/data/dataset/youth-justice-detention-data/resource/c7edfa08-7bc9-404d-8f2b-22bcd0425021
Description of the dataset - This dataset provides us with quarterly details of average nightly population in detention centres across Australia between the period 2008-2017. The details of the earners provided are:
“agegrp” (persons): This variable provides us with the the age group of the people in detention in a state at a quarterly level for time period between 2008-2017
“indig_status”: This variable provides us with indigenous status of the population present in the detention centre within a state at a quarterly level for time period between 2008-2017
“legal_status”: This variable provides us with the legal status of the people in the detention centres within each state at a quarterly level for the time period between 2008-2017
“sex”: This variable provides us with the sex of the people present in detention centres within each state at a quarterly level for the time period between 2008-2017
“State”: This variable provides us with the state details for the time period between 2008-2017
“quarter”: This variable provides us with the quarter details in the format of Month/Year for the time period between 2008-2017
“quart”: This variable provides us with the quarter details in the format of “month” for the time period between 2008-2017
“year”: This variable provides us with the year details in the format of “year” for the time period between 2008-2017
“avg_nightly_pop”: This variable provides us with the average night population in the youth detention centres within each state at a quarterly level for the time period between 2008-2017
youth.justice.detention.data.2017 <- read.csv("/Applications/Documents/Study/RMIT Learning/Second Semester/Data Wrangling/02 Assignment/03 Data Sources/youth-justice-detention-data-2017.csv")
head(youth.justice.detention.data.2017,3)
Let us have a look at the summary of the dataframe. From here, we will get a brief idea of what values are present in the dataframe and what kind of pre-processing needs to be done
We have a dataframe “youth.justice.detention.data.2017” of 42768x9 dimensions
| Variables | Type | Levels |
|---|---|---|
| agegrp | Factor | 3 |
| indig_status | Factor | 4 |
| legal_status | Factor | 3 |
| sex | Factor | 4 |
| state | Factor | 9 |
| quarter | Factor | 33 |
| quart | Factor | 4 |
| year | integer | - |
| avg_night_population | integer | - |
Since we will be converting our data to our State-Year level, most of the variables will be dropping off. However, State will be kept as a factor since it is a categorical variable and will be consistent with dataset 1.
str(youth.justice.detention.data.2017)
## 'data.frame': 42768 obs. of 9 variables:
## $ agegrp : Factor w/ 3 levels "10 to 17","18+",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ indig_status : Factor w/ 4 levels "Indigenous","Non-Indigenous",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ legal_status : Factor w/ 3 levels "Sentenced","Total",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ sex : Factor w/ 4 levels "Female","Male",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ state : Factor w/ 9 levels "ACT","Aust","NSW",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ quarter : Factor w/ 33 levels "Dec qtr 2008",..: 9 26 1 18 10 27 2 19 11 28 ...
## $ quart : Factor w/ 4 levels "Dec","Jun","Mar",..: 2 4 1 3 2 4 1 3 2 4 ...
## $ year : int 2008 2008 2008 2009 2009 2009 2009 2010 2010 2010 ...
## $ avg_nightly_pop: int 202 197 196 177 166 155 175 165 170 155 ...
dim(youth.justice.detention.data.2017)
## [1] 42768 9
For variables “agegrp”, “indig_status”, “legal_status”, and “sex”, we can observe that they have a variable called “Total” which provides us with the sum of all the other variables present in the dataframe. Hence we will be subsetting the dataset on “total”
Before we make that decision, let us confirm if “total” is actually the sum of other variables in a column
# This is the R chunk for the Tidy & Manipulate Data I
newdata <- subset(youth.justice.detention.data.2017, youth.justice.detention.data.2017$state == 'NSW' & youth.justice.detention.data.2017$quart == 'Jun' & youth.justice.detention.data.2017$indig_status == 'Total' & youth.justice.detention.data.2017$legal_status == 'Total' & youth.justice.detention.data.2017$sex=='Total' & youth.justice.detention.data.2017$state=='NSW' & youth.justice.detention.data.2017$year == 2008, select = c("agegrp", "indig_status","legal_status","sex","state","quarter","quart","year","avg_nightly_pop"))
head(newdata)
From the above table we can observe that ‘Total’ is the sum of all other cases in a particular column (in the above case, we experimented with ‘agegrp’). Hence we can remove all row values except for ‘Total’.
Since ‘agegrp, ’indig_status, ’legal_status’, and ‘sex’ will have only ‘Total’ as their row values, we will be removing these columns since they do not provide us with any additional information
Since we finally want information at a year level, we will also be removing “quarter” and “quart” values from our dataframe
youth_justice_detention_data_filtered <- subset(youth.justice.detention.data.2017, youth.justice.detention.data.2017$indig_status == 'Total' & youth.justice.detention.data.2017$legal_status == 'Total' & youth.justice.detention.data.2017$sex=='Total' & youth.justice.detention.data.2017$agegrp == 'Total', select = c("state","year","avg_nightly_pop"))
head(youth_justice_detention_data_filtered,3)
Since the data we have above is at a quarter level, we will be grouping them together to get it at a yearly and state level
youth_justice_detention_data_grouped <- aggregate(youth_justice_detention_data_filtered$avg_nightly_pop, by = list(State = youth_justice_detention_data_filtered$state, Year = youth_justice_detention_data_filtered$year), FUN = sum)
youth_justice_detention_data_grouped <- youth_justice_detention_data_grouped %>% rename(avg_nightly_population = x)
head(youth_justice_detention_data_grouped,3)
Converting “State” variable from “factor” to “character” so that changes can be made easily to the values present in the column
youth_justice_detention_data_grouped <- youth_justice_detention_data_grouped%>%mutate_if(is.factor, as.character)
str(youth_justice_detention_data_grouped)
## 'data.frame': 81 obs. of 3 variables:
## $ State : chr "ACT" "Aust" "NSW" "NT" ...
## $ Year : int 2008 2008 2008 2008 2008 2008 2008 2008 2008 2009 ...
## $ avg_nightly_population: int 45 2932 1285 78 361 195 87 410 470 49 ...
From the above table, we can observe the following:
“State”: Currently the variables present in this column are short forms. We will be converting these shortforms to the actual names of the states
“Year”: We also observe that the values present in this column are a bit different from the format we have in Dataset 1 (Personal Income in Australia). e.g. “Year” in Dataset 1 displays “2008-09” whereas “Year” in Dataset 2 displays “2008”. Hence we will convert data in Dataset 2 to Dataset 1 format to keep it consistent
youth_justice_detention_data_grouped$State[youth_justice_detention_data_grouped$State == "NSW"] <- "New South Wales"
youth_justice_detention_data_grouped$State[youth_justice_detention_data_grouped$State == "Vic"] <- "Victoria"
youth_justice_detention_data_grouped$State[youth_justice_detention_data_grouped$State == "Qld"] <- "Queensland"
youth_justice_detention_data_grouped$State[youth_justice_detention_data_grouped$State == "SA"] <- "South Australia"
youth_justice_detention_data_grouped$State[youth_justice_detention_data_grouped$State == "Tas"] <- "Tasmania"
youth_justice_detention_data_grouped$State[youth_justice_detention_data_grouped$State == "WA"] <- "Western Australia"
youth_justice_detention_data_grouped$State[youth_justice_detention_data_grouped$State == "NT"] <- "Northern Territory"
youth_justice_detention_data_grouped$State[youth_justice_detention_data_grouped$State == "ACT"] <- "Australian Capital Territory"
youth_justice_detention_data_grouped$State[youth_justice_detention_data_grouped$State == "Aust"] <- "Australia"
youth_justice_detention_data_grouped$Year[youth_justice_detention_data_grouped$Year == "2008"] <- "2008-09"
youth_justice_detention_data_grouped$Year[youth_justice_detention_data_grouped$Year == "2009"] <- "2009-10"
youth_justice_detention_data_grouped$Year[youth_justice_detention_data_grouped$Year == "2010"] <- "2010-11"
youth_justice_detention_data_grouped$Year[youth_justice_detention_data_grouped$Year == "2011"] <- "2011-12"
youth_justice_detention_data_grouped$Year[youth_justice_detention_data_grouped$Year == "2012"] <- "2012-13"
youth_justice_detention_data_grouped$Year[youth_justice_detention_data_grouped$Year == "2013"] <- "2013-14"
youth_justice_detention_data_grouped$Year[youth_justice_detention_data_grouped$Year == "2014"] <- "2014-15"
youth_justice_detention_data_grouped$Year[youth_justice_detention_data_grouped$Year == "2015"] <- "2015-16"
youth_justice_detention_data_grouped$Year[youth_justice_detention_data_grouped$Year == "2016"] <- "2016-17"
head(youth_justice_detention_data_grouped,3)
The following values will be removed from our dataframe since we they are not required:
“Australia - State”: We will be removing “Australia” from the “State” column since we are dealing with only states
“Year”: We will be removing years from “2008-09” to “2010-11”, since we are dealing with years from “2011-12” onwards
youth_justice_detention_data_grouped_filtered <- subset(youth_justice_detention_data_grouped, !(youth_justice_detention_data_grouped$State == 'Australia'), select = c("State","Year","avg_nightly_population"))
youth_justice_detention_data_grouped_filtered <- subset(youth_justice_detention_data_grouped_filtered, !(youth_justice_detention_data_grouped_filtered$Year == '2008-09'), select = c("State","Year","avg_nightly_population"))
youth_justice_detention_data_grouped_filtered <- subset(youth_justice_detention_data_grouped_filtered, !(youth_justice_detention_data_grouped_filtered$Year == '2009-10'), select = c("State","Year","avg_nightly_population"))
youth_justice_detention_data_grouped_filtered <- subset(youth_justice_detention_data_grouped_filtered, !(youth_justice_detention_data_grouped_filtered$Year == '2010-11'), select = c("State","Year","avg_nightly_population"))
head(youth_justice_detention_data_grouped_filtered,3)
We will be converting “States” to “Factor” since they are categorical variables and will be consistent with the first dataset
## Converting States to "Factor"
youth_justice_detention_data_grouped_filtered$State <- as.factor(youth_justice_detention_data_grouped_filtered$State)
str(youth_justice_detention_data_grouped_filtered)
## 'data.frame': 48 obs. of 3 variables:
## $ State : Factor w/ 8 levels "Australian Capital Territory",..: 1 2 3 4 5 6 7 8 1 2 ...
## $ Year : chr "2011-12" "2011-12" "2011-12" "2011-12" ...
## $ avg_nightly_population: int 85 1474 153 517 246 92 724 743 81 1333 ...
Now that we have got both the datasets in the same format, we can merge them together.
## Merging both the datasets together
Australia_Earners_Justice <- inner_join(x = Personal_Income_in_Australia_Earners_merged, y = youth_justice_detention_data_grouped_filtered, by = c("State","Year"))
head(Australia_Earners_Justice,3)
We will now be scanning our datafraame for missing values, special values, and obvious errors
Since we have used .csv and .xslx file to import data, the null values are represented as NA for integer reference and
Since is.na() works with dataframes, we will be using this function to count the total number of NAs in our dataframe. We will be adding a sum function outside it, so that we can out the total number of missing values present in our dataset.
## Scanning for missing values
sum(is.na(Australia_Earners_Justice))
## [1] 0
Here we will be using the sapply function that will allow us to call on a function and apply it on our dataframe. Here again we have used the sum function outside the “is.specialorNA” function to count the total of special values present in our dataframe
## Scanning for special values
is.specialorNA <- function(x){
if (is.numeric(x)) (is.infinite(x) | is.nan(x) | is.na(x))
}
sapply(Australia_Earners_Justice, function(x) sum(is.specialorNA(x)))
## State Year Number_of_earners
## 0 0 0
## Total_Money_Earned Per_Capita_Income avg_nightly_population
## 0 0 0
There are no missing values in our dataframe
An obvious error occurs when a data record contains a value that can not be corresponded to in real life. For this analysis, we have added four rules here that can not be violated in real life -
Number of Earners within each state has to be greater than 0
Total amount earned by each state has to be greater than 0
Per capita income within each state has to be greater than 0
Average nightly population of a detention centre can not be a negative value
Violating any of these rules, we will lead to the count of violations being greater than 0 and then we would be able to corrective actions on it (if required)
## Scanning for any obvious errors
(Rule1 <- editset(c("Number_of_earners >= 0", "Total_Money_Earned >= 0", "Per_Capita_Income >= 0", "avg_nightly_population > 0")))
##
## Edit set:
## num1 : 0 <= Number_of_earners
## num2 : 0 <= Total_Money_Earned
## num3 : 0 <= Per_Capita_Income
## num4 : 0 < avg_nightly_population
Violated <- violatedEdits(Rule1, Australia_Earners_Justice)
# summary of violated rules
sum(isTRUE(Violated))
## [1] 0
summary(Violated)
## No violations detected, 0 checks evaluated to NA
## NULL
There are no rules that are violated
We will first try to remove outliers with the z-score methodology. However, the key assumption of z-score methodology is that the underlying data is normally distributed. We will first plot the histogram of the data and observe.
## Scanning for outliers using "z-score"
hist(Australia_Earners_Justice$Number_of_earners)
We can see from the above chart that it does not have a normal distribution which is a key assumption in the “z-score” methodology and hence we will not be using “z-score” methodology for identifying outliers. Instead we will be using box plot to identify outliers.
## Scanning for outliers
Australia_Earners_Justice$Number_of_earners %>% boxplot(main="Box Plot of Number of Earners", ylab="Earners", col = "pink")
Australia_Earners_Justice$Total_Money_Earned %>% boxplot(main="Box Plot of Amount of Money Earned", ylab="Money", col = "dark blue")
Australia_Earners_Justice$Per_Capita_Income %>% boxplot(main="Box Plot of Per Capita Income", ylab="Income", col = "dark green")
Australia_Earners_Justice$avg_nightly_population %>% boxplot(main="Box Plot of Average Nightly Population", ylab="Population", col = "dark orange")
From the above Box plot we can see that there are no outliers.
In the histograms above, we observe that the none of our numerical attributes have a normal distribution. We also observe three out four variables have right skewdness. The last variable is showing attributes of a bi-modal distribution. We will try out various transformation technique to decrease the skewness and convert the distribution to a normal distribution for one of our variables (“Number of Earners”)
## Checking for variable 1: Number of Earners
log_number_of_earners <- log10(Australia_Earners_Justice$Number_of_earners)
hist(log_number_of_earners)
ln_number_of_earners <- log(Australia_Earners_Justice$Number_of_earners)
hist(ln_number_of_earners)
rec_number_of_earners <- 1/Australia_Earners_Justice$Number_of_earners
hist(rec_number_of_earners)
sqrt_number_of_earners <- sqrt(Australia_Earners_Justice$Number_of_earners)
hist(sqrt_number_of_earners)
cbrt_number_of_earners <- Australia_Earners_Justice$Number_of_earners^1/3
hist(cbrt_number_of_earners)
boxcox_number_of_earners <- BoxCox(Australia_Earners_Justice$Number_of_earners, lambda = 'auto')
hist(boxcox_number_of_earners)
Having seen multiple transformations, we can choose square root transformations for further analysis since it is the closest to the normal distribution for “Total number of Earners”