MATH2349 Data Wrangling

Required packages

library(dplyr)
library(forecast)
library(ggplot2)
library(magrittr)
library(knitr)
library(readxl)
library(tidyr)
library(tidyverse)

Executive Summary

The crime rate in any given period is a contentious political issue. In order to explore this issue, data was gathered and preprocessed in order to create a dataset for further analysis.
Population, crime, and wage data was sourced from the Australian Bureau of Statistics website. The data was downloaded in xlsx format and saved to a local drive. From there it was imported into R using the read_excel() function from the readxl() package.
The datasets were then tidied in order to remove empty/superfluous observations. The population and earnings datasets were manipulated to combine records by year. Using tidyverse functions, the datasets were tidied according to the “Tidy Data Principles” proposed by Hadley Wickham.
After tidying, the datasets were characterised using the str() function to understand the variable types. Appropriate data type conversions were carried out using a range of functions. The datasets were then merged using the left_join() function in order to create a single dataset.
The datasets were then scanned for missing or special values. This was achieved through the colSums() function as well as a created function.
After scanning the data, a separate variable for crime rate was created using the mutate() function. Next, outliers were found through the use of bivariate boxplots with values plotted against state. Using the results of the boxplots, the 2010 observation for Australian Capital Territory was omitted. Finally, the crime rates variable was transformed in order to create a distribution for further analysis.

Data

The Average Weekly Earnings dataset was sourced from the Australian Bureau of Statistics. It was downloaded and saved as an xlxs file. The data contains 10,208 observations with 6 variables.
Reference period: Referrring to the time period the observations were recorded. Here, observations were recorded biannually in May and November and span the years May-2010-May 2020.
Earnings: The earnings variable outlines 2 levels, Average weekly cash earnings and the Standard Error of Average Weekly Cash Earnings.
Sector: The sector variable has 3 levels, Public, Private and combined Public & Private.
States and Territories: The states and territories contains 8 levels for the Australian States & Territories.
Data Item Description: Includes descriptions for observation by sex, ordinary full-time working hours, and total which includes bonus or overtime
Estimate: Contains an estimate of the gross average weekly earnings.
Further definitons and information can be found from: https://www.abs.gov.au/methodologies/average-weekly-earnings-australia-methodology/may-2020#glossary

Earnings <- read_excel("63020do016_20202.xlsx", sheet = "Table 2", skip = 5)
head(Earnings)

Population statistics were sourced from the ABS.Stat datasets. The site contains quarterly population statistics from 1981.
For the purposes of this report, only data from 2010-2019 was sourced. The downloaded data contained 3 variables organised in a table:
State: Australian states and territories
Year: The year of recording
Population: Number of persons who usually reside

Further information of definitions and descriptions can be found from: http://stat.data.abs.gov.au/Index.aspx?DataSetCode=ERP_QUARTERLY#

Population <- read_excel("0a7e3dd4-c2d8-4a8c-a0a6-09f256a9aa37.xlsx", skip = 6)

## New names:
## * `` -> ...2

head(Population)

Crime statistics were sourced from the Australian Bureau of Statistics. The data reports on the number of crimes as reported by police. For the purposes here, the information on crime by state was used. Additionally, only the number of crimes were imported as rate values were not applicable or not reported for a lot of observations.
The argument range= was used for this.
The three variables contained in the dataset were:
State: Australian states and territories
Year: The year of recording
Offence: Te type of offence reported
Further information on the data can be found from: https://www.abs.gov.au/statistics/people/crime-and-justice/recorded-crime-victims-australia/latest-release

crime_stats <- read_excel("2. Victims of Crime, Selected offences, states and territories.xlsx", 
                          sheet = "Table 6", skip = 5, col_names = TRUE, range = "A6:K206", trim_ws = TRUE)
head(crime_stats)

Tidy & Manipulate Data I

Removing empty observations and the value for Australian population as we are only interested in the states.

Population <- Population[3:10, -2]

Removing last 3 observations as they are not relevant to the data(contain blank space and copyright mark).
Also removing the variables for rates as they are not reported for a lot of observations.

crime_stats <- crime_stats[1:200, 1:11]

The Earnings dataset has 2 levels for the earnings variable. Avg weekly cash earnings and Std. error on the average. The std. error observations were removed as they are not of relevance to this task.
The sector variable contained values for private, public and combined. Only combined values were kept as this is more useful for our purposes.
This was achieved using the filter() function.

Earnings_filter <- Earnings %>% filter(Earnings == "Average Weekly Cash Earnings", `States and territories` != "Australia",
                                       Sector == "Private & public sectors", )

As we are concerned with the yearly population statistics, quarterly populations were combined using the transmute() function.The column headings are renamed as needed.

Pop_avg <- transmute(Population, `States and Territories` = Time,
                     `2010` = (`Mar-2010`+`Jun-2010`+`Sep-2010`+`Dec-2010`)/4,
                     `2011` = (`Mar-2011`+`Jun-2011`+`Sep-2011`+`Dec-2011`)/4,
                     `2012` = (`Mar-2012`+`Jun-2012`+`Sep-2012`+`Dec-2012`)/4,
                     `2013` = (`Mar-2013`+`Jun-2013`+`Sep-2013`+`Dec-2013`)/4,
                     `2014` = (`Mar-2014`+`Jun-2014`+`Sep-2014`+`Dec-2014`)/4,
                     `2015` = (`Mar-2015`+`Jun-2015`+`Sep-2015`+`Dec-2015`)/4,
                     `2016` = (`Mar-2016`+`Jun-2016`+`Sep-2016`+`Dec-2016`)/4,
                     `2017` = (`Mar-2017`+`Jun-2017`+`Sep-2017`+`Dec-2017`)/4,
                     `2018` = (`Mar-2018`+`Jun-2018`+`Sep-2018`+`Dec-2018`)/4,
                     `2019` = (`Mar-2019`+`Jun-2019`+`Sep-2019`+`Dec-2019`)/4,
)

According to tidy data principles, variables must have individual columns, observations must be strored in rows and each value must be stored in its own cell. The crime and population datasets violate these conditions as the year values are stored as column names.
This can be corrected using the gather() function. We specify the set of columns representing values, use the key = argument for the variable name and finally the value = argument to specify the values spread across cells.

Pop_avg <- Pop_avg %>% gather(`2010`, `2011`,`2012`,`2013`,`2014`,`2015`,`2016`,`2017`,`2018`,`2019`, key = 'year', value = "population")

Before applying the gather() function on the crime dataset, additional formatiing is needed.
Using view(), we can see that there are rows containing NAs that were imported from the excel sheet. These are removed using the filter() and !is.na functions.

crime_stats1 <- crime_stats %>% filter(!is.na(Offence)|!is.na(`2010`))

The crime stats now contains data for crime by state, however as the state name was imported as a value, a new variable for state needs to be added.
Firstly, a new vector was created with the names of the states as values. This was then appended to the crime_stats data frame using the cbind() function.
In order to ensure the states were assigned to their respective statistics, the crimestats dataframe was inspected and the order of the states was used in creating the State vector.

State <- rep(c("New South Wales", "Victoria", "Queensland", "South Australia", "Western Australia", "Tasmania", "Northern Territory", "Australian Capital Territory"), each = 17)
crime_stats1 <- cbind(crime_stats1, State)

Now we have assigned the state names, we can remove the states stored as values in the 2010 column. In order to do this, the filter() function is used. As the state names stored in 2010 have corresponding NA values, we can use the !na argument.

crime_stats_filtered <- crime_stats1 %>% filter(!is.na(`2011`))
head(crime_stats_filtered)

when viewing the crime_stats_filtered dataset, we can see that the assault cases for Victoria are not reported. It is therefore decided to omit all assault cases for all states.

crime_stats_filter1 <- crime_stats_filtered %>% filter(Offence != "Assault")
head(crime_stats_filter1)

Using the gather function in order to convert the wide format into long format where year and crime number are stored in columns.

crime_stats_gathered <- crime_stats_filter1 %>% gather(`2010`, `2011`, `2012`, `2013`, `2014`, `2015`, `2016`, `2017`, `2018`, `2019`, key = "year", value = "NoCrimes")
head(crime_stats_gathered)

Filtering the earning stats further to only include those of relevance as we wish to compare average earnings by state against crime stats.

Earnings_filter1 <- Earnings_filter %>% filter(`Data Item Description` == "Earnings; Persons; Total cash earnings;")

Creating a yearly average for the earnings stats as it is recorded biannually and we wish to compare annual figures.
Firstly we convert long format to wide format using the spread function.

Test <- Earnings_filter1 %>% spread(key = `Reference Period`, value = `Estimate`)

Next, we use the transmute function to create a new variable for year. We can also drop the variables for Earning, Sector and Data Item Description as we know all observations are of the same type.

EarningsTest1 <- Test %>% transmute(`States and territories` = `States and territories`,
                                    `2010` = (`May 2010`+`November 2010`)/2,
                                    `2011` = (`May 2011`+`November 2011`)/2,
                                    `2012` = (`May 2012`+`November 2012`)/2,
                                    `2013` = (`May 2013`+`November 2013`)/2,
                                    `2014` = (`May 2014`+`November 2014`)/2,
                                    `2015` = (`May 2015`+`November 2015`)/2,
                                    `2016` = (`May 2016`+`November 2016`)/2,
                                    `2017` = (`May 2017`+`November 2017`)/2,
                                    `2018` = (`May 2018`+`November 2018`)/2,
                                    `2019` = (`May 2019`+`November 2019`)/2
                                     )

Finally, using the gather() function we convert the newly created year values to a column.

Earnings_final <- EarningsTest1 %>% gather(`2010`, `2011`, `2012`, `2013`, `2014`, `2015`, `2016`, `2017`, `2018`, `2019`, key = "year", value = "Earnings")
head(Earnings_final)

Understand

Using the str() function, we can ascertain the variable types.

str(crime_stats_gathered)

## 'data.frame':    1200 obs. of  4 variables:
##  $ Offence : chr  "Homicide and related offences" "Murder" "Attempted murder" "Manslaughter" ...
##  $ State   : chr  "New South Wales" "New South Wales" "New South Wales" "New South Wales" ...
##  $ year    : chr  "2010" "2010" "2010" "2010" ...
##  $ NoCrimes: chr  "133" "73" "45" "11" ...

str(Pop_avg)

## tibble [80 × 3] (S3: tbl_df/tbl/data.frame)
##  $ States and Territories: chr [1:80] "New South Wales" "Victoria" "Queensland" "South Australia" ...
##  $ year                  : chr [1:80] "2010" "2010" "2010" "2010" ...
##  $ population            : num [1:80] 7153816 5470174 4412724 1628517 2298478 ...

str(Earnings_final)

## tibble [80 × 3] (S3: tbl_df/tbl/data.frame)
##  $ States and territories: chr [1:80] "Australian Capital Territory" "New South Wales" "Northern Territory" "Queensland" ...
##  $ year                  : chr [1:80] "2010" "2010" "2010" "2010" ...
##  $ Earnings              : num [1:80] 1262 1031 1096 1035 923 ...

From the structure outputs, it can be seen that the crime data frame has the NoCrimes defined as a character variable when it should be a numeric variable. The year variable is stored as a character but because we have a dataset with 10 years, we wish to store this as a factor variable. We can convert these to the correct format using the as. functions. This stores the variables as vectors in the data type indicated. These vectors are then joined using the cbind() function. Finally, the unused variables are dropped and variable names assigned using colnames().

crime_stats5 <- crime_stats_gathered$NoCrimes %>% as.numeric()
crime_stats_year <- crime_stats_gathered$year %>% as.factor()
crime_stats6 <- crime_stats_gathered %>% cbind(crime_stats5, crime_stats_year)
crime_stats7 <- crime_stats6[ ,-3:-4]
colnames(crime_stats7) <- (c("Offence", "State", "No_Crimes", "Year"))
str(crime_stats7)

## 'data.frame':    1200 obs. of  4 variables:
##  $ Offence  : chr  "Homicide and related offences" "Murder" "Attempted murder" "Manslaughter" ...
##  $ State    : chr  "New South Wales" "New South Wales" "New South Wales" "New South Wales" ...
##  $ No_Crimes: num  133 73 45 11 7269 ...
##  $ Year     : Factor w/ 10 levels "2010","2011",..: 1 1 1 1 1 1 1 1 1 1 ...

As we wish to compare the crime rate by average income by state, we can combine all crime stats into a single column. In order to achive this we must forst use the spread() function to convert the crime stats data from long to wide format in order to create individual columns for the crime type.

crime_single <- crime_stats7 %>% spread(key = Offence, value = No_Crimes)
head(crime_single)

We then use the transmute function to create 3 variables: State, Year and Total_crimes.

crime_final <- crime_single %>% transmute("State" = State,
                     "Year" = Year,
                     "Total_crimes" = (`Armed robbery` + `Attempted murder`+`Blackmail/extortion`+`Homicide and related offences`+`Involving the taking of property`+`Kidnapping/abduction`+`Manslaughter`+`Motor vehicle theft`+`Murder`+`Other`+`Other theft`+`Robbery`+`Sexual assault`+`Unarmed robbery`+`Unlawful entry with intent`))
head(crime_final)

The population and earnings dataframes both have their year and states variables defined as characters. We wish to define these as factors therefore we use the mutate() function from the dplyr package. Using the function, any variables defined as characters are converted to factors.

Earnings_final1 <- Earnings_final %>% mutate(across(where(is.character), as.factor))
str(Earnings_final1)

## tibble [80 × 3] (S3: tbl_df/tbl/data.frame)
##  $ States and territories: Factor w/ 8 levels "Australian Capital Territory",..: 1 2 3 4 5 6 7 8 1 2 ...
##  $ year                  : Factor w/ 10 levels "2010","2011",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ Earnings              : num [1:80] 1262 1031 1096 1035 923 ...

Using the same methodology for the population dataframe.

Pop_avg1 <- Pop_avg %>% mutate(across(where(is.character), as.factor))
str(Pop_avg1)

## tibble [80 × 3] (S3: tbl_df/tbl/data.frame)
##  $ States and Territories: Factor w/ 8 levels "Australian Capital Territory",..: 2 7 4 5 8 6 3 1 2 7 ...
##  $ year                  : Factor w/ 10 levels "2010","2011",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ population            : num [1:80] 7153816 5470174 4412724 1628517 2298478 ...

Finally we can merge the three datsets. This is achieved using the left_join() function. This allows us to combine the datasets on their common observations or “keys”. In this case the common observations are State and Year. This is specified using the " " = " " argument.

Combined <- crime_final %>% left_join(Earnings_final1, c("Year" = "year", "State" = "States and territories")) %>% left_join(Pop_avg1, c("Year" = "year", "State" = "States and Territories"))
head(Combined)

Scan I

Searching for missing values in each dataframe using the colsums() and is.na arguments().

colSums(is.na(Combined))

##        State         Year Total_crimes     Earnings   population 
##            0            0            0            0            0

As the dataframe does not contain missing values, we then check the numeric variables for special values. This is achieved by firstly writing a function.

is.special <- function(x){
  if (is.numeric(x)) (is.infinite(x) | is.nan(x))
}

Next we apply this function to the dataframe using the sapply() function. Sapply allows us to efficiently extract the number of special values by column.

sapply(Combined, function(x) sum(is.na(x)))

##        State         Year Total_crimes     Earnings   population 
##            0            0            0            0            0

From the results, we observe that there are no special values.

Tidy & Manipulate Data II

As there are no special or missing values, we create a new variable for crime rate.
As the states have different populations, comparing total number of crimes would produce a biased output. For this reason, we will create a new variable which is crime per capita.
We use the mutate() function in order to create a new column. Mutate allows for addition of new columns while preserving existing columns. The new column is defined as the total crimes per 100 persons. This object is assigned as Combined_mutated.

Combined_mutated <- Combined %>% mutate(Crime_percapita = (Total_crimes/population)*100)
head(Combined_mutated)

Scan II

In order to search for outliers, we can use the boxplot and plot functions. These particular methods are selected in this instance because we have a small set of numerical variables. As we are testing a factor variable(State) by a numerical variable, we use the bivariate boxplot to detect outliers. Outliers are defined as those values above and below 1.5x IQR fence.
(Note that state names are abbreviated in order to improve the layout)

boxplot(Combined$population ~ Combined$State, names = c("ACT", "NSW", "NT", "QLD", "SA", "TAS", "VIC", "WA"), main = "Population by State", ylab = "Population", xlab = "State", col = "grey")

From the results for the population by state plot, we can see there are no outliers in the data.

boxplot(Combined$Total_crimes ~ Combined$State, names = c("ACT", "NSW", "NT", "QLD", "SA", "TAS", "VIC", "WA"), main = "Number of Crimes by State", ylab = "No. of Crimes", xlab = "State", col = "grey")

From the results for the crimes by state plot, we can see there are outliers for ACT, VIC and WA. As we are unable to ascertain the cause of these outliers, they are included in the dataset as they may be of significance in analysing results.

boxplot(Combined_mutated$Crime_percapita ~ Combined_mutated$State, names = c("ACT", "NSW", "NT", "QLD", "SA", "TAS", "VIC", "WA"), main = "Crimerate per 100 persons by State", ylab = "Crimerate per 100 persons", xlab = "State", col = "grey")

boxplot(Combined$Earnings ~ Combined$State, names = c("ACT", "NSW", "NT", "QLD", "SA", "TAS", "VIC", "WA"), main = "Average Weekly Earnings by State", ylab = "Average Weekly Earnings", xlab = "State", col = "grey")

From the results of the earnings by state plot we can see there is one outlier for ACT. On closer inspection of the data, we can see that this corresponds to the 2010 observation. As the 2010 observation for crimerate by state for ACT was also an outlier, this observation was removed from the dataset.

Combined_mutated1 <- Combined_mutated %>% filter(!(State == "Australian Capital Territory" & Year == 2010))
head(Combined_mutated1)

Transform

When viewing the plot of the crime rates variable, we observe a slightly bimodal distribution.

hist(Combined_mutated1$Crime_percapita)

In order to normalise this data we can use the abs() function.

transformed <- abs(Combined_mutated1$Crime_percapita - mean(Combined_mutated1$Crime_percapita))
hist(transformed)

This creates a vector of the variable as calculated from the function.