Setup

# Load the necessary packages required to reproduce the report. For example:

library(kableExtra)

## Warning: package 'kableExtra' was built under R version 4.0.5

library(magrittr)

## Warning: package 'magrittr' was built under R version 4.0.5

library(ggplot2) # Useful for creating plots

## Warning: package 'ggplot2' was built under R version 4.0.5

library(dplyr)  # Useful for data manipulation

## Warning: package 'dplyr' was built under R version 4.0.5

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:kableExtra':
## 
##     group_rows

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(knitr) # Useful for creating nice tables

## Warning: package 'knitr' was built under R version 4.0.5

library(magrittr)
library(readr)

## Warning: package 'readr' was built under R version 4.0.5

library(tidyr)

## Warning: package 'tidyr' was built under R version 4.0.5

## 
## Attaching package: 'tidyr'

## The following object is masked from 'package:magrittr':
## 
##     extract

library(readxl)

## Warning: package 'readxl' was built under R version 4.0.5

#install.packages("forecast")
library(forecast)

## Warning: package 'forecast' was built under R version 4.0.5

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

#install.packages("editrules")
#library(editrules)

Student names, numbers and percentage of contributions

Group information
Student name	Student number	Percentage of contribution
Karen Amanda Rodrigues	s3890896	50%
Kushagra Bhatnagar	s3666420	50%

Executive Summary

This assignment is used as a medium to answer a real-life question – What percentage of Australia’s youth is studying and yet is involved in crime, and to gain insight into the impact of education on the crime statistics. According to the 2010 National Strategies for Young Australians (https://www.youthpolicy.org/factsheets/country/australia), youth is defined in general as the age group between 12-24 years. We have selected 2 categories: • Age groups of 15-19 and 20-24 years. • Years between 2016 and 2020.
The data sets chosen were: • Education and work 2020, Table 21 - referred to as Education dataset in the report • 1, Offenders Australia(1) - referred to as Crime dataset in the report Both the datasets were downloaded from the Australian Bureau of Statistics. From the Education and work dataset, the study and total population of the age groups 15-19 and 20-24 were derived. The dataset 1 i.e. Offenders Australia(1) provided the population involved in crime within the two age groups. The two datasets were comprised of heaps of data. The approach adopted is that the datasets are tidied and manipulated individually before they were merged. Upon merger, the final result is presented as the percentage of youth population in the two age groups studying and involved in crime over a period of five years between 2016 and 2020.

Data

Provide explanations here.

Two data sets are obtained from Australian Bureau of Statistic. Five data frames starting from year 2016 to 2020 are created from one Education dataset. The reason for this is that each tab in excel corresponds to a year. The five data frames store study and total population of several age groups for their respective years. One data frame is created from the Crime dataset.

The two datasets are detailed and elaborate, so they need to be trimmed in order to describe the variables. The variable description and merger of the datasets is only possible after tidying and manipulating the datasets. The Tidy and Manipulate sections will contain this information.

In this section the data sets are read using the library function readr and their first fifteen lines and structure are shown.

Since the data frames sourced from Education dataset are ed2016, ed2017, ed2018, ed2019, ed2020. They are of exact structure, so any work done on one dataframe is replicated in the rest of the five datasets. Henceforth, ed2016 dataframe would be used to display results. The dfc dataframe is sourced from ‘Table 5’ tab of excel file comprising Crime dataset.

# Import the data, provide your R codes here.

ed2016 <- read_excel("Education and work, 2020, Table 21.xls", sheet = "2016")

## New names:
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
## * `` -> ...6
## * ...

ed2017 <- read_excel("Education and work, 2020, Table 21.xls", sheet = "2017")

## New names:
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
## * `` -> ...6
## * ...

ed2018 <- read_excel("Education and work, 2020, Table 21.xls", sheet = "2018")

## New names:
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
## * `` -> ...6
## * ...

ed2019 <- read_excel("Education and work, 2020, Table 21.xls", sheet = "2019")

## New names:
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
## * `` -> ...6
## * ...

ed2020 <- read_excel("Education and work, 2020, Table 21.xls", sheet = "2020")

## New names:
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
## * `` -> ...6
## * ...

df2 <- read_excel("1. Offenders, Australia (1).xls", sheet = "Table 5")

## New names:
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
## * `` -> ...6
## * ...

head(ed2016, n = 15)

str(ed2016)

## tibble [147 x 14] (S3: tbl_df/tbl/data.frame)
##  $ Australian Bureau of Statistics: chr [1:147] "Education and Work, Australia, May 2020" "Released at 11:30 am (Canberra time) 11 November 2020" "Table 21 Current study: Persons aged 15–64 years, 2016" NA ...
##  $ ...2                           : chr [1:147] NA NA NA "15–19 years" ...
##  $ ...3                           : chr [1:147] NA NA NA "20–24 years" ...
##  $ ...4                           : chr [1:147] NA NA NA "25–29 years" ...
##  $ ...5                           : chr [1:147] NA NA NA "30–34 years" ...
##  $ ...6                           : chr [1:147] NA NA NA "35–39 years" ...
##  $ ...7                           : chr [1:147] NA NA NA "40–44 years" ...
##  $ ...8                           : chr [1:147] NA NA NA "45–49 years" ...
##  $ ...9                           : chr [1:147] NA NA NA "50–54 years" ...
##  $ ...10                          : chr [1:147] NA NA NA "55–64 years" ...
##  $ ...11                          : chr [1:147] NA NA NA "18–24 years" ...
##  $ ...12                          : chr [1:147] NA NA NA "15–24 years" ...
##  $ ...13                          : chr [1:147] NA NA NA "25–64 years" ...
##  $ ...14                          : chr [1:147] NA NA NA "15–64 years" ...

head(df2,n = 15)

str(df2)

## tibble [64 x 25] (S3: tbl_df/tbl/data.frame)
##  $ Australian Bureau of Statistics: chr [1:64] "Recorded Crime – Offenders, 2019–20" "Released at 11.30am (Canberra time) Thurs 11 February 2021" "Table 5 Offenders, Age by sex, 2008–09 to 2019–20" NA ...
##  $ ...2                           : chr [1:64] NA NA NA "Number" ...
##  $ ...3                           : chr [1:64] NA NA NA NA ...
##  $ ...4                           : chr [1:64] NA NA NA NA ...
##  $ ...5                           : chr [1:64] NA NA NA NA ...
##  $ ...6                           : chr [1:64] NA NA NA NA ...
##  $ ...7                           : chr [1:64] NA NA NA NA ...
##  $ ...8                           : chr [1:64] NA NA NA NA ...
##  $ ...9                           : chr [1:64] NA NA NA NA ...
##  $ ...10                          : chr [1:64] NA NA NA NA ...
##  $ ...11                          : chr [1:64] NA NA NA NA ...
##  $ ...12                          : chr [1:64] NA NA NA NA ...
##  $ ...13                          : chr [1:64] NA NA NA NA ...
##  $ ...14                          : chr [1:64] NA NA NA "Offender rate(a)" ...
##  $ ...15                          : chr [1:64] NA NA NA NA ...
##  $ ...16                          : chr [1:64] NA NA NA NA ...
##  $ ...17                          : chr [1:64] NA NA NA NA ...
##  $ ...18                          : chr [1:64] NA NA NA NA ...
##  $ ...19                          : chr [1:64] NA NA NA NA ...
##  $ ...20                          : chr [1:64] NA NA NA NA ...
##  $ ...21                          : chr [1:64] NA NA NA NA ...
##  $ ...22                          : chr [1:64] NA NA NA NA ...
##  $ ...23                          : chr [1:64] NA NA NA NA ...
##  $ ...24                          : chr [1:64] NA NA NA NA ...
##  $ ...25                          : chr [1:64] NA NA NA NA ...

Understand

# This is the R chunk for the Understand Section
ed2016 <-slice(ed2016, c(4,22:23,42:43))
is.na(ed2016)

##      Australian Bureau of Statistics  ...2  ...3  ...4  ...5  ...6  ...7  ...8
## [1,]                            TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [2,]                           FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [3,]                           FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [4,]                           FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [5,]                           FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##       ...9 ...10 ...11 ...12 ...13 ...14
## [1,] FALSE FALSE FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE FALSE FALSE
## [4,] FALSE FALSE FALSE FALSE FALSE FALSE
## [5,] FALSE FALSE FALSE FALSE FALSE FALSE

sum(is.na(ed2016))

## [1] 1

ed2016[1,1]="Gender"
str(ed2016)

## tibble [5 x 14] (S3: tbl_df/tbl/data.frame)
##  $ Australian Bureau of Statistics: chr [1:5] "Gender" "Male" "Female" "Male" ...
##  $ ...2                           : chr [1:5] "15–19 years" "623.29999999999995" "597.29999999999995" "753.5" ...
##  $ ...3                           : chr [1:5] "20–24 years" "359" "382.10000000000002" "847.20000000000005" ...
##  $ ...4                           : chr [1:5] "25–29 years" "155.80000000000001" "186.80000000000001" "876.70000000000005" ...
##  $ ...5                           : chr [1:5] "30–34 years" "101.2" "129.40000000000001" "873.5" ...
##  $ ...6                           : chr [1:5] "35–39 years" "59.700000000000003" "101.5" "793.89999999999998" ...
##  $ ...7                           : chr [1:5] "40–44 years" "47.100000000000001" "84.200000000000003" "796.89999999999998" ...
##  $ ...8                           : chr [1:5] "45–49 years" "33.799999999999997" "70.900000000000006" "785.29999999999995" ...
##  $ ...9                           : chr [1:5] "50–54 years" "24.600000000000001" "48.700000000000003" "753.70000000000005" ...
##  $ ...10                          : chr [1:5] "55–64 years" "28.899999999999999" "48.899999999999999" "1361.4000000000001" ...
##  $ ...11                          : chr [1:5] "18–24 years" "561.29999999999995" "579.10000000000002" "1158.2" ...
##  $ ...12                          : chr [1:5] "15–24 years" "981" "979.5" "1601.5999999999999" ...
##  $ ...13                          : chr [1:5] "25–64 years" "452.19999999999999" "672.70000000000005" "6239.8999999999996" ...
##  $ ...14                          : chr [1:5] "15–64 years" "1434" "1652.2" "7838.3000000000002" ...

Gender <- factor( c("Male", "Female", "Male", "Female"), levels = c("Male", "Female") )
names(ed2016) <- ed2016[1,]# first row as column names

## Warning: The `value` argument of `names<-` must be a character vector as of
## tibble 3.0.0.

ed2016<-ed2016[-c(1,1), ] # deleting first row then. 
Year <- c("2016","2016","2016","2016")
ed2016$Year<-Year
ed2016 <- ed2016 %>%  select('Gender','15–19 years','20–24 years','Year')
ed2016$`15–19 years` <- as.integer(ed2016$`15–19 years`)
ed2016$`20–24 years` <- as.integer(ed2016$`20–24 years`)
ed2016$Year <- as.integer(ed2016$Year)
ed2016

ed2017 <-slice(ed2017, c(4,22:23,42:43))
ed2017[1,1]="Gender"
Gender <- factor( c("Male", "Female", "Male", "Female"), levels = c("Male", "Female") )
names(ed2017) <- ed2017[1,]# first row as column names

## Warning: The `value` argument of `names<-` must be a character vector as of
## tibble 3.0.0.

ed2017<-ed2017[-c(1,1), ] # deleting first row then. 
Year <- c("2017","2017","2017","2017")
ed2017$Year<-Year
ed2017 <- ed2017 %>%  select('Gender','15–19 years','20–24 years','Year')
ed2017$`15–19 years` <- as.integer(ed2017$`15–19 years`)
ed2017$`20–24 years` <- as.integer(ed2017$`20–24 years`)
ed2017$Year <- as.integer(ed2017$Year)


ed2018 <-slice(ed2018, c(4,22:23,42:43))
ed2018[1,1]="Gender"
Gender <- factor( c("Male", "Female", "Male", "Female"), levels = c("Male", "Female") )
names(ed2018) <- ed2018[1,]# first row as column names

## Warning: The `value` argument of `names<-` must be a character vector as of
## tibble 3.0.0.

ed2018<-ed2018[-c(1,1), ] # deleting first row then. 
Year <- c("2018","2018","2018","2018")
ed2018$Year<-Year
ed2018 <- ed2018 %>%  select('Gender','15–19 years','20–24 years','Year')
ed2018$`15–19 years` <- as.integer(ed2018$`15–19 years`)
ed2018$`20–24 years` <- as.integer(ed2018$`20–24 years`)
ed2018$Year <- as.integer(ed2018$Year)


ed2019 <-slice(ed2019, c(4,22:23,42:43))
ed2019[1,1]="Gender"
Gender <- factor( c("Male", "Female", "Male", "Female"), levels = c("Male", "Female") )
names(ed2019) <- ed2019[1,]# first row as column names

## Warning: The `value` argument of `names<-` must be a character vector as of
## tibble 3.0.0.

ed2019<-ed2019[-c(1,1), ] # deleting first row then. 
Year <- c("2019","2019","2019","2019")
ed2019$Year<-Year
ed2019 <- ed2019 %>%  select('Gender','15–19 years','20–24 years','Year')
ed2019$`15–19 years` <- as.integer(ed2019$`15–19 years`)
ed2019$`20–24 years` <- as.integer(ed2019$`20–24 years`)
ed2019$Year <- as.integer(ed2019$Year)


ed2020 <-slice(ed2020, c(4,22:23,42:43))
str(ed2020)

## tibble [5 x 14] (S3: tbl_df/tbl/data.frame)
##  $ Australian Bureau of Statistics: chr [1:5] NA "Male" "Female" "Male" ...
##  $ ...2                           : chr [1:5] "15–19 years" "617.70000000000005" "623.5" "765.5" ...
##  $ ...3                           : chr [1:5] "20–24 years" "365.5" "395.60000000000002" "878.79999999999995" ...
##  $ ...4                           : chr [1:5] "25–29 years" "181.09999999999999" "182.59999999999999" "952.79999999999995" ...
##  $ ...5                           : chr [1:5] "30–34 years" "94.099999999999994" "132.59999999999999" "937.79999999999995" ...
##  $ ...6                           : chr [1:5] "35–39 years" "67.900000000000006" "94.5" "897.29999999999995" ...
##  $ ...7                           : chr [1:5] "40–44 years" "45.899999999999999" "80.400000000000006" "795.29999999999995" ...
##  $ ...8                           : chr [1:5] "45–49 years" "34.100000000000001" "67.5" "818.60000000000002" ...
##  $ ...9                           : chr [1:5] "50–54 years" "21.199999999999999" "51.399999999999999" "753.10000000000002" ...
##  $ ...10                          : chr [1:5] "55–64 years" "25.300000000000001" "46.299999999999997" "1439.7" ...
##  $ ...11                          : chr [1:5] "18–24 years" "553.89999999999998" "602.20000000000005" "1194.2" ...
##  $ ...12                          : chr [1:5] "15–24 years" "985.29999999999995" "1018.9" "1642.7" ...
##  $ ...13                          : chr [1:5] "25–64 years" "466.69999999999999" "656.29999999999995" "6590" ...
##  $ ...14                          : chr [1:5] "15–64 years" "1450.9000000000001" "1675.0999999999999" "8235.3999999999996" ...

ed2020[1,1]="Gender"
Gender <- factor( c("Male", "Female", "Male", "Female"), levels = c("Male", "Female") )
names(ed2020) <- ed2020[1,]# first row as column names

## Warning: The `value` argument of `names<-` must be a character vector as of
## tibble 3.0.0.

ed2020<-ed2020[-c(1,1), ] # deleting first row then. 
Year <- c("2020","2020","2020","2020")
ed2020$Year<-Year
ed2020 <- ed2020 %>%  select('Gender','15–19 years','20–24 years','Year')
ed2020$`15–19 years` <- as.integer(ed2020$`15–19 years`)
ed2020$`20–24 years` <- as.integer(ed2020$`20–24 years`)
ed2020$Year <- as.integer(ed2020$Year)

dfc <- slice(df2, c(5,8:9,22:25))
names(dfc) <- dfc[1,]

## Warning: The `value` argument of `names<-` must be a character vector as of
## tibble 3.0.0.

dfc<-dfc %>% select(1,9,10,11,12,13)
dfc<-dfc[-c(1,4,5), ]
is.na(dfc)

##        Age 2015–16 2016–17 2017–18 2018–19(b) 2019–20
## [1,] FALSE   FALSE   FALSE   FALSE      FALSE   FALSE
## [2,] FALSE   FALSE   FALSE   FALSE      FALSE   FALSE
## [3,] FALSE   FALSE   FALSE   FALSE      FALSE   FALSE
## [4,] FALSE   FALSE   FALSE   FALSE      FALSE   FALSE

sum(is.na(dfc))

## [1] 0

Gender <- factor(c("Male","Male","Female","Female"), levels=c("Male","Female"))
dfc$Gender<-Gender
Status <-c ("Crime","Crime","Crime","Crime")
dfc$Status <- Status
dfc

names(dfc)[1]<-paste("Age_gp")
names(dfc)[2]<-paste("2016")
names(dfc)[3]<-paste("2017")
names(dfc)[4]<-paste("2018")
names(dfc)[5]<-paste("2019")
names(dfc)[6]<-paste("2020")

The size and content of six data sets are trimmed to a manageable size to understand the data. This means that only those rows and columns are kept which are necessary to meet our objective i.e. variables and observations for the selected age groups for five years. Following the surgically obtained data frames, a check for missing values is conducted. The Slice function is used in all the dataframes to remove the irrelevant bits.

Education dataset: For the ed2016, the is.na function was run and it produced just one missing value, which is a blankspace that is used to naming column comprising nominal variables - male and female. Running the str command showed that the study and population figures for different age groups were in character data type. They were converted to integer data type. The reason they were converted to integer type was because a new variable would be created to calculate the percentage. A column comprising year was added to the dataframe. All of the above mentioned steps were repeated for the dataframes ed2017,ed2018,ed2019 and ed2020.

Crime dataset: For df2, running the command is.na yielded no missing values. Gender variable was created with values ‘Male’ and ‘Female’ as factors and levels. A new variable ‘Status’ was created to hold value as ‘Crime’. This status variable is created to differentiate between the study and crime population figures when the dataframes are merged.

The dataframe df2 has years represented in fiscal year format such as 2016-17, compared to ed2016 which has calendar year format 2016. For the data merger, the year needs to be in a consistent format. Therefore, in df2 the year is converted to calendar year format by changing 2016-17 to 2016. This is followed for the subsequent years as well.

Tidy & Manipulate Data I

# This is the R chunk for the Tidy & Manipulate Data I 


ed2016 <- ed2016 %>% gather('15–19 years','20–24 years', key = "year", value = "Numbers")
names(ed2016)[3]<-paste("Age_gp")
Status <- c("Studying","Studying","Population","Population","Studying","Studying","Population","Population")
ed2016$Status<-Status
ed2016

ed2017 <- ed2017 %>% gather('15–19 years','20–24 years', key = "year", value = "Numbers")
names(ed2017)[3]<-paste("Age_gp")
Status <- c("Studying","Studying","Population","Population","Studying","Studying","Population","Population")
ed2017$Status<-Status
ed2017

ed2018 <- ed2018 %>% gather('15–19 years','20–24 years', key = "year", value = "Numbers")
names(ed2018)[3]<-paste("Age_gp")
Status <- c("Studying","Studying","Population","Population","Studying","Studying","Population","Population")
ed2018$Status<-Status
ed2018

ed2019 <- ed2019 %>% gather('15–19 years','20–24 years', key = "year", value = "Numbers")
names(ed2019)[3]<-paste("Age_gp")
Status <- c("Studying","Studying","Population","Population","Studying","Studying","Population","Population")
ed2019$Status<-Status
ed2019

ed2020 <- ed2020 %>% gather('15–19 years','20–24 years', key = "year", value = "Numbers")
names(ed2020)[3]<-paste("Age_gp")
Status <- c("Studying","Studying","Population","Population","Studying","Studying","Population","Population")
ed2020$Status<-Status
ed2020

dfc <- dfc %>% gather(`2020`, `2019`,`2018`, `2017`, `2016`, key = "Year", value = "Numbers")
dfc<-dfc %>% select(2,4,1,5,3)
dfc$Numbers <- as.numeric(dfc$Numbers)
dfc$Year <- as.integer(dfc$Year)
knitr::kable(dfc)

Gender	Year	Age_gp	Numbers	Status
Male	2020	15–19 years	41792	Crime
Male	2020	20–24 years	43207	Crime
Female	2020	15–19 years	14619	Crime
Female	2020	20–24 years	13229	Crime
Male	2019	15–19 years	44687	Crime
Male	2019	20–24 years	48605	Crime
Female	2019	15–19 years	16111	Crime
Female	2019	20–24 years	14342	Crime
Male	2018	15–19 years	48185	Crime
Male	2018	20–24 years	52506	Crime
Female	2018	15–19 years	16651	Crime
Female	2018	20–24 years	14687	Crime
Male	2017	15–19 years	50518	Crime
Male	2017	20–24 years	55737	Crime
Female	2017	15–19 years	17372	Crime
Female	2017	20–24 years	15653	Crime
Male	2016	15–19 years	52884	Crime
Male	2016	20–24 years	59770	Crime
Female	2016	15–19 years	17685	Crime
Female	2016	20–24 years	15925	Crime

The data in ed2016 is not in tidy form because it does not follow the tidy principle - Each variable must have its own column. In the table 15-19 years and 20-24 years are variables, but they are represented as column names. When they are rearranged under the variable name - ‘Age_gp’, it represents a tidier version of the dame data. Gather function is used to achieve this, followed by creating variable name ‘Age_gp’ for the column. An additional variable “Status” is created to differentiate the numbers between studying and population.

The years in dfc from 2016 to 2020 are variables but are currently column names. They are in violation of tidy principle that each variable should have one column. Therefore, the gather function is implemented to bring them in line with the tidy principle.

Tidy & Manipulate Data II

# This is the R chunk for the Tidy & Manipulate Data II 


study <- union(union(union(union(ed2020,ed2019),ed2018),ed2017),ed2016)
Year <- as.integer(study$Year)
study$Numbers <- as.integer(study$Numbers)
study$Numbers <- study$Numbers*1000
knitr::kable(study)

Gender	Year	Age_gp	Numbers	Status
Male	2020	15–19 years	617000	Studying
Female	2020	15–19 years	623000	Studying
Male	2020	15–19 years	765000	Population
Female	2020	15–19 years	724000	Population
Male	2020	20–24 years	365000	Studying
Female	2020	20–24 years	395000	Studying
Male	2020	20–24 years	878000	Population
Female	2020	20–24 years	835000	Population
Male	2019	15–19 years	624000	Studying
Female	2019	15–19 years	610000	Studying
Male	2019	15–19 years	764000	Population
Female	2019	15–19 years	728000	Population
Male	2019	20–24 years	395000	Studying
Female	2019	20–24 years	416000	Studying
Male	2019	20–24 years	892000	Population
Female	2019	20–24 years	851000	Population
Male	2018	15–19 years	613000	Studying
Female	2018	15–19 years	617000	Studying
Male	2018	15–19 years	756000	Population
Female	2018	15–19 years	720000	Population
Male	2018	20–24 years	390000	Studying
Female	2018	20–24 years	394000	Studying
Male	2018	20–24 years	870000	Population
Female	2018	20–24 years	843000	Population
Male	2017	15–19 years	620000	Studying
Female	2017	15–19 years	613000	Studying
Male	2017	15–19 years	755000	Population
Female	2017	15–19 years	718000	Population
Male	2017	20–24 years	362000	Studying
Female	2017	20–24 years	384000	Studying
Male	2017	20–24 years	856000	Population
Female	2017	20–24 years	823000	Population
Male	2016	15–19 years	623000	Studying
Female	2016	15–19 years	597000	Studying
Male	2016	15–19 years	753000	Population
Female	2016	15–19 years	710000	Population
Male	2016	20–24 years	359000	Studying
Female	2016	20–24 years	382000	Studying
Male	2016	20–24 years	847000	Population
Female	2016	20–24 years	813000	Population

db <- union(study,dfc)
db <- db %>% spread(key = "Status", value = "Numbers")

Crime_Prop <- (db$Crime/db$Population)*100
db$Crime_Prop <- Crime_Prop
round(db$Crime_Prop,digits=2)

##  [1] 2.49 1.96 2.42 1.90 2.31 1.74 2.21 1.69 2.02 1.58 7.02 7.06 6.69 6.51 6.37
## [16] 6.04 5.85 5.45 5.46 4.92

Study_Prop <- (db$Studying/db$Population)*100
db$Study_Prop <- Study_Prop
round(db$Study_Prop,digits=2)

##  [1] 84.08 46.99 85.38 46.66 85.69 46.74 83.79 48.88 86.05 47.31 82.74 42.38
## [13] 82.12 42.29 81.08 44.83 81.68 44.28 80.65 41.57

db <-db %>% select(2,1,3,7,8)
db <- db %>% arrange(Year, Gender)
knitr::kable(db)

Year	Gender	Age_gp	Crime_Prop	Study_Prop
2016	Female	15–19 years	2.490845	84.08451
2016	Female	20–24 years	1.958795	46.98647
2016	Male	15–19 years	7.023108	82.73572
2016	Male	20–24 years	7.056671	42.38489
2017	Female	15–19 years	2.419499	85.37604
2017	Female	20–24 years	1.901944	46.65857
2017	Male	15–19 years	6.691126	82.11921
2017	Male	20–24 years	6.511332	42.28972
2018	Female	15–19 years	2.312639	85.69444
2018	Female	20–24 years	1.742230	46.73784
2018	Male	15–19 years	6.373677	81.08466
2018	Male	20–24 years	6.035172	44.82759
2019	Female	15–19 years	2.213049	83.79121
2019	Female	20–24 years	1.685311	48.88367
2019	Male	15–19 years	5.849084	81.67539
2019	Male	20–24 years	5.448991	44.28251
2020	Female	15–19 years	2.019199	86.04972
2020	Female	20–24 years	1.584311	47.30539
2020	Male	15–19 years	5.463006	80.65359
2020	Male	20–24 years	4.921071	41.57175

The datasets are merged in this step. A dataframe ‘study’ is created where ed2016, ed2017,2018,ed2019, ed2020 are unioned. As the name suggests, this dataframe has study and population of two age groups from 2016 to 2020. The ‘Year’ and ‘Number’ variables are converted into integer variables. The reason being ‘Number’ variable needs to be multiplied by 1000 to get the absolute figures.

A dataframe db is the final stage of achieving a tidy data set. The dataframe db is created which unions study and dfc dataframes. The dataframe db is not a tidy set, because are looking for gender based age groups’study and crime percentage. Therefore, Crime and Study percentage become variables and the values as observation in each row. Hence, spread function is used to achieve this. Then, Crime and study proportion representing variables are created from the existing variables. The dataframe db is further trimmed by using the select function. The final dataframe db represents five variables - Gender, Year, Age_gp, Crime_Prop, Study_Prop.

Scan I

# This is the R chunk for the Scan I
is.na(db)

##        Year Gender Age_gp Crime_Prop Study_Prop
##  [1,] FALSE  FALSE  FALSE      FALSE      FALSE
##  [2,] FALSE  FALSE  FALSE      FALSE      FALSE
##  [3,] FALSE  FALSE  FALSE      FALSE      FALSE
##  [4,] FALSE  FALSE  FALSE      FALSE      FALSE
##  [5,] FALSE  FALSE  FALSE      FALSE      FALSE
##  [6,] FALSE  FALSE  FALSE      FALSE      FALSE
##  [7,] FALSE  FALSE  FALSE      FALSE      FALSE
##  [8,] FALSE  FALSE  FALSE      FALSE      FALSE
##  [9,] FALSE  FALSE  FALSE      FALSE      FALSE
## [10,] FALSE  FALSE  FALSE      FALSE      FALSE
## [11,] FALSE  FALSE  FALSE      FALSE      FALSE
## [12,] FALSE  FALSE  FALSE      FALSE      FALSE
## [13,] FALSE  FALSE  FALSE      FALSE      FALSE
## [14,] FALSE  FALSE  FALSE      FALSE      FALSE
## [15,] FALSE  FALSE  FALSE      FALSE      FALSE
## [16,] FALSE  FALSE  FALSE      FALSE      FALSE
## [17,] FALSE  FALSE  FALSE      FALSE      FALSE
## [18,] FALSE  FALSE  FALSE      FALSE      FALSE
## [19,] FALSE  FALSE  FALSE      FALSE      FALSE
## [20,] FALSE  FALSE  FALSE      FALSE      FALSE

sum(is.na(db))

## [1] 0

sapply(db, is.infinite)

##        Year Gender Age_gp Crime_Prop Study_Prop
##  [1,] FALSE  FALSE  FALSE      FALSE      FALSE
##  [2,] FALSE  FALSE  FALSE      FALSE      FALSE
##  [3,] FALSE  FALSE  FALSE      FALSE      FALSE
##  [4,] FALSE  FALSE  FALSE      FALSE      FALSE
##  [5,] FALSE  FALSE  FALSE      FALSE      FALSE
##  [6,] FALSE  FALSE  FALSE      FALSE      FALSE
##  [7,] FALSE  FALSE  FALSE      FALSE      FALSE
##  [8,] FALSE  FALSE  FALSE      FALSE      FALSE
##  [9,] FALSE  FALSE  FALSE      FALSE      FALSE
## [10,] FALSE  FALSE  FALSE      FALSE      FALSE
## [11,] FALSE  FALSE  FALSE      FALSE      FALSE
## [12,] FALSE  FALSE  FALSE      FALSE      FALSE
## [13,] FALSE  FALSE  FALSE      FALSE      FALSE
## [14,] FALSE  FALSE  FALSE      FALSE      FALSE
## [15,] FALSE  FALSE  FALSE      FALSE      FALSE
## [16,] FALSE  FALSE  FALSE      FALSE      FALSE
## [17,] FALSE  FALSE  FALSE      FALSE      FALSE
## [18,] FALSE  FALSE  FALSE      FALSE      FALSE
## [19,] FALSE  FALSE  FALSE      FALSE      FALSE
## [20,] FALSE  FALSE  FALSE      FALSE      FALSE

sum(sapply(db, is.infinite))

## [1] 0

library(editrules)

## Warning: package 'editrules' was built under R version 4.0.5

## Loading required package: igraph

## Warning: package 'igraph' was built under R version 4.0.5

## 
## Attaching package: 'igraph'

## The following object is masked from 'package:tidyr':
## 
##     crossing

## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union

## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum

## The following object is masked from 'package:base':
## 
##     union

## 
## Attaching package: 'editrules'

## The following objects are masked from 'package:igraph':
## 
##     blocks, normalize

## The following objects are masked from 'package:tidyr':
## 
##     contains, separate

## The following object is masked from 'package:dplyr':
## 
##     contains

(Rule_Study <- editset(c("Study_Prop >= 0", "Study_Prop <= 100")))

## 
## Edit set:
## num1 : 0 <= Study_Prop
## num2 : Study_Prop <= 100

violatedEdits(Rule_Study, db)

##       edit
## record  num1  num2
##     1  FALSE FALSE
##     2  FALSE FALSE
##     3  FALSE FALSE
##     4  FALSE FALSE
##     5  FALSE FALSE
##     6  FALSE FALSE
##     7  FALSE FALSE
##     8  FALSE FALSE
##     9  FALSE FALSE
##     10 FALSE FALSE
##     11 FALSE FALSE
##     12 FALSE FALSE
##     13 FALSE FALSE
##     14 FALSE FALSE
##     15 FALSE FALSE
##     16 FALSE FALSE
##     17 FALSE FALSE
##     18 FALSE FALSE
##     19 FALSE FALSE
##     20 FALSE FALSE

(Rule_Crime <- editset(c("Crime_Prop >= 0", "Crime_Prop <= 100")))

## 
## Edit set:
## num1 : 0 <= Crime_Prop
## num2 : Crime_Prop <= 100

violatedEdits(Rule_Study, db)

##       edit
## record  num1  num2
##     1  FALSE FALSE
##     2  FALSE FALSE
##     3  FALSE FALSE
##     4  FALSE FALSE
##     5  FALSE FALSE
##     6  FALSE FALSE
##     7  FALSE FALSE
##     8  FALSE FALSE
##     9  FALSE FALSE
##     10 FALSE FALSE
##     11 FALSE FALSE
##     12 FALSE FALSE
##     13 FALSE FALSE
##     14 FALSE FALSE
##     15 FALSE FALSE
##     16 FALSE FALSE
##     17 FALSE FALSE
##     18 FALSE FALSE
##     19 FALSE FALSE
##     20 FALSE FALSE

The steps undertaken here are to identify missing values, special values and obvious inconsistencies in the dataset db. is.na function is used to determine missing values. This is followed by sum function to determine number of missing values. There was no missing values found.

Sapply function is used to find if there are ny infinite values. No infinite value was found.

The variables study_prop and crime_prop are checked for obvious inconsistencies or errors. Given that they are perecentage figures so the values should be 0 and 100. Both the variables are checked by using edit rules.

Scan II

# This is the R chunk for the Scan II
#db$Study_Prop %>%  boxplot(main="Boxplot of Study_Prop", ylab="Study_Prop", col = "grey")
boxplot(db$Crime_Prop~db$Age_gp,main="Boxplot 1 - Crime Percentage by Age Group", ylab="Crime Percentage", xlab = "Age Group")

boxplot(db$Study_Prop ~db$Age_gp,main="Boxplot 2 - Study Percentage by Age Group", ylab="Study Percentage", xlab = "Age Group")

boxplot(db$Crime_Prop~db$Gender,main="Boxplot 3 - Crime Percentage by Gender", ylab="Crime Percentage", xlab = "Gender")

boxplot(db$Study_Prop~db$Gender,main="Boxplot 4 - Study Percentage by Gender", ylab="Study Percentage", xlab = "Gender")

boxplot(db$Crime_Prop~db$Year,main="Boxplot 5 - Crime Percentage by Year", ylab="Crime Percentage", xlab = "Year")

boxplot(db$Study_Prop~db$Year,main="Boxplot 6 - Study Percentage by Year", ylab="Study Percentage", xlab = "Year")

Provide explanations here.

Based on the boxplots generated for the datasets in consideration, it is safe to deduce that the 15-19 age group had a higher percentage of involvement in crimes(Boxplot 1). Out of these, the number of females involved in crime is lower as compared to the number of males. Alternatively, more females are involved in education as compared to males from the same age groups.

Transform

# This is the R chunk for the Transform Section
center_df <-scale(db$Crime_Prop, center = TRUE, scale = FALSE)
center_df

##             [,1]
##  [1,] -1.5942079
##  [2,] -2.1262584
##  [3,]  2.9380546
##  [4,]  2.9716176
##  [5,] -1.6655544
##  [6,] -2.1831089
##  [7,]  2.6060728
##  [8,]  2.4262788
##  [9,] -1.7724141
## [10,] -2.3428229
## [11,]  2.2886243
## [12,]  1.9501194
## [13,] -1.8720035
## [14,] -2.3997416
## [15,]  1.7640308
## [16,]  1.3639380
## [17,] -2.0658541
## [18,] -2.5007416
## [19,]  1.3779535
## [20,]  0.8360176
## attr(,"scaled:center")
## [1] 4.085053

Provide explanations here. The scale function was used to apply mean-centering to the column Crime_Prop. After this function was applied, the new centered values for Crime_Prop have been displayed.

References

2010 National Strategies for Young Australians. Retried October 3, 2021, from Youth Policy website: (https://www.youthpolicy.org/factsheets/country/australia)
Recorded Crime - Offenders (2019-20). Offenders, Australia. Retrieved October 3, 2021, from Australian Bureau of Statistics website: https://www.abs.gov.au/statistics/people/crime-and-justice/recorded-crime-offenders/2019-20#data-download
Education and Work, Australia (2019-20). Table 21 Current study: Persons aged 15-64 years – 2004 to 2020, Excel Data Table. Retrieved October 3, 2021, from Australian Bureau of Statistics website: https://www.abs.gov.au/statistics/people/education/education-and-work-australia/latest-release#data-download

Data Wrangling (Data Preprocessing)

Practical assessment 2

Kushagra Bhatnagar

03.10.21