# Load the necessary packages required to reproduce the report. For example:
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 4.0.5
library(magrittr)
## Warning: package 'magrittr' was built under R version 4.0.5
library(ggplot2) # Useful for creating plots
## Warning: package 'ggplot2' was built under R version 4.0.5
library(dplyr) # Useful for data manipulation
## Warning: package 'dplyr' was built under R version 4.0.5
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:kableExtra':
##
## group_rows
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(knitr) # Useful for creating nice tables
## Warning: package 'knitr' was built under R version 4.0.5
library(magrittr)
library(readr)
## Warning: package 'readr' was built under R version 4.0.5
library(tidyr)
## Warning: package 'tidyr' was built under R version 4.0.5
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:magrittr':
##
## extract
library(readxl)
## Warning: package 'readxl' was built under R version 4.0.5
#install.packages("forecast")
library(forecast)
## Warning: package 'forecast' was built under R version 4.0.5
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
#install.packages("editrules")
#library(editrules)
| Student name | Student number | Percentage of contribution |
|---|---|---|
| Karen Amanda Rodrigues | s3890896 | 50% |
| Kushagra Bhatnagar | s3666420 | 50% |
This assignment is used as a medium to answer a real-life question – What percentage of Australia’s youth is studying and yet is involved in crime, and to gain insight into the impact of education on the crime statistics. According to the 2010 National Strategies for Young Australians (https://www.youthpolicy.org/factsheets/country/australia), youth is defined in general as the age group between 12-24 years. We have selected 2 categories: • Age groups of 15-19 and 20-24 years. • Years between 2016 and 2020.
The data sets chosen were: • Education and work 2020, Table 21 - referred to as Education dataset in the report • 1, Offenders Australia(1) - referred to as Crime dataset in the report Both the datasets were downloaded from the Australian Bureau of Statistics. From the Education and work dataset, the study and total population of the age groups 15-19 and 20-24 were derived. The dataset 1 i.e. Offenders Australia(1) provided the population involved in crime within the two age groups. The two datasets were comprised of heaps of data. The approach adopted is that the datasets are tidied and manipulated individually before they were merged. Upon merger, the final result is presented as the percentage of youth population in the two age groups studying and involved in crime over a period of five years between 2016 and 2020.
Provide explanations here.
Two data sets are obtained from Australian Bureau of Statistic. Five data frames starting from year 2016 to 2020 are created from one Education dataset. The reason for this is that each tab in excel corresponds to a year. The five data frames store study and total population of several age groups for their respective years. One data frame is created from the Crime dataset.
The two datasets are detailed and elaborate, so they need to be trimmed in order to describe the variables. The variable description and merger of the datasets is only possible after tidying and manipulating the datasets. The Tidy and Manipulate sections will contain this information.
In this section the data sets are read using the library function readr and their first fifteen lines and structure are shown.
Since the data frames sourced from Education dataset are ed2016, ed2017, ed2018, ed2019, ed2020. They are of exact structure, so any work done on one dataframe is replicated in the rest of the five datasets. Henceforth, ed2016 dataframe would be used to display results. The dfc dataframe is sourced from ‘Table 5’ tab of excel file comprising Crime dataset.
# Import the data, provide your R codes here.
ed2016 <- read_excel("Education and work, 2020, Table 21.xls", sheet = "2016")
## New names:
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
## * `` -> ...6
## * ...
ed2017 <- read_excel("Education and work, 2020, Table 21.xls", sheet = "2017")
## New names:
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
## * `` -> ...6
## * ...
ed2018 <- read_excel("Education and work, 2020, Table 21.xls", sheet = "2018")
## New names:
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
## * `` -> ...6
## * ...
ed2019 <- read_excel("Education and work, 2020, Table 21.xls", sheet = "2019")
## New names:
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
## * `` -> ...6
## * ...
ed2020 <- read_excel("Education and work, 2020, Table 21.xls", sheet = "2020")
## New names:
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
## * `` -> ...6
## * ...
df2 <- read_excel("1. Offenders, Australia (1).xls", sheet = "Table 5")
## New names:
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
## * `` -> ...6
## * ...
head(ed2016, n = 15)
str(ed2016)
## tibble [147 x 14] (S3: tbl_df/tbl/data.frame)
## $ Australian Bureau of Statistics: chr [1:147] "Education and Work, Australia, May 2020" "Released at 11:30 am (Canberra time) 11 November 2020" "Table 21 Current study: Persons aged 15–64 years, 2016" NA ...
## $ ...2 : chr [1:147] NA NA NA "15–19 years" ...
## $ ...3 : chr [1:147] NA NA NA "20–24 years" ...
## $ ...4 : chr [1:147] NA NA NA "25–29 years" ...
## $ ...5 : chr [1:147] NA NA NA "30–34 years" ...
## $ ...6 : chr [1:147] NA NA NA "35–39 years" ...
## $ ...7 : chr [1:147] NA NA NA "40–44 years" ...
## $ ...8 : chr [1:147] NA NA NA "45–49 years" ...
## $ ...9 : chr [1:147] NA NA NA "50–54 years" ...
## $ ...10 : chr [1:147] NA NA NA "55–64 years" ...
## $ ...11 : chr [1:147] NA NA NA "18–24 years" ...
## $ ...12 : chr [1:147] NA NA NA "15–24 years" ...
## $ ...13 : chr [1:147] NA NA NA "25–64 years" ...
## $ ...14 : chr [1:147] NA NA NA "15–64 years" ...
head(df2,n = 15)
str(df2)
## tibble [64 x 25] (S3: tbl_df/tbl/data.frame)
## $ Australian Bureau of Statistics: chr [1:64] "Recorded Crime – Offenders, 2019–20" "Released at 11.30am (Canberra time) Thurs 11 February 2021" "Table 5 Offenders, Age by sex, 2008–09 to 2019–20" NA ...
## $ ...2 : chr [1:64] NA NA NA "Number" ...
## $ ...3 : chr [1:64] NA NA NA NA ...
## $ ...4 : chr [1:64] NA NA NA NA ...
## $ ...5 : chr [1:64] NA NA NA NA ...
## $ ...6 : chr [1:64] NA NA NA NA ...
## $ ...7 : chr [1:64] NA NA NA NA ...
## $ ...8 : chr [1:64] NA NA NA NA ...
## $ ...9 : chr [1:64] NA NA NA NA ...
## $ ...10 : chr [1:64] NA NA NA NA ...
## $ ...11 : chr [1:64] NA NA NA NA ...
## $ ...12 : chr [1:64] NA NA NA NA ...
## $ ...13 : chr [1:64] NA NA NA NA ...
## $ ...14 : chr [1:64] NA NA NA "Offender rate(a)" ...
## $ ...15 : chr [1:64] NA NA NA NA ...
## $ ...16 : chr [1:64] NA NA NA NA ...
## $ ...17 : chr [1:64] NA NA NA NA ...
## $ ...18 : chr [1:64] NA NA NA NA ...
## $ ...19 : chr [1:64] NA NA NA NA ...
## $ ...20 : chr [1:64] NA NA NA NA ...
## $ ...21 : chr [1:64] NA NA NA NA ...
## $ ...22 : chr [1:64] NA NA NA NA ...
## $ ...23 : chr [1:64] NA NA NA NA ...
## $ ...24 : chr [1:64] NA NA NA NA ...
## $ ...25 : chr [1:64] NA NA NA NA ...
# This is the R chunk for the Understand Section
ed2016 <-slice(ed2016, c(4,22:23,42:43))
is.na(ed2016)
## Australian Bureau of Statistics ...2 ...3 ...4 ...5 ...6 ...7 ...8
## [1,] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [4,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [5,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## ...9 ...10 ...11 ...12 ...13 ...14
## [1,] FALSE FALSE FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE FALSE FALSE
## [4,] FALSE FALSE FALSE FALSE FALSE FALSE
## [5,] FALSE FALSE FALSE FALSE FALSE FALSE
sum(is.na(ed2016))
## [1] 1
ed2016[1,1]="Gender"
str(ed2016)
## tibble [5 x 14] (S3: tbl_df/tbl/data.frame)
## $ Australian Bureau of Statistics: chr [1:5] "Gender" "Male" "Female" "Male" ...
## $ ...2 : chr [1:5] "15–19 years" "623.29999999999995" "597.29999999999995" "753.5" ...
## $ ...3 : chr [1:5] "20–24 years" "359" "382.10000000000002" "847.20000000000005" ...
## $ ...4 : chr [1:5] "25–29 years" "155.80000000000001" "186.80000000000001" "876.70000000000005" ...
## $ ...5 : chr [1:5] "30–34 years" "101.2" "129.40000000000001" "873.5" ...
## $ ...6 : chr [1:5] "35–39 years" "59.700000000000003" "101.5" "793.89999999999998" ...
## $ ...7 : chr [1:5] "40–44 years" "47.100000000000001" "84.200000000000003" "796.89999999999998" ...
## $ ...8 : chr [1:5] "45–49 years" "33.799999999999997" "70.900000000000006" "785.29999999999995" ...
## $ ...9 : chr [1:5] "50–54 years" "24.600000000000001" "48.700000000000003" "753.70000000000005" ...
## $ ...10 : chr [1:5] "55–64 years" "28.899999999999999" "48.899999999999999" "1361.4000000000001" ...
## $ ...11 : chr [1:5] "18–24 years" "561.29999999999995" "579.10000000000002" "1158.2" ...
## $ ...12 : chr [1:5] "15–24 years" "981" "979.5" "1601.5999999999999" ...
## $ ...13 : chr [1:5] "25–64 years" "452.19999999999999" "672.70000000000005" "6239.8999999999996" ...
## $ ...14 : chr [1:5] "15–64 years" "1434" "1652.2" "7838.3000000000002" ...
Gender <- factor( c("Male", "Female", "Male", "Female"), levels = c("Male", "Female") )
names(ed2016) <- ed2016[1,]# first row as column names
## Warning: The `value` argument of `names<-` must be a character vector as of
## tibble 3.0.0.
ed2016<-ed2016[-c(1,1), ] # deleting first row then.
Year <- c("2016","2016","2016","2016")
ed2016$Year<-Year
ed2016 <- ed2016 %>% select('Gender','15–19 years','20–24 years','Year')
ed2016$`15–19 years` <- as.integer(ed2016$`15–19 years`)
ed2016$`20–24 years` <- as.integer(ed2016$`20–24 years`)
ed2016$Year <- as.integer(ed2016$Year)
ed2016
ed2017 <-slice(ed2017, c(4,22:23,42:43))
ed2017[1,1]="Gender"
Gender <- factor( c("Male", "Female", "Male", "Female"), levels = c("Male", "Female") )
names(ed2017) <- ed2017[1,]# first row as column names
## Warning: The `value` argument of `names<-` must be a character vector as of
## tibble 3.0.0.
ed2017<-ed2017[-c(1,1), ] # deleting first row then.
Year <- c("2017","2017","2017","2017")
ed2017$Year<-Year
ed2017 <- ed2017 %>% select('Gender','15–19 years','20–24 years','Year')
ed2017$`15–19 years` <- as.integer(ed2017$`15–19 years`)
ed2017$`20–24 years` <- as.integer(ed2017$`20–24 years`)
ed2017$Year <- as.integer(ed2017$Year)
ed2018 <-slice(ed2018, c(4,22:23,42:43))
ed2018[1,1]="Gender"
Gender <- factor( c("Male", "Female", "Male", "Female"), levels = c("Male", "Female") )
names(ed2018) <- ed2018[1,]# first row as column names
## Warning: The `value` argument of `names<-` must be a character vector as of
## tibble 3.0.0.
ed2018<-ed2018[-c(1,1), ] # deleting first row then.
Year <- c("2018","2018","2018","2018")
ed2018$Year<-Year
ed2018 <- ed2018 %>% select('Gender','15–19 years','20–24 years','Year')
ed2018$`15–19 years` <- as.integer(ed2018$`15–19 years`)
ed2018$`20–24 years` <- as.integer(ed2018$`20–24 years`)
ed2018$Year <- as.integer(ed2018$Year)
ed2019 <-slice(ed2019, c(4,22:23,42:43))
ed2019[1,1]="Gender"
Gender <- factor( c("Male", "Female", "Male", "Female"), levels = c("Male", "Female") )
names(ed2019) <- ed2019[1,]# first row as column names
## Warning: The `value` argument of `names<-` must be a character vector as of
## tibble 3.0.0.
ed2019<-ed2019[-c(1,1), ] # deleting first row then.
Year <- c("2019","2019","2019","2019")
ed2019$Year<-Year
ed2019 <- ed2019 %>% select('Gender','15–19 years','20–24 years','Year')
ed2019$`15–19 years` <- as.integer(ed2019$`15–19 years`)
ed2019$`20–24 years` <- as.integer(ed2019$`20–24 years`)
ed2019$Year <- as.integer(ed2019$Year)
ed2020 <-slice(ed2020, c(4,22:23,42:43))
str(ed2020)
## tibble [5 x 14] (S3: tbl_df/tbl/data.frame)
## $ Australian Bureau of Statistics: chr [1:5] NA "Male" "Female" "Male" ...
## $ ...2 : chr [1:5] "15–19 years" "617.70000000000005" "623.5" "765.5" ...
## $ ...3 : chr [1:5] "20–24 years" "365.5" "395.60000000000002" "878.79999999999995" ...
## $ ...4 : chr [1:5] "25–29 years" "181.09999999999999" "182.59999999999999" "952.79999999999995" ...
## $ ...5 : chr [1:5] "30–34 years" "94.099999999999994" "132.59999999999999" "937.79999999999995" ...
## $ ...6 : chr [1:5] "35–39 years" "67.900000000000006" "94.5" "897.29999999999995" ...
## $ ...7 : chr [1:5] "40–44 years" "45.899999999999999" "80.400000000000006" "795.29999999999995" ...
## $ ...8 : chr [1:5] "45–49 years" "34.100000000000001" "67.5" "818.60000000000002" ...
## $ ...9 : chr [1:5] "50–54 years" "21.199999999999999" "51.399999999999999" "753.10000000000002" ...
## $ ...10 : chr [1:5] "55–64 years" "25.300000000000001" "46.299999999999997" "1439.7" ...
## $ ...11 : chr [1:5] "18–24 years" "553.89999999999998" "602.20000000000005" "1194.2" ...
## $ ...12 : chr [1:5] "15–24 years" "985.29999999999995" "1018.9" "1642.7" ...
## $ ...13 : chr [1:5] "25–64 years" "466.69999999999999" "656.29999999999995" "6590" ...
## $ ...14 : chr [1:5] "15–64 years" "1450.9000000000001" "1675.0999999999999" "8235.3999999999996" ...
ed2020[1,1]="Gender"
Gender <- factor( c("Male", "Female", "Male", "Female"), levels = c("Male", "Female") )
names(ed2020) <- ed2020[1,]# first row as column names
## Warning: The `value` argument of `names<-` must be a character vector as of
## tibble 3.0.0.
ed2020<-ed2020[-c(1,1), ] # deleting first row then.
Year <- c("2020","2020","2020","2020")
ed2020$Year<-Year
ed2020 <- ed2020 %>% select('Gender','15–19 years','20–24 years','Year')
ed2020$`15–19 years` <- as.integer(ed2020$`15–19 years`)
ed2020$`20–24 years` <- as.integer(ed2020$`20–24 years`)
ed2020$Year <- as.integer(ed2020$Year)
dfc <- slice(df2, c(5,8:9,22:25))
names(dfc) <- dfc[1,]
## Warning: The `value` argument of `names<-` must be a character vector as of
## tibble 3.0.0.
dfc<-dfc %>% select(1,9,10,11,12,13)
dfc<-dfc[-c(1,4,5), ]
is.na(dfc)
## Age 2015–16 2016–17 2017–18 2018–19(b) 2019–20
## [1,] FALSE FALSE FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE FALSE FALSE
## [4,] FALSE FALSE FALSE FALSE FALSE FALSE
sum(is.na(dfc))
## [1] 0
Gender <- factor(c("Male","Male","Female","Female"), levels=c("Male","Female"))
dfc$Gender<-Gender
Status <-c ("Crime","Crime","Crime","Crime")
dfc$Status <- Status
dfc
names(dfc)[1]<-paste("Age_gp")
names(dfc)[2]<-paste("2016")
names(dfc)[3]<-paste("2017")
names(dfc)[4]<-paste("2018")
names(dfc)[5]<-paste("2019")
names(dfc)[6]<-paste("2020")
The size and content of six data sets are trimmed to a manageable size to understand the data. This means that only those rows and columns are kept which are necessary to meet our objective i.e. variables and observations for the selected age groups for five years. Following the surgically obtained data frames, a check for missing values is conducted. The Slice function is used in all the dataframes to remove the irrelevant bits.
Education dataset: For the ed2016, the is.na function was run and it produced just one missing value, which is a blankspace that is used to naming column comprising nominal variables - male and female. Running the str command showed that the study and population figures for different age groups were in character data type. They were converted to integer data type. The reason they were converted to integer type was because a new variable would be created to calculate the percentage. A column comprising year was added to the dataframe. All of the above mentioned steps were repeated for the dataframes ed2017,ed2018,ed2019 and ed2020.
Crime dataset: For df2, running the command is.na yielded no missing values. Gender variable was created with values ‘Male’ and ‘Female’ as factors and levels. A new variable ‘Status’ was created to hold value as ‘Crime’. This status variable is created to differentiate between the study and crime population figures when the dataframes are merged.
The dataframe df2 has years represented in fiscal year format such as 2016-17, compared to ed2016 which has calendar year format 2016. For the data merger, the year needs to be in a consistent format. Therefore, in df2 the year is converted to calendar year format by changing 2016-17 to 2016. This is followed for the subsequent years as well.
# This is the R chunk for the Tidy & Manipulate Data I
ed2016 <- ed2016 %>% gather('15–19 years','20–24 years', key = "year", value = "Numbers")
names(ed2016)[3]<-paste("Age_gp")
Status <- c("Studying","Studying","Population","Population","Studying","Studying","Population","Population")
ed2016$Status<-Status
ed2016
ed2017 <- ed2017 %>% gather('15–19 years','20–24 years', key = "year", value = "Numbers")
names(ed2017)[3]<-paste("Age_gp")
Status <- c("Studying","Studying","Population","Population","Studying","Studying","Population","Population")
ed2017$Status<-Status
ed2017
ed2018 <- ed2018 %>% gather('15–19 years','20–24 years', key = "year", value = "Numbers")
names(ed2018)[3]<-paste("Age_gp")
Status <- c("Studying","Studying","Population","Population","Studying","Studying","Population","Population")
ed2018$Status<-Status
ed2018
ed2019 <- ed2019 %>% gather('15–19 years','20–24 years', key = "year", value = "Numbers")
names(ed2019)[3]<-paste("Age_gp")
Status <- c("Studying","Studying","Population","Population","Studying","Studying","Population","Population")
ed2019$Status<-Status
ed2019
ed2020 <- ed2020 %>% gather('15–19 years','20–24 years', key = "year", value = "Numbers")
names(ed2020)[3]<-paste("Age_gp")
Status <- c("Studying","Studying","Population","Population","Studying","Studying","Population","Population")
ed2020$Status<-Status
ed2020
dfc <- dfc %>% gather(`2020`, `2019`,`2018`, `2017`, `2016`, key = "Year", value = "Numbers")
dfc<-dfc %>% select(2,4,1,5,3)
dfc$Numbers <- as.numeric(dfc$Numbers)
dfc$Year <- as.integer(dfc$Year)
knitr::kable(dfc)
| Gender | Year | Age_gp | Numbers | Status |
|---|---|---|---|---|
| Male | 2020 | 15–19 years | 41792 | Crime |
| Male | 2020 | 20–24 years | 43207 | Crime |
| Female | 2020 | 15–19 years | 14619 | Crime |
| Female | 2020 | 20–24 years | 13229 | Crime |
| Male | 2019 | 15–19 years | 44687 | Crime |
| Male | 2019 | 20–24 years | 48605 | Crime |
| Female | 2019 | 15–19 years | 16111 | Crime |
| Female | 2019 | 20–24 years | 14342 | Crime |
| Male | 2018 | 15–19 years | 48185 | Crime |
| Male | 2018 | 20–24 years | 52506 | Crime |
| Female | 2018 | 15–19 years | 16651 | Crime |
| Female | 2018 | 20–24 years | 14687 | Crime |
| Male | 2017 | 15–19 years | 50518 | Crime |
| Male | 2017 | 20–24 years | 55737 | Crime |
| Female | 2017 | 15–19 years | 17372 | Crime |
| Female | 2017 | 20–24 years | 15653 | Crime |
| Male | 2016 | 15–19 years | 52884 | Crime |
| Male | 2016 | 20–24 years | 59770 | Crime |
| Female | 2016 | 15–19 years | 17685 | Crime |
| Female | 2016 | 20–24 years | 15925 | Crime |
The data in ed2016 is not in tidy form because it does not follow the tidy principle - Each variable must have its own column. In the table 15-19 years and 20-24 years are variables, but they are represented as column names. When they are rearranged under the variable name - ‘Age_gp’, it represents a tidier version of the dame data. Gather function is used to achieve this, followed by creating variable name ‘Age_gp’ for the column. An additional variable “Status” is created to differentiate the numbers between studying and population.
The years in dfc from 2016 to 2020 are variables but are currently column names. They are in violation of tidy principle that each variable should have one column. Therefore, the gather function is implemented to bring them in line with the tidy principle.
# This is the R chunk for the Tidy & Manipulate Data II
study <- union(union(union(union(ed2020,ed2019),ed2018),ed2017),ed2016)
Year <- as.integer(study$Year)
study$Numbers <- as.integer(study$Numbers)
study$Numbers <- study$Numbers*1000
knitr::kable(study)
| Gender | Year | Age_gp | Numbers | Status |
|---|---|---|---|---|
| Male | 2020 | 15–19 years | 617000 | Studying |
| Female | 2020 | 15–19 years | 623000 | Studying |
| Male | 2020 | 15–19 years | 765000 | Population |
| Female | 2020 | 15–19 years | 724000 | Population |
| Male | 2020 | 20–24 years | 365000 | Studying |
| Female | 2020 | 20–24 years | 395000 | Studying |
| Male | 2020 | 20–24 years | 878000 | Population |
| Female | 2020 | 20–24 years | 835000 | Population |
| Male | 2019 | 15–19 years | 624000 | Studying |
| Female | 2019 | 15–19 years | 610000 | Studying |
| Male | 2019 | 15–19 years | 764000 | Population |
| Female | 2019 | 15–19 years | 728000 | Population |
| Male | 2019 | 20–24 years | 395000 | Studying |
| Female | 2019 | 20–24 years | 416000 | Studying |
| Male | 2019 | 20–24 years | 892000 | Population |
| Female | 2019 | 20–24 years | 851000 | Population |
| Male | 2018 | 15–19 years | 613000 | Studying |
| Female | 2018 | 15–19 years | 617000 | Studying |
| Male | 2018 | 15–19 years | 756000 | Population |
| Female | 2018 | 15–19 years | 720000 | Population |
| Male | 2018 | 20–24 years | 390000 | Studying |
| Female | 2018 | 20–24 years | 394000 | Studying |
| Male | 2018 | 20–24 years | 870000 | Population |
| Female | 2018 | 20–24 years | 843000 | Population |
| Male | 2017 | 15–19 years | 620000 | Studying |
| Female | 2017 | 15–19 years | 613000 | Studying |
| Male | 2017 | 15–19 years | 755000 | Population |
| Female | 2017 | 15–19 years | 718000 | Population |
| Male | 2017 | 20–24 years | 362000 | Studying |
| Female | 2017 | 20–24 years | 384000 | Studying |
| Male | 2017 | 20–24 years | 856000 | Population |
| Female | 2017 | 20–24 years | 823000 | Population |
| Male | 2016 | 15–19 years | 623000 | Studying |
| Female | 2016 | 15–19 years | 597000 | Studying |
| Male | 2016 | 15–19 years | 753000 | Population |
| Female | 2016 | 15–19 years | 710000 | Population |
| Male | 2016 | 20–24 years | 359000 | Studying |
| Female | 2016 | 20–24 years | 382000 | Studying |
| Male | 2016 | 20–24 years | 847000 | Population |
| Female | 2016 | 20–24 years | 813000 | Population |
db <- union(study,dfc)
db <- db %>% spread(key = "Status", value = "Numbers")
Crime_Prop <- (db$Crime/db$Population)*100
db$Crime_Prop <- Crime_Prop
round(db$Crime_Prop,digits=2)
## [1] 2.49 1.96 2.42 1.90 2.31 1.74 2.21 1.69 2.02 1.58 7.02 7.06 6.69 6.51 6.37
## [16] 6.04 5.85 5.45 5.46 4.92
Study_Prop <- (db$Studying/db$Population)*100
db$Study_Prop <- Study_Prop
round(db$Study_Prop,digits=2)
## [1] 84.08 46.99 85.38 46.66 85.69 46.74 83.79 48.88 86.05 47.31 82.74 42.38
## [13] 82.12 42.29 81.08 44.83 81.68 44.28 80.65 41.57
db <-db %>% select(2,1,3,7,8)
db <- db %>% arrange(Year, Gender)
knitr::kable(db)
| Year | Gender | Age_gp | Crime_Prop | Study_Prop |
|---|---|---|---|---|
| 2016 | Female | 15–19 years | 2.490845 | 84.08451 |
| 2016 | Female | 20–24 years | 1.958795 | 46.98647 |
| 2016 | Male | 15–19 years | 7.023108 | 82.73572 |
| 2016 | Male | 20–24 years | 7.056671 | 42.38489 |
| 2017 | Female | 15–19 years | 2.419499 | 85.37604 |
| 2017 | Female | 20–24 years | 1.901944 | 46.65857 |
| 2017 | Male | 15–19 years | 6.691126 | 82.11921 |
| 2017 | Male | 20–24 years | 6.511332 | 42.28972 |
| 2018 | Female | 15–19 years | 2.312639 | 85.69444 |
| 2018 | Female | 20–24 years | 1.742230 | 46.73784 |
| 2018 | Male | 15–19 years | 6.373677 | 81.08466 |
| 2018 | Male | 20–24 years | 6.035172 | 44.82759 |
| 2019 | Female | 15–19 years | 2.213049 | 83.79121 |
| 2019 | Female | 20–24 years | 1.685311 | 48.88367 |
| 2019 | Male | 15–19 years | 5.849084 | 81.67539 |
| 2019 | Male | 20–24 years | 5.448991 | 44.28251 |
| 2020 | Female | 15–19 years | 2.019199 | 86.04972 |
| 2020 | Female | 20–24 years | 1.584311 | 47.30539 |
| 2020 | Male | 15–19 years | 5.463006 | 80.65359 |
| 2020 | Male | 20–24 years | 4.921071 | 41.57175 |
The datasets are merged in this step. A dataframe ‘study’ is created where ed2016, ed2017,2018,ed2019, ed2020 are unioned. As the name suggests, this dataframe has study and population of two age groups from 2016 to 2020. The ‘Year’ and ‘Number’ variables are converted into integer variables. The reason being ‘Number’ variable needs to be multiplied by 1000 to get the absolute figures.
A dataframe db is the final stage of achieving a tidy data set. The dataframe db is created which unions study and dfc dataframes. The dataframe db is not a tidy set, because are looking for gender based age groups’study and crime percentage. Therefore, Crime and Study percentage become variables and the values as observation in each row. Hence, spread function is used to achieve this. Then, Crime and study proportion representing variables are created from the existing variables. The dataframe db is further trimmed by using the select function. The final dataframe db represents five variables - Gender, Year, Age_gp, Crime_Prop, Study_Prop.
# This is the R chunk for the Scan I
is.na(db)
## Year Gender Age_gp Crime_Prop Study_Prop
## [1,] FALSE FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE FALSE
## [4,] FALSE FALSE FALSE FALSE FALSE
## [5,] FALSE FALSE FALSE FALSE FALSE
## [6,] FALSE FALSE FALSE FALSE FALSE
## [7,] FALSE FALSE FALSE FALSE FALSE
## [8,] FALSE FALSE FALSE FALSE FALSE
## [9,] FALSE FALSE FALSE FALSE FALSE
## [10,] FALSE FALSE FALSE FALSE FALSE
## [11,] FALSE FALSE FALSE FALSE FALSE
## [12,] FALSE FALSE FALSE FALSE FALSE
## [13,] FALSE FALSE FALSE FALSE FALSE
## [14,] FALSE FALSE FALSE FALSE FALSE
## [15,] FALSE FALSE FALSE FALSE FALSE
## [16,] FALSE FALSE FALSE FALSE FALSE
## [17,] FALSE FALSE FALSE FALSE FALSE
## [18,] FALSE FALSE FALSE FALSE FALSE
## [19,] FALSE FALSE FALSE FALSE FALSE
## [20,] FALSE FALSE FALSE FALSE FALSE
sum(is.na(db))
## [1] 0
sapply(db, is.infinite)
## Year Gender Age_gp Crime_Prop Study_Prop
## [1,] FALSE FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE FALSE
## [4,] FALSE FALSE FALSE FALSE FALSE
## [5,] FALSE FALSE FALSE FALSE FALSE
## [6,] FALSE FALSE FALSE FALSE FALSE
## [7,] FALSE FALSE FALSE FALSE FALSE
## [8,] FALSE FALSE FALSE FALSE FALSE
## [9,] FALSE FALSE FALSE FALSE FALSE
## [10,] FALSE FALSE FALSE FALSE FALSE
## [11,] FALSE FALSE FALSE FALSE FALSE
## [12,] FALSE FALSE FALSE FALSE FALSE
## [13,] FALSE FALSE FALSE FALSE FALSE
## [14,] FALSE FALSE FALSE FALSE FALSE
## [15,] FALSE FALSE FALSE FALSE FALSE
## [16,] FALSE FALSE FALSE FALSE FALSE
## [17,] FALSE FALSE FALSE FALSE FALSE
## [18,] FALSE FALSE FALSE FALSE FALSE
## [19,] FALSE FALSE FALSE FALSE FALSE
## [20,] FALSE FALSE FALSE FALSE FALSE
sum(sapply(db, is.infinite))
## [1] 0
library(editrules)
## Warning: package 'editrules' was built under R version 4.0.5
## Loading required package: igraph
## Warning: package 'igraph' was built under R version 4.0.5
##
## Attaching package: 'igraph'
## The following object is masked from 'package:tidyr':
##
## crossing
## The following objects are masked from 'package:dplyr':
##
## as_data_frame, groups, union
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
##
## Attaching package: 'editrules'
## The following objects are masked from 'package:igraph':
##
## blocks, normalize
## The following objects are masked from 'package:tidyr':
##
## contains, separate
## The following object is masked from 'package:dplyr':
##
## contains
(Rule_Study <- editset(c("Study_Prop >= 0", "Study_Prop <= 100")))
##
## Edit set:
## num1 : 0 <= Study_Prop
## num2 : Study_Prop <= 100
violatedEdits(Rule_Study, db)
## edit
## record num1 num2
## 1 FALSE FALSE
## 2 FALSE FALSE
## 3 FALSE FALSE
## 4 FALSE FALSE
## 5 FALSE FALSE
## 6 FALSE FALSE
## 7 FALSE FALSE
## 8 FALSE FALSE
## 9 FALSE FALSE
## 10 FALSE FALSE
## 11 FALSE FALSE
## 12 FALSE FALSE
## 13 FALSE FALSE
## 14 FALSE FALSE
## 15 FALSE FALSE
## 16 FALSE FALSE
## 17 FALSE FALSE
## 18 FALSE FALSE
## 19 FALSE FALSE
## 20 FALSE FALSE
(Rule_Crime <- editset(c("Crime_Prop >= 0", "Crime_Prop <= 100")))
##
## Edit set:
## num1 : 0 <= Crime_Prop
## num2 : Crime_Prop <= 100
violatedEdits(Rule_Study, db)
## edit
## record num1 num2
## 1 FALSE FALSE
## 2 FALSE FALSE
## 3 FALSE FALSE
## 4 FALSE FALSE
## 5 FALSE FALSE
## 6 FALSE FALSE
## 7 FALSE FALSE
## 8 FALSE FALSE
## 9 FALSE FALSE
## 10 FALSE FALSE
## 11 FALSE FALSE
## 12 FALSE FALSE
## 13 FALSE FALSE
## 14 FALSE FALSE
## 15 FALSE FALSE
## 16 FALSE FALSE
## 17 FALSE FALSE
## 18 FALSE FALSE
## 19 FALSE FALSE
## 20 FALSE FALSE
The steps undertaken here are to identify missing values, special values and obvious inconsistencies in the dataset db. is.na function is used to determine missing values. This is followed by sum function to determine number of missing values. There was no missing values found.
Sapply function is used to find if there are ny infinite values. No infinite value was found.
The variables study_prop and crime_prop are checked for obvious inconsistencies or errors. Given that they are perecentage figures so the values should be 0 and 100. Both the variables are checked by using edit rules.
# This is the R chunk for the Scan II
#db$Study_Prop %>% boxplot(main="Boxplot of Study_Prop", ylab="Study_Prop", col = "grey")
boxplot(db$Crime_Prop~db$Age_gp,main="Boxplot 1 - Crime Percentage by Age Group", ylab="Crime Percentage", xlab = "Age Group")
boxplot(db$Study_Prop ~db$Age_gp,main="Boxplot 2 - Study Percentage by Age Group", ylab="Study Percentage", xlab = "Age Group")
boxplot(db$Crime_Prop~db$Gender,main="Boxplot 3 - Crime Percentage by Gender", ylab="Crime Percentage", xlab = "Gender")
boxplot(db$Study_Prop~db$Gender,main="Boxplot 4 - Study Percentage by Gender", ylab="Study Percentage", xlab = "Gender")
boxplot(db$Crime_Prop~db$Year,main="Boxplot 5 - Crime Percentage by Year", ylab="Crime Percentage", xlab = "Year")
boxplot(db$Study_Prop~db$Year,main="Boxplot 6 - Study Percentage by Year", ylab="Study Percentage", xlab = "Year")
Provide explanations here.
Based on the boxplots generated for the datasets in consideration, it is safe to deduce that the 15-19 age group had a higher percentage of involvement in crimes(Boxplot 1). Out of these, the number of females involved in crime is lower as compared to the number of males. Alternatively, more females are involved in education as compared to males from the same age groups.
# This is the R chunk for the Transform Section
center_df <-scale(db$Crime_Prop, center = TRUE, scale = FALSE)
center_df
## [,1]
## [1,] -1.5942079
## [2,] -2.1262584
## [3,] 2.9380546
## [4,] 2.9716176
## [5,] -1.6655544
## [6,] -2.1831089
## [7,] 2.6060728
## [8,] 2.4262788
## [9,] -1.7724141
## [10,] -2.3428229
## [11,] 2.2886243
## [12,] 1.9501194
## [13,] -1.8720035
## [14,] -2.3997416
## [15,] 1.7640308
## [16,] 1.3639380
## [17,] -2.0658541
## [18,] -2.5007416
## [19,] 1.3779535
## [20,] 0.8360176
## attr(,"scaled:center")
## [1] 4.085053
Provide explanations here. The scale function was used to apply mean-centering to the column Crime_Prop. After this function was applied, the new centered values for Crime_Prop have been displayed.
2010 National Strategies for Young Australians. Retried October 3, 2021, from Youth Policy website: (https://www.youthpolicy.org/factsheets/country/australia)
Recorded Crime - Offenders (2019-20). Offenders, Australia. Retrieved October 3, 2021, from Australian Bureau of Statistics website: https://www.abs.gov.au/statistics/people/crime-and-justice/recorded-crime-offenders/2019-20#data-download
Education and Work, Australia (2019-20). Table 21 Current study: Persons aged 15-64 years – 2004 to 2020, Excel Data Table. Retrieved October 3, 2021, from Australian Bureau of Statistics website: https://www.abs.gov.au/statistics/people/education/education-and-work-australia/latest-release#data-download