Setup

# Load the necessary packages required to reproduce the report. For example:

library(kableExtra)
## Warning: package 'kableExtra' was built under R version 4.0.5
library(magrittr)
## Warning: package 'magrittr' was built under R version 4.0.5
library(ggplot2) # Useful for creating plots
## Warning: package 'ggplot2' was built under R version 4.0.5
library(dplyr)  # Useful for data manipulation
## Warning: package 'dplyr' was built under R version 4.0.5
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:kableExtra':
## 
##     group_rows
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(knitr) # Useful for creating nice tables
## Warning: package 'knitr' was built under R version 4.0.5
library(magrittr)
library(readr)
## Warning: package 'readr' was built under R version 4.0.5
library(tidyr)
## Warning: package 'tidyr' was built under R version 4.0.5
## 
## Attaching package: 'tidyr'
## The following object is masked from 'package:magrittr':
## 
##     extract
library(readxl)
## Warning: package 'readxl' was built under R version 4.0.5
#install.packages("forecast")
library(forecast)
## Warning: package 'forecast' was built under R version 4.0.5
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
#install.packages("editrules")
#library(editrules)

Student names, numbers and percentage of contributions

Group information
Student name Student number Percentage of contribution
Karen Amanda Rodrigues s3890896 50%
Kushagra Bhatnagar s3666420 50%



Executive Summary

This assignment is used as a medium to answer a real-life question – What percentage of Australia’s youth is studying and yet is involved in crime, and to gain insight into the impact of education on the crime statistics. According to the 2010 National Strategies for Young Australians (https://www.youthpolicy.org/factsheets/country/australia), youth is defined in general as the age group between 12-24 years. We have selected 2 categories: • Age groups of 15-19 and 20-24 years. • Years between 2016 and 2020.
The data sets chosen were: • Education and work 2020, Table 21 - referred to as Education dataset in the report • 1, Offenders Australia(1) - referred to as Crime dataset in the report Both the datasets were downloaded from the Australian Bureau of Statistics. From the Education and work dataset, the study and total population of the age groups 15-19 and 20-24 were derived. The dataset 1 i.e. Offenders Australia(1) provided the population involved in crime within the two age groups. The two datasets were comprised of heaps of data. The approach adopted is that the datasets are tidied and manipulated individually before they were merged. Upon merger, the final result is presented as the percentage of youth population in the two age groups studying and involved in crime over a period of five years between 2016 and 2020.



Data

Provide explanations here.

Two data sets are obtained from Australian Bureau of Statistic. Five data frames starting from year 2016 to 2020 are created from one Education dataset. The reason for this is that each tab in excel corresponds to a year. The five data frames store study and total population of several age groups for their respective years. One data frame is created from the Crime dataset.

The two datasets are detailed and elaborate, so they need to be trimmed in order to describe the variables. The variable description and merger of the datasets is only possible after tidying and manipulating the datasets. The Tidy and Manipulate sections will contain this information.

In this section the data sets are read using the library function readr and their first fifteen lines and structure are shown.

Since the data frames sourced from Education dataset are ed2016, ed2017, ed2018, ed2019, ed2020. They are of exact structure, so any work done on one dataframe is replicated in the rest of the five datasets. Henceforth, ed2016 dataframe would be used to display results. The dfc dataframe is sourced from ‘Table 5’ tab of excel file comprising Crime dataset.

# Import the data, provide your R codes here.

ed2016 <- read_excel("Education and work, 2020, Table 21.xls", sheet = "2016")
## New names:
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
## * `` -> ...6
## * ...
ed2017 <- read_excel("Education and work, 2020, Table 21.xls", sheet = "2017")
## New names:
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
## * `` -> ...6
## * ...
ed2018 <- read_excel("Education and work, 2020, Table 21.xls", sheet = "2018")
## New names:
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
## * `` -> ...6
## * ...
ed2019 <- read_excel("Education and work, 2020, Table 21.xls", sheet = "2019")
## New names:
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
## * `` -> ...6
## * ...
ed2020 <- read_excel("Education and work, 2020, Table 21.xls", sheet = "2020")
## New names:
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
## * `` -> ...6
## * ...
df2 <- read_excel("1. Offenders, Australia (1).xls", sheet = "Table 5")
## New names:
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
## * `` -> ...6
## * ...
head(ed2016, n = 15)
str(ed2016)
## tibble [147 x 14] (S3: tbl_df/tbl/data.frame)
##  $ Australian Bureau of Statistics: chr [1:147] "Education and Work, Australia, May 2020" "Released at 11:30 am (Canberra time) 11 November 2020" "Table 21 Current study: Persons aged 15–64 years, 2016" NA ...
##  $ ...2                           : chr [1:147] NA NA NA "15–19 years" ...
##  $ ...3                           : chr [1:147] NA NA NA "20–24 years" ...
##  $ ...4                           : chr [1:147] NA NA NA "25–29 years" ...
##  $ ...5                           : chr [1:147] NA NA NA "30–34 years" ...
##  $ ...6                           : chr [1:147] NA NA NA "35–39 years" ...
##  $ ...7                           : chr [1:147] NA NA NA "40–44 years" ...
##  $ ...8                           : chr [1:147] NA NA NA "45–49 years" ...
##  $ ...9                           : chr [1:147] NA NA NA "50–54 years" ...
##  $ ...10                          : chr [1:147] NA NA NA "55–64 years" ...
##  $ ...11                          : chr [1:147] NA NA NA "18–24 years" ...
##  $ ...12                          : chr [1:147] NA NA NA "15–24 years" ...
##  $ ...13                          : chr [1:147] NA NA NA "25–64 years" ...
##  $ ...14                          : chr [1:147] NA NA NA "15–64 years" ...
head(df2,n = 15)
str(df2)
## tibble [64 x 25] (S3: tbl_df/tbl/data.frame)
##  $ Australian Bureau of Statistics: chr [1:64] "Recorded Crime – Offenders, 2019–20" "Released at 11.30am (Canberra time) Thurs 11 February 2021" "Table 5 Offenders, Age by sex, 2008–09 to 2019–20" NA ...
##  $ ...2                           : chr [1:64] NA NA NA "Number" ...
##  $ ...3                           : chr [1:64] NA NA NA NA ...
##  $ ...4                           : chr [1:64] NA NA NA NA ...
##  $ ...5                           : chr [1:64] NA NA NA NA ...
##  $ ...6                           : chr [1:64] NA NA NA NA ...
##  $ ...7                           : chr [1:64] NA NA NA NA ...
##  $ ...8                           : chr [1:64] NA NA NA NA ...
##  $ ...9                           : chr [1:64] NA NA NA NA ...
##  $ ...10                          : chr [1:64] NA NA NA NA ...
##  $ ...11                          : chr [1:64] NA NA NA NA ...
##  $ ...12                          : chr [1:64] NA NA NA NA ...
##  $ ...13                          : chr [1:64] NA NA NA NA ...
##  $ ...14                          : chr [1:64] NA NA NA "Offender rate(a)" ...
##  $ ...15                          : chr [1:64] NA NA NA NA ...
##  $ ...16                          : chr [1:64] NA NA NA NA ...
##  $ ...17                          : chr [1:64] NA NA NA NA ...
##  $ ...18                          : chr [1:64] NA NA NA NA ...
##  $ ...19                          : chr [1:64] NA NA NA NA ...
##  $ ...20                          : chr [1:64] NA NA NA NA ...
##  $ ...21                          : chr [1:64] NA NA NA NA ...
##  $ ...22                          : chr [1:64] NA NA NA NA ...
##  $ ...23                          : chr [1:64] NA NA NA NA ...
##  $ ...24                          : chr [1:64] NA NA NA NA ...
##  $ ...25                          : chr [1:64] NA NA NA NA ...



Understand

# This is the R chunk for the Understand Section
ed2016 <-slice(ed2016, c(4,22:23,42:43))
is.na(ed2016)
##      Australian Bureau of Statistics  ...2  ...3  ...4  ...5  ...6  ...7  ...8
## [1,]                            TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [2,]                           FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [3,]                           FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [4,]                           FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [5,]                           FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##       ...9 ...10 ...11 ...12 ...13 ...14
## [1,] FALSE FALSE FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE FALSE FALSE
## [4,] FALSE FALSE FALSE FALSE FALSE FALSE
## [5,] FALSE FALSE FALSE FALSE FALSE FALSE
sum(is.na(ed2016))
## [1] 1
ed2016[1,1]="Gender"
str(ed2016)
## tibble [5 x 14] (S3: tbl_df/tbl/data.frame)
##  $ Australian Bureau of Statistics: chr [1:5] "Gender" "Male" "Female" "Male" ...
##  $ ...2                           : chr [1:5] "15–19 years" "623.29999999999995" "597.29999999999995" "753.5" ...
##  $ ...3                           : chr [1:5] "20–24 years" "359" "382.10000000000002" "847.20000000000005" ...
##  $ ...4                           : chr [1:5] "25–29 years" "155.80000000000001" "186.80000000000001" "876.70000000000005" ...
##  $ ...5                           : chr [1:5] "30–34 years" "101.2" "129.40000000000001" "873.5" ...
##  $ ...6                           : chr [1:5] "35–39 years" "59.700000000000003" "101.5" "793.89999999999998" ...
##  $ ...7                           : chr [1:5] "40–44 years" "47.100000000000001" "84.200000000000003" "796.89999999999998" ...
##  $ ...8                           : chr [1:5] "45–49 years" "33.799999999999997" "70.900000000000006" "785.29999999999995" ...
##  $ ...9                           : chr [1:5] "50–54 years" "24.600000000000001" "48.700000000000003" "753.70000000000005" ...
##  $ ...10                          : chr [1:5] "55–64 years" "28.899999999999999" "48.899999999999999" "1361.4000000000001" ...
##  $ ...11                          : chr [1:5] "18–24 years" "561.29999999999995" "579.10000000000002" "1158.2" ...
##  $ ...12                          : chr [1:5] "15–24 years" "981" "979.5" "1601.5999999999999" ...
##  $ ...13                          : chr [1:5] "25–64 years" "452.19999999999999" "672.70000000000005" "6239.8999999999996" ...
##  $ ...14                          : chr [1:5] "15–64 years" "1434" "1652.2" "7838.3000000000002" ...
Gender <- factor( c("Male", "Female", "Male", "Female"), levels = c("Male", "Female") )
names(ed2016) <- ed2016[1,]# first row as column names
## Warning: The `value` argument of `names<-` must be a character vector as of
## tibble 3.0.0.
ed2016<-ed2016[-c(1,1), ] # deleting first row then. 
Year <- c("2016","2016","2016","2016")
ed2016$Year<-Year
ed2016 <- ed2016 %>%  select('Gender','15–19 years','20–24 years','Year')
ed2016$`15–19 years` <- as.integer(ed2016$`15–19 years`)
ed2016$`20–24 years` <- as.integer(ed2016$`20–24 years`)
ed2016$Year <- as.integer(ed2016$Year)
ed2016
ed2017 <-slice(ed2017, c(4,22:23,42:43))
ed2017[1,1]="Gender"
Gender <- factor( c("Male", "Female", "Male", "Female"), levels = c("Male", "Female") )
names(ed2017) <- ed2017[1,]# first row as column names
## Warning: The `value` argument of `names<-` must be a character vector as of
## tibble 3.0.0.
ed2017<-ed2017[-c(1,1), ] # deleting first row then. 
Year <- c("2017","2017","2017","2017")
ed2017$Year<-Year
ed2017 <- ed2017 %>%  select('Gender','15–19 years','20–24 years','Year')
ed2017$`15–19 years` <- as.integer(ed2017$`15–19 years`)
ed2017$`20–24 years` <- as.integer(ed2017$`20–24 years`)
ed2017$Year <- as.integer(ed2017$Year)


ed2018 <-slice(ed2018, c(4,22:23,42:43))
ed2018[1,1]="Gender"
Gender <- factor( c("Male", "Female", "Male", "Female"), levels = c("Male", "Female") )
names(ed2018) <- ed2018[1,]# first row as column names
## Warning: The `value` argument of `names<-` must be a character vector as of
## tibble 3.0.0.
ed2018<-ed2018[-c(1,1), ] # deleting first row then. 
Year <- c("2018","2018","2018","2018")
ed2018$Year<-Year
ed2018 <- ed2018 %>%  select('Gender','15–19 years','20–24 years','Year')
ed2018$`15–19 years` <- as.integer(ed2018$`15–19 years`)
ed2018$`20–24 years` <- as.integer(ed2018$`20–24 years`)
ed2018$Year <- as.integer(ed2018$Year)


ed2019 <-slice(ed2019, c(4,22:23,42:43))
ed2019[1,1]="Gender"
Gender <- factor( c("Male", "Female", "Male", "Female"), levels = c("Male", "Female") )
names(ed2019) <- ed2019[1,]# first row as column names
## Warning: The `value` argument of `names<-` must be a character vector as of
## tibble 3.0.0.
ed2019<-ed2019[-c(1,1), ] # deleting first row then. 
Year <- c("2019","2019","2019","2019")
ed2019$Year<-Year
ed2019 <- ed2019 %>%  select('Gender','15–19 years','20–24 years','Year')
ed2019$`15–19 years` <- as.integer(ed2019$`15–19 years`)
ed2019$`20–24 years` <- as.integer(ed2019$`20–24 years`)
ed2019$Year <- as.integer(ed2019$Year)


ed2020 <-slice(ed2020, c(4,22:23,42:43))
str(ed2020)
## tibble [5 x 14] (S3: tbl_df/tbl/data.frame)
##  $ Australian Bureau of Statistics: chr [1:5] NA "Male" "Female" "Male" ...
##  $ ...2                           : chr [1:5] "15–19 years" "617.70000000000005" "623.5" "765.5" ...
##  $ ...3                           : chr [1:5] "20–24 years" "365.5" "395.60000000000002" "878.79999999999995" ...
##  $ ...4                           : chr [1:5] "25–29 years" "181.09999999999999" "182.59999999999999" "952.79999999999995" ...
##  $ ...5                           : chr [1:5] "30–34 years" "94.099999999999994" "132.59999999999999" "937.79999999999995" ...
##  $ ...6                           : chr [1:5] "35–39 years" "67.900000000000006" "94.5" "897.29999999999995" ...
##  $ ...7                           : chr [1:5] "40–44 years" "45.899999999999999" "80.400000000000006" "795.29999999999995" ...
##  $ ...8                           : chr [1:5] "45–49 years" "34.100000000000001" "67.5" "818.60000000000002" ...
##  $ ...9                           : chr [1:5] "50–54 years" "21.199999999999999" "51.399999999999999" "753.10000000000002" ...
##  $ ...10                          : chr [1:5] "55–64 years" "25.300000000000001" "46.299999999999997" "1439.7" ...
##  $ ...11                          : chr [1:5] "18–24 years" "553.89999999999998" "602.20000000000005" "1194.2" ...
##  $ ...12                          : chr [1:5] "15–24 years" "985.29999999999995" "1018.9" "1642.7" ...
##  $ ...13                          : chr [1:5] "25–64 years" "466.69999999999999" "656.29999999999995" "6590" ...
##  $ ...14                          : chr [1:5] "15–64 years" "1450.9000000000001" "1675.0999999999999" "8235.3999999999996" ...
ed2020[1,1]="Gender"
Gender <- factor( c("Male", "Female", "Male", "Female"), levels = c("Male", "Female") )
names(ed2020) <- ed2020[1,]# first row as column names
## Warning: The `value` argument of `names<-` must be a character vector as of
## tibble 3.0.0.
ed2020<-ed2020[-c(1,1), ] # deleting first row then. 
Year <- c("2020","2020","2020","2020")
ed2020$Year<-Year
ed2020 <- ed2020 %>%  select('Gender','15–19 years','20–24 years','Year')
ed2020$`15–19 years` <- as.integer(ed2020$`15–19 years`)
ed2020$`20–24 years` <- as.integer(ed2020$`20–24 years`)
ed2020$Year <- as.integer(ed2020$Year)

dfc <- slice(df2, c(5,8:9,22:25))
names(dfc) <- dfc[1,]
## Warning: The `value` argument of `names<-` must be a character vector as of
## tibble 3.0.0.
dfc<-dfc %>% select(1,9,10,11,12,13)
dfc<-dfc[-c(1,4,5), ]
is.na(dfc)
##        Age 2015–16 2016–17 2017–18 2018–19(b) 2019–20
## [1,] FALSE   FALSE   FALSE   FALSE      FALSE   FALSE
## [2,] FALSE   FALSE   FALSE   FALSE      FALSE   FALSE
## [3,] FALSE   FALSE   FALSE   FALSE      FALSE   FALSE
## [4,] FALSE   FALSE   FALSE   FALSE      FALSE   FALSE
sum(is.na(dfc))
## [1] 0
Gender <- factor(c("Male","Male","Female","Female"), levels=c("Male","Female"))
dfc$Gender<-Gender
Status <-c ("Crime","Crime","Crime","Crime")
dfc$Status <- Status
dfc
names(dfc)[1]<-paste("Age_gp")
names(dfc)[2]<-paste("2016")
names(dfc)[3]<-paste("2017")
names(dfc)[4]<-paste("2018")
names(dfc)[5]<-paste("2019")
names(dfc)[6]<-paste("2020")

The size and content of six data sets are trimmed to a manageable size to understand the data. This means that only those rows and columns are kept which are necessary to meet our objective i.e. variables and observations for the selected age groups for five years. Following the surgically obtained data frames, a check for missing values is conducted. The Slice function is used in all the dataframes to remove the irrelevant bits.

Education dataset: For the ed2016, the is.na function was run and it produced just one missing value, which is a blankspace that is used to naming column comprising nominal variables - male and female. Running the str command showed that the study and population figures for different age groups were in character data type. They were converted to integer data type. The reason they were converted to integer type was because a new variable would be created to calculate the percentage. A column comprising year was added to the dataframe. All of the above mentioned steps were repeated for the dataframes ed2017,ed2018,ed2019 and ed2020.

Crime dataset: For df2, running the command is.na yielded no missing values. Gender variable was created with values ‘Male’ and ‘Female’ as factors and levels. A new variable ‘Status’ was created to hold value as ‘Crime’. This status variable is created to differentiate between the study and crime population figures when the dataframes are merged.

The dataframe df2 has years represented in fiscal year format such as 2016-17, compared to ed2016 which has calendar year format 2016. For the data merger, the year needs to be in a consistent format. Therefore, in df2 the year is converted to calendar year format by changing 2016-17 to 2016. This is followed for the subsequent years as well.



Tidy & Manipulate Data I

# This is the R chunk for the Tidy & Manipulate Data I 


ed2016 <- ed2016 %>% gather('15–19 years','20–24 years', key = "year", value = "Numbers")
names(ed2016)[3]<-paste("Age_gp")
Status <- c("Studying","Studying","Population","Population","Studying","Studying","Population","Population")
ed2016$Status<-Status
ed2016
ed2017 <- ed2017 %>% gather('15–19 years','20–24 years', key = "year", value = "Numbers")
names(ed2017)[3]<-paste("Age_gp")
Status <- c("Studying","Studying","Population","Population","Studying","Studying","Population","Population")
ed2017$Status<-Status
ed2017
ed2018 <- ed2018 %>% gather('15–19 years','20–24 years', key = "year", value = "Numbers")
names(ed2018)[3]<-paste("Age_gp")
Status <- c("Studying","Studying","Population","Population","Studying","Studying","Population","Population")
ed2018$Status<-Status
ed2018
ed2019 <- ed2019 %>% gather('15–19 years','20–24 years', key = "year", value = "Numbers")
names(ed2019)[3]<-paste("Age_gp")
Status <- c("Studying","Studying","Population","Population","Studying","Studying","Population","Population")
ed2019$Status<-Status
ed2019
ed2020 <- ed2020 %>% gather('15–19 years','20–24 years', key = "year", value = "Numbers")
names(ed2020)[3]<-paste("Age_gp")
Status <- c("Studying","Studying","Population","Population","Studying","Studying","Population","Population")
ed2020$Status<-Status
ed2020
dfc <- dfc %>% gather(`2020`, `2019`,`2018`, `2017`, `2016`, key = "Year", value = "Numbers")
dfc<-dfc %>% select(2,4,1,5,3)
dfc$Numbers <- as.numeric(dfc$Numbers)
dfc$Year <- as.integer(dfc$Year)
knitr::kable(dfc)
Gender Year Age_gp Numbers Status
Male 2020 15–19 years 41792 Crime
Male 2020 20–24 years 43207 Crime
Female 2020 15–19 years 14619 Crime
Female 2020 20–24 years 13229 Crime
Male 2019 15–19 years 44687 Crime
Male 2019 20–24 years 48605 Crime
Female 2019 15–19 years 16111 Crime
Female 2019 20–24 years 14342 Crime
Male 2018 15–19 years 48185 Crime
Male 2018 20–24 years 52506 Crime
Female 2018 15–19 years 16651 Crime
Female 2018 20–24 years 14687 Crime
Male 2017 15–19 years 50518 Crime
Male 2017 20–24 years 55737 Crime
Female 2017 15–19 years 17372 Crime
Female 2017 20–24 years 15653 Crime
Male 2016 15–19 years 52884 Crime
Male 2016 20–24 years 59770 Crime
Female 2016 15–19 years 17685 Crime
Female 2016 20–24 years 15925 Crime

The data in ed2016 is not in tidy form because it does not follow the tidy principle - Each variable must have its own column. In the table 15-19 years and 20-24 years are variables, but they are represented as column names. When they are rearranged under the variable name - ‘Age_gp’, it represents a tidier version of the dame data. Gather function is used to achieve this, followed by creating variable name ‘Age_gp’ for the column. An additional variable “Status” is created to differentiate the numbers between studying and population.

The years in dfc from 2016 to 2020 are variables but are currently column names. They are in violation of tidy principle that each variable should have one column. Therefore, the gather function is implemented to bring them in line with the tidy principle.



Tidy & Manipulate Data II

# This is the R chunk for the Tidy & Manipulate Data II 


study <- union(union(union(union(ed2020,ed2019),ed2018),ed2017),ed2016)
Year <- as.integer(study$Year)
study$Numbers <- as.integer(study$Numbers)
study$Numbers <- study$Numbers*1000
knitr::kable(study)
Gender Year Age_gp Numbers Status
Male 2020 15–19 years 617000 Studying
Female 2020 15–19 years 623000 Studying
Male 2020 15–19 years 765000 Population
Female 2020 15–19 years 724000 Population
Male 2020 20–24 years 365000 Studying
Female 2020 20–24 years 395000 Studying
Male 2020 20–24 years 878000 Population
Female 2020 20–24 years 835000 Population
Male 2019 15–19 years 624000 Studying
Female 2019 15–19 years 610000 Studying
Male 2019 15–19 years 764000 Population
Female 2019 15–19 years 728000 Population
Male 2019 20–24 years 395000 Studying
Female 2019 20–24 years 416000 Studying
Male 2019 20–24 years 892000 Population
Female 2019 20–24 years 851000 Population
Male 2018 15–19 years 613000 Studying
Female 2018 15–19 years 617000 Studying
Male 2018 15–19 years 756000 Population
Female 2018 15–19 years 720000 Population
Male 2018 20–24 years 390000 Studying
Female 2018 20–24 years 394000 Studying
Male 2018 20–24 years 870000 Population
Female 2018 20–24 years 843000 Population
Male 2017 15–19 years 620000 Studying
Female 2017 15–19 years 613000 Studying
Male 2017 15–19 years 755000 Population
Female 2017 15–19 years 718000 Population
Male 2017 20–24 years 362000 Studying
Female 2017 20–24 years 384000 Studying
Male 2017 20–24 years 856000 Population
Female 2017 20–24 years 823000 Population
Male 2016 15–19 years 623000 Studying
Female 2016 15–19 years 597000 Studying
Male 2016 15–19 years 753000 Population
Female 2016 15–19 years 710000 Population
Male 2016 20–24 years 359000 Studying
Female 2016 20–24 years 382000 Studying
Male 2016 20–24 years 847000 Population
Female 2016 20–24 years 813000 Population
db <- union(study,dfc)
db <- db %>% spread(key = "Status", value = "Numbers")

Crime_Prop <- (db$Crime/db$Population)*100
db$Crime_Prop <- Crime_Prop
round(db$Crime_Prop,digits=2)
##  [1] 2.49 1.96 2.42 1.90 2.31 1.74 2.21 1.69 2.02 1.58 7.02 7.06 6.69 6.51 6.37
## [16] 6.04 5.85 5.45 5.46 4.92
Study_Prop <- (db$Studying/db$Population)*100
db$Study_Prop <- Study_Prop
round(db$Study_Prop,digits=2)
##  [1] 84.08 46.99 85.38 46.66 85.69 46.74 83.79 48.88 86.05 47.31 82.74 42.38
## [13] 82.12 42.29 81.08 44.83 81.68 44.28 80.65 41.57
db <-db %>% select(2,1,3,7,8)
db <- db %>% arrange(Year, Gender)
knitr::kable(db)
Year Gender Age_gp Crime_Prop Study_Prop
2016 Female 15–19 years 2.490845 84.08451
2016 Female 20–24 years 1.958795 46.98647
2016 Male 15–19 years 7.023108 82.73572
2016 Male 20–24 years 7.056671 42.38489
2017 Female 15–19 years 2.419499 85.37604
2017 Female 20–24 years 1.901944 46.65857
2017 Male 15–19 years 6.691126 82.11921
2017 Male 20–24 years 6.511332 42.28972
2018 Female 15–19 years 2.312639 85.69444
2018 Female 20–24 years 1.742230 46.73784
2018 Male 15–19 years 6.373677 81.08466
2018 Male 20–24 years 6.035172 44.82759
2019 Female 15–19 years 2.213049 83.79121
2019 Female 20–24 years 1.685311 48.88367
2019 Male 15–19 years 5.849084 81.67539
2019 Male 20–24 years 5.448991 44.28251
2020 Female 15–19 years 2.019199 86.04972
2020 Female 20–24 years 1.584311 47.30539
2020 Male 15–19 years 5.463006 80.65359
2020 Male 20–24 years 4.921071 41.57175

The datasets are merged in this step. A dataframe ‘study’ is created where ed2016, ed2017,2018,ed2019, ed2020 are unioned. As the name suggests, this dataframe has study and population of two age groups from 2016 to 2020. The ‘Year’ and ‘Number’ variables are converted into integer variables. The reason being ‘Number’ variable needs to be multiplied by 1000 to get the absolute figures.

A dataframe db is the final stage of achieving a tidy data set. The dataframe db is created which unions study and dfc dataframes. The dataframe db is not a tidy set, because are looking for gender based age groups’study and crime percentage. Therefore, Crime and Study percentage become variables and the values as observation in each row. Hence, spread function is used to achieve this. Then, Crime and study proportion representing variables are created from the existing variables. The dataframe db is further trimmed by using the select function. The final dataframe db represents five variables - Gender, Year, Age_gp, Crime_Prop, Study_Prop.



Scan I

# This is the R chunk for the Scan I
is.na(db)
##        Year Gender Age_gp Crime_Prop Study_Prop
##  [1,] FALSE  FALSE  FALSE      FALSE      FALSE
##  [2,] FALSE  FALSE  FALSE      FALSE      FALSE
##  [3,] FALSE  FALSE  FALSE      FALSE      FALSE
##  [4,] FALSE  FALSE  FALSE      FALSE      FALSE
##  [5,] FALSE  FALSE  FALSE      FALSE      FALSE
##  [6,] FALSE  FALSE  FALSE      FALSE      FALSE
##  [7,] FALSE  FALSE  FALSE      FALSE      FALSE
##  [8,] FALSE  FALSE  FALSE      FALSE      FALSE
##  [9,] FALSE  FALSE  FALSE      FALSE      FALSE
## [10,] FALSE  FALSE  FALSE      FALSE      FALSE
## [11,] FALSE  FALSE  FALSE      FALSE      FALSE
## [12,] FALSE  FALSE  FALSE      FALSE      FALSE
## [13,] FALSE  FALSE  FALSE      FALSE      FALSE
## [14,] FALSE  FALSE  FALSE      FALSE      FALSE
## [15,] FALSE  FALSE  FALSE      FALSE      FALSE
## [16,] FALSE  FALSE  FALSE      FALSE      FALSE
## [17,] FALSE  FALSE  FALSE      FALSE      FALSE
## [18,] FALSE  FALSE  FALSE      FALSE      FALSE
## [19,] FALSE  FALSE  FALSE      FALSE      FALSE
## [20,] FALSE  FALSE  FALSE      FALSE      FALSE
sum(is.na(db))
## [1] 0
sapply(db, is.infinite)
##        Year Gender Age_gp Crime_Prop Study_Prop
##  [1,] FALSE  FALSE  FALSE      FALSE      FALSE
##  [2,] FALSE  FALSE  FALSE      FALSE      FALSE
##  [3,] FALSE  FALSE  FALSE      FALSE      FALSE
##  [4,] FALSE  FALSE  FALSE      FALSE      FALSE
##  [5,] FALSE  FALSE  FALSE      FALSE      FALSE
##  [6,] FALSE  FALSE  FALSE      FALSE      FALSE
##  [7,] FALSE  FALSE  FALSE      FALSE      FALSE
##  [8,] FALSE  FALSE  FALSE      FALSE      FALSE
##  [9,] FALSE  FALSE  FALSE      FALSE      FALSE
## [10,] FALSE  FALSE  FALSE      FALSE      FALSE
## [11,] FALSE  FALSE  FALSE      FALSE      FALSE
## [12,] FALSE  FALSE  FALSE      FALSE      FALSE
## [13,] FALSE  FALSE  FALSE      FALSE      FALSE
## [14,] FALSE  FALSE  FALSE      FALSE      FALSE
## [15,] FALSE  FALSE  FALSE      FALSE      FALSE
## [16,] FALSE  FALSE  FALSE      FALSE      FALSE
## [17,] FALSE  FALSE  FALSE      FALSE      FALSE
## [18,] FALSE  FALSE  FALSE      FALSE      FALSE
## [19,] FALSE  FALSE  FALSE      FALSE      FALSE
## [20,] FALSE  FALSE  FALSE      FALSE      FALSE
sum(sapply(db, is.infinite))
## [1] 0
library(editrules)
## Warning: package 'editrules' was built under R version 4.0.5
## Loading required package: igraph
## Warning: package 'igraph' was built under R version 4.0.5
## 
## Attaching package: 'igraph'
## The following object is masked from 'package:tidyr':
## 
##     crossing
## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union
## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum
## The following object is masked from 'package:base':
## 
##     union
## 
## Attaching package: 'editrules'
## The following objects are masked from 'package:igraph':
## 
##     blocks, normalize
## The following objects are masked from 'package:tidyr':
## 
##     contains, separate
## The following object is masked from 'package:dplyr':
## 
##     contains
(Rule_Study <- editset(c("Study_Prop >= 0", "Study_Prop <= 100")))
## 
## Edit set:
## num1 : 0 <= Study_Prop
## num2 : Study_Prop <= 100
violatedEdits(Rule_Study, db)
##       edit
## record  num1  num2
##     1  FALSE FALSE
##     2  FALSE FALSE
##     3  FALSE FALSE
##     4  FALSE FALSE
##     5  FALSE FALSE
##     6  FALSE FALSE
##     7  FALSE FALSE
##     8  FALSE FALSE
##     9  FALSE FALSE
##     10 FALSE FALSE
##     11 FALSE FALSE
##     12 FALSE FALSE
##     13 FALSE FALSE
##     14 FALSE FALSE
##     15 FALSE FALSE
##     16 FALSE FALSE
##     17 FALSE FALSE
##     18 FALSE FALSE
##     19 FALSE FALSE
##     20 FALSE FALSE
(Rule_Crime <- editset(c("Crime_Prop >= 0", "Crime_Prop <= 100")))
## 
## Edit set:
## num1 : 0 <= Crime_Prop
## num2 : Crime_Prop <= 100
violatedEdits(Rule_Study, db)
##       edit
## record  num1  num2
##     1  FALSE FALSE
##     2  FALSE FALSE
##     3  FALSE FALSE
##     4  FALSE FALSE
##     5  FALSE FALSE
##     6  FALSE FALSE
##     7  FALSE FALSE
##     8  FALSE FALSE
##     9  FALSE FALSE
##     10 FALSE FALSE
##     11 FALSE FALSE
##     12 FALSE FALSE
##     13 FALSE FALSE
##     14 FALSE FALSE
##     15 FALSE FALSE
##     16 FALSE FALSE
##     17 FALSE FALSE
##     18 FALSE FALSE
##     19 FALSE FALSE
##     20 FALSE FALSE

The steps undertaken here are to identify missing values, special values and obvious inconsistencies in the dataset db. is.na function is used to determine missing values. This is followed by sum function to determine number of missing values. There was no missing values found.

Sapply function is used to find if there are ny infinite values. No infinite value was found.

The variables study_prop and crime_prop are checked for obvious inconsistencies or errors. Given that they are perecentage figures so the values should be 0 and 100. Both the variables are checked by using edit rules.



Scan II

# This is the R chunk for the Scan II
#db$Study_Prop %>%  boxplot(main="Boxplot of Study_Prop", ylab="Study_Prop", col = "grey")
boxplot(db$Crime_Prop~db$Age_gp,main="Boxplot 1 - Crime Percentage by Age Group", ylab="Crime Percentage", xlab = "Age Group")

boxplot(db$Study_Prop ~db$Age_gp,main="Boxplot 2 - Study Percentage by Age Group", ylab="Study Percentage", xlab = "Age Group")

boxplot(db$Crime_Prop~db$Gender,main="Boxplot 3 - Crime Percentage by Gender", ylab="Crime Percentage", xlab = "Gender")

boxplot(db$Study_Prop~db$Gender,main="Boxplot 4 - Study Percentage by Gender", ylab="Study Percentage", xlab = "Gender")

boxplot(db$Crime_Prop~db$Year,main="Boxplot 5 - Crime Percentage by Year", ylab="Crime Percentage", xlab = "Year")

boxplot(db$Study_Prop~db$Year,main="Boxplot 6 - Study Percentage by Year", ylab="Study Percentage", xlab = "Year")

Provide explanations here.

Based on the boxplots generated for the datasets in consideration, it is safe to deduce that the 15-19 age group had a higher percentage of involvement in crimes(Boxplot 1). Out of these, the number of females involved in crime is lower as compared to the number of males. Alternatively, more females are involved in education as compared to males from the same age groups.

Transform

# This is the R chunk for the Transform Section
center_df <-scale(db$Crime_Prop, center = TRUE, scale = FALSE)
center_df
##             [,1]
##  [1,] -1.5942079
##  [2,] -2.1262584
##  [3,]  2.9380546
##  [4,]  2.9716176
##  [5,] -1.6655544
##  [6,] -2.1831089
##  [7,]  2.6060728
##  [8,]  2.4262788
##  [9,] -1.7724141
## [10,] -2.3428229
## [11,]  2.2886243
## [12,]  1.9501194
## [13,] -1.8720035
## [14,] -2.3997416
## [15,]  1.7640308
## [16,]  1.3639380
## [17,] -2.0658541
## [18,] -2.5007416
## [19,]  1.3779535
## [20,]  0.8360176
## attr(,"scaled:center")
## [1] 4.085053

Provide explanations here. The scale function was used to apply mean-centering to the column Crime_Prop. After this function was applied, the new centered values for Crime_Prop have been displayed.

References