# This is the R chunk for the required packages
library(dplyr)
library(tidyr)
library(readxl)
library(outliers)
library(forecast)
The dataset used in this assignment was created by merging to datasets that were untidy originally. The datasets were untidy as the the column headers were values and not column names. But before making the data tidy, we have used some data manipulation techniques such as elimination of certain rows and columns. the first dataset has the information about the total number of victims for selected offenses in Australia. Therefore, only the number of victims were kept in the dataset and other information such as Victimization rate was deleted. The second dataset had information about the total number of victims for selected offenses categorized by State and Territories. Since we were only interested in the number of victims in NSW, all other data was eliminated. Both datasets were made tidy using gather() and merged using left_join() to get the final dataset required for this assignment. A new variable was also created from existing variables using mutate(). The dataset was also scanned for missing values, obvious errors and outliers. The outliers were handled using appropriate techniques and data transformation was applied on two numeric variables to change the scale of the variable and standardize the values for better understanding.
Two datasets are used to create the data set for the assignment.
Dataset 1 : Book_2 source : https://www.abs.gov.au/statistics/people/crime-and-justice/recorded-crime-victims-australia/latest-release File name : Victims of Crime, Australia The data set contains information about the number of Victims in Australia for recorded offences from 2010 to 2019.
Dataset 2 : Book_3 source : https://www.abs.gov.au/statistics/people/crime-and-justice/recorded-crime-victims-australia/latest-release#data-download File name: Victims of Crime, Selected offences, states and territories The dataset contains number of victims for selected offences by states and territories from 2010 to 2019.
# This is the R chunk for the Data Section
Book_2 <- read_excel("Book2.xlsx")
New names:
* `` -> ...2
* `` -> ...3
* `` -> ...4
* `` -> ...5
* `` -> ...6
* ...
head(Book_2)
Book_3 <- read_excel("Book3.xlsx")
New names:
* `` -> ...2
* `` -> ...3
* `` -> ...4
* `` -> ...5
* `` -> ...6
* ...
head(Book_3)
Both tha datasets Book_2 and Book_3 are messy. Before tidying the data, we need to clean the data by deleting unwanted rows and columns. In Book_2, we only need tha number victims for the selected offences,therefore we elimante the Victimiztion rate from the dataset. We delete the unwanted rows and rename the column headers with rename(). The dataset is messy as the column headers are values and not column names. Therefore, we tidy the data using the gather().
In Book_3, we only need the number of victims for selected offence for the of NSW. Therefore we delete all the data for the remaining states and only keep tha data for NSW. We rename the column header using rename(). Again the data is messy as the column header are values and not column names. We then tidy the data using gather().
After tidying the data, we join the datasets using left_join() to create our assignment data.
# This is the R chunk for the Tidy & Manipulate Data I
#Deleting unwanted rows
Book_2 <- Book_2[-c(1,2,3,4,5),]
#Renaming columns
Book_2<- Book_2 %>% rename('Principal Offence' = 'Australian Bureau of Statistics', '2010' = '...2',
'2011' = '...3', '2012' = '...4', '2013' = '...5',
'2014' = '...6', '2015' = '...7', '2016' = '...8',
'2017' = '...9', '2018' = '...10', '2019' = '...11')
# Deleting unwanted rows
Book_2 <- Book_2[-c(5,7,9,13,15,19,21,23:34),]
# Tidying the data
Book_2 <- Book_2%>% gather('2010','2011','2012','2013','2014','2015','2016','2017','2018','2019',
key = "Year", value = "Victims_in_Australia")
head(Book_2)
#For dataset Book_3
Book_3 <- Book_3[-c(1,2,3,4,5),]
Book_3 <- Book_3[-c(12:21)]
Book_3<- Book_3 %>% rename('Principal Offence' = 'Australian Bureau of Statistics', '2010' = '...2',
'2011' = '...3', '2012' = '...4', '2013' = '...5',
'2014' = '...6', '2015' = '...7', '2016' = '...8',
'2017' = '...9', '2018' = '...10', '2019' = '...11')
Book_3 <- Book_3[-c(1,6,8,10,12,16,18,22,24,26:203),]
head(Book_3)
Book_3 <- Book_3 %>% gather('2010','2011','2012','2013','2014','2015','2016','2017','2018','2019',
key = "Year", value = "Victims_in_NSW")
#joining datasets using left join
Victim <- left_join(Book_2,Book_3, by = c("Principal Offence","Year"))
head(Victim)
Variable Description 1.Offence : Different types of offences that occured within the state. 2.Year : The period in which the offence was recorded 3.Victims_in_Australia : total number of victims in Australia 4.Victims_in_NSW = total number of victims in NSW. Str() is used to inspect the dataset. There are 4 variables in total whose datatype is character out of which we need to change datatypes of two variables to numeric as it contains numeric values. This will be done using as.numeric().
# This is the R chunk for the Understanding
str(Victim)
tibble [150 × 4] (S3: tbl_df/tbl/data.frame)
$ Principal Offence : chr [1:150] "Homicide and related offences" "Murder" "Attempted murder" "Manslaughter" ...
$ Year : chr [1:150] "2010" "2010" "2010" "2010" ...
$ Victims_in_Australia: chr [1:150] "460" "233" "204" "29" ...
$ Victims_in_NSW : chr [1:150] "133" "73" "45" "11" ...
Victim$Victims_in_Australia = as.numeric(Victim$Victims_in_Australia)
Victim$Victims_in_NSW = as.numeric(Victim$Victims_in_NSW)
summary(Victim)
Principal Offence Year Victims_in_Australia Victims_in_NSW
Length:150 Length:150 Min. : 23.0 Min. : 0.0
Class :character Class :character 1st Qu.: 444.2 1st Qu.: 108.5
Mode :character Mode :character Median : 6026.5 Median : 2138.5
Mean : 66217.9 Mean : 17730.5
3rd Qu.: 59994.8 3rd Qu.: 16421.0
Max. :569404.0 Max. :154335.0
A new variable called NSW_percentage is created using the mutate() which indicated the percentage of victims from NSW.
# This is the R chunk for the Tidy & Manipulate Data II
Victim <- mutate(Victim, NSW_percent = Victim$Victims_in_NSW / Victim$Victims_in_Australia * 100)
head(Victim)
NA
The dataset is checked for the missing values and special values using the sapply function on the dataset. No such values are found in the dataset.
# This is the R chunk for the Scan I
is.special <- function(x){
if (is.numeric(x)) sum((is.infinite(x) | is.nan(x) | is.na(x)))}
sapply(Victim, is.special)
$`Principal Offence`
NULL
$Year
NULL
$Victims_in_Australia
[1] 0
$Victims_in_NSW
[1] 0
$NSW_percent
[1] 0
The numeric variables are scanned for outliers using Tukey’s method of outlier detection. Few outliers are found in all 3 variables are handled using the capping method.
# This is the R chunk for the Scan II
Victim$Victims_in_Australia %>% boxplot(main="Box Plot of Victims", ylab="Victims in Australia", col = "grey")
Victim$Victims_in_NSW %>% boxplot(main="Box Plot of Victims", ylab="Victims in NSW", col = "grey")
Victim$NSW_percent %>% boxplot(main="Box Plot of Victims", ylab=" % Victims in NSW", col = "grey")
cap <- function(x){
quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
x
}
Victim_sub <- select(Victim,Victims_in_Australia,Victims_in_NSW,NSW_percent)
summary(Victim_sub)
Victims_in_Australia Victims_in_NSW NSW_percent
Min. : 23.0 Min. : 0.0 Min. : 0.00
1st Qu.: 444.2 1st Qu.: 108.5 1st Qu.:22.99
Median : 6026.5 Median : 2138.5 Median :27.87
Mean : 66217.9 Mean : 17730.5 Mean :29.41
3rd Qu.: 59994.8 3rd Qu.: 16421.0 3rd Qu.:35.07
Max. :569404.0 Max. :154335.0 Max. :60.50
Victim_capped <- sapply(Victim_sub, FUN = cap)
summary(Victim_capped)
Victims_in_Australia Victims_in_NSW NSW_percent
Min. : 23.0 Min. : 0.0 Min. : 8.38
1st Qu.: 444.2 1st Qu.: 108.5 1st Qu.:22.99
Median : 6026.5 Median : 2138.5 Median :27.87
Mean : 84712.3 Mean : 21078.2 Mean :29.44
3rd Qu.: 59994.8 3rd Qu.: 16421.0 3rd Qu.:35.07
Max. :489865.3 Max. :133325.4 Max. :51.17
Transformation is applied in the below chunk in order to change the scale of the variable and standardize the values for better understanding. Since there was a huge variation in the values of the variable Victims_in_Australia and Victims_in_NSW, various transformation techniques were implemented and technique with best results were selected.
Log transformation compresses the higher values and spreads the low values by expressing the values as order of magnitude. Similarly BoxCox transformation is a type of power transformation to transform non normal data into normal distribution. It is is very successful in transforming skewed distributions into a symmetric distribution.
# This is the R chunk for the Transform Section
# Histogram before applying transformation on Victims_in_Australia.
hist(Victim_sub$Victims_in_Australia)
# Histogram after applying BoxCox transformation.
box_cox_victim <- BoxCox(Victim_sub$Victims_in_Australia, lambda = "auto")
hist(box_cox_victim)
# Histogram before applying transformation on Victims_in_NSW.
hist(Victim_sub$Victims_in_NSW)
# Histogram after applying Log10 transformation.
log_victim_NSW <- log10(Victim_sub$Victims_in_NSW)
hist(log_victim_NSW)