Setup
# Load the necessary packages required to reproduce the report. For example:
library(kableExtra)
library(magrittr)
library(readr)
library(readxl)
library(dplyr)
Student names, numbers and percentage of
contributions
Group information
| Student name |
Student number |
Percentage of contribution |
| Brayden Smith |
S3717835 |
100% |
Executive Summary
After both data sets were imported, due to the way that the second
data set “lga_data” was formatted, one of the columns had to be renamed
before I could merge them, as there was no mutual variable to join them
by.
- The formatting of this data set continued to be problematic with all
of the usable columns needing to be renamed as their headers were
imported as cells in the data frame.
- After this, several conversions were made to change the variable
types of some variables.
- I then got rid of any unnecessary columns as the formatting of the
data set made it quite untidy and messy and was very limiting in its
workable capabilities
- We also then created a new variable describing the rate of crimes by
area in kilometers squared
- We then got rid of any N/A or missing values, followed by replacing
any outliers with their respective means
- Lastly we transformed one of the variables in an attempt to
normalize the data
Data
The first data set was downloaded from the Crime Statistics Agency :
https://www.crimestatistics.vic.gov.au/crime-statistics/latest-victorian-crime-data/download-data
The data set was originally titled “Data Tables LGA Recorded Offences
Year Ending June 2023”. In this very descriptive report, the details of
all recorded criminal offences are kept and seperated by many
categories, but mainly focused around the regional division by Local
Government Area (LGA)
The second data set was downloaded from the Australian Bureau of
Statistics: https://www.abs.gov.au/statistics/people/population/regional-population/latest-release
The data set was orginally titled “Population estimates and components
by LGA 2021 to 2022 - Revised. This data set describes different
demographic features of all Local Government Areas across Australia,
however we only used”Table 2” as that was the table used to described
Victorian LGAs.
# Import the data, provide your R codes here.
crime_data<- read_excel("Data_Tables_LGA_Recorded_Offences_Year_Ending_June_2023.xlsx",sheet = "Table 01")
lga_data<-read_excel("32180DS0002_2021-22r.xlsx",sheet = "Table 2")
lga_data<-lga_data %>%
rename("Local Government Area" = one_of("...2"))
all_data<-left_join(crime_data,lga_data, by = "Local Government Area")
head(all_data)
Understand
# This is the R chunk for the Understand Section
all_data<- all_data %>%
rename(
"2021 ERP" = "...3",
"2022 ERP" = "...4",
"ERP Change" = "...6",
"ERP Change Ratio" = "...7",
"Natural Increase" = "...9",
"Internal Migration" = "...10",
"External Migration" = "...11",
"Area (km2)" = "...13",
"Persons per km2" = "...14"
)
# str(all_data)
# Here there are some character variables that we should change to number
all_data$`2021 ERP`<-as.numeric(all_data$`2021 ERP`)
all_data$`2022 ERP`<-as.numeric(all_data$`2022 ERP`)
all_data$`ERP Change`<-as.numeric(all_data$`ERP Change`)
all_data$`ERP Change Ratio`<-as.numeric(all_data$`ERP Change Ratio`)
all_data$`Natural Increase`<-as.numeric(all_data$`Natural Increase`)
all_data$`Internal Migration`<-as.numeric(all_data$`Internal Migration`)
all_data$`External Migration`<-as.numeric(all_data$`External Migration`)
all_data$`Area (km2)`<-as.numeric(all_data$`Area (km2)`)
all_data$`Persons per km2`<-as.numeric(all_data$`Persons per km2`)
# We will also convert Police Region to factor variable
all_data$`Police Region`<-as.factor(all_data$`Police Region`)
str(all_data)
tibble [870 × 19] (S3: tbl_df/tbl/data.frame)
$ Year : num [1:870] 2023 2023 2023 2023 2023 ...
$ Year ending : chr [1:870] "June" "June" "June" "June" ...
$ Police Region : Factor w/ 6 levels "1 North West Metro",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Local Government Area : chr [1:870] "Banyule" "Brimbank" "Darebin" "Hobsons Bay" ...
$ Offence Count : num [1:870] 7862 17122 13244 5647 18257 ...
$ Rate per 100,000 population : num [1:870] 6112 8732 8639 6079 7049 ...
$ Australian Bureau of Statistics: chr [1:870] "20660" "21180" "21890" "23110" ...
$ 2021 ERP : num [1:870] 127370 196631 150296 92267 246920 ...
$ 2022 ERP : num [1:870] 127348 193256 150483 91803 252987 ...
$ ...5 : logi [1:870] NA NA NA NA NA NA ...
$ ERP Change : num [1:870] -22 -3375 187 -464 6067 ...
$ ERP Change Ratio : num [1:870] 0 -1.7 0.1 -0.5 2.5 1.3 4.5 6.5 NA 0.1 ...
$ ...8 : logi [1:870] NA NA NA NA NA NA ...
$ Natural Increase : num [1:870] 435 784 556 480 2688 ...
$ Internal Migration : num [1:870] -1446 -6400 -2536 -1750 971 ...
$ External Migration : num [1:870] 989 2241 2167 806 2408 ...
$ ...12 : logi [1:870] NA NA NA NA NA NA ...
$ Area (km2) : num [1:870] 62.5 123.4 53.5 64.2 503.8 ...
$ Persons per km2 : num [1:870] 2036 1566 2814 1429 502 ...
After initially running the str() function we could see that almost
all of the variables had been read in as character variables. However
most of them were numeric which were changed using the as.numeric()
function as well as adjusting the “Police Region” to be a Factor
Variable. There were also several empty columns that were classified as
logical variables however these will all be removed in the next section
as they are all containing NA values.
Tidy & Manipulate Data I
# This is the R chunk for the Tidy & Manipulate Data I
# This data also represents crime from all years between 2014-2023, however we are only interested in 2021/2022.
# Since we have columns for 2021 data we will delete all rows except for those containing 2022 data
all_data<-all_data[all_data$Year == "2022",]
# All of these recordings were taken in June as well so we will eliminate the month column as it is no longer needed
all_data<- all_data%>%
select(-"Year ending",-"Australian Bureau of Statistics",-"...5",-"...8",-"...12")
# We will also delete any row that has "Total" as it does not represent a single Local Government Area
all_data<-all_data[!(all_data$`Local Government Area` %in% "Total"),]
Both of these datasets were initially quite messy when imported. To
begin with the first data set had a different row for each yearly
observation of the same data set where they should have been stored in
columns. Since we were only looking at 2021 and 2022 years which were
already stored in their own columns in the second data set, we deleted
all rows except for 2022. We also deleted many columns that were
designed to be line breaks in between the original data sheet in excel.
We then deleted the “Year ending” variable as every year had ended in
June so it was redundant to have that value in there. There were also
some rows that had “Total” which grouped the data by descriptive
statistics but was not needed and created more missing values in
adjacent columns so these were taken out as well.
Tidy & Manipulate Data II
# This is the R chunk for the Tidy & Manipulate Data II
# Here we will create a new column for the offence rate by area
all_data$"Offence Rate per KM2" <- numeric(nrow(all_data))
for (i in 1:nrow(all_data)){
all_data$`Offence Rate per KM2`[i]<-all_data$`Offence Count`[i]/all_data$`Area (km2)`[i]
}
Here to create a new variable we decided to look at the offence rate
by Area in squared kilometeres. I think that this gives valuable insight
when you look at larger regions having higher crime per citizen but
lower crime per km2 as they covered a lot of land that is not settled or
un-utilised.
Scan I
# This is the R chunk for the Scan I
empty_rows_df<-data.frame()
na_rows<-list()
for (i in 1:nrow(all_data)){
if (any(is.na(all_data[i,]))){
empty_rows_df<-rbind(empty_rows_df,all_data[i,])
na_rows<-c(na_rows,i)
}
}
print(empty_rows_df)
# Due to all of these rows containing unique data that is location specific, we should not substitute with other values
# As such we will delete each of these rows
all_data<-na.omit(all_data)
To identify that there are N/A values we search through and each row
to find rows that contain these NA or missing values. Since they are not
being used for calculations that we durastically affect any further
calculations, I have opted to omit these from the data set using
na.omit() function.
Scan II
# This is the R chunk for the Scan II
# Since none of these variables will alter later calculations durastically, we will imput the outliers with the mean value of the column
for (col_name in names(all_data)) {
if (col_name != "Local Government Area" && is.numeric(all_data[[col_name]])) {
col_data <- all_data[[col_name]]
col_mean <- mean(col_data)
col_sd <- sd(col_data)
threshold <- 3
outliers <- abs((col_data - col_mean) / col_sd) > threshold
col_data[outliers] <- col_mean
all_data[[col_name]] <- col_data
}
}
To identify outliers, I ran a for loop that scans through each column
and calculates the mean and standard deviation of each column. We then
calculate a z score and see if the absolute value of the z score is
greater than the outlier by checking it against a threshold of 3. The
most efficient way of dealing with outliers in this data set that has
very minimal instances of outliers and is not being used for intense
calculations, is to use the mean of each column to replace the outlier.
This is all stored in a smaller column and replaces the original column
in the data set at the end of the loop.
Link to Presentation
# This is the R chunk to provide the direct link to your video presentation recorded in Studio
# Please see below link to video
# https://rmit-arc.instructuremedia.com/embed/68dfbb5b-455d-4be0-9df9-6e017a85a5eb
---
title: "Data Wrangling (Data Preprocessing)"
author: "Brayden Smith s3717835" 
subtitle: Practical assessment 2
date: "15/10/2023"
output:
  html_notebook: default
  html_document:
    df_print: paged
  pdf_document: default
---

## **Setup**

```{r}

# Load the necessary packages required to reproduce the report. For example:
library(kableExtra)
library(magrittr)
library(readr)
library(readxl)
library(dplyr)

```

## **Student names, numbers and percentage of contributions**
```{r, echo=FALSE}

# Add your names, numbers and percentage of your contribution here.
na<- c("Brayden Smith")
no<- c(" S3717835")
pc<- c("100%")

s<- data.frame(cbind(na,no,pc))
colnames(s)<- c("Student name", "Student number", "Percentage of contribution")

s %>% kbl(caption = "Group information") %>%
  kable_classic(full_width = F, html_font = "Cambria")

```



## **Executive Summary**
After both data sets were imported, due to the way that the second data set "lga_data" was formatted, one of the columns had to be renamed before I could merge them, as there was no mutual variable to join them by. 

* The formatting of this data set continued to be problematic with all of the usable columns needing to be renamed as their headers were imported as cells in the data frame.
* After this, several conversions were made to change the variable types of some variables.
* I then got rid of any unnecessary columns as the formatting of the data set made it quite untidy and messy and was very limiting in its workable capabilities
* We also then created a new variable describing the rate of crimes by area in kilometers squared
* We then got rid of any N/A or missing values, followed by replacing any outliers with their respective means
* Lastly we transformed one of the variables in an attempt to normalize the data
<br>
<br>

## **Data**

The first data set was downloaded from the Crime Statistics Agency : https://www.crimestatistics.vic.gov.au/crime-statistics/latest-victorian-crime-data/download-data
The data set was originally titled "Data Tables LGA Recorded Offences Year Ending June 2023". In this very descriptive report, the details of all recorded criminal offences are kept and seperated by many categories, but mainly focused around the regional division by Local Government Area (LGA)

The second data set was downloaded from the Australian Bureau of Statistics:
https://www.abs.gov.au/statistics/people/population/regional-population/latest-release 
The data set was orginally titled "Population estimates and components by LGA 2021 to 2022 - Revised. This data set describes different demographic features of all Local Government Areas across Australia, however we only used "Table 2" as that was the table used to described Victorian LGAs.


```{r, results=FALSE}

# Import the data, provide your R codes here.
crime_data<- read_excel("Data_Tables_LGA_Recorded_Offences_Year_Ending_June_2023.xlsx",sheet = "Table 01")
lga_data<-read_excel("32180DS0002_2021-22r.xlsx",sheet = "Table 2")
lga_data<-lga_data %>%
  rename("Local Government Area" = one_of("...2"))
all_data<-left_join(crime_data,lga_data, by = "Local Government Area")

head(all_data)
```

<br>
<br>

## **Understand** 

```{r}
# This is the R chunk for the Understand Section
all_data<- all_data %>%
  rename(
    "2021 ERP" = "...3",
    "2022 ERP" = "...4",
    "ERP Change" = "...6",
    "ERP Change Ratio" = "...7",
    "Natural Increase" = "...9",
    "Internal Migration" = "...10",
    "External Migration" = "...11",
    "Area (km2)" = "...13",
    "Persons per km2" = "...14"
  )
# str(all_data)

# Here there are some character variables that we should change to number
all_data$`2021 ERP`<-as.numeric(all_data$`2021 ERP`)
all_data$`2022 ERP`<-as.numeric(all_data$`2022 ERP`)
all_data$`ERP Change`<-as.numeric(all_data$`ERP Change`)
all_data$`ERP Change Ratio`<-as.numeric(all_data$`ERP Change Ratio`)
all_data$`Natural Increase`<-as.numeric(all_data$`Natural Increase`)
all_data$`Internal Migration`<-as.numeric(all_data$`Internal Migration`)
all_data$`External Migration`<-as.numeric(all_data$`External Migration`)
all_data$`Area (km2)`<-as.numeric(all_data$`Area (km2)`)
all_data$`Persons per km2`<-as.numeric(all_data$`Persons per km2`)

# We will also convert Police Region to factor variable
all_data$`Police Region`<-as.factor(all_data$`Police Region`)

str(all_data)
```
After initially running the str() function we could see that almost all of the variables had been read in as character variables. However most of them were numeric which were changed using the as.numeric() function as well as adjusting the "Police Region" to be a Factor Variable. There were also several empty columns that were classified as logical variables however these will all be removed in the next section as they are all containing NA values.

<br>
<br>

##	**Tidy & Manipulate Data I **


```{r}
# This is the R chunk for the Tidy & Manipulate Data I 

# This data also represents crime from all years between 2014-2023, however we are only interested in 2021/2022.
# Since we have columns for 2021 data we will delete all rows except for those containing 2022 data
all_data<-all_data[all_data$Year == "2022",]

# All of these recordings were taken in June as well so we will eliminate the month column as it is no longer needed
all_data<- all_data%>%
  select(-"Year ending",-"Australian Bureau of Statistics",-"...5",-"...8",-"...12")

# We will also delete any row that has "Total" as it does not represent a single Local Government Area
all_data<-all_data[!(all_data$`Local Government Area` %in% "Total"),]

```

Both of these datasets were initially quite messy when imported. To begin with the first data set had a different row for each yearly observation of the same data set where they should have been stored in columns. Since we were only looking at 2021 and 2022 years which were already stored in their own columns in the second data set, we deleted all rows except for 2022. We also deleted many columns that were designed to be line breaks in between the original data sheet in excel. We then deleted the "Year ending" variable as every year had ended in June so it was redundant to have that value in there. There were also some rows that had "Total" which grouped the data by descriptive statistics but was not needed and created more missing values in adjacent columns so these were taken out as well.

<br>
<br>

## **Tidy & Manipulate Data II** 

```{r}
# This is the R chunk for the Tidy & Manipulate Data II 

# Here we will create a new column for the offence rate by area
all_data$"Offence Rate per KM2" <- numeric(nrow(all_data))

for (i in 1:nrow(all_data)){
  all_data$`Offence Rate per KM2`[i]<-all_data$`Offence Count`[i]/all_data$`Area (km2)`[i]
}

```

Here to create a new variable we decided to look at the offence rate by Area in squared kilometeres. I think that this gives valuable insight when you look at larger regions having higher crime per citizen but lower crime per km2 as they covered a lot of land that is not settled or un-utilised.

<br>
<br>

##	**Scan I **

```{r}
# This is the R chunk for the Scan I
empty_rows_df<-data.frame()
na_rows<-list()

for (i in 1:nrow(all_data)){
  if (any(is.na(all_data[i,]))){
    empty_rows_df<-rbind(empty_rows_df,all_data[i,])
    na_rows<-c(na_rows,i)
  }
}
print(empty_rows_df)


# Due to all of these rows containing unique data that is location specific, we should not substitute with other values
# As such we will delete each of these rows
all_data<-na.omit(all_data)
```

To identify that there are N/A values we search through and each row to find rows that contain these NA or missing values. Since they are not being used for calculations that we durastically affect any further calculations, I have opted to omit these from the data set using na.omit() function.
<br>
<br>

##	**Scan II**

```{r}
# This is the R chunk for the Scan II

# Since none of these variables will alter later calculations durastically, we will imput the outliers with the mean value of the column

for (col_name in names(all_data)) {
  if (col_name != "Local Government Area" && is.numeric(all_data[[col_name]])) {
    col_data <- all_data[[col_name]]
    col_mean <- mean(col_data)
    col_sd <- sd(col_data)
    threshold <- 3

    outliers <- abs((col_data - col_mean) / col_sd) > threshold
    col_data[outliers] <- col_mean
    
    all_data[[col_name]] <- col_data
  }
}



```

To identify outliers, I ran a for loop that scans through each column and calculates the mean and standard deviation of each column. We then calculate a z score and see if the absolute value of the z score is greater than the outlier by checking it against a threshold of 3. The most efficient way of dealing with outliers in this data set that has very minimal instances of outliers and is not being used for intense calculations, is to use the mean of each column to replace the outlier. This is all stored in a smaller column and replaces the original column in the data set at the end of the loop.

<br>
<br>

##	**Transform **

```{r}
# This is the R chunk for the Transform Section
# The data that we would like to see is the graph of the Rate per 100,000
hist(all_data$`Rate per 100,000 population`, main = "Rate of crime per 100,000", xlab="Orginal Values")

# Since the data is right skewed, we will use the sqrt function to normalise the data
all_data$`Rate per 100,000 population`<-sqrt(all_data$`Rate per 100,000 population`)
hist(all_data$`Rate per 100,000 population`, main = "Rate of crime per 100,000 - Normalised", xlab = "Normalised Values")
```
Here we wish to plot a histogram of the rate of offences per 100,000 citizens in the local government area. We can see from the original visual that the data is right skewed and needs to be corrected. To do this we initially used the log() function however it did not give us the desired outcome, so instead we have opted with the square root function sqrt() which gives us a very normalised data set which takes the shape of a typical bell curve.

<br>
<br>

##	**Link to Presentation **

```{r}
# This is the R chunk to provide the direct link to your video presentation recorded in Studio
# Please see below link to video
# https://rmit-arc.instructuremedia.com/embed/68dfbb5b-455d-4be0-9df9-6e017a85a5eb 

```

<br>
<br>