library(tidyverse)
library(readxl)
library(openxlsx)
library(stringr)
library(editrules)
library(forecast)
This preprocessing task involved data for crime and income statistics in Victoria (Crime Statistics Agency 2020a, Australian Bureau of Statistics 2019a). The second section imports the data and defines each variable. The third section tidies the income dataset and joins it the crime dataset. It then creates a new column in the dataframe representing income values divided by criminal incident values. The fourth section scans the data for missing or special values, or obvious inconsistencies, and none were found. It then checks for outliers. Some outliers were found for income data. These were deemed to be legitimate observstions, and the rationale for leaving them in the dataset was given. The final section applies a logarithmic transformation to a crime variable that was positively skewed. The transformation successfully normalised the distribution of this variable. The data is now ready for a statistical analysis that examines the relationship between personal income and crime in Victoria.
Provided by the Crime Statistics Agency (CSA) (2020a), this data gives the number of criminal incidents recorded and the rate of occurrence per 100,000 people in Victoria, grouped by area and ranging over the years 2010 to 2020. This data was obtained from the CSA website, the link for which can be found in the references. The variables are,
Year (Interval): Year values ranging from 2011 to 2020.
Year ending (Interval): Paired with the year column to indicate the reference period for each observation. Reporting concerns the period between 1 July of the previous year to 30 June of the year in the same row. The column thus exclusively contains the string value ‘June’.
Police Region (Nominal): Broad, regional policing boundaries in Victoria, including ‘North West Metro’, ‘Southern Metro’, ‘Eastern’ and ‘Western’.
Local Government Area (Nominal): The 79 distinct local government areas (LGA) in Victoria. Each council is located in one, and only one, police region.
Incidents Record (Ratio): Number of criminal incidents reported in each council. Incidents are described by the CSA (2020b) as, “a criminal event that may include multiple offences, alleged offenders and/or victims that is recorded on the LEAP database on a single date and at one location.”
Incident Rate (Ratio): Proportion of incidents in a council’s population per 100,000 people. This is calculated by the CSA (2020b) as,
"Offence rate = (Offence count/ERP count) *100,000"
where ERP stands for “Estimated Resident Population.”
Provided by the Australian Bureau of Statistics (ABS) (2019a), this data provides details on personal income for people in Victoria, grouped by area and ranging over the years 2011 to 2017. It was obtained from the ABS website, the link for which can be found in the references. The variables are,
Year (Interval): The year values here are recorded differently than in the crime dataset. Each value represents a financial year, which aligns perfectly with the reporting periods for crime statistics, as the ABS (2019b) reporting period is from 1 July to 30 June. Where the year values in this data set are ‘2016-17’, for example, this is equivalent to the ‘2017’ year value in the crime data set. Data is available for the years ending from 2012 to 2017.
LGA (Nominal): Unique code for each LGA in Australia.
LGA name (Nominal): Names of the LGAs in Australia.
Earners (Ratio): Defined by the ABS (2019b) as, “Persons (including children) who receive income (either positive or negative) from an income type.”
Median age of earners (Ratio): Median age of income earners in each area.
Sum (Ratio): Summed income of all earners by council area.
Median (Ratio): Median income of all earners by council area.
Mean (Ratio): Mean income of all earners by council area.
head(crime <- read_excel("crime.xlsx", sheet = 2))
head(income <- read_excel("income.xls", sheet = 6, skip = 6))
New names:
* `2011-12` -> `2011-12...3`
* `2012-13` -> `2012-13...4`
* `2013-14` -> `2013-14...5`
* `2014-15` -> `2014-15...6`
* `2015-16` -> `2015-16...7`
* ...
While the crime dataset conforms to tidy data principles, the income dataset does not, as year values form column headings, repeating a sequence from ‘2011-12’ to ‘2016-17’. Each sequence represents an actual variable, which are “Earners”, “Median age of earners”, “Sum”, “Median” and “Mean” in that order. It was a challenge to stretch out a year column because of these sequences. Using dplyr::pivot_longer() repeatedly on the same table resulted in Cartesian products, i.e. duplicating row entries. The solution involved pivoting the years into a new table for each variable, then joining them all at the end.
Firstly, rows were filtered for the LGAs in Victoria only, and saved to an object called income_vic:
income_vic <- income %>% filter(LGA %in% c(20110:27630))
Then, new objects were created for each variable. As can be seen in the chunk below, the pivot_longer() function took the year columns containing the values for the “Earners” variable, selected only the LGA, Year and earners columns, and saved it to an object called earners. A string manipulation using stringr::str_sub() was also performed on the Year column. If the value in the Year column was “2011-12…3”, this manipulation replaced it with “12”, the 6th and 7th characters, i.e. the year ending the reporting period. This will be important for joining later. The first six values of the vector were subset and printed out below:
head(earners <- income_vic %>%
pivot_longer(`2011-12...3`:`2016-17...8`, names_to = "Year", values_to = "earners") %>%
select(LGA, Year, earners))
(earners$Year = earners$Year %>%
str_sub(6, 7))[1:6]
[1] "12" "13" "14" "15" "16" "17"
The exact same process was repeated for the other four variables:
earners_age <- income_vic %>%
pivot_longer(`2011-12...9`:`2016-17...14`, names_to = "Year", values_to = "earners.median.age") %>%
select(LGA, Year, earners.median.age)
earners_age$Year = earners_age$Year %>%
str_sub(6, 7)
sum_earned <- income_vic %>%
pivot_longer(`2011-12...15`:`2016-17...20`, names_to = "Year", values_to = "sum.earned") %>%
select(LGA, Year, sum.earned)
sum_earned$Year = sum_earned$Year %>%
str_sub(6, 7)
median_earned <- income_vic %>%
pivot_longer(`2011-12...21`:`2016-17...26`, names_to = "Year", values_to = "median.earned") %>%
select(LGA, Year, median.earned)
median_earned$Year = median_earned$Year %>%
str_sub(6, 7)
mean_earned <- income_vic %>%
pivot_longer(`2011-12...27`:`2016-17...32`, names_to = "Year", values_to = "mean.earned") %>%
select(LGA, Year, mean.earned)
mean_earned$Year = mean_earned$Year %>%
str_sub(6, 7)
All these new tables were joined to an object called all on LGA and Year:
head(all <- earners %>%
left_join(earners_age) %>%
left_join(sum_earned) %>%
left_join(median_earned) %>%
left_join(mean_earned))
Joining, by = c("LGA", "Year")
Joining, by = c("LGA", "Year")
Joining, by = c("LGA", "Year")
Joining, by = c("LGA", "Year")
A “20” was pasted onto the front of the values in the Year column, making them identical to those in crime dataset. The first six values in the vector were subset and printed out below:
(all$Year <- paste("20", all$Year, sep = "") %>% as.double(all$Year))[1:6]
[1] 2012 2013 2014 2015 2016 2017
Finally, the first and second columns were subset in the original income_vic dataset, then joined with all by LGA:
(income_vic <- income_vic[,1:2] %>% right_join(all))[1:10,]
Joining, by = "LGA"
The cleaned income dataset was joined to the crime dataset by the year and the name of the LGA:
income_crime <- income_vic %>% left_join(crime, by = c("Year" = "Year", "LGA NAME" = "Local Government Area"))
Using the str() function, it can be seen that the majority of variables were automatically classed incorrectly:
str(income_crime)
tibble [474 x 12] (S3: tbl_df/tbl/data.frame)
$ LGA : chr [1:474] "20110" "20110" "20110" "20110" ...
$ LGA NAME : chr [1:474] "Alpine" "Alpine" "Alpine" "Alpine" ...
$ Year : num [1:474] 2012 2013 2014 2015 2016 ...
$ earners : chr [1:474] "6795" "6829" "6880" "7028" ...
$ earners.median.age : chr [1:474] "49" "50" "50" "50" ...
$ sum.earned : chr [1:474] "263000000" "273000000" "290000000" "307000000" ...
$ median.earned : chr [1:474] "32690" "33533" "34681" "36195" ...
$ mean.earned : chr [1:474] "38704" "39948" "42147" "43753" ...
$ Year ending : chr [1:474] "June" "June" "June" "June" ...
$ Police Region : chr [1:474] "2 Eastern" "2 Eastern" "2 Eastern" "2 Eastern" ...
$ Incidents Recorded : num [1:474] 330 316 324 327 295 279 914 879 765 884 ...
$ Rate per 100,000 population: num [1:474] 2709 2573 2622 2631 2345 ...
The columns representing quantitative variables (earners, earners.median.age, sum.earned, median.earned, mean.earned) were converted to numerical types using as.double():
income_crime$mean.earned <- income_crime$mean.earned %>% as.double()
income_crime$earners <- income_crime$earners %>% as.double()
income_crime$earners.median.age <- income_crime$earners.median.age %>% as.double()
income_crime$median.earned <- income_crime$median.earned %>% as.double()
income_crime$sum.earned <- income_crime$sum.earned %>% as.double()
The value in Police Region were factored and relabeled:
income_crime$`Police Region` <- income_crime$`Police Region` %>%
factor(levels = c("1 North West Metro", "2 Eastern", "3 Southern Metro", "4 Western"),
labels = c("North West Metro", "Eastern", "Southern Metro", "Western"),
ordered = FALSE)
levels(income_crime$`Police Region`)
[1] "North West Metro" "Eastern" "Southern Metro" "Western"
The values in LGA and LGA NAME were factored:
income_crime$`LGA NAME` <- income_crime$`LGA NAME` %>% factor(ordered = FALSE)
income_crime$`LGA` <- income_crime$`LGA` %>% factor(ordered = FALSE)
levels(income_crime$`LGA NAME`)
[1] "Alpine" "Ararat" "Ballarat" "Banyule" "Bass Coast" "Baw Baw" "Bayside"
[8] "Benalla" "Boroondara" "Brimbank" "Buloke" "Campaspe" "Cardinia" "Casey"
[15] "Central Goldfields" "Colac-Otway" "Corangamite" "Darebin" "East Gippsland" "Frankston" "Gannawarra"
[22] "Glen Eira" "Glenelg" "Golden Plains" "Greater Bendigo" "Greater Dandenong" "Greater Geelong" "Greater Shepparton"
[29] "Hepburn" "Hindmarsh" "Hobsons Bay" "Horsham" "Hume" "Indigo" "Kingston"
[36] "Knox" "Latrobe" "Loddon" "Macedon Ranges" "Manningham" "Mansfield" "Maribyrnong"
[43] "Maroondah" "Melbourne" "Melton" "Mildura" "Mitchell" "Moira" "Monash"
[50] "Moonee Valley" "Moorabool" "Moreland" "Mornington Peninsula" "Mount Alexander" "Moyne" "Murrindindi"
[57] "Nillumbik" "Northern Grampians" "Port Phillip" "Pyrenees" "Queenscliffe" "South Gippsland" "Southern Grampians"
[64] "Stonnington" "Strathbogie" "Surf Coast" "Swan Hill" "Towong" "Wangaratta" "Warrnambool"
[71] "Wellington" "West Wimmera" "Whitehorse" "Whittlesea" "Wodonga" "Wyndham" "Yarra"
[78] "Yarra Ranges" "Yarriambiack"
levels(income_crime$`LGA`)
[1] "20110" "20260" "20570" "20660" "20740" "20830" "20910" "21010" "21110" "21180" "21270" "21370" "21450" "21610" "21670" "21750" "21830" "21890" "22110" "22170" "22250"
[22] "22310" "22410" "22490" "22620" "22670" "22750" "22830" "22910" "22980" "23110" "23190" "23270" "23350" "23430" "23670" "23810" "23940" "24130" "24210" "24250" "24330"
[43] "24410" "24600" "24650" "24780" "24850" "24900" "24970" "25060" "25150" "25250" "25340" "25430" "25490" "25620" "25710" "25810" "25900" "25990" "26080" "26170" "26260"
[64] "26350" "26430" "26490" "26610" "26670" "26700" "26730" "26810" "26890" "26980" "27070" "27170" "27260" "27350" "27450" "27630"
The column Year ending is not useful for this instance of the dataset, as the end of the reporting period is known to be 30 June. It was removed:
income_crime <- income_crime %>% select(-`Year ending`)
Now, dplyr::rename() was used to give a consistent format to the column names, and the str() function demonstrates that each variable is of the correct class:
str(income_crime <- income_crime %>% rename(lga.code = LGA,
lga.name = `LGA NAME`,
year = Year,
police.region = `Police Region`,
criminal.incidents = `Incidents Recorded`,
incident.rate = `Rate per 100,000 population`))
tibble [474 x 11] (S3: tbl_df/tbl/data.frame)
$ lga.code : Factor w/ 79 levels "20110","20260",..: 1 1 1 1 1 1 2 2 2 2 ...
$ lga.name : Factor w/ 79 levels "Alpine","Ararat",..: 1 1 1 1 1 1 2 2 2 2 ...
$ year : num [1:474] 2012 2013 2014 2015 2016 ...
$ earners : num [1:474] 6795 6829 6880 7028 7154 ...
$ earners.median.age: num [1:474] 49 50 50 50 49 50 46 47 47 47 ...
$ sum.earned : num [1:474] 2.63e+08 2.73e+08 2.90e+08 3.07e+08 3.17e+08 ...
$ median.earned : num [1:474] 32690 33533 34681 36195 36669 ...
$ mean.earned : num [1:474] 38704 39948 42147 43753 44247 ...
$ police.region : Factor w/ 4 levels "North West Metro",..: 2 2 2 2 2 2 4 4 4 4 ...
$ criminal.incidents: num [1:474] 330 316 324 327 295 279 914 879 765 884 ...
$ incident.rate : num [1:474] 2709 2573 2622 2631 2345 ...
A new column was created by using a log10() transformation on sum.earned and incidents and dividing the first by the second. Given these were both raw values for income and crime, they should be proportionate. The new column gives values representing the relationship between income and crime. The logarithm was applied because both the original variables were positively skewed. The names of the LGAs and the relevant quantitative variables were subset and printed out below:
(income_crime <- income_crime %>% mutate(earned.per.incident = log10(sum.earned)/log10(incident.rate)))[1:10, c(2, 6, 10, 12)]
Using sum(is.na()), no missing values were found. Using sapply() and function(x) sum(is.infinite(x)|is.nan(x)), no infinite or NaN values were found:
income_crime %>% is.na() %>% sum()
[1] 0
income_crime %>% sapply(function(x) sum(is.infinite(x)|is.nan(x)))
lga.code lga.name year earners earners.median.age sum.earned median.earned mean.earned
0 0 0 0 0 0 0 0
police.region criminal.incidents incident.rate earned.per.incident
0 0 0 0
The data was checked for obvious inconsistencies using editrules::editset(c). The years were in the correct range and all the ratio variables were at least greater than or equal to zero.
Rules <- editset(c("year >= 2012",
"year <= 2017",
"earners >= 0",
"earners.median.age >= 0",
"sum.earned >= 0",
"median.earned >= 0",
"mean.earned >= 0",
"criminal.incidents >= 0",
"incident.rate >= 0",
"earned.per.incident >= 0"))
Violated <- violatedEdits(Rules, income_crime)
summary(Violated)
No violations detected, 0 checks evaluated to NA
NULL
Boxplots were created for the ratio variables to identify outliers. The following insights were obtained:
boxplot(income_crime$earners.median.age, main = "Earners Median Age")
boxplot(income_crime$median.earned, main = "Median Earned")
boxplot(income_crime$incident.rate, main = "Incidents Rate (Per 100,000 People)")
boxplot(income_crime$earned.per.incident, main = "Amount Earned Per Incident")
It is not surprising that there are a few areas with exceptionally high or low age and income demographics. The functions summary()and filter() were used to identify which LGAs breach the outlier fences for “Earners Median Age” and “Median Earned”:
(summary_age <- summary(income_crime$earners.median.age))
Min. 1st Qu. Median Mean 3rd Qu. Max.
31.00 42.00 45.00 44.71 48.00 59.00
(summary_earned <- summary(income_crime$median.earned))
Min. 1st Qu. Median Mean 3rd Qu. Max.
28980 37425 41845 42480 46231 61847
(age_upper <- summary_age[5] + (1.5 * (summary_age[5] - summary_age[2])))
3rd Qu.
57
(age_lower <- summary_age[2] - (1.5 * (summary_age[5] - summary_age[2])))
1st Qu.
33
(earned_upper <- summary_earned[5] + (1.5 * (summary_earned[5] - summary_earned[2])))
3rd Qu.
59441.38
Breaches of upper fence for earners’ median age:
income_crime %>%
filter(earners.median.age > age_upper)
Breaches of lower fence for earners’ median age:
income_crime %>%
filter(earners.median.age < age_lower)
Breaches of upper fence for median earned (no lower fence breaches):
income_crime %>%
filter(median.earned > earned_upper)
It is unlikely that any of these outliers were caused by errors. There are a number of plausible explanations for this variance. Queenscliffe is a popular coastal location for retirees, while the Melbourne CBD houses many students and young workers. The Bayside, Port Phillip, Stonnington and Yarra councils are known to be relatively wealthy areas. If one were concerned about the effect they might have on a statistical analysis, they might consider to capping or removing them. Otherwise, they are valuable and legitimate observations that could be included in an analysis.
The variables “Incident Rate” was suspected to be positively skewed. Its histogram reinforced this suspicion:
hist(income_crime$incident.rate)
A log10() transformation was found to be the most effective. The histogram of a new column with the log10(incident.rate) values is presented below.
income_crime <- income_crime %>%
mutate(log.rate = log10(incident.rate))
hist(income_crime$log.rate)
Australian Bureau of Statistics 2019a, Income (including Government Allowances), ASGS and LGA, 2011, 2014-2019, data file, Australian Government, Australian Bureau of Statistics, ACT, viewed 19 October 2020 https://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/1410.02014-19?OpenDocument
Australian Bureau of Statistics 2019b, Personal Income in Australia methodology, Australian Bureau of Statistics, viewed 19 October 2020, https://www.abs.gov.au/methodologies/personal-income-australia-methodology/2011-12-2016-17
Crime Statistics Agency 2020a, Data Tables - LGA Criminal Incidents Visualisation - year ending June 2020, data file, Australian Government, Melbourne, viewed 19 October 2020, https://www.crimestatistics.vic.gov.au/crime-statistics/latest-victorian-crime-data/download-data
Crime Statistics Agency 2020b, Explanatory Notes, Crime Statistics Agency, viewed 19 October 2020, https://www.crimestatistics.vic.gov.au/about-the-data/explanatory-notes#Reference%20periods