# Load all packages required
library(readr)
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(magrittr)
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:tidyr':
##
## extract
library(jsonlite) # approved by Sona so I can read a json file
library(stringr)
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(readxl)
library(forecast) # for boxcox
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(rmarkdown)
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
I have taken data sets from three different sources (The Queensland Government Open Data Portal - Crime by Postcode for April 2020, The Australian Bureau of Statistics Portal - Index of Economic Resources (IER), Socio-economic Indexes for Areas (SEIFA) 2016 data by Postcode and The Australian Tax Office - 2017-18 Individual Income or Loss by Postcode) and read them into R.
Preprocessing performed included data type conversions (some with ordered and leveled factoring), tidying of data (although most of it was tidy), mutating of several variables, scanning of missing data and looking for inconsistencies, scanning for outliers, a transformation of one variable (which was right skewed) to values that were normally distributed and some sorting and rearranging of columns. Careful attention was given to the merging of the data sets to make sure that any dropped data (and there was some) was analysed, understood and correct.
This data set is now ready to have some statistical analysis performed over it. Of interest would be the relationship between crime and earnings by postcode and/or crime and SEIFA by postcode in Brisbane for April 2020.
I have chosen to include three data sets:
Crime offence data by boundary (Brisbane LGA) and offence type for the period 1 April 2020 through to 30 April 2020. Information on this dataset can be found at https://www.data.qld.gov.au/dataset/crime-locations-2000-present This is provided by the Queensland Government on their Open Data Portal.
The dataset was downloaded as a json file by using the following URL. https://a5c7zwf7e5.execute-api.ap-southeast-2.amazonaws.com/dev/offences?locationType=Lga&startDate=04-01-2020&locationName=Brisbane&endDate=04-30-2020&format=JSON
License: Creative Commons Attribution 3.0 (Open Data)
VARIABLES
The dataset contains five variables of which I am only interested in the following four for this exercise:
Type: Is a string denoting the type of criminal offence.
Date: Is the date recorded for the offence. Note that multiple offences at the same Postcode for the same time can be recorded.
Postcode: Postal area code for the specific offence location in question. (Four digit number)
Area of Interest: Data is for the Brisbane LGA which means other South East Queensland LGA’s will not have offence data included.A string detailing the offence description.
This data ranks areas in Australia according to relative socio-economic advantage and disadvantage. The index used is based on information from the five-yearly Census (Postal Area, Index of Economic Resources (IER), SEIFA 2016). Data will be subset so that only the Brisbane LGA Postcode information is kept. Information on this data can be found from the Australian Bureau of Statistics page at: https://www.abs.gov.au/AUSSTATS/abs@.nsf/Lookup/2033.0.55.001Main+Features12016?OpenDocument (2033.0.55.001)
The dataset was downloaded from: https://www.abs.gov.au/ausstats/subscriber.nsf/log?openagent&2033055001%20-%20poa%20indexes.xls&2033.0.55.001&Data%20Cubes&DC124D1DAC3D9FDDCA25825D000F9267&0&2016&27.03.2018&Latest
License: Creative Commons Attribution 4.0 International licence
VARIABLES
The dataset contains multiple tables. I am only interested in table 4 (Postal Area (POA) Index of Economic Resources, 2016) and out of the 17 variables, just the following four:
2016 Postal Area (POA) Code Postal area code for the relative socio-economic advantage and disadvantage rank. (four-digit number) Ranking within State or Territory - Decile A decile rank between 1 and 10 that describes the level of socio-economic advantage and disadvantage with a number of 1 for the lowest 10% (worst), a number of 2 for the next lowest 10%, through to a number of 10 being the best.
Table 4 is being used as it relates to economic resources and this complements the next data set being used related to income. Whilst the dataset contains many other variables, for this exercise, they are not relevant and will simply burden computer processing power unnecessarily.
Lastly, this next data set will provide us with individual’s income data (2017-2018) from the Australian Taxation Office for the postcodes in question.This specifically relates to “Individuals - Table 25”. More information on the dataset can be found at: https://data.gov.au/data/dataset/taxation-statistics-2017-18/resource/343f1d18-067b-44ee-b7b3-1b04c4872b86?inner_span=True
The actual dataset was downloaded from: https://data.gov.au/data/dataset/23b8c299-a85b-4fc0-a07d-5ed14e23a103/resource/343f1d18-067b-44ee-b7b3-1b04c4872b86/download/ts18individual25countaveragemedianbypostcode.csv
License: Creative Commons Attribution 2.5 Australia
VARIABLES
The dataset contains 34 variables of which I am only interested in three for the purpose of this exercise:
Postcode Postal area code for the relative income area (four-digit number)
Count total income or loss The number of people living in that postcode who submitted a tax return.
Average total income or loss Amount in dollars of the average income per person living in that postcode.
# Read in April 2020 crime data from QLD Government Open Data Portal
crimeApril2 <- fromJSON(txt = "crimeApril2020.json")
json_df <- as.data.frame(crimeApril2)
head(json_df) # view the first few observations in the data set
str(json_df) # check the structure of the data set for obvious anomolies
## 'data.frame': 6777 obs. of 5 variables:
## $ Type : chr "Fraud" "Other Theft (excl. Unlawful Entry)" "Drug Offences" "Handling Stolen Goods" ...
## $ Date : chr "2020-04-01 00:00:00" "2020-04-01 00:00:00" "2020-04-01 00:00:00" "2020-04-01 00:00:00" ...
## $ Postcode : chr "4116" "4179" "4000" "4000" ...
## $ Area of Interest: chr "Brisbane" "Brisbane" "Brisbane" "Brisbane" ...
## $ ABS Meshblock : chr "30044610000" "30563881800" "30565480000" "30565480000" ...
# Read in 2016 socio economic status file from the Australian Bureau of Statistics
socio <- read_excel("2033055001 - poa indexes.xls", sheet = "Table 4", skip = 4)
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in A2637 / R2637C1: got '© Commonwealth of Australia 2018'
## New names:
## * `` -> ...3
## * `` -> ...4
## * `` -> ...6
## * `` -> ...7
## * `` -> ...8
## * ...
Some warning messages alert us to the fact that there are some strange characters in the column names but the system renames them automatically which will suffice for this exercise.
head(socio) # view the first few observations in the data set
str(socio) # check the structure of the data set for obvious anomolies
## tibble [2,632 x 17] (S3: tbl_df/tbl/data.frame)
## $ 2016 Postal Area (POA) Code : num [1:2632] NA 2000 2007 2008 2009 ...
## $ Usual Resident Population : num [1:2632] NA 27411 8845 11712 12813 ...
## $ ...3 : chr [1:2632] "Score" "804" "737" "784" ...
## $ ...4 : logi [1:2632] NA NA NA NA NA NA ...
## $ Ranking within Australia : chr [1:2632] "Rank" "40" "27" "35" ...
## $ ...6 : chr [1:2632] "Decile" "1" "1" "1" ...
## $ ...7 : chr [1:2632] "Percentile" "2" "2" "2" ...
## $ ...8 : logi [1:2632] NA NA NA NA NA NA ...
## $ Ranking within State or Territory : chr [1:2632] "State" "NSW" "NSW" "NSW" ...
## $ ...10 : chr [1:2632] "Rank" "8" "3" "6" ...
## $ ...11 : chr [1:2632] "Decile" "1" "1" "1" ...
## $ ...12 : chr [1:2632] "Percentile" "2" "1" "1" ...
## $ Minimum score for SA1s in area : num [1:2632] NA 632 618 540 765 515 598 696 498 501 ...
## $ Maximum score for SA1s in area : num [1:2632] NA 1169 918 1008 1107 ...
## $ % Usual Resident Population without an SA1 level score : num [1:2632] NA 0.00555 0 0 0 ...
## $ Data should be used with caution - area not well represented by SA1s: chr [1:2632] NA NA NA NA ...
## $ POA crosses state or territory boundaries : chr [1:2632] NA NA NA NA ...
# Read in 2017-2018 individual income by postcode data
earnings <- read_csv("ts18individual25countaveragemedianbypostcode.csv")
## Parsed with column specification:
## cols(
## .default = col_double()
## )
## See spec(...) for full column specifications.
head(earnings) # view the first few observations in the data set
str(earnings) # check the structure of the data set for obvious anomolies
## tibble [2,470 x 34] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Postcode : num [1:2470] 800 810 812 820 822 828 829 830 832 835 ...
## $ Number of individuals lodging an income tax return: num [1:2470] 5168 21449 11170 13186 6279 ...
## $ Count taxable income or loss : num [1:2470] 5168 21449 11170 13186 6279 ...
## $ Average taxable income or loss : num [1:2470] 74990 66197 64626 80947 48181 ...
## $ Median taxable income or loss : num [1:2470] 56363 55311 56496 63161 36292 ...
## $ Proportion with salary or wages : num [1:2470] 0.92 0.87 0.88 0.89 0.91 0.86 0.96 0.92 0.94 0.84 ...
## $ Count salary or wages : num [1:2470] 4777 18675 9797 11690 5707 ...
## $ Average salary or wages : num [1:2470] 76428 67931 67487 77029 46538 ...
## $ Median salary or wages : num [1:2470] 59845 59999 60989 65904 35992 ...
## $ Proportion with net rent : num [1:2470] 0.14 0.2 0.2 0.23 0.1 0.2 0.08 0.17 0.16 0.26 ...
## $ Count net rent : num [1:2470] 741 4228 2266 3047 642 ...
## $ Average net rent : num [1:2470] -7109 -4989 -5063 -4251 -3954 ...
## $ Median net rent : num [1:2470] -5354 -3723 -3785 -3420 -2660 ...
## $ Count total income or loss : num [1:2470] 5168 21449 11170 13186 6279 ...
## $ Average total income or loss : num [1:2470] 77465 68729 66892 84033 49652 ...
## $ Median total income or loss : num [1:2470] 58802 57663 58560 65872 36858 ...
## $ Count total deductions : num [1:2470] 5168 21449 11170 13186 6279 ...
## $ Average total deductions : num [1:2470] 2423 2484 2214 2864 1380 ...
## $ Median total deductions : num [1:2470] 917 1047 1036 1228 340 ...
## $ Proportion with total business income : num [1:2470] 0.05 0.08 0.07 0.07 0.04 0.08 0.02 0.06 0.05 0.1 ...
## $ Count total business income : num [1:2470] 246 1792 769 912 264 ...
## $ Average total business income : num [1:2470] 76899 87283 79961 90982 75433 ...
## $ Median total business income : num [1:2470] 22887 30475 33599 25532 31241 ...
## $ Proportion with total business expenses : num [1:2470] 0.05 0.08 0.07 0.07 0.04 0.08 0.02 0.05 0.05 0.1 ...
## $ Count total business expenses : num [1:2470] 238 1692 740 871 255 ...
## $ Average total business expenses : num [1:2470] 55931 58678 56098 60852 55751 ...
## $ Median total business expenses : num [1:2470] 10500 12751 16228 13101 16282 ...
## $ Proportion with net tax : num [1:2470] 0.87 0.82 0.83 0.85 0.68 0.84 0.94 0.84 0.87 0.83 ...
## $ Count net tax : num [1:2470] 4475 17484 9226 11271 4256 ...
## $ Average net tax : num [1:2470] 22583 18948 17661 25007 13117 ...
## $ Median net tax : num [1:2470] 12806 13302 13313 15371 7785 ...
## $ Count super total accounts balance : num [1:2470] 7324 25039 13262 15877 9351 ...
## $ Average super total accounts balance : num [1:2470] 74077 119697 126783 143697 49967 ...
## $ Median super total accounts balance : num [1:2470] 11624 42545 53552 46748 10471 ...
## - attr(*, "spec")=
## .. cols(
## .. Postcode = col_double(),
## .. `Number of individuals lodging an income tax return` = col_double(),
## .. `Count taxable income or loss` = col_double(),
## .. `Average taxable income or loss` = col_double(),
## .. `Median taxable income or loss` = col_double(),
## .. `Proportion with salary or wages` = col_double(),
## .. `Count salary or wages` = col_double(),
## .. `Average salary or wages` = col_double(),
## .. `Median salary or wages` = col_double(),
## .. `Proportion with net rent` = col_double(),
## .. `Count net rent` = col_double(),
## .. `Average net rent` = col_double(),
## .. `Median net rent` = col_double(),
## .. `Count total income or loss` = col_double(),
## .. `Average total income or loss` = col_double(),
## .. `Median total income or loss` = col_double(),
## .. `Count total deductions` = col_double(),
## .. `Average total deductions` = col_double(),
## .. `Median total deductions` = col_double(),
## .. `Proportion with total business income` = col_double(),
## .. `Count total business income` = col_double(),
## .. `Average total business income` = col_double(),
## .. `Median total business income` = col_double(),
## .. `Proportion with total business expenses` = col_double(),
## .. `Count total business expenses` = col_double(),
## .. `Average total business expenses` = col_double(),
## .. `Median total business expenses` = col_double(),
## .. `Proportion with net tax` = col_double(),
## .. `Count net tax` = col_double(),
## .. `Average net tax` = col_double(),
## .. `Median net tax` = col_double(),
## .. `Count super total accounts balance` = col_double(),
## .. `Average super total accounts balance` = col_double(),
## .. `Median super total accounts balance` = col_double()
## .. )
As I want to perform some of the other analysis and tidying first, I will not be merging the data set just yet.
This section should fulfill the minimum requirements of items 2-4 of the task sheet and where it doesn’t, it refers to a later section where it does.
# Subset crime data to just keep the required variables
crimeApril <- subset(crimeApril2, select = c(Type, Date, Postcode, `Area of Interest`))
# Make sure variables are of appropriate data types for manipulation
str(crimeApril$Type) # of data type chr. Can factor this.
## chr [1:6777] "Fraud" "Other Theft (excl. Unlawful Entry)" "Drug Offences" ...
length(unique(crimeApril$Type)) # 18 different types of offences
## [1] 18
levels <- sort(unique.default(crimeApril$Type))
factor(levels) # sort alphabetically, even if default
## [1] Arson Assault
## [3] Drug Offences Fraud
## [5] Good Order Offences Handling Stolen Goods
## [7] Liquor (excl. Drunkenness) Miscellaneous Offences
## [9] Other Offences Against the Person Other Property Damage
## [11] Other Theft (excl. Unlawful Entry) Prostitution Offences
## [13] Robbery Traffic and Related Offences
## [15] Trespassing and Vagrancy Unlawful Entry
## [17] Unlawful Use of Motor Vehicle Weapons Act Offences
## 18 Levels: Arson Assault Drug Offences Fraud ... Weapons Act Offences
crimeApril$Type <- factor(crimeApril$Type,
labels=levels,
levels)
# Create new columns splitting out the date so that the info can be better used
crimeApril$Year <- year(crimeApril$Date) # Split out year
crimeApril$Month <- month(crimeApril$Date) # Split out month
crimeApril$Day <- day(crimeApril$Date) # Split out day
crimeApril$Hours <- hour(crimeApril$Date) # Split out hours
crimeApril$Minutes <- minute(crimeApril$Date) # Split out minutes
crimeApril$Seconds <- second(crimeApril$Date) # Split out seconds
# Create a new column for Month showing the month abbreviation so it can be used as a look up if necessary.
crimeApril$Month <- as.integer(crimeApril$Month) # make Month an integer
mymonths <- c("Jan","Feb","Mar","Apr","May","Jun",
"Jul","Aug","Sep","Oct","Nov","Dec") # create month abbrev. vector
# Now create a new column with the correctly mapped month abbreviation. In this case as we only have data for April, it will always be Apr
crimeApril$`Month Abbreviation` <- mymonths[crimeApril$Month]
# Can now factor MonthsAbb = 12 levels. Although we only have one month in the dataset, if we read in a dataset with more than one month, this code would work perfectly fine on that too.
crimeApril$`Month Abbreviation` <- factor(crimeApril$`Month Abbreviation`,
ordered = TRUE,
levels = c("Jan","Feb","Mar","Apr","May","Jun",
"Jul","Aug","Sep","Oct","Nov","Dec"))
crimeApril$Date <- as.Date(crimeApril$Date) # Convert Date to type Date
crimeApril$Postcode <- as.numeric(crimeApril$Postcode)# Need a numeric postcode for when we merge
# Area of Interest is already a chr data type which we can leave.
str(crimeApril$`Area of Interest`) # Check data type of Area. Chr. All good
## chr [1:6777] "Brisbane" "Brisbane" "Brisbane" "Brisbane" "Brisbane" ...
str(crimeApril) # Looks good
## 'data.frame': 6777 obs. of 11 variables:
## $ Type : Factor w/ 18 levels "Arson","Assault",..: 4 11 3 6 4 11 4 10 14 5 ...
## $ Date : Date, format: "2020-04-01" "2020-04-01" ...
## $ Postcode : num 4116 4179 4000 4000 4070 ...
## $ Area of Interest : chr "Brisbane" "Brisbane" "Brisbane" "Brisbane" ...
## $ Year : num 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 ...
## $ Month : int 4 4 4 4 4 4 4 4 4 4 ...
## $ Day : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Hours : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Minutes : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Seconds : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Month Abbreviation: Ord.factor w/ 12 levels "Jan"<"Feb"<"Mar"<..: 4 4 4 4 4 4 4 4 4 4 ...
head(crimeApril) # View the head of the data frame
Refer to Tidy & Manipulate Data I as the socio dataset is untidy so I will clean that up and then perform the subsetting and data conversions there.
# Just select the columns we are interested in.
earnings2 <- select(earnings, `Postcode`,`Count total income or loss`,
`Average total income or loss`)
summary(earnings2) # Note the differences between max and min values for the count and average income or loss variables. Interesting!
## Postcode Count total income or loss Average total income or loss
## Min. : 800 Min. : 53 Min. : 21322
## 1st Qu.:2840 1st Qu.: 502 1st Qu.: 48503
## Median :3870 Median : 2284 Median : 55179
## Mean :4094 Mean : 5698 Mean : 59390
## 3rd Qu.:5164 3rd Qu.: 8360 3rd Qu.: 64440
## Max. :7470 Max. :63270 Max. :249199
str(earnings2) # Check data types.All should be numeric. Yes they are.
## tibble [2,470 x 3] (S3: tbl_df/tbl/data.frame)
## $ Postcode : num [1:2470] 800 810 812 820 822 828 829 830 832 835 ...
## $ Count total income or loss : num [1:2470] 5168 21449 11170 13186 6279 ...
## $ Average total income or loss: num [1:2470] 77465 68729 66892 84033 49652 ...
head(earnings2) # Looks good
The SEIFA socio dataset has an extra row (which couldn’t be removed as part of the initial read as it would have made the data set incomprehensible) at the top of the data which came in as part of the subheadings created from the ABS data. As such, this observation does not conform to the observations which share a number of columns in all the other rows. As it happens, because I need to subset just the QLD data based on the Ranking within State or Territory column, that untidy row will disappear during that step.
I also perform a check on the max and min values in the income or loss data frame but they turn out to be Victorian and New South Wales postcodes which will be removed upon the merging of datasets as I am only interested in Brisbane postcodes.
Other than that, the data that is being kept was all relatively tidy.
# Work on socio data set - tidying and performing data type conversions, etc.
head(socio) # see untidy first row
# Extract rows that match QLD Ranking within State or Territory
socio2 <- subset(socio, socio$`Ranking within State or Territory` == "QLD") # untidy row disappeared during this step
socio3 <- socio2[c(1,11)] # Only keep Postcode and Decile Rank column as they are the only columns I need to perform the intended analysis over the data
socio3 <- rename(socio3, Postcode = `2016 Postal Area (POA) Code`) # Rename postcode column for when it will be used as the merging key.
socio3 <- rename(socio3, SEIFA_Decile = `...11`) # rename SEIFA Decile column
socio3$Postcode <- as.numeric(socio3$Postcode)# Need a numeric postcode for when we merge
# Now factor the SEIFA_Decile variable so that it is ordered and levelled
socio3$SEIFA_Decile <- factor(socio3$SEIFA_Decile,
ordered = TRUE,
levels = c("1","2","3","4","5","6",
"7","8","9","10"))
str(socio3) # All good
## tibble [434 x 2] (S3: tbl_df/tbl/data.frame)
## $ Postcode : num [1:434] 4000 4005 4006 4007 4008 ...
## $ SEIFA_Decile: Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 4 1 8 3 4 8 6 5 8 ...
head(socio3) # Looks good
# Perform some basic checks on the earnings data set to make sure there is nothing else that is untidy.
summarise(earnings2,
meanCount = mean(earnings2$`Count total income or loss`, na.rm = TRUE),
minCount = min(earnings2$`Count total income or loss`, na.rm = TRUE),
maxCount = max(earnings2$`Count total income or loss`, na.rm = TRUE),
IQRCount = IQR(earnings2$`Count total income or loss`, na.rm = TRUE),
meanInc = mean(earnings2$`Average total income or loss`, na.rm = TRUE),
minInc = min(earnings2$`Average total income or loss`, na.rm = TRUE),
maxInc = max(earnings2$`Average total income or loss`, na.rm = TRUE),
IQRInc = IQR(earnings2$`Average total income or loss`, na.rm = TRUE),
)
which(earnings2$`Count total income or loss` == "53") # find the row where the count is only 53.
## [1] 1269
print(earnings2[1269,]) # print it. This is poscode 3921. Look that up externally. This is a Victorian postcode so will get removed once the data is joined as we are only interested in Brisbane.
## # A tibble: 1 x 3
## Postcode `Count total income or loss` `Average total income or loss`
## <dbl> <dbl> <dbl>
## 1 3921 53 34410
which(earnings2$`Count total income or loss` == "63270")
## [1] 680
print(earnings2[680,]) # Postcode 3030 - another Victorian one which will be removed on join by QLD postcodes.
## # A tibble: 1 x 3
## Postcode `Count total income or loss` `Average total income or loss`
## <dbl> <dbl> <dbl>
## 1 3030 63270 60426
which(earnings2$`Average total income or loss` == "21322")
## [1] 267
which(earnings2$`Average total income or loss` == "249199")
## [1] 50
print(earnings2[267,]) # postcode 2308 - NSW - will be removed on merge
## # A tibble: 1 x 3
## Postcode `Count total income or loss` `Average total income or loss`
## <dbl> <dbl> <dbl>
## 1 2308 158 21322
print(earnings2[50,]) # postcode 2028 - NSW - will also be removed on merge
## # A tibble: 1 x 3
## Postcode `Count total income or loss` `Average total income or loss`
## <dbl> <dbl> <dbl>
## 1 2028 3517 249199
Crime Data Set
Firstly I have scanned every retained field for missing data. Although I could have put this into a simple loop, I prefer the readability of doing it individually. For a bigger data set I would opt to use a loop.
sum(is.na(crimeApril$Type)) #0 missing
## [1] 0
sum(is.na(crimeApril$Date)) #0 missing
## [1] 0
sum(is.na(crimeApril$Postcode)) #0 missing
## [1] 0
sum(is.na(crimeApril$`Area of Interest`)) #0 missing
## [1] 0
sum(is.na(crimeApril$Year)) #0 missing
## [1] 0
sum(is.na(crimeApril$Month)) #0 missing
## [1] 0
sum(is.na(crimeApril$Day)) #0 missing
## [1] 0
sum(is.na(crimeApril$Hours)) #0 missing
## [1] 0
sum(is.na(crimeApril$Minutes)) #0 missing
## [1] 0
sum(is.na(crimeApril$Seconds)) #0 missing
## [1] 0
sum(is.na(crimeApril$`Month Abbreviation`)) #0 missing
## [1] 0
# The original crime data was actually missing some values in the ABS Meshblock (as follows) but given that variable was not required, the whole column was removed and so the missing values were irrelevant. The rest of the columns were complete with values.
sum(is.na(crimeApril2$`ABS Meshblock`))
## [1] 6
crimeApril2[!complete.cases(crimeApril2), ]
Socio Data Set
sum(is.na(socio3$Postcode)) #0 missing
## [1] 0
sum(is.na(socio3$SEIFA_Decile)) #0 missing
## [1] 0
Income Data Set
sum(is.na(earnings2$Postcode)) #0
## [1] 0
sum(is.na(earnings2$`Average total income or loss`)) #0
## [1] 0
sum(is.na(earnings2$`Count total income or loss`)) #0
## [1] 0
earnings2[!complete.cases(earnings2), ] # 0 too so no missing data
# Scan Average Total Income or Loss for Outliers
earnings2$`Average total income or loss` %>% boxplot(main="Box Plot of Average Income or Loss", ylab="Income", col = "green")
We can see a number of outliers so let’s check its distribution.
# Check distribution
hist(earnings2$`Average total income or loss`, xlab = "Average Income or Loss", main = "Histogram of average total income or loss")
The total income or loss data is right skewed and as a result does not follow a normal distribution so we can’t use z-scores.
# Do a qqplot
qqPlot(earnings2$`Average total income or loss`, ylab="Average Income or Loss", main="qqPlot of Average Total Income or Loss")
## [1] 50 2055
QQPlot also supports the same notion; the data is showing outliers and non-normality. This is occurring at the right of the data set.
Not all outliers actually mean there is a problem and so we have to think about the data set. What this demonstrates is something known as “Pareto’s law of income distribution” (Rodd, 1996). It forms the basis of the 80/20 rule; that a small proportion of (in this case) income earners are responsible for a large share of income. As such, it makes no sense to manipulate those outliers.
I will perform a box plot on the count of income earners variable too but I expect this to further demonstrate Pareto’s Law and in fact it does. The distribution is skewed and outliers exist.
# Scan Count total income or loss for outliers
earnings2$`Count total income or loss` %>% boxplot(main="Box Plot of Count of Income Earners", ylab="Count", col = "orange")
Despite the outliers, they are in fact important to the data set and as such should be retained keeping in mind that there are outliers.
Now scan income postcodes which we will expect to be fine. Although the postcode data is numeric at the moment, we can turn it into factored levels later if we wish.
earnings2$Postcode %>% boxplot(main="Box Plot of Postcodes (Income)", ylab="Postcodes", col = "purple")
There are no outliers in these postcodes. Now do the same for the crimeApril data set with postcode.
crimeApril$Postcode %>% boxplot(main="Box Plot of Postcodes (Crime)", ylab="Postcodes", col = "purple")
# one outlier so look for the maximum postcode
summary(crimeApril$Postcode) # 4306
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4000 4011 4066 4067 4109 4306
# now look up a postcode chart to determine where 4306 is. This is in the city of Ipswich which does not fall within the LGA of Brisbane so we can determine that this is in fact an observation that does not belong to the data set. View how many observations there are for 4306
pc_4306 <- subset(crimeApril, crimeApril$Postcode == 4306)
View(pc_4306)
# There are six observations that we can remove. This should take our 6777 observations down to 6771
crimeApril <- anti_join(crimeApril, pc_4306)
## Joining, by = c("Type", "Date", "Postcode", "Area of Interest", "Year", "Month", "Day", "Hours", "Minutes", "Seconds", "Month Abbreviation")
length(crimeApril) # Yes - now we have 6771 obs
## [1] 11
There was an outlier; postcode 4306 which actually belonged to a different Local Government Authority so we can put that down to incorrect data entry and remove it from the data set.
Now check postcode in the socio data too.
socio3$Postcode %>% boxplot(main="Box Plot of Postcodes (SEIFA)", ylab="Postcodes", col = "purple")
There are no postcode outliers in this data set and that concludes looking at outliers.
Given that we already identified that income is right-skewed, and although we have said we should leave that data as it is to tell the correct story, for the purposes of this exercise, I will perform further analysis include a transformation:
# Shapiro-Wilk test - another test for normality (Karadimitriou, n.d.)
shapiro.test(earnings2$`Average total income or loss`)
##
## Shapiro-Wilk normality test
##
## data: earnings2$`Average total income or loss`
## W = 0.74972, p-value < 2.2e-16
The null hypothesis is that the data is normally distributed. If The p-value is very small (<0.05) then we reject the null hypothesis. In this case, the p-value is much less than 0.05 so we reject the null hypothesis and state that there is strong statistical evidence to suggest the distribution for ‘Average total income or loss’ is not normal.
# Perform a BoxCox transformation
earnings2$BoxCox_Income <- BoxCox(earnings2$`Average total income or loss`,lambda = "auto")
head(earnings2$BoxCox_Income)
## [1] 1.522455 1.522379 1.522360 1.522504 1.522137 1.522420
hist(earnings2$BoxCox_Income, xlab="BoxCox Transformed Average Income or Loss", main = "Histogram of Normalised Average Income or Loss")
Given I am using Postcode as the key I want to be careful that as each join is performed, the appropriate result arises so I perform a reconciliation exercise as follows:
# Check unique postcodes in crimeApril which is the 'master data set' as that is the data we want to attach both SEFIA data and income data.
length(unique(crimeApril$Postcode)) # 76 unique Postcodes because I am only looking at crime in Brisbane. Let's put these into a vector.
## [1] 76
crime_pc <- unique(crimeApril$Postcode)
# Check unique postcodes in SEIFA
length(unique(socio3$Postcode)) # 434 Postcodes but this includes many postcodes for Queensland areas that are not in the Brisbane LGA. Let's put these into a vector.
## [1] 434
socio_pc <- unique(socio3$Postcode)
head(socio_pc)
## [1] 4000 4005 4006 4007 4008 4010
# Then compare the two data set postcodes.
diff_pc <- unique(crime_pc[!crime_pc %in% socio_pc]) # (Stack Overflow, n.d.)
head(diff_pc)
## [1] 4009 4072
# produces 4009 which is Eagle Farm, a commercial/ industrial centre that doesn't make sense to include in the SEIFA info. and postcode 4072 which is for the University of QLD PO Boxes so once again, does not make sense to include in the SEIFA data.Now find diff between socio_pc and crime_pc
diff_pc2 <- unique(socio_pc[!socio_pc %in% crime_pc])
head(diff_pc2)
## [1] 4019 4020 4021 4022 4037 4114
# 359 postcodes but these are all in areas outside the region of the original crime dataset and we are not interested in them
# This means we can safely perform an inner join on crime now.
socio_crime <- inner_join(crimeApril, socio3, by = "Postcode") # (Parker, 2009)
# Now confirm the data obs have decreased by # 6777 - 6756 = 21 Records
diff_df <- anti_join(crimeApril, socio_crime)
## Joining, by = c("Type", "Date", "Postcode", "Area of Interest", "Year", "Month", "Day", "Hours", "Minutes", "Seconds", "Month Abbreviation")
head(diff_df) # Yes
nrow(crimeApril) # length of crimeApril data set
## [1] 6771
nrow(socio_crime) # length of merged socio_crime data set with 4009 and 4072 observations removed.
## [1] 6750
# Analysing the entries we see that they arise from crime acts all within postcodes 4009 (Eagle Farm/Industrial) and 4072 (Uni QLD) which are not required for the intended purpose of analysing crime with SEIFA in households.
head(socio_crime)
# Excellent. The data is as expected. Now to join with the income data.
# Get QLD postcodes and check length too
qld_pc <- subset(earnings2, earnings2$Postcode > "3999" & earnings2$Postcode < "4999")
length(unique(qld_pc$Postcode)) # 417
## [1] 417
# Now compare to crime_pc
diff_pc3 <- unique(crime_pc[!crime_pc %in% qld_pc])
head(diff_pc3) # 77 postcodes so all required postcodes are in this dataset.
## [1] 4116 4179 4000 4070 4120 4059
# Now perform the join of socio_crime with earnings
socio_earnings_crime <- inner_join(socio_crime, earnings2, by="Postcode")
# We've lost another 20 obs in the join. Let's see why?
sediff_df <- anti_join(socio_crime, socio_earnings_crime)
## Joining, by = c("Type", "Date", "Postcode", "Area of Interest", "Year", "Month", "Day", "Hours", "Minutes", "Seconds", "Month Abbreviation", "SEIFA_Decile")
head(sediff_df)
# The data all relates to postcode 4029 which is the Royal Brisbane Hospital and therefore also would not be part of a crime study by SEIFA and household earnings.
head(socio_earnings_crime)
One of our columns is the Count total income or loss (count of income earners who filed a tax return for that fiscal period) and another is Average total income or loss (average income or loss per person in that postcode). I can create a new column that is the product of both of these which will tell us the average total earnings per postcode.
Note that a number of other columns have been created during this exercise such as splitting a date out into Years, Month, Day, Hours, Minutes Seconds. Refer to the understand section in the Work on crime data set.
# Create a new column which is the product of two other columns
socio_earnings_crime <- socio_earnings_crime %>%
mutate(Total_postcode_earnings = `Count total income or loss` * `Average total income or loss`)
head(socio_earnings_crime)
# factor and level the postcodes
length(unique(socio_earnings_crime$Postcode)) # 74 different postcodes
## [1] 73
levels <- sort(unique.default(socio_earnings_crime$Postcode))
factor(levels) # sort alphabetically, even if default
## [1] 4000 4005 4006 4007 4008 4010 4011 4012 4013 4014 4017 4018 4025 4030 4031
## [16] 4032 4034 4035 4036 4051 4053 4054 4055 4059 4060 4061 4064 4065 4066 4067
## [31] 4068 4069 4070 4073 4074 4075 4076 4077 4078 4101 4102 4103 4104 4105 4106
## [46] 4107 4108 4109 4110 4111 4112 4113 4115 4116 4117 4120 4121 4122 4123 4151
## [61] 4152 4153 4154 4155 4156 4169 4170 4171 4172 4173 4174 4178 4179
## 73 Levels: 4000 4005 4006 4007 4008 4010 4011 4012 4013 4014 4017 4018 ... 4179
socio_earnings_crime$Postcode <- factor(socio_earnings_crime$Postcode, levels)
str(socio_earnings_crime)
## 'data.frame': 6730 obs. of 16 variables:
## $ Type : Factor w/ 18 levels "Arson","Assault",..: 4 11 3 6 4 11 4 10 14 5 ...
## $ Date : Date, format: "2020-04-01" "2020-04-01" ...
## $ Postcode : Factor w/ 73 levels "4000","4005",..: 54 73 1 1 33 56 24 25 38 38 ...
## $ Area of Interest : chr "Brisbane" "Brisbane" "Brisbane" "Brisbane" ...
## $ Year : num 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 ...
## $ Month : int 4 4 4 4 4 4 4 4 4 4 ...
## $ Day : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Hours : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Minutes : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Seconds : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Month Abbreviation : Ord.factor w/ 12 levels "Jan"<"Feb"<"Mar"<..: 4 4 4 4 4 4 4 4 4 4 ...
## $ SEIFA_Decile : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 9 9 1 1 10 3 2 10 1 1 ...
## $ Count total income or loss : num 16079 12099 13233 13233 7224 ...
## $ Average total income or loss: num 52250 69579 66122 66122 70670 ...
## $ BoxCox_Income : num 1.52 1.52 1.52 1.52 1.52 ...
## ..- attr(*, "lambda")= num -0.656
## $ Total_postcode_earnings : num 8.40e+08 8.42e+08 8.75e+08 8.75e+08 5.11e+08 ...
# Some rows are duplicated because crime data shows same crime in two different locations within the suburb at the same time. I will artificially create a unique row identifier that will force all observations to be unique. The following will create a Daily offence ID by Postcode that is the count of all crime types by postcode, month and day which could be handy anyway.
socio_earnings_crime2 <- socio_earnings_crime %>%
group_by(Postcode, Month, Day) %>%
mutate(`Daily offence ID by Postcode` = row_number())
head(socio_earnings_crime2) # (mnel, 2012)
# Then re-arrange the data set to have relevant columns next to each other.
length(socio_earnings_crime2) # 17 variables
## [1] 17
crime_socio_earnings <- socio_earnings_crime2[, c("Postcode", "Daily offence ID by Postcode", "Type", "SEIFA_Decile", "Average total income or loss", "BoxCox_Income", "Count total income or loss", "Total_postcode_earnings", "Year", "Month", "Month Abbreviation", "Day", "Hours", "Minutes", "Seconds", "Date", "Area of Interest")]
length(crime_socio_earnings) # Should also be 17 variables
## [1] 17
# Now sort by Postcode, month, day
crime_socio_earnings <- arrange(crime_socio_earnings, Postcode, Month, Day)
head(crime_socio_earnings)
# The End
Karadimitriou, S.M. (n.d.). Checking Normality for Parametric Tests in R. [online] StatsTutor. Available at: https://www.sheffield.ac.uk/polopoly_fs/1.885202!/file/95_Normality_Check.pdf [Accessed 17 Oct. 2020].
mnel (2012). Numbering rows within groups in a data frame. [online] StackOverflow. Available at: https://stackoverflow.com/questions/12925063/numbering-rows-within-groups-in-a-data-frame [Accessed 16 Oct. 2020].
Parker, M. (2009). r - How to join (merge) data frames (inner, outer, left, right). [online] Stack Overflow. Available at: https://stackoverflow.com/questions/1299871/how-to-join-merge-data-frames-inner-outer-left-right [Accessed 18 Oct. 2020].
Rodd, J. (1996). Pareto’s law of income distribution, or the 80/20 rule. International Journal of Nonprofit and Voluntary Sector Marketing, 1(1), pp.77–89.
Stack Overflow. (n.d.). Difference between two vectors in R. [online] Available at: https://stackoverflow.com/questions/31573087/difference-between-two-vectors-in-r [Accessed 18 Oct. 2020].