MATH2349 Data Wrangling

Required packages

library(tidyverse)
library(forecast)
library(car)

Executive Summary

I imported the data set crime, which details dates and locations (suburb) of criminal offences in Adelaide (South Australia) for the 2019-2020 financial year. I described the variables before dropping some of them. I grouped and summarised the data to give the number of offences against persons and property in each suburb for the year. I then reshaped the data to a tidy format.

Next I imported the data set suburbs, which gives more detail of the suburbs of Adelaide, including population. I described the variables. I then merged the two data sets using left_join. I filtered the data to include only inner suburbs and dropped some of the variables. I then checked all remaining variables and changed their data type where appropriate.

I scanned for missing and special values and dealt with them. I created two new variables using mutate. These were offence rates per capita, for property and person offences. I scanned for missing and special values again.

I scanned all the numeric data for outliers. As the data was not normal I used non-parametric methods, namely box plots and scatter plots. I dealt with outliers as I thought appropriate, including removing some anomalous observations.

Finally, I attempted to transform the heavily skewed population data to a normal distribution. I used the Box-Cox method and displayed histograms and QQ plots to indicate how successful the transformation had been.

Data

The first data set is from the South Australian Government Data Directory (data.sa.gov.au). The data set is ‘Crime Statistics 2019-20’. It is available for download at https://data.sa.gov.au/data/dataset/crime-statistics/resource/590083cd-be2f-4a6c-871e-0ec4c717717b. The data consists of suburb-based crime statistics for South Australia.

My second data set is from a wikipedia page ‘List of Adelaide Suburbs’ (https://en.wikipedia.org/wiki/List_of_Adelaide_suburbs). It contains data on Adelaide suburbs, including population data from the 2016 national Census. I converted the main table to a csv file using an online tool https://wikitable2csv.ggor.de/.

Data set 1

I start by importing the crime statistics data set.

crime <- read_csv("C:/Users/Steven/OneDrive/Desktop/crime.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   `Reported Date` = col_character(),
##   `Suburb - Incident` = col_character(),
##   `Postcode - Incident` = col_character(),
##   `Offence Level 1 Description` = col_character(),
##   `Offence Level 2 Description` = col_character(),
##   `Offence Level 3 Description` = col_character(),
##   `Offence count` = col_double()
## )

head(crime)

dim(crime)

## [1] 95702     7

This data set has 7 variables. They are:

Reported date. This is the date the offence was reported. All dates are between 01/07/2019 and 30/06/2020.
Suburb - Incident. This is the suburb where the incident occured. All suburbs and towns in South Australia are included.
Postcode - Incident. The postcode of the suburb where the incident occurred.
Offence Level 1 description. There are 2 broad categories of offence here: offences against property and offences against the person. Offences against property include crimes such as property damage, fraud and theft. Offences against the person are mostly assault. Sexual offences are not included, as the locations of these crimes are not disclosed for privacy reasons.
Offence Level 2 description. This is a description of the offence at level 2.
Offence level 3 description. The most detailed description of the offence.
Offence count. The number of offences (level 3) for the date and suburb.

This data is not in the form I want it at the moment. I would like statistics for the whole year (2019-2020) for each suburb, with just the offence level 2 description.

I remove the date, postcode, offence level 2, and offence level 3 columns.

crime <- crime[c(2:4,7)]

Now I group and summarise the data by suburb and offence (level 1).

crime <- crime %>% group_by(`Suburb - Incident`,
  `Offence Level 1 Description`) %>%
  summarise(offence_count = sum(`Offence count`, na.rm = TRUE))

## `summarise()` regrouping output by 'Suburb - Incident' (override with `.groups` argument)

head(crime)

Tidy & Manipulate Data I

This data is not tidy. It does not conform to the tidy data principles, specifically because each variable does not have its own column. To tidy this data, I need ‘offences against the person’ and ‘offences against property’ to each have their own columns. I use the pivot_wider function.

crime <- crime %>% 
  pivot_wider(names_from = `Offence Level 1 Description`,
  values_from = offence_count)
head(crime)

dim(crime)

## [1] 1322    3

This data is now in the desired format. For every suburb, the number of offences against person and property for the financial year are given. There are 3 variables and the variable types are suitable. I rename the columns to make things easier.

crime <- crime %>% 
  rename(
    Suburb = `Suburb - Incident`,
    Offences_property = `OFFENCES AGAINST PROPERTY`,
    Offences_person = `OFFENCES AGAINST THE PERSON`
    )

Second Dataset

I now import the second dataset, which contains population and other data on Adelaide suburbs.

suburbs <- read_csv("C:/Users/Steven/OneDrive/Desktop/suburbs.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   Suburb = col_character(),
##   PostCode = col_double(),
##   LGA = col_character(),
##   `YearEstab.[citation needed]` = col_character(),
##   `Dist.[4](km)[citation needed]` = col_character(),
##   `Area(ha)[citation needed]` = col_number(),
##   `Population[citation needed]` = col_number()
## )

head(suburbs)

dim(suburbs)

## [1] 402   7

Some of the variable names are not correct, so I rename them before proceeding.

suburbs <- suburbs %>% 
  rename(
    YearEstab = `YearEstab.[citation needed]`,
    Distance = `Dist.[4](km)[citation needed]`,
    Area = `Area(ha)[citation needed]`,
    Population = `Population[citation needed]`
    )

This data set contains 7 variables. They are:

Suburb. The name of the suburb. Suburbs included are all those in the Greater Adelaide area, but do not include the Adelaide Hills or other South Australian towns.
Postcode. The postcode of the suburb.
LGA. The Local Government Area of the suburb.
YearEstab. The year the suburb was established
Distance. The distance (km) of the suburb from the General Post Office in Adelaide City Centre.
Area. The area (ha) covered by the suburb.
Population. The population of the suburb.

Merging datasets

The variable I want to join on - Suburb - is in capitals for the crime data set, but not for suburbs. I need to make them the same, so I capitalize the Suburb column in the suburbs data set.

suburbs$Suburb <- toupper(suburbs$Suburb)

I use left_join to join suburbs with crime to create adelaide.

adelaide <- left_join(suburbs,crime)

## Joining, by = "Suburb"

head(adelaide)

dim(adelaide)

## [1] 402   9

Because there is missing data in the suburbs dataset, I decide to filter my data to include only the local government areas of inner Adelaide.

adelaide <- adelaide %>% filter(
  LGA=='City of Campbelltown' | 
  LGA=='City of Burnside' |
  LGA=='City of Adelaide' |
  LGA=='City of Charles Sturt' |
  LGA=='City of Marion' |
  LGA=='City of Mitcham' |
  LGA=='City of Norwood Payneham St Peters' |
  LGA=='City of Port Adelaide Enfield' |
  LGA=='City of Prospect' |
  LGA=='City of Salisbury' |
  LGA=='City of Tea Tree Gully' |
  LGA=='City of Unley' |
  LGA=='Corporation of the Town of Walkerville' |
  LGA=='City of West Torrens')
dim(adelaide)

## [1] 300   9

Understand

I check the structure of adelaide

glimpse(adelaide)

## Rows: 300
## Columns: 9
## $ Suburb            <chr> "ADELAIDE", "NORTH ADELAIDE", "AULDANA", "BEAUMON...
## $ PostCode          <dbl> 5000, 5006, 5072, 5066, 5067, 5066, 5065, 5063, 5...
## $ LGA               <chr> "City of Adelaide", "City of Adelaide", "City of ...
## $ YearEstab         <chr> "1837", "1837", "1847 [7]", "1870", "1941[10]", "...
## $ Distance          <chr> "-", "0.5", "9", "5.9", "5", "5.75", "2.4", "3", ...
## $ Area              <dbl> 1005, 420, 312, 158, 60, 300, 112, 148, 118, 116,...
## $ Population        <dbl> 15115, 6950, 625, 2557, 1602, 2930, 1678, 764, 11...
## $ Offences_property <dbl> 5465, 520, 12, 21, 43, 28, 56, 115, 42, 97, 144, ...
## $ Offences_person   <dbl> 1334, 92, 3, 3, 6, 3, 7, 24, 1, 13, 15, 41, 6, 5,...

I am mainly interested in crime and population data, so I drop the variables YearEstab, Distance and Area

adelaide <- select(adelaide,-(YearEstab:Area))
glimpse(adelaide)

## Rows: 300
## Columns: 6
## $ Suburb            <chr> "ADELAIDE", "NORTH ADELAIDE", "AULDANA", "BEAUMON...
## $ PostCode          <dbl> 5000, 5006, 5072, 5066, 5067, 5066, 5065, 5063, 5...
## $ LGA               <chr> "City of Adelaide", "City of Adelaide", "City of ...
## $ Population        <dbl> 15115, 6950, 625, 2557, 1602, 2930, 1678, 764, 11...
## $ Offences_property <dbl> 5465, 520, 12, 21, 43, 28, 56, 115, 42, 97, 144, ...
## $ Offences_person   <dbl> 1334, 92, 3, 3, 6, 3, 7, 24, 1, 13, 15, 41, 6, 5,...

There are now 6 variables. Checking the data types, some of them are not correct. Postcode should really be a factor variable, so I change it.

adelaide$PostCode <- as.factor(adelaide$PostCode)

LGA should be a factor variable.

adelaide$LGA <- factor(adelaide$LGA,
  levels=c('City of Adelaide','City of Burnside',
          'City of Campbelltown', 'City of Charles Sturt',
          'City of Marion','City of Mitcham',
          'City of Norwood Payneham St Peters',
          'City of Port Adelaide Enfield',
          'City of Prospect', 'City of Salisbury', 
          'City of Tea Tree Gully','City of Unley',
          'Corporation of the Town of Walkerville',
          'City of West Torrens'),
  labels=c('Adelaide','Burnside','Campbelltown','Charles Sturt',
           'Marion','Mitcham','Norwood','Port Adelaide',
           'Prospect', 'Salisbury','Tea Tree Gully', 'Unley',
           'Walkerville','West Torrens')) 
levels(adelaide$LGA)

##  [1] "Adelaide"       "Burnside"       "Campbelltown"   "Charles Sturt" 
##  [5] "Marion"         "Mitcham"        "Norwood"        "Port Adelaide" 
##  [9] "Prospect"       "Salisbury"      "Tea Tree Gully" "Unley"         
## [13] "Walkerville"    "West Torrens"

The remaining 3 variables are numeric.

Note that this data set includes numeric, character and factor variables. I have so far applied conversions to several different data types, and one of the variables is a labelled factor variable.

Scan I

Before creating new variables, I first scan the data for missing and special values.

I check the columns for missing values.

colSums(is.na(adelaide))

##            Suburb          PostCode               LGA        Population 
##                 0                 0                 0                 0 
## Offences_property   Offences_person 
##                 1                10

There are NA values in offences_person and offences_property. These values occurred in the crime data where suburbs had no offences of these type. The appropriate values for these observations is 0, so I change them.

adelaide$Offences_person[is.na(adelaide$Offences_person)] <- 0
adelaide$Offences_property[is.na(adelaide$Offences_property)] <- 0
colSums(is.na(adelaide))

##            Suburb          PostCode               LGA        Population 
##                 0                 0                 0                 0 
## Offences_property   Offences_person 
##                 0                 0

There are now no missing values. I check for other special values (infinite or NaN)

is.special <-function(x){if
(is.numeric(x)) (is.infinite(x) | is.nan(x))}
sapply(adelaide, function(x) sum(is.special(x)))

##            Suburb          PostCode               LGA        Population 
##                 0                 0                 0                 0 
## Offences_property   Offences_person 
##                 0                 0

There are no special values (infinite or NaN).

Tidy & Manipulate Data II

I create 2 new variables. They are offence rates for each suburb, for both property and person offences. These rates are measured as number of offences for every 100 residents of a suburb. I round these rates to 2 decimal places.

adelaide <- mutate(adelaide,
  Offences_property_rate = round((Offences_property/Population)*100,2),
  Offences_person_rate = round((Offences_person/Population)*100,2))
head(adelaide)

I check for special and NA values again.

colSums(is.na(adelaide))

##                 Suburb               PostCode                    LGA 
##                      0                      0                      0 
##             Population      Offences_property        Offences_person 
##                      0                      0                      0 
## Offences_property_rate   Offences_person_rate 
##                      0                      0

sapply(adelaide, function(x) sum(is.special(x)))

##                 Suburb               PostCode                    LGA 
##                      0                      0                      0 
##             Population      Offences_property        Offences_person 
##                      0                      0                      0 
## Offences_property_rate   Offences_person_rate 
##                      2                      2

There are no NA values, but there are 2 infinite values in the rate columns. This has occurred because of 0 values in population. There are 2 ‘suburbs’ with 0 population, namely Adelaide Airport and Keswick Terminal (railway terminal). These places had offences occur despite no population, hence giving ‘infinite’ rates per person. As these are not ‘real’ suburbs, I will remove them from the data set and check for special values again.

adelaide <- filter(adelaide, Population>0)
sapply(adelaide, function(x) sum(is.special(x)))

##                 Suburb               PostCode                    LGA 
##                      0                      0                      0 
##             Population      Offences_property        Offences_person 
##                      0                      0                      0 
## Offences_property_rate   Offences_person_rate 
##                      0                      0

There are now no infinite or NaN values.

Scan II

To check for outliers, I start by plotting some box plots of numeric variables. I begin with population.

pop <- adelaide$Population %>% 
  boxplot(main = "Box Plot of Adelaide Population",
  ylab="Population", col ="grey")

pop$out

##  [1] 15115  8229  9289 10365 12301 13280  8825 13297 16945 16530  8205  8972
## [13]  9891 10235  8988

The box plot shows that this data is not symmetric and therefore not normal. There are many values that could be classed as outliers according to ‘Tukey’s method’. However, these are correct population figures, and I don’t think they should be changed or removed yet.

Next I look at the property offences for each suburb.

prop <- adelaide$Offences_property %>%
  boxplot(main = "Box Plot of Adelaide property offences",
  ylab = "property offences", col ="grey")

prop$out

##  [1] 5465  520  476  611  569  556 1082  630  901  666 1017  572  928  768  802
## [16]  618 1251  770  525  767  557

There are outliers again, but one of them is quite extreme. I check which suburb it is.

filter(adelaide, Offences_property>4000)

The suburb of Adelaide itself (Adelaide City) has a very large number of property offences, even given its large population. I check the boxplot for person offences.

person <- adelaide$Offences_person %>% 
  boxplot(main = "Box Plot of Adelaide person offences",
  ylab = "offences against person", col ="grey")

person$out

##  [1] 1334   92  125   85  156  135   90  179  175  126  108  167  122  120   87
## [16]  196  187  250   92   81  201  120

There is another extreme outlier and I confirm which suburb it is.

filter(adelaide, Offences_person>800)

The suburb is Adelaide again. I am now interested to see a scatter plot of offences_property vs population.

adelaide %>% plot(Offences_property ~ Population,
    data = ., ylab="property offences", xlab="population",
    main="Property offences by suburb population")

The scatter plot shows that the suburb of Adelaide is an anomaly. I am sure a similar scatter plot of person offences would show the same thing. I think it is appropriate to remove this observation, given that it is the city centre and not really a residential suburb.

I remove the suburb of Adelaide from the data set.

adelaide <- filter(adelaide, Suburb != 'ADELAIDE' )

I check the box plots and scatter plots again (output not shown). Although outliers remain, I see no reason to remove or alter them.

Note that the box plots show this data is not normal, so it is not appropriate to use z-scores or any other parametric methods of detecting outliers.

I now check the box plots of the rate variables.

prorate <- adelaide$Offences_property_rate %>%
  boxplot(main = "Box Plot of Adelaide property offences rate",
  ylab = "offences per 100 inhabitants", col ="grey")

prorate$out

##  [1]  15.05  23.71  75.76  18.80  14.93  15.69  28.17  14.99  15.70  17.89
## [11]  36.28  44.94  16.59  18.62  54.41  52.99  48.12  43.01 112.27 352.50
## [21] 108.33  32.47 204.76  15.25  48.48  15.71  31.11 500.00

perrate <- adelaide$Offences_person_rate %>%
  boxplot(main = "Box Plot of Adelaide person offences rate",
  ylab = "offences per 100 inhabitants", col ="grey")

perrate$out

##  [1]   3.14   2.85   5.63   3.67   4.06   3.00   2.92   3.79  11.24   3.30
## [11]  14.30   6.72   4.08   3.56   4.09 100.00   5.26   4.41   4.13   3.90
## [21]  14.29   3.05   3.03   4.29  16.67

There are many outliers in both box plots. I check the more extreme ones.

filter (adelaide, Offences_property_rate >= 30 )

filter (adelaide, Offences_person_rate >= 3.8 )

The suburbs with the highest offence rates are mostly those with very few residents

I think these ‘rates’ are not really meaningful for small populations. They are too sensitive to small changes in the number of offences committed. I think one way to deal with small suburbs, would be to use average offence rates of neighbouring suburbs instead (impute the mean/median).

A simpler way to handle small suburbs is to simply exclude them from our data. I pick 500 as a cut-off point for population, as a way of eliminating anomalies without losing too much data. I create the new data set, adelaide_large, which consists of inner adelaide suburbs of more than 500 residents.

adelaide_large <- filter(adelaide, Population > 500)
dim(adelaide_large)

## [1] 274   8

New box plots for offence rates are created, using the adelaide_large data,

prorate <- adelaide_large$Offences_property_rate %>%
  boxplot(main = "Box Plot of Adelaide property offences rate",
  ylab = "offences per 100 inhabitants", col ="grey")

prorate$out

##  [1] 15.05 23.71 18.80 14.93 15.69 28.17 14.99 15.70 17.89 36.28 16.59 18.62
## [13] 54.41 14.30 15.25 14.62 31.11

perrate <- adelaide_large$Offences_person_rate %>%
  boxplot(main = "Box Plot of Adelaide person offences rate",
  ylab = "offences per 100 inhabitants", col ="grey")

perrate$out

##  [1]  3.14  2.85  3.67  4.06  3.00  2.92  3.79  3.30 14.30  4.08  4.13  3.05

There are still outliers, but fewer and less extreme. I check which is the most extreme outlier in both box plots.

filter (adelaide_large, Offences_property_rate > 50 |
        Offences_person_rate > 12)

In both cases the most extreme outlier is the suburb of Port Adelaide. However, I see no reason to exclude this observation from our data.

Transform

I now attempt to transform the distribution of the Population variable to a normal distribution. The histogram for this variable (using the original adelaide data set) is given.

hist(adelaide$Population)

This data is clearly right-skewed and not normal. I try a Box-Cox transformation. The histogram and QQ plot of this transformed variable are displayed .

boxcox_pop <- BoxCox (adelaide$Population, lambda = "auto")
hist(boxcox_pop)

car::qqPlot(boxcox_pop)

## [1] 183 284

The shape of the histogram is closer to normal, but the QQ plot does not follow a straight line and has many values outside the confidence intervals.

I try the same steps using the adelaide_large data set. The population distribution with smaller suburbs removed is shown.

hist(adelaide_large$Population)

I apply a Box-Cox transformation again

boxcox_large <- BoxCox(adelaide_large$Population,lambda ="auto")
hist(boxcox_large)

car::qqPlot(boxcox_large)

## [1] 121 134

Although the distribution looks a bit left-skewed, this transformation seems better than the last one, and the points on the QQ plot almost lie within the confidence intervals of the line. I add the transformed variable to the adelaide_large data set.

adelaide_large$Pop_normal <- boxcox_large
adelaide_large <- adelaide_large[ ,c(1:4,9,5:8)]
glimpse(adelaide_large)

## Rows: 274
## Columns: 9
## $ Suburb                 <chr> "NORTH ADELAIDE", "AULDANA", "BEAUMONT", "BE...
## $ PostCode               <fct> 5006, 5072, 5066, 5067, 5066, 5065, 5063, 50...
## $ LGA                    <fct> Adelaide, Burnside, Burnside, Burnside, Burn...
## $ Population             <dbl> 6950, 625, 2557, 1602, 2930, 1678, 764, 1186...
## $ Pop_normal             <dbl> 3.460962, 3.125170, 3.346447, 3.281786, 3.36...
## $ Offences_property      <dbl> 520, 12, 21, 43, 28, 56, 115, 42, 97, 144, 1...
## $ Offences_person        <dbl> 92, 3, 3, 6, 3, 7, 24, 1, 13, 15, 41, 6, 5, ...
## $ Offences_property_rate <dbl> 7.48, 1.92, 0.82, 2.68, 0.96, 3.34, 15.05, 3...
## $ Offences_person_rate   <dbl> 1.32, 0.48, 0.12, 0.37, 0.10, 0.42, 3.14, 0....