Required packages

The following packages are required for importing, manipulating and visualising the data.

RSocrata was was used to gain direct access to the City of Melbourne Open Data repository.

#install.packages("RSocrata")
library(RSocrata)

# Standard package list

library(readr)
library(dplyr)
library(tidyr)
library(stringr)
library(outliers)
library(lubridate)

# Charts

library(ggplot2)

# R Markdown packages

library(kableExtra)

Executive Summary

City of Melbourne’s population is set to grow beyond 250,000 by 2037 placing increased pressure on local council services. Actual population data was combined with population forecast data for each suburb within the municipality to create population timeline from 2014 to 2037. This was combined with service request volumes for a small subset of council services (7 service categories) from 2014-2017. The combined data set was used to forecast future service request volumes by suburb to 2037. This analysis used a very small proportion of total annual council service requests, and a significant number of requests were unusable because suburb was not recorded. Nevertheless, the forecast shows an almost three-fold increase in service request demand with the greatest growth in the Melbourne central business district. This project serves as a proof of concept for long-range forecasting of customer service demand once a complete catalogue of service request types becomes available.

Data

City of Melbourne publishes almost 200 data sets on the demographics, business, economics, employment and performance of the central city municpality.

This project will combine three data sets available online or published under creative commons:

The objective of the project is to forecast future demand for council services based on population projections.

The three data sets will be joined by two similar, but not identical geographical concepts:

Suburbs are a familiar concept - gazetted areas typically used in addresses.

The Census of Land Use and Employment (CLUE) is a survey of commercial property that City of Mebourne has conducted sporadically since the 1960s; and systematically every two years since 2000.

The CLUE Small Areas are subsets of suburbs based on in industrial zones and natural land use regions within the municipality.

The CLUE Small Areas will need to be combined and mutated to match Suburbs, for data joining purposes.

The first section accesses the data.melbourne.vic.gov.au open data repository directly in RStudio to access the two open data sets for this project.

Explore the CoM Open Data Repository

CoM has published almost 200 data sets including demographics, civic assets, maps, and performance information.

The repository can be accessed directly within R using the RSocrata package.

# Explore the CoM Open Data Repository

ls.socrata("https://data.melbourne.vic.gov.au")
com_data <- ls.socrata("https://data.melbourne.vic.gov.au")

head(com_data)
names(com_data)
##  [1] "accessLevel"  "landingPage"  "issued"       "@type"       
##  [5] "modified"     "keyword"      "contactPoint" "publisher"   
##  [9] "identifier"   "description"  "title"        "distribution"
## [13] "license"      "theme"

Import Data

Import population forecast data

The first data set to be used is forcast population by CLUE Small Areas.

# Get data - Population Forecasts

# display the details of the data set.
t(com_data[,c(11,14, 2, 13)] %>% 
  filter(title  == "Population Forecasts by Small Area")) %>%
  kable() %>%
  kable_styling()
title Population Forecasts by Small Area
theme People & Events
landingPage https://data.melbourne.vic.gov.au/d/4ajz-qakh
license http://creativecommons.org/licenses/by/4.0/legalcode

Import the data directly from the open data repository.

# import the popultaion forecast data
com_pop_fc <- 
  read.socrata("https://data.melbourne.vic.gov.au/d/4ajz-qakh")

# check the imported data
head(com_pop_fc)
# check the dimensions and data types
str(com_pop_fc)
## 'data.frame':    273 obs. of  3 variables:
##  $ Geography : chr  "Melbourne (CBD)" "Melbourne (CBD)" "Melbourne (CBD)" "Melbourne (CBD)" ...
##  $ Year      : int  2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 ...
##  $ Population: int  42589 44087 46442 51136 53795 55854 58336 61086 63802 65529 ...
# standardise on lowercase column names
names(com_pop_fc) <- 
  tolower(names(com_pop_fc))

# convert the Small Areas to factors 
com_pop_fc$geography <- 
  as.factor(com_pop_fc$geography)

# display the years as factors to convert them to ordered factors
levels(as.factor(com_pop_fc$year))
##  [1] "2017" "2018" "2019" "2020" "2021" "2022" "2023" "2024" "2025" "2026"
## [11] "2027" "2028" "2029" "2030" "2031" "2032" "2033" "2034" "2035" "2036"
## [21] "2037"
# convert the years to ordered factors
com_pop_fc$year <- 
  factor(com_pop_fc$year, levels=c("2017","2018","2019","2020","2021","2022","2023","2024","2025","2026","2027","2028","2029","2030","2031","2032","2033","2034","2035","2036","2037"))

#check the resulting data frame has correct data types
str(com_pop_fc)
## 'data.frame':    273 obs. of  3 variables:
##  $ geography : Factor w/ 13 levels "Carlton","Docklands",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ year      : Factor w/ 21 levels "2017","2018",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ population: int  42589 44087 46442 51136 53795 55854 58336 61086 63802 65529 ...

Import cutomer request data

# Get data - Customer Service Requests

# Look for customer data
com_data[,c(11,14)] %>% 
  filter(theme == "Finance & Operations", str_detect(title,"Customer"))
# Display the desired data set
t(com_data[,c(11,14, 2, 13)] %>% 
  filter(title  == "Customer service requests, with resolution time")) %>%
  kable() %>%
  kable_styling()
title Customer service requests, with resolution time
theme Finance & Operations
landingPage https://data.melbourne.vic.gov.au/d/ht4h-vqbu
license http://creativecommons.org/licenses/by/4.0/legalcode

Import the desired data set direct from the open data repository

# import data
com_cust <- 
  read.socrata("https://data.melbourne.vic.gov.au/d/ht4h-vqbu")

# check data size and types
str(com_cust)
## 'data.frame':    78639 obs. of  7 variables:
##  $ REQUEST_STATUS  : chr  "CLOSED    " "CLOSED    " "CLOSED    " "CLOSED    " ...
##  $ DATE_RECEIVED   : POSIXct, format: "2015-12-20" "2016-07-28" ...
##  $ DATE_COMPLETED  : POSIXct, format: "2016-01-12" "2016-08-25" ...
##  $ SUBURB          : chr  "Carlton" "Melbourne" "Carlton" "Melbourne" ...
##  $ CATEGORY        : chr  "Graffiti" "Parks and Trees" "Waste, Street Cleaning and Litter " "Waste, Street Cleaning and Litter " ...
##  $ SERVICE_DESC    : chr  "Graffiti Removal " "Tree Maintenance Services" "Public Litter Bin " "Missed Bin Collection" ...
##  $ DAYS_TO_COMPLETE: int  23 28 5 3 3 3 3 2 4 10 ...
names(com_cust)
## [1] "REQUEST_STATUS"   "DATE_RECEIVED"    "DATE_COMPLETED"  
## [4] "SUBURB"           "CATEGORY"         "SERVICE_DESC"    
## [7] "DAYS_TO_COMPLETE"
# convert select variables to factors
cols <- 
  c("REQUEST_STATUS","SUBURB","CATEGORY","SERVICE_DESC")

com_cust[cols] <- 
  lapply(com_cust[cols], factor)

# standardise on lowercase 
names(com_cust) <- 
  tolower(names(com_cust))

str(com_cust)
## 'data.frame':    78639 obs. of  7 variables:
##  $ request_status  : Factor w/ 6 levels "ACTIONED","CLOSED",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ date_received   : POSIXct, format: "2015-12-20" "2016-07-28" ...
##  $ date_completed  : POSIXct, format: "2016-01-12" "2016-08-25" ...
##  $ suburb          : Factor w/ 15 levels "","Carlton","Carlton North",..: 2 8 2 8 14 8 9 8 5 15 ...
##  $ category        : Factor w/ 7 levels "Asset maintenance",..: 2 4 7 7 7 2 2 2 2 2 ...
##  $ service_desc    : Factor w/ 44 levels "Bike pod services",..: 12 40 25 15 15 12 12 12 12 12 ...
##  $ days_to_complete: int  23 28 5 3 3 3 3 2 4 10 ...
levels(com_cust$suburb)
##  [1] ""                "Carlton"         "Carlton North"  
##  [4] "Docklands"       "East Melbourne"  "Flemington"     
##  [7] "Kensington"      "Melbourne"       "North Melbourne"
## [10] "Parkville"       "Port Melbourne"  "South Wharf"    
## [13] "South Yarra"     "Southbank"       "West Melbourne"
names(com_cust)
## [1] "request_status"   "date_received"    "date_completed"  
## [4] "suburb"           "category"         "service_desc"    
## [7] "days_to_complete"
count(com_cust %>% 
        group_by(suburb))

Actual population data

The next data set was provided by CoM. This data was derived from three separate open data respositories and spatially joined by an analyst at City of Melbourne and provided as a csv file.

# ACTUAL POPULATION DATA FROM THE PAST

# import CSV file
com_pop_act <- 
  read_csv("com_population.csv")
## Parsed with column specification:
## cols(
##   `Small Area` = col_character(),
##   Year = col_integer(),
##   Population = col_integer()
## )
# check the shape of the file
str(com_pop_act)
## Classes 'tbl_df', 'tbl' and 'data.frame':    91 obs. of  3 variables:
##  $ Small Area: chr  "Carlton" "Carlton" "Carlton" "Carlton" ...
##  $ Year      : int  2011 2012 2013 2014 2015 2016 2017 2011 2012 2013 ...
##  $ Population: int  15126 15609 16723 17846 18986 20245 21797 6196 6802 7988 ...
##  - attr(*, "spec")=List of 2
##   ..$ cols   :List of 3
##   .. ..$ Small Area: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Year      : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Population: list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   ..$ default: list()
##   .. ..- attr(*, "class")= chr  "collector_guess" "collector"
##   ..- attr(*, "class")= chr "col_spec"
# match the headings to the forecast data
names(com_pop_act) <- 
  c("geography","year","population")

# check the year column
levels(as.factor(com_pop_act$year))
## [1] "2011" "2012" "2013" "2014" "2015" "2016" "2017"
# convert the year column to an ordered factor
com_pop_act$year <- 
  factor(com_pop_act$year, levels=c("2011","2012","2013","2014","2015","2016","2017"))

# convert the geography column to factor 
com_pop_act$geography <- 
  factor(com_pop_act$geography)

Understand

Three data sets have been imported:

The objective of this project is to combine these three data sets with a view to forecasting future customer request volumes based on population forecasts.

Population Data

To begin with lets compare the two population data sets.

# Comparing population data

# compare columns
names(com_pop_act)
## [1] "geography"  "year"       "population"
names (com_pop_fc)
## [1] "geography"  "year"       "population"
# compare geography

levels(com_pop_act$geography)
##  [1] "Carlton"                      "Docklands"                   
##  [3] "East Melbourne"               "Kensington"                  
##  [5] "Melbourne (CBD)"              "Melbourne (Remainder)"       
##  [7] "North Melbourne"              "Parkville"                   
##  [9] "Port Melbourne"               "South Yarra"                 
## [11] "Southbank"                    "West Melbourne (Industrial)" 
## [13] "West Melbourne (Residential)"
levels(com_pop_fc$geography)
##  [1] "Carlton"                      "Docklands"                   
##  [3] "East Melbourne"               "Kensington"                  
##  [5] "Melbourne (CBD)"              "Melbourne (Remainder)"       
##  [7] "North Melbourne"              "Parkville"                   
##  [9] "Port Melbourne"               "South Yarra"                 
## [11] "Southbank"                    "West Melbourne (Industrial)" 
## [13] "West Melbourne (Residential)"
# compare years

levels(com_pop_act$year)
## [1] "2011" "2012" "2013" "2014" "2015" "2016" "2017"
levels(com_pop_fc$year)
##  [1] "2017" "2018" "2019" "2020" "2021" "2022" "2023" "2024" "2025" "2026"
## [11] "2027" "2028" "2029" "2030" "2031" "2032" "2033" "2034" "2035" "2036"
## [21] "2037"

The following features are noted:

  • The geography variable relates to CLUE Small Area
    • these are similar to suburb
    • some suburbs have been subdivided reflecting different land uses (residential, commercial, industrial)
    • these will need to be recombined into suburbs to match the customer service data
  • In relation to the year variable
    • the 2017 observation for forecast population will need to be dropped due to the overlap
    • the 2011-2013 actual population observations will need to be dropped because they don’t overlap the customer service data

Customer Service Data

The City of Melbourne is in the process of converting from a legacy customer relationship management system to a modern cloud-based solution.

The data available here is derived from the new system.

head(com_cust)
dim(com_cust)
## [1] 78639     7
str(com_cust)
## 'data.frame':    78639 obs. of  7 variables:
##  $ request_status  : Factor w/ 6 levels "ACTIONED","CLOSED",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ date_received   : POSIXct, format: "2015-12-20" "2016-07-28" ...
##  $ date_completed  : POSIXct, format: "2016-01-12" "2016-08-25" ...
##  $ suburb          : Factor w/ 15 levels "","Carlton","Carlton North",..: 2 8 2 8 14 8 9 8 5 15 ...
##  $ category        : Factor w/ 7 levels "Asset maintenance",..: 2 4 7 7 7 2 2 2 2 2 ...
##  $ service_desc    : Factor w/ 44 levels "Bike pod services",..: 12 40 25 15 15 12 12 12 12 12 ...
##  $ days_to_complete: int  23 28 5 3 3 3 3 2 4 10 ...

Because the data is derived from the new system only a few customer service types have been converted. Only the following 7 service request categories are available in this data set. This probably represents around 5% of the total number of different categories.

As a result the volumes of service requests are probably less than 20% of total service requests recorded as most are still recorded in the legacy system at the time of this analysis.

levels(com_cust$category)
## [1] "Asset maintenance"                 
## [2] "Graffiti"                          
## [3] "Parking"                           
## [4] "Parks and Trees"                   
## [5] "Roads and Traffic"                 
## [6] "Waste, Street Cleaning and Litter" 
## [7] "Waste, Street Cleaning and Litter "
count(com_cust %>%
  group_by(category)) %>%
  kable() %>%
  kable_styling()
category n
Asset maintenance 1950
Graffiti 14454
Parking 9025
Parks and Trees 8847
Roads and Traffic 6659
Waste, Street Cleaning and Litter 2989
Waste, Street Cleaning and Litter 34715

Tidy & Manipulate Data I

The City of Melbourne research team has adopted the Tidy principles so the data sets used here already comply.

head(com_cust) %>%
  kable() %>%
  kable_styling()
request_status date_received date_completed suburb category service_desc days_to_complete
CLOSED 2015-12-20 2016-01-12 Carlton Graffiti Graffiti Removal 23
CLOSED 2016-07-28 2016-08-25 Melbourne Parks and Trees Tree Maintenance Services 28
CLOSED 2016-07-28 2016-08-02 Carlton Waste, Street Cleaning and Litter Public Litter Bin 5
CLOSED 2016-09-02 2016-09-05 Melbourne Waste, Street Cleaning and Litter Missed Bin Collection 3
CLOSED 2016-09-02 2016-09-05 Southbank Waste, Street Cleaning and Litter Missed Bin Collection 3
CLOSED 2016-08-08 2016-08-11 Melbourne Graffiti Graffiti Removal 3

For the purposes of this project the following fields wont be needed:

A clean year field will need to be created from the date_recieved variable.

Scan I

In this step a signifcant number of missing values are dealt with.

Customer Requests - Missing values

The com_cust data set is around 78,000 observations, however ~50,000 have no suburb recorded. This is because the new system built by council encourages customers to provide GPS coordinates from their phone, rather than type in an address (Lat/Long is not available in the data set).

The null values in the com_cust$suburb variable were marked as “” and not picked up by is.na().

Nonetheless, they were discovered and those observations (unfortunately) omitted.

# now I have to deal with the profoundly disappointing number of missing values

com_cust$suburb[com_cust$suburb == ""] <- NA

# have no choice but to omit the NA records, there's no way to impute the missing values from this data

com_cust <- 
  na.omit(com_cust)

com_cust$suburb <- 
  droplevels(com_cust$suburb)

str(com_cust)
## 'data.frame':    27936 obs. of  7 variables:
##  $ request_status  : Factor w/ 6 levels "ACTIONED","CLOSED",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ date_received   : POSIXct, format: "2015-12-20" "2016-07-28" ...
##  $ date_completed  : POSIXct, format: "2016-01-12" "2016-08-25" ...
##  $ suburb          : Factor w/ 14 levels "Carlton","Carlton North",..: 1 7 1 7 13 7 8 7 4 14 ...
##  $ category        : Factor w/ 7 levels "Asset maintenance",..: 2 4 7 7 7 2 2 2 2 2 ...
##  $ service_desc    : Factor w/ 44 levels "Bike pod services",..: 12 40 25 15 15 12 12 12 12 12 ...
##  $ days_to_complete: int  23 28 5 3 3 3 3 2 4 10 ...
##  - attr(*, "na.action")=Class 'omit'  Named int [1:50703] 6253 6254 6255 6256 6257 6258 6259 6260 6261 6262 ...
##   .. ..- attr(*, "names")= chr [1:50703] "6253" "6254" "6255" "6256" ...

Double check for other missing values

# Double check there are no NA values in other data sets
# there are 1301 in the data_completed variable

sum(is.na(com_cust$request_status))
## [1] 0
sum(is.na(com_cust$date_received))
## [1] 0
sum(is.na(com_cust$date_completed))
## [1] 0
sum(is.na(com_cust$suburb))
## [1] 0
sum(is.na(com_cust$category))
## [1] 0
sum(is.na(com_cust$service_desc))
## [1] 0
sum(is.na(com_cust$days_to_complete))
## [1] 0
names(com_cust)
## [1] "request_status"   "date_received"    "date_completed"  
## [4] "suburb"           "category"         "service_desc"    
## [7] "days_to_complete"
# Check population forecast data

names(com_pop_fc)
## [1] "geography"  "year"       "population"
sum(is.na(com_pop_fc$geography))
## [1] 0
sum(is.na(com_pop_fc$year))
## [1] 0
sum(is.na(com_pop_fc$population))
## [1] 0
head(com_pop_fc)
# check actual population data for null values

sum(is.na(com_pop_act$population))
## [1] 0
sum(is.na(com_pop_act$year))
## [1] 0
sum(is.na(com_pop_act$geography))
## [1] 0

Tidy & Manipulate Data II

The output of this section will be a single combined data set comprised of:

The steps taken will be:

  1. Combine actual and forecast population data (rbind)
  2. Create a suburb varable from CLUE small areas (mutate)
  3. Create a year variable in customer request data (lubridate)
  4. Summarise customer request data by year (group_by)
  5. Combine suburbs in customer request data to match population data (mutate)
  6. Join population and customer request data (full_join)

Combine actual and forecast population data (Step 1 of 6)

com_pop_fc[com_pop_fc$year != "2017",]
# note I cant really use population data from 2011-13, so I'll drop it

com_pop_act_sub <- 
  com_pop_act %>% 
  filter(year %in% c("2014","2015","2016","2017"))

com_pop_act_sub
com_pop_act_sub$year <- 
  droplevels(com_pop_act_sub$year)

# combine the two data sets
com_pop <- 
  rbind(com_pop_act_sub, com_pop_fc[com_pop_fc$year != "2017",])

# check the result
str(com_pop)
## Classes 'tbl_df', 'tbl' and 'data.frame':    312 obs. of  3 variables:
##  $ geography : Factor w/ 13 levels "Carlton","Docklands",..: 1 1 1 1 2 2 2 2 3 3 ...
##  $ year      : Factor w/ 24 levels "2014","2015",..: 1 2 3 4 1 2 3 4 1 2 ...
##  $ population: int  17846 18986 20245 21797 9170 10444 11832 13675 5164 5197 ...
##  - attr(*, "spec")=List of 2
##   ..$ cols   :List of 3
##   .. ..$ Small Area: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Year      : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Population: list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   ..$ default: list()
##   .. ..- attr(*, "class")= chr  "collector_guess" "collector"
##   ..- attr(*, "class")= chr "col_spec"
levels(com_pop$year)
##  [1] "2014" "2015" "2016" "2017" "2018" "2019" "2020" "2021" "2022" "2023"
## [11] "2024" "2025" "2026" "2027" "2028" "2029" "2030" "2031" "2032" "2033"
## [21] "2034" "2035" "2036" "2037"
levels (com_pop$geography)
##  [1] "Carlton"                      "Docklands"                   
##  [3] "East Melbourne"               "Kensington"                  
##  [5] "Melbourne (CBD)"              "Melbourne (Remainder)"       
##  [7] "North Melbourne"              "Parkville"                   
##  [9] "Port Melbourne"               "South Yarra"                 
## [11] "Southbank"                    "West Melbourne (Industrial)" 
## [13] "West Melbourne (Residential)"

Create a suburb varable from CLUE small areas (step 2 of 6)

com_pop <- 
  com_pop %>% 
  mutate(suburb = 
                     ifelse(geography %in% 
                              c("Melbourne (CBD)","Melbourne (Remainder)"), "Melbourne",
                     ifelse(geography %in% 
                              c("West Melbourne (Industrial)","West Melbourne (Residential)"), "West Melbourne",
                     as.character(geography))))

head(com_pop) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
geography year population suburb
Carlton 2014 17846 Carlton
Carlton 2015 18986 Carlton
Carlton 2016 20245 Carlton
Carlton 2017 21797 Carlton
Docklands 2014 9170 Docklands
Docklands 2015 10444 Docklands
# collate the population data from duplicate suburbs
com_pop <- 
  com_pop %>% 
  group_by(year, suburb) %>% 
  summarise(total_pop = sum(population))

# tidy up the column names (change total_pop back to population)
names(com_pop)
## [1] "year"      "suburb"    "total_pop"
names(com_pop) <- 
  c("year","suburb","population")

str(com_pop)
## Classes 'grouped_df', 'tbl_df', 'tbl' and 'data.frame':  264 obs. of  3 variables:
##  $ year      : Factor w/ 24 levels "2014","2015",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ suburb    : chr  "Carlton" "Docklands" "East Melbourne" "Kensington" ...
##  $ population: int  17846 9170 5164 11119 35936 14613 7447 11 4301 17351 ...
##  - attr(*, "vars")= chr "year"
##  - attr(*, "drop")= logi TRUE
com_pop$suburb <- 
  factor(com_pop$suburb)
com_pop

Create a year variable in customer request data (step 3 of 6)

# check the format of the date_recieved field
head(com_cust$date_received)
## [1] "2015-12-20 AEDT" "2016-07-28 AEST" "2016-07-28 AEST" "2016-09-02 AEST"
## [5] "2016-09-02 AEST" "2016-08-08 AEST"
# use lubridate to generate year value

com_cust <-
  com_cust %>%
    mutate(year = year(date_completed))

# check it worked - new variables created
names(com_cust)
## [1] "request_status"   "date_received"    "date_completed"  
## [4] "suburb"           "category"         "service_desc"    
## [7] "days_to_complete" "year"
head(com_cust$year)
## [1] 2016 2016 2016 2016 2016 2016
str(com_cust)
## 'data.frame':    27936 obs. of  8 variables:
##  $ request_status  : Factor w/ 6 levels "ACTIONED","CLOSED",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ date_received   : POSIXct, format: "2015-12-20" "2016-07-28" ...
##  $ date_completed  : POSIXct, format: "2016-01-12" "2016-08-25" ...
##  $ suburb          : Factor w/ 14 levels "Carlton","Carlton North",..: 1 7 1 7 13 7 8 7 4 14 ...
##  $ category        : Factor w/ 7 levels "Asset maintenance",..: 2 4 7 7 7 2 2 2 2 2 ...
##  $ service_desc    : Factor w/ 44 levels "Bike pod services",..: 12 40 25 15 15 12 12 12 12 12 ...
##  $ days_to_complete: int  23 28 5 3 3 3 3 2 4 10 ...
##  $ year            : num  2016 2016 2016 2016 2016 ...
# check the years created
levels(as.factor(com_cust$year))
## [1] "2014" "2015" "2016"

Summarise customer request data by year (step 4 of 6)

# convert the year variable to a factor
com_cust$year <- 
  as.factor(com_cust$year)

# strip out the unneccessary columns and summarise service requests by year
cust_sum <- 
  com_cust[,c(8,4)] %>% 
  group_by(year, suburb) %>% 
  summarise(sr_count=n())

# fix the column names
names(cust_sum) <- 
  c("year","suburb","sr_count")

# check it worked
head(cust_sum)

Combine suburbs in customer request data to match population data (step 5 of 6)

# I will combine the following
#  - Carlton and Carlton North
#  - Kensington and Flemington
#  - South Wharf and Docklands

cust_sum <- 
  cust_sum %>% 
  mutate(suburb = ifelse(suburb %in% 
                           c("Carlton","Carlton North"), "Carlton", 
                  ifelse(suburb %in% 
                           c("Flemington","Kensington"), "Kensington", 
                  ifelse(suburb %in% 
                           c("South Wharf","Docklands"), "Docklands", as.character(suburb)))))

# combine the total service requests in the now duplicate suburbs
cust_sum <- 
  cust_sum %>% 
  group_by(year, suburb) %>% 
  summarise(tot_srs = sum(sr_count))

# tidy up the column names
names(cust_sum) <- 
  c("year","suburb","sr_count")

# convert the suburb variable back to a factor
cust_sum$suburb <- 
  factor(cust_sum$suburb)

Join population and customer request data (step 6 of 6)

# join the population data to the customer servie data
pop_cust <- 
  com_pop %>% 
  full_join(cust_sum, by= c("year", "suburb"))
## Warning: Column `year` joining factors with different levels, coercing to
## character vector
# check it worked
str(pop_cust)
## Classes 'grouped_df', 'tbl_df', 'tbl' and 'data.frame':  264 obs. of  4 variables:
##  $ year      : chr  "2014" "2014" "2014" "2014" ...
##  $ suburb    : Factor w/ 11 levels "Carlton","Docklands",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ population: int  17846 9170 5164 11119 35936 14613 7447 11 4301 17351 ...
##  $ sr_count  : int  184 38 102 183 424 114 81 10 75 39 ...
##  - attr(*, "vars")= chr "year"
##  - attr(*, "drop")= logi TRUE
# fix the year variable
pop_cust$year <- 
  factor(pop_cust$year, levels = c("2014","2015","2016","2017","2018","2019","2020","2021","2022","2023","2024","2025","2026","2027","2028","2029","2030","2031","2032","2033","2034","2035","2036","2037"))

levels(pop_cust$year)
##  [1] "2014" "2015" "2016" "2017" "2018" "2019" "2020" "2021" "2022" "2023"
## [11] "2024" "2025" "2026" "2027" "2028" "2029" "2030" "2031" "2032" "2033"
## [21] "2034" "2035" "2036" "2037"
# fix the suburb variable
pop_cust$suburb <- 
  factor(pop_cust$suburb)

Overview of output to this point

Now I have a combined population and service request volume data set.

Population data

kable(pop_cust[,c(1:3)] %>% 
        spread(key = suburb, value = population)) %>% 
        kable_styling(bootstrap_options = c("striped", "hover"))
year Carlton Docklands East Melbourne Kensington Melbourne North Melbourne Parkville Port Melbourne South Yarra Southbank West Melbourne
2014 17846 9170 5164 11119 35936 14613 7447 11 4301 17351 5031
2015 18986 10444 5197 11381 39500 15299 7682 11 4284 18623 5480
2016 20245 11832 5251 11621 42557 16197 7898 10 4304 20285 5912
2017 21797 13675 5427 11765 47724 17251 8276 10 4553 22181 6404
2018 21214 13113 5550 11865 46181 17310 7982 10 4699 21894 5709
2019 21961 14282 5598 12192 48741 18761 8105 10 4705 22569 5744
2020 22680 15893 5752 12492 53728 20126 8349 10 4719 23359 5749
2021 23392 16916 5872 12948 56483 21327 8500 11 4727 24262 5938
2022 23580 17909 5948 13377 58587 22060 8538 11 4732 25905 6135
2023 23769 18801 6010 13792 61109 22803 8658 321 4739 27749 6383
2024 23920 19754 6062 14198 63896 23355 8729 670 4753 29747 6698
2025 24061 20349 6100 14594 66647 23865 8844 1062 4774 31681 7046
2026 24148 20817 6155 14855 68404 24290 9006 1494 4790 33209 7314
2027 24261 21294 6209 15123 70205 24710 9091 1972 4804 34721 7568
2028 24341 21764 6252 15397 71617 25117 9248 2254 4816 36123 7675
2029 24338 22178 6287 15689 72553 25428 9273 2800 4827 37422 7844
2030 24371 22557 6307 15975 73580 25735 9316 2905 4840 38764 8201
2031 24385 22874 6354 16264 74567 25998 9346 3488 4851 40122 9038
2032 24439 23194 6398 16564 75493 26257 9398 4009 4859 41489 10080
2033 24494 23515 6463 16869 76474 26526 9461 4684 4864 42876 11292
2034 24524 23819 6541 17178 77390 26779 9485 5450 4867 44277 12596
2035 24547 24116 6629 17491 78318 27033 9518 6267 4866 45708 13233
2036 24559 24406 6651 17810 79238 27282 9563 7110 4861 47159 13890
2037 24548 24694 6677 17848 80095 27290 9581 7945 4854 48610 14313

Customer request data

The customer request dta is NUL for all future years. In the next steps I will forecast future service request volumes.

pop_cust[,c(1,2,4)] %>%
  spread(key = suburb, value = sr_count) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
year Carlton Docklands East Melbourne Kensington Melbourne North Melbourne Parkville Port Melbourne South Yarra Southbank West Melbourne
2014 184 38 102 183 424 114 81 10 75 39 67
2015 2095 351 753 1446 4009 1556 773 76 502 354 748
2016 2068 467 917 1585 4475 1620 855 69 462 537 901
2017 NA NA NA NA NA NA NA NA NA NA NA
2018 NA NA NA NA NA NA NA NA NA NA NA
2019 NA NA NA NA NA NA NA NA NA NA NA
2020 NA NA NA NA NA NA NA NA NA NA NA
2021 NA NA NA NA NA NA NA NA NA NA NA
2022 NA NA NA NA NA NA NA NA NA NA NA
2023 NA NA NA NA NA NA NA NA NA NA NA
2024 NA NA NA NA NA NA NA NA NA NA NA
2025 NA NA NA NA NA NA NA NA NA NA NA
2026 NA NA NA NA NA NA NA NA NA NA NA
2027 NA NA NA NA NA NA NA NA NA NA NA
2028 NA NA NA NA NA NA NA NA NA NA NA
2029 NA NA NA NA NA NA NA NA NA NA NA
2030 NA NA NA NA NA NA NA NA NA NA NA
2031 NA NA NA NA NA NA NA NA NA NA NA
2032 NA NA NA NA NA NA NA NA NA NA NA
2033 NA NA NA NA NA NA NA NA NA NA NA
2034 NA NA NA NA NA NA NA NA NA NA NA
2035 NA NA NA NA NA NA NA NA NA NA NA
2036 NA NA NA NA NA NA NA NA NA NA NA
2037 NA NA NA NA NA NA NA NA NA NA NA

Scan II

In this section I generate the forecast service request volumes then check for anomalies and outliers.

Generate estimated service request volumes

First I generate estimates service request volumes for future years based on population forecasts.

This can be done in one step. But some wild outliers are generated which will need to be cleaned up later. So I’ll do it in two steps.

# generate a service requests per population variable
pop_cust <- 
  pop_cust %>% 
  group_by(suburb) %>% 
  mutate(srs_per_pop = mean(sr_count/population, na.rm=TRUE))

# generate an estimate of service reuqest volumes per year for each suburb
pop_cust <- 
  pop_cust %>% 
  group_by(suburb) %>% 
  mutate(est_srs = population * srs_per_pop)

head(pop_cust)

Outliers and anomalies

First lets compare the estimated service request volumes to the real ones in the first three years.

pop_cust_g <-
  pop_cust %>% 
  gather(key = "source", value = "srs", sr_count, est_srs, -year,-suburb, -population, -srs_per_pop, na.rm = TRUE)

ggplot(pop_cust_g, aes(x = source, y = srs)) +
  geom_boxplot() + 
  facet_wrap(~suburb) + 
  ggtitle("SR Count vs Source for various Suburbs") +
  theme(plot.title = element_text(size=14, face="bold"))

You can see that the estimates for Port Melbourne are totally wrong, by a factor of 50-500. This is because Port Melbourne is almost entirely made up of the port and currently has a very low population (~10). But when the population estamates progress, Port Melbourne is envisaged to have a growing population and the multiplyer effect throws out the service request volume estimates badly.

# revisit the pop_cust data frame
head(pop_cust)
# identify the major outliers in the srs_per_pop variable
boxplot(pop_cust$srs_per_pop,main="Service Requests per capita", 
     ylab="SRs per capita")

# identify them numerically
z.scores <- 
  pop_cust$srs_per_pop %>%  
  scores(type = "z")

z.scores %>% 
  summary()
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.3558 -0.3171 -0.3139  0.0000 -0.2949  3.1557
# use the z-score method to find and replace with the mean (excluding Port Melb)
pop_cust$srs_per_pop[ which( abs(z.scores) >3 )] <- 
  mean(pop_cust$srs_per_pop[-which( abs(z.scores) >3 )], na.rm = TRUE)

# recalculate estimated service request volumes

pop_cust <- 
  pop_cust %>% 
  group_by(suburb) %>% 
  mutate(est_srs = population * srs_per_pop)

# redraw the plot to confirm Port Melbourne is no longer an outlier

pop_cust_g2 <- 
  pop_cust %>% 
  gather(key = "source", value = "srs", sr_count, est_srs, -year,-suburb, -population, -srs_per_pop, na.rm = TRUE)

ggplot(pop_cust_g2, aes(x = source, y = srs)) +
  geom_boxplot() + 
  facet_wrap(~suburb) + 
  ggtitle("SR Count vs Source for various Suburbs") +
  theme(plot.title = element_text(size=14, face="bold"))

Final outcome

The final outcome of the forecasting is best represented as a chart.

ggplot() +
  geom_line(data = pop_cust, aes(x = as.numeric(paste(year)), y = est_srs, color=suburb), size = .1) +
  ggtitle("Forecast Service Request Volumes\n\n")+
  xlab("year") +
  ylab("service requests")

Transform

You can see in the chart above that the forecast service request volume for Melbourne is a curve. Lets apply a transform to make this linear.

# filter out the Melbourne forecast
melb <-
  pop_cust %>%
    filter(suburb == "Melbourne")

# cube the estimated service request volumes
melb <-
  melb %>%
    mutate(cube_srs = (est_srs)^3)
# check it worked
head(melb)
# plot it to see if its more linear
ggplot() +
  geom_line(data = melb, aes(x = as.numeric(paste(year)), y = melb$cube_srs, color=suburb), size = .1) +
  ggtitle("Forecast Service Request Volumes\n\n")+
  xlab("year") +
  ylab("cube of service requests")