MATH2349 Semester 2, 2018

Required packages

The following packages are required for importing, manipulating and visualising the data.

RSocrata was was used to gain direct access to the City of Melbourne Open Data repository.

#install.packages("RSocrata")
library(RSocrata)

# Standard package list

library(readr)
library(dplyr)
library(tidyr)
library(stringr)
library(outliers)
library(lubridate)

# Charts

library(ggplot2)

# R Markdown packages

library(kableExtra)

Executive Summary

City of Melbourne’s population is set to grow beyond 250,000 by 2037 placing increased pressure on local council services. Actual population data was combined with population forecast data for each suburb within the municipality to create population timeline from 2014 to 2037. This was combined with service request volumes for a small subset of council services (7 service categories) from 2014-2017. The combined data set was used to forecast future service request volumes by suburb to 2037. This analysis used a very small proportion of total annual council service requests, and a significant number of requests were unusable because suburb was not recorded. Nevertheless, the forecast shows an almost three-fold increase in service request demand with the greatest growth in the Melbourne central business district. This project serves as a proof of concept for long-range forecasting of customer service demand once a complete catalogue of service request types becomes available.

Data

City of Melbourne publishes almost 200 data sets on the demographics, business, economics, employment and performance of the central city municpality.

This project will combine three data sets available online or published under creative commons:

Actual population figures by Small Area
- spatially joined and converted to csv by Daniel Booth (Daniel.Booth@melbourne.vic.gov.au) from the following sources:
- http://www.qgso.qld.gov.au/subjects/demography/population-estimates/tables/erp-sa1-aus-consult/index.php
- http://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/1270.0.55.003July%202016?OpenDocument
- https://data.melbourne.vic.gov.au/Economy/Small-Areas-for-Census-of-Land-Use-and-Employment-/gei8-3w86
Population Forecasts by Small Area
- https://data.melbourne.vic.gov.au/d/4ajz-qakh
Customer service requests, with resolution time
- https://data.melbourne.vic.gov.au/d/ht4h-vqbu

The objective of the project is to forecast future demand for council services based on population projections.

The three data sets will be joined by two similar, but not identical geographical concepts:

Suburb
CLUE Small Area

Suburbs are a familiar concept - gazetted areas typically used in addresses.

The Census of Land Use and Employment (CLUE) is a survey of commercial property that City of Mebourne has conducted sporadically since the 1960s; and systematically every two years since 2000.

The CLUE Small Areas are subsets of suburbs based on in industrial zones and natural land use regions within the municipality.

The CLUE Small Areas will need to be combined and mutated to match Suburbs, for data joining purposes.

The first section accesses the data.melbourne.vic.gov.au open data repository directly in RStudio to access the two open data sets for this project.

Explore the CoM Open Data Repository

CoM has published almost 200 data sets including demographics, civic assets, maps, and performance information.

The repository can be accessed directly within R using the RSocrata package.

# Explore the CoM Open Data Repository

ls.socrata("https://data.melbourne.vic.gov.au")

com_data <- ls.socrata("https://data.melbourne.vic.gov.au")

head(com_data)

names(com_data)

##  [1] "accessLevel"  "landingPage"  "issued"       "@type"       
##  [5] "modified"     "keyword"      "contactPoint" "publisher"   
##  [9] "identifier"   "description"  "title"        "distribution"
## [13] "license"      "theme"

Import Data

Import population forecast data

The first data set to be used is forcast population by CLUE Small Areas.

# Get data - Population Forecasts

# display the details of the data set.
t(com_data[,c(11,14, 2, 13)] %>% 
  filter(title  == "Population Forecasts by Small Area")) %>%
  kable() %>%
  kable_styling()

title	Population Forecasts by Small Area
theme	People & Events
landingPage	https://data.melbourne.vic.gov.au/d/4ajz-qakh
license	http://creativecommons.org/licenses/by/4.0/legalcode

Import the data directly from the open data repository.

# import the popultaion forecast data
com_pop_fc <- 
  read.socrata("https://data.melbourne.vic.gov.au/d/4ajz-qakh")

# check the imported data
head(com_pop_fc)

# check the dimensions and data types
str(com_pop_fc)

## 'data.frame':    273 obs. of  3 variables:
##  $ Geography : chr  "Melbourne (CBD)" "Melbourne (CBD)" "Melbourne (CBD)" "Melbourne (CBD)" ...
##  $ Year      : int  2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 ...
##  $ Population: int  42589 44087 46442 51136 53795 55854 58336 61086 63802 65529 ...

# standardise on lowercase column names
names(com_pop_fc) <- 
  tolower(names(com_pop_fc))

# convert the Small Areas to factors 
com_pop_fc$geography <- 
  as.factor(com_pop_fc$geography)

# display the years as factors to convert them to ordered factors
levels(as.factor(com_pop_fc$year))

##  [1] "2017" "2018" "2019" "2020" "2021" "2022" "2023" "2024" "2025" "2026"
## [11] "2027" "2028" "2029" "2030" "2031" "2032" "2033" "2034" "2035" "2036"
## [21] "2037"

# convert the years to ordered factors
com_pop_fc$year <- 
  factor(com_pop_fc$year, levels=c("2017","2018","2019","2020","2021","2022","2023","2024","2025","2026","2027","2028","2029","2030","2031","2032","2033","2034","2035","2036","2037"))

#check the resulting data frame has correct data types
str(com_pop_fc)

## 'data.frame':    273 obs. of  3 variables:
##  $ geography : Factor w/ 13 levels "Carlton","Docklands",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ year      : Factor w/ 21 levels "2017","2018",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ population: int  42589 44087 46442 51136 53795 55854 58336 61086 63802 65529 ...

Import cutomer request data

# Get data - Customer Service Requests

# Look for customer data
com_data[,c(11,14)] %>% 
  filter(theme == "Finance & Operations", str_detect(title,"Customer"))

# Display the desired data set
t(com_data[,c(11,14, 2, 13)] %>% 
  filter(title  == "Customer service requests, with resolution time")) %>%
  kable() %>%
  kable_styling()

title	Customer service requests, with resolution time
theme	Finance & Operations
landingPage	https://data.melbourne.vic.gov.au/d/ht4h-vqbu
license	http://creativecommons.org/licenses/by/4.0/legalcode

Import the desired data set direct from the open data repository

# import data
com_cust <- 
  read.socrata("https://data.melbourne.vic.gov.au/d/ht4h-vqbu")

# check data size and types
str(com_cust)

## 'data.frame':    78639 obs. of  7 variables:
##  $ REQUEST_STATUS  : chr  "CLOSED    " "CLOSED    " "CLOSED    " "CLOSED    " ...
##  $ DATE_RECEIVED   : POSIXct, format: "2015-12-20" "2016-07-28" ...
##  $ DATE_COMPLETED  : POSIXct, format: "2016-01-12" "2016-08-25" ...
##  $ SUBURB          : chr  "Carlton" "Melbourne" "Carlton" "Melbourne" ...
##  $ CATEGORY        : chr  "Graffiti" "Parks and Trees" "Waste, Street Cleaning and Litter " "Waste, Street Cleaning and Litter " ...
##  $ SERVICE_DESC    : chr  "Graffiti Removal " "Tree Maintenance Services" "Public Litter Bin " "Missed Bin Collection" ...
##  $ DAYS_TO_COMPLETE: int  23 28 5 3 3 3 3 2 4 10 ...

names(com_cust)

## [1] "REQUEST_STATUS"   "DATE_RECEIVED"    "DATE_COMPLETED"  
## [4] "SUBURB"           "CATEGORY"         "SERVICE_DESC"    
## [7] "DAYS_TO_COMPLETE"

# convert select variables to factors
cols <- 
  c("REQUEST_STATUS","SUBURB","CATEGORY","SERVICE_DESC")

com_cust[cols] <- 
  lapply(com_cust[cols], factor)

# standardise on lowercase 
names(com_cust) <- 
  tolower(names(com_cust))

str(com_cust)

## 'data.frame':    78639 obs. of  7 variables:
##  $ request_status  : Factor w/ 6 levels "ACTIONED","CLOSED",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ date_received   : POSIXct, format: "2015-12-20" "2016-07-28" ...
##  $ date_completed  : POSIXct, format: "2016-01-12" "2016-08-25" ...
##  $ suburb          : Factor w/ 15 levels "","Carlton","Carlton North",..: 2 8 2 8 14 8 9 8 5 15 ...
##  $ category        : Factor w/ 7 levels "Asset maintenance",..: 2 4 7 7 7 2 2 2 2 2 ...
##  $ service_desc    : Factor w/ 44 levels "Bike pod services",..: 12 40 25 15 15 12 12 12 12 12 ...
##  $ days_to_complete: int  23 28 5 3 3 3 3 2 4 10 ...

levels(com_cust$suburb)

##  [1] ""                "Carlton"         "Carlton North"  
##  [4] "Docklands"       "East Melbourne"  "Flemington"     
##  [7] "Kensington"      "Melbourne"       "North Melbourne"
## [10] "Parkville"       "Port Melbourne"  "South Wharf"    
## [13] "South Yarra"     "Southbank"       "West Melbourne"

names(com_cust)

## [1] "request_status"   "date_received"    "date_completed"  
## [4] "suburb"           "category"         "service_desc"    
## [7] "days_to_complete"

count(com_cust %>% 
        group_by(suburb))

Actual population data

The next data set was provided by CoM. This data was derived from three separate open data respositories and spatially joined by an analyst at City of Melbourne and provided as a csv file.

# ACTUAL POPULATION DATA FROM THE PAST

# import CSV file
com_pop_act <- 
  read_csv("com_population.csv")

## Parsed with column specification:
## cols(
##   `Small Area` = col_character(),
##   Year = col_integer(),
##   Population = col_integer()
## )

# check the shape of the file
str(com_pop_act)

## Classes 'tbl_df', 'tbl' and 'data.frame':    91 obs. of  3 variables:
##  $ Small Area: chr  "Carlton" "Carlton" "Carlton" "Carlton" ...
##  $ Year      : int  2011 2012 2013 2014 2015 2016 2017 2011 2012 2013 ...
##  $ Population: int  15126 15609 16723 17846 18986 20245 21797 6196 6802 7988 ...
##  - attr(*, "spec")=List of 2
##   ..$ cols   :List of 3
##   .. ..$ Small Area: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Year      : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Population: list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   ..$ default: list()
##   .. ..- attr(*, "class")= chr  "collector_guess" "collector"
##   ..- attr(*, "class")= chr "col_spec"

# match the headings to the forecast data
names(com_pop_act) <- 
  c("geography","year","population")

# check the year column
levels(as.factor(com_pop_act$year))

## [1] "2011" "2012" "2013" "2014" "2015" "2016" "2017"

# convert the year column to an ordered factor
com_pop_act$year <- 
  factor(com_pop_act$year, levels=c("2011","2012","2013","2014","2015","2016","2017"))

# convert the geography column to factor 
com_pop_act$geography <- 
  factor(com_pop_act$geography)

Understand

Three data sets have been imported:

Population Data
- Population forcast data from 2017-2037
- Actual Population data from 2011-2017
Customer Request Data
- 78,000 customer requests from 2014-2016

The objective of this project is to combine these three data sets with a view to forecasting future customer request volumes based on population forecasts.

Population Data

To begin with lets compare the two population data sets.

# Comparing population data

# compare columns
names(com_pop_act)

## [1] "geography"  "year"       "population"

names (com_pop_fc)

## [1] "geography"  "year"       "population"

# compare geography

levels(com_pop_act$geography)

##  [1] "Carlton"                      "Docklands"                   
##  [3] "East Melbourne"               "Kensington"                  
##  [5] "Melbourne (CBD)"              "Melbourne (Remainder)"       
##  [7] "North Melbourne"              "Parkville"                   
##  [9] "Port Melbourne"               "South Yarra"                 
## [11] "Southbank"                    "West Melbourne (Industrial)" 
## [13] "West Melbourne (Residential)"

levels(com_pop_fc$geography)

##  [1] "Carlton"                      "Docklands"                   
##  [3] "East Melbourne"               "Kensington"                  
##  [5] "Melbourne (CBD)"              "Melbourne (Remainder)"       
##  [7] "North Melbourne"              "Parkville"                   
##  [9] "Port Melbourne"               "South Yarra"                 
## [11] "Southbank"                    "West Melbourne (Industrial)" 
## [13] "West Melbourne (Residential)"

# compare years

levels(com_pop_act$year)

## [1] "2011" "2012" "2013" "2014" "2015" "2016" "2017"

levels(com_pop_fc$year)

##  [1] "2017" "2018" "2019" "2020" "2021" "2022" "2023" "2024" "2025" "2026"
## [11] "2027" "2028" "2029" "2030" "2031" "2032" "2033" "2034" "2035" "2036"
## [21] "2037"

The following features are noted:

The geography variable relates to CLUE Small Area
- these are similar to suburb
- some suburbs have been subdivided reflecting different land uses (residential, commercial, industrial)
- these will need to be recombined into suburbs to match the customer service data
In relation to the year variable
- the 2017 observation for forecast population will need to be dropped due to the overlap
- the 2011-2013 actual population observations will need to be dropped because they don’t overlap the customer service data

Customer Service Data

The City of Melbourne is in the process of converting from a legacy customer relationship management system to a modern cloud-based solution.

The data available here is derived from the new system.

head(com_cust)

dim(com_cust)

## [1] 78639     7

str(com_cust)

## 'data.frame':    78639 obs. of  7 variables:
##  $ request_status  : Factor w/ 6 levels "ACTIONED","CLOSED",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ date_received   : POSIXct, format: "2015-12-20" "2016-07-28" ...
##  $ date_completed  : POSIXct, format: "2016-01-12" "2016-08-25" ...
##  $ suburb          : Factor w/ 15 levels "","Carlton","Carlton North",..: 2 8 2 8 14 8 9 8 5 15 ...
##  $ category        : Factor w/ 7 levels "Asset maintenance",..: 2 4 7 7 7 2 2 2 2 2 ...
##  $ service_desc    : Factor w/ 44 levels "Bike pod services",..: 12 40 25 15 15 12 12 12 12 12 ...
##  $ days_to_complete: int  23 28 5 3 3 3 3 2 4 10 ...

Because the data is derived from the new system only a few customer service types have been converted. Only the following 7 service request categories are available in this data set. This probably represents around 5% of the total number of different categories.

As a result the volumes of service requests are probably less than 20% of total service requests recorded as most are still recorded in the legacy system at the time of this analysis.

levels(com_cust$category)

## [1] "Asset maintenance"                 
## [2] "Graffiti"                          
## [3] "Parking"                           
## [4] "Parks and Trees"                   
## [5] "Roads and Traffic"                 
## [6] "Waste, Street Cleaning and Litter" 
## [7] "Waste, Street Cleaning and Litter "

count(com_cust %>%
  group_by(category)) %>%
  kable() %>%
  kable_styling()

category	n
Asset maintenance	1950
Graffiti	14454
Parking	9025
Parks and Trees	8847
Roads and Traffic	6659
Waste, Street Cleaning and Litter	2989
Waste, Street Cleaning and Litter	34715

Tidy & Manipulate Data I

The City of Melbourne research team has adopted the Tidy principles so the data sets used here already comply.

head(com_cust) %>%
  kable() %>%
  kable_styling()

request_status	date_received	date_completed	suburb	category	service_desc	days_to_complete
CLOSED	2015-12-20	2016-01-12	Carlton	Graffiti	Graffiti Removal	23
CLOSED	2016-07-28	2016-08-25	Melbourne	Parks and Trees	Tree Maintenance Services	28
CLOSED	2016-07-28	2016-08-02	Carlton	Waste, Street Cleaning and Litter	Public Litter Bin	5
CLOSED	2016-09-02	2016-09-05	Melbourne	Waste, Street Cleaning and Litter	Missed Bin Collection	3
CLOSED	2016-09-02	2016-09-05	Southbank	Waste, Street Cleaning and Litter	Missed Bin Collection	3
CLOSED	2016-08-08	2016-08-11	Melbourne	Graffiti	Graffiti Removal	3

For the purposes of this project the following fields wont be needed:

request_status
date_completed
category
service_desc
days_to_complete

A clean year field will need to be created from the date_recieved variable.

Scan I

In this step a signifcant number of missing values are dealt with.

Customer Requests - Missing values

The com_cust data set is around 78,000 observations, however ~50,000 have no suburb recorded. This is because the new system built by council encourages customers to provide GPS coordinates from their phone, rather than type in an address (Lat/Long is not available in the data set).

The null values in the com_cust$suburb variable were marked as “” and not picked up by is.na().

Nonetheless, they were discovered and those observations (unfortunately) omitted.

# now I have to deal with the profoundly disappointing number of missing values

com_cust$suburb[com_cust$suburb == ""] <- NA

# have no choice but to omit the NA records, there's no way to impute the missing values from this data

com_cust <- 
  na.omit(com_cust)

com_cust$suburb <- 
  droplevels(com_cust$suburb)

str(com_cust)

## 'data.frame':    27936 obs. of  7 variables:
##  $ request_status  : Factor w/ 6 levels "ACTIONED","CLOSED",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ date_received   : POSIXct, format: "2015-12-20" "2016-07-28" ...
##  $ date_completed  : POSIXct, format: "2016-01-12" "2016-08-25" ...
##  $ suburb          : Factor w/ 14 levels "Carlton","Carlton North",..: 1 7 1 7 13 7 8 7 4 14 ...
##  $ category        : Factor w/ 7 levels "Asset maintenance",..: 2 4 7 7 7 2 2 2 2 2 ...
##  $ service_desc    : Factor w/ 44 levels "Bike pod services",..: 12 40 25 15 15 12 12 12 12 12 ...
##  $ days_to_complete: int  23 28 5 3 3 3 3 2 4 10 ...
##  - attr(*, "na.action")=Class 'omit'  Named int [1:50703] 6253 6254 6255 6256 6257 6258 6259 6260 6261 6262 ...
##   .. ..- attr(*, "names")= chr [1:50703] "6253" "6254" "6255" "6256" ...

Double check for other missing values

# Double check there are no NA values in other data sets
# there are 1301 in the data_completed variable

sum(is.na(com_cust$request_status))

## [1] 0

sum(is.na(com_cust$date_received))

## [1] 0

sum(is.na(com_cust$date_completed))

## [1] 0

sum(is.na(com_cust$suburb))

## [1] 0

sum(is.na(com_cust$category))

## [1] 0

sum(is.na(com_cust$service_desc))

## [1] 0

sum(is.na(com_cust$days_to_complete))

## [1] 0

names(com_cust)

## [1] "request_status"   "date_received"    "date_completed"  
## [4] "suburb"           "category"         "service_desc"    
## [7] "days_to_complete"

# Check population forecast data

names(com_pop_fc)

## [1] "geography"  "year"       "population"

sum(is.na(com_pop_fc$geography))

## [1] 0

sum(is.na(com_pop_fc$year))

## [1] 0

sum(is.na(com_pop_fc$population))

## [1] 0

head(com_pop_fc)

# check actual population data for null values

sum(is.na(com_pop_act$population))

## [1] 0

sum(is.na(com_pop_act$year))

## [1] 0

sum(is.na(com_pop_act$geography))

## [1] 0

Tidy & Manipulate Data II

The output of this section will be a single combined data set comprised of:

suburb
year
population (actual and forecast)
volume of service requests per year

The steps taken will be:

Combine actual and forecast population data (rbind)
Create a suburb varable from CLUE small areas (mutate)
Create a year variable in customer request data (lubridate)
Summarise customer request data by year (group_by)
Combine suburbs in customer request data to match population data (mutate)
Join population and customer request data (full_join)

Combine actual and forecast population data (Step 1 of 6)

com_pop_fc[com_pop_fc$year != "2017",]

# note I cant really use population data from 2011-13, so I'll drop it

com_pop_act_sub <- 
  com_pop_act %>% 
  filter(year %in% c("2014","2015","2016","2017"))

com_pop_act_sub

com_pop_act_sub$year <- 
  droplevels(com_pop_act_sub$year)

# combine the two data sets
com_pop <- 
  rbind(com_pop_act_sub, com_pop_fc[com_pop_fc$year != "2017",])

# check the result
str(com_pop)

## Classes 'tbl_df', 'tbl' and 'data.frame':    312 obs. of  3 variables:
##  $ geography : Factor w/ 13 levels "Carlton","Docklands",..: 1 1 1 1 2 2 2 2 3 3 ...
##  $ year      : Factor w/ 24 levels "2014","2015",..: 1 2 3 4 1 2 3 4 1 2 ...
##  $ population: int  17846 18986 20245 21797 9170 10444 11832 13675 5164 5197 ...
##  - attr(*, "spec")=List of 2
##   ..$ cols   :List of 3
##   .. ..$ Small Area: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Year      : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Population: list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   ..$ default: list()
##   .. ..- attr(*, "class")= chr  "collector_guess" "collector"
##   ..- attr(*, "class")= chr "col_spec"

levels(com_pop$year)

##  [1] "2014" "2015" "2016" "2017" "2018" "2019" "2020" "2021" "2022" "2023"
## [11] "2024" "2025" "2026" "2027" "2028" "2029" "2030" "2031" "2032" "2033"
## [21] "2034" "2035" "2036" "2037"

levels (com_pop$geography)

##  [1] "Carlton"                      "Docklands"                   
##  [3] "East Melbourne"               "Kensington"                  
##  [5] "Melbourne (CBD)"              "Melbourne (Remainder)"       
##  [7] "North Melbourne"              "Parkville"                   
##  [9] "Port Melbourne"               "South Yarra"                 
## [11] "Southbank"                    "West Melbourne (Industrial)" 
## [13] "West Melbourne (Residential)"

Create a suburb varable from CLUE small areas (step 2 of 6)

com_pop <- 
  com_pop %>% 
  mutate(suburb = 
                     ifelse(geography %in% 
                              c("Melbourne (CBD)","Melbourne (Remainder)"), "Melbourne",
                     ifelse(geography %in% 
                              c("West Melbourne (Industrial)","West Melbourne (Residential)"), "West Melbourne",
                     as.character(geography))))

head(com_pop) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

geography	year	population	suburb
Carlton	2014	17846	Carlton
Carlton	2015	18986	Carlton
Carlton	2016	20245	Carlton
Carlton	2017	21797	Carlton
Docklands	2014	9170	Docklands
Docklands	2015	10444	Docklands

# collate the population data from duplicate suburbs
com_pop <- 
  com_pop %>% 
  group_by(year, suburb) %>% 
  summarise(total_pop = sum(population))

# tidy up the column names (change total_pop back to population)
names(com_pop)

## [1] "year"      "suburb"    "total_pop"

names(com_pop) <- 
  c("year","suburb","population")

str(com_pop)

## Classes 'grouped_df', 'tbl_df', 'tbl' and 'data.frame':  264 obs. of  3 variables:
##  $ year      : Factor w/ 24 levels "2014","2015",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ suburb    : chr  "Carlton" "Docklands" "East Melbourne" "Kensington" ...
##  $ population: int  17846 9170 5164 11119 35936 14613 7447 11 4301 17351 ...
##  - attr(*, "vars")= chr "year"
##  - attr(*, "drop")= logi TRUE

com_pop$suburb <- 
  factor(com_pop$suburb)
com_pop

Create a year variable in customer request data (step 3 of 6)

# check the format of the date_recieved field
head(com_cust$date_received)

## [1] "2015-12-20 AEDT" "2016-07-28 AEST" "2016-07-28 AEST" "2016-09-02 AEST"
## [5] "2016-09-02 AEST" "2016-08-08 AEST"

# use lubridate to generate year value

com_cust <-
  com_cust %>%
    mutate(year = year(date_completed))

# check it worked - new variables created
names(com_cust)

## [1] "request_status"   "date_received"    "date_completed"  
## [4] "suburb"           "category"         "service_desc"    
## [7] "days_to_complete" "year"

head(com_cust$year)

## [1] 2016 2016 2016 2016 2016 2016

str(com_cust)

## 'data.frame':    27936 obs. of  8 variables:
##  $ request_status  : Factor w/ 6 levels "ACTIONED","CLOSED",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ date_received   : POSIXct, format: "2015-12-20" "2016-07-28" ...
##  $ date_completed  : POSIXct, format: "2016-01-12" "2016-08-25" ...
##  $ suburb          : Factor w/ 14 levels "Carlton","Carlton North",..: 1 7 1 7 13 7 8 7 4 14 ...
##  $ category        : Factor w/ 7 levels "Asset maintenance",..: 2 4 7 7 7 2 2 2 2 2 ...
##  $ service_desc    : Factor w/ 44 levels "Bike pod services",..: 12 40 25 15 15 12 12 12 12 12 ...
##  $ days_to_complete: int  23 28 5 3 3 3 3 2 4 10 ...
##  $ year            : num  2016 2016 2016 2016 2016 ...

# check the years created
levels(as.factor(com_cust$year))

## [1] "2014" "2015" "2016"

Summarise customer request data by year (step 4 of 6)

# convert the year variable to a factor
com_cust$year <- 
  as.factor(com_cust$year)

# strip out the unneccessary columns and summarise service requests by year
cust_sum <- 
  com_cust[,c(8,4)] %>% 
  group_by(year, suburb) %>% 
  summarise(sr_count=n())

# fix the column names
names(cust_sum) <- 
  c("year","suburb","sr_count")

# check it worked
head(cust_sum)

Combine suburbs in customer request data to match population data (step 5 of 6)

# I will combine the following
#  - Carlton and Carlton North
#  - Kensington and Flemington
#  - South Wharf and Docklands

cust_sum <- 
  cust_sum %>% 
  mutate(suburb = ifelse(suburb %in% 
                           c("Carlton","Carlton North"), "Carlton", 
                  ifelse(suburb %in% 
                           c("Flemington","Kensington"), "Kensington", 
                  ifelse(suburb %in% 
                           c("South Wharf","Docklands"), "Docklands", as.character(suburb)))))

# combine the total service requests in the now duplicate suburbs
cust_sum <- 
  cust_sum %>% 
  group_by(year, suburb) %>% 
  summarise(tot_srs = sum(sr_count))

# tidy up the column names
names(cust_sum) <- 
  c("year","suburb","sr_count")

# convert the suburb variable back to a factor
cust_sum$suburb <- 
  factor(cust_sum$suburb)

Join population and customer request data (step 6 of 6)

# join the population data to the customer servie data
pop_cust <- 
  com_pop %>% 
  full_join(cust_sum, by= c("year", "suburb"))

## Warning: Column `year` joining factors with different levels, coercing to
## character vector

# check it worked
str(pop_cust)

## Classes 'grouped_df', 'tbl_df', 'tbl' and 'data.frame':  264 obs. of  4 variables:
##  $ year      : chr  "2014" "2014" "2014" "2014" ...
##  $ suburb    : Factor w/ 11 levels "Carlton","Docklands",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ population: int  17846 9170 5164 11119 35936 14613 7447 11 4301 17351 ...
##  $ sr_count  : int  184 38 102 183 424 114 81 10 75 39 ...
##  - attr(*, "vars")= chr "year"
##  - attr(*, "drop")= logi TRUE

# fix the year variable
pop_cust$year <- 
  factor(pop_cust$year, levels = c("2014","2015","2016","2017","2018","2019","2020","2021","2022","2023","2024","2025","2026","2027","2028","2029","2030","2031","2032","2033","2034","2035","2036","2037"))

levels(pop_cust$year)

##  [1] "2014" "2015" "2016" "2017" "2018" "2019" "2020" "2021" "2022" "2023"
## [11] "2024" "2025" "2026" "2027" "2028" "2029" "2030" "2031" "2032" "2033"
## [21] "2034" "2035" "2036" "2037"

# fix the suburb variable
pop_cust$suburb <- 
  factor(pop_cust$suburb)

Overview of output to this point

Now I have a combined population and service request volume data set.

Population data

kable(pop_cust[,c(1:3)] %>% 
        spread(key = suburb, value = population)) %>% 
        kable_styling(bootstrap_options = c("striped", "hover"))

year	Carlton	Docklands	East Melbourne	Kensington	Melbourne	North Melbourne	Parkville	Port Melbourne	South Yarra	Southbank	West Melbourne
2014	17846	9170	5164	11119	35936	14613	7447	11	4301	17351	5031
2015	18986	10444	5197	11381	39500	15299	7682	11	4284	18623	5480
2016	20245	11832	5251	11621	42557	16197	7898	10	4304	20285	5912
2017	21797	13675	5427	11765	47724	17251	8276	10	4553	22181	6404
2018	21214	13113	5550	11865	46181	17310	7982	10	4699	21894	5709
2019	21961	14282	5598	12192	48741	18761	8105	10	4705	22569	5744
2020	22680	15893	5752	12492	53728	20126	8349	10	4719	23359	5749
2021	23392	16916	5872	12948	56483	21327	8500	11	4727	24262	5938
2022	23580	17909	5948	13377	58587	22060	8538	11	4732	25905	6135
2023	23769	18801	6010	13792	61109	22803	8658	321	4739	27749	6383
2024	23920	19754	6062	14198	63896	23355	8729	670	4753	29747	6698
2025	24061	20349	6100	14594	66647	23865	8844	1062	4774	31681	7046
2026	24148	20817	6155	14855	68404	24290	9006	1494	4790	33209	7314
2027	24261	21294	6209	15123	70205	24710	9091	1972	4804	34721	7568
2028	24341	21764	6252	15397	71617	25117	9248	2254	4816	36123	7675
2029	24338	22178	6287	15689	72553	25428	9273	2800	4827	37422	7844
2030	24371	22557	6307	15975	73580	25735	9316	2905	4840	38764	8201
2031	24385	22874	6354	16264	74567	25998	9346	3488	4851	40122	9038
2032	24439	23194	6398	16564	75493	26257	9398	4009	4859	41489	10080
2033	24494	23515	6463	16869	76474	26526	9461	4684	4864	42876	11292
2034	24524	23819	6541	17178	77390	26779	9485	5450	4867	44277	12596
2035	24547	24116	6629	17491	78318	27033	9518	6267	4866	45708	13233
2036	24559	24406	6651	17810	79238	27282	9563	7110	4861	47159	13890
2037	24548	24694	6677	17848	80095	27290	9581	7945	4854	48610	14313

Customer request data

The customer request dta is NUL for all future years. In the next steps I will forecast future service request volumes.

pop_cust[,c(1,2,4)] %>%
  spread(key = suburb, value = sr_count) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

year	Carlton	Docklands	East Melbourne	Kensington	Melbourne	North Melbourne	Parkville	Port Melbourne	South Yarra	Southbank	West Melbourne
2014	184	38	102	183	424	114	81	10	75	39	67
2015	2095	351	753	1446	4009	1556	773	76	502	354	748
2016	2068	467	917	1585	4475	1620	855	69	462	537	901
2017	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
2018	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
2019	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
2020	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
2021	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
2022	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
2023	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
2024	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
2025	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
2026	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
2027	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
2028	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
2029	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
2030	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
2031	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
2032	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
2033	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
2034	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
2035	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
2036	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
2037	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA

Scan II

In this section I generate the forecast service request volumes then check for anomalies and outliers.

Generate estimated service request volumes

First I generate estimates service request volumes for future years based on population forecasts.

This can be done in one step. But some wild outliers are generated which will need to be cleaned up later. So I’ll do it in two steps.

# generate a service requests per population variable
pop_cust <- 
  pop_cust %>% 
  group_by(suburb) %>% 
  mutate(srs_per_pop = mean(sr_count/population, na.rm=TRUE))

# generate an estimate of service reuqest volumes per year for each suburb
pop_cust <- 
  pop_cust %>% 
  group_by(suburb) %>% 
  mutate(est_srs = population * srs_per_pop)

head(pop_cust)

Outliers and anomalies

First lets compare the estimated service request volumes to the real ones in the first three years.

pop_cust_g <-
  pop_cust %>% 
  gather(key = "source", value = "srs", sr_count, est_srs, -year,-suburb, -population, -srs_per_pop, na.rm = TRUE)

ggplot(pop_cust_g, aes(x = source, y = srs)) +
  geom_boxplot() + 
  facet_wrap(~suburb) + 
  ggtitle("SR Count vs Source for various Suburbs") +
  theme(plot.title = element_text(size=14, face="bold"))

You can see that the estimates for Port Melbourne are totally wrong, by a factor of 50-500. This is because Port Melbourne is almost entirely made up of the port and currently has a very low population (~10). But when the population estamates progress, Port Melbourne is envisaged to have a growing population and the multiplyer effect throws out the service request volume estimates badly.

# revisit the pop_cust data frame
head(pop_cust)

# identify the major outliers in the srs_per_pop variable
boxplot(pop_cust$srs_per_pop,main="Service Requests per capita", 
     ylab="SRs per capita")

# identify them numerically
z.scores <- 
  pop_cust$srs_per_pop %>%  
  scores(type = "z")

z.scores %>% 
  summary()

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.3558 -0.3171 -0.3139  0.0000 -0.2949  3.1557

# use the z-score method to find and replace with the mean (excluding Port Melb)
pop_cust$srs_per_pop[ which( abs(z.scores) >3 )] <- 
  mean(pop_cust$srs_per_pop[-which( abs(z.scores) >3 )], na.rm = TRUE)

# recalculate estimated service request volumes

pop_cust <- 
  pop_cust %>% 
  group_by(suburb) %>% 
  mutate(est_srs = population * srs_per_pop)

# redraw the plot to confirm Port Melbourne is no longer an outlier

pop_cust_g2 <- 
  pop_cust %>% 
  gather(key = "source", value = "srs", sr_count, est_srs, -year,-suburb, -population, -srs_per_pop, na.rm = TRUE)

ggplot(pop_cust_g2, aes(x = source, y = srs)) +
  geom_boxplot() + 
  facet_wrap(~suburb) + 
  ggtitle("SR Count vs Source for various Suburbs") +
  theme(plot.title = element_text(size=14, face="bold"))

Final outcome

The final outcome of the forecasting is best represented as a chart.

ggplot() +
  geom_line(data = pop_cust, aes(x = as.numeric(paste(year)), y = est_srs, color=suburb), size = .1) +
  ggtitle("Forecast Service Request Volumes\n\n")+
  xlab("year") +
  ylab("service requests")

Transform

You can see in the chart above that the forecast service request volume for Melbourne is a curve. Lets apply a transform to make this linear.

# filter out the Melbourne forecast
melb <-
  pop_cust %>%
    filter(suburb == "Melbourne")

# cube the estimated service request volumes
melb <-
  melb %>%
    mutate(cube_srs = (est_srs)^3)
# check it worked
head(melb)

# plot it to see if its more linear
ggplot() +
  geom_line(data = melb, aes(x = as.numeric(paste(year)), y = melb$cube_srs, color=suburb), size = .1) +
  ggtitle("Forecast Service Request Volumes\n\n")+
  xlab("year") +
  ylab("cube of service requests")