The following packages are required for importing, manipulating and visualising the data.
RSocrata was was used to gain direct access to the City of Melbourne Open Data repository.
#install.packages("RSocrata")
library(RSocrata)
# Standard package list
library(readr)
library(dplyr)
library(tidyr)
library(stringr)
library(outliers)
library(lubridate)
# Charts
library(ggplot2)
# R Markdown packages
library(kableExtra)
City of Melbourne’s population is set to grow beyond 250,000 by 2037 placing increased pressure on local council services. Actual population data was combined with population forecast data for each suburb within the municipality to create population timeline from 2014 to 2037. This was combined with service request volumes for a small subset of council services (7 service categories) from 2014-2017. The combined data set was used to forecast future service request volumes by suburb to 2037. This analysis used a very small proportion of total annual council service requests, and a significant number of requests were unusable because suburb was not recorded. Nevertheless, the forecast shows an almost three-fold increase in service request demand with the greatest growth in the Melbourne central business district. This project serves as a proof of concept for long-range forecasting of customer service demand once a complete catalogue of service request types becomes available.
City of Melbourne publishes almost 200 data sets on the demographics, business, economics, employment and performance of the central city municpality.
This project will combine three data sets available online or published under creative commons:
The objective of the project is to forecast future demand for council services based on population projections.
The three data sets will be joined by two similar, but not identical geographical concepts:
Suburbs are a familiar concept - gazetted areas typically used in addresses.
The Census of Land Use and Employment (CLUE) is a survey of commercial property that City of Mebourne has conducted sporadically since the 1960s; and systematically every two years since 2000.
The CLUE Small Areas are subsets of suburbs based on in industrial zones and natural land use regions within the municipality.
The CLUE Small Areas will need to be combined and mutated to match Suburbs, for data joining purposes.
The first section accesses the data.melbourne.vic.gov.au open data repository directly in RStudio to access the two open data sets for this project.
CoM has published almost 200 data sets including demographics, civic assets, maps, and performance information.
The repository can be accessed directly within R using the RSocrata package.
# Explore the CoM Open Data Repository
ls.socrata("https://data.melbourne.vic.gov.au")
com_data <- ls.socrata("https://data.melbourne.vic.gov.au")
head(com_data)
names(com_data)
## [1] "accessLevel" "landingPage" "issued" "@type"
## [5] "modified" "keyword" "contactPoint" "publisher"
## [9] "identifier" "description" "title" "distribution"
## [13] "license" "theme"
The first data set to be used is forcast population by CLUE Small Areas.
# Get data - Population Forecasts
# display the details of the data set.
t(com_data[,c(11,14, 2, 13)] %>%
filter(title == "Population Forecasts by Small Area")) %>%
kable() %>%
kable_styling()
| title | Population Forecasts by Small Area |
| theme | People & Events |
| landingPage | https://data.melbourne.vic.gov.au/d/4ajz-qakh |
| license | http://creativecommons.org/licenses/by/4.0/legalcode |
Import the data directly from the open data repository.
# import the popultaion forecast data
com_pop_fc <-
read.socrata("https://data.melbourne.vic.gov.au/d/4ajz-qakh")
# check the imported data
head(com_pop_fc)
# check the dimensions and data types
str(com_pop_fc)
## 'data.frame': 273 obs. of 3 variables:
## $ Geography : chr "Melbourne (CBD)" "Melbourne (CBD)" "Melbourne (CBD)" "Melbourne (CBD)" ...
## $ Year : int 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 ...
## $ Population: int 42589 44087 46442 51136 53795 55854 58336 61086 63802 65529 ...
# standardise on lowercase column names
names(com_pop_fc) <-
tolower(names(com_pop_fc))
# convert the Small Areas to factors
com_pop_fc$geography <-
as.factor(com_pop_fc$geography)
# display the years as factors to convert them to ordered factors
levels(as.factor(com_pop_fc$year))
## [1] "2017" "2018" "2019" "2020" "2021" "2022" "2023" "2024" "2025" "2026"
## [11] "2027" "2028" "2029" "2030" "2031" "2032" "2033" "2034" "2035" "2036"
## [21] "2037"
# convert the years to ordered factors
com_pop_fc$year <-
factor(com_pop_fc$year, levels=c("2017","2018","2019","2020","2021","2022","2023","2024","2025","2026","2027","2028","2029","2030","2031","2032","2033","2034","2035","2036","2037"))
#check the resulting data frame has correct data types
str(com_pop_fc)
## 'data.frame': 273 obs. of 3 variables:
## $ geography : Factor w/ 13 levels "Carlton","Docklands",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ year : Factor w/ 21 levels "2017","2018",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ population: int 42589 44087 46442 51136 53795 55854 58336 61086 63802 65529 ...
# Get data - Customer Service Requests
# Look for customer data
com_data[,c(11,14)] %>%
filter(theme == "Finance & Operations", str_detect(title,"Customer"))
# Display the desired data set
t(com_data[,c(11,14, 2, 13)] %>%
filter(title == "Customer service requests, with resolution time")) %>%
kable() %>%
kable_styling()
| title | Customer service requests, with resolution time |
| theme | Finance & Operations |
| landingPage | https://data.melbourne.vic.gov.au/d/ht4h-vqbu |
| license | http://creativecommons.org/licenses/by/4.0/legalcode |
Import the desired data set direct from the open data repository
# import data
com_cust <-
read.socrata("https://data.melbourne.vic.gov.au/d/ht4h-vqbu")
# check data size and types
str(com_cust)
## 'data.frame': 78639 obs. of 7 variables:
## $ REQUEST_STATUS : chr "CLOSED " "CLOSED " "CLOSED " "CLOSED " ...
## $ DATE_RECEIVED : POSIXct, format: "2015-12-20" "2016-07-28" ...
## $ DATE_COMPLETED : POSIXct, format: "2016-01-12" "2016-08-25" ...
## $ SUBURB : chr "Carlton" "Melbourne" "Carlton" "Melbourne" ...
## $ CATEGORY : chr "Graffiti" "Parks and Trees" "Waste, Street Cleaning and Litter " "Waste, Street Cleaning and Litter " ...
## $ SERVICE_DESC : chr "Graffiti Removal " "Tree Maintenance Services" "Public Litter Bin " "Missed Bin Collection" ...
## $ DAYS_TO_COMPLETE: int 23 28 5 3 3 3 3 2 4 10 ...
names(com_cust)
## [1] "REQUEST_STATUS" "DATE_RECEIVED" "DATE_COMPLETED"
## [4] "SUBURB" "CATEGORY" "SERVICE_DESC"
## [7] "DAYS_TO_COMPLETE"
# convert select variables to factors
cols <-
c("REQUEST_STATUS","SUBURB","CATEGORY","SERVICE_DESC")
com_cust[cols] <-
lapply(com_cust[cols], factor)
# standardise on lowercase
names(com_cust) <-
tolower(names(com_cust))
str(com_cust)
## 'data.frame': 78639 obs. of 7 variables:
## $ request_status : Factor w/ 6 levels "ACTIONED","CLOSED",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ date_received : POSIXct, format: "2015-12-20" "2016-07-28" ...
## $ date_completed : POSIXct, format: "2016-01-12" "2016-08-25" ...
## $ suburb : Factor w/ 15 levels "","Carlton","Carlton North",..: 2 8 2 8 14 8 9 8 5 15 ...
## $ category : Factor w/ 7 levels "Asset maintenance",..: 2 4 7 7 7 2 2 2 2 2 ...
## $ service_desc : Factor w/ 44 levels "Bike pod services",..: 12 40 25 15 15 12 12 12 12 12 ...
## $ days_to_complete: int 23 28 5 3 3 3 3 2 4 10 ...
levels(com_cust$suburb)
## [1] "" "Carlton" "Carlton North"
## [4] "Docklands" "East Melbourne" "Flemington"
## [7] "Kensington" "Melbourne" "North Melbourne"
## [10] "Parkville" "Port Melbourne" "South Wharf"
## [13] "South Yarra" "Southbank" "West Melbourne"
names(com_cust)
## [1] "request_status" "date_received" "date_completed"
## [4] "suburb" "category" "service_desc"
## [7] "days_to_complete"
count(com_cust %>%
group_by(suburb))
The next data set was provided by CoM. This data was derived from three separate open data respositories and spatially joined by an analyst at City of Melbourne and provided as a csv file.
# ACTUAL POPULATION DATA FROM THE PAST
# import CSV file
com_pop_act <-
read_csv("com_population.csv")
## Parsed with column specification:
## cols(
## `Small Area` = col_character(),
## Year = col_integer(),
## Population = col_integer()
## )
# check the shape of the file
str(com_pop_act)
## Classes 'tbl_df', 'tbl' and 'data.frame': 91 obs. of 3 variables:
## $ Small Area: chr "Carlton" "Carlton" "Carlton" "Carlton" ...
## $ Year : int 2011 2012 2013 2014 2015 2016 2017 2011 2012 2013 ...
## $ Population: int 15126 15609 16723 17846 18986 20245 21797 6196 6802 7988 ...
## - attr(*, "spec")=List of 2
## ..$ cols :List of 3
## .. ..$ Small Area: list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ Year : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ Population: list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## ..$ default: list()
## .. ..- attr(*, "class")= chr "collector_guess" "collector"
## ..- attr(*, "class")= chr "col_spec"
# match the headings to the forecast data
names(com_pop_act) <-
c("geography","year","population")
# check the year column
levels(as.factor(com_pop_act$year))
## [1] "2011" "2012" "2013" "2014" "2015" "2016" "2017"
# convert the year column to an ordered factor
com_pop_act$year <-
factor(com_pop_act$year, levels=c("2011","2012","2013","2014","2015","2016","2017"))
# convert the geography column to factor
com_pop_act$geography <-
factor(com_pop_act$geography)
Three data sets have been imported:
The objective of this project is to combine these three data sets with a view to forecasting future customer request volumes based on population forecasts.
To begin with lets compare the two population data sets.
# Comparing population data
# compare columns
names(com_pop_act)
## [1] "geography" "year" "population"
names (com_pop_fc)
## [1] "geography" "year" "population"
# compare geography
levels(com_pop_act$geography)
## [1] "Carlton" "Docklands"
## [3] "East Melbourne" "Kensington"
## [5] "Melbourne (CBD)" "Melbourne (Remainder)"
## [7] "North Melbourne" "Parkville"
## [9] "Port Melbourne" "South Yarra"
## [11] "Southbank" "West Melbourne (Industrial)"
## [13] "West Melbourne (Residential)"
levels(com_pop_fc$geography)
## [1] "Carlton" "Docklands"
## [3] "East Melbourne" "Kensington"
## [5] "Melbourne (CBD)" "Melbourne (Remainder)"
## [7] "North Melbourne" "Parkville"
## [9] "Port Melbourne" "South Yarra"
## [11] "Southbank" "West Melbourne (Industrial)"
## [13] "West Melbourne (Residential)"
# compare years
levels(com_pop_act$year)
## [1] "2011" "2012" "2013" "2014" "2015" "2016" "2017"
levels(com_pop_fc$year)
## [1] "2017" "2018" "2019" "2020" "2021" "2022" "2023" "2024" "2025" "2026"
## [11] "2027" "2028" "2029" "2030" "2031" "2032" "2033" "2034" "2035" "2036"
## [21] "2037"
The following features are noted:
The City of Melbourne is in the process of converting from a legacy customer relationship management system to a modern cloud-based solution.
The data available here is derived from the new system.
head(com_cust)
dim(com_cust)
## [1] 78639 7
str(com_cust)
## 'data.frame': 78639 obs. of 7 variables:
## $ request_status : Factor w/ 6 levels "ACTIONED","CLOSED",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ date_received : POSIXct, format: "2015-12-20" "2016-07-28" ...
## $ date_completed : POSIXct, format: "2016-01-12" "2016-08-25" ...
## $ suburb : Factor w/ 15 levels "","Carlton","Carlton North",..: 2 8 2 8 14 8 9 8 5 15 ...
## $ category : Factor w/ 7 levels "Asset maintenance",..: 2 4 7 7 7 2 2 2 2 2 ...
## $ service_desc : Factor w/ 44 levels "Bike pod services",..: 12 40 25 15 15 12 12 12 12 12 ...
## $ days_to_complete: int 23 28 5 3 3 3 3 2 4 10 ...
Because the data is derived from the new system only a few customer service types have been converted. Only the following 7 service request categories are available in this data set. This probably represents around 5% of the total number of different categories.
As a result the volumes of service requests are probably less than 20% of total service requests recorded as most are still recorded in the legacy system at the time of this analysis.
levels(com_cust$category)
## [1] "Asset maintenance"
## [2] "Graffiti"
## [3] "Parking"
## [4] "Parks and Trees"
## [5] "Roads and Traffic"
## [6] "Waste, Street Cleaning and Litter"
## [7] "Waste, Street Cleaning and Litter "
count(com_cust %>%
group_by(category)) %>%
kable() %>%
kable_styling()
| category | n |
|---|---|
| Asset maintenance | 1950 |
| Graffiti | 14454 |
| Parking | 9025 |
| Parks and Trees | 8847 |
| Roads and Traffic | 6659 |
| Waste, Street Cleaning and Litter | 2989 |
| Waste, Street Cleaning and Litter | 34715 |
The City of Melbourne research team has adopted the Tidy principles so the data sets used here already comply.
head(com_cust) %>%
kable() %>%
kable_styling()
| request_status | date_received | date_completed | suburb | category | service_desc | days_to_complete |
|---|---|---|---|---|---|---|
| CLOSED | 2015-12-20 | 2016-01-12 | Carlton | Graffiti | Graffiti Removal | 23 |
| CLOSED | 2016-07-28 | 2016-08-25 | Melbourne | Parks and Trees | Tree Maintenance Services | 28 |
| CLOSED | 2016-07-28 | 2016-08-02 | Carlton | Waste, Street Cleaning and Litter | Public Litter Bin | 5 |
| CLOSED | 2016-09-02 | 2016-09-05 | Melbourne | Waste, Street Cleaning and Litter | Missed Bin Collection | 3 |
| CLOSED | 2016-09-02 | 2016-09-05 | Southbank | Waste, Street Cleaning and Litter | Missed Bin Collection | 3 |
| CLOSED | 2016-08-08 | 2016-08-11 | Melbourne | Graffiti | Graffiti Removal | 3 |
For the purposes of this project the following fields wont be needed:
A clean year field will need to be created from the date_recieved variable.
In this step a signifcant number of missing values are dealt with.
The com_cust data set is around 78,000 observations, however ~50,000 have no suburb recorded. This is because the new system built by council encourages customers to provide GPS coordinates from their phone, rather than type in an address (Lat/Long is not available in the data set).
The null values in the com_cust$suburb variable were marked as “” and not picked up by is.na().
Nonetheless, they were discovered and those observations (unfortunately) omitted.
# now I have to deal with the profoundly disappointing number of missing values
com_cust$suburb[com_cust$suburb == ""] <- NA
# have no choice but to omit the NA records, there's no way to impute the missing values from this data
com_cust <-
na.omit(com_cust)
com_cust$suburb <-
droplevels(com_cust$suburb)
str(com_cust)
## 'data.frame': 27936 obs. of 7 variables:
## $ request_status : Factor w/ 6 levels "ACTIONED","CLOSED",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ date_received : POSIXct, format: "2015-12-20" "2016-07-28" ...
## $ date_completed : POSIXct, format: "2016-01-12" "2016-08-25" ...
## $ suburb : Factor w/ 14 levels "Carlton","Carlton North",..: 1 7 1 7 13 7 8 7 4 14 ...
## $ category : Factor w/ 7 levels "Asset maintenance",..: 2 4 7 7 7 2 2 2 2 2 ...
## $ service_desc : Factor w/ 44 levels "Bike pod services",..: 12 40 25 15 15 12 12 12 12 12 ...
## $ days_to_complete: int 23 28 5 3 3 3 3 2 4 10 ...
## - attr(*, "na.action")=Class 'omit' Named int [1:50703] 6253 6254 6255 6256 6257 6258 6259 6260 6261 6262 ...
## .. ..- attr(*, "names")= chr [1:50703] "6253" "6254" "6255" "6256" ...
# Double check there are no NA values in other data sets
# there are 1301 in the data_completed variable
sum(is.na(com_cust$request_status))
## [1] 0
sum(is.na(com_cust$date_received))
## [1] 0
sum(is.na(com_cust$date_completed))
## [1] 0
sum(is.na(com_cust$suburb))
## [1] 0
sum(is.na(com_cust$category))
## [1] 0
sum(is.na(com_cust$service_desc))
## [1] 0
sum(is.na(com_cust$days_to_complete))
## [1] 0
names(com_cust)
## [1] "request_status" "date_received" "date_completed"
## [4] "suburb" "category" "service_desc"
## [7] "days_to_complete"
# Check population forecast data
names(com_pop_fc)
## [1] "geography" "year" "population"
sum(is.na(com_pop_fc$geography))
## [1] 0
sum(is.na(com_pop_fc$year))
## [1] 0
sum(is.na(com_pop_fc$population))
## [1] 0
head(com_pop_fc)
# check actual population data for null values
sum(is.na(com_pop_act$population))
## [1] 0
sum(is.na(com_pop_act$year))
## [1] 0
sum(is.na(com_pop_act$geography))
## [1] 0
The output of this section will be a single combined data set comprised of:
The steps taken will be:
com_pop_fc[com_pop_fc$year != "2017",]
# note I cant really use population data from 2011-13, so I'll drop it
com_pop_act_sub <-
com_pop_act %>%
filter(year %in% c("2014","2015","2016","2017"))
com_pop_act_sub
com_pop_act_sub$year <-
droplevels(com_pop_act_sub$year)
# combine the two data sets
com_pop <-
rbind(com_pop_act_sub, com_pop_fc[com_pop_fc$year != "2017",])
# check the result
str(com_pop)
## Classes 'tbl_df', 'tbl' and 'data.frame': 312 obs. of 3 variables:
## $ geography : Factor w/ 13 levels "Carlton","Docklands",..: 1 1 1 1 2 2 2 2 3 3 ...
## $ year : Factor w/ 24 levels "2014","2015",..: 1 2 3 4 1 2 3 4 1 2 ...
## $ population: int 17846 18986 20245 21797 9170 10444 11832 13675 5164 5197 ...
## - attr(*, "spec")=List of 2
## ..$ cols :List of 3
## .. ..$ Small Area: list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ Year : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ Population: list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## ..$ default: list()
## .. ..- attr(*, "class")= chr "collector_guess" "collector"
## ..- attr(*, "class")= chr "col_spec"
levels(com_pop$year)
## [1] "2014" "2015" "2016" "2017" "2018" "2019" "2020" "2021" "2022" "2023"
## [11] "2024" "2025" "2026" "2027" "2028" "2029" "2030" "2031" "2032" "2033"
## [21] "2034" "2035" "2036" "2037"
levels (com_pop$geography)
## [1] "Carlton" "Docklands"
## [3] "East Melbourne" "Kensington"
## [5] "Melbourne (CBD)" "Melbourne (Remainder)"
## [7] "North Melbourne" "Parkville"
## [9] "Port Melbourne" "South Yarra"
## [11] "Southbank" "West Melbourne (Industrial)"
## [13] "West Melbourne (Residential)"
com_pop <-
com_pop %>%
mutate(suburb =
ifelse(geography %in%
c("Melbourne (CBD)","Melbourne (Remainder)"), "Melbourne",
ifelse(geography %in%
c("West Melbourne (Industrial)","West Melbourne (Residential)"), "West Melbourne",
as.character(geography))))
head(com_pop) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| geography | year | population | suburb |
|---|---|---|---|
| Carlton | 2014 | 17846 | Carlton |
| Carlton | 2015 | 18986 | Carlton |
| Carlton | 2016 | 20245 | Carlton |
| Carlton | 2017 | 21797 | Carlton |
| Docklands | 2014 | 9170 | Docklands |
| Docklands | 2015 | 10444 | Docklands |
# collate the population data from duplicate suburbs
com_pop <-
com_pop %>%
group_by(year, suburb) %>%
summarise(total_pop = sum(population))
# tidy up the column names (change total_pop back to population)
names(com_pop)
## [1] "year" "suburb" "total_pop"
names(com_pop) <-
c("year","suburb","population")
str(com_pop)
## Classes 'grouped_df', 'tbl_df', 'tbl' and 'data.frame': 264 obs. of 3 variables:
## $ year : Factor w/ 24 levels "2014","2015",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ suburb : chr "Carlton" "Docklands" "East Melbourne" "Kensington" ...
## $ population: int 17846 9170 5164 11119 35936 14613 7447 11 4301 17351 ...
## - attr(*, "vars")= chr "year"
## - attr(*, "drop")= logi TRUE
com_pop$suburb <-
factor(com_pop$suburb)
com_pop
# check the format of the date_recieved field
head(com_cust$date_received)
## [1] "2015-12-20 AEDT" "2016-07-28 AEST" "2016-07-28 AEST" "2016-09-02 AEST"
## [5] "2016-09-02 AEST" "2016-08-08 AEST"
# use lubridate to generate year value
com_cust <-
com_cust %>%
mutate(year = year(date_completed))
# check it worked - new variables created
names(com_cust)
## [1] "request_status" "date_received" "date_completed"
## [4] "suburb" "category" "service_desc"
## [7] "days_to_complete" "year"
head(com_cust$year)
## [1] 2016 2016 2016 2016 2016 2016
str(com_cust)
## 'data.frame': 27936 obs. of 8 variables:
## $ request_status : Factor w/ 6 levels "ACTIONED","CLOSED",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ date_received : POSIXct, format: "2015-12-20" "2016-07-28" ...
## $ date_completed : POSIXct, format: "2016-01-12" "2016-08-25" ...
## $ suburb : Factor w/ 14 levels "Carlton","Carlton North",..: 1 7 1 7 13 7 8 7 4 14 ...
## $ category : Factor w/ 7 levels "Asset maintenance",..: 2 4 7 7 7 2 2 2 2 2 ...
## $ service_desc : Factor w/ 44 levels "Bike pod services",..: 12 40 25 15 15 12 12 12 12 12 ...
## $ days_to_complete: int 23 28 5 3 3 3 3 2 4 10 ...
## $ year : num 2016 2016 2016 2016 2016 ...
# check the years created
levels(as.factor(com_cust$year))
## [1] "2014" "2015" "2016"
# convert the year variable to a factor
com_cust$year <-
as.factor(com_cust$year)
# strip out the unneccessary columns and summarise service requests by year
cust_sum <-
com_cust[,c(8,4)] %>%
group_by(year, suburb) %>%
summarise(sr_count=n())
# fix the column names
names(cust_sum) <-
c("year","suburb","sr_count")
# check it worked
head(cust_sum)
# I will combine the following
# - Carlton and Carlton North
# - Kensington and Flemington
# - South Wharf and Docklands
cust_sum <-
cust_sum %>%
mutate(suburb = ifelse(suburb %in%
c("Carlton","Carlton North"), "Carlton",
ifelse(suburb %in%
c("Flemington","Kensington"), "Kensington",
ifelse(suburb %in%
c("South Wharf","Docklands"), "Docklands", as.character(suburb)))))
# combine the total service requests in the now duplicate suburbs
cust_sum <-
cust_sum %>%
group_by(year, suburb) %>%
summarise(tot_srs = sum(sr_count))
# tidy up the column names
names(cust_sum) <-
c("year","suburb","sr_count")
# convert the suburb variable back to a factor
cust_sum$suburb <-
factor(cust_sum$suburb)
# join the population data to the customer servie data
pop_cust <-
com_pop %>%
full_join(cust_sum, by= c("year", "suburb"))
## Warning: Column `year` joining factors with different levels, coercing to
## character vector
# check it worked
str(pop_cust)
## Classes 'grouped_df', 'tbl_df', 'tbl' and 'data.frame': 264 obs. of 4 variables:
## $ year : chr "2014" "2014" "2014" "2014" ...
## $ suburb : Factor w/ 11 levels "Carlton","Docklands",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ population: int 17846 9170 5164 11119 35936 14613 7447 11 4301 17351 ...
## $ sr_count : int 184 38 102 183 424 114 81 10 75 39 ...
## - attr(*, "vars")= chr "year"
## - attr(*, "drop")= logi TRUE
# fix the year variable
pop_cust$year <-
factor(pop_cust$year, levels = c("2014","2015","2016","2017","2018","2019","2020","2021","2022","2023","2024","2025","2026","2027","2028","2029","2030","2031","2032","2033","2034","2035","2036","2037"))
levels(pop_cust$year)
## [1] "2014" "2015" "2016" "2017" "2018" "2019" "2020" "2021" "2022" "2023"
## [11] "2024" "2025" "2026" "2027" "2028" "2029" "2030" "2031" "2032" "2033"
## [21] "2034" "2035" "2036" "2037"
# fix the suburb variable
pop_cust$suburb <-
factor(pop_cust$suburb)
Now I have a combined population and service request volume data set.
kable(pop_cust[,c(1:3)] %>%
spread(key = suburb, value = population)) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| year | Carlton | Docklands | East Melbourne | Kensington | Melbourne | North Melbourne | Parkville | Port Melbourne | South Yarra | Southbank | West Melbourne |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 2014 | 17846 | 9170 | 5164 | 11119 | 35936 | 14613 | 7447 | 11 | 4301 | 17351 | 5031 |
| 2015 | 18986 | 10444 | 5197 | 11381 | 39500 | 15299 | 7682 | 11 | 4284 | 18623 | 5480 |
| 2016 | 20245 | 11832 | 5251 | 11621 | 42557 | 16197 | 7898 | 10 | 4304 | 20285 | 5912 |
| 2017 | 21797 | 13675 | 5427 | 11765 | 47724 | 17251 | 8276 | 10 | 4553 | 22181 | 6404 |
| 2018 | 21214 | 13113 | 5550 | 11865 | 46181 | 17310 | 7982 | 10 | 4699 | 21894 | 5709 |
| 2019 | 21961 | 14282 | 5598 | 12192 | 48741 | 18761 | 8105 | 10 | 4705 | 22569 | 5744 |
| 2020 | 22680 | 15893 | 5752 | 12492 | 53728 | 20126 | 8349 | 10 | 4719 | 23359 | 5749 |
| 2021 | 23392 | 16916 | 5872 | 12948 | 56483 | 21327 | 8500 | 11 | 4727 | 24262 | 5938 |
| 2022 | 23580 | 17909 | 5948 | 13377 | 58587 | 22060 | 8538 | 11 | 4732 | 25905 | 6135 |
| 2023 | 23769 | 18801 | 6010 | 13792 | 61109 | 22803 | 8658 | 321 | 4739 | 27749 | 6383 |
| 2024 | 23920 | 19754 | 6062 | 14198 | 63896 | 23355 | 8729 | 670 | 4753 | 29747 | 6698 |
| 2025 | 24061 | 20349 | 6100 | 14594 | 66647 | 23865 | 8844 | 1062 | 4774 | 31681 | 7046 |
| 2026 | 24148 | 20817 | 6155 | 14855 | 68404 | 24290 | 9006 | 1494 | 4790 | 33209 | 7314 |
| 2027 | 24261 | 21294 | 6209 | 15123 | 70205 | 24710 | 9091 | 1972 | 4804 | 34721 | 7568 |
| 2028 | 24341 | 21764 | 6252 | 15397 | 71617 | 25117 | 9248 | 2254 | 4816 | 36123 | 7675 |
| 2029 | 24338 | 22178 | 6287 | 15689 | 72553 | 25428 | 9273 | 2800 | 4827 | 37422 | 7844 |
| 2030 | 24371 | 22557 | 6307 | 15975 | 73580 | 25735 | 9316 | 2905 | 4840 | 38764 | 8201 |
| 2031 | 24385 | 22874 | 6354 | 16264 | 74567 | 25998 | 9346 | 3488 | 4851 | 40122 | 9038 |
| 2032 | 24439 | 23194 | 6398 | 16564 | 75493 | 26257 | 9398 | 4009 | 4859 | 41489 | 10080 |
| 2033 | 24494 | 23515 | 6463 | 16869 | 76474 | 26526 | 9461 | 4684 | 4864 | 42876 | 11292 |
| 2034 | 24524 | 23819 | 6541 | 17178 | 77390 | 26779 | 9485 | 5450 | 4867 | 44277 | 12596 |
| 2035 | 24547 | 24116 | 6629 | 17491 | 78318 | 27033 | 9518 | 6267 | 4866 | 45708 | 13233 |
| 2036 | 24559 | 24406 | 6651 | 17810 | 79238 | 27282 | 9563 | 7110 | 4861 | 47159 | 13890 |
| 2037 | 24548 | 24694 | 6677 | 17848 | 80095 | 27290 | 9581 | 7945 | 4854 | 48610 | 14313 |
The customer request dta is NUL for all future years. In the next steps I will forecast future service request volumes.
pop_cust[,c(1,2,4)] %>%
spread(key = suburb, value = sr_count) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| year | Carlton | Docklands | East Melbourne | Kensington | Melbourne | North Melbourne | Parkville | Port Melbourne | South Yarra | Southbank | West Melbourne |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 2014 | 184 | 38 | 102 | 183 | 424 | 114 | 81 | 10 | 75 | 39 | 67 |
| 2015 | 2095 | 351 | 753 | 1446 | 4009 | 1556 | 773 | 76 | 502 | 354 | 748 |
| 2016 | 2068 | 467 | 917 | 1585 | 4475 | 1620 | 855 | 69 | 462 | 537 | 901 |
| 2017 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 2018 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 2019 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 2020 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 2021 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 2022 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 2023 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 2024 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 2025 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 2026 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 2027 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 2028 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 2029 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 2030 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 2031 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 2032 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 2033 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 2034 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 2035 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 2036 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 2037 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
In this section I generate the forecast service request volumes then check for anomalies and outliers.
First I generate estimates service request volumes for future years based on population forecasts.
This can be done in one step. But some wild outliers are generated which will need to be cleaned up later. So I’ll do it in two steps.
# generate a service requests per population variable
pop_cust <-
pop_cust %>%
group_by(suburb) %>%
mutate(srs_per_pop = mean(sr_count/population, na.rm=TRUE))
# generate an estimate of service reuqest volumes per year for each suburb
pop_cust <-
pop_cust %>%
group_by(suburb) %>%
mutate(est_srs = population * srs_per_pop)
head(pop_cust)
First lets compare the estimated service request volumes to the real ones in the first three years.
pop_cust_g <-
pop_cust %>%
gather(key = "source", value = "srs", sr_count, est_srs, -year,-suburb, -population, -srs_per_pop, na.rm = TRUE)
ggplot(pop_cust_g, aes(x = source, y = srs)) +
geom_boxplot() +
facet_wrap(~suburb) +
ggtitle("SR Count vs Source for various Suburbs") +
theme(plot.title = element_text(size=14, face="bold"))
You can see that the estimates for Port Melbourne are totally wrong, by a factor of 50-500. This is because Port Melbourne is almost entirely made up of the port and currently has a very low population (~10). But when the population estamates progress, Port Melbourne is envisaged to have a growing population and the multiplyer effect throws out the service request volume estimates badly.
# revisit the pop_cust data frame
head(pop_cust)
# identify the major outliers in the srs_per_pop variable
boxplot(pop_cust$srs_per_pop,main="Service Requests per capita",
ylab="SRs per capita")
# identify them numerically
z.scores <-
pop_cust$srs_per_pop %>%
scores(type = "z")
z.scores %>%
summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.3558 -0.3171 -0.3139 0.0000 -0.2949 3.1557
# use the z-score method to find and replace with the mean (excluding Port Melb)
pop_cust$srs_per_pop[ which( abs(z.scores) >3 )] <-
mean(pop_cust$srs_per_pop[-which( abs(z.scores) >3 )], na.rm = TRUE)
# recalculate estimated service request volumes
pop_cust <-
pop_cust %>%
group_by(suburb) %>%
mutate(est_srs = population * srs_per_pop)
# redraw the plot to confirm Port Melbourne is no longer an outlier
pop_cust_g2 <-
pop_cust %>%
gather(key = "source", value = "srs", sr_count, est_srs, -year,-suburb, -population, -srs_per_pop, na.rm = TRUE)
ggplot(pop_cust_g2, aes(x = source, y = srs)) +
geom_boxplot() +
facet_wrap(~suburb) +
ggtitle("SR Count vs Source for various Suburbs") +
theme(plot.title = element_text(size=14, face="bold"))
The final outcome of the forecasting is best represented as a chart.
ggplot() +
geom_line(data = pop_cust, aes(x = as.numeric(paste(year)), y = est_srs, color=suburb), size = .1) +
ggtitle("Forecast Service Request Volumes\n\n")+
xlab("year") +
ylab("service requests")
You can see in the chart above that the forecast service request volume for Melbourne is a curve. Lets apply a transform to make this linear.
# filter out the Melbourne forecast
melb <-
pop_cust %>%
filter(suburb == "Melbourne")
# cube the estimated service request volumes
melb <-
melb %>%
mutate(cube_srs = (est_srs)^3)
# check it worked
head(melb)
# plot it to see if its more linear
ggplot() +
geom_line(data = melb, aes(x = as.numeric(paste(year)), y = melb$cube_srs, color=suburb), size = .1) +
ggtitle("Forecast Service Request Volumes\n\n")+
xlab("year") +
ylab("cube of service requests")