Background

The London Fire Brigade (LFB) attends a range of non-fire incidents (which we call ‘special services’). These ‘special services’ include assistance to animals that may be trapped or in distress. This document will have a look at the file supplied on London Datastore and draw some insights from it.

Setup

## load package
library(tidyverse)
library(lubridate)
library(Boruta)
library(caret)

The csv file can be downloaded from https://data.london.gov.uk/dataset/animal-rescue-incidents-attended-by-lfb

## drop incident number as it cause issue on parsing the number
dataset <- read_csv("https://data.london.gov.uk/download/animal-rescue-incidents-attended-by-lfb/e6211993-9ea2-4ed4-9378-7344653e9c31/Animal%20Rescue%20incidents%20attended%20by%20LFB%20from%20Jan%202009.csv", 
                    col_types = cols(IncidentNumber = col_skip()),
                    na = "NULL")

We notice the issue with column names for HourlyNotionalCost and IncidentNotionalCost, this is due to the (£) in the file not being correctly read, so we will change those two column names.

colnames(dataset)[c(7, 8)] <- c("HourlyCost", "IncidentCost")

Now, we can have a quick inspection of the dataset, to see if the factors are correctly labeled.

glimpse(dataset)
## Rows: 7,266
## Columns: 30
## $ DateTimeOfCall             <chr> "01/01/2009 03:01", "01/01/2009 08:51", ...
## $ CalYear                    <dbl> 2009, 2009, 2009, 2009, 2009, 2009, 2009...
## $ FinYear                    <chr> "2008/09", "2008/09", "2008/09", "2008/0...
## $ TypeOfIncident             <chr> "Special Service", "Special Service", "S...
## $ PumpCount                  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ PumpHoursTotal             <dbl> 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1...
## $ HourlyCost                 <dbl> 255, 255, 255, 255, 255, 255, 255, 255, ...
## $ IncidentCost               <dbl> 510, 255, 255, 255, 255, 255, 255, 255, ...
## $ FinalDescription           <chr> "DOG WITH JAW TRAPPED IN MAGAZINE RACK,B...
## $ AnimalGroupParent          <chr> "Dog", "Fox", "Dog", "Horse", "Rabbit", ...
## $ OriginofCall               <chr> "Person (land line)", "Person (land line...
## $ PropertyType               <chr> "House - single occupancy", "Railings", ...
## $ PropertyCategory           <chr> "Dwelling", "Outdoor Structure", "Outdoo...
## $ SpecialServiceTypeCategory <chr> "Other animal assistance", "Other animal...
## $ SpecialServiceType         <chr> "Animal assistance involving livestock -...
## $ WardCode                   <chr> "E05011467", "E05000169", "E05000558", "...
## $ Ward                       <chr> "Crystal Palace & Upper Norwood", "Woods...
## $ BoroughCode                <chr> "E09000008", "E09000008", "E09000029", "...
## $ Borough                    <chr> "Croydon", "Croydon", "Sutton", "Hilling...
## $ StnGroundName              <chr> "Norbury", "Woodside", "Wallington", "Ru...
## $ UPRN                       <dbl> NA, NA, NA, 100021000000, NA, NA, NA, NA...
## $ Street                     <chr> "Waddington Way", "Grasmere Road", "Mill...
## $ USRN                       <dbl> 20500146, NA, NA, 21401484, 21300122, 19...
## $ PostcodeDistrict           <chr> "SE19", "SE25", "SM5", "UB9", "RM3", "RM...
## $ Easting_m                  <dbl> NA, 534785, 528041, 504689, NA, NA, 5390...
## $ Northing_m                 <dbl> NA, 167546, 164923, 190685, NA, NA, 1861...
## $ Easting_rounded            <dbl> 532350, 534750, 528050, 504650, 554650, ...
## $ Northing_rounded           <dbl> 170050, 167550, 164950, 190650, 192350, ...
## $ Latitude                   <dbl> NA, 51.39095, 51.36894, 51.60528, NA, NA...
## $ Longitude                  <dbl> NA, -0.064166887, -0.161985191, -0.48968...

First we notice DateTimeOfCall is saved as character, we will need to change it to data-time and drop the next column as they only encode the yearly information.

dataset_clean_year <- dataset %>% 
  mutate(DateTimeOfCall = dmy_hm(DateTimeOfCall)) %>% 
  select(-c(CalYear, FinYear))

We will drop FinalDescription and for the geographic information, we will just keep Borough, StnGroundName and PostcodeDistrict for this analysis and drop the others.

dataset_clean_geo <- dataset_clean_year %>% 
  select(-c(FinalDescription, WardCode, Ward, BoroughCode, UPRN, Street, USRN, Easting_m:Longitude)) %>% 
  mutate_if(is_character, as.factor) ## convert rest character column to factor

glimpse(dataset_clean_geo)
## Rows: 7,266
## Columns: 15
## $ DateTimeOfCall             <dttm> 2009-01-01 03:01:00, 2009-01-01 08:51:0...
## $ TypeOfIncident             <fct> Special Service, Special Service, Specia...
## $ PumpCount                  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ PumpHoursTotal             <dbl> 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1...
## $ HourlyCost                 <dbl> 255, 255, 255, 255, 255, 255, 255, 255, ...
## $ IncidentCost               <dbl> 510, 255, 255, 255, 255, 255, 255, 255, ...
## $ AnimalGroupParent          <fct> Dog, Fox, Dog, Horse, Rabbit, Unknown - ...
## $ OriginofCall               <fct> Person (land line), Person (land line), ...
## $ PropertyType               <fct> "House - single occupancy", "Railings", ...
## $ PropertyCategory           <fct> Dwelling, Outdoor Structure, Outdoor Str...
## $ SpecialServiceTypeCategory <fct> Other animal assistance, Other animal as...
## $ SpecialServiceType         <fct> Animal assistance involving livestock - ...
## $ Borough                    <fct> Croydon, Croydon, Sutton, Hillingdon, Ha...
## $ StnGroundName              <fct> Norbury, Woodside, Wallington, Ruislip, ...
## $ PostcodeDistrict           <fct> SE19, SE25, SM5, UB9, RM3, RM10, E11, E1...

Cost terms

Some of the numerical has been converted to factor, let’s have a look why.

summary(dataset_clean_geo[, 2:6])
##          TypeOfIncident   PumpCount     PumpHoursTotal     HourlyCost   
##  Special Service:7266   Min.   :1.000   Min.   : 0.000   Min.   :255.0  
##                         1st Qu.:1.000   1st Qu.: 1.000   1st Qu.:260.0  
##                         Median :1.000   Median : 1.000   Median :295.0  
##                         Mean   :1.021   Mean   : 1.184   Mean   :299.4  
##                         3rd Qu.:1.000   3rd Qu.: 1.000   3rd Qu.:333.0  
##                         Max.   :4.000   Max.   :12.000   Max.   :346.0  
##                         NA's   :48      NA's   :49                      
##   IncidentCost   
##  Min.   :   0.0  
##  1st Qu.: 260.0  
##  Median : 298.0  
##  Mean   : 354.3  
##  3rd Qu.: 339.0  
##  Max.   :3912.0  
##  NA's   :49

We will try to predict the incident cost given some input, we will have a look at the effect of different variables. From the summary above, we can already see for PumpCount and PumpHoursTotal, vast majority of the cases involve 1 pump and resolved within 1hrs. For the hourly cost, £300/hr is the average going rate.

IncidentCost is calculated as the product of PumpCount, PumpHoursTotal and HourlyCost, hence the most common cost is £260 (i.e. 1 pump * 1 hour * £260). We won’t need to use those factors in the predication as they are encoded in the dependent variable. Given each of those three variables has a distinct mode value, the possible interaction affect on the IncidentCost will be limited.

We also notice TypeOfIncident has only single level, so we can drop this factor.

dataset_na_0 <- dataset_clean_geo %>% 
  select(-TypeOfIncident) %>% 
  mutate(PumpCount = ifelse(is.na(PumpCount), 0, PumpCount),
         PumpHoursTotal = ifelse(is.na(PumpHoursTotal), 0, PumpHoursTotal))

## check possible link before we drop them
cor(dataset_na_0$PumpHoursTotal, dataset_na_0$HourlyCost)
## [1] -0.007081184

There is not correlation between PumpHoursTotal and HourlyCost, indicate the time spent to get the job done is without much conscious of affect on the total cost.

## convert IncidentCost into numerical
dataset_clean_cost <- dataset_na_0 %>% 
  select(-c(PumpCount:HourlyCost) ) %>% 
  mutate(cost_dummy = as.character(IncidentCost),
         IncidentCost = as.numeric(cost_dummy)) %>% ## warning message indicate NA introduced
  select(-cost_dummy)

We replace all the NA’s in the IncidentCost by the median cost. Due to the missing value in the corresponding column of PumpCount and PumpHoursTotal, we cannot deduce what the correct value should be, we used median value as not to be affected by large max value.

cost_na <- which(is.na(dataset_clean_cost$IncidentCost))
dataset_clean_cost$IncidentCost[cost_na] <- median(dataset_clean_cost$IncidentCost, na.rm = T)
## There is a single incident labeled with 0 PumpHoursTotal, which cause IncidentCost to be 0, we change that
dataset_clean_cost[dataset_clean_cost$IncidentCost == 0, "IncidentCost"] <- 333

Borough terms

There are also 8 NA’s in Borough, we will use PostcodeDistrict to fix them.

## check which one's in Borough 
borough_na <- which(is.na(dataset_clean_cost$Borough))

## compare with postcode
dataset_clean_cost[borough_na, c("Borough", "PostcodeDistrict")]
## # A tibble: 0 x 2
## # ... with 2 variables: Borough <fct>, PostcodeDistrict <fct>

After checking against the postcode of London, we replace correspondingly.

dataset_clean_cost$Borough[c(4138, 4165, 4246, 4709)] <- "Redbridge"
## ME3, SL3,  are not a London borough postcode
dataset_clean_cost$Borough[6768] <- "Kingston upon Thames"
dataset_clean_cost$Borough[7072] <- "Sutton"
## Drop the last two postcode which is not in London.
dataset_postcode_1 <- dataset_clean_cost %>% 
  filter(PostcodeDistrict != "ME3", PostcodeDistrict != "SL3")

summary(dataset_postcode_1$Borough)
##                          Barking and Dagenham   BARKING AND DAGENHAM 
##                      0                    116                     64 
##                 Barnet                 BARNET                 Bexley 
##                    151                    139                    106 
##                 BEXLEY                  Brent                  BRENT 
##                     92                    104                     90 
##              Brentwood                Bromley                BROMLEY 
##                      1                    135                    124 
##             Broxbourne                 Camden                 CAMDEN 
##                      1                    114                    104 
##         City of London         CITY OF LONDON                Croydon 
##                      8                      5                    170 
##                CROYDON                 Ealing                 EALING 
##                    134                    125                    114 
##                Enfield                ENFIELD          Epping Forest 
##                    164                    163                      3 
##              Greenwich              GREENWICH                Hackney 
##                    120                    117                    142 
##                HACKNEY Hammersmith and Fulham HAMMERSMITH AND FULHAM 
##                    114                     97                     84 
##               Haringey               HARINGEY                 Harrow 
##                    137                    134                     67 
##                 HARROW               Havering               HAVERING 
##                     62                    121                    103 
##             Hillingdon             HILLINGDON               Hounslow 
##                    130                    107                    109 
##               HOUNSLOW              Islington              ISLINGTON 
##                    102                    143                     91 
## Kensington and Chelsea KENSINGTON AND CHELSEA   Kingston upon Thames 
##                     83                     89                     76 
##   KINGSTON UPON THAMES                Lambeth                LAMBETH 
##                     70                    159                    105 
##               Lewisham               LEWISHAM                 Merton 
##                    132                    101                     79 
##                 MERTON                 Newham                 NEWHAM 
##                     63                    137                    128 
##              Redbridge              REDBRIDGE   Richmond upon Thames 
##                    122                    119                    107 
##   RICHMOND UPON THAMES              Southwark              SOUTHWARK 
##                     88                    160                    126 
##                 Sutton                 SUTTON              Tandridge 
##                     89                     61                      1 
##          Tower Hamlets          TOWER HAMLETS         Waltham Forest 
##                    156                    110                    153 
##         WALTHAM FOREST             Wandsworth             WANDSWORTH 
##                    110                    120                    109 
##            Westminster            WESTMINSTER 
##                    103                    100

Few more issue to solve, Brentwood, Broxbourne, Tandridge and Epping Forest are not London Borough, so they need to be checked against their postcode to group them into a London Borough (or nearest possible Borough). Then we will convert all character to the same format.

## The following locations has ambiguous postcode, which may or may-not be in a London Borough 
## this entry belong to Epping Forest(non-London), but share close proximity with 3 London Boroughs
dataset_postcode_1[dataset_postcode_1$Borough == "Epping Forest" & dataset_postcode_1$PostcodeDistrict == "IG10", "Borough"] <- "Redbridge"
dataset_postcode_1[dataset_postcode_1$Borough == "Epping Forest" & dataset_postcode_1$PostcodeDistrict == "EN9", "Borough"] <- "Enfield"
## Brentwood is not in London
dataset_postcode_1[dataset_postcode_1$Borough == "Broxbourne" & dataset_postcode_1$PostcodeDistrict == "EN7", "Borough"] <- "Enfield"
dataset_postcode_1[dataset_postcode_1$Borough == "Tandridge" & dataset_postcode_1$PostcodeDistrict == "TN16", "Borough"] <- "Bromley"

## Brentwood one cannot be easily grouped.
dataset_borough <- dataset_postcode_1 %>% 
  filter(Borough != "Brentwood") %>% 
  mutate(Borough = as.factor(str_to_title(Borough)))

Other categorical variables

Some of the factors have a lot of levels, so we will lump the low count levels together for easier analysis. categorical

dataset_factor_lump <-dataset_borough %>% 
  mutate(AnimalGroupParent = fct_lump_min(AnimalGroupParent, 50),
         OriginofCall = fct_lump_min(OriginofCall, 100),
         PropertyType = fct_lump_min(PropertyType, 100),
         PropertyCategory = fct_lump_min(PropertyCategory, 200),
         SpecialServiceType = fct_lump_min(SpecialServiceType, 200))

summary(dataset_factor_lump)
##  DateTimeOfCall                 IncidentCost   
##  Min.   :2009-01-01 03:01:00   Min.   : 255.0  
##  1st Qu.:2012-01-11 21:32:15   1st Qu.: 260.0  
##  Median :2015-02-23 21:09:30   Median : 298.0  
##  Mean   :2015-02-22 21:27:43   Mean   : 353.9  
##  3rd Qu.:2018-05-15 21:06:00   3rd Qu.: 339.0  
##  Max.   :2021-01-31 13:48:00   Max.   :3912.0  
##                                                
##                         AnimalGroupParent             OriginofCall 
##  Cat                             :3519    Person (land line):3038  
##  Bird                            :1454    Person (mobile)   :4036  
##  Dog                             :1162    Police            : 131  
##  Fox                             : 333    Other             :  57  
##  Horse                           : 189                             
##  Unknown - Domestic Animal Or Pet: 188                             
##  (Other)                         : 417                             
##                                             PropertyType 
##  House - single occupancy                         :1875  
##  Other                                            :1695  
##  Purpose Built Flats/Maisonettes - Up to 3 storeys: 602  
##  Purpose Built Flats/Maisonettes - 4 to 9 storeys : 590  
##  Tree scrub                                       : 324  
##  Animal harm outdoors                             : 281  
##  (Other)                                          :1895  
##           PropertyCategory                   SpecialServiceTypeCategory
##  Dwelling         :3690    Animal rescue from below ground: 741        
##  Non Residential  : 754    Animal rescue from height      :2618        
##  Outdoor          :1951    Animal rescue from water       : 387        
##  Outdoor Structure: 558    Other animal assistance        :3516        
##  Road Vehicle     : 284                                                
##  Other            :  25                                                
##                                                                        
##                                             SpecialServiceType
##  Animal rescue from height - Domestic pet            :1634    
##  Assist trapped domestic animal                      :1487    
##  Animal rescue from height - Bird                    : 896    
##  Other                                               : 787    
##  Animal assistance involving livestock - Other action: 633    
##  Assist trapped wild animal                          : 591    
##  (Other)                                             :1234    
##           Borough       StnGroundName  PostcodeDistrict
##  Enfield      : 330   Enfield  : 133   CR0    : 122    
##  Croydon      : 304   Ilford   : 129   E17    : 100    
##  Barnet       : 290   Tottenham: 126   N1     :  91    
##  Southwark    : 286   Edmonton : 125   E4     :  87    
##  Haringey     : 271   Dagenham : 123   E14    :  84    
##  Tower Hamlets: 266   Hornsey  : 118   NW1    :  83    
##  (Other)      :5515   (Other)  :6508   (Other):6695

After grouping the minority levels, PropertyType still have a lot levels and other has a dominate size, as it won’t help describe with this unbalance, we will drop it and keep PropertyCategory.

SpecialServiceType gives more detailed level against SpecialServiceTypeCategory, but we will only keep the latter to simplify the modeling.

StnGroundName and PostcodeDistrict are useful when analysis at a particular location, but here we are only going to calculate for London as well, so we will keep at Borough level granularity

dataset_simplified <- dataset_factor_lump %>% 
  select(-c(PropertyType, SpecialServiceType, StnGroundName, PostcodeDistrict))

Feature analysis

First plot IncidentCost itself.

dataset_simplified %>% 
  ggplot(aes(x = log(IncidentCost))) + ##
  geom_histogram(binwidth = 0.05, alpha = 0.5) +
  xlab("Incident Cost") +
  labs(title = "Distribution of log scale Incident Cost") +
  theme_bw()

We notice the price is highly skewed even after log transformation, majority of the cases are below the value at 6 (~£400), indicates majority of the cases are 1 pump and resolved within 1 hour as noted before. We can separate them into a classification problem as well.

dataset_simplified <- dataset_simplified %>% 
  mutate(log_cost = log1p(IncidentCost),
         case = ifelse(IncidentCost >400, "Hard", "Easy"),
         case = as.factor(case))

We will log_cost as the case and check how the other factors affect it.

Time series

First we check any trend with respect to the time component.

dataset_simplified %>%
  mutate(year = year(DateTimeOfCall)) %>% 
  ggplot(aes(x = year, fill = case)) +
  geom_bar() +
  labs(title = "Yearly trend for # of incidents by LFB ",
       x = "") +
  theme_bw()

Noticeably there is an increase in 2020 for the easy response cases, it is probably due to the lockdown measure that forces people to stay at home. This increases the cases when people have more interaction with the animals and reporting.

dataset_simplified %>%
  mutate(month = month(DateTimeOfCall, label = TRUE, abbr = TRUE)) %>% 
  ggplot(aes(x = month, fill = case)) +
  geom_bar() +
  labs(title = "Monthly trend for # of incidents by LFB ",
       x = "") +
  theme_bw()

From the yearly trend, we can see the number of cases is about 600 per year. When break down into monthly trend, we see summer has much higher count than winter.

dataset_simplified %>%
  mutate(weekday = wday(DateTimeOfCall, label = TRUE, abbr = TRUE)) %>% 
  ggplot(aes(x = weekday, fill = case)) +
  geom_bar() +
  labs(title = "Weekly trend for # of incidents by LFB ",
       x = "") +
  theme_bw()

Again, we see the weekend has a slight higher count, probably due to people off work and be with the animals. The difference is not strong.

dataset_simplified %>%
  mutate(day = day(DateTimeOfCall)) %>% 
  ggplot(aes(x = day, fill = case)) +
  geom_bar() +
  ## try break the tick every 7 days
  scale_x_continuous(breaks = seq(1,31, 7)) +
  labs(title = "Daily trend for # of incidents by LFB ",
       x = "") +
  theme_bw()

There is not significant trend within the month with respect to each day. For the x-axis break gap, there is no noticeable trend with respect to the weekend. The dip at the end of the month are due to leap year (dataset covers 12 years), hence less count on the 29th, fewer on the 30th and least on the 31st (half as much as other dates).

dataset_simplified %>%
  mutate(hour = hour(DateTimeOfCall)) %>% 
  ggplot(aes(x = hour, fill = case)) +
  geom_bar() +
  scale_x_continuous(breaks = seq(0,23, 6)) +
  labs(title = "Hourly trend for # of incidents by LFB ",
       x = "") +
  theme_bw()

Understandably there is minimum incidents called during the resting hour. There is peak during the midday followed by a descent into the night.

Overall, we will create month and hour from the DateTimeOfCall for our prediction. Yearly trend is likely to return normal after people return to work, while the weekly trend is weak, so we don’t include it for the simplicity of prediction model.

Othe categorical variables

dummy <- dataset_simplified %>% count(OriginofCall)

dataset_simplified %>% 
  ggplot(aes(x = OriginofCall, y = log_cost)) +
  geom_boxplot() +
  geom_text(data = dummy, aes(y = 5.3, label = paste("n=", n))) +
  labs(title = "Incident Cost with respect to Origin of Calls",
       x = "",
       y = "Cost (log scale)") +
  theme_bw()

The median cost for person using mobile is slight higher than land line, given the large sample size, the difference matters. This is probably due to person calling from outside which may indicate it may be a slightly more complicated. While the call from police may be more complicated case as well(though sample size is small here)

dataset_simplified %>% 
  ## delete unknown in the text
  mutate(animal_remove_unknown = str_remove(AnimalGroupParent, "Unknown - ")) %>% 
   ggplot(aes(x = animal_remove_unknown, y = log_cost, colour = case)) +
  geom_boxplot() +
  labs(title = "Incident Cost with respect to Animal Group",
       x = "",
       y = "Cost (log scale)") +
  scale_x_discrete(label = function (x) str_trunc(x, 18)) +
  theme_bw() +
  coord_flip()

There is no significant difference between different animals, overall cost will be similar.

dummy <- dataset_simplified %>% count(PropertyCategory)

dataset_simplified %>% 
  ggplot(aes(x = PropertyCategory, y = log_cost)) +
  geom_boxplot() +
  geom_text(data = dummy, aes(y = 5.3, label = paste("n=", n))) +
  labs(title = "Incident Cost with respect to Property Type",
       x = "",
       y = "Cost (log scale)") +
  theme_bw()

Slight higher median value for road vehicle, while similar for other groups expect other.

dummy <- dataset_simplified %>% count(SpecialServiceTypeCategory)

dataset_simplified %>% 
  ggplot(aes(x = SpecialServiceTypeCategory, y = log_cost)) +
  geom_boxplot() +
  geom_text(data = dummy, aes(y = 5.3, label = paste("n=", n))) +
  labs(title = "Incident Cost with respect to Service Type",
       x = "",
       y = "Cost (log scale)") +
  theme_bw()

The rescue of animal from waterhas higher IQR, likely due to trickier cases requires longer time to resolve. But median cost is similar.

dummy <- dataset_simplified %>% count(Borough)

dataset_simplified %>% 
  ggplot(aes(x = fct_reorder(Borough, log_cost, mean), y = log_cost, colour = case)) +
  geom_boxplot() +
  labs(title = "Incident Cost with respect to Borough",
       x = "",
       y = "Cost (log scale)") +
  scale_x_discrete(label = function (x) str_trunc(x, 15)) +
  theme_bw() +
  coord_flip()

There is noticeable difference between different Borough, sorted by mean cost, we notice the outer boroughs has few more outlier values, likely due to the borough size of the borough takes longer to travel in between station and target locations which increases the hours needed.

Another noticeable point is City of London listed last on the list with minimum cost. This is due to low residential number which implies less domestic animal or pet, and no farm animals, hence the minimum reporting.

modeling

Clean the variable list

dataset_model <- dataset_simplified %>% 
  mutate(month = month(DateTimeOfCall, label = TRUE, abbr = TRUE),
         hour = hour(DateTimeOfCall)) %>% 
  select(-c(DateTimeOfCall, IncidentCost))

Regression

Check the factors we have selected is good.

set.seed(1234)
boruta_dataset <- Boruta(log_cost ~ . - case - Borough, data = dataset_model) 

## copied code for displace xlab vertically
plot(boruta_dataset, xlab = "", xaxt = "n")
lz<-lapply(1:ncol(boruta_dataset$ImpHistory),function(i)
boruta_dataset$ImpHistory[is.finite(boruta_dataset$ImpHistory[,i]),i])
names(lz) <- colnames(boruta_dataset$ImpHistory)
Labels <- sort(sapply(lz,median))
axis(side = 1,las=2,labels = names(Labels),
at = 1:ncol(boruta_dataset$ImpHistory), cex.axis = 0.5)

Only AnimalGroupParent shows high importance, while month shows none and hour low. So despite the time factor has significant effect on the amount of incidents (also the total cost), they do not effect the individual incident cost.

dataset_model <- dataset_model %>% select(-c(month, hour))

Split into training and testing set.

set.seed(1234)
split_index <- createDataPartition(dataset_model$log_cost, p = 0.8, list = FALSE)
training <- dataset_model[split_index, ]
## Warning: The `i` argument of ``[`()` can't be a matrix as of tibble 3.0.0.
## Convert to a vector.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
testing <- dataset_model[-split_index,]

Setup trainControl

fitControl <- trainControl(method = "repeatedcv",
                           number = 10, 
                           repeats = 5)

Regularized regression

glmnet_grid <- expand.grid(alpha = seq(0,1, length = 10),
                           lambda = seq(0.00001, 1, length = 10))

model_glmnet <- train(log_cost ~ . - case, 
                   data = training, 
                   method = "glmnet",
                   trControl = fitControl, 
                   tuneGrid = glmnet_grid,
                   metric = "RMSE"
                   )
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.

Random Forest

rf_grid <- expand.grid(cp = seq(0.01, 1, length = 10))

model_rf <- train(log_cost ~ . - case, 
                   data = training, 
                   method = "rpart",
                   trControl = fitControl, 
                   tuneGrid = rf_grid,
                   metric = "RMSE",
                   set.seed(1234))
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.

Bayestion

model_bayes <- train(log_cost ~ . - case, 
                   data = training, 
                   method = "bayesglm",
                   trControl = fitControl,
                   metric = "RMSE",
                   set.seed(1234))

Let’s compare the models

model_compare <- resamples(list(glmnet = model_glmnet, rf = model_rf, bayes = model_bayes))

summary(model_compare)
## 
## Call:
## summary.resamples(object = model_compare)
## 
## Models: glmnet, rf, bayes 
## Number of resamples: 50 
## 
## MAE 
##             Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## glmnet 0.1718179 0.1829452 0.1872155 0.1877690 0.1922706 0.2064283    0
## rf     0.1683735 0.1788171 0.1855652 0.1852953 0.1911182 0.2051243    0
## bayes  0.1749146 0.1821412 0.1873531 0.1878967 0.1929526 0.2129091    0
## 
## RMSE 
##             Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## glmnet 0.2730637 0.2934897 0.2994892 0.3021037 0.3089118 0.3414669    0
## rf     0.2700709 0.2909711 0.3078350 0.3050891 0.3156118 0.3505814    0
## bayes  0.2753297 0.2900420 0.3038291 0.3023369 0.3122834 0.3424512    0
## 
## Rsquared 
##              Min.    1st Qu.     Median       Mean   3rd Qu.      Max. NA's
## glmnet 0.01181742 0.07941158 0.12163099 0.11846410 0.1516690 0.2085770    0
## rf     0.01485295 0.06874424 0.09755847 0.09962986 0.1194157 0.2332070    0
## bayes  0.05411532 0.08239707 0.10874015 0.11519437 0.1392231 0.2406679    0
bwplot(model_compare, scales = list(x = list(relation = "free")))

The three models performs similarly, but their R-squared is really low, as the models don’t make a good prediction regarding the incident cost.

varImp(model_rf)
## rpart variable importance
## 
##   only 20 most important variables shown (out of 55)
## 
##                                                     Overall
## AnimalGroupParentHorse                              100.000
## PropertyCategoryOutdoor                              30.280
## SpecialServiceTypeCategoryAnimal rescue from water   28.559
## OriginofCallPerson (mobile)                          25.566
## SpecialServiceTypeCategoryAnimal rescue from height  10.269
## AnimalGroupParentDeer                                 4.178
## BoroughBrent                                          0.000
## BoroughWestminster                                    0.000
## `BoroughWaltham Forest`                               0.000
## BoroughBarnet                                         0.000
## BoroughLewisham                                       0.000
## `AnimalGroupParentUnknown - Domestic Animal Or Pet`   0.000
## `AnimalGroupParentUnknown - Wild Animal`              0.000
## BoroughBexley                                         0.000
## BoroughHarrow                                         0.000
## AnimalGroupParentDog                                  0.000
## BoroughHounslow                                       0.000
## BoroughCamden                                         0.000
## BoroughHavering                                       0.000
## AnimalGroupParentCat                                  0.000
varImp(model_glmnet)
## glmnet variable importance
## 
##   only 20 most important variables shown (out of 52)
## 
##                                                    Overall
## AnimalGroupParentHorse                             100.000
## PropertyCategoryOther                               25.598
## SpecialServiceTypeCategoryAnimal rescue from water  18.009
## AnimalGroupParentDeer                               16.829
## OriginofCallOther                                   16.598
## OriginofCallPerson (mobile)                         11.992
## BoroughTower Hamlets                                11.185
## AnimalGroupParentUnknown - Wild Animal              10.749
## BoroughBromley                                      10.063
## PropertyCategoryOutdoor                              9.479
## AnimalGroupParentOther                               9.255
## BoroughHounslow                                      8.960
## BoroughEnfield                                       8.828
## BoroughSutton                                        8.792
## AnimalGroupParentSquirrel                            7.928
## AnimalGroupParentUnknown - Domestic Animal Or Pet    7.829
## BoroughRedbridge                                     7.708
## BoroughWestminster                                   7.280
## BoroughBexley                                        7.241
## AnimalGroupParentDog                                 6.644

Response regarding horse has the strongest impact on the predication and no other animal has the impact (Deer comes in distance second). Otherwise Animal rescue from water has some effect in predicating the cost. All other factors shows low and mixed impact.

Let’s dig deep about horse, why it stands out.

dataset_simplified %>% 
  group_by(AnimalGroupParent) %>% 
  summarise(average_cost = mean(IncidentCost),
            median_cost = median(IncidentCost)) %>% 
  arrange(desc(average_cost))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 10 x 3
##    AnimalGroupParent                average_cost median_cost
##    <fct>                                   <dbl>       <dbl>
##  1 Horse                                    733.         590
##  2 Unknown - Wild Animal                    412.         326
##  3 Deer                                     411.         333
##  4 Other                                    377.         295
##  5 Fox                                      363.         328
##  6 Dog                                      345.         298
##  7 Cat                                      340.         298
##  8 Bird                                     339.         326
##  9 Unknown - Domestic Animal Or Pet         324.         295
## 10 Squirrel                                 311.         326

Given the max hourly rate is £395, anything above £400 indicate a hard incident to resolve, where we can see horse is the most distinct. For the cases of wild animal and deer, although their average is above £400, their median value is clearly below, so the mean value is skewed due to extreme cases. Yet the median cost for horse is still far above £400. Therefore we can conclude that almost any call LFB receives for special service regarding animals are quite easy to handle. Majority cases involves domestic setting, hence either cats or dogs. Do beware of the horses though.

Classification

Let’s try classify case, to see if it does good job.

table(dataset_model$case)
## 
## Easy Hard 
## 6412  850

Though there is unbalance in the class, we will skip the resampling, keep previous training and testing set.

fitControl_class <- trainControl(method = "repeatedcv",
                           number = 10, 
                           repeats = 5, 
                           summaryFunction = twoClassSummary,
                           classProbs = TRUE,
                           search = "random")

Logistic

model_log_class <- train(case ~ . - log_cost, 
                   data = training, 
                   method = "glm",
                   family = "binomial",
                   trControl = fitControl_class,
                   metric = "ROC"
                   )

Random Forest

model_rf_class <- train(case ~ . - log_cost, 
                   data = training, 
                   method = "rpart",
                   trControl = fitControl_class, 
                   tuneLength = 20,
                   metric = "ROC",
                   set.seed(1234))

Support vector machine

model_bayes_class <- train(case ~ . - log_cost, 
                   data = training, 
                   method = "bayesglm",
                   trControl = fitControl_class, 
                   metric = "ROC",
                   set.seed(1234))
varImp(model_log_class)
## glm variable importance
## 
##   only 20 most important variables shown (out of 52)
## 
##                                                       Overall
## AnimalGroupParentHorse                                100.000
## PropertyCategoryOutdoor                                63.101
## PropertyCategoryOther                                  28.725
## AnimalGroupParentOther                                 27.625
## `PropertyCategoryNon Residential`                      26.794
## AnimalGroupParentDeer                                  25.411
## `PropertyCategoryRoad Vehicle`                         24.592
## `SpecialServiceTypeCategoryOther animal assistance`    23.542
## OriginofCallOther                                      22.216
## `SpecialServiceTypeCategoryAnimal rescue from height`  19.683
## AnimalGroupParentCat                                   19.675
## BoroughBromley                                         16.872
## BoroughCroydon                                         15.114
## BoroughHounslow                                        13.429
## BoroughSutton                                          12.332
## `BoroughTower Hamlets`                                 12.255
## `SpecialServiceTypeCategoryAnimal rescue from water`   11.582
## AnimalGroupParentSquirrel                              11.548
## BoroughSouthwark                                       10.866
## BoroughWestminster                                      9.949
varImp(model_rf_class)
## rpart variable importance
## 
##   only 20 most important variables shown (out of 60)
## 
##                                                       Overall
## AnimalGroupParentHorse                              1.000e+02
## PropertyCategoryOutdoor                             5.118e+01
## SpecialServiceTypeCategoryAnimal rescue from water  4.400e+01
## OriginofCallOther                                   1.668e+01
## SpecialServiceTypeCategoryAnimal rescue from height 6.760e+00
## AnimalGroupParentOther                              6.713e+00
## BoroughBromley                                      5.758e+00
## SpecialServiceTypeCategoryOther animal assistance   3.701e+00
## PropertyCategoryRoad Vehicle                        3.247e+00
## AnimalGroupParentDeer                               2.556e+00
## AnimalGroupParentCat                                2.130e+00
## BoroughEnfield                                      1.403e+00
## BoroughTower Hamlets                                1.289e+00
## BoroughHounslow                                     1.228e+00
## BoroughHillingdon                                   1.020e+00
## OriginofCallPerson (mobile)                         1.008e+00
## BoroughIslington                                    8.028e-01
## PropertyCategoryNon Residential                     2.250e-01
## PropertyCategoryOutdoor Structure                   1.638e-01
## AnimalGroupParentDog                                8.153e-03

Compare the models

## confusion matrix
confusionMatrix(predict(model_log_class, data = training), training$case)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Easy Hard
##       Easy 5086  576
##       Hard   46  103
##                                           
##                Accuracy : 0.893           
##                  95% CI : (0.8847, 0.9008)
##     No Information Rate : 0.8832          
##     P-Value [Acc > NIR] : 0.009883        
##                                           
##                   Kappa : 0.2158          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9910          
##             Specificity : 0.1517          
##          Pos Pred Value : 0.8983          
##          Neg Pred Value : 0.6913          
##              Prevalence : 0.8832          
##          Detection Rate : 0.8752          
##    Detection Prevalence : 0.9744          
##       Balanced Accuracy : 0.5714          
##                                           
##        'Positive' Class : Easy            
## 
confusionMatrix(predict(model_rf_class, data = training), training$case)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Easy Hard
##       Easy 5083  572
##       Hard   49  107
##                                          
##                Accuracy : 0.8931         
##                  95% CI : (0.8849, 0.901)
##     No Information Rate : 0.8832         
##     P-Value [Acc > NIR] : 0.008831       
##                                          
##                   Kappa : 0.2223         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.9905         
##             Specificity : 0.1576         
##          Pos Pred Value : 0.8989         
##          Neg Pred Value : 0.6859         
##              Prevalence : 0.8832         
##          Detection Rate : 0.8747         
##    Detection Prevalence : 0.9732         
##       Balanced Accuracy : 0.5740         
##                                          
##        'Positive' Class : Easy           
## 
confusionMatrix(predict(model_bayes_class, data = training), training$case)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Easy Hard
##       Easy 5087  578
##       Hard   45  101
##                                           
##                Accuracy : 0.8928          
##                  95% CI : (0.8846, 0.9006)
##     No Information Rate : 0.8832          
##     P-Value [Acc > NIR] : 0.01104         
##                                           
##                   Kappa : 0.2123          
##                                           
##  Mcnemar's Test P-Value : < 2e-16         
##                                           
##             Sensitivity : 0.9912          
##             Specificity : 0.1487          
##          Pos Pred Value : 0.8980          
##          Neg Pred Value : 0.6918          
##              Prevalence : 0.8832          
##          Detection Rate : 0.8754          
##    Detection Prevalence : 0.9749          
##       Balanced Accuracy : 0.5700          
##                                           
##        'Positive' Class : Easy            
## 

With respect to the no information rate, the model is not a significant improvement of it. Three models performed similarly, they struggle to identify the hard case, all had 570+ misidentifications, hence they all have specificity of around 15%. This indicate we may need to go back to resample the date, as hard case is under sampled, to improve the accuracy. Though given our results from regression, the few variables here may not be sufficient to improve the model in the first place.

model_compare_class <- resamples(list(log = model_log_class, rf = model_rf_class, bayes = model_bayes_class))

summary(model_compare_class)
## 
## Call:
## summary.resamples(object = model_compare_class)
## 
## Models: log, rf, bayes 
## Number of resamples: 50 
## 
## ROC 
##            Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## log   0.5831613 0.6460877 0.6718175 0.6684388 0.6964403 0.7304589    0
## rf    0.5532588 0.6161675 0.6404440 0.6401840 0.6662868 0.7293028    0
## bayes 0.5747334 0.6453603 0.6648463 0.6698299 0.6991098 0.7741801    0
## 
## Sens 
##            Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## log   0.9824561 0.9883041 0.9902534 0.9905306 0.9922141 0.9980507    0
## rf    0.9746589 0.9844055 0.9863680 0.9866326 0.9902534 0.9961014    0
## bayes 0.9785575 0.9883041 0.9922027 0.9906856 0.9941520 0.9980507    0
## 
## Spec 
##             Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## log   0.07352941 0.1176471 0.1343284 0.1434416 0.1617647 0.2500000    0
## rf    0.07352941 0.1176471 0.1406936 0.1390035 0.1617647 0.2205882    0
## bayes 0.08823529 0.1176471 0.1323529 0.1440386 0.1617647 0.2352941    0
bwplot(model_compare_class, scales = list(x = list(relation = "free")))

Logistic regression and Bayes glm performed better than Random Forest.

Final check on the testing set.

pred_log_test <- predict(model_log_class, testing)
confusionMatrix(pred_log_test, testing$case)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Easy Hard
##       Easy 1265  146
##       Hard   15   25
##                                           
##                Accuracy : 0.889           
##                  95% CI : (0.8717, 0.9047)
##     No Information Rate : 0.8822          
##     P-Value [Acc > NIR] : 0.2208          
##                                           
##                   Kappa : 0.2013          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.9883          
##             Specificity : 0.1462          
##          Pos Pred Value : 0.8965          
##          Neg Pred Value : 0.6250          
##              Prevalence : 0.8822          
##          Detection Rate : 0.8718          
##    Detection Prevalence : 0.9724          
##       Balanced Accuracy : 0.5672          
##                                           
##        'Positive' Class : Easy            
## 

Model is not good at predicting when the case recieved will be a hard case, it’s easier just to treat call are easy case.