The London Fire Brigade (LFB) attends a range of non-fire incidents (which we call ‘special services’). These ‘special services’ include assistance to animals that may be trapped or in distress. This document will have a look at the file supplied on London Datastore and draw some insights from it.
## load package
library(tidyverse)
library(lubridate)
library(Boruta)
library(caret)
The csv file can be downloaded from https://data.london.gov.uk/dataset/animal-rescue-incidents-attended-by-lfb
## drop incident number as it cause issue on parsing the number
dataset <- read_csv("https://data.london.gov.uk/download/animal-rescue-incidents-attended-by-lfb/e6211993-9ea2-4ed4-9378-7344653e9c31/Animal%20Rescue%20incidents%20attended%20by%20LFB%20from%20Jan%202009.csv",
col_types = cols(IncidentNumber = col_skip()),
na = "NULL")
We notice the issue with column names for HourlyNotionalCost and IncidentNotionalCost, this is due to the (£) in the file not being correctly read, so we will change those two column names.
colnames(dataset)[c(7, 8)] <- c("HourlyCost", "IncidentCost")
Now, we can have a quick inspection of the dataset, to see if the factors are correctly labeled.
glimpse(dataset)
## Rows: 7,266
## Columns: 30
## $ DateTimeOfCall <chr> "01/01/2009 03:01", "01/01/2009 08:51", ...
## $ CalYear <dbl> 2009, 2009, 2009, 2009, 2009, 2009, 2009...
## $ FinYear <chr> "2008/09", "2008/09", "2008/09", "2008/0...
## $ TypeOfIncident <chr> "Special Service", "Special Service", "S...
## $ PumpCount <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ PumpHoursTotal <dbl> 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1...
## $ HourlyCost <dbl> 255, 255, 255, 255, 255, 255, 255, 255, ...
## $ IncidentCost <dbl> 510, 255, 255, 255, 255, 255, 255, 255, ...
## $ FinalDescription <chr> "DOG WITH JAW TRAPPED IN MAGAZINE RACK,B...
## $ AnimalGroupParent <chr> "Dog", "Fox", "Dog", "Horse", "Rabbit", ...
## $ OriginofCall <chr> "Person (land line)", "Person (land line...
## $ PropertyType <chr> "House - single occupancy", "Railings", ...
## $ PropertyCategory <chr> "Dwelling", "Outdoor Structure", "Outdoo...
## $ SpecialServiceTypeCategory <chr> "Other animal assistance", "Other animal...
## $ SpecialServiceType <chr> "Animal assistance involving livestock -...
## $ WardCode <chr> "E05011467", "E05000169", "E05000558", "...
## $ Ward <chr> "Crystal Palace & Upper Norwood", "Woods...
## $ BoroughCode <chr> "E09000008", "E09000008", "E09000029", "...
## $ Borough <chr> "Croydon", "Croydon", "Sutton", "Hilling...
## $ StnGroundName <chr> "Norbury", "Woodside", "Wallington", "Ru...
## $ UPRN <dbl> NA, NA, NA, 100021000000, NA, NA, NA, NA...
## $ Street <chr> "Waddington Way", "Grasmere Road", "Mill...
## $ USRN <dbl> 20500146, NA, NA, 21401484, 21300122, 19...
## $ PostcodeDistrict <chr> "SE19", "SE25", "SM5", "UB9", "RM3", "RM...
## $ Easting_m <dbl> NA, 534785, 528041, 504689, NA, NA, 5390...
## $ Northing_m <dbl> NA, 167546, 164923, 190685, NA, NA, 1861...
## $ Easting_rounded <dbl> 532350, 534750, 528050, 504650, 554650, ...
## $ Northing_rounded <dbl> 170050, 167550, 164950, 190650, 192350, ...
## $ Latitude <dbl> NA, 51.39095, 51.36894, 51.60528, NA, NA...
## $ Longitude <dbl> NA, -0.064166887, -0.161985191, -0.48968...
First we notice DateTimeOfCall is saved as character, we will need to change it to data-time and drop the next column as they only encode the yearly information.
dataset_clean_year <- dataset %>%
mutate(DateTimeOfCall = dmy_hm(DateTimeOfCall)) %>%
select(-c(CalYear, FinYear))
We will drop FinalDescription and for the geographic information, we will just keep Borough, StnGroundName and PostcodeDistrict for this analysis and drop the others.
dataset_clean_geo <- dataset_clean_year %>%
select(-c(FinalDescription, WardCode, Ward, BoroughCode, UPRN, Street, USRN, Easting_m:Longitude)) %>%
mutate_if(is_character, as.factor) ## convert rest character column to factor
glimpse(dataset_clean_geo)
## Rows: 7,266
## Columns: 15
## $ DateTimeOfCall <dttm> 2009-01-01 03:01:00, 2009-01-01 08:51:0...
## $ TypeOfIncident <fct> Special Service, Special Service, Specia...
## $ PumpCount <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ PumpHoursTotal <dbl> 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1...
## $ HourlyCost <dbl> 255, 255, 255, 255, 255, 255, 255, 255, ...
## $ IncidentCost <dbl> 510, 255, 255, 255, 255, 255, 255, 255, ...
## $ AnimalGroupParent <fct> Dog, Fox, Dog, Horse, Rabbit, Unknown - ...
## $ OriginofCall <fct> Person (land line), Person (land line), ...
## $ PropertyType <fct> "House - single occupancy", "Railings", ...
## $ PropertyCategory <fct> Dwelling, Outdoor Structure, Outdoor Str...
## $ SpecialServiceTypeCategory <fct> Other animal assistance, Other animal as...
## $ SpecialServiceType <fct> Animal assistance involving livestock - ...
## $ Borough <fct> Croydon, Croydon, Sutton, Hillingdon, Ha...
## $ StnGroundName <fct> Norbury, Woodside, Wallington, Ruislip, ...
## $ PostcodeDistrict <fct> SE19, SE25, SM5, UB9, RM3, RM10, E11, E1...
Some of the numerical has been converted to factor, let’s have a look why.
summary(dataset_clean_geo[, 2:6])
## TypeOfIncident PumpCount PumpHoursTotal HourlyCost
## Special Service:7266 Min. :1.000 Min. : 0.000 Min. :255.0
## 1st Qu.:1.000 1st Qu.: 1.000 1st Qu.:260.0
## Median :1.000 Median : 1.000 Median :295.0
## Mean :1.021 Mean : 1.184 Mean :299.4
## 3rd Qu.:1.000 3rd Qu.: 1.000 3rd Qu.:333.0
## Max. :4.000 Max. :12.000 Max. :346.0
## NA's :48 NA's :49
## IncidentCost
## Min. : 0.0
## 1st Qu.: 260.0
## Median : 298.0
## Mean : 354.3
## 3rd Qu.: 339.0
## Max. :3912.0
## NA's :49
We will try to predict the incident cost given some input, we will have a look at the effect of different variables. From the summary above, we can already see for PumpCount and PumpHoursTotal, vast majority of the cases involve 1 pump and resolved within 1hrs. For the hourly cost, £300/hr is the average going rate.
IncidentCost is calculated as the product of PumpCount, PumpHoursTotal and HourlyCost, hence the most common cost is £260 (i.e. 1 pump * 1 hour * £260). We won’t need to use those factors in the predication as they are encoded in the dependent variable. Given each of those three variables has a distinct mode value, the possible interaction affect on the IncidentCost will be limited.
We also notice TypeOfIncident has only single level, so we can drop this factor.
dataset_na_0 <- dataset_clean_geo %>%
select(-TypeOfIncident) %>%
mutate(PumpCount = ifelse(is.na(PumpCount), 0, PumpCount),
PumpHoursTotal = ifelse(is.na(PumpHoursTotal), 0, PumpHoursTotal))
## check possible link before we drop them
cor(dataset_na_0$PumpHoursTotal, dataset_na_0$HourlyCost)
## [1] -0.007081184
There is not correlation between PumpHoursTotal and HourlyCost, indicate the time spent to get the job done is without much conscious of affect on the total cost.
## convert IncidentCost into numerical
dataset_clean_cost <- dataset_na_0 %>%
select(-c(PumpCount:HourlyCost) ) %>%
mutate(cost_dummy = as.character(IncidentCost),
IncidentCost = as.numeric(cost_dummy)) %>% ## warning message indicate NA introduced
select(-cost_dummy)
We replace all the NA’s in the IncidentCost by the median cost. Due to the missing value in the corresponding column of PumpCount and PumpHoursTotal, we cannot deduce what the correct value should be, we used median value as not to be affected by large max value.
cost_na <- which(is.na(dataset_clean_cost$IncidentCost))
dataset_clean_cost$IncidentCost[cost_na] <- median(dataset_clean_cost$IncidentCost, na.rm = T)
## There is a single incident labeled with 0 PumpHoursTotal, which cause IncidentCost to be 0, we change that
dataset_clean_cost[dataset_clean_cost$IncidentCost == 0, "IncidentCost"] <- 333
There are also 8 NA’s in Borough, we will use PostcodeDistrict to fix them.
## check which one's in Borough
borough_na <- which(is.na(dataset_clean_cost$Borough))
## compare with postcode
dataset_clean_cost[borough_na, c("Borough", "PostcodeDistrict")]
## # A tibble: 0 x 2
## # ... with 2 variables: Borough <fct>, PostcodeDistrict <fct>
After checking against the postcode of London, we replace correspondingly.
dataset_clean_cost$Borough[c(4138, 4165, 4246, 4709)] <- "Redbridge"
## ME3, SL3, are not a London borough postcode
dataset_clean_cost$Borough[6768] <- "Kingston upon Thames"
dataset_clean_cost$Borough[7072] <- "Sutton"
## Drop the last two postcode which is not in London.
dataset_postcode_1 <- dataset_clean_cost %>%
filter(PostcodeDistrict != "ME3", PostcodeDistrict != "SL3")
summary(dataset_postcode_1$Borough)
## Barking and Dagenham BARKING AND DAGENHAM
## 0 116 64
## Barnet BARNET Bexley
## 151 139 106
## BEXLEY Brent BRENT
## 92 104 90
## Brentwood Bromley BROMLEY
## 1 135 124
## Broxbourne Camden CAMDEN
## 1 114 104
## City of London CITY OF LONDON Croydon
## 8 5 170
## CROYDON Ealing EALING
## 134 125 114
## Enfield ENFIELD Epping Forest
## 164 163 3
## Greenwich GREENWICH Hackney
## 120 117 142
## HACKNEY Hammersmith and Fulham HAMMERSMITH AND FULHAM
## 114 97 84
## Haringey HARINGEY Harrow
## 137 134 67
## HARROW Havering HAVERING
## 62 121 103
## Hillingdon HILLINGDON Hounslow
## 130 107 109
## HOUNSLOW Islington ISLINGTON
## 102 143 91
## Kensington and Chelsea KENSINGTON AND CHELSEA Kingston upon Thames
## 83 89 76
## KINGSTON UPON THAMES Lambeth LAMBETH
## 70 159 105
## Lewisham LEWISHAM Merton
## 132 101 79
## MERTON Newham NEWHAM
## 63 137 128
## Redbridge REDBRIDGE Richmond upon Thames
## 122 119 107
## RICHMOND UPON THAMES Southwark SOUTHWARK
## 88 160 126
## Sutton SUTTON Tandridge
## 89 61 1
## Tower Hamlets TOWER HAMLETS Waltham Forest
## 156 110 153
## WALTHAM FOREST Wandsworth WANDSWORTH
## 110 120 109
## Westminster WESTMINSTER
## 103 100
Few more issue to solve, Brentwood, Broxbourne, Tandridge and Epping Forest are not London Borough, so they need to be checked against their postcode to group them into a London Borough (or nearest possible Borough). Then we will convert all character to the same format.
## The following locations has ambiguous postcode, which may or may-not be in a London Borough
## this entry belong to Epping Forest(non-London), but share close proximity with 3 London Boroughs
dataset_postcode_1[dataset_postcode_1$Borough == "Epping Forest" & dataset_postcode_1$PostcodeDistrict == "IG10", "Borough"] <- "Redbridge"
dataset_postcode_1[dataset_postcode_1$Borough == "Epping Forest" & dataset_postcode_1$PostcodeDistrict == "EN9", "Borough"] <- "Enfield"
## Brentwood is not in London
dataset_postcode_1[dataset_postcode_1$Borough == "Broxbourne" & dataset_postcode_1$PostcodeDistrict == "EN7", "Borough"] <- "Enfield"
dataset_postcode_1[dataset_postcode_1$Borough == "Tandridge" & dataset_postcode_1$PostcodeDistrict == "TN16", "Borough"] <- "Bromley"
## Brentwood one cannot be easily grouped.
dataset_borough <- dataset_postcode_1 %>%
filter(Borough != "Brentwood") %>%
mutate(Borough = as.factor(str_to_title(Borough)))
Some of the factors have a lot of levels, so we will lump the low count levels together for easier analysis. categorical
dataset_factor_lump <-dataset_borough %>%
mutate(AnimalGroupParent = fct_lump_min(AnimalGroupParent, 50),
OriginofCall = fct_lump_min(OriginofCall, 100),
PropertyType = fct_lump_min(PropertyType, 100),
PropertyCategory = fct_lump_min(PropertyCategory, 200),
SpecialServiceType = fct_lump_min(SpecialServiceType, 200))
summary(dataset_factor_lump)
## DateTimeOfCall IncidentCost
## Min. :2009-01-01 03:01:00 Min. : 255.0
## 1st Qu.:2012-01-11 21:32:15 1st Qu.: 260.0
## Median :2015-02-23 21:09:30 Median : 298.0
## Mean :2015-02-22 21:27:43 Mean : 353.9
## 3rd Qu.:2018-05-15 21:06:00 3rd Qu.: 339.0
## Max. :2021-01-31 13:48:00 Max. :3912.0
##
## AnimalGroupParent OriginofCall
## Cat :3519 Person (land line):3038
## Bird :1454 Person (mobile) :4036
## Dog :1162 Police : 131
## Fox : 333 Other : 57
## Horse : 189
## Unknown - Domestic Animal Or Pet: 188
## (Other) : 417
## PropertyType
## House - single occupancy :1875
## Other :1695
## Purpose Built Flats/Maisonettes - Up to 3 storeys: 602
## Purpose Built Flats/Maisonettes - 4 to 9 storeys : 590
## Tree scrub : 324
## Animal harm outdoors : 281
## (Other) :1895
## PropertyCategory SpecialServiceTypeCategory
## Dwelling :3690 Animal rescue from below ground: 741
## Non Residential : 754 Animal rescue from height :2618
## Outdoor :1951 Animal rescue from water : 387
## Outdoor Structure: 558 Other animal assistance :3516
## Road Vehicle : 284
## Other : 25
##
## SpecialServiceType
## Animal rescue from height - Domestic pet :1634
## Assist trapped domestic animal :1487
## Animal rescue from height - Bird : 896
## Other : 787
## Animal assistance involving livestock - Other action: 633
## Assist trapped wild animal : 591
## (Other) :1234
## Borough StnGroundName PostcodeDistrict
## Enfield : 330 Enfield : 133 CR0 : 122
## Croydon : 304 Ilford : 129 E17 : 100
## Barnet : 290 Tottenham: 126 N1 : 91
## Southwark : 286 Edmonton : 125 E4 : 87
## Haringey : 271 Dagenham : 123 E14 : 84
## Tower Hamlets: 266 Hornsey : 118 NW1 : 83
## (Other) :5515 (Other) :6508 (Other):6695
After grouping the minority levels, PropertyType still have a lot levels and other has a dominate size, as it won’t help describe with this unbalance, we will drop it and keep PropertyCategory.
SpecialServiceType gives more detailed level against SpecialServiceTypeCategory, but we will only keep the latter to simplify the modeling.
StnGroundName and PostcodeDistrict are useful when analysis at a particular location, but here we are only going to calculate for London as well, so we will keep at Borough level granularity
dataset_simplified <- dataset_factor_lump %>%
select(-c(PropertyType, SpecialServiceType, StnGroundName, PostcodeDistrict))
First plot IncidentCost itself.
dataset_simplified %>%
ggplot(aes(x = log(IncidentCost))) + ##
geom_histogram(binwidth = 0.05, alpha = 0.5) +
xlab("Incident Cost") +
labs(title = "Distribution of log scale Incident Cost") +
theme_bw()
We notice the price is highly skewed even after log transformation, majority of the cases are below the value at 6 (~£400), indicates majority of the cases are 1 pump and resolved within 1 hour as noted before. We can separate them into a classification problem as well.
dataset_simplified <- dataset_simplified %>%
mutate(log_cost = log1p(IncidentCost),
case = ifelse(IncidentCost >400, "Hard", "Easy"),
case = as.factor(case))
We will log_cost as the case and check how the other factors affect it.
First we check any trend with respect to the time component.
dataset_simplified %>%
mutate(year = year(DateTimeOfCall)) %>%
ggplot(aes(x = year, fill = case)) +
geom_bar() +
labs(title = "Yearly trend for # of incidents by LFB ",
x = "") +
theme_bw()
Noticeably there is an increase in 2020 for the easy response cases, it is probably due to the lockdown measure that forces people to stay at home. This increases the cases when people have more interaction with the animals and reporting.
dataset_simplified %>%
mutate(month = month(DateTimeOfCall, label = TRUE, abbr = TRUE)) %>%
ggplot(aes(x = month, fill = case)) +
geom_bar() +
labs(title = "Monthly trend for # of incidents by LFB ",
x = "") +
theme_bw()
From the yearly trend, we can see the number of cases is about 600 per year. When break down into monthly trend, we see summer has much higher count than winter.
dataset_simplified %>%
mutate(weekday = wday(DateTimeOfCall, label = TRUE, abbr = TRUE)) %>%
ggplot(aes(x = weekday, fill = case)) +
geom_bar() +
labs(title = "Weekly trend for # of incidents by LFB ",
x = "") +
theme_bw()
Again, we see the weekend has a slight higher count, probably due to people off work and be with the animals. The difference is not strong.
dataset_simplified %>%
mutate(day = day(DateTimeOfCall)) %>%
ggplot(aes(x = day, fill = case)) +
geom_bar() +
## try break the tick every 7 days
scale_x_continuous(breaks = seq(1,31, 7)) +
labs(title = "Daily trend for # of incidents by LFB ",
x = "") +
theme_bw()
There is not significant trend within the month with respect to each day. For the x-axis break gap, there is no noticeable trend with respect to the weekend. The dip at the end of the month are due to leap year (dataset covers 12 years), hence less count on the 29th, fewer on the 30th and least on the 31st (half as much as other dates).
dataset_simplified %>%
mutate(hour = hour(DateTimeOfCall)) %>%
ggplot(aes(x = hour, fill = case)) +
geom_bar() +
scale_x_continuous(breaks = seq(0,23, 6)) +
labs(title = "Hourly trend for # of incidents by LFB ",
x = "") +
theme_bw()
Understandably there is minimum incidents called during the resting hour. There is peak during the midday followed by a descent into the night.
Overall, we will create month and hour from the DateTimeOfCall for our prediction. Yearly trend is likely to return normal after people return to work, while the weekly trend is weak, so we don’t include it for the simplicity of prediction model.
dummy <- dataset_simplified %>% count(OriginofCall)
dataset_simplified %>%
ggplot(aes(x = OriginofCall, y = log_cost)) +
geom_boxplot() +
geom_text(data = dummy, aes(y = 5.3, label = paste("n=", n))) +
labs(title = "Incident Cost with respect to Origin of Calls",
x = "",
y = "Cost (log scale)") +
theme_bw()
The median cost for person using mobile is slight higher than land line, given the large sample size, the difference matters. This is probably due to person calling from outside which may indicate it may be a slightly more complicated. While the call from police may be more complicated case as well(though sample size is small here)
dataset_simplified %>%
## delete unknown in the text
mutate(animal_remove_unknown = str_remove(AnimalGroupParent, "Unknown - ")) %>%
ggplot(aes(x = animal_remove_unknown, y = log_cost, colour = case)) +
geom_boxplot() +
labs(title = "Incident Cost with respect to Animal Group",
x = "",
y = "Cost (log scale)") +
scale_x_discrete(label = function (x) str_trunc(x, 18)) +
theme_bw() +
coord_flip()
There is no significant difference between different animals, overall cost will be similar.
dummy <- dataset_simplified %>% count(PropertyCategory)
dataset_simplified %>%
ggplot(aes(x = PropertyCategory, y = log_cost)) +
geom_boxplot() +
geom_text(data = dummy, aes(y = 5.3, label = paste("n=", n))) +
labs(title = "Incident Cost with respect to Property Type",
x = "",
y = "Cost (log scale)") +
theme_bw()
Slight higher median value for road vehicle, while similar for other groups expect other.
dummy <- dataset_simplified %>% count(SpecialServiceTypeCategory)
dataset_simplified %>%
ggplot(aes(x = SpecialServiceTypeCategory, y = log_cost)) +
geom_boxplot() +
geom_text(data = dummy, aes(y = 5.3, label = paste("n=", n))) +
labs(title = "Incident Cost with respect to Service Type",
x = "",
y = "Cost (log scale)") +
theme_bw()
The rescue of animal from waterhas higher IQR, likely due to trickier cases requires longer time to resolve. But median cost is similar.
dummy <- dataset_simplified %>% count(Borough)
dataset_simplified %>%
ggplot(aes(x = fct_reorder(Borough, log_cost, mean), y = log_cost, colour = case)) +
geom_boxplot() +
labs(title = "Incident Cost with respect to Borough",
x = "",
y = "Cost (log scale)") +
scale_x_discrete(label = function (x) str_trunc(x, 15)) +
theme_bw() +
coord_flip()
There is noticeable difference between different Borough, sorted by mean cost, we notice the outer boroughs has few more outlier values, likely due to the borough size of the borough takes longer to travel in between station and target locations which increases the hours needed.
Another noticeable point is City of London listed last on the list with minimum cost. This is due to low residential number which implies less domestic animal or pet, and no farm animals, hence the minimum reporting.
Clean the variable list
dataset_model <- dataset_simplified %>%
mutate(month = month(DateTimeOfCall, label = TRUE, abbr = TRUE),
hour = hour(DateTimeOfCall)) %>%
select(-c(DateTimeOfCall, IncidentCost))
Check the factors we have selected is good.
set.seed(1234)
boruta_dataset <- Boruta(log_cost ~ . - case - Borough, data = dataset_model)
## copied code for displace xlab vertically
plot(boruta_dataset, xlab = "", xaxt = "n")
lz<-lapply(1:ncol(boruta_dataset$ImpHistory),function(i)
boruta_dataset$ImpHistory[is.finite(boruta_dataset$ImpHistory[,i]),i])
names(lz) <- colnames(boruta_dataset$ImpHistory)
Labels <- sort(sapply(lz,median))
axis(side = 1,las=2,labels = names(Labels),
at = 1:ncol(boruta_dataset$ImpHistory), cex.axis = 0.5)
Only
AnimalGroupParent shows high importance, while month shows none and hour low. So despite the time factor has significant effect on the amount of incidents (also the total cost), they do not effect the individual incident cost.
dataset_model <- dataset_model %>% select(-c(month, hour))
Split into training and testing set.
set.seed(1234)
split_index <- createDataPartition(dataset_model$log_cost, p = 0.8, list = FALSE)
training <- dataset_model[split_index, ]
## Warning: The `i` argument of ``[`()` can't be a matrix as of tibble 3.0.0.
## Convert to a vector.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
testing <- dataset_model[-split_index,]
Setup trainControl
fitControl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 5)
Regularized regression
glmnet_grid <- expand.grid(alpha = seq(0,1, length = 10),
lambda = seq(0.00001, 1, length = 10))
model_glmnet <- train(log_cost ~ . - case,
data = training,
method = "glmnet",
trControl = fitControl,
tuneGrid = glmnet_grid,
metric = "RMSE"
)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
Random Forest
rf_grid <- expand.grid(cp = seq(0.01, 1, length = 10))
model_rf <- train(log_cost ~ . - case,
data = training,
method = "rpart",
trControl = fitControl,
tuneGrid = rf_grid,
metric = "RMSE",
set.seed(1234))
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
Bayestion
model_bayes <- train(log_cost ~ . - case,
data = training,
method = "bayesglm",
trControl = fitControl,
metric = "RMSE",
set.seed(1234))
Let’s compare the models
model_compare <- resamples(list(glmnet = model_glmnet, rf = model_rf, bayes = model_bayes))
summary(model_compare)
##
## Call:
## summary.resamples(object = model_compare)
##
## Models: glmnet, rf, bayes
## Number of resamples: 50
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## glmnet 0.1718179 0.1829452 0.1872155 0.1877690 0.1922706 0.2064283 0
## rf 0.1683735 0.1788171 0.1855652 0.1852953 0.1911182 0.2051243 0
## bayes 0.1749146 0.1821412 0.1873531 0.1878967 0.1929526 0.2129091 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## glmnet 0.2730637 0.2934897 0.2994892 0.3021037 0.3089118 0.3414669 0
## rf 0.2700709 0.2909711 0.3078350 0.3050891 0.3156118 0.3505814 0
## bayes 0.2753297 0.2900420 0.3038291 0.3023369 0.3122834 0.3424512 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## glmnet 0.01181742 0.07941158 0.12163099 0.11846410 0.1516690 0.2085770 0
## rf 0.01485295 0.06874424 0.09755847 0.09962986 0.1194157 0.2332070 0
## bayes 0.05411532 0.08239707 0.10874015 0.11519437 0.1392231 0.2406679 0
bwplot(model_compare, scales = list(x = list(relation = "free")))
The three models performs similarly, but their R-squared is really low, as the models don’t make a good prediction regarding the incident cost.
varImp(model_rf)
## rpart variable importance
##
## only 20 most important variables shown (out of 55)
##
## Overall
## AnimalGroupParentHorse 100.000
## PropertyCategoryOutdoor 30.280
## SpecialServiceTypeCategoryAnimal rescue from water 28.559
## OriginofCallPerson (mobile) 25.566
## SpecialServiceTypeCategoryAnimal rescue from height 10.269
## AnimalGroupParentDeer 4.178
## BoroughBrent 0.000
## BoroughWestminster 0.000
## `BoroughWaltham Forest` 0.000
## BoroughBarnet 0.000
## BoroughLewisham 0.000
## `AnimalGroupParentUnknown - Domestic Animal Or Pet` 0.000
## `AnimalGroupParentUnknown - Wild Animal` 0.000
## BoroughBexley 0.000
## BoroughHarrow 0.000
## AnimalGroupParentDog 0.000
## BoroughHounslow 0.000
## BoroughCamden 0.000
## BoroughHavering 0.000
## AnimalGroupParentCat 0.000
varImp(model_glmnet)
## glmnet variable importance
##
## only 20 most important variables shown (out of 52)
##
## Overall
## AnimalGroupParentHorse 100.000
## PropertyCategoryOther 25.598
## SpecialServiceTypeCategoryAnimal rescue from water 18.009
## AnimalGroupParentDeer 16.829
## OriginofCallOther 16.598
## OriginofCallPerson (mobile) 11.992
## BoroughTower Hamlets 11.185
## AnimalGroupParentUnknown - Wild Animal 10.749
## BoroughBromley 10.063
## PropertyCategoryOutdoor 9.479
## AnimalGroupParentOther 9.255
## BoroughHounslow 8.960
## BoroughEnfield 8.828
## BoroughSutton 8.792
## AnimalGroupParentSquirrel 7.928
## AnimalGroupParentUnknown - Domestic Animal Or Pet 7.829
## BoroughRedbridge 7.708
## BoroughWestminster 7.280
## BoroughBexley 7.241
## AnimalGroupParentDog 6.644
Response regarding horse has the strongest impact on the predication and no other animal has the impact (Deer comes in distance second). Otherwise Animal rescue from water has some effect in predicating the cost. All other factors shows low and mixed impact.
Let’s dig deep about horse, why it stands out.
dataset_simplified %>%
group_by(AnimalGroupParent) %>%
summarise(average_cost = mean(IncidentCost),
median_cost = median(IncidentCost)) %>%
arrange(desc(average_cost))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 10 x 3
## AnimalGroupParent average_cost median_cost
## <fct> <dbl> <dbl>
## 1 Horse 733. 590
## 2 Unknown - Wild Animal 412. 326
## 3 Deer 411. 333
## 4 Other 377. 295
## 5 Fox 363. 328
## 6 Dog 345. 298
## 7 Cat 340. 298
## 8 Bird 339. 326
## 9 Unknown - Domestic Animal Or Pet 324. 295
## 10 Squirrel 311. 326
Given the max hourly rate is £395, anything above £400 indicate a hard incident to resolve, where we can see horse is the most distinct. For the cases of wild animal and deer, although their average is above £400, their median value is clearly below, so the mean value is skewed due to extreme cases. Yet the median cost for horse is still far above £400. Therefore we can conclude that almost any call LFB receives for special service regarding animals are quite easy to handle. Majority cases involves domestic setting, hence either cats or dogs. Do beware of the horses though.
Let’s try classify case, to see if it does good job.
table(dataset_model$case)
##
## Easy Hard
## 6412 850
Though there is unbalance in the class, we will skip the resampling, keep previous training and testing set.
fitControl_class <- trainControl(method = "repeatedcv",
number = 10,
repeats = 5,
summaryFunction = twoClassSummary,
classProbs = TRUE,
search = "random")
Logistic
model_log_class <- train(case ~ . - log_cost,
data = training,
method = "glm",
family = "binomial",
trControl = fitControl_class,
metric = "ROC"
)
Random Forest
model_rf_class <- train(case ~ . - log_cost,
data = training,
method = "rpart",
trControl = fitControl_class,
tuneLength = 20,
metric = "ROC",
set.seed(1234))
Support vector machine
model_bayes_class <- train(case ~ . - log_cost,
data = training,
method = "bayesglm",
trControl = fitControl_class,
metric = "ROC",
set.seed(1234))
varImp(model_log_class)
## glm variable importance
##
## only 20 most important variables shown (out of 52)
##
## Overall
## AnimalGroupParentHorse 100.000
## PropertyCategoryOutdoor 63.101
## PropertyCategoryOther 28.725
## AnimalGroupParentOther 27.625
## `PropertyCategoryNon Residential` 26.794
## AnimalGroupParentDeer 25.411
## `PropertyCategoryRoad Vehicle` 24.592
## `SpecialServiceTypeCategoryOther animal assistance` 23.542
## OriginofCallOther 22.216
## `SpecialServiceTypeCategoryAnimal rescue from height` 19.683
## AnimalGroupParentCat 19.675
## BoroughBromley 16.872
## BoroughCroydon 15.114
## BoroughHounslow 13.429
## BoroughSutton 12.332
## `BoroughTower Hamlets` 12.255
## `SpecialServiceTypeCategoryAnimal rescue from water` 11.582
## AnimalGroupParentSquirrel 11.548
## BoroughSouthwark 10.866
## BoroughWestminster 9.949
varImp(model_rf_class)
## rpart variable importance
##
## only 20 most important variables shown (out of 60)
##
## Overall
## AnimalGroupParentHorse 1.000e+02
## PropertyCategoryOutdoor 5.118e+01
## SpecialServiceTypeCategoryAnimal rescue from water 4.400e+01
## OriginofCallOther 1.668e+01
## SpecialServiceTypeCategoryAnimal rescue from height 6.760e+00
## AnimalGroupParentOther 6.713e+00
## BoroughBromley 5.758e+00
## SpecialServiceTypeCategoryOther animal assistance 3.701e+00
## PropertyCategoryRoad Vehicle 3.247e+00
## AnimalGroupParentDeer 2.556e+00
## AnimalGroupParentCat 2.130e+00
## BoroughEnfield 1.403e+00
## BoroughTower Hamlets 1.289e+00
## BoroughHounslow 1.228e+00
## BoroughHillingdon 1.020e+00
## OriginofCallPerson (mobile) 1.008e+00
## BoroughIslington 8.028e-01
## PropertyCategoryNon Residential 2.250e-01
## PropertyCategoryOutdoor Structure 1.638e-01
## AnimalGroupParentDog 8.153e-03
Compare the models
## confusion matrix
confusionMatrix(predict(model_log_class, data = training), training$case)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Easy Hard
## Easy 5086 576
## Hard 46 103
##
## Accuracy : 0.893
## 95% CI : (0.8847, 0.9008)
## No Information Rate : 0.8832
## P-Value [Acc > NIR] : 0.009883
##
## Kappa : 0.2158
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9910
## Specificity : 0.1517
## Pos Pred Value : 0.8983
## Neg Pred Value : 0.6913
## Prevalence : 0.8832
## Detection Rate : 0.8752
## Detection Prevalence : 0.9744
## Balanced Accuracy : 0.5714
##
## 'Positive' Class : Easy
##
confusionMatrix(predict(model_rf_class, data = training), training$case)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Easy Hard
## Easy 5083 572
## Hard 49 107
##
## Accuracy : 0.8931
## 95% CI : (0.8849, 0.901)
## No Information Rate : 0.8832
## P-Value [Acc > NIR] : 0.008831
##
## Kappa : 0.2223
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9905
## Specificity : 0.1576
## Pos Pred Value : 0.8989
## Neg Pred Value : 0.6859
## Prevalence : 0.8832
## Detection Rate : 0.8747
## Detection Prevalence : 0.9732
## Balanced Accuracy : 0.5740
##
## 'Positive' Class : Easy
##
confusionMatrix(predict(model_bayes_class, data = training), training$case)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Easy Hard
## Easy 5087 578
## Hard 45 101
##
## Accuracy : 0.8928
## 95% CI : (0.8846, 0.9006)
## No Information Rate : 0.8832
## P-Value [Acc > NIR] : 0.01104
##
## Kappa : 0.2123
##
## Mcnemar's Test P-Value : < 2e-16
##
## Sensitivity : 0.9912
## Specificity : 0.1487
## Pos Pred Value : 0.8980
## Neg Pred Value : 0.6918
## Prevalence : 0.8832
## Detection Rate : 0.8754
## Detection Prevalence : 0.9749
## Balanced Accuracy : 0.5700
##
## 'Positive' Class : Easy
##
With respect to the no information rate, the model is not a significant improvement of it. Three models performed similarly, they struggle to identify the hard case, all had 570+ misidentifications, hence they all have specificity of around 15%. This indicate we may need to go back to resample the date, as hard case is under sampled, to improve the accuracy. Though given our results from regression, the few variables here may not be sufficient to improve the model in the first place.
model_compare_class <- resamples(list(log = model_log_class, rf = model_rf_class, bayes = model_bayes_class))
summary(model_compare_class)
##
## Call:
## summary.resamples(object = model_compare_class)
##
## Models: log, rf, bayes
## Number of resamples: 50
##
## ROC
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## log 0.5831613 0.6460877 0.6718175 0.6684388 0.6964403 0.7304589 0
## rf 0.5532588 0.6161675 0.6404440 0.6401840 0.6662868 0.7293028 0
## bayes 0.5747334 0.6453603 0.6648463 0.6698299 0.6991098 0.7741801 0
##
## Sens
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## log 0.9824561 0.9883041 0.9902534 0.9905306 0.9922141 0.9980507 0
## rf 0.9746589 0.9844055 0.9863680 0.9866326 0.9902534 0.9961014 0
## bayes 0.9785575 0.9883041 0.9922027 0.9906856 0.9941520 0.9980507 0
##
## Spec
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## log 0.07352941 0.1176471 0.1343284 0.1434416 0.1617647 0.2500000 0
## rf 0.07352941 0.1176471 0.1406936 0.1390035 0.1617647 0.2205882 0
## bayes 0.08823529 0.1176471 0.1323529 0.1440386 0.1617647 0.2352941 0
bwplot(model_compare_class, scales = list(x = list(relation = "free")))
Logistic regression and Bayes glm performed better than Random Forest.
Final check on the testing set.
pred_log_test <- predict(model_log_class, testing)
confusionMatrix(pred_log_test, testing$case)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Easy Hard
## Easy 1265 146
## Hard 15 25
##
## Accuracy : 0.889
## 95% CI : (0.8717, 0.9047)
## No Information Rate : 0.8822
## P-Value [Acc > NIR] : 0.2208
##
## Kappa : 0.2013
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.9883
## Specificity : 0.1462
## Pos Pred Value : 0.8965
## Neg Pred Value : 0.6250
## Prevalence : 0.8822
## Detection Rate : 0.8718
## Detection Prevalence : 0.9724
## Balanced Accuracy : 0.5672
##
## 'Positive' Class : Easy
##
Model is not good at predicting when the case recieved will be a hard case, it’s easier just to treat call are easy case.