Empirical Bayes Estimation of completion rates of rodent baiting service requests from 2013 to 2015

In this document we will be examining, through Empirical Bayes Estimation, the rates in which rodent baiting service requests were completed in the city of Chicago for the years 2013 through 2015. For each service request, we will check to see if the service was completed within one week (7 days). I chose a time period of a week for its common usage in terms of determining how long it takes to accomplish goals. We will look at requests made by the public that were reportedly resolved with actual inspections and baiting.

The inspiration for this project is from a series of posts at varianceexplained.org. I would highly recommend examining the entries here and here regarding Empirical Bayes and the Beta distribution.

The data are from the City of Chicago Data Portal at https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Rodent-Baiting/97t6-zrhs. The data were downloaded on March 23, 2016 at 1:06 P.M. in the form of a .csv file. The data file was filtered to include records on requests with the following stipulations being used for filtering:

Creation Date (date the request was made) being between 1/1/2013 and 12/31/2015
Classification of ‘Completed’ for Status
Inspected and Baited’ as the Most Recent Action taken

The following columns were included in the data file:

Creation Date
Status
Completion Date
Service Requests
Most Recent Action
ZIP Code
Ward
Police District
Community Area

All of the columns will not be used in this analysis, but may be used in future projects.

We will now load the data into memory, call up additional libraries needed for the project, as well as format the columns to suit our purposes. In addition, three new variables will be created: Elapsed.Time, which will measure the time elapsed between the Creation Date and Completion Date in days, as well as month and year variables to indicate the month and year when the request was created.

library(dplyr) # Data manipulation

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2) # Graphics
mice <- read.csv('311_Service_Requests_-_Rodent_Baiting.csv',header=T,sep=',') # Load data into memory
mice$Creation.Date <- as.Date(as.character(mice$Creation.Date),"%m/%d/%Y") # Format date
mice$Completion.Date <- as.Date(as.character(mice$Completion.Date),"%m/%d/%Y") # Same as above
mice$Elapsed.Time <- as.numeric(difftime(mice$Completion.Date,mice$Creation.Date,units='days')) # Time between request and completion of service
mice$month <- months(mice$Creation.Date,abbreviate=FALSE) # Month variable
mice$year <- as.character.Date(format(mice$Creation.Date,"%Y")) # Year variable

For a prior distribution for the data, we will be using observations of the data from the year 2013. In a previous project we had the opportunity to work with this data. We will use the data to construct a model to describe how the information is distributed.

To examine the data from 2013, we will create a subset of the data frame. The data will then be reshaped so summarizations by month can be made. Two new variables will also be constructed, one to show the total number of requests made for the month, and another to show the total number of requests completed within seven days.

mice2 <- subset(mice,mice$year == '2013') # Subset of data for requests from 2013
mice2a <- select(mice2,month,Elapsed.Time) # An additional subset containg month and Elapsed.Time
tableA <- mice2a %>% group_by(month) %>% summarize(total = n()) # One table to total all requests by month
tableB <- mice2a %>% group_by(month) %>% tally(Elapsed.Time <= 7) # Another to count requests done in one week
tableC <- left_join(tableA,tableB,by='month') # The tables are merged and...
tableC$pct <- round(tableC$n/tableC$total,3) # percentages of requests done in one week are calculated
tableC

## Source: local data frame [12 x 4]
## 
##        month total     n   pct
##        (chr) (int) (int) (dbl)
## 1      April  1456  1340 0.920
## 2     August  3239   318 0.098
## 3   December   728   326 0.448
## 4   February   717   700 0.976
## 5    January  1105  1064 0.963
## 6       July  2555   230 0.090
## 7       June  1980   789 0.398
## 8      March  1014   951 0.938
## 9        May  1778  1567 0.881
## 10  November  1233   126 0.102
## 11   October  2448   224 0.092
## 12 September  2702   211 0.078

From the printed table above, you may be able to discern that most of the data centers around two distinct groups. However, the following barplot may make things clearer:

# To place the months in order of occurence, they will have to be seen as of the factor data type
# mlevels will add the order we will need
mlevels <- c('January','February','March','April','May','June','July','August','September','October','November','December')
ggplot(tableC,aes(x=factor(month,levels=mlevels),y=pct))+geom_bar(stat='identity')+xlab('Months')+ylab('Percentage')+ggtitle('Percentages of Service Requests completed in one week by month')

With this evidence from the year 2013, we will split up the data into three distinct groups by month:

One group containing January, February, March, April, and May.
Another group with July, August, September, October, and November.
A third group with June and December as “transitional months”.

tableC1 <- subset(tableC,tableC$month %in% c('January','February','March','April','May'))
tableC2 <- subset(tableC,tableC$month %in% c('July','August','September','October','November'))
tableC3 <- subset(tableC,tableC$month %in% c('June','December'))

The data we have can be modeled through the binomial distribution, looking at the percentage of requests done within a week (successes) in relation to the total number of requests. We can, then, represent the prior expectations of each group through the beta distribution. For each group we will estimate a distribution through the fitdist() function in the MASS library that reflects the data. From these distributions, the values of \(\alpha\) and \(\beta\), the parameters, will be recorded.

group1 <- MASS::fitdistr(tableC1$pct,dbeta,start=list(shape1=9,shape2=1))
alpha1 <- group1$estimate[1]
beta1 <- group1$estimate[2]
group2 <- MASS::fitdistr(tableC2$pct,dbeta,start=list(shape1=4,shape2=60))
alpha2 <- group2$estimate[1]
beta2 <- group2$estimate[2]
group3 <- MASS::fitdistr(tableC3$pct,dbeta,start=list(shape1=109,shape2=127))
alpha3 <- group3$estimate[1]
beta3 <- group3$estimate[2]

Now that we have parameters for our prior distributions, we will take the full dataset and prepare the data as we did previously. However, we will also split up the data into the three groups refered to earlier.

mice3 <- select(mice,month,Elapsed.Time) # Data from all three years
mice3a <- mice3 %>% group_by(month) %>% summarize(total=n()) # Number of requests in each month
mice3b <- mice3 %>% group_by(month) %>% tally(Elapsed.Time <= 7) # Number of requests done in a week
mice3c <- left_join(mice3a,mice3b,by='month') # The data are brought back together..
# ... and now separated by the groups of months described earlier 
table3yr1 <- subset(mice3c,mice3c$month %in% c('January','February','March','April','May'))
table3yr2 <- subset(mice3c,mice3c$month %in% c('July','August','September','October','November'))
table3yr3 <- subset(mice3c,mice3c$month %in% c('June','December'))
table3yr1$pct <- round(table3yr1$n/table3yr1$total,3) # Actual percentage of "successes" for each month in each group
table3yr2$pct <- round(table3yr2$n/table3yr2$total,3)
table3yr3$pct <- round(table3yr3$n/table3yr3$total,3)

Now, for the months in each group, we will calculate an estimated value of the rates based on the \(\alpha\) and \(\beta\) for each group, total requests (total), and requests completed within a week (n). the formula would be the following: estimate = (\(\alpha\) + n)/(\(\alpha\) + \(\beta\) + total)

table_group1 <- table3yr1 %>% mutate(eb_estimate = (alpha1+n)/(alpha1+beta1+total))
table_group2 <- table3yr2 %>% mutate(eb_estimate = (alpha2+n)/(alpha2+beta2+total))
table_group3 <- table3yr3 %>% mutate(eb_estimate = (alpha3+n)/(alpha3+beta3+total))
table_group1$eb_estimate <- round(table_group1$eb_estimate,3) # Rounding results for clarity
table_group2$eb_estimate <- round(table_group2$eb_estimate,3)
table_group3$eb_estimate <- round(table_group3$eb_estimate,3)

We can update the parameters for each of the months in each group with the collected data. We can update \(\alpha\) by adding the number of requests completed in one week for each of the months. Meanwhile, \(\beta\) can be updated by adding the amount of requests less the successes. These new parameters will be the parameters of the posterior distribution for each of the months.

table_group1 <- table_group1 %>% mutate(alpha1new = alpha1 + n, beta1new = total - n + beta1)
table_group2 <- table_group2 %>% mutate(alpha2new = alpha2 + n, beta2new = total - n + beta2)
table_group3 <- table_group3 %>% mutate(alpha3new = alpha3 + n, beta3new = total - n + beta3)

With the updated parameters, we can compute the limits of credible intervals for each of the months. Here, we will create 95% intervals in which our estimate should (hopefully) fall into using the qbeta() function.

table_group1 <- table_group1 %>% mutate(low = qbeta(.025,alpha1new,beta1new), high = qbeta(.975,alpha1new,beta1new))
table_group2 <- table_group2 %>% mutate(low = qbeta(.025,alpha2new,beta2new), high = qbeta(.975,alpha2new,beta2new))
table_group3 <- table_group3 %>% mutate(low = qbeta(.025,alpha3new,beta3new), high = qbeta(.975,alpha3new,beta3new))
est_group1 <- table_group1 %>% select(-alpha1new,-beta1new) # Removal of new alpha's and beta's for 
est_group2 <- table_group2 %>% select(-alpha2new,-beta2new) # readability of the table
est_group3 <- table_group3 %>% select(-alpha3new,-beta3new) # below
est_group1$low <- round(est_group1$low,3);est_group1$high <- round(est_group1$high,3) # Determination of  limits
est_group2$low <- round(est_group2$low,3);est_group2$high <- round(est_group2$high,3) # of credible intervals
est_group3$low <- round(est_group3$low,3);est_group3$high <- round(est_group3$high,3) # for each month
est_final <- rbind(est_group1,est_group2,est_group3) # the data are brought back together and displayed
est_final

## Source: local data frame [12 x 7]
## 
##        month total     n   pct eb_estimate   low  high
##        (chr) (int) (int) (dbl)       (dbl) (dbl) (dbl)
## 1      April  4875  3277 0.672       0.675 0.662 0.688
## 2   February  1423  1182 0.831       0.834 0.815 0.853
## 3    January  2188  1810 0.827       0.830 0.814 0.845
## 4      March  3187  2522 0.791       0.794 0.780 0.807
## 5        May  5657  3438 0.608       0.611 0.598 0.623
## 6     August  9986  1103 0.110       0.109 0.103 0.115
## 7       July 10044  1042 0.104       0.103 0.097 0.108
## 8   November  4771  1226 0.257       0.228 0.217 0.239
## 9    October  8012   827 0.103       0.102 0.096 0.108
## 10 September  8670   781 0.090       0.090 0.085 0.096
## 11  December  3824  2167 0.567       0.554 0.539 0.569
## 12      June  7177  2032 0.283       0.290 0.280 0.300

Now, we can create a barplot for the months incorporating the estimates from the additional data. Error bars will also be added to the plot to demonstrate the credible intervals.

ggplot(est_final,aes(x=factor(month,levels=mlevels),y=eb_estimate))+geom_bar(stat='identity',color='red',fill='gray')+xlab('Months')+ylab('Estimated Percentage')+geom_errorbar(aes(ymin=low,ymax=high,width=.2))+ggtitle('Monthly estimated percentages of rodent baiting\nservice requests completed within one week')

Overall, the data tells us that the estimates fit in well with the actual percentages when we compare the actual percentage (pct) to the Empirical Bayes estimate (eb_estimate). However, the month of November exceeded our estimates, with an actual percentage of .257 compared to an estimated .228. For 2013, November one week completion rates were a little over 10%. The following two years for November showed marked improvement with a rate is close to 31%. In addition, April and May do not seem to be as similar to January, February, and March. Still, their rates are strikingly different from the summer and fall months.

In conclusion, the prior distributions served well as a place to start estimating how the data was distributed. Empirical Bayes provided us with good estimates, in general, of the percentages for each month. Improvements may be found, however, by including information regarding weather or utilizing a mixture model.

Empirical Bayes Estimation of completion rates of rodent baiting service requests from 2013 to 2015

Pierre Phanor

April 9, 2016