In this document we will be examining, through Empirical Bayes Estimation, the rates in which rodent baiting service requests were completed in the city of Chicago for the years 2013 through 2015. For each service request, we will check to see if the service was completed within one week (7 days). I chose a time period of a week for its common usage in terms of determining how long it takes to accomplish goals. We will look at requests made by the public that were reportedly resolved with actual inspections and baiting.
The inspiration for this project is from a series of posts at varianceexplained.org. I would highly recommend examining the entries here and here regarding Empirical Bayes and the Beta distribution.
The data are from the City of Chicago Data Portal at https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Rodent-Baiting/97t6-zrhs. The data were downloaded on March 23, 2016 at 1:06 P.M. in the form of a .csv file. The data file was filtered to include records on requests with the following stipulations being used for filtering:
The following columns were included in the data file:
All of the columns will not be used in this analysis, but may be used in future projects.
We will now load the data into memory, call up additional libraries needed for the project, as well as format the columns to suit our purposes. In addition, three new variables will be created: Elapsed.Time, which will measure the time elapsed between the Creation Date and Completion Date in days, as well as month and year variables to indicate the month and year when the request was created.
library(dplyr) # Data manipulation
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2) # Graphics
mice <- read.csv('311_Service_Requests_-_Rodent_Baiting.csv',header=T,sep=',') # Load data into memory
mice$Creation.Date <- as.Date(as.character(mice$Creation.Date),"%m/%d/%Y") # Format date
mice$Completion.Date <- as.Date(as.character(mice$Completion.Date),"%m/%d/%Y") # Same as above
mice$Elapsed.Time <- as.numeric(difftime(mice$Completion.Date,mice$Creation.Date,units='days')) # Time between request and completion of service
mice$month <- months(mice$Creation.Date,abbreviate=FALSE) # Month variable
mice$year <- as.character.Date(format(mice$Creation.Date,"%Y")) # Year variable
For a prior distribution for the data, we will be using observations of the data from the year 2013. In a previous project we had the opportunity to work with this data. We will use the data to construct a model to describe how the information is distributed.
To examine the data from 2013, we will create a subset of the data frame. The data will then be reshaped so summarizations by month can be made. Two new variables will also be constructed, one to show the total number of requests made for the month, and another to show the total number of requests completed within seven days.
mice2 <- subset(mice,mice$year == '2013') # Subset of data for requests from 2013
mice2a <- select(mice2,month,Elapsed.Time) # An additional subset containg month and Elapsed.Time
tableA <- mice2a %>% group_by(month) %>% summarize(total = n()) # One table to total all requests by month
tableB <- mice2a %>% group_by(month) %>% tally(Elapsed.Time <= 7) # Another to count requests done in one week
tableC <- left_join(tableA,tableB,by='month') # The tables are merged and...
tableC$pct <- round(tableC$n/tableC$total,3) # percentages of requests done in one week are calculated
tableC
## Source: local data frame [12 x 4]
##
## month total n pct
## (chr) (int) (int) (dbl)
## 1 April 1456 1340 0.920
## 2 August 3239 318 0.098
## 3 December 728 326 0.448
## 4 February 717 700 0.976
## 5 January 1105 1064 0.963
## 6 July 2555 230 0.090
## 7 June 1980 789 0.398
## 8 March 1014 951 0.938
## 9 May 1778 1567 0.881
## 10 November 1233 126 0.102
## 11 October 2448 224 0.092
## 12 September 2702 211 0.078
From the printed table above, you may be able to discern that most of the data centers around two distinct groups. However, the following barplot may make things clearer:
# To place the months in order of occurence, they will have to be seen as of the factor data type
# mlevels will add the order we will need
mlevels <- c('January','February','March','April','May','June','July','August','September','October','November','December')
ggplot(tableC,aes(x=factor(month,levels=mlevels),y=pct))+geom_bar(stat='identity')+xlab('Months')+ylab('Percentage')+ggtitle('Percentages of Service Requests completed in one week by month')
With this evidence from the year 2013, we will split up the data into three distinct groups by month:
tableC1 <- subset(tableC,tableC$month %in% c('January','February','March','April','May'))
tableC2 <- subset(tableC,tableC$month %in% c('July','August','September','October','November'))
tableC3 <- subset(tableC,tableC$month %in% c('June','December'))
The data we have can be modeled through the binomial distribution, looking at the percentage of requests done within a week (successes) in relation to the total number of requests. We can, then, represent the prior expectations of each group through the beta distribution. For each group we will estimate a distribution through the fitdist() function in the MASS library that reflects the data. From these distributions, the values of \(\alpha\) and \(\beta\), the parameters, will be recorded.
group1 <- MASS::fitdistr(tableC1$pct,dbeta,start=list(shape1=9,shape2=1))
alpha1 <- group1$estimate[1]
beta1 <- group1$estimate[2]
group2 <- MASS::fitdistr(tableC2$pct,dbeta,start=list(shape1=4,shape2=60))
alpha2 <- group2$estimate[1]
beta2 <- group2$estimate[2]
group3 <- MASS::fitdistr(tableC3$pct,dbeta,start=list(shape1=109,shape2=127))
alpha3 <- group3$estimate[1]
beta3 <- group3$estimate[2]
Now that we have parameters for our prior distributions, we will take the full dataset and prepare the data as we did previously. However, we will also split up the data into the three groups refered to earlier.
mice3 <- select(mice,month,Elapsed.Time) # Data from all three years
mice3a <- mice3 %>% group_by(month) %>% summarize(total=n()) # Number of requests in each month
mice3b <- mice3 %>% group_by(month) %>% tally(Elapsed.Time <= 7) # Number of requests done in a week
mice3c <- left_join(mice3a,mice3b,by='month') # The data are brought back together..
# ... and now separated by the groups of months described earlier
table3yr1 <- subset(mice3c,mice3c$month %in% c('January','February','March','April','May'))
table3yr2 <- subset(mice3c,mice3c$month %in% c('July','August','September','October','November'))
table3yr3 <- subset(mice3c,mice3c$month %in% c('June','December'))
table3yr1$pct <- round(table3yr1$n/table3yr1$total,3) # Actual percentage of "successes" for each month in each group
table3yr2$pct <- round(table3yr2$n/table3yr2$total,3)
table3yr3$pct <- round(table3yr3$n/table3yr3$total,3)
Now, for the months in each group, we will calculate an estimated value of the rates based on the \(\alpha\) and \(\beta\) for each group, total requests (total), and requests completed within a week (n). the formula would be the following: estimate = (\(\alpha\) + n)/(\(\alpha\) + \(\beta\) + total)
table_group1 <- table3yr1 %>% mutate(eb_estimate = (alpha1+n)/(alpha1+beta1+total))
table_group2 <- table3yr2 %>% mutate(eb_estimate = (alpha2+n)/(alpha2+beta2+total))
table_group3 <- table3yr3 %>% mutate(eb_estimate = (alpha3+n)/(alpha3+beta3+total))
table_group1$eb_estimate <- round(table_group1$eb_estimate,3) # Rounding results for clarity
table_group2$eb_estimate <- round(table_group2$eb_estimate,3)
table_group3$eb_estimate <- round(table_group3$eb_estimate,3)
We can update the parameters for each of the months in each group with the collected data. We can update \(\alpha\) by adding the number of requests completed in one week for each of the months. Meanwhile, \(\beta\) can be updated by adding the amount of requests less the successes. These new parameters will be the parameters of the posterior distribution for each of the months.
table_group1 <- table_group1 %>% mutate(alpha1new = alpha1 + n, beta1new = total - n + beta1)
table_group2 <- table_group2 %>% mutate(alpha2new = alpha2 + n, beta2new = total - n + beta2)
table_group3 <- table_group3 %>% mutate(alpha3new = alpha3 + n, beta3new = total - n + beta3)
With the updated parameters, we can compute the limits of credible intervals for each of the months. Here, we will create 95% intervals in which our estimate should (hopefully) fall into using the qbeta() function.
table_group1 <- table_group1 %>% mutate(low = qbeta(.025,alpha1new,beta1new), high = qbeta(.975,alpha1new,beta1new))
table_group2 <- table_group2 %>% mutate(low = qbeta(.025,alpha2new,beta2new), high = qbeta(.975,alpha2new,beta2new))
table_group3 <- table_group3 %>% mutate(low = qbeta(.025,alpha3new,beta3new), high = qbeta(.975,alpha3new,beta3new))
est_group1 <- table_group1 %>% select(-alpha1new,-beta1new) # Removal of new alpha's and beta's for
est_group2 <- table_group2 %>% select(-alpha2new,-beta2new) # readability of the table
est_group3 <- table_group3 %>% select(-alpha3new,-beta3new) # below
est_group1$low <- round(est_group1$low,3);est_group1$high <- round(est_group1$high,3) # Determination of limits
est_group2$low <- round(est_group2$low,3);est_group2$high <- round(est_group2$high,3) # of credible intervals
est_group3$low <- round(est_group3$low,3);est_group3$high <- round(est_group3$high,3) # for each month
est_final <- rbind(est_group1,est_group2,est_group3) # the data are brought back together and displayed
est_final
## Source: local data frame [12 x 7]
##
## month total n pct eb_estimate low high
## (chr) (int) (int) (dbl) (dbl) (dbl) (dbl)
## 1 April 4875 3277 0.672 0.675 0.662 0.688
## 2 February 1423 1182 0.831 0.834 0.815 0.853
## 3 January 2188 1810 0.827 0.830 0.814 0.845
## 4 March 3187 2522 0.791 0.794 0.780 0.807
## 5 May 5657 3438 0.608 0.611 0.598 0.623
## 6 August 9986 1103 0.110 0.109 0.103 0.115
## 7 July 10044 1042 0.104 0.103 0.097 0.108
## 8 November 4771 1226 0.257 0.228 0.217 0.239
## 9 October 8012 827 0.103 0.102 0.096 0.108
## 10 September 8670 781 0.090 0.090 0.085 0.096
## 11 December 3824 2167 0.567 0.554 0.539 0.569
## 12 June 7177 2032 0.283 0.290 0.280 0.300
Now, we can create a barplot for the months incorporating the estimates from the additional data. Error bars will also be added to the plot to demonstrate the credible intervals.
ggplot(est_final,aes(x=factor(month,levels=mlevels),y=eb_estimate))+geom_bar(stat='identity',color='red',fill='gray')+xlab('Months')+ylab('Estimated Percentage')+geom_errorbar(aes(ymin=low,ymax=high,width=.2))+ggtitle('Monthly estimated percentages of rodent baiting\nservice requests completed within one week')
Overall, the data tells us that the estimates fit in well with the actual percentages when we compare the actual percentage (pct) to the Empirical Bayes estimate (eb_estimate). However, the month of November exceeded our estimates, with an actual percentage of .257 compared to an estimated .228. For 2013, November one week completion rates were a little over 10%. The following two years for November showed marked improvement with a rate is close to 31%. In addition, April and May do not seem to be as similar to January, February, and March. Still, their rates are strikingly different from the summer and fall months.
In conclusion, the prior distributions served well as a place to start estimating how the data was distributed. Empirical Bayes provided us with good estimates, in general, of the percentages for each month. Improvements may be found, however, by including information regarding weather or utilizing a mixture model.