Invite-to-quote rate is a key metric for Thumbtack’s business, but its trend is confounded by other variables like category and location. In this analysis, I’m going to remove the impact of covariates by logistic regression, then analyze invite-to-quote rate series over time. I found that the adjusted series gradually increased for the first 42 days, then dropped sharply afterwards, which suggested that some anomaly might occur in the product, causing this site-wide shifts in quoting behavior. I put my codes on GitHub.
After getting tables from the sqlite file, I join them together by request_id and invite_id, and get a dataset with 24622 rows, which is the number of invites in the database. I exclude those requests with 0 invite in this analysis, since they are not relevant to the invite-to-quote rate.
Then I create a dummy variable quo_dum indicating which invite result in a quote, and extract the request date req_date from the timestamp. In the period between 2013-07-01 and 2013-08-31, the average quote rate is 0.520632. On the request level, each request results in 5.005 invites on average, and the unweighted mean quote rate is 0.5196.
I compute the daily quote rate. From the plot, it looks like the quote rate has a moderate increasing trend, and level off near the end of time.
# Compute daily invite-to-quote rate, ignoring category and location
daily_quoR <- allData %>%
group_by(req_date) %>%
summarise( quo_rate=sum(!is.na(quote_id))/length(invite_id) )
daily_quoR <- as.data.frame(daily_quoR)
daily_quoR <- daily_quoR[order(daily_quoR$req_date),]
plot(daily_quoR$quo_rate,type='l',xlab='Time',ylab='Invite-to-Quote Rate',main='Daily Invite-to-Quote Rates')
Next, I compute the quote rate for every category and location. For category, the quote rate ranges from 0.08069164 to 0.94130435. For locations, the quote rate ranges from 0.3207547 to 0.6607143. In the table, n is the number of requests.
## Source: local data frame [113 x 3]
##
## category quo_rate n
## (chr) (dbl) (int)
## 1 Personal Training 0.9413043 93
## 2 Lawncare 0.9361702 37
## 3 Personal Chef Services 0.8996416 50
## 4 Balloon Artistry 0.8764535 124
## 5 Architectural Services 0.8461538 20
## 6 Wedding Planning 0.8291317 67
## 7 Boudoir Photography 0.8235294 35
## 8 Karaoke 0.8235294 8
## 9 Upholstery and Furniture Cleaning 0.8163265 13
## 10 Exterior Painting 0.8115942 15
## .. ... ... ...
## Source: local data frame [100 x 3]
##
## location quo_rate n
## (chr) (dbl) (int)
## 1 Albany-Schenectady-Troy, NY 0.6607143 20
## 2 Raleigh, NC 0.6575342 13
## 3 Baltimore-Columbia-Towson, MD 0.6533333 45
## 4 Kansas City, MO-KS 0.6382979 24
## 5 Pittsburgh, PA 0.6349206 39
## 6 Des Moines-West Des Moines, IA 0.6341463 9
## 7 Indianapolis-Carmel-Anderson, IN 0.6326531 31
## 8 San Jose-Sunnyvale-Santa Clara, CA 0.6225166 32
## 9 Cleveland-Elyria, OH 0.6197183 35
## 10 Bridgeport-Stamford-Norwalk, CT 0.6190476 15
## .. ... ... ...
As I mentioned, the trend of quote rate is confounded by other variables like category and location. The following two stacked area plots show the “quote rate contribution” of each category and location over time(number of quotes of each class/number of invites of the day). Although they look messy, we can still get the idea that every category or location has different impact on the quote rate, and their impacts are changing over time. We want to remove these impacts and get a better idea about the product itself. Also I notice that in the time vs quote rate by location plot, the brown area in the middle seems to contribute the most in all time. This brown area represent the quotes of New York-Newark-Jersey City, NY-NJ-PA, which has the most invites(4939) and most quotes(2499) in our dataset.
In this part, I’m going to remove the impact of location and category. Here I will use the category_id, location_id and their interactions as the predictors, the quoting indicator quo_dum as the dependent variable to fit a logistic regression model, then take the residuals as the adjusted series.
Since the predictors are categorical variables and contain many levels, first, I create the design matrix by dummy coding and transform it to a sparse matrix for computational efficiency. The design matrix contains 24622 rows and 11300 columns.
# Model matrix
mod_data <- allData[,c("invite_id", "quo_dum", "category_id", "location_id")]
mod_data <- as.data.frame(apply(mod_data, 2, as.factor))
X <- sparse.model.matrix(~ category_id*location_id - 1 , mod_data)
y <- as.factor(mod_data$quo_dum)
Next, I want to fit a logistic regression model. Since the number of predictors are very large and they are highly correlated because of the interactions, I use the elastic net regularization which allows for learning a sparse model where few of the weights are non-zero like Lasso, while still maintaining the regularization properties of Ridge.
To determine the value for \(\alpha\) and \(\lambda\) of elastic net, I perform a 20 fold cross validation, and use AUC as the measure of model performance. Also, to speed up the computation, I use parallel foreach to fit each fold.
# Choose alpha by cv
out_compare <- NULL
foldid=sample(1:20,size=length(y),replace=TRUE)
# Parallel
cl <- makeCluster(detectCores())
registerDoParallel(cl)
for (alpha in seq(0, 1, by = 0.1)){
cvfit <- cv.glmnet(X, y, family = "binomial", type.measure = "auc", foldid=foldid, alpha = alpha, intercept=F, parallel = TRUE)
out_compare <- rbind(out_compare, c(alpha, cvfit$lambda.min, max(cvfit$cvm)))
}
colnames(out_compare) <- c("alpha","lambda.min","cvm")
stopCluster(cl)
print(out_compare)
## alpha lambda.min cvm
## [1,] 0.0 43.992813746 0.7072362
## [2,] 0.1 0.047172052 0.7243424
## [3,] 0.2 0.023586026 0.7239566
## [4,] 0.3 0.018939635 0.7241591
## [5,] 0.4 0.012942817 0.7243655
## [6,] 0.5 0.010354254 0.7243045
## [7,] 0.6 0.008628545 0.7241687
## [8,] 0.7 0.007395896 0.7242055
## [9,] 0.8 0.005896507 0.7242834
## [10,] 0.9 0.005241339 0.7242481
## [11,] 1.0 0.005177127 0.7243126
# Fit model with chosen alpha and lambda, generate probabilities
alpha_fin <- out_compare[which.max(out_compare[,3]),1]
lambda_fin <- out_compare[which.max(out_compare[,3]),2]
fit <- glmnet(X, y, family = "binomial", alpha = alpha_fin, intercept=F)
yhat <- predict(fit, newx = X, s = lambda_fin ,type = "response")
colnames(yhat) <- "prob"
From the above table, I choose \(\alpha=\) 0.4 and \(\lambda=\) 0.0129428, which maximize the AUC. Finally I fit the regression model with all data and chosen parameters, then predict the probability to quote for each row, which is the estimated invite-to-quote rate. In this model, 0.1288969 of null deviance are explained.
After getting the residuals from the logistic regression model, let’s plot the adjusted quote rate series. From the following graph, we can see the quote rate oscillated at the same level for the first 40 days, then experienced a drop afterwards.
To find out more about the change point in the adjusted series, I implement BFAST method on the data. BFAST(Breaks For Additive Season and Trend) integrates the decomposition of time series into trend, season, and remainder components with methods for detecting and characterizing change within time series. First, it decompose the series into trend and seasonal components with the STL method, then it use OLS-MOSUM test on each components to see if there is any significant break point. Next, BFAST fit the two components and the detected break points with linear regression. BFAST iteratively estimates the time and number of changes, and characterizes change by its magnitude and direction, until the number and position of the breakpoints are unchanged.
From the following results, we can see that the change point is the 42nd day (2013-08-11), the 95% confidence interval is 40-48 day (2013-08-09 to 2013-08-17). Only one break point is detected in the trend of the series, and there is no break point in the seasonal component.
# Change point detection in trend and seasonal component
bpfit <- bfast(daily_ts, h=.25, season="harmonic", max.iter=100)
bpfit
##
## TREND BREAKPOINTS
## Confidence intervals for breakpoints
## of optimal 2-segment partition:
##
## Call:
## confint.breakpointsfull(object = bp.Vt, het.err = FALSE)
##
## Breakpoints at observation number:
## 2.5 % breakpoints 97.5 %
## 1 40 42 48
##
## Corresponding to breakdates:
## 2.5 % breakpoints 97.5 %
## 1 6(5) 6(7) 7(6)
##
## SEASONAL BREAKPOINTS: None
Lastly, the ACF and PACF of the noise component, and the following Ljung-Box test show that the remainder is a white noise series.
##
## Box-Ljung test
##
## data: remainder
## X-squared = 14.002, df = 12.4, p-value = 0.3293
Besides category and location, it is a natural guess that individual requester and providers would impact the invite-to-quote rate as well. For example, a provider was on vacation during this period, so he/she didn’t send any quote to the requester. Therefore, I add req_uid and prov_uid to the predictors, and fit the previous model again. The design matrix contains 17211 columns now.
The explained deviance ratio is 0.2133727, which is higher than the previous model. The adjusted quote rate series looks similar to the previous one. But the 95% confidence interval of the estimated break point changes from 40-48 to 40-46 day. That’s an improvement! Finally, the ACF and PACF of the noise component, and the Ljung-Box test also show that the remainder is a white noise series, suggesting the model is valid.
##
## TREND BREAKPOINTS
## Confidence intervals for breakpoints
## of optimal 2-segment partition:
##
## Call:
## confint.breakpointsfull(object = bp.Vt, het.err = FALSE)
##
## Breakpoints at observation number:
## 2.5 % breakpoints 97.5 %
## 1 40 42 46
##
## Corresponding to breakdates:
## 2.5 % breakpoints 97.5 %
## 1 6(5) 6(7) 7(4)
##
## SEASONAL BREAKPOINTS: None
To analyze the site-wide shifts in quoting behavior, I remove the impact of category, location and users by fitting a logistic regression model with elastic net penalty. Using the BFAST method, I detect the change point occur on 2013-08-11, with a 95% CI from 2013-08-09 to 2013-08-15. The quote rate experienced a sudden drop after the change point, which suggests that some anomaly might occur in the product before or during the period of 2013-08-09 to 2013-08-15.
To further improve the model, we can try the followings:
We can also try to model the time intervals from invite to quote, which may give us more clues about what caused the product change.
All the codes I used can be found here.