The Door-to-Door Caravan Sampling Problem

Michael Andre Abellon | Nico Angelo Menodiado

2023-07-07

Introduction

Globe is considering to hire a marketing agency to do a door-to-door sampling caravan project to give out free prepaid SIM cards to households. The objective is to increase the subscriber base of the company while obtaining a net profit from this project, by hoping that the free SIM card samples will eventually be activated in the future and revenue will flow from the activated SIMs.

As such, for this case study, we utilized the PSA Barangay Census data to analyze and determine the feasibility of conducting the door-to-door caravan project.

Sampling Strategy Assumptions

The marketing agency proposes to do the following sampling system:

  • Ten samplers will be deployed per day to a specific barangay. These samplers are paid minimum wage by the marketing agency. A limit of one barangay per day is to be visited by the team due to capacity and planning restrictions.

  • A work day is 8 hours from 8am to 5pm. From TGT, the ten samplers will be assembled and at 8am will leave BGC to travel to the selected barangay for that day as a Globe Door-to-Door Caravan. They have to be back by 5pm at TGT. Lunch break is from 12-1pm.

  • When at the Barangay for the day, the ten samplers will start walking around the Barangay knocking on houses. Assume that the ten samplers go to different households in different streets within a Barangay within a day.

  • When a sampler calls on a house and it is answered, the standard spiel to be said is: “Good morning(afternoon) po, andito po ako ngayon para magbigay ng libreng prepaid SIM from Globe. May libre po siyang PhP 50.00 load for call, text at saka data. Interested po ba sila?” If the one that answered the door is interested, the sampler hands a free SIM card. Additionally, the sampler asks for a signature indicating the SIM has been given away. Maximum of one SIM card per house.

  • The interaction with a house is composed of 3 parts: 1) Waiting to Answer the Door which usually takes on average 20 seconds, 2) Spiel Talk Time and 3) Signing time which takes around 10 seconds.

  • If not interested or there is no answer after 30 seconds, then the sampler will move to the next household. Assume that the Sampler walks on average 4 km/hr. Only 10% of the houses don’t answer.

  • If the sampler was able to provide a SIM to the household, that is considered as a hit. Usually the success rate of the Samplers claimed by the Marketing Agency is 75%. Which means that 7.5 out of 10 households that answered the door, have accepted the free SIM.

  • Only 25% of the SIMs given away are activated after the sampling and on average will provide a revenue return of PhP 200.00 per year. The 25% value is called conversion rate, and is estimated by the Marketing Agency.

  • During the entire sampling project, assume that the Caravan has enough SIMs for the day for all 10 samplers.

  • A SIM card costs PhP 40.00 to manufacture which includes the initial cost of the PhP 50 credit load.

  • The agency is proposing a PhP 5,000,000 contract with Globe to do the 100 days of sampling which includes the salaries of the ten samplers, as well as the company’s van and a driver. (Globe’s estimates that the agency spends 10,000 per day for the van’s utilization which includes gas, van rental and driver salary). Note further that the PhP 5,000,000 contract does not include the costs of the SIM cards.

Algorithms

1. Loading the dataset

To begin, the main dataset was first loaded which contained the different attributes of the possible barangays where Globe could implement the door-to-door caravan project. Additionally, the longitude and latitude of Globe’s headquarters (i.e., TGT) were defined as this will serve as the starting point of the caravan for all days of the project, which will be needed to calculate travel time later on.

library(geosphere)
dat = read.csv('h2h.csv')
tgt = c(121.0499,14.5535)
dat = na.omit(dat)

2. Defining the main function

In order to compute for the potential hits (i.e., when a sampler is able to provide a SIM to a household), the ‘comp’ function was defined which contains multiple sections of calculations that will eventually result in a forecast of each barangay’s number of potential hits. The inputs of the function are estimates or assumptions that the project team will have to make regarding various aspects of the project:

  • answerRate: Rate that the household will answer the door when a sampler calls on them
  • acceptRate: Rate that a household will accept the SIM provided to them
  • conversionRate: Rate that the SIMs given away are activated after the sampling
  • numofsamplers: Number of samplers that will be deployed for the project
  • numofdays: Duration of the door-to-door caravan project (in days)
comp <- function(answerRate = 0.9, acceptRate = 0.75, conversionRate = 0.25, numofsamplers = 10, numofdays = 100){

2.1. Computing for travel time

In order to compute the travel time from TGT, the ‘distm’ function under the geosphere package was used which computed for the distance between TGT and each barangay using their respective longitude and latitude coordinates found in the dataset. Travel time was then calculated by dividing the result by the assumed speed of the caravan (e.g., For this case, the van was assumed to travel at a speed of 60 kph).

# Getting the travel distance and travel time from TGT
    for(i in 1:dim(dat)[1]){
        latlong = c((dat$long[i]), (dat$lat[i]))
        p = matrix(c(tgt, latlong), nrow = 2)
        #d = dist(p, fun = distGeo)
        x = distm(x = tgt, y = latlong, fun = distGeo)
        dat$distancefromtgt[i] <<- x
        dat$traveltimeh[i] <<- (x/1000)/60
    }

2.2. Computing for house-to-house walk time

The second section of the function aims to calculate the time it would take a sampler to walk from house to house within each barangay.

# Getting the time it takes to go from house to house
    for(i in 1:dim(dat)[1]){
        x = sqrt(dat$AreaBarangay[i]/dat$NHouseholds[i])
        dat$housedim[i] <<- x
        dat$timetocover[i] <<- (x/4)
    }

2.3. Computing for number of houses that can be covered

Given the new attributes created on the traveltimeh and timetocover for each barangay, the third section of the function now computes for the maximum number of households that can be visited for each barangay.

# Getting the maximum number of households that can be visited
    hit = 60/3600
    rej = 50/3600
    dnan = 30/3600
    hitRate = answerRate * acceptRate
    rejRate = answerRate * (1-acceptRate)
    dnanRate = 1 - answerRate
    
    for(i in 1:dim(dat)[1]){
        x = (8-(2*dat$traveltimeh[i])) / (dat$timetocover[i]+(hitRate*hit)+(rejRate*rej)+(dnanRate*dnan))
        if ((x*numofsamplers)>dat$NHouseholds[i]){
            dat$coveredhouse[i] <<- dat$NHouseholds[i]
        }
        else {
            dat$coveredhouse[i] <<- x*numofsamplers
        }
    }

2.4. Computing for potential hits

The number of potential hits for each barangay can then be calculated by multiplying the number of houses that can be covered by the hit rate.

# Getting the number of potential hits
    dat$potentialhits <<- dat$coveredhouse*hitRate

2.5. Computing for sampler work breakdown

In the fifth section of the function, for each barangay, we computed the percentage that a sampler’s time is spent on traveling, walking, a hit interaction, a rejection interaction, and a ‘did not answer’ interaction as these values will be needed for analysis later on.

    dat$pertraveltime <<- ((2*dat$traveltimeh)/9)*100
    dat$perhittime <<- (((dat$potentialhits/numofsamplers) / 60)/9 )*100
    dat$perrejtime <<- ((dat$coveredhouse*rejRate/numofsamplers)*rej)/9*100
    dat$perdnantime <<- ((dat$coveredhouse*dnanRate/numofsamplers)*dnan)/9*100
    dat$perwalktime <<- ((dat$coveredhouse*dat$timetocover/numofsamplers)/9)*100
    
    for(i in 1:dim(dat)[1]){
        dat$perlunchtime[i] <<- 1/9
    }

2.6. Sorting the data by potential hits

The data is ordered by decreasing number of potential hits and subsetted to only include the top 100 to determine which 100 barangays to focus on for each day of the door-to-door caravan project.

    # Sorting top 100 of potential hits
    s <<- dat[order(dat$potentialhits, decreasing = TRUE),]
    s <<- head(s, numofdays)

2.7. Computing for profit of the project

Finally, the potential revenue and cost of the project was calculated in order to come up with the overall earnings, earn, of the project.

    potentialsimcard <<- sum(s$potentialhits)*conversionRate*200
    earn <<- potentialsimcard - (sum(s$potentialhits)*40)
}

comp()

Results

1. If Globe is to do the planning, which Barangays are to be visited to maximize the hits per day for all l00 days?

Based on the bar graph below, the top 5 barangays to be visited in order to maximize the number of hits are Barangay 649, Barangay 183, Addition Hills, Barangay 20, and Barangay 201.

library(ggplot2)
library(treemap)
library(treemapify)

par(mai=c(1,2,1,0.5))
barplot(sort(s$potentialhits[1:20],decreasing=FALSE),names.arg=s$Barangay[(20:1)],xlab='Number of Potential Hits',
        ylab='',col='dodgerblue',border='blue',main='Number of Hits of Top 20 Barangays',
        las=1,cex.names=0.8,xlim=c(3000,max(s$potentialhits[1:10])),horiz = TRUE,xpd=FALSE)
title(ylab="Barangay", mgp=c(9,1,0))

In order to view list of all 100 barangays to be visited for the 100 days of the project, the following code may also be run.

s$Barangay

2. How are the hits calculated/estimated? How many hits can be realistically obtained for 100 days of sampling? What is the breakdown of these hits by City? By Region? By Barangay Size? Can the hits reach 500,000? Can we go to more Barangays to get more hits?

Based on the calculations generated, a total of 299,173.8 hits can be realistically obtained for 100 days of sampling.

sum(s$potentialhits)
## [1] 299173.8

The code below generates the treemap plot that will display the breakdown of these 299,173.8 hits by Region.

#By Region
for (i in 1:nrow(s)) {
    if (s[i,'Region'] == 'NATIONAL CAPITAL REGION') {
        s[i,'Region'] = s[i,'Region']
    }
    else {
        s[i,'Region'] = 'REGION IV-A CALABARZON'
    }
}

ggplot(s, aes(area = potentialhits, fill = Region, label = paste(Barangay,round(potentialhits,3),sep="\n"),
              subgroup = Region)) +
    geom_treemap(layout='squarified') +
    geom_treemap_text(colour = "white", place = "center", size = 12) +
    labs(title="Top 100 Barangays (in Potential Hits) per Region") +
    theme(plot.title = element_text(face='bold'))

Based on the treemap plot, we can see that a majority (i.e., around 90%) of the top 100 barangays in terms of most potential hits belong to the National Capital Region, while the remaining belong to Region IV-A CALABARZON.

Next, the code below generates the treemap plot of the breakdown of the total hits by City.

s$ProvinceCity = ifelse(startsWith(s$ProvinceCity,'CITY OF LAS PI'),'CITY OF LAS PINAS',s$ProvinceCity)

ggplot(s, aes(area = potentialhits, fill = ProvinceCity, label = paste(Barangay,round(potentialhits,3),sep="\n"),
              subgroup = ProvinceCity)) +
    geom_treemap(layout='squarified') +
    geom_treemap_text(colour = "white", place = "center", size = 12) +
    labs(title="Top 100 Barangays (in Potential Hits) per City/Province") +
    theme(plot.title = element_text(face='bold'))

In line with the treemap of the hits by Region, most of the barangays in the top 100 in terms of most potential hits belong to cities that are located in NCR. Notably, the cities that contributed the most number of hits were Caloocan City, Quezon City, and City of Makati.

Finally, the code below generates the treemap plot of the breakdown of the total hits by Barangay Size.

ggplot(s, aes(area = potentialhits, fill = Size, label = paste(Barangay,round(potentialhits,3),sep="\n"),
              subgroup = Size)) +
    geom_treemap(layout='squarified') +
    geom_treemap_text(colour = "white", place = "center", size = 12) +
    labs(title="Top 100 Barangays (in Potential Hits) per Barangay Size") +
    theme(plot.title = element_text(face='bold'))

Based on the treemap plot, all of the barangays in the top 100 in potential hits have a Large barangay size. This signifies that it is more beneficial for the door-to-door caravan to travel to large barangays for the entire duration of the project.

sub = dat[dat$potentialhits > 0,]
sum(sub$potentialhits)
## [1] 5328986

With all these being said, we can see from the value above that we can realistically reach and exceed 500,000 hits, but this would require for the number of days of the project to be increased in order to accommodate more barangays that can be visited.

3. What is the hourly breakdown/utilization of the samplers? How much is interacting with a potential hit? How much is travelling by van? How much is walking? What sampling strategy changes can be implemented to increase the number of hits by the samplers?

The following code produces the treemap plot of the hourly breakdown or utilization of the samplers.

activity = c('Travel Time','Hit Time','Rej Time','DNAn Time','Walk Time')
activity_per = c(mean(s$pertraveltime),mean(s$perhittime),mean(s$perrejtime),mean(s$perdnantime),mean(s$perwalktime))
util_bdown = data.frame(cbind(activity,activity_per))
util_bdown$activity_per = as.numeric(util_bdown$activity_per)

ggplot(util_bdown, aes(area = activity_per, fill = activity, label = paste(activity,round(activity_per,3),sep="\n"))) +
    geom_treemap() +
    geom_treemap_text(colour = "white", place = "center", grow=TRUE) +
    labs(title="Sampler Utilization Breakdown (in %)") +
    theme(plot.title = element_text(face='bold'))

According to the graphic, around 55% of a sampler’s work hour is spent on interacting with a potential hit, which is reasonable as not much time is spent on interacting with a rejection or other extraneous activities in comparison.

However, improvements can still be made in the sampling strategy in order to reduce the 10% walk time estimate. The main recommendation here is to target barangays with higher household density as this will minimize walk time, while maximizing hit time in the process.

4. Is it worthwhile to go to provincial barangays outside Metro Manila if the sampling strategy is to maximize sampling reach?

Determining the value of going to provincial barangays for maximizing sampling reach can be answered by plotting the potential hits for each region versus their average distance from The Globe Tower.

The first dataframe variable grpdist refers to the aggregate mean of the barangays distance from TGT grouped together per region which generalizes the distance between TGT and a certain barangay in a region. Next, the variable grphits refers to the aggregate sum of potential hits per barangay grouped together per region. These two variables can be plotted in order to see the trend between potential hits and distance of each region. The resulting plot will look something like:

grpdist = aggregate(dat$distancefromtgt, list(dat$Region), mean)
grphits = aggregate(dat$potentialhits, list(dat$Region), sum)

plot(grpdist$x, grphits$x, pch = 19, col = factor(grpdist$Group.1),
     xlab = "Distance from TGT (m)",
     ylab = "Potential Hits",
     lwd=8,
     main = "Potential Hits vs Distance from TGT")
legend("bottomleft",
       #cex=0.8,
       legend = unique(grpdist$Group.1),
       col = factor(grpdist$Group.1),
       pch = 19)
abline(h=0, col = "red")

The results show that out of all the regions, only National Capital Region, Region IV-A and Region III are the only worthwhile regions to launch the house-to-house campaign. This can be attributed to their proximity from TGT which greatly influences the potential hits due to the samplers spending more time going around the barangay rather than traveling to and from the barangay.

5. What would be a good target market profile to hand out SIMs in terms of demographics?

One way to understand the potential customers is to be able to learn the demographics of the ‘profitable’ barangays. This can be explored by looking at the household density and distance of the barangays from The Globe tower.

t = data.frame()
t = s
t$householddensity = t$NHouseholds/t$AreaBarangay

Household density for each barangay can be computed as the number of households per unit area of the barangay. This determines how far each household would be from one another which could affect the potential hits of a barangay. This is also a good socioeconomic indicator for the potential market of the campaign. Low household density often indicates wealthier communities as households are farther apart and might have detached houses. On the other hand, high household density would indicate poorer communities as households would have been right next to each other or even sometimes on top of each other. Plotting the household density versus distance to/from TGT would give a glimpse to how the market profiles would look like:

plot(t$distancefromtgt, t$householddensity, pch = 19, col = factor(t$Region),
     xlab = "Distance from TGT (m)",
     ylab = "Household Density",
     ylim = c(0,50000),
     lwd = 8,
     main = "Household Density vs Distance from TGT")
legend("topright",
       legend = unique(t$Region),
       col = factor(t$Region),
       pch = 19)

The results show that the top target barangays are often those that are near from TGT and exhibits high household density. Upon closer inspection, most barangays are from the slums of Metro Manila while some are from the barangays of Rizal, Cavity, and Laguna.

6. Given that the acceptance rates and conversion rates are based on the Agency’s estimates, how sensitive is this to changes? If the acceptance or conversion rates change, how will the number of hits change? Will these sensitivities have an effect on the feasibility of the project?

In order to determine sensitivity of the results with regards to the acceptance and conversion rates, different values are tested as parameters to the comp() function created at the beginning of this report. This was achieved by creating a function elastic() which takes in three arguments for the starting activation, answer, and activation rates. A sequence was created based on the input arguments with increments of 0.1 and an upper limit of 1. These parameters are passed through the comp() function and the resulting computations are saved on a separate data frame e. This function, in summary, simulate the effect of changing arguments of the comp() function with regards to how the acceptance, answer, and activation rates often change in the real world.

elastic <- function(ac = 0.75, an = 0.90, acti = 0.25){
  
  x <<- seq(ac, 1, 0.1)
  y <<- seq(an, 1, 0.1)
  z <<- seq(acti, 1, 0.1)
  pot = length(x) * length(y) * length(z)
  
  e = matrix(c(0,0,0,0),ncol = 4, nrow = pot)
  e = as.data.frame(e)
  c1 = 1
  
  for (i in x){
    for (h in y){
      for (j in z){
        comp(acceptRate = i, answerRate = h, conversionRate = j)
        e[c1,] = c(i, h, j, sum(s$potentialhits) )
        c1 = c1+1
       
      }
    }
  }
  
  e <<- e
  
}

elastic()

As for the visualization of the effect, three linear regression models were created based on the effect of acceptance, answer, and activation rates on the calculation of the potential hits. The data used for the models and this analysis comes from the pre-requisite calculation using elastic() function. With this, plotting the calculated potential hits versus coefficient values shows how each value for each coefficient affects the computed potential hits. The resulting plot looks something as follows:

plot(e[,1], e[,4], col = "red", pch = 4,
     xlab = "Coefficient Values",
     ylab = "Potential Hits",
     lwd = 8,
     main = "Effect of Varying Rates on the Potential Hits")
points(e[,2], e[,4], col = "green", pch = 2, lwd = 8)
points(e[,3], e[,4], col = "blue", pch = 0, lwd = 8)
mdl1 = lm(V4~V1, data = e)
mdl2 = lm(V4~V2, data = e)
mdl3 = lm(V4~V3, data = e)
legend("top",
       legend = c("Acceptance Rate", "Answer Rate", "Activation Rate"),
       col = c("red", "green", "blue"),
       pch = c(4,2,0)
)
abline(mdl1, col="red")
abline(mdl2, col="green")
abline(mdl3, col="blue")

The results of the plot shows that changing the activation rate does not influence the calculation of the potential hits as it exhibits a horizontal line when plotted. Answer rate, on the other hand, shows a positive relationship with the potential hits. Lastly, acceptance rate influences the calculated potential hits the most compared to the other variables.

7. What is the Market Feasibility, Technical Feasibility and Socio/Economic Feasibility of this Project? What is the return of investment of this endeavor? Is the PhP 5,000,000 a worthwhile investment? If yes, why? If no, what counter proposal would you provide the Marketing Agency?

sev <- function(samp = 20, day = 600){
  x <<- seq(10, samp, 5)
  y <<- seq(100, day, 100)
  pot = length(x) * length(y) 
  
  sv = matrix(c(0,0,0),ncol = 3, nrow = pot)
  sv = as.data.frame(sv)
  c1 = 1
  
  for (i in x){
    for (h in y){
      comp(numofsamplers = i, numofdays = h)
      sv[c1,] = c(i, h, earn)
      c1 = c1+1
    }
  }
  
  
  colnames(sv) = c("Number of Samplers", "Number of Days", "Total Estimated Earning")
  sv <<- sv
}
sev()

In order to determine the return on investment of the campaign, we can simulate different values of the number of samplers and the number of campaign days. This was done by creating a sev() function that reiterates through a sequence of values of number of samplers and number of days. By default, the functions starts with number of samplers equal to 10 with increments of 5 and number of campaign days starts at 100 with increments of 100. With this, the function takes in arguments which pertains to the maximum values of samplers and campaign days. The results for each iteration is then saved into a separate data frame. The resulting data frame is as follows:

print(sv)
##    Number of Samplers Number of Days Total Estimated Earning
## 1                  10            100                 2991738
## 2                  10            200                 5740337
## 3                  10            300                 8288176
## 4                  10            400                10547096
## 5                  10            500                12530967
## 6                  10            600                14254126
## 7                  15            100                 4437905
## 8                  15            200                 8397659
## 9                  15            300                11745563
## 10                 15            400                14458504
## 11                 15            500                16710421
## 12                 15            600                18614065
## 13                 20            100                 5817506
## 14                 20            200                10698193
## 15                 20            300                14374847
## 16                 20            400                17267951
## 17                 20            500                19587236
## 18                 20            600                21544466

The results show that at 10 samplers and 100 days and an investment of 5 million, the whole campaign would be operating at a loss. Increasing the number of samplers and number of days increases the potential profit to some varying degree. The breakeven conditions based on the tested arguments to the function shows that a counter proposal of double the number of samplers or campaign days are in order.

Conclusion

Overall, we recommend that Globe does not accept the current contract offer of Php 5,000,000 because as mentioned earlier, the company would not gain any profit from the door-to-door caravan contract at its current conditions. As a counter proposal, we suggest that the same samplers and number of days of the project be maintained but the contract offer would have to be lowered to Php 2,000,000. With a forecasted ROI of Php 2,991,738, both Globe and the Marketing Agency can earn a net profit from this offer value. In addition to this, an alternative counter proposal is to keep the same contract offer value but increase either the number of samplers and/or the number of days of the project.