Halo Effect vs. Cannibalization Effect

author: Soheil (Sol) Sadeghi date: 6-17-2015 width: 1400 height: 700 autosize: true

Problem Overview

Promotion on specific item(s) for a specific period
- Example: 10% off on peanut butter in the first week of March
This promotion might have effect on other items as well:
- Halo Effects: increase in sale
- Cannibalization Effect: decrease in sales
Goal: Detect the items with possible halo/cannibalization effect!

Probabilistic View of Classification

H: Halo Set
C: Cannibalization Set
N: Neutral Set
Different Perceptions:
- Classification: each item belongs to one set
- Probabilissim: each item has a probability of belonging to any sets

Probabilisim to Ranking to Classification

Estimate the probability of each item belonging to each set
Rank all the items based on some criterion
The top items belong to H, the bottom items belong to C, the rest belongs to N

Questions

How to estimate the probabilities?
What criterion should be used to rank the items?
What thresholding should be used to seperate items?
- Answer: Cross Validation

Model Overview

Item level model (data points are items)
For each item, define: Response variable and features
Each promotion is considered as a query
Item ranking is developed within queries
- Each query has its own ordered list of items

Response Variable

Within each query, for each item
S: Total sale for the promotion period
ES: Expected sale based on historical data
S/ES: Normalized sale
- (S-ES)/ES is eventually the same concept
- ES/S is an alternative
Response Variable: 3 lables based on the values of normalized sale
- (5, 1, 0) for normalized sale: halo is of main interest
- (5, 1, 0) for reverse normalized sale: cannibalization is of main interest

Motivation for Features

Consider 4 different baskets
- Basket 1: peanut butter, nutella, bread, milk
- Basket 2: bread, jerry, butter, milk
- Basket 3: cheese, nuts, tea, bread
- Basket 4: cheese, peanut butter, bread, milk
Peanut butter matched with bread/milk twice, nutella/cheese once
It has four common items with bread/milk, two common items with the rest
(match,common) belongs to the set of {(2,4), (1,2), (0,2)}
match variable has direct effect on halo effect but indirect effect on cannibalization effect (through common variable)

Motivation for Features - continued

(normalized.match, normalized.common) for variables are as follows:

Nutella: (1 , 2/3)
Milk: (2/3 , 2/3)
Bread: (1/2 , 1/2)
Cheese: (1/2 , 2/5)
Jerry: (0 , 2/3)
Butter: (0 , 2/3)
Nuts: (0 , 2/3)
Tea: (0 , 2/3)

Features

Within each query, for each item:
- normalized.match
- normalized.common
- brand.match (binary variable)
- category.match (binary variable)
- subcategory.match (binary variable)
- fineline.match (binary variable)
- other item-related features ?

Features - continued

For each query:
- time
- time-related features
- promotion-related features
- promoted item-related features
- ?

Ranking Approach

Considering the reverse normalized score and defining the response as follow:

5: The item with cannibalization effect
1: The item with neutral effect
0: The item with halo effect

The lables are motivated by the NDCG criterion which is used by Bing search engine

What is Normalized Discounted cumulative Gain (NDCG)?

For an ordered query \[ \small DCG_j = \sum_{i=1}^{n_j} \frac{2^{Y_i} - 1}{\log(i+1)} \]

IDCG: Ideal DCG for a given set of query

\[ \small NDCG_j = \frac{DCG_j}{IDCG_j} \]

and Final NDCG for the whole dataset

\[ \small NDCG_F = \frac{1}{N_Q} \sum_{j=1}^{N_Q} NDCG_j \]

Motivation for NDCG using Classification Learning

Perfect classifications naturally result in perfect DCG scores
DCG errors are bounded by classification errors
Even an improvement of 1% NDCG is often considered significant

BUT

DCG criterion is non-convex and non-smooth

Classification to Ranking to Classification

First learn the class probabilities by some soft classification algorithm
I am using the Boosting Tree Algorithm for learning class probabilities
Then score each data point according to the Expected Relevence

\[ S_i = \sum_{k=0,1,5} T(k) \hat{p}_{i,k} \]

T(k) a monotone (increasing) function - k or 2^k
Within each quey, rank the items based on S
Top items belong to C, bottom items belong to H, and the rest belongs to N

Stochastic Gradient Tree Boost Algorithm

\(F_0(x) = \arg \min_{\rho} \sum_{i=1}^{N} L(y_i, \rho)\)
\(\text{For m = 1 to M do:}\)
\(\{\pi(i)\}_1^N = \text{rand-perm} \{i\}_1^N\)
\(\tilde{y}_{\pi(i)m} = - \biggl[ \frac{\partial L(y_{\pi(i)}, F(x_{\pi(i)}))}{\partial F(x_{\pi(i)})} \biggr]_{F(x)=F_{m-1}(x)} ; i=1,\ldots,\tilde{N}\)
\(\{R_{lm}\}_{1}^{L} = \text{L - terminal node tree}(\{\tilde{y}_{\pi(i)m}, x_{\pi(i)}\}_1^\tilde{N})\)
\(\rho_m = \arg \min_{\rho} \sum_{x_{\pi(i)} \in R_{lm}} L(y_{\pi(i)}, F_{m-1}(x_{\pi(i)}) + \rho)\)
\(F_m(x) = F_{m-1}(x) + \nu \ldotp \rho_{lm} 1(x \in R_{lm})\)
\(\text{End For}\)
\(\text{End Algorithm}\)

Computational Cost Improvement

Multinomial Distribution –> learning is too time-consuming compare to binary learning
Suggestion: divide the multinomial laerning into 2 separate binary learning
- Learn the P(Y=5) vs. P(Y<5) - binary learning
- Learn the P(Y>0) vs. P(Y=0) - binary learning
- calculate P(Y=1) = P(Y<5) - P(Y=0)