Understanding Business Report: Canal+

Introduction

Our task has been assigned by Canal+ and its objective is, broadly speaking, to propose an optimal, cost-effective algorithm for scheduling TV programs. A few useful measures are utilized in the task and they are as follows.

a <- c("AMR", "total AMR","SHR")
b <- c("Average Minute Rating", "Total Average Minute Rating","Audience Share")
c <- c("% of target group watching the program", "% of target group watching TV", "% of people watching the program out of people watching TV at that time")
df <- data.frame(a,b,c)
colnames(df) <- c("Measure", "Full Name","Description")
df %>%
  kbl() %>%
  kable_styling()

Measure	Full Name	Description
AMR	Average Minute Rating	% of target group watching the program
total AMR	Total Average Minute Rating	% of target group watching TV
SHR	Audience Share	% of people watching the program out of people watching TV at that time

AMR is captured minute by minute and and average over the program duration is calculated.

The model given by Canal+ can be summarized as:

\[ \widehat{AMR}(c)=f(metadata(c),performance\_history(c),schedule\_slot(c)) \] and our objective is to maximize viewership, i.e. select \(schedule\_slot(c)\) that maximizes:

\[ \sum_{c=C} \widehat{AMR}(c) \] where \(C\) is a set of available contents \(c\). The problem is to find an algorithm which is faster than simple \(N!\) selection.

Solution

Our proposed solution is to use the ML Model to predict the SHR and put content with highest values of Predicted SHR into the best slot. It is important to state that this solution is a framework that does not specify all details, but provides a general idea on how to proceed. The details should be specified after consultations with SMEs and getting real-life data for validation. This framework consist of 4 steps:

Find the best ML Model for Predicting SHR based on past content metdata

1.1. First, all available content metadata should be gathered. Metadata should not contain missing observations in any column, since these are hard to be modelled - in case there are some missing observation, consider removing this column or cautiosly choose a data imputation method. One variable that should always be included in the model is time - the hour within the week when the content was presented.

1.2. Then, ML models that will be validated should be chosen. We suggest to use a variety of models in order to be able to properly check different types of relationship between the dependent and independent variables (linear - e.g. Linear Regression, SVR with linear kernel, non-linear - e.g. SVR with sigmoid kernel, Random Forest). After choosing the model types, we should choose the set of parameters that will be tested for each model (depends on the choice of models). Last, but not least, we should agree on a performance metric that will be used to evalute the performance of models - we suggest using RMSE as a standard, but after consulations with SMEs, other metric may be chosen.

1.3. Then, K-fold stratified validation should be conducted - we suggest to run it at least 10 times (but the more the better/safer) for each model with a 50/50 sample split stratified by Predicted SHR (stratification may be achieved by dividing the target variable into bins). Then, we should calculate the average of chosen performance metrics and see which model provided the best results. Depending on the amount of models variations and inital results, we may choose to select a few best performing variations and run the validation again, testing each variation more times than before. This should allow us to find a model and its paramteres that has the best fit for our metadata.
Schedule content that has to be shown live

2.1. Some content types are always shown live - for example news. If there is any such content - schedule it first to see which slots are still available for other content.
Select slot(s) with the highest Total AMR, calculate Predicted SHR, and put content with highest values into that slot. Repeat this step until all content is schedulded.

3.1. First, use historical data (consult with SMEs to select a meaningful sample - e.g. last year) to calculate the average Total AMR for each hour of the week. It should be calculated for different time windows, depending on the duration of content that you are presenting.

3.2. Select slot(s) that have the highest Total AMR for each time window. Then combine content’s metdata with the time from the slot (in some cases the time might be different, depending on the content’s duration) and run the ML Model to obtain Predicted SHR for all content.

3.3. Select content with highest Predicted SHR and put in into selected slot. This way, AMR is maximixed - content with the highest SHR is put into slots with highest Total AMR, resulting in maximization of our content’s AMR.

3.4. Repeat steps 3.2 and 3.3. until all content is schedulded. The slots which are left should be controlled so that the content that is left can be schedulded (e.g. there might be a 2h content to schedule but only 1h slots left). In such case, longer content should be scheduled first even though its Predicted SHR might be lower.

Example

This examples illustrates how our algorithm would work in a simplified form - since real-world data was not available, we did not implement all features as the results was not meaningful.

Data

First, we had to set-up a data generating process. We generated 3 files - (i) content to schedule for next week and its metadata, (ii) historical total AMR & (iii) AMR and SHR of past content and its metadata,

Content to schedule for next week and its metadata

There are 4 types of content - news, sport, movie and series
Content of the same type always lasts the same - news and series for 1 hour; sport and movie for 2 hours.
News are played 14 times a week (14h), sport 5 times (15h), movies 30 times (60h) and series 84 times (84h).
News and sport are played live and not included in the ML model, and therefore most metadata is available only for movie and series.
- News are played daily at 9am and 7pm
- Sport starts randomly throughout the week, but never at the same time as news
Movie and series have several attributes included in the metadata - below is their desciprion and distribution
- character: uniform distrbution - action (20%), comedy (20%), documentary (15%), thriller (15%), romance(10%), drama (10%), horror (5%), fantasy (5%)
- production decade: normal distrbution with u = 2005 and sd = 10 (bounded by range between 1950 and 2020)
- oscar: uniform distrbution - 1 (5%), 0(95%) [available only for movies]
- other awards: uniform distrbution - 1 (20%), 0 (80%)
- imbd score: normal distrbution with u = 6 and sd = 1.5 (bounded by range between 0 and 100)