Our task has been assigned by Canal+ and its objective is, broadly speaking, to propose an optimal, cost-effective algorithm for scheduling TV programs. A few useful measures are utilized in the task and they are as follows.
a <- c("AMR", "total AMR","SHR")
b <- c("Average Minute Rating", "Total Average Minute Rating","Audience Share")
c <- c("% of target group watching the program", "% of target group watching TV", "% of people watching the program out of people watching TV at that time")
df <- data.frame(a,b,c)
colnames(df) <- c("Measure", "Full Name","Description")
df %>%
kbl() %>%
kable_styling()
| Measure | Full Name | Description |
|---|---|---|
| AMR | Average Minute Rating | % of target group watching the program |
| total AMR | Total Average Minute Rating | % of target group watching TV |
| SHR | Audience Share | % of people watching the program out of people watching TV at that time |
AMR is captured minute by minute and and average over the program duration is calculated.
The model given by Canal+ can be summarized as:
\[ \widehat{AMR}(c)=f(metadata(c),performance\_history(c),schedule\_slot(c)) \] and our objective is to maximize viewership, i.e. select \(schedule\_slot(c)\) that maximizes:
\[ \sum_{c=C} \widehat{AMR}(c) \] where \(C\) is a set of available contents \(c\). The problem is to find an algorithm which is faster than simple \(N!\) selection.
Our proposed solution is to use the ML Model to predict the SHR and put content with highest values of Predicted SHR into the best slot. It is important to state that this solution is a framework that does not specify all details, but provides a general idea on how to proceed. The details should be specified after consultations with SMEs and getting real-life data for validation. This framework consist of 4 steps:
Find the best ML Model for Predicting SHR based on past content metdata
1.1. First, all available content metadata should be gathered. Metadata should not contain missing observations in any column, since these are hard to be modelled - in case there are some missing observation, consider removing this column or cautiosly choose a data imputation method. One variable that should always be included in the model is time - the hour within the week when the content was presented.
1.2. Then, ML models that will be validated should be chosen. We suggest to use a variety of models in order to be able to properly check different types of relationship between the dependent and independent variables (linear - e.g. Linear Regression, SVR with linear kernel, non-linear - e.g. SVR with sigmoid kernel, Random Forest). After choosing the model types, we should choose the set of parameters that will be tested for each model (depends on the choice of models). Last, but not least, we should agree on a performance metric that will be used to evalute the performance of models - we suggest using RMSE as a standard, but after consulations with SMEs, other metric may be chosen.
1.3. Then, K-fold stratified validation should be conducted - we suggest to run it at least 10 times (but the more the better/safer) for each model with a 50/50 sample split stratified by Predicted SHR (stratification may be achieved by dividing the target variable into bins). Then, we should calculate the average of chosen performance metrics and see which model provided the best results. Depending on the amount of models variations and inital results, we may choose to select a few best performing variations and run the validation again, testing each variation more times than before. This should allow us to find a model and its paramteres that has the best fit for our metadata.
Schedule content that has to be shown live
2.1. Some content types are always shown live - for example news. If there is any such content - schedule it first to see which slots are still available for other content.
Select slot(s) with the highest Total AMR, calculate Predicted SHR, and put content with highest values into that slot. Repeat this step until all content is schedulded.
3.1. First, use historical data (consult with SMEs to select a meaningful sample - e.g. last year) to calculate the average Total AMR for each hour of the week. It should be calculated for different time windows, depending on the duration of content that you are presenting.
3.2. Select slot(s) that have the highest Total AMR for each time window. Then combine content’s metdata with the time from the slot (in some cases the time might be different, depending on the content’s duration) and run the ML Model to obtain Predicted SHR for all content.
3.3. Select content with highest Predicted SHR and put in into selected slot. This way, AMR is maximixed - content with the highest SHR is put into slots with highest Total AMR, resulting in maximization of our content’s AMR.
3.4. Repeat steps 3.2 and 3.3. until all content is schedulded. The slots which are left should be controlled so that the content that is left can be schedulded (e.g. there might be a 2h content to schedule but only 1h slots left). In such case, longer content should be scheduled first even though its Predicted SHR might be lower.
This examples illustrates how our algorithm would work in a simplified form - since real-world data was not available, we did not implement all features as the results was not meaningful.
First, we had to set-up a data generating process. We generated 3 files - (i) content to schedule for next week and its metadata, (ii) historical total AMR & (iii) AMR and SHR of past content and its metadata,
Content to schedule for next week and its metadata
Code used to generate content to schedule data is below (hidden by default).
# NEWS
start_time_news <- sort(c(seq(9, 168, 24), seq(19, 168, 24)))
metadata_news <- data.frame(
title = paste(rep('News', 14), seq(1, 14, 1)),
content_type = rep('news', 14),
duration = rep(1, 14),
start_time = start_time_news,
character = rep('news', 14),
production_decade = rep('2020', 14),
oscar = rep(NA, 14),
other_awards = rep(NA, 14),
imbd_score = rep(NA, 14),
stringsAsFactors = F
)
# SPORT
set.seed(456100)
start_time_sport <- sort(round(runif(5, min = 0, max = 168)))
for (time in start_time_sport) {
if (is.element(time, start_time_news)) {
start_time_sport[start_time_sport==time] <- time + 1
} else if (is.element(time+1, start_time_news)) {
start_time_sport[start_time_sport==time] <- time + 2
}
}
set.seed(456101)
metadata_sport <- data.frame(
title = paste(rep('Sport', 5), seq(1, 5, 1)),
content_type = rep('sport', 5),
duration = rep(2, 5),
start_time = start_time_sport,
character = rep('sport', 5),
production_decade = rep('2020', 5),
oscar = rep(NA, 5),
other_awards = rep(NA, 5),
imbd_score = rep(NA, 5),
stringsAsFactors = F
)
# MOVIE
set.seed(456102)
character_movie <- runif(30, 0, 100)
character_movie <- dplyr::case_when(
character_movie 80 ~ 'action',
character_movie 60 & character_movie <= 80 ~ 'comedy',
character_movie 45 & character_movie <= 60 ~ 'documentary',
character_movie 30 & character_movie <= 45 ~ 'thriller',
character_movie 20 & character_movie <= 30 ~ 'romance',
character_movie 10 & character_movie <= 20 ~ 'drama',
character_movie 5 & character_movie <= 10 ~ 'horror',
character_movie <= 5 ~ 'fantasy'
)
set.seed(456103)
production_decade_movie <- round(rnorm(30, 2000, 10)/10) * 10
production_decade_movie <- ifelse(production_decade_movie<1950, 1950, ifelse(production_decade_movie2020, 2020, production_decade_movie))
set.seed(456104)
oscar_movie <- runif(30, 0, 100)
oscar_movie <- ifelse(oscar_movie<=5, 1, 0)
set.seed(456105)
other_awards_movie <- runif(30, 0, 100)
other_awards_movie <- ifelse(other_awards_movie<=20, 1, 0)
set.seed(456106)
imbd_score_movie <- round(rnorm(30, 6, 2)*10)
imbd_score_movie <- ifelse(imbd_score_movie<0, 0, ifelse(imbd_score_movie100, 100, imbd_score_movie))
metadata_movie <- data.frame(
title = paste(rep('Movie', 30), seq(1, 30, 1)),
content_type = rep('movie', 30),
duration = rep(2, 30),
start_time = rep(NA, 30),
character = character_movie,
production_decade = production_decade_movie,
oscar = oscar_movie,
other_awards = other_awards_movie,
imbd_score = imbd_score_movie,
stringsAsFactors = F
)
# SERIES
set.seed(456107)
character_series <- runif(84, 0, 100)
character_series <- dplyr::case_when(
character_series 80 ~ 'action',
character_series 60 & character_series <= 80 ~ 'comedy',
character_series 45 & character_series <= 60 ~ 'documentary',
character_series 30 & character_series <= 45 ~ 'thriller',
character_series 20 & character_series <= 30 ~ 'romance',
character_series 10 & character_series <= 20 ~ 'drama',
character_series 5 & character_series <= 10 ~ 'horror',
character_series <= 5 ~ 'fantasy'
)
set.seed(456108)
production_decade_series <- round(rnorm(84, 2005, 10)/10) * 10
production_decade_series <- ifelse(production_decade_series<1950, 1950, ifelse(production_decade_series2020, 2020, production_decade_series))
set.seed(456109)
other_awards_series <- runif(84, 0, 100)
other_awards_series <- ifelse(other_awards_series<=20, 1, 0)
set.seed(456110)
imbd_score_series <- round(rnorm(84, 6, 2)*10)
imbd_score_series <- ifelse(imbd_score_series<0, 0, ifelse(imbd_score_series100, 100, imbd_score_series))
metadata_series <- data.frame(
title = paste(rep('Series', 84), seq(1, 84, 1)),
content_type = rep('series', 84),
duration = rep(1, 84),
start_time = rep(NA, 84),
character = character_series,
production_decade = production_decade_series,
oscar = rep(0, 84),
other_awards = other_awards_series,
imbd_score = imbd_score_series,
stringsAsFactors = F
)
# ALL
metadata <- rbind(
metadata_news,
metadata_sport,
metadata_movie,
metadata_series
)
Historical total AMR
Code used to generate content to schedule data is below (hidden by default).
total_amr <- data.frame(
week = rep(1:52, each = 168),
time = rep(0:167, times = 52),
total_amr = rep(0, times = 52 * 168)
)
for (week in 1:52) {
for (time in 0:167){
if (time %in% c(23, 0, 1, 2, 3, 4, 5)) {
amr_vector <- rnorm(60, 5, 5)
} else if (time %in% c(22, 6)) {
amr_vector <- rnorm(60, 10, 5)
} else if (time %in% c(19, 20, 21, 22, 7, 8, 9) | i 120) {
amr_vector <- rnorm(60, 25, 10)
} else {
amr_vector <- rnorm(60, 15, 10)
}
amr_vector <- ifelse(mean(amr_vector)<1, 1, ifelse(mean(amr_vector)70, 70, mean(amr_vector)))
total_amr[total_amr$week == week & total_amr$time == time, 'total_amr'] = amr_vector
}
}
total_amr <- total_amr %%
group_by(time) %%
summarise(
total_amr = mean(total_amr)
) %%
mutate(
total_amr_2h = coalesce((total_amr + lead(total_amr)) / 2, 0)
) %%
as.data.frame()
AMR and SHR of past content and its metadata
Code used to generate content to schedule data is below (hidden by default).
# MOVIE
set.seed(456101)
character_movie <- runif(10000, 0, 100)
character_movie <- case_when(
character_movie 80 ~ 'action',
character_movie 60 & character_movie <= 80 ~ 'comedy',
character_movie 45 & character_movie <= 60 ~ 'documentary',
character_movie 30 & character_movie <= 45 ~ 'thriller',
character_movie 20 & character_movie <= 30 ~ 'romance',
character_movie 10 & character_movie <= 20 ~ 'drama',
character_movie 5 & character_movie <= 10 ~ 'horror',
character_movie <= 5 ~ 'fantasy'
)
set.seed(456102)
production_decade_movie <- round(rnorm(10000, 2000, 10)/10) * 10
production_decade_movie <- ifelse(production_decade_movie<1950, 1950, ifelse(production_decade_movie2020, 2020, production_decade_movie))
set.seed(456103)
oscar_movie <- runif(10000, 0, 100)
oscar_movie <- ifelse(oscar_movie<=5, 1, 0)
set.seed(456104)
other_awards_movie <- runif(10000, 0, 100)
other_awards_movie <- ifelse(other_awards_movie<=20, 1, 0)
set.seed(456105)
imbd_score_movie <- round(rnorm(10000, 6, 2)*10)
imbd_score_movie <- ifelse(imbd_score_movie<0, 0, ifelse(imbd_score_movie100, 100, imbd_score_movie))
set.seed(456106)
start_time_movie <- round(runif(10000, 0, 166))
set.seed(456107)
amr_movie_vector <- c()
for (time in start_time_movie){
time <- time%%24
if (time %in% c(23, 0, 1, 2, 3, 4, 5)) {
amr_vector <- rnorm(120, 1, 1)
} else if (time %in% c(22, 6)) {
amr_vector <- rnorm(120, 2.5, 1.5)
} else if (time %in% c(19, 20, 21, 22, 7, 8, 9) | i 120) {
amr_vector <- rnorm(120, 7.5, 5)
} else {
amr_vector <- rnorm(120, 5, 2.5)
}
amr_vector <- mean(amr_vector)
amr_vector <- ifelse(amr_vector < 0.1, 0.1, ifelse(amr_vector 10, 10, amr_vector))
amr_movie_vector <- append(amr_movie_vector, amr_vector)
}
amr_movie <- data.frame(
title = paste(rep('Movie S', 10000), seq(1, 10000, 1)),
content_type = rep('movie', 10000),
duration = rep(2, 10000),
start_time = start_time_movie,
character = character_movie,
production_decade = production_decade_movie,
oscar = oscar_movie,
other_awards = other_awards_movie,
imbd_score = imbd_score_movie,
amr = amr_movie_vector,
stringsAsFactors = F
)
# SERIES
set.seed(456107)
character_series <- runif(20000, 0, 100)
character_series <- case_when(
character_movie 80 ~ 'action',
character_movie 60 & character_movie <= 80 ~ 'comedy',
character_movie 45 & character_movie <= 60 ~ 'documentary',
character_movie 30 & character_movie <= 45 ~ 'thriller',
character_movie 20 & character_movie <= 30 ~ 'romance',
character_movie 10 & character_movie <= 20 ~ 'drama',
character_movie 5 & character_movie <= 10 ~ 'horror',
character_movie <= 5 ~ 'fantasy'
)
set.seed(456108)
production_decade_series <- round(rnorm(20000, 2005, 10)/10) * 10
production_decade_series <- ifelse(production_decade_series<1950, 1950, ifelse(production_decade_series2020, 2020, production_decade_series))
set.seed(456109)
other_awards_series <- runif(20000, 0, 100)
other_awards_series <- ifelse(other_awards_series<=20, 1, 0)
set.seed(456110)
imbd_score_series <- round(rnorm(20000, 6, 2)*10)
imbd_score_series <- ifelse(imbd_score_series<0, 0, ifelse(imbd_score_series100, 100, imbd_score_series))
set.seed(456111)
start_time_series <- round(runif(20000, 0, 167))
set.seed(456112)
amr_series_vector <- c()
for (time in start_time_series){
time <- time%%24
if (time %in% c(23, 0, 1, 2, 3, 4, 5)) {
amr_vector <- rnorm(60, 1, 1)
} else if (time %in% c(22, 6)) {
amr_vector <- rnorm(60, 2.5, 1.5)
} else if (time %in% c(19, 20, 21, 7, 8, 9) | i 120) {
amr_vector <- rnorm(60, 7.5, 5)
} else {
amr_vector <- rnorm(60, 5, 2.5)
}
amr_vector <- mean(amr_vector)
amr_vector <- ifelse(amr_vector < 0.1, 0.1, ifelse(amr_vector 10, 10, amr_vector))
amr_series_vector <- append(amr_series_vector, amr_vector)
}
amr_series <- data.frame(
title = paste(rep('Series S', 20000), seq(1, 20000, 1)),
content_type = rep('series', 20000),
duration = rep(1, 20000),
start_time = start_time_series,
character = character_series,
production_decade = production_decade_series,
oscar = rep(0, 20000),
other_awards = other_awards_series,
imbd_score = imbd_score_series,
amr = amr_series_vector,
stringsAsFactors = F
)
amr <- rbind(
amr_movie,
amr_series
)
for (col in c('production_decade', 'oscar', 'other_awards')) {
amr[, col] <- as.character(amr[, col])
}
amr_movie %%
arrange(start_time)
amr <- amr %%
merge(total_amr, by.x='start_time', by.y = 'time') %%
mutate(
amr_vs_total = ifelse(content_type == 'movie', amr / total_amr_2h, amr / total_amr)
) %%
select(
title,
content_type,
duration,
start_time,
character,
production_decade,
oscar,
other_awards,
imbd_score,
amr,
amr_vs_total
)
We did not implement the K Stratified Validation since the data we use is random, and therefore the model do not have any predictive power. For our example, we have chosen to use Random Forest with 100 trees, 6 variables used at each split, minimum 50 observation in a terminal node, and without replacement. We trained the model with all past metadata.
First, we scheduled all news and sports. Then, we scheduled all movies and series consecutively using the results of RF model. Please note that not all features described in the proposed framework were implemented in the exemplary algorithm since the data was random and the final result would not be meaningful anyway. Nevertheless, we have the know-how to implement all the rules, but that was not the main purpose of our task.
Code used to generate content to schedule data is below (hidden by default).
form <- amr_vs_total ~ content_type + character + production_decade + oscar + other_awards + imbd_score
set.seed(456100)
rf <- randomForest(form, data=amr, ntree = 100, mtry = 6, replace = F, nodesize = 50)
schedule <- data.frame(
title = character(),
content_type = character(),
start_time = integer(),
end_time = integer()
)
for (t in metadata[metadata$content_type %in% c('news', 'sport'), 'title']) {
obs <- metadata %%
filter(title == t) %%
mutate(
end_time = start_time + duration
) %%
select(
title,
content_type,
character,
start_time,
end_time
)
schedule <- rbind(schedule, obs)
}
free_slots <- data.frame(
time = 0:167,
slot_1h = rep(0, 168),
slot_2h = rep(0, 168)
)
free_slots[168, 'slot_2h'] <- 1
free_slots <- free_slots %%
merge(schedule, all.x=T, by.x='time', by.y='start_time') %%
mutate(
slot_1h = case_when(
content_type %in% c('news', 'sport') ~ 1,
lag(content_type, 1) == 'sport' ~ 1,
T ~ 0
),
slot_2h = case_when(
slot_1h == 1 ~ 1,
lead(slot_1h) == 1 ~ 1,
T ~ 0
)
) %%
merge(total_amr, by='time')
metadata_schedule <- metadata[metadata$content_type %in% c('movie', 'series'),]
for (col in c('production_decade', 'oscar', 'other_awards')) {
metadata_schedule[, col] <- as.character(metadata_schedule[, col])
}
metadata_schedule$amr_vs_total_pred <- predict(rf, metadata_schedule)
metadata_schedule$planned <- 0
metadata_schedule <- metadata_schedule %%
arrange(desc(amr_vs_total_pred))
for (i in 1:nrow(metadata_schedule)) {
obs <- metadata_schedule[i,]
if (obs$duration == 1) {
best_slot <- which(free_slots$total_amr == max(free_slots[free_slots$slot_1h == 0, 'total_amr']))
} else if (obs$duration == 2) {
best_slot <- which(free_slots$total_amr_2h == max(free_slots[free_slots$slot_2h == 0, 'total_amr_2h']))
}
print(i)
print(best_slot)
print('')
free_slots[best_slot, 'title'] <- obs$title
free_slots[best_slot, 'content_type'] <- obs$content_type
free_slots[best_slot, 'character'] <- obs$character
free_slots[best_slot, 'end_time'] <- free_slots[best_slot, 'time'] + obs$duration
free_slots[best_slot, 'slot_1h'] <- 1
free_slots[best_slot, 'slot_2h'] <- 1
free_slots[best_slot - 1, 'slot_2h'] <- 1
if (obs$duration == 2) {
free_slots[best_slot + 1, 'slot_1h'] <- 1
free_slots[best_slot + 1, 'slot_2h'] <- 1
}
metadata_schedule[i, 'planned'] <- 1
}
Out last step was preparation of a graphic schedule for the next week. We have tweaked the data to be fit into a ggplot figure.
Code used to generate content to schedule data is below (hidden by default).
# create 2 rows if show spans over multiple days
for (i in which(free_slots$content_type %in% c('sport', 'movie') & free_slots$time%%24 == 23)) {
free_slots[i, 'end_time'] <- free_slots[i, 'end_time'] - 1
free_slots[i+1, 'title'] <- paste(free_slots[i, 'title'], 'cont.')
free_slots[i+1, 'content_type'] <- free_slots[i, 'content_type']
free_slots[i+1, 'character'] <- free_slots[i, 'character']
free_slots[i+1, 'end_time'] <- free_slots[i, 'end_time'] + 1
}
plot_data <- free_slots %%
mutate(
day_start = floor(time/24),
day_end = day_start + 1,
day = (day_start + day_end) / 2,
hour_start = time%%24,
hour_end = ifelse(end_time%%24==0, 24, end_time%%24),
hour_title = ifelse(content_type %in% c('series', 'movie'), (hour_start + hour_end) / 2 - 0.2, (hour_start + hour_end) / 2),
hour_character = (hour_start + hour_end) / 2 + 0.2
) %%
filter(title != '<NA')
png('schedule.png', width = 3508, height = 2480, res=300)
ggplot(plot_data) +
geom_rect(
aes(xmin = day_start, xmax = day_end, ymin = hour_start, ymax = hour_end, fill = content_type),
alpha = 0.5, color = 'black') +
geom_text(
aes(x = day, y = hour_title, label = title),
size = 3, color = 'black', fontface = 'bold') +
geom_text(
data = plot_data[plot_data$content_type %in% c('series', 'movie'),],
aes(x = day, y = hour_character, label = paste('(', character, ')', sep='')),
size = 2, color = 'black') +
scale_y_reverse(breaks = seq(0, 23, 2),
labels = paste(ifelse(nchar(seq(0, 23, 2)) == 1, paste(0, seq(0, 23, 2), sep=''), seq(0, 23, 2)), ':00', sep='')) +
scale_x_continuous(breaks = 0:6 + 0.5,
labels = c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'),
position = "top") +
scale_fill_viridis(discrete=T) +
theme_classic() +
theme(
axis.line = element_blank(),
axis.ticks = element_blank(),
axis.title = element_blank(),
axis.text.x = element_text(size = 14, face = 'bold', color = 'black'),
axis.text.y = element_text(size = 11, face = 'bold', color = 'black'),
plot.margin = unit(c(1,1,1,1), "cm"),
panel.spacing = element_blank(),
legend.position = 'none'
)
dev.off()
We argue that the proposed solution is optimal and efficient for several reasons.
It is time- and cost-effective. The ML algorithm needs to be trained only once, and then, it can be applied to new data (e.g. each new week) quickly and seamlessly.
It is easily adaptable to new data. While being trained on one TV channel, the algorithm can be applied to data from another channel without effort.
Our solution captures viewing habits due to the inclusion of the time a person was watching a specific content.
New features may be easily incorporated. Currently our metadata is limited to 11 features. However, new features can be added after consulting an SME.
Work with SMEs to identify which features may be added to the metadata and establish a set of rules that the algorithm should follow (e.g. time frames for kids’ content).
Adjust the algorithm to work with other possible content types (multiple ML models may be used).
Use real-world data to run the algorithm and evaluate its performance.