Group F_Topic 2_Clickprediction

1. Executive Summary

This project develops a predictive model to estimate the probability that a user will click on an online advertisement.

Objective. Predict ad click behavior at the user level using a binary classification framework.
Data. The analysis is based on 18,000 historical ad impressions for model training and evaluation, and 2,000 new users for out-of-sample prediction and ranking.
Methodology. We benchmark an interpretable logistic regression model against tree-based machine learning models, including a decision tree and a random forest. Model performance is evaluated primarily using the ROC curve and AUC, which are robust to class imbalance and threshold choice.
Business Output. The final model generates predicted click probabilities that support data-driven targeting and budget allocation decisions. An illustrative targeting strategy based on the top 10% and top 20% of users by predicted click probability is provided.

2 Data Description & Preprocessing

2.1 Load Data

The dataset used in this project is provided as an RData file containing two objects: a training dataset (ClickTraining) and a prediction dataset (ClickPrediction).

load("DeliveryAdClick.RData")

ClickTraining  <- .standardize_target(ClickTraining)
# ClickPrediction doesn't have target (so no need)

# Basic checks
dim(ClickTraining)

## [1] 18000     9

dim(ClickPrediction)

## [1] 2000    8

2.2 Variable Overview

The training dataset contains 18,000 observations and 9 variables. The variables capture different aspects of user context, behavior, and content exposure:

Context: Region, Weekday, Daytime
Channel: Carrier, Social_Network
Engagement: Time_On_Previous_Website
History: Number_of_Previous_Orders
Content: Restaurant_Type

The target variable is Clicks_Conversion, where 1 indicates that the user clicked on the advertisement and 0 indicates no click.

To better understand variable types and distributions, we inspect the structure and summary statistics of the training dataset.

## Rows: 18,000
## Columns: 9
## $ Region                    <chr> "Alsace and East France", "West France", "Pa…
## $ Daytime                   <dbl> 0.22325383, 0.73439324, 0.78131697, 0.310846…
## $ Carrier                   <chr> "SFR", "Bouygues", "SFR", "SFR", "SFR", "Bou…
## $ Time_On_Previous_Website  <dbl> 8.066935, 257.567079, 1427.640826, 1606.2621…
## $ Weekday                   <fct> Saturday, Saturday, Wednesday, Wednesday, Su…
## $ Social_Network            <chr> "Instagram", "Twitter", "Instagram", "Instag…
## $ Number_of_Previous_Orders <dbl> 3, 3, 6, 4, 2, 7, 2, 6, 4, 4, 3, 2, 2, 6, 2,…
## $ Clicks_Conversion         <int> 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1,…
## $ Restaurant_Type           <chr> "French", "French", "French", "Burger", "Gro…

##     Region             Daytime           Carrier         
##  Length:18000       Min.   :0.000121   Length:18000      
##  Class :character   1st Qu.:0.244788   Class :character  
##  Mode  :character   Median :0.492348   Mode  :character  
##                     Mean   :0.495189                     
##                     3rd Qu.:0.740757                     
##                     Max.   :0.999992                     
##                                                          
##  Time_On_Previous_Website      Weekday     Social_Network    
##  Min.   :   5.033         Friday   :2593   Length:18000      
##  1st Qu.: 454.132         Monday   :2695   Class :character  
##  Median : 902.860         Saturday :2522   Mode  :character  
##  Mean   : 902.603         Sunday   :2554                     
##  3rd Qu.:1352.777         Thursday :2517                     
##  Max.   :1799.875         Tuesday  :2599                     
##                           Wednesday:2520                     
##  Number_of_Previous_Orders Clicks_Conversion Restaurant_Type   
##  Min.   : 0.000            Min.   :0.0000    Length:18000      
##  1st Qu.: 2.000            1st Qu.:1.0000    Class :character  
##  Median : 3.000            Median :1.0000    Mode  :character  
##  Mean   : 3.082            Mean   :0.8446                      
##  3rd Qu.: 4.000            3rd Qu.:1.0000                      
##  Max.   :11.000            Max.   :1.0000                      
##

Target distribution & baseline click rate

We first examine the distribution of the target variable (Clicks_Conversion) and compute the baseline click rate. This provides a reference point for evaluating predictive models.

## 
##     0     1 
##  2798 15202

## 
##     0     1 
## 0.155 0.845

## [1] 0.8445556

Interpretation. The baseline click rate is 0.8446 (≈84%).
Because the outcome is highly skewed toward clicks, accuracy alone can be misleading, so we emphasize AUC and probability-based ranking.

2.3 Feature Preparation

We ensure consistent preprocessing across training and prediction datasets to avoid factor-level mismatch:

Treat missing Restaurant_Type as "Unknown"
Convert categorical predictors to factors
Keep the same columns in training and prediction datasets after preprocessing

## 'data.frame':    18000 obs. of  9 variables:
##  $ Region                   : Factor w/ 6 levels "Alsace and East France",..: 1 6 3 3 2 6 3 2 2 5 ...
##  $ Daytime                  : num  0.223 0.734 0.781 0.311 0.91 ...
##  $ Carrier                  : Factor w/ 4 levels "Bouygues","Free",..: 4 1 4 4 4 1 2 4 3 1 ...
##  $ Time_On_Previous_Website : num  8.07 257.57 1427.64 1606.26 480.23 ...
##  $ Weekday                  : Factor w/ 7 levels "Friday","Monday",..: 3 3 7 7 4 3 2 3 3 2 ...
##  $ Social_Network           : Factor w/ 3 levels "Facebook","Instagram",..: 2 3 2 2 1 2 1 3 2 1 ...
##  $ Number_of_Previous_Orders: num  3 3 6 4 2 7 2 6 4 4 ...
##  $ Clicks_Conversion        : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 1 2 2 2 ...
##  $ Restaurant_Type          : Factor w/ 6 levels "Burger","French",..: 2 2 2 1 3 2 1 4 2 4 ...

##                    Region                   Daytime                   Carrier 
##                         0                         0                         0 
##  Time_On_Previous_Website                   Weekday            Social_Network 
##                         0                         0                         0 
## Number_of_Previous_Orders         Clicks_Conversion           Restaurant_Type 
##                         0                         0                         0

After preprocessing, all categorical predictors are properly encoded as factors, and missing values are treated consistently. The training and prediction datasets now share the same feature space and are ready for model development.

3 Experimental Design

We split the 18,000 observations into:

Training set (70%) for model fitting
Test set (30%) for out-of-sample evaluation

We evaluate model performance using the following criteria:

ROC/AUC (primary metric; threshold-independent)
Confusion matrix at threshold 0.5 (secondary; intuitive interpretation)

## [1] 12600     9

## [1] 5400    9

4 Models

4.1 Baseline Model: Logistic Regression

We first estimate a logistic regression model as an interpretable benchmark for predicting ad click behavior.

Because the binomial logistic regression in glm() requires a numeric response variable coded as 0/1, we create a numeric copy of the target variable for model fitting while keeping the original factor version for classification tasks.

## 
## Call:
## glm(formula = y_num ~ ., family = binomial(link = "logit"), data = train_data_glm[, 
##     c(features, "y_num")])
## 
## Coefficients:
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                4.839e+00  2.808e-01  17.235  < 2e-16 ***
## RegionNorth France        -1.049e-01  1.501e-01  -0.699 0.484581    
## RegionParis               -9.479e-02  1.494e-01  -0.635 0.525750    
## RegionRest of France       8.252e-02  1.506e-01   0.548 0.583797    
## RegionSouth France        -2.279e-01  1.463e-01  -1.558 0.119291    
## RegionWest France         -2.463e-01  1.488e-01  -1.656 0.097812 .  
## Daytime                    5.055e+00  1.909e-01  26.483  < 2e-16 ***
## CarrierFree               -5.433e+00  1.694e-01 -32.068  < 2e-16 ***
## CarrierOrange              1.068e+00  1.492e-01   7.158 8.18e-13 ***
## CarrierSFR                 1.103e+00  1.443e-01   7.642 2.14e-14 ***
## Time_On_Previous_Website   2.157e-03  9.738e-05  22.148  < 2e-16 ***
## WeekdayMonday             -6.415e+00  2.333e-01 -27.495  < 2e-16 ***
## WeekdaySaturday            6.985e-01  2.369e-01   2.949 0.003192 ** 
## WeekdaySunday             -9.494e-01  2.169e-01  -4.377 1.21e-05 ***
## WeekdayThursday           -1.308e+00  2.121e-01  -6.167 6.97e-10 ***
## WeekdayTuesday            -6.397e+00  2.334e-01 -27.406  < 2e-16 ***
## WeekdayWednesday          -1.258e+00  2.095e-01  -6.007 1.89e-09 ***
## Social_NetworkInstagram   -2.266e+00  1.246e-01 -18.187  < 2e-16 ***
## Social_NetworkTwitter     -2.884e+00  1.292e-01 -22.322  < 2e-16 ***
## Number_of_Previous_Orders  1.792e-01  2.854e-02   6.277 3.44e-10 ***
## Restaurant_TypeFrench      2.032e+00  1.670e-01  12.167  < 2e-16 ***
## Restaurant_TypeGroceries  -1.995e+00  1.347e-01 -14.811  < 2e-16 ***
## Restaurant_TypeKebab       5.203e-02  1.390e-01   0.374 0.708147    
## Restaurant_TypeSushi       2.135e+00  1.655e-01  12.901  < 2e-16 ***
## Restaurant_TypeUnknown    -6.547e-01  1.759e-01  -3.723 0.000197 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10786.6  on 12599  degrees of freedom
## Residual deviance:  3580.1  on 12575  degrees of freedom
## AIC: 3630.1
## 
## Number of Fisher Scoring iterations: 8

4.2 Logistic Regression Evaluation (Test Set)

We evaluate the logistic regression model on the test set using predicted probabilities, a confusion matrix at threshold 0.5, and ROC/AUC.

Confusion matrix at 0.5

##          Actual
## Predicted    0    1
##         0  640  132
##         1  229 4399

ROC / AUC

## [1] 0.9732049

Logistic regression achieves excellent discrimination, with AUC = r round(auc_logit, 4). Given the marketing setting, the strong ranking ability supports probability-based targeting even when baseline click rate is high.

4.3 Advanced Models: Decision Tree & Random Forest

To capture potential non-linear patterns and interactions among predictors, we estimate tree-based models and compare their performance with the logistic regression baseline.

4.3.1 Decision Tree

We first fit a classification decision tree, which provides an interpretable set of if–then rules and can naturally capture non-linear effects and interactions.

## 
## Classification tree:
## rpart(formula = as.formula(paste(target, "~", paste(features, 
##     collapse = " + "))), data = train_data, method = "class", 
##     control = rpart.control(cp = 0.01))
## 
## Variables actually used in tree construction:
## [1] Carrier         Daytime         Restaurant_Type Social_Network 
## [5] Weekday        
## 
## Root node error: 1929/12600 = 0.1531
## 
## n= 12600 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.156039      0   1.00000 1.00000 0.020953
## 2 0.030845      2   0.68792 0.68792 0.017862
## 3 0.024538      5   0.58217 0.58528 0.016620
## 4 0.010000      8   0.50855 0.51322 0.015657

We evaluate the decision tree using predicted probabilities on the test set and compute the ROC curve and AUC.

## [1] 0.9394472

Interpretation.
The decision tree improves flexibility relative to logistic regression by modeling non-linearities and interactions.
However, its AUC suggests weaker overall ranking performance, reflecting the higher variance typical of single-tree models.

4.3.2 Random Forest

We fit a random forest model to improve predictive accuracy by averaging across many decision trees, which typically reduces variance and captures complex non-linear relationships.

## 
## Call:
##  randomForest(formula = as.formula(paste(target, "~", paste(features,      collapse = " + "))), data = train_data, ntree = 500, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 5.48%
## Confusion matrix:
##      0     1 class.error
## 0 1461   468  0.24261275
## 1  222 10449  0.02080405

## [1] 0.9832841

Interpretation.
The random forest achieves strong out-of-sample performance with an AUC of r round(auc_rf, 4). The variable importance plot highlights the key drivers of click probability, which can be translated into actionable targeting insights (e.g., engagement signals, timing, and platform effects).

5. Model Comparison & Selection

We compare the predictive performance of all models using AUC and ROC curves on the test set.

Model	AUC
Logistic Regression	0.9732
Decision Tree	0.9394
Random Forest	0.9833

ROC Curves — Model Comparison

Interpretation.
All three models substantially outperform random classification, indicating strong predictive power.
The random forest achieves the highest AUC, reflecting its ability to capture complex non-linear relationships and interactions among predictors.

However, the performance gain of the random forest over logistic regression is relatively modest.
Given its strong accuracy and superior interpretability, logistic regression remains an attractive baseline model, especially in managerial settings where transparency and ease of explanation are important.

6 Managerial Implications

6.1 Targeting & Budget Allocation

With an AUC above 0.98, the predictive models can reliably rank users by their likelihood of clicking on an advertisement.
Instead of exposing all users uniformly, managers can prioritize users with higher predicted click probabilities in order to:

reduce wasted impressions,
improve conversion efficiency, and
increase return on advertising investment (ROI).

This probability-based targeting strategy enables a shift from volume-driven advertising toward performance-oriented budget allocation, where marketing resources are concentrated on users with the highest expected marginal returns.

6.2 Threshold Choice & Business Trade-offs

A probability threshold converts predicted click probabilities into a binary targeting decision.
The optimal threshold depends on business objectives and cost considerations, including:

cost of showing advertisements (when costs are low, a lower threshold may be preferred to capture more potential clickers),
user experience and opportunity costs (when costs are high, a higher threshold helps reduce false positives and avoid unnecessary exposure).

Therefore, ROC/AUC is a particularly important evaluation metric, as it assesses the model’s ranking quality across all possible thresholds rather than locking decisions into a single cutoff point.

7 Prediction & Ranking of New Users (2,000)

We score the 2,000 new users using the selected random forest model and rank them from highest to lowest predicted click probability.

##                    Region   Daytime  Carrier Time_On_Previous_Website   Weekday
## 1            South France 0.7271450 Bouygues                 1073.690 Wednesday
## 2            North France 0.9649928   Orange                 1636.194 Wednesday
## 3          Rest of France 0.2988972   Orange                 1632.236    Friday
## 4          Rest of France 0.8550760 Bouygues                 1394.258  Saturday
## 5            North France 0.9042812   Orange                 1262.653    Friday
## 6          Rest of France 0.1888440   Orange                 1352.549    Sunday
## 7  Alsace and East France 0.9636936      SFR                 1425.138  Saturday
## 8            North France 0.7608732   Orange                 1707.208    Sunday
## 9                   Paris 0.7688666 Bouygues                 1599.074  Saturday
## 10         Rest of France 0.5145126   Orange                 1464.624  Saturday
##    Social_Network Number_of_Previous_Orders Restaurant_Type predicted_prob
## 1        Facebook                         3           Kebab              1
## 2        Facebook                         6           Sushi              1
## 3         Twitter                         2           Sushi              1
## 4        Facebook                         4           Sushi              1
## 5       Instagram                         2          Burger              1
## 6        Facebook                         6          French              1
## 7        Facebook                         5           Kebab              1
## 8       Instagram                         6           Sushi              1
## 9        Facebook                         4          Burger              1
## 10       Facebook                         5           Kebab              1

7.1 Distribution of Predicted Probabilities

We examine the distribution of predicted click probabilities to understand how users are differentiated by the model.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0080  0.8640  0.9860  0.8548  0.9980  1.0000

##      1%      5%     10%     50%     90%     95%     99% 
## 0.05198 0.16400 0.40540 0.98600 1.00000 1.00000 1.00000

Interpretation.
The predicted probabilities are heavily concentrated near 1 (Median ≈ 0.986; 3rd Quartile ≈ 0.998), consistent with the high baseline click rate (r round(baseline_rate, 3)). However, the lower tail still shows meaningful separation (e.g., bottom 10% below ~0.41), which is valuable for excluding low-return users and improving efficiency.

7.2 Top 10% / Top 20% Targeting Segments

Segment	Users	Avg_Prob	Min_Prob
Top 10%	200	1.000000	1.000
Top 20%	400	0.999895	0.998

Business application (based on model output).

Top 10% (200 users): Avg_Prob ≈ 1.000000, Min_Prob = 1.000 → nearly guaranteed clickers.
Top 20% (400 users): Avg_Prob ≈ 0.999895, Min_Prob ≈ 0.998 → larger reach with almost no loss in expected conversion.

Recommendation.

If budget is tight: target Top 10% to maximize efficiency.
If the goal is scale: target Top 20% to increase reach while maintaining high ROI.

8 Conclusion

This project demonstrates that ad click behavior can be predicted with very high accuracy using user
context, channel, engagement, and past behavior features.

Logistic regression provides a highly interpretable baseline
(AUC = 0.973),
while the random forest achieves the best predictive performance
(AUC = 0.983).

The final model’s probabilities enable probability-based targeting and budget allocation decisions,
with top-segment strategies illustrating how analytics can directly support
ROI-driven marketing actions.