This project develops a predictive model to estimate the probability that a user will click on an online advertisement.
The dataset used in this project is provided as an RData file
containing two objects: a training dataset (ClickTraining)
and a prediction dataset (ClickPrediction).
load("DeliveryAdClick.RData")
ClickTraining <- .standardize_target(ClickTraining)
# ClickPrediction doesn't have target (so no need)
# Basic checks
dim(ClickTraining)
## [1] 18000 9
dim(ClickPrediction)
## [1] 2000 8
The training dataset contains 18,000 observations and 9 variables. The variables capture different aspects of user context, behavior, and content exposure:
The target variable is Clicks_Conversion, where 1 indicates that the user clicked on the advertisement and 0 indicates no click.
To better understand variable types and distributions, we inspect the structure and summary statistics of the training dataset.
## Rows: 18,000
## Columns: 9
## $ Region <chr> "Alsace and East France", "West France", "Pa…
## $ Daytime <dbl> 0.22325383, 0.73439324, 0.78131697, 0.310846…
## $ Carrier <chr> "SFR", "Bouygues", "SFR", "SFR", "SFR", "Bou…
## $ Time_On_Previous_Website <dbl> 8.066935, 257.567079, 1427.640826, 1606.2621…
## $ Weekday <fct> Saturday, Saturday, Wednesday, Wednesday, Su…
## $ Social_Network <chr> "Instagram", "Twitter", "Instagram", "Instag…
## $ Number_of_Previous_Orders <dbl> 3, 3, 6, 4, 2, 7, 2, 6, 4, 4, 3, 2, 2, 6, 2,…
## $ Clicks_Conversion <int> 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1,…
## $ Restaurant_Type <chr> "French", "French", "French", "Burger", "Gro…
## Region Daytime Carrier
## Length:18000 Min. :0.000121 Length:18000
## Class :character 1st Qu.:0.244788 Class :character
## Mode :character Median :0.492348 Mode :character
## Mean :0.495189
## 3rd Qu.:0.740757
## Max. :0.999992
##
## Time_On_Previous_Website Weekday Social_Network
## Min. : 5.033 Friday :2593 Length:18000
## 1st Qu.: 454.132 Monday :2695 Class :character
## Median : 902.860 Saturday :2522 Mode :character
## Mean : 902.603 Sunday :2554
## 3rd Qu.:1352.777 Thursday :2517
## Max. :1799.875 Tuesday :2599
## Wednesday:2520
## Number_of_Previous_Orders Clicks_Conversion Restaurant_Type
## Min. : 0.000 Min. :0.0000 Length:18000
## 1st Qu.: 2.000 1st Qu.:1.0000 Class :character
## Median : 3.000 Median :1.0000 Mode :character
## Mean : 3.082 Mean :0.8446
## 3rd Qu.: 4.000 3rd Qu.:1.0000
## Max. :11.000 Max. :1.0000
##
We first examine the distribution of the target variable (Clicks_Conversion) and compute the baseline click rate. This provides a reference point for evaluating predictive models.
##
## 0 1
## 2798 15202
##
## 0 1
## 0.155 0.845
## [1] 0.8445556
Interpretation. The baseline click rate is
0.8446 (≈84%).
Because the outcome is highly skewed toward clicks, accuracy
alone can be misleading, so we emphasize AUC
and probability-based ranking.
We ensure consistent preprocessing across training and prediction datasets to avoid factor-level mismatch:
"Unknown"## 'data.frame': 18000 obs. of 9 variables:
## $ Region : Factor w/ 6 levels "Alsace and East France",..: 1 6 3 3 2 6 3 2 2 5 ...
## $ Daytime : num 0.223 0.734 0.781 0.311 0.91 ...
## $ Carrier : Factor w/ 4 levels "Bouygues","Free",..: 4 1 4 4 4 1 2 4 3 1 ...
## $ Time_On_Previous_Website : num 8.07 257.57 1427.64 1606.26 480.23 ...
## $ Weekday : Factor w/ 7 levels "Friday","Monday",..: 3 3 7 7 4 3 2 3 3 2 ...
## $ Social_Network : Factor w/ 3 levels "Facebook","Instagram",..: 2 3 2 2 1 2 1 3 2 1 ...
## $ Number_of_Previous_Orders: num 3 3 6 4 2 7 2 6 4 4 ...
## $ Clicks_Conversion : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 1 2 2 2 ...
## $ Restaurant_Type : Factor w/ 6 levels "Burger","French",..: 2 2 2 1 3 2 1 4 2 4 ...
## Region Daytime Carrier
## 0 0 0
## Time_On_Previous_Website Weekday Social_Network
## 0 0 0
## Number_of_Previous_Orders Clicks_Conversion Restaurant_Type
## 0 0 0
After preprocessing, all categorical predictors are properly encoded as factors, and missing values are treated consistently. The training and prediction datasets now share the same feature space and are ready for model development.
We split the 18,000 observations into:
We evaluate model performance using the following criteria:
## [1] 12600 9
## [1] 5400 9
We first estimate a logistic regression model as an interpretable benchmark for predicting ad click behavior.
Because the binomial logistic regression in glm()
requires a numeric response variable coded as 0/1, we create a numeric
copy of the target variable for model fitting while keeping the original
factor version for classification tasks.
##
## Call:
## glm(formula = y_num ~ ., family = binomial(link = "logit"), data = train_data_glm[,
## c(features, "y_num")])
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.839e+00 2.808e-01 17.235 < 2e-16 ***
## RegionNorth France -1.049e-01 1.501e-01 -0.699 0.484581
## RegionParis -9.479e-02 1.494e-01 -0.635 0.525750
## RegionRest of France 8.252e-02 1.506e-01 0.548 0.583797
## RegionSouth France -2.279e-01 1.463e-01 -1.558 0.119291
## RegionWest France -2.463e-01 1.488e-01 -1.656 0.097812 .
## Daytime 5.055e+00 1.909e-01 26.483 < 2e-16 ***
## CarrierFree -5.433e+00 1.694e-01 -32.068 < 2e-16 ***
## CarrierOrange 1.068e+00 1.492e-01 7.158 8.18e-13 ***
## CarrierSFR 1.103e+00 1.443e-01 7.642 2.14e-14 ***
## Time_On_Previous_Website 2.157e-03 9.738e-05 22.148 < 2e-16 ***
## WeekdayMonday -6.415e+00 2.333e-01 -27.495 < 2e-16 ***
## WeekdaySaturday 6.985e-01 2.369e-01 2.949 0.003192 **
## WeekdaySunday -9.494e-01 2.169e-01 -4.377 1.21e-05 ***
## WeekdayThursday -1.308e+00 2.121e-01 -6.167 6.97e-10 ***
## WeekdayTuesday -6.397e+00 2.334e-01 -27.406 < 2e-16 ***
## WeekdayWednesday -1.258e+00 2.095e-01 -6.007 1.89e-09 ***
## Social_NetworkInstagram -2.266e+00 1.246e-01 -18.187 < 2e-16 ***
## Social_NetworkTwitter -2.884e+00 1.292e-01 -22.322 < 2e-16 ***
## Number_of_Previous_Orders 1.792e-01 2.854e-02 6.277 3.44e-10 ***
## Restaurant_TypeFrench 2.032e+00 1.670e-01 12.167 < 2e-16 ***
## Restaurant_TypeGroceries -1.995e+00 1.347e-01 -14.811 < 2e-16 ***
## Restaurant_TypeKebab 5.203e-02 1.390e-01 0.374 0.708147
## Restaurant_TypeSushi 2.135e+00 1.655e-01 12.901 < 2e-16 ***
## Restaurant_TypeUnknown -6.547e-01 1.759e-01 -3.723 0.000197 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10786.6 on 12599 degrees of freedom
## Residual deviance: 3580.1 on 12575 degrees of freedom
## AIC: 3630.1
##
## Number of Fisher Scoring iterations: 8
We evaluate the logistic regression model on the test set using predicted probabilities, a confusion matrix at threshold 0.5, and ROC/AUC.
## Actual
## Predicted 0 1
## 0 640 132
## 1 229 4399
## [1] 0.9732049
Logistic regression achieves excellent discrimination, with AUC = r round(auc_logit, 4). Given the marketing setting, the strong ranking ability supports probability-based targeting even when baseline click rate is high.
To capture potential non-linear patterns and interactions among predictors, we estimate tree-based models and compare their performance with the logistic regression baseline.
We first fit a classification decision tree, which provides an interpretable set of if–then rules and can naturally capture non-linear effects and interactions.
##
## Classification tree:
## rpart(formula = as.formula(paste(target, "~", paste(features,
## collapse = " + "))), data = train_data, method = "class",
## control = rpart.control(cp = 0.01))
##
## Variables actually used in tree construction:
## [1] Carrier Daytime Restaurant_Type Social_Network
## [5] Weekday
##
## Root node error: 1929/12600 = 0.1531
##
## n= 12600
##
## CP nsplit rel error xerror xstd
## 1 0.156039 0 1.00000 1.00000 0.020953
## 2 0.030845 2 0.68792 0.68792 0.017862
## 3 0.024538 5 0.58217 0.58528 0.016620
## 4 0.010000 8 0.50855 0.51322 0.015657
We evaluate the decision tree using predicted probabilities on the test set and compute the ROC curve and AUC.
## [1] 0.9394472
Interpretation.
The decision tree improves flexibility relative to logistic regression
by modeling non-linearities and interactions.
However, its AUC suggests weaker overall ranking performance, reflecting
the higher variance typical of single-tree models.
We fit a random forest model to improve predictive accuracy by averaging across many decision trees, which typically reduces variance and captures complex non-linear relationships.
##
## Call:
## randomForest(formula = as.formula(paste(target, "~", paste(features, collapse = " + "))), data = train_data, ntree = 500, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 5.48%
## Confusion matrix:
## 0 1 class.error
## 0 1461 468 0.24261275
## 1 222 10449 0.02080405
## [1] 0.9832841
Interpretation.
The random forest achieves strong out-of-sample performance with an AUC
of r round(auc_rf, 4). The variable importance plot highlights the key
drivers of click probability, which can be translated into actionable
targeting insights (e.g., engagement signals, timing, and platform
effects).
We compare the predictive performance of all models using AUC and ROC curves on the test set.
| Model | AUC |
|---|---|
| Logistic Regression | 0.9732 |
| Decision Tree | 0.9394 |
| Random Forest | 0.9833 |
Interpretation.
All three models substantially outperform random classification,
indicating strong predictive power.
The random forest achieves the highest AUC, reflecting its ability to
capture complex non-linear relationships and interactions among
predictors.
However, the performance gain of the random forest over logistic
regression is relatively modest.
Given its strong accuracy and superior interpretability, logistic
regression remains an attractive baseline model, especially in
managerial settings where transparency and ease of explanation are
important.
With an AUC above 0.98, the predictive models can reliably rank users
by their likelihood of clicking on an advertisement.
Instead of exposing all users uniformly, managers can prioritize users
with higher predicted click probabilities in order to:
This probability-based targeting strategy enables a shift from volume-driven advertising toward performance-oriented budget allocation, where marketing resources are concentrated on users with the highest expected marginal returns.
A probability threshold converts predicted click probabilities into a
binary targeting decision.
The optimal threshold depends on business objectives and cost
considerations, including:
Therefore, ROC/AUC is a particularly important evaluation metric, as it assesses the model’s ranking quality across all possible thresholds rather than locking decisions into a single cutoff point.
We score the 2,000 new users using the selected random forest model and rank them from highest to lowest predicted click probability.
## Region Daytime Carrier Time_On_Previous_Website Weekday
## 1 South France 0.7271450 Bouygues 1073.690 Wednesday
## 2 North France 0.9649928 Orange 1636.194 Wednesday
## 3 Rest of France 0.2988972 Orange 1632.236 Friday
## 4 Rest of France 0.8550760 Bouygues 1394.258 Saturday
## 5 North France 0.9042812 Orange 1262.653 Friday
## 6 Rest of France 0.1888440 Orange 1352.549 Sunday
## 7 Alsace and East France 0.9636936 SFR 1425.138 Saturday
## 8 North France 0.7608732 Orange 1707.208 Sunday
## 9 Paris 0.7688666 Bouygues 1599.074 Saturday
## 10 Rest of France 0.5145126 Orange 1464.624 Saturday
## Social_Network Number_of_Previous_Orders Restaurant_Type predicted_prob
## 1 Facebook 3 Kebab 1
## 2 Facebook 6 Sushi 1
## 3 Twitter 2 Sushi 1
## 4 Facebook 4 Sushi 1
## 5 Instagram 2 Burger 1
## 6 Facebook 6 French 1
## 7 Facebook 5 Kebab 1
## 8 Instagram 6 Sushi 1
## 9 Facebook 4 Burger 1
## 10 Facebook 5 Kebab 1
We examine the distribution of predicted click probabilities to understand how users are differentiated by the model.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0080 0.8640 0.9860 0.8548 0.9980 1.0000
## 1% 5% 10% 50% 90% 95% 99%
## 0.05198 0.16400 0.40540 0.98600 1.00000 1.00000 1.00000
Interpretation.
The predicted probabilities are heavily concentrated near 1 (Median ≈
0.986; 3rd Quartile ≈ 0.998), consistent with the high baseline click
rate (r round(baseline_rate, 3)). However, the lower tail still shows
meaningful separation (e.g., bottom 10% below ~0.41), which is valuable
for excluding low-return users and improving efficiency.
| Segment | Users | Avg_Prob | Min_Prob |
|---|---|---|---|
| Top 10% | 200 | 1.000000 | 1.000 |
| Top 20% | 400 | 0.999895 | 0.998 |
Business application (based on model output).
Recommendation.
This project demonstrates that ad click behavior can be predicted
with very high accuracy using user
context, channel, engagement, and past behavior features.
Logistic regression provides a highly interpretable baseline
(AUC = 0.973),
while the random forest achieves the best predictive performance
(AUC = 0.983).
The final model’s probabilities enable probability-based
targeting and budget allocation decisions,
with top-segment strategies illustrating how analytics
can directly support
ROI-driven marketing actions.