Online advertising companies deal with huge amount of click ads which some of them are fraud clicks. This project involves investigating user’s click behavior and download rate. The goal is to predict whether a user will download an app after clicking a mobile app ad. This has been done by examining the events of downloading the app and users click rate. Upon examination of these events it became clear that events are very rare, data is very imbalanced and users mostly click the ads but don’t download the app afterwords. This project highlights the importance of exploratory data analysis, feature engineering and the use of appropriate algorithms for deal with imbalanced data.
Click fraud is a type of fraud that occurs on the Internet in pay-per-click (PPC) online advertising. In this setting there is the advertising commissioner, the advertisers and the content publishers. The commissioner is acting as a broker between the advertisers and the content publishers. An advertiser plans a budget, and agrees with the advertising commissioner on a commission for every customer click ad. Then a content publisher contracts with the commissioner to display advertisements on their websites, and gets commissions based on the traffic it drives to the advertisers. This type of model, may cause publishers to generate fraudulent clicks which in turn will generate more income, this is a major issue known as click fraud. Moreover In order to eliminate their competitors, companies use special software solutions to simply click on the ad of their competitors, thereby depriving them of the opportunity to show their products to their target audience due to limited budget. Thus, a reliable click fraud detection system is needed to help the commissioners proactively prevent click fraud and assure their advertisers that their dollars have been well spent.
TalkingData, China’s largest independent big data service platform, covers over 70% of active mobile devices nationwide. They handle 3 billion clicks per day, of which 90% are potentially fraudulent. Their current approach to prevent click fraud for app developers is to measure the journey of a user’s click across their portfolio, and flag IP addresses who produce lots of clicks, but never end up installing apps. With this information, they’ve built an IP blacklist and device blacklist.
While successful, they want to always be one step ahead of fraudsters. The challenge is to build an algorithm that predicts whether a user will download an app after clicking a mobile app ad.
The raw data supplied by TalkingData was collected over 4 days (from Monday to Thursday) and consist of 8 variables and 185 million rows. Talking data supplied a sampled training set CSV file with 100K rows, which is used in this project as the full data set. The training set contains 80000 observations and the split was done with respect to the response variable (is_attributed), meaning 80% of the non event observations were sampled into the training set and 80% of the event observation were sampled into the training set. The validation set contains the rest of 20000 observations. In addition there is a CSV file with 1 million observation only from the first day (Monday) that will held out and be used as a test set.
| Variable | Description |
|---|---|
| ip | ip address of click. |
| app | app id for marketing. |
| device | device type id of user mobile phone (e.g., iphone 6 plus, iphone 7, huawei mate 7, etc.) |
| os | os version id of user mobile phone |
| Channel | channel id of mobile ad publisher |
| click_time | timestamp of click (UTC) |
| attributed_time | if user download the app for after clicking an ad, this is the time of the app download |
| is_attributed | the target that is to be predicted, indicating the app was downloaded |
The data set is highly imbalanced, with only 227 events, meaning that evaluation should be concentrated on observations with an event.
First I decided to extract the day and hour of a user’s click from the click_time variable. Then by looking at the summary of Table 2 we can quickly see that observations from certain groups are repeated more often then other groups, as such it would be wise to inspect and find which groups are repeated often but have no downloads. Doing that can indicate a potential fraud click, for example app number 12 appears 10535 times in the data set, yet none of the observation from this app downloaded the app.
| app | countP | count | relcount |
|---|---|---|---|
| 18 | 3 | 6659 | 0.0004505 |
| 3 | 3 | 14628 | 0.0002051 |
| 15 | 1 | 6902 | 0.0001449 |
| 1 | 0 | 2458 | 0.0000000 |
| 2 | 0 | 9396 | 0.0000000 |
| 4 | 0 | 46 | 0.0000000 |
| 6 | 0 | 1048 | 0.0000000 |
| 7 | 0 | 771 | 0.0000000 |
| 12 | 0 | 10535 | 0.0000000 |
| 13 | 0 | 1953 | 0.0000000 |
| 14 | 0 | 4318 | 0.0000000 |
Using this information, We can see there is a lot of noise in the data, so it might be worthwhile to label observations from groups that has no downloads. That way we summarize information from the whole data set into each observation. this has been done for each of the following variables: ip ,app, device, os, channel, hour.
| ip | app | device | os | channel | is_attributed | newdevice | newip | hour | day | newhour | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 70837 | 89078 | 18 | 1 | 19 | 134 | 0 | 1 | 0 | 3 | Wed | 1 |
| 26512 | 93192 | 3 | 1 | 13 | 280 | 0 | 1 | 0 | 11 | Wed | 1 |
| 59440 | 34684 | 11 | 1 | 17 | 319 | 0 | 1 | 0 | 12 | Wed | 1 |
| 48146 | 178851 | 2 | 1 | 13 | 205 | 0 | 1 | 0 | 8 | Tue | 1 |
| 26513 | 68591 | 18 | 1 | 14 | 439 | 0 | 1 | 0 | 6 | Wed | 1 |
Now for example without looking at the variable is_attributed we can see, by looking at the newip variable that the first observation is in a group of ip (18) that has no downloads, so immediately we can determine that this observation wont download the app without even looking at is_attributed variable. The same procedure was done for the validation and test set, using the frequency tables from the training set.
The new training set and validation set contains 6 more new variables.
| New variable | Description |
|---|---|
| newip | 0 if an observation is with an ip with no downloads, 1 otherwise |
| newapp | 0 if an observation is with an app with no downloads, 1 otherwise |
| newdevice | 0 if an observation is with a device with no downloads, 1 otherwise |
| newos | 0 if an observation is with an os with no downloads, 1 otherwise |
| newchannel | 0 if an observation is in a channel with no downloads, 1 otherwise |
| newhour | 0 if an observation is in an hour with no downloads, 1 otherwise |
The algorithms I used in this project are Simple Decision trees, Random forest and Boosting. Eventually Random forest and boosting outperformed Simple Decision trees so it is not presented here.
Random forest is an improvement over bagged trees in a way that decorrelates the trees. The algorithm build a number of decision trees on bootstrapped training samples. But now for every tree, each time a split in a tree is considered, only a random sample of predictors is considered as a split candidate. In other words, when building a random forest, the algorithm is not considering all of the predictors at each split. That way, unlike bagging, each decision tree would look different from the other, since now, most of the trees or all of them don’t use the strongest predictor in the top split. Unfortunately, averaging many highly correlated quantities does not lead to as large of a reduction in variance as averaging many uncorrelated quantities (G. James, D. Witten, T. Hastie and R. Tibshirani 2017). Happily Random forest overcome this problem and it average over more reliable trees.
The final parameters used for prediction:
Final variables used for prediction are : newchannel, newdevice, newos, newapp, app, device, os, channel, hour, newhour.
Boosting is a machine learning technique for classification and regression problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. Boosting is a method that builds an additive model in a forward stage-wise fashion. It allows for the optimization of arbitrary differentiable loss functions. In each stage, classification trees are fit on the negative gradient of the binomial deviance loss function. When using boosting, there are several parameters that are required to tweak and change until the model stabilize and performs well.
The final parameters used for prediction:
Final variables used for prediction are : newdevice, newapp, app, device, os, channel, hour.
app and channel seems to be the most important variables, leading to higher accuracy and node impurity. Although newhour is the least important variable, its omission from the model leads to reduction in performance measures.
In evaluation I was using mainly the performance measured by AUC, since it reflects the total performance of the model and is more suitable for our problem.
| 0 | 1 | |
|---|---|---|
| 0 | 19609 | 5 |
| 1 | 346 | 40 |
| 0 | 1 | |
|---|---|---|
| 0 | 19731 | 8 |
| 1 | 224 | 37 |
| Random Forest | Boosting | |
|---|---|---|
| Sensitivity | 0.8888889 | 0.8222222 |
| Specificity | 0.9826610 | 0.9887747 |
| Precision | 0.1036269 | 0.1417625 |
| Recall | 0.8888889 | 0.8222222 |
| AUC | 0.9357749 | 0.9054985 |
We can see that Random Forest has better recall but worse precision then Boosting. Usually recall and precision have a trade-off with each other so it is not surprising to see this results. In this project, it is clear that on the validation set, Random forest outperformed boosting with respect to AUC measure. Analyzing and building a model that would perform well on a highly imbalanced data is a difficult task. Both of the models fit the data well, which is mainly due to the data preprocessing and data analysis that was performed prior to building the models. Now lets try both of the models on the test set with 1 million rows.
| Random Forest | Boosting | |
|---|---|---|
| Sensitivity | 0.6580035 | 0.6633196 |
| Specificity | 0.9929290 | 0.9865492 |
| Precision | 0.1363025 | 0.0771768 |
| Recall | 0.6580035 | 0.6633196 |
| AUC | 0.8254663 | 0.8249344 |
There is a big drop in performance in both of the models, mainly at AUC measure. This is probably due to over-fitting the data set, which is not caused by the models, but rather by the size of the data set itself. When using a data set with size of 0.005% from the original data set, you are likely to under-fit the full data set just because you are missing a lot of information. That is why we see a large drop in both models performance. Using larger data set when building the models will probably overcome this problem. Unfortunately, due to computational limitations, I was not able to use the full data set.
Tree based algorithms performed well for this type of data set, while algorithms like SVM and Neural Networks were not computationally feasible. Although it was not the goal of the project, detecting observations as suspicious fraud click helped the algorithms to predict whether a user will eventually download the app after clicking the mobile ad.
Journal of Machine Learning Research 15 (2014) 99-140.
G. James, D. Witten, T. Hastie and R. Tibshirani. An Introduction to Statistical Learning, pages 319-321.