Abstract

Online advertising companies deal with huge amount of click ads which some of them are fraud clicks. This project involves investigating user’s click behavior and download rate. The goal is to predict whether a user will download an app after clicking a mobile app ad. This has been done by examining the events of downloading the app and users click rate. Upon examination of these events it became clear that events are very rare, data is very imbalanced and users mostly click the ads but don’t download the app afterwords. This project highlights the importance of exploratory data analysis, feature engineering and the use of appropriate algorithms for deal with imbalanced data.

Introduction

Click fraud is a type of fraud that occurs on the Internet in pay-per-click (PPC) online advertising. In this setting there is the advertising commissioner, the advertisers and the content publishers. The commissioner is acting as a broker between the advertisers and the content publishers. An advertiser plans a budget, and agrees with the advertising commissioner on a commission for every customer click ad. Then a content publisher contracts with the commissioner to display advertisements on their websites, and gets commissions based on the traffic it drives to the advertisers. This type of model, may cause publishers to generate fraudulent clicks which in turn will generate more income, this is a major issue known as click fraud. Moreover In order to eliminate their competitors, companies use special software solutions to simply click on the ad of their competitors, thereby depriving them of the opportunity to show their products to their target audience due to limited budget. Thus, a reliable click fraud detection system is needed to help the commissioners proactively prevent click fraud and assure their advertisers that their dollars have been well spent.

TalkingData, China’s largest independent big data service platform, covers over 70% of active mobile devices nationwide. They handle 3 billion clicks per day, of which 90% are potentially fraudulent. Their current approach to prevent click fraud for app developers is to measure the journey of a user’s click across their portfolio, and flag IP addresses who produce lots of clicks, but never end up installing apps. With this information, they’ve built an IP blacklist and device blacklist.

While successful, they want to always be one step ahead of fraudsters. The challenge is to build an algorithm that predicts whether a user will download an app after clicking a mobile app ad.

Data

The raw data supplied by TalkingData was collected over 4 days (from Monday to Thursday) and consist of 8 variables and 185 million rows. Talking data supplied a sampled training set CSV file with 100K rows, which is used in this project as the full data set. The training set contains 80000 observations and the split was done with respect to the response variable (is_attributed), meaning 80% of the non event observations were sampled into the training set and 80% of the event observation were sampled into the training set. The validation set contains the rest of 20000 observations. In addition there is a CSV file with 1 million observation only from the first day (Monday) that will held out and be used as a test set.

Fields in the data set
Variable Description
ip ip address of click.
app app id for marketing.
device device type id of user mobile phone (e.g., iphone 6 plus, iphone 7, huawei mate 7, etc.)
os os version id of user mobile phone
Channel channel id of mobile ad publisher
click_time timestamp of click (UTC)
attributed_time if user download the app for after clicking an ad, this is the time of the app download
is_attributed the target that is to be predicted, indicating the app was downloaded

The data set is highly imbalanced, with only 227 events, meaning that evaluation should be concentrated on observations with an event.
First I decided to extract the day and hour of a user’s click from the click_time variable. Then by looking at the summary of Table 2 we can quickly see that observations from certain groups are repeated more often then other groups, as such it would be wise to inspect and find which groups are repeated often but have no downloads. Doing that can indicate a potential fraud click, for example app number 12 appears 10535 times in the data set, yet none of the observation from this app downloaded the app.

Partial frequancy table for app
app countP count relcount
18 3 6659 0.0004505
3 3 14628 0.0002051
15 1 6902 0.0001449
1 0 2458 0.0000000
2 0 9396 0.0000000
4 0 46 0.0000000
6 0 1048 0.0000000
7 0 771 0.0000000
12 0 10535 0.0000000
13 0 1953 0.0000000
14 0 4318 0.0000000

Using this information, We can see there is a lot of noise in the data, so it might be worthwhile to label observations from groups that has no downloads. That way we summarize information from the whole data set into each observation. this has been done for each of the following variables: ip ,app, device, os, channel, hour.

Partial training set
ip app device os channel is_attributed newdevice newip hour day newhour
70837 89078 18 1 19 134 0 1 0 3 Wed 1
26512 93192 3 1 13 280 0 1 0 11 Wed 1
59440 34684 11 1 17 319 0 1 0 12 Wed 1
48146 178851 2 1 13 205 0 1 0 8 Tue 1
26513 68591 18 1 14 439 0 1 0 6 Wed 1

Now for example without looking at the variable is_attributed we can see, by looking at the newip variable that the first observation is in a group of ip (18) that has no downloads, so immediately we can determine that this observation wont download the app without even looking at is_attributed variable. The same procedure was done for the validation and test set, using the frequency tables from the training set.

The new training set and validation set contains 6 more new variables.

Fields in the data set
New variable Description
newip 0 if an observation is with an ip with no downloads, 1 otherwise
newapp 0 if an observation is with an app with no downloads, 1 otherwise
newdevice 0 if an observation is with a device with no downloads, 1 otherwise
newos 0 if an observation is with an os with no downloads, 1 otherwise
newchannel 0 if an observation is in a channel with no downloads, 1 otherwise
newhour 0 if an observation is in an hour with no downloads, 1 otherwise

Methods

The algorithms I used in this project are Simple Decision trees, Random forest and Boosting. Eventually Random forest and boosting outperformed Simple Decision trees so it is not presented here.

Random forest is an improvement over bagged trees in a way that decorrelates the trees. The algorithm build a number of decision trees on bootstrapped training samples. But now for every tree, each time a split in a tree is considered, only a random sample of predictors is considered as a split candidate. In other words, when building a random forest, the algorithm is not considering all of the predictors at each split. That way, unlike bagging, each decision tree would look different from the other, since now, most of the trees or all of them don’t use the strongest predictor in the top split. Unfortunately, averaging many highly correlated quantities does not lead to as large of a reduction in variance as averaging many uncorrelated quantities (G. James, D. Witten, T. Hastie and R. Tibshirani 2017). Happily Random forest overcome this problem and it average over more reliable trees.
The final parameters used for prediction:

Final variables used for prediction are : newchannel, newdevice, newos, newapp, app, device, os, channel, hour, newhour.

Boosting is a machine learning technique for classification and regression problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. Boosting is a method that builds an additive model in a forward stage-wise fashion. It allows for the optimization of arbitrary differentiable loss functions. In each stage, classification trees are fit on the negative gradient of the binomial deviance loss function. When using boosting, there are several parameters that are required to tweak and change until the model stabilize and performs well.
The final parameters used for prediction:

Final variables used for prediction are : newdevice, newapp, app, device, os, channel, hour.

Results, discussion and conclusion

app and channel seems to be the most important variables, leading to higher accuracy and node impurity. Although newhour is the least important variable, its omission from the model leads to reduction in performance measures.
In evaluation I was using mainly the performance measured by AUC, since it reflects the total performance of the model and is more suitable for our problem.

Confusion table Random Forest
0 1
0 19609 5
1 346 40
Confusion table Random Forest
0 1
0 19731 8
1 224 37
Validation set performence table
Random Forest Boosting
Sensitivity 0.8888889 0.8222222
Specificity 0.9826610 0.9887747
Precision 0.1036269 0.1417625
Recall 0.8888889 0.8222222
AUC 0.9357749 0.9054985

We can see that Random Forest has better recall but worse precision then Boosting. Usually recall and precision have a trade-off with each other so it is not surprising to see this results. In this project, it is clear that on the validation set, Random forest outperformed boosting with respect to AUC measure. Analyzing and building a model that would perform well on a highly imbalanced data is a difficult task. Both of the models fit the data well, which is mainly due to the data preprocessing and data analysis that was performed prior to building the models. Now lets try both of the models on the test set with 1 million rows.

Test set performence table
Random Forest Boosting
Sensitivity 0.6580035 0.6633196
Specificity 0.9929290 0.9865492
Precision 0.1363025 0.0771768
Recall 0.6580035 0.6633196
AUC 0.8254663 0.8249344

There is a big drop in performance in both of the models, mainly at AUC measure. This is probably due to over-fitting the data set, which is not caused by the models, but rather by the size of the data set itself. When using a data set with size of 0.005% from the original data set, you are likely to under-fit the full data set just because you are missing a lot of information. That is why we see a large drop in both models performance. Using larger data set when building the models will probably overcome this problem. Unfortunately, due to computational limitations, I was not able to use the full data set.

Tree based algorithms performed well for this type of data set, while algorithms like SVM and Neural Networks were not computationally feasible. Although it was not the goal of the project, detecting observations as suspicious fraud click helped the algorithms to predict whether a user will eventually download the app after clicking the mobile ad.

References

Journal of Machine Learning Research 15 (2014) 99-140.

G. James, D. Witten, T. Hastie and R. Tibshirani. An Introduction to Statistical Learning, pages 319-321.