O risco de fraude está em toda parte, mas para as empresas que anunciam online, a fraude de cliques pode acontecer em um volume avassalador, resultando em dados de cliques enganosos e dinheiro desperdiçado. Os canais de anúncios podem aumentar os custos simplesmente quando pessoas ou bots clicam nos anúncios em grande escala, o que na prática não gera o resultado esperado. Com mais de 1 bilhão de dispositivos móveis em uso todos os meses, a China é o maior mercado móvel do mundo e, portanto, sofre com grandes volumes de tráfego fraudulento.
A TalkingData, a maior plataforma de Big Data independente da China, cobre mais de 70% dos dispositivos móveis ativos em todo o país lidam com 3 bilhões de cliques por dia, dos quais 90% são potencialmente fraudulentos. Sua abordagem atual para impedir fraudes de cliques para desenvolvedores de aplicativos é medir a jornada do clique de um usuário em todo o portfólio e sinalizar endereços IP que produzem muitos cliques, mas nunca acabam instalando aplicativos. Com essas informações, eles criaram uma lista negra de IPs e uma lista negra de dispositivos.
O objetivo aqui é criar um algoritmo que possa prever se um usuário fará o download de um aplicativo depois de clicar em um anúncio de aplicativo para dispositivos móveis.
O dataset aqui utilizado faz parte de uma das competições do kaggle e está disponível para download através do link: https://www.kaggle.com/c/talkingdata-adtracking-fraud-detection/data
Observação: ip, app, device, os e channel possuem encoding.
Utilização da Linguagem R, Rstúdio e algoritmos de Machine Learning de metódos de classificação.
Efetua o carregamento dos pacotes que serão utilizados no projeto e os armazena na memória do Rstúdio.
pacman::p_load(tidyverse, caTools, corrplot, caret,
data.table,knitr, gridExtra,gmodels,
class,e1071,ROCR,DMwR,fasttime,lubridate,randomForest,pROC,C50,fastAdaboost)
Carrega o dataset na memória do Rstudio e transforma para o tipo tibble.
df <- as_tibble(fread("train.csv"))
glimpse(df)
## Rows: 184,903,890
## Columns: 8
## $ ip <int> 83230, 17357, 35810, 45745, 161007, 18787, 103022, ...
## $ app <int> 3, 3, 3, 14, 3, 3, 3, 3, 3, 64, 3, 3, 3, 3, 3, 3, 3...
## $ device <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ os <int> 13, 19, 13, 13, 13, 16, 23, 19, 13, 22, 25, 13, 19,...
## $ channel <int> 379, 379, 379, 478, 379, 379, 379, 379, 379, 459, 3...
## $ click_time <chr> "2017-11-06 14:32:21", "2017-11-06 14:33:34", "2017...
## $ attributed_time <chr> "", "", "", "", "", "", "", "", "", "", "", "", "",...
## $ is_attributed <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
kable(head(df))
| ip | app | device | os | channel | click_time | attributed_time | is_attributed |
|---|---|---|---|---|---|---|---|
| 83230 | 3 | 1 | 13 | 379 | 2017-11-06 14:32:21 | 0 | |
| 17357 | 3 | 1 | 19 | 379 | 2017-11-06 14:33:34 | 0 | |
| 35810 | 3 | 1 | 13 | 379 | 2017-11-06 14:34:12 | 0 | |
| 45745 | 14 | 1 | 13 | 478 | 2017-11-06 14:34:52 | 0 | |
| 161007 | 3 | 1 | 13 | 379 | 2017-11-06 14:35:08 | 0 | |
| 18787 | 3 | 1 | 16 | 379 | 2017-11-06 14:36:26 | 0 |
O dataset possui variáveis numéricas e categóricas. Será realizado em seguida, o processo de data wrangling para ajuste das variáveis.
Realização de coleta aleatória de dados amostrais de uma massa de dados de 10% do dataset original.
Neste processo, também será realizado os ajustes e criação das variáveis temporais.
df <- df %>%
sample_frac(0.10) %>% # Coleta amostra aleatória de 10% dos dados.
mutate(click_time = fastPOSIXct(click_time),
hora_click = hour(click_time),
dia_click = weekdays(click_time),
is_attributed = factor(is_attributed, levels = c("0","1"), labels = c("no","yes")),
attributed_time = ifelse(attributed_time == "",NA,attributed_time),
attributed_time = fastPOSIXct(attributed_time),
hora_attributed = hour(attributed_time),
dia_attributed = weekdays(attributed_time)) %>%
select(ip,app,device,os,channel,click_time,hora_click,dia_click,
attributed_time,hora_attributed,dia_attributed,is_attributed)
p1 <- df %>% filter(is_attributed == "yes") %>% group_by(os) %>% summarize(n = n()) %>% arrange(desc(n)) %>%
head() %>%
ggplot(aes(x=reorder(os,n),y=n))+
geom_col(fill = "steelblue")+theme_minimal()+labs(x="OS",y="Count",title="Top downloads por OS")+
geom_text(aes(label = round(n / sum(n), 2)), vjust = 1.6, color = "white", size=3.5)
## `summarise()` ungrouping output (override with `.groups` argument)
p2 <- df %>% filter(is_attributed == "yes") %>% group_by(ip) %>% summarize(n = n()) %>% arrange(desc(n)) %>%
head() %>%
ggplot(aes(x=reorder(ip,n),y=n))+
geom_col(fill = "steelblue")+theme_minimal()+labs(x="IP",y="Count",title="Top downloads por IP")+
geom_text(aes(label=round(n/sum(n),2)),vjust=1.6,color="white",size = 3.5)
## `summarise()` ungrouping output (override with `.groups` argument)
p3 <- df %>% filter(is_attributed == "yes") %>% group_by(app) %>% summarize(n = n()) %>% arrange(desc(n)) %>%
head() %>%
ggplot(aes(x=reorder(app,n),y=n))+
geom_col(fill = "steelblue")+theme_minimal()+labs(x="APP",y="Count",title="Top downloads por APP")+
geom_text(aes(label=round(n/sum(n),2)),vjust=1.6,color="white",size = 3.5)
## `summarise()` ungrouping output (override with `.groups` argument)
p4 <- df %>% filter(is_attributed == "yes") %>% group_by(device) %>% summarize(n = n()) %>% arrange(desc(n)) %>%
head() %>%
ggplot(aes(x=reorder(device,n),y=n))+
geom_col(fill="steelblue")+theme_minimal()+labs(x="Device",y="Count",title="Top downloads por Device")+
geom_text(aes(label=round(n/sum(n),2)),vjust=1.6,color="white",size=3.5)
## `summarise()` ungrouping output (override with `.groups` argument)
gridExtra::grid.arrange(p1,p2,p3,p4,ncol=2,nrow=2)
30% dos downloads são efetuados por usuários que possuem aparelhos com sistema operacional da classe 19.
Os 4 maiores downloads por classe de sistema operacional correspondem a 85% do total de downloads efetuados.
74% dos downloads efetuados são de aparelhos da classe 1.
Os 2 maiores downloads por classe de aparelho correspondem a 98% do total de downloads.
options(scipen = 999)
p5 <- df %>% filter(is_attributed == "yes") %>% group_by(dia_attributed) %>%
summarize(n=n()) %>% arrange(desc(n)) %>%
ggplot(aes(x=reorder(dia_attributed,n),y=n))+
geom_col(fill="steelblue")+theme_minimal()+labs(x="Dia",y="Count",title="Taxa de downloads por dia da semana")+
geom_text(aes(label=round(n/sum(n),2)),vjust=1.6,color="white",size=3.5)
## `summarise()` ungrouping output (override with `.groups` argument)
p6 <- df %>% filter(is_attributed == "no") %>% group_by(dia_click) %>% summarize(n=n()) %>% arrange(desc(n)) %>%
ggplot(aes(x=reorder(dia_click,n),y=n))+
geom_col(fill="steelblue")+theme_minimal()+labs(x="Dia",y="Count",title="Taxa de clicks por dia da semana")+
geom_text(aes(label=round(n/sum(n),2)),vjust=1.6,color="white",size=3.5)
## `summarise()` ungrouping output (override with `.groups` argument)
gridExtra::grid.arrange(p5,p6,nrow=2)
Os downloads quando efetuados pelos usuários, seguem o mesmo padrão de dias. Ou seja, quando o usuário clica na aplicação, o download geralmente é realizado no mesmo dia;
Segunda-feira é o pior dia quando se fala em taxa de cliques ou downloads.
p7 <- df %>% filter(is_attributed == "no") %>% group_by(hora_click) %>%
summarize(n=n()) %>% arrange(desc(n)) %>%
ggplot(aes(x=hora_click,y=n))+
geom_col(fill="steelblue")+theme_minimal()+labs(x="hora",y="Count", title="Clicks por hora")
## `summarise()` ungrouping output (override with `.groups` argument)
p8 <- df %>% filter(is_attributed == "yes") %>%
group_by(hora_attributed) %>%
summarize(n=n()) %>% arrange(desc(n)) %>%
ggplot(aes(x=hora_attributed,y=n))+
geom_col(fill="steelblue")+theme_minimal()+labs(x="hora",y="Count",title="Download por hora")
## `summarise()` ungrouping output (override with `.groups` argument)
gridExtra::grid.arrange(p7,p8,nrow=2)
Construção de modelos de aprendizagem de máquina de classificação para a variável is_attributed (variável que indica se o download foi ou não realizado após o clique no anúncio).
Faz a remoção dos objetos carregados anteriormente na memória do Rstudio.
rm(list = ls(all.names = TRUE));gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 2646433 141.4 4596620 245.5 4596620 245.5
## Vcells 96633737 737.3 1527913692 11657.1 1717604735 13104.3
pacman::p_load(tidyverse, caTools, corrplot, caret, data.table,knitr, gridExtra,gmodels,
class,e1071,ROCR,DMwR,fasttime,lubridate,xgboost)
df <- as_tibble(fread("train.csv"))
table(df$is_attributed)
##
## 0 1
## 184447044 456846
df2 <- df %>% dplyr::filter(is_attributed == 1) %>% sample_n(456846);df2
df <- df %>% dplyr::filter(is_attributed == 0) %>% sample_n(456846);df
df <- rbind(df,df2)
rm(df2)
Como há grande variação nos dados quando olhamos a taxa de cliques x taxa de downloads realizados, se faz necessário utilizar uma técnica de balanceamento de classes.
Isso evita que o modelo se torne pouco genérico ou venha a ter problemas de instabilidade.
Pré-processamento dos dados para inicialização da construção do modelo preditivo.
df <- df %>%
mutate(click_time = fastPOSIXct(click_time),
hora_click = hour(click_time),
dia_click = day(click_time),
is_attributed = factor(is_attributed, levels = c("0","1"), labels = c("no","yes"))) %>%
dplyr::select(ip,app,device,os,channel,hora_click,dia_click,is_attributed)
## Salva os dados ajustados
write.csv(df, "dados_ajustados.csv")
Sample de 70% para dados de treino e 30% para dados de teste.
treino <- df %>% sample_frac(0.7);treino
teste <- df %>% sample_frac(0.3);teste
prop.table(table(treino$is_attributed))
##
## no yes
## 0.5001329 0.4998671
prop.table(table(teste$is_attributed))
##
## no yes
## 0.4996972 0.5003028
Dados balanceados.
set.seed(123)
naive1 <- naiveBayes(x=treino[-8],y=treino$is_attributed)
previsao_naive <- predict(naive1,teste[-8])
conf.matrix <- table(data =previsao_naive,reference = teste$is_attributed)
confusionMatrix(conf.matrix)
## Confusion Matrix and Statistics
##
## reference
## data no yes
## no 122933 55109
## yes 14038 82028
##
## Accuracy : 0.7477
## 95% CI : (0.7461, 0.7494)
## No Information Rate : 0.5003
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.4956
##
## Mcnemar's Test P-Value : < 0.00000000000000022
##
## Sensitivity : 0.8975
## Specificity : 0.5981
## Pos Pred Value : 0.6905
## Neg Pred Value : 0.8539
## Prevalence : 0.4997
## Detection Rate : 0.4485
## Detection Prevalence : 0.6495
## Balanced Accuracy : 0.7478
##
## 'Positive' Class : no
##
set.seed(123)
controle <- trainControl(method = "repeatedcv",number = 10,repeats = 2)
regr1 <- train(is_attributed~.,data=treino,method="glm",trControl=controle)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
importance <- varImp(regr1,scale=FALSE);plot(importance)
previsoes_reg <- predict(regr1,teste[-8])
conf.matrix <- table(data = previsoes_reg, reference = teste$is_attributed)
confusionMatrix(conf.matrix)
## Confusion Matrix and Statistics
##
## reference
## data no yes
## no 110527 35153
## yes 26444 101984
##
## Accuracy : 0.7753
## 95% CI : (0.7737, 0.7768)
## No Information Rate : 0.5003
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.5506
##
## Mcnemar's Test P-Value : < 0.00000000000000022
##
## Sensitivity : 0.8069
## Specificity : 0.7437
## Pos Pred Value : 0.7587
## Neg Pred Value : 0.7941
## Prevalence : 0.4997
## Detection Rate : 0.4032
## Detection Prevalence : 0.5315
## Balanced Accuracy : 0.7753
##
## 'Positive' Class : no
##
treino2 <- treino %>% select(-device,-os)
teste2 <- teste %>% select(-device,-os)
controle <- trainControl(method = "repeatedcv",number = 10,repeats = 2)
regr2 <- train(is_attributed~.,data=treino2,method="glm",trControl=controle)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
previsoes_reg2 <- predict(regr2,teste2[-6])
conf.matrix <- table(data = previsoes_reg2,reference = teste2$is_attributed)
confusionMatrix(conf.matrix)
## Confusion Matrix and Statistics
##
## reference
## data no yes
## no 110447 35095
## yes 26524 102042
##
## Accuracy : 0.7752
## 95% CI : (0.7736, 0.7768)
## No Information Rate : 0.5003
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.5504
##
## Mcnemar's Test P-Value : < 0.00000000000000022
##
## Sensitivity : 0.8064
## Specificity : 0.7441
## Pos Pred Value : 0.7589
## Neg Pred Value : 0.7937
## Prevalence : 0.4997
## Detection Rate : 0.4029
## Detection Prevalence : 0.5310
## Balanced Accuracy : 0.7752
##
## 'Positive' Class : no
##
set.seed(123)
treino3 <- treino %>% mutate(is_attributed = factor(is_attributed, levels = c("no","yes"), labels = c("0","1")))
teste3 <- teste %>% mutate(is_attributed = factor(is_attributed, levels = c("no","yes"), labels = c("0","1")))
model_xgboost <- xgboost(
data = as.matrix(treino3 %>% select(-is_attributed)),
label = as.matrix(treino3$is_attributed),
max.depth = 20,
eta = 1,
nthread = 4,
nrounds = 100,
objective = "binary:logistic",
verbose = F
)
previsoes <- predict(model_xgboost,as.matrix(teste3 %>% select(-is_attributed)));length(previsoes);head(previsoes)
## [1] 274108
## [1] 0.99999821186 0.08277809620 0.02115340158 0.01214443520 0.00003448338
## [6] 0.99988245964
previsoes <- as.numeric(previsoes>0.5);head(previsoes)
## [1] 1 0 0 0 0 1
erros <- mean(previsoes!=teste3$is_attributed)
print(erros)
## [1] 0.02780291
confusionMatrix(table(data = previsoes, reference = teste3$is_attributed))
## Confusion Matrix and Statistics
##
## reference
## data 0 1
## 0 133933 4583
## 1 3038 132554
##
## Accuracy : 0.9722
## 95% CI : (0.9716, 0.9728)
## No Information Rate : 0.5003
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.9444
##
## Mcnemar's Test P-Value : < 0.00000000000000022
##
## Sensitivity : 0.9778
## Specificity : 0.9666
## Pos Pred Value : 0.9669
## Neg Pred Value : 0.9776
## Prevalence : 0.4997
## Detection Rate : 0.4886
## Detection Prevalence : 0.5053
## Balanced Accuracy : 0.9722
##
## 'Positive' Class : 0
##
O modelo 4 com XGBoost foi de longe o que apresentou melhor desempenho, com 97% de acurácia e com baixissímo número de falsos postivos e falsos negativos.
Carregamento de um novo dataset para realização da previsão utilizando o modelo cujo teve a melhor performance.
df <- as_tibble(fread("test.csv"));glimpse(df)
## Rows: 18,790,469
## Columns: 7
## $ click_id <int> 0, 1, 2, 3, 4, 5, 6, 7, 9, 8, 10, 11, 12, 13, 14, 15, 16...
## $ ip <int> 5744, 119901, 72287, 78477, 123080, 110769, 12540, 88637...
## $ app <int> 9, 9, 21, 15, 12, 18, 3, 27, 18, 12, 26, 3, 12, 9, 12, 1...
## $ device <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ os <int> 3, 3, 19, 13, 13, 13, 1, 19, 10, 53, 19, 19, 13, 13, 25,...
## $ channel <int> 107, 466, 128, 111, 328, 107, 137, 153, 107, 424, 477, 1...
## $ click_time <chr> "2017-11-10 04:00:00", "2017-11-10 04:00:00", "2017-11-1...
df <- df %>%
mutate(click_time = fastPOSIXct(click_time),
hora_click = hour(click_time),
dia_click = day(click_time)) %>%
dplyr::select(ip,app,device,os,channel,hora_click,dia_click);glimpse(df)
## Rows: 18,790,469
## Columns: 7
## $ ip <int> 5744, 119901, 72287, 78477, 123080, 110769, 12540, 88637...
## $ app <int> 9, 9, 21, 15, 12, 18, 3, 27, 18, 12, 26, 3, 12, 9, 12, 1...
## $ device <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ os <int> 3, 3, 19, 13, 13, 13, 1, 19, 10, 53, 19, 19, 13, 13, 25,...
## $ channel <int> 107, 466, 128, 111, 328, 107, 137, 153, 107, 424, 477, 1...
## $ hora_click <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,...
## $ dia_click <int> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, ...
previsoes <- predict(model_xgboost,as.matrix(df));length(previsoes)
## [1] 18790469
previsoes <- as.numeric(previsoes>0.5)
Fim!
sandropenha.com