三天内无曝光多频次ip统计

大量来源于此ip的用户,但是产生的行为很少

library(readr)
ip0911 <- read_csv('/Users/milin/Downloads/ip0911.csv')
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
##   X1 = col_integer(),
##   network_name = col_character(),
##   device_id = col_character(),
##   ip = col_character(),
##   impre = col_integer(),
##   times = col_integer(),
##   score = col_double()
## )
ip0911

查看p出现次数分布

library(tidyverse)
## ─ Attaching packages ───────────────────────────────────────── tidyverse 1.2.1 ─
## ✔ ggplot2 2.2.1     ✔ purrr   0.2.5
## ✔ tibble  1.4.2     ✔ dplyr   0.7.4
## ✔ tidyr   0.8.1     ✔ stringr 1.3.0
## ✔ ggplot2 2.2.1     ✔ forcats 0.3.0
## ─ Conflicts ─────────────────────────────────────────── tidyverse_conflicts() ─
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
ip09110 <- filter(ip0911,impre<=1)
ggplot(data = ip09110,aes(times))+geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(ip09110$times)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    1.00    2.00    3.25    5.00  105.00

根据不同渠道进行区分,查看ip次数出现情况

ggplot(data = ip09110,aes(times))+geom_histogram()+facet_wrap(~network_name,nrow = 4)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

对于没有行为的数据,或者行为非常少的数据,ip出现次数越多,越有可疑是虚假数据

ip0911 %>% group_by(network_name) %>% summarise(
  qu9 = table(times >= quantile(ip0911$times,0.9))[2],
  qu3 = table(times >= 4)[2],
  qu2 = table(times >= 2.893 )[2],
  n = n()
)%>%  mutate(pre9=qu9/n,pre3=qu3/n,pre2=qu2/n)

qu9 表示9分位数,对于出现次数多大作为一个异常判定,需要结合业务的情况进行决定