大量来源于此ip的用户,但是产生的行为很少
library(readr)
ip0911 <- read_csv('/Users/milin/Downloads/ip0911.csv')
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## X1 = col_integer(),
## network_name = col_character(),
## device_id = col_character(),
## ip = col_character(),
## impre = col_integer(),
## times = col_integer(),
## score = col_double()
## )
ip0911
library(tidyverse)
## ─ Attaching packages ───────────────────────────────────────── tidyverse 1.2.1 ─
## ✔ ggplot2 2.2.1 ✔ purrr 0.2.5
## ✔ tibble 1.4.2 ✔ dplyr 0.7.4
## ✔ tidyr 0.8.1 ✔ stringr 1.3.0
## ✔ ggplot2 2.2.1 ✔ forcats 0.3.0
## ─ Conflicts ─────────────────────────────────────────── tidyverse_conflicts() ─
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
ip09110 <- filter(ip0911,impre<=1)
ggplot(data = ip09110,aes(times))+geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(ip09110$times)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 1.00 2.00 3.25 5.00 105.00
根据不同渠道进行区分,查看ip次数出现情况
ggplot(data = ip09110,aes(times))+geom_histogram()+facet_wrap(~network_name,nrow = 4)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ip0911 %>% group_by(network_name) %>% summarise(
qu9 = table(times >= quantile(ip0911$times,0.9))[2],
qu3 = table(times >= 4)[2],
qu2 = table(times >= 2.893 )[2],
n = n()
)%>% mutate(pre9=qu9/n,pre3=qu3/n,pre2=qu2/n)
qu9 表示9分位数,对于出现次数多大作为一个异常判定,需要结合业务的情况进行决定