分析3月1号至3月31号之间的数据,数据如下
library(tidyverse)
load('/Users/milin/evidence31331.Rdata')
evidence31331 <-mydata
head(mydata)
## # A tibble: 6 x 17
## adid install_time click_time1 action1 network1
## <chr> <dttm> <dttm> <chr> <chr>
## 1 00003335-d4b3-… 2018-03-06 11:28:54 2018-03-06 11:28:54 click AppLift
## 2 00005b6c-f4db-… 2018-03-10 12:48:36 2018-03-10 12:48:36 click AppLift
## 3 000226d9-fd07-… 2018-03-13 16:49:44 2018-03-13 16:49:44 click AppLift
## 4 0002308c-8af9-… 2018-03-05 16:01:12 2018-03-05 16:01:12 click AppLift
## 5 0003733d-7c00-… 2018-03-08 17:09:22 2018-03-08 17:09:22 click AppLift
## 6 00040fec-a742-… 2018-03-06 10:33:08 2018-03-06 10:33:08 click AppLift
## # ... with 12 more variables: click_time2 <dttm>, action2 <chr>,
## # network2 <chr>, click_time3 <dttm>, action3 <chr>, network3 <chr>,
## # click_time4 <dttm>, action4 <chr>, network4 <chr>, click_time5 <dttm>,
## # action5 <chr>, network5 <chr>
首先是数据,以及时间间隔的描述性统计,分别是下载时间到最近一次点击时间, 最近第二次点击时间到第一次点击时间,最近第三次点击时间到第二次点击时间,最近第四次点击时间到最近第三次点击时间
evidence31331 <- as_tibble(evidence31331)
# 最后一次到下载
summary(as.numeric(evidence31331$install_time - evidence31331$click_time1))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 0 7667 0 2544420
# 从这里看出,平均而言,最后一次点击行为1623's
# 2-1
summary(as.numeric(evidence31331$click_time1 - evidence31331$click_time2))
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 2 7 57676 50 2543909 220113
# 第2次行为到第1次行为平均所花的时间为11592's
# 3-2
summary(as.numeric(evidence31331$click_time2 - evidence31331$click_time3))
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 2 4 47855 31 2541397 266368
# 4-3
summary(as.numeric(evidence31331$click_time3 - evidence31331$click_time4))
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 2 3 44466 13 2502780 282980
#5 -4
summary(as.numeric(evidence31331$click_time4 - evidence31331$click_time5))
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 1 3 36151 10 2465742 290589
从结果中可以看到,平均最后一次点击到下载之间间隔了7667秒,最大值间隔了2544420秒。 但是平均值不能说明什么问题,而且,平均值非常受到离群值的影响。
现在我想知道,比如,下载到最后一次点击时间间隔为0的数据占比多少?
# 下载时间离最后一次操作
head(sort(table(as.numeric(evidence31331$install_time - evidence31331$click_time1)),decreasing = T))
##
## 0 1 2 3 4 5
## 246937 3858 2649 1951 1515 1109
# 最后一次操作到倒数第二次操作
head(sort(table(as.numeric(evidence31331$click_time1 - evidence31331$click_time2)),decreasing = T))
##
## 1 2 3 4 5 0
## 10087 9836 6353 4386 3220 2460
# 3-2
head(sort(table(as.numeric(evidence31331$click_time2 - evidence31331$click_time3)),decreasing = T))
##
## 1 2 3 4 0 5
## 5259 4627 2911 2073 1465 1359
# 4-3
head(sort(table(as.numeric(evidence31331$click_time3 - evidence31331$click_time4)),decreasing = T))
##
## 1 2 3 4 0 5
## 3014 2614 1693 1069 907 721
# 5-4
head(sort(table(as.numeric(evidence31331$click_time4 - evidence31331$click_time5)),decreasing = T))
##
## 1 2 3 0 4 5
## 1840 1424 850 630 516 356
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 61 1236 2410 2410 3584 4759
可以看出,下载时间与最后一次操作时间间隔为0s的有246937,占比82.5%,其次间隔为1s,3858,占比1.2%
如果认为点击到下载时间间隔为0为异常,那么,至少有82.5%的数据为异常。
library(tidyverse)
load('/Users/milin/AffleMobvista.Rdata')
首先查看mobvista的数据
head(Mobvista2931)
## # A tibble: 6 x 17
## adid install_time click_time1 action1 network1
## <chr> <dttm> <dttm> <chr> <chr>
## 1 00060d0a-3740-… 2018-03-29 10:51:50 2018-03-29 10:51:50 click Mobvista
## 2 000a89fc-23fa-… 2018-03-29 10:34:41 2018-03-29 10:34:41 click Mobvista
## 3 001c5462-6d4f-… 2018-03-29 02:51:59 2018-03-29 02:51:59 click Mobvista
## 4 002a3127-b9bb-… 2018-03-30 02:46:27 2018-03-30 02:46:27 click Mobvista
## 5 002b77a5-f388-… 2018-03-30 00:29:10 2018-03-30 00:29:10 click Mobvista
## 6 00319a59-5148-… 2018-03-29 02:50:47 2018-03-29 02:50:47 click Mobvista
## # ... with 12 more variables: click_time2 <dttm>, action2 <chr>,
## # network2 <chr>, click_time3 <dttm>, action3 <chr>, network3 <chr>,
## # click_time4 <dttm>, action4 <chr>, network4 <chr>, click_time5 <dttm>,
## # action5 <chr>, network5 <chr>
Mobvista2931 <- as_tibble(Mobvista2931)
# 下载到最近一次
summary(as.numeric(Mobvista2931$install_time - Mobvista2931$click_time1))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 98.58 0.00 190041.00
# 从这里看出,平均而言,最后一次点击行为1623's
# 2-1
summary(as.numeric(Mobvista2931$click_time1 - Mobvista2931$click_time2))
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 0 5797 222687 190402 2499278 4095
# 第2次行为到第1次行为平均所花的时间为11592's
# 3-2
summary(as.numeric(Mobvista2931$click_time2 - Mobvista2931$click_time3))
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1 117 42020 252399 167924 2084081 4625
# 4-3
summary(as.numeric(Mobvista2931$click_time3 - Mobvista2931$click_time4))
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 3.0 572.5 16158.0 248712.6 183138.0 1981539.0 4697
# 5-4
summary(as.numeric(Mobvista2931$click_time4 - Mobvista2931$click_time5))
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 3 7 157 49590 34048 381850 4722
可以看到,点击到下载的平均时间为98秒
# 下载时间离最后一次操作
head(sort(table(as.numeric(Mobvista2931$install_time - Mobvista2931$click_time1)),decreasing = T))
##
## 0 49119 63079 164256 190041
## 4728 1 1 1 1
# 最后一次操作到倒数第二次操作
head(sort(table(as.numeric(Mobvista2931$click_time1 - Mobvista2931$click_time2)),decreasing = T))
##
## 0 1 10 3 7 4
## 160 14 10 8 8 7
# 3-2
head(sort(table(as.numeric(Mobvista2931$click_time2 - Mobvista2931$click_time3)),decreasing = T))
##
## 61 28538 66826 76100 121504 195798
## 4 4 4 4 4 4
# 4-3
head(sort(table(as.numeric(Mobvista2931$click_time3 - Mobvista2931$click_time4)),decreasing = T))
##
## 5647 183138 3 11 19 33
## 4 4 1 1 1 1
# 5-4
head(sort(table(as.numeric(Mobvista2931$click_time4 - Mobvista2931$click_time5)),decreasing = T))
##
## 3 19 36 278 930 45087
## 3 1 1 1 1 1
可以看出,Mobvista在3月29到3月31之间,下载时间到最后一次点击时间之间间隔为0有4728条数据,占比99%
然后查看Affle的数据 ### 查看时间间隔的分布
head(Affle2931)
## # A tibble: 6 x 17
## adid install_time click_time1 action1 network1
## <chr> <dttm> <dttm> <chr> <chr>
## 1 00071257-7827-… 2018-03-29 08:46:38 2018-03-29 08:46:38 click AppLift
## 2 001a4eec-27cc-… 2018-03-30 21:44:43 2018-03-30 21:44:43 click Affle
## 3 002fa32e-885b-… 2018-03-29 02:39:37 2018-03-29 02:39:37 click Affle
## 4 00412192-1104-… 2018-03-30 14:58:42 2018-03-30 14:58:02 click AppLift
## 5 004b2e49-2a43-… 2018-03-30 16:33:58 2018-03-30 16:33:58 click Affle
## 6 0072175f-270c-… 2018-03-30 13:45:25 2018-03-30 13:45:25 click Affle
## # ... with 12 more variables: click_time2 <dttm>, action2 <chr>,
## # network2 <chr>, click_time3 <dttm>, action3 <chr>, network3 <chr>,
## # click_time4 <dttm>, action4 <chr>, network4 <chr>, click_time5 <dttm>,
## # action5 <chr>, network5 <chr>
Affle2931 <- as_tibble(Affle2931)
summary(as.numeric(Affle2931$install_time - Affle2931$click_time1))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 0 45150 11 2534860
# 从这里看出,平均而言,最后一次点击行为1623's
# 2-1
summary(as.numeric(Affle2931$click_time1 - Affle2931$click_time2))
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 1.0 4.0 104069.8 314.8 2456969.0 2295
# 第2次行为到第1次行为平均所花的时间为11592's
# 3-2
summary(as.numeric(Affle2931$click_time2 - Affle2931$click_time3))
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 1.0 3.0 106138.0 46.5 2453998.0 2871
# 4-3
summary(as.numeric(Affle2931$click_time3 - Affle2931$click_time4))
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1 1 3 105291 33 2448227 3058
# 5-4
summary(as.numeric(Affle2931$click_time4 - Affle2931$click_time5))
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.0 2.0 4.5 145679.4 263247.0 2521674.0 3171
可以看到,点击到下载的平均时间为45150 秒
# 下载时间离最后一次操作
head(sort(table(as.numeric(Affle2931$install_time - Affle2931$click_time1)),decreasing = T))
##
## 0 1 2 3 4 5
## 2030 158 107 67 28 26
# 最后一次操作到倒数第二次操作
head(sort(table(as.numeric(Affle2931$click_time1 - Affle2931$click_time2)),decreasing = T))
##
## 1 0 2 3 4 5
## 200 103 102 88 57 29
# 3-2
head(sort(table(as.numeric(Affle2931$click_time2 - Affle2931$click_time3)),decreasing = T))
##
## 1 2 3 0 4 6
## 91 76 40 35 24 10
# 4-3
head(sort(table(as.numeric(Affle2931$click_time3 - Affle2931$click_time4)),decreasing = T))
##
## 1 2 3 4 7 5
## 72 34 23 11 10 6
# 5-4
head(sort(table(as.numeric(Affle2931$click_time4 - Affle2931$click_time5)),decreasing = T))
##
## 302955 2 1 3 4 5
## 27 21 18 15 13 6
可以看出,Affle2931在3月29到3月31之间,下载时间到最后一次点击时间之间间隔为0有2030条数据,占比61%