随着电信行业用户增长趋于缓和,存量客户运营与维系逐渐成为重点,预防与减少用户流失与价值提升是存量用户维系的两大重点,价值提升中收入保有也是关注的重要指标。收入预测模型通过预测收入及收保,通过预测收入及收保可以更及时的观测该考核指标的变化,为各地市维系工作者提供数据支撑。
在做预测之前,我们首先要明确一个假设,我们假设未来的结果是与过去的结果相关的。相关度越高,我们越能根据过去来预测未来的结果。如果关于未来的结果的波动是比较无规律的,那么预测未来的结果则是比较困难的。
第一个思路是直接根据流失的历史数据来进行预测,在这里选取的数据集是2019年1月份-20201年4月份的月流失数据。
# 这个数据和平台上面的数据是能够对得上的
library(readxl)
library(lubridate)
library(tidyverse)
runoff2019 <- read.csv("/Users/milin/Downloads/2019流失数据.csv")
runoff2020 <- read.csv("/Users/milin/Downloads/2020流失数据.csv")
runoff2021 <- read.csv("/Users/milin/Downloads/2021流失数据.csv")
# 关注2020年11月份
# 重新命名
names(runoff2019) <- c("地市",as.character(ym("201812")+months(1:12)))
names(runoff2020) <- c("地市",as.character(ym("201912")+months(1:12)))
names(runoff2021) <- c("地市",as.character(ym("202012")+months(1:4)))
# 数据准备好了,然后将宽数据转变成为长数据。
runoff2019 <- runoff2019 %>% gather("时间","流失量",2:13)
runoff2020 <- runoff2020 %>% gather("时间","流失量",2:13)
runoff2021 <- runoff2021 %>% gather("时间","流失量",2:5)
# 将这三年的数据合并成为同一份数据
runoff20192021 <- runoff2019 %>% bind_rows(runoff2020) %>% bind_rows(runoff2021)
第一步,对于月流失数据进行可视化,分地市绘制关于流失的时间序列图。
# 到这里数据就处理好了,接下来要做的是进行数据可视化。
library(tidyverse)
library(plotly)
library(showtext)
showtext_auto()
runoff20192021$时间 <- date(runoff20192021$时间)
p <- ggplot(data = runoff20192021,aes(x = 时间,y = (流失量),color = 地市)) + geom_line()
p
ggplotly(p)
可以看到,流失量的波动还是比较大的,这意味着流失序列不稳定,对于不稳定或者趋势不明显的时间序列进行建模,即使模型将将历史数据拟合比较好,在预测的时候,预测结果和真实结果的误差还是会比较大。
对流失量进行取对数,然后分地市进行可视化.取对数可以让数据之间的差异变得更小。
# 对流失量进行取对数
p <- ggplot(data = runoff20192021,aes(x = 时间,y = log(流失量),color = 地市)) + geom_line()
p
ggplotly(p)
对月流失序列取对数之后,可以看到,序列变得稍微平稳一些。对取对数之后的流失量构建prophet模型。
Prophet 是以时间序列分解和曲线拟合思想为基础建立的模型。Prophet 也是一个时间复杂度较低、建模简单的模型。与 ARIMA 模型相比,Prophet的优点是对节假日和突发事件造成的数据波动有很好的拟合效果,并能在一定程度上拟合非线性数据,比较适用于长期的周期明显的时间序列预测;但是其在趋势、周期不明显的时间序列上效果不佳。
这里首先筛选全省的数据,然后对全省的数据使用prophet进行建模。
library(prophet)
# 筛选出全省的数据
prorunoff <- runoff20192021 %>% filter(地市== "全省")
# 真理记录一下真实结果
truvalue <- prorunoff
history <- data.frame(ds = prorunoff$时间[1:24],
y = log(prorunoff$流失量[1:24]))
m <- prophet(history,interval.width = 0.95)
future <- make_future_dataframe(m, periods = 4,freq = "month")
forecast <- predict(m, future)
plot(m, forecast)
result <- prorunoff %>% mutate(pre = forecast$yhat,lowpre = forecast$yhat_lower,upperpre = forecast$yhat_upper)
tail(result)
## 地市 时间 流失量 pre lowpre upperpre
## 23 全省 2020-11-01 1515336 14.13161 14.03473 14.22911
## 24 全省 2020-12-01 1321730 14.12619 14.03102 14.22061
## 25 全省 2021-01-01 1174921 14.12058 14.02842 14.20941
## 26 全省 2021-02-01 1243171 14.11497 14.02350 14.21026
## 27 全省 2021-03-01 1108165 14.10990 14.00952 14.21385
## 28 全省 2021-04-01 1195862 14.10430 13.97599 14.23096
在这里,我们是使用2019年1月份到2020年12月份的数据训练模型,然后使用2021年1月到4月份的数据作为测算。 从数据中可以看到,模型其实是很好的拟合了数据的趋势的。接下来查看模型的误差。
options(scipen = 10)
result <- result %>% mutate(log流失量 = log(流失量),logerrer = abs((log(流失量)-pre))/log(流失量)*100,errer = abs(((流失量)-exp(pre)))/(流失量)*100,loweexp = exp(lowpre),uppexp = exp(upperpre),preexp=exp(pre))
tail(result)
## 地市 时间 流失量 pre lowpre upperpre log流失量 logerrer
## 23 全省 2020-11-01 1515336 14.13161 14.03473 14.22911 14.23115 0.6994210
## 24 全省 2020-12-01 1321730 14.12619 14.03102 14.22061 14.09445 0.2251458
## 25 全省 2021-01-01 1174921 14.12058 14.02842 14.20941 13.97671 1.0293246
## 26 全省 2021-02-01 1243171 14.11497 14.02350 14.21026 14.03318 0.5828571
## 27 全省 2021-03-01 1108165 14.10990 14.00952 14.21385 13.91822 1.3772457
## 28 全省 2021-04-01 1195862 14.10430 13.97599 14.23096 13.99438 0.7854465
## errer loweexp uppexp preexp
## 23 9.474232 1245100 1512246 1371770
## 24 3.224193 1240490 1499456 1364345
## 25 15.472905 1237273 1482758 1356715
## 26 8.523153 1231197 1484009 1349128
## 27 21.129257 1214109 1489351 1342312
## 28 11.618693 1174079 1515055 1334806
结果中,pre标识的预测值,lowpre标识的是预测值的下界,upperpre表示的是预测值的上界,log流失量标识的是对真实的流失量取对数,logerrer标识的是取对数之后的误差,需要注意的是这里误差是百分比 。errer标识的是没有取对计算的误差。loweexp表示预测结果不取对数时候的下界,uppexp表示预测结果不取对数时候的上界,preexp表示预测结果不取对数时候的预测值。
从结果可以看到,如果对数据取了对数,也就是序列比较平稳的时候,不管是训练误差还是测试误差都是比较小的。如果不取对数,模型的训练误差还是比较小,说明模型还是能拟合现有的数据,但是测试误差比较大,说明数据波动还是太大,对于新的一个月,模型无法基于历史结果给出比较准确的预测。
根据这种方法预测5-6月份的数据
truvalue <- prorunoff
history <- data.frame(ds = prorunoff$时间,
y = log(prorunoff$流失量))
m <- prophet(history,interval.width = 0.95)
future <- make_future_dataframe(m, periods = 2,freq = "month")
forecast <- predict(m, future)
plot(m, forecast)
tail(forecast)
## ds trend additive_terms additive_terms_lower additive_terms_upper
## 25 2021-01-01 14.02105 -0.04433636 -0.04433636 -0.04433636
## 26 2021-02-01 14.01894 0.01423223 0.01423223 0.01423223
## 27 2021-03-01 14.01704 -0.09882714 -0.09882714 -0.09882714
## 28 2021-04-01 14.01494 -0.02056124 -0.02056124 -0.02056124
## 29 2021-05-01 14.01290 -0.11991164 -0.11991164 -0.11991164
## 30 2021-06-01 14.01080 -0.09220867 -0.09220867 -0.09220867
## yearly yearly_lower yearly_upper multiplicative_terms
## 25 -0.04433636 -0.04433636 -0.04433636 0
## 26 0.01423223 0.01423223 0.01423223 0
## 27 -0.09882714 -0.09882714 -0.09882714 0
## 28 -0.02056124 -0.02056124 -0.02056124 0
## 29 -0.11991164 -0.11991164 -0.11991164 0
## 30 -0.09220867 -0.09220867 -0.09220867 0
## multiplicative_terms_lower multiplicative_terms_upper yhat_lower yhat_upper
## 25 0 0 13.97671 13.97671
## 26 0 0 14.03318 14.03318
## 27 0 0 13.91822 13.91822
## 28 0 0 13.99438 13.99438
## 29 0 0 13.80422 13.97013
## 30 0 0 13.66727 14.13222
## trend_lower trend_upper yhat
## 25 14.02105 14.02105 13.97671
## 26 14.01894 14.01894 14.03318
## 27 14.01704 14.01704 13.91822
## 28 14.01494 14.01494 13.99438
## 29 13.92413 14.09004 13.89299
## 30 13.75948 14.22443 13.91859
通过前几天的日流失数据,来预测本月末的流失数据。这一部分是获取数据以及对数据进行基本的处理。这里所使用的数据的2021那年1月份到2021年5月份每一天的日流失数据。
# 首先还是获取数据,这里获取的数据是2021年1-5月份的数据
runoff202101 <- read.csv("/Users/milin/Downloads/202101流失.csv")
runoff202102 <- read.csv("/Users/milin/Downloads/202102流失.csv")
runoff202103 <- read.csv("/Users/milin/Downloads/202103流失.csv")
runoff202104 <- read.csv("/Users/milin/Downloads/202104流失.csv")
runoff202105 <- read.csv("/Users/milin/Downloads/202105流失.csv")
names(runoff202101) <- c("地市",as.character(ym("2021-01")+days(0:30)))
names(runoff202102) <- c("地市",as.character(ym("2021-02")+days(0:27)))
names(runoff202103) <- c("地市",as.character(ym("2021-03")+days(0:30)))
names(runoff202104) <- c("地市",as.character(ym("2021-04")+days(0:29)))
names(runoff202105) <- c("地市",as.character(ym("2021-05")+days(0:9)))
runoff202101 <- runoff202101 %>% gather("时间","流失量",2:32)
runoff202102 <- runoff202102 %>% gather("时间","流失量",2:29)
runoff202103 <- runoff202103 %>% gather("时间","流失量",2:32)
runoff202104 <- runoff202104 %>% gather("时间","流失量",2:31)
runoff202105 <- runoff202105 %>% gather("时间","流失量",2:11)
runoff2021_dayofdata <- runoff202101 %>% bind_rows(runoff202102) %>% bind_rows(runoff202103) %>% bind_rows(runoff202104) %>% bind_rows(runoff202105)
对每一天的日流失数据进行可视化,观察数据。
runoff2021_dayofdata$时间 <- date(runoff2021_dayofdata$时间)
p <- ggplot(data = runoff2021_dayofdata,aes(x = 时间,y = 流失量,color = 地市)) + geom_line()
p
ggplotly(p)
从数据中可以看出,每一个月月初和月末之间,流失数量的变化趋势是非常类似的,因此再每一个月初的时候,就可以尝试预测月末的数据。这里使用的是使用6号到10号的日数据来预测月末的数据。
尝试使用2月份的数据来验证。
# 筛选出全省的数据
prorunoff <- runoff2021_dayofdata %>% filter(地市== "全省") %>% filter(month(时间)==2)
history <- data.frame(ds = prorunoff$时间[c(6:10)],
y = prorunoff$流失量[c(6:10)])
m <- prophet(history)
future <- make_future_dataframe(m, periods = 18,freq = "day")
forecast <- predict(m, future)
plot(m,forecast)
tail(forecast)
## ds trend zeros zeros_lower zeros_upper additive_terms
## 18 2021-02-23 1367426 0 0 0 0
## 19 2021-02-24 1346986 0 0 0 0
## 20 2021-02-25 1326546 0 0 0 0
## 21 2021-02-26 1306106 0 0 0 0
## 22 2021-02-27 1285666 0 0 0 0
## 23 2021-02-28 1265226 0 0 0 0
## additive_terms_lower additive_terms_upper multiplicative_terms
## 18 0 0 0
## 19 0 0 0
## 20 0 0 0
## 21 0 0 0
## 22 0 0 0
## 23 0 0 0
## multiplicative_terms_lower multiplicative_terms_upper yhat_lower yhat_upper
## 18 0 0 1001136.2 1788683
## 19 0 0 943472.3 1813645
## 20 0 0 881063.2 1845929
## 21 0 0 817795.7 1880120
## 22 0 0 748898.7 1908654
## 23 0 0 668565.5 1939543
## trend_lower trend_upper yhat
## 18 1001136.2 1788683 1367426
## 19 943472.3 1813645 1346986
## 20 881063.2 1845929 1326546
## 21 817795.7 1880120 1306106
## 22 748898.7 1908654 1285666
## 23 668565.5 1939543 1265226
预测结果显示,月流失数据为:1265226 ,而真实的结果是:1243171,差异为:
abs(1265226-1243171)/1265226
## [1] 0.01743167
用3月份的数据测算一下.
prorunoff <- runoff2021_dayofdata %>% filter(地市== "全省") %>% filter(month(时间)==3)
history <- data.frame(ds = prorunoff$时间[c(6:10)],
y = prorunoff$流失量[c(6:10)])
m <- prophet(history)
future <- make_future_dataframe(m, periods = 21,freq = "day")
forecast <- predict(m, future)
tail(forecast)
## ds trend zeros zeros_lower zeros_upper additive_terms
## 21 2021-03-26 1169246 0 0 0 0
## 22 2021-03-27 1147851 0 0 0 0
## 23 2021-03-28 1126456 0 0 0 0
## 24 2021-03-29 1105061 0 0 0 0
## 25 2021-03-30 1083666 0 0 0 0
## 26 2021-03-31 1062271 0 0 0 0
## additive_terms_lower additive_terms_upper multiplicative_terms
## 21 0 0 0
## 22 0 0 0
## 23 0 0 0
## 24 0 0 0
## 25 0 0 0
## 26 0 0 0
## multiplicative_terms_lower multiplicative_terms_upper yhat_lower yhat_upper
## 21 0 0 850747.3 1499738
## 22 0 0 801435.9 1503280
## 23 0 0 753262.9 1509113
## 24 0 0 696819.0 1514419
## 25 0 0 632986.9 1524099
## 26 0 0 580258.5 1534287
## trend_lower trend_upper yhat
## 21 850747.3 1499738 1169246
## 22 801435.9 1503280 1147851
## 23 753262.9 1509113 1126456
## 24 696819.0 1514419 1105061
## 25 632986.9 1524099 1083666
## 26 580258.5 1534287 1062271
plot(m,forecast)
# 计算一下误差
abs(truvalue$流失量[27]-forecast$yhat[26])/truvalue$流失量[27]
## [1] 0.04141441
从结果中可以看到,预测结果为1062271,真实结果为 1108165 .误差为0.04141441 。
接下来使用4月份的数据来做测算
prorunoff <- runoff2021_dayofdata %>% filter(地市== "全省") %>% filter(month(时间)==4)
history <- data.frame(ds = prorunoff$时间[c(6:10)],
y = prorunoff$流失量[c(6:10)])
m <- prophet(history)
## Disabling yearly seasonality. Run prophet with yearly.seasonality=TRUE to override this.
## Disabling weekly seasonality. Run prophet with weekly.seasonality=TRUE to override this.
## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.
## n.changepoints greater than number of observations. Using 3
future <- make_future_dataframe(m, periods = 20,freq = "day")
forecast <- predict(m, future)
tail(forecast)
## ds trend zeros zeros_lower zeros_upper additive_terms
## 20 2021-04-25 1271413 0 0 0 0
## 21 2021-04-26 1248883 0 0 0 0
## 22 2021-04-27 1226353 0 0 0 0
## 23 2021-04-28 1203823 0 0 0 0
## 24 2021-04-29 1181293 0 0 0 0
## 25 2021-04-30 1158763 0 0 0 0
## additive_terms_lower additive_terms_upper multiplicative_terms
## 20 0 0 0
## 21 0 0 0
## 22 0 0 0
## 23 0 0 0
## 24 0 0 0
## 25 0 0 0
## multiplicative_terms_lower multiplicative_terms_upper yhat_lower yhat_upper
## 20 0 0 984644.3 1539789
## 21 0 0 929644.6 1537820
## 22 0 0 871387.1 1542808
## 23 0 0 814410.6 1561913
## 24 0 0 763535.9 1577149
## 25 0 0 708271.6 1583362
## trend_lower trend_upper yhat
## 20 984644.3 1539789 1271413
## 21 929644.6 1537820 1248883
## 22 871387.1 1542808 1226353
## 23 814410.6 1561913 1203823
## 24 763535.9 1577149 1181293
## 25 708271.6 1583362 1158763
plot(m,forecast)
# 计算一下误差
abs(truvalue$流失量[28]-1158763)/truvalue$流失量[28]
## [1] 0.03102281
使用4月份来做测算,预测结果为1158763,真是结果为1195862 ,误差0.03102281 。测算了三个月,初步可以判断,如果使用每个月6号到10号的数据来测算月度流失,误差约为3%左右。另外,如果模型所使用的时间越长例如是6号到15号的数据,那么误差可能会更小。
如果希望测算T+1 ,也就是知道在5月份,希望测算6月份的流失。思路可以是通过5月份以及之前的月份的6到10号的日流失数据预测出6月份6-10号的日流失数据,然后根据6月份所测算的日流失数据来预测6月份的月流失数据。
预测T+1 这里也尝试了很多的方法,第一种方法是根据历史的6号到10号的数据预测T+1的6号的数据,然后使用prophet模型预测根据6月6号到10号的数据预测6月末的流失情况.
# 先筛选出省份的数据
allpriv_runoff <- runoff2021_dayofdata %>% filter(地市 == "全省")
# 还需要将数据进行转化,转化得出一个月中每一天的结果
allpriv_runoff <- allpriv_runoff %>% mutate(月份 = month(时间),日期=day(时间))
allpriv_runoff <- allpriv_runoff %>% select(-时间)%>% spread(key = 日期,value=流失量)
allpriv_runoff <- allpriv_runoff%>% arrange(月份)
# 将5月的数据填充进去
allpriv_runoff[5,13] <- 1675392
allpriv_runoff[5,14] <- 165600
# 预测4月份
mydata <- allpriv_runoff
tmp <- mydata[c(2,3),c(8:12)]
tmp <- (tmp[1,]+tmp[2,])/2
# 这里预测的是四月份的日流失,预测结果的误差是
# -0.02384465 -0.02358408 -0.02161841 -0.02366546 -0.02299481
history <- data.frame(ds = date("2021-04-01")+days(5:9),
y =as.numeric(tmp))
m <- prophet(history)
future <- make_future_dataframe(m, periods = 20,freq = "day")
forecast <- predict(m, future)
tail(forecast)
f <- function(lang=0,type=1){
mydata <- allpriv_runoff
tmp <- mydata[c(2,3),c(8:(12+lang))] #
if(type==1){
tmp <- (tmp[1,]+tmp[2,])/2 # 这里是通过算数平均得出结果,我们尝试使用几何平均
}else{
tmp <- sqrt(as.numeric(tmp[1,])*as.numeric(tmp[2,])) # 几何平均
}
history <- data.frame(ds = date("2021-04-01")+days(5:(9+lang)),
y =as.numeric(tmp))
m <- prophet(history)
future <- make_future_dataframe(m, periods = 20-lang,freq = "day")
forecast <- predict(m, future)
return(forecast)
}
tmp <- f(lang = 10,type=1)
print(abs(tmp[dim(tmp)[1],dim(tmp)[2]] - 1195862)/1195862)
tmp <- f(lang = 7,type=2)
abs(tmp[dim(tmp)[1],dim(tmp)[2]] - 1195862)/1195862
# 0 :0.03654574 几何 0: 0.03500069 算数
# 1 :0.01141181 几何 1: 0.009230093
# 2: 0.01699695 几何 2: 0.02007152
# 五月的预测 1280000
mydata <- allpriv_runoff
tmp <- mydata[c(3,4),c(8:(12+6))]
#tmp <- (tmp[1,]+tmp[2,])/2 # 使用几何平均
tmp <- sqrt(as.numeric(tmp[1,])*as.numeric(tmp[2,])) ## 使用几何平均
history <- data.frame(ds = date("2021-05-01")+days(5:(9+6)),
y =as.numeric(tmp))
m <- prophet(history)
future <- make_future_dataframe(m, periods = 21-6,freq = "day")
forecast <- predict(m, future)
abs(forecast[dim(forecast)[1],dim(forecast)[2]] -1280000)/1280000
# 预测六月份
mydata <- allpriv_runoff
tmp <- mydata[c(4,5),c(8:(12+2))]
#tmp <- (tmp[1,]+tmp[2,])/2 # 使用几何平均
tmp <- sqrt(as.numeric(tmp[1,])*as.numeric(tmp[2,])) ## 使用几何平均
history <- data.frame(ds = date("2021-06-01")+days(5:(9+2)),
y =as.numeric(tmp))
m <- prophet(history)
future <- make_future_dataframe(m, periods = 20-2,freq = "day")
forecast <- predict(m, future)
forecast[dim(forecast)[1],dim(forecast)[2]]
1183242 # 到10号,几何
1183486 # 到10号,算数
1242005 # 到11号几何
1242187 # 算数
1242187 # 12 号算数
1242005 # 12 号几何
用这种方法测算3月份的数据,误差大约是:0.03654574 。测算5月份的数据,预测结果是1254241,测算六月份的数据,测算结果是 1183242。
还有一个思路是用发展用户数,出账用户数,在网用户数,活动新发展用户数。流失量与以往的发展量相关,我们通过流失量与其他变量质检的关系来预测流失。
这里使用到的数据包括2019年1月到2021年4月份的,月流失量,月新发展,月出账用户量。
## 下载数据
develop2019 <- read.csv("/Users/milin/Downloads/2019发展.csv")
develop2020 <- read.csv("/Users/milin/Downloads/2020发展.csv")
develop2021 <- read.csv("/Users/milin/Downloads/2021发展.csv")
expaccount2019 <- read.csv("/Users/milin/Downloads/2019年出账.csv")
expaccount2020 <- read.csv("/Users/milin/Downloads/2020年出账.csv")
expaccount2021 <- read.csv("/Users/milin/Downloads/2021年出账.csv")
## 对数据进行重新命名
names(develop2019) <- c("地市",as.character(ym("201812")+months(1:12)))
names(develop2020) <- c("地市",as.character(ym("201912")+months(1:12)))
names(develop2021) <- c("地市",as.character(ym("202012")+months(1:4)))
names(expaccount2019) <- c("地市",as.character(ym("201812")+months(1:12)))
names(expaccount2020) <- c("地市",as.character(ym("201912")+months(1:12)))
names(expaccount2021) <- c("地市",as.character(ym("202012")+months(1:4)))
# 将宽数据转换成为长数据
develop2019 <- develop2019 %>% gather("时间","发展量用户数",2:13)
develop2020 <- develop2020 %>% gather("时间","发展量用户数",2:13)
develop2021 <- develop2021 %>% gather("时间","发展量用户数",2:5)
expaccount2019 <- expaccount2019 %>% gather("时间","出账用户数",2:13)
expaccount2020 <- expaccount2020 %>% gather("时间","出账用户数",2:13)
expaccount2021 <- expaccount2021 %>% gather("时间","出账用户数",2:5)
develop20192021 <- develop2019 %>% bind_rows(develop2020) %>% bind_rows(develop2021)
expaccount20192021 <- expaccount2019 %>% bind_rows(expaccount2020) %>% bind_rows(expaccount2021)
tail(runoff20192021)
## 地市 时间 流失量
## 611 茂名 2021-04-01 19931
## 612 河源 2021-04-01 10375
## 613 汕尾 2021-04-01 9102
## 614 阳江 2021-04-01 9684
## 615 梅州 2021-04-01 10490
## 616 云浮 2021-04-01 8933
develop20192021$时间 <- as.Date(develop20192021$时间)
expaccount20192021$时间 <- as.Date(expaccount20192021$时间)
de_exp_run <- develop20192021 %>% left_join(y = expaccount20192021,by = c("地市","时间")) %>% left_join(y = runoff20192021,by = c("地市","时间"))
需要注意爹是,当月的新发展不会影响当月的流失,当月的流失是受到之前月份的影响。因此,构建3个新变量,分别标识上月分的新发展,上上个月新发展,上上上个月的新发展数量。
prode_exp_run <- de_exp_run %>% filter(地市=="全省")
prode_exp_run <- prode_exp_run %>% mutate(lag1dev = lag(发展量用户数)) %>% mutate(lag2dev = lag(发展量用户数,n = 2)) %>% mutate(lag3dev = lag(发展量用户数,n = 3),lag4dev = lag(发展量用户数,n = 4),lag5dev = lag(发展量用户数,n = 5),lag6dev = lag(发展量用户数,n = 6),lag7dev = lag(发展量用户数,n = 7),lag8dev = lag(发展量用户数,n = 8),lag9dev = lag(发展量用户数,n = 9))
tail(prode_exp_run)
## 地市 时间 发展量用户数 出账用户数 流失量 lag1dev lag2dev lag3dev
## 23 全省 2020-11-01 815640 26560339 1515336 929258 937946 1028718
## 24 全省 2020-12-01 783999 26283171 1321730 815640 929258 937946
## 25 全省 2021-01-01 917653 26285874 1174921 783999 815640 929258
## 26 全省 2021-02-01 805985 25972420 1243171 917653 783999 815640
## 27 全省 2021-03-01 1301092 26537959 1108165 805985 917653 783999
## 28 全省 2021-04-01 1074871 26734399 1195862 1301092 805985 917653
## lag4dev lag5dev lag6dev lag7dev lag8dev lag9dev
## 23 925356 911425 973507 1075020 1151719 454177
## 24 1028718 925356 911425 973507 1075020 1151719
## 25 937946 1028718 925356 911425 973507 1075020
## 26 929258 937946 1028718 925356 911425 973507
## 27 815640 929258 937946 1028718 925356 911425
## 28 783999 815640 929258 937946 1028718 925356
数据中lag1dev标识上一个月的新发展,lag2dev表示的是上上个月的新发展。然后先简单分析一下数据之间的关系,例如流失与新发展用户之间的关系.
首先查看流失量与上一个月新发展用户数质检的关系
p <- ggplot(data = prode_exp_run,aes(x=时间,y = 流失量))+geom_line()
p <- p + geom_line(data = prode_exp_run,aes(x = 时间,y=lag1dev,color="yellow"))
p + labs(color="发展量用户数") + theme_bw()
可以看到,上一个月的新发展用户数量和流失量之间的关系并不明显。进一步可以计算相关系数.
cor(prode_exp_run$lag1dev[-c(1:3)],prode_exp_run$流失量[-c(1:3)])
## [1] 0.3140291
可以看到,流失量和上一个月新发展用户数之间的相关系数是31%,是存在一定的相关性的。接下来分析流失量和上上个月的新发展用户数量之间的关系。
p <- ggplot(data = prode_exp_run,aes(x=时间,y = 流失量))+geom_line()
p <- p + geom_line(data = prode_exp_run,aes(x = 时间,y=lag2dev,color="yellow"))
p + labs(color="发展量用户数") + theme_bw()
从这个图似乎可以看出来,流失与新发展之间的关系了,似乎是新发展的量一旦多了起来,在未来流失量的会增加。如果新发展的量减少,那么再未来一段时间,流失量也会减少。
同样,计算两者之间的相关性 。
cor(prode_exp_run$lag2dev[-c(1:3)],prode_exp_run$流失量[-c(1:3)])
## [1] 0.4631888
可以看到,相关系数,46%,是比较强的相关性了。进一步分析流失量与上上个月之间的相关关系。
p <- ggplot(data = prode_exp_run,aes(x=时间,y = 流失量))+geom_line()
p <- p + geom_line(data = prode_exp_run,aes(x = 时间,y=lag3dev,color="yellow"))
p + labs(color="发展量用户数") + theme_bw()
进一步计算二者之间的相关系数。
cor(prode_exp_run$lag3dev[-c(1:3)],prode_exp_run$流失量[-c(1:3)])
## [1] 0.6195624
可以看到相关系数是61% ,这也是比较强的相关性。我们还可以分析流失量与上第4个月之间的关系以及之后的关系,当我们分析到流失量与前第九个月之间的关系的时候,会显示下图。
p <- ggplot(data = prode_exp_run,aes(x=时间,y = 流失量))+geom_line()
p <- p + geom_line(data = prode_exp_run,aes(x = 时间,y=lag(发展量用户数,n = 9),color="yellow"))
p + labs(color="发展量用户数") + theme_bw()
可以看到,发展量与流失量之间的变化趋势已经是基本一致了 。进一步计算相关性.
cor(lag(prode_exp_run$发展量用户数,n = 9)[-c(1:9)],prode_exp_run$流失量[-c(1:9)])
## [1] 0.6759469
相关性依然还是比较大的,然后还可以分析一下出账用户数和流失量之间的关系。
p <- ggplot(data = prode_exp_run,aes(x=时间,y = 流失量))+geom_line()
p <- p + geom_line(data = prode_exp_run,aes(x = 时间,y=出账用户数,color="yellow"))
p + labs(color="出账用户数量") + theme_bw()
然后计算相关性.
cor(prode_exp_run$流失量,prode_exp_run$出账用户数)
## [1] 0.6163892
可以看到,流失量与出账用户数量依然是比较高的。相关性还是比较高的。
通过这样的变换之后,数据之间的量纲变得比较接近了。然后开始构建模型,使用2019年1月到2020年12月的数据训练,使用2021年1月到4月份的数据来做预测。
# 首先要创建训练集和测试集合
train <- prode_exp_run[c(1:24),]
test <- prode_exp_run[c(25:28),]
# 使用所有的变量来建模
lmodel <- lm(流失量~出账用户数+lag1dev+lag2dev+lag3dev,data = na.omit(train))
summary(lmodel)
##
## Call:
## lm(formula = 流失量 ~ 出账用户数 + lag1dev + lag2dev +
## lag3dev, data = na.omit(train))
##
## Residuals:
## Min 1Q Median 3Q Max
## -143707 -102297 -33110 70584 255388
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1768278.47359 1299201.51997 -1.361 0.2034
## 出账用户数 0.11851 0.05632 2.104 0.0616 .
## lag1dev -0.03983 0.22487 -0.177 0.8629
## lag2dev -0.24853 0.22730 -1.093 0.2999
## lag3dev 0.26332 0.21557 1.221 0.2499
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 151200 on 10 degrees of freedom
## Multiple R-squared: 0.6847, Adjusted R-squared: 0.5585
## F-statistic: 5.428 on 4 and 10 DF, p-value: 0.01379
stepmodel <- step(lmodel)
## Start: AIC=361.7
## 流失量 ~ 出账用户数 + lag1dev + lag2dev + lag3dev
##
## Df Sum of Sq RSS AIC
## - lag1dev 1 717087579 229242794437 359.75
## - lag2dev 1 27321451882 255847158740 361.40
## <none> 228525706858 361.70
## - lag3dev 1 34096304265 262622011123 361.79
## - 出账用户数 1 101194984356 329720691214 365.20
##
## Step: AIC=359.75
## 流失量 ~ 出账用户数 + lag2dev + lag3dev
##
## Df Sum of Sq RSS AIC
## <none> 229242794437 359.75
## - lag2dev 1 34634822568 263877617005 359.86
## - lag3dev 1 40222093672 269464888109 360.17
## - 出账用户数 1 122630820347 351873614784 364.18
summary(stepmodel)
##
## Call:
## lm(formula = 流失量 ~ 出账用户数 + lag2dev + lag3dev,
## data = na.omit(train))
##
## Residuals:
## Min 1Q Median 3Q Max
## -145350 -98524 -33993 67633 253613
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1667959.00511 1116605.79073 -1.494 0.1634
## 出账用户数 0.11361 0.04683 2.426 0.0337 *
## lag2dev -0.26248 0.20361 -1.289 0.2238
## lag3dev 0.27418 0.19736 1.389 0.1922
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 144400 on 11 degrees of freedom
## Multiple R-squared: 0.6837, Adjusted R-squared: 0.5974
## F-statistic: 7.925 on 3 and 11 DF, p-value: 0.004299
pre <- predict(stepmodel,newdata = test)
abs((exp(pre) - exp(test$流失量)))/exp(test$流失量)
## 25 26 27 28
## NaN NaN NaN NaN
从结果来看,模型的效果并不是很好。
library(randomForest)
ran <- randomForest(流失量~.,data = na.omit(train[,-c(1,2)]))
ran
##
## Call:
## randomForest(formula = 流失量 ~ ., data = na.omit(train[, -c(1, 2)]))
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 3
##
## Mean of squared residuals: 18321137768
## % Var explained: 62.08
plot(ran)
pre <- predict(ran,newdata = test)
abs((exp(pre) - exp(test$流失量)))/exp(test$流失量)
## 25 26 27 28
## NaN NaN NaN NaN
效果依然不是很理想。
这个思路的第一步是将客户划分成为很多细分群体。
每一个月的2I新发展用户,非2I新发展用户,3个月的存量用户,6个月的存量用户,12个月的存量用户,24个月的存量用户,24个月以上的存量用户。
例如以2020年12月为时间间隔,要计算5月份的流失量。则可以这么算。
计算2020年12月2I新发展用户量在2021年1月份流失的比例,以及2021年2月份流失的比例,到22021年4月份流失的比例,通过这些历史的比例推测出这批用户在5月份流失的比例,通过这个比例计算出这批用户在5月份的流失量。
然后继续计算2021年1月份新发展用户,在2021年2月份流失比例以及到2021年4月份流失比例,通过历史比例计算出这一批人在5月份的流失比例并进一步计算这一批用户的流失量。 以此类推。
非2I新发展用户计算方式是一样的。
对于3个月存量用户,则计算到2020年12月为止网龄为3个月以内的入网用户(也就是2020年9月到11月入网的用户),计算这一批用户在之后每一个月的流失率,通过历史的流失率推算5月份的流失率,并进一步得出这一批人在5月份流失的量。 一次类推
library(readxl)
toI <- read_xlsx("/Users/milin/Library/Containers/com.tencent.xinWeChat/Data/Library/Application\ Support/com.tencent.xinWeChat/2.0b4.0.9/53566e35a70cea7efea3dd23a98b7c78/Message/MessageTemp/bed530d2f94e07b9fe9a38669031e423/File/按照不同类型拍照用户的用户量来预测流失量.xlsx",sheet = 1)
notoI <- read_xlsx("/Users/milin/Library/Containers/com.tencent.xinWeChat/Data/Library/Application\ Support/com.tencent.xinWeChat/2.0b4.0.9/53566e35a70cea7efea3dd23a98b7c78/Message/MessageTemp/bed530d2f94e07b9fe9a38669031e423/File/按照不同类型拍照用户的用户量来预测流失量.xlsx",sheet = 2)
age3 <- read_xlsx("/Users/milin/Library/Containers/com.tencent.xinWeChat/Data/Library/Application\ Support/com.tencent.xinWeChat/2.0b4.0.9/53566e35a70cea7efea3dd23a98b7c78/Message/MessageTemp/bed530d2f94e07b9fe9a38669031e423/File/按照不同类型拍照用户的用户量来预测流失量.xlsx",sheet = 3)
age6 <- read_xlsx("/Users/milin/Library/Containers/com.tencent.xinWeChat/Data/Library/Application\ Support/com.tencent.xinWeChat/2.0b4.0.9/53566e35a70cea7efea3dd23a98b7c78/Message/MessageTemp/bed530d2f94e07b9fe9a38669031e423/File/按照不同类型拍照用户的用户量来预测流失量.xlsx",sheet = 4)
age12 <- read_xlsx("/Users/milin/Library/Containers/com.tencent.xinWeChat/Data/Library/Application\ Support/com.tencent.xinWeChat/2.0b4.0.9/53566e35a70cea7efea3dd23a98b7c78/Message/MessageTemp/bed530d2f94e07b9fe9a38669031e423/File/按照不同类型拍照用户的用户量来预测流失量.xlsx",sheet = 5)
age24 <- read_xlsx("/Users/milin/Library/Containers/com.tencent.xinWeChat/Data/Library/Application\ Support/com.tencent.xinWeChat/2.0b4.0.9/53566e35a70cea7efea3dd23a98b7c78/Message/MessageTemp/bed530d2f94e07b9fe9a38669031e423/File/按照不同类型拍照用户的用户量来预测流失量.xlsx",sheet = 6)
age24m <- read_xlsx("/Users/milin/Library/Containers/com.tencent.xinWeChat/Data/Library/Application\ Support/com.tencent.xinWeChat/2.0b4.0.9/53566e35a70cea7efea3dd23a98b7c78/Message/MessageTemp/bed530d2f94e07b9fe9a38669031e423/File/按照不同类型拍照用户的用户量来预测流失量.xlsx",sheet = 7)
然后要将具体的数值转化成为流失率
f <- function(x){
return(x/toI$发展规模)
}
toI1 <- toI %>% select(-发展月份,-发展规模) %>% mutate_all(f)
toI1 <- toI %>% select(发展月份,发展规模) %>% bind_cols(toI1)
f <- function(x){
return(x/notoI$发展规模)
}
notoI1 <- notoI %>% select(-发展月份,-发展规模) %>% mutate_all(f)
notoI1 <- notoI %>% select(发展月份,发展规模) %>% bind_cols(notoI1)
f <- function(x){
return(x/age3$出账用户)
}
age31 <- age3 %>% select(-`网龄1-3个月`,-出账用户) %>% mutate_all(f)
age31 <- age3 %>% select(`网龄1-3个月`,出账用户) %>% bind_cols(age31)
f <- function(x){
return(x/age6$拍照出账用户)
}
age61 <- age6 %>% select(-`网龄4-6个月`,-拍照出账用户) %>% mutate_all(f)
age61 <- age6 %>% select(`网龄4-6个月`,拍照出账用户) %>% bind_cols(age61)
f <- function(x){
return(x/age12$拍照出账用户)
}
age121 <- age12 %>% select(-`网龄7-12个月`,-拍照出账用户) %>% mutate_all(f)
age121 <- age12 %>% select(`网龄7-12个月`,拍照出账用户) %>% bind_cols(age121)
f <- function(x){
return(x/age24$拍照出账用户)
}
age241 <- age24 %>% select(-网龄1年到2年,-拍照出账用户) %>% mutate_all(f)
age241 <- age24 %>% select(网龄1年到2年,拍照出账用户) %>% bind_cols(age241)
f <- function(x){
return(x/age24m$拍照出账用户)
}
age24m1 <- age24m %>% select(-网龄2年以上,-拍照出账用户) %>% mutate_all(f)
age24m1 <- age24m %>% select(网龄2年以上,拍照出账用户) %>% bind_cols(age24m1)
到这里,数据就初步整理好了,就下来就是要做几件事情,通过这个方法,有两个数据需要推算,一个是流失率,一个是T+1月的用户量。
修改一下命名
names(age24m1)[29] <- "T+27月流失"
names(age241)[29] <- "T+27月流失"
names(age121)[29] <- "T+27月流失"
names(age61)[29] <- "T+27月流失"
names(age31)[29] <- "T+27月流失"
names(notoI1)[29] <- "T+27月流失"
names(toI1)[29] <- "T+27月流失"
编写函数,来计算T月份的
# T = 0 or 1
predictRunoff <- function(T=0){
require(tidyverse)
year = 21
month = 4
time = 27
lang <- dim(age24)[2]
deep <- dim(age24)[1]
for (i in 2:lang) {
if(i==lang){
time <- time+1
toI1 <- toI1 %>%data.frame() %>% mutate(a = toI1[1,lang]*0.9)
names(toI1)[lang+1] <- paste("T+",time,"月流失",sep = "")
toI1[,lang+1][-1,1] <- NA
notoI1 <- notoI1 %>%data.frame()%>% mutate(a = notoI1[1,lang]*0.9)
names(notoI1)[lang+1] <- paste("T+",time,"月流失",sep = "")
notoI1[,lang+1][-1,1] <- NA
age31 <- age31 %>%data.frame()%>% mutate(a = age31[1,lang]*0.9)
names(age31)[lang+1] <- paste("T+",time,"月流失",sep = "")
age31[,lang+1][-1,1] <- NA
age61 <- age61 %>%data.frame()%>% mutate(a = age61[1,lang]*0.9)
names(age61)[lang+1] <- paste("T+",time,"月流失",sep = "")
age61[,lang+1][-1,1] <- NA
age121 <- age121 %>%data.frame()%>% mutate(a = age121[1,lang]*0.9)
names(age121)[lang+1] <- paste("T+",time,"月流失",sep = "")
age121[,lang+1][-1,1] <- NA
age241 <- age241 %>%data.frame()%>% mutate(a = age241[1,lang]*0.9)
names(age241)[lang+1] <- paste("T+",time,"月流失",sep = "")
age241[,lang+1][-1,1] <- NA
age24m1 <- age24m1 %>%data.frame()%>% mutate(a = age24m1[1,lang]*0.9)
names(age24m1)[lang+1] <- paste("T+",time,"月流失",sep = "")
age24m1[,lang+1][-1,1] <- NA
# 倒数第二行
toI1[2,lang] <- toI1[1,lang]*runif(1,min = 0.9,max = 1.1)
notoI1[2,lang] <- notoI1[1,lang]*runif(1,min = 0.9,max = 1.1)
age31[2,lang] <- age31[1,lang]*runif(1,min = 0.9,max = 1.1)
age61[2,lang] <- age61[1,lang]*runif(1,min = 0.9,max = 1.1)
age121[2,lang] <- age121[1,lang]*runif(1,min = 0.9,max = 1.1)
age241[2,lang] <- age241[1,lang]*runif(1,min = 0.9,max = 1.1)
age24m1[2,lang] <- age24m1[1,lang]*runif(1,min = 0.9,max = 1.1)
toI1[2,lang] <- toI1[1,lang]*runif(1,min = 0.9,max = 1.1)
}else if(i==2){
# 如果月份超过12,修改年和月份
if(month+1==12){
year = year+1
month=1
}else{
month=month+1
}
# 1
value <- toI1[deep,2]*0.6+toI1[deep-1,2]*0.4
toI1[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
# 2
value <- notoI1[deep,2]*0.6+notoI1[deep-1,2]*0.4
notoI1[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
# 3
value <- age31[deep,2]*0.6+age31[deep-1,2]*0.4
age31[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
# 4
value <- age61[deep,2]*0.6+age61[deep-1,2]*0.4
age61[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
# 5
value <- age121[deep,2]*0.6+age121[deep-1,2]*0.4
age121[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
# 6
value <- age241[deep,2]*0.6+age241[deep-1,2]*0.4
age241[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
# 7
value <- age24m1[deep,2]*0.6+age24m1[deep-1,2]*0.4
age24m1[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
}else{
value <- toI1[deep-(i-2),i]*0.6+toI1[deep-(i-1),i]*0.4
toI1[deep-(i-3),i] <- value
# 2
value <- notoI1[deep-(i-2),i]*0.6+notoI1[deep-(i-1),i]*0.4
notoI1[deep-(i-3),i] <- value
# 3
value <- age31[deep-(i-2),i]*0.6+age31[deep-(i-1),i]*0.4
age31[deep-(i-3),i] <- value
# 4
value <- age61[deep-(i-2),i]*0.6+age61[deep-(i-1),i]*0.4
age61[deep-(i-3),i] <- value
# 5
value <- age121[deep-(i-2),i]*0.6+age121[deep-(i-1),i]*0.4
age121[deep-(i-3),i] <- value
# 6
value <- age241[deep-(i-2),i]*0.6+age241[deep-(i-1),i]*0.4
age241[deep-(i-3),i] <- value
# 7
value <- age24m1[deep-(i-2),i]*0.6+age24m1[deep-(i-1),i]*0.4
age24m1[deep-(i-3),i] <- value
}
}
if(T==0){
return(list(toI1,notoI1,age31,age61,age121,age241,age24m1))
}else if(T==1){
lang <- dim(toI1)[2]
deep <- dim(toI1)[1]
for (i in 2:lang) {
if(i==lang){
time <- time+1
toI1 <- toI1 %>%data.frame() %>% mutate(a = toI1[1,lang]*0.9)
names(toI1)[lang+1] <- paste("T+",time,"月流失",sep = "")
toI1[,lang+1][-1] <- NA
notoI1 <- notoI1 %>%data.frame() %>% mutate(a = notoI1[1,lang]*0.9)
names(notoI1)[lang+1] <- paste("T+",time,"月流失",sep = "")
notoI1[,lang+1][-1] <- NA
age31 <- age31 %>%data.frame() %>% mutate(a = age31[1,lang]*0.9)
names(age31)[lang+1] <- paste("T+",time,"月流失",sep = "")
age31[,lang+1][-1] <- NA
age61 <- age61 %>%data.frame() %>% mutate(a = age61[1,lang]*0.9)
names(age61)[lang+1] <- paste("T+",time,"月流失",sep = "")
age61[,lang+1][-1] <- NA
age121 <- age121%>%data.frame() %>% mutate(a = age121[1,lang]*0.9)
names(age121)[lang+1] <- paste("T+",time,"月流失",sep = "")
age121[,lang+1][-1] <- NA
age241 <- age241 %>%data.frame() %>% mutate(a = age241[1,lang]*0.9)
names(age241)[lang+1] <- paste("T+",time,"月流失",sep = "")
age241[,lang+1][-1] <- NA
age24m1 <- age24m1 %>% mutate(a = age24m1[1,lang]*0.9)
names(age24m1)[lang+1] <- paste("T+",time,"月流失",sep = "")
age24m1[,lang+1][-1] <- NA
toI1[2,lang][[1]] <- toI1[1,lang][[1]]#*runif(1,0.9,1.1)
# print(toI1[1,lang][[1]])
notoI1[2,lang][[1]] <- notoI1[1,lang][[1]]*runif(1,0.9,1.1)
age31[2,lang][[1]] <- age31[1,lang][[1]]*runif(1,0.9,1.1)
age61[2,lang][[1]] <- age61[1,lang][[1]]*runif(1,0.9,1.1)
age121[2,lang][[1]] <- age121[1,lang][[1]]*runif(1,0.9,1.1)
age241[2,lang][[1]] <- age241[1,lang][[1]]*runif(1,0.9,1.1)
age24m1[2,lang][[1]] <- age24m1[1,lang][[1]]*runif(1,0.9,1.1)
toI1[2,lang][[1]] <- toI1[1,lang][[1]]*runif(1,0.9,1.1)
}else if(i==2){
# 如果月份超过12,修改年和月份
if(month+1==12){
year = year+1
month=1
}else{
month=month+1
}
# 1
value <- toI1[deep,2]*0.6+toI1[deep-1,2]*0.4
toI1[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
# 2
value <- notoI1[deep,2]*0.6+notoI1[deep-1,2]*0.4
notoI1[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
# 3
value <- age31[deep,2]*0.6+age31[deep-1,2]*0.4
age31[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
# 4
value <- age61[deep,2]*0.6+age61[deep-1,2]*0.4
age61[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
# 5
value <- age121[deep,2]*0.6+age121[deep-1,2]*0.4
age121[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
# 6
value <- age241[deep,2]*0.6+age241[deep-1,2]*0.4
age241[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
# 7
value <- age24m1[deep,2]*0.6+age24m1[deep-1,2]*0.4
age24m1[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
}else{
value <- toI1[deep-(i-2),i]*0.6+toI1[deep-(i-1),i]*0.4
toI1[deep-(i-3),i] <- value
# 2
value <- notoI1[deep-(i-2),i]*0.6+notoI1[deep-(i-1),i]*0.4
notoI1[deep-(i-3),i] <- value
# 3
value <- age31[deep-(i-2),i]*0.6+age31[deep-(i-1),i]*0.4
age31[deep-(i-3),i] <- value
# 4
value <- age61[deep-(i-2),i]*0.6+age61[deep-(i-1),i]*0.4
age61[deep-(i-3),i] <- value
# 5
value <- age121[deep-(i-2),i]*0.6+age121[deep-(i-1),i]*0.4
age121[deep-(i-3),i] <- value
# 6
value <- age241[deep-(i-2),i]*0.6+age241[deep-(i-1),i]*0.4
age241[deep-(i-3),i] <- value
# 7
value <- age24m1[deep-(i-2),i]*0.6+age24m1[deep-(i-1),i]*0.4
age24m1[deep-(i-3),i] <- value
}
}
return(list(toI1,notoI1,age31,age61,age121,age241,age24m1))
}
##return(list(toI1[,-dim(toI1)[2]],notoI1[,-dim(notoI1)[2]],age31[,-dim(age31)[2]],
# age61[,-dim(age61)[2]],age121[,-dim(age121)[2]],age241[,-dim(age241)[2]],age24m1[,-dim(age241)]))
}
tmp <- predictRunoff(T = 1)
上面这个函数构造出了缺失率和发展量,下面要计算对应的一个流失 。
# 这个函数的输入是 predictRunoff 函数的输出
# month 等于24 对应的是 2020 年12月份
predict_funoff_num <- function(tmp,month=24){
# month == 24 表示 看24行以及以后的新发展
# 看23行的拍照用户
result <- c()
for (i in 1:2) {
o.value <- tmp[[i]][month:(dim(tmp[[1]])[1]-1),2] %>% as.numeric()
odd <- c()
n <- 1
for (j in (dim(tmp[[3]])[2]-month+1):3) {
odd[n] <- tmp[[i]][dim(tmp[[3]])[2]-(j-1),j][[1]][[1]]
n <- n+1
}
result[i] <- sum(o.value*odd)
}
for (i in 3:length(tmp)) {
result[i] <- as.numeric(tmp[[3]][month-1,2])* tmp[[3]][month-1,dim(tmp[[3]])[2]-24+2]
}
return(result)
}
result <- predict_funoff_num(tmp = tmp,month = 24)
sum(result)
## [1] 1010535
# write_csv(tmp[[1]],file = "2i.csv")
# write_csv(tmp[[2]],file = "非2i.csv")
# write_csv(tmp[[3]],file = "3个月.csv")
# write_csv(tmp[[4]],file = "6个月.csv")
# write_csv(tmp[[5]],file = "12个月.csv")
# write_csv(tmp[[6]],file = "24个月.csv")
# write_csv(tmp[[7]],file = "大于24个月.csv")
#library(xlsx)
计算平均
predictRunoffMean <- function(T=0,rate=1){
require(tidyverse)
year = 21
month = 4
time = 27
lang <- dim(age24)[2]
deep <- dim(age24)[1]
for (i in 2:lang) {
if(i==lang){
time <- time+1
toI1 <- toI1 %>%data.frame() %>% mutate(a = toI1[1,lang]*0.9)
names(toI1)[lang+1] <- paste("T+",time,"月流失",sep = "")
toI1[,lang+1][-1,1] <- NA
notoI1 <- notoI1 %>%data.frame()%>% mutate(a = notoI1[1,lang]*0.9)
names(notoI1)[lang+1] <- paste("T+",time,"月流失",sep = "")
notoI1[,lang+1][-1,1] <- NA
age31 <- age31 %>%data.frame()%>% mutate(a = age31[1,lang]*0.9)
names(age31)[lang+1] <- paste("T+",time,"月流失",sep = "")
age31[,lang+1][-1,1] <- NA
age61 <- age61 %>%data.frame()%>% mutate(a = age61[1,lang]*0.9)
names(age61)[lang+1] <- paste("T+",time,"月流失",sep = "")
age61[,lang+1][-1,1] <- NA
age121 <- age121 %>%data.frame()%>% mutate(a = age121[1,lang]*0.9)
names(age121)[lang+1] <- paste("T+",time,"月流失",sep = "")
age121[,lang+1][-1,1] <- NA
age241 <- age241 %>%data.frame()%>% mutate(a = age241[1,lang]*0.9)
names(age241)[lang+1] <- paste("T+",time,"月流失",sep = "")
age241[,lang+1][-1,1] <- NA
age24m1 <- age24m1 %>%data.frame()%>% mutate(a = age24m1[1,lang]*0.9)
names(age24m1)[lang+1] <- paste("T+",time,"月流失",sep = "")
age24m1[,lang+1][-1,1] <- NA
# 倒数第二行
toI1[2,lang] <- toI1[1,lang]*runif(1,min = 0.9,max = 1.1)
notoI1[2,lang] <- notoI1[1,lang]*runif(1,min = 0.9,max = 1.1)
age31[2,lang] <- age31[1,lang]*runif(1,min = 0.9,max = 1.1)
age61[2,lang] <- age61[1,lang]*runif(1,min = 0.9,max = 1.1)
age121[2,lang] <- age121[1,lang]*runif(1,min = 0.9,max = 1.1)
age241[2,lang] <- age241[1,lang]*runif(1,min = 0.9,max = 1.1)
age24m1[2,lang] <- age24m1[1,lang]*runif(1,min = 0.9,max = 1.1)
toI1[2,lang] <- toI1[1,lang]*runif(1,min = 0.9,max = 1.1)
}else if(i==2){
# 如果月份超过12,修改年和月份
if(month+1==12){
year = year+1
month=1
}else{
month=month+1
}
# 1
value <- toI1[deep,2]*0.6+toI1[deep-1,2]*0.4
toI1[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
# 2
value <- notoI1[deep,2]*0.6+notoI1[deep-1,2]*0.4
notoI1[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
# 3
value <- age31[deep,2]*0.6+age31[deep-1,2]*0.4
age31[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
# 4
value <- age61[deep,2]*0.6+age61[deep-1,2]*0.4
age61[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
# 5
value <- age121[deep,2]*0.6+age121[deep-1,2]*0.4
age121[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
# 6
value <- age241[deep,2]*0.6+age241[deep-1,2]*0.4
age241[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
# 7
value <- age24m1[deep,2]*0.6+age24m1[deep-1,2]*0.4
age24m1[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
}else{
value <- toI1[deep-(i-2),i]*0.5+toI1[deep-(i-1),i]*0.5
toI1[deep-(i-3),i] <- as.numeric(value)*rate
# 2
value <- notoI1[deep-(i-2),i]*0.5+notoI1[deep-(i-1),i]*0.5
notoI1[deep-(i-3),i] <- as.numeric(value)*rate
# 3
value <- age31[deep-(i-2),i]*0.5+age31[deep-(i-1),i]*0.5
age31[deep-(i-3),i] <- as.numeric(value)*rate
# 4
value <- age61[deep-(i-2),i]*0.5+age61[deep-(i-1),i]*0.5
age61[deep-(i-3),i] <- as.numeric(value)*rate
# 5
value <- age121[deep-(i-2),i]*0.5+age121[deep-(i-1),i]*0.5
age121[deep-(i-3),i] <- as.numeric(value)*rate
# 6
value <- age241[deep-(i-2),i]*0.5+age241[deep-(i-1),i]*0.5
age241[deep-(i-3),i] <- as.numeric(value)*rate
# 7
value <- age24m1[deep-(i-2),i]*0.5+age24m1[deep-(i-1),i]*0.5
age24m1[deep-(i-3),i] <- as.numeric(value)*rate
}
}
if(T==0){
return(list(toI1,notoI1,age31,age61,age121,age241,age24m1))
}else if(T==1){
lang <- dim(toI1)[2]
deep <- dim(toI1)[1]
for (i in 2:lang) {
if(i==lang){
time <- time+1
toI1 <- toI1 %>%data.frame() %>% mutate(a = toI1[1,lang]*0.9)
names(toI1)[lang+1] <- paste("T+",time,"月流失",sep = "")
toI1[,lang+1][-1] <- NA
notoI1 <- notoI1 %>%data.frame() %>% mutate(a = notoI1[1,lang]*0.9)
names(notoI1)[lang+1] <- paste("T+",time,"月流失",sep = "")
notoI1[,lang+1][-1] <- NA
age31 <- age31 %>%data.frame() %>% mutate(a = age31[1,lang]*0.9)
names(age31)[lang+1] <- paste("T+",time,"月流失",sep = "")
age31[,lang+1][-1] <- NA
age61 <- age61 %>%data.frame() %>% mutate(a = age61[1,lang]*0.9)
names(age61)[lang+1] <- paste("T+",time,"月流失",sep = "")
age61[,lang+1][-1] <- NA
age121 <- age121%>%data.frame() %>% mutate(a = age121[1,lang]*0.9)
names(age121)[lang+1] <- paste("T+",time,"月流失",sep = "")
age121[,lang+1][-1] <- NA
age241 <- age241 %>%data.frame() %>% mutate(a = age241[1,lang]*0.9)
names(age241)[lang+1] <- paste("T+",time,"月流失",sep = "")
age241[,lang+1][-1] <- NA
age24m1 <- age24m1 %>% mutate(a = age24m1[1,lang]*0.9)
names(age24m1)[lang+1] <- paste("T+",time,"月流失",sep = "")
age24m1[,lang+1][-1] <- NA
toI1[2,lang][[1]] <- toI1[1,lang][[1]]#*runif(1,0.9,1.1)
# print(toI1[1,lang][[1]])
notoI1[2,lang][[1]] <- notoI1[1,lang][[1]]*runif(1,0.9,1.1)
age31[2,lang][[1]] <- age31[1,lang][[1]]*runif(1,0.9,1.1)
age61[2,lang][[1]] <- age61[1,lang][[1]]*runif(1,0.9,1.1)
age121[2,lang][[1]] <- age121[1,lang][[1]]*runif(1,0.9,1.1)
age241[2,lang][[1]] <- age241[1,lang][[1]]*runif(1,0.9,1.1)
age24m1[2,lang][[1]] <- age24m1[1,lang][[1]]*runif(1,0.9,1.1)
toI1[2,lang][[1]] <- toI1[1,lang][[1]]*runif(1,0.9,1.1)
}else if(i==2){
# 如果月份超过12,修改年和月份
if(month+1==12){
year = year+1
month=1
}else{
month=month+1
}
# 1
value <- toI1[deep,2]*0.6+toI1[deep-1,2]*0.4
toI1[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
# 2
value <- notoI1[deep,2]*0.6+notoI1[deep-1,2]*0.4
notoI1[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
# 3
value <- age31[deep,2]*0.6+age31[deep-1,2]*0.4
age31[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
# 4
value <- age61[deep,2]*0.6+age61[deep-1,2]*0.4
age61[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
# 5
value <- age121[deep,2]*0.6+age121[deep-1,2]*0.4
age121[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
# 6
value <- age241[deep,2]*0.6+age241[deep-1,2]*0.4
age241[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
# 7
value <- age24m1[deep,2]*0.6+age24m1[deep-1,2]*0.4
age24m1[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
}else{
value <- toI1[deep-(i-2),i]*0.5+toI1[deep-(i-1),i]*0.5
toI1[deep-(i-3),i] <- as.numeric(value)*rate
# 2
value <- notoI1[deep-(i-2),i]*0.5+notoI1[deep-(i-1),i]*0.5
notoI1[deep-(i-3),i] <- as.numeric(value)*rate
# 3
value <- age31[deep-(i-2),i]*0.5+age31[deep-(i-1),i]*0.5
age31[deep-(i-3),i] <- as.numeric(value)*rate
# 4
value <- age61[deep-(i-2),i]*0.5+age61[deep-(i-1),i]*0.5
age61[deep-(i-3),i] <- as.numeric(value)*rate
# 5
value <- age121[deep-(i-2),i]*0.5+age121[deep-(i-1),i]*0.5
age121[deep-(i-3),i] <- as.numeric(value)*rate
# 6
value <- age241[deep-(i-2),i]*0.5+age241[deep-(i-1),i]*0.5
age241[deep-(i-3),i] <- as.numeric(value)*rate
# 7
value <- age24m1[deep-(i-2),i]*0.5+age24m1[deep-(i-1),i]*0.5
age24m1[deep-(i-3),i] <- as.numeric(value)*rate
}
}
return(list(toI1,notoI1,age31,age61,age121,age241,age24m1))
}
##return(list(toI1[,-dim(toI1)[2]],notoI1[,-dim(notoI1)[2]],age31[,-dim(age31)[2]],
# age61[,-dim(age61)[2]],age121[,-dim(age121)[2]],age241[,-dim(age241)[2]],age24m1[,-dim(age241)]))
}
tmp <- predictRunoffMean(T = 0,rate = 1)
result <- predict_funoff_num(tmp = tmp,month = 19)
sum(result)
## [1] 1301473
计算最大
predictRunoffMAX <- function(T=0,rate=1){
require(tidyverse)
year = 21
month = 4
time = 27
lang <- dim(age24)[2]
deep <- dim(age24)[1]
for (i in 2:lang) {
if(i==lang){
time <- time+1
toI1 <- toI1 %>%data.frame() %>% mutate(a = toI1[1,lang]*0.9)
names(toI1)[lang+1] <- paste("T+",time,"月流失",sep = "")
toI1[,lang+1][-1,1] <- NA
notoI1 <- notoI1 %>%data.frame()%>% mutate(a = notoI1[1,lang]*0.9)
names(notoI1)[lang+1] <- paste("T+",time,"月流失",sep = "")
notoI1[,lang+1][-1,1] <- NA
age31 <- age31 %>%data.frame()%>% mutate(a = age31[1,lang]*0.9)
names(age31)[lang+1] <- paste("T+",time,"月流失",sep = "")
age31[,lang+1][-1,1] <- NA
age61 <- age61 %>%data.frame()%>% mutate(a = age61[1,lang]*0.9)
names(age61)[lang+1] <- paste("T+",time,"月流失",sep = "")
age61[,lang+1][-1,1] <- NA
age121 <- age121 %>%data.frame()%>% mutate(a = age121[1,lang]*0.9)
names(age121)[lang+1] <- paste("T+",time,"月流失",sep = "")
age121[,lang+1][-1,1] <- NA
age241 <- age241 %>%data.frame()%>% mutate(a = age241[1,lang]*0.9)
names(age241)[lang+1] <- paste("T+",time,"月流失",sep = "")
age241[,lang+1][-1,1] <- NA
age24m1 <- age24m1 %>%data.frame()%>% mutate(a = age24m1[1,lang]*0.9)
names(age24m1)[lang+1] <- paste("T+",time,"月流失",sep = "")
age24m1[,lang+1][-1,1] <- NA
# 倒数第二行
toI1[2,lang] <- toI1[1,lang]*runif(1,min = 0.9,max = 1.1)
notoI1[2,lang] <- notoI1[1,lang]*runif(1,min = 0.9,max = 1.1)
age31[2,lang] <- age31[1,lang]*runif(1,min = 0.9,max = 1.1)
age61[2,lang] <- age61[1,lang]*runif(1,min = 0.9,max = 1.1)
age121[2,lang] <- age121[1,lang]*runif(1,min = 0.9,max = 1.1)
age241[2,lang] <- age241[1,lang]*runif(1,min = 0.9,max = 1.1)
age24m1[2,lang] <- age24m1[1,lang]*runif(1,min = 0.9,max = 1.1)
toI1[2,lang] <- toI1[1,lang]*runif(1,min = 0.9,max = 1.1)
}else if(i==2){
# 如果月份超过12,修改年和月份
if(month+1==12){
year = year+1
month=1
}else{
month=month+1
}
# 1
value <- toI1[deep,2]*0.6+toI1[deep-1,2]*0.4
toI1[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
# 2
value <- notoI1[deep,2]*0.6+notoI1[deep-1,2]*0.4
notoI1[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
# 3
value <- age31[deep,2]*0.6+age31[deep-1,2]*0.4
age31[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
# 4
value <- age61[deep,2]*0.6+age61[deep-1,2]*0.4
age61[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
# 5
value <- age121[deep,2]*0.6+age121[deep-1,2]*0.4
age121[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
# 6
value <- age241[deep,2]*0.6+age241[deep-1,2]*0.4
age241[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
# 7
value <- age24m1[deep,2]*0.6+age24m1[deep-1,2]*0.4
age24m1[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
}else{
value <- ifelse(toI1[deep-(i-2),i]>toI1[deep-(i-1),i],toI1[deep-(i-2),i],toI1[deep-(i-1),i])
toI1[deep-(i-3),i] <- as.numeric(value)*rate
# 2
value <- ifelse(notoI1[deep-(i-2),i]>notoI1[deep-(i-1),i],notoI1[deep-(i-2),i],notoI1[deep-(i-1),i])
notoI1[deep-(i-3),i] <- as.numeric(value)*rate
# 3
value <- ifelse(age31[deep-(i-2),i]>age31[deep-(i-1),i],age31[deep-(i-2),i],age31[deep-(i-1),i])
age31[deep-(i-3),i] <- as.numeric(value)*rate
# 4
value <- ifelse(age61[deep-(i-2),i]>age61[deep-(i-1),i],age61[deep-(i-2),i],age61[deep-(i-1),i])
age61[deep-(i-3),i] <- as.numeric(value)*rate
# 5
value <- ifelse(age121[deep-(i-2),i]>age121[deep-(i-1),i],age121[deep-(i-2),i],age121[deep-(i-1),i])
age121[deep-(i-3),i] <- as.numeric(value)*rate
# 6
value <- ifelse(age241[deep-(i-2),i]>age241[deep-(i-1),i],age241[deep-(i-2),i],age241[deep-(i-1),i])
age241[deep-(i-3),i] <- as.numeric(value)*rate
# 7
value <- ifelse(age24m1[deep-(i-2),i]>age24m1[deep-(i-1),i],age24m1[deep-(i-2),i],age24m1[deep-(i-1),i])
age24m1[deep-(i-3),i] <- as.numeric(value)*rate
}
}
if(T==0){
return(list(toI1,notoI1,age31,age61,age121,age241,age24m1))
}else if(T==1){
lang <- dim(toI1)[2]
deep <- dim(toI1)[1]
for (i in 2:lang) {
if(i==lang){
time <- time+1
toI1 <- toI1 %>%data.frame() %>% mutate(a = toI1[1,lang]*0.9)
names(toI1)[lang+1] <- paste("T+",time,"月流失",sep = "")
toI1[,lang+1][-1] <- NA
notoI1 <- notoI1 %>%data.frame() %>% mutate(a = notoI1[1,lang]*0.9)
names(notoI1)[lang+1] <- paste("T+",time,"月流失",sep = "")
notoI1[,lang+1][-1] <- NA
age31 <- age31 %>%data.frame() %>% mutate(a = age31[1,lang]*0.9)
names(age31)[lang+1] <- paste("T+",time,"月流失",sep = "")
age31[,lang+1][-1] <- NA
age61 <- age61 %>%data.frame() %>% mutate(a = age61[1,lang]*0.9)
names(age61)[lang+1] <- paste("T+",time,"月流失",sep = "")
age61[,lang+1][-1] <- NA
age121 <- age121%>%data.frame() %>% mutate(a = age121[1,lang]*0.9)
names(age121)[lang+1] <- paste("T+",time,"月流失",sep = "")
age121[,lang+1][-1] <- NA
age241 <- age241 %>%data.frame() %>% mutate(a = age241[1,lang]*0.9)
names(age241)[lang+1] <- paste("T+",time,"月流失",sep = "")
age241[,lang+1][-1] <- NA
age24m1 <- age24m1 %>% mutate(a = age24m1[1,lang]*0.9)
names(age24m1)[lang+1] <- paste("T+",time,"月流失",sep = "")
age24m1[,lang+1][-1] <- NA
toI1[2,lang][[1]] <- toI1[1,lang][[1]]#*runif(1,0.9,1.1)
# print(toI1[1,lang][[1]])
notoI1[2,lang][[1]] <- notoI1[1,lang][[1]]*runif(1,0.9,1.1)
age31[2,lang][[1]] <- age31[1,lang][[1]]*runif(1,0.9,1.1)
age61[2,lang][[1]] <- age61[1,lang][[1]]*runif(1,0.9,1.1)
age121[2,lang][[1]] <- age121[1,lang][[1]]*runif(1,0.9,1.1)
age241[2,lang][[1]] <- age241[1,lang][[1]]*runif(1,0.9,1.1)
age24m1[2,lang][[1]] <- age24m1[1,lang][[1]]*runif(1,0.9,1.1)
toI1[2,lang][[1]] <- toI1[1,lang][[1]]*runif(1,0.9,1.1)
}else if(i==2){
# 如果月份超过12,修改年和月份
if(month+1==12){
year = year+1
month=1
}else{
month=month+1
}
# 1
value <- toI1[deep,2]*0.6+toI1[deep-1,2]*0.4
toI1[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
# 2
value <- notoI1[deep,2]*0.6+notoI1[deep-1,2]*0.4
notoI1[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
# 3
value <- age31[deep,2]*0.6+age31[deep-1,2]*0.4
age31[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
# 4
value <- age61[deep,2]*0.6+age61[deep-1,2]*0.4
age61[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
# 5
value <- age121[deep,2]*0.6+age121[deep-1,2]*0.4
age121[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
# 6
value <- age241[deep,2]*0.6+age241[deep-1,2]*0.4
age241[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
# 7
value <- age24m1[deep,2]*0.6+age24m1[deep-1,2]*0.4
age24m1[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
}else{
value <- ifelse(toI1[deep-(i-2),i]>toI1[deep-(i-1),i],toI1[deep-(i-2),i],toI1[deep-(i-1),i])
toI1[deep-(i-3),i] <- as.numeric(value)*rate
# 2
value <- ifelse(notoI1[deep-(i-2),i]>notoI1[deep-(i-1),i],notoI1[deep-(i-2),i],notoI1[deep-(i-1),i])
notoI1[deep-(i-3),i] <- as.numeric(value)*rate
# 3
value <- ifelse(age31[deep-(i-2),i]>age31[deep-(i-1),i],age31[deep-(i-2),i],age31[deep-(i-1),i])
age31[deep-(i-3),i] <- as.numeric(value)*rate
# 4
value <- ifelse(age61[deep-(i-2),i]>age61[deep-(i-1),i],age61[deep-(i-2),i],age61[deep-(i-1),i])
age61[deep-(i-3),i] <- as.numeric(value)*rate
# 5
value <- ifelse(age121[deep-(i-2),i]>age121[deep-(i-1),i],age121[deep-(i-2),i],age121[deep-(i-1),i])
age121[deep-(i-3),i] <- as.numeric(value)*rate
# 6
value <- ifelse(age241[deep-(i-2),i]>age241[deep-(i-1),i],age241[deep-(i-2),i],age241[deep-(i-1),i])
age241[deep-(i-3),i] <- as.numeric(value)*rate
# 7
value <- ifelse(age24m1[deep-(i-2),i]>age24m1[deep-(i-1),i],age24m1[deep-(i-2),i],age24m1[deep-(i-1),i])
age24m1[deep-(i-3),i] <- as.numeric(value)*rate
}
}
return(list(toI1,notoI1,age31,age61,age121,age241,age24m1))
}
##return(list(toI1[,-dim(toI1)[2]],notoI1[,-dim(notoI1)[2]],age31[,-dim(age31)[2]],
# age61[,-dim(age61)[2]],age121[,-dim(age121)[2]],age241[,-dim(age241)[2]],age24m1[,-dim(age241)]))
}
# [1009886,1028324]
# [1000749,1039770]
通过这种方式计算的关键是计算流失率,但是根据历史的流失率推算未来的流失率会偏低。
测算结果是,5月份的预测区间是:[1000749,1039770] 6月份的预测区间是[1009886,1028324]
根据每一个月每一天的流失量,计算每一天流失的变化,这个变化表示的就是环比(例如,今天相比于昨天的一个百分比),流失量变化了多少 。例如2020年1月11号的环比为:1943312/1962916 .其中1962916标识的是10号的流失量,1943312表示的是11号的流失量。
这样计算出每一天的一个环比变化。如果每一天的环比波动非常小。那么当我们想要推算5月份的流失量的时候,只需要通过历史的环比推算5月份每一天的环比变化,然后根据5月10号的流失结果即可得出5月末的流失。
final <- read.csv("/Users/milin/Library/Containers/com.tencent.xinWeChat/Data/Library/Application\ Support/com.tencent.xinWeChat/2.0b4.0.9/53566e35a70cea7efea3dd23a98b7c78/Message/MessageTemp/bed530d2f94e07b9fe9a38669031e423/File/2020to2021runoff.csv")
final1 <- final
for (i in 11:31) {
final1[,i] <- final[,i]/final[,i-1]
}
names(final1)[6:31] <- 6:31
head(final1)
## X.月份 月流失 月环比 准流失 日累计环比 6 7 8 9 10 11
## 1 1月 175.2 NA NA NA NA NA NA NA 1962916 0.9900128
## 2 2月 186.7 11.48 NA NA NA NA NA NA 2060212 0.9899156
## 3 3月 157.0 -29.68 NA NA NA NA NA NA 1856768 0.9888365
## 4 4月 168.2 11.18 163.14 NA NA NA NA NA 1922137 0.9900127
## 5 5月 141.3 -26.92 137.68 -25.46 NA NA NA NA 1743174 0.9875910
## 6 6月 140.7 -0.60 137.47 -0.21 NA NA NA NA 1703864 0.9858627
## 12 13 14 15 16 17 18
## 1 0.9913735 0.9922665 0.9913766 0.9917453 0.9920256 0.9919755 0.9929995
## 2 0.9910608 0.9926366 0.9929862 0.9931073 0.9940208 0.9933767 0.9932573
## 3 0.9895814 0.9898223 0.9914219 0.9923084 0.9916661 0.9921529 0.9927388
## 4 0.9911269 0.9917479 0.9924052 0.9927797 0.9922648 0.9926458 0.9935356
## 5 0.9891464 0.9885851 0.9865530 0.9935043 0.9767151 0.9925179 0.9915893
## 6 0.9872423 0.9894533 0.9910272 0.9858849 0.9905762 0.9909444 0.9910084
## 19 20 21 22 23 24 25
## 1 0.9928750 0.9891875 0.9931712 0.9945694 0.9952373 0.9915564 0.9953615
## 2 0.9935253 0.9891235 0.9924880 0.9935946 0.9943276 0.9898323 0.9916594
## 3 0.9934958 0.9887022 0.9925042 0.9938597 0.9936067 0.9892385 0.9933024
## 4 0.9944473 0.9856108 0.9921165 0.9926161 0.9891535 0.9868224 0.9928489
## 5 0.9912683 0.9862671 0.9918407 0.9925987 0.9938641 0.9878027 0.9882346
## 6 0.9903241 0.9833436 0.9913013 0.9899536 0.9873684 0.9809115 0.9882617
## 26 27 28 29 30 31
## 1 0.9977196 0.9971429 0.9995532 0.9978396 0.9975205 0.9973764
## 2 0.9961605 0.9936199 0.9942384 0.9944099 NA NA
## 3 0.9946201 0.9962245 0.9946631 0.9953910 0.9948715 NA
## 4 0.9951281 0.9886437 0.9946226 0.9943022 0.9939077 NA
## 5 0.9923664 0.9851440 0.9863745 0.9784311 0.9892487 0.9959389
## 6 0.9943065 0.9901178 0.9902620 0.9941325 0.9943250 NA
然后对环比数据进行可视化.
final1 <- gather(data = final1,"时间","环比",11:31)
p <- ggplot(data = final1,aes(x = as.numeric(时间),y=环比,color = X.月份))+geom_line()
p +labs(color="月份") +theme_bw()
## Warning: Removed 473 row(s) containing missing values (geom_path).
从结果中可以看到,环比的变化可以是比较稳定,通常而言变化是再0.0005以内的。可以尝试某一月份的日环比等于近几个月份的日环比的平均值。
我们测算一下,4月份,4月10号的日流失为1609363,2月和3月的平均环比为0.99007和0.99021 ,那么4月份的流失的预测结果为:
1475966 * ((0.99007+0.99021)/2)^20
## [1] 1210622
4月份的真实流失为:1195862 ,那么计算的误差是:
abs(1195862-1210622)/1210622
## [1] 0.01219208
我们测算一下,3月份,3月10号的日流失为1373942,2月和3月的平均环比为0.99007和0.99021 ,那么4月份的流失的预测结果为:
1373942 * (0.99007)^21
## [1] 1114172
3月份的真实流失为:1195862 ,那么计算的误差是:
abs(1114172-1108165)/1108165
## [1] 0.005420673
假如我们要测算5月份的流失量,那么5月10号的日流失量为1694129,4月和3月的平均环比为0.99021 和 0.99038,那么5月份的月流失是:
1694129 * ((0.99021+0.99038)/2)^21
## [1] 1380393
如果希望通过这种方法来测算T+1 ,那么问题又变成了测算T+1月份的10号的流失量。这里假设,每一个月的流失量与前两个月是强相关的,用4月和5月10号的流失的平均值标识6月10号的流失量,进一步推算6月份的流失。推算6月份的流失量为:
((1609363+1694129)/2)* (((0.99021+0.99038)/2+0.99021 )/2)^2
## [1] 1619702
通过这种方式,测算结果为 :1357883