随着电信行业用户增长趋于缓和,存量客户运营与维系逐渐成为重点,预防与减少用户流失与价值提升是存量用户维系的两大重点,价值提升中收入保有也是关注的重要指标。收入预测模型通过预测收入及收保,通过预测收入及收保可以更及时的观测该考核指标的变化,为各地市维系工作者提供数据支撑。

在做预测之前,我们首先要明确一个假设,我们假设未来的结果是与过去的结果相关的。相关度越高,我们越能根据过去来预测未来的结果。如果关于未来的结果的波动是比较无规律的,那么预测未来的结果则是比较困难的。

思路1

第一个思路是直接根据流失的历史数据来进行预测,在这里选取的数据集是2019年1月份-20201年4月份的月流失数据。

# 这个数据和平台上面的数据是能够对得上的

library(readxl)
library(lubridate)
library(tidyverse)
runoff2019 <- read.csv("/Users/milin/Downloads/2019流失数据.csv")

runoff2020 <- read.csv("/Users/milin/Downloads/2020流失数据.csv")

runoff2021 <- read.csv("/Users/milin/Downloads/2021流失数据.csv")



# 关注2020年11月份

# 重新命名


names(runoff2019) <- c("地市",as.character(ym("201812")+months(1:12)))

names(runoff2020) <- c("地市",as.character(ym("201912")+months(1:12)))

names(runoff2021) <- c("地市",as.character(ym("202012")+months(1:4)))
# 数据准备好了,然后将宽数据转变成为长数据。

runoff2019 <- runoff2019 %>% gather("时间","流失量",2:13)

runoff2020 <- runoff2020 %>% gather("时间","流失量",2:13)

runoff2021 <- runoff2021 %>% gather("时间","流失量",2:5)
# 将这三年的数据合并成为同一份数据
runoff20192021 <- runoff2019 %>% bind_rows(runoff2020) %>% bind_rows(runoff2021)

第一步,对于月流失数据进行可视化,分地市绘制关于流失的时间序列图。

# 到这里数据就处理好了,接下来要做的是进行数据可视化。
library(tidyverse)
library(plotly)
library(showtext)
showtext_auto()

runoff20192021$时间 <- date(runoff20192021$时间)

p <- ggplot(data = runoff20192021,aes(x = 时间,y = (流失量),color = 地市)) + geom_line()
p

ggplotly(p)

可以看到,流失量的波动还是比较大的,这意味着流失序列不稳定,对于不稳定或者趋势不明显的时间序列进行建模,即使模型将将历史数据拟合比较好,在预测的时候,预测结果和真实结果的误差还是会比较大。

对流失量进行取对数,然后分地市进行可视化.取对数可以让数据之间的差异变得更小。

# 对流失量进行取对数
p <- ggplot(data = runoff20192021,aes(x = 时间,y = log(流失量),color = 地市)) + geom_line()
p

ggplotly(p)

对月流失序列取对数之后,可以看到,序列变得稍微平稳一些。对取对数之后的流失量构建prophet模型。

Prophet 是以时间序列分解和曲线拟合思想为基础建立的模型。Prophet 也是一个时间复杂度较低、建模简单的模型。与 ARIMA 模型相比,Prophet的优点是对节假日和突发事件造成的数据波动有很好的拟合效果,并能在一定程度上拟合非线性数据,比较适用于长期的周期明显的时间序列预测;但是其在趋势、周期不明显的时间序列上效果不佳。

这里首先筛选全省的数据,然后对全省的数据使用prophet进行建模。

library(prophet)

# 筛选出全省的数据 

prorunoff <- runoff20192021 %>% filter(地市== "全省")

# 真理记录一下真实结果

truvalue <- prorunoff

history <- data.frame(ds = prorunoff$时间[1:24],
                      y = log(prorunoff$流失量[1:24]))
m <- prophet(history,interval.width = 0.95)

 
future <- make_future_dataframe(m, periods = 4,freq = "month")
forecast <- predict(m, future)
plot(m, forecast)

result <- prorunoff %>% mutate(pre = forecast$yhat,lowpre = forecast$yhat_lower,upperpre = forecast$yhat_upper)

tail(result)
##    地市       时间  流失量      pre   lowpre upperpre
## 23 全省 2020-11-01 1515336 14.13161 14.03473 14.22911
## 24 全省 2020-12-01 1321730 14.12619 14.03102 14.22061
## 25 全省 2021-01-01 1174921 14.12058 14.02842 14.20941
## 26 全省 2021-02-01 1243171 14.11497 14.02350 14.21026
## 27 全省 2021-03-01 1108165 14.10990 14.00952 14.21385
## 28 全省 2021-04-01 1195862 14.10430 13.97599 14.23096

在这里,我们是使用2019年1月份到2020年12月份的数据训练模型,然后使用2021年1月到4月份的数据作为测算。 从数据中可以看到,模型其实是很好的拟合了数据的趋势的。接下来查看模型的误差。

options(scipen = 10)
result <- result %>% mutate(log流失量 = log(流失量),logerrer = abs((log(流失量)-pre))/log(流失量)*100,errer = abs(((流失量)-exp(pre)))/(流失量)*100,loweexp = exp(lowpre),uppexp = exp(upperpre),preexp=exp(pre))


tail(result)
##    地市       时间  流失量      pre   lowpre upperpre log流失量  logerrer
## 23 全省 2020-11-01 1515336 14.13161 14.03473 14.22911  14.23115 0.6994210
## 24 全省 2020-12-01 1321730 14.12619 14.03102 14.22061  14.09445 0.2251458
## 25 全省 2021-01-01 1174921 14.12058 14.02842 14.20941  13.97671 1.0293246
## 26 全省 2021-02-01 1243171 14.11497 14.02350 14.21026  14.03318 0.5828571
## 27 全省 2021-03-01 1108165 14.10990 14.00952 14.21385  13.91822 1.3772457
## 28 全省 2021-04-01 1195862 14.10430 13.97599 14.23096  13.99438 0.7854465
##        errer loweexp  uppexp  preexp
## 23  9.474232 1245100 1512246 1371770
## 24  3.224193 1240490 1499456 1364345
## 25 15.472905 1237273 1482758 1356715
## 26  8.523153 1231197 1484009 1349128
## 27 21.129257 1214109 1489351 1342312
## 28 11.618693 1174079 1515055 1334806

结果中,pre标识的预测值,lowpre标识的是预测值的下界,upperpre表示的是预测值的上界,log流失量标识的是对真实的流失量取对数,logerrer标识的是取对数之后的误差,需要注意的是这里误差是百分比 。errer标识的是没有取对计算的误差。loweexp表示预测结果不取对数时候的下界,uppexp表示预测结果不取对数时候的上界,preexp表示预测结果不取对数时候的预测值。

从结果可以看到,如果对数据取了对数,也就是序列比较平稳的时候,不管是训练误差还是测试误差都是比较小的。如果不取对数,模型的训练误差还是比较小,说明模型还是能拟合现有的数据,但是测试误差比较大,说明数据波动还是太大,对于新的一个月,模型无法基于历史结果给出比较准确的预测。

根据这种方法预测5-6月份的数据

truvalue <- prorunoff

history <- data.frame(ds = prorunoff$时间,
                      y = log(prorunoff$流失量))
m <- prophet(history,interval.width = 0.95)

 
future <- make_future_dataframe(m, periods = 2,freq = "month")
forecast <- predict(m, future)
plot(m, forecast)

tail(forecast)
##            ds    trend additive_terms additive_terms_lower additive_terms_upper
## 25 2021-01-01 14.02105    -0.04433636          -0.04433636          -0.04433636
## 26 2021-02-01 14.01894     0.01423223           0.01423223           0.01423223
## 27 2021-03-01 14.01704    -0.09882714          -0.09882714          -0.09882714
## 28 2021-04-01 14.01494    -0.02056124          -0.02056124          -0.02056124
## 29 2021-05-01 14.01290    -0.11991164          -0.11991164          -0.11991164
## 30 2021-06-01 14.01080    -0.09220867          -0.09220867          -0.09220867
##         yearly yearly_lower yearly_upper multiplicative_terms
## 25 -0.04433636  -0.04433636  -0.04433636                    0
## 26  0.01423223   0.01423223   0.01423223                    0
## 27 -0.09882714  -0.09882714  -0.09882714                    0
## 28 -0.02056124  -0.02056124  -0.02056124                    0
## 29 -0.11991164  -0.11991164  -0.11991164                    0
## 30 -0.09220867  -0.09220867  -0.09220867                    0
##    multiplicative_terms_lower multiplicative_terms_upper yhat_lower yhat_upper
## 25                          0                          0   13.97671   13.97671
## 26                          0                          0   14.03318   14.03318
## 27                          0                          0   13.91822   13.91822
## 28                          0                          0   13.99438   13.99438
## 29                          0                          0   13.80422   13.97013
## 30                          0                          0   13.66727   14.13222
##    trend_lower trend_upper     yhat
## 25    14.02105    14.02105 13.97671
## 26    14.01894    14.01894 14.03318
## 27    14.01704    14.01704 13.91822
## 28    14.01494    14.01494 13.99438
## 29    13.92413    14.09004 13.89299
## 30    13.75948    14.22443 13.91859

思路2

通过前几天的日流失数据,来预测本月末的流失数据。这一部分是获取数据以及对数据进行基本的处理。这里所使用的数据的2021那年1月份到2021年5月份每一天的日流失数据。

# 首先还是获取数据,这里获取的数据是2021年1-5月份的数据
runoff202101 <- read.csv("/Users/milin/Downloads/202101流失.csv")
runoff202102 <- read.csv("/Users/milin/Downloads/202102流失.csv")
runoff202103 <- read.csv("/Users/milin/Downloads/202103流失.csv")
runoff202104 <- read.csv("/Users/milin/Downloads/202104流失.csv")
runoff202105 <- read.csv("/Users/milin/Downloads/202105流失.csv")


names(runoff202101) <- c("地市",as.character(ym("2021-01")+days(0:30)))
names(runoff202102) <- c("地市",as.character(ym("2021-02")+days(0:27)))
names(runoff202103) <- c("地市",as.character(ym("2021-03")+days(0:30)))
names(runoff202104) <- c("地市",as.character(ym("2021-04")+days(0:29)))
names(runoff202105) <- c("地市",as.character(ym("2021-05")+days(0:9)))




runoff202101 <- runoff202101 %>% gather("时间","流失量",2:32)
runoff202102 <- runoff202102 %>% gather("时间","流失量",2:29)
runoff202103 <- runoff202103 %>% gather("时间","流失量",2:32)
runoff202104 <- runoff202104 %>% gather("时间","流失量",2:31)
runoff202105 <- runoff202105 %>% gather("时间","流失量",2:11)


runoff2021_dayofdata <- runoff202101 %>% bind_rows(runoff202102) %>% bind_rows(runoff202103) %>% bind_rows(runoff202104) %>% bind_rows(runoff202105)   

对每一天的日流失数据进行可视化,观察数据。

runoff2021_dayofdata$时间 <- date(runoff2021_dayofdata$时间)

p <- ggplot(data = runoff2021_dayofdata,aes(x = 时间,y = 流失量,color = 地市)) + geom_line()
p

ggplotly(p)

从数据中可以看出,每一个月月初和月末之间,流失数量的变化趋势是非常类似的,因此再每一个月初的时候,就可以尝试预测月末的数据。这里使用的是使用6号到10号的日数据来预测月末的数据。

尝试使用2月份的数据来验证。

# 筛选出全省的数据 

prorunoff <- runoff2021_dayofdata %>% filter(地市== "全省") %>% filter(month(时间)==2)



history <- data.frame(ds = prorunoff$时间[c(6:10)],
                      y = prorunoff$流失量[c(6:10)])
m <- prophet(history)

 
future <- make_future_dataframe(m, periods = 18,freq = "day")
forecast <- predict(m, future)

plot(m,forecast)

tail(forecast)
##            ds   trend zeros zeros_lower zeros_upper additive_terms
## 18 2021-02-23 1367426     0           0           0              0
## 19 2021-02-24 1346986     0           0           0              0
## 20 2021-02-25 1326546     0           0           0              0
## 21 2021-02-26 1306106     0           0           0              0
## 22 2021-02-27 1285666     0           0           0              0
## 23 2021-02-28 1265226     0           0           0              0
##    additive_terms_lower additive_terms_upper multiplicative_terms
## 18                    0                    0                    0
## 19                    0                    0                    0
## 20                    0                    0                    0
## 21                    0                    0                    0
## 22                    0                    0                    0
## 23                    0                    0                    0
##    multiplicative_terms_lower multiplicative_terms_upper yhat_lower yhat_upper
## 18                          0                          0  1001136.2    1788683
## 19                          0                          0   943472.3    1813645
## 20                          0                          0   881063.2    1845929
## 21                          0                          0   817795.7    1880120
## 22                          0                          0   748898.7    1908654
## 23                          0                          0   668565.5    1939543
##    trend_lower trend_upper    yhat
## 18   1001136.2     1788683 1367426
## 19    943472.3     1813645 1346986
## 20    881063.2     1845929 1326546
## 21    817795.7     1880120 1306106
## 22    748898.7     1908654 1285666
## 23    668565.5     1939543 1265226

预测结果显示,月流失数据为:1265226 ,而真实的结果是:1243171,差异为:

abs(1265226-1243171)/1265226
## [1] 0.01743167

用3月份的数据测算一下.

prorunoff <- runoff2021_dayofdata %>% filter(地市== "全省") %>% filter(month(时间)==3)


history <- data.frame(ds = prorunoff$时间[c(6:10)],
                      y = prorunoff$流失量[c(6:10)])
m <- prophet(history)

 
future <- make_future_dataframe(m, periods = 21,freq = "day")
forecast <- predict(m, future)


tail(forecast)
##            ds   trend zeros zeros_lower zeros_upper additive_terms
## 21 2021-03-26 1169246     0           0           0              0
## 22 2021-03-27 1147851     0           0           0              0
## 23 2021-03-28 1126456     0           0           0              0
## 24 2021-03-29 1105061     0           0           0              0
## 25 2021-03-30 1083666     0           0           0              0
## 26 2021-03-31 1062271     0           0           0              0
##    additive_terms_lower additive_terms_upper multiplicative_terms
## 21                    0                    0                    0
## 22                    0                    0                    0
## 23                    0                    0                    0
## 24                    0                    0                    0
## 25                    0                    0                    0
## 26                    0                    0                    0
##    multiplicative_terms_lower multiplicative_terms_upper yhat_lower yhat_upper
## 21                          0                          0   850747.3    1499738
## 22                          0                          0   801435.9    1503280
## 23                          0                          0   753262.9    1509113
## 24                          0                          0   696819.0    1514419
## 25                          0                          0   632986.9    1524099
## 26                          0                          0   580258.5    1534287
##    trend_lower trend_upper    yhat
## 21    850747.3     1499738 1169246
## 22    801435.9     1503280 1147851
## 23    753262.9     1509113 1126456
## 24    696819.0     1514419 1105061
## 25    632986.9     1524099 1083666
## 26    580258.5     1534287 1062271
plot(m,forecast)

# 计算一下误差

abs(truvalue$流失量[27]-forecast$yhat[26])/truvalue$流失量[27]
## [1] 0.04141441

从结果中可以看到,预测结果为1062271,真实结果为 1108165 .误差为0.04141441 。

接下来使用4月份的数据来做测算

prorunoff <- runoff2021_dayofdata %>% filter(地市== "全省") %>% filter(month(时间)==4)


history <- data.frame(ds = prorunoff$时间[c(6:10)],
                      y = prorunoff$流失量[c(6:10)])
m <- prophet(history)
## Disabling yearly seasonality. Run prophet with yearly.seasonality=TRUE to override this.
## Disabling weekly seasonality. Run prophet with weekly.seasonality=TRUE to override this.
## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.
## n.changepoints greater than number of observations. Using 3
future <- make_future_dataframe(m, periods = 20,freq = "day")
forecast <- predict(m, future)


tail(forecast)
##            ds   trend zeros zeros_lower zeros_upper additive_terms
## 20 2021-04-25 1271413     0           0           0              0
## 21 2021-04-26 1248883     0           0           0              0
## 22 2021-04-27 1226353     0           0           0              0
## 23 2021-04-28 1203823     0           0           0              0
## 24 2021-04-29 1181293     0           0           0              0
## 25 2021-04-30 1158763     0           0           0              0
##    additive_terms_lower additive_terms_upper multiplicative_terms
## 20                    0                    0                    0
## 21                    0                    0                    0
## 22                    0                    0                    0
## 23                    0                    0                    0
## 24                    0                    0                    0
## 25                    0                    0                    0
##    multiplicative_terms_lower multiplicative_terms_upper yhat_lower yhat_upper
## 20                          0                          0   984644.3    1539789
## 21                          0                          0   929644.6    1537820
## 22                          0                          0   871387.1    1542808
## 23                          0                          0   814410.6    1561913
## 24                          0                          0   763535.9    1577149
## 25                          0                          0   708271.6    1583362
##    trend_lower trend_upper    yhat
## 20    984644.3     1539789 1271413
## 21    929644.6     1537820 1248883
## 22    871387.1     1542808 1226353
## 23    814410.6     1561913 1203823
## 24    763535.9     1577149 1181293
## 25    708271.6     1583362 1158763
plot(m,forecast)

# 计算一下误差

abs(truvalue$流失量[28]-1158763)/truvalue$流失量[28]
## [1] 0.03102281

使用4月份来做测算,预测结果为1158763,真是结果为1195862 ,误差0.03102281 。测算了三个月,初步可以判断,如果使用每个月6号到10号的数据来测算月度流失,误差约为3%左右。另外,如果模型所使用的时间越长例如是6号到15号的数据,那么误差可能会更小。

测算T+1

如果希望测算T+1 ,也就是知道在5月份,希望测算6月份的流失。思路可以是通过5月份以及之前的月份的6到10号的日流失数据预测出6月份6-10号的日流失数据,然后根据6月份所测算的日流失数据来预测6月份的月流失数据。

预测T+1 这里也尝试了很多的方法,第一种方法是根据历史的6号到10号的数据预测T+1的6号的数据,然后使用prophet模型预测根据6月6号到10号的数据预测6月末的流失情况.

# 先筛选出省份的数据

allpriv_runoff <- runoff2021_dayofdata %>% filter(地市 == "全省")
 
# 还需要将数据进行转化,转化得出一个月中每一天的结果


allpriv_runoff <- allpriv_runoff %>% mutate(月份 = month(时间),日期=day(时间))


allpriv_runoff <- allpriv_runoff %>% select(-时间)%>% spread(key = 日期,value=流失量) 

allpriv_runoff <- allpriv_runoff%>% arrange(月份)


# 将5月的数据填充进去
allpriv_runoff[5,13] <- 1675392 
allpriv_runoff[5,14] <- 165600

# 预测4月份
mydata <- allpriv_runoff
  tmp <- mydata[c(2,3),c(8:12)]
  tmp <- (tmp[1,]+tmp[2,])/2
  # 这里预测的是四月份的日流失,预测结果的误差是
  # -0.02384465 -0.02358408 -0.02161841 -0.02366546 -0.02299481
  
  history <- data.frame(ds = date("2021-04-01")+days(5:9),
                      y =as.numeric(tmp))
  m <- prophet(history)

 
future <- make_future_dataframe(m, periods = 20,freq = "day")
forecast <- predict(m, future)


tail(forecast)

f <- function(lang=0,type=1){
  mydata <- allpriv_runoff
  tmp <- mydata[c(2,3),c(8:(12+lang))] #
 

  if(type==1){
     tmp <- (tmp[1,]+tmp[2,])/2  # 这里是通过算数平均得出结果,我们尝试使用几何平均
  }else{
     tmp <- sqrt(as.numeric(tmp[1,])*as.numeric(tmp[2,])) # 几何平均
  }
  
  history <- data.frame(ds = date("2021-04-01")+days(5:(9+lang)),
                      y =as.numeric(tmp))
  m <- prophet(history)

 
future <- make_future_dataframe(m, periods = 20-lang,freq = "day")
forecast <- predict(m, future)


return(forecast)
}


tmp <- f(lang = 10,type=1)
print(abs(tmp[dim(tmp)[1],dim(tmp)[2]] - 1195862)/1195862)
  


tmp <- f(lang = 7,type=2)
abs(tmp[dim(tmp)[1],dim(tmp)[2]] - 1195862)/1195862


# 0 :0.03654574 几何  0: 0.03500069 算数
# 1 :0.01141181 几何  1: 0.009230093 
# 2:  0.01699695 几何  2: 0.02007152



# 五月的预测 1280000
  mydata <- allpriv_runoff
  tmp <- mydata[c(3,4),c(8:(12+6))]
  #tmp <- (tmp[1,]+tmp[2,])/2 # 使用几何平均
  tmp <- sqrt(as.numeric(tmp[1,])*as.numeric(tmp[2,])) ## 使用几何平均
  
  
  history <- data.frame(ds = date("2021-05-01")+days(5:(9+6)),
                      y =as.numeric(tmp))
  m <- prophet(history)

 
future <- make_future_dataframe(m, periods = 21-6,freq = "day")
forecast <- predict(m, future)


abs(forecast[dim(forecast)[1],dim(forecast)[2]] -1280000)/1280000
 


# 预测六月份
mydata <- allpriv_runoff
  tmp <- mydata[c(4,5),c(8:(12+2))]
  #tmp <- (tmp[1,]+tmp[2,])/2 # 使用几何平均
  tmp <- sqrt(as.numeric(tmp[1,])*as.numeric(tmp[2,])) ## 使用几何平均
  
  
  history <- data.frame(ds = date("2021-06-01")+days(5:(9+2)),
                      y =as.numeric(tmp))
  m <- prophet(history)

 
future <- make_future_dataframe(m, periods = 20-2,freq = "day")
forecast <- predict(m, future)


forecast[dim(forecast)[1],dim(forecast)[2]]


1183242 # 到10号,几何
1183486 # 到10号,算数

1242005 # 到11号几何
1242187 # 算数

1242187 # 12 号算数
1242005 # 12 号几何

用这种方法测算3月份的数据,误差大约是:0.03654574 。测算5月份的数据,预测结果是1254241,测算六月份的数据,测算结果是 1183242。

思路3

还有一个思路是用发展用户数,出账用户数,在网用户数,活动新发展用户数。流失量与以往的发展量相关,我们通过流失量与其他变量质检的关系来预测流失。

这里使用到的数据包括2019年1月到2021年4月份的,月流失量,月新发展,月出账用户量。

## 下载数据
develop2019 <- read.csv("/Users/milin/Downloads/2019发展.csv")

develop2020 <- read.csv("/Users/milin/Downloads/2020发展.csv")

develop2021 <- read.csv("/Users/milin/Downloads/2021发展.csv")



expaccount2019 <- read.csv("/Users/milin/Downloads/2019年出账.csv")

expaccount2020 <- read.csv("/Users/milin/Downloads/2020年出账.csv")

expaccount2021 <- read.csv("/Users/milin/Downloads/2021年出账.csv")


## 对数据进行重新命名

names(develop2019) <- c("地市",as.character(ym("201812")+months(1:12)))

names(develop2020) <- c("地市",as.character(ym("201912")+months(1:12)))

names(develop2021) <- c("地市",as.character(ym("202012")+months(1:4)))


names(expaccount2019) <- c("地市",as.character(ym("201812")+months(1:12)))

names(expaccount2020) <- c("地市",as.character(ym("201912")+months(1:12)))

names(expaccount2021) <- c("地市",as.character(ym("202012")+months(1:4)))



# 将宽数据转换成为长数据

develop2019 <- develop2019 %>% gather("时间","发展量用户数",2:13)

develop2020 <- develop2020 %>% gather("时间","发展量用户数",2:13)

develop2021 <- develop2021 %>% gather("时间","发展量用户数",2:5)


expaccount2019 <- expaccount2019 %>% gather("时间","出账用户数",2:13)

expaccount2020 <- expaccount2020 %>% gather("时间","出账用户数",2:13)

expaccount2021 <- expaccount2021 %>% gather("时间","出账用户数",2:5)



develop20192021 <- develop2019 %>% bind_rows(develop2020) %>% bind_rows(develop2021)

expaccount20192021 <- expaccount2019 %>% bind_rows(expaccount2020) %>% bind_rows(expaccount2021)
tail(runoff20192021)
##     地市       时间 流失量
## 611 茂名 2021-04-01  19931
## 612 河源 2021-04-01  10375
## 613 汕尾 2021-04-01   9102
## 614 阳江 2021-04-01   9684
## 615 梅州 2021-04-01  10490
## 616 云浮 2021-04-01   8933
develop20192021$时间 <- as.Date(develop20192021$时间)
expaccount20192021$时间 <- as.Date(expaccount20192021$时间)

de_exp_run <- develop20192021 %>% left_join(y = expaccount20192021,by = c("地市","时间")) %>% left_join(y = runoff20192021,by = c("地市","时间"))

需要注意爹是,当月的新发展不会影响当月的流失,当月的流失是受到之前月份的影响。因此,构建3个新变量,分别标识上月分的新发展,上上个月新发展,上上上个月的新发展数量。

prode_exp_run <- de_exp_run %>% filter(地市=="全省")

prode_exp_run <- prode_exp_run %>% mutate(lag1dev = lag(发展量用户数)) %>% mutate(lag2dev = lag(发展量用户数,n = 2)) %>% mutate(lag3dev = lag(发展量用户数,n = 3),lag4dev = lag(发展量用户数,n = 4),lag5dev = lag(发展量用户数,n = 5),lag6dev = lag(发展量用户数,n = 6),lag7dev = lag(发展量用户数,n = 7),lag8dev = lag(发展量用户数,n = 8),lag9dev = lag(发展量用户数,n = 9))

tail(prode_exp_run)
##    地市       时间 发展量用户数 出账用户数  流失量 lag1dev lag2dev lag3dev
## 23 全省 2020-11-01       815640   26560339 1515336  929258  937946 1028718
## 24 全省 2020-12-01       783999   26283171 1321730  815640  929258  937946
## 25 全省 2021-01-01       917653   26285874 1174921  783999  815640  929258
## 26 全省 2021-02-01       805985   25972420 1243171  917653  783999  815640
## 27 全省 2021-03-01      1301092   26537959 1108165  805985  917653  783999
## 28 全省 2021-04-01      1074871   26734399 1195862 1301092  805985  917653
##    lag4dev lag5dev lag6dev lag7dev lag8dev lag9dev
## 23  925356  911425  973507 1075020 1151719  454177
## 24 1028718  925356  911425  973507 1075020 1151719
## 25  937946 1028718  925356  911425  973507 1075020
## 26  929258  937946 1028718  925356  911425  973507
## 27  815640  929258  937946 1028718  925356  911425
## 28  783999  815640  929258  937946 1028718  925356

数据中lag1dev标识上一个月的新发展,lag2dev表示的是上上个月的新发展。然后先简单分析一下数据之间的关系,例如流失与新发展用户之间的关系.

首先查看流失量与上一个月新发展用户数质检的关系

p <- ggplot(data = prode_exp_run,aes(x=时间,y = 流失量))+geom_line()
p <- p + geom_line(data = prode_exp_run,aes(x = 时间,y=lag1dev,color="yellow"))
p + labs(color="发展量用户数") + theme_bw()

可以看到,上一个月的新发展用户数量和流失量之间的关系并不明显。进一步可以计算相关系数.

cor(prode_exp_run$lag1dev[-c(1:3)],prode_exp_run$流失量[-c(1:3)])
## [1] 0.3140291

可以看到,流失量和上一个月新发展用户数之间的相关系数是31%,是存在一定的相关性的。接下来分析流失量和上上个月的新发展用户数量之间的关系。

p <- ggplot(data = prode_exp_run,aes(x=时间,y = 流失量))+geom_line()
p <- p + geom_line(data = prode_exp_run,aes(x = 时间,y=lag2dev,color="yellow"))
p + labs(color="发展量用户数") + theme_bw()

从这个图似乎可以看出来,流失与新发展之间的关系了,似乎是新发展的量一旦多了起来,在未来流失量的会增加。如果新发展的量减少,那么再未来一段时间,流失量也会减少。

同样,计算两者之间的相关性 。

cor(prode_exp_run$lag2dev[-c(1:3)],prode_exp_run$流失量[-c(1:3)])
## [1] 0.4631888

可以看到,相关系数,46%,是比较强的相关性了。进一步分析流失量与上上个月之间的相关关系。

p <- ggplot(data = prode_exp_run,aes(x=时间,y = 流失量))+geom_line()
p <- p + geom_line(data = prode_exp_run,aes(x = 时间,y=lag3dev,color="yellow"))
p + labs(color="发展量用户数") + theme_bw()

进一步计算二者之间的相关系数。

cor(prode_exp_run$lag3dev[-c(1:3)],prode_exp_run$流失量[-c(1:3)])
## [1] 0.6195624

可以看到相关系数是61% ,这也是比较强的相关性。我们还可以分析流失量与上第4个月之间的关系以及之后的关系,当我们分析到流失量与前第九个月之间的关系的时候,会显示下图。

p <- ggplot(data = prode_exp_run,aes(x=时间,y = 流失量))+geom_line()
p <- p + geom_line(data = prode_exp_run,aes(x = 时间,y=lag(发展量用户数,n = 9),color="yellow"))
p + labs(color="发展量用户数") + theme_bw()

可以看到,发展量与流失量之间的变化趋势已经是基本一致了 。进一步计算相关性.

cor(lag(prode_exp_run$发展量用户数,n = 9)[-c(1:9)],prode_exp_run$流失量[-c(1:9)])
## [1] 0.6759469

相关性依然还是比较大的,然后还可以分析一下出账用户数和流失量之间的关系。

p <- ggplot(data = prode_exp_run,aes(x=时间,y = 流失量))+geom_line()
p <- p + geom_line(data = prode_exp_run,aes(x = 时间,y=出账用户数,color="yellow"))
p + labs(color="出账用户数量") + theme_bw()

然后计算相关性.

cor(prode_exp_run$流失量,prode_exp_run$出账用户数)
## [1] 0.6163892

可以看到,流失量与出账用户数量依然是比较高的。相关性还是比较高的。

通过这样的变换之后,数据之间的量纲变得比较接近了。然后开始构建模型,使用2019年1月到2020年12月的数据训练,使用2021年1月到4月份的数据来做预测。

# 首先要创建训练集和测试集合

train <- prode_exp_run[c(1:24),]
test <- prode_exp_run[c(25:28),]


# 使用所有的变量来建模



lmodel <- lm(流失量~出账用户数+lag1dev+lag2dev+lag3dev,data = na.omit(train))
summary(lmodel)
## 
## Call:
## lm(formula = 流失量 ~ 出账用户数 + lag1dev + lag2dev + 
##     lag3dev, data = na.omit(train))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -143707 -102297  -33110   70584  255388 
## 
## Coefficients:
##                   Estimate     Std. Error t value Pr(>|t|)  
## (Intercept) -1768278.47359  1299201.51997  -1.361   0.2034  
## 出账用户数         0.11851        0.05632   2.104   0.0616 .
## lag1dev           -0.03983        0.22487  -0.177   0.8629  
## lag2dev           -0.24853        0.22730  -1.093   0.2999  
## lag3dev            0.26332        0.21557   1.221   0.2499  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 151200 on 10 degrees of freedom
## Multiple R-squared:  0.6847, Adjusted R-squared:  0.5585 
## F-statistic: 5.428 on 4 and 10 DF,  p-value: 0.01379
stepmodel <- step(lmodel)
## Start:  AIC=361.7
## 流失量 ~ 出账用户数 + lag1dev + lag2dev + lag3dev
## 
##              Df    Sum of Sq          RSS    AIC
## - lag1dev     1    717087579 229242794437 359.75
## - lag2dev     1  27321451882 255847158740 361.40
## <none>                       228525706858 361.70
## - lag3dev     1  34096304265 262622011123 361.79
## - 出账用户数  1 101194984356 329720691214 365.20
## 
## Step:  AIC=359.75
## 流失量 ~ 出账用户数 + lag2dev + lag3dev
## 
##              Df    Sum of Sq          RSS    AIC
## <none>                       229242794437 359.75
## - lag2dev     1  34634822568 263877617005 359.86
## - lag3dev     1  40222093672 269464888109 360.17
## - 出账用户数  1 122630820347 351873614784 364.18
summary(stepmodel)
## 
## Call:
## lm(formula = 流失量 ~ 出账用户数 + lag2dev + lag3dev, 
##     data = na.omit(train))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -145350  -98524  -33993   67633  253613 
## 
## Coefficients:
##                   Estimate     Std. Error t value Pr(>|t|)  
## (Intercept) -1667959.00511  1116605.79073  -1.494   0.1634  
## 出账用户数         0.11361        0.04683   2.426   0.0337 *
## lag2dev           -0.26248        0.20361  -1.289   0.2238  
## lag3dev            0.27418        0.19736   1.389   0.1922  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 144400 on 11 degrees of freedom
## Multiple R-squared:  0.6837, Adjusted R-squared:  0.5974 
## F-statistic: 7.925 on 3 and 11 DF,  p-value: 0.004299
pre <- predict(stepmodel,newdata = test)

abs((exp(pre) - exp(test$流失量)))/exp(test$流失量)
##  25  26  27  28 
## NaN NaN NaN NaN

从结果来看,模型的效果并不是很好。

library(randomForest)

ran <- randomForest(流失量~.,data = na.omit(train[,-c(1,2)]))

ran
## 
## Call:
##  randomForest(formula = 流失量 ~ ., data = na.omit(train[,      -c(1, 2)])) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 18321137768
##                     % Var explained: 62.08
plot(ran)

pre <- predict(ran,newdata = test)

abs((exp(pre) - exp(test$流失量)))/exp(test$流失量)
##  25  26  27  28 
## NaN NaN NaN NaN

效果依然不是很理想。

思路4

这个思路的第一步是将客户划分成为很多细分群体。

每一个月的2I新发展用户,非2I新发展用户,3个月的存量用户,6个月的存量用户,12个月的存量用户,24个月的存量用户,24个月以上的存量用户。

例如以2020年12月为时间间隔,要计算5月份的流失量。则可以这么算。

计算2020年12月2I新发展用户量在2021年1月份流失的比例,以及2021年2月份流失的比例,到22021年4月份流失的比例,通过这些历史的比例推测出这批用户在5月份流失的比例,通过这个比例计算出这批用户在5月份的流失量。

然后继续计算2021年1月份新发展用户,在2021年2月份流失比例以及到2021年4月份流失比例,通过历史比例计算出这一批人在5月份的流失比例并进一步计算这一批用户的流失量。 以此类推。

非2I新发展用户计算方式是一样的。

对于3个月存量用户,则计算到2020年12月为止网龄为3个月以内的入网用户(也就是2020年9月到11月入网的用户),计算这一批用户在之后每一个月的流失率,通过历史的流失率推算5月份的流失率,并进一步得出这一批人在5月份流失的量。 一次类推

library(readxl)

toI <- read_xlsx("/Users/milin/Library/Containers/com.tencent.xinWeChat/Data/Library/Application\ Support/com.tencent.xinWeChat/2.0b4.0.9/53566e35a70cea7efea3dd23a98b7c78/Message/MessageTemp/bed530d2f94e07b9fe9a38669031e423/File/按照不同类型拍照用户的用户量来预测流失量.xlsx",sheet = 1)

notoI <- read_xlsx("/Users/milin/Library/Containers/com.tencent.xinWeChat/Data/Library/Application\ Support/com.tencent.xinWeChat/2.0b4.0.9/53566e35a70cea7efea3dd23a98b7c78/Message/MessageTemp/bed530d2f94e07b9fe9a38669031e423/File/按照不同类型拍照用户的用户量来预测流失量.xlsx",sheet = 2)


age3 <- read_xlsx("/Users/milin/Library/Containers/com.tencent.xinWeChat/Data/Library/Application\ Support/com.tencent.xinWeChat/2.0b4.0.9/53566e35a70cea7efea3dd23a98b7c78/Message/MessageTemp/bed530d2f94e07b9fe9a38669031e423/File/按照不同类型拍照用户的用户量来预测流失量.xlsx",sheet = 3)



age6 <- read_xlsx("/Users/milin/Library/Containers/com.tencent.xinWeChat/Data/Library/Application\ Support/com.tencent.xinWeChat/2.0b4.0.9/53566e35a70cea7efea3dd23a98b7c78/Message/MessageTemp/bed530d2f94e07b9fe9a38669031e423/File/按照不同类型拍照用户的用户量来预测流失量.xlsx",sheet = 4)

age12 <- read_xlsx("/Users/milin/Library/Containers/com.tencent.xinWeChat/Data/Library/Application\ Support/com.tencent.xinWeChat/2.0b4.0.9/53566e35a70cea7efea3dd23a98b7c78/Message/MessageTemp/bed530d2f94e07b9fe9a38669031e423/File/按照不同类型拍照用户的用户量来预测流失量.xlsx",sheet = 5)

age24 <- read_xlsx("/Users/milin/Library/Containers/com.tencent.xinWeChat/Data/Library/Application\ Support/com.tencent.xinWeChat/2.0b4.0.9/53566e35a70cea7efea3dd23a98b7c78/Message/MessageTemp/bed530d2f94e07b9fe9a38669031e423/File/按照不同类型拍照用户的用户量来预测流失量.xlsx",sheet = 6)

age24m <- read_xlsx("/Users/milin/Library/Containers/com.tencent.xinWeChat/Data/Library/Application\ Support/com.tencent.xinWeChat/2.0b4.0.9/53566e35a70cea7efea3dd23a98b7c78/Message/MessageTemp/bed530d2f94e07b9fe9a38669031e423/File/按照不同类型拍照用户的用户量来预测流失量.xlsx",sheet = 7)

然后要将具体的数值转化成为流失率

f <- function(x){
  return(x/toI$发展规模)
}
toI1 <-  toI %>% select(-发展月份,-发展规模) %>% mutate_all(f) 
toI1 <-  toI %>% select(发展月份,发展规模) %>% bind_cols(toI1) 

f <- function(x){
  return(x/notoI$发展规模)
}
notoI1 <-  notoI %>% select(-发展月份,-发展规模) %>% mutate_all(f) 
notoI1 <-  notoI %>% select(发展月份,发展规模) %>% bind_cols(notoI1) 

f <- function(x){
  return(x/age3$出账用户)
}
age31 <-  age3 %>% select(-`网龄1-3个月`,-出账用户) %>% mutate_all(f) 
age31 <-  age3 %>% select(`网龄1-3个月`,出账用户) %>% bind_cols(age31) 

f <- function(x){
  return(x/age6$拍照出账用户)
}

age61 <-  age6 %>% select(-`网龄4-6个月`,-拍照出账用户) %>% mutate_all(f)  
age61 <-  age6 %>% select(`网龄4-6个月`,拍照出账用户) %>% bind_cols(age61) 

f <- function(x){
  return(x/age12$拍照出账用户)
}

age121 <-  age12 %>% select(-`网龄7-12个月`,-拍照出账用户) %>% mutate_all(f) 
age121 <-  age12 %>% select(`网龄7-12个月`,拍照出账用户) %>% bind_cols(age121) 

f <- function(x){
  return(x/age24$拍照出账用户)
}

age241 <-  age24 %>% select(-网龄1年到2年,-拍照出账用户) %>% mutate_all(f)  
age241 <-  age24 %>% select(网龄1年到2年,拍照出账用户) %>% bind_cols(age241) 


f <- function(x){
  return(x/age24m$拍照出账用户)
}
age24m1 <-  age24m %>% select(-网龄2年以上,-拍照出账用户) %>% mutate_all(f)  
age24m1 <-  age24m %>% select(网龄2年以上,拍照出账用户) %>% bind_cols(age24m1) 

到这里,数据就初步整理好了,就下来就是要做几件事情,通过这个方法,有两个数据需要推算,一个是流失率,一个是T+1月的用户量。

修改一下命名

names(age24m1)[29] <- "T+27月流失"

names(age241)[29] <- "T+27月流失"

names(age121)[29] <- "T+27月流失"

names(age61)[29] <- "T+27月流失"

names(age31)[29] <- "T+27月流失"


names(notoI1)[29] <- "T+27月流失"

names(toI1)[29] <- "T+27月流失"

编写函数,来计算T月份的

# T = 0 or 1
predictRunoff <- function(T=0){
  require(tidyverse)
  year = 21
  month = 4
  time = 27
  lang <- dim(age24)[2]
  deep <- dim(age24)[1]

    for (i in 2:lang) {
      if(i==lang){
        time <- time+1
        
        toI1 <-  toI1 %>%data.frame() %>% mutate(a = toI1[1,lang]*0.9)  
        
        names(toI1)[lang+1] <- paste("T+",time,"月流失",sep = "") 
        toI1[,lang+1][-1,1] <- NA
        
        notoI1 <-  notoI1 %>%data.frame()%>% mutate(a = notoI1[1,lang]*0.9)
        names(notoI1)[lang+1] <- paste("T+",time,"月流失",sep = "") 
        notoI1[,lang+1][-1,1] <- NA
        
        age31 <-  age31 %>%data.frame()%>% mutate(a = age31[1,lang]*0.9)
        names(age31)[lang+1] <- paste("T+",time,"月流失",sep = "") 
        age31[,lang+1][-1,1] <- NA
        
        age61 <-  age61 %>%data.frame()%>% mutate(a = age61[1,lang]*0.9)
        names(age61)[lang+1] <- paste("T+",time,"月流失",sep = "") 
        age61[,lang+1][-1,1] <- NA
        
        age121 <-  age121 %>%data.frame()%>% mutate(a = age121[1,lang]*0.9)
        names(age121)[lang+1] <- paste("T+",time,"月流失",sep = "") 
        age121[,lang+1][-1,1] <- NA
        
        age241 <-  age241 %>%data.frame()%>% mutate(a = age241[1,lang]*0.9)
        names(age241)[lang+1] <- paste("T+",time,"月流失",sep = "") 
        age241[,lang+1][-1,1] <- NA
        
        age24m1 <-  age24m1 %>%data.frame()%>% mutate(a = age24m1[1,lang]*0.9)
        names(age24m1)[lang+1] <- paste("T+",time,"月流失",sep = "") 
        age24m1[,lang+1][-1,1] <- NA
        
        # 倒数第二行
        toI1[2,lang] <-   toI1[1,lang]*runif(1,min = 0.9,max = 1.1)
        notoI1[2,lang] <-   notoI1[1,lang]*runif(1,min = 0.9,max = 1.1)
        age31[2,lang] <-   age31[1,lang]*runif(1,min = 0.9,max = 1.1)
        age61[2,lang] <-   age61[1,lang]*runif(1,min = 0.9,max = 1.1)
        age121[2,lang] <-   age121[1,lang]*runif(1,min = 0.9,max = 1.1)
        age241[2,lang] <-   age241[1,lang]*runif(1,min = 0.9,max = 1.1)
        age24m1[2,lang] <-   age24m1[1,lang]*runif(1,min = 0.9,max = 1.1)
        toI1[2,lang] <-   toI1[1,lang]*runif(1,min = 0.9,max = 1.1)
        
      }else if(i==2){
        # 如果月份超过12,修改年和月份
        if(month+1==12){
          year = year+1
          month=1
        }else{
          month=month+1
        }
        
        # 1
        value <- toI1[deep,2]*0.6+toI1[deep-1,2]*0.4
        toI1[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
        # 2
        value <- notoI1[deep,2]*0.6+notoI1[deep-1,2]*0.4
        notoI1[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
        # 3
        value <- age31[deep,2]*0.6+age31[deep-1,2]*0.4
        age31[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
        # 4
        value <- age61[deep,2]*0.6+age61[deep-1,2]*0.4
        age61[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
        # 5
        value <- age121[deep,2]*0.6+age121[deep-1,2]*0.4
        age121[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
        # 6
        value <- age241[deep,2]*0.6+age241[deep-1,2]*0.4
        age241[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
        # 7
        value <- age24m1[deep,2]*0.6+age24m1[deep-1,2]*0.4
        age24m1[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
      }else{
        value <- toI1[deep-(i-2),i]*0.6+toI1[deep-(i-1),i]*0.4
        toI1[deep-(i-3),i] <- value
        # 2
        value <- notoI1[deep-(i-2),i]*0.6+notoI1[deep-(i-1),i]*0.4
        notoI1[deep-(i-3),i] <- value
        # 3
        value <- age31[deep-(i-2),i]*0.6+age31[deep-(i-1),i]*0.4
        age31[deep-(i-3),i] <- value

        # 4
        value <- age61[deep-(i-2),i]*0.6+age61[deep-(i-1),i]*0.4
        age61[deep-(i-3),i] <- value
        
        # 5
        value <- age121[deep-(i-2),i]*0.6+age121[deep-(i-1),i]*0.4
        age121[deep-(i-3),i] <- value
        
        # 6
        value <- age241[deep-(i-2),i]*0.6+age241[deep-(i-1),i]*0.4
        age241[deep-(i-3),i] <- value

        # 7
        value <- age24m1[deep-(i-2),i]*0.6+age24m1[deep-(i-1),i]*0.4
        age24m1[deep-(i-3),i] <- value
      }
      
    }

  
  if(T==0){
    return(list(toI1,notoI1,age31,age61,age121,age241,age24m1))
  }else if(T==1){
      lang <- dim(toI1)[2]
      
      deep <- dim(toI1)[1]
      for (i in 2:lang) {
        
      if(i==lang){
        
        
        time <- time+1
        toI1 <-  toI1 %>%data.frame() %>% mutate(a = toI1[1,lang]*0.9)
        
        names(toI1)[lang+1] <- paste("T+",time,"月流失",sep = "") 
        toI1[,lang+1][-1] <- NA

        
        notoI1 <-  notoI1 %>%data.frame() %>% mutate(a = notoI1[1,lang]*0.9)
        names(notoI1)[lang+1] <- paste("T+",time,"月流失",sep = "") 
        notoI1[,lang+1][-1] <- NA
        
        age31 <-  age31 %>%data.frame() %>% mutate(a = age31[1,lang]*0.9)
        names(age31)[lang+1] <- paste("T+",time,"月流失",sep = "") 
        age31[,lang+1][-1] <- NA
        
        age61 <-  age61 %>%data.frame() %>% mutate(a = age61[1,lang]*0.9)
        names(age61)[lang+1] <- paste("T+",time,"月流失",sep = "") 
        age61[,lang+1][-1] <- NA
        
        age121 <-  age121%>%data.frame()  %>% mutate(a = age121[1,lang]*0.9)
        names(age121)[lang+1] <- paste("T+",time,"月流失",sep = "") 
        age121[,lang+1][-1] <- NA
        
        age241 <-  age241 %>%data.frame() %>% mutate(a = age241[1,lang]*0.9)
        names(age241)[lang+1] <- paste("T+",time,"月流失",sep = "") 
        age241[,lang+1][-1] <- NA
        
        age24m1 <-  age24m1 %>% mutate(a = age24m1[1,lang]*0.9)
        names(age24m1)[lang+1] <- paste("T+",time,"月流失",sep = "") 
        age24m1[,lang+1][-1] <- NA
        
        toI1[2,lang][[1]] <-   toI1[1,lang][[1]]#*runif(1,0.9,1.1)
        # print(toI1[1,lang][[1]])
        notoI1[2,lang][[1]] <-   notoI1[1,lang][[1]]*runif(1,0.9,1.1)
        age31[2,lang][[1]] <-   age31[1,lang][[1]]*runif(1,0.9,1.1) 
        age61[2,lang][[1]] <-   age61[1,lang][[1]]*runif(1,0.9,1.1) 
        age121[2,lang][[1]] <-   age121[1,lang][[1]]*runif(1,0.9,1.1) 
        age241[2,lang][[1]] <-   age241[1,lang][[1]]*runif(1,0.9,1.1) 
        age24m1[2,lang][[1]] <-   age24m1[1,lang][[1]]*runif(1,0.9,1.1) 
        toI1[2,lang][[1]] <-   toI1[1,lang][[1]]*runif(1,0.9,1.1) 
        
      }else if(i==2){
        # 如果月份超过12,修改年和月份
        if(month+1==12){
          year = year+1
          month=1
        }else{
          month=month+1
        }
        
        # 1
        value <- toI1[deep,2]*0.6+toI1[deep-1,2]*0.4
        toI1[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
        # 2
        value <- notoI1[deep,2]*0.6+notoI1[deep-1,2]*0.4
        notoI1[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
        # 3
        value <- age31[deep,2]*0.6+age31[deep-1,2]*0.4
        age31[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
        # 4
        value <- age61[deep,2]*0.6+age61[deep-1,2]*0.4
        age61[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
        # 5
        value <- age121[deep,2]*0.6+age121[deep-1,2]*0.4
        age121[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
        # 6
        value <- age241[deep,2]*0.6+age241[deep-1,2]*0.4
        age241[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
        # 7
        value <- age24m1[deep,2]*0.6+age24m1[deep-1,2]*0.4
        age24m1[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
      }else{
        value <- toI1[deep-(i-2),i]*0.6+toI1[deep-(i-1),i]*0.4
        toI1[deep-(i-3),i] <- value
        # 2
        value <- notoI1[deep-(i-2),i]*0.6+notoI1[deep-(i-1),i]*0.4
        notoI1[deep-(i-3),i] <- value
        # 3
        value <- age31[deep-(i-2),i]*0.6+age31[deep-(i-1),i]*0.4
        age31[deep-(i-3),i] <- value

        # 4
        value <- age61[deep-(i-2),i]*0.6+age61[deep-(i-1),i]*0.4
        age61[deep-(i-3),i] <- value
        
        # 5
        value <- age121[deep-(i-2),i]*0.6+age121[deep-(i-1),i]*0.4
        age121[deep-(i-3),i] <- value
        
        # 6
        value <- age241[deep-(i-2),i]*0.6+age241[deep-(i-1),i]*0.4
        age241[deep-(i-3),i] <- value

        # 7
        value <- age24m1[deep-(i-2),i]*0.6+age24m1[deep-(i-1),i]*0.4
        age24m1[deep-(i-3),i] <- value
      }
      
      }
      return(list(toI1,notoI1,age31,age61,age121,age241,age24m1))
  }
   
  ##return(list(toI1[,-dim(toI1)[2]],notoI1[,-dim(notoI1)[2]],age31[,-dim(age31)[2]],
             # age61[,-dim(age61)[2]],age121[,-dim(age121)[2]],age241[,-dim(age241)[2]],age24m1[,-dim(age241)]))

}


tmp <- predictRunoff(T = 1)

上面这个函数构造出了缺失率和发展量,下面要计算对应的一个流失 。

# 这个函数的输入是 predictRunoff 函数的输出
# month 等于24 对应的是 2020 年12月份
predict_funoff_num <- function(tmp,month=24){
  # month == 24 表示 看24行以及以后的新发展
  # 看23行的拍照用户
  
    result <- c()

    for (i in 1:2) {
      
    o.value <- tmp[[i]][month:(dim(tmp[[1]])[1]-1),2] %>% as.numeric()
    odd <- c()
    n <- 1
    for (j in (dim(tmp[[3]])[2]-month+1):3) {
      odd[n] <- tmp[[i]][dim(tmp[[3]])[2]-(j-1),j][[1]][[1]]
      n <- n+1
    }
    result[i] <- sum(o.value*odd)
    }
 
  
  for (i in 3:length(tmp)) {
   result[i] <- as.numeric(tmp[[3]][month-1,2])* tmp[[3]][month-1,dim(tmp[[3]])[2]-24+2]
   
  }
  
    return(result)


  
  }


result <- predict_funoff_num(tmp = tmp,month = 24)

sum(result)
## [1] 1010535
#  write_csv(tmp[[1]],file = "2i.csv")
# write_csv(tmp[[2]],file = "非2i.csv")
#  write_csv(tmp[[3]],file = "3个月.csv")
#   write_csv(tmp[[4]],file = "6个月.csv")
#    write_csv(tmp[[5]],file = "12个月.csv")
#     write_csv(tmp[[6]],file = "24个月.csv")
#      write_csv(tmp[[7]],file = "大于24个月.csv")


#library(xlsx)

尝试计算一个范围

计算平均

predictRunoffMean <- function(T=0,rate=1){
  require(tidyverse)
  year = 21
  month = 4
  time = 27
  lang <- dim(age24)[2]
  deep <- dim(age24)[1]

    for (i in 2:lang) {
      if(i==lang){
        time <- time+1
        
        toI1 <-  toI1 %>%data.frame() %>% mutate(a = toI1[1,lang]*0.9)  
        
        names(toI1)[lang+1] <- paste("T+",time,"月流失",sep = "") 
        toI1[,lang+1][-1,1] <- NA
        
        notoI1 <-  notoI1 %>%data.frame()%>% mutate(a = notoI1[1,lang]*0.9)
        names(notoI1)[lang+1] <- paste("T+",time,"月流失",sep = "") 
        notoI1[,lang+1][-1,1] <- NA
        
        age31 <-  age31 %>%data.frame()%>% mutate(a = age31[1,lang]*0.9)
        names(age31)[lang+1] <- paste("T+",time,"月流失",sep = "") 
        age31[,lang+1][-1,1] <- NA
        
        age61 <-  age61 %>%data.frame()%>% mutate(a = age61[1,lang]*0.9)
        names(age61)[lang+1] <- paste("T+",time,"月流失",sep = "") 
        age61[,lang+1][-1,1] <- NA
        
        age121 <-  age121 %>%data.frame()%>% mutate(a = age121[1,lang]*0.9)
        names(age121)[lang+1] <- paste("T+",time,"月流失",sep = "") 
        age121[,lang+1][-1,1] <- NA
        
        age241 <-  age241 %>%data.frame()%>% mutate(a = age241[1,lang]*0.9)
        names(age241)[lang+1] <- paste("T+",time,"月流失",sep = "") 
        age241[,lang+1][-1,1] <- NA
        
        age24m1 <-  age24m1 %>%data.frame()%>% mutate(a = age24m1[1,lang]*0.9)
        names(age24m1)[lang+1] <- paste("T+",time,"月流失",sep = "") 
        age24m1[,lang+1][-1,1] <- NA
        
        # 倒数第二行
        toI1[2,lang] <-   toI1[1,lang]*runif(1,min = 0.9,max = 1.1)
        notoI1[2,lang] <-   notoI1[1,lang]*runif(1,min = 0.9,max = 1.1)
        age31[2,lang] <-   age31[1,lang]*runif(1,min = 0.9,max = 1.1)
        age61[2,lang] <-   age61[1,lang]*runif(1,min = 0.9,max = 1.1)
        age121[2,lang] <-   age121[1,lang]*runif(1,min = 0.9,max = 1.1)
        age241[2,lang] <-   age241[1,lang]*runif(1,min = 0.9,max = 1.1)
        age24m1[2,lang] <-   age24m1[1,lang]*runif(1,min = 0.9,max = 1.1)
        toI1[2,lang] <-   toI1[1,lang]*runif(1,min = 0.9,max = 1.1)
        
      }else if(i==2){
        # 如果月份超过12,修改年和月份
        if(month+1==12){
          year = year+1
          month=1
        }else{
          month=month+1
        }
        
        # 1
        value <- toI1[deep,2]*0.6+toI1[deep-1,2]*0.4
        toI1[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
        # 2
        value <- notoI1[deep,2]*0.6+notoI1[deep-1,2]*0.4
        notoI1[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
        # 3
        value <- age31[deep,2]*0.6+age31[deep-1,2]*0.4
        age31[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
        # 4
        value <- age61[deep,2]*0.6+age61[deep-1,2]*0.4
        age61[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
        # 5
        value <- age121[deep,2]*0.6+age121[deep-1,2]*0.4
        age121[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
        # 6
        value <- age241[deep,2]*0.6+age241[deep-1,2]*0.4
        age241[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
        # 7
        value <- age24m1[deep,2]*0.6+age24m1[deep-1,2]*0.4
        age24m1[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
      }else{
        value <- toI1[deep-(i-2),i]*0.5+toI1[deep-(i-1),i]*0.5
        toI1[deep-(i-3),i] <- as.numeric(value)*rate
        # 2
        value <- notoI1[deep-(i-2),i]*0.5+notoI1[deep-(i-1),i]*0.5
        notoI1[deep-(i-3),i] <- as.numeric(value)*rate
        # 3
        value <- age31[deep-(i-2),i]*0.5+age31[deep-(i-1),i]*0.5
        age31[deep-(i-3),i] <- as.numeric(value)*rate

        # 4
        value <- age61[deep-(i-2),i]*0.5+age61[deep-(i-1),i]*0.5
        age61[deep-(i-3),i] <- as.numeric(value)*rate
        
        # 5
        value <- age121[deep-(i-2),i]*0.5+age121[deep-(i-1),i]*0.5
        age121[deep-(i-3),i] <- as.numeric(value)*rate
        
        # 6
        value <- age241[deep-(i-2),i]*0.5+age241[deep-(i-1),i]*0.5
        age241[deep-(i-3),i] <- as.numeric(value)*rate

        # 7
        value <- age24m1[deep-(i-2),i]*0.5+age24m1[deep-(i-1),i]*0.5
        age24m1[deep-(i-3),i] <- as.numeric(value)*rate
      }
      
    }

  
  if(T==0){
    return(list(toI1,notoI1,age31,age61,age121,age241,age24m1))
  }else if(T==1){
      lang <- dim(toI1)[2]
      
      deep <- dim(toI1)[1]
      for (i in 2:lang) {
        
      if(i==lang){
        
        
        time <- time+1
        toI1 <-  toI1 %>%data.frame() %>% mutate(a = toI1[1,lang]*0.9)
        
        names(toI1)[lang+1] <- paste("T+",time,"月流失",sep = "") 
        toI1[,lang+1][-1] <- NA

        
        notoI1 <-  notoI1 %>%data.frame() %>% mutate(a = notoI1[1,lang]*0.9)
        names(notoI1)[lang+1] <- paste("T+",time,"月流失",sep = "") 
        notoI1[,lang+1][-1] <- NA
        
        age31 <-  age31 %>%data.frame() %>% mutate(a = age31[1,lang]*0.9)
        names(age31)[lang+1] <- paste("T+",time,"月流失",sep = "") 
        age31[,lang+1][-1] <- NA
        
        age61 <-  age61 %>%data.frame() %>% mutate(a = age61[1,lang]*0.9)
        names(age61)[lang+1] <- paste("T+",time,"月流失",sep = "") 
        age61[,lang+1][-1] <- NA
        
        age121 <-  age121%>%data.frame()  %>% mutate(a = age121[1,lang]*0.9)
        names(age121)[lang+1] <- paste("T+",time,"月流失",sep = "") 
        age121[,lang+1][-1] <- NA
        
        age241 <-  age241 %>%data.frame() %>% mutate(a = age241[1,lang]*0.9)
        names(age241)[lang+1] <- paste("T+",time,"月流失",sep = "") 
        age241[,lang+1][-1] <- NA
        
        age24m1 <-  age24m1 %>% mutate(a = age24m1[1,lang]*0.9)
        names(age24m1)[lang+1] <- paste("T+",time,"月流失",sep = "") 
        age24m1[,lang+1][-1] <- NA
        
        toI1[2,lang][[1]] <-   toI1[1,lang][[1]]#*runif(1,0.9,1.1)
        # print(toI1[1,lang][[1]])
        notoI1[2,lang][[1]] <-   notoI1[1,lang][[1]]*runif(1,0.9,1.1)
        age31[2,lang][[1]] <-   age31[1,lang][[1]]*runif(1,0.9,1.1) 
        age61[2,lang][[1]] <-   age61[1,lang][[1]]*runif(1,0.9,1.1) 
        age121[2,lang][[1]] <-   age121[1,lang][[1]]*runif(1,0.9,1.1) 
        age241[2,lang][[1]] <-   age241[1,lang][[1]]*runif(1,0.9,1.1) 
        age24m1[2,lang][[1]] <-   age24m1[1,lang][[1]]*runif(1,0.9,1.1) 
        toI1[2,lang][[1]] <-   toI1[1,lang][[1]]*runif(1,0.9,1.1) 
        
      }else if(i==2){
        # 如果月份超过12,修改年和月份
        if(month+1==12){
          year = year+1
          month=1
        }else{
          month=month+1
        }
        
        # 1
        value <- toI1[deep,2]*0.6+toI1[deep-1,2]*0.4
        toI1[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
        # 2
        value <- notoI1[deep,2]*0.6+notoI1[deep-1,2]*0.4
        notoI1[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
        # 3
        value <- age31[deep,2]*0.6+age31[deep-1,2]*0.4
        age31[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
        # 4
        value <- age61[deep,2]*0.6+age61[deep-1,2]*0.4
        age61[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
        # 5
        value <- age121[deep,2]*0.6+age121[deep-1,2]*0.4
        age121[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
        # 6
        value <- age241[deep,2]*0.6+age241[deep-1,2]*0.4
        age241[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
        # 7
        value <- age24m1[deep,2]*0.6+age24m1[deep-1,2]*0.4
        age24m1[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
      }else{
        value <- toI1[deep-(i-2),i]*0.5+toI1[deep-(i-1),i]*0.5
        toI1[deep-(i-3),i] <- as.numeric(value)*rate
        # 2
        value <- notoI1[deep-(i-2),i]*0.5+notoI1[deep-(i-1),i]*0.5
        notoI1[deep-(i-3),i] <- as.numeric(value)*rate
        # 3
        value <- age31[deep-(i-2),i]*0.5+age31[deep-(i-1),i]*0.5
        age31[deep-(i-3),i] <- as.numeric(value)*rate

        # 4
        value <- age61[deep-(i-2),i]*0.5+age61[deep-(i-1),i]*0.5
        age61[deep-(i-3),i] <- as.numeric(value)*rate
        
        # 5
        value <- age121[deep-(i-2),i]*0.5+age121[deep-(i-1),i]*0.5
        age121[deep-(i-3),i] <- as.numeric(value)*rate
        
        # 6
        value <- age241[deep-(i-2),i]*0.5+age241[deep-(i-1),i]*0.5
        age241[deep-(i-3),i] <- as.numeric(value)*rate

        # 7
        value <- age24m1[deep-(i-2),i]*0.5+age24m1[deep-(i-1),i]*0.5
        age24m1[deep-(i-3),i] <- as.numeric(value)*rate
      }
      
      }
      return(list(toI1,notoI1,age31,age61,age121,age241,age24m1))
  }
   
  ##return(list(toI1[,-dim(toI1)[2]],notoI1[,-dim(notoI1)[2]],age31[,-dim(age31)[2]],
             # age61[,-dim(age61)[2]],age121[,-dim(age121)[2]],age241[,-dim(age241)[2]],age24m1[,-dim(age241)]))

}
tmp <- predictRunoffMean(T = 0,rate = 1)
result <- predict_funoff_num(tmp = tmp,month = 19)

sum(result)
## [1] 1301473

计算最大

predictRunoffMAX <- function(T=0,rate=1){
  require(tidyverse)
  year = 21
  month = 4
  time = 27
  lang <- dim(age24)[2]
  deep <- dim(age24)[1]

    for (i in 2:lang) {
      if(i==lang){
        time <- time+1
        
        toI1 <-  toI1 %>%data.frame() %>% mutate(a = toI1[1,lang]*0.9)  
        
        names(toI1)[lang+1] <- paste("T+",time,"月流失",sep = "") 
        toI1[,lang+1][-1,1] <- NA
        
        notoI1 <-  notoI1 %>%data.frame()%>% mutate(a = notoI1[1,lang]*0.9)
        names(notoI1)[lang+1] <- paste("T+",time,"月流失",sep = "") 
        notoI1[,lang+1][-1,1] <- NA
        
        age31 <-  age31 %>%data.frame()%>% mutate(a = age31[1,lang]*0.9)
        names(age31)[lang+1] <- paste("T+",time,"月流失",sep = "") 
        age31[,lang+1][-1,1] <- NA
        
        age61 <-  age61 %>%data.frame()%>% mutate(a = age61[1,lang]*0.9)
        names(age61)[lang+1] <- paste("T+",time,"月流失",sep = "") 
        age61[,lang+1][-1,1] <- NA
        
        age121 <-  age121 %>%data.frame()%>% mutate(a = age121[1,lang]*0.9)
        names(age121)[lang+1] <- paste("T+",time,"月流失",sep = "") 
        age121[,lang+1][-1,1] <- NA
        
        age241 <-  age241 %>%data.frame()%>% mutate(a = age241[1,lang]*0.9)
        names(age241)[lang+1] <- paste("T+",time,"月流失",sep = "") 
        age241[,lang+1][-1,1] <- NA
        
        age24m1 <-  age24m1 %>%data.frame()%>% mutate(a = age24m1[1,lang]*0.9)
        names(age24m1)[lang+1] <- paste("T+",time,"月流失",sep = "") 
        age24m1[,lang+1][-1,1] <- NA
        
        # 倒数第二行
        toI1[2,lang] <-   toI1[1,lang]*runif(1,min = 0.9,max = 1.1)
        notoI1[2,lang] <-   notoI1[1,lang]*runif(1,min = 0.9,max = 1.1)
        age31[2,lang] <-   age31[1,lang]*runif(1,min = 0.9,max = 1.1)
        age61[2,lang] <-   age61[1,lang]*runif(1,min = 0.9,max = 1.1)
        age121[2,lang] <-   age121[1,lang]*runif(1,min = 0.9,max = 1.1)
        age241[2,lang] <-   age241[1,lang]*runif(1,min = 0.9,max = 1.1)
        age24m1[2,lang] <-   age24m1[1,lang]*runif(1,min = 0.9,max = 1.1)
        toI1[2,lang] <-   toI1[1,lang]*runif(1,min = 0.9,max = 1.1)
        
      }else if(i==2){
        # 如果月份超过12,修改年和月份
        if(month+1==12){
          year = year+1
          month=1
        }else{
          month=month+1
        }
        
        # 1
        value <- toI1[deep,2]*0.6+toI1[deep-1,2]*0.4
        toI1[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
        # 2
        value <- notoI1[deep,2]*0.6+notoI1[deep-1,2]*0.4
        notoI1[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
        # 3
        value <- age31[deep,2]*0.6+age31[deep-1,2]*0.4
        age31[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
        # 4
        value <- age61[deep,2]*0.6+age61[deep-1,2]*0.4
        age61[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
        # 5
        value <- age121[deep,2]*0.6+age121[deep-1,2]*0.4
        age121[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
        # 6
        value <- age241[deep,2]*0.6+age241[deep-1,2]*0.4
        age241[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
        # 7
        value <- age24m1[deep,2]*0.6+age24m1[deep-1,2]*0.4
        age24m1[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
      }else{
        value <- ifelse(toI1[deep-(i-2),i]>toI1[deep-(i-1),i],toI1[deep-(i-2),i],toI1[deep-(i-1),i])
        toI1[deep-(i-3),i] <- as.numeric(value)*rate
        # 2
        value <- ifelse(notoI1[deep-(i-2),i]>notoI1[deep-(i-1),i],notoI1[deep-(i-2),i],notoI1[deep-(i-1),i])
        notoI1[deep-(i-3),i] <- as.numeric(value)*rate
        # 3
        value <- ifelse(age31[deep-(i-2),i]>age31[deep-(i-1),i],age31[deep-(i-2),i],age31[deep-(i-1),i])
        age31[deep-(i-3),i] <- as.numeric(value)*rate

        # 4
        value <- ifelse(age61[deep-(i-2),i]>age61[deep-(i-1),i],age61[deep-(i-2),i],age61[deep-(i-1),i])
        age61[deep-(i-3),i] <- as.numeric(value)*rate
        
        # 5
        value <- ifelse(age121[deep-(i-2),i]>age121[deep-(i-1),i],age121[deep-(i-2),i],age121[deep-(i-1),i])
        age121[deep-(i-3),i] <- as.numeric(value)*rate
        
        # 6
        value <- ifelse(age241[deep-(i-2),i]>age241[deep-(i-1),i],age241[deep-(i-2),i],age241[deep-(i-1),i])
        age241[deep-(i-3),i] <- as.numeric(value)*rate

        # 7
        value <- ifelse(age24m1[deep-(i-2),i]>age24m1[deep-(i-1),i],age24m1[deep-(i-2),i],age24m1[deep-(i-1),i])
        age24m1[deep-(i-3),i] <- as.numeric(value)*rate
      }
      
    }

  
  if(T==0){
    return(list(toI1,notoI1,age31,age61,age121,age241,age24m1))
  }else if(T==1){
      lang <- dim(toI1)[2]
      
      deep <- dim(toI1)[1]
      for (i in 2:lang) {
        
      if(i==lang){
        
        
        time <- time+1
        toI1 <-  toI1 %>%data.frame() %>% mutate(a = toI1[1,lang]*0.9)
        
        names(toI1)[lang+1] <- paste("T+",time,"月流失",sep = "") 
        toI1[,lang+1][-1] <- NA

        
        notoI1 <-  notoI1 %>%data.frame() %>% mutate(a = notoI1[1,lang]*0.9)
        names(notoI1)[lang+1] <- paste("T+",time,"月流失",sep = "") 
        notoI1[,lang+1][-1] <- NA
        
        age31 <-  age31 %>%data.frame() %>% mutate(a = age31[1,lang]*0.9)
        names(age31)[lang+1] <- paste("T+",time,"月流失",sep = "") 
        age31[,lang+1][-1] <- NA
        
        age61 <-  age61 %>%data.frame() %>% mutate(a = age61[1,lang]*0.9)
        names(age61)[lang+1] <- paste("T+",time,"月流失",sep = "") 
        age61[,lang+1][-1] <- NA
        
        age121 <-  age121%>%data.frame()  %>% mutate(a = age121[1,lang]*0.9)
        names(age121)[lang+1] <- paste("T+",time,"月流失",sep = "") 
        age121[,lang+1][-1] <- NA
        
        age241 <-  age241 %>%data.frame() %>% mutate(a = age241[1,lang]*0.9)
        names(age241)[lang+1] <- paste("T+",time,"月流失",sep = "") 
        age241[,lang+1][-1] <- NA
        
        age24m1 <-  age24m1 %>% mutate(a = age24m1[1,lang]*0.9)
        names(age24m1)[lang+1] <- paste("T+",time,"月流失",sep = "") 
        age24m1[,lang+1][-1] <- NA
        
        toI1[2,lang][[1]] <-   toI1[1,lang][[1]]#*runif(1,0.9,1.1)
        # print(toI1[1,lang][[1]])
        notoI1[2,lang][[1]] <-   notoI1[1,lang][[1]]*runif(1,0.9,1.1)
        age31[2,lang][[1]] <-   age31[1,lang][[1]]*runif(1,0.9,1.1) 
        age61[2,lang][[1]] <-   age61[1,lang][[1]]*runif(1,0.9,1.1) 
        age121[2,lang][[1]] <-   age121[1,lang][[1]]*runif(1,0.9,1.1) 
        age241[2,lang][[1]] <-   age241[1,lang][[1]]*runif(1,0.9,1.1) 
        age24m1[2,lang][[1]] <-   age24m1[1,lang][[1]]*runif(1,0.9,1.1) 
        toI1[2,lang][[1]] <-   toI1[1,lang][[1]]*runif(1,0.9,1.1) 
        
      }else if(i==2){
        # 如果月份超过12,修改年和月份
        if(month+1==12){
          year = year+1
          month=1
        }else{
          month=month+1
        }
        
        # 1
        value <- toI1[deep,2]*0.6+toI1[deep-1,2]*0.4
        toI1[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
        # 2
        value <- notoI1[deep,2]*0.6+notoI1[deep-1,2]*0.4
        notoI1[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
        # 3
        value <- age31[deep,2]*0.6+age31[deep-1,2]*0.4
        age31[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
        # 4
        value <- age61[deep,2]*0.6+age61[deep-1,2]*0.4
        age61[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
        # 5
        value <- age121[deep,2]*0.6+age121[deep-1,2]*0.4
        age121[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
        # 6
        value <- age241[deep,2]*0.6+age241[deep-1,2]*0.4
        age241[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
        # 7
        value <- age24m1[deep,2]*0.6+age24m1[deep-1,2]*0.4
        age24m1[deep+1,c(1,2)] <- c(paste(year,"年",month,"月",sep=""),value)
      }else{
        value <- ifelse(toI1[deep-(i-2),i]>toI1[deep-(i-1),i],toI1[deep-(i-2),i],toI1[deep-(i-1),i])
        toI1[deep-(i-3),i] <- as.numeric(value)*rate
        # 2
        value <- ifelse(notoI1[deep-(i-2),i]>notoI1[deep-(i-1),i],notoI1[deep-(i-2),i],notoI1[deep-(i-1),i])
        notoI1[deep-(i-3),i] <- as.numeric(value)*rate
        # 3
        value <- ifelse(age31[deep-(i-2),i]>age31[deep-(i-1),i],age31[deep-(i-2),i],age31[deep-(i-1),i])
        age31[deep-(i-3),i] <- as.numeric(value)*rate

        # 4
        value <- ifelse(age61[deep-(i-2),i]>age61[deep-(i-1),i],age61[deep-(i-2),i],age61[deep-(i-1),i])
        age61[deep-(i-3),i] <- as.numeric(value)*rate
        
        # 5
        value <- ifelse(age121[deep-(i-2),i]>age121[deep-(i-1),i],age121[deep-(i-2),i],age121[deep-(i-1),i])
        age121[deep-(i-3),i] <- as.numeric(value)*rate
        
        # 6
        value <- ifelse(age241[deep-(i-2),i]>age241[deep-(i-1),i],age241[deep-(i-2),i],age241[deep-(i-1),i])
        age241[deep-(i-3),i] <- as.numeric(value)*rate

        # 7
        value <- ifelse(age24m1[deep-(i-2),i]>age24m1[deep-(i-1),i],age24m1[deep-(i-2),i],age24m1[deep-(i-1),i])
        age24m1[deep-(i-3),i] <- as.numeric(value)*rate
        
      }
      
      }
      return(list(toI1,notoI1,age31,age61,age121,age241,age24m1))
  }
   
  ##return(list(toI1[,-dim(toI1)[2]],notoI1[,-dim(notoI1)[2]],age31[,-dim(age31)[2]],
             # age61[,-dim(age61)[2]],age121[,-dim(age121)[2]],age241[,-dim(age241)[2]],age24m1[,-dim(age241)]))

}

# [1009886,1028324]
# [1000749,1039770]

通过这种方式计算的关键是计算流失率,但是根据历史的流失率推算未来的流失率会偏低。

测算结果是,5月份的预测区间是:[1000749,1039770] 6月份的预测区间是[1009886,1028324]

思路5

根据每一个月每一天的流失量,计算每一天流失的变化,这个变化表示的就是环比(例如,今天相比于昨天的一个百分比),流失量变化了多少 。例如2020年1月11号的环比为:1943312/1962916 .其中1962916标识的是10号的流失量,1943312表示的是11号的流失量。

这样计算出每一天的一个环比变化。如果每一天的环比波动非常小。那么当我们想要推算5月份的流失量的时候,只需要通过历史的环比推算5月份每一天的环比变化,然后根据5月10号的流失结果即可得出5月末的流失。

final <- read.csv("/Users/milin/Library/Containers/com.tencent.xinWeChat/Data/Library/Application\ Support/com.tencent.xinWeChat/2.0b4.0.9/53566e35a70cea7efea3dd23a98b7c78/Message/MessageTemp/bed530d2f94e07b9fe9a38669031e423/File/2020to2021runoff.csv")
final1 <- final

for (i in 11:31) {
  final1[,i] <- final[,i]/final[,i-1]
  
}

names(final1)[6:31] <- 6:31

head(final1)
##   X.月份 月流失 月环比 准流失 日累计环比  6  7  8  9      10        11
## 1    1月  175.2     NA     NA         NA NA NA NA NA 1962916 0.9900128
## 2    2月  186.7  11.48     NA         NA NA NA NA NA 2060212 0.9899156
## 3    3月  157.0 -29.68     NA         NA NA NA NA NA 1856768 0.9888365
## 4    4月  168.2  11.18 163.14         NA NA NA NA NA 1922137 0.9900127
## 5    5月  141.3 -26.92 137.68     -25.46 NA NA NA NA 1743174 0.9875910
## 6    6月  140.7  -0.60 137.47      -0.21 NA NA NA NA 1703864 0.9858627
##          12        13        14        15        16        17        18
## 1 0.9913735 0.9922665 0.9913766 0.9917453 0.9920256 0.9919755 0.9929995
## 2 0.9910608 0.9926366 0.9929862 0.9931073 0.9940208 0.9933767 0.9932573
## 3 0.9895814 0.9898223 0.9914219 0.9923084 0.9916661 0.9921529 0.9927388
## 4 0.9911269 0.9917479 0.9924052 0.9927797 0.9922648 0.9926458 0.9935356
## 5 0.9891464 0.9885851 0.9865530 0.9935043 0.9767151 0.9925179 0.9915893
## 6 0.9872423 0.9894533 0.9910272 0.9858849 0.9905762 0.9909444 0.9910084
##          19        20        21        22        23        24        25
## 1 0.9928750 0.9891875 0.9931712 0.9945694 0.9952373 0.9915564 0.9953615
## 2 0.9935253 0.9891235 0.9924880 0.9935946 0.9943276 0.9898323 0.9916594
## 3 0.9934958 0.9887022 0.9925042 0.9938597 0.9936067 0.9892385 0.9933024
## 4 0.9944473 0.9856108 0.9921165 0.9926161 0.9891535 0.9868224 0.9928489
## 5 0.9912683 0.9862671 0.9918407 0.9925987 0.9938641 0.9878027 0.9882346
## 6 0.9903241 0.9833436 0.9913013 0.9899536 0.9873684 0.9809115 0.9882617
##          26        27        28        29        30        31
## 1 0.9977196 0.9971429 0.9995532 0.9978396 0.9975205 0.9973764
## 2 0.9961605 0.9936199 0.9942384 0.9944099        NA        NA
## 3 0.9946201 0.9962245 0.9946631 0.9953910 0.9948715        NA
## 4 0.9951281 0.9886437 0.9946226 0.9943022 0.9939077        NA
## 5 0.9923664 0.9851440 0.9863745 0.9784311 0.9892487 0.9959389
## 6 0.9943065 0.9901178 0.9902620 0.9941325 0.9943250        NA

然后对环比数据进行可视化.

final1 <- gather(data = final1,"时间","环比",11:31)

p <- ggplot(data = final1,aes(x = as.numeric(时间),y=环比,color = X.月份))+geom_line()
p +labs(color="月份") +theme_bw()
## Warning: Removed 473 row(s) containing missing values (geom_path).

从结果中可以看到,环比的变化可以是比较稳定,通常而言变化是再0.0005以内的。可以尝试某一月份的日环比等于近几个月份的日环比的平均值。

我们测算一下,4月份,4月10号的日流失为1609363,2月和3月的平均环比为0.99007和0.99021 ,那么4月份的流失的预测结果为:

1475966 * ((0.99007+0.99021)/2)^20
## [1] 1210622

4月份的真实流失为:1195862 ,那么计算的误差是:

abs(1195862-1210622)/1210622
## [1] 0.01219208

我们测算一下,3月份,3月10号的日流失为1373942,2月和3月的平均环比为0.99007和0.99021 ,那么4月份的流失的预测结果为:

1373942 * (0.99007)^21
## [1] 1114172

3月份的真实流失为:1195862 ,那么计算的误差是:

abs(1114172-1108165)/1108165
## [1] 0.005420673

假如我们要测算5月份的流失量,那么5月10号的日流失量为1694129,4月和3月的平均环比为0.99021 和 0.99038,那么5月份的月流失是:

1694129 * ((0.99021+0.99038)/2)^21
## [1] 1380393

如果希望通过这种方法来测算T+1 ,那么问题又变成了测算T+1月份的10号的流失量。这里假设,每一个月的流失量与前两个月是强相关的,用4月和5月10号的流失的平均值标识6月10号的流失量,进一步推算6月份的流失。推算6月份的流失量为:

((1609363+1694129)/2)* (((0.99021+0.99038)/2+0.99021 )/2)^2
## [1] 1619702

通过这种方式,测算结果为 :1357883