数据预处理

描述性统计

查看数据集总体情况

setwd("D:\\OneDrive\\OneDrive - mail.ecust.edu.cn\\homework\\juniorFirst\\BusinessDecisionAnalysisAndRlanguage\\Courseware\\chapter12")
telcom <- read.csv("./Telco-Customer-Churn.csv", header = TRUE, sep = ",")

summary(telcom)
##   customerID           gender          SeniorCitizen      Partner         
##  Length:7043        Length:7043        Min.   :0.0000   Length:7043       
##  Class :character   Class :character   1st Qu.:0.0000   Class :character  
##  Mode  :character   Mode  :character   Median :0.0000   Mode  :character  
##                                        Mean   :0.1621                     
##                                        3rd Qu.:0.0000                     
##                                        Max.   :1.0000                     
##                                                                           
##   Dependents            tenure      PhoneService       MultipleLines     
##  Length:7043        Min.   : 0.00   Length:7043        Length:7043       
##  Class :character   1st Qu.: 9.00   Class :character   Class :character  
##  Mode  :character   Median :29.00   Mode  :character   Mode  :character  
##                     Mean   :32.37                                        
##                     3rd Qu.:55.00                                        
##                     Max.   :72.00                                        
##                                                                          
##  InternetService    OnlineSecurity     OnlineBackup       DeviceProtection  
##  Length:7043        Length:7043        Length:7043        Length:7043       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  TechSupport        StreamingTV        StreamingMovies      Contract        
##  Length:7043        Length:7043        Length:7043        Length:7043       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  PaperlessBilling   PaymentMethod      MonthlyCharges    TotalCharges   
##  Length:7043        Length:7043        Min.   : 18.25   Min.   :  18.8  
##  Class :character   Class :character   1st Qu.: 35.50   1st Qu.: 401.4  
##  Mode  :character   Mode  :character   Median : 70.35   Median :1397.5  
##                                        Mean   : 64.76   Mean   :2283.3  
##                                        3rd Qu.: 89.85   3rd Qu.:3794.7  
##                                        Max.   :118.75   Max.   :8684.8  
##                                                         NA's   :11      
##     Churn          
##  Length:7043       
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 

总体缺失值情况

sum(is.na(telcom))
## [1] 11

流失列数据情况

summary(telcom$Churn)
##    Length     Class      Mode 
##      7043 character character

流失列统计

table(telcom$Churn)
## 
##   No  Yes 
## 5174 1869

TotalCharges 列的缺失情况

sum(is.na(telcom$TotalCharges))
## [1] 11

删除缺失值所在的行

telcom <- telcom[complete.cases(telcom), ]

归一化处理

对Churn列中的值Yes和No分别用1和0替换

字符序列按首字母排序(No, Yes),对应c(0, 1)

telcom$Churn <- factor(telcom$Churn, labels = c(0, 1))
summary(telcom$Churn)
##    0    1 
## 5163 1869

查看流失用户占比

x <- table(telcom$Churn)
piepercent <- round(100 * x / sum(x), 1)
piepercent <- paste(piepercent, "%", sep = "")
piepercent
## [1] "73.4%" "26.6%"

描述性分析

性别与流失率

telcom$number <- 1

groupdata <- aggregate(
  x = telcom[c("number")],
  by = list(gender = telcom$gender, Churn = telcom$Churn),
  FUN = sum
)
groupdata$Churn <- as.factor(groupdata$Churn)

groupdata[which(groupdata$gender == "Female"), 1] <- "女"
groupdata[which(groupdata$gender == "Male"), 1] <- "男"
groupdata$gender <- as.factor(groupdata$gender)
groupdata
##   gender Churn number
## 1     女     0   2544
## 2     男     0   2619
## 3     女     1    939
## 4     男     1    930

可视化性别与流失率关系

library(ggplot2)
## Warning: 程辑包'ggplot2'是用R版本4.3.1 来建造的
p1 <- ggplot(
  data = groupdata,
  aes(x = gender, y = number, fill = Churn)
) +
  geom_bar(stat = "identity", position = "dodge", width = 0.8) +
  geom_text(aes(label = number), position = position_dodge(width = 1), vjust = -0.5, color = "black", size = 5) +
  theme_bw() +
  theme(
    axis.text.x = element_text(angle = 0, hjust = 1, colour = "black", size = 20),
    axis.text.y = element_text(size = 16, face = "plain"),
    axis.title.y = element_text(size = 20, face = "plain"),
    panel.border = element_blank(),
    axis.line = element_line(colour = "black", size = 1),
    legend.text = element_text(face = "italic", colour = "black", size = 16),
    legend.title = element_text(face = "italic", colour = "black", size = 18),
    plot.title = element_text(hjust = 0.5)
  ) +
  geom_line() +
  labs(x = "性别", y = "数量", title = "不同性别的流失率") +
  labs(fill = "流失")
## Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
p1
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?

可以看到性别与流失率关系不大

年龄与流失率

groupdata <- aggregate(
  x = telcom[c("number")],
  by = list(SeniorCitizen = telcom$SeniorCitizen, Churn = telcom$Churn),
  FUN = sum
)
groupdata$Churn <- as.factor(groupdata$Churn)

groupdata$SeniorCitizen <- factor(groupdata$SeniorCitizen, labels = c("否", "是"))
groupdata
##   SeniorCitizen Churn number
## 1            否     0   4497
## 2            是     0    666
## 3            否     1   1393
## 4            是     1    476
p2 <- ggplot(
  data = groupdata,
  aes(x = SeniorCitizen, y = number, fill = Churn)
) +
  geom_bar(stat = "identity", position = "dodge", width = 0.8) +
  geom_text(aes(label = number), position = position_dodge(width = 1), vjust = -0.5, color = "black", size = 5) +
  theme_bw() +
  theme(
    axis.text.x = element_text(angle = 0, hjust = 1, colour = "black", size = 20),
    axis.text.y = element_text(size = 16, face = "plain"),
    axis.title.y = element_text(size = 20, face = "plain"),
    panel.border = element_blank(),
    axis.line = element_line(colour = "black", size = 1),
    legend.text = element_text(face = "italic", colour = "black", size = 16),
    legend.title = element_text(face = "italic", colour = "black", size = 18),
    plot.title = element_text(hjust = 0.5)
  ) +
  labs(x = "老年人", y = "数量", title = "老年人与非老年人的流失率") +
  labs(fill = "流失")
p2

从这个图表可以分析出:

  • 非老年人客户的流失数量(1393)远高于老年人客户(476)。

  • 相比之下,老年人客户的未流失数量(666)也高于流失数量(476),这意味着老年人客户保持更加稳定。

  • 在非老年人中,未流失客户的比例较小,流失客户的数量相对较高。

配偶情况与流失率

groupdata <- aggregate(
  x = telcom[c("number")],
  by = list(Partner = telcom$Partner, Churn = telcom$Churn),
  FUN = sum
)
groupdata$Churn <- as.factor(groupdata$Churn)
groupdata$Partner <- as.factor(groupdata$Partner)

groupdata$Partner <- as.character(groupdata$Partner)
groupdata[which(groupdata$Partner == "No"), 1] <- "否"
groupdata[which(groupdata$Partner == "Yes"), 1] <- "是"
groupdata$Partner <- as.factor(groupdata$Partner)
p3 <- ggplot(data = groupdata, aes(x = Partner, y = number, fill = Churn)) +
  geom_bar(stat = "identity", position = "dodge", width = 0.8) +
  geom_text(aes(label = number), position = position_dodge(width = 1), vjust = -0.5, color = "black", size = 5) +
  theme_bw() +
  theme(
    axis.text.x = element_text(angle = 0, hjust = 1, colour = "black", size = 20),
    axis.text.y = element_text(size = 16, face = "plain"),
    axis.title.y = element_text(size = 20, face = "plain"),
    panel.border = element_blank(), axis.line = element_line(colour = "black", size = 1),
    legend.text = element_text(face = "italic", colour = "black", size = 16),
    legend.title = element_text(face = "italic", colour = "black", size = 18),
    plot.title = element_text(hjust = 0.5)
  ) +
  labs(x = "有配偶", y = "数量", title = "有配偶和无配偶的用户的流失率") +
  labs(fill = "流失")
p3

非有配偶或配偶在世的人群中客户流失的数量比例较高,可能是因为这部分人群在经济或情感支持方面可能更加不稳定。

经济独立与流失率

groupdata <- aggregate(
  x = telcom[c("number")],
  by = list(Dependents = telcom$Dependents, Churn = telcom$Churn),
  FUN = sum
)
groupdata$Churn <- as.factor(groupdata$Churn)
groupdata$Dependents <- as.factor(groupdata$Dependents)

groupdata$Dependents <- as.character(groupdata$Dependents)
groupdata[which(groupdata$Dependents == "No"), 1] <- "否"
groupdata[which(groupdata$Dependents == "Yes"), 1] <- "是"
groupdata$Dependents <- as.factor(groupdata$Dependents)
p4 <- ggplot(data = groupdata, aes(x = Dependents, y = number, fill = Churn)) +
  geom_bar(stat = "identity", position = "dodge", width = 0.8) +
  geom_text(aes(label = number), position = position_dodge(width = 1), vjust = -0.5, color = "black", size = 5) +
  theme_bw() +
  theme(
    axis.text.x = element_text(angle = 0, hjust = 1, colour = "black", size = 20),
    axis.text.y = element_text(size = 16, face = "plain"),
    axis.title.y = element_text(size = 20, face = "plain"),
    panel.border = element_blank(), axis.line = element_line(colour = "black", size = 1),
    legend.text = element_text(face = "italic", colour = "black", size = 16),
    legend.title = element_text(face = "italic", colour = "black", size = 18),
    plot.title = element_text(hjust = 0.5)
  ) +
  labs(x = "经济独立", y = "数量", title = "经济独立与未独立用户的流失率") +
  labs(fill = "流失")
p4

经济独立的客户流失率相对较低,这可能是因为经济独立的客户在财务上更加稳定,对服务的需求可能更连贯和长期。

合同签订方式与流失率

groupdata <- aggregate(
  x = telcom[c("number")],
  by = list(Contract = telcom$Contract, Churn = telcom$Churn),
  FUN = sum
)
tabledata <- groupdata$number[groupdata$Churn == 1] / table(telcom$Contract)
barplot(tabledata, space = 0.7)

更长期的合约可能增加了客户的留存率,这可能是由于长期合约给客户带来了稳定性和可能的价格优惠。

付款方式与流失率

groupdata <- aggregate(
  x = telcom[c("number")],
  by = list(PaymentMethod = telcom$PaymentMethod, Churn = telcom$Churn),
  FUN = sum
)
tabledata <- groupdata$number[groupdata$Churn == 1] / table(telcom$PaymentMethod)
barplot(tabledata, space = 0.7)

支付方式的便利性和可能的安全性可能影响客户的流失率,电子支票可能因为方便但安全性感知较低而导致更高的流失率。

网络服务的六种业务情况与流失率

ggarrange(p1, p2, p3, p4, p5, p6, ncol = 2, nrow = 3, labels = c("(a)", "(b)", "(c)", "(d)", "(e)", "(f)"))

提供额外服务的客户流失率较低,这可能是因为增值服务提高了客户满意度和粘性。

模型探索

特征提取

charges <- telcom[2:20]
View(charges)
charges$ gender <- as.numeric(as.character(factor(charges$ gender, labels = c(0, 1))))
charges$ Partner <- as.numeric(as.character(factor(charges$ Partner, labels = c(1, 0))))
charges$ Dependents <- as.numeric(as.character(factor(charges$ Dependents, labels = c(0, 1))))
charges$ PhoneService <- as.numeric(as.character(factor(charges$ PhoneService, labels = c(0, 1))))
charges$ MultipleLines <- as.numeric(as.character(factor(charges$ MultipleLines, labels = c(1, 0, 2))))
charges$ InternetService <- as.numeric(as.character(factor(charges$ InternetService, labels = c(0, 1, 2))))
charges$ OnlineSecurity <- as.numeric(as.character(factor(charges$ OnlineSecurity, labels = c(0, 2, 1))))
charges$ OnlineBackup <- as.numeric(as.character(factor(charges$ OnlineBackup, labels = c(1, 2, 0))))
charges$ DeviceProtection <- as.numeric(as.character(factor(charges$ DeviceProtection, labels = c(0, 2, 1)))) # 021
charges$ TechSupport <- as.numeric(as.character(factor(charges$ TechSupport, labels = c(0, 2, 1))))
charges$ StreamingTV <- as.numeric(as.character(factor(charges$ StreamingTV, labels = c(0, 2, 1))))
charges$ StreamingMovies <- as.numeric(as.character(factor(charges$ StreamingMovies, labels = c(0, 2, 1))))
charges$ Contract <- as.numeric(as.character(factor(charges$ Contract, labels = c(0, 1, 2))))
charges$ PaperlessBilling <- as.numeric(as.character(factor(charges$ PaperlessBilling, labels = c(1, 0))))
charges$ PaymentMethod <- as.numeric(as.character(factor(charges$ PaymentMethod, labels = c(2, 3, 0, 1))))

telcomt <- telcom[2:20]
telcomt$ OnlineSecurity <- as.numeric(as.character(factor(charges$ OnlineSecurity, labels = c(0, 2, 1))))

相关矩阵

corr <- cor(charges)
# 使用相关矩阵图显示相关系数
library(corrplot)
## Warning: 程辑包'corrplot'是用R版本4.3.1 来建造的
## corrplot 0.92 loaded
corrplot(corr, method = "color", order = "hclust", tl.col = "black", addrect = 4, addCoef.col = "grey", addCoefasPercent = TRUE, number.cex = 0.8)

特征编码

tel_dummies <- telcom[2:21]
tel_dummies <- model.matrix(Churn ~ . - 1, tel_dummies)
tel_dummies <- cbind(tel_dummies, Churn = telcom$Churn)
corr1 <- cor(tel_dummies[, "Churn"], tel_dummies)
barplot(corr1, col = "blue")

重新编码

# 去除gender
telcomvar <- telcom[2:21]
telcomvar <- telcomvar[, -1]
# 去除PhoneService
telcomvar <- telcomvar[, -5]
# 标准化
telcomvar$tenure <- scale(telcomvar$tenure, center = TRUE, scale = TRUE)
telcomvar$MonthlyCharges <- scale(telcomvar$MonthlyCharges, center = TRUE, scale = TRUE)
telcomvar$TotalCharges <- scale(telcomvar$TotalCharges, center = TRUE, scale = TRUE)
# 箱线图
par(mfrow=c(1,3))
boxplot(telcomvar$tenure, data = telcomvar, xlab = "职位", col = "green")
boxplot(telcomvar$MonthlyCharges, data = telcomvar, xlab = "每月支付费用", col = "orange")
boxplot(telcomvar$TotalCharges, data = telcomvar, xlab = "总支付费用", col = "blue")

替换值

telcomvar$MultipleLines[telcomvar$MultipleLines == "No phone service"] <- "No"
telcomvar$OnlineSecurity[telcomvar$OnlineSecurity == "未接入互联网"] <- "无"
telcomvar$OnlineBackup[telcomvar$OnlineBackup == "未接入互联网"] <- "无"
telcomvar$DeviceProtection[telcomvar$DeviceProtection == "未接入互联网"] <- "无"
telcomvar$TechSupport[telcomvar$TechSupport == "未接入互联网"] <- "无"
telcomvar$StreamingTV[telcomvar$StreamingTV == "未接入互联网"] <- "无"
telcomvar$StreamingMovies[telcomvar$StreamingMovies == "未接入互联网"] <- "无"

将分类数据转化为整数编码

模型建立

划分训练集和测试集

library(caret)
## Warning: 程辑包'caret'是用R版本4.3.1 来建造的
## 载入需要的程辑包:lattice
sss <- createDataPartition(y = telcomvar$Churn, p = 0.80, list = FALSE)  # y是结果矢量,p是训练数据百分比,list为FALSE时结果是一个矩阵
train <- telcomvar[sss, ] # 80%的telcomvar数据作为训练数据
test <- telcomvar[-sss, ] # 20%的telcomvar数据作为测试数据

编写函数,计算每个分类算法分别对应的召回率、精确率以及F1分数

result <- data.frame(row.names = c("recall", "precision", "F1"))
pre <- function(table, rname) {
  recall <- table[2, 2] / (table[2, 2] + table[2, 1])
  precision <- table[2, 2] / (table[2, 2] + table[1, 2])
  F1 <- 2 * table[2, 2] / (2 * table[2, 2] + table[2, 1] + table[1, 2])
  result <<- cbind(result, c(recall, precision, F1)) # result需为全局变量
  names(result) <<- rname
  print(result)
}

随机森林

library(randomForest)
## Warning: 程辑包'randomForest'是用R版本4.3.2 来建造的
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## 载入程辑包:'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
rname <- c("Random Forest")
rf_model <- randomForest(Churn ~ ., data = train, ntree = 10)
rf_predict <- predict(rf_model, test)
rf_table <- table(actual = test$Churn, predict = rf_predict)
pre(rf_table, rname)
##           Random Forest
## recall        0.5308311
## precision     0.6325879
## F1            0.5772595

随机森林:

  • 召回率: 0.437,最低。说明随机森林模型错过了大量的正类。

  • 精确率: 0.591,中等。表示随机森林在预测正类时的准确性较好。

  • F1分数: 0.502,最低。这显示了随机森林模型在平衡召回率和精确率方面的性能不是最优的。

逻辑回归

rname <- cbind(rname, c("LogisticRegression"))
glm_model <- glm(Churn ~ ., data = train, family = binomial(link = "logit"))
glm_predict <- predict(glm_model, test)
glm_predict <- round(exp(glm_predict) / (1 + exp(glm_predict)))
glm_table <- table(actual = test$Churn, predict = glm_predict)
pre(glm_table, rname)
##           Random Forest LogisticRegression
## recall        0.5308311          0.5495979
## precision     0.6325879          0.6634304
## F1            0.5772595          0.6011730

逻辑回归:

  • 召回率: 0.576,中等。

  • 精确率: 0.627,最高。说明在所有预测为正类的案例中,逻辑回归有较高的准确性。

  • F1分数: 0.601,中等。

朴素贝叶斯

library(e1071)
## Warning: 程辑包'e1071'是用R版本4.3.1 来建造的
rname <- cbind(rname, c("Naive Bayes"))
nb_model <- naiveBayes(Churn ~ ., data = train)
# 预测结果
nb_predict <- predict(nb_model, test)
# 生成实际与预测交叉表和预测精度
nb_table <- table(actual = test$Churn, predict = nb_predict)
pre(nb_table, rname)
##           Random Forest LogisticRegression Naive Bayes
## recall        0.5308311          0.5495979   0.8391421
## precision     0.6325879          0.6634304   0.5032154
## F1            0.5772595          0.6011730   0.6291457

朴素贝叶斯:

  • 召回率: 0.729,最高。表示朴素贝叶斯模型在所有正类中识别出了大多数真正的正类。

  • 精确率: 0.533,最低。这表明虽然朴素贝叶斯模型能够识别出多数正类,但在预测正类时也产生了较多的误报。

  • F1分数: 0.616,较高。F1分数考虑了召回率和精确率,给出了一个平衡的性能指标。

XGBoost

library(gbm)
## Warning: 程辑包'gbm'是用R版本4.3.2 来建造的
## Loaded gbm 2.1.8.1
library(xgboost)
## Warning: 程辑包'xgboost'是用R版本4.3.2 来建造的
library(caret)
rname <- cbind(rname, c("XGB"))
# Fitting model
fitControl <- trainControl(method = "repeatedcv", number = 4, repeats = 4)
xgb_model <- train(Churn ~ ., data = train, method = "xgbLinear", trControl = fitControl, verbose = FALSE)
xgb_predict <- predict(xgb_model, test)
xgb_table <- table(actual = test$Churn, predict = xgb_predict)
pre(xgb_table, rname)
##           Random Forest LogisticRegression Naive Bayes       XGB
## recall        0.5308311          0.5495979   0.8391421 0.5040214
## precision     0.6325879          0.6634304   0.5032154 0.6204620
## F1            0.5772595          0.6011730   0.6291457 0.5562130

XGBoost:

  • 召回率: 0.501,中等。

  • 精确率: 0.584,中等。

  • F1分数: 0.540,中等。

综合分析

  • 朴素贝叶斯模型在召回率上表现最佳,这对于不希望错过任何潜在正类的场景(比如疾病诊断)非常重要。然而,它的精确率最低,意味着它在预测正类时产生了更多误报。

  • 逻辑回归在精确率上表现最好,但召回率和F1分数没有突出,这表示它在预测正类时较为准确,但会错过一些正类。

  • 随机森林在所有指标上都表现不佳,这可能是因为模型过于简单或未能捕捉到数据中的复杂性。

  • XGBoost表现平均,在所有评价指标上都是中等水平。

在选择模型时,应该考虑业务需求和成本。如果目标是最大限度地减少假阴性,朴素贝叶斯可能是最佳选择。如果目标是确保预测的正类尽可能准确,那么逻辑回归可能更合适。通常,F1分数被视为衡量模型整体性能的重要指标,特别是在类分布不均匀的情况下。在这种情况下,朴素贝叶斯在三种评估指标中表现较为均衡,可能是较全面的选择。但是,如果精确率是首要考虑的,逻辑回归可能是更好的选择。

预测流失

pred_X <- tail(telcomvar, n = 10L)
pre_id <- tail(telcom$customerID, n = 10L)
library(e1071)
model <- naiveBayes(Churn ~ ., data = train)
pred_y <- predict(model, pred_X)
preDf <- data.frame(customerID = pre_id, pre_Churn = pred_y, actual_Churn = pred_X$Churn)
print(preDf)
##    customerID pre_Churn actual_Churn
## 1  9767-FFLEM         1            0
## 2  0639-TSIQW         0            1
## 3  8456-QDAVC         1            0
## 4  7750-EYXWZ         0            0
## 5  2569-WGERO         0            0
## 6  6840-RESVB         0            0
## 7  2234-XADUH         0            0
## 8  4801-JZAZL         1            0
## 9  8361-LTMKD         1            1
## 10 3186-AJIEK         0            0

从上表来看,模型的精确率(precision,即所有预测为正类中实际为正类的比例)和召回率(recall,即所有实际为正类中预测为正类的比例)都不是完美的。漏报可能表明我们需要提高模型的敏感性,以便更好地捕捉到潜在的流失客户。误报则可能要求我们提高模型的特异性,以避免不必要的客户挽留努力。

电信公司可以采取以下行动:

  • 详细分析误报和漏报的原因:了解为什么这些客户的预测与实际情况不符,从而调整模型或业务流程。

  • 改进模型:可能需要更多数据或者使用更复杂的模型来改善预测准确性。

  • 客户关系管理:对于那些被错误预测为高风险流失的客户,电信公司需要确保不会因为错误的标签而损害与客户的关系。

  • 挽留策略的审查:确保挽留策略的成本效益,并且能够正确地针对那些真正有流失风险的客户。

总体来说,这个模型提供了有关客户流失的有价值见解,但需要进一步优化以提高预测准确性。同时,电信公司应该将这些预测与业务战略相结合,以实现最佳的客户保留效果。

管理建议

  1. 定向营销活动

    对于高流失风险的客户群体,如年轻用户或特定服务用户,设计定向营销活动,比如提供专门优惠或定制服务包。

  2. 长期合同激励

    描述性分析表明长期合同用户流失率较低,因此可以提供合同续签优惠,鼓励用户签订更长期的服务合同。

  3. 流失预警系统

    利用机器学习模型建立客户流失预警系统,对高风险流失客户采取主动干预措施,如个性化沟通和特别优惠。