Email : brigita.melantika@student.matanauniversity.ac.id
RPubs : https://rpubs.com/brigitatiaraem/
Jurusan : Statistika
Address : ARA Center, Matana University Tower
Jl. CBD Barat Kav, RT.1, Curug Sangereng, Kelapa Dua, Tangerang, Banten 15810.
Goal:
End to end analysis on Telco customer churn data main focus being: - Identify what features contribute to customer churn.
- Magnitude of each feature’s contribution and uncertainties associated. - Business conclusion that we can share with Telco.
I added references at the end of the notebook that all composed the bits and pieces of this notebook. Please go check them out.
library(tidyverse) # imports ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, and forcats
# Normality Test
library(car) # for qqplot
library(nortest) # Anderson-Darling normality test
# Correspondence Analysis
library(FactoMineR)
library(factoextra)
# time-to-event (survival) analysis packages
library(KMsurv)
library(survival)
library(survminer)
# correlation
library(corrr)
# contingency table
library(MASS)
# plotting
library(gridExtra) # grid.arrange
# library(glmnet)
# library(factoextra)
# library(ggfortify)
# library(devtools)
# # library(ggbiplot)
# library(factoextra)
# library(MASS)
# library(cluster)
# library(gridExtra)
# library(party)
# set the max number of columns displayed in Jupyter Notebook without truncation
options(repr.matrix.max.cols=30, repr.matrix.max.rows=100)# load data
telco_df <- read.csv("WA_Fn-UseC_-Telco-Customer-Churn.csv")customerIDs unique? Check if is there only one row (data point) per customerID?row_count = nrow(telco_df)
uniqueID_count =length(unique(telco_df$customerID))
print(paste0("Number of Rows : ", row_count))## [1] "Number of Rows : 7043"
print(paste0("Number of Unique CustomerID : ", uniqueID_count))## [1] "Number of Unique CustomerID : 7043"
if (row_count == uniqueID_count) {
print("Customer ID is unique; ther is one data point per Customer ID.")
} else {
print("Customer ID is not unique.")
}## [1] "Customer ID is unique; ther is one data point per Customer ID."
OBSERVATIONS: - There is only one row per Customer ID. Each Customer ID is unique!
summary(telco_df)## customerID gender SeniorCitizen Partner
## Length:7043 Length:7043 Min. :0.0000 Length:7043
## Class :character Class :character 1st Qu.:0.0000 Class :character
## Mode :character Mode :character Median :0.0000 Mode :character
## Mean :0.1621
## 3rd Qu.:0.0000
## Max. :1.0000
##
## Dependents tenure PhoneService MultipleLines
## Length:7043 Min. : 0.00 Length:7043 Length:7043
## Class :character 1st Qu.: 9.00 Class :character Class :character
## Mode :character Median :29.00 Mode :character Mode :character
## Mean :32.37
## 3rd Qu.:55.00
## Max. :72.00
##
## InternetService OnlineSecurity OnlineBackup DeviceProtection
## Length:7043 Length:7043 Length:7043 Length:7043
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## TechSupport StreamingTV StreamingMovies Contract
## Length:7043 Length:7043 Length:7043 Length:7043
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## PaperlessBilling PaymentMethod MonthlyCharges TotalCharges
## Length:7043 Length:7043 Min. : 18.25 Min. : 18.8
## Class :character Class :character 1st Qu.: 35.50 1st Qu.: 401.4
## Mode :character Mode :character Median : 70.35 Median :1397.5
## Mean : 64.76 Mean :2283.3
## 3rd Qu.: 89.85 3rd Qu.:3794.7
## Max. :118.75 Max. :8684.8
## NA's :11
## Churn
## Length:7043
## Class :character
## Mode :character
##
##
##
##
OBSERVATIONS: - No N/As in all columns!
Let’s start with the simpler observations and move onto the ones that require a bit more efforts (e.g., data distribution, correlation).
options(repr.plot.width = 14, repr.plot.height = 8)
# check out distribution of numeric feature(s)
telco_df %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) + theme_minimal() +
facet_wrap(~ key, scales = "free") +
geom_histogram() +
theme(axis.title=element_text(size=16,face="bold")) +
theme(text = element_text(size = 20)) # more on tenure distribution (Number of Months)
min(telco_df$tenure)## [1] 0
max(telco_df$tenure)## [1] 72
sprintf('Max tenure is %i years.', max(telco_df$tenure)/12)## [1] "Max tenure is 6 years."
OBSERVATIONS: - Again, we see that the values of SeniorCitizen is using numeric 0 and 1 as binary label for True and False, respectively. - MonthlyCharges column has a high concentration around 20, and the rest the values are somewhat a normal distribution with skew to the left. - TotalCharges column seem to follow an exponential distribution.
ref: https://stackoverflow.com/questions/38184288/how-to-plot-multiple-factor-columns-with-ggplot
colnames(telco_df)## [1] "customerID" "gender" "SeniorCitizen" "Partner"
## [5] "Dependents" "tenure" "PhoneService" "MultipleLines"
## [9] "InternetService" "OnlineSecurity" "OnlineBackup" "DeviceProtection"
## [13] "TechSupport" "StreamingTV" "StreamingMovies" "Contract"
## [17] "PaperlessBilling" "PaymentMethod" "MonthlyCharges" "TotalCharges"
## [21] "Churn"
options(repr.plot.width = 14, repr.plot.height = 10)
telco_df %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) + theme_minimal() +
facet_wrap(~ key, scales = "free") +
geom_histogram() +
theme(axis.title=element_text(size=16,face="bold")) +
theme(text = element_text(size = 20)) OBSERVATIONS: - We have a unique key (customerID), 16 categorical columns, and 4 numerical columns. - SeniorCitizen, one of the numerical columns, uses 1 and 0 as a binary label (wonder it is not in True/False like many other binary categorical columns). - 5 columns are only composef of Yes or No. - 6 columns have Yes, No, and No internet service option. - 1 column, MultipleLines, has Yes, No, and No phone service option. - The relaitonships of “Yes/No + third option” columns can be visualized like the following:
PhoneService |
MultipleLines |
|---|---|
| No | No phone service |
| Yes | Yes |
| Yes | No |
InteretService |
OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, StreamingMovie |
|---|---|
| No | No internet service |
| Yes | Yes |
| Yes | No |
The column values on the right are dependent on the column values on the left, thus 7 columns alone can explain the 2 columns on the left.
MonthlyCharges distributions separated by categorical features.MonthlyCharges seem to be composed of multiple distributions (maybe of a narrow tall normal distribution with mean ~20 + a shallow wide normal distribution with mean ~ 90)InternetService and PhoneService which could determine the distribution of the MonthlyCharges. So that is what we will be seeing here.ggplot(telco_df, aes(x=MonthlyCharges, color=InternetService)) +
geom_histogram(fill="white", alpha=0.5, position="identity") + theme_minimal() +
ggtitle("`MonthlyCharges` by `InternetService`") +
theme(axis.title=element_text(size=12,face="bold")) +
theme(text = element_text(size = 14)) +
labs(color="InternetService")OBSERVATIONS:
MonthlyCharges could be highly determined by the type of InternetServices. Fiber optics with average ~90, DSL ~60, and no service ~20.ggplot(telco_df, aes(x=MonthlyCharges, color=PhoneService)) +
geom_histogram(fill="white", alpha=0.5, position="identity") + theme_minimal() +
ggtitle("`MonthlyCharges` by `PhoneService`") +
theme(axis.title=element_text(size=12,face="bold")) +
theme(text = element_text(size = 14)) +
labs(color="PhoneService")OBSERVATIONS:
MonthlyCharges by InternetService” plot, we see there is no overlap between the customers without (No) PhoneService and fiber optics internet service.# How about the combination of the two - Contingency Table
telco_df$InternetPhoneServices <- paste('Phone: ', telco_df$PhoneService, ' | ',
'Internet: ', telco_df$InternetService
)
ggplot(telco_df, aes(x=MonthlyCharges, color=InternetPhoneServices)) +
geom_histogram(fill="white", alpha=0.5, position="identity") + theme_minimal() +
ggtitle("`MonthlyCharges` by `PhoneService` and `InternetService`") +
theme(axis.title=element_text(size=12,face="bold")) +
theme(text = element_text(size = 14)) +
labs(color="Phone | InternetService")ggplot(telco_df, aes(x = MonthlyCharges)) +
geom_histogram(fill="white", aes(color=InternetPhoneServices)) + theme_minimal() +
facet_wrap( ~ InternetPhoneServices) +
theme(axis.title=element_text(size=12,face="bold")) +
theme(text = element_text(size = 14)) +
labs(color="Phone | InternetService")# TODO: statistical check on whether the distributions are coming from different populations.TotalCharges equal to Tenrue*MonthlyCharges?MonthlyCharges * Tenure (in number of months) is identical to the actual TotalCharges in the data.total_charges_theoretical = MonthlyCharges * Tenure.total_charges_theoretical and TotalCharges. Save to total_charges_diff.total_charges_diff.# 1. Compute theoretical total charges total_charges_theoretical = MonthlyCharges * Tenure.
telco_df <- telco_df %>% mutate(total_charges_theoretical = tenure*MonthlyCharges)
# 2. Take the difference between total_charges_theoretical and TotalCharges. Save to total_charges_diff.
telco_df <- telco_df %>% mutate(total_charges_diff = TotalCharges - total_charges_theoretical)
# 3. Analyze the distribution of total_charges_diff - let's start with a histogram
ggplot(telco_df, aes(total_charges_diff)) + geom_histogram(bins = 200) + theme_minimal()OBSERVATION:
- We have a sharp vertical bar at 0. - The distribution of the rest of the values look close to a normal distribuion (let’s check this just for fun!).
# remove 0 and plot histogram
diff_hist_data <- telco_df %>%
filter(total_charges_diff != 0)
ggplot(diff_hist_data, aes(total_charges_diff)) + geom_histogram(bins = 200) + theme_minimal()OBSERVATION:
- Looks roughly normal? - This is a good time to test out normality testing methods. - Method 1: QQ-Plot - Method 2: Kolmogorov-Smirnov test - Method 3: Anderson-Darling test
# Method 1: QQ-Plot
qqPlot(diff_hist_data$total_charges_diff)## [1] 1297 1600
# Method 2: Kolmogorov-Smirnov test
# ref: https://stackoverflow.com/questions/26715843/kolmogorov-smirnov-test-in-r
ks.test(diff_hist_data$total_charges_diff, 'pnorm', 0, sd(diff_hist_data$total_charges_diff))##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: diff_hist_data$total_charges_diff
## D = 0.070153, p-value < 2.2e-16
## alternative hypothesis: two-sided
# Method 3: Anderson-Darling test
ad.test(diff_hist_data$total_charges_diff)##
## Anderson-Darling normality test
##
## data: diff_hist_data$total_charges_diff
## A = 74.325, p-value < 2.2e-16
OBSERVATIONS: - QQ-plot indicates that the distribution has light tails. - ref: - https://www.youtube.com/watch?v=vMaKx9fmJHE - https://stats.stackexchange.com/questions/101274/how-to-interpret-a-qq-plot - Both KS and AD normality tests indicate that there is not enough evidence to concluse the distribution is not normal.
(TODO) other side question(s): - What features are related to the difference in the theoretical and actual total charges? 1. Divide the difference in theoretical - actual total charges by tenure to obtain monthly difference (how much more or less one paid compared to the mean). 2. Conduct regression method to see which feature(s) are related to the difference. Something like a regression y(difference) ~ x1(tenure) + x2(PhoneService) + … + xn 3. potential conclusion: Is tenure negatively correlated with monthly charge difference? Longer tenure result in a larger discount?
Clearly independent categorical features:
Fisher’s exact test is more accurate than the chi-square test or G–test of independence when the expected numbers are small. I recommend you use Fisher’s exact test when the total sample size is less than 1000, and use the chi-square or G–test for larger sample sizes. ref: http://www.biostathandbook.com/fishers.html#:~:text=Fisher’s%20exact%20test%20is%20more,test%20for%20larger%20sample%20sizes.
table(telco_df$InternetService, telco_df$MultipleLines)##
## No No phone service Yes
## DSL 1048 682 691
## Fiber optic 1158 0 1938
## No 1184 0 342
# %% [code]
table(telco_df$MultipleLines, telco_df$PhoneService)##
## No Yes
## No 0 3390
## No phone service 682 0
## Yes 0 2971
table(telco_df$OnlineSecurity, telco_df$OnlineBackup)##
## No No internet service Yes
## No 2195 0 1303
## No internet service 0 1526 0
## Yes 893 0 1126
table(telco_df$Partner, telco_df$Dependents)##
## No Yes
## No 3280 361
## Yes 1653 1749
fisher.test(telco_df$Partner, telco_df$Dependents)##
## Fisher's Exact Test for Count Data
##
## data: telco_df$Partner and telco_df$Dependents
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 8.447856 10.950901
## sample estimates:
## odds ratio
## 9.607857
OBSERVATION: At significance level of 5%, we reject the null hypothesis that there is significant relationship between Partner and Dependents.
table(telco_df$Partner, telco_df$MultipleLines)##
## No No phone service Yes
## No 1981 371 1289
## Yes 1409 311 1682
fisher.test(telco_df$Partner, telco_df$MultipleLines)##
## Fisher's Exact Test for Count Data
##
## data: telco_df$Partner and telco_df$MultipleLines
## p-value < 2.2e-16
## alternative hypothesis: two.sided
OBSERVATION: At significance level of 5%, we reject the null hypothesis that there is significant relationship between Partner and Dependents.
table(telco_df$Dependents, telco_df$MultipleLines)##
## No No phone service Yes
## No 2337 476 2120
## Yes 1053 206 851
fisher.test(telco_df$Dependents, telco_df$MultipleLines)##
## Fisher's Exact Test for Count Data
##
## data: telco_df$Dependents and telco_df$MultipleLines
## p-value = 0.1082
## alternative hypothesis: two.sided
OBSERVATION: At significance level of 5%, we fail to reject the null hypothesis that there is significant relationship between Dependents and MultipleLines.
Which makes sense as having dependents likely means there is a need for multiple lines.
OBSERVATION: At significance level of 5%, we fail to reject the null hypothesis that there is significant relationship between Dependents and MultipleLines.
Which makes sense as having dependents likely means there is a need for multiple lines.
TODO: Correspondence Analysis
(TODO): Add Explanation
The null hypothesis in Logrank test states that there is no difference between the populations in the probability of an event (here a churn) ref: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC403858/#:~:text=The%20logrank%20test%20is%20used,of%20events%20(here%20deaths).
# add 'is_churn' column to conduct survival anlaysis
telco_df$is_churn <- ifelse(telco_df$Churn == 'Yes', 1, 0)# check N/A in each column.
plot(survfit(Surv(tenure, is_churn) ~ 1, data = telco_df),
xlab = "Days",
ylab = "Overall survival probability")surv_object <- Surv(time = telco_df$tenure, event = telco_df$is_churn)fit <- survfit(surv_object ~ gender, data = telco_df)
ggsurvplot(fit, data = telco_df,
pval = TRUE,
conf.int = TRUE,
risk.table = TRUE
)OBSERVATION:
The evidence is not sufficient to reject the null hypothesis.
Null hypothesis: Gender is not a factor that distinguishes the probability to churn.
fit1 <- survfit(surv_object ~ SeniorCitizen, data = telco_df)
ggsurvplot(fit1, data = telco_df,
pval = TRUE,
conf.int = TRUE,
risk.table = TRUE)OBSERVATION:
The evidence is sufficient to reject the null hypothesis.
Null hypothesis: SeniorCitizen is not a factor that distinguishes the probability to churn.
Non-senior citizens (SeniorCitizen = 0) are more likely to stay longer with the Telco service. The difference in the probabilities between the Senior and non-Senior Citizens staying with Telco service becomes more different as the tenure gets longer. Staying with the firm for 60 weeks is ~75% for non-senior citizens vs. ~50% for senior citiznes.
fit1 <- survfit(surv_object ~ Partner, data = telco_df)
ggsurvplot(fit1, data = telco_df,
pval = TRUE,
conf.int = TRUE,
risk.table = TRUE)OBSERVATION:
The evidence is sufficient to reject the null hypothesis. Null hypothesis: Being a Partner is not a factor that distinguishes the probability to churn. Customer that are partners (Partner = 1) are more likely to stay longer with the Telco service. Different from SeniorCitizen, the difference in survival probability for Partner diverged quickly in the earlier tenure, and the difference in probability roughly identical (roughly parallel) throughout the tenures.
fit1 <- survfit(surv_object ~ Dependents, data = telco_df)
ggsurvplot(fit1, data = telco_df,
pval = TRUE,
conf.int = TRUE,
risk.table = TRUE)OBSERVATION:
The evidence is sufficient to reject the null hypothesis. Null hypothesis: Dependents is not a factor that distinguishes the probability to churn. Having Dependents is correlated with staying longer with the Telco service, meaning less likely to churn keeping the tenure identical.
fit1 <- survfit(surv_object ~ PhoneService, data = telco_df)
ggsurvplot(fit1, data = telco_df,
pval = TRUE,
conf.int = TRUE,
risk.table = TRUE)OBSERVATION:
The evidence is not sufficient to reject the null hypothesis. Null hypothesis: PhoneService is not a factor that distinguishes the probability to churn. Telco should be aware that customer having a PhoneService is not a contributing factor to the customer’s staying with the service.
fit1 <- survfit(surv_object ~ MultipleLines, data = telco_df)
ggsurvplot(fit1, data = telco_df,
pval = TRUE,
conf.int = TRUE,
risk.table = TRUE)OBSERVATION:
The evidence is sufficient to reject the null hypothesis. Null hypothesis: MultipleLines is not a factor that distinguishes the probability to churn. What was interesting here was that the churn probability was in the following order: (most likely to get churned earlier) Single Line of phone service - No phone service - Multiple lines of phone service (least likely to get churned earlier)
fit1 <- survfit(surv_object ~ InternetService, data = telco_df)
ggsurvplot(fit1, data = telco_df,
pval = TRUE,
conf.int = TRUE,
risk.table = TRUE)OBSERVATION (perhaps the most interesting): The evidence is sufficient to reject the null hypothesis. Null hypothesis: InternetSrervice is not a factor that distinguishes the probability to churn. The magnitude of differences across groups were the very large for InternetService. Interestingly, customers without InternetService (in this data, meaning those only with phone service) had the largest survival rate. We had the largest churn rate for customers with Fiber Optics for the internet service.
More similar looking plots
plotting ref: https://rpkgs.datanovia.com/survminer/reference/arrange_ggsurvplots.html
options(repr.plot.width = 14, repr.plot.height = 10)
splots <- list()
fit <- survfit(surv_object ~ OnlineSecurity, data = telco_df)
online_security_plt <- ggsurvplot(fit, data = telco_df,
pval = TRUE,
conf.int = TRUE,
risk.table = TRUE)
splots[[1]] <- online_security_plt
fit <- survfit(surv_object ~ OnlineBackup, data = telco_df)
online_backup_plt <- ggsurvplot(fit, data = telco_df,
pval = TRUE,
conf.int = TRUE,
risk.table = TRUE)
splots[[2]] <- online_backup_plt
fit <- survfit(surv_object ~ DeviceProtection, data = telco_df)
device_protection_plt <- ggsurvplot(fit, data = telco_df,
pval = TRUE,
conf.int = TRUE,
risk.table = TRUE)
splots[[3]] <- device_protection_plt
fit <- survfit(surv_object ~ TechSupport, data = telco_df)
tech_support_plt <- ggsurvplot(fit, data = telco_df,
pval = TRUE,
conf.int = TRUE,
risk.table = TRUE)
splots[[4]] <- tech_support_plt
arrange_ggsurvplots(splots, print = TRUE, ncol = 2, nrow = 1)OBSERVATION:
The above four plots look very similar. Need to check if the four services are highly correlated (e.g, the majority of customers have no OnlineSecurity also does not have OnlineBackup, DeviceProtection, and TechSupport), which sounds very likely.
options(repr.plot.width = 14, repr.plot.height = 10)
splots <- list()
fit <- survfit(surv_object ~ StreamingTV, data = telco_df)
streamingTV <- ggsurvplot(fit, data = telco_df,
pval = TRUE,
conf.int = TRUE,
risk.table = TRUE)
splots[[1]] <- streamingTV
fit <- survfit(surv_object ~ StreamingMovies, data = telco_df)
streamingMovie <- ggsurvplot(fit, data = telco_df,
pval = TRUE,
conf.int = TRUE,
risk.table = TRUE)
splots[[2]] <- streamingMovie
arrange_ggsurvplots(splots, print = TRUE, ncol = 2, nrow = 1)OBSERVATION:
The above two plots look very similar. Need to check if the two services are highly correlated (e.g, the majority of customers who have StreaminbTV also have StreamingMovie), which sounds very likely.
fit1 <- survfit(surv_object ~ Contract, data = telco_df)
ggsurvplot(fit1, data = telco_df,
pval = TRUE,
conf.int = TRUE,
risk.table = TRUE)OBSERVATION: The evidence is sufficient to reject the null hypothesis. Null hypothesis: Contract is not a factor that distinguishes the probability to churn. This result is hardly surprising given one year ~= 52 weeks and two years ~= 104 weeks, so do not see any significant drop before that many weeks for the One year and Two year contracts.
fit1 <- survfit(surv_object ~ PaperlessBilling, data = telco_df)
ggsurvplot(fit1, data = telco_df,
pval = TRUE,
conf.int = TRUE,
risk.table = TRUE)OBSERVATION: The evidence is sufficient to reject the null hypothesis. Null hypothesis: PaperlessBilling is not a factor that distinguishes the probability to churn. This was interesting as PaperlessBilling having correlation to time-to-churn was not very intuitive; customers who rolled in for PaperlessBilling was more likely to churn quicker compared to those without PaperlessBilling.
fit1 <- survfit(surv_object ~ PaymentMethod, data = telco_df)
ggsurvplot(fit1, data = telco_df,
pval = TRUE,
conf.int = TRUE,
risk.table = TRUE)OBSERVATION: The evidence is sufficient to reject the null hypothesis. Null hypothesis: PaymentMethod is not a factor that distinguishes the probability to churn. Those who signed up for automatic payment services (bank transfer and credit card) were more likely to stay longer with the service.Telco could assist the customers with setting up the automated payments or have a promotion that could give enough motivation for the customers to sign up for the automated payments.
head(telco_df)## customerID gender SeniorCitizen Partner Dependents tenure PhoneService
## 1 7590-VHVEG Female 0 Yes No 1 No
## 2 5575-GNVDE Male 0 No No 34 Yes
## 3 3668-QPYBK Male 0 No No 2 Yes
## 4 7795-CFOCW Male 0 No No 45 No
## 5 9237-HQITU Female 0 No No 2 Yes
## 6 9305-CDSKC Female 0 No No 8 Yes
## MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection
## 1 No phone service DSL No Yes No
## 2 No DSL Yes No Yes
## 3 No DSL Yes Yes No
## 4 No phone service DSL Yes No Yes
## 5 No Fiber optic No No No
## 6 Yes Fiber optic No No Yes
## TechSupport StreamingTV StreamingMovies Contract PaperlessBilling
## 1 No No No Month-to-month Yes
## 2 No No No One year No
## 3 No No No Month-to-month Yes
## 4 Yes No No One year No
## 5 No No No Month-to-month Yes
## 6 No Yes Yes Month-to-month Yes
## PaymentMethod MonthlyCharges TotalCharges Churn
## 1 Electronic check 29.85 29.85 No
## 2 Mailed check 56.95 1889.50 No
## 3 Mailed check 53.85 108.15 Yes
## 4 Bank transfer (automatic) 42.30 1840.75 No
## 5 Electronic check 70.70 151.65 Yes
## 6 Electronic check 99.65 820.50 Yes
## InternetPhoneServices total_charges_theoretical
## 1 Phone: No | Internet: DSL 29.85
## 2 Phone: Yes | Internet: DSL 1936.30
## 3 Phone: Yes | Internet: DSL 107.70
## 4 Phone: No | Internet: DSL 1903.50
## 5 Phone: Yes | Internet: Fiber optic 141.40
## 6 Phone: Yes | Internet: Fiber optic 797.20
## total_charges_diff is_churn
## 1 0.00 0
## 2 -46.80 0
## 3 0.45 1
## 4 -62.75 0
## 5 10.25 1
## 6 23.30 1
Since we have features that are hierarchical, we can stratify customers based on columns:
Where, a subset of features are only applicable for a sertain subset of groups (e.g., OnlineSecurity is not an applicable column for those who does NOT have InternetService) 1. w/ only PhoneService 2. w/ only InternetService 3. w/ both InternetService and PhoneService
# there is no true or false column for internet service
telco_df$has_InternetService <- ifelse(telco_df$InternetService != "No", "Yes", "No")
telco_df$has_InternetService <- as.factor(telco_df$has_InternetService)
telco_df %>% count(PhoneService, has_InternetService)## PhoneService has_InternetService n
## 1 No Yes 682
## 2 Yes No 1526
## 3 Yes Yes 4835
# observation: we have descent numbers of data points per PhoneService and has_InternetService combinations
# subset data based on internet and phone services.
only_phone_service_df <- telco_df %>%
filter(PhoneService == "Yes" & has_InternetService == "No") %>%
dplyr::select(-c(Churn, InternetService, OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, StreamingMovies, TotalCharges, InternetPhoneServices, total_charges_theoretical, total_charges_diff))
only_internet_service_df <- telco_df %>%
filter(PhoneService == "No" & has_InternetService == "Yes") %>%
dplyr::select(-c(Churn, MultipleLines, PhoneService, TotalCharges, InternetPhoneServices, total_charges_theoretical, total_charges_diff))
both_services_df <- telco_df %>%
filter(PhoneService == "Yes" & has_InternetService == "Yes") %>%
dplyr::select(-c(Churn, TotalCharges, InternetPhoneServices, total_charges_theoretical, total_charges_diff))# only phone service
only_phone_fit <- coxph(Surv(tenure, is_churn) ~
gender + factor(SeniorCitizen) +
MultipleLines + Partner + Dependents + Contract + PaperlessBilling + PaymentMethod + MonthlyCharges,
data = only_phone_service_df)
only_phone_ftest <- cox.zph(only_phone_fit)
only_phone_ftest## chisq df p
## gender 0.3955 1 0.529
## factor(SeniorCitizen) 0.3407 1 0.559
## MultipleLines 3.9501 1 0.047
## Partner 0.0411 1 0.839
## Dependents 0.0212 1 0.884
## Contract 3.1695 2 0.205
## PaperlessBilling 1.6147 1 0.204
## PaymentMethod 6.0521 3 0.109
## MonthlyCharges 1.7684 1 0.184
## GLOBAL 16.0418 12 0.189
head(only_internet_service_df)## customerID gender SeniorCitizen Partner Dependents tenure InternetService
## 1 7590-VHVEG Female 0 Yes No 1 DSL
## 2 7795-CFOCW Male 0 No No 45 DSL
## 3 6713-OKOMC Female 0 No No 10 DSL
## 4 8779-QRDMV Male 1 No No 1 DSL
## 5 8665-UTDHZ Male 0 Yes Yes 1 DSL
## 6 0526-SXDJP Male 0 Yes No 72 DSL
## OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV
## 1 No Yes No No No
## 2 Yes No Yes Yes No
## 3 Yes No No No No
## 4 No No Yes No No
## 5 No Yes No No No
## 6 Yes Yes Yes No No
## StreamingMovies Contract PaperlessBilling PaymentMethod
## 1 No Month-to-month Yes Electronic check
## 2 No One year No Bank transfer (automatic)
## 3 No Month-to-month No Mailed check
## 4 Yes Month-to-month Yes Electronic check
## 5 No Month-to-month No Electronic check
## 6 No Two year No Bank transfer (automatic)
## MonthlyCharges is_churn has_InternetService
## 1 29.85 0 Yes
## 2 42.30 0 Yes
## 3 29.75 0 Yes
## 4 39.65 1 Yes
## 5 30.20 1 Yes
## 6 42.10 0 Yes