1 Importing Libraries and Dataset

1.1 Imporing Libraries

library(magrittr)
library(dplyr)
library(tidyr)
library(ggplot2)
library(knitr)
library(kableExtra)
library(reshape2)
library(ggpubr)
library(onewaytests)
library(car)
library(effectsize)

1.2 Setting the path

setwd(“C:_homeworks_2023”)

1.3 Importing the dataset

1.3.1 First 5 rows

##   customerID gender SeniorCitizen Partner Dependents tenure PhoneService
## 1 7590-VHVEG Female             0     Yes         No      1           No
## 2 5575-GNVDE   Male             0      No         No     34          Yes
## 3 3668-QPYBK   Male             0      No         No      2          Yes
## 4 7795-CFOCW   Male             0      No         No     45           No
## 5 9237-HQITU Female             0      No         No      2          Yes
## 6 9305-CDSKC Female             0      No         No      8          Yes
##      MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection
## 1 No phone service             DSL             No          Yes               No
## 2               No             DSL            Yes           No              Yes
## 3               No             DSL            Yes          Yes               No
## 4 No phone service             DSL            Yes           No              Yes
## 5               No     Fiber optic             No           No               No
## 6              Yes     Fiber optic             No           No              Yes
##   TechSupport StreamingTV StreamingMovies       Contract PaperlessBilling
## 1          No          No              No Month-to-month              Yes
## 2          No          No              No       One year               No
## 3          No          No              No Month-to-month              Yes
## 4         Yes          No              No       One year               No
## 5          No          No              No Month-to-month              Yes
## 6          No         Yes             Yes Month-to-month              Yes
##               PaymentMethod MonthlyCharges TotalCharges Churn
## 1          Electronic check          29.85        29.85    No
## 2              Mailed check          56.95      1889.50    No
## 3              Mailed check          53.85       108.15   Yes
## 4 Bank transfer (automatic)          42.30      1840.75    No
## 5          Electronic check          70.70       151.65   Yes
## 6          Electronic check          99.65       820.50   Yes
kbl(df[1:5,]) %>%
  kable_paper("hover", full_width = F)
customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
7590-VHVEG Female 0 Yes No 1 No No phone service DSL No Yes No No No No Month-to-month Yes Electronic check 29.85 29.85 No
5575-GNVDE Male 0 No No 34 Yes No DSL Yes No Yes No No No One year No Mailed check 56.95 1889.50 No
3668-QPYBK Male 0 No No 2 Yes No DSL Yes Yes No No No No Month-to-month Yes Mailed check 53.85 108.15 Yes
7795-CFOCW Male 0 No No 45 No No phone service DSL Yes No Yes Yes No No One year No Bank transfer (automatic) 42.30 1840.75 No
9237-HQITU Female 0 No No 2 Yes No Fiber optic No No No No No No Month-to-month Yes Electronic check 70.70 151.65 Yes
str(df)
## 'data.frame':    7043 obs. of  21 variables:
##  $ customerID      : chr  "7590-VHVEG" "5575-GNVDE" "3668-QPYBK" "7795-CFOCW" ...
##  $ gender          : chr  "Female" "Male" "Male" "Male" ...
##  $ SeniorCitizen   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Partner         : chr  "Yes" "No" "No" "No" ...
##  $ Dependents      : chr  "No" "No" "No" "No" ...
##  $ tenure          : int  1 34 2 45 2 8 22 10 28 62 ...
##  $ PhoneService    : chr  "No" "Yes" "Yes" "No" ...
##  $ MultipleLines   : chr  "No phone service" "No" "No" "No phone service" ...
##  $ InternetService : chr  "DSL" "DSL" "DSL" "DSL" ...
##  $ OnlineSecurity  : chr  "No" "Yes" "Yes" "Yes" ...
##  $ OnlineBackup    : chr  "Yes" "No" "Yes" "No" ...
##  $ DeviceProtection: chr  "No" "Yes" "No" "Yes" ...
##  $ TechSupport     : chr  "No" "No" "No" "Yes" ...
##  $ StreamingTV     : chr  "No" "No" "No" "No" ...
##  $ StreamingMovies : chr  "No" "No" "No" "No" ...
##  $ Contract        : chr  "Month-to-month" "One year" "Month-to-month" "One year" ...
##  $ PaperlessBilling: chr  "Yes" "No" "Yes" "No" ...
##  $ PaymentMethod   : chr  "Electronic check" "Mailed check" "Mailed check" "Bank transfer (automatic)" ...
##  $ MonthlyCharges  : num  29.9 57 53.9 42.3 70.7 ...
##  $ TotalCharges    : num  29.9 1889.5 108.2 1840.8 151.7 ...
##  $ Churn           : chr  "No" "No" "Yes" "No" ...
summary(df)
##   customerID           gender          SeniorCitizen      Partner         
##  Length:7043        Length:7043        Min.   :0.0000   Length:7043       
##  Class :character   Class :character   1st Qu.:0.0000   Class :character  
##  Mode  :character   Mode  :character   Median :0.0000   Mode  :character  
##                                        Mean   :0.1621                     
##                                        3rd Qu.:0.0000                     
##                                        Max.   :1.0000                     
##                                                                           
##   Dependents            tenure      PhoneService       MultipleLines     
##  Length:7043        Min.   : 0.00   Length:7043        Length:7043       
##  Class :character   1st Qu.: 9.00   Class :character   Class :character  
##  Mode  :character   Median :29.00   Mode  :character   Mode  :character  
##                     Mean   :32.37                                        
##                     3rd Qu.:55.00                                        
##                     Max.   :72.00                                        
##                                                                          
##  InternetService    OnlineSecurity     OnlineBackup       DeviceProtection  
##  Length:7043        Length:7043        Length:7043        Length:7043       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  TechSupport        StreamingTV        StreamingMovies      Contract        
##  Length:7043        Length:7043        Length:7043        Length:7043       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  PaperlessBilling   PaymentMethod      MonthlyCharges    TotalCharges   
##  Length:7043        Length:7043        Min.   : 18.25   Min.   :  18.8  
##  Class :character   Class :character   1st Qu.: 35.50   1st Qu.: 401.4  
##  Mode  :character   Mode  :character   Median : 70.35   Median :1397.5  
##                                        Mean   : 64.76   Mean   :2283.3  
##                                        3rd Qu.: 89.85   3rd Qu.:3794.7  
##                                        Max.   :118.75   Max.   :8684.8  
##                                                         NA's   :11      
##     Churn          
##  Length:7043       
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 

2 About Dataset

The dataset contains the data of the customers in telecommunication industry. Therefore the unit of observation is a customer. The dataset includes 21 variables and data of 7043 customers.

2.1 Description of variables

The variable description is as below.

variables=read.csv("variables.csv")
variables
##          customerID
## 1            gender
## 2     SeniorCitizen
## 3           Partner
## 4        Dependents
## 5            tenure
## 6      PhoneService
## 7     MultipleLines
## 8   InternetService
## 9    OnlineSecurity
## 10     OnlineBackup
## 11 DeviceProtection
## 12      TechSupport
## 13      StreamingTV
## 14  StreamingMovies
## 15         Contract
## 16 PaperlessBilling
## 17    PaymentMethod
## 18   MonthlyCharges
## 19     TotalCharges
## 20            Churn
##                                                                                                       ID.of.the.Customer
## 1                                                                             Whether the customer is a male or a female
## 2                                                                 Whether the customer is a senior citizen or not (1, 0)
## 3                                                                    Whether the customer has a partner or not (Yes, No)
## 4                                                                   Whether the customer has dependents or not (Yes, No)
## 5                                                              Number of months the customer has stayed with the company
## 6                                                              Whether the customer has a phone service or not (Yes, No)
## 7                                             Whether the customer has multiple lines or not (Yes, No, No phone service)
## 8                                                         Customer\x92s internet service provider (DSL, Fiber optic, No)
## 9                                         Whether the customer has online security or not (Yes, No, No internet service)
## 10                                          Whether the customer has online backup or not (Yes, No, No internet service)
## 11                                      Whether the customer has device protection or not (Yes, No, No internet service)
## 12                                           Whether the customer has tech support or not (Yes, No, No internet service)
## 13                                           Whether the customer has streaming TV or not (Yes, No, No internet service)
## 14                                       Whether the customer has streaming movies or not (Yes, No, No internet service)
## 15                                                The contract term of the customer (Month-to-month, One year, Two year)
## 16                                                           Whether the customer has paperless billing or not (Yes, No)
## 17 The customer\x92s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic))
## 18                                                                        The amount charged to the customer monthly ($)
## 19                                                                          The total amount charged to the customer ($)
## 20                                                                       Whether the customer churned or not (Yes or No)

3 Main Goal

The main goal of this study is to build a prediction model for predicting the customers who are likely to churn from the network. The second goal of the study is to conduct a descriptive analaysis to identyfy trends, factors associate with the customer prediction.

4 Hypothesis Testing

Data processing part is completed and the next phase will be conducting the hypothesis testing based on different assumed claims.

4.1 Hypothesis 01

4.1.1 Hypotheses

  • Testing whether if the average monthly charge is less than $100.
  • H0: mu = 100
  • H1: mu < 100

4.1.2 Testing the normality of the data

#The p-p plot
ggqqplot(df$MonthlyCharges)

#Applying Shapiro Wilk test
df2 = df[sample(1:nrow(df), 4500, replace=FALSE),]
shapiro.test(df2$MonthlyCharges)
## 
##  Shapiro-Wilk normality test
## 
## data:  df2$MonthlyCharges
## W = 0.92165, p-value < 2.2e-16
  • From observing the p-p plot it is observed that the data are not normally distributed.
  • The shapiro wilk test also shows a p-value less than 0.001, which indicates the significance of the normality test.
  • As the data are not normally distributed non parametric test, One-Sample Wilcoxon Signed Rank Test is used to test the mentioned hypothesis.

4.1.3 Applying One-Sample Wilcoxon Signed Rank Test

wilcox.test(df$MonthlyCharges, mu = 100, alternative = "less")
## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  df$MonthlyCharges
## V = 909748, p-value < 2.2e-16
## alternative hypothesis: true location is less than 100

4.1.4 Conclusion

There is sufficient evidence to conclude that the average monthly charge is less than 100.

4.1.5 Explanation:

The average monthly charge of a person is significantly less than $100, the overall average monthly charge was found to be $64. Therefore, the promotions and packages should be designed based on their monthly charges. However, a deep analysis should be conducted to identify the customer groups based on their usage per month to recognize the behaviour of the customers.

From the One-Sample Wilcoxon Signed Rank Test the p-value obtained as less than 0.05. Therefore the null hypothesis is rejected.

4.2 Hypothesis 02

4.2.1 Hypotheses

  • Testing if the average monthly charges differ between males and females
  • H0: mu1 = mu2
  • H1: mu1 != mu2

4.2.2 Visualization

e <- ggplot(df, aes(x = gender , y = MonthlyCharges))
e + geom_boxplot()

4.2.3 Testing the normality of the data

#The p-p plot for males and females
ggqqplot(df$MonthlyCharges [which(df$gender == 'Female')])

ggqqplot(df$MonthlyCharges [which(df$gender == 'Male')])

4.2.4 Applying Shapiro Wilk test

nor.test(df$MonthlyCharges~gender, data = df)
## 
##   Shapiro-Wilk Normality Test (alpha = 0.05) 
## -------------------------------------------------- 
##   data : df$MonthlyCharges and gender 
## 
##    Level Statistic      p.value   Normality
## 1 Female 0.9215085 2.615068e-39      Reject
## 2   Male 0.9201174 7.152397e-40      Reject
## --------------------------------------------------

  • From observing the p-p plots it is observed that the data are not normally distributed for both groups.
  • The shapiro wilk test also shows a p-value less than 0.001 for both males and females, which indicates the significance of the normality test. As the data are not normally distributed non parametric test, Mann Whitney U Test is used to test the mentioned hypothesis.
wilcox.test(MonthlyCharges~gender, data = df, exact = FALSE)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  MonthlyCharges by gender
## W = 6298265, p-value = 0.249
## alternative hypothesis: true location shift is not equal to 0

From the two sample Mann Whiteny Test the p-value obtained as 0.2726.

Therefore the null hypothesis is not rejected.

4.2.5 Conclusion

There is not sufficient evidence to conclude that the average monthly charge for males and females differ.

4.2.6 Explanation:

When the segmentations are done, it is not recommeded to segment by gender unless its a personalized promotion. That is because the monthly usage between males and females follow same kind of pattern and the usage are also same.

4.3 Hypothesis 03

4.3.1 Hypotheses

  • Testing if the total charge differs by Payment Method
  • H0: mu1 = mu2 = mu3 = mu4
  • H1: at least one mean is different

4.3.2 Visualization

e <- ggplot(df, aes(x = PaymentMethod , y = df$TotalCharges))
e + geom_boxplot()

4.3.3 Testing the normality of the data and Applying Shapiro Wilk test

The p-p plot for different payment methods and Applying Shapiro Wilk test to check the normality

nor.test(df$TotalCharges~PaymentMethod, data = df)
## 
##   Shapiro-Wilk Normality Test (alpha = 0.05) 
## -------------------------------------------------- 
##   data : df$TotalCharges and PaymentMethod 
## 
##                       Level Statistic      p.value   Normality
## 1 Bank transfer (automatic) 0.9228526 2.123728e-27      Reject
## 2   Credit card (automatic) 0.9175288 5.073903e-28      Reject
## 3          Electronic check 0.8496637 9.382990e-43      Reject
## 4              Mailed check 0.7103853 3.209453e-46      Reject
## --------------------------------------------------

The Shapiro wilk test shows the p-values for all the categories are less than 0.05 and the tests are significant. Therefore, the normality assumption is not met for the data.

4.3.4 Testing for homogeneity

# Applying Levene's test
leveneTest(df$TotalCharges ~ df$PaymentMethod, data = df)
## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value    Pr(>F)    
## group    3  218.93 < 2.2e-16 ***
##       7028                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • From the Levene’s test it is clear that the variances among the four groups are not similar as the p-value is less than 0.001.
  • The homogeniety assumprion is also not met for this data.
  • Therefore instead of ANOVA parametric test, Kruskal Wallis non parametric test will be Used.

4.3.5 Applying the Kruskal-Wallis test

kruskal.test(df$TotalCharges ~ df$PaymentMethod, data = df)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  df$TotalCharges by df$PaymentMethod
## Kruskal-Wallis chi-squared = 1077, df = 3, p-value < 2.2e-16

The p-value reported from the Kruskal Wallis test is less than 0.001, therefore the null hypotheis is rejected and it is concluded that at least one mean total charge is different from others.

4.3.6 Posthoc Analysis

To observe which categories are different based on their mean total charges, a paired post hoc analysis is applied.

pairwise.wilcox.test(df$TotalCharges, df$PaymentMethod, p.adjust.method = "BH")
## 
##  Pairwise comparisons using Wilcoxon rank sum test with continuity correction 
## 
## data:  df$TotalCharges and df$PaymentMethod 
## 
##                         Bank transfer (automatic) Credit card (automatic)
## Credit card (automatic) 0.7                       -                      
## Electronic check        <2e-16                    <2e-16                 
## Mailed check            <2e-16                    <2e-16                 
##                         Electronic check
## Credit card (automatic) -               
## Electronic check        -               
## Mailed check            <2e-16          
## 
## P value adjustment method: BH

4.3.7 Conclusion

The post hoc analysis shows that except between bank tranfers and credit card payments, all the mean total charges are different for all categories.

4.3.8 Explanation:

The total usage is highest in th ebank transfers and credit card users. The lowest total charges are observed in the mailed check group. A segmentation by payment method can be recommended as their behavious and usages are totally different except bank transfers and credit card holders.Further more mailed check customers can be manipulated to use the other three methods to more digitization.

4.4 Hypothesis 04

4.4.1 Hypotheses

  • Testing if the total charge differs by Contract Type
  • H0: mu1 = mu2 = mu3
  • H1: at least one mean is different

4.4.2 Visualization

e <- ggplot(df, aes(x = Contract , y = df$TotalCharges))
e + geom_boxplot()

4.4.3 Testing the normality of the data and Applying Shapiro Wilk test

The p-p plot for different payment methods and Applying Shapiro Wilk test to check the normality

nor.test(df$TotalCharges~Contract, data = df)
## 
##   Shapiro-Wilk Normality Test (alpha = 0.05) 
## -------------------------------------------------- 
##   data : df$TotalCharges and Contract 
## 
##            Level Statistic      p.value   Normality
## 1 Month-to-month 0.7964279 8.671991e-57      Reject
## 2       One year 0.9318958 2.286757e-25      Reject
## 3       Two year 0.9151809 1.023804e-29      Reject
## --------------------------------------------------

The Shapiro wilk test shows the p-values for all the categories are less than 0.05 and the tests are significant. Therefore, the normality assumption is not met for the data.

4.4.4 Testing for homogeneity

# Applying Levene's test
leveneTest(df$TotalCharges ~ df$Contract, data = df)
## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value    Pr(>F)    
## group    2  554.55 < 2.2e-16 ***
##       7029                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • From the Levene’’s test it is clear that the variances among the three groups are not similar as the p-value is less than 0.001. The homogeniety assumprion is also not met for this data.

  • Therefore instead of ANOVA parametric test, Kruskal Wallis non parametric test will be Used.

4.4.5 Applying the Kruskal-Wallis test

kruskal.test(df$TotalCharges ~ df$Contract, data = df)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  df$TotalCharges by df$Contract
## Kruskal-Wallis chi-squared = 1565.8, df = 2, p-value < 2.2e-16

The p-value reported from the Kruskal Wallis test is less than 0.001, therefore the null hypotheis is rejected and it is concluded that at least one mean total charge is different from others of contract types.

4.4.6 Posthoc Analysis

To observe which categories are different based on their mean total charges, a paired post hoc analysis is applied.

pairwise.wilcox.test(df$TotalCharges, df$Contract, p.adjust.method = "BH")
## 
##  Pairwise comparisons using Wilcoxon rank sum test with continuity correction 
## 
## data:  df$TotalCharges and df$Contract 
## 
##          Month-to-month One year
## One year < 2e-16        -       
## Two year < 2e-16        1.9e-14 
## 
## P value adjustment method: BH

4.4.7 Conclusion

The post hoc analysis shows that all the mean total charges are different for all three categories.

4.4.8 Explanation:

As the length of the contract period increases the total charge also increases,therefore contracts with more time should be promoted and it can be positively impact to reduce the churn of the customers as well.

4.5 Hypothesis 05

4.5.1 Hypotheses

  • Testing if the payment method and the contract type are independent
  • H0: payment method and the contract type are independent.
  • H1: payment method and the contract type are not independent.

4.5.2 Creating the contingency table

obs = table(df$PaymentMethod, df$Contract)
obs
##                            
##                             Month-to-month One year Two year
##   Bank transfer (automatic)            589      391      564
##   Credit card (automatic)              543      398      581
##   Electronic check                    1850      347      168
##   Mailed check                         893      337      382

4.5.3 Applying Chi-Square Test

chisq.test(obs)
## 
##  Pearson's Chi-squared test
## 
## data:  obs
## X-squared = 1001.6, df = 6, p-value < 2.2e-16

4.5.4 Testing the normality of the data and Applying Shapiro Wilk test

The p-value obtained from the chi squared test for independence is less than 0.001.

Therefore the null hypothesis is rejected.

4.5.5 Assumption Checking

chisq.test(obs)$expected
##                            
##                             Month-to-month One year Two year
##   Bank transfer (automatic)       849.4960 322.9181 371.5860
##   Credit card (automatic)         837.3917 318.3169 366.2914
##   Electronic check               1301.2033 494.6252 569.1715
##   Mailed check                    886.9090 337.1399 387.9512

The expected values are greater than 5 for each cell. Therefore, the expected value at least be 5, assumption is met for the data.

4.5.6 Conclusion

There is sufficient evidence to conclude that the payment method and the contract type are independent.

4.5.7 Explanation:

The payment method of the customers are correlated with their contract type. Therefore, an extended analysis is suggested to conduct, to identify how customers in each contract type uses the payment methods.