Group 3 Members

Student ID	Name
22108694	Panji Winata
23101520	Chan Yung Her
23093746	Joyce Chua Xin Jie
17207213	Kan Sok Wah
23108481	Wong Mei Kait

1.0 Introduction

Telecommunications technology has changed drastically. Long-distance communication used to require days or weeks-long letters. Technology allows us to interact instantaneously via mobile phones and the internet. The digital age has made communication fast and efficient, enabling people to interact regardless of time or place. As technology advances, telecommunications devices are utilized for entertainment as well as communication. Netflix offers movies and TV shows, while TikTok and Instagram as social media allow self-expression. Telecommunications have become a hub of digital activity, facilitating communication, entertainment, and leisure. Telecom companies compete to provide features and perks to attract and keep consumers. Showcase a fast network, high data quota, and affordable service bundles. Customers have the choice to select the best service despite this severe competition. They will terminate their memberships and transfer providers if the service is poor. Customer turnover may hurt a provider’s company. The company’s income might plummet when clients quit, but satellite management and network infrastructure expenditures remain. Thus, suppliers must satisfy customers and comprehend subscription cancellations. Manual data analysis has several downsides. For complicated and massive datasets, this method is time-consuming and need many people. Because of human mistake like neglect or subjective prejudice, manual analysis is typically unreliable. In data-based decision making, these restrictions make manual methods less efficient and complex. Data science provides quicker, more accurate, and more efficient analytical techniques. Regression and classification models as part of data science process are useful in many fields. Data science process use complex algorithms and approaches to spot trends, anticipate, and give deep insights for management to make data-driven decisions.

As background stated, therefore below are identified case study problem questions:
1. What is the factors that could lead customer to cancel their subscription?
2. Is there any specific pattern of customer which cancel their subscription?
3. How accurate is the data science model implemented to this case study?

2.0 Objectives

To answer the problem questions, objective of this case study stated as follows:
1. To find the relationship between variables that affect customer subscription cancellation using a regression approach
2. To find patterns of customers who make subscription cancellation based on parameters defined in the dataset using classification method
3. To obtain the accuracy of the model results on clients at risk of subscription cancellation

3.0 Data Introduction

Before we introduce contain of dataset, csv file of dataset loaded into the R environment using command below:

df_telco_intro=read.csv("telco_customer_churn.csv")

Furthermore we examine the column and return 6 row value of each columns using head command.

head(df_telco_intro)

##   Age Avg.Monthly.GB.Download Avg.Monthly.Long.Distance.Charges  Churn.Category
## 1  72                       4                             19.44                
## 2  27                      59                             45.62                
## 3  59                       0                             16.07                
## 4  25                      27                              0.00                
## 5  31                      21                             17.22 Dissatisfaction
## 6  32                      30                             11.94      Competitor
##                    Churn.Reason Churn.Score         City CLTV       Contract
## 1                                        51    San Mateo 4849       Two Year
## 2                                        27 Sutter Creek 3715 Month-to-Month
## 3                                        59   Santa Cruz 5092 Month-to-Month
## 4                                        49         Brea 2068       One Year
## 5           Network reliability          88     San Jose 4026       One Year
## 6 Competitor had better devices          81     Stockton 5351 Month-to-Month
##         Country Customer.ID Customer.Status Dependents Device.Protection.Plan
## 1 United States  4526-ZJJTM          Stayed          0                      1
## 2 United States  5302-BDJNT          Stayed          0                      1
## 3 United States  5468-BPMMO          Stayed          0                      0
## 4 United States  2212-LYASK          Stayed          0                      1
## 5 United States  0378-XSZPU         Churned          0                      1
## 6 United States  6023-YEBUP         Churned          0                      1
##   Gender Internet.Service Internet.Type               Lat.Long Latitude
## 1 Female                1   Fiber Optic 37.538309, -122.305109 37.53831
## 2   Male                1   Fiber Optic  38.432145, -120.77069 38.43214
## 3   Male                0               37.007882, -122.065975 37.00788
## 4   Male                1           DSL  33.924143, -117.79387 33.92414
## 5   Male                1         Cable 37.311088, -121.961786 37.31109
## 6   Male                1         Cable 38.033219, -121.297433 38.03322
##   Longitude Married Monthly.Charge Multiple.Lines Number.of.Dependents
## 1 -122.3051       1          88.40              0                    0
## 2 -120.7707       0          95.50              1                    0
## 3 -122.0660       1          19.60              0                    0
## 4 -117.7939       1          45.85              0                    0
## 5 -121.9618       1          60.30              0                    0
## 6 -121.2974       0         100.95              1                    0
##   Number.of.Referrals   Offer Online.Backup Online.Security Paperless.Billing
## 1                   1                     1               0                 0
## 2                   0                     0               0                 0
## 3                   3                     0               0                 0
## 4                   3                     1               0                 1
## 5                   1 Offer B             1               1                 0
## 6                   0 Offer E             0               0                 1
##   Partner  Payment.Method Phone.Service Population Premium.Tech.Support Quarter
## 1       1 Bank Withdrawal             1      37926                    0      Q3
## 2       0 Bank Withdrawal             1       4610                    1      Q3
## 3       1 Bank Withdrawal             1       4563                    0      Q3
## 4       1     Credit Card             0       1408                    0      Q3
## 5       1     Credit Card             1      29914                    0      Q3
## 6       0 Bank Withdrawal             1      40611                    0      Q3
##   Referred.a.Friend Satisfaction.Score Senior.Citizen      State
## 1                 1                  3              1 California
## 2                 0                  3              0 California
## 3                 1                  5              0 California
## 4                 1                  4              0 California
## 5                 1                  2              0 California
## 6                 0                  1              0 California
##   Streaming.Movies Streaming.Music Streaming.TV Tenure.in.Months Total.Charges
## 1                1               1            0               25       2191.15
## 2                0               0            1               35       3418.20
## 3                0               0            0               46        851.20
## 4                1               1            0               27       1246.40
## 5                0               0            0               58       3563.80
## 6                1               1            1                3        329.95
##   Total.Extra.Data.Charges Total.Long.Distance.Charges Total.Refunds
## 1                        0                      486.00             0
## 2                        0                     1596.70             0
## 3                        0                      739.22             0
## 4                       30                        0.00             0
## 5                        0                      998.76             0
## 6                       10                       35.82             0
##   Total.Revenue Under.30 Unlimited.Data Zip.Code Churn
## 1       2677.15        0              1    94403     0
## 2       5014.90        1              1    95685     0
## 3       1590.42        0              0    95064     0
## 4       1276.40        1              0    92823     0
## 5       4562.56        0              1    95117     1
## 6        375.77        0              0    95210     1

Below are the columns list and explanation of each column content:
1. Age: Age of telecomunication customer.
2. Avg Monthly GB Download: Average in GigaByte of data downloaded by customer.
3. Avg Monthly Long Distance Charges: Average monthly long distance charges that customer has to pay.
4. Churn Category: Category of reasons for canceling subscription.
5. Churn Reason: Specific explanation for canceling subscription.
6. Churn Score: A value that describes the possibility of a customer canceling a subscription
7. City: The city where the customer lives
8. CLTV (Customer Lifetime Value): Projection of the total value of the customer’s contribution during the subscription
9. Contract: Type of customer contract duration
10. Country: City where the customer lives
11. Customer ID: Unique ID for each customer
12. Customer Status: Current customer status, whether still subscribed, just joined or has canceled the subscription
13. Dependents: Does the customer have dependents
14. Device Protection Plan: Does the customer subscribe to gadget damage protection insurance
15. Gender: Customer gender
16. Internet Service: Does the customer subscribe to a home internet service
17. Internet Type: Type of home internet service subscribed to by the customer
18. Lat Long: Latitude and Longitude of the customer’s residence
19. Latitude: Latitude of the customer’s residence
20. Longitude: Longitude of the customer’s residence
21. Married: Is the customer married
22. Monthly Charge: Monthly fee that must be paid by the customer
23. Multiple Lines: Does the customer have more than 1 telephone line
24. Number of Dependents: Number of customer dependents
25. Number of Referrals: Number of new customers generated from customer referral code
26. Offer: Category of offer received by customer
27. Online Backup: Does customer subscribe to online data backup service
28. Online Security: Does customer subscribe to online security service
29. Paperless Billing: Does customer utilize paperless billing system
30. Partner: Does customer have a partner
31. Payment Method: Payment method used by customer
32. Phone Service: Does customer subscribe to phone service
33. Population: Population in customer’s area of residence
34. Premium Tech Support: Does customer subscribe to premium support service
35. Quarter: Quarter of period data service is taken by customer
36. Referred a Friend: Did customer recommend the product to a friend
37. Satisfaction Score: Customer satisfaction score
38. Senior Citizen: Does customer fall into senior citizen category
39. State: State where customer resides
40. Streaming Movies: Does customer use streaming service
41. Streaming Music: Does customer use streaming music service
42. Streaming TV: Does customer use streaming TV service
43. Tenure in Months: How many months customer has subscribed
44. Total Charges: Total amount paid by customer
45. Total Extra Data Charges: Total amount paid by customer for additional data charges
46. Total Long Distance Charges: Total long distance usage charges paid by the customer
47. Total Refunds: Total refunds given to the customer
48. Total Revenue: Total revenue generated from the customer
49. Under 30: Is the customer under 30 years old
50. Unlimited Data: Does the customer subscribe to an unlimited data plan
51. Zip Code: The postal code where the customer lives
52. Churn: Did the customer cancel the subscription

4.0 Data Cleansing

telco_customer_churn <- read.csv("telco_customer_churn.csv", stringsAsFactors = TRUE)

# Count empty strings
empty_counts <- colSums(telco_customer_churn == "")

# Count NA values
na_counts <- colSums(is.na(telco_customer_churn))

# Combine results for clarity
data.frame(
  Column = names(telco_customer_churn),
  EmptyStrings = empty_counts,
  NAValues = na_counts
)

##                                                              Column
## Age                                                             Age
## Avg.Monthly.GB.Download                     Avg.Monthly.GB.Download
## Avg.Monthly.Long.Distance.Charges Avg.Monthly.Long.Distance.Charges
## Churn.Category                                       Churn.Category
## Churn.Reason                                           Churn.Reason
## Churn.Score                                             Churn.Score
## City                                                           City
## CLTV                                                           CLTV
## Contract                                                   Contract
## Country                                                     Country
## Customer.ID                                             Customer.ID
## Customer.Status                                     Customer.Status
## Dependents                                               Dependents
## Device.Protection.Plan                       Device.Protection.Plan
## Gender                                                       Gender
## Internet.Service                                   Internet.Service
## Internet.Type                                         Internet.Type
## Lat.Long                                                   Lat.Long
## Latitude                                                   Latitude
## Longitude                                                 Longitude
## Married                                                     Married
## Monthly.Charge                                       Monthly.Charge
## Multiple.Lines                                       Multiple.Lines
## Number.of.Dependents                           Number.of.Dependents
## Number.of.Referrals                             Number.of.Referrals
## Offer                                                         Offer
## Online.Backup                                         Online.Backup
## Online.Security                                     Online.Security
## Paperless.Billing                                 Paperless.Billing
## Partner                                                     Partner
## Payment.Method                                       Payment.Method
## Phone.Service                                         Phone.Service
## Population                                               Population
## Premium.Tech.Support                           Premium.Tech.Support
## Quarter                                                     Quarter
## Referred.a.Friend                                 Referred.a.Friend
## Satisfaction.Score                               Satisfaction.Score
## Senior.Citizen                                       Senior.Citizen
## State                                                         State
## Streaming.Movies                                   Streaming.Movies
## Streaming.Music                                     Streaming.Music
## Streaming.TV                                           Streaming.TV
## Tenure.in.Months                                   Tenure.in.Months
## Total.Charges                                         Total.Charges
## Total.Extra.Data.Charges                   Total.Extra.Data.Charges
## Total.Long.Distance.Charges             Total.Long.Distance.Charges
## Total.Refunds                                         Total.Refunds
## Total.Revenue                                         Total.Revenue
## Under.30                                                   Under.30
## Unlimited.Data                                       Unlimited.Data
## Zip.Code                                                   Zip.Code
## Churn                                                         Churn
##                                   EmptyStrings NAValues
## Age                                          0        0
## Avg.Monthly.GB.Download                      0        0
## Avg.Monthly.Long.Distance.Charges            0        0
## Churn.Category                            3104        0
## Churn.Reason                              3104        0
## Churn.Score                                  0        0
## City                                         0        0
## CLTV                                         0        0
## Contract                                     0        0
## Country                                      0        0
## Customer.ID                                  0        0
## Customer.Status                              0        0
## Dependents                                   0        0
## Device.Protection.Plan                       0        0
## Gender                                       0        0
## Internet.Service                             0        0
## Internet.Type                              886        0
## Lat.Long                                     0        0
## Latitude                                     0        0
## Longitude                                    0        0
## Married                                      0        0
## Monthly.Charge                               0        0
## Multiple.Lines                               0        0
## Number.of.Dependents                         0        0
## Number.of.Referrals                          0        0
## Offer                                     2324        0
## Online.Backup                                0        0
## Online.Security                              0        0
## Paperless.Billing                            0        0
## Partner                                      0        0
## Payment.Method                               0        0
## Phone.Service                                0        0
## Population                                   0        0
## Premium.Tech.Support                         0        0
## Quarter                                      0        0
## Referred.a.Friend                            0        0
## Satisfaction.Score                           0        0
## Senior.Citizen                               0        0
## State                                        0        0
## Streaming.Movies                             0        0
## Streaming.Music                              0        0
## Streaming.TV                                 0        0
## Tenure.in.Months                             0        0
## Total.Charges                                0        0
## Total.Extra.Data.Charges                     0        0
## Total.Long.Distance.Charges                  0        0
## Total.Refunds                                0        0
## Total.Revenue                                0        0
## Under.30                                     0        0
## Unlimited.Data                               0        0
## Zip.Code                                     0        0
## Churn                                        0        0

# Replace empty strings with NA
telco_customer_churn[telco_customer_churn == ""] <- NA

# Calculate the percentage of missing values for each column
missing_percentage <- colSums(is.na(telco_customer_churn)) / nrow(telco_customer_churn) * 100

# Filter columns with missing values
missing_percentage <- missing_percentage[missing_percentage > 0]

# Display the columns with missing values and their percentages 
for (col_name in names(missing_percentage)) {
  cat(col_name, "    ", format(missing_percentage[col_name], digits = 2), "\n")
}

## Churn.Category      73 
## Churn.Reason      73 
## Internet.Type      21 
## Offer      55

# Remove columns with missing values >50%
telco_customer_churn <- telco_customer_churn[, colSums(is.na(telco_customer_churn)) / nrow(telco_customer_churn) < 0.5]

# Drop irrelevant columns
telco_customer_churn <- telco_customer_churn %>% select(-Customer.ID, -Lat.Long, -Zip.Code,-Latitude,-Longitude)

# Check if Internet.Type is NA when Internet.Service is "0"
no_internet <- is.na(telco_customer_churn$Internet.Type) & telco_customer_churn$Internet.Service == "0"
sum(no_internet, na.rm = TRUE)

## [1] 886

# Replace NA in Internet.Type with "No Internet" where Internet.Service == "0"
telco_customer_churn$Internet.Type <- as.character(telco_customer_churn$Internet.Type)
telco_customer_churn$Internet.Type[is.na(telco_customer_churn$Internet.Type) & telco_customer_churn$Internet.Service == "0"] <- "No Internet"
sum(is.na(telco_customer_churn$Internet.Type))

## [1] 0

# Binary encoding (Male = 1, Female = 0)
telco_customer_churn$Gender <- ifelse(telco_customer_churn$Gender == "Male", 1, 0)

# One-hot encoding using model.matrix
telco_encoded1 <- model.matrix(~ Internet.Type - 1, data = telco_customer_churn)
telco_encoded2 <- model.matrix(~ Contract - 1, data = telco_customer_churn)
telco_encoded3 <- model.matrix(~ Payment.Method - 1, data = telco_customer_churn)

# Combine the encoded data with the original dataset
telco_customer_churn <- cbind(telco_customer_churn, telco_encoded1, telco_encoded2, telco_encoded3)

# Remove the original categorical columns
telco_customer_churn <- subset(telco_customer_churn, select = -c(Internet.Type, Contract, Payment.Method))

# Save the cleaned dataset
write.csv(telco_customer_churn, file = "telco_customer_churn_clean.csv")

To handle the missing data, we first count empty strings and NA values for each column. Then, the empty strings are substituted with NA values to ensure consistency. Irrelevant columns or columns with more than 50% missing data are removed. Next, we conduct a data validation check to ensure that all missing values in the Internet Type column are consistent with customers who have “No Internet” in the Internet Service column. This is necessary to confirm that missing values in the Internet Type field are due to customers having no internet service. This validation ensures that the missing data is not a result of human error or other factors and that the dataset is reliable for analysis. Furthermore, binary encoding is done for the gender column, ‘1’ indicating male, ‘0’ indicating female. Lastly, the One-Hot Encoding method is applied to the categorical columns such as Internet Type, Contract, and Payment Method, generating new columns for each category in each column. The original columns were then removed.

5.0 Exploratory Data Analysis

5.1 Univariate Analysis

# Summary Statistics
summary(telco_customer_churn)

##       Age        Avg.Monthly.GB.Download Avg.Monthly.Long.Distance.Charges
##  Min.   :19.00   Min.   : 0.00           Min.   : 0.00                    
##  1st Qu.:32.00   1st Qu.: 4.00           1st Qu.: 9.05                    
##  Median :46.00   Median :17.00           Median :22.57                    
##  Mean   :46.45   Mean   :20.74           Mean   :22.77                    
##  3rd Qu.:60.00   3rd Qu.:27.00           3rd Qu.:36.17                    
##  Max.   :80.00   Max.   :85.00           Max.   :49.99                    
##                                                                           
##   Churn.Score               City           CLTV               Country    
##  Min.   : 5.00   Los Angeles  : 175   Min.   :2003   United States:4225  
##  1st Qu.:40.00   San Diego    : 168   1st Qu.:3493                       
##  Median :61.00   San Francisco:  61   Median :4531                       
##  Mean   :58.28   San Jose     :  61   Mean   :4410                       
##  3rd Qu.:75.00   Sacramento   :  60   3rd Qu.:5381                       
##  Max.   :96.00   Long Beach   :  42   Max.   :6500                       
##                  (Other)      :3658                                      
##  Customer.Status   Dependents     Device.Protection.Plan     Gender      
##  Churned:1121    Min.   :0.0000   Min.   :0.000          Min.   :0.0000  
##  Joined : 272    1st Qu.:0.0000   1st Qu.:0.000          1st Qu.:0.0000  
##  Stayed :2832    Median :0.0000   Median :0.000          Median :1.0000  
##                  Mean   :0.2331   Mean   :0.347          Mean   :0.5027  
##                  3rd Qu.:0.0000   3rd Qu.:1.000          3rd Qu.:1.0000  
##                  Max.   :1.0000   Max.   :1.000          Max.   :1.0000  
##                                                                          
##  Internet.Service    Married       Monthly.Charge   Multiple.Lines  
##  Min.   :0.0000   Min.   :0.0000   Min.   : 18.25   Min.   :0.0000  
##  1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.: 38.55   1st Qu.:0.0000  
##  Median :1.0000   Median :0.0000   Median : 70.20   Median :0.0000  
##  Mean   :0.7903   Mean   :0.4833   Mean   : 64.91   Mean   :0.4161  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.: 89.75   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :118.75   Max.   :1.0000  
##                                                                     
##  Number.of.Dependents Number.of.Referrals Online.Backup    Online.Security 
##  Min.   :0.0000       Min.   : 0.000      Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000       1st Qu.: 0.000      1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000       Median : 0.000      Median :0.0000   Median :0.0000  
##  Mean   :0.4727       Mean   : 1.997      Mean   :0.3505   Mean   :0.2925  
##  3rd Qu.:0.0000       3rd Qu.: 3.000      3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :8.0000       Max.   :11.000      Max.   :1.0000   Max.   :1.0000  
##                                                                            
##  Paperless.Billing    Partner       Phone.Service      Population    
##  Min.   :0.0000    Min.   :0.0000   Min.   :0.0000   Min.   :    11  
##  1st Qu.:0.0000    1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:  2346  
##  Median :1.0000    Median :0.0000   Median :1.0000   Median : 17554  
##  Mean   :0.5924    Mean   :0.4833   Mean   :0.8994   Mean   : 22009  
##  3rd Qu.:1.0000    3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.: 35737  
##  Max.   :1.0000    Max.   :1.0000   Max.   :1.0000   Max.   :105285  
##                                                                      
##  Premium.Tech.Support Quarter   Referred.a.Friend Satisfaction.Score
##  Min.   :0.0000       Q3:4225   Min.   :0.0000    Min.   :1.000     
##  1st Qu.:0.0000                 1st Qu.:0.0000    1st Qu.:3.000     
##  Median :0.0000                 Median :0.0000    Median :3.000     
##  Mean   :0.2966                 Mean   :0.4599    Mean   :3.242     
##  3rd Qu.:1.0000                 3rd Qu.:1.0000    3rd Qu.:4.000     
##  Max.   :1.0000                 Max.   :1.0000    Max.   :5.000     
##                                                                     
##  Senior.Citizen          State      Streaming.Movies Streaming.Music 
##  Min.   :0.0000   California:4225   Min.   :0.000    Min.   :0.0000  
##  1st Qu.:0.0000                     1st Qu.:0.000    1st Qu.:0.0000  
##  Median :0.0000                     Median :0.000    Median :0.0000  
##  Mean   :0.1647                     Mean   :0.383    Mean   :0.3496  
##  3rd Qu.:0.0000                     3rd Qu.:1.000    3rd Qu.:1.0000  
##  Max.   :1.0000                     Max.   :1.000    Max.   :1.0000  
##                                                                      
##   Streaming.TV    Tenure.in.Months Total.Charges    Total.Extra.Data.Charges
##  Min.   :0.0000   Min.   : 1.00    Min.   :  18.8   Min.   :  0.000         
##  1st Qu.:0.0000   1st Qu.: 9.00    1st Qu.: 401.5   1st Qu.:  0.000         
##  Median :0.0000   Median :30.00    Median :1424.6   Median :  0.000         
##  Mean   :0.3837   Mean   :32.68    Mean   :2306.1   Mean   :  6.966         
##  3rd Qu.:1.0000   3rd Qu.:56.00    3rd Qu.:3846.8   3rd Qu.:  0.000         
##  Max.   :1.0000   Max.   :72.00    Max.   :8672.5   Max.   :150.000         
##                                                                             
##  Total.Long.Distance.Charges Total.Refunds    Total.Revenue         Under.30  
##  Min.   :   0.00             Min.   : 0.000   Min.   :   21.36   Min.   :0.0  
##  1st Qu.:  67.68             1st Qu.: 0.000   1st Qu.:  592.75   1st Qu.:0.0  
##  Median : 396.64             Median : 0.000   Median : 2151.47   Median :0.0  
##  Mean   : 754.72             Mean   : 1.954   Mean   : 3065.81   Mean   :0.2  
##  3rd Qu.:1200.00             3rd Qu.: 0.000   3rd Qu.: 4845.75   3rd Qu.:0.0  
##  Max.   :3564.00             Max.   :49.790   Max.   :11979.34   Max.   :1.0  
##                                                                               
##  Unlimited.Data       Churn        Internet.TypeCable Internet.TypeDSL
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000     Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000     1st Qu.:0.0000  
##  Median :1.0000   Median :0.0000   Median :0.0000     Median :0.0000  
##  Mean   :0.6779   Mean   :0.2653   Mean   :0.1188     Mean   :0.2386  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.0000     3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000     Max.   :1.0000  
##                                                                       
##  Internet.TypeFiber Optic Internet.TypeNo Internet ContractMonth-to-Month
##  Min.   :0.0000           Min.   :0.0000           Min.   :0.0000        
##  1st Qu.:0.0000           1st Qu.:0.0000           1st Qu.:0.0000        
##  Median :0.0000           Median :0.0000           Median :1.0000        
##  Mean   :0.4329           Mean   :0.2097           Mean   :0.5191        
##  3rd Qu.:1.0000           3rd Qu.:0.0000           3rd Qu.:1.0000        
##  Max.   :1.0000           Max.   :1.0000           Max.   :1.0000        
##                                                                          
##  ContractOne Year ContractTwo Year Payment.MethodBank Withdrawal
##  Min.   :0.000    Min.   :0.000    Min.   :0.0000               
##  1st Qu.:0.000    1st Qu.:0.000    1st Qu.:0.0000               
##  Median :0.000    Median :0.000    Median :1.0000               
##  Mean   :0.214    Mean   :0.267    Mean   :0.5503               
##  3rd Qu.:0.000    3rd Qu.:1.000    3rd Qu.:1.0000               
##  Max.   :1.000    Max.   :1.000    Max.   :1.0000               
##                                                                 
##  Payment.MethodCredit Card Payment.MethodMailed Check
##  Min.   :0.0000            Min.   :0.00000           
##  1st Qu.:0.0000            1st Qu.:0.00000           
##  Median :0.0000            Median :0.00000           
##  Mean   :0.3976            Mean   :0.05207           
##  3rd Qu.:1.0000            3rd Qu.:0.00000           
##  Max.   :1.0000            Max.   :1.00000           
##

# Histogram for monthly charge
Monthly_Charge_Plot <- ggplot(telco_customer_churn, aes(x = Monthly.Charge)) +
  geom_histogram(bins = 30, fill = "purple", color = "black") +
  theme_minimal() + labs(title="Distribution of Monthly Charges")

# Histogram for age
Age_Plot <- ggplot(telco_customer_churn, aes(x = Age)) +
  geom_histogram(bins = 30, fill = "purple", color = "black") +
  theme_minimal() + labs(title="Distribution of Age")

# Histogram for CLTV
CLTV_Plot <- ggplot(telco_customer_churn, aes(x = CLTV)) +
  geom_histogram(bins = 30, fill = "purple", color = "black") +
  theme_minimal() + labs(title="Distribution of CLTV")

# Histogram for tenure
Tenure_Plot <- ggplot(telco_customer_churn, aes(x = Tenure.in.Months)) +
  geom_histogram(bins = 30, fill = "purple", color = "black") +
  theme_minimal() + labs(title="Distribution of Tenure in Months")

# Combine the four plots into one
grid.arrange(Monthly_Charge_Plot, Age_Plot, CLTV_Plot, Tenure_Plot, nrow = 1)

We first summarized the dataset using summary statistics for each column. Then, we have created some visualizations to examine the key variables. Histograms are generated to show the distributions for numerical columns, while bar charts are plotted for the categorical columns. Additionally, a churn distribution is visualized to picture the proportion of the customers who churned and do not churn.

# Bar chart for contract
contract_counts <- data.frame(
  Contract_Type = c("Month-to-Month", "One Year", "Two Year"),
  Count = c(sum(telco_customer_churn$`ContractMonth-to-Month` == 1),
            sum(telco_customer_churn$`ContractOne Year` == 1),
            sum(telco_customer_churn$`ContractTwo Year` == 1))
)

contract_plot <- ggplot(contract_counts, aes(x = Contract_Type, y = Count)) +
  geom_bar(stat = "identity", fill = "skyblue", color = "black") +
  theme_minimal() +
  labs(title = "Count of Contract Types", x = "Contract Type", y = "Count")

# Bar chart for churn
ggplot(telco_customer_churn, aes(Churn)) + geom_bar(fill = "pink", color = "black") + theme_minimal() +
  labs(title = "Churn Count", x = "Churn", y = "Count")

In bivariate analysis, we explore the relationships between various factors and churn using visualization and statistical tests. The chi-square test is used for categorical variables, while the T-test is used for quantitative variables.

5.2 Bivariate Analysis

# Contract VS Churn
contract_churn_table <- table(
  Churn = telco_customer_churn$Churn,
  Contract = factor(
    apply(telco_customer_churn[, c("ContractMonth-to-Month", "ContractOne Year", "ContractTwo Year")], 1, function(x) {
      if (x[1] == 1) return("Month-to-Month")
      if (x[2] == 1) return("One Year")
      if (x[3] == 1) return("Two Year")
    }),
    levels = c("Month-to-Month", "One Year", "Two Year")
  )
)

contract_churn_df <- as.data.frame(contract_churn_table)
colnames(contract_churn_df) <- c("Churn", "Contract", "Count")

# To create a grouped bar chart
ggplot(contract_churn_df, aes(x = Contract, y = Count, fill = as.factor(Churn))) +
  geom_bar(stat = "identity", position = "dodge") +  
  theme_minimal() +
  labs(title = "Churn vs. Contract Type",
       x = "Contract Type",
       y = "Count",
       fill = "Churn") +  
  scale_fill_manual(values = c("skyblue", "orange"), labels = c("No Churn", "Churn"))

# Chi-square test 
chi_square_result <- chisq.test(contract_churn_table)
chi_square_result

## 
##  Pearson's Chi-squared test
## 
## data:  contract_churn_table
## X-squared = 856.16, df = 2, p-value < 2.2e-16

## Internet Type VS Churn
# Create a contingency table
internet_churn_table <- table(
  Churn = telco_customer_churn$Churn,
  Internet_Type = factor(
    apply(telco_customer_churn[, c("Internet.TypeCable", "Internet.TypeDSL", "Internet.TypeFiber Optic", "Internet.TypeNo Internet")], 1, function(x) {
      if (x[1] == 1) return("Cable")
      if (x[2] == 1) return("DSL")
      if (x[3] == 1) return("Fiber Optic")
      if (x[4] == 1) return("No Internet")
    }),
    levels = c("Cable", "DSL", "Fiber Optic", "No Internet")
  )
)

internet_churn_df <- as.data.frame(internet_churn_table)
colnames(internet_churn_df) <- c("Churn", "Internet_Type", "Count")

# To create a grouped bar chart
ggplot(internet_churn_df, aes(x = Internet_Type, y = Count, fill = as.factor(Churn))) +
  geom_bar(stat = "identity", position = "dodge") +  
  theme_minimal() +
  labs(title = "Churn vs. Internet Type",
       x = "Internet Type",
       y = "Count",
       fill = "Churn") +  
  scale_fill_manual(values = c("skyblue", "orange"), labels = c("No Churn", "Churn"))

# Chi-square test 
chi_square_internet_result <- chisq.test(internet_churn_table)
chi_square_internet_result

## 
##  Pearson's Chi-squared test
## 
## data:  internet_churn_table
## X-squared = 365.6, df = 3, p-value < 2.2e-16

## Monthly Charges VS Churn

telco_customer_churn$Churn <- factor(telco_customer_churn$Churn, labels = c("No Churn", "Churn"))

# To create a boxplot
ggplot(telco_customer_churn, aes(x = Churn, y = Monthly.Charge, fill = Churn)) +
  geom_boxplot() +
  theme_minimal() +
  labs(title = "Comparison of Monthly Charge Across Churn",
       x = "Churn",
       y = "Monthly Charge") +
  scale_fill_manual(values = c("skyblue", "orange"))

# T-test
t_test_result <- t.test(Monthly.Charge ~ Churn, data = telco_customer_churn)
t_test_result

## 
##  Welch Two Sample t-test
## 
## data:  Monthly.Charge by Churn
## t = -13.27, df = 2480.7, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group No Churn and group Churn is not equal to 0
## 95 percent confidence interval:
##  -14.03326 -10.41992
## sample estimates:
## mean in group No Churn    mean in group Churn 
##               61.66244               73.88903

## Tenure VS Churn

# To create a boxplot
ggplot(telco_customer_churn, aes(x = Churn, y = Tenure.in.Months, fill = Churn)) +
  geom_boxplot() +
  theme_minimal() +
  labs(title = "Comparison of Tenure Across Churn",
       x = "Churn",
       y = "Tenure (in Months)") +
  scale_fill_manual(values = c("skyblue", "orange"))

# T-test 
t_test_tenure_result <- t.test(Tenure.in.Months ~ Churn, data = telco_customer_churn)
t_test_tenure_result

## 
##  Welch Two Sample t-test
## 
## data:  Tenure.in.Months by Churn
## t = 27.634, df = 2457.8, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group No Churn and group Churn is not equal to 0
## 95 percent confidence interval:
##  18.54804 21.38153
## sample estimates:
## mean in group No Churn    mean in group Churn 
##               37.97906               18.01427

Result of chi-square test : X-squared = 856.16, df = 2, p-value < 2.2e-16 The p-value < 0.05, indicates a statistically significant relationship between contract type and churn.

Result of chi-square test : 365.6, df = 3, p-value < 2.2e-16 The p-value < 0.05, indicates a statistically significant relationship between internet type and churn.

Result of T-test : t = -13.27, df = 2480.7, p-value < 2.2e-16 The p-value < 0.05, thus we reject the null hypothesis. The true difference in means between group No Churn and group Churn is not equal to 0.

Result of T-test : t = 27.634, df = 2457.8, p-value < 2.2e-16 The p-value < 0.05, thus we reject the null hypothesis. The true difference in means between group No Churn and group Churn is not equal to 0.

6.0 Data Preprocessing

#Load the cleaned dataset
telco_customer_churn <- read.csv("telco_customer_churn_clean.csv")

6.1 Preprocessing for Logistic Regression and Decision Tree

# Set Churn as a factor
telco_customer_churn$Churn <- as.factor(telco_customer_churn$Churn)
class(telco_customer_churn$Churn)

## [1] "factor"

# Set a seed for reproducibility
set.seed(42)

# Stratified Train-Test Split
#p = 0.75 means that 75% of the data is allocated to the training set, and 25% is allocated to the testing set.
trainIndex <- createDataPartition(telco_customer_churn$Churn, p = 0.75, 
                                  list = FALSE, times = 1)
train_Telco <- telco_customer_churn[trainIndex, ]
test_Telco <- telco_customer_churn[-trainIndex, ]

# Verify proportions
cat("Proportions in Training Set:\n")

## Proportions in Training Set:

print(table(train_Telco$Churn) / nrow(train_Telco)) # Check distribution in training set

## 
##         0         1 
## 0.7346166 0.2653834

cat("Proportions in Testing Set:\n")

## Proportions in Testing Set:

print(table(test_Telco$Churn) / nrow(test_Telco)) # Check distribution in testing set

## 
##         0         1 
## 0.7348485 0.2651515

train_Telco: Contains 75% of the data (used for model training). test_Telco: Contains the remaining 25% of the data (used for model evaluation).

6.2 Preprocessing for XGBoost

# Define features and target
features <- c(
  "Churn.Score", "Internet.TypeFiber.Optic", "ContractMonth.to.Month", "Dependents", 
  "Number.of.Referrals", "Satisfaction.Score", "Tenure.in.Months", "ContractTwo.Year"
)

target <- "Churn"

# Subset data with selected features and target
df_subset <- telco_customer_churn[, c(target, features)]

# Train-Test Split
set.seed(42)
total_rows <- nrow(df_subset)
train_idx <- sample(total_rows, total_rows * 0.8)
test_idx <- setdiff(1:total_rows, train_idx)

train_df <- df_subset[train_idx, ]
test_df <- df_subset[test_idx, ]

# One-hot encode categorical variables and convert to matrix
train_x <- model.matrix(~ . - 1, data = train_df[, -1])  # Exclude target column
test_x <- model.matrix(~ . - 1, data = test_df[, -1])

# Convert target variable to numeric (0 and 1)
train_y <- as.numeric(as.character(train_df[, 1]))
test_y <- as.numeric(as.character(test_df[, 1]))

# Create XGBoost DMatrix
dtrain <- xgb.DMatrix(data = train_x, label = train_y)
dtest <- xgb.DMatrix(data = test_x, label = test_y)

7.0 Modeling

We will use Logistic regression, Decision Tree, and XGBoosts to predict the churn.

7.1 Logistic regression

As the bar char shown in EDA,the number of “1” (Churn) is three times higher than “0” (No Churn), so class weights is used to increase importance to the minority class, and thus, balance the impact of each class.

  # Calculate class weights
  class_weights <- ifelse(train_Telco$Churn == 1, 
                          sum(train_Telco$Churn == 0) / sum(train_Telco$Churn == 1), 
                          1)
  # Train Logistic Regression Model with Weights
LR_model <- glm(Churn ~ Churn.Score + Internet.TypeFiber.Optic+ ContractMonth.to.Month + Dependents + Number.of.Referrals + Satisfaction.Score + Tenure.in.Months + ContractTwo.Year, 
  data = train_Telco, 
  family = binomial, 
  weights = class_weights)

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

# Summary of the logistic regression model
summary(LR_model)

## 
## Call:
## glm(formula = Churn ~ Churn.Score + Internet.TypeFiber.Optic + 
##     ContractMonth.to.Month + Dependents + Number.of.Referrals + 
##     Satisfaction.Score + Tenure.in.Months + ContractTwo.Year, 
##     family = binomial, data = train_Telco, weights = class_weights)
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               4.781e+01  1.694e+03   0.028 0.977490    
## Churn.Score               1.811e-01  1.333e-02  13.586  < 2e-16 ***
## Internet.TypeFiber.Optic  8.844e-01  2.127e-01   4.158 3.22e-05 ***
## ContractMonth.to.Month    1.153e+00  3.154e-01   3.657 0.000255 ***
## Dependents               -1.310e+00  3.366e-01  -3.893 9.90e-05 ***
## Number.of.Referrals      -4.081e-01  7.675e-02  -5.317 1.05e-07 ***
## Satisfaction.Score       -2.032e+01  5.648e+02  -0.036 0.971299    
## Tenure.in.Months         -2.047e-02  6.198e-03  -3.303 0.000956 ***
## ContractTwo.Year         -7.844e-01  3.742e-01  -2.096 0.036053 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6454.59  on 3168  degrees of freedom
## Residual deviance:  622.97  on 3160  degrees of freedom
## AIC: 666.04
## 
## Number of Fisher Scoring iterations: 21

Churn.Score, Internet.TypeFiber.Optic, ContractMonth.to.Month, Dependents, Number.of.Referrals, Tenure.in.Months, and ContractTwo.Year are all statistically significant (p < 0.05). Positive coefficients (Churn.Score, Internet.TypeFiber.Optic, ContractMonth.to.Month) increase the likelihood of churn. Negative coefficients (Dependents, Number.of.Referrals, Tenure.in.Months, ContractTwo.Year) decrease the likelihood of churn. The large reduction in deviance and low AIC indicate that the model fits the data well.

7.2. Decision Tree

DT_model <- rpart(Churn ~ Churn.Score + Internet.TypeFiber.Optic + 
                    ContractMonth.to.Month + Dependents + Number.of.Referrals + 
                    Satisfaction.Score + Tenure.in.Months + ContractTwo.Year, 
                  data = train_Telco, method = "class")
summary(DT_model)

## Call:
## rpart(formula = Churn ~ Churn.Score + Internet.TypeFiber.Optic + 
##     ContractMonth.to.Month + Dependents + Number.of.Referrals + 
##     Satisfaction.Score + Tenure.in.Months + ContractTwo.Year, 
##     data = train_Telco, method = "class")
##   n= 3169 
## 
##          CP nsplit rel error    xerror       xstd
## 1 0.7633769      0 1.0000000 1.0000000 0.02955510
## 2 0.1284185      1 0.2366231 0.2366231 0.01623856
## 3 0.0100000      2 0.1082045 0.1082045 0.01117887
## 
## Variable importance
## Satisfaction.Score        Churn.Score 
##                 62                 38 
## 
## Node number 1: 3169 observations,    complexity param=0.7633769
##   predicted class=0  expected loss=0.2653834  P(node) =1
##     class counts:  2328   841
##    probabilities: 0.735 0.265 
##   left son=2 (2527 obs) right son=3 (642 obs)
##   Primary splits:
##       Satisfaction.Score     < 2.5  to the right, improve=868.9674, (0 missing)
##       Churn.Score            < 80.5 to the left,  improve=598.5973, (0 missing)
##       ContractMonth.to.Month < 0.5  to the left,  improve=243.6513, (0 missing)
##       Tenure.in.Months       < 17.5 to the right, improve=136.0844, (0 missing)
##       ContractTwo.Year       < 0.5  to the right, improve=134.6416, (0 missing)
##   Surrogate splits:
##       Churn.Score < 80.5 to the left,  agree=0.878, adj=0.399, (0 split)
## 
## Node number 2: 2527 observations,    complexity param=0.1284185
##   predicted class=0  expected loss=0.07874951  P(node) =0.7974124
##     class counts:  2328   199
##    probabilities: 0.921 0.079 
##   left son=4 (2419 obs) right son=5 (108 obs)
##   Primary splits:
##       Churn.Score            < 80.5 to the left,  improve=191.50430, (0 missing)
##       Satisfaction.Score     < 3.5  to the right, improve= 32.84084, (0 missing)
##       ContractMonth.to.Month < 0.5  to the left,  improve= 24.95558, (0 missing)
##       Tenure.in.Months       < 2.5  to the right, improve= 22.86974, (0 missing)
##       Number.of.Referrals    < 1.5  to the right, improve= 15.43606, (0 missing)
## 
## Node number 3: 642 observations
##   predicted class=1  expected loss=0  P(node) =0.2025876
##     class counts:     0   642
##    probabilities: 0.000 1.000 
## 
## Node number 4: 2419 observations
##   predicted class=0  expected loss=0.03761885  P(node) =0.7633323
##     class counts:  2328    91
##    probabilities: 0.962 0.038 
## 
## Node number 5: 108 observations
##   predicted class=1  expected loss=0  P(node) =0.03408015
##     class counts:     0   108
##    probabilities: 0.000 1.000

#Decision Tree Visualization
rpart.plot(DT_model, main = "Decision Tree")

Satisfaction.Score (62%) and Churn.Score (38%) are the most influential variables in predicting churn. The tree primarily splits on these variables.

The Decision Tree chart shows that 100% of the customers with Satisfaction.Score < 3 would churn. For customers with Satisfaction.Score >= 3) and Churn.Score < 81, they are predicted not to churn with a probability of 0.04 (76%). For customers with Satisfaction.Score >= 3) and Churn.Score >= 81, they are predicted to churn with a probability of 1.0 (100%), but this applies to only 3% of the total data.

7.3 XGBoost

# Calculate scale_pos_weight for class imbalance
ratio <- sum(train_y == 0) / sum(train_y == 1)

params <- list(
  max_depth = 6,
  eta = 0.3,
  objective = "binary:logistic",
  eval_metric = "auc",
  scale_pos_weight = ratio
)

final_model <- xgboost(
  data = dtrain,
  params = params,
  nrounds = 100,
  verbose = 0
)
importance <- xgb.importance(model = final_model)
print(importance)

##                     Feature        Gain      Cover  Frequency
##                      <char>       <num>      <num>      <num>
## 1:       Satisfaction.Score 0.664155415 0.23335940 0.03417635
## 2:              Churn.Score 0.253474645 0.35137528 0.33424470
## 3:         Tenure.in.Months 0.034530450 0.20475469 0.41285031
## 4:      Number.of.Referrals 0.016455623 0.09078490 0.06835270
## 5:   ContractMonth.to.Month 0.015972345 0.02319221 0.03075871
## 6: Internet.TypeFiber.Optic 0.008693037 0.04202293 0.08133971
## 7:               Dependents 0.003901611 0.03167210 0.02187286
## 8:         ContractTwo.Year 0.002816874 0.02283850 0.01640465

Satisfaction.Score is the most important feature (Gain = 0.664) increase the likelihood of churn. Churn.Score is the second most important feature (Gain = 0.253) significant in increasing the likelihood of churn. ContractMonth.to.Month, ContractTwo.Year, Internet.TypeFiber.Optic, and Dependents have smaller gain value but will still contribute to the churn.

8.0 Evaluation Measures

To evaluate the performances of the classification models in churn prediction, we have used accuracy, precision, recall, F1-score, and ROC-AUC, calculated from contents of confusion matrix.

Convert probabilities to binary predictions based on a threshold (default = 0.3) If predicted probability >= 0.5, prediction will be labeled as “1”; If predicted probability < 0.5, prediction will be labeled as “0”.

evaluate_model <- function(actual, predicted, predicted_prob = NULL, threshold = 0.5, model_name = "") {
  if (!is.null(predicted_prob)) {
    predicted <- ifelse(predicted_prob > threshold, 1, 0)
  }
  
  # Convert to factors with matching levels
  actual <- factor(actual, levels = c(0, 1))
  predicted <- factor(predicted, levels = c(0, 1))
  
  # Confusion Matrix
  library(caret)
  cm <- confusionMatrix(predicted, actual)
  
  # Metrics
  accuracy <- cm$overall["Accuracy"]
  precision <- cm$byClass["Precision"]
  recall <- cm$byClass["Recall"]
  f1_score <- cm$byClass["F1"]
  
  # Initialize empty ROC plot
  roc_plot <- NULL
  auc <- NA
  
  # ROC-AUC (if probabilities are provided)
  if (!is.null(predicted_prob)) {
    library(pROC)
    roc_obj <- roc(actual, predicted_prob)
    auc <- auc(roc_obj)
    
    # Create ROC Curve plot
    roc_plot <- ggroc(roc_obj) +
      ggtitle(paste("ROC Curve -", model_name)) +
      theme_minimal()
    
    cat("AUC for", model_name, ":", auc, "\n")
  }
  
  # Visualize Confusion Matrix
  cm_table <- as.data.frame(cm$table)
  colnames(cm_table) <- c("Reference", "Prediction", "Freq")
  library(ggplot2)
  cm_plot <- ggplot(cm_table, aes(Prediction, Reference, fill = Freq)) +
    geom_tile(color = "white") +
    scale_fill_gradient(low = "white", high = "orange") +
    geom_text(aes(label = Freq), vjust = 1.5) +
    labs(
      title = paste("Confusion Matrix -", model_name),
      x = "Predicted Class",
      y = "Actual Class"
    ) +
    theme_minimal()
  
  # Arrange plots side by side if both are available
  if (!is.null(roc_plot)) {
    library(gridExtra)
    grid.arrange(cm_plot, roc_plot, nrow = 2)
  } else {
    print(cm_plot)
  }
  
  # Return metrics
  list(
    ConfusionMatrix = cm,
    Metrics = data.frame(
      Accuracy = accuracy,
      Precision = precision,
      Recall = recall,
      F1_Score = f1_score,
      ROC_AUC = auc
    )
  )
}

8.1 Logistic Regression (LR) Evaluation

LR_pred_prob <- predict(LR_model, newdata = test_Telco, type = "response")
lr_results <- evaluate_model(test_Telco$Churn, NULL, LR_pred_prob, model_name = "Logistic Regression")

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

## AUC for Logistic Regression : 0.9976758

cat("\nLogistic Regression Metrics:\n")

## 
## Logistic Regression Metrics:

print(lr_results$Metrics)

##           Accuracy Precision    Recall  F1_Score   ROC_AUC
## Accuracy 0.9725379  0.994702 0.9677835 0.9810581 0.9976758

Confusion Matrix:
Class 0: No churn,
Class 1: Churn

True Negatives (Bottom-left, 751): The model correctly predicted 751 instances as “No churn”.

True Positives (Top-right, 276): The model correctly predicted 276 instances as “Churn”.

False Negatives (Top-left, 25): The model incorrectly predicted 25 instances as “No churn” when the actual label was “Churn”.

False Positives (Bottom-right, 4): The model incorrectly predicted 4 instances as “Churn” when the actual label was “No churn”.

Thus, Logistic Regression model performs well in identifying “No churn” cases and “Churn” cases. However, False Negatives should be minimized to ensure fewer churners are missed.

The ROC curve is very close to the top-left corner, indicating that the Logistic Regression model has excellent discriminative ability. AUC (0.997) indicates outstanding discriminative power, nearly perfect for distinguishing between churners and non-churners.

8.2 Decision Tree (DT) Evaluation

# Get predicted probabilities instead of class predictions
dt_pred_prob <- predict(DT_model, newdata = test_Telco, type = "prob")[, 2] # Probability for the positive class (e.g., '1')

# Evaluate the model
dt_results <- evaluate_model(test_Telco$Churn, NULL, dt_pred_prob, model_name = "Decision Tree")

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

## AUC for Decision Tree : 0.9553571

# Print metrics
cat("\nDecision Tree Metrics:\n")

## 
## Decision Tree Metrics:

print(dt_results$Metrics)

##           Accuracy Precision Recall  F1_Score   ROC_AUC
## Accuracy 0.9763258  0.968789      1 0.9841471 0.9553571

Confusion Matrix:
Class 0: No churn,
Class 1: Churn

True Negatives (Bottom-left, 776): The model correctly predicted 776 instances as “No churn”.

True Positives (Top-right, 255): The model correctly predicted 255 instances as “Churn”.

False Negative (Top-left, 0): The model did not incorrectly predicted any instances as “No churn” when they were actually “Churn”.

False Positives (Bottom-right, 25): 25 actual “No churn” cases were misclassified as “Churn”.

Thus, DT model has high accuracy, as the majority of predictions are correct. Decision Tree model performs well in predicting churners. However, the presence of false positive indicates room for further refinement.

High value of AUC (0.955) and relatively steep rise in the ROC curve suggest that the Decision Tree model has good discriminative ability. An AUC closer to 1.0 reflects strong performance in distinguishing between churners (1) and non-churners (0). While the AUC is not perfect, it indicates that the model is effective at classification. The curve moves reasonably close to the top-left corner, reflecting high sensitivity (true positive rate) at low false positive rates.

8.3 XGBoost (XGB) Evaluation

xgb_pred_prob <- predict(final_model, dtest)
xgb_results <- evaluate_model(test_y, NULL, xgb_pred_prob, model_name = "XGBoost")

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

## AUC for XGBoost : 0.9991505

cat("\nXGBoost Metrics:\n")

## 
## XGBoost Metrics:

print(xgb_results$Metrics)

##           Accuracy Precision   Recall F1_Score   ROC_AUC
## Accuracy 0.9763314 0.9967585 0.971564    0.984 0.9991505

Confusion Matrix:
Class 0: No churn,
Class 1: Churn

True Negatives (Bottom-left, 615): The model correctly predicted 615 instances as “No churn”.

True Positives (Top-right, 210): The model correctly predicted 210 instances as “Churn”.

False Negatives (Top-left, 18): The model incorrectly predicted 18 instances as “No churn” when the actual label was “Churn”.

False Positives (Bottom-right, 2): The model incorrectly predicted 2 instances as “Churn” when the actual label was “No Churn”.

The model performs well and produces few false positives, ensuring effective resource utilization. However, as churners are essential for strategies for retention and revenue protection, it is important to address the False Negatives.

The ROC curve is nearly touching the top-left corner, signifying excellent model performance. The curve is far above the diagonal line, confirming that the XGBoost model performs significantly better than random guessing. The AUC (0.999) indicates exceptionally strong discriminative power, almost ideal for distinguishing between churners and non-churners across all thresholds.

Summary Table

# Convert AUC values to numeric in each model's results
lr_results$Metrics$ROC_AUC <- as.numeric(lr_results$Metrics$ROC_AUC)
dt_results$Metrics$ROC_AUC <- as.numeric(dt_results$Metrics$ROC_AUC)
xgb_results$Metrics$ROC_AUC <- as.numeric(xgb_results$Metrics$ROC_AUC)

# Combine the metrics into a summary table
library(dplyr)
summary_table <- bind_rows(
  cbind(Model = "Logistic Regression", lr_results$Metrics),
  cbind(Model = "Decision Tree", dt_results$Metrics),
  cbind(Model = "XGBoost", xgb_results$Metrics)
)
# Print the summary table
print(summary_table)

##                            Model  Accuracy Precision    Recall  F1_Score
## Accuracy...1 Logistic Regression 0.9725379 0.9947020 0.9677835 0.9810581
## Accuracy...2       Decision Tree 0.9763258 0.9687890 1.0000000 0.9841471
## Accuracy...3             XGBoost 0.9763314 0.9967585 0.9715640 0.9840000
##                ROC_AUC
## Accuracy...1 0.9976758
## Accuracy...2 0.9553571
## Accuracy...3 0.9991505

#round off to 4 decimal places
summary_table <- summary_table %>%
  mutate_if(is.numeric, ~ round(., 4))

cat("\nSummary Table:\n")

## 
## Summary Table:

print(summary_table)

##                            Model Accuracy Precision Recall F1_Score ROC_AUC
## Accuracy...1 Logistic Regression   0.9725    0.9947 0.9678   0.9811  0.9977
## Accuracy...2       Decision Tree   0.9763    0.9688 1.0000   0.9841  0.9554
## Accuracy...3             XGBoost   0.9763    0.9968 0.9716   0.9840  0.9992

In term of Accuracy, all three models are highly effective in classifying the data. They perform similarly with Logistic Regression slightly lower at 97.25%, and Decision Tree and XGBoost achieving 97.63%.

In term of Precision, XGBoost achieves the highest precision (99.68%), followed by Logistic Regression (99.47%) and Decision Tree (96.88%).

In term of Recall (Sensitivity), Decision Tree achieves the highest recall (100%), while Logistic Regression and XGBoost have slightly lower recall at 96.78% and 97.16% respectively.

In term of F1-Score, the performance of XGBoost (98.40%) and Decision Tree (98.41%) is very close, while logistic regression (98.11%) lags behind by a small margin.

In term of ROC-AUC, XGBoost achieves the highest ROC-AUC of 99.92%, indicating near-perfect classification performance, followed by Logistic Regression at 99.77%, and Decision Tree at 95.54%.

In overall, XGBoost is the strongest performer, excelling in accuracy, precision, and ROC-AUC.

9.0 Conclusion

Among all, in general, XGBoost has proved to be the most reliable and efficient model a choice can fall upon, which gives the highest accuracy 97.63%, precision 99.68%, and ROC_AUC 0.9992, for which it was suitable whenever balanced and robust performance metric was required; Decision Tree, on one side has perfect recall: 100%, reflecting hence its very strong point by identifying all relevant instances-a requirement for some applications where fraud needs to be caught. However, its precision is a bit lower, 96.88%, as compared to XGBoost, and the lack of a valid ROC_AUC metric restricts its overall comparison. The Logistic Regression model is simpler and has the lowest accuracy 97.25%, recall 96.78%, and F1-Score 98.11%, among the three models, though it still does well with a strong ROC_AUC of 0.9977. In summary, the overall recommendation goes to XGBoost for its balanced high performance, and a Decision Tree for those recall-critical applications. Where computational simplicity or interpretability is a higher priority, logistic regression could be considered.

WQD7004 OCC3: Group 3 - Classification Model for Telco Clients at Risk of Subscription Cancellation

Group 3 Members

1.0 Introduction

2.0 Objectives

3.0 Data Introduction

4.0 Data Cleansing

5.0 Exploratory Data Analysis

5.1 Univariate Analysis

5.2 Bivariate Analysis

6.0 Data Preprocessing

6.1 Preprocessing for Logistic Regression and Decision Tree

6.2 Preprocessing for XGBoost

7.0 Modeling

7.1 Logistic regression

7.2. Decision Tree

7.3 XGBoost

8.0 Evaluation Measures

8.1 Logistic Regression (LR) Evaluation

8.2 Decision Tree (DT) Evaluation

8.3 XGBoost (XGB) Evaluation

Summary Table

9.0 Conclusion