| Student ID | Name |
|---|---|
| 22108694 | Panji Winata |
| 23101520 | Chan Yung Her |
| 23093746 | Joyce Chua Xin Jie |
| 17207213 | Kan Sok Wah |
| 23108481 | Wong Mei Kait |
Telecommunications technology has changed drastically. Long-distance
communication used to require days or weeks-long letters. Technology
allows us to interact instantaneously via mobile phones and the
internet. The digital age has made communication fast and efficient,
enabling people to interact regardless of time or place. As technology
advances, telecommunications devices are utilized for entertainment as
well as communication. Netflix offers movies and TV shows, while TikTok
and Instagram as social media allow self-expression. Telecommunications
have become a hub of digital activity, facilitating communication,
entertainment, and leisure. Telecom companies compete to provide
features and perks to attract and keep consumers. Showcase a fast
network, high data quota, and affordable service bundles. Customers have
the choice to select the best service despite this severe competition.
They will terminate their memberships and transfer providers if the
service is poor. Customer turnover may hurt a provider’s company. The
company’s income might plummet when clients quit, but satellite
management and network infrastructure expenditures remain. Thus,
suppliers must satisfy customers and comprehend subscription
cancellations. Manual data analysis has several downsides. For
complicated and massive datasets, this method is time-consuming and need
many people. Because of human mistake like neglect or subjective
prejudice, manual analysis is typically unreliable. In data-based
decision making, these restrictions make manual methods less efficient
and complex. Data science provides quicker, more accurate, and more
efficient analytical techniques. Regression and classification models as
part of data science process are useful in many fields. Data science
process use complex algorithms and approaches to spot trends,
anticipate, and give deep insights for management to make data-driven
decisions.
As background stated, therefore below are identified case study
problem questions:
1. What is the factors that could lead customer
to cancel their subscription?
2. Is there any specific pattern of
customer which cancel their subscription?
3. How accurate is the
data science model implemented to this case study?
To answer the problem questions, objective of this case study stated
as follows:
1. To find the relationship between variables that
affect customer subscription cancellation using a regression approach
2. To find patterns of customers who make subscription cancellation
based on parameters defined in the dataset using classification
method
3. To obtain the accuracy of the model results on clients at
risk of subscription cancellation
Before we introduce contain of dataset, csv file of dataset loaded
into the R environment using command below:
df_telco_intro=read.csv("telco_customer_churn.csv")
Furthermore we examine the column and return 6 row value of each
columns using head command.
head(df_telco_intro)
## Age Avg.Monthly.GB.Download Avg.Monthly.Long.Distance.Charges Churn.Category
## 1 72 4 19.44
## 2 27 59 45.62
## 3 59 0 16.07
## 4 25 27 0.00
## 5 31 21 17.22 Dissatisfaction
## 6 32 30 11.94 Competitor
## Churn.Reason Churn.Score City CLTV Contract
## 1 51 San Mateo 4849 Two Year
## 2 27 Sutter Creek 3715 Month-to-Month
## 3 59 Santa Cruz 5092 Month-to-Month
## 4 49 Brea 2068 One Year
## 5 Network reliability 88 San Jose 4026 One Year
## 6 Competitor had better devices 81 Stockton 5351 Month-to-Month
## Country Customer.ID Customer.Status Dependents Device.Protection.Plan
## 1 United States 4526-ZJJTM Stayed 0 1
## 2 United States 5302-BDJNT Stayed 0 1
## 3 United States 5468-BPMMO Stayed 0 0
## 4 United States 2212-LYASK Stayed 0 1
## 5 United States 0378-XSZPU Churned 0 1
## 6 United States 6023-YEBUP Churned 0 1
## Gender Internet.Service Internet.Type Lat.Long Latitude
## 1 Female 1 Fiber Optic 37.538309, -122.305109 37.53831
## 2 Male 1 Fiber Optic 38.432145, -120.77069 38.43214
## 3 Male 0 37.007882, -122.065975 37.00788
## 4 Male 1 DSL 33.924143, -117.79387 33.92414
## 5 Male 1 Cable 37.311088, -121.961786 37.31109
## 6 Male 1 Cable 38.033219, -121.297433 38.03322
## Longitude Married Monthly.Charge Multiple.Lines Number.of.Dependents
## 1 -122.3051 1 88.40 0 0
## 2 -120.7707 0 95.50 1 0
## 3 -122.0660 1 19.60 0 0
## 4 -117.7939 1 45.85 0 0
## 5 -121.9618 1 60.30 0 0
## 6 -121.2974 0 100.95 1 0
## Number.of.Referrals Offer Online.Backup Online.Security Paperless.Billing
## 1 1 1 0 0
## 2 0 0 0 0
## 3 3 0 0 0
## 4 3 1 0 1
## 5 1 Offer B 1 1 0
## 6 0 Offer E 0 0 1
## Partner Payment.Method Phone.Service Population Premium.Tech.Support Quarter
## 1 1 Bank Withdrawal 1 37926 0 Q3
## 2 0 Bank Withdrawal 1 4610 1 Q3
## 3 1 Bank Withdrawal 1 4563 0 Q3
## 4 1 Credit Card 0 1408 0 Q3
## 5 1 Credit Card 1 29914 0 Q3
## 6 0 Bank Withdrawal 1 40611 0 Q3
## Referred.a.Friend Satisfaction.Score Senior.Citizen State
## 1 1 3 1 California
## 2 0 3 0 California
## 3 1 5 0 California
## 4 1 4 0 California
## 5 1 2 0 California
## 6 0 1 0 California
## Streaming.Movies Streaming.Music Streaming.TV Tenure.in.Months Total.Charges
## 1 1 1 0 25 2191.15
## 2 0 0 1 35 3418.20
## 3 0 0 0 46 851.20
## 4 1 1 0 27 1246.40
## 5 0 0 0 58 3563.80
## 6 1 1 1 3 329.95
## Total.Extra.Data.Charges Total.Long.Distance.Charges Total.Refunds
## 1 0 486.00 0
## 2 0 1596.70 0
## 3 0 739.22 0
## 4 30 0.00 0
## 5 0 998.76 0
## 6 10 35.82 0
## Total.Revenue Under.30 Unlimited.Data Zip.Code Churn
## 1 2677.15 0 1 94403 0
## 2 5014.90 1 1 95685 0
## 3 1590.42 0 0 95064 0
## 4 1276.40 1 0 92823 0
## 5 4562.56 0 1 95117 1
## 6 375.77 0 0 95210 1
Below are the columns list and explanation of each column
content:
1. Age: Age of telecomunication customer.
2. Avg
Monthly GB Download: Average in GigaByte of data downloaded by
customer.
3. Avg Monthly Long Distance Charges: Average monthly long
distance charges that customer has to pay.
4. Churn Category:
Category of reasons for canceling subscription.
5. Churn Reason:
Specific explanation for canceling subscription.
6. Churn Score: A
value that describes the possibility of a customer canceling a
subscription
7. City: The city where the customer lives
8. CLTV
(Customer Lifetime Value): Projection of the total value of the
customer’s contribution during the subscription
9. Contract: Type of
customer contract duration
10. Country: City where the customer
lives
11. Customer ID: Unique ID for each customer
12. Customer
Status: Current customer status, whether still subscribed, just joined
or has canceled the subscription
13. Dependents: Does the customer
have dependents
14. Device Protection Plan: Does the customer
subscribe to gadget damage protection insurance
15. Gender: Customer
gender
16. Internet Service: Does the customer subscribe to a home
internet service
17. Internet Type: Type of home internet service
subscribed to by the customer
18. Lat Long: Latitude and Longitude
of the customer’s residence
19. Latitude: Latitude of the customer’s
residence
20. Longitude: Longitude of the customer’s residence
21. Married: Is the customer married
22. Monthly Charge: Monthly fee
that must be paid by the customer
23. Multiple Lines: Does the
customer have more than 1 telephone line
24. Number of Dependents:
Number of customer dependents
25. Number of Referrals: Number of new
customers generated from customer referral code
26. Offer: Category
of offer received by customer
27. Online Backup: Does customer
subscribe to online data backup service
28. Online Security: Does
customer subscribe to online security service
29. Paperless Billing:
Does customer utilize paperless billing system
30. Partner: Does
customer have a partner
31. Payment Method: Payment method used by
customer
32. Phone Service: Does customer subscribe to phone
service
33. Population: Population in customer’s area of
residence
34. Premium Tech Support: Does customer subscribe to
premium support service
35. Quarter: Quarter of period data service
is taken by customer
36. Referred a Friend: Did customer recommend
the product to a friend
37. Satisfaction Score: Customer
satisfaction score
38. Senior Citizen: Does customer fall into
senior citizen category
39. State: State where customer resides
40. Streaming Movies: Does customer use streaming service
41.
Streaming Music: Does customer use streaming music service
42.
Streaming TV: Does customer use streaming TV service
43. Tenure in
Months: How many months customer has subscribed
44. Total Charges:
Total amount paid by customer
45. Total Extra Data Charges: Total
amount paid by customer for additional data charges
46. Total Long
Distance Charges: Total long distance usage charges paid by the
customer
47. Total Refunds: Total refunds given to the customer
48. Total Revenue: Total revenue generated from the customer
49.
Under 30: Is the customer under 30 years old
50. Unlimited Data:
Does the customer subscribe to an unlimited data plan
51. Zip Code:
The postal code where the customer lives
52. Churn: Did the customer
cancel the subscription
telco_customer_churn <- read.csv("telco_customer_churn.csv", stringsAsFactors = TRUE)
# Count empty strings
empty_counts <- colSums(telco_customer_churn == "")
# Count NA values
na_counts <- colSums(is.na(telco_customer_churn))
# Combine results for clarity
data.frame(
Column = names(telco_customer_churn),
EmptyStrings = empty_counts,
NAValues = na_counts
)
## Column
## Age Age
## Avg.Monthly.GB.Download Avg.Monthly.GB.Download
## Avg.Monthly.Long.Distance.Charges Avg.Monthly.Long.Distance.Charges
## Churn.Category Churn.Category
## Churn.Reason Churn.Reason
## Churn.Score Churn.Score
## City City
## CLTV CLTV
## Contract Contract
## Country Country
## Customer.ID Customer.ID
## Customer.Status Customer.Status
## Dependents Dependents
## Device.Protection.Plan Device.Protection.Plan
## Gender Gender
## Internet.Service Internet.Service
## Internet.Type Internet.Type
## Lat.Long Lat.Long
## Latitude Latitude
## Longitude Longitude
## Married Married
## Monthly.Charge Monthly.Charge
## Multiple.Lines Multiple.Lines
## Number.of.Dependents Number.of.Dependents
## Number.of.Referrals Number.of.Referrals
## Offer Offer
## Online.Backup Online.Backup
## Online.Security Online.Security
## Paperless.Billing Paperless.Billing
## Partner Partner
## Payment.Method Payment.Method
## Phone.Service Phone.Service
## Population Population
## Premium.Tech.Support Premium.Tech.Support
## Quarter Quarter
## Referred.a.Friend Referred.a.Friend
## Satisfaction.Score Satisfaction.Score
## Senior.Citizen Senior.Citizen
## State State
## Streaming.Movies Streaming.Movies
## Streaming.Music Streaming.Music
## Streaming.TV Streaming.TV
## Tenure.in.Months Tenure.in.Months
## Total.Charges Total.Charges
## Total.Extra.Data.Charges Total.Extra.Data.Charges
## Total.Long.Distance.Charges Total.Long.Distance.Charges
## Total.Refunds Total.Refunds
## Total.Revenue Total.Revenue
## Under.30 Under.30
## Unlimited.Data Unlimited.Data
## Zip.Code Zip.Code
## Churn Churn
## EmptyStrings NAValues
## Age 0 0
## Avg.Monthly.GB.Download 0 0
## Avg.Monthly.Long.Distance.Charges 0 0
## Churn.Category 3104 0
## Churn.Reason 3104 0
## Churn.Score 0 0
## City 0 0
## CLTV 0 0
## Contract 0 0
## Country 0 0
## Customer.ID 0 0
## Customer.Status 0 0
## Dependents 0 0
## Device.Protection.Plan 0 0
## Gender 0 0
## Internet.Service 0 0
## Internet.Type 886 0
## Lat.Long 0 0
## Latitude 0 0
## Longitude 0 0
## Married 0 0
## Monthly.Charge 0 0
## Multiple.Lines 0 0
## Number.of.Dependents 0 0
## Number.of.Referrals 0 0
## Offer 2324 0
## Online.Backup 0 0
## Online.Security 0 0
## Paperless.Billing 0 0
## Partner 0 0
## Payment.Method 0 0
## Phone.Service 0 0
## Population 0 0
## Premium.Tech.Support 0 0
## Quarter 0 0
## Referred.a.Friend 0 0
## Satisfaction.Score 0 0
## Senior.Citizen 0 0
## State 0 0
## Streaming.Movies 0 0
## Streaming.Music 0 0
## Streaming.TV 0 0
## Tenure.in.Months 0 0
## Total.Charges 0 0
## Total.Extra.Data.Charges 0 0
## Total.Long.Distance.Charges 0 0
## Total.Refunds 0 0
## Total.Revenue 0 0
## Under.30 0 0
## Unlimited.Data 0 0
## Zip.Code 0 0
## Churn 0 0
# Replace empty strings with NA
telco_customer_churn[telco_customer_churn == ""] <- NA
# Calculate the percentage of missing values for each column
missing_percentage <- colSums(is.na(telco_customer_churn)) / nrow(telco_customer_churn) * 100
# Filter columns with missing values
missing_percentage <- missing_percentage[missing_percentage > 0]
# Display the columns with missing values and their percentages
for (col_name in names(missing_percentage)) {
cat(col_name, " ", format(missing_percentage[col_name], digits = 2), "\n")
}
## Churn.Category 73
## Churn.Reason 73
## Internet.Type 21
## Offer 55
# Remove columns with missing values >50%
telco_customer_churn <- telco_customer_churn[, colSums(is.na(telco_customer_churn)) / nrow(telco_customer_churn) < 0.5]
# Drop irrelevant columns
telco_customer_churn <- telco_customer_churn %>% select(-Customer.ID, -Lat.Long, -Zip.Code,-Latitude,-Longitude)
# Check if Internet.Type is NA when Internet.Service is "0"
no_internet <- is.na(telco_customer_churn$Internet.Type) & telco_customer_churn$Internet.Service == "0"
sum(no_internet, na.rm = TRUE)
## [1] 886
# Replace NA in Internet.Type with "No Internet" where Internet.Service == "0"
telco_customer_churn$Internet.Type <- as.character(telco_customer_churn$Internet.Type)
telco_customer_churn$Internet.Type[is.na(telco_customer_churn$Internet.Type) & telco_customer_churn$Internet.Service == "0"] <- "No Internet"
sum(is.na(telco_customer_churn$Internet.Type))
## [1] 0
# Binary encoding (Male = 1, Female = 0)
telco_customer_churn$Gender <- ifelse(telco_customer_churn$Gender == "Male", 1, 0)
# One-hot encoding using model.matrix
telco_encoded1 <- model.matrix(~ Internet.Type - 1, data = telco_customer_churn)
telco_encoded2 <- model.matrix(~ Contract - 1, data = telco_customer_churn)
telco_encoded3 <- model.matrix(~ Payment.Method - 1, data = telco_customer_churn)
# Combine the encoded data with the original dataset
telco_customer_churn <- cbind(telco_customer_churn, telco_encoded1, telco_encoded2, telco_encoded3)
# Remove the original categorical columns
telco_customer_churn <- subset(telco_customer_churn, select = -c(Internet.Type, Contract, Payment.Method))
# Save the cleaned dataset
write.csv(telco_customer_churn, file = "telco_customer_churn_clean.csv")
To handle the missing data, we first count empty strings and NA values for each column. Then, the empty strings are substituted with NA values to ensure consistency. Irrelevant columns or columns with more than 50% missing data are removed. Next, we conduct a data validation check to ensure that all missing values in the Internet Type column are consistent with customers who have “No Internet” in the Internet Service column. This is necessary to confirm that missing values in the Internet Type field are due to customers having no internet service. This validation ensures that the missing data is not a result of human error or other factors and that the dataset is reliable for analysis. Furthermore, binary encoding is done for the gender column, ‘1’ indicating male, ‘0’ indicating female. Lastly, the One-Hot Encoding method is applied to the categorical columns such as Internet Type, Contract, and Payment Method, generating new columns for each category in each column. The original columns were then removed.
# Summary Statistics
summary(telco_customer_churn)
## Age Avg.Monthly.GB.Download Avg.Monthly.Long.Distance.Charges
## Min. :19.00 Min. : 0.00 Min. : 0.00
## 1st Qu.:32.00 1st Qu.: 4.00 1st Qu.: 9.05
## Median :46.00 Median :17.00 Median :22.57
## Mean :46.45 Mean :20.74 Mean :22.77
## 3rd Qu.:60.00 3rd Qu.:27.00 3rd Qu.:36.17
## Max. :80.00 Max. :85.00 Max. :49.99
##
## Churn.Score City CLTV Country
## Min. : 5.00 Los Angeles : 175 Min. :2003 United States:4225
## 1st Qu.:40.00 San Diego : 168 1st Qu.:3493
## Median :61.00 San Francisco: 61 Median :4531
## Mean :58.28 San Jose : 61 Mean :4410
## 3rd Qu.:75.00 Sacramento : 60 3rd Qu.:5381
## Max. :96.00 Long Beach : 42 Max. :6500
## (Other) :3658
## Customer.Status Dependents Device.Protection.Plan Gender
## Churned:1121 Min. :0.0000 Min. :0.000 Min. :0.0000
## Joined : 272 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.0000
## Stayed :2832 Median :0.0000 Median :0.000 Median :1.0000
## Mean :0.2331 Mean :0.347 Mean :0.5027
## 3rd Qu.:0.0000 3rd Qu.:1.000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.000 Max. :1.0000
##
## Internet.Service Married Monthly.Charge Multiple.Lines
## Min. :0.0000 Min. :0.0000 Min. : 18.25 Min. :0.0000
## 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.: 38.55 1st Qu.:0.0000
## Median :1.0000 Median :0.0000 Median : 70.20 Median :0.0000
## Mean :0.7903 Mean :0.4833 Mean : 64.91 Mean :0.4161
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.: 89.75 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :118.75 Max. :1.0000
##
## Number.of.Dependents Number.of.Referrals Online.Backup Online.Security
## Min. :0.0000 Min. : 0.000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.: 0.000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median : 0.000 Median :0.0000 Median :0.0000
## Mean :0.4727 Mean : 1.997 Mean :0.3505 Mean :0.2925
## 3rd Qu.:0.0000 3rd Qu.: 3.000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :8.0000 Max. :11.000 Max. :1.0000 Max. :1.0000
##
## Paperless.Billing Partner Phone.Service Population
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. : 11
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.: 2346
## Median :1.0000 Median :0.0000 Median :1.0000 Median : 17554
## Mean :0.5924 Mean :0.4833 Mean :0.8994 Mean : 22009
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.: 35737
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :105285
##
## Premium.Tech.Support Quarter Referred.a.Friend Satisfaction.Score
## Min. :0.0000 Q3:4225 Min. :0.0000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:3.000
## Median :0.0000 Median :0.0000 Median :3.000
## Mean :0.2966 Mean :0.4599 Mean :3.242
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:4.000
## Max. :1.0000 Max. :1.0000 Max. :5.000
##
## Senior.Citizen State Streaming.Movies Streaming.Music
## Min. :0.0000 California:4225 Min. :0.000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.0000
## Median :0.0000 Median :0.000 Median :0.0000
## Mean :0.1647 Mean :0.383 Mean :0.3496
## 3rd Qu.:0.0000 3rd Qu.:1.000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.000 Max. :1.0000
##
## Streaming.TV Tenure.in.Months Total.Charges Total.Extra.Data.Charges
## Min. :0.0000 Min. : 1.00 Min. : 18.8 Min. : 0.000
## 1st Qu.:0.0000 1st Qu.: 9.00 1st Qu.: 401.5 1st Qu.: 0.000
## Median :0.0000 Median :30.00 Median :1424.6 Median : 0.000
## Mean :0.3837 Mean :32.68 Mean :2306.1 Mean : 6.966
## 3rd Qu.:1.0000 3rd Qu.:56.00 3rd Qu.:3846.8 3rd Qu.: 0.000
## Max. :1.0000 Max. :72.00 Max. :8672.5 Max. :150.000
##
## Total.Long.Distance.Charges Total.Refunds Total.Revenue Under.30
## Min. : 0.00 Min. : 0.000 Min. : 21.36 Min. :0.0
## 1st Qu.: 67.68 1st Qu.: 0.000 1st Qu.: 592.75 1st Qu.:0.0
## Median : 396.64 Median : 0.000 Median : 2151.47 Median :0.0
## Mean : 754.72 Mean : 1.954 Mean : 3065.81 Mean :0.2
## 3rd Qu.:1200.00 3rd Qu.: 0.000 3rd Qu.: 4845.75 3rd Qu.:0.0
## Max. :3564.00 Max. :49.790 Max. :11979.34 Max. :1.0
##
## Unlimited.Data Churn Internet.TypeCable Internet.TypeDSL
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :1.0000 Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.6779 Mean :0.2653 Mean :0.1188 Mean :0.2386
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
##
## Internet.TypeFiber Optic Internet.TypeNo Internet ContractMonth-to-Month
## Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :1.0000
## Mean :0.4329 Mean :0.2097 Mean :0.5191
## 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000
##
## ContractOne Year ContractTwo Year Payment.MethodBank Withdrawal
## Min. :0.000 Min. :0.000 Min. :0.0000
## 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.0000
## Median :0.000 Median :0.000 Median :1.0000
## Mean :0.214 Mean :0.267 Mean :0.5503
## 3rd Qu.:0.000 3rd Qu.:1.000 3rd Qu.:1.0000
## Max. :1.000 Max. :1.000 Max. :1.0000
##
## Payment.MethodCredit Card Payment.MethodMailed Check
## Min. :0.0000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.0000 Median :0.00000
## Mean :0.3976 Mean :0.05207
## 3rd Qu.:1.0000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.00000
##
# Histogram for monthly charge
Monthly_Charge_Plot <- ggplot(telco_customer_churn, aes(x = Monthly.Charge)) +
geom_histogram(bins = 30, fill = "purple", color = "black") +
theme_minimal() + labs(title="Distribution of Monthly Charges")
# Histogram for age
Age_Plot <- ggplot(telco_customer_churn, aes(x = Age)) +
geom_histogram(bins = 30, fill = "purple", color = "black") +
theme_minimal() + labs(title="Distribution of Age")
# Histogram for CLTV
CLTV_Plot <- ggplot(telco_customer_churn, aes(x = CLTV)) +
geom_histogram(bins = 30, fill = "purple", color = "black") +
theme_minimal() + labs(title="Distribution of CLTV")
# Histogram for tenure
Tenure_Plot <- ggplot(telco_customer_churn, aes(x = Tenure.in.Months)) +
geom_histogram(bins = 30, fill = "purple", color = "black") +
theme_minimal() + labs(title="Distribution of Tenure in Months")
# Combine the four plots into one
grid.arrange(Monthly_Charge_Plot, Age_Plot, CLTV_Plot, Tenure_Plot, nrow = 1)
We first summarized the dataset using summary statistics for each column. Then, we have created some visualizations to examine the key variables. Histograms are generated to show the distributions for numerical columns, while bar charts are plotted for the categorical columns. Additionally, a churn distribution is visualized to picture the proportion of the customers who churned and do not churn.
# Bar chart for contract
contract_counts <- data.frame(
Contract_Type = c("Month-to-Month", "One Year", "Two Year"),
Count = c(sum(telco_customer_churn$`ContractMonth-to-Month` == 1),
sum(telco_customer_churn$`ContractOne Year` == 1),
sum(telco_customer_churn$`ContractTwo Year` == 1))
)
contract_plot <- ggplot(contract_counts, aes(x = Contract_Type, y = Count)) +
geom_bar(stat = "identity", fill = "skyblue", color = "black") +
theme_minimal() +
labs(title = "Count of Contract Types", x = "Contract Type", y = "Count")
# Bar chart for churn
ggplot(telco_customer_churn, aes(Churn)) + geom_bar(fill = "pink", color = "black") + theme_minimal() +
labs(title = "Churn Count", x = "Churn", y = "Count")
In bivariate analysis, we explore the relationships between various factors and churn using visualization and statistical tests. The chi-square test is used for categorical variables, while the T-test is used for quantitative variables.
# Contract VS Churn
contract_churn_table <- table(
Churn = telco_customer_churn$Churn,
Contract = factor(
apply(telco_customer_churn[, c("ContractMonth-to-Month", "ContractOne Year", "ContractTwo Year")], 1, function(x) {
if (x[1] == 1) return("Month-to-Month")
if (x[2] == 1) return("One Year")
if (x[3] == 1) return("Two Year")
}),
levels = c("Month-to-Month", "One Year", "Two Year")
)
)
contract_churn_df <- as.data.frame(contract_churn_table)
colnames(contract_churn_df) <- c("Churn", "Contract", "Count")
# To create a grouped bar chart
ggplot(contract_churn_df, aes(x = Contract, y = Count, fill = as.factor(Churn))) +
geom_bar(stat = "identity", position = "dodge") +
theme_minimal() +
labs(title = "Churn vs. Contract Type",
x = "Contract Type",
y = "Count",
fill = "Churn") +
scale_fill_manual(values = c("skyblue", "orange"), labels = c("No Churn", "Churn"))
# Chi-square test
chi_square_result <- chisq.test(contract_churn_table)
chi_square_result
##
## Pearson's Chi-squared test
##
## data: contract_churn_table
## X-squared = 856.16, df = 2, p-value < 2.2e-16
## Internet Type VS Churn
# Create a contingency table
internet_churn_table <- table(
Churn = telco_customer_churn$Churn,
Internet_Type = factor(
apply(telco_customer_churn[, c("Internet.TypeCable", "Internet.TypeDSL", "Internet.TypeFiber Optic", "Internet.TypeNo Internet")], 1, function(x) {
if (x[1] == 1) return("Cable")
if (x[2] == 1) return("DSL")
if (x[3] == 1) return("Fiber Optic")
if (x[4] == 1) return("No Internet")
}),
levels = c("Cable", "DSL", "Fiber Optic", "No Internet")
)
)
internet_churn_df <- as.data.frame(internet_churn_table)
colnames(internet_churn_df) <- c("Churn", "Internet_Type", "Count")
# To create a grouped bar chart
ggplot(internet_churn_df, aes(x = Internet_Type, y = Count, fill = as.factor(Churn))) +
geom_bar(stat = "identity", position = "dodge") +
theme_minimal() +
labs(title = "Churn vs. Internet Type",
x = "Internet Type",
y = "Count",
fill = "Churn") +
scale_fill_manual(values = c("skyblue", "orange"), labels = c("No Churn", "Churn"))
# Chi-square test
chi_square_internet_result <- chisq.test(internet_churn_table)
chi_square_internet_result
##
## Pearson's Chi-squared test
##
## data: internet_churn_table
## X-squared = 365.6, df = 3, p-value < 2.2e-16
## Monthly Charges VS Churn
telco_customer_churn$Churn <- factor(telco_customer_churn$Churn, labels = c("No Churn", "Churn"))
# To create a boxplot
ggplot(telco_customer_churn, aes(x = Churn, y = Monthly.Charge, fill = Churn)) +
geom_boxplot() +
theme_minimal() +
labs(title = "Comparison of Monthly Charge Across Churn",
x = "Churn",
y = "Monthly Charge") +
scale_fill_manual(values = c("skyblue", "orange"))
# T-test
t_test_result <- t.test(Monthly.Charge ~ Churn, data = telco_customer_churn)
t_test_result
##
## Welch Two Sample t-test
##
## data: Monthly.Charge by Churn
## t = -13.27, df = 2480.7, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group No Churn and group Churn is not equal to 0
## 95 percent confidence interval:
## -14.03326 -10.41992
## sample estimates:
## mean in group No Churn mean in group Churn
## 61.66244 73.88903
## Tenure VS Churn
# To create a boxplot
ggplot(telco_customer_churn, aes(x = Churn, y = Tenure.in.Months, fill = Churn)) +
geom_boxplot() +
theme_minimal() +
labs(title = "Comparison of Tenure Across Churn",
x = "Churn",
y = "Tenure (in Months)") +
scale_fill_manual(values = c("skyblue", "orange"))
# T-test
t_test_tenure_result <- t.test(Tenure.in.Months ~ Churn, data = telco_customer_churn)
t_test_tenure_result
##
## Welch Two Sample t-test
##
## data: Tenure.in.Months by Churn
## t = 27.634, df = 2457.8, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group No Churn and group Churn is not equal to 0
## 95 percent confidence interval:
## 18.54804 21.38153
## sample estimates:
## mean in group No Churn mean in group Churn
## 37.97906 18.01427
Result of chi-square test : X-squared = 856.16, df = 2, p-value < 2.2e-16 The p-value < 0.05, indicates a statistically significant relationship between contract type and churn.
Result of chi-square test : 365.6, df = 3, p-value < 2.2e-16 The p-value < 0.05, indicates a statistically significant relationship between internet type and churn.
Result of T-test : t = -13.27, df = 2480.7, p-value < 2.2e-16 The p-value < 0.05, thus we reject the null hypothesis. The true difference in means between group No Churn and group Churn is not equal to 0.
Result of T-test : t = 27.634, df = 2457.8, p-value < 2.2e-16 The p-value < 0.05, thus we reject the null hypothesis. The true difference in means between group No Churn and group Churn is not equal to 0.
#Load the cleaned dataset
telco_customer_churn <- read.csv("telco_customer_churn_clean.csv")
# Set Churn as a factor
telco_customer_churn$Churn <- as.factor(telco_customer_churn$Churn)
class(telco_customer_churn$Churn)
## [1] "factor"
# Set a seed for reproducibility
set.seed(42)
# Stratified Train-Test Split
#p = 0.75 means that 75% of the data is allocated to the training set, and 25% is allocated to the testing set.
trainIndex <- createDataPartition(telco_customer_churn$Churn, p = 0.75,
list = FALSE, times = 1)
train_Telco <- telco_customer_churn[trainIndex, ]
test_Telco <- telco_customer_churn[-trainIndex, ]
# Verify proportions
cat("Proportions in Training Set:\n")
## Proportions in Training Set:
print(table(train_Telco$Churn) / nrow(train_Telco)) # Check distribution in training set
##
## 0 1
## 0.7346166 0.2653834
cat("Proportions in Testing Set:\n")
## Proportions in Testing Set:
print(table(test_Telco$Churn) / nrow(test_Telco)) # Check distribution in testing set
##
## 0 1
## 0.7348485 0.2651515
train_Telco: Contains 75% of the data (used for model training). test_Telco: Contains the remaining 25% of the data (used for model evaluation).
# Define features and target
features <- c(
"Churn.Score", "Internet.TypeFiber.Optic", "ContractMonth.to.Month", "Dependents",
"Number.of.Referrals", "Satisfaction.Score", "Tenure.in.Months", "ContractTwo.Year"
)
target <- "Churn"
# Subset data with selected features and target
df_subset <- telco_customer_churn[, c(target, features)]
# Train-Test Split
set.seed(42)
total_rows <- nrow(df_subset)
train_idx <- sample(total_rows, total_rows * 0.8)
test_idx <- setdiff(1:total_rows, train_idx)
train_df <- df_subset[train_idx, ]
test_df <- df_subset[test_idx, ]
# One-hot encode categorical variables and convert to matrix
train_x <- model.matrix(~ . - 1, data = train_df[, -1]) # Exclude target column
test_x <- model.matrix(~ . - 1, data = test_df[, -1])
# Convert target variable to numeric (0 and 1)
train_y <- as.numeric(as.character(train_df[, 1]))
test_y <- as.numeric(as.character(test_df[, 1]))
# Create XGBoost DMatrix
dtrain <- xgb.DMatrix(data = train_x, label = train_y)
dtest <- xgb.DMatrix(data = test_x, label = test_y)
We will use Logistic regression, Decision Tree, and XGBoosts to predict the churn.
As the bar char shown in EDA,the number of “1” (Churn) is three times higher than “0” (No Churn), so class weights is used to increase importance to the minority class, and thus, balance the impact of each class.
# Calculate class weights
class_weights <- ifelse(train_Telco$Churn == 1,
sum(train_Telco$Churn == 0) / sum(train_Telco$Churn == 1),
1)
# Train Logistic Regression Model with Weights
LR_model <- glm(Churn ~ Churn.Score + Internet.TypeFiber.Optic+ ContractMonth.to.Month + Dependents + Number.of.Referrals + Satisfaction.Score + Tenure.in.Months + ContractTwo.Year,
data = train_Telco,
family = binomial,
weights = class_weights)
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# Summary of the logistic regression model
summary(LR_model)
##
## Call:
## glm(formula = Churn ~ Churn.Score + Internet.TypeFiber.Optic +
## ContractMonth.to.Month + Dependents + Number.of.Referrals +
## Satisfaction.Score + Tenure.in.Months + ContractTwo.Year,
## family = binomial, data = train_Telco, weights = class_weights)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.781e+01 1.694e+03 0.028 0.977490
## Churn.Score 1.811e-01 1.333e-02 13.586 < 2e-16 ***
## Internet.TypeFiber.Optic 8.844e-01 2.127e-01 4.158 3.22e-05 ***
## ContractMonth.to.Month 1.153e+00 3.154e-01 3.657 0.000255 ***
## Dependents -1.310e+00 3.366e-01 -3.893 9.90e-05 ***
## Number.of.Referrals -4.081e-01 7.675e-02 -5.317 1.05e-07 ***
## Satisfaction.Score -2.032e+01 5.648e+02 -0.036 0.971299
## Tenure.in.Months -2.047e-02 6.198e-03 -3.303 0.000956 ***
## ContractTwo.Year -7.844e-01 3.742e-01 -2.096 0.036053 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6454.59 on 3168 degrees of freedom
## Residual deviance: 622.97 on 3160 degrees of freedom
## AIC: 666.04
##
## Number of Fisher Scoring iterations: 21
Churn.Score, Internet.TypeFiber.Optic, ContractMonth.to.Month, Dependents, Number.of.Referrals, Tenure.in.Months, and ContractTwo.Year are all statistically significant (p < 0.05). Positive coefficients (Churn.Score, Internet.TypeFiber.Optic, ContractMonth.to.Month) increase the likelihood of churn. Negative coefficients (Dependents, Number.of.Referrals, Tenure.in.Months, ContractTwo.Year) decrease the likelihood of churn. The large reduction in deviance and low AIC indicate that the model fits the data well.
DT_model <- rpart(Churn ~ Churn.Score + Internet.TypeFiber.Optic +
ContractMonth.to.Month + Dependents + Number.of.Referrals +
Satisfaction.Score + Tenure.in.Months + ContractTwo.Year,
data = train_Telco, method = "class")
summary(DT_model)
## Call:
## rpart(formula = Churn ~ Churn.Score + Internet.TypeFiber.Optic +
## ContractMonth.to.Month + Dependents + Number.of.Referrals +
## Satisfaction.Score + Tenure.in.Months + ContractTwo.Year,
## data = train_Telco, method = "class")
## n= 3169
##
## CP nsplit rel error xerror xstd
## 1 0.7633769 0 1.0000000 1.0000000 0.02955510
## 2 0.1284185 1 0.2366231 0.2366231 0.01623856
## 3 0.0100000 2 0.1082045 0.1082045 0.01117887
##
## Variable importance
## Satisfaction.Score Churn.Score
## 62 38
##
## Node number 1: 3169 observations, complexity param=0.7633769
## predicted class=0 expected loss=0.2653834 P(node) =1
## class counts: 2328 841
## probabilities: 0.735 0.265
## left son=2 (2527 obs) right son=3 (642 obs)
## Primary splits:
## Satisfaction.Score < 2.5 to the right, improve=868.9674, (0 missing)
## Churn.Score < 80.5 to the left, improve=598.5973, (0 missing)
## ContractMonth.to.Month < 0.5 to the left, improve=243.6513, (0 missing)
## Tenure.in.Months < 17.5 to the right, improve=136.0844, (0 missing)
## ContractTwo.Year < 0.5 to the right, improve=134.6416, (0 missing)
## Surrogate splits:
## Churn.Score < 80.5 to the left, agree=0.878, adj=0.399, (0 split)
##
## Node number 2: 2527 observations, complexity param=0.1284185
## predicted class=0 expected loss=0.07874951 P(node) =0.7974124
## class counts: 2328 199
## probabilities: 0.921 0.079
## left son=4 (2419 obs) right son=5 (108 obs)
## Primary splits:
## Churn.Score < 80.5 to the left, improve=191.50430, (0 missing)
## Satisfaction.Score < 3.5 to the right, improve= 32.84084, (0 missing)
## ContractMonth.to.Month < 0.5 to the left, improve= 24.95558, (0 missing)
## Tenure.in.Months < 2.5 to the right, improve= 22.86974, (0 missing)
## Number.of.Referrals < 1.5 to the right, improve= 15.43606, (0 missing)
##
## Node number 3: 642 observations
## predicted class=1 expected loss=0 P(node) =0.2025876
## class counts: 0 642
## probabilities: 0.000 1.000
##
## Node number 4: 2419 observations
## predicted class=0 expected loss=0.03761885 P(node) =0.7633323
## class counts: 2328 91
## probabilities: 0.962 0.038
##
## Node number 5: 108 observations
## predicted class=1 expected loss=0 P(node) =0.03408015
## class counts: 0 108
## probabilities: 0.000 1.000
#Decision Tree Visualization
rpart.plot(DT_model, main = "Decision Tree")
Satisfaction.Score (62%) and Churn.Score (38%) are the most influential variables in predicting churn. The tree primarily splits on these variables.
The Decision Tree chart shows that 100% of the customers with Satisfaction.Score < 3 would churn. For customers with Satisfaction.Score >= 3) and Churn.Score < 81, they are predicted not to churn with a probability of 0.04 (76%). For customers with Satisfaction.Score >= 3) and Churn.Score >= 81, they are predicted to churn with a probability of 1.0 (100%), but this applies to only 3% of the total data.
# Calculate scale_pos_weight for class imbalance
ratio <- sum(train_y == 0) / sum(train_y == 1)
params <- list(
max_depth = 6,
eta = 0.3,
objective = "binary:logistic",
eval_metric = "auc",
scale_pos_weight = ratio
)
final_model <- xgboost(
data = dtrain,
params = params,
nrounds = 100,
verbose = 0
)
importance <- xgb.importance(model = final_model)
print(importance)
## Feature Gain Cover Frequency
## <char> <num> <num> <num>
## 1: Satisfaction.Score 0.664155415 0.23335940 0.03417635
## 2: Churn.Score 0.253474645 0.35137528 0.33424470
## 3: Tenure.in.Months 0.034530450 0.20475469 0.41285031
## 4: Number.of.Referrals 0.016455623 0.09078490 0.06835270
## 5: ContractMonth.to.Month 0.015972345 0.02319221 0.03075871
## 6: Internet.TypeFiber.Optic 0.008693037 0.04202293 0.08133971
## 7: Dependents 0.003901611 0.03167210 0.02187286
## 8: ContractTwo.Year 0.002816874 0.02283850 0.01640465
Satisfaction.Score is the most important feature (Gain = 0.664) increase the likelihood of churn. Churn.Score is the second most important feature (Gain = 0.253) significant in increasing the likelihood of churn. ContractMonth.to.Month, ContractTwo.Year, Internet.TypeFiber.Optic, and Dependents have smaller gain value but will still contribute to the churn.
To evaluate the performances of the classification models in churn prediction, we have used accuracy, precision, recall, F1-score, and ROC-AUC, calculated from contents of confusion matrix.
Convert probabilities to binary predictions based on a threshold (default = 0.3) If predicted probability >= 0.5, prediction will be labeled as “1”; If predicted probability < 0.5, prediction will be labeled as “0”.
evaluate_model <- function(actual, predicted, predicted_prob = NULL, threshold = 0.5, model_name = "") {
if (!is.null(predicted_prob)) {
predicted <- ifelse(predicted_prob > threshold, 1, 0)
}
# Convert to factors with matching levels
actual <- factor(actual, levels = c(0, 1))
predicted <- factor(predicted, levels = c(0, 1))
# Confusion Matrix
library(caret)
cm <- confusionMatrix(predicted, actual)
# Metrics
accuracy <- cm$overall["Accuracy"]
precision <- cm$byClass["Precision"]
recall <- cm$byClass["Recall"]
f1_score <- cm$byClass["F1"]
# Initialize empty ROC plot
roc_plot <- NULL
auc <- NA
# ROC-AUC (if probabilities are provided)
if (!is.null(predicted_prob)) {
library(pROC)
roc_obj <- roc(actual, predicted_prob)
auc <- auc(roc_obj)
# Create ROC Curve plot
roc_plot <- ggroc(roc_obj) +
ggtitle(paste("ROC Curve -", model_name)) +
theme_minimal()
cat("AUC for", model_name, ":", auc, "\n")
}
# Visualize Confusion Matrix
cm_table <- as.data.frame(cm$table)
colnames(cm_table) <- c("Reference", "Prediction", "Freq")
library(ggplot2)
cm_plot <- ggplot(cm_table, aes(Prediction, Reference, fill = Freq)) +
geom_tile(color = "white") +
scale_fill_gradient(low = "white", high = "orange") +
geom_text(aes(label = Freq), vjust = 1.5) +
labs(
title = paste("Confusion Matrix -", model_name),
x = "Predicted Class",
y = "Actual Class"
) +
theme_minimal()
# Arrange plots side by side if both are available
if (!is.null(roc_plot)) {
library(gridExtra)
grid.arrange(cm_plot, roc_plot, nrow = 2)
} else {
print(cm_plot)
}
# Return metrics
list(
ConfusionMatrix = cm,
Metrics = data.frame(
Accuracy = accuracy,
Precision = precision,
Recall = recall,
F1_Score = f1_score,
ROC_AUC = auc
)
)
}
LR_pred_prob <- predict(LR_model, newdata = test_Telco, type = "response")
lr_results <- evaluate_model(test_Telco$Churn, NULL, LR_pred_prob, model_name = "Logistic Regression")
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## AUC for Logistic Regression : 0.9976758
cat("\nLogistic Regression Metrics:\n")
##
## Logistic Regression Metrics:
print(lr_results$Metrics)
## Accuracy Precision Recall F1_Score ROC_AUC
## Accuracy 0.9725379 0.994702 0.9677835 0.9810581 0.9976758
Confusion Matrix:
Class 0: No churn,
Class 1: Churn
True Negatives (Bottom-left, 751): The model correctly predicted 751 instances as “No churn” when the actual label was “No churn.”
True Positives (Top-left, 25): The model correctly predicted 25 instances as “Churn” when the actual label was “Churn.”
False Positives (Top-right, 276): The model incorrectly predicted 276 instances as “Churn” when the actual label was “No churn.”
False Negatives (Bottom-right, 4): The model incorrectly predicted 4 instances as “No churn” when the actual label was “Churn.” Thus, LR model performs well in identifying “No churn” cases (high True Negatives), but there is a tendency to misclassify “No churn” cases as “Churn” (high False Positives), which could lead to unnecessary alerts.
The ROC curve is very close to the top-left corner, indicating that the Logistic Regression model has excellent discriminative ability. AUC (0.997) indicates outstanding discriminative power, nearly perfect for distinguishing between churners and non-churners.
# Get predicted probabilities instead of class predictions
dt_pred_prob <- predict(DT_model, newdata = test_Telco, type = "prob")[, 2] # Probability for the positive class (e.g., '1')
# Evaluate the model
dt_results <- evaluate_model(test_Telco$Churn, NULL, dt_pred_prob, model_name = "Decision Tree")
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## AUC for Decision Tree : 0.9553571
# Print metrics
cat("\nDecision Tree Metrics:\n")
##
## Decision Tree Metrics:
print(dt_results$Metrics)
## Accuracy Precision Recall F1_Score ROC_AUC
## Accuracy 0.9763258 0.968789 1 0.9841471 0.9553571
Confusion Matrix:
Class 0: No churn,
Class 1: Churn
True Negatives (Bottom-left, 776): The model correctly predicted 776 instances as “No churn.”
True Positives (Top-left, 255): The model correctly predicted 255 instances as “Churn.”
False Positives (Top-right, 0): The model did not incorrectly predicted any instances as “Churn” when they were actually “No churn.”
False Negatives (Bottom-right, 25): 25 actual “Churn” cases were misclassified as “No churn.”
Thus, DT model has high accuracy, as the majority of predictions are correct.
High value of AUC (0.955) and relatively steep rise in the ROC curve suggest that the Decision Tree model has good discriminative ability. An AUC closer to 1.0 reflects strong performance in distinguishing between churners (1) and non-churners (0). While the AUC is not perfect, it indicates that the model is effective at classification. The curve moves reasonably close to the top-left corner, reflecting high sensitivity (true positive rate) at low false positive rates.
Thus, the Decision Tree model performs well in predicting churners while minimizing the number of non-churners being misclassified as churners. However, the presence of false negatives indicates room for further refinement.
xgb_pred_prob <- predict(final_model, dtest)
xgb_results <- evaluate_model(test_y, NULL, xgb_pred_prob, model_name = "XGBoost")
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## AUC for XGBoost : 0.9991505
cat("\nXGBoost Metrics:\n")
##
## XGBoost Metrics:
print(xgb_results$Metrics)
## Accuracy Precision Recall F1_Score ROC_AUC
## Accuracy 0.9763314 0.9967585 0.971564 0.984 0.9991505
Confusion Matrix:
Class 0: No churn,
Class 1: Churn
True Negatives (Bottom-left, 615): The model correctly predicted 615 instances as “No churn” when the actual label was “No churn.”
True Positives (Top-left, 18): The model correctly predicted 18 instances as “Churn” when the actual label was “Churn.”
False Positives (Top-right, 210): The model incorrectly predicted 210 instances as “Churn” when the actual label was “No churn.”
False Negatives (Bottom-right, 2): The model incorrectly predicted 2 instances as “No churn” when the actual label was “Churn.”
Thus, XGB model The XGBoost model performs very well in identifying “Churn” cases (Class 1) with a high True Positives count (18) and only 2 False Negatives, indicating strong sensitivity for churn prediction. However, the model struggles with “No churn” cases. High False Positives (210) indicate that many “No churn” instances are incorrectly classified as “Churn,” potentially causing over-alerting.
The ROC curve is nearly touching the top-left corner, signifying excellent model performance. The curve is far above the diagonal line, confirming that the XGBoost model performs significantly better than random guessing. The AUC (0.999) indicates exceptionally strong discriminative power, almost ideal for distinguishing between churners and non-churners across all thresholds.
# Convert AUC values to numeric in each model's results
lr_results$Metrics$ROC_AUC <- as.numeric(lr_results$Metrics$ROC_AUC)
dt_results$Metrics$ROC_AUC <- as.numeric(dt_results$Metrics$ROC_AUC)
xgb_results$Metrics$ROC_AUC <- as.numeric(xgb_results$Metrics$ROC_AUC)
# Combine the metrics into a summary table
library(dplyr)
summary_table <- bind_rows(
cbind(Model = "Logistic Regression", lr_results$Metrics),
cbind(Model = "Decision Tree", dt_results$Metrics),
cbind(Model = "XGBoost", xgb_results$Metrics)
)
# Print the summary table
print(summary_table)
## Model Accuracy Precision Recall F1_Score
## Accuracy...1 Logistic Regression 0.9725379 0.9947020 0.9677835 0.9810581
## Accuracy...2 Decision Tree 0.9763258 0.9687890 1.0000000 0.9841471
## Accuracy...3 XGBoost 0.9763314 0.9967585 0.9715640 0.9840000
## ROC_AUC
## Accuracy...1 0.9976758
## Accuracy...2 0.9553571
## Accuracy...3 0.9991505
#round off to 4 decimal places
summary_table <- summary_table %>%
mutate_if(is.numeric, ~ round(., 4))
cat("\nSummary Table:\n")
##
## Summary Table:
print(summary_table)
## Model Accuracy Precision Recall F1_Score ROC_AUC
## Accuracy...1 Logistic Regression 0.9725 0.9947 0.9678 0.9811 0.9977
## Accuracy...2 Decision Tree 0.9763 0.9688 1.0000 0.9841 0.9554
## Accuracy...3 XGBoost 0.9763 0.9968 0.9716 0.9840 0.9992
In term of Accuracy, all three models are highly effective in classifying the data. They perform similarly with Logistic Regression slightly lower at 97.25%, and Decision Tree and XGBoost achieving 97.63%.
In term of Precision, XGBoost achieves the highest precision (99.68%), followed by Logistic Regression (99.47%) and Decision Tree (96.88%).
In term of Recall (Sensitivity), Decision Tree achieves the highest recall (100%), while Logistic Regression and XGBoost have slightly lower recall at 96.78% and 97.16% respectively.
In term of F1-Score, the performance of XGBoost (98.40%) and Decision Tree (98.41%) is very close, while logistic regression (98.11%) lags behind by a small margin.
In term of ROC-AUC, XGBoost achieves the highest ROC-AUC of 99.92%, indicating near-perfect classification performance, followed by Logistic Regression at 99.77%, and Decision Tree at 95.54%.
In overall, XGBoost is the strongest performer, excelling in precision, recall, F1-Score, and ROC-AUC.
As it is, the statistical test revealed a significant relationship between the contract type with churn and internet type with churning of customers. That is, using either a chi-square or T-test provides results that there is some significance in the variables affecting a customer to cancel their subscription. These provide a sound basis for using a regression approach in order to predict churn by identifying the most influential factors causing cancellations. Among all, in general, XGBoost has proved to be the most reliable and efficient model a choice can fall upon, which gives accuracy of 0.9763, precision of 0.9968, F1 Score of 0.9840 and ROC_AUC of 0.9992, for which it was suitable whenever balanced and robust performance metric was required. Decision Tree, on one side has perfect recall of 1.0000, reflecting hence its very strong point by identifying all relevant instances like a requirement for some applications where fraud needs to be caught. However, its precision is a bit lower, 0.9688, as compared to XGBoost, and the lack of a valid ROC_AUC metric restricts its overall comparison. Logistic Regression is the simplest and has the minimum accuracy of 0.9725, recall of 0.9678, and F1-Score of 0.9811 among the three models, though it is still doing great with a strong ROC_AUC of 0.9977. The highest overall recommendation would go to XGBoost because of its balanced high performance. A decision tree can be recommended when applications are critical in their recall. Logistic regression could be considered where either the simplicity of computation or the interpretability is the priority over performance. Business users of the model can improve retention by identifying at-risk customers, targeting retention, optimizing resource allocations, and improving their products based on the insights. It will result in better customer satisfaction, efficiency in operations, and improved revenue forecasts.