# Libraries Used:
library(dplyr)
library(data.table)
library(DescTools)
library(scales)
library(ggplot2)
library(lubridate)
library(ggthemes)
library(RColorBrewer)
library(plyr)
library(plotly)
library(cowplot)
library(ggpubr)
library(flexdashboard)
library(ggrepel)

Introduction

In this project, we take a look at an IBM sample customer churn data set for a fictional Telecommunications company. Analysis of this data can be used to gain a better understanding of customer behavior and what characteristics might indicate that a customer is at high risk of leaving. This information can then be used to develop more focused and effective customer retention programs.

Source: https://www.kaggle.com/blastchar/telco-customer-churn

Dataset

The data set contains 7,043 rows and 21 columns which represent individual customers and customer attributes, respectively. The customer attributes include information about churn (identifies customers that have left in the last month), customer demographics, customer account information, and the services the customer is signed up for.

For a complete list of the customer attributes, summary statistics for each one, and the data structure, see the “Descriptive Statistics” tab under “Findings”.

setwd("/Users/jackberk/Documents/Data Visualization/R/R_Data_Project")
df <- read.csv("Telco_Customer_Data.csv")

Findings

Descriptive statistics for the overall data set to keep in mind for comparison:

  • Tenure
    • Mean: 32.37 Months
  • Monthly Charges
    • Mean: $64.76
    • Max: $118.75
    • Min: $18.25

Churn Rate

Here we see that the churn rate for the month that the data was collected is 26.5%, meaning 26.5% of the customers in the data set left in the last month. As previously discussed, our goal in analyzing the data will be to find ways to improve customer retention programs and reduce this rate.

# Customer Churn:
churn_count <- count(df, "Churn")
# Building Churn Pie Chart:
p4 <- plot_ly(churn_count, labels = ~Churn, values = ~freq, type = "pie",
              marker = list(colors=c("steelblue", "lightblue")), 
              textposition = "outside", textinfo = "label + percent") %>%
  layout(title="Customer Churn")
p4

Gender of Customers

With only 0.5% more males, the data set is essentially an even split of male and female customers. This greatly reduces the risk of bias from one gender being disproportionately represented in the data and allows us to more confidently draw conclusions from any observed correlation, or lack there of, between gender and churn.

# Gender:
gender_count <- count(df, "gender")

# Building Gender Pie Chart:
p5 <- plot_ly(gender_count, labels = ~gender, values = ~freq, type = "pie",
              marker = list(colors=c("steelblue", "lightblue")),
        textposition = "outside", textinfo = "label + percent") %>%
  layout(title="Gender of Customers")
p5

Churn by Gender

Just as seen in the pie chart, the number of customers is very similar for each gender. This plot also shows that churn is nearly the same between male and female customers with a difference of less than 1%.

  • Key Takeaway: Gender likely does not have any impact on whether or not a customer leaves.
#DF for churn counts by gender:
new_df <- df %>% 
  select(gender, Churn) %>%
  group_by(gender, Churn) %>%
  summarise(n = length(gender), .groups = 'keep') %>%
  data.frame()

# DF for Aggregate Total of each Gender for Stacked Bar Chart label:
agg_tot <- new_df %>%
  select(gender, n) %>%
  group_by(gender) %>%
  summarise(tot = sum(n), .groups = 'keep') %>%
  data.frame()

# Churn Totals by Gender:
churn_tot <- new_df %>%
  filter(Churn == "Yes") %>%
  select(gender, n) %>%
  group_by(gender) %>%
  summarise(churn_tot = sum(n), .groups = 'keep') %>%
  data.frame()

# Percent Churn by Gender:
male_tot <- agg_tot[agg_tot$gender == "Male", "tot"]
male_churn_tot <- churn_tot[churn_tot$gender == "Male", "churn_tot"]
male_churn_rate <- (round((male_churn_tot/male_tot)*100,2))
male_churn_rate_lab <- paste0(male_churn_rate, "%")
male_nonchurn_rate <- (round((100-(male_churn_tot/male_tot)*100),2))
male_nonchurn_rate_lab <- paste0(male_nonchurn_rate, "%")

female_tot <- agg_tot[agg_tot$gender == "Female", "tot"]
female_churn_tot <- churn_tot[churn_tot$gender == "Female", "churn_tot"]
female_churn_rate <- (round((female_churn_tot/female_tot)*100,2))
female_churn_rate_lab <- paste0(female_churn_rate, "%")
female_nonchurn_rate <- (round((100-(female_churn_tot/female_tot)*100),2))
female_nonchurn_rate_lab <- paste0(female_nonchurn_rate, "%")

# Building Stacked Bar Plot:
max_y <- round_any(max(agg_tot$tot), 1000, ceiling)

p2 <- ggplot(new_df, aes(x = gender, y = n, fill = Churn)) +
  geom_bar(stat="identity", position = position_stack(reverse = TRUE)) + 
  labs(title = "Churn by Gender", x = "Gender", y = "Customer Count", fill = "Churn") + 
  theme_light() + 
  theme(plot.title = element_text(hjust = 0.5)) + 
  scale_fill_brewer(palette="Paired", guide = guide_legend(reverse = TRUE)) + 
  geom_text(data = agg_tot, aes(x = gender, y = tot, label = scales::comma(tot), fill = NULL), vjust = -0.4, size = 4) +
  scale_y_continuous(labels = comma, 
                     breaks = seq(0, max_y, by = 500),
                     limits= c(0, max_y)) + 
  annotate("text", x = "Female", y = 2500/2, label = female_nonchurn_rate_lab, size = 4) + 
  annotate("text", x = "Female", y = 3000, label = female_churn_rate_lab, size = 4) +
  annotate("text", x = "Male", y = 2625/2, label = male_nonchurn_rate_lab, size = 4) + 
  annotate("text", x = "Male", y = 3100, label = male_churn_rate_lab, size = 4)
p2

Tenure by Churn

This density plot illustrates the distribution of tenure length for retained and lost customers. Tenure distribution for retained customers is relatively uniform with only a slightly higher concentration of around 1.75% of retained customers having a longer tenure of about 69-70 months. On the other hand, the distribution is right skewed with a high concentration of customers having short tenures of about 2-4 months for customers lost in the last month. The dashed lines represent mean tenures of about 37-38 months for retained customers and about 18 months for customers that left in the last month, substantially lower than the mean of 32.37 months for the overall data set.

  • Key Takeaway: Customers who have left in the last month tend to be newer or short-term customers while retained customers seem to be made up of nearly equal amounts of short, mid, and long-term customers.
mu <- ddply(df, "Churn", summarise, grp.mean=mean(tenure))

# Building Density Plot:
p1 <- ggplot(df, aes(x=tenure, fill = Churn)) +
  geom_density(alpha=0.4) +
  labs(title = "Density Plot: Customer Tenure by Churn", x= "Tenure (Months)", y = "Density") +
  theme_light() +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_fill_brewer(palette="Set1") + 
  scale_x_continuous(labels=comma, breaks = seq(0, max(df$tenure), by = 12)) +
  geom_vline(data = mu, aes(xintercept=grp.mean, color=Churn),
             linetype="dashed")
p1

Churn, Monthly Charges, and Customer Count by Contract Type

This plot provides a closer look at the relationship between churn and contract type as well as the total monthly charges for each contract type. Over 50% of the company’s customers have month-to-month contracts. Significantly less customers have two-year contracts which is closely followed by the number of customers who have one-year contracts. Churn is also far higher for month-to-month contract holders at nearly 50%. Month-to-month contract holders also account for the largest total amount of monthly charges at about $156,000, followed by only about $64,000 to $65,000 for two-year and $56,000 for one-year contract holders.

  • Key Takeaway: Month-to-month contract holders account for the majority of the customers in the data set and the largest total amount of monthly charges. However, the group of customers with this contract type also have the highest churn rate falling just short of 50%.
# Axis 1: Customer Count by Contract Type by Churn:
contract_df <- df %>% 
  select(Contract, Churn) %>%
  group_by(Contract, Churn) %>%
  summarise(n = length(Contract), .groups = 'keep') %>%
  data.frame()

# Axis 1: Aggregate Customer Count Total by Contract Type for Stacked Bar Chart:
contract_tot <- contract_df %>%
  select(Contract, n) %>%
  group_by(Contract) %>%
  summarise(total = sum(n), .groups = 'keep') %>%
  data.frame()

# Axis 2: Total Monthly Charges by Contract Type:
monthly_charges_df <- df %>%
  select(Contract, MonthlyCharges) %>%
  group_by(Contract) %>%
  summarise(total_monthly_charges = sum(MonthlyCharges)) %>%
  data.frame()

# Changing Axis 2 Scale:
ylab <- seq(0, max(monthly_charges_df$total_monthly_charges)/1e3, 25)
# 1e3 = 1,000 (scientific notation)
my_labels <- paste0("$", ylab, "K")

# Increase axis range to fit labels:
max_y1 <- round_any(max(contract_tot$total)*2,1000, ceiling)

p6 <- ggplot(contract_df, aes(x = reorder(Contract, n, sum), y = n, fill = Churn)) +
  geom_bar(stat="identity", position = position_stack(reverse = TRUE)) + 
  coord_flip() +
  labs(title = "Churn, Total Monthly Charges, and Customer Count by Contract Type", x = "Contract Type", y = "Customer Count", fill = "Churn") +
  theme_light() + 
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_fill_brewer(palette = "Paired", guide = guide_legend(reverse = TRUE)) +
  geom_line(inherit.aes = FALSE, data = monthly_charges_df,
            aes(x = Contract, y = total_monthly_charges/40, color = "Total Monthly Charges", group=1), size=1) +
  scale_color_manual(NULL, values="black") + 
  scale_y_continuous(labels=comma,
                     breaks = seq(0, max_y1, by = 500),
                     sec.axis = sec_axis(~.*40, name = "Total Monthly Charges", labels = my_labels,
                                         breaks = ylab*1e3)) +
  geom_point(inherit.aes = FALSE, data=monthly_charges_df,
             aes(x = Contract, y = total_monthly_charges/40, group = 1),
                 size = 3, shape = 21, fill = "white", color = "black") + 
  theme(legend.background = element_rect(fill = "transparent"),
        legend.box.background = element_rect(fill = "transparent", color = NA),
        legend.spacing = unit(-1, "lines")) +
  geom_text(data = contract_tot, aes(x = Contract, y = total, label = scales::comma(total), fill = NULL), hjust = -0.1, size = 3)
p6

Customer Count by Monthly Charges by Churn

This heatmap shows the distribution of churn by range of monthly charges. The number in the box tells us the amount of customers with the churn status for that column who have monthly charges in that range. Box color also indicates customer count with darker shades of blue representing a higher count and lighter shades representing a lower count. From this plot we learn that customers lost in the last month tend to have monthly charges at the higher end, with the highest concentration of 353 customers spending $78.25 - $88.25 per month. Retained customers are concentrated more at the lower end of spending with 1,458 customers spending only $18.25-$28.25.

  • Key Takeaway: Customers lost in the last month tend to have higher monthly charges while retained customers seem to spend less and have generally lower monthly charges.
# Data Frame w/ buckets for different levels of MonthlyPayments and avg charges per bucket by Churn
charges_df <- df %>%
  select(Churn, MonthlyCharges) %>%
  mutate(MonthlyChargesBucket = case_when(
    MonthlyCharges <= 28.25 ~ "$18.25 - $28.25",
    MonthlyCharges <= 38.25 ~ "$28.25 - $38.25",
    MonthlyCharges <= 48.25 ~ "$38.25 - $48.25",
    MonthlyCharges <= 58.25 ~ "$48.25 - $58.25",
    MonthlyCharges <= 68.25 ~ "$58.25 - $68.25",
    MonthlyCharges <= 78.25 ~ "$68.25 - $78.25",
    MonthlyCharges <= 88.25 ~ "$78.25 - $88.25",
    MonthlyCharges <= 98.25 ~ "$88.25 - $98.25",
    MonthlyCharges <= 108.25 ~ "$98.25 - $108.25",
    MonthlyCharges <= 118.75 ~ "$108.25 - $118.75")) %>%
  group_by(Churn, MonthlyChargesBucket) %>%
  summarise(Customer_Count = length(Churn), .groups='keep') %>%
  data.frame()

# Building Heatmap:
my_levels <- c("$18.25 - $28.25", "$28.25 - $38.25","$38.25 - $48.25", "$48.25 - $58.25", "$58.25 - $68.25","$68.25 - $78.25", "$78.25 - $88.25", "$88.25 - $98.25", "$98.25 - $108.25", "$108.25 - $118.75")
charges_df$MonthlyChargesBucket <- factor(charges_df$MonthlyChargesBucket, level = my_levels)

breaks <- c(seq(0, max(charges_df$Customer_Count), by=100))

p3 <- ggplot(charges_df, aes(x = Churn, y = MonthlyChargesBucket, fill = Customer_Count)) +
  geom_tile(color = "black") + 
  geom_text(aes(label=comma(Customer_Count))) + 
  coord_equal(ratio=1) + 
  labs(title = "Heatmap: Customer Count by Monthly Charges by Churn",
       x = "Churn",
       y = "Monthly Charges",
       fill = "Customer Count") +
  theme_minimal() + 
  theme(plot.title = element_text(hjust = 0.5)) + 
  scale_y_discrete(limits = rev(levels(charges_df$MonthlyChargesBucket))) + 
  scale_fill_continuous(low="white", high="steelblue", breaks = breaks) + 
  guides(fill = guide_legend(reverse = TRUE, override.aes=list(color="black")))
p3

Descriptive Statistics

There were 11 NA’s in the TotalCharges column but this was the only column with NA’s. 11 NA’s out of 7,043 is not a very significant amount. Additionally, all of the customers with NA’s in this column have Tenure = 0 and Churn = No but have data recorded for MonthlyCharges. This could mean that these rows are customers that have just been acquired in the last month. Therefore, the missing data may be due to total charges not being calculated for new customers yet or to data collection issues. For these reasons, the rows with missing data for TotalCharges were kept in the data set.

summary(df)
##   customerID           gender          SeniorCitizen      Partner         
##  Length:7043        Length:7043        Min.   :0.0000   Length:7043       
##  Class :character   Class :character   1st Qu.:0.0000   Class :character  
##  Mode  :character   Mode  :character   Median :0.0000   Mode  :character  
##                                        Mean   :0.1621                     
##                                        3rd Qu.:0.0000                     
##                                        Max.   :1.0000                     
##                                                                           
##   Dependents            tenure      PhoneService       MultipleLines     
##  Length:7043        Min.   : 0.00   Length:7043        Length:7043       
##  Class :character   1st Qu.: 9.00   Class :character   Class :character  
##  Mode  :character   Median :29.00   Mode  :character   Mode  :character  
##                     Mean   :32.37                                        
##                     3rd Qu.:55.00                                        
##                     Max.   :72.00                                        
##                                                                          
##  InternetService    OnlineSecurity     OnlineBackup       DeviceProtection  
##  Length:7043        Length:7043        Length:7043        Length:7043       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  TechSupport        StreamingTV        StreamingMovies      Contract        
##  Length:7043        Length:7043        Length:7043        Length:7043       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  PaperlessBilling   PaymentMethod      MonthlyCharges    TotalCharges   
##  Length:7043        Length:7043        Min.   : 18.25   Min.   :  18.8  
##  Class :character   Class :character   1st Qu.: 35.50   1st Qu.: 401.4  
##  Mode  :character   Mode  :character   Median : 70.35   Median :1397.5  
##                                        Mean   : 64.76   Mean   :2283.3  
##                                        3rd Qu.: 89.85   3rd Qu.:3794.7  
##                                        Max.   :118.75   Max.   :8684.8  
##                                                         NA's   :11      
##     Churn          
##  Length:7043       
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 
str(df)
## 'data.frame':    7043 obs. of  21 variables:
##  $ customerID      : chr  "7590-VHVEG" "5575-GNVDE" "3668-QPYBK" "7795-CFOCW" ...
##  $ gender          : chr  "Female" "Male" "Male" "Male" ...
##  $ SeniorCitizen   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Partner         : chr  "Yes" "No" "No" "No" ...
##  $ Dependents      : chr  "No" "No" "No" "No" ...
##  $ tenure          : int  1 34 2 45 2 8 22 10 28 62 ...
##  $ PhoneService    : chr  "No" "Yes" "Yes" "No" ...
##  $ MultipleLines   : chr  "No phone service" "No" "No" "No phone service" ...
##  $ InternetService : chr  "DSL" "DSL" "DSL" "DSL" ...
##  $ OnlineSecurity  : chr  "No" "Yes" "Yes" "Yes" ...
##  $ OnlineBackup    : chr  "Yes" "No" "Yes" "No" ...
##  $ DeviceProtection: chr  "No" "Yes" "No" "Yes" ...
##  $ TechSupport     : chr  "No" "No" "No" "Yes" ...
##  $ StreamingTV     : chr  "No" "No" "No" "No" ...
##  $ StreamingMovies : chr  "No" "No" "No" "No" ...
##  $ Contract        : chr  "Month-to-month" "One year" "Month-to-month" "One year" ...
##  $ PaperlessBilling: chr  "Yes" "No" "Yes" "No" ...
##  $ PaymentMethod   : chr  "Electronic check" "Mailed check" "Mailed check" "Bank transfer (automatic)" ...
##  $ MonthlyCharges  : num  29.9 57 53.9 42.3 70.7 ...
##  $ TotalCharges    : num  29.9 1889.5 108.2 1840.8 151.7 ...
##  $ Churn           : chr  "No" "No" "Yes" "No" ...

Conclusion

These plots provide several key takeaways for customer retention efforts. The tenure density plot tells us that customers who leave tend to be newer or short-term customers. This is reinforced by the churn by contract type bar chart which shows that month-to-month contract holders account for the majority of the Telecommunications customers and generate the most revenue with the highest total monthly charges, but they also have the highest churn rate. Furthermore, the heatmap shows us that customers who leave generally have higher individual monthly charges compared to retained customers.

Given this information, customer retention efforts may need to be refocused to prioritize newer customers with short term month-to-month contracts and high monthly bills. The data indicates that these types of customers make up a significant level of churn resulting in lost revenue.

Questions to Consider

These visualizations focused on identifying and examining relationships between churn and gender, tenure, contract type, and monthly charges. While these visualizations provide value for the effort to optimize customer retention programs, there are still several other customer attributes that provoke questions and should be investigated. Some relationships of interest include:

  • Churn vs Senior Citizen Status (it may be of greater value to look at customer age in years if the data can be obtained in the future)
  • Churn vs Payment Method
  • Churn vs Paperless Billing Status (does billing type seem to impact risk of customer leaving, does it indicate other information about the customer?)
  • Churn vs Services the Customer Pays For (phone, internet, streaming TV and/or movies)
  • Churn vs Dependent Status (does a customer having dependents seem to influence the likelihood of them leaving?)
  • Churn vs Partner Status (are single customers more or less likely to leave than customers with a partner?)
  • Churn vs Protective Measures (is paying for safeguards like online security, online backup, and/or device protection associated with a lower or higher churn rate?)

A Word About the Data

Overall, the IBM sample telecommunications customer churn data set is very usable and had very few, if any, issues. Very little data cleaning was necessary as I did not come across any bad data and the small number of NA’s were not concerning once investigated (see “Descriptive Statistics” tab). This makes the data set a good choice for someone looking to learn and develop data visualization skills. The methods of data analysis applied to this data set and the value they produced for identifying areas of focus for customer retention efforts are also applicable to real world data despite this being a generated sample data set.