# Libraries Used:
library(dplyr)
library(data.table)
library(DescTools)
library(scales)
library(ggplot2)
library(lubridate)
library(ggthemes)
library(RColorBrewer)
library(plyr)
library(plotly)
library(cowplot)
library(ggpubr)
library(flexdashboard)
library(ggrepel)
In this project, we take a look at an IBM sample customer churn data set for a fictional Telecommunications company. Analysis of this data can be used to gain a better understanding of customer behavior and what characteristics might indicate that a customer is at high risk of leaving. This information can then be used to develop more focused and effective customer retention programs.
Source: https://www.kaggle.com/blastchar/telco-customer-churn
The data set contains 7,043 rows and 21 columns which represent individual customers and customer attributes, respectively. The customer attributes include information about churn (identifies customers that have left in the last month), customer demographics, customer account information, and the services the customer is signed up for.
For a complete list of the customer attributes, summary statistics for each one, and the data structure, see the “Descriptive Statistics” tab under “Findings”.
setwd("/Users/jackberk/Documents/Data Visualization/R/R_Data_Project")
df <- read.csv("Telco_Customer_Data.csv")
Descriptive statistics for the overall data set to keep in mind for comparison:
Here we see that the churn rate for the month that the data was collected is 26.5%, meaning 26.5% of the customers in the data set left in the last month. As previously discussed, our goal in analyzing the data will be to find ways to improve customer retention programs and reduce this rate.
# Customer Churn:
churn_count <- count(df, "Churn")
# Building Churn Pie Chart:
p4 <- plot_ly(churn_count, labels = ~Churn, values = ~freq, type = "pie",
marker = list(colors=c("steelblue", "lightblue")),
textposition = "outside", textinfo = "label + percent") %>%
layout(title="Customer Churn")
p4
With only 0.5% more males, the data set is essentially an even split of male and female customers. This greatly reduces the risk of bias from one gender being disproportionately represented in the data and allows us to more confidently draw conclusions from any observed correlation, or lack there of, between gender and churn.
# Gender:
gender_count <- count(df, "gender")
# Building Gender Pie Chart:
p5 <- plot_ly(gender_count, labels = ~gender, values = ~freq, type = "pie",
marker = list(colors=c("steelblue", "lightblue")),
textposition = "outside", textinfo = "label + percent") %>%
layout(title="Gender of Customers")
p5
Just as seen in the pie chart, the number of customers is very similar for each gender. This plot also shows that churn is nearly the same between male and female customers with a difference of less than 1%.
#DF for churn counts by gender:
new_df <- df %>%
select(gender, Churn) %>%
group_by(gender, Churn) %>%
summarise(n = length(gender), .groups = 'keep') %>%
data.frame()
# DF for Aggregate Total of each Gender for Stacked Bar Chart label:
agg_tot <- new_df %>%
select(gender, n) %>%
group_by(gender) %>%
summarise(tot = sum(n), .groups = 'keep') %>%
data.frame()
# Churn Totals by Gender:
churn_tot <- new_df %>%
filter(Churn == "Yes") %>%
select(gender, n) %>%
group_by(gender) %>%
summarise(churn_tot = sum(n), .groups = 'keep') %>%
data.frame()
# Percent Churn by Gender:
male_tot <- agg_tot[agg_tot$gender == "Male", "tot"]
male_churn_tot <- churn_tot[churn_tot$gender == "Male", "churn_tot"]
male_churn_rate <- (round((male_churn_tot/male_tot)*100,2))
male_churn_rate_lab <- paste0(male_churn_rate, "%")
male_nonchurn_rate <- (round((100-(male_churn_tot/male_tot)*100),2))
male_nonchurn_rate_lab <- paste0(male_nonchurn_rate, "%")
female_tot <- agg_tot[agg_tot$gender == "Female", "tot"]
female_churn_tot <- churn_tot[churn_tot$gender == "Female", "churn_tot"]
female_churn_rate <- (round((female_churn_tot/female_tot)*100,2))
female_churn_rate_lab <- paste0(female_churn_rate, "%")
female_nonchurn_rate <- (round((100-(female_churn_tot/female_tot)*100),2))
female_nonchurn_rate_lab <- paste0(female_nonchurn_rate, "%")
# Building Stacked Bar Plot:
max_y <- round_any(max(agg_tot$tot), 1000, ceiling)
p2 <- ggplot(new_df, aes(x = gender, y = n, fill = Churn)) +
geom_bar(stat="identity", position = position_stack(reverse = TRUE)) +
labs(title = "Churn by Gender", x = "Gender", y = "Customer Count", fill = "Churn") +
theme_light() +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette="Paired", guide = guide_legend(reverse = TRUE)) +
geom_text(data = agg_tot, aes(x = gender, y = tot, label = scales::comma(tot), fill = NULL), vjust = -0.4, size = 4) +
scale_y_continuous(labels = comma,
breaks = seq(0, max_y, by = 500),
limits= c(0, max_y)) +
annotate("text", x = "Female", y = 2500/2, label = female_nonchurn_rate_lab, size = 4) +
annotate("text", x = "Female", y = 3000, label = female_churn_rate_lab, size = 4) +
annotate("text", x = "Male", y = 2625/2, label = male_nonchurn_rate_lab, size = 4) +
annotate("text", x = "Male", y = 3100, label = male_churn_rate_lab, size = 4)
p2
This density plot illustrates the distribution of tenure length for retained and lost customers. Tenure distribution for retained customers is relatively uniform with only a slightly higher concentration of around 1.75% of retained customers having a longer tenure of about 69-70 months. On the other hand, the distribution is right skewed with a high concentration of customers having short tenures of about 2-4 months for customers lost in the last month. The dashed lines represent mean tenures of about 37-38 months for retained customers and about 18 months for customers that left in the last month, substantially lower than the mean of 32.37 months for the overall data set.
mu <- ddply(df, "Churn", summarise, grp.mean=mean(tenure))
# Building Density Plot:
p1 <- ggplot(df, aes(x=tenure, fill = Churn)) +
geom_density(alpha=0.4) +
labs(title = "Density Plot: Customer Tenure by Churn", x= "Tenure (Months)", y = "Density") +
theme_light() +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette="Set1") +
scale_x_continuous(labels=comma, breaks = seq(0, max(df$tenure), by = 12)) +
geom_vline(data = mu, aes(xintercept=grp.mean, color=Churn),
linetype="dashed")
p1
This plot provides a closer look at the relationship between churn and contract type as well as the total monthly charges for each contract type. Over 50% of the company’s customers have month-to-month contracts. Significantly less customers have two-year contracts which is closely followed by the number of customers who have one-year contracts. Churn is also far higher for month-to-month contract holders at nearly 50%. Month-to-month contract holders also account for the largest total amount of monthly charges at about $156,000, followed by only about $64,000 to $65,000 for two-year and $56,000 for one-year contract holders.
# Axis 1: Customer Count by Contract Type by Churn:
contract_df <- df %>%
select(Contract, Churn) %>%
group_by(Contract, Churn) %>%
summarise(n = length(Contract), .groups = 'keep') %>%
data.frame()
# Axis 1: Aggregate Customer Count Total by Contract Type for Stacked Bar Chart:
contract_tot <- contract_df %>%
select(Contract, n) %>%
group_by(Contract) %>%
summarise(total = sum(n), .groups = 'keep') %>%
data.frame()
# Axis 2: Total Monthly Charges by Contract Type:
monthly_charges_df <- df %>%
select(Contract, MonthlyCharges) %>%
group_by(Contract) %>%
summarise(total_monthly_charges = sum(MonthlyCharges)) %>%
data.frame()
# Changing Axis 2 Scale:
ylab <- seq(0, max(monthly_charges_df$total_monthly_charges)/1e3, 25)
# 1e3 = 1,000 (scientific notation)
my_labels <- paste0("$", ylab, "K")
# Increase axis range to fit labels:
max_y1 <- round_any(max(contract_tot$total)*2,1000, ceiling)
p6 <- ggplot(contract_df, aes(x = reorder(Contract, n, sum), y = n, fill = Churn)) +
geom_bar(stat="identity", position = position_stack(reverse = TRUE)) +
coord_flip() +
labs(title = "Churn, Total Monthly Charges, and Customer Count by Contract Type", x = "Contract Type", y = "Customer Count", fill = "Churn") +
theme_light() +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Paired", guide = guide_legend(reverse = TRUE)) +
geom_line(inherit.aes = FALSE, data = monthly_charges_df,
aes(x = Contract, y = total_monthly_charges/40, color = "Total Monthly Charges", group=1), size=1) +
scale_color_manual(NULL, values="black") +
scale_y_continuous(labels=comma,
breaks = seq(0, max_y1, by = 500),
sec.axis = sec_axis(~.*40, name = "Total Monthly Charges", labels = my_labels,
breaks = ylab*1e3)) +
geom_point(inherit.aes = FALSE, data=monthly_charges_df,
aes(x = Contract, y = total_monthly_charges/40, group = 1),
size = 3, shape = 21, fill = "white", color = "black") +
theme(legend.background = element_rect(fill = "transparent"),
legend.box.background = element_rect(fill = "transparent", color = NA),
legend.spacing = unit(-1, "lines")) +
geom_text(data = contract_tot, aes(x = Contract, y = total, label = scales::comma(total), fill = NULL), hjust = -0.1, size = 3)
p6
This heatmap shows the distribution of churn by range of monthly charges. The number in the box tells us the amount of customers with the churn status for that column who have monthly charges in that range. Box color also indicates customer count with darker shades of blue representing a higher count and lighter shades representing a lower count. From this plot we learn that customers lost in the last month tend to have monthly charges at the higher end, with the highest concentration of 353 customers spending $78.25 - $88.25 per month. Retained customers are concentrated more at the lower end of spending with 1,458 customers spending only $18.25-$28.25.
# Data Frame w/ buckets for different levels of MonthlyPayments and avg charges per bucket by Churn
charges_df <- df %>%
select(Churn, MonthlyCharges) %>%
mutate(MonthlyChargesBucket = case_when(
MonthlyCharges <= 28.25 ~ "$18.25 - $28.25",
MonthlyCharges <= 38.25 ~ "$28.25 - $38.25",
MonthlyCharges <= 48.25 ~ "$38.25 - $48.25",
MonthlyCharges <= 58.25 ~ "$48.25 - $58.25",
MonthlyCharges <= 68.25 ~ "$58.25 - $68.25",
MonthlyCharges <= 78.25 ~ "$68.25 - $78.25",
MonthlyCharges <= 88.25 ~ "$78.25 - $88.25",
MonthlyCharges <= 98.25 ~ "$88.25 - $98.25",
MonthlyCharges <= 108.25 ~ "$98.25 - $108.25",
MonthlyCharges <= 118.75 ~ "$108.25 - $118.75")) %>%
group_by(Churn, MonthlyChargesBucket) %>%
summarise(Customer_Count = length(Churn), .groups='keep') %>%
data.frame()
# Building Heatmap:
my_levels <- c("$18.25 - $28.25", "$28.25 - $38.25","$38.25 - $48.25", "$48.25 - $58.25", "$58.25 - $68.25","$68.25 - $78.25", "$78.25 - $88.25", "$88.25 - $98.25", "$98.25 - $108.25", "$108.25 - $118.75")
charges_df$MonthlyChargesBucket <- factor(charges_df$MonthlyChargesBucket, level = my_levels)
breaks <- c(seq(0, max(charges_df$Customer_Count), by=100))
p3 <- ggplot(charges_df, aes(x = Churn, y = MonthlyChargesBucket, fill = Customer_Count)) +
geom_tile(color = "black") +
geom_text(aes(label=comma(Customer_Count))) +
coord_equal(ratio=1) +
labs(title = "Heatmap: Customer Count by Monthly Charges by Churn",
x = "Churn",
y = "Monthly Charges",
fill = "Customer Count") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5)) +
scale_y_discrete(limits = rev(levels(charges_df$MonthlyChargesBucket))) +
scale_fill_continuous(low="white", high="steelblue", breaks = breaks) +
guides(fill = guide_legend(reverse = TRUE, override.aes=list(color="black")))
p3
There were 11 NA’s in the TotalCharges column but this was the only column with NA’s. 11 NA’s out of 7,043 is not a very significant amount. Additionally, all of the customers with NA’s in this column have Tenure = 0 and Churn = No but have data recorded for MonthlyCharges. This could mean that these rows are customers that have just been acquired in the last month. Therefore, the missing data may be due to total charges not being calculated for new customers yet or to data collection issues. For these reasons, the rows with missing data for TotalCharges were kept in the data set.
summary(df)
## customerID gender SeniorCitizen Partner
## Length:7043 Length:7043 Min. :0.0000 Length:7043
## Class :character Class :character 1st Qu.:0.0000 Class :character
## Mode :character Mode :character Median :0.0000 Mode :character
## Mean :0.1621
## 3rd Qu.:0.0000
## Max. :1.0000
##
## Dependents tenure PhoneService MultipleLines
## Length:7043 Min. : 0.00 Length:7043 Length:7043
## Class :character 1st Qu.: 9.00 Class :character Class :character
## Mode :character Median :29.00 Mode :character Mode :character
## Mean :32.37
## 3rd Qu.:55.00
## Max. :72.00
##
## InternetService OnlineSecurity OnlineBackup DeviceProtection
## Length:7043 Length:7043 Length:7043 Length:7043
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## TechSupport StreamingTV StreamingMovies Contract
## Length:7043 Length:7043 Length:7043 Length:7043
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## PaperlessBilling PaymentMethod MonthlyCharges TotalCharges
## Length:7043 Length:7043 Min. : 18.25 Min. : 18.8
## Class :character Class :character 1st Qu.: 35.50 1st Qu.: 401.4
## Mode :character Mode :character Median : 70.35 Median :1397.5
## Mean : 64.76 Mean :2283.3
## 3rd Qu.: 89.85 3rd Qu.:3794.7
## Max. :118.75 Max. :8684.8
## NA's :11
## Churn
## Length:7043
## Class :character
## Mode :character
##
##
##
##
str(df)
## 'data.frame': 7043 obs. of 21 variables:
## $ customerID : chr "7590-VHVEG" "5575-GNVDE" "3668-QPYBK" "7795-CFOCW" ...
## $ gender : chr "Female" "Male" "Male" "Male" ...
## $ SeniorCitizen : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Partner : chr "Yes" "No" "No" "No" ...
## $ Dependents : chr "No" "No" "No" "No" ...
## $ tenure : int 1 34 2 45 2 8 22 10 28 62 ...
## $ PhoneService : chr "No" "Yes" "Yes" "No" ...
## $ MultipleLines : chr "No phone service" "No" "No" "No phone service" ...
## $ InternetService : chr "DSL" "DSL" "DSL" "DSL" ...
## $ OnlineSecurity : chr "No" "Yes" "Yes" "Yes" ...
## $ OnlineBackup : chr "Yes" "No" "Yes" "No" ...
## $ DeviceProtection: chr "No" "Yes" "No" "Yes" ...
## $ TechSupport : chr "No" "No" "No" "Yes" ...
## $ StreamingTV : chr "No" "No" "No" "No" ...
## $ StreamingMovies : chr "No" "No" "No" "No" ...
## $ Contract : chr "Month-to-month" "One year" "Month-to-month" "One year" ...
## $ PaperlessBilling: chr "Yes" "No" "Yes" "No" ...
## $ PaymentMethod : chr "Electronic check" "Mailed check" "Mailed check" "Bank transfer (automatic)" ...
## $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
## $ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ...
## $ Churn : chr "No" "No" "Yes" "No" ...
These plots provide several key takeaways for customer retention efforts. The tenure density plot tells us that customers who leave tend to be newer or short-term customers. This is reinforced by the churn by contract type bar chart which shows that month-to-month contract holders account for the majority of the Telecommunications customers and generate the most revenue with the highest total monthly charges, but they also have the highest churn rate. Furthermore, the heatmap shows us that customers who leave generally have higher individual monthly charges compared to retained customers.
Given this information, customer retention efforts may need to be refocused to prioritize newer customers with short term month-to-month contracts and high monthly bills. The data indicates that these types of customers make up a significant level of churn resulting in lost revenue.
These visualizations focused on identifying and examining relationships between churn and gender, tenure, contract type, and monthly charges. While these visualizations provide value for the effort to optimize customer retention programs, there are still several other customer attributes that provoke questions and should be investigated. Some relationships of interest include:
Overall, the IBM sample telecommunications customer churn data set is very usable and had very few, if any, issues. Very little data cleaning was necessary as I did not come across any bad data and the small number of NA’s were not concerning once investigated (see “Descriptive Statistics” tab). This makes the data set a good choice for someone looking to learn and develop data visualization skills. The methods of data analysis applied to this data set and the value they produced for identifying areas of focus for customer retention efforts are also applicable to real world data despite this being a generated sample data set.