Import libraries

library(tidyverse)
library(readr)
library(ggplot2)
library(dplyr)
library(kableExtra)
library(GGally)
library(corrplot)
library(RColorBrewer)

Load the required libraries, including readr, tidyverse,

Load data

data <- read_csv("customer churn dataset.csv")
View(data)

# check column names of the data
colnames(data)
##  [1] "CustomerID"        "Age"               "Gender"           
##  [4] "Tenure"            "Usage Frequency"   "Support Calls"    
##  [7] "Payment Delay"     "Subscription Type" "Contract Length"  
## [10] "Total Spend"       "Last Interaction"  "Churn"
# check the structure of the data
str(data)
## spc_tbl_ [64,374 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ CustomerID       : num [1:64374] 1 2 3 4 5 6 7 8 9 10 ...
##  $ Age              : num [1:64374] 22 41 47 35 53 30 47 54 36 65 ...
##  $ Gender           : chr [1:64374] "Female" "Female" "Male" "Male" ...
##  $ Tenure           : num [1:64374] 25 28 27 9 58 41 37 36 20 8 ...
##  $ Usage Frequency  : num [1:64374] 14 28 10 12 24 14 15 11 5 4 ...
##  $ Support Calls    : num [1:64374] 4 7 2 5 9 10 9 0 10 2 ...
##  $ Payment Delay    : num [1:64374] 27 13 29 17 2 10 28 18 8 23 ...
##  $ Subscription Type: chr [1:64374] "Basic" "Standard" "Premium" "Premium" ...
##  $ Contract Length  : chr [1:64374] "Monthly" "Monthly" "Annual" "Quarterly" ...
##  $ Total Spend      : num [1:64374] 598 584 757 232 533 500 574 323 687 995 ...
##  $ Last Interaction : num [1:64374] 9 20 21 18 18 29 14 16 8 10 ...
##  $ Churn            : num [1:64374] 1 0 0 0 0 0 1 0 0 0 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   CustomerID = col_double(),
##   ..   Age = col_double(),
##   ..   Gender = col_character(),
##   ..   Tenure = col_double(),
##   ..   `Usage Frequency` = col_double(),
##   ..   `Support Calls` = col_double(),
##   ..   `Payment Delay` = col_double(),
##   ..   `Subscription Type` = col_character(),
##   ..   `Contract Length` = col_character(),
##   ..   `Total Spend` = col_double(),
##   ..   `Last Interaction` = col_double(),
##   ..   Churn = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
# attach my data
attach(data)

Data Description

This dataset contains customer information related to demographics, subscription details, and service usage. Key features include age, gender, tenure, usage frequency, payment behavior, and contract length, along with a churn indicator showing whether a customer left the service. The goal of this analysis is to explore patterns behind customer churn through descriptive statistics and visualizations, providing insights into retention and engagement strategies.

The data was loaded using the read_csv command available in readr package. The data contained 64,374 observations and 12 columns. The data was sourced from Kaggle which is a trusted platform to obtain data science and data analytics data for any kind of project

About the Data

Customer churn refers to the phenomenon where customers discontinue their relationship or subscription with a company or service provider. It represents the rate at which customers stop using a company’s products or services within a specific period. Churn is an important metric for businesses as it directly impacts revenue, growth, and customer retention.

In the context of the Churn dataset, the churn label indicates whether a customer has churned or not. A churned customer is one who has decided to discontinue their subscription or usage of the company’s services. On the other hand, a non-churned customer is one who continues to remain engaged and retains their relationship with the company.

Understanding customer churn is crucial for businesses to identify patterns, factors, and indicators that contribute to customer attrition. By analyzing churn behavior and its associated features, companies can develop strategies to retain existing customers, improve customer satisfaction, and reduce customer turnover. Predictive modeling techniques can also be applied to forecast and proactively address potential churn, enabling companies to take proactive measures to retain at-risk customers. You can download the data set from Kaggle Customer Churn Dataset.

Check Missing Values in the data

na_counts<-sapply(data, function(y)sum(length(which(is.na(y)))))
na_counts<-data.frame(na_counts)
na_counts
##                   na_counts
## CustomerID                0
## Age                       0
## Gender                    0
## Tenure                    0
## Usage Frequency           0
## Support Calls             0
## Payment Delay             0
## Subscription Type         0
## Contract Length           0
## Total Spend               0
## Last Interaction          0
## Churn                     0

Create Age Group from Age

data <- data %>%
  mutate(AgeGroup = case_when(
    Age < 25 ~ "18-24",
    Age >= 25 & Age <= 34 ~ "25-34",
    Age >= 35 & Age <= 44 ~ "35-44",
    Age >= 45 & Age <= 54 ~ "45-54",
    Age >= 55 ~ "55+"
  ))

Changing the type of the variables

data$`Churn` = factor(data$Churn, 
                     levels = c(0, 1), 
                     labels = c("No", "Yes"))
data$`Gender` = as.factor(data$`Gender`)
data$`Contract Length` = factor(data$`Contract Length`, 
                              levels = c("Monthly", "Quarterly", "Annual"))
data$`Subscription Type` = factor(data$`Subscription Type`, 
                                levels = c("Basic", "Standard", "Premium"))

# check the structure after conversion
str(data)
## tibble [64,374 × 13] (S3: tbl_df/tbl/data.frame)
##  $ CustomerID       : num [1:64374] 1 2 3 4 5 6 7 8 9 10 ...
##  $ Age              : num [1:64374] 22 41 47 35 53 30 47 54 36 65 ...
##  $ Gender           : Factor w/ 2 levels "Female","Male": 1 1 2 2 1 2 1 1 2 2 ...
##  $ Tenure           : num [1:64374] 25 28 27 9 58 41 37 36 20 8 ...
##  $ Usage Frequency  : num [1:64374] 14 28 10 12 24 14 15 11 5 4 ...
##  $ Support Calls    : num [1:64374] 4 7 2 5 9 10 9 0 10 2 ...
##  $ Payment Delay    : num [1:64374] 27 13 29 17 2 10 28 18 8 23 ...
##  $ Subscription Type: Factor w/ 3 levels "Basic","Standard",..: 1 2 3 3 2 3 1 2 1 1 ...
##  $ Contract Length  : Factor w/ 3 levels "Monthly","Quarterly",..: 1 1 3 2 3 1 2 1 1 3 ...
##  $ Total Spend      : num [1:64374] 598 584 757 232 533 500 574 323 687 995 ...
##  $ Last Interaction : num [1:64374] 9 20 21 18 18 29 14 16 8 10 ...
##  $ Churn            : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 2 1 1 1 ...
##  $ AgeGroup         : chr [1:64374] "18-24" "35-44" "45-54" "35-44" ...

Columns in the data have to be transformed in the appropriate data type, For instance, column names Gender, Contract Length, Subscription Type, Churn have the character as their data type. Such variables have to be transformed to factor data type.

Customer Charn Distribution

# Summarize churn counts and percentages
churn_counts <- data %>%
  count(Churn) %>%
  mutate(Percent = round(n / sum(n) * 100, 1))

# Bar plot
ggplot(churn_counts, aes(x = Churn, y = n, fill = Churn)) +
  geom_col() +
  geom_text(aes(label = paste0(Percent, "%")), vjust = -0.5) +
  labs(title = "Customer Churn Distribution",
       x = "Gender",
       y = "Frequency") +
  scale_fill_manual(values = c("No" = "#8E27F5", "Yes" = "#27F5B7")) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5)) 

Churn by Gender

# Summarize churn counts by Gender
churn_Gender <- data %>%
  count(Gender, Churn) %>%
  group_by(Gender) %>%
  mutate(Percent = round(n / sum(n) * 100, 1))

# Bar chart: Gender vs Churn
ggplot(churn_Gender, aes(x = Gender, y = n, fill = Churn)) +
  geom_col(position = "dodge") +  
  geom_text(aes(label = paste0(Percent, "%")), 
            position = position_dodge(width = 0.9), 
            vjust = -0.5) +
  labs(
    title = "Churn by Gender",
    x = "Gender",
    y = "Churn Rate",
    fill = "Churn"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_fill_manual(values = c("No" = "#E23FEB", "Yes" = "#298A6D")) 

Churn by Age Group

# Step 2: Summarize churn counts by AgeGroup
churn_AgeGroup <- data %>%
  count(AgeGroup, Churn) %>%
  group_by(AgeGroup) %>%
  mutate(Percent = round(n / sum(n) * 100, 1))

# Step 3: Bar chart: AgeGroup vs Churn
ggplot(churn_AgeGroup, aes(x = AgeGroup, y = n, fill = Churn)) +
  geom_col(position = "dodge") +  
  geom_text(aes(label = paste0(Percent, "%")), 
            position = position_dodge(width = 0.9), 
            vjust = -0.5) +
  labs(
    title = "Churn by Age Group",
    x = "Age Group",
    y = "Count of Customers",
    fill = "Churn"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_fill_manual(values = c("No" = "#DD27F5", "Yes" = "#27F55E"))

The Distribution of Subscription Type

# Summarize subscription counts and percentages
subscription_counts <- data %>%
  count(`Subscription Type`) %>%
  mutate(Percent = round(n / sum(n) * 100, 1))

# Bar plot
ggplot(subscription_counts, aes(x = `Subscription Type`, y = n, fill = `Subscription Type`)) +
  geom_col() +
  geom_text(aes(label = paste0(Percent, "%")), vjust = -0.5) +
  labs(title = "Distribution of Subscription Types",
       x = "Subscription Type",
       y = "Frequency") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_fill_manual(values = c("Basic"   = "tomato",
                             "Premium" = "skyblue",
                             "Standard"    = "seagreen"))

The Distribution of Contract Length

# Summarize contract length counts and percentages
contract_counts <- data %>%
  count(`Contract Length`) %>%
  mutate(Percent = round(n / sum(n) * 100, 1))

# Bar plot
ggplot(contract_counts, aes(x = `Contract Length`, y = n, fill = `Contract Length`)) +
  geom_col() +
  geom_text(aes(label = paste0(Percent, "%")), vjust = -0.5) +
  labs(title = "Distribution of Contract Length",
       x = "Contract Length",
       y = "Frequency") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5))+
  scale_fill_manual(values = c("Monthly"   = "#2A27F5",
                             "Quarterly" = "#27F5F2",
                             "Annual"    = "#27F549"))

numeric_variable = data %>%
  select(Age, Tenure, `Usage Frequency`, `Support Calls`, 
         `Payment Delay`, `Total Spend`, `Last Interaction`)
# Data for Distribution and Correlation analysis
df <- data %>%
  select(Age, `Subscription Type`, Tenure, `Usage Frequency`, 
         `Payment Delay`, `Total Spend`, `Last Interaction`) %>%
  sample_n(1000)

Distribution and Correlation Plot

ggpairs(df, ggplot2::aes(colour=`Subscription Type`)) 

# Calculate the correlation matrix
cor_matrix = cor(numeric_variable)

# Create the correlogram 
corrplot(cor_matrix, 
         type = "upper", 
         method = "square", 
         addCoef.col = "black", 
         tl.col = "black", tl.srt = 45,
         col = brewer.pal(n = 8, name = "YlOrRd"))  

corrplot(cor_matrix,
         type = "upper",
         method = "square",
         addCoef.col = "black",
         tl.col = "black", tl.srt = 45,
         col = colorRampPalette(c("deeppink", "orange", "gold", "lightgreen", "skyblue"))(200))