Introduction

Marketing analytics refers to the study of customer data to evaluate and devise marketing activities which have been widely incorporated by businesses across the globe especially in the telecommunications industry. In depth, analysis of the customer data is used to determine the main factors of consumer action, leading to enhancing the company marketing strategies and maximising return on investments from the wonders of their marketing analytics (Cote, 2021).

In this project, a real corporate dataset of a pilot campaign launched by a telecommunication company was used, herein referred to as ZenTel (pseudonym due to confidentiality). ZenTel is one of the largest telecommunication companies in Malaysia which has been employing customer-centric solutions to facilitate seamless, consistent, and excellent customer experiences by delivering the best value for money offerings and rate plans to its customers.

Problem Statement

ZenTel had recently proposed to migrate their existing telco platform to a more advanced platform to improve user and customer experiences. To optimise the platform migration, ZenTel had decided to come up with an initiative called “Right Planning” that aimed in migrating all customers’ old rate plan to a new rate plan at a lower cost and better benefits. With “Right Planning”, ZenTel could enlighten customer experience with new rate plans bringing better benefits and offers while withdrawing the outdated rate plans and standardising the rate plans information stored in the new platform.

To evaluate the efficacy of the proposed campaign, the Base Management Team at ZenTel conducted a trial run by selecting a group of customers before officially launching the “Right Planning” to their seven million customers.The goal of this study was to evaluate the pilot campaign effectiveness by introducing new rate plans which provided superior benefits and offers through understanding patterns of customer behaviour with propensity to taking up the offer, while eliminating obsolete rate plans and establishing uniformity in the information regarding rate plans stored in the new platform.

Research Questions

  1. EDA: How effective is the “Right Planning” pilot campaign in terms of customer participation?
  2. EDA: What are the demographic and behavioural characteristics of customers who are more likely to opt-in for the “Right Planning” pilot campaign?
  3. Classification: How can we accurately predict which customers will opt-in for the “Right Planning” campaign based on their usage and revenue behaviours?
  4. Regression: How does the revenue generated before the “Right Planning” campaign correlate with the customers’ behaviours in terms of usage?

Research Objectives

  1. To evaluate the effectiveness of the “Right Planning” pilot campaign.
  2. To identify the campaign offer takers’ profile.
  3. To develop a classification model which accurately predicts if the customers will opt-in for the “Right Planning” campaign based on usage and revenue behaviours.
  4. To identify the relationship between customers’ behaviours in terms of usage and revenue generated before the campaign using multiple linear regression.

Dataset Description

  • A set of real-world dataset provided by the Base Management Team in one of the largest telecommunications companies in Malaysia.
  • A total of 7,272 records of campaign data that launched between 5th October 2022 and 11th October 2022.
  • Target base was a group of active Prepaid Malaysian customers.
  • Dataset contained demographic information, usage, and revenue activity before and after the campaign.

Methodology

This section displays the steps of data understanding, data cleaning, data pre-processing and data transformation before conducting data analysis and predictive modelling.

Part 1: Import and Load Libraries

# Install required packages
install.packages(c("readxl", "caret", "pryr", "corrplot", "dplyr", "ggplot2", "coefplot", "infotheo", "lubridate", "randomForest", "Rcpp", "devtools", "tidyr", "reshape2", "cowplot","e1071"))
# Load installed packages
library(readxl)
library(knitr)
library(caret)
library(pryr)
library(corrplot)
library(dplyr)
library(ggplot2)
library(coefplot)
library(infotheo)
library(lubridate)
library(randomForest)
library(Rcpp)
library(devtools)
library(tidyr)
library(reshape2)
library(cowplot)
library(e1071)

Part 2: Metadata & Dataset Overview (Before Data Cleaning)

Table below shows the metadata of the collected dataset.

# Load metadata 
metadata <- read_excel("/Users/kharshin/Campaign_Dataset_Description.xlsx")
kable(metadata)
Variable Description  DataType Unit
ID Customer ID String N/A
TENURE Customer duration with Kation since registration date Integer Months
AGE Customer age Integer Years
GENDER Customer gender String N/A
NATIONALITY Customer nationality String N/A
STATE Customer hometown (state) String N/A
STATUS_BEFORE Customer status before campaign launched. String N/A
STATUS_AFTER Customer status after campaign ended. String N/A
OFFER_TAKER Indicator for customers who opted-in the migration plan. Boolean N/A
OFFER_TAKE_UP_DT Date for customers who opted-in the migration plan. Date N/A
DATA_PURC_BEFORE Indicator for customer who purchased data before campaign launched. Boolean N/A
DATA_PURC_AFTER Indicator for customer who purchased data after campaign ended. Boolean N/A
DATA_CHRG_BEFORE  Total amount of data charged before campaign launched. Float RM
DATA_CHRG_AFTER Total amount of data charged after campaign ended. Float RM
DATA_USG_BEFORE Data usage before campaign launched. Float MB
DATA_USG_AFTER Data usage after campaign ended. Float MB
VOICE_USG_BEFORE Voice usage before campaign launched. Float Minutes
VOICE_USG_AFTER Voice usage after campaign ended. Float Minutes
RLD_IND_BEFORE Indicator for customer who reload before campaign launched. Boolean N/A
RLD_IND_AFTER Indicator for customer who reload after campaign ended. Boolean N/A
RLD_AMT_BEFORE Total of reload amount before campaign launched. Float RM
RLD_AMT_AFTER Total of reload amount after campaign ended. Float RM
ARPU_BEFORE ARPU before campaign launched. Float RM
CPA_RVN_BEFORE Total added value service before campaign launched. Float RM
CPA_RVN_AFTER Total added value service after campaign ended. Float RM
ARPU_AFTER ARPU after campaign ended. Float RM
ACTIVITY_DAYS_AFTER  Silent days after campaign ended. Integer Days
ACTIVITY_STATUS_AFTER Customer activity status after campaign ended. String N/A

Table below shows the first five rows of the collected dataset.

# Load dataset
dataset <- read_excel("/Users/kharshin/Dataset.xlsx")
head(dataset, n = 5)
## # A tibble: 5 × 27
##   TENURE   AGE GENDER NATIONALITY STATE   STATUS_BEFORE STATUS_AFTER OFFER_TAKER
##    <dbl> <dbl> <chr>  <chr>       <chr>   <chr>         <chr>        <chr>      
## 1     90    28 Female Malaysia    JOHORE  Active        Active       Y          
## 2    204    81 Male   Malaysia    PAHANG  Active        Active       Y          
## 3    120    84 Male   Malaysia    KLANG … Active        Active       N          
## 4    199    82 Male   Malaysia    PAHANG  Active        Active       Y          
## 5     30 -9999 ?      Malaysia    SABAH   Active        Active       N          
## # ℹ 19 more variables: OFFER_TAKE_UP_DT <chr>, DATA_PURC_BEFORE <chr>,
## #   DATA_PURC_AFTER <chr>, DATA_CHRG_BEFORE <chr>, DATA_CHRG_AFTER <chr>,
## #   DATA_USG_BEFORE <dbl>, DATA_USG_AFTER <dbl>, VOICE_USG_BEFORE <dbl>,
## #   VOICE_USG_AFTER <dbl>, RLD_IND_BEFORE <chr>, RLD_IND_AFTER <chr>,
## #   RLD_AMT_BEFORE <chr>, RLD_AMT_AFTER <chr>, CPA_RVN_BEFORE <chr>,
## #   CPA_RVN_AFTER <chr>, ARPU_BEFORE <chr>, ARPU_AFTER <chr>,
## #   ACTVIITY_DAYS_AFTER <dbl>, ACTIVITY_STATUS_AFTER <chr>


This section displays the information of:
i. Number of columns and rows

cat("Number of columns:", ncol(dataset), "\nNumber of rows:", nrow(dataset))
## Number of columns: 27 
## Number of rows: 7272

A total of 27 columns and 7,272 records.


  1. Dataset Structure
str(dataset)
## tibble [7,272 × 27] (S3: tbl_df/tbl/data.frame)
##  $ TENURE               : num [1:7272] 90 204 120 199 30 16 86 34 12 35 ...
##  $ AGE                  : num [1:7272] 28 81 84 82 -9999 ...
##  $ GENDER               : chr [1:7272] "Female" "Male" "Male" "Male" ...
##  $ NATIONALITY          : chr [1:7272] "Malaysia" "Malaysia" "Malaysia" "Malaysia" ...
##  $ STATE                : chr [1:7272] "JOHORE" "PAHANG" "KLANG VALLEY" "PAHANG" ...
##  $ STATUS_BEFORE        : chr [1:7272] "Active" "Active" "Active" "Active" ...
##  $ STATUS_AFTER         : chr [1:7272] "Active" "Active" "Active" "Active" ...
##  $ OFFER_TAKER          : chr [1:7272] "Y" "Y" "N" "Y" ...
##  $ OFFER_TAKE_UP_DT     : chr [1:7272] "44691" "44691" "?" "44722" ...
##  $ DATA_PURC_BEFORE     : chr [1:7272] "Y" "N" "N" "Y" ...
##  $ DATA_PURC_AFTER      : chr [1:7272] "N" "N" "N" "N" ...
##  $ DATA_CHRG_BEFORE     : chr [1:7272] "15" "0" "0" "15" ...
##  $ DATA_CHRG_AFTER      : chr [1:7272] "0" "0" "0" "0" ...
##  $ DATA_USG_BEFORE      : num [1:7272] 0 0 0 0 6.53 ...
##  $ DATA_USG_AFTER       : num [1:7272] 0 0 0 0 1.02 ...
##  $ VOICE_USG_BEFORE     : num [1:7272] 732.4 49.9 17.1 16.2 14.2 ...
##  $ VOICE_USG_AFTER      : num [1:7272] 1535.917 0.667 0 8.317 0 ...
##  $ RLD_IND_BEFORE       : chr [1:7272] "N" "N" "N" "N" ...
##  $ RLD_IND_AFTER        : chr [1:7272] "Y" "N" "N" "N" ...
##  $ RLD_AMT_BEFORE       : chr [1:7272] "?" "?" "?" "?" ...
##  $ RLD_AMT_AFTER        : chr [1:7272] "5" "?" "?" "?" ...
##  $ CPA_RVN_BEFORE       : chr [1:7272] "?" "?" "?" "?" ...
##  $ CPA_RVN_AFTER        : chr [1:7272] "?" "?" "?" "?" ...
##  $ ARPU_BEFORE          : chr [1:7272] "?" "13.65" "4.32" "4.68" ...
##  $ ARPU_AFTER           : chr [1:7272] "5.15" "0.26" "?" "2.73" ...
##  $ ACTVIITY_DAYS_AFTER  : num [1:7272] 0 1 5 5 0 9 0 33 11 1 ...
##  $ ACTIVITY_STATUS_AFTER: chr [1:7272] "DURING & AFTER CAMP" "DURING & AFTER CAMP" "DURING & AFTER CAMP" "DURING & AFTER CAMP" ...

There were a total of 7 numerical columns and 20 categorical columns. However, several numerical columns were misclassified as categorical because the blank or missing values were automatically detected as a string of ? in R. Columns affected were RLD_AMT_BEFORE, RLD_AMT_AFTER, CPA_RVN_BEFORE, CPA_RVN_AFTER, ARPU_BEFORE, ARPU_AFTER, DATA_PURC_BEFORE and DATA_PURC_AFTER. Meanwhile, OFFER_TAKE_UP_DT was misclassified as numerical but it was a date variable.


  1. Count of missing values within the columns
# Check missing values in categorical columns and print column names if there are null values
cat_missing <- sapply(dataset, function(x) {
  if (is.factor(x) && sum(is.na(x)) > 0) {
    cat("Column:", names(x), "has", sum(is.na(x)), "missing value(s).\n")
  }
  sum(is.na(x))
})

# Check missing values in numerical columns and print column names if there are null values
num_missing <- sapply(dataset, function(x) {
  if (is.numeric(x) && sum(is.na(x)) > 0) {
    cat("Column:", names(x), "has", sum(is.na(x)), "missing value(s).\n")
  }
  sum(is.na(x))
})

No missing values were found but based on the dataset structure shown, there were ? missing values within the columns. Hence, data cleaning was required.


Part 3: Exploratory Data Analysis

This section displays the graph distributions for all columns of the dataset. Violin plot and bar chart were used.

# Create function to perform data profiling on a data frame
# Input: x - data frame
# Output: Prints plots for numeric columns and categorical columns
data_profiling = function(x) {
  
  # Iterate over column names
  for (col_name in colnames(x)) {
    
    # Check if column is numeric
    if (is.numeric(x[[col_name]])) {
      
      # Generate violin plot for numeric columns
      p = ggplot(x, aes(x = col_name, y = x[[col_name]], fill = col_name)) +
        geom_violin() +
        labs(x = "Column", y = "Range", title = paste("Violin Plot of", col_name)) +
        theme_minimal() 
      print(p)
      
    }
    
    # Column is not numeric (assumed categorical)
    else {
      # Calculate counts for each category
      counts = table(x[[col_name]])
      
      # Create a data frame for plotting
      df_counts = data.frame(category = names(counts), count = as.numeric(counts))
      
      # Generate bar plot for categorical columns
      p = ggplot(df_counts, aes(x = count, y = category)) +
        geom_col(fill = "steelblue") +
        labs(x = "Count", y = "Category", title = paste("Bar Plot of", col_name)) +
        theme_minimal() 
      print(p)
    }
  }
}

# Display graphs
data_profiling(dataset)

Based on the graphs shown,it could be observed that:

-AGE contained a minimum age of -9999, which was abnormal as human age should fall between a range of 0 to 100 with no negative value. This phenomenon occurred because there were missing customers’ age information due to system or human error thus filling in with -9999.

-ARPU_BEFORE,ARPU_AFTER, CPA_RVN_BEFORE, CPA_RVN_AFTER, DATA_CHRG_BEFORE,DATA_CHRG_AFTER, RLD_AMT_BEFORE, and RLD_AMT_AFTER were further justified as numerical instead of categorical. Upon checking, these columns were filled with ? instead of 0 value hence wrongly tagged them as class variables.

-STATE contained inconsistent values. Malaysia comprises 13 states and 3 federal territories but the graph showed 20 unique values.

-GENDER shall have 2 unique values - male and female but there were 4 different values.

-OFFER_TAKE_UP_DT consisted of null values and date values.

AGE, GENDER, STATE, ARPU_BEFORE, ARPU_AFTER, CPA_RVN_BEFORE, CPA_RVN_AFTER, DATA_CHRG_BEFORE, DATA_CHRG_AFTER, RLD_AMT_BEFORE, RLD_AMT_AFTER and OFFER_TAKE_UP_DT required data cleaning and pre-processing to ensure clean data were used for further analysis and modelling.


Part 4: Data Cleaning

This section displays data cleaning for all four data quality issues (incomplete data, noisy data, inconsistent data, and intentional data) found in the dataset as mentioned in Exploratory Data Analysis session.

  1. Modify intentional data - AGE
# Remove records of 'AGE' with value '-9999'
dataset <- subset(dataset, AGE != -9999)

Records with value of -9999 represented missing values for customer age. Thus, these records were removed as they did not bring any useful information.


  1. Modify incomplete data - Replace ? to NA then fill with 0
# replace "?" to "NA"
dataset <- replace(dataset, dataset=='?', NA)

# replace "NA" with 0
dataset$ARPU_BEFORE = replace(dataset$ARPU_BEFORE, is.na(dataset$ARPU_BEFORE), 0)
dataset$ARPU_AFTER = replace(dataset$ARPU_AFTER, is.na(dataset$ARPU_AFTER), 0)
dataset$CPA_RVN_BEFORE = replace(dataset$CPA_RVN_BEFORE, is.na(dataset$CPA_RVN_BEFORE), 0)
dataset$CPA_RVN_AFTER = replace(dataset$CPA_RVN_AFTER, is.na(dataset$CPA_RVN_AFTER), 0)
dataset$RLD_AMT_BEFORE = replace(dataset$RLD_AMT_BEFORE, is.na(dataset$RLD_AMT_BEFORE), 0)
dataset$RLD_AMT_AFTER = replace(dataset$RLD_AMT_AFTER, is.na(dataset$RLD_AMT_AFTER), 0)
dataset$DATA_CHRG_BEFORE = replace(dataset$DATA_CHRG_BEFORE, is.na(dataset$DATA_CHRG_BEFORE), 0)
dataset$DATA_CHRG_AFTER = replace(dataset$DATA_CHRG_AFTER, is.na(dataset$DATA_CHRG_AFTER), 0)

Columns mistagged as categorical with values of ? were replaced with NA then 0 and corrected to numeric data type. 0 was chosen to replace the unknown value based on the business logic in which null is shown when there is no mapping result.


  1. Modify inconsistent data - STATE
# Standardize the name for the states
dataset$STATE[dataset$STATE == "JOHORE"] = "JOHOR"
dataset$STATE[dataset$STATE == "KLANG VALLEY"] = "WILAYAH PERSEKUTUAN"
dataset$STATE[dataset$STATE == "MALACCA"] = "MELAKA"
dataset$STATE[dataset$STATE == "N SEMBILAN"] = "NEGERI SEMBILAN"
dataset$STATE[dataset$STATE == "PULAU PINANG"] = "PENANG"
dataset$STATE[dataset$STATE == "SEREMBAN/MELAKA"] = "MELAKA"

The inconsistency in the naming of the states where names were mixed with Malay and English names that caused the duplicate values in the variable were standardised using English naming convention.


  1. Modify inconsistent data - GENDER
dataset$GENDER = replace(dataset$GENDER, is.na(dataset$GENDER), "Unspecified")
dataset <- subset(dataset, GENDER != "Unspecified")

The values of Unspecified and ? (which then replaced with NA) were grouped together as they represented unknown gender.These groups of people were removed from the dataset so that the GENDER variable encompassed only values of Male and Female.


  1. Modify incomplete data - OFFER_TAKE_UP_DT

    OFFER_TAKE_UP_DT which was supposed to be in date format was kept as string data type as if the corresponding cell of the OFFER_TAKER attribute was No, it suggested that the customer did not take up the offer and naturally there would not be any offer take-up date being recorded. This variable was removed as the date did not bring any relevant information to this study.


  1. Standardise the numeric values into 2 decimal place
# Format the numeric into 2 decimal points
dataset$ARPU_BEFORE = round(as.numeric(dataset$ARPU_BEFORE), 2)
dataset$ARPU_AFTER = round(as.numeric(dataset$ARPU_AFTER), 2)
dataset$CPA_RVN_BEFORE = round(as.numeric(dataset$CPA_RVN_BEFORE), 2)
dataset$CPA_RVN_AFTER = round(as.numeric(dataset$CPA_RVN_AFTER), 2)
dataset$RLD_AMT_BEFORE = round(as.numeric(dataset$RLD_AMT_BEFORE), 2)
dataset$RLD_AMT_AFTER = round(as.numeric(dataset$RLD_AMT_AFTER), 2)
dataset$DATA_CHRG_BEFORE = round(as.numeric(dataset$DATA_CHRG_BEFORE), 2)
dataset$DATA_CHRG_AFTER = round(as.numeric(dataset$DATA_CHRG_AFTER), 2)


  1. Modify noisy data - Outliers

    The variables associated with this data quality issue were ARPU_BEFORE, ARPU_AFTER, CPA_RVN_BEFORE, CPA_RVN_AFTER, DATA_CHRG_BEFORE, DATA_CHRG_AFTER, RLD_AMT_BEFORE, RLD_AMT_AFTER, DATA_USG_BEFORE, DATA_USG_AFTER, VOICE_USG_BEFORE, VOICE_USG_AFTER, AGE and TENURE. Based on the domain knowledge, these extreme values carried informative insights. Outliers detected in these variables were not removed as the extreme values are legitimate observations which could be included as a part of the sample, thus keeping them in the dataset.


  2. Remove columns which were meaningless and would not be used for the analysis from the study

# Drop the "meaningless" columns 
dataset = subset(dataset, select = -c(NATIONALITY, STATUS_BEFORE, OFFER_TAKE_UP_DT))

# Backup the cleaned dataset for analysis 
cleaned_dataset <- dataset

# Select and print out the "AFTER" columns
cols_to_drop = grep("AFTER$", names(dataset), value = TRUE)

# Drop the "AFTER" columns
dataset[, cols_to_drop] = NULL

As mentioned previously, the campaign was launched between specific dates and the group of customers involved were active Prepaid Malaysian customers. Hence, NATIONALITY, STATUS_BEFORE and OFFER_TAKE_UP_DT were irrelevant to the analysis. All the variables with prefix of AFTER were removed as these variables would be needed only when there was a need to compare the performance between pre-campaign and post-campaign.


Part 5: Dataset Overview (After Data Cleaning)

This section displays the data overview after data cleaning.

# Check missing values
colSums(is.na(dataset))
##           TENURE              AGE           GENDER            STATE 
##                0                0                0                0 
##      OFFER_TAKER DATA_PURC_BEFORE DATA_CHRG_BEFORE  DATA_USG_BEFORE 
##                0                0                0                0 
## VOICE_USG_BEFORE   RLD_IND_BEFORE   RLD_AMT_BEFORE   CPA_RVN_BEFORE 
##                0                0                0                0 
##      ARPU_BEFORE 
##                0
# Check data structure
str(dataset)
## tibble [7,157 × 13] (S3: tbl_df/tbl/data.frame)
##  $ TENURE          : num [1:7157] 90 204 120 199 16 86 34 12 35 56 ...
##  $ AGE             : num [1:7157] 28 81 84 82 82 82 85 85 88 86 ...
##  $ GENDER          : chr [1:7157] "Female" "Male" "Male" "Male" ...
##  $ STATE           : chr [1:7157] "JOHOR" "PAHANG" "WILAYAH PERSEKUTUAN" "PAHANG" ...
##  $ OFFER_TAKER     : chr [1:7157] "Y" "Y" "N" "Y" ...
##  $ DATA_PURC_BEFORE: chr [1:7157] "Y" "N" "N" "Y" ...
##  $ DATA_CHRG_BEFORE: num [1:7157] 15 0 0 15 15 0 0 0 0 0 ...
##  $ DATA_USG_BEFORE : num [1:7157] 0 0 0 0 0.0185 ...
##  $ VOICE_USG_BEFORE: num [1:7157] 732.4 49.9 17.1 16.2 13.6 ...
##  $ RLD_IND_BEFORE  : chr [1:7157] "N" "N" "N" "N" ...
##  $ RLD_AMT_BEFORE  : num [1:7157] 0 0 0 0 0 0 0 0 0 0 ...
##  $ CPA_RVN_BEFORE  : num [1:7157] 0 0 0 0 0 0 0 0 0 0 ...
##  $ ARPU_BEFORE     : num [1:7157] 0 13.65 4.32 4.68 4.4 ...
# Check first 5 rows of dataset
head(dataset, n = 5)
## # A tibble: 5 × 13
##   TENURE   AGE GENDER STATE        OFFER_TAKER DATA_PURC_BEFORE DATA_CHRG_BEFORE
##    <dbl> <dbl> <chr>  <chr>        <chr>       <chr>                       <dbl>
## 1     90    28 Female JOHOR        Y           Y                              15
## 2    204    81 Male   PAHANG       Y           N                               0
## 3    120    84 Male   WILAYAH PER… N           N                               0
## 4    199    82 Male   PAHANG       Y           Y                              15
## 5     16    82 Male   PERAK        Y           Y                              15
## # ℹ 6 more variables: DATA_USG_BEFORE <dbl>, VOICE_USG_BEFORE <dbl>,
## #   RLD_IND_BEFORE <chr>, RLD_AMT_BEFORE <dbl>, CPA_RVN_BEFORE <dbl>,
## #   ARPU_BEFORE <dbl>

There were a total of 7,157 records (after removing the intentional age and unknown gender) with 13 selected variables to continue our analysis ranging from TENURE, AGE, GENDER, STATE, DATA_PURC_BEFORE, DATA_CHRG_BEFORE, DATA_USG_BEFORE, VOICE_USG_BEFORE, RLD_IND_BEFORE, RLD_AMT_BEFORE, CPA_RVN_BEFORE, ARPU_BEFORE and OFFER_TAKER. No missing values were found and the type of variables were now correct.


Part 6: Data Pre-Processing and Transformation

This section explains how data were pre-processed before proceeding to analysis and modelling. Categorical variables were transformed using one-hot encoding and heatmap was used to check the correlation between variables. Highly correlated variables would be dropped and dataset would be split into training (80%) and testing (20%) set.

  1. Categorical variables such as GENDER and STATE were converted to numeric values by one-hot encoding, whereas DATA_PURC_BEFORE and RLD_IND_BEFORE were converted to numeric values using remapping method.
# Replace column values of "Y" with 1 and "N" with 0 
dataset$DATA_PURC_BEFORE <- as.integer(ifelse(dataset$DATA_PURC_BEFORE == "Y", 1, ifelse(dataset$DATA_PURC_BEFORE == "N", 0, dataset$DATA_PURC_BEFORE)))
dataset$RLD_IND_BEFORE <- as.integer(ifelse(dataset$RLD_IND_BEFORE == "Y", 1, ifelse(dataset$RLD_IND_BEFORE == "N", 0, dataset$RLD_IND_BEFORE)))

# Apply one-hot encoding to the categorical variables
cols_to_encode = c("GENDER", "STATE")
one_hot = dummyVars(" ~ .", data = dataset[, cols_to_encode])
encoded_df = data.frame(predict(one_hot, newdata = dataset))

# Preview the first 5 rows
head(encoded_df, 5)
##   GENDERFemale GENDERMale STATEJOHOR STATEKEDAH STATEKELANTAN STATEMELAKA
## 1            1          0          1          0             0           0
## 2            0          1          0          0             0           0
## 3            0          1          0          0             0           0
## 4            0          1          0          0             0           0
## 5            0          1          0          0             0           0
##   STATENEGERI.SEMBILAN STATEPAHANG STATEPENANG STATEPERAK STATEPERLIS
## 1                    0           0           0          0           0
## 2                    0           1           0          0           0
## 3                    0           0           0          0           0
## 4                    0           1           0          0           0
## 5                    0           0           0          1           0
##   STATESABAH STATESARAWAK STATESELANGOR STATETERENGGANU
## 1          0            0             0               0
## 2          0            0             0               0
## 3          0            0             0               0
## 4          0            0             0               0
## 5          0            0             0               0
##   STATEWILAYAH.PERSEKUTUAN
## 1                        0
## 2                        0
## 3                        1
## 4                        0
## 5                        0

Although variables GENDER and STATE had undergone data transformation, they were not included in the modelling as demographic variables will create bias towards the prediction output. For instance, it is unreasonable for Zentel to launch a campaign targeting specifically male and specific state such as Sabah. The campaign launched shall target every customer from the population.


  1. Select numerical variables and plot heatmap to investigate the correlation between numerical variables
# Select numeric variables 
non_encoded_cols = setdiff(colnames(dataset), cols_to_encode)
non_encoded_list = dataset[, non_encoded_cols]
non_encoded_list = subset(non_encoded_list, select = -OFFER_TAKER)

# Create function to extract the upper triangle of a correlation matrix
# Input: cormat - correlation matrix
# Output: upper triangle of the correlation matrix with lower triangle set to NA
get_upper_tri = function(cormat){
  
  # Set lower triangle of the correlation matrix to NA
  cormat[lower.tri(cormat)] <- NA
  
  # Return the modified correlation matrix
  return(cormat)
}

# Plot heatmap 
cormat = round(cor(non_encoded_list),2)
upper_tri = get_upper_tri(cormat)
melted_cormat = melt(upper_tri, na.rm = TRUE)
ggplot(data = melted_cormat, aes(Var2, Var1, fill = value))+
  geom_tile(color = "white")+
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
                       midpoint = 0, limit = c(-1,1), space = "Lab", 
                       name="Pearson\nCorrelation") +
  theme_minimal()+ 
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))+
  coord_fixed()

Based on the correlation plot, it could be observed that the pairs of CPA_RVN_BEFORE, RLD_AMT_BEFORE and ARPU_BEFORE, and DATA_CHRG_BEFORE and DATA_PURC_BEFORE were highly correlated with each other. Hence,CPA_RVN_BEFORE, RLD_AMT_BEFORE and DATA_CHRG_BEFORE were dropped as these two variables are contributing to the value of ARPU. When these values increase, ARPU increases. On top of that, based on domain knowledge, ARPU_BEFORE refers to the average spending of a customer that brings meaningful information.


  1. Remove highly correlated variables
# Remove high correlation columns 
final_df = subset(non_encoded_list, select=c(-CPA_RVN_BEFORE, -RLD_AMT_BEFORE,-DATA_CHRG_BEFORE))


  1. Assign target variable
# Assign target 
final_df$OFFER_TAKER = ifelse(dataset$OFFER_TAKER == "Y", 1, 0)


  1. Split the dataset into train and test set
# Set the random seed
set.seed(123)

# Split the data into train and test set 
train_idx = createDataPartition(final_df$OFFER_TAKER, p = 0.8, list = FALSE)
train_data = final_df[train_idx,]
test_data = final_df[-train_idx,]

# Check dimensions of train and test set 
cat(" Dimensions of train data:", dim(train_data), "\n", "Dimensions of test data:", dim(test_data), "\n")
##  Dimensions of train data: 5726 8 
##  Dimensions of test data: 1431 8

After performing the data pre-processing, data transformation, feature selection and train-test split, a total of 7 independent variables were used to train the model.


Results and Findings

This section displays the results and discussion for all four research objectives.

Objective 1: Effectiveness of “Right Planning” Pilot Campaign

Effectiveness of “Right Planning pilot campaign was evaluated by 2 aspects - the total count of offer takers among all customers and the total count of offer takers among all customers who stayed active after the pilot campaign.

# Count the frequencies of "Y" and "N" values in the column
value_counts <- table(cleaned_dataset$OFFER_TAKER)
value <- value_counts["Y"]
# Calculate the percentage of offer takers only - "Y" 
percentage_y <- (value_counts["Y"] / sum(value_counts)) * 100

# Count the frequencies of customers who remain active after opted-in for the pilot campaign
offertaker <- subset(cleaned_dataset, OFFER_TAKER == "Y")
value_counts <- table(offertaker$ACTIVITY_STATUS_AFTER)
# Calculate the sum of frequencies for the desired categories
sum_active <- sum(value_counts[c("DURING & AFTER CAMP", "BEFORE & AFTER CAMP", "DURING CAMP THEN REMAIN")])
# Calculate the percentage of active offer takers
percentage_active <- sum_active / sum(value_counts) * 100

# Display results
message <- paste(" Number of offer takers:", value, "\n", "Percentage of offer takers:", round(percentage_y, 2), "%\n", "Number of customers who remain active after opted-in for the pilot campaign:", sum_active, "\n", "Percentage of customers who remain active after opted-in for the pilot campaign:", round(percentage_active, 2), "%\n")
cat(message)
##  Number of offer takers: 4499 
##  Percentage of offer takers: 62.86 %
##  Number of customers who remain active after opted-in for the pilot campaign: 3047 
##  Percentage of customers who remain active after opted-in for the pilot campaign: 67.73 %

The above statistics indicated that the pilot campaign launched was moderately successful because:

-The pilot campaign launched had a total number of 4,499 customers out of 7,157 who opted in for the new rate plans. The opt in rate for the campaign was 63%.

-There were 3,047 customers out of 4,499 opted-in customers who remained active after the campaign. The active rate was 68%.


Objective 2: Offer Takers’ Profile

The profile of offer takers was identified by observing the graph distribution grouped by offer taker. A more precise analysis could be done by conducting pivot table in the future.

data_profiling_after <- function(x, group_by_col) {
  
  # Iterate over column names
  for (col_name in colnames(x)) {
    
    # Check if column is numeric
    if (is.numeric(x[[col_name]])) {
      
      # Generate violin plot for numeric columns grouped by OFFER_TAKER
      p <- ggplot(x, aes(x = col_name, y = x[[col_name]], fill = col_name)) +
        geom_violin() +
        labs(x = "Column", y = "Range", title = paste("Violin Plot of", col_name)) +
        theme_minimal() +
        facet_wrap(~get(group_by_col))
      print(p)
      
    }
    
    # Column is not numeric (assumed categorical)
    else {
      # Calculate counts for each category grouped by OFFER_TAKER
      counts <- table(x[[col_name]], x[[group_by_col]])
      
      # Create a data frame for plotting
      df_counts <- data.frame(category = rownames(counts), count = as.numeric(counts), OFFER_TAKER = rep(colnames(counts), each = nrow(counts)))
      
      # Generate bar plot for categorical columns grouped by OFFER_TAKER
      p <- ggplot(df_counts, aes(x = count, y = category, fill = OFFER_TAKER)) +
        geom_col() +
        labs(x = "Count", y = "Category", title = paste("Bar Plot of", col_name)) +
        theme_minimal() +
        facet_wrap(~get(group_by_col))
      print(p)
    }
  }
}

# Display graphs grouped by OFFER_TAKER
data_profiling_after(cleaned_dataset, "OFFER_TAKER")

Based on the EDA, the customers who tend to take up the offer had the following characteristics:

-Campaign offer takers showed a higher numbers of opt in rate among male compared to female.

-Campaign offer takers were mostly from age group between 22 and 36 years old with tenure more than 1 year.

-Most of the campaign takers were from Klang Valley, followed by Sabah and Sarawak.

-Most of the campaign takers had higher ARPU, higher CPA revenue, higher reload amount, higher voice and data usage and tend to purchase data plan.

-Majority of the customers remained active after the campaign but 11 takers terminated their lines.


Objective 3: Offer Takers’ Predictive Modelling

Three famous classification models were chosen to predict which customers would likely opt-in for the “Right Planning” campaign based on their usage and revenue behaviours, which were Random Forest Classifier, K-Nearest Neighbor (KNN) and Support Vector machine (SVM). The best performed model would be chosen as the prediction model for this study.

Evaluation metrics of accuracy were used to evaluate the model performance since it is the most common metric used to evaluate the performance of a classification predictive model. On top of that, the dataset provided was considered a balanced dataset thus accuracy was chosen.

  1. Random Forest Classifier
# Set the random seed
set.seed(123)

# Train the Random Forest Classifier
classifier_RF = randomForest(x = train_data[, -ncol(train_data)],
                             y = as.factor(train_data$OFFER_TAKER),
                             ntree = 500,
                             importance = TRUE)

# Make predictions on the test data
preds = predict(classifier_RF, newdata = test_data[, -ncol(test_data)])

# Calculate the confusion matrix/ accuracy of the predictions
conf_mat_RF = table(preds, test_data$OFFER_TAKER)
accuracy_RF = sum(diag(conf_mat_RF)) / sum(conf_mat_RF)
# Display output of confusion matrix and accuracy 
outputRF <- paste("Confusion Matrix (Prediction):", paste(capture.output(print(conf_mat_RF)), collapse = "\n"), "\n", "Accuracy of Random Forest Classifier:", round(accuracy_RF * 100, 2), "%")
cat(outputRF)
## Confusion Matrix (Prediction):      
## preds   0   1
##     0 462  90
##     1 113 766 
##  Accuracy of Random Forest Classifier: 85.81 %

Random Forest model achieved an accuracy of 85.51%.


  1. K-Nearest Neighbours (KNN)
# Set the random seed
set.seed(123)

# Train the KNN Classifier
classifier_KNN <- train(x = train_data[, -ncol(train_data)],
                        y = as.factor(train_data$OFFER_TAKER),
                        method = "knn",
                        trControl = trainControl(method = "none"),
                        preProcess = c("center", "scale"))

# Make predictions on the test data
preds_KNN <- predict(classifier_KNN, newdata = test_data[, -ncol(test_data)])

# Calculate the confusion matrix/ accuracy of the predictions
conf_mat_KNN <- table(preds_KNN, test_data$OFFER_TAKER)
accuracy_KNN <- sum(diag(conf_mat_KNN)) / sum(conf_mat_KNN)
# Display output of confusion matrix and accuracy 
outputKNN <- paste("Confusion Matrix (Prediction):", paste(capture.output(print(conf_mat_KNN)), collapse = "\n"), "\n", "Accuracy of K-Nearest Neighbour:", round(accuracy_KNN * 100, 2), "%")
cat(outputKNN)
## Confusion Matrix (Prediction):          
## preds_KNN   0   1
##         0 438 124
##         1 137 732 
##  Accuracy of K-Nearest Neighbour: 81.76 %

K-Nearest Neighbours model achieved an accuracy of 81.76%.


  1. Support Vector Machine (SVM)
# Set the random seed
set.seed(123)

# Train the SVM Classifier
classifier_SVM <- svm(x = train_data[, -ncol(train_data)],
                      y = as.factor(train_data$OFFER_TAKER),
                      kernel = "radial")

# Make predictions on the test data
preds_SVM <- predict(classifier_SVM, newdata = test_data[, -ncol(test_data)])

# Calculate the confusion matrix/ accuracy of the predictions
conf_mat_SVM <- table(preds_SVM, test_data$OFFER_TAKER)
accuracy_SVM <- sum(diag(conf_mat_SVM)) / sum(conf_mat_SVM)
# Display output of confusion matrix and accuracy 
outputSVM <- paste("Confusion Matrix (Prediction):", paste(capture.output(print(conf_mat_SVM)), collapse = "\n"), "\n", "Accuracy of Support Vector Machine", round(accuracy_SVM * 100, 2), "%")
cat(outputSVM)
## Confusion Matrix (Prediction):          
## preds_SVM   0   1
##         0 468 108
##         1 107 748 
##  Accuracy of Support Vector Machine 84.98 %

Support Vector Machine model achieved an accuracy of 84.98%.


Feature Importance for Random Forest Classifier

# Feature importance
imp_RF <- importance(classifier_RF)
features <- paste("Random Forest Classifier Feature Importance:\n", paste(capture.output(print(imp_RF)), collapse = "\n"))
cat(features)
## Random Forest Classifier Feature Importance:
##                            0           1 MeanDecreaseAccuracy MeanDecreaseGini
## TENURE             8.820628  24.3849919             24.13703        206.39523
## AGE                5.278590   9.2401210             11.41115        149.28534
## DATA_PURC_BEFORE 104.606113 157.6559667            135.19721       1087.75961
## DATA_USG_BEFORE   32.503856 -11.7541345             18.30836        233.48710
## VOICE_USG_BEFORE  16.258360   7.1245668             20.95345        104.26227
## RLD_IND_BEFORE    22.069320   2.9338180             21.51774         51.02151
## ARPU_BEFORE       52.064695  -0.1964525             50.16256        253.43854

By comparing performances for the three classifiers, Random Forest Classifier had the highest accuracy hence it was selected as the best classification model / predictive model in this study. A higher value in Mean Decrease Accuracy indicated that the features had stronger influences on the accuracy of the model. The top 5 important features were DATA_PURC_BEFORE, ARPU_BEFORE, RLD_IND_BEFORE, TENURE and DATA_USG_BEFORE. These variables could be used to predict if a customer will likely opt in for the new offer.

A customer tend to take up the offer of the campaign if he or she had the following characteristics: A longer tenure, high average revenue, performing reload before the campaign, purchasing any data plan before the campaign and high data usage.


Objective 4: Relationship between Customers’ Behaviour (Usage) and Revenue Generated

This section displays the relationship between the customers’ behaviours in terms of usage and revenue generated before the campaign using multiple linear regression.

  1. Select usage features and revenue feature
# Select relevant feature
reg_final_df = subset(final_df, select = c(DATA_USG_BEFORE, ARPU_BEFORE, VOICE_USG_BEFORE))


  1. Split the dataset into train and test set
# Set the random seed
set.seed(123)

# Split the data into train and test set 
train_idx = createDataPartition(reg_final_df$ARPU_BEFORE, p = 0.8, list = FALSE)
reg_train_data = reg_final_df[train_idx,]
reg_test_data = reg_final_df[-train_idx,]

# Check dimensions of train and test set 
cat(" Dimensions of train data:", dim(reg_train_data), "\n", "Dimensions of test data:", dim(reg_test_data), "\n")
##  Dimensions of train data: 5726 3 
##  Dimensions of test data: 1431 3


  1. Perform scaling on the train and test set and assign target for regression model
# Set the random seed
set.seed(123)

# Scaling
train_data_scaled = reg_train_data %>%
  select(-ARPU_BEFORE) %>%
  scale() %>%
  as.data.frame() %>%
  cbind(reg_train_data$ARPU_BEFORE)

colnames(train_data_scaled)[3] = "target"


  1. Linear Regression
# Train the Regression Model
lm_model = lm(target ~ ., data = train_data_scaled)

# Make predictions on the test data 
test_data_scaled = reg_test_data %>%
  select(-ARPU_BEFORE) %>%
  scale() %>%
  as.data.frame() %>%
  cbind(reg_test_data$ARPU_BEFORE)
predictions = predict(lm_model, newdata = test_data_scaled)

colnames(test_data_scaled)[3] = "target"


Multiple Linear Regression Model Performance

# Result
summary(lm_model)
## 
## Call:
## lm(formula = target ~ ., data = train_data_scaled)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##  -63.14   -9.51   -9.51   -6.06 1495.77 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       10.2951     0.8876  11.599  < 2e-16 ***
## DATA_USG_BEFORE    2.4578     0.8926   2.754  0.00591 ** 
## VOICE_USG_BEFORE   0.2941     0.8926   0.329  0.74180    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 67.17 on 5723 degrees of freedom
## Multiple R-squared:  0.00139,    Adjusted R-squared:  0.001041 
## F-statistic: 3.984 on 2 and 5723 DF,  p-value: 0.01866

Coefficient: For a one-unit increase in data usage (DATA_USG_BEFORE), the revenue generated (ARPU_BEFORE) was expected to increase by 2.4578 units. DATA_USG_BEFORE was statistically significant as its p-value was less than the alpha value of 0.05. However, VOICE_USG_BEFORE was not statistically significant as its p-value was larger than the alpha value of 0.05.

R-squared: The R-squared of 0.00139 indicated that only a small portion of the target variable’s variability was explained by the predictors.

F-statistic and p-value: The F-statistic of 3.984 with a p-value of 0.01866, suggesting that there was some evidence of linear relationship between the predictors and the target variable.


Multiple Linear Regression Model Performance on Test Set

# Result
RMSE = sqrt(mean((predictions - test_data_scaled$target)^2))
R2 = cor(predictions, test_data_scaled$target)^2
## Performance of Linear Regression:
## 
## RMSE: 75.65 
## R-squared: 0.00095

RMSE: RMSE value of 75.65 suggested that on average, the predictions of the regression model had an error of approximately 75.65 units.

R-squared: The R-squared value of 0.00095 suggested that the independent variables in the model explained a very small portion (approximately 0.095%) of the variance in the dependent variable for test set.

In other words, this model had very little predictive power in explaining the variability of the data as the R squared was very low and VOICE_USG_BEFORE was not statistically significant. This meant that customers’ behaviour in terms of usage was not a determinant of the revenue generated. This is indeed true because revenue generated is not just coming from usage but other factors such as purchasing roaming pass and added-value service. When Base Management Team intended to design a new offer, the structure content should not merely consider data usage and voice usage hoping to bring more revenue.In fact, further analysis and consideration might be required to improve the model’s performance or explore additional predictors.


Conclusion

In conclusion, the research objectives of this study were achieved by using several statistical methods.

First of all, the effectiveness of the pilot campaign was evaluated in which the statistics showed that the pilot campaign launched was moderately successful with 63% of take up rate and 68% of active rate among the offer takers.

Secondly, the campaign offer taker’s profile was identified. Campaign offer takers were mainly from age group between 22 and 36 years old with tenure more than 1 year and gender male. Most of them were from Klang Valley, followed by Sabah and Sarawak. Majority of the offer takers had higher ARPU, higher CPA revenue, higher reload amount, higher voice and data usage and tend to purchase data plan.

Thirdly, a classification model was developed to predict if the customers would opt-in for the “Right Planning” campaign based on usage and revenue behaviours. Random Forest Classifier with accuracy of 85.51% was chosen as the predictive model because it had the highest value of accuracy as compared to the other two models. The top 5 important features were DATA_PURC_BEFORE, ARPU_BEFORE, RLD_IND_BEFORE, TENURE and DATA_USG_BEFORE.

Last but not least, the relationship between the customers’ behaviour in terms of usage and revenue generated before the campaign using multiple linear regression was investigated. It was clearly proven that revenue generated was affected by the data usage but not voice usage. Other factors such as purchasing roaming pass and added-value service should be considered if the Base Management Team would like to increase revenue.


Citation

Cote, C. (2021, January 21). What is Marketing Analytics? Harvard Business School Online. Retrieved May 20, 2023, from https://online.hbs.edu/blog/post/what-is-marketing-analytics