Marketing analytics refers to the study of customer data to evaluate and devise marketing activities which have been widely incorporated by businesses across the globe especially in the telecommunications industry. In depth, analysis of the customer data is used to determine the main factors of consumer action, leading to enhancing the company marketing strategies and maximising return on investments from the wonders of their marketing analytics (Cote, 2021).
In this project, a real corporate dataset of a pilot campaign launched by a telecommunication company was used, herein referred to as ZenTel (pseudonym due to confidentiality). ZenTel is one of the largest telecommunication companies in Malaysia which has been employing customer-centric solutions to facilitate seamless, consistent, and excellent customer experiences by delivering the best value for money offerings and rate plans to its customers.
ZenTel had recently proposed to migrate their existing telco platform to a more advanced platform to improve user and customer experiences. To optimise the platform migration, ZenTel had decided to come up with an initiative called “Right Planning” that aimed in migrating all customers’ old rate plan to a new rate plan at a lower cost and better benefits. With “Right Planning”, ZenTel could enlighten customer experience with new rate plans bringing better benefits and offers while withdrawing the outdated rate plans and standardising the rate plans information stored in the new platform.
To evaluate the efficacy of the proposed campaign, the Base Management Team at ZenTel conducted a trial run by selecting a group of customers before officially launching the “Right Planning” to their seven million customers.The goal of this study was to evaluate the pilot campaign effectiveness by introducing new rate plans which provided superior benefits and offers through understanding patterns of customer behaviour with propensity to taking up the offer, while eliminating obsolete rate plans and establishing uniformity in the information regarding rate plans stored in the new platform.
This section displays the steps of data understanding, data cleaning, data pre-processing and data transformation before conducting data analysis and predictive modelling.
# Install required packages
install.packages(c("readxl", "caret", "pryr", "corrplot", "dplyr", "ggplot2", "coefplot", "infotheo", "lubridate", "randomForest", "Rcpp", "devtools", "tidyr", "reshape2", "cowplot","e1071"))
# Load installed packages
library(readxl)
library(knitr)
library(caret)
library(pryr)
library(corrplot)
library(dplyr)
library(ggplot2)
library(coefplot)
library(infotheo)
library(lubridate)
library(randomForest)
library(Rcpp)
library(devtools)
library(tidyr)
library(reshape2)
library(cowplot)
library(e1071)
Table below shows the metadata of the collected dataset.
# Load metadata
metadata <- read_excel("/Users/kharshin/Campaign_Dataset_Description.xlsx")
kable(metadata)
| Variable | Description | DataType | Unit |
|---|---|---|---|
| ID | Customer ID | String | N/A |
| TENURE | Customer duration with Kation since registration date | Integer | Months |
| AGE | Customer age | Integer | Years |
| GENDER | Customer gender | String | N/A |
| NATIONALITY | Customer nationality | String | N/A |
| STATE | Customer hometown (state) | String | N/A |
| STATUS_BEFORE | Customer status before campaign launched. | String | N/A |
| STATUS_AFTER | Customer status after campaign ended. | String | N/A |
| OFFER_TAKER | Indicator for customers who opted-in the migration plan. | Boolean | N/A |
| OFFER_TAKE_UP_DT | Date for customers who opted-in the migration plan. | Date | N/A |
| DATA_PURC_BEFORE | Indicator for customer who purchased data before campaign launched. | Boolean | N/A |
| DATA_PURC_AFTER | Indicator for customer who purchased data after campaign ended. | Boolean | N/A |
| DATA_CHRG_BEFORE | Total amount of data charged before campaign launched. | Float | RM |
| DATA_CHRG_AFTER | Total amount of data charged after campaign ended. | Float | RM |
| DATA_USG_BEFORE | Data usage before campaign launched. | Float | MB |
| DATA_USG_AFTER | Data usage after campaign ended. | Float | MB |
| VOICE_USG_BEFORE | Voice usage before campaign launched. | Float | Minutes |
| VOICE_USG_AFTER | Voice usage after campaign ended. | Float | Minutes |
| RLD_IND_BEFORE | Indicator for customer who reload before campaign launched. | Boolean | N/A |
| RLD_IND_AFTER | Indicator for customer who reload after campaign ended. | Boolean | N/A |
| RLD_AMT_BEFORE | Total of reload amount before campaign launched. | Float | RM |
| RLD_AMT_AFTER | Total of reload amount after campaign ended. | Float | RM |
| ARPU_BEFORE | ARPU before campaign launched. | Float | RM |
| CPA_RVN_BEFORE | Total added value service before campaign launched. | Float | RM |
| CPA_RVN_AFTER | Total added value service after campaign ended. | Float | RM |
| ARPU_AFTER | ARPU after campaign ended. | Float | RM |
| ACTIVITY_DAYS_AFTER | Silent days after campaign ended. | Integer | Days |
| ACTIVITY_STATUS_AFTER | Customer activity status after campaign ended. | String | N/A |
Table below shows the first five rows of the collected dataset.
# Load dataset
dataset <- read_excel("/Users/kharshin/Dataset.xlsx")
head(dataset, n = 5)
## # A tibble: 5 × 27
## TENURE AGE GENDER NATIONALITY STATE STATUS_BEFORE STATUS_AFTER OFFER_TAKER
## <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 90 28 Female Malaysia JOHORE Active Active Y
## 2 204 81 Male Malaysia PAHANG Active Active Y
## 3 120 84 Male Malaysia KLANG … Active Active N
## 4 199 82 Male Malaysia PAHANG Active Active Y
## 5 30 -9999 ? Malaysia SABAH Active Active N
## # ℹ 19 more variables: OFFER_TAKE_UP_DT <chr>, DATA_PURC_BEFORE <chr>,
## # DATA_PURC_AFTER <chr>, DATA_CHRG_BEFORE <chr>, DATA_CHRG_AFTER <chr>,
## # DATA_USG_BEFORE <dbl>, DATA_USG_AFTER <dbl>, VOICE_USG_BEFORE <dbl>,
## # VOICE_USG_AFTER <dbl>, RLD_IND_BEFORE <chr>, RLD_IND_AFTER <chr>,
## # RLD_AMT_BEFORE <chr>, RLD_AMT_AFTER <chr>, CPA_RVN_BEFORE <chr>,
## # CPA_RVN_AFTER <chr>, ARPU_BEFORE <chr>, ARPU_AFTER <chr>,
## # ACTVIITY_DAYS_AFTER <dbl>, ACTIVITY_STATUS_AFTER <chr>
This section displays the information of:
i. Number of
columns and rows
cat("Number of columns:", ncol(dataset), "\nNumber of rows:", nrow(dataset))
## Number of columns: 27
## Number of rows: 7272
A total of 27 columns and 7,272 records.
str(dataset)
## tibble [7,272 × 27] (S3: tbl_df/tbl/data.frame)
## $ TENURE : num [1:7272] 90 204 120 199 30 16 86 34 12 35 ...
## $ AGE : num [1:7272] 28 81 84 82 -9999 ...
## $ GENDER : chr [1:7272] "Female" "Male" "Male" "Male" ...
## $ NATIONALITY : chr [1:7272] "Malaysia" "Malaysia" "Malaysia" "Malaysia" ...
## $ STATE : chr [1:7272] "JOHORE" "PAHANG" "KLANG VALLEY" "PAHANG" ...
## $ STATUS_BEFORE : chr [1:7272] "Active" "Active" "Active" "Active" ...
## $ STATUS_AFTER : chr [1:7272] "Active" "Active" "Active" "Active" ...
## $ OFFER_TAKER : chr [1:7272] "Y" "Y" "N" "Y" ...
## $ OFFER_TAKE_UP_DT : chr [1:7272] "44691" "44691" "?" "44722" ...
## $ DATA_PURC_BEFORE : chr [1:7272] "Y" "N" "N" "Y" ...
## $ DATA_PURC_AFTER : chr [1:7272] "N" "N" "N" "N" ...
## $ DATA_CHRG_BEFORE : chr [1:7272] "15" "0" "0" "15" ...
## $ DATA_CHRG_AFTER : chr [1:7272] "0" "0" "0" "0" ...
## $ DATA_USG_BEFORE : num [1:7272] 0 0 0 0 6.53 ...
## $ DATA_USG_AFTER : num [1:7272] 0 0 0 0 1.02 ...
## $ VOICE_USG_BEFORE : num [1:7272] 732.4 49.9 17.1 16.2 14.2 ...
## $ VOICE_USG_AFTER : num [1:7272] 1535.917 0.667 0 8.317 0 ...
## $ RLD_IND_BEFORE : chr [1:7272] "N" "N" "N" "N" ...
## $ RLD_IND_AFTER : chr [1:7272] "Y" "N" "N" "N" ...
## $ RLD_AMT_BEFORE : chr [1:7272] "?" "?" "?" "?" ...
## $ RLD_AMT_AFTER : chr [1:7272] "5" "?" "?" "?" ...
## $ CPA_RVN_BEFORE : chr [1:7272] "?" "?" "?" "?" ...
## $ CPA_RVN_AFTER : chr [1:7272] "?" "?" "?" "?" ...
## $ ARPU_BEFORE : chr [1:7272] "?" "13.65" "4.32" "4.68" ...
## $ ARPU_AFTER : chr [1:7272] "5.15" "0.26" "?" "2.73" ...
## $ ACTVIITY_DAYS_AFTER : num [1:7272] 0 1 5 5 0 9 0 33 11 1 ...
## $ ACTIVITY_STATUS_AFTER: chr [1:7272] "DURING & AFTER CAMP" "DURING & AFTER CAMP" "DURING & AFTER CAMP" "DURING & AFTER CAMP" ...
There were a total of 7 numerical columns and 20 categorical columns.
However, several numerical columns were misclassified as categorical
because the blank or missing values were automatically detected as a
string of ? in R. Columns affected were
RLD_AMT_BEFORE, RLD_AMT_AFTER,
CPA_RVN_BEFORE, CPA_RVN_AFTER,
ARPU_BEFORE, ARPU_AFTER,
DATA_PURC_BEFORE and DATA_PURC_AFTER.
Meanwhile, OFFER_TAKE_UP_DT was misclassified as numerical
but it was a date variable.
# Check missing values in categorical columns and print column names if there are null values
cat_missing <- sapply(dataset, function(x) {
if (is.factor(x) && sum(is.na(x)) > 0) {
cat("Column:", names(x), "has", sum(is.na(x)), "missing value(s).\n")
}
sum(is.na(x))
})
# Check missing values in numerical columns and print column names if there are null values
num_missing <- sapply(dataset, function(x) {
if (is.numeric(x) && sum(is.na(x)) > 0) {
cat("Column:", names(x), "has", sum(is.na(x)), "missing value(s).\n")
}
sum(is.na(x))
})
No missing values were found but based on the dataset structure shown,
there were ? missing values within the columns. Hence, data
cleaning was required.
This section displays the graph distributions for all columns of the dataset. Violin plot and bar chart were used.
# Create function to perform data profiling on a data frame
# Input: x - data frame
# Output: Prints plots for numeric columns and categorical columns
data_profiling = function(x) {
# Iterate over column names
for (col_name in colnames(x)) {
# Check if column is numeric
if (is.numeric(x[[col_name]])) {
# Generate violin plot for numeric columns
p = ggplot(x, aes(x = col_name, y = x[[col_name]], fill = col_name)) +
geom_violin() +
labs(x = "Column", y = "Range", title = paste("Violin Plot of", col_name)) +
theme_minimal()
print(p)
}
# Column is not numeric (assumed categorical)
else {
# Calculate counts for each category
counts = table(x[[col_name]])
# Create a data frame for plotting
df_counts = data.frame(category = names(counts), count = as.numeric(counts))
# Generate bar plot for categorical columns
p = ggplot(df_counts, aes(x = count, y = category)) +
geom_col(fill = "steelblue") +
labs(x = "Count", y = "Category", title = paste("Bar Plot of", col_name)) +
theme_minimal()
print(p)
}
}
}
# Display graphs
data_profiling(dataset)
Based on the graphs shown,it could be observed that:
-AGE contained a minimum age of -9999, which was abnormal
as human age should fall between a range of 0 to 100 with no negative
value. This phenomenon occurred because there were missing customers’
age information due to system or human error thus filling in with -9999.
-ARPU_BEFORE,ARPU_AFTER,
CPA_RVN_BEFORE, CPA_RVN_AFTER,
DATA_CHRG_BEFORE,DATA_CHRG_AFTER,
RLD_AMT_BEFORE, and RLD_AMT_AFTER were further
justified as numerical instead of categorical. Upon checking, these
columns were filled with ? instead of 0 value
hence wrongly tagged them as class variables.
-STATE contained inconsistent values. Malaysia comprises 13
states and 3 federal territories but the graph showed 20 unique values.
-GENDER shall have 2 unique values - male and female but
there were 4 different values.
-OFFER_TAKE_UP_DT consisted of null values and date values.
AGE, GENDER, STATE,
ARPU_BEFORE, ARPU_AFTER,
CPA_RVN_BEFORE, CPA_RVN_AFTER,
DATA_CHRG_BEFORE, DATA_CHRG_AFTER,
RLD_AMT_BEFORE, RLD_AMT_AFTER and
OFFER_TAKE_UP_DT required data cleaning and pre-processing
to ensure clean data were used for further analysis and modelling.
This section displays data cleaning for all four data quality issues (incomplete data, noisy data, inconsistent data, and intentional data) found in the dataset as mentioned in Exploratory Data Analysis session.
# Remove records of 'AGE' with value '-9999'
dataset <- subset(dataset, AGE != -9999)
Records with value of -9999 represented missing values for
customer age. Thus, these records were removed as they did not bring any
useful information.
? to NA
then fill with 0# replace "?" to "NA"
dataset <- replace(dataset, dataset=='?', NA)
# replace "NA" with 0
dataset$ARPU_BEFORE = replace(dataset$ARPU_BEFORE, is.na(dataset$ARPU_BEFORE), 0)
dataset$ARPU_AFTER = replace(dataset$ARPU_AFTER, is.na(dataset$ARPU_AFTER), 0)
dataset$CPA_RVN_BEFORE = replace(dataset$CPA_RVN_BEFORE, is.na(dataset$CPA_RVN_BEFORE), 0)
dataset$CPA_RVN_AFTER = replace(dataset$CPA_RVN_AFTER, is.na(dataset$CPA_RVN_AFTER), 0)
dataset$RLD_AMT_BEFORE = replace(dataset$RLD_AMT_BEFORE, is.na(dataset$RLD_AMT_BEFORE), 0)
dataset$RLD_AMT_AFTER = replace(dataset$RLD_AMT_AFTER, is.na(dataset$RLD_AMT_AFTER), 0)
dataset$DATA_CHRG_BEFORE = replace(dataset$DATA_CHRG_BEFORE, is.na(dataset$DATA_CHRG_BEFORE), 0)
dataset$DATA_CHRG_AFTER = replace(dataset$DATA_CHRG_AFTER, is.na(dataset$DATA_CHRG_AFTER), 0)
Columns mistagged as categorical with values of ? were
replaced with NA then 0 and corrected to
numeric data type. 0 was chosen to replace the unknown
value based on the business logic in which null is shown when there is
no mapping result.
# Standardize the name for the states
dataset$STATE[dataset$STATE == "JOHORE"] = "JOHOR"
dataset$STATE[dataset$STATE == "KLANG VALLEY"] = "WILAYAH PERSEKUTUAN"
dataset$STATE[dataset$STATE == "MALACCA"] = "MELAKA"
dataset$STATE[dataset$STATE == "N SEMBILAN"] = "NEGERI SEMBILAN"
dataset$STATE[dataset$STATE == "PULAU PINANG"] = "PENANG"
dataset$STATE[dataset$STATE == "SEREMBAN/MELAKA"] = "MELAKA"
The inconsistency in the naming of the states where names were mixed with Malay and English names that caused the duplicate values in the variable were standardised using English naming convention.
dataset$GENDER = replace(dataset$GENDER, is.na(dataset$GENDER), "Unspecified")
dataset <- subset(dataset, GENDER != "Unspecified")
The values of Unspecified and ? (which then
replaced with NA) were grouped together as they represented
unknown gender.These groups of people were removed from the dataset so
that the GENDER variable encompassed only values of Male and Female.
OFFER_TAKE_UP_DT which was supposed to be in date format
was kept as string data type as if the corresponding cell of the
OFFER_TAKER attribute was No, it suggested
that the customer did not take up the offer and naturally there would
not be any offer take-up date being recorded. This variable was removed
as the date did not bring any relevant information to this study.
# Format the numeric into 2 decimal points
dataset$ARPU_BEFORE = round(as.numeric(dataset$ARPU_BEFORE), 2)
dataset$ARPU_AFTER = round(as.numeric(dataset$ARPU_AFTER), 2)
dataset$CPA_RVN_BEFORE = round(as.numeric(dataset$CPA_RVN_BEFORE), 2)
dataset$CPA_RVN_AFTER = round(as.numeric(dataset$CPA_RVN_AFTER), 2)
dataset$RLD_AMT_BEFORE = round(as.numeric(dataset$RLD_AMT_BEFORE), 2)
dataset$RLD_AMT_AFTER = round(as.numeric(dataset$RLD_AMT_AFTER), 2)
dataset$DATA_CHRG_BEFORE = round(as.numeric(dataset$DATA_CHRG_BEFORE), 2)
dataset$DATA_CHRG_AFTER = round(as.numeric(dataset$DATA_CHRG_AFTER), 2)
Modify noisy data - Outliers
The variables associated with this data quality issue were
ARPU_BEFORE, ARPU_AFTER,
CPA_RVN_BEFORE, CPA_RVN_AFTER,
DATA_CHRG_BEFORE, DATA_CHRG_AFTER,
RLD_AMT_BEFORE, RLD_AMT_AFTER,
DATA_USG_BEFORE, DATA_USG_AFTER,
VOICE_USG_BEFORE, VOICE_USG_AFTER,
AGE and TENURE. Based on the domain knowledge,
these extreme values carried informative insights. Outliers detected in
these variables were not removed as the extreme values are legitimate
observations which could be included as a part of the sample, thus
keeping them in the dataset.
Remove columns which were meaningless and would not be used for the analysis from the study
# Drop the "meaningless" columns
dataset = subset(dataset, select = -c(NATIONALITY, STATUS_BEFORE, OFFER_TAKE_UP_DT))
# Backup the cleaned dataset for analysis
cleaned_dataset <- dataset
# Select and print out the "AFTER" columns
cols_to_drop = grep("AFTER$", names(dataset), value = TRUE)
# Drop the "AFTER" columns
dataset[, cols_to_drop] = NULL
As mentioned previously, the campaign was launched between specific
dates and the group of customers involved were active Prepaid Malaysian
customers. Hence, NATIONALITY, STATUS_BEFORE
and OFFER_TAKE_UP_DT were irrelevant to the analysis. All
the variables with prefix of AFTER were removed as these
variables would be needed only when there was a need to compare the
performance between pre-campaign and post-campaign.
This section displays the data overview after data cleaning.
# Check missing values
colSums(is.na(dataset))
## TENURE AGE GENDER STATE
## 0 0 0 0
## OFFER_TAKER DATA_PURC_BEFORE DATA_CHRG_BEFORE DATA_USG_BEFORE
## 0 0 0 0
## VOICE_USG_BEFORE RLD_IND_BEFORE RLD_AMT_BEFORE CPA_RVN_BEFORE
## 0 0 0 0
## ARPU_BEFORE
## 0
# Check data structure
str(dataset)
## tibble [7,157 × 13] (S3: tbl_df/tbl/data.frame)
## $ TENURE : num [1:7157] 90 204 120 199 16 86 34 12 35 56 ...
## $ AGE : num [1:7157] 28 81 84 82 82 82 85 85 88 86 ...
## $ GENDER : chr [1:7157] "Female" "Male" "Male" "Male" ...
## $ STATE : chr [1:7157] "JOHOR" "PAHANG" "WILAYAH PERSEKUTUAN" "PAHANG" ...
## $ OFFER_TAKER : chr [1:7157] "Y" "Y" "N" "Y" ...
## $ DATA_PURC_BEFORE: chr [1:7157] "Y" "N" "N" "Y" ...
## $ DATA_CHRG_BEFORE: num [1:7157] 15 0 0 15 15 0 0 0 0 0 ...
## $ DATA_USG_BEFORE : num [1:7157] 0 0 0 0 0.0185 ...
## $ VOICE_USG_BEFORE: num [1:7157] 732.4 49.9 17.1 16.2 13.6 ...
## $ RLD_IND_BEFORE : chr [1:7157] "N" "N" "N" "N" ...
## $ RLD_AMT_BEFORE : num [1:7157] 0 0 0 0 0 0 0 0 0 0 ...
## $ CPA_RVN_BEFORE : num [1:7157] 0 0 0 0 0 0 0 0 0 0 ...
## $ ARPU_BEFORE : num [1:7157] 0 13.65 4.32 4.68 4.4 ...
# Check first 5 rows of dataset
head(dataset, n = 5)
## # A tibble: 5 × 13
## TENURE AGE GENDER STATE OFFER_TAKER DATA_PURC_BEFORE DATA_CHRG_BEFORE
## <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl>
## 1 90 28 Female JOHOR Y Y 15
## 2 204 81 Male PAHANG Y N 0
## 3 120 84 Male WILAYAH PER… N N 0
## 4 199 82 Male PAHANG Y Y 15
## 5 16 82 Male PERAK Y Y 15
## # ℹ 6 more variables: DATA_USG_BEFORE <dbl>, VOICE_USG_BEFORE <dbl>,
## # RLD_IND_BEFORE <chr>, RLD_AMT_BEFORE <dbl>, CPA_RVN_BEFORE <dbl>,
## # ARPU_BEFORE <dbl>
There were a total of 7,157 records (after removing the intentional age
and unknown gender) with 13 selected variables to continue our analysis
ranging from TENURE, AGE, GENDER,
STATE, DATA_PURC_BEFORE,
DATA_CHRG_BEFORE, DATA_USG_BEFORE,
VOICE_USG_BEFORE, RLD_IND_BEFORE,
RLD_AMT_BEFORE, CPA_RVN_BEFORE,
ARPU_BEFORE and OFFER_TAKER. No missing values
were found and the type of variables were now correct.
This section explains how data were pre-processed before proceeding to analysis and modelling. Categorical variables were transformed using one-hot encoding and heatmap was used to check the correlation between variables. Highly correlated variables would be dropped and dataset would be split into training (80%) and testing (20%) set.
GENDER and
STATE were converted to numeric values by one-hot encoding,
whereas DATA_PURC_BEFORE and RLD_IND_BEFORE
were converted to numeric values using remapping method.# Replace column values of "Y" with 1 and "N" with 0
dataset$DATA_PURC_BEFORE <- as.integer(ifelse(dataset$DATA_PURC_BEFORE == "Y", 1, ifelse(dataset$DATA_PURC_BEFORE == "N", 0, dataset$DATA_PURC_BEFORE)))
dataset$RLD_IND_BEFORE <- as.integer(ifelse(dataset$RLD_IND_BEFORE == "Y", 1, ifelse(dataset$RLD_IND_BEFORE == "N", 0, dataset$RLD_IND_BEFORE)))
# Apply one-hot encoding to the categorical variables
cols_to_encode = c("GENDER", "STATE")
one_hot = dummyVars(" ~ .", data = dataset[, cols_to_encode])
encoded_df = data.frame(predict(one_hot, newdata = dataset))
# Preview the first 5 rows
head(encoded_df, 5)
## GENDERFemale GENDERMale STATEJOHOR STATEKEDAH STATEKELANTAN STATEMELAKA
## 1 1 0 1 0 0 0
## 2 0 1 0 0 0 0
## 3 0 1 0 0 0 0
## 4 0 1 0 0 0 0
## 5 0 1 0 0 0 0
## STATENEGERI.SEMBILAN STATEPAHANG STATEPENANG STATEPERAK STATEPERLIS
## 1 0 0 0 0 0
## 2 0 1 0 0 0
## 3 0 0 0 0 0
## 4 0 1 0 0 0
## 5 0 0 0 1 0
## STATESABAH STATESARAWAK STATESELANGOR STATETERENGGANU
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## STATEWILAYAH.PERSEKUTUAN
## 1 0
## 2 0
## 3 1
## 4 0
## 5 0
Although variables GENDER and STATE had
undergone data transformation, they were not included in the modelling
as demographic variables will create bias towards the prediction output.
For instance, it is unreasonable for Zentel to launch a campaign
targeting specifically male and specific state such as Sabah. The
campaign launched shall target every customer from the population.
# Select numeric variables
non_encoded_cols = setdiff(colnames(dataset), cols_to_encode)
non_encoded_list = dataset[, non_encoded_cols]
non_encoded_list = subset(non_encoded_list, select = -OFFER_TAKER)
# Create function to extract the upper triangle of a correlation matrix
# Input: cormat - correlation matrix
# Output: upper triangle of the correlation matrix with lower triangle set to NA
get_upper_tri = function(cormat){
# Set lower triangle of the correlation matrix to NA
cormat[lower.tri(cormat)] <- NA
# Return the modified correlation matrix
return(cormat)
}
# Plot heatmap
cormat = round(cor(non_encoded_list),2)
upper_tri = get_upper_tri(cormat)
melted_cormat = melt(upper_tri, na.rm = TRUE)
ggplot(data = melted_cormat, aes(Var2, Var1, fill = value))+
geom_tile(color = "white")+
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1,1), space = "Lab",
name="Pearson\nCorrelation") +
theme_minimal()+
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))+
coord_fixed()
Based on the correlation plot, it could be observed that the pairs of
CPA_RVN_BEFORE, RLD_AMT_BEFORE and
ARPU_BEFORE, and DATA_CHRG_BEFORE and
DATA_PURC_BEFORE were highly correlated with each other.
Hence,CPA_RVN_BEFORE, RLD_AMT_BEFORE and
DATA_CHRG_BEFORE were dropped as these two variables are
contributing to the value of ARPU. When these values increase, ARPU
increases. On top of that, based on domain knowledge,
ARPU_BEFORE refers to the average spending of a customer
that brings meaningful information.
# Remove high correlation columns
final_df = subset(non_encoded_list, select=c(-CPA_RVN_BEFORE, -RLD_AMT_BEFORE,-DATA_CHRG_BEFORE))
# Assign target
final_df$OFFER_TAKER = ifelse(dataset$OFFER_TAKER == "Y", 1, 0)
# Set the random seed
set.seed(123)
# Split the data into train and test set
train_idx = createDataPartition(final_df$OFFER_TAKER, p = 0.8, list = FALSE)
train_data = final_df[train_idx,]
test_data = final_df[-train_idx,]
# Check dimensions of train and test set
cat(" Dimensions of train data:", dim(train_data), "\n", "Dimensions of test data:", dim(test_data), "\n")
## Dimensions of train data: 5726 8
## Dimensions of test data: 1431 8
After performing the data pre-processing, data transformation, feature selection and train-test split, a total of 7 independent variables were used to train the model.
This section displays the results and discussion for all four research objectives.
Effectiveness of “Right Planning pilot campaign was evaluated by 2 aspects - the total count of offer takers among all customers and the total count of offer takers among all customers who stayed active after the pilot campaign.
# Count the frequencies of "Y" and "N" values in the column
value_counts <- table(cleaned_dataset$OFFER_TAKER)
value <- value_counts["Y"]
# Calculate the percentage of offer takers only - "Y"
percentage_y <- (value_counts["Y"] / sum(value_counts)) * 100
# Count the frequencies of customers who remain active after opted-in for the pilot campaign
offertaker <- subset(cleaned_dataset, OFFER_TAKER == "Y")
value_counts <- table(offertaker$ACTIVITY_STATUS_AFTER)
# Calculate the sum of frequencies for the desired categories
sum_active <- sum(value_counts[c("DURING & AFTER CAMP", "BEFORE & AFTER CAMP", "DURING CAMP THEN REMAIN")])
# Calculate the percentage of active offer takers
percentage_active <- sum_active / sum(value_counts) * 100
# Display results
message <- paste(" Number of offer takers:", value, "\n", "Percentage of offer takers:", round(percentage_y, 2), "%\n", "Number of customers who remain active after opted-in for the pilot campaign:", sum_active, "\n", "Percentage of customers who remain active after opted-in for the pilot campaign:", round(percentage_active, 2), "%\n")
cat(message)
## Number of offer takers: 4499
## Percentage of offer takers: 62.86 %
## Number of customers who remain active after opted-in for the pilot campaign: 3047
## Percentage of customers who remain active after opted-in for the pilot campaign: 67.73 %
The above statistics indicated that the pilot campaign launched was moderately successful because:
-The pilot campaign launched had a total number of 4,499 customers out of 7,157 who opted in for the new rate plans. The opt in rate for the campaign was 63%.
-There were 3,047 customers out of 4,499 opted-in customers who remained active after the campaign. The active rate was 68%.
The profile of offer takers was identified by observing the graph distribution grouped by offer taker. A more precise analysis could be done by conducting pivot table in the future.
data_profiling_after <- function(x, group_by_col) {
# Iterate over column names
for (col_name in colnames(x)) {
# Check if column is numeric
if (is.numeric(x[[col_name]])) {
# Generate violin plot for numeric columns grouped by OFFER_TAKER
p <- ggplot(x, aes(x = col_name, y = x[[col_name]], fill = col_name)) +
geom_violin() +
labs(x = "Column", y = "Range", title = paste("Violin Plot of", col_name)) +
theme_minimal() +
facet_wrap(~get(group_by_col))
print(p)
}
# Column is not numeric (assumed categorical)
else {
# Calculate counts for each category grouped by OFFER_TAKER
counts <- table(x[[col_name]], x[[group_by_col]])
# Create a data frame for plotting
df_counts <- data.frame(category = rownames(counts), count = as.numeric(counts), OFFER_TAKER = rep(colnames(counts), each = nrow(counts)))
# Generate bar plot for categorical columns grouped by OFFER_TAKER
p <- ggplot(df_counts, aes(x = count, y = category, fill = OFFER_TAKER)) +
geom_col() +
labs(x = "Count", y = "Category", title = paste("Bar Plot of", col_name)) +
theme_minimal() +
facet_wrap(~get(group_by_col))
print(p)
}
}
}
# Display graphs grouped by OFFER_TAKER
data_profiling_after(cleaned_dataset, "OFFER_TAKER")
Based on the EDA, the customers who tend to take up the offer had the following characteristics:
-Campaign offer takers showed a higher numbers of opt in rate among male compared to female.
-Campaign offer takers were mostly from age group between 22 and 36 years old with tenure more than 1 year.
-Most of the campaign takers were from Klang Valley, followed by Sabah and Sarawak.
-Most of the campaign takers had higher ARPU, higher CPA revenue, higher reload amount, higher voice and data usage and tend to purchase data plan.
-Majority of the customers remained active after the campaign but 11 takers terminated their lines.
Three famous classification models were chosen to predict which customers would likely opt-in for the “Right Planning” campaign based on their usage and revenue behaviours, which were Random Forest Classifier, K-Nearest Neighbor (KNN) and Support Vector machine (SVM). The best performed model would be chosen as the prediction model for this study.
Evaluation metrics of accuracy were used to evaluate the model performance since it is the most common metric used to evaluate the performance of a classification predictive model. On top of that, the dataset provided was considered a balanced dataset thus accuracy was chosen.
# Set the random seed
set.seed(123)
# Train the Random Forest Classifier
classifier_RF = randomForest(x = train_data[, -ncol(train_data)],
y = as.factor(train_data$OFFER_TAKER),
ntree = 500,
importance = TRUE)
# Make predictions on the test data
preds = predict(classifier_RF, newdata = test_data[, -ncol(test_data)])
# Calculate the confusion matrix/ accuracy of the predictions
conf_mat_RF = table(preds, test_data$OFFER_TAKER)
accuracy_RF = sum(diag(conf_mat_RF)) / sum(conf_mat_RF)
# Display output of confusion matrix and accuracy
outputRF <- paste("Confusion Matrix (Prediction):", paste(capture.output(print(conf_mat_RF)), collapse = "\n"), "\n", "Accuracy of Random Forest Classifier:", round(accuracy_RF * 100, 2), "%")
cat(outputRF)
## Confusion Matrix (Prediction):
## preds 0 1
## 0 462 90
## 1 113 766
## Accuracy of Random Forest Classifier: 85.81 %
Random Forest model achieved an accuracy of 85.51%.
# Set the random seed
set.seed(123)
# Train the KNN Classifier
classifier_KNN <- train(x = train_data[, -ncol(train_data)],
y = as.factor(train_data$OFFER_TAKER),
method = "knn",
trControl = trainControl(method = "none"),
preProcess = c("center", "scale"))
# Make predictions on the test data
preds_KNN <- predict(classifier_KNN, newdata = test_data[, -ncol(test_data)])
# Calculate the confusion matrix/ accuracy of the predictions
conf_mat_KNN <- table(preds_KNN, test_data$OFFER_TAKER)
accuracy_KNN <- sum(diag(conf_mat_KNN)) / sum(conf_mat_KNN)
# Display output of confusion matrix and accuracy
outputKNN <- paste("Confusion Matrix (Prediction):", paste(capture.output(print(conf_mat_KNN)), collapse = "\n"), "\n", "Accuracy of K-Nearest Neighbour:", round(accuracy_KNN * 100, 2), "%")
cat(outputKNN)
## Confusion Matrix (Prediction):
## preds_KNN 0 1
## 0 438 124
## 1 137 732
## Accuracy of K-Nearest Neighbour: 81.76 %
K-Nearest Neighbours model achieved an accuracy of 81.76%.
# Set the random seed
set.seed(123)
# Train the SVM Classifier
classifier_SVM <- svm(x = train_data[, -ncol(train_data)],
y = as.factor(train_data$OFFER_TAKER),
kernel = "radial")
# Make predictions on the test data
preds_SVM <- predict(classifier_SVM, newdata = test_data[, -ncol(test_data)])
# Calculate the confusion matrix/ accuracy of the predictions
conf_mat_SVM <- table(preds_SVM, test_data$OFFER_TAKER)
accuracy_SVM <- sum(diag(conf_mat_SVM)) / sum(conf_mat_SVM)
# Display output of confusion matrix and accuracy
outputSVM <- paste("Confusion Matrix (Prediction):", paste(capture.output(print(conf_mat_SVM)), collapse = "\n"), "\n", "Accuracy of Support Vector Machine", round(accuracy_SVM * 100, 2), "%")
cat(outputSVM)
## Confusion Matrix (Prediction):
## preds_SVM 0 1
## 0 468 108
## 1 107 748
## Accuracy of Support Vector Machine 84.98 %
Support Vector Machine model achieved an accuracy of 84.98%.
# Feature importance
imp_RF <- importance(classifier_RF)
features <- paste("Random Forest Classifier Feature Importance:\n", paste(capture.output(print(imp_RF)), collapse = "\n"))
cat(features)
## Random Forest Classifier Feature Importance:
## 0 1 MeanDecreaseAccuracy MeanDecreaseGini
## TENURE 8.820628 24.3849919 24.13703 206.39523
## AGE 5.278590 9.2401210 11.41115 149.28534
## DATA_PURC_BEFORE 104.606113 157.6559667 135.19721 1087.75961
## DATA_USG_BEFORE 32.503856 -11.7541345 18.30836 233.48710
## VOICE_USG_BEFORE 16.258360 7.1245668 20.95345 104.26227
## RLD_IND_BEFORE 22.069320 2.9338180 21.51774 51.02151
## ARPU_BEFORE 52.064695 -0.1964525 50.16256 253.43854
By comparing performances for the three classifiers, Random
Forest Classifier had the highest accuracy hence it was
selected as the best classification model / predictive model in this
study. A higher value in Mean Decrease Accuracy indicated
that the features had stronger influences on the accuracy of the model.
The top 5 important features were DATA_PURC_BEFORE,
ARPU_BEFORE, RLD_IND_BEFORE,
TENURE and DATA_USG_BEFORE. These variables
could be used to predict if a customer will likely opt in for the new
offer.
A customer tend to take up the offer of the campaign if he or she had the following characteristics: A longer tenure, high average revenue, performing reload before the campaign, purchasing any data plan before the campaign and high data usage.
This section displays the relationship between the customers’ behaviours in terms of usage and revenue generated before the campaign using multiple linear regression.
# Select relevant feature
reg_final_df = subset(final_df, select = c(DATA_USG_BEFORE, ARPU_BEFORE, VOICE_USG_BEFORE))
# Set the random seed
set.seed(123)
# Split the data into train and test set
train_idx = createDataPartition(reg_final_df$ARPU_BEFORE, p = 0.8, list = FALSE)
reg_train_data = reg_final_df[train_idx,]
reg_test_data = reg_final_df[-train_idx,]
# Check dimensions of train and test set
cat(" Dimensions of train data:", dim(reg_train_data), "\n", "Dimensions of test data:", dim(reg_test_data), "\n")
## Dimensions of train data: 5726 3
## Dimensions of test data: 1431 3
# Set the random seed
set.seed(123)
# Scaling
train_data_scaled = reg_train_data %>%
select(-ARPU_BEFORE) %>%
scale() %>%
as.data.frame() %>%
cbind(reg_train_data$ARPU_BEFORE)
colnames(train_data_scaled)[3] = "target"
# Train the Regression Model
lm_model = lm(target ~ ., data = train_data_scaled)
# Make predictions on the test data
test_data_scaled = reg_test_data %>%
select(-ARPU_BEFORE) %>%
scale() %>%
as.data.frame() %>%
cbind(reg_test_data$ARPU_BEFORE)
predictions = predict(lm_model, newdata = test_data_scaled)
colnames(test_data_scaled)[3] = "target"
# Result
summary(lm_model)
##
## Call:
## lm(formula = target ~ ., data = train_data_scaled)
##
## Residuals:
## Min 1Q Median 3Q Max
## -63.14 -9.51 -9.51 -6.06 1495.77
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.2951 0.8876 11.599 < 2e-16 ***
## DATA_USG_BEFORE 2.4578 0.8926 2.754 0.00591 **
## VOICE_USG_BEFORE 0.2941 0.8926 0.329 0.74180
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 67.17 on 5723 degrees of freedom
## Multiple R-squared: 0.00139, Adjusted R-squared: 0.001041
## F-statistic: 3.984 on 2 and 5723 DF, p-value: 0.01866
Coefficient: For a one-unit increase in data usage
(DATA_USG_BEFORE), the revenue generated
(ARPU_BEFORE) was expected to increase by 2.4578 units.
DATA_USG_BEFORE was statistically
significant as its p-value was less than the alpha value of
0.05. However, VOICE_USG_BEFORE was not
statistically significant as its p-value was larger than the
alpha value of 0.05.
R-squared: The R-squared of 0.00139 indicated that only a small portion of the target variable’s variability was explained by the predictors.
F-statistic and p-value: The F-statistic of 3.984 with a p-value of 0.01866, suggesting that there was some evidence of linear relationship between the predictors and the target variable.
# Result
RMSE = sqrt(mean((predictions - test_data_scaled$target)^2))
R2 = cor(predictions, test_data_scaled$target)^2
## Performance of Linear Regression:
##
## RMSE: 75.65
## R-squared: 0.00095
RMSE: RMSE value of 75.65 suggested that on average, the predictions of the regression model had an error of approximately 75.65 units.
R-squared: The R-squared value of 0.00095 suggested that the independent variables in the model explained a very small portion (approximately 0.095%) of the variance in the dependent variable for test set.
In other words, this model had very little predictive
power in explaining the variability of the data as the R
squared was very low and VOICE_USG_BEFORE was not
statistically significant. This meant that customers’ behaviour in terms
of usage was not a determinant of the revenue generated. This is indeed
true because revenue generated is not just coming from usage but other
factors such as purchasing roaming pass and added-value service. When
Base Management Team intended to design a new offer, the structure
content should not merely consider data usage and voice usage hoping to
bring more revenue.In fact, further analysis and
consideration might be required to improve the model’s
performance or explore additional predictors.
In conclusion, the research objectives of this study were achieved by using several statistical methods.
First of all, the effectiveness of the pilot campaign was evaluated in which the statistics showed that the pilot campaign launched was moderately successful with 63% of take up rate and 68% of active rate among the offer takers.
Secondly, the campaign offer taker’s profile was identified. Campaign offer takers were mainly from age group between 22 and 36 years old with tenure more than 1 year and gender male. Most of them were from Klang Valley, followed by Sabah and Sarawak. Majority of the offer takers had higher ARPU, higher CPA revenue, higher reload amount, higher voice and data usage and tend to purchase data plan.
Thirdly, a classification model was developed to
predict if the customers would opt-in for the “Right Planning” campaign
based on usage and revenue behaviours. Random Forest Classifier with
accuracy of 85.51% was chosen as the predictive model because it had the
highest value of accuracy as compared to the other two models. The top 5
important features were DATA_PURC_BEFORE,
ARPU_BEFORE, RLD_IND_BEFORE,
TENURE and DATA_USG_BEFORE.
Last but not least, the relationship between the customers’ behaviour in terms of usage and revenue generated before the campaign using multiple linear regression was investigated. It was clearly proven that revenue generated was affected by the data usage but not voice usage. Other factors such as purchasing roaming pass and added-value service should be considered if the Base Management Team would like to increase revenue.
Cote, C. (2021, January 21). What is Marketing Analytics? Harvard Business School Online. Retrieved May 20, 2023, from https://online.hbs.edu/blog/post/what-is-marketing-analytics