Description of the data

I was commissioned to analyse the customer database of a mobile phone provider and derive recommendations for actions to improve customer loyalty. In addition, you should provide a dashboard based on your analysis. The dashboard should be calibrated on the provided historical data. You got a 5% random sample of the customer activities over the past 3 months.

Variable definition in customer data

cid: an artificially generated customer identifier for this analysis

gender: indicating the customer’s gender.

age: indicating the age in years when the current contract started.

citizenship: indicating the citizenship status when first contract started.

tenure: number of months since start of the first contract.

contract type: type of contract referring to pay-as-you-go, or contract length.

data allowance: free GB data download allowance per month.

payment method: method of payment including direct debit, bank transfer and credit card.

monthly charges: average monthly payment in last three months, in £.

payment arrears: indicating the number of months the costumer did not pay on time.

contract terminated: indicating if the contract had been terminated in the last three months.

phone number transferred: indicating if a request was received to transfer the phone number to another provider in the previous three months.

Variable definition in satisfaction score survey data

cid: an artificially generated customer identifier for this analysis score: the satisfaction score between 0 = totally unsatisfied to 10 totally satisfied, customers have been invited via a text message to assess the quality of their service.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.5     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.1.2     ✓ forcats 0.5.1
## Warning: package 'readr' was built under R version 4.1.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(ggplot2)
library(caTools)
library(gbm)
## Loaded gbm 2.1.8
library(e1071)
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
library(Metrics)
## 
## Attaching package: 'Metrics'
## The following objects are masked from 'package:caret':
## 
##     precision, recall
customer = read.table('/Users/shuaibismail/Downloads/customer.csv', head = TRUE, sep = ',')
score = read.table('/Users/shuaibismail/Downloads/score.csv', head = TRUE, sep = ',')
head(customer)
head(score)

Exploratory Data Analysis

A bar chart showing the distribution of customer score ratings

ggplot(data = score) +
  geom_bar(mapping = aes(x = score))

The most popular monthly charge accros all plans is 50 pounds.

ggplot(data = customer) +
  geom_histogram(mapping = aes(x = monthly_charges), binwidth = 9)

The 2-year contract plans is at least 5 times as popular than the one-year plan.

ggplot(data = customer, mapping = aes(x = tenure, colour = contract_type)) +
  geom_freqpoly(binwidth = 5)

The number of customers paying via Bank Transfers is almost thrice as those paying via Credit Cards

ggplot(data = customer, mapping = aes(x = monthly_charges, colour = payment_method)) +
  geom_freqpoly(binwidth = 9)

Median monthly charge for unlimited plan and 10gb is similar (overing around 50), with quite a number of people paying over 100 for both plans.

ggplot(data = customer) +
  geom_boxplot(mapping = aes(x = reorder(data_allowance, monthly_charges, FUN = median), y = monthly_charges))

Median age for Male Customers is around 40 years, for Female it is slightly lower at around 38.

ggplot(data = customer) +
  geom_boxplot(mapping = aes(x = reorder(gender, age, FUN = median), y = age))
## Warning: Removed 39 rows containing non-finite values (stat_boxplot).

A bar chart of gender distribution

ggplot(data = customer) +
  geom_bar(mapping = aes(x = gender))

Data Cleaning and Transformation

Replacing unusual values in the Citizenship column with null

customer2 <- customer %>% 
  mutate(citizenship = ifelse(citizenship == -1 | citizenship == 5 | citizenship == 0, NA, citizenship))

Joining the Data

customer_join <- merge(customer2,score)
names(customer_join)
##  [1] "cid"                 "gender"              "age"                
##  [4] "citizenship"         "tenure"              "contract_type"      
##  [7] "data_allowance"      "payment_method"      "monthly_charges"    
## [10] "payment_arrears"     "contract_terminated" "number_transferred" 
## [13] "score"

Here we’re saying if a score rating is less than or equal to 6, mark as negative, else positive. We did this because the score rating is right skewed.

customer_join1 <- customer_join %>% 
  mutate(score = ifelse(score <= 7 , 0, 1))

Encoding Categorical Data

customer_join1$gender = factor(customer_join1$gender)
customer_join1$contract_type = factor(customer_join1$contract_type)
customer_join1$citizenship = factor(customer_join1$citizenship)
customer_join1$payment_method = factor(customer_join1$payment_method)
customer_join1$contract_type = factor(customer_join1$contract_type)
customer_join1$number_transferred = factor(customer_join1$data_allowance)
customer_join1$contract_terminated = factor(customer_join1$contract_terminated)
customer_join1$data_allowance = factor(customer_join1$data_allowance)

Separating the data into a training and testing set. Training will be used to train the model and Testing will be for predictions.

set.seed(123)
cs_split = sample.split(customer_join1$score, SplitRatio=.75)
train_cs = subset(customer_join1, cs_split == TRUE)
test_cs = subset(customer_join1, cs_split == FALSE)

c(nrow(customer_join1), nrow(train_cs), nrow(test_cs))
## [1] 1782 1336  446
customer_join2=subset(train_cs, select = c(score, monthly_charges, tenure))
plot_margins <- ggplot(data = customer_join2, aes(x = monthly_charges, y = tenure, color = factor(score))) + geom_point() + 
    scale_color_manual(values = c("red", "blue"))
#display plot 
plot_margins

Building a gradient boosting classification model to classify positive and negative ratings

set.seed(123)
boosting_model <- gbm(score ~ .-cid, 
                       distribution = "bernoulli", 
                       train_cs,
                       n.trees = 25,
                       cv.folds = 3)
opt_tree=gbm.perf(boosting_model, method="cv")

print(boosting_model)
## gbm(formula = score ~ . - cid, distribution = "bernoulli", data = train_cs, 
##     n.trees = 25, cv.folds = 3)
## A gradient boosted model with bernoulli loss function.
## 25 iterations were performed.
## The best cross-validation iteration was 25.
## There were 11 predictors of which 3 had non-zero influence.

The model shows that Tenure has the most influence in score ratings. Citizenship also had some influence.

summary(boosting_model)
#Obtaining Predictions and Computing the Test Error Rate
pred_boost=predict(boosting_model, newdata=test_cs,n.trees=25, distribution="bernoulli", type="response")

head(pred_boost)
## [1] 0.8488104 0.7003576 0.7003576 0.7129922 0.6740130 0.5084145
boost_pred=ifelse(pred_boost < 0.6, 0, 1)
head(boost_pred)
## [1] 1 1 1 1 1 0

The model has 73.7% prediction accuracy

accuracy(test_cs[,"score"], boost_pred)
## [1] 0.7376682
table(test_cs[,"score"], (boost_pred))
##    
##       0   1
##   0  49  83
##   1  34 280

This model has 70% Prediction Accuracy

svm_cs<- svm(score ~ .-cid, data = train_cs, type = "C-classification", kernel = "radial", scale = TRUE)

pred_svm=predict(svm_cs,test_cs)

mean(pred_svm==test_cs$score)
## Warning in `==.default`(pred_svm, test_cs$score): longer object length is not a
## multiple of shorter object length
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length
## [1] 0.7017937

We used an almost extreme case here that says only ratings above 7 are positive, the rest are negative. Also we told our model to only predict yes when it’s at least 60% sure (the standard is 50%), yet our models are able to achieve 70% accuracy.

When we made the positive rating greater than or equal to 5, the model had 97% accuracy, but this would have been an incorrect estimate, as the underlying data had more ratings above 5 (distribution is not normal but right skewed).