Description of the data
I was commissioned to analyse the customer database of a mobile phone provider and derive recommendations for actions to improve customer loyalty. In addition, you should provide a dashboard based on your analysis. The dashboard should be calibrated on the provided historical data. You got a 5% random sample of the customer activities over the past 3 months.
Variable definition in customer data
cid: an artificially generated customer identifier for this analysis
gender: indicating the customer’s gender.
age: indicating the age in years when the current contract started.
citizenship: indicating the citizenship status when first contract started.
tenure: number of months since start of the first contract.
contract type: type of contract referring to pay-as-you-go, or contract length.
data allowance: free GB data download allowance per month.
payment method: method of payment including direct debit, bank transfer and credit card.
monthly charges: average monthly payment in last three months, in £.
payment arrears: indicating the number of months the costumer did not pay on time.
contract terminated: indicating if the contract had been terminated in the last three months.
phone number transferred: indicating if a request was received to transfer the phone number to another provider in the previous three months.
Variable definition in satisfaction score survey data
cid: an artificially generated customer identifier for this analysis score: the satisfaction score between 0 = totally unsatisfied to 10 totally satisfied, customers have been invited via a text message to assess the quality of their service.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.5 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.4 ✓ stringr 1.4.0
## ✓ readr 2.1.2 ✓ forcats 0.5.1
## Warning: package 'readr' was built under R version 4.1.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(ggplot2)
library(caTools)
library(gbm)
## Loaded gbm 2.1.8
library(e1071)
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(Metrics)
##
## Attaching package: 'Metrics'
## The following objects are masked from 'package:caret':
##
## precision, recall
customer = read.table('/Users/shuaibismail/Downloads/customer.csv', head = TRUE, sep = ',')
score = read.table('/Users/shuaibismail/Downloads/score.csv', head = TRUE, sep = ',')
head(customer)
head(score)
Exploratory Data Analysis
A bar chart showing the distribution of customer score ratings
ggplot(data = score) +
geom_bar(mapping = aes(x = score))
The most popular monthly charge accros all plans is 50 pounds.
ggplot(data = customer) +
geom_histogram(mapping = aes(x = monthly_charges), binwidth = 9)
The 2-year contract plans is at least 5 times as popular than the one-year plan.
ggplot(data = customer, mapping = aes(x = tenure, colour = contract_type)) +
geom_freqpoly(binwidth = 5)
The number of customers paying via Bank Transfers is almost thrice as those paying via Credit Cards
ggplot(data = customer, mapping = aes(x = monthly_charges, colour = payment_method)) +
geom_freqpoly(binwidth = 9)
Median monthly charge for unlimited plan and 10gb is similar (overing around 50), with quite a number of people paying over 100 for both plans.
ggplot(data = customer) +
geom_boxplot(mapping = aes(x = reorder(data_allowance, monthly_charges, FUN = median), y = monthly_charges))
Median age for Male Customers is around 40 years, for Female it is slightly lower at around 38.
ggplot(data = customer) +
geom_boxplot(mapping = aes(x = reorder(gender, age, FUN = median), y = age))
## Warning: Removed 39 rows containing non-finite values (stat_boxplot).
A bar chart of gender distribution
ggplot(data = customer) +
geom_bar(mapping = aes(x = gender))
Data Cleaning and Transformation
Replacing unusual values in the Citizenship column with null
customer2 <- customer %>%
mutate(citizenship = ifelse(citizenship == -1 | citizenship == 5 | citizenship == 0, NA, citizenship))
Joining the Data
customer_join <- merge(customer2,score)
names(customer_join)
## [1] "cid" "gender" "age"
## [4] "citizenship" "tenure" "contract_type"
## [7] "data_allowance" "payment_method" "monthly_charges"
## [10] "payment_arrears" "contract_terminated" "number_transferred"
## [13] "score"
Here we’re saying if a score rating is less than or equal to 6, mark as negative, else positive. We did this because the score rating is right skewed.
customer_join1 <- customer_join %>%
mutate(score = ifelse(score <= 7 , 0, 1))
Encoding Categorical Data
customer_join1$gender = factor(customer_join1$gender)
customer_join1$contract_type = factor(customer_join1$contract_type)
customer_join1$citizenship = factor(customer_join1$citizenship)
customer_join1$payment_method = factor(customer_join1$payment_method)
customer_join1$contract_type = factor(customer_join1$contract_type)
customer_join1$number_transferred = factor(customer_join1$data_allowance)
customer_join1$contract_terminated = factor(customer_join1$contract_terminated)
customer_join1$data_allowance = factor(customer_join1$data_allowance)
Separating the data into a training and testing set. Training will be used to train the model and Testing will be for predictions.
set.seed(123)
cs_split = sample.split(customer_join1$score, SplitRatio=.75)
train_cs = subset(customer_join1, cs_split == TRUE)
test_cs = subset(customer_join1, cs_split == FALSE)
c(nrow(customer_join1), nrow(train_cs), nrow(test_cs))
## [1] 1782 1336 446
customer_join2=subset(train_cs, select = c(score, monthly_charges, tenure))
plot_margins <- ggplot(data = customer_join2, aes(x = monthly_charges, y = tenure, color = factor(score))) + geom_point() +
scale_color_manual(values = c("red", "blue"))
#display plot
plot_margins
Building a gradient boosting classification model to classify positive and negative ratings
set.seed(123)
boosting_model <- gbm(score ~ .-cid,
distribution = "bernoulli",
train_cs,
n.trees = 25,
cv.folds = 3)
opt_tree=gbm.perf(boosting_model, method="cv")
print(boosting_model)
## gbm(formula = score ~ . - cid, distribution = "bernoulli", data = train_cs,
## n.trees = 25, cv.folds = 3)
## A gradient boosted model with bernoulli loss function.
## 25 iterations were performed.
## The best cross-validation iteration was 25.
## There were 11 predictors of which 3 had non-zero influence.
The model shows that Tenure has the most influence in score ratings. Citizenship also had some influence.
summary(boosting_model)
#Obtaining Predictions and Computing the Test Error Rate
pred_boost=predict(boosting_model, newdata=test_cs,n.trees=25, distribution="bernoulli", type="response")
head(pred_boost)
## [1] 0.8488104 0.7003576 0.7003576 0.7129922 0.6740130 0.5084145
boost_pred=ifelse(pred_boost < 0.6, 0, 1)
head(boost_pred)
## [1] 1 1 1 1 1 0
The model has 73.7% prediction accuracy
accuracy(test_cs[,"score"], boost_pred)
## [1] 0.7376682
table(test_cs[,"score"], (boost_pred))
##
## 0 1
## 0 49 83
## 1 34 280
This model has 70% Prediction Accuracy
svm_cs<- svm(score ~ .-cid, data = train_cs, type = "C-classification", kernel = "radial", scale = TRUE)
pred_svm=predict(svm_cs,test_cs)
mean(pred_svm==test_cs$score)
## Warning in `==.default`(pred_svm, test_cs$score): longer object length is not a
## multiple of shorter object length
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length
## [1] 0.7017937
We used an almost extreme case here that says only ratings above 7 are positive, the rest are negative. Also we told our model to only predict yes when it’s at least 60% sure (the standard is 50%), yet our models are able to achieve 70% accuracy.
When we made the positive rating greater than or equal to 5, the model had 97% accuracy, but this would have been an incorrect estimate, as the underlying data had more ratings above 5 (distribution is not normal but right skewed).