Telco Segmentation

Business Objective

A US based fixed-line telecom service provider wants to roll out bundled combo minute plans eg: STD call plans, Local call plans and internet plans.

Segmentation

Customer Segmentation is the process of splitting your customer database into smaller groups. By focusing on specific customer types, we can maximize customer lifetime value and better understand who they are and what they need.

Procedure for Segmentation

Data Preparation
- Missing Value
- Outliers
Factor Analysis
Variable Standardization
Clustering
Profiling

Reading the Data

telco = read.csv("D:\\Business Analytics\\R\\Factor Analysis\\telco_csv.csv")
head(telco)

##   region tenure age marital address income ed employ retire gender reside
## 1      2     13  44       1       9     64  4      5      0      0      2
## 2      3     11  33       1       7    136  5      5      0      0      6
## 3      3     68  52       1      24    116  1     29      0      1      2
## 4      2     33  33       0      12     33  2      0      0      1      1
## 5      2     23  30       1       9     30  1      2      0      0      4
## 6      2     41  39       0      17     78  2     16      0      1      1
##   tollfree equip callcard wireless longmon tollmon equipmon cardmon wiremon
## 1        0     0        1        0    3.70    0.00        0    7.50     0.0
## 2        1     0        1        1    4.40   20.75        0   15.25    35.7
## 3        1     0        1        0   18.15   18.00        0   30.25     0.0
## 4        0     0        0        0    9.45    0.00        0    0.00     0.0
## 5        0     0        0        0    6.30    0.00        0    0.00     0.0
## 6        1     0        1        0   11.80   19.25        0   13.50     0.0
##   multline voice pager internet callid callwait forward confer ebill custcat
## 1        0     0     0        0      0        0       1      0     0       1
## 2        0     1     1        0      1        1       1      1     0       4
## 3        0     0     0        0      1        1       0      1     0       3
## 4        0     0     0        0      0        0       0      0     0       1
## 5        0     0     0        0      1        0       1      1     0       3
## 6        0     0     0        0      1        1       0      0     0       3

Data Preparation

Creating a user defined function to have a better understanding of the data.

mystats <- function(x) {
  nmiss<-sum(is.na(x))
  a <- x[!is.na(x)]
  m <- mean(a)
  n <- length(a)
  s <- sd(a)
  min <- min(a)
  pctls<-quantile(a,probs=c(0.01, 0.05,0.1,0.25,0.5,0.75,0.9,0.95,0.99))
  max <- max(a)
  return(c(n=n, nmiss=nmiss,  mean=m, stdev=s,min = min, pctls=pctls,max=max))
}

# Applying udf on the data
diag_stats = t(data.frame(sapply(telco,mystats)))
head(diag_stats)

##            n nmiss   mean       stdev min pctls.1% pctls.5% pctls.10% pctls.25%
## region  1000     0  2.022   0.8161998   1     1.00        1         1         1
## tenure  1000     0 35.526  21.3598119   1     1.00        4         7        17
## age     1000     0 41.684  12.5588163  18    20.00       23        26        32
## marital 1000     0  0.495   0.5002252   0     0.00        0         0         0
## address 1000     0 11.551  10.0866813   0     0.00        0         1         3
## income  1000     0 77.535 107.0441648   9    10.99       18        21        29
##         pctls.50% pctls.75% pctls.90% pctls.95% pctls.99%  max
## region          2         3       3.0      3.00      3.00    3
## tenure         34        54      66.0     70.00     72.00   72
## age            40        51      59.0     64.00     70.01   77
## marital         0         1       1.0      1.00      1.00    1
## address         9        18      26.1     31.00     43.00   55
## income         47        83     155.4    232.25    460.48 1668

Outlier Treatment

An outlier is a data point that differs significantly from other observations. An outlier may be due to variability in the measurement or it may indicate experimental error. An outlier can cause serious problems in statistical analysis.

telco$longmon[telco$longmon>42.81355889]<- 42.81355889
telco$tollmon[telco$tollmon>63.98036623]<- 63.98036623
telco$equipmon[telco$equipmon>71.42541641]<- 71.42541641
telco$cardmon[telco$cardmon>56.03448901]<- 56.03448901
telco$wiremon[telco$wiremon>70.74217684]<- 70.74217684
telco$income[telco$income>232.25]<- 232.25

Here we will inclue only those variables which make business sense.

telco0 = subset(telco, select = -c(custcat,ed,region,address,retire))

Factor Analysis

Factor analysis is a technique that is used to reduce a large number of variables into fewer numbers of factors. This technique extracts maximum common variance from all variables and puts them into a common score. Factor analysis is part of general linear model (GLM) and this method also assumes several assumptions: - There is linear relationship - There is no multicollinearity - It includes relevant variables into analysis - There is true correlation between variables and factors.

Correlation Matrix

A correlation matrix is simply a table which displays the correlation coefficients for different variables. The matrix depicts the correlation between all the possible pairs of values in a table. It is a powerful tool to summarize a large dataset and to identify and visualize patterns in the given data.

Factor Analysis is based on Correlation, it uses correlation as an input. So here we are creating a Correlation Matrix.

corrm = cor(telco0)
head(corrm)

##             tenure         age      marital      income       employ
## tenure  1.00000000  0.49021726  0.154171373  0.30738310  0.520288419
## age     0.49021726  1.00000000 -0.014432866  0.40982414  0.670046986
## marital 0.15417137 -0.01443287  1.000000000 -0.01459662 -0.002890873
## income  0.30738310  0.40982414 -0.014596621  1.00000000  0.642244671
## employ  0.52028842  0.67004699 -0.002890873  0.64224467  1.000000000
## gender  0.02353288 -0.01141912  0.008345242  0.03151291  0.046016477
##               gender      reside    tollfree         equip    callcard
## tenure   0.023532879 -0.03659250  0.06873054 -0.1531774649 0.462181989
## age     -0.011419117 -0.25499327  0.09282143 -0.1724809789 0.280004174
## marital  0.008345242  0.62599878  0.01349894 -0.0002875888 0.057317799
## income   0.031512913 -0.10627573  0.16430605 -0.0230135214 0.205451559
## employ   0.046016477 -0.13468330  0.14272735 -0.2042438059 0.292414769
## gender   1.000000000  0.01655645 -0.01626339 -0.0023101471 0.002030097
##             wireless     longmon     tollmon     equipmon     cardmon
## tenure   0.003006869 0.774445432 0.239455329 -0.060847694  0.44628090
## age     -0.010202930 0.411953567 0.182580063 -0.127170596  0.28821617
## marital  0.032773372 0.132626798 0.063959849  0.013231727  0.10920756
## income   0.128952031 0.258607035 0.220614310  0.024525137  0.18377626
## employ  -0.013076295 0.463511630 0.247872449 -0.161196790  0.28915557
## gender  -0.000140281 0.006040312 0.001086482  0.001846442 -0.02405169
##            wiremon   multline        voice       pager    internet      callid
## tenure  0.09606562 0.40962974 -0.005590909 -0.02072139 -0.14088848 0.058310909
## age     0.03659999 0.12492786 -0.032375768 -0.05487644 -0.14102545 0.080997688
## marital 0.05195499 0.09161917  0.019653900  0.04465364  0.02422044 0.035647528
## income  0.16858727 0.12055072  0.111907587  0.13891397 -0.01006946 0.171492238
## employ  0.03589811 0.11867028 -0.031076538 -0.01572814 -0.19861355 0.143052572
## gender  0.01191728 0.02173975  0.016671138 -0.03160897  0.01968536 0.001293682
##            callwait     forward       confer        ebill
## tenure   0.08677702  0.08588377  0.133584783 -0.157094763
## age      0.07368522  0.08059528  0.121192741 -0.144425326
## marital  0.02371186  0.04386649  0.006040350 -0.056495385
## income   0.14854184  0.13814952  0.164903713 -0.004472019
## employ   0.15255484  0.11580105  0.168607527 -0.182043430
## gender  -0.03101188 -0.03554803 -0.002137253 -0.015770729

Selecting Number of Factors

Factor is a group of variables which are related to each other. In order to know the number of factors to create, we have a concept of Eigen Values.

Eigen values is also called characteristic roots. Eigen values shows variance explained by that particular factor out of the total variance. From the commonality column, we can know how much variance is explained by the first factor out of the total variance. For example, if our first factor explains 68% variance out of the total, this means that 32% variance will be explained by the other factor.

eigen(corrm)$values

##  [1] 6.58247489 4.30989791 2.66208726 1.75475707 1.16920837 0.99946833
##  [7] 0.86450147 0.78940688 0.59399790 0.57325397 0.54115089 0.48055277
## [13] 0.46061830 0.44191731 0.41658882 0.38898210 0.36220645 0.35387059
## [19] 0.32833103 0.29190714 0.21994438 0.20004349 0.11555017 0.07104984
## [25] 0.02823265

To have better understanding we will create a table with eigen values, cumulative eigen values, percent variance and cumulative percent variance.If eigen value > 1 that means it is forcing variables into the group and if eigen value < 1 then there is overlap across the group.

Number of factors to be decided on:

Eigen values should be close to 1, preferably just below 1
Cumulative percent variance should be close to 70%

eigen_values = mutate(data.frame(eigen(corrm)$values)
                      , cum_sum_eigen = cumsum(eigen.corrm..values)
                      , pct_var = eigen.corrm..values/sum(eigen.corrm..values)
                      , cum_pct_var = cum_sum_eigen/sum(eigen.corrm..values))

Factor Loading

Factor Loadings are basically the shadows that each and every variable will drop on that particular factor.

fa = fa(r = corrm, 7, rotate = "varimax", fm = "ml")

Here we will sort it in decending order so that the maximum loading come on the top and minimum loading go to the bottom.

Fa_sort = fa.sort(fa)
#ls(Fa_sort)

Here we will see what is happening with each and every variable with each and every factor.

#Fa_sort$loadings
loadings = data.frame(Fa_sort$loadings[1:ncol(telco0),])

We’ll get the output in the form of csv file.

write.csv(loadings,"D:\\Business Analytics\\R\\Factor Analysis\\loadingsf1.csv")

vars <- c( "region","tenure" ,"age" ,"marital" ,"address" , "income","ed" ,"employ",
           "retire","gender","reside","tollfree", "equip","callcard", "wireless", "longmon",
           "tollmon",  "equipmon", "cardmon",  "wiremon",  "multline", "voice", "pager" ,
           "internet","callid", "callwait", "forward", "confer", "ebill","custcat")

inputdata_final = telco[vars]

Standardization

Standardization is the process of putting different variables on the same scale. This process allows you to compare scores between different types of variables.

inputdata_final = data.frame(scale(inputdata_final))

These variables are selected on the basis of Factor Analysis.

clus <- c("tollmon","callwait","forward","pager","voice","equipmon","internet","tenure","longmon","multline","income","wiremon","cardmon")

inputdata_clus <- inputdata_final[clus]

Clustering

Clustering is the process of finding similarities in customers so that they can be grouped, and therefore segmented. We will use k-means clustering to divide the data into 3,4,5 and 6 cluster.

#building cluster using k-means clustering

cluster_three = kmeans(inputdata_clus,3)
cluster_four = kmeans(inputdata_clus,4)
cluster_five = kmeans(inputdata_clus,5)
cluster_six = kmeans(inputdata_clus,6)

telco_new = cbind(telco,km_cluster_three = cluster_three$cluster,km_cluster_four = cluster_four$cluster,km_cluster_five = cluster_five$cluster,km_cluster_six = cluster_six$cluster)
head(telco_new)

##   region tenure age marital address income ed employ retire gender reside
## 1      2     13  44       1       9     64  4      5      0      0      2
## 2      3     11  33       1       7    136  5      5      0      0      6
## 3      3     68  52       1      24    116  1     29      0      1      2
## 4      2     33  33       0      12     33  2      0      0      1      1
## 5      2     23  30       1       9     30  1      2      0      0      4
## 6      2     41  39       0      17     78  2     16      0      1      1
##   tollfree equip callcard wireless longmon tollmon equipmon cardmon wiremon
## 1        0     0        1        0    3.70    0.00        0    7.50     0.0
## 2        1     0        1        1    4.40   20.75        0   15.25    35.7
## 3        1     0        1        0   18.15   18.00        0   30.25     0.0
## 4        0     0        0        0    9.45    0.00        0    0.00     0.0
## 5        0     0        0        0    6.30    0.00        0    0.00     0.0
## 6        1     0        1        0   11.80   19.25        0   13.50     0.0
##   multline voice pager internet callid callwait forward confer ebill custcat
## 1        0     0     0        0      0        0       1      0     0       1
## 2        0     1     1        0      1        1       1      1     0       4
## 3        0     0     0        0      1        1       0      1     0       3
## 4        0     0     0        0      0        0       0      0     0       1
## 5        0     0     0        0      1        0       1      1     0       3
## 6        0     0     0        0      1        1       0      0     0       3
##   km_cluster_three km_cluster_four km_cluster_five km_cluster_six
## 1                3               4               5              6
## 2                1               1               1              3
## 3                2               3               3              2
## 4                3               4               5              6
## 5                3               4               5              6
## 6                2               2               1              5

Ploting a clusplot to visualize how data is divided into different cluster.

#graph based on k-means
clusplot(inputdata_clus,cluster_four$cluster, color = TRUE, lines = 6, labels = 2)

Converting the datatype of clusters to factor

# converting to factors

telco_new$km_cluster_three = as.factor(telco_new$km_cluster_three)
telco_new$km_cluster_four  = as.factor(telco_new$km_cluster_four)
telco_new$km_cluster_five = as.factor(telco_new$km_cluster_five)
telco_new$km_cluster_six = as.factor(telco_new$km_cluster_six)

Profiling

In order to understand what does these groups like and don’t like we carry out the Profiling exercise, where for each segment of my N segment solution, we are trying to see how are they behaving on different variables.

In order to analyse the segmentation, we will convert the clusters to tabular form.

profile = tabular(1+tenure+age+marital+address+income+ed+employ+retire+gender+reside+tollfree+
                    equip+callcard+wireless+longmon+tollmon+equipmon+cardmon+wiremon+multline+
                    voice+pager+internet+callid+callwait+forward+confer+ebill ~ mean + (mean*km_cluster_three)+
                    (mean*km_cluster_four)+(mean*km_cluster_five)+(mean*km_cluster_six), data = telco_new)

profile1 = as.matrix(profile)
profile1 = data.frame(profile1)
head(profile1)

##       X1     X2               X3       X4       X5              X6       X7
## 1                           mean                              mean         
## 2               km_cluster_three                   km_cluster_four         
## 3          mean                1        2        3               1        2
## 4    All     NA               NA       NA       NA              NA       NA
## 5 tenure 35.526          35.8874 47.98013 27.29336        37.32883 30.58605
## 6    age 41.684          40.4242 47.26821 38.69593        40.91441 41.46512
##         X8       X9             X10      X11       X12     X13      X14
## 1                              mean                                    
## 2                   km_cluster_five                                    
## 3        3        4               1        2         3       4        5
## 4       NA       NA              NA       NA        NA      NA       NA
## 5 63.04945 24.11549        30.42051 28.41053  65.32609 37.6782 27.52727
## 6 50.84615 37.87927        41.88205 38.60000  52.11594 40.8416 39.05818
##              X15      X16       X17      X18      X19      X20
## 1           mean                                              
## 2 km_cluster_six                                              
## 3              1        2         3        4        5        6
## 4             NA       NA        NA       NA       NA       NA
## 5       55.78610  64.6875 21.854962 53.52128 29.29064 19.93115
## 6       46.71658  54.1750 35.480916 46.38298 41.00000 36.99344

Here we will convert the segmentation to csv file for further analysis.

#write.csv(profile1, "profilef1.csv",row.names = F)
#write.csv(profile2, "profilef2.csv",row.names = F)