A US based fixed-line telecom service provider wants to roll out bundled combo minute plans eg: STD call plans, Local call plans and internet plans.
Customer Segmentation is the process of splitting your customer database into smaller groups. By focusing on specific customer types, we can maximize customer lifetime value and better understand who they are and what they need.
telco = read.csv("D:\\Business Analytics\\R\\Factor Analysis\\telco_csv.csv")
head(telco)
## region tenure age marital address income ed employ retire gender reside
## 1 2 13 44 1 9 64 4 5 0 0 2
## 2 3 11 33 1 7 136 5 5 0 0 6
## 3 3 68 52 1 24 116 1 29 0 1 2
## 4 2 33 33 0 12 33 2 0 0 1 1
## 5 2 23 30 1 9 30 1 2 0 0 4
## 6 2 41 39 0 17 78 2 16 0 1 1
## tollfree equip callcard wireless longmon tollmon equipmon cardmon wiremon
## 1 0 0 1 0 3.70 0.00 0 7.50 0.0
## 2 1 0 1 1 4.40 20.75 0 15.25 35.7
## 3 1 0 1 0 18.15 18.00 0 30.25 0.0
## 4 0 0 0 0 9.45 0.00 0 0.00 0.0
## 5 0 0 0 0 6.30 0.00 0 0.00 0.0
## 6 1 0 1 0 11.80 19.25 0 13.50 0.0
## multline voice pager internet callid callwait forward confer ebill custcat
## 1 0 0 0 0 0 0 1 0 0 1
## 2 0 1 1 0 1 1 1 1 0 4
## 3 0 0 0 0 1 1 0 1 0 3
## 4 0 0 0 0 0 0 0 0 0 1
## 5 0 0 0 0 1 0 1 1 0 3
## 6 0 0 0 0 1 1 0 0 0 3
Creating a user defined function to have a better understanding of the data.
mystats <- function(x) {
nmiss<-sum(is.na(x))
a <- x[!is.na(x)]
m <- mean(a)
n <- length(a)
s <- sd(a)
min <- min(a)
pctls<-quantile(a,probs=c(0.01, 0.05,0.1,0.25,0.5,0.75,0.9,0.95,0.99))
max <- max(a)
return(c(n=n, nmiss=nmiss, mean=m, stdev=s,min = min, pctls=pctls,max=max))
}
# Applying udf on the data
diag_stats = t(data.frame(sapply(telco,mystats)))
head(diag_stats)
## n nmiss mean stdev min pctls.1% pctls.5% pctls.10% pctls.25%
## region 1000 0 2.022 0.8161998 1 1.00 1 1 1
## tenure 1000 0 35.526 21.3598119 1 1.00 4 7 17
## age 1000 0 41.684 12.5588163 18 20.00 23 26 32
## marital 1000 0 0.495 0.5002252 0 0.00 0 0 0
## address 1000 0 11.551 10.0866813 0 0.00 0 1 3
## income 1000 0 77.535 107.0441648 9 10.99 18 21 29
## pctls.50% pctls.75% pctls.90% pctls.95% pctls.99% max
## region 2 3 3.0 3.00 3.00 3
## tenure 34 54 66.0 70.00 72.00 72
## age 40 51 59.0 64.00 70.01 77
## marital 0 1 1.0 1.00 1.00 1
## address 9 18 26.1 31.00 43.00 55
## income 47 83 155.4 232.25 460.48 1668
An outlier is a data point that differs significantly from other observations. An outlier may be due to variability in the measurement or it may indicate experimental error. An outlier can cause serious problems in statistical analysis.
telco$longmon[telco$longmon>42.81355889]<- 42.81355889
telco$tollmon[telco$tollmon>63.98036623]<- 63.98036623
telco$equipmon[telco$equipmon>71.42541641]<- 71.42541641
telco$cardmon[telco$cardmon>56.03448901]<- 56.03448901
telco$wiremon[telco$wiremon>70.74217684]<- 70.74217684
telco$income[telco$income>232.25]<- 232.25
Here we will inclue only those variables which make business sense.
telco0 = subset(telco, select = -c(custcat,ed,region,address,retire))
Factor analysis is a technique that is used to reduce a large number of variables into fewer numbers of factors. This technique extracts maximum common variance from all variables and puts them into a common score. Factor analysis is part of general linear model (GLM) and this method also assumes several assumptions: - There is linear relationship - There is no multicollinearity - It includes relevant variables into analysis - There is true correlation between variables and factors.
A correlation matrix is simply a table which displays the correlation coefficients for different variables. The matrix depicts the correlation between all the possible pairs of values in a table. It is a powerful tool to summarize a large dataset and to identify and visualize patterns in the given data.
Factor Analysis is based on Correlation, it uses correlation as an input. So here we are creating a Correlation Matrix.
corrm = cor(telco0)
head(corrm)
## tenure age marital income employ
## tenure 1.00000000 0.49021726 0.154171373 0.30738310 0.520288419
## age 0.49021726 1.00000000 -0.014432866 0.40982414 0.670046986
## marital 0.15417137 -0.01443287 1.000000000 -0.01459662 -0.002890873
## income 0.30738310 0.40982414 -0.014596621 1.00000000 0.642244671
## employ 0.52028842 0.67004699 -0.002890873 0.64224467 1.000000000
## gender 0.02353288 -0.01141912 0.008345242 0.03151291 0.046016477
## gender reside tollfree equip callcard
## tenure 0.023532879 -0.03659250 0.06873054 -0.1531774649 0.462181989
## age -0.011419117 -0.25499327 0.09282143 -0.1724809789 0.280004174
## marital 0.008345242 0.62599878 0.01349894 -0.0002875888 0.057317799
## income 0.031512913 -0.10627573 0.16430605 -0.0230135214 0.205451559
## employ 0.046016477 -0.13468330 0.14272735 -0.2042438059 0.292414769
## gender 1.000000000 0.01655645 -0.01626339 -0.0023101471 0.002030097
## wireless longmon tollmon equipmon cardmon
## tenure 0.003006869 0.774445432 0.239455329 -0.060847694 0.44628090
## age -0.010202930 0.411953567 0.182580063 -0.127170596 0.28821617
## marital 0.032773372 0.132626798 0.063959849 0.013231727 0.10920756
## income 0.128952031 0.258607035 0.220614310 0.024525137 0.18377626
## employ -0.013076295 0.463511630 0.247872449 -0.161196790 0.28915557
## gender -0.000140281 0.006040312 0.001086482 0.001846442 -0.02405169
## wiremon multline voice pager internet callid
## tenure 0.09606562 0.40962974 -0.005590909 -0.02072139 -0.14088848 0.058310909
## age 0.03659999 0.12492786 -0.032375768 -0.05487644 -0.14102545 0.080997688
## marital 0.05195499 0.09161917 0.019653900 0.04465364 0.02422044 0.035647528
## income 0.16858727 0.12055072 0.111907587 0.13891397 -0.01006946 0.171492238
## employ 0.03589811 0.11867028 -0.031076538 -0.01572814 -0.19861355 0.143052572
## gender 0.01191728 0.02173975 0.016671138 -0.03160897 0.01968536 0.001293682
## callwait forward confer ebill
## tenure 0.08677702 0.08588377 0.133584783 -0.157094763
## age 0.07368522 0.08059528 0.121192741 -0.144425326
## marital 0.02371186 0.04386649 0.006040350 -0.056495385
## income 0.14854184 0.13814952 0.164903713 -0.004472019
## employ 0.15255484 0.11580105 0.168607527 -0.182043430
## gender -0.03101188 -0.03554803 -0.002137253 -0.015770729
Factor is a group of variables which are related to each other. In order to know the number of factors to create, we have a concept of Eigen Values.
Eigen values is also called characteristic roots. Eigen values shows variance explained by that particular factor out of the total variance. From the commonality column, we can know how much variance is explained by the first factor out of the total variance. For example, if our first factor explains 68% variance out of the total, this means that 32% variance will be explained by the other factor.
eigen(corrm)$values
## [1] 6.58247489 4.30989791 2.66208726 1.75475707 1.16920837 0.99946833
## [7] 0.86450147 0.78940688 0.59399790 0.57325397 0.54115089 0.48055277
## [13] 0.46061830 0.44191731 0.41658882 0.38898210 0.36220645 0.35387059
## [19] 0.32833103 0.29190714 0.21994438 0.20004349 0.11555017 0.07104984
## [25] 0.02823265
To have better understanding we will create a table with eigen values, cumulative eigen values, percent variance and cumulative percent variance.If eigen value > 1 that means it is forcing variables into the group and if eigen value < 1 then there is overlap across the group.
Number of factors to be decided on:
eigen_values = mutate(data.frame(eigen(corrm)$values)
, cum_sum_eigen = cumsum(eigen.corrm..values)
, pct_var = eigen.corrm..values/sum(eigen.corrm..values)
, cum_pct_var = cum_sum_eigen/sum(eigen.corrm..values))
Factor Loadings are basically the shadows that each and every variable will drop on that particular factor.
fa = fa(r = corrm, 7, rotate = "varimax", fm = "ml")
Here we will sort it in decending order so that the maximum loading come on the top and minimum loading go to the bottom.
Fa_sort = fa.sort(fa)
#ls(Fa_sort)
Here we will see what is happening with each and every variable with each and every factor.
#Fa_sort$loadings
loadings = data.frame(Fa_sort$loadings[1:ncol(telco0),])
We’ll get the output in the form of csv file.
write.csv(loadings,"D:\\Business Analytics\\R\\Factor Analysis\\loadingsf1.csv")
vars <- c( "region","tenure" ,"age" ,"marital" ,"address" , "income","ed" ,"employ",
"retire","gender","reside","tollfree", "equip","callcard", "wireless", "longmon",
"tollmon", "equipmon", "cardmon", "wiremon", "multline", "voice", "pager" ,
"internet","callid", "callwait", "forward", "confer", "ebill","custcat")
inputdata_final = telco[vars]
Standardization is the process of putting different variables on the same scale. This process allows you to compare scores between different types of variables.
inputdata_final = data.frame(scale(inputdata_final))
These variables are selected on the basis of Factor Analysis.
clus <- c("tollmon","callwait","forward","pager","voice","equipmon","internet","tenure","longmon","multline","income","wiremon","cardmon")
inputdata_clus <- inputdata_final[clus]
Clustering is the process of finding similarities in customers so that they can be grouped, and therefore segmented. We will use k-means clustering to divide the data into 3,4,5 and 6 cluster.
#building cluster using k-means clustering
cluster_three = kmeans(inputdata_clus,3)
cluster_four = kmeans(inputdata_clus,4)
cluster_five = kmeans(inputdata_clus,5)
cluster_six = kmeans(inputdata_clus,6)
telco_new = cbind(telco,km_cluster_three = cluster_three$cluster,km_cluster_four = cluster_four$cluster,km_cluster_five = cluster_five$cluster,km_cluster_six = cluster_six$cluster)
head(telco_new)
## region tenure age marital address income ed employ retire gender reside
## 1 2 13 44 1 9 64 4 5 0 0 2
## 2 3 11 33 1 7 136 5 5 0 0 6
## 3 3 68 52 1 24 116 1 29 0 1 2
## 4 2 33 33 0 12 33 2 0 0 1 1
## 5 2 23 30 1 9 30 1 2 0 0 4
## 6 2 41 39 0 17 78 2 16 0 1 1
## tollfree equip callcard wireless longmon tollmon equipmon cardmon wiremon
## 1 0 0 1 0 3.70 0.00 0 7.50 0.0
## 2 1 0 1 1 4.40 20.75 0 15.25 35.7
## 3 1 0 1 0 18.15 18.00 0 30.25 0.0
## 4 0 0 0 0 9.45 0.00 0 0.00 0.0
## 5 0 0 0 0 6.30 0.00 0 0.00 0.0
## 6 1 0 1 0 11.80 19.25 0 13.50 0.0
## multline voice pager internet callid callwait forward confer ebill custcat
## 1 0 0 0 0 0 0 1 0 0 1
## 2 0 1 1 0 1 1 1 1 0 4
## 3 0 0 0 0 1 1 0 1 0 3
## 4 0 0 0 0 0 0 0 0 0 1
## 5 0 0 0 0 1 0 1 1 0 3
## 6 0 0 0 0 1 1 0 0 0 3
## km_cluster_three km_cluster_four km_cluster_five km_cluster_six
## 1 3 4 5 6
## 2 1 1 1 3
## 3 2 3 3 2
## 4 3 4 5 6
## 5 3 4 5 6
## 6 2 2 1 5
Ploting a clusplot to visualize how data is divided into different cluster.
#graph based on k-means
clusplot(inputdata_clus,cluster_four$cluster, color = TRUE, lines = 6, labels = 2)
Converting the datatype of clusters to factor
# converting to factors
telco_new$km_cluster_three = as.factor(telco_new$km_cluster_three)
telco_new$km_cluster_four = as.factor(telco_new$km_cluster_four)
telco_new$km_cluster_five = as.factor(telco_new$km_cluster_five)
telco_new$km_cluster_six = as.factor(telco_new$km_cluster_six)
In order to understand what does these groups like and don’t like we carry out the Profiling exercise, where for each segment of my N segment solution, we are trying to see how are they behaving on different variables.
In order to analyse the segmentation, we will convert the clusters to tabular form.
profile = tabular(1+tenure+age+marital+address+income+ed+employ+retire+gender+reside+tollfree+
equip+callcard+wireless+longmon+tollmon+equipmon+cardmon+wiremon+multline+
voice+pager+internet+callid+callwait+forward+confer+ebill ~ mean + (mean*km_cluster_three)+
(mean*km_cluster_four)+(mean*km_cluster_five)+(mean*km_cluster_six), data = telco_new)
profile1 = as.matrix(profile)
profile1 = data.frame(profile1)
head(profile1)
## X1 X2 X3 X4 X5 X6 X7
## 1 mean mean
## 2 km_cluster_three km_cluster_four
## 3 mean 1 2 3 1 2
## 4 All NA NA NA NA NA NA
## 5 tenure 35.526 35.8874 47.98013 27.29336 37.32883 30.58605
## 6 age 41.684 40.4242 47.26821 38.69593 40.91441 41.46512
## X8 X9 X10 X11 X12 X13 X14
## 1 mean
## 2 km_cluster_five
## 3 3 4 1 2 3 4 5
## 4 NA NA NA NA NA NA NA
## 5 63.04945 24.11549 30.42051 28.41053 65.32609 37.6782 27.52727
## 6 50.84615 37.87927 41.88205 38.60000 52.11594 40.8416 39.05818
## X15 X16 X17 X18 X19 X20
## 1 mean
## 2 km_cluster_six
## 3 1 2 3 4 5 6
## 4 NA NA NA NA NA NA
## 5 55.78610 64.6875 21.854962 53.52128 29.29064 19.93115
## 6 46.71658 54.1750 35.480916 46.38298 41.00000 36.99344
Here we will convert the segmentation to csv file for further analysis.
#write.csv(profile1, "profilef1.csv",row.names = F)
#write.csv(profile2, "profilef2.csv",row.names = F)