Call Detail Record Analysis Using K-means Clustering

Overview

Call Detail Record (CDR) is the information captured by most telecom companies during Call, SMS, and Internet activity of a customer. This information provides greater insights about the customer’s needs when used with customer demographics. Most telecom companies use CDR information for fraud detection by clustering the user’s profiles, reducing customer churn by usage activities, and targeting the profitable customers by using RFM analysis. RFM (recency, frequency, monetary) analysis is a marketing technique used to determine quantitatively which customers are the best ones by examining how recently a customer has purchased (recency), how often they purchase (frequency), and how much the customer spends (monetary). RFM analysis is based on the marketing axiom that “80% of your business comes from 20% of your customers.”

In this report, we will discuss about clustering of the customer activities for 24 hours by using unsupervised K-means clustering algorithm. It is used to understand segments of customers with respect to their usage by hours. For example, customer segment with high activity may generate more revenue. Customer segment with high activity in the night hours might be fraud ones.

The data and code used for this project can be downloaded on Github

Data Description

A daily activity file from Dandelion API is used as a data source, where the file contains CDR records generated by the Telecom Italia cellular network over the city of Milano. The daily CDR activity file contains information for 10,000 grids. The data set consists of 107,809 observations and 8 variables. The 8 numerical variables are about SMS in and out activity, call in and out activity, Internet traffic activity, square grid ID where the activity has happened, country code, and timestamp information about when the activity was started. Let’s take a look:

## Warning in scan(file = file, what = what, sep = sep, quote = quote, dec =
## dec, : number of items read is not a multiple of the number of columns

## 'data.frame':    107809 obs. of  8 variables:
##  $ square_id                : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ time_interval            : num  1.39e+12 1.39e+12 1.39e+12 1.39e+12 1.39e+12 ...
##  $ country_code             : int  0 33 359 39 41 0 33 39 48 0 ...
##  $ sms_in_activity          : num  0.237 NA 0.0273 3.8694 NA ...
##  $ sms_out_activity         : num  NA NA 0.0807 5.7099 NA ...
##  $ call_in_activity         : num  NA 0.0261 NA 2.0738 0.0273 ...
##  $ call_out_activity        : num  0.0546 NA NA 2.752 NA ...
##  $ internet_traffic_activity: num  NA NA NA 11.4 NA ...

##              Length Class  Mode   
## cluster      107809 -none- numeric
## centers          20 -none- numeric
## totss             1 -none- numeric
## withinss         10 -none- numeric
## tot.withinss      1 -none- numeric
## betweenss         1 -none- numeric
## size             10 -none- numeric
## iter              1 -none- numeric
## ifault            1 -none- numeric

Intrepreting The Visualizations

Our first bar plot investigates “total activity by activity hours”. The visualization shows that most of the activities happened in the hour of 23 and the least activity happened in the hour of 06.

Our second bar plot investigates “total activity by Top 25 square grids”. It is evident that most of the activities happened in the square grid ID 147 followed by square grid ID 48

Our third visualization which is a pie chart depicts “Top 10 country by total activity”. It shows that the country code 39(Italy) has the highest activity.

Call Detail Record Clustering

K-means clustering is the popular unsupervised clustering algorithm used to find the pattern in the data. 10 clusters were chosen because it was evident that the Sum of Squared Error (SSE) decreases with minimal change after cluster number 10 and there is no unexpected increase in the error distance. The output of the summary of CDR K-means model can be viewed above.

The fourth and final visualization is a heat map plot with cluster, activity hour, and total activity time. The heat map shows that clusters 2 and 7 have activities for all 24 hours and are the most revenue generating clusters. The clusters 2, 3, 5, 7, and 8 all have activities at late hours. The cluster 8 has activity only at 19-23 hours (at night) which can be investigated further to see if there’s a fraudulent activity going on there. Cluster 5 is the second most revenue generating cluster.

Conclusion

By using this clustering mechanism, we can find the clusters making more traffic to the telecom network in the measure of total activity. Similarly, we can obtain more information like square grid and country code information to understand the square grid likely to create more revenue and more traffic to the telecom network and to target more customers based on their geo location.