1 Overview

1.1 Getting Started

packages = c('funModeling', 'haven', 'corrplot', 'tidyverse')
for (p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p,character.only = T)
}

1.2 The haven Package

haven is a part of the tidyverse, an ecosystem of packages designed with common APIs and a shared philosophy. It enables R to read and write various data formats used by other statistical packages by wrapping the fantastic ReadStat C library written by Evan Miller.

Currently it offers:

  • read_sas() reads .sas7bdat + .sas7bcat files of SAS and read_xpt() reads SAS transport files (version 5 and version 8).

  • read_sav() reads .sav files of SPSS and read_por() reads the older .por files. write_sav() writes .sav files.

  • read_dta() reads .dta files (up to version 15) of Stata. write_dta() writes .dta files (versions 8-15).

1.3 Importing case data

In this exercise, telecom.sas7bdat used in Lab Exercise 7 of ISSS602 Data Analytics Lab will be used. The code chunk below shows how a SAS’s sas7bdat (i.e. telecom.sas7bdat) file can be imported using read_sas() function of haven package.

telecom <- read_sas("data/telecom.sas7bdat")

Notice that the function return tibble. It uses underscores instead of dots. Labelled vectors are returned as a new labelled class. Character vectors are not converted to factors. See vignette (“semantics”) for more details.

2 Getting to Known the Data Frame

Before any analysis can be perform, it is important for us to explore the data. The purpose of initial exploration are: - to understand the structure of the data, and - to identify data quality issues such as messy data, dirty data, missing data, etc.

2.1 Basic way to view the data

One of the easy way to view the data is by typing the name of the data frame directly on R Console. The code chunk below will be used to display telecom data.

telecom

Tibble or tbl_df wraps a local data frame. The main advantage of using a tbl_df over a regular data frame is the printing: tbl objects only print a few rows and all the columns that fit on one screen, describing the rest of it as text.

Notice that only the first ten records will be displayed. Furthermore, only the columns that can be displayed on the view will appear. This is the characteristics of tibble. To learn more about tibble, please refer to Overview and Chapter 10: Tibbles.

2.2 Alternative way to view the data: head()

head() returns the first part of user defined records. For example, the code below displays the first six records.

head(telecom)
## # A tibble: 6 x 46
##   GENDER_CD EDUCATION_CD SUBS_TENURE TOT_IB_CALL_DUR TOT_IB_CALL_CNT
##   <chr>     <chr>              <dbl>           <dbl>           <dbl>
## 1 M         " 2"                 198            54.2             192
## 2 M         " 1"                 114            13.3              51
## 3 M         " 2"                 114            12.3              94
## 4 M         " 1"                 228            49.7             171
## 5 F         " 4"                 168            16.3             536
## 6 M         " 2"                 132            40.7             656
## # ... with 41 more variables: AVG_OB_CALL_CNT <dbl>,
## #   TOT_OB_CALL_NAT_ROAM_CNT <dbl>, TOT_OB_CALL_INTL_CNT <dbl>,
## #   TOT_OB_CALL_LOC_CNT <dbl>, TOT_OB_CALL_NAT_CNT <dbl>,
## #   TOT_OB_CALL_INTL_ROAM_CNT <dbl>, TOT_DAY_LAST_COMPLAINT_CNT <dbl>,
## #   TOT_DAY_LAST_OB_BARRED_CNT <dbl>, TOT_DAY_LAST_SUSPENDED_CNT <dbl>,
## #   TOT_EMAIL_QUERY_CNT <dbl>, MTH_TO_SUBS_END_CNT <dbl>,
## #   TOT_SRV_DROPPED_CNT <dbl>, TOT_SRV_ADDED_CNT <dbl>,
## #   TOT_OUTSTAND_60_90_DAY_AMT <dbl>, TOT_REV_FIX_AMT <dbl>,
## #   TOT_REV_GPRS_AMT <dbl>, TOT_REV_INET_AMT <dbl>,
## #   TOT_COMPLAINT_1_MTH_CNT <dbl>, TOT_MTH_LAST_SUSPENDED_CNT <dbl>,
## #   LAST_PRICE_PLAN_CHNG_DAY_CNT <dbl>, MTH_SINCE_DATA_ACTVN <dbl>,
## #   MTH_SINCE_VM_ACTVN <dbl>, BARRING_REASON_CD <chr>, TOT_OB_CALL_CNT <dbl>,
## #   TOT_ACTV_SRV_CNT <dbl>, REV_AMT_BASE_1 <dbl>, REV_AMT_BASE_2 <dbl>,
## #   REV_AMT_BASE_3 <dbl>, REV_AMT_BASE_4 <dbl>, REV_AMT_BASE_5 <dbl>,
## #   REV_AMT_BASE_6 <dbl>, CUST_AGE <dbl>, PCT_CHNG_IB_SMS_CNT <dbl>,
## #   PCT_CHNG_SUSPENDED_CNT <dbl>, PCT_CHNG_BILL_AMT <dbl>, CUST_SUBS_ID <dbl>,
## #   TOT_REV_AMT <dbl>, TOT_PROF_AMT <dbl>, CUST_ID <dbl>, name <chr>,
## #   CHURN_FLG <chr>

2.3 Alternative way to view the data: tail()

tail() returns the last part of user defined records. For example, the code below displays the last five records.

tail(telecom)
## # A tibble: 6 x 46
##   GENDER_CD EDUCATION_CD SUBS_TENURE TOT_IB_CALL_DUR TOT_IB_CALL_CNT
##   <chr>     <chr>              <dbl>           <dbl>           <dbl>
## 1 M         " 1"                 132            47               105
## 2 F         " 3"                 114            63.2             797
## 3 M         " ."                 120            55.3             231
## 4 M         " 1"                 132            52               158
## 5 F         " 3"                 156            49.8            1279
## 6 M         " 1"                 132            46.3             168
## # ... with 41 more variables: AVG_OB_CALL_CNT <dbl>,
## #   TOT_OB_CALL_NAT_ROAM_CNT <dbl>, TOT_OB_CALL_INTL_CNT <dbl>,
## #   TOT_OB_CALL_LOC_CNT <dbl>, TOT_OB_CALL_NAT_CNT <dbl>,
## #   TOT_OB_CALL_INTL_ROAM_CNT <dbl>, TOT_DAY_LAST_COMPLAINT_CNT <dbl>,
## #   TOT_DAY_LAST_OB_BARRED_CNT <dbl>, TOT_DAY_LAST_SUSPENDED_CNT <dbl>,
## #   TOT_EMAIL_QUERY_CNT <dbl>, MTH_TO_SUBS_END_CNT <dbl>,
## #   TOT_SRV_DROPPED_CNT <dbl>, TOT_SRV_ADDED_CNT <dbl>,
## #   TOT_OUTSTAND_60_90_DAY_AMT <dbl>, TOT_REV_FIX_AMT <dbl>,
## #   TOT_REV_GPRS_AMT <dbl>, TOT_REV_INET_AMT <dbl>,
## #   TOT_COMPLAINT_1_MTH_CNT <dbl>, TOT_MTH_LAST_SUSPENDED_CNT <dbl>,
## #   LAST_PRICE_PLAN_CHNG_DAY_CNT <dbl>, MTH_SINCE_DATA_ACTVN <dbl>,
## #   MTH_SINCE_VM_ACTVN <dbl>, BARRING_REASON_CD <chr>, TOT_OB_CALL_CNT <dbl>,
## #   TOT_ACTV_SRV_CNT <dbl>, REV_AMT_BASE_1 <dbl>, REV_AMT_BASE_2 <dbl>,
## #   REV_AMT_BASE_3 <dbl>, REV_AMT_BASE_4 <dbl>, REV_AMT_BASE_5 <dbl>,
## #   REV_AMT_BASE_6 <dbl>, CUST_AGE <dbl>, PCT_CHNG_IB_SMS_CNT <dbl>,
## #   PCT_CHNG_SUSPENDED_CNT <dbl>, PCT_CHNG_BILL_AMT <dbl>, CUST_SUBS_ID <dbl>,
## #   TOT_REV_AMT <dbl>, TOT_PROF_AMT <dbl>, CUST_ID <dbl>, name <chr>,
## #   CHURN_FLG <chr>

2.4 Understanding the structure of a data frame

In this code chunk below, str() is used to view the data type of each column.

str(telecom)
## tibble [13,196 x 46] (S3: tbl_df/tbl/data.frame)
##  $ GENDER_CD                   : chr [1:13196] "M" "M" "M" "M" ...
##   ..- attr(*, "label")= chr "Gender code"
##   ..- attr(*, "format.sas")= chr "$"
##  $ EDUCATION_CD                : chr [1:13196] " 2" " 1" " 2" " 1" ...
##   ..- attr(*, "label")= chr "Education code"
##  $ SUBS_TENURE                 : num [1:13196] 198 114 114 228 168 132 120 138 126 144 ...
##   ..- attr(*, "label")= chr "Tenure on network"
##  $ TOT_IB_CALL_DUR             : num [1:13196] 54.2 13.3 12.3 49.7 16.3 ...
##   ..- attr(*, "label")= chr "Total duration of incoming calls over 6 months"
##  $ TOT_IB_CALL_CNT             : num [1:13196] 192 51 94 171 536 656 283 148 179 156 ...
##   ..- attr(*, "label")= chr "Total number of incoming calls over 6 months"
##  $ AVG_OB_CALL_CNT             : num [1:13196] 897 252 1016 1534 730 ...
##   ..- attr(*, "label")= chr "Average number of outgoing calls over 6 months"
##  $ TOT_OB_CALL_NAT_ROAM_CNT    : num [1:13196] 738 25 43 39 16 144 28 49 18 17 ...
##   ..- attr(*, "label")= chr "Total  number of roaming calls over 6 months"
##  $ TOT_OB_CALL_INTL_CNT        : num [1:13196] 43 5568 72 23 16 ...
##   ..- attr(*, "label")= chr "Total  number of international calls over 6 months"
##  $ TOT_OB_CALL_LOC_CNT         : num [1:13196] 300 221 221 302 517 ...
##   ..- attr(*, "label")= chr "Total  number of local calls over 6 months"
##  $ TOT_OB_CALL_NAT_CNT         : num [1:13196] 125 216 79 108 570 85 73 446 116 94 ...
##   ..- attr(*, "label")= chr "Total  number of national calls over 6 months"
##  $ TOT_OB_CALL_INTL_ROAM_CNT   : num [1:13196] 0.9 0.0376 0.2927 0.6904 1.7252 ...
##   ..- attr(*, "label")= chr "Total  number of international roaming calls over 6 months"
##  $ TOT_DAY_LAST_COMPLAINT_CNT  : num [1:13196] 15 8 19 14 4 22 4 1 19 5 ...
##   ..- attr(*, "label")= chr "No. of days since last complaint made"
##  $ TOT_DAY_LAST_OB_BARRED_CNT  : num [1:13196] 11 5 2 4 3 8 13 11 9 14 ...
##   ..- attr(*, "label")= chr "No. of days since last outbound barred"
##  $ TOT_DAY_LAST_SUSPENDED_CNT  : num [1:13196] 3 23 25 7 18 20 10 18 16 1 ...
##   ..- attr(*, "label")= chr "No. of days since last suspension"
##  $ TOT_EMAIL_QUERY_CNT         : num [1:13196] 17 16 19 4 20 25 17 11 5 3 ...
##   ..- attr(*, "label")= chr "Total no. of email queries in last 6 months"
##  $ MTH_TO_SUBS_END_CNT         : num [1:13196] 4 2 3 4 2 3 3 2 3 4 ...
##   ..- attr(*, "label")= chr "Months to subscription end"
##  $ TOT_SRV_DROPPED_CNT         : num [1:13196] 2 3 4 0 2 0 0 1 1 1 ...
##   ..- attr(*, "label")= chr "No. of services dropped in last 6 months"
##  $ TOT_SRV_ADDED_CNT           : num [1:13196] 1 0 0 1 1 2 0 2 0 2 ...
##   ..- attr(*, "label")= chr "No. of services added in last 6 months"
##  $ TOT_OUTSTAND_60_90_DAY_AMT  : num [1:13196] 0 0 0 0 0 483 0 0 0 0 ...
##   ..- attr(*, "label")= chr "  Outstanding amount in 60 to 90 days delinquency period"
##  $ TOT_REV_FIX_AMT             : num [1:13196] 0 0 0 0 0 0 0 0 0 0 ...
##   ..- attr(*, "label")= chr "Total revenue due to calls to fixline"
##  $ TOT_REV_GPRS_AMT            : num [1:13196] 0 729 574 0 0 0 0 0 0 0 ...
##   ..- attr(*, "label")= chr "Total revenue due to GPRS"
##  $ TOT_REV_INET_AMT            : num [1:13196] 0 0 0 0 0 508 0 0 0 0 ...
##   ..- attr(*, "label")= chr "Total revenue due to internet"
##  $ TOT_COMPLAINT_1_MTH_CNT     : num [1:13196] 0 0 1 0 0 0 0 0 0 0 ...
##   ..- attr(*, "label")= chr "Total no. of complaints in last month"
##  $ TOT_MTH_LAST_SUSPENDED_CNT  : num [1:13196] 0 0 0 0 0 0 0 0 0 0 ...
##   ..- attr(*, "label")= chr "No. of months since last suspension"
##  $ LAST_PRICE_PLAN_CHNG_DAY_CNT: num [1:13196] 0 0 0 0 0 0 0 0 0 0 ...
##   ..- attr(*, "label")= chr "No. of days since last price plan change"
##  $ MTH_SINCE_DATA_ACTVN        : num [1:13196] 0 0 0 0 0 0 0 0 0 0 ...
##   ..- attr(*, "label")= chr "Months since data activation service"
##  $ MTH_SINCE_VM_ACTVN          : num [1:13196] 0 0 0 0 0 0 0 0 0 0 ...
##   ..- attr(*, "label")= chr "Months since voice mail activation service"
##  $ BARRING_REASON_CD           : chr [1:13196] "000" "000" "002" "000" ...
##   ..- attr(*, "label")= chr "Barring type code"
##  $ TOT_OB_CALL_CNT             : num [1:13196] 5382 1512 6096 9204 4380 ...
##   ..- attr(*, "label")= chr "Total number of outgoing calls over 6 months"
##  $ TOT_ACTV_SRV_CNT            : num [1:13196] 0 0 0 4 3 3 4 2 0 3 ...
##   ..- attr(*, "label")= chr "Total no. of active services"
##  $ REV_AMT_BASE_1              : num [1:13196] 870 570 632 638 1162 ...
##   ..- attr(*, "label")= chr " Revenue amount in base 1 month"
##  $ REV_AMT_BASE_2              : num [1:13196] 548 1024 547 466 394 ...
##   ..- attr(*, "label")= chr " Revenue amount in base 2 month"
##  $ REV_AMT_BASE_3              : num [1:13196] 970 938 851 810 1141 ...
##   ..- attr(*, "label")= chr " Revenue amount in base 3 month"
##  $ REV_AMT_BASE_4              : num [1:13196] 392 602 821 655 244 325 568 645 427 659 ...
##   ..- attr(*, "label")= chr " Revenue amount in base 4 month"
##  $ REV_AMT_BASE_5              : num [1:13196] 358 729 574 402 423 787 830 241 901 437 ...
##   ..- attr(*, "label")= chr " Revenue amount in base 5 month"
##  $ REV_AMT_BASE_6              : num [1:13196] 1248 1129 1027 642 618 ...
##   ..- attr(*, "label")= chr " Revenue amount in base 6 month"
##  $ CUST_AGE                    : num [1:13196] 32 24 24 44 21 52 34 44 39 43 ...
##   ..- attr(*, "label")= chr "Customer age"
##  $ PCT_CHNG_IB_SMS_CNT         : num [1:13196] 1.155 1.265 0.902 1.87 1.267 ...
##   ..- attr(*, "label")= chr "Percent change of latest 2 months incoming SMS wrt previous 4 months incoming SMS"
##  $ PCT_CHNG_SUSPENDED_CNT      : num [1:13196] 0 0 5000 0 0 0 0 0 0 0 ...
##   ..- attr(*, "label")= chr "Percent change of latest 2 months suspended calls wrt previous 4 months suspended calls"
##  $ PCT_CHNG_BILL_AMT           : num [1:13196] 0.956 0.938 0.72 0.88 1.283 ...
##   ..- attr(*, "label")= chr "Percent change of latest 2 months bill amount wrt previous 4 months bill amount"
##  $ CUST_SUBS_ID                : num [1:13196] 1 2 3 4 5 6 7 8 9 10 ...
##   ..- attr(*, "label")= chr "Subscription identifier"
##  $ TOT_REV_AMT                 : num [1:13196] NA 99 47 NA 97 NA NA NA 10 3 ...
##   ..- attr(*, "label")= chr "Customer Value"
##  $ TOT_PROF_AMT                : num [1:13196] NA 99 47 NA 97 NA NA NA 10 3 ...
##   ..- attr(*, "label")= chr "Total profitability of a subscription"
##  $ CUST_ID                     : num [1:13196] 5198 752 3501 5406 6115 ...
##   ..- attr(*, "label")= chr "Customer identifier"
##  $ name                        : chr [1:13196] "Amy H Thomas" "Ignatius T Lyod" "Leo C Wood" "Augustine X Turner" ...
##   ..- attr(*, "label")= chr "Subscriber Name"
##  $ CHURN_FLG                   : chr [1:13196] "0" "1" "1" "0" ...
##   ..- attr(*, "label")= chr "Churn flag"
##  - attr(*, "label")= chr "CHURN_TELECOM"

Since telecom is tibble data frame, it is advisable to use glimpse() of dplyr package to display the data.

glimpse(telecom)
## Rows: 13,196
## Columns: 46
## $ GENDER_CD                    <chr> "M", "M", "M", "M", "F", "M", "M", "M"...
## $ EDUCATION_CD                 <chr> " 2", " 1", " 2", " 1", " 4", " 2", " ...
## $ SUBS_TENURE                  <dbl> 198, 114, 114, 228, 168, 132, 120, 138...
## $ TOT_IB_CALL_DUR              <dbl> 54.167, 13.333, 12.333, 49.667, 16.333...
## $ TOT_IB_CALL_CNT              <dbl> 192, 51, 94, 171, 536, 656, 283, 148, ...
## $ AVG_OB_CALL_CNT              <dbl> 897, 252, 1016, 1534, 730, 323, 463, 8...
## $ TOT_OB_CALL_NAT_ROAM_CNT     <dbl> 738, 25, 43, 39, 16, 144, 28, 49, 18, ...
## $ TOT_OB_CALL_INTL_CNT         <dbl> 43, 5568, 72, 23, 16, 27, 18, 779, 33,...
## $ TOT_OB_CALL_LOC_CNT          <dbl> 300, 221, 221, 302, 517, 540, 320, 192...
## $ TOT_OB_CALL_NAT_CNT          <dbl> 125, 216, 79, 108, 570, 85, 73, 446, 1...
## $ TOT_OB_CALL_INTL_ROAM_CNT    <dbl> 0.89995712, 0.03764006, 0.29266312, 0....
## $ TOT_DAY_LAST_COMPLAINT_CNT   <dbl> 15, 8, 19, 14, 4, 22, 4, 1, 19, 5, 6, ...
## $ TOT_DAY_LAST_OB_BARRED_CNT   <dbl> 11, 5, 2, 4, 3, 8, 13, 11, 9, 14, 23, ...
## $ TOT_DAY_LAST_SUSPENDED_CNT   <dbl> 3, 23, 25, 7, 18, 20, 10, 18, 16, 1, 9...
## $ TOT_EMAIL_QUERY_CNT          <dbl> 17, 16, 19, 4, 20, 25, 17, 11, 5, 3, 1...
## $ MTH_TO_SUBS_END_CNT          <dbl> 4, 2, 3, 4, 2, 3, 3, 2, 3, 4, 1, 1, 4,...
## $ TOT_SRV_DROPPED_CNT          <dbl> 2, 3, 4, 0, 2, 0, 0, 1, 1, 1, 0, 1, 1,...
## $ TOT_SRV_ADDED_CNT            <dbl> 1, 0, 0, 1, 1, 2, 0, 2, 0, 2, 1, 0, 0,...
## $ TOT_OUTSTAND_60_90_DAY_AMT   <dbl> 0, 0, 0, 0, 0, 483, 0, 0, 0, 0, 0, 0, ...
## $ TOT_REV_FIX_AMT              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 718, 0, ...
## $ TOT_REV_GPRS_AMT             <dbl> 0, 729, 574, 0, 0, 0, 0, 0, 0, 0, 0, 3...
## $ TOT_REV_INET_AMT             <dbl> 0, 0, 0, 0, 0, 508, 0, 0, 0, 0, 0, 467...
## $ TOT_COMPLAINT_1_MTH_CNT      <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ TOT_MTH_LAST_SUSPENDED_CNT   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ LAST_PRICE_PLAN_CHNG_DAY_CNT <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ MTH_SINCE_DATA_ACTVN         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ MTH_SINCE_VM_ACTVN           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ BARRING_REASON_CD            <chr> "000", "000", "002", "000", "002", "00...
## $ TOT_OB_CALL_CNT              <dbl> 5382, 1512, 6096, 9204, 4380, 1938, 27...
## $ TOT_ACTV_SRV_CNT             <dbl> 0, 0, 0, 4, 3, 3, 4, 2, 0, 3, 3, 0, 2,...
## $ REV_AMT_BASE_1               <dbl> 870, 570, 632, 638, 1162, 236, 445, 11...
## $ REV_AMT_BASE_2               <dbl> 548, 1024, 547, 466, 394, 729, 1091, 4...
## $ REV_AMT_BASE_3               <dbl> 970, 938, 851, 810, 1141, 483, 447, 33...
## $ REV_AMT_BASE_4               <dbl> 392, 602, 821, 655, 244, 325, 568, 645...
## $ REV_AMT_BASE_5               <dbl> 358, 729, 574, 402, 423, 787, 830, 241...
## $ REV_AMT_BASE_6               <dbl> 1248, 1129, 1027, 642, 618, 508, 1133,...
## $ CUST_AGE                     <dbl> 32, 24, 24, 44, 21, 52, 34, 44, 39, 43...
## $ PCT_CHNG_IB_SMS_CNT          <dbl> 1.1553398, 1.2653061, 0.9019608, 1.870...
## $ PCT_CHNG_SUSPENDED_CNT       <dbl> 0, 0, 5000, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ PCT_CHNG_BILL_AMT            <dbl> 0.9555256, 0.9381989, 0.7204400, 0.880...
## $ CUST_SUBS_ID                 <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,...
## $ TOT_REV_AMT                  <dbl> NA, 99, 47, NA, 97, NA, NA, NA, 10, 3,...
## $ TOT_PROF_AMT                 <dbl> NA, 99, 47, NA, 97, NA, NA, NA, 10, 3,...
## $ CUST_ID                      <dbl> 5198, 752, 3501, 5406, 6115, 5478, 514...
## $ name                         <chr> "Amy H Thomas", "Ignatius T Lyod", "Le...
## $ CHURN_FLG                    <chr> "0", "1", "1", "0", "1", "0", "0", "0"...

After reviewing the report above, it is clear that fields name, CUST_ID, and CUST_SUBS_ID are not required for the subsequent analysis. In view of this, we will exclude these three fields from telecom data frame by using the code chunk below.

telecom <- telecom %>% 
  select(-name, -CUST_SUBS_ID, -CUST_ID)

Notice that telecom data frame only have 43 variables now.

3 Data Exploration and Understanding

3.1 Why Data Profiling?

Data profiling is the process of examining the data available from an existing information source (e.g. a database or a file) and collecting statistics or informative summaries about that data.

  • To find out whether existing data can be easily used for other purposes.
  • To improve the ability to search data by tagging it with keywords, descriptions, or assigning it to a category.
  • To assess data quality, including whether the data conforms to particular standards or patterns.
  • To assess the risk involved in integrating data in new applications, including the challenges of joins.
  • To discover metadata of the source database, including value patterns and distributions, key candidates, foreign-key candidates, and functional dependencies.
  • To assess whether known metadata accurately describes the actual values in the source database.
  • To understanding data challenges early in any data intensive project, so that late project surprises are avoided. Finding data problems late in the project can lead to delays and cost overruns.
  • To have an enterprise view of all data, for uses such as master data management, where key data is needed, or data governance for improving data quality.

3.2 Introducing funModeling package

In order to perform data profiling, funModeling package will be used. It’s available in R cran. A quick getting started vignettes is available. An online book entitle Data Science Live Book provides a comprehensive tutorial on the package.

3.3 Checking Data Health Status using status()

In this section, we will use status() to check for missing data, zeros, data type of telecom data frame.

status(telecom)
##                        variable q_zeros      p_zeros q_na      p_na q_inf p_inf
## 1                     GENDER_CD       0 0.000000e+00    0 0.0000000     0     0
## 2                  EDUCATION_CD       0 0.000000e+00    0 0.0000000     0     0
## 3                   SUBS_TENURE       0 0.000000e+00    0 0.0000000     0     0
## 4               TOT_IB_CALL_DUR       0 0.000000e+00    0 0.0000000     0     0
## 5               TOT_IB_CALL_CNT       0 0.000000e+00    0 0.0000000     0     0
## 6               AVG_OB_CALL_CNT       0 0.000000e+00    0 0.0000000     0     0
## 7      TOT_OB_CALL_NAT_ROAM_CNT       0 0.000000e+00    0 0.0000000     0     0
## 8          TOT_OB_CALL_INTL_CNT       0 0.000000e+00    0 0.0000000     0     0
## 9           TOT_OB_CALL_LOC_CNT       0 0.000000e+00    0 0.0000000     0     0
## 10          TOT_OB_CALL_NAT_CNT       0 0.000000e+00    0 0.0000000     0     0
## 11    TOT_OB_CALL_INTL_ROAM_CNT       0 0.000000e+00    0 0.0000000     0     0
## 12   TOT_DAY_LAST_COMPLAINT_CNT     285 2.159745e-02    0 0.0000000     0     0
## 13   TOT_DAY_LAST_OB_BARRED_CNT     249 1.886935e-02    0 0.0000000     0     0
## 14   TOT_DAY_LAST_SUSPENDED_CNT     258 1.955138e-02    0 0.0000000     0     0
## 15          TOT_EMAIL_QUERY_CNT     234 1.773265e-02    0 0.0000000     0     0
## 16          MTH_TO_SUBS_END_CNT       0 0.000000e+00    0 0.0000000     0     0
## 17          TOT_SRV_DROPPED_CNT    3206 2.429524e-01    0 0.0000000     0     0
## 18            TOT_SRV_ADDED_CNT    2712 2.055168e-01    0 0.0000000     0     0
## 19   TOT_OUTSTAND_60_90_DAY_AMT   11778 8.925432e-01    0 0.0000000     0     0
## 20              TOT_REV_FIX_AMT   10768 8.160048e-01    0 0.0000000     0     0
## 21             TOT_REV_GPRS_AMT   10313 7.815247e-01    0 0.0000000     0     0
## 22             TOT_REV_INET_AMT    9693 7.345408e-01    0 0.0000000     0     0
## 23      TOT_COMPLAINT_1_MTH_CNT   12924 9.793877e-01    0 0.0000000     0     0
## 24   TOT_MTH_LAST_SUSPENDED_CNT   12878 9.759018e-01    0 0.0000000     0     0
## 25 LAST_PRICE_PLAN_CHNG_DAY_CNT   12891 9.768869e-01    0 0.0000000     0     0
## 26         MTH_SINCE_DATA_ACTVN   12799 9.699151e-01    0 0.0000000     0     0
## 27           MTH_SINCE_VM_ACTVN   12200 9.245226e-01    0 0.0000000     0     0
## 28            BARRING_REASON_CD       0 0.000000e+00    0 0.0000000     0     0
## 29              TOT_OB_CALL_CNT       0 0.000000e+00    0 0.0000000     0     0
## 30             TOT_ACTV_SRV_CNT    1953 1.479994e-01    0 0.0000000     0     0
## 31               REV_AMT_BASE_1       0 0.000000e+00    0 0.0000000     0     0
## 32               REV_AMT_BASE_2       0 0.000000e+00    0 0.0000000     0     0
## 33               REV_AMT_BASE_3       0 0.000000e+00    0 0.0000000     0     0
## 34               REV_AMT_BASE_4       0 0.000000e+00    0 0.0000000     0     0
## 35               REV_AMT_BASE_5       0 0.000000e+00    0 0.0000000     0     0
## 36               REV_AMT_BASE_6       0 0.000000e+00    0 0.0000000     0     0
## 37                     CUST_AGE       0 0.000000e+00    0 0.0000000     0     0
## 38          PCT_CHNG_IB_SMS_CNT       1 7.578054e-05    0 0.0000000     0     0
## 39       PCT_CHNG_SUSPENDED_CNT   12730 9.646863e-01    0 0.0000000     0     0
## 40            PCT_CHNG_BILL_AMT       0 0.000000e+00    0 0.0000000     0     0
## 41                  TOT_REV_AMT     176 1.333737e-02 9026 0.6839952     0     0
## 42                 TOT_PROF_AMT     176 1.333737e-02 9026 0.6839952     0     0
## 43                    CHURN_FLG   12105 9.173234e-01    0 0.0000000     0     0
##         type unique
## 1  character      3
## 2  character      8
## 3    numeric     37
## 4    numeric    412
## 5    numeric   1933
## 6    numeric   3312
## 7    numeric    780
## 8    numeric    808
## 9    numeric   2573
## 10   numeric   1710
## 11   numeric  13196
## 12   numeric     33
## 13   numeric     32
## 14   numeric     35
## 15   numeric     33
## 16   numeric      4
## 17   numeric      5
## 18   numeric      4
## 19   numeric    251
## 20   numeric    302
## 21   numeric    401
## 22   numeric    352
## 23   numeric      2
## 24   numeric      2
## 25   numeric      2
## 26   numeric      2
## 27   numeric      2
## 28 character      4
## 29   numeric   3312
## 30   numeric      8
## 31   numeric   1001
## 32   numeric   1201
## 33   numeric   1101
## 34   numeric    901
## 35   numeric    951
## 36   numeric   1051
## 37   numeric     43
## 38   numeric   8125
## 39   numeric      7
## 40   numeric  13113
## 41   numeric    101
## 42   numeric    101
## 43 character      2
  • q_zeros: quantity of zeros
  • p_zeros: percentage of zeros
  • q_na: quantity of NA
  • p_na: percentage of NA
  • type: data type, factor or numeric
  • unique: quantity of unique values, useful for categorical data

How to interpret these metrics important?

  • Zeros: Variables with lots of zeros may be not useful for modeling, and in some cases it may dramatically bias the model.
  • NA: Several models automatically exclude rows with NA (random forest, for example). As a result, the final model can be biased due to several missing rows because of only one variable. For example, if the data contains only one out of 100 variables with 90% of NAs, the model will be training with only 10% of original rows.
  • Type: Some variables are encoded as numbers, but they are codes or categories, and the models don’t handle them in the same way.
  • Unique: Factor/categorical variables with a high number of different values (~30), tend to do overfitting if categories have low representative, (decision tree, for example).

3.3.1 Ordering report by percentage of zeros

Instead of displaying all the measures, you can also focus on only one or a selected measures such as zeros. This is achieved by using arrange() and select() of dplyr package.

my_data_status <- status(telecom)
arrange(my_data_status, -p_zeros) %>%
  select(variable, q_zeros, p_zeros)
##                        variable q_zeros      p_zeros
## 1       TOT_COMPLAINT_1_MTH_CNT   12924 9.793877e-01
## 2  LAST_PRICE_PLAN_CHNG_DAY_CNT   12891 9.768869e-01
## 3    TOT_MTH_LAST_SUSPENDED_CNT   12878 9.759018e-01
## 4          MTH_SINCE_DATA_ACTVN   12799 9.699151e-01
## 5        PCT_CHNG_SUSPENDED_CNT   12730 9.646863e-01
## 6            MTH_SINCE_VM_ACTVN   12200 9.245226e-01
## 7                     CHURN_FLG   12105 9.173234e-01
## 8    TOT_OUTSTAND_60_90_DAY_AMT   11778 8.925432e-01
## 9               TOT_REV_FIX_AMT   10768 8.160048e-01
## 10             TOT_REV_GPRS_AMT   10313 7.815247e-01
## 11             TOT_REV_INET_AMT    9693 7.345408e-01
## 12          TOT_SRV_DROPPED_CNT    3206 2.429524e-01
## 13            TOT_SRV_ADDED_CNT    2712 2.055168e-01
## 14             TOT_ACTV_SRV_CNT    1953 1.479994e-01
## 15   TOT_DAY_LAST_COMPLAINT_CNT     285 2.159745e-02
## 16   TOT_DAY_LAST_SUSPENDED_CNT     258 1.955138e-02
## 17   TOT_DAY_LAST_OB_BARRED_CNT     249 1.886935e-02
## 18          TOT_EMAIL_QUERY_CNT     234 1.773265e-02
## 19                  TOT_REV_AMT     176 1.333737e-02
## 20                 TOT_PROF_AMT     176 1.333737e-02
## 21          PCT_CHNG_IB_SMS_CNT       1 7.578054e-05
## 22                    GENDER_CD       0 0.000000e+00
## 23                 EDUCATION_CD       0 0.000000e+00
## 24                  SUBS_TENURE       0 0.000000e+00
## 25              TOT_IB_CALL_DUR       0 0.000000e+00
## 26              TOT_IB_CALL_CNT       0 0.000000e+00
## 27              AVG_OB_CALL_CNT       0 0.000000e+00
## 28     TOT_OB_CALL_NAT_ROAM_CNT       0 0.000000e+00
## 29         TOT_OB_CALL_INTL_CNT       0 0.000000e+00
## 30          TOT_OB_CALL_LOC_CNT       0 0.000000e+00
## 31          TOT_OB_CALL_NAT_CNT       0 0.000000e+00
## 32    TOT_OB_CALL_INTL_ROAM_CNT       0 0.000000e+00
## 33          MTH_TO_SUBS_END_CNT       0 0.000000e+00
## 34            BARRING_REASON_CD       0 0.000000e+00
## 35              TOT_OB_CALL_CNT       0 0.000000e+00
## 36               REV_AMT_BASE_1       0 0.000000e+00
## 37               REV_AMT_BASE_2       0 0.000000e+00
## 38               REV_AMT_BASE_3       0 0.000000e+00
## 39               REV_AMT_BASE_4       0 0.000000e+00
## 40               REV_AMT_BASE_5       0 0.000000e+00
## 41               REV_AMT_BASE_6       0 0.000000e+00
## 42                     CUST_AGE       0 0.000000e+00
## 43            PCT_CHNG_BILL_AMT       0 0.000000e+00

DIY: Write a code chunk to display the quantity and percentage of NA for each variables. Sort the result ascendingly.

3.3.2 Removing variables with a high number of zeros

In this section, you will learn how to remove variables with a high number of zeros

my_data_status = status(telecom)
vars_to_remove = filter(my_data_status, p_zeros > 30) %>%
  .$variable
vars_to_remove <- vars_to_remove[-11]
vars_to_remove
## character(0)

The list above shows variables with more than 30% zeroes values.

Notice CHURN_FLG variable has been excluded from the list.

3.3.3 Keeping all columns except the ones present in vars_to_remove vector

telecom_retained <- telecom %>%
  select(-one_of(vars_to_remove), -TOT_REV_AMT, -TOT_PROF_AMT)

It is a good practice the review the output after the selection.

my_data_status <- status(telecom_retained)
arrange(my_data_status, -p_zeros) 
##                        variable q_zeros      p_zeros q_na p_na q_inf p_inf
## 1       TOT_COMPLAINT_1_MTH_CNT   12924 9.793877e-01    0    0     0     0
## 2  LAST_PRICE_PLAN_CHNG_DAY_CNT   12891 9.768869e-01    0    0     0     0
## 3    TOT_MTH_LAST_SUSPENDED_CNT   12878 9.759018e-01    0    0     0     0
## 4          MTH_SINCE_DATA_ACTVN   12799 9.699151e-01    0    0     0     0
## 5        PCT_CHNG_SUSPENDED_CNT   12730 9.646863e-01    0    0     0     0
## 6            MTH_SINCE_VM_ACTVN   12200 9.245226e-01    0    0     0     0
## 7                     CHURN_FLG   12105 9.173234e-01    0    0     0     0
## 8    TOT_OUTSTAND_60_90_DAY_AMT   11778 8.925432e-01    0    0     0     0
## 9               TOT_REV_FIX_AMT   10768 8.160048e-01    0    0     0     0
## 10             TOT_REV_GPRS_AMT   10313 7.815247e-01    0    0     0     0
## 11             TOT_REV_INET_AMT    9693 7.345408e-01    0    0     0     0
## 12          TOT_SRV_DROPPED_CNT    3206 2.429524e-01    0    0     0     0
## 13            TOT_SRV_ADDED_CNT    2712 2.055168e-01    0    0     0     0
## 14             TOT_ACTV_SRV_CNT    1953 1.479994e-01    0    0     0     0
## 15   TOT_DAY_LAST_COMPLAINT_CNT     285 2.159745e-02    0    0     0     0
## 16   TOT_DAY_LAST_SUSPENDED_CNT     258 1.955138e-02    0    0     0     0
## 17   TOT_DAY_LAST_OB_BARRED_CNT     249 1.886935e-02    0    0     0     0
## 18          TOT_EMAIL_QUERY_CNT     234 1.773265e-02    0    0     0     0
## 19          PCT_CHNG_IB_SMS_CNT       1 7.578054e-05    0    0     0     0
## 20                    GENDER_CD       0 0.000000e+00    0    0     0     0
## 21                 EDUCATION_CD       0 0.000000e+00    0    0     0     0
## 22                  SUBS_TENURE       0 0.000000e+00    0    0     0     0
## 23              TOT_IB_CALL_DUR       0 0.000000e+00    0    0     0     0
## 24              TOT_IB_CALL_CNT       0 0.000000e+00    0    0     0     0
## 25              AVG_OB_CALL_CNT       0 0.000000e+00    0    0     0     0
## 26     TOT_OB_CALL_NAT_ROAM_CNT       0 0.000000e+00    0    0     0     0
## 27         TOT_OB_CALL_INTL_CNT       0 0.000000e+00    0    0     0     0
## 28          TOT_OB_CALL_LOC_CNT       0 0.000000e+00    0    0     0     0
## 29          TOT_OB_CALL_NAT_CNT       0 0.000000e+00    0    0     0     0
## 30    TOT_OB_CALL_INTL_ROAM_CNT       0 0.000000e+00    0    0     0     0
## 31          MTH_TO_SUBS_END_CNT       0 0.000000e+00    0    0     0     0
## 32            BARRING_REASON_CD       0 0.000000e+00    0    0     0     0
## 33              TOT_OB_CALL_CNT       0 0.000000e+00    0    0     0     0
## 34               REV_AMT_BASE_1       0 0.000000e+00    0    0     0     0
## 35               REV_AMT_BASE_2       0 0.000000e+00    0    0     0     0
## 36               REV_AMT_BASE_3       0 0.000000e+00    0    0     0     0
## 37               REV_AMT_BASE_4       0 0.000000e+00    0    0     0     0
## 38               REV_AMT_BASE_5       0 0.000000e+00    0    0     0     0
## 39               REV_AMT_BASE_6       0 0.000000e+00    0    0     0     0
## 40                     CUST_AGE       0 0.000000e+00    0    0     0     0
## 41            PCT_CHNG_BILL_AMT       0 0.000000e+00    0    0     0     0
##         type unique
## 1    numeric      2
## 2    numeric      2
## 3    numeric      2
## 4    numeric      2
## 5    numeric      7
## 6    numeric      2
## 7  character      2
## 8    numeric    251
## 9    numeric    302
## 10   numeric    401
## 11   numeric    352
## 12   numeric      5
## 13   numeric      4
## 14   numeric      8
## 15   numeric     33
## 16   numeric     35
## 17   numeric     32
## 18   numeric     33
## 19   numeric   8125
## 20 character      3
## 21 character      8
## 22   numeric     37
## 23   numeric    412
## 24   numeric   1933
## 25   numeric   3312
## 26   numeric    780
## 27   numeric    808
## 28   numeric   2573
## 29   numeric   1710
## 30   numeric  13196
## 31   numeric      4
## 32 character      4
## 33   numeric   3312
## 34   numeric   1001
## 35   numeric   1201
## 36   numeric   1101
## 37   numeric    901
## 38   numeric    951
## 39   numeric   1051
## 40   numeric     43
## 41   numeric  13113

3.4 Checking Data Health Status using data_integrity()

Another useful function to check data health status is data_integrity(). It is a handy function to return different vectors of variable names aimed to quickly filter NA, categorical (factor / character), numerical and other types (boolean, date, posix). It also returns a vector of variables which have high cardinality.

data_integrity(telecom)
## $vars_num_with_NA
##       variable q_na      p_na
## 1  TOT_REV_AMT 9026 0.6839952
## 2 TOT_PROF_AMT 9026 0.6839952
## 
## $vars_cat_with_NA
## [1] variable q_na     p_na    
## <0 rows> (or 0-length row.names)
## 
## $vars_cat_high_card
## [1] variable unique  
## <0 rows> (or 0-length row.names)
## 
## $MAX_UNIQUE
## [1] 35
## 
## $vars_one_value
## character(0)
## 
## $vars_cat
## [1] "GENDER_CD"         "EDUCATION_CD"      "BARRING_REASON_CD"
## [4] "CHURN_FLG"        
## 
## $vars_num
##  [1] "SUBS_TENURE"                  "TOT_IB_CALL_DUR"             
##  [3] "TOT_IB_CALL_CNT"              "AVG_OB_CALL_CNT"             
##  [5] "TOT_OB_CALL_NAT_ROAM_CNT"     "TOT_OB_CALL_INTL_CNT"        
##  [7] "TOT_OB_CALL_LOC_CNT"          "TOT_OB_CALL_NAT_CNT"         
##  [9] "TOT_OB_CALL_INTL_ROAM_CNT"    "TOT_DAY_LAST_COMPLAINT_CNT"  
## [11] "TOT_DAY_LAST_OB_BARRED_CNT"   "TOT_DAY_LAST_SUSPENDED_CNT"  
## [13] "TOT_EMAIL_QUERY_CNT"          "MTH_TO_SUBS_END_CNT"         
## [15] "TOT_SRV_DROPPED_CNT"          "TOT_SRV_ADDED_CNT"           
## [17] "TOT_OUTSTAND_60_90_DAY_AMT"   "TOT_REV_FIX_AMT"             
## [19] "TOT_REV_GPRS_AMT"             "TOT_REV_INET_AMT"            
## [21] "TOT_COMPLAINT_1_MTH_CNT"      "TOT_MTH_LAST_SUSPENDED_CNT"  
## [23] "LAST_PRICE_PLAN_CHNG_DAY_CNT" "MTH_SINCE_DATA_ACTVN"        
## [25] "MTH_SINCE_VM_ACTVN"           "TOT_OB_CALL_CNT"             
## [27] "TOT_ACTV_SRV_CNT"             "REV_AMT_BASE_1"              
## [29] "REV_AMT_BASE_2"               "REV_AMT_BASE_3"              
## [31] "REV_AMT_BASE_4"               "REV_AMT_BASE_5"              
## [33] "REV_AMT_BASE_6"               "CUST_AGE"                    
## [35] "PCT_CHNG_IB_SMS_CNT"          "PCT_CHNG_SUSPENDED_CNT"      
## [37] "PCT_CHNG_BILL_AMT"            "TOT_REV_AMT"                 
## [39] "TOT_PROF_AMT"                
## 
## $vars_char
## [1] "GENDER_CD"         "EDUCATION_CD"      "BARRING_REASON_CD"
## [4] "CHURN_FLG"        
## 
## $vars_factor
## character(0)
## 
## $vars_other
## character(0)

It returns an ‘integrity’ object, which has: ‘status_now’ (comes from status function), and ‘results’ list, following elements can be found: vars_cat, vars_num, vars_num_with_NA, etc.

4 Descriptive Statistics

4.1 summary() method

To view the descriptive statistics of each column, summary() is commonly used as shown in the code chunk below.

summary(telecom_retained)
##   GENDER_CD         EDUCATION_CD        SUBS_TENURE    TOT_IB_CALL_DUR 
##  Length:13196       Length:13196       Min.   : 24.0   Min.   : 3.333  
##  Class :character   Class :character   1st Qu.:102.0   1st Qu.:39.500  
##  Mode  :character   Mode  :character   Median :120.0   Median :46.750  
##                                        Mean   :119.9   Mean   :44.853  
##                                        3rd Qu.:138.0   3rd Qu.:53.500  
##                                        Max.   :240.0   Max.   :78.667  
##  TOT_IB_CALL_CNT     AVG_OB_CALL_CNT   TOT_OB_CALL_NAT_ROAM_CNT
##  Min.   :      8.0   Min.   :    190   Min.   :    11.0        
##  1st Qu.:    119.8   1st Qu.:    407   1st Qu.:    21.0        
##  Median :    180.0   Median :    607   Median :    32.0        
##  Mean   :    904.8   Mean   :   3987   Mean   :   130.7        
##  3rd Qu.:    366.0   3rd Qu.:   1209   3rd Qu.:    62.0        
##  Max.   :1136214.0   Max.   :5988140   Max.   :225286.0        
##  TOT_OB_CALL_INTL_CNT TOT_OB_CALL_LOC_CNT TOT_OB_CALL_NAT_CNT
##  Min.   :    11.0     Min.   :    121.0   Min.   :    49.0   
##  1st Qu.:    21.0     1st Qu.:    245.0   1st Qu.:   103.0   
##  Median :    31.0     Median :    370.0   Median :   154.0   
##  Mean   :   189.4     Mean   :   1834.0   Mean   :   625.9   
##  3rd Qu.:    62.0     3rd Qu.:    725.2   3rd Qu.:   308.0   
##  Max.   :410842.0     Max.   :1474931.0   Max.   :328972.0   
##  TOT_OB_CALL_INTL_ROAM_CNT TOT_DAY_LAST_COMPLAINT_CNT
##  Min.   :   0.0002         Min.   : 0.00             
##  1st Qu.:   0.1935         1st Qu.: 6.00             
##  Median :   0.4401         Median :12.00             
##  Mean   :   2.1869         Mean   :11.78             
##  3rd Qu.:   0.9029         3rd Qu.:17.00             
##  Max.   :2158.1174         Max.   :32.00             
##  TOT_DAY_LAST_OB_BARRED_CNT TOT_DAY_LAST_SUSPENDED_CNT TOT_EMAIL_QUERY_CNT
##  Min.   : 0.00              Min.   : 0.00              Min.   : 0.00      
##  1st Qu.: 6.00              1st Qu.: 7.00              1st Qu.: 7.00      
##  Median :13.00              Median :13.00              Median :14.00      
##  Mean   :12.67              Mean   :13.36              Mean   :14.07      
##  3rd Qu.:19.00              3rd Qu.:20.00              3rd Qu.:21.00      
##  Max.   :31.00              Max.   :34.00              Max.   :32.00      
##  MTH_TO_SUBS_END_CNT TOT_SRV_DROPPED_CNT TOT_SRV_ADDED_CNT
##  Min.   :1.000       Min.   :0.000       Min.   :0.00     
##  1st Qu.:2.000       1st Qu.:1.000       1st Qu.:1.00     
##  Median :3.000       Median :1.000       Median :1.00     
##  Mean   :2.506       Mean   :1.045       Mean   :1.41     
##  3rd Qu.:3.000       3rd Qu.:2.000       3rd Qu.:2.00     
##  Max.   :4.000       Max.   :4.000       Max.   :3.00     
##  TOT_OUTSTAND_60_90_DAY_AMT TOT_REV_FIX_AMT TOT_REV_GPRS_AMT TOT_REV_INET_AMT
##  Min.   :  0.00             Min.   :  0     Min.   :  0.0    Min.   :  0.0   
##  1st Qu.:  0.00             1st Qu.:  0     1st Qu.:  0.0    1st Qu.:  0.0   
##  Median :  0.00             Median :  0     Median :  0.0    Median :  0.0   
##  Mean   : 61.94             Mean   :119     Mean   :120.3    Mean   :166.3   
##  3rd Qu.:  0.00             3rd Qu.:  0     3rd Qu.:  0.0    3rd Qu.:471.0   
##  Max.   :700.00             Max.   :800     Max.   :750.0    Max.   :800.0   
##  TOT_COMPLAINT_1_MTH_CNT TOT_MTH_LAST_SUSPENDED_CNT
##  Min.   :0.00000         Min.   :0.0000            
##  1st Qu.:0.00000         1st Qu.:0.0000            
##  Median :0.00000         Median :0.0000            
##  Mean   :0.02061         Mean   :0.0241            
##  3rd Qu.:0.00000         3rd Qu.:0.0000            
##  Max.   :1.00000         Max.   :1.0000            
##  LAST_PRICE_PLAN_CHNG_DAY_CNT MTH_SINCE_DATA_ACTVN MTH_SINCE_VM_ACTVN
##  Min.   :0.00000              Min.   :0.00000      Min.   :0.00000   
##  1st Qu.:0.00000              1st Qu.:0.00000      1st Qu.:0.00000   
##  Median :0.00000              Median :0.00000      Median :0.00000   
##  Mean   :0.02311              Mean   :0.03008      Mean   :0.07548   
##  3rd Qu.:0.00000              3rd Qu.:0.00000      3rd Qu.:0.00000   
##  Max.   :1.00000              Max.   :1.00000      Max.   :1.00000   
##  BARRING_REASON_CD  TOT_OB_CALL_CNT    TOT_ACTV_SRV_CNT REV_AMT_BASE_1  
##  Length:13196       Min.   :    1140   Min.   :0.000    Min.   : 200.0  
##  Class :character   1st Qu.:    2442   1st Qu.:1.000    1st Qu.: 457.0  
##  Mode  :character   Median :    3642   Median :2.000    Median : 681.0  
##                     Mean   :   23920   Mean   :2.441    Mean   : 694.1  
##                     3rd Qu.:    7256   3rd Qu.:4.000    3rd Qu.: 938.0  
##                     Max.   :35928840   Max.   :7.000    Max.   :1200.0  
##  REV_AMT_BASE_2   REV_AMT_BASE_3   REV_AMT_BASE_4   REV_AMT_BASE_5  
##  Min.   : 200.0   Min.   : 200.0   Min.   : 200.0   Min.   : 200.0  
##  1st Qu.: 492.0   1st Qu.: 491.0   1st Qu.: 473.0   1st Qu.: 440.0  
##  Median : 772.5   Median : 690.0   Median : 647.0   Median : 625.0  
##  Mean   : 788.0   Mean   : 732.9   Mean   : 648.6   Mean   : 648.4  
##  3rd Qu.:1081.0   3rd Qu.: 996.0   3rd Qu.: 825.0   3rd Qu.: 852.0  
##  Max.   :1400.0   Max.   :1300.0   Max.   :1100.0   Max.   :1150.0  
##  REV_AMT_BASE_6      CUST_AGE     PCT_CHNG_IB_SMS_CNT PCT_CHNG_SUSPENDED_CNT
##  Min.   : 200.0   Min.   :20.00   Min.   :0.0000      Min.   :    0.0       
##  1st Qu.: 501.0   1st Qu.:30.00   1st Qu.:0.8571      1st Qu.:    0.0       
##  Median : 672.0   Median :40.00   Median :1.1720      Median :    0.0       
##  Mean   : 697.9   Mean   :40.72   Mean   :1.2452      Mean   :  181.5       
##  3rd Qu.: 893.0   3rd Qu.:51.00   3rd Qu.:1.5479      3rd Qu.:    0.0       
##  Max.   :1250.0   Max.   :62.00   Max.   :6.1667      Max.   :10000.0       
##  PCT_CHNG_BILL_AMT  CHURN_FLG        
##  Min.   :0.2454    Length:13196      
##  1st Qu.:0.8239    Class :character  
##  Median :1.0875    Mode  :character  
##  Mean   :1.1341                      
##  3rd Qu.:1.3738                      
##  Max.   :5.0696

4.2 profiling_num() method

We can also use funModeling to retrieve several statistics for numerical variables. The function used is called profiling_num. Read about the function before running the code below.

profiling_num(telecom_retained)
##                        variable         mean      std_dev variation_coef
## 1                   SUBS_TENURE 1.198550e+02 2.713962e+01      0.2264372
## 2               TOT_IB_CALL_DUR 4.485265e+01 1.322877e+01      0.2949384
## 3               TOT_IB_CALL_CNT 9.048331e+02 1.227397e+04     13.5648987
## 4               AVG_OB_CALL_CNT 3.986687e+03 8.095119e+04     20.3053799
## 5      TOT_OB_CALL_NAT_ROAM_CNT 1.307292e+02 2.099494e+03     16.0598782
## 6          TOT_OB_CALL_INTL_CNT 1.893939e+02 4.265169e+03     22.5200964
## 7           TOT_OB_CALL_LOC_CNT 1.833966e+03 2.273202e+04     12.3950019
## 8           TOT_OB_CALL_NAT_CNT 6.258938e+02 5.794708e+03      9.2582928
## 9     TOT_OB_CALL_INTL_ROAM_CNT 2.186854e+00 2.746952e+01     12.5612037
## 10   TOT_DAY_LAST_COMPLAINT_CNT 1.177918e+01 6.993986e+00      0.5937585
## 11   TOT_DAY_LAST_OB_BARRED_CNT 1.266588e+01 7.355303e+00      0.5807177
## 12   TOT_DAY_LAST_SUSPENDED_CNT 1.336178e+01 7.721855e+00      0.5779063
## 13          TOT_EMAIL_QUERY_CNT 1.407464e+01 8.042835e+00      0.5714414
## 14          MTH_TO_SUBS_END_CNT 2.505684e+00 9.581059e-01      0.3823731
## 15          TOT_SRV_DROPPED_CNT 1.044711e+00 7.615037e-01      0.7289136
## 16            TOT_SRV_ADDED_CNT 1.410352e+00 9.811001e-01      0.6956422
## 17   TOT_OUTSTAND_60_90_DAY_AMT 6.193589e+01 1.800469e+02      2.9069889
## 18              TOT_REV_FIX_AMT 1.189946e+02 2.533148e+02      2.1287922
## 19             TOT_REV_GPRS_AMT 1.202774e+02 2.339171e+02      1.9448143
## 20             TOT_REV_INET_AMT 1.662537e+02 2.814179e+02      1.6927011
## 21      TOT_COMPLAINT_1_MTH_CNT 2.061231e-02 1.420879e-01      6.8933524
## 22   TOT_MTH_LAST_SUSPENDED_CNT 2.409821e-02 1.533599e-01      6.3639551
## 23 LAST_PRICE_PLAN_CHNG_DAY_CNT 2.311306e-02 1.502683e-01      6.5014442
## 24         MTH_SINCE_DATA_ACTVN 3.008487e-02 1.708274e-01      5.6781809
## 25           MTH_SINCE_VM_ACTVN 7.547742e-02 2.641701e-01      3.4999892
## 26              TOT_OB_CALL_CNT 2.392012e+04 4.857071e+05     20.3053799
## 27             TOT_ACTV_SRV_CNT 2.441194e+00 1.641850e+00      0.6725600
## 28               REV_AMT_BASE_1 6.941080e+02 2.846927e+02      0.4101562
## 29               REV_AMT_BASE_2 7.879706e+02 3.407361e+02      0.4324223
## 30               REV_AMT_BASE_3 7.329165e+02 3.063096e+02      0.4179325
## 31               REV_AMT_BASE_4 6.486079e+02 2.367476e+02      0.3650087
## 32               REV_AMT_BASE_5 6.483597e+02 2.548069e+02      0.3930024
## 33               REV_AMT_BASE_6 6.978585e+02 2.686731e+02      0.3849965
## 34                     CUST_AGE 4.071885e+01 1.216377e+01      0.2987258
## 35          PCT_CHNG_IB_SMS_CNT 1.245212e+00 5.546390e-01      0.4454174
## 36       PCT_CHNG_SUSPENDED_CNT 1.815002e+02 1.042496e+03      5.7437730
## 37            PCT_CHNG_BILL_AMT 1.134070e+00 4.325255e-01      0.3813922
##            p_01         p_05         p_25         p_50        p_75         p_95
## 1  6.000000e+01 7.800000e+01  102.0000000  120.0000000  138.000000   168.000000
## 2  8.333000e+00 1.233300e+01   39.5000000   46.7500000   53.500000    62.333000
## 3  2.600000e+01 6.000000e+01  119.7500000  180.0000000  366.000000  1779.000000
## 4  2.740000e+02 3.160000e+02  407.0000000  607.0000000 1209.250000  6305.250000
## 5  1.500000e+01 1.700000e+01   21.0000000   32.0000000   62.000000   309.250000
## 6  1.500000e+01 1.700000e+01   21.0000000   31.0000000   62.000000   323.000000
## 7  1.670000e+02 1.900000e+02  245.0000000  370.0000000  725.250000  3441.750000
## 8  7.000000e+01 7.900000e+01  103.0000000  154.0000000  308.000000  1411.250000
## 9  6.440805e-03 3.769725e-02    0.1934530    0.4401406    0.902894     4.504552
## 10 0.000000e+00 1.000000e+00    6.0000000   12.0000000   17.000000    22.000000
## 11 0.000000e+00 1.000000e+00    6.0000000   13.0000000   19.000000    24.000000
## 12 0.000000e+00 1.000000e+00    7.0000000   13.0000000   20.000000    25.000000
## 13 0.000000e+00 1.000000e+00    7.0000000   14.0000000   21.000000    26.000000
## 14 1.000000e+00 1.000000e+00    2.0000000    3.0000000    3.000000     4.000000
## 15 0.000000e+00 0.000000e+00    1.0000000    1.0000000    2.000000     2.000000
## 16 0.000000e+00 0.000000e+00    1.0000000    1.0000000    2.000000     3.000000
## 17 0.000000e+00 0.000000e+00    0.0000000    0.0000000    0.000000   585.000000
## 18 0.000000e+00 0.000000e+00    0.0000000    0.0000000    0.000000   711.000000
## 19 0.000000e+00 0.000000e+00    0.0000000    0.0000000    0.000000   659.000000
## 20 0.000000e+00 0.000000e+00    0.0000000    0.0000000  471.000000   736.000000
## 21 0.000000e+00 0.000000e+00    0.0000000    0.0000000    0.000000     0.000000
## 22 0.000000e+00 0.000000e+00    0.0000000    0.0000000    0.000000     0.000000
## 23 0.000000e+00 0.000000e+00    0.0000000    0.0000000    0.000000     0.000000
## 24 0.000000e+00 0.000000e+00    0.0000000    0.0000000    0.000000     0.000000
## 25 0.000000e+00 0.000000e+00    0.0000000    0.0000000    0.000000     1.000000
## 26 1.644000e+03 1.896000e+03 2442.0000000 3642.0000000 7255.500000 37831.500000
## 27 0.000000e+00 0.000000e+00    1.0000000    2.0000000    4.000000     5.000000
## 28 2.100000e+02 2.520000e+02  457.0000000  681.0000000  938.000000  1151.000000
## 29 2.140000e+02 2.650000e+02  492.0000000  772.5000000 1081.000000  1332.000000
## 30 2.120000e+02 2.600000e+02  491.0000000  690.0000000  996.000000  1238.000000
## 31 2.090000e+02 2.560000e+02  473.0000000  647.0000000  825.000000  1044.000000
## 32 2.130000e+02 2.590000e+02  440.0000000  625.0000000  852.000000  1086.000000
## 33 2.140000e+02 2.720000e+02  501.0000000  672.0000000  893.000000  1177.000000
## 34 2.100000e+01 2.300000e+01   30.0000000   40.0000000   51.000000    60.000000
## 35 3.171752e-01 4.833140e-01    0.8571429    1.1719745    1.547856     2.247514
## 36 0.000000e+00 0.000000e+00    0.0000000    0.0000000    0.000000     0.000000
## 37 3.867880e-01 5.267272e-01    0.8239360    1.0875377    1.373761     1.920271
##            p_99     skewness     kurtosis          iqr
## 1  1.860000e+02  0.200868371     3.053959   36.0000000
## 2  6.833300e+01 -1.035160678     4.124150   14.0000000
## 3  9.918750e+03 68.183975169  5816.311117  246.2500000
## 4  2.944850e+04 57.637894382  3680.205010  802.2500000
## 5  1.301100e+03 94.970870778 10047.402135   41.0000000
## 6  1.663900e+03 79.841704526  7110.635151   41.0000000
## 7  1.730505e+04 43.310012570  2249.853341  480.2500000
## 8  6.451150e+03 39.790518507  1931.029211  205.0000000
## 9  2.542337e+01 55.069678976  3686.333455    0.7094410
## 10 2.900000e+01  0.188118939     2.282798   11.0000000
## 11 2.900000e+01  0.072639431     1.983575   13.0000000
## 12 3.100000e+01  0.072901315     2.077751   13.0000000
## 13 2.900000e+01 -0.020861369     1.872841   14.0000000
## 14 4.000000e+00 -0.008684363     2.056079    1.0000000
## 15 3.000000e+00  0.345059808     3.110976    1.0000000
## 16 3.000000e+00  0.089853673     1.994601    1.0000000
## 17 6.780000e+02  2.616443060     8.013632    0.0000000
## 18 7.820000e+02  1.712514550     4.066626    0.0000000
## 19 7.330500e+02  1.551849203     3.675855    0.0000000
## 20 7.870000e+02  1.175947811     2.541470  471.0000000
## 21 1.000000e+00  6.748018376    46.535752    0.0000000
## 22 1.000000e+00  6.206573017    39.521549    0.0000000
## 23 1.000000e+00  6.347380065    41.289234    0.0000000
## 24 1.000000e+00  5.501846305    31.270313    0.0000000
## 25 1.000000e+00  3.214130571    11.330635    0.0000000
## 26 1.766910e+05 57.637894382  3680.205010 4813.5000000
## 27 6.000000e+00  0.227985186     2.303311    3.0000000
## 28 1.189000e+03  0.054913095     1.854334  481.0000000
## 29 1.385000e+03  0.071659635     1.821046  589.0000000
## 30 1.290000e+03  0.144701374     1.906482  505.0000000
## 31 1.088000e+03  0.015064005     2.107570  352.0000000
## 32 1.136000e+03  0.215500669     2.016043  412.0000000
## 33 1.236000e+03  0.246976090     2.230875  392.0000000
## 34 6.200000e+01  0.072910336     1.734184   21.0000000
## 35 2.929189e+00  1.107734891     6.091433    0.6907130
## 36 5.000000e+03  6.287175713    46.092121    0.0000000
## 37 2.401325e+00  0.907429141     4.893899    0.5498254
##                                   range_98
## 1                                [60, 186]
## 2                          [8.333, 68.333]
## 3                   [26, 9918.74999999993]
## 4                  [274, 29448.4999999993]
## 5                             [15, 1301.1]
## 6                   [15, 1663.89999999999]
## 7                  [167, 17305.0499999999]
## 8                            [70, 6451.15]
## 9  [0.00644080526029729, 25.4233652560673]
## 10                                 [0, 29]
## 11                                 [0, 29]
## 12                                 [0, 31]
## 13                                 [0, 29]
## 14                                  [1, 4]
## 15                                  [0, 3]
## 16                                  [0, 3]
## 17                                [0, 678]
## 18                                [0, 782]
## 19                   [0, 733.049999999999]
## 20                                [0, 787]
## 21                                  [0, 1]
## 22                                  [0, 1]
## 23                                  [0, 1]
## 24                                  [0, 1]
## 25                                  [0, 1]
## 26                [1644, 176690.999999996]
## 27                                  [0, 6]
## 28                             [210, 1189]
## 29                             [214, 1385]
## 30                             [212, 1290]
## 31                             [209, 1088]
## 32                             [213, 1136]
## 33                             [214, 1236]
## 34                                [21, 62]
## 35   [0.317175244439669, 2.92918879326131]
## 36                               [0, 5000]
## 37   [0.386788006123978, 2.40132548684723]
##                                  range_80
## 1                               [84, 156]
## 2                          [29.5, 59.167]
## 3                             [88, 899.5]
## 4                           [339, 3054.5]
## 5                               [18, 155]
## 6                               [18, 159]
## 7                             [204, 1768]
## 8                               [85, 759]
## 9  [0.0722201996015945, 2.30620032698903]
## 10                                [2, 21]
## 11                                [3, 22]
## 12                                [3, 23]
## 13                                [3, 25]
## 14                                 [1, 4]
## 15                                 [0, 2]
## 16                                 [0, 3]
## 17                               [0, 470]
## 18                               [0, 637]
## 19                               [0, 571]
## 20                               [0, 670]
## 21                                 [0, 0]
## 22                                 [0, 0]
## 23                                 [0, 0]
## 24                                 [0, 0]
## 25                                 [0, 0]
## 26                          [2034, 18327]
## 27                                 [0, 5]
## 28                            [304, 1099]
## 29                            [327, 1271]
## 30                          [320, 1176.5]
## 31                           [313, 988.5]
## 32                            [320, 1027]
## 33                            [342, 1108]
## 34                               [25, 58]
## 35  [0.608695652173913, 1.95046012269939]
## 36                                 [0, 0]
## 37  [0.627196976959703, 1.69636993607128]

The summary statistics generated by profiling_num() are: mean, std_dev: standard deviation, all the p_XX:percentile at XX number, skewness, kurtosis, iqr: inter quartile range, variation_coef: the ratio ofsd/mean, range_98 is the limit for which the 98.

Instead of showing the complete list of summary statistics, you can also display selected summary statistics by using the code chunk below.

my_profiling_report <- profiling_num(telecom_retained)
select(my_profiling_report, -c(5:11))
##                        variable         mean      std_dev variation_coef
## 1                   SUBS_TENURE 1.198550e+02 2.713962e+01      0.2264372
## 2               TOT_IB_CALL_DUR 4.485265e+01 1.322877e+01      0.2949384
## 3               TOT_IB_CALL_CNT 9.048331e+02 1.227397e+04     13.5648987
## 4               AVG_OB_CALL_CNT 3.986687e+03 8.095119e+04     20.3053799
## 5      TOT_OB_CALL_NAT_ROAM_CNT 1.307292e+02 2.099494e+03     16.0598782
## 6          TOT_OB_CALL_INTL_CNT 1.893939e+02 4.265169e+03     22.5200964
## 7           TOT_OB_CALL_LOC_CNT 1.833966e+03 2.273202e+04     12.3950019
## 8           TOT_OB_CALL_NAT_CNT 6.258938e+02 5.794708e+03      9.2582928
## 9     TOT_OB_CALL_INTL_ROAM_CNT 2.186854e+00 2.746952e+01     12.5612037
## 10   TOT_DAY_LAST_COMPLAINT_CNT 1.177918e+01 6.993986e+00      0.5937585
## 11   TOT_DAY_LAST_OB_BARRED_CNT 1.266588e+01 7.355303e+00      0.5807177
## 12   TOT_DAY_LAST_SUSPENDED_CNT 1.336178e+01 7.721855e+00      0.5779063
## 13          TOT_EMAIL_QUERY_CNT 1.407464e+01 8.042835e+00      0.5714414
## 14          MTH_TO_SUBS_END_CNT 2.505684e+00 9.581059e-01      0.3823731
## 15          TOT_SRV_DROPPED_CNT 1.044711e+00 7.615037e-01      0.7289136
## 16            TOT_SRV_ADDED_CNT 1.410352e+00 9.811001e-01      0.6956422
## 17   TOT_OUTSTAND_60_90_DAY_AMT 6.193589e+01 1.800469e+02      2.9069889
## 18              TOT_REV_FIX_AMT 1.189946e+02 2.533148e+02      2.1287922
## 19             TOT_REV_GPRS_AMT 1.202774e+02 2.339171e+02      1.9448143
## 20             TOT_REV_INET_AMT 1.662537e+02 2.814179e+02      1.6927011
## 21      TOT_COMPLAINT_1_MTH_CNT 2.061231e-02 1.420879e-01      6.8933524
## 22   TOT_MTH_LAST_SUSPENDED_CNT 2.409821e-02 1.533599e-01      6.3639551
## 23 LAST_PRICE_PLAN_CHNG_DAY_CNT 2.311306e-02 1.502683e-01      6.5014442
## 24         MTH_SINCE_DATA_ACTVN 3.008487e-02 1.708274e-01      5.6781809
## 25           MTH_SINCE_VM_ACTVN 7.547742e-02 2.641701e-01      3.4999892
## 26              TOT_OB_CALL_CNT 2.392012e+04 4.857071e+05     20.3053799
## 27             TOT_ACTV_SRV_CNT 2.441194e+00 1.641850e+00      0.6725600
## 28               REV_AMT_BASE_1 6.941080e+02 2.846927e+02      0.4101562
## 29               REV_AMT_BASE_2 7.879706e+02 3.407361e+02      0.4324223
## 30               REV_AMT_BASE_3 7.329165e+02 3.063096e+02      0.4179325
## 31               REV_AMT_BASE_4 6.486079e+02 2.367476e+02      0.3650087
## 32               REV_AMT_BASE_5 6.483597e+02 2.548069e+02      0.3930024
## 33               REV_AMT_BASE_6 6.978585e+02 2.686731e+02      0.3849965
## 34                     CUST_AGE 4.071885e+01 1.216377e+01      0.2987258
## 35          PCT_CHNG_IB_SMS_CNT 1.245212e+00 5.546390e-01      0.4454174
## 36       PCT_CHNG_SUSPENDED_CNT 1.815002e+02 1.042496e+03      5.7437730
## 37            PCT_CHNG_BILL_AMT 1.134070e+00 4.325255e-01      0.3813922
##        skewness     kurtosis          iqr
## 1   0.200868371     3.053959   36.0000000
## 2  -1.035160678     4.124150   14.0000000
## 3  68.183975169  5816.311117  246.2500000
## 4  57.637894382  3680.205010  802.2500000
## 5  94.970870778 10047.402135   41.0000000
## 6  79.841704526  7110.635151   41.0000000
## 7  43.310012570  2249.853341  480.2500000
## 8  39.790518507  1931.029211  205.0000000
## 9  55.069678976  3686.333455    0.7094410
## 10  0.188118939     2.282798   11.0000000
## 11  0.072639431     1.983575   13.0000000
## 12  0.072901315     2.077751   13.0000000
## 13 -0.020861369     1.872841   14.0000000
## 14 -0.008684363     2.056079    1.0000000
## 15  0.345059808     3.110976    1.0000000
## 16  0.089853673     1.994601    1.0000000
## 17  2.616443060     8.013632    0.0000000
## 18  1.712514550     4.066626    0.0000000
## 19  1.551849203     3.675855    0.0000000
## 20  1.175947811     2.541470  471.0000000
## 21  6.748018376    46.535752    0.0000000
## 22  6.206573017    39.521549    0.0000000
## 23  6.347380065    41.289234    0.0000000
## 24  5.501846305    31.270313    0.0000000
## 25  3.214130571    11.330635    0.0000000
## 26 57.637894382  3680.205010 4813.5000000
## 27  0.227985186     2.303311    3.0000000
## 28  0.054913095     1.854334  481.0000000
## 29  0.071659635     1.821046  589.0000000
## 30  0.144701374     1.906482  505.0000000
## 31  0.015064005     2.107570  352.0000000
## 32  0.215500669     2.016043  412.0000000
## 33  0.246976090     2.230875  392.0000000
## 34  0.072910336     1.734184   21.0000000
## 35  1.107734891     6.091433    0.6907130
## 36  6.287175713    46.092121    0.0000000
## 37  0.907429141     4.893899    0.5498254
##                                   range_98
## 1                                [60, 186]
## 2                          [8.333, 68.333]
## 3                   [26, 9918.74999999993]
## 4                  [274, 29448.4999999993]
## 5                             [15, 1301.1]
## 6                   [15, 1663.89999999999]
## 7                  [167, 17305.0499999999]
## 8                            [70, 6451.15]
## 9  [0.00644080526029729, 25.4233652560673]
## 10                                 [0, 29]
## 11                                 [0, 29]
## 12                                 [0, 31]
## 13                                 [0, 29]
## 14                                  [1, 4]
## 15                                  [0, 3]
## 16                                  [0, 3]
## 17                                [0, 678]
## 18                                [0, 782]
## 19                   [0, 733.049999999999]
## 20                                [0, 787]
## 21                                  [0, 1]
## 22                                  [0, 1]
## 23                                  [0, 1]
## 24                                  [0, 1]
## 25                                  [0, 1]
## 26                [1644, 176690.999999996]
## 27                                  [0, 6]
## 28                             [210, 1189]
## 29                             [214, 1385]
## 30                             [212, 1290]
## 31                             [209, 1088]
## 32                             [213, 1136]
## 33                             [214, 1236]
## 34                                [21, 62]
## 35   [0.317175244439669, 2.92918879326131]
## 36                               [0, 5000]
## 37   [0.386788006123978, 2.40132548684723]
##                                  range_80
## 1                               [84, 156]
## 2                          [29.5, 59.167]
## 3                             [88, 899.5]
## 4                           [339, 3054.5]
## 5                               [18, 155]
## 6                               [18, 159]
## 7                             [204, 1768]
## 8                               [85, 759]
## 9  [0.0722201996015945, 2.30620032698903]
## 10                                [2, 21]
## 11                                [3, 22]
## 12                                [3, 23]
## 13                                [3, 25]
## 14                                 [1, 4]
## 15                                 [0, 2]
## 16                                 [0, 3]
## 17                               [0, 470]
## 18                               [0, 637]
## 19                               [0, 571]
## 20                               [0, 670]
## 21                                 [0, 0]
## 22                                 [0, 0]
## 23                                 [0, 0]
## 24                                 [0, 0]
## 25                                 [0, 0]
## 26                          [2034, 18327]
## 27                                 [0, 5]
## 28                            [304, 1099]
## 29                            [327, 1271]
## 30                          [320, 1176.5]
## 31                           [313, 988.5]
## 32                            [320, 1027]
## 33                            [342, 1108]
## 34                               [25, 58]
## 35  [0.608695652173913, 1.95046012269939]
## 36                                 [0, 0]
## 37  [0.627196976959703, 1.69636993607128]

5 Exploratory Data Analysis (EDA)

Tukey defined EDA in 1961 as:

“Procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data.”

EDA is often the first step to getting to know about your data.

  • It is iterative in nature:

    • Visualise and generate questions about your data.
    • Search for answers by visualising, transforming, and modeling your data.
    • Use what you learn to refine your questions and generate new questions.
  • Rinse and repeat until you communicate the findings.

5.1 Tips on doing EDA

  • EDA is fundamentally a data sensing process - it is not an exact science. It requires knowledge of your data and a lot of patience.
  • At the most basic level, it involves answering two questions:
    • What type of variation occurs within my variables?
    • What type of covariation occurs between my variables?
  • EDA relies heavily on statistical graphics to reveal patterns and anomelies of your data.

5.2 Getting to know ggplot2

  • ggplot2 (http://ggplot2.tidyverse.org/index.html) is a R package for declaratively creating statistical graphics.
  • It is designed and developed by Hadley Wickham in 2005.
  • It is an implementation of Leland Wilkinson’s Grammar of Graphics, a general scheme for data visualization which breaks up graphs into semantic components such as scales and layers.
  • It can serve as a replacement for the base graphics in R and contains a number of defaults for web and print display of common scales.

5.2.1 A Layered Grammar of Graphics - Design principles of ggplot2

  • Data: The dataset being ploted.
  • Aesthetics: Take attributes of the data and use them to influence visual characteristics, such as position, colours, size, shape, or transparency.
  • Geometrics: The visual elements used for our data, such as point, bar or line.
  • Facets: split the data into subsets to create multiple variations of the same graph (paneling, multiple plots).
  • Statistics: statiscal transformations that summarise data (e.g. mean, confidence intervals).
  • Coordinate systems: define the plane on which data are mapped on the graphic.
  • Themes: modify all non-data components of a plot, such as main title, sub-title, y-aixs title, or legend background.

5.2.2 Plotting a simple histogram

ggplot(data=telecom_retained, aes(x= TOT_IB_CALL_DUR)) +
  geom_histogram(bins=50, color="black", fill="light blue")

5.3 Profiling continuous variables

Instead of using ggplot2 to display the distribution of the continuous variables, the plot_num() of funModeling package can be used.

plot_num(telecom_retained)

The output of plot_num() is a histogram with default bin=10. It can be changed by providing different number at the bin argument as shown in the code chunk below.

telecom_selected <- telecom_retained %>%
  select(c(3:4,12:18))
plot_num(telecom_selected, bins = 12)

The code chunk above also shows how to combine select() of dplyr package.

5.4 Getting frequency distributions for categoric variables

To display the frequency distribution of a categorical variable, the freq() of funModeling will be used.

freq(telecom_retained)

##   GENDER_CD frequency percentage cumulative_perc
## 1         M      8534      64.67           64.67
## 2         F      4566      34.60           99.27
## 3                  96       0.73          100.00

##   EDUCATION_CD frequency percentage cumulative_perc
## 1            2      4870      36.91           36.91
## 2            1      4817      36.50           73.41
## 3            .      2901      21.98           95.39
## 4            3       330       2.50           97.89
## 5            4       143       1.08           98.97
## 6            6        73       0.55           99.52
## 7            5        60       0.45           99.97
## 8            7         2       0.02          100.00

##   BARRING_REASON_CD frequency percentage cumulative_perc
## 1               000      3403      25.79           25.79
## 2               002      3358      25.45           51.24
## 3               003      3223      24.42           75.66
## 4               001      3212      24.34          100.00

##   CHURN_FLG frequency percentage cumulative_perc
## 1         0     12105      91.73           91.73
## 2         1      1091       8.27          100.00
## [1] "Variables processed: GENDER_CD, EDUCATION_CD, BARRING_REASON_CD, CHURN_FLG"

Notes:

  • freq() only processes factor and character, excluding non-categorical variables.
  • It returns the distribution table as a data frame.
  • If input is empty, then it runs for all categorical variables.
  • path_out indicates the path directory; if it has a value, then the plot is exported in jpeg.
  • To save in current directory path must dot: “.”
  • na.rm indicates if NA values should be excluded (FALSE by default).

For more information, refer to 1.1.2 Profiling categorical variables of Data Science Live Book.

6 Exploring the relationship between the target variable and the input variables

One of the task during the data exploration stage is to investigate the relationship between the target variable and the inputs variables. In this section, we are going to explore how this task can be perform by using appropriate functions of ggplot2 and funModeling packages.

6.1 Correlation plot using ggplot2 package

The code chunk below uses appropriate ggplot2 functions to plot a histogram with the CHURN_FLG overplot on it.

ggplot(data=telecom_retained, aes(x= TOT_IB_CALL_CNT, fill = CHURN_FLG)) +
  geom_histogram(bins=50, color="grey30")

6.2 Correlation plot using funModeling package

The plotar() of funModeling package is specially designed to visualise the relationship between the target variable and the input variables. In fact, it is designed using ggplot2. However, it is capable to display only the continous input variables from a given data frame. The function only support binary or multinominal values as target.

6.2.1 Boxplot

The code chunk below is used to visualise the relationship of CHURN_FLG and the input variables with boxplot.

plotar(telecom_retained, target = 'CHURN_FLG', plot_type = "boxplot")

6.2.2 Probability density plot

The code chunk below is used to visualise the relationship of CHURN_FLG and the input variables with probability density plot.

plotar(telecom_retained, target = 'CHURN_FLG', plot_type = "histdens")

6.3 Quantitative analysis for binary outcome - categ_analysis()

When we need to analyse the relationship between categorical and binary target variable, the categ_analysis() come in very handy.

It profile a binary target based on a categorical input variable, the representativeness (perc_rows) and the accuracy (perc_target) for each value of the input variable; for example, the churn cases per education categories.

categ_analysis(data = telecom_retained, input = "EDUCATION_CD", target = "CHURN_FLG")
##   EDUCATION_CD mean_target sum_target perc_target q_rows perc_rows
## 1            6       0.123          9       0.008     73     0.006
## 2            4       0.098         14       0.013    143     0.011
## 3            1       0.085        410       0.376   4817     0.365
## 4            5       0.083          5       0.005     60     0.005
## 5            2       0.082        400       0.367   4870     0.369
## 6            .       0.079        230       0.211   2901     0.220
## 7            3       0.070         23       0.021    330     0.025
## 8            7       0.000          0       0.000      2     0.000

Note:

  • The input variable must be categorical.
  • The target variable must be binary (two-value).
  • The first column automatically adjusts its name based on input variable.
  • CHURN_FLAG variable has values 0 and 1, categ_analysis() assigns internally the number 1 to the less representative class, 1 in this case, in order to calculate the mean, sum and percentage.

These are the metrics returned by categ_analysis():

  • EDUCATION_CD: name of each category in input variable.
  • mean_target: sum_target/q_rows, average number of CHURN_FLG=“1” for that category. This is the likelihood.
  • sum_target: quantity of CHRUN_FLG=“1” values are in each category.
  • perc_target: the same as sum_target but in percentage, sum_target of each category / total sum_target. This column sums 1.00.
  • q_rows: quantity of rows that, regardless of the CHURN_FLG variable, fell in that category. It’s the distribution of input. This column sums the total rows analyzed.
  • perc_rows: related to q_rows it represents the share or percentage of each category. This column sums 1.00

This function is used to analyze data when we need to reduce variable cardinality in predictive modeling.

7 Data Preparation

7.1 Discretizing Numerical Variables

Discretization is the process of transforming continuous variables into discrete counterparts. This process is usually carried out as a first step toward making them suitable for numerical evaluation and modelling.

7.1.1 Equal range

The range is commonly found in histograms looking at distribution, but is highly susceptible to outliers. To create, for example, four bins, requires the min and max values divided by 4.

telecom_retained$REV_AMT_BASE_2_eq_range <- cut_interval(telecom_retained$REV_AMT_BASE_2, n = 4)
describe(telecom_retained$REV_AMT_BASE_2_eq_range)
## telecom_retained$REV_AMT_BASE_2_eq_range 
##        n  missing distinct 
##    13196        0        4 
##                                                                 
## Value              [200,500]         (500,800]     (800,1.1e+03]
## Frequency               3391              3490              3225
## Proportion             0.257             0.264             0.244
##                             
## Value      (1.1e+03,1.4e+03]
## Frequency               3090
## Proportion             0.234
freq(telecom_retained$REV_AMT_BASE_2_eq_range)

##                 var frequency percentage cumulative_perc
## 1         (500,800]      3490      26.45           26.45
## 2         [200,500]      3391      25.70           52.15
## 3     (800,1.1e+03]      3225      24.44           76.59
## 4 (1.1e+03,1.4e+03]      3090      23.42          100.00

7.1.2 Equal frequency

This technique groups the same number of observations using criteria based on percentiles. The funModeling package includes the equal_freq() function to create bins based on these criteria:

telecom_retained$REV_AMT_BASE_2_eq_freq <- equal_freq(telecom_retained$REV_AMT_BASE_2, n = 4)
describe(telecom_retained$REV_AMT_BASE_2_eq_freq)
## telecom_retained$REV_AMT_BASE_2_eq_freq :  Revenue amount in base 2 month 
##        n  missing distinct 
##    13196        0        4 
##                                                           
## Value      [ 200, 493) [ 493, 773) [ 773,1082) [1082,1400]
## Frequency         3301        3297        3304        3294
## Proportion        0.25        0.25        0.25        0.25
freq(telecom_retained$REV_AMT_BASE_2_eq_freq)

##           var frequency percentage cumulative_perc
## 1 [ 773,1082)      3304      25.04           25.04
## 2 [ 200, 493)      3301      25.02           50.06
## 3 [ 493, 773)      3297      24.98           75.04
## 4 [1082,1400]      3294      24.96          100.00

7.1.3 Custom bins

If we already have the points for which we want the segments, we can use the cut() function.

telecom_retained$REV_AMT_BASE_2_manual <- cut(telecom_retained$REV_AMT_BASE_2, breaks = c(0, 500, 800, 1000, 2000))
describe(telecom_retained$REV_AMT_BASE_2_manual)
## telecom_retained$REV_AMT_BASE_2_manual 
##        n  missing distinct 
##    13196        0        4 
##                                                                   
## Value            (0,500]     (500,800]   (800,1e+03] (1e+03,2e+03]
## Frequency           3391          3490          2142          4173
## Proportion         0.257         0.264         0.162         0.316
freq(telecom_retained$REV_AMT_BASE_2_manual)

##             var frequency percentage cumulative_perc
## 1 (1e+03,2e+03]      4173      31.62           31.62
## 2     (500,800]      3490      26.45           58.07
## 3       (0,500]      3391      25.70           83.77
## 4   (800,1e+03]      2142      16.23          100.00

For further discussion, refer to section 2.1.6, 2.1.7 and 2.1.8 of Data Science Live Book.