packages = c('funModeling', 'haven', 'corrplot', 'tidyverse')
for (p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p,character.only = T)
}
haven is a part of the tidyverse, an ecosystem of packages designed with common APIs and a shared philosophy. It enables R to read and write various data formats used by other statistical packages by wrapping the fantastic ReadStat C library written by Evan Miller.
Currently it offers:
read_sas() reads .sas7bdat + .sas7bcat files of SAS and read_xpt() reads SAS transport files (version 5 and version 8).
read_sav() reads .sav files of SPSS and read_por() reads the older .por files. write_sav() writes .sav files.
read_dta() reads .dta files (up to version 15) of Stata. write_dta() writes .dta files (versions 8-15).
In this exercise, telecom.sas7bdat used in Lab Exercise 7 of ISSS602 Data Analytics Lab will be used. The code chunk below shows how a SAS’s sas7bdat (i.e. telecom.sas7bdat) file can be imported using read_sas() function of haven package.
telecom <- read_sas("data/telecom.sas7bdat")
Notice that the function return tibble. It uses underscores instead of dots. Labelled vectors are returned as a new labelled class. Character vectors are not converted to factors. See vignette (“semantics”) for more details.
Before any analysis can be perform, it is important for us to explore the data. The purpose of initial exploration are: - to understand the structure of the data, and - to identify data quality issues such as messy data, dirty data, missing data, etc.
One of the easy way to view the data is by typing the name of the data frame directly on R Console. The code chunk below will be used to display telecom data.
telecom
Tibble or tbl_df wraps a local data frame. The main advantage of using a tbl_df over a regular data frame is the printing: tbl objects only print a few rows and all the columns that fit on one screen, describing the rest of it as text.
Notice that only the first ten records will be displayed. Furthermore, only the columns that can be displayed on the view will appear. This is the characteristics of tibble. To learn more about tibble, please refer to Overview and Chapter 10: Tibbles.
head() returns the first part of user defined records. For example, the code below displays the first six records.
head(telecom)
## # A tibble: 6 x 46
## GENDER_CD EDUCATION_CD SUBS_TENURE TOT_IB_CALL_DUR TOT_IB_CALL_CNT
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 M " 2" 198 54.2 192
## 2 M " 1" 114 13.3 51
## 3 M " 2" 114 12.3 94
## 4 M " 1" 228 49.7 171
## 5 F " 4" 168 16.3 536
## 6 M " 2" 132 40.7 656
## # ... with 41 more variables: AVG_OB_CALL_CNT <dbl>,
## # TOT_OB_CALL_NAT_ROAM_CNT <dbl>, TOT_OB_CALL_INTL_CNT <dbl>,
## # TOT_OB_CALL_LOC_CNT <dbl>, TOT_OB_CALL_NAT_CNT <dbl>,
## # TOT_OB_CALL_INTL_ROAM_CNT <dbl>, TOT_DAY_LAST_COMPLAINT_CNT <dbl>,
## # TOT_DAY_LAST_OB_BARRED_CNT <dbl>, TOT_DAY_LAST_SUSPENDED_CNT <dbl>,
## # TOT_EMAIL_QUERY_CNT <dbl>, MTH_TO_SUBS_END_CNT <dbl>,
## # TOT_SRV_DROPPED_CNT <dbl>, TOT_SRV_ADDED_CNT <dbl>,
## # TOT_OUTSTAND_60_90_DAY_AMT <dbl>, TOT_REV_FIX_AMT <dbl>,
## # TOT_REV_GPRS_AMT <dbl>, TOT_REV_INET_AMT <dbl>,
## # TOT_COMPLAINT_1_MTH_CNT <dbl>, TOT_MTH_LAST_SUSPENDED_CNT <dbl>,
## # LAST_PRICE_PLAN_CHNG_DAY_CNT <dbl>, MTH_SINCE_DATA_ACTVN <dbl>,
## # MTH_SINCE_VM_ACTVN <dbl>, BARRING_REASON_CD <chr>, TOT_OB_CALL_CNT <dbl>,
## # TOT_ACTV_SRV_CNT <dbl>, REV_AMT_BASE_1 <dbl>, REV_AMT_BASE_2 <dbl>,
## # REV_AMT_BASE_3 <dbl>, REV_AMT_BASE_4 <dbl>, REV_AMT_BASE_5 <dbl>,
## # REV_AMT_BASE_6 <dbl>, CUST_AGE <dbl>, PCT_CHNG_IB_SMS_CNT <dbl>,
## # PCT_CHNG_SUSPENDED_CNT <dbl>, PCT_CHNG_BILL_AMT <dbl>, CUST_SUBS_ID <dbl>,
## # TOT_REV_AMT <dbl>, TOT_PROF_AMT <dbl>, CUST_ID <dbl>, name <chr>,
## # CHURN_FLG <chr>
tail() returns the last part of user defined records. For example, the code below displays the last five records.
tail(telecom)
## # A tibble: 6 x 46
## GENDER_CD EDUCATION_CD SUBS_TENURE TOT_IB_CALL_DUR TOT_IB_CALL_CNT
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 M " 1" 132 47 105
## 2 F " 3" 114 63.2 797
## 3 M " ." 120 55.3 231
## 4 M " 1" 132 52 158
## 5 F " 3" 156 49.8 1279
## 6 M " 1" 132 46.3 168
## # ... with 41 more variables: AVG_OB_CALL_CNT <dbl>,
## # TOT_OB_CALL_NAT_ROAM_CNT <dbl>, TOT_OB_CALL_INTL_CNT <dbl>,
## # TOT_OB_CALL_LOC_CNT <dbl>, TOT_OB_CALL_NAT_CNT <dbl>,
## # TOT_OB_CALL_INTL_ROAM_CNT <dbl>, TOT_DAY_LAST_COMPLAINT_CNT <dbl>,
## # TOT_DAY_LAST_OB_BARRED_CNT <dbl>, TOT_DAY_LAST_SUSPENDED_CNT <dbl>,
## # TOT_EMAIL_QUERY_CNT <dbl>, MTH_TO_SUBS_END_CNT <dbl>,
## # TOT_SRV_DROPPED_CNT <dbl>, TOT_SRV_ADDED_CNT <dbl>,
## # TOT_OUTSTAND_60_90_DAY_AMT <dbl>, TOT_REV_FIX_AMT <dbl>,
## # TOT_REV_GPRS_AMT <dbl>, TOT_REV_INET_AMT <dbl>,
## # TOT_COMPLAINT_1_MTH_CNT <dbl>, TOT_MTH_LAST_SUSPENDED_CNT <dbl>,
## # LAST_PRICE_PLAN_CHNG_DAY_CNT <dbl>, MTH_SINCE_DATA_ACTVN <dbl>,
## # MTH_SINCE_VM_ACTVN <dbl>, BARRING_REASON_CD <chr>, TOT_OB_CALL_CNT <dbl>,
## # TOT_ACTV_SRV_CNT <dbl>, REV_AMT_BASE_1 <dbl>, REV_AMT_BASE_2 <dbl>,
## # REV_AMT_BASE_3 <dbl>, REV_AMT_BASE_4 <dbl>, REV_AMT_BASE_5 <dbl>,
## # REV_AMT_BASE_6 <dbl>, CUST_AGE <dbl>, PCT_CHNG_IB_SMS_CNT <dbl>,
## # PCT_CHNG_SUSPENDED_CNT <dbl>, PCT_CHNG_BILL_AMT <dbl>, CUST_SUBS_ID <dbl>,
## # TOT_REV_AMT <dbl>, TOT_PROF_AMT <dbl>, CUST_ID <dbl>, name <chr>,
## # CHURN_FLG <chr>
In this code chunk below, str() is used to view the data type of each column.
str(telecom)
## tibble [13,196 x 46] (S3: tbl_df/tbl/data.frame)
## $ GENDER_CD : chr [1:13196] "M" "M" "M" "M" ...
## ..- attr(*, "label")= chr "Gender code"
## ..- attr(*, "format.sas")= chr "$"
## $ EDUCATION_CD : chr [1:13196] " 2" " 1" " 2" " 1" ...
## ..- attr(*, "label")= chr "Education code"
## $ SUBS_TENURE : num [1:13196] 198 114 114 228 168 132 120 138 126 144 ...
## ..- attr(*, "label")= chr "Tenure on network"
## $ TOT_IB_CALL_DUR : num [1:13196] 54.2 13.3 12.3 49.7 16.3 ...
## ..- attr(*, "label")= chr "Total duration of incoming calls over 6 months"
## $ TOT_IB_CALL_CNT : num [1:13196] 192 51 94 171 536 656 283 148 179 156 ...
## ..- attr(*, "label")= chr "Total number of incoming calls over 6 months"
## $ AVG_OB_CALL_CNT : num [1:13196] 897 252 1016 1534 730 ...
## ..- attr(*, "label")= chr "Average number of outgoing calls over 6 months"
## $ TOT_OB_CALL_NAT_ROAM_CNT : num [1:13196] 738 25 43 39 16 144 28 49 18 17 ...
## ..- attr(*, "label")= chr "Total number of roaming calls over 6 months"
## $ TOT_OB_CALL_INTL_CNT : num [1:13196] 43 5568 72 23 16 ...
## ..- attr(*, "label")= chr "Total number of international calls over 6 months"
## $ TOT_OB_CALL_LOC_CNT : num [1:13196] 300 221 221 302 517 ...
## ..- attr(*, "label")= chr "Total number of local calls over 6 months"
## $ TOT_OB_CALL_NAT_CNT : num [1:13196] 125 216 79 108 570 85 73 446 116 94 ...
## ..- attr(*, "label")= chr "Total number of national calls over 6 months"
## $ TOT_OB_CALL_INTL_ROAM_CNT : num [1:13196] 0.9 0.0376 0.2927 0.6904 1.7252 ...
## ..- attr(*, "label")= chr "Total number of international roaming calls over 6 months"
## $ TOT_DAY_LAST_COMPLAINT_CNT : num [1:13196] 15 8 19 14 4 22 4 1 19 5 ...
## ..- attr(*, "label")= chr "No. of days since last complaint made"
## $ TOT_DAY_LAST_OB_BARRED_CNT : num [1:13196] 11 5 2 4 3 8 13 11 9 14 ...
## ..- attr(*, "label")= chr "No. of days since last outbound barred"
## $ TOT_DAY_LAST_SUSPENDED_CNT : num [1:13196] 3 23 25 7 18 20 10 18 16 1 ...
## ..- attr(*, "label")= chr "No. of days since last suspension"
## $ TOT_EMAIL_QUERY_CNT : num [1:13196] 17 16 19 4 20 25 17 11 5 3 ...
## ..- attr(*, "label")= chr "Total no. of email queries in last 6 months"
## $ MTH_TO_SUBS_END_CNT : num [1:13196] 4 2 3 4 2 3 3 2 3 4 ...
## ..- attr(*, "label")= chr "Months to subscription end"
## $ TOT_SRV_DROPPED_CNT : num [1:13196] 2 3 4 0 2 0 0 1 1 1 ...
## ..- attr(*, "label")= chr "No. of services dropped in last 6 months"
## $ TOT_SRV_ADDED_CNT : num [1:13196] 1 0 0 1 1 2 0 2 0 2 ...
## ..- attr(*, "label")= chr "No. of services added in last 6 months"
## $ TOT_OUTSTAND_60_90_DAY_AMT : num [1:13196] 0 0 0 0 0 483 0 0 0 0 ...
## ..- attr(*, "label")= chr " Outstanding amount in 60 to 90 days delinquency period"
## $ TOT_REV_FIX_AMT : num [1:13196] 0 0 0 0 0 0 0 0 0 0 ...
## ..- attr(*, "label")= chr "Total revenue due to calls to fixline"
## $ TOT_REV_GPRS_AMT : num [1:13196] 0 729 574 0 0 0 0 0 0 0 ...
## ..- attr(*, "label")= chr "Total revenue due to GPRS"
## $ TOT_REV_INET_AMT : num [1:13196] 0 0 0 0 0 508 0 0 0 0 ...
## ..- attr(*, "label")= chr "Total revenue due to internet"
## $ TOT_COMPLAINT_1_MTH_CNT : num [1:13196] 0 0 1 0 0 0 0 0 0 0 ...
## ..- attr(*, "label")= chr "Total no. of complaints in last month"
## $ TOT_MTH_LAST_SUSPENDED_CNT : num [1:13196] 0 0 0 0 0 0 0 0 0 0 ...
## ..- attr(*, "label")= chr "No. of months since last suspension"
## $ LAST_PRICE_PLAN_CHNG_DAY_CNT: num [1:13196] 0 0 0 0 0 0 0 0 0 0 ...
## ..- attr(*, "label")= chr "No. of days since last price plan change"
## $ MTH_SINCE_DATA_ACTVN : num [1:13196] 0 0 0 0 0 0 0 0 0 0 ...
## ..- attr(*, "label")= chr "Months since data activation service"
## $ MTH_SINCE_VM_ACTVN : num [1:13196] 0 0 0 0 0 0 0 0 0 0 ...
## ..- attr(*, "label")= chr "Months since voice mail activation service"
## $ BARRING_REASON_CD : chr [1:13196] "000" "000" "002" "000" ...
## ..- attr(*, "label")= chr "Barring type code"
## $ TOT_OB_CALL_CNT : num [1:13196] 5382 1512 6096 9204 4380 ...
## ..- attr(*, "label")= chr "Total number of outgoing calls over 6 months"
## $ TOT_ACTV_SRV_CNT : num [1:13196] 0 0 0 4 3 3 4 2 0 3 ...
## ..- attr(*, "label")= chr "Total no. of active services"
## $ REV_AMT_BASE_1 : num [1:13196] 870 570 632 638 1162 ...
## ..- attr(*, "label")= chr " Revenue amount in base 1 month"
## $ REV_AMT_BASE_2 : num [1:13196] 548 1024 547 466 394 ...
## ..- attr(*, "label")= chr " Revenue amount in base 2 month"
## $ REV_AMT_BASE_3 : num [1:13196] 970 938 851 810 1141 ...
## ..- attr(*, "label")= chr " Revenue amount in base 3 month"
## $ REV_AMT_BASE_4 : num [1:13196] 392 602 821 655 244 325 568 645 427 659 ...
## ..- attr(*, "label")= chr " Revenue amount in base 4 month"
## $ REV_AMT_BASE_5 : num [1:13196] 358 729 574 402 423 787 830 241 901 437 ...
## ..- attr(*, "label")= chr " Revenue amount in base 5 month"
## $ REV_AMT_BASE_6 : num [1:13196] 1248 1129 1027 642 618 ...
## ..- attr(*, "label")= chr " Revenue amount in base 6 month"
## $ CUST_AGE : num [1:13196] 32 24 24 44 21 52 34 44 39 43 ...
## ..- attr(*, "label")= chr "Customer age"
## $ PCT_CHNG_IB_SMS_CNT : num [1:13196] 1.155 1.265 0.902 1.87 1.267 ...
## ..- attr(*, "label")= chr "Percent change of latest 2 months incoming SMS wrt previous 4 months incoming SMS"
## $ PCT_CHNG_SUSPENDED_CNT : num [1:13196] 0 0 5000 0 0 0 0 0 0 0 ...
## ..- attr(*, "label")= chr "Percent change of latest 2 months suspended calls wrt previous 4 months suspended calls"
## $ PCT_CHNG_BILL_AMT : num [1:13196] 0.956 0.938 0.72 0.88 1.283 ...
## ..- attr(*, "label")= chr "Percent change of latest 2 months bill amount wrt previous 4 months bill amount"
## $ CUST_SUBS_ID : num [1:13196] 1 2 3 4 5 6 7 8 9 10 ...
## ..- attr(*, "label")= chr "Subscription identifier"
## $ TOT_REV_AMT : num [1:13196] NA 99 47 NA 97 NA NA NA 10 3 ...
## ..- attr(*, "label")= chr "Customer Value"
## $ TOT_PROF_AMT : num [1:13196] NA 99 47 NA 97 NA NA NA 10 3 ...
## ..- attr(*, "label")= chr "Total profitability of a subscription"
## $ CUST_ID : num [1:13196] 5198 752 3501 5406 6115 ...
## ..- attr(*, "label")= chr "Customer identifier"
## $ name : chr [1:13196] "Amy H Thomas" "Ignatius T Lyod" "Leo C Wood" "Augustine X Turner" ...
## ..- attr(*, "label")= chr "Subscriber Name"
## $ CHURN_FLG : chr [1:13196] "0" "1" "1" "0" ...
## ..- attr(*, "label")= chr "Churn flag"
## - attr(*, "label")= chr "CHURN_TELECOM"
Since telecom is tibble data frame, it is advisable to use glimpse() of dplyr package to display the data.
glimpse(telecom)
## Rows: 13,196
## Columns: 46
## $ GENDER_CD <chr> "M", "M", "M", "M", "F", "M", "M", "M"...
## $ EDUCATION_CD <chr> " 2", " 1", " 2", " 1", " 4", " 2", " ...
## $ SUBS_TENURE <dbl> 198, 114, 114, 228, 168, 132, 120, 138...
## $ TOT_IB_CALL_DUR <dbl> 54.167, 13.333, 12.333, 49.667, 16.333...
## $ TOT_IB_CALL_CNT <dbl> 192, 51, 94, 171, 536, 656, 283, 148, ...
## $ AVG_OB_CALL_CNT <dbl> 897, 252, 1016, 1534, 730, 323, 463, 8...
## $ TOT_OB_CALL_NAT_ROAM_CNT <dbl> 738, 25, 43, 39, 16, 144, 28, 49, 18, ...
## $ TOT_OB_CALL_INTL_CNT <dbl> 43, 5568, 72, 23, 16, 27, 18, 779, 33,...
## $ TOT_OB_CALL_LOC_CNT <dbl> 300, 221, 221, 302, 517, 540, 320, 192...
## $ TOT_OB_CALL_NAT_CNT <dbl> 125, 216, 79, 108, 570, 85, 73, 446, 1...
## $ TOT_OB_CALL_INTL_ROAM_CNT <dbl> 0.89995712, 0.03764006, 0.29266312, 0....
## $ TOT_DAY_LAST_COMPLAINT_CNT <dbl> 15, 8, 19, 14, 4, 22, 4, 1, 19, 5, 6, ...
## $ TOT_DAY_LAST_OB_BARRED_CNT <dbl> 11, 5, 2, 4, 3, 8, 13, 11, 9, 14, 23, ...
## $ TOT_DAY_LAST_SUSPENDED_CNT <dbl> 3, 23, 25, 7, 18, 20, 10, 18, 16, 1, 9...
## $ TOT_EMAIL_QUERY_CNT <dbl> 17, 16, 19, 4, 20, 25, 17, 11, 5, 3, 1...
## $ MTH_TO_SUBS_END_CNT <dbl> 4, 2, 3, 4, 2, 3, 3, 2, 3, 4, 1, 1, 4,...
## $ TOT_SRV_DROPPED_CNT <dbl> 2, 3, 4, 0, 2, 0, 0, 1, 1, 1, 0, 1, 1,...
## $ TOT_SRV_ADDED_CNT <dbl> 1, 0, 0, 1, 1, 2, 0, 2, 0, 2, 1, 0, 0,...
## $ TOT_OUTSTAND_60_90_DAY_AMT <dbl> 0, 0, 0, 0, 0, 483, 0, 0, 0, 0, 0, 0, ...
## $ TOT_REV_FIX_AMT <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 718, 0, ...
## $ TOT_REV_GPRS_AMT <dbl> 0, 729, 574, 0, 0, 0, 0, 0, 0, 0, 0, 3...
## $ TOT_REV_INET_AMT <dbl> 0, 0, 0, 0, 0, 508, 0, 0, 0, 0, 0, 467...
## $ TOT_COMPLAINT_1_MTH_CNT <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ TOT_MTH_LAST_SUSPENDED_CNT <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ LAST_PRICE_PLAN_CHNG_DAY_CNT <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ MTH_SINCE_DATA_ACTVN <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ MTH_SINCE_VM_ACTVN <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ BARRING_REASON_CD <chr> "000", "000", "002", "000", "002", "00...
## $ TOT_OB_CALL_CNT <dbl> 5382, 1512, 6096, 9204, 4380, 1938, 27...
## $ TOT_ACTV_SRV_CNT <dbl> 0, 0, 0, 4, 3, 3, 4, 2, 0, 3, 3, 0, 2,...
## $ REV_AMT_BASE_1 <dbl> 870, 570, 632, 638, 1162, 236, 445, 11...
## $ REV_AMT_BASE_2 <dbl> 548, 1024, 547, 466, 394, 729, 1091, 4...
## $ REV_AMT_BASE_3 <dbl> 970, 938, 851, 810, 1141, 483, 447, 33...
## $ REV_AMT_BASE_4 <dbl> 392, 602, 821, 655, 244, 325, 568, 645...
## $ REV_AMT_BASE_5 <dbl> 358, 729, 574, 402, 423, 787, 830, 241...
## $ REV_AMT_BASE_6 <dbl> 1248, 1129, 1027, 642, 618, 508, 1133,...
## $ CUST_AGE <dbl> 32, 24, 24, 44, 21, 52, 34, 44, 39, 43...
## $ PCT_CHNG_IB_SMS_CNT <dbl> 1.1553398, 1.2653061, 0.9019608, 1.870...
## $ PCT_CHNG_SUSPENDED_CNT <dbl> 0, 0, 5000, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ PCT_CHNG_BILL_AMT <dbl> 0.9555256, 0.9381989, 0.7204400, 0.880...
## $ CUST_SUBS_ID <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,...
## $ TOT_REV_AMT <dbl> NA, 99, 47, NA, 97, NA, NA, NA, 10, 3,...
## $ TOT_PROF_AMT <dbl> NA, 99, 47, NA, 97, NA, NA, NA, 10, 3,...
## $ CUST_ID <dbl> 5198, 752, 3501, 5406, 6115, 5478, 514...
## $ name <chr> "Amy H Thomas", "Ignatius T Lyod", "Le...
## $ CHURN_FLG <chr> "0", "1", "1", "0", "1", "0", "0", "0"...
After reviewing the report above, it is clear that fields name, CUST_ID, and CUST_SUBS_ID are not required for the subsequent analysis. In view of this, we will exclude these three fields from telecom data frame by using the code chunk below.
telecom <- telecom %>%
select(-name, -CUST_SUBS_ID, -CUST_ID)
Notice that telecom data frame only have 43 variables now.
Data profiling is the process of examining the data available from an existing information source (e.g. a database or a file) and collecting statistics or informative summaries about that data.
In order to perform data profiling, funModeling package will be used. It’s available in R cran. A quick getting started vignettes is available. An online book entitle Data Science Live Book provides a comprehensive tutorial on the package.
In this section, we will use status() to check for missing data, zeros, data type of telecom data frame.
status(telecom)
## variable q_zeros p_zeros q_na p_na q_inf p_inf
## 1 GENDER_CD 0 0.000000e+00 0 0.0000000 0 0
## 2 EDUCATION_CD 0 0.000000e+00 0 0.0000000 0 0
## 3 SUBS_TENURE 0 0.000000e+00 0 0.0000000 0 0
## 4 TOT_IB_CALL_DUR 0 0.000000e+00 0 0.0000000 0 0
## 5 TOT_IB_CALL_CNT 0 0.000000e+00 0 0.0000000 0 0
## 6 AVG_OB_CALL_CNT 0 0.000000e+00 0 0.0000000 0 0
## 7 TOT_OB_CALL_NAT_ROAM_CNT 0 0.000000e+00 0 0.0000000 0 0
## 8 TOT_OB_CALL_INTL_CNT 0 0.000000e+00 0 0.0000000 0 0
## 9 TOT_OB_CALL_LOC_CNT 0 0.000000e+00 0 0.0000000 0 0
## 10 TOT_OB_CALL_NAT_CNT 0 0.000000e+00 0 0.0000000 0 0
## 11 TOT_OB_CALL_INTL_ROAM_CNT 0 0.000000e+00 0 0.0000000 0 0
## 12 TOT_DAY_LAST_COMPLAINT_CNT 285 2.159745e-02 0 0.0000000 0 0
## 13 TOT_DAY_LAST_OB_BARRED_CNT 249 1.886935e-02 0 0.0000000 0 0
## 14 TOT_DAY_LAST_SUSPENDED_CNT 258 1.955138e-02 0 0.0000000 0 0
## 15 TOT_EMAIL_QUERY_CNT 234 1.773265e-02 0 0.0000000 0 0
## 16 MTH_TO_SUBS_END_CNT 0 0.000000e+00 0 0.0000000 0 0
## 17 TOT_SRV_DROPPED_CNT 3206 2.429524e-01 0 0.0000000 0 0
## 18 TOT_SRV_ADDED_CNT 2712 2.055168e-01 0 0.0000000 0 0
## 19 TOT_OUTSTAND_60_90_DAY_AMT 11778 8.925432e-01 0 0.0000000 0 0
## 20 TOT_REV_FIX_AMT 10768 8.160048e-01 0 0.0000000 0 0
## 21 TOT_REV_GPRS_AMT 10313 7.815247e-01 0 0.0000000 0 0
## 22 TOT_REV_INET_AMT 9693 7.345408e-01 0 0.0000000 0 0
## 23 TOT_COMPLAINT_1_MTH_CNT 12924 9.793877e-01 0 0.0000000 0 0
## 24 TOT_MTH_LAST_SUSPENDED_CNT 12878 9.759018e-01 0 0.0000000 0 0
## 25 LAST_PRICE_PLAN_CHNG_DAY_CNT 12891 9.768869e-01 0 0.0000000 0 0
## 26 MTH_SINCE_DATA_ACTVN 12799 9.699151e-01 0 0.0000000 0 0
## 27 MTH_SINCE_VM_ACTVN 12200 9.245226e-01 0 0.0000000 0 0
## 28 BARRING_REASON_CD 0 0.000000e+00 0 0.0000000 0 0
## 29 TOT_OB_CALL_CNT 0 0.000000e+00 0 0.0000000 0 0
## 30 TOT_ACTV_SRV_CNT 1953 1.479994e-01 0 0.0000000 0 0
## 31 REV_AMT_BASE_1 0 0.000000e+00 0 0.0000000 0 0
## 32 REV_AMT_BASE_2 0 0.000000e+00 0 0.0000000 0 0
## 33 REV_AMT_BASE_3 0 0.000000e+00 0 0.0000000 0 0
## 34 REV_AMT_BASE_4 0 0.000000e+00 0 0.0000000 0 0
## 35 REV_AMT_BASE_5 0 0.000000e+00 0 0.0000000 0 0
## 36 REV_AMT_BASE_6 0 0.000000e+00 0 0.0000000 0 0
## 37 CUST_AGE 0 0.000000e+00 0 0.0000000 0 0
## 38 PCT_CHNG_IB_SMS_CNT 1 7.578054e-05 0 0.0000000 0 0
## 39 PCT_CHNG_SUSPENDED_CNT 12730 9.646863e-01 0 0.0000000 0 0
## 40 PCT_CHNG_BILL_AMT 0 0.000000e+00 0 0.0000000 0 0
## 41 TOT_REV_AMT 176 1.333737e-02 9026 0.6839952 0 0
## 42 TOT_PROF_AMT 176 1.333737e-02 9026 0.6839952 0 0
## 43 CHURN_FLG 12105 9.173234e-01 0 0.0000000 0 0
## type unique
## 1 character 3
## 2 character 8
## 3 numeric 37
## 4 numeric 412
## 5 numeric 1933
## 6 numeric 3312
## 7 numeric 780
## 8 numeric 808
## 9 numeric 2573
## 10 numeric 1710
## 11 numeric 13196
## 12 numeric 33
## 13 numeric 32
## 14 numeric 35
## 15 numeric 33
## 16 numeric 4
## 17 numeric 5
## 18 numeric 4
## 19 numeric 251
## 20 numeric 302
## 21 numeric 401
## 22 numeric 352
## 23 numeric 2
## 24 numeric 2
## 25 numeric 2
## 26 numeric 2
## 27 numeric 2
## 28 character 4
## 29 numeric 3312
## 30 numeric 8
## 31 numeric 1001
## 32 numeric 1201
## 33 numeric 1101
## 34 numeric 901
## 35 numeric 951
## 36 numeric 1051
## 37 numeric 43
## 38 numeric 8125
## 39 numeric 7
## 40 numeric 13113
## 41 numeric 101
## 42 numeric 101
## 43 character 2
How to interpret these metrics important?
Instead of displaying all the measures, you can also focus on only one or a selected measures such as zeros. This is achieved by using arrange() and select() of dplyr package.
my_data_status <- status(telecom)
arrange(my_data_status, -p_zeros) %>%
select(variable, q_zeros, p_zeros)
## variable q_zeros p_zeros
## 1 TOT_COMPLAINT_1_MTH_CNT 12924 9.793877e-01
## 2 LAST_PRICE_PLAN_CHNG_DAY_CNT 12891 9.768869e-01
## 3 TOT_MTH_LAST_SUSPENDED_CNT 12878 9.759018e-01
## 4 MTH_SINCE_DATA_ACTVN 12799 9.699151e-01
## 5 PCT_CHNG_SUSPENDED_CNT 12730 9.646863e-01
## 6 MTH_SINCE_VM_ACTVN 12200 9.245226e-01
## 7 CHURN_FLG 12105 9.173234e-01
## 8 TOT_OUTSTAND_60_90_DAY_AMT 11778 8.925432e-01
## 9 TOT_REV_FIX_AMT 10768 8.160048e-01
## 10 TOT_REV_GPRS_AMT 10313 7.815247e-01
## 11 TOT_REV_INET_AMT 9693 7.345408e-01
## 12 TOT_SRV_DROPPED_CNT 3206 2.429524e-01
## 13 TOT_SRV_ADDED_CNT 2712 2.055168e-01
## 14 TOT_ACTV_SRV_CNT 1953 1.479994e-01
## 15 TOT_DAY_LAST_COMPLAINT_CNT 285 2.159745e-02
## 16 TOT_DAY_LAST_SUSPENDED_CNT 258 1.955138e-02
## 17 TOT_DAY_LAST_OB_BARRED_CNT 249 1.886935e-02
## 18 TOT_EMAIL_QUERY_CNT 234 1.773265e-02
## 19 TOT_REV_AMT 176 1.333737e-02
## 20 TOT_PROF_AMT 176 1.333737e-02
## 21 PCT_CHNG_IB_SMS_CNT 1 7.578054e-05
## 22 GENDER_CD 0 0.000000e+00
## 23 EDUCATION_CD 0 0.000000e+00
## 24 SUBS_TENURE 0 0.000000e+00
## 25 TOT_IB_CALL_DUR 0 0.000000e+00
## 26 TOT_IB_CALL_CNT 0 0.000000e+00
## 27 AVG_OB_CALL_CNT 0 0.000000e+00
## 28 TOT_OB_CALL_NAT_ROAM_CNT 0 0.000000e+00
## 29 TOT_OB_CALL_INTL_CNT 0 0.000000e+00
## 30 TOT_OB_CALL_LOC_CNT 0 0.000000e+00
## 31 TOT_OB_CALL_NAT_CNT 0 0.000000e+00
## 32 TOT_OB_CALL_INTL_ROAM_CNT 0 0.000000e+00
## 33 MTH_TO_SUBS_END_CNT 0 0.000000e+00
## 34 BARRING_REASON_CD 0 0.000000e+00
## 35 TOT_OB_CALL_CNT 0 0.000000e+00
## 36 REV_AMT_BASE_1 0 0.000000e+00
## 37 REV_AMT_BASE_2 0 0.000000e+00
## 38 REV_AMT_BASE_3 0 0.000000e+00
## 39 REV_AMT_BASE_4 0 0.000000e+00
## 40 REV_AMT_BASE_5 0 0.000000e+00
## 41 REV_AMT_BASE_6 0 0.000000e+00
## 42 CUST_AGE 0 0.000000e+00
## 43 PCT_CHNG_BILL_AMT 0 0.000000e+00
DIY: Write a code chunk to display the quantity and percentage of NA for each variables. Sort the result ascendingly.
In this section, you will learn how to remove variables with a high number of zeros
my_data_status = status(telecom)
vars_to_remove = filter(my_data_status, p_zeros > 30) %>%
.$variable
vars_to_remove <- vars_to_remove[-11]
vars_to_remove
## character(0)
The list above shows variables with more than 30% zeroes values.
Notice CHURN_FLG variable has been excluded from the list.
telecom_retained <- telecom %>%
select(-one_of(vars_to_remove), -TOT_REV_AMT, -TOT_PROF_AMT)
It is a good practice the review the output after the selection.
my_data_status <- status(telecom_retained)
arrange(my_data_status, -p_zeros)
## variable q_zeros p_zeros q_na p_na q_inf p_inf
## 1 TOT_COMPLAINT_1_MTH_CNT 12924 9.793877e-01 0 0 0 0
## 2 LAST_PRICE_PLAN_CHNG_DAY_CNT 12891 9.768869e-01 0 0 0 0
## 3 TOT_MTH_LAST_SUSPENDED_CNT 12878 9.759018e-01 0 0 0 0
## 4 MTH_SINCE_DATA_ACTVN 12799 9.699151e-01 0 0 0 0
## 5 PCT_CHNG_SUSPENDED_CNT 12730 9.646863e-01 0 0 0 0
## 6 MTH_SINCE_VM_ACTVN 12200 9.245226e-01 0 0 0 0
## 7 CHURN_FLG 12105 9.173234e-01 0 0 0 0
## 8 TOT_OUTSTAND_60_90_DAY_AMT 11778 8.925432e-01 0 0 0 0
## 9 TOT_REV_FIX_AMT 10768 8.160048e-01 0 0 0 0
## 10 TOT_REV_GPRS_AMT 10313 7.815247e-01 0 0 0 0
## 11 TOT_REV_INET_AMT 9693 7.345408e-01 0 0 0 0
## 12 TOT_SRV_DROPPED_CNT 3206 2.429524e-01 0 0 0 0
## 13 TOT_SRV_ADDED_CNT 2712 2.055168e-01 0 0 0 0
## 14 TOT_ACTV_SRV_CNT 1953 1.479994e-01 0 0 0 0
## 15 TOT_DAY_LAST_COMPLAINT_CNT 285 2.159745e-02 0 0 0 0
## 16 TOT_DAY_LAST_SUSPENDED_CNT 258 1.955138e-02 0 0 0 0
## 17 TOT_DAY_LAST_OB_BARRED_CNT 249 1.886935e-02 0 0 0 0
## 18 TOT_EMAIL_QUERY_CNT 234 1.773265e-02 0 0 0 0
## 19 PCT_CHNG_IB_SMS_CNT 1 7.578054e-05 0 0 0 0
## 20 GENDER_CD 0 0.000000e+00 0 0 0 0
## 21 EDUCATION_CD 0 0.000000e+00 0 0 0 0
## 22 SUBS_TENURE 0 0.000000e+00 0 0 0 0
## 23 TOT_IB_CALL_DUR 0 0.000000e+00 0 0 0 0
## 24 TOT_IB_CALL_CNT 0 0.000000e+00 0 0 0 0
## 25 AVG_OB_CALL_CNT 0 0.000000e+00 0 0 0 0
## 26 TOT_OB_CALL_NAT_ROAM_CNT 0 0.000000e+00 0 0 0 0
## 27 TOT_OB_CALL_INTL_CNT 0 0.000000e+00 0 0 0 0
## 28 TOT_OB_CALL_LOC_CNT 0 0.000000e+00 0 0 0 0
## 29 TOT_OB_CALL_NAT_CNT 0 0.000000e+00 0 0 0 0
## 30 TOT_OB_CALL_INTL_ROAM_CNT 0 0.000000e+00 0 0 0 0
## 31 MTH_TO_SUBS_END_CNT 0 0.000000e+00 0 0 0 0
## 32 BARRING_REASON_CD 0 0.000000e+00 0 0 0 0
## 33 TOT_OB_CALL_CNT 0 0.000000e+00 0 0 0 0
## 34 REV_AMT_BASE_1 0 0.000000e+00 0 0 0 0
## 35 REV_AMT_BASE_2 0 0.000000e+00 0 0 0 0
## 36 REV_AMT_BASE_3 0 0.000000e+00 0 0 0 0
## 37 REV_AMT_BASE_4 0 0.000000e+00 0 0 0 0
## 38 REV_AMT_BASE_5 0 0.000000e+00 0 0 0 0
## 39 REV_AMT_BASE_6 0 0.000000e+00 0 0 0 0
## 40 CUST_AGE 0 0.000000e+00 0 0 0 0
## 41 PCT_CHNG_BILL_AMT 0 0.000000e+00 0 0 0 0
## type unique
## 1 numeric 2
## 2 numeric 2
## 3 numeric 2
## 4 numeric 2
## 5 numeric 7
## 6 numeric 2
## 7 character 2
## 8 numeric 251
## 9 numeric 302
## 10 numeric 401
## 11 numeric 352
## 12 numeric 5
## 13 numeric 4
## 14 numeric 8
## 15 numeric 33
## 16 numeric 35
## 17 numeric 32
## 18 numeric 33
## 19 numeric 8125
## 20 character 3
## 21 character 8
## 22 numeric 37
## 23 numeric 412
## 24 numeric 1933
## 25 numeric 3312
## 26 numeric 780
## 27 numeric 808
## 28 numeric 2573
## 29 numeric 1710
## 30 numeric 13196
## 31 numeric 4
## 32 character 4
## 33 numeric 3312
## 34 numeric 1001
## 35 numeric 1201
## 36 numeric 1101
## 37 numeric 901
## 38 numeric 951
## 39 numeric 1051
## 40 numeric 43
## 41 numeric 13113
Another useful function to check data health status is data_integrity(). It is a handy function to return different vectors of variable names aimed to quickly filter NA, categorical (factor / character), numerical and other types (boolean, date, posix). It also returns a vector of variables which have high cardinality.
data_integrity(telecom)
## $vars_num_with_NA
## variable q_na p_na
## 1 TOT_REV_AMT 9026 0.6839952
## 2 TOT_PROF_AMT 9026 0.6839952
##
## $vars_cat_with_NA
## [1] variable q_na p_na
## <0 rows> (or 0-length row.names)
##
## $vars_cat_high_card
## [1] variable unique
## <0 rows> (or 0-length row.names)
##
## $MAX_UNIQUE
## [1] 35
##
## $vars_one_value
## character(0)
##
## $vars_cat
## [1] "GENDER_CD" "EDUCATION_CD" "BARRING_REASON_CD"
## [4] "CHURN_FLG"
##
## $vars_num
## [1] "SUBS_TENURE" "TOT_IB_CALL_DUR"
## [3] "TOT_IB_CALL_CNT" "AVG_OB_CALL_CNT"
## [5] "TOT_OB_CALL_NAT_ROAM_CNT" "TOT_OB_CALL_INTL_CNT"
## [7] "TOT_OB_CALL_LOC_CNT" "TOT_OB_CALL_NAT_CNT"
## [9] "TOT_OB_CALL_INTL_ROAM_CNT" "TOT_DAY_LAST_COMPLAINT_CNT"
## [11] "TOT_DAY_LAST_OB_BARRED_CNT" "TOT_DAY_LAST_SUSPENDED_CNT"
## [13] "TOT_EMAIL_QUERY_CNT" "MTH_TO_SUBS_END_CNT"
## [15] "TOT_SRV_DROPPED_CNT" "TOT_SRV_ADDED_CNT"
## [17] "TOT_OUTSTAND_60_90_DAY_AMT" "TOT_REV_FIX_AMT"
## [19] "TOT_REV_GPRS_AMT" "TOT_REV_INET_AMT"
## [21] "TOT_COMPLAINT_1_MTH_CNT" "TOT_MTH_LAST_SUSPENDED_CNT"
## [23] "LAST_PRICE_PLAN_CHNG_DAY_CNT" "MTH_SINCE_DATA_ACTVN"
## [25] "MTH_SINCE_VM_ACTVN" "TOT_OB_CALL_CNT"
## [27] "TOT_ACTV_SRV_CNT" "REV_AMT_BASE_1"
## [29] "REV_AMT_BASE_2" "REV_AMT_BASE_3"
## [31] "REV_AMT_BASE_4" "REV_AMT_BASE_5"
## [33] "REV_AMT_BASE_6" "CUST_AGE"
## [35] "PCT_CHNG_IB_SMS_CNT" "PCT_CHNG_SUSPENDED_CNT"
## [37] "PCT_CHNG_BILL_AMT" "TOT_REV_AMT"
## [39] "TOT_PROF_AMT"
##
## $vars_char
## [1] "GENDER_CD" "EDUCATION_CD" "BARRING_REASON_CD"
## [4] "CHURN_FLG"
##
## $vars_factor
## character(0)
##
## $vars_other
## character(0)
It returns an ‘integrity’ object, which has: ‘status_now’ (comes from status function), and ‘results’ list, following elements can be found: vars_cat, vars_num, vars_num_with_NA, etc.
To view the descriptive statistics of each column, summary() is commonly used as shown in the code chunk below.
summary(telecom_retained)
## GENDER_CD EDUCATION_CD SUBS_TENURE TOT_IB_CALL_DUR
## Length:13196 Length:13196 Min. : 24.0 Min. : 3.333
## Class :character Class :character 1st Qu.:102.0 1st Qu.:39.500
## Mode :character Mode :character Median :120.0 Median :46.750
## Mean :119.9 Mean :44.853
## 3rd Qu.:138.0 3rd Qu.:53.500
## Max. :240.0 Max. :78.667
## TOT_IB_CALL_CNT AVG_OB_CALL_CNT TOT_OB_CALL_NAT_ROAM_CNT
## Min. : 8.0 Min. : 190 Min. : 11.0
## 1st Qu.: 119.8 1st Qu.: 407 1st Qu.: 21.0
## Median : 180.0 Median : 607 Median : 32.0
## Mean : 904.8 Mean : 3987 Mean : 130.7
## 3rd Qu.: 366.0 3rd Qu.: 1209 3rd Qu.: 62.0
## Max. :1136214.0 Max. :5988140 Max. :225286.0
## TOT_OB_CALL_INTL_CNT TOT_OB_CALL_LOC_CNT TOT_OB_CALL_NAT_CNT
## Min. : 11.0 Min. : 121.0 Min. : 49.0
## 1st Qu.: 21.0 1st Qu.: 245.0 1st Qu.: 103.0
## Median : 31.0 Median : 370.0 Median : 154.0
## Mean : 189.4 Mean : 1834.0 Mean : 625.9
## 3rd Qu.: 62.0 3rd Qu.: 725.2 3rd Qu.: 308.0
## Max. :410842.0 Max. :1474931.0 Max. :328972.0
## TOT_OB_CALL_INTL_ROAM_CNT TOT_DAY_LAST_COMPLAINT_CNT
## Min. : 0.0002 Min. : 0.00
## 1st Qu.: 0.1935 1st Qu.: 6.00
## Median : 0.4401 Median :12.00
## Mean : 2.1869 Mean :11.78
## 3rd Qu.: 0.9029 3rd Qu.:17.00
## Max. :2158.1174 Max. :32.00
## TOT_DAY_LAST_OB_BARRED_CNT TOT_DAY_LAST_SUSPENDED_CNT TOT_EMAIL_QUERY_CNT
## Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 6.00 1st Qu.: 7.00 1st Qu.: 7.00
## Median :13.00 Median :13.00 Median :14.00
## Mean :12.67 Mean :13.36 Mean :14.07
## 3rd Qu.:19.00 3rd Qu.:20.00 3rd Qu.:21.00
## Max. :31.00 Max. :34.00 Max. :32.00
## MTH_TO_SUBS_END_CNT TOT_SRV_DROPPED_CNT TOT_SRV_ADDED_CNT
## Min. :1.000 Min. :0.000 Min. :0.00
## 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.00
## Median :3.000 Median :1.000 Median :1.00
## Mean :2.506 Mean :1.045 Mean :1.41
## 3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.:2.00
## Max. :4.000 Max. :4.000 Max. :3.00
## TOT_OUTSTAND_60_90_DAY_AMT TOT_REV_FIX_AMT TOT_REV_GPRS_AMT TOT_REV_INET_AMT
## Min. : 0.00 Min. : 0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 0.00 1st Qu.: 0 1st Qu.: 0.0 1st Qu.: 0.0
## Median : 0.00 Median : 0 Median : 0.0 Median : 0.0
## Mean : 61.94 Mean :119 Mean :120.3 Mean :166.3
## 3rd Qu.: 0.00 3rd Qu.: 0 3rd Qu.: 0.0 3rd Qu.:471.0
## Max. :700.00 Max. :800 Max. :750.0 Max. :800.0
## TOT_COMPLAINT_1_MTH_CNT TOT_MTH_LAST_SUSPENDED_CNT
## Min. :0.00000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.00000 Median :0.0000
## Mean :0.02061 Mean :0.0241
## 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :1.00000 Max. :1.0000
## LAST_PRICE_PLAN_CHNG_DAY_CNT MTH_SINCE_DATA_ACTVN MTH_SINCE_VM_ACTVN
## Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.00000 Median :0.00000 Median :0.00000
## Mean :0.02311 Mean :0.03008 Mean :0.07548
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.00000 Max. :1.00000
## BARRING_REASON_CD TOT_OB_CALL_CNT TOT_ACTV_SRV_CNT REV_AMT_BASE_1
## Length:13196 Min. : 1140 Min. :0.000 Min. : 200.0
## Class :character 1st Qu.: 2442 1st Qu.:1.000 1st Qu.: 457.0
## Mode :character Median : 3642 Median :2.000 Median : 681.0
## Mean : 23920 Mean :2.441 Mean : 694.1
## 3rd Qu.: 7256 3rd Qu.:4.000 3rd Qu.: 938.0
## Max. :35928840 Max. :7.000 Max. :1200.0
## REV_AMT_BASE_2 REV_AMT_BASE_3 REV_AMT_BASE_4 REV_AMT_BASE_5
## Min. : 200.0 Min. : 200.0 Min. : 200.0 Min. : 200.0
## 1st Qu.: 492.0 1st Qu.: 491.0 1st Qu.: 473.0 1st Qu.: 440.0
## Median : 772.5 Median : 690.0 Median : 647.0 Median : 625.0
## Mean : 788.0 Mean : 732.9 Mean : 648.6 Mean : 648.4
## 3rd Qu.:1081.0 3rd Qu.: 996.0 3rd Qu.: 825.0 3rd Qu.: 852.0
## Max. :1400.0 Max. :1300.0 Max. :1100.0 Max. :1150.0
## REV_AMT_BASE_6 CUST_AGE PCT_CHNG_IB_SMS_CNT PCT_CHNG_SUSPENDED_CNT
## Min. : 200.0 Min. :20.00 Min. :0.0000 Min. : 0.0
## 1st Qu.: 501.0 1st Qu.:30.00 1st Qu.:0.8571 1st Qu.: 0.0
## Median : 672.0 Median :40.00 Median :1.1720 Median : 0.0
## Mean : 697.9 Mean :40.72 Mean :1.2452 Mean : 181.5
## 3rd Qu.: 893.0 3rd Qu.:51.00 3rd Qu.:1.5479 3rd Qu.: 0.0
## Max. :1250.0 Max. :62.00 Max. :6.1667 Max. :10000.0
## PCT_CHNG_BILL_AMT CHURN_FLG
## Min. :0.2454 Length:13196
## 1st Qu.:0.8239 Class :character
## Median :1.0875 Mode :character
## Mean :1.1341
## 3rd Qu.:1.3738
## Max. :5.0696
We can also use funModeling to retrieve several statistics for numerical variables. The function used is called profiling_num. Read about the function before running the code below.
profiling_num(telecom_retained)
## variable mean std_dev variation_coef
## 1 SUBS_TENURE 1.198550e+02 2.713962e+01 0.2264372
## 2 TOT_IB_CALL_DUR 4.485265e+01 1.322877e+01 0.2949384
## 3 TOT_IB_CALL_CNT 9.048331e+02 1.227397e+04 13.5648987
## 4 AVG_OB_CALL_CNT 3.986687e+03 8.095119e+04 20.3053799
## 5 TOT_OB_CALL_NAT_ROAM_CNT 1.307292e+02 2.099494e+03 16.0598782
## 6 TOT_OB_CALL_INTL_CNT 1.893939e+02 4.265169e+03 22.5200964
## 7 TOT_OB_CALL_LOC_CNT 1.833966e+03 2.273202e+04 12.3950019
## 8 TOT_OB_CALL_NAT_CNT 6.258938e+02 5.794708e+03 9.2582928
## 9 TOT_OB_CALL_INTL_ROAM_CNT 2.186854e+00 2.746952e+01 12.5612037
## 10 TOT_DAY_LAST_COMPLAINT_CNT 1.177918e+01 6.993986e+00 0.5937585
## 11 TOT_DAY_LAST_OB_BARRED_CNT 1.266588e+01 7.355303e+00 0.5807177
## 12 TOT_DAY_LAST_SUSPENDED_CNT 1.336178e+01 7.721855e+00 0.5779063
## 13 TOT_EMAIL_QUERY_CNT 1.407464e+01 8.042835e+00 0.5714414
## 14 MTH_TO_SUBS_END_CNT 2.505684e+00 9.581059e-01 0.3823731
## 15 TOT_SRV_DROPPED_CNT 1.044711e+00 7.615037e-01 0.7289136
## 16 TOT_SRV_ADDED_CNT 1.410352e+00 9.811001e-01 0.6956422
## 17 TOT_OUTSTAND_60_90_DAY_AMT 6.193589e+01 1.800469e+02 2.9069889
## 18 TOT_REV_FIX_AMT 1.189946e+02 2.533148e+02 2.1287922
## 19 TOT_REV_GPRS_AMT 1.202774e+02 2.339171e+02 1.9448143
## 20 TOT_REV_INET_AMT 1.662537e+02 2.814179e+02 1.6927011
## 21 TOT_COMPLAINT_1_MTH_CNT 2.061231e-02 1.420879e-01 6.8933524
## 22 TOT_MTH_LAST_SUSPENDED_CNT 2.409821e-02 1.533599e-01 6.3639551
## 23 LAST_PRICE_PLAN_CHNG_DAY_CNT 2.311306e-02 1.502683e-01 6.5014442
## 24 MTH_SINCE_DATA_ACTVN 3.008487e-02 1.708274e-01 5.6781809
## 25 MTH_SINCE_VM_ACTVN 7.547742e-02 2.641701e-01 3.4999892
## 26 TOT_OB_CALL_CNT 2.392012e+04 4.857071e+05 20.3053799
## 27 TOT_ACTV_SRV_CNT 2.441194e+00 1.641850e+00 0.6725600
## 28 REV_AMT_BASE_1 6.941080e+02 2.846927e+02 0.4101562
## 29 REV_AMT_BASE_2 7.879706e+02 3.407361e+02 0.4324223
## 30 REV_AMT_BASE_3 7.329165e+02 3.063096e+02 0.4179325
## 31 REV_AMT_BASE_4 6.486079e+02 2.367476e+02 0.3650087
## 32 REV_AMT_BASE_5 6.483597e+02 2.548069e+02 0.3930024
## 33 REV_AMT_BASE_6 6.978585e+02 2.686731e+02 0.3849965
## 34 CUST_AGE 4.071885e+01 1.216377e+01 0.2987258
## 35 PCT_CHNG_IB_SMS_CNT 1.245212e+00 5.546390e-01 0.4454174
## 36 PCT_CHNG_SUSPENDED_CNT 1.815002e+02 1.042496e+03 5.7437730
## 37 PCT_CHNG_BILL_AMT 1.134070e+00 4.325255e-01 0.3813922
## p_01 p_05 p_25 p_50 p_75 p_95
## 1 6.000000e+01 7.800000e+01 102.0000000 120.0000000 138.000000 168.000000
## 2 8.333000e+00 1.233300e+01 39.5000000 46.7500000 53.500000 62.333000
## 3 2.600000e+01 6.000000e+01 119.7500000 180.0000000 366.000000 1779.000000
## 4 2.740000e+02 3.160000e+02 407.0000000 607.0000000 1209.250000 6305.250000
## 5 1.500000e+01 1.700000e+01 21.0000000 32.0000000 62.000000 309.250000
## 6 1.500000e+01 1.700000e+01 21.0000000 31.0000000 62.000000 323.000000
## 7 1.670000e+02 1.900000e+02 245.0000000 370.0000000 725.250000 3441.750000
## 8 7.000000e+01 7.900000e+01 103.0000000 154.0000000 308.000000 1411.250000
## 9 6.440805e-03 3.769725e-02 0.1934530 0.4401406 0.902894 4.504552
## 10 0.000000e+00 1.000000e+00 6.0000000 12.0000000 17.000000 22.000000
## 11 0.000000e+00 1.000000e+00 6.0000000 13.0000000 19.000000 24.000000
## 12 0.000000e+00 1.000000e+00 7.0000000 13.0000000 20.000000 25.000000
## 13 0.000000e+00 1.000000e+00 7.0000000 14.0000000 21.000000 26.000000
## 14 1.000000e+00 1.000000e+00 2.0000000 3.0000000 3.000000 4.000000
## 15 0.000000e+00 0.000000e+00 1.0000000 1.0000000 2.000000 2.000000
## 16 0.000000e+00 0.000000e+00 1.0000000 1.0000000 2.000000 3.000000
## 17 0.000000e+00 0.000000e+00 0.0000000 0.0000000 0.000000 585.000000
## 18 0.000000e+00 0.000000e+00 0.0000000 0.0000000 0.000000 711.000000
## 19 0.000000e+00 0.000000e+00 0.0000000 0.0000000 0.000000 659.000000
## 20 0.000000e+00 0.000000e+00 0.0000000 0.0000000 471.000000 736.000000
## 21 0.000000e+00 0.000000e+00 0.0000000 0.0000000 0.000000 0.000000
## 22 0.000000e+00 0.000000e+00 0.0000000 0.0000000 0.000000 0.000000
## 23 0.000000e+00 0.000000e+00 0.0000000 0.0000000 0.000000 0.000000
## 24 0.000000e+00 0.000000e+00 0.0000000 0.0000000 0.000000 0.000000
## 25 0.000000e+00 0.000000e+00 0.0000000 0.0000000 0.000000 1.000000
## 26 1.644000e+03 1.896000e+03 2442.0000000 3642.0000000 7255.500000 37831.500000
## 27 0.000000e+00 0.000000e+00 1.0000000 2.0000000 4.000000 5.000000
## 28 2.100000e+02 2.520000e+02 457.0000000 681.0000000 938.000000 1151.000000
## 29 2.140000e+02 2.650000e+02 492.0000000 772.5000000 1081.000000 1332.000000
## 30 2.120000e+02 2.600000e+02 491.0000000 690.0000000 996.000000 1238.000000
## 31 2.090000e+02 2.560000e+02 473.0000000 647.0000000 825.000000 1044.000000
## 32 2.130000e+02 2.590000e+02 440.0000000 625.0000000 852.000000 1086.000000
## 33 2.140000e+02 2.720000e+02 501.0000000 672.0000000 893.000000 1177.000000
## 34 2.100000e+01 2.300000e+01 30.0000000 40.0000000 51.000000 60.000000
## 35 3.171752e-01 4.833140e-01 0.8571429 1.1719745 1.547856 2.247514
## 36 0.000000e+00 0.000000e+00 0.0000000 0.0000000 0.000000 0.000000
## 37 3.867880e-01 5.267272e-01 0.8239360 1.0875377 1.373761 1.920271
## p_99 skewness kurtosis iqr
## 1 1.860000e+02 0.200868371 3.053959 36.0000000
## 2 6.833300e+01 -1.035160678 4.124150 14.0000000
## 3 9.918750e+03 68.183975169 5816.311117 246.2500000
## 4 2.944850e+04 57.637894382 3680.205010 802.2500000
## 5 1.301100e+03 94.970870778 10047.402135 41.0000000
## 6 1.663900e+03 79.841704526 7110.635151 41.0000000
## 7 1.730505e+04 43.310012570 2249.853341 480.2500000
## 8 6.451150e+03 39.790518507 1931.029211 205.0000000
## 9 2.542337e+01 55.069678976 3686.333455 0.7094410
## 10 2.900000e+01 0.188118939 2.282798 11.0000000
## 11 2.900000e+01 0.072639431 1.983575 13.0000000
## 12 3.100000e+01 0.072901315 2.077751 13.0000000
## 13 2.900000e+01 -0.020861369 1.872841 14.0000000
## 14 4.000000e+00 -0.008684363 2.056079 1.0000000
## 15 3.000000e+00 0.345059808 3.110976 1.0000000
## 16 3.000000e+00 0.089853673 1.994601 1.0000000
## 17 6.780000e+02 2.616443060 8.013632 0.0000000
## 18 7.820000e+02 1.712514550 4.066626 0.0000000
## 19 7.330500e+02 1.551849203 3.675855 0.0000000
## 20 7.870000e+02 1.175947811 2.541470 471.0000000
## 21 1.000000e+00 6.748018376 46.535752 0.0000000
## 22 1.000000e+00 6.206573017 39.521549 0.0000000
## 23 1.000000e+00 6.347380065 41.289234 0.0000000
## 24 1.000000e+00 5.501846305 31.270313 0.0000000
## 25 1.000000e+00 3.214130571 11.330635 0.0000000
## 26 1.766910e+05 57.637894382 3680.205010 4813.5000000
## 27 6.000000e+00 0.227985186 2.303311 3.0000000
## 28 1.189000e+03 0.054913095 1.854334 481.0000000
## 29 1.385000e+03 0.071659635 1.821046 589.0000000
## 30 1.290000e+03 0.144701374 1.906482 505.0000000
## 31 1.088000e+03 0.015064005 2.107570 352.0000000
## 32 1.136000e+03 0.215500669 2.016043 412.0000000
## 33 1.236000e+03 0.246976090 2.230875 392.0000000
## 34 6.200000e+01 0.072910336 1.734184 21.0000000
## 35 2.929189e+00 1.107734891 6.091433 0.6907130
## 36 5.000000e+03 6.287175713 46.092121 0.0000000
## 37 2.401325e+00 0.907429141 4.893899 0.5498254
## range_98
## 1 [60, 186]
## 2 [8.333, 68.333]
## 3 [26, 9918.74999999993]
## 4 [274, 29448.4999999993]
## 5 [15, 1301.1]
## 6 [15, 1663.89999999999]
## 7 [167, 17305.0499999999]
## 8 [70, 6451.15]
## 9 [0.00644080526029729, 25.4233652560673]
## 10 [0, 29]
## 11 [0, 29]
## 12 [0, 31]
## 13 [0, 29]
## 14 [1, 4]
## 15 [0, 3]
## 16 [0, 3]
## 17 [0, 678]
## 18 [0, 782]
## 19 [0, 733.049999999999]
## 20 [0, 787]
## 21 [0, 1]
## 22 [0, 1]
## 23 [0, 1]
## 24 [0, 1]
## 25 [0, 1]
## 26 [1644, 176690.999999996]
## 27 [0, 6]
## 28 [210, 1189]
## 29 [214, 1385]
## 30 [212, 1290]
## 31 [209, 1088]
## 32 [213, 1136]
## 33 [214, 1236]
## 34 [21, 62]
## 35 [0.317175244439669, 2.92918879326131]
## 36 [0, 5000]
## 37 [0.386788006123978, 2.40132548684723]
## range_80
## 1 [84, 156]
## 2 [29.5, 59.167]
## 3 [88, 899.5]
## 4 [339, 3054.5]
## 5 [18, 155]
## 6 [18, 159]
## 7 [204, 1768]
## 8 [85, 759]
## 9 [0.0722201996015945, 2.30620032698903]
## 10 [2, 21]
## 11 [3, 22]
## 12 [3, 23]
## 13 [3, 25]
## 14 [1, 4]
## 15 [0, 2]
## 16 [0, 3]
## 17 [0, 470]
## 18 [0, 637]
## 19 [0, 571]
## 20 [0, 670]
## 21 [0, 0]
## 22 [0, 0]
## 23 [0, 0]
## 24 [0, 0]
## 25 [0, 0]
## 26 [2034, 18327]
## 27 [0, 5]
## 28 [304, 1099]
## 29 [327, 1271]
## 30 [320, 1176.5]
## 31 [313, 988.5]
## 32 [320, 1027]
## 33 [342, 1108]
## 34 [25, 58]
## 35 [0.608695652173913, 1.95046012269939]
## 36 [0, 0]
## 37 [0.627196976959703, 1.69636993607128]
The summary statistics generated by profiling_num() are: mean, std_dev: standard deviation, all the p_XX:percentile at XX number, skewness, kurtosis, iqr: inter quartile range, variation_coef: the ratio ofsd/mean, range_98 is the limit for which the 98.
Instead of showing the complete list of summary statistics, you can also display selected summary statistics by using the code chunk below.
my_profiling_report <- profiling_num(telecom_retained)
select(my_profiling_report, -c(5:11))
## variable mean std_dev variation_coef
## 1 SUBS_TENURE 1.198550e+02 2.713962e+01 0.2264372
## 2 TOT_IB_CALL_DUR 4.485265e+01 1.322877e+01 0.2949384
## 3 TOT_IB_CALL_CNT 9.048331e+02 1.227397e+04 13.5648987
## 4 AVG_OB_CALL_CNT 3.986687e+03 8.095119e+04 20.3053799
## 5 TOT_OB_CALL_NAT_ROAM_CNT 1.307292e+02 2.099494e+03 16.0598782
## 6 TOT_OB_CALL_INTL_CNT 1.893939e+02 4.265169e+03 22.5200964
## 7 TOT_OB_CALL_LOC_CNT 1.833966e+03 2.273202e+04 12.3950019
## 8 TOT_OB_CALL_NAT_CNT 6.258938e+02 5.794708e+03 9.2582928
## 9 TOT_OB_CALL_INTL_ROAM_CNT 2.186854e+00 2.746952e+01 12.5612037
## 10 TOT_DAY_LAST_COMPLAINT_CNT 1.177918e+01 6.993986e+00 0.5937585
## 11 TOT_DAY_LAST_OB_BARRED_CNT 1.266588e+01 7.355303e+00 0.5807177
## 12 TOT_DAY_LAST_SUSPENDED_CNT 1.336178e+01 7.721855e+00 0.5779063
## 13 TOT_EMAIL_QUERY_CNT 1.407464e+01 8.042835e+00 0.5714414
## 14 MTH_TO_SUBS_END_CNT 2.505684e+00 9.581059e-01 0.3823731
## 15 TOT_SRV_DROPPED_CNT 1.044711e+00 7.615037e-01 0.7289136
## 16 TOT_SRV_ADDED_CNT 1.410352e+00 9.811001e-01 0.6956422
## 17 TOT_OUTSTAND_60_90_DAY_AMT 6.193589e+01 1.800469e+02 2.9069889
## 18 TOT_REV_FIX_AMT 1.189946e+02 2.533148e+02 2.1287922
## 19 TOT_REV_GPRS_AMT 1.202774e+02 2.339171e+02 1.9448143
## 20 TOT_REV_INET_AMT 1.662537e+02 2.814179e+02 1.6927011
## 21 TOT_COMPLAINT_1_MTH_CNT 2.061231e-02 1.420879e-01 6.8933524
## 22 TOT_MTH_LAST_SUSPENDED_CNT 2.409821e-02 1.533599e-01 6.3639551
## 23 LAST_PRICE_PLAN_CHNG_DAY_CNT 2.311306e-02 1.502683e-01 6.5014442
## 24 MTH_SINCE_DATA_ACTVN 3.008487e-02 1.708274e-01 5.6781809
## 25 MTH_SINCE_VM_ACTVN 7.547742e-02 2.641701e-01 3.4999892
## 26 TOT_OB_CALL_CNT 2.392012e+04 4.857071e+05 20.3053799
## 27 TOT_ACTV_SRV_CNT 2.441194e+00 1.641850e+00 0.6725600
## 28 REV_AMT_BASE_1 6.941080e+02 2.846927e+02 0.4101562
## 29 REV_AMT_BASE_2 7.879706e+02 3.407361e+02 0.4324223
## 30 REV_AMT_BASE_3 7.329165e+02 3.063096e+02 0.4179325
## 31 REV_AMT_BASE_4 6.486079e+02 2.367476e+02 0.3650087
## 32 REV_AMT_BASE_5 6.483597e+02 2.548069e+02 0.3930024
## 33 REV_AMT_BASE_6 6.978585e+02 2.686731e+02 0.3849965
## 34 CUST_AGE 4.071885e+01 1.216377e+01 0.2987258
## 35 PCT_CHNG_IB_SMS_CNT 1.245212e+00 5.546390e-01 0.4454174
## 36 PCT_CHNG_SUSPENDED_CNT 1.815002e+02 1.042496e+03 5.7437730
## 37 PCT_CHNG_BILL_AMT 1.134070e+00 4.325255e-01 0.3813922
## skewness kurtosis iqr
## 1 0.200868371 3.053959 36.0000000
## 2 -1.035160678 4.124150 14.0000000
## 3 68.183975169 5816.311117 246.2500000
## 4 57.637894382 3680.205010 802.2500000
## 5 94.970870778 10047.402135 41.0000000
## 6 79.841704526 7110.635151 41.0000000
## 7 43.310012570 2249.853341 480.2500000
## 8 39.790518507 1931.029211 205.0000000
## 9 55.069678976 3686.333455 0.7094410
## 10 0.188118939 2.282798 11.0000000
## 11 0.072639431 1.983575 13.0000000
## 12 0.072901315 2.077751 13.0000000
## 13 -0.020861369 1.872841 14.0000000
## 14 -0.008684363 2.056079 1.0000000
## 15 0.345059808 3.110976 1.0000000
## 16 0.089853673 1.994601 1.0000000
## 17 2.616443060 8.013632 0.0000000
## 18 1.712514550 4.066626 0.0000000
## 19 1.551849203 3.675855 0.0000000
## 20 1.175947811 2.541470 471.0000000
## 21 6.748018376 46.535752 0.0000000
## 22 6.206573017 39.521549 0.0000000
## 23 6.347380065 41.289234 0.0000000
## 24 5.501846305 31.270313 0.0000000
## 25 3.214130571 11.330635 0.0000000
## 26 57.637894382 3680.205010 4813.5000000
## 27 0.227985186 2.303311 3.0000000
## 28 0.054913095 1.854334 481.0000000
## 29 0.071659635 1.821046 589.0000000
## 30 0.144701374 1.906482 505.0000000
## 31 0.015064005 2.107570 352.0000000
## 32 0.215500669 2.016043 412.0000000
## 33 0.246976090 2.230875 392.0000000
## 34 0.072910336 1.734184 21.0000000
## 35 1.107734891 6.091433 0.6907130
## 36 6.287175713 46.092121 0.0000000
## 37 0.907429141 4.893899 0.5498254
## range_98
## 1 [60, 186]
## 2 [8.333, 68.333]
## 3 [26, 9918.74999999993]
## 4 [274, 29448.4999999993]
## 5 [15, 1301.1]
## 6 [15, 1663.89999999999]
## 7 [167, 17305.0499999999]
## 8 [70, 6451.15]
## 9 [0.00644080526029729, 25.4233652560673]
## 10 [0, 29]
## 11 [0, 29]
## 12 [0, 31]
## 13 [0, 29]
## 14 [1, 4]
## 15 [0, 3]
## 16 [0, 3]
## 17 [0, 678]
## 18 [0, 782]
## 19 [0, 733.049999999999]
## 20 [0, 787]
## 21 [0, 1]
## 22 [0, 1]
## 23 [0, 1]
## 24 [0, 1]
## 25 [0, 1]
## 26 [1644, 176690.999999996]
## 27 [0, 6]
## 28 [210, 1189]
## 29 [214, 1385]
## 30 [212, 1290]
## 31 [209, 1088]
## 32 [213, 1136]
## 33 [214, 1236]
## 34 [21, 62]
## 35 [0.317175244439669, 2.92918879326131]
## 36 [0, 5000]
## 37 [0.386788006123978, 2.40132548684723]
## range_80
## 1 [84, 156]
## 2 [29.5, 59.167]
## 3 [88, 899.5]
## 4 [339, 3054.5]
## 5 [18, 155]
## 6 [18, 159]
## 7 [204, 1768]
## 8 [85, 759]
## 9 [0.0722201996015945, 2.30620032698903]
## 10 [2, 21]
## 11 [3, 22]
## 12 [3, 23]
## 13 [3, 25]
## 14 [1, 4]
## 15 [0, 2]
## 16 [0, 3]
## 17 [0, 470]
## 18 [0, 637]
## 19 [0, 571]
## 20 [0, 670]
## 21 [0, 0]
## 22 [0, 0]
## 23 [0, 0]
## 24 [0, 0]
## 25 [0, 0]
## 26 [2034, 18327]
## 27 [0, 5]
## 28 [304, 1099]
## 29 [327, 1271]
## 30 [320, 1176.5]
## 31 [313, 988.5]
## 32 [320, 1027]
## 33 [342, 1108]
## 34 [25, 58]
## 35 [0.608695652173913, 1.95046012269939]
## 36 [0, 0]
## 37 [0.627196976959703, 1.69636993607128]
Tukey defined EDA in 1961 as:
“Procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data.”
EDA is often the first step to getting to know about your data.
It is iterative in nature:
Rinse and repeat until you communicate the findings.
ggplot(data=telecom_retained, aes(x= TOT_IB_CALL_DUR)) +
geom_histogram(bins=50, color="black", fill="light blue")
Instead of using ggplot2 to display the distribution of the continuous variables, the plot_num() of funModeling package can be used.
plot_num(telecom_retained)
The output of plot_num() is a histogram with default bin=10. It can be changed by providing different number at the bin argument as shown in the code chunk below.
telecom_selected <- telecom_retained %>%
select(c(3:4,12:18))
plot_num(telecom_selected, bins = 12)
The code chunk above also shows how to combine select() of dplyr package.
To display the frequency distribution of a categorical variable, the freq() of funModeling will be used.
freq(telecom_retained)
## GENDER_CD frequency percentage cumulative_perc
## 1 M 8534 64.67 64.67
## 2 F 4566 34.60 99.27
## 3 96 0.73 100.00
## EDUCATION_CD frequency percentage cumulative_perc
## 1 2 4870 36.91 36.91
## 2 1 4817 36.50 73.41
## 3 . 2901 21.98 95.39
## 4 3 330 2.50 97.89
## 5 4 143 1.08 98.97
## 6 6 73 0.55 99.52
## 7 5 60 0.45 99.97
## 8 7 2 0.02 100.00
## BARRING_REASON_CD frequency percentage cumulative_perc
## 1 000 3403 25.79 25.79
## 2 002 3358 25.45 51.24
## 3 003 3223 24.42 75.66
## 4 001 3212 24.34 100.00
## CHURN_FLG frequency percentage cumulative_perc
## 1 0 12105 91.73 91.73
## 2 1 1091 8.27 100.00
## [1] "Variables processed: GENDER_CD, EDUCATION_CD, BARRING_REASON_CD, CHURN_FLG"
Notes:
For more information, refer to 1.1.2 Profiling categorical variables of Data Science Live Book.
One of the task during the data exploration stage is to investigate the relationship between the target variable and the inputs variables. In this section, we are going to explore how this task can be perform by using appropriate functions of ggplot2 and funModeling packages.
The code chunk below uses appropriate ggplot2 functions to plot a histogram with the CHURN_FLG overplot on it.
ggplot(data=telecom_retained, aes(x= TOT_IB_CALL_CNT, fill = CHURN_FLG)) +
geom_histogram(bins=50, color="grey30")
The plotar() of funModeling package is specially designed to visualise the relationship between the target variable and the input variables. In fact, it is designed using ggplot2. However, it is capable to display only the continous input variables from a given data frame. The function only support binary or multinominal values as target.
The code chunk below is used to visualise the relationship of CHURN_FLG and the input variables with boxplot.
plotar(telecom_retained, target = 'CHURN_FLG', plot_type = "boxplot")
The code chunk below is used to visualise the relationship of CHURN_FLG and the input variables with probability density plot.
plotar(telecom_retained, target = 'CHURN_FLG', plot_type = "histdens")
When we need to analyse the relationship between categorical and binary target variable, the categ_analysis() come in very handy.
It profile a binary target based on a categorical input variable, the representativeness (perc_rows) and the accuracy (perc_target) for each value of the input variable; for example, the churn cases per education categories.
categ_analysis(data = telecom_retained, input = "EDUCATION_CD", target = "CHURN_FLG")
## EDUCATION_CD mean_target sum_target perc_target q_rows perc_rows
## 1 6 0.123 9 0.008 73 0.006
## 2 4 0.098 14 0.013 143 0.011
## 3 1 0.085 410 0.376 4817 0.365
## 4 5 0.083 5 0.005 60 0.005
## 5 2 0.082 400 0.367 4870 0.369
## 6 . 0.079 230 0.211 2901 0.220
## 7 3 0.070 23 0.021 330 0.025
## 8 7 0.000 0 0.000 2 0.000
Note:
These are the metrics returned by categ_analysis():
This function is used to analyze data when we need to reduce variable cardinality in predictive modeling.
Discretization is the process of transforming continuous variables into discrete counterparts. This process is usually carried out as a first step toward making them suitable for numerical evaluation and modelling.
The range is commonly found in histograms looking at distribution, but is highly susceptible to outliers. To create, for example, four bins, requires the min and max values divided by 4.
telecom_retained$REV_AMT_BASE_2_eq_range <- cut_interval(telecom_retained$REV_AMT_BASE_2, n = 4)
describe(telecom_retained$REV_AMT_BASE_2_eq_range)
## telecom_retained$REV_AMT_BASE_2_eq_range
## n missing distinct
## 13196 0 4
##
## Value [200,500] (500,800] (800,1.1e+03]
## Frequency 3391 3490 3225
## Proportion 0.257 0.264 0.244
##
## Value (1.1e+03,1.4e+03]
## Frequency 3090
## Proportion 0.234
freq(telecom_retained$REV_AMT_BASE_2_eq_range)
## var frequency percentage cumulative_perc
## 1 (500,800] 3490 26.45 26.45
## 2 [200,500] 3391 25.70 52.15
## 3 (800,1.1e+03] 3225 24.44 76.59
## 4 (1.1e+03,1.4e+03] 3090 23.42 100.00
This technique groups the same number of observations using criteria based on percentiles. The funModeling package includes the equal_freq() function to create bins based on these criteria:
telecom_retained$REV_AMT_BASE_2_eq_freq <- equal_freq(telecom_retained$REV_AMT_BASE_2, n = 4)
describe(telecom_retained$REV_AMT_BASE_2_eq_freq)
## telecom_retained$REV_AMT_BASE_2_eq_freq : Revenue amount in base 2 month
## n missing distinct
## 13196 0 4
##
## Value [ 200, 493) [ 493, 773) [ 773,1082) [1082,1400]
## Frequency 3301 3297 3304 3294
## Proportion 0.25 0.25 0.25 0.25
freq(telecom_retained$REV_AMT_BASE_2_eq_freq)
## var frequency percentage cumulative_perc
## 1 [ 773,1082) 3304 25.04 25.04
## 2 [ 200, 493) 3301 25.02 50.06
## 3 [ 493, 773) 3297 24.98 75.04
## 4 [1082,1400] 3294 24.96 100.00
If we already have the points for which we want the segments, we can use the cut() function.
telecom_retained$REV_AMT_BASE_2_manual <- cut(telecom_retained$REV_AMT_BASE_2, breaks = c(0, 500, 800, 1000, 2000))
describe(telecom_retained$REV_AMT_BASE_2_manual)
## telecom_retained$REV_AMT_BASE_2_manual
## n missing distinct
## 13196 0 4
##
## Value (0,500] (500,800] (800,1e+03] (1e+03,2e+03]
## Frequency 3391 3490 2142 4173
## Proportion 0.257 0.264 0.162 0.316
freq(telecom_retained$REV_AMT_BASE_2_manual)
## var frequency percentage cumulative_perc
## 1 (1e+03,2e+03] 4173 31.62 31.62
## 2 (500,800] 3490 26.45 58.07
## 3 (0,500] 3391 25.70 83.77
## 4 (800,1e+03] 2142 16.23 100.00
For further discussion, refer to section 2.1.6, 2.1.7 and 2.1.8 of Data Science Live Book.