Trigger Analyze
ID, Phone and User

Project of IT Fizo Team

1 Data Wrangling

1.1 Import Library

1.2 Import Data

1.3 Insecting data

  • A Brief overview of the data set
## Rows: 4,612
## Columns: 96
## $ vnpostUserName                <chr> "dungnt07", "59.0057", "anhpdt", "22.006…
## $ vnpostname                    <chr> "Nguyễn Thị Dung", "MAI THỊ XUÂN HƯỜNG",…
## $ vnpostprovincename            <chr> "Bưu điện TP Hà Nội", "Bưu điện Tỉnh Bìn…
## $ vnpostdistrictname            <chr> "BĐH Chương Mỹ", "BĐH Tuy Phước", "BĐH N…
## $ vnpostorganizationname        <chr> "Chương Mỹ", "VHX Phước Sơn", "VHX Hiệp …
## $ Ekyc                          <chr> "Đã OTP thành công", "Đã OTP thành công"…
## $ userGroup                     <dbl> 2, 2, 2, 2, 2, 6, 2, 5, 5, 2, 2, 2, 2, 1…
## $ flowGroup                     <dbl> 1, 1, 1, 1, 1, 0, 1, 3, 3, 1, 1, 1, 1, 1…
## $ Status                        <chr> NA, NA, NA, NA, NA, "Lock", NA, NA, NA, …
## $ Total_app                     <dbl> 26, 18, 18, 18, 17, 14, 13, 13, 13, 13, …
## $ Total_app7                    <dbl> 44, 29, 62, 31, 46, 51, 22, 67, 31, 47, …
## $ Total_app30                   <dbl> 198, 67, 155, 31, 49, 51, 55, 262, 138, …
## $ Total_app_cancel              <dbl> 5, 0, 1, 2, 6, 0, 5, 3, 3, 1, 0, 1, 2, 1…
## $ Total_app_cancel_7Day         <dbl> 7, 3, 8, 7, 14, 0, 5, 17, 4, 4, 3, 1, 2,…
## $ Total_app_cancel_30Day        <dbl> 32, 6, 25, 7, 15, 0, 15, 60, 37, 12, 15,…
## $ cancelRatio                   <dbl> 0.19230769, 0.00000000, 0.05555556, 0.11…
## $ cancelRatio7DDay              <dbl> 0.15909091, 0.10344828, 0.12903226, 0.22…
## $ cancelRatio30Day              <dbl> 0.16161616, 0.08955224, 0.16129032, 0.22…
## $ fpd30_his_TT                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ fpd30_base_TT                 <dbl> 0, 0, 36, 0, 0, 0, 4, 0, 12, 0, 0, 0, 2,…
## $ fpd30_TT                      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Total_app_reject              <dbl> 19, 11, 11, 14, 10, 11, 4, 7, 9, 10, 10,…
## $ Total_app_reject_7Day         <dbl> 30, 18, 38, 21, 26, 47, 9, 38, 23, 32, 1…
## $ Total_app_reject_30Day        <dbl> 133, 34, 93, 21, 27, 47, 26, 132, 69, 49…
## $ rejectRatio                   <dbl> 0.7307692, 0.6111111, 0.6111111, 0.77777…
## $ rejectRatio7day               <dbl> 0.6818182, 0.6206897, 0.6129032, 0.67741…
## $ rejectRatio30Day              <dbl> 0.6717172, 0.5074627, 0.6000000, 0.67741…
## $ Total_app_approve             <dbl> 1, 1, 4, 0, 1, 2, 0, 0, 0, 1, 1, 0, 0, 0…
## $ Total_app_approve_7Day        <dbl> 6, 2, 14, 1, 5, 3, 4, 9, 3, 10, 3, 2, 1,…
## $ Total_app_approve_30Day       <dbl> 32, 21, 35, 1, 6, 3, 10, 67, 31, 18, 12,…
## $ approveRatio                  <dbl> 0.03846154, 0.05555556, 0.22222222, 0.00…
## $ approveRatio7day              <dbl> 0.13636364, 0.06896552, 0.22580645, 0.03…
## $ approveRatio30Day             <dbl> 0.16161616, 0.31343284, 0.22580645, 0.03…
## $ Total_app_disbursed           <dbl> 0, 0, 2, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0…
## $ Total_app_disbursed_7Day      <dbl> 3, 1, 12, 1, 3, 1, 4, 9, 3, 9, 3, 2, 1, …
## $ Total_app_disbursed_30Day     <dbl> 29, 20, 33, 1, 4, 1, 10, 67, 31, 17, 12,…
## $ disbursedRatio                <dbl> 0.00000000, 0.00000000, 0.11111111, 0.00…
## $ disbursedRatio7day            <dbl> 0.06818182, 0.03448276, 0.19354839, 0.03…
## $ disbursedRatio30Day           <dbl> 0.14646465, 0.29850746, 0.21290323, 0.03…
## $ outsideApp                    <dbl> 5, 0, 0, 0, 17, 1, 0, 0, 0, 0, 0, 2, 0, …
## $ outsideApp7Day                <dbl> 6, 1, 1, 0, 46, 4, 0, 1, 0, 0, 0, 2, 0, …
## $ outsideApp30Day               <dbl> 11, 1, 3, 0, 46, 4, 0, 3, 4, 0, 0, 15, 2…
## $ outside_ratio                 <dbl> 0.19230769, 0.00000000, 0.00000000, 0.00…
## $ outsideRatio7Day              <dbl> 0.13636364, 0.03448276, 0.01612903, 0.00…
## $ outsideRatio30Day             <dbl> 0.05555556, 0.01492537, 0.01935484, 0.00…
## $ outSidePro                    <dbl> 4, 0, 0, 0, 9, 1, 0, 0, 0, 0, 0, 2, 0, 1…
## $ outSidePro7Day                <dbl> 5, 1, 1, 0, 16, 3, 0, 1, 0, 0, 0, 2, 0, …
## $ outSidePro30Day               <dbl> 6, 1, 1, 0, 16, 3, 0, 3, 3, 0, 0, 8, 1, …
## $ upLoadRatio                   <dbl> 0.1153846, 1.0000000, 0.2777778, 0.11111…
## $ upLoadRatio7Day               <dbl> 0.18181818, 0.65517241, 0.17204301, 0.06…
## $ upLoadRatio30Day              <dbl> 0.60942761, 0.28358209, 0.39139785, 0.06…
## $ IDHitApp                      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ IDHitApp7Day                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 1, 0, 0, 0…
## $ IDHitApp30Day                 <dbl> 0, 0, 0, 0, 0, 0, 0, 3, 1, 3, 1, 0, 0, 0…
## $ IDHitAppRatio                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ IDHitAppRatio7Day             <dbl> 0.00000000, 0.00000000, 0.00000000, 0.00…
## $ IDHitAppRatio30Day            <dbl> 0.000000000, 0.000000000, 0.000000000, 0…
## $ phoneHitApp                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ phoneHitApp7Day               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0…
## $ phoneHitApp30Day              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0…
## $ phoneHitAppRatio              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ phoneHitAppRatio7Day          <dbl> 0.00000000, 0.00000000, 0.00000000, 0.00…
## $ phoneHitAppRatio30Day         <dbl> 0.00000000, 0.00000000, 0.00000000, 0.00…
## $ userHitApp                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ userHitApp7Day                <dbl> 0, 0, 2, 1, 11, 2, 1, 2, 2, 5, 6, 1, 4, …
## $ userHitApp30Day               <dbl> 9, 0, 3, 1, 12, 2, 3, 7, 13, 6, 10, 6, 4…
## $ userHitAppRatio               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ userHitAppRatio7Day           <dbl> 0.00000000, 0.00000000, 0.03225806, 0.03…
## $ userHitAppRatio30Day          <dbl> 0.04545455, 0.00000000, 0.01935484, 0.03…
## $ Total_locked                  <dbl> 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1…
## $ Total_device                  <dbl> 9, 2, 2, 3, 3, 2, 1, 9, 1, 3, 2, 8, 2, 2…
## $ oldReferenceTrigger           <dbl> 0, 1, 0, 1, 0, 0, 2, 0, 1, 5, 5, 1, 1, 0…
## $ oldReferenceTrigger7Day       <dbl> 0, 1, 3, 1, 1, 0, 6, 2, 13, 18, 11, 1, 2…
## $ oldReferenceTrigger30Day      <dbl> 4, 7, 5, 1, 1, 0, 14, 28, 56, 19, 21, 3,…
## $ oldReferenceRatio             <dbl> 0.00000000, 0.05555556, 0.00000000, 0.05…
## $ oldReferenceRatio7Day         <dbl> 0.00000000, 0.03448276, 0.04838710, 0.03…
## $ oldReferenceRatio30Day        <dbl> 0.02020202, 0.10447761, 0.03225806, 0.03…
## $ groupFpd30TT                  <chr> "0", "0", "0", "0", "0", "0", "0", "0", …
## $ groupHitOutside               <chr> "Hit", "NoneHit", "NoneHit", "NoneHit", …
## $ groupHitOutside7Day           <chr> "Hit", "Hit", "Hit", "NoneHit", "Hit", "…
## $ groupHitOutside30Day          <chr> "Hit", "Hit", "Hit", "NoneHit", "Hit", "…
## $ groupHitUpLoadRatio           <chr> "Hit", "Hit", "Hit", "Hit", "Hit", "None…
## $ groupHitUpLoadRatio7Day       <chr> "Hit", "Hit", "Hit", "Hit", "Hit", "Hit"…
## $ groupHitUpLoadRatio30Day      <chr> "Hit", "Hit", "Hit", "Hit", "Hit", "Hit"…
## $ groupHitOldReferenceRatio     <chr> "NoneHit", "Hit", "NoneHit", "Hit", "Non…
## $ groupHitOldReferenceRatio7Day <chr> "NoneHit", "Hit", "Hit", "Hit", "Hit", "…
## $ groupHitOldReference30Day     <chr> "Hit", "Hit", "Hit", "Hit", "Hit", "None…
## $ groupHitIDApp                 <chr> "NoneHit", "NoneHit", "NoneHit", "NoneHi…
## $ groupHitIDApp7Day             <chr> "NoneHit", "NoneHit", "NoneHit", "NoneHi…
## $ groupHitIDApp30Day            <chr> "NoneHit", "NoneHit", "NoneHit", "NoneHi…
## $ groupHitPhoneApp              <chr> "NoneHit", "NoneHit", "NoneHit", "NoneHi…
## $ groupHitPhoneApp7Day          <chr> "NoneHit", "NoneHit", "NoneHit", "NoneHi…
## $ groupHitPhoneApp30Day         <chr> "NoneHit", "NoneHit", "NoneHit", "NoneHi…
## $ groupHitUserApp               <chr> "NoneHit", "NoneHit", "NoneHit", "NoneHi…
## $ groupHitUserApp7Day           <chr> "NoneHit", "NoneHit", "Hit", "Hit", "Hit…
## $ groupHitUserApp30Day          <chr> "Hit", "NoneHit", "Hit", "Hit", "Hit", "…
Data summary
Name TriggerOverview
Number of rows 4612
Number of columns 96
_______________________
Column type frequency:
character 26
numeric 70
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
vnpostUserName 0 1.00 4 15 0 4612 0
vnpostname 0 1.00 4 28 0 4281 0
vnpostprovincename 0 1.00 12 29 0 64 0
vnpostdistrictname 0 1.00 6 30 0 612 0
vnpostorganizationname 0 1.00 3 41 0 2844 0
Ekyc 181 0.96 8 21 0 4 0
Status 4060 0.12 4 4 0 1 0
groupFpd30TT 0 1.00 1 10 0 5 0
groupHitOutside 3676 0.20 3 7 0 2 0
groupHitOutside7Day 2107 0.54 3 7 0 2 0
groupHitOutside30Day 758 0.84 3 7 0 2 0
groupHitUpLoadRatio 3676 0.20 3 7 0 2 0
groupHitUpLoadRatio7Day 2107 0.54 3 7 0 2 0
groupHitUpLoadRatio30Day 758 0.84 3 7 0 2 0
groupHitOldReferenceRatio 3676 0.20 3 7 0 2 0
groupHitOldReferenceRatio7Day 2107 0.54 3 7 0 2 0
groupHitOldReference30Day 758 0.84 3 7 0 2 0
groupHitIDApp 3676 0.20 7 7 0 1 0
groupHitIDApp7Day 2107 0.54 3 7 0 2 0
groupHitIDApp30Day 758 0.84 3 7 0 2 0
groupHitPhoneApp 3676 0.20 7 7 0 1 0
groupHitPhoneApp7Day 2107 0.54 3 7 0 2 0
groupHitPhoneApp30Day 758 0.84 3 7 0 2 0
groupHitUserApp 3676 0.20 7 7 0 1 0
groupHitUserApp7Day 2107 0.54 3 7 0 2 0
groupHitUserApp30Day 758 0.84 3 7 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
userGroup 0 1.00 2.89 1.54 1 2.0 2.00 5.00 7 ▇▁▁▂▁
flowGroup 0 1.00 1.25 0.84 0 1.0 1.00 1.00 3 ▁▇▁▁▂
Total_app 0 1.00 0.47 1.41 0 0.0 0.00 0.00 26 ▇▁▁▁▁
Total_app7 0 1.00 3.16 6.18 0 0.0 1.00 4.00 67 ▇▁▁▁▁
Total_app30 0 1.00 10.60 20.20 0 1.0 4.00 12.00 351 ▇▁▁▁▁
Total_app_cancel 0 1.00 0.06 0.33 0 0.0 0.00 0.00 6 ▇▁▁▁▁
Total_app_cancel_7Day 0 1.00 0.54 1.36 0 0.0 0.00 0.00 20 ▇▁▁▁▁
Total_app_cancel_30Day 0 1.00 2.18 4.42 0 0.0 1.00 3.00 85 ▇▁▁▁▁
cancelRatio 3676 0.20 0.10 0.24 0 0.0 0.00 0.00 1 ▇▁▁▁▁
cancelRatio7DDay 2107 0.54 0.15 0.23 0 0.0 0.00 0.25 1 ▇▂▁▁▁
cancelRatio30Day 758 0.84 0.21 0.24 0 0.0 0.16 0.33 1 ▇▃▂▁▁
fpd30_his_TT 0 1.00 0.12 1.18 0 0.0 0.00 0.00 54 ▇▁▁▁▁
fpd30_base_TT 0 1.00 2.39 7.62 0 0.0 0.00 2.00 206 ▇▁▁▁▁
fpd30_TT 0 1.00 0.01 0.09 0 0.0 0.00 0.00 1 ▇▁▁▁▁
Total_app_reject 0 1.00 0.28 0.95 0 0.0 0.00 0.00 19 ▇▁▁▁▁
Total_app_reject_7Day 0 1.00 1.92 4.09 0 0.0 0.00 2.00 47 ▇▁▁▁▁
Total_app_reject_30Day 0 1.00 6.06 12.54 0 1.0 2.00 7.00 252 ▇▁▁▁▁
rejectRatio 3676 0.20 0.59 0.41 0 0.0 0.67 1.00 1 ▅▁▃▂▇
rejectRatio7day 2107 0.54 0.62 0.34 0 0.4 0.67 1.00 1 ▃▂▅▅▇
rejectRatio30Day 758 0.84 0.58 0.30 0 0.4 0.57 0.80 1 ▃▃▇▆▆
Total_app_approve 0 1.00 0.06 0.26 0 0.0 0.00 0.00 4 ▇▁▁▁▁
Total_app_approve_7Day 0 1.00 0.62 1.41 0 0.0 0.00 1.00 15 ▇▁▁▁▁
Total_app_approve_30Day 0 1.00 2.28 4.54 0 0.0 1.00 3.00 72 ▇▁▁▁▁
approveRatio 3676 0.20 0.14 0.30 0 0.0 0.00 0.00 1 ▇▁▁▁▁
approveRatio7day 2107 0.54 0.20 0.27 0 0.0 0.10 0.33 1 ▇▂▁▁▁
approveRatio30Day 758 0.84 0.20 0.22 0 0.0 0.17 0.31 1 ▇▅▁▁▁
Total_app_disbursed 0 1.00 0.03 0.18 0 0.0 0.00 0.00 2 ▇▁▁▁▁
Total_app_disbursed_7Day 0 1.00 0.56 1.31 0 0.0 0.00 1.00 15 ▇▁▁▁▁
Total_app_disbursed_30Day 0 1.00 2.22 4.47 0 0.0 1.00 2.00 71 ▇▁▁▁▁
disbursedRatio 3676 0.20 0.08 0.23 0 0.0 0.00 0.00 1 ▇▁▁▁▁
disbursedRatio7day 2107 0.54 0.18 0.25 0 0.0 0.00 0.27 1 ▇▂▁▁▁
disbursedRatio30Day 758 0.84 0.20 0.22 0 0.0 0.17 0.30 1 ▇▃▁▁▁
outsideApp 0 1.00 0.04 0.42 0 0.0 0.00 0.00 17 ▇▁▁▁▁
outsideApp7Day 0 1.00 0.29 1.73 0 0.0 0.00 0.00 46 ▇▁▁▁▁
outsideApp30Day 0 1.00 1.08 5.01 0 0.0 0.00 0.00 130 ▇▁▁▁▁
outside_ratio 3676 0.20 0.07 0.23 0 0.0 0.00 0.00 1 ▇▁▁▁▁
outsideRatio7Day 2107 0.54 0.07 0.20 0 0.0 0.00 0.00 1 ▇▁▁▁▁
outsideRatio30Day 758 0.84 0.08 0.20 0 0.0 0.00 0.03 1 ▇▁▁▁▁
outSidePro 0 1.00 0.03 0.25 0 0.0 0.00 0.00 9 ▇▁▁▁▁
outSidePro7Day 0 1.00 0.15 0.67 0 0.0 0.00 0.00 16 ▇▁▁▁▁
outSidePro30Day 0 1.00 0.49 1.52 0 0.0 0.00 0.00 39 ▇▁▁▁▁
upLoadRatio 3676 0.20 0.22 0.38 0 0.0 0.00 0.33 1 ▇▁▁▁▂
upLoadRatio7Day 2107 0.54 0.29 0.38 0 0.0 0.00 0.58 1 ▇▁▁▁▂
upLoadRatio30Day 758 0.84 0.39 0.39 0 0.0 0.32 0.75 1 ▇▂▂▂▅
IDHitApp 0 1.00 0.00 0.00 0 0.0 0.00 0.00 0 ▁▁▇▁▁
IDHitApp7Day 0 1.00 0.03 0.21 0 0.0 0.00 0.00 7 ▇▁▁▁▁
IDHitApp30Day 0 1.00 0.10 0.51 0 0.0 0.00 0.00 16 ▇▁▁▁▁
IDHitAppRatio 3676 0.20 0.00 0.00 0 0.0 0.00 0.00 0 ▁▁▇▁▁
IDHitAppRatio7Day 2107 0.54 0.01 0.06 0 0.0 0.00 0.00 1 ▇▁▁▁▁
IDHitAppRatio30Day 758 0.84 0.01 0.05 0 0.0 0.00 0.00 1 ▇▁▁▁▁
phoneHitApp 0 1.00 0.00 0.00 0 0.0 0.00 0.00 0 ▁▁▇▁▁
phoneHitApp7Day 0 1.00 0.01 0.13 0 0.0 0.00 0.00 3 ▇▁▁▁▁
phoneHitApp30Day 0 1.00 0.05 0.30 0 0.0 0.00 0.00 8 ▇▁▁▁▁
phoneHitAppRatio 3676 0.20 0.00 0.00 0 0.0 0.00 0.00 0 ▁▁▇▁▁
phoneHitAppRatio7Day 2107 0.54 0.00 0.05 0 0.0 0.00 0.00 1 ▇▁▁▁▁
phoneHitAppRatio30Day 758 0.84 0.01 0.05 0 0.0 0.00 0.00 1 ▇▁▁▁▁
userHitApp 0 1.00 0.00 0.00 0 0.0 0.00 0.00 0 ▁▁▇▁▁
userHitApp7Day 0 1.00 0.31 1.05 0 0.0 0.00 0.00 19 ▇▁▁▁▁
userHitApp30Day 0 1.00 1.03 2.51 0 0.0 0.00 1.00 42 ▇▁▁▁▁
userHitAppRatio 3676 0.20 0.00 0.00 0 0.0 0.00 0.00 0 ▁▁▇▁▁
userHitAppRatio7Day 2107 0.54 0.09 0.20 0 0.0 0.00 0.07 1 ▇▁▁▁▁
userHitAppRatio30Day 758 0.84 0.09 0.18 0 0.0 0.00 0.12 1 ▇▁▁▁▁
Total_locked 0 1.00 0.37 0.48 0 0.0 0.00 1.00 1 ▇▁▁▁▅
Total_device 0 1.00 1.46 0.93 0 1.0 1.00 2.00 13 ▇▁▁▁▁
oldReferenceTrigger 0 1.00 0.06 0.34 0 0.0 0.00 0.00 6 ▇▁▁▁▁
oldReferenceTrigger7Day 0 1.00 0.43 1.24 0 0.0 0.00 0.00 18 ▇▁▁▁▁
oldReferenceTrigger30Day 0 1.00 1.54 3.59 0 0.0 0.00 2.00 62 ▇▁▁▁▁
oldReferenceRatio 3676 0.20 0.14 0.31 0 0.0 0.00 0.00 1 ▇▁▁▁▁
oldReferenceRatio7Day 2107 0.54 0.14 0.25 0 0.0 0.00 0.18 1 ▇▁▁▁▁
oldReferenceRatio30Day 758 0.84 0.14 0.23 0 0.0 0.00 0.20 1 ▇▂▁▁▁

1.4 Cleaning column names: Call the janitor

1.5 Data types:

In R we have a slightly different distinction:

* character / <chr>: Textual data, for example the text of a tweet.

* factor / <fct>: Categorical data with a finite number of categories with no particular order.

* ordered / <ord>: Categorical data with a finite number of categories with a particular order.

* double / <dbl>: Numerical data with decimal places.

* integer / <int>: Numerical data with whole numbers only (i.e. no decimals).

* logical / <lgl>: Logical data, which only consists of values TRUE and FALSE.

* date / date: Data which consists of dates, e.g. 2021-08-05.

* date-time / dttm: Data which consists of dates and times, e.g. 2021-08-05 16:29:25 BST

We want to convert all variables in one go, we can put into a same function

## Rows: 4,612
## Columns: 96
## $ vnpost_user_name                  <chr> "dungnt07", "59.0057", "anhpdt", "22…
## $ vnpostname                        <chr> "Nguyễn Thị Dung", "MAI THỊ XUÂN HƯỜ…
## $ vnpostprovincename                <fct> Bưu điện TP Hà Nội, Bưu điện Tỉnh Bì…
## $ vnpostdistrictname                <fct> BĐH Chương Mỹ, BĐH Tuy Phước, BĐH Nh…
## $ vnpostorganizationname            <fct> "Chương Mỹ", "VHX Phước Sơn", "VHX H…
## $ ekyc                              <chr> "Đã OTP thành công", "Đã OTP thành c…
## $ user_group                        <dbl> 2, 2, 2, 2, 2, 6, 2, 5, 5, 2, 2, 2, …
## $ flow_group                        <dbl> 1, 1, 1, 1, 1, 0, 1, 3, 3, 1, 1, 1, …
## $ status                            <chr> NA, NA, NA, NA, NA, "Lock", NA, NA, …
## $ total_app                         <dbl> 26, 18, 18, 18, 17, 14, 13, 13, 13, …
## $ total_app7                        <dbl> 44, 29, 62, 31, 46, 51, 22, 67, 31, …
## $ total_app30                       <dbl> 198, 67, 155, 31, 49, 51, 55, 262, 1…
## $ total_app_cancel                  <dbl> 5, 0, 1, 2, 6, 0, 5, 3, 3, 1, 0, 1, …
## $ total_app_cancel_7day             <dbl> 7, 3, 8, 7, 14, 0, 5, 17, 4, 4, 3, 1…
## $ total_app_cancel_30day            <dbl> 32, 6, 25, 7, 15, 0, 15, 60, 37, 12,…
## $ cancel_ratio                      <dbl> 0.19230769, 0.00000000, 0.05555556, …
## $ cancel_ratio7d_day                <dbl> 0.15909091, 0.10344828, 0.12903226, …
## $ cancel_ratio30day                 <dbl> 0.16161616, 0.08955224, 0.16129032, …
## $ fpd30_his_tt                      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ fpd30_base_tt                     <dbl> 0, 0, 36, 0, 0, 0, 4, 0, 12, 0, 0, 0…
## $ fpd30_tt                          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ total_app_reject                  <dbl> 19, 11, 11, 14, 10, 11, 4, 7, 9, 10,…
## $ total_app_reject_7day             <dbl> 30, 18, 38, 21, 26, 47, 9, 38, 23, 3…
## $ total_app_reject_30day            <dbl> 133, 34, 93, 21, 27, 47, 26, 132, 69…
## $ reject_ratio                      <dbl> 0.7307692, 0.6111111, 0.6111111, 0.7…
## $ reject_ratio7day                  <dbl> 0.6818182, 0.6206897, 0.6129032, 0.6…
## $ reject_ratio30day                 <dbl> 0.6717172, 0.5074627, 0.6000000, 0.6…
## $ total_app_approve                 <dbl> 1, 1, 4, 0, 1, 2, 0, 0, 0, 1, 1, 0, …
## $ total_app_approve_7day            <dbl> 6, 2, 14, 1, 5, 3, 4, 9, 3, 10, 3, 2…
## $ total_app_approve_30day           <dbl> 32, 21, 35, 1, 6, 3, 10, 67, 31, 18,…
## $ approve_ratio                     <dbl> 0.03846154, 0.05555556, 0.22222222, …
## $ approve_ratio7day                 <dbl> 0.13636364, 0.06896552, 0.22580645, …
## $ approve_ratio30day                <dbl> 0.16161616, 0.31343284, 0.22580645, …
## $ total_app_disbursed               <dbl> 0, 0, 2, 0, 0, 0, 0, 0, 0, 1, 1, 0, …
## $ total_app_disbursed_7day          <dbl> 3, 1, 12, 1, 3, 1, 4, 9, 3, 9, 3, 2,…
## $ total_app_disbursed_30day         <dbl> 29, 20, 33, 1, 4, 1, 10, 67, 31, 17,…
## $ disbursed_ratio                   <dbl> 0.00000000, 0.00000000, 0.11111111, …
## $ disbursed_ratio7day               <dbl> 0.06818182, 0.03448276, 0.19354839, …
## $ disbursed_ratio30day              <dbl> 0.14646465, 0.29850746, 0.21290323, …
## $ outside_app                       <dbl> 5, 0, 0, 0, 17, 1, 0, 0, 0, 0, 0, 2,…
## $ outside_app7day                   <dbl> 6, 1, 1, 0, 46, 4, 0, 1, 0, 0, 0, 2,…
## $ outside_app30day                  <dbl> 11, 1, 3, 0, 46, 4, 0, 3, 4, 0, 0, 1…
## $ outside_ratio                     <dbl> 0.19230769, 0.00000000, 0.00000000, …
## $ outside_ratio7day                 <dbl> 0.13636364, 0.03448276, 0.01612903, …
## $ outside_ratio30day                <dbl> 0.05555556, 0.01492537, 0.01935484, …
## $ out_side_pro                      <dbl> 4, 0, 0, 0, 9, 1, 0, 0, 0, 0, 0, 2, …
## $ out_side_pro7day                  <dbl> 5, 1, 1, 0, 16, 3, 0, 1, 0, 0, 0, 2,…
## $ out_side_pro30day                 <dbl> 6, 1, 1, 0, 16, 3, 0, 3, 3, 0, 0, 8,…
## $ up_load_ratio                     <dbl> 0.1153846, 1.0000000, 0.2777778, 0.1…
## $ up_load_ratio7day                 <dbl> 0.18181818, 0.65517241, 0.17204301, …
## $ up_load_ratio30day                <dbl> 0.60942761, 0.28358209, 0.39139785, …
## $ id_hit_app                        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ id_hit_app7day                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 1, 0, …
## $ id_hit_app30day                   <dbl> 0, 0, 0, 0, 0, 0, 0, 3, 1, 3, 1, 0, …
## $ id_hit_app_ratio                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ id_hit_app_ratio7day              <dbl> 0.00000000, 0.00000000, 0.00000000, …
## $ id_hit_app_ratio30day             <dbl> 0.000000000, 0.000000000, 0.00000000…
## $ phone_hit_app                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ phone_hit_app7day                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, …
## $ phone_hit_app30day                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, …
## $ phone_hit_app_ratio               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ phone_hit_app_ratio7day           <dbl> 0.00000000, 0.00000000, 0.00000000, …
## $ phone_hit_app_ratio30day          <dbl> 0.00000000, 0.00000000, 0.00000000, …
## $ user_hit_app                      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ user_hit_app7day                  <dbl> 0, 0, 2, 1, 11, 2, 1, 2, 2, 5, 6, 1,…
## $ user_hit_app30day                 <dbl> 9, 0, 3, 1, 12, 2, 3, 7, 13, 6, 10, …
## $ user_hit_app_ratio                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ user_hit_app_ratio7day            <dbl> 0.00000000, 0.00000000, 0.03225806, …
## $ user_hit_app_ratio30day           <dbl> 0.04545455, 0.00000000, 0.01935484, …
## $ total_locked                      <dbl> 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, …
## $ total_device                      <dbl> 9, 2, 2, 3, 3, 2, 1, 9, 1, 3, 2, 8, …
## $ old_reference_trigger             <dbl> 0, 1, 0, 1, 0, 0, 2, 0, 1, 5, 5, 1, …
## $ old_reference_trigger7day         <dbl> 0, 1, 3, 1, 1, 0, 6, 2, 13, 18, 11, …
## $ old_reference_trigger30day        <dbl> 4, 7, 5, 1, 1, 0, 14, 28, 56, 19, 21…
## $ old_reference_ratio               <dbl> 0.00000000, 0.05555556, 0.00000000, …
## $ old_reference_ratio7day           <dbl> 0.00000000, 0.03448276, 0.04838710, …
## $ old_reference_ratio30day          <dbl> 0.02020202, 0.10447761, 0.03225806, …
## $ group_fpd30tt                     <chr> "0", "0", "0", "0", "0", "0", "0", "…
## $ group_hit_outside                 <fct> Hit, NoneHit, NoneHit, NoneHit, Hit,…
## $ group_hit_outside7day             <fct> Hit, Hit, Hit, NoneHit, Hit, Hit, No…
## $ group_hit_outside30day            <fct> Hit, Hit, Hit, NoneHit, Hit, Hit, No…
## $ group_hit_up_load_ratio           <chr> "Hit", "Hit", "Hit", "Hit", "Hit", "…
## $ group_hit_up_load_ratio7day       <chr> "Hit", "Hit", "Hit", "Hit", "Hit", "…
## $ group_hit_up_load_ratio30day      <chr> "Hit", "Hit", "Hit", "Hit", "Hit", "…
## $ group_hit_old_reference_ratio     <chr> "NoneHit", "Hit", "NoneHit", "Hit", …
## $ group_hit_old_reference_ratio7day <chr> "NoneHit", "Hit", "Hit", "Hit", "Hit…
## $ group_hit_old_reference30day      <chr> "Hit", "Hit", "Hit", "Hit", "Hit", "…
## $ group_hit_id_app                  <chr> "NoneHit", "NoneHit", "NoneHit", "No…
## $ group_hit_id_app7day              <chr> "NoneHit", "NoneHit", "NoneHit", "No…
## $ group_hit_id_app30day             <chr> "NoneHit", "NoneHit", "NoneHit", "No…
## $ group_hit_phone_app               <chr> "NoneHit", "NoneHit", "NoneHit", "No…
## $ group_hit_phone_app7day           <chr> "NoneHit", "NoneHit", "NoneHit", "No…
## $ group_hit_phone_app30day          <chr> "NoneHit", "NoneHit", "NoneHit", "No…
## $ group_hit_user_app                <chr> "NoneHit", "NoneHit", "NoneHit", "No…
## $ group_hit_user_app7day            <chr> "NoneHit", "NoneHit", "Hit", "Hit", …
## $ group_hit_user_app30day           <chr> "Hit", "NoneHit", "Hit", "Hit", "Hit…
## # A tibble: 4,612 × 97
##    vnpost…¹ vnpos…² vnpos…³ vnpos…⁴ vnpos…⁵ ekyc  user_…⁶ flow_…⁷ status total…⁸
##    <chr>    <chr>   <fct>   <fct>   <fct>   <chr> <chr>     <dbl> <chr>    <dbl>
##  1 dungnt07 Nguyễn… Bưu đi… BĐH Ch… Chương… Đã O… 2             1 <NA>        26
##  2 59.0057  MAI TH… Bưu đi… BĐH Tu… VHX Ph… Đã O… 2             1 <NA>        18
##  3 anhpdt   PHAN D… Bưu đi… BĐH Nh… VHX Hi… Đã O… 2             1 <NA>        18
##  4 22.0067  Trịnh … Bưu đi… BĐH Qu… Quế Võ  Đã O… 2             1 <NA>        18
##  5 43.0415   Đinh … Bưu đi… BĐTP N… Ninh B… Đã O… 2             1 <NA>        17
##  6 10.0912  NGỌC T… Bưu đi… BĐ Tru… Cầu Di… Đã O… 6             0 Lock        14
##  7 hant10   Nguyễn… Bưu đi… BĐH Ph… Phù Yên Đã O… 2             1 <NA>        13
##  8 41.0824  NGUYỄN… Bưu đi… BĐH Ki… Kiến X… Đã O… 5             3 <NA>        13
##  9 53.0426  Lê Thị… Bưu đi… BĐH A … VHX Ph… Đã O… 5             3 <NA>        13
## 10 83.0048  ĐẶNG T… Bưu đi… BĐH Bù… Bù Gia… Đã O… 2             1 <NA>        13
## # … with 4,602 more rows, 87 more variables: total_app7 <dbl>,
## #   total_app30 <dbl>, total_app_cancel <dbl>, total_app_cancel_7day <dbl>,
## #   total_app_cancel_30day <dbl>, cancel_ratio <dbl>, cancel_ratio7d_day <dbl>,
## #   cancel_ratio30day <dbl>, fpd30_his_tt <dbl>, fpd30_base_tt <dbl>,
## #   fpd30_tt <dbl>, total_app_reject <dbl>, total_app_reject_7day <dbl>,
## #   total_app_reject_30day <dbl>, reject_ratio <dbl>, reject_ratio7day <dbl>,
## #   reject_ratio30day <dbl>, total_app_approve <dbl>, …

1.6 Handling factors

Factors are an essential way to classify observations in our data in different ways. In terms of data wrangling, there are usually at least two steps we take to prepare them for analysis:

  • Recoding factors, and

  • Reordering factor levels.

1.6.1 Recoding factors

TriggerOverview_clean %>%
  count(group_hit_outside, sort = TRUE) 
## # A tibble: 3 × 2
##   group_hit_outside     n
##   <fct>             <int>
## 1 <NA>               3676
## 2 NoneHit             829
## 3 Hit                 107
TriggerOverview_clean %>%
    count(group_hit_outside) %>%
    ggplot(aes(group_hit_outside, n))+
    geom_col()

### Reordering factor levels

fct_unique(TriggerOverview_clean$user_group)
## factor(0)
## Levels:
TriggerOverview_clean %>%
  filter(str_length(vnpostprovincename) >= 15) %>%
  count(vnpostprovincename)
## # A tibble: 63 × 2
##    vnpostprovincename               n
##    <fct>                        <int>
##  1 Bưu điện TP Hà Nội             129
##  2 Bưu điện Tỉnh Bình Định        141
##  3 Bưu điện Tỉnh Đồng Nai         112
##  4 Bưu điện Tỉnh Bắc Ninh          38
##  5 Bưu điện Tỉnh Ninh Bình         57
##  6 Bưu điện Tỉnh Sơn La            98
##  7 Bưu điện Tỉnh Thái Bình         66
##  8 Bưu điện Tỉnh Thừa Thiên Huế    77
##  9 Bưu điện Tỉnh Bình Phước       146
## 10 Bưu điện Tỉnh Quảng Ngãi        79
## # … with 53 more rows

1.7 Dealing with missing data

1.7.1 Mapping missing data

library(naniar)
vis_miss(TriggerOverview_clean)

gg_miss_var(TriggerOverview_clean)

# Summarise the missing value in each variable
miss_var_summary(TriggerOverview_clean)
## # A tibble: 96 × 3
##    variable            n_miss pct_miss
##    <chr>                <int>    <dbl>
##  1 status                4060     88.0
##  2 cancel_ratio          3676     79.7
##  3 reject_ratio          3676     79.7
##  4 approve_ratio         3676     79.7
##  5 disbursed_ratio       3676     79.7
##  6 outside_ratio         3676     79.7
##  7 up_load_ratio         3676     79.7
##  8 id_hit_app_ratio      3676     79.7
##  9 phone_hit_app_ratio   3676     79.7
## 10 user_hit_app_ratio    3676     79.7
## # … with 86 more rows
gg_miss_upset(TriggerOverview_clean)

## Replacing or removing missing data * User active 12-12

TriggerOverview_clean1 <- TriggerOverview_clean %>%
  drop_na(group_hit_outside) %>% vis_miss()
TriggerOverview_clean1

* User active last 7 day

TriggerOverview_clean7 <- TriggerOverview_clean %>%
  na.omit(group_hit_outside7day) %>% vis_miss()
TriggerOverview_clean7

* User active last 30 day

TriggerOverview_clean30 <- TriggerOverview_clean %>%
  na.omit(group_hit_outside30day)  %>% vis_miss()

TriggerOverview_clean30

1.8 Latent constructs and their reliability

  • Compute mean() of all related item
  • For each rowwise() because each row presents one paticipant
TriggerOverview_clean1 <- TriggerOverview_clean %>%
  rowwise() %>%
  mutate(user_ratio1 = mean(c(user_hit_app_ratio,
                        user_hit_app_ratio7day,
                        user_hit_app_ratio30day
                        )
                      ),
         phone_ratio1 = mean(c(phone_hit_app_ratio,
                        phone_hit_app_ratio7day,
                        phone_hit_app_ratio30day
                        )
                      ),
         id_ratio1 = mean(c(id_hit_app_ratio,
                        id_hit_app_ratio7day,
                        id_hit_app_ratio30day
                        )
                      )
         )

glimpse(TriggerOverview_clean1)
## Rows: 4,612
## Columns: 99
## Rowwise: 
## $ vnpost_user_name                  <chr> "dungnt07", "59.0057", "anhpdt", "22…
## $ vnpostname                        <chr> "Nguyễn Thị Dung", "MAI THỊ XUÂN HƯỜ…
## $ vnpostprovincename                <fct> Bưu điện TP Hà Nội, Bưu điện Tỉnh Bì…
## $ vnpostdistrictname                <fct> BĐH Chương Mỹ, BĐH Tuy Phước, BĐH Nh…
## $ vnpostorganizationname            <fct> "Chương Mỹ", "VHX Phước Sơn", "VHX H…
## $ ekyc                              <chr> "Đã OTP thành công", "Đã OTP thành c…
## $ user_group                        <dbl> 2, 2, 2, 2, 2, 6, 2, 5, 5, 2, 2, 2, …
## $ flow_group                        <dbl> 1, 1, 1, 1, 1, 0, 1, 3, 3, 1, 1, 1, …
## $ status                            <chr> NA, NA, NA, NA, NA, "Lock", NA, NA, …
## $ total_app                         <dbl> 26, 18, 18, 18, 17, 14, 13, 13, 13, …
## $ total_app7                        <dbl> 44, 29, 62, 31, 46, 51, 22, 67, 31, …
## $ total_app30                       <dbl> 198, 67, 155, 31, 49, 51, 55, 262, 1…
## $ total_app_cancel                  <dbl> 5, 0, 1, 2, 6, 0, 5, 3, 3, 1, 0, 1, …
## $ total_app_cancel_7day             <dbl> 7, 3, 8, 7, 14, 0, 5, 17, 4, 4, 3, 1…
## $ total_app_cancel_30day            <dbl> 32, 6, 25, 7, 15, 0, 15, 60, 37, 12,…
## $ cancel_ratio                      <dbl> 0.19230769, 0.00000000, 0.05555556, …
## $ cancel_ratio7d_day                <dbl> 0.15909091, 0.10344828, 0.12903226, …
## $ cancel_ratio30day                 <dbl> 0.16161616, 0.08955224, 0.16129032, …
## $ fpd30_his_tt                      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ fpd30_base_tt                     <dbl> 0, 0, 36, 0, 0, 0, 4, 0, 12, 0, 0, 0…
## $ fpd30_tt                          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ total_app_reject                  <dbl> 19, 11, 11, 14, 10, 11, 4, 7, 9, 10,…
## $ total_app_reject_7day             <dbl> 30, 18, 38, 21, 26, 47, 9, 38, 23, 3…
## $ total_app_reject_30day            <dbl> 133, 34, 93, 21, 27, 47, 26, 132, 69…
## $ reject_ratio                      <dbl> 0.7307692, 0.6111111, 0.6111111, 0.7…
## $ reject_ratio7day                  <dbl> 0.6818182, 0.6206897, 0.6129032, 0.6…
## $ reject_ratio30day                 <dbl> 0.6717172, 0.5074627, 0.6000000, 0.6…
## $ total_app_approve                 <dbl> 1, 1, 4, 0, 1, 2, 0, 0, 0, 1, 1, 0, …
## $ total_app_approve_7day            <dbl> 6, 2, 14, 1, 5, 3, 4, 9, 3, 10, 3, 2…
## $ total_app_approve_30day           <dbl> 32, 21, 35, 1, 6, 3, 10, 67, 31, 18,…
## $ approve_ratio                     <dbl> 0.03846154, 0.05555556, 0.22222222, …
## $ approve_ratio7day                 <dbl> 0.13636364, 0.06896552, 0.22580645, …
## $ approve_ratio30day                <dbl> 0.16161616, 0.31343284, 0.22580645, …
## $ total_app_disbursed               <dbl> 0, 0, 2, 0, 0, 0, 0, 0, 0, 1, 1, 0, …
## $ total_app_disbursed_7day          <dbl> 3, 1, 12, 1, 3, 1, 4, 9, 3, 9, 3, 2,…
## $ total_app_disbursed_30day         <dbl> 29, 20, 33, 1, 4, 1, 10, 67, 31, 17,…
## $ disbursed_ratio                   <dbl> 0.00000000, 0.00000000, 0.11111111, …
## $ disbursed_ratio7day               <dbl> 0.06818182, 0.03448276, 0.19354839, …
## $ disbursed_ratio30day              <dbl> 0.14646465, 0.29850746, 0.21290323, …
## $ outside_app                       <dbl> 5, 0, 0, 0, 17, 1, 0, 0, 0, 0, 0, 2,…
## $ outside_app7day                   <dbl> 6, 1, 1, 0, 46, 4, 0, 1, 0, 0, 0, 2,…
## $ outside_app30day                  <dbl> 11, 1, 3, 0, 46, 4, 0, 3, 4, 0, 0, 1…
## $ outside_ratio                     <dbl> 0.19230769, 0.00000000, 0.00000000, …
## $ outside_ratio7day                 <dbl> 0.13636364, 0.03448276, 0.01612903, …
## $ outside_ratio30day                <dbl> 0.05555556, 0.01492537, 0.01935484, …
## $ out_side_pro                      <dbl> 4, 0, 0, 0, 9, 1, 0, 0, 0, 0, 0, 2, …
## $ out_side_pro7day                  <dbl> 5, 1, 1, 0, 16, 3, 0, 1, 0, 0, 0, 2,…
## $ out_side_pro30day                 <dbl> 6, 1, 1, 0, 16, 3, 0, 3, 3, 0, 0, 8,…
## $ up_load_ratio                     <dbl> 0.1153846, 1.0000000, 0.2777778, 0.1…
## $ up_load_ratio7day                 <dbl> 0.18181818, 0.65517241, 0.17204301, …
## $ up_load_ratio30day                <dbl> 0.60942761, 0.28358209, 0.39139785, …
## $ id_hit_app                        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ id_hit_app7day                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 1, 0, …
## $ id_hit_app30day                   <dbl> 0, 0, 0, 0, 0, 0, 0, 3, 1, 3, 1, 0, …
## $ id_hit_app_ratio                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ id_hit_app_ratio7day              <dbl> 0.00000000, 0.00000000, 0.00000000, …
## $ id_hit_app_ratio30day             <dbl> 0.000000000, 0.000000000, 0.00000000…
## $ phone_hit_app                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ phone_hit_app7day                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, …
## $ phone_hit_app30day                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, …
## $ phone_hit_app_ratio               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ phone_hit_app_ratio7day           <dbl> 0.00000000, 0.00000000, 0.00000000, …
## $ phone_hit_app_ratio30day          <dbl> 0.00000000, 0.00000000, 0.00000000, …
## $ user_hit_app                      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ user_hit_app7day                  <dbl> 0, 0, 2, 1, 11, 2, 1, 2, 2, 5, 6, 1,…
## $ user_hit_app30day                 <dbl> 9, 0, 3, 1, 12, 2, 3, 7, 13, 6, 10, …
## $ user_hit_app_ratio                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ user_hit_app_ratio7day            <dbl> 0.00000000, 0.00000000, 0.03225806, …
## $ user_hit_app_ratio30day           <dbl> 0.04545455, 0.00000000, 0.01935484, …
## $ total_locked                      <dbl> 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, …
## $ total_device                      <dbl> 9, 2, 2, 3, 3, 2, 1, 9, 1, 3, 2, 8, …
## $ old_reference_trigger             <dbl> 0, 1, 0, 1, 0, 0, 2, 0, 1, 5, 5, 1, …
## $ old_reference_trigger7day         <dbl> 0, 1, 3, 1, 1, 0, 6, 2, 13, 18, 11, …
## $ old_reference_trigger30day        <dbl> 4, 7, 5, 1, 1, 0, 14, 28, 56, 19, 21…
## $ old_reference_ratio               <dbl> 0.00000000, 0.05555556, 0.00000000, …
## $ old_reference_ratio7day           <dbl> 0.00000000, 0.03448276, 0.04838710, …
## $ old_reference_ratio30day          <dbl> 0.02020202, 0.10447761, 0.03225806, …
## $ group_fpd30tt                     <chr> "0", "0", "0", "0", "0", "0", "0", "…
## $ group_hit_outside                 <fct> Hit, NoneHit, NoneHit, NoneHit, Hit,…
## $ group_hit_outside7day             <fct> Hit, Hit, Hit, NoneHit, Hit, Hit, No…
## $ group_hit_outside30day            <fct> Hit, Hit, Hit, NoneHit, Hit, Hit, No…
## $ group_hit_up_load_ratio           <chr> "Hit", "Hit", "Hit", "Hit", "Hit", "…
## $ group_hit_up_load_ratio7day       <chr> "Hit", "Hit", "Hit", "Hit", "Hit", "…
## $ group_hit_up_load_ratio30day      <chr> "Hit", "Hit", "Hit", "Hit", "Hit", "…
## $ group_hit_old_reference_ratio     <chr> "NoneHit", "Hit", "NoneHit", "Hit", …
## $ group_hit_old_reference_ratio7day <chr> "NoneHit", "Hit", "Hit", "Hit", "Hit…
## $ group_hit_old_reference30day      <chr> "Hit", "Hit", "Hit", "Hit", "Hit", "…
## $ group_hit_id_app                  <chr> "NoneHit", "NoneHit", "NoneHit", "No…
## $ group_hit_id_app7day              <chr> "NoneHit", "NoneHit", "NoneHit", "No…
## $ group_hit_id_app30day             <chr> "NoneHit", "NoneHit", "NoneHit", "No…
## $ group_hit_phone_app               <chr> "NoneHit", "NoneHit", "NoneHit", "No…
## $ group_hit_phone_app7day           <chr> "NoneHit", "NoneHit", "NoneHit", "No…
## $ group_hit_phone_app30day          <chr> "NoneHit", "NoneHit", "NoneHit", "No…
## $ group_hit_user_app                <chr> "NoneHit", "NoneHit", "NoneHit", "No…
## $ group_hit_user_app7day            <chr> "NoneHit", "NoneHit", "Hit", "Hit", …
## $ group_hit_user_app30day           <chr> "Hit", "NoneHit", "Hit", "Hit", "Hit…
## $ user_ratio1                       <dbl> 0.01515152, 0.00000000, 0.01720430, …
## $ phone_ratio1                      <dbl> 0.00000000, 0.00000000, 0.00000000, …
## $ id_ratio1                         <dbl> 0.000000000, 0.000000000, 0.00000000…
TriggerOverview_clean %>%
  select(user_hit_app_ratio, user_hit_app_ratio7day, user_hit_app_ratio30day ) %>%
  psych::alpha()
## 
## Reliability analysis   
## Call: psych::alpha(x = .)
## 
##   raw_alpha std.alpha G6(smc) average_r S/N    ase  mean   sd median_r
##       0.76      0.76    0.62      0.62 3.3 0.0069 0.062 0.13     0.62
## 
##     95% confidence boundaries 
##          lower alpha upper
## Feldt     0.75  0.76  0.78
## Duhachek  0.75  0.76  0.78
## 
##  Reliability if an item is dropped:
##                         raw_alpha std.alpha G6(smc) average_r S/N alpha se
## user_hit_app_ratio7day       0.69      0.62    0.38      0.62 1.6       NA
## user_hit_app_ratio30day      0.56      0.62    0.38      0.62 1.6       NA
##                         var.r med.r
## user_hit_app_ratio7day      0  0.62
## user_hit_app_ratio30day     0  0.62
## 
##  Item statistics 
##                            n raw.r std.r r.cor r.drop  mean   sd
## user_hit_app_ratio7day  2505  0.95   0.9  0.71   0.62 0.089 0.20
## user_hit_app_ratio30day 3854  0.95   0.9  0.71   0.62 0.095 0.18
TriggerOverview_clean %>%
  select(id_hit_app_ratio, id_hit_app_ratio7day, id_hit_app_ratio30day ) %>%
  psych::alpha()
## 
## Reliability analysis   
## Call: psych::alpha(x = .)
## 
##   raw_alpha std.alpha G6(smc) average_r S/N    ase   mean    sd median_r
##       0.66      0.67     0.5       0.5   2 0.0098 0.0054 0.035      0.5
## 
##     95% confidence boundaries 
##          lower alpha upper
## Feldt     0.64  0.66  0.68
## Duhachek  0.64  0.66  0.68
## 
##  Reliability if an item is dropped:
##                       raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r
## id_hit_app_ratio7day       0.60       0.5    0.25       0.5   1       NA     0
## id_hit_app_ratio30day      0.42       0.5    0.25       0.5   1       NA     0
##                       med.r
## id_hit_app_ratio7day    0.5
## id_hit_app_ratio30day   0.5
## 
##  Item statistics 
##                          n raw.r std.r r.cor r.drop   mean   sd
## id_hit_app_ratio7day  2505  0.93  0.87  0.61    0.5 0.0074 0.06
## id_hit_app_ratio30day 3854  0.93  0.87  0.61    0.5 0.0086 0.05
TriggerOverview_clean %>%
  select(phone_hit_app_ratio, phone_hit_app_ratio7day, phone_hit_app_ratio30day ) %>%
  psych::alpha()
## 
## Reliability analysis   
## Call: psych::alpha(x = .)
## 
##   raw_alpha std.alpha G6(smc) average_r S/N    ase   mean    sd median_r
##       0.68      0.68    0.52      0.52 2.1 0.0094 0.0033 0.031     0.52
## 
##     95% confidence boundaries 
##          lower alpha upper
## Feldt     0.66  0.68   0.7
## Duhachek  0.66  0.68   0.7
## 
##  Reliability if an item is dropped:
##                          raw_alpha std.alpha G6(smc) average_r S/N alpha se
## phone_hit_app_ratio7day       0.51      0.52    0.27      0.52 1.1       NA
## phone_hit_app_ratio30day      0.52      0.52    0.27      0.52 1.1       NA
##                          var.r med.r
## phone_hit_app_ratio7day      0  0.52
## phone_hit_app_ratio30day     0  0.52
## 
##  Item statistics 
##                             n raw.r std.r r.cor r.drop   mean    sd
## phone_hit_app_ratio7day  2505  0.94  0.87  0.63   0.52 0.0042 0.045
## phone_hit_app_ratio30day 3854  0.96  0.87  0.63   0.52 0.0051 0.046

1.8.1 Confirmatory factor analysis

Confirmatory factor analysis (CFA): This approach is used to confirm whether a set of items truly reflect a latent variable which we defined ex-ante.

#1: Define the model which explains how items relate to latent variables
model <- 'latent_today =~ total_app +id_hit_app30day+ phone_hit_app30day+ user_hit_app30day 

          latent_7day =~  total_app7 +id_hit_app30day + phone_hit_app30day+ user_hit_app30day

          latent_30day =~ total_app30+ id_hit_app30day + phone_hit_app30day+ user_hit_app30day'

#2: Run the CFA to see how well this model fits our data
fit <- cfa(model, data = TriggerOverview_clean)

#3a: Extract the performance indicators
fit_indices <- fitmeasures(fit)

#3b: We tidy the results with enframe() and
#    pick only those indices we are most interested in
fit_indices %>%
  enframe() %>%
  filter(name == "cfi" |
         name == "srmr" |
         name == "rmsea") %>%
  mutate(value = round(value, 3))   # Round to 3 decimal places
## # A tibble: 3 × 2
##   name  value     
##   <chr> <lvn.vctr>
## 1 cfi   1.000     
## 2 rmsea 0.000     
## 3 srmr  0.002
enframe(fit_indices)
## # A tibble: 42 × 2
##    name            value       
##    <chr>           <lvn.vctr>  
##  1 npar            2.100000e+01
##  2 fmin            4.025079e-05
##  3 chisq           3.712732e-01
##  4 df              0.000000e+00
##  5 pvalue                    NA
##  6 baseline.chisq  8.037557e+03
##  7 baseline.df     1.500000e+01
##  8 baseline.pvalue 0.000000e+00
##  9 cfi             9.999537e-01
## 10 tli             1.000000e+00
## # … with 32 more rows

These column names were generated when we called the function (). I often find myself working through chains of analytical steps iteratively to see what the intermediary steps produce. This also makes it easier to spot any mistakes early on. Therefore, I recommend slowly building up your dplyr chains of function calls, especially when you just started learning R and the  approach of data analysis.

The results of our CFA appear fairly promising:

  • The cfi (Comparative Fit Index) lies above 1,

  • The rmsea (Root Mean Square Error of Approximation) appears slightly higher than desirable, which usually is lower than 0.00, and

  • The srmr (Standardised Root Mean Square Residual) lies well below 0.02 (Hu & Bentler, 1999; West et al., 2012).

Overall, the model seems to suggest a good fit with our data. Combined with the computed Cronbach’s
\(\alpha\), we can be reasonably confident in our latent variables and perform further analytical steps.

2 Analysis

2.1 Central tendency measures: Mean, Median, Mode

2.1.1 Mean

TriggerOverview_clean1 %>%
  filter(total_app>0) %>%
  group_by(vnpostprovincename) %>%
  summarise(mean_total_app = mean(total_app, na.rm = TRUE)) %>%
  ggplot(aes(x = reorder(vnpostprovincename, mean_total_app), y = mean_total_app)) +
  geom_col() +
  coord_flip()

TriggerOverview_clean %>%
  filter(!is.na(id_hit_app), id_hit_app7day >0) %>%
  group_by(vnpostprovincename) %>%
  summarise(sum_gross_in_m = sum(id_hit_app7day)) %>%
  ggplot(aes(x = vnpostprovincename, y = sum_gross_in_m)) +
  geom_col() +
  coord_flip()

TriggerOverview_clean %>%
  filter(!is.na(id_hit_app), id_hit_app30day >0) %>%
  group_by(vnpostprovincename) %>%
  summarise(sum_gross_in_m = sum(id_hit_app30day)) %>%
  ggplot(aes(x = vnpostprovincename, y = sum_gross_in_m)) +
  geom_col() +
  coord_flip()

2.2 Indicators and visualisations to examine the spread of data

TriggerOverview_clean %>%
  filter(id_hit_app30day >0) %>%
  ggplot(aes(id_hit_app_ratio7day)) +
  geom_histogram() +
  geom_bar(aes(fill = "red"), show.legend = FALSE)

TriggerOverview_clean %>%
  filter(id_hit_app30day >0) %>%
  ggplot(aes(id_hit_app_ratio30day)) +
  geom_histogram() +
  geom_bar(aes(fill = "red"), show.legend = FALSE)

2.3 Packages to compute descriptive statistics

2.3.1 The psych package for descriptive statistics

2.3.2 The skimr package for descriptive statistics

2.4 Sources of bias: Outliers, normality and other ‘conundrums’

2.4.1 Linearity and additivity

2.4.2 Independence

2.4.3 Normality

2.4.4 Homogeneity of variance (homoscedasticity)

2.4.5 Outliers and how to deal with them

2.4.5.1 Detecting outliers using the standard deviation

A very frequently used approach to detecting outliers is the use of the standard deviation. Usually, scholars use multiples of the standard deviation to determine thresholds. For example, a value that lies 3 standard deviations above or below the mean could be categorised as an outlier. Unfortunately, there is quite some variability regarding how many multiples of the standard deviation counts as an outlier. Some authors might use 3, and others might settle for 2 (see also Leys et al. (2013)). Let’s stick with the definition of 3 standard deviations to get us started. We can revisit our previous plot regarding id_hit_app7day (see Figure 8.5) and add lines that show the thresholds above and below the mean. As before, I will create a base plot outlier_plot first so that we do not have to repeat the same code over and over again. We then use outlier_plot and add more layers as we see fit.

2.4.6 Detecting outliers using the interquartile range (IQR)

# Compute the quartiles
(TO <- quantile(TriggerOverview_clean$runtime_min))
##   0%  25%  50%  75% 100% 
##   NA   NA   NA   NA   NA
# Compute the thresholds
iqr_upper <- TO[4] + 1.5 * IQR(TriggerOverview_clean$id_hit_app7day)
iqr_lower <- TO[2] - 1.5 * IQR(TriggerOverview_clean$id_hit_app7day)

TriggerOverview_clean <-
  TriggerOverview_clean %>%
  mutate(outlier = ifelse(id_hit_app7day > iqr_upper |
                            id_hit_app7day < iqr_lower,
                          TRUE, FALSE))
TriggerOverview_clean %>%
  ggplot(aes(x = reorder(vnpostprovincename, id_hit_app30day),
             y = id_hit_app7day)
         ) +
  geom_point(size = 1) +
  theme(panel.background = element_blank()) +
  coord_flip()

TriggerOverview_clean %>%
  ggplot(aes(x = reorder(vnpostprovincename, phone_hit_app7day),
             y = phone_hit_app7day)
         ) +
  geom_point(size = 1) +
  theme(panel.background = element_blank()) +
  coord_flip()

TriggerOverview_clean %>%
  ggplot(aes(x = reorder(vnpostprovincename, phone_hit_app30day),
             y = phone_hit_app30day)
         ) +
  geom_point(size = 1) +
  theme(panel.background = element_blank()) +
  coord_flip()

TriggerOverview_clean %>%
  ggplot(aes(x = reorder(vnpostprovincename, user_hit_app7day),
             y = user_hit_app7day)
         ) +
  geom_point(size = 1) +
  theme(panel.background = element_blank()) +
  coord_flip()

TriggerOverview_clean %>%
  ggplot(aes(x = reorder(vnpostprovincename, user_hit_app30day),
             y = user_hit_app30day)
         ) +
  geom_point(size = 1) +
  theme(panel.background = element_blank()) +
  coord_flip()