Analysis Plan

YIFANG, JAY

2019-09-27

#Set up

Data structure and overview

We could see that there were many missing values, so the first step we started to deal with the problem. (Here we do not want to show too many variables.)

## 'data.frame':    100000 obs. of  10 variables:
##  $ CUS_ID            : int  3418 4302 5545 7207 7213 8818 9681 9743 9839 10246 ...
##  $ GENDER            : Factor w/ 2 levels "F","M": 2 2 2 2 2 2 2 2 2 2 ...
##  $ AGE               : Factor w/ 4 levels "中","中高","低",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ CHARGE_CITY_CD    : Factor w/ 8 levels "A1","A2","B1",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ CONTACT_CITY_CD   : Factor w/ 8 levels "A1","A2","B1",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ EDUCATION_CD      : int  NA NA 1 NA 1 NA NA NA 1 NA ...
##  $ MARRIAGE_CD       : int  NA NA 0 0 0 0 0 0 0 0 ...
##  $ LAST_A_CCONTACT_DT: Factor w/ 2 levels "N","Y": 2 2 2 2 1 2 1 2 1 1 ...
##  $ L1YR_A_ISSUE_CNT  : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ LAST_A_ISSUE_DT   : Factor w/ 2 levels "N","Y": 1 1 2 1 1 1 1 1 1 1 ...
##      CUS_ID         GENDER        AGE        CHARGE_CITY_CD 
##  Min.   :     12   F   :52944   中  :27148   B1     :51618  
##  1st Qu.: 627540   M   :46373   中高:21737   A1     :16513  
##  Median :1261358   NA's:  683   低  :27213   C2     : 8601  
##  Mean   :1267299                高  :23902   A2     : 8431  
##  3rd Qu.:1903561                             B2     : 7574  
##  Max.   :2551470                             C1     : 4820  
##                                              (Other): 2443  
##  CONTACT_CITY_CD  EDUCATION_CD    MARRIAGE_CD    LAST_A_CCONTACT_DT
##  A1     :27066   Min.   :1.000   Min.   :0.000   N:64595           
##  B1     :18680   1st Qu.:1.000   1st Qu.:0.000   Y:35405           
##  A2     :14489   Median :2.000   Median :0.000                     
##  C2     :14454   Mean   :2.169   Mean   :0.315                     
##  B2     :13027   3rd Qu.:3.000   3rd Qu.:1.000                     
##  C1     : 7961   Max.   :4.000   Max.   :2.000                     
##  (Other): 4323   NA's   :20562   NA's   :7951                      
##  L1YR_A_ISSUE_CNT  LAST_A_ISSUE_DT
##  Min.   : 0.0000   N:88631        
##  1st Qu.: 0.0000   Y:11369        
##  Median : 0.0000                  
##  Mean   : 0.1182                  
##  3rd Qu.: 0.0000                  
##  Max.   :22.0000                  
## 

Missing Value

Continuous variables

We have two method to replace the missing value:

2. Missing as median

We used the median value to replace the missing.

For example: