The benefits and pitfalls of binning data

Having detailed numeric data for an explanatory variable offers continuous and quantitative information, and allows the usual avenues of generating summary statistics to understand the variable better. However, one technique for understanding or simplifying a model takes a numeric variable and breaks it into a handful of ranges to consider, called binning or discretizing. This technique can offer some intelligibility to a variable by breaking it down into two or more categorical variables. However, binning has the potential to give up a lot of information about a dataset by eliminating the prospect of a per-unit quantified effect that would be gained from the regression coefficient.

Binning is at its most useful when there are already some apparent categorical features of the dataset: the variable in question tends to cluster around several values, there’s a known stepwise relationship about the variable in question, or there are other categorical variables in the dataset already. Also, when the variable has a substantial amount of missing values, this too can be made into a distinct category along with numerical ranges, and may add information if the values are not missing at random. Finally, binning effectively removes outliers by grouping it with nearby values.

In the worst-case scenario, incorrect selection of the numeric ranges for bins may attenuate the true correlation between two variables by limiting the amount of information gained from the variable. Pivotal to developing a discretized variable is selecting the ranges used. This could be done using several methods. The simplest is to assign ranges according to quantiles over the data range. From there, the exact boundaries could be changed to reflect the underlying distribution.

For the example today, I’m going to look at the dataset used for HW4, which looked at predicting insurance claims based off characteristics of the driver and car.

The original dataset needed some cleaning up, but ended up looking like this.

summary(insurance_fix)
##   TARGET_FLAG       TARGET_AMT        KIDSDRIV           AGE       
##  Min.   :0.0000   Min.   :     0   Min.   :0.0000   Min.   :16.00  
##  1st Qu.:0.0000   1st Qu.:     0   1st Qu.:0.0000   1st Qu.:39.00  
##  Median :0.0000   Median :     0   Median :0.0000   Median :45.00  
##  Mean   :0.2638   Mean   :  1504   Mean   :0.1711   Mean   :44.79  
##  3rd Qu.:1.0000   3rd Qu.:  1036   3rd Qu.:0.0000   3rd Qu.:51.00  
##  Max.   :1.0000   Max.   :107586   Max.   :4.0000   Max.   :81.00  
##                                                     NA's   :6      
##     HOMEKIDS           YOJ           INCOME       PARENT1       HOME_VAL     
##  Min.   :0.0000   Min.   : 0.0   Min.   :     0   No :7084   Min.   :     0  
##  1st Qu.:0.0000   1st Qu.: 9.0   1st Qu.: 28097   Yes:1077   1st Qu.:     0  
##  Median :0.0000   Median :11.0   Median : 54028              Median :161160  
##  Mean   :0.7212   Mean   :10.5   Mean   : 61898              Mean   :154867  
##  3rd Qu.:1.0000   3rd Qu.:13.0   3rd Qu.: 85986              3rd Qu.:238724  
##  Max.   :5.0000   Max.   :23.0   Max.   :367030              Max.   :885282  
##                   NA's   :454    NA's   :445                 NA's   :464     
##  MSTATUS    SEX                      EDUCATION              JOB      
##  No :3267   F:4375   Bachelors            :2242   Blue Collar :1825  
##  Yes:4894   M:3786   High School          :2330   Clerical    :1271  
##                      Less than High School:1203   Professional:1117  
##                      Masters              :1658   Manager     : 988  
##                      PhD                  : 728   Lawyer      : 835  
##                                                   Student     : 712  
##                                                   (Other)     :1413  
##     TRAVTIME            CAR_USE        BLUEBOOK          TIF        
##  Min.   :  5.00   Commercial:3029   Min.   : 1500   Min.   : 1.000  
##  1st Qu.: 22.00   Private   :5132   1st Qu.: 9280   1st Qu.: 1.000  
##  Median : 33.00                     Median :14440   Median : 4.000  
##  Mean   : 33.49                     Mean   :15710   Mean   : 5.351  
##  3rd Qu.: 44.00                     3rd Qu.:20850   3rd Qu.: 7.000  
##  Max.   :142.00                     Max.   :69740   Max.   :25.000  
##                                                                     
##         CAR_TYPE    RED_CAR       OLDCLAIM        CLM_FREQ      REVOKED   
##  Minivan    :2145   no :5783   Min.   :    0   Min.   :0.0000   No :7161  
##  Panel Truck: 676   yes:2378   1st Qu.:    0   1st Qu.:0.0000   Yes:1000  
##  Pickup     :1389              Median :    0   Median :0.0000             
##  Sports Car : 907              Mean   : 4037   Mean   :0.7986             
##  SUV        :2294              3rd Qu.: 4636   3rd Qu.:2.0000             
##  Van        : 750              Max.   :57037   Max.   :5.0000             
##                                                                           
##     MVR_PTS          CAR_AGE                     URBANICITY  
##  Min.   : 0.000   Min.   : 1.000   Highly Rural/ Rural:1669  
##  1st Qu.: 0.000   1st Qu.: 1.000   Highly Urban/ Urban:6492  
##  Median : 1.000   Median : 8.000                             
##  Mean   : 1.696   Mean   : 8.329                             
##  3rd Qu.: 3.000   3rd Qu.:12.000                             
##  Max.   :13.000   Max.   :28.000                             
##                   NA's   :510

The binned dataset I used for the final project looked at 7 previously numeric variables and changed them to factors or binary features. I selected these because some of them had many NA values or zero as a model. Also, there were some such as the number of kids which didn’t offer a clear relationship of how an additional child past the first would indicate anything about the claimant. For the intervals, I used the quartiles and generated the new categories by using the cut function, mutating new variables, and dropping the old variables.

insurance_bins1 <- insurance_fix %>%
  mutate(CAR_AGE_BIN=cut(CAR_AGE, breaks=c(-Inf, 1, 3, 12, Inf), labels=c("New","Like New","Average", 'Old'))) %>% #four level fator for car age
  mutate(HOME_VAL_BIN=cut(HOME_VAL, breaks=c(-Inf, 0, 50000, 150000, 250000, Inf), labels=c("Zero", "$0-$50k", "$50k-$150k","$150k-$250k", 'Over $250k'))) %>% #bins for zero, plus four other price ranges
  mutate(HAS_HOME_KIDS = as.factor(case_when(HOMEKIDS == 0 ~ 'No kids', HOMEKIDS > 0 ~ ('Has kids')))) %>% #binary variable for whether family has kids
  mutate(HAS_KIDSDRIV = as.factor(case_when(KIDSDRIV == 0 ~ 'No kids driving', KIDSDRIV > 0 ~ 'Has kids driving'))) %>% #binary variable for whether family has kids driving
  mutate(OLDCLAIM_BIN =cut(OLDCLAIM, breaks=c(-Inf, 0, 3000, 6000, 9000, Inf), labels=c("Zero","$0-$3k", "$3k-$6k", "$6k-$9k",'Over $9k'))) %>% #bins for zero, plus four other price ranges based on quartiles
  mutate(TIF_BIN =cut(TIF, breaks=c(-Inf, 0, 1, 4, 7, Inf), labels=c("Zero","Less than 1 year", "1-4 years", "4-7 years",'Over 7 years'))) %>% #bins for zero, plus four other price ranges based on quartiles
  mutate(YOJ_BIN =cut(YOJ, breaks=c(-Inf, 0, 10, 15, Inf), labels=c("Zero","Less than 10 years", 'Between 10-15 years', 'Over 15 years'))) %>% #bins for zero, plus three other categories based on quartiles
  dplyr::select(-c(CAR_AGE, HOME_VAL, HOMEKIDS, KIDSDRIV, OLDCLAIM, TIF, YOJ)) #drop the binned features

summary(insurance_bins1)
##   TARGET_FLAG       TARGET_AMT          AGE            INCOME       PARENT1   
##  Min.   :0.0000   Min.   :     0   Min.   :16.00   Min.   :     0   No :7084  
##  1st Qu.:0.0000   1st Qu.:     0   1st Qu.:39.00   1st Qu.: 28097   Yes:1077  
##  Median :0.0000   Median :     0   Median :45.00   Median : 54028             
##  Mean   :0.2638   Mean   :  1504   Mean   :44.79   Mean   : 61898             
##  3rd Qu.:1.0000   3rd Qu.:  1036   3rd Qu.:51.00   3rd Qu.: 85986             
##  Max.   :1.0000   Max.   :107586   Max.   :81.00   Max.   :367030             
##                                    NA's   :6       NA's   :445                
##  MSTATUS    SEX                      EDUCATION              JOB      
##  No :3267   F:4375   Bachelors            :2242   Blue Collar :1825  
##  Yes:4894   M:3786   High School          :2330   Clerical    :1271  
##                      Less than High School:1203   Professional:1117  
##                      Masters              :1658   Manager     : 988  
##                      PhD                  : 728   Lawyer      : 835  
##                                                   Student     : 712  
##                                                   (Other)     :1413  
##     TRAVTIME            CAR_USE        BLUEBOOK            CAR_TYPE   
##  Min.   :  5.00   Commercial:3029   Min.   : 1500   Minivan    :2145  
##  1st Qu.: 22.00   Private   :5132   1st Qu.: 9280   Panel Truck: 676  
##  Median : 33.00                     Median :14440   Pickup     :1389  
##  Mean   : 33.49                     Mean   :15710   Sports Car : 907  
##  3rd Qu.: 44.00                     3rd Qu.:20850   SUV        :2294  
##  Max.   :142.00                     Max.   :69740   Van        : 750  
##                                                                       
##  RED_CAR       CLM_FREQ      REVOKED       MVR_PTS      
##  no :5783   Min.   :0.0000   No :7161   Min.   : 0.000  
##  yes:2378   1st Qu.:0.0000   Yes:1000   1st Qu.: 0.000  
##             Median :0.0000              Median : 1.000  
##             Mean   :0.7986              Mean   : 1.696  
##             3rd Qu.:2.0000              3rd Qu.: 3.000  
##             Max.   :5.0000              Max.   :13.000  
##                                                         
##                URBANICITY     CAR_AGE_BIN        HOME_VAL_BIN   HAS_HOME_KIDS 
##  Highly Rural/ Rural:1669   New     :1938   Zero       :2294   Has kids:2872  
##  Highly Urban/ Urban:6492   Like New:  66   $0-$50k    :   0   No kids :5289  
##                             Average :3775   $50k-$150k :1274                  
##                             Old     :1872   $150k-$250k:2445                  
##                             NA's    : 510   Over $250k :1684                  
##                                             NA's       : 464                  
##                                                                               
##            HAS_KIDSDRIV    OLDCLAIM_BIN              TIF_BIN    
##  Has kids driving: 981   Zero    :5009   Zero            :   0  
##  No kids driving :7180   $0-$3k  : 584   Less than 1 year:2533  
##                          $3k-$6k : 970   1-4 years       :1672  
##                          $6k-$9k : 720   4-7 years       :2013  
##                          Over $9k: 878   Over 7 years    :1943  
##                                                                 
##                                                                 
##                 YOJ_BIN    
##  Zero               : 625  
##  Less than 10 years :2313  
##  Between 10-15 years:4425  
##  Over 15 years      : 344  
##  NA's               : 454  
##                            
## 

If there were many variables to consider, or perhaps time constraints, then R has a bin function offered in the ‘OneR’ library. This automates the discretization of numeric data and offers additional arguments to control the number of bins, whether the new categories should be equally sized versus 1D k-means clustering. Lets see how this would look using the insurance dataset from before. I will be using the ‘cluster’ method as so many of the affected numerical values are skewed because of many zero-low values.

insurance_bins2 <- insurance_fix %>%
  mutate(CAR_AGE_BIN = bin(CAR_AGE, nbins = 4, labels=c("1st Quartile", "2nd Quartile", "3rd Quartile",  "4th Quartile"), na.omit = FALSE)) %>% #four level fator for car age
  mutate(HOME_VAL_BIN = bin(HOME_VAL, method = 'cluster', nbins = 5, labels=c("Nearly Zero", "1st Quartile", "2nd Quartile", "3rd Quartile",  "4th Quartile"), na.omit = FALSE)) %>% #five categories
  mutate(HAS_HOME_KIDS = bin(HOMEKIDS, method = 'cluster',  labels=c('Fewer Kids', 'More Kids'),  nbins = 2,na.omit = FALSE)) %>% #binary variable for whether family has kids
  mutate(HAS_KIDSDRIV = bin(KIDSDRIV, method = 'cluster', labels=c('Fewer Kids Driving', 'More Kids Driving'), nbins = 2, na.omit = FALSE)) %>%  #binary variable for whether family has kids driving
  mutate(OLDCLAIM_BIN =bin(OLDCLAIM, method = 'cluster',  nbins = 5, labels=c("Nearly Zero", "1st Quartile", "2nd Quartile", "3rd Quartile",  "4th Quartile"), na.omit = FALSE)) %>% #five divisions
  mutate(TIF_BIN =bin(TIF, nbins = 5, method = 'cluster', labels=c("Nearly Zero","1st Quartile", "2nd Quartile", "3rd Quartile",  "4th Quartile"), na.omit = FALSE)) %>%  #five categories
  mutate(YOJ_BIN =bin(YOJ,nbins = 5, method = 'cluster', labels=c("Nearly Zero", "1st Quartile", "2nd Quartile", "3rd Quartile",  "4th Quartile"), na.omit = FALSE)) %>% #five categories
  dplyr::select(-c(CAR_AGE, HOME_VAL, HOMEKIDS, KIDSDRIV, OLDCLAIM, TIF, YOJ)) #drop the binned features



summary(insurance_bins2)
##   TARGET_FLAG       TARGET_AMT          AGE            INCOME       PARENT1   
##  Min.   :0.0000   Min.   :     0   Min.   :16.00   Min.   :     0   No :7084  
##  1st Qu.:0.0000   1st Qu.:     0   1st Qu.:39.00   1st Qu.: 28097   Yes:1077  
##  Median :0.0000   Median :     0   Median :45.00   Median : 54028             
##  Mean   :0.2638   Mean   :  1504   Mean   :44.79   Mean   : 61898             
##  3rd Qu.:1.0000   3rd Qu.:  1036   3rd Qu.:51.00   3rd Qu.: 85986             
##  Max.   :1.0000   Max.   :107586   Max.   :81.00   Max.   :367030             
##                                    NA's   :6       NA's   :445                
##  MSTATUS    SEX                      EDUCATION              JOB      
##  No :3267   F:4375   Bachelors            :2242   Blue Collar :1825  
##  Yes:4894   M:3786   High School          :2330   Clerical    :1271  
##                      Less than High School:1203   Professional:1117  
##                      Masters              :1658   Manager     : 988  
##                      PhD                  : 728   Lawyer      : 835  
##                                                   Student     : 712  
##                                                   (Other)     :1413  
##     TRAVTIME            CAR_USE        BLUEBOOK            CAR_TYPE   
##  Min.   :  5.00   Commercial:3029   Min.   : 1500   Minivan    :2145  
##  1st Qu.: 22.00   Private   :5132   1st Qu.: 9280   Panel Truck: 676  
##  Median : 33.00                     Median :14440   Pickup     :1389  
##  Mean   : 33.49                     Mean   :15710   Sports Car : 907  
##  3rd Qu.: 44.00                     3rd Qu.:20850   SUV        :2294  
##  Max.   :142.00                     Max.   :69740   Van        : 750  
##                                                                       
##  RED_CAR       CLM_FREQ      REVOKED       MVR_PTS      
##  no :5783   Min.   :0.0000   No :7161   Min.   : 0.000  
##  yes:2378   1st Qu.:0.0000   Yes:1000   1st Qu.: 0.000  
##             Median :0.0000              Median : 1.000  
##             Mean   :0.7986              Mean   : 1.696  
##             3rd Qu.:2.0000              3rd Qu.: 3.000  
##             Max.   :5.0000              Max.   :13.000  
##                                                         
##                URBANICITY         CAR_AGE_BIN         HOME_VAL_BIN 
##  Highly Rural/ Rural:1669   1st Quartile:3419   Nearly Zero :2359  
##  Highly Urban/ Urban:6492   2nd Quartile:3027   1st Quartile:1772  
##                             3rd Quartile:1140   2nd Quartile:2026  
##                             4th Quartile:  65   3rd Quartile:1180  
##                             NA          : 510   4th Quartile: 360  
##                                                 NA          : 464  
##                                                                    
##     HAS_HOME_KIDS              HAS_KIDSDRIV        OLDCLAIM_BIN 
##  Fewer Kids:6191   Fewer Kids Driving:7816   Nearly Zero :5697  
##  More Kids :1970   More Kids Driving : 345   1st Quartile:1840  
##                                              2nd Quartile: 204  
##                                              3rd Quartile: 262  
##                                              4th Quartile: 158  
##                                                                 
##                                                                 
##          TIF_BIN             YOJ_BIN    
##  Nearly Zero :4205   Nearly Zero : 682  
##  1st Quartile:2073   1st Quartile:1507  
##  2nd Quartile:1570   2nd Quartile:3926  
##  3rd Quartile: 281   3rd Quartile:1452  
##  4th Quartile:  32   4th Quartile: 140  
##                      NA          : 454  
## 

Compared to the freehand bins, the automated function is much less likely to put values in the 4th quartile. For the binary variables as well, the ‘has’ columns are also more likely to be selective. To see if this selection has any impact on modeling, lets consider the perforance of a logistical model using both binning methods.

get_cv_performance(insurance_fix, glm(data = insurance_fix, formula = TARGET_FLAG ~.-TARGET_AMT, family = 'binomial'))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 857 204
##          1  74 163
##                                           
##                Accuracy : 0.7858          
##                  95% CI : (0.7625, 0.8079)
##     No Information Rate : 0.7173          
##     P-Value [Acc > NIR] : 1.049e-08       
##                                           
##                   Kappa : 0.4085          
##                                           
##  Mcnemar's Test P-Value : 1.019e-14       
##                                           
##             Sensitivity : 0.9205          
##             Specificity : 0.4441          
##          Pos Pred Value : 0.8077          
##          Neg Pred Value : 0.6878          
##              Prevalence : 0.7173          
##          Detection Rate : 0.6602          
##    Detection Prevalence : 0.8174          
##       Balanced Accuracy : 0.6823          
##                                           
##        'Positive' Class : 0               
## 
get_roc(insurance_fix, glm(data = insurance_fix, formula = TARGET_FLAG ~.-TARGET_AMT, family = 'binomial'))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

get_cv_performance(insurance_bins1, glm(data = insurance_bins1, formula = TARGET_FLAG ~.-TARGET_AMT, family = 'binomial'))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 849 196
##          1  65 176
##                                          
##                Accuracy : 0.797          
##                  95% CI : (0.774, 0.8187)
##     No Information Rate : 0.7107         
##     P-Value [Acc > NIR] : 1.020e-12      
##                                          
##                   Kappa : 0.4489         
##                                          
##  Mcnemar's Test P-Value : 8.499e-16      
##                                          
##             Sensitivity : 0.9289         
##             Specificity : 0.4731         
##          Pos Pred Value : 0.8124         
##          Neg Pred Value : 0.7303         
##              Prevalence : 0.7107         
##          Detection Rate : 0.6602         
##    Detection Prevalence : 0.8126         
##       Balanced Accuracy : 0.7010         
##                                          
##        'Positive' Class : 0              
## 
get_roc(insurance_bins1, glm(data = insurance_bins1, formula = TARGET_FLAG ~.-TARGET_AMT, family = 'binomial'))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

get_cv_performance(insurance_bins2, glm(data = insurance_bins2, formula = TARGET_FLAG ~.-TARGET_AMT, family = 'binomial'))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1067  211
##          1   85  173
##                                           
##                Accuracy : 0.8073          
##                  95% CI : (0.7867, 0.8267)
##     No Information Rate : 0.75            
##     P-Value [Acc > NIR] : 5.820e-08       
##                                           
##                   Kappa : 0.423           
##                                           
##  Mcnemar's Test P-Value : 3.717e-13       
##                                           
##             Sensitivity : 0.9262          
##             Specificity : 0.4505          
##          Pos Pred Value : 0.8349          
##          Neg Pred Value : 0.6705          
##              Prevalence : 0.7500          
##          Detection Rate : 0.6947          
##    Detection Prevalence : 0.8320          
##       Balanced Accuracy : 0.6884          
##                                           
##        'Positive' Class : 0               
## 
get_roc(insurance_bins2, glm(data = insurance_bins2, formula = TARGET_FLAG ~.-TARGET_AMT, family = 'binomial'))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

The results don’t appear significantly different, supporting the decision to automate the binning process. However, neither binning method appears to benefit compared to the original numerical variables.

Sources: https://www.rdocumentation.org/packages/OneR/versions/2.2/topics/bin