Having detailed numeric data for an explanatory variable offers continuous and quantitative information, and allows the usual avenues of generating summary statistics to understand the variable better. However, one technique for understanding or simplifying a model takes a numeric variable and breaks it into a handful of ranges to consider, called binning or discretizing. This technique can offer some intelligibility to a variable by breaking it down into two or more categorical variables. However, binning has the potential to give up a lot of information about a dataset by eliminating the prospect of a per-unit quantified effect that would be gained from the regression coefficient.
Binning is at its most useful when there are already some apparent categorical features of the dataset: the variable in question tends to cluster around several values, there’s a known stepwise relationship about the variable in question, or there are other categorical variables in the dataset already. Also, when the variable has a substantial amount of missing values, this too can be made into a distinct category along with numerical ranges, and may add information if the values are not missing at random. Finally, binning effectively removes outliers by grouping it with nearby values.
In the worst-case scenario, incorrect selection of the numeric ranges for bins may attenuate the true correlation between two variables by limiting the amount of information gained from the variable. Pivotal to developing a discretized variable is selecting the ranges used. This could be done using several methods. The simplest is to assign ranges according to quantiles over the data range. From there, the exact boundaries could be changed to reflect the underlying distribution.
For the example today, I’m going to look at the dataset used for HW4, which looked at predicting insurance claims based off characteristics of the driver and car.
The original dataset needed some cleaning up, but ended up looking like this.
summary(insurance_fix)
## TARGET_FLAG TARGET_AMT KIDSDRIV AGE
## Min. :0.0000 Min. : 0 Min. :0.0000 Min. :16.00
## 1st Qu.:0.0000 1st Qu.: 0 1st Qu.:0.0000 1st Qu.:39.00
## Median :0.0000 Median : 0 Median :0.0000 Median :45.00
## Mean :0.2638 Mean : 1504 Mean :0.1711 Mean :44.79
## 3rd Qu.:1.0000 3rd Qu.: 1036 3rd Qu.:0.0000 3rd Qu.:51.00
## Max. :1.0000 Max. :107586 Max. :4.0000 Max. :81.00
## NA's :6
## HOMEKIDS YOJ INCOME PARENT1 HOME_VAL
## Min. :0.0000 Min. : 0.0 Min. : 0 No :7084 Min. : 0
## 1st Qu.:0.0000 1st Qu.: 9.0 1st Qu.: 28097 Yes:1077 1st Qu.: 0
## Median :0.0000 Median :11.0 Median : 54028 Median :161160
## Mean :0.7212 Mean :10.5 Mean : 61898 Mean :154867
## 3rd Qu.:1.0000 3rd Qu.:13.0 3rd Qu.: 85986 3rd Qu.:238724
## Max. :5.0000 Max. :23.0 Max. :367030 Max. :885282
## NA's :454 NA's :445 NA's :464
## MSTATUS SEX EDUCATION JOB
## No :3267 F:4375 Bachelors :2242 Blue Collar :1825
## Yes:4894 M:3786 High School :2330 Clerical :1271
## Less than High School:1203 Professional:1117
## Masters :1658 Manager : 988
## PhD : 728 Lawyer : 835
## Student : 712
## (Other) :1413
## TRAVTIME CAR_USE BLUEBOOK TIF
## Min. : 5.00 Commercial:3029 Min. : 1500 Min. : 1.000
## 1st Qu.: 22.00 Private :5132 1st Qu.: 9280 1st Qu.: 1.000
## Median : 33.00 Median :14440 Median : 4.000
## Mean : 33.49 Mean :15710 Mean : 5.351
## 3rd Qu.: 44.00 3rd Qu.:20850 3rd Qu.: 7.000
## Max. :142.00 Max. :69740 Max. :25.000
##
## CAR_TYPE RED_CAR OLDCLAIM CLM_FREQ REVOKED
## Minivan :2145 no :5783 Min. : 0 Min. :0.0000 No :7161
## Panel Truck: 676 yes:2378 1st Qu.: 0 1st Qu.:0.0000 Yes:1000
## Pickup :1389 Median : 0 Median :0.0000
## Sports Car : 907 Mean : 4037 Mean :0.7986
## SUV :2294 3rd Qu.: 4636 3rd Qu.:2.0000
## Van : 750 Max. :57037 Max. :5.0000
##
## MVR_PTS CAR_AGE URBANICITY
## Min. : 0.000 Min. : 1.000 Highly Rural/ Rural:1669
## 1st Qu.: 0.000 1st Qu.: 1.000 Highly Urban/ Urban:6492
## Median : 1.000 Median : 8.000
## Mean : 1.696 Mean : 8.329
## 3rd Qu.: 3.000 3rd Qu.:12.000
## Max. :13.000 Max. :28.000
## NA's :510
The binned dataset I used for the final project looked at 7 previously numeric variables and changed them to factors or binary features. I selected these because some of them had many NA values or zero as a model. Also, there were some such as the number of kids which didn’t offer a clear relationship of how an additional child past the first would indicate anything about the claimant. For the intervals, I used the quartiles and generated the new categories by using the cut function, mutating new variables, and dropping the old variables.
insurance_bins1 <- insurance_fix %>%
mutate(CAR_AGE_BIN=cut(CAR_AGE, breaks=c(-Inf, 1, 3, 12, Inf), labels=c("New","Like New","Average", 'Old'))) %>% #four level fator for car age
mutate(HOME_VAL_BIN=cut(HOME_VAL, breaks=c(-Inf, 0, 50000, 150000, 250000, Inf), labels=c("Zero", "$0-$50k", "$50k-$150k","$150k-$250k", 'Over $250k'))) %>% #bins for zero, plus four other price ranges
mutate(HAS_HOME_KIDS = as.factor(case_when(HOMEKIDS == 0 ~ 'No kids', HOMEKIDS > 0 ~ ('Has kids')))) %>% #binary variable for whether family has kids
mutate(HAS_KIDSDRIV = as.factor(case_when(KIDSDRIV == 0 ~ 'No kids driving', KIDSDRIV > 0 ~ 'Has kids driving'))) %>% #binary variable for whether family has kids driving
mutate(OLDCLAIM_BIN =cut(OLDCLAIM, breaks=c(-Inf, 0, 3000, 6000, 9000, Inf), labels=c("Zero","$0-$3k", "$3k-$6k", "$6k-$9k",'Over $9k'))) %>% #bins for zero, plus four other price ranges based on quartiles
mutate(TIF_BIN =cut(TIF, breaks=c(-Inf, 0, 1, 4, 7, Inf), labels=c("Zero","Less than 1 year", "1-4 years", "4-7 years",'Over 7 years'))) %>% #bins for zero, plus four other price ranges based on quartiles
mutate(YOJ_BIN =cut(YOJ, breaks=c(-Inf, 0, 10, 15, Inf), labels=c("Zero","Less than 10 years", 'Between 10-15 years', 'Over 15 years'))) %>% #bins for zero, plus three other categories based on quartiles
dplyr::select(-c(CAR_AGE, HOME_VAL, HOMEKIDS, KIDSDRIV, OLDCLAIM, TIF, YOJ)) #drop the binned features
summary(insurance_bins1)
## TARGET_FLAG TARGET_AMT AGE INCOME PARENT1
## Min. :0.0000 Min. : 0 Min. :16.00 Min. : 0 No :7084
## 1st Qu.:0.0000 1st Qu.: 0 1st Qu.:39.00 1st Qu.: 28097 Yes:1077
## Median :0.0000 Median : 0 Median :45.00 Median : 54028
## Mean :0.2638 Mean : 1504 Mean :44.79 Mean : 61898
## 3rd Qu.:1.0000 3rd Qu.: 1036 3rd Qu.:51.00 3rd Qu.: 85986
## Max. :1.0000 Max. :107586 Max. :81.00 Max. :367030
## NA's :6 NA's :445
## MSTATUS SEX EDUCATION JOB
## No :3267 F:4375 Bachelors :2242 Blue Collar :1825
## Yes:4894 M:3786 High School :2330 Clerical :1271
## Less than High School:1203 Professional:1117
## Masters :1658 Manager : 988
## PhD : 728 Lawyer : 835
## Student : 712
## (Other) :1413
## TRAVTIME CAR_USE BLUEBOOK CAR_TYPE
## Min. : 5.00 Commercial:3029 Min. : 1500 Minivan :2145
## 1st Qu.: 22.00 Private :5132 1st Qu.: 9280 Panel Truck: 676
## Median : 33.00 Median :14440 Pickup :1389
## Mean : 33.49 Mean :15710 Sports Car : 907
## 3rd Qu.: 44.00 3rd Qu.:20850 SUV :2294
## Max. :142.00 Max. :69740 Van : 750
##
## RED_CAR CLM_FREQ REVOKED MVR_PTS
## no :5783 Min. :0.0000 No :7161 Min. : 0.000
## yes:2378 1st Qu.:0.0000 Yes:1000 1st Qu.: 0.000
## Median :0.0000 Median : 1.000
## Mean :0.7986 Mean : 1.696
## 3rd Qu.:2.0000 3rd Qu.: 3.000
## Max. :5.0000 Max. :13.000
##
## URBANICITY CAR_AGE_BIN HOME_VAL_BIN HAS_HOME_KIDS
## Highly Rural/ Rural:1669 New :1938 Zero :2294 Has kids:2872
## Highly Urban/ Urban:6492 Like New: 66 $0-$50k : 0 No kids :5289
## Average :3775 $50k-$150k :1274
## Old :1872 $150k-$250k:2445
## NA's : 510 Over $250k :1684
## NA's : 464
##
## HAS_KIDSDRIV OLDCLAIM_BIN TIF_BIN
## Has kids driving: 981 Zero :5009 Zero : 0
## No kids driving :7180 $0-$3k : 584 Less than 1 year:2533
## $3k-$6k : 970 1-4 years :1672
## $6k-$9k : 720 4-7 years :2013
## Over $9k: 878 Over 7 years :1943
##
##
## YOJ_BIN
## Zero : 625
## Less than 10 years :2313
## Between 10-15 years:4425
## Over 15 years : 344
## NA's : 454
##
##
If there were many variables to consider, or perhaps time constraints, then R has a bin function offered in the ‘OneR’ library. This automates the discretization of numeric data and offers additional arguments to control the number of bins, whether the new categories should be equally sized versus 1D k-means clustering. Lets see how this would look using the insurance dataset from before. I will be using the ‘cluster’ method as so many of the affected numerical values are skewed because of many zero-low values.
insurance_bins2 <- insurance_fix %>%
mutate(CAR_AGE_BIN = bin(CAR_AGE, nbins = 4, labels=c("1st Quartile", "2nd Quartile", "3rd Quartile", "4th Quartile"), na.omit = FALSE)) %>% #four level fator for car age
mutate(HOME_VAL_BIN = bin(HOME_VAL, method = 'cluster', nbins = 5, labels=c("Nearly Zero", "1st Quartile", "2nd Quartile", "3rd Quartile", "4th Quartile"), na.omit = FALSE)) %>% #five categories
mutate(HAS_HOME_KIDS = bin(HOMEKIDS, method = 'cluster', labels=c('Fewer Kids', 'More Kids'), nbins = 2,na.omit = FALSE)) %>% #binary variable for whether family has kids
mutate(HAS_KIDSDRIV = bin(KIDSDRIV, method = 'cluster', labels=c('Fewer Kids Driving', 'More Kids Driving'), nbins = 2, na.omit = FALSE)) %>% #binary variable for whether family has kids driving
mutate(OLDCLAIM_BIN =bin(OLDCLAIM, method = 'cluster', nbins = 5, labels=c("Nearly Zero", "1st Quartile", "2nd Quartile", "3rd Quartile", "4th Quartile"), na.omit = FALSE)) %>% #five divisions
mutate(TIF_BIN =bin(TIF, nbins = 5, method = 'cluster', labels=c("Nearly Zero","1st Quartile", "2nd Quartile", "3rd Quartile", "4th Quartile"), na.omit = FALSE)) %>% #five categories
mutate(YOJ_BIN =bin(YOJ,nbins = 5, method = 'cluster', labels=c("Nearly Zero", "1st Quartile", "2nd Quartile", "3rd Quartile", "4th Quartile"), na.omit = FALSE)) %>% #five categories
dplyr::select(-c(CAR_AGE, HOME_VAL, HOMEKIDS, KIDSDRIV, OLDCLAIM, TIF, YOJ)) #drop the binned features
summary(insurance_bins2)
## TARGET_FLAG TARGET_AMT AGE INCOME PARENT1
## Min. :0.0000 Min. : 0 Min. :16.00 Min. : 0 No :7084
## 1st Qu.:0.0000 1st Qu.: 0 1st Qu.:39.00 1st Qu.: 28097 Yes:1077
## Median :0.0000 Median : 0 Median :45.00 Median : 54028
## Mean :0.2638 Mean : 1504 Mean :44.79 Mean : 61898
## 3rd Qu.:1.0000 3rd Qu.: 1036 3rd Qu.:51.00 3rd Qu.: 85986
## Max. :1.0000 Max. :107586 Max. :81.00 Max. :367030
## NA's :6 NA's :445
## MSTATUS SEX EDUCATION JOB
## No :3267 F:4375 Bachelors :2242 Blue Collar :1825
## Yes:4894 M:3786 High School :2330 Clerical :1271
## Less than High School:1203 Professional:1117
## Masters :1658 Manager : 988
## PhD : 728 Lawyer : 835
## Student : 712
## (Other) :1413
## TRAVTIME CAR_USE BLUEBOOK CAR_TYPE
## Min. : 5.00 Commercial:3029 Min. : 1500 Minivan :2145
## 1st Qu.: 22.00 Private :5132 1st Qu.: 9280 Panel Truck: 676
## Median : 33.00 Median :14440 Pickup :1389
## Mean : 33.49 Mean :15710 Sports Car : 907
## 3rd Qu.: 44.00 3rd Qu.:20850 SUV :2294
## Max. :142.00 Max. :69740 Van : 750
##
## RED_CAR CLM_FREQ REVOKED MVR_PTS
## no :5783 Min. :0.0000 No :7161 Min. : 0.000
## yes:2378 1st Qu.:0.0000 Yes:1000 1st Qu.: 0.000
## Median :0.0000 Median : 1.000
## Mean :0.7986 Mean : 1.696
## 3rd Qu.:2.0000 3rd Qu.: 3.000
## Max. :5.0000 Max. :13.000
##
## URBANICITY CAR_AGE_BIN HOME_VAL_BIN
## Highly Rural/ Rural:1669 1st Quartile:3419 Nearly Zero :2359
## Highly Urban/ Urban:6492 2nd Quartile:3027 1st Quartile:1772
## 3rd Quartile:1140 2nd Quartile:2026
## 4th Quartile: 65 3rd Quartile:1180
## NA : 510 4th Quartile: 360
## NA : 464
##
## HAS_HOME_KIDS HAS_KIDSDRIV OLDCLAIM_BIN
## Fewer Kids:6191 Fewer Kids Driving:7816 Nearly Zero :5697
## More Kids :1970 More Kids Driving : 345 1st Quartile:1840
## 2nd Quartile: 204
## 3rd Quartile: 262
## 4th Quartile: 158
##
##
## TIF_BIN YOJ_BIN
## Nearly Zero :4205 Nearly Zero : 682
## 1st Quartile:2073 1st Quartile:1507
## 2nd Quartile:1570 2nd Quartile:3926
## 3rd Quartile: 281 3rd Quartile:1452
## 4th Quartile: 32 4th Quartile: 140
## NA : 454
##
Compared to the freehand bins, the automated function is much less likely to put values in the 4th quartile. For the binary variables as well, the ‘has’ columns are also more likely to be selective. To see if this selection has any impact on modeling, lets consider the perforance of a logistical model using both binning methods.
get_cv_performance(insurance_fix, glm(data = insurance_fix, formula = TARGET_FLAG ~.-TARGET_AMT, family = 'binomial'))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 857 204
## 1 74 163
##
## Accuracy : 0.7858
## 95% CI : (0.7625, 0.8079)
## No Information Rate : 0.7173
## P-Value [Acc > NIR] : 1.049e-08
##
## Kappa : 0.4085
##
## Mcnemar's Test P-Value : 1.019e-14
##
## Sensitivity : 0.9205
## Specificity : 0.4441
## Pos Pred Value : 0.8077
## Neg Pred Value : 0.6878
## Prevalence : 0.7173
## Detection Rate : 0.6602
## Detection Prevalence : 0.8174
## Balanced Accuracy : 0.6823
##
## 'Positive' Class : 0
##
get_roc(insurance_fix, glm(data = insurance_fix, formula = TARGET_FLAG ~.-TARGET_AMT, family = 'binomial'))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
get_cv_performance(insurance_bins1, glm(data = insurance_bins1, formula = TARGET_FLAG ~.-TARGET_AMT, family = 'binomial'))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 849 196
## 1 65 176
##
## Accuracy : 0.797
## 95% CI : (0.774, 0.8187)
## No Information Rate : 0.7107
## P-Value [Acc > NIR] : 1.020e-12
##
## Kappa : 0.4489
##
## Mcnemar's Test P-Value : 8.499e-16
##
## Sensitivity : 0.9289
## Specificity : 0.4731
## Pos Pred Value : 0.8124
## Neg Pred Value : 0.7303
## Prevalence : 0.7107
## Detection Rate : 0.6602
## Detection Prevalence : 0.8126
## Balanced Accuracy : 0.7010
##
## 'Positive' Class : 0
##
get_roc(insurance_bins1, glm(data = insurance_bins1, formula = TARGET_FLAG ~.-TARGET_AMT, family = 'binomial'))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
get_cv_performance(insurance_bins2, glm(data = insurance_bins2, formula = TARGET_FLAG ~.-TARGET_AMT, family = 'binomial'))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1067 211
## 1 85 173
##
## Accuracy : 0.8073
## 95% CI : (0.7867, 0.8267)
## No Information Rate : 0.75
## P-Value [Acc > NIR] : 5.820e-08
##
## Kappa : 0.423
##
## Mcnemar's Test P-Value : 3.717e-13
##
## Sensitivity : 0.9262
## Specificity : 0.4505
## Pos Pred Value : 0.8349
## Neg Pred Value : 0.6705
## Prevalence : 0.7500
## Detection Rate : 0.6947
## Detection Prevalence : 0.8320
## Balanced Accuracy : 0.6884
##
## 'Positive' Class : 0
##
get_roc(insurance_bins2, glm(data = insurance_bins2, formula = TARGET_FLAG ~.-TARGET_AMT, family = 'binomial'))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
The results don’t appear significantly different, supporting the decision to automate the binning process. However, neither binning method appears to benefit compared to the original numerical variables.
Sources: https://www.rdocumentation.org/packages/OneR/versions/2.2/topics/bin