Setting bins in R, while not particulary difficult, is usually reasonably standardized and therefore can be easily automatted for many situations.

library(PKPDdatasets)
library(PKPDmisc)
library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(knitr)
data <- capitalize_names(sd_oral_richpk)
sid_data <- data %>% filter(!duplicated(ID))

Given a traditional PK dataset (data) and one filtered for one observation per individual (sid_data) that looks like:

head(data) %>% kable()
ID TIME AMT CONC AGE WEIGHT GENDER RACE DOSE
1 0.00 5000 0.000000 56.09591 94.19649 Male Hispanic 5000
1 0.25 0 8.612809 56.09591 94.19649 Male Hispanic 5000
1 0.50 0 19.436818 56.09591 94.19649 Male Hispanic 5000
1 1.00 0 34.006699 56.09591 94.19649 Male Hispanic 5000
1 2.00 0 30.228800 56.09591 94.19649 Male Hispanic 5000
1 3.00 0 31.299610 56.09591 94.19649 Male Hispanic 5000
head(sid_data) %>% kable()
ID TIME AMT CONC AGE WEIGHT GENDER RACE DOSE
1 0 5000 0 56.09591 94.19649 Male Hispanic 5000
2 0 5000 0 45.07672 64.17279 Male Caucasian 5000
3 0 5000 0 50.74503 67.89058 Male Other 5000
4 0 5000 0 46.87347 62.47354 Female Caucasian 5000
5 0 5000 0 50.86722 73.76395 Female Caucasian 5000
6 0 5000 0 40.77630 69.89467 Male Hispanic 5000

Here are a couple situations in which bins are appropriate

  1. given a breakpoint weight of 70 kg, determine how flag any individuals are above the breakpoint weight.

Historically, this has been accomplished using ifelse statments like so

binned_data <- sid_data %>% mutate(WTBIN = ifelse(WEIGHT > 70, 1, 0))
head(binned_data) %>% kable()
ID TIME AMT CONC AGE WEIGHT GENDER RACE DOSE WTBIN
1 0 5000 0 56.09591 94.19649 Male Hispanic 5000 1
2 0 5000 0 45.07672 64.17279 Male Caucasian 5000 0
3 0 5000 0 50.74503 67.89058 Male Other 5000 0
4 0 5000 0 46.87347 62.47354 Female Caucasian 5000 0
5 0 5000 0 50.86722 73.76395 Female Caucasian 5000 1
6 0 5000 0 40.77630 69.89467 Male Hispanic 5000 0

This can be similarly accomplished with binning using the set_bins function from PKPDmisc

binned_data <- binned_data %>% mutate(WTBIN2 = set_bins(WEIGHT, breaks = 70))

head(binned_data)%>% kable()
ID TIME AMT CONC AGE WEIGHT GENDER RACE DOSE WTBIN WTBIN2
1 0 5000 0 56.09591 94.19649 Male Hispanic 5000 1 1
2 0 5000 0 45.07672 64.17279 Male Caucasian 5000 0 0
3 0 5000 0 50.74503 67.89058 Male Other 5000 0 0
4 0 5000 0 46.87347 62.47354 Female Caucasian 5000 0 0
5 0 5000 0 50.86722 73.76395 Female Caucasian 5000 1 1
6 0 5000 0 40.77630 69.89467 Male Hispanic 5000 0 0

The ‘problem’ with the ifelse statement is it does not scale well under more complex circumstances.

For example, given a more complex set of criteria such as weight less than 50, between 50 and 70, and 70 and above, the above example must be expanded to:

binned_data <- sid_data %>% mutate(WTBIN = ifelse(WEIGHT < 50, 0, 
                                              ifelse(WEIGHT >= 50 & WEIGHT < 70, 1,
                                              ifelse(WEIGHT >= 70, 2, NA))))
head(binned_data)%>% kable()
ID TIME AMT CONC AGE WEIGHT GENDER RACE DOSE WTBIN
1 0 5000 0 56.09591 94.19649 Male Hispanic 5000 2
2 0 5000 0 45.07672 64.17279 Male Caucasian 5000 1
3 0 5000 0 50.74503 67.89058 Male Other 5000 1
4 0 5000 0 46.87347 62.47354 Female Caucasian 5000 1
5 0 5000 0 50.86722 73.76395 Female Caucasian 5000 2
6 0 5000 0 40.77630 69.89467 Male Hispanic 5000 1

However, with the set_bins function and the breaks argument this is simple:

binned_data <- binned_data %>% mutate(WTBIN2 = set_bins(WEIGHT, breaks = c(50, 70)))
head(binned_data)%>% kable()
ID TIME AMT CONC AGE WEIGHT GENDER RACE DOSE WTBIN WTBIN2
1 0 5000 0 56.09591 94.19649 Male Hispanic 5000 2 2
2 0 5000 0 45.07672 64.17279 Male Caucasian 5000 1 1
3 0 5000 0 50.74503 67.89058 Male Other 5000 1 1
4 0 5000 0 46.87347 62.47354 Female Caucasian 5000 1 1
5 0 5000 0 50.86722 73.76395 Female Caucasian 5000 2 2
6 0 5000 0 40.77630 69.89467 Male Hispanic 5000 1 1

You can even find additional information about the ranges of each bin by setting quiet=FALSE

 binned_data %>% 
  mutate(WTBIN2 = set_bins(WEIGHT, breaks = c(50, 70), quiet=F)) %>% 
  head %>% kable()
## there were 3bins calculated, with the following
##                    range for each bin: 
## BIN: 0 range: -Inf - 50
## BIN: 1 range: 50 - 70
## BIN: 2 range: 70 - Inf
ID TIME AMT CONC AGE WEIGHT GENDER RACE DOSE WTBIN WTBIN2
1 0 5000 0 56.09591 94.19649 Male Hispanic 5000 2 2
2 0 5000 0 45.07672 64.17279 Male Caucasian 5000 1 1
3 0 5000 0 50.74503 67.89058 Male Other 5000 1 1
4 0 5000 0 46.87347 62.47354 Female Caucasian 5000 1 1
5 0 5000 0 50.86722 73.76395 Female Caucasian 5000 2 2
6 0 5000 0 40.77630 69.89467 Male Hispanic 5000 1 1

As you can see by default, the first and last bin are automatically set to a range of -Inf and Inf for the min and max values. This can be controlled using lower_boundary and upper_boundary such that the lower and upper bounds can either be controlled, and any values exceeding those bounds will be returned as NA. This might be useful if you are programmatically calculating the break points and already will have a lower bound specified.

For example by manually specifying a lower bound of 0:

 binned_data %>% mutate(WTBIN2 = set_bins(WEIGHT, breaks = c(0, 50, 70), quiet=F)) %>% head %>% kable()
## there were 4bins calculated, with the following
##                    range for each bin: 
## BIN: 0 range: -Inf - 0
## BIN: 1 range: 0 - 50
## BIN: 2 range: 50 - 70
## BIN: 3 range: 70 - Inf
ID TIME AMT CONC AGE WEIGHT GENDER RACE DOSE WTBIN WTBIN2
1 0 5000 0 56.09591 94.19649 Male Hispanic 5000 2 3
2 0 5000 0 45.07672 64.17279 Male Caucasian 5000 1 2
3 0 5000 0 50.74503 67.89058 Male Other 5000 1 2
4 0 5000 0 46.87347 62.47354 Female Caucasian 5000 1 2
5 0 5000 0 50.86722 73.76395 Female Caucasian 5000 2 3
6 0 5000 0 40.77630 69.89467 Male Hispanic 5000 1 2

It adds an unnecessary -Inf to 0 bin. This can be turned off via lower_bound=NULL:

 binned_data %>% 
  mutate(WTBIN2 = set_bins(WEIGHT, breaks = c(0, 50, 70), lower_bound=NULL,  quiet=F)) %>% 
  head %>% kable()
## there were 3bins calculated, with the following
##                    range for each bin: 
## BIN: 0 range: 0 - 50
## BIN: 1 range: 50 - 70
## BIN: 2 range: 70 - Inf
ID TIME AMT CONC AGE WEIGHT GENDER RACE DOSE WTBIN WTBIN2
1 0 5000 0 56.09591 94.19649 Male Hispanic 5000 2 2
2 0 5000 0 45.07672 64.17279 Male Caucasian 5000 1 1
3 0 5000 0 50.74503 67.89058 Male Other 5000 1 1
4 0 5000 0 46.87347 62.47354 Female Caucasian 5000 1 1
5 0 5000 0 50.86722 73.76395 Female Caucasian 5000 2 2
6 0 5000 0 40.77630 69.89467 Male Hispanic 5000 1 1

Getting away from quantile and cut

A common way of stratifying is to use quantile and the cut function to establish bins. The cut function, however, has poor defaults, such as not including the lowest value, and returns a string column specifying the range in each cut that must be further coerced to arrive at a numeric bin.

sid_data %>% mutate(AGECUTS = cut(AGE, breaks = quantile(AGE), include.lowest=T),
                    AGEBINS = as.numeric(AGECUTS)) %>% head %>% kable()
ID TIME AMT CONC AGE WEIGHT GENDER RACE DOSE AGECUTS AGEBINS
1 0 5000 0 56.09591 94.19649 Male Hispanic 5000 (54.5,60.6] 4
2 0 5000 0 45.07672 64.17279 Male Caucasian 5000 [38.3,46.7] 1
3 0 5000 0 50.74503 67.89058 Male Other 5000 (46.7,50.8] 2
4 0 5000 0 46.87347 62.47354 Female Caucasian 5000 (46.7,50.8] 2
5 0 5000 0 50.86722 73.76395 Female Caucasian 5000 (50.8,54.5] 3
6 0 5000 0 40.77630 69.89467 Male Hispanic 5000 [38.3,46.7] 1

This is again simplified, especially as, by default, the breaks argument calculates the quantiles

sid_data %>% mutate(AGEBINS = set_bins(AGE)) %>% head %>% kable()
ID TIME AMT CONC AGE WEIGHT GENDER RACE DOSE AGEBINS
1 0 5000 0 56.09591 94.19649 Male Hispanic 5000 4
2 0 5000 0 45.07672 64.17279 Male Caucasian 5000 1
3 0 5000 0 50.74503 67.89058 Male Other 5000 2
4 0 5000 0 46.87347 62.47354 Female Caucasian 5000 2
5 0 5000 0 50.86722 73.76395 Female Caucasian 5000 3
6 0 5000 0 40.77630 69.89467 Male Hispanic 5000 1

For more fine-tuned levels, we can specify additional breaks

sid_data %>% mutate(AGEBINS = set_bins(AGE, breaks = 
                                         quantile(AGE, seq(0, 1, length.out = 10)),
                                       quiet = F)) %>% head %>% kable()
## there were 11bins calculated, with the following
##                    range for each bin: 
## BIN: 0 range: -Inf - 38.29745189
## BIN: 1 range: 38.29745189 - 43.5622634533333
## BIN: 2 range: 43.5622634533333 - 46.38955
## BIN: 3 range: 46.38955 - 48.0166795233333
## BIN: 4 range: 48.0166795233333 - 50.5045815844444
## BIN: 5 range: 50.5045815844444 - 51.5338098388889
## BIN: 6 range: 51.5338098388889 - 53.32920805
## BIN: 7 range: 53.32920805 - 54.7564958733333
## BIN: 8 range: 54.7564958733333 - 56.6070799066667
## BIN: 9 range: 56.6070799066667 - 60.55971096
## BIN: 10 range: 60.55971096 - Inf
ID TIME AMT CONC AGE WEIGHT GENDER RACE DOSE AGEBINS
1 0 5000 0 56.09591 94.19649 Male Hispanic 5000 8
2 0 5000 0 45.07672 64.17279 Male Caucasian 5000 2
3 0 5000 0 50.74503 67.89058 Male Other 5000 5
4 0 5000 0 46.87347 62.47354 Female Caucasian 5000 3
5 0 5000 0 50.86722 73.76395 Female Caucasian 5000 5
6 0 5000 0 40.77630 69.89467 Male Hispanic 5000 1

BETWEEN

A special case is when there is desire to test whether a value falls in some inclusive range, such as a specified therapeutic window. set_bins can also handle this situation with the between argument.

In this case, a range can be specified, and all values inside the range (inclusive) will be assigned to bin 1, with all values outside in either bin 0 (below range) or bin 2 (above range).

For example, to understand the non-zero concentration measurements given a therapeutic window of 20-100, the following can be done:

tw_data <- data %>% filter(CONC > 0) %>% mutate(TW = set_bins(CONC, between=c(20, 100)))

head(tw_data) %>% kable()
ID TIME AMT CONC AGE WEIGHT GENDER RACE DOSE TW
1 0.25 0 8.612809 56.09591 94.19649 Male Hispanic 5000 0
1 0.50 0 19.436818 56.09591 94.19649 Male Hispanic 5000 0
1 1.00 0 34.006699 56.09591 94.19649 Male Hispanic 5000 1
1 2.00 0 30.228800 56.09591 94.19649 Male Hispanic 5000 1
1 3.00 0 31.299610 56.09591 94.19649 Male Hispanic 5000 1
1 4.00 0 24.979117 56.09591 94.19649 Male Hispanic 5000 1

This can easily be further examined:

tw_data %>% group_by(TW) %>% summarize(n= n()) %>% kable()
TW n
0 163
1 378
2 9
tw_data %>% filter(TW ==2) %>% kable()
ID TIME AMT CONC AGE WEIGHT GENDER RACE DOSE TW
2 2 0 100.1783 45.07672 64.17279 Male Caucasian 5000 2
16 1 0 104.6390 54.32461 75.68308 Female Caucasian 5000 2
16 2 0 101.3737 54.32461 75.68308 Female Caucasian 5000 2
26 2 0 101.9166 41.26571 56.59549 Female Black 5000 2
27 2 0 116.1278 53.45380 71.09299 Male Asian 5000 2
36 1 0 118.8456 60.55971 81.15454 Female Hispanic 5000 2
36 2 0 112.7284 60.55971 81.15454 Female Hispanic 5000 2
36 3 0 117.5580 60.55971 81.15454 Female Hispanic 5000 2
36 4 0 130.6603 60.55971 81.15454 Female Hispanic 5000 2
devtools::session_info()
## Session info --------------------------------------------------------------
##  setting  value                       
##  version  R version 3.2.2 (2015-08-14)
##  system   x86_64, darwin13.4.0        
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  tz       America/Los_Angeles         
##  date     2015-11-30
## Packages ------------------------------------------------------------------
##  package      * version date       source                                
##  assertthat     0.1     2013-12-06 CRAN (R 3.2.0)                        
##  DBI            0.3.1   2014-09-24 CRAN (R 3.2.0)                        
##  devtools       1.9.1   2015-09-11 CRAN (R 3.2.0)                        
##  digest         0.6.8   2014-12-31 CRAN (R 3.2.0)                        
##  dplyr        * 0.4.3   2015-09-01 CRAN (R 3.2.0)                        
##  evaluate       0.8     2015-09-18 CRAN (R 3.2.0)                        
##  highr          0.5.1   2015-09-18 CRAN (R 3.2.0)                        
##  htmltools      0.2.6   2014-09-08 CRAN (R 3.2.0)                        
##  knitr        * 1.11    2015-08-14 CRAN (R 3.2.2)                        
##  lazyeval       0.1.10  2015-01-02 CRAN (R 3.2.0)                        
##  magrittr       1.5     2014-11-22 CRAN (R 3.2.0)                        
##  memoise        0.2.1   2014-04-22 CRAN (R 3.2.0)                        
##  PKPDdatasets * 0.1.0   2015-11-11 Github (dpastoor/PKPDdatasets@52880fa)
##  PKPDmisc     * 0.4     2015-11-11 Github (dpastoor/PKPDmisc@a0680b9)    
##  R6             2.1.1   2015-08-19 CRAN (R 3.2.0)                        
##  Rcpp           0.12.1  2015-09-10 CRAN (R 3.2.0)                        
##  rmarkdown      0.8.1   2015-10-10 CRAN (R 3.2.2)                        
##  stringi        1.0-1   2015-10-22 CRAN (R 3.2.0)                        
##  stringr        1.0.0   2015-04-30 CRAN (R 3.2.0)                        
##  yaml           2.1.13  2014-06-12 CRAN (R 3.2.0)