setting-bins

Setting bins in R, while not particulary difficult, is usually reasonably standardized and therefore can be easily automatted for many situations.

library(PKPDdatasets)
library(PKPDmisc)
library(dplyr)

## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(knitr)

data <- capitalize_names(sd_oral_richpk)
sid_data <- data %>% filter(!duplicated(ID))

Given a traditional PK dataset (data) and one filtered for one observation per individual (sid_data) that looks like:

head(data) %>% kable()

ID	TIME	AMT	CONC	AGE	WEIGHT	GENDER	RACE	DOSE
1	0.00	5000	0.000000	56.09591	94.19649	Male	Hispanic	5000
1	0.25	0	8.612809	56.09591	94.19649	Male	Hispanic	5000
1	0.50	0	19.436818	56.09591	94.19649	Male	Hispanic	5000
1	1.00	0	34.006699	56.09591	94.19649	Male	Hispanic	5000
1	2.00	0	30.228800	56.09591	94.19649	Male	Hispanic	5000
1	3.00	0	31.299610	56.09591	94.19649	Male	Hispanic	5000

head(sid_data) %>% kable()

ID	AMT	AGE	WEIGHT	GENDER	RACE	DOSE
1	5000	56.09591	94.19649	Male	Hispanic	5000
2	5000	45.07672	64.17279	Male	Caucasian	5000
3	5000	50.74503	67.89058	Male	Other	5000
4	5000	46.87347	62.47354	Female	Caucasian	5000
5	5000	50.86722	73.76395	Female	Caucasian	5000
6	5000	40.77630	69.89467	Male	Hispanic	5000

Here are a couple situations in which bins are appropriate

given a breakpoint weight of 70 kg, determine how flag any individuals are above the breakpoint weight.

Historically, this has been accomplished using ifelse statments like so

binned_data <- sid_data %>% mutate(WTBIN = ifelse(WEIGHT > 70, 1, 0))
head(binned_data) %>% kable()

ID	AMT	AGE	WEIGHT	GENDER	RACE	DOSE	WTBIN
1	5000	56.09591	94.19649	Male	Hispanic	5000	1
2	5000	45.07672	64.17279	Male	Caucasian	5000	0
3	5000	50.74503	67.89058	Male	Other	5000	0
4	5000	46.87347	62.47354	Female	Caucasian	5000	0
5	5000	50.86722	73.76395	Female	Caucasian	5000	1
6	5000	40.77630	69.89467	Male	Hispanic	5000	0

This can be similarly accomplished with binning using the set_bins function from PKPDmisc

binned_data <- binned_data %>% mutate(WTBIN2 = set_bins(WEIGHT, breaks = 70))

head(binned_data)%>% kable()

ID	AMT	AGE	WEIGHT	GENDER	RACE	DOSE	WTBIN	WTBIN2
1	5000	56.09591	94.19649	Male	Hispanic	5000	1	1
2	5000	45.07672	64.17279	Male	Caucasian	5000	0	0
3	5000	50.74503	67.89058	Male	Other	5000	0	0
4	5000	46.87347	62.47354	Female	Caucasian	5000	0	0
5	5000	50.86722	73.76395	Female	Caucasian	5000	1	1
6	5000	40.77630	69.89467	Male	Hispanic	5000	0	0

The ‘problem’ with the ifelse statement is it does not scale well under more complex circumstances.

For example, given a more complex set of criteria such as weight less than 50, between 50 and 70, and 70 and above, the above example must be expanded to:

binned_data <- sid_data %>% mutate(WTBIN = ifelse(WEIGHT < 50, 0, 
                                              ifelse(WEIGHT >= 50 & WEIGHT < 70, 1,
                                              ifelse(WEIGHT >= 70, 2, NA))))
head(binned_data)%>% kable()

ID	AMT	AGE	WEIGHT	GENDER	RACE	DOSE	WTBIN
1	5000	56.09591	94.19649	Male	Hispanic	5000	2
2	5000	45.07672	64.17279	Male	Caucasian	5000	1
3	5000	50.74503	67.89058	Male	Other	5000	1
4	5000	46.87347	62.47354	Female	Caucasian	5000	1
5	5000	50.86722	73.76395	Female	Caucasian	5000	2
6	5000	40.77630	69.89467	Male	Hispanic	5000	1

However, with the set_bins function and the breaks argument this is simple:

binned_data <- binned_data %>% mutate(WTBIN2 = set_bins(WEIGHT, breaks = c(50, 70)))
head(binned_data)%>% kable()

ID	AMT	AGE	WEIGHT	GENDER	RACE	DOSE	WTBIN	WTBIN2
1	5000	56.09591	94.19649	Male	Hispanic	5000	2	2
2	5000	45.07672	64.17279	Male	Caucasian	5000	1	1
3	5000	50.74503	67.89058	Male	Other	5000	1	1
4	5000	46.87347	62.47354	Female	Caucasian	5000	1	1
5	5000	50.86722	73.76395	Female	Caucasian	5000	2	2
6	5000	40.77630	69.89467	Male	Hispanic	5000	1	1

You can even find additional information about the ranges of each bin by setting quiet=FALSE

 binned_data %>% 
  mutate(WTBIN2 = set_bins(WEIGHT, breaks = c(50, 70), quiet=F)) %>% 
  head %>% kable()

## there were 3bins calculated, with the following
##                    range for each bin: 
## BIN: 0 range: -Inf - 50
## BIN: 1 range: 50 - 70
## BIN: 2 range: 70 - Inf

ID	AMT	AGE	WEIGHT	GENDER	RACE	DOSE	WTBIN	WTBIN2
1	5000	56.09591	94.19649	Male	Hispanic	5000	2	2
2	5000	45.07672	64.17279	Male	Caucasian	5000	1	1
3	5000	50.74503	67.89058	Male	Other	5000	1	1
4	5000	46.87347	62.47354	Female	Caucasian	5000	1	1
5	5000	50.86722	73.76395	Female	Caucasian	5000	2	2
6	5000	40.77630	69.89467	Male	Hispanic	5000	1	1

As you can see by default, the first and last bin are automatically set to a range of -Inf and Inf for the min and max values. This can be controlled using lower_boundary and upper_boundary such that the lower and upper bounds can either be controlled, and any values exceeding those bounds will be returned as NA. This might be useful if you are programmatically calculating the break points and already will have a lower bound specified.

For example by manually specifying a lower bound of 0:

 binned_data %>% mutate(WTBIN2 = set_bins(WEIGHT, breaks = c(0, 50, 70), quiet=F)) %>% head %>% kable()

## there were 4bins calculated, with the following
##                    range for each bin: 
## BIN: 0 range: -Inf - 0
## BIN: 1 range: 0 - 50
## BIN: 2 range: 50 - 70
## BIN: 3 range: 70 - Inf

ID	AMT	AGE	WEIGHT	GENDER	RACE	DOSE	WTBIN	WTBIN2
1	5000	56.09591	94.19649	Male	Hispanic	5000	2	3
2	5000	45.07672	64.17279	Male	Caucasian	5000	1	2
3	5000	50.74503	67.89058	Male	Other	5000	1	2
4	5000	46.87347	62.47354	Female	Caucasian	5000	1	2
5	5000	50.86722	73.76395	Female	Caucasian	5000	2	3
6	5000	40.77630	69.89467	Male	Hispanic	5000	1	2

It adds an unnecessary -Inf to 0 bin. This can be turned off via lower_bound=NULL:

 binned_data %>% 
  mutate(WTBIN2 = set_bins(WEIGHT, breaks = c(0, 50, 70), lower_bound=NULL,  quiet=F)) %>% 
  head %>% kable()

## there were 3bins calculated, with the following
##                    range for each bin: 
## BIN: 0 range: 0 - 50
## BIN: 1 range: 50 - 70
## BIN: 2 range: 70 - Inf

ID	AMT	AGE	WEIGHT	GENDER	RACE	DOSE	WTBIN	WTBIN2
1	5000	56.09591	94.19649	Male	Hispanic	5000	2	2
2	5000	45.07672	64.17279	Male	Caucasian	5000	1	1
3	5000	50.74503	67.89058	Male	Other	5000	1	1
4	5000	46.87347	62.47354	Female	Caucasian	5000	1	1
5	5000	50.86722	73.76395	Female	Caucasian	5000	2	2
6	5000	40.77630	69.89467	Male	Hispanic	5000	1	1

Getting away from `quantile` and `cut`

A common way of stratifying is to use quantile and the cut function to establish bins. The cut function, however, has poor defaults, such as not including the lowest value, and returns a string column specifying the range in each cut that must be further coerced to arrive at a numeric bin.

sid_data %>% mutate(AGECUTS = cut(AGE, breaks = quantile(AGE), include.lowest=T),
                    AGEBINS = as.numeric(AGECUTS)) %>% head %>% kable()

ID	AMT	AGE	WEIGHT	GENDER	RACE	DOSE	AGECUTS	AGEBINS
1	5000	56.09591	94.19649	Male	Hispanic	5000	(54.5,60.6]	4
2	5000	45.07672	64.17279	Male	Caucasian	5000	[38.3,46.7]	1
3	5000	50.74503	67.89058	Male	Other	5000	(46.7,50.8]	2
4	5000	46.87347	62.47354	Female	Caucasian	5000	(46.7,50.8]	2
5	5000	50.86722	73.76395	Female	Caucasian	5000	(50.8,54.5]	3
6	5000	40.77630	69.89467	Male	Hispanic	5000	[38.3,46.7]	1

This is again simplified, especially as, by default, the breaks argument calculates the quantiles

sid_data %>% mutate(AGEBINS = set_bins(AGE)) %>% head %>% kable()

ID	AMT	AGE	WEIGHT	GENDER	RACE	DOSE	AGEBINS
1	5000	56.09591	94.19649	Male	Hispanic	5000	4
2	5000	45.07672	64.17279	Male	Caucasian	5000	1
3	5000	50.74503	67.89058	Male	Other	5000	2
4	5000	46.87347	62.47354	Female	Caucasian	5000	2
5	5000	50.86722	73.76395	Female	Caucasian	5000	3
6	5000	40.77630	69.89467	Male	Hispanic	5000	1

For more fine-tuned levels, we can specify additional breaks

sid_data %>% mutate(AGEBINS = set_bins(AGE, breaks = 
                                         quantile(AGE, seq(0, 1, length.out = 10)),
                                       quiet = F)) %>% head %>% kable()

## there were 11bins calculated, with the following
##                    range for each bin: 
## BIN: 0 range: -Inf - 38.29745189
## BIN: 1 range: 38.29745189 - 43.5622634533333
## BIN: 2 range: 43.5622634533333 - 46.38955
## BIN: 3 range: 46.38955 - 48.0166795233333
## BIN: 4 range: 48.0166795233333 - 50.5045815844444
## BIN: 5 range: 50.5045815844444 - 51.5338098388889
## BIN: 6 range: 51.5338098388889 - 53.32920805
## BIN: 7 range: 53.32920805 - 54.7564958733333
## BIN: 8 range: 54.7564958733333 - 56.6070799066667
## BIN: 9 range: 56.6070799066667 - 60.55971096
## BIN: 10 range: 60.55971096 - Inf

ID	AMT	AGE	WEIGHT	GENDER	RACE	DOSE	AGEBINS
1	5000	56.09591	94.19649	Male	Hispanic	5000	8
2	5000	45.07672	64.17279	Male	Caucasian	5000	2
3	5000	50.74503	67.89058	Male	Other	5000	5
4	5000	46.87347	62.47354	Female	Caucasian	5000	3
5	5000	50.86722	73.76395	Female	Caucasian	5000	5
6	5000	40.77630	69.89467	Male	Hispanic	5000	1

BETWEEN

A special case is when there is desire to test whether a value falls in some inclusive range, such as a specified therapeutic window. set_bins can also handle this situation with the between argument.

In this case, a range can be specified, and all values inside the range (inclusive) will be assigned to bin 1, with all values outside in either bin 0 (below range) or bin 2 (above range).

For example, to understand the non-zero concentration measurements given a therapeutic window of 20-100, the following can be done:

tw_data <- data %>% filter(CONC > 0) %>% mutate(TW = set_bins(CONC, between=c(20, 100)))

head(tw_data) %>% kable()

ID	TIME	CONC	AGE	WEIGHT	GENDER	RACE	DOSE	TW
1	0.25	8.612809	56.09591	94.19649	Male	Hispanic	5000	0
1	0.50	19.436818	56.09591	94.19649	Male	Hispanic	5000	0
1	1.00	34.006699	56.09591	94.19649	Male	Hispanic	5000	1
1	2.00	30.228800	56.09591	94.19649	Male	Hispanic	5000	1
1	3.00	31.299610	56.09591	94.19649	Male	Hispanic	5000	1
1	4.00	24.979117	56.09591	94.19649	Male	Hispanic	5000	1

This can easily be further examined:

tw_data %>% group_by(TW) %>% summarize(n= n()) %>% kable()

TW	n
0	163
1	378
2	9

tw_data %>% filter(TW ==2) %>% kable()

ID	TIME	CONC	AGE	WEIGHT	GENDER	RACE	DOSE	TW
2	2	100.1783	45.07672	64.17279	Male	Caucasian	5000	2
16	1	104.6390	54.32461	75.68308	Female	Caucasian	5000	2
16	2	101.3737	54.32461	75.68308	Female	Caucasian	5000	2
26	2	101.9166	41.26571	56.59549	Female	Black	5000	2
27	2	116.1278	53.45380	71.09299	Male	Asian	5000	2
36	1	118.8456	60.55971	81.15454	Female	Hispanic	5000	2
36	2	112.7284	60.55971	81.15454	Female	Hispanic	5000	2
36	3	117.5580	60.55971	81.15454	Female	Hispanic	5000	2
36	4	130.6603	60.55971	81.15454	Female	Hispanic	5000	2

devtools::session_info()

## Session info --------------------------------------------------------------

##  setting  value                       
##  version  R version 3.2.2 (2015-08-14)
##  system   x86_64, darwin13.4.0        
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  tz       America/Los_Angeles         
##  date     2015-11-30

## Packages ------------------------------------------------------------------

##  package      * version date       source                                
##  assertthat     0.1     2013-12-06 CRAN (R 3.2.0)                        
##  DBI            0.3.1   2014-09-24 CRAN (R 3.2.0)                        
##  devtools       1.9.1   2015-09-11 CRAN (R 3.2.0)                        
##  digest         0.6.8   2014-12-31 CRAN (R 3.2.0)                        
##  dplyr        * 0.4.3   2015-09-01 CRAN (R 3.2.0)                        
##  evaluate       0.8     2015-09-18 CRAN (R 3.2.0)                        
##  highr          0.5.1   2015-09-18 CRAN (R 3.2.0)                        
##  htmltools      0.2.6   2014-09-08 CRAN (R 3.2.0)                        
##  knitr        * 1.11    2015-08-14 CRAN (R 3.2.2)                        
##  lazyeval       0.1.10  2015-01-02 CRAN (R 3.2.0)                        
##  magrittr       1.5     2014-11-22 CRAN (R 3.2.0)                        
##  memoise        0.2.1   2014-04-22 CRAN (R 3.2.0)                        
##  PKPDdatasets * 0.1.0   2015-11-11 Github (dpastoor/PKPDdatasets@52880fa)
##  PKPDmisc     * 0.4     2015-11-11 Github (dpastoor/PKPDmisc@a0680b9)    
##  R6             2.1.1   2015-08-19 CRAN (R 3.2.0)                        
##  Rcpp           0.12.1  2015-09-10 CRAN (R 3.2.0)                        
##  rmarkdown      0.8.1   2015-10-10 CRAN (R 3.2.2)                        
##  stringi        1.0-1   2015-10-22 CRAN (R 3.2.0)                        
##  stringr        1.0.0   2015-04-30 CRAN (R 3.2.0)                        
##  yaml           2.1.13  2014-06-12 CRAN (R 3.2.0)

setting-bins

devin

November 15th, 2015

Getting away from `quantile` and `cut`

BETWEEN

setting-bins

devin

November 15th, 2015

Getting away from quantile and cut

BETWEEN

Getting away from `quantile` and `cut`