Setting bins in R, while not particulary difficult, is usually reasonably standardized and therefore can be easily automatted for many situations.
library(PKPDdatasets)
library(PKPDmisc)
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(knitr)
data <- capitalize_names(sd_oral_richpk)
sid_data <- data %>% filter(!duplicated(ID))
Given a traditional PK dataset (data) and one filtered for one observation per individual (sid_data) that looks like:
head(data) %>% kable()
| ID | TIME | AMT | CONC | AGE | WEIGHT | GENDER | RACE | DOSE |
|---|---|---|---|---|---|---|---|---|
| 1 | 0.00 | 5000 | 0.000000 | 56.09591 | 94.19649 | Male | Hispanic | 5000 |
| 1 | 0.25 | 0 | 8.612809 | 56.09591 | 94.19649 | Male | Hispanic | 5000 |
| 1 | 0.50 | 0 | 19.436818 | 56.09591 | 94.19649 | Male | Hispanic | 5000 |
| 1 | 1.00 | 0 | 34.006699 | 56.09591 | 94.19649 | Male | Hispanic | 5000 |
| 1 | 2.00 | 0 | 30.228800 | 56.09591 | 94.19649 | Male | Hispanic | 5000 |
| 1 | 3.00 | 0 | 31.299610 | 56.09591 | 94.19649 | Male | Hispanic | 5000 |
head(sid_data) %>% kable()
| ID | TIME | AMT | CONC | AGE | WEIGHT | GENDER | RACE | DOSE |
|---|---|---|---|---|---|---|---|---|
| 1 | 0 | 5000 | 0 | 56.09591 | 94.19649 | Male | Hispanic | 5000 |
| 2 | 0 | 5000 | 0 | 45.07672 | 64.17279 | Male | Caucasian | 5000 |
| 3 | 0 | 5000 | 0 | 50.74503 | 67.89058 | Male | Other | 5000 |
| 4 | 0 | 5000 | 0 | 46.87347 | 62.47354 | Female | Caucasian | 5000 |
| 5 | 0 | 5000 | 0 | 50.86722 | 73.76395 | Female | Caucasian | 5000 |
| 6 | 0 | 5000 | 0 | 40.77630 | 69.89467 | Male | Hispanic | 5000 |
Here are a couple situations in which bins are appropriate
Historically, this has been accomplished using ifelse statments like so
binned_data <- sid_data %>% mutate(WTBIN = ifelse(WEIGHT > 70, 1, 0))
head(binned_data) %>% kable()
| ID | TIME | AMT | CONC | AGE | WEIGHT | GENDER | RACE | DOSE | WTBIN |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 0 | 5000 | 0 | 56.09591 | 94.19649 | Male | Hispanic | 5000 | 1 |
| 2 | 0 | 5000 | 0 | 45.07672 | 64.17279 | Male | Caucasian | 5000 | 0 |
| 3 | 0 | 5000 | 0 | 50.74503 | 67.89058 | Male | Other | 5000 | 0 |
| 4 | 0 | 5000 | 0 | 46.87347 | 62.47354 | Female | Caucasian | 5000 | 0 |
| 5 | 0 | 5000 | 0 | 50.86722 | 73.76395 | Female | Caucasian | 5000 | 1 |
| 6 | 0 | 5000 | 0 | 40.77630 | 69.89467 | Male | Hispanic | 5000 | 0 |
This can be similarly accomplished with binning using the set_bins function from PKPDmisc
binned_data <- binned_data %>% mutate(WTBIN2 = set_bins(WEIGHT, breaks = 70))
head(binned_data)%>% kable()
| ID | TIME | AMT | CONC | AGE | WEIGHT | GENDER | RACE | DOSE | WTBIN | WTBIN2 |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 0 | 5000 | 0 | 56.09591 | 94.19649 | Male | Hispanic | 5000 | 1 | 1 |
| 2 | 0 | 5000 | 0 | 45.07672 | 64.17279 | Male | Caucasian | 5000 | 0 | 0 |
| 3 | 0 | 5000 | 0 | 50.74503 | 67.89058 | Male | Other | 5000 | 0 | 0 |
| 4 | 0 | 5000 | 0 | 46.87347 | 62.47354 | Female | Caucasian | 5000 | 0 | 0 |
| 5 | 0 | 5000 | 0 | 50.86722 | 73.76395 | Female | Caucasian | 5000 | 1 | 1 |
| 6 | 0 | 5000 | 0 | 40.77630 | 69.89467 | Male | Hispanic | 5000 | 0 | 0 |
The ‘problem’ with the ifelse statement is it does not scale well under more complex circumstances.
For example, given a more complex set of criteria such as weight less than 50, between 50 and 70, and 70 and above, the above example must be expanded to:
binned_data <- sid_data %>% mutate(WTBIN = ifelse(WEIGHT < 50, 0,
ifelse(WEIGHT >= 50 & WEIGHT < 70, 1,
ifelse(WEIGHT >= 70, 2, NA))))
head(binned_data)%>% kable()
| ID | TIME | AMT | CONC | AGE | WEIGHT | GENDER | RACE | DOSE | WTBIN |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 0 | 5000 | 0 | 56.09591 | 94.19649 | Male | Hispanic | 5000 | 2 |
| 2 | 0 | 5000 | 0 | 45.07672 | 64.17279 | Male | Caucasian | 5000 | 1 |
| 3 | 0 | 5000 | 0 | 50.74503 | 67.89058 | Male | Other | 5000 | 1 |
| 4 | 0 | 5000 | 0 | 46.87347 | 62.47354 | Female | Caucasian | 5000 | 1 |
| 5 | 0 | 5000 | 0 | 50.86722 | 73.76395 | Female | Caucasian | 5000 | 2 |
| 6 | 0 | 5000 | 0 | 40.77630 | 69.89467 | Male | Hispanic | 5000 | 1 |
However, with the set_bins function and the breaks argument this is simple:
binned_data <- binned_data %>% mutate(WTBIN2 = set_bins(WEIGHT, breaks = c(50, 70)))
head(binned_data)%>% kable()
| ID | TIME | AMT | CONC | AGE | WEIGHT | GENDER | RACE | DOSE | WTBIN | WTBIN2 |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 0 | 5000 | 0 | 56.09591 | 94.19649 | Male | Hispanic | 5000 | 2 | 2 |
| 2 | 0 | 5000 | 0 | 45.07672 | 64.17279 | Male | Caucasian | 5000 | 1 | 1 |
| 3 | 0 | 5000 | 0 | 50.74503 | 67.89058 | Male | Other | 5000 | 1 | 1 |
| 4 | 0 | 5000 | 0 | 46.87347 | 62.47354 | Female | Caucasian | 5000 | 1 | 1 |
| 5 | 0 | 5000 | 0 | 50.86722 | 73.76395 | Female | Caucasian | 5000 | 2 | 2 |
| 6 | 0 | 5000 | 0 | 40.77630 | 69.89467 | Male | Hispanic | 5000 | 1 | 1 |
You can even find additional information about the ranges of each bin by setting quiet=FALSE
binned_data %>%
mutate(WTBIN2 = set_bins(WEIGHT, breaks = c(50, 70), quiet=F)) %>%
head %>% kable()
## there were 3bins calculated, with the following
## range for each bin:
## BIN: 0 range: -Inf - 50
## BIN: 1 range: 50 - 70
## BIN: 2 range: 70 - Inf
| ID | TIME | AMT | CONC | AGE | WEIGHT | GENDER | RACE | DOSE | WTBIN | WTBIN2 |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 0 | 5000 | 0 | 56.09591 | 94.19649 | Male | Hispanic | 5000 | 2 | 2 |
| 2 | 0 | 5000 | 0 | 45.07672 | 64.17279 | Male | Caucasian | 5000 | 1 | 1 |
| 3 | 0 | 5000 | 0 | 50.74503 | 67.89058 | Male | Other | 5000 | 1 | 1 |
| 4 | 0 | 5000 | 0 | 46.87347 | 62.47354 | Female | Caucasian | 5000 | 1 | 1 |
| 5 | 0 | 5000 | 0 | 50.86722 | 73.76395 | Female | Caucasian | 5000 | 2 | 2 |
| 6 | 0 | 5000 | 0 | 40.77630 | 69.89467 | Male | Hispanic | 5000 | 1 | 1 |
As you can see by default, the first and last bin are automatically set to a range of -Inf and Inf for the min and max values. This can be controlled using lower_boundary and upper_boundary such that the lower and upper bounds can either be controlled, and any values exceeding those bounds will be returned as NA. This might be useful if you are programmatically calculating the break points and already will have a lower bound specified.
For example by manually specifying a lower bound of 0:
binned_data %>% mutate(WTBIN2 = set_bins(WEIGHT, breaks = c(0, 50, 70), quiet=F)) %>% head %>% kable()
## there were 4bins calculated, with the following
## range for each bin:
## BIN: 0 range: -Inf - 0
## BIN: 1 range: 0 - 50
## BIN: 2 range: 50 - 70
## BIN: 3 range: 70 - Inf
| ID | TIME | AMT | CONC | AGE | WEIGHT | GENDER | RACE | DOSE | WTBIN | WTBIN2 |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 0 | 5000 | 0 | 56.09591 | 94.19649 | Male | Hispanic | 5000 | 2 | 3 |
| 2 | 0 | 5000 | 0 | 45.07672 | 64.17279 | Male | Caucasian | 5000 | 1 | 2 |
| 3 | 0 | 5000 | 0 | 50.74503 | 67.89058 | Male | Other | 5000 | 1 | 2 |
| 4 | 0 | 5000 | 0 | 46.87347 | 62.47354 | Female | Caucasian | 5000 | 1 | 2 |
| 5 | 0 | 5000 | 0 | 50.86722 | 73.76395 | Female | Caucasian | 5000 | 2 | 3 |
| 6 | 0 | 5000 | 0 | 40.77630 | 69.89467 | Male | Hispanic | 5000 | 1 | 2 |
It adds an unnecessary -Inf to 0 bin. This can be turned off via lower_bound=NULL:
binned_data %>%
mutate(WTBIN2 = set_bins(WEIGHT, breaks = c(0, 50, 70), lower_bound=NULL, quiet=F)) %>%
head %>% kable()
## there were 3bins calculated, with the following
## range for each bin:
## BIN: 0 range: 0 - 50
## BIN: 1 range: 50 - 70
## BIN: 2 range: 70 - Inf
| ID | TIME | AMT | CONC | AGE | WEIGHT | GENDER | RACE | DOSE | WTBIN | WTBIN2 |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 0 | 5000 | 0 | 56.09591 | 94.19649 | Male | Hispanic | 5000 | 2 | 2 |
| 2 | 0 | 5000 | 0 | 45.07672 | 64.17279 | Male | Caucasian | 5000 | 1 | 1 |
| 3 | 0 | 5000 | 0 | 50.74503 | 67.89058 | Male | Other | 5000 | 1 | 1 |
| 4 | 0 | 5000 | 0 | 46.87347 | 62.47354 | Female | Caucasian | 5000 | 1 | 1 |
| 5 | 0 | 5000 | 0 | 50.86722 | 73.76395 | Female | Caucasian | 5000 | 2 | 2 |
| 6 | 0 | 5000 | 0 | 40.77630 | 69.89467 | Male | Hispanic | 5000 | 1 | 1 |
quantile and cutA common way of stratifying is to use quantile and the cut function to establish bins. The cut function, however, has poor defaults, such as not including the lowest value, and returns a string column specifying the range in each cut that must be further coerced to arrive at a numeric bin.
sid_data %>% mutate(AGECUTS = cut(AGE, breaks = quantile(AGE), include.lowest=T),
AGEBINS = as.numeric(AGECUTS)) %>% head %>% kable()
| ID | TIME | AMT | CONC | AGE | WEIGHT | GENDER | RACE | DOSE | AGECUTS | AGEBINS |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 0 | 5000 | 0 | 56.09591 | 94.19649 | Male | Hispanic | 5000 | (54.5,60.6] | 4 |
| 2 | 0 | 5000 | 0 | 45.07672 | 64.17279 | Male | Caucasian | 5000 | [38.3,46.7] | 1 |
| 3 | 0 | 5000 | 0 | 50.74503 | 67.89058 | Male | Other | 5000 | (46.7,50.8] | 2 |
| 4 | 0 | 5000 | 0 | 46.87347 | 62.47354 | Female | Caucasian | 5000 | (46.7,50.8] | 2 |
| 5 | 0 | 5000 | 0 | 50.86722 | 73.76395 | Female | Caucasian | 5000 | (50.8,54.5] | 3 |
| 6 | 0 | 5000 | 0 | 40.77630 | 69.89467 | Male | Hispanic | 5000 | [38.3,46.7] | 1 |
This is again simplified, especially as, by default, the breaks argument calculates the quantiles
sid_data %>% mutate(AGEBINS = set_bins(AGE)) %>% head %>% kable()
| ID | TIME | AMT | CONC | AGE | WEIGHT | GENDER | RACE | DOSE | AGEBINS |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 0 | 5000 | 0 | 56.09591 | 94.19649 | Male | Hispanic | 5000 | 4 |
| 2 | 0 | 5000 | 0 | 45.07672 | 64.17279 | Male | Caucasian | 5000 | 1 |
| 3 | 0 | 5000 | 0 | 50.74503 | 67.89058 | Male | Other | 5000 | 2 |
| 4 | 0 | 5000 | 0 | 46.87347 | 62.47354 | Female | Caucasian | 5000 | 2 |
| 5 | 0 | 5000 | 0 | 50.86722 | 73.76395 | Female | Caucasian | 5000 | 3 |
| 6 | 0 | 5000 | 0 | 40.77630 | 69.89467 | Male | Hispanic | 5000 | 1 |
For more fine-tuned levels, we can specify additional breaks
sid_data %>% mutate(AGEBINS = set_bins(AGE, breaks =
quantile(AGE, seq(0, 1, length.out = 10)),
quiet = F)) %>% head %>% kable()
## there were 11bins calculated, with the following
## range for each bin:
## BIN: 0 range: -Inf - 38.29745189
## BIN: 1 range: 38.29745189 - 43.5622634533333
## BIN: 2 range: 43.5622634533333 - 46.38955
## BIN: 3 range: 46.38955 - 48.0166795233333
## BIN: 4 range: 48.0166795233333 - 50.5045815844444
## BIN: 5 range: 50.5045815844444 - 51.5338098388889
## BIN: 6 range: 51.5338098388889 - 53.32920805
## BIN: 7 range: 53.32920805 - 54.7564958733333
## BIN: 8 range: 54.7564958733333 - 56.6070799066667
## BIN: 9 range: 56.6070799066667 - 60.55971096
## BIN: 10 range: 60.55971096 - Inf
| ID | TIME | AMT | CONC | AGE | WEIGHT | GENDER | RACE | DOSE | AGEBINS |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 0 | 5000 | 0 | 56.09591 | 94.19649 | Male | Hispanic | 5000 | 8 |
| 2 | 0 | 5000 | 0 | 45.07672 | 64.17279 | Male | Caucasian | 5000 | 2 |
| 3 | 0 | 5000 | 0 | 50.74503 | 67.89058 | Male | Other | 5000 | 5 |
| 4 | 0 | 5000 | 0 | 46.87347 | 62.47354 | Female | Caucasian | 5000 | 3 |
| 5 | 0 | 5000 | 0 | 50.86722 | 73.76395 | Female | Caucasian | 5000 | 5 |
| 6 | 0 | 5000 | 0 | 40.77630 | 69.89467 | Male | Hispanic | 5000 | 1 |
A special case is when there is desire to test whether a value falls in some inclusive range, such as a specified therapeutic window. set_bins can also handle this situation with the between argument.
In this case, a range can be specified, and all values inside the range (inclusive) will be assigned to bin 1, with all values outside in either bin 0 (below range) or bin 2 (above range).
For example, to understand the non-zero concentration measurements given a therapeutic window of 20-100, the following can be done:
tw_data <- data %>% filter(CONC > 0) %>% mutate(TW = set_bins(CONC, between=c(20, 100)))
head(tw_data) %>% kable()
| ID | TIME | AMT | CONC | AGE | WEIGHT | GENDER | RACE | DOSE | TW |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 0.25 | 0 | 8.612809 | 56.09591 | 94.19649 | Male | Hispanic | 5000 | 0 |
| 1 | 0.50 | 0 | 19.436818 | 56.09591 | 94.19649 | Male | Hispanic | 5000 | 0 |
| 1 | 1.00 | 0 | 34.006699 | 56.09591 | 94.19649 | Male | Hispanic | 5000 | 1 |
| 1 | 2.00 | 0 | 30.228800 | 56.09591 | 94.19649 | Male | Hispanic | 5000 | 1 |
| 1 | 3.00 | 0 | 31.299610 | 56.09591 | 94.19649 | Male | Hispanic | 5000 | 1 |
| 1 | 4.00 | 0 | 24.979117 | 56.09591 | 94.19649 | Male | Hispanic | 5000 | 1 |
This can easily be further examined:
tw_data %>% group_by(TW) %>% summarize(n= n()) %>% kable()
| TW | n |
|---|---|
| 0 | 163 |
| 1 | 378 |
| 2 | 9 |
tw_data %>% filter(TW ==2) %>% kable()
| ID | TIME | AMT | CONC | AGE | WEIGHT | GENDER | RACE | DOSE | TW |
|---|---|---|---|---|---|---|---|---|---|
| 2 | 2 | 0 | 100.1783 | 45.07672 | 64.17279 | Male | Caucasian | 5000 | 2 |
| 16 | 1 | 0 | 104.6390 | 54.32461 | 75.68308 | Female | Caucasian | 5000 | 2 |
| 16 | 2 | 0 | 101.3737 | 54.32461 | 75.68308 | Female | Caucasian | 5000 | 2 |
| 26 | 2 | 0 | 101.9166 | 41.26571 | 56.59549 | Female | Black | 5000 | 2 |
| 27 | 2 | 0 | 116.1278 | 53.45380 | 71.09299 | Male | Asian | 5000 | 2 |
| 36 | 1 | 0 | 118.8456 | 60.55971 | 81.15454 | Female | Hispanic | 5000 | 2 |
| 36 | 2 | 0 | 112.7284 | 60.55971 | 81.15454 | Female | Hispanic | 5000 | 2 |
| 36 | 3 | 0 | 117.5580 | 60.55971 | 81.15454 | Female | Hispanic | 5000 | 2 |
| 36 | 4 | 0 | 130.6603 | 60.55971 | 81.15454 | Female | Hispanic | 5000 | 2 |
devtools::session_info()
## Session info --------------------------------------------------------------
## setting value
## version R version 3.2.2 (2015-08-14)
## system x86_64, darwin13.4.0
## ui X11
## language (EN)
## collate en_US.UTF-8
## tz America/Los_Angeles
## date 2015-11-30
## Packages ------------------------------------------------------------------
## package * version date source
## assertthat 0.1 2013-12-06 CRAN (R 3.2.0)
## DBI 0.3.1 2014-09-24 CRAN (R 3.2.0)
## devtools 1.9.1 2015-09-11 CRAN (R 3.2.0)
## digest 0.6.8 2014-12-31 CRAN (R 3.2.0)
## dplyr * 0.4.3 2015-09-01 CRAN (R 3.2.0)
## evaluate 0.8 2015-09-18 CRAN (R 3.2.0)
## highr 0.5.1 2015-09-18 CRAN (R 3.2.0)
## htmltools 0.2.6 2014-09-08 CRAN (R 3.2.0)
## knitr * 1.11 2015-08-14 CRAN (R 3.2.2)
## lazyeval 0.1.10 2015-01-02 CRAN (R 3.2.0)
## magrittr 1.5 2014-11-22 CRAN (R 3.2.0)
## memoise 0.2.1 2014-04-22 CRAN (R 3.2.0)
## PKPDdatasets * 0.1.0 2015-11-11 Github (dpastoor/PKPDdatasets@52880fa)
## PKPDmisc * 0.4 2015-11-11 Github (dpastoor/PKPDmisc@a0680b9)
## R6 2.1.1 2015-08-19 CRAN (R 3.2.0)
## Rcpp 0.12.1 2015-09-10 CRAN (R 3.2.0)
## rmarkdown 0.8.1 2015-10-10 CRAN (R 3.2.2)
## stringi 1.0-1 2015-10-22 CRAN (R 3.2.0)
## stringr 1.0.0 2015-04-30 CRAN (R 3.2.0)
## yaml 2.1.13 2014-06-12 CRAN (R 3.2.0)