Overview

  1. Getting Started
  2. General R Use
  3. Data Handling for 'Pharmacometricians'
  4. Plotting
  5. What else?

— .class #id

Getting Started

While we start the overview of R we wanted to make sure everyone has the necessary components set-up for R use. Please open the setup.R file and we will show everyone how to run the code. If you have any problems please put up a RED sticky and a helper will come assist


R Vocabulary


Components of an R Setup


Rstudio

The R workspace in RStudio


Let's Look at RStudio


Self-help


Self-help (2)

Dose <- c(0.1, 0.1, 0.1)
Weight <- c(52, "12", 38)
# Trying to calculate amount to give for weight-based dosing
Dose * Weight
## Error: non-numeric argument to binary operator

In this case, someone could replicate it, allowing them to notice that the '12' is a character rather than numeric and point it out.


R As A Calculator

2 + 2  ## add numbers
2 * pi  #multiply by a constant
7 + runif(1, min = 0, max = 1)  #add a random variable
4^4  # powers
sqrt(4^4)  # functions

Arithmetic Operators

2 + 2
2/2
2 * 2
2^2
2 == 2
23%/%2
23%%2

Other Key Symbols

ID <- 3
ID
## [1] 3
TRT <- "placebo"
TRT
## [1] "placebo"

Other Key Symbols (II)

1:10
##  [1]  1  2  3  4  5  6  7  8  9 10
# it increments by one
a <- 100:120
a
##  [1] 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116
## [18] 117 118 119 120

Sequence and Repeat

Can have more fine control

# seq(from #, to #, by increment)
seq(0, 24, 0.5)
##  [1]  0.0  0.5  1.0  1.5  2.0  2.5  3.0  3.5  4.0  4.5  5.0  5.5  6.0  6.5
## [15]  7.0  7.5  8.0  8.5  9.0  9.5 10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5
## [29] 14.0 14.5 15.0 15.5 16.0 16.5 17.0 17.5 18.0 18.5 19.0 19.5 20.0 20.5
## [43] 21.0 21.5 22.0 22.5 23.0 23.5 24.0

# rep (values to repeat, # of times to repeat, ...)
rep(c("a", "b"), 3)
## [1] "a" "b" "a" "b" "a" "b"

Comments in R


# weight of female adults patients in kg
weight <- c(46.3, 51, 42)


R Advanced Math


Using the Workspace


Using the Workspace (2)

x<-5 #store a variable with <-
x    #print the variable
z<-3 
ls() #list all variables
ls.str() #list and describe variables
rm(x)    # delete a variable
ls()

You can also also visually see what is in your workspace in Rstudio


R as a Language

a <- 3
A <- 4
print(c(a, A))
## [1] 3 4

aA 2. What happens if I type print(a,A)?


Concatenating with c

A <- c(3, 4)
print(A)
## [1] 3 4

Unintended Consequences

R is smart - it will often infer what you are trying to do, but this can sometimes get you into trouble

f <- c(3, "a")
class(f)
## [1] "character"
f[2] <- 5
f
## [1] "3" "5"
class(f)
## [1] "character"

Some Other Language Bugs Features


Special Operators

big <- c(9, 12, 15, 25)
small <- c(9, 3, 4, 2)
# Give us a nice vector of logical values
big > small
## [1] FALSE  TRUE  TRUE  TRUE
big = small
# Oops--don't do this, reassigns big to small
print(big)
## [1] 9 3 4 2

Special Operators II

big <- c(9, 12, 15, 25)
big[big == small]
## [1] 9
# Returns values where the logical vector is true
big[big > small]
## [1] 12 15 25
big[big < small]  # Returns an empty set
## numeric(0)

Special operators (III)

foo <- c(NA, 4, 9, 8.7)
!is.na(foo)  # Returns TRUE for non-NA
## [1] FALSE  TRUE  TRUE  TRUE
a <- foo[!is.na(foo)]
a
## [1] 4.0 9.0 8.7

Data Structures in R


R for Pharmacometricians

The rest of this presenation will be example-driven, and will often draw on the quinidine dataset provided or the built-in R datasets


Reading in Data

Data from a variety of formats an be read into R - two of the most common are .tab and .csv

Lets read in the dataset:

data <- read.csv("./data/PopPKIVdata.csv", stringsAsFactors = F, skip = 3, header = T)
head(data)
##   CID TIME  CONC AMT DOSE MDV   AGE    WT   SCR ISM  CLCR
## 1   1 0.00  0.00 100  100   1 34.82 38.21 1.113   0 42.63
## 2   1 0.25 13.03   0  100   0 34.82 38.21 1.113   0 42.63
## 3   1 0.50 14.98   0  100   0 34.82 38.21 1.113   0 42.63
## 4   1 0.75 14.16   0  100   0 34.82 38.21 1.113   0 42.63
## 5   1 1.00 19.32   0  100   0 34.82 38.21 1.113   0 42.63
## 6   1 1.50 13.15   0  100   0 34.82 38.21 1.113   0 42.63

We can quickly find additional information about the dataset

# What are the column names?
names(data)
##  [1] "CID"  "TIME" "CONC" "AMT"  "DOSE" "MDV"  "AGE"  "WT"   "SCR"  "ISM" 
## [11] "CLCR"
# how long is the dataset - did everything read in correctly?
nrow(data)
## [1] 1600
# numer of columns
ncol(data)
## [1] 11

# What is the structure of each column
str(data)
## 'data.frame':    1600 obs. of  11 variables:
##  $ CID : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ TIME: num  0 0.25 0.5 0.75 1 1.5 2 2.5 3 4 ...
##  $ CONC: num  0 13 15 14.2 19.3 ...
##  $ AMT : int  100 0 0 0 0 0 0 0 0 0 ...
##  $ DOSE: int  100 100 100 100 100 100 100 100 100 100 ...
##  $ MDV : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ AGE : num  34.8 34.8 34.8 34.8 34.8 ...
##  $ WT  : num  38.2 38.2 38.2 38.2 38.2 ...
##  $ SCR : num  1.11 1.11 1.11 1.11 1.11 ...
##  $ ISM : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ CLCR: num  42.6 42.6 42.6 42.6 42.6 ...

We can get some summary statistics

summary(data)
##       CID             TIME             CONC             AMT       
##  Min.   :  1.0   Min.   : 0.000   Min.   :  0.00   Min.   :  0.0  
##  1st Qu.: 25.8   1st Qu.: 0.938   1st Qu.:  6.93   1st Qu.:  0.0  
##  Median : 50.5   Median : 2.750   Median : 13.49   Median :  0.0  
##  Mean   : 50.5   Mean   : 6.344   Mean   : 18.11   Mean   : 10.9  
##  3rd Qu.: 75.2   3rd Qu.: 9.000   3rd Qu.: 24.95   3rd Qu.:  0.0  
##  Max.   :100.0   Max.   :24.000   Max.   :117.37   Max.   :250.0  
##       DOSE          MDV              AGE             WT       
##  Min.   :100   Min.   :0.0000   Min.   :19.2   Min.   : 20.4  
##  1st Qu.:100   1st Qu.:0.0000   1st Qu.:33.8   1st Qu.: 40.1  
##  Median :175   Median :0.0000   Median :41.3   Median : 49.8  
##  Mean   :175   Mean   :0.0625   Mean   :43.0   Mean   : 51.9  
##  3rd Qu.:250   3rd Qu.:0.0000   3rd Qu.:48.7   3rd Qu.: 59.2  
##  Max.   :250   Max.   :1.0000   Max.   :79.3   Max.   :113.9  
##       SCR             ISM            CLCR      
##  Min.   :0.760   Min.   :0.00   Min.   : 19.0  
##  1st Qu.:0.976   1st Qu.:0.00   1st Qu.: 39.9  
##  Median :1.114   Median :0.00   Median : 53.8  
##  Mean   :1.143   Mean   :0.47   Mean   : 58.1  
##  3rd Qu.:1.267   3rd Qu.:1.00   3rd Qu.: 72.9  
##  Max.   :1.720   Max.   :1.00   Max.   :129.0

Summaries with table()

table(data$ISM)
## 
##   0   1 
## 848 752
table(data$ISM, data$DOSE)
##    
##     100 250
##   0 448 400
##   1 352 400

Working with data frames

class(data)

data[2, ]
##   CID TIME  CONC AMT DOSE MDV   AGE    WT   SCR ISM  CLCR
## 2   1 0.25 13.03   0  100   0 34.82 38.21 1.113   0 42.63
head(data[, 2], n = 16)
##  [1]  0.00  0.25  0.50  0.75  1.00  1.50  2.00  2.50  3.00  4.00  6.00
## [12]  8.00 12.00 16.00 20.00 24.00

Columns using $

Can easily call on or manipulate a single column via dataframe$colname

head(data$AMT)
## [1] 100   0   0   0   0   0
unique(data$TIME)
##  [1]  0.00  0.25  0.50  0.75  1.00  1.50  2.00  2.50  3.00  4.00  6.00
## [12]  8.00 12.00 16.00 20.00 24.00

POP QUIZ

How can easily check the number of unique individuals in the dataset using what we've learned so far?


ANSWER

length(unique(data$CID))
## [1] 100

Subsetting

Can easily subset rows of the data frame using the [ ] in combination with $

ID1 <- data[data$CID == 1, ]
head(ID1, n = 3)
##   CID TIME  CONC AMT DOSE MDV   AGE    WT   SCR ISM  CLCR
## 1   1 0.00  0.00 100  100   1 34.82 38.21 1.113   0 42.63
## 2   1 0.25 13.03   0  100   0 34.82 38.21 1.113   0 42.63
## 3   1 0.50 14.98   0  100   0 34.82 38.21 1.113   0 42.63
tail(ID1, n = 3)
##    CID TIME  CONC AMT DOSE MDV   AGE    WT   SCR ISM  CLCR
## 14   1   16 2.418   0  100   0 34.82 38.21 1.113   0 42.63
## 15   1   20 4.759   0  100   0 34.82 38.21 1.113   0 42.63
## 16   1   24 2.365   0  100   0 34.82 38.21 1.113   0 42.63

Subsetting (II)

Or can take only a subset of columns

datasmall <- data[, c(1:4)]

or using the column names

keepcols <- c("CID", "TIME", "CONC", "AMT")
datasmall <- data[, keepcols]

Using both together


keepcols <- c("CID", "TIME", "CONC", "AMT")
ID1 <- data[data$CID == 1, keepcols]
head(ID1)
##   CID TIME  CONC AMT
## 1   1 0.00  0.00 100
## 2   1 0.25 13.03   0
## 3   1 0.50 14.98   0
## 4   1 0.75 14.16   0
## 5   1 1.00 19.32   0
## 6   1 1.50 13.15   0

Multiple subsets

You can also use the & and | operators to do multiple subsets at once

Let's make a dataset with only males less than 40 years old

youngmales <- data[data$ISM != 0 & data$AGE <= 40, ]
head(youngmales[!duplicated(youngmales$CID), ])
##     CID TIME CONC AMT DOSE MDV   AGE    WT    SCR ISM   CLCR
## 17    2    0    0 100  100   1 32.77 74.84 0.8846   1 126.00
## 33    3    0    0 100  100   1 35.97 37.30 1.1004   1  48.98
## 49    4    0    0 100  100   1 38.21 32.97 1.1972   1  38.93
## 257  17    0    0 100  100   1 33.80 49.64 1.0998   1  66.57
## 337  22    0    0 100  100   1 33.14 52.33 1.4894   1  52.15
## 401  26    0    0 100  100   1 31.03 27.41 0.9585   1  43.28

Subsetting to assign values

What if we needed to change all the 0's in the AMT column to NA values

# so we can keep the original dataset
changedata <- data
head(changedata, n = 3)
##   CID TIME  CONC AMT DOSE MDV   AGE    WT   SCR ISM  CLCR
## 1   1 0.00  0.00 100  100   1 34.82 38.21 1.113   0 42.63
## 2   1 0.25 13.03   0  100   0 34.82 38.21 1.113   0 42.63
## 3   1 0.50 14.98   0  100   0 34.82 38.21 1.113   0 42.63
changedata[changedata$AMT == 0, "AMT"] <- "NA"
head(changedata, n = 3)
##   CID TIME  CONC AMT DOSE MDV   AGE    WT   SCR ISM  CLCR
## 1   1 0.00  0.00 100  100   1 34.82 38.21 1.113   0 42.63
## 2   1 0.25 13.03  NA  100   0 34.82 38.21 1.113   0 42.63
## 3   1 0.50 14.98  NA  100   0 34.82 38.21 1.113   0 42.63

Adding Columns to a DF with $

# find the mean age and add it to new column
data$MEANAGE <- mean(data$AGE)
head(data, n = 3)
##   CID TIME  CONC AMT DOSE MDV   AGE    WT   SCR ISM  CLCR MEANAGE
## 1   1 0.00  0.00 100  100   1 34.82 38.21 1.113   0 42.63   42.99
## 2   1 0.25 13.03   0  100   0 34.82 38.21 1.113   0 42.63   42.99
## 3   1 0.50 14.98   0  100   0 34.82 38.21 1.113   0 42.63   42.99

Transforming data

What if we wanted time to be in minutes rather than hours

data$TIMEMIN <- data$TIME * 60
head(data)
##   CID TIME  CONC AMT DOSE MDV   AGE    WT   SCR ISM  CLCR TIMEMIN
## 1   1 0.00  0.00 100  100   1 34.82 38.21 1.113   0 42.63       0
## 2   1 0.25 13.03   0  100   0 34.82 38.21 1.113   0 42.63      15
## 3   1 0.50 14.98   0  100   0 34.82 38.21 1.113   0 42.63      30
## 4   1 0.75 14.16   0  100   0 34.82 38.21 1.113   0 42.63      45
## 5   1 1.00 19.32   0  100   0 34.82 38.21 1.113   0 42.63      60
## 6   1 1.50 13.15   0  100   0 34.82 38.21 1.113   0 42.63      90
data$TIMEMIN <- NULL

Create some Data to Merge

tableOne <- data.frame(ID = c(1:3), PERIOD = c("a", "b", "c"))
tableTwo <- data.frame(ID = c(seq(1, 5, 2)), PERIOD = c("a", "c", "f"), ISM = c(0, 
    1, 0))
tableOne
##   ID PERIOD
## 1  1      a
## 2  2      b
## 3  3      c
tableTwo
##   ID PERIOD ISM
## 1  1      a   0
## 2  3      c   1
## 3  5      f   0

Merging Options

Can merge by similarities in specific column, specific index, or can combine all

merge(tableOne, tableTwo, by = "ID")
##   ID PERIOD.x PERIOD.y ISM
## 1  1        a        a   0
## 2  3        c        c   1
merge(tableOne, tableTwo, by = "ID", all = T)
##   ID PERIOD.x PERIOD.y ISM
## 1  1        a        a   0
## 2  2        b     <NA>  NA
## 3  3        c        c   1
## 4  5     <NA>        f   0

Merging Options (II)

merge(tableOne, tableTwo, by = "ID", all.x = T)
##   ID PERIOD.x PERIOD.y ISM
## 1  1        a        a   0
## 2  2        b     <NA>  NA
## 3  3        c        c   1
merge(tableOne, tableTwo, by = "ID", all.y = T)
##   ID PERIOD.x PERIOD.y ISM
## 1  1        a        a   0
## 2  3        c        c   1
## 3  5     <NA>        f   0

Merging Options (III)

Can also bind together next to one another

cbind(tableOne, tableTwo)
##   ID PERIOD ID PERIOD ISM
## 1  1      a  1      a   0
## 2  2      b  3      c   1
## 3  3      c  5      f   0

Merging Options (IV)

Can stack two data frames using rbind but data frames must have equal number of columns

tableOne <- data.frame(ID = c(1:3), PERIOD = c("a", "b", "c"))
tableTwo <- data.frame(ID = c(seq(1, 5, 2)), PERIOD = c("a", "c", "f"))
rbind(tableOne, tableTwo)
##   ID PERIOD
## 1  1      a
## 2  2      b
## 3  3      c
## 4  1      a
## 5  3      c
## 6  5      f

Reshaping Data

There is an excellent package (library) called reshape2

Let's load it and get some data to play with

library(reshape2)
keepcols <- c("CID", "TIME", "CONC")
concTime <- data[data$CID < 10, keepcols]
tail(concTime)
##     CID TIME   CONC
## 139   9    6 20.458
## 140   9    8 17.128
## 141   9   12 12.192
## 142   9   16  5.068
## 143   9   20 10.035
## 144   9   24  8.961

How can we make a table of concentrations by time for all ID's? (long to wide conversion)


molten <- melt(concTime, id.vars = c("TIME", "CID"), measure.vars = c("CONC"))
head(molten)
##   TIME CID variable value
## 1 0.00   1     CONC  0.00
## 2 0.25   1     CONC 13.03
## 3 0.50   1     CONC 14.98
## 4 0.75   1     CONC 14.16
## 5 1.00   1     CONC 19.32
## 6 1.50   1     CONC 13.15
concTimeWide <- dcast(molten, TIME + variable ~ CID)
head(concTimeWide, n = 4)
##   TIME variable     1      2     3     4     5      6    7     8     9
## 1 0.00     CONC  0.00  0.000  0.00  0.00  0.00  0.000  0.0  0.00  0.00
## 2 0.25     CONC 13.03  7.420 12.80 24.69 11.33  7.453 12.1 30.88 37.13
## 3 0.50     CONC 14.98  4.987 15.15 34.27 12.68 12.971 11.1 30.39 35.43
## 4 0.75     CONC 14.16 13.630 12.30 32.88 12.16  9.782 11.3 25.95 30.43

PLYR

Lets introduce another powerful library plyr

It is based around the paradigm SPLIT-APPLY-COMBINE

library(plyr)

cmax <- ddply(data, .(CID), summarize, CMAXobs = max(CONC))
head(cmax)
##   CID CMAXobs
## 1   1   19.32
## 2   2   14.52
## 3   3   15.15
## 4   4   34.27
## 5   5   13.41
## 6   6   17.07

Lets merge the CMAX column back into the dataset

data <- merge(data, cmax)
head(data, n = 3)
##   CID TIME  CONC AMT DOSE MDV   AGE    WT   SCR ISM  CLCR CMAXobs
## 1   1 0.00  0.00 100  100   1 34.82 38.21 1.113   0 42.63   19.32
## 2   1 0.25 13.03   0  100   0 34.82 38.21 1.113   0 42.63   19.32
## 3   1 0.50 14.98   0  100   0 34.82 38.21 1.113   0 42.63   19.32
tail(data, n = 3)
##      CID TIME   CONC AMT DOSE MDV   AGE    WT   SCR ISM  CLCR CMAXobs
## 1598 100   16  3.969   0  250   0 40.15 58.73 1.248   0 55.49   24.05
## 1599 100   20 11.177   0  250   0 40.15 58.73 1.248   0 55.49   24.05
## 1600 100   24  6.903   0  250   0 40.15 58.73 1.248   0 55.49   24.05

Sorting Data

Lets sort the data by time

keep <- c("CID", "TIME", "CONC", "DOSE")
sorted <- data[, keep]
sorted <- sorted[order(sorted$TIME), ]
head(sorted)
##    CID TIME CONC DOSE
## 1    1    0    0  100
## 17   2    0    0  100
## 33   3    0    0  100
## 49   4    0    0  100
## 65   5    0    0  100
## 81   6    0    0  100

Multiple Sorts at Once

First let's 'mess up' the dataset - sort by descending concentrations

concSort <- sort[order(-sort$CONC), ]
## Error: object of type 'closure' is not subsettable
head(concSort)
## Error: object 'concSort' not found

##Mutliple Sorts at Once (II)

concSort <- concSort[order(concSort$CID), ]
## Error: object 'concSort' not found
head(concSort)
## Error: object 'concSort' not found

Oh no the times aren't in order!


Multiple Sorts at Once (III)

Thankfully we can fix this by sorting ID and time within ID

concSort <- concSort[order(concSort$CID, concSort$TIME), ]
## Error: object 'concSort' not found
head(concSort)
## Error: object 'concSort' not found

Sample Dosing Sheet Creation

conc <- c(0)
nSubj <- 40
times <- c(0, 0.5, 1, 2, 3, 4, 6, 8, 12, 16, 24)
t <- expand.grid(TIME = times, CID = 1:nSubj, CONC = conc)
t <- as.data.frame(t)
democols <- c("CID", "AGE", "WT", "CLCR")
demo <- unique(data[, democols])
res <- merge(t, demo, by = "CID")
head(res)
##   CID TIME CONC   AGE    WT  CLCR
## 1   1  0.0    0 34.82 38.21 42.63
## 2   1  0.5    0 34.82 38.21 42.63
## 3   1  1.0    0 34.82 38.21 42.63
## 4   1  2.0    0 34.82 38.21 42.63
## 5   1  3.0    0 34.82 38.21 42.63
## 6   1  4.0    0 34.82 38.21 42.63

Plotting

ggplot is the newest to the 'game' but has risen to be the most popular package due to its ease of use, consistent style, and quality graphics it produces.


Intro to ggplot

# lets take a subset of some individuals - make sure get some from 250 mg
# and 100 mg doses
PKdata100 <- data[data$CID <= 6 & data$DOSE == 100, ]
PKdata250 <- data[data$CID >= 95 & data$DOSE == 250, ]
PKdata <- rbind(PKdata100, PKdata250)
PKdataID <- PKdata[!duplicated(PKdata$CID), ]
# lets check to make sure we have males and females at each dose group (for
# later examples)
table(PKdataID$ISM, PKdataID$DOSE)
##    
##     100 250
##   0   3   2
##   1   3   4

qplot for quick visualizations

library(ggplot2)
qplot(data = PKdata, x = TIME, y = CONC)

plot of chunk unnamed-chunk-42


qplot for quick visualizations

qplot(data = PKdata, x = TIME, y = CONC)

plot of chunk unnamed-chunk-43


Additional customization


Let's Add Some Color

qplot(data = PKdata, x = TIME, y = CONC, color = as.factor(CID))

plot of chunk unnamed-chunk-44


# Add lines as well as a horizontal line at a y-intercept of 40
qplot(data = PKdata, x = TIME, y = CONC, color = as.factor(CID), geom = c("point", 
    "line")) + geom_hline(aes(yintercept = 40))

plot of chunk unnamed-chunk-45


Statistical Plots Are Easy Too!

IDinfo <- data[!duplicated(data$CID), ]
qplot(data = IDinfo, x = AGE)

plot of chunk unnamed-chunk-46


# change the binwidth and add a red vertical line dotted line @ median age
qplot(data = IDinfo, x = AGE, binwidth = 5) + geom_vline(aes(xintercept = median(AGE), 
    color = "red"), linetype = 2)

plot of chunk unnamed-chunk-47


# lets look at the distribution of males and females
qplot(data = IDinfo, x = AGE, fill = as.factor(ISM))

plot of chunk unnamed-chunk-48


What About Facetting

qplot(data = IDinfo, x = AGE, binwidth = 5) + facet_wrap(~ISM)

plot of chunk unnamed-chunk-49


Boxplots

qplot(data = IDinfo, x = as.factor(ISM), y = AGE, geom = "boxplot")

plot of chunk unnamed-chunk-50


More Boxplot Options

qplot(data = IDinfo, x = as.factor(ISM), y = AGE, geom = c("boxplot", "point"))

plot of chunk unnamed-chunk-51


'ggplot'


What's Missing?

What we've shown has only scratched the surface of R's capabilities - so what else can be done?