— .class #id
While we start the overview of R we wanted to make sure everyone has the necessary components set-up for R use. Please open the setup.R file and we will show everyone how to run the code. If you have any problems please put up a RED sticky and a helper will come assist
The R workspace in RStudio
?summary??regressionDose <- c(0.1, 0.1, 0.1)
Weight <- c(52, "12", 38)
# Trying to calculate amount to give for weight-based dosing
Dose * Weight
## Error: non-numeric argument to binary operator
In this case, someone could replicate it, allowing them to notice that the '12' is a character rather than numeric and point it out.
2 + 2 ## add numbers
2 * pi #multiply by a constant
7 + runif(1, min = 0, max = 1) #add a random variable
4^4 # powers
sqrt(4^4) # functions
+ - = / * and exponential ^, there is also integer division %/% and remainder in integer division (known as modulo arithmetic) %%2 + 2
2/2
2 * 2
2^2
2 == 2
23%/%2
23%%2
<- is the assignment operator, it declares something is something elseID <- 3
ID
## [1] 3
TRT <- "placebo"
TRT
## [1] "placebo"
: is the sequence operator1:10
## [1] 1 2 3 4 5 6 7 8 9 10
# it increments by one
a <- 100:120
a
## [1] 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116
## [18] 117 118 119 120
Can have more fine control
# seq(from #, to #, by increment)
seq(0, 24, 0.5)
## [1] 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5
## [15] 7.0 7.5 8.0 8.5 9.0 9.5 10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5
## [29] 14.0 14.5 15.0 15.5 16.0 16.5 17.0 17.5 18.0 18.5 19.0 19.5 20.0 20.5
## [43] 21.0 21.5 22.0 22.5 23.0 23.5 24.0
# rep (values to repeat, # of times to repeat, ...)
rep(c("a", "b"), 3)
## [1] "a" "b" "a" "b" "a" "b"
# weight of female adults patients in kg
weight <- c(46.3, 51, 42)
Easiest of all, R can generate distributions of data very easily
rnorm(100) or rbinom(100)This comes in handy when writing examples and building analyses because it is trivial to generate a synthetic piece of data to use as an example
Try typing hist(rnorm(10000))
x<-5 #store a variable with <-
x #print the variable
z<-3
ls() #list all variables
ls.str() #list and describe variables
rm(x) # delete a variable
ls()
You can also also visually see what is in your workspace in Rstudio
a <- 3
A <- 4
print(c(a, A))
## [1] 3 4
a ≠ A 2. What happens if I type print(a,A)?
cA <- c(3, 4)
print(A)
## [1] 3 4
c stands for concatenate and allows vectors to have multiple elementsc, which is one of the most used functions you will ever usec is important to put any vector together, but remember that objects within a vector must all be of the same typeR is smart - it will often infer what you are trying to do, but this can sometimes get you into trouble
f <- c(3, "a")
class(f)
## [1] "character"
f[2] <- 5
f
## [1] "3" "5"
class(f)
## [1] "character"
camelCase; others are.dot.separated; others use_underscoresbase, grid, lattice, and ggplot2)<, >, <=, >=, ==, and != are used to compare values across vectorsbig <- c(9, 12, 15, 25)
small <- c(9, 3, 4, 2)
# Give us a nice vector of logical values
big > small
## [1] FALSE TRUE TRUE TRUE
big = small
# Oops--don't do this, reassigns big to small
print(big)
## [1] 9 3 4 2
= or == to assign anything, always use <-[] to avoid confusionbig <- c(9, 12, 15, 25)
big[big == small]
## [1] 9
# Returns values where the logical vector is true
big[big > small]
## [1] 12 15 25
big[big < small] # Returns an empty set
## numeric(0)
| (or) and & (and) can be used to combine two logical values and produce another logical value as the result. The operator ! (not) negates a logical value. These operators allow complex conditions to be constructed.foo <- c(NA, 4, 9, 8.7)
!is.na(foo) # Returns TRUE for non-NA
## [1] FALSE TRUE TRUE TRUE
a <- foo[!is.na(foo)]
a
## [1] 4.0 9.0 8.7
The rest of this presenation will be example-driven, and will often draw on the quinidine dataset provided or the built-in R datasets
Data from a variety of formats an be read into R - two of the most common are .tab and .csv
Lets read in the dataset:
data <- read.csv("./data/PopPKIVdata.csv", stringsAsFactors = F, skip = 3, header = T)
head(data)
## CID TIME CONC AMT DOSE MDV AGE WT SCR ISM CLCR
## 1 1 0.00 0.00 100 100 1 34.82 38.21 1.113 0 42.63
## 2 1 0.25 13.03 0 100 0 34.82 38.21 1.113 0 42.63
## 3 1 0.50 14.98 0 100 0 34.82 38.21 1.113 0 42.63
## 4 1 0.75 14.16 0 100 0 34.82 38.21 1.113 0 42.63
## 5 1 1.00 19.32 0 100 0 34.82 38.21 1.113 0 42.63
## 6 1 1.50 13.15 0 100 0 34.82 38.21 1.113 0 42.63
We can quickly find additional information about the dataset
# What are the column names?
names(data)
## [1] "CID" "TIME" "CONC" "AMT" "DOSE" "MDV" "AGE" "WT" "SCR" "ISM"
## [11] "CLCR"
# how long is the dataset - did everything read in correctly?
nrow(data)
## [1] 1600
# numer of columns
ncol(data)
## [1] 11
# What is the structure of each column
str(data)
## 'data.frame': 1600 obs. of 11 variables:
## $ CID : int 1 1 1 1 1 1 1 1 1 1 ...
## $ TIME: num 0 0.25 0.5 0.75 1 1.5 2 2.5 3 4 ...
## $ CONC: num 0 13 15 14.2 19.3 ...
## $ AMT : int 100 0 0 0 0 0 0 0 0 0 ...
## $ DOSE: int 100 100 100 100 100 100 100 100 100 100 ...
## $ MDV : int 1 0 0 0 0 0 0 0 0 0 ...
## $ AGE : num 34.8 34.8 34.8 34.8 34.8 ...
## $ WT : num 38.2 38.2 38.2 38.2 38.2 ...
## $ SCR : num 1.11 1.11 1.11 1.11 1.11 ...
## $ ISM : int 0 0 0 0 0 0 0 0 0 0 ...
## $ CLCR: num 42.6 42.6 42.6 42.6 42.6 ...
We can get some summary statistics
summary(data)
## CID TIME CONC AMT
## Min. : 1.0 Min. : 0.000 Min. : 0.00 Min. : 0.0
## 1st Qu.: 25.8 1st Qu.: 0.938 1st Qu.: 6.93 1st Qu.: 0.0
## Median : 50.5 Median : 2.750 Median : 13.49 Median : 0.0
## Mean : 50.5 Mean : 6.344 Mean : 18.11 Mean : 10.9
## 3rd Qu.: 75.2 3rd Qu.: 9.000 3rd Qu.: 24.95 3rd Qu.: 0.0
## Max. :100.0 Max. :24.000 Max. :117.37 Max. :250.0
## DOSE MDV AGE WT
## Min. :100 Min. :0.0000 Min. :19.2 Min. : 20.4
## 1st Qu.:100 1st Qu.:0.0000 1st Qu.:33.8 1st Qu.: 40.1
## Median :175 Median :0.0000 Median :41.3 Median : 49.8
## Mean :175 Mean :0.0625 Mean :43.0 Mean : 51.9
## 3rd Qu.:250 3rd Qu.:0.0000 3rd Qu.:48.7 3rd Qu.: 59.2
## Max. :250 Max. :1.0000 Max. :79.3 Max. :113.9
## SCR ISM CLCR
## Min. :0.760 Min. :0.00 Min. : 19.0
## 1st Qu.:0.976 1st Qu.:0.00 1st Qu.: 39.9
## Median :1.114 Median :0.00 Median : 53.8
## Mean :1.143 Mean :0.47 Mean : 58.1
## 3rd Qu.:1.267 3rd Qu.:1.00 3rd Qu.: 72.9
## Max. :1.720 Max. :1.00 Max. :129.0
table(data$ISM)
##
## 0 1
## 848 752
table(data$ISM, data$DOSE)
##
## 100 250
## 0 448 400
## 1 352 400
class(data)
data[2, ]
## CID TIME CONC AMT DOSE MDV AGE WT SCR ISM CLCR
## 2 1 0.25 13.03 0 100 0 34.82 38.21 1.113 0 42.63
head(data[, 2], n = 16)
## [1] 0.00 0.25 0.50 0.75 1.00 1.50 2.00 2.50 3.00 4.00 6.00
## [12] 8.00 12.00 16.00 20.00 24.00
Can easily call on or manipulate a single column via dataframe$colname
head(data$AMT)
## [1] 100 0 0 0 0 0
unique(data$TIME)
## [1] 0.00 0.25 0.50 0.75 1.00 1.50 2.00 2.50 3.00 4.00 6.00
## [12] 8.00 12.00 16.00 20.00 24.00
How can easily check the number of unique individuals in the dataset using what we've learned so far?
length(unique(data$CID))
## [1] 100
Can easily subset rows of the data frame using the [ ] in combination with $
ID1 <- data[data$CID == 1, ]
head(ID1, n = 3)
## CID TIME CONC AMT DOSE MDV AGE WT SCR ISM CLCR
## 1 1 0.00 0.00 100 100 1 34.82 38.21 1.113 0 42.63
## 2 1 0.25 13.03 0 100 0 34.82 38.21 1.113 0 42.63
## 3 1 0.50 14.98 0 100 0 34.82 38.21 1.113 0 42.63
tail(ID1, n = 3)
## CID TIME CONC AMT DOSE MDV AGE WT SCR ISM CLCR
## 14 1 16 2.418 0 100 0 34.82 38.21 1.113 0 42.63
## 15 1 20 4.759 0 100 0 34.82 38.21 1.113 0 42.63
## 16 1 24 2.365 0 100 0 34.82 38.21 1.113 0 42.63
Or can take only a subset of columns
datasmall <- data[, c(1:4)]
or using the column names
keepcols <- c("CID", "TIME", "CONC", "AMT")
datasmall <- data[, keepcols]
keepcols <- c("CID", "TIME", "CONC", "AMT")
ID1 <- data[data$CID == 1, keepcols]
head(ID1)
## CID TIME CONC AMT
## 1 1 0.00 0.00 100
## 2 1 0.25 13.03 0
## 3 1 0.50 14.98 0
## 4 1 0.75 14.16 0
## 5 1 1.00 19.32 0
## 6 1 1.50 13.15 0
You can also use the & and | operators to do multiple subsets at once
Let's make a dataset with only males less than 40 years old
youngmales <- data[data$ISM != 0 & data$AGE <= 40, ]
head(youngmales[!duplicated(youngmales$CID), ])
## CID TIME CONC AMT DOSE MDV AGE WT SCR ISM CLCR
## 17 2 0 0 100 100 1 32.77 74.84 0.8846 1 126.00
## 33 3 0 0 100 100 1 35.97 37.30 1.1004 1 48.98
## 49 4 0 0 100 100 1 38.21 32.97 1.1972 1 38.93
## 257 17 0 0 100 100 1 33.80 49.64 1.0998 1 66.57
## 337 22 0 0 100 100 1 33.14 52.33 1.4894 1 52.15
## 401 26 0 0 100 100 1 31.03 27.41 0.9585 1 43.28
What if we needed to change all the 0's in the AMT column to NA values
# so we can keep the original dataset
changedata <- data
head(changedata, n = 3)
## CID TIME CONC AMT DOSE MDV AGE WT SCR ISM CLCR
## 1 1 0.00 0.00 100 100 1 34.82 38.21 1.113 0 42.63
## 2 1 0.25 13.03 0 100 0 34.82 38.21 1.113 0 42.63
## 3 1 0.50 14.98 0 100 0 34.82 38.21 1.113 0 42.63
changedata[changedata$AMT == 0, "AMT"] <- "NA"
head(changedata, n = 3)
## CID TIME CONC AMT DOSE MDV AGE WT SCR ISM CLCR
## 1 1 0.00 0.00 100 100 1 34.82 38.21 1.113 0 42.63
## 2 1 0.25 13.03 NA 100 0 34.82 38.21 1.113 0 42.63
## 3 1 0.50 14.98 NA 100 0 34.82 38.21 1.113 0 42.63
# find the mean age and add it to new column
data$MEANAGE <- mean(data$AGE)
head(data, n = 3)
## CID TIME CONC AMT DOSE MDV AGE WT SCR ISM CLCR MEANAGE
## 1 1 0.00 0.00 100 100 1 34.82 38.21 1.113 0 42.63 42.99
## 2 1 0.25 13.03 0 100 0 34.82 38.21 1.113 0 42.63 42.99
## 3 1 0.50 14.98 0 100 0 34.82 38.21 1.113 0 42.63 42.99
What if we wanted time to be in minutes rather than hours
data$TIMEMIN <- data$TIME * 60
head(data)
## CID TIME CONC AMT DOSE MDV AGE WT SCR ISM CLCR TIMEMIN
## 1 1 0.00 0.00 100 100 1 34.82 38.21 1.113 0 42.63 0
## 2 1 0.25 13.03 0 100 0 34.82 38.21 1.113 0 42.63 15
## 3 1 0.50 14.98 0 100 0 34.82 38.21 1.113 0 42.63 30
## 4 1 0.75 14.16 0 100 0 34.82 38.21 1.113 0 42.63 45
## 5 1 1.00 19.32 0 100 0 34.82 38.21 1.113 0 42.63 60
## 6 1 1.50 13.15 0 100 0 34.82 38.21 1.113 0 42.63 90
data$TIMEMIN <- NULL
tableOne <- data.frame(ID = c(1:3), PERIOD = c("a", "b", "c"))
tableTwo <- data.frame(ID = c(seq(1, 5, 2)), PERIOD = c("a", "c", "f"), ISM = c(0,
1, 0))
tableOne
## ID PERIOD
## 1 1 a
## 2 2 b
## 3 3 c
tableTwo
## ID PERIOD ISM
## 1 1 a 0
## 2 3 c 1
## 3 5 f 0
Can merge by similarities in specific column, specific index, or can combine all
merge(tableOne, tableTwo, by = "ID")
## ID PERIOD.x PERIOD.y ISM
## 1 1 a a 0
## 2 3 c c 1
merge(tableOne, tableTwo, by = "ID", all = T)
## ID PERIOD.x PERIOD.y ISM
## 1 1 a a 0
## 2 2 b <NA> NA
## 3 3 c c 1
## 4 5 <NA> f 0
merge(tableOne, tableTwo, by = "ID", all.x = T)
## ID PERIOD.x PERIOD.y ISM
## 1 1 a a 0
## 2 2 b <NA> NA
## 3 3 c c 1
merge(tableOne, tableTwo, by = "ID", all.y = T)
## ID PERIOD.x PERIOD.y ISM
## 1 1 a a 0
## 2 3 c c 1
## 3 5 <NA> f 0
Can also bind together next to one another
cbind(tableOne, tableTwo)
## ID PERIOD ID PERIOD ISM
## 1 1 a 1 a 0
## 2 2 b 3 c 1
## 3 3 c 5 f 0
Can stack two data frames using rbind but data frames must have equal number of columns
tableOne <- data.frame(ID = c(1:3), PERIOD = c("a", "b", "c"))
tableTwo <- data.frame(ID = c(seq(1, 5, 2)), PERIOD = c("a", "c", "f"))
rbind(tableOne, tableTwo)
## ID PERIOD
## 1 1 a
## 2 2 b
## 3 3 c
## 4 1 a
## 5 3 c
## 6 5 f
There is an excellent package (library) called reshape2
Let's load it and get some data to play with
library(reshape2)
keepcols <- c("CID", "TIME", "CONC")
concTime <- data[data$CID < 10, keepcols]
tail(concTime)
## CID TIME CONC
## 139 9 6 20.458
## 140 9 8 17.128
## 141 9 12 12.192
## 142 9 16 5.068
## 143 9 20 10.035
## 144 9 24 8.961
How can we make a table of concentrations by time for all ID's? (long to wide conversion)
molten <- melt(concTime, id.vars = c("TIME", "CID"), measure.vars = c("CONC"))
head(molten)
## TIME CID variable value
## 1 0.00 1 CONC 0.00
## 2 0.25 1 CONC 13.03
## 3 0.50 1 CONC 14.98
## 4 0.75 1 CONC 14.16
## 5 1.00 1 CONC 19.32
## 6 1.50 1 CONC 13.15
concTimeWide <- dcast(molten, TIME + variable ~ CID)
head(concTimeWide, n = 4)
## TIME variable 1 2 3 4 5 6 7 8 9
## 1 0.00 CONC 0.00 0.000 0.00 0.00 0.00 0.000 0.0 0.00 0.00
## 2 0.25 CONC 13.03 7.420 12.80 24.69 11.33 7.453 12.1 30.88 37.13
## 3 0.50 CONC 14.98 4.987 15.15 34.27 12.68 12.971 11.1 30.39 35.43
## 4 0.75 CONC 14.16 13.630 12.30 32.88 12.16 9.782 11.3 25.95 30.43
Lets introduce another powerful library plyr
It is based around the paradigm SPLIT-APPLY-COMBINE
library(plyr)
cmax <- ddply(data, .(CID), summarize, CMAXobs = max(CONC))
head(cmax)
## CID CMAXobs
## 1 1 19.32
## 2 2 14.52
## 3 3 15.15
## 4 4 34.27
## 5 5 13.41
## 6 6 17.07
Lets merge the CMAX column back into the dataset
data <- merge(data, cmax)
head(data, n = 3)
## CID TIME CONC AMT DOSE MDV AGE WT SCR ISM CLCR CMAXobs
## 1 1 0.00 0.00 100 100 1 34.82 38.21 1.113 0 42.63 19.32
## 2 1 0.25 13.03 0 100 0 34.82 38.21 1.113 0 42.63 19.32
## 3 1 0.50 14.98 0 100 0 34.82 38.21 1.113 0 42.63 19.32
tail(data, n = 3)
## CID TIME CONC AMT DOSE MDV AGE WT SCR ISM CLCR CMAXobs
## 1598 100 16 3.969 0 250 0 40.15 58.73 1.248 0 55.49 24.05
## 1599 100 20 11.177 0 250 0 40.15 58.73 1.248 0 55.49 24.05
## 1600 100 24 6.903 0 250 0 40.15 58.73 1.248 0 55.49 24.05
Lets sort the data by time
keep <- c("CID", "TIME", "CONC", "DOSE")
sorted <- data[, keep]
sorted <- sorted[order(sorted$TIME), ]
head(sorted)
## CID TIME CONC DOSE
## 1 1 0 0 100
## 17 2 0 0 100
## 33 3 0 0 100
## 49 4 0 0 100
## 65 5 0 0 100
## 81 6 0 0 100
First let's 'mess up' the dataset - sort by descending concentrations
concSort <- sort[order(-sort$CONC), ]
## Error: object of type 'closure' is not subsettable
head(concSort)
## Error: object 'concSort' not found
##Mutliple Sorts at Once (II)
concSort <- concSort[order(concSort$CID), ]
## Error: object 'concSort' not found
head(concSort)
## Error: object 'concSort' not found
Oh no the times aren't in order!
Thankfully we can fix this by sorting ID and time within ID
concSort <- concSort[order(concSort$CID, concSort$TIME), ]
## Error: object 'concSort' not found
head(concSort)
## Error: object 'concSort' not found
conc <- c(0)
nSubj <- 40
times <- c(0, 0.5, 1, 2, 3, 4, 6, 8, 12, 16, 24)
t <- expand.grid(TIME = times, CID = 1:nSubj, CONC = conc)
t <- as.data.frame(t)
democols <- c("CID", "AGE", "WT", "CLCR")
demo <- unique(data[, democols])
res <- merge(t, demo, by = "CID")
head(res)
## CID TIME CONC AGE WT CLCR
## 1 1 0.0 0 34.82 38.21 42.63
## 2 1 0.5 0 34.82 38.21 42.63
## 3 1 1.0 0 34.82 38.21 42.63
## 4 1 2.0 0 34.82 38.21 42.63
## 5 1 3.0 0 34.82 38.21 42.63
## 6 1 4.0 0 34.82 38.21 42.63
ggplot is the newest to the 'game' but has risen to be the most popular package due to its ease of use, consistent style, and quality graphics it produces.
# lets take a subset of some individuals - make sure get some from 250 mg
# and 100 mg doses
PKdata100 <- data[data$CID <= 6 & data$DOSE == 100, ]
PKdata250 <- data[data$CID >= 95 & data$DOSE == 250, ]
PKdata <- rbind(PKdata100, PKdata250)
PKdataID <- PKdata[!duplicated(PKdata$CID), ]
# lets check to make sure we have males and females at each dose group (for
# later examples)
table(PKdataID$ISM, PKdataID$DOSE)
##
## 100 250
## 0 3 2
## 1 3 4
library(ggplot2)
qplot(data = PKdata, x = TIME, y = CONC)
qplot(data = PKdata, x = TIME, y = CONC)
qplot(data = PKdata, x = TIME, y = CONC, color = as.factor(CID))
# Add lines as well as a horizontal line at a y-intercept of 40
qplot(data = PKdata, x = TIME, y = CONC, color = as.factor(CID), geom = c("point",
"line")) + geom_hline(aes(yintercept = 40))
IDinfo <- data[!duplicated(data$CID), ]
qplot(data = IDinfo, x = AGE)
# change the binwidth and add a red vertical line dotted line @ median age
qplot(data = IDinfo, x = AGE, binwidth = 5) + geom_vline(aes(xintercept = median(AGE),
color = "red"), linetype = 2)
# lets look at the distribution of males and females
qplot(data = IDinfo, x = AGE, fill = as.factor(ISM))
qplot(data = IDinfo, x = AGE, binwidth = 5) + facet_wrap(~ISM)
qplot(data = IDinfo, x = as.factor(ISM), y = AGE, geom = "boxplot")
qplot(data = IDinfo, x = as.factor(ISM), y = AGE, geom = c("boxplot", "point"))
What we've shown has only scratched the surface of R's capabilities - so what else can be done?