Outline R - The Basics
Objective: Identify core features and functions essential to working with R; include sources and further learning references.
Help
help.start() - see intro materials installed with R
?help - understand the built-in help system
?median - learn about the median (or any R function); explanations end with helpful examples
example(mean) - display example of function in parenthenses; use on any function in this document
help.search(‘median’) - search across documentation for median
apropos(‘termorregex’) - finds objects in the search list matching expression
vignette() - list available vignettes for installed packages
RSiteSearch(“some term”) - submits search to r-project.org
Quick-R - Fab learning site by R.I.Kapacoff
R Seek Search Site - try it out with ‘child welfare’
sos Package - search through help files in packages; usually more detailed than R Seek.
CRAN Task Views - Social Science view, a good start.
Getting Help page - notes from Jeffry Leek (internal only)
Bottom up: R Results thru CRAN
Conceptual organizing framework for this outline.
R produces
RESULTS through
ANALYSIS using
PACKAGES (including sample data) running
FUNCTIONS with
ARGUMENTS on
R OBJECTS of
DATA STRUCTURES created by
SIMULATION or loaded from
DATA IMPORTS (or connections) into a
PROJECT running in an
R ENVIRONMENT(WORKSPACE) installed on a
LOCAL MACHINE (or a key, or remote server) after
CRAN (or GitHub or Bioconductor) download
CRAN sources, Local Machine, R Envronment/workspace
CRAN - source code, instructions
RStudio - helpful IDE/GUI for R
getwd() - display current working directory
setwd() - sets working directory; windows syntax clumsy; easier within RStudio > Session >
Sys.getenv() - local computing environment important to R processing
Sys.time() - current system time; check before doing date / time analysis
options() - view (and set) global options affecting R computations and displays
installed.packages() - display list of installed packages
.libPaths() - display path to library of installed packages
ls() - list objects in the workspace ls.str(package:foo) - list all functions in package foo
list.files() - list files in workspace
rm(x, y, z) - remove specified objects from the workspace
dir() - display files and folders in current directory
list.dirs() - display directories under current one
sessionInfo() - R version, platform, loaded packages, computer locale settings; nice to add to end of knitr doc
R.Version - reports intalled version of R
q() - quit R
Updating R
# installing/loading the package:
if(!require(installr)) {
install.packages("installr"); require(installr)} #load / install+load installr
# using the package:
updateR() # this will start the updating process of your R installation. It will check for newer versions, and if one is available, will guide you through the decisions you'd need to make.
Projects - organization
Steps: define question; define ideal data set; identify available sources; obtain; clean; explore; predict/model; interpret results; challenge; synthesize; create reproducible code RStudio - create project integrated with version control system
Project Template - http://projecttemplate.net/ tutorial and config details
library(‘ProjectTemplate’) - run in dir where proj files desired
create.project(‘projectname’, minimal=TRUE) - creates min set of proj folders in current dir
setwd(‘C:/[path2projectname]’) - change to new proj dir
See Letters-Minimal, a quick implementation of project template
Top
Outlines
Data - I/O Summary
| tabular |
read.table, read.csv |
write.table |
| text file |
readLines |
writeLines |
| R code files |
source (inverse dump) |
dump |
| deparsed R code files |
dget (inverse dput) |
dput |
| saved workspace |
load |
save |
| binary R objects |
unserialize |
serialize |
write.table(df, ‘childyouth.txt’, col.names=NA) - or row.names = FALSE for spreadsheet compatibility; sep = ‘’ for tab prevents split of POSIX data into date and time
dump and dput preserve metadata dump(c(‘x’,‘y’), file = ‘data.R’) # for R list obj x and y; dput(x, file=‘data.r’) # for R data frame
source(‘data.R’) - recreates data object in file
Data - Imports(Read) and Sources
read.table(‘sourcefile.csv’, header=TRUE|FALSE, sep=‘,’, row.names=‘rowidheader’, skip=n, stringsAsFactors=TRUE|FALSE) - importing from CSV file; see help file for large data hints; stringsASFactors=FALSE usu better; comment.char=’’ when none, use colClasses example following
initial <- read.table('data.txt', nrows = 100) # get classes from 1st 100 rows of large table
classes <- sapply(initial, class)
tabAll <- read.table('datatable.txt', colClasses = classes)
xpartial <- read.table(‘datatable.txt’, nrows = 100) - create DF from 1st 100 rows
classes <- sapply(xpartial, class) - save class info by applying class function over x
xfull <- read.table(‘datatable.txt’, colClasses = classes) - use colClasses to read full source and apply classes extracted from partial table.
odbcConnectExcel() - read Excel file with CSV or DIF export; requires RODBC
spss.get(‘spsssourcefile.por’, use.value.labels=TRUE) - load and convert value labels to factors;requires package Hmisc
read.dta(‘statasourcedatafile.dta’) - read in STATA file; requires library(foreign) source(‘data.r’) - read in and recreates objects in the file
readlines(‘filename.txt’) - read lines and store in character vector
scan() - read file into vector
File Connections
file(description = ‘’, open = ’r|w|a|rb|wb|ab’, blocking = TRUE, encoding = getOption(‘encoding’), raw = FALSE)
con <- file(‘foo.txt’, ‘r’) = text file abstracted connection
con <- gzfile(‘words.gz’) - gzipped file connection
con <- bzfile(‘words.bz’) - bzip2 compressed file connection
con <- url(‘http://www.torontocas.ca’, ‘r’) - abstract connect to wed address
x <- readLines(con, 10) - read fist 10 lines of the connection
Top
Data - Simulation Functions
No data? Create your own.
set.seed(1234) - set random generator with same seed number for reproducibility
rep(x, n) - replicates object x, n times
seq(a:b, by = n) - create numbers a to b by interval n.
Distributions
d for density; r for random number generation; p for cumulative distribution; q for quantile function
Normal distribution
rnorm(x, mean, sd) - generate x number of random Normal variates (particular outcome of a random variable) with a given mean and standard deviation.
dnorm(x, mean, sd, log=TRUE|FALSE) - returns height of Normal probability density (with a given mean and sd) at a point or vector of points x, usually log=TRUE.
pnorm(q, mean, sd, lower.tail=TRUE|FALSE, log.p) - probability that a normal random number will be less than q; lower.tail=FALSE for upper tail
qnorm(p, mean, sd) - evaluate the quantiles for the normal distribution with given mean/SD; returns Z score for number(s) p.
If phi is the cumulative distribution for a standard normal distribution, then pnorm(q) = phi(q) and qnorm(p) = phi^1(p), that is, the inverse of phi.
Poisson distribution
rpois(x, rate) - generate x random Poisson variates with a given rate; mean will be approx equal to rate
ppois(x, rate) - probability of a variable equal to or less than x, when rate
Other distributions
rbeta()
rbinom(1000, 1, 0.52) - gen 100 observations, 0-1, slightly higher chance .52 vs .5 , eg, sex;
rcauchy()
rchisq()
rexp()
rf()
rgamma()
rgeom()
rhyper()
rlogis()
rlnorm()
rnbinom()
rt()
runif(1000, 0, 21) - gen 1000, min 0, max 21, eg ages
rweibull()
Densities
dbeta()
dbinom()
dcauchy()
dchisq()
dexp()
df()
dgamma()
dgeom()
dhyper()
dlogis()
dlnorm()
dnbinom()
dnorm()
dpois()
dt()
dunif()
dweibull()
Sampling
sample(1:100, 10, FALSE) - random 10 from 100, no replacement (repeats), equal selection probability; use replacement when bootstrapping
sample(1:10) - generates permutation
Top
Simulate Linear Model \(y=\beta_0 + \beta_1x + \epsilon\) w/ Random Num
where random noise is \(\epsilon \sim N(0,2^2)\).
Outcome generates using the assumptions that
x is a standard normal distribution \(x \sim N(0, 1^2)\)
intercept is \(\beta_0 = 0.5\)
and coefficient \(\beta_1 = 2\)
set.seed(20)
x <- rnorm(100) # if x is binary, use rbnom(100, 1, 0.5)
e <- rnorm(100, 0, 2) # generate random err
y <- 0.5 + 2*x + e # add together after multiplying regression coefficients
summary(y) # ranges from -6 to +6
plot(x, y) # show linear relationship
Simulate Poisson Model \(Y \sim Poisson(\mu)\)
Log of mu follows a linear model where \(log(\mu) = \beta_0 + \beta_1*x\)
\(\beta_0 = 0.5\) and \(\beta_1 = 03.\)
Use rpois - standard error is a poisson, not normal, distribution.
set.seed(1)
x <- rnorm(100)
log.mu <- 0.5 + 0.3 * x # generate linear predictor intercept and coefficient of x
y <- rpois(100, exp(log.mu)) # to get the mean, simulate 100 random poissons and exponentiate the linear predictor
summary(y) # mean is 1.5 range is 0 to 6
plot(x,y) # plot shows linear relationship as x increases so does count for y
Data - Create/Outputs/Write
(see also Results)
dput(y) - creates ASCII representation of object y
dpub(y, file = ‘y.r’) - output to text file for
dget(‘y.r’) = reconstructs the saved object
dump(c(‘x’, ‘y’), file = ‘data.R’) - like dget but works on multiple objects
writeLines(‘filename.txt’, sep = ‘’) - ouput lines from character vector and write each element to text file
Top
Data - Structures(Objects), Modes (Types), Attributes
str(objectname) - most important command; displays structure of named object; especially useful for nested lists; use on functions, factors, data structures, datasets, matrices
s <- split(airquality, airquality$Month) - split list by month
str(s) - display structure, that is, first few elements of each split list (object s)
Vector LIst
vector - basic object; set of elements from same atomic class; everything in R stored as vector; has two components, length and type
vector() - create an empty vector
c(1,5:10) - numeric vector
c(‘charlie’, ‘owen’, ‘rowan’, ‘cole’, ‘olivia’) - character vector
c(TRUE,TRUE,FALSE) - logical vector
a[2] - element 2 of vector a
a[c(2,4,5)] - elements 2,4,5 of vector a
List - ordered set of elements of different classes; can combine any type of data or structures, including a list of lists; contents indexed by double bracket [[]]
Array Matrix
Array - multi-dimensional matrix
Matrix - a special case of two-dimensional array; all cols same mode and same length; a vector with a dimension attribute
matrix(1:100, 25,4, TRUE) - create 25 row, 4 col matrix using 1 to 100, fill by row to override the byrow=FALSE default
cbind(1:3, 10:12) - create 2 col matrix
rbind(1:3, 10:12) - create 2 row matrix
dimnames(x) <- list(char_vector_rownames, char_vector_colnames) - add row and column names to matrix; list(c(‘r1’,‘r2’), c(‘c1’,‘c2’,‘c3’))
Data Frame Data Table
Data Frame - set of vectors of equal length with different classes; special type of list; each element of the list is like a column with length equal to number of rows; usually called by read.table() or read.csv(); convert to matrix with data.matrix(), but beware of effects of coercion
Data Table - special type of data frame; more like SQL; better for large, complex tasks
dim(x) - retrieve or set dimension of object x (matrix, array, data frame)
nrow(x) ncol(x) - number of rows or cols in x
Factor
Factors - character data internally represented as numeric; categorical data; can be unordered/unranked or ordered/ranked; factors with labels easier than numerical representations; receive special treatment by lm() and glm()
factor(x, levels = c(‘bad’, ‘not too bad’, ‘ok’, ‘better’, ‘best of all’)) - creates factors from specified vector and orders alphabetically, unless ordered with levels as in this example
Note: best to avoid ‘strings As Factors’ on data import.
as.character(x) - coerce x to character mode
as.numeric(x) - coerce x to numeric (reals/decimals including rational and irrational)
as.integer(x) - coerce x to integer (subset of reals)
1L - store numeric literal as integer; 5 is a numeric, 5L is an integer
NaN - Not a number; 0/0; missing value; class(NaN) # => [1] “numeric”
is.nan(x) - is x a number (FALSE) or not a number (TRUE)
as.complex(x) - coerce x to complex number (a+b_i_)
as.logical(x) - coerce x to logical
as.raw(x) - coerce x to raw bytes
is.finite(pi/0) - returns FALSE
is.infinite(pi/0) - returns TRUE
Top
Dates and Times
- Dates represented by the Date class
- Times represented by the POSIXct or POSIXlt class
- POSIX classes keep track of leap years, leap seconds, DST, and time zones
- POSIXct - seconds since start of 1970.
- POSIXlt - list of vectors for sec, min, hour, mday, mon, year, wday, yday, isdst.
- Both output strings but store internally as date/time objects
- Dates stored internally as the number of days since 1970-01-01
- Times stored internally as the number of seconds since 1970-01-01
- Plot functions recognize and plot date / time differently
date() returns now in character string, “Wed Jul 24 14:51:45 2013”
format(Sys.time(), ‘%a %b %d %H:%M:%S %Y’) - same as above with customizable format
as.Date(x) - coerce string x to date
unclass(as.Date(‘1970-01-01’)) - shows zero because this is index date
as.POSIXct(x) - coerce x as time data; uses a class that is useful when times stored in something like a data frame
as.POSIXlt(x) - coerce x as a list plus other useful information like day of week, day of year, month, day of month
x\(sec, x\)min, x\(hour, x\)mday, x\(mon, x\)year, x\(wday, x\)yday, x$isdst - value of specified POSIXlt object
x <- strptime(datestring, ‘%B %d, %Y %H:%M’) - applies specified format in converting character vector ‘datastring’ into POSIXlt or POSIXct objectt
?strptime - full details on date / time format syntax.
Data Missing Values - NA NaN
NaN is NA; but NA is NOT a NaN
summary(x[, c(‘col1’, ‘col3’, ‘col7’)]) - total NA in specified columns of dataset x
sum(is.na(x)) - number of missing values in dataset x
nrow(x[!complete.cases(x), ]) - number of rows with missing data
is.na() - test if object is Na
is.nan() - test if literal numeric is not a number
Data Missing Values - Remove Replace
x <- c(1, 2, NA, 4, NA, 5) - numeric vector with missing values
y <- c(‘a’, ‘b’, NA, ‘d’, NA, ‘f’) - character vector with matching values x[!is.na(x)] - retrieve those that are not NA
complete.cases(x, y) - logical vector showing which positions in both x and y have complete data
data$col[data$col==''] <- NA # replace NA for ''
data <- data[!is.na(data$col),] # remove rows with NA in a column
newdata <- na.omit(data) # create new dataset without the missing data
Data - Commands, Operators, Initial Explorations
attach(datasetname) - use to set dataset default; avoids typing datasetname with variables; prone to error in use
detach(datasetname) - use to remove specified dataset from search path
class(x) - display class of object x ; character, numeric, integer, logical
unclass(x) - strip out the class; reduce to integer
length(x) - length of vector x
names(x) - get / set names of object x
attributes(x) - show attributes of object x; include names, dimnames, class, length, and other user defined attribs and metadata; permits alterations of attributes
summary(x) - summary stats about columns (default) in x; shows tot NA in each col
summary(x[1:5,]) - summary stats about rows 1 to 5 in x; used to show row col index
unique(dataframe$colname) - what are the coded values for the named column
length(unique(dataframe$colname)) - how many unique values in the named column
LOGICAL OPERATORS
==, !=, >, >=, <, <=, &, |
= is NOT a logical operator
# symbol => means 'results in' or 'produces'
class(TRUE) # => [1] "logical"
class(FALSE) # => [1] "logical"
# Behavior is normal
TRUE == TRUE # => [1] TRUE
TRUE == FALSE # => [1] FALSE
FALSE != FALSE # => [1] FALSE
FALSE != TRUE # => [1] TRUE
# Missing data (NA) is logical, too
class(NA) # => [1] "logical"
MATH OPERATORS
__+, -, , *__ - when applied to vector or arrays of same dimensions, math operation performed on each corresponding (i) , or (i,j), or (i,j,…n) element
10 + 66 # => [1] 76
53.2 - 4 # => [1] 49.2
2 * 2.0 # => [1] 4
3L / 4 # => [1] 0.75
3 %*% 2 # => true matrix multiplication
Top
Data Operations - Sorting and Subsetting
dfname[order(colname1, colname2, -colname3),] sort dfname by col1, col2 in ascending order; by col3 in descending
Subset a Vector
x[1] - extract 1st element of vector x
x[2:6] - extract elements 2 through 6
x[x > ‘a’] - extract all elements greater than ‘a’
u <- x > ‘a’ - create logical vector of those elements greater than ‘a’
x[u] - display those elements using the logical index
Subset a Matrix
x <- matrix(1:6, 2, 3) - create 2r 3c matrix fill 1 thru 6
x[1, 2] - get element in r1, c2; single element return vector of 1
x[2, 1, drop = FALSE] - get element in r2, c1; preserve dimensionality; returns 1x1 matrix
x[1, ] - get all of r1; returns vector
x[, 2, drop = FALSE] - get all of c2; returns matrix
Subset a List
x <- list(foo = 1:4, bar = 0.6) - create list with 2 elements named ‘foo’ and ‘bar’
x[1] - list name and sequence
x[[1]] - returns sequence, no name
x$bar - get specific element named ‘bar’ from the list
x[[‘bar’]] - return the value of ‘bar’
x[‘bar’] - single bracket returns obj of same class; use of name removes requirement for index position
x[c(1, 3)] - pass numeric vector to show elements 1 and 3 of list x
x[[c(1, 3)]] - recurses the list to get the 3rd element of the first element in list x
x[[c(2, 1)]] - as above, gets 1st element of the second element in the list
x[[2]][[1]] - alternative approach, double subsetting, same result
x$f - partial match, finds list element x dollar foo; fails if more than one begins with f
x[[‘f’, exact = FALSE]] - adding argument removes exact match requirement of [[]]
Data - Vectorized Operations
head(vec, 1) - first item in vector vec
tail(vec, 1) - last item in vector vec
length(vec) - length of vector vec
vec * 5 - multiple every item in vec by 5
mean(vec), var(vec). sd(vec), max(vec), min(vec), sum(vec) - summary stats on vector vec
x + y - add vectors x and y
x > 2 - returns vector of logical result for each element
x * y - vector product
Matrix
Must have same class or will coerce into same class
z <- cbind(1:4, c(‘dog’, ‘cat’, ‘bird’, ‘dog’)) # 2 cols each character class
x <- matrix(1:4, 2, 2)
y <- matrix(rep(10, 4), 2, 2)
t(x) - transpose matrix t
x[1, ] - get first row of matrix x
3 * x[, 1] - multiply col 1 of matrix x by 3
x[3,2] - get item in row 3, col 2 from matrix x
x * y - element wise multiplication
x/y - divide __x %*% y__ - true matrix multiplication, dot product rows x by cols y
Dataframe
For combining cols of different classes.
df <- data.frame(c(5,2,1,4), c(‘dog’, ‘cat’, ‘bird’, ‘dog’)) - 2 cols each diff class
names(df) - see columns in dataframe
names(df) <- c(‘number’, ‘species’) assign col names
class(df$number) check out the class of the second column
head(df) - see first few rows of dataframe; optionally add number of rows to show
df[,c(‘case’,‘dob’,‘race’)] - select columns by name
df[375:552, c(‘case’,‘dob’,‘race’)] - select specific rows and specific columns
df[df$age>3] - select rows meeting a specific criteria
df[grep(“Legacy”, df$desc, ignore.case=T),] - select rows with ‘Legacy’ in the description col.
Data Operations - Tables
table(x) - frequency table of factor x
Data Calc Columns
DF$z <- with(DF, (x == ‘A’) & (y == 1)) - iif both conditions in cols x & y, then col z TRUE, else FALSE
lapply(x, function(zzz) zzz[, 1]) - extract first col from each matrix in list x using anonymous function zzz existing in context of lapply only
Top
Packages
library() - list installed packages
library(packagename) - load named package; fails if missing; processing stops
require(packagename) - load named package; throws error if missing; processing continues
search() - display search path for packages available in the workspace
detach(package:packagename) - detach named package from search path
Must Have Packages_ - plyr or better dplyr, ggplot2, lme, knitr, rodbc, sqldf, stringr, lubridate, qcc, reshape2, randomForest Other Data I/O
httr - working with http connections
RMySQL - interface with mySQL
bigmemory - handle datasets larger than RAM
RHadoop - interface R and Hadoop
foreign - data from other stats programs (SPSS, Stata, etc)
shiny - create interactive web pages
Basic Functions
Not to be confused with functions in base package. Some of these come from others.
?apply Functions
rnorm(100, 0, 1)) - random normal distribution; observations, mean, sd
lapply - loop over a list and apply a function on each element sapply - Same as lapply but tries to simplify result apply - Apply functions over the margins of an array tapply - apply function over subsets of a vector (table apply) mapply - multivariate version of lapply
mapply function
Takes multiple args, unlike sapply and lapply
mapply(rep, 1:4, 4:1) - same as list(rep(1,4), rep(2,3), rep(3,2), rep(4,1))
mapply(noise, 1:5, 1:5, 2) - same as list(noise(1, 1, 2), noise(2,2,2), noise(3,3,2), noise(4,4,2), noise(5,5,2))
tapply function
Apply a function over a subsets of a vector
tapply(x, f, mean) - group mean of x grouped by factor f
tapply(x, f, range) - get range of x grouped by factor f
split w lapply and sapply
s <- split(airquality, airquality$Month) - split by month lapply(s, function(x) colMeans(x[, c(‘Ozone’, ‘Solar.R’, ‘Wind’)])) - get means of 3 specified columns with anon FUN function(x)
sapply(s, function(x) colMeans(x[, c(‘Ozone’, ‘Solar.R’, ‘Wind’)], na.rm = TRUE)) - simplified result matrix with missing values removed
str(split(x, list(f1, f2), drop = TRUE)) - split with drop removes interactions w/out elements; returns list by factor with observations
apply function
Use to evalute function over the margins of an array
apply(x, 1, sum) - calc sum of rows in matrix x; rowSums
apply(x, 1, mean) - calc mean of rows in matrix x; rowMeans
apply(x, 2, sum) - calc sum of cols in matrix x; colSums
apply(x, 2, mean) - cacl mean of cols in matrix x: colMeans
apply(x, 1, quantile, probs = c(0.25, 0.75)) - get matrix of 25th and 75th quantile for each row in matrix x
lapply function
Always returns list object
lapply(x, mean) - returns single list with mean for each element in list x
lapply(x, runif, min = 0, max =1) - generates runif(x) for each value of list x
sapply function
Variant of lapply with simplified result
sapply(x, mean) - returns single vector with mean for each element in list x
Top
Arguments
args(‘nameofFUN’) - argument name and corresponding defaults of FUN
argument name - use when position not known; if not exact, chks partial, in no partial, checks position
… - specifies generic functions; variable number of arguments usually passed to other functions; necessary when that number is not known in advance; arguments after … named explicitly w/out partial match
Lexical, Static Scoping
.GlobalEnv package:base - always first and last; order of packages in search list matters; functions and non-functions have separate namespaces; can have same name
ls(environment(functionname)) - get environment for named function
# CAUTION removes all variables in this session to support the example.
rm(list=ls(all=TRUE))
g <- function(x) {
a <- 3
x + a + y
}
g(2)
# => Error in g(2) : object 'y' not found
# but works if y defined.
y <- 3
g(2)
Top
Data Results Presentation and Output Devices
?Devices - list of graphical devices
dev.cur() - determine current output device
dev.list() - list of active devices
dev.list - show list of open graphics devices
dev.next - switch control to next graphics device on the list
dev.set - set control to specific graphics device
dev.copy - copy plot to another device; not exact; check integrity of results
dev.copy2pdf - copy plot to pdf file
dev.off - close the current graphics device
Sys.sleep(1) - introduce pause
Base 2D Graphics
Parameters can be added to existing plot via graphics device.
par() specifies global params that affect all plots in the R session. Although set globally, these params can often be overriden as args to specific plotting functions. par(‘parameter’) specifies specific
pch - the plotting symbol (def is open circle)
lty - line type (def is solid). can be dashed, dotted, etc
lwd - line width, specified as integer multiple
col - plotting color, specified as number, string, or hex; the colors function give a vector of colors by name. R has 657 built in colors.
las - orientation of the axis labels on the plot
bg - ackground col (def is none)
mar - margin size; goes clockwise from bottom
oma - the outer margin per (def = 0 for all sides); use with multiple plots
mfrow - number of plots per row, col (plots filled row-wise)
mfcol - number of plots per row, col (plots filled col-wise)
Base Plot Functions
plot(x) - makes scatterplot, or other type depending on class of the object being plotted remaining are add ons to plot; won’t work unless plot already created.
hist(x) - makes histogram
boxplot(x) - boxplot
abline(linearmodel|intercept and slope) - add line to plot
lines - add lines to a plot, give a vector x values and a correspondng vector of y values (or 2 col matrix); this FUN just connects the dots. points - add points to a plot text - dd text labels to a plot useing specified x, y coordinates title - add annotations to x, y axis labels, title, subtitle, outer margin mtext - add arbitrary text to the margins (inner or outer) of the plot axis - adding axis ticks/labels
Base Plot EG
par(mfrow = c(1, 1)) # reset display to 1 x 1
x <- rnorm(100) # create two simulated groups male and females
y <- x + rnorm(100)
g <- gl(2, 50, labels = c('Male', 'Female'))
plot(x, y, type = 'n') # data by gender set up plot, but do not draw
points(x[g == 'Male'], y[g == 'Male'], col = 'green', pch = 18) # subset the vectors for male and female
points(x[g == 'Female'], y[g == 'Female'], col = 'blue', pch = 19)
Lattice/Grid Graphics
All parameters specified at once; return object of class trellis; easier to write R Script then run. Produce multi panel displays for specified conditions.
plotFUN(y ~ x | f * g) - general format, xy conditioned by all factors of x by interaction g
Lattice Plot Functions
xyplot - most important; creates scatterplots
bwplot - box and whiskers plots (‘boxplots’)
histogram - histograms
stripplot - like boxplot but with actual points
dotplot - plot dots on ‘violin strings’
splom - scatterplot matrix; like pairs in base graphics
levelplot, contourplot - for plotting ‘image’ data
Multiple Factor Conditioning Lattice EG
Illustrates interaction effects of multiple variables.
require(lattice)
# establish 4 levels (lattice shingles) for the continuous vars temperature and wind; use plot(temp.cut) to see overlapping ranges
temp.cut <- equal.count(environmental$temperature, 4)
wind.cut <- equal.count(environmental$wind, 4)
# scatterplots for each combination of temperature and wind
xyplot(ozone ~ radiation | temp.cut * wind.cut, data = environmental, as.table = TRUE, pch = 20, main = 'Ozone vs Radiation by Temp & Wind',
panel = function(x,y, ...){
panel.xyplot(x, y, ...)
panel.loess(x, y)
}, xlab = 'Solar Radiation', ylab = 'Ozone(ppb)')
Splom and Histogram EGs
Additional examples of illustrating variable interaction
# scatterplot matrix of all interactions
splom(~ environmental)
# distribution of temperature as wind changes; temps a bit higher when wind lower
histogram(~ temperature | wind.cut, data = environmental)
# distribution of ozone as wind changes; greater distribution in lower range with more wind
# additon of layout improves readability
histogram(~ ozone | wind.cut, layout = c(1, 4),data = environmental)
# add all wind and temp combinations; low temp and high wind have lowest level concentrations
histogram(~ ozone | temp.cut * wind.cut, data = environmental)
Top
Color in R
colors() - vector of available colors; can ref by name
rainbow(n) - generates rainbow palette of n colors
plot(x, y, col = rgb(0, 0, 0, 0.2), pch = 19) - apply RGB values directly to plot colors
heat.colors(n, alpha=1) - create vector of continguous colors with no transparency; alpha parameter (0-1) controls transparency
topo.colors(), terrain.colors(), cm.colors() - variants of above
pal <- colorRamp(c(‘red’,‘blue’)) - takes palette of colors and returns a function that takes values between 0 and 1 indicating the extremes of the color palette
pal(0), pal(0.5), pal(1), pal(seq(0,1, len=10)) - outputs the colorRamp RGB values pal <- colorRampPalette(c(‘red’, ‘yellow’)) - takes palette of colors and returns function with integer arguments that returns a vector of colors interpolating the palette
pal(2), pal(10) - show the 2 colours; show the steps between the 2 colors assigned to pal
RColorBrewer 3 types of palettes
- Sequential (ordered, low / high, continuous or not); ordered data, light for low, dark for high
- Diverging (data that moves away from a centre; neg and positive); equal emphasis on mid-range values with light colors, low and high extremes with dark
- Qualitative (not ordered but diff values); no magnitude differences; best for nominal or categorical data
brewer.pal(n, ‘PaletteName’) - create color palette of n colors using built in ‘PaletteName’)
display.brewer.pal(3, ‘BuGn’) - displays the palette in a graphics window
display.brewer.all(type=‘seq’|‘div’|‘qual’) - displays palettes of type simultanueously in a graphics window
See library(colorspace) for modifying/creatng custom versions of the 3 types of RColorBrewer palettes
Palette information can be used with colorRamp() and colorRampPalette()
pal <- colorRampPalette(brewer.pal(3, ‘BuGn’)) - create gradiated palette
image(volcano, col = pal(20)) - apply to volcano topography smoothScatter(x, y, colramp = pal) - smoothScatter uses RColorBrewer palettes
Math Annotation
LaTeX-like math annotation on plots in any package where args take text
?plotmath - full syntax list
expression(mathexpresssions) wrap notation in expression FUN within plot
Main, ylab, xlab EG
Illusrating use of math expressions on graph output
# title, y and x labels with symbols and formulas
plot(0, 0, main = expression(theta == 0), ylab = expression(hat(gamma) == 0),
xlab = expression(sum(x[i] * y[i], i == 1, n)))
# alt version concatenating strings with expressions
plot(0, 0, main = expression(theta == 0), ylab = expression(hat(gamma) == 0),
xlab = expression('The mean (' * bar(x) * ') is ' * sum(x[i]/n, i ==
1, n)))
# Use 'substitute' when the expression is a computation.
# Following labels the axes with the mean of x and y.
plot(x, y, xlab = substitute(bar(x) == k, list(k = mean(x))),
ylab = substitute(bar(y) == k, list(k = mean(y))))
Top
Types of Analysis
Descriptive - first data analysis; describes a set of data; census; no generalization w/out stats modelling
Exploratory - find relationships; suggest hypothesis; assess inference assumption; basis for getting more data; support best fit and techniques
Inferential - use a sample to say something about a population; modelling goal; estimate quantities of concern and uncertainty of estimate; depends on population and sampling
Predictive - more data, simple model works best for determining if X predicts Y; X predicts Y NE X causes Y
Causal - usu requires randomized studies, non randomized sensitive to assumptions; find what happens to one variable when changing another; gold standard for data analysis; apply as average effects, not to every individual
Mechanistic - understand exact changes in variables that lead to changes in other variables for individual objects ; deterministic equations; random component is measurement err; unknown parameters for know equations may be inferred with data analysis.
Define the Question
Practice/business - What are you trying to know, understand, apply, analyze, synthesize, or evaluate?
Theoretical assumptions - what theory / assumptions do the above include?
Scientific basis - what known science supports the questions?
Analytic approach - see types of analysis above
Ideal evidence - what would the ideal evidence look like?
Available evidence - what is the available evidence?
Interpretation - what does this mean for the original question?
Challenge - everything: question, data source, processing, analysis conclusions, measures of uncertainty, terms in the model; what are possible alternative interpretations?
Synthesis - summary and conclusions based on above
Consider schematic representations for different types of studies applying frequentist or Bayesian analysis.
Coding Basics
Create READABLE,
DEBUGGED,
TEST-DRIVEN,
RE-USABLE code using
consistent STYLE
names() - if the output is not self explanatory, naming is sub-optimal
Reserved Words
if, else, repeat, while, function, for, in, next, break
TRUE, FALSE, NULL, Inf, NaN, NA
NA_integer_ , NA_real_, NA_complex_ and NA_character_
Control Structures
Following is primarily for programs; for interactive command line. The ?apply functions are genearally more useful.
if, else if, else
if(condition1) {
## do something
} else if(condition2) { # any number of else if
## do something different
} else { # else must be at end; not req'd; does nothing if reached
## so something different
}
for loops
x <- c('a', 'b', 'c', 'd')
for(i in 1:4) { # standard for loop
print(x[i])
}
for(i in seq_along(x)) { # using seq_along function makes index equal to elements in x
print(x[i])
}
for(letter in x) { # iterate through letter data in x
print letter
}
for(i in 1:4) print(x[i]) # for single expression curly bracket not reqd
nested for loop
x <- matrix(1:6, 2, 3) # create 2r by 3 col matrix
for(i in seq_len(nrow(x))) { # starting at r1, get row index
for(j in seq_len(ncol(x))) { # next at c1, get col index
print(x[i,j]) # print the value at the row, col intersect
} # efficient but not easy to understand
}
while loop / repeat
count <-0
while(count < 10) {
print(count)
count <- count + 1
}
x0 <- 1
to1 <- 1e-8
repeat { # useful for converging algorithms; loops with hard limit usually better
x1 <- computeEstimate() # assumes a user def function 'computeEstimate
if(abs(x1 - x0) < to1) {
break # required to avoid endless loop
} else {
x0 <- x1
}
}
next loop
for(i in 1:100) {
if(i <= 20) { # skip the first 20 iterations
next
}
## do something
}
Classes and Methods
?Classes and ?Methods (long) and help pages for methods package - primary documentation
?setClass, ?setMethod and ?setGeneric (other related details) technical, aimed a programmer/developer, not end user, makes sense with use. Casual user will not define classes/method
(Chambers, John), Programming with Data: A Guide to the S language - (green book) user -> programmer philosophy
S3 (old) and S4 (new) Coexistence
- separate systems new (and future), but can be mixed to a degree
- each can be used fairly independently of each other
- developers of new projects encouraged to use S4 style
- S3 continued use driven by ease of its ‘quick and dirty’ style
- lecture focus is S4
- code for implementing S4 is in the methods package
OOP in R
class(x) - get class of any R object x
* a class - a description of a thing; class defined by setClass() in methods package.
* an object - an instance of a class. Objects created using new().
* a method - a function that only operates on a certain class of objects.
* a generic function is an R function which dispatches methods. A generic function typically encapsulates a generic concept (plot, mean, predict,…); does not do any computation.
* a method is the implementation of a generic function for an object of a particular class.
Places to Learn S4 style
- best way to learn is to look at examples and do the exercises
- http://www.bioconductor.org - rich resource even if you nothing about bioinformatics
- some CRAN packages - SparseM, gpclib, flexmix, its, lme4, orientlib, pixmap
- stats$ pacakge has classes/methods for doing maximum likelihood analysis