Notes Home | Outline R | Explore Data | knitr | ESP Child Welfare

1 Outline R - The Basics

Objective: Identify core features and functions essential to working with R; include sources and further learning references.

1.1 Help

help.start() - see intro materials installed with R
?help - understand the built-in help system
?median - learn about the median (or any R function); explanations end with helpful examples
example(mean) - display example of function in parenthenses; use on any function in this document
help.search(‘median’) - search across documentation for median
apropos(‘termorregex’) - finds objects in the search list matching expression
vignette() - list available vignettes for installed packages
RSiteSearch(“some term”) - submits search to r-project.org
Quick-R - Fab learning site by R.I.Kapacoff
R Seek Search Site - try it out with ‘child welfare’
sos Package - search through help files in packages; usually more detailed than R Seek.
CRAN Task Views - Social Science view, a good start.
Getting Help page - notes from Jeffry Leek (internal only)

1.1.1 Bottom up: R Results thru CRAN

Conceptual organizing framework for this outline.

R produces
  RESULTS through
    ANALYSIS using 
      PACKAGES (including sample data) running 
        FUNCTIONS with 
         ARGUMENTS on 
            R OBJECTS of
              DATA STRUCTURES created by
                SIMULATION or loaded from
                  DATA IMPORTS (or connections) into a
                    PROJECT running in an
                      R ENVIRONMENT(WORKSPACE) installed on a
                        LOCAL MACHINE (or a key, or remote server) after 
                                             CRAN (or GitHub or Bioconductor) download

1.1.2 Top down: CRAN through R Results

CRAN/Machine/Workspace > Projects > I/O Sum > Data Read/Sources > Simulation > Data Out > R Objects > Data Structures > Functions > Arguments > Packages > Analysis > Results

This partially inverted view of the above framework organizes the sections of this document.

1.2 CRAN sources, Local Machine, R Envronment/workspace

CRAN - source code, instructions
RStudio - helpful IDE/GUI for R
getwd() - display current working directory
setwd() - sets working directory; windows syntax clumsy; easier within RStudio > Session >
Sys.getenv() - local computing environment important to R processing
Sys.time() - current system time; check before doing date / time analysis
options() - view (and set) global options affecting R computations and displays
installed.packages() - display list of installed packages
.libPaths() - display path to library of installed packages
ls() - list objects in the workspace ls.str(package:foo) - list all functions in package foo
list.files() - list files in workspace
rm(x, y, z) - remove specified objects from the workspace
dir() - display files and folders in current directory
list.dirs() - display directories under current one
sessionInfo() - R version, platform, loaded packages, computer locale settings; nice to add to end of knitr doc
R.Version - reports intalled version of R
q() - quit R

1.2.1 Updating R

# installing/loading the package:
if(!require(installr)) {
install.packages("installr"); require(installr)} #load / install+load installr
 
# using the package:
updateR() # this will start the updating process of your R installation.  It will check for newer versions, and if one is available, will guide you through the decisions you'd need to make.

1.3 Projects - organization

Steps: define question; define ideal data set; identify available sources; obtain; clean; explore; predict/model; interpret results; challenge; synthesize; create reproducible code RStudio - create project integrated with version control system
Project Template - http://projecttemplate.net/ tutorial and config details
library(‘ProjectTemplate’) - run in dir where proj files desired
create.project(‘projectname’, minimal=TRUE) - creates min set of proj folders in current dir
setwd(‘C:/[path2projectname]’) - change to new proj dir
See Letters-Minimal, a quick implementation of project template
Top

Outlines

1.4 Data - I/O Summary

Type of data	Input to R	Output from R
tabular	read.table, read.csv	write.table
text file	readLines	writeLines
R code files	source (inverse dump)	dump
deparsed R code files	dget (inverse dput)	dput
saved workspace	load	save
binary R objects	unserialize	serialize

write.table(df, ‘childyouth.txt’, col.names=NA) - or row.names = FALSE for spreadsheet compatibility; sep = ‘’ for tab prevents split of POSIX data into date and time
dump and dput preserve metadata dump(c(‘x’,‘y’), file = ‘data.R’) # for R list obj x and y; dput(x, file=‘data.r’) # for R data frame
source(‘data.R’) - recreates data object in file

1.5 Data - Imports(Read) and Sources

read.table(‘sourcefile.csv’, header=TRUE|FALSE, sep=‘,’, row.names=‘rowidheader’, skip=n, stringsAsFactors=TRUE|FALSE) - importing from CSV file; see help file for large data hints; stringsASFactors=FALSE usu better; comment.char=’’ when none, use colClasses example following

initial <- read.table('data.txt', nrows = 100) # get classes from 1st 100 rows of large table
classes <- sapply(initial, class)  
tabAll <- read.table('datatable.txt', colClasses = classes)

xpartial <- read.table(‘datatable.txt’, nrows = 100) - create DF from 1st 100 rows
classes <- sapply(xpartial, class) - save class info by applying class function over x
xfull <- read.table(‘datatable.txt’, colClasses = classes) - use colClasses to read full source and apply classes extracted from partial table.
odbcConnectExcel() - read Excel file with CSV or DIF export; requires RODBC
spss.get(‘spsssourcefile.por’, use.value.labels=TRUE) - load and convert value labels to factors;requires package Hmisc
read.dta(‘statasourcedatafile.dta’) - read in STATA file; requires library(foreign) source(‘data.r’) - read in and recreates objects in the file
readlines(‘filename.txt’) - read lines and store in character vector
scan() - read file into vector

1.5.1 Sources

data() - displays data sets available in loaded packages
machine learning - http://archive.ics.uci.edu/ml/datasets.html for machine learning datasets

1.6 File Connections

file(description = ‘’, open = ’r|w|a|rb|wb|ab’, blocking = TRUE, encoding = getOption(‘encoding’), raw = FALSE)
con <- file(‘foo.txt’, ‘r’) = text file abstracted connection
con <- gzfile(‘words.gz’) - gzipped file connection
con <- bzfile(‘words.bz’) - bzip2 compressed file connection
con <- url(‘http://www.torontocas.ca’, ‘r’) - abstract connect to wed address
x <- readLines(con, 10) - read fist 10 lines of the connection
Top

1.7 Data - Simulation Functions

No data? Create your own.
set.seed(1234) - set random generator with same seed number for reproducibility
rep(x, n) - replicates object x, n times
seq(a:b, by = n) - create numbers a to b by interval n.

1.7.1 Distributions

d for density; r for random number generation; p for cumulative distribution; q for quantile function

1.7.1.1 Normal distribution

rnorm(x, mean, sd) - generate x number of random Normal variates (particular outcome of a random variable) with a given mean and standard deviation.
dnorm(x, mean, sd, log=TRUE|FALSE) - returns height of Normal probability density (with a given mean and sd) at a point or vector of points x, usually log=TRUE.
pnorm(q, mean, sd, lower.tail=TRUE|FALSE, log.p) - probability that a normal random number will be less than q; lower.tail=FALSE for upper tail
qnorm(p, mean, sd) - evaluate the quantiles for the normal distribution with given mean/SD; returns Z score for number(s) p.
If phi is the cumulative distribution for a standard normal distribution, then pnorm(q) = phi(q) and qnorm(p) = phi^1(p), that is, the inverse of phi.

1.7.1.2 Poisson distribution

rpois(x, rate) - generate x random Poisson variates with a given rate; mean will be approx equal to rate
ppois(x, rate) - probability of a variable equal to or less than x, when rate

1.7.2 Other distributions

rbeta()
rbinom(1000, 1, 0.52) - gen 100 observations, 0-1, slightly higher chance .52 vs .5 , eg, sex;
rcauchy()
rchisq()
rexp()
rf()
rgamma()
rgeom()
rhyper()
rlogis()
rlnorm()
rnbinom()
rt()
runif(1000, 0, 21) - gen 1000, min 0, max 21, eg ages
rweibull()

1.7.3 Densities

dbeta()
dbinom()
dcauchy()
dchisq()
dexp()
df()
dgamma()
dgeom()
dhyper()
dlogis()
dlnorm()
dnbinom()
dnorm()
dpois()
dt()
dunif()
dweibull()

1.7.4 Sampling

sample(1:100, 10, FALSE) - random 10 from 100, no replacement (repeats), equal selection probability; use replacement when bootstrapping
sample(1:10) - generates permutation
Top

1.7.5 Simulate Linear Model $y=\beta_0 + \beta_1x + \epsilon$ w/ Random Num

where random noise is $\epsilon \sim N(0,2^2)$.
Outcome generates using the assumptions that
x is a standard normal distribution $x \sim N(0, 1^2)$
intercept is $\beta_0 = 0.5$
and coefficient $\beta_1 = 2$

set.seed(20)
x <- rnorm(100) # if x is binary, use rbnom(100, 1, 0.5)
e <- rnorm(100, 0, 2) # generate random err
y <- 0.5 + 2*x + e # add together after multiplying regression coefficients
summary(y) # ranges from -6 to +6
plot(x, y) # show linear relationship

1.7.6 Simulate Poisson Model $Y \sim Poisson(\mu)$

Log of mu follows a linear model where $log(\mu) = \beta_0 + \beta_1*x$
$\beta_0 = 0.5$ and $\beta_1 = 03.$
Use rpois - standard error is a poisson, not normal, distribution.

set.seed(1)
x <- rnorm(100)
log.mu <- 0.5 + 0.3 * x  # generate linear predictor intercept and coefficient of x  
y <- rpois(100, exp(log.mu))  # to get the mean, simulate 100 random poissons and exponentiate the linear predictor  
summary(y)  # mean is 1.5 range is 0 to 6
plot(x,y) # plot shows linear relationship as x increases so does count for y

1.8 Data - Create/Outputs/Write

(see also Results)
dput(y) - creates ASCII representation of object y
dpub(y, file = ‘y.r’) - output to text file for
dget(‘y.r’) = reconstructs the saved object
dump(c(‘x’, ‘y’), file = ‘data.R’) - like dget but works on multiple objects
writeLines(‘filename.txt’, sep = ‘’) - ouput lines from character vector and write each element to text file

Top

1.9 Data - Structures(Objects), Modes (Types), Attributes

str(objectname) - most important command; displays structure of named object; especially useful for nested lists; use on functions, factors, data structures, datasets, matrices
s <- split(airquality, airquality$Month) - split list by month
str(s) - display structure, that is, first few elements of each split list (object s)

1.9.1 Vector LIst

vector - basic object; set of elements from same atomic class; everything in R stored as vector; has two components, length and type
vector() - create an empty vector
c(1,5:10) - numeric vector
c(‘charlie’, ‘owen’, ‘rowan’, ‘cole’, ‘olivia’) - character vector
c(TRUE,TRUE,FALSE) - logical vector
a[2] - element 2 of vector a
a[c(2,4,5)] - elements 2,4,5 of vector a
List - ordered set of elements of different classes; can combine any type of data or structures, including a list of lists; contents indexed by double bracket [[]]

1.9.2 Array Matrix

Array - multi-dimensional matrix
Matrix - a special case of two-dimensional array; all cols same mode and same length; a vector with a dimension attribute
matrix(1:100, 25,4, TRUE) - create 25 row, 4 col matrix using 1 to 100, fill by row to override the byrow=FALSE default
cbind(1:3, 10:12) - create 2 col matrix
rbind(1:3, 10:12) - create 2 row matrix
dimnames(x) <- list(char_vector_rownames, char_vector_colnames) - add row and column names to matrix; list(c(‘r1’,‘r2’), c(‘c1’,‘c2’,‘c3’))

1.9.3 Data Frame Data Table

Data Frame - set of vectors of equal length with different classes; special type of list; each element of the list is like a column with length equal to number of rows; usually called by read.table() or read.csv(); convert to matrix with data.matrix(), but beware of effects of coercion
Data Table - special type of data frame; more like SQL; better for large, complex tasks

dim(x) - retrieve or set dimension of object x (matrix, array, data frame)
nrow(x) ncol(x) - number of rows or cols in x

1.9.4 Factor

Factors - character data internally represented as numeric; categorical data; can be unordered/unranked or ordered/ranked; factors with labels easier than numerical representations; receive special treatment by lm() and glm()
factor(x, levels = c(‘bad’, ‘not too bad’, ‘ok’, ‘better’, ‘best of all’)) - creates factors from specified vector and orders alphabetically, unless ordered with levels as in this example
Note: best to avoid ‘strings As Factors’ on data import.
as.character(x) - coerce x to character mode
as.numeric(x) - coerce x to numeric (reals/decimals including rational and irrational)
as.integer(x) - coerce x to integer (subset of reals)
1L - store numeric literal as integer; 5 is a numeric, 5L is an integer
NaN - Not a number; 0/0; missing value; class(NaN) # => [1] “numeric”
is.nan(x) - is x a number (FALSE) or not a number (TRUE)
as.complex(x) - coerce x to complex number (a+b_i_)
as.logical(x) - coerce x to logical
as.raw(x) - coerce x to raw bytes

is.finite(pi/0) - returns FALSE
is.infinite(pi/0) - returns TRUE

Top

1.10 Dates and Times

Dates represented by the Date class
Times represented by the POSIXct or POSIXlt class
POSIX classes keep track of leap years, leap seconds, DST, and time zones
POSIXct - seconds since start of 1970.
POSIXlt - list of vectors for sec, min, hour, mday, mon, year, wday, yday, isdst.
Both output strings but store internally as date/time objects
Dates stored internally as the number of days since 1970-01-01
Times stored internally as the number of seconds since 1970-01-01
Plot functions recognize and plot date / time differently

date() returns now in character string, “Wed Jul 24 14:51:45 2013”
format(Sys.time(), ‘%a %b %d %H:%M:%S %Y’) - same as above with customizable format
as.Date(x) - coerce string x to date
unclass(as.Date(‘1970-01-01’)) - shows zero because this is index date
as.POSIXct(x) - coerce x as time data; uses a class that is useful when times stored in something like a data frame
as.POSIXlt(x) - coerce x as a list plus other useful information like day of week, day of year, month, day of month
x$sec, x$min, x$hour, x$mday, x$mon, x$year, x$wday, x$yday, x$isdst - value of specified POSIXlt object
x <- strptime(datestring, ‘%B %d, %Y %H:%M’) - applies specified format in converting character vector ‘datastring’ into POSIXlt or POSIXct objectt
?strptime - full details on date / time format syntax.

1.11 Data Missing Values - NA NaN

NaN is NA; but NA is NOT a NaN
summary(x[, c(‘col1’, ‘col3’, ‘col7’)]) - total NA in specified columns of dataset x
sum(is.na(x)) - number of missing values in dataset x
nrow(x[!complete.cases(x), ]) - number of rows with missing data
is.na() - test if object is Na
is.nan() - test if literal numeric is not a number

1.12 Data Missing Values - Remove Replace

x <- c(1, 2, NA, 4, NA, 5) - numeric vector with missing values
y <- c(‘a’, ‘b’, NA, ‘d’, NA, ‘f’) - character vector with matching values x[!is.na(x)] - retrieve those that are not NA
complete.cases(x, y) - logical vector showing which positions in both x and y have complete data

data$col[data$col==''] <- NA # replace NA for ''
data <- data[!is.na(data$col),] # remove rows with NA in a column
newdata <- na.omit(data) # create new dataset without the missing data

1.13 Data - Commands, Operators, Initial Explorations

attach(datasetname) - use to set dataset default; avoids typing datasetname with variables; prone to error in use
detach(datasetname) - use to remove specified dataset from search path
class(x) - display class of object x ; character, numeric, integer, logical
unclass(x) - strip out the class; reduce to integer
length(x) - length of vector x
names(x) - get / set names of object x
attributes(x) - show attributes of object x; include names, dimnames, class, length, and other user defined attribs and metadata; permits alterations of attributes
summary(x) - summary stats about columns (default) in x; shows tot NA in each col
summary(x[1:5,]) - summary stats about rows 1 to 5 in x; used to show row col index
unique(dataframe$colname) - what are the coded values for the named column

length(unique(dataframe$colname)) - how many unique values in the named column

LOGICAL OPERATORS
==, !=, >, >=, <, <=, &, |
= is NOT a logical operator

# symbol => means 'results in' or 'produces'
class(TRUE) # => [1] "logical"
class(FALSE) # => [1] "logical"
# Behavior is normal
TRUE == TRUE # => [1] TRUE
TRUE == FALSE # => [1] FALSE
FALSE != FALSE # => [1] FALSE
FALSE != TRUE # => [1] TRUE
# Missing data (NA) is logical, too
class(NA) # => [1] "logical"

MATH OPERATORS
__+, -, , *__ - when applied to vector or arrays of same dimensions, math operation performed on each corresponding (i) , or (i,j), or (i,j,…n) element

10 + 66 # => [1] 76
53.2 - 4 # => [1] 49.2
2 * 2.0 # => [1] 4
3L / 4 # => [1] 0.75
3 %*% 2 # => true matrix multiplication

Top

1.14 Data Operations - Sorting and Subsetting

dfname[order(colname1, colname2, -colname3),] sort dfname by col1, col2 in ascending order; by col3 in descending

1.14.1 Subset a Vector

x[1] - extract 1st element of vector x
x[2:6] - extract elements 2 through 6
x[x > ‘a’] - extract all elements greater than ‘a’
u <- x > ‘a’ - create logical vector of those elements greater than ‘a’
x[u] - display those elements using the logical index

1.14.2 Subset a Matrix

x <- matrix(1:6, 2, 3) - create 2r 3c matrix fill 1 thru 6
x[1, 2] - get element in r1, c2; single element return vector of 1
x[2, 1, drop = FALSE] - get element in r2, c1; preserve dimensionality; returns 1x1 matrix
x[1, ] - get all of r1; returns vector
x[, 2, drop = FALSE] - get all of c2; returns matrix

1.14.3 Subset a List

x <- list(foo = 1:4, bar = 0.6) - create list with 2 elements named ‘foo’ and ‘bar’
x[1] - list name and sequence
x[[1]] - returns sequence, no name

x$bar - get specific element named ‘bar’ from the list

x[[‘bar’]] - return the value of ‘bar’

x[‘bar’] - single bracket returns obj of same class; use of name removes requirement for index position
x[c(1, 3)] - pass numeric vector to show elements 1 and 3 of list x
x[[c(1, 3)]] - recurses the list to get the 3rd element of the first element in list x
x[[c(2, 1)]] - as above, gets 1st element of the second element in the list
x[[2]][[1]] - alternative approach, double subsetting, same result

x$f - partial match, finds list element x dollar foo; fails if more than one begins with f

x[[‘f’, exact = FALSE]] - adding argument removes exact match requirement of [[]]

1.15 Data - Vectorized Operations

head(vec, 1) - first item in vector vec
tail(vec, 1) - last item in vector vec
length(vec) - length of vector vec
vec * 5 - multiple every item in vec by 5
mean(vec), var(vec). sd(vec), max(vec), min(vec), sum(vec) - summary stats on vector vec

x + y - add vectors x and y
x > 2 - returns vector of logical result for each element
x * y - vector product

1.15.1 Matrix

Must have same class or will coerce into same class
z <- cbind(1:4, c(‘dog’, ‘cat’, ‘bird’, ‘dog’)) # 2 cols each character class
x <- matrix(1:4, 2, 2)
y <- matrix(rep(10, 4), 2, 2)

t(x) - transpose matrix t
x[1, ] - get first row of matrix x
3 * x[, 1] - multiply col 1 of matrix x by 3
x[3,2] - get item in row 3, col 2 from matrix x
x * y - element wise multiplication
x/y - divide __x %*% y__ - true matrix multiplication, dot product rows x by cols y

1.15.2 Dataframe

For combining cols of different classes.
df <- data.frame(c(5,2,1,4), c(‘dog’, ‘cat’, ‘bird’, ‘dog’)) - 2 cols each diff class
names(df) - see columns in dataframe
names(df) <- c(‘number’, ‘species’) assign col names

class(df$number) check out the class of the second column

head(df) - see first few rows of dataframe; optionally add number of rows to show
df[,c(‘case’,‘dob’,‘race’)] - select columns by name
df[375:552, c(‘case’,‘dob’,‘race’)] - select specific rows and specific columns
df[df$age>3] - select rows meeting a specific criteria
df[grep(“Legacy”, df$desc, ignore.case=T),] - select rows with ‘Legacy’ in the description col.

1.16 Data Operations - Tables

table(x) - frequency table of factor x

1.17 Data Calc Columns

DF$z <- with(DF, (x == ‘A’) & (y == 1)) - iif both conditions in cols x & y, then col z TRUE, else FALSE
lapply(x, function(zzz) zzz[, 1]) - extract first col from each matrix in list x using anonymous function zzz existing in context of lapply only

Top

1.18 Packages

library() - list installed packages
library(packagename) - load named package; fails if missing; processing stops
require(packagename) - load named package; throws error if missing; processing continues
search() - display search path for packages available in the workspace
detach(package:packagename) - detach named package from search path

Must Have Packages_ - plyr or better dplyr, ggplot2, lme, knitr, rodbc, sqldf, stringr, lubridate, qcc, reshape2, randomForest Other Data I/O
httr - working with http connections
RMySQL - interface with mySQL
bigmemory - handle datasets larger than RAM
RHadoop - interface R and Hadoop
foreign - data from other stats programs (SPSS, Stata, etc)
shiny - create interactive web pages

1.19 Basic Functions

Not to be confused with functions in base package. Some of these come from others.

1.19.1 ?apply Functions

rnorm(100, 0, 1)) - random normal distribution; observations, mean, sd
lapply - loop over a list and apply a function on each element sapply - Same as lapply but tries to simplify result apply - Apply functions over the margins of an array tapply - apply function over subsets of a vector (table apply) mapply - multivariate version of lapply

1.19.2 mapply function

Takes multiple args, unlike sapply and lapply
mapply(rep, 1:4, 4:1) - same as list(rep(1,4), rep(2,3), rep(3,2), rep(4,1))
mapply(noise, 1:5, 1:5, 2) - same as list(noise(1, 1, 2), noise(2,2,2), noise(3,3,2), noise(4,4,2), noise(5,5,2))

1.19.3 tapply function

Apply a function over a subsets of a vector
tapply(x, f, mean) - group mean of x grouped by factor f
tapply(x, f, range) - get range of x grouped by factor f

1.19.4 split w lapply and sapply

s <- split(airquality, airquality$Month) - split by month lapply(s, function(x) colMeans(x[, c(‘Ozone’, ‘Solar.R’, ‘Wind’)])) - get means of 3 specified columns with anon FUN function(x)
sapply(s, function(x) colMeans(x[, c(‘Ozone’, ‘Solar.R’, ‘Wind’)], na.rm = TRUE)) - simplified result matrix with missing values removed
str(split(x, list(f1, f2), drop = TRUE)) - split with drop removes interactions w/out elements; returns list by factor with observations

1.19.5 apply function

Use to evalute function over the margins of an array
apply(x, 1, sum) - calc sum of rows in matrix x; rowSums
apply(x, 1, mean) - calc mean of rows in matrix x; rowMeans
apply(x, 2, sum) - calc sum of cols in matrix x; colSums
apply(x, 2, mean) - cacl mean of cols in matrix x: colMeans
apply(x, 1, quantile, probs = c(0.25, 0.75)) - get matrix of 25th and 75th quantile for each row in matrix x

1.19.6 lapply function

Always returns list object
lapply(x, mean) - returns single list with mean for each element in list x
lapply(x, runif, min = 0, max =1) - generates runif(x) for each value of list x

1.19.7 sapply function

Variant of lapply with simplified result
sapply(x, mean) - returns single vector with mean for each element in list x

Top

1.20 Arguments

args(‘nameofFUN’) - argument name and corresponding defaults of FUN
argument name - use when position not known; if not exact, chks partial, in no partial, checks position
… - specifies generic functions; variable number of arguments usually passed to other functions; necessary when that number is not known in advance; arguments after … named explicitly w/out partial match

1.21 Lexical, Static Scoping

.GlobalEnv package:base - always first and last; order of packages in search list matters; functions and non-functions have separate namespaces; can have same name
ls(environment(functionname)) - get environment for named function

# CAUTION removes all variables in this session to support the example. 
rm(list=ls(all=TRUE))  
g <- function(x) {  
  a <- 3  
  x + a + y  
}  
g(2)
# => Error in g(2) : object 'y' not found
# but works if y defined.  
y <- 3
g(2)

Top

1.22 Data Results Presentation and Output Devices

?Devices - list of graphical devices
dev.cur() - determine current output device
dev.list() - list of active devices
dev.list - show list of open graphics devices
dev.next - switch control to next graphics device on the list
dev.set - set control to specific graphics device
dev.copy - copy plot to another device; not exact; check integrity of results
dev.copy2pdf - copy plot to pdf file
dev.off - close the current graphics device
Sys.sleep(1) - introduce pause

1.22.1 Base 2D Graphics

Parameters can be added to existing plot via graphics device.

par() specifies global params that affect all plots in the R session. Although set globally, these params can often be overriden as args to specific plotting functions. par(‘parameter’) specifies specific
pch - the plotting symbol (def is open circle)
lty - line type (def is solid). can be dashed, dotted, etc
lwd - line width, specified as integer multiple
col - plotting color, specified as number, string, or hex; the colors function give a vector of colors by name. R has 657 built in colors.
las - orientation of the axis labels on the plot
bg - ackground col (def is none)
mar - margin size; goes clockwise from bottom
oma - the outer margin per (def = 0 for all sides); use with multiple plots
mfrow - number of plots per row, col (plots filled row-wise)
mfcol - number of plots per row, col (plots filled col-wise)

1.22.1.1 Base Plot Functions

plot(x) - makes scatterplot, or other type depending on class of the object being plotted remaining are add ons to plot; won’t work unless plot already created.
hist(x) - makes histogram
boxplot(x) - boxplot
abline(linearmodel|intercept and slope) - add line to plot
lines - add lines to a plot, give a vector x values and a correspondng vector of y values (or 2 col matrix); this FUN just connects the dots. points - add points to a plot text - dd text labels to a plot useing specified x, y coordinates title - add annotations to x, y axis labels, title, subtitle, outer margin mtext - add arbitrary text to the margins (inner or outer) of the plot axis - adding axis ticks/labels

1.22.1.2 Base Plot EG

par(mfrow = c(1, 1))  # reset display to 1 x 1  
x <- rnorm(100) # create two simulated groups male and females  
y <- x + rnorm(100)  
g <- gl(2, 50, labels = c('Male', 'Female'))  
plot(x, y, type = 'n') # data by gender set up plot, but do not draw  
points(x[g == 'Male'], y[g == 'Male'], col = 'green', pch = 18) # subset the vectors for male and female    
points(x[g == 'Female'], y[g == 'Female'], col = 'blue', pch = 19)

1.22.2 Lattice/Grid Graphics

All parameters specified at once; return object of class trellis; easier to write R Script then run. Produce multi panel displays for specified conditions.
plotFUN(y ~ x | f * g) - general format, xy conditioned by all factors of x by interaction g

1.22.2.1 Lattice Plot Functions

xyplot - most important; creates scatterplots
bwplot - box and whiskers plots (‘boxplots’)
histogram - histograms
stripplot - like boxplot but with actual points
dotplot - plot dots on ‘violin strings’
splom - scatterplot matrix; like pairs in base graphics
levelplot, contourplot - for plotting ‘image’ data

1.22.2.2 Multiple Factor Conditioning Lattice EG

Illustrates interaction effects of multiple variables.

require(lattice)
# establish 4 levels (lattice shingles) for the continuous vars temperature and wind; use plot(temp.cut) to see overlapping ranges  
temp.cut <- equal.count(environmental$temperature, 4) 
wind.cut <- equal.count(environmental$wind, 4) 
# scatterplots for each combination of temperature and wind 
xyplot(ozone ~ radiation | temp.cut * wind.cut, data = environmental, as.table = TRUE, pch = 20, main = 'Ozone vs Radiation by Temp & Wind',   
       panel = function(x,y, ...){  
         panel.xyplot(x, y, ...)  
         panel.loess(x, y)  
         }, xlab = 'Solar Radiation', ylab = 'Ozone(ppb)')

1.22.2.3 Splom and Histogram EGs

Additional examples of illustrating variable interaction

# scatterplot matrix of all interactions
splom(~ environmental)   
# distribution of temperature as wind changes; temps a bit higher when wind lower  
histogram(~ temperature | wind.cut, data = environmental)  
# distribution of ozone as wind changes; greater distribution in lower range with more wind  
# additon of layout improves readability
histogram(~ ozone | wind.cut, layout = c(1, 4),data = environmental)  
# add all wind and temp combinations; low temp and high wind have lowest level concentrations  
histogram(~ ozone | temp.cut * wind.cut, data = environmental)

Top

1.22.3 Color in R

colors() - vector of available colors; can ref by name
rainbow(n) - generates rainbow palette of n colors
plot(x, y, col = rgb(0, 0, 0, 0.2), pch = 19) - apply RGB values directly to plot colors
heat.colors(n, alpha=1) - create vector of continguous colors with no transparency; alpha parameter (0-1) controls transparency
topo.colors(), terrain.colors(), cm.colors() - variants of above
pal <- colorRamp(c(‘red’,‘blue’)) - takes palette of colors and returns a function that takes values between 0 and 1 indicating the extremes of the color palette
pal(0), pal(0.5), pal(1), pal(seq(0,1, len=10)) - outputs the colorRamp RGB values pal <- colorRampPalette(c(‘red’, ‘yellow’)) - takes palette of colors and returns function with integer arguments that returns a vector of colors interpolating the palette
pal(2), pal(10) - show the 2 colours; show the steps between the 2 colors assigned to pal

1.22.3.1 RColorBrewer 3 types of palettes

Sequential (ordered, low / high, continuous or not); ordered data, light for low, dark for high
Diverging (data that moves away from a centre; neg and positive); equal emphasis on mid-range values with light colors, low and high extremes with dark
Qualitative (not ordered but diff values); no magnitude differences; best for nominal or categorical data

brewer.pal(n, ‘PaletteName’) - create color palette of n colors using built in ‘PaletteName’)
display.brewer.pal(3, ‘BuGn’) - displays the palette in a graphics window
display.brewer.all(type=‘seq’|‘div’|‘qual’) - displays palettes of type simultanueously in a graphics window

See library(colorspace) for modifying/creatng custom versions of the 3 types of RColorBrewer palettes

Palette information can be used with colorRamp() and colorRampPalette()
pal <- colorRampPalette(brewer.pal(3, ‘BuGn’)) - create gradiated palette
image(volcano, col = pal(20)) - apply to volcano topography smoothScatter(x, y, colramp = pal) - smoothScatter uses RColorBrewer palettes

1.23 Math Annotation

LaTeX-like math annotation on plots in any package where args take text
?plotmath - full syntax list
expression(mathexpresssions) wrap notation in expression FUN within plot

1.23.0.1 Main, ylab, xlab EG

Illusrating use of math expressions on graph output

# title, y and x labels with symbols and formulas
plot(0, 0, main = expression(theta == 0), ylab = expression(hat(gamma) == 0), 
    xlab = expression(sum(x[i] * y[i], i == 1, n))) 
# alt version concatenating strings with expressions
plot(0, 0, main = expression(theta == 0), ylab = expression(hat(gamma) == 0), 
    xlab = expression('The mean (' * bar(x) * ') is ' * sum(x[i]/n, i == 
    1, n))) 
# Use 'substitute' when the expression is a computation. 
# Following labels the axes with the mean of x and y.
plot(x, y, xlab = substitute(bar(x) == k, list(k = mean(x))), 
     ylab = substitute(bar(y) == k, list(k = mean(y))))

Top

1.24 Types of Analysis

Descriptive - first data analysis; describes a set of data; census; no generalization w/out stats modelling
Exploratory - find relationships; suggest hypothesis; assess inference assumption; basis for getting more data; support best fit and techniques
Inferential - use a sample to say something about a population; modelling goal; estimate quantities of concern and uncertainty of estimate; depends on population and sampling
Predictive - more data, simple model works best for determining if X predicts Y; X predicts Y NE X causes Y
Causal - usu requires randomized studies, non randomized sensitive to assumptions; find what happens to one variable when changing another; gold standard for data analysis; apply as average effects, not to every individual
Mechanistic - understand exact changes in variables that lead to changes in other variables for individual objects ; deterministic equations; random component is measurement err; unknown parameters for know equations may be inferred with data analysis.

1.25 Define the Question

Practice/business - What are you trying to know, understand, apply, analyze, synthesize, or evaluate?
Theoretical assumptions - what theory / assumptions do the above include?
Scientific basis - what known science supports the questions?
Analytic approach - see types of analysis above
Ideal evidence - what would the ideal evidence look like?
Available evidence - what is the available evidence?
Interpretation - what does this mean for the original question?
Challenge - everything: question, data source, processing, analysis conclusions, measures of uncertainty, terms in the model; what are possible alternative interpretations?
Synthesis - summary and conclusions based on above

Consider schematic representations for different types of studies applying frequentist or Bayesian analysis.

2 Coding Basics

Create READABLE,
    DEBUGGED,
      TEST-DRIVEN, 
            RE-USABLE code using 
                consistent STYLE

names() - if the output is not self explanatory, naming is sub-optimal

2.1 Debugging Tools

message - generic notification/diagnostic message produced by message function; execution continues
warning - indication something is wrong, but not necessarily a problem or fatal; got something different; execution continues, warning appears at the end
error - indication that a fatal problem occurred; executions stops. Produced by stop function
condition - generic concept for indicating that something unexpected can occur; programmers can create their own conditions

traceback - prints out the function call stack after error occurs; does nothing if no err exists. Functions calls another; err may be very deep in the nesting
debug - flags a function for debug mode which allows step through execution of a function one line at a time. Starts at the top
browser - suspends the execution of a function wherever it is called and puts function in debug mode. Stops execution at that line
trace - allows insertion of debugging code into a function at specific places. Insert snippet to trace, especially useful for troubleshooting others and core R code
recover - allows modification of error behaviour in order to browse the function call stack. Related to traceback. Recover is err handler that freezes execution, can browse through call stack

2.2 Reserved Words

if, else, repeat, while, function, for, in, next, break
TRUE, FALSE, NULL, Inf, NaN, NA
NA_integer_ , NA_real_, NA_complex_ and NA_character_

2.3 Control Structures

Following is primarily for programs; for interactive command line. The ?apply functions are genearally more useful.

2.3.1 if, else if, else

if(condition1) {  
  ## do something  
   } else if(condition2) {       # any number of else if  
     ## do something different  
   } else {            # else must be at end; not req'd; does nothing if reached  
  ## so something different  
}

2.3.2 for loops

x <- c('a', 'b', 'c', 'd')  
for(i in 1:4) {        # standard for loop   
    print(x[i])  
}  

for(i in seq_along(x)) {   # using seq_along function makes index equal to elements in x  
    print(x[i])  
}  

for(letter in x) {  # iterate through letter data in x  
   print letter  
}  

for(i in 1:4) print(x[i])  # for single expression curly bracket not reqd

2.3.3 nested for loop

x <- matrix(1:6, 2, 3)  # create 2r by 3 col matrix  
for(i in seq_len(nrow(x))) {  # starting at r1, get row index
     for(j in seq_len(ncol(x))) {  # next at c1, get col index
        print(x[i,j])  # print the value at the row, col intersect
     }  # efficient but not easy to understand
}

2.3.4 while loop / repeat

count <-0  
while(count < 10) {  
    print(count)  
    count <- count + 1  
}  

x0 <- 1  
to1 <- 1e-8  
repeat {  # useful for converging algorithms; loops with hard limit usually better    
    x1 <- computeEstimate()  # assumes a user def function 'computeEstimate   
    if(abs(x1 - x0) < to1) {  
        break # required to avoid endless loop   
    } else {  
        x0 <- x1  
    }  
}

2.3.5 next loop

for(i in 1:100) {  
     if(i <= 20) {  # skip the first 20 iterations  
      next  
     }  
     ## do something  
}

2.4 Classes and Methods

?Classes and ?Methods (long) and help pages for methods package - primary documentation
?setClass, ?setMethod and ?setGeneric (other related details) technical, aimed a programmer/developer, not end user, makes sense with use. Casual user will not define classes/method
(Chambers, John), Programming with Data: A Guide to the S language - (green book) user -> programmer philosophy

2.4.1 S3 (old) and S4 (new) Coexistence

separate systems new (and future), but can be mixed to a degree
each can be used fairly independently of each other
developers of new projects encouraged to use S4 style
S3 continued use driven by ease of its ‘quick and dirty’ style
lecture focus is S4
code for implementing S4 is in the methods package

2.4.2 OOP in R

class(x) - get class of any R object x
* a class - a description of a thing; class defined by setClass() in methods package.
* an object - an instance of a class. Objects created using new().
* a method - a function that only operates on a certain class of objects.
* a generic function is an R function which dispatches methods. A generic function typically encapsulates a generic concept (plot, mean, predict,…); does not do any computation.
* a method is the implementation of a generic function for an object of a particular class.

2.4.3 Places to Learn S4 style

best way to learn is to look at examples and do the exercises
http://www.bioconductor.org - rich resource even if you nothing about bioinformatics
some CRAN packages - SparseM, gpclib, flexmix, its, lme4, orientlib, pixmap
stats$ pacakge has classes/methods for doing maximum likelihood analysis

2.5 Regular Expressions - literals + metacharacters

Regex great for extraction from unstructured data, not from CSV.

2.5.1 Beginning ^ and end of line $

^ - metacharacter represents the start of a line  
^i think - matches lines beginning with literal i space think

$ represents the end of a line
$morning - matches lines ending with the literal morning

2.5.2 Character Classes with []

Use to set list of characters to accept at given point in the match.
Specifies a variety of letters and letter case to match.

[Bb][Uu][Ss][Hh]  - matches any word with any form of bush in it anywhere

[a-z] - match any lower case letter in alphabet  
[A-Z] - match any upper case letter in alphabet  
[a-zA-Z] - match any letter, any case
[^a-c] - caret at beginning of character class is negation operator.

2.5.2.1 Combining metacharacters

^[Ii] am # matches any line that begins with i am or I am  
^[0-9][a-zA-Z] # matches any line beginning with 0-9 followed by text in any case  
[^?.]$ # matches any line NOT ending with ? or a period.  
  - 'under water? hmm' (ok because ? not at eol)  
  - 'die anyway!' (ok because ! not in character class)  
. period is a metacharacter used to refer to any character (or none) 
9.11 # matches any line with a 9 followed anywhere on same line with 11

2.5.2.2 Subexpressions/alternatives with | (logical or)

symbol translates to ‘or’ it is not a pipe, although same symbol. Combines two subexpressions/alternatives with logical or.

flood|fire # matches lines with either literal in them
flood|earthquake|hurricane|coldfire # can combine alternatives to match lines with any one of them
^[Gg]ood|[Bb]ad # combining real expressions - matches lines starting with Good, good or have the literal Bad or bad anywhere in them  
^([Gg]ood|[Bb]ad) # as above, with (), now start of line must have Goodcop, good or Bad, badcop

2.5.2.3 optional character(s) with ? and escape with

[Gg]eorge( [Ww]\.)?  [Bb]ush
# applies the ? to indicate optional characters [Ww]  note use of \ to escape the period to avoid special metacharacter interpretation as 'any or none characters'  Use this for escaping any meta.
# matches:  
george bush, George W. Bush, george bushes

2.5.2.4 repetition with metacharacters * and +

* and + are metacharacters indicating repetition. 
* means 'any number including none, of the item' 
+ means 'at least one of the item'  

(.*)
# matches:
(24, m, germany) chat?
likes ... (drives + east area)
()

[0-9]+ (.*)[0-9]+
# matches any line that has at least one number, followed by space, followed by anything, followed by another digit.
time 4 me 2 go 2 bed

2.5.2.5 interval qualifiers

{} Curly brackets specify the min and max number of matches of an expression

[Bb]ush( +[^ ]+ +){1,5} debate
matches lines with Bush followed by 1 to 5 words before the word debate
Interpreted as:
Bush or bushes (followed by one or more spaces, followed by one or more NOT a space; followed by one or more spaces) at between 1 or 5 of these pattern in the parentheses
# Any Word Pattern
( +[^ ]+ +) # is the PATTERN FOR ANY WORD

2.5.2.5.1 {m, n}

m,n means at least m but not more than n
m means exactly m
m, means at least m

2.5.2.6 metacharacters (and) revisited

most implementations of regex parentheses () not only limit scope of alternatives divided by ‘|’, but also can be used to ‘remember’ text matched by the enclosed subexpression
these parenthesized subexpressions can be matched with , , etc.

+([a-zA-Z]+) +\1 +
# interpret above as any combination of characters repeated any number of times; followed by a space; repeated one time; followed by a space
# matches lines with words repeating at least once, that is, whatever it matched, just repeat it again.
# finds:
so so itchy
blah blah blah blah

The asterix * is ‘greedy’; it matches the longest possible string that satisfies the regex.

^s(.*)s # matches any string starting with s; followed by anything repeated any number of times followed by another s  
# for example
sitting at starbucks, rather than sitting at s
setting up mysql and rails rather than setting up mys

# unless this behaviour is turned off with the ? question mark character
^s(.*?)s$ # question marks makes the metacharacter lazy, that is, stops at shortest match.

Also useful for substituting characters in strings.

2.6 Programming Style Guides

Programming cautions - http://www.burns-stat.com/pages/Tutor/R_inferno.pdf
http://4dpiecharts.com/r-code-style-guide
http://google-styleguide.googlecode.com/svn/trunk/google-r-style.html
http://wiki.fhcrc.org/bioc/Coding_Standards

3 References

[R 2009] R Development Core Team (2009). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org.
Jared E. Knowles Tutorials - http://jaredknowles.com/r-bootcamp/
Hadley Wickam Vocabulary - http://adv-r.had.co.nz/Vocabulary.html
Hadley Wickam devtools - https://github.com/hadley/devtools/wiki
Revolution Analytics R Language Features - http://www.revolutionanalytics.com/what-is-open-source-r/r-language-features/
Olivia Lau R Tips - http://www.olivialau.org/software/Rtips.pdf
Robert I. Kabacoff Quick R - http://www.statmethods.net
OpenIntro Online Stat Book - http://www.openintro.org
R Data Import/Export Manual - http://cran.r-project.org/doc/manuals/R-data.pdf
Roger Peng Coursera Lectures -
Jeff Leek Coursera Lectures -
Woods and Pagano Course Lectures and Materials Edx/Harvard PH207x
IDRE UCLA FAQ pages - http://statistics.ats.ucla.edu/stat/r/faq/
Top 10 R Packages -
Stackoverflow - http://stackoverflow.com/questions/tagged/r
UPenn R Study Group - http://www.ling.upenn.edu/~joseff/rstudy/

4 Learning R

ComputerWorld Overview Series (START HERE) - http://www.computerworld.com/s/article/9239625/Beginner_s_guide_to_R_Introduction - First of 6 parts.
Books related to R - http://www.r-project.org/doc/bib/R-books.html - deep dive into introductory and specialized R topics.
2 minute Tutorials - http://www.twotorials.com/ - introductions and refreshers for specific techniques.
Learn R in Y Minutes - http://learnxinyminutes.com/docs/r/ - intro for user with programming background.
Jared Knowles R Bootcamp - first session covers R installation.

5 Dev Notes

Notes: This outline uses single quotes because R markdown converts double quotes to curly quotes. Copy/pasting curly quotes to command line generates ‘unexpected input’ err. Examples including the syntax x$y sometimes require blank lines above and below to produce accurate markdown output. Anyone knows why, please share.

sessionInfo()

## R version 3.2.2 (2015-08-14)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 8 x64 (build 9200)
## 
## locale:
## [1] LC_COLLATE=English_Canada.1252  LC_CTYPE=English_Canada.1252   
## [3] LC_MONETARY=English_Canada.1252 LC_NUMERIC=C                   
## [5] LC_TIME=English_Canada.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] magrittr_1.5    formatR_1.2     tools_3.2.2     htmltools_0.2.6
##  [5] yaml_2.1.13     stringi_0.5-5   rmarkdown_0.7   knitr_1.10.5   
##  [9] stringr_1.0.0   digest_0.6.8    evaluate_0.7

May you do good and not evil.
May you find forgiveness for yourself and forgive others.
May you share freely, never taking more than you give.

Outline of R

ASM mielniczuk at outlook dot com

2015-12-11