R Workshop

Basics, Visualization, Publication-Quality Tables

Yphtach Lelkes

Why use R?

  • R is free and open, and can be installed on any environment
  • Incredibly flexible
  • Actively developed
  • Amazing Graphics
  • Reproducible Research

Publication Quality Tables

a <- lm(obama ~ age, data = egss)
b <- lm(obama ~ age + pid, data = egss)
apsrtable(a, b, caption = "OLS Predicting Obama Favorability")

Awesome Graphics

Reproducible Research

Installing R

Installing a frontend

Commands and code

  • You can work directly in the console.

  • But working with source files lets you save your work.

  • History window will show you the list of commands.

  • cmd-enter/cntrl-enter in a mac will run the current line.

Shortcuts

R is a calculator

  • Try some simple math.

  • +,- #addition, subtraction

  • *,/ #multiplication, division

  • Normal rules of arithmetic apply.

    2 + 2
    
    ## [1] 4
    
    2 + 2 * 3
    
    ## [1] 8
    
    (2 + 2) * 3
    
    ## [1] 12
    

R Operators

log (10) # Natural logarithm with base e=2.7182
log10(5) # Common logarithm with base 10
52 # 5 raised to the second power
sqrt (16) # Square root
abs (3-7) # Absolute value
pi # 3.14
exp(2) # Exponential function
floor(15.9) # Rounds down
ceiling(15.1) # Rounds up
cos(.5) # Cosine Function
sin(.5) # Sine Function
tan(.5) # Tangent Function
acos(0.8775826) # Inverse Cosine
asin(0.4794255) # Inverse Sine
atan(0.5463025) # Inverse Tangent

Advanced Math

Can integrate functions Create random samples from distributions (e.g., normal, binomial, poisson). Useful for writing examples and testing new tools.

rnorm(10, 0, 1)
rbinom(5, 1, 0.5)
rpois(3, 2)

Or if you're teaching statistics

hist(rnorm(10, 0, 1))
hist(rnorm(1e+05, 0, 1))

R is a language

John Chambers on S, the precursor to R:

[W]e wanted users to be able to begin in an interactive environment, where they did not consciously think of themselves as programming. Then as their needs became clearer and their sophistication increased, they should be able to slide gradually into programming, when the language and system aspects would become more important

R works with objects.

  • Lots of things are objects: data sets, plots, variables.

  • Let's start with variables.

  • <- assignment operator, something is something else

x <- 2
x
## [1] 2
x <- 2
x
## [1] 2

<- is preferable to =

Manipulating Variables.

x * x
## [1] 4
y = x + x
y
## [1] 4

Manipulating Variables.

  • c() to combine into a vector
    • A vector is the basic object in R
  • : sequence operator
o1 <- c(1, 2, 3, 4)
o2 <- c(4:1)
o1
o2
o1 * o2

Manipulating Vectors.

weight <- c(77, 62, 54, 82, 72, 76, 75.34)
height <- c(1.75, 1.65, 1.56, 1.77, 1.82, 1.69, 1.72)
bmi <- weight/height^2
bmi
sum(weight)
length(weight)
sum(weight)/length(weight)
mean(weight)

R works with objects that it stores in its memory.

  • Datasets
data(cars)
cars
  • Vectors
cars$speed
  • Plots
speedhist <- hist(cars$speed, breaks = 10)
plot(speedhist)
  • Functions
usefulfunction <- function(x) {
    rep("ASCOR", x)
}
usefulfunction(5)

These objects are in R's workspace.

ls()
usefulfunction
rm(usefulfunction)
usefulfunction
ls.str()  #list and describe

Logical Operators

  • >, <=, >=, ==, !=, &, |
a <- c(1, 2, 3, 4)
b <- c(1, 3, 2, 5)
a > b
a == b
a != b

Often combined with brackets

time1 <- sample(1:100, 25)
time2 <- sample(1:100, 25)
time1
time2
time1[time1 > 75]
time1[time1 > mean(time2)]
length(time1[time1 < time2])
time1[time1 < 10 | time1 > 90]
time1[time1 > 10 & time1 < 20]
time1[time1 >= time2] = NA
time1

Classes of data

  • Generally, there are four classes of data: Numeric, Character, Logical, Factor, Dates
numericvector <- c(1:10)
is.numeric(numericvector)
charactervector <- c("ascor", "r", "political communication")
is.character(charactervector)
logicalvector <- c(TRUE, FALSE, T, F, T, T)
is.logical(logicalvector)
is.numeric(logicalvector)

Factor class

factorvector <- as.factor(sample(c("High Knowledge", "Low Knowledge"), 10, 1))
factorvector
levels(factorvector)
  • Numeric order built in to a factor
  • As a dummy variable in a regression, level 1 will be the omitted category.
dv <- sample(1:10, 10)
lm(dv ~ factorvector)
factorvectornew <- relevel(factorvector, ref = "Low Knowledge")
levels(factorvectornew)
lm(dv ~ factorvectornew)

Dates

day1 <- as.Date("10/2/2012", format = "%m/%d/%Y")
day1 + 10
day2 <- as.Date("10/2/2013", format = "%m/%d/%Y")
day2 - day1

Why do we care so much about classes?

  • Determines what R can and can't do with data.
  • Can't take the mean of a character vector.
  • E.g., A variable needs to be numeric for OLS.

Data Structures in R

  • Vectors: The basic data class. A single column of data.
  • Matrices and Arrays: combination of k vectors of length n of the same class.
  • Dataframe: combination of k vectors of length n of the different classes.
  • Lists: Combination of various classes, of different lengths and of different classes.

Vectors

LETTERS
# What is the 16th element of the vector LETTERS
LETTERS[16]
# What is the 1ST through 3RD and the 10th element of the vector LETTERS
LETTERS[c(1:3, 10)]
# How long is the vector LETTERS
length(LETTERS)

Matrices

Combination of vectors of the same length and type.

## I have a vector of numbers 1:36, I want to put them in a square 6 x 6
## matrix.
mymat <- matrix(1:36, nrow = 6, ncol = 6)
rownames(mymat) <- LETTERS[1:6]
colnames(mymat) <- LETTERS[7:12]
mymat

X[n,k] n indexes rows k indexes columns What is the number in the 4th row, 2nd column of matrix my mat?

Arrays

  • A generalized matrix
  • >2 dimensional matrices
i <- array(c(1:10), dim = c(3, 3, 2))
i[, , 1]
i[, , 2]

Lists

  • An ordered collection of possibly unrelated objects.
list(LETTERS, 1:10, 12)

Data frames

  • Combinations of vectors of any type of the same length, but can be of different types.

  • Most likely you'll spend most of your time working with dataframes.

data(ChickWeight)
## Look at the top N (default=6) rows of the dataframe
head(ChickWeight, 10)
## Look at the top 6 rows of the dataframe
tail(ChickWeight)
## Names of the variables in the dataframe
names(ChickWeight)
## Number of ways to call a variable in a dataframe.
ChickWeight$Time
ChickWeight[, 2]

Many built in functions.

Some descriptive statistics.

mean(data$variable)
mean(data[,N])
with(data, mean(variable))
median()
summary()
max()
min()
range()
var()
var(ChickWeight$weight)
var(ChickWeight[, 1])

  • Find the mean, median, and standard error of the average weight of the chicks.
  • Hint: Standard Error is the square root of the variance over the square root of the sample size.

Aggregate Command

aggregate(measuredvariable~clustering variable,FUN=function,dataframe)

Let's say you want the mean for each level of a variable.

aggregate(weight ~ Diet, FUN = mean, ChickWeight)
##   Diet weight
## 1    1  102.6
## 2    2  122.6
## 3    3  142.9
## 4    4  135.3

We'll use the plyr package in a bit that does this better.

Some tables.

table(data$x)
table(data$x,data$y)
table(data[,N])
table(data[,N1],data[,N2])
with(data, table(x))
prop.table(table(x))
prop.table(table(x,y),1) # Row Proportions
prop.table(table(x,y),2) # Column Proportions
data(ToothGrowth)
names(ToothGrowth)
  • How many people received each type of supplement?
  • How many people received each dose by type of supplement?
  • What are the row percentages? What are the column percentages?
tab <- table(ToothGrowth$dose, ToothGrowth$supp)
write.csv(tab, "toothtable.csv")

There are packages which will print out spss-, sas-, stata-style tables.

Packages and functions.

  • There are many, many, many R functions.
  • If you need it, it probably exists.
  • Functions live in packages.
  • install.packages("")
  • once you've installed a package, call it with library(packagename)
  • Try it with the plyr package
install.packages("plyr")
library(plyr)
library(help = "plyr")
`?`(ddply)
ddply(ToothGrowth, .(supp, dose), summarise, mean = mean(len))

Now on to the thick of things.

  • 1. Project
  • 2. Create Project
  • 3. I do almost everything in dropbox and github.

Let's put a dataset into workspace. ANES 2006 Data ANES 2006 Codebook

install.packages("foreign")
library(foreign)
## Type read. and hit the tab key to see the types of data that can be
## read into R using the foreign package
nes06 <- read.spss("NESPIL06.por", to.data.frame = T)

Significance Tests

t.test
chisq.test or summary(table())
wilcox.test

Are men signficantly more conservative then women?

Gender: V06P005 Political Interest: V06P630

levels(nes06$V06P630)

Hint: Political Interest must be made numeric. Have to remove levels 8 and 9.

nes06$interest <- as.numeric(nes06$V06P630) nes06$interest[nes06$interest>5]=NA

Linear Models

lmmodel <- lm(y~iv1+iv2,data)
anova(lmmodel)

What is the effect of gender on political interest, controlling for age: V06P006? Want to see significance?

summary(lmmodel)
or
install.packages("arm")
library(arm)
display(lmmodel,detail=T)

Looking more closely at your model.

plot(lmmodels)

lmmodel <- lm(interest~age*gender)
library(effects)
effects(lmmodel,term="age*gender")
plot(effects(lmmodel,term="age*gender"))

Models useful for categorical and limited dependent models

?glm #Generalized Linear Models

e.g., glm(formula,family=)

*binomial, Gamma, gaussian, inverse,gaussian, poisson, quasi, quasibinomial, quasipoisson.

*link can be: "logit", "probit", "cauchit", "cloglog", "identity", "log", "sqrt", "1/mu2", "inverse"

  • polr() in MASS package for ordered probit

  • multinom in nnet package for multinomial logit

  • lot's of other stuff out there for count models, censored outcomes, survival models, hazard models, time series

Factor Analysis, Principal Components, SEM

Factor analysis with Schwartz Values Items

schwartz <- nes06[,154:163]
levels(schwartz[,1])
schwartz <- data.matrix(schwartz)
schwartz[schwartz>5]=NA

#OR 

nes06[,154:163] <- data.matrix(nes06[,154:163])
nes06[,154:163][nes06[,154:163]>5]=NA
##try
factanal(schwartz)
out <- factanal(na.omit(schwartz),factors=4,scores="regression")
names(out)
out$scores[,1]

OpenMx

On to graphics

Three popular choices.

  • R's base plots
  • lattice
  • ggplot2: The Grammar of Graphics