R Workshop

Basics, Visualization, Publication-Quality Tables

Yphtach Lelkes

Why use R?

R is free and open, and can be installed on any environment
Incredibly flexible
Actively developed
Amazing Graphics
Reproducible Research

Publication Quality Tables

a <- lm(obama ~ age, data = egss)
b <- lm(obama ~ age + pid, data = egss)
apsrtable(a, b, caption = "OLS Predicting Obama Favorability")

Awesome Graphics

Reproducible Research

Installing R

Installing a frontend

Commands and code

You can work directly in the console.
But working with source files lets you save your work.
History window will show you the list of commands.
cmd-enter/cntrl-enter in a mac will run the current line.

Shortcuts

R is a calculator

Try some simple math.
+,- #addition, subtraction
*,/ #multiplication, division

Normal rules of arithmetic apply.

2 + 2

## [1] 4

2 + 2 * 3

## [1] 8

(2 + 2) * 3

## [1] 12

R Operators

log (10) # Natural logarithm with base e=2.7182
log10(5) # Common logarithm with base 10
5² # 5 raised to the second power
sqrt (16) # Square root
abs (3-7) # Absolute value
pi # 3.14
exp(2) # Exponential function
floor(15.9) # Rounds down
ceiling(15.1) # Rounds up
cos(.5) # Cosine Function
sin(.5) # Sine Function
tan(.5) # Tangent Function
acos(0.8775826) # Inverse Cosine
asin(0.4794255) # Inverse Sine
atan(0.5463025) # Inverse Tangent

Advanced Math

Can integrate functions Create random samples from distributions (e.g., normal, binomial, poisson). Useful for writing examples and testing new tools.

rnorm(10, 0, 1)
rbinom(5, 1, 0.5)
rpois(3, 2)

Or if you're teaching statistics

hist(rnorm(10, 0, 1))
hist(rnorm(1e+05, 0, 1))

R is a language

John Chambers on S, the precursor to R:

[W]e wanted users to be able to begin in an interactive environment, where they did not consciously think of themselves as programming. Then as their needs became clearer and their sophistication increased, they should be able to slide gradually into programming, when the language and system aspects would become more important

R works with objects.

Lots of things are objects: data sets, plots, variables.
Let's start with variables.
<- assignment operator, something is something else

x <- 2
x

## [1] 2

x <- 2
x

## [1] 2

<- is preferable to =

Manipulating Variables.

x * x

## [1] 4

y = x + x
y

## [1] 4

Manipulating Variables.

c() to combine into a vector
- A vector is the basic object in R
: sequence operator

o1 <- c(1, 2, 3, 4)
o2 <- c(4:1)
o1
o2
o1 * o2

Manipulating Vectors.

weight <- c(77, 62, 54, 82, 72, 76, 75.34)
height <- c(1.75, 1.65, 1.56, 1.77, 1.82, 1.69, 1.72)
bmi <- weight/height^2
bmi
sum(weight)
length(weight)
sum(weight)/length(weight)
mean(weight)

R works with objects that it stores in its memory.

Datasets

data(cars)
cars

Vectors

cars$speed

Plots

speedhist <- hist(cars$speed, breaks = 10)
plot(speedhist)

Functions

usefulfunction <- function(x) {
    rep("ASCOR", x)
}
usefulfunction(5)

These objects are in R's workspace.

ls()
usefulfunction
rm(usefulfunction)
usefulfunction
ls.str()  #list and describe

Logical Operators

>, <=, >=, ==, !=, &, |

a <- c(1, 2, 3, 4)
b <- c(1, 3, 2, 5)
a > b
a == b
a != b

Often combined with brackets

time1 <- sample(1:100, 25)
time2 <- sample(1:100, 25)
time1
time2
time1[time1 > 75]
time1[time1 > mean(time2)]
length(time1[time1 < time2])
time1[time1 < 10 | time1 > 90]
time1[time1 > 10 & time1 < 20]
time1[time1 >= time2] = NA
time1

Classes of data

Generally, there are four classes of data: Numeric, Character, Logical, Factor, Dates

numericvector <- c(1:10)
is.numeric(numericvector)
charactervector <- c("ascor", "r", "political communication")
is.character(charactervector)
logicalvector <- c(TRUE, FALSE, T, F, T, T)
is.logical(logicalvector)
is.numeric(logicalvector)

Factor class

factorvector <- as.factor(sample(c("High Knowledge", "Low Knowledge"), 10, 1))
factorvector
levels(factorvector)

Numeric order built in to a factor
As a dummy variable in a regression, level 1 will be the omitted category.

dv <- sample(1:10, 10)
lm(dv ~ factorvector)
factorvectornew <- relevel(factorvector, ref = "Low Knowledge")
levels(factorvectornew)
lm(dv ~ factorvectornew)

Dates

day1 <- as.Date("10/2/2012", format = "%m/%d/%Y")
day1 + 10
day2 <- as.Date("10/2/2013", format = "%m/%d/%Y")
day2 - day1

Why do we care so much about classes?

Determines what R can and can't do with data.
Can't take the mean of a character vector.
E.g., A variable needs to be numeric for OLS.

Data Structures in R

Vectors: The basic data class. A single column of data.
Matrices and Arrays: combination of k vectors of length n of the same class.
Dataframe: combination of k vectors of length n of the different classes.
Lists: Combination of various classes, of different lengths and of different classes.

Vectors

LETTERS
# What is the 16th element of the vector LETTERS
LETTERS[16]
# What is the 1ST through 3RD and the 10th element of the vector LETTERS
LETTERS[c(1:3, 10)]
# How long is the vector LETTERS
length(LETTERS)

Matrices

Combination of vectors of the same length and type.

## I have a vector of numbers 1:36, I want to put them in a square 6 x 6
## matrix.
mymat <- matrix(1:36, nrow = 6, ncol = 6)
rownames(mymat) <- LETTERS[1:6]
colnames(mymat) <- LETTERS[7:12]
mymat

X[n,k] n indexes rows k indexes columns What is the number in the 4th row, 2nd column of matrix my mat?

Arrays

A generalized matrix
>2 dimensional matrices

i <- array(c(1:10), dim = c(3, 3, 2))
i[, , 1]
i[, , 2]

Lists

An ordered collection of possibly unrelated objects.

list(LETTERS, 1:10, 12)

Data frames

Combinations of vectors of any type of the same length, but can be of different types.
Most likely you'll spend most of your time working with dataframes.

data(ChickWeight)
## Look at the top N (default=6) rows of the dataframe
head(ChickWeight, 10)
## Look at the top 6 rows of the dataframe
tail(ChickWeight)
## Names of the variables in the dataframe
names(ChickWeight)
## Number of ways to call a variable in a dataframe.
ChickWeight$Time
ChickWeight[, 2]

Many built in functions.

Some descriptive statistics.

mean(data$variable)
mean(data[,N])
with(data, mean(variable))
median()
summary()
max()
min()
range()
var()

var(ChickWeight$weight)
var(ChickWeight[, 1])

Find the mean, median, and standard error of the average weight of the chicks.
Hint: Standard Error is the square root of the variance over the square root of the sample size.

Aggregate Command

aggregate(measuredvariable~clustering variable,FUN=function,dataframe)

Let's say you want the mean for each level of a variable.

aggregate(weight ~ Diet, FUN = mean, ChickWeight)

##   Diet weight
## 1    1  102.6
## 2    2  122.6
## 3    3  142.9
## 4    4  135.3

We'll use the plyr package in a bit that does this better.

Some tables.

table(data$x)
table(data$x,data$y)
table(data[,N])
table(data[,N1],data[,N2])
with(data, table(x))
prop.table(table(x))
prop.table(table(x,y),1) # Row Proportions
prop.table(table(x,y),2) # Column Proportions
data(ToothGrowth)
names(ToothGrowth)

How many people received each type of supplement?
How many people received each dose by type of supplement?
What are the row percentages? What are the column percentages?

tab <- table(ToothGrowth$dose, ToothGrowth$supp)
write.csv(tab, "toothtable.csv")

There are packages which will print out spss-, sas-, stata-style tables.

Packages and functions.

There are many, many, many R functions.
If you need it, it probably exists.
Functions live in packages.
install.packages("")
once you've installed a package, call it with library(packagename)
Try it with the plyr package

install.packages("plyr")
library(plyr)
library(help = "plyr")
`?`(ddply)
ddply(ToothGrowth, .(supp, dose), summarise, mean = mean(len))

Now on to the thick of things.

1. Project
2. Create Project
3. I do almost everything in dropbox and github.

Let's put a dataset into workspace. ANES 2006 Data ANES 2006 Codebook

install.packages("foreign")
library(foreign)
## Type read. and hit the tab key to see the types of data that can be
## read into R using the foreign package
nes06 <- read.spss("NESPIL06.por", to.data.frame = T)

Significance Tests

t.test
chisq.test or summary(table())
wilcox.test

Are men signficantly more conservative then women?

Gender: V06P005 Political Interest: V06P630

levels(nes06$V06P630)

Hint: Political Interest must be made numeric. Have to remove levels 8 and 9.

nes06$interest <- as.numeric(nes06$V06P630) nes06$interest[nes06$interest>5]=NA

Linear Models

lmmodel <- lm(y~iv1+iv2,data)
anova(lmmodel)

What is the effect of gender on political interest, controlling for age: V06P006? Want to see significance?

summary(lmmodel)
or
install.packages("arm")
library(arm)
display(lmmodel,detail=T)

Looking more closely at your model.

plot(lmmodels)

lmmodel <- lm(interest~age*gender)
library(effects)
effects(lmmodel,term="age*gender")
plot(effects(lmmodel,term="age*gender"))

Models useful for categorical and limited dependent models

?glm #Generalized Linear Models

e.g., glm(formula,family=)

*binomial, Gamma, gaussian, inverse,gaussian, poisson, quasi, quasibinomial, quasipoisson.

*link can be: "logit", "probit", "cauchit", "cloglog", "identity", "log", "sqrt", "1/mu^2", "inverse"

polr() in MASS package for ordered probit
multinom in nnet package for multinomial logit
lot's of other stuff out there for count models, censored outcomes, survival models, hazard models, time series

Factor Analysis, Principal Components, SEM

Factor analysis with Schwartz Values Items

schwartz <- nes06[,154:163]
levels(schwartz[,1])
schwartz <- data.matrix(schwartz)
schwartz[schwartz>5]=NA

#OR 

nes06[,154:163] <- data.matrix(nes06[,154:163])
nes06[,154:163][nes06[,154:163]>5]=NA
##try
factanal(schwartz)
out <- factanal(na.omit(schwartz),factors=4,scores="regression")
names(out)
out$scores[,1]

OpenMx

On to graphics

Three popular choices.

R's base plots
lattice
ggplot2: The Grammar of Graphics