Intro to R author: John Voorheis date: 5/2/2014

R: A short introduction

R is a general purpose, but statistically focused language.
It is based on S, a language developed at Bell Labs (C was also developed there; dudes like single letter names)
R is used across a large number of disciplines and for many different purposes, including statistical genetics, machine learning/text mining/natural language processing, empirical finance, and yes, even Econometrics

Why Should You Learn R?

If you already use Stata, why use R?
It depends. Some cases to consider:
If there is well developed functionality in Stata in its core competency (data manipulation of a single, pre-cleaned, in memory dataset and estimation) it will probably be easier, but not necessarily faster.
Ex: mixlogit, the standard .ado for performing mixed logit estimation, is between 2-5 times as slow as the R alternative, mlogit
Ex2: bootstrapping, even in Stata MP, will be slower than in R, since Stata does not allow coarse grain parallelization.

Why Should You Learn R?

R has a lot of functionality that can complement Stata's weaknesses. EX: merging many potentially not well cleaned datasets can be annoying in Stata, due to the one dataset at a time architecture.
R can do (with a little elbow grease) things that Stata can't do or is very bad at (EX: parallelization, mapping, text analysis)
R is free…. and someday soon you might find yourself in a place that doesn't want to pay for a Stat license.
The private sector seems to find it valuable. (See 2014 Dice Survey.) Outside options are better than no options.

R Basics: Data Types

There are four basic data objects: vectors, matrices, data.frames and lists. Forget about matrices for now.

thisIsAVector <- c("#Butts", "#LOL", "#Inappropriate")
thisIsADataFrame <- data.frame(hastags = thisIsAVector, randomNumber = rnorm(3, 
    0, 1))
thisIsAList <- list(hastags = thisIsAVector, randomNumber = rnorm(3, 0, 1), 
    supermansHome = "Krypton")

thisIsAVector

## [1] "#Butts"         "#LOL"           "#Inappropriate"

thisIsADataFrame

##          hastags randomNumber
## 1         #Butts       0.1034
## 2           #LOL       0.9748
## 3 #Inappropriate      -0.2178

thisIsAList

## $hastags
## [1] "#Butts"         "#LOL"           "#Inappropriate"
## 
## $randomNumber
## [1] -0.08646  0.71146  1.03897
## 
## $supermansHome
## [1] "Krypton"

List items need not have the same length, but all columns of a data.frame must have the same length. You can call or assign items in a data.frame or list using the $ notation:

try(thisIsADataFrame$supermansHome <- c("Krypton", "Earth"), silent = F)
thisIsADataFrame$supermansHome <- "Krypton"
thisIsADataFrame

##          hastags randomNumber supermansHome
## 1         #Butts       0.1034       Krypton
## 2           #LOL       0.9748       Krypton
## 3 #Inappropriate      -0.2178       Krypton

As we've seen, data.frames and lists can have multiple types (integer, numeric, character, factor, etc.) We can coerce data into different types, but sometimes that can have bad consequences:

as.character(thisIsADataFrame$randomNumber)

## [1] "0.103403841923095"  "0.974818583615347"  "-0.217842314945977"

as.numeric(thisIsADataFrame$supermansHome)

## Warning: NAs introduced by coercion

## [1] NA NA NA

Finally, R has a powerful subsetting notation:

# We can pull down columns:
thisIsADataFrame[, 1]

## [1] #Butts         #LOL           #Inappropriate
## Levels: #Butts #Inappropriate #LOL

# Or subsets (rows that meet a condition):
thisIsADataFrame[thisIsADataFrame$randomNumber < 0, ]

##          hastags randomNumber supermansHome
## 3 #Inappropriate      -0.2178       Krypton

R Basics: Functions

There's a lot of built in functionality in R (and even more in the CRAN libraries). Oftentimes we want to do something specific, over and over again. Example: I often want to calculate the 90-10 ratio for an empirical income distribution. First, let's generate some fake data.

library(GB2)
fakeIncomes <- rgb2(1000, 2.5, 40000, 1, 1)
head(fakeIncomes)

## [1] 40308 59552 28519 25332 56210 31019

quantile(fakeIncomes, prob = 0.9)

##   90% 
## 94454

as.numeric(quantile(fakeIncomes, probs = 0.9))

## [1] 94454

ratio9010 <- function(x) {
    ratio <- as.numeric(quantile(x, probs = 0.9))/as.numeric(quantile(x, probs = 0.1))
    return(ratio)
}
ratio9010(fakeIncomes)

## [1] 5.279

fakeIncomes2 <- rgb2(1000, 2.5, 40000, 1, 2)
ratio9010(fakeIncomes2)

## [1] 4.271

R Basics: Flow Control

You can use the standard for() loops and if() statements. But, often times, if you are operating on a vector (or a column of a data.frame), there's a vectorized function that's better (sometimes you have to write your own.)

# Make a dumb function that 'topcodes' incomes
truncateIncomes <- function(x) {
    if (x > 1e+05) {
        return(1e+05)
    } else {
        return(x)
    }
}

# Bad:
truncIncomes <- c()
for (i in fakeIncomes) {
    truncIncomes <- append(truncIncomes, truncateIncomes(i))
}
# better:
truncIncomes2 <- sapply(fakeIncomes, truncateIncomes)
# best
truncIncomes3 <- ifelse(fakeIncomes > 1e+05, 1e+05, fakeIncomes)

head(truncIncomes)

## [1] 40308 59552 28519 25332 56210 31019

head(truncIncomes2)

## [1] 40308 59552 28519 25332 56210 31019

head(truncIncomes3)

## [1] 40308 59552 28519 25332 56210 31019

R Basics: Statistical and Econometrics Functionality

R has a lot of statistics and econometric functionality built in; with the addition of some outside packages, you can do pretty much anything Stata can do.

Generate some fake wage data:

wages <- data.frame(age = runif(100, 15, 64), exp_fact = runif(100, 0, 0.7), 
    educ = rlnorm(100, log(8), 0.75))
wages$exp <- wages$age * wages$exp_fact
wages$wage <- exp(0.5 * wages$exp + 0.75 * wages$age + 0.25 * wages$educ + rnorm(100, 
    0, 2))

lm() is the built in function for running linear regressions:

model1 <- lm(log(wage) ~ exp + age + educ, data = wages)
summary(model1)

## 
## Call:
## lm(formula = log(wage) ~ exp + age + educ, data = wages)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.216 -0.915  0.131  1.296  4.436 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.4726     0.5587    0.85      0.4    
## exp           0.5079     0.0262   19.35   <2e-16 ***
## age           0.7235     0.0153   47.28   <2e-16 ***
## educ          0.2499     0.0171   14.63   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.96 on 96 degrees of freedom
## Multiple R-squared:  0.981,  Adjusted R-squared:  0.98 
## F-statistic: 1.66e+03 on 3 and 96 DF,  p-value: <2e-16

One thing base R doesn't do particularly well is alternate variance covariance matrix estimators. The ability to use robust standard errors is available in various packages (lmtest and sandwich).

library(AER)

## Loading required package: car
## Loading required package: lmtest
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## 
## Attaching package: 'lmtest'
## 
## The following object is masked _by_ '.GlobalEnv':
## 
##     wages
## 
## Loading required package: sandwich
## Loading required package: survival
## Loading required package: splines

coeftest(model1, vcovHC(model1))

## 
## t test of coefficients:
## 
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.4726     0.5767    0.82     0.41    
## exp           0.5079     0.0255   19.90   <2e-16 ***
## age           0.7235     0.0152   47.60   <2e-16 ***
## educ          0.2499     0.0213   11.75   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Base R also ships with functionality to estimate most Generalized Linear Models (i.e. Logit, Probit, Poisson) via the glm() function:

mydata <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")
mylogit <- glm(admit ~ gre + gpa + rank, data = mydata, family = binomial(link = logit))
summary(mylogit)

## 
## Call:
## glm(formula = admit ~ gre + gpa + rank, family = binomial(link = logit), 
##     data = mydata)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.580  -0.885  -0.638   1.157   2.173  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.44955    1.13285   -3.05   0.0023 ** 
## gre          0.00229    0.00109    2.10   0.0356 *  
## gpa          0.77701    0.32748    2.37   0.0177 *  
## rank        -0.56003    0.12714   -4.40  1.1e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 499.98  on 399  degrees of freedom
## Residual deviance: 459.44  on 396  degrees of freedom
## AIC: 467.4
## 
## Number of Fisher Scoring iterations: 4

More sophisticated estimators are found in the libraries. I'll highlight two examples. We often work with panel data of some sort or other. In Stata, we'd use xtreg or xtivreg2. Equivalent functionality can be found in R using the lfe package. This package also allows us to use clustered (even multiway clustering) standard errors, and it is possible to do IV estimation as well.

library(lfe)

## Loading required package: Matrix

# Load health + inequality data
load("/home/john/Dropbox/PUMA_Health_Ineq/PUMA_MSA_ineq_GB2.rda")
# Create a panel variable
try3$unit <- sapply(1:4188, function(x) paste(as.character(try3$st_PUMA[x]), 
    as.character(try3$MSA[x]), sep = "_"))
try3$crude_rate <- try3$Deaths/try3$Population
# winsorize crude rate at 99%
try3 <- try3[which(try3$crude_rate < quantile(try3$crude_rate, probs = 0.99)), 
    ]
head(try3$unit)

## [1] "0_1000" "0_1000" "0_1000" "0_1000" "0_1000" "0_1000"


# Model with just county-group FE, with county-group clustered SE
model_FE <- felm(crude_rate ~ ACS_Gini | unit | unit, data = try3)
summary(model_FE)

## 
## Call:
##    felm(formula = crude_rate ~ ACS_Gini | unit | unit, data = try3) 
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.850730 -0.005858 -0.000104  0.005612  0.644550 
## 
## Coefficients:
##          Estimate Cluster s.e. t value Pr(>|t|)  
## ACS_Gini   0.1088       0.0574     1.9    0.058 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0456 on 3448 degrees of freedom
## Multiple R-squared: 0.921   Adjusted R-squared: 0.904 
## F-statistic(normal s.e.):57.3 on 697 and 3448 DF, p-value: <2e-16 
## F-statistic(proj): 0.00515 on 697 and 3448 DF, p-value: 1 
## *** F-test for robust/clustered standard errors is unreliable


# Model with county-group and year FE, with county-group clustered SE
model_FE2 <- felm(crude_rate ~ ACS_Gini | unit + year | unit, data = try3)
summary(model_FE2)

## 
## Call:
##    felm(formula = crude_rate ~ ACS_Gini | unit + year | unit, data = try3) 
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -8.48e-01 -5.90e-03 -5.13e-06  5.91e-03  6.45e-01 
## 
## Coefficients:
##          Estimate Cluster s.e. t value Pr(>|t|)  
## ACS_Gini   0.1301       0.0592     2.2    0.028 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0456 on 3443 degrees of freedom
## Multiple R-squared: 0.921   Adjusted R-squared: 0.904 
## F-statistic(normal s.e.):56.9 on 702 and 3443 DF, p-value: <2e-16 
## F-statistic(proj): 0.00687 on 702 and 3443 DF, p-value: 1 
## *** F-test for robust/clustered standard errors is unreliable


# Above, but with covariates:
model_FE3 <- felm(crude_rate ~ ACS_Gini + black + latino + median_inc + under19 + 
    over65 + college | unit + year | unit, data = try3)
summary(model_FE3)

## Warning: the matrix is either rank-deficient or indefinite

## 
## Call:
##    felm(formula = crude_rate ~ ACS_Gini + black + latino + median_inc +      under19 + over65 + college | unit + year | unit, data = try3) 
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.847440 -0.006316 -0.000108  0.006208  0.643483 
## 
## Coefficients:
##             Estimate Cluster s.e. t value Pr(>|t|)  
## ACS_Gini    1.07e-01     5.21e-02    2.06     0.04 *
## black      -6.36e-02     8.77e-02   -0.73     0.47  
## latino     -1.95e-01     1.26e-01   -1.54     0.12  
## median_inc -3.80e-07     3.19e-07   -1.19     0.23  
## under19     8.26e-02     8.86e-02    0.93     0.35  
## over65      1.32e-01     1.36e-01    0.97     0.33  
## college    -2.91e-02     6.99e-02   -0.42     0.68  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0455 on 3437 degrees of freedom
## Multiple R-squared: 0.921   Adjusted R-squared: 0.905 
## F-statistic(normal s.e.):56.5 on 708 and 3437 DF, p-value: <2e-16 
## F-statistic(proj): 0.0125 on 708 and 3437 DF, p-value: 1 
## *** F-test for robust/clustered standard errors is unreliable


# With county-group and year two way clustering:
model_FE4 <- felm(crude_rate ~ ACS_Gini + black + latino + median_inc + under19 + 
    over65 + college | unit + year | unit + year, data = try3)
summary(model_FE4)

## Warning: the matrix is either rank-deficient or indefinite

## 
## Call:
##    felm(formula = crude_rate ~ ACS_Gini + black + latino + median_inc +      under19 + over65 + college | unit + year | unit + year, data = try3) 
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.847440 -0.006316 -0.000108  0.006208  0.643483 
## 
## Coefficients:
##             Estimate Cluster s.e. t value Pr(>|t|)  
## ACS_Gini    1.07e-01     7.79e-02    1.38    0.169  
## black      -6.36e-02     8.98e-02   -0.71    0.479  
## latino     -1.95e-01     1.07e-01   -1.82    0.069 .
## median_inc -3.80e-07     3.42e-07   -1.11    0.267  
## under19     8.26e-02     6.81e-02    1.21    0.226  
## over65      1.32e-01     1.16e-01    1.13    0.257  
## college    -2.91e-02     6.76e-02   -0.43    0.667  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0455 on 3437 degrees of freedom
## Multiple R-squared: 0.921   Adjusted R-squared: 0.905 
## F-statistic(normal s.e.):56.5 on 708 and 3437 DF, p-value: <2e-16 
## F-statistic(proj): 0.0111 on 708 and 3437 DF, p-value: 1 
## *** F-test for robust/clustered standard errors is unreliable

The rdd package gives us the functionality to estimes RD designs, along with the standard diagnostic tests, optimal bandwidth selections, etc.

library(rdd)

## Loading required package: Formula

x <- runif(1000, -1, 1)
cov <- rnorm(1000)
y <- 3 + 2 * x + 3 * cov + 10 * (x >= 0) + rnorm(1000)
model.rd <- RDestimate(y ~ x | cov, cluster = x)
summary(model.rd)

## 
## Call:
## RDestimate(formula = y ~ x | cov, cluster = x)
## 
## Type:
## sharp 
## 
## Estimates:
##            Bandwidth  Observations  Estimate  Std. Error  z value
## LATE       0.359      356           9.74      0.203       48.0   
## Half-BW    0.180      180           9.60      0.265       36.2   
## Double-BW  0.719      731           9.72      0.152       64.0   
##            Pr(>|z|)      
## LATE        0.00e+00  ***
## Half-BW    1.99e-287  ***
## Double-BW   0.00e+00  ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## F-statistics:
##            F     Num. DoF  Denom. DoF  p
## LATE       2997  5         350         0
## Half-BW    1608  5         174         0
## Double-BW  6140  5         725         0

plot(model.rd)

plot of chunk unnamed-chunk-14

One final estimation example: for problems where the we need to model choices (e.g. transportation mode choice or recreation demand) a variant of the multinomial logit model is used. In particular, a state-of-the-art method at the moment is to allow for heterogeneity by estimating a random parameters or mixed logit model. The best R implementation of the mixed logit model is in the mlogit package.

library(mlogit)

## Loading required package: maxLik
## Loading required package: miscTools
## 
## Please cite the 'maxLik' package as:
## Henningsen, Arne and Toomet, Ott (2011). maxLik: A package for maximum likelihood estimation in R. Computational Statistics 26(3), 443-458. DOI 10.1007/s00180-010-0217-1.
## 
## If you have questions, suggestions, or comments regarding the 'maxLik' package, please use a forum or 'tracker' at maxLik's R-Forge site:
## https://r-forge.r-project.org/projects/maxlik/

mlogit.df <- read.table("/home/john/Dropbox/clogit.dat", quote = "\"")
colnames(mlogit.df) <- c("mode", "ttme", "invc", "invt", "gc", "chair", "hinc", 
    "psize", "indj", "indi", "aasc", "tasc", "basc", "casc", "hinca", "psizea", 
    "z", "nij", "ni")
mlogit.df$mode.ids <- factor(rep(1:4, 210))
mlogit.df$mode.ids <- factor(rep(1:4, 210), labels = c("air", "train", "bus", 
    "car"))
mlogit.df$indivs <- factor(rep(1:210, each = 4))
mlogit.df <- mlogit.data(mlogit.df, shape = "long", choice = "mode", alt.var = "mode.ids")
system.time(mlogit.model <- mlogit(mode ~ ttme + gc, data = mlogit.df, reflevel = "car", 
    rpar = c(`air:(intercept)` = "n", `bus:(intercept)` = "n", `train:(intercept)` = "n"), 
    R = 500, halton = NA, print.level = 1))

## Initial value of the function : 200.021471160969 
## iteration 1, step = 0.00390625, lnL = 200.00700181, chi2 = 42.89405615
## iteration 2, step = 0.015625, lnL = 200.00530375, chi2 = 0.23847745
## iteration 3, step = 0.125, lnL = 199.73904009, chi2 = 0.77798228
## iteration 4, step = 0.25, lnL = 199.71467864, chi2 = 0.25353518
## iteration 5, step = 0.5, lnL = 199.62259894, chi2 = 2.54825399
## iteration 6, step = 1, lnL = 198.92335441, chi2 = 0.9368018
## iteration 7, step = 1, lnL = 198.21614612, chi2 = 1.17814832
## iteration 8, step = 1, lnL = 197.78031543, chi2 = 0.73589678
## iteration 9, step = 1, lnL = 197.54249151, chi2 = 0.36853162
## iteration 10, step = 1, lnL = 197.48342838, chi2 = 0.0944876
## iteration 11, step = 1, lnL = 197.47108299, chi2 = 0.0189415
## iteration 12, step = 1, lnL = 197.46807699, chi2 = 0.00507095
## iteration 13, step = 1, lnL = 197.46781515, chi2 = 0.00046435
## iteration 14, step = 1, lnL = 197.46780495, chi2 = 1.855e-05
## iteration 15, step = 1, lnL = 197.46780438, chi2 = 8.2e-07

##    user  system elapsed 
##  11.414   0.024  11.435

summary(mlogit.model)

## 
## Call:
## mlogit(formula = mode ~ ttme + gc, data = mlogit.df, reflevel = "car", 
##     rpar = c(`air:(intercept)` = "n", `bus:(intercept)` = "n", 
##         `train:(intercept)` = "n"), R = 500, halton = NA, print.level = 1)
## 
## Frequencies of alternatives:
##   car   air train   bus 
## 0.281 0.276 0.300 0.143 
## 
## bfgs method
## 15 iterations, 0h:0m:11s 
## g'(-H)^-1g = 8.22E-07 
## gradient close to zero 
## 
## Coefficients :
##                      Estimate Std. Error t-value Pr(>|t|)    
## air:(intercept)       6.15924    1.13869    5.41  6.3e-08 ***
## train:(intercept)     5.10960    0.81745    6.25  4.1e-10 ***
## bus:(intercept)       4.14873    0.90422    4.59  4.5e-06 ***
## ttme                 -0.11444    0.01990   -5.75  8.9e-09 ***
## gc                   -0.03084    0.00775   -3.98  6.9e-05 ***
## sd.air:(intercept)    2.93805    0.91640    3.21   0.0013 ** 
## sd.train:(intercept)  0.03826   21.45898    0.00   0.9986    
## sd.bus:(intercept)    0.00166   76.69203    0.00   1.0000    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Log-Likelihood: -197
## McFadden R^2:  0.304 
## Likelihood ratio test : chisq = 173 (p.value = <2e-16)
## 
## random coefficients
##                   Min. 1st Qu. Median  Mean 3rd Qu. Max.
## air:(intercept)   -Inf   4.178  6.159 6.159   8.141  Inf
## bus:(intercept)   -Inf   4.148  4.149 4.149   4.150  Inf
## train:(intercept) -Inf   5.084  5.110 5.110   5.135  Inf

The equivalent Stata implementation (via the mixlogit .ado) takes almost 4 times as long (~32 seconds).

Split-Apply-Combine and Data Visualization

One thing I want to do fairly often is to split my data into subsets (stratifying by race, for instance), apply a function to each of the subsets, and then combine these results into a new dataset. This process (split-apply-combine) is best accomplished through either the plyr or dplyr package, both by Hadley Wickham. I'll use the earlier (and slightly slower) plyr syntax here. Our example will be a simplified task from my field paper: generate income inequality measures for each state on an annual basis from the pooled CPS microdata.

# Split Apply Combine Suppose we need to apply a function repeatedly to
# different parts of our dataset.  An efficient algorithm is
# split-apply-combine. Example: Annual State level inequality. Using older
# plyr syntax:
library(ggplot2)
library(plyr)
library(reldist)

## reldist: Relative Distribution Methods
## Version 1.6-2 created on 2013-03-03.
## copyright (c) 2003, Mark S. Handcock, University of California-Los Angeles
##  For citation information, type citation("reldist").
##  Type help(package="reldist") to get started.

library(ineq)

## 
## Attaching package: 'ineq'
## 
## The following object is masked from 'package:reldist':
## 
##     entropy

load("/media/john/Shared Linux_Windows Files/MSA Level Inequality/Data/CPS_ind_baseline.rda")


top1share <- function(x) {
    top1 <- sum(x[which(x >= quantile(x, probs = 0.99))])
    total <- sum(x)
    return(top1/total)
}
# Theil coefficient is only defined for strictly positive incomes, so
# truncate all incomes to 1
CPS.work.hh.baseline$cellmean_equivinc <- ifelse(CPS.work.hh.baseline$cellmean_equivinc <= 
    1, 1, CPS.work.hh.baseline$cellmean_equivinc)
CPS.work.hh.baseline <- CPS.work.hh.baseline[is.na(CPS.work.hh.baseline$cellmean_equivinc) == 
    F, ]

system.time(State.ineq <- ddply(CPS.work.hh.baseline, .variables = c("State", 
    "year"), function(x) data.frame(CPS_Gini = gini(x$cellmean_equivinc, w = x$wtsupp), 
    CPS_theil = Theil(x$cellmean_equivinc), CPS_top1 = top1share(x$cellmean_equivinc))))

##    user  system elapsed 
##   3.383   0.076   3.459

head(State.ineq)

##     State year CPS_Gini CPS_theil CPS_top1
## 1 Alabama 1992   0.3969    0.2597  0.05355
## 2 Alabama 1993   0.4086    0.2946  0.06184
## 3 Alabama 1994   0.4019    0.2838  0.06681
## 4 Alabama 1995   0.4096    0.2938  0.06127
## 5 Alabama 1996   0.4436    0.3721  0.08811
## 6 Alabama 1997   0.4235    0.3311  0.08092

One nice thing about plyr is that it will utilize more than one core if you have a “parallel backend” registered (more on that in a minute).

library(doMC)

## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel

registerDoMC(cores = detectCores())

system.time(State.ineq <- ddply(CPS.work.hh.baseline, .variables = c("State", 
    "year"), function(x) data.frame(CPS_Gini = gini(x$cellmean_equivinc, w = x$wtsupp), 
    CPS_theil = Theil(x$cellmean_equivinc), CPS_top1 = top1share(x$cellmean_equivinc)), 
    .parallel = T))

##    user  system elapsed 
##   4.935   1.120   2.074

Not too shabby, but far less than linear.

Hadley Wickham also wrote what is one of the best graphic toolkits in R: the ggplot2 package. If we wanted to graph the time series of Oregon's Gini inequality, we could do so easily:

ggplot(State.ineq[State.ineq$State == "Oregon", ], aes(x = year)) + geom_line(aes(y = CPS_Gini, 
    colour = "CPS_top1"))

plot of chunk unnamed-chunk-18

You can do almost any data visualization task with ggplot, but one of the neatest is the ability to make facet graphs:

ggplot(State.ineq[State.ineq$State != "District of Columbia", ], aes(x = year)) + 
    geom_line(aes(y = CPS_Gini, colour = "CPS_Gini")) + facet_wrap(~State, ncol = 10)

plot of chunk unnamed-chunk-19

Bootstrapping and Parallelization

Thus far, we've done stuff that Stata can do perfectly fine. Let's look at a couple of things where R excels. First: parallelization and bootstrapping. Bootstrapping is a very useful statistical technique. Its especially useful when conventional inference doesn't perform particularly well. I'll use an example from my own research. Imagine we have two Lorenz curves, and we want to perform a test of Lorenz dominance. One (rough) way of doing this would be to perform a difference in Lorenz ordinates test for some finite set of ordinates. Problem: the standard way of doing this (a t-test). This doesn't perform very well in practice.

# Bootstapping example: Difference in Lorenz confidence intervals First,
# generate some fake income distributions
library(GB2)
library(reldist)
distribution1 <- rgb2(10000, 2.5, 50000, 1, 1)
distribution2 <- rgb2(10000, 2.4, 50000, 1, 1)

L_x2 <- pgb2(qgb2(0.5, 2.4, 50000, 1, 1), 2.4, 50000, 1 + (1/2.4), 1 - (1/2.4))
L_x1 <- pgb2(qgb2(0.5, 2.5, 50000, 1, 1), 2.5, 50000, 1 + (1/2.5), 1 - (1/2.5))
L_x2 - L_x1

## [1] -0.009679

# standard inference
diff1 <- Lorenz_KB(distribution2, 0.5) - Lorenz_KB(distribution1, 0.5)
var1 <- (Lorenz_KB(distribution2, 0.5, type = "variance") + Lorenz_KB(distribution1, 
    0.5, type = "variance"))
diff1/sqrt(var1)  #t-stat

## [1] -0.02726

c(diff1 - 1.96 * sqrt(var1), diff1 + 1.96 * sqrt(var1))  #95% CI

## [1] -0.7180  0.6983

So let's see if we can't do better with the bootstrap. We'll be doing a very simple form of bootstrap inference - we'll bootstrap the difference in Lorenz ordinates directly, and use the bootstrap confidence interval to perform inference:

# Bootstrap inference, slow but straightforward way:
bootreps <- numeric(500)
ptm <- proc.time()
for (i in 1:500) {
    samples <- sample(1:10000, replace = T)
    bootreps[i] <- Lorenz_KB(distribution2[samples], 0.5) - Lorenz_KB(distribution1[samples], 
        0.5)
}
seq.time <- proc.time() - ptm
quantile(bootreps, c(0.025, 0.975))

##      2.5%     97.5% 
## -0.016394 -0.002671

Bootstrapping is a computational task that is classified as “embarassingly parallel” - each bootstrap repetition is independent, so we can distribute the repetitions across as many cores as we want (or have available) and performance will scale roughly linearly. R makes parallelization pretty easy. You just need to setup a parallel backend (on a linux or Mac, doMC is the way to go. On a PC, the doParallel package appears to be the way to go.)

# We can do better! First step, set up a parallel backend
library(doMC)
registerDoMC(cores = detectCores())

# The foreach package lets us perform parallel for loops So this will just
# perform a (dumb) for loop, sequentially:
library(foreach)
try1 <- foreach(i = 1:100, .combine = "c") %do% {
    rnorm(1, 0, i)
}
# If we use the %dopar% operator instead:
library(foreach)
try1 <- foreach(i = 1:100, .combine = "c") %dopar% {
    rnorm(1, 0, i)
}
# foreach loops work best when each iteration is independent.
ptm1 <- proc.time()
bootreps.par <- foreach(i = 1:500, .combine = "c") %dopar% {
    samples <- sample(1:10000, replace = T)
    Lorenz_KB(distribution2[samples], 0.5) - Lorenz_KB(distribution1[samples], 
        0.5)
}
par.time <- proc.time() - ptm1
quantile(bootreps.par, c(0.025, 0.975))

##      2.5%     97.5% 
## -0.015833 -0.003265

seq.time/par.time

##      user    system   elapsed 
## 168.54438   0.08696   3.77862

Maps in R: A quick Introduction

R has a lot of functionality for doing mapping and GIS tasks. One thing we often need to do is create choropleths to visualize our data. There's a new package (choropethR, developed, oddly enough, by Trulia) that makes this super easy. Here's a version of the inequality maps I made a couple months ago, which took me a few minutes last night (much less time than the originals):

# Mapping Stuff!
library(choroplethr)
load("/media/john/Shared Linux_Windows Files/MSA Level Inequality/Data/baseline_multi_ineq_ind_State.rda")
multi_ineq_means$State <- as.character(multi_ineq_means$State)
ineq.df <- multi_ineq_means[multi_ineq_means$year == 2011, ][, c(1, 3)]
colnames(ineq.df) <- c("region", "value")
choroplethr(ineq.df, "state", title = "State Gini, 2011")

plot of chunk unnamed-chunk-25

If you want a quick and dirty visualization from any of the ACS 5-year tables, we can do that easily too:

choroplethr_acs(tableId = "B19301", lod = "county")

plot of chunk unnamed-chunk-26

Accessing APIs (and other web-facing stuff)

R has built in capabilities to do basic web facing stuff, and with CRAN libraries you can access pretty much anything on the web. In particular, we can access APIs offered by many companies. I'll use two examples: the Twitter API (which you heard me mumble about a couple weeks ago), and the MapQuest OpenStreetMaps API (which powers the mqtime.ado file I wrote).

Here's an example of a function that queries the OSM API to calculate driving distances:

# API access via R
library(RCurl)

## Loading required package: bitops
## 
## Attaching package: 'RCurl'
## 
## The following object is masked from 'package:lmtest':
## 
##     reset

library(rjson)

mqdirections <- function(address1, address2) {
    if (address1 == "Georgia") {
        address1 = "Atlanta,Georgia"
    }
    if (address2 == "Georgia") {
        address2 = "Atlanta,Georgia"
    }
    URL2 = paste("http://open.mapquestapi.com/directions/v1/route?key=", "Fmjtd%7Cluub2huanl%2C20%3Do5-9uzwdz&from=", 
        address1, "&to=", address2, "&outFormat='json'", "boundingBox=24,-85,50,-125", 
        sep = "")
    URL2 <- gsub(" ", "+", URL2)
    x = getURL(URL2)
    x1 <- fromJSON(x)
    if (length(x1$route$distance) == 0) {
        return(NA)
    } else {
        return(x1$route$distance)
    }
}
mqdirections("Eugene,OR", "Corvallis,OR")

## [1] 40.26

Similarly, we can use the OSM API to geocode text addresses:

geocode_attempt <- function(address) {
    URL2 = paste("http://open.mapquestapi.com/geocoding/v1/address?key=", "Fmjtd%7Cluub2huanl%2C20%3Do5-9uzwdz", 
        "&location=", address, "&outFormat='json'", "boundingBox=24,-85,50,-125", 
        sep = "")
    # print(URL2)
    URL2 <- gsub(" ", "+", URL2)
    x = getURL(URL2)
    x1 <- fromJSON(x)
    if (length(x1$results[[1]]$locations) == 0) {
        return(NA)
    } else {
        return(c(x1$results[[1]]$locations[[1]]$displayLatLng$lat, x1$results[[1]]$locations[[1]]$displayLatLng$lng))
    }
}
geocode_attempt("1241 Kincaid St, Eugene,OR")

## [1]   44.05 -123.08

Twitter also provides APIs that allow us to access to (some) tweets. There are two APIs: the streaming API, which gives us access to a 5% sample of publicly available tweets, and the REST API, which allows us to request tweets by specific users.

library("twitteR")

## Loading required package: ROAuth
## Loading required package: digest
## 
## Attaching package: 'twitteR'
## 
## The following object is masked from 'package:plyr':
## 
##     id

library("ROAuth")
library(streamR)
load("~/credentials.RData")
filterStream("USA_tweets1130.json", locations = c(-120, 30, -65, 49), timeout = 30, 
    oauth = Credentials)

## Capturing tweets...
## Connection to Twitter stream was closed after 30 seconds with up to 1012 tweets downloaded.

tweets.df <- parseTweets("USA_tweets950.json", verbose = FALSE)

library(ggplot2)
library(grid)
map.data <- map_data("state")
points <- data.frame(x = as.numeric(tweets.df$lon), y = as.numeric(tweets.df$lat))
points <- points[points$y > 25, ]
ggplot(map.data) + geom_map(aes(map_id = region), map = map.data, fill = "white", 
    color = "grey20", size = 0.25) + expand_limits(x = map.data$long, y = map.data$lat) + 
    labs(title = "9:50am Friday Tweets") + theme(axis.line = element_blank(), 
    axis.text = element_blank(), axis.ticks = element_blank(), axis.title = element_blank(), 
    panel.background = element_blank(), panel.border = element_blank(), panel.grid.major = element_blank(), 
    plot.background = element_blank(), plot.margin = unit(0 * c(-1.5, -1.5, 
        -1.5, -1.5), "lines")) + geom_point(data = points, aes(x = x, y = y), 
    size = 1, alpha = 1/5, color = "darkblue")

## Warning: Removed 1300 rows containing missing values (geom_point).

plot of chunk unnamed-chunk-29

Here's the same thing a couple hours later:

tweets.df <- parseTweets("USA_tweets1130.json", verbose = FALSE)

library(ggplot2)
library(grid)
map.data <- map_data("state")
points <- data.frame(x = as.numeric(tweets.df$lon), y = as.numeric(tweets.df$lat))
points <- points[points$y > 25, ]
ggplot(map.data) + geom_map(aes(map_id = region), map = map.data, fill = "white", 
    color = "grey20", size = 0.25) + expand_limits(x = map.data$long, y = map.data$lat) + 
    labs(title = "11:30am Friday Tweets") + theme(axis.line = element_blank(), 
    axis.text = element_blank(), axis.ticks = element_blank(), axis.title = element_blank(), 
    panel.background = element_blank(), panel.border = element_blank(), panel.grid.major = element_blank(), 
    plot.background = element_blank(), plot.margin = unit(0 * c(-1.5, -1.5, 
        -1.5, -1.5), "lines")) + geom_point(data = points, aes(x = x, y = y), 
    size = 1, alpha = 1/5, color = "darkblue")

## Warning: Removed 1659 rows containing missing values (geom_point).

plot of chunk unnamed-chunk-30

We can also access the last <3200 tweets from a given user, via the REST API, accessible by the twitteR library:

registerTwitterOAuth(Credentials)

## [1] TRUE

mark <- userTimeline("MarkThoma", n = 100)
head(mark)

## [[1]]
## [1] "MarkThoma: 'Varian: Big Data: New Tricks for Econometrics' http://t.co/RorvwANhkG"
## 
## [[2]]
## [1] "MarkThoma: 'Economy Adds 288,000 Jobs in April, Sharp Drop in Labor Force Leads to Plunge in Unemployment' http://t.co/XTclSsZVUb"
## 
## [[3]]
## [1] "MarkThoma: Paul Krugman: Why Economics Failed http://t.co/LIiwwh39Jp"
## 
## [[4]]
## [1] "MarkThoma: Links for 5-02-14 http://t.co/ILqJsfw9kj"
## 
## [[5]]
## [1] "MarkThoma: Thomas Piketty and the Ghost of Joan Robinson - Dean Baker\n http://t.co/Ln4vWnSBpK"
## 
## [[6]]
## [1] "MarkThoma: Paul Krugman Really Blows It - EconoSpeak http://t.co/avZ3sisSFx"

Or access the current trends on Twitter:

place <- closestTrendLocations(44.04624, -123.07854)$woeid
getTrends(place)

##          name                                         url           query
## 1   Starbucks       http://twitter.com/search?q=Starbucks       Starbucks
## 2    Portland        http://twitter.com/search?q=Portland        Portland
## 3    #RipCity      http://twitter.com/search?q=%23RipCity      %23RipCity
## 4  Spider-Man      http://twitter.com/search?q=Spider-Man      Spider-Man
## 5   Taco Bell http://twitter.com/search?q=%22Taco+Bell%22 %22Taco+Bell%22
## 6      Game 6    http://twitter.com/search?q=%22Game+6%22    %22Game+6%22
## 7      Oregon          http://twitter.com/search?q=Oregon          Oregon
## 8    Snapchat        http://twitter.com/search?q=Snapchat        Snapchat
## 9   Star Wars http://twitter.com/search?q=%22Star+Wars%22 %22Star+Wars%22
## 10     Tumblr          http://twitter.com/search?q=Tumblr          Tumblr
##      woeid
## 1  2475687
## 2  2475687
## 3  2475687
## 4  2475687
## 5  2475687
## 6  2475687
## 7  2475687
## 8  2475687
## 9  2475687
## 10 2475687

And there's about 1100 other things you can do with R.