Intro to R author: John Voorheis date: 5/2/2014
There are four basic data objects: vectors, matrices, data.frames and lists. Forget about matrices for now.
thisIsAVector <- c("#Butts", "#LOL", "#Inappropriate")
thisIsADataFrame <- data.frame(hastags = thisIsAVector, randomNumber = rnorm(3,
0, 1))
thisIsAList <- list(hastags = thisIsAVector, randomNumber = rnorm(3, 0, 1),
supermansHome = "Krypton")
thisIsAVector
## [1] "#Butts" "#LOL" "#Inappropriate"
thisIsADataFrame
## hastags randomNumber
## 1 #Butts 0.1034
## 2 #LOL 0.9748
## 3 #Inappropriate -0.2178
thisIsAList
## $hastags
## [1] "#Butts" "#LOL" "#Inappropriate"
##
## $randomNumber
## [1] -0.08646 0.71146 1.03897
##
## $supermansHome
## [1] "Krypton"
List items need not have the same length, but all columns of a data.frame must have the same length. You can call or assign items in a data.frame or list using the $ notation:
try(thisIsADataFrame$supermansHome <- c("Krypton", "Earth"), silent = F)
thisIsADataFrame$supermansHome <- "Krypton"
thisIsADataFrame
## hastags randomNumber supermansHome
## 1 #Butts 0.1034 Krypton
## 2 #LOL 0.9748 Krypton
## 3 #Inappropriate -0.2178 Krypton
As we've seen, data.frames and lists can have multiple types (integer, numeric, character, factor, etc.) We can coerce data into different types, but sometimes that can have bad consequences:
as.character(thisIsADataFrame$randomNumber)
## [1] "0.103403841923095" "0.974818583615347" "-0.217842314945977"
as.numeric(thisIsADataFrame$supermansHome)
## Warning: NAs introduced by coercion
## [1] NA NA NA
Finally, R has a powerful subsetting notation:
# We can pull down columns:
thisIsADataFrame[, 1]
## [1] #Butts #LOL #Inappropriate
## Levels: #Butts #Inappropriate #LOL
# Or subsets (rows that meet a condition):
thisIsADataFrame[thisIsADataFrame$randomNumber < 0, ]
## hastags randomNumber supermansHome
## 3 #Inappropriate -0.2178 Krypton
There's a lot of built in functionality in R (and even more in the CRAN libraries). Oftentimes we want to do something specific, over and over again. Example: I often want to calculate the 90-10 ratio for an empirical income distribution. First, let's generate some fake data.
library(GB2)
fakeIncomes <- rgb2(1000, 2.5, 40000, 1, 1)
head(fakeIncomes)
## [1] 40308 59552 28519 25332 56210 31019
quantile(fakeIncomes, prob = 0.9)
## 90%
## 94454
as.numeric(quantile(fakeIncomes, probs = 0.9))
## [1] 94454
ratio9010 <- function(x) {
ratio <- as.numeric(quantile(x, probs = 0.9))/as.numeric(quantile(x, probs = 0.1))
return(ratio)
}
ratio9010(fakeIncomes)
## [1] 5.279
fakeIncomes2 <- rgb2(1000, 2.5, 40000, 1, 2)
ratio9010(fakeIncomes2)
## [1] 4.271
You can use the standard for() loops and if() statements. But, often times, if you are operating on a vector (or a column of a data.frame), there's a vectorized function that's better (sometimes you have to write your own.)
# Make a dumb function that 'topcodes' incomes
truncateIncomes <- function(x) {
if (x > 1e+05) {
return(1e+05)
} else {
return(x)
}
}
# Bad:
truncIncomes <- c()
for (i in fakeIncomes) {
truncIncomes <- append(truncIncomes, truncateIncomes(i))
}
# better:
truncIncomes2 <- sapply(fakeIncomes, truncateIncomes)
# best
truncIncomes3 <- ifelse(fakeIncomes > 1e+05, 1e+05, fakeIncomes)
head(truncIncomes)
## [1] 40308 59552 28519 25332 56210 31019
head(truncIncomes2)
## [1] 40308 59552 28519 25332 56210 31019
head(truncIncomes3)
## [1] 40308 59552 28519 25332 56210 31019
R has a lot of statistics and econometric functionality built in; with the addition of some outside packages, you can do pretty much anything Stata can do.
Generate some fake wage data:
wages <- data.frame(age = runif(100, 15, 64), exp_fact = runif(100, 0, 0.7),
educ = rlnorm(100, log(8), 0.75))
wages$exp <- wages$age * wages$exp_fact
wages$wage <- exp(0.5 * wages$exp + 0.75 * wages$age + 0.25 * wages$educ + rnorm(100,
0, 2))
lm() is the built in function for running linear regressions:
model1 <- lm(log(wage) ~ exp + age + educ, data = wages)
summary(model1)
##
## Call:
## lm(formula = log(wage) ~ exp + age + educ, data = wages)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.216 -0.915 0.131 1.296 4.436
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.4726 0.5587 0.85 0.4
## exp 0.5079 0.0262 19.35 <2e-16 ***
## age 0.7235 0.0153 47.28 <2e-16 ***
## educ 0.2499 0.0171 14.63 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.96 on 96 degrees of freedom
## Multiple R-squared: 0.981, Adjusted R-squared: 0.98
## F-statistic: 1.66e+03 on 3 and 96 DF, p-value: <2e-16
One thing base R doesn't do particularly well is alternate variance covariance matrix estimators. The ability to use robust standard errors is available in various packages (lmtest and sandwich).
library(AER)
## Loading required package: car
## Loading required package: lmtest
## Loading required package: zoo
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
##
## Attaching package: 'lmtest'
##
## The following object is masked _by_ '.GlobalEnv':
##
## wages
##
## Loading required package: sandwich
## Loading required package: survival
## Loading required package: splines
coeftest(model1, vcovHC(model1))
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.4726 0.5767 0.82 0.41
## exp 0.5079 0.0255 19.90 <2e-16 ***
## age 0.7235 0.0152 47.60 <2e-16 ***
## educ 0.2499 0.0213 11.75 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Base R also ships with functionality to estimate most Generalized Linear Models (i.e. Logit, Probit, Poisson) via the glm() function:
mydata <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")
mylogit <- glm(admit ~ gre + gpa + rank, data = mydata, family = binomial(link = logit))
summary(mylogit)
##
## Call:
## glm(formula = admit ~ gre + gpa + rank, family = binomial(link = logit),
## data = mydata)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.580 -0.885 -0.638 1.157 2.173
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.44955 1.13285 -3.05 0.0023 **
## gre 0.00229 0.00109 2.10 0.0356 *
## gpa 0.77701 0.32748 2.37 0.0177 *
## rank -0.56003 0.12714 -4.40 1.1e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 499.98 on 399 degrees of freedom
## Residual deviance: 459.44 on 396 degrees of freedom
## AIC: 467.4
##
## Number of Fisher Scoring iterations: 4
More sophisticated estimators are found in the libraries. I'll highlight two examples. We often work with panel data of some sort or other. In Stata, we'd use xtreg or xtivreg2. Equivalent functionality can be found in R using the lfe package. This package also allows us to use clustered (even multiway clustering) standard errors, and it is possible to do IV estimation as well.
library(lfe)
## Loading required package: Matrix
# Load health + inequality data
load("/home/john/Dropbox/PUMA_Health_Ineq/PUMA_MSA_ineq_GB2.rda")
# Create a panel variable
try3$unit <- sapply(1:4188, function(x) paste(as.character(try3$st_PUMA[x]),
as.character(try3$MSA[x]), sep = "_"))
try3$crude_rate <- try3$Deaths/try3$Population
# winsorize crude rate at 99%
try3 <- try3[which(try3$crude_rate < quantile(try3$crude_rate, probs = 0.99)),
]
head(try3$unit)
## [1] "0_1000" "0_1000" "0_1000" "0_1000" "0_1000" "0_1000"
# Model with just county-group FE, with county-group clustered SE
model_FE <- felm(crude_rate ~ ACS_Gini | unit | unit, data = try3)
summary(model_FE)
##
## Call:
## felm(formula = crude_rate ~ ACS_Gini | unit | unit, data = try3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.850730 -0.005858 -0.000104 0.005612 0.644550
##
## Coefficients:
## Estimate Cluster s.e. t value Pr(>|t|)
## ACS_Gini 0.1088 0.0574 1.9 0.058 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0456 on 3448 degrees of freedom
## Multiple R-squared: 0.921 Adjusted R-squared: 0.904
## F-statistic(normal s.e.):57.3 on 697 and 3448 DF, p-value: <2e-16
## F-statistic(proj): 0.00515 on 697 and 3448 DF, p-value: 1
## *** F-test for robust/clustered standard errors is unreliable
# Model with county-group and year FE, with county-group clustered SE
model_FE2 <- felm(crude_rate ~ ACS_Gini | unit + year | unit, data = try3)
summary(model_FE2)
##
## Call:
## felm(formula = crude_rate ~ ACS_Gini | unit + year | unit, data = try3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.48e-01 -5.90e-03 -5.13e-06 5.91e-03 6.45e-01
##
## Coefficients:
## Estimate Cluster s.e. t value Pr(>|t|)
## ACS_Gini 0.1301 0.0592 2.2 0.028 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0456 on 3443 degrees of freedom
## Multiple R-squared: 0.921 Adjusted R-squared: 0.904
## F-statistic(normal s.e.):56.9 on 702 and 3443 DF, p-value: <2e-16
## F-statistic(proj): 0.00687 on 702 and 3443 DF, p-value: 1
## *** F-test for robust/clustered standard errors is unreliable
# Above, but with covariates:
model_FE3 <- felm(crude_rate ~ ACS_Gini + black + latino + median_inc + under19 +
over65 + college | unit + year | unit, data = try3)
summary(model_FE3)
## Warning: the matrix is either rank-deficient or indefinite
##
## Call:
## felm(formula = crude_rate ~ ACS_Gini + black + latino + median_inc + under19 + over65 + college | unit + year | unit, data = try3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.847440 -0.006316 -0.000108 0.006208 0.643483
##
## Coefficients:
## Estimate Cluster s.e. t value Pr(>|t|)
## ACS_Gini 1.07e-01 5.21e-02 2.06 0.04 *
## black -6.36e-02 8.77e-02 -0.73 0.47
## latino -1.95e-01 1.26e-01 -1.54 0.12
## median_inc -3.80e-07 3.19e-07 -1.19 0.23
## under19 8.26e-02 8.86e-02 0.93 0.35
## over65 1.32e-01 1.36e-01 0.97 0.33
## college -2.91e-02 6.99e-02 -0.42 0.68
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0455 on 3437 degrees of freedom
## Multiple R-squared: 0.921 Adjusted R-squared: 0.905
## F-statistic(normal s.e.):56.5 on 708 and 3437 DF, p-value: <2e-16
## F-statistic(proj): 0.0125 on 708 and 3437 DF, p-value: 1
## *** F-test for robust/clustered standard errors is unreliable
# With county-group and year two way clustering:
model_FE4 <- felm(crude_rate ~ ACS_Gini + black + latino + median_inc + under19 +
over65 + college | unit + year | unit + year, data = try3)
summary(model_FE4)
## Warning: the matrix is either rank-deficient or indefinite
##
## Call:
## felm(formula = crude_rate ~ ACS_Gini + black + latino + median_inc + under19 + over65 + college | unit + year | unit + year, data = try3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.847440 -0.006316 -0.000108 0.006208 0.643483
##
## Coefficients:
## Estimate Cluster s.e. t value Pr(>|t|)
## ACS_Gini 1.07e-01 7.79e-02 1.38 0.169
## black -6.36e-02 8.98e-02 -0.71 0.479
## latino -1.95e-01 1.07e-01 -1.82 0.069 .
## median_inc -3.80e-07 3.42e-07 -1.11 0.267
## under19 8.26e-02 6.81e-02 1.21 0.226
## over65 1.32e-01 1.16e-01 1.13 0.257
## college -2.91e-02 6.76e-02 -0.43 0.667
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0455 on 3437 degrees of freedom
## Multiple R-squared: 0.921 Adjusted R-squared: 0.905
## F-statistic(normal s.e.):56.5 on 708 and 3437 DF, p-value: <2e-16
## F-statistic(proj): 0.0111 on 708 and 3437 DF, p-value: 1
## *** F-test for robust/clustered standard errors is unreliable
The rdd package gives us the functionality to estimes RD designs, along with the standard diagnostic tests, optimal bandwidth selections, etc.
library(rdd)
## Loading required package: Formula
x <- runif(1000, -1, 1)
cov <- rnorm(1000)
y <- 3 + 2 * x + 3 * cov + 10 * (x >= 0) + rnorm(1000)
model.rd <- RDestimate(y ~ x | cov, cluster = x)
summary(model.rd)
##
## Call:
## RDestimate(formula = y ~ x | cov, cluster = x)
##
## Type:
## sharp
##
## Estimates:
## Bandwidth Observations Estimate Std. Error z value
## LATE 0.359 356 9.74 0.203 48.0
## Half-BW 0.180 180 9.60 0.265 36.2
## Double-BW 0.719 731 9.72 0.152 64.0
## Pr(>|z|)
## LATE 0.00e+00 ***
## Half-BW 1.99e-287 ***
## Double-BW 0.00e+00 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## F-statistics:
## F Num. DoF Denom. DoF p
## LATE 2997 5 350 0
## Half-BW 1608 5 174 0
## Double-BW 6140 5 725 0
plot(model.rd)
One final estimation example: for problems where the we need to model choices (e.g. transportation mode choice or recreation demand) a variant of the multinomial logit model is used. In particular, a state-of-the-art method at the moment is to allow for heterogeneity by estimating a random parameters or mixed logit model. The best R implementation of the mixed logit model is in the mlogit package.
library(mlogit)
## Loading required package: maxLik
## Loading required package: miscTools
##
## Please cite the 'maxLik' package as:
## Henningsen, Arne and Toomet, Ott (2011). maxLik: A package for maximum likelihood estimation in R. Computational Statistics 26(3), 443-458. DOI 10.1007/s00180-010-0217-1.
##
## If you have questions, suggestions, or comments regarding the 'maxLik' package, please use a forum or 'tracker' at maxLik's R-Forge site:
## https://r-forge.r-project.org/projects/maxlik/
mlogit.df <- read.table("/home/john/Dropbox/clogit.dat", quote = "\"")
colnames(mlogit.df) <- c("mode", "ttme", "invc", "invt", "gc", "chair", "hinc",
"psize", "indj", "indi", "aasc", "tasc", "basc", "casc", "hinca", "psizea",
"z", "nij", "ni")
mlogit.df$mode.ids <- factor(rep(1:4, 210))
mlogit.df$mode.ids <- factor(rep(1:4, 210), labels = c("air", "train", "bus",
"car"))
mlogit.df$indivs <- factor(rep(1:210, each = 4))
mlogit.df <- mlogit.data(mlogit.df, shape = "long", choice = "mode", alt.var = "mode.ids")
system.time(mlogit.model <- mlogit(mode ~ ttme + gc, data = mlogit.df, reflevel = "car",
rpar = c(`air:(intercept)` = "n", `bus:(intercept)` = "n", `train:(intercept)` = "n"),
R = 500, halton = NA, print.level = 1))
## Initial value of the function : 200.021471160969
## iteration 1, step = 0.00390625, lnL = 200.00700181, chi2 = 42.89405615
## iteration 2, step = 0.015625, lnL = 200.00530375, chi2 = 0.23847745
## iteration 3, step = 0.125, lnL = 199.73904009, chi2 = 0.77798228
## iteration 4, step = 0.25, lnL = 199.71467864, chi2 = 0.25353518
## iteration 5, step = 0.5, lnL = 199.62259894, chi2 = 2.54825399
## iteration 6, step = 1, lnL = 198.92335441, chi2 = 0.9368018
## iteration 7, step = 1, lnL = 198.21614612, chi2 = 1.17814832
## iteration 8, step = 1, lnL = 197.78031543, chi2 = 0.73589678
## iteration 9, step = 1, lnL = 197.54249151, chi2 = 0.36853162
## iteration 10, step = 1, lnL = 197.48342838, chi2 = 0.0944876
## iteration 11, step = 1, lnL = 197.47108299, chi2 = 0.0189415
## iteration 12, step = 1, lnL = 197.46807699, chi2 = 0.00507095
## iteration 13, step = 1, lnL = 197.46781515, chi2 = 0.00046435
## iteration 14, step = 1, lnL = 197.46780495, chi2 = 1.855e-05
## iteration 15, step = 1, lnL = 197.46780438, chi2 = 8.2e-07
## user system elapsed
## 11.414 0.024 11.435
summary(mlogit.model)
##
## Call:
## mlogit(formula = mode ~ ttme + gc, data = mlogit.df, reflevel = "car",
## rpar = c(`air:(intercept)` = "n", `bus:(intercept)` = "n",
## `train:(intercept)` = "n"), R = 500, halton = NA, print.level = 1)
##
## Frequencies of alternatives:
## car air train bus
## 0.281 0.276 0.300 0.143
##
## bfgs method
## 15 iterations, 0h:0m:11s
## g'(-H)^-1g = 8.22E-07
## gradient close to zero
##
## Coefficients :
## Estimate Std. Error t-value Pr(>|t|)
## air:(intercept) 6.15924 1.13869 5.41 6.3e-08 ***
## train:(intercept) 5.10960 0.81745 6.25 4.1e-10 ***
## bus:(intercept) 4.14873 0.90422 4.59 4.5e-06 ***
## ttme -0.11444 0.01990 -5.75 8.9e-09 ***
## gc -0.03084 0.00775 -3.98 6.9e-05 ***
## sd.air:(intercept) 2.93805 0.91640 3.21 0.0013 **
## sd.train:(intercept) 0.03826 21.45898 0.00 0.9986
## sd.bus:(intercept) 0.00166 76.69203 0.00 1.0000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Log-Likelihood: -197
## McFadden R^2: 0.304
## Likelihood ratio test : chisq = 173 (p.value = <2e-16)
##
## random coefficients
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## air:(intercept) -Inf 4.178 6.159 6.159 8.141 Inf
## bus:(intercept) -Inf 4.148 4.149 4.149 4.150 Inf
## train:(intercept) -Inf 5.084 5.110 5.110 5.135 Inf
The equivalent Stata implementation (via the mixlogit .ado) takes almost 4 times as long (~32 seconds).
One thing I want to do fairly often is to split my data into subsets (stratifying by race, for instance), apply a function to each of the subsets, and then combine these results into a new dataset. This process (split-apply-combine) is best accomplished through either the plyr or dplyr package, both by Hadley Wickham. I'll use the earlier (and slightly slower) plyr syntax here. Our example will be a simplified task from my field paper: generate income inequality measures for each state on an annual basis from the pooled CPS microdata.
# Split Apply Combine Suppose we need to apply a function repeatedly to
# different parts of our dataset. An efficient algorithm is
# split-apply-combine. Example: Annual State level inequality. Using older
# plyr syntax:
library(ggplot2)
library(plyr)
library(reldist)
## reldist: Relative Distribution Methods
## Version 1.6-2 created on 2013-03-03.
## copyright (c) 2003, Mark S. Handcock, University of California-Los Angeles
## For citation information, type citation("reldist").
## Type help(package="reldist") to get started.
library(ineq)
##
## Attaching package: 'ineq'
##
## The following object is masked from 'package:reldist':
##
## entropy
load("/media/john/Shared Linux_Windows Files/MSA Level Inequality/Data/CPS_ind_baseline.rda")
top1share <- function(x) {
top1 <- sum(x[which(x >= quantile(x, probs = 0.99))])
total <- sum(x)
return(top1/total)
}
# Theil coefficient is only defined for strictly positive incomes, so
# truncate all incomes to 1
CPS.work.hh.baseline$cellmean_equivinc <- ifelse(CPS.work.hh.baseline$cellmean_equivinc <=
1, 1, CPS.work.hh.baseline$cellmean_equivinc)
CPS.work.hh.baseline <- CPS.work.hh.baseline[is.na(CPS.work.hh.baseline$cellmean_equivinc) ==
F, ]
system.time(State.ineq <- ddply(CPS.work.hh.baseline, .variables = c("State",
"year"), function(x) data.frame(CPS_Gini = gini(x$cellmean_equivinc, w = x$wtsupp),
CPS_theil = Theil(x$cellmean_equivinc), CPS_top1 = top1share(x$cellmean_equivinc))))
## user system elapsed
## 3.383 0.076 3.459
head(State.ineq)
## State year CPS_Gini CPS_theil CPS_top1
## 1 Alabama 1992 0.3969 0.2597 0.05355
## 2 Alabama 1993 0.4086 0.2946 0.06184
## 3 Alabama 1994 0.4019 0.2838 0.06681
## 4 Alabama 1995 0.4096 0.2938 0.06127
## 5 Alabama 1996 0.4436 0.3721 0.08811
## 6 Alabama 1997 0.4235 0.3311 0.08092
One nice thing about plyr is that it will utilize more than one core if you have a “parallel backend” registered (more on that in a minute).
library(doMC)
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel
registerDoMC(cores = detectCores())
system.time(State.ineq <- ddply(CPS.work.hh.baseline, .variables = c("State",
"year"), function(x) data.frame(CPS_Gini = gini(x$cellmean_equivinc, w = x$wtsupp),
CPS_theil = Theil(x$cellmean_equivinc), CPS_top1 = top1share(x$cellmean_equivinc)),
.parallel = T))
## user system elapsed
## 4.935 1.120 2.074
Not too shabby, but far less than linear.
Hadley Wickham also wrote what is one of the best graphic toolkits in R: the ggplot2 package. If we wanted to graph the time series of Oregon's Gini inequality, we could do so easily:
ggplot(State.ineq[State.ineq$State == "Oregon", ], aes(x = year)) + geom_line(aes(y = CPS_Gini,
colour = "CPS_top1"))
You can do almost any data visualization task with ggplot, but one of the neatest is the ability to make facet graphs:
ggplot(State.ineq[State.ineq$State != "District of Columbia", ], aes(x = year)) +
geom_line(aes(y = CPS_Gini, colour = "CPS_Gini")) + facet_wrap(~State, ncol = 10)
Thus far, we've done stuff that Stata can do perfectly fine. Let's look at a couple of things where R excels. First: parallelization and bootstrapping. Bootstrapping is a very useful statistical technique. Its especially useful when conventional inference doesn't perform particularly well. I'll use an example from my own research. Imagine we have two Lorenz curves, and we want to perform a test of Lorenz dominance. One (rough) way of doing this would be to perform a difference in Lorenz ordinates test for some finite set of ordinates. Problem: the standard way of doing this (a t-test). This doesn't perform very well in practice.
# Bootstapping example: Difference in Lorenz confidence intervals First,
# generate some fake income distributions
library(GB2)
library(reldist)
distribution1 <- rgb2(10000, 2.5, 50000, 1, 1)
distribution2 <- rgb2(10000, 2.4, 50000, 1, 1)
L_x2 <- pgb2(qgb2(0.5, 2.4, 50000, 1, 1), 2.4, 50000, 1 + (1/2.4), 1 - (1/2.4))
L_x1 <- pgb2(qgb2(0.5, 2.5, 50000, 1, 1), 2.5, 50000, 1 + (1/2.5), 1 - (1/2.5))
L_x2 - L_x1
## [1] -0.009679
# standard inference
diff1 <- Lorenz_KB(distribution2, 0.5) - Lorenz_KB(distribution1, 0.5)
var1 <- (Lorenz_KB(distribution2, 0.5, type = "variance") + Lorenz_KB(distribution1,
0.5, type = "variance"))
diff1/sqrt(var1) #t-stat
## [1] -0.02726
c(diff1 - 1.96 * sqrt(var1), diff1 + 1.96 * sqrt(var1)) #95% CI
## [1] -0.7180 0.6983
So let's see if we can't do better with the bootstrap. We'll be doing a very simple form of bootstrap inference - we'll bootstrap the difference in Lorenz ordinates directly, and use the bootstrap confidence interval to perform inference:
# Bootstrap inference, slow but straightforward way:
bootreps <- numeric(500)
ptm <- proc.time()
for (i in 1:500) {
samples <- sample(1:10000, replace = T)
bootreps[i] <- Lorenz_KB(distribution2[samples], 0.5) - Lorenz_KB(distribution1[samples],
0.5)
}
seq.time <- proc.time() - ptm
quantile(bootreps, c(0.025, 0.975))
## 2.5% 97.5%
## -0.016394 -0.002671
Bootstrapping is a computational task that is classified as “embarassingly parallel” - each bootstrap repetition is independent, so we can distribute the repetitions across as many cores as we want (or have available) and performance will scale roughly linearly. R makes parallelization pretty easy. You just need to setup a parallel backend (on a linux or Mac, doMC is the way to go. On a PC, the doParallel package appears to be the way to go.)
# We can do better! First step, set up a parallel backend
library(doMC)
registerDoMC(cores = detectCores())
# The foreach package lets us perform parallel for loops So this will just
# perform a (dumb) for loop, sequentially:
library(foreach)
try1 <- foreach(i = 1:100, .combine = "c") %do% {
rnorm(1, 0, i)
}
# If we use the %dopar% operator instead:
library(foreach)
try1 <- foreach(i = 1:100, .combine = "c") %dopar% {
rnorm(1, 0, i)
}
# foreach loops work best when each iteration is independent.
ptm1 <- proc.time()
bootreps.par <- foreach(i = 1:500, .combine = "c") %dopar% {
samples <- sample(1:10000, replace = T)
Lorenz_KB(distribution2[samples], 0.5) - Lorenz_KB(distribution1[samples],
0.5)
}
par.time <- proc.time() - ptm1
quantile(bootreps.par, c(0.025, 0.975))
## 2.5% 97.5%
## -0.015833 -0.003265
seq.time/par.time
## user system elapsed
## 168.54438 0.08696 3.77862
R has a lot of functionality for doing mapping and GIS tasks. One thing we often need to do is create choropleths to visualize our data. There's a new package (choropethR, developed, oddly enough, by Trulia) that makes this super easy. Here's a version of the inequality maps I made a couple months ago, which took me a few minutes last night (much less time than the originals):
# Mapping Stuff!
library(choroplethr)
load("/media/john/Shared Linux_Windows Files/MSA Level Inequality/Data/baseline_multi_ineq_ind_State.rda")
multi_ineq_means$State <- as.character(multi_ineq_means$State)
ineq.df <- multi_ineq_means[multi_ineq_means$year == 2011, ][, c(1, 3)]
colnames(ineq.df) <- c("region", "value")
choroplethr(ineq.df, "state", title = "State Gini, 2011")
If you want a quick and dirty visualization from any of the ACS 5-year tables, we can do that easily too:
choroplethr_acs(tableId = "B19301", lod = "county")
R has built in capabilities to do basic web facing stuff, and with CRAN libraries you can access pretty much anything on the web. In particular, we can access APIs offered by many companies. I'll use two examples: the Twitter API (which you heard me mumble about a couple weeks ago), and the MapQuest OpenStreetMaps API (which powers the mqtime.ado file I wrote).
Here's an example of a function that queries the OSM API to calculate driving distances:
# API access via R
library(RCurl)
## Loading required package: bitops
##
## Attaching package: 'RCurl'
##
## The following object is masked from 'package:lmtest':
##
## reset
library(rjson)
mqdirections <- function(address1, address2) {
if (address1 == "Georgia") {
address1 = "Atlanta,Georgia"
}
if (address2 == "Georgia") {
address2 = "Atlanta,Georgia"
}
URL2 = paste("http://open.mapquestapi.com/directions/v1/route?key=", "Fmjtd%7Cluub2huanl%2C20%3Do5-9uzwdz&from=",
address1, "&to=", address2, "&outFormat='json'", "boundingBox=24,-85,50,-125",
sep = "")
URL2 <- gsub(" ", "+", URL2)
x = getURL(URL2)
x1 <- fromJSON(x)
if (length(x1$route$distance) == 0) {
return(NA)
} else {
return(x1$route$distance)
}
}
mqdirections("Eugene,OR", "Corvallis,OR")
## [1] 40.26
Similarly, we can use the OSM API to geocode text addresses:
geocode_attempt <- function(address) {
URL2 = paste("http://open.mapquestapi.com/geocoding/v1/address?key=", "Fmjtd%7Cluub2huanl%2C20%3Do5-9uzwdz",
"&location=", address, "&outFormat='json'", "boundingBox=24,-85,50,-125",
sep = "")
# print(URL2)
URL2 <- gsub(" ", "+", URL2)
x = getURL(URL2)
x1 <- fromJSON(x)
if (length(x1$results[[1]]$locations) == 0) {
return(NA)
} else {
return(c(x1$results[[1]]$locations[[1]]$displayLatLng$lat, x1$results[[1]]$locations[[1]]$displayLatLng$lng))
}
}
geocode_attempt("1241 Kincaid St, Eugene,OR")
## [1] 44.05 -123.08
Twitter also provides APIs that allow us to access to (some) tweets. There are two APIs: the streaming API, which gives us access to a 5% sample of publicly available tweets, and the REST API, which allows us to request tweets by specific users.
library("twitteR")
## Loading required package: ROAuth
## Loading required package: digest
##
## Attaching package: 'twitteR'
##
## The following object is masked from 'package:plyr':
##
## id
library("ROAuth")
library(streamR)
load("~/credentials.RData")
filterStream("USA_tweets1130.json", locations = c(-120, 30, -65, 49), timeout = 30,
oauth = Credentials)
## Capturing tweets...
## Connection to Twitter stream was closed after 30 seconds with up to 1012 tweets downloaded.
tweets.df <- parseTweets("USA_tweets950.json", verbose = FALSE)
library(ggplot2)
library(grid)
map.data <- map_data("state")
points <- data.frame(x = as.numeric(tweets.df$lon), y = as.numeric(tweets.df$lat))
points <- points[points$y > 25, ]
ggplot(map.data) + geom_map(aes(map_id = region), map = map.data, fill = "white",
color = "grey20", size = 0.25) + expand_limits(x = map.data$long, y = map.data$lat) +
labs(title = "9:50am Friday Tweets") + theme(axis.line = element_blank(),
axis.text = element_blank(), axis.ticks = element_blank(), axis.title = element_blank(),
panel.background = element_blank(), panel.border = element_blank(), panel.grid.major = element_blank(),
plot.background = element_blank(), plot.margin = unit(0 * c(-1.5, -1.5,
-1.5, -1.5), "lines")) + geom_point(data = points, aes(x = x, y = y),
size = 1, alpha = 1/5, color = "darkblue")
## Warning: Removed 1300 rows containing missing values (geom_point).
Here's the same thing a couple hours later:
tweets.df <- parseTweets("USA_tweets1130.json", verbose = FALSE)
library(ggplot2)
library(grid)
map.data <- map_data("state")
points <- data.frame(x = as.numeric(tweets.df$lon), y = as.numeric(tweets.df$lat))
points <- points[points$y > 25, ]
ggplot(map.data) + geom_map(aes(map_id = region), map = map.data, fill = "white",
color = "grey20", size = 0.25) + expand_limits(x = map.data$long, y = map.data$lat) +
labs(title = "11:30am Friday Tweets") + theme(axis.line = element_blank(),
axis.text = element_blank(), axis.ticks = element_blank(), axis.title = element_blank(),
panel.background = element_blank(), panel.border = element_blank(), panel.grid.major = element_blank(),
plot.background = element_blank(), plot.margin = unit(0 * c(-1.5, -1.5,
-1.5, -1.5), "lines")) + geom_point(data = points, aes(x = x, y = y),
size = 1, alpha = 1/5, color = "darkblue")
## Warning: Removed 1659 rows containing missing values (geom_point).
We can also access the last <3200 tweets from a given user, via the REST API, accessible by the twitteR library:
registerTwitterOAuth(Credentials)
## [1] TRUE
mark <- userTimeline("MarkThoma", n = 100)
head(mark)
## [[1]]
## [1] "MarkThoma: 'Varian: Big Data: New Tricks for Econometrics' http://t.co/RorvwANhkG"
##
## [[2]]
## [1] "MarkThoma: 'Economy Adds 288,000 Jobs in April, Sharp Drop in Labor Force Leads to Plunge in Unemployment' http://t.co/XTclSsZVUb"
##
## [[3]]
## [1] "MarkThoma: Paul Krugman: Why Economics Failed http://t.co/LIiwwh39Jp"
##
## [[4]]
## [1] "MarkThoma: Links for 5-02-14 http://t.co/ILqJsfw9kj"
##
## [[5]]
## [1] "MarkThoma: Thomas Piketty and the Ghost of Joan Robinson - Dean Baker\n http://t.co/Ln4vWnSBpK"
##
## [[6]]
## [1] "MarkThoma: Paul Krugman Really Blows It - EconoSpeak http://t.co/avZ3sisSFx"
Or access the current trends on Twitter:
place <- closestTrendLocations(44.04624, -123.07854)$woeid
getTrends(place)
## name url query
## 1 Starbucks http://twitter.com/search?q=Starbucks Starbucks
## 2 Portland http://twitter.com/search?q=Portland Portland
## 3 #RipCity http://twitter.com/search?q=%23RipCity %23RipCity
## 4 Spider-Man http://twitter.com/search?q=Spider-Man Spider-Man
## 5 Taco Bell http://twitter.com/search?q=%22Taco+Bell%22 %22Taco+Bell%22
## 6 Game 6 http://twitter.com/search?q=%22Game+6%22 %22Game+6%22
## 7 Oregon http://twitter.com/search?q=Oregon Oregon
## 8 Snapchat http://twitter.com/search?q=Snapchat Snapchat
## 9 Star Wars http://twitter.com/search?q=%22Star+Wars%22 %22Star+Wars%22
## 10 Tumblr http://twitter.com/search?q=Tumblr Tumblr
## woeid
## 1 2475687
## 2 2475687
## 3 2475687
## 4 2475687
## 5 2475687
## 6 2475687
## 7 2475687
## 8 2475687
## 9 2475687
## 10 2475687
And there's about 1100 other things you can do with R.