Lab 1: Introduction to R
Introductions
POLSC-UH 3312J: Social Media and Politics
Email: aae322@nyu.edu
Office: 19W4th Room 322 (3rd floor)
Lab Overviews
The stated Program Learning Outcomes for this course are the
following:
- Learn theories and literature related to political behavior
- Learn theories and literature related to social media and
politics
- Learn to download social media data
- Learn how to analyze social media data using the R programming
language
- Learn how to prepare a professional presentation analyzing social
media data
The purpose of these lab sessions will be primarily to develop points
3 and 4 such that you may successfully accomplish goal 5.
The course has five lab sessions that we will complete together. All
labs are from 2:00pm-4:00pm here in 19 W 4th St, 3rd Floor Computer Lab.
Sessions will have three primary components: overview of topic,
interactive R session, in-class exercises/quiz.
- Wednesday January 8th: Introduction to R
- Thursday January 9th: Introduction to statistical
analysis
- Monday January 13th: Collecting Social Media Data
Using APIs/scraping
- Tuesday January 14th: Using R to Analyze Twitter
Data
- Thursday January 16th: Using APIs to Access
ChatGPT
Getting Started in R
A primary objective of these labs is to get you working and
comfortable with the R Language for Statistical Computing and Graphics.
You can download R here and
RStudio here
if you don’t already have these things installed on your own computer
(all computers in the lab should have these pre-installed).
Once these things are installed, you should be able to download the
code underlying this RMarkdown document by clicking the “code” button at
the top of the web-page. Once downloaded, you will be able to open the
file within RStudio and follow/play along!
Before jumping into coding essentials, I thought it would be useful
to point you towards some useful and free resources for getting into the
R language. Foremost among these is likely R for Data Science by Hadley Wickham
and Garrett Grolemund and Hands-On Programming
with R by Garrett Grolemund. Both books provide excellent overviews
of the essentials you need to get working with R as quickly as
possible.
If you already have some experience with R, you might find Hadley
Wickham’s Advanced R, ggplot2, or R Markdown: The Definitive
Guide by Allaire and Grolemund. These books cover more advanced
aspects of the programming language, provide an authoritative take on
creating graphics in R, and give a detailed overview of R Markdown (the
typesetting approach used to create these lab documents). Once you get
these things down, you should be easily able to shift into more advanced
applications than will be covered in this course, such as Deep
Learning with R or dive right in to exploring the various CRAN Task Views which
collect a large number of packages relevant to tasks from Bayesian
inference to Natural Language Processing and Machine Learning.
But before we can think of doing any of that, we need to pound out
some basics of the R Programming Language.
Some Basics
First thing is first; let’s install/load all of the packages that we
will be using, and clear out the environment.
packages <- c("haven","dplyr","ggplot2","countrycode","tidyr","gridExtra","grid",
"stargazer","tidyselect")
for(i in packages){
if(!require(i,character.only = T, quietly = T)){
install.packages(i,repos = "http://cran.us.r-project.org")
}
library(i, character.only = T, quietly = T)
}
rm(list=ls())
If you haven’t seen R before that is a bunch of hieroglyphics. Let’s
break it down and get a feel for the most basic operations and entities
used in the language. First, R is an object oriented programming
language. This means that we will create named objects which exist
within our working environment that we can then access or manipulate.
First, let’s create an object named packages
which contains
a vector of package names:
packages <- c("haven","dplyr","ggplot2","countrycode","tidyr","gridExtra","grid",
"stargazer")
This is an example of the most basic operation within R,
object assignment. In general, the syntax looks like
<object name> <- <stuff>
where
<-
is the most commonly used assignment
operator. Suppose we wanted to access only the first three
elements of this vector; to do so we can specify the indices we want
like so:
packages[1:3]
## [1] "haven" "dplyr" "ggplot2"
Note that unlike some other languages, indices in R start at 1 rather
than 0. Suppose we wanted the 2nd, 5th, and 8th element of the vector
instead. We can do so by supplying a vector thusly:
packages[c(2,5,8)]
## [1] "dplyr" "tidyr" "stargazer"
There are a few particularly important classes
within R. Above we have a character vector which is
comprised of strings. We can check the class of an
object with the class
function:
class(packages)
## [1] "character"
Other particularly important classes are numeric and
factor variables, the former being self descriptive
while the latter is the name for categorical data in R. We are able to
convert between classes with the as.whatever
family of
functions. For an example, let’s draw 15 random normal deviates with
mean 5 and standard deviation 10 after setting a seed
(this ensures that we draw the same pseudo-random samples every
time):
set.seed(1234)
num <- rnorm(n = 15, mean = 5, sd = 10)
class(num)
## [1] "numeric"
As expected, our numeric data is, well, numeric. We can convert it to
a factor variable with the as.factor
function.
fac <- as.factor(num)
class(fac)
## [1] "factor"
Of note, to convert a factor variable back to numeric an additional
step is required. Let’s create a data.frame
to see what
happens and to learn how to conduct basic modifications of such
objects.
dat <- data.frame(num,fac)
dat$wrong <- as.numeric(dat$fac)
dat$right <- as.numeric(as.character(dat$fac))
dat
## num fac wrong right
## 1 -7.0706575 -7.07065749385421 2 -7.0706575
## 2 7.7742924 7.7742924211066 11 7.7742924
## 3 15.8444118 15.8444117668306 15 15.8444118
## 4 -18.4569770 -18.4569770262935 1 -18.4569770
## 5 9.2912469 9.2912468881105 12 9.2912469
## 6 10.0605589 10.0605589215757 13 10.0605589
## 7 -0.7473996 -0.747399601346488 6 -0.7473996
## 8 -0.4663186 -0.466318557841871 8 -0.4663186
## 9 -0.6445200 -0.64451999093283 7 -0.6445200
## 10 -3.9003783 -3.90037829044104 4 -3.9003783
## 11 0.2280730 0.22807300246453 9 0.2280730
## 12 -4.9838644 -4.98386444859704 3 -4.9838644
## 13 -2.7625389 -2.7625389463799 5 -2.7625389
## 14 5.6445882 5.64458817276269 10 5.6445882
## 15 14.5949406 14.5949405897077 14 14.5949406
What we did in the above was create a data.frame
with
two columns, num
and fac
. Note that converting
from a factor variable directly to numeric returns the factor level
rather than the value itself while converting to a character in-between
gives us back the correct information.
This is an example of why it is so important for beginners in the R
programming language, or any language for that matter, to read
the documentation so that mistakes are not made. To access the
documentation for a function we can simply ask for help:
help(as.numeric)
R has great documentation and you should always read about functions
you are unfamiliar with. Scrolling down a bit we can see under the
warning header that “If x is a factor, as.numeric will return the
underlying numeric (integer) representation, which is often meaningless
as it may not correspond to the factor levels.”
Since we have a data.frame
handy, let’s learn how to
interact with it. Using the $
operator we can access
columns of the data in a straighforward manner:
dat$num
## [1] -7.0706575 7.7742924 15.8444118 -18.4569770 9.2912469 10.0605589
## [7] -0.7473996 -0.4663186 -0.6445200 -3.9003783 0.2280730 -4.9838644
## [13] -2.7625389 5.6445882 14.5949406
There are two other useful ways of extracting information from
data.frame
s. First, we can use indices in a way very
similar to the above except for noting that now we have both rows AND
column indices. For example, we can access the first 3 rows of columns 2
and 4 like so:
dat[1:3,c(2,4)]
## fac right
## 1 -7.07065749385421 -7.070657
## 2 7.7742924211066 7.774292
## 3 15.8444117668306 15.844412
Alternatively, we can also call variables by their names like
this:
dat[,c("wrong","right")]
## wrong right
## 1 2 -7.0706575
## 2 11 7.7742924
## 3 15 15.8444118
## 4 1 -18.4569770
## 5 12 9.2912469
## 6 13 10.0605589
## 7 6 -0.7473996
## 8 8 -0.4663186
## 9 7 -0.6445200
## 10 4 -3.9003783
## 11 9 0.2280730
## 12 3 -4.9838644
## 13 5 -2.7625389
## 14 10 5.6445882
## 15 14 14.5949406
Note that when you leave an index blank you get all of those elements
back – in the above we got all of the rows for the two selected columns.
Alternatively we could get all of the columns for a particular subset of
rows like so:
dat[1:2,]
## num fac wrong right
## 1 -7.070657 -7.07065749385421 2 -7.070657
## 2 7.774292 7.7742924211066 11 7.774292
Of particular importance is that the columns of
data.frame
s can be different classes.
c(class(dat$num),class(dat$fac),class(dat$wrong),class(dat$right))
## [1] "numeric" "factor" "numeric" "numeric"
This is distinct from the matrix
class of object,
generally only used in particular machine learning libraries or to do
matrix algebra in R, but we won’t talk about those things in detail
here. Note what happens when we coerce our data.frame
to a
matrix (note, accessing elements of matrices is almost identical to
data.frame
s except that the $
no longer
works):
head(as.matrix(dat))
## num fac wrong right
## [1,] " -7.0706575" "-7.07065749385421" " 2" " -7.0706575"
## [2,] " 7.7742924" "7.7742924211066" "11" " 7.7742924"
## [3,] " 15.8444118" "15.8444117668306" "15" " 15.8444118"
## [4,] "-18.4569770" "-18.4569770262935" " 1" "-18.4569770"
## [5,] " 9.2912469" "9.2912468881105" "12" " 9.2912469"
## [6,] " 10.0605589" "10.0605589215757" "13" " 10.0605589"
They are all characters now! We get the same behavior with vectors
when combining various classes:
numz <- c(1,2,3)
chaz <- c("a","b","c")
c(numz,chaz)
## [1] "1" "2" "3" "a" "b" "c"
To see why, check out the “Details” section of the help file for the
c
function.
?c
The final main object type I want to introduce you to is my favorite:
list
s! They are kind of like a mash between
data.frame
s and vector
s in that they are one
dimensional but can have elements of any class.
a_list <- list(packages,dat,chaz)
a_list
## [[1]]
## [1] "haven" "dplyr" "ggplot2" "countrycode" "tidyr"
## [6] "gridExtra" "grid" "stargazer"
##
## [[2]]
## num fac wrong right
## 1 -7.0706575 -7.07065749385421 2 -7.0706575
## 2 7.7742924 7.7742924211066 11 7.7742924
## 3 15.8444118 15.8444117668306 15 15.8444118
## 4 -18.4569770 -18.4569770262935 1 -18.4569770
## 5 9.2912469 9.2912468881105 12 9.2912469
## 6 10.0605589 10.0605589215757 13 10.0605589
## 7 -0.7473996 -0.747399601346488 6 -0.7473996
## 8 -0.4663186 -0.466318557841871 8 -0.4663186
## 9 -0.6445200 -0.64451999093283 7 -0.6445200
## 10 -3.9003783 -3.90037829044104 4 -3.9003783
## 11 0.2280730 0.22807300246453 9 0.2280730
## 12 -4.9838644 -4.98386444859704 3 -4.9838644
## 13 -2.7625389 -2.7625389463799 5 -2.7625389
## 14 5.6445882 5.64458817276269 10 5.6445882
## 15 14.5949406 14.5949405897077 14 14.5949406
##
## [[3]]
## [1] "a" "b" "c"
To access their elements we use “double bracked” notation like
so:
a_list[[2]]
## num fac wrong right
## 1 -7.0706575 -7.07065749385421 2 -7.0706575
## 2 7.7742924 7.7742924211066 11 7.7742924
## 3 15.8444118 15.8444117668306 15 15.8444118
## 4 -18.4569770 -18.4569770262935 1 -18.4569770
## 5 9.2912469 9.2912468881105 12 9.2912469
## 6 10.0605589 10.0605589215757 13 10.0605589
## 7 -0.7473996 -0.747399601346488 6 -0.7473996
## 8 -0.4663186 -0.466318557841871 8 -0.4663186
## 9 -0.6445200 -0.64451999093283 7 -0.6445200
## 10 -3.9003783 -3.90037829044104 4 -3.9003783
## 11 0.2280730 0.22807300246453 9 0.2280730
## 12 -4.9838644 -4.98386444859704 3 -4.9838644
## 13 -2.7625389 -2.7625389463799 5 -2.7625389
## 14 5.6445882 5.64458817276269 10 5.6445882
## 15 14.5949406 14.5949405897077 14 14.5949406
Beautiful. Now back to where we began. Packages are
user created collections of functions which enhance the capabilities of
R. “Official” packages are available via CRAN, a network of ftp and web
servers around the world which store identical, up-to-date, versions of
code and documentation for R. To access the functions stored within
packages, we first need to install.packages
if they are not
already installed and then load them into our current session with
library
.
Now let’s look at the initial chunk of code you were confronted
with:
packages <- c("haven","dplyr","ggplot2","countrycode","tidyr","gridExtra","grid",
"stargazer")
for(i in packages){
if(!require(i,character.only = T, quietly = T)){
install.packages(i,repos = "http://cran.us.r-project.org")
}
library(i, character.only = T, quietly = T)
}
rm(list=ls())
First, we create a vector of package names. Then, for
each element of that vector we
- Check to see if it is installed with
require
(see
?require
)
- If the package is not installed, install it with
install.packages
- Load all of the packages into our current session with
library
- Remove all objects from working memory
By the way – if you’d like to install only a single package you might
do something like:
install.packages("gamlss")
library(gamlss)
But sometimes starting with something hard and breaking it down is a
better way to learn than building up from basics.
Now that we have those basics in our heads we can start putting R to
use.
Loading Data and Setting Paths
We will be using data from Peterson (2017): Export Diversity and
Human Rights. You can download the replication archive by clicking here
or download the data directly by running the following chunk.
d <- read_dta("https://www.dropbox.com/s/st8ugyfld4se1a5/JCR_final.dta?dl=1")
d
## # A tibble: 5,188 × 18
## ccode year twoway inhhi comper polity2 physint conflictonlocation lnpop
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2 1981 0.986 0.990 0.997 10 8 0 19.3
## 2 2 1982 0.988 0.989 0.998 10 8 0 19.3
## 3 2 1983 0.982 0.989 0.994 10 8 1 19.3
## 4 2 1984 0.987 0.987 1 10 8 0 19.3
## 5 2 1985 0.985 0.987 0.998 10 7 0 19.3
## 6 2 1986 0.980 0.987 0.994 10 7 0 19.3
## 7 2 1987 0.982 0.987 0.995 10 8 0 19.3
## 8 2 1988 0.977 0.988 0.989 10 7 0 19.3
## 9 2 1989 0.980 0.988 0.992 10 7 1 19.3
## 10 2 1990 0.983 0.988 0.995 10 8 0 19.4
## # ℹ 5,178 more rows
## # ℹ 9 more variables: lngdppc <dbl>, gdppc <dbl>, expdep <dbl>,
## # gdpgrowth <dbl>, lib_HK <dbl>, meanarab <dbl>, meanpop <dbl>,
## # sdarab_manual <dbl>, sdpop_manual <dbl>
The read_dta
function comes from the haven
package and is useful for reading in datasets from other statistical
software like SPSS, STATA, or SAS.
Usually you’ll be loading data from your computer rather than from a
link. For this it is important to get a feel for how file paths work on
your computer and how to use working directories.
To check your working directory, you can run:
getwd()
## [1] "/Users/aliaelkattan/Documents/1- PhD/1- SMaPP jterm"
To introduce you quickly to a few useful functions, let’s have R
- Make us a new folder just off of your current working directory
- Save the Peterson data as a .csv
- Load that .csv into memory as a different object
- Delete that object from memory
dir <- getwd()
path <- paste(dir,"example_folder",sep="/")
dir.create(path)
setwd(path)
write.csv(d,"peterson_2017.csv",row.names = F)
list.files()
## [1] "peterson_2017.csv"
Boom, there it is! Now if we wanted to read in the data we could:
dat_path <- paste(path,"peterson_2017.csv",sep="/")
dat <- read.csv(dat_path)
And boom there it is. Now let’s remove just that:
rm(dat)
In your own time I suggest looking through the documentation for the
functions used above: getwd
, paste
,
dir.create
, setwd
, write.csv
,
list.files
, and rm
.
Finally, before we get our hands dirty, let’s look at how to take a
look at our data for the first time:
summary(d)
## ccode year twoway inhhi
## Min. : 2.0 Min. :1981 Min. :0.0000 Min. :0.0000
## 1st Qu.:232.0 1st Qu.:1989 1st Qu.:0.1989 1st Qu.:0.6856
## Median :450.0 Median :1996 Median :0.4499 Median :0.8636
## Mean :457.3 Mean :1996 Mean :0.4855 Mean :0.7784
## 3rd Qu.:670.0 3rd Qu.:2003 3rd Qu.:0.7919 3rd Qu.:0.9545
## Max. :990.0 Max. :2010 Max. :0.9882 Max. :0.9932
##
## comper polity2 physint conflictonlocation
## Min. :0.00161 Min. :-10.000 Min. :0.000 Min. :0.0000
## 1st Qu.:0.32156 1st Qu.: -6.000 1st Qu.:3.000 1st Qu.:0.0000
## Median :0.58347 Median : 4.000 Median :5.000 Median :0.0000
## Mean :0.57874 Mean : 1.746 Mean :4.838 Mean :0.1714
## 3rd Qu.:0.86127 3rd Qu.: 9.000 3rd Qu.:7.000 3rd Qu.:0.0000
## Max. :1.00000 Max. : 10.000 Max. :8.000 Max. :1.0000
## NA's :808 NA's :701
## lnpop lngdppc gdppc expdep
## Min. :10.61 Min. : 4.889 Min. : 132.8 Min. :0.0007
## 1st Qu.:14.67 1st Qu.: 7.455 1st Qu.: 1729.2 1st Qu.:0.0581
## Median :15.84 Median : 8.507 Median : 4951.4 Median :0.1164
## Mean :15.67 Mean : 8.473 Mean : 9711.5 Mean :0.1684
## 3rd Qu.:16.90 3rd Qu.: 9.453 3rd Qu.: 12743.0 3rd Qu.:0.2103
## Max. :21.00 Max. :11.541 Max. :102804.8 Max. :3.7119
## NA's :731 NA's :731 NA's :731 NA's :731
## gdpgrowth lib_HK meanarab meanpop
## Min. :-0.6532 Min. :-0.0142 Min. : 0.000 Min. : 0.1193
## 1st Qu.:-0.0032 1st Qu.:-0.0016 1st Qu.: 7.661 1st Qu.: 6.8294
## Median : 0.0347 Median :-0.0002 Median :13.599 Median : 17.0428
## Mean : 0.0377 Mean : 0.0002 Mean :15.963 Mean : 62.5552
## 3rd Qu.: 0.0742 3rd Qu.: 0.0016 3rd Qu.:24.007 3rd Qu.: 34.3939
## Max. : 1.8620 Max. : 0.0572 Max. :73.389 Max. :1216.1071
## NA's :762 NA's :1059 NA's :1209 NA's :1186
## sdarab_manual sdpop_manual
## Min. : 0.000 Min. : 0.0000
## 1st Qu.: 1.423 1st Qu.: 0.6023
## Median : 6.301 Median : 8.5648
## Mean : 7.074 Mean : 40.1050
## 3rd Qu.:11.619 3rd Qu.: 26.2434
## Max. :30.083 Max. :629.2051
## NA's :1209 NA's :1186
Of particular importance is the NA
counts representing
missing data. This is not only important to take a look at to get a
better sense of your data, but also is useful for alerting you to the
behavior how functions like mean
and sum
react
to missing data.
c(mean(d$polity2),
sum(d$polity2))
## [1] NA NA
Checking documentation with ?mean
or
help(mean)
you’ll note the argument na.rm
defaults to FALSE
. To compute these things omitting the
missing values, you would specify:
c(mean(d$polity2, na.rm=T),
sum(d$polity2, na.rm=TRUE))
## [1] 1.745662 7646.000000
where either T
or TRUE
can be used to
indicate, well, true.
Basic Data Wrangling
When dealing with data, especially text data, certain data wrangling
skills are important. Perhaps the most basic task you’ll need to know
how to do is select cases and subset data. As with most
things in R, there are multiple ways of accomplishing the same goal
(base R vs packages, etc).
To get indices which satisfy logical statements you can use the
which
function
which(d$gdppc > 50000)
## [1] 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 2116 2117 2118
## [16] 2119 2120 2121 2122 2123 2124 2125 2126 3981 3982 3983 3984 3985 3986 4017
## [31] 4018 4019 4020 4035 4036 4037 4038 4039 4040 4041 4042 4043 4044 4045 4046
## [46] 4802 4803 4804 4805 4806 4807 4808 4809 4810 4811 4812 4813 4814 4815 4816
## [61] 4817 4818 4820 4821 4823 4824 4825 4826 4827 4828 4830
which(d$gdppc < 2000 & d$polity2 == 10)
## [1] 4277 4278 4279 4280
which(d$gdpgrowth < -.5 | d$gdppc < 100)
## [1] 3040 3369 3717 3805 3967
We can combine this with indexing to subset down the data. We can
also call columns in a variety of ways. Remember that you can create
objects carrying this information to modularize your code, which might
be helpful in particular situations to keep everything clear.
inds <- which(d$gdppc > 50000)
cols <- c("ccode","year","physint","lnpop")
sub1 <- d[inds,cols]
sub1
## # A tibble: 71 × 4
## ccode year physint lnpop
## <dbl> <dbl> <dbl> <dbl>
## 1 212 1999 8 13.0
## 2 212 2000 8 13.0
## 3 212 2001 8 13.0
## 4 212 2002 8 13.0
## 5 212 2003 8 13.0
## 6 212 2004 8 13.0
## 7 212 2005 8 13.0
## 8 212 2006 8 13.1
## 9 212 2007 8 13.1
## 10 212 2008 8 13.1
## # ℹ 61 more rows
Alternatively, one could use the subset
function from
base R to get the same result.
sub2 <- subset(d,d$gdppc > 50000,cols)
sub2
## # A tibble: 71 × 4
## ccode year physint lnpop
## <dbl> <dbl> <dbl> <dbl>
## 1 212 1999 8 13.0
## 2 212 2000 8 13.0
## 3 212 2001 8 13.0
## 4 212 2002 8 13.0
## 5 212 2003 8 13.0
## 6 212 2004 8 13.0
## 7 212 2005 8 13.0
## 8 212 2006 8 13.1
## 9 212 2007 8 13.1
## 10 212 2008 8 13.1
## # ℹ 61 more rows
identical(sub1,sub2)
## [1] TRUE
Another great option is the dplyr
package, which is part
of the tidyverse alongside the
packages haven
and ggplot2
. One of the best
things about the tidyverse family of packages is that they are very well
documented, including a
variety
of
cheatsheets
and books. One thing that makes the
wrangling tools particularly powerful is that they leverage a pipe
(%>%
) from the magrittr
package which
says,in pseudo-code, that x %>% f(y)
is the same as
f(x,y)
. This can create nice work flow. For example, to get
the same subset yet again:
d %>% filter(gdppc > 50000) %>% dplyr::select(all_of(cols)) -> sub3
all(sub1 == sub3, na.rm=T)
## [1] TRUE
Neat. What we did was took our dataframe, filter
ed the
rows that we wanted, and then select
ed the columns of
interest.
If we want to sort
data, there is a base R approach for
vectors.
head(sort(d$gdpgrowth))
## [1] -0.6532034 -0.6011391 -0.5537450 -0.5498320 -0.5225903 -0.4939937
For dataframes you have to use order
, which produces
index numbers that can be used as before
d[order(d$gdpgrowth),c("ccode","year","gdpgrowth")]
## # A tibble: 5,188 × 3
## ccode year gdpgrowth
## <dbl> <dbl> <dbl>
## 1 645 1991 -0.653
## 2 572 2003 -0.601
## 3 517 1994 -0.554
## 4 690 1991 -0.550
## 5 660 1989 -0.523
## 6 92 2007 -0.494
## 7 475 1986 -0.480
## 8 450 1990 -0.475
## 9 373 1993 -0.440
## 10 411 1990 -0.423
## # ℹ 5,178 more rows
We can also switch the ordering around by setting
decreasing = T
d[order(d$gdpgrowth,decreasing = T),c("ccode","year","gdpgrowth")]
## # A tibble: 5,188 × 3
## ccode year gdpgrowth
## <dbl> <dbl> <dbl>
## 1 690 1992 1.86
## 2 572 2005 1.48
## 3 450 1997 1.39
## 4 411 1997 1.38
## 5 92 2009 1.13
## 6 411 2002 0.874
## 7 345 1996 0.854
## 8 411 1999 0.788
## 9 552 2010 0.752
## 10 411 1992 0.731
## # ℹ 5,178 more rows
Or we could use the handy %>%
. In this case we have
to use the placeholder .
for the input, which might be
handy to know that you can do for more complicated functions.
order(d$gdpgrowth,decreasing = T) %>%
d[.,c("ccode","year","gdpgrowth")]
## # A tibble: 5,188 × 3
## ccode year gdpgrowth
## <dbl> <dbl> <dbl>
## 1 690 1992 1.86
## 2 572 2005 1.48
## 3 450 1997 1.39
## 4 411 1997 1.38
## 5 92 2009 1.13
## 6 411 2002 0.874
## 7 345 1996 0.854
## 8 411 1999 0.788
## 9 552 2010 0.752
## 10 411 1992 0.731
## # ℹ 5,178 more rows
There is also a handy arrange
function in
dplyr
for accomplishing the same sort of task but allowing
to sort based on multiple columns.
d %>%
dplyr::select(c("ccode","year","gdpgrowth")) %>%
arrange(ccode,gdpgrowth)
## # A tibble: 5,188 × 3
## ccode year gdpgrowth
## <dbl> <dbl> <dbl>
## 1 2 2009 -0.0359
## 2 2 1982 -0.0233
## 3 2 2008 -0.00629
## 4 2 1991 -0.00469
## 5 2 2001 0.00829
## 6 2 2002 0.0129
## 7 2 1990 0.0150
## 8 2 2007 0.0195
## 9 2 2003 0.0238
## 10 2 2010 0.0256
## # ℹ 5,178 more rows
Another basic task you’ll want to know how to do is merge
datasets together. You may have noticed that the ccode variable
isn’t particularly descriptive for which country it means. At the start
we loaded in the countrycode
package which contains
additional information.
codes <- countrycode::codelist_panel
Let’s see what they have.
colnames(codes)
## [1] "country.name.en" "year"
## [3] "ar5" "cctld"
## [5] "continent" "country.name.de"
## [7] "country.name.de.regex" "country.name.en.regex"
## [9] "country.name.fr" "country.name.fr.regex"
## [11] "country.name.it" "country.name.it.regex"
## [13] "cowc" "cown"
## [15] "currency" "dhs"
## [17] "ecb" "eu28"
## [19] "eurocontrol_pru" "eurocontrol_statfor"
## [21] "eurostat" "fao"
## [23] "fips" "gaul"
## [25] "genc2c" "genc3c"
## [27] "genc3n" "gwc"
## [29] "gwn" "icao.region"
## [31] "imf" "ioc"
## [33] "iso2c" "iso3c"
## [35] "iso3n" "iso4217c"
## [37] "iso4217n" "p4c"
## [39] "p4n" "p5c"
## [41] "p5n" "region"
## [43] "region23" "un"
## [45] "un.region.code" "un.regionintermediate.code"
## [47] "un.regionsub.code" "unhcr"
## [49] "unhcr.region" "unicode.symbol"
## [51] "unpd" "vdem"
## [53] "wb" "wb_api2c"
## [55] "wb_api3c" "wvs"
The country codes we are currently using are cown
. Let’s
grab iso3c
and region
to add to the dataset.
We also know that the dataset we are working with only has years from
1981 to 2010, so let’s practice our subsetting skillz
codes <- codes[codes$year %in% 1981:2010,c("cown","year","iso3c","country.name.en","region")]
One thing to pay attention to is losing or gaining observations
during a merge. For a great overview, check out this handy NYU Data
Services guide.
nrow(d)
## [1] 5188
out1 <- merge(d,codes,by.x=c("ccode","year"),by.y=c("cown","year"))
nrow(out1)
## [1] 5125
out2 <- merge(d,codes,by.x=c("ccode","year"),by.y=c("cown","year"),all.x=T)
nrow(out2)
## [1] 5188
out3 <- merge(d,codes,by.x=c("ccode","year"),by.y=c("cown","year"),all.y=T)
nrow(out3)
## [1] 5621
out4 <- merge(d,codes,by.x=c("ccode","year"),by.y=c("cown","year"),all=T)
nrow(out4)
## [1] 5684
And, of course, we can do the same merges using dplyr with
inner_join
, left_join
,
right_join
, and full_join
respectively. Going
forward we will keep out2
as the working dataset.
Another basic task you’ll want to know how to do is calculate
aggregates and summaries. There are a number of great
things you can do with the apply
family of functions,
including easily going in parallel with the pbapply
package. If you are interested in more details on this you should check
out this
tutorial and this
taskview. We will focus on using dplyr
to calculate
summaries of interest.
One reason for this is that it is super easy to calculate summaries
grouping on another variable. For example, if we wanted to think about
regional variation in gdppc
we could
out2 %>%
group_by(region) %>%
summarize(mean=mean(gdppc,na.rm=T),
sd=sd(gdppc,na.rm=T),
sum=sum(gdppc,na.rm=T))
## # A tibble: 8 × 4
## region mean sd sum
## <chr> <dbl> <dbl> <dbl>
## 1 East Asia & Pacific 13583. 15207. 6071507.
## 2 Europe & Central Asia 16667. 11640. 18734205.
## 3 Latin America & Caribbean 7137. 4608. 6194691.
## 4 Middle East & North Africa 13735. 17320. 6881093.
## 5 North America 32391. 5598. 1943448.
## 6 South Asia 2619. 2070. 549891.
## 7 Sub-Saharan Africa 2048. 2652. 2523069.
## 8 <NA> 25760. 24903. 386395.
We can also use the mutate
function to add this
information to our dataframe. In base R this would take
merge
ing the output of aggregate
, so it can
certainly be done, but dplyr
makes it somewhat more
straightforward and scaleable.
out2 %>%
group_by(region) %>%
mutate(mean_gdppc=mean(gdppc,na.rm=T),
sd_gdppc=sd(gdppc,na.rm=T)) -> out2
out2
## # A tibble: 5,188 × 23
## # Groups: region [8]
## ccode year twoway inhhi comper polity2 physint conflictonlocation lnpop
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2 1981 0.986 0.990 0.997 10 8 0 19.3
## 2 2 1982 0.988 0.989 0.998 10 8 0 19.3
## 3 2 1983 0.982 0.989 0.994 10 8 1 19.3
## 4 2 1984 0.987 0.987 1 10 8 0 19.3
## 5 2 1985 0.985 0.987 0.998 10 7 0 19.3
## 6 2 1986 0.980 0.987 0.994 10 7 0 19.3
## 7 2 1987 0.982 0.987 0.995 10 8 0 19.3
## 8 2 1988 0.977 0.988 0.989 10 7 0 19.3
## 9 2 1989 0.980 0.988 0.992 10 7 1 19.3
## 10 2 1990 0.983 0.988 0.995 10 8 0 19.4
## # ℹ 5,178 more rows
## # ℹ 14 more variables: lngdppc <dbl>, gdppc <dbl>, expdep <dbl>,
## # gdpgrowth <dbl>, lib_HK <dbl>, meanarab <dbl>, meanpop <dbl>,
## # sdarab_manual <dbl>, sdpop_manual <dbl>, iso3c <chr>,
## # country.name.en <chr>, region <chr>, mean_gdppc <dbl>, sd_gdppc <dbl>
A base R version of the above might be
a1 <- aggregate(out2$gdppc,by=list(out2$region),mean,na.rm=T)
a1
## Group.1 x
## 1 East Asia & Pacific 13582.789
## 2 Europe & Central Asia 16667.442
## 3 Latin America & Caribbean 7136.741
## 4 Middle East & North Africa 13734.717
## 5 North America 32390.804
## 6 South Asia 2618.530
## 7 Sub-Saharan Africa 2047.946
colnames(a1) <- c("region","mean_gdppc")
a2 <- aggregate(out2$gdppc,by=list(out2$region),sd,na.rm=T)
colnames(a2) <- c("region","sd_gdppc")
t1 <- merge(out2,a1,by="region")
t2 <- merge(t1,a2,by="region")
tbl_df(t2)
## Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
## ℹ Please use `tibble::as_tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## # A tibble: 5,125 × 25
## region ccode year twoway inhhi comper polity2 physint conflictonlocation
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 East Asia… 712 1995 0.187 0.781 0.240 9 6 0
## 2 East Asia… 710 2009 0.974 0.976 0.998 -7 0 0
## 3 East Asia… 712 1994 0.218 0.858 0.254 9 7 0
## 4 East Asia… 712 2003 0.293 0.802 0.366 10 6 0
## 5 East Asia… 712 1996 0.222 0.747 0.297 10 7 0
## 6 East Asia… 712 1997 0.248 0.760 0.326 10 7 0
## 7 East Asia… 712 2002 0.294 0.796 0.369 10 7 0
## 8 East Asia… 740 1983 0.940 0.963 0.976 10 8 0
## 9 East Asia… 712 2004 0.335 0.782 0.428 10 5 0
## 10 East Asia… 712 1991 0.192 0.873 0.220 2 8 0
## # ℹ 5,115 more rows
## # ℹ 16 more variables: lnpop <dbl>, lngdppc <dbl>, gdppc <dbl>, expdep <dbl>,
## # gdpgrowth <dbl>, lib_HK <dbl>, meanarab <dbl>, meanpop <dbl>,
## # sdarab_manual <dbl>, sdpop_manual <dbl>, iso3c <chr>,
## # country.name.en <chr>, mean_gdppc.x <dbl>, sd_gdppc.x <dbl>,
## # mean_gdppc.y <dbl>, sd_gdppc.y <dbl>
but the dplyr
approach really is quite nice.
Basic Plotting
We will focus on using ggplot2
for graphics in R,
although base R has nice capabilities on its own. ggplot
is
all about the `grammar of graphics’ which follows a layered approach to
describe and construct graphics in a structured manner. To begin, we
will always initialize a plot:
p1 <- ggplot(out2[which(out2$region == "North America"),], aes(x=log(gdppc)))
To get different plots, we will add layers. For example, if we wanted
a dot plot
p1 + geom_dotplot(binwidth=0.03)

or a histogram
p1 + geom_histogram(binwidth=0.03)

or a density plot
p1 + geom_density()

we can just add a different layer to the same underlying plot.
The order of the layers does not matter, and there are a bunch more
customizations that we can add.
p1 + geom_histogram(color="black",fill="darkblue",binwidth = 0.03) +
xlab("Natural Log of Per Capita GDP") +
ylab("Frequency") +
ggtitle('North American GDPPC') +
theme_bw() -> g1
g1

You can also add multiple geometries to the same underderlying
plot.
p2 <- ggplot(out2[which(out2$region == "South Asia"),],aes(x=year,y=log(gdppc),color=iso3c))
p2 + geom_point(na.rm=T) +
geom_line(na.rm=T) +
labs(color="Country") +
scale_color_brewer(palette="Spectral") -> g2
g2

You can even add some smoothers if you want.
p3 <- ggplot(out3[which(out3$iso3c=="RUS"),],aes(x=year,y=gdppc))
p3 + geom_point(na.rm=T) +
geom_smooth(color ="gray", method = "lm", se = TRUE,na.rm=T, formula=y~x)

p3 <- ggplot(out3[which(out3$iso3c=="RUS"),],aes(x=year,y=gdppc))
p3 + geom_point(na.rm=T) +
geom_smooth(color ="gray", method = "loess", se = TRUE,formula=y~x, na.rm=T) -> g3
g3

Two last notes on plots – faceting and adding plots together into a
larger image.
Faceting can be a nice way to break up a continuous variable by
category.
p4 <- ggplot(na.omit(out2[which(out2$region %in% c("Europe & Central Asia","Middle East & North Africa")),]),aes(x=log(gdppc)))
p4 + geom_histogram(binwidth = 0.1) +
facet_grid(region ~ .)

p4 <- ggplot(na.omit(out2[which(out2$region %in% c("Europe & Central Asia","Middle East & North Africa")),]),aes(x=log(gdppc)))
p4 + geom_histogram(binwidth = 0.1) +
facet_grid(. ~ region)

Once we do all that, we might want to add multiple plots together
into a larger multi-panel graphic. The gridExtra
package is
great for this.
grid.arrange(g1,g2,g3,textGrob("Spiffy!"),ncol=2,nrow=2)

You should take a look at the ggplot2
book linked to
above, as well as other stuff like ggpubr
, some examples
can be found here.
Notes on Typesetting
In R it is relatively easy to create tables that are presentable in
Word, LaTex, or HTML. A large number of packages –
apsrtable
, xtable
, texreg
,
memisc
, outreg
and others – cater to LaTex
users. If you’re using word to prepare your documents,
stargazer
is an option that can work nicely. One thing we
might want to do from the above is greate a table of descriptive
statistics.
vars <- c("polity2","physint","conflictonlocation","lnpop")
path <- getwd()
stargazer(as.data.frame(out2[,vars]),type="html",
title="Descriptive Statistics",
covariate.labels = c("Democracy","Human Rights","Any Conflict?",
"Population (Logged)"),
out=paste0(path,"/summary1.doc"))
As you might have gathered from the code above, what we are going to
do is use the ability of Word to render html by saving the output with a
.doc extension.
Otherwise, the table looks like this:
Descriptive Statistics
|
Statistic
|
N
|
Mean
|
St. Dev.
|
Min
|
Max
|
|
Democracy
|
4,380
|
1.746
|
7.203
|
-10
|
10
|
Human Rights
|
4,487
|
4.838
|
2.322
|
0
|
8
|
Any Conflict?
|
5,188
|
0.171
|
0.377
|
0
|
1
|
Population (Logged)
|
4,457
|
15.674
|
1.925
|
10.615
|
21.000
|
|
If you’re interested in what other summary statistics are available
natively from stargazer
, check out the link in
omit.summary.stat
in their documentation
?stargazer
Something a bit more interesting would be to take a look at
Regression tables. First, let’s run some models. To replicate the
Peterson (2017) results, we need to define a function
for
leading a variable.
tscslead <- function(x, cs,ts){
obs <- 1:length(x)
lagobs <- match(paste(cs, ts+1, sep="::"), paste(cs, ts, sep="::"))
lagx <- x[lagobs]
lagx
}
The next thing that we will do is create the dependent variable with
a one period lead
out2$physint_lead <- tscslead(out2$physint, out2$ccode, out2$year)
And then we will replicate models 1-3
m1 <- lm(physint_lead ~ twoway +lib_HK + polity2 + conflictonlocation
+ lngdppc + gdpgrowth + lnpop + year + physint, data = out2)
m2 <- lm(physint_lead ~ twoway + expdep + polity2 + conflictonlocation
+ lngdppc + gdpgrowth + lnpop + year + physint, data = out2)
m3 <- lm(physint_lead ~ twoway + expdep +lib_HK + polity2 + conflictonlocation
+ lngdppc + gdpgrowth + lnpop + year + physint, data = out2)
stargazer(m1,m2,m3,type="html",out = paste0(path,"/regressions.doc"),
covariate.labels = c("Export Diversification","Liberalization","
Export Dependence","Democracy","Conflict","
Log Income","Growth","Log Population",
"Linear Time Trend","Lagged DV"),
title="Replication of Peterson (2017) Models 1-3",
dep.var.caption = "All States",
dep.var.labels = "Human Rights")
The output looks like this:
Replication of Peterson (2017) Models 1-3
|
|
All States
|
|
|
|
Human Rights
|
|
(1)
|
(2)
|
(3)
|
|
Export Diversification
|
0.276**
|
0.242**
|
0.226**
|
|
(0.114)
|
(0.113)
|
(0.114)
|
|
|
|
|
Liberalization
|
-10.638
|
|
-57.858***
|
|
(8.896)
|
|
(13.994)
|
|
|
|
|
Export Dependence
|
|
0.202**
|
0.707***
|
|
|
(0.101)
|
(0.162)
|
|
|
|
|
Democracy
|
0.026***
|
0.026***
|
0.027***
|
|
(0.004)
|
(0.004)
|
(0.004)
|
|
|
|
|
Conflict
|
-0.583***
|
-0.555***
|
-0.611***
|
|
(0.062)
|
(0.061)
|
(0.062)
|
|
|
|
|
Log Income
|
0.126***
|
0.118***
|
0.131***
|
|
(0.023)
|
(0.022)
|
(0.023)
|
|
|
|
|
Growth
|
-0.154
|
-0.207
|
-0.140
|
|
(0.193)
|
(0.191)
|
(0.192)
|
|
|
|
|
Log Population
|
-0.190***
|
-0.174***
|
-0.226***
|
|
(0.020)
|
(0.018)
|
(0.022)
|
|
|
|
|
Linear Time Trend
|
-0.014***
|
-0.014***
|
-0.016***
|
|
(0.002)
|
(0.002)
|
(0.002)
|
|
|
|
|
Lagged DV
|
0.675***
|
0.679***
|
0.666***
|
|
(0.013)
|
(0.012)
|
(0.013)
|
|
|
|
|
Constant
|
31.817***
|
31.360***
|
35.327***
|
|
(4.914)
|
(4.876)
|
(4.967)
|
|
|
|
|
|
Observations
|
3,479
|
3,562
|
3,479
|
R2
|
0.751
|
0.754
|
0.753
|
Adjusted R2
|
0.751
|
0.753
|
0.752
|
Residual Std. Error
|
1.118 (df = 3469)
|
1.121 (df = 3552)
|
1.115 (df = 3468)
|
F Statistic
|
1,164.520*** (df = 9; 3469)
|
1,208.221*** (df = 9; 3552)
|
1,055.422*** (df = 10; 3468)
|
|
Note:
|
p<0.1; p<0.05;
p<0.01
|
Stargazer has a number of useful “style” options which format close
to a number of major journals, which may be worth looking at or playing
around with.
That’s all for today! Tomorrow we will go a bit deeper and start
thinking about how to use R for statistical analysis!
Social Media and Political Participation
Lab 1: Introduction to R
Introductions
POLSC-UH 3312J: Social Media and Politics
Email: aae322@nyu.edu
Office: 19W4th Room 322 (3rd floor)
Lab Overviews
The stated Program Learning Outcomes for this course are the following:
The purpose of these lab sessions will be primarily to develop points 3 and 4 such that you may successfully accomplish goal 5.
The course has five lab sessions that we will complete together. All labs are from 2:00pm-4:00pm here in 19 W 4th St, 3rd Floor Computer Lab. Sessions will have three primary components: overview of topic, interactive R session, in-class exercises/quiz.
Getting Started in R
A primary objective of these labs is to get you working and comfortable with the R Language for Statistical Computing and Graphics. You can download R here and RStudio here if you don’t already have these things installed on your own computer (all computers in the lab should have these pre-installed).
Once these things are installed, you should be able to download the code underlying this RMarkdown document by clicking the “code” button at the top of the web-page. Once downloaded, you will be able to open the file within RStudio and follow/play along!
Before jumping into coding essentials, I thought it would be useful to point you towards some useful and free resources for getting into the R language. Foremost among these is likely R for Data Science by Hadley Wickham and Garrett Grolemund and Hands-On Programming with R by Garrett Grolemund. Both books provide excellent overviews of the essentials you need to get working with R as quickly as possible.
If you already have some experience with R, you might find Hadley Wickham’s Advanced R, ggplot2, or R Markdown: The Definitive Guide by Allaire and Grolemund. These books cover more advanced aspects of the programming language, provide an authoritative take on creating graphics in R, and give a detailed overview of R Markdown (the typesetting approach used to create these lab documents). Once you get these things down, you should be easily able to shift into more advanced applications than will be covered in this course, such as Deep Learning with R or dive right in to exploring the various CRAN Task Views which collect a large number of packages relevant to tasks from Bayesian inference to Natural Language Processing and Machine Learning.
But before we can think of doing any of that, we need to pound out some basics of the R Programming Language.
Some Basics
First thing is first; let’s install/load all of the packages that we will be using, and clear out the environment.
If you haven’t seen R before that is a bunch of hieroglyphics. Let’s break it down and get a feel for the most basic operations and entities used in the language. First, R is an object oriented programming language. This means that we will create named objects which exist within our working environment that we can then access or manipulate. First, let’s create an object named
packages
which contains a vector of package names:This is an example of the most basic operation within R, object assignment. In general, the syntax looks like
<object name> <- <stuff>
where<-
is the most commonly used assignment operator. Suppose we wanted to access only the first three elements of this vector; to do so we can specify the indices we want like so:Note that unlike some other languages, indices in R start at 1 rather than 0. Suppose we wanted the 2nd, 5th, and 8th element of the vector instead. We can do so by supplying a vector thusly:
There are a few particularly important classes within R. Above we have a character vector which is comprised of strings. We can check the class of an object with the
class
function:Other particularly important classes are numeric and factor variables, the former being self descriptive while the latter is the name for categorical data in R. We are able to convert between classes with the
as.whatever
family of functions. For an example, let’s draw 15 random normal deviates with mean 5 and standard deviation 10 after setting a seed (this ensures that we draw the same pseudo-random samples every time):As expected, our numeric data is, well, numeric. We can convert it to a factor variable with the
as.factor
function.Of note, to convert a factor variable back to numeric an additional step is required. Let’s create a
data.frame
to see what happens and to learn how to conduct basic modifications of such objects.What we did in the above was create a
data.frame
with two columns,num
andfac
. Note that converting from a factor variable directly to numeric returns the factor level rather than the value itself while converting to a character in-between gives us back the correct information.This is an example of why it is so important for beginners in the R programming language, or any language for that matter, to read the documentation so that mistakes are not made. To access the documentation for a function we can simply ask for help:
R has great documentation and you should always read about functions you are unfamiliar with. Scrolling down a bit we can see under the warning header that “If x is a factor, as.numeric will return the underlying numeric (integer) representation, which is often meaningless as it may not correspond to the factor levels.”
Since we have a
data.frame
handy, let’s learn how to interact with it. Using the$
operator we can access columns of the data in a straighforward manner:There are two other useful ways of extracting information from
data.frame
s. First, we can use indices in a way very similar to the above except for noting that now we have both rows AND column indices. For example, we can access the first 3 rows of columns 2 and 4 like so:Alternatively, we can also call variables by their names like this:
Note that when you leave an index blank you get all of those elements back – in the above we got all of the rows for the two selected columns. Alternatively we could get all of the columns for a particular subset of rows like so:
Of particular importance is that the columns of
data.frame
s can be different classes.This is distinct from the
matrix
class of object, generally only used in particular machine learning libraries or to do matrix algebra in R, but we won’t talk about those things in detail here. Note what happens when we coerce ourdata.frame
to a matrix (note, accessing elements of matrices is almost identical todata.frame
s except that the$
no longer works):They are all characters now! We get the same behavior with vectors when combining various classes:
To see why, check out the “Details” section of the help file for the
c
function.The final main object type I want to introduce you to is my favorite:
list
s! They are kind of like a mash betweendata.frame
s andvector
s in that they are one dimensional but can have elements of any class.To access their elements we use “double bracked” notation like so:
Beautiful. Now back to where we began. Packages are user created collections of functions which enhance the capabilities of R. “Official” packages are available via CRAN, a network of ftp and web servers around the world which store identical, up-to-date, versions of code and documentation for R. To access the functions stored within packages, we first need to
install.packages
if they are not already installed and then load them into our current session withlibrary
.Now let’s look at the initial chunk of code you were confronted with:
First, we create a vector of package names. Then,
for
each element of that vector werequire
(see?require
)install.packages
library
By the way – if you’d like to install only a single package you might do something like:
But sometimes starting with something hard and breaking it down is a better way to learn than building up from basics.
Now that we have those basics in our heads we can start putting R to use.
Loading Data and Setting Paths
We will be using data from Peterson (2017): Export Diversity and Human Rights. You can download the replication archive by clicking here or download the data directly by running the following chunk.
The
read_dta
function comes from thehaven
package and is useful for reading in datasets from other statistical software like SPSS, STATA, or SAS.Usually you’ll be loading data from your computer rather than from a link. For this it is important to get a feel for how file paths work on your computer and how to use working directories.
To check your working directory, you can run:
To introduce you quickly to a few useful functions, let’s have R
Boom, there it is! Now if we wanted to read in the data we could:
And boom there it is. Now let’s remove just that:
In your own time I suggest looking through the documentation for the functions used above:
getwd
,paste
,dir.create
,setwd
,write.csv
,list.files
, andrm
.Finally, before we get our hands dirty, let’s look at how to take a look at our data for the first time:
Of particular importance is the
NA
counts representing missing data. This is not only important to take a look at to get a better sense of your data, but also is useful for alerting you to the behavior how functions likemean
andsum
react to missing data.Checking documentation with
?mean
orhelp(mean)
you’ll note the argumentna.rm
defaults toFALSE
. To compute these things omitting the missing values, you would specify:where either
T
orTRUE
can be used to indicate, well, true.Basic Data Wrangling
When dealing with data, especially text data, certain data wrangling skills are important. Perhaps the most basic task you’ll need to know how to do is select cases and subset data. As with most things in R, there are multiple ways of accomplishing the same goal (base R vs packages, etc).
To get indices which satisfy logical statements you can use the
which
functionWe can combine this with indexing to subset down the data. We can also call columns in a variety of ways. Remember that you can create objects carrying this information to modularize your code, which might be helpful in particular situations to keep everything clear.
Alternatively, one could use the
subset
function from base R to get the same result.Another great option is the
dplyr
package, which is part of the tidyverse alongside the packageshaven
andggplot2
. One of the best things about the tidyverse family of packages is that they are very well documented, including a variety of cheatsheets and books. One thing that makes the wrangling tools particularly powerful is that they leverage a pipe (%>%
) from themagrittr
package which says,in pseudo-code, thatx %>% f(y)
is the same asf(x,y)
. This can create nice work flow. For example, to get the same subset yet again:Neat. What we did was took our dataframe,
filter
ed the rows that we wanted, and thenselect
ed the columns of interest.If we want to
sort
data, there is a base R approach for vectors.For dataframes you have to use
order
, which produces index numbers that can be used as beforeWe can also switch the ordering around by setting
decreasing = T
Or we could use the handy
%>%
. In this case we have to use the placeholder.
for the input, which might be handy to know that you can do for more complicated functions.There is also a handy
arrange
function indplyr
for accomplishing the same sort of task but allowing to sort based on multiple columns.Another basic task you’ll want to know how to do is merge datasets together. You may have noticed that the ccode variable isn’t particularly descriptive for which country it means. At the start we loaded in the
countrycode
package which contains additional information.Let’s see what they have.
The country codes we are currently using are
cown
. Let’s grabiso3c
andregion
to add to the dataset. We also know that the dataset we are working with only has years from 1981 to 2010, so let’s practice our subsetting skillzOne thing to pay attention to is losing or gaining observations during a merge. For a great overview, check out this handy NYU Data Services guide.
And, of course, we can do the same merges using dplyr with
inner_join
,left_join
,right_join
, andfull_join
respectively. Going forward we will keepout2
as the working dataset.Another basic task you’ll want to know how to do is calculate aggregates and summaries. There are a number of great things you can do with the
apply
family of functions, including easily going in parallel with thepbapply
package. If you are interested in more details on this you should check out this tutorial and this taskview. We will focus on usingdplyr
to calculate summaries of interest.One reason for this is that it is super easy to calculate summaries grouping on another variable. For example, if we wanted to think about regional variation in
gdppc
we couldWe can also use the
mutate
function to add this information to our dataframe. In base R this would takemerge
ing the output ofaggregate
, so it can certainly be done, butdplyr
makes it somewhat more straightforward and scaleable.A base R version of the above might be
but the
dplyr
approach really is quite nice.Basic Plotting
We will focus on using
ggplot2
for graphics in R, although base R has nice capabilities on its own.ggplot
is all about the `grammar of graphics’ which follows a layered approach to describe and construct graphics in a structured manner. To begin, we will always initialize a plot:To get different plots, we will add layers. For example, if we wanted a dot plot
or a histogram
or a density plot
we can just add a different layer to the same underlying plot.
The order of the layers does not matter, and there are a bunch more customizations that we can add.
You can also add multiple geometries to the same underderlying plot.
You can even add some smoothers if you want.
Two last notes on plots – faceting and adding plots together into a larger image.
Faceting can be a nice way to break up a continuous variable by category.
Once we do all that, we might want to add multiple plots together into a larger multi-panel graphic. The
gridExtra
package is great for this.You should take a look at the
ggplot2
book linked to above, as well as other stuff likeggpubr
, some examples can be found here.Notes on Typesetting
In R it is relatively easy to create tables that are presentable in Word, LaTex, or HTML. A large number of packages –
apsrtable
,xtable
,texreg
,memisc
,outreg
and others – cater to LaTex users. If you’re using word to prepare your documents,stargazer
is an option that can work nicely. One thing we might want to do from the above is greate a table of descriptive statistics.As you might have gathered from the code above, what we are going to do is use the ability of Word to render html by saving the output with a .doc extension.
Otherwise, the table looks like this:
If you’re interested in what other summary statistics are available natively from
stargazer
, check out the link inomit.summary.stat
in their documentationSomething a bit more interesting would be to take a look at Regression tables. First, let’s run some models. To replicate the Peterson (2017) results, we need to define a
function
for leading a variable.The next thing that we will do is create the dependent variable with a one period lead
And then we will replicate models 1-3
The output looks like this:
Stargazer has a number of useful “style” options which format close to a number of major journals, which may be worth looking at or playing around with.
That’s all for today! Tomorrow we will go a bit deeper and start thinking about how to use R for statistical analysis!