A primary objective of these recitations is to get you working and comfortable with the R Language for Statistical Computing and Graphics. You can download R here and RStudio here if you don’t already have these things installed on your own computer (all computers in the lab should have these pre-installed).
Once these things are installed, you should be able to download the code underlying this RMarkdown document by clicking the “code” button at the top of the web-page. Once downloaded, you will be able to open the file within RStudio and follow/play along!
Before jumping into coding essentials, I thought it would be useful to point you towards some useful and free resources for getting into the R language. Foremost among these is likely R for Data Science by Hadley Wickham and Garrett Grolemund and Hands-On Programming with R by Garrett Grolemund. Both books provide excellent overviews of the essentials you need to get working with R as quickly as possible.
If you already have some experience with R, you might find Hadley Wickham’s Advanced R, ggplot2, or R Markdown: The Definitive Guide by Allaire and Grolemund. These books cover more advanced aspects of the programming language, provide an authoritative take on creating graphics in R, and give a detailed overview of R Markdown (the typesetting approach used to create these lab documents). Once you get these things down, you should be easily able to shift into more advanced applications than will be covered in this course, such as Deep Learning with R or dive right in to exploring the various CRAN Task Views which collect a large number of packages relevant to tasks from Bayesian inference to Natural Language Processing and Machine Learning.
But before we can think of doing any of that, we need to pound out some basics of the R Programming Language.
First thing is first; let’s install/load all of the packages that we will be using, and clear out the environment.
packages <- c("haven","dplyr","ggplot2","countrycode","tidyr","gridExtra","grid",
"stargazer","tidyselect")
for(i in packages){
if(!require(i,character.only = T, quietly = T)){
install.packages(i,repos = "http://cran.us.r-project.org")
}
library(i, character.only = T, quietly = T)
}
rm(list=ls())
If you haven’t seen R before that is a bunch of hieroglyphics. Let’s
break it down and get a feel for the most basic operations and entities
used in the language. First, R is an object oriented programming
language. This means that we will create named objects which exist
within our working environment that we can then access or manipulate.
First, let’s create an object named packages
which contains
a vector of package names:
packages <- c("haven","dplyr","ggplot2","countrycode","tidyr","gridExtra","grid",
"stargazer")
This is an example of the most basic operation within R,
object assignment. In general, the syntax looks like
<object name> <- <stuff>
where
<-
is the most commonly used assignment
operator. Suppose we wanted to access only the first three
elements of this vector; to do so we can specify the indices we want
like so:
packages[1:3]
## [1] "haven" "dplyr" "ggplot2"
Note that unlike some other languages, indices in R start at 1 rather than 0. Suppose we wanted the 2nd, 5th, and 8th element of the vector instead. We can do so by supplying a vector thusly:
packages[c(2,5,8)]
## [1] "dplyr" "tidyr" "stargazer"
There are a few particularly important classes
within R. Above we have a character vector which is
comprised of strings. We can check the class of an
object with the class
function:
class(packages)
## [1] "character"
Other particularly important classes are numeric and
factor variables, the former being self descriptive
while the latter is the name for categorical data in R. We are able to
convert between classes with the as.whatever
family of
functions. For an example, let’s draw 15 random normal deviates with
mean 5 and standard deviation 10 after setting a seed
(this ensures that we draw the same pseudo-random samples every
time):
set.seed(1234)
num <- rnorm(n = 15, mean = 5, sd = 10)
class(num)
## [1] "numeric"
As expected, our numeric data is, well, numeric. We can convert it to
a factor variable with the as.factor
function.
fac <- as.character(num)
class(fac)
## [1] "character"
Of note, to convert a factor variable back to numeric an additional
step is required. Let’s create a data.frame
to see what
happens and to learn how to conduct basic modifications of such
objects.
dat <- data.frame(num,fac)
dat$wrong <- as.numeric(dat$fac)
dat$right <- as.numeric(as.character(dat$fac))
dat
## num fac wrong right
## 1 -7.0706575 -7.07065749385421 -7.0706575 -7.0706575
## 2 7.7742924 7.7742924211066 7.7742924 7.7742924
## 3 15.8444118 15.8444117668306 15.8444118 15.8444118
## 4 -18.4569770 -18.4569770262935 -18.4569770 -18.4569770
## 5 9.2912469 9.2912468881105 9.2912469 9.2912469
## 6 10.0605589 10.0605589215757 10.0605589 10.0605589
## 7 -0.7473996 -0.747399601346488 -0.7473996 -0.7473996
## 8 -0.4663186 -0.466318557841871 -0.4663186 -0.4663186
## 9 -0.6445200 -0.64451999093283 -0.6445200 -0.6445200
## 10 -3.9003783 -3.90037829044104 -3.9003783 -3.9003783
## 11 0.2280730 0.22807300246453 0.2280730 0.2280730
## 12 -4.9838644 -4.98386444859704 -4.9838644 -4.9838644
## 13 -2.7625389 -2.7625389463799 -2.7625389 -2.7625389
## 14 5.6445882 5.64458817276269 5.6445882 5.6445882
## 15 14.5949406 14.5949405897077 14.5949406 14.5949406
What we did in the above was create a data.frame
with
two columns, num
and fac
. Note that converting
from a factor variable directly to numeric returns the factor level
rather than the value itself while converting to a character in-between
gives us back the correct information.
This is an example of why it is so important for beginners in the R programming language, or any language for that matter, to read the documentation so that mistakes are not made. To access the documentation for a function we can simply ask for help:
help(as.numeric)
R has great documentation and you should always read about functions you are unfamiliar with. Scrolling down a bit we can see under the warning header that “If x is a factor, as.numeric will return the underlying numeric (integer) representation, which is often meaningless as it may not correspond to the factor levels.”
Since we have a data.frame
handy, let’s learn how to
interact with it. Using the $
operator we can access
columns of the data in a straighforward manner:
dat$num
## [1] -7.0706575 7.7742924 15.8444118 -18.4569770 9.2912469 10.0605589
## [7] -0.7473996 -0.4663186 -0.6445200 -3.9003783 0.2280730 -4.9838644
## [13] -2.7625389 5.6445882 14.5949406
There are two other useful ways of extracting information from
data.frame
s. First, we can use indices in a way very
similar to the above except for noting that now we have both rows AND
column indices. For example, we can access the first dat[[of columns 2
and 4 like so:
dat[1:3,c(2,4)]
## fac right
## 1 -7.07065749385421 -7.070657
## 2 7.7742924211066 7.774292
## 3 15.8444117668306 15.844412
Alternatively, we can also call variables by their names like this:
dat[,c("num","fac")]
## num fac
## 1 -7.0706575 -7.07065749385421
## 2 7.7742924 7.7742924211066
## 3 15.8444118 15.8444117668306
## 4 -18.4569770 -18.4569770262935
## 5 9.2912469 9.2912468881105
## 6 10.0605589 10.0605589215757
## 7 -0.7473996 -0.747399601346488
## 8 -0.4663186 -0.466318557841871
## 9 -0.6445200 -0.64451999093283
## 10 -3.9003783 -3.90037829044104
## 11 0.2280730 0.22807300246453
## 12 -4.9838644 -4.98386444859704
## 13 -2.7625389 -2.7625389463799
## 14 5.6445882 5.64458817276269
## 15 14.5949406 14.5949405897077
Note that when you leave an index blank you get all of those elements back – in the above we got all of the rows for the two selected columns. Alternatively we could get all of the columns for a particular subset of rows like so:
dat[1:2,]
## num fac wrong right
## 1 -7.070657 -7.07065749385421 -7.070657 -7.070657
## 2 7.774292 7.7742924211066 7.774292 7.774292
Of particular importance is that the columns of
data.frame
s can be different classes.
c(class(dat$num),class(dat$fac),class(dat$wrong),class(dat$right))
## [1] "numeric" "character" "numeric" "numeric"
This is distinct from the matrix
class of object,
generally only used in particular machine learning libraries or to do
matrix algebra in R, but we won’t talk about those things in detail
here. Note what happens when we coerce our data.frame
to a
matrix (note, accessing elements of matrices is almost identical to
data.frame
s except that the $
no longer
works):
head(as.matrix(dat))
## num fac wrong right
## [1,] " -7.0706575" "-7.07065749385421" " -7.0706575" " -7.0706575"
## [2,] " 7.7742924" "7.7742924211066" " 7.7742924" " 7.7742924"
## [3,] " 15.8444118" "15.8444117668306" " 15.8444118" " 15.8444118"
## [4,] "-18.4569770" "-18.4569770262935" "-18.4569770" "-18.4569770"
## [5,] " 9.2912469" "9.2912468881105" " 9.2912469" " 9.2912469"
## [6,] " 10.0605589" "10.0605589215757" " 10.0605589" " 10.0605589"
They are all characters now! We get the same behavior with vectors when combining various classes:
numz <- c(1,2,3)
chaz <- c("a","b","c")
c(numz,chaz)
## [1] "1" "2" "3" "a" "b" "c"
To see why, check out the “Details” section of the help file for the
c
function.
?c
The final main object type I want to introduce you to is my favorite:
list
s! They are kind of like a mash between
data.frame
s and vector
s in that they are one
dimensional but can have elements of any class.
a_list <- list(packages,dat,chaz)
a_list
## [[1]]
## [1] "haven" "dplyr" "ggplot2" "countrycode" "tidyr"
## [6] "gridExtra" "grid" "stargazer"
##
## [[2]]
## num fac wrong right
## 1 -7.0706575 -7.07065749385421 -7.0706575 -7.0706575
## 2 7.7742924 7.7742924211066 7.7742924 7.7742924
## 3 15.8444118 15.8444117668306 15.8444118 15.8444118
## 4 -18.4569770 -18.4569770262935 -18.4569770 -18.4569770
## 5 9.2912469 9.2912468881105 9.2912469 9.2912469
## 6 10.0605589 10.0605589215757 10.0605589 10.0605589
## 7 -0.7473996 -0.747399601346488 -0.7473996 -0.7473996
## 8 -0.4663186 -0.466318557841871 -0.4663186 -0.4663186
## 9 -0.6445200 -0.64451999093283 -0.6445200 -0.6445200
## 10 -3.9003783 -3.90037829044104 -3.9003783 -3.9003783
## 11 0.2280730 0.22807300246453 0.2280730 0.2280730
## 12 -4.9838644 -4.98386444859704 -4.9838644 -4.9838644
## 13 -2.7625389 -2.7625389463799 -2.7625389 -2.7625389
## 14 5.6445882 5.64458817276269 5.6445882 5.6445882
## 15 14.5949406 14.5949405897077 14.5949406 14.5949406
##
## [[3]]
## [1] "a" "b" "c"
To access their elements we use “double bracked” notation like so:
a_list[[2]]
## num fac wrong right
## 1 -7.0706575 -7.07065749385421 -7.0706575 -7.0706575
## 2 7.7742924 7.7742924211066 7.7742924 7.7742924
## 3 15.8444118 15.8444117668306 15.8444118 15.8444118
## 4 -18.4569770 -18.4569770262935 -18.4569770 -18.4569770
## 5 9.2912469 9.2912468881105 9.2912469 9.2912469
## 6 10.0605589 10.0605589215757 10.0605589 10.0605589
## 7 -0.7473996 -0.747399601346488 -0.7473996 -0.7473996
## 8 -0.4663186 -0.466318557841871 -0.4663186 -0.4663186
## 9 -0.6445200 -0.64451999093283 -0.6445200 -0.6445200
## 10 -3.9003783 -3.90037829044104 -3.9003783 -3.9003783
## 11 0.2280730 0.22807300246453 0.2280730 0.2280730
## 12 -4.9838644 -4.98386444859704 -4.9838644 -4.9838644
## 13 -2.7625389 -2.7625389463799 -2.7625389 -2.7625389
## 14 5.6445882 5.64458817276269 5.6445882 5.6445882
## 15 14.5949406 14.5949405897077 14.5949406 14.5949406
By the way – if you’d like to install only a single package you might do something like:
install.packages("gamlss")
library(gamlss)
Now that we have those basics in our heads we can start putting R to use.
We will be using data from Peterson (2017): Export Diversity andd[1:5] Human Rights. You can download the replication archive by clicking here or download the data directly by running the following chunk.
d <- read_dta("https://www.dropbox.com/s/st8ugyfld4se1a5/JCR_final.dta?dl=1")
d
## # A tibble: 5,188 × 18
## ccode year twoway inhhi comper polity2 physint conflictonlocation lnpop
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2 1981 0.986 0.990 0.997 10 8 0 19.3
## 2 2 1982 0.988 0.989 0.998 10 8 0 19.3
## 3 2 1983 0.982 0.989 0.994 10 8 1 19.3
## 4 2 1984 0.987 0.987 1 10 8 0 19.3
## 5 2 1985 0.985 0.987 0.998 10 7 0 19.3
## 6 2 1986 0.980 0.987 0.994 10 7 0 19.3
## 7 2 1987 0.982 0.987 0.995 10 8 0 19.3
## 8 2 1988 0.977 0.988 0.989 10 7 0 19.3
## 9 2 1989 0.980 0.988 0.992 10 7 1 19.3
## 10 2 1990 0.983 0.988 0.995 10 8 0 19.4
## # ℹ 5,178 more rows
## # ℹ 9 more variables: lngdppc <dbl>, gdppc <dbl>, expdep <dbl>,
## # gdpgrowth <dbl>, lib_HK <dbl>, meanarab <dbl>, meanpop <dbl>,
## # sdarab_manual <dbl>, sdpop_manual <dbl>
The read_dta
function comes from the haven
package and is useful for reading in datasets from other statistical
software like SPSS, STATA, or SAS.
Usually you’ll be loading data from your computer rather than from a link. For this it is important to get a feel for how file paths work on your computer and how to use working directories.
To check your working directory, you can run:
getwd()
## [1] "/Users/aliaelkattan/Documents/1- PhD/Honors Thesis"
To introduce you quickly to a few useful functions, let’s have R
dir <- getwd()
path <- paste(dir,"example_folder",sep="/")
dir.create(path)
setwd(path)
write.csv(d,"peterson_2017.csv",row.names = F)
list.files()
## [1] "peterson_2017.csv"
Boom, there it is! Now if we wanted to read in the data we could:
dat_path <- paste(path,"peterson_2017.csv",sep="/")
dat <- read.csv(dat_path)
And boom there it is. Now let’s remove just that:
rm(dat)
Finally, before we get our hands dirty, let’s look at how to take a look at our data for the first time:
summary(d)
## ccode year twoway inhhi
## Min. : 2.0 Min. :1981 Min. :0.0000 Min. :0.0000
## 1st Qu.:232.0 1st Qu.:1989 1st Qu.:0.1989 1st Qu.:0.6856
## Median :450.0 Median :1996 Median :0.4499 Median :0.8636
## Mean :457.3 Mean :1996 Mean :0.4855 Mean :0.7784
## 3rd Qu.:670.0 3rd Qu.:2003 3rd Qu.:0.7919 3rd Qu.:0.9545
## Max. :990.0 Max. :2010 Max. :0.9882 Max. :0.9932
##
## comper polity2 physint conflictonlocation
## Min. :0.00161 Min. :-10.000 Min. :0.000 Min. :0.0000
## 1st Qu.:0.32156 1st Qu.: -6.000 1st Qu.:3.000 1st Qu.:0.0000
## Median :0.58347 Median : 4.000 Median :5.000 Median :0.0000
## Mean :0.57874 Mean : 1.746 Mean :4.838 Mean :0.1714
## 3rd Qu.:0.86127 3rd Qu.: 9.000 3rd Qu.:7.000 3rd Qu.:0.0000
## Max. :1.00000 Max. : 10.000 Max. :8.000 Max. :1.0000
## NA's :808 NA's :701
## lnpop lngdppc gdppc expdep
## Min. :10.61 Min. : 4.889 Min. : 132.8 Min. :0.0007
## 1st Qu.:14.67 1st Qu.: 7.455 1st Qu.: 1729.2 1st Qu.:0.0581
## Median :15.84 Median : 8.507 Median : 4951.4 Median :0.1164
## Mean :15.67 Mean : 8.473 Mean : 9711.5 Mean :0.1684
## 3rd Qu.:16.90 3rd Qu.: 9.453 3rd Qu.: 12743.0 3rd Qu.:0.2103
## Max. :21.00 Max. :11.541 Max. :102804.8 Max. :3.7119
## NA's :731 NA's :731 NA's :731 NA's :731
## gdpgrowth lib_HK meanarab meanpop
## Min. :-0.6532 Min. :-0.0142 Min. : 0.000 Min. : 0.1193
## 1st Qu.:-0.0032 1st Qu.:-0.0016 1st Qu.: 7.661 1st Qu.: 6.8294
## Median : 0.0347 Median :-0.0002 Median :13.599 Median : 17.0428
## Mean : 0.0377 Mean : 0.0002 Mean :15.963 Mean : 62.5552
## 3rd Qu.: 0.0742 3rd Qu.: 0.0016 3rd Qu.:24.007 3rd Qu.: 34.3939
## Max. : 1.8620 Max. : 0.0572 Max. :73.389 Max. :1216.1071
## NA's :762 NA's :1059 NA's :1209 NA's :1186
## sdarab_manual sdpop_manual
## Min. : 0.000 Min. : 0.0000
## 1st Qu.: 1.423 1st Qu.: 0.6023
## Median : 6.301 Median : 8.5648
## Mean : 7.074 Mean : 40.1050
## 3rd Qu.:11.619 3rd Qu.: 26.2434
## Max. :30.083 Max. :629.2051
## NA's :1209 NA's :1186
Of particular importance is the NA
counts representing
missing data. This is not only important to take a look at to get a
better sense of your data, but also is useful for alerting you to the
behavior how functions like mean
and sum
react
to missing data.
c(mean(d$polity2),
sum(d$polity2))
## [1] NA NA
Checking documentation with ?mean
or
help(mean)
you’ll note the argument na.rm
defaults to FALSE
. To compute these things omitting the
missing values, you would specify:
c(mean(d$polity2, na.rm=T),
sum(d$polity2, na.rm=TRUE))
## [1] 1.745662 7646.000000
where either T
or TRUE
can be used to
indicate, well, true.
When dealing with data, especially text data, certain data wrangling skills are important. Perhaps the most basic task you’ll need to know how to do is select cases and subset data. As with most things in R, there are multiple ways of accomplishing the same goal (base R vs packages, etc).
To get indices which satisfy logical statements you can use the
which
function
which(d$gdppc > 50000)
## [1] 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 2116 2117 2118
## [16] 2119 2120 2121 2122 2123 2124 2125 2126 3981 3982 3983 3984 3985 3986 4017
## [31] 4018 4019 4020 4035 4036 4037 4038 4039 4040 4041 4042 4043 4044 4045 4046
## [46] 4802 4803 4804 4805 4806 4807 4808 4809 4810 4811 4812 4813 4814 4815 4816
## [61] 4817 4818 4820 4821 4823 4824 4825 4826 4827 4828 4830
which(d$gdppc < 2000 & d$polity2 == 10)
## [1] 4277 4278 4279 4280
which(d$gdpgrowth < -.5 | d$gdppc < 100)
## [1] 3040 3369 3717 3805 3967
We can combine this with indexing to subset down the data. We can also call columns in a variety of ways. Remember that you can create objects carrying this information to modularize your code, which might be helpful in particular situations to keep everything clear.
inds <- which(d$gdppc > 50000)
cols <- c("ccode","year","physint","lnpop")
sub1 <- d[inds,cols]
sub1
## # A tibble: 71 × 4
## ccode year physint lnpop
## <dbl> <dbl> <dbl> <dbl>
## 1 212 1999 8 13.0
## 2 212 2000 8 13.0
## 3 212 2001 8 13.0
## 4 212 2002 8 13.0
## 5 212 2003 8 13.0
## 6 212 2004 8 13.0
## 7 212 2005 8 13.0
## 8 212 2006 8 13.1
## 9 212 2007 8 13.1
## 10 212 2008 8 13.1
## # ℹ 61 more rows
Alternatively, one could use the subset
function from
base R to get the same result.
sub2 <- subset(d,d$gdppc > 50000,cols)
sub2
## # A tibble: 71 × 4
## ccode year physint lnpop
## <dbl> <dbl> <dbl> <dbl>
## 1 212 1999 8 13.0
## 2 212 2000 8 13.0
## 3 212 2001 8 13.0
## 4 212 2002 8 13.0
## 5 212 2003 8 13.0
## 6 212 2004 8 13.0
## 7 212 2005 8 13.0
## 8 212 2006 8 13.1
## 9 212 2007 8 13.1
## 10 212 2008 8 13.1
## # ℹ 61 more rows
identical(sub1,sub2)
## [1] TRUE
Another great option is the dplyr
package, which is part
of the tidyverse alongside the
packages haven
and ggplot2
. One of the best
things about the tidyverse family of packages is that they are very well
documented, including a
variety
of
cheatsheets
and books. One thing that makes the
wrangling tools particularly powerful is that they leverage a pipe
(%>%
) from the magrittr
package which
says,in pseudo-code, that x %>% f(y)
is the same as
f(x,y)
. This can create nice work flow. For example, to get
the same subset yet again:
d %>% filter(gdppc > 50000) %>% dplyr::select(all_of(cols)) -> sub3
all(sub1 == sub3, na.rm=T)
## [1] TRUE
Neat. What we did was took our dataframe, filter
ed the
rows that we wanted, and then select
ed the columns of
interest.
If we want to sort
data, there is a base R approach for
vectors.
head(sort(d$gdpgrowth))
## [1] -0.6532034 -0.6011391 -0.5537450 -0.5498320 -0.5225903 -0.4939937
For dataframes you have to use order
, which produces
index numbers that can be used as before
d[order(d$gdpgrowth),c("ccode","year","gdpgrowth")]
## # A tibble: 5,188 × 3
## ccode year gdpgrowth
## <dbl> <dbl> <dbl>
## 1 645 1991 -0.653
## 2 572 2003 -0.601
## 3 517 1994 -0.554
## 4 690 1991 -0.550
## 5 660 1989 -0.523
## 6 92 2007 -0.494
## 7 475 1986 -0.480
## 8 450 1990 -0.475
## 9 373 1993 -0.440
## 10 411 1990 -0.423
## # ℹ 5,178 more rows
We can also switch the ordering around by setting
decreasing = T
d[order(d$gdpgrowth,decreasing = T),c("ccode","year","gdpgrowth")]
## # A tibble: 5,188 × 3
## ccode year gdpgrowth
## <dbl> <dbl> <dbl>
## 1 690 1992 1.86
## 2 572 2005 1.48
## 3 450 1997 1.39
## 4 411 1997 1.38
## 5 92 2009 1.13
## 6 411 2002 0.874
## 7 345 1996 0.854
## 8 411 1999 0.788
## 9 552 2010 0.752
## 10 411 1992 0.731
## # ℹ 5,178 more rows
Or we could use the handy %>%
. In this case we have
to use the placeholder .
for the input, which might be
handy to know that you can do for more complicated functions.
order(d$gdpgrowth,decreasing = T) %>%
d[.,c("ccode","year","gdpgrowth")]
## # A tibble: 5,188 × 3
## ccode year gdpgrowth
## <dbl> <dbl> <dbl>
## 1 690 1992 1.86
## 2 572 2005 1.48
## 3 450 1997 1.39
## 4 411 1997 1.38
## 5 92 2009 1.13
## 6 411 2002 0.874
## 7 345 1996 0.854
## 8 411 1999 0.788
## 9 552 2010 0.752
## 10 411 1992 0.731
## # ℹ 5,178 more rows
Another basic task you’ll want to know how to do is merge
datasets together. You may have noticed that the ccode variable
isn’t particularly descriptive for which country it means. At the start
we loaded in the countrycode
package which contains
additional information.
codes <- countrycode::codelist_panel
Let’s see what they have.
colnames(codes)
## [1] "country.name.en" "year"
## [3] "ar5" "cctld"
## [5] "continent" "country.name.de"
## [7] "country.name.de.regex" "country.name.en.regex"
## [9] "country.name.fr" "country.name.fr.regex"
## [11] "country.name.it" "country.name.it.regex"
## [13] "cowc" "cown"
## [15] "currency" "dhs"
## [17] "ecb" "eu28"
## [19] "eurocontrol_pru" "eurocontrol_statfor"
## [21] "eurostat" "fao"
## [23] "fips" "gaul"
## [25] "genc2c" "genc3c"
## [27] "genc3n" "gwc"
## [29] "gwn" "icao.region"
## [31] "imf" "ioc"
## [33] "iso2c" "iso3c"
## [35] "iso3n" "iso4217c"
## [37] "iso4217n" "p4c"
## [39] "p4n" "p5c"
## [41] "p5n" "region"
## [43] "region23" "un"
## [45] "un.region.code" "un.regionintermediate.code"
## [47] "un.regionsub.code" "unhcr"
## [49] "unhcr.region" "unicode.symbol"
## [51] "unpd" "vdem"
## [53] "wb" "wb_api2c"
## [55] "wb_api3c" "wvs"
The country codes we are currently using are cown
. Let’s
grab iso3c
and region
to add to the dataset.
We also know that the dataset we are working with only has years from
1981 to 2010, so let’s practice our subsetting skillz
codes <- codes[codes$year %in% 1981:2010,c("cown","year","iso3c","country.name.en","region")]
One thing to pay attention to is losing or gaining observations during a merge. For a great overview, check out this handy NYU Data Services guide.
nrow(d)
## [1] 5188
out1 <- merge(d,codes,by.x=c("ccode","year"),by.y=c("cown","year"))
nrow(out1)
## [1] 5125
out2 <- merge(d,codes,by.x=c("ccode","year"),by.y=c("cown","year"),all.x=T)
nrow(out2)
## [1] 5188
out3 <- merge(d,codes,by.x=c("ccode","year"),by.y=c("cown","year"),all.y=T)
nrow(out3)
## [1] 5621
out4 <- merge(d,codes,by.x=c("ccode","year"),by.y=c("cown","year"),all=T)
nrow(out4)
## [1] 5684
And, of course, we can do the same merges using dplyr with
inner_join
, left_join
,
right_join
, and full_join
respectively. Going
forward we will keep out2
as the working dataset.
Another basic task you’ll want to know how to do is calculate
aggregates and summaries. There are a number of great
things you can do with the apply
family of functions,
including easily going in parallel with the pbapply
package. If you are interested in more details on this you should check
out this
tutorial and this
taskview. We will focus on using dplyr
to calculate
summaries of interest.
One reason for this is that it is super easy to calculate summaries
grouping on another variable. For example, if we wanted to think about
regional variation in gdppc
we could
out2 %>%
group_by(region) %>%
summarize(mean=mean(gdppc,na.rm=T),
sd=sd(gdppc,na.rm=T),
sum=sum(gdppc,na.rm=T))
## # A tibble: 8 × 4
## region mean sd sum
## <chr> <dbl> <dbl> <dbl>
## 1 East Asia & Pacific 13583. 15207. 6071507.
## 2 Europe & Central Asia 16667. 11640. 18734205.
## 3 Latin America & Caribbean 7137. 4608. 6194691.
## 4 Middle East & North Africa 13735. 17320. 6881093.
## 5 North America 32391. 5598. 1943448.
## 6 South Asia 2619. 2070. 549891.
## 7 Sub-Saharan Africa 2048. 2652. 2523069.
## 8 <NA> 25760. 24903. 386395.
We can also use the mutate
function to add this
information to our dataframe. In base R this would take
merge
ing the output of aggregate
, so it can
certainly be done, but dplyr
makes it somewhat more
straightforward and scaleable.
out2 %>%
group_by(region) %>%
mutate(mean_gdppc=mean(gdppc,na.rm=T),
sd_gdppc=sd(gdppc,na.rm=T)) -> out2
out2
## # A tibble: 5,188 × 23
## # Groups: region [8]
## ccode year twoway inhhi comper polity2 physint conflictonlocation lnpop
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2 1981 0.986 0.990 0.997 10 8 0 19.3
## 2 2 1982 0.988 0.989 0.998 10 8 0 19.3
## 3 2 1983 0.982 0.989 0.994 10 8 1 19.3
## 4 2 1984 0.987 0.987 1 10 8 0 19.3
## 5 2 1985 0.985 0.987 0.998 10 7 0 19.3
## 6 2 1986 0.980 0.987 0.994 10 7 0 19.3
## 7 2 1987 0.982 0.987 0.995 10 8 0 19.3
## 8 2 1988 0.977 0.988 0.989 10 7 0 19.3
## 9 2 1989 0.980 0.988 0.992 10 7 1 19.3
## 10 2 1990 0.983 0.988 0.995 10 8 0 19.4
## # ℹ 5,178 more rows
## # ℹ 14 more variables: lngdppc <dbl>, gdppc <dbl>, expdep <dbl>,
## # gdpgrowth <dbl>, lib_HK <dbl>, meanarab <dbl>, meanpop <dbl>,
## # sdarab_manual <dbl>, sdpop_manual <dbl>, iso3c <chr>,
## # country.name.en <chr>, region <chr>, mean_gdppc <dbl>, sd_gdppc <dbl>
A base R version of the above might be
a1 <- aggregate(out2$gdppc,by=list(out2$region),mean,na.rm=T)
a1
## Group.1 x
## 1 East Asia & Pacific 13582.789
## 2 Europe & Central Asia 16667.442
## 3 Latin America & Caribbean 7136.741
## 4 Middle East & North Africa 13734.717
## 5 North America 32390.804
## 6 South Asia 2618.530
## 7 Sub-Saharan Africa 2047.946
colnames(a1) <- c("region","mean_gdppc")
a2 <- aggregate(out2$gdppc,by=list(out2$region),sd,na.rm=T)
colnames(a2) <- c("region","sd_gdppc")
t1 <- merge(out2,a1,by="region")
t2 <- merge(t1,a2,by="region")
tbl_df(t2)
## Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
## ℹ Please use `tibble::as_tibble()` instead.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## # A tibble: 5,125 × 25
## region ccode year twoway inhhi comper polity2 physint conflictonlocation
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 East Asia… 712 1995 0.187 0.781 0.240 9 6 0
## 2 East Asia… 710 2009 0.974 0.976 0.998 -7 0 0
## 3 East Asia… 712 1994 0.218 0.858 0.254 9 7 0
## 4 East Asia… 712 2003 0.293 0.802 0.366 10 6 0
## 5 East Asia… 712 1996 0.222 0.747 0.297 10 7 0
## 6 East Asia… 712 1997 0.248 0.760 0.326 10 7 0
## 7 East Asia… 712 2002 0.294 0.796 0.369 10 7 0
## 8 East Asia… 740 1983 0.940 0.963 0.976 10 8 0
## 9 East Asia… 712 2004 0.335 0.782 0.428 10 5 0
## 10 East Asia… 712 1991 0.192 0.873 0.220 2 8 0
## # ℹ 5,115 more rows
## # ℹ 16 more variables: lnpop <dbl>, lngdppc <dbl>, gdppc <dbl>, expdep <dbl>,
## # gdpgrowth <dbl>, lib_HK <dbl>, meanarab <dbl>, meanpop <dbl>,
## # sdarab_manual <dbl>, sdpop_manual <dbl>, iso3c <chr>,
## # country.name.en <chr>, mean_gdppc.x <dbl>, sd_gdppc.x <dbl>,
## # mean_gdppc.y <dbl>, sd_gdppc.y <dbl>
but the dplyr
approach really is quite nice.
We will focus on using ggplot2
for graphics in R,
although base R has nice capabilities on its own. ggplot
is
all about the `grammar of graphics’ which follows a layered approach to
describe and construct graphics in a structured manner. To begin, we
will always initialize a plot:
p1 <- ggplot(out2[which(out2$region == "North America"),], aes(x=log(gdppc)))
To get different plots, we will add layers. For example, if we wanted a dot plot
p1 + geom_dotplot(binwidth=0.1)
or a histogram
p1 + geom_histogram(binwidth=0.1)
or a density plot
p1 + geom_density()
we can just add a different layer to the same underlying plot.
The order of the layers does not matter, and there are a bunch more customizations that we can add.
p1 + geom_histogram(color="red",fill="red",binwidth = 0.03) +
xlab("Natural Log of Per Capita GDP") +
ylab("Frequency") +
ggtitle('North American GDPPC') +
theme_bw() -> g1
g1
You can also add multiple geometries to the same underderlying plot.
p2 <- ggplot(out2[which(out2$region == "South Asia"),],aes(x=year,y=log(gdppc),color=iso3c))
p2 + geom_point(na.rm=T) +
geom_line(na.rm=T) +
labs(color="Country") +
scale_color_brewer(palette="Spectral") -> g2
g2
You can even add some smoothers if you want.
p3 <- ggplot(out3[which(out3$iso3c=="RUS"),],aes(x=year,y=gdppc))
p3 + geom_point(na.rm=T) +
geom_smooth(color ="gray", method = "lm", se = TRUE,na.rm=T, formula=y~x)
p3 <- ggplot(out3[which(out3$iso3c=="RUS"),],aes(x=year,y=gdppc))
p3 + geom_point(na.rm=T) +
geom_smooth(color ="gray", method = "loess", se = TRUE,formula=y~x, na.rm=T) -> g3
g3
Two last notes on plots – faceting and adding plots together into a larger image.
Faceting can be a nice way to break up a continuous variable by category.
p4 <- ggplot(na.omit(out2[which(out2$region %in% c("Europe & Central Asia","Middle East & North Africa")),]),aes(x=log(gdppc)))
p4 + geom_histogram(binwidth = 0.1) +
facet_grid(region ~ .)
p4 <- ggplot(na.omit(out2[which(out2$region %in% c("Europe & Central Asia","Middle East & North Africa")),]),aes(x=log(gdppc)))
p4 + geom_histogram(binwidth = 0.1) +
facet_grid(. ~ region)
Once we do all that, we might want to add multiple plots together
into a larger multi-panel graphic. The gridExtra
package is
great for this.
grid.arrange(g1,g2,g3,textGrob("Spiffy!"),ncol=2,nrow=2)