Setting up

# Set your working directory
setwd("/Users/corybelden/Documents/my-stuff/PAMR")

# Three libraries need today (plus "base" package)

library(MASS)
library(foreign)
library(plyr)
library(data.table)
## Warning: package 'data.table' was built under R version 3.4.2

Becoming a Troubleshooting Master

Troubleshooting is the most important skill in applied programming.

A few tips to start:

• How do you get better at stuff like lists? PRACTICE, PRACTICE, PRACTICE.
• Add only packages you need, and in order of how often you use them in the script.
• Read the R package “vignettes” to familiarize yourself.
• Check the “Help” file, using “??”.
• Start simple when you de-bug (e.g., take a slice of your data to work with).
• Check your code line by line when you’re de-bugging.
• Ask for help, but only when you’ve exhausted the list of possible checks.
• Become an expert in Stack Overflow and Google (so much information out there!).

Troubleshooting checklist:

• Specify your syntax with double colon.
• Check your R/RStudio/package versions and reinstall if you have old versions.
• Restart your session (don’t save workspace!).
• Use a debugger (for more advanced programming).
• Get familiar with errors and warnings.
• Warnings mean something might be wrong, but R carried out the task anyway.
• “subscript out of bounds” means the element does not exist.
• “replacement has length zero” (or something similar): your variables are not the same length of obs.

# Example of specifying syntax

# Checking what's loaded and versions
sessionInfo()
## R version 3.4.1 (2017-06-30)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base
##
## other attached packages:
## [1] data.table_1.10.4-2 plyr_1.8.4          foreign_0.8-69
## [4] MASS_7.3-47
##
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.12    digest_0.6.12   rprojroot_1.2   backports_1.1.1
##  [5] magrittr_1.5    evaluate_0.10.1 stringi_1.1.5   curl_2.8.1
##  [9] xml2_1.1.1      rmarkdown_1.6   tools_3.4.1     stringr_1.2.0
## [13] yaml_2.1.14     compiler_3.4.1  htmltools_0.3.6 knitr_1.17

Some review on objects in R:

Matrix objects are a nice way of considering data in R since our datasets are essentially large matrices! They also help us to connect what we know about matrices to thinking about network data. Recall that we can create matrix objects in a number of ways.

Three main ways of creating network object:

• Use vector objects to create a matrix.
• Use matrix() rules: similar to above but we can fill the matrix with a single scalar.
• Combine multiple vectors as a matrix using “rbind” or “cbind”. Remember that length matters!
#  Vectors
vector_1 <- c(1,2,3)
matrix_1 <- matrix(vector_1, nrow=3, ncol=2, byrow=F)
matrix_1
##      [,1] [,2]
## [1,]    1    1
## [2,]    2    2
## [3,]    3    3
# Matrix rules
matrix_2 <- matrix(1,3,2)
matrix_2
##      [,1] [,2]
## [1,]    1    1
## [2,]    1    1
## [3,]    1    1
# Combination (using rbind)
row_1 <- c(1, 2, 3)
row_2 <- c(4, 5, 6)
matrix_3 <- rbind(row_1, row_2)
matrix_3
##       [,1] [,2] [,3]
## row_1    1    2    3
## row_2    4    5    6
# Combination (using cbind)
column_1 <- c(1, 2)
column_2 <- c(3, 4)
matrix_4 <- cbind(column_1, column_2)
matrix_4
##      column_1 column_2
## [1,]        1        3
## [2,]        2        4
# If we ever forget the dimensions of our marix, we can ask R for help
dim(matrix_3)
## [1] 2 3
# Remember, we can turn matrices into dataframes!
my_matrix <- matrix(c(1,2,3,4,5,6), nrow=3, ncol=2, byrow=F)
my_matrix
##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6
class(my_matrix)
## [1] "matrix"
my_dataframe <-as.data.frame(my_matrix)
my_dataframe
##   V1 V2
## 1  1  4
## 2  2  5
## 3  3  6
class(my_dataframe)
## [1] "data.frame"

Simple Steps Dealing with Real Data

There’s a myth that we can’t manage or manipulate data with the same ease in R as we can do in Stata. To some extent, this is true. However, most of the more common functions can be performed in R with relative ease (emphasizing “relative”" since managing data is never easy). I’m going to cover some BASIC tools we can use to manipulate data in R. Later, we can cover more complex approaches.

# Read data
qog <- read.csv("dataPersonal/qog_std_cs_jan18.csv", header=T)

We might want to look more closely at our data or at our data’s variable names. The latter is especially useful if you’re relying on someone else’s data and codebook. Replication data, for example, never includes the full set of variables.

# Summary
# summary(qog) -- commenting out the out because output is so long
head(names(qog))
## [1] "ccode"    "cname"    "ccodealp" "ccodecow" "ccodewb"  "version"

Note that these codes tell us about variables names in different ways, but not about our actual data. Below are a number of manipulations we may want to make to this existing master data. But first, let’s save this as our sample data and work from there so we don’t overwrite the primary data.

# Write data to disk
write.csv(qog,"new_qog.csv")

new_qog$random_var <- 1 new_qog$random_var
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [71] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [106] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [141] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [176] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
# Transform existing variable
new_qog$wvs_trust_rescale <- new_qog$wvs_trust/2

# Compare original and new
new_qog$wvs_trust ## [1] NA NA 0.17928633 NA NA NA ## [7] 0.15280272 0.19854753 0.51814008 NA NA 0.34183672 ## [13] NA 0.11022963 NA NA NA NA ## [19] NA NA 0.07116177 NA NA NA ## [25] NA NA NA 0.35224298 NA NA ## [31] NA NA NA NA NA 0.12770340 ## [37] 0.63131458 0.30887952 0.04130579 NA NA NA ## [43] NA NA NA 0.07605544 NA NA ## [49] NA NA NA 0.07166667 NA NA ## [55] NA NA 0.40074742 NA NA NA ## [61] NA NA 0.08885164 NA 0.45346335 0.05030944 ## [67] NA NA NA NA NA NA ## [73] NA NA NA NA 0.33223999 NA ## [79] NA 0.31971580 NA NA NA NA ## [85] NA NA 0.38332745 0.13249999 NA NA ## [91] NA 0.30000001 0.38041958 NA 0.10905730 NA ## [97] NA NA 0.10643163 NA NA NA ## [103] NA NA 0.08538461 NA NA NA ## [109] NA NA 0.12424850 NA NA NA ## [115] NA 0.12531753 NA NA NA NA ## [121] NA 0.67416936 NA 0.56776559 NA NA ## [127] 0.15012437 NA NA NA NA 0.23090559 ## [133] NA NA NA 0.08466440 0.03167078 0.22705126 ## [139] NA NA NA NA 0.07747335 0.29615441 ## [145] 0.16633923 NA NA NA NA NA ## [151] NA NA NA NA NA 0.37390605 ## [157] NA NA 0.20113315 NA 0.23412248 0.08294227 ## [163] 0.19565730 NA NA NA NA 0.61769235 ## [169] NA NA NA 0.32491252 NA NA ## [175] 0.03219316 NA 0.15996578 0.12295500 NA NA ## [181] NA 0.24750616 NA 0.21500890 NA NA ## [187] 0.35139450 NA 0.15248619 0.14092141 NA NA ## [193] 0.40398741 NA new_qog$wvs_trust_rescale
##   [1]         NA         NA 0.08964316         NA         NA         NA
##   [7] 0.07640136 0.09927376 0.25907004         NA         NA 0.17091836
##  [13]         NA 0.05511481         NA         NA         NA         NA
##  [19]         NA         NA 0.03558088         NA         NA         NA
##  [25]         NA         NA         NA 0.17612149         NA         NA
##  [31]         NA         NA         NA         NA         NA 0.06385170
##  [37] 0.31565729 0.15443976 0.02065290         NA         NA         NA
##  [43]         NA         NA         NA 0.03802772         NA         NA
##  [49]         NA         NA         NA 0.03583333         NA         NA
##  [55]         NA         NA 0.20037371         NA         NA         NA
##  [61]         NA         NA 0.04442582         NA 0.22673167 0.02515472
##  [67]         NA         NA         NA         NA         NA         NA
##  [73]         NA         NA         NA         NA 0.16611999         NA
##  [79]         NA 0.15985790         NA         NA         NA         NA
##  [85]         NA         NA 0.19166373 0.06625000         NA         NA
##  [91]         NA 0.15000000 0.19020979         NA 0.05452865         NA
##  [97]         NA         NA 0.05321581         NA         NA         NA
## [103]         NA         NA 0.04269231         NA         NA         NA
## [109]         NA         NA 0.06212425         NA         NA         NA
## [115]         NA 0.06265877         NA         NA         NA         NA
## [121]         NA 0.33708468         NA 0.28388280         NA         NA
## [127] 0.07506219         NA         NA         NA         NA 0.11545279
## [133]         NA         NA         NA 0.04233220 0.01583539 0.11352563
## [139]         NA         NA         NA         NA 0.03873667 0.14807720
## [145] 0.08316962         NA         NA         NA         NA         NA
## [151]         NA         NA         NA         NA         NA 0.18695302
## [157]         NA         NA 0.10056658         NA 0.11706124 0.04147113
## [163] 0.09782865         NA         NA         NA         NA 0.30884618
## [169]         NA         NA         NA 0.16245626         NA         NA
## [175] 0.01609658         NA 0.07998289 0.06147750         NA         NA
## [181]         NA 0.12375308         NA 0.10750445         NA         NA
## [187] 0.17569725         NA 0.07624309 0.07046070         NA         NA
## [193] 0.20199370         NA

Also, note the NAs representing missing data. In Stata, we use a “.” to denote missingness. In R, we leave the cell blank. If you don’t enter “0” for a true “0”, then R treats the observation as missing.

# Look at 20 variables only
head(names(qog), 20)
##  [1] "ccode"        "cname"        "ccodealp"     "ccodecow"
##  [5] "ccodewb"      "version"      "aid_cpnc"     "aid_cpsc"
##  [9] "aid_crnc"     "aid_crnio"    "aid_crsc"     "aid_crsio"
## [13] "ajr_settmort" "al_ethnic"    "al_language"  "al_religion"
## [17] "bci_bci"      "bci_bcistd"   "bi_a_total"   "bi_p_total"

Now, onto dropping and keeping variables (or, subsetting). An important comment about this: If dropping variables, first create a new data set. Also, it’s better to use “select” in order to keep variables. That way, we indicate specifically what we plan to use, instead of dropping variables and therefore risking dropping the wrong ones. Unlike Stata, these functions allow us to hold onto both the subset and the original data in the global environment. This is a HUGE advantage in R.

# Selecting your subset:
selected <- c("ccode", "cname", "ajr_settmort")
new_qog_subset <- new_qog[selected]

# Dropping variables to create a subset a different way
dropped <- names(new_qog) %in% c("wr_nonautocracy", "wr_regtype")
new_qog_subset_2 <- new_qog[!dropped]

Slightly more complicated steps dealing with real data

We often want to merge datasets together. This seems surprising with a dataset with 1883 variables, but usually we’re working with multiple datasets from different sources. Let’s start by making a smaller subset of our “new_qog”" data and merging our subsets as if they originated from different sources.

# More complex
selected2 <- c("ccode", "cname", "wvs_trust","wvs_trust_rescale", "random_var")
new_qog_subset_3 <- new_qog[selected2]

# And here, I started getting errors because
# I didn't create a new name for my newly selected data.
# I had to clear my history and start over. DUMMY.

# Merge
working_data <- merge(new_qog_subset, new_qog_subset_3, by=c("ccode", "cname"))

In this case, we don’t lose observations because these subsets came from the same original dataset. In reality, we often have to add some code to avoid dropping observations that don’t merge. We use “all” in the merge function to specify whether we want to drop these observations or not.

# Another merge
working_data <- merge(new_qog_subset, new_qog_subset_3, by=c("ccode", "cname"), all=TRUE)

Appending data is very simple, assuming the data are formatted correctly (i.e., each dataset has the same number of columns and column names). The “rbind()” command from the “base” package will suffice. If columns are not the same length/same names, we can use “rbind.fill()” from the “plyr” package, which will introduce NAs.

# Using rbind.fill when column names are different
tog_qog_subsets = rbind.fill(new_qog_subset, new_qog_subset_3)

Finally, we may want to turn panel data (i.e. “wide data”) like the QOG data into time series data (i.e. “long data”).

# Let's load in the time series data from QOG and try to reshape it into panel data

# This is a big dataset, so takes awhile!
wide_qog <- reshape(qog_ts,
# time variable
timevar="year",
# variables not to change
idvar="ccode",
# unit-year observations
new.row.names=1:211, # NULL is trick!
# direction of reshape
direction = "wide")

# Subset to the variable you're interested in
year = as.vector(qog_ts$year) country = as.vector(qog_ts$cname)
regtype = as.vector(qog_ts$ht_regtype) regtype_all = as.data.frame(cbind(country, year, regtype)) regtype_all$regtype = as.numeric(regtype_all$regtype) # # This is a big dataset, so takes awhile! wide_regtype_qog <- reshape(regtype_all, # time variable timevar="year", # variables not to change idvar="country", # unit-year observations new.row.names=1:211, # NULL is trick! # direction of reshape direction = "wide") # Now we can sum together by country -- FIX LATER # wide_trust_qog$average = rowSums(wide_regtype_qog, na.rm=T, dims = c(2:73))