Programming & Methods Resources Workshop

Setting up

# Set your working directory
setwd("/Users/corybelden/Documents/my-stuff/PAMR")

# Three libraries need today (plus "base" package)

library(MASS)
library(foreign)
library(plyr)
library(data.table)

## Warning: package 'data.table' was built under R version 3.4.2

Becoming a Troubleshooting Master

Troubleshooting is the most important skill in applied programming.

A few tips to start:

How do you get better at stuff like lists? PRACTICE, PRACTICE, PRACTICE.
Add only packages you need, and in order of how often you use them in the script.
Read the R package “vignettes” to familiarize yourself.
Check the “Help” file, using “??”.
Start simple when you de-bug (e.g., take a slice of your data to work with).
Check your code line by line when you’re de-bugging.
Ask for help, but only when you’ve exhausted the list of possible checks.
Become an expert in Stack Overflow and Google (so much information out there!).

Troubleshooting checklist:

Specify your syntax with double colon.
Check your paths if you’re loading data.
Check your object types.
Check your R/RStudio/package versions and reinstall if you have old versions.
Restart your session (don’t save workspace!).
Use a debugger (for more advanced programming).
Get familiar with errors and warnings.
- Warnings mean something might be wrong, but R carried out the task anyway.
- “subscript out of bounds” means the element does not exist.
- “replacement has length zero” (or something similar): your variables are not the same length of obs.
- “object ‘x’ not found” means the object doesn’t exist.

A couple helpful codes:

# Example of specifying syntax
members = xml2::read_xml("http://clerk.house.gov/xml/lists/MemberData.xml")
    
# Checking what's loaded and versions
sessionInfo()

## R version 3.4.1 (2017-06-30)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.1
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] data.table_1.10.4-2 plyr_1.8.4          foreign_0.8-69     
## [4] MASS_7.3-47        
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.12    digest_0.6.12   rprojroot_1.2   backports_1.1.1
##  [5] magrittr_1.5    evaluate_0.10.1 stringi_1.1.5   curl_2.8.1     
##  [9] xml2_1.1.1      rmarkdown_1.6   tools_3.4.1     stringr_1.2.0  
## [13] yaml_2.1.14     compiler_3.4.1  htmltools_0.3.6 knitr_1.17

Some review on objects in R:

Matrix objects are a nice way of considering data in R since our datasets are essentially large matrices! They also help us to connect what we know about matrices to thinking about network data. Recall that we can create matrix objects in a number of ways.

Three main ways of creating network object:

Use vector objects to create a matrix.
Use matrix() rules: similar to above but we can fill the matrix with a single scalar.
Combine multiple vectors as a matrix using “rbind” or “cbind”. Remember that length matters!

#  Vectors
vector_1 <- c(1,2,3)
matrix_1 <- matrix(vector_1, nrow=3, ncol=2, byrow=F)
matrix_1

##      [,1] [,2]
## [1,]    1    1
## [2,]    2    2
## [3,]    3    3

# Matrix rules  
matrix_2 <- matrix(1,3,2)
matrix_2

##      [,1] [,2]
## [1,]    1    1
## [2,]    1    1
## [3,]    1    1

# Combination (using rbind)
row_1 <- c(1, 2, 3)
row_2 <- c(4, 5, 6)
matrix_3 <- rbind(row_1, row_2)
matrix_3

##       [,1] [,2] [,3]
## row_1    1    2    3
## row_2    4    5    6

# Combination (using cbind)
column_1 <- c(1, 2)
column_2 <- c(3, 4)
matrix_4 <- cbind(column_1, column_2)
matrix_4

##      column_1 column_2
## [1,]        1        3
## [2,]        2        4

# If we ever forget the dimensions of our marix, we can ask R for help
dim(matrix_3)

## [1] 2 3

# Remember, we can turn matrices into dataframes!
my_matrix <- matrix(c(1,2,3,4,5,6), nrow=3, ncol=2, byrow=F)
my_matrix

##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6

class(my_matrix)

## [1] "matrix"

my_dataframe <-as.data.frame(my_matrix)
my_dataframe

##   V1 V2
## 1  1  4
## 2  2  5
## 3  3  6

class(my_dataframe)

## [1] "data.frame"

Simple Steps Dealing with Real Data

There’s a myth that we can’t manage or manipulate data with the same ease in R as we can do in Stata. To some extent, this is true. However, most of the more common functions can be performed in R with relative ease (emphasizing “relative”" since managing data is never easy). I’m going to cover some BASIC tools we can use to manipulate data in R. Later, we can cover more complex approaches.

# Read data
qog <- read.csv("dataPersonal/qog_std_cs_jan18.csv", header=T)

We might want to look more closely at our data or at our data’s variable names. The latter is especially useful if you’re relying on someone else’s data and codebook. Replication data, for example, never includes the full set of variables.

# Summary 
# summary(qog) -- commenting out the out because output is so long
head(names(qog))

## [1] "ccode"    "cname"    "ccodealp" "ccodecow" "ccodewb"  "version"

Note that these codes tell us about variables names in different ways, but not about our actual data. Below are a number of manipulations we may want to make to this existing master data. But first, let’s save this as our sample data and work from there so we don’t overwrite the primary data.

# Write data to disk
write.csv(qog,"new_qog.csv")

# Read in "copy" data
new_qog <- read.csv("new_qog.csv")

# Add new variable
new_qog$random_var <- 1
new_qog$random_var

##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [71] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [106] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [141] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [176] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

# Transform existing variable
new_qog$wvs_trust_rescale <- new_qog$wvs_trust/2

# Compare original and new 
new_qog$wvs_trust

##   [1]         NA         NA 0.17928633         NA         NA         NA
##   [7] 0.15280272 0.19854753 0.51814008         NA         NA 0.34183672
##  [13]         NA 0.11022963         NA         NA         NA         NA
##  [19]         NA         NA 0.07116177         NA         NA         NA
##  [25]         NA         NA         NA 0.35224298         NA         NA
##  [31]         NA         NA         NA         NA         NA 0.12770340
##  [37] 0.63131458 0.30887952 0.04130579         NA         NA         NA
##  [43]         NA         NA         NA 0.07605544         NA         NA
##  [49]         NA         NA         NA 0.07166667         NA         NA
##  [55]         NA         NA 0.40074742         NA         NA         NA
##  [61]         NA         NA 0.08885164         NA 0.45346335 0.05030944
##  [67]         NA         NA         NA         NA         NA         NA
##  [73]         NA         NA         NA         NA 0.33223999         NA
##  [79]         NA 0.31971580         NA         NA         NA         NA
##  [85]         NA         NA 0.38332745 0.13249999         NA         NA
##  [91]         NA 0.30000001 0.38041958         NA 0.10905730         NA
##  [97]         NA         NA 0.10643163         NA         NA         NA
## [103]         NA         NA 0.08538461         NA         NA         NA
## [109]         NA         NA 0.12424850         NA         NA         NA
## [115]         NA 0.12531753         NA         NA         NA         NA
## [121]         NA 0.67416936         NA 0.56776559         NA         NA
## [127] 0.15012437         NA         NA         NA         NA 0.23090559
## [133]         NA         NA         NA 0.08466440 0.03167078 0.22705126
## [139]         NA         NA         NA         NA 0.07747335 0.29615441
## [145] 0.16633923         NA         NA         NA         NA         NA
## [151]         NA         NA         NA         NA         NA 0.37390605
## [157]         NA         NA 0.20113315         NA 0.23412248 0.08294227
## [163] 0.19565730         NA         NA         NA         NA 0.61769235
## [169]         NA         NA         NA 0.32491252         NA         NA
## [175] 0.03219316         NA 0.15996578 0.12295500         NA         NA
## [181]         NA 0.24750616         NA 0.21500890         NA         NA
## [187] 0.35139450         NA 0.15248619 0.14092141         NA         NA
## [193] 0.40398741         NA

new_qog$wvs_trust_rescale

##   [1]         NA         NA 0.08964316         NA         NA         NA
##   [7] 0.07640136 0.09927376 0.25907004         NA         NA 0.17091836
##  [13]         NA 0.05511481         NA         NA         NA         NA
##  [19]         NA         NA 0.03558088         NA         NA         NA
##  [25]         NA         NA         NA 0.17612149         NA         NA
##  [31]         NA         NA         NA         NA         NA 0.06385170
##  [37] 0.31565729 0.15443976 0.02065290         NA         NA         NA
##  [43]         NA         NA         NA 0.03802772         NA         NA
##  [49]         NA         NA         NA 0.03583333         NA         NA
##  [55]         NA         NA 0.20037371         NA         NA         NA
##  [61]         NA         NA 0.04442582         NA 0.22673167 0.02515472
##  [67]         NA         NA         NA         NA         NA         NA
##  [73]         NA         NA         NA         NA 0.16611999         NA
##  [79]         NA 0.15985790         NA         NA         NA         NA
##  [85]         NA         NA 0.19166373 0.06625000         NA         NA
##  [91]         NA 0.15000000 0.19020979         NA 0.05452865         NA
##  [97]         NA         NA 0.05321581         NA         NA         NA
## [103]         NA         NA 0.04269231         NA         NA         NA
## [109]         NA         NA 0.06212425         NA         NA         NA
## [115]         NA 0.06265877         NA         NA         NA         NA
## [121]         NA 0.33708468         NA 0.28388280         NA         NA
## [127] 0.07506219         NA         NA         NA         NA 0.11545279
## [133]         NA         NA         NA 0.04233220 0.01583539 0.11352563
## [139]         NA         NA         NA         NA 0.03873667 0.14807720
## [145] 0.08316962         NA         NA         NA         NA         NA
## [151]         NA         NA         NA         NA         NA 0.18695302
## [157]         NA         NA 0.10056658         NA 0.11706124 0.04147113
## [163] 0.09782865         NA         NA         NA         NA 0.30884618
## [169]         NA         NA         NA 0.16245626         NA         NA
## [175] 0.01609658         NA 0.07998289 0.06147750         NA         NA
## [181]         NA 0.12375308         NA 0.10750445         NA         NA
## [187] 0.17569725         NA 0.07624309 0.07046070         NA         NA
## [193] 0.20199370         NA

Also, note the NAs representing missing data. In Stata, we use a “.” to denote missingness. In R, we leave the cell blank. If you don’t enter “0” for a true “0”, then R treats the observation as missing.

# Look at 20 variables only
head(names(qog), 20)

##  [1] "ccode"        "cname"        "ccodealp"     "ccodecow"    
##  [5] "ccodewb"      "version"      "aid_cpnc"     "aid_cpsc"    
##  [9] "aid_crnc"     "aid_crnio"    "aid_crsc"     "aid_crsio"   
## [13] "ajr_settmort" "al_ethnic"    "al_language"  "al_religion" 
## [17] "bci_bci"      "bci_bcistd"   "bi_a_total"   "bi_p_total"

Now, onto dropping and keeping variables (or, subsetting). An important comment about this: If dropping variables, first create a new data set. Also, it’s better to use “select” in order to keep variables. That way, we indicate specifically what we plan to use, instead of dropping variables and therefore risking dropping the wrong ones. Unlike Stata, these functions allow us to hold onto both the subset and the original data in the global environment. This is a HUGE advantage in R.

# Selecting your subset:
selected <- c("ccode", "cname", "ajr_settmort")
new_qog_subset <- new_qog[selected]

# Dropping variables to create a subset a different way
dropped <- names(new_qog) %in% c("wr_nonautocracy", "wr_regtype")
new_qog_subset_2 <- new_qog[!dropped]

Slightly more complicated steps dealing with real data

We often want to merge datasets together. This seems surprising with a dataset with 1883 variables, but usually we’re working with multiple datasets from different sources. Let’s start by making a smaller subset of our “new_qog”" data and merging our subsets as if they originated from different sources.

# More complex
selected2 <- c("ccode", "cname", "wvs_trust","wvs_trust_rescale", "random_var")
new_qog_subset_3 <- new_qog[selected2]

# And here, I started getting errors because 
# I didn't create a new name for my newly selected data. 
# I had to clear my history and start over. DUMMY.

# Merge 
working_data <- merge(new_qog_subset, new_qog_subset_3, by=c("ccode", "cname"))

In this case, we don’t lose observations because these subsets came from the same original dataset. In reality, we often have to add some code to avoid dropping observations that don’t merge. We use “all” in the merge function to specify whether we want to drop these observations or not.

# Another merge
working_data <- merge(new_qog_subset, new_qog_subset_3, by=c("ccode", "cname"), all=TRUE)

Appending data is very simple, assuming the data are formatted correctly (i.e., each dataset has the same number of columns and column names). The “rbind()” command from the “base” package will suffice. If columns are not the same length/same names, we can use “rbind.fill()” from the “plyr” package, which will introduce NAs.

# Using rbind.fill when column names are different
tog_qog_subsets = rbind.fill(new_qog_subset, new_qog_subset_3)

Finally, we may want to turn panel data (i.e. “wide data”) like the QOG data into time series data (i.e. “long data”).

# Let's load in the time series data from QOG and try to reshape it into panel data
qog_ts <- read.csv("dataPersonal/qog_std_ts_jan18.csv", header=T)

# This is a big dataset, so takes awhile! 
wide_qog <- reshape(qog_ts,
                       # time variable
                       timevar="year",
                       # variables not to change
                       idvar="ccode", 
                       # unit-year observations
                       new.row.names=1:211, # NULL is trick! 
                       # direction of reshape
                       direction = "wide")

# Subset to the variable you're interested in
year = as.vector(qog_ts$year)
country = as.vector(qog_ts$cname)
regtype = as.vector(qog_ts$ht_regtype)
regtype_all = as.data.frame(cbind(country, year, regtype))
regtype_all$regtype = as.numeric(regtype_all$regtype)

# 
# This is a big dataset, so takes awhile! 
wide_regtype_qog <- reshape(regtype_all,
                       # time variable
                       timevar="year",
                       # variables not to change
                       idvar="country", 
                       # unit-year observations
                       new.row.names=1:211, # NULL is trick! 
                       # direction of reshape
                       direction = "wide")

# Now we can sum together by country -- FIX LATER  
# wide_trust_qog$average = rowSums(wide_regtype_qog, na.rm=T, dims = c(2:73))