# Set your working directory
setwd("/Users/corybelden/Documents/my-stuff/PAMR")
# Three libraries need today (plus "base" package)
library(MASS)
library(foreign)
library(plyr)
library(data.table)
## Warning: package 'data.table' was built under R version 3.4.2
Troubleshooting is the most important skill in applied programming.
A few tips to start:
Troubleshooting checklist:
A couple helpful codes:
# Example of specifying syntax
members = xml2::read_xml("http://clerk.house.gov/xml/lists/MemberData.xml")
# Checking what's loaded and versions
sessionInfo()
## R version 3.4.1 (2017-06-30)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] data.table_1.10.4-2 plyr_1.8.4 foreign_0.8-69
## [4] MASS_7.3-47
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.12 digest_0.6.12 rprojroot_1.2 backports_1.1.1
## [5] magrittr_1.5 evaluate_0.10.1 stringi_1.1.5 curl_2.8.1
## [9] xml2_1.1.1 rmarkdown_1.6 tools_3.4.1 stringr_1.2.0
## [13] yaml_2.1.14 compiler_3.4.1 htmltools_0.3.6 knitr_1.17
Matrix objects are a nice way of considering data in R since our datasets are essentially large matrices! They also help us to connect what we know about matrices to thinking about network data. Recall that we can create matrix objects in a number of ways.
Three main ways of creating network object:
# Vectors
vector_1 <- c(1,2,3)
matrix_1 <- matrix(vector_1, nrow=3, ncol=2, byrow=F)
matrix_1
## [,1] [,2]
## [1,] 1 1
## [2,] 2 2
## [3,] 3 3
# Matrix rules
matrix_2 <- matrix(1,3,2)
matrix_2
## [,1] [,2]
## [1,] 1 1
## [2,] 1 1
## [3,] 1 1
# Combination (using rbind)
row_1 <- c(1, 2, 3)
row_2 <- c(4, 5, 6)
matrix_3 <- rbind(row_1, row_2)
matrix_3
## [,1] [,2] [,3]
## row_1 1 2 3
## row_2 4 5 6
# Combination (using cbind)
column_1 <- c(1, 2)
column_2 <- c(3, 4)
matrix_4 <- cbind(column_1, column_2)
matrix_4
## column_1 column_2
## [1,] 1 3
## [2,] 2 4
# If we ever forget the dimensions of our marix, we can ask R for help
dim(matrix_3)
## [1] 2 3
# Remember, we can turn matrices into dataframes!
my_matrix <- matrix(c(1,2,3,4,5,6), nrow=3, ncol=2, byrow=F)
my_matrix
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
class(my_matrix)
## [1] "matrix"
my_dataframe <-as.data.frame(my_matrix)
my_dataframe
## V1 V2
## 1 1 4
## 2 2 5
## 3 3 6
class(my_dataframe)
## [1] "data.frame"
There’s a myth that we can’t manage or manipulate data with the same ease in R as we can do in Stata. To some extent, this is true. However, most of the more common functions can be performed in R with relative ease (emphasizing “relative”" since managing data is never easy). I’m going to cover some BASIC tools we can use to manipulate data in R. Later, we can cover more complex approaches.
# Read data
qog <- read.csv("dataPersonal/qog_std_cs_jan18.csv", header=T)
We might want to look more closely at our data or at our data’s variable names. The latter is especially useful if you’re relying on someone else’s data and codebook. Replication data, for example, never includes the full set of variables.
# Summary
# summary(qog) -- commenting out the out because output is so long
head(names(qog))
## [1] "ccode" "cname" "ccodealp" "ccodecow" "ccodewb" "version"
Note that these codes tell us about variables names in different ways, but not about our actual data. Below are a number of manipulations we may want to make to this existing master data. But first, let’s save this as our sample data and work from there so we don’t overwrite the primary data.
# Write data to disk
write.csv(qog,"new_qog.csv")
# Read in "copy" data
new_qog <- read.csv("new_qog.csv")
# Add new variable
new_qog$random_var <- 1
new_qog$random_var
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [71] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [106] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [141] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [176] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
# Transform existing variable
new_qog$wvs_trust_rescale <- new_qog$wvs_trust/2
# Compare original and new
new_qog$wvs_trust
## [1] NA NA 0.17928633 NA NA NA
## [7] 0.15280272 0.19854753 0.51814008 NA NA 0.34183672
## [13] NA 0.11022963 NA NA NA NA
## [19] NA NA 0.07116177 NA NA NA
## [25] NA NA NA 0.35224298 NA NA
## [31] NA NA NA NA NA 0.12770340
## [37] 0.63131458 0.30887952 0.04130579 NA NA NA
## [43] NA NA NA 0.07605544 NA NA
## [49] NA NA NA 0.07166667 NA NA
## [55] NA NA 0.40074742 NA NA NA
## [61] NA NA 0.08885164 NA 0.45346335 0.05030944
## [67] NA NA NA NA NA NA
## [73] NA NA NA NA 0.33223999 NA
## [79] NA 0.31971580 NA NA NA NA
## [85] NA NA 0.38332745 0.13249999 NA NA
## [91] NA 0.30000001 0.38041958 NA 0.10905730 NA
## [97] NA NA 0.10643163 NA NA NA
## [103] NA NA 0.08538461 NA NA NA
## [109] NA NA 0.12424850 NA NA NA
## [115] NA 0.12531753 NA NA NA NA
## [121] NA 0.67416936 NA 0.56776559 NA NA
## [127] 0.15012437 NA NA NA NA 0.23090559
## [133] NA NA NA 0.08466440 0.03167078 0.22705126
## [139] NA NA NA NA 0.07747335 0.29615441
## [145] 0.16633923 NA NA NA NA NA
## [151] NA NA NA NA NA 0.37390605
## [157] NA NA 0.20113315 NA 0.23412248 0.08294227
## [163] 0.19565730 NA NA NA NA 0.61769235
## [169] NA NA NA 0.32491252 NA NA
## [175] 0.03219316 NA 0.15996578 0.12295500 NA NA
## [181] NA 0.24750616 NA 0.21500890 NA NA
## [187] 0.35139450 NA 0.15248619 0.14092141 NA NA
## [193] 0.40398741 NA
new_qog$wvs_trust_rescale
## [1] NA NA 0.08964316 NA NA NA
## [7] 0.07640136 0.09927376 0.25907004 NA NA 0.17091836
## [13] NA 0.05511481 NA NA NA NA
## [19] NA NA 0.03558088 NA NA NA
## [25] NA NA NA 0.17612149 NA NA
## [31] NA NA NA NA NA 0.06385170
## [37] 0.31565729 0.15443976 0.02065290 NA NA NA
## [43] NA NA NA 0.03802772 NA NA
## [49] NA NA NA 0.03583333 NA NA
## [55] NA NA 0.20037371 NA NA NA
## [61] NA NA 0.04442582 NA 0.22673167 0.02515472
## [67] NA NA NA NA NA NA
## [73] NA NA NA NA 0.16611999 NA
## [79] NA 0.15985790 NA NA NA NA
## [85] NA NA 0.19166373 0.06625000 NA NA
## [91] NA 0.15000000 0.19020979 NA 0.05452865 NA
## [97] NA NA 0.05321581 NA NA NA
## [103] NA NA 0.04269231 NA NA NA
## [109] NA NA 0.06212425 NA NA NA
## [115] NA 0.06265877 NA NA NA NA
## [121] NA 0.33708468 NA 0.28388280 NA NA
## [127] 0.07506219 NA NA NA NA 0.11545279
## [133] NA NA NA 0.04233220 0.01583539 0.11352563
## [139] NA NA NA NA 0.03873667 0.14807720
## [145] 0.08316962 NA NA NA NA NA
## [151] NA NA NA NA NA 0.18695302
## [157] NA NA 0.10056658 NA 0.11706124 0.04147113
## [163] 0.09782865 NA NA NA NA 0.30884618
## [169] NA NA NA 0.16245626 NA NA
## [175] 0.01609658 NA 0.07998289 0.06147750 NA NA
## [181] NA 0.12375308 NA 0.10750445 NA NA
## [187] 0.17569725 NA 0.07624309 0.07046070 NA NA
## [193] 0.20199370 NA
Also, note the NAs representing missing data. In Stata, we use a “.” to denote missingness. In R, we leave the cell blank. If you don’t enter “0” for a true “0”, then R treats the observation as missing.
# Look at 20 variables only
head(names(qog), 20)
## [1] "ccode" "cname" "ccodealp" "ccodecow"
## [5] "ccodewb" "version" "aid_cpnc" "aid_cpsc"
## [9] "aid_crnc" "aid_crnio" "aid_crsc" "aid_crsio"
## [13] "ajr_settmort" "al_ethnic" "al_language" "al_religion"
## [17] "bci_bci" "bci_bcistd" "bi_a_total" "bi_p_total"
Now, onto dropping and keeping variables (or, subsetting). An important comment about this: If dropping variables, first create a new data set. Also, it’s better to use “select” in order to keep variables. That way, we indicate specifically what we plan to use, instead of dropping variables and therefore risking dropping the wrong ones. Unlike Stata, these functions allow us to hold onto both the subset and the original data in the global environment. This is a HUGE advantage in R.
# Selecting your subset:
selected <- c("ccode", "cname", "ajr_settmort")
new_qog_subset <- new_qog[selected]
# Dropping variables to create a subset a different way
dropped <- names(new_qog) %in% c("wr_nonautocracy", "wr_regtype")
new_qog_subset_2 <- new_qog[!dropped]
We often want to merge datasets together. This seems surprising with a dataset with 1883 variables, but usually we’re working with multiple datasets from different sources. Let’s start by making a smaller subset of our “new_qog”" data and merging our subsets as if they originated from different sources.
# More complex
selected2 <- c("ccode", "cname", "wvs_trust","wvs_trust_rescale", "random_var")
new_qog_subset_3 <- new_qog[selected2]
# And here, I started getting errors because
# I didn't create a new name for my newly selected data.
# I had to clear my history and start over. DUMMY.
# Merge
working_data <- merge(new_qog_subset, new_qog_subset_3, by=c("ccode", "cname"))
In this case, we don’t lose observations because these subsets came from the same original dataset. In reality, we often have to add some code to avoid dropping observations that don’t merge. We use “all” in the merge function to specify whether we want to drop these observations or not.
# Another merge
working_data <- merge(new_qog_subset, new_qog_subset_3, by=c("ccode", "cname"), all=TRUE)
Appending data is very simple, assuming the data are formatted correctly (i.e., each dataset has the same number of columns and column names). The “rbind()” command from the “base” package will suffice. If columns are not the same length/same names, we can use “rbind.fill()” from the “plyr” package, which will introduce NAs.
# Using rbind.fill when column names are different
tog_qog_subsets = rbind.fill(new_qog_subset, new_qog_subset_3)
Finally, we may want to turn panel data (i.e. “wide data”) like the QOG data into time series data (i.e. “long data”).
# Let's load in the time series data from QOG and try to reshape it into panel data
qog_ts <- read.csv("dataPersonal/qog_std_ts_jan18.csv", header=T)
# This is a big dataset, so takes awhile!
wide_qog <- reshape(qog_ts,
# time variable
timevar="year",
# variables not to change
idvar="ccode",
# unit-year observations
new.row.names=1:211, # NULL is trick!
# direction of reshape
direction = "wide")
# Subset to the variable you're interested in
year = as.vector(qog_ts$year)
country = as.vector(qog_ts$cname)
regtype = as.vector(qog_ts$ht_regtype)
regtype_all = as.data.frame(cbind(country, year, regtype))
regtype_all$regtype = as.numeric(regtype_all$regtype)
#
# This is a big dataset, so takes awhile!
wide_regtype_qog <- reshape(regtype_all,
# time variable
timevar="year",
# variables not to change
idvar="country",
# unit-year observations
new.row.names=1:211, # NULL is trick!
# direction of reshape
direction = "wide")
# Now we can sum together by country -- FIX LATER
# wide_trust_qog$average = rowSums(wide_regtype_qog, na.rm=T, dims = c(2:73))