Programming & Methods Resources Workshop

Setting up

A few things we’ll scratch the surface of today:

The apply family
stringr/stringi packages
Temporal independence and networks

# Three libraries need today 
library(sna)
library(stringr)
library(stringi)
library(network)
library(igraph)
library(survival)
library(data.table)

# Working directory
setwd("~/Documents/my-stuff/PAMR")

The Apply Family

First, we’ll start with the family of “apply” functions (in base R). The general advice to use apply or its variants (lapply, mapply, tapply, etc.) instead of “for loop” functions when you can. The variations of “apply” correspond with data structures, such as list (lapply), matrix (mapply), and so on. These functions allow you to repeat the same function on multiple vectors, arrays, and dataframes. Things we might use these for:

Calculating the mean (or other quantities) across multiple vectors
Subsetting data or tranforming it into another structure
Gathering only specific types of information from a complicated list

The apply family functions can be used on their own, or in a function we create.

# Create a silly matrix 
m = matrix(c(2,4,3,"elephant",8),
           nrow=4,
           ncol=5)

# Make a dataframe
m = as.data.frame(m)

# Replace elephant with 0
mAdj = as.data.frame(lapply(m, function(x) {
  as.numeric(gsub("elephant", 0, x))})) # x is the matrix, as required in gsub
mAdj

##   V1 V2 V3 V4 V5
## 1  2  8  0  3  4
## 2  4  2  8  0  3
## 3  3  4  2  8  0
## 4  0  3  4  2  8

# Now let's pull the means
mAve = lapply(mAdj, mean)
mAve

## $V1
## [1] 2.25
## 
## $V2
## [1] 4.25
## 
## $V3
## [1] 3.5
## 
## $V4
## [1] 3.25
## 
## $V5
## [1] 3.75

  # Note a vector the same length of the number of columns is returned

# Let's find the 0s 
lapply(mAdj, function(x) {
  length(x[x==0])})

## $V1
## [1] 1
## 
## $V2
## [1] 0
## 
## $V3
## [1] 1
## 
## $V4
## [1] 1
## 
## $V5
## [1] 1

# Square the values are return a df
mAdjSq = as.data.frame(lapply(mAdj, function(x) {
  x^2}))
mAdjSq

##   V1 V2 V3 V4 V5
## 1  4 64  0  9 16
## 2 16  4 64  0  9
## 3  9 16  4 64  0
## 4  0  9 16  4 64

# An example of a more complicated lapply function...

String functions

Working with string vectors (in the character class) is incredibly easy (and exciting!) in the R environment. A brief list of tasks you might want to perform using the stringr and stringi packages:

Identify observations based on a set of characters (e.g., which are “extension of remarks?”)
Match multiple instances of these characters
Pull part of a string based on common attributes (called “regular expressions”)
Substitute a set of characters based on another set of characters
Extract the first, second, etc. word in a character vector

It’s important to emphasize that we need to be extremely careful with manipulating string data. It’s not always easy to see what we’ve grabbed or dropped, and we certainly don’t want to be changing the data in the ways we don’t intend. Moreoever, we need to have strong justifications for our choices (such as subsetting on keywords), just as we do for numerical data.

usDrought = data.frame(readRDS("dataPersonal/us_speech_ex.rds"), stringsAsFactors = F)

# NOTE: I switch between packages here (including base)! When in doubt, use the double-colon!


# Maybe I want to create an indicator that the title includes "drought"

# str_match and str_extract are similar
usDrought$droughtTitle = str_match(usDrought$title, "drought")
head(usDrought$droughtTitle)

##      [,1]
## [1,] NA  
## [2,] NA  
## [3,] NA  
## [4,] NA  
## [5,] NA  
## [6,] NA

  # nope, caps sensitive!

# With str_match
usDrought$droughtTitle = str_match(usDrought$title, "DROUGHT")
head(usDrought$droughtTitle)

##      [,1]     
## [1,] "DROUGHT"
## [2,] "DROUGHT"
## [3,] "DROUGHT"
## [4,] "DROUGHT"
## [5,] "DROUGHT"
## [6,] "DROUGHT"

# With str_extract
usDrought$droughtTitle = str_extract(usDrought$title, "DROUGHT")
head(usDrought$droughtTitle)

## [1] "DROUGHT" "DROUGHT" "DROUGHT" "DROUGHT" "DROUGHT" "DROUGHT"

# Now I could make that string binary (use in analysis, or as easy way to subset)
usDrought$droughtTitle = as.numeric(gsub("DROUGHT", "1", usDrought$droughtTitle))
head(usDrought$droughtTitle)

## [1] 1 1 1 1 1 1

# Or we could match all instances of a particular string for a frequency measure
agDroughtWords = c("drought", "irrigation", "farmer")
usDrought$droughtMatches = str_match_all(usDrought$speech, paste(agDroughtWords, collapse="|"))
  # str_extract_all exists too 

# Use count to measure frequency of a word
usDrought$numDroughtRefs = str_count(usDrought$speech, "drought")
head(usDrought$numDroughtRefs)

## [1]  7 16 40  7  5  5

# Also particularly useful for identifying particular observations (for example, in links)
usDrought$ext = str_match(usDrought$link, "ext")
  # I use "ext" in the url link to identify which records are "extension of remarks"

# Maybe the number of characters is informative
usDrought$nchars = nchar(usDrought$speech)
head(usDrought$nchars, 10)

##  [1]  3963 12977 11380  7502  1621  4634  4145  1662  4172  6055

  # The 10,000+ records probably have bills embedded

# Could subset on them if we wanted
usDrought = subset(usDrought, nchars < 10000)
  # We lose around 20 obs.

# Change titles to lowercase (useful for merging when cases are different)
usDrought$title = base::tolower(usDrought$title)
head(usDrought$title)

## [1] "drought"                           "drought and wildfires"            
## [3] "addressing the california drought" "drought relief"                   
## [5] "water and drought in california"   "drought in california"

  # Note I changed the original variable

# Often times, whitepsace causes problems (especially when merging) -- we can get rid of it!
usDrought$title = base::trimws(usDrought$title)

# We can use word to pull first/second/third words
usDrought$titleFirst = word(usDrought$title, 1)

# Or the last word
usDrought$titleLast = word(usDrought$title, -1)
  # Or any variation of that, really

# Maybe we make those where the first word is the last word (meaning one word title) = NA
usDrought$titleFirst[usDrought$titleLast==usDrought$titleFirst] = NA
usDrought$titleLast[usDrought$titleLast==usDrought$titleFirst] = NA

  # Let's look at the dataframe...

# You can use str_split to divide strings
bills = as.character(c("H.R. 200; H.R. 300; H.R. 400; H.R. 576; S. 763"))
bills

## [1] "H.R. 200; H.R. 300; H.R. 400; H.R. 576; S. 763"

str_split(bills, ";")

## [[1]]
## [1] "H.R. 200"  " H.R. 300" " H.R. 400" " H.R. 576" " S. 763"

# Create dataframe based on these splits
indBills = as.data.frame(str_split(bills, ";"), stringsAsFactors = F)
names(indBills) = "billNo"
head(indBills)

##      billNo
## 1  H.R. 200
## 2  H.R. 300
## 3  H.R. 400
## 4  H.R. 576
## 5    S. 763

# You can also specify n to restrict to certain number of matches
str_split_fixed(bills, ";", 2)

##      [,1]       [,2]                                   
## [1,] "H.R. 200" " H.R. 300; H.R. 400; H.R. 576; S. 763"

Other ways strings can be used involve identifying patterns that have character variation. That is, we might want to drop certain observations based on a set of string patterns, instead of a precise character representation. These are called “regular expressions”, which are part of R’s base; “sub”, “gsub”, “grep”, “grepl”, and others are members of a family of functions that search for patterns and extract them or replace them. Regular expressions can be tough to write and require a lot of trial and error. Here are two great resources on regular expressions:

Here are just a couple examples where I’ve used regular expressions. We might cover more of this in the next workshop on text as data.

# We can use grepl to drop bills (with or without regular expressions)
newDrought <- subset(usDrought, !grepl("TEXT OF AMENDMENT|Daily Digest|INTRODUCTION OF BILLS AND JOINT RESOLUTIONS", usDrought$speech), drop=TRUE)
  # Remember the "!" means "anything not that"
 
# Use "sub" and regular expressions to get rid of potentially problematic data
newDrought$speech[67] # here, I want to remove any false positive of a state mention

## [1] "        \n\n[Page S5446]\nFrom the Congressional Record Online through the Government Publishing Office [www.gpo.gov]\n\n\n\n\n                            TEXAS WILDFIRES\n\n  Mrs. HUTCHISON. Mr. President, I rise in morning business to talk \nabout a situation in Texas, the wildfires and the drought.\n  Since we were mostly home during the August recess, I saw the floods \nin the Midwest and on the Missouri and Mississippi Rivers. I saw the \nhurricane that hit New York and all along the East Coast. At the same \ntime, with all the extra water in the East, we have had as much as 60 \ndays in parts of Texas with no rain whatsoever. The drought is killing \nlivestock. It is killing land. It is a sad situation. What has \nhappened, of course, is, from that, the wildfires have been able to go \nfarther than we have ever seen in Texas before.\n  Just in the past 7 days, the Texas Forest Service has responded to \n176 fires, destroying nearly 130,000 acres. This year alone, over 2,000 \nfires have burned more than 2 million acres in Texas. We have high \nwinds and drought conditions, which are a terrible combination in this \ninstance.\n  Yesterday, the Texas Forest Service responded to 20 new fires, which \nconsumed nearly 1,500 more acres. One of the hardest hit areas is \nBastrop County, which is near Austin. I was talking to some of my \nconstituents in Houston, which is not near Austin, and they were \ntalking about seeing and smelling the smoke in Houston from these fires \nin Bastrop.\n  An assessment has been completed as of now that says 785 homes were \ncompletely destroyed, 238 homes have been reported lost as a result of \nother fires over the past 3 days, and the fires are so big that they \nare being photographed from space.\n  Senator Cornyn and I have asked the President to add the recent \nwildfires from just this last week to his previous disaster declaration \nfrom this spring, which did include wildfires. I want the people of \nTexas to know that Senator Cornyn and I are working together to get all \nthe Federal help they need. I have been in contact with the State \nrepresentatives from the area, the mayors, and the county judges to get \nthe reports. So far they feel they have gotten the help they have \nneeded. But now, in the aftermath, we will need to be part of any kind \nof disaster bill that goes through this Senate or is declared by the \nPresident.\n  It is my hope we can work through that next week and make sure we \ninclude these most recent fires along with the flood disaster relief \nthat supposedly will come to the floor next week. So we are going to \nwork on it and try to help these people. We can't replace the \ngraduation pictures and the wedding pictures and the children's \npictures that are lost. This is the human loss you see in this type of \na situation. But we can certainly help these people rebuild, and that \nis what we want to do.\n  We are going to be on the job trying to help in every way we can, \nknowing there will not be a 100-percent replacement because the \nphotographs and the personal items and grandmother's wedding ring may \nnot be recovered, but we are going to do what we can, as Americans \nalways do.\n  I yield the floor.\n  The PRESIDING OFFICER. The Senator from Illinois.\n\n                          ____________________\n\n\n\n    "

newDrought$speech = sub("The PRESIDING OFFICER..*", "", newDrought$speech) # removes everything after this statement
newDrought$speech[67]

## [1] "        \n\n[Page S5446]\nFrom the Congressional Record Online through the Government Publishing Office [www.gpo.gov]\n\n\n\n\n                            TEXAS WILDFIRES\n\n  Mrs. HUTCHISON. Mr. President, I rise in morning business to talk \nabout a situation in Texas, the wildfires and the drought.\n  Since we were mostly home during the August recess, I saw the floods \nin the Midwest and on the Missouri and Mississippi Rivers. I saw the \nhurricane that hit New York and all along the East Coast. At the same \ntime, with all the extra water in the East, we have had as much as 60 \ndays in parts of Texas with no rain whatsoever. The drought is killing \nlivestock. It is killing land. It is a sad situation. What has \nhappened, of course, is, from that, the wildfires have been able to go \nfarther than we have ever seen in Texas before.\n  Just in the past 7 days, the Texas Forest Service has responded to \n176 fires, destroying nearly 130,000 acres. This year alone, over 2,000 \nfires have burned more than 2 million acres in Texas. We have high \nwinds and drought conditions, which are a terrible combination in this \ninstance.\n  Yesterday, the Texas Forest Service responded to 20 new fires, which \nconsumed nearly 1,500 more acres. One of the hardest hit areas is \nBastrop County, which is near Austin. I was talking to some of my \nconstituents in Houston, which is not near Austin, and they were \ntalking about seeing and smelling the smoke in Houston from these fires \nin Bastrop.\n  An assessment has been completed as of now that says 785 homes were \ncompletely destroyed, 238 homes have been reported lost as a result of \nother fires over the past 3 days, and the fires are so big that they \nare being photographed from space.\n  Senator Cornyn and I have asked the President to add the recent \nwildfires from just this last week to his previous disaster declaration \nfrom this spring, which did include wildfires. I want the people of \nTexas to know that Senator Cornyn and I are working together to get all \nthe Federal help they need. I have been in contact with the State \nrepresentatives from the area, the mayors, and the county judges to get \nthe reports. So far they feel they have gotten the help they have \nneeded. But now, in the aftermath, we will need to be part of any kind \nof disaster bill that goes through this Senate or is declared by the \nPresident.\n  It is my hope we can work through that next week and make sure we \ninclude these most recent fires along with the flood disaster relief \nthat supposedly will come to the floor next week. So we are going to \nwork on it and try to help these people. We can't replace the \ngraduation pictures and the wedding pictures and the children's \npictures that are lost. This is the human loss you see in this type of \na situation. But we can certainly help these people rebuild, and that \nis what we want to do.\n  We are going to be on the job trying to help in every way we can, \nknowing there will not be a 100-percent replacement because the \nphotographs and the personal items and grandmother's wedding ring may \nnot be recovered, but we are going to do what we can, as Americans \nalways do.\n  I yield the floor.\n  "

  # I only made this decision after reviewing many speeches to ensure I wasn't dropping important data!

# A slightly more complicated extraction
newDrought$name = str_extract_all(newDrought$speech, "Mr.\\s*\\w+(\\s[a-z]+)*|Ms.\\s*\\w+(\\s[a-z]+)*|Mrs.\\s*\\w+(\\s[a-z]+)*")
  # Could use without "all" to get the first instance only
head(newDrought$name)

## [[1]]
## [1] "Mr. DURBIN"
## 
## [[2]]
## [1] "Mr. WYDEN"     "Mr. President"
## 
## [[3]]
## [1] "Ms. LORETTA" "Ms. LORETTA"
## 
## [[4]]
## [1] "Mr. CORNYN"      "Mr. President"   "Mr. President"   "Mr. \nPresident"
## 
## [[5]]
## [1] "Ms. Loretta" "Ms. LORETTA" "Mr. Speaker"
## 
## [[6]]
## [1] "Mr. COSTA asked and was given permission to address the"
## [2] "Mr. COSTA"

# Without regular expressions, drop data I know I don't want with grepl
toMatch = c("Israel", "Iran", "Afghanistan")
newDrought = subset(newDrought, !grepl(paste(toMatch, collapse="|"), newDrought$speech), drop=T) 
  # Drops one obs.

# Could also use grepl with an index as opposed to subset
farmDrought = newDrought[grepl("farm", newDrought$title),]

Non-Independence

Often we want to utilize methods that help us to deal with different kinds of “non-independence”. In other words, we usually know that theoretically our observations are correlated, but addressing this methodologically can be challenging so we ignore it. Today we are going to talk about two different kinds of non-independence and scratch the surface on how to address these analytically. The first type is unit-based non-independence and the programming related solution we’ll explore is social network analysis. The second type is temporal non-independence. We will explore a number of data management strategies for getting your data into a structure that will allow you to deal with this.

Alliance_1995 <- read.csv("dataPersonal/1995_alliances.csv", header=FALSE)
ccodes <- read.csv("dataPersonal/1995_ccodes.csv")

# We have automatically brought in the data as a data.frame but what we want is to make sure that
# R knows that this matrix should be treated as a network.

net_1995 <- as.network(x = Alliance_1995, # the network object
                       directed = TRUE, # specify whether the network is directed
                       loops = FALSE, # do we allow self ties (should not allow them)
                       matrix.type = "adjacency" # the type of input
                       )

network.vertex.names(net_1995) <- ccodes # add an i.d. to our vertices with the ccode

# We can plot this network, but as you'll see in a moment, visualizing networks in R is not great.
plot.network(net_1995, # our network object
             vertex.col = "blue", # color nodes by gender
             displaylabels = T, # show the node names
             label.pos = 5 # display the names directly over nodes
)

# Alternatively,
gplot(net_1995, displaylabels=TRUE)

# There is a TON you can do with gplot to improve your network graphics -- check it out! 

# We may also want some basic characteristics of our network before extracting statistics.

summary.network(net_1995, # the network we want to look at
                print.adj = FALSE # if TRUE then this will print out the whole adjacency matrix.
)

## Network attributes:
##   vertices = 187
##   directed = TRUE
##   hyper = FALSE
##   loops = FALSE
##   multiple = FALSE
##   bipartite = FALSE
##  total edges = 4086 
##    missing edges = 0 
##    non-missing edges = 4086 
##  density = 0.1174746 
## 
## Vertex attributes:
##   vertex.names:
##    character valued attribute
##    3927 valid vertex names
## 
## No edge attributes

# OR....

# Number of dyads
network.dyadcount(net_1995)

## [1] 34782

# Number of edges
network.edgecount(net_1995)

## [1] 4086

# Size of the network
network.size(net_1995)

## [1] 187

Finally, we might want statistics from the network.

# Measuring Centrality. Note that in- and out-degree should be the same in this network
degree_cent_1995 <- sna::degree(net_1995)
btw_cent_1995 <- sna::betweenness(net_1995)
close_cent_1995 <- sna::closeness(net_1995)

# eigen_1995 <- centralization(net_1995, evcent) # Measurese Eigenvector Centrality of the network

# We can plot the degree distribution here:
hist(degree_cent_1995, xlab="Total Degree", main="Total Degree Distribution", prob=TRUE)

# Other valuable measures:

# Network Density:
density_1995 <- sna::gden(net_1995)

# Reciprocity
recip_1995 <- sna::grecip(net_1995)

# Transitivity:
# trans_1995 <- sna::gtrans(net_1995)

# Strong and Weak Component Counts:
# comps_1995 = sna::components(net_1995)
weak_comps_1995 <- sna::components(net_1995, connected="weak")
head(weak_comps_1995)

## [1] 33

For more complicated tasks, see A User’s Guide to Network in Analysis in R.

Temporal Dependence (duration analysis)

There are many ways of dealing with temporal dependence issues. The typical way is to lag or lead variables. This seems intuitive but lets talk about some potential drawbacks:

Leading and Lagging drops observations
Deciding on the appropriate unit of time is not intuitive
Cases covering your lag/lead unit are lost
This still may not solve your endogeneity issues

BUT, it is often useful so lets look at how to implement this with some data. The DV in this dataset is “activesup”; this data was designed to look at when foreign states will actively support a non-state armed group. Say we want to do a simple logit but we need some of our variables to be lagged to account for temporal endogeneity. Specifically, let’s see if conflict between the target government and the foriegn supporter in the previous year influences support for the NAG.

NAG_Support <- read.csv("dataPersonal/NAG_Support_Data.csv")

NAG_Support_dt <- data.table(NAG_Support)
NAG_Support_dt[,lagConfHist := lag(ConflictHistory), by = c("dyadno")]

# Now lets run our SIMPLE logit model:
model1 <- glm(activesup~lagConfHist, family=binomial(link=logit), data=NAG_Support_dt)
summary(model1)

## 
## Call:
## glm(formula = activesup ~ lagConfHist, family = binomial(link = logit), 
##     data = NAG_Support_dt)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.5527  -0.5527  -0.1366  -0.1366   3.0591  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -4.6698     0.2009  -23.24   <2e-16 ***
## lagConfHist   2.8683     0.2160   13.28   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1652.8  on 4004  degrees of freedom
## Residual deviance: 1355.0  on 4003  degrees of freedom
## AIC: 1359
## 
## Number of Fisher Scoring iterations: 7

Instead, we might want to account for the time until active support is provided. For this we have to set up our data properly for duration analysis. We need to consider the following:

Need for lagged covariates
Time invariant covariates
Grouping variable (id…in our case, triad id)

I’m not going to deal with the time invariant problems. This is something you will learn about extensively in other classes. However, I will show you how to properly set up your data.

model2 <- coxph(formula = Surv(year, activesup) ~ alliance + contig + cinc_tar + cinc_ps +
        ConflictHistory, data = NAG_Support)
model2

## Call:
## coxph(formula = Surv(year, activesup) ~ alliance + contig + cinc_tar + 
##     cinc_ps + ConflictHistory, data = NAG_Support)
## 
##                     coef exp(coef) se(coef)     z       p
## alliance          0.3330    1.3951   0.0438  7.61 2.8e-14
## contig           -0.2242    0.7992   0.0335 -6.70 2.1e-11
## cinc_tar          3.2553   25.9267   4.8446  0.67     0.5
## cinc_ps           6.1764  481.2463   1.2025  5.14 2.8e-07
## ConflictHistory   2.5188   12.4138   0.2256 11.16 < 2e-16
## 
## Likelihood ratio test=357.9  on 5 df, p=<2e-16
## n= 3875, number of events= 199 
##    (130 observations deleted due to missingness)

And remember:

Use the help window and vignettes when you’re navigating R!