INTRODUCTION


1. Worskop objectives

  • Introduce applied programming (intensively)
  • Improve our troubleshooting skills
  • Exposure to what’s possible (so much is possible!…especially in open source)
  • Assist in what’s relevant to you all


2. Where are we right now?

  • In a software called RMarkdown.
  • Basically a GUI in RStudio that combines LaTex and R.
  • It didn’t exist when we started grad school, but it’s a great way to take notes and write papers.
  • Documents can be conveniently converted to html, pdfs, etc.
  • We’ll see some basic functionality of this thing another day.


3. First, let’s get to know each other.

  • Who we are.
  • Who you are (interests, data, methods you might use).
  • What programming issues are eating your soul?


GETTING STARTED


1. Following along

  1. Go to your html browser and type this url: http://rpubs.com/crbelden/363736
  2. Open R Studio (if it isn’t yet) and open a new script.
  3. We’ll annotate this script as we go, too (and you’ll have access to URL indefinitely).
  4. Create a folder to store the data and your script (follow our folder format).
  5. Install the following packages: “XML”, “xml2”, “igraph”, “network”


2. Staying organized

Let’s review “working directory” and paths. Most people are moving away from setting highly specific working directories. Still, we think it’s a good idea to set a working directory to the main folder in which you’re working, and use relative paths to load and save data.

# Set your own working directory for this workshop
setwd("/Users/corybelden/Documents/my-stuff/PAMR")

# To see where you are
getwd()
## [1] "/Users/corybelden/Documents/my-stuff/PAMR"
# A manual trick...

# A relative path to read in data
dataDemoEx = readRDS("dataDemos/myVar_data.rds")

# Another relative path to read in data 
dataPersonalEx = readRDS("dataPersonal/ausR_geos_2-2018.rds")

# What's in my folder?
list.files()
## [1] "dataDemos"            "dataPersonal"         "PAMR_workshop_1.html"
## [4] "PAMR_workshop_1.Rmd"  "PAMR_workshop_2.Rmd"  "PARM_Week1_Notes.R"  
## [7] "pictures"             "rsconnect"            "syllabus"
# Which are my folders?
list.dirs()
##  [1] "."                                                        
##  [2] "./dataDemos"                                              
##  [3] "./dataPersonal"                                           
##  [4] "./pictures"                                               
##  [5] "./rsconnect"                                              
##  [6] "./rsconnect/documents"                                    
##  [7] "./rsconnect/documents/PAMR_workshop_1.Rmd"                
##  [8] "./rsconnect/documents/PAMR_workshop_1.Rmd/rpubs.com"      
##  [9] "./rsconnect/documents/PAMR_workshop_1.Rmd/rpubs.com/rpubs"
## [10] "./syllabus"


How to organize scripts?

  • At top, a note about the purpose of the script.
  • After that, set the working directory.
  • Then load your libraries.
  • Annotate, annotate, annotate!
    • Tell yourself what each line does (until you’re really familiar).
    • Make a note of the number of observations if you’re changing data.
  • Name your variables systematically.
    • Use capital letters or a hyphen.
    • Have a logic to naming your variables (don’t just use “x”).
    • Have “test” names to remind yourself which are trial runs and which are not.
    • Don’t use names that are also functions in R!
    • Be conscientious about when you overwrite variables with same name.
  • Keep scripts fairly concise.
    • Have separate “collect”, “analyze”, and “results” scripts.
    • Scripts that do too many tasks get overwhelming.
    • Have “play” and “final” scripts (basic way to do version control).
  • Don’t overwrite your raw data!


3. Understanding the Most Relevant R “Objects”

Some basics in applied programming:

  • R objects can live in global and local (e.g., within a function) environments.
  • Dataframes are only one type of object.
  • NA (missing value in a vector) versus NULL (an object that indicates missingness).

Key R Objects:

  • Variable (or vector, i.e., contiguous cells containing data) classes
    • Numeric: Decimals.
    • Integer: No decimals.
    • Character: Words and letters (strings).
    • Logical: Evaluation (TRUE or FALSE).
    • Factor: A vector of integer values with a corresponding set of character values, i.e., a categorical variable (R often defaults to factors).
  • Blocks of data
    • Lists: An ordered collection of vectors (can be different modes and different lengths).
    • Matrices: vectors with the same class (e.g., numeric) and length.
    • Arrays: Similar to matrices, but with more than two dimensions.
    • Dataframes: vectors with the same length.
    • Specialized types, like XML nodesets


Get in the habit of checking object types with “class”. There are several variations of this code, including “typeof”, but “class” will probably work for most of our purposes.

# Create integer variable
myVar = 1:3 
xmyVar <- 1:3
class(myVar)
## [1] "integer"
# Overwrite variable as character (c = combine)
myVar = c("apple", "banana", "cornbread")
class(myVar)
## [1] "character"
# You can save any R object with as an RDS file (or other file types)
saveRDS(myVar, "dataDemos/myVar_data.rds")

# To read back in
myVar = readRDS("dataDemos/myVar_data.rds")

# Create another character variable
myVar2 = c("grapefruit", "orange")

# Add vectors together
myVarTog = c(myVar, myVar2)
myVarTog
## [1] "apple"      "banana"     "cornbread"  "grapefruit" "orange"
class(myVar)
## [1] "character"
# Turn into dataframe
myVarTogDf = as.data.frame(myVarTog)
myVarTogDf
##     myVarTog
## 1      apple
## 2     banana
## 3  cornbread
## 4 grapefruit
## 5     orange
# Add name to variable (syntax only works for single vector)
names(myVarTogDf) = "fruit"

# Now block has class and so does variable
class(myVarTogDf)
## [1] "data.frame"
class(myVarTogDf$fruit)
## [1] "factor"
# Darn factors! A key aggravation in dfs

# Change back to character and check
myVarTogDf$fruit = as.character(myVarTogDf$fruit)
class(myVarTogDf$fruit)
## [1] "character"


Most data that is now useful to us does not come in a rectangular box. It comes in lists. As a reminder, lists have elements, each of which can contain any type of R object. They can also be nested.

To understand lists, it’s helpful to learn how to index. Most indexing involves trial and error—that’s okay! The three general index properties below work for simpler lists, but indexing gets more complex as the list structure gets more complex. Note that sometimes looking at the list “tree” in the Global Environment can help you visualize the nested structure.

  • Single-bracket: Retrieve a list “slice” (sometimes list slices are nested within elements).
  • Double-bracket: Retrieve an element from that list slice (sometimes many nested elements within one slice).
  • Single-bracket after double-bracket: Retrieve a member of that element.
# Let's make our character vector a list
myVarTog = list(myVarTog)
myVarTog
## [[1]]
## [1] "apple"      "banana"     "cornbread"  "grapefruit" "orange"
# Cornbread isn't a fruit! Let's change it NA

# Identifying "cornbread"" with index
myVarTog[[1]][3]
## [1] "cornbread"
# Same output because there's only one "list slice"
myVarTog[1][[1]][3]
## [1] "cornbread"
# Change "cornbread" to missing
myVarTog[1][[1]][3] = NA
myVarTog
## [[1]]
## [1] "apple"      "banana"     NA           "grapefruit" "orange"
# Could also assign name to list
names(myVarTog) = "fruit"

# Now we can look at the new NA by name and position
myVarTog$fruit[3]
## [1] NA
# Use logical to confirm that it changed
is.na(myVarTog$fruit)
## [1] FALSE FALSE  TRUE FALSE FALSE
# Can make the logical variable another list element
myVarTog$isFruit = is.na(myVarTog$fruit)
myVarTog
## $fruit
## [1] "apple"      "banana"     NA           "grapefruit" "orange"    
## 
## $isFruit
## [1] FALSE FALSE  TRUE FALSE FALSE
# Some basic list examples
y = c(7, 2, 9, 10) # integer values
z = c("aa", "bb", "cc", "zz") # character values
x = c(TRUE, FALSE, TRUE, FALSE, FALSE) # logical values
listTog = list(y, z, x) # lists of all three vectors (or list "slices")
listTog
## [[1]]
## [1]  7  2  9 10
## 
## [[2]]
## [1] "aa" "bb" "cc" "zz"
## 
## [[3]]
## [1]  TRUE FALSE  TRUE FALSE FALSE
  # Can't turn listTog into a dataframe! (Different lengths)

# Retrieve 2nd list slice with single square bracket
listTog[2]
## [[1]]
## [1] "aa" "bb" "cc" "zz"
# Retrieve slice containing the second and fourth element of listTog
listTog[c(2, 1)]
## [[1]]
## [1] "aa" "bb" "cc" "zz"
## 
## [[2]]
## [1]  7  2  9 10
listTog[c(1, 2)]
## [[1]]
## [1]  7  2  9 10
## 
## [[2]]
## [1] "aa" "bb" "cc" "zz"
# Double square bracket gets us the element of the list directly (note missing [[]] in output)
listTog[[2]]
## [1] "aa" "bb" "cc" "zz"
# Another single bracket gets us the member of that element
listTog[[2]][1]
## [1] "aa"
# Note that this is not equivalent!
listTog[2][1]
## [[1]]
## [1] "aa" "bb" "cc" "zz"
# This IS equivalent (to line 244), but more precise--you can do it either way
# Says: go within slice 2, give me element 1, and the first member of element 1
listTog[2][[1]][1]
## [1] "aa"
# Another example being precise: find the "2"
listTog
## [[1]]
## [1]  7  2  9 10
## 
## [[2]]
## [1] "aa" "bb" "cc" "zz"
## 
## [[3]]
## [1]  TRUE FALSE  TRUE FALSE FALSE
# listTog[1][[2]] 
  # Oops, this code won't work because there's no second element of the first list slice

# Need to add the member of the element
listTog[1][[1]][2]
## [1] 2
# Now we can change to "2" to "8"
listTog[1][[1]][2] = 8
listTog
## [[1]]
## [1]  7  8  9 10
## 
## [[2]]
## [1] "aa" "bb" "cc" "zz"
## 
## [[3]]
## [1]  TRUE FALSE  TRUE FALSE FALSE
## A more complex example, lists upon lists (turtles all the way down)

# Create new vectors and make into list
newVarText = c("i", "am", "learning")
newVarNum = c(1, 50)
newListTog = list(newVarText, newVarNum)

# Make a list of both lists (this isn't what you'd be creating, but how your data might arrive)
# Note that "complexList" is doubling the elements because it is created from two lists
complexList = list(listTog, newListTog)
complexList
## [[1]]
## [[1]][[1]]
## [1]  7  8  9 10
## 
## [[1]][[2]]
## [1] "aa" "bb" "cc" "zz"
## 
## [[1]][[3]]
## [1]  TRUE FALSE  TRUE FALSE FALSE
## 
## 
## [[2]]
## [[2]][[1]]
## [1] "i"        "am"       "learning"
## 
## [[2]][[2]]
## [1]  1 50
# Now identify element in line 244
# Get first list slice, then grab first element and second nested element
complexList[1][[1]][[2]]
## [1] "aa" "bb" "cc" "zz"
# This code fails because of list's nested structure
# complexList[1][[2]]

# Note there are also slices within elements
complexList[1][[1]][2]
## [[1]]
## [1] "aa" "bb" "cc" "zz"
# To retrieve second element from second list slice
complexList[2][[1]][[1]]
## [1] "i"        "am"       "learning"
# And now, its best member!
complexList[2][[1]][[1]][3]
## [1] "learning"


Dataframes are more intuitive and can make your life easier (sometimes).

  • Especially when data is rectangular and you’re learning.
  • And when you’re subsetting or removing elements.
  • You have to be careful of changes in variable types (“stringsAsFactors” can help.)
  • Note that the “$” sign indicates a attribute/name in a list and a variable in a dataframe; basically equivalent.
# Using dataframe, can subset to get rid of the NA we no longer want
myVarTog
## $fruit
## [1] "apple"      "banana"     NA           "grapefruit" "orange"    
## 
## $isFruit
## [1] FALSE FALSE  TRUE FALSE FALSE
myVarTogDf = as.data.frame(myVarTog, stringsAsFactors=F) 
myVarFixed = subset(myVarTogDf, !is.na(fruit))
myVarFixed
##        fruit isFruit
## 1      apple   FALSE
## 2     banana   FALSE
## 4 grapefruit   FALSE
## 5     orange   FALSE
# Can make similar value changes in dataframe
myVarFixed$fruit[myVarFixed$fruit == "banana"] = "bananas"
myVarFixed
##        fruit isFruit
## 1      apple   FALSE
## 2    bananas   FALSE
## 4 grapefruit   FALSE
## 5     orange   FALSE
# Note that "NA" syntax is almost always different than other types.
myVarApple = subset(myVarFixed, fruit=="apple")
myVarApple
##   fruit isFruit
## 1 apple   FALSE


4. Why Not Just Stay in the Rectangular World?

Most data in the social sciences (and in geography) has a nested structure, often that come in nested lists.

  • Networks
  • Text (including XML and HTML)
  • Spatial

We’ll show you two examples of why learning to “speak” in lists is useful.


DATA IN NETWORK FORM

Some network tips:

  • You can run a basic regression model without network statistics.
  • Usually bring in edgelist and attributes separately.
  • Edgelists are inherently dyadic, even if network analysis/stats are evaluated at other levels of analysis.
  • Many of these packages in R networks are buggy!
library(igraph)
## 
## Attaching package: 'igraph'
## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum
## The following object is masked from 'package:base':
## 
##     union
library(network)
## network: Classes for Relational Data
## Version 1.13.0 created on 2015-08-31.
## copyright (c) 2005, Carter T. Butts, University of California-Irvine
##                     Mark S. Handcock, University of California -- Los Angeles
##                     David R. Hunter, Penn State University
##                     Martina Morris, University of Washington
##                     Skye Bender-deMoll, University of Washington
##  For citation information, type citation("network").
##  Type help("network-package") to get started.
## 
## Attaching package: 'network'
## The following objects are masked from 'package:igraph':
## 
##     %c%, %s%, add.edges, add.vertices, delete.edges,
##     delete.vertices, get.edge.attribute, get.edges,
##     get.vertex.attribute, is.bipartite, is.directed,
##     list.edge.attributes, list.vertex.attributes,
##     set.edge.attribute, set.vertex.attribute
# Reading in a single network as dyadic edgelist with attributes
Ally_80 <- read.delim("dataPersonal/1980dyadicattributes.csv")
data <- data.frame(Ally_80)

# Basic model (without network statistics)
basic_model <- glm(dichtrade ~ dichenmyofenemy + joindem + atopally, data = data, family = "binomial")
summary(basic_model) 
## 
## Call:
## glm(formula = dichtrade ~ dichenmyofenemy + joindem + atopally, 
##     family = "binomial", data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8242  -0.5964  -0.5964  -0.5964   1.9049  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -1.63647    0.01852 -88.385  < 2e-16 ***
## dichenmyofenemy  1.19663    0.16089   7.438 1.02e-13 ***
## joindem          1.51080    0.06839  22.091  < 2e-16 ***
## atopally         1.57958    0.04721  33.455  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 24752  on 24179  degrees of freedom
## Residual deviance: 22871  on 24176  degrees of freedom
## AIC: 22879
## 
## Number of Fisher Scoring iterations: 3
# Edgelist represents dichotomous trade in 1980, 
  # binary based on a threshold rule.
elist1 <- read.delim("dataPersonal/1980edgelist.csv")
net1980 <- network(elist1, matrix.type="edgelist")

#### You won't cover this until advanced networks, but you may want this code.
#### Using edgelist from 1980 dichotomous trade, run ergm model to get NS values
# Ergm1 <- ergm(net1980 ~ triangle + density + twopath)
# summary(Ergm1)

# Adding attributes to edgelist
node_attr <- read.csv("dataPersonal/1980attributes.csv")
head(network.vertex.names(net1980), 30)
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## [24] 24 25 26 27 28 29 30
net1980%v%"alliance" <- node_attr$atopally
net1980%v%"enemy" <- node_attr$dichenmyofenemy
net1980%v%"joint democracy" <- node_attr$joindem

# Lets explore our network components:
list.vertex.attributes(net1980)
## [1] "alliance"        "enemy"           "joint democracy" "na"             
## [5] "vertex.names"


There are other ways we can handle these types of data that can be explored in more depth. Below is an example of how we leverage loops to handle matrix algebra with lists that represent “layers” of networks. Note that this is very advanced network analysis, but the example highlights the following things:

  • Using lists of arrays (in this case labelled “dat”)
  • Using lists containing multiple adjacency matrices
  • Using notation like [,,1], [,,2] to call specific layers of matrices stored in a third dimension…that’s right, a THIRD dimension.
  • Using code like “mapply” to replace basic logical functions you would usually use to do simple matrix algebra.
# Empty list for cooperate matrices
cooperate <- list() 

# Empty list for conflict matrices
conflict <- list() 

# Looping over matrices
for(i in 1:length(dat)){
  cooperate[[i]] <-  matrix(mapply(sum, dat[[i]][,,1],dat[[i]][,,2],dat[[i]][,,3]))
  conflict[[i]] <-  matrix(mapply(sum, dat[[i]][,,4],dat[[i]][,,5]))
}


DATA IN TEXT FORM

Basics:

  • XML (extensible markup language) is a flexible way to share structured data via the Internet.
  • HTML (hypertext markup language) is the Internet (language to create webpages). It’s usually messier than html.
  • You need to understand indexing in order to extract information in mass from these formats.
# XML data 
library(XML)
library(xml2)

# Read xml data
members = read_xml("http://clerk.house.gov/xml/lists/MemberData.xml")

# Identify top node of interest
mem_tags = xml_find_all(members, "//member")
head(mem_tags)
## {xml_nodeset (6)}
## [1] <member>\n  <statedistrict>AK00</statedistrict>\n  <member-info>\n   ...
## [2] <member>\n  <statedistrict>AL01</statedistrict>\n  <member-info>\n   ...
## [3] <member>\n  <statedistrict>AL02</statedistrict>\n  <member-info>\n   ...
## [4] <member>\n  <statedistrict>AL03</statedistrict>\n  <member-info>\n   ...
## [5] <member>\n  <statedistrict>AL04</statedistrict>\n  <member-info>\n   ...
## [6] <member>\n  <statedistrict>AL05</statedistrict>\n  <member-info>\n   ...
# Trial and error until you identify information you want
lastName = as.character(xml_find_all(members, "//lastname/text()"))
head(lastName)
## [1] "Young"    "Byrne"    "Roby"     "Rogers"   "Aderholt" "Brooks"
firstName = as.character(xml_find_all(members, "//firstname/text()"))
head(firstName)
## [1] "Don"     "Bradley" "Martha"  "Mike"    "Robert"  "Mo"
party = as.character(xml_find_all(members, "//party/text()"))
head(party)
## [1] "R" "R" "R" "R" "R" "R"
# Make a dataframe of equal lengths
currentMems = as.data.frame(cbind(firstName, lastName, party))
head(currentMems)
##   firstName lastName party
## 1       Don    Young     R
## 2   Bradley    Byrne     R
## 3    Martha     Roby     R
## 4      Mike   Rogers     R
## 5    Robert Aderholt     R
## 6        Mo   Brooks     R
# Google results data (mimics url source)
ausGeos = readRDS("dataPersonal/ausR_geos_2-2018.rds")

# Pulling elements based on attribute names and index
ausGeos[[1]]$orig_loc
## [1] "Australia"
ausGeos[[1]]$res$geometry$location$lat
## [1] -25.2744
ausGeos[[1]]$res$address_components[[1]]$types[[2]]
## [1] "political"
# Could find the same elements via pure index, but boy these get long! 
ausGeos[1][[1]][1]
## $orig_loc
## [1] "Australia"