Let’s review “working directory” and paths. Most people are moving away from setting highly specific working directories. Still, we think it’s a good idea to set a working directory to the main folder in which you’re working, and use relative paths to load and save data.
# Set your own working directory for this workshop
setwd("/Users/corybelden/Documents/my-stuff/PAMR")
# To see where you are
getwd()
## [1] "/Users/corybelden/Documents/my-stuff/PAMR"
# A manual trick...
# A relative path to read in data
dataDemoEx = readRDS("dataDemos/myVar_data.rds")
# Another relative path to read in data
dataPersonalEx = readRDS("dataPersonal/ausR_geos_2-2018.rds")
# What's in my folder?
list.files()
## [1] "dataDemos" "dataPersonal" "PAMR_workshop_1.html"
## [4] "PAMR_workshop_1.Rmd" "PAMR_workshop_2.Rmd" "PARM_Week1_Notes.R"
## [7] "pictures" "rsconnect" "syllabus"
# Which are my folders?
list.dirs()
## [1] "."
## [2] "./dataDemos"
## [3] "./dataPersonal"
## [4] "./pictures"
## [5] "./rsconnect"
## [6] "./rsconnect/documents"
## [7] "./rsconnect/documents/PAMR_workshop_1.Rmd"
## [8] "./rsconnect/documents/PAMR_workshop_1.Rmd/rpubs.com"
## [9] "./rsconnect/documents/PAMR_workshop_1.Rmd/rpubs.com/rpubs"
## [10] "./syllabus"
How to organize scripts?
Some basics in applied programming:
Key R Objects:
Get in the habit of checking object types with “class”. There are several variations of this code, including “typeof”, but “class” will probably work for most of our purposes.
# Create integer variable
myVar = 1:3
xmyVar <- 1:3
class(myVar)
## [1] "integer"
# Overwrite variable as character (c = combine)
myVar = c("apple", "banana", "cornbread")
class(myVar)
## [1] "character"
# You can save any R object with as an RDS file (or other file types)
saveRDS(myVar, "dataDemos/myVar_data.rds")
# To read back in
myVar = readRDS("dataDemos/myVar_data.rds")
# Create another character variable
myVar2 = c("grapefruit", "orange")
# Add vectors together
myVarTog = c(myVar, myVar2)
myVarTog
## [1] "apple" "banana" "cornbread" "grapefruit" "orange"
class(myVar)
## [1] "character"
# Turn into dataframe
myVarTogDf = as.data.frame(myVarTog)
myVarTogDf
## myVarTog
## 1 apple
## 2 banana
## 3 cornbread
## 4 grapefruit
## 5 orange
# Add name to variable (syntax only works for single vector)
names(myVarTogDf) = "fruit"
# Now block has class and so does variable
class(myVarTogDf)
## [1] "data.frame"
class(myVarTogDf$fruit)
## [1] "factor"
# Darn factors! A key aggravation in dfs
# Change back to character and check
myVarTogDf$fruit = as.character(myVarTogDf$fruit)
class(myVarTogDf$fruit)
## [1] "character"
Most data that is now useful to us does not come in a rectangular box. It comes in lists. As a reminder, lists have elements, each of which can contain any type of R object. They can also be nested.
To understand lists, it’s helpful to learn how to index. Most indexing involves trial and error—that’s okay! The three general index properties below work for simpler lists, but indexing gets more complex as the list structure gets more complex. Note that sometimes looking at the list “tree” in the Global Environment can help you visualize the nested structure.
# Let's make our character vector a list
myVarTog = list(myVarTog)
myVarTog
## [[1]]
## [1] "apple" "banana" "cornbread" "grapefruit" "orange"
# Cornbread isn't a fruit! Let's change it NA
# Identifying "cornbread"" with index
myVarTog[[1]][3]
## [1] "cornbread"
# Same output because there's only one "list slice"
myVarTog[1][[1]][3]
## [1] "cornbread"
# Change "cornbread" to missing
myVarTog[1][[1]][3] = NA
myVarTog
## [[1]]
## [1] "apple" "banana" NA "grapefruit" "orange"
# Could also assign name to list
names(myVarTog) = "fruit"
# Now we can look at the new NA by name and position
myVarTog$fruit[3]
## [1] NA
# Use logical to confirm that it changed
is.na(myVarTog$fruit)
## [1] FALSE FALSE TRUE FALSE FALSE
# Can make the logical variable another list element
myVarTog$isFruit = is.na(myVarTog$fruit)
myVarTog
## $fruit
## [1] "apple" "banana" NA "grapefruit" "orange"
##
## $isFruit
## [1] FALSE FALSE TRUE FALSE FALSE
# Some basic list examples
y = c(7, 2, 9, 10) # integer values
z = c("aa", "bb", "cc", "zz") # character values
x = c(TRUE, FALSE, TRUE, FALSE, FALSE) # logical values
listTog = list(y, z, x) # lists of all three vectors (or list "slices")
listTog
## [[1]]
## [1] 7 2 9 10
##
## [[2]]
## [1] "aa" "bb" "cc" "zz"
##
## [[3]]
## [1] TRUE FALSE TRUE FALSE FALSE
# Can't turn listTog into a dataframe! (Different lengths)
# Retrieve 2nd list slice with single square bracket
listTog[2]
## [[1]]
## [1] "aa" "bb" "cc" "zz"
# Retrieve slice containing the second and fourth element of listTog
listTog[c(2, 1)]
## [[1]]
## [1] "aa" "bb" "cc" "zz"
##
## [[2]]
## [1] 7 2 9 10
listTog[c(1, 2)]
## [[1]]
## [1] 7 2 9 10
##
## [[2]]
## [1] "aa" "bb" "cc" "zz"
# Double square bracket gets us the element of the list directly (note missing [[]] in output)
listTog[[2]]
## [1] "aa" "bb" "cc" "zz"
# Another single bracket gets us the member of that element
listTog[[2]][1]
## [1] "aa"
# Note that this is not equivalent!
listTog[2][1]
## [[1]]
## [1] "aa" "bb" "cc" "zz"
# This IS equivalent (to line 244), but more precise--you can do it either way
# Says: go within slice 2, give me element 1, and the first member of element 1
listTog[2][[1]][1]
## [1] "aa"
# Another example being precise: find the "2"
listTog
## [[1]]
## [1] 7 2 9 10
##
## [[2]]
## [1] "aa" "bb" "cc" "zz"
##
## [[3]]
## [1] TRUE FALSE TRUE FALSE FALSE
# listTog[1][[2]]
# Oops, this code won't work because there's no second element of the first list slice
# Need to add the member of the element
listTog[1][[1]][2]
## [1] 2
# Now we can change to "2" to "8"
listTog[1][[1]][2] = 8
listTog
## [[1]]
## [1] 7 8 9 10
##
## [[2]]
## [1] "aa" "bb" "cc" "zz"
##
## [[3]]
## [1] TRUE FALSE TRUE FALSE FALSE
## A more complex example, lists upon lists (turtles all the way down)
# Create new vectors and make into list
newVarText = c("i", "am", "learning")
newVarNum = c(1, 50)
newListTog = list(newVarText, newVarNum)
# Make a list of both lists (this isn't what you'd be creating, but how your data might arrive)
# Note that "complexList" is doubling the elements because it is created from two lists
complexList = list(listTog, newListTog)
complexList
## [[1]]
## [[1]][[1]]
## [1] 7 8 9 10
##
## [[1]][[2]]
## [1] "aa" "bb" "cc" "zz"
##
## [[1]][[3]]
## [1] TRUE FALSE TRUE FALSE FALSE
##
##
## [[2]]
## [[2]][[1]]
## [1] "i" "am" "learning"
##
## [[2]][[2]]
## [1] 1 50
# Now identify element in line 244
# Get first list slice, then grab first element and second nested element
complexList[1][[1]][[2]]
## [1] "aa" "bb" "cc" "zz"
# This code fails because of list's nested structure
# complexList[1][[2]]
# Note there are also slices within elements
complexList[1][[1]][2]
## [[1]]
## [1] "aa" "bb" "cc" "zz"
# To retrieve second element from second list slice
complexList[2][[1]][[1]]
## [1] "i" "am" "learning"
# And now, its best member!
complexList[2][[1]][[1]][3]
## [1] "learning"
Dataframes are more intuitive and can make your life easier (sometimes).
# Using dataframe, can subset to get rid of the NA we no longer want
myVarTog
## $fruit
## [1] "apple" "banana" NA "grapefruit" "orange"
##
## $isFruit
## [1] FALSE FALSE TRUE FALSE FALSE
myVarTogDf = as.data.frame(myVarTog, stringsAsFactors=F)
myVarFixed = subset(myVarTogDf, !is.na(fruit))
myVarFixed
## fruit isFruit
## 1 apple FALSE
## 2 banana FALSE
## 4 grapefruit FALSE
## 5 orange FALSE
# Can make similar value changes in dataframe
myVarFixed$fruit[myVarFixed$fruit == "banana"] = "bananas"
myVarFixed
## fruit isFruit
## 1 apple FALSE
## 2 bananas FALSE
## 4 grapefruit FALSE
## 5 orange FALSE
# Note that "NA" syntax is almost always different than other types.
myVarApple = subset(myVarFixed, fruit=="apple")
myVarApple
## fruit isFruit
## 1 apple FALSE
Most data in the social sciences (and in geography) has a nested structure, often that come in nested lists.
We’ll show you two examples of why learning to “speak” in lists is useful.
DATA IN NETWORK FORM
Some network tips:
library(igraph)
##
## Attaching package: 'igraph'
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
library(network)
## network: Classes for Relational Data
## Version 1.13.0 created on 2015-08-31.
## copyright (c) 2005, Carter T. Butts, University of California-Irvine
## Mark S. Handcock, University of California -- Los Angeles
## David R. Hunter, Penn State University
## Martina Morris, University of Washington
## Skye Bender-deMoll, University of Washington
## For citation information, type citation("network").
## Type help("network-package") to get started.
##
## Attaching package: 'network'
## The following objects are masked from 'package:igraph':
##
## %c%, %s%, add.edges, add.vertices, delete.edges,
## delete.vertices, get.edge.attribute, get.edges,
## get.vertex.attribute, is.bipartite, is.directed,
## list.edge.attributes, list.vertex.attributes,
## set.edge.attribute, set.vertex.attribute
# Reading in a single network as dyadic edgelist with attributes
Ally_80 <- read.delim("dataPersonal/1980dyadicattributes.csv")
data <- data.frame(Ally_80)
# Basic model (without network statistics)
basic_model <- glm(dichtrade ~ dichenmyofenemy + joindem + atopally, data = data, family = "binomial")
summary(basic_model)
##
## Call:
## glm(formula = dichtrade ~ dichenmyofenemy + joindem + atopally,
## family = "binomial", data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8242 -0.5964 -0.5964 -0.5964 1.9049
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.63647 0.01852 -88.385 < 2e-16 ***
## dichenmyofenemy 1.19663 0.16089 7.438 1.02e-13 ***
## joindem 1.51080 0.06839 22.091 < 2e-16 ***
## atopally 1.57958 0.04721 33.455 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 24752 on 24179 degrees of freedom
## Residual deviance: 22871 on 24176 degrees of freedom
## AIC: 22879
##
## Number of Fisher Scoring iterations: 3
# Edgelist represents dichotomous trade in 1980,
# binary based on a threshold rule.
elist1 <- read.delim("dataPersonal/1980edgelist.csv")
net1980 <- network(elist1, matrix.type="edgelist")
#### You won't cover this until advanced networks, but you may want this code.
#### Using edgelist from 1980 dichotomous trade, run ergm model to get NS values
# Ergm1 <- ergm(net1980 ~ triangle + density + twopath)
# summary(Ergm1)
# Adding attributes to edgelist
node_attr <- read.csv("dataPersonal/1980attributes.csv")
head(network.vertex.names(net1980), 30)
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## [24] 24 25 26 27 28 29 30
net1980%v%"alliance" <- node_attr$atopally
net1980%v%"enemy" <- node_attr$dichenmyofenemy
net1980%v%"joint democracy" <- node_attr$joindem
# Lets explore our network components:
list.vertex.attributes(net1980)
## [1] "alliance" "enemy" "joint democracy" "na"
## [5] "vertex.names"
There are other ways we can handle these types of data that can be explored in more depth. Below is an example of how we leverage loops to handle matrix algebra with lists that represent “layers” of networks. Note that this is very advanced network analysis, but the example highlights the following things:
# Empty list for cooperate matrices
cooperate <- list()
# Empty list for conflict matrices
conflict <- list()
# Looping over matrices
for(i in 1:length(dat)){
cooperate[[i]] <- matrix(mapply(sum, dat[[i]][,,1],dat[[i]][,,2],dat[[i]][,,3]))
conflict[[i]] <- matrix(mapply(sum, dat[[i]][,,4],dat[[i]][,,5]))
}
DATA IN TEXT FORM
Basics:
# XML data
library(XML)
library(xml2)
# Read xml data
members = read_xml("http://clerk.house.gov/xml/lists/MemberData.xml")
# Identify top node of interest
mem_tags = xml_find_all(members, "//member")
head(mem_tags)
## {xml_nodeset (6)}
## [1] <member>\n <statedistrict>AK00</statedistrict>\n <member-info>\n ...
## [2] <member>\n <statedistrict>AL01</statedistrict>\n <member-info>\n ...
## [3] <member>\n <statedistrict>AL02</statedistrict>\n <member-info>\n ...
## [4] <member>\n <statedistrict>AL03</statedistrict>\n <member-info>\n ...
## [5] <member>\n <statedistrict>AL04</statedistrict>\n <member-info>\n ...
## [6] <member>\n <statedistrict>AL05</statedistrict>\n <member-info>\n ...
# Trial and error until you identify information you want
lastName = as.character(xml_find_all(members, "//lastname/text()"))
head(lastName)
## [1] "Young" "Byrne" "Roby" "Rogers" "Aderholt" "Brooks"
firstName = as.character(xml_find_all(members, "//firstname/text()"))
head(firstName)
## [1] "Don" "Bradley" "Martha" "Mike" "Robert" "Mo"
party = as.character(xml_find_all(members, "//party/text()"))
head(party)
## [1] "R" "R" "R" "R" "R" "R"
# Make a dataframe of equal lengths
currentMems = as.data.frame(cbind(firstName, lastName, party))
head(currentMems)
## firstName lastName party
## 1 Don Young R
## 2 Bradley Byrne R
## 3 Martha Roby R
## 4 Mike Rogers R
## 5 Robert Aderholt R
## 6 Mo Brooks R
# Google results data (mimics url source)
ausGeos = readRDS("dataPersonal/ausR_geos_2-2018.rds")
# Pulling elements based on attribute names and index
ausGeos[[1]]$orig_loc
## [1] "Australia"
ausGeos[[1]]$res$geometry$location$lat
## [1] -25.2744
ausGeos[[1]]$res$address_components[[1]]$types[[2]]
## [1] "political"
# Could find the same elements via pure index, but boy these get long!
ausGeos[1][[1]][1]
## $orig_loc
## [1] "Australia"