Programming and Methods Resources Workshop

Setting up

This session is about getting comfortable with thinking creatively about data in R. R allows a user to slice, dice and manipulate data in a variety of ways, especially through efficient and user-friendly packages. Package growth in R continues to expand with new versions of R, so we’ll probably see things become even more efficient over time. In this session, we’re going to show several different ways of doing similar things.

# Three libraries need today 

library(tidyr)
library(dplyr)
library(data.table)
library(lubridate)

# Working directory
setwd("/Users/corybelden/Documents/my-stuff/PAMR")

BASE & STATS

Let’s go over a few things in the basic packages, like base and stats. These oldschool packages often offer similar functionality to newer packages, but they often take more lines of code than newer ones.

# Remember the package problem?
??aggregate
  # We're uing the "stats" package (in base R)

# Using data on how many times an MC references their home district in speeches
dtmUSBack = readRDS("dataPersonal/dtm_ex.rds")
head(dtmUSBack)

##    doc_id home_jur national_jur record.number       date
## 1:  62586        0            1            65 2010-07-30
## 2:  62587        0            3           104 2012-02-27
## 3:  62588        0            0           300 2011-02-09
## 4:  62589        0            1           176 2014-04-30
## 5:  62590        1            0            57 2015-10-07
## 6:  62591        0            3           193 2015-01-06
##                                                                           title
## 1: HONORING KIM EBERT-COLELLA FOR RECEIVING THE 2010 GREATER TACOMA PEACE PRIZE
## 2:                              RECOGNIZING THE MT. VIEW EDGEWOOD WATER COMPANY
## 3:                                         HONORING THE CAREER OF RHETT BICKLEY
## 4:                                                       IN MEMORY OF PETE HALL
## 5:                    HEROES MAKING A DIFFERENCE IN THE SOUTH CAROLINA FLOODING
## 6:                                  IN MEMORY OF GOVERNOR JAMES BURROWS EDWARDS
##    year lastName          state firstName            name       fullName
## 1: 2010    smith     Washington      adam     smith, adam     adam smith
## 2: 2012    smith     Washington      adam     smith, adam     adam smith
## 3: 2011   wilson South Carolina   addison wilson, addison addison wilson
## 4: 2014   wilson South Carolina   addison wilson, addison addison wilson
## 5: 2015   wilson South Carolina   addison wilson, addison addison wilson
## 6: 2015   wilson South Carolina   addison wilson, addison addison wilson
##    party chamber district status
## 1:   100       1        9      0
## 2:   100       1        9      0
## 3:   200       1        2      0
## 4:   200       1        2      0
## 5:   200       1        2      0
## 6:   200       1        2      0

# Summing ("FUN") home references by legislator
legsCountHome = aggregate(dtmUSBack$home_jur, by=list(dtmUSBack$fullName), FUN=sum, na.rm=TRUE, na.action=NULL)
  
  # Note that we always need to provide "list" even with one variable.
  # Remember to check for "NA" functionality when performing these tasks.

# Summing home references by legislator by year
legsCountHomeYr = aggregate(dtmUSBack$home_jur, by=list(dtmUSBack$fullName, dtmUSBack$year), FUN=sum, na.rm=TRUE, na.action=NULL)
head(legsCountHomeYr)

##         Group.1 Group.2 x
## 1    adam smith    2010 0
## 2  adrian smith    2010 0
## 3    allen boyd    2010 0
## 4 barbara boxer    2010 3
## 5  betsy markey    2010 0
## 6   brian baird    2010 3

names(legsCountHomeYr)[names(legsCountHomeYr)=="Group.1"] = "fullName"
head(legsCountHomeYr)

##        fullName Group.2 x
## 1    adam smith    2010 0
## 2  adrian smith    2010 0
## 3    allen boyd    2010 0
## 4 barbara boxer    2010 3
## 5  betsy markey    2010 0
## 6   brian baird    2010 3

colnames(legsCountHomeYr) = c("fullName", "year", "sum")
head(legsCountHomeYr)

##        fullName year sum
## 1    adam smith 2010   0
## 2  adrian smith 2010   0
## 3    allen boyd 2010   0
## 4 barbara boxer 2010   3
## 5  betsy markey 2010   0
## 6   brian baird 2010   3

# Getting the mean by legislator
legsFreqAve = aggregate(dtmUSBack$home_jur, by=list(dtmUSBack$fullName), FUN=mean, na.rm=TRUE, na.action=NULL)
colnames(legsFreqAve) = c("fullName", "ave")
head(legsFreqAve)

##         fullName       ave
## 1     adam smith 0.0000000
## 2 addison wilson 0.2500000
## 3   adrian smith 0.1111111
## 4 alcee hastings 0.0000000
## 5     allen boyd 0.0000000
## 6     alma adams 0.0000000

# Rounding the long decimal
legsFreqAve$ave = round(legsFreqAve$ave, 2)
head(legsFreqAve)

##         fullName  ave
## 1     adam smith 0.00
## 2 addison wilson 0.25
## 3   adrian smith 0.11
## 4 alcee hastings 0.00
## 5     allen boyd 0.00
## 6     alma adams 0.00

# Getting the mean of all legislators
mean(legsFreqAve$ave)

## [1] 0.5889222

LUBRIDATE

Often times, we want to extract parts of dates from a date variable. Lubdridate allows us to extract information from dates. Before we extract information, we need to first change the format to “date” (be careful to specify the right format).

# Raw speech data after web-scraping
us_drought = data.frame(readRDS("dataPersonal/us_speech_ex.rds"), stringsAsFactors = F)
head(us_drought$date, 1) # Factors, indicated by "levels" print out

## [1] August 2, 2012 
## 705 Levels: April 1, 2009  April 1, 2014  ... September 9, 2016

head(us_drought$speech, 1) # See, no factors!

## [1] "        \n\n[Page S5956]\nFrom the Congressional Record Online through the Government Publishing Office [www.gpo.gov]\n\n\n\n\n                                DROUGHT\n\n  Mr. DURBIN. About 2 weeks ago, I visited a farm near my home town of \nSpringfield, IL to see the impact of the ongoing drought.\n  From the road, I couldn't tell there was anything wrong with the \ncrop.\n  But as we went into the field, it quickly became clear that the crop \nwas in poor shape.\n  Following that visit, I met with the Illinois corn growers and the \nsoybean growers and farmers from across the state.\n  The message I heard was straightforward; it is as bad or worse than \nit has been in decades.\n  Since that visit to a Springfield farm, drought conditions have only \ngotten worse.\n  100 percent of Illinois and 64 percent of the country is facing \nsevere or harsher drought conditions.\n  Today, USDA announced 66 additional Illinois counties as primary \ndisaster counties.\n  With this announcement, all but four counties, Will, Cook, Kane, \nDuPage--in Illinois qualify for disaster assistance\n  Very little rain, combined with abnormally high temperatures, is \ndecimating many of the primary crop-growing areas of the country.\n  71 percent of the corn crop and 56 percent of the soybean crop in \nIllinois is rated as poor or very poor.\n  This is in a State that regularly ranks as a top producer for both of \nthese commodities.\n  That means feed prices for livestock and eventually food prices for \nthe rest of us are increasing.\n  Everyone is going to feel the impact of this historic drought\n  In response to conditions on the ground, Governor Quinn created a \nmulti-agency drought task force in Illinois.\n  The task force is coordinating State and Federal resources to ensure \nproducers and communities are receiving the timely assistance.\n  President Obama and Secretary Vilsack have done a commendable job of \ntaking steps to help provide assistance to impacted producers and \ncommunities.\n  They have sped up the disaster declaration process helping producers \nmore quickly gain access to the limited disaster programs currently \navailable.\n  They have reduced interest rates on emergency loans.\n  They have made it easier for land that is in conservation to open \nearlier for haying and grazing for livestock producers.\n  And the administration is working with crop insurance companies to \ntry to give producers more time to make premium payments.\n  But we can do more.\n  And since we can't make rain, the single most important step Congress \ncan take is to pass a farm bill.\n  Most farmers will tell you they can survive one bad year.\n  But right now farmers can't even plan for future years.\n  More than a month ago, the Senate passed the Local Food, Farms, and \nJobs Act, more commonly known as the farm bill, with a 64--35 \nbipartisan vote.\n  The bill would reauthorize several expired disaster programs to \nimmediately help producers.\n  Equally, if not more important, the bill would provide certainty for \nproducers--allowing them to make long-term plans for getting through \nthis drought and recovering from a bad year.\n  Unfortunately the House has failed to act.\n  In the roughly 40 days since the Senate passed a bill, the House has \nnot even brought a companion measure to the House floor. During those \n40 days another 20 percent of the country has developed drought \nconditions. During those 40 days, 98 of 102 counties in Illinois \nqualified for disaster assistance. During those 40 days, many farmers \nin Illinois have lost their crops.\n  It is well past time for the House to take up and pass a farm bill \nthat includes robust disaster assistance paired with the long-term \npolicy farmers need.\n  I will repeat something I said 2 weeks ago.\n  Our producers and rural America already face a natural disaster. I \ndon't think it is too much that we spare them a manmade disaster by \nfailing to pass a farm bill.\n\n                          ____________________\n\n\n\n    "

# Class 
class(us_drought$date)

## [1] "factor"

head(us_drought$date)

## [1] August 2, 2012    August 2, 2012    February 3, 2014  June 2, 2015     
## [5] July 13, 2016     August 6, 2009   
## 705 Levels: April 1, 2009  April 1, 2014  ... September 9, 2016

# Change to a date
us_drought$date = mdy(us_drought$date)
head(us_drought$date)

## [1] "2012-08-02" "2012-08-02" "2014-02-03" "2015-06-02" "2016-07-13"
## [6] "2009-08-06"

# Grab Year, Month
us_drought$year <- year(us_drought$date)
us_drought$month <- month(us_drought$date)
head(us_drought$year)

## [1] 2012 2012 2014 2015 2016 2009

DATA.TABLE

Data.table is an enhanced version of “data.frames”.

Characters are never converted into factors.
Subsetting is easier when you have complex subsetting needs (select on columns in many different ways).
Sorting is easier using the “base” package in R (dealing with characters).
Use “by” to group columns, even lists with character vectors.

Below, we create time intervals using data.table.

## Using data on Australian Ministerial positions
min_df = readRDS("dataPersonal/time_ex.rds")

# Transform (generic function for manipulating columns in dataframes) the dates
min_df = transform(min_df, fromdate = as.Date(fromdate, format="%Y-%m-%d"), todate = as.Date(todate, format="%Y-%m-%d"))

# Extract years from each variable
min_df$fromyear = year(min_df$fromdate)
min_df$toyear = year(min_df$todate)

# Create data table
min_dt <- data.table(min_df)

# Create a sequence of dates for each row
exp_min <- min_dt[, list(name, position, year = seq(fromyear, toyear)), 1:nrow(min_dt)]
exp_min$nrow = NULL

  # "list" ensures that a data.table is returned (not just a dataframe)

# Sort acsending
exp_min = exp_min[order(name, year, position),]
head(exp_min, 5)

##                   name
## 1: Alan Glyndwr Cadman
## 2: Alan Glyndwr Cadman
## 3:  Alan Peter Griffin
## 4:  Alan Peter Griffin
## 5:  Alan Peter Griffin
##                                                                              position
## 1: Parliamentary Secretary to the Minister for Workplace Relations and Small Business
## 2: Parliamentary Secretary to the Minister for Workplace Relations and Small Business
## 3:                          Parliamentary Secretary to the Shadow Minister for Health
## 4:                          Parliamentary Secretary to the Shadow Minister for Health
## 5:                          Parliamentary Secretary to the Shadow Minister for Health
##    year
## 1: 1997
## 2: 1998
## 3: 1998
## 4: 1999
## 5: 2000

# Sort descending
exp_min = exp_min[order(-name, year, position),]
head(exp_min, 5)

##                  name
## 1:        Zed Seselja
## 2:        Zed Seselja
## 3:        Zed Seselja
## 4:        Zed Seselja
## 5: Yvette Maree D'Ath
##                                                            position year
## 1:              Assistant Minister for Science, Jobs and Innovation 2016
## 2: Assistant Minister for Social Services and Multicultural Affairs 2016
## 3:              Assistant Minister for Science, Jobs and Innovation 2017
## 4: Assistant Minister for Social Services and Multicultural Affairs 2017
## 5: Parliamentary Secretary for Climate Change and Energy Efficiency 2013

# Easy computing (using home speech data)
dtmUSBack = data.table(dtmUSBack)
dtmUSBack[, mean(home_jur), by = chamber]

##    chamber        V1
## 1:       1 0.4887640
## 2:       2 0.5208333

DPLYR & TIDYR

As illustrated, we often we have “messy” data specifically with respect to its temporal unit. We can use dplyr and tidyr (often used in conjunction) to perform tasks to restructure data for merges with other datasets. (Remember that altough your instinct is to just start clearning, think ahead to final goals!)

Before we start, what’s that funny symbol (%>%)? It’s the pipe operator! All programming softwares have pipe operations. The pipe operator makes it so the user can easily chain a sequence of calculations, instead of doing them separately.

#vignette("dplyr")

# "Template"" data -- international system membership data (country-year)
dyad_template <- read.csv("dataPersonal/states2016.csv")
head(dyad_template)

##   stateabb ccode                 statenme styear stmonth stday endyear
## 1      USA     2 United States of America   1816       1     1    2016
## 2      CAN    20                   Canada   1920       1    10    2016
## 3      BHM    31                  Bahamas   1973       7    10    2016
## 4      CUB    40                     Cuba   1902       5    20    1906
## 5      CUB    40                     Cuba   1909       1    23    2016
## 6      HAI    41                    Haiti   1859       1     1    1915
##   endmonth endday version
## 1       12     31    2016
## 2       12     31    2016
## 3       12     31    2016
## 4        9     25    2016
## 5       12     31    2016
## 6        7     28    2016

# Keeping variables we need
dyad_template <- dyad_template %>%  select(ccode, styear, endyear)

# Expanding data to create dyad-year instead of country-year
dyad_template <- dyad_template %>%  expand(statea=ccode, stateb=ccode, year=seq(1816,2016)) %>%
  filter(statea!=stateb) %>% # remove same-country dyads
  left_join(., dyad_template, by=c("statea"="ccode")) %>% # create all possible country-country-year pairs
  filter(year >= styear & year <= endyear) %>%  # remove if one dyad member doesn't exist in year
  select(-styear,-endyear) %>% 
  left_join(., dyad_template, by=c("stateb"="ccode")) %>%
  filter(year >= styear & year <= endyear) %>%
  select(-styear,-endyear)

head(dyad_template)

## # A tibble: 6 x 3
##   statea stateb  year
##    <int>  <int> <int>
## 1      2     20  1920
## 2      2     20  1921
## 3      2     20  1922
## 4      2     20  1923
## 5      2     20  1924
## 6      2     20  1925

# Check out the jump in observations!
nrow(dyad_template)

## [1] 1912350

Next, let’s bring in data we want to tranform and merge with the template data. Here, we’re using the Threat and Imposition of Economic Sanctions data (TIES) which is often used in IR/Foreign Policy. These dyads are not always “dyads”. Some case-ids have multiple senders. For this example we will only look at the primary sender. Since there are many variables in this dataset, let’s simplify it first so we only look at the one or two variables that typically matter to us.

# Bring in merging data
sanctions <- read.csv("dataPersonal/TIESv4.csv")

# Select variables we want to keep
sanctions_subset <- select(sanctions, caseid, startyear, endyear, primarysender, targetstate, threat, imposition)

# Creating dyads on sanctions (similar process to above)
sanctions_expand <- sanctions_subset %>% expand(caseid, year=seq(1945,2012)) %>%
  right_join(., sanctions_subset, by="caseid")%>%
  filter(year>= startyear & year<=endyear)%>%
  select(-startyear,-endyear)

# Create a dataframe to edit so we don't transform our expanded data
sanctions_using <- sanctions_expand

# Rename variables in order to merge
colnames(sanctions_using)[colnames(sanctions_using)=="primarysender"] <- "statea"
colnames(sanctions_using)[colnames(sanctions_using)=="targetstate"] <- "stateb"

# Merge
working_data <- merge(dyad_template, sanctions_using, by=c("statea", "stateb", "year"), all=TRUE)
head(working_data)

##   statea stateb year caseid threat imposition
## 1      2     20 1920     NA     NA         NA
## 2      2     20 1921     NA     NA         NA
## 3      2     20 1922     NA     NA         NA
## 4      2     20 1923     NA     NA         NA
## 5      2     20 1924     NA     NA         NA
## 6      2     20 1925     NA     NA         NA

# Another way to arrange the data (negative sign indicates descending order)
working_data = arrange(working_data, -threat, imposition, -year, statea, stateb)
head(working_data)

##   statea stateb year     caseid threat imposition
## 1      2    140 2010 2005100601      1          0
## 2      2    712 2010 2003112203      1          0
## 3      2    750 2010 2003112201      1          0
## 4      2    140 2009 2005100601      1          0
## 5      2    223 2009 2001100201      1          0
## 6      2    225 2009 2001101201      1          0

# Count number of years 
dplyr::count(working_data, year)

## # A tibble: 201 x 2
##     year     n
##    <int> <int>
##  1  1816   506
##  2  1817   506
##  3  1818   506
##  4  1819   506
##  5  1820   506
##  6  1821   506
##  7  1822   552
##  8  1823   552
##  9  1824   552
## 10  1825   600
## # ... with 191 more rows

Note a few important things here.

We don’t actually want to keep all unmerged pairs. For example, we want to drop those dyad-year observations that are in both data sets.
We would also want to limit the time frame to 1945-2012 since there will be no observations prior to 1945 in the merged sanctions data. This is typical in IR data.
Finally, this data is not ready for analysis. We would want to bring in other dyadic data for a full model. But this is the starting point for building the correct data.

Now let’s use “gather” and “spread” (complements to one another), which are also part of the tidyr package. “Gather” takes muliple columns and transforms them into key-value pairs (making “wide” data long). “Spread” takes two columns (a key and a value) and spreads them into multiple columns (making “long” data wider). We often want wide data when we’re going counts by group.

#vignette("tidyr")

head(sanctions_using)

## # A tibble: 6 x 6
##       caseid  year statea stateb threat imposition
##        <int> <int>  <int>  <int>  <int>      <int>
## 1 1945121601  1945      2    710      1          1
## 2 1945121601  1946      2    710      1          1
## 3 1945121601  1947      2    710      1          1
## 4 1946020801  1946    365    305      0          1
## 5 1946020801  1947    365    305      0          1
## 6 1946020801  1948    365    305      0          1

# Gathering sanction data, where "sanct_level" is our new variable indicates threat or imposition
  # and value indicates whether it occured or not
sanctions_count = sanctions_using %>%
  gather(sanct_level, value, -statea, -stateb, -year, -caseid) # negative sign prevents false gathers
head(sanctions_count)

## # A tibble: 6 x 6
##       caseid  year statea stateb sanct_level value
##        <int> <int>  <int>  <int>       <chr> <int>
## 1 1945121601  1945      2    710      threat     1
## 2 1945121601  1946      2    710      threat     1
## 3 1945121601  1947      2    710      threat     1
## 4 1946020801  1946    365    305      threat     0
## 5 1946020801  1947    365    305      threat     0
## 6 1946020801  1948    365    305      threat     0

  # A nice feature of these tools is you specify the variable names in the function

# Another way to do it
sanctions_count2 = gather(sanctions_using, "sanct_level", "value", 5:6)

# Another count, this time getting the number of impositions over time using "group_by"
imp_count = sanctions_using %>%
  group_by(statea, stateb)%>% # getting count for each dyad
  tally(imposition) # tally means count
head(imp_count)

## # A tibble: 6 x 3
## # Groups:   statea [1]
##   statea stateb     n
##    <int>  <int> <int>
## 1      2     20    28
## 2      2     40     7
## 3      2     41    12
## 4      2     42     2
## 5      2     52    10
## 6      2     53     8

# Returning to legislators, let's spread these by year
head(legsCountHomeYr)

##        fullName year sum
## 1    adam smith 2010   0
## 2  adrian smith 2010   0
## 3    allen boyd 2010   0
## 4 barbara boxer 2010   3
## 5  betsy markey 2010   0
## 6   brian baird 2010   3

# Make year the column
legHomeYears = spread(legsCountHomeYr, year, sum) # year is the key and sum is the value
head(legHomeYears)

##         fullName 2010 2011 2012 2013 2014 2015 2016
## 1     adam smith    0   NA    0   NA   NA   NA   NA
## 2 addison wilson   NA    0   NA   NA    0    1   NA
## 3   adrian smith    0   NA   NA    1    0    0   NA
## 4 alcee hastings   NA    0    0    0    0   NA    0
## 5     allen boyd    0   NA   NA   NA   NA   NA   NA
## 6     alma adams   NA   NA   NA   NA   NA    0    0

CLOSING REMARKS

There are many other packages and functions that may be of use for data manipulation and management. Of course, there are also packages for data collection, visualization, analytics, and more complex data management (we’ll go over some of these in the next sessions.)

stringi and stringr (we’ll definitely review some of this in another session!)
broom
zoo
tidyverse – CHECK IT OUT! A set of packages that work in harmony, all related to tidying data.
The apply family: apply, mapply, sapply (looping over multiple lists, matrices, and so on.)
This is a great resource for exploring packages.
Grolemund & Wickman’s R for Data Science book.