This session is about getting comfortable with thinking creatively about data in R. R allows a user to slice, dice and manipulate data in a variety of ways, especially through efficient and user-friendly packages. Package growth in R continues to expand with new versions of R, so we’ll probably see things become even more efficient over time. In this session, we’re going to show several different ways of doing similar things.
# Three libraries need today
library(tidyr)
library(dplyr)
library(data.table)
library(lubridate)
# Working directory
setwd("/Users/corybelden/Documents/my-stuff/PAMR")
Let’s go over a few things in the basic packages, like base and stats. These oldschool packages often offer similar functionality to newer packages, but they often take more lines of code than newer ones.
# Remember the package problem?
??aggregate
# We're uing the "stats" package (in base R)
# Using data on how many times an MC references their home district in speeches
dtmUSBack = readRDS("dataPersonal/dtm_ex.rds")
head(dtmUSBack)
## doc_id home_jur national_jur record.number date
## 1: 62586 0 1 65 2010-07-30
## 2: 62587 0 3 104 2012-02-27
## 3: 62588 0 0 300 2011-02-09
## 4: 62589 0 1 176 2014-04-30
## 5: 62590 1 0 57 2015-10-07
## 6: 62591 0 3 193 2015-01-06
## title
## 1: HONORING KIM EBERT-COLELLA FOR RECEIVING THE 2010 GREATER TACOMA PEACE PRIZE
## 2: RECOGNIZING THE MT. VIEW EDGEWOOD WATER COMPANY
## 3: HONORING THE CAREER OF RHETT BICKLEY
## 4: IN MEMORY OF PETE HALL
## 5: HEROES MAKING A DIFFERENCE IN THE SOUTH CAROLINA FLOODING
## 6: IN MEMORY OF GOVERNOR JAMES BURROWS EDWARDS
## year lastName state firstName name fullName
## 1: 2010 smith Washington adam smith, adam adam smith
## 2: 2012 smith Washington adam smith, adam adam smith
## 3: 2011 wilson South Carolina addison wilson, addison addison wilson
## 4: 2014 wilson South Carolina addison wilson, addison addison wilson
## 5: 2015 wilson South Carolina addison wilson, addison addison wilson
## 6: 2015 wilson South Carolina addison wilson, addison addison wilson
## party chamber district status
## 1: 100 1 9 0
## 2: 100 1 9 0
## 3: 200 1 2 0
## 4: 200 1 2 0
## 5: 200 1 2 0
## 6: 200 1 2 0
# Summing ("FUN") home references by legislator
legsCountHome = aggregate(dtmUSBack$home_jur, by=list(dtmUSBack$fullName), FUN=sum, na.rm=TRUE, na.action=NULL)
# Note that we always need to provide "list" even with one variable.
# Remember to check for "NA" functionality when performing these tasks.
# Summing home references by legislator by year
legsCountHomeYr = aggregate(dtmUSBack$home_jur, by=list(dtmUSBack$fullName, dtmUSBack$year), FUN=sum, na.rm=TRUE, na.action=NULL)
head(legsCountHomeYr)
## Group.1 Group.2 x
## 1 adam smith 2010 0
## 2 adrian smith 2010 0
## 3 allen boyd 2010 0
## 4 barbara boxer 2010 3
## 5 betsy markey 2010 0
## 6 brian baird 2010 3
names(legsCountHomeYr)[names(legsCountHomeYr)=="Group.1"] = "fullName"
head(legsCountHomeYr)
## fullName Group.2 x
## 1 adam smith 2010 0
## 2 adrian smith 2010 0
## 3 allen boyd 2010 0
## 4 barbara boxer 2010 3
## 5 betsy markey 2010 0
## 6 brian baird 2010 3
colnames(legsCountHomeYr) = c("fullName", "year", "sum")
head(legsCountHomeYr)
## fullName year sum
## 1 adam smith 2010 0
## 2 adrian smith 2010 0
## 3 allen boyd 2010 0
## 4 barbara boxer 2010 3
## 5 betsy markey 2010 0
## 6 brian baird 2010 3
# Getting the mean by legislator
legsFreqAve = aggregate(dtmUSBack$home_jur, by=list(dtmUSBack$fullName), FUN=mean, na.rm=TRUE, na.action=NULL)
colnames(legsFreqAve) = c("fullName", "ave")
head(legsFreqAve)
## fullName ave
## 1 adam smith 0.0000000
## 2 addison wilson 0.2500000
## 3 adrian smith 0.1111111
## 4 alcee hastings 0.0000000
## 5 allen boyd 0.0000000
## 6 alma adams 0.0000000
# Rounding the long decimal
legsFreqAve$ave = round(legsFreqAve$ave, 2)
head(legsFreqAve)
## fullName ave
## 1 adam smith 0.00
## 2 addison wilson 0.25
## 3 adrian smith 0.11
## 4 alcee hastings 0.00
## 5 allen boyd 0.00
## 6 alma adams 0.00
# Getting the mean of all legislators
mean(legsFreqAve$ave)
## [1] 0.5889222
Often times, we want to extract parts of dates from a date variable. Lubdridate allows us to extract information from dates. Before we extract information, we need to first change the format to “date” (be careful to specify the right format).
# Raw speech data after web-scraping
us_drought = data.frame(readRDS("dataPersonal/us_speech_ex.rds"), stringsAsFactors = F)
head(us_drought$date, 1) # Factors, indicated by "levels" print out
## [1] August 2, 2012
## 705 Levels: April 1, 2009 April 1, 2014 ... September 9, 2016
head(us_drought$speech, 1) # See, no factors!
## [1] " \n\n[Page S5956]\nFrom the Congressional Record Online through the Government Publishing Office [www.gpo.gov]\n\n\n\n\n DROUGHT\n\n Mr. DURBIN. About 2 weeks ago, I visited a farm near my home town of \nSpringfield, IL to see the impact of the ongoing drought.\n From the road, I couldn't tell there was anything wrong with the \ncrop.\n But as we went into the field, it quickly became clear that the crop \nwas in poor shape.\n Following that visit, I met with the Illinois corn growers and the \nsoybean growers and farmers from across the state.\n The message I heard was straightforward; it is as bad or worse than \nit has been in decades.\n Since that visit to a Springfield farm, drought conditions have only \ngotten worse.\n 100 percent of Illinois and 64 percent of the country is facing \nsevere or harsher drought conditions.\n Today, USDA announced 66 additional Illinois counties as primary \ndisaster counties.\n With this announcement, all but four counties, Will, Cook, Kane, \nDuPage--in Illinois qualify for disaster assistance\n Very little rain, combined with abnormally high temperatures, is \ndecimating many of the primary crop-growing areas of the country.\n 71 percent of the corn crop and 56 percent of the soybean crop in \nIllinois is rated as poor or very poor.\n This is in a State that regularly ranks as a top producer for both of \nthese commodities.\n That means feed prices for livestock and eventually food prices for \nthe rest of us are increasing.\n Everyone is going to feel the impact of this historic drought\n In response to conditions on the ground, Governor Quinn created a \nmulti-agency drought task force in Illinois.\n The task force is coordinating State and Federal resources to ensure \nproducers and communities are receiving the timely assistance.\n President Obama and Secretary Vilsack have done a commendable job of \ntaking steps to help provide assistance to impacted producers and \ncommunities.\n They have sped up the disaster declaration process helping producers \nmore quickly gain access to the limited disaster programs currently \navailable.\n They have reduced interest rates on emergency loans.\n They have made it easier for land that is in conservation to open \nearlier for haying and grazing for livestock producers.\n And the administration is working with crop insurance companies to \ntry to give producers more time to make premium payments.\n But we can do more.\n And since we can't make rain, the single most important step Congress \ncan take is to pass a farm bill.\n Most farmers will tell you they can survive one bad year.\n But right now farmers can't even plan for future years.\n More than a month ago, the Senate passed the Local Food, Farms, and \nJobs Act, more commonly known as the farm bill, with a 64--35 \nbipartisan vote.\n The bill would reauthorize several expired disaster programs to \nimmediately help producers.\n Equally, if not more important, the bill would provide certainty for \nproducers--allowing them to make long-term plans for getting through \nthis drought and recovering from a bad year.\n Unfortunately the House has failed to act.\n In the roughly 40 days since the Senate passed a bill, the House has \nnot even brought a companion measure to the House floor. During those \n40 days another 20 percent of the country has developed drought \nconditions. During those 40 days, 98 of 102 counties in Illinois \nqualified for disaster assistance. During those 40 days, many farmers \nin Illinois have lost their crops.\n It is well past time for the House to take up and pass a farm bill \nthat includes robust disaster assistance paired with the long-term \npolicy farmers need.\n I will repeat something I said 2 weeks ago.\n Our producers and rural America already face a natural disaster. I \ndon't think it is too much that we spare them a manmade disaster by \nfailing to pass a farm bill.\n\n ____________________\n\n\n\n "
# Class
class(us_drought$date)
## [1] "factor"
head(us_drought$date)
## [1] August 2, 2012 August 2, 2012 February 3, 2014 June 2, 2015
## [5] July 13, 2016 August 6, 2009
## 705 Levels: April 1, 2009 April 1, 2014 ... September 9, 2016
# Change to a date
us_drought$date = mdy(us_drought$date)
head(us_drought$date)
## [1] "2012-08-02" "2012-08-02" "2014-02-03" "2015-06-02" "2016-07-13"
## [6] "2009-08-06"
# Grab Year, Month
us_drought$year <- year(us_drought$date)
us_drought$month <- month(us_drought$date)
head(us_drought$year)
## [1] 2012 2012 2014 2015 2016 2009
Data.table is an enhanced version of “data.frames”.
Below, we create time intervals using data.table.
## Using data on Australian Ministerial positions
min_df = readRDS("dataPersonal/time_ex.rds")
# Transform (generic function for manipulating columns in dataframes) the dates
min_df = transform(min_df, fromdate = as.Date(fromdate, format="%Y-%m-%d"), todate = as.Date(todate, format="%Y-%m-%d"))
# Extract years from each variable
min_df$fromyear = year(min_df$fromdate)
min_df$toyear = year(min_df$todate)
# Create data table
min_dt <- data.table(min_df)
# Create a sequence of dates for each row
exp_min <- min_dt[, list(name, position, year = seq(fromyear, toyear)), 1:nrow(min_dt)]
exp_min$nrow = NULL
# "list" ensures that a data.table is returned (not just a dataframe)
# Sort acsending
exp_min = exp_min[order(name, year, position),]
head(exp_min, 5)
## name
## 1: Alan Glyndwr Cadman
## 2: Alan Glyndwr Cadman
## 3: Alan Peter Griffin
## 4: Alan Peter Griffin
## 5: Alan Peter Griffin
## position
## 1: Parliamentary Secretary to the Minister for Workplace Relations and Small Business
## 2: Parliamentary Secretary to the Minister for Workplace Relations and Small Business
## 3: Parliamentary Secretary to the Shadow Minister for Health
## 4: Parliamentary Secretary to the Shadow Minister for Health
## 5: Parliamentary Secretary to the Shadow Minister for Health
## year
## 1: 1997
## 2: 1998
## 3: 1998
## 4: 1999
## 5: 2000
# Sort descending
exp_min = exp_min[order(-name, year, position),]
head(exp_min, 5)
## name
## 1: Zed Seselja
## 2: Zed Seselja
## 3: Zed Seselja
## 4: Zed Seselja
## 5: Yvette Maree D'Ath
## position year
## 1: Assistant Minister for Science, Jobs and Innovation 2016
## 2: Assistant Minister for Social Services and Multicultural Affairs 2016
## 3: Assistant Minister for Science, Jobs and Innovation 2017
## 4: Assistant Minister for Social Services and Multicultural Affairs 2017
## 5: Parliamentary Secretary for Climate Change and Energy Efficiency 2013
# Easy computing (using home speech data)
dtmUSBack = data.table(dtmUSBack)
dtmUSBack[, mean(home_jur), by = chamber]
## chamber V1
## 1: 1 0.4887640
## 2: 2 0.5208333
As illustrated, we often we have “messy” data specifically with respect to its temporal unit. We can use dplyr and tidyr (often used in conjunction) to perform tasks to restructure data for merges with other datasets. (Remember that altough your instinct is to just start clearning, think ahead to final goals!)
Before we start, what’s that funny symbol (%>%)? It’s the pipe operator! All programming softwares have pipe operations. The pipe operator makes it so the user can easily chain a sequence of calculations, instead of doing them separately.
#vignette("dplyr")
# "Template"" data -- international system membership data (country-year)
dyad_template <- read.csv("dataPersonal/states2016.csv")
head(dyad_template)
## stateabb ccode statenme styear stmonth stday endyear
## 1 USA 2 United States of America 1816 1 1 2016
## 2 CAN 20 Canada 1920 1 10 2016
## 3 BHM 31 Bahamas 1973 7 10 2016
## 4 CUB 40 Cuba 1902 5 20 1906
## 5 CUB 40 Cuba 1909 1 23 2016
## 6 HAI 41 Haiti 1859 1 1 1915
## endmonth endday version
## 1 12 31 2016
## 2 12 31 2016
## 3 12 31 2016
## 4 9 25 2016
## 5 12 31 2016
## 6 7 28 2016
# Keeping variables we need
dyad_template <- dyad_template %>% select(ccode, styear, endyear)
# Expanding data to create dyad-year instead of country-year
dyad_template <- dyad_template %>% expand(statea=ccode, stateb=ccode, year=seq(1816,2016)) %>%
filter(statea!=stateb) %>% # remove same-country dyads
left_join(., dyad_template, by=c("statea"="ccode")) %>% # create all possible country-country-year pairs
filter(year >= styear & year <= endyear) %>% # remove if one dyad member doesn't exist in year
select(-styear,-endyear) %>%
left_join(., dyad_template, by=c("stateb"="ccode")) %>%
filter(year >= styear & year <= endyear) %>%
select(-styear,-endyear)
head(dyad_template)
## # A tibble: 6 x 3
## statea stateb year
## <int> <int> <int>
## 1 2 20 1920
## 2 2 20 1921
## 3 2 20 1922
## 4 2 20 1923
## 5 2 20 1924
## 6 2 20 1925
# Check out the jump in observations!
nrow(dyad_template)
## [1] 1912350
Next, let’s bring in data we want to tranform and merge with the template data. Here, we’re using the Threat and Imposition of Economic Sanctions data (TIES) which is often used in IR/Foreign Policy. These dyads are not always “dyads”. Some case-ids have multiple senders. For this example we will only look at the primary sender. Since there are many variables in this dataset, let’s simplify it first so we only look at the one or two variables that typically matter to us.
# Bring in merging data
sanctions <- read.csv("dataPersonal/TIESv4.csv")
# Select variables we want to keep
sanctions_subset <- select(sanctions, caseid, startyear, endyear, primarysender, targetstate, threat, imposition)
# Creating dyads on sanctions (similar process to above)
sanctions_expand <- sanctions_subset %>% expand(caseid, year=seq(1945,2012)) %>%
right_join(., sanctions_subset, by="caseid")%>%
filter(year>= startyear & year<=endyear)%>%
select(-startyear,-endyear)
# Create a dataframe to edit so we don't transform our expanded data
sanctions_using <- sanctions_expand
# Rename variables in order to merge
colnames(sanctions_using)[colnames(sanctions_using)=="primarysender"] <- "statea"
colnames(sanctions_using)[colnames(sanctions_using)=="targetstate"] <- "stateb"
# Merge
working_data <- merge(dyad_template, sanctions_using, by=c("statea", "stateb", "year"), all=TRUE)
head(working_data)
## statea stateb year caseid threat imposition
## 1 2 20 1920 NA NA NA
## 2 2 20 1921 NA NA NA
## 3 2 20 1922 NA NA NA
## 4 2 20 1923 NA NA NA
## 5 2 20 1924 NA NA NA
## 6 2 20 1925 NA NA NA
# Another way to arrange the data (negative sign indicates descending order)
working_data = arrange(working_data, -threat, imposition, -year, statea, stateb)
head(working_data)
## statea stateb year caseid threat imposition
## 1 2 140 2010 2005100601 1 0
## 2 2 712 2010 2003112203 1 0
## 3 2 750 2010 2003112201 1 0
## 4 2 140 2009 2005100601 1 0
## 5 2 223 2009 2001100201 1 0
## 6 2 225 2009 2001101201 1 0
# Count number of years
dplyr::count(working_data, year)
## # A tibble: 201 x 2
## year n
## <int> <int>
## 1 1816 506
## 2 1817 506
## 3 1818 506
## 4 1819 506
## 5 1820 506
## 6 1821 506
## 7 1822 552
## 8 1823 552
## 9 1824 552
## 10 1825 600
## # ... with 191 more rows
Note a few important things here.
We don’t actually want to keep all unmerged pairs. For example, we want to drop those dyad-year observations that are in both data sets.
We would also want to limit the time frame to 1945-2012 since there will be no observations prior to 1945 in the merged sanctions data. This is typical in IR data.
Finally, this data is not ready for analysis. We would want to bring in other dyadic data for a full model. But this is the starting point for building the correct data.
Now let’s use “gather” and “spread” (complements to one another), which are also part of the tidyr package. “Gather” takes muliple columns and transforms them into key-value pairs (making “wide” data long). “Spread” takes two columns (a key and a value) and spreads them into multiple columns (making “long” data wider). We often want wide data when we’re going counts by group.
#vignette("tidyr")
head(sanctions_using)
## # A tibble: 6 x 6
## caseid year statea stateb threat imposition
## <int> <int> <int> <int> <int> <int>
## 1 1945121601 1945 2 710 1 1
## 2 1945121601 1946 2 710 1 1
## 3 1945121601 1947 2 710 1 1
## 4 1946020801 1946 365 305 0 1
## 5 1946020801 1947 365 305 0 1
## 6 1946020801 1948 365 305 0 1
# Gathering sanction data, where "sanct_level" is our new variable indicates threat or imposition
# and value indicates whether it occured or not
sanctions_count = sanctions_using %>%
gather(sanct_level, value, -statea, -stateb, -year, -caseid) # negative sign prevents false gathers
head(sanctions_count)
## # A tibble: 6 x 6
## caseid year statea stateb sanct_level value
## <int> <int> <int> <int> <chr> <int>
## 1 1945121601 1945 2 710 threat 1
## 2 1945121601 1946 2 710 threat 1
## 3 1945121601 1947 2 710 threat 1
## 4 1946020801 1946 365 305 threat 0
## 5 1946020801 1947 365 305 threat 0
## 6 1946020801 1948 365 305 threat 0
# A nice feature of these tools is you specify the variable names in the function
# Another way to do it
sanctions_count2 = gather(sanctions_using, "sanct_level", "value", 5:6)
# Another count, this time getting the number of impositions over time using "group_by"
imp_count = sanctions_using %>%
group_by(statea, stateb)%>% # getting count for each dyad
tally(imposition) # tally means count
head(imp_count)
## # A tibble: 6 x 3
## # Groups: statea [1]
## statea stateb n
## <int> <int> <int>
## 1 2 20 28
## 2 2 40 7
## 3 2 41 12
## 4 2 42 2
## 5 2 52 10
## 6 2 53 8
# Returning to legislators, let's spread these by year
head(legsCountHomeYr)
## fullName year sum
## 1 adam smith 2010 0
## 2 adrian smith 2010 0
## 3 allen boyd 2010 0
## 4 barbara boxer 2010 3
## 5 betsy markey 2010 0
## 6 brian baird 2010 3
# Make year the column
legHomeYears = spread(legsCountHomeYr, year, sum) # year is the key and sum is the value
head(legHomeYears)
## fullName 2010 2011 2012 2013 2014 2015 2016
## 1 adam smith 0 NA 0 NA NA NA NA
## 2 addison wilson NA 0 NA NA 0 1 NA
## 3 adrian smith 0 NA NA 1 0 0 NA
## 4 alcee hastings NA 0 0 0 0 NA 0
## 5 allen boyd 0 NA NA NA NA NA NA
## 6 alma adams NA NA NA NA NA 0 0
There are many other packages and functions that may be of use for data manipulation and management. Of course, there are also packages for data collection, visualization, analytics, and more complex data management (we’ll go over some of these in the next sessions.)