This document walks through a little cleanup of the 8/5 Mayday data dump. The functional components are extracted in this gist: https://gist.github.com/ec45778abc6386ae8ec3

Let’s grab the data and make the headers stop shouting:

library(plyr)
library(dplyr)
donors = read.csv("donors.csv", header=TRUE, stringsAsFactors=FALSE,
                  na.strings=c("", " "))
names(donors) = tolower(names(donors))

We have seven variables; let’s learn about them.

donor_id

donor_id seems to be a three-letter prefix and a numerical suffix. What prefixes do we have?

donors$donor_prefix = sapply(donors$donor_id, function(x) substr(x,1,3))
donors %>% group_by(donor_prefix) %>% summarize(n=n())
## Source: local data frame [2 x 2]
## 
##   donor_prefix     n
## 1          MDA 60067
## 2          MDS     7

There’s a small handful of MDS rows with larger donation amounts; let’s assume they’re not different and treat donor_id as an opaque identifier.

donor_id is not unique, so let’s infer a one-to-many relationship between donors and donations.

donors[donors$donor_id == 'MDA1',]
##    donor_id first_name last_name       donation_time     x amount
## 80     MDA1       <NA>      <NA> 2014-05-10T00:19:22 28557     25
## 81     MDA1       <NA>      <NA> 2014-06-19T23:07:03 28557     25
##         targeting donor_prefix
## 80 Democrats Only          MDA
## 81 Democrats Only          MDA

donation_time

donation_time is mostly a series of dates in ISO 8601 formats, except when they aren’t.

is_iso = sapply(donors$donation_time, function(x) substr(x, 1, 4) == "2014")
unique(head(donors[is_iso == FALSE, "donation_time"], 30))
##  [1] NA                          "N/A WIRE"                 
##  [3] "40312"                     "40317"                    
##  [5] "40309"                     "40332"                    
##  [7] "40376"                     "40325"                    
##  [9] "5/15/2014 2:29:04 PM CST"  "5/20/2014 11:34:05 AM CST"

“N/A WIRE” is an interesting value! More hazardously, some of these dates look like they came from strftime("%D %T %Z")… and some of these look like… zip codes?? But probably aren’t?

Let’s let lubridate sort this out.

library(lubridate)
donors$donation_time = parse_date_time(donors$donation_time,
                                       orders=c("ymdhms", "mdyImsp"),
                                       tz="America/Chicago")
## Warning: 245 failed to parse.

One of the donation times is very early: 2014-01-24. Let’s change that up.

donors[which.min(donors$donation_time), "donation_time"] = as.POSIXct("2014/05/01 00:00", tz="America/Chicago")

Zip codes

There’s an unnamed column that contains a lot of 5 digit numbers. Let’s boldly assume it contains zip codes. They don’t all match, though:

donors = rename(donors, replace=c("x"="zip"))
is_zip = grepl("\\d{5}", donors$zip)
head(donors$zip[!is_zip], 100)
##   [1] NA      NA      NA      NA      NA      NA      NA      NA     
##   [9] NA      NA      NA      NA      NA      NA      NA      NA     
##  [17] NA      NA      NA      NA      "N168B" "M1N1R" "L6H 6" "624"  
##  [25] "Kahul" "N1H 3" "Kihei" "V6E2A" "Haver" "0Kent" "San F" "470-0"
##  [33] "T3L 2" "W1G 8" "W1G 8" "W1G 8" "Delan" "Rosev" "0"     "Santa"
##  [41] "Kings" "EH178" "Stoug" "*1564" "*4532" "907"   "N2M5E" "N1 4B"
##  [49] "V3Y 2" "0"     "Chels" "M6P 3" "h2t2e" "B4A3Y" "2518J" "mobil"
##  [57] "lS7 4" "Bount" "Mount" "K7P2G" "OR"    "Apple" "woods" "woods"
##  [65] "670-8" "T4C1A" "0"     "Santa" "V8N3W" "Oakla" "M4P1Y" "115"  
##  [73] "cb8 0" "0SN25" "V6R1A" "874-0" "South" "willi" "918"   "Bruns"
##  [81] "BS162" "0"     "Lanha" "883-0" "T2V3N" "M4K3K" "M4K3K" "M4K3K"
##  [89] "M4K 3" "W3"    "SP35E" "M6C 1" "0"     "peter" "APO"   "47"   
##  [97] "47"    "Oakla" "T3C 1" "Cambr"

Some of these are clearly misentries; others look like truncated Canadian, English, or otherwise foreign postal designations. Let’s just retain the five-digited ones.

donors$zip[!is_zip] = NA

amount

This column is well-behaved, but it’s entertaining to note that there are some zero-amount rows:

head(donors[donors$amount == 0,])
##       donor_id first_name last_name       donation_time   zip amount
## 2701  MDA12023       <NA>      <NA> 2014-06-30 19:11:35 94107      0
## 7554  MDA15811       <NA>      <NA> 2014-05-05 22:27:46 01355      0
## 8517   MDA1656       <NA>      <NA> 2014-05-23 03:05:53 97405      0
## 10265 MDA17933       <NA>      <NA> 2014-06-21 20:07:08 94122      0
## 10311 MDA17970       <NA>      <NA> 2014-07-05 01:33:41 04064      0
## 17788 MDA23825       <NA>      <NA> 2014-06-21 00:10:46 04064      0
##            targeting donor_prefix
## 2701  Democrats Only          MDA
## 7554  Democrats Only          MDA
## 8517  Whatever Helps          MDA
## 10265 Democrats Only          MDA
## 10311 Whatever Helps          MDA
## 17788 Whatever Helps          MDA

targeting

This one is interesting… too interesting.

donors %>% group_by(targeting) %>% summarize(n=n(), sum=sum(amount))
## Source: local data frame [7 x 3]
## 
##                        targeting     n     sum
## 1                 Democrats Only 15289  930158
## 2        Jim Rubens/NH-Sen/R-Pri     4      35
## 3                 Only Democrats     2      70
## 4               Republicans Only   376   27606
## 5 Staci Appel/IA-Whatever Helps3     2      35
## 6                 Whatever Helps 44327 3468121
## 7                             NA    74 3395438

There’s only a few edge cases so let’s get rid of all of the interestingness.

donors$targeting = revalue(donors$targeting,
                           c("Jim Rubens/NH-Sen/R-Pri"="Republicans Only",
                             "Only Democrats"="Democrats Only",
                             "Staci Appel/IA-Whatever Helps3"="Whatever Helps"))
donors$targeting = as.factor(donors$targeting)

And now we’re ready to tango.