While we’re waiting for Mayday to release their full data dump to the public, we can take a peek at some of the data they’ve already submitted to the FEC for the period ending 6/30 (before the end of the second round). The data were published a couple weeks ago and got some attention then.

You can see the filing here or download the filing here. Interestingly, these files natively use ASCII-delimited text, which I’d never seen in the wild before. I’ll cowardly download the .csv instead.

Opening up the file in vim, it looks like the first column of each row describes the record type. Let’s see how many of each type we have:

df = read.csv('939397.fec', header=FALSE, stringsAsFactors=FALSE)
df %>% group_by(V1) %>% summarize(n=n())
## Source: local data frame [4 x 2]
## 
##       V1    n
## 1   F3XN    1
## 2    HDR    1
## 3 SA11AI 2449
## 4  SB21B   72

That’s not so bad. That corresponds to one header record, one summary record from Form 3X (“REPORT OF RECEIPTS AND DISBURSEMENTS FOR OTHER THAN AN AUTHORIZED COMMITTEE”), 2,449 donor records from Schedule A line 11 (“Itemized Receipts”), and 72 records from Schedule B line 21 (“Itemized Disbursements”).

We can find definitions for each record type in the FEC Vendor Pack. I dumped the donor record headers into a file; let’s pair them up with the data, clear out any columns that are empty, and see what data we have to work with.

donors = df %>% filter(V1 == 'SA11AI')
donors = donors[, 1:45] # drop extra columns
names(donors) = gsub(' ', '.', scan('fec_donor_cols.csv', what='character', sep='\n'))
donors = donors[, colwise(function(x) !all(is.na(x)))(donors) %>% as.logical()]
names(donors)
##  [1] "FORM.TYPE"                     "FILER.COMMITTEE.ID.NUMBER"    
##  [3] "TRANSACTION.ID."               "BACK.REFERENCE.TRAN.ID.NUMBER"
##  [5] "BACK.REFERENCE.SCHED.NAME"     "ENTITY.TYPE"                  
##  [7] "CONTRIBUTOR.ORGANIZATION.NAME" "CONTRIBUTOR.LAST.NAME"        
##  [9] "CONTRIBUTOR.FIRST.NAME"        "CONTRIBUTOR.MIDDLE.NAME"      
## [11] "CONTRIBUTOR.SUFFIX"            "CONTRIBUTOR.STREET..1"        
## [13] "CONTRIBUTOR.STREET..2"         "CONTRIBUTOR.CITY"             
## [15] "CONTRIBUTOR.STATE"             "CONTRIBUTOR.ZIP"              
## [17] "ELECTION.CODE"                 "CONTRIBUTION.DATE"            
## [19] "CONTRIBUTION.AMOUNT"           "CONTRIBUTION.AGGREGATE"       
## [21] "CONTRIBUTION.PURPOSE.DESCRIP"  "CONTRIBUTOR.EMPLOYER"         
## [23] "CONTRIBUTOR.OCCUPATION"

It’s odd that we have so few records; I expected more rows. I see some of my friends in this data set (some fun celeb-spotting opportunities, as well) but not myself. I wonder what the distribution of donation amounts looks like – maybe I didn’t give enough?

There’s at least one very large donation, but most of the donations look much smaller:

summary(donors$CONTRIBUTION.AMOUNT)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1     100     250     854     500  250000

There are even some very small donations, but maybe those are from people who gave larger amounts in toto.

Anyway, let’s take a look at distribution of the bottom 99% of donations.

small_donors = donors[donors$CONTRIBUTION.AMOUNT < quantile(donors$CONTRIBUTION.AMOUNT, 0.99),]
ggplot(small_donors, aes(CONTRIBUTION.AMOUNT)) + geom_density(adjust=0.25)

plot of chunk unnamed-chunk-5

We see peaks near round numbers – $100, $200, $250, $500, $1000. 96.6% of donations are $1,000 or less, accounting for 24.5% of total money raised.

Let’s look at how Mayday raised over time:

by_date = donors %>% group_by(CONTRIBUTION.DATE) %>% summarize(amount=sum(CONTRIBUTION.AMOUNT))
by_date$date = strptime(by_date$CONTRIBUTION.DATE, "%Y%m%d")
by_date$cumulative = cumsum(by_date$amount)
ggplot(by_date, aes(date, cumulative)) + geom_line() + ylab("Amount raised")

plot of chunk unnamed-chunk-6

Interesting that we’re so far shy of $1 million by 5/15; it must mean that we’re missing information about a lot of small donations. That implies that you should take all of this with a grain of salt, I think. We’ll see below that tech dominates the reported donations but maybe the reported donations aren’t representative of all of the smaller donations.

The employer and occupation fields aren’t normalized. I’ll do a little bit of cleanup I won’t show you and let’s see if we can find employers with generous donors. I’m arbitrarily choosing companies with at least 5 donations to increase the odds we’ve heard of them. (Notably, “donations” is not the same as “donors”; many people gave more than once.)

small_donors %>%
  group_by(CONTRIBUTOR.EMPLOYER) %>%
  summarize(total=sum(CONTRIBUTION.AMOUNT), donation_count=n()) %>%
  mutate(mean=total/donation_count) %>%
  filter(donation_count >= 5) %>%
  arrange(desc(total)) %>%
  head(10)
## Source: local data frame [10 x 4]
## 
##    CONTRIBUTOR.EMPLOYER  total donation_count  mean
## 1         Self-employed 286944            838 342.4
## 2                Google  75806            131 578.7
## 3                        17434             30 581.1
## 4                 Apple   9700             23 421.7
## 5               Twitter   5850             11 531.8
## 6                 Yahoo   4396             10 439.6
## 7              Facebook   3149              7 449.9
## 8             Microsoft   3050             11 277.3
## 9                VMware   2500              6 416.7
## 10           Amazon.com   2264              9 251.6

So that’s… pretty solidly tech-oriented!

Occupation data is free-text entry but even without any normalization it gives a surprisingly clear answer:

small_donors %>%
  group_by(CONTRIBUTOR.OCCUPATION) %>%
  summarize(total=sum(CONTRIBUTION.AMOUNT), donation_count=n()) %>%
  mutate(mean=total/donation_count) %>%
  filter(donation_count >= 5) %>%
  arrange(desc(total)) %>%
  head(10)
## Source: local data frame [10 x 4]
## 
##    CONTRIBUTOR.OCCUPATION  total donation_count  mean
## 1       Software Engineer 127808            310 412.3
## 2                 Retired 105195            318 330.8
## 3                Engineer  46868            146 321.0
## 4                Attorney  24300             57 426.3
## 5      Software Developer  23962             83 288.7
## 6                          17734             32 554.2
## 7               Physician  15088             35 431.1
## 8            Not Employed  13565             33 411.1
## 9            Entrepreneur  13386             32 418.3
## 10             Programmer  12676             53 239.2

There are also 49 students, who gave an average of $159.86.

That’s all for now; I’ll come back to this later and practice some geographic visualization techniques, but, spoiler alert, California’s over-represented. :P