While we’re waiting for Mayday to release their full data dump to the public, we can take a peek at some of the data they’ve already submitted to the FEC for the period ending 6/30 (before the end of the second round). The data were published a couple weeks ago and got some attention then.
You can see the filing here or download the filing here. Interestingly, these files natively use ASCII-delimited text, which I’d never seen in the wild before. I’ll cowardly download the .csv instead.
Opening up the file in vim, it looks like the first column of each row describes the record type. Let’s see how many of each type we have:
df = read.csv('939397.fec', header=FALSE, stringsAsFactors=FALSE)
df %>% group_by(V1) %>% summarize(n=n())
## Source: local data frame [4 x 2]
##
## V1 n
## 1 F3XN 1
## 2 HDR 1
## 3 SA11AI 2449
## 4 SB21B 72
That’s not so bad. That corresponds to one header record, one summary record from Form 3X (“REPORT OF RECEIPTS AND DISBURSEMENTS FOR OTHER THAN AN AUTHORIZED COMMITTEE”), 2,449 donor records from Schedule A line 11 (“Itemized Receipts”), and 72 records from Schedule B line 21 (“Itemized Disbursements”).
We can find definitions for each record type in the FEC Vendor Pack. I dumped the donor record headers into a file; let’s pair them up with the data, clear out any columns that are empty, and see what data we have to work with.
donors = df %>% filter(V1 == 'SA11AI')
donors = donors[, 1:45] # drop extra columns
names(donors) = gsub(' ', '.', scan('fec_donor_cols.csv', what='character', sep='\n'))
donors = donors[, colwise(function(x) !all(is.na(x)))(donors) %>% as.logical()]
names(donors)
## [1] "FORM.TYPE" "FILER.COMMITTEE.ID.NUMBER"
## [3] "TRANSACTION.ID." "BACK.REFERENCE.TRAN.ID.NUMBER"
## [5] "BACK.REFERENCE.SCHED.NAME" "ENTITY.TYPE"
## [7] "CONTRIBUTOR.ORGANIZATION.NAME" "CONTRIBUTOR.LAST.NAME"
## [9] "CONTRIBUTOR.FIRST.NAME" "CONTRIBUTOR.MIDDLE.NAME"
## [11] "CONTRIBUTOR.SUFFIX" "CONTRIBUTOR.STREET..1"
## [13] "CONTRIBUTOR.STREET..2" "CONTRIBUTOR.CITY"
## [15] "CONTRIBUTOR.STATE" "CONTRIBUTOR.ZIP"
## [17] "ELECTION.CODE" "CONTRIBUTION.DATE"
## [19] "CONTRIBUTION.AMOUNT" "CONTRIBUTION.AGGREGATE"
## [21] "CONTRIBUTION.PURPOSE.DESCRIP" "CONTRIBUTOR.EMPLOYER"
## [23] "CONTRIBUTOR.OCCUPATION"
It’s odd that we have so few records; I expected more rows. I see some of my friends in this data set (some fun celeb-spotting opportunities, as well) but not myself. I wonder what the distribution of donation amounts looks like – maybe I didn’t give enough?
There’s at least one very large donation, but most of the donations look much smaller:
summary(donors$CONTRIBUTION.AMOUNT)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 100 250 854 500 250000
There are even some very small donations, but maybe those are from people who gave larger amounts in toto.
Anyway, let’s take a look at distribution of the bottom 99% of donations.
small_donors = donors[donors$CONTRIBUTION.AMOUNT < quantile(donors$CONTRIBUTION.AMOUNT, 0.99),]
ggplot(small_donors, aes(CONTRIBUTION.AMOUNT)) + geom_density(adjust=0.25)
We see peaks near round numbers – $100, $200, $250, $500, $1000. 96.6% of donations are $1,000 or less, accounting for 24.5% of total money raised.
Let’s look at how Mayday raised over time:
by_date = donors %>% group_by(CONTRIBUTION.DATE) %>% summarize(amount=sum(CONTRIBUTION.AMOUNT))
by_date$date = strptime(by_date$CONTRIBUTION.DATE, "%Y%m%d")
by_date$cumulative = cumsum(by_date$amount)
ggplot(by_date, aes(date, cumulative)) + geom_line() + ylab("Amount raised")
Interesting that we’re so far shy of $1 million by 5/15; it must mean that we’re missing information about a lot of small donations. That implies that you should take all of this with a grain of salt, I think. We’ll see below that tech dominates the reported donations but maybe the reported donations aren’t representative of all of the smaller donations.
The employer and occupation fields aren’t normalized. I’ll do a little bit of cleanup I won’t show you and let’s see if we can find employers with generous donors. I’m arbitrarily choosing companies with at least 5 donations to increase the odds we’ve heard of them. (Notably, “donations” is not the same as “donors”; many people gave more than once.)
small_donors %>%
group_by(CONTRIBUTOR.EMPLOYER) %>%
summarize(total=sum(CONTRIBUTION.AMOUNT), donation_count=n()) %>%
mutate(mean=total/donation_count) %>%
filter(donation_count >= 5) %>%
arrange(desc(total)) %>%
head(10)
## Source: local data frame [10 x 4]
##
## CONTRIBUTOR.EMPLOYER total donation_count mean
## 1 Self-employed 286944 838 342.4
## 2 Google 75806 131 578.7
## 3 17434 30 581.1
## 4 Apple 9700 23 421.7
## 5 Twitter 5850 11 531.8
## 6 Yahoo 4396 10 439.6
## 7 Facebook 3149 7 449.9
## 8 Microsoft 3050 11 277.3
## 9 VMware 2500 6 416.7
## 10 Amazon.com 2264 9 251.6
So that’s… pretty solidly tech-oriented!
Occupation data is free-text entry but even without any normalization it gives a surprisingly clear answer:
small_donors %>%
group_by(CONTRIBUTOR.OCCUPATION) %>%
summarize(total=sum(CONTRIBUTION.AMOUNT), donation_count=n()) %>%
mutate(mean=total/donation_count) %>%
filter(donation_count >= 5) %>%
arrange(desc(total)) %>%
head(10)
## Source: local data frame [10 x 4]
##
## CONTRIBUTOR.OCCUPATION total donation_count mean
## 1 Software Engineer 127808 310 412.3
## 2 Retired 105195 318 330.8
## 3 Engineer 46868 146 321.0
## 4 Attorney 24300 57 426.3
## 5 Software Developer 23962 83 288.7
## 6 17734 32 554.2
## 7 Physician 15088 35 431.1
## 8 Not Employed 13565 33 411.1
## 9 Entrepreneur 13386 32 418.3
## 10 Programmer 12676 53 239.2
There are also 49 students, who gave an average of $159.86.
That’s all for now; I’ll come back to this later and practice some geographic visualization techniques, but, spoiler alert, California’s over-represented. :P