Contributors of Bernie Sanders

1. Read the data

library(dplyr)
library(data.table)
library(ggplot2)
data <- fread('P60007168-ALL.csv', header = FALSE)

## 
Read 0.0% of 2063097 rows
Read 16.0% of 2063097 rows
Read 25.7% of 2063097 rows
Read 35.4% of 2063097 rows
Read 48.0% of 2063097 rows
Read 63.0% of 2063097 rows
Read 76.1% of 2063097 rows
Read 80.9% of 2063097 rows
Read 2063097 rows and 19 (of 19) columns from 0.413 GB file in 00:00:14

data <- data[, 1:18, with=FALSE]
nameStr <- 
"cmte_id,cand_id,cand_nm,contbr_nm,contbr_city,contbr_st,
contbr_zip,contbr_employer,contbr_occupation,contb_receipt_amt,
contb_receipt_dt,receipt_desc,memo_cd,memo_text,form_tp,file_num,tran_id,election_tp"

nameStr <- strsplit(nameStr, ',')[[1]]
newName <- gsub('\n', '', nameStr)
names(data) <- newName

# change the column type
new_data <- data %>%
  mutate(contbr_employer = factor(contbr_employer), 
         contbr_occupation = factor(contbr_occupation),
         contbr_st = factor(contbr_st),
         contb_receipt_dt = as.Date(contb_receipt_dt, format='%d-%b-%y'))

In the original data set, the last column is NA. I removed the last column so that the rest of each column has correct name.
For the columns we are interested it, change them to correct types.

2. Most significant contributors by Occupation

q1 <- new_data %>% 
  group_by(contbr_occupation) %>%
  filter(contb_receipt_amt > 0) %>%
  summarise(total_amount = sum(contb_receipt_amt), total_contbr = n(), avg_donate=total_amount/total_contbr) %>%
  arrange(desc(total_contbr))
q1[1:10, ]

## # A tibble: 10 × 4
##    contbr_occupation total_amount total_contbr avg_donate
##               <fctr>        <dbl>        <int>      <dbl>
## 1       NOT EMPLOYED   25628469.6       567712   45.14343
## 2            RETIRED    4052546.4        76700   52.83633
## 3            TEACHER    1823146.9        50043   36.43161
## 4           ENGINEER    1628144.6        29752   54.72387
## 5  SOFTWARE ENGINEER    1822173.0        29517   61.73300
## 6           ATTORNEY    2050315.7        29184   70.25479
## 7          PROFESSOR    1405766.2        26096   53.86903
## 8      SELF EMPLOYED    1213079.1        23479   51.66656
## 9          PHYSICIAN    1924241.7        23257   82.73817
## 10        CONSULTANT     900961.4        17543   51.35732

p1 <- ggplot(q1[1:10,], aes(x=reorder(contbr_occupation, total_contbr), y=total_contbr)) + 
  geom_bar(stat = 'identity') + 
  coord_flip() + 
  labs(title = 'Most Significant Contributors by Occupation (Top 10)', 
       y = 'Number of Contributors',
       x = 'Occupation')
p1

For each occupation, I computed the total contribution amount and the total number of contributors. Notice that I found many records in which the contribution amount is negative, looks like that’s refunds to individuals. In order to compute the number of contributors, I just removed the refunds records.
As the plot shows, the top three occupations support Bernie Sanders most are Not Employed, Retired and Teacher, which is consistent with the politics proposal of Sanders.

3. Most significant contributors by Employer

q2 <- new_data %>% 
  group_by(contbr_employer) %>%
  filter(contb_receipt_amt > 0) %>%
  summarise(total_amount = sum(contb_receipt_amt), total_contbr = n(), avg_donate=total_amount/total_contbr) %>%
  arrange(desc(total_contbr))
q2[1:10, ]

## # A tibble: 10 × 4
##          contbr_employer total_amount total_contbr avg_donate
##                   <fctr>        <dbl>        <int>      <dbl>
## 1                   NONE   15713103.0       429918   36.54907
## 2           NOT EMPLOYED   14380900.1       223208   64.42825
## 3                   SELF    5937895.4       139408   42.59365
## 4          SELF EMPLOYED    6792103.7       122804   55.30849
## 5                RETIRED     671612.3        16960   39.59978
## 6          SELF-EMPLOYED     719114.3        15533   46.29590
## 7                           3099513.5        15277  202.88758
## 8  INFORMATION REQUESTED    1967702.3         9323  211.05891
## 9                    N/A     417217.1         7564   55.15827
## 10                   MS.     166044.5         4902   33.87280

p2 <- ggplot(q2[1:10,], aes(x=reorder(contbr_employer, total_contbr), y=total_contbr)) + 
  geom_bar(stat = 'identity') + 
  coord_flip() + 
  labs(title = 'Most Significant Contributors by Employer (Top 10)', 
       y = 'Number of Contributors',
       x = 'Employer')
p2

The analysis of contributors by Employer is highly consistent with that of Occupation. I believe the employer type NONE, NOT Employed and SELF are the same or have large overlapping.
To get more accurate analysis, we need correct the Employer names to the same format. E.g. SELF, SELF EMPLOYED, SELF-EMPLOYED are obviously the same. This can be done in future work.

4. Most significant contributors over State

q3 <- new_data %>% 
  group_by(contbr_st) %>%
  filter(contb_receipt_amt > 0) %>%
  summarise(total_amount = sum(contb_receipt_amt), total_contbr = n(), avg_donate=total_amount/total_contbr) %>%
  arrange(desc(total_contbr))
q3[1:10, ]

## # A tibble: 10 × 4
##    contbr_st total_amount total_contbr avg_donate
##       <fctr>        <dbl>        <int>      <dbl>
## 1         CA     20679154       404069   51.17728
## 2         NY      8457321       173387   48.77713
## 3         WA      5744186       120828   47.54019
## 4         MA      7882637       113686   69.33692
## 5         FL      3483095        81950   42.50268
## 6         TX      3568233        79487   44.89077
## 7         IL      3453610        76015   45.43327
## 8         OR      3120169        72475   43.05165
## 9         PA      2526101        59255   42.63101
## 10        VT      2537746        55706   45.55607

p3 <- ggplot(q3[1:10,], aes(x=reorder(contbr_st, total_contbr), y=total_contbr)) + 
  geom_bar(stat = 'identity') + 
  coord_flip() + 
  labs(title = 'Most Significant Contributors by State (Top 10)', 
       y = 'Number of Contributors',
       x = 'State')
p3

The top three States are CA, NY, WA, which is consistent with our prior knowledge. These states supports democratic party more often.

5. Most significant contributors over time

q4 <- new_data %>%
  group_by(Year_Month = format(contb_receipt_dt, '%y-%m')) %>%
  filter(contb_receipt_amt > 0) %>%
  summarise(total_amount = sum(contb_receipt_amt), total_contbr = n()) %>%
  arrange(Year_Month)

p4 <- ggplot(q4, aes(x=Year_Month, y=total_contbr)) + geom_line(group=1) +
  labs(title = 'Number of Contributors Over Time', 
       y = 'Number of Contributors',
       x = 'Year_Month')
p4

Since the records are the donation of a contributor for each day, I extracted the Year and Month of the records, then aggregate the number of contributors by each month.
From the time series plot, we can see at the very beginning, Sanders didn’t draw attention of people, only few of them donated to him. While as the compaign proceeded, Sanders got more support. And in March 2016, the largest number of contributors donated to Sanders.

Contributors of Bernie Sanders

Xiang Cao

February 22, 2017

1. Read the data

2. Most significant contributors by Occupation

3. Most significant contributors by Employer

4. Most significant contributors over State

5. Most significant contributors over time