Introduction

In this post I will be diving into parsing and looking at filings that 527 political organizations submit to the IRS. I am looking into this as the start of a larger project to understand, measure, and predict the effects of financial influence on politics.

If that sounds interesting to you, follow me on twitter so you can see the next one!

https://twitter.com/isaac_flath

The main thing you that may be nice to know (but not required) when reading this article is what a 527 organization is. All of the data we are looking at will be regarding 527 organizations, so if you don’t know what they are nothing will have much context.

Here’s what a 527 organization is, courtesy of wikipedia.

A 527 organization or 527 group is a type of U.S. tax-exempt organization organized under Section 527 of the U.S. Internal Revenue Code (26 U.S.C. § 527). A 527 group is created primarily to influence the selection, nomination, election, appointment or defeat of candidates to federal, state or local public office.

Technically, almost all political committees, including state, local, and federal candidate committees, traditional political action committees (PACs), “Super PACs”, and political parties are “527s”. However, in common practice the term is usually applied only to such organizations that are not regulated under state or federal campaign finance laws because they do not “expressly advocate” for the election or defeat of a candidate or party.

There are no upper limits on contributions to 527s and no restrictions on who may contribute. There are no spending limits imposed on these organizations. The organizations must register with the Internal Revenue Service (IRS), publicly disclose their donors and file periodic reports of contributions and expenditures.\[1\]

Because they may not expressly advocate for specific candidates or coordinate with any candidate’s campaign, many 527s are used to raise money to spend on issue advocacy and voter mobilization. Examples of 527s are Swift Boat Veterans for Truth, The Media Fund, America Coming Together, the Progress for America Voter Fund, and the Secretary of State Project.\[2\]

Let’s get started!

Setup

Imports

These are all standard imports that I use. I haven’t filtered out ones not used in this particular report, but these are my standard imports I start with on projects.

library(DBI)

# general visualisation
library('ggplot2') # visualisation
library('scales') # visualisation
library('patchwork') # visualisation
library('RColorBrewer') # visualisation
library('corrplot') # visualisation
library('ggthemes') # visualisation
library('viridis') # visualisation
library('gt') # table styling

# general data manipulation
library('dplyr') # data manipulation
library('readr') # input/output
library('vroom') # input/output
library('tibble') # data wrangling
library('tidyr') # data wrangling
library('stringr') # string manipulation
library('forcats') # factor manipulation
library('janitor') # cleaning
library('lubridate') # Date wrangling


# specific
library('glue') # encoding
library('naniar') # missing values
#library('gganimate') # visualisation
library('ggrepel') # visualisation
library('grid') # visualisation
library('GGally') # visualisation
library('ggforce') # visualisation
#library('ggtext') # visualisation
library(rlang) # eval string as symbols
library(lazyeval) # eval string as symbols

Paths

output_path <- "~/code_outputs/irs-527-filings/"
unzip_path = file.path(output_path,"var")
dest_path = file.path(output_path,"irs527_fullData.zip")

Reset Code

do.call(file.remove, list(list.files(output_path, full.names = TRUE)))
unlink(unzip_path,recursive = T)

UDFs

`%notin%` <- Negate(`%in%`)

categorical_plot_cnt <- function(df,column,title,top=0){
  out <- df %>% 
            filter_(!is.na(column)) %>% 
            group_by_(column) %>% 
            summarise(n=n()) %>%
            arrange(desc(n))
  
  if(top > 0){out <- head(out,top)}
  
  out[[column]] <- factor(out[[column]], levels = out[[column]])

  out <- out %>%
            ggplot(aes_string(x=column, y="n",fill=column)) +
            geom_bar(stat='identity') + 
            theme_minimal() +
            theme(legend.position = "none",
                  axis.text.x = element_text(angle = 45, vjust = 1.1, hjust=1)) +
            labs(x = "", title = title)
  return(out)}

Download and Unzip

The Data is available from the IRS website at this location http://forms.irs.gov/app/pod/dataDownload/fullData.

We need to download, unzip, and then pull the file out of the nested directory path.

Download and unzip the file to FullDataFile.txt

url = "http://forms.irs.gov/app/pod/dataDownload/fullData"

if (!file.exists(dest_path)) {
    download.file(url, dest_path)
}
if (!file.exists(unzip_path)) {
    unzip(dest_path, exdir = output_path)
}

from = file.path(output_path, "/var/IRS/data/scripts/pofd/download/FullDataFile.txt")
file_path = file.path(output_path, "FullDataFile.txt")
file.copy(from, file_path)

File Head

From there we can look at the first few lines of the file. We can see a few things:

  1. Not every row has the same number of fields.
  2. The first cell has what type of row it is

This tells us that we are probably going to need to parse this row by row in some way, rather than using an out of the box read csv function.

lines <- readLines(file_path, n=7)
writeLines(lines,sep="\n\n")
H|20220129|0311|F|

1|8871|8|0|0|0|912121950|Dan Swecker for Senate Campaign|10420 - 173rd Ave SW||Rochester|WA|98579||wfga@localaccess.com||Dan Swecker|10420 - 173rd Ave SW||Rochester|WA|98579||Dan Swecker|10420 - 173rd Ave. SW||Rochester|WA|98579||same||Rochester|WA|98579|||||Tax exempt political organization - Political campaign||2001-05-13 21:20:54|0|1

D|8|26174|Dan Swecker for Senate Campaign|912121950|Dan Swecker|Candidate|10420 - 173rd Ave. SW||Rochester|WA|98579||

1|8871|9|0|0|0|954857244|FRIENDS OF TOM CALDERON|728 W. EDNA PLACE||COVINA|CA|91722||no@email||YOLANDA MIRANDA|728 W. EDNA PLACE||COVINA|CA|91722||YOLANDA MIRANDA|728 W. EDNA PLACE||COVINA|CA|91722||728 W. EDNA PLACE||COVINA|CA|91722|||||RAISE FUNDS TO ELECT CANDIDATE||2001-05-14 15:05:31|1|1

D|9|269|FRIENDS OF TOM CALDERON|954857244|THOMAS M. CALDERON|CANDIDATE|412 N. 10TH STREET||MONTEBELLO|CA|90640||

D|9|270|FRIENDS OF TOM CALDERON|954857244|YOLANDA MIRAND|TREASURER|728 W. EDNA PLACE||COVINA|CA|91722||

1|8871|10|0|0|0|061596525|ASGM PAC|WILLIAM S WEBB CO INC|377 OAK ST - CS601|GARDEN CITY|NY|11530|0601|no@email||RICHARD C BLIVEN|377 OAK ST - CS601||GARDEN CITY|NY|11530|0601|RICHARD C BLIVEN|377 OAK ST - CS601||GARDEN CITY|NY|11530|0601|WILLIAM S WEBB CO INC|377 OAK ST - CS601|GARDEN CITY|NY|11530|0601||||POLITICAL ACTION COMMITTEE TO PROMOTE LEGISLATION FAVORABLE TO WORKERS COMPENSATION SAFETY GROUPS IN NEW YORK STATE.||2001-05-14 17:24:52|1|1

Structure Data

The structure of the file has different types of data indicated by the first letter of each row. In order to read this, we need to break it out into each files. There are several ways to do this. I have chosen to do this as a 2 step process:

  1. Split file into multiple files, each containing all the rows of that type
  2. Read each of those files in individually and load into a sqlite database

Some records span multiple lines and so that needs to be dealt with also.

Split Files

This is step 1 from above. Because we know what a row of a particular type will start with, we can loop through the file and break it out row by row.

For records that span multiple lines, we just keep appending rows until we get to a new record row.

By doing this we have turned this unusually formatted data into a series of CSVs we can handle like a typical dataset.

# This is so we can track progress
nlines <- read.table(pipe(paste("wc -l",file_path)))[[1]]

# If a row starts with one of this, it's a new record
line_starts <- list('H|','F|','1|','2|','A|','B|','D|','E|','R|')

# open files we will be writing data to (1 per each type)
out_files <- list()
for(line in line_starts){
  fpath <- file.path(output_path,paste(line,'.txt'))
  if (file.exists(fpath)) {
    next
    } else {
    out_files[[line]] <- file(fpath,'w')
    }
}


if(length(out_files) > 0){
  
  con <- file(file_path,"r")
  
  for(i in 0:nlines) {
    # Get rid of line breaks
    line <- str_replace_all(str_replace_all(readLines(con, n = 1), '\r', ''), '\n', '')
    line_type <- substr(line,1,2) # This dictates what type of record the row is
    
    # Handle end of file
    if(identical(line,character(0))){
      writeLines(full_line, out_files[[out_type]])
      break}

    # Handle start of file
    if(i==0){
      full_line <- line
      out_type <- line_type
      next
    }
  
    # For new lines write data, if not new line concatenate and go to next row
    if(line_type %in% line_starts){
      if(out_type %in% names(out_files)){writeLines(full_line, out_files[[out_type]])}
      full_line <- line
      out_type <- line_type
    } else {
      full_line <- paste(full_line,line)
    }
    
    # Progress tracking
    if(i%%100000 == 0){print(paste(Sys.time(),'|',i,'of',nlines))}
  }
  # close all files
  close(con)
  for (f in out_files) {close(f)}
}

Column Map

We need to give headers to the data when we load it in so we can work with it. These come straight from the data dictionaries provided by the IRS.

a <- c('record_type','form_id_number','sched_a_id','org_name','ein','contributor_name',
       'contributor_address_1','contributor_address_2','contributor_address_city',
       'contributor_address_state','contributor_address_zip_code',
       'contributor_address_zip_ext','contributor_employer','contribution_amount',
       'contributor_occupation','agg_contribution_ytd','contribution_date')

one <- c('record_type','form_type','form_id_number','initial_report_indicator',
         'amended_report_indicator','final_report_indicator','ein','organization_name',
         'mailing_address_1','mailing_address_2','mailing_address_city',
         'mailing_address_state','mailing_address_zip_code','mailing_address_zip_ext',
         'e_mail_address','established_date','custodian_name','custodian_address_1',
         'custodian_address_2','custodian_address_city','custodian_address_state',
         'custodian_address_zip_code','custodian_address_zip_ext','contact_person_name',
         'contact_address_1','contact_address_2','contact_address_city',
         'contact_address_state','contact_address_zip_code','contact_address_zip_ext',
         'business_address_1','business_address_2','business_address_city',
         'business_address_state','business_address_zip_code','business_address_zip_ext',
         'exempt_8872_indicator','exempt_state','exempt_990_indicator','purpose',
         'material_change_date','insert_datetime','related_entity_bypass','eain_bypass')

two <- c('record_type','form_type','form_id_number','period_begin_date','period_end_date',
         'initial_report_indicator','amended_report_indicator','final_report_indicator',
         'change_of_address_indicator','organization_name','ein','mailing_address_1',
         'mailing_address_2','mailing_address_city','mailing_address_state',
         'mailing_address_zip_code','mailing_address_zip_ext','e_mail_address',
         'org_formation_date','custodian_name','custodian_address_1','custodian_address_2',
         'custodian_address_city','custodian_address_state','custodian_address_zip_code',
         'custodian_address_zip_ext','contact_person_name','contact_address_1',
         'contact_address_2','contact_address_city','contact_address_state',
         'contact_address_zip_code','contact_address_zip_ext','business_address_1',
         'business_address_2','business_address_city','business_address_state',
         'business_address_zip_code','business_address_zip_ext','qtr_indicator',
         'monthly_rpt_month','pre_elect_type','pre_or_post_elect_date',
         'pre_or_post_elect_state','sched_a_ind','total_sched_a','sched_b_ind',
         'total_sched_b','insert_datetime')

b <- c('record_type','form_id_number','sched_b_id','org_name','ein','recipient_name',
       'recipient_address_1','recipient_address_2','recipient_address_city',
       'recipient_address_st','recipient_address_zip_code','recipient_address_zip_ext',
       'recipient_employer','expenditure_amount','recipient_occupation','expenditure_date',
       'expenditure_purpose')

d <- c('record_type','form_id_number','director_id','org_name','ein','entity_name',
       'entity_title','entity_address_1','entity_address_2','entity_address_city',
       'entity_address_st','entity_address_zip_code','entity_address_zip_code_ext')

e <- c('record_type','form_id_number','eain_id','election_authority_id_number','state_issued')

r <- c('record_type','form_id_number','entity_id','org_name','ein','entity_name',
       'entity_relationship','entity_address_1','entity_address_2','entity_address_city',
       'entity_address_st','entity_address_zip_code','entity_address_zip_ext')

h <- c('record_type','transmission_date','transmission_time','file_id_modifier')
f <- c('record_type','transmission_date','transmission_time','record_count')

column_map <- list('H|'=h,
                   'F|'=f,
                   '1|'=one,
                   '2|'=two,
                   'A|'=a,
                   'B|'=b,
                   'D|'=d,
                   'E|'=e,
                   'R|'=r)         

Load into DB

Our data is no longer in an unusual format because we have split it into a collection of normal CSVs. Because of this we can easily load all our data into a sqlite database.

con <- dbConnect(RSQLite::SQLite(), file.path(output_path,"irs.sqlite"))
tables <- dbListTables(con)
load_map <- list("1|"="main_8871",
                 'D|'="directors_officers_8871",
                 'R|'="related_entities_8871",
                 'E|'="eain_8871",
                 "2|"="main_8872",
                 "A|"="schedule_a_8872",
                 "B|"="schedule_b_8872")

for(file_pref in names(load_map)){
  table_name <- load_map[[file_pref]]
  if(table_name %notin% tables){
    df <- vroom(file.path(output_path,paste(file_pref,".txt")),
                col_names=column_map[[file_pref]])
    dbWriteTable(con, table_name, df, overwrite=TRUE)
  }
}

dbListTables(con)

File Metadata

Each file from the IRS has a header and footer row that gives some metadata about the particular data dump. I am ignoring them for now, but they could be useful when a more proper pipeline is built out.

Header Row

The file starts with a header row that just have general data about this particular data file.

file_pref <- 'H|'
df <- vroom(file.path(output_path,paste(file_pref,".txt")),col_names=column_map[[file_pref]],n_max=1000)
gt(df %>% select(-X5))
record_type transmission_date transmission_time file_id_modifier
H 20220129 0311 FALSE

8871 Data

527 political organizations submit an 8871 for to the IRS to report data about the organization. This is general form that covers all the general data about the company, related entities, directors and officers, ID numbers, and the like.

8871 Main Form

The 8871 main form has all the information that is 1:1 with the company. In this header form we see a lot of basic information you would expect on any company record such as addresses, contact information, purpose, and ID numbers.

Query DB

df <- dbGetQuery(con, 'SELECT * FROM main_8871')

df <- df %>%
  mutate(across(c(exempt_state, business_address_state,contact_address_state,custodian_address_state,
                  mailing_address_state), as.factor)) %>%
  mutate(across(c(established_date,material_change_date), ymd))

names(df)
##  [1] "record_type"                "form_type"                 
##  [3] "form_id_number"             "initial_report_indicator"  
##  [5] "amended_report_indicator"   "final_report_indicator"    
##  [7] "ein"                        "organization_name"         
##  [9] "mailing_address_1"          "mailing_address_2"         
## [11] "mailing_address_city"       "mailing_address_state"     
## [13] "mailing_address_zip_code"   "mailing_address_zip_ext"   
## [15] "e_mail_address"             "established_date"          
## [17] "custodian_name"             "custodian_address_1"       
## [19] "custodian_address_2"        "custodian_address_city"    
## [21] "custodian_address_state"    "custodian_address_zip_code"
## [23] "custodian_address_zip_ext"  "contact_person_name"       
## [25] "contact_address_1"          "contact_address_2"         
## [27] "contact_address_city"       "contact_address_state"     
## [29] "contact_address_zip_code"   "contact_address_zip_ext"   
## [31] "business_address_1"         "business_address_2"        
## [33] "business_address_city"      "business_address_state"    
## [35] "business_address_zip_code"  "business_address_zip_ext"  
## [37] "exempt_8872_indicator"      "exempt_state"              
## [39] "exempt_990_indicator"       "purpose"                   
## [41] "material_change_date"       "insert_datetime"           
## [43] "related_entity_bypass"      "eain_bypass"

Identifiers

  • Each form submission gets a unique form_id.

  • I think a unique organization can be identified by the EIN number. It’s certainly very close to that - this will be useful in figuring out how to collapse down amendments.

p1 <- df %>% 
  select(ein) %>%
  count(ein) %>%
  ggplot(aes(n)) +
  geom_histogram(fill = "violet") + 
  xlim(0,5) + 
  labs(title="EINs")  

p2 <- df %>% 
  select(organization_name) %>%
  count(organization_name) %>%
  ggplot(aes(n)) +
  geom_histogram(fill = "orange") + 
  xlim(0,5) + 
  labs(title="Organization Names")

p3 <- df %>% 
  mutate(organization_name=tolower(organization_name)) %>%
  select(ein,organization_name) %>%
  group_by(ein) %>%
  summarise(n=n_distinct(organization_name)) %>%
  ggplot(aes(x=n)) +
  geom_histogram(fill = 'darkgreen') +
  labs(title="Organization Names Per EIN")

(p1 + p2) / p3

Null Values

Null values are a good place to start. While I can get an idea of what my missing data is like I can generally look at what columns I even have. There isn’t a ton to be concerned about here, but this is what I noticed:

  • Most of the columns with heavy nulls surround zip extension and address 2 fields.

  • There are also null values around exemption and change date fields. I would expect that not every form has these.

  • It is a bit interesting that established date is null, and I would be curious when that value isn’t required.

df %>% 
  sample_frac(0.2) %>% 
  naniar::vis_miss()

Amendments and Bypasses

Thoughts:

  • The majority of reports are not final reports. We will need to figure out what needs to be done about amendments. Can we just use the final report? Or is there value in what was changed? Do the final reports have all data, or can the exclude things that were completed on previous versions?

  • Exemptions and bypasses are much more common than I thought and it may be interesting to look at when these apply.

p1 <- categorical_plot_cnt(df,"initial_report_indicator","Initial Report Indicator")
p2 <- categorical_plot_cnt(df,"amended_report_indicator","Amended Report Indicator")
p3 <- categorical_plot_cnt(df,"final_report_indicator","Final Report Indicator")
p4 <- categorical_plot_cnt(df,"exempt_8872_indicator","Exempt 8872 Indicator")
p5 <- categorical_plot_cnt(df,"related_entity_bypass","Related Entity Bypass")
p6 <- categorical_plot_cnt(df,"eain_bypass","EAIN Bypass")

(p1 + p2 + p3) / (p4 + p5 + p6)

Purpose

I don’t see anything that surprises me from a simple word count on the purpose fields, but this is a very simple look so more digging could reveal something more interesting.

library(qdap)

term_count <- freq_terms(df$purpose, 50,stopwords=tm::stopwords("english"))

term_count %>%
  arrange(FREQ) %>%
  mutate(WORD=factor(WORD,levels=WORD)) %>%
  ggplot(aes(x=WORD,y=FREQ,fill=FREQ)) +
  geom_bar(stat='identity') +
  coord_flip() + 
  theme_minimal() +
  theme(legend.position = "none") + 
  labs(title="Top 50 Words in Purpose")

Entities

Olson Hagel & Fishburn LLP is dominating this list. They are a firm specializing in political law.

I will be doing some more research on exactly what services they provide and what their level of involvement is.

no@email is a very common email address.

p1 <- categorical_plot_cnt(df,"custodian_name","Custodian Name",10)
p2 <- categorical_plot_cnt(df,"contact_person_name","Contact Name",10)
p3 <- categorical_plot_cnt(df,"e_mail_address","Email",20)

(p1 + p2) / p3

Location

We see that a lot of the geographic files line up here, which is good. A much deeper analysis on organization locations is warranted, but this is just about getting a quick peek.

p1 <- categorical_plot_cnt(df,"business_address_state","Business Adress State",10)
p2 <- categorical_plot_cnt(df,"business_address_city","Business Adress City",10)

p3 <- categorical_plot_cnt(df,"custodian_address_state","Custodian Adress State",10)
p4 <- categorical_plot_cnt(df,"custodian_address_city","Custodian Adress City",10)

p5 <- categorical_plot_cnt(df,"contact_address_state","Contact Adress State",10)
p6 <- categorical_plot_cnt(df,"contact_address_city","Contact Adress City",10)

p7 <- categorical_plot_cnt(df,"mailing_address_state","Mailing Adress State",10)
p8 <- categorical_plot_cnt(df,"mailing_address_city","Mailing Adress City",10)

(p1 + p2) / (p3 + p4) / (p5 + p6) / (p7 + p8)

Correlations

  • If the report is an amended report, it’s negatively correlated with related entity bypasses, so it may be they can bypass the form because the information was previously submitted. This indicates we can’t simply discard early revisions in favor of the final report.

  • We do however wee that these bypasses are also correlated with the state listed on the form, which is interesting. This could be related to state regulations, and would be interesting to see if there are higher volumes of entities in states that allow for exemptions and bypasses.

df %>% 
  select(initial_report_indicator, amended_report_indicator,final_report_indicator,
         exempt_8872_indicator,related_entity_bypass,eain_bypass,exempt_state,
         business_address_state,contact_address_state,custodian_address_state,
         mailing_address_state) %>% 
  replace_na(list(initial_report_indicator=-1, amended_report_indicator=-1,
        final_report_indicator=-1,exempt_8872_indicator=-1,related_entity_bypass=-1,
        eain_bypass=-1,exempt_state=-1,business_address_state=-1,
        contact_address_state=-1,custodian_address_state=-1,
        mailing_address_state=-1)) %>%
  mutate(across(everything(), as.numeric)) %>% 
  cor(use = "pairwise.complete.obs", method = "spearman") %>% 
  corrplot(type = "upper", diag = FALSE, method = "square")

Directors/Officers (8871)

Companies are required to submit their Directors and Officers along side the 8871 form. Of course there can be many directors or officers in one organization.

Query DB

df <- dbGetQuery(con, 'SELECT * FROM directors_officers_8871')
df <- df %>% select(-X14)

names(df)
##  [1] "record_type"                 "form_id_number"             
##  [3] "director_id"                 "org_name"                   
##  [5] "ein"                         "entity_name"                
##  [7] "entity_title"                "entity_address_1"           
##  [9] "entity_address_2"            "entity_address_city"        
## [11] "entity_address_st"           "entity_address_zip_code"    
## [13] "entity_address_zip_code_ext"

Null Values

df %>% 
  sample_frac(0.2) %>% 
  naniar::vis_miss()

Titles

  • Most frequent job title is treasurer

  • The second most common is candidate, which means candidates are directly involved in the organization. Certainly not surprising.

  • I am interested in how Director/Officer is defined, as I would not have thought Secretary would qualify but it’s a big category.

df %>% 
  mutate(entity_title=tolower(entity_title)) %>%
  group_by(entity_title) %>%
  summarise(n=n_distinct(ein)) %>%
  arrange(desc(n)) %>%
  head(20) %>%
  mutate(entity_title=factor(entity_title,levels=entity_title)) %>%
  ggplot(aes(entity_title, n, fill = entity_title)) +
  geom_col() +
  theme_minimal() +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 45, vjust = 1.1, hjust=1)) +
  labs(x = "", title = "Director/Officer Titles (top 20)")

Names

  • top 10 are directors/officers in over 100 entities. This seems to be strong evidence that these entities are highly related in some way (since it’s unlikely these people are truly significantly involved in over 100 truly distinct organizations.

  • It is possible that these really highly involved members are consultants of some kind. It would be interesting to understand how involved they are and what level of influence they have if that is the case.

df %>% 
  mutate(entity_name=tolower(entity_name)) %>%
  group_by(entity_name) %>%
  summarise(n=n_distinct(ein)) %>%
  arrange(desc(n)) %>%
  head(20) %>%
  mutate(entity_name=factor(entity_name,levels=entity_name)) %>%
  ggplot(aes(entity_name, n, fill = entity_name)) +
  geom_col() +
  theme_minimal() +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 45, vjust = 1.1, hjust=1)) +
  labs(x = "", title = "Director/Officer Names (top 20)")

EAIN (8871)

EAIN are the Election Authority Identification numbers listed on the 8871 form.

Query DB

df <- dbGetQuery(con, 'SELECT * FROM eain_8871')
df <- df %>% select(-X6)

names(df)
## [1] "record_type"                  "form_id_number"              
## [3] "eain_id"                      "election_authority_id_number"
## [5] "state_issued"

Null Values

No null values!

df %>% 
  sample_frac(0.2) %>% 
  naniar::vis_miss()

Geographic

For EAIN’s we see largely the state issued is CA. This is a bit surprising since most of the contribution and expenditures happen in DC, and the the majority of Mailing States is DC for contributions and expenditures.

Organizations are founded in CA, but the money flows through DC. Is there an advantage to getting the EAIN issues in CA? Or are there more amendments if you get your EAIN issued in CA? For now I am looking at files individually, but will explore that more later.

df %>% 
  select(state_issued) %>%
  count(state_issued) %>%
  arrange(desc(n)) %>%
  # head(5) %>%
  mutate(state_issued=factor(state_issued,levels=state_issued)) %>%
  ggplot(aes(state_issued, n, fill = state_issued)) +
  geom_col() +
  theme_minimal() +
  theme(legend.position = "none",axis.text.x = element_text(angle = 45, vjust = 1.1, hjust=1)) +
  labs(x = "", title = "State Issued")

8872 Data

While the 8871 form is about general information about the organization, the 8872 is about reporting spending and expenditures. There are three parts, the main form which covers company information, schedule A’s which cover contributions, and schedule B’s which cover expenditures.

8872 Main Form

Query DB

df <- dbGetQuery(con, 'SELECT * FROM main_8872')
df <- df %>% select(-X50)

names(df)
##  [1] "record_type"                 "form_type"                  
##  [3] "form_id_number"              "period_begin_date"          
##  [5] "period_end_date"             "initial_report_indicator"   
##  [7] "amended_report_indicator"    "final_report_indicator"     
##  [9] "change_of_address_indicator" "organization_name"          
## [11] "ein"                         "mailing_address_1"          
## [13] "mailing_address_2"           "mailing_address_city"       
## [15] "mailing_address_state"       "mailing_address_zip_code"   
## [17] "mailing_address_zip_ext"     "e_mail_address"             
## [19] "org_formation_date"          "custodian_name"             
## [21] "custodian_address_1"         "custodian_address_2"        
## [23] "custodian_address_city"      "custodian_address_state"    
## [25] "custodian_address_zip_code"  "custodian_address_zip_ext"  
## [27] "contact_person_name"         "contact_address_1"          
## [29] "contact_address_2"           "contact_address_city"       
## [31] "contact_address_state"       "contact_address_zip_code"   
## [33] "contact_address_zip_ext"     "business_address_1"         
## [35] "business_address_2"          "business_address_city"      
## [37] "business_address_state"      "business_address_zip_code"  
## [39] "business_address_zip_ext"    "qtr_indicator"              
## [41] "monthly_rpt_month"           "pre_elect_type"             
## [43] "pre_or_post_elect_date"      "pre_or_post_elect_state"    
## [45] "sched_a_ind"                 "total_sched_a"              
## [47] "sched_b_ind"                 "total_sched_b"              
## [49] "insert_datetime"

Null Values

df %>% 
  sample_n(10000) %>%
  naniar::vis_miss()

Categoricals

We see here there are a lot of form amendments, which we will need to dig in more to understand the proper way to merge these.

df <- df %>%
  select(period_begin_date,period_end_date,initial_report_indicator,amended_report_indicator,final_report_indicator,change_of_address_indicator,ein,org_formation_date,mailing_address_state,qtr_indicator,monthly_rpt_month,pre_elect_type,pre_or_post_elect_date,pre_or_post_elect_state,sched_a_ind,total_sched_a,sched_b_ind,total_sched_b)

p1 <- df %>% 
  filter(!is.na(initial_report_indicator)) %>% 
  count(initial_report_indicator) %>% 
  ggplot(aes(initial_report_indicator, n, fill = initial_report_indicator)) +
  geom_col() + theme_minimal() + theme(legend.position = "none") +
  labs(x = "", title = "Initial Report Indicator")

p2 <- df %>% 
  filter(!is.na(amended_report_indicator)) %>% 
  count(amended_report_indicator) %>% 
  ggplot(aes(amended_report_indicator, n, fill = amended_report_indicator)) +
  geom_col() + theme_minimal() + theme(legend.position = "none") +
  labs(x = "", title = "Amended Report Indicator")

p3 <- df %>% 
  filter(!is.na(final_report_indicator)) %>% 
  count(final_report_indicator) %>% 
  ggplot(aes(final_report_indicator, n, fill = final_report_indicator)) +
  geom_col() + theme_minimal() + theme(legend.position = "none") +
  labs(x = "", title = "Final Report Indicator")

p4 <- df %>% 
  filter(!is.na(change_of_address_indicator)) %>% 
  count(change_of_address_indicator) %>% 
  ggplot(aes(change_of_address_indicator, n, fill = change_of_address_indicator)) +
  geom_col() + theme_minimal() + theme(legend.position = "none") +
  labs(x = "", title = "Change of Address Indicator")

p6 <- df %>% 
  select(mailing_address_state) %>%
  count(mailing_address_state) %>%
  arrange(desc(n)) %>%
  mutate(mailing_address_state=factor(mailing_address_state,levels=mailing_address_state)) %>%
  ggplot(aes(mailing_address_state, n, fill = mailing_address_state)) +
  geom_col() + theme_minimal() +
  theme(legend.position = "none",axis.text.x = element_text(angle = 45, vjust = 1.1, hjust=1)) +
  labs(x = "", title = "Mailing State")

p7 <- df %>% 
  filter(!is.na(sched_a_ind)) %>% 
  count(sched_a_ind) %>% 
  ggplot(aes(sched_a_ind, n, fill = sched_a_ind)) +
  geom_col() + theme_minimal() + theme(legend.position = "none") +
  labs(x = "", title = "Schedule B Indicator")

p8 <- df %>% 
  filter(!is.na(sched_b_ind)) %>% 
  count(sched_b_ind) %>% 
  ggplot(aes(sched_b_ind, n, fill = sched_b_ind)) +
  geom_col() + theme_minimal() + theme(legend.position = "none") +
  labs(x = "", title = "Schedule A Indicator")

(p1 + p2 + p3) / (p4 + p7 + p8) / (p6)

Schedule A’s (8872)

Schedule A form is where political organizations report contributions.

Query DB

df <- dbGetQuery(con, 'SELECT * FROM schedule_a_8872')
df <- df %>% select(-X18)

Null Values

df %>%
  sample_n(10000) %>%
  naniar::vis_miss()

Contributions

We see that the majority of contributions come from DC. You would expect that not to be the case if contributions are coming from the people, since DC doesn’t have near the population as other states. This indicates that the majority of contributions probably come from other organizations.

Perhaps when people donate, that money gets passed around so it’s hard to tell where your donation money truly went.

p1 <- df %>%
  select(contributor_employer,contribution_amount) %>%
  mutate(contributor_employer=tolower(contributor_employer)) %>%
  group_by(contributor_employer) %>%
  summarise(contribution_amount=sum(contribution_amount)) %>%
  arrange(desc(contribution_amount)) %>%
  head(5) %>%
  mutate(contributor_employer=factor(contributor_employer,levels=contributor_employer)) %>%
  ggplot(aes(contributor_employer, contribution_amount, fill = contributor_employer)) +
  geom_col() + theme_minimal() + 
  theme(legend.position = "none",axis.text.x = element_text(angle = 45, vjust = 1.1, hjust=1)) +
  labs(x = "",y = "Contribution Amount", title = "Contributor Employer")

p2 <- df %>%
  select(contributor_name,contribution_amount) %>%
  mutate(contributor_name=tolower(contributor_name)) %>%
  group_by(contributor_name) %>%
  summarise(contribution_amount=sum(contribution_amount)) %>%
  arrange(desc(contribution_amount)) %>%
  head(10) %>%
  mutate(contributor_name=factor(contributor_name,levels=contributor_name)) %>%
  ggplot(aes(contributor_name, contribution_amount, fill = contributor_name)) +
  geom_col() + theme_minimal() + 
  theme(legend.position = "none",axis.text.x = element_text(angle = 45, vjust = 1.1, hjust=1)) +
  labs(x = "",y = "Contribution Amount", title = "Contributor Name")

p3 <- df %>%
  select(org_name,contribution_amount) %>%
  mutate(org_name=tolower(org_name)) %>%
  group_by(org_name) %>%
  summarise(contribution_amount=sum(contribution_amount)) %>%
  arrange(desc(contribution_amount)) %>%
  head(10) %>%
  mutate(org_name=factor(org_name,levels=org_name)) %>%
  ggplot(aes(org_name, contribution_amount, fill = org_name)) +
  geom_col() + theme_minimal() + 
  theme(legend.position = "none",axis.text.x = element_text(angle = 45, vjust = 1.1, hjust=1)) +
  labs(x = "",y = "Contribution Amount", title = "Org Name")

p4 <- df %>%
  select(contributor_address_state,contribution_amount) %>%
  mutate(contributor_address_state=tolower(contributor_address_state)) %>%
  group_by(contributor_address_state) %>%
  summarise(contribution_amount=sum(contribution_amount)) %>%
  arrange(desc(contribution_amount)) %>%
  head(5) %>%
  mutate(contributor_address_state=factor(contributor_address_state,levels=contributor_address_state)) %>%
  ggplot(aes(contributor_address_state, contribution_amount, fill = contributor_address_state)) +
  geom_col() + theme_minimal() + 
  theme(legend.position = "none",axis.text.x = element_text(angle = 45, vjust = 1.1, hjust=1)) +
  labs(x = "",y = "Contribution Amount", title = "Contributor State")

p5 <- df %>%
  select(contributor_occupation,contribution_amount) %>%
  mutate(contributor_occupation=tolower(contributor_occupation)) %>%
  group_by(contributor_occupation) %>%
  summarise(contribution_amount=sum(contribution_amount)) %>%
  arrange(desc(contribution_amount)) %>%
  head(5) %>%
  mutate(contributor_occupation=factor(contributor_occupation,levels=contributor_occupation)) %>%
  ggplot(aes(contributor_occupation, contribution_amount, fill = contributor_occupation)) +
  geom_col() + theme_minimal() + 
  theme(legend.position = "none",axis.text.x = element_text(angle = 45, vjust = 1.1, hjust=1)) +
  labs(x = "",y = "Contribution Amount", title = "Contributor Occupation")



(p1 + p4 + p5) / (p2 + p3)

Time

We see large spikes in contributions. I suspect these will correlate with election cycles but will be good to verify that. I would also like to see what else could drive contribution spikes, such as Supreme Court confirmation fights, or high impact bill votes.

df %>%
  select(contribution_date,contribution_amount) %>%
  filter(!is.na(contribution_date)) %>% 
  mutate(contribution_date=ymd(contribution_date)) %>%
  group_by(contribution_date) %>%
  summarise(contribution_amount=sum(contribution_amount)) %>%
  ggplot(aes(contribution_date, contribution_amount,color=contribution_amount)) +
  geom_point() + theme_minimal() + 
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 45, vjust = 1.1, hjust=1)) +
  labs(x = "",y = "Contribution Amount", title = "Contribution by Date")

Schedule B’s (8872)

The Schedule B form is where political organizations report expenditures.

Query DB

df <- dbGetQuery(con, 'SELECT * FROM schedule_b_8872')
df <- df %>% select(-X18)

names(df)
##  [1] "record_type"                "form_id_number"            
##  [3] "sched_b_id"                 "org_name"                  
##  [5] "ein"                        "recipient_name"            
##  [7] "recipient_address_1"        "recipient_address_2"       
##  [9] "recipient_address_city"     "recipient_address_st"      
## [11] "recipient_address_zip_code" "recipient_address_zip_ext" 
## [13] "recipient_employer"         "expenditure_amount"        
## [15] "recipient_occupation"       "expenditure_date"          
## [17] "expenditure_purpose"

Null Values

The number of null values on recipient employer and recipient occupation is aggravating. Clearly this is easy information to get regarding the person you are giving money to, but it would be nice to have a cleaner money trail. Of course, that’s likely not what political actors want!

df %>%
  sample_n(10000) %>%
  naniar::vis_miss()

Expenditures

Here we see some evidence that supports our previous guess. The recipient of expenditures is also DC in addition to contributions. Money flows to DC into these political organizations and gets passed around, and the top purpose is a contribution - which far outweighs media buy, political contributions, payroll, CC processing fees, and other costs that would be associated with direct use of funds.

A lot of the recipients of the money are consultants, though the next 2 largest are “information requested” and “na”. You mean to tell me that they requested but were unable to get the job title of the person they were sending money to? That seems unlikely.

p1 <- df %>%
  select(recipient_name,expenditure_amount) %>%
  mutate(recipient_name=tolower(recipient_name)) %>%
  group_by(recipient_name) %>%
  summarise(expenditure_amount=sum(expenditure_amount)) %>%
  arrange(desc(expenditure_amount)) %>%
  head(15) %>%
  mutate(recipient_name=factor(recipient_name,levels=recipient_name)) %>%
  ggplot(aes(recipient_name, expenditure_amount, fill = recipient_name)) +
  geom_col() + theme_minimal() + 
  theme(legend.position = "none",axis.text.x = element_text(angle = 45, vjust = 1.1, hjust=1)) +
  labs(x = "",y = "Expenditure Amount", title = "recipient Name")

p2 <- df %>%
  select(expenditure_purpose,expenditure_amount) %>%
  mutate(expenditure_purpose=tolower(expenditure_purpose)) %>%
  group_by(expenditure_purpose) %>%
  summarise(expenditure_amount=sum(expenditure_amount)) %>%
  arrange(desc(expenditure_amount)) %>%
  head(15) %>%
  mutate(expenditure_purpose=factor(expenditure_purpose,levels=expenditure_purpose)) %>%
  ggplot(aes(expenditure_purpose, expenditure_amount, fill = expenditure_purpose)) +
  geom_col() + theme_minimal() + 
  theme(legend.position = "none",axis.text.x = element_text(angle = 45, vjust = 1.1, hjust=1)) +
  labs(x = "",y = "Expenditure Amount", title = "Expenditure Purpose")


p3 <- df %>%
  select(org_name,expenditure_amount) %>%
  mutate(org_name=tolower(org_name)) %>%
  group_by(org_name) %>%
  summarise(expenditure_amount=sum(expenditure_amount)) %>%
  arrange(desc(expenditure_amount)) %>%
  head(15) %>%
  mutate(org_name=factor(org_name,levels=org_name)) %>%
  ggplot(aes(org_name, expenditure_amount, fill = org_name)) +
  geom_col() + theme_minimal() + 
  theme(legend.position = "none",axis.text.x = element_text(angle = 45, vjust = 1.1, hjust=1)) +
  labs(x = "",y = "Expenditure Amount", title = "Org Name")


p4 <- df %>%
  select(recipient_address_st,expenditure_amount) %>%
  mutate(recipient_address_st=tolower(recipient_address_st)) %>%
  group_by(recipient_address_st) %>%
  summarise(expenditure_amount=sum(expenditure_amount)) %>%
  arrange(desc(expenditure_amount)) %>%
  head(15) %>%
  mutate(recipient_address_st=factor(recipient_address_st,levels=recipient_address_st)) %>%
  ggplot(aes(recipient_address_st, expenditure_amount, fill = recipient_address_st)) +
  geom_col() + theme_minimal() + 
  theme(legend.position = "none",axis.text.x = element_text(angle = 45, vjust = 1.1, hjust=1)) +
  labs(x = "",y = "Expenditure Amount", title = "recipient Address State")


tmp <- df %>%
  select(recipient_employer,expenditure_amount) %>%
  mutate(recipient_employer=tolower(recipient_employer)) %>%
  group_by(recipient_employer) %>%
  summarise(expenditure_amount=sum(expenditure_amount)) %>%
  arrange(desc(expenditure_amount)) %>%
  mutate(recipient_employer=factor(recipient_employer,levels=recipient_employer)) %>%
  head(20) 

p5a <- tmp %>%
  head(15) %>%
  ggplot(aes(recipient_employer, expenditure_amount, fill = recipient_employer)) +
  geom_col() + theme_minimal() + 
  theme(legend.position = "none",axis.text.x = element_text(angle = 45, vjust = 1.1, hjust=1)) +
  labs(x = "",y = "Expenditure Amount", title = "recipient Employer")

p5b <- tmp %>%
  filter(recipient_employer %notin% c('n/a','n a')) %>%
  head(15) %>%
  ggplot(aes(recipient_employer, expenditure_amount, fill = recipient_employer)) +
  geom_col() + theme_minimal() + 
  theme(legend.position = "none",axis.text.x = element_text(angle = 45, vjust = 1.1, hjust=1)) +
  labs(x = "",y = "Expenditure Amount", title = "recipient Employer (na excluded)")

tmp <- df %>%
  select(recipient_occupation,expenditure_amount) %>%
  mutate(recipient_occupation=tolower(recipient_occupation)) %>%
  group_by(recipient_occupation) %>%
  summarise(expenditure_amount=sum(expenditure_amount)) %>%
  arrange(desc(expenditure_amount)) %>%
  head(20) %>%
  mutate(recipient_occupation=factor(recipient_occupation,levels=recipient_occupation)) 

p6a <- tmp %>%
  head(15) %>%
  ggplot(aes(recipient_occupation, expenditure_amount, fill = recipient_occupation)) +
  geom_col() + theme_minimal() + 
  theme(legend.position = "none",axis.text.x = element_text(angle = 45, vjust = 1.1, hjust=1)) +
  labs(x = "",y = "Expenditure Amount", title = "recipient Occupation")

p6b <- tmp %>%
  filter(recipient_occupation %notin% c('n/a','n a')) %>%
  head(15) %>%
  ggplot(aes(recipient_occupation, expenditure_amount, fill = recipient_occupation)) +
  geom_col() + theme_minimal() + 
  theme(legend.position = "none",axis.text.x = element_text(angle = 45, vjust = 1.1, hjust=1)) +
  labs(x = "",y = "Expenditure Amount", title = "recipient Occupation (na excluded)")


(p2 + p4) / (p5a + p5b) / (p6a +p6b) / (p1 + p3)

Time

When we look at expenditures over time we also see a pattern where there are spikes, seeing what they are correlated with will be interesting.

df %>%
  select(expenditure_date,expenditure_amount) %>%
  filter(!is.na(expenditure_date)) %>% 
  mutate(expenditure_date=ymd(expenditure_date)) %>%
  group_by(expenditure_date) %>%
  summarise(expenditure_amount=sum(expenditure_amount)) %>%
  ggplot(aes(expenditure_date, expenditure_amount,color=expenditure_amount)) +
  geom_point() + theme_minimal() + 
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 45, vjust = 1.1, hjust=1)) +
  labs(x = "",y = "Expenditure Amount", title = "Expenditure by Date")

Additional Analysis

This post covered parsing the data and a quick peak at each type of data.

Stay tuned for the next post where we will cover:

  1. Deep dive into amendments. What gets amended? How to handle amendments? What are amendments correlated with? etc.

  2. Analysis looking across files. In this we looked at each piece of data by itself, but we can do much more by looking at these different types of data together.

Thanks for reading. To see when the next post is published, follow me on twitter at https://twitter.com/isaac_flath