1 Introduction

1.1 Context

Compared to the rest of many other democracies in the world, elections in the United States tend to last a very long time. For presidential elections, nominees typically announce their candidacy over a year (and sometimes multiple years) in advance, and US citizens are exposed to political advertising and showmanship on what seems like a nearly constant basis. As such, in order for candidates to be able to have a shot at competing in an election they need a lot of one thing: money. Candidates have a multitude of fundraising options when it comes to financing their campaigns including, but not limited to individual contributors, political action committees (PACs), and publicly funded state/federal grants. Candidates with more money seemingly have a huge advantage in that they have more to spend on ads, staff, events, etc. In 2018 a staggering $5.7 billion dollars was spent on campaigns for midterm elections, making it at the time the most expensive midterm election cycle ever. This kind of extraordinary spending on U.S. political elections makes citizens question: are our elections fair?

1.2 Purpose/Research Question

The purpose of this project is to determine the influence of campaign spending on the outcomes of the 2018 congressional midterms. To achieve this purpose, campaign finance data has been sourced from the publicly available Federal Election Commission (FEC) API, and election result data for both the U.S. Senate and U.S. House of Representatives has been retrieved from Kaggle via web scraping of The Cook Political Report.

2 Importing/Collecting Data

2.1 Campaign Finance Data

The FEC open data platform only allows a single API call to retrieve 100 results. As such, the code chunk below creates a function that will pull all results from a API call, given a valid API URL and election year.

#' Function: get_api_results
#' Returns all api results from FEC open data api and stores as a dataframe.
#' - api_url: should be one listed from https://api.open.fec.gov/developers/
#' - api_key: should be the user api_key for the FEC open data API. Users can 
#'   sign up for an api key at https://api.open.fec.gov/developers.
#' - year: the election year for which you'd like to retrieve data.

get_api_results <- function(api_url, api_key, year){

  print(paste("Pulling data from", api_url))
  api_call <- paste(api_url, '?api_key=', api_key, '&election_year=', year,
                    '&per_page=1&page=1', sep='')
  
  pre_result <- fromJSON(api_call, flatten = TRUE) %>% data.frame 
  num_cols <- ncol(pre_result)
  columns <- colnames(pre_result)
  num_results <- pre_result$pagination.count[[1]]
  
  i = 0
  result <- data.frame(matrix(ncol = num_cols, nrow = 0))
  colnames(result) <- columns

  while (i * 100 < num_results){
    api_call <- paste(api_url, '?api_key=', api_key, '&election_year=', year,
                      '&per_page=100&page=', i + 1, sep='')
    tmp <- fromJSON(api_call, flatten = TRUE) %>% data.frame
    result <- rbind(result, tmp)
    print(
      paste('On  result ', i*100, '-', (i+1)*100, ' out of ', num_results,
            sep = '')
    )
    i = i + 1
  }
  return(result)
}

In this case, we will use the API to pull data from two different sources, the first provides genral information about the candidates, while the second pulls each candidate’s campaign finanace data. The code chunk below utilizes the get_api_results function to pull this information for all of the 2018 congressional elections.

election_year <- 2018
api_key <- params$key
# Un-comment below line for manual entry.
#api_key <- 'dummy-api-key'

cand_info <- get_api_results(
  api_url = 'https://api.open.fec.gov/v1/candidates/',
  api_key = api_key,
  year = election_year
)

financial_info <- get_api_results(
  api_url = 'https://api.open.fec.gov/v1/candidates/totals/',
  api_key = api_key,
  year = election_year
)

The candidate and finance data is stored in the cand_info and financial_info dataframes, respectively.

2.2 Election Result Data

The election result data has been uploaded to Github as .csv files (Senate data, House data), which are imported in the code chunk below:

link <- getURL('https://raw.githubusercontent.com/williamzjasmine/CUNY_SPS_DS/master/DATA_607/Projects/Final_Project/data/1976-2018_senate_election_ratings_results.csv')
senate_results <- read_csv(link)
## Rows: 756 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): state, cook_rating, winner_name, winner_party
## dbl (2): year, winner_pct
## lgl (1): special
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
link <- getURL('https://raw.githubusercontent.com/williamzjasmine/CUNY_SPS_DS/master/DATA_607/Projects/Final_Project/data/2002-2018_house_election_ratings_results.csv')
house_results <- read_csv(link)
## Rows: 3915 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): district, winner_name, cook_rating, winner
## dbl (4): year, dem, repub, other
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

The above chunk has created two new dataframes to hold all the election results: senate_results and house_results.

3 Cleaning Data

Now that all the data has been properly imported, the section below outlines all the data cleaning steps that were taken to manipulate it into a usable format for data analysis. The main goal is to unite all four of the dataframes that currently store the data (cand_info, financial_info, senate_results, house_results).

3.1 Election Result Data

This subsection is dedicated to cleaning the two dataframes containing the information about election results (senate_results, house_results) in an effort to combine them.

3.1.1 Clean Individual Dataframes

The .csv files containing the election result data actually included results for multiple election cycles. As such, the cell below filters the data to include only those elections that occurred in 2018:

senate_results <- senate_results %>% filter(year == '2018')
house_results <- house_results %>% filter(year == '2018')

Next, the code chunk below creates a separate state and district field using the district column of the house_results dataframe (currently the values in that column are of form STATE-DISTICT# i.e. AL-01):

house_results <- house_results %>%
  mutate(
    state = substr(district, 1, 2), 
    district = substr(district, 4, 5))

Because the senate_results dataframe doesn’t include district (senate elections take participation from the entire state), the code cell below adds a district placeholder column to senate_results that contains the value “00”. This matches the methodology that will later be seen in the API data. The code below also removes the special column, which is not included in the house_results dataframe.

senate_results <- senate_results %>%
  mutate(district = '00') %>%
    select(-special)

The senate_results dataframe has only has the voter percentages for the candidate who won, and as such, the cell below modifies the house_results dataframe in order to match this format:

# Only winning parties in house results are Republican and Democratic
# unique(house_results$winner)

house_results <- house_results %>%
  mutate(winner_party = ifelse(winner == 'r', 'Republican', 'Democrat'),
         winner_pct = ifelse(winner == 'r', repub, dem)) %>%
    select(-winner, -dem, -repub, -other)

Next, we can remove the cook_rating and year columns from each of the dataframes: the Cook’s rating is not pertinent to answering the project’s research question, and the year column provides no additional information as we know all election data is from 2018.

house_results <- select(house_results, -cook_rating, -year)
senate_results <- select(senate_results, -cook_rating, -year)

3.1.2 Join Together

Now that the senate_results and house_results dataframes have the same format, they can be appended together. This is performed in the code chunk below:

house_results <- house_results %>%
  select(winner_name, winner_party, winner_pct, state, district) %>%
    mutate(house = 'House')

senate_results <- senate_results %>%
  select(winner_name, winner_party, winner_pct, state, district) %>%
    mutate(house = 'Senate')

election_results <- rbind(house_results, senate_results)

Note that in the above code, a house column is added that contains one of either “Senate” or “House” in order to know the source of each observation.

This election_results dataframe represents the final, cleaned version of the data from the .csv files.

3.2 API Data

This subsection is dedicated to cleaning the two dataframes containing the information from the FEC (cand_info, financial_info) in an effort to combine them.

3.2.1 Clean Individual Dataframes

The first step in cleaning the cand_info and financial_info dataframes is to replace the nested field names that were a result of the api call. For example, the field containing the candidate’s name is results.candidate as opposed to just candidate. The code below fixes this:

colnames(cand_info) <- 
  str_replace_all(colnames(cand_info), 'results.', '')
colnames(financial_info) <- 
  str_replace_all(colnames(financial_info), 'results.', '')

Before any further cleaning takes place, the cell below checks to make sure that there exists only 1 row per candidate_id in the cand_info dataframe, ensuring that the API did not output any duplicate rows.

cand_info %>%
  group_by(candidate_id) %>%
    summarise(num_cand_rows = n()) %>%
      filter(num_cand_rows > 1)
## # A tibble: 0 × 2
## # … with 2 variables: candidate_id <chr>, num_cand_rows <int>

Unfortunately, the district field in the cand_info dataframe only provides the number of the district that the candidate most recently ran for office in. For example, Tom Alciere ran for New Hampshire’s 2nd district in 2018, but because he ran again in New Hampshire’s 1st district in 2022, the district column states:

cand_info %>%
  filter(name == "ALCIERE, TOM") %>% 
    select(district)
##   district
## 1       01

To get the actual district each candidate ran in, we have to parse the elections and districts columns, which provide lists of the the election years that the candidate ran and the districts that they ran in, respectively. This is done in the cell below to determine the district the candidate ran in in 2018:

i = 0 
districts <- c()

while(i < nrow(cand_info)) {
  tmp <- cand_info$election_years[[i+1]]
  tmp <- match(election_year, tmp)[[1]]
  district <- cand_info$election_districts[[i+1]][tmp]
  districts <- append(districts, district)
  i <- i + 1
}

cand_info$district <- districts

Next, the code chunk below selects out only the pertinent fields from the cand_info dataframe:

cand_info <- cand_info %>% 
  select(
    candidate_id, name, party_full, state, district, has_raised_funds,
    office_full, election_years, election_districts, incumbent_challenge_full
  )

The cell below creates three new fields from this reduced dataframe that might be interesting for use in future analysis:

  • num_runs: The number of times the candidate has run for office up to and including the 2018 election cycle.
  • houses_run: Can be either “house”, “senate”, or “both”, indicating what offices the candidate has run for up to and including the 2018 election cycle.
  • num_districts: The number of house districts the candidate has run in seats for the House of Representatives up to and including the 2018 election.

It also cleans the election_districts and election_years columns to only include those entries up until 2018.

i = 0
num_runs_vec <- c()
houses_run_vec <- c()
num_districts_vec <- c()

while(i < nrow(cand_info)) {
  tmp <- cand_info$election_years[[i+1]]
  tmp <- match(election_year, tmp)[[1]]
  num_runs_vec <- append(num_runs_vec, tmp)
  
  districts <- unique(cand_info$election_districts[[i+1]][1:tmp])
  ran_senate <- FALSE
  
  if ("00" %in% districts){
    ran_senate <- TRUE
  }
  districts <- districts[districts != "00"]
  num_districts <- length(districts)

  if (ran_senate == FALSE) {
    houses_run <- 'house'  
  }
  else {
    if (num_districts > 1) {
      houses_run <- 'both'
    }
    else {
      houses_run <- 'senate'
    }
  }
  
  houses_run_vec <- append(houses_run_vec, houses_run)
  num_districts_vec <- append(num_districts_vec, num_districts)
  cand_info$election_years[[i+1]] <- 
    cand_info$election_years[[i+1]][1:tmp]
  cand_info$election_districts[[i+1]] <-
    cand_info$election_districts[[i+1]][1:tmp]
  i = i+1
}

cand_info$num_runs <- num_runs_vec
cand_info$houses_run <- houses_run_vec 
cand_info$num_districts <- num_districts_vec

Unlike the cand_info dataframe, the cell below shows that in financial_info there are a number of candidates that appear more than once:

financial_info %>%
  group_by(candidate_id) %>%
    summarise(num_cand_rows = n()) %>%
      filter(num_cand_rows > 1) %>% nrow()
## [1] 754

After further inspection, it appears that this is due to the presence of duplicates in the dataframe. They are removed in the cell below:

financial_info <- financial_info %>%
  select(-contains("pagination")) %>%
    distinct()

3.2.2 Join Together

Now that the cand_info and financial_info dataframes have been cleaned, they can be joined together using the FEC’s candidate_id field. This is done in the cell below, which stores the combined data in the campaign_finance dataframe.

campaign_finance <- cand_info %>%
  left_join(financial_info, by = 'candidate_id', suffix = c("", ".y")) %>%
    select(-contains('.y')) 

Finally, we select only the pertinent columns from the campaign_finance dataframe.

campaign_finance <- campaign_finance %>%
  select(
    candidate_id,
    name,
    party_full,
    state,
    district,
    has_raised_funds,
    office_full,
    election_years,
    election_districts,
    incumbent_challenge_full,
    houses_run,
    num_runs,
    num_districts,
    cash_on_hand_end_period,
    federal_funds_flag,
    individual_itemized_contributions,
    other_political_committee_contributions,
    disbursements
  )

3.3 Bringing it All Together

The final data cleaning step needed before analysis can be performed is to combine the campaign_finance and election_results dataframes. The code chunk below attempts to do this via a left join on state and district:

df <- campaign_finance %>%
  left_join(
    election_results, by = c('state', 'district'), 
    suffix = c('', '.y')
  ) %>%
    select(-contains('.y'))

After the above join, all of the originally imported data now exists in a single dataframe df. However, looking at the output of this join reveals that it produced more rows than are in the campaign_finance dataframe. This means that our join key is not unique, and we are duplicating rows. The cell below shows the reason why:

election_results %>%
  group_by(state, district) %>%
    summarise(count = n()) %>%
      filter(count > 1)
## `summarise()` has grouped output by 'state'. You can override using the
## `.groups` argument.
## # A tibble: 2 × 3
## # Groups:   state [2]
##   state district count
##   <chr> <chr>    <int>
## 1 MN    00           2
## 2 MS    00           2

Because senate races don’t include districts, the FEC data uses “00” as a placeholder value in the district column if a candidate ran for senate. Based on this information and the output above, we can see that in 2018 both the state of Minnesota and Mississippi had two senate races thanks to special elections. As such, the state, district combination is not a unique key in these cases, and all candidates who ran in these races will be counted twice. Because our data has no way to determine which candidates ran against each winner, we remove them from the data in the cell below:

df <- df %>%
  filter(!(district == '00' & (state %in% c('MN', 'MS'))))

# check to make sure join key is now unique
nrow(df) == nrow(
  filter(campaign_finance, !(district == '00' & (state %in% c('MN', 'MS')))))
## [1] TRUE

The output above reveals that the combination of district, and state is now a unique key thanks to removing these duplicated values.

The cell below modifies the party fields from our two original dataframes to be the same format. Namely, the party and winner_party fields will only contain the values “DEMOCRATIC PARTY”, “REPUBLICAN PARTY”, or “OTHER”.

df <- df %>%
  mutate(
    winner_party = case_when(
      winner_party == 'Democrat' ~ "DEMOCRATIC PARTY",
      winner_party == 'Republican' ~ "REPUBLICAN PARTY",
      TRUE ~ "OTHER"),
    party_full = ifelse(party_full != "DEMOCRATIC PARTY" &
                         party_full != "REPUBLICAN PARTY",
                         "OTHER", party_full)
  )

Now that the data has been joined together, we can create combine the information in each. Because the candidate names have different formats in the election_results and campaign_finance dataframes, the cell below attempts to transform the winner_name field from election_results the FEC’s naming format used in campaign_finance’s name field (i.e. John Smith –> SMITH, JOHN):

mapped_names <- c()

for (i in 1:nrow(df)){
  name <- toupper(df$winner_name[[i]])
  split_name <- strsplit(name, split = " ", fixed = TRUE)[[1]]
  last_name <- tail(split_name, 1)
  first_name <- paste(split_name[1:length(split_name) - 1], collapse = ' ')
  mapped_name <- paste(last_name, ', ', first_name, sep = '')
  mapped_names <- append(mapped_names, mapped_name)
}

df$winner_name_clean <- mapped_names

We create a column, winner that compares the new winner_name_clean field to name. More specifically, the value of this field will be 1 if the names match, and 0 if not.

df <- df %>%
  mutate(winner = ifelse(winner_name_clean == name, 1, 0))

Because this name matching will not be perfect (i.e. it doesn’t take into account different spellings, prefixes, suffixes, titles, etc.) the cell below checks to see for which of the races we were actually able to determine who the winner was:

df %>% 
  group_by(winner) %>%
    summarise(count_classes = n())
## # A tibble: 2 × 2
##   winner count_classes
##    <dbl>         <int>
## 1      0          3552
## 2      1           185

Due to the difficult nature of matching names, it looks as though only 185 of our 470 elections results are able to be used. While this is a dramatic decrease, it still provides a robust sampling of the 2018 midterm results. The cell below removes rows from those elections in which a winner could not be determined:

filter_df <- df %>% 
  filter(winner == 1) %>%
    select(state, district) %>%
      group_by(state, district)
    
df <- df %>%
  inner_join(filter_df, by = c('state', 'district'))

Because not every candidate raised funds in the 2018 midterm elections, a number of the financial fields remain null after the join (i.e. there was information for a candidate in cand_info, but not in financial_info). The cell below transforms these values to 0:

df$cash_on_hand_end_period[
  is.na(df$cash_on_hand_end_period)] = 0

df$individual_itemized_contributions[
  is.na(df$individual_itemized_contributions)] = 0

df$other_political_committee_contributions[
  is.na(df$other_political_committee_contributions)] = 0

df$disbursements[is.na(df$disbursements)] = 0

Answering this project’s research question requires knowing the candidates who spent the most in their respective elections. Because the data is available, we can also determine who received the most itemized individual contributions and political committee contributions as well. Fields containing this information are added to df in the cell below:

max_spent_df <- df %>%
  group_by(state, district) %>%
    summarise(
      max_disbursements = max(disbursements),
      max_iic = max(individual_itemized_contributions),
      max_pac = max(other_political_committee_contributions),
      num_candidates = n()
    )
## `summarise()` has grouped output by 'state'. You can override using the
## `.groups` argument.
df <- df %>% 
  left_join(max_spent_df, by = c("state", "district")) %>%
    mutate(
      iic_most = ifelse(individual_itemized_contributions == max_iic, 1, 0),
      pac_most = 
        ifelse(other_political_committee_contributions == max_pac, 1, 0),
      spent_most = ifelse(disbursements == max_disbursements, 1, 0))

Lastly, the cell below creates an additional three binary columns to track if those candidates who spent the most, received the most individual itemized contributions, or received the most political committee contributions also won their elections.

df <- df %>%
  mutate(
    winner_iic_most = ifelse(iic_most & winner == 1, 1, 0),
    winner_pac_most = ifelse(pac_most & winner == 1, 1, 0),
    winner_spent_most = ifelse(spent_most == 1 & winner == 1, 1, 0))

For the concluding data cleaning step, the cell below picks out the final set of columns from df and names them appropriately:

df <- df %>%
  select(candidate_id, name, winner_name, winner_name_clean, party_full,
         winner_party, state, district, election_years, 
         election_districts, incumbent_challenge_full, houses_run, num_runs,
         num_districts, has_raised_funds, federal_funds_flag,
         cash_on_hand_end_period, individual_itemized_contributions,
         other_political_committee_contributions, disbursements,
         max_iic, max_pac, max_disbursements, num_candidates, winner,
         iic_most, pac_most, spent_most, winner_iic_most, winner_pac_most,
         winner_spent_most)

colnames(df) <- c(
  "candidate_id", "candidate_name", "winner_name", "winner_name_clean",
  "party", "winner_party", "state", "district", "election_years",
  "election_districts", "candidate_type", "houses_run", "num_runs", 
  "num_districts", "raised_funds", "raised_federal_funds", "coh_end_period",
  "iic", "pac", "disbursements", "max_iic", "max_pac", "max_disbursements",
  "num_candidates", "winner", "iic_most", "pac_most", "spent_most",
  "winner_iic_most", "winner_pac_most", "winner_spent_most")

The df dataframe represents the final, cleaned dataset that will be used in the following analysis. For reference, a list of its fields along with their descriptions is provided below:

Column Name Column Type Description
candidate_id id_field The id number for each candidate, provided by the FEC.
candidate_name text The name of the candidate as it appears in the FEC databases.
winner_name text The name of the candidate who won the 2018 midterm election that the candidate ran in. Name is as it appears in the Cook Political Report.
winner_name_clean text An attempt to match the format of the names in winner_name to those in candidate_name. (Matching Cook political report format to FEC format.)
party categorical The party affiliation of the candidate. Can be one of “DEMOCRATIC PARTY”, “REPUBLICAN PARTY”, or “OTHER”.
winner_party categorical The party affiliation of the candidate who won the 2018 midterm election that the candidate ran in. Can be one of “DEMOCRATIC PARTY”, “REPUBLICAN PARTY”, or “OTHER”.
state categorical The state of the 2018 midterm election that the candidate ran in. Uses U.S. postal codes.
district text The congressional district of the 2018 midterm election that the candidate ran in. If the candidate ran in a senate race, the value in this field is equal to “00”.
election_years vector A vector field containing all the election years that the candidate ran in, up to and including the 2018 election cycle. (i.e. c(“2014”, “2016”, “2018”).)
election_districts vector A vector field containing all the congressional districts that the candidate has run in, up to and including the 2018 election cycle. This field aligns with the election_years field in that the \(n\)th entry of election_districts is the district the candidate ran in during the year contained in the \(n\)th entry of election_years.
candidate_type categorical Describes the type of election the candidate participated in. Can be one of “Challenger”, “Open seat”, or “Incumbent”.
houses_run categorical Describes which houses of congress the candidate has ran in historically, up to and including the 2018 election cycle. Can be one of “House”, “Senate”, or “Both”.
num_runs numerical (discrete) The number of times the candidate has run for congressional political office, up to and including the 2018 election cycle.
num_districts numerical (discrete) The number of congressional districts the candidate has run in for congressional political office, up to and including the 2018 election cycle.
raised_funds binary TRUE if the candidate has raised any money for their political campaign, FALSE if not.
raised_federal_funds binary TRUE if the candidate has received any federal grants to fund their campaign, FALSE is not.
coh_end_period numerical (continuous) The amount of cash on hand in USD that the candidate had remaining on their books at the end of the pertinent accounting period.
iic numerical (continuous) The amount of money in USD that the candidate has raised from individual itemized contributions. These are contributions from individuals that exceed $200.
pac numerical (continuous) The amount of money in USD that the candidate has raised from political committee contributions, such as political action committees.
disbursements numerical (continuous) The amount of money in USD that the candidate has spent to cover the expenditures of their campaign. This field is the closest estimate of the amount of total money spent by the candidate.
max_iic numerical (continuous) The total from the candidate with the largest total individual itemized contributions in the 2018 midterm election race that the candidate participated in.
max_pac numerical (continuous) The total from the candidate with the largest total political committee contributions in the 2018 midterm election race that the candidate participated in.
max_disbursements numerical (continuous) The disbursements from the candidate who spent the most in the 2018 midterm election race that the candidate participated in.
num_candidates numerical (discrete) The number of candidates that competed in the 2018 midterm election the candidate participated in (inclusive of their own participation).
winner binary 1 if the candidate won the 2018 midterm election they participated in, 0 if not.
iic_most binary 1 if the candidate had the max_iic amount in the 2018 midterm election they participated in, 0 if not.
pac_most binary 1 if the candidate had the max_pac amount in the 2018 midterm election that they participated in, 0 if not.
spent_most binary 1 if the candidate had the max_disbursement amount in the midterm election they participated in, 0 if not.
winner_iic_most binary 1 if the candidate won and had the max_iic amount, 0 if not.
winner_pac_most binary 1 if the candidate won and had the max_pac amount, 0 if not.
winner_spent_most binary 1 if the candidate won and had the max_disbursement amount, 0 if not.

4 Data Analysis

First, we can take a look to see how much money was spent by either party in the races we have data for:

plt_data1 <- df %>%
  filter(!is.na(party)) %>%
    group_by(party) %>%
      summarise(party_disbursements = sum(disbursements)) %>%
        arrange(desc(party_disbursements))

ggplot(data <- plt_data1, 
       aes(x=reorder(party, party_disbursements),
           y=party_disbursements)
      ) +
  geom_bar(stat = 'identity') +
    labs(
      x = '',
      y = 'Money Spent on Campaigns'
    )
Fig 1: Total campaign spending by party. Democrats clearly spent more in the 2018 midterm elections.

Fig 1: Total campaign spending by party. Democrats clearly spent more in the 2018 midterm elections.

As is clear in the plot above, the Democratic Party as a whole put up a lot more money, which makes sense given that they were trying to get back control of a Republican controlled congress.

The code chunk below creates a similar histogram to the one above, but this time compares how many races were won by each party:

plt_data2 <- df %>%
  filter(winner == 1) %>% 
    group_by(winner_party) %>%
      summarise(party_election_wins = n())

ggplot(data <- plt_data2, 
       aes(x=reorder(winner_party, party_election_wins),
           y=party_election_wins)
      ) +
  geom_bar(stat = 'identity') +
    labs(
      x = '',
      y = 'Elections Won'
    )
Fig 2: Results by party for the 2018 midterm elections. It is clear Democrats outperformed the republicans.

Fig 2: Results by party for the 2018 midterm elections. It is clear Democrats outperformed the republicans.

The bar chart makes it clear that the democrats peroformed quite well in the elections contained by our dataset.

If we combine the information from the previous two histograms into one, we see that the proportion of races that the democrats won is similar to the proportion of total campaign funding they put up:

plt_data1 <- plt_data1 %>%
  mutate(pct = party_disbursements / sum(plt_data1$party_disbursements))
plt_data1$type <- 'Election Spending'

plt_data2 <- plt_data2 %>% 
  mutate(pct = party_election_wins / sum(plt_data2$party_election_wins))

plt_data2[nrow(plt_data2) + 1, ] <- list("OTHER", 0, 0)
plt_data2$type <- 'Elections Won'
colnames(plt_data2)[1] <- 'party'

plt_data <- rbind(
  select(plt_data1, party, pct, type),
  select(plt_data2, party, pct, type)
)

ggplot(data <- plt_data, aes(x=reorder(party, pct), y=pct, fill = type)) + 
  geom_bar(stat = "identity", position = position_dodge()) +
    labs(
      x = '',
      y = 'Fraction of Total'
    )
Fig 3: Compares information from the results shown in Fig 1. and Fig 2. The percetage of total money spent by each party is nearly equal to the percentage of elections that they won.

Fig 3: Compares information from the results shown in Fig 1. and Fig 2. The percetage of total money spent by each party is nearly equal to the percentage of elections that they won.

The output above gives the first hint that spending money on political campaigns can have a real impact.

Next, the code chunk below plots the distribution of dispursements amounts:

ggplot(data = df, aes(x = disbursements)) +
  geom_histogram() +
    labs(
      x = 'Amount of Money Spent by Candidate ($)',
      y = 'Count'
    ) + scale_x_log10()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Fig 4: Distrubution of disbursement amounts for all candidates. x-axis is a log scale.

Fig 4: Distrubution of disbursement amounts for all candidates. x-axis is a log scale.

The plot shown above is shown on a log scale due to the fact that there is a large discrepancy between the median and maximum disbursement amounts. The plot shows that while there are a number of races that have campaign spending in the 10s of millions of dollars, we see that about half of them spend less than $100,000. If we look at the top spenders, we can see names from some pretty recognizable races, most of which were at the senate level:

df %>%
  arrange(desc(disbursements)) %>% 
    select(candidate_name, winner_name, state, district) %>%
      head(10)
##        candidate_name      winner_name state district
## 1       HUGIN, ROBERT  Robert Menendez    NJ       00
## 2      BALDWIN, TAMMY    Tammy Baldwin    WI       00
## 3      BROWN, SHERROD    Sherrod Brown    OH       00
## 4     SINEMA, KYRSTEN   Kyrsten Sinema    AZ       00
## 5   FEINSTEIN, DIANNE Dianne Feinstein    CA       00
## 6     MCSALLY, MARTHA   Kyrsten Sinema    AZ       00
## 7         BRAUN, MIKE       Mike Braun    IN       00
## 8        TRONE, DAVID     Jamie Raskin    MD       08
## 9    STABENOW, DEBBIE  Debbie Stabenow    MI       00
## 10 DONNELLY, JOSEPH S       Mike Braun    IN       00

To get a better idea of just how big an impact campaign spending has on election results, the cell below calculates and outputs a number of metrics:

avg_cands <- round(mean(max_spent_df$num_candidates), digits = 2)
avg_win_rate <- round(1 / avg_cands, digits = 2)
pct_winner_spent_most <- round(
  sum(df$winner_spent_most) / nrow(max_spent_df), digits = 2)

Average number of candidates in each race: 8.26
Average win probability for a candidate to win their election, if chosen at random: 0.12
Percentage of races in which the candidate who spent the most money also won the election: 0.68

The results are pretty staggering: close to 70% of those candidates who spent the most on their campaigns won their election. This number actually increases when looking at who received the most funding from political committees:

pct_winner_iic_most <- round(
  sum(df$winner_iic_most) / nrow(max_spent_df), digits = 2)
pct_winner_pac_most <- round(
  sum(df$winner_pac_most) / nrow(max_spent_df), digits = 2)

Percentage of races in which the candidate who received the most individual itemized contributions also won the election: 0.67`
Percentage of races in which the candidate who received the most money from political committees (i.e. PACs, super-PACs) also won the election: 0.7

While it’s probably not necessary to do so, we can perform a Chi-Squared test in order to statistically confirm the dependence winning an election (winner) on if you spend the most money during your campaign (spent_most). Given \(P_x\) = the probability a candidate wins their election, \(P_y\) = the probability a candidate spent the most money on their election, and \(P_{xy}\) = \(P(X \text{and} Y)\), the null (\(H_0\)) and alternative hypotheses (\(H_1\)) are:

\[ \begin{align} H_0: P_{xy} &= P_x \cdot P_y \\ H_1: P_{xy} &\neq P_x \cdot P_y \end{align} \] First, the cell below creates a table of the two categorical variables we are testing:

table(df$winner, df$spent_most)
##    
##        0    1
##   0 1256   88
##   1   59  126

And finally, the cell below performs the Chi-Squared test.

chisq.test(df$winner, df$spent_most)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  df$winner and df$spent_most
## X-squared = 506.87, df = 1, p-value < 2.2e-16

As is probably expected, the \(p\)-value is well below 0.05 meaning that we reject the null hypothesis and conclude that the two variables have a dependence upon one each other.

5 Conclusion

While it is not necessarily surprising to think that campaign finance plays a role in the results of U.S. political elections, the scale which it is seen here is quite staggering: the fact that ~70% of elections that were included in the data were won by the candidate who spent the most money seems quite high.

The work presented here also has a number of opportunities for potential improvements and possible extensions. The most glaring improvement that could be made is in the fuzzy matching of the candidate names, which as present only allowed us to analyze 185 of the 470 2018 congressional midterm elections. Having data from most or all of the elections as well as from additional election cycles would provide greater validity to the results presented. Additionally, there are a number of other analyses that completed using fields that have not yet been taken into account.