DATA 606: Final Project

Part 1 - Introduction

Research question

In this project we will examine campaign expenditures for the 2016 US presidential election. We will concentrate on the two major party nominees, and will look at expenditures by the month, starting in Feb 2016.

Because of the nuances of campaign finance law, while we will be able to see the intake for the campaigns, we will not be able to see fund intake for the independent PACs. As a result, we will look at expenditures.

The summary stats by month for each campaign are already available. We will tidy the independant PAC data to create monthy summary expenditures for groups that favor or oppose the two major party nominees.

Our goal is to compare monthly expenditures for the campaign and for PACs across the two candidates. We will further attempt to use the “sup_opp” variable in the PAC data to divide PAC expenditures across “negative” or “positive” efforts.

We will attempt to determine if there is a difference in average monthly expenditure across the 6 categories.

Environment Prep

if (!require('dplyr')) install.packages('dplyr')
if (!require('stringr')) install.packages('stringr')
if (!require('ggplot2')) install.packages('ggplot2')
if (!require('DT')) install.packages('DT')

Part 2 - Data

# load data
camp <- read.csv("FEC_campaign_summary.csv")
pac <- read.csv("FEC_independent-expenditure.csv")

Cases

In our raw data, there are 239 observations of the campaign data and 168703 of the PAC data.

To effectively conduct our analysis, we’ll only want the monthly observations for our two candidates so once the data is cleaned and merged into a long format dataset, I would expect the number of cases = n months * 6, where 6 represents ClintonCampaign, ProClintonPAC, AntiClintonPAC, TrumpCampaign, ProTrumpPAC, AntiTrumpPAC.

Data collection

The data is delivered to the FEC and redistributed to the public as part of disclosure law.

“The Federal Election Commission (FEC) is an independent regulatory agency established in 1975 to administer and enforce the Federal Election Campaign Act (FECA). That statute limits the sources and amounts of the contributions used to finance federal elections, requires public disclosure of campaign finance information and–in tandem with the Primary Matching Payment Act and the Presidential Election Campaign Fund Act–provides for the public funding of Presidential elections. For more information on the FEC’s role in regulating federal elections, see the brochure”The FEC and the Federal Campaign Finance Law." - From the FEC Site

Type of study

This is an observational data study.

Data Source

Data is source from is publicly available datasets from the Federal Election Commission. The data we will examine includes summary data on the 2016 US Presidential election from both the campaigns and independant groups or PACs.

Response

The response variable will be the expenditure by the month. This will be a numerical variable.

Explanatory

We have a few explanatory variables: candidate + campaign/PAC type (Categorical) and month (Categorical).

Part 3 - Exploratory data analysis

Campaign Data Prep

#subset to DT and HC
campsub <- filter(camp, CMTE_NM =="HILLARY FOR AMERICA" | CMTE_NM =="DONALD J. TRUMP FOR PRESIDENT, INC.") %>% 
    filter(str_detect(RPT_TP, "M")) %>% 
    arrange(CMTE_NM, RPT_TP) %>% 
    select(CMTE_NM, RPT_TP, TTL_DISB_SUM_PAGE_PER) %>% 
    droplevels()

#change vars
campsub$CMTE_NM <- as.character(campsub$CMTE_NM)
campsub$CMTE_NM[campsub$CMTE_NM == "HILLARY FOR AMERICA"] <- "HRCCamp"
campsub$CMTE_NM[campsub$CMTE_NM == "DONALD J. TRUMP FOR PRESIDENT, INC."] <- "DJTCamp"

#change to month
campsub$RPT_TP <- as.character(campsub$RPT_TP)
campsub$RPT_TP[campsub$RPT_TP == "M2"] <- "Feb"
campsub$RPT_TP[campsub$RPT_TP == "M3"] <- "Mar"
campsub$RPT_TP[campsub$RPT_TP == "M4"] <- "Apr"
campsub$RPT_TP[campsub$RPT_TP == "M5"] <- "May"
campsub$RPT_TP[campsub$RPT_TP == "M6"] <- "Jun"
campsub$RPT_TP[campsub$RPT_TP == "M7"] <- "Jul"
campsub$RPT_TP[campsub$RPT_TP == "M8"] <- "Aug"
campsub$RPT_TP[campsub$RPT_TP == "M9"] <- "Sep"
campsub$RPT_TP[campsub$RPT_TP == "M10"] <- "Oct"
campsub$RPT_TP <- factor(campsub$RPT_TP, level=
                             c("Feb", "Mar", "Apr", "May", "Jun",
                               "Jul", "Aug", "Sep", "Oct"))

campsub$direction[campsub$CMTE_NM =="DJTCamp"] <- "Trump"
campsub$direction[campsub$CMTE_NM =="HRCCamp"] <- "Clinton"

Exploring Campaign Data

ggplot(campsub, aes(x=RPT_TP, y=TTL_DISB_SUM_PAGE_PER,
                    group=CMTE_NM, colour=CMTE_NM)) + 
    geom_line() + ylab("Disbursements") + xlab("") +
    ggtitle("Total Disbursements by Campaigns")

PAC Data Prep

Initial clean, subset of PAC data

Below we will begin to clean, subset, and rename to find PACs associated with HRC and DJT, noting whether they support or oppose a candidate.

# ready numeric vector
pac$expenditure <- str_replace_all(as.character(pac$exp_amo), fixed("$"), "")
pac$expenditure <- as.numeric(str_replace_all(pac$expenditure, ',', ''))

# there are dupes if we leave in amn_ind (amended filing) vars, let's remove them.  while imprecise, just reviewing the original filing will give us what we need for this study.

pac <- pac[pac$amn_ind == "N",]

# droplevels
pac$can_nam <- as.character(pac$can_nam)
pac$can_nam <- tolower(pac$can_nam)
pac$spe_nam <- tolower(pac$spe_nam)

# subset only those that refer to HRC
pacsubHC <- pac[grepl("inton|hil", pac$can_nam), ]
pacsubHC <- pacsubHC[!grepl("kirk|james|george", pacsubHC$can_nam), ]
pacsubHC$org[pacsubHC$sup_opp == "Oppose"] <- "HRC_Oppose"
pacsubHC$org[pacsubHC$sup_opp == "Support"] <- "HRC_Support"

# what's missing?
rmhrc <- pacsubHC[!complete.cases(pacsubHC$org),]
pacsubHC <- pacsubHC[complete.cases(pacsubHC$org),]
rmhrc$org[rmhrc$can_par_aff == "Dem"] <- "HRC_Support"
rmhrc <- rmhrc[complete.cases(rmhrc$org),]
pacsubHC <- rbind(pacsubHC, rmhrc)

# subset only those that refer to DJT
pacsubDT <- pac[grepl("trump", pac$can_nam),]
pacsubDT$org[pacsubDT$sup_opp == "Oppose"] <- "DJT_Oppose"
pacsubDT$org[pacsubDT$sup_opp == "Support"] <- "DJT_Support"

# what's missing?
rmdjt <- pacsubDT[!complete.cases(pacsubDT$org),]
pacsubDT <- pacsubDT[complete.cases(pacsubDT$org),]
rmdjt$org[rmdjt$can_par_aff == "Rep"] <- "DJT_Support"
rmdjt <- rmdjt[complete.cases(rmdjt$org),]
pacsubDT <- rbind(pacsubDT, rmdjt)

# combine, subset
pacsub <- rbind(pacsubHC, pacsubDT)
pacsub <- select(pacsub, org, spe_nam, rec_dat, expenditure)

# clean type
pacsub$org <- as.factor(pacsub$org)
pacsub$spe_nam <- as.character(pacsub$spe_nam)
pacsub$rec_dat <- as.Date(pacsub$rec_dat, "%m/%d/%Y")

# remove obs < feb 2016 > nov (same as campaign data summary)
pacsub <- pacsub[pacsub$rec_dat >= "2016-02-01" & pacsub$rec_dat < "2016-11-01",]

#clean house 
rm(pacsubHC, pacsubDT, rmdjt, rmhrc)

Exploring PAC Data

length(unique(pacsub$spe_nam))

## [1] 247

While we see there about 247 unique Super PACs associated with HRC and DJT, apparently many of them use multiple submitting parties (who can’t spell) to send reports to the FEC. There are a number of duplicate records in here, but a quick scan suggests maybe 20-40 are dupes. Cleaning this field is beyond the scope of this project.

Prep PAC Month

pacsub$month[pacsub$rec_dat >= "2016/02/01" & pacsub$rec_dat < "2016/03/01"] <- "Feb"
pacsub$month[pacsub$rec_dat >= "2016/03/01" & pacsub$rec_dat < "2016/04/01"] <- "Mar"
pacsub$month[pacsub$rec_dat >= "2016/04/01" & pacsub$rec_dat < "2016/05/01"] <- "Apr"
pacsub$month[pacsub$rec_dat >= "2016/05/01" & pacsub$rec_dat < "2016/06/01"] <- "May"
pacsub$month[pacsub$rec_dat >= "2016/06/01" & pacsub$rec_dat < "2016/07/01"] <- "Jun"
pacsub$month[pacsub$rec_dat >= "2016/07/01" & pacsub$rec_dat < "2016/08/01"] <- "Jul"
pacsub$month[pacsub$rec_dat >= "2016/08/01" & pacsub$rec_dat < "2016/09/01"] <- "Aug"
pacsub$month[pacsub$rec_dat >= "2016/09/01" & pacsub$rec_dat < "2016/10/01"] <- "Sep"
pacsub$month[pacsub$rec_dat >= "2016/10/01" & pacsub$rec_dat < "2016/11/01"] <- "Oct"

Prep PAC Totals

HRCOp <- pacsub[pacsub$org=="HRC_Oppose", ]
DJTOp <- pacsub[pacsub$org=="DJT_Oppose", ]
HRCSu <- pacsub[pacsub$org=="HRC_Support", ]
DJTSu <- pacsub[pacsub$org=="DJT_Support", ]

monthsum <- function(df){
    df %>% 
        group_by(month) %>% 
        summarize(sum(expenditure))
}

HOp <- monthsum(HRCOp)
DOp <- monthsum(DJTOp)
HSu <- monthsum(HRCSu)
DSu <- monthsum(DJTSu)

HOp$org <- "HRC_Oppose"
DOp$org <- "DJT_Oppose"
HSu$org <- "HRC_Support"
DSu$org <- "DJT_Support"

pacsummary <- rbind(HOp, DOp, HSu, DSu)
pacsummary$month <- factor(pacsummary$month, level=
                             c("Feb", "Mar", "Apr", "May", "Jun",
                               "Jul", "Aug", "Sep", "Oct"))
names(pacsummary) <- c("RPT_TP", "TTL_DISB_SUM_PAGE_PER", "CMTE_NM")
pacsummary <- pacsummary[,c(3,1,2)]
# create summary var for direction of effort
pacsummary$direction[pacsummary$CMTE_NM =="DJT_Support" | pacsummary$CMTE_NM =="HRC_Oppose"] <- "Trump"
pacsummary$direction[pacsummary$CMTE_NM =="DJT_Oppose" | pacsummary$CMTE_NM =="HRC_Support"] <- "Clinton"

rm(HOp, DOp, HSu, DSu, HRCOp, DJTOp, HRCSu)

Exploring PAC Summary Data

ggplot(pacsummary, aes(x=RPT_TP, y=TTL_DISB_SUM_PAGE_PER,
                    group=CMTE_NM, colour=CMTE_NM)) + 
    geom_line() + ylab("Disbursements") + xlab("") +
    ggtitle("Total Disbursements by PACS")

# why are June expenditures so high for Trump supporting PACs?
review <- DJTSu[DJTSu$month=="Jun",]
review <- review[order(-review$expenditure),]
datatable(review, options=list(pageLength=5))

# what was this $50mil expenditure from Jun 14?
whatonearth <- pac[pac$expenditure==50000000,]
whatonearth <- select(whatonearth, spe_nam, pur, pay, exp_amo)
knitr::kable(whatonearth)

	spe_nam	pur	pay	exp_amo
15924	get our jobs back, inc	Digital Media Marketing, Revenue Sharing	Statware, Inc.	$50,000,000.00

rm(DJTSu, review, whatonearth)

Ok, this is quite something. This is a very very strange outlier. More information on this particular observation can be found here. We should leave this in, but note it is an outlier likely to skew statistical analysis.

Combine all

final <- rbind(pacsummary, campsub)
names(final) <- c("category", "month", "expenditure", "direction")
final$category <- as.factor(final$category)
final$direction <- as.factor(final$direction)

ggplot(final, aes(x=month, y=expenditure,
                    group=category, colour=category)) + 
    geom_line() + ylab("Disbursements") + xlab("") +
    ggtitle("Total Disbursements by Campaigns and PACs")

ggplot(final, aes(x=direction, y=expenditure, fill=category)) +
    geom_bar(position="dodge", stat="identity") + 
    ylab("Total Disbursements") + xlab("") + 
    ggtitle("Total Disbursements by Org Type")

Part 4 - Inference

Hypothesis

$H_0: \mu_{TCamp} = \mu_{HCamp} = \mu_{TPACOp} = \mu_{HPACOp} = \mu_{TPACSu} = \mu_{HPACSu}$. The avg monthly expenditures does not vary across all types.

$H_A:$ The avg monthly expenditures across some or all groups does vary.

Check Conditions

ggplot(final, aes(factor(category), expenditure)) + geom_boxplot(aes(fill=category)) + xlab("") +  theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank())

finalsum <- group_by(final, category) 
finalsum2 <- summarise(finalsum, mean = round(mean(expenditure),0), stdev = round(sd(expenditure), 0))
knitr::kable(finalsum2)

category	mean	stdev
DJT_Oppose	24138816	23916610
DJT_Support	9915562	17023951
DJTCamp	19692503	20249352
HRC_Oppose	7111931	10329706
HRC_Support	5238978	6447687
HRCCamp	35879179	20373035

We believe the data from each category is independant (it is unlikely that a similar expenditure shows up in each category). We also estimate that while the variability and distributions are not ideal, they should also be adequate for conducting a simple ANOVA test.

ANOVA

anova <- aov(final$expenditure ~ final$category)
summary(anova)

##                Df    Sum Sq   Mean Sq F value  Pr(>F)   
## final$category  5 6.308e+15 1.262e+15   4.125 0.00339 **
## Residuals      48 1.468e+16 3.059e+14                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

tukey <- TukeyHSD(anova, ordered=TRUE)
tukey

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
##     factor levels have been ordered
## 
## Fit: aov(formula = final$expenditure ~ final$category)
## 
## $`final$category`
##                             diff       lwr      upr     p adj
## HRC_Oppose-HRC_Support   1872953 -22595545 26341451 0.9999114
## DJT_Support-HRC_Support  4676583 -19791914 29145081 0.9926928
## DJTCamp-HRC_Support     14453525 -10014973 38922023 0.5047741
## DJT_Oppose-HRC_Support  18899837  -5568661 43368335 0.2171664
## HRCCamp-HRC_Support     30640200   6171702 55108698 0.0066044
## DJT_Support-HRC_Oppose   2803630 -21664867 27272128 0.9993597
## DJTCamp-HRC_Oppose      12580572 -11887926 37049070 0.6496221
## DJT_Oppose-HRC_Oppose   17026884  -7441614 41495382 0.3224373
## HRCCamp-HRC_Oppose      28767247   4298749 53235745 0.0126490
## DJTCamp-DJT_Support      9776941 -14691556 34245439 0.8413586
## DJT_Oppose-DJT_Support  14223254 -10245244 38691751 0.5224657
## HRCCamp-DJT_Support     25963617   1495119 50432114 0.0316406
## DJT_Oppose-DJTCamp       4446312 -20022185 28914810 0.9942179
## HRCCamp-DJTCamp         16186675  -8281822 40655173 0.3779452
## HRCCamp-DJT_Oppose      11740363 -12728135 36208861 0.7124701

We can see from our test that the p value is extremely low = 0.0033878. This causes us to reject $H_0$ and conclude that there are some differences in monthly expenditures across categories.

To help determine where the biggest differences in mean might be occurring, we conducted a Tukey range test. The lowest ‘p adj’ values indicate the biggest difference in mean and often refer to the Clinton Campaign (HRCCamp) and any of the PAC groups.

Part 5 - Conclusion

Our study of the 2016 presidential election campaign revealed a bit about the nature of expenditures related to the two major party candidates. We looked at subgroups representing the campaigns and PACs that spent to either support or oppose a candidate.

The majority of our lessons learned about the nuances of the data came from our exploratory data preparation process. Finding outliers like the $50mil donation caused us to find additional duplicates (amended records), learn more about the many things that could be considered a campaign expenditure, and illuminate the inconsistent nature of the self-reported data.

We set out to show expenditures across PACs and campaigns and determine whether there are differences in the average monthly expenditures for each group type we reviewed. Using ANOVA and a Tukey test were able to determine which means were most alike and which were most different.

A few other approaches that could be used to analyze the same problem include:

Conducting a similar analysis after using a normalization technique to flatten the values so that we’re looking at a ratio each month instead of summary dollar value.
Pulling full campaign expenditure data instead of summary data, and conducting a similar ANOVA test to determine if the mean of expenditures overall were similar across group types.