Data Preparation

## 
## Welcome to CUNY DATA606 Statistics and Probability for Data Analytics 
## This package is designed to support this course. The text book used 
## is OpenIntro Statistics, 3rd Edition. You can read this by typing 
## vignette('os3') or visit www.OpenIntro.org. 
##  
## The getLabs() function will return a list of the labs available. 
##  
## The demo(package='DATA606') will list the demos that are available.

Introduction

In this project we are going to use the Federal Election Commission data set to research the campaign contributions and expenditures for election 2016 vs 2012.

We are going to analyze the data of 4 major party candidates, 2 Democrats and 2 Republicans. Would like to see how the Democrats and Republicans have spent in both the election cycle.

Further would like to do some analysis on how the Political Action Committes have spend their amounts in 2012 and 2016. Would like to investigate how the PACs have sepnd for/against a particular candidates.

Would like to present a heat map with the statewise contributions or expenditure and represent in plot_ly or choropleth.

Explanatory Data Analysis.

Data Loading and Data Munging.

Reading the Data from FEC Website. We have downloaded the contribution files of Hillary Clinton and Donald J Trump. Owing to pretty huge data for 2012, we are skipping the data for 2012 and going to compare the contributions for 2016 between two major party candidates.

First we are loading the data HRC data contributions and we are cleaning the data by removing unneccessary data attributes.

# The HRC DataSet to data.frame

hrccontributions=fread('HRC_Cont.csv',stringsAsFactors = TRUE)
## Warning in fread("HRC_Cont.csv", stringsAsFactors = TRUE):
## Starting data input on line 2 and discarding line 1 because
## it has too few or too many items to be column names or data:
## cmte_id,cand_id,cand_nm,contbr_nm,contbr_city,contbr_st,contbr_zip,contbr_employer,contbr_occupation,contb_receipt_amt,contb_receipt_dt,receipt_desc,memo_cd,memo_text,form_tp,file_num,tran_id,election_tp
## 
Read 2.3% of 3506081 rows
Read 14.0% of 3506081 rows
Read 23.1% of 3506081 rows
Read 31.7% of 3506081 rows
Read 41.9% of 3506081 rows
Read 53.9% of 3506081 rows
Read 67.9% of 3506081 rows
Read 85.6% of 3506081 rows
Read 86.1% of 3506081 rows
Read 3506081 rows and 19 (of 19) columns from 0.621 GB file in 00:00:14
hrccontributions=data.frame(hrccontributions)


hrccontributions = subset(hrccontributions, hrccontributions$V10 > 0)
hrccontributions = subset(hrccontributions, hrccontributions$V4 != "HILLARY VICTORY FUND - UNITEMIZED")

hrccontributions=subset(hrccontributions,select = c(V3:V13))

Now having loaded the HRC data set the dimensions are 3471105, 11

Next we will be loading the data set for DJT and cleaning the data by removing unneccessary values.

djtcontributions=fread('DJT_Cont.csv',stringsAsFactors = TRUE )
## Warning in fread("DJT_Cont.csv", stringsAsFactors = TRUE):
## Starting data input on line 2 and discarding line 1 because
## it has too few or too many items to be column names or data:
## cmte_id,cand_id,cand_nm,contbr_nm,contbr_city,contbr_st,contbr_zip,contbr_employer,contbr_occupation,contb_receipt_amt,contb_receipt_dt,receipt_desc,memo_cd,memo_text,form_tp,file_num,tran_id,election_tp
djtcontributions=data.frame(djtcontributions)

djtcontributions = subset(djtcontributions, djtcontributions$V10 > 0)
djtcontributions=subset(djtcontributions,select = c(V3:V13))
hrccontributions$V11=str_replace_all(hrccontributions$V11, "[:digit:]","")
hrccontributions$V11=str_replace_all(hrccontributions$V11, "-","")

djtcontributions$V11=str_replace_all(djtcontributions$V11, "[:digit:]","")
djtcontributions$V11=str_replace_all(djtcontributions$V11, "-","")

Loaded the DJT data set the dimensions are 762871, 11

Now for the two data set we will do some summary statistics using summary. The only variable of interest in this data set is that of contributions.

# Will do a box plot to see how the contributions are. We could see a outlier in DJT contributions, which is way out of reach.

boxplot(hrccontributions$V10 , djtcontributions$V10)

We will see who are the top contributors for both the campaigns.

#HRC Campaign.
top_fifty = hrccontributions %>%
  filter(rank(dplyr::desc(hrccontributions$V10))<=100)


# Top Fifty Individual Contributor's to HRC Campaign.
knitr::kable(head(plyr::arrange(top_fifty,desc(top_fifty$V10)), n = 20))   
V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13
Clinton, Hillary Rodham GOCKE, THOMAS JUPITER FL 334691584 SELF-EMPLOYED PHYSICIAN 20000 JUN
Clinton, Hillary Rodham YOUNG, SAMUEL J. TEHACHAPI CA 935618652 SELF-EMPLOYED REAL ESTATE INVESTOR 10000 JUN X
Clinton, Hillary Rodham YOUNG, SAMUEL J. TEHACHAPI CA 935618652 INFORMATION REQUESTED INFORMATION REQUESTED 7300 JUN
Clinton, Hillary Rodham HILLARY ACTION FUND - UNITEMIZED NEW YORK NY 101855256 5620 SEP X
Clinton, Hillary Rodham PROPPER, GREG LOS ANGELES CA 900691429 PROPPER DALEY PHILANTHROPIC CONSULTING 5400 JUN
Clinton, Hillary Rodham PICKER, MICHAEL SACRAMENTO CA 958413111 STATE OF CALIFORNIA COMMISSIONER 5400 NOV
Clinton, Hillary Rodham AUSTIN, ALAN ATHERTON CA 940275458 N/A RETIRED 5400 AUG
Clinton, Hillary Rodham FOX, ALAN STUDIO CITY CA 916042407 ACF PROPERTY MANAGEMENT PRESIDENT 5400 MAR
Clinton, Hillary Rodham BEAUBIEN, JAMES SANTA MONICA CA 904023024 LATHAM & WATKINS LLP ATTORNEY 5400 APR
Clinton, Hillary Rodham CARROLL, DANIEL ASHTON SAN FRANCISCO CA 941151125 TPG CAPITAL INVESTMENT MANAGER 5400 MAR
Clinton, Hillary Rodham HOLZMAN, WINNIE BURBANK CA 915054005 SELF-EMPLOYED WRITER 5400 JUN
Clinton, Hillary Rodham ITKIN, MARK LOS ANGELES CA 900691128 WILLIAM MORRIS ENDEAVOR TALENT AGENT 5400 APR
Clinton, Hillary Rodham FENG, JIONG CLAREMONT CA 917116500 W. CALIFORNIA ART ACADEMY TEACHER 5400 MAY
Clinton, Hillary Rodham WALLACE, RICHARD SHERMAN OAKS CA 914032915 WARNER BROS TV WARNER BROTHERS TV 5400 APR
Clinton, Hillary Rodham HANNA, MONA Z. VILLA PARK CA 928615322 MICHELMAN & ROBINSON, LLP ATTORNEY 5400 JUN
Clinton, Hillary Rodham BARSZCZ, MICHAEL WINTER PARK FL 327893347 SELF-EMPLOYED ATTORNEY 5400 MAY
Clinton, Hillary Rodham GHAZVINI, MEHRAN TALLAHASSEE FL 323081547 PREMIER HEALTH CLINIC & REHAB OF TALLA DOCTOR OF CHIRPORACTIC 5400 MAY
Clinton, Hillary Rodham FINK KOHL, BINA WESTON FL 333262726 SELF-EMPLOYED PUBLIC RELATIONS 5400 MAY
Clinton, Hillary Rodham SCHRAGIE, GOLDBLATT MIAMI BEACH FL 331403429 PROMED MANAGEMENT, INC. PRESIDENT 5400 APR X
Clinton, Hillary Rodham TOLL, ROBERT MIAMI BEACH FL 331404226 TOLL BROTHERS, INC EXECUTIVE CHAIRMAN 5400 JUN X
# DJT Campaign.
top_fifty_djt = djtcontributions %>%
         filter(rank(desc(djtcontributions$V10))<=100) 

    
# Top Fifty Individual Contributor's to DJT Campaign.
knitr::kable(head(plyr::arrange(top_fifty_djt,desc(top_fifty_djt$V10)), n = 20))   
V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13
Trump, Donald J. BOCH, ERNIE NORWOOD MA 02062 BOCH AUTOMOTIVE GROUP EXECUTIVE 86936.80 AUG
Trump, Donald J. FERRERO, LOUIS P MR. CANTON GA 30115 INFORMATION REQUESTED INFORMATION REQUESTED 12500.00 JUL
Trump, Donald J. CONSERVATIVE ACTION FUND ALEXANDRIA VA 22314 10030.24 DEC X
Trump, Donald J. COBB, ROBERT BIRMINGHAM AL 35209 COBB THEATERS OWNER 10000.00 NOV
Trump, Donald J. DOBSKI, ROBERT BLOOMINGTON IL 61704 INFORMATION REQUESTED INFORMATION REQUESTED 10000.00 NOV
Trump, Donald J. GORMAN, L.D. MR. HAZARD KY 41702 INFORMATION REQUESTED INFORMATION REQUESTED 10000.00 JUL
Trump, Donald J. ROVT, ALEXANDER MR. BROOKLYN NY 11234 INFORMATION REQUESTED INFORMATION REQUESTED 10000.00 NOV
Trump, Donald J. GIGANTE, PETER BELLINGHAM WA 98225 SELF-EMPLOYED INTERNATIONAL TRADE 10000.00 AUG
Trump, Donald J. TANZER, LEONARD J MR. SCARSDALE NY 10583 PATIENT CARE ASSOCIATES PRESIDENT 7300.00 NOV
Trump, Donald J. NORTHCUTT, JOHN D MR. III FAIRHOPE AL 36532 INFORMATION REQUESTED INFORMATION REQUESTED 5400.00 JUL
Trump, Donald J. HARBERT, NORMAN C MR. SOTTSDALE AZ 85262 RETIRED RETIRED 5400.00 SEP
Trump, Donald J. DAY, TIMOTHY T MR. PHOENIX AZ 85018 INFORMATION REQUESTED INFORMATION REQUESTED 5400.00 JUL
Trump, Donald J. STERN, MARC MR. MALIBU CA 90265 THE TCW GROUP, INC CHAIRMAN 5400.00 JUL
Trump, Donald J. LITTLEFAIR, ANDREW NEW PORT BEACH CA 92663 CLEAN ENERGY FUELD PRESIDENT 5400.00 OCT
Trump, Donald J. SCHUMANN, ROBERT MANHATTAN BEACH CA 90266 REAL ESTATE WEST INC. BROKER 5400.00 NOV
Trump, Donald J. LISENKO, ROSE H MS. SAN DIEGO CA 92107 INFORMATION REQUESTED INFORMATION REQUESTED 5400.00 AUG
Trump, Donald J. NEE, ALICE MS. NEWPORT COAST CA 92657 INFORMATION REQUESTED INFORMATION REQUESTED 5400.00 AUG
Trump, Donald J. MIZEL, CAROL MS. DENVER CO 80237 HOMEMAKER HOMEMAKER 5400.00 SEP
Trump, Donald J. MIZEL, LARRY A MR. DENVER CO 80237 M.D.C HOLDINGS INVESTMENT 5400.00 SEP
Trump, Donald J. TRAVIS, CHARLOTTE MS. BRIGHTON CO 80601 INFORMATION REQUESTED INFORMATION REQUESTED 5400.00 AUG

Group By Profession.

We will look at who were the top contributor for both the campaigns based on the profession.

top_contributor_prof <- hrccontributions %>%  
          group_by(V9)  %>%
            summarise(V10=n()) 

top_contributor_djt <- djtcontributions %>%  
          group_by(V9)  %>%
            summarise(V10=n()) 

knitr::kable(head(plyr::arrange(top_contributor_prof,desc(top_contributor_prof$V10)), n = 10))  
V9 V10
RETIRED 862272
ATTORNEY 145237
INFORMATION REQUESTED 80238
TEACHER 79455
PROFESSOR 73324
PHYSICIAN 67938
HOMEMAKER 60355
NOT EMPLOYED 56072
CONSULTANT 54765
LAWYER 43813
knitr::kable(head(plyr::arrange(top_contributor_djt,desc(top_contributor_djt$V10)), n = 10))  
V9 V10
RETIRED 292655
INFORMATION REQUESTED 153128
SELF-EMPLOYED 11118
SALES 10047
PHYSICIAN 8837
HOMEMAKER 8188
BUSINESS OWNER 7438
ENGINEER 6915
ATTORNEY 6071
OWNER 5401

Group By State

We will look at who were the top contributor for both the campaigns based on the state were in they reside.

top_state_hrc <- hrccontributions %>%  
          group_by(V6)  %>%
            summarise(V10=n()) 

top_state_djt <- djtcontributions %>%  
          group_by(V6)  %>%
            summarise(V10=n()) 

knitr::kable(head(plyr::arrange(top_state_hrc,desc(top_state_hrc$V10)), n = 10))  
V6 V10
CA 681518
NY 394653
TX 201805
FL 182113
MA 152570
WA 125099
IL 121487
PA 118255
NJ 113906
VA 109987
knitr::kable(head(plyr::arrange(top_state_djt,desc(top_state_djt$V10)), n = 10))  
V6 V10
CA 83741
FL 76961
TX 76143
NY 35762
PA 30046
GA 29055
OH 26281
VA 23630
IL 22748
AZ 21429

We are going to plot the states where these top contributors are residing by using a Plotly Map.

top_state_hrc$hover = with(top_state_hrc, paste(top_state_hrc$V6, '<br>', "Total contributions", top_state_hrc$V10))

l=list(color = toRGB("white"), width = 2)
g=list(
  scope = 'usa',
  projection = list(type = 'albers usa'),
  showlakes = TRUE,
  lakecolor = toRGB('white')
)

p=plot_geo(top_state_hrc, locationmode = 'USA-states') %>%
  add_trace(
    z = ~V10, text = ~hover, locations = ~V6,
    color = ~V10, colors = 'Blues'
  ) %>%
  colorbar(title = "Contributions") %>%
  layout(
    title = 'Total HRC Contributions by State<br>(Hover for breakdown)',
    geo = g
  )

top_state_djt$hover=with(top_state_djt, paste(top_state_djt$V6, '<br>', "Total contributions", top_state_djt$V10))

p_dgt=plot_geo(top_state_djt, locationmode = 'USA-states') %>%
  add_trace(
    z = ~V10, text = ~hover, locations = ~V6,
    color = ~V10, colors = 'Reds'
  ) %>%
  colorbar(title = "Contributions") %>%
  layout(
    title = 'Total DJT Contributions by State<br>(Hover for breakdown)',
    geo = g
  )

#Sys.setenv("plotly_username"="data607g3")
#Sys.setenv("plotly_api_key"="dDZhymyplVxhzUuXv2MX")
#chart_link = api_create(p, filename="HRCStateOne",id="HRC")
#chart_link_djt = api_create(p_dgt,filename = "DJTStateOne")
#chart_link_djt
p
p_dgt

Inference for Contributions.

\(H_0\) Null Hypothesis : The average Contributions during last four months does not vary across the campaigns.

\(H_A\) Alternative Hypothesis : The average Contributions during last four months varies across the campaigns.

For this first we are cleaning the data and picking data only those we are interested in.

So we are picking contributions only for the month of JUN, JUL, AUG, SET and OCT for both the campaigns.

hrcfinal = data.frame(hrccontributions)
hrcfinal=subset(hrcfinal,select = c(V10,V11))
hrclast=subset(hrcfinal, hrcfinal$V11 == "JUN" | hrcfinal$V11 == "JUL" |  hrcfinal$V11 == "AUG" |hrcfinal$V11 == "SEP" |hrcfinal$V11 == "OCT" )

djtfinal = data.frame(djtcontributions)
djtfinal=subset(djtfinal,select = c(V10,V11))
djtlast=subset(djtfinal, djtfinal$V11 == "JUN" | djtfinal$V11 == "JUL" |  djtfinal$V11 == "AUG" |djtfinal$V11 == "SEP" |djtfinal$V11 == "OCT" )

Box Plot of last four months contributions.

bwplot(hrclast$V10 ~ hrclast$V11 ,main="Contributions across final four Months",xlab="Month",ylab="Contributions")

bwplot(djtlast$V10 ~ djtlast$V11 ,main="Contributions across final four Months",xlab="Month",ylab="Contributions")

Since the box plot is not able to give a clear picture, we are going to compare the means across four months.

hrc_summary_four = aggregate(hrclast$V10 ~ hrclast$V11, data = hrclast, function(x) c(Total = sum(x), Mean = mean(x), SD = sd(x) )) 
hrc_summary_four=cbind(hrc_summary_four[-ncol(hrc_summary_four)], hrc_summary_four[[ncol(hrc_summary_four)]])
knitr::kable(hrc_summary_four)
hrclast$V11 Total Mean SD
AUG 52298582 134.17462 408.1672
JUL 45677073 119.42655 366.8881
JUN 41466446 170.44040 509.5670
OCT 72668987 91.65413 269.8099
SEP 71749414 125.26522 364.4244
djt_summary_four <- aggregate(djtlast$V10 ~ djtlast$V11, data = djtlast, function(x) c(Total = sum(x), Mean = mean(x), SD = sd(x) )) 
djt_summary_four<- cbind(djt_summary_four[-ncol(djt_summary_four)], djt_summary_four[[ncol(djt_summary_four)]])
knitr::kable(djt_summary_four)
djtlast$V11 Total Mean SD
AUG 24049608 128.28373 372.4752
JUL 25068072 95.85821 248.8110
JUN 16474648 145.63096 367.7656
OCT 25179304 340.62451 399.2310
SEP 25577068 377.37647 517.3588

From the above analysis it is pretty clear that average monthly contributions across the last four between two candidates are not same. As a result we are going to reject the Null Hypothesis and we are going to accept the Alternative Hypothesis.

Plots:

These are plots of Contributions across all the months.

ggplot(hrcfinal, aes(x=hrcfinal$V11, y=hrcfinal$V10 ,group=hrcfinal$V11, colour=hrcfinal$V10)) + 
    geom_point() + ylab("Total Contributions") + 
    xlab("Month")

ggplot(djtfinal, aes(x=djtfinal$V11, y=djtfinal$V10 ,group=djtfinal$V11, colour=djtfinal$V10)) + 
    geom_point() + ylab("Total Contributions") + 
    xlab("Month")

Super PACs Expenditure:

Having worked on the campaign contributions from individuals of both the campaigns, now we are going to work on expenditures associated with PAC’s PACs spend a huge amount of money for and against the candidates. In this experiment, we are going to see how PACs have done their monthly expenditure heading up to the election.

pacexpenditure=fread('FEC_independent-expenditure.csv')
## Warning in fread("FEC_independent-expenditure.csv"): Bumped column 7 to
## type character on data row 12046, field contains 'AG'. Coercing previously
## read values in this column from logical, integer or numeric back to
## character which may not be lossless; e.g., if '00' and '000' occurred
## before they will now be just '0', and there may be inconsistencies with
## treatment of ',,' and ',NA,' too (if they occurred in this column before
## the bump). If this matters please rerun and set 'colClasses' to 'character'
## for this column. Please note that column type detection uses a sample of
## 1,000 rows (100 rows at 10 points) so hopefully this message should be very
## rare. If reporting to datatable-help, please rerun and include the output
## from verbose=TRUE.
pacexpenditure=data.frame(pacexpenditure)


# ready numeric vector
pacexpenditure$expenditure=str_replace_all(as.character(pacexpenditure$exp_amo), fixed("$"), "")
pacexpenditure$expenditure = as.numeric(str_replace_all(pacexpenditure$expenditure, ',', ''))
pacexpenditure=pacexpenditure[pacexpenditure$amn_ind == "N",]
pacexpenditure$can_nam <- as.character(pacexpenditure$can_nam)
pacexpenditure$can_nam <- tolower(pacexpenditure$can_nam)
pacexpenditure$spe_nam <- tolower(pacexpenditure$spe_nam)

# HRC PAC
hrcpac=pacexpenditure[grepl("inton|hil", pacexpenditure$can_nam), ]
hrcpac=hrcpac[!grepl("kirk|james|george", hrcpac$can_nam), ]
hrcpac$org = "HRC"

# DJT PAC
djtpac=pacexpenditure[grepl("trump", pacexpenditure$can_nam),]
djtpac$org = "DJT"

# combine, subset
pactot=rbind(hrcpac, djtpac)
pactot=select(pactot,org, spe_nam, rec_dat, expenditure)

# clean type
pactot$org <- as.factor(pactot$org)
pactot$spe_nam <- as.character(pactot$spe_nam)
pactot$rec_dat <- as.Date(pactot$rec_dat, "%m/%d/%Y")
dateval=as.month(pactot$rec_dat)
pactot$rec_dat=dateval$month

Now having the two PAC Contributions sorted out. Next we will try to Summarize the monthly expenditures across both the PAC and then we can try to compare the average monthly expenditure across both the PAC’s

hrcmonthlyexp = pactot %>%
                filter(pactot$org == "HRC") %>%
                group_by(rec_dat) %>% 
                summarize(sum(expenditure))
hrcmonthlyexp$org="HRC"

djtpacmonthlyepx = pactot %>%
                filter(pactot$org == "DJT") %>%
                group_by(rec_dat) %>% 
                summarize(sum(expenditure))
djtpacmonthlyepx$org = "DJT"

totpacmnthexp = rbind(hrcmonthlyexp, djtpacmonthlyepx)

Plot of monthly expenditure.

ggplot(totpacmnthexp, aes(x=rec_dat, y=`sum(expenditure)`,
                    group=org, colour=org)) + 
    geom_line() + ylab("Expenditure") + xlab("Month") +
    ggtitle("PAC Expenditure across both the campaigns") 

Inference

\(H_O\): Null Hypothesis : Average monthly expenditure across both the PAC’s does not vary

\(H_A\): Alternative Hypothesis : Average monthly expenditure across both the PAC’s varies

ggplot(totpacmnthexp, aes(factor(org), `sum(expenditure)`)) + geom_boxplot(aes(fill=org)) + xlab("") +  theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank())

pacmonexpfinal = aggregate(totpacmnthexp$`sum(expenditure)` ~ totpacmnthexp$org, data = totpacmnthexp, function(x) c(Total = sum(x), Mean = mean(x), SD = sd(x) )) 
pacmonexpfinal=cbind(pacmonexpfinal[-ncol(pacmonexpfinal)], pacmonexpfinal[[ncol(pacmonexpfinal)]])
knitr::kable(pacmonexpfinal)
totpacmnthexp$org Total Mean SD
DJT 349184499 29098708 30902356
HRC 160661431 13388453 17769403

From the box plots and our calculations above it is pretty clear that both the averages across both the PACs are not same, so we can reject Null Hypothesis and Accept the Alternative Hypothesis.

Conclusion.

For this project as per my initial proposal we wanted to compare the contributions and expenditure of presidential elections between 2012 and 2016. Since the volume of data involved in both the campaigns was so huge to even load in for our project, so we decided to go ahead with only the analysis of 2016.

As a first experiment, we read the individual contributions across both the major party candidates in the presidential election. We did some initial analysis of data like top contributors, top contributors across the states and we even plot the campaign contributors in the map of USA to see where the maximum contributions are from.

We also did a hypothesis testing by comparing the average contributions of the last five months heading up to election, to see whether the average contributions across both the candidates are same. From our testing we rejected out Null Hypothesis and determined that average contributions between two candidates in the last four months were different.

Once having finished working on contributions, then we loaded the expenditure of PACs. We wanted to see how the PACs associated to HRC and DJT have spent the money during this election. We did some munging of data for both the candidates associated PAC’s.

For PAC expenditure we did a hypothesis testing to see whether the monthly expenditures are same across both the candidates. Finaly the means were way off limit so we concluded by rejecting null hypothesis and accepting the alternative hypothesis.

Overall this was a great experience in working on this project and DATA606 class as a whole. For a new comer as me in Statistics was a great learning experience.

Reference : Research analysis inspired from Kaggle 2016 Election Analysis