DATA 606: Final Project

Data Preparation

## 
## Welcome to CUNY DATA606 Statistics and Probability for Data Analytics 
## This package is designed to support this course. The text book used 
## is OpenIntro Statistics, 3rd Edition. You can read this by typing 
## vignette('os3') or visit www.OpenIntro.org. 
##  
## The getLabs() function will return a list of the labs available. 
##  
## The demo(package='DATA606') will list the demos that are available.

Introduction

In this project we are going to use the Federal Election Commission data set to research the campaign contributions and expenditures for election 2016 vs 2012.

We are going to analyze the data of 4 major party candidates, 2 Democrats and 2 Republicans. Would like to see how the Democrats and Republicans have spent in both the election cycle.

Further would like to do some analysis on how the Political Action Committes have spend their amounts in 2012 and 2016. Would like to investigate how the PACs have sepnd for/against a particular candidates.

Would like to present a heat map with the statewise contributions or expenditure and represent in plot_ly or choropleth.

Explanatory Data Analysis.

Data Loading and Data Munging.

Reading the Data from FEC Website. We have downloaded the contribution files of Hillary Clinton and Donald J Trump. Owing to pretty huge data for 2012, we are skipping the data for 2012 and going to compare the contributions for 2016 between two major party candidates.

First we are loading the data HRC data contributions and we are cleaning the data by removing unneccessary data attributes.

# The HRC DataSet to data.frame

hrccontributions=fread('HRC_Cont.csv',stringsAsFactors = TRUE)

## Warning in fread("HRC_Cont.csv", stringsAsFactors = TRUE):
## Starting data input on line 2 and discarding line 1 because
## it has too few or too many items to be column names or data:
## cmte_id,cand_id,cand_nm,contbr_nm,contbr_city,contbr_st,contbr_zip,contbr_employer,contbr_occupation,contb_receipt_amt,contb_receipt_dt,receipt_desc,memo_cd,memo_text,form_tp,file_num,tran_id,election_tp

## 
Read 2.3% of 3506081 rows
Read 14.0% of 3506081 rows
Read 23.1% of 3506081 rows
Read 31.7% of 3506081 rows
Read 41.9% of 3506081 rows
Read 53.9% of 3506081 rows
Read 67.9% of 3506081 rows
Read 85.6% of 3506081 rows
Read 86.1% of 3506081 rows
Read 3506081 rows and 19 (of 19) columns from 0.621 GB file in 00:00:14

hrccontributions=data.frame(hrccontributions)


hrccontributions = subset(hrccontributions, hrccontributions$V10 > 0)
hrccontributions = subset(hrccontributions, hrccontributions$V4 != "HILLARY VICTORY FUND - UNITEMIZED")

hrccontributions=subset(hrccontributions,select = c(V3:V13))

Now having loaded the HRC data set the dimensions are 3471105, 11

Next we will be loading the data set for DJT and cleaning the data by removing unneccessary values.

djtcontributions=fread('DJT_Cont.csv',stringsAsFactors = TRUE )

## Warning in fread("DJT_Cont.csv", stringsAsFactors = TRUE):
## Starting data input on line 2 and discarding line 1 because
## it has too few or too many items to be column names or data:
## cmte_id,cand_id,cand_nm,contbr_nm,contbr_city,contbr_st,contbr_zip,contbr_employer,contbr_occupation,contb_receipt_amt,contb_receipt_dt,receipt_desc,memo_cd,memo_text,form_tp,file_num,tran_id,election_tp

djtcontributions=data.frame(djtcontributions)

djtcontributions = subset(djtcontributions, djtcontributions$V10 > 0)
djtcontributions=subset(djtcontributions,select = c(V3:V13))

hrccontributions$V11=str_replace_all(hrccontributions$V11, "[:digit:]","")
hrccontributions$V11=str_replace_all(hrccontributions$V11, "-","")

djtcontributions$V11=str_replace_all(djtcontributions$V11, "[:digit:]","")
djtcontributions$V11=str_replace_all(djtcontributions$V11, "-","")

Loaded the DJT data set the dimensions are 762871, 11

Now for the two data set we will do some summary statistics using summary. The only variable of interest in this data set is that of contributions.

# Will do a box plot to see how the contributions are. We could see a outlier in DJT contributions, which is way out of reach.

boxplot(hrccontributions$V10 , djtcontributions$V10)

We will see who are the top contributors for both the campaigns.

#HRC Campaign.
top_fifty = hrccontributions %>%
  filter(rank(dplyr::desc(hrccontributions$V10))<=100)


# Top Fifty Individual Contributor's to HRC Campaign.
knitr::kable(head(plyr::arrange(top_fifty,desc(top_fifty$V10)), n = 20))

V3	V4	V5	V6	V7	V8	V9	V10	V11	V13
Clinton, Hillary Rodham	GOCKE, THOMAS	JUPITER	FL	334691584	SELF-EMPLOYED	PHYSICIAN	20000	JUN
Clinton, Hillary Rodham	YOUNG, SAMUEL J.	TEHACHAPI	CA	935618652	SELF-EMPLOYED	REAL ESTATE INVESTOR	10000	JUN	X
Clinton, Hillary Rodham	YOUNG, SAMUEL J.	TEHACHAPI	CA	935618652	INFORMATION REQUESTED	INFORMATION REQUESTED	7300	JUN
Clinton, Hillary Rodham	HILLARY ACTION FUND - UNITEMIZED	NEW YORK	NY	101855256			5620	SEP	X
Clinton, Hillary Rodham	PROPPER, GREG	LOS ANGELES	CA	900691429	PROPPER DALEY	PHILANTHROPIC CONSULTING	5400	JUN
Clinton, Hillary Rodham	PICKER, MICHAEL	SACRAMENTO	CA	958413111	STATE OF CALIFORNIA	COMMISSIONER	5400	NOV
Clinton, Hillary Rodham	AUSTIN, ALAN	ATHERTON	CA	940275458	N/A	RETIRED	5400	AUG
Clinton, Hillary Rodham	FOX, ALAN	STUDIO CITY	CA	916042407	ACF PROPERTY MANAGEMENT	PRESIDENT	5400	MAR
Clinton, Hillary Rodham	BEAUBIEN, JAMES	SANTA MONICA	CA	904023024	LATHAM & WATKINS LLP	ATTORNEY	5400	APR
Clinton, Hillary Rodham	CARROLL, DANIEL ASHTON	SAN FRANCISCO	CA	941151125	TPG CAPITAL	INVESTMENT MANAGER	5400	MAR
Clinton, Hillary Rodham	HOLZMAN, WINNIE	BURBANK	CA	915054005	SELF-EMPLOYED	WRITER	5400	JUN
Clinton, Hillary Rodham	ITKIN, MARK	LOS ANGELES	CA	900691128	WILLIAM MORRIS ENDEAVOR	TALENT AGENT	5400	APR
Clinton, Hillary Rodham	FENG, JIONG	CLAREMONT	CA	917116500	W. CALIFORNIA ART ACADEMY	TEACHER	5400	MAY
Clinton, Hillary Rodham	WALLACE, RICHARD	SHERMAN OAKS	CA	914032915	WARNER BROS TV	WARNER BROTHERS TV	5400	APR
Clinton, Hillary Rodham	HANNA, MONA Z.	VILLA PARK	CA	928615322	MICHELMAN & ROBINSON, LLP	ATTORNEY	5400	JUN
Clinton, Hillary Rodham	BARSZCZ, MICHAEL	WINTER PARK	FL	327893347	SELF-EMPLOYED	ATTORNEY	5400	MAY
Clinton, Hillary Rodham	GHAZVINI, MEHRAN	TALLAHASSEE	FL	323081547	PREMIER HEALTH CLINIC & REHAB OF TALLA	DOCTOR OF CHIRPORACTIC	5400	MAY
Clinton, Hillary Rodham	FINK KOHL, BINA	WESTON	FL	333262726	SELF-EMPLOYED	PUBLIC RELATIONS	5400	MAY
Clinton, Hillary Rodham	SCHRAGIE, GOLDBLATT	MIAMI BEACH	FL	331403429	PROMED MANAGEMENT, INC.	PRESIDENT	5400	APR	X
Clinton, Hillary Rodham	TOLL, ROBERT	MIAMI BEACH	FL	331404226	TOLL BROTHERS, INC	EXECUTIVE CHAIRMAN	5400	JUN	X

# DJT Campaign.
top_fifty_djt = djtcontributions %>%
         filter(rank(desc(djtcontributions$V10))<=100) 

    
# Top Fifty Individual Contributor's to DJT Campaign.
knitr::kable(head(plyr::arrange(top_fifty_djt,desc(top_fifty_djt$V10)), n = 20))

V3	V4	V5	V6	V7	V8	V9	V10	V11	V13
Trump, Donald J.	BOCH, ERNIE	NORWOOD	MA	02062	BOCH AUTOMOTIVE GROUP	EXECUTIVE	86936.80	AUG
Trump, Donald J.	FERRERO, LOUIS P MR.	CANTON	GA	30115	INFORMATION REQUESTED	INFORMATION REQUESTED	12500.00	JUL
Trump, Donald J.	CONSERVATIVE ACTION FUND	ALEXANDRIA	VA	22314			10030.24	DEC	X
Trump, Donald J.	COBB, ROBERT	BIRMINGHAM	AL	35209	COBB THEATERS	OWNER	10000.00	NOV
Trump, Donald J.	DOBSKI, ROBERT	BLOOMINGTON	IL	61704	INFORMATION REQUESTED	INFORMATION REQUESTED	10000.00	NOV
Trump, Donald J.	GORMAN, L.D. MR.	HAZARD	KY	41702	INFORMATION REQUESTED	INFORMATION REQUESTED	10000.00	JUL
Trump, Donald J.	ROVT, ALEXANDER MR.	BROOKLYN	NY	11234	INFORMATION REQUESTED	INFORMATION REQUESTED	10000.00	NOV
Trump, Donald J.	GIGANTE, PETER	BELLINGHAM	WA	98225	SELF-EMPLOYED	INTERNATIONAL TRADE	10000.00	AUG
Trump, Donald J.	TANZER, LEONARD J MR.	SCARSDALE	NY	10583	PATIENT CARE ASSOCIATES	PRESIDENT	7300.00	NOV
Trump, Donald J.	NORTHCUTT, JOHN D MR. III	FAIRHOPE	AL	36532	INFORMATION REQUESTED	INFORMATION REQUESTED	5400.00	JUL
Trump, Donald J.	HARBERT, NORMAN C MR.	SOTTSDALE	AZ	85262	RETIRED	RETIRED	5400.00	SEP
Trump, Donald J.	DAY, TIMOTHY T MR.	PHOENIX	AZ	85018	INFORMATION REQUESTED	INFORMATION REQUESTED	5400.00	JUL
Trump, Donald J.	STERN, MARC MR.	MALIBU	CA	90265	THE TCW GROUP, INC	CHAIRMAN	5400.00	JUL
Trump, Donald J.	LITTLEFAIR, ANDREW	NEW PORT BEACH	CA	92663	CLEAN ENERGY FUELD	PRESIDENT	5400.00	OCT
Trump, Donald J.	SCHUMANN, ROBERT	MANHATTAN BEACH	CA	90266	REAL ESTATE WEST INC.	BROKER	5400.00	NOV
Trump, Donald J.	LISENKO, ROSE H MS.	SAN DIEGO	CA	92107	INFORMATION REQUESTED	INFORMATION REQUESTED	5400.00	AUG
Trump, Donald J.	NEE, ALICE MS.	NEWPORT COAST	CA	92657	INFORMATION REQUESTED	INFORMATION REQUESTED	5400.00	AUG
Trump, Donald J.	MIZEL, CAROL MS.	DENVER	CO	80237	HOMEMAKER	HOMEMAKER	5400.00	SEP
Trump, Donald J.	MIZEL, LARRY A MR.	DENVER	CO	80237	M.D.C HOLDINGS	INVESTMENT	5400.00	SEP
Trump, Donald J.	TRAVIS, CHARLOTTE MS.	BRIGHTON	CO	80601	INFORMATION REQUESTED	INFORMATION REQUESTED	5400.00	AUG

Group By Profession.

We will look at who were the top contributor for both the campaigns based on the profession.

top_contributor_prof <- hrccontributions %>%  
          group_by(V9)  %>%
            summarise(V10=n()) 

top_contributor_djt <- djtcontributions %>%  
          group_by(V9)  %>%
            summarise(V10=n()) 

knitr::kable(head(plyr::arrange(top_contributor_prof,desc(top_contributor_prof$V10)), n = 10))

V9	V10
RETIRED	862272
ATTORNEY	145237
INFORMATION REQUESTED	80238
TEACHER	79455
PROFESSOR	73324
PHYSICIAN	67938
HOMEMAKER	60355
NOT EMPLOYED	56072
CONSULTANT	54765
LAWYER	43813

knitr::kable(head(plyr::arrange(top_contributor_djt,desc(top_contributor_djt$V10)), n = 10))

V9	V10
RETIRED	292655
INFORMATION REQUESTED	153128
SELF-EMPLOYED	11118
SALES	10047
PHYSICIAN	8837
HOMEMAKER	8188
BUSINESS OWNER	7438
ENGINEER	6915
ATTORNEY	6071
OWNER	5401

Group By State

We will look at who were the top contributor for both the campaigns based on the state were in they reside.

top_state_hrc <- hrccontributions %>%  
          group_by(V6)  %>%
            summarise(V10=n()) 

top_state_djt <- djtcontributions %>%  
          group_by(V6)  %>%
            summarise(V10=n()) 

knitr::kable(head(plyr::arrange(top_state_hrc,desc(top_state_hrc$V10)), n = 10))

V6	V10
CA	681518
NY	394653
TX	201805
FL	182113
MA	152570
WA	125099
IL	121487
PA	118255
NJ	113906
VA	109987

knitr::kable(head(plyr::arrange(top_state_djt,desc(top_state_djt$V10)), n = 10))

V6	V10
CA	83741
FL	76961
TX	76143
NY	35762
PA	30046
GA	29055
OH	26281
VA	23630
IL	22748
AZ	21429

We are going to plot the states where these top contributors are residing by using a Plotly Map.

top_state_hrc$hover = with(top_state_hrc, paste(top_state_hrc$V6, '<br>', "Total contributions", top_state_hrc$V10))

l=list(color = toRGB("white"), width = 2)
g=list(
  scope = 'usa',
  projection = list(type = 'albers usa'),
  showlakes = TRUE,
  lakecolor = toRGB('white')
)

p=plot_geo(top_state_hrc, locationmode = 'USA-states') %>%
  add_trace(
    z = ~V10, text = ~hover, locations = ~V6,
    color = ~V10, colors = 'Blues'
  ) %>%
  colorbar(title = "Contributions") %>%
  layout(
    title = 'Total HRC Contributions by State<br>(Hover for breakdown)',
    geo = g
  )

top_state_djt$hover=with(top_state_djt, paste(top_state_djt$V6, '<br>', "Total contributions", top_state_djt$V10))

p_dgt=plot_geo(top_state_djt, locationmode = 'USA-states') %>%
  add_trace(
    z = ~V10, text = ~hover, locations = ~V6,
    color = ~V10, colors = 'Reds'
  ) %>%
  colorbar(title = "Contributions") %>%
  layout(
    title = 'Total DJT Contributions by State<br>(Hover for breakdown)',
    geo = g
  )

#Sys.setenv("plotly_username"="data607g3")
#Sys.setenv("plotly_api_key"="dDZhymyplVxhzUuXv2MX")
#chart_link = api_create(p, filename="HRCStateOne",id="HRC")
#chart_link_djt = api_create(p_dgt,filename = "DJTStateOne")
#chart_link_djt
p

p_dgt

Inference for Contributions.

$H_0$ Null Hypothesis : The average Contributions during last four months does not vary across the campaigns.

$H_A$ Alternative Hypothesis : The average Contributions during last four months varies across the campaigns.

For this first we are cleaning the data and picking data only those we are interested in.

So we are picking contributions only for the month of JUN, JUL, AUG, SET and OCT for both the campaigns.

hrcfinal = data.frame(hrccontributions)
hrcfinal=subset(hrcfinal,select = c(V10,V11))
hrclast=subset(hrcfinal, hrcfinal$V11 == "JUN" | hrcfinal$V11 == "JUL" |  hrcfinal$V11 == "AUG" |hrcfinal$V11 == "SEP" |hrcfinal$V11 == "OCT" )

djtfinal = data.frame(djtcontributions)
djtfinal=subset(djtfinal,select = c(V10,V11))
djtlast=subset(djtfinal, djtfinal$V11 == "JUN" | djtfinal$V11 == "JUL" |  djtfinal$V11 == "AUG" |djtfinal$V11 == "SEP" |djtfinal$V11 == "OCT" )

Box Plot of last four months contributions.

bwplot(hrclast$V10 ~ hrclast$V11 ,main="Contributions across final four Months",xlab="Month",ylab="Contributions")

bwplot(djtlast$V10 ~ djtlast$V11 ,main="Contributions across final four Months",xlab="Month",ylab="Contributions")

Since the box plot is not able to give a clear picture, we are going to compare the means across four months.

hrc_summary_four = aggregate(hrclast$V10 ~ hrclast$V11, data = hrclast, function(x) c(Total = sum(x), Mean = mean(x), SD = sd(x) )) 
hrc_summary_four=cbind(hrc_summary_four[-ncol(hrc_summary_four)], hrc_summary_four[[ncol(hrc_summary_four)]])
knitr::kable(hrc_summary_four)

hrclast$V11	Total	Mean	SD
AUG	52298582	134.17462	408.1672
JUL	45677073	119.42655	366.8881
JUN	41466446	170.44040	509.5670
OCT	72668987	91.65413	269.8099
SEP	71749414	125.26522	364.4244

djt_summary_four <- aggregate(djtlast$V10 ~ djtlast$V11, data = djtlast, function(x) c(Total = sum(x), Mean = mean(x), SD = sd(x) )) 
djt_summary_four<- cbind(djt_summary_four[-ncol(djt_summary_four)], djt_summary_four[[ncol(djt_summary_four)]])
knitr::kable(djt_summary_four)

djtlast$V11	Total	Mean	SD
AUG	24049608	128.28373	372.4752
JUL	25068072	95.85821	248.8110
JUN	16474648	145.63096	367.7656
OCT	25179304	340.62451	399.2310
SEP	25577068	377.37647	517.3588

From the above analysis it is pretty clear that average monthly contributions across the last four between two candidates are not same. As a result we are going to reject the Null Hypothesis and we are going to accept the Alternative Hypothesis.

Plots:

These are plots of Contributions across all the months.

ggplot(hrcfinal, aes(x=hrcfinal$V11, y=hrcfinal$V10 ,group=hrcfinal$V11, colour=hrcfinal$V10)) + 
    geom_point() + ylab("Total Contributions") + 
    xlab("Month")

ggplot(djtfinal, aes(x=djtfinal$V11, y=djtfinal$V10 ,group=djtfinal$V11, colour=djtfinal$V10)) + 
    geom_point() + ylab("Total Contributions") + 
    xlab("Month")

Super PACs Expenditure:

Having worked on the campaign contributions from individuals of both the campaigns, now we are going to work on expenditures associated with PAC’s PACs spend a huge amount of money for and against the candidates. In this experiment, we are going to see how PACs have done their monthly expenditure heading up to the election.

pacexpenditure=fread('FEC_independent-expenditure.csv')

## Warning in fread("FEC_independent-expenditure.csv"): Bumped column 7 to
## type character on data row 12046, field contains 'AG'. Coercing previously
## read values in this column from logical, integer or numeric back to
## character which may not be lossless; e.g., if '00' and '000' occurred
## before they will now be just '0', and there may be inconsistencies with
## treatment of ',,' and ',NA,' too (if they occurred in this column before
## the bump). If this matters please rerun and set 'colClasses' to 'character'
## for this column. Please note that column type detection uses a sample of
## 1,000 rows (100 rows at 10 points) so hopefully this message should be very
## rare. If reporting to datatable-help, please rerun and include the output
## from verbose=TRUE.

pacexpenditure=data.frame(pacexpenditure)


# ready numeric vector
pacexpenditure$expenditure=str_replace_all(as.character(pacexpenditure$exp_amo), fixed("$"), "")
pacexpenditure$expenditure = as.numeric(str_replace_all(pacexpenditure$expenditure, ',', ''))
pacexpenditure=pacexpenditure[pacexpenditure$amn_ind == "N",]
pacexpenditure$can_nam <- as.character(pacexpenditure$can_nam)
pacexpenditure$can_nam <- tolower(pacexpenditure$can_nam)
pacexpenditure$spe_nam <- tolower(pacexpenditure$spe_nam)

# HRC PAC
hrcpac=pacexpenditure[grepl("inton|hil", pacexpenditure$can_nam), ]
hrcpac=hrcpac[!grepl("kirk|james|george", hrcpac$can_nam), ]
hrcpac$org = "HRC"

# DJT PAC
djtpac=pacexpenditure[grepl("trump", pacexpenditure$can_nam),]
djtpac$org = "DJT"

# combine, subset
pactot=rbind(hrcpac, djtpac)
pactot=select(pactot,org, spe_nam, rec_dat, expenditure)

# clean type
pactot$org <- as.factor(pactot$org)
pactot$spe_nam <- as.character(pactot$spe_nam)
pactot$rec_dat <- as.Date(pactot$rec_dat, "%m/%d/%Y")
dateval=as.month(pactot$rec_dat)
pactot$rec_dat=dateval$month

Now having the two PAC Contributions sorted out. Next we will try to Summarize the monthly expenditures across both the PAC and then we can try to compare the average monthly expenditure across both the PAC’s

hrcmonthlyexp = pactot %>%
                filter(pactot$org == "HRC") %>%
                group_by(rec_dat) %>% 
                summarize(sum(expenditure))
hrcmonthlyexp$org="HRC"

djtpacmonthlyepx = pactot %>%
                filter(pactot$org == "DJT") %>%
                group_by(rec_dat) %>% 
                summarize(sum(expenditure))
djtpacmonthlyepx$org = "DJT"

totpacmnthexp = rbind(hrcmonthlyexp, djtpacmonthlyepx)

Plot of monthly expenditure.

ggplot(totpacmnthexp, aes(x=rec_dat, y=`sum(expenditure)`,
                    group=org, colour=org)) + 
    geom_line() + ylab("Expenditure") + xlab("Month") +
    ggtitle("PAC Expenditure across both the campaigns")

Inference

$H_O$: Null Hypothesis : Average monthly expenditure across both the PAC’s does not vary

$H_A$: Alternative Hypothesis : Average monthly expenditure across both the PAC’s varies

ggplot(totpacmnthexp, aes(factor(org), `sum(expenditure)`)) + geom_boxplot(aes(fill=org)) + xlab("") +  theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank())

pacmonexpfinal = aggregate(totpacmnthexp$`sum(expenditure)` ~ totpacmnthexp$org, data = totpacmnthexp, function(x) c(Total = sum(x), Mean = mean(x), SD = sd(x) )) 
pacmonexpfinal=cbind(pacmonexpfinal[-ncol(pacmonexpfinal)], pacmonexpfinal[[ncol(pacmonexpfinal)]])
knitr::kable(pacmonexpfinal)

totpacmnthexp$org	Total	Mean	SD
DJT	349184499	29098708	30902356
HRC	160661431	13388453	17769403

From the box plots and our calculations above it is pretty clear that both the averages across both the PACs are not same, so we can reject Null Hypothesis and Accept the Alternative Hypothesis.

Conclusion.

For this project as per my initial proposal we wanted to compare the contributions and expenditure of presidential elections between 2012 and 2016. Since the volume of data involved in both the campaigns was so huge to even load in for our project, so we decided to go ahead with only the analysis of 2016.

As a first experiment, we read the individual contributions across both the major party candidates in the presidential election. We did some initial analysis of data like top contributors, top contributors across the states and we even plot the campaign contributors in the map of USA to see where the maximum contributions are from.

We also did a hypothesis testing by comparing the average contributions of the last five months heading up to election, to see whether the average contributions across both the candidates are same. From our testing we rejected out Null Hypothesis and determined that average contributions between two candidates in the last four months were different.

Once having finished working on contributions, then we loaded the expenditure of PACs. We wanted to see how the PACs associated to HRC and DJT have spent the money during this election. We did some munging of data for both the candidates associated PAC’s.

For PAC expenditure we did a hypothesis testing to see whether the monthly expenditures are same across both the candidates. Finaly the means were way off limit so we concluded by rejecting null hypothesis and accepting the alternative hypothesis.

Overall this was a great experience in working on this project and DATA606 class as a whole. For a new comer as me in Statistics was a great learning experience.

Reference : Research analysis inspired from Kaggle 2016 Election Analysis