[note]: the instruction of this project indicates the dataset is for 2012 presidential campaign but when I followed the link and downloaded the dataset it’s actually for 2016 campaign - This can be seen by “election_tp” (G2016/P2016) and the “contb_receipt_dt”. I have updated all names to 2016 to reflect the correct election year.
## [1] 3211 18
## [1] "cmte_id" "cand_id" "cand_nm"
## [4] "contbr_nm" "contbr_city" "contbr_st"
## [7] "contbr_zip" "contbr_employer" "contbr_occupation"
## [10] "contb_receipt_amt" "contb_receipt_dt" "receipt_desc"
## [13] "memo_cd" "memo_text" "form_tp"
## [16] "file_num" "tran_id" "election_tp"
## 'data.frame': 3211 obs. of 18 variables:
## $ cmte_id : Factor w/ 13 levels "C00458844","C00500587",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ cand_id : Factor w/ 13 levels "P00003392","P20003281",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ cand_nm : Factor w/ 14 levels "Bush, Jeb","Carson, Benjamin S.",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ contbr_nm : Factor w/ 1535 levels "AARONSON, REBECCA",..: 103 1303 855 1357 339 804 1332 675 75 31 ...
## $ contbr_city : Factor w/ 221 levels "","AIRWAY HEIGHTS",..: 166 168 91 168 138 168 58 168 119 168 ...
## $ contbr_st : Factor w/ 1 level "WA": 1 1 1 1 1 1 1 1 1 1 ...
## $ contbr_zip : int 980620396 981037022 980311337 981154925 982779649 981193204 980264228 981173014 980402117 981011778 ...
## $ contbr_employer : Factor w/ 551 levels "","1000","110 CONSULTING",..: 305 212 333 47 142 414 414 148 261 230 ...
## $ contbr_occupation: Factor w/ 486 levels "","ACCOUNT EXECUTIVE",..: 366 328 377 106 423 27 28 142 236 209 ...
## $ contb_receipt_amt: num 100 100 50 100 2700 ...
## $ contb_receipt_dt : Factor w/ 120 levels "1-Apr-15","1-Jun-15",..: 60 62 72 48 17 105 73 94 26 113 ...
## $ receipt_desc : Factor w/ 10 levels "","REATTRIBUTION / REDESIGNATION REQUESTED (AUTOMATIC)",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ memo_cd : Factor w/ 2 levels "","X": 1 1 1 1 1 1 1 1 1 1 ...
## $ memo_text : Factor w/ 15 levels "","* EARMARKED CONTRIBUTION: SEE BELOW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ form_tp : Factor w/ 3 levels "SA17A","SA18",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ file_num : int 1015585 1015585 1015585 1015585 1015585 1015585 1015585 1015585 1015585 1015585 ...
## $ tran_id : Factor w/ 3208 levels "A027F86C1700D48FCBA3",..: 210 371 1046 335 918 480 792 874 637 263 ...
## $ election_tp : Factor w/ 2 levels "G2016","P2016": 2 2 2 2 2 2 2 2 2 2 ...
## cmte_id cand_id cand_nm
## C00577130:924 P60007168:924 Sanders, Bernard :924
## C00575795:921 P00003392:921 Clinton, Hillary Rodham :921
## C00573519:530 P60005915:530 Carson, Benjamin S. :530
## C00574624:359 P60006111:359 Cruz, Rafael Edward 'Ted':341
## C00575449:195 P40003576:195 Paul, Rand :195
## C00458844:187 P60006723:187 Rubio, Marco :187
## (Other) : 95 (Other) : 95 (Other) :113
## contbr_nm contbr_city contbr_st
## NIEMAN, TYLER : 16 SEATTLE : 834 WA:3211
## PRIEBE, WOLFGANG G. MR.: 16 BELLEVUE : 165
## PARKER, DOROTHY : 15 VANCOUVER: 100
## LIANG, THOMAS MR. : 13 OLYMPIA : 80
## WENTZEL, CATHY : 12 KIRKLAND : 79
## DESCHAMPS, ROBERT : 11 TACOMA : 77
## (Other) :3128 (Other) :1876
## contbr_zip contbr_employer
## Min. : 98001 RETIRED : 624
## 1st Qu.:980720587 NOT EMPLOYED : 360
## Median :981361327 SELF-EMPLOYED: 301
## Mean :947850345 N/A : 225
## 3rd Qu.:984024023 SELF : 97
## Max. :994039792 (Other) :1597
## NA's :3 NA's : 7
## contbr_occupation contb_receipt_amt
## RETIRED : 758 Min. :-3300.0
## NOT EMPLOYED : 321 1st Qu.: 50.0
## ATTORNEY : 81 Median : 100.0
## INFORMATION REQUESTED : 75 Mean : 412.6
## INFORMATION REQUESTED PER BEST EFFORTS: 74 3rd Qu.: 250.0
## (Other) :1900 Max. : 5400.0
## NA's : 2
## contb_receipt_dt
## 30-Jun-15: 262
## 29-Jun-15: 110
## 30-Apr-15: 109
## 12-Apr-15: 85
## 26-May-15: 79
## 16-Jun-15: 71
## (Other) :2495
## receipt_desc memo_cd
## :3163 :3165
## Refund : 14 X: 46
## REDESIGNATION FROM PRIMARY : 8
## REDESIGNATION TO GENERAL : 8
## REATTRIBUTION / REDESIGNATION REQUESTED (AUTOMATIC): 4
## REATTRIBUTION FROM SPOUSE : 4
## (Other) : 10
## memo_text form_tp
## :2307 SA17A:3181
## * EARMARKED CONTRIBUTION: SEE BELOW : 813 SA18 : 16
## EARMARKED FROM MAKE DC LISTEN : 49 SB28A: 14
## REDESIGNATION FROM PRIMARY : 8
## REDESIGNATION TO GENERAL : 8
## REATTRIBUTION / REDESIGNATION REQUESTED (AUTOMATIC): 4
## (Other) : 22
## file_num tran_id election_tp
## Min. :1003942 SA17.113790 : 2 G2016: 40
## 1st Qu.:1015044 SA17.15219 : 2 P2016:3171
## Median :1015538 SA17.200406 : 2
## Mean :1015205 A027F86C1700D48FCBA3: 1
## 3rd Qu.:1015585 A02E8299728EB46FA8B6: 1
## Max. :1015715 A04553564DD6D419D947: 1
## (Other) :3202
I immediately noticed from data structure that there are 13 levels of cand_id but 14 levels of cand_nm - the candidate ID levels should exactly match candidate names. I created a table just for these two variables and then found out that “Cruz, Rafael Edward ‘Ted’” and “CRUZ, RAFAEL EDWARD TED” are the same person and we need to consolidate these two candidate names into one. Below code is to change them to ‘Cruz, Rafael Edward Ted’.
Now we have 13 levels of cand_nm with cand_id P60006111 particular corresponding to ‘Cruz, Rafael Edward Ted’. We get this mismatch fixed now.
I also noticed contbr_zip is int in original dataset and many have zip code followed with sub-zip code. I removed any last four digits of sub zip code and changed it to factor variable.
It makes sense to me to add three variables “gender”, “party” and “age” for each candidates since they’re all factored in when making the contribution. I did some online research for each candidates and add their “gender”, political “party” and “age” into the original dataset.
Here is the summary of the updated dataset.
## cmte_id cand_id cand_nm
## C00577130:924 P60007168:924 Sanders, Bernard :924
## C00575795:921 P00003392:921 Clinton, Hillary Rodham:921
## C00573519:530 P60005915:530 Carson, Benjamin S. :530
## C00574624:359 P60006111:359 Cruz, Rafael Edward Ted:359
## C00575449:195 P40003576:195 Paul, Rand :195
## C00458844:187 P60006723:187 Rubio, Marco :187
## (Other) : 95 (Other) : 95 (Other) : 95
## contbr_nm contbr_city contbr_st contbr_zip
## NIEMAN, TYLER : 16 SEATTLE : 834 WA:3211 98112 : 81
## PRIEBE, WOLFGANG G. MR.: 16 BELLEVUE : 165 98004 : 68
## PARKER, DOROTHY : 15 VANCOUVER: 100 98115 : 68
## LIANG, THOMAS MR. : 13 OLYMPIA : 80 98122 : 63
## WENTZEL, CATHY : 12 KIRKLAND : 79 98119 : 57
## DESCHAMPS, ROBERT : 11 TACOMA : 77 (Other):2871
## (Other) :3128 (Other) :1876 NA's : 3
## contbr_employer contbr_occupation
## RETIRED : 624 RETIRED : 758
## NOT EMPLOYED : 360 NOT EMPLOYED : 321
## SELF-EMPLOYED: 301 ATTORNEY : 81
## N/A : 225 INFORMATION REQUESTED : 75
## SELF : 97 INFORMATION REQUESTED PER BEST EFFORTS: 74
## (Other) :1597 (Other) :1900
## NA's : 7 NA's : 2
## contb_receipt_amt contb_receipt_dt
## Min. :-3300.0 30-Jun-15: 262
## 1st Qu.: 50.0 29-Jun-15: 110
## Median : 100.0 30-Apr-15: 109
## Mean : 412.6 12-Apr-15: 85
## 3rd Qu.: 250.0 26-May-15: 79
## Max. : 5400.0 16-Jun-15: 71
## (Other) :2495
## receipt_desc memo_cd
## :3163 :3165
## Refund : 14 X: 46
## REDESIGNATION FROM PRIMARY : 8
## REDESIGNATION TO GENERAL : 8
## REATTRIBUTION / REDESIGNATION REQUESTED (AUTOMATIC): 4
## REATTRIBUTION FROM SPOUSE : 4
## (Other) : 10
## memo_text form_tp
## :2307 SA17A:3181
## * EARMARKED CONTRIBUTION: SEE BELOW : 813 SA18 : 16
## EARMARKED FROM MAKE DC LISTEN : 49 SB28A: 14
## REDESIGNATION FROM PRIMARY : 8
## REDESIGNATION TO GENERAL : 8
## REATTRIBUTION / REDESIGNATION REQUESTED (AUTOMATIC): 4
## (Other) : 22
## file_num tran_id election_tp gender
## Min. :1003942 SA17.113790 : 2 G2016: 40 F: 960
## 1st Qu.:1015044 SA17.15219 : 2 P2016:3171 M:2251
## Median :1015538 SA17.200406 : 2
## Mean :1015205 A027F86C1700D48FCBA3: 1
## 3rd Qu.:1015585 A02E8299728EB46FA8B6: 1
## Max. :1015715 A04553564DD6D419D947: 1
## (Other) :3202
## party age
## Democratic:1847 Min. :44.00
## Republican:1364 1st Qu.:62.00
## Median :67.00
## Mean :63.05
## 3rd Qu.:73.00
## Max. :73.00
##
Once I get data cleaned and tidy, it’s time to plot the distributions of each variables that I’m interested in. I start with contribution distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 50.0 100.0 433.2 250.0 5400.0
Most contribute between $1 and $250 with $100 as median. Beyond that range, three possible value points are $500, $1000 and $2700. It’s interesting that there are almost no other values in between. The max. contribution is $5400 made by 8 outliers.
[Note]FEC Raises Contribution Caps for 2016. Under new FEC limits, which are adjusted for inflation in odd-numbered years, individuals can give up to $5,400 to candidates - $2,700 for their primary campaigns, and another $2,700 for the general election.
I’d like to figure out negative contribution. Following code is to find the records of donors that have at least one negative contribution
## Source: local data frame [28 x 3]
## Groups: contbr_nm
##
## contbr_nm cand_nm real_contri
## 1 BARBER, DAVID H. MR. Rubio, Marco 5400.00
## 2 BOYAJIAN, POLLY Sanders, Bernard 172.03
## 3 CONNORS, KATHY MARIE MRS. Rubio, Marco 2700.00
## 4 ERWIN, GERALD Carson, Benjamin S. 7700.00
## 5 FEENEY, SUSAN Clinton, Hillary Rodham 2700.00
## 6 GAMORAN, SAUL Cruz, Rafael Edward Ted 5400.00
## 7 GOLITZIN, ALEXANDER Sanders, Bernard 2700.00
## 8 GOLITZIN, JEANNETTE Sanders, Bernard 2700.00
## 9 GREEN, JEFF Clinton, Hillary Rodham -100.00
## 10 HOLM, TERESA MS. Paul, Rand 150.00
## .. ... ... ...
There are some interesting issues revealed in this table.
I may contact FEC or WA State to investigate these issues further. In this project we won’t clean up this type of issue before they’re clarified by FEC or WA state.
I’d like to see the contribution distribution over each cities in WA State.
Seattle is the largest city in WA so it’s no doubt that it has most contributors and four times of the 2nd city Bellevue which is also the largest city at east of Seattle. There is no surprise for this distribution.
I’d like to check the contribution distribution over contributors’ employers and occupations.
I’d like to see the contribution distribution over gender, political party and age.
There are only three Democratic candidates and the rest of 10 candidates all come from Republican. It’s surprising that Democratic candidates actually drew more donations than Republican. We’ll take a closer look in the following analysis to figure this out.
The top two candidates that get most contributions are “Sanders, Bernard” and “Clinton, Hillary Rodham”, both Democratic - no wonder Democratic candidates get more contributions than Republican. This also implies WA is a blue state voting for Democratic Party in 2016 election.
This plot shows the log10 of contribution distribution against contribution sources (SA17A:individual, SA18:committee or SB28A:refund)
There are 3211 transactions (contributions by individual, committee or refund) in the dataset with 18 variables (cmte_id, cand_id, cand_nm, contbr_nm, contbr_city, contbr_st, contbr_zip, contbr_employer, contbr_occupation, contb_receipt_amt, contb_receipt_dt, receipt_desc, memo_cd, memo_text, form_tp, file_num, tran_id, election_tp). Two variables are integer “contbr_zip” and “file_num”, one is number variable “contb_receipt_amt”, and all rest variables are Factor variables.
Of the 3211 transactions, the total positive contributions are 3182, the rest 29 are refund or re-designation.
I’d like to see what factors affect contributors when they make the decision on voting. I put these factors into two categories. One category is related to contributors, this includes:
The other category is directly related to each candidates, including:
“contbr_zip” can be used to see even more detail of living area of the contributors. From the summary of this feature, we can see people from rich area such as 98112,98004,98115,98122 and 98119 make the top of contribution list. Another feature “form_tp” shows the source of contributions - most of contributions come from individuals. The last but not least, “contbr_nm” can be used to show the outliers information so I can do more research if applied.
I created three new variables “gender”, “party” and “age” for each candidates because I think these are important factors when people make the decision. There are only two female candidates - “Clinton, Hillary Rodham” from Democratic and “Fiorina, Carly” from Republican. The rest of 11 candidates are all male. Republican won Democratic in items of number of candidates - there are only three Democratic candidates - they are “Clinton, Hillary Rodham”, “O’Malley, Martin Joseph” and “Sanders, Bernard”, leaving the rest eight all Republican. Candidates ages range from 44 upto 73. The youngest include “Rubio, Marco”(Repbulican), “Cruz, Rafael Edward Ted”(Repbulican) and “Jindal, Bobby”(Repbulican), and the eldest is “Sanders, Bernard”(Democratic)
I immediately noticed from data structure that there are 13 levels of cand_id but 14 levels of cand_nm - the candidate ID levels should exactly match candidate names. A further investigation shows “Cruz, Rafael Edward ‘Ted’” and “CRUZ, RAFAEL EDWARD TED” are duplicated cand_nm and I consolidate these two candidate names into one ‘Cruz, Rafael Edward Ted’.
I also noticed contbr_zip is int in original dataset and many have zip code followed with sub-zip code. I removed any last four digits of sub zip code and changed it to factor variable.
When I investigate those donors with at least one negative contribution I find some interesting issues related to this data set:
These types of issues need to be clarified with FEC or WA state before cleaning up. It’s out of the scope of this project and I will leave these as is in this project.
First, take a look at pairwise relationship by applying ggpairs. [note]Since ggpairs runs dramatically slow with factor variables that contains a large number of levels, I exclude “contbr_city”, “contbr_employer” and “contbr_occupation” from ggpairs. These three variables have been checked out individually in ‘Univariate Analysis’ section.
The only continuous variable “age”" has a correlation coefficient 0.0364 with contb_receipt_amt, which is very week linear relationship.
I’m mostly interested in the contributions distribution against each candidates, below plot shows the contribution distribution faceted with each candidates with sqrt(y = contribution count) and log10(x = contb_receipt_amt).
Here is another way to see the contribution distribution by using box plot and log10(y = contb_receipt_amt).
I’d also like to check the total funds raised by each candidates. In order to check that a separate data frame that includes cand_nm, total_transaction and total_contribution is required.
Is Hillary going to win the campaign? She got the contributions almost the total combination of the rest of candidates. At least by WA state she is the most promising.
Let’s see which party win in items of the total contribution value.
Democratic raises more than double of the funds Republican did in WA state.
Here is the plot of total contribution values obtained by male vs. female candidates.
Due to Hillary’s exceptional performance, female candidates drew more donations than male in WA state, although there are only two female candidates vs. 11 male.
Below is a plot to show the top contributors who made the most single-transaction donations.
Of these top donors, there are one Chairman, two CEO, one Presidents, four investors, two consultants and one homemaker. Most of them have top end title/career indicating exceptional income earner. It appears rich people are more willing to contribute more - maybe they believe the more impact they can impose on this country the more benefit they can get in return.
I’d like to check popularity of each candidates in the major cities. The major cities are those with 50 or more contributions as showed in city_50.
Clinton wins Seattle, Bellevue (these two are where most contributions come from) and Kirkland. She’s followed by Bernard Sanders who is the 2nd in Seattle and Bellevue and the 1st in Olympia, Tacoma and Bellingham. Both are from Democratic.
I’d also like to check each parties popularity in these major cities.
Obviously, Democratic wins almost all major cities in WA except Vancouver and Redmond.
Since most features in the dataset are factor not continuous, we only see one correlation coefficient calculation between age and the contribution amount. But their correlation coefficient is 0.0364 which is very week.
Hillary raised most funds in WA and wins Seattle and Bellevue which are the most important cities. Following is Bernard Sanders who wins Olympia, Tacoma and Bellingham. Both are Democratic and lead the way to victory over Republican.
Almost all of the outliers of donors have highest levels of title/career and earning. In next section I will take a closer look and see which candidates/parties they are in favor of.
Due to Hillary’s exceptional performance, female candidates drew more donations than male in WA state, although there are only two female candidates vs. 11 male.
Hillary get the most contributions and raised the funds of almost the combination of the rest candidates. She is absolutely the winner in WA. She also wins Seattle and Bellevue - the two most important cities in WA.
Democratic beats Republican in almost every major cities in WA and raised more than double of the funds of Repbulican.
I’d like to see the funds raised by each candidates with their ages factored in.
Hillary, Bernard and Benjamin are the eldest but raised more funds than other younger candidates. It appears Washingtonian votes in favor of more experienced candidates.
Below two plots are put side by side to compare the funds raised per candidates categorized with party and gender
the next plot compares the total funds raised between the two parties, and also shows the individual raised funds as blocks of their party column.
This plot proves the following conclusions that are reached in previous analysis:
In section 2 I checked the information (employer and occupation) of top donors; in the last plot I’d like to further investigate and find out which party/candidates these contributions vote in favor of.
All candidates in this list are from Republican; it’s obvious that Repbulican is still the favor of rich people in WA state.
In this section I investigate more detail the impact of candidate’s age, gender, political party on their final raised donations. I also checked relationship of the top donors and their favor of candidates. In summary:
There are two female candidates “Clinton, Hillary Rodham” from Democratic and “Florina, Carly” from Republican. Hillary gets the most of contributions and raised the highest funds across all parties and candidates while Carly is only ranked 8th of 13 candidates. It appears that gender is not a key factor in WA state and people more focus on candidates’ experience, reputation and political vision.
The features in this dataset are mostly category and factor variables, and the only continuous variable “age” has a very week relationship with the contribution amount. Hence, I don’t think a linear relationship is a fit for this dataset so a linear model may not fit into this analysis.
I’d like to check popularity of each candidates in the major cities. The major cities are those with 50 or more contributions as showed in city_50.
Clinton wins:
She’s followed by Bernard Sanders who is the 2nd in:
and the 1st in:
Both are from Democratic.Thanks to these two candidates, Democratic wins almost all major cities in WA except Vancouver and Redmond.
In the second plot I compares the total funds raised between the two parties, and also shows the individually raised funds as blocks in their party column.
This plot proves the following conclusions that are reached in previous analysis:
This plot is to show the top contributors who made the highest single-transaction donations and the corresponding candidates that receive these donations.
Of these top donors, there are one Chairman, two CEO, one Presidents, four investors, two consultants and one homemaker. Most of them have top end title/career indicating exceptional income earner. It appears rich people are more willing to contribute more - maybe they believe the more impact they can impose on this country the more benefit they can get in return.
The candidates that receive these highest single donations are all from Republican; it’s obvious that Repbulican is still the favor of rich people in WA state.
The dataset of Financial Contribution for 2016 Presidential Campaign in WA State contains 3211 contribution records with 18 variables. I started with querying the dataset to get myself familiarized with data structure and various features. I immediately noticed the duplicate of one candidate name which should be consolidated before any further analysis. Some other date cleanup for contbr_zip, contbr_employer and contbr_occupation are followed to make the analysis more accurate. Three new features - “gender”, “party” and “age” - are added into dataset and they’re all proved to be important factors in following analysis. The initial univariable analysis shows most donations fall into the range of $1 to $250 with $100 as median value. 8 donors made the highest contribution of $5,400 which is also the maximum individual contribution allowed by FEC. Democratic candidate “Clinton, Hillary Rodham” is the winner across all parties and she raised $738,489.6 vs. $94,205.16 which is the highest in Republican. Democratic wins most of major cities including Seattle and Bellevue and Republican only has two cities in favor of it - WA state is definitely a blue state voting for Democratic in 2016 election. It appears Washingtonian prefer more experienced candidates based on the fact that the three eldest candidates (aged 73,67 and 65) are also the top three in the fund raising list. On the other hand, gender is less important factor comparing to experience, reputation and political party. Taking a closer look at Plot Two of section 3, I realized that 10 Republican candidates vs. 3 Democratic is actually a disadvantage. The contributions toward Republican are evenly distributed over each candidates while in Democratic the supporters are able to easily focus on one candidate. This may be one of the reason that even no any Republican candidate is close to Hillary. Every cloud has a silver lining. The rich donors who make more contributions are still in favor of Republican, this could be even more important when we factor in super PAC.
The first challenge is actully when I load the csv data. There are many duplicated entries on the first coloum of original dataset which stopped read.csv. I first tried adding row.names = NULL, that made the loading successfully but all headers shifted one column to the right. What I did was to insert one extra column as 1st column and add row numbers; when reading, I used [, -1] to remove it after the input. Another thing I learned is how to apply appropriate transfomation to show more detail information. For example, in the plot of contribution distribution faceted with each candidates, some of the figures seems empty. I tried log10(y) but they were still invisible. I took a closer look at the dataset and realized that some candiates like “Perry, James R. (Rick)” and “Jindal, Bobby” have only one contribution; in this case using log10(y) didn’t work since that transformed them to 0 (log10(1) = 0) which is still invisible. So I used another tranformation of sqrt(y) and now they’re all clearly showed up in the final plot.
The investigation on negative contribution (refunds or redesignation) reveals some unusual information. One donor has donated $7,700 which exceeds the FEC limit. Five donors actually ends up with negative contributions. These issues need to be clarified with FEC or WA State so that we can make the dataset more accurate and improve the reliability of the analysis.