Exploration of Financial Contributions to 2016 Presidential Campaigns in Washington State by Zai Feng Wang

[note]: the instruction of this project indicates the dataset is for 2012 presidential campaign but when I followed the link and downloaded the dataset it’s actually for 2016 campaign - This can be seen by “election_tp” (G2016/P2016) and the “contb_receipt_dt”. I have updated all names to 2016 to reflect the correct election year.

Univariate Plots Section

## [1] 3211   18

##  [1] "cmte_id"           "cand_id"           "cand_nm"          
##  [4] "contbr_nm"         "contbr_city"       "contbr_st"        
##  [7] "contbr_zip"        "contbr_employer"   "contbr_occupation"
## [10] "contb_receipt_amt" "contb_receipt_dt"  "receipt_desc"     
## [13] "memo_cd"           "memo_text"         "form_tp"          
## [16] "file_num"          "tran_id"           "election_tp"

## 'data.frame':    3211 obs. of  18 variables:
##  $ cmte_id          : Factor w/ 13 levels "C00458844","C00500587",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ cand_id          : Factor w/ 13 levels "P00003392","P20003281",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ cand_nm          : Factor w/ 14 levels "Bush, Jeb","Carson, Benjamin S.",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ contbr_nm        : Factor w/ 1535 levels "AARONSON, REBECCA",..: 103 1303 855 1357 339 804 1332 675 75 31 ...
##  $ contbr_city      : Factor w/ 221 levels "","AIRWAY HEIGHTS",..: 166 168 91 168 138 168 58 168 119 168 ...
##  $ contbr_st        : Factor w/ 1 level "WA": 1 1 1 1 1 1 1 1 1 1 ...
##  $ contbr_zip       : int  980620396 981037022 980311337 981154925 982779649 981193204 980264228 981173014 980402117 981011778 ...
##  $ contbr_employer  : Factor w/ 551 levels "","1000","110 CONSULTING",..: 305 212 333 47 142 414 414 148 261 230 ...
##  $ contbr_occupation: Factor w/ 486 levels "","ACCOUNT EXECUTIVE",..: 366 328 377 106 423 27 28 142 236 209 ...
##  $ contb_receipt_amt: num  100 100 50 100 2700 ...
##  $ contb_receipt_dt : Factor w/ 120 levels "1-Apr-15","1-Jun-15",..: 60 62 72 48 17 105 73 94 26 113 ...
##  $ receipt_desc     : Factor w/ 10 levels "","REATTRIBUTION / REDESIGNATION REQUESTED (AUTOMATIC)",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ memo_cd          : Factor w/ 2 levels "","X": 1 1 1 1 1 1 1 1 1 1 ...
##  $ memo_text        : Factor w/ 15 levels "","* EARMARKED CONTRIBUTION: SEE BELOW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ form_tp          : Factor w/ 3 levels "SA17A","SA18",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ file_num         : int  1015585 1015585 1015585 1015585 1015585 1015585 1015585 1015585 1015585 1015585 ...
##  $ tran_id          : Factor w/ 3208 levels "A027F86C1700D48FCBA3",..: 210 371 1046 335 918 480 792 874 637 263 ...
##  $ election_tp      : Factor w/ 2 levels "G2016","P2016": 2 2 2 2 2 2 2 2 2 2 ...

##       cmte_id         cand_id                         cand_nm   
##  C00577130:924   P60007168:924   Sanders, Bernard         :924  
##  C00575795:921   P00003392:921   Clinton, Hillary Rodham  :921  
##  C00573519:530   P60005915:530   Carson, Benjamin S.      :530  
##  C00574624:359   P60006111:359   Cruz, Rafael Edward 'Ted':341  
##  C00575449:195   P40003576:195   Paul, Rand               :195  
##  C00458844:187   P60006723:187   Rubio, Marco             :187  
##  (Other)  : 95   (Other)  : 95   (Other)                  :113  
##                    contbr_nm       contbr_city   contbr_st
##  NIEMAN, TYLER          :  16   SEATTLE  : 834   WA:3211  
##  PRIEBE, WOLFGANG G. MR.:  16   BELLEVUE : 165            
##  PARKER, DOROTHY        :  15   VANCOUVER: 100            
##  LIANG, THOMAS MR.      :  13   OLYMPIA  :  80            
##  WENTZEL, CATHY         :  12   KIRKLAND :  79            
##  DESCHAMPS, ROBERT      :  11   TACOMA   :  77            
##  (Other)                :3128   (Other)  :1876            
##    contbr_zip             contbr_employer
##  Min.   :    98001   RETIRED      : 624  
##  1st Qu.:980720587   NOT EMPLOYED : 360  
##  Median :981361327   SELF-EMPLOYED: 301  
##  Mean   :947850345   N/A          : 225  
##  3rd Qu.:984024023   SELF         :  97  
##  Max.   :994039792   (Other)      :1597  
##  NA's   :3           NA's         :   7  
##                               contbr_occupation contb_receipt_amt
##  RETIRED                               : 758    Min.   :-3300.0  
##  NOT EMPLOYED                          : 321    1st Qu.:   50.0  
##  ATTORNEY                              :  81    Median :  100.0  
##  INFORMATION REQUESTED                 :  75    Mean   :  412.6  
##  INFORMATION REQUESTED PER BEST EFFORTS:  74    3rd Qu.:  250.0  
##  (Other)                               :1900    Max.   : 5400.0  
##  NA's                                  :   2                     
##   contb_receipt_dt
##  30-Jun-15: 262   
##  29-Jun-15: 110   
##  30-Apr-15: 109   
##  12-Apr-15:  85   
##  26-May-15:  79   
##  16-Jun-15:  71   
##  (Other)  :2495   
##                                               receipt_desc  memo_cd 
##                                                     :3163    :3165  
##  Refund                                             :  14   X:  46  
##  REDESIGNATION FROM PRIMARY                         :   8           
##  REDESIGNATION TO GENERAL                           :   8           
##  REATTRIBUTION / REDESIGNATION REQUESTED (AUTOMATIC):   4           
##  REATTRIBUTION FROM SPOUSE                          :   4           
##  (Other)                                            :  10           
##                                                memo_text     form_tp    
##                                                     :2307   SA17A:3181  
##  * EARMARKED CONTRIBUTION: SEE BELOW                : 813   SA18 :  16  
##  EARMARKED FROM MAKE DC LISTEN                      :  49   SB28A:  14  
##  REDESIGNATION FROM PRIMARY                         :   8               
##  REDESIGNATION TO GENERAL                           :   8               
##  REATTRIBUTION / REDESIGNATION REQUESTED (AUTOMATIC):   4               
##  (Other)                                            :  22               
##     file_num                       tran_id     election_tp 
##  Min.   :1003942   SA17.113790         :   2   G2016:  40  
##  1st Qu.:1015044   SA17.15219          :   2   P2016:3171  
##  Median :1015538   SA17.200406         :   2               
##  Mean   :1015205   A027F86C1700D48FCBA3:   1               
##  3rd Qu.:1015585   A02E8299728EB46FA8B6:   1               
##  Max.   :1015715   A04553564DD6D419D947:   1               
##                    (Other)             :3202

I immediately noticed from data structure that there are 13 levels of cand_id but 14 levels of cand_nm - the candidate ID levels should exactly match candidate names. I created a table just for these two variables and then found out that “Cruz, Rafael Edward ‘Ted’” and “CRUZ, RAFAEL EDWARD TED” are the same person and we need to consolidate these two candidate names into one. Below code is to change them to ‘Cruz, Rafael Edward Ted’.

Now we have 13 levels of cand_nm with cand_id P60006111 particular corresponding to ‘Cruz, Rafael Edward Ted’. We get this mismatch fixed now.

I also noticed contbr_zip is int in original dataset and many have zip code followed with sub-zip code. I removed any last four digits of sub zip code and changed it to factor variable.

It makes sense to me to add three variables “gender”, “party” and “age” for each candidates since they’re all factored in when making the contribution. I did some online research for each candidates and add their “gender”, political “party” and “age” into the original dataset.

Here is the summary of the updated dataset.

##       cmte_id         cand_id                       cand_nm   
##  C00577130:924   P60007168:924   Sanders, Bernard       :924  
##  C00575795:921   P00003392:921   Clinton, Hillary Rodham:921  
##  C00573519:530   P60005915:530   Carson, Benjamin S.    :530  
##  C00574624:359   P60006111:359   Cruz, Rafael Edward Ted:359  
##  C00575449:195   P40003576:195   Paul, Rand             :195  
##  C00458844:187   P60006723:187   Rubio, Marco           :187  
##  (Other)  : 95   (Other)  : 95   (Other)                : 95  
##                    contbr_nm       contbr_city   contbr_st   contbr_zip  
##  NIEMAN, TYLER          :  16   SEATTLE  : 834   WA:3211   98112  :  81  
##  PRIEBE, WOLFGANG G. MR.:  16   BELLEVUE : 165             98004  :  68  
##  PARKER, DOROTHY        :  15   VANCOUVER: 100             98115  :  68  
##  LIANG, THOMAS MR.      :  13   OLYMPIA  :  80             98122  :  63  
##  WENTZEL, CATHY         :  12   KIRKLAND :  79             98119  :  57  
##  DESCHAMPS, ROBERT      :  11   TACOMA   :  77             (Other):2871  
##  (Other)                :3128   (Other)  :1876             NA's   :   3  
##       contbr_employer                              contbr_occupation
##  RETIRED      : 624   RETIRED                               : 758   
##  NOT EMPLOYED : 360   NOT EMPLOYED                          : 321   
##  SELF-EMPLOYED: 301   ATTORNEY                              :  81   
##  N/A          : 225   INFORMATION REQUESTED                 :  75   
##  SELF         :  97   INFORMATION REQUESTED PER BEST EFFORTS:  74   
##  (Other)      :1597   (Other)                               :1900   
##  NA's         :   7   NA's                                  :   2   
##  contb_receipt_amt  contb_receipt_dt
##  Min.   :-3300.0   30-Jun-15: 262   
##  1st Qu.:   50.0   29-Jun-15: 110   
##  Median :  100.0   30-Apr-15: 109   
##  Mean   :  412.6   12-Apr-15:  85   
##  3rd Qu.:  250.0   26-May-15:  79   
##  Max.   : 5400.0   16-Jun-15:  71   
##                    (Other)  :2495   
##                                               receipt_desc  memo_cd 
##                                                     :3163    :3165  
##  Refund                                             :  14   X:  46  
##  REDESIGNATION FROM PRIMARY                         :   8           
##  REDESIGNATION TO GENERAL                           :   8           
##  REATTRIBUTION / REDESIGNATION REQUESTED (AUTOMATIC):   4           
##  REATTRIBUTION FROM SPOUSE                          :   4           
##  (Other)                                            :  10           
##                                                memo_text     form_tp    
##                                                     :2307   SA17A:3181  
##  * EARMARKED CONTRIBUTION: SEE BELOW                : 813   SA18 :  16  
##  EARMARKED FROM MAKE DC LISTEN                      :  49   SB28A:  14  
##  REDESIGNATION FROM PRIMARY                         :   8               
##  REDESIGNATION TO GENERAL                           :   8               
##  REATTRIBUTION / REDESIGNATION REQUESTED (AUTOMATIC):   4               
##  (Other)                                            :  22               
##     file_num                       tran_id     election_tp  gender  
##  Min.   :1003942   SA17.113790         :   2   G2016:  40   F: 960  
##  1st Qu.:1015044   SA17.15219          :   2   P2016:3171   M:2251  
##  Median :1015538   SA17.200406         :   2                        
##  Mean   :1015205   A027F86C1700D48FCBA3:   1                        
##  3rd Qu.:1015585   A02E8299728EB46FA8B6:   1                        
##  Max.   :1015715   A04553564DD6D419D947:   1                        
##                    (Other)             :3202                        
##         party           age       
##  Democratic:1847   Min.   :44.00  
##  Republican:1364   1st Qu.:62.00  
##                    Median :67.00  
##                    Mean   :63.05  
##                    3rd Qu.:73.00  
##                    Max.   :73.00  
##

Once I get data cleaned and tidy, it’s time to plot the distributions of each variables that I’m interested in. I start with contribution distribution.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0    50.0   100.0   433.2   250.0  5400.0

Most contribute between $1 and $250 with $100 as median. Beyond that range, three possible value points are $500, $1000 and $2700. It’s interesting that there are almost no other values in between. The max. contribution is $5400 made by 8 outliers.

[Note]FEC Raises Contribution Caps for 2016. Under new FEC limits, which are adjusted for inflation in odd-numbered years, individuals can give up to $5,400 to candidates - $2,700 for their primary campaigns, and another $2,700 for the general election.

I’d like to figure out negative contribution. Following code is to find the records of donors that have at least one negative contribution

## Source: local data frame [28 x 3]
## Groups: contbr_nm
## 
##                    contbr_nm                 cand_nm real_contri
## 1       BARBER, DAVID H. MR.            Rubio, Marco     5400.00
## 2            BOYAJIAN, POLLY        Sanders, Bernard      172.03
## 3  CONNORS, KATHY MARIE MRS.            Rubio, Marco     2700.00
## 4              ERWIN, GERALD     Carson, Benjamin S.     7700.00
## 5              FEENEY, SUSAN Clinton, Hillary Rodham     2700.00
## 6              GAMORAN, SAUL Cruz, Rafael Edward Ted     5400.00
## 7        GOLITZIN, ALEXANDER        Sanders, Bernard     2700.00
## 8        GOLITZIN, JEANNETTE        Sanders, Bernard     2700.00
## 9                GREEN, JEFF Clinton, Hillary Rodham     -100.00
## 10          HOLM, TERESA MS.              Paul, Rand      150.00
## ..                       ...                     ...         ...

There are some interesting issues revealed in this table.

ERWIN, GERALD has donated $7,700.00 which exceeds the FEC limit.
These donors have totally negative contributions: GREEN, JEFF, MCCAW, CRAIG O., MCCAW, SUSAN R., NEUPERT and SHERYL S.
There is one donor named HOLM, TERESA MS. who made contributions to two candidates, Cruz, Rafael Edward Ted and Paul, Rand, both republican.

I may contact FEC or WA State to investigate these issues further. In this project we won’t clean up this type of issue before they’re clarified by FEC or WA state.

I’d like to see the contribution distribution over each cities in WA State.

Seattle is the largest city in WA so it’s no doubt that it has most contributors and four times of the 2nd city Bellevue which is also the largest city at east of Seattle. There is no surprise for this distribution.

I’d like to check the contribution distribution over contributors’ employers and occupations.

I’d like to see the contribution distribution over gender, political party and age.

There are only three Democratic candidates and the rest of 10 candidates all come from Republican. It’s surprising that Democratic candidates actually drew more donations than Republican. We’ll take a closer look in the following analysis to figure this out.

The top two candidates that get most contributions are “Sanders, Bernard” and “Clinton, Hillary Rodham”, both Democratic - no wonder Democratic candidates get more contributions than Republican. This also implies WA is a blue state voting for Democratic Party in 2016 election.

This plot shows the log10 of contribution distribution against contribution sources (SA17A:individual, SA18:committee or SB28A:refund)

Univariate Analysis

What is the structure of your dataset?

There are 3211 transactions (contributions by individual, committee or refund) in the dataset with 18 variables (cmte_id, cand_id, cand_nm, contbr_nm, contbr_city, contbr_st, contbr_zip, contbr_employer, contbr_occupation, contb_receipt_amt, contb_receipt_dt, receipt_desc, memo_cd, memo_text, form_tp, file_num, tran_id, election_tp). Two variables are integer “contbr_zip” and “file_num”, one is number variable “contb_receipt_amt”, and all rest variables are Factor variables.

Of the 3211 transactions, the total positive contributions are 3182, the rest 29 are refund or re-designation.

What is/are the main feature(s) of interest in your dataset?

I’d like to see what factors affect contributors when they make the decision on voting. I put these factors into two categories. One category is related to contributors, this includes:

“contbr_city”
“contbr_employer”
“contbr_occupation”
“contb_receipt_amt”

The other category is directly related to each candidates, including:

“cand_nm”
“gender”
“party”
“age”

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

“contbr_zip” can be used to see even more detail of living area of the contributors. From the summary of this feature, we can see people from rich area such as 98112,98004,98115,98122 and 98119 make the top of contribution list. Another feature “form_tp” shows the source of contributions - most of contributions come from individuals. The last but not least, “contbr_nm” can be used to show the outliers information so I can do more research if applied.

Did you create any new variables from existing variables in the dataset?

I created three new variables “gender”, “party” and “age” for each candidates because I think these are important factors when people make the decision. There are only two female candidates - “Clinton, Hillary Rodham” from Democratic and “Fiorina, Carly” from Republican. The rest of 11 candidates are all male. Republican won Democratic in items of number of candidates - there are only three Democratic candidates - they are “Clinton, Hillary Rodham”, “O’Malley, Martin Joseph” and “Sanders, Bernard”, leaving the rest eight all Republican. Candidates ages range from 44 upto 73. The youngest include “Rubio, Marco”(Repbulican), “Cruz, Rafael Edward Ted”(Repbulican) and “Jindal, Bobby”(Repbulican), and the eldest is “Sanders, Bernard”(Democratic)

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

I immediately noticed from data structure that there are 13 levels of cand_id but 14 levels of cand_nm - the candidate ID levels should exactly match candidate names. A further investigation shows “Cruz, Rafael Edward ‘Ted’” and “CRUZ, RAFAEL EDWARD TED” are duplicated cand_nm and I consolidate these two candidate names into one ‘Cruz, Rafael Edward Ted’.

I also noticed contbr_zip is int in original dataset and many have zip code followed with sub-zip code. I removed any last four digits of sub zip code and changed it to factor variable.

When I investigate those donors with at least one negative contribution I find some interesting issues related to this data set:

ERWIN, GERALD has donated $7,700.00 which exceeds the FEC limit.
These donors have totally negative contributions: GREEN, JEFF, MCCAW, CRAIG O., MCCAW, SUSAN R., NEUPERT and SHERYL S.

These types of issues need to be clarified with FEC or WA state before cleaning up. It’s out of the scope of this project and I will leave these as is in this project.

Bivariate Plots Section

First, take a look at pairwise relationship by applying ggpairs. [note]Since ggpairs runs dramatically slow with factor variables that contains a large number of levels, I exclude “contbr_city”, “contbr_employer” and “contbr_occupation” from ggpairs. These three variables have been checked out individually in ‘Univariate Analysis’ section.

The only continuous variable “age”" has a correlation coefficient 0.0364 with contb_receipt_amt, which is very week linear relationship.

I’m mostly interested in the contributions distribution against each candidates, below plot shows the contribution distribution faceted with each candidates with sqrt(y = contribution count) and log10(x = contb_receipt_amt).

Here is another way to see the contribution distribution by using box plot and log10(y = contb_receipt_amt).

I’d also like to check the total funds raised by each candidates. In order to check that a separate data frame that includes cand_nm, total_transaction and total_contribution is required.

Is Hillary going to win the campaign? She got the contributions almost the total combination of the rest of candidates. At least by WA state she is the most promising.

Let’s see which party win in items of the total contribution value.

Democratic raises more than double of the funds Republican did in WA state.

Here is the plot of total contribution values obtained by male vs. female candidates.

Due to Hillary’s exceptional performance, female candidates drew more donations than male in WA state, although there are only two female candidates vs. 11 male.

Below is a plot to show the top contributors who made the most single-transaction donations.

Of these top donors, there are one Chairman, two CEO, one Presidents, four investors, two consultants and one homemaker. Most of them have top end title/career indicating exceptional income earner. It appears rich people are more willing to contribute more - maybe they believe the more impact they can impose on this country the more benefit they can get in return.

I’d like to check popularity of each candidates in the major cities. The major cities are those with 50 or more contributions as showed in city_50.

Clinton wins Seattle, Bellevue (these two are where most contributions come from) and Kirkland. She’s followed by Bernard Sanders who is the 2nd in Seattle and Bellevue and the 1st in Olympia, Tacoma and Bellingham. Both are from Democratic.

I’d also like to check each parties popularity in these major cities.

Obviously, Democratic wins almost all major cities in WA except Vancouver and Redmond.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Since most features in the dataset are factor not continuous, we only see one correlation coefficient calculation between age and the contribution amount. But their correlation coefficient is 0.0364 which is very week.

Hillary raised most funds in WA and wins Seattle and Bellevue which are the most important cities. Following is Bernard Sanders who wins Olympia, Tacoma and Bellingham. Both are Democratic and lead the way to victory over Republican.

Almost all of the outliers of donors have highest levels of title/career and earning. In next section I will take a closer look and see which candidates/parties they are in favor of.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Due to Hillary’s exceptional performance, female candidates drew more donations than male in WA state, although there are only two female candidates vs. 11 male.

What was the strongest relationship you found?

Hillary get the most contributions and raised the funds of almost the combination of the rest candidates. She is absolutely the winner in WA. She also wins Seattle and Bellevue - the two most important cities in WA.

Democratic beats Republican in almost every major cities in WA and raised more than double of the funds of Repbulican.

Multivariate Plots Section

I’d like to see the funds raised by each candidates with their ages factored in.

Hillary, Bernard and Benjamin are the eldest but raised more funds than other younger candidates. It appears Washingtonian votes in favor of more experienced candidates.

Below two plots are put side by side to compare the funds raised per candidates categorized with party and gender

the next plot compares the total funds raised between the two parties, and also shows the individual raised funds as blocks of their party column.

This plot proves the following conclusions that are reached in previous analysis:

WA state is a “blue” state since Democratic raised almost double of the funds that Republican did.
Hillary is the winner across all parties and all candidates. Her fund is almost the combination of funds that the rest candidates raised.
The 2nd winner is also from Democratic and the top two candidates raised almost all the funds donated to Democratic. In contrast, the funds raised by Republican are very evenly distributed across most of Republican candidates, resulting in the fact that no any Republican is even close to Hillary.

In section 2 I checked the information (employer and occupation) of top donors; in the last plot I’d like to further investigate and find out which party/candidates these contributions vote in favor of.

All candidates in this list are from Republican; it’s obvious that Repbulican is still the favor of rich people in WA state.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

In this section I investigate more detail the impact of candidate’s age, gender, political party on their final raised donations. I also checked relationship of the top donors and their favor of candidates. In summary:

Age means experience. The eldest three candidates draw the most contributions, which seems to convince me that Washingtonian votes in favor of more experienced candidates.
WA state is a “blue” state since Democratic raised almost double of the funds that Republican did.
Hillary is the winner across all parties and all candidates. Her fund is almost the combination of funds that the rest candidates raised. No any Republican candidate is even close to Hillary.
Although Democratic beats Republican in almost all major cities and in WA as a whole, the investigation on top single-transaction contributions shows that Repbulican is still the favor of rich people in WA state.

Were there any interesting or surprising interactions between features?

There are two female candidates “Clinton, Hillary Rodham” from Democratic and “Florina, Carly” from Republican. Hillary gets the most of contributions and raised the highest funds across all parties and candidates while Carly is only ranked 8th of 13 candidates. It appears that gender is not a key factor in WA state and people more focus on candidates’ experience, reputation and political vision.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

The features in this dataset are mostly category and factor variables, and the only continuous variable “age” has a very week relationship with the contribution amount. Hence, I don’t think a linear relationship is a fit for this dataset so a linear model may not fit into this analysis.

Final Plots and Summary

Plot One

I’d like to check popularity of each candidates in the major cities. The major cities are those with 50 or more contributions as showed in city_50.

Description One

Clinton wins:

Seattle (contribution counts 430 and total funds raised $412,776.14)
Bellevue (contribution counts 61 and total funds raised $68,117.10)
Kirkland (contribution counts 38 and total funds raised $7,426.64).

She’s followed by Bernard Sanders who is the 2nd in:

Seattle (contribution counts 267 and total funds raised $55,718.00)
Bellevue (contribution counts 38 and total funds raised $8,107.00)

and the 1st in:

Bellingham (contribution counts 39 and total funds raised $7,297.77)
Tacoma (contribution counts 33 and total funds raised $4,362.99)
Olympia (contribution counts 30 and total funds raised $4,356.16)

Both are from Democratic.Thanks to these two candidates, Democratic wins almost all major cities in WA except Vancouver and Redmond.

Plot Two

In the second plot I compares the total funds raised between the two parties, and also shows the individually raised funds as blocks in their party column.

Description Two

This plot proves the following conclusions that are reached in previous analysis:

WA state is a “blue” state since Democratic raised $929,125 which is more than double of the funds $395,879 that Republican did.
Hillary is the winner across all parties and all candidates. Her fund $738,489.57 is more than the combination of funds $586,515 that the rest candidates raised.
The 2nd winner is also from Democratic and the top two candidates raised almost all the funds donated to Democratic. In contrast, the funds raised by Republican are very evenly distributed across most of Republican candidates, resulting in the fact that no any Republican candidate is even close to Hillary.

Plot Three

This plot is to show the top contributors who made the highest single-transaction donations and the corresponding candidates that receive these donations.

Description Three

The candidates that receive these highest single donations are all from Republican; it’s obvious that Repbulican is still the favor of rich people in WA state.

Reflection

The dataset of Financial Contribution for 2016 Presidential Campaign in WA State contains 3211 contribution records with 18 variables. I started with querying the dataset to get myself familiarized with data structure and various features. I immediately noticed the duplicate of one candidate name which should be consolidated before any further analysis. Some other date cleanup for contbr_zip, contbr_employer and contbr_occupation are followed to make the analysis more accurate. Three new features - “gender”, “party” and “age” - are added into dataset and they’re all proved to be important factors in following analysis. The initial univariable analysis shows most donations fall into the range of $1 to $250 with $100 as median value. 8 donors made the highest contribution of $5,400 which is also the maximum individual contribution allowed by FEC. Democratic candidate “Clinton, Hillary Rodham” is the winner across all parties and she raised $738,489.6 vs. $94,205.16 which is the highest in Republican. Democratic wins most of major cities including Seattle and Bellevue and Republican only has two cities in favor of it - WA state is definitely a blue state voting for Democratic in 2016 election. It appears Washingtonian prefer more experienced candidates based on the fact that the three eldest candidates (aged 73,67 and 65) are also the top three in the fund raising list. On the other hand, gender is less important factor comparing to experience, reputation and political party. Taking a closer look at Plot Two of section 3, I realized that 10 Republican candidates vs. 3 Democratic is actually a disadvantage. The contributions toward Republican are evenly distributed over each candidates while in Democratic the supporters are able to easily focus on one candidate. This may be one of the reason that even no any Republican candidate is close to Hillary. Every cloud has a silver lining. The rich donors who make more contributions are still in favor of Republican, this could be even more important when we factor in super PAC.

The first challenge is actully when I load the csv data. There are many duplicated entries on the first coloum of original dataset which stopped read.csv. I first tried adding row.names = NULL, that made the loading successfully but all headers shifted one column to the right. What I did was to insert one extra column as 1st column and add row numbers; when reading, I used [, -1] to remove it after the input. Another thing I learned is how to apply appropriate transfomation to show more detail information. For example, in the plot of contribution distribution faceted with each candidates, some of the figures seems empty. I tried log10(y) but they were still invisible. I took a closer look at the dataset and realized that some candiates like “Perry, James R. (Rick)” and “Jindal, Bobby” have only one contribution; in this case using log10(y) didn’t work since that transformed them to 0 (log10(1) = 0) which is still invisible. So I used another tranformation of sqrt(y) and now they’re all clearly showed up in the final plot.

The investigation on negative contribution (refunds or redesignation) reveals some unusual information. One donor has donated $7,700 which exceeds the FEC limit. Five donors actually ends up with negative contributions. These issues need to be clarified with FEC or WA State so that we can make the dataset more accurate and improve the reliability of the analysis.