setwd(“/Users/michaelreinhard/nano/R/finalProject”)
I load the data in from the raw csv file, which is assumed to be in the same directory as the Rmd file.
In order to streamline presentation I am going to add a few variables that I construct through bringing in outside information, like the party of the condidates, the total votes they recieved in the primary, or whether a contribution was made before or after the primary. I will plot them to make sure they are working right.
Create a ‘gen’–for general election–dummy variable for future use.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 1.0000 1.0000 0.8658 1.0000 1.0000
##
## 0 1
## 35911 231746
## [1] "numeric"
## [1] "primary" "general"
A variable I want to add right off is the party of the candidate.
## [1] Republican Democrat Democrat Democrat Democrat
## Levels: Democrat Republican Socialist
## [1] "factor"
##
## Democrat Republican Socialist
## 201064 66538 55
I am first going to examine some of the variables and their distributions. While doing this I will add some variables to the data set, either through recoding or by adding information from outside sources. The first I will look at is the date the contribution was made. I will convert the data values to a date format recognized by R and add a dummy variable for the date of the primary.
There are six main variables to examine. The primary variable of interest is the amount of contributions. This is the thing we are interested in explaining.
The variables that can contribute to explaining the amounts contributed are the characteristics of the contributors, the location of the contributors and the timing of the contributions. Another set of features in the data set that can explain the contributions are characteristics of the candidates. The names of the candidates are in the data set and other characteristics such as their party affiliation and the number of votes they recieved in the primary contests can be added to the data set. ### Date Dates are sometimes troublesome in R because they are recorded as characters. You have to convert them to dates. Then, if you want to use them on an axis and want to add, say, a vertical line to the axis you have to convert the date you want to locate the line at back to a numeric object. So it is an involved and somewhat counter intuitive process. You have to take them from characters to dates and then, depending on how want to plot them, to numbers.
## [1] "factor"
## [1] "2011-12-06" "2011-09-30" "2011-09-27" "2011-09-30" "2011-08-10"
## [6] "2011-09-27"
The variables for occupation and employer are factor level variables and have too many different values and no intrinsic ordering so if you try to make a bar graph it takes forever to render and doesn’t tell you anything. Moreover, with so many ‘levels’ it takes a few minutes to render a plot. The best I can do with the variables absent extensive recoding is note the most common values. This requires reordering the names of the levels, whether the names of cities, occupations or employers, by their frequency. Oh yeah, they have names, too.
## [1] "integer"
## [1] "factor"
## [1] 267657
## [1] 58897
## FALLSGRAFF, TOBY MITCHELL, CAITLIN HUGHES, ELLEN
## 292 248 213
## MUIR, MARY LUCAS, WILLIAM SELZ, MARIA CRISTINA
## 162 149 139
## DFHDFH, DFHDFH TRICE, YOHANCE RUSH, KYLE
## 113 109 106
## ZMRHAL, TOBY MR. MILES, LUCAS MILLER, ED
## 97 91 91
## WEISS, GREG BROWN, MARTHA LIBER, DAVID
## 91 89 87
## FINE, PHYLLIS MCKINLEY, CHERYL WOODING, JOHN
## 86 86 85
## ZAR, LEON JACOBSON, LINDA
## 85 84
## Named int [1:100] 292 248 213 162 149 139 113 109 106 97 ...
## - attr(*, "names")= chr [1:100] "FALLSGRAFF, TOBY" "MITCHELL, CAITLIN" "HUGHES, ELLEN" "MUIR, MARY" ...
## Var1 Freq
## 1 AARNOUDSE, ANTHONIE 4
## 2 AARON, SUSAN S. 2
## 3 AARONSON-KRELL, PAULA 7
## 4 AARONSON, RICHARD 2
## 5 AASEBY, JOEL DAVID 6
## 6 ABA, JAIME R. 2
## Var1 Freq
## 1 AARNOUDSE, ANTHONIE 4
## 2 AARON, SUSAN S. 2
## 3 AARONSON-KRELL, PAULA 7
## 4 AARONSON, RICHARD 2
## 5 AASEBY, JOEL DAVID 6
## 6 ABA, JAIME R. 2
## Var1 Freq
## 14973 FALLSGRAFF, TOBY 292
## 35968 MITCHELL, CAITLIN 248
## 24063 HUGHES, ELLEN 213
## 37015 MUIR, MARY 162
## 31824 LUCAS, WILLIAM 149
## 47516 SELZ, MARIA CRISTINA 139
## Var1 Freq
## 14973 FALLSGRAFF, TOBY 292
## 35968 MITCHELL, CAITLIN 248
## 24063 HUGHES, ELLEN 213
## 37015 MUIR, MARY 162
## 31824 LUCAS, WILLIAM 149
## 47516 SELZ, MARIA CRISTINA 139
## 12342 DFHDFH, DFHDFH 113
## 53416 TRICE, YOHANCE 109
## 45404 RUSH, KYLE 106
## 58795 ZMRHAL, TOBY MR. 97
## 35521 MILES, LUCAS 91
## 35587 MILLER, ED 91
## 55928 WEISS, GREG 91
## 6129 BROWN, MARTHA 89
## 30858 LIBER, DAVID 87
## 15564 FINE, PHYLLIS 86
## 34516 MCKINLEY, CHERYL 86
## 57782 WOODING, JOHN 85
## 58536 ZAR, LEON 85
## 24819 JACOBSON, LINDA 84
So this Toby guy made almost 300 contributions. Maybe he is a bundler? Or maybe he contributes to a lot of people? Or maybe he contributes in really small amounts.
## cmte_id cand_id cand_nm contbr_nm contbr_city
## 21 C00431445 P80003338 Obama, Barack FALLSGRAFF, TOBY CHICAGO
## 1214 C00431445 P80003338 Obama, Barack FALLSGRAFF, TOBY CHICAGO
## 1323 C00431445 P80003338 Obama, Barack FALLSGRAFF, TOBY CHICAGO
## 1573 C00431445 P80003338 Obama, Barack FALLSGRAFF, TOBY CHICAGO
## 1582 C00431445 P80003338 Obama, Barack FALLSGRAFF, TOBY CHICAGO
## 1648 C00431445 P80003338 Obama, Barack FALLSGRAFF, TOBY CHICAGO
## contbr_st contbr_zip contbr_employer contbr_occupation
## 21 IL 606472810 OBAMA FOR AMERICA SENIOR WRITER
## 1214 IL 606472810 OBAMA FOR AMERICA SENIOR WRITER
## 1323 IL 606472810 OBAMA FOR AMERICA SENIOR WRITER
## 1573 IL 606472810 OBAMA FOR AMERICA SENIOR WRITER
## 1582 IL 606472810 OBAMA FOR AMERICA SENIOR WRITER
## 1648 IL 606472810 OBAMA FOR AMERICA SENIOR WRITER
## contb_receipt_amt contb_receipt_dt receipt_desc memo_cd memo_text
## 21 0.4 24-AUG-11
## 1214 0.4 19-SEP-11
## 1323 5.0 28-SEP-11
## 1573 0.4 21-AUG-11
## 1582 5.0 29-SEP-11
## 1648 0.4 16-SEP-11
## form_tp file_num tran_id election_tp date_con gen gen.f
## 21 SA17A 756218 C11912258 P2012 2011-08-24 0 primary
## 1214 SA17A 756218 C12144091 P2012 2011-09-19 0 primary
## 1323 SA17A 756218 C12264232 P2012 2011-09-28 0 primary
## 1573 SA17A 756218 C11896062 P2012 2011-08-21 0 primary
## 1582 SA17A 756218 C12310253 P2012 2011-09-29 0 primary
## 1648 SA17A 756218 C12088742 P2012 2011-09-16 0 primary
## party date_con.n primary_day.n
## 21 Democrat 15210 15419
## 1214 Democrat 15236 15419
## 1323 Democrat 15245 15419
## 1573 Democrat 15207 15419
## 1582 Democrat 15246 15419
## 1648 Democrat 15233 15419
## [1] "cmte_id" "cand_id" "cand_nm"
## [4] "contbr_nm" "contbr_city" "contbr_st"
## [7] "contbr_zip" "contbr_employer" "contbr_occupation"
## [10] "contb_receipt_amt" "contb_receipt_dt" "receipt_desc"
## [13] "memo_cd" "memo_text" "form_tp"
## [16] "file_num" "tran_id" "election_tp"
## [19] "date_con" "gen" "gen.f"
## [22] "party" "date_con.n" "primary_day.n"
## cand_nm contbr_employer contbr_occupation contb_receipt_amt
## 21 Obama, Barack OBAMA FOR AMERICA SENIOR WRITER 0.4
## 1214 Obama, Barack OBAMA FOR AMERICA SENIOR WRITER 0.4
## 1323 Obama, Barack OBAMA FOR AMERICA SENIOR WRITER 5.0
## 1573 Obama, Barack OBAMA FOR AMERICA SENIOR WRITER 0.4
## 1582 Obama, Barack OBAMA FOR AMERICA SENIOR WRITER 5.0
## 1648 Obama, Barack OBAMA FOR AMERICA SENIOR WRITER 0.4
## NA <NA> <NA> <NA> NA
## NA.1 <NA> <NA> <NA> NA
## NA.2 <NA> <NA> <NA> NA
## NA.3 <NA> <NA> <NA> NA
Ok, so it looks like Toby made a bunch of contributions that had to be taken back or something since after the first 5 contributions there are nothing but NAs for the balance of his 300 contributions. For a serious analysis we should probably remove Toby and take a serious look at the other top contributors to see if some of them are in the upper range of contributors for some anomalous reason. ### Occupations 11,000 different occupations can’t really be graphed. They would have to be boiled down into to a smaller number of categories to be useful. Employers will have too many categories to usefully plot as well.
## [1] 11216
## [1] "factor"
##
## .NET PROGRAMMER (RETIRED SECRETARY)
## 2189 14 1
## *LONG TERM DISABILITY ~ 1ST DEPUTY COMMISSIONER
## 1 1 1
##
## .NET PROGRAMMER (RETIRED SECRETARY)
## 2189 14 1
## *LONG TERM DISABILITY ~ 1ST DEPUTY COMMISSIONER
## 1 1 1
## 'data.frame': 11215 obs. of 2 variables:
## $ Var1: Factor w/ 11215 levels "",".NET PROGRAMMER",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Freq: int 2189 14 1 1 1 1 2 7 1 1 ...
## Var1 Freq
## 1 2189
## 2 .NET PROGRAMMER 14
## 3 (RETIRED SECRETARY) 1
## 4 *LONG TERM DISABILITY 1
## 5 ~ 1
## 6 1ST DEPUTY COMMISSIONER 1
## [1] RETIRED
## [2] ATTORNEY
## [3] HOMEMAKER
## [4] INFORMATION REQUESTED PER BEST EFFORTS
## [5] PHYSICIAN
## [6] TEACHER
## [7] INFORMATION REQUESTED
## [8] PROFESSOR
## [9] CONSULTANT
## [10] SALES
## [11] NOT EMPLOYED
## [12] ENGINEER
## [13] LAWYER
## [14] MANAGER
## [15]
## [16] NONE
## [17] PRESIDENT
## [18] WRITER
## [19] OWNER
## [20] SELF-EMPLOYED
## 11215 Levels: .NET PROGRAMMER (RETIRED SECRETARY) ... ZONE MANAGER
## Var1 Freq
## 1 2189
## 2 .NET PROGRAMMER 14
## 3 (RETIRED SECRETARY) 1
## 4 *LONG TERM DISABILITY 1
## 5 ~ 1
## 6 1ST DEPUTY COMMISSIONER 1
## 7 1ST GRADE TEACHER 2
## 8 270 POLLING ANALYST 7
## 9 2ND LIEUTENANT, ACTIVE DUTY 1
## 10 2ND. AD - FILM/TV/COMMERCIALS 1
## 11 3D ARTIST 1
## 12 3RD GENERATION FAMILY BUSINESS OWNER 1
## 13 3RD GRADE TEACHER 1
## 14 4TH GRADE TEACHER 2
## 15 5TH GRADE DUAL LANGUAGE TEACHER 2
## 16 6TH GRADE TEACHER 7
## 17 7TH GRADE TEACHER 4
## 18 80% MEDICALLY DISABLED VETERAN 1
## 19 911 DISPATCHER 4
## 20 911 SUPERVISOR 6
## Var1 Freq
## 8535 RETIRED 54133
## 795 ATTORNEY 10911
## 4744 HOMEMAKER 7480
## 5028 INFORMATION REQUESTED PER BEST EFFORTS 6573
## 7237 PHYSICIAN 5892
## 10184 TEACHER 5733
## 5026 INFORMATION REQUESTED 5349
## 7726 PROFESSOR 4550
## 2167 CONSULTANT 4349
## 8826 SALES 3197
## 6635 NOT EMPLOYED 3070
## 3525 ENGINEER 2599
## 5535 LAWYER 2542
## 5849 MANAGER 2461
## 1 2189
## 6611 NONE 1931
## 7462 PRESIDENT 1918
## 11147 WRITER 1665
## 6891 OWNER 1582
## 9081 SELF-EMPLOYED 1532
## [1] "data.frame"
Trying to look at all occupations takes forever for R to run and doesn’t tell you anything because the labels are all together and there isn’t enough variation. But I would still like to see the distribution of all occupations on a log scale to see if there is a geometric decline in the frequency of contributions for all occupations taken as a whole. So I want to plot the frequency of contributions by occupation on a logarithmic scale with the labels suppressed. How is that possible?
This is not really the distribution of occupations since this is a factor variable. We are just ordering by the frequencies of various occupations. Also, the categories are from the respondents themselves so what we see is really the way people have chosen to classify their occupations. Some people have chosen a very narrow and specific description of their occupation and some have chosen a more general. The most general description, ‘retired’–since you can be retired from anything, is not surprisingly the largest category.
Now I want to look at the top 20 employers. This is a factor variable so it should follow the same template I have used for other factors.
## [1] 23176
## [1] "factor"
##
##
## 2181
## FORSYTHE SOLUTIONS GROUP
## 3
## NORTHERN ILLINOIS CONFERENCE UNITED M
## 4
## @ PROPERTIES
## 3
## @ UIC
## 2
## @PROPERTIES
## 32
##
##
## 2181
## FORSYTHE SOLUTIONS GROUP
## 3
## NORTHERN ILLINOIS CONFERENCE UNITED M
## 4
## @ PROPERTIES
## 3
## @ UIC
## 2
## @PROPERTIES
## 32
## 'data.frame': 23175 obs. of 2 variables:
## $ Var1: Factor w/ 23175 levels ""," FORSYTHE SOLUTIONS GROUP",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Freq: int 2181 3 4 3 2 32 2 1 2 1 ...
## Var1 Freq
## 1 2181
## 2 FORSYTHE SOLUTIONS GROUP 3
## 3 NORTHERN ILLINOIS CONFERENCE UNITED M 4
## 4 @ PROPERTIES 3
## 5 @ UIC 2
## 6 @PROPERTIES 32
So I took the factor and made it into a dataframe. Now I want to order the cities by frequency and get a chart of the top cities by frequency.
## [1] RETIRED
## [2] SELF-EMPLOYED
## [3] NOT EMPLOYED
## [4] INFORMATION REQUESTED PER BEST EFFORTS
## [5] INFORMATION REQUESTED
## [6] HOMEMAKER
## [7] NONE
## [8] OBAMA FOR AMERICA
## [9]
## [10] UNIVERSITY OF CHICAGO
## [11] NORTHWESTERN UNIVERSITY
## [12] SELF
## [13] UNIVERSITY OF ILLINOIS
## [14] OFA
## [15] STATE OF ILLINOIS
## [16] CHICAGO PUBLIC SCHOOLS
## [17] CITY OF CHICAGO
## [18] UNITED AIRLINES
## [19] AT&T
## [20] DEPAUL UNIVERSITY
## 23175 Levels: ... ZYCUS
## Var1 Freq
## 1 2181
## 2 FORSYTHE SOLUTIONS GROUP 3
## 3 NORTHERN ILLINOIS CONFERENCE UNITED M 4
## 4 @ PROPERTIES 3
## 5 @ UIC 2
## 6 @PROPERTIES 32
## 7 #2 C.I. 2
## 8 ~ 1
## 9 08131952 2
## 10 1030 HUBBARD PLACE 1
## 11 10TH MAGNITUDE LLC 1
## 12 11 COMMUNICATIONS 1
## 13 1154 LILL STUDIO 11
## 14 12 INTERACTIVE 1
## 15 16TH JUDICIAL CIRCUIT OF ILLINOIS 2
## 16 180 PROPERTIES 1
## 17 19TH JUDICIAL CIRCUIT 14
## 18 1ST ADVANTAGE 12
## 19 1ST CIRCUIT PROBATION 48
## 20 1ST MIDWEST BANK 4
## [1] "data.frame"
Actually, there is a lot of interesting stuff in there. The top 20 employers were very revealing. But there is no way to do a good graph from this without a lot of work I should think. ### Cities While there is no intrinsic ordering of the cities as there is not with the other factor variables we can impose an order and plot by that order to get a better idea of the distribution of the cities. We can then see if all cities are more or less equal in the proportion of contributions that come from them or if the distribution is skewed toward a subset of cities.
## [1] "factor"
##
## 123 E. CO BLACKTOP 1308 N ASTOR ST C
## 1 2 1
## 1308 N ASTOR ST CHICAGO 60154-5908 ABBOTT PARK
## 1 2 1
##
## 123 E. CO BLACKTOP 1308 N ASTOR ST C
## 1 2 1
## 1308 N ASTOR ST CHICAGO 60154-5908 ABBOTT PARK
## 1 2 1
## 'data.frame': 1402 obs. of 2 variables:
## $ Var1: Factor w/ 1402 levels "","123 E. CO BLACKTOP",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Freq: int 1 2 1 1 2 1 3 15 1 245 ...
## Var1 Freq
## 1 1
## 2 123 E. CO BLACKTOP 2
## 3 1308 N ASTOR ST C 1
## 4 1308 N ASTOR ST CHICAGO 1
## 5 60154-5908 2
## 6 ABBOTT PARK 1
So I took the factor and made it into a dataframe. Now I want to order the cities by frequency and get a chart of the top cities by frequency.
## [1] CHICAGO EVANSTON OAK PARK
## [4] NAPERVILLE SPRINGFIELD CHAMPAIGN
## [7] WILMETTE HIGHLAND PARK BLOOMINGTON
## [10] LAKE FOREST PEORIA GLENVIEW
## [13] WINNETKA WHEATON NORTHBROOK
## [16] ROCKFORD AURORA ARLINGTON HEIGHTS
## [19] DOWNERS GROVE HINSDALE
## 1402 Levels: 123 E. CO BLACKTOP ... ZION
## Var1 Freq
## 1 1
## 2 123 E. CO BLACKTOP 2
## 3 1308 N ASTOR ST C 1
## 4 1308 N ASTOR ST CHICAGO 1
## 5 60154-5908 2
## 6 ABBOTT PARK 1
## 7 ABINGDON 3
## 8 ADAIR 15
## 9 ADDIEVILLE 1
## 10 ADDISON 245
## 11 ALBANY 51
## 12 ALBERS 10
## 13 ALBION 17
## 14 ALEDO 36
## 15 ALEXIS 12
## 16 ALGONQUIN 432
## 17 ALLENDALE 3
## 18 ALMA 5
## 19 ALPHA 7
## 20 ALSIP 101
## [1] "data.frame"
### Zip Codes The long zip codes are not much use to us so I will shorten them to 5 digits. The zip codes are also unordered factors: even though they are numbers the orders of the zip codes has no real intrinsic meaning within a state. We do know that there are some zip codes outside of the boundaries of the state of Illinois, whose zip codes are bounded between 60000 and 63000.
## [1] "character"
## [1] 60618 60640 60645 60521 60422 60586
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1267 60190 60600 60670 60650 100000 30
We could go farther and get the top zip codes but I think I will leave the zip codes for the map program.
There is one more thing to check for, that is using visuals to check for outliers and mistakes. It seems like there are some outliers but not enough to compare to the ones that are coded correctly, that is, inside the confines of expected values for Illinois zip codes.
Now we see that there are some values below the correct range. I will exclude these values from the analyses that use zip codes. It would be a big job to go in and repair them unless I could figure out a way to do that programmatically. I suppose I could get the city and streets and narrow down the possible zip codes that way, but at some point it would have to be done by hand.
## [1] Paul, Ron Obama, Barack Obama, Barack Obama, Barack Obama, Barack
## [6] Obama, Barack
## 14 Levels: Bachmann, Michele Cain, Herman Gingrich, Newt ... Stein, Jill
## [1] "integer"
## [1] "factor"
Massive lead for Barack Obama. No real surprise in his home state. It seems like the number of contributions might even be nearly equal if we combined all the Republican contribuitons together. Still, since Obama did not face a primary it is a little incongruous.
## [1] 25 1000 500 500 200 25
## [1] "double"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -25000 25 50 455 112 16390000
So a roughly normal distribution shows up by examining the data on a log scale.
Create Party variable first.
## [1] "cmte_id" "cand_id" "cand_nm"
## [4] "contbr_nm" "contbr_city" "contbr_st"
## [7] "contbr_zip" "contbr_employer" "contbr_occupation"
## [10] "contb_receipt_amt" "contb_receipt_dt" "receipt_desc"
## [13] "memo_cd" "memo_text" "form_tp"
## [16] "file_num" "tran_id" "election_tp"
## [19] "date_con" "gen" "gen.f"
## [22] "party" "date_con.n" "primary_day.n"
## [25] "zip_short"
## [1] Republican Democrat Democrat Democrat Democrat
## Levels: Democrat Republican Socialist
## [1] Republican Democrat Democrat Democrat Democrat Democrat
## Levels: Democrat Republican Socialist
##
## Democrat Republican Socialist
## 201064 66538 55
Now use to evaluate candidate contributions
Now look at the vote totals
## $`Paul, Ron`
## [1] 86605
##
## $`Obama, Barack`
## [1] NA
##
## $`Obama, Barack`
## [1] NA
##
## $`Obama, Barack`
## [1] NA
##
## $`Obama, Barack`
## [1] NA
##
## $`Obama, Barack`
## [1] NA
## [1] 86605 NA NA NA NA NA
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 433700 433700 381800 433700 433700 201119
First thing we do is the big correlation analysis, throwing everything in and seeing what is related to what. Correlation is be suited to numeric variables and we don’t have a bunch of those. The counts of contributions and the amounts are the most obvious variables to test for correlation, but it is not clear how to do that. We would have to group the two variables by a third for the idea of correlation to become meaningful, really. In order to ask the question we would have to somehow bin or group the observations together, say, by candidate, and ask do candidates that get a lot of contributions also get a lot of money? But that would be a question for later, multivariate analysis.
Taking the variables as is we could ask if amounts, party (Republican or Democrat), pre or post primary, date (later or earlier, ending at the general election) or votes (in the primary at least) correlate with one another. Still, these variables aren’t meaningful on the same time scales so I still don’t think it is appropriate to just throw them all into a big correlation matrix and see what shakes out. Or at least it is not very imformative. But just to say that we hvae done it, here goes.
## 'data.frame': 267602 obs. of 5 variables:
## $ contb_receipt_amt: num 25 1000 500 500 200 25 100 250 100 250 ...
## $ date_con : num 15314 15247 15244 15247 15196 ...
## $ gen : num 0 0 0 0 0 0 0 0 0 0 ...
## $ party : num 1 0 0 0 0 0 0 0 0 0 ...
## $ date_con.n : num 15314 15247 15244 15247 15196 ...
## contb_receipt_amt date_con gen party date_con.n
## contb_receipt_amt 1 0.00 0.00 0.00 0.00
## date_con 0 1.00 0.84 -0.04 1.00
## gen 0 0.84 1.00 -0.07 0.84
## party 0 -0.04 -0.07 1.00 -0.04
## date_con.n 0 1.00 0.84 -0.04 1.00
There just aren’t enough straight ahead numerica variables to make get a bunch of correlations to make sense.
first I am going to investigate the relationship between the amount of money donated and the other variables in the data set of interest. I will use the amount of money donated as the dependent variable.
I want to see a bit more in the contribution amount histograms. There is so much space between the amounts that it sort of distorts the central message of the data, that there are different teirs of contributions. I think what I want to show in this chart is the spread of the distribution. The important fact about a contribution is its general size, not the fact that there are a lot of contributions of 5000 dollars and very few of 4200. So, I think a couple of histograms with really small numbers of bins might be interesting.
So I didn’t get much out of the pre-post primary distinction on the first cut. Let’s look at Republicans and Democrats. ### Republicans vs. Democrats
So it looks like Republican contributions could be larger on average than Democratic contributions, but that Democrats have more outliers. Lets look more closely at the structure of contribution amounts by looking at the relationship between and democratic histograms in a grid. Also, I am going to make a data set that is limited to individual contributions by subsetting the data to Democrats and Republicans and to contributions above 0 and below 5000.
I am choosing 5000 as the limit instead of 3000 because although the election laws say that 2700 is the limit for individual contributions but groups can contribute up to 5000. I have seen that there are a lot of contributions that are over 2700 and I don’t know why.
## [1] 371
## OBAMA VICTORY FUND 2012 - UNITEMIZED FREIDHEIM, CYRUS F. MR. JR.
## 21 6
## LINCOLN PAC BRINCAT, JEFF MR.
## 4 3
## RAITT, JOHN R. MR. RAUSCHERT, KARL A. MR.
## 3 3
## RICE, YVONNE MRS. ABROE, MARY J. MRS.
## 3 2
## ADRIDGE, KENNETH MR. ALLEN, MARCI MRS.
## 2 2
## ALLEN, ROGER ALLEN, RUSSELL L. MR.
## 2 2
## ALLEN, SHELLEY MRS. ALLGOOD, JASON MR.
## 2 2
## ANDERSON, CARL P. MR. B&R SUPPLY GROUP LLC
## 2 2
## BAXMEYER, K. MR. BLAKLEY, JAMES
## 2 2
## BLOCK, JOHN G. MR. BUFORD, ROBERT
## 2 2
## BURTIS, ERIK MR. BURTIS, MELISSA G. MRS.
## 2 2
## C.N.A. CITIZENS FOR GOOD GOVERNMENT CANNING, JOHN A. MR. JR.
## 2 2
## CAVANAGH, WILLIAM CHAMNESS, JOHANNA MRS.
## 2 2
## CHAMNESS, MICHAEL MR. CHANG, JAE MR.
## 2 2
## CROFT, BARBARA CUBITT, GEOFF
## 2 2
## DIXON, SCOTT MR. DIXTON, GRANT M. MR.
## 2 2
## DRAFT, HOWARD MR. DUNCAN, BRUCE
## 2 2
## FESMIRE, ROBERT H. MR. FESSLER, CAROL A. MS.
## 2 2
## FITCH, DENNIS C. MR. FRYE, MEGAN MS.
## 2 2
## FULTON, THOMAS M. MR. GETCO PAC
## 2 2
## GREEN, JEFFREY MR. GRISHAM, LARRY MR.
## 2 2
## HAGENBUCH, LEROY G. MR. HANSON, TERRY J. MR.
## 2 2
## HERRING, KERRY MR. HINES, CHARLES L. MR.
## 2 2
## HOFFMAN, MARY JO MS. HORNER, JANICE E. MRS.
## 2 2
## HUSMAN, MICHAEL ISRINGHAUSEN, GEOFFREY MR.
## 2 2
## ISRINGHAUSEN, SUSAN J. MRS. JASPER, PAUL T. MR.
## 2 2
## KANE, DIANE MRS. KATSIAVELOS, HARRY MR.
## 2 2
## KUGLER, LARRY MR. KUNZ, PETER MR.
## 2 2
## LEWIS, ALISHA MRS. LEWIS, JAMIE MRS.
## 2 2
## LEWIS, TIMOTHY R. MR. JR. LEWIS, WILLIAM MR.
## 2 2
## MASON, DAVID MR. MASSEY, E. DAVISON MR.
## 2 2
## MAZZETTA, THOMAS J. MR. MILLER, JOSEPH G. MR. JR.
## 2 2
## MILLER, PAUL MR. MILLER, PHYLLIS MRS.
## 2 2
## MILLIGAN, DAVID MORGAN, MICHAEL MR.
## 2 2
## MORRIS, JOE MR. NORD, CONNIE MRS.
## 2 2
## PEACOCK, HENRY STAFFORD MR. RAKOW, THOMAS MR.
## 2 2
## RAUNER, BRUCE VINCENT MR. RICE, DEBBIE MRS.
## 2 2
## RICE, EDDIE MR. RICE, TERRY MR.
## 2 2
## ROBERTS, SCOTT RUZICH, RICHARD
## 2 2
## SAIA, ALBERT S. MR. SAPIENTE, JOHN MR.
## 2 2
## SAVAIANO, MAUREEN R. MRS. SHAFFER, JOHN E. MR.
## 2 2
## SHAPIRO, DAN MR. SHAPIRO, NATHAN MR.
## 2 2
## SHAPIRO, STEVEN MR. SILVERMAN, MORRIS MR.
## 2 2
## SILVERMAN, PEARL MRS. SKELLY, LINDA
## 2 2
## SMITH, MICHAEL G. MR. SMITHBURG, WILLIAM D. MR.
## 2 2
## SPECTOR, MARCY L. MRS. SPRENZEL, MATTHEW MR.
## 2 2
## STAFFORD, ROBERT STEIN, EVAN
## 2 2
## STORTO, KELLY MS. STORTO, TRESA MS.
## 2 2
## THE DUCHOSSOIS GROUP PAC THOMAS, MARK
## 2 2
## TIMBERLAKE, JIM (Other)
## 2 144
I don’t know. After a cursory inspection of the names of contributors there seem to be mostly individual names on the list. There are a lot of names that have 2 contributions. Perhaps they are contributing at husband and wife? There were a few institutions such as Obama for America and a couple of lobbying groups but since most of the names were individuals I will keep them in for now.
When we make the scale of y logarithmic we see a lot more information in the lower end of the distribution. There are a lot more outliers in the Democratic contribution amounts at both the upper and lower end of the scale of individual contirbutions. But with the violin plots the distribution of amounts among the Republicans seems to be about as broad as that among the Democrats.
Now I will compare the parties by displaying them side by side. I am going to limit the contributions to those under 5000 to capture individual contributions and I will exclude the Socialist since they did not play much of a role in the general election. I will call it ill_ind for Illinois Individuals. What this graph makes me want to see is a comparison of the republic and democrat totals along with stacked bars showing how much of each party’s money came from small versus large doners.
So even though there were about four times as many contributions to the Democratic candidate the Republicans raised almost as much money. I would like to find a way to combine this information in a single chart.
## [1] 26170642 26034372
## # A tibble: 2 × 5
## party mean median total n
## <fctr> <dbl> <dbl> <dbl> <int>
## 1 Democrat 131.5914 50 26170642 198878
## 2 Republican 398.6826 100 26034372 65301
To do what I want to do I am going to have to group by party. I want to take the parties, get the sum of their total contributions, graph that as bars, and then in the bars I am going to make fill a stacked histogram of the amounts, with the lower amounts having some sort of scale, like the higher the amount the darker the color.
Every contribution is the same height no matter how large or small it is but the color of the contribution or the hue or alpha should change. So a clear pattern emerges where the Rubulicans get a larger proportion from donors making 1000 or 2500 dollar donations and a smaller proportion from those contributing amounts under $200 or so.
For the last two histograms, I am trying to get a comparison of the counts of contributions at various amounts by party. It is confusing for a couple of reasons. First of all, the Republicans appear to have more contributions at both the higher and lower levels. Moreover, in the last graph where I , there are what appear to be negative values in the histogram.
I am going to look at counts of contributions by time and party (is that a bivariate or a multivariate?) So this is nice. You see that the socialist contributions are more even spread out and evenly divided between the general and primary election. Since they don’t typically have any real prospect of winning the general election it may be that there is no particularly greater impulse to contribute during the general election. Comparing the Republicans and Democrats directly…
We can see that the Democratic contributions are more ‘spikey’, showing more ups and downs within a given period. I can’t really say why that would be the case.
First, since I am not looking at party anymore I am going to put the Socialists (actually, the Socialist) back in. Now we look at the count of contributions versus the amount of contributions, or the average amount of contribuitons. This is really a new variable, namely, the average. We will make the avarage contribution into a variable in itself and see how it is distributed over various other variables. Of course, the mean doesn’t have any actual meaning independent of the count of things you are averaging over. So to get means we have to use some sort of grouping function. The most basic way to look at the mean contribution size is to look at it for the whole data set. From there you can look at how the mean changes for all the factor variables in the data set. I am going to work though all the variables and examine graphically how the distribtion of the mean changes when the data is grouped by party, candidate, before and after the primary, months, days, occupation (not a very good variable since people code it themselves resulting in the variation being driven more by how people choose to describe their occupation than by actual differences in occupations), employers and zip codes. This is going to require a lot of use of they dplyr function. It will be good practice.
## [1] 454.7985
## # A tibble: 3 × 4
## party mean median n
## <fctr> <dbl> <dbl> <int>
## 1 Democrat 476.2116 50 201064
## 2 Republican 390.2304 100 66538
## 3 Socialist 288.2951 250 55
So we have an interesting set of numbers. First we can do the comparisons by single bar graphs. This is pretty interesting. I think this would be clearer if we could present these as distributions with those overlapping density thingies. Then we could plot the means and medians as vertical lines.
Now I want to add the mean and median as veritical lines going from y=0 to the top of the relevant density curve. Then I would like to reassign the colors, making the Democrats blue and the Republicans red to conform to the current convention. Contributions as the continuous variable and the parties as the categorical variable.
First I make a basic plot of the contribution amounts with the fill differing by party. Now I try it with a densit plot.
Now I want to compare densities with the Repbulicans and Democrats data.
Now I can do the same thing with boxplots.
### Contributions and Votes Finally, we look at Contributions and Votes. First we see how many votes each candidate got in the primary. That is, we group the data by candidate.
# Bivariate Analysis
Here is am going to do the amounts by count separately for each candidate and map the size of the dot to the candidate’s vote total in the primary. This will only be for the republicans. I may do a separate one for the general election but that would have to wait. Also, since there are only two in the general election the dot plot format would proabably not be very informative. Maybe I could add a date when they dropped out for the final touch as a way to get some information in about the candidates that didn’t make it to the primary election? ## Other multivariate plots There are a few other relationships I would like to explore. I would like to look at how the amount of contributions vary by party, candidate, votes. Lets look at the relationship between votes and campaign contributions for the two contestants in the general election, Mitt Romney and Barak Obama. What can we do? We can look at the data with the dates, the pre and post primary election periods and–what else? We can’t look at party since there is no way to separate out the effects of party when you have two parties and two candidates. There are no degrees of freedom.
First we have to create a month variable. It will range from 0 to 24 since each month is repeated twice. names(ill)
## Source: local data frame [192 x 8]
## Groups: month, year, party [57]
##
## month year party cand_nm contb avg_contb
## <chr> <chr> <fctr> <fctr> <dbl> <dbl>
## 1 01 2011 Republican Cain, Herman 250 250.0000
## 2 02 2011 Republican Cain, Herman 250 250.0000
## 3 03 2011 Republican Cain, Herman 5000 2500.0000
## 4 03 2011 Republican Gingrich, Newt 7200 720.0000
## 5 03 2011 Republican Pawlenty, Timothy 5750 1437.5000
## 6 03 2011 Republican Roemer, Charles E. 'Buddy' III 255 51.0000
## 7 04 2011 Democrat Obama, Barack 903414 755.9950
## 8 04 2011 Republican Cain, Herman 500 250.0000
## 9 04 2011 Republican Gingrich, Newt 8675 619.6429
## 10 04 2011 Republican Paul, Ron 250 250.0000
## # ... with 182 more rows, and 2 more variables: n <int>, y_m <chr>
Now we can see which party outraised which by month for the two year period.
## [1] "month" "year" "party" "cand_nm" "contb" "avg_contb"
## [7] "n" "y_m"
## Source: local data frame [57 x 5]
## Groups: y_m [?]
##
## y_m party contb avg n
## <chr> <fctr> <dbl> <dbl> <int>
## 1 2011_01 Republican 250.0 250.0 1
## 2 2011_02 Republican 250.0 250.0 1
## 3 2011_03 Republican 18205.0 18205.0 4
## 4 2011_04 Democrat 903414.0 903414.0 1
## 5 2011_04 Republican 79575.0 79575.0 6
## 6 2011_05 Democrat 196260.0 196260.0 1
## 7 2011_05 Republican 523750.3 523750.3 7
## 8 2011_06 Democrat 495531.9 495531.9 1
## 9 2011_06 Republican 622169.3 622169.3 10
## 10 2011_07 Democrat 373218.9 373218.9 1
## # ... with 47 more rows
Well, not entirely sure what is going on here with the row of dots on the bottom. They are colored Republican but I don’t see how I would get two sets of Republican observations here.
Now that I have votes per candidate I think it would be interesting to compare the amount collected by the candidates to the number of votes they got. So I am going to subset the data and get the candidates that survived till the primary and divide the number of votes they received by the amount of money they collected.
## [1] 73993 86605 5541 3704 433700 325488
## [1] 0.22836100 0.13416782 0.01421918 0.39909493 0.01841580 0.93320332
Now I want to make a bar chart from this. Buddy Roemer’s name is causing a lot more trouble than it is worth so I am going to shorten it to ‘Roemer, Charles’, to make it fit better with the rest of the data.
## [1] "cand_nm" "total_spent" "n" "votes" "price_vote"
## [1] Gingrich, Newt Paul, Ron
## [3] Perry, Rick Roemer, Charles E. 'Buddy' III
## [5] Romney, Mitt Santorum, Rick
## 14 Levels: Bachmann, Michele Cain, Herman Gingrich, Newt ... Stein, Jill
## [1] "Bachmann, Michele" "Cain, Herman" "Gingrich, Newt"
## [4] "Huntsman, Jon" "Johnson, Gary Earl" "McCotter, Thaddeus G"
## [7] "Obama, Barack" "Paul, Ron" "Pawlenty, Timothy"
## [10] "Perry, Rick" "Roemer, Charles" "Romney, Mitt"
## [13] "Santorum, Rick" "Stein, Jill"
Lets do it again.
#get candidates that had some votes
ill_vote_gettersX = subset(ill, votes > 0)
#group by cand_nm
ill_vote_gettersX_grp = ill_vote_getters %>%
dplyr::group_by(cand_nm) %>%
dplyr::summarise(votesX = min(votes),
total_spent = sum(contb_receipt_amt),
n = n())
ill_vote_gettersX_grp
## # A tibble: 6 × 4
## cand_nm votesX total_spent n
## <fctr> <dbl> <dbl> <int>
## 1 Gingrich, Newt 73993 324017.7 1440
## 2 Paul, Ron 86605 645497.6 5132
## 3 Perry, Rick 5541 389685.0 265
## 4 Roemer, Charles E. 'Buddy' III 3704 9281.0 171
## 5 Romney, Mitt 433700 23550435.2 55911
## 6 Santorum, Rick 325488 348785.7 1845
ill_vote_gettersX_grp$productivity = ill_vote_gettersX_grp$votesX/ill_vote_gettersX_grp$total_spent
#do the chart again
ggplot(ill_vote_getters_grp, aes(x=cand_nm, y=price_vote)) + geom_bar(stat="identity") + theme_minimal() + ggtitle("Votes Per Dollar in the Illinois Republican Primary") + ylab("votes recieved/money raised") + theme(axis.text.x = element_text(hjust = 0.5, family="Didot", color="blue"), axis.text.y=element_text(color="steelblue")) + xlab("")
So on this evidence Romney appears to be a very ‘efficient’ candidate, having to raise a penny or two per vote. But this is combining all of the money raised throughout the entire cycle and comparing it to the money raised by other candidates up until the primary elections were held and most, presumably stopped fund raising at all. We should further subset the data to money raised before the primary election.
Now I am going to finish my initial round of univariate charts. I am going to look at zip codes. These are going to be made using the choroplethr package along with ggplot.
## [1] "region" "total_population" "percent_white"
## [4] "percent_black" "percent_asian" "percent_hispanic"
## [7] "per_capita_income" "median_rent" "median_age"
## [1] 1255 4
## [1] "zip_short" "total_amt" "mean_amt" "n"
## [1] "character"
## [1] "character"
## region total_population percent_white percent_black percent_asian
## 1 60002 24250 88 2 3
## 2 60004 49957 81 2 8
## 3 60005 30931 76 2 8
## 4 60007 33973 77 1 9
## 5 60008 22302 61 4 8
## 6 60010 44031 84 0 8
## percent_hispanic per_capita_income median_rent median_age total_amt
## 1 5 33622 761 41.2 35559.70
## 2 8 40134 1102 41.8 190235.38
## 3 12 37387 953 39.8 130670.37
## 4 11 33540 921 43.5 105911.80
## 5 26 29141 1030 36.2 41905.75
## 6 6 65973 1173 46.3 1090630.04
## mean_amt n
## 1 85.48005 416
## 2 147.92798 1286
## 3 145.67488 897
## 4 216.58855 489
## 5 119.73071 350
## 6 339.97196 3208
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_text_repel).
## Warning in self$bind(): The following regions were missing and are being
## set to NA: 62532, 60959, 61232, 62885, 62063, 61852, 61928, 62079, 62539,
## 61460, 61416, 61471, 61057, 62982, 62355, 62897, 62277, 62962, 61811,
## 62942, 60557, 62219, 61417, 62434, 60931, 60961, 61724, 61336, 61077,
## 62601, 62289, 62438, 62927, 61539, 62829, 61425, 62819, 62538, 60519,
## 62998, 60536, 62019, 62313, 62378, 61344, 61358, 62961, 60111, 61751,
## 62070, 60407, 61379, 61374, 60973, 61940, 61042, 61812, 61452, 61454,
## 61432, 61050, 61441, 61062, 61769, 61831, 61468, 62023, 61478, 61479,
## 61845, 61480, 61485, 61312, 61313, 61325, 61330, 61337, 61720, 61722,
## 60949, 60929, 62990, 62617, 62996, 61735, 62639, 62359, 62695, 62649,
## 62218, 62671, 62674, 62295, 62280, 62261, 62268, 62330, 62338, 62809,
## 62319, 62323, 62373, 62833, 62346, 62843, 62850, 62356, 62831, 62366,
## 62367, 62422, 62861, 62425, 62426, 62432, 62878, 62011, 62017, 62475,
## 62887, 62892, 62032, 62478, 62481, 62895, 62514, 62547, 62555, 62045,
## 62543, 62544, 62956, 62963, 61871, 62921, 62282, 62085, 61721, 62065,
## 61777, 62988, 62967, 61749, 62808, 62932, 61855, 61625, 61876, 61778,
## 62835, 61321, 61772, 62015, 62836, 62238, 61773, 62253, 62348, 62325,
## 62926, 62894, 62628, 62876, 61044, 62436, 60437, 61815, 62846, 61426,
## 61543, 62553, 61532, 62421, 62365, 61431, 62001, 62886, 62889, 62419,
## 61474, 61544, 62554, 61435, 60912, 60144, 62091, 62621, 62999, 61436,
## 61440, 62874, 62624, 61816, 61439, 61844, 61251, 61283, 60129, 62084,
## 62992, 62643, 61914, 62867, 61459, 61483, 62880, 62818, 62965, 62969,
## 60974, 62082, 62361, 61363, 61750, 61925, 61941, 61419, 61059, 61424,
## 61433, 61775, 61475, 61848, 61524, 61236, 61258, 61263, 61562, 61564,
## 61317, 60910, 61323, 61328, 61331, 61335, 61346, 62093, 62622, 62997,
## 62030, 62248, 62689, 60933, 62250, 62811, 62316, 62336, 62825, 62352,
## 62879, 62891, 62459, 62519, 62048, 62540, 62541, 62953, 62076, 62078,
## 62081, 62610, 60930, 61079, 60934, 60926, 60113, 62570, 61372, 61329,
## 61043, 62672, 61553, 61563, 62663, 62841, 62852, 61516, 62537, 61541,
## 62083, 61851, 61338, 62098, 62983, 61027, 62273, 61810, 62266, 61552,
## 61955, 61332, 60917, 61519, 62357, 60960, 62883, 61324, 63673
## Scale for 'fill' is already present. Adding another scale for 'fill',
## which will replace the existing scale.