TITLE by Michael Reinhard

setwd(“/Users/michaelreinhard/nano/R/finalProject”)

I load the data in from the raw csv file, which is assumed to be in the same directory as the Rmd file.

Univariate Plots Section

Preliminary Data Additions

In order to streamline presentation I am going to add a few variables that I construct through bringing in outside information, like the party of the condidates, the total votes they recieved in the primary, or whether a contribution was made before or after the primary. I will plot them to make sure they are working right.

Create a ‘gen’–for general election–dummy variable for future use.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  1.0000  1.0000  0.8658  1.0000  1.0000

## 
##      0      1 
##  35911 231746

## [1] "numeric"

## [1] "primary" "general"

A variable I want to add right off is the party of the candidate.

## [1] Republican Democrat   Democrat   Democrat   Democrat  
## Levels: Democrat Republican Socialist

## [1] "factor"

## 
##   Democrat Republican  Socialist 
##     201064      66538         55

I am first going to examine some of the variables and their distributions. While doing this I will add some variables to the data set, either through recoding or by adding information from outside sources. The first I will look at is the date the contribution was made. I will convert the data values to a date format recognized by R and add a dummy variable for the date of the primary.

There are six main variables to examine. The primary variable of interest is the amount of contributions. This is the thing we are interested in explaining.

The variables that can contribute to explaining the amounts contributed are the characteristics of the contributors, the location of the contributors and the timing of the contributions. Another set of features in the data set that can explain the contributions are characteristics of the candidates. The names of the candidates are in the data set and other characteristics such as their party affiliation and the number of votes they recieved in the primary contests can be added to the data set. ### Date Dates are sometimes troublesome in R because they are recorded as characters. You have to convert them to dates. Then, if you want to use them on an axis and want to add, say, a vertical line to the axis you have to convert the date you want to locate the line at back to a numeric object. So it is an involved and somewhat counter intuitive process. You have to take them from characters to dates and then, depending on how want to plot them, to numbers.

## [1] "factor"

## [1] "2011-12-06" "2011-09-30" "2011-09-27" "2011-09-30" "2011-08-10"
## [6] "2011-09-27"

Contributor Characteristics

The variables for occupation and employer are factor level variables and have too many different values and no intrinsic ordering so if you try to make a bar graph it takes forever to render and doesn’t tell you anything. Moreover, with so many ‘levels’ it takes a few minutes to render a plot. The best I can do with the variables absent extensive recoding is note the most common values. This requires reordering the names of the levels, whether the names of cities, occupations or employers, by their frequency. Oh yeah, they have names, too.

## [1] "integer"

## [1] "factor"

## [1] 267657

## [1] 58897

##     FALLSGRAFF, TOBY    MITCHELL, CAITLIN        HUGHES, ELLEN 
##                  292                  248                  213 
##           MUIR, MARY       LUCAS, WILLIAM SELZ, MARIA CRISTINA 
##                  162                  149                  139 
##       DFHDFH, DFHDFH       TRICE, YOHANCE           RUSH, KYLE 
##                  113                  109                  106 
##     ZMRHAL, TOBY MR.         MILES, LUCAS           MILLER, ED 
##                   97                   91                   91 
##          WEISS, GREG        BROWN, MARTHA         LIBER, DAVID 
##                   91                   89                   87 
##        FINE, PHYLLIS     MCKINLEY, CHERYL        WOODING, JOHN 
##                   86                   86                   85 
##            ZAR, LEON      JACOBSON, LINDA 
##                   85                   84

##  Named int [1:100] 292 248 213 162 149 139 113 109 106 97 ...
##  - attr(*, "names")= chr [1:100] "FALLSGRAFF, TOBY" "MITCHELL, CAITLIN" "HUGHES, ELLEN" "MUIR, MARY" ...

##                    Var1 Freq
## 1   AARNOUDSE, ANTHONIE    4
## 2       AARON, SUSAN S.    2
## 3 AARONSON-KRELL, PAULA    7
## 4     AARONSON, RICHARD    2
## 5    AASEBY, JOEL DAVID    6
## 6         ABA, JAIME R.    2

##                    Var1 Freq
## 1   AARNOUDSE, ANTHONIE    4
## 2       AARON, SUSAN S.    2
## 3 AARONSON-KRELL, PAULA    7
## 4     AARONSON, RICHARD    2
## 5    AASEBY, JOEL DAVID    6
## 6         ABA, JAIME R.    2

##                       Var1 Freq
## 14973     FALLSGRAFF, TOBY  292
## 35968    MITCHELL, CAITLIN  248
## 24063        HUGHES, ELLEN  213
## 37015           MUIR, MARY  162
## 31824       LUCAS, WILLIAM  149
## 47516 SELZ, MARIA CRISTINA  139

##                       Var1 Freq
## 14973     FALLSGRAFF, TOBY  292
## 35968    MITCHELL, CAITLIN  248
## 24063        HUGHES, ELLEN  213
## 37015           MUIR, MARY  162
## 31824       LUCAS, WILLIAM  149
## 47516 SELZ, MARIA CRISTINA  139
## 12342       DFHDFH, DFHDFH  113
## 53416       TRICE, YOHANCE  109
## 45404           RUSH, KYLE  106
## 58795     ZMRHAL, TOBY MR.   97
## 35521         MILES, LUCAS   91
## 35587           MILLER, ED   91
## 55928          WEISS, GREG   91
## 6129         BROWN, MARTHA   89
## 30858         LIBER, DAVID   87
## 15564        FINE, PHYLLIS   86
## 34516     MCKINLEY, CHERYL   86
## 57782        WOODING, JOHN   85
## 58536            ZAR, LEON   85
## 24819      JACOBSON, LINDA   84

So this Toby guy made almost 300 contributions. Maybe he is a bundler? Or maybe he contributes to a lot of people? Or maybe he contributes in really small amounts.

##        cmte_id   cand_id       cand_nm        contbr_nm contbr_city
## 21   C00431445 P80003338 Obama, Barack FALLSGRAFF, TOBY     CHICAGO
## 1214 C00431445 P80003338 Obama, Barack FALLSGRAFF, TOBY     CHICAGO
## 1323 C00431445 P80003338 Obama, Barack FALLSGRAFF, TOBY     CHICAGO
## 1573 C00431445 P80003338 Obama, Barack FALLSGRAFF, TOBY     CHICAGO
## 1582 C00431445 P80003338 Obama, Barack FALLSGRAFF, TOBY     CHICAGO
## 1648 C00431445 P80003338 Obama, Barack FALLSGRAFF, TOBY     CHICAGO
##      contbr_st contbr_zip   contbr_employer contbr_occupation
## 21          IL  606472810 OBAMA FOR AMERICA     SENIOR WRITER
## 1214        IL  606472810 OBAMA FOR AMERICA     SENIOR WRITER
## 1323        IL  606472810 OBAMA FOR AMERICA     SENIOR WRITER
## 1573        IL  606472810 OBAMA FOR AMERICA     SENIOR WRITER
## 1582        IL  606472810 OBAMA FOR AMERICA     SENIOR WRITER
## 1648        IL  606472810 OBAMA FOR AMERICA     SENIOR WRITER
##      contb_receipt_amt contb_receipt_dt receipt_desc memo_cd memo_text
## 21                 0.4        24-AUG-11                               
## 1214               0.4        19-SEP-11                               
## 1323               5.0        28-SEP-11                               
## 1573               0.4        21-AUG-11                               
## 1582               5.0        29-SEP-11                               
## 1648               0.4        16-SEP-11                               
##      form_tp file_num   tran_id election_tp   date_con gen   gen.f
## 21     SA17A   756218 C11912258       P2012 2011-08-24   0 primary
## 1214   SA17A   756218 C12144091       P2012 2011-09-19   0 primary
## 1323   SA17A   756218 C12264232       P2012 2011-09-28   0 primary
## 1573   SA17A   756218 C11896062       P2012 2011-08-21   0 primary
## 1582   SA17A   756218 C12310253       P2012 2011-09-29   0 primary
## 1648   SA17A   756218 C12088742       P2012 2011-09-16   0 primary
##         party date_con.n primary_day.n
## 21   Democrat      15210         15419
## 1214 Democrat      15236         15419
## 1323 Democrat      15245         15419
## 1573 Democrat      15207         15419
## 1582 Democrat      15246         15419
## 1648 Democrat      15233         15419

##  [1] "cmte_id"           "cand_id"           "cand_nm"          
##  [4] "contbr_nm"         "contbr_city"       "contbr_st"        
##  [7] "contbr_zip"        "contbr_employer"   "contbr_occupation"
## [10] "contb_receipt_amt" "contb_receipt_dt"  "receipt_desc"     
## [13] "memo_cd"           "memo_text"         "form_tp"          
## [16] "file_num"          "tran_id"           "election_tp"      
## [19] "date_con"          "gen"               "gen.f"            
## [22] "party"             "date_con.n"        "primary_day.n"

##            cand_nm   contbr_employer contbr_occupation contb_receipt_amt
## 21   Obama, Barack OBAMA FOR AMERICA     SENIOR WRITER               0.4
## 1214 Obama, Barack OBAMA FOR AMERICA     SENIOR WRITER               0.4
## 1323 Obama, Barack OBAMA FOR AMERICA     SENIOR WRITER               5.0
## 1573 Obama, Barack OBAMA FOR AMERICA     SENIOR WRITER               0.4
## 1582 Obama, Barack OBAMA FOR AMERICA     SENIOR WRITER               5.0
## 1648 Obama, Barack OBAMA FOR AMERICA     SENIOR WRITER               0.4
## NA            <NA>              <NA>              <NA>                NA
## NA.1          <NA>              <NA>              <NA>                NA
## NA.2          <NA>              <NA>              <NA>                NA
## NA.3          <NA>              <NA>              <NA>                NA

Ok, so it looks like Toby made a bunch of contributions that had to be taken back or something since after the first 5 contributions there are nothing but NAs for the balance of his 300 contributions. For a serious analysis we should probably remove Toby and take a serious look at the other top contributors to see if some of them are in the upper range of contributors for some anomalous reason. ### Occupations 11,000 different occupations can’t really be graphed. They would have to be boiled down into to a smaller number of categories to be useful. Employers will have too many categories to usefully plot as well.

## [1] 11216

## [1] "factor"

## 
##                                 .NET PROGRAMMER     (RETIRED SECRETARY) 
##                    2189                      14                       1 
##   *LONG TERM DISABILITY                       ~ 1ST DEPUTY COMMISSIONER 
##                       1                       1                       1

## 
##                                 .NET PROGRAMMER     (RETIRED SECRETARY) 
##                    2189                      14                       1 
##   *LONG TERM DISABILITY                       ~ 1ST DEPUTY COMMISSIONER 
##                       1                       1                       1

## 'data.frame':    11215 obs. of  2 variables:
##  $ Var1: Factor w/ 11215 levels "",".NET PROGRAMMER",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Freq: int  2189 14 1 1 1 1 2 7 1 1 ...

##                      Var1 Freq
## 1                         2189
## 2         .NET PROGRAMMER   14
## 3     (RETIRED SECRETARY)    1
## 4   *LONG TERM DISABILITY    1
## 5                       ~    1
## 6 1ST DEPUTY COMMISSIONER    1

##  [1] RETIRED                               
##  [2] ATTORNEY                              
##  [3] HOMEMAKER                             
##  [4] INFORMATION REQUESTED PER BEST EFFORTS
##  [5] PHYSICIAN                             
##  [6] TEACHER                               
##  [7] INFORMATION REQUESTED                 
##  [8] PROFESSOR                             
##  [9] CONSULTANT                            
## [10] SALES                                 
## [11] NOT EMPLOYED                          
## [12] ENGINEER                              
## [13] LAWYER                                
## [14] MANAGER                               
## [15]                                       
## [16] NONE                                  
## [17] PRESIDENT                             
## [18] WRITER                                
## [19] OWNER                                 
## [20] SELF-EMPLOYED                         
## 11215 Levels:  .NET PROGRAMMER (RETIRED SECRETARY) ... ZONE MANAGER

##                                    Var1 Freq
## 1                                       2189
## 2                       .NET PROGRAMMER   14
## 3                   (RETIRED SECRETARY)    1
## 4                 *LONG TERM DISABILITY    1
## 5                                     ~    1
## 6               1ST DEPUTY COMMISSIONER    1
## 7                     1ST GRADE TEACHER    2
## 8                   270 POLLING ANALYST    7
## 9           2ND LIEUTENANT, ACTIVE DUTY    1
## 10        2ND. AD - FILM/TV/COMMERCIALS    1
## 11                            3D ARTIST    1
## 12 3RD GENERATION FAMILY BUSINESS OWNER    1
## 13                    3RD GRADE TEACHER    1
## 14                    4TH GRADE TEACHER    2
## 15      5TH GRADE DUAL LANGUAGE TEACHER    2
## 16                    6TH GRADE TEACHER    7
## 17                    7TH GRADE TEACHER    4
## 18       80% MEDICALLY DISABLED VETERAN    1
## 19                       911 DISPATCHER    4
## 20                       911 SUPERVISOR    6

##                                         Var1  Freq
## 8535                                 RETIRED 54133
## 795                                 ATTORNEY 10911
## 4744                               HOMEMAKER  7480
## 5028  INFORMATION REQUESTED PER BEST EFFORTS  6573
## 7237                               PHYSICIAN  5892
## 10184                                TEACHER  5733
## 5026                   INFORMATION REQUESTED  5349
## 7726                               PROFESSOR  4550
## 2167                              CONSULTANT  4349
## 8826                                   SALES  3197
## 6635                            NOT EMPLOYED  3070
## 3525                                ENGINEER  2599
## 5535                                  LAWYER  2542
## 5849                                 MANAGER  2461
## 1                                             2189
## 6611                                    NONE  1931
## 7462                               PRESIDENT  1918
## 11147                                 WRITER  1665
## 6891                                   OWNER  1582
## 9081                           SELF-EMPLOYED  1532

## [1] "data.frame"

Trying to look at all occupations takes forever for R to run and doesn’t tell you anything because the labels are all together and there isn’t enough variation. But I would still like to see the distribution of all occupations on a log scale to see if there is a geometric decline in the frequency of contributions for all occupations taken as a whole. So I want to plot the frequency of contributions by occupation on a logarithmic scale with the labels suppressed. How is that possible? This is not really the distribution of occupations since this is a factor variable. We are just ordering by the frequencies of various occupations. Also, the categories are from the respondents themselves so what we see is really the way people have chosen to classify their occupations. Some people have chosen a very narrow and specific description of their occupation and some have chosen a more general. The most general description, ‘retired’–since you can be retired from anything, is not surprisingly the largest category.

Now I want to look at the top 20 employers. This is a factor variable so it should follow the same template I have used for other factors.

## [1] 23176

## [1] "factor"

## 
##                                        
##                                   2181 
##               FORSYTHE SOLUTIONS GROUP 
##                                      3 
##  NORTHERN ILLINOIS CONFERENCE UNITED M 
##                                      4 
##                           @ PROPERTIES 
##                                      3 
##                                  @ UIC 
##                                      2 
##                            @PROPERTIES 
##                                     32

## 
##                                        
##                                   2181 
##               FORSYTHE SOLUTIONS GROUP 
##                                      3 
##  NORTHERN ILLINOIS CONFERENCE UNITED M 
##                                      4 
##                           @ PROPERTIES 
##                                      3 
##                                  @ UIC 
##                                      2 
##                            @PROPERTIES 
##                                     32

## 'data.frame':    23175 obs. of  2 variables:
##  $ Var1: Factor w/ 23175 levels ""," FORSYTHE SOLUTIONS GROUP",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Freq: int  2181 3 4 3 2 32 2 1 2 1 ...

##                                     Var1 Freq
## 1                                        2181
## 2               FORSYTHE SOLUTIONS GROUP    3
## 3  NORTHERN ILLINOIS CONFERENCE UNITED M    4
## 4                           @ PROPERTIES    3
## 5                                  @ UIC    2
## 6                            @PROPERTIES   32

So I took the factor and made it into a dataframe. Now I want to order the cities by frequency and get a chart of the top cities by frequency.

##  [1] RETIRED                               
##  [2] SELF-EMPLOYED                         
##  [3] NOT EMPLOYED                          
##  [4] INFORMATION REQUESTED PER BEST EFFORTS
##  [5] INFORMATION REQUESTED                 
##  [6] HOMEMAKER                             
##  [7] NONE                                  
##  [8] OBAMA FOR AMERICA                     
##  [9]                                       
## [10] UNIVERSITY OF CHICAGO                 
## [11] NORTHWESTERN UNIVERSITY               
## [12] SELF                                  
## [13] UNIVERSITY OF ILLINOIS                
## [14] OFA                                   
## [15] STATE OF ILLINOIS                     
## [16] CHICAGO PUBLIC SCHOOLS                
## [17] CITY OF CHICAGO                       
## [18] UNITED AIRLINES                       
## [19] AT&T                                  
## [20] DEPAUL UNIVERSITY                     
## 23175 Levels:  ... ZYCUS

##                                      Var1 Freq
## 1                                         2181
## 2                FORSYTHE SOLUTIONS GROUP    3
## 3   NORTHERN ILLINOIS CONFERENCE UNITED M    4
## 4                            @ PROPERTIES    3
## 5                                   @ UIC    2
## 6                             @PROPERTIES   32
## 7                                 #2 C.I.    2
## 8                                       ~    1
## 9                                08131952    2
## 10                     1030 HUBBARD PLACE    1
## 11                     10TH MAGNITUDE LLC    1
## 12                      11 COMMUNICATIONS    1
## 13                       1154 LILL STUDIO   11
## 14                         12 INTERACTIVE    1
## 15      16TH JUDICIAL CIRCUIT OF ILLINOIS    2
## 16                         180 PROPERTIES    1
## 17                  19TH JUDICIAL CIRCUIT   14
## 18                          1ST ADVANTAGE   12
## 19                  1ST CIRCUIT PROBATION   48
## 20                       1ST MIDWEST BANK    4

## [1] "data.frame"

Actually, there is a lot of interesting stuff in there. The top 20 employers were very revealing. But there is no way to do a good graph from this without a lot of work I should think. ### Cities While there is no intrinsic ordering of the cities as there is not with the other factor variables we can impose an order and plot by that order to get a better idea of the distribution of the cities. We can then see if all cities are more or less equal in the proportion of contributions that come from them or if the distribution is skewed toward a subset of cities.

## [1] "factor"

## 
##                              123 E. CO BLACKTOP       1308 N ASTOR ST C 
##                       1                       2                       1 
## 1308 N ASTOR ST CHICAGO              60154-5908             ABBOTT PARK 
##                       1                       2                       1

## 
##                              123 E. CO BLACKTOP       1308 N ASTOR ST C 
##                       1                       2                       1 
## 1308 N ASTOR ST CHICAGO              60154-5908             ABBOTT PARK 
##                       1                       2                       1

## 'data.frame':    1402 obs. of  2 variables:
##  $ Var1: Factor w/ 1402 levels "","123 E. CO BLACKTOP",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Freq: int  1 2 1 1 2 1 3 15 1 245 ...

##                      Var1 Freq
## 1                            1
## 2      123 E. CO BLACKTOP    2
## 3       1308 N ASTOR ST C    1
## 4 1308 N ASTOR ST CHICAGO    1
## 5              60154-5908    2
## 6             ABBOTT PARK    1

So I took the factor and made it into a dataframe. Now I want to order the cities by frequency and get a chart of the top cities by frequency.

##  [1] CHICAGO           EVANSTON          OAK PARK         
##  [4] NAPERVILLE        SPRINGFIELD       CHAMPAIGN        
##  [7] WILMETTE          HIGHLAND PARK     BLOOMINGTON      
## [10] LAKE FOREST       PEORIA            GLENVIEW         
## [13] WINNETKA          WHEATON           NORTHBROOK       
## [16] ROCKFORD          AURORA            ARLINGTON HEIGHTS
## [19] DOWNERS GROVE     HINSDALE         
## 1402 Levels:  123 E. CO BLACKTOP ... ZION

##                       Var1 Freq
## 1                             1
## 2       123 E. CO BLACKTOP    2
## 3        1308 N ASTOR ST C    1
## 4  1308 N ASTOR ST CHICAGO    1
## 5               60154-5908    2
## 6              ABBOTT PARK    1
## 7                 ABINGDON    3
## 8                    ADAIR   15
## 9               ADDIEVILLE    1
## 10                 ADDISON  245
## 11                  ALBANY   51
## 12                  ALBERS   10
## 13                  ALBION   17
## 14                   ALEDO   36
## 15                  ALEXIS   12
## 16               ALGONQUIN  432
## 17               ALLENDALE    3
## 18                    ALMA    5
## 19                   ALPHA    7
## 20                   ALSIP  101

## [1] "data.frame"

### Zip Codes The long zip codes are not much use to us so I will shorten them to 5 digits. The zip codes are also unordered factors: even though they are numbers the orders of the zip codes has no real intrinsic meaning within a state. We do know that there are some zip codes outside of the boundaries of the state of Illinois, whose zip codes are bounded between 60000 and 63000.

## [1] "character"

## [1] 60618 60640 60645 60521 60422 60586

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1267   60190   60600   60670   60650  100000      30

We could go farther and get the top zip codes but I think I will leave the zip codes for the map program.

There is one more thing to check for, that is using visuals to check for outliers and mistakes. It seems like there are some outliers but not enough to compare to the ones that are coded correctly, that is, inside the confines of expected values for Illinois zip codes. Now we see that there are some values below the correct range. I will exclude these values from the analyses that use zip codes. It would be a big job to go in and repair them unless I could figure out a way to do that programmatically. I suppose I could get the city and streets and narrow down the possible zip codes that way, but at some point it would have to be done by hand.

Candidates

## [1] Paul, Ron     Obama, Barack Obama, Barack Obama, Barack Obama, Barack
## [6] Obama, Barack
## 14 Levels: Bachmann, Michele Cain, Herman Gingrich, Newt ... Stein, Jill

## [1] "integer"

## [1] "factor"

Massive lead for Barack Obama. No real surprise in his home state. It seems like the number of contributions might even be nearly equal if we combined all the Republican contribuitons together. Still, since Obama did not face a primary it is a little incongruous.

Contribution amounts

## [1]   25 1000  500  500  200   25

## [1] "double"

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   -25000       25       50      455      112 16390000

So a roughly normal distribution shows up by examining the data on a log scale.

Candidate’s contribution counts

Create Party variable first.

##  [1] "cmte_id"           "cand_id"           "cand_nm"          
##  [4] "contbr_nm"         "contbr_city"       "contbr_st"        
##  [7] "contbr_zip"        "contbr_employer"   "contbr_occupation"
## [10] "contb_receipt_amt" "contb_receipt_dt"  "receipt_desc"     
## [13] "memo_cd"           "memo_text"         "form_tp"          
## [16] "file_num"          "tran_id"           "election_tp"      
## [19] "date_con"          "gen"               "gen.f"            
## [22] "party"             "date_con.n"        "primary_day.n"    
## [25] "zip_short"

## [1] Republican Democrat   Democrat   Democrat   Democrat  
## Levels: Democrat Republican Socialist

## [1] Republican Democrat   Democrat   Democrat   Democrat   Democrat  
## Levels: Democrat Republican Socialist

## 
##   Democrat Republican  Socialist 
##     201064      66538         55

Now use to evaluate candidate contributions

Vote totals

Now look at the vote totals

## $`Paul, Ron`
## [1] 86605
## 
## $`Obama, Barack`
## [1] NA
## 
## $`Obama, Barack`
## [1] NA
## 
## $`Obama, Barack`
## [1] NA
## 
## $`Obama, Barack`
## [1] NA
## 
## $`Obama, Barack`
## [1] NA

## [1] 86605    NA    NA    NA    NA    NA

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0  433700  433700  381800  433700  433700  201119

Univariate Analysis

What is the structure of your dataset?

What is/are the main feature(s) of interest in your dataset?

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Did you create any new variables from existing variables in the dataset?

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Bivariate Plots Section

First thing we do is the big correlation analysis, throwing everything in and seeing what is related to what. Correlation is be suited to numeric variables and we don’t have a bunch of those. The counts of contributions and the amounts are the most obvious variables to test for correlation, but it is not clear how to do that. We would have to group the two variables by a third for the idea of correlation to become meaningful, really. In order to ask the question we would have to somehow bin or group the observations together, say, by candidate, and ask do candidates that get a lot of contributions also get a lot of money? But that would be a question for later, multivariate analysis.

Taking the variables as is we could ask if amounts, party (Republican or Democrat), pre or post primary, date (later or earlier, ending at the general election) or votes (in the primary at least) correlate with one another. Still, these variables aren’t meaningful on the same time scales so I still don’t think it is appropriate to just throw them all into a big correlation matrix and see what shakes out. Or at least it is not very imformative. But just to say that we hvae done it, here goes.

## 'data.frame':    267602 obs. of  5 variables:
##  $ contb_receipt_amt: num  25 1000 500 500 200 25 100 250 100 250 ...
##  $ date_con         : num  15314 15247 15244 15247 15196 ...
##  $ gen              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ party            : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ date_con.n       : num  15314 15247 15244 15247 15196 ...

##                   contb_receipt_amt date_con   gen party date_con.n
## contb_receipt_amt                 1     0.00  0.00  0.00       0.00
## date_con                          0     1.00  0.84 -0.04       1.00
## gen                               0     0.84  1.00 -0.07       0.84
## party                             0    -0.04 -0.07  1.00      -0.04
## date_con.n                        0     1.00  0.84 -0.04       1.00

There just aren’t enough straight ahead numerica variables to make get a bunch of correlations to make sense.

first I am going to investigate the relationship between the amount of money donated and the other variables in the data set of interest. I will use the amount of money donated as the dependent variable.

Before and After the Primary

I want to see a bit more in the contribution amount histograms. There is so much space between the amounts that it sort of distorts the central message of the data, that there are different teirs of contributions. I think what I want to show in this chart is the spread of the distribution. The important fact about a contribution is its general size, not the fact that there are a lot of contributions of 5000 dollars and very few of 4200. So, I think a couple of histograms with really small numbers of bins might be interesting. So I didn’t get much out of the pre-post primary distinction on the first cut. Let’s look at Republicans and Democrats. ### Republicans vs. Democrats So it looks like Republican contributions could be larger on average than Democratic contributions, but that Democrats have more outliers. Lets look more closely at the structure of contribution amounts by looking at the relationship between and democratic histograms in a grid. Also, I am going to make a data set that is limited to individual contributions by subsetting the data to Democrats and Republicans and to contributions above 0 and below 5000.

I am choosing 5000 as the limit instead of 3000 because although the election laws say that 2700 is the limit for individual contributions but groups can contribute up to 5000. I have seen that there are a lot of contributions that are over 2700 and I don’t know why.

## [1] 371

## OBAMA VICTORY FUND 2012 - UNITEMIZED          FREIDHEIM, CYRUS F. MR. JR. 
##                                   21                                    6 
##                          LINCOLN PAC                    BRINCAT, JEFF MR. 
##                                    4                                    3 
##                   RAITT, JOHN R. MR.               RAUSCHERT, KARL A. MR. 
##                                    3                                    3 
##                    RICE, YVONNE MRS.                  ABROE, MARY J. MRS. 
##                                    3                                    2 
##                 ADRIDGE, KENNETH MR.                    ALLEN, MARCI MRS. 
##                                    2                                    2 
##                         ALLEN, ROGER                ALLEN, RUSSELL L. MR. 
##                                    2                                    2 
##                  ALLEN, SHELLEY MRS.                   ALLGOOD, JASON MR. 
##                                    2                                    2 
##                ANDERSON, CARL P. MR.                 B&R SUPPLY GROUP LLC 
##                                    2                                    2 
##                     BAXMEYER, K. MR.                       BLAKLEY, JAMES 
##                                    2                                    2 
##                   BLOCK, JOHN G. MR.                       BUFORD, ROBERT 
##                                    2                                    2 
##                     BURTIS, ERIK MR.              BURTIS, MELISSA G. MRS. 
##                                    2                                    2 
##  C.N.A. CITIZENS FOR GOOD GOVERNMENT             CANNING, JOHN A. MR. JR. 
##                                    2                                    2 
##                    CAVANAGH, WILLIAM               CHAMNESS, JOHANNA MRS. 
##                                    2                                    2 
##                CHAMNESS, MICHAEL MR.                       CHANG, JAE MR. 
##                                    2                                    2 
##                       CROFT, BARBARA                        CUBITT, GEOFF 
##                                    2                                    2 
##                     DIXON, SCOTT MR.                 DIXTON, GRANT M. MR. 
##                                    2                                    2 
##                    DRAFT, HOWARD MR.                        DUNCAN, BRUCE 
##                                    2                                    2 
##               FESMIRE, ROBERT H. MR.                FESSLER, CAROL A. MS. 
##                                    2                                    2 
##                 FITCH, DENNIS C. MR.                      FRYE, MEGAN MS. 
##                                    2                                    2 
##                FULTON, THOMAS M. MR.                            GETCO PAC 
##                                    2                                    2 
##                   GREEN, JEFFREY MR.                   GRISHAM, LARRY MR. 
##                                    2                                    2 
##              HAGENBUCH, LEROY G. MR.                 HANSON, TERRY J. MR. 
##                                    2                                    2 
##                   HERRING, KERRY MR.                HINES, CHARLES L. MR. 
##                                    2                                    2 
##                 HOFFMAN, MARY JO MS.               HORNER, JANICE E. MRS. 
##                                    2                                    2 
##                      HUSMAN, MICHAEL           ISRINGHAUSEN, GEOFFREY MR. 
##                                    2                                    2 
##          ISRINGHAUSEN, SUSAN J. MRS.                  JASPER, PAUL T. MR. 
##                                    2                                    2 
##                     KANE, DIANE MRS.               KATSIAVELOS, HARRY MR. 
##                                    2                                    2 
##                    KUGLER, LARRY MR.                      KUNZ, PETER MR. 
##                                    2                                    2 
##                   LEWIS, ALISHA MRS.                    LEWIS, JAMIE MRS. 
##                                    2                                    2 
##            LEWIS, TIMOTHY R. MR. JR.                   LEWIS, WILLIAM MR. 
##                                    2                                    2 
##                     MASON, DAVID MR.               MASSEY, E. DAVISON MR. 
##                                    2                                    2 
##              MAZZETTA, THOMAS J. MR.            MILLER, JOSEPH G. MR. JR. 
##                                    2                                    2 
##                     MILLER, PAUL MR.                 MILLER, PHYLLIS MRS. 
##                                    2                                    2 
##                      MILLIGAN, DAVID                  MORGAN, MICHAEL MR. 
##                                    2                                    2 
##                      MORRIS, JOE MR.                    NORD, CONNIE MRS. 
##                                    2                                    2 
##          PEACOCK, HENRY STAFFORD MR.                    RAKOW, THOMAS MR. 
##                                    2                                    2 
##            RAUNER, BRUCE VINCENT MR.                    RICE, DEBBIE MRS. 
##                                    2                                    2 
##                      RICE, EDDIE MR.                      RICE, TERRY MR. 
##                                    2                                    2 
##                       ROBERTS, SCOTT                      RUZICH, RICHARD 
##                                    2                                    2 
##                  SAIA, ALBERT S. MR.                   SAPIENTE, JOHN MR. 
##                                    2                                    2 
##            SAVAIANO, MAUREEN R. MRS.                 SHAFFER, JOHN E. MR. 
##                                    2                                    2 
##                     SHAPIRO, DAN MR.                  SHAPIRO, NATHAN MR. 
##                                    2                                    2 
##                  SHAPIRO, STEVEN MR.                SILVERMAN, MORRIS MR. 
##                                    2                                    2 
##                SILVERMAN, PEARL MRS.                        SKELLY, LINDA 
##                                    2                                    2 
##                SMITH, MICHAEL G. MR.            SMITHBURG, WILLIAM D. MR. 
##                                    2                                    2 
##               SPECTOR, MARCY L. MRS.                SPRENZEL, MATTHEW MR. 
##                                    2                                    2 
##                     STAFFORD, ROBERT                          STEIN, EVAN 
##                                    2                                    2 
##                    STORTO, KELLY MS.                    STORTO, TRESA MS. 
##                                    2                                    2 
##             THE DUCHOSSOIS GROUP PAC                         THOMAS, MARK 
##                                    2                                    2 
##                      TIMBERLAKE, JIM                              (Other) 
##                                    2                                  144

I don’t know. After a cursory inspection of the names of contributors there seem to be mostly individual names on the list. There are a lot of names that have 2 contributions. Perhaps they are contributing at husband and wife? There were a few institutions such as Obama for America and a couple of lobbying groups but since most of the names were individuals I will keep them in for now.

When we make the scale of y logarithmic we see a lot more information in the lower end of the distribution. There are a lot more outliers in the Democratic contribution amounts at both the upper and lower end of the scale of individual contirbutions. But with the violin plots the distribution of amounts among the Republicans seems to be about as broad as that among the Democrats.

Now I will compare the parties by displaying them side by side. I am going to limit the contributions to those under 5000 to capture individual contributions and I will exclude the Socialist since they did not play much of a role in the general election. I will call it ill_ind for Illinois Individuals. What this graph makes me want to see is a comparison of the republic and democrat totals along with stacked bars showing how much of each party’s money came from small versus large doners. So even though there were about four times as many contributions to the Democratic candidate the Republicans raised almost as much money. I would like to find a way to combine this information in a single chart.

## [1] 26170642 26034372

## # A tibble: 2 × 5
##        party     mean median    total      n
##       <fctr>    <dbl>  <dbl>    <dbl>  <int>
## 1   Democrat 131.5914     50 26170642 198878
## 2 Republican 398.6826    100 26034372  65301

To do what I want to do I am going to have to group by party. I want to take the parties, get the sum of their total contributions, graph that as bars, and then in the bars I am going to make fill a stacked histogram of the amounts, with the lower amounts having some sort of scale, like the higher the amount the darker the color.

Every contribution is the same height no matter how large or small it is but the color of the contribution or the hue or alpha should change. So a clear pattern emerges where the Rubulicans get a larger proportion from donors making 1000 or 2500 dollar donations and a smaller proportion from those contributing amounts under $200 or so.

For the last two histograms, I am trying to get a comparison of the counts of contributions at various amounts by party. It is confusing for a couple of reasons. First of all, the Republicans appear to have more contributions at both the higher and lower levels. Moreover, in the last graph where I , there are what appear to be negative values in the histogram.

I am going to look at counts of contributions by time and party (is that a bivariate or a multivariate?) So this is nice. You see that the socialist contributions are more even spread out and evenly divided between the general and primary election. Since they don’t typically have any real prospect of winning the general election it may be that there is no particularly greater impulse to contribute during the general election. Comparing the Republicans and Democrats directly… We can see that the Democratic contributions are more ‘spikey’, showing more ups and downs within a given period. I can’t really say why that would be the case.

Candidates and Amounts

First, since I am not looking at party anymore I am going to put the Socialists (actually, the Socialist) back in. Now we look at the count of contributions versus the amount of contributions, or the average amount of contribuitons. This is really a new variable, namely, the average. We will make the avarage contribution into a variable in itself and see how it is distributed over various other variables. Of course, the mean doesn’t have any actual meaning independent of the count of things you are averaging over. So to get means we have to use some sort of grouping function. The most basic way to look at the mean contribution size is to look at it for the whole data set. From there you can look at how the mean changes for all the factor variables in the data set. I am going to work though all the variables and examine graphically how the distribtion of the mean changes when the data is grouped by party, candidate, before and after the primary, months, days, occupation (not a very good variable since people code it themselves resulting in the variation being driven more by how people choose to describe their occupation than by actual differences in occupations), employers and zip codes. This is going to require a lot of use of they dplyr function. It will be good practice.

## [1] 454.7985

## # A tibble: 3 × 4
##        party     mean median      n
##       <fctr>    <dbl>  <dbl>  <int>
## 1   Democrat 476.2116     50 201064
## 2 Republican 390.2304    100  66538
## 3  Socialist 288.2951    250     55

So we have an interesting set of numbers. First we can do the comparisons by single bar graphs. This is pretty interesting. I think this would be clearer if we could present these as distributions with those overlapping density thingies. Then we could plot the means and medians as vertical lines. Now I want to add the mean and median as veritical lines going from y=0 to the top of the relevant density curve. Then I would like to reassign the colors, making the Democrats blue and the Republicans red to conform to the current convention. Contributions as the continuous variable and the parties as the categorical variable.

First I make a basic plot of the contribution amounts with the fill differing by party. Now I try it with a densit plot. Now I want to compare densities with the Repbulicans and Democrats data. Now I can do the same thing with boxplots. ### Contributions and Votes Finally, we look at Contributions and Votes. First we see how many votes each candidate got in the primary. That is, we group the data by candidate. # Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

What was the strongest relationship you found?

Multivariate Plots Section

Here is am going to do the amounts by count separately for each candidate and map the size of the dot to the candidate’s vote total in the primary. This will only be for the republicans. I may do a separate one for the general election but that would have to wait. Also, since there are only two in the general election the dot plot format would proabably not be very informative. Maybe I could add a date when they dropped out for the final touch as a way to get some information in about the candidates that didn’t make it to the primary election? ## Other multivariate plots There are a few other relationships I would like to explore. I would like to look at how the amount of contributions vary by party, candidate, votes. Lets look at the relationship between votes and campaign contributions for the two contestants in the general election, Mitt Romney and Barak Obama. What can we do? We can look at the data with the dates, the pre and post primary election periods and–what else? We can’t look at party since there is no way to separate out the effects of party when you have two parties and two candidates. There are no degrees of freedom.

First we have to create a month variable. It will range from 0 to 24 since each month is repeated twice. names(ill)

## Source: local data frame [192 x 8]
## Groups: month, year, party [57]
## 
##    month  year      party                        cand_nm  contb avg_contb
##    <chr> <chr>     <fctr>                         <fctr>  <dbl>     <dbl>
## 1     01  2011 Republican                   Cain, Herman    250  250.0000
## 2     02  2011 Republican                   Cain, Herman    250  250.0000
## 3     03  2011 Republican                   Cain, Herman   5000 2500.0000
## 4     03  2011 Republican                 Gingrich, Newt   7200  720.0000
## 5     03  2011 Republican              Pawlenty, Timothy   5750 1437.5000
## 6     03  2011 Republican Roemer, Charles E. 'Buddy' III    255   51.0000
## 7     04  2011   Democrat                  Obama, Barack 903414  755.9950
## 8     04  2011 Republican                   Cain, Herman    500  250.0000
## 9     04  2011 Republican                 Gingrich, Newt   8675  619.6429
## 10    04  2011 Republican                      Paul, Ron    250  250.0000
## # ... with 182 more rows, and 2 more variables: n <int>, y_m <chr>

Now we can see which party outraised which by month for the two year period.

## [1] "month"     "year"      "party"     "cand_nm"   "contb"     "avg_contb"
## [7] "n"         "y_m"

## Source: local data frame [57 x 5]
## Groups: y_m [?]
## 
##        y_m      party    contb      avg     n
##      <chr>     <fctr>    <dbl>    <dbl> <int>
## 1  2011_01 Republican    250.0    250.0     1
## 2  2011_02 Republican    250.0    250.0     1
## 3  2011_03 Republican  18205.0  18205.0     4
## 4  2011_04   Democrat 903414.0 903414.0     1
## 5  2011_04 Republican  79575.0  79575.0     6
## 6  2011_05   Democrat 196260.0 196260.0     1
## 7  2011_05 Republican 523750.3 523750.3     7
## 8  2011_06   Democrat 495531.9 495531.9     1
## 9  2011_06 Republican 622169.3 622169.3    10
## 10 2011_07   Democrat 373218.9 373218.9     1
## # ... with 47 more rows

Well, not entirely sure what is going on here with the row of dots on the bottom. They are colored Republican but I don’t see how I would get two sets of Republican observations here.

Contributions and Votes

Now that I have votes per candidate I think it would be interesting to compare the amount collected by the candidates to the number of votes they got. So I am going to subset the data and get the candidates that survived till the primary and divide the number of votes they received by the amount of money they collected.

## [1]  73993  86605   5541   3704 433700 325488

## [1] 0.22836100 0.13416782 0.01421918 0.39909493 0.01841580 0.93320332

Now I want to make a bar chart from this. Buddy Roemer’s name is causing a lot more trouble than it is worth so I am going to shorten it to ‘Roemer, Charles’, to make it fit better with the rest of the data.

## [1] "cand_nm"     "total_spent" "n"           "votes"       "price_vote"

## [1] Gingrich, Newt                 Paul, Ron                     
## [3] Perry, Rick                    Roemer, Charles E. 'Buddy' III
## [5] Romney, Mitt                   Santorum, Rick                
## 14 Levels: Bachmann, Michele Cain, Herman Gingrich, Newt ... Stein, Jill

##  [1] "Bachmann, Michele"    "Cain, Herman"         "Gingrich, Newt"      
##  [4] "Huntsman, Jon"        "Johnson, Gary Earl"   "McCotter, Thaddeus G"
##  [7] "Obama, Barack"        "Paul, Ron"            "Pawlenty, Timothy"   
## [10] "Perry, Rick"          "Roemer, Charles"      "Romney, Mitt"        
## [13] "Santorum, Rick"       "Stein, Jill"

Lets do it again.

#get candidates that had some votes
ill_vote_gettersX = subset(ill, votes > 0)
#group by cand_nm
ill_vote_gettersX_grp = ill_vote_getters %>%
   dplyr::group_by(cand_nm) %>%
   dplyr::summarise(votesX = min(votes),
             total_spent = sum(contb_receipt_amt),
             n = n())
ill_vote_gettersX_grp

## # A tibble: 6 × 4
##                          cand_nm votesX total_spent     n
##                           <fctr>  <dbl>       <dbl> <int>
## 1                 Gingrich, Newt  73993    324017.7  1440
## 2                      Paul, Ron  86605    645497.6  5132
## 3                    Perry, Rick   5541    389685.0   265
## 4 Roemer, Charles E. 'Buddy' III   3704      9281.0   171
## 5                   Romney, Mitt 433700  23550435.2 55911
## 6                 Santorum, Rick 325488    348785.7  1845

ill_vote_gettersX_grp$productivity = ill_vote_gettersX_grp$votesX/ill_vote_gettersX_grp$total_spent
#do the chart again
ggplot(ill_vote_getters_grp, aes(x=cand_nm, y=price_vote)) + geom_bar(stat="identity") + theme_minimal() + ggtitle("Votes Per Dollar in the Illinois Republican Primary") + ylab("votes recieved/money raised") + theme(axis.text.x = element_text(hjust = 0.5, family="Didot", color="blue"), axis.text.y=element_text(color="steelblue")) + xlab("")

So on this evidence Romney appears to be a very ‘efficient’ candidate, having to raise a penny or two per vote. But this is combining all of the money raised throughout the entire cycle and comparing it to the money raised by other candidates up until the primary elections were held and most, presumably stopped fund raising at all. We should further subset the data to money raised before the primary election.

Contributions by Zip Code

Now I am going to finish my initial round of univariate charts. I am going to look at zip codes. These are going to be made using the choroplethr package along with ggplot.

## [1] "region"            "total_population"  "percent_white"    
## [4] "percent_black"     "percent_asian"     "percent_hispanic" 
## [7] "per_capita_income" "median_rent"       "median_age"

## [1] 1255    4

## [1] "zip_short" "total_amt" "mean_amt"  "n"

## [1] "character"

## [1] "character"

##   region total_population percent_white percent_black percent_asian
## 1  60002            24250            88             2             3
## 2  60004            49957            81             2             8
## 3  60005            30931            76             2             8
## 4  60007            33973            77             1             9
## 5  60008            22302            61             4             8
## 6  60010            44031            84             0             8
##   percent_hispanic per_capita_income median_rent median_age  total_amt
## 1                5             33622         761       41.2   35559.70
## 2                8             40134        1102       41.8  190235.38
## 3               12             37387         953       39.8  130670.37
## 4               11             33540         921       43.5  105911.80
## 5               26             29141        1030       36.2   41905.75
## 6                6             65973        1173       46.3 1090630.04
##    mean_amt    n
## 1  85.48005  416
## 2 147.92798 1286
## 3 145.67488  897
## 4 216.58855  489
## 5 119.73071  350
## 6 339.97196 3208

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Were there any interesting or surprising interactions between features?

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

Final Plots and Summary

Plot One

Description One

Plot Two

## Warning: Removed 1 rows containing missing values (geom_point).

## Warning: Removed 1 rows containing missing values (geom_text_repel).

Description Two

Plot Three

## Warning in self$bind(): The following regions were missing and are being
## set to NA: 62532, 60959, 61232, 62885, 62063, 61852, 61928, 62079, 62539,
## 61460, 61416, 61471, 61057, 62982, 62355, 62897, 62277, 62962, 61811,
## 62942, 60557, 62219, 61417, 62434, 60931, 60961, 61724, 61336, 61077,
## 62601, 62289, 62438, 62927, 61539, 62829, 61425, 62819, 62538, 60519,
## 62998, 60536, 62019, 62313, 62378, 61344, 61358, 62961, 60111, 61751,
## 62070, 60407, 61379, 61374, 60973, 61940, 61042, 61812, 61452, 61454,
## 61432, 61050, 61441, 61062, 61769, 61831, 61468, 62023, 61478, 61479,
## 61845, 61480, 61485, 61312, 61313, 61325, 61330, 61337, 61720, 61722,
## 60949, 60929, 62990, 62617, 62996, 61735, 62639, 62359, 62695, 62649,
## 62218, 62671, 62674, 62295, 62280, 62261, 62268, 62330, 62338, 62809,
## 62319, 62323, 62373, 62833, 62346, 62843, 62850, 62356, 62831, 62366,
## 62367, 62422, 62861, 62425, 62426, 62432, 62878, 62011, 62017, 62475,
## 62887, 62892, 62032, 62478, 62481, 62895, 62514, 62547, 62555, 62045,
## 62543, 62544, 62956, 62963, 61871, 62921, 62282, 62085, 61721, 62065,
## 61777, 62988, 62967, 61749, 62808, 62932, 61855, 61625, 61876, 61778,
## 62835, 61321, 61772, 62015, 62836, 62238, 61773, 62253, 62348, 62325,
## 62926, 62894, 62628, 62876, 61044, 62436, 60437, 61815, 62846, 61426,
## 61543, 62553, 61532, 62421, 62365, 61431, 62001, 62886, 62889, 62419,
## 61474, 61544, 62554, 61435, 60912, 60144, 62091, 62621, 62999, 61436,
## 61440, 62874, 62624, 61816, 61439, 61844, 61251, 61283, 60129, 62084,
## 62992, 62643, 61914, 62867, 61459, 61483, 62880, 62818, 62965, 62969,
## 60974, 62082, 62361, 61363, 61750, 61925, 61941, 61419, 61059, 61424,
## 61433, 61775, 61475, 61848, 61524, 61236, 61258, 61263, 61562, 61564,
## 61317, 60910, 61323, 61328, 61331, 61335, 61346, 62093, 62622, 62997,
## 62030, 62248, 62689, 60933, 62250, 62811, 62316, 62336, 62825, 62352,
## 62879, 62891, 62459, 62519, 62048, 62540, 62541, 62953, 62076, 62078,
## 62081, 62610, 60930, 61079, 60934, 60926, 60113, 62570, 61372, 61329,
## 61043, 62672, 61553, 61563, 62663, 62841, 62852, 61516, 62537, 61541,
## 62083, 61851, 61338, 62098, 62983, 61027, 62273, 61810, 62266, 61552,
## 61955, 61332, 60917, 61519, 62357, 60960, 62883, 61324, 63673

## Scale for 'fill' is already present. Adding another scale for 'fill',
## which will replace the existing scale.