Analysis of Amash Amendment Votes

Wired ran an article this week claiming that defense campaign contributions are a stronger predictor of voting against defunding the NSA phone record dragnet than political party. Unfortunately, the way the presented this data doesn't really seem correct for making the claim: they present a bar chart of average defense contributions against vote. While the difference is large enough that it is likely significant, that seems to me to be an odd way to defend a claim about predictive power, and they didn't present any statistical test results.

Fortunately, the underlying data is available from MapLight. So I copied the Contributions by Legislator table into a spreadsheet, loaded it into R, and poked around.

Setup

library(ggplot2)
library(data.table)

First, I'll read in the raw CSV file:

votes.raw <- read.csv("amash-votes.csv")

Then turn it into a data table, limit to voting representatives, and give columns saner names. For the remainder of this analysis, I limit the data to representatives who cast a ‘Yes’ or ‘No’ vote.

votes <- data.table(votes.raw)[Vote != "Not Voting", list(Name, Party, Money = Oppose, 
    Vote)]

Finally, what we really care about is a binary outcome: did the representative vote ‘No’ (let the NSA carry on)?

votes$VotedNo <- votes$Vote == "No"

Plot distributions

First, it's usually good to get a sense of the distribution of our data. How are defense contributions distributed?

qplot(Money, data = votes, geom = "histogram")

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust
## this.

plot of chunk unnamed-chunk-5

Eek, that's skewed. Let's try taking the log:

# Add 1 before logging to deal with $0 donations
votes$LogMoney = log10(votes$Money + 1)
qplot(log10(Money + 1), data = votes, geom = "histogram")

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust
## this.

plot of chunk unnamed-chunk-6

Much saner. The order of magnitutde of defense contributions to representatives seems roughly normally distributed, ignoring those who received no such contributions. Should make the subsequent analysis better-behaved.

Regression Modelling

Now, we have a binary outcome (voted ‘No’ on defunding phone records collection) that we want to try to predict. That's the explanatory variable; we want to find out how well party and defense contribution level predict a ‘no’ vote. The standard frequentist tool for asking this question is a logistic regression. So let's do one, on party and the log of contributions (‘Oppose’ is the field for donations from organizations opposed to the amendment, which in this data set is the defense-related businesses):

summary(glm(VotedNo ~ Party + LogMoney, data = votes, family = binomial))

## 
## Call:
## glm(formula = VotedNo ~ Party + LogMoney, family = binomial, 
##     data = votes)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.450  -1.158   0.912   1.067   1.791  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.3787     0.4043   -3.41  0.00065 ***
## PartyR        0.5389     0.2022    2.66  0.00770 ** 
## LogMoney      0.2898     0.0982    2.95  0.00315 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 584.67  on 421  degrees of freedom
## Residual deviance: 564.16  on 419  degrees of freedom
## AIC: 570.2
## 
## Number of Fisher Scoring iterations: 4

We learn a few things. First, both party and contributions are statistically significant predictors (\( p < 0.01 \)). Further, after controlling for party, the level of contributions from defense-related businesses is still a very strong predictor of a representative's vote. So the claim that business contributions strongly predict vote holds up under further analysis.

Now, we want to evaluate the claim that contributions more strongly predict vote than party. This seems plausible, given that contributions are more statistically significant (lower p-value), but I don't know that that's a particularly good basis for the claim, at least alone. Let's compare models based only on party and only on vote:

summary(glm(VotedNo ~ Party, data = votes, family = binomial))

## 
## Call:
## glm(formula = VotedNo ~ Party, family = binomial, data = votes)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
##  -1.33   -1.06    1.03    1.03    1.30  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)   -0.291      0.145   -2.00   0.0452 * 
## PartyR         0.645      0.198    3.26   0.0011 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 584.67  on 421  degrees of freedom
## Residual deviance: 573.91  on 420  degrees of freedom
## AIC: 577.9
## 
## Number of Fisher Scoring iterations: 4

summary(glm(VotedNo ~ LogMoney, data = votes, family = binomial))

## 
## Call:
## glm(formula = VotedNo ~ LogMoney, family = binomial, data = votes)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.380  -1.215   0.983   1.096   1.732  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.2484     0.3987   -3.13  0.00174 ** 
## LogMoney      0.3305     0.0967    3.42  0.00063 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 584.67  on 421  degrees of freedom
## Residual deviance: 571.30  on 420  degrees of freedom
## AIC: 575.3
## 
## Number of Fisher Scoring iterations: 4

We are now getting to the limits of my understanding of logistic regressions and how to interpret them, but I note 3 things comparing these two models: the donation model is more likely to be significant (\( p \) value on the coefficient is lower), it has lower error on the intercept, and its AIC is lower (indicating a better model fit, although I'm not sure how much better a difference of 577.91 vs. 575.3 is). Therefore, at least based on my understanding of logistic regressions (and lazily depending on whatever R reports by default rather than digging deeper for predictive discrimination measures), it seems that the claim that vote is more strongly predicted by defense contributions than by party is valid.

We are, however, trying to compare a binary and a continuous predictive variable; I'm not sure if there are additional things that we need to consider to accurately evaluate the claim, but it seems to hold up under at least some more thorough probing.

If there's some problem with my analysis, please let me know in the comments.