Introduction:

The question is whether people change their preference of the party from what they said they will vote for in the pre-poll survey to what they actually voted for as revealed in the post-poll survey.

We know that the political parties spend a lot of money and energy in pre-poll campaigns. It would be interesting to know whether all those expenses really help in winning over the voters of the rival parties. If yes, how much - can there be a quantification of that “winning over”?

In this reproducible research, we shall examine how many people voted for the candidate that they said they will. This will let us have an idea about whether there is any effect, on the people of all the pre-poll propaganda/ advertisement/ meetings etc. Do people really change their mind based on these exposures? Also we shall examine whether the party-affiliation give any insight into supporters of which group/ party are most prone to changing their minds.

Data:

  1. Data Collection: The data was collected from the survey results of the American National Elections Survey, which is conducted before and after the American Presidential Elections.

  2. The Cases: Eachf of the rows in the file http://d396qusza40orc.cloudfront.net/statistics/project/anes.RData, represent various data about one voter. However, not all voters answered all the queries.

  3. Variables: The two variables we shall analyze here are : prevote_inthsbc and postvote_hsvtbc. Both are categorical variables. As there are NA’s in both of these variables, we shall first filter out the records in which either of these variables is NA.

anes1 <- anes[!is.na(anes$prevote_inthsbc) & !is.na(anes$postvote_hsvtbc), c("prevote_inthsbc", "postvote_hsvtbc")]

Then we check what unique values are there in these two variables. As there is a lebel called “R Vol: Names On Ballot Card Are Not Correct {Vote Recorded On Next Screen}”, we need to remove the records that contain it, as it indicates a procedural problem during the voting, and not a valid choice of the voter.

unique(anes1$prevote_inthsbc)
## [1] (Republican Candidate)                                                    
## [2] (Democratic Candidate)                                                    
## [3] R Vol: Names On Ballot Card Are Not Correct {Vote Recorded On Next Screen}
## [4] Other Candidate {Specify}                                                 
## [5] (Independent Candidate)                                                   
## 5 Levels: (Democratic Candidate) ... Other Candidate {Specify}
unique(anes1$postvote_hsvtbc)
## [1] (Republican Candidate)                                                    
## [2] (Democratic Candidate)                                                    
## [3] Other Candidate {Specify}                                                 
## [4] R Vol: Names On Ballot Card Are Not Correct {Vote Recorded On Next Screen}
## [5] (Independent Candidate)                                                   
## 5 Levels: (Democratic Candidate) ... Other Candidate {Specify}
anes2 <- anes1[(anes1$prevote_inthsbc != "R Vol: Names On Ballot Card Are Not Correct {Vote Recorded On Next Screen}") & (anes1$postvote_hsvtbc != "R Vol: Names On Ballot Card Are Not Correct {Vote Recorded On Next Screen}"), ]
unique(anes2$prevote_inthsbc)
## [1] (Republican Candidate)    (Democratic Candidate)   
## [3] Other Candidate {Specify} (Independent Candidate)  
## 5 Levels: (Democratic Candidate) ... Other Candidate {Specify}
unique(anes2$postvote_hsvtbc)
## [1] (Republican Candidate)    (Democratic Candidate)   
## [3] Other Candidate {Specify} (Independent Candidate)  
## 5 Levels: (Democratic Candidate) ... Other Candidate {Specify}

Then we create a numeric variable called change, which will have a value of 1, when the two variables prevote_inthsbc and postvote_hsvtbc are same, and 0 when those are different. Then for convenience, we change the variable-name prevote_inthsbc to “pre” and postvote_hsvtbc to “post”.

anes2$change <- ifelse(anes2$prevote_inthsbc == anes2$postvote_hsvtbc, 1, 0)
summary(anes2)   
##                                                                    prevote_inthsbc
##  (Democratic Candidate)                                                    :1391  
##  (Republican Candidate)                                                    :1140  
##  (Independent Candidate)                                                   :  71  
##  R Vol: Names On Ballot Card Are Not Correct {Vote Recorded On Next Screen}:   0  
##  Other Candidate {Specify}                                                 :  16  
##                                                                                   
##                                                                    postvote_hsvtbc
##  (Democratic Candidate)                                                    :1400  
##  (Republican Candidate)                                                    :1162  
##  (Independent Candidate)                                                   :  40  
##  R Vol: Names On Ballot Card Are Not Correct {Vote Recorded On Next Screen}:   0  
##  Other Candidate {Specify}                                                 :  16  
##                                                                                   
##      change     
##  Min.   :0.000  
##  1st Qu.:1.000  
##  Median :1.000  
##  Mean   :0.924  
##  3rd Qu.:1.000  
##  Max.   :1.000
names(anes2) <- c("pre", "post", "change")
  1. Study: This study is an observational one - as we do not have any control over the data-colleciton method or on the people being surveyed.

  2. Scope of inference - generalizability: The population of interest is the American Voting People. As the data was collected from across the states, irrespective of social/ economic/ geographical/ gender status, we can say the findings of this study is applicable to the whole American Voting Population.

  3. Scope of inference - causality: No causality between the two variables of interest can be inferred based on this study. The variables represent preferences of the same person before and during the voting - and hence, though they are related, but not causally-associated.

Exploratory data analysis:

Now, anes2 dataset is our population data. Using this we shall construct a bootstrap confidence interval for percent of people who voted for the same candidate whom they declared they will before voting.

Each of our bootstrap samples has 100 cases, and we take such 1000 samples. For each of the samples, we sum-up the “change” variable, which gives us how many of the 100 cases has same value in both the variables. This sum (in boot_sum variable) is the percentage of people who voted for the same candidate that they said they would before the voting.

The histogram and summary of this variable is as following.

n = 100
boot_sum = rep(NA, 100)
for(i in 1:1000) {
   boot_sample = sample(anes2$change, n, replace = TRUE)
   boot_sum[i] = sum(boot_sample) }
hist(boot_sum)

plot of chunk unnamed-chunk-5

summary(boot_sum)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    84.0    91.0    93.0    92.5    94.0    99.0

The actual population percentage of those who voted for the parties they said they will is : 92.3988%. The pupulation size is : nrow(anes2).

Inference:

  1. The Hypetheses are: The null-hypothesis is - Hn : p = 1 (i.e. everybody vote for the party they said they would)
    The alternate-hypothesis is - Ha : p < 1,
    where p is the percentage of people who do not change their mind, i.e. vote for the party they said they will in the pre-poll survey.

  2. Condition checking and method selection:
  1. Independence: Random: This condition is met. Neither any case has dependence on any other case nor the people were influenced to vote or not for the party they said they would.

  2. n < 10% : The sampling is done without replacement, and the total count, n = 2618, is definitely less than 10% of the total American voting population.

  3. Sample size/ skew: Because of our null-hypothesis, Hn, this condition is not met. Here, though the number of successes, np = 2618 * 1 = 2618, the n(1 - p) = 0 (though there are 199 failures in the actual dataset). Hence, we shall use ‘simulation’ for inference.

  1. Perform Analysis:

Now we shall test the hypotheses and find the p-value for the population and infer.

Then we shall find out the confidence intervals for the population as a whole, and then for the people who initially said they will voted for the Democratic, Republican, Independent or Other Candidates, respectively.

Hypothesis Test for the population:-

source("inference.R")
inference(anes2$change, type = "ht", null = 1, alternative = "less", method = "simulation", est = "mean")
## Warning: package 'openintro' was built under R version 3.1.1
## Warning: package 'BHH2' was built under R version 3.1.1
## Single mean 
## Summary statistics:
## mean = 0.924 ;  sd = 0.2651 ;  n = 2618 
## H0: mu = 1 
## HA: mu < 1

plot of chunk unnamed-chunk-6

## p-value =  0

As the p-value is zero (0), we reject the null-hyposthesis and infer that not all people actually vote for the party, they said they would.

Now, let us calculate the 99% Confidence Intervals for how many people do vote for the parties they said they would - first for the population and then for the Democrats, Republicans, Independents and Others.

  1. Confidence interval for the population:-
source("inference.R")
inference(anes2$change, type = "ci", method = "simulation", conflevel = 0.99, est = "mean", boot_method = "perc")
## Single mean 
## Summary statistics:
## mean = 0.924 ;  sd = 0.2651 ;  n = 2618

plot of chunk unnamed-chunk-7

## Bootstrap method: Percentile
## 99 % Bootstrap interval = ( 0.9102 , 0.937 )

Note:

Clearly this confidence interval match with the result of the hypethesis-test, as the high-value of the CI is 93.7 at 99% significance-level, and 100 (i.e. p = 1) would certainly be out of any CI.

  1. Confidence interval for the people who said they will vote for the Democratic candidate before vote :-
inference(anes2[anes2$pre == "(Democratic Candidate)","change"], type = "ci", method = "simulation", conflevel = 0.99, est = "mean", boot_method = "perc")
## Single mean 
## Summary statistics:
## mean = 0.9454 ;  sd = 0.2274 ;  n = 1391

plot of chunk unnamed-chunk-8

## Bootstrap method: Percentile
## 99 % Bootstrap interval = ( 0.9288 , 0.9605 )
  1. Confidence interval for the people who said they will vote for the Republican candidate before vote :-
inference(anes2[anes2$pre == "(Republican Candidate)","change"], type = "ci", method = "simulation", conflevel = 0.95, est = "mean", boot_method = "perc")
## Single mean 
## Summary statistics:
## mean = 0.9342 ;  sd = 0.248 ;  n = 1140

plot of chunk unnamed-chunk-9

## Bootstrap method: Percentile
## 95 % Bootstrap interval = ( 0.9193 , 0.9482 )
  1. Confidence interval for the people who said they will vote for the Independent candidate before vote :-
inference(anes2[anes2$pre == "(Independent Candidate)","change"], type = "ci", method = "simulation", conflevel = 0.95, est = "mean", boot_method = "perc")
## Single mean 
## Summary statistics:
## mean = 0.4648 ;  sd = 0.5023 ;  n = 71

plot of chunk unnamed-chunk-10

## Bootstrap method: Percentile
## 95 % Bootstrap interval = ( 0.3521 , 0.5775 )
  1. Confidence interval for the people who said they will vote for Other candidate before vote :-
inference(anes2[anes2$pre == "Other Candidate {Specify}","change"], type = "ci", method = "simulation", conflevel = 0.95, est = "mean", boot_method = "perc")
## Single mean 
## Summary statistics:
## mean = 0.375 ;  sd = 0.5 ;  n = 16

plot of chunk unnamed-chunk-11

## Bootstrap method: Percentile
## 95 % Bootstrap interval = ( 0.125 , 0.625 )

Conclusion:

From the above study we can conclude that:-
1. People do change their preference for political parties while voting, compared to what they said they prefer in the pre-poll survey.
2. About 92.4% of the voting-population vote for the candiate they said they would before the poll, i.e., only 7.6% people changed their preference.
3. Vote-bank loyalty-wise, the Domocracts has the most loyal voters (mean 94.5%), followed by the Republicans (mean 93.4%), the Independents (46.5%) and Other Candidates (37.5%).
4. In other words, the voting campaigns mostly affect those voters who had previously decided to vote for the independents or other candidates. The campaigns only help about 6% vote-bank of the Republicans or Democrats to change their preferences.
5. It would be worth further investigating what event/ campaign prompted the voters to change their preference of the political-party. Based on that, we shall be able to know what are the switch-factors, and also the political parties will be able to strategize their campaigns in future to maximize their vote-share.