Intro

The purpose of this document is to describe the analysis performed on the Ottawa 2014 election data set, as discussed in the recent ‘Open Data Ottawa’ meetup (http://www.meetup.com/Open-Data-Ottawa/).

A few apologetic notes:
- this analysis was done quickly, for fun. Were it part of a more serious effort, the documentation would be better, and proper software engineering procedures would be adhered to.
- the statistical analyses reported are only the first steps of what could be a more fulsome analysis.

Now that the fine print is done, on to more substantive matters.

There are many ways to analyze the election data set, and many questions that could be asked. But the key thing that jumped out at me this election was the low voter turnout. With the title of this document in mind, the question addressed herein is what factors determine the voter turnout?

In addition to the election data, various pieces of demographic information on-hand from previous meetups will be used. Many more pieces of demographic information could be used to expand the models, as could be previous election information. Perhaps even linking to the Ottawa 311 data set could provide insight, as it is possible that wards with a higher rate of phone calls to the city are more likely to vote against the incumbent, but time is limited…

Data

The primary data set is available at http://data.ottawa.ca/dataset/elections-2014-statement-of-votes-cast-poll-by-poll-results

Given the stated purpose of this analysis, the simplest thing seemed to be to focus on the individual Ward tabs, and ignore the races for mayor and school boards. I.e. we focus on the Ward Councillors.

The data was massaged by hand, the massaged file is available from the author and/or will be posted somewhere soon.

Each row of the dataframe represents one precinct.

The final list of variables used are:
-Electoral
–Ward: the ward number
–Precinct: the name of the precinct
–Advance: a 1 indicates that this was one of the advanced voting days, 0 otherwise
–Special: a 1 indicates that this was one of the special voting days, 0 otherwise
–TotalVotes: total number of votes cast in that precinct
–WardMargin: difference between the winner and runner up in the precinct’s ward
–RegisteredVoters: total number of registered voters in the precinct
–Candidates: number of candidates running
–Incumbent: a 1 indicates that the incumbent ran, 0 otherwise
–WardRegisteredVoters: sum of all the registered voters for the precincts in the ward
-Demographic
–WardPop: total population of the precinct’s ward
–WardArea: total land area of the precinct’s ward, in km^2
–WardType: suburban, urban or rural
–WardKidsFrac: fraction of the precinct’s ward’s population that is 15 yrs old or less
–WardAdultsFrac: fraction of the precinct’s ward’s population that is 16 yrs old or more but less than 65.
–WardSeniorsFrac: fraction of the precinct’s ward’s population that is 65 yrs old or more
–WardMaleFrac: : fraction of the precinct’s ward’s population that is male
–WardFemaleFrac: : fraction of the precinct’s ward’s population that is female

Note that the advanced and special polls show their number of registered voters as 0. As it is difficult to know which precinct to assign these votes to, for simplicity we do not use the advance or special polls.

library(dplyr)
library(ggplot2)

# Of course, you will have to change the following to match your directory structure
dat<- read.csv("C:/Research/Ottawa Election/TurnoutData.csv", stringsAsFactors=FALSE)

# The next line of code adds three variables:
# the first is the total number of votes in the precinct divided by the number of registerd voters in the precinct and so represents fraction of eligible voters that actually cast a ballot
# the second is the WardMargin divided by the WardRegisteredVoters and so represents how big the winner's margin was relative to the number of voters.
# the third one adds the population density of the ward

dat<-mutate(dat,votedFrac=TotalVotes/RegisteredVoters,margPct=WardMargin/WardRegisteredVoters,density=WardPop/WardArea)

Next, as explained above, we exclude the special and advanced votes. We also exclude any precincts where nobody, or only one person, cast a ballot.

dat2<-dat %>%
  filter(Advance==0,Special==0,RegisteredVoters>1) 

Some Exploratory Analysis

dat2 %>%
  ggplot(aes(RegisteredVoters,votedFrac)) +
  geom_point(size=3.2)

plot of chunk unnamed-chunk-3

It seems that there are two subgroups here, precincts with more than 500 registered voters, and those with less. So, we split and analyse separately.

Large Precincts

dat3<-dat2 %>%
  filter(RegisteredVoters>500) 

dat3 %>%
  ggplot(aes(votedFrac)) +
  geom_histogram()

plot of chunk unnamed-chunk-4

dat3 %>%  
  ggplot(aes(RegisteredVoters,votedFrac)) +
  geom_point(size=3.2)

plot of chunk unnamed-chunk-5

We’d like to understand which factors, electoral and demographic, may help in predicting turnout. We’ll start with a big linear model (for simplicity) and one by one (in order of insignificance) eliminate the variables that do not have a statistically significant effect.

summary(lm(votedFrac~RegisteredVoters+factor(Candidates)+factor(Incumbent)+WardPop+WardArea+factor(WardType)+WardKidsFrac+WardAdultsFrac+WardSeniorsFrac+WardMaleFrac+WardFemaleFrac+margPct+density,data=dat3))
## 
## Call:
## lm(formula = votedFrac ~ RegisteredVoters + factor(Candidates) + 
##     factor(Incumbent) + WardPop + WardArea + factor(WardType) + 
##     WardKidsFrac + WardAdultsFrac + WardSeniorsFrac + WardMaleFrac + 
##     WardFemaleFrac + margPct + density, data = dat3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.17026 -0.03076 -0.00078  0.03421  0.16251 
## 
## Coefficients: (2 not defined because of singularities)
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               1.53e+00   6.81e-01    2.25  0.02536 *  
## RegisteredVoters         -1.59e-05   5.12e-06   -3.11  0.00216 ** 
## factor(Candidates)3       1.69e-02   2.35e-02    0.72  0.47419    
## factor(Candidates)4       4.52e-02   3.33e-02    1.36  0.17658    
## factor(Candidates)5       6.80e-02   2.18e-02    3.11  0.00212 ** 
## factor(Candidates)6      -5.74e-02   4.09e-02   -1.40  0.16194    
## factor(Candidates)7       4.06e-02   5.49e-02    0.74  0.46073    
## factor(Candidates)9      -1.64e-03   4.96e-02   -0.03  0.97364    
## factor(Candidates)10      2.36e-02   5.56e-02    0.42  0.67219    
## factor(Candidates)11     -1.21e-02   2.96e-02   -0.41  0.68265    
## factor(Incumbent)1        2.00e-02   3.47e-02    0.58  0.56578    
## WardPop                  -1.22e-06   1.28e-06   -0.95  0.34224    
## WardArea                 -1.27e-04   1.45e-04   -0.88  0.37979    
## factor(WardType)suburban -6.72e-02   7.18e-02   -0.94  0.35035    
## factor(WardType)urban    -1.58e-01   9.26e-02   -1.70  0.09015 .  
## WardKidsFrac             -7.16e-01   5.00e-01   -1.43  0.15312    
## WardAdultsFrac           -2.11e-01   5.29e-01   -0.40  0.69080    
## WardSeniorsFrac                 NA         NA      NA       NA    
## WardMaleFrac             -1.37e+00   1.50e+00   -0.91  0.36258    
## WardFemaleFrac                  NA         NA      NA       NA    
## margPct                  -3.86e-01   9.97e-02   -3.87  0.00015 ***
## density                  -3.39e-06   8.30e-06   -0.41  0.68333    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.058 on 214 degrees of freedom
## Multiple R-squared:  0.32,   Adjusted R-squared:  0.259 
## F-statistic: 5.29 on 19 and 214 DF,  p-value: 2.01e-10

We see that two variables are ‘NA’, this is because they are determined by the others (eg WardFemaleFrac=1-WardMaleFrac), so we eliminate those.

summary(lm(votedFrac~RegisteredVoters+factor(Candidates)+factor(Incumbent)+WardPop+WardArea+factor(WardType)+WardKidsFrac+WardAdultsFrac+WardMaleFrac+margPct+density,data=dat3))
## 
## Call:
## lm(formula = votedFrac ~ RegisteredVoters + factor(Candidates) + 
##     factor(Incumbent) + WardPop + WardArea + factor(WardType) + 
##     WardKidsFrac + WardAdultsFrac + WardMaleFrac + margPct + 
##     density, data = dat3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.17026 -0.03076 -0.00078  0.03421  0.16251 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               1.53e+00   6.81e-01    2.25  0.02536 *  
## RegisteredVoters         -1.59e-05   5.12e-06   -3.11  0.00216 ** 
## factor(Candidates)3       1.69e-02   2.35e-02    0.72  0.47419    
## factor(Candidates)4       4.52e-02   3.33e-02    1.36  0.17658    
## factor(Candidates)5       6.80e-02   2.18e-02    3.11  0.00212 ** 
## factor(Candidates)6      -5.74e-02   4.09e-02   -1.40  0.16194    
## factor(Candidates)7       4.06e-02   5.49e-02    0.74  0.46073    
## factor(Candidates)9      -1.64e-03   4.96e-02   -0.03  0.97364    
## factor(Candidates)10      2.36e-02   5.56e-02    0.42  0.67219    
## factor(Candidates)11     -1.21e-02   2.96e-02   -0.41  0.68265    
## factor(Incumbent)1        2.00e-02   3.47e-02    0.58  0.56578    
## WardPop                  -1.22e-06   1.28e-06   -0.95  0.34224    
## WardArea                 -1.27e-04   1.45e-04   -0.88  0.37979    
## factor(WardType)suburban -6.72e-02   7.18e-02   -0.94  0.35035    
## factor(WardType)urban    -1.58e-01   9.26e-02   -1.70  0.09015 .  
## WardKidsFrac             -7.16e-01   5.00e-01   -1.43  0.15312    
## WardAdultsFrac           -2.11e-01   5.29e-01   -0.40  0.69080    
## WardMaleFrac             -1.37e+00   1.50e+00   -0.91  0.36258    
## margPct                  -3.86e-01   9.97e-02   -3.87  0.00015 ***
## density                  -3.39e-06   8.30e-06   -0.41  0.68333    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.058 on 214 degrees of freedom
## Multiple R-squared:  0.32,   Adjusted R-squared:  0.259 
## F-statistic: 5.29 on 19 and 214 DF,  p-value: 2.01e-10

After several iterations, we end up with

summary(lm(votedFrac~RegisteredVoters+factor(Candidates)+factor(WardType)+WardMaleFrac+margPct,data=dat3))
## 
## Call:
## lm(formula = votedFrac ~ RegisteredVoters + factor(Candidates) + 
##     factor(WardType) + WardMaleFrac + margPct, data = dat3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.16935 -0.02967 -0.00087  0.03635  0.16307 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               1.15e+00   3.31e-01    3.48  0.00060 ***
## RegisteredVoters         -1.57e-05   4.96e-06   -3.17  0.00173 ** 
## factor(Candidates)3       1.33e-03   1.62e-02    0.08  0.93483    
## factor(Candidates)4       2.48e-02   1.99e-02    1.24  0.21460    
## factor(Candidates)5       6.03e-02   1.55e-02    3.90  0.00013 ***
## factor(Candidates)6      -6.30e-02   2.22e-02   -2.83  0.00505 ** 
## factor(Candidates)7      -2.53e-02   1.89e-02   -1.34  0.18045    
## factor(Candidates)9      -3.23e-02   1.93e-02   -1.67  0.09626 .  
## factor(Candidates)10     -2.55e-02   2.32e-02   -1.10  0.27392    
## factor(Candidates)11      1.03e-03   1.97e-02    0.05  0.95842    
## factor(WardType)suburban -4.45e-03   1.73e-02   -0.26  0.79718    
## factor(WardType)urban    -5.01e-02   1.60e-02   -3.13  0.00196 ** 
## WardMaleFrac             -1.45e+00   6.62e-01   -2.19  0.02968 *  
## margPct                  -3.84e-01   8.27e-02   -4.65  5.8e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0579 on 220 degrees of freedom
## Multiple R-squared:  0.303,  Adjusted R-squared:  0.262 
## F-statistic: 7.37 on 13 and 220 DF,  p-value: 5.84e-12

Here is one possible interpretation:
-RegisteredVoters: if there are more voters, each potential voter may think their vote counts less
-WardType: being in a rural ward increases the turnout.
-WardMaleFrac: hmm…more males, fewer votes?
-margPct: the bigger the margin of the winner, the less someone is to vote, as it the outcome is likely already decided.
-Candidates: perhaps nothing of interest here

Note that these interpretations are after the fact, i.e. they are fishing. They do, however suggest proper hypothesis tests that could be done on other data, etc.

Lets see some plots. Note smoothers were added to plots where it helped interpretation.

dat3 %>%
  ggplot(aes(RegisteredVoters,votedFrac)) +
  geom_point(size=3.2) +
  geom_smooth()

plot of chunk unnamed-chunk-9

dat3 %>%
  ggplot(aes(factor(Candidates),votedFrac)) +
  geom_boxplot()

plot of chunk unnamed-chunk-9

dat3 %>%
  ggplot(aes(factor(WardType),votedFrac)) +
  geom_boxplot()  

plot of chunk unnamed-chunk-9

dat3 %>%
  ggplot(aes(WardMaleFrac,votedFrac)) +
  geom_point(size=3.2) +
  geom_smooth()

plot of chunk unnamed-chunk-9

 dat3 %>%
  ggplot(aes(margPct,votedFrac)) +
  geom_point(size=3.2) +
  geom_smooth()  

plot of chunk unnamed-chunk-9

Small Precincts

dat4<-dat2 %>%
  filter(RegisteredVoters<=500) 

dat4 %>%
  ggplot(aes(votedFrac)) +
  geom_histogram()

plot of chunk unnamed-chunk-10

dat4 %>%  
  ggplot(aes(RegisteredVoters,votedFrac)) +
  geom_point(size=3.2) 

plot of chunk unnamed-chunk-10

Which precincts have a voter turnout of greater than 100%? Note that this is possible and legal.

dat4 %>%
  filter(votedFrac>1) %>%
  select(Ward,Precinct, RegisteredVoters, TotalVotes)
##   Ward                                               Precinct
## 1    8 08-007 - Bytowne Thorncliffe Place A Seniors Residence
## 2   10                               10-007 - Hunt Club Manor
## 3   13     13-004 - New Edinburgh Square Retirement Residence
## 4   13       13-012 - Chartwell Heritage Retirement Residence
## 5   16       16-015 - Windsor Park Manor Retirement Residence
## 6   18            18-005 - Alta Vista Manor Retirement Living
## 7   23     23-011 - Chartwell Stonehaven Retirement Residence
##   RegisteredVoters TotalVotes
## 1               19         21
## 2               29         35
## 3               62         63
## 4               79         87
## 5               38         61
## 6               38         47
## 7               56         66

These numbers seem reasonable, so there is no cause for alarm.

Next, lets build a model for the small precincts

summary(lm(votedFrac~RegisteredVoters+factor(Candidates)+factor(Incumbent)+WardPop+WardArea+factor(WardType)+WardKidsFrac+WardAdultsFrac+WardSeniorsFrac+WardMaleFrac+WardFemaleFrac+margPct+density,data=dat4))
## 
## Call:
## lm(formula = votedFrac ~ RegisteredVoters + factor(Candidates) + 
##     factor(Incumbent) + WardPop + WardArea + factor(WardType) + 
##     WardKidsFrac + WardAdultsFrac + WardSeniorsFrac + WardMaleFrac + 
##     WardFemaleFrac + margPct + density, data = dat4)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.5802 -0.1784 -0.0253  0.1702  0.7993 
## 
## Coefficients: (2 not defined because of singularities)
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               3.87e+00   5.88e+00    0.66    0.513    
## RegisteredVoters         -2.32e-03   4.78e-04   -4.85  7.7e-06 ***
## factor(Candidates)3       1.30e-01   2.34e-01    0.55    0.581    
## factor(Candidates)4       9.25e-02   3.29e-01    0.28    0.780    
## factor(Candidates)5       1.13e-01   2.33e-01    0.49    0.629    
## factor(Candidates)6       2.17e-01   3.32e-01    0.65    0.516    
## factor(Candidates)7       1.01e+00   5.57e-01    1.81    0.075 .  
## factor(Candidates)9       1.07e+00   4.64e-01    2.31    0.024 *  
## factor(Candidates)10      1.15e+00   5.46e-01    2.11    0.039 *  
## factor(Candidates)11      3.05e-01   3.44e-01    0.89    0.378    
## factor(Incumbent)1        6.64e-01   3.62e-01    1.83    0.071 .  
## WardPop                  -5.05e-06   1.06e-05   -0.48    0.634    
## WardArea                 -7.95e-04   2.16e-03   -0.37    0.714    
## factor(WardType)suburban -1.05e+00   1.28e+00   -0.82    0.418    
## factor(WardType)urban    -1.22e+00   1.48e+00   -0.82    0.412    
## WardKidsFrac             -4.29e+00   5.15e+00   -0.83    0.408    
## WardAdultsFrac           -7.54e+00   5.05e+00   -1.49    0.140    
## WardSeniorsFrac                 NA         NA      NA       NA    
## WardMaleFrac              5.76e+00   1.15e+01    0.50    0.617    
## WardFemaleFrac                  NA         NA      NA       NA    
## margPct                   8.75e-01   7.18e-01    1.22    0.227    
## density                   1.03e-04   7.58e-05    1.37    0.177    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.274 on 66 degrees of freedom
## Multiple R-squared:  0.41,   Adjusted R-squared:  0.24 
## F-statistic: 2.41 on 19 and 66 DF,  p-value: 0.0044

and again we eliminate the ‘NA’, and get rid of insignificant factors one by one. We end up with

summary(lm(votedFrac~RegisteredVoters,data=dat4))
## 
## Call:
## lm(formula = votedFrac ~ RegisteredVoters, data = dat4)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.5001 -0.2070 -0.0549  0.1668  0.9408 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       0.746557   0.050683   14.73  < 2e-16 ***
## RegisteredVoters -0.002159   0.000409   -5.28  9.8e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.274 on 84 degrees of freedom
## Multiple R-squared:  0.249,  Adjusted R-squared:  0.241 
## F-statistic: 27.9 on 1 and 84 DF,  p-value: 9.81e-07

Almost all the variables that were of use in the large precinct model are not used here. perhaps the small precincts already have all the ‘demographic’ info represented by the fact that they are a small precinct. But again we see that more registered voters leads to a lower turnout percentage.

Lets put the data for the precincts into the model and see what the output is:

Lets see some plots here.

dat4 %>%
  ggplot(aes(RegisteredVoters,votedFrac)) +
  geom_point(size=3.2) +
  geom_smooth()

plot of chunk unnamed-chunk-14

Conclusion / Suggestions for Next Steps

The one conclusion from both models is the more registered voters there are in the precinct, the lower the turnout for that precinct is. This hypothesis would be woth testing in other data sets.

One of the things glossed over in this document is that none of the models fit the data particularly well. Were work to continue on this data set, the results presented could be used as a starting point for which variables to include in a more sophisticated model (lasso, glm, hierarchical Bayesian, etc).

Another thread would to be to get data for previous elections, and/or election data from other regions.

Voter apathy sucks.