1 Introduction

Much has been heralded with the advent of machine learning, from identifying new molecular compounds for medicines to the development of self driving cars and more notoriously the development of algorithm generated personalized advertising. Machine learning usage in elections is said to have started with Barack Obama’s 2008 presidential campaign, where it was used for optimizing e-mail campaigns, identifying the best use of Obama’s time during the campaign as well as utilizing historical precinct level voting data to identify when and where his campaign needed to invest resources. Much more detail on how data has been used in US elections has been covered in “The Victoy Lab” by Sasha Issenberg.
In Irish election campaigns, there is little evidence that machine learning has been utilized by any party. This may be in part due to the level of funding that political parties have in Ireland compared to other jurisdictions, meaning there are less if not no full time political consultants and data science groups working in Irish elections. But this lack of data scientists in elections may also be due to a paucity of data and a voting structure (PR-STV) which is not as easily modeled compared to the duopolisitc system in the US.

The Irish equivalent of precinct level voting data are the box tallies. The box tallies are collected collaboratively by volunteers from every political party in Ireland, they map the number of 1st preference votes per candidate per ballot box. Each ballot box is linked to a polling district, which in turn contains (or is contained within) one or more electoral districts.

In this analysis machine learning is utilized to predict the outcome of the 2020 Sinn Féin result using the 2016 box tallies, 2016 census data and national opinion poll results leading up the the 2020 Irish general election. Box tallies and nationall reported results from the 2020 general election were used to test the accuracy of the machine learning model.

2 The data

2.1 Box tallies

To be able to link geospatial data, such as CSO data to box tallies, the tallies must first be expressed from ballot boxes to electoral divisions. This expression was performed using polling schemes from all 40 constituencies to identify which electoral divisions were linked to which ballot boxes.

In a case where several ballots boxes were linked to the same electoral division, the tallies in these boxes were summed.
In a case where there were several electoral divisions linked to a single box, then the tallies in that box were divided proportionally to the number of electors from each electoral division linked to that ballot box.
In cases where the number of electors was not available, then the boxes were divided equally across the number of electoral divisions they were linked to.

Due to this method of expressing ballot box tallies to electoral divisions, the tallies for electoral divisions should be considered an estimate of the number of votes in electoral divisions. To normalize the tallies for comparison across all electoral divisions, the proportion of votes per candidate was calculated, by dividing the number of votes per candidate, by the total number of valid votes per electoral division. Table 1 summarizes the number of electoral divisions per constituency included in this analysis. An interactive map of the underlying data can be viewed on this dashboard.

2.2 Geospatial data

The entire 2016 Irish census was converted to percentage of people per variable measured by the census per electoral division were included in the modeling. The number of people per age within electoral divisions was also included as the proportion of people by generational group. Population density was also calculated using the total number of people per electoral division and the land area of the electoral division, calculated from shapefiles of the electoral divisions. Table 2 summarizes each variable included in the modeling.

2.3 Opinion poll data

Opinion poll data leading up the the 2020 general election was utilized in this analysis. Specifically the polling average from the [Irish Polling Indicator] (https://www.pollingindicator.com/) in the weeks leading up to the 2020 Irish general election were used. A summary of the nation opinion polls between the 2016 and 2020 Irish general election are shown in figure 2.1

Figure 2.1: National opinion polls between the 2016 and 2020 Irish general elections

3 The model

There are many things to consider when choosing the best model which best fits the data one has.

What is the nature of the data we’re using to predict?
What is the variable we’re trying to predict? etc.

In this analysis the data that is being utilized to make the predictions exhibit a lot of collinearity, while the values we’re trying to predict (% of 1st preference) is a constrained value between 0 and 1. Along with these considerations, a non-linear model was preferred. AS was one which would also have a mechanism for updating predictions using national opinion polls.

The model that chosen was a random forest, where the dependent variable was the % 1st preference vote for Sinn Féin from the 2016 general election in each electoral division in our data set. The predictors were all 446 2016 census variables described in table 2. Random forests are non-linear, and have reasonable robustness to collinearity. A special case of random forest was chosen that assumed a beta distribution of the dependent variable (i.e. it assumed the variable we’re trying to predict was normally distributed between the values 0 and 1; the % 1st preference value for Sinn Féin) and also allowed for the output of quantile regression predictions.

Quantile regression allows for a range of predictions to be returned for each prediction, which can express our confidence in the prediction returned. For this analysis, national opinion poll average will be used to select the percentile of predictions from the random forest quantile regression model which best matches the national opinion poll for a particular time period.

For example, if the national opinion poll average was 20%, the percentile of predictions which had a national mean prediction across all 3,409 electoral divisions of 20% was chosen as the predictions to utilize. Figure 3.1 shows the mean and standard deviation of the 5th to 95th percentile of quantile regression predictions from the random forest model for all 3,409 electoral divisions. If we use the national poll average for Sinn Féin the week of the 2020 general election (which was 22.5% on the 2nd of February 2020), the percentile of predictions with a mean closest to this value is the 92nd percentile of predictions. This puts the result of the 2020 election for Sinn Féin, near the upper bounds of this model, which only has the 2016 election results to learn from.

Figure 3.1: Mean and standard deviation of percentile predictions from the random forest model with quantile regression

3.1 Model accuracy

How accurate would this model have been the week of the election?
Figure 3.2, shows the accuracy of the model for predicting the % 1st preference for Sinn Féin across constituencies. The R² for the model (which is a measure of how well the model predictions fit the actual results), was 0.74. While the route mean squared error (which is a measure of the mean error of the model) was 0.0946. These numbers loosely translate to an accuracy of 74% with an average error of +/- 9.5%. The red dotted line in figure 3.2 represents a perfect fit between the predicted and actual results. From this graph we can see that the model under predicted results at the high end and underpredicted the results at the lower end of constituency 1st preference for Sinn Féin.

Figure 3.2: Model accuracy for predicting constituency % 1st preference for Sinn Féin

How about the models accuracy at predicting results within constituencies?
Figure 3.3 shows the accuracy of predicting the % 1st preference within constituencies. Within constituency accuracy was much more variable, with many constituencies showing very high accuracy, in excess of 80%, while in others the model had virtually no predictive ability at all. Table 3 summaries the accuracy measure for all constituencies.

Figure 3.3: Model accuracy for predicting intra-constituency % 1st preference for Sinn Féin

The measures of accuracy shown in the previous figures and tables are concerned with the absolute accuracy of the model. However in a PR-STV system, one is not concerned about the actual % 1st preference, but more so if that number is converted into a seat in that constituency. In the Irish PR-STV system the proportion of the vote required to earn a seat is known as the quota. This quota can be roughly converted to a % by the dividing the 100 by the number of seats available plus one. The paper “Mining the ballot” by Dr Kevin Cunningham provided a model to determine the probability of being voted given a proportion of the quota received on the first count. Figure 3.4 represents the model from “Mining the ballot”.

Figure 3.4: Graphical representation of the equation outlined by Cunnigham from Mining the ballot 2018

This equation indicated that at 60% of the quota a candidate has a >70% chance of being elected. This proportion will be assumed for determining whether or not a candidate was elected using the predictions from the random forest model. If a candidate received more than the quota, then 2 candidates will be assumed to have been elected at 1.6 times the quota, and 3 at 2.6 times the quota.
Using this method for measuring the accuracy of the model, the accuracy is 90%, with 36 out of 40 constituencies being accurately predicted for Sinn Féin. No constituency had more seats for Sinn Féin than the model predicted, while 4 constituencies had fewer seats than the model predicted. Overall the model predicted 40 seats for Sinn Féin, while Sinn Féin only received 37 seats. This difference between the predicted and actual number of seats may be in part be due to Sinn Féin not running enough candidates in the general election, however of the four constituencies which were not accurately predicted, in three, Sinn Féin did not have a single candidate elected (Cork South West, Limerick County and Galway East), while the 4th constituency (Dublin South Central) had 2 seats predicted for Sinn Féin, but only one candidate was elected. Figure 3.5 and table 4 summarize the accuracy of the model when the probability of gaining a seat is considered.

Figure 3.5: Prediction accuracy of model when considering the probability of winning a seat

3.2 Under the hood

While many machine learning models are black boxes which have no interpretability from which to ascertain which variables are important, this is not the case for random forests. Figure 3.6 shows the top twenty most important variables used by the model to make it’s predictions.

The x-axis in this figure is the magnitude by which the R² of the model changes when that variable is added or removed. The greatest variable importance is for the % of people who were born in Ireland in an electoral district, however the importance of the variable is very small, with an impact of 0.0005 on the R² or a 0.05% increase in accuracy of the model when this variable is included. This means that overall no one variable had a huge impact on the model, but the combination of multiple variables lead to an accurate model for Sinn Féin voter behavior overall.

Figure 3.6: Top twenty most important variables in the model

4 Comparing to a naive prediction approach

The machine learning model appears to have good accuracy in determining the number of seats for Sinn Féin in the 2020 election. But how does this compare to a simpler and more naive approach. How does machine learning compare to simply multpying the 2016 results by the change in public support since the 2016 general election? In 2016 Sinn Féin received 13.8% of the national vote; the week of the election in 2020 they were at 22.5% polling average in national opinion polls. This translates to a 63% increase in public support for Sinn Féin in the period between the 2016 election and a week before the 2020 election.
The correlation between the 2016 and 2020 % 1st preference support for Sinn Féin at the electoral division level was 86%. Figure 4.1 shows the relationship between the 2020 and 2016 % 1st preference support for Sinn Féin at the electoral division level.
As this relationship is so strong, we might expect that a naive approach to predicting the results by simply multiplying the 2016 results by a factor of 1.63 may be quite accurate.

Figure 4.1: Relationship between 2020 and 2016 % 1st preference support for Sinn Féin at the electoral division level

There is a strong accuracy in predicting the constituency results using a naive approach to predicting Sinn Féin support in 2020. Figure 4.2 shows how the machine learning and naive predictions compare. The R² for the naive approach is stronger than the machine learning approach at 0.82 compared to 0.76. Table 5 shows the accuracy measures for the naive and machine learning approaches

Figure 4.2: Comparison between machine learning and naive approach to predicting the % 1st preference support for Sinn Féin in each constituency

Considering the likelihood of winning a seat, the naive approach performs similarly to the machine learning approach, predicting 41 seats, compared to 40 for the machine learning approach. Both approaches over predicted the number of seats for Sinn Féin. However, the constituencies where the naive approach over predicted, Sinn Féin ran fewer candidates that they could have given the level of support they received on the day of the election.
Table 6 compares the performance of the naive and machine learning approach for predicting where Sinn Féin would win seats.

Even within constituencies the naive approach performed better than the machine learning approach. But in some cases the machine learning approach was better at predicting support within constituencies that the naive approach. The difference in performance was not large and overall the two models performed well at predicting the % 1st preference for Sinn Féin at the electoral divisions within constituencies. Fig 4.3 show the relationship accuracy of both approaches in predicting 2020 Sinn Féin % 1st preference for electoral divisions within constituencies. Table 7 compares the accuracy measures for the naive vs machine learning model within constituencies.

Relationship between predicted Sinn Féin 1st preference % within constituencies and actual % 1st preference at the electoral division level using two approaches

Figure 4.3: Relationship between predicted Sinn Féin 1st preference % within constituencies and actual % 1st preference at the electoral division level using two approaches

5 Why use machine learning if a naive approach performs as well?

If the naive approach, which simple multiplies the last elections results by the change in public opinion works better than a complex machine learning approach, then why use machine learning at all?
Firstly the machine learning approach that was used, is not only useful for prediction but all for inference. The variable importance shown in 3.6 gives us an indication of the demographics supporting and against Sinn Féin. This can be key in understanding the motivations for why people voted for Sinn Féin in the 2020 election and could be supplement to traditional opinion polls
Secondly, the machine learning approach can be useful in situations where we have no historical data, such as in a constituency where a party didn’t run a candidate in a recent election, or at a smaller geographical area than electoral divisions, such as small areas.

5.1 Small area predictions

The CSO describes small areas as

Small Areas are areas of population generally comprising between 80 and 120 dwellings created by The National Institute of Regional and Spatial Analysis (NIRSA) on behalf of the Ordnance Survey Ireland (OSi) in consultation with CSO. Small Areas were designed as the lowest level of geography for the compilation of statistics in line with data protection and generally comprise either complete or part of townlands or neighbourhoods. There is a constraint on Small Areas that they must nest within Electoral Division boundaries. Small areas were used as the basis for the Enumeration in Census 2016. Enumerators were assigned a number of adjacent Small Areas constituting around 400 dwellings in which they had to visit every dwelling and deliver and collect a completed census form and record the dwelling status of unoccupied dwellings. The small area boundaries have been amended in line with population data from Census 2016.

As such small areas have all of the same variables listed in the census as electoral divisions. This means that predictions using the machine learning model can be produced to the small area level. Figure 5.1 shows the predictions of Sinn Féin % 1st preference support at the small area level for Dublin Mid West. This map would be analogous to how precinct level data in the US has been used to identify where probable voters are located.

Figure 5.1: Small area predictions of Sinn Féin % 1st preference in the constituency of Dublin Mid West

6 Conclusion

The performance of the naive approach may be explained by the strong correlation between the 2016 and 2020 results for Sinn Féin, which is indicative of the same geographical areas voting for Sinn Féin, but voting in larger numbers from the same areas as compared to 2016.
While the machine learning approach to prediction the 2020 election results for Sinn Féin was outperformed by a simpler naive approach, a machine learning approach can still be a useful tool in understanding voter behavior and as a campaign tool.

Can machine learning help us predict Irish elections?

Dr Ian Richardson : e-mail irichard@tcd.ie