Comparing estimates of divisional level support for SSM with actual results

Earlier this week I published estimates of support for same-sex marriage in each of Australia’s 150 CEDs based on the ABC’s 2016 Vote Compass (VC) survey. My analysis — drawing strongly on joint work with Dr Shaun Ratcliff and Luke Mansillo — used a variety of post-stratification methods.

With the release of the results of the Australian Marriage Law Postal Survey (AMLPS), we can compare the performance of the post-stratification techniques. My prior expectation is that the post-stratified VC estimates will perform well with respect to the rank-ordering of CEDs stated level of support for SSM in AMLPS.

Comparisons of estimates

We examine the performance of the BART + Bayes estimates (dark grey), raked estimates (red) and the results from the AMLPS released today by ABS (blue).

library(scales)
plotData$division <- reorder(plotData$division,plotData$yes_per/100,mean)

library(ggplot2)
ggplot(data=plotData,
       aes(y=division,x=p.bayes,xmin=p.bayes.lo,xmax=p.bayes.up,color="BART + Bayes")) + 
  geom_vline(xintercept = .5,col=gray(.45)) +
  geom_errorbarh(height=0,color=gray(.35)) +
  geom_point() +
  geom_point(inherit.aes = FALSE,aes(x=yes_per/100,y=division,color="AMLPS")) +
  geom_point(inherit.aes = FALSE,aes(x=p.raked,y=division,color="Raked")) +
  scale_y_discrete("") +
  scale_x_continuous("Percentage supporting SSM",
                     labels = percent,
                     breaks=seq(.3,.9,by=.1)) +
  scale_color_manual(name=" ",
                     values=c("BART + Bayes"=gray(.45),
                              "AMLPS"="blue",
                              "Raked"="red")) +
theme(panel.border = element_blank(),
      legend.position = 'top')

Comparison of AMPLS results (blue), Bayes estimates (gray) and raked VC estimates (red).

The following is an array of scatterplots, showing the relationship between the various post-stratified estimates and actual results. Correlations reported are Spearman rank-order correlations.

panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...){
    usr <- par("usr"); on.exit(par(usr))
    par(usr = c(0, 1, 0, 1))
    r <- abs(cor(x, y,use="pairwise",method="spearman"))
    txt <- format(c(r, 0.123456789), digits = digits)[1]
    txt <- paste0(prefix, txt)
    if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
    text(0.5, 0.5, txt, cex = 2)
}

pairs(cbind(plotData$p.bayes,
            plotData$p.glmer,
            plotData$p.bart,
            plotData$p.nnet,
            plotData$p.glm,
            plotData$p.raked,
            plotData$naive,
            plotData$yes_per/100),
      labels=c("BART\n+ Bayes",
               "GLMER","BART","nnet","logistic\nregression",
               "raking","naive","AMLPS\n(actuall)"),
      lower.panel=function(x,y){
        abline(0,1)
        abline(coef=coef(lm(y~x)),col="red")
        points(x,y,pch=1)
        },
      upper.panel = panel.cor)

The raked estimates perform quite well relative to other more complex methods of statistical adjustment. Indeed, if a rank ordering of CEDs is all we seek, then even the naive estimates based on the raw, unadjusted VC responses have tolerable performance.

ols <- lm(yes_per~I(p.bayes*100),data=plotData)
b <- coef(ols)

In the next scatterplot I focus on the relationship between the BART + Bayes estimates and the AMLPS results; the dark line is a 45 degree line and if the BART + Bayes estimates perfectly predicted the AMLPS results the data would all fall on this line. Points below the 45 degree line are overestimates, where the BART + Bayes prediction overstates the AMLPS result; convrsely, points above the 45 degree line are underestimates.

The regression (red line) has an intercept of 0.65 and a slope of 1.08, indicating that the BART + Bayes estimates generally underestimated SSM support (as expected, given that respondents reporting DK on VC were not dropped or reallocated) — but only by less than a percentage point, on average — and that this underestimation is larger among CEDs with higher levels of support for SSM. This general pattern holds up despite the large over-estimates of SSM support in a number of CEDs in Western Sydney. Indeed, the regression result is a little misleading, masking the general pattern of about a 5% underestimate, save for the large overestimates.

refData <- data.frame(x=c(0,100),y=c(0,100))
olsData <- data.frame(x=seq(20,90,length=101))
olsData$y <- b[1] + b[2]*olsData$x

library(highcharter)
hc <- highchart(width=800,height=800) %>%
  hc_add_theme(hc_theme_ggplot2()) %>%
  hc_add_series(data = refData,
                id="45 degrees",
                name="45 degrees",
                marker=list(enabled=FALSE,radius=0),
                enableMouseTracking="false",
                type="line") %>%
  hc_add_series(data = olsData,
                ##id="Regression",
                name="Regression",
                marker=list(enabled=FALSE,radius=0),
                enableMouseTracking="false",
                type="line") %>%
  hc_add_series_scatter(x=plotData$p.bayes*100,
                        y=plotData$yes_per,
                        marker=list(symbol="square",
                                    fillColor="#ccc",
                                    lineWidth=1,
                                    lineColor="#555"),
                        division=as.character(plotData$division),
                        st=plotData$State,
                        error=round(plotData$yes_per - plotData$p.bayes*100,1)) %>%
  hc_tooltip(crosshairs=TRUE,
             useHTML = TRUE,
             shared = FALSE,
             backgroundColor ="rgba(0,0,0,.55)",
             style=list("z-index"="9998",
                        color="white",
                        fontWeight="normal",
                        fontFamily="Avenir Next, Roboto, Helvetica, sans-serif"),
             valueDecmials = 1,
             headerFormat = "<table>",
             pointFormat=paste("<tr><th>{point.division} ({point.st})</th></tr>",
                               "<tr><th>Error: {point.error}</th></tr>"),
             footerFormat = "</table>") %>%
  hc_chart(zoomType = "xy") %>%
  hc_title(text="AMLPS results by electoral division, vs survey-based predictions",
           align="left") %>%
  hc_xAxis(title=list(text="BART + Bayes prediction",
                      style=list(color="#111")),
           tickPositions=seq(30,80,by=10),
           startOnTick=FALSE,
           endOnTick=FALSE,
           min=25,max=85) %>%
  hc_yAxis(title=list(text="AMLPS result",
                      style=list(color="#111")),
           crosshair=TRUE,
           tickPositions=seq(30,80,by=10),
           startOnTick=FALSE,
           endOnTick=FALSE,
           min=25,max=85)

hc$x$hc_opts$series[[1]]$enableMouseTracking <- FALSE
hc$x$hc_opts$series[[2]]$enableMouseTracking <- FALSE

##hc$x$hc_opts$series[[1]]$showInLegend <- FALSE
##hc$x$hc_opts$series[[2]]$showInLegend <- FALSE

tlist <- list(hc)
htmltools::tagList(tlist)

Mean-squared error, median absoute deviation:

library(reshape2)
estimates <- melt(plotData,
                  id.vars=c('division','yes_per'),
                  measure.vars=c("p.bayes",
                                 "p.glmer",
                                 "p.bart",
                                 "p.nnet",
                                 "p.glm",
                                 "p.raked",
                                 "naive"))
library(plyr)
library(dplyr)
mse <- estimates %>% 
  mutate(e = yes_per/100 - value) %>%
group_by(variable) %>%
  summarise(mse = mean(e^2,na.rm=TRUE),
            rmse = sqrt(mse),
            mad = median(abs(e),na.rm=TRUE)) %>%
  arrange(rmse)

kable(mse[,c("variable","rmse","mad")],
      digits=4,
      format="html",
      col.names = c("Estimator","RMSE","MAD")) %>%
  kable_styling(full_width=FALSE)

Estimator	RMSE	MAD
p.raked	0.0702	0.0627
p.bayes	0.0840	0.0771
p.bart	0.0843	0.0773
p.glmer	0.0847	0.0775
p.glm	0.0864	0.0745
p.nnet	0.0900	0.0761
naive	0.1150	0.0763

On both RMSE and MAD, again we see the raked estimates outperforming the various MRP-like approaches. The naive estimates — based on no statistical adjustment at all, but simply taking the VC results at face value — have poor RMSE, but perform at least as well as the most of the adjustment methods on the MAD criterion.

Inspection of errors

The following dotplot compares errors for various estimators ordered by the magnitude of the error for the BART + Bayes estimator. Generally, we see slightly smaller errors for the raked estimator than the BART + Bayes estimator.

estimates$e <- estimates$yes_per/100 - estimates$value
indx <- order(abs(estimates$e[estimates$variable=="p.bayes"]))
levs <- estimates$division[estimates$variable=='p.bayes'][indx]
estimates$division <- factor(estimates$division,levels=levs,ordered = TRUE)

library(ggplot2)
ggplot(data=subset(estimates,variable %in% c("p.bayes","p.raked")),
                   aes(y=division,x=e,color=variable)) +
  geom_point() +
  scale_y_discrete("") +
    scale_color_manual(values = c("p.bayes"="blue","p.raked"="red"),
                     labels=c("BART + Bayes",
                              "Raking (survey)")) +
  scale_x_continuous("Error",labels=percent,breaks=seq(-.2,.2,by=.05)) +
  guides(color=guide_legend(title="",ncol = 2)) +
  theme(panel.border = element_blank(),
        legend.position = "top")

tmp <- dcast(subset(estimates,variable %in% c("p.bayes","p.raked","naive")),
             division + yes_per ~ variable)
tmp$p.raked <- 100*tmp$p.raked
tmp$p.bayes <- 100*tmp$p.bayes
tmp$p.naive <- 100*tmp$naive
tmp$e.bayes <- tmp$yes_per - tmp$p.bayes
tmp$e.raked <- tmp$yes_per - tmp$p.raked
tmp$e.naive <- tmp$yes_per - tmp$p.naive
indx <- order(abs(tmp$e.bayes),decreasing = TRUE)
tmp$state <- ddata$State[match(tmp$division,ddata$division)]

Only in 29 seats does the BART + Bayes estimate have a smaller error than the raked estimate.

The misses in some seats are quite large, especially in Western Sydney seats where the “Yes” vote is low and far below that predicted by the adjustments of the VC estimates.

The general pattern is one of under-estimation of support for SSM, reflecting the fact that “don’t know” responses on VC were not dropped nor rellocated in my analysis.

Divisions with the 20 largest errors for the BART + Bayes estimates are reported below:

library(knitr)
library(kableExtra)
kable(tmp[indx[1:20],
      c("division","state",
        "yes_per",
        "p.bayes","e.bayes",
        "p.raked","e.raked",
        "p.naive","e.naive")],
      format = "html",
      col.names = c("Division","State",
                    "AMLPS Yes",
                    "Estimate","Error",
                    "Estimate","Error",
                    "Estimate","Error"),
      digits=1,row.names = FALSE) %>%
  kable_styling(full_width=FALSE) %>%
  add_header_above(header=c(" "=3,
                            "BART + Bayes"=2,
                            "Raked"=2,
                            "Naive"=2))

			BART + Bayes		Raked		Naive
Division	State	AMLPS Yes	Estimate	Error	Estimate	Error	Estimate	Error
Watson	NSW	30.4	53.6	-23.2	44.1	-13.7	66.7	-36.3
Blaxland	NSW	26.1	49.0	-22.9	45.9	-19.8	59.2	-33.1
Corangamite	VIC	71.6	56.2	15.4	59.6	12.0	75.7	-4.1
McMahon	NSW	35.1	50.1	-15.0	46.9	-11.8	62.5	-27.4
Werriwa	NSW	36.3	50.7	-14.4	45.5	-9.2	64.1	-27.8
Fowler	NSW	36.3	50.0	-13.7	48.8	-12.5	65.5	-29.2
Fairfax	QLD	64.3	50.7	13.6	52.1	12.2	64.7	-0.4
Cowper	NSW	60.0	46.5	13.5	48.5	11.5	68.4	-8.4
Flinders	VIC	70.0	56.6	13.4	57.8	12.2	74.6	-4.6
Parramatta	NSW	38.4	51.8	-13.4	50.4	-12.0	64.5	-26.1
Warringah	NSW	75.0	61.8	13.2	65.4	9.6	79.3	-4.3
Boothby	SA	68.5	56.3	12.2	58.0	10.5	74.4	-5.9
Chifley	NSW	41.3	53.3	-12.0	50.6	-9.3	59.7	-18.4
Mayo	SA	64.7	52.9	11.8	54.1	10.6	70.0	-5.3
Barton	NSW	43.6	55.2	-11.6	53.4	-9.8	76.0	-32.4
Fisher	QLD	62.8	51.6	11.2	53.4	9.4	64.4	-1.6
Denison	TAS	73.8	62.7	11.1	64.8	9.0	79.8	-6.0
Macquarie	NSW	63.9	52.8	11.1	54.4	9.5	71.4	-7.5
Brisbane	QLD	79.5	68.5	11.0	69.8	9.7	81.1	-1.6
Richmond	NSW	67.9	56.9	11.0	58.2	9.7	74.7	-6.8

Error analysis

With a large set of Census variables aggregated to the CED level, we can begin to examine the model mis-specification that if corrected, might have improved the performance of the model adjustment procedures.

We examine the relationship between prediction errors with four Census aggregates: two indicators of birth place (1) born in East Asia or (2) born outside of Australia, East Asia or an English speaking country (a good measure of the proportion of a CED born in the Middle East or continental Europe); (3) a religious affiliation other than a Christian denomination or no religion; (4) proportion of the electorate not speaking English.

Birth place figured prominently in earlier versions of our modeling and does play a role in the raked estimates, which generally have better performance here. Religious affiliation did figure in our modeling.

plotData$e <- -plotData$p.bayes + plotData$yes_per/100
plotData$e.raked <- -plotData$p.raked + plotData$yes_per/100

library(reshape2)
edata <- melt(plotData,id.vars = c("division","e","e.raked"),
              measure.vars = c("bp.East_Asia",
                               "Religion_Other_Religion",
                               "bp.Other",
                               "SpeakEnglish"))

edata$variable <- factor(edata$variable,
                         levels=sort(unique(edata$variable)),
                         labels=c("Born East Asia",
                                  "Religion Other",
                                  "Born Other",
                                  "Speak English"))

ggplot(edata,aes(x=value,y=e)) + 
  geom_smooth(method="gam",se=FALSE,color="red") +
  geom_point() +
  facet_wrap(~variable,ncol=2,scales="free_x") +
  scale_x_continuous("Prevalence of given attribute, CED",label=percent) +
  scale_y_continuous("Error of BART + Bayes",label=percent)

In each case it is clear that the variables above contain information that — had our modeling utilised it, or utilised it more effectively — we could have presumably improved the performance of our CED estimates based on the adjusting the VC estimates.

Recommendations

I close with some recommendations for myself (!) and the broader research community:

refit our predictive models, but
- birth place in
- drop the imputation of vote onto the post-stratification Census frame
- ensure we have 2016 Census data in the post-stratification frame
tailor the predictive models to local context, an instance of what Andy Gelman might call “secret weapon”. That is, there is a time to pool, and a time not to pool. More rigorously, we might say the Western Sydney CEDs are not exchangeable with other places in Australia; the predictors we’d use in a predictive model and statistical adjustment procedure there might be different there than elsewhere. Birth place and religion possibly more important there than elsewhere, maybe in a way that is hard even for a hierarchical GLM?
there is an interesting tension in this type of work, yet another manifestation of a tradeoff ubiquitious in applied statistical work. The post-stratification frame can’t be too “deep”, in the sense that it can’t have too many variables/dimensions: we start slicing and dicing the Census data quite finely and indeed, ABS tablebuilder starts introducing error into the cell counts for evenly reasonably coarse tabulations. This speaks to working with a small number of predictors in the predictive model, perhaps with deep interactions to be discovered via machine learning algorithms.
the AMLPS results make clear the challenges of polling of non-English speaking Australia. The non-English speakers taking phone and Internet surveys or participating in Vote Compass are probably unusual with respect to their co-ethnics, even after balancing on demographics such as educational attainment, income etc. Recruiting respondents in these communities and administering surveys to respondents in the language of their choice would seem an important innovation for Australian pollsters (academic, media, commericial) to consider, to the extent this isn’t happening now. Political polling in the United States acknowledges the importance of offering respondents a Spanish language option. Australia has compulsory turnout and twice the foreign-born rate than the United States, and while the prevalence of English-speakers among Australia’s immigrants is probably higher than that in the United States, there are probably important gains to be had from expanding non-English language polling in Australia.