Introduction

Opinion surveys are done to report the findings to some stakeholder(s). In this tutorial, we will review how to create weighted topline results tables and figures to help disseminate information about the survey results.

Topline reports are used by survey researchers as way to quickly get as much information about the survey results as possible to the stakeholders (which can and oftentimes does include the public). Generally, topline reports are used to provide stakeholders with a look at the weighted results of key survey questions as well as how different subgroups in the sample answered the key survey questions of interest.

Here, we will use a 2018 public opinion survey of Colorado residents that asked questions about an upcoming election as well as general policy questions like support for legalized marijuana and recreational sports gambling. For simplicity, we focus on two questions for our key topline reporting whereas in reality we would do this for all the key election related questions (note you will do something similar on in-class activity). - Gubernatorial Election Choice (gov_choice) - Support for Legalized Recreational Marijuana (potlaw)

We first import our survey data results - here from Github and named sample - then review the variables we have available to us using the names(sample) command.

#Import Data
url <- "https://github.com/drCES/survey_weighting_dacss695/raw/main/sample.dta"
# Read the Stata file into R
sample <- read_dta(url)
names(sample)
##  [1] "caseid"       "pid_4"        "ideo5"        "gov_choice"   "prop_111"    
##  [6] "prop_112"     "trump_app"    "hick_app"     "gardner_app"  "cong_app"    
## [11] "scotus_app"   "pot_law"      "gambling"     "fracking"     "gun_control" 
## [16] "anger"        "pride"        "hope"         "disgust"      "worry"       
## [21] "trump_app2"   "hick_app2"    "gardner_app2" "cong_app2"    "scotus_app2" 
## [26] "pot_law2"     "gambling2"    "fracking2"    "gun_control2" "weight_org"  
## [31] "pid_x"        "sex"          "race_4"       "speakspanish" "marstat"     
## [36] "child18"      "employ"       "faminc_new"   "casscd"       "religiosity" 
## [41] "age_group"    "educ"

Information About the Survey

Topline reports typically contain information about the survey at the very beginning that describes the survey methodology in detail. This section should include the sampling approach, the population the sample generalizes to, the probability-based Margin of Error for the entire sample, the dates of sample collection + the survey mode used, along with a discussion of any survey weights necessary to make the sample match the population. This information should be provided so that other survey methodologists and stakeholders can have an informed understanding of how the survey was conducted prior to reviewing the results.

Once this discussion has taken place, the actual results of the survey are reported starting with the key outcome variables of interest.

Margin of Error

Surveys should report margin of error, even non-probability based ones, so the first thing we do is calculate our full sample MOE using the function we reviewed in class earlier in the semester. The sample size for this poll is 800 and by plugging that value into the function along with the critical value of 1.96 for a 95% confidence interval we get our full sample margin of Error at +/- 3.5%.

  moe_fun <- function(p, n, cv) {
    step1<-p*(1-p)
    step2<-step1/n
    se<-(step2)^.5
    moe<-cv*se
    print(moe)
  }

moe_fun(.5, 800, 1.96) #Set proportion to .5, sample size to 800 which is the full sample size, and critical value to 1.96 for a 95% CI
## [1] 0.03464823

Weighed Survey Results

Because this is a sample survey where the demographic profile does not match the population, we need to include survey weights for any results that are reported. In a previous tutorial, you learned how to create survey weights. Here, you do not need to create weights again but rather just use the supplied weighting variable, weight_org in the analysis.

Our two key variables of interest for this reporting is the gubernatorial vote choice measure - gov_choice - and support/opposition to marijuana being recreationally legal - potlaw. Because these are our key outcome variables of interest, we summarize them first at the beginning of the topline report.

As with any analysis, we should first review our variables of interest so that we are confident with what it contains. Here we run frequencies - basically unweighted counts - of our key variables of interest to understand unweighted frequencies and the values each variable can take on.

freq(sample$gov_choice)

## Governor Choice 
##       Frequency Percent
## 1           413  51.625
## 2           346  43.250
## 3            30   3.750
## 4            11   1.375
## Total       800 100.000
val_labels(sample$gov_choice)
##        Jared Polis - Democrat Walker Stapleton - Republican 
##                             1                             2 
##    Scott Helker - Libertarian          Some other candidate 
##                             3                             4
freq(sample$pot_law) 

## RECODE of Q30 (Q30) 
##       Frequency Percent Valid Percent
## 1           192  24.000        24.584
## 2            75   9.375         9.603
## 3           178  22.250        22.791
## 4           336  42.000        43.022
## NA's         19   2.375              
## Total       800 100.000       100.000
val_labels(sample$pot_law)
##  Strong Oppose         Oppose          Favor Strongly Favor 
##              1              2              3              4

Weighted Topline Frequencies

Now that we feel comfortable with what is included in each of our key outcome variables, we create the weighted topline figures for reporting. To do this, we will use the topline function from the pollster package which was designed to simplify the reporting of survey results in topline reports. This package, linked here to package description and detailed tutorials, allows you to create nice tables, and graphs if desired, that report the weighted frequency that each option in your variables takes on.

The code has a variety of options that can be confusing so let’s slowly go through each line of code. To begin, topline(df = sample, variable = gov_choice, weight = weight_org, pct = FALSE, cum_pct=FALSE) %>% tells R to…

  1. Use the topline function from the pollster package
  2. Work from the, sample, dataframe
  3. Summarize the variable, gov_choice,
  4. Weight the results using the weighting variable, weight_org
  5. Suppress percentage and cumulative percentage in the table for aesthetic purposes

The first part of the code creates the values that we will then report with the second part of the code: kable(digits=0, col.names = c('Candidate', 'N', 'Percent <br> Supporting'), escape=FALSE, align=('lcc')) %>% kable_styling(bootstrap_options = "responsive", full_width = F, position = "float_left") %>%
footnote(alphabet = ("Full Sample MOE +/-3.5%"))

In this long bit of code, we use the kable package to manipulate the values for reporting.

  1. digits = 0 sets the rounding for the percentages
  2. col.names changes the column names that will be reported on the table
  3. escape=false handles how table is displayed when exported
  4. align = (lcc) handles the placement of the column names l = left, c=centered, r=right. If manually setting placement of columns, you need to have as many locations as columns. Here we have 3 column names so 3 locations.
  5. kable_styling controls general look of the table as well as placement using “float_left”. Also set the width of the table to NOT take up the entire screen.
  6. Add footnote about the overall sample MOE based on previous calculation
topline(df = sample, variable = gov_choice, 
        weight = weight_org, pct = FALSE, cum_pct=FALSE) %>%  
  kable(digits=0, 
        col.names = c('Candidate', 'N', 
                      'Percent <br> Supporting'),#Changes names of the columns to make them more readable for people 
        escape=FALSE, align=('lcc')) %>% #Use align to chance the alignment of each column 
 kable_styling(bootstrap_options = "responsive", full_width = F, position = "float_left") %>% #Controls look and width of table 
  kableExtra::footnote(alphabet = ("Full Sample MOE +/-3.5%")) 
Candidate N Percent
Supporting
Jared Polis - Democrat 429 54
Walker Stapleton - Republican 334 42
Scott Helker - Libertarian 24 3
Some other candidate 13 2
a Full Sample MOE +/-3.5%

Now, from our weighted survey results, we report that Jared Polis is expected to get roughly 54% of the vote while Walker Stapleton is expected to get roughly 42% of the vote with a MOE of +/- 3.5%. These would be the important figures from the survey reported to the world about who was currently leading the election.

We can do the same type of analysis with a non-election related question. Here we use support/opposition to recreational marijuana in Colorado to demonstrate. We follow the same coding approach as above changing out the variable name, from gov_choice to pot_law since we are changing the variable of analysis as well as shifting some of the column names.

topline(df = sample, variable = pot_law, #Update to correct variable name
        weight = weight_org, pct = FALSE, cum_pct=FALSE) %>%  
  kable(digits=0, 
        col.names = c('Response', 'N', 
                      'Percent <br> Supporting'),#Changes names of the columns to make them more readable for people 
        escape=FALSE, align=('lcc')) %>% #Use align to chance the alignment of each column 
 kable_styling(bootstrap_options = "responsive", full_width = F, position = "float_right") %>% #Controls look and width of table 
  kableExtra::footnote(alphabet = ("Full Sample MOE +/-3.5%")) 
Response N Percent
Supporting
Strong Oppose 170 22
Oppose 63 8
Favor 165 21
Strongly Favor 383 49
(Missing) 19 NA
a Full Sample MOE +/-3.5%

Notice how there is a fifth row, missing, included in the output. If we want to exclude that, we add this line of code remove = c("(Missing)"), pct = FALSE) to the initial section. We explicitly type "(Missing)" because that is the name of the row from the table above.

topline(df = sample, variable = pot_law, #Update to correct variable name
        weight = weight_org, pct = FALSE, cum_pct=FALSE, remove = c("(Missing)")) %>%  
  kable(digits=0, 
        col.names = c('Response', 'N', 
                      'Percent <br> Supporting'),#Changes names of the columns to make them more readable for people 
        escape=FALSE, align=('lcc')) %>% #Use align to chance the alignment of each column 
 kable_styling(bootstrap_options = "responsive", full_width = F, position = "float_left") %>% #Controls look and width of table 
  kableExtra::footnote(alphabet = ("Full Sample MOE +/-3.5%")) 
Response N Percent
Supporting
Strong Oppose 170 22
Oppose 63 8
Favor 165 21
Strongly Favor 383 49
a Full Sample MOE +/-3.5%

Now we have removed the missing row from the table to easier reading. Results here show that 49% strongly favor marijuana being legal with another 21% somewhat favoring it indicating strong support overall for legalization in Colorado.

Including Crosstables

While reporting the high level frequencies of your key outcome variables is important, oftentimes topline reports also include crosstab analysis, which looks at how different subgroups in your sample answered your key question of interest. For instance in election polling, it is oftentimes reported how individuals with different partisan identities - Democratic, Republican, Independent, etc. - support candidates or policies at different rates. Topline reports are way for these types of deeper dives into the data to be provided to the public and other stakeholders.

2-Way Crosstables

Here, we will look at a 2-way crosstable using governor choice as our key outcome variable of interest by three different subgroups: Partisanship (pid_4), education (educ) and having a child 18 or younger (child18).

#PID Crosstable
crosstab(df = sample, x = gov_choice, 
         y = pid_4, weight = weight_org, 
         pct_type = "col") %>%
  kable(digits=0,  
        col.names = c('Governor Choice','Democrat', 'Independent', 'Republican', 'Other'), align=('lccccr')) %>%
  kable_styling(bootstrap_options = "responsive", full_width = F, position = "float_left") %>% 
  kableExtra::footnote(alphabet = ("Full Sample MOE +/-3.5%; Values Represent % Supporting that Candidate"))
Governor Choice Democrat Independent Republican Other
Jared Polis - Democrat 98 51 3 62
Walker Stapleton - Republican 1 36 96 24
Scott Helker - Libertarian 1 8 0 13
Some other candidate 0 5 1 1
n 293 210 256 42
a Full Sample MOE +/-3.5%; Values Represent % Supporting that Candidate
#Race/Ethnicity Crosstable
crosstab(df = sample, x = gov_choice, 
         y = educ, weight = weight_org, 
         pct_type = "col") %>%
  kable(digits=0,  
        col.names = c('Governor Choice','HS or Less', 'Some College', 'Associates', 'BA/BS', 'Advanced'), align=('lccccr')) %>%
  kable_styling(bootstrap_options = "responsive", full_width = F, position = "left") %>% 
  kableExtra::footnote(alphabet = ("Full Sample MOE +/-3.5%; Values Represent % Supporting that Candidate"))
Governor Choice HS or Less Some College Associates BA/BS Advanced
Jared Polis - Democrat 49 51 41 59 59
Walker Stapleton - Republican 49 41 52 37 37
Scott Helker - Libertarian 2 5 6 2 2
Some other candidate 0 3 1 2 2
n 168 181 69 243 138
a Full Sample MOE +/-3.5%; Values Represent % Supporting that Candidate
#Have Child Crosstable
crosstab(df = sample, x = gov_choice, 
         y = child18, weight = weight_org, 
         pct_type = "col") %>%
  kable(digits=0,  
        col.names = c('Governor Choice','Has Child', 'No Child'), align=('lcc')) %>%
  kable_styling(bootstrap_options = "responsive", full_width = F, position = "right") %>% 
  kableExtra::footnote(alphabet = ("Full Sample MOE +/-3.5%; Values Represent % Supporting that Candidate"))
Governor Choice Has Child No Child
Jared Polis - Democrat 53 54
Walker Stapleton - Republican 38 43
Scott Helker - Libertarian 6 2
Some other candidate 3 1
n 187 613
a Full Sample MOE +/-3.5%; Values Represent % Supporting that Candidate

This code chunk created three crosstables that showed how each of the subgroups for that particular demographic variable answered the governor preference question. We can see that Democrats overwhelmingly supported Polis while Republicans did so for Stapleton. Independents went heavily in favor of Polis however which was the main difference in the election. We can also see that people with college degrees with more supportive of Polis than those without.

Now, you need to recreate the above 3 crosstables but using the support/opposition to recreational marijuana, potlaw as the outcome variable of interest. Meaning, take the code from above and update gov_choice with the pot_law variable then interpret what you see in the results.

#Crosstable to be updated with pot_law
crosstab(df = sample, x = gov_choice, 
         y = pid_4, weight = weight_org, 
         pct_type = "col") %>%
  kable(digits=0,  
        col.names = c('Governor Choice','Democrat', 'Independent', 'Republican', 'Other'), align=('lccccr')) %>%
  kable_styling(bootstrap_options = "responsive", full_width = F, position = "float_left") %>% 
  kableExtra::footnote(alphabet = ("Full Sample MOE +/-3.5%; Values Represent % Supporting that Candidate"))
Governor Choice Democrat Independent Republican Other
Jared Polis - Democrat 98 51 3 62
Walker Stapleton - Republican 1 36 96 24
Scott Helker - Libertarian 1 8 0 13
Some other candidate 0 5 1 1
n 293 210 256 42
a Full Sample MOE +/-3.5%; Values Represent % Supporting that Candidate

3-Way Crosstables

Results for the education crosstable is interesting but might get you wondering but is that just a partisan outcome? A 3-way crosstable allows you to add one additional variable to your crosstable to better understand how different demographic variables might interact to influence opinion. Here, we look at partisanship and education on governor vote choice preferences.

With a 3-way crosstable, you need to worry about the sample size in each cell. Because the education variable takes on 5 values while the PID variable takes on 4, that would create 20 unique cells in our crosstable. To reduce the number of cells, thus increasing the sample size in each cell, we first create a new variable called college that is simply a recode of the educ variable and = 1 when the respondent has a BA/BS or more and 0 if not.

sample <- sample %>% 
  mutate(college = if_else(educ<4, 'No College Degree', 'College Degree'))

With this new college variable, we can now run the 3-way crosstable. To do this, the coding approach is largely the same as above. We do have to make some changes to the coding. Instead of crosstab as above, we start the code we crosstab_3way to indicate that we are running a slightly more complex crosstable. We add the college variable as z in the new code.

crosstab_3way (df = sample, x = pid_4, 
      y = gov_choice, z=college, weight = weight_org, pct_type = "row") %>%
  kable(digits=0, col.names = c('PartyID', 'College', 'Polis(D)', 'Stapleton (R)', 
                                'Helker (L)', 'Other <BR> Candidates', 'N'), align=('llccccr'),
        caption = "Gubernatorial Vote Choice by Partisanship & College", escape=FALSE, position="right") %>%
  kable_styling(bootstrap_options = "responsive", full_width = F, html_font = "Cambria") %>% 
  kableExtra::footnote(alphabet = ("Numbers Represent % Supporting that Candidate"))
Gubernatorial Vote Choice by Partisanship & College
PartyID College Polis(D) Stapleton (R) Helker (L) Other
Candidates
N
Democrat College Degree 99 1 0 0 150
Democrat No College Degree 97 1 1 1 143
Independent College Degree 57 30 7 6 97
Independent No College Degree 46 42 8 3 113
Republican College Degree 4 95 0 1 112
Republican No College Degree 3 96 0 1 143
Other College Degree 72 23 5 0 23
Other No College Degree 50 25 22 3 19
a Numbers Represent % Supporting that Candidate

Here, we see a more complex view of how education influenced vote preference. Within Democrats or Republican identifiers, having a college degree had no influence as partisans voted for the candidate that matched their partisanship. However, for independents, we see that college education had an impact on who they supported. Independents who had a college degree supported Polis at a 57% rate compared to independents who did not have a college degree that supported him at only a 46% rate.

Now, in this section, you task is to create the above table but using potlaw as the outcome variable of interest. After you do this, then change the z variable from college to child18 and see if having a child in the home influences support for recreational marijuana.

Make sure to update the column names to reflect the support/opposition to recreational marijuana instead of governor vote preference and the title.

#Crosstable to be updated with pot_law & child under 18 in the home or not
crosstab_3way (df = sample, x = pid_4, 
      y = gov_choice, z=college, weight = weight_org, pct_type = "row") %>%
  kable(digits=0, col.names = c('PartyID', 'College', 'Polis(D)', 'Stapleton (R)', 
                                'Helker (L)', 'Other <BR> Candidates', 'N'), align=('llccccr'),
        caption = "Gubernatorial Vote Choice by Partisanship & College", escape=FALSE, position="left") %>%
  kable_styling(bootstrap_options = "responsive", full_width = F, html_font = "Cambria") %>% 
  kableExtra::footnote(alphabet = ("Numbers Represent % Supporting that Candidate"))
Gubernatorial Vote Choice by Partisanship & College
PartyID College Polis(D) Stapleton (R) Helker (L) Other
Candidates
N
Democrat College Degree 99 1 0 0 150
Democrat No College Degree 97 1 1 1 143
Independent College Degree 57 30 7 6 97
Independent No College Degree 46 42 8 3 113
Republican College Degree 4 95 0 1 112
Republican No College Degree 3 96 0 1 143
Other College Degree 72 23 5 0 23
Other No College Degree 50 25 22 3 19
a Numbers Represent % Supporting that Candidate