Introduction
- Calculating Margin of Error
- Variables Used in Tutorial
Univariate Data Review
Bivariate Data Review
- Comparing Means between Two Groups
  - Comparing Means between 3+ Groups
- Correlations Between Numerical Variables

Introduction

In this tutorial, we will review basic data exploration and analysis techniques to help explore interesting patterns in your survey results. We will cover the following basic techniques:

Univariate Data review
- Weighted Topline Tables
- Bar graphs - horizontal, vertical
  - Ordinal and dichotomous variable types
Correlations
- Including graphical display
Bivariate relationships
- Crosstabs
- Graphing crosstab outcomes
- T-tests
- ANOVAs
- Linear Regression

We start by importing the “cleaned” dataset we worked on Tuesday using the haven package, which I saved to my github page.

#Pull Data Directly from Github - Need Internet Connection 
url <- "https://github.com/drCES/raw_survey_data_dacss695/raw/main/data_clean.dta"
data <- read_dta(url)# Read the Stata file into R
names(data)

##   [1] "start_date"                    "end_date"                     
##   [3] "status"                        "ip_address"                   
##   [5] "progress"                      "duration_in_seconds"          
##   [7] "finished"                      "recorded_date"                
##   [9] "response_id"                   "recipient_last_name"          
##  [11] "recipient_first_name"          "recipient_email"              
##  [13] "external_reference"            "location_latitude"            
##  [15] "location_longitude"            "distribution_channel"         
##  [17] "user_language"                 "q_recaptcha_score"            
##  [19] "q_relevant_id_duplicate"       "q_relevant_id_duplicate_score"
##  [21] "q_relevant_id_fraud_score"     "q_relevant_id_last_start_date"
##  [23] "q1"                            "main_branch_1"                
##  [25] "main_branch_2"                 "main_branch_3"                
##  [27] "main_branch_4"                 "main_branch_5"                
##  [29] "main_branch_6"                 "main_branch_7"                
##  [31] "main_branch_8"                 "main_branch_9"                
##  [33] "main_branch_10"                "mohanty_news_1"               
##  [35] "mohanty_news_2"                "mohanty_news_3"               
##  [37] "mohanty_news_4"                "mohanty_news_5"               
##  [39] "mohanty_news_6"                "mohanty_news_6_text"          
##  [41] "dinsmore_smusage"              "dinsmore_comms"               
##  [43] "dinsmore_comms_7_text"         "yifan_livecomm_1"             
##  [45] "yifan_livecomm_2"              "yifan_livecomm_3"             
##  [47] "yifan_livecomm_4"              "shawn_worktype"               
##  [49] "shawn_worktype_5_text"         "bot_check_4"                  
##  [51] "bot_check_5"                   "bot_check_6"                  
##  [53] "bot_check_7"                   "bot_check_8"                  
##  [55] "gutierrez_suarez_ai"           "krishna_cata_1"               
##  [57] "krishna_cata_2"                "krishna_cata_3"               
##  [59] "krishna_cata_4"                "krishna_cata_5"               
##  [61] "krishna_cata_6"                "krishna_cata_7"               
##  [63] "krishna_cata_8"                "krishna_cata_9"               
##  [65] "krishna_cata_10"               "krishna_cata_11"              
##  [67] "krishna_cata_10_text"          "ai_ad_1"                      
##  [69] "ai_ad_2"                       "ai_ad_3"                      
##  [71] "krishna_dc"                    "gutierrez_clear"              
##  [73] "suarez_ad2_1"                  "suarez_ad2_2"                 
##  [75] "dinsmore_mh"                   "balint_stress"                
##  [77] "gomez_dinsmore_ad_1"           "gomez_dinsmore_ad_2"          
##  [79] "gomez_dinsmore_ad_3"           "gomez_dinsmore_ad_4"          
##  [81] "dinsmore_friendship"           "mohanty_emotneeds"            
##  [83] "sc_approval"                   "brake_influence"              
##  [85] "brake_confidence"              "brake_ideology"               
##  [87] "brake_number"                  "mohanty_disagree"             
##  [89] "mohanty_famideo"               "mohanty_famideo_13_text"      
##  [91] "mohanty_commideo"              "mohanty_commideo_6_text"      
##  [93] "amalea_source_1"               "amalea_source_2"              
##  [95] "amalea_source_3"               "amalea_source_4"              
##  [97] "amalea_source_5"               "amalea_source_6"              
##  [99] "amalea_source_7"               "amalea_source_6_text"         
## [101] "amalea_fam"                    "amalea_stigma"                
## [103] "amalea_ad"                     "god_importance_xu"            
## [105] "god_help_xu"                   "god_answer_xu"                
## [107] "gomez_recruit"                 "gomez_success"                
## [109] "balint_struggle"               "balint_cost"                  
## [111] "balint_inc_1"                  "balint_inc_2"                 
## [113] "balint_inc_3"                  "balint_inc_4"                 
## [115] "balint_inc_5"                  "balint_inc_6"                 
## [117] "balint_inc_5_text"             "balint_impact"                
## [119] "mitchell_threat"               "mitchell_ad_1"                
## [121] "mitchell_ad_2"                 "q1zo"                         
## [123] "q2zo"                          "q5zo"                         
## [125] "omar_ad_1"                     "omar_ad_2"                    
## [127] "kopecky_sat"                   "yeumin_rec"                   
## [129] "yuemin_ad_1"                   "yuemin_ad_2"                  
## [131] "yuemin_ad_3"                   "yuemin_ad_4"                  
## [133] "kopecky_goals_1"               "kopecky_goals_2"              
## [135] "kopecky_goals_3"               "kopecky_goals_4"              
## [137] "kopecky_goals_5"               "kopecky_goals_6"              
## [139] "kopecky_goals_7"               "kopecky_goals_7_text"         
## [141] "kopecky_focus"                 "kopecky_focus_6_text"         
## [143] "kopecky_type_1"                "kopecky_type_2"               
## [145] "kopecky_type_3"                "shawn_overall"                
## [147] "shawn_ad"                      "stapleton_so_1"               
## [149] "stapleton_so_2"                "stapleton_so_3"               
## [151] "stapleton_so_4"                "q107"                         
## [153] "yifan_why_12"                  "yifan_why_13"                 
## [155] "yifan_why_14"                  "yifan_why_15"                 
## [157] "yifan_why_16"                  "yifan_why_18"                 
## [159] "yifan_why_19"                  "yifan_why_20"                 
## [161] "yifan_why_21"                  "yifan_why_22"                 
## [163] "yifan_why_22_text"             "yifan_length"                 
## [165] "mitchell_cc"                   "pol_ideo"                     
## [167] "pol_ideo_8_text"               "god_pray_xu"                  
## [169] "rid"                           "age"                          
## [171] "gender"                        "hhi"                          
## [173] "ethnicity"                     "hispanic"                     
## [175] "education"                     "region"                       
## [177] "zip"                           "political_party"              
## [179] "correct"                       "stapleton_so_dh"              
## [181] "stapleton_so_pc"               "stapleton_so_shift"           
## [183] "stapleton_so_bases"            "educ"                         
## [185] "college"                       "educ4"                        
## [187] "stapleton_so_dh_2"

We know have reimported our cleaned data, which has the data transformations conducted previously. This data has 266 observations which matches the number we finished with previously after cleaning.

Because we are interested in understanding our data better in this tutorial, we should be using survey weights. Because part of your final project is to create survey weights based on the provided demographic profile, this tutorial will not used weighted data. However, to ensure the code is usable to you with weighted data, this tutorial first creates a fake weight variable that equals 1 to include in each analysis. When you do this on your own survey data, you will use whatever you name your final survey weight once you have completed that procedure.

data$weight<-1

Calculating Margin of Error

First thing we want to do is calculate the margin of error for our final sample of valid responses, which has 266 completes. We simply use the margin of error function from earlier in the semester to calculate the MOE for the full sample. Remember, for subgroups and variables with smaller samples, the MOE will be larger.

  moe_fun <- function(p, n, cv) {
    step1<-p*(1-p)
    step2<-step1/n
    se<-(step2)^.5
    moe<-cv*se
    print(moe)
  }

moe_fun(.5, 266, 1.96) #Set proportion to .5, sample size to 800 which is the full sample size, and critical value to 1.96 for a 95% CI

## [1] 0.06008766

Running our MOE function with the appropriate values, we see that our MOE for the full sample size if +/- 6%. Meaning, we expect the true population opinion on the survey quesetion to be somewhere within +/-6% of that value.

Variables Used in Tutorial

This tutorial will primarily use the same 5 variables as in the previous tutorial all related to Major League Baseball and measuring support/opposition to four recent rule changes made to the game. We use the cleaned version of these variables that have the non-substantive response options such as Don’t Know removed for our data exploration.

The primary variables are:

stapleton_so_dh = measure of support for universal DH
stapleton_so_pc = measure of support for pitch clock
stapleton_so_shift = measure of support for infield shift ban
stapleton_so_bases = measure of support for larger bases
main_branch_6 = Respondent considers self a fan of MLB (branching question for above)

Ultimately, our goal here is to find interesting patterns in the data that suggests differences in opinion in the population. Data exploration in general public surveys like this is about finding what might be interesting to the public or stakeholders who will see the results. This is different than academic style survey research which should be focused on testing theories with empirically falsifiable hypotheses. Take other DACSS courses to learn about this approach to survey research.

Univariate Data Review

While the topline code is good for creating public facing reports, that is not necessary for data exploration. Here, we use a slightly different approach. We start by looking at the weighted distributions of each variable to get an understanding of the population estimate.

topline(data, variable = stapleton_so_dh, 
        weight = weight, pct = FALSE, cum_pct=FALSE)

## # A tibble: 5 × 3
##   Response         Frequency `Valid Percent`
##   <fct>                <dbl>           <dbl>
## 1 Strongly support        28           34.1 
## 2 Somewhat support        33           40.2 
## 3 Somewhat oppose         14           17.1 
## 4 Strongly oppose          7            8.54
## 5 (Missing)              184           NA

topline(data, variable = stapleton_so_pc, 
        weight = weight, pct = FALSE, cum_pct=FALSE, remove = c("(Missing)"))

## # A tibble: 4 × 3
##   Response         Frequency `Valid Percent`
##   <fct>                <dbl>           <dbl>
## 1 Strongly support        25            29.8
## 2 Somewhat support        34            40.5
## 3 Somewhat oppose         16            19.0
## 4 Strongly oppose          9            10.7

topline(data, variable = stapleton_so_shift, 
        weight = weight, pct = FALSE, cum_pct=FALSE)

## # A tibble: 5 × 3
##   Response         Frequency `Valid Percent`
##   <fct>                <dbl>           <dbl>
## 1 Strongly support        15            19.5
## 2 Somewhat support        24            31.2
## 3 Somewhat oppose         22            28.6
## 4 Strongly oppose         16            20.8
## 5 (Missing)              189            NA

topline(data, variable = stapleton_so_bases, 
        weight = weight, pct = FALSE, cum_pct=FALSE)

## # A tibble: 5 × 3
##   Response         Frequency `Valid Percent`
##   <fct>                <dbl>           <dbl>
## 1 Strongly support        16            20  
## 2 Somewhat support        28            35  
## 3 Somewhat oppose         22            27.5
## 4 Strongly oppose         14            17.5
## 5 (Missing)              186            NA

Graphing Weighted Univariate Distributions

One variable at a time

To graph the weighted response distribution for one variable, we use a two-step process.

Save weighted mean for desired variable as a dataframe
Use created dataframe from Step 1 along with GGPlot to graph results

Step 1: Saving Weighted Distribution for Graphing

Let’s say I want to create a graph of the weighted distribution for the Universal DH variable (I also do the same calculation on the other three variables at the same time). First step is to save the weighted distributions percentages as a dataframe or tibble. We do this using the weights function from the srvyr package along with general tidyverse code.

Working out of our cleaned data, first we save a new dataframe, dh_freq, for graphing purposes. We then filter out any NA responses and from previous tutorial we know there are a lot of missing data in this variable. The variable we eventually want to graph goes in the group_by(stapleton_so_dh) section of the code. We then use summarise along withsum(weight, na.rm=TRUE) to calculated the weighted number of observations for each response option. We finish by calculating the weighted percentage for response option and use arrange to order the data from lowest (strongly support) to highest (strongly oppose).

dh_freq <- data %>%
  filter(!is.na(stapleton_so_dh)) %>% # Drop NA responses
  group_by(stapleton_so_dh) %>%
  summarise(
    weighted_count = sum(weight, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  mutate(percentage = 100 * (weighted_count / sum(weighted_count))) %>%
  arrange(stapleton_so_dh) #Orders Data from low to high

pc_freq <- data %>%
  filter(!is.na(stapleton_so_pc)) %>% # Drop NA responses
  group_by(stapleton_so_pc) %>%
  summarise(
    weighted_count = sum(weight, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  mutate(percentage = 100 * (weighted_count / sum(weighted_count))) %>%
  arrange(stapleton_so_pc) #Orders Data from low to high

shift_freq <- data %>%
  filter(!is.na(stapleton_so_shift)) %>% # Drop NA responses
  group_by(stapleton_so_shift) %>%
  summarise(
    weighted_count = sum(weight, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  mutate(percentage = 100 * (weighted_count / sum(weighted_count))) %>%
  arrange(stapleton_so_shift) #Orders Data from low to high

bases_freq <- data %>%
  filter(!is.na(stapleton_so_bases)) %>% # Drop NA responses
  group_by(stapleton_so_bases) %>%
  summarise(
    weighted_count = sum(weight, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  mutate(percentage = 100 * (weighted_count / sum(weighted_count))) %>%
  arrange(stapleton_so_bases) #Orders Data from low to high

Step2: Plotting the Weighted Distribution

With the the new dataframe, dh_freq, we can now graph the weighted percentages and use the graph in any number of ways. We will start with a basic ggplot graph before cleaning it up some. Notice, we use as.factor(stapleton_so_dh) for the x and fill options. This allows us to work more easily with survey data specifically.

ggplot(dh_freq, aes(x = as.factor(stapleton_so_dh), y = percentage, 
                    fill = as.factor(stapleton_so_dh))) +
  geom_bar(stat = "identity", width = 0.7)

Above we see a basic ggplot which has our weighted distribution for support/opposition to the Universal DH in MLB. But that is about all that we have. The colors are harsh on the eyes and there are no labels. We need to change that before we use this in any formal way.

GGplot is a really powerful graphing package in R. We do not have time to cover off on all its functionalities. New things we add to this plot are:

Actual percentage values on the plot, rounded to 1 decimal
Customized color scheme that doesn’t hurt the eyes
Add value labels to the X-axis
Remove the legend

ggplot(dh_freq, aes(x = as.factor(stapleton_so_dh), y = percentage, 
                    fill = as.factor(stapleton_so_dh))) +
  geom_bar(stat = "identity", width = 0.7) +  # Bar chart
  geom_text(aes(label = paste0(round(percentage, 1), "%")), 
            vjust = -0.5, color = "black", size = 4) +  # Add percentage labels
  scale_fill_manual(values = c("1" = "darkgreen", # Custom colors
                               "2" = "darkolivegreen3", 
                               "3" = "lightcoral", 
                               "4" = "firebrick")) +  
  scale_x_discrete(labels = c("1" = "Strongly\nSupport",   # Replace x-axis tick labels
      "2" = "Somewhat\nSupport", 
      "3" = "Somewhat\nOppose", 
      "4" = "Strongly\nOppose")) +
  labs(title = "Support for Universal DH among MLB Fans",
       subtitle="N=82",
       x = "",
       y = "Percentage") +
  theme_minimal() +
  theme(axis.text.x = element_text(size = 12, angle = 45, hjust = 1),  # Rotate and style x-axis labels
    axis.title.y = element_text(size = 14),  # Style y-axis label
    legend.position = "none")  # Remove legend

Cleaned Up Univariate Graph

This is starting to look good, but there are appears to be margin issues, at least with how it renders in Studio. Let’s change the y axis limit to 50 to ensure nothing runs off the margin. To do this, we simply add ylim(0,50 to the code.

ggplot(dh_freq, aes(x = as.factor(stapleton_so_dh), y = percentage, 
                    fill = as.factor(stapleton_so_dh))) +
  geom_bar(stat = "identity", width = 0.7) +  # Bar chart
  geom_text(aes(label = paste0(round(percentage, 1), "%")), 
            vjust = -0.5, color = "black", size = 4) +  # Add percentage labels
  scale_fill_manual(values = c("1" = "darkgreen", # Custom colors
                               "2" = "darkolivegreen3", 
                               "3" = "lightcoral", 
                               "4" = "firebrick")) +  
  scale_x_discrete(labels = c("1" = "Strongly\nSupport",   # Replace x-axis tick labels
                              "2" = "Somewhat\nSupport", 
                              "3" = "Somewhat\nOppose", 
                              "4" = "Strongly\nOppose")) +
  labs(title = "Support for Universal DH among MLB Fans",
       subtitle = "N=82",
       x = "",
       y = "Percentage") +
  ylim(0, 50) +  # Set y-axis limit
  theme_minimal() +
  theme(axis.text.x = element_text(size = 12, angle = 45, hjust = 1),  # Rotate and style x-axis labels
        axis.title.y = element_text(size = 14),  # Style y-axis title
    legend.position = "none")  # Remove legend

Combining Multiple GGPlots

You could then easily do something like this for each of your ordinal scale survey questions. You could also combine them into one plot that shows all four individual graphs using gridExtra. You then save each individual plot before combining them. Here, we combine the universal DH response with the support/opposition to the pitch clock response.

dh<- ggplot(dh_freq, aes(x = as.factor(stapleton_so_dh), y = percentage, 
                    fill = as.factor(stapleton_so_dh))) +
  geom_bar(stat = "identity", width = 0.7) +  # Bar chart
  geom_text(aes(label = paste0(round(percentage, 1), "%")), 
            vjust = -0.5, color = "black", size = 4) +  # Add percentage labels
  scale_fill_manual(values = c("1" = "darkgreen", # Custom colors
                               "2" = "darkolivegreen3", 
                               "3" = "lightcoral", 
                               "4" = "firebrick")) +  
  scale_x_discrete(labels = c("1" = "Strongly\nSupport",   # Replace x-axis tick labels
                              "2" = "Somewhat\nSupport", 
                              "3" = "Somewhat\nOppose", 
                              "4" = "Strongly\nOppose")) +
  labs(title = "Support for Universal DH among MLB Fans",
       subtitle = "N=82",
       x = "",
       y = "Percentage") +
  ylim(0, 50) +  # Set y-axis limit
  theme_minimal() +
  theme(axis.text.x = element_text(size = 11, angle = 45, hjust = 1),  # Rotate and style x-axis labels
        axis.title.y = element_text(size = 11),  # Style y-axis title
    legend.position = "none")  # Remove legend


pc<- ggplot(pc_freq, aes(x = as.factor(stapleton_so_pc), y = percentage, 
                    fill = as.factor(stapleton_so_pc))) +
  geom_bar(stat = "identity", width = 0.7) +  # Bar chart
  geom_text(aes(label = paste0(round(percentage, 1), "%")), 
            vjust = -0.5, color = "black", size = 4) +  # Add percentage labels
  scale_fill_manual(values = c("1" = "darkgreen", # Custom colors
                               "2" = "darkolivegreen3", 
                               "3" = "lightcoral", 
                               "4" = "firebrick")) +  
  scale_x_discrete(labels = c("1" = "Strongly\nSupport",   # Replace x-axis tick labels
                              "2" = "Somewhat\nSupport", 
                              "3" = "Somewhat\nOppose", 
                              "4" = "Strongly\nOppose")) +
  labs(title = "Support for Pitch among MLB Fans",
       subtitle = "N=84",
       x = "",
       y = "Percentage") +
  ylim(0, 50) +  # Set y-axis limit
  theme_minimal() +
  theme(axis.text.x = element_text(size = 11, angle = 45, hjust = 1),  # Rotate and style x-axis labels
        axis.title.y = element_text(size = 11),  # Style y-axis title
    legend.position = "none")  # Remove legend

combined_plot <- grid.arrange(dh, pc, ncol = 1, nrow = 2)

combined_plot

## TableGrob (2 x 1) "arrange": 2 grobs
##   z     cells    name           grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (2-2,1-1) arrange gtable[layout]

Stacked Bar Charts

You could also get a bit more complex and combine multiple of your ordinal scale variables, measured on the same scale, onto a stacked bar plot. This is a bit more complex coding approach but can be a useful visualization.

The first thing we will do is save the weighted distributions for all four of my variables measured on the same scale into the same dataframe. To do this, we first save a vector, vars, that simply includes the names of my four variables I want to graph in this plot vars <- c("stapleton_so_dh", "stapleton_so_pc", "stapleton_so_shift", "stapleton_so_bases"). This simplifies the next line of code so for your variables just replace the current names with your variable names.

We save a new dataframe called weighted_frequencies which contains the weighted values for each of the four variables for graphing. We use the select function to only select from data the variables in the vars vector. We use the pivot_longer function to make the data the right format for graphing. You do not need to change this if doing this on your project. We then filter out any missing data and do the same weighted calculation for each variable we did above.

We then directly change the names of the variables and x-axais labels to something appropriate for a graph.

vars <- c("stapleton_so_dh", "stapleton_so_pc", "stapleton_so_shift", "stapleton_so_bases")

weighted_frequencies  <- data %>%
  select(all_of(vars), weight) %>%
  pivot_longer(cols = all_of(vars),
    names_to = "variable",
    values_to = "response") %>%
  filter(!is.na(response)) %>%
  group_by(variable, response) %>%
  summarise(
    weighted_count = sum(weight, na.rm = TRUE),
    .groups = "drop") %>%
  group_by(variable) %>%
  mutate(percentage = 100 * (weighted_count / sum(weighted_count)),
    response_label = case_when(
      response == 1 ~ "Strongly Approve",
      response == 2 ~ "Somewhat Approve",
      response == 3 ~ "Somewhat Oppose",
      response == 4 ~ "Strongly Oppose"),
    variable = case_when(
      variable == "stapleton_so_dh" ~ "Universal DH",
      variable == "stapleton_so_pc" ~ "Pitch Clock",
      variable == "stapleton_so_shift" ~ "Shift Ban",
      variable == "stapleton_so_bases" ~ "Larger Bases")) %>%
  ungroup()

This data is now ready for graphing. Before actually plotting the four-stacked bar charts, we need to force the response option labels to go in the order we want. Otherwise it picks it alphabetically which is usually not correct. Note, that what comes last in the order will show up on the left of a horizontal legend as you will shortly see. We then use ggplot to graph the percentage of each response option split by the individual variable.

weighted_frequencies$response_label <- factor(weighted_frequencies$response_label, #Force Order
                                   levels = c("Strongly Oppose", 
                                              "Somewhat Oppose", 
                                              "Somewhat Approve", 
                                              "Strongly Approve"))

# Plot
ggplot(weighted_frequencies, aes(x = percentage, y = variable, fill = response_label)) +
  geom_bar(stat = "identity", position = "fill") +
  geom_text(aes(label = paste0(round(percentage, 1), "%"), 
                group = response_label),
            position = position_fill(vjust = 0.5), 
            color = "white", size = 3) +
  scale_fill_manual(values = c("Strongly Approve" = "darkgreen",
                               "Somewhat Approve" = "darkolivegreen3",
                               "Somewhat Oppose" = "lightcoral",
                               "Strongly Oppose" = "firebrick"),
    name = "Response",
    breaks = c("Strongly Approve", "Somewhat Approve", "Somewhat Oppose", "Strongly Oppose")) +  # Explicit order) +
  labs(title = "Support for Baseball Rule Changes",
       subtitle = "Response Counts range from 77 to 82",
       x = "Percentage", y = NULL) +
  theme_minimal(base_size=12) +
  theme(legend.position = "top")

Here we see all four variables side-by-side along with the actual weighted percentage that selected the option. This can help people understand how similar questions were answered in the same or different fashion.

Graphing Yes-No or Select All That Apply Options

We can also do basically the same thing we just did but using yes-no or select all that apply questions. Here, we graph the 8 substantive branching select all that apply questions that respondents initially answered and sort descending based on which option was selected the most. Once again, we use weighted results. Since all of these questions start with main_branch_ we use that structure to simplify our code. If you have a select all that apply question in the final project your options will all follow the same naming convention so you can likely just substitute your naming convention where you see main_branch_ . You might see warning messages but that is okay here. It still produces the accurate percentages that we want for each variable.

sata_frequencies <- data %>%
  select(starts_with("main_branch_"), weight) %>%
  pivot_longer(
    cols = starts_with("main_branch_"),
    names_to = "variable",
    values_to = "response"
  ) %>%
  filter(!is.na(response)) %>%  # Drop NA responses
  group_by(variable, response) %>%
  summarise(
    weighted_count = sum(weight, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  group_by(variable) %>%
  mutate(percentage = 100 * (weighted_count / sum(weighted_count))) %>%
  arrange(variable, response)

## Warning: `main_branch_1` and `main_branch_2` have conflicting value labels.
## ℹ Labels for these values will be taken from `main_branch_1`.
## ✖ Values: 1

## Warning: `main_branch_1` and `main_branch_3` have conflicting value labels.
## ℹ Labels for these values will be taken from `main_branch_1`.
## ✖ Values: 1

## Warning: `main_branch_1` and `main_branch_4` have conflicting value labels.
## ℹ Labels for these values will be taken from `main_branch_1`.
## ✖ Values: 1

## Warning: `main_branch_1` and `main_branch_5` have conflicting value labels.
## ℹ Labels for these values will be taken from `main_branch_1`.
## ✖ Values: 1

## Warning: `main_branch_1` and `main_branch_6` have conflicting value labels.
## ℹ Labels for these values will be taken from `main_branch_1`.
## ✖ Values: 1

## Warning: `main_branch_1` and `main_branch_7` have conflicting value labels.
## ℹ Labels for these values will be taken from `main_branch_1`.
## ✖ Values: 1

## Warning: `main_branch_1` and `main_branch_8` have conflicting value labels.
## ℹ Labels for these values will be taken from `main_branch_1`.
## ✖ Values: 1

## Warning: `main_branch_1` and `main_branch_9` have conflicting value labels.
## ℹ Labels for these values will be taken from `main_branch_1`.
## ✖ Values: 1

## Warning: `main_branch_1` and `main_branch_10` have conflicting value labels.
## ℹ Labels for these values will be taken from `main_branch_1`.
## ✖ Values: 1

Because yes-no questions only have two options, you do not need to show both. Meaning, you should delete the no’s or the not selected options since they are literally going to be 1-whatever percentage is reported on the graph. This cleans the visualization up making it easier for audience to understand the information.

In the next code chunk, we first set the order we want our graph to display the results. We use arrange(percentage) to sort descending from largest to smallest and extract variable names in the order we want to display them. Note, if we wanted to sort ascending from smallest to largest, we would use arrangedesc((percentage)) . We then delete the respondent who did not select the option and set the order based on the first command.

order <- sata_frequencies %>%
  filter(response == 1) %>%
  arrange(percentage) %>%
  pull(variable)

sata_frequencies <- sata_frequencies %>%
  filter(response == 1) %>%
  mutate(variable = factor(variable, levels = order))

Now, we can simply graph the results using ggplot like previously. First, we do it with vertical bars then we do it with horizontal.

ggplot(sata_frequencies, aes(x = variable, y = percentage, fill = as.factor(response))) +
  geom_bar(stat = "identity", position = "dodge") +
  geom_text(aes(label = paste0(round(percentage, 1), "%"), group = response), 
            position = position_dodge(width = 0.9), 
            vjust = -0.5, color = "black", size = 3) +
  scale_x_discrete(labels = c("main_branch_1" = "Believe\nin God", 
                              "main_branch_2" = "Have Job", 
                              "main_branch_3" = "Victim\nof Fraud", 
                              "main_branch_4" = "Own(ed)\nBusiness", 
                              "main_branch_5" = "Attended College", 
                              "main_branch_6" = "Fan of MLB", 
                              "main_branch_7" = "Enjoy\nHiking", 
                              "main_branch_8" = "Use\nSocial Media",
                              "main_branch_9" = "Labor Union\nMember", 
                              "main_branch_10" = "None of\nthe Above")) +
  scale_fill_manual(values = c("1" = "grey"), 
                    name = "Response", 
                    breaks = c("1"), 
                    labels = c("Yes")) +
  labs(title = "Weighted % Selecting Each Initial SATA Question",
       x = "", y = "% Selecting") +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none")

To make this horizontal, we flip the variables in the x and y options and then use scale_y_discrete instead of scale_x_discrete to label the y-axis instead of the x-axis like above.

ggplot(sata_frequencies, aes(x = percentage, y = variable, fill = as.factor(response))) +
  geom_bar(stat = "identity", position = "dodge") +
  geom_text(aes(label = paste0(round(percentage, 1), "%"), group = response), 
            position = position_dodge(width = 0.9), 
            hjust = -0.1, color = "black", size = 3) +
  scale_y_discrete(labels = c("main_branch_1" = "Believe\nin God", 
                              "main_branch_2" = "Have Job", 
                              "main_branch_3" = "Victim\nof Fraud", 
                              "main_branch_4" = "Own(ed)\nBusiness", 
                              "main_branch_5" = "Attended College", 
                              "main_branch_6" = "Fan of MLB", 
                              "main_branch_7" = "Enjoy\nHiking", 
                              "main_branch_8" = "Use\nSocial Media",
                              "main_branch_9" = "Labor Union\nMember", 
                              "main_branch_10" = "None of\nthe Above")) +
  scale_fill_manual(values = c("1" = "darkgreen"), 
                    name = "Response", 
                    breaks = c("1"), 
                    labels = c("Yes")) +
  labs(title = "Weighted % Selecting Each Initial SATA Question",
       x = "% Selecting", y = "") +
  xlim(0,90)+
  theme_minimal(base_size = 12) +
  theme(legend.position = "none")

Bivariate Data Review

While univariate looks at the data can be interesting, typically we want to explore our data for interesting patterns and relationships between two or more variables. We will look at three different exploration techniques here:

Comparing Means between Two Groups
Comparing Means between Multiple Groups
Correlations between Multiple Ordinal/Continuous Scale Questions

Comparing Means between Two Groups

Let’s say that I was curious if attending college - even if didn’t graduate - influences how people view the new baseball rules. Using the main_branch_5 question which is yes if the respondent attended college at all and no if they did not, we can compare how they answered these questions to see if there are differences in responses.

To start, we will look at mean support for the Universal DH variable using the 4-point scale split by this variable. We use weighted.mean along with our weighting variable

dhc_freq <- data %>%
  filter(!is.na(stapleton_so_dh)) %>%  # Drop NA responses
  group_by(main_branch_5) %>%
  summarise(
    response_count = n(),  # Add response count
    weighted_mean = weighted.mean(stapleton_so_dh, weight, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  mutate(percentage = 100 * (weighted_mean / sum(weighted_mean, na.rm = TRUE)))

dhc_freq

## # A tibble: 2 × 4
##   main_branch_5                          response_count weighted_mean percentage
##   <dbl+lbl>                                       <int>         <dbl>      <dbl>
## 1 0                                                  34          1.82       46.2
## 2 1 [Attended college or university for…             48          2.12       53.8

We see that when main_branch_5 = 1 - i.e. those who attended college - the weighted mean for support/opposition to the universal DH is 2.13 while for those who not attend college the weighted mean is 1.82. Lower values means more support so there is a possibility that those who never attended college support the universal DH at a higher rate than those who did attend.

Differences between groups can occur by random chance alone so we always will do statistical tests to understand if the difference we observe likely reflects a real difference in the population or if it is by random chance alone. Take one of the other numerous great DACSS courses that teach these techniques.

Here, we look at a t-test which tests for “significant” differences on some numeric variable between two groups. Typically if you have a p-value in these tests <.05 the difference is considered significant and something that truly reflects differences in the population. With a small sample size like we have here, it is okay to make the limit p <.1.

To run the t-test on our weighted sample, we first specify a svydesign dataset that applies the survey weight to the analysis. This function is required to analyze weighted data with a t-test. Once we save this new dataframe survey_design, we perform the Welch’s t-test to see if those differences revealed are “significant”.

#Conduct Welch's Test from the rstatix package
survey_design <- svydesign(ids = ~1, data = data, weights = ~weight)

# Perform a weighted t-test
welch_t_test <- svyttest(stapleton_so_dh ~ main_branch_5, design = survey_design)

# Print the result
print(welch_t_test)

## 
##  Design-based t-test
## 
## data:  stapleton_so_dh ~ main_branch_5
## t = 1.5485, df = 80, p-value = 0.1255
## alternative hypothesis: true difference in mean is not equal to 0
## 95 percent confidence interval:
##  -0.08597111  0.68891229
## sample estimates:
## difference in mean 
##          0.3014706

There are two main pieces of information to look at here. The difference in mean, .301, is simply the difference between group 1 and group 2 on support for the DH. The other key thing to look at is the p-value which in this test = .1255. This value is larger than .1 so I would conclude that attending college or not does not influence how MLB fans perceive the Universal DH.

Comparing Means between 3+ Groups

T-tests are very useful but can only be used when the grouping variable has exactly 2 groups. If it has 3 or more, t-tests will not work. In this instance, we turn to Analysis of Variance (ANOVAs) for our significance tests. We calculate the weighted means in the exact way as with the t-test.

Let use the educ4 variable we created previously, which for some reason has 3 groups for education: HS degree or less, Some College, College Degree. We first calculate the weighted mean by group.

dhe_means <- data %>%
  filter(!is.na(stapleton_so_dh)) %>%  # Drop NA responses
  group_by(educ4) %>%
  summarise(
    response_count = n(),  # Add response count
    weighted_mean = weighted.mean(stapleton_so_dh, weight, na.rm = TRUE),
    .groups = "drop") %>%
  mutate(percentage = 100 * (weighted_mean / sum(weighted_mean)),
    response_label = case_when(
      educ4 == 1 ~ "HS or Less",
      educ4 == 2 ~ "Some College",
      educ4 == 3 ~ "College Degree"))
dhe_means

## # A tibble: 3 × 5
##   educ4                   response_count weighted_mean percentage response_label
##   <dbl+lbl>                        <int>         <dbl>      <dbl> <chr>         
## 1 1 [HS or Less]                      22          1.73       29.2 HS or Less    
## 2 2 [Some College]                    33          2.15       36.4 Some College  
## 3 3 [College Degree or M…             27          2.04       34.4 College Degree

We see that group 1, HS degree or less, has the highest approval rating (lowest value confusingly enough) while those with some college have the lowest approval rating. Once again, we want to statistically test for significant differences between these groups.

We use ANOVA to do this along with additional test called the Tukey Honestly Significant Difference which conducts pairwise statistical tests like the t-test. We use the aov function to calculate the ANOVA results. First inside the () is the dependent variable, stapleton_so_dh_n , always followed by ~ and then the grouping variable always inside the as.factor like as.factor(educ4). Following this is the data to be used and you directly specify the weight variable in this line of code using weight=weight.

anova_result <- aov(stapleton_so_dh ~ as.factor(educ4), data = data, weight=weight) #The as.factor code tells R to make the variable inside the () a factor otherwise it will not calculate correctly

# Step 3: Display the ANOVA result
print(summary(anova_result))

##                  Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(educ4)  2   2.43  1.2155   1.421  0.248
## Residuals        79  67.57  0.8553               
## 184 observations deleted due to missingness

tukey_result <- TukeyHSD(anova_result)
print(tukey_result)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = stapleton_so_dh ~ as.factor(educ4), data = data, weight = weight)
## 
## $`as.factor(educ4)`
##           diff        lwr       upr     p adj
## 2-1  0.4242424 -0.1837951 1.0322799 0.2244212
## 3-1  0.3097643 -0.3247218 0.9442504 0.4767906
## 3-2 -0.1144781 -0.6877414 0.4587851 0.8823316

Graphing Mean Values by Group Membership

If you have significant differences, or just want to graph, these weighted values by group, we can do that using ggplot once again. We make sure to specify factor(educ) as the x variable to tell R this is a factor and not a number. Because the education variable has 3 levels, I specify three names that match the variable as well as three colors to visually differentiate that the bars represent different group membership.

# Create the bar graph with custom colors specified in the ggplot command
ggplot(dhe_means, aes(x = factor(educ4), y = weighted_mean, fill = factor(educ4))) +
  geom_bar(stat = "identity") +  # Use geom_bar for vertical bars
  labs(title = "Opposition to Universal DH by Education Level",
    x = "Education Level",
    y = "Percentage") +
  scale_x_discrete(labels = c("1" = "HS\nor Less", # Update labels for x-axis
      "2" = "Some\nCollege", 
      "3" = "College Degree\nor More")) +
  scale_fill_manual(values = c("#1f77b4", 
      "#ff7f0e", 
      "darkgreen"), 
    name = "Education Level") +
  ylim(0, 2.5)+
  theme_minimal() +  # Apply a minimal theme
  theme(axis.text.x = element_text(size=12, angle = 45, hjust = 1),  # Rotate x-axis labels for readability
    legend.position = "none")

Correlations Between Numerical Variables

If you have a continuous or ordinal variable with the Don’t Knows removed, you can run a correlation analysis. What this analysis tells you is if two variables typically move in the same or opposite direction as each other. Meaning, if a person who supports(opposes) the Universal DH almost always supports(opposes) the Pitch Clock, opinions on these two variables would be highly correlated because most people answered them in the same way. Correlations can be positive, indicating that as the response on one variable gets larger(smaller) the response on the other variable tends to get larger(smaller) as well, negative, indicating that as the response on one variable gets larger/smaller the response on the other variable tends to get smaller/larger as well. A correlation close to 0 means the two variables are not related while the closer the correlation is to 1 or -1 the stronger the relationship.

For correlations to run in R, the variables have to be numeric. Unfortunately, our survey variables are considered haven labelled meaning correlation analysis will not work unless we transform our existing variables into a numeric (remember save a new variable and keep the existing one intact).

We can do this easily either one variable at a time or multiple using the following approaches. The new variables will all have _n at the end to signify to me this is the numeric version of the variable.

data$stapleton_so_dh_n<-as.numeric(data$stapleton_so_dh) #1 at a time
#Multiple at once
data <- data %>%
  mutate(across(starts_with("stapleton_so_"), as.numeric, .names = "{.col}_n"))

Now that we have numeric variables, we can begin the process to calculate the correlations between the responses.

We can easily do this one at a time using the below approach. We use the cor function and then type in the two variable names with data$ first. We also need to specify the method, here we use pearson, as well as use="complete.obs". This last bit of code removes missing data and without including it the correlations will not run.

r_select <- cor(data$stapleton_so_dh_n, data$stapleton_so_pc_n, method = "pearson", use="complete.obs")
r_select

## [1] 0.4377297

Here, we see that the correlation between support for the DH and the Pitch Clock is .438. This is medium to small sized correlation indicating that the opinions between these two rules are not all that related.

If you want to do more than 2 variables at a time, you need to simply select the variables you want to include and then rerun the correlation command as illustrated above. Here I use the select function and stapleton_so_dh_n:stapleton_so_bases_n to tell R to include all the variables starting with _dh_n and ending with _bases_n. This works because my four variables are in order. If they are not in order, you would need to use a different select approach. If nothing else, you can manually type in the variable names you want to include.

new_data <- data %>%
  select(stapleton_so_dh_n:stapleton_so_bases_n)

r_all <-cor(new_data, method="pearson", use="complete.obs")
r_all

##                      stapleton_so_dh_n stapleton_so_1_n stapleton_so_2_n
## stapleton_so_dh_n            1.0000000        1.0000000        0.4347273
## stapleton_so_1_n             1.0000000        1.0000000        0.4347273
## stapleton_so_2_n             0.4347273        0.4347273        1.0000000
## stapleton_so_3_n             0.4018328        0.4018328        0.4377871
## stapleton_so_4_n             0.6536464        0.6536464        0.3917767
## stapleton_so_pc_n            0.4347273        0.4347273        1.0000000
## stapleton_so_shift_n         0.4018328        0.4018328        0.4377871
## stapleton_so_bases_n         0.6536464        0.6536464        0.3917767
##                      stapleton_so_3_n stapleton_so_4_n stapleton_so_pc_n
## stapleton_so_dh_n           0.4018328        0.6536464         0.4347273
## stapleton_so_1_n            0.4018328        0.6536464         0.4347273
## stapleton_so_2_n            0.4377871        0.3917767         1.0000000
## stapleton_so_3_n            1.0000000        0.5043263         0.4377871
## stapleton_so_4_n            0.5043263        1.0000000         0.3917767
## stapleton_so_pc_n           0.4377871        0.3917767         1.0000000
## stapleton_so_shift_n        1.0000000        0.5043263         0.4377871
## stapleton_so_bases_n        0.5043263        1.0000000         0.3917767
##                      stapleton_so_shift_n stapleton_so_bases_n
## stapleton_so_dh_n               0.4018328            0.6536464
## stapleton_so_1_n                0.4018328            0.6536464
## stapleton_so_2_n                0.4377871            0.3917767
## stapleton_so_3_n                1.0000000            0.5043263
## stapleton_so_4_n                0.5043263            1.0000000
## stapleton_so_pc_n               0.4377871            0.3917767
## stapleton_so_shift_n            1.0000000            0.5043263
## stapleton_so_bases_n            0.5043263            1.0000000

Now, we see more results since we have all 4 variables. Quickly scan the values looking for those close to 1 or -1. The values that = 1 are not relevant since that the is the correlation between a variable and itself. The largest correlation in the table is between support for the universal DH and the larger bases at .654. This is a reasonably large correlation indicating a fairly strong relationship between people’s opinions on these two rules.

If you want something like is more graphical and fun, you can use the corrplot package to display the results. We simply use the correlation results we saved previously, r_all, and pipe that into the corrplot command to produce a graphical display of the correlations. Below shows a variety of methods you can use.

num<-corrplot(r_all, method = 'circle') # colorful number

num2<-corrplot(r_all, method = 'circle',  type="lower", diag=FALSE)  #Only Show Lower Part of Diag

num3<-corrplot(r_all, method = 'number',  type="lower", diag=FALSE)  #Only Show Lower Part of Diag

num4<-corrplot(r_all, method = 'square',  type="lower", diag=FALSE)  #Only Show Lower Part of Diag

If I were going to show this graph to some audience, I would change the names to something more informative and visually appealing.

Survey Data Exploration and Analysis

CEStapleton

2024-12-05