In this tutorial, we will review basic data exploration and analysis techniques to help explore interesting patterns in your survey results. We will cover the following basic techniques:
We start by importing the “cleaned” dataset we worked on Tuesday
using the haven
package, which I saved to my github
page.
#Pull Data Directly from Github - Need Internet Connection
<- "https://github.com/drCES/raw_survey_data_dacss695/raw/main/data_clean.dta"
url <- read_dta(url)# Read the Stata file into R
data names(data)
## [1] "start_date" "end_date"
## [3] "status" "ip_address"
## [5] "progress" "duration_in_seconds"
## [7] "finished" "recorded_date"
## [9] "response_id" "recipient_last_name"
## [11] "recipient_first_name" "recipient_email"
## [13] "external_reference" "location_latitude"
## [15] "location_longitude" "distribution_channel"
## [17] "user_language" "q_recaptcha_score"
## [19] "q_relevant_id_duplicate" "q_relevant_id_duplicate_score"
## [21] "q_relevant_id_fraud_score" "q_relevant_id_last_start_date"
## [23] "q1" "main_branch_1"
## [25] "main_branch_2" "main_branch_3"
## [27] "main_branch_4" "main_branch_5"
## [29] "main_branch_6" "main_branch_7"
## [31] "main_branch_8" "main_branch_9"
## [33] "main_branch_10" "mohanty_news_1"
## [35] "mohanty_news_2" "mohanty_news_3"
## [37] "mohanty_news_4" "mohanty_news_5"
## [39] "mohanty_news_6" "mohanty_news_6_text"
## [41] "dinsmore_smusage" "dinsmore_comms"
## [43] "dinsmore_comms_7_text" "yifan_livecomm_1"
## [45] "yifan_livecomm_2" "yifan_livecomm_3"
## [47] "yifan_livecomm_4" "shawn_worktype"
## [49] "shawn_worktype_5_text" "bot_check_4"
## [51] "bot_check_5" "bot_check_6"
## [53] "bot_check_7" "bot_check_8"
## [55] "gutierrez_suarez_ai" "krishna_cata_1"
## [57] "krishna_cata_2" "krishna_cata_3"
## [59] "krishna_cata_4" "krishna_cata_5"
## [61] "krishna_cata_6" "krishna_cata_7"
## [63] "krishna_cata_8" "krishna_cata_9"
## [65] "krishna_cata_10" "krishna_cata_11"
## [67] "krishna_cata_10_text" "ai_ad_1"
## [69] "ai_ad_2" "ai_ad_3"
## [71] "krishna_dc" "gutierrez_clear"
## [73] "suarez_ad2_1" "suarez_ad2_2"
## [75] "dinsmore_mh" "balint_stress"
## [77] "gomez_dinsmore_ad_1" "gomez_dinsmore_ad_2"
## [79] "gomez_dinsmore_ad_3" "gomez_dinsmore_ad_4"
## [81] "dinsmore_friendship" "mohanty_emotneeds"
## [83] "sc_approval" "brake_influence"
## [85] "brake_confidence" "brake_ideology"
## [87] "brake_number" "mohanty_disagree"
## [89] "mohanty_famideo" "mohanty_famideo_13_text"
## [91] "mohanty_commideo" "mohanty_commideo_6_text"
## [93] "amalea_source_1" "amalea_source_2"
## [95] "amalea_source_3" "amalea_source_4"
## [97] "amalea_source_5" "amalea_source_6"
## [99] "amalea_source_7" "amalea_source_6_text"
## [101] "amalea_fam" "amalea_stigma"
## [103] "amalea_ad" "god_importance_xu"
## [105] "god_help_xu" "god_answer_xu"
## [107] "gomez_recruit" "gomez_success"
## [109] "balint_struggle" "balint_cost"
## [111] "balint_inc_1" "balint_inc_2"
## [113] "balint_inc_3" "balint_inc_4"
## [115] "balint_inc_5" "balint_inc_6"
## [117] "balint_inc_5_text" "balint_impact"
## [119] "mitchell_threat" "mitchell_ad_1"
## [121] "mitchell_ad_2" "q1zo"
## [123] "q2zo" "q5zo"
## [125] "omar_ad_1" "omar_ad_2"
## [127] "kopecky_sat" "yeumin_rec"
## [129] "yuemin_ad_1" "yuemin_ad_2"
## [131] "yuemin_ad_3" "yuemin_ad_4"
## [133] "kopecky_goals_1" "kopecky_goals_2"
## [135] "kopecky_goals_3" "kopecky_goals_4"
## [137] "kopecky_goals_5" "kopecky_goals_6"
## [139] "kopecky_goals_7" "kopecky_goals_7_text"
## [141] "kopecky_focus" "kopecky_focus_6_text"
## [143] "kopecky_type_1" "kopecky_type_2"
## [145] "kopecky_type_3" "shawn_overall"
## [147] "shawn_ad" "stapleton_so_1"
## [149] "stapleton_so_2" "stapleton_so_3"
## [151] "stapleton_so_4" "q107"
## [153] "yifan_why_12" "yifan_why_13"
## [155] "yifan_why_14" "yifan_why_15"
## [157] "yifan_why_16" "yifan_why_18"
## [159] "yifan_why_19" "yifan_why_20"
## [161] "yifan_why_21" "yifan_why_22"
## [163] "yifan_why_22_text" "yifan_length"
## [165] "mitchell_cc" "pol_ideo"
## [167] "pol_ideo_8_text" "god_pray_xu"
## [169] "rid" "age"
## [171] "gender" "hhi"
## [173] "ethnicity" "hispanic"
## [175] "education" "region"
## [177] "zip" "political_party"
## [179] "correct" "stapleton_so_dh"
## [181] "stapleton_so_pc" "stapleton_so_shift"
## [183] "stapleton_so_bases" "educ"
## [185] "college" "educ4"
## [187] "stapleton_so_dh_2"
We know have reimported our cleaned data, which has the data transformations conducted previously. This data has 266 observations which matches the number we finished with previously after cleaning.
Because we are interested in understanding our data better in this tutorial, we should be using survey weights. Because part of your final project is to create survey weights based on the provided demographic profile, this tutorial will not used weighted data. However, to ensure the code is usable to you with weighted data, this tutorial first creates a fake weight variable that equals 1 to include in each analysis. When you do this on your own survey data, you will use whatever you name your final survey weight once you have completed that procedure.
$weight<-1 data
First thing we want to do is calculate the margin of error for our final sample of valid responses, which has 266 completes. We simply use the margin of error function from earlier in the semester to calculate the MOE for the full sample. Remember, for subgroups and variables with smaller samples, the MOE will be larger.
<- function(p, n, cv) {
moe_fun <-p*(1-p)
step1<-step1/n
step2<-(step2)^.5
se<-cv*se
moeprint(moe)
}
moe_fun(.5, 266, 1.96) #Set proportion to .5, sample size to 800 which is the full sample size, and critical value to 1.96 for a 95% CI
## [1] 0.06008766
Running our MOE function with the appropriate values, we see that our MOE for the full sample size if +/- 6%. Meaning, we expect the true population opinion on the survey quesetion to be somewhere within +/-6% of that value.
This tutorial will primarily use the same 5 variables as in the previous tutorial all related to Major League Baseball and measuring support/opposition to four recent rule changes made to the game. We use the cleaned version of these variables that have the non-substantive response options such as Don’t Know removed for our data exploration.
The primary variables are:
stapleton_so_dh = measure of support for universal DH
stapleton_so_pc = measure of support for pitch clock
stapleton_so_shift = measure of support for infield shift ban
stapleton_so_bases = measure of support for larger bases
main_branch_6 = Respondent considers self a fan of MLB (branching question for above)
Ultimately, our goal here is to find interesting patterns in the data that suggests differences in opinion in the population. Data exploration in general public surveys like this is about finding what might be interesting to the public or stakeholders who will see the results. This is different than academic style survey research which should be focused on testing theories with empirically falsifiable hypotheses. Take other DACSS courses to learn about this approach to survey research.
While the topline code is good for creating public facing reports, that is not necessary for data exploration. Here, we use a slightly different approach. We start by looking at the weighted distributions of each variable to get an understanding of the population estimate.
topline(data, variable = stapleton_so_dh,
weight = weight, pct = FALSE, cum_pct=FALSE)
## # A tibble: 5 × 3
## Response Frequency `Valid Percent`
## <fct> <dbl> <dbl>
## 1 Strongly support 28 34.1
## 2 Somewhat support 33 40.2
## 3 Somewhat oppose 14 17.1
## 4 Strongly oppose 7 8.54
## 5 (Missing) 184 NA
topline(data, variable = stapleton_so_pc,
weight = weight, pct = FALSE, cum_pct=FALSE, remove = c("(Missing)"))
## # A tibble: 4 × 3
## Response Frequency `Valid Percent`
## <fct> <dbl> <dbl>
## 1 Strongly support 25 29.8
## 2 Somewhat support 34 40.5
## 3 Somewhat oppose 16 19.0
## 4 Strongly oppose 9 10.7
topline(data, variable = stapleton_so_shift,
weight = weight, pct = FALSE, cum_pct=FALSE)
## # A tibble: 5 × 3
## Response Frequency `Valid Percent`
## <fct> <dbl> <dbl>
## 1 Strongly support 15 19.5
## 2 Somewhat support 24 31.2
## 3 Somewhat oppose 22 28.6
## 4 Strongly oppose 16 20.8
## 5 (Missing) 189 NA
topline(data, variable = stapleton_so_bases,
weight = weight, pct = FALSE, cum_pct=FALSE)
## # A tibble: 5 × 3
## Response Frequency `Valid Percent`
## <fct> <dbl> <dbl>
## 1 Strongly support 16 20
## 2 Somewhat support 28 35
## 3 Somewhat oppose 22 27.5
## 4 Strongly oppose 14 17.5
## 5 (Missing) 186 NA
To graph the weighted response distribution for one variable, we use a two-step process.
Let’s say I want to create a graph of the weighted distribution for
the Universal DH variable (I also do the same calculation on the other
three variables at the same time). First step is to save the weighted
distributions percentages as a dataframe or tibble. We do this using the
weights
function from the srvyr
package along
with general tidyverse
code.
Working out of our cleaned data, first we save a new dataframe,
dh_freq
, for graphing purposes. We then filter out any NA
responses and from previous tutorial we know there are a lot of missing
data in this variable. The variable we eventually want to graph goes in
the group_by(stapleton_so_dh)
section of the code. We then
use summarise
along
withsum(weight, na.rm=TRUE)
to calculated the weighted
number of observations for each response option. We finish by
calculating the weighted percentage for response option and use
arrange
to order the data from lowest (strongly support) to
highest (strongly oppose).
<- data %>%
dh_freq filter(!is.na(stapleton_so_dh)) %>% # Drop NA responses
group_by(stapleton_so_dh) %>%
summarise(
weighted_count = sum(weight, na.rm = TRUE),
.groups = "drop"
%>%
) mutate(percentage = 100 * (weighted_count / sum(weighted_count))) %>%
arrange(stapleton_so_dh) #Orders Data from low to high
<- data %>%
pc_freq filter(!is.na(stapleton_so_pc)) %>% # Drop NA responses
group_by(stapleton_so_pc) %>%
summarise(
weighted_count = sum(weight, na.rm = TRUE),
.groups = "drop"
%>%
) mutate(percentage = 100 * (weighted_count / sum(weighted_count))) %>%
arrange(stapleton_so_pc) #Orders Data from low to high
<- data %>%
shift_freq filter(!is.na(stapleton_so_shift)) %>% # Drop NA responses
group_by(stapleton_so_shift) %>%
summarise(
weighted_count = sum(weight, na.rm = TRUE),
.groups = "drop"
%>%
) mutate(percentage = 100 * (weighted_count / sum(weighted_count))) %>%
arrange(stapleton_so_shift) #Orders Data from low to high
<- data %>%
bases_freq filter(!is.na(stapleton_so_bases)) %>% # Drop NA responses
group_by(stapleton_so_bases) %>%
summarise(
weighted_count = sum(weight, na.rm = TRUE),
.groups = "drop"
%>%
) mutate(percentage = 100 * (weighted_count / sum(weighted_count))) %>%
arrange(stapleton_so_bases) #Orders Data from low to high
With the the new dataframe, dh_freq
, we can now graph
the weighted percentages and use the graph in any number of ways. We
will start with a basic ggplot
graph before cleaning it up
some. Notice, we use as.factor(stapleton_so_dh)
for the x
and fill options. This allows us to work more easily with survey data
specifically.
ggplot(dh_freq, aes(x = as.factor(stapleton_so_dh), y = percentage,
fill = as.factor(stapleton_so_dh))) +
geom_bar(stat = "identity", width = 0.7)
Above we see a basic ggplot which has our weighted distribution for support/opposition to the Universal DH in MLB. But that is about all that we have. The colors are harsh on the eyes and there are no labels. We need to change that before we use this in any formal way.
GGplot is a really powerful graphing package in R. We do not have time to cover off on all its functionalities. New things we add to this plot are:
ggplot(dh_freq, aes(x = as.factor(stapleton_so_dh), y = percentage,
fill = as.factor(stapleton_so_dh))) +
geom_bar(stat = "identity", width = 0.7) + # Bar chart
geom_text(aes(label = paste0(round(percentage, 1), "%")),
vjust = -0.5, color = "black", size = 4) + # Add percentage labels
scale_fill_manual(values = c("1" = "darkgreen", # Custom colors
"2" = "darkolivegreen3",
"3" = "lightcoral",
"4" = "firebrick")) +
scale_x_discrete(labels = c("1" = "Strongly\nSupport", # Replace x-axis tick labels
"2" = "Somewhat\nSupport",
"3" = "Somewhat\nOppose",
"4" = "Strongly\nOppose")) +
labs(title = "Support for Universal DH among MLB Fans",
subtitle="N=82",
x = "",
y = "Percentage") +
theme_minimal() +
theme(axis.text.x = element_text(size = 12, angle = 45, hjust = 1), # Rotate and style x-axis labels
axis.title.y = element_text(size = 14), # Style y-axis label
legend.position = "none") # Remove legend
This is starting to look good, but there are appears to be margin
issues, at least with how it renders in Studio. Let’s change the y axis
limit to 50 to ensure nothing runs off the margin. To do this, we simply
add ylim(0,50
to the code.
ggplot(dh_freq, aes(x = as.factor(stapleton_so_dh), y = percentage,
fill = as.factor(stapleton_so_dh))) +
geom_bar(stat = "identity", width = 0.7) + # Bar chart
geom_text(aes(label = paste0(round(percentage, 1), "%")),
vjust = -0.5, color = "black", size = 4) + # Add percentage labels
scale_fill_manual(values = c("1" = "darkgreen", # Custom colors
"2" = "darkolivegreen3",
"3" = "lightcoral",
"4" = "firebrick")) +
scale_x_discrete(labels = c("1" = "Strongly\nSupport", # Replace x-axis tick labels
"2" = "Somewhat\nSupport",
"3" = "Somewhat\nOppose",
"4" = "Strongly\nOppose")) +
labs(title = "Support for Universal DH among MLB Fans",
subtitle = "N=82",
x = "",
y = "Percentage") +
ylim(0, 50) + # Set y-axis limit
theme_minimal() +
theme(axis.text.x = element_text(size = 12, angle = 45, hjust = 1), # Rotate and style x-axis labels
axis.title.y = element_text(size = 14), # Style y-axis title
legend.position = "none") # Remove legend
You could then easily do something like this for each of your ordinal
scale survey questions. You could also combine them into one plot that
shows all four individual graphs using gridExtra
. You then
save each individual plot before combining them. Here, we combine the
universal DH response with the support/opposition to the pitch clock
response.
<- ggplot(dh_freq, aes(x = as.factor(stapleton_so_dh), y = percentage,
dhfill = as.factor(stapleton_so_dh))) +
geom_bar(stat = "identity", width = 0.7) + # Bar chart
geom_text(aes(label = paste0(round(percentage, 1), "%")),
vjust = -0.5, color = "black", size = 4) + # Add percentage labels
scale_fill_manual(values = c("1" = "darkgreen", # Custom colors
"2" = "darkolivegreen3",
"3" = "lightcoral",
"4" = "firebrick")) +
scale_x_discrete(labels = c("1" = "Strongly\nSupport", # Replace x-axis tick labels
"2" = "Somewhat\nSupport",
"3" = "Somewhat\nOppose",
"4" = "Strongly\nOppose")) +
labs(title = "Support for Universal DH among MLB Fans",
subtitle = "N=82",
x = "",
y = "Percentage") +
ylim(0, 50) + # Set y-axis limit
theme_minimal() +
theme(axis.text.x = element_text(size = 11, angle = 45, hjust = 1), # Rotate and style x-axis labels
axis.title.y = element_text(size = 11), # Style y-axis title
legend.position = "none") # Remove legend
<- ggplot(pc_freq, aes(x = as.factor(stapleton_so_pc), y = percentage,
pcfill = as.factor(stapleton_so_pc))) +
geom_bar(stat = "identity", width = 0.7) + # Bar chart
geom_text(aes(label = paste0(round(percentage, 1), "%")),
vjust = -0.5, color = "black", size = 4) + # Add percentage labels
scale_fill_manual(values = c("1" = "darkgreen", # Custom colors
"2" = "darkolivegreen3",
"3" = "lightcoral",
"4" = "firebrick")) +
scale_x_discrete(labels = c("1" = "Strongly\nSupport", # Replace x-axis tick labels
"2" = "Somewhat\nSupport",
"3" = "Somewhat\nOppose",
"4" = "Strongly\nOppose")) +
labs(title = "Support for Pitch among MLB Fans",
subtitle = "N=84",
x = "",
y = "Percentage") +
ylim(0, 50) + # Set y-axis limit
theme_minimal() +
theme(axis.text.x = element_text(size = 11, angle = 45, hjust = 1), # Rotate and style x-axis labels
axis.title.y = element_text(size = 11), # Style y-axis title
legend.position = "none") # Remove legend
<- grid.arrange(dh, pc, ncol = 1, nrow = 2) combined_plot
combined_plot
## TableGrob (2 x 1) "arrange": 2 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (2-2,1-1) arrange gtable[layout]
You could also get a bit more complex and combine multiple of your ordinal scale variables, measured on the same scale, onto a stacked bar plot. This is a bit more complex coding approach but can be a useful visualization.
The first thing we will do is save the weighted distributions for all
four of my variables measured on the same scale into the same dataframe.
To do this, we first save a vector, vars
, that simply
includes the names of my four variables I want to graph in this plot
vars <- c("stapleton_so_dh", "stapleton_so_pc", "stapleton_so_shift", "stapleton_so_bases")
.
This simplifies the next line of code so for your variables just replace
the current names with your variable names.
We save a new dataframe called weighted_frequencies
which contains the weighted values for each of the four variables for
graphing. We use the select
function to only select from
data
the variables in the vars
vector. We use
the pivot_longer
function to make the data the right format
for graphing. You do not need to change this if doing this on your
project. We then filter out any missing data and do the same weighted
calculation for each variable we did above.
We then directly change the names of the variables and x-axais labels to something appropriate for a graph.
<- c("stapleton_so_dh", "stapleton_so_pc", "stapleton_so_shift", "stapleton_so_bases")
vars
<- data %>%
weighted_frequencies select(all_of(vars), weight) %>%
pivot_longer(cols = all_of(vars),
names_to = "variable",
values_to = "response") %>%
filter(!is.na(response)) %>%
group_by(variable, response) %>%
summarise(
weighted_count = sum(weight, na.rm = TRUE),
.groups = "drop") %>%
group_by(variable) %>%
mutate(percentage = 100 * (weighted_count / sum(weighted_count)),
response_label = case_when(
== 1 ~ "Strongly Approve",
response == 2 ~ "Somewhat Approve",
response == 3 ~ "Somewhat Oppose",
response == 4 ~ "Strongly Oppose"),
response variable = case_when(
== "stapleton_so_dh" ~ "Universal DH",
variable == "stapleton_so_pc" ~ "Pitch Clock",
variable == "stapleton_so_shift" ~ "Shift Ban",
variable == "stapleton_so_bases" ~ "Larger Bases")) %>%
variable ungroup()
This data is now ready for graphing. Before actually plotting the
four-stacked bar charts, we need to force the response option labels to
go in the order we want. Otherwise it picks it alphabetically which is
usually not correct. Note, that what comes last in the order will show
up on the left of a horizontal legend as you will shortly see. We then
use ggplot
to graph the percentage of each response option
split by the individual variable.
$response_label <- factor(weighted_frequencies$response_label, #Force Order
weighted_frequencieslevels = c("Strongly Oppose",
"Somewhat Oppose",
"Somewhat Approve",
"Strongly Approve"))
# Plot
ggplot(weighted_frequencies, aes(x = percentage, y = variable, fill = response_label)) +
geom_bar(stat = "identity", position = "fill") +
geom_text(aes(label = paste0(round(percentage, 1), "%"),
group = response_label),
position = position_fill(vjust = 0.5),
color = "white", size = 3) +
scale_fill_manual(values = c("Strongly Approve" = "darkgreen",
"Somewhat Approve" = "darkolivegreen3",
"Somewhat Oppose" = "lightcoral",
"Strongly Oppose" = "firebrick"),
name = "Response",
breaks = c("Strongly Approve", "Somewhat Approve", "Somewhat Oppose", "Strongly Oppose")) + # Explicit order) +
labs(title = "Support for Baseball Rule Changes",
subtitle = "Response Counts range from 77 to 82",
x = "Percentage", y = NULL) +
theme_minimal(base_size=12) +
theme(legend.position = "top")
Here we see all four variables side-by-side along with the actual weighted percentage that selected the option. This can help people understand how similar questions were answered in the same or different fashion.
We can also do basically the same thing we just did but using yes-no
or select all that apply questions. Here, we graph the 8 substantive
branching select all that apply questions that respondents initially
answered and sort descending based on which option was selected the
most. Once again, we use weighted results. Since all of these questions
start with main_branch_
we use that structure to simplify
our code. If you have a select all that apply question in the final
project your options will all follow the same naming convention so you
can likely just substitute your naming convention where you see
main_branch_
. You might see warning messages but that is
okay here. It still produces the accurate percentages that we want for
each variable.
<- data %>%
sata_frequencies select(starts_with("main_branch_"), weight) %>%
pivot_longer(
cols = starts_with("main_branch_"),
names_to = "variable",
values_to = "response"
%>%
) filter(!is.na(response)) %>% # Drop NA responses
group_by(variable, response) %>%
summarise(
weighted_count = sum(weight, na.rm = TRUE),
.groups = "drop"
%>%
) group_by(variable) %>%
mutate(percentage = 100 * (weighted_count / sum(weighted_count))) %>%
arrange(variable, response)
## Warning: `main_branch_1` and `main_branch_2` have conflicting value labels.
## ℹ Labels for these values will be taken from `main_branch_1`.
## ✖ Values: 1
## Warning: `main_branch_1` and `main_branch_3` have conflicting value labels.
## ℹ Labels for these values will be taken from `main_branch_1`.
## ✖ Values: 1
## Warning: `main_branch_1` and `main_branch_4` have conflicting value labels.
## ℹ Labels for these values will be taken from `main_branch_1`.
## ✖ Values: 1
## Warning: `main_branch_1` and `main_branch_5` have conflicting value labels.
## ℹ Labels for these values will be taken from `main_branch_1`.
## ✖ Values: 1
## Warning: `main_branch_1` and `main_branch_6` have conflicting value labels.
## ℹ Labels for these values will be taken from `main_branch_1`.
## ✖ Values: 1
## Warning: `main_branch_1` and `main_branch_7` have conflicting value labels.
## ℹ Labels for these values will be taken from `main_branch_1`.
## ✖ Values: 1
## Warning: `main_branch_1` and `main_branch_8` have conflicting value labels.
## ℹ Labels for these values will be taken from `main_branch_1`.
## ✖ Values: 1
## Warning: `main_branch_1` and `main_branch_9` have conflicting value labels.
## ℹ Labels for these values will be taken from `main_branch_1`.
## ✖ Values: 1
## Warning: `main_branch_1` and `main_branch_10` have conflicting value labels.
## ℹ Labels for these values will be taken from `main_branch_1`.
## ✖ Values: 1
Because yes-no questions only have two options, you do not need to show both. Meaning, you should delete the no’s or the not selected options since they are literally going to be 1-whatever percentage is reported on the graph. This cleans the visualization up making it easier for audience to understand the information.
In the next code chunk, we first set the order we want our graph to
display the results. We use arrange(percentage)
to sort
descending from largest to smallest and extract variable names in the
order we want to display them. Note, if we wanted to sort ascending from
smallest to largest, we would use arrangedesc((percentage))
. We then delete the respondent who did not select the option and set
the order based on the first command.
<- sata_frequencies %>%
order filter(response == 1) %>%
arrange(percentage) %>%
pull(variable)
<- sata_frequencies %>%
sata_frequencies filter(response == 1) %>%
mutate(variable = factor(variable, levels = order))
Now, we can simply graph the results using ggplot
like
previously. First, we do it with vertical bars then we do it with
horizontal.
ggplot(sata_frequencies, aes(x = variable, y = percentage, fill = as.factor(response))) +
geom_bar(stat = "identity", position = "dodge") +
geom_text(aes(label = paste0(round(percentage, 1), "%"), group = response),
position = position_dodge(width = 0.9),
vjust = -0.5, color = "black", size = 3) +
scale_x_discrete(labels = c("main_branch_1" = "Believe\nin God",
"main_branch_2" = "Have Job",
"main_branch_3" = "Victim\nof Fraud",
"main_branch_4" = "Own(ed)\nBusiness",
"main_branch_5" = "Attended College",
"main_branch_6" = "Fan of MLB",
"main_branch_7" = "Enjoy\nHiking",
"main_branch_8" = "Use\nSocial Media",
"main_branch_9" = "Labor Union\nMember",
"main_branch_10" = "None of\nthe Above")) +
scale_fill_manual(values = c("1" = "grey"),
name = "Response",
breaks = c("1"),
labels = c("Yes")) +
labs(title = "Weighted % Selecting Each Initial SATA Question",
x = "", y = "% Selecting") +
theme_minimal(base_size = 12) +
theme(legend.position = "none")
To make this horizontal, we flip the variables in the x and y options
and then use scale_y_discrete
instead of
scale_x_discrete
to label the y-axis instead of the x-axis
like above.
ggplot(sata_frequencies, aes(x = percentage, y = variable, fill = as.factor(response))) +
geom_bar(stat = "identity", position = "dodge") +
geom_text(aes(label = paste0(round(percentage, 1), "%"), group = response),
position = position_dodge(width = 0.9),
hjust = -0.1, color = "black", size = 3) +
scale_y_discrete(labels = c("main_branch_1" = "Believe\nin God",
"main_branch_2" = "Have Job",
"main_branch_3" = "Victim\nof Fraud",
"main_branch_4" = "Own(ed)\nBusiness",
"main_branch_5" = "Attended College",
"main_branch_6" = "Fan of MLB",
"main_branch_7" = "Enjoy\nHiking",
"main_branch_8" = "Use\nSocial Media",
"main_branch_9" = "Labor Union\nMember",
"main_branch_10" = "None of\nthe Above")) +
scale_fill_manual(values = c("1" = "darkgreen"),
name = "Response",
breaks = c("1"),
labels = c("Yes")) +
labs(title = "Weighted % Selecting Each Initial SATA Question",
x = "% Selecting", y = "") +
xlim(0,90)+
theme_minimal(base_size = 12) +
theme(legend.position = "none")
While univariate looks at the data can be interesting, typically we want to explore our data for interesting patterns and relationships between two or more variables. We will look at three different exploration techniques here:
Let’s say that I was curious if attending college - even if didn’t
graduate - influences how people view the new baseball rules. Using the
main_branch_5
question which is yes if the respondent
attended college at all and no if they did not, we can compare how they
answered these questions to see if there are differences in
responses.
To start, we will look at mean support for the Universal DH variable
using the 4-point scale split by this variable. We use
weighted.mean
along with our weighting variable
<- data %>%
dhc_freq filter(!is.na(stapleton_so_dh)) %>% # Drop NA responses
group_by(main_branch_5) %>%
summarise(
response_count = n(), # Add response count
weighted_mean = weighted.mean(stapleton_so_dh, weight, na.rm = TRUE),
.groups = "drop"
%>%
) mutate(percentage = 100 * (weighted_mean / sum(weighted_mean, na.rm = TRUE)))
dhc_freq
## # A tibble: 2 × 4
## main_branch_5 response_count weighted_mean percentage
## <dbl+lbl> <int> <dbl> <dbl>
## 1 0 34 1.82 46.2
## 2 1 [Attended college or university for… 48 2.12 53.8
We see that when main_branch_5 = 1
- i.e. those who
attended college - the weighted mean for support/opposition to the
universal DH is 2.13 while for those who not attend college the weighted
mean is 1.82. Lower values means more support so there is a possibility
that those who never attended college support the universal DH at a
higher rate than those who did attend.
Differences between groups can occur by random chance alone so we always will do statistical tests to understand if the difference we observe likely reflects a real difference in the population or if it is by random chance alone. Take one of the other numerous great DACSS courses that teach these techniques.
Here, we look at a t-test which tests for “significant” differences on some numeric variable between two groups. Typically if you have a p-value in these tests <.05 the difference is considered significant and something that truly reflects differences in the population. With a small sample size like we have here, it is okay to make the limit p <.1.
To run the t-test on our weighted sample, we first specify a
svydesign
dataset that applies the survey weight to the
analysis. This function is required to analyze weighted data with a
t-test. Once we save this new dataframe survey_design
, we
perform the Welch’s t-test to see if those differences revealed are
“significant”.
#Conduct Welch's Test from the rstatix package
<- svydesign(ids = ~1, data = data, weights = ~weight)
survey_design
# Perform a weighted t-test
<- svyttest(stapleton_so_dh ~ main_branch_5, design = survey_design)
welch_t_test
# Print the result
print(welch_t_test)
##
## Design-based t-test
##
## data: stapleton_so_dh ~ main_branch_5
## t = 1.5485, df = 80, p-value = 0.1255
## alternative hypothesis: true difference in mean is not equal to 0
## 95 percent confidence interval:
## -0.08597111 0.68891229
## sample estimates:
## difference in mean
## 0.3014706
There are two main pieces of information to look at here. The difference in mean, .301, is simply the difference between group 1 and group 2 on support for the DH. The other key thing to look at is the p-value which in this test = .1255. This value is larger than .1 so I would conclude that attending college or not does not influence how MLB fans perceive the Universal DH.
T-tests are very useful but can only be used when the grouping variable has exactly 2 groups. If it has 3 or more, t-tests will not work. In this instance, we turn to Analysis of Variance (ANOVAs) for our significance tests. We calculate the weighted means in the exact way as with the t-test.
Let use the educ4
variable we created previously, which
for some reason has 3 groups for education: HS degree or less, Some
College, College Degree. We first calculate the weighted mean by
group.
<- data %>%
dhe_means filter(!is.na(stapleton_so_dh)) %>% # Drop NA responses
group_by(educ4) %>%
summarise(
response_count = n(), # Add response count
weighted_mean = weighted.mean(stapleton_so_dh, weight, na.rm = TRUE),
.groups = "drop") %>%
mutate(percentage = 100 * (weighted_mean / sum(weighted_mean)),
response_label = case_when(
== 1 ~ "HS or Less",
educ4 == 2 ~ "Some College",
educ4 == 3 ~ "College Degree"))
educ4 dhe_means
## # A tibble: 3 × 5
## educ4 response_count weighted_mean percentage response_label
## <dbl+lbl> <int> <dbl> <dbl> <chr>
## 1 1 [HS or Less] 22 1.73 29.2 HS or Less
## 2 2 [Some College] 33 2.15 36.4 Some College
## 3 3 [College Degree or M… 27 2.04 34.4 College Degree
We see that group 1, HS degree or less, has the highest approval rating (lowest value confusingly enough) while those with some college have the lowest approval rating. Once again, we want to statistically test for significant differences between these groups.
We use ANOVA to do this along with additional test called the Tukey
Honestly Significant Difference which conducts pairwise statistical
tests like the t-test. We use the aov
function to calculate
the ANOVA results. First inside the () is the dependent variable,
stapleton_so_dh_n
, always followed by ~
and
then the grouping variable always inside the as.factor like
as.factor(educ4)
. Following this is the data to be used and
you directly specify the weight variable in this line of code using
weight=weight
.
<- aov(stapleton_so_dh ~ as.factor(educ4), data = data, weight=weight) #The as.factor code tells R to make the variable inside the () a factor otherwise it will not calculate correctly
anova_result
# Step 3: Display the ANOVA result
print(summary(anova_result))
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(educ4) 2 2.43 1.2155 1.421 0.248
## Residuals 79 67.57 0.8553
## 184 observations deleted due to missingness
<- TukeyHSD(anova_result)
tukey_result print(tukey_result)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = stapleton_so_dh ~ as.factor(educ4), data = data, weight = weight)
##
## $`as.factor(educ4)`
## diff lwr upr p adj
## 2-1 0.4242424 -0.1837951 1.0322799 0.2244212
## 3-1 0.3097643 -0.3247218 0.9442504 0.4767906
## 3-2 -0.1144781 -0.6877414 0.4587851 0.8823316
If you have significant differences, or just want to graph, these
weighted values by group, we can do that using ggplot
once
again. We make sure to specify factor(educ)
as the x
variable to tell R this is a factor and not a number. Because the
education variable has 3 levels, I specify three names that match the
variable as well as three colors to visually differentiate that the bars
represent different group membership.
# Create the bar graph with custom colors specified in the ggplot command
ggplot(dhe_means, aes(x = factor(educ4), y = weighted_mean, fill = factor(educ4))) +
geom_bar(stat = "identity") + # Use geom_bar for vertical bars
labs(title = "Opposition to Universal DH by Education Level",
x = "Education Level",
y = "Percentage") +
scale_x_discrete(labels = c("1" = "HS\nor Less", # Update labels for x-axis
"2" = "Some\nCollege",
"3" = "College Degree\nor More")) +
scale_fill_manual(values = c("#1f77b4",
"#ff7f0e",
"darkgreen"),
name = "Education Level") +
ylim(0, 2.5)+
theme_minimal() + # Apply a minimal theme
theme(axis.text.x = element_text(size=12, angle = 45, hjust = 1), # Rotate x-axis labels for readability
legend.position = "none")
If you have a continuous or ordinal variable with the Don’t Knows removed, you can run a correlation analysis. What this analysis tells you is if two variables typically move in the same or opposite direction as each other. Meaning, if a person who supports(opposes) the Universal DH almost always supports(opposes) the Pitch Clock, opinions on these two variables would be highly correlated because most people answered them in the same way. Correlations can be positive, indicating that as the response on one variable gets larger(smaller) the response on the other variable tends to get larger(smaller) as well, negative, indicating that as the response on one variable gets larger/smaller the response on the other variable tends to get smaller/larger as well. A correlation close to 0 means the two variables are not related while the closer the correlation is to 1 or -1 the stronger the relationship.
For correlations to run in R, the variables have to be numeric. Unfortunately, our survey variables are considered haven labelled meaning correlation analysis will not work unless we transform our existing variables into a numeric (remember save a new variable and keep the existing one intact).
We can do this easily either one variable at a time or multiple using
the following approaches. The new variables will all have
_n
at the end to signify to me this is the numeric version
of the variable.
$stapleton_so_dh_n<-as.numeric(data$stapleton_so_dh) #1 at a time
data#Multiple at once
<- data %>%
data mutate(across(starts_with("stapleton_so_"), as.numeric, .names = "{.col}_n"))
Now that we have numeric variables, we can begin the process to calculate the correlations between the responses.
We can easily do this one at a time using the below approach. We use
the cor
function and then type in the two variable names
with data$
first. We also need to specify the method, here
we use pearson
, as well as use="complete.obs"
.
This last bit of code removes missing data and without including it the
correlations will not run.
<- cor(data$stapleton_so_dh_n, data$stapleton_so_pc_n, method = "pearson", use="complete.obs")
r_select r_select
## [1] 0.4377297
Here, we see that the correlation between support for the DH and the Pitch Clock is .438. This is medium to small sized correlation indicating that the opinions between these two rules are not all that related.
If you want to do more than 2 variables at a time, you need to simply
select the variables you want to include and then rerun the correlation
command as illustrated above. Here I use the select
function and stapleton_so_dh_n:stapleton_so_bases_n
to tell
R to include all the variables starting with _dh_n and ending with
_bases_n. This works because my four variables are in order. If they are
not in order, you would need to use a different select approach. If
nothing else, you can manually type in the variable names you want to
include.
<- data %>%
new_data select(stapleton_so_dh_n:stapleton_so_bases_n)
<-cor(new_data, method="pearson", use="complete.obs")
r_all r_all
## stapleton_so_dh_n stapleton_so_1_n stapleton_so_2_n
## stapleton_so_dh_n 1.0000000 1.0000000 0.4347273
## stapleton_so_1_n 1.0000000 1.0000000 0.4347273
## stapleton_so_2_n 0.4347273 0.4347273 1.0000000
## stapleton_so_3_n 0.4018328 0.4018328 0.4377871
## stapleton_so_4_n 0.6536464 0.6536464 0.3917767
## stapleton_so_pc_n 0.4347273 0.4347273 1.0000000
## stapleton_so_shift_n 0.4018328 0.4018328 0.4377871
## stapleton_so_bases_n 0.6536464 0.6536464 0.3917767
## stapleton_so_3_n stapleton_so_4_n stapleton_so_pc_n
## stapleton_so_dh_n 0.4018328 0.6536464 0.4347273
## stapleton_so_1_n 0.4018328 0.6536464 0.4347273
## stapleton_so_2_n 0.4377871 0.3917767 1.0000000
## stapleton_so_3_n 1.0000000 0.5043263 0.4377871
## stapleton_so_4_n 0.5043263 1.0000000 0.3917767
## stapleton_so_pc_n 0.4377871 0.3917767 1.0000000
## stapleton_so_shift_n 1.0000000 0.5043263 0.4377871
## stapleton_so_bases_n 0.5043263 1.0000000 0.3917767
## stapleton_so_shift_n stapleton_so_bases_n
## stapleton_so_dh_n 0.4018328 0.6536464
## stapleton_so_1_n 0.4018328 0.6536464
## stapleton_so_2_n 0.4377871 0.3917767
## stapleton_so_3_n 1.0000000 0.5043263
## stapleton_so_4_n 0.5043263 1.0000000
## stapleton_so_pc_n 0.4377871 0.3917767
## stapleton_so_shift_n 1.0000000 0.5043263
## stapleton_so_bases_n 0.5043263 1.0000000
Now, we see more results since we have all 4 variables. Quickly scan the values looking for those close to 1 or -1. The values that = 1 are not relevant since that the is the correlation between a variable and itself. The largest correlation in the table is between support for the universal DH and the larger bases at .654. This is a reasonably large correlation indicating a fairly strong relationship between people’s opinions on these two rules.
If you want something like is more graphical and fun, you can use the
corrplot
package to display the results. We simply use the
correlation results we saved previously, r_all
, and pipe
that into the corrplot
command to produce a graphical
display of the correlations. Below shows a variety of methods you can
use.
<-corrplot(r_all, method = 'circle') # colorful number num
<-corrplot(r_all, method = 'circle', type="lower", diag=FALSE) #Only Show Lower Part of Diag num2
<-corrplot(r_all, method = 'number', type="lower", diag=FALSE) #Only Show Lower Part of Diag num3
<-corrplot(r_all, method = 'square', type="lower", diag=FALSE) #Only Show Lower Part of Diag num4
If I were going to show this graph to some audience, I would change the names to something more informative and visually appealing.