Learning Goals

This problem set has three learning goals that scaffolds on much of the material we have worked on this semester.

  1. Continue to explore public opinion data
  2. Understand the full process of data analysis from collection through cleaning, to presentation.
  3. Analyze survey data from cleaning, to analysis, to visualization

To do this, we will each design our own survey, administer it to at least ten people, and then download, clean, and analyze the data you have collected.

To do this, we will make use of Qualtrics, although other tools exist, including SurveyCTO and ODK. We use Qualtrics in this course because, although they all charge for access, Harvard has an institutional account with Qualtrics that allows us to make use of many of their more powerful features such as the Application Programming Interface (API).

Information & Introduction

Do Now

  1. Create an account with Qualtrics
  • NB: Make sure you browse to Harvard’s institutional Qualtrics account and not the commercial qualtrics at qualtrics.com. There will be more features available to you at Harvard’s institutional account.
  1. Install the qualtRics package in R

Submitting your problem set

We encourage you to work in teams, but please do submit your own work. There are several points where we ask you to the data on top of just coding. We welcome you discussing the interpretation with your peers, but the write-up of the interpretation should be your own.

You should submit your problem set on canvas. When you submit your problem set, please submit both the compiled .Rmarkdown file as well as the uncompiled file so that we can run and verify your work. As much as possible, please show your work through the steps you took in the code.

We want to see how you scaffold your knowledge throughout the problem. We begin by providing comprehensive examples and at the end ask you to apply what you have seen in class and at the beginning of this problem set to similar questions with fewer pointers. You should be able to answer all the questions in the problem with just the material we have provided you in class and in this problem set, but you are of course welcome to use resources on the internet, including generative AI like chatGPT. If you do, please do note that in the problem set.

Resources

While completing this problem set, you might get stuck at various points. Don’t stress! Getting stuck, grappling with a problem, figuring out what the right question to ask, and then understanding how to solve the problem is all a natural part of learning! And do know that Emmerich and Nicolás are here to help you along your learning – we are excited to see how you approach this and how you tackle the problems. If you get stuck, you can reach us at:

GenerativeAI Sanbdox

We also have a GenerativeAI sandbox. Unless you pay for ChatGPT or another AI’s premium product, the sandbox provides access to the latest, paywalled, models, which are supposedly more powerful and accurate than previous models, so we encourage you to use that. You can find the sandbox at this link

Qualtrics Resources

You might find a couple of resources useful when creating, distributing, and analyzing your survey. Both qualtrics and Harvard IT provide useful basic overviews that you can find:

  1. Qualtrics
  2. Harvard IT.

Public Opinion and Education

In this section, we want to explore public opinion data available through Afrobraometer series. We want to understand how we can understand what the public wants from their educational systems. In particular, we want to see how we would test the idea that education is a “loud but noisy” signal. We will work with the sixth round of the Afrobarometer survey in Liberia from 2014-15. You can find the questionnaire for this on canvas.

  1. What questions are available in this questionnaire on education?
  2. Which questions force respondents to make a tradeoff between education and other policy areas?
  3. If you were to explore whether education was a “loud but noisy,” “loud but clear,” “quiet but noisy,” or “quiet but clear” signal, how would you use the questions in the Afrobarometer survey to answer that question?

Answer Area for Public Opinion and Education

You can write your answers in the area below.

Collecting Your Own Data

For this problem set, there will be three separate parts:

  1. Design a short, simple survey in Qualtrics that you will administer to at least 10 people.
  2. Using the Qualtrics API to query and pull your data locally.
  3. Analyzing the data you have collected, including mapping the data.

1. Designing Your Own Survey

We first want you to come up with and design a survey of your own. To do so, you should answer the following questions:

  1. What is your research question? International Students’ Perception of the Academic Quality at Harvard Graduate School of Education
  2. What would you need to know from your respondents to answer your question? They have to be international students
  3. What questions can you design to collect the data required for your research question? I will be using a list experiment

You can answer the above three questions in the space below.

As part of this assignment, you should create a survey that is at least five questions long that helps you answer the research question. We would like you to include one form of survey experiment we discussed in class on October 31 (list experiments, endorsement experiments, or constrained choice questions. You are welcome to try and program a conjoint experiment, but we provide an exercise below to work with a conjoint experiment I collected in Delhi and the programming of a conjoint experiment in qualtrics is not straightforward. Please contact Emmerich if you are interested in programming a conjoint experiment). Please include the questions in the space below:

Survey Questions

Each survey is paired with 4 neutral statements. I will be listing down the “treatment” statements below:

  1. The academic challenge is less rigorous compared to education back home
  2. The curriculum has not been effective in guiding me toward selecting a career direction
  3. The pass/fail system often demotivates me to do better work
  4. Professors and faculty members are quite lenient towards low quality outputs from students
  5. Peer group work often prioritizes cooperation over the pursuit of excellence

You will then want to program these questions into Qualtrics. You can use the resources here for designing a survey within Qualtrics.

Distributing Your Survey

Now that you have designed your survey, you will have to distribute it to at least ten respondents. You can do this in a couple of ways.

  1. Via a unique email link that identifies each respondent.
    1. You can find more information on how to use mail merge to send your email to non-anonymized respondents with a unique link here
  2. Through an anonymous email link that does not identify any respondent.
    1. You can find more information on how to distribute your survey with an anonymized link here.
  3. Through the Qualtrics offline data collection app.
    1. If you choose to use this method, you will have to collect the data yourself. You will have to download the qualtrics offline data collection tool available here for iOS and Android.

Human Subjects

We should stop here and recognize that while research for educational purposes such as teaching is exempt from human subjects review, you may be collecting data that can identify individuals and reveal confidential information (email addresses and geographic locations as an example). You should keep in mind that research has a long and complex relationship with the people that serve as subjects (and hopefully beneficiaries) of that research and that institutions have put (often insufficient) safeguards to protect those subjects. One such safeguard is a university’s Institutional Review Board (IRB) that determines the risks and benefits to human subjects of any research conducted in a university. You can find Harvard’s IRB office and materials here. We encourage you to browse the website.

2. Using the Qualtrics API

To use the qualtrics API, you will need to generate your API token: a unique key that allows you to access your Qualtrics account. You can generate your API key by following these instructions.

Once that is done, you can access your Qualtrics account in R with the following code:

qualtrics_api_credentials(
    api_key = "VSy5mybrD6bcb7trka42HeXK1Tgw9nUtjf9IODBV", # Replace with your API key
    base_url = "harvard.az1.qualtrics.com" # Keep this as is
)

NB: Please note that throughout this problem set, we have used the RMarkdown option, eval = FALSE which tells RStudio not to evaluate (run) the code we have provided in the code chunks. You will want to remove this option once you have added the relevant options and attempting to run the code yourselves.

Next, you can view all the surveys you have collected through Qualtrics with a “call” to the API (giving the API instructions to tell you which surveys are available online)

all_surveys()

You should see your survey with an id and name. Once you have the id for your survey, you can download it from Qualtrics by running the following command:

control_survey <- fetch_survey(
    surveyID = "SV_9HU3HTuEcjeGLH0", # Replace with your survey ID
    force_request = TRUE # This will ensure that you download the latest version of your
    # survey. Otherwise, to save calls to the API, which are often limited,
    # qualtRics does not call for a new version of the survey each time.
)
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   .default = col_double(),
##   StartDate = col_datetime(format = ""),
##   EndDate = col_datetime(format = ""),
##   Status = col_character(),
##   IPAddress = col_character(),
##   Finished = col_logical(),
##   RecordedDate = col_datetime(format = ""),
##   ResponseId = col_character(),
##   RecipientLastName = col_logical(),
##   RecipientFirstName = col_logical(),
##   RecipientEmail = col_logical(),
##   ExternalReference = col_logical(),
##   DistributionChannel = col_character(),
##   UserLanguage = col_character(),
##   Q0 = col_character()
## )
## ℹ Use `spec()` for the full column specifications.
treatment_survey <- fetch_survey(
    surveyID = "SV_cLSDeeO8JQ69EN0", # Replace with your survey ID
    force_request = TRUE # This will ensure that you download the latest version of your
    # survey. Otherwise, to save calls to the API, which are often limited,
    # qualtRics does not call for a new version of the survey each time.
)
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   .default = col_double(),
##   StartDate = col_datetime(format = ""),
##   EndDate = col_datetime(format = ""),
##   Status = col_character(),
##   IPAddress = col_character(),
##   Finished = col_logical(),
##   RecordedDate = col_datetime(format = ""),
##   ResponseId = col_character(),
##   RecipientLastName = col_logical(),
##   RecipientFirstName = col_logical(),
##   RecipientEmail = col_logical(),
##   ExternalReference = col_logical(),
##   DistributionChannel = col_character(),
##   UserLanguage = col_character(),
##   Q0 = col_character()
## )
## ℹ Use `spec()` for the full column specifications.

3. Analyzing the Data You Have Collected

This portion of the assignment invites you to analyze the data you collected in the first two parts. You can either do this graphically using ggplot, or through another measure you choose. You will see that the data you collected is not “clean” and you will have to use the tools we have used previously from dplyr to clean your data. We provide space below for you to do this:

control_group <- control_survey %>% select(Q1, Q2, Q3, Q4, Q5)
## select: dropped 18 variables (StartDate, EndDate, Status, IPAddress, Progress, …)
treatment_group <- treatment_survey %>% select(Q1, Q2, Q3, Q4, Q5)
## select: dropped 18 variables (StartDate, EndDate, Status, IPAddress, Progress, …)
control_group <- control_group %>% mutate(Group = "Control")
## mutate: new variable 'Group' (character) with one unique value and 0% NA
treatment_group <- treatment_group %>% mutate(Group = "Treatment")
## mutate: new variable 'Group' (character) with one unique value and 0% NA
combined_data <- bind_rows(control_group, treatment_group)
combined_data %>% 
  group_by(Group) %>% 
  summarise(across(everything(), mean, na.rm = TRUE))
## group_by: one grouping variable (Group)
## Warning: There was 1 warning in `.fun()`.
## ℹ In argument: `across(everything(), mean, na.rm = TRUE)`.
## ℹ In group 1: `Group = "Control"`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
## 
##   # Previously
##   across(a:b, mean, na.rm = TRUE)
## 
##   # Now
##   across(a:b, \(x) mean(x, na.rm = TRUE))
## summarise: now 2 rows and 6 columns, ungrouped
data <- data.frame(
  Group = c("Control", "Treatment"),
  Q1 = c(3.428571, 3.875000),
  Q2 = c(3.428571, 3.750000),
  Q3 = c(3.285714, 3.500000),
  Q4 = c(3.571429, 4.250000),
  Q5 = c(3.571429, 4.500000)
)

long_data <- pivot_longer(data, cols = -Group, names_to = "Question", values_to = "Mean")
## pivot_longer: reorganized (Q1, Q2, Q3, Q4, Q5) into (Question, Mean) [was 2x6, now 10x3]
ggplot(long_data, aes(x = Question, y = Mean, fill = Group)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  coord_flip() +  
  labs(title = "Average Responses for Control and Treatment Groups",
       x = "Question",
       y = "Average Mean Response") +
  theme_minimal() +
  theme(legend.position = "bottom")

long_data <- pivot_longer(data, cols = -Group, names_to = "Question", values_to = "Mean")
## pivot_longer: reorganized (Q1, Q2, Q3, Q4, Q5) into (Question, Mean) [was 2x6, now 10x3]
mean_diff <- long_data %>%
  spread(Group, Mean) %>%
  mutate(Difference = Treatment - Control) %>%
  select(Question, Difference)
## spread: reorganized (Group, Mean) into (Control, Treatment) [was 10x3, now 5x3]
## mutate: new variable 'Difference' (double) with 5 unique values and 0% NA
## select: dropped 2 variables (Control, Treatment)
ggplot(mean_diff, aes(x = Question, y = Difference, fill = Difference > 0)) +
  geom_bar(stat = "identity") +
  coord_flip() +  
  labs(title = "Difference in Mean Responses between Control and Treatment Groups",
       x = "Question",
       y = "Difference in Mean Response") +
  theme_minimal() +
  scale_fill_manual(values = c("red", "blue"), 
                    name = "Difference",
                    labels = c("Treatment < Control", "Treatment > Control")) +
  theme(legend.position = "bottom")

Plot the location of your respondents

A nice feature of qualtrics is that it collects the location of your respondents. It stores this data in a variable helpfully called, LocationLatitude and LocationLongitude. If you conduct the survey using your mobile phone, it will record the location in which you submitted the survey.

To plot the location of your respondents, you can use the following code with some modifications:

location_df <- st_as_sf(x = control_survey, coords = c("LocationLongitude", "LocationLatitude"))
ggplot(location_df) + geom_sf()
location_df <- st_as_sf(x = treatment_survey, coords = c("LocationLongitude", "LocationLatitude"))
ggplot(location_df) + geom_sf()

If you wish to plot this on a map of your location, you can find administrative maps at the GADM Website at various levels of subdivision.

Cleaning and Analyzing a Conjoint Experiment

For the final question, we ask you to clean and analyze a conjoint experiment. As we discussed in class on October 31, conjoint experiments are a way for researchers to understand what individuals value when there are bundled choices such as in voting. Two candidates may hold the same policy positions, but one may be a woman, the other a man, one may be of the same race or ethnicity as the voter, and the other from a difference ethnicity, and finally one candidate may be highly educated and the other not. In this situation, which characteristics of the candidate is most important to the voter?

You can find the data for this problem here (.rda)

In a study conducted in Delhi with Ashwini Deshpande of Ashoka University and Parushya of Georgetown University, we were interested in understanding what parents in schools in Delhi found important in their parental representatives in school management committees (SMCs). We administered a conjoint experiment to approximately 2,000 parents across Delhi to understand their preferences for their representatives along the following lines:

  1. Gender
  2. Education
  3. Caste
  4. Religion
  5. State of Origin
  6. Partisan political affiliation
  7. Distance they lived from the school as a means of measuring ability to attend to school matters.

The analysis of a conjoint experiment is rather simple, but we will be given you the raw (okay, lightly cleaned) data we collected, and ask you to tell us what are the strongest predictors of preferences for representatives. To conduct the final analysis, you will want to run a regression of the following form:

\[ \begin{aligned} Choice_{i, p} = Gender_{i, p} + Education_{i, p} + Caste_{i, p} + Religion_{i, p} + State of Origin_{i, p} +\\ Partisan Political Affiliation_{i, p} + Distance from School_{i, p} + Respondent_{i} \end{aligned} \]

Where i indexes the respondent and p indexes the profile.

We have given you data that has data at the respondent-level. You will need to get data to the respondent profile choice level to run this analysis. We would like to see you do the following:

  1. Reshape the data from respondent-level to respondent-profile choice-level. There are two potential ways you can do this:
    1. Use the pivot_long command from the tidyr package.
    2. Split the data into two separate dataframes containing the first and the second profile.
  2. Provide a check on the probability of receiving each randomly assigned characteristic. In other words, how likely was it that a respondent saw “Hindu” or “Muslim” or “Male” or “Female”?
  3. Run the analysis in the equation above.
  4. Plot the coefficients from the regression analysis.

Answer

load("/Users/heidinadhira/Downloads/conjoint_a830.rda")
data <- x
remove(x)

profile_1 <- data %>%
  select(hhid, profile_choice, contains("profile_1_")) %>%
  mutate(profile = 1) %>%
  relocate(hhid, profile_choice, profile)

profile_2 <- data %>%
  select(hhid, profile_choice, contains("profile_2_")) %>%
  mutate(profile = 2) %>%
  relocate(hhid, profile_choice, profile)


# Rename columns to have a common naming convention
names(profile_1)[4:10] <- c("gender", "distance", "education", "caste", "home_state", "party", "religion")
names(profile_2)[4:10] <- c("gender", "distance", "education", "caste", "home_state", "party", "religion")

# Combine dataframes
long_data <- rbind(profile_1, profile_2)

# Create a binary variable for choice
long_data$choice <- ifelse(long_data$profile_choice == long_data$profile, 1, 0)
long_data <- long_data %>%
  select(-profile_choice)

# Convert attributes to factors
attributes <- c("gender", "distance", "education", "caste", "home_state", "party", "religion")
long_data[attributes] <- lapply(long_data[attributes], factor)

# Probability Check: Calculate the frequency of each attribute
prob_check <- long_data %>%
  select(all_of(attributes)) %>%
  gather(key = "attribute", value = "value") %>%
  group_by(attribute, value) %>%
  summarise(count = n(), .groups = 'drop') %>%
  mutate(probability = count / sum(count))

# View probabilities
print(prob_check)
# Regression analysis
model <- lm(choice ~ gender + distance + education + caste + home_state + party + religion, data = long_data)

# Summary of the model
summary(model)

# Plot the coefficients
tidy_model <- tidy(model)

ggplot(tidy_model, aes(x = term, y = estimate)) +
  geom_col() +
  coord_flip() +
  xlab("Attributes") +
  ylab("Coefficient Estimate") +
  ggtitle("Coefficients from Regression Analysis")

# Plot the coefficients with confidence intervals
tidy_model <- tidy(model)

ggplot(tidy_model, aes(x = estimate, y = term)) +
  geom_point() +
  geom_errorbar(aes(xmin = estimate - std.error, xmax = estimate + std.error), width = 0.2) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "red") +
  xlab("Coefficient Estimate") +
  ylab("Attributes") +
  ggtitle("Coefficient Plot with Confidence Intervals")
# Regression analysis with fixed effects using feols from fixest package
model_fe <- feols(choice ~ gender + distance + education + caste + home_state + party + religion | hhid, data = long_data)

# Extract coefficients and standard errors
coef_estimates <- coef(model_fe)
std_errors <- sqrt(diag(vcov(model_fe)))

# Calculate confidence intervals
alpha <- 0.05 # for a 95% confidence interval
z <- qnorm(1 - alpha / 2)
conf_int_lower <- coef_estimates - z * std_errors
conf_int_upper <- coef_estimates + z * std_errors

# Creating a tidy dataframe for plotting
tidy_model_fe <- data.frame(
  term = names(coef_estimates),
  estimate = coef_estimates,
  conf.low = conf_int_lower,
  conf.high = conf_int_upper
)

# Plot the coefficients with confidence intervals
ggplot(tidy_model_fe, aes(x = estimate, y = term)) +
  geom_point() +
  geom_errorbar(aes(xmin = conf.low, xmax = conf.high), width = 0.2) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "red") +
  xlab("Change in probability") +
  ylab("Attributes") +
  ggtitle("Coefficient Plot with Confidence Intervals (Fixed Effects Model)")
f1 <- choice ~ gender + distance + education + caste + home_state + party + religion
plot(mm(long_data, f1, id = ~hhid ), vline = 0.5)

conjoint <- cj(long_data, f1, id = ~hhid)

cj_full_plot_presentation <- plot(conjoint,
                                  xlab = "Change in Probability of Vote (%)",
                                  alpha = 0.1) +
  geom_point(size = 2) +
  geom_line(size = 2) +
  theme_bw() +
  theme(
    legend.position = "none",
    panel.background = element_blank(),
    plot.background = element_blank(),
    panel.border = element_blank(),
    panel.grid.minor = element_blank()
  ) +
  xlim(
    c(-0.25, 0.5)
  )

```