This problem set has three learning goals that scaffolds on much of the material we have worked on this semester.
To do this, we will each design our own survey, administer it to at least ten people, and then download, clean, and analyze the data you have collected.
To do this, we will make use of Qualtrics, although other tools exist, including SurveyCTO and ODK. We use Qualtrics in this course because, although they all charge for access, Harvard has an institutional account with Qualtrics that allows us to make use of many of their more powerful features such as the Application Programming Interface (API).
qualtRics package in RWe encourage you to work in teams, but please do submit your own work. There are several points where we ask you to the data on top of just coding. We welcome you discussing the interpretation with your peers, but the write-up of the interpretation should be your own.
You should submit your problem set on canvas. When you submit your
problem set, please submit both the compiled .Rmarkdown
file as well as the uncompiled file so that we can run and verify your
work. As much as possible, please show your work through the steps you
took in the code.
We want to see how you scaffold your knowledge throughout the problem. We begin by providing comprehensive examples and at the end ask you to apply what you have seen in class and at the beginning of this problem set to similar questions with fewer pointers. You should be able to answer all the questions in the problem with just the material we have provided you in class and in this problem set, but you are of course welcome to use resources on the internet, including generative AI like chatGPT. If you do, please do note that in the problem set.
While completing this problem set, you might get stuck at various points. Don’t stress! Getting stuck, grappling with a problem, figuring out what the right question to ask, and then understanding how to solve the problem is all a natural part of learning! And do know that Emmerich and Nicolás are here to help you along your learning – we are excited to see how you approach this and how you tackle the problems. If you get stuck, you can reach us at:
We also have a GenerativeAI sandbox. Unless you pay for ChatGPT or another AI’s premium product, the sandbox provides access to the latest, paywalled, models, which are supposedly more powerful and accurate than previous models, so we encourage you to use that. You can find the sandbox at this link
You might find a couple of resources useful when creating, distributing, and analyzing your survey. Both qualtrics and Harvard IT provide useful basic overviews that you can find:
In this section, we want to explore public opinion data available through Afrobraometer series. We want to understand how we can understand what the public wants from their educational systems. In particular, we want to see how we would test the idea that education is a “loud but noisy” signal. We will work with the sixth round of the Afrobarometer survey in Liberia from 2014-15. You can find the questionnaire for this on canvas.
You can write your answers in the area below.
For this problem set, there will be three separate parts:
We first want you to come up with and design a survey of your own. To do so, you should answer the following questions:
You can answer the above three questions in the space below.
As part of this assignment, you should create a survey that is at least five questions long that helps you answer the research question. We would like you to include one form of survey experiment we discussed in class on October 31 (list experiments, endorsement experiments, or constrained choice questions. You are welcome to try and program a conjoint experiment, but we provide an exercise below to work with a conjoint experiment I collected in Delhi and the programming of a conjoint experiment in qualtrics is not straightforward. Please contact Emmerich if you are interested in programming a conjoint experiment). Please include the questions in the space below:
Each survey is paired with 4 neutral statements. I will be listing down the “treatment” statements below:
You will then want to program these questions into Qualtrics. You can use the resources here for designing a survey within Qualtrics.
Now that you have designed your survey, you will have to distribute it to at least ten respondents. You can do this in a couple of ways.
We should stop here and recognize that while research for educational purposes such as teaching is exempt from human subjects review, you may be collecting data that can identify individuals and reveal confidential information (email addresses and geographic locations as an example). You should keep in mind that research has a long and complex relationship with the people that serve as subjects (and hopefully beneficiaries) of that research and that institutions have put (often insufficient) safeguards to protect those subjects. One such safeguard is a university’s Institutional Review Board (IRB) that determines the risks and benefits to human subjects of any research conducted in a university. You can find Harvard’s IRB office and materials here. We encourage you to browse the website.
To use the qualtrics API, you will need to generate your API token: a unique key that allows you to access your Qualtrics account. You can generate your API key by following these instructions.
Once that is done, you can access your Qualtrics account in R with the following code:
qualtrics_api_credentials(
api_key = "VSy5mybrD6bcb7trka42HeXK1Tgw9nUtjf9IODBV", # Replace with your API key
base_url = "harvard.az1.qualtrics.com" # Keep this as is
)
NB: Please note that throughout this problem set, we
have used the RMarkdown option, eval = FALSE which tells
RStudio not to evaluate (run) the code we have provided in the code
chunks. You will want to remove this option once you have added the
relevant options and attempting to run the code yourselves.
Next, you can view all the surveys you have collected through Qualtrics with a “call” to the API (giving the API instructions to tell you which surveys are available online)
all_surveys()
You should see your survey with an id and
name. Once you have the id for your survey,
you can download it from Qualtrics by running the following command:
control_survey <- fetch_survey(
surveyID = "SV_9HU3HTuEcjeGLH0", # Replace with your survey ID
force_request = TRUE # This will ensure that you download the latest version of your
# survey. Otherwise, to save calls to the API, which are often limited,
# qualtRics does not call for a new version of the survey each time.
)
##
|
| | 0%
|
|======================================================================| 100%
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## .default = col_double(),
## StartDate = col_datetime(format = ""),
## EndDate = col_datetime(format = ""),
## Status = col_character(),
## IPAddress = col_character(),
## Finished = col_logical(),
## RecordedDate = col_datetime(format = ""),
## ResponseId = col_character(),
## RecipientLastName = col_logical(),
## RecipientFirstName = col_logical(),
## RecipientEmail = col_logical(),
## ExternalReference = col_logical(),
## DistributionChannel = col_character(),
## UserLanguage = col_character(),
## Q0 = col_character()
## )
## ℹ Use `spec()` for the full column specifications.
treatment_survey <- fetch_survey(
surveyID = "SV_cLSDeeO8JQ69EN0", # Replace with your survey ID
force_request = TRUE # This will ensure that you download the latest version of your
# survey. Otherwise, to save calls to the API, which are often limited,
# qualtRics does not call for a new version of the survey each time.
)
##
|
| | 0%
|
|======================================================================| 100%
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## .default = col_double(),
## StartDate = col_datetime(format = ""),
## EndDate = col_datetime(format = ""),
## Status = col_character(),
## IPAddress = col_character(),
## Finished = col_logical(),
## RecordedDate = col_datetime(format = ""),
## ResponseId = col_character(),
## RecipientLastName = col_logical(),
## RecipientFirstName = col_logical(),
## RecipientEmail = col_logical(),
## ExternalReference = col_logical(),
## DistributionChannel = col_character(),
## UserLanguage = col_character(),
## Q0 = col_character()
## )
## ℹ Use `spec()` for the full column specifications.
This portion of the assignment invites you to analyze the data you
collected in the first two parts. You can either do this graphically
using ggplot, or through another measure you choose. You
will see that the data you collected is not “clean” and you will have to
use the tools we have used previously from dplyr to clean
your data. We provide space below for you to do this:
control_group <- control_survey %>% select(Q1, Q2, Q3, Q4, Q5)
## select: dropped 18 variables (StartDate, EndDate, Status, IPAddress, Progress, …)
treatment_group <- treatment_survey %>% select(Q1, Q2, Q3, Q4, Q5)
## select: dropped 18 variables (StartDate, EndDate, Status, IPAddress, Progress, …)
control_group <- control_group %>% mutate(Group = "Control")
## mutate: new variable 'Group' (character) with one unique value and 0% NA
treatment_group <- treatment_group %>% mutate(Group = "Treatment")
## mutate: new variable 'Group' (character) with one unique value and 0% NA
combined_data <- bind_rows(control_group, treatment_group)
combined_data %>%
group_by(Group) %>%
summarise(across(everything(), mean, na.rm = TRUE))
## group_by: one grouping variable (Group)
## Warning: There was 1 warning in `.fun()`.
## ℹ In argument: `across(everything(), mean, na.rm = TRUE)`.
## ℹ In group 1: `Group = "Control"`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
##
## # Previously
## across(a:b, mean, na.rm = TRUE)
##
## # Now
## across(a:b, \(x) mean(x, na.rm = TRUE))
## summarise: now 2 rows and 6 columns, ungrouped
data <- data.frame(
Group = c("Control", "Treatment"),
Q1 = c(3.428571, 3.875000),
Q2 = c(3.428571, 3.750000),
Q3 = c(3.285714, 3.500000),
Q4 = c(3.571429, 4.250000),
Q5 = c(3.571429, 4.500000)
)
long_data <- pivot_longer(data, cols = -Group, names_to = "Question", values_to = "Mean")
## pivot_longer: reorganized (Q1, Q2, Q3, Q4, Q5) into (Question, Mean) [was 2x6, now 10x3]
ggplot(long_data, aes(x = Question, y = Mean, fill = Group)) +
geom_bar(stat = "identity", position = position_dodge()) +
coord_flip() +
labs(title = "Average Responses for Control and Treatment Groups",
x = "Question",
y = "Average Mean Response") +
theme_minimal() +
theme(legend.position = "bottom")
long_data <- pivot_longer(data, cols = -Group, names_to = "Question", values_to = "Mean")
## pivot_longer: reorganized (Q1, Q2, Q3, Q4, Q5) into (Question, Mean) [was 2x6, now 10x3]
mean_diff <- long_data %>%
spread(Group, Mean) %>%
mutate(Difference = Treatment - Control) %>%
select(Question, Difference)
## spread: reorganized (Group, Mean) into (Control, Treatment) [was 10x3, now 5x3]
## mutate: new variable 'Difference' (double) with 5 unique values and 0% NA
## select: dropped 2 variables (Control, Treatment)
ggplot(mean_diff, aes(x = Question, y = Difference, fill = Difference > 0)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(title = "Difference in Mean Responses between Control and Treatment Groups",
x = "Question",
y = "Difference in Mean Response") +
theme_minimal() +
scale_fill_manual(values = c("red", "blue"),
name = "Difference",
labels = c("Treatment < Control", "Treatment > Control")) +
theme(legend.position = "bottom")
A nice feature of qualtrics is that it collects the location of your
respondents. It stores this data in a variable helpfully called,
LocationLatitude and LocationLongitude. If you
conduct the survey using your mobile phone, it will record the location
in which you submitted the survey.
To plot the location of your respondents, you can use the following code with some modifications:
location_df <- st_as_sf(x = control_survey, coords = c("LocationLongitude", "LocationLatitude"))
ggplot(location_df) + geom_sf()
location_df <- st_as_sf(x = treatment_survey, coords = c("LocationLongitude", "LocationLatitude"))
ggplot(location_df) + geom_sf()
If you wish to plot this on a map of your location, you can find administrative maps at the GADM Website at various levels of subdivision.
For the final question, we ask you to clean and analyze a conjoint experiment. As we discussed in class on October 31, conjoint experiments are a way for researchers to understand what individuals value when there are bundled choices such as in voting. Two candidates may hold the same policy positions, but one may be a woman, the other a man, one may be of the same race or ethnicity as the voter, and the other from a difference ethnicity, and finally one candidate may be highly educated and the other not. In this situation, which characteristics of the candidate is most important to the voter?
You can find the data for this problem here (.rda)
In a study conducted in Delhi with Ashwini Deshpande of Ashoka University and Parushya of Georgetown University, we were interested in understanding what parents in schools in Delhi found important in their parental representatives in school management committees (SMCs). We administered a conjoint experiment to approximately 2,000 parents across Delhi to understand their preferences for their representatives along the following lines:
The analysis of a conjoint experiment is rather simple, but we will be given you the raw (okay, lightly cleaned) data we collected, and ask you to tell us what are the strongest predictors of preferences for representatives. To conduct the final analysis, you will want to run a regression of the following form:
\[ \begin{aligned} Choice_{i, p} = Gender_{i, p} + Education_{i, p} + Caste_{i, p} + Religion_{i, p} + State of Origin_{i, p} +\\ Partisan Political Affiliation_{i, p} + Distance from School_{i, p} + Respondent_{i} \end{aligned} \]
Where i indexes the respondent and p indexes the profile.
We have given you data that has data at the respondent-level. You will need to get data to the respondent profile choice level to run this analysis. We would like to see you do the following:
pivot_long command from the tidyr
package.load("/Users/heidinadhira/Downloads/conjoint_a830.rda")
data <- x
remove(x)
profile_1 <- data %>%
select(hhid, profile_choice, contains("profile_1_")) %>%
mutate(profile = 1) %>%
relocate(hhid, profile_choice, profile)
profile_2 <- data %>%
select(hhid, profile_choice, contains("profile_2_")) %>%
mutate(profile = 2) %>%
relocate(hhid, profile_choice, profile)
# Rename columns to have a common naming convention
names(profile_1)[4:10] <- c("gender", "distance", "education", "caste", "home_state", "party", "religion")
names(profile_2)[4:10] <- c("gender", "distance", "education", "caste", "home_state", "party", "religion")
# Combine dataframes
long_data <- rbind(profile_1, profile_2)
# Create a binary variable for choice
long_data$choice <- ifelse(long_data$profile_choice == long_data$profile, 1, 0)
long_data <- long_data %>%
select(-profile_choice)
# Convert attributes to factors
attributes <- c("gender", "distance", "education", "caste", "home_state", "party", "religion")
long_data[attributes] <- lapply(long_data[attributes], factor)
# Probability Check: Calculate the frequency of each attribute
prob_check <- long_data %>%
select(all_of(attributes)) %>%
gather(key = "attribute", value = "value") %>%
group_by(attribute, value) %>%
summarise(count = n(), .groups = 'drop') %>%
mutate(probability = count / sum(count))
# View probabilities
print(prob_check)
# Regression analysis
model <- lm(choice ~ gender + distance + education + caste + home_state + party + religion, data = long_data)
# Summary of the model
summary(model)
# Plot the coefficients
tidy_model <- tidy(model)
ggplot(tidy_model, aes(x = term, y = estimate)) +
geom_col() +
coord_flip() +
xlab("Attributes") +
ylab("Coefficient Estimate") +
ggtitle("Coefficients from Regression Analysis")
# Plot the coefficients with confidence intervals
tidy_model <- tidy(model)
ggplot(tidy_model, aes(x = estimate, y = term)) +
geom_point() +
geom_errorbar(aes(xmin = estimate - std.error, xmax = estimate + std.error), width = 0.2) +
geom_vline(xintercept = 0, linetype = "dashed", color = "red") +
xlab("Coefficient Estimate") +
ylab("Attributes") +
ggtitle("Coefficient Plot with Confidence Intervals")
# Regression analysis with fixed effects using feols from fixest package
model_fe <- feols(choice ~ gender + distance + education + caste + home_state + party + religion | hhid, data = long_data)
# Extract coefficients and standard errors
coef_estimates <- coef(model_fe)
std_errors <- sqrt(diag(vcov(model_fe)))
# Calculate confidence intervals
alpha <- 0.05 # for a 95% confidence interval
z <- qnorm(1 - alpha / 2)
conf_int_lower <- coef_estimates - z * std_errors
conf_int_upper <- coef_estimates + z * std_errors
# Creating a tidy dataframe for plotting
tidy_model_fe <- data.frame(
term = names(coef_estimates),
estimate = coef_estimates,
conf.low = conf_int_lower,
conf.high = conf_int_upper
)
# Plot the coefficients with confidence intervals
ggplot(tidy_model_fe, aes(x = estimate, y = term)) +
geom_point() +
geom_errorbar(aes(xmin = conf.low, xmax = conf.high), width = 0.2) +
geom_vline(xintercept = 0, linetype = "dashed", color = "red") +
xlab("Change in probability") +
ylab("Attributes") +
ggtitle("Coefficient Plot with Confidence Intervals (Fixed Effects Model)")
f1 <- choice ~ gender + distance + education + caste + home_state + party + religion
plot(mm(long_data, f1, id = ~hhid ), vline = 0.5)
conjoint <- cj(long_data, f1, id = ~hhid)
cj_full_plot_presentation <- plot(conjoint,
xlab = "Change in Probability of Vote (%)",
alpha = 0.1) +
geom_point(size = 2) +
geom_line(size = 2) +
theme_bw() +
theme(
legend.position = "none",
panel.background = element_blank(),
plot.background = element_blank(),
panel.border = element_blank(),
panel.grid.minor = element_blank()
) +
xlim(
c(-0.25, 0.5)
)
```