# Setting up the environment
library(tidyverse)
library(magrittr)
I always find it helpful to see dplyr verbs in action - even if it is
not the most sophisticated example, the real-world relevance really
helps me grasp its potential uses and utility. Recently, my coauthor and
I came across the problem of survey infiltration by bots.
dplyr verbs again came to the rescue in our efforts to
gauge the extent of the problem and prevent it.
# You can download the example directly from my github
bot_dt <- "https://github.com/hyunso-christy-oh/prism2025/raw/refs/heads/main/bot_example.csv" %>%
url %>%
gzcon %>%
read_csv
# Have a look at how the data looks - what stands out to you as something that we may want to address?
bot_dt %>%
head
# How many rows?
bot_dt %>%
nrow
What I did first was to remove the first two rows, which is an artifact of the data being downloaded from qualtrics and do not contain the actual responses.
# Get rid of useless rows
bot_dt1 <- bot_dt %>%
slice(-c(1, 2)) ## removing the first two rows
# Checking what we did
bot_dt1 %>%
head ## now all rows are responses
## # A tibble: 6 × 95
## StartDate EndDate Status IPAddress Progress Duration (in seconds…¹ Finished
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 2025-02-27 … 2025-0… 0 120.216.… 100 77 1
## 2 2025-02-27 … 2025-0… 0 183.94.3… 100 121 1
## 3 2025-02-27 … 2025-0… 0 14.30.17… 100 110 1
## 4 2025-02-27 … 2025-0… 0 1.61.18.… 100 172 1
## 5 2025-02-27 … 2025-0… 0 119.60.1… 100 211 1
## 6 2025-02-27 … 2025-0… 0 14.26.23… 100 170 1
## # ℹ abbreviated name: ¹`Duration (in seconds)`
## # ℹ 88 more variables: RecordedDate <chr>, ResponseId <chr>,
## # RecipientLastName <chr>, RecipientFirstName <chr>, RecipientEmail <chr>,
## # ExternalReference <chr>, LocationLatitude <chr>, LocationLongitude <chr>,
## # DistributionChannel <chr>, UserLanguage <chr>, Q_RecaptchaScore <chr>,
## # Q_RelevantIDDuplicate <chr>, Q_RelevantIDDuplicateScore <chr>,
## # Q_RelevantIDFraudScore <chr>, Q_RelevantIDLastStartDate <chr>, …
bot_dt1 %>%
nrow ## indeed, we have two less rows
## [1] 337
What I wanted to next requires providing some context. In our initial
soft launch of this survey, we detected some evidence for bot
infiltration in our sample. Namely, we had cases where
open-ended responses would match exactly across
different respondents (sometimes, responses up to 16 Chinese
characters), and these duplicate responses sometimes didn’t even make
sense (for example, how likely is that three respondents who are
completely unrelated to each other, talk about “dried tofu” in response
to a vignette about the United Nations?). Being alerted to this
possibility, we put in some measures for bot detection, and wanted to
see if they were successful in filtering out suspicious respondents.
Since we had put this measure right before resuming sample collection on
March 14th, we wanted to look only at the responses collected
after we put in the measures. Let’s use
filter on the the variable that tells us when the responses
were recorded.
# Which variable tells us when the response was recorded?
bot_dt1 %>% colnames
StartDate seems promising!
# Let's see what StartDate variable is like
bot_dt1$StartDate %>% class ## it is a character variable
## [1] "character"
bot_dt1 %>%
select(StartDate) %>%
slice_head(n = 5) ## it is in the format of year-month-date hour:min:sec
## # A tibble: 5 × 1
## StartDate
## <chr>
## 1 2025-02-27 16:02:25
## 2 2025-02-27 16:02:17
## 3 2025-02-27 16:02:43
## 4 2025-02-27 16:02:57
## 5 2025-02-27 16:02:57
# We can use mutate to convert the StartDate variable into a format that is useful for us - there is two ways of going about this
## Using base R
?as.POSIXct
bot_dt1 %>%
mutate(
StartDate = StartDate %>%
as.POSIXct(format="%Y-%m-%d %H:%M:%S")
) %>%
select(StartDate) %>%
print(n = 5)
## # A tibble: 337 × 1
## StartDate
## <dttm>
## 1 2025-02-27 16:02:25
## 2 2025-02-27 16:02:17
## 3 2025-02-27 16:02:43
## 4 2025-02-27 16:02:57
## 5 2025-02-27 16:02:57
## # ℹ 332 more rows
## Using lubridate (a core tidyverse package)
bot_dt1 %>%
mutate(
StartDate = StartDate %>%
as_datetime
) %>%
select(StartDate) %>%
print(n = 5)
## # A tibble: 337 × 1
## StartDate
## <dttm>
## 1 2025-02-27 16:02:25
## 2 2025-02-27 16:02:17
## 3 2025-02-27 16:02:43
## 4 2025-02-27 16:02:57
## 5 2025-02-27 16:02:57
## # ℹ 332 more rows
# We will proceed with lubridate, because we love tidyverse!
bot_dt2 <- bot_dt1 %>%
mutate(
StartDate = StartDate %>%
as_datetime
)
A quick introduction to lubridate. Visit the following
link for relevant information and cheatsheet: https://lubridate.tidyverse.org/ I actually realize that
the original StartDate allows you to do the filtering seemingly
similarly, but having it converted to a proper date object will limit
error and ensure that all your operations will perform as expected
(e.g. getting time differences, ordering using arrange,
grouping by group_by)
# Lubridate can be useful
bot_dt2$StartDate %>%
date
bot_dt2$StartDate %>%
hour
bot_dt2$StartDate %>%
wday(label = TRUE)
# if we left it as character, you can run into problems
bot_dt1$StartDate %>% group_by(month(StartDate))
bot_dt1$StartDate %>% arrange
Now we can use the mutated StartDate to limit the
dataset to the timeframe that we are interested in.
# Limiting timeframe
bot_dt3 <- bot_dt2 %>%
filter(
StartDate %>%
is_weakly_greater_than("2025-03-14")
)
Now, we can check how effective each of our measures were in
filtering out suspicious responses. One of our measures was to include
all of the basic bot detection measures that comes with Qualtrics. Since
we had them filtered out before displaying the first demographic
question, I could check how many were filtered out based on how many NAs
there were for the gender question.
# Limiting timeframe
bot_dt3 %>%
filter(
gender %>%
is.na
)
## # A tibble: 2 × 95
## StartDate EndDate Status IPAddress Progress Duration (in seconds…¹
## <dttm> <chr> <chr> <chr> <chr> <chr>
## 1 2025-03-14 11:51:36 2025-03-… 1 <NA> 100 10
## 2 2025-03-14 11:52:26 2025-03-… 1 <NA> 100 21
## # ℹ abbreviated name: ¹`Duration (in seconds)`
## # ℹ 89 more variables: Finished <chr>, RecordedDate <chr>, ResponseId <chr>,
## # RecipientLastName <chr>, RecipientFirstName <chr>, RecipientEmail <chr>,
## # ExternalReference <chr>, LocationLatitude <chr>, LocationLongitude <chr>,
## # DistributionChannel <chr>, UserLanguage <chr>, Q_RecaptchaScore <chr>,
## # Q_RelevantIDDuplicate <chr>, Q_RelevantIDDuplicateScore <chr>,
## # Q_RelevantIDFraudScore <chr>, Q_RelevantIDLastStartDate <chr>, …
Two of you! Gotcha. We also included a rather smart open-ended question with which we were hoping to catch LLM-enhanced bots. We asked respondents to put in purple(“紫色”) if they are real human, and put in green(“绿色”) if they are bots. Ryan gave us this idea, essentially taking advantage of the AI being programmed with what we call “AI alignment principles” or “guardrails”. Namely, they are not supposed to lie.
# Those filtered out by our open-ended question
bot_dt3 %>%
filter(
gender %>%
is.na %>%
not
) %>%
filter(
term %>%
equals("bot_detection") ## this is how we marked those who were filtered out by this question
) %>%
select(attn_open) ## we can see what responses they gave
## # A tibble: 6 × 1
## attn_open
## <chr>
## 1 绿色
## 2 紫色
## 3 purple
## 4 绿色
## 5 绿色
## 6 绿色
We see some greens! We are not sure what exactly happened with the
response purple in smaller font and purple in English. Anyhow,
suspicious all the same, and successfully vetted. Another measure we
incoporated took advantage of the fact that bots perform poorly on
questions that require consistency. For example, if we ask humans to
respond to the questions “What year were you born?” and “Put in your
year of birth”, there should be consistency. However, bots most often
put two different answers. We called this a
birthday filter. We set the term to be recorded as… well,
birthday_filter
# Those filtered out by our birthday filter
bot_dt3 %>%
filter(
gender %>%
is.na %>%
not
) %>%
filter(
term %>%
equals("birthday_filter") ## n = 1
)
## # A tibble: 1 × 95
## StartDate EndDate Status IPAddress Progress Duration (in seconds…¹
## <dttm> <chr> <chr> <chr> <chr> <chr>
## 1 2025-03-14 15:53:15 2025-03-… 0 182.200.… 100 65
## # ℹ abbreviated name: ¹`Duration (in seconds)`
## # ℹ 89 more variables: Finished <chr>, RecordedDate <chr>, ResponseId <chr>,
## # RecipientLastName <chr>, RecipientFirstName <chr>, RecipientEmail <chr>,
## # ExternalReference <chr>, LocationLatitude <chr>, LocationLongitude <chr>,
## # DistributionChannel <chr>, UserLanguage <chr>, Q_RecaptchaScore <chr>,
## # Q_RelevantIDDuplicate <chr>, Q_RelevantIDDuplicateScore <chr>,
## # Q_RelevantIDFraudScore <chr>, Q_RelevantIDLastStartDate <chr>, …
# What did they actually say?
bot_dt3 %>%
filter(
gender %>%
is.na %>%
not
) %>%
filter(
term %>%
equals("birthday_filter") ## n = 1
) %>%
select(age...33, age2...34) ## You are suddenly 5 years more youthful a few questions down, are you?
## # A tibble: 1 × 2
## age...33 age2...34
## <chr> <chr>
## 1 1980 1985
So in total, we can winnow out the quality sample, also excluding those who failed our very simple attention check questions.
# number of quality sample: n = 100
## Again, the usefulness of magrittr approach that allows you to see how many repsonses get filtered out in each step.
bot_dt3 %>%
filter(
gender %>%
is.na %>%
not
) %>%
filter(
term %>%
equals("birthday_filter") %>%
not |
term %>%
is.na
) %>%
filter(
attn %>%
equals(3)
) %>%
filter(
attn_open %>%
equals("紫色")
) %>%
filter(
att2 %>%
equals(1)
) %>%
nrow
## [1] 100
Quality respondents, 100 out of 116!