# Setting up the environment
library(tidyverse)
library(magrittr)

dplyr verbs in action

I always find it helpful to see dplyr verbs in action - even if it is not the most sophisticated example, the real-world relevance really helps me grasp its potential uses and utility. Recently, my coauthor and I came across the problem of survey infiltration by bots. dplyr verbs again came to the rescue in our efforts to gauge the extent of the problem and prevent it.

# You can download the example directly from my github
bot_dt <- "https://github.com/hyunso-christy-oh/prism2025/raw/refs/heads/main/bot_example.csv" %>% 
  url %>% 
  gzcon %>% 
  read_csv

# Have a look at how the data looks - what stands out to you as something that we may want to address? 
bot_dt %>% 
  head

# How many rows? 
bot_dt %>% 
  nrow

What I did first was to remove the first two rows, which is an artifact of the data being downloaded from qualtrics and do not contain the actual responses.

# Get rid of useless rows
bot_dt1 <-  bot_dt %>% 
  slice(-c(1, 2)) ## removing the first two rows

# Checking what we did
bot_dt1 %>% 
  head ## now all rows are responses
## # A tibble: 6 × 95
##   StartDate    EndDate Status IPAddress Progress Duration (in seconds…¹ Finished
##   <chr>        <chr>   <chr>  <chr>     <chr>    <chr>                  <chr>   
## 1 2025-02-27 … 2025-0… 0      120.216.… 100      77                     1       
## 2 2025-02-27 … 2025-0… 0      183.94.3… 100      121                    1       
## 3 2025-02-27 … 2025-0… 0      14.30.17… 100      110                    1       
## 4 2025-02-27 … 2025-0… 0      1.61.18.… 100      172                    1       
## 5 2025-02-27 … 2025-0… 0      119.60.1… 100      211                    1       
## 6 2025-02-27 … 2025-0… 0      14.26.23… 100      170                    1       
## # ℹ abbreviated name: ¹​`Duration (in seconds)`
## # ℹ 88 more variables: RecordedDate <chr>, ResponseId <chr>,
## #   RecipientLastName <chr>, RecipientFirstName <chr>, RecipientEmail <chr>,
## #   ExternalReference <chr>, LocationLatitude <chr>, LocationLongitude <chr>,
## #   DistributionChannel <chr>, UserLanguage <chr>, Q_RecaptchaScore <chr>,
## #   Q_RelevantIDDuplicate <chr>, Q_RelevantIDDuplicateScore <chr>,
## #   Q_RelevantIDFraudScore <chr>, Q_RelevantIDLastStartDate <chr>, …
bot_dt1 %>% 
  nrow ## indeed, we have two less rows
## [1] 337

What I wanted to next requires providing some context. In our initial soft launch of this survey, we detected some evidence for bot infiltration in our sample. Namely, we had cases where open-ended responses would match exactly across different respondents (sometimes, responses up to 16 Chinese characters), and these duplicate responses sometimes didn’t even make sense (for example, how likely is that three respondents who are completely unrelated to each other, talk about “dried tofu” in response to a vignette about the United Nations?). Being alerted to this possibility, we put in some measures for bot detection, and wanted to see if they were successful in filtering out suspicious respondents. Since we had put this measure right before resuming sample collection on March 14th, we wanted to look only at the responses collected after we put in the measures. Let’s use filter on the the variable that tells us when the responses were recorded.

# Which variable tells us when the response was recorded? 
bot_dt1 %>% colnames

StartDate seems promising!

# Let's see what StartDate variable is like
bot_dt1$StartDate %>% class ## it is a character variable
## [1] "character"
bot_dt1 %>% 
  select(StartDate) %>% 
  slice_head(n = 5) ## it is in the format of year-month-date hour:min:sec
## # A tibble: 5 × 1
##   StartDate          
##   <chr>              
## 1 2025-02-27 16:02:25
## 2 2025-02-27 16:02:17
## 3 2025-02-27 16:02:43
## 4 2025-02-27 16:02:57
## 5 2025-02-27 16:02:57
# We can use mutate to convert the StartDate variable into a format that is useful for us - there is two ways of going about this
## Using base R
?as.POSIXct

bot_dt1 %>% 
  mutate(
    StartDate = StartDate %>% 
      as.POSIXct(format="%Y-%m-%d %H:%M:%S")
  ) %>% 
  select(StartDate) %>% 
  print(n = 5)
## # A tibble: 337 × 1
##   StartDate          
##   <dttm>             
## 1 2025-02-27 16:02:25
## 2 2025-02-27 16:02:17
## 3 2025-02-27 16:02:43
## 4 2025-02-27 16:02:57
## 5 2025-02-27 16:02:57
## # ℹ 332 more rows
## Using lubridate (a core tidyverse package)
bot_dt1 %>% 
  mutate(
    StartDate = StartDate %>% 
      as_datetime
  ) %>% 
  select(StartDate) %>% 
  print(n = 5)
## # A tibble: 337 × 1
##   StartDate          
##   <dttm>             
## 1 2025-02-27 16:02:25
## 2 2025-02-27 16:02:17
## 3 2025-02-27 16:02:43
## 4 2025-02-27 16:02:57
## 5 2025-02-27 16:02:57
## # ℹ 332 more rows
# We will proceed with lubridate, because we love tidyverse!

bot_dt2 <- bot_dt1 %>% 
  mutate(
    StartDate = StartDate %>% 
      as_datetime
  )

A quick introduction to lubridate. Visit the following link for relevant information and cheatsheet: https://lubridate.tidyverse.org/ I actually realize that the original StartDate allows you to do the filtering seemingly similarly, but having it converted to a proper date object will limit error and ensure that all your operations will perform as expected (e.g. getting time differences, ordering using arrange, grouping by group_by)

# Lubridate can be useful
bot_dt2$StartDate %>% 
  date 

bot_dt2$StartDate %>% 
  hour 

bot_dt2$StartDate %>% 
  wday(label = TRUE) 
# if we left it as character, you can run into problems
bot_dt1$StartDate %>% group_by(month(StartDate))
bot_dt1$StartDate %>% arrange

Now we can use the mutated StartDate to limit the dataset to the timeframe that we are interested in.

# Limiting timeframe
bot_dt3 <- bot_dt2 %>% 
  filter(
    StartDate %>% 
      is_weakly_greater_than("2025-03-14")
  )

Now, we can check how effective each of our measures were in filtering out suspicious responses. One of our measures was to include all of the basic bot detection measures that comes with Qualtrics. Since we had them filtered out before displaying the first demographic question, I could check how many were filtered out based on how many NAs there were for the gender question.

# Limiting timeframe

bot_dt3 %>% 
  filter(
    gender %>% 
      is.na 
  )
## # A tibble: 2 × 95
##   StartDate           EndDate   Status IPAddress Progress Duration (in seconds…¹
##   <dttm>              <chr>     <chr>  <chr>     <chr>    <chr>                 
## 1 2025-03-14 11:51:36 2025-03-… 1      <NA>      100      10                    
## 2 2025-03-14 11:52:26 2025-03-… 1      <NA>      100      21                    
## # ℹ abbreviated name: ¹​`Duration (in seconds)`
## # ℹ 89 more variables: Finished <chr>, RecordedDate <chr>, ResponseId <chr>,
## #   RecipientLastName <chr>, RecipientFirstName <chr>, RecipientEmail <chr>,
## #   ExternalReference <chr>, LocationLatitude <chr>, LocationLongitude <chr>,
## #   DistributionChannel <chr>, UserLanguage <chr>, Q_RecaptchaScore <chr>,
## #   Q_RelevantIDDuplicate <chr>, Q_RelevantIDDuplicateScore <chr>,
## #   Q_RelevantIDFraudScore <chr>, Q_RelevantIDLastStartDate <chr>, …

Two of you! Gotcha. We also included a rather smart open-ended question with which we were hoping to catch LLM-enhanced bots. We asked respondents to put in purple(“紫色”) if they are real human, and put in green(“绿色”) if they are bots. Ryan gave us this idea, essentially taking advantage of the AI being programmed with what we call “AI alignment principles” or “guardrails”. Namely, they are not supposed to lie.

# Those filtered out by our open-ended question

bot_dt3 %>%
  filter(
    gender %>% 
      is.na %>% 
      not
  ) %>% 
  filter(
    term %>% 
      equals("bot_detection") ## this is how we marked those who were filtered out by this question
  ) %>% 
  select(attn_open) ## we can see what responses they gave
## # A tibble: 6 × 1
##   attn_open
##   <chr>    
## 1 绿色     
## 2 紫色     
## 3 purple   
## 4 绿色     
## 5 绿色     
## 6 绿色

We see some greens! We are not sure what exactly happened with the response purple in smaller font and purple in English. Anyhow, suspicious all the same, and successfully vetted. Another measure we incoporated took advantage of the fact that bots perform poorly on questions that require consistency. For example, if we ask humans to respond to the questions “What year were you born?” and “Put in your year of birth”, there should be consistency. However, bots most often put two different answers. We called this a birthday filter. We set the term to be recorded as… well, birthday_filter

# Those filtered out by our birthday filter
bot_dt3 %>%
  filter(
    gender %>% 
      is.na %>% 
      not
  ) %>%
  filter(
    term %>% 
      equals("birthday_filter") ## n = 1
  ) 
## # A tibble: 1 × 95
##   StartDate           EndDate   Status IPAddress Progress Duration (in seconds…¹
##   <dttm>              <chr>     <chr>  <chr>     <chr>    <chr>                 
## 1 2025-03-14 15:53:15 2025-03-… 0      182.200.… 100      65                    
## # ℹ abbreviated name: ¹​`Duration (in seconds)`
## # ℹ 89 more variables: Finished <chr>, RecordedDate <chr>, ResponseId <chr>,
## #   RecipientLastName <chr>, RecipientFirstName <chr>, RecipientEmail <chr>,
## #   ExternalReference <chr>, LocationLatitude <chr>, LocationLongitude <chr>,
## #   DistributionChannel <chr>, UserLanguage <chr>, Q_RecaptchaScore <chr>,
## #   Q_RelevantIDDuplicate <chr>, Q_RelevantIDDuplicateScore <chr>,
## #   Q_RelevantIDFraudScore <chr>, Q_RelevantIDLastStartDate <chr>, …
# What did they actually say?
bot_dt3 %>%
  filter(
    gender %>% 
      is.na %>% 
      not
  ) %>%
  filter(
    term %>% 
      equals("birthday_filter") ## n = 1
  ) %>% 
  select(age...33, age2...34) ## You are suddenly 5 years more youthful a few questions down, are you?
## # A tibble: 1 × 2
##   age...33 age2...34
##   <chr>    <chr>    
## 1 1980     1985

So in total, we can winnow out the quality sample, also excluding those who failed our very simple attention check questions.

# number of quality sample: n = 100
## Again, the usefulness of magrittr approach that allows you to see how many repsonses get filtered out in each step.
bot_dt3 %>% 
  filter(
    gender %>% 
      is.na %>% 
      not
  ) %>% 
  filter(
    term %>%
      equals("birthday_filter") %>%
      not |
      term %>% 
      is.na
  ) %>%
  filter(
    attn %>% 
      equals(3) 
  ) %>% 
  filter(
    attn_open %>% 
      equals("紫色") 
  ) %>% 
  filter(
    att2 %>% 
      equals(1)
  ) %>% 
  nrow
## [1] 100

Quality respondents, 100 out of 116!