Purpose

  • The purpose of this script to 1) anonymize the raw Chatfuel dataset and 2) to document the data handling process. The raw Chatfuel dataset is manually downloaded from Chatfuel and contains chatfuel user id as the main identification of participants. The data does not contain personal private information of respondents such as name, profile picture url, etc. This raw Chatfuel dataset is named Project_Don_t_Get_Duped_....csv and is uploaded to Sherlock server using path ~ssh://sherlock/oak/stanford/groups/athey/fb_misinfo_interventions/data/chatfuel/raw.

  • The anonymization process will take place after the raw dataset is uploaded to Sherlock server. The anonymization process will include the following steps:

    1. Load the raw dataset Project_Don_t_Get_Duped_....csv, drop partipants from the GSB Golub Capital Social Lab and generate a unique analytic_id for each respondents.
    2. Order the data by MisinfoChat_start_time, using signed up and chatfuel user id as tiebreakers (in that order).
    3. Add a new variable rn which is simply the row number for each entry.
    4. Create a new unique id, analytic_id, by splicing together rn, signed up, and MisinfoChat_start_time into one string.
    5. Delete the chatfuel user id variable.
    6. Save the anynomized dataset as fb_misinfo_anon.csv to the same directory as the raw dataset, ~ssh://sherlock/oak/stanford/groups/athey/fb_misinfo_interventions/data/chatfuel/raw
  • For cleaning purposes, the data is then manually downloaded from Sherlock server to the local machine and uploaded to the Github directory ~fb_misinfo_interventions/data/.

  • Lab and project members are removed from the dataset at this stage.

Library

library(tidyverse)
library(kableExtra)
library(broom)
library(here)
library(data.table)

Anonymization Process

Loading the raw dataset

  • The raw dataset was last pulled on 02/23/2024
data <- read_csv("./data/chatfuel/raw/Project_Don_t_Get_Duped_2024_02_23_06_43_57.csv") #loading the raw Chatfuel dataset

Checking for unique chatfuel user id

  • We check unique chatfuel user id to see if there are any duplicates by comparing the number of unique chatfuel user id with the number of rows in the dataset.
a= as.character(length(unique(data$`chatfuel user id`)))


if (nrow(data) != length(unique(data$`chatfuel user id`))) {
  stop("The number of rows does not match the number of unique `chatfuel user id` values.")
} else {
  # Continue 
  print("Unique observations check passed.")
}
## [1] "Unique observations check passed."

Dropping participants from the GSB Golub Capital Social Lab

  1. Ruth Appel: 5668883889871868
  2. Szymon Sacher: 6536224449774331
  3. Susan Athey: 6093789924066999
  4. Kristine Koutout: 6340332249327821
  5. Kiet Le: 6454866794609331
  6. Mike Luca : 6410378582382260
chatfuel_user_id_to_drop <- c("5668883889871868", "6536224449774331", "6093789924066999", "6340332249327821", "6454866794609331", "6410378582382260")
data <- data %>% filter(!(`chatfuel user id` %in% chatfuel_user_id_to_drop))

Generating analytic_id

ordered_data <- data[order(data$MisinfoChat_start_time, data$"signed up", data$"chatfuel user id"), ]
ordered_data$rn <- row.names(ordered_data)
ordered_data$analytic_id <- paste(ordered_data$rn, ordered_data$"signed up", ordered_data$MisinfoChat_start_time, sep = "_")
ordered_data$"chatfuel user id" <- NULL

count <- as.character(length(unique(ordered_data$analytic_id)))
  • The anonymized raw dataset contains 160283 unique analytic_id and 160283 rows.

Exporting the anonymized dataset

  • The anonymized dataset is named fbmisinfo_anon.csv and is saved in the data folder. The directory is ~fb_misinfo_interventions/data/
df_chat_anon <- ordered_data %>%
  mutate(anon_id = 1:nrow(.)) %>%
fwrite("./data/chatfuel/raw/fbmisinfo_anon.csv.gz")