The purpose of this script to 1) anonymize the raw Chatfuel
dataset and 2) to document the data handling process. The raw Chatfuel
dataset is manually downloaded from Chatfuel and contains
chatfuel user id
as the main identification of
participants. The data does not contain personal private information of
respondents such as name, profile picture url, etc. This raw Chatfuel
dataset is named Project_Don_t_Get_Duped_....csv
and is
uploaded to Sherlock server using path
~ssh://sherlock/oak/stanford/groups/athey/fb_misinfo_interventions/data/chatfuel/raw
.
The anonymization process will take place after the raw dataset is uploaded to Sherlock server. The anonymization process will include the following steps:
Project_Don_t_Get_Duped_....csv
,
drop partipants from the GSB Golub Capital Social Lab and generate a
unique analytic_id
for each respondents.MisinfoChat_start_time
, using
signed up
and chatfuel user id
as tiebreakers
(in that order).rn
which is simply the row number
for each entry.analytic_id
, by splicing
together rn
, signed up
, and
MisinfoChat_start_time
into one string.chatfuel user id
variable.fb_misinfo_anon.csv
to
the same directory as the raw dataset,
~ssh://sherlock/oak/stanford/groups/athey/fb_misinfo_interventions/data/chatfuel/raw
For cleaning purposes, the data is then manually downloaded from
Sherlock server to the local machine and uploaded to the Github
directory ~fb_misinfo_interventions/data/
.
Lab and project members are removed from the dataset at this stage.
chatfuel user id
chatfuel user id
to see if there are
any duplicates by comparing the number of unique
chatfuel user id
with the number of rows in the
dataset.a= as.character(length(unique(data$`chatfuel user id`)))
if (nrow(data) != length(unique(data$`chatfuel user id`))) {
stop("The number of rows does not match the number of unique `chatfuel user id` values.")
} else {
# Continue
print("Unique observations check passed.")
}
## [1] "Unique observations check passed."
analytic_id
ordered_data <- data[order(data$MisinfoChat_start_time, data$"signed up", data$"chatfuel user id"), ]
ordered_data$rn <- row.names(ordered_data)
ordered_data$analytic_id <- paste(ordered_data$rn, ordered_data$"signed up", ordered_data$MisinfoChat_start_time, sep = "_")
ordered_data$"chatfuel user id" <- NULL
count <- as.character(length(unique(ordered_data$analytic_id)))
analytic_id
and 160283 rows.