Procedures to filter valid users 2009-2022

Selecting the cases to include

This file goal is to chose the valid cases for users of the dataset, since two coders worked coding the collected data. For that we need to consolidate and compare results for raw and verified users.

The universe of unique users collected and coded was 9,690 users for the entire period 2009-2022. Below we filter the data in order to estimate the valid universe to be analyzed.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library (readxl)

raw <- read_excel("~/Documents/R Studio files/Argentina Twitter Data/Users_2009-2022_final - May242023.xlsx", sheet = "raw")

Procedure 1, taking codes of principal coder in the raw dataset

In this first procedure, we respect the criteria of the principal coder (Coder 1) above other coder’s criteria. The first coder has more information about the context of the discussions from where the data was extracted and coded the majority of cases. The following steps were followed:

Check if coder 1 values are not Na. If true set the final values as coder 1 did (8,973 users)
If coder 1 values were NA, then take the values of coder 2 (717 users exclusively coded by coder 2).

#create new column names for coder1 columns
lookup_1 <- c(f_organization = "c1_organization",
         f_user_category = "c1_user_category",
         f_ideology = "c1_ideology",
         f_cod_country = "c1_cod_country",
         f_country = "c1_country",
         f_official = "c1_official",
         f_verified = "c1_verified")

#create a dataset where coder 1 set codes
df1 <- raw[!is.na(raw$c1_organization),]  %>%
  rename(all_of(lookup_1)) %>%
  select(-c(c2_organization,
            c2_user_category,
            c2_ideology,
            c2_cod_country,
            c2_country,
            c2_official,
            c2_verified))
  
#create new column names for coder2 columns
lookup_2 <- c(f_organization = "c2_organization",
         f_user_category = "c2_user_category",
         f_ideology = "c2_ideology",
         f_cod_country = "c2_cod_country",
         f_country = "c2_country",
         f_official = "c2_official",
         f_verified = "c2_verified")

#create a dataset where coder 1 did not set categories
df2 <- raw[is.na(raw$c1_organization),]  %>%
  rename(all_of(lookup_2)) %>%
  select(-c(c1_organization,
            c1_user_category,
            c1_ideology,
            c1_cod_country,
            c1_country,
            c1_official,
            c1_verified))

#merge datasets in a new one with predominance of coder 1 opinion
procedure_1 <- merge(df1, df2, all=TRUE)

save(procedure_1, file="~/Documents/R Studio files/Argentina Twitter Data/Users coded/procedure_1.RData")

However, this procedure was discarded because it made no sense to take advantage of the limited but valuable contrast of opinions between coders to minimize potential bias.

Procedure 2, eliminating records where coders disagree from the raw dataset

In the second procedure, the disagreements between the two coders are discarded, and we keep in the sample only those cases where it was agreement. The figures are as follow:

Records coded by both coders: 1,142
Records where it was total agreement: 989 (86%)
Records coded only by coder 1: 7,831, we took his valuation as true
Records coded only by coder 2: 717, we took his valuation as true

As a result of this approach, the total number of valid records was reduced from 9,690 to 9,536 (-1,5%)

#to eliminate records where both coders disagree, we first
#isolate those records similarly coded by both coders 
df3 <- raw %>%
  filter(c1_organization == c2_organization & c1_user_category == c2_user_category &
           c1_ideology == c2_ideology & c1_cod_country == c2_cod_country &
           c1_country == c2_country & c1_official == c2_official &
           c1_verified == c2_verified) %>% 
  rename(all_of(lookup_1)) %>%
  select(-c(c2_organization,
            c2_user_category,
            c2_ideology,
            c2_cod_country,
            c2_country,
            c2_official,
            c2_verified))
  

#select those records coded exclusively by coder 1
df4 <- raw %>%
  filter(!is.na(c1_organization) & is.na(c2_organization))  %>%
  rename(all_of(lookup_1)) %>%
  select(-c(c2_organization,
            c2_user_category,
            c2_ideology,
            c2_cod_country,
            c2_country,
            c2_official,
            c2_verified))

#select those records coded exclusively by coder 2
df5 <- raw %>%
  filter(is.na(c1_organization) & !is.na(c2_organization))  %>%
  rename(all_of(lookup_2)) %>%
  select(-c(c1_organization,
            c1_user_category,
            c1_ideology,
            c1_cod_country,
            c1_country,
            c1_official,
            c1_verified))

#count all records coded by both coders
df6 <- raw %>%
  filter (!is.na(c1_organization) & !is.na(c2_organization))

#merge df3, df4,& df5 to get the final dataset
procedure_2 <- merge(df3, df4, all=TRUE)
procedure_2 <- merge(procedure_2, df5, all = TRUE)

Procedure 3, taking codes of coder 1 from the verified users subgroup

Similarly to Procedure 1, we analyze the subgroup of verified coders giving priority to the opinion of coder 1 over coder 2, when needed. In those cases verified only by coder 2, we respect his opinion.

The initial figures are the following:

Total verified users by either coder 1 or 2: 2,776
Total verified users by both coders: 317, where in 311 cases existed 100% agreement
- Verified users by coder 1, but not by coder2: 2,318
- Verified users by coder 2, but not by coder 1: 140
Final dataset records: 2,769 users

#filtering verified user in raw
ver_or <- raw %>%
  filter(c1_verified ==1 | c2_verified == 1)

#identifiying users coded by both coders
df7 <- ver_or %>%
  filter(c1_verified ==1 & c2_verified == 1)

#identifying 100% agreement
df8 <- ver_or %>%
  filter(c1_organization == c2_organization & c1_user_category == c2_user_category &
           c1_ideology == c2_ideology & c1_cod_country == c2_cod_country &
           c1_country == c2_country & c1_official == c2_official &
           c1_verified == c2_verified) %>%
  rename(all_of(lookup_1)) %>%
  select(-c(c2_organization,
            c2_user_category,
            c2_ideology,
            c2_cod_country,
            c2_country,
            c2_official,
            c2_verified))

#identifying users coded by coder 1 only
df9 <- df7 <- ver_or %>%
  filter(!is.na(c1_verified) & is.na(c2_verified)) %>%
  rename(all_of(lookup_1)) %>%
  select(-c(c2_organization,
            c2_user_category,
            c2_ideology,
            c2_cod_country,
            c2_country,
            c2_official,
            c2_verified))


#identifying users coded by coder 2 only
df10 <- df7 <- ver_or %>%
  filter(is.na(c1_verified) & !is.na(c2_verified)) %>%
  rename(all_of(lookup_2)) %>%
  select(-c(c1_organization,
            c1_user_category,
            c1_ideology,
            c1_cod_country,
            c1_country,
            c1_official,
            c1_verified))

#merging datasets of verified users with agreement
procedure_3 <- merge(df8, df9, all=TRUE)
procedure_3 <- merge(procedure_3, df10, all = TRUE)

Conclusion

In summary, we were able to validate users and filter them according to the output of the coding process. We estimate that eliminating the records where disagreement existed would bring a more accurate result, so we decide to work on the Procedure 2 & 3 datasets.

For the first study about Argentina, we will base the analysis on the verified users group (Procedure 3), while the Procedure 2 would serve as a reference of the total universe of Twitter users involved i the discussion.

#saving valid users from raw
save(procedure_2,file="~/Documents/R Studio files/Argentina Twitter Data/valid_raw_users.RData")

#saving valid users from verified
save(procedure_3,file="~/Documents/R Studio files/Argentina Twitter Data/valid_ver_users.RData")

These two files will serve to create the dataset to be included in DNA software and segment the dataset by year. Needed steps to fully analyze the user’s composition and create the affiliation networks.