merging and converting Socrates data to the wide format

Author

Philipp Chapkovski, UBonn

The task overall

So we have two datasets:

The main data: where there is one row per each participant, with \(X\) number of variables (columns)
Annotation data: where there are more than one row per each participant:
- several coders may annotate the decisions/chats for each participants;
- one coder may annotate several participants.
- the annotation data set have \(M\) columns

Our ultimate goal:

is to create a dataset where each row would be a participant, and each coder will be presented as a set of columns in this participant-row, with a structure of ___.
that means that the resulting dataset will have \(X+M*max(coders)\) columns. That is: if the max number of coders that annotated a single person was 4 then the resulting dataset will have \(X\) columns for the original data, plus \(4M\) columns for each coder. For those participants who were annotated by less than 4 coders, the corresponding columns will be empty.
the variable names will look like: coder_1_field_1, coder_1_field_2,…. coder_4_field_1, coder_4_field_2 etc, where field_1, field_2 etc - are actual variable names from annotation data.

Let’s build a super small minimalistic datasets and check how it works on them before going to the real data:

loading libraries:

library(pacman)
p_load('readxl', 'tidyverse', 'glue', 'kableExtra')

here is our main data:

df<-tibble(participant_id=seq(1,4), name=c('Anna','Boris','Cecilia', 'David'))
df %>% kbl() %>% kable_classic_2

participant_id	name
1	Anna
2	Boris
3	Cecilia
4	David

Here is our annotation data:

coder_df <-
  tibble(
    participant_id = c(1, 1, 1, 2, 3, 3),
    coder_id = seq(1, 6),
    opinion = seq(10, 60, 10),
    another_opinion = seq(100, 600, 100)
  )

coder_df %>% kbl() %>% kable_classic_2

participant_id	coder_id	opinion	another_opinion
1	1	10	100
1	2	20	200
1	3	30	300
2	4	40	400
3	5	50	500
3	6	60	600

coder_df %>% group_by(participant_id) %>% tally %>% kbl() %>% kable_classic_2

participant_id	n
1	3
2	1
3	2

So each participant was coded by at most of three different coders: participant 1 was coded three times, participant 2 was coded once etc. It has also two variables/annotations provided by coders (in our example opinion, and another opinion)

What we need to do now is to first merge the coder_df with main data (df) and then pivot it wider in such a way that internal numbers of each coder (1,2,3…) will become parts of columns, as well as the variable names:

coder_df %>%
  left_join(df) %>%
  group_by(participant_id) %>%
  mutate(id = row_number()) %>%
  ungroup %>%
  pivot_wider(
    id_cols = c(participant_id, name),
    names_from = id,
    values_from = c(opinion, another_opinion),
    names_glue = 'coder_{id}_{.value}',
    names_vary = 'slowest',
    names_sort = T
  ) ->
  result

Joining with `by = join_by(participant_id)`

  result %>% names %>% kbl() %>% kable_classic_2

x
participant_id
name
coder_1_opinion
coder_1_another_opinion
coder_2_opinion
coder_2_another_opinion
coder_3_opinion
coder_3_another_opinion

A lot of things are going on here:

we group by participant id
then we create an internal counter for each participant to calculate how many times it is in there which is basically how many coders annotated this participant (and we get this counter in id variable)
then we pivot it wider using this internal counter as a source of columns, and we take all other variables from original annotation data for values.
we glue names in such a way that they look like coder_<internal_counter>_<field_name>
and we also sort them in such a way that they are grouped by coder/

So the resulting dataset looks like:

result %>% head %>% kbl() %>% kable_classic_2

participant_id	name	coder_1_opinion	coder_1_another_opinion	coder_2_opinion	coder_2_another_opinion	coder_3_opinion	coder_3_another_opinion
1	Anna	10	100	20	200	30	300
2	Boris	40	400	NA	NA	NA	NA
3	Cecilia	50	500	60	600	NA	NA

And finally we add non-coded records:

result %>% 
  bind_rows(df %>% filter(!participant_id %in% result$participant_id)) %>% 
  kbl() %>% 
  kable_classic_2

participant_id	name	coder_1_opinion	coder_1_another_opinion	coder_2_opinion	coder_2_another_opinion	coder_3_opinion	coder_3_another_opinion
1	Anna	10	100	20	200	30	300
2	Boris	40	400	NA	NA	NA	NA
3	Cecilia	50	500	60	600	NA	NA
4	David	NA	NA	NA	NA	NA	NA

Now with the real data

Main data:

df<-read_excel('all_data.xlsx')

Annotated data (we rename for readibility and to ease up the following merging): Since we are only interested in the data that was annotated by at least one coder, we filter out those participants whose codes do not exist in the annotated dataset

cleaned_coder_df <-
  read_excel('transcripts.xlsx') %>%
  rename(coder = `Your first name and last initial.`,
         participant.code = `Participant Code`) %>%
  filter(participant.code %in% df$participant.code)

New names:
• `Acknowledgment?` -> `Acknowledgment?...21`
• `Comments about acknowledgment? (Optional)` -> `Comments about
  acknowledgment? (Optional)...22`
• `Acknowledgment?` -> `Acknowledgment?...40`
• `Comments about acknowledgment? (Optional)` -> `Comments about
  acknowledgment? (Optional)...41`

Now we can merge, and check names:

names_to_wide <- cleaned_coder_df %>% select(-participant.code) %>% names
names_to_keep<-df %>% names 
merged_df<- cleaned_coder_df %>%
  filter(participant.code%in%df$participant.code) %>% 
  left_join(df) %>% 
  group_by(participant.code) %>% 
  mutate(id = row_number()) %>% 
  pivot_wider( 
              id_cols = all_of(names_to_keep) ,
              names_from=id, 
              values_from=all_of(names_to_wide), 
              names_glue='coder_{id}_{.value}',
              names_vary='slowest',
              names_sort=T)

Joining with `by = join_by(participant.code)`

unmatched<-df %>% filter(!participant.code%in%cleaned_coder_df$participant.code)

merged_df<-merged_df %>% bind_rows(unmatched)

New names:
• `coder_1_Acknowledgment?...21` -> `coder_1_Acknowledgment?...170`
• `coder_1_Comments about acknowledgment? (Optional)...22` -> `coder_1_Comments
  about acknowledgment? (Optional)...171`
• `coder_1_Acknowledgment?...40` -> `coder_1_Acknowledgment?...189`
• `coder_1_Comments about acknowledgment? (Optional)...41` -> `coder_1_Comments
  about acknowledgment? (Optional)...190`
• `coder_2_Acknowledgment?...21` -> `coder_2_Acknowledgment?...210`
• `coder_2_Comments about acknowledgment? (Optional)...22` -> `coder_2_Comments
  about acknowledgment? (Optional)...211`
• `coder_2_Acknowledgment?...40` -> `coder_2_Acknowledgment?...229`
• `coder_2_Comments about acknowledgment? (Optional)...41` -> `coder_2_Comments
  about acknowledgment? (Optional)...230`
• `coder_3_Acknowledgment?...21` -> `coder_3_Acknowledgment?...250`
• `coder_3_Comments about acknowledgment? (Optional)...22` -> `coder_3_Comments
  about acknowledgment? (Optional)...251`
• `coder_3_Acknowledgment?...40` -> `coder_3_Acknowledgment?...269`
• `coder_3_Comments about acknowledgment? (Optional)...41` -> `coder_3_Comments
  about acknowledgment? (Optional)...270`
• `coder_4_Acknowledgment?...21` -> `coder_4_Acknowledgment?...290`
• `coder_4_Comments about acknowledgment? (Optional)...22` -> `coder_4_Comments
  about acknowledgment? (Optional)...291`
• `coder_4_Acknowledgment?...40` -> `coder_4_Acknowledgment?...309`
• `coder_4_Comments about acknowledgment? (Optional)...41` -> `coder_4_Comments
  about acknowledgment? (Optional)...310`

merged_df%>%write_csv('merged_data.csv')

merged_df %>% names %>% kbl %>% kable_classic_2

x
participant.id_in_session
participant.code
participant.label
participant._is_bot
participant._index_in_pages
participant._max_page_index
participant._current_app_name
participant._current_page_name
participant.time_started
participant.visited
participant.mturk_worker_id
participant.mturk_assignment_id
participant.payoff
session.code
session.label
session.mturk_HITId
session.mturk_HITGroupId
session.comment
session.is_demo
session.config.blocking_attempts
session.config.seconds_allow_exit
session.config.time_to_proceed
session.config.fee_for_correct
session.config.for_prolific
session.config.prolific_redirect_url
session.config.time_bonus
session.config.vignette
session.config.msg_till_allowed_exit
session.config.seconds_forced_exit
session.config.msg_forced_exit
session.config.time_to_start
session.config.sec_waiting_too_long
session.config.blocking
session.config.chat
session.config.participation_fee
session.config.study_length_min
session.config.real_world_currency_per_point
session.config.param_name
starter.1.player.id_in_group
starter.1.player.role
starter.1.player.payoff
starter.1.player.arrival_time
starter.1.player.time_to_pass
starter.1.player.diff_to_pass
starter.1.player.worker_id
starter.1.player.hit_id
starter.1.player.assignment_id
starter.1.player.on_time
starter.1.player.consent
starter.1.group.id_in_subsession
starter.1.subsession.round_number
starter.1.subsession.time_to_start
starter.1.subsession.time_bonus
starter.1.subsession.time_to_proceed
starter.1.subsession.delta
starter.1.subsession.formatted_delta
starter.1.subsession.study_length
first.1.player.id_in_group
first.1.player.role
first.1.player.payoff
first.1.player.order
first.1.player.answer
first.1.player.confidence
first.1.player.timezone
first.1.player.time_on_first_opinion
first.1.player.cq_err_counter
first.1.player.cq_1_err_counter
first.1.player.cq_2_err_counter
first.1.player.cq_3_err_counter
first.1.player.cq_4_err_counter
first.1.player.cq_5_err_counter
first.1.player.cq_6_err_counter
first.1.group.id_in_subsession
first.1.subsession.round_number
first.1.subsession.body
first.1.subsession.question
first.1.subsession.yes_option
first.1.subsession.no_option
first.1.subsession.correct
second.1.player.id_in_group
second.1.player.role
second.1.player.payoff
second.1.player.order
second.1.player.answer
second.1.player.confidence
second.1.player.wp_entrance_time
second.1.player.wp_exit_time
second.1.player.wp_waiting_time
second.1.player.matched
second.1.player.essay
second.1.player.time_on_comprehension_check
second.1.player.time_on_discussion
second.1.player.time_on_essay
second.1.player.time_on_second_opinion
second.1.player.treatment
second.1.group.id_in_subsession
second.1.group.time_allow_exit
second.1.group.time_forced_exit
second.1.group.chat_status
second.1.subsession.round_number
second.1.subsession.body
second.1.subsession.question
second.1.subsession.yes_option
second.1.subsession.no_option
second.1.subsession.vignette_id
second.1.subsession.correct
second.1.subsession.seconds_allow_exit
second.1.subsession.msg_till_allowed_exit
second.1.subsession.seconds_forced_exit
second.1.subsession.msg_forced_exit
second.1.subsession.fee_for_correct
questionnaire.1.player.id_in_group
questionnaire.1.player.role
questionnaire.1.player.payoff
questionnaire.1.player.age
questionnaire.1.player.sex
questionnaire.1.player.race
questionnaire.1.player.education
questionnaire.1.player.ses
questionnaire.1.player.phil_background
questionnaire.1.player.stats_background
questionnaire.1.player.stem_background
questionnaire.1.player.critical_background
questionnaire.1.player.country_born
questionnaire.1.player.country_life
questionnaire.1.player.survey_experience
questionnaire.1.player.other_comments
questionnaire.1.group.id_in_subsession
questionnaire.1.subsession.round_number
last.1.player.id_in_group
last.1.player.role
last.1.player.payoff
last.1.player.vars_dump
last.1.group.id_in_subsession
last.1.subsession.round_number
prol.1.player.id_in_group
prol.1.player.role
prol.1.player.payoff
prol.1.player.order
prol.1.player.answer
prol.1.player.confidence
prol.1.player.timezone
prol.1.player.time_on_first_opinion
prol.1.group.id_in_subsession
prol.1.subsession.round_number
prol.1.subsession.body
prol.1.subsession.question
prol.1.subsession.yes_option
prol.1.subsession.no_option
prol.1.subsession.correct
coder_1_Timestamp
coder_1_coder
coder_1_Session
coder_1_Group ID
coder_1_Compliance?
coder_1_Scheming?
coder_1_Comment about scheming. (Optional)
coder_1_Ambivalence?
coder_1_Comments about ambivalence. (Optional)
coder_1_Rudeness?
coder_1_Comments about rudeness? (Optional)
coder_1_Reasons for their initial response?
coder_1_Comments about "Reasons for their initial response"? (Optional)
coder_1_Counterexample?
coder_1_Comments about counterexample? (Optional)
coder_1_Reconsideration?
coder_1_Comments about reconsideration? (Optional)
coder_1_Reasons against their own initial response?
coder_1_Comments about "Reasons against their own initial response"? (Optional)
coder_1_Acknowledgment?...170
coder_1_Comments about acknowledgment? (Optional)...171
coder_1_Reasons for the other initial response?
coder_1_Comments about "Reasons the other initial response"? (Optional)
coder_1_Misunderstanding of stimulus or question?
coder_1_Comments about "Misunderstanding of stimulus or question"? (Optional)
coder_1_Did they give up trying to persuade their partner (i.e., agree to disagree)?
coder_1_How difficult was it to understand their reasoning?
coder_1_Comments on difficulty understanding their reasoning. (Optional)
coder_1_Understanding of partner's reasoning?
coder_1_Comments about "Understanding of partner's reasoning?" (Optional)
coder_1_General comments or notes? (Optional)
coder_1_Transactivity
coder_1_Comments about transactivity (Optional)
coder_1_Comments on "Did they give up trying to persuade their partner"? (Optional)
coder_1_Is this a discussion or an (individual) essay?
coder_1_Did they prematurely give up on trying to figure out the best response?
coder_1_Familiarity?
coder_1_Relevant?
coder_1_Acknowledgment?...189
coder_1_Comments about acknowledgment? (Optional)...190
coder_2_Timestamp
coder_2_coder
coder_2_Session
coder_2_Group ID
coder_2_Compliance?
coder_2_Scheming?
coder_2_Comment about scheming. (Optional)
coder_2_Ambivalence?
coder_2_Comments about ambivalence. (Optional)
coder_2_Rudeness?
coder_2_Comments about rudeness? (Optional)
coder_2_Reasons for their initial response?
coder_2_Comments about "Reasons for their initial response"? (Optional)
coder_2_Counterexample?
coder_2_Comments about counterexample? (Optional)
coder_2_Reconsideration?
coder_2_Comments about reconsideration? (Optional)
coder_2_Reasons against their own initial response?
coder_2_Comments about "Reasons against their own initial response"? (Optional)
coder_2_Acknowledgment?...210
coder_2_Comments about acknowledgment? (Optional)...211
coder_2_Reasons for the other initial response?
coder_2_Comments about "Reasons the other initial response"? (Optional)
coder_2_Misunderstanding of stimulus or question?
coder_2_Comments about "Misunderstanding of stimulus or question"? (Optional)
coder_2_Did they give up trying to persuade their partner (i.e., agree to disagree)?
coder_2_How difficult was it to understand their reasoning?
coder_2_Comments on difficulty understanding their reasoning. (Optional)
coder_2_Understanding of partner's reasoning?
coder_2_Comments about "Understanding of partner's reasoning?" (Optional)
coder_2_General comments or notes? (Optional)
coder_2_Transactivity
coder_2_Comments about transactivity (Optional)
coder_2_Comments on "Did they give up trying to persuade their partner"? (Optional)
coder_2_Is this a discussion or an (individual) essay?
coder_2_Did they prematurely give up on trying to figure out the best response?
coder_2_Familiarity?
coder_2_Relevant?
coder_2_Acknowledgment?...229
coder_2_Comments about acknowledgment? (Optional)...230
coder_3_Timestamp
coder_3_coder
coder_3_Session
coder_3_Group ID
coder_3_Compliance?
coder_3_Scheming?
coder_3_Comment about scheming. (Optional)
coder_3_Ambivalence?
coder_3_Comments about ambivalence. (Optional)
coder_3_Rudeness?
coder_3_Comments about rudeness? (Optional)
coder_3_Reasons for their initial response?
coder_3_Comments about "Reasons for their initial response"? (Optional)
coder_3_Counterexample?
coder_3_Comments about counterexample? (Optional)
coder_3_Reconsideration?
coder_3_Comments about reconsideration? (Optional)
coder_3_Reasons against their own initial response?
coder_3_Comments about "Reasons against their own initial response"? (Optional)
coder_3_Acknowledgment?...250
coder_3_Comments about acknowledgment? (Optional)...251
coder_3_Reasons for the other initial response?
coder_3_Comments about "Reasons the other initial response"? (Optional)
coder_3_Misunderstanding of stimulus or question?
coder_3_Comments about "Misunderstanding of stimulus or question"? (Optional)
coder_3_Did they give up trying to persuade their partner (i.e., agree to disagree)?
coder_3_How difficult was it to understand their reasoning?
coder_3_Comments on difficulty understanding their reasoning. (Optional)
coder_3_Understanding of partner's reasoning?
coder_3_Comments about "Understanding of partner's reasoning?" (Optional)
coder_3_General comments or notes? (Optional)
coder_3_Transactivity
coder_3_Comments about transactivity (Optional)
coder_3_Comments on "Did they give up trying to persuade their partner"? (Optional)
coder_3_Is this a discussion or an (individual) essay?
coder_3_Did they prematurely give up on trying to figure out the best response?
coder_3_Familiarity?
coder_3_Relevant?
coder_3_Acknowledgment?...269
coder_3_Comments about acknowledgment? (Optional)...270
coder_4_Timestamp
coder_4_coder
coder_4_Session
coder_4_Group ID
coder_4_Compliance?
coder_4_Scheming?
coder_4_Comment about scheming. (Optional)
coder_4_Ambivalence?
coder_4_Comments about ambivalence. (Optional)
coder_4_Rudeness?
coder_4_Comments about rudeness? (Optional)
coder_4_Reasons for their initial response?
coder_4_Comments about "Reasons for their initial response"? (Optional)
coder_4_Counterexample?
coder_4_Comments about counterexample? (Optional)
coder_4_Reconsideration?
coder_4_Comments about reconsideration? (Optional)
coder_4_Reasons against their own initial response?
coder_4_Comments about "Reasons against their own initial response"? (Optional)
coder_4_Acknowledgment?...290
coder_4_Comments about acknowledgment? (Optional)...291
coder_4_Reasons for the other initial response?
coder_4_Comments about "Reasons the other initial response"? (Optional)
coder_4_Misunderstanding of stimulus or question?
coder_4_Comments about "Misunderstanding of stimulus or question"? (Optional)
coder_4_Did they give up trying to persuade their partner (i.e., agree to disagree)?
coder_4_How difficult was it to understand their reasoning?
coder_4_Comments on difficulty understanding their reasoning. (Optional)
coder_4_Understanding of partner's reasoning?
coder_4_Comments about "Understanding of partner's reasoning?" (Optional)
coder_4_General comments or notes? (Optional)
coder_4_Transactivity
coder_4_Comments about transactivity (Optional)
coder_4_Comments on "Did they give up trying to persuade their partner"? (Optional)
coder_4_Is this a discussion or an (individual) essay?
coder_4_Did they prematurely give up on trying to figure out the best response?
coder_4_Familiarity?
coder_4_Relevant?
coder_4_Acknowledgment?...309
coder_4_Comments about acknowledgment? (Optional)...310