merging and converting Socrates data to the wide format
The task overall
So we have two datasets:
The main data: where there is one row per each participant, with \(X\) number of variables (columns)
Annotation data: where there are more than one row per each participant:
- several coders may annotate the decisions/chats for each participants;
- one coder may annotate several participants.
- the annotation data set have \(M\) columns
Our ultimate goal:
is to create a dataset where each row would be a participant, and each coder will be presented as a set of columns in this participant-row, with a structure of
___ . that means that the resulting dataset will have \(X+M*max(coders)\) columns. That is: if the max number of coders that annotated a single person was 4 then the resulting dataset will have \(X\) columns for the original data, plus \(4M\) columns for each coder. For those participants who were annotated by less than 4 coders, the corresponding columns will be empty.
the variable names will look like: coder_1_field_1, coder_1_field_2,…. coder_4_field_1, coder_4_field_2 etc, where field_1, field_2 etc - are actual variable names from annotation data.
Let’s build a super small minimalistic datasets and check how it works on them before going to the real data:
loading libraries:
here is our main data:
df<-tibble(participant_id=seq(1,4), name=c('Anna','Boris','Cecilia', 'David'))
df %>% kbl() %>% kable_classic_2| participant_id | name |
|---|---|
| 1 | Anna |
| 2 | Boris |
| 3 | Cecilia |
| 4 | David |
Here is our annotation data:
coder_df <-
tibble(
participant_id = c(1, 1, 1, 2, 3, 3),
coder_id = seq(1, 6),
opinion = seq(10, 60, 10),
another_opinion = seq(100, 600, 100)
)
coder_df %>% kbl() %>% kable_classic_2| participant_id | coder_id | opinion | another_opinion |
|---|---|---|---|
| 1 | 1 | 10 | 100 |
| 1 | 2 | 20 | 200 |
| 1 | 3 | 30 | 300 |
| 2 | 4 | 40 | 400 |
| 3 | 5 | 50 | 500 |
| 3 | 6 | 60 | 600 |
| participant_id | n |
|---|---|
| 1 | 3 |
| 2 | 1 |
| 3 | 2 |
So each participant was coded by at most of three different coders: participant 1 was coded three times, participant 2 was coded once etc. It has also two variables/annotations provided by coders (in our example opinion, and another opinion)
What we need to do now is to first merge the coder_df with main data (df) and then pivot it wider in such a way that internal numbers of each coder (1,2,3…) will become parts of columns, as well as the variable names:
coder_df %>%
left_join(df) %>%
group_by(participant_id) %>%
mutate(id = row_number()) %>%
ungroup %>%
pivot_wider(
id_cols = c(participant_id, name),
names_from = id,
values_from = c(opinion, another_opinion),
names_glue = 'coder_{id}_{.value}',
names_vary = 'slowest',
names_sort = T
) ->
resultJoining with `by = join_by(participant_id)`
| x |
|---|
| participant_id |
| name |
| coder_1_opinion |
| coder_1_another_opinion |
| coder_2_opinion |
| coder_2_another_opinion |
| coder_3_opinion |
| coder_3_another_opinion |
A lot of things are going on here:
- we group by participant id
- then we create an internal counter for each participant to calculate how many times it is in there which is basically how many coders annotated this participant (and we get this counter in
idvariable) - then we pivot it wider using this internal counter as a source of columns, and we take all other variables from original annotation data for values.
- we glue names in such a way that they look like
coder_<internal_counter>_<field_name> - and we also sort them in such a way that they are grouped by coder/
So the resulting dataset looks like:
| participant_id | name | coder_1_opinion | coder_1_another_opinion | coder_2_opinion | coder_2_another_opinion | coder_3_opinion | coder_3_another_opinion |
|---|---|---|---|---|---|---|---|
| 1 | Anna | 10 | 100 | 20 | 200 | 30 | 300 |
| 2 | Boris | 40 | 400 | NA | NA | NA | NA |
| 3 | Cecilia | 50 | 500 | 60 | 600 | NA | NA |
And finally we add non-coded records:
result %>%
bind_rows(df %>% filter(!participant_id %in% result$participant_id)) %>%
kbl() %>%
kable_classic_2| participant_id | name | coder_1_opinion | coder_1_another_opinion | coder_2_opinion | coder_2_another_opinion | coder_3_opinion | coder_3_another_opinion |
|---|---|---|---|---|---|---|---|
| 1 | Anna | 10 | 100 | 20 | 200 | 30 | 300 |
| 2 | Boris | 40 | 400 | NA | NA | NA | NA |
| 3 | Cecilia | 50 | 500 | 60 | 600 | NA | NA |
| 4 | David | NA | NA | NA | NA | NA | NA |
Now with the real data
Main data:
Annotated data (we rename for readibility and to ease up the following merging): Since we are only interested in the data that was annotated by at least one coder, we filter out those participants whose codes do not exist in the annotated dataset
cleaned_coder_df <-
read_excel('transcripts.xlsx') %>%
rename(coder = `Your first name and last initial.`,
participant.code = `Participant Code`) %>%
filter(participant.code %in% df$participant.code)New names:
• `Acknowledgment?` -> `Acknowledgment?...21`
• `Comments about acknowledgment? (Optional)` -> `Comments about
acknowledgment? (Optional)...22`
• `Acknowledgment?` -> `Acknowledgment?...40`
• `Comments about acknowledgment? (Optional)` -> `Comments about
acknowledgment? (Optional)...41`
Now we can merge, and check names:
names_to_wide <- cleaned_coder_df %>% select(-participant.code) %>% names
names_to_keep<-df %>% names
merged_df<- cleaned_coder_df %>%
filter(participant.code%in%df$participant.code) %>%
left_join(df) %>%
group_by(participant.code) %>%
mutate(id = row_number()) %>%
pivot_wider(
id_cols = all_of(names_to_keep) ,
names_from=id,
values_from=all_of(names_to_wide),
names_glue='coder_{id}_{.value}',
names_vary='slowest',
names_sort=T) Joining with `by = join_by(participant.code)`
unmatched<-df %>% filter(!participant.code%in%cleaned_coder_df$participant.code)
merged_df<-merged_df %>% bind_rows(unmatched)New names:
• `coder_1_Acknowledgment?...21` -> `coder_1_Acknowledgment?...170`
• `coder_1_Comments about acknowledgment? (Optional)...22` -> `coder_1_Comments
about acknowledgment? (Optional)...171`
• `coder_1_Acknowledgment?...40` -> `coder_1_Acknowledgment?...189`
• `coder_1_Comments about acknowledgment? (Optional)...41` -> `coder_1_Comments
about acknowledgment? (Optional)...190`
• `coder_2_Acknowledgment?...21` -> `coder_2_Acknowledgment?...210`
• `coder_2_Comments about acknowledgment? (Optional)...22` -> `coder_2_Comments
about acknowledgment? (Optional)...211`
• `coder_2_Acknowledgment?...40` -> `coder_2_Acknowledgment?...229`
• `coder_2_Comments about acknowledgment? (Optional)...41` -> `coder_2_Comments
about acknowledgment? (Optional)...230`
• `coder_3_Acknowledgment?...21` -> `coder_3_Acknowledgment?...250`
• `coder_3_Comments about acknowledgment? (Optional)...22` -> `coder_3_Comments
about acknowledgment? (Optional)...251`
• `coder_3_Acknowledgment?...40` -> `coder_3_Acknowledgment?...269`
• `coder_3_Comments about acknowledgment? (Optional)...41` -> `coder_3_Comments
about acknowledgment? (Optional)...270`
• `coder_4_Acknowledgment?...21` -> `coder_4_Acknowledgment?...290`
• `coder_4_Comments about acknowledgment? (Optional)...22` -> `coder_4_Comments
about acknowledgment? (Optional)...291`
• `coder_4_Acknowledgment?...40` -> `coder_4_Acknowledgment?...309`
• `coder_4_Comments about acknowledgment? (Optional)...41` -> `coder_4_Comments
about acknowledgment? (Optional)...310`
| x |
|---|
| participant.id_in_session |
| participant.code |
| participant.label |
| participant._is_bot |
| participant._index_in_pages |
| participant._max_page_index |
| participant._current_app_name |
| participant._current_page_name |
| participant.time_started |
| participant.visited |
| participant.mturk_worker_id |
| participant.mturk_assignment_id |
| participant.payoff |
| session.code |
| session.label |
| session.mturk_HITId |
| session.mturk_HITGroupId |
| session.comment |
| session.is_demo |
| session.config.blocking_attempts |
| session.config.seconds_allow_exit |
| session.config.time_to_proceed |
| session.config.fee_for_correct |
| session.config.for_prolific |
| session.config.prolific_redirect_url |
| session.config.time_bonus |
| session.config.vignette |
| session.config.msg_till_allowed_exit |
| session.config.seconds_forced_exit |
| session.config.msg_forced_exit |
| session.config.time_to_start |
| session.config.sec_waiting_too_long |
| session.config.blocking |
| session.config.chat |
| session.config.participation_fee |
| session.config.study_length_min |
| session.config.real_world_currency_per_point |
| session.config.param_name |
| starter.1.player.id_in_group |
| starter.1.player.role |
| starter.1.player.payoff |
| starter.1.player.arrival_time |
| starter.1.player.time_to_pass |
| starter.1.player.diff_to_pass |
| starter.1.player.worker_id |
| starter.1.player.hit_id |
| starter.1.player.assignment_id |
| starter.1.player.on_time |
| starter.1.player.consent |
| starter.1.group.id_in_subsession |
| starter.1.subsession.round_number |
| starter.1.subsession.time_to_start |
| starter.1.subsession.time_bonus |
| starter.1.subsession.time_to_proceed |
| starter.1.subsession.delta |
| starter.1.subsession.formatted_delta |
| starter.1.subsession.study_length |
| first.1.player.id_in_group |
| first.1.player.role |
| first.1.player.payoff |
| first.1.player.order |
| first.1.player.answer |
| first.1.player.confidence |
| first.1.player.timezone |
| first.1.player.time_on_first_opinion |
| first.1.player.cq_err_counter |
| first.1.player.cq_1_err_counter |
| first.1.player.cq_2_err_counter |
| first.1.player.cq_3_err_counter |
| first.1.player.cq_4_err_counter |
| first.1.player.cq_5_err_counter |
| first.1.player.cq_6_err_counter |
| first.1.group.id_in_subsession |
| first.1.subsession.round_number |
| first.1.subsession.body |
| first.1.subsession.question |
| first.1.subsession.yes_option |
| first.1.subsession.no_option |
| first.1.subsession.correct |
| second.1.player.id_in_group |
| second.1.player.role |
| second.1.player.payoff |
| second.1.player.order |
| second.1.player.answer |
| second.1.player.confidence |
| second.1.player.wp_entrance_time |
| second.1.player.wp_exit_time |
| second.1.player.wp_waiting_time |
| second.1.player.matched |
| second.1.player.essay |
| second.1.player.time_on_comprehension_check |
| second.1.player.time_on_discussion |
| second.1.player.time_on_essay |
| second.1.player.time_on_second_opinion |
| second.1.player.treatment |
| second.1.group.id_in_subsession |
| second.1.group.time_allow_exit |
| second.1.group.time_forced_exit |
| second.1.group.chat_status |
| second.1.subsession.round_number |
| second.1.subsession.body |
| second.1.subsession.question |
| second.1.subsession.yes_option |
| second.1.subsession.no_option |
| second.1.subsession.vignette_id |
| second.1.subsession.correct |
| second.1.subsession.seconds_allow_exit |
| second.1.subsession.msg_till_allowed_exit |
| second.1.subsession.seconds_forced_exit |
| second.1.subsession.msg_forced_exit |
| second.1.subsession.fee_for_correct |
| questionnaire.1.player.id_in_group |
| questionnaire.1.player.role |
| questionnaire.1.player.payoff |
| questionnaire.1.player.age |
| questionnaire.1.player.sex |
| questionnaire.1.player.race |
| questionnaire.1.player.education |
| questionnaire.1.player.ses |
| questionnaire.1.player.phil_background |
| questionnaire.1.player.stats_background |
| questionnaire.1.player.stem_background |
| questionnaire.1.player.critical_background |
| questionnaire.1.player.country_born |
| questionnaire.1.player.country_life |
| questionnaire.1.player.survey_experience |
| questionnaire.1.player.other_comments |
| questionnaire.1.group.id_in_subsession |
| questionnaire.1.subsession.round_number |
| last.1.player.id_in_group |
| last.1.player.role |
| last.1.player.payoff |
| last.1.player.vars_dump |
| last.1.group.id_in_subsession |
| last.1.subsession.round_number |
| prol.1.player.id_in_group |
| prol.1.player.role |
| prol.1.player.payoff |
| prol.1.player.order |
| prol.1.player.answer |
| prol.1.player.confidence |
| prol.1.player.timezone |
| prol.1.player.time_on_first_opinion |
| prol.1.group.id_in_subsession |
| prol.1.subsession.round_number |
| prol.1.subsession.body |
| prol.1.subsession.question |
| prol.1.subsession.yes_option |
| prol.1.subsession.no_option |
| prol.1.subsession.correct |
| coder_1_Timestamp |
| coder_1_coder |
| coder_1_Session |
| coder_1_Group ID |
| coder_1_Compliance? |
| coder_1_Scheming? |
| coder_1_Comment about scheming. (Optional) |
| coder_1_Ambivalence? |
| coder_1_Comments about ambivalence. (Optional) |
| coder_1_Rudeness? |
| coder_1_Comments about rudeness? (Optional) |
| coder_1_Reasons for their initial response? |
| coder_1_Comments about "Reasons for their initial response"? (Optional) |
| coder_1_Counterexample? |
| coder_1_Comments about counterexample? (Optional) |
| coder_1_Reconsideration? |
| coder_1_Comments about reconsideration? (Optional) |
| coder_1_Reasons against their own initial response? |
| coder_1_Comments about "Reasons against their own initial response"? (Optional) |
| coder_1_Acknowledgment?...170 |
| coder_1_Comments about acknowledgment? (Optional)...171 |
| coder_1_Reasons for the other initial response? |
| coder_1_Comments about "Reasons the other initial response"? (Optional) |
| coder_1_Misunderstanding of stimulus or question? |
| coder_1_Comments about "Misunderstanding of stimulus or question"? (Optional) |
| coder_1_Did they give up trying to persuade their partner (i.e., agree to disagree)? |
| coder_1_How difficult was it to understand their reasoning? |
| coder_1_Comments on difficulty understanding their reasoning. (Optional) |
| coder_1_Understanding of partner's reasoning? |
| coder_1_Comments about "Understanding of partner's reasoning?" (Optional) |
| coder_1_General comments or notes? (Optional) |
| coder_1_Transactivity |
| coder_1_Comments about transactivity (Optional) |
| coder_1_Comments on "Did they give up trying to persuade their partner"? (Optional) |
| coder_1_Is this a discussion or an (individual) essay? |
| coder_1_Did they prematurely give up on trying to figure out the best response? |
| coder_1_Familiarity? |
| coder_1_Relevant? |
| coder_1_Acknowledgment?...189 |
| coder_1_Comments about acknowledgment? (Optional)...190 |
| coder_2_Timestamp |
| coder_2_coder |
| coder_2_Session |
| coder_2_Group ID |
| coder_2_Compliance? |
| coder_2_Scheming? |
| coder_2_Comment about scheming. (Optional) |
| coder_2_Ambivalence? |
| coder_2_Comments about ambivalence. (Optional) |
| coder_2_Rudeness? |
| coder_2_Comments about rudeness? (Optional) |
| coder_2_Reasons for their initial response? |
| coder_2_Comments about "Reasons for their initial response"? (Optional) |
| coder_2_Counterexample? |
| coder_2_Comments about counterexample? (Optional) |
| coder_2_Reconsideration? |
| coder_2_Comments about reconsideration? (Optional) |
| coder_2_Reasons against their own initial response? |
| coder_2_Comments about "Reasons against their own initial response"? (Optional) |
| coder_2_Acknowledgment?...210 |
| coder_2_Comments about acknowledgment? (Optional)...211 |
| coder_2_Reasons for the other initial response? |
| coder_2_Comments about "Reasons the other initial response"? (Optional) |
| coder_2_Misunderstanding of stimulus or question? |
| coder_2_Comments about "Misunderstanding of stimulus or question"? (Optional) |
| coder_2_Did they give up trying to persuade their partner (i.e., agree to disagree)? |
| coder_2_How difficult was it to understand their reasoning? |
| coder_2_Comments on difficulty understanding their reasoning. (Optional) |
| coder_2_Understanding of partner's reasoning? |
| coder_2_Comments about "Understanding of partner's reasoning?" (Optional) |
| coder_2_General comments or notes? (Optional) |
| coder_2_Transactivity |
| coder_2_Comments about transactivity (Optional) |
| coder_2_Comments on "Did they give up trying to persuade their partner"? (Optional) |
| coder_2_Is this a discussion or an (individual) essay? |
| coder_2_Did they prematurely give up on trying to figure out the best response? |
| coder_2_Familiarity? |
| coder_2_Relevant? |
| coder_2_Acknowledgment?...229 |
| coder_2_Comments about acknowledgment? (Optional)...230 |
| coder_3_Timestamp |
| coder_3_coder |
| coder_3_Session |
| coder_3_Group ID |
| coder_3_Compliance? |
| coder_3_Scheming? |
| coder_3_Comment about scheming. (Optional) |
| coder_3_Ambivalence? |
| coder_3_Comments about ambivalence. (Optional) |
| coder_3_Rudeness? |
| coder_3_Comments about rudeness? (Optional) |
| coder_3_Reasons for their initial response? |
| coder_3_Comments about "Reasons for their initial response"? (Optional) |
| coder_3_Counterexample? |
| coder_3_Comments about counterexample? (Optional) |
| coder_3_Reconsideration? |
| coder_3_Comments about reconsideration? (Optional) |
| coder_3_Reasons against their own initial response? |
| coder_3_Comments about "Reasons against their own initial response"? (Optional) |
| coder_3_Acknowledgment?...250 |
| coder_3_Comments about acknowledgment? (Optional)...251 |
| coder_3_Reasons for the other initial response? |
| coder_3_Comments about "Reasons the other initial response"? (Optional) |
| coder_3_Misunderstanding of stimulus or question? |
| coder_3_Comments about "Misunderstanding of stimulus or question"? (Optional) |
| coder_3_Did they give up trying to persuade their partner (i.e., agree to disagree)? |
| coder_3_How difficult was it to understand their reasoning? |
| coder_3_Comments on difficulty understanding their reasoning. (Optional) |
| coder_3_Understanding of partner's reasoning? |
| coder_3_Comments about "Understanding of partner's reasoning?" (Optional) |
| coder_3_General comments or notes? (Optional) |
| coder_3_Transactivity |
| coder_3_Comments about transactivity (Optional) |
| coder_3_Comments on "Did they give up trying to persuade their partner"? (Optional) |
| coder_3_Is this a discussion or an (individual) essay? |
| coder_3_Did they prematurely give up on trying to figure out the best response? |
| coder_3_Familiarity? |
| coder_3_Relevant? |
| coder_3_Acknowledgment?...269 |
| coder_3_Comments about acknowledgment? (Optional)...270 |
| coder_4_Timestamp |
| coder_4_coder |
| coder_4_Session |
| coder_4_Group ID |
| coder_4_Compliance? |
| coder_4_Scheming? |
| coder_4_Comment about scheming. (Optional) |
| coder_4_Ambivalence? |
| coder_4_Comments about ambivalence. (Optional) |
| coder_4_Rudeness? |
| coder_4_Comments about rudeness? (Optional) |
| coder_4_Reasons for their initial response? |
| coder_4_Comments about "Reasons for their initial response"? (Optional) |
| coder_4_Counterexample? |
| coder_4_Comments about counterexample? (Optional) |
| coder_4_Reconsideration? |
| coder_4_Comments about reconsideration? (Optional) |
| coder_4_Reasons against their own initial response? |
| coder_4_Comments about "Reasons against their own initial response"? (Optional) |
| coder_4_Acknowledgment?...290 |
| coder_4_Comments about acknowledgment? (Optional)...291 |
| coder_4_Reasons for the other initial response? |
| coder_4_Comments about "Reasons the other initial response"? (Optional) |
| coder_4_Misunderstanding of stimulus or question? |
| coder_4_Comments about "Misunderstanding of stimulus or question"? (Optional) |
| coder_4_Did they give up trying to persuade their partner (i.e., agree to disagree)? |
| coder_4_How difficult was it to understand their reasoning? |
| coder_4_Comments on difficulty understanding their reasoning. (Optional) |
| coder_4_Understanding of partner's reasoning? |
| coder_4_Comments about "Understanding of partner's reasoning?" (Optional) |
| coder_4_General comments or notes? (Optional) |
| coder_4_Transactivity |
| coder_4_Comments about transactivity (Optional) |
| coder_4_Comments on "Did they give up trying to persuade their partner"? (Optional) |
| coder_4_Is this a discussion or an (individual) essay? |
| coder_4_Did they prematurely give up on trying to figure out the best response? |
| coder_4_Familiarity? |
| coder_4_Relevant? |
| coder_4_Acknowledgment?...309 |
| coder_4_Comments about acknowledgment? (Optional)...310 |