Code
library(tidyverse)
library(jsonlite)
library(stringr)
library(scales)
library(knitr)
library(tidytext)One-on-one tutoring is widely regarded as an effective pedagogical interventions for its ability to provide highly personalized and adaptive support (Price et al., 2007; VanLehn, 2011). Tutors may ask questions, provide hints, reveal information, correct misunderstandings, confirm student progress, or combine several forms of support in the same response (Zhang et al., 2023). For researchers and designers of online tutoring systems, understanding the annotated instructional actions and the language patterns of tutor responses can provide useful insight into how tutoring support is structured.
This project uses the Corpus of Instructional Management Actions (CIMA), an open-access dataset of simulated tutoring dialogues introduced by @stasaski2020cima. The CIMA dataset was collected through a crowdsourcing method in which workers role-played as tutors and students in beginner level Italian language learning exercises (Stasaski et al., 2020). The dataset includes prior dialogue context, student action annotations, candidate tutor responses, and tutor action annotations. In the dataset, tutor actions are coded as Question, Hint/Information Reveal, Correction, Confirmation, and Other. Student actions are coded as Guess, Question, Affirmation, and Other.
This project focuses on descriptive items to examine how annotated tutor strategies and response language are distributed across simulated student dialogue contexts. The purpose of this project is to examine how tutor response strategies and language patterns vary across student action contexts. The leading research question of this project is:
How do tutor response strategies and language patterns vary across student action contexts in the CIMA simulated tutoring dataset?
To answer this question, I use four descriptive text-mining and learning-analytics analyses:
Although sentiment analysis and LDA topic modeling are common text-mining techniques, they are not used as primary analyses in this project. Sentiment analysis is less aligned with the research question because tutoring feedback often mixes encouragement and correction, and general-purpose sentiment lexicons may misread pedagogically useful correction as negative tone. LDA topic modeling is also not prioritized because the tutor responses are short and the dataset already includes meaningful action annotations. Instead, this project uses tokenization, TF-IDF, bigram analysis, and action-label comparison because these techniques more directly support the goal of describing tutor response strategies and feedback language.
library(tidyverse)
library(jsonlite)
library(stringr)
library(scales)
library(knitr)
library(tidytext)Here I loaded the original dataset, which is a JSON file. I used the following codes to read the JSON structure and convert it into an R object. I used simplifyVector = FALSE to keep the nested JSON structure as lists.
cima_raw <- fromJSON("dataset.json",
simplifyVector = FALSE)
names(cima_raw)[1] "prepDataset" "shapeDataset"
As I preview the dataset, I notice that the full object contains two sections, and prepDataset is the section that contains the tutoring dialogue records used in this project. Each item in prepDataset represents one tutoring dialogue context. So here I extracted the prepDataset:
prep_raw <- cima_raw$prepDataset
length(prep_raw) # count the tutoring dialogue records[1] 1135
I used the following codes tp select the first tutoring record and to show the fields included in that record.
# Inspect the structure of the first record.
prep_raw[[1]] |> names() [1] "past_convo" "img" "prep" "engPrep"
[5] "obj" "engObj" "color" "engColor"
[9] "grammarRules" "studentActions" "tutorResponses" "tutorActions"
[13] "tutorKeys"
The original file is nested. Each dialogue record contains a conversation history, language-learning target variables, student action annotations, candidate tutor responses, and tutor action annotations. I first create a dialogue-level table where each row represents one dialogue context.
dialogues <- tibble(
dialogue_id = names(prep_raw),
record = prep_raw
) |>
mutate(
past_convo = map(record, "past_convo"),
prep = map_chr(record, "prep"),
engPrep = map_chr(record, "engPrep"),
obj = map_chr(record, "obj"),
engObj = map_chr(record, "engObj"),
color = map_chr(record, "color"),
engColor = map_chr(record, "engColor"),
grammarRules = map_chr(record, "grammarRules"),
studentActions = map(record, "studentActions"),
tutorResponses = map(record, "tutorResponses"),
tutorActions = map(record, "tutorActions")
) |>
select(-record)
glimpse(dialogues)Rows: 1,135
Columns: 12
$ dialogue_id <chr> "0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10",…
$ past_convo <named list> ["\"Pink\" is \"rosa\". Please try to fill in th…
$ prep <chr> "e dietro", "e accanto al", "e vicino", "e accanto al",…
$ engPrep <chr> "is behind the", "is next to the", "is next to the", "i…
$ obj <chr> "l'albero", "letto", "all'albero", "coniglio", "letto",…
$ engObj <chr> "tree", "bed", "tree", "bunny", "bed", "table", "bag", …
$ color <chr> "rosa", "rosa", "blu", "giallo", "giallo", "rosa", "gia…
$ engColor <chr> "pink", "pink", "blue", "yellow", "yellow", "pink", "ye…
$ grammarRules <chr> "[[\"l' (\\\"the\\\") is prepended to the following wor…
$ studentActions <named list> ["False", "True", "False", "False"], ["True", "F…
$ tutorResponses <named list> ["Look at your order of words again. Adjectives …
$ tutorActions <named list> [[FALSE, FALSE, FALSE, FALSE, TRUE], [TRUE, FALS…
Because each dialogue context includes several tutor responses, I reshape the data so that each row represents one tutor response. This tutor-response-level dataset is the main analytic dataset for this project.
I created a small helper function called safe_action() to extract tutor action labels from nested action vectors. This was necessary because in the CIMA data, tutorActions is stored as a list of TRUE/FALSE values. Each position in the vector represents one tutor action category: 1 = Question, 2 = Hint/Information Reveal, 3 = Correction, 4 = Confirmation, and 5 = Other. The function also prevents indexing errors if an action vector is unexpectedly shorter than expected.
# Helper function for safely extracting logical values from action vectors.
safe_action <- function(x, i) {
if (length(x) >= i) {
return(as.logical(x[[i]]))
} else {
return(NA)
}
}The following codes reshape the nested CIMA data into a tutor-response-level dataset. Since each dialogue context includes different tutor responses, I use unnest_longer() to place each tutor response in its own row. I also convert the student and tutor action vectors into readable Boolean variables. This transformation is necessary for later analyses.
tutor_response_level <- tibble(
dialogue_id = names(prep_raw),
record = prep_raw
) |>
mutate(
past_convo = map(record, "past_convo"),
studentActions = map(record, "studentActions"),
tutorResponses = map(record, "tutorResponses"),
tutorActions = map(record, "tutorActions")
) |>
select(dialogue_id, past_convo, studentActions, tutorResponses, tutorActions) |>
mutate(
student_guess = map_lgl(studentActions, ~ .x[[1]] == "True"),
student_question = map_lgl(studentActions, ~ .x[[2]] == "True"),
student_affirmation = map_lgl(studentActions, ~ .x[[3]] == "True"),
student_other = map_lgl(studentActions, ~ .x[[4]] == "True")
) |>
mutate(
student_context = case_when(
student_guess ~ "Guess",
student_question ~ "Question",
student_affirmation ~ "Affirmation",
student_other ~ "Other",
TRUE ~ "Unlabeled"
)
) |>
unnest_longer(
tutorResponses,
indices_to = "response_id",
values_to = "tutor_response"
) |>
mutate(
tutor_action_vector = map2(tutorActions, response_id, ~ .x[[.y]])
) |>
mutate(
tutor_question = map_lgl(tutor_action_vector, ~ safe_action(.x, 1)),
tutor_hint_info_reveal = map_lgl(tutor_action_vector, ~ safe_action(.x, 2)),
tutor_correction = map_lgl(tutor_action_vector, ~ safe_action(.x, 3)),
tutor_confirmation = map_lgl(tutor_action_vector, ~ safe_action(.x, 4)),
tutor_other = map_lgl(tutor_action_vector, ~ safe_action(.x, 5))
) |>
select(
dialogue_id,
response_id,
tutor_response,
student_context,
student_guess:student_other,
tutor_question:tutor_other
)
glimpse(tutor_response_level)Rows: 3,315
Columns: 13
$ dialogue_id <chr> "0", "0", "0", "1", "1", "1", "2", "2", "2", "3…
$ response_id <int> 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1,…
$ tutor_response <chr> "Look at your order of words again. Adjectives …
$ student_context <chr> "Question", "Question", "Question", "Guess", "G…
$ student_guess <lgl> FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, FALSE, F…
$ student_question <lgl> TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, F…
$ student_affirmation <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE,…
$ student_other <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ tutor_question <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ tutor_hint_info_reveal <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE,…
$ tutor_correction <lgl> FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, FALSE, T…
$ tutor_confirmation <lgl> FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, TR…
$ tutor_other <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
After creating the tutor-response-level dataset, I first checked its overall structure. This summary reports both the number of unique dialogue contexts and the number of tutor responses. I also calculated the average number of candidate responses per context to better understand the structure of the reshaped data.
tutor_response_level |>
summarise(
n_dialogue_contexts = n_distinct(dialogue_id),
n_candidate_tutor_responses = n(),
avg_responses_per_context = n_candidate_tutor_responses / n_dialogue_contexts
) |>
kable(digits = 2)| n_dialogue_contexts | n_candidate_tutor_responses | avg_responses_per_context |
|---|---|---|
| 1135 | 3315 | 2.92 |
I then checked the distribution of student action contexts. For later analysis, if one context appears much more frequently than others, the interpretation of comparisons should consider this imbalance. Here, each count represents the number of candidate tutor responses ralted to a given student context.
tutor_response_level |>
count(student_context, sort = TRUE) |>
kable()| student_context | n |
|---|---|
| Question | 1571 |
| Guess | 1496 |
| Affirmation | 242 |
| Other | 6 |
For frequency and visualization, I reshape the tutor action columns into a long format.
tutor_actions_long <- tutor_response_level |>
pivot_longer(
cols = tutor_question:tutor_other,
names_to = "tutor_action",
values_to = "action_present"
) |>
mutate(
tutor_action = recode(
tutor_action,
tutor_question = "Question",
tutor_hint_info_reveal = "Hint / Information Reveal",
tutor_correction = "Correction",
tutor_confirmation = "Confirmation",
tutor_other = "Other"
)
) |>
filter(action_present == TRUE)
tutor_actions_long |>
count(tutor_action, sort = TRUE) |>
kable()| tutor_action | n |
|---|---|
| Hint / Information Reveal | 1986 |
| Correction | 957 |
| Question | 943 |
| Confirmation | 483 |
| Other | 62 |
This first analysis provides a descriptive baseline. Before comparing tutor strategies across student contexts, it is useful to understand which tutor actions appear most frequently in the dataset overall.
tutor_action_counts <- tutor_actions_long |>
count(tutor_action, sort = TRUE) |>
mutate(
proportion = n / sum(n)
)
tutor_action_counts |>
mutate(proportion = percent(proportion, accuracy = 0.1)) |>
kable()| tutor_action | n | proportion |
|---|---|---|
| Hint / Information Reveal | 1986 | 44.8% |
| Correction | 957 | 21.6% |
| Question | 943 | 21.3% |
| Confirmation | 483 | 10.9% |
| Other | 62 | 1.4% |
ggplot(tutor_action_counts, aes(x = reorder(tutor_action, n), y = n)) +
geom_col() +
coord_flip() +
labs(
title = "Overall Distribution of Tutor Actions",
x = "Tutor action",
y = "Number of action labels"
) +
geom_col(fill = "salmon") +
theme_minimal(base_size = 13)From the above analysis, the most common tutor action in the dataset was Hint/Information Reveal, which appeared 1,986 times and accounted for 44.8% of all tutor action labels. This suggests that simulated tutors most often responded by providing students with information, reminders, or partial guidance rather than only evaluating their answers. Correction and Question appeared at similar levels, each making up around 21% of the action labels. This indicates that tutors also frequently corrected student responses and prompted students to think further. However, Confirmation and Other were less common.
My second analysis compares tutor action patterns across student action contexts. This helps show whether tutor responses differ when the prior student action is coded as a guess, question, affirmation, or other.
actions_by_context <- tutor_actions_long |>
count(student_context, tutor_action) |>
group_by(student_context) |>
mutate(
context_total = sum(n),
proportion = n / context_total
) |>
ungroup()
actions_by_context |>
mutate(proportion = percent(proportion, accuracy = 0.1)) |>
arrange(student_context, desc(n)) |>
kable()| student_context | tutor_action | n | context_total | proportion |
|---|---|---|---|---|
| Affirmation | Question | 113 | 314 | 36.0% |
| Affirmation | Hint / Information Reveal | 71 | 314 | 22.6% |
| Affirmation | Correction | 63 | 314 | 20.1% |
| Affirmation | Confirmation | 61 | 314 | 19.4% |
| Affirmation | Other | 6 | 314 | 1.9% |
| Guess | Correction | 813 | 2214 | 36.7% |
| Guess | Hint / Information Reveal | 604 | 2214 | 27.3% |
| Guess | Question | 450 | 2214 | 20.3% |
| Guess | Confirmation | 325 | 2214 | 14.7% |
| Guess | Other | 22 | 2214 | 1.0% |
| Other | Question | 5 | 7 | 71.4% |
| Other | Hint / Information Reveal | 2 | 7 | 28.6% |
| Question | Hint / Information Reveal | 1309 | 1896 | 69.0% |
| Question | Question | 375 | 1896 | 19.8% |
| Question | Confirmation | 97 | 1896 | 5.1% |
| Question | Correction | 81 | 1896 | 4.3% |
| Question | Other | 34 | 1896 | 1.8% |
ggplot(actions_by_context, aes(x = reorder(tutor_action, proportion), y = proportion)) +
geom_col() +
coord_flip() +
facet_wrap(~ student_context) +
scale_y_continuous(labels = percent_format()) +
labs(
title = "Tutor Actions by Student Action Context",
subtitle = "Proportions are calculated within each student context",
x = "Tutor action",
y = "Proportion of tutor action labels"
) +
geom_col(fill = "skyblue2") +
theme_minimal(base_size = 13)The above results show that tutor action patterns varied noticeably across student action contexts. When students asked questions, tutor responses were dominated by Hint/Information Reveal actions, which made up 69.0% of action labels in that context. In this case, tutors usually responded to student questions by directly providing information or guidance. When students made guesses, correction was the most common tutor action at 36.7%. In affirmation contexts, tutor actions were more evenly distributed across questions, hints, corrections, and confirmations. In this case, tutors used a wider range of strategies when students acknowledged or confirmed something.
My third analysis uses TF-IDF,to identify words that are especially distinctive in tutor responses to different student action contexts. Unlike simple word frequency, TF-IDF highlights words that are more characteristic of one context compared with others.
tutor_words <- tutor_response_level |>
unnest_tokens(word, tutor_response) |>
anti_join(stop_words, by = "word") |>
filter(!str_detect(word, "^[0-9]+$")) |>
filter(str_length(word) > 1)
tfidf_by_context <- tutor_words |>
count(student_context, word, sort = TRUE) |>
bind_tf_idf(word, student_context, n) |>
arrange(desc(tf_idf))
tfidf_by_context |>
group_by(student_context) |>
slice_max(tf_idf, n = 10, with_ties = FALSE) |>
ungroup() |>
arrange(student_context, desc(tf_idf)) |>
kable(digits = 4)| student_context | word | n | tf | idf | tf_idf |
|---|---|---|---|---|---|
| Affirmation | remember | 30 | 0.0360 | 0.2877 | 0.0104 |
| Affirmation | correct | 28 | 0.0336 | 0.2877 | 0.0097 |
| Affirmation | sentence | 26 | 0.0312 | 0.2877 | 0.0090 |
| Affirmation | noun | 24 | 0.0288 | 0.2877 | 0.0083 |
| Affirmation | box | 20 | 0.0240 | 0.2877 | 0.0069 |
| Affirmation | dog | 18 | 0.0216 | 0.2877 | 0.0062 |
| Affirmation | fill | 18 | 0.0216 | 0.2877 | 0.0062 |
| Affirmation | blank | 17 | 0.0204 | 0.2877 | 0.0059 |
| Affirmation | cane | 16 | 0.0192 | 0.2877 | 0.0055 |
| Affirmation | bunny | 14 | 0.0168 | 0.2877 | 0.0048 |
| Guess | correct | 258 | 0.0401 | 0.2877 | 0.0115 |
| Guess | remember | 256 | 0.0398 | 0.2877 | 0.0115 |
| Guess | noun | 186 | 0.0289 | 0.2877 | 0.0083 |
| Guess | close | 173 | 0.0269 | 0.2877 | 0.0077 |
| Guess | il | 153 | 0.0238 | 0.2877 | 0.0068 |
| Guess | box | 140 | 0.0218 | 0.2877 | 0.0063 |
| Guess | al | 129 | 0.0201 | 0.2877 | 0.0058 |
| Guess | la | 126 | 0.0196 | 0.2877 | 0.0056 |
| Guess | scatola | 126 | 0.0196 | 0.2877 | 0.0056 |
| Guess | yellow | 98 | 0.0152 | 0.2877 | 0.0044 |
| Other | pianta | 1 | 0.0476 | 0.2877 | 0.0137 |
| Other | translate | 1 | 0.0476 | 0.2877 | 0.0137 |
| Other | phrase | 3 | 0.1429 | 0.0000 | 0.0000 |
| Other | plant | 3 | 0.1429 | 0.0000 | 0.0000 |
| Other | green | 2 | 0.0952 | 0.0000 | 0.0000 |
| Other | tree | 2 | 0.0952 | 0.0000 | 0.0000 |
| Other | words | 2 | 0.0952 | 0.0000 | 0.0000 |
| Other | color | 1 | 0.0476 | 0.0000 | 0.0000 |
| Other | dietro | 1 | 0.0476 | 0.0000 | 0.0000 |
| Other | hint | 1 | 0.0476 | 0.0000 | 0.0000 |
| Question | al | 174 | 0.0349 | 0.2877 | 0.0100 |
| Question | fronte | 119 | 0.0239 | 0.2877 | 0.0069 |
| Question | la | 119 | 0.0239 | 0.2877 | 0.0069 |
| Question | di | 114 | 0.0229 | 0.2877 | 0.0066 |
| Question | front | 113 | 0.0227 | 0.2877 | 0.0065 |
| Question | remember | 109 | 0.0219 | 0.2877 | 0.0063 |
| Question | scatola | 107 | 0.0215 | 0.2877 | 0.0062 |
| Question | box | 105 | 0.0211 | 0.2877 | 0.0061 |
| Question | blue | 103 | 0.0207 | 0.2877 | 0.0059 |
| Question | blu | 92 | 0.0185 | 0.2877 | 0.0053 |
tfidf_by_context |>
group_by(student_context) |>
slice_max(tf_idf, n = 10, with_ties = FALSE) |>
ungroup() |>
ggplot(aes(x = reorder_within(word, tf_idf, student_context), y = tf_idf)) +
geom_col() +
coord_flip() +
facet_wrap(~ student_context, scales = "free") +
scale_x_reordered() +
labs(
title = "Distinctive Tutor Response Words by Student Context",
x = "Word",
y = "TF-IDF"
) +
geom_col(fill = "lightpink3") +
theme_minimal(base_size = 13)The TF-IDF results show that tutor responses used somewhat different language depending on the student action context. In Guess contexts, distinctive words such as “correct,” “remember,” “noun,” and “close” suggest that tutors often responded to student attempts by evaluating the answer and guiding revision. In Question contexts, words such as “al,” “fronte,” “di,” “box,” “blue,” and “blu” reflect vocabulary and phrase-level information reveal. In Affirmation contexts, words such as “remember,” “correct,” “sentence,” and “noun” suggest that tutors may moved from student acknowledgment toward reinforcing grammar rules or prompting students to complete the sentence.
Single words can be useful, but tutoring language often appears in short phrases. So here my analysis is trying to identify common two-word phrases in tutor responses.
tutor_bigrams <- tutor_response_level |>
unnest_tokens(bigram, tutor_response, token = "ngrams", n = 2) |>
separate(bigram, into = c("word1", "word2"), sep = " ") |>
filter(
!word1 %in% stop_words$word,
!word2 %in% stop_words$word,
!str_detect(word1, "^[0-9]+$"),
!str_detect(word2, "^[0-9]+$")
) |>
unite(bigram, word1, word2, sep = " ")
top_bigrams <- tutor_bigrams |>
count(bigram, sort = TRUE) |>
slice_max(n, n = 20, with_ties = FALSE)
top_bigrams |>
kable()| bigram | n |
|---|---|
| di fronte | 194 |
| dentro la | 84 |
| dietro la | 79 |
| accanto al | 78 |
| fronte al | 67 |
| fronte alla | 66 |
| cima al | 61 |
| italian word | 54 |
| color words | 50 |
| words follow | 47 |
| il gatto | 43 |
| adjectives follow | 41 |
| sotto il | 40 |
| il coniglio | 37 |
| la scatola | 37 |
| il cane | 34 |
| dietro il | 31 |
| correct word | 24 |
| vicino al | 24 |
| vicino alla | 19 |
ggplot(top_bigrams, aes(x = reorder(bigram, n), y = n)) +
geom_col() +
coord_flip() +
labs(
title = "Most Common Bigrams in Tutor Responses",
x = "Bigram",
y = "Frequency"
) +
geom_col(fill = "#69b3a2") +
theme_minimal(base_size = 13)Before running the bigram analysis, I expected that many of the common two-word phrases would include general tutoring expressions such as “try again,” “do you,” “word for,” or “remember that.” However, the results show that many of the most common two-word phrases in tutor responses were related to the Italian language-learning content, such as “di fronte,” “dentro la,” “dietro la,” and “accanto al.” Other common bigrams, such as “color words,” “words follow,” and “adjectives follow,” point to recurring grammar explanations about Italian word order. This pattern likely reflects the specific nature of the CIMA dataset. Because the tutoring tasks focus on beginner-level Italian phrase completion, tutor responses often repeat or reveal the target Italian vocabulary and grammar structures. In this case, the bigram analysis shows that tutor feedback was highly content-focused with vocabulary support and brief grammar reminders.
As a simple measure of feedback elaboration, I want to further compare the number of words in tutor responses across tutor action labels. Some actions, such as correction or hint/information reveal, may require more elaboration than brief confirmation.
response_length_data <- tutor_response_level |>
mutate(
response_word_count = str_count(tutor_response, "\\S+")
) |>
pivot_longer(
cols = tutor_question:tutor_other,
names_to = "tutor_action",
values_to = "action_present"
) |>
filter(action_present == TRUE) |>
mutate(
tutor_action = recode(
tutor_action,
tutor_question = "Question",
tutor_hint_info_reveal = "Hint / Information Reveal",
tutor_correction = "Correction",
tutor_confirmation = "Confirmation",
tutor_other = "Other"
)
)
response_length_data |>
group_by(tutor_action) |>
summarise(
n = n(),
mean_words = mean(response_word_count, na.rm = TRUE),
median_words = median(response_word_count, na.rm = TRUE)
) |>
arrange(desc(mean_words)) |>
kable(digits = 2)| tutor_action | n | mean_words | median_words |
|---|---|---|---|
| Other | 62 | 22.19 | 20 |
| Correction | 957 | 13.04 | 12 |
| Question | 943 | 12.28 | 11 |
| Hint / Information Reveal | 1986 | 9.56 | 8 |
| Confirmation | 483 | 9.39 | 8 |
ggplot(response_length_data, aes(x = tutor_action, y = response_word_count)) +
geom_boxplot() +
coord_flip() +
labs(
title = "Tutor Response Length by Tutor Action",
x = "Tutor action",
y = "Number of words in tutor response"
) +
theme_minimal(base_size = 13)From the above analysis, tutor response length varied across tutor action types. Responses labeled as Other had the highest average and median word counts, but this category was much smaller than the others. Among the more common action types, correction and question tended to be slightly longer than Hint/Information Reveal and Confirmation. The boxplot also shows several long-response outliers, especially for correction and hint/information reveal. In this case, some tutor responses provided more detailed explanations than the typical short feedback response.
In the CIMA simulated tutoring dataset, the analyses show that tutor responses varied in three main ways. First, tutors most often used Hint/Information Reveal. This suggests that simulated tutors usually supported students by giving vocabulary, grammar reminders, or partial guidance rather than only confirming or correcting answers. Second, student action context shaped tutor strategy. When students asked questions, tutors mostly responded with Hint/Information Reveal. When students made guesses, tutors more often used Correction, often combined with guidance. This suggests that tutors responded differently depending on whether students were requesting help or attempting an answer. Third, tutor language reflected both student context and task content. The TF-IDF results showed that responses to guesses included words such as “correct,” “remember,” “close,” and “noun,” while responses to questions included more vocabulary related and phrase related words such as “al,” “fronte,” “box,” “blue,” and “blu.” The bigram results also showed many Italian phrases, such as “di fronte,” “dentro la,” and “accanto al.” Overall, the findings suggest that tutor responses in CIMA were highly contextualized. Tutors tended to provide information when students asked questions, correct and guide when students made guesses, and rely heavily on content-specific language to support the Italian learning task.
Based on the findings, this data product can help educational researchers, instructional designers, and AI tutoring developers better understand how tutoring support is organized in simulated online tutoring dialogue. For instructional designers, one potential action is to design feedback templates that are sensitive to student action context. For example, when students ask a question, the tutor system may need to prioritize hint/information reveal. When students make a guess, the feedback may need to combine correction with supportive language. This suggests that feedback design should avoid using one generic response style for all student inputs. For AI tutoring developers, the findings suggest that automated tutoring systems could benefit from a two-step response design. First, the system can identify the student’s action type, such as whether the student is asking a question or affirming understanding. Second, the system cna select a response strategy that fits that context, such as revealing information, asking a follow-up question, giving correction, or confirming progress.
For educational researchers, one useful next step is to compare these simulated dialogue patterns with real tutoring interactions. The CIMA dataset provides a structured starting point for studying tutoring strategies, but future research could examine whether similar patterns appear in classroom help seeking or AI tutoring logs. Researchers could also further investigate whether certain combinations of tutor actions, such as correction plus explanation or hint plus question, are related to stronger student engagement or learning outcomes in datasets that include post-response student performance.
This project has several limitations. First, CIMA contains simulated tutoring dialogues produced by crowdworkers, whicih means it is not naturally occurring conversations between real tutors and students. In this case, the findings should be interpreted as patterns in role-played pedagogical dialogue, not a direct evidence of real student learning. Second, the dataset focuses on beginner-level Italian language learning exercises. Patterns in this dataset may not generalize to other subjects, age groups, or learning environments. Additionally, this project is highly descriptive. It does not make causal claims about the effects of tutor strategies on student outcomes. Finally, for ethical considerations, this project uses an open-access dataset and reports findings only in aggregate form. While the dataset is simulated and anonymized, it is still worthnothing that the purpose of this project is not to evaluate individual crowdworkers, tutors, or students, but to understand broader patterns in annotated tutoring responses.
Price, L., Richardson, J. T., & Jelfs, A. (2007). Face‐to‐face versus online tutoring support in distance education. Studies in higher education, 32(1), 1-20. https://doi.org/10.1080/03075070601004366
Stasaski, K., Kao, K., & Hearst, M. A. (2020, July). CIMA: A large open access dialogue dataset for tutoring. In Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications (pp. 52-64). https://doi.org/10.18653/v1/2020.bea-1.5
VanLehn, K. (2011). The relative effectiveness of human tutoring, intelligent tutoring systems, and other tutoring systems. Educational psychologist, 46(4), 197-221. https://doi.org/10.1080/00461520.2011.611369
Zhang, L., Pan, M., Yu, S., Chen, L., & Zhang, J. (2023). Evaluation of a student-centered online one-to-one tutoring system. Interactive Learning Environments, 31(7), 4251-4269. https://doi.org/10.1080/10494820.2021.1958234