Tutor Response Strategies and Language Patterns in Simulated Online Tutoring Dialogues

Author

Siyu Long

Introduction

One-on-one tutoring is widely regarded as an effective pedagogical interventions for its ability to provide highly personalized and adaptive support (Price et al., 2007; VanLehn, 2011). Tutors may ask questions, provide hints, reveal information, correct misunderstandings, confirm student progress, or combine several forms of support in the same response (Zhang et al., 2023). For researchers and designers of online tutoring systems, understanding the annotated instructional actions and the language patterns of tutor responses can provide useful insight into how tutoring support is structured.

This project uses the Corpus of Instructional Management Actions (CIMA), an open-access dataset of simulated tutoring dialogues introduced by @stasaski2020cima. The CIMA dataset was collected through a crowdsourcing method in which workers role-played as tutors and students in beginner level Italian language learning exercises (Stasaski et al., 2020). The dataset includes prior dialogue context, student action annotations, candidate tutor responses, and tutor action annotations. In the dataset, tutor actions are coded as Question, Hint/Information Reveal, Correction, Confirmation, and Other. Student actions are coded as Guess, Question, Affirmation, and Other.

This project focuses on descriptive items to examine how annotated tutor strategies and response language are distributed across simulated student dialogue contexts. The purpose of this project is to examine how tutor response strategies and language patterns vary across student action contexts. The leading research question of this project is:

How do tutor response strategies and language patterns vary across student action contexts in the CIMA simulated tutoring dataset?

To answer this question, I use four descriptive text-mining and learning-analytics analyses:

  1. Overall frequency of tutor actions.
  2. Tutor action patterns by student action context.
  3. TF-IDF analysis of tutor response words by student context.
  4. Bigram and response-length analysis of tutor feedback language.

Although sentiment analysis and LDA topic modeling are common text-mining techniques, they are not used as primary analyses in this project. Sentiment analysis is less aligned with the research question because tutoring feedback often mixes encouragement and correction, and general-purpose sentiment lexicons may misread pedagogically useful correction as negative tone. LDA topic modeling is also not prioritized because the tutor responses are short and the dataset already includes meaningful action annotations. Instead, this project uses tokenization, TF-IDF, bigram analysis, and action-label comparison because these techniques more directly support the goal of describing tutor response strategies and feedback language.

Data Wrangling

Loading Packages and Data

Code
library(tidyverse)
library(jsonlite)
library(stringr)
library(scales)
library(knitr)
library(tidytext)

Here I loaded the original dataset, which is a JSON file. I used the following codes to read the JSON structure and convert it into an R object. I used simplifyVector = FALSE to keep the nested JSON structure as lists.

Code
cima_raw <- fromJSON("dataset.json",
                     simplifyVector = FALSE) 

names(cima_raw)
[1] "prepDataset"  "shapeDataset"

As I preview the dataset, I notice that the full object contains two sections, and prepDataset is the section that contains the tutoring dialogue records used in this project. Each item in prepDataset represents one tutoring dialogue context. So here I extracted the prepDataset:

Code
prep_raw <- cima_raw$prepDataset

length(prep_raw) # count the tutoring dialogue records
[1] 1135

I used the following codes tp select the first tutoring record and to show the fields included in that record.

Code
# Inspect the structure of the first record.
prep_raw[[1]] |> names()
 [1] "past_convo"     "img"            "prep"           "engPrep"       
 [5] "obj"            "engObj"         "color"          "engColor"      
 [9] "grammarRules"   "studentActions" "tutorResponses" "tutorActions"  
[13] "tutorKeys"     

Creating a Dialogue-Level Dataset

The original file is nested. Each dialogue record contains a conversation history, language-learning target variables, student action annotations, candidate tutor responses, and tutor action annotations. I first create a dialogue-level table where each row represents one dialogue context.

Code
dialogues <- tibble(
  dialogue_id = names(prep_raw),
  record = prep_raw
) |>
  mutate(
    past_convo = map(record, "past_convo"),
    prep = map_chr(record, "prep"),
    engPrep = map_chr(record, "engPrep"),
    obj = map_chr(record, "obj"),
    engObj = map_chr(record, "engObj"),
    color = map_chr(record, "color"),
    engColor = map_chr(record, "engColor"),
    grammarRules = map_chr(record, "grammarRules"),
    studentActions = map(record, "studentActions"),
    tutorResponses = map(record, "tutorResponses"),
    tutorActions = map(record, "tutorActions")
  ) |>
  select(-record)

glimpse(dialogues)
Rows: 1,135
Columns: 12
$ dialogue_id    <chr> "0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10",…
$ past_convo     <named list> ["\"Pink\" is \"rosa\". Please try to fill in th…
$ prep           <chr> "e dietro", "e accanto al", "e vicino", "e accanto al",…
$ engPrep        <chr> "is behind the", "is next to the", "is next to the", "i…
$ obj            <chr> "l'albero", "letto", "all'albero", "coniglio", "letto",…
$ engObj         <chr> "tree", "bed", "tree", "bunny", "bed", "table", "bag", …
$ color          <chr> "rosa", "rosa", "blu", "giallo", "giallo", "rosa", "gia…
$ engColor       <chr> "pink", "pink", "blue", "yellow", "yellow", "pink", "ye…
$ grammarRules   <chr> "[[\"l' (\\\"the\\\") is prepended to the following wor…
$ studentActions <named list> ["False", "True", "False", "False"], ["True", "F…
$ tutorResponses <named list> ["Look at your order of words again. Adjectives …
$ tutorActions   <named list> [[FALSE, FALSE, FALSE, FALSE, TRUE], [TRUE, FALS…

Creating a Tutor-Response-Level Dataset

Because each dialogue context includes several tutor responses, I reshape the data so that each row represents one tutor response. This tutor-response-level dataset is the main analytic dataset for this project.

I created a small helper function called safe_action() to extract tutor action labels from nested action vectors. This was necessary because in the CIMA data, tutorActions is stored as a list of TRUE/FALSE values. Each position in the vector represents one tutor action category: 1 = Question, 2 = Hint/Information Reveal, 3 = Correction, 4 = Confirmation, and 5 = Other. The function also prevents indexing errors if an action vector is unexpectedly shorter than expected.

Code
# Helper function for safely extracting logical values from action vectors.
safe_action <- function(x, i) {
  if (length(x) >= i) {
    return(as.logical(x[[i]]))
  } else {
    return(NA)
  }
}

The following codes reshape the nested CIMA data into a tutor-response-level dataset. Since each dialogue context includes different tutor responses, I use unnest_longer() to place each tutor response in its own row. I also convert the student and tutor action vectors into readable Boolean variables. This transformation is necessary for later analyses.

Code
tutor_response_level <- tibble(
  dialogue_id = names(prep_raw),
  record = prep_raw
) |>
  mutate(
    past_convo = map(record, "past_convo"),
    studentActions = map(record, "studentActions"),
    tutorResponses = map(record, "tutorResponses"),
    tutorActions = map(record, "tutorActions")
  ) |>
  select(dialogue_id, past_convo, studentActions, tutorResponses, tutorActions) |>
  mutate(
    student_guess = map_lgl(studentActions, ~ .x[[1]] == "True"),
    student_question = map_lgl(studentActions, ~ .x[[2]] == "True"),
    student_affirmation = map_lgl(studentActions, ~ .x[[3]] == "True"),
    student_other = map_lgl(studentActions, ~ .x[[4]] == "True")
  ) |>
  mutate(
    student_context = case_when(
      student_guess ~ "Guess",
      student_question ~ "Question",
      student_affirmation ~ "Affirmation",
      student_other ~ "Other",
      TRUE ~ "Unlabeled"
    )
  ) |>
  unnest_longer(
    tutorResponses,
    indices_to = "response_id",
    values_to = "tutor_response"
  ) |>
  mutate(
    tutor_action_vector = map2(tutorActions, response_id, ~ .x[[.y]])
  ) |>
  mutate(
    tutor_question = map_lgl(tutor_action_vector, ~ safe_action(.x, 1)),
    tutor_hint_info_reveal = map_lgl(tutor_action_vector, ~ safe_action(.x, 2)),
    tutor_correction = map_lgl(tutor_action_vector, ~ safe_action(.x, 3)),
    tutor_confirmation = map_lgl(tutor_action_vector, ~ safe_action(.x, 4)),
    tutor_other = map_lgl(tutor_action_vector, ~ safe_action(.x, 5))
  ) |>
  select(
    dialogue_id,
    response_id,
    tutor_response,
    student_context,
    student_guess:student_other,
    tutor_question:tutor_other
  )

glimpse(tutor_response_level)
Rows: 3,315
Columns: 13
$ dialogue_id            <chr> "0", "0", "0", "1", "1", "1", "2", "2", "2", "3…
$ response_id            <int> 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1,…
$ tutor_response         <chr> "Look at your order of words again. Adjectives …
$ student_context        <chr> "Question", "Question", "Question", "Guess", "G…
$ student_guess          <lgl> FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, FALSE, F…
$ student_question       <lgl> TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, F…
$ student_affirmation    <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE,…
$ student_other          <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ tutor_question         <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ tutor_hint_info_reveal <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE,…
$ tutor_correction       <lgl> FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, FALSE, T…
$ tutor_confirmation     <lgl> FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, TR…
$ tutor_other            <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…

Checking the Reshaped Data

After creating the tutor-response-level dataset, I first checked its overall structure. This summary reports both the number of unique dialogue contexts and the number of tutor responses. I also calculated the average number of candidate responses per context to better understand the structure of the reshaped data.

Code
tutor_response_level |>
  summarise(
    n_dialogue_contexts = n_distinct(dialogue_id),
    n_candidate_tutor_responses = n(),
    avg_responses_per_context = n_candidate_tutor_responses / n_dialogue_contexts
  ) |>
  kable(digits = 2)
n_dialogue_contexts n_candidate_tutor_responses avg_responses_per_context
1135 3315 2.92

I then checked the distribution of student action contexts. For later analysis, if one context appears much more frequently than others, the interpretation of comparisons should consider this imbalance. Here, each count represents the number of candidate tutor responses ralted to a given student context.

Code
tutor_response_level |>
  count(student_context, sort = TRUE) |>
  kable()
student_context n
Question 1571
Guess 1496
Affirmation 242
Other 6

Creating Long-Format Tutor Action Data

For frequency and visualization, I reshape the tutor action columns into a long format.

Code
tutor_actions_long <- tutor_response_level |>
  pivot_longer(
    cols = tutor_question:tutor_other,
    names_to = "tutor_action",
    values_to = "action_present"
  ) |>
  mutate(
    tutor_action = recode(
      tutor_action,
      tutor_question = "Question",
      tutor_hint_info_reveal = "Hint / Information Reveal",
      tutor_correction = "Correction",
      tutor_confirmation = "Confirmation",
      tutor_other = "Other"
    )
  ) |>
  filter(action_present == TRUE)

tutor_actions_long |>
  count(tutor_action, sort = TRUE) |>
  kable()
tutor_action n
Hint / Information Reveal 1986
Correction 957
Question 943
Confirmation 483
Other 62

Analysis

Analysis 1: Overall Distribution of Tutor Actions

This first analysis provides a descriptive baseline. Before comparing tutor strategies across student contexts, it is useful to understand which tutor actions appear most frequently in the dataset overall.

Code
tutor_action_counts <- tutor_actions_long |>
  count(tutor_action, sort = TRUE) |>
  mutate(
    proportion = n / sum(n)
  )

tutor_action_counts |>
  mutate(proportion = percent(proportion, accuracy = 0.1)) |>
  kable()
tutor_action n proportion
Hint / Information Reveal 1986 44.8%
Correction 957 21.6%
Question 943 21.3%
Confirmation 483 10.9%
Other 62 1.4%
Code
ggplot(tutor_action_counts, aes(x = reorder(tutor_action, n), y = n)) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Overall Distribution of Tutor Actions",
    x = "Tutor action",
    y = "Number of action labels"
  ) +
  geom_col(fill = "salmon") +
  theme_minimal(base_size = 13)

From the above analysis, the most common tutor action in the dataset was Hint/Information Reveal, which appeared 1,986 times and accounted for 44.8% of all tutor action labels. This suggests that simulated tutors most often responded by providing students with information, reminders, or partial guidance rather than only evaluating their answers. Correction and Question appeared at similar levels, each making up around 21% of the action labels. This indicates that tutors also frequently corrected student responses and prompted students to think further. However, Confirmation and Other were less common.

Analysis 2: Tutor Actions by Student Context

My second analysis compares tutor action patterns across student action contexts. This helps show whether tutor responses differ when the prior student action is coded as a guess, question, affirmation, or other.

Code
actions_by_context <- tutor_actions_long |>
  count(student_context, tutor_action) |>
  group_by(student_context) |>
  mutate(
    context_total = sum(n),
    proportion = n / context_total
  ) |>
  ungroup()

actions_by_context |>
  mutate(proportion = percent(proportion, accuracy = 0.1)) |>
  arrange(student_context, desc(n)) |>
  kable()
student_context tutor_action n context_total proportion
Affirmation Question 113 314 36.0%
Affirmation Hint / Information Reveal 71 314 22.6%
Affirmation Correction 63 314 20.1%
Affirmation Confirmation 61 314 19.4%
Affirmation Other 6 314 1.9%
Guess Correction 813 2214 36.7%
Guess Hint / Information Reveal 604 2214 27.3%
Guess Question 450 2214 20.3%
Guess Confirmation 325 2214 14.7%
Guess Other 22 2214 1.0%
Other Question 5 7 71.4%
Other Hint / Information Reveal 2 7 28.6%
Question Hint / Information Reveal 1309 1896 69.0%
Question Question 375 1896 19.8%
Question Confirmation 97 1896 5.1%
Question Correction 81 1896 4.3%
Question Other 34 1896 1.8%
Code
ggplot(actions_by_context, aes(x = reorder(tutor_action, proportion), y = proportion)) +
  geom_col() +
  coord_flip() +
  facet_wrap(~ student_context) +
  scale_y_continuous(labels = percent_format()) +
  labs(
    title = "Tutor Actions by Student Action Context",
    subtitle = "Proportions are calculated within each student context",
    x = "Tutor action",
    y = "Proportion of tutor action labels"
  ) +
  geom_col(fill = "skyblue2") +
  theme_minimal(base_size = 13)

The above results show that tutor action patterns varied noticeably across student action contexts. When students asked questions, tutor responses were dominated by Hint/Information Reveal actions, which made up 69.0% of action labels in that context. In this case, tutors usually responded to student questions by directly providing information or guidance. When students made guesses, correction was the most common tutor action at 36.7%. In affirmation contexts, tutor actions were more evenly distributed across questions, hints, corrections, and confirmations. In this case, tutors used a wider range of strategies when students acknowledged or confirmed something.

Analysis 3: TF-IDF of Tutor Response Words by Student Context

My third analysis uses TF-IDF,to identify words that are especially distinctive in tutor responses to different student action contexts. Unlike simple word frequency, TF-IDF highlights words that are more characteristic of one context compared with others.

Code
tutor_words <- tutor_response_level |>
  unnest_tokens(word, tutor_response) |>
  anti_join(stop_words, by = "word") |>
  filter(!str_detect(word, "^[0-9]+$")) |>
  filter(str_length(word) > 1)

tfidf_by_context <- tutor_words |>
  count(student_context, word, sort = TRUE) |>
  bind_tf_idf(word, student_context, n) |>
  arrange(desc(tf_idf))

tfidf_by_context |>
  group_by(student_context) |>
  slice_max(tf_idf, n = 10, with_ties = FALSE) |>
  ungroup() |>
  arrange(student_context, desc(tf_idf)) |>
  kable(digits = 4)
student_context word n tf idf tf_idf
Affirmation remember 30 0.0360 0.2877 0.0104
Affirmation correct 28 0.0336 0.2877 0.0097
Affirmation sentence 26 0.0312 0.2877 0.0090
Affirmation noun 24 0.0288 0.2877 0.0083
Affirmation box 20 0.0240 0.2877 0.0069
Affirmation dog 18 0.0216 0.2877 0.0062
Affirmation fill 18 0.0216 0.2877 0.0062
Affirmation blank 17 0.0204 0.2877 0.0059
Affirmation cane 16 0.0192 0.2877 0.0055
Affirmation bunny 14 0.0168 0.2877 0.0048
Guess correct 258 0.0401 0.2877 0.0115
Guess remember 256 0.0398 0.2877 0.0115
Guess noun 186 0.0289 0.2877 0.0083
Guess close 173 0.0269 0.2877 0.0077
Guess il 153 0.0238 0.2877 0.0068
Guess box 140 0.0218 0.2877 0.0063
Guess al 129 0.0201 0.2877 0.0058
Guess la 126 0.0196 0.2877 0.0056
Guess scatola 126 0.0196 0.2877 0.0056
Guess yellow 98 0.0152 0.2877 0.0044
Other pianta 1 0.0476 0.2877 0.0137
Other translate 1 0.0476 0.2877 0.0137
Other phrase 3 0.1429 0.0000 0.0000
Other plant 3 0.1429 0.0000 0.0000
Other green 2 0.0952 0.0000 0.0000
Other tree 2 0.0952 0.0000 0.0000
Other words 2 0.0952 0.0000 0.0000
Other color 1 0.0476 0.0000 0.0000
Other dietro 1 0.0476 0.0000 0.0000
Other hint 1 0.0476 0.0000 0.0000
Question al 174 0.0349 0.2877 0.0100
Question fronte 119 0.0239 0.2877 0.0069
Question la 119 0.0239 0.2877 0.0069
Question di 114 0.0229 0.2877 0.0066
Question front 113 0.0227 0.2877 0.0065
Question remember 109 0.0219 0.2877 0.0063
Question scatola 107 0.0215 0.2877 0.0062
Question box 105 0.0211 0.2877 0.0061
Question blue 103 0.0207 0.2877 0.0059
Question blu 92 0.0185 0.2877 0.0053
Code
tfidf_by_context |>
  group_by(student_context) |>
  slice_max(tf_idf, n = 10, with_ties = FALSE) |>
  ungroup() |>
  ggplot(aes(x = reorder_within(word, tf_idf, student_context), y = tf_idf)) +
  geom_col() +
  coord_flip() +
  facet_wrap(~ student_context, scales = "free") +
  scale_x_reordered() +
  labs(
    title = "Distinctive Tutor Response Words by Student Context",
    x = "Word",
    y = "TF-IDF"
  ) +
  geom_col(fill = "lightpink3") +
  theme_minimal(base_size = 13)

The TF-IDF results show that tutor responses used somewhat different language depending on the student action context. In Guess contexts, distinctive words such as “correct,” “remember,” “noun,” and “close” suggest that tutors often responded to student attempts by evaluating the answer and guiding revision. In Question contexts, words such as “al,” “fronte,” “di,” “box,” “blue,” and “blu” reflect vocabulary and phrase-level information reveal. In Affirmation contexts, words such as “remember,” “correct,” “sentence,” and “noun” suggest that tutors may moved from student acknowledgment toward reinforcing grammar rules or prompting students to complete the sentence.

Analysis 4: Common Bigrams in Tutor Responses

Single words can be useful, but tutoring language often appears in short phrases. So here my analysis is trying to identify common two-word phrases in tutor responses.

Code
tutor_bigrams <- tutor_response_level |>
  unnest_tokens(bigram, tutor_response, token = "ngrams", n = 2) |>
  separate(bigram, into = c("word1", "word2"), sep = " ") |>
  filter(
    !word1 %in% stop_words$word,
    !word2 %in% stop_words$word,
    !str_detect(word1, "^[0-9]+$"),
    !str_detect(word2, "^[0-9]+$")
  ) |>
  unite(bigram, word1, word2, sep = " ")

top_bigrams <- tutor_bigrams |>
  count(bigram, sort = TRUE) |>
  slice_max(n, n = 20, with_ties = FALSE)

top_bigrams |>
  kable()
bigram n
di fronte 194
dentro la 84
dietro la 79
accanto al 78
fronte al 67
fronte alla 66
cima al 61
italian word 54
color words 50
words follow 47
il gatto 43
adjectives follow 41
sotto il 40
il coniglio 37
la scatola 37
il cane 34
dietro il 31
correct word 24
vicino al 24
vicino alla 19
Code
ggplot(top_bigrams, aes(x = reorder(bigram, n), y = n)) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Most Common Bigrams in Tutor Responses",
    x = "Bigram",
    y = "Frequency"
  ) +
  geom_col(fill = "#69b3a2") +
  theme_minimal(base_size = 13)

Before running the bigram analysis, I expected that many of the common two-word phrases would include general tutoring expressions such as “try again,” “do you,” “word for,” or “remember that.” However, the results show that many of the most common two-word phrases in tutor responses were related to the Italian language-learning content, such as “di fronte,” “dentro la,” “dietro la,” and “accanto al.” Other common bigrams, such as “color words,” “words follow,” and “adjectives follow,” point to recurring grammar explanations about Italian word order. This pattern likely reflects the specific nature of the CIMA dataset. Because the tutoring tasks focus on beginner-level Italian phrase completion, tutor responses often repeat or reveal the target Italian vocabulary and grammar structures. In this case, the bigram analysis shows that tutor feedback was highly content-focused with vocabulary support and brief grammar reminders.

Analysis 5: Tutor Response Length by Tutor Action

As a simple measure of feedback elaboration, I want to further compare the number of words in tutor responses across tutor action labels. Some actions, such as correction or hint/information reveal, may require more elaboration than brief confirmation.

Code
response_length_data <- tutor_response_level |>
  mutate(
    response_word_count = str_count(tutor_response, "\\S+")
  ) |>
  pivot_longer(
    cols = tutor_question:tutor_other,
    names_to = "tutor_action",
    values_to = "action_present"
  ) |>
  filter(action_present == TRUE) |>
  mutate(
    tutor_action = recode(
      tutor_action,
      tutor_question = "Question",
      tutor_hint_info_reveal = "Hint / Information Reveal",
      tutor_correction = "Correction",
      tutor_confirmation = "Confirmation",
      tutor_other = "Other"
    )
  )

response_length_data |>
  group_by(tutor_action) |>
  summarise(
    n = n(),
    mean_words = mean(response_word_count, na.rm = TRUE),
    median_words = median(response_word_count, na.rm = TRUE)
  ) |>
  arrange(desc(mean_words)) |>
  kable(digits = 2)
tutor_action n mean_words median_words
Other 62 22.19 20
Correction 957 13.04 12
Question 943 12.28 11
Hint / Information Reveal 1986 9.56 8
Confirmation 483 9.39 8
Code
ggplot(response_length_data, aes(x = tutor_action, y = response_word_count)) +
  geom_boxplot() +
  coord_flip() +
  labs(
    title = "Tutor Response Length by Tutor Action",
    x = "Tutor action",
    y = "Number of words in tutor response"
  ) +
  theme_minimal(base_size = 13)

From the above analysis, tutor response length varied across tutor action types. Responses labeled as Other had the highest average and median word counts, but this category was much smaller than the others. Among the more common action types, correction and question tended to be slightly longer than Hint/Information Reveal and Confirmation. The boxplot also shows several long-response outliers, especially for correction and hint/information reveal. In this case, some tutor responses provided more detailed explanations than the typical short feedback response.

Findings

In the CIMA simulated tutoring dataset, the analyses show that tutor responses varied in three main ways. First, tutors most often used Hint/Information Reveal. This suggests that simulated tutors usually supported students by giving vocabulary, grammar reminders, or partial guidance rather than only confirming or correcting answers. Second, student action context shaped tutor strategy. When students asked questions, tutors mostly responded with Hint/Information Reveal. When students made guesses, tutors more often used Correction, often combined with guidance. This suggests that tutors responded differently depending on whether students were requesting help or attempting an answer. Third, tutor language reflected both student context and task content. The TF-IDF results showed that responses to guesses included words such as “correct,” “remember,” “close,” and “noun,” while responses to questions included more vocabulary related and phrase related words such as “al,” “fronte,” “box,” “blue,” and “blu.” The bigram results also showed many Italian phrases, such as “di fronte,” “dentro la,” and “accanto al.” Overall, the findings suggest that tutor responses in CIMA were highly contextualized. Tutors tended to provide information when students asked questions, correct and guide when students made guesses, and rely heavily on content-specific language to support the Italian learning task.

Based on the findings, this data product can help educational researchers, instructional designers, and AI tutoring developers better understand how tutoring support is organized in simulated online tutoring dialogue. For instructional designers, one potential action is to design feedback templates that are sensitive to student action context. For example, when students ask a question, the tutor system may need to prioritize hint/information reveal. When students make a guess, the feedback may need to combine correction with supportive language. This suggests that feedback design should avoid using one generic response style for all student inputs. For AI tutoring developers, the findings suggest that automated tutoring systems could benefit from a two-step response design. First, the system can identify the student’s action type, such as whether the student is asking a question or affirming understanding. Second, the system cna select a response strategy that fits that context, such as revealing information, asking a follow-up question, giving correction, or confirming progress.

For educational researchers, one useful next step is to compare these simulated dialogue patterns with real tutoring interactions. The CIMA dataset provides a structured starting point for studying tutoring strategies, but future research could examine whether similar patterns appear in classroom help seeking or AI tutoring logs. Researchers could also further investigate whether certain combinations of tutor actions, such as correction plus explanation or hint plus question, are related to stronger student engagement or learning outcomes in datasets that include post-response student performance.

Limitations and Ethical Considerations

This project has several limitations. First, CIMA contains simulated tutoring dialogues produced by crowdworkers, whicih means it is not naturally occurring conversations between real tutors and students. In this case, the findings should be interpreted as patterns in role-played pedagogical dialogue, not a direct evidence of real student learning. Second, the dataset focuses on beginner-level Italian language learning exercises. Patterns in this dataset may not generalize to other subjects, age groups, or learning environments. Additionally, this project is highly descriptive. It does not make causal claims about the effects of tutor strategies on student outcomes. Finally, for ethical considerations, this project uses an open-access dataset and reports findings only in aggregate form. While the dataset is simulated and anonymized, it is still worthnothing that the purpose of this project is not to evaluate individual crowdworkers, tutors, or students, but to understand broader patterns in annotated tutoring responses.

References

Price, L., Richardson, J. T., & Jelfs, A. (2007). Face‐to‐face versus online tutoring support in distance education. Studies in higher education, 32(1), 1-20. https://doi.org/10.1080/03075070601004366

Stasaski, K., Kao, K., & Hearst, M. A. (2020, July). CIMA: A large open access dialogue dataset for tutoring. In Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications (pp. 52-64). https://doi.org/10.18653/v1/2020.bea-1.5

VanLehn, K. (2011). The relative effectiveness of human tutoring, intelligent tutoring systems, and other tutoring systems. Educational psychologist, 46(4), 197-221. https://doi.org/10.1080/00461520.2011.611369

Zhang, L., Pan, M., Yu, S., Chen, L., & Zhang, J. (2023). Evaluation of a student-centered online one-to-one tutoring system. Interactive Learning Environments, 31(7), 4251-4269. https://doi.org/10.1080/10494820.2021.1958234