Tutor Response Strategies and Language Patterns in Simulated Online Tutoring Dialogues

Author

Siyu Long

Introduction

One-on-one tutoring is widely regarded as an effective pedagogical interventions for its ability to provide highly personalized and adaptive support (Price et al., 2007; VanLehn, 2011). Tutors may ask questions, provide hints, reveal information, correct misunderstandings, confirm student progress, or combine several forms of support in the same response (Zhang et al., 2023). For researchers and designers of online tutoring systems, understanding the annotated instructional actions and the language patterns of tutor responses can provide useful insight into how tutoring support is structured.

This project uses the Corpus of Instructional Management Actions (CIMA), an open-access dataset of simulated tutoring dialogues introduced by @stasaski2020cima. The CIMA dataset was collected through a crowdsourcing method in which workers role-played as tutors and students in beginner level Italian language learning exercises (Stasaski et al., 2020). The dataset includes prior dialogue context, student action annotations, candidate tutor responses, and tutor action annotations. In the dataset, tutor actions are coded as Question, Hint/Information Reveal, Correction, Confirmation, and Other. Student actions are coded as Guess, Question, Affirmation, and Other.

This project focuses on descriptive items to examine how annotated tutor strategies and response language are distributed across simulated student dialogue contexts. The purpose of this project is to examine how tutor response strategies and language patterns vary across student action contexts. The leading research question of this project is:

How do tutor response strategies and language patterns vary across student action contexts in the CIMA simulated tutoring dataset?

To answer this question, I use four descriptive text-mining and learning-analytics analyses:

Overall frequency of tutor actions.
Tutor action patterns by student action context.
TF-IDF analysis of tutor response words by student context.
Bigram and response-length analysis of tutor feedback language.

Although sentiment analysis and LDA topic modeling are common text-mining techniques, they are not used as primary analyses in this project. Sentiment analysis is less aligned with the research question because tutoring feedback often mixes encouragement and correction, and general-purpose sentiment lexicons may misread pedagogically useful correction as negative tone. LDA topic modeling is also not prioritized because the tutor responses are short and the dataset already includes meaningful action annotations. Instead, this project uses tokenization, TF-IDF, bigram analysis, and action-label comparison because these techniques more directly support the goal of describing tutor response strategies and feedback language.

Data Wrangling

Loading Packages and Data

Code

library(tidyverse)
library(jsonlite)
library(stringr)
library(scales)
library(knitr)
library(tidytext)

Here I loaded the original dataset, which is a JSON file. I used the following codes to read the JSON structure and convert it into an R object. I used simplifyVector = FALSE to keep the nested JSON structure as lists.

Code

cima_raw <- fromJSON("dataset.json",
                     simplifyVector = FALSE) 

names(cima_raw)

[1] "prepDataset"  "shapeDataset"

As I preview the dataset, I notice that the full object contains two sections, and prepDataset is the section that contains the tutoring dialogue records used in this project. Each item in prepDataset represents one tutoring dialogue context. So here I extracted the prepDataset:

Code

prep_raw <- cima_raw$prepDataset

length(prep_raw) # count the tutoring dialogue records

[1] 1135

I used the following codes tp select the first tutoring record and to show the fields included in that record.

Code

# Inspect the structure of the first record.
prep_raw[[1]] |> names()

 [1] "past_convo"     "img"            "prep"           "engPrep"       
 [5] "obj"            "engObj"         "color"          "engColor"      
 [9] "grammarRules"   "studentActions" "tutorResponses" "tutorActions"  
[13] "tutorKeys"

Creating a Dialogue-Level Dataset

The original file is nested. Each dialogue record contains a conversation history, language-learning target variables, student action annotations, candidate tutor responses, and tutor action annotations. I first create a dialogue-level table where each row represents one dialogue context.

Code

dialogues <- tibble(
  dialogue_id = names(prep_raw),
  record = prep_raw
) |>
  mutate(
    past_convo = map(record, "past_convo"),
    prep = map_chr(record, "prep"),
    engPrep = map_chr(record, "engPrep"),
    obj = map_chr(record, "obj"),
    engObj = map_chr(record, "engObj"),
    color = map_chr(record, "color"),
    engColor = map_chr(record, "engColor"),
    grammarRules = map_chr(record, "grammarRules"),
    studentActions = map(record, "studentActions"),
    tutorResponses = map(record, "tutorResponses"),
    tutorActions = map(record, "tutorActions")
  ) |>
  select(-record)

glimpse(dialogues)

Rows: 1,135
Columns: 12
$ dialogue_id    <chr> "0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10",…
$ past_convo     <named list> ["\"Pink\" is \"rosa\". Please try to fill in th…
$ prep           <chr> "e dietro", "e accanto al", "e vicino", "e accanto al",…
$ engPrep        <chr> "is behind the", "is next to the", "is next to the", "i…
$ obj            <chr> "l'albero", "letto", "all'albero", "coniglio", "letto",…
$ engObj         <chr> "tree", "bed", "tree", "bunny", "bed", "table", "bag", …
$ color          <chr> "rosa", "rosa", "blu", "giallo", "giallo", "rosa", "gia…
$ engColor       <chr> "pink", "pink", "blue", "yellow", "yellow", "pink", "ye…
$ grammarRules   <chr> "[[\"l' (\\\"the\\\") is prepended to the following wor…
$ studentActions <named list> ["False", "True", "False", "False"], ["True", "F…
$ tutorResponses <named list> ["Look at your order of words again. Adjectives …
$ tutorActions   <named list> [[FALSE, FALSE, FALSE, FALSE, TRUE], [TRUE, FALS…

Creating a Tutor-Response-Level Dataset

Because each dialogue context includes several tutor responses, I reshape the data so that each row represents one tutor response. This tutor-response-level dataset is the main analytic dataset for this project.

I created a small helper function called safe_action() to extract tutor action labels from nested action vectors. This was necessary because in the CIMA data, tutorActions is stored as a list of TRUE/FALSE values. Each position in the vector represents one tutor action category: 1 = Question, 2 = Hint/Information Reveal, 3 = Correction, 4 = Confirmation, and 5 = Other. The function also prevents indexing errors if an action vector is unexpectedly shorter than expected.

Code

# Helper function for safely extracting logical values from action vectors.
safe_action <- function(x, i) {
  if (length(x) >= i) {
    return(as.logical(x[[i]]))
  } else {
    return(NA)
  }
}

The following codes reshape the nested CIMA data into a tutor-response-level dataset. Since each dialogue context includes different tutor responses, I use unnest_longer() to place each tutor response in its own row. I also convert the student and tutor action vectors into readable Boolean variables. This transformation is necessary for later analyses.

Code

tutor_response_level <- tibble(
  dialogue_id = names(prep_raw),
  record = prep_raw
) |>
  mutate(
    past_convo = map(record, "past_convo"),
    studentActions = map(record, "studentActions"),
    tutorResponses = map(record, "tutorResponses"),
    tutorActions = map(record, "tutorActions")
  ) |>
  select(dialogue_id, past_convo, studentActions, tutorResponses, tutorActions) |>
  mutate(
    student_guess = map_lgl(studentActions, ~ .x[[1]] == "True"),
    student_question = map_lgl(studentActions, ~ .x[[2]] == "True"),
    student_affirmation = map_lgl(studentActions, ~ .x[[3]] == "True"),
    student_other = map_lgl(studentActions, ~ .x[[4]] == "True")
  ) |>
  mutate(
    student_context = case_when(
      student_guess ~ "Guess",
      student_question ~ "Question",
      student_affirmation ~ "Affirmation",
      student_other ~ "Other",
      TRUE ~ "Unlabeled"
    )
  ) |>
  unnest_longer(
    tutorResponses,
    indices_to = "response_id",
    values_to = "tutor_response"
  ) |>
  mutate(
    tutor_action_vector = map2(tutorActions, response_id, ~ .x[[.y]])
  ) |>
  mutate(
    tutor_question = map_lgl(tutor_action_vector, ~ safe_action(.x, 1)),
    tutor_hint_info_reveal = map_lgl(tutor_action_vector, ~ safe_action(.x, 2)),
    tutor_correction = map_lgl(tutor_action_vector, ~ safe_action(.x, 3)),
    tutor_confirmation = map_lgl(tutor_action_vector, ~ safe_action(.x, 4)),
    tutor_other = map_lgl(tutor_action_vector, ~ safe_action(.x, 5))
  ) |>
  select(
    dialogue_id,
    response_id,
    tutor_response,
    student_context,
    student_guess:student_other,
    tutor_question:tutor_other
  )

glimpse(tutor_response_level)

Rows: 3,315
Columns: 13
$ dialogue_id            <chr> "0", "0", "0", "1", "1", "1", "2", "2", "2", "3…
$ response_id            <int> 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1,…
$ tutor_response         <chr> "Look at your order of words again. Adjectives …
$ student_context        <chr> "Question", "Question", "Question", "Guess", "G…
$ student_guess          <lgl> FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, FALSE, F…
$ student_question       <lgl> TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, F…
$ student_affirmation    <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE,…
$ student_other          <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ tutor_question         <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ tutor_hint_info_reveal <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE,…
$ tutor_correction       <lgl> FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, FALSE, T…
$ tutor_confirmation     <lgl> FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, TR…
$ tutor_other            <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…

Checking the Reshaped Data

After creating the tutor-response-level dataset, I first checked its overall structure. This summary reports both the number of unique dialogue contexts and the number of tutor responses. I also calculated the average number of candidate responses per context to better understand the structure of the reshaped data.

Code

tutor_response_level |>
  summarise(
    n_dialogue_contexts = n_distinct(dialogue_id),
    n_candidate_tutor_responses = n(),
    avg_responses_per_context = n_candidate_tutor_responses / n_dialogue_contexts
  ) |>
  kable(digits = 2)

n_dialogue_contexts	n_candidate_tutor_responses	avg_responses_per_context
1135	3315	2.92

I then checked the distribution of student action contexts. For later analysis, if one context appears much more frequently than others, the interpretation of comparisons should consider this imbalance. Here, each count represents the number of candidate tutor responses ralted to a given student context.

Code

tutor_response_level |>
  count(student_context, sort = TRUE) |>
  kable()

student_context	n
Question	1571
Guess	1496
Affirmation	242
Other	6

Creating Long-Format Tutor Action Data

For frequency and visualization, I reshape the tutor action columns into a long format.

Code

tutor_actions_long <- tutor_response_level |>
  pivot_longer(
    cols = tutor_question:tutor_other,
    names_to = "tutor_action",
    values_to = "action_present"
  ) |>
  mutate(
    tutor_action = recode(
      tutor_action,
      tutor_question = "Question",
      tutor_hint_info_reveal = "Hint / Information Reveal",
      tutor_correction = "Correction",
      tutor_confirmation = "Confirmation",
      tutor_other = "Other"
    )
  ) |>
  filter(action_present == TRUE)

tutor_actions_long |>
  count(tutor_action, sort = TRUE) |>
  kable()

tutor_action	n
Hint / Information Reveal	1986
Correction	957
Question	943
Confirmation	483
Other	62

Analysis

Analysis 1: Overall Distribution of Tutor Actions

This first analysis provides a descriptive baseline. Before comparing tutor strategies across student contexts, it is useful to understand which tutor actions appear most frequently in the dataset overall.

Code

tutor_action_counts <- tutor_actions_long |>
  count(tutor_action, sort = TRUE) |>
  mutate(
    proportion = n / sum(n)
  )

tutor_action_counts |>
  mutate(proportion = percent(proportion, accuracy = 0.1)) |>
  kable()

tutor_action	n	proportion
Hint / Information Reveal	1986	44.8%
Correction	957	21.6%
Question	943	21.3%
Confirmation	483	10.9%
Other	62	1.4%

Code

ggplot(tutor_action_counts, aes(x = reorder(tutor_action, n), y = n)) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Overall Distribution of Tutor Actions",
    x = "Tutor action",
    y = "Number of action labels"
  ) +
  geom_col(fill = "salmon") +
  theme_minimal(base_size = 13)

From the above analysis, the most common tutor action in the dataset was Hint/Information Reveal, which appeared 1,986 times and accounted for 44.8% of all tutor action labels. This suggests that simulated tutors most often responded by providing students with information, reminders, or partial guidance rather than only evaluating their answers. Correction and Question appeared at similar levels, each making up around 21% of the action labels. This indicates that tutors also frequently corrected student responses and prompted students to think further. However, Confirmation and Other were less common.

Analysis 2: Tutor Actions by Student Context

My second analysis compares tutor action patterns across student action contexts. This helps show whether tutor responses differ when the prior student action is coded as a guess, question, affirmation, or other.

Code

actions_by_context <- tutor_actions_long |>
  count(student_context, tutor_action) |>
  group_by(student_context) |>
  mutate(
    context_total = sum(n),
    proportion = n / context_total
  ) |>
  ungroup()

actions_by_context |>
  mutate(proportion = percent(proportion, accuracy = 0.1)) |>
  arrange(student_context, desc(n)) |>
  kable()

student_context	tutor_action	n	context_total	proportion
Affirmation	Question	113	314	36.0%
Affirmation	Hint / Information Reveal	71	314	22.6%
Affirmation	Correction	63	314	20.1%
Affirmation	Confirmation	61	314	19.4%
Affirmation	Other	6	314	1.9%
Guess	Correction	813	2214	36.7%
Guess	Hint / Information Reveal	604	2214	27.3%
Guess	Question	450	2214	20.3%
Guess	Confirmation	325	2214	14.7%
Guess	Other	22	2214	1.0%
Other	Question	5	7	71.4%
Other	Hint / Information Reveal	2	7	28.6%
Question	Hint / Information Reveal	1309	1896	69.0%
Question	Question	375	1896	19.8%
Question	Confirmation	97	1896	5.1%
Question	Correction	81	1896	4.3%
Question	Other	34	1896	1.8%

Code

ggplot(actions_by_context, aes(x = reorder(tutor_action, proportion), y = proportion)) +
  geom_col() +
  coord_flip() +
  facet_wrap(~ student_context) +
  scale_y_continuous(labels = percent_format()) +
  labs(
    title = "Tutor Actions by Student Action Context",
    subtitle = "Proportions are calculated within each student context",
    x = "Tutor action",
    y = "Proportion of tutor action labels"
  ) +
  geom_col(fill = "skyblue2") +
  theme_minimal(base_size = 13)

The above results show that tutor action patterns varied noticeably across student action contexts. When students asked questions, tutor responses were dominated by Hint/Information Reveal actions, which made up 69.0% of action labels in that context. In this case, tutors usually responded to student questions by directly providing information or guidance. When students made guesses, correction was the most common tutor action at 36.7%. In affirmation contexts, tutor actions were more evenly distributed across questions, hints, corrections, and confirmations. In this case, tutors used a wider range of strategies when students acknowledged or confirmed something.

Analysis 3: TF-IDF of Tutor Response Words by Student Context

My third analysis uses TF-IDF,to identify words that are especially distinctive in tutor responses to different student action contexts. Unlike simple word frequency, TF-IDF highlights words that are more characteristic of one context compared with others.

Code

tutor_words <- tutor_response_level |>
  unnest_tokens(word, tutor_response) |>
  anti_join(stop_words, by = "word") |>
  filter(!str_detect(word, "^[0-9]+$")) |>
  filter(str_length(word) > 1)

tfidf_by_context <- tutor_words |>
  count(student_context, word, sort = TRUE) |>
  bind_tf_idf(word, student_context, n) |>
  arrange(desc(tf_idf))

tfidf_by_context |>
  group_by(student_context) |>
  slice_max(tf_idf, n = 10, with_ties = FALSE) |>
  ungroup() |>
  arrange(student_context, desc(tf_idf)) |>
  kable(digits = 4)

student_context	word	n	tf	idf	tf_idf
Affirmation	remember	30	0.0360	0.2877	0.0104
Affirmation	correct	28	0.0336	0.2877	0.0097
Affirmation	sentence	26	0.0312	0.2877	0.0090
Affirmation	noun	24	0.0288	0.2877	0.0083
Affirmation	box	20	0.0240	0.2877	0.0069
Affirmation	dog	18	0.0216	0.2877	0.0062
Affirmation	fill	18	0.0216	0.2877	0.0062
Affirmation	blank	17	0.0204	0.2877	0.0059
Affirmation	cane	16	0.0192	0.2877	0.0055
Affirmation	bunny	14	0.0168	0.2877	0.0048
Guess	correct	258	0.0401	0.2877	0.0115
Guess	remember	256	0.0398	0.2877	0.0115
Guess	noun	186	0.0289	0.2877	0.0083
Guess	close	173	0.0269	0.2877	0.0077
Guess	il	153	0.0238	0.2877	0.0068
Guess	box	140	0.0218	0.2877	0.0063
Guess	al	129	0.0201	0.2877	0.0058
Guess	la	126	0.0196	0.2877	0.0056
Guess	scatola	126	0.0196	0.2877	0.0056
Guess	yellow	98	0.0152	0.2877	0.0044
Other	pianta	1	0.0476	0.2877	0.0137
Other	translate	1	0.0476	0.2877	0.0137
Other	phrase	3	0.1429	0.0000	0.0000
Other	plant	3	0.1429	0.0000	0.0000
Other	green	2	0.0952	0.0000	0.0000
Other	tree	2	0.0952	0.0000	0.0000
Other	words	2	0.0952	0.0000	0.0000
Other	color	1	0.0476	0.0000	0.0000
Other	dietro	1	0.0476	0.0000	0.0000
Other	hint	1	0.0476	0.0000	0.0000
Question	al	174	0.0349	0.2877	0.0100
Question	fronte	119	0.0239	0.2877	0.0069
Question	la	119	0.0239	0.2877	0.0069
Question	di	114	0.0229	0.2877	0.0066
Question	front	113	0.0227	0.2877	0.0065
Question	remember	109	0.0219	0.2877	0.0063
Question	scatola	107	0.0215	0.2877	0.0062
Question	box	105	0.0211	0.2877	0.0061
Question	blue	103	0.0207	0.2877	0.0059
Question	blu	92	0.0185	0.2877	0.0053

Code

tfidf_by_context |>
  group_by(student_context) |>
  slice_max(tf_idf, n = 10, with_ties = FALSE) |>
  ungroup() |>
  ggplot(aes(x = reorder_within(word, tf_idf, student_context), y = tf_idf)) +
  geom_col() +
  coord_flip() +
  facet_wrap(~ student_context, scales = "free") +
  scale_x_reordered() +
  labs(
    title = "Distinctive Tutor Response Words by Student Context",
    x = "Word",
    y = "TF-IDF"
  ) +
  geom_col(fill = "lightpink3") +
  theme_minimal(base_size = 13)

The TF-IDF results show that tutor responses used somewhat different language depending on the student action context. In Guess contexts, distinctive words such as “correct,” “remember,” “noun,” and “close” suggest that tutors often responded to student attempts by evaluating the answer and guiding revision. In Question contexts, words such as “al,” “fronte,” “di,” “box,” “blue,” and “blu” reflect vocabulary and phrase-level information reveal. In Affirmation contexts, words such as “remember,” “correct,” “sentence,” and “noun” suggest that tutors may moved from student acknowledgment toward reinforcing grammar rules or prompting students to complete the sentence.

Analysis 4: Common Bigrams in Tutor Responses

Single words can be useful, but tutoring language often appears in short phrases. So here my analysis is trying to identify common two-word phrases in tutor responses.

Code

tutor_bigrams <- tutor_response_level |>
  unnest_tokens(bigram, tutor_response, token = "ngrams", n = 2) |>
  separate(bigram, into = c("word1", "word2"), sep = " ") |>
  filter(
    !word1 %in% stop_words$word,
    !word2 %in% stop_words$word,
    !str_detect(word1, "^[0-9]+$"),
    !str_detect(word2, "^[0-9]+$")
  ) |>
  unite(bigram, word1, word2, sep = " ")

top_bigrams <- tutor_bigrams |>
  count(bigram, sort = TRUE) |>
  slice_max(n, n = 20, with_ties = FALSE)

top_bigrams |>
  kable()

bigram	n
di fronte	194
dentro la	84
dietro la	79
accanto al	78
fronte al	67
fronte alla	66
cima al	61
italian word	54
color words	50
words follow	47
il gatto	43
adjectives follow	41
sotto il	40
il coniglio	37
la scatola	37
il cane	34
dietro il	31
correct word	24
vicino al	24
vicino alla	19

Code

ggplot(top_bigrams, aes(x = reorder(bigram, n), y = n)) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Most Common Bigrams in Tutor Responses",
    x = "Bigram",
    y = "Frequency"
  ) +
  geom_col(fill = "#69b3a2") +
  theme_minimal(base_size = 13)

Before running the bigram analysis, I expected that many of the common two-word phrases would include general tutoring expressions such as “try again,” “do you,” “word for,” or “remember that.” However, the results show that many of the most common two-word phrases in tutor responses were related to the Italian language-learning content, such as “di fronte,” “dentro la,” “dietro la,” and “accanto al.” Other common bigrams, such as “color words,” “words follow,” and “adjectives follow,” point to recurring grammar explanations about Italian word order. This pattern likely reflects the specific nature of the CIMA dataset. Because the tutoring tasks focus on beginner-level Italian phrase completion, tutor responses often repeat or reveal the target Italian vocabulary and grammar structures. In this case, the bigram analysis shows that tutor feedback was highly content-focused with vocabulary support and brief grammar reminders.

Analysis 5: Tutor Response Length by Tutor Action

As a simple measure of feedback elaboration, I want to further compare the number of words in tutor responses across tutor action labels. Some actions, such as correction or hint/information reveal, may require more elaboration than brief confirmation.

Code

response_length_data <- tutor_response_level |>
  mutate(
    response_word_count = str_count(tutor_response, "\\S+")
  ) |>
  pivot_longer(
    cols = tutor_question:tutor_other,
    names_to = "tutor_action",
    values_to = "action_present"
  ) |>
  filter(action_present == TRUE) |>
  mutate(
    tutor_action = recode(
      tutor_action,
      tutor_question = "Question",
      tutor_hint_info_reveal = "Hint / Information Reveal",
      tutor_correction = "Correction",
      tutor_confirmation = "Confirmation",
      tutor_other = "Other"
    )
  )

response_length_data |>
  group_by(tutor_action) |>
  summarise(
    n = n(),
    mean_words = mean(response_word_count, na.rm = TRUE),
    median_words = median(response_word_count, na.rm = TRUE)
  ) |>
  arrange(desc(mean_words)) |>
  kable(digits = 2)

tutor_action	n	mean_words	median_words
Other	62	22.19	20
Correction	957	13.04	12
Question	943	12.28	11
Hint / Information Reveal	1986	9.56	8
Confirmation	483	9.39	8

Code

ggplot(response_length_data, aes(x = tutor_action, y = response_word_count)) +
  geom_boxplot() +
  coord_flip() +
  labs(
    title = "Tutor Response Length by Tutor Action",
    x = "Tutor action",
    y = "Number of words in tutor response"
  ) +
  theme_minimal(base_size = 13)

From the above analysis, tutor response length varied across tutor action types. Responses labeled as Other had the highest average and median word counts, but this category was much smaller than the others. Among the more common action types, correction and question tended to be slightly longer than Hint/Information Reveal and Confirmation. The boxplot also shows several long-response outliers, especially for correction and hint/information reveal. In this case, some tutor responses provided more detailed explanations than the typical short feedback response.

Findings

In the CIMA simulated tutoring dataset, the analyses show that tutor responses varied in three main ways. First, tutors most often used Hint/Information Reveal. This suggests that simulated tutors usually supported students by giving vocabulary, grammar reminders, or partial guidance rather than only confirming or correcting answers. Second, student action context shaped tutor strategy. When students asked questions, tutors mostly responded with Hint/Information Reveal. When students made guesses, tutors more often used Correction, often combined with guidance. This suggests that tutors responded differently depending on whether students were requesting help or attempting an answer. Third, tutor language reflected both student context and task content. The TF-IDF results showed that responses to guesses included words such as “correct,” “remember,” “close,” and “noun,” while responses to questions included more vocabulary related and phrase related words such as “al,” “fronte,” “box,” “blue,” and “blu.” The bigram results also showed many Italian phrases, such as “di fronte,” “dentro la,” and “accanto al.” Overall, the findings suggest that tutor responses in CIMA were highly contextualized. Tutors tended to provide information when students asked questions, correct and guide when students made guesses, and rely heavily on content-specific language to support the Italian learning task.

Based on the findings, this data product can help educational researchers, instructional designers, and AI tutoring developers better understand how tutoring support is organized in simulated online tutoring dialogue. For instructional designers, one potential action is to design feedback templates that are sensitive to student action context. For example, when students ask a question, the tutor system may need to prioritize hint/information reveal. When students make a guess, the feedback may need to combine correction with supportive language. This suggests that feedback design should avoid using one generic response style for all student inputs. For AI tutoring developers, the findings suggest that automated tutoring systems could benefit from a two-step response design. First, the system can identify the student’s action type, such as whether the student is asking a question or affirming understanding. Second, the system cna select a response strategy that fits that context, such as revealing information, asking a follow-up question, giving correction, or confirming progress.

For educational researchers, one useful next step is to compare these simulated dialogue patterns with real tutoring interactions. The CIMA dataset provides a structured starting point for studying tutoring strategies, but future research could examine whether similar patterns appear in classroom help seeking or AI tutoring logs. Researchers could also further investigate whether certain combinations of tutor actions, such as correction plus explanation or hint plus question, are related to stronger student engagement or learning outcomes in datasets that include post-response student performance.

Limitations and Ethical Considerations

This project has several limitations. First, CIMA contains simulated tutoring dialogues produced by crowdworkers, whicih means it is not naturally occurring conversations between real tutors and students. In this case, the findings should be interpreted as patterns in role-played pedagogical dialogue, not a direct evidence of real student learning. Second, the dataset focuses on beginner-level Italian language learning exercises. Patterns in this dataset may not generalize to other subjects, age groups, or learning environments. Additionally, this project is highly descriptive. It does not make causal claims about the effects of tutor strategies on student outcomes. Finally, for ethical considerations, this project uses an open-access dataset and reports findings only in aggregate form. While the dataset is simulated and anonymized, it is still worthnothing that the purpose of this project is not to evaluate individual crowdworkers, tutors, or students, but to understand broader patterns in annotated tutoring responses.

References

Price, L., Richardson, J. T., & Jelfs, A. (2007). Face‐to‐face versus online tutoring support in distance education. Studies in higher education, 32(1), 1-20. https://doi.org/10.1080/03075070601004366

Stasaski, K., Kao, K., & Hearst, M. A. (2020, July). CIMA: A large open access dialogue dataset for tutoring. In Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications (pp. 52-64). https://doi.org/10.18653/v1/2020.bea-1.5

VanLehn, K. (2011). The relative effectiveness of human tutoring, intelligent tutoring systems, and other tutoring systems. Educational psychologist, 46(4), 197-221. https://doi.org/10.1080/00461520.2011.611369

Zhang, L., Pan, M., Yu, S., Chen, L., & Zhang, J. (2023). Evaluation of a student-centered online one-to-one tutoring system. Interactive Learning Environments, 31(7), 4251-4269. https://doi.org/10.1080/10494820.2021.1958234