Sofia Andersen Garreffa

Introduction

Duolingo is an interactive app that helps people learn languages giving access to a on-on-one experience using technology. The app use a science-backed approach that can deliver measurable results to the user. This app is accessible from anywhere in the world for free as their idea is that everyone should be able to have access to the platform for learning a language. In this project I have decided to focus on user learning Italian.

RESEARCH QUESTIONS

1. Which Italian lexical items shows a high rate in the Duolingo data set?

2. Does repetition have an effect on the nnumbers of correct and incorrect responses?

Data

This large data shows more then 12 millions rows of learning traces from Duolingo. Each row shows how the users’ interacts with a precise language and vocabulary item. Different variables are shown in the data set such as:

  1. learning_language –> The language the the use is learning
  2. lexeme_string -> the vocabulary item that the user is learning
  3. history_correct –> This shows if the learner has been answering correctly through the different learning activities.
  4. session_seen –> The number of time that the learner has been learning correctly
  5. session_correct –> The number of time the learner responds correctly
  6. timestamp
  7. user_id

Since the data set was pre-processed, personal identifying details were removed and learning patters remain untouched. Data set link: https://www.kaggle.com/datasets/charitygithogora/cleaned-duolingo-learning-data/data

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.5
## ✔ ggplot2   4.0.0     ✔ stringr   1.5.2
## ✔ lubridate 1.9.4     ✔ tibble    3.3.0
## ✔ purrr     1.1.0     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
cleaned_duolingo <- read.csv("/Users/Sofia/Desktop/cleaned_duolingo.csv")
names(cleaned_duolingo)
##  [1] "p_recall"          "timestamp"         "delta"            
##  [4] "user_id"           "learning_language" "ui_language"      
##  [7] "lexeme_id"         "lexeme_string"     "history_seen"     
## [10] "history_correct"   "session_seen"      "session_correct"
 cleaned_duolingo %>% 
  count(session_correct) %>%
  ggplot(aes(x = factor(session_correct), y = n, fill = factor(session_correct))) +
  geom_col(fill = "steelblue") +
  labs(
    title = " Number of Correct Answers during a duolingo session",
    x = "number of correct answers per session",
    y = "Count"
  )

This first visualization shows us that the learning is right-skewed, where most learners only have one to three correct responses for the session. This shows that Duolingo’s learning platform uses short learning activity. So the x axis shows the number of correct answers per session and y is showing the session count

 italian_only <- cleaned_duolingo %>%
  filter(learning_language == "it")
italian_only %>%
  count(session_correct) %>%
  ggplot(aes(x = session_correct, y = n)) +
  geom_col() +
  labs(
    title = " Number of Correct Answers – Italian Learners",
    x = "Session_correct value",
    y = "Count"
  )

Since the data set shows learners learning different languages for the purpose of this research Italian was chosen. This bar blot shows in general how many times a correct answer is given by a learner

library(dplyr)
library(tidyverse)
cleaned_duolingo <- read.csv("/Users/Sofia/Desktop/cleaned_duolingo.csv")
it_only <- cleaned_duolingo %>%
  filter(learning_language == "it")
it_only %>% 
  select((lexeme_string))
library(dplyr)
library(tidyverse) 
it_only <- cleaned_duolingo %>% 
  filter(learning_language == "it") %>% 
  mutate( incorrect = session_seen - session_correct,
          is_verb = str_detect(lexeme_string, "<vblex>")
  )
verb<- it_only %>%
  select(lexeme_string,history_seen, session_correct,incorrect)
 verb_errors <- it_only %>%
  filter(is_verb) %>%                     
  group_by(lexeme_string) %>%            
  summarise(
    total_seen      = sum(session_seen, na.rm = TRUE),
    total_correct   = sum(session_correct, na.rm = TRUE),
    total_incorrect = sum(incorrect, na.rm = TRUE),
    error_rate      = total_incorrect / total_seen
  ) %>%
  arrange(desc(error_rate))
head(verb_errors, 20)  

The data showed a very large poll of words (lexeme_string) that the learners were practicing using the platform. This table shows only the verbs(vblex) that were practiced by a variety of users

ggplot(verb_errors, aes(x=total_seen, y=total_correct))+
  geom_line(linewidth = 0.5)

ggplot(verb_errors, aes(x=total_seen, y=total_incorrect))+
  geom_line(linewidth = 0.5)

To discover if there were differences between between the number of time Italian verbs were seen by the learner and the number of of correct and incorrect responses associated with the verbs. What seems to occur is that verbs that seem more frequently tend to show a higher numbers of correct and incorrect responses. Therefore, the more responses lead the learner to have opportunities for correct and incorrect responses which does not only focus on being accurate

ggplot(verb_errors, aes(x = total_seen)) +
  geom_line(aes(y = total_correct, linetype = "Correct"), linewidth = 0.5) +
  geom_line(aes(y = total_incorrect, linetype = "Incorrect"), linewidth = 0.5) +
  labs(
    x = "total times of a verb was seen",
    y = "total number of responses",
    linetype = "Response"
  )

To compare the two visualizations I have chosen to show the relationship between exposure and the learner responses for verbs in Italian. Choosing to overlap the responses helps to compare the responses and how outcomes and exposure increases. It also shows how correct response rise more then incorrect responses. In fact frequent exposure can lead to more successful answers even if mistakes occurs

it_only <- cleaned_duolingo %>% 
  filter(learning_language == "it") %>% 
  mutate( incorrect = session_seen - session_correct,
          is_n = str_detect(lexeme_string, "<n>")
  )
verb<- it_only %>%
  select(lexeme_string,history_seen, session_correct,incorrect)
 noun_errors <- it_only %>%
  filter(is_n) %>%                     
  group_by(lexeme_string) %>%            
  summarise(
    total_seen      = sum(session_seen, na.rm = TRUE),
    total_correct   = sum(session_correct, na.rm = TRUE),
    total_incorrect = sum(incorrect, na.rm = TRUE),
    error_rate      = total_incorrect / total_seen
  ) %>%
  arrange(desc(error_rate))
 
 ggplot(noun_errors, aes(x=total_seen, y=total_correct))+
  geom_line(linewidth = 0.5)

 ggplot(noun_errors, aes(x=total_seen, y=total_incorrect))+
  geom_line(linewidth = 0.5)

As the previous visualization this one shows the relationship between the number of time Italian noun lexeme were sees by the learner and the number of correct and incorrect answer given. As with the verbs, this shows that nouns also appear frequently and that both correct and incorrect shows higher responses. This can really shows how repetition increases opportunities for the learner rather then showing accuracy

it_freq <- cleaned_duolingo %>%
  filter(learning_language == "it") %>%
  group_by(lexeme_string) %>%
  summarise(
    total_seen = sum(session_seen, na.rm = TRUE)
  ) %>%
  arrange(desc(total_seen))
top_words <- it_freq %>%
  slice_max(total_seen, n = 20)

ggplot(top_words, aes(x = reorder(lexeme_string, total_seen),
                      y = total_seen, fill=)) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Most Frequent Italian words in the Duolingo Data",
    x = "Italian verbs",
    y = "Total times a word is seen"
  )+
  theme_minimal()

Here it is possible to notice what words are most frequently shown in the data set for when learning Italian. The highest most frequent are function words such as pronouns and determiners, forms of the verb “essere” an nouns This pattern shown gives us an insight on how the Duolingo’s platform structures learning rathern then what are the differences between the words shown

# 1) Italian only + POS tag
it_pos <- cleaned_duolingo %>%
  filter(learning_language == "it") %>%
  mutate(
    pos = case_when(
      str_detect(lexeme_string, "<vblex>") ~ "Verb",
      str_detect(lexeme_string, "<n>")     ~ "Noun",
      TRUE ~ NA_character_
    )
  ) %>%
  filter(!is.na(pos))

# 2) Frequency per lexeme within each POS
pos_freq <- it_pos %>%
  group_by(pos, lexeme_string) %>%
  summarise(total_seen = sum(session_seen, na.rm = TRUE), .groups = "drop")

# 3) Top N from each POS
top_n <- 20
top_pos <- pos_freq %>%
  group_by(pos) %>%
  slice_max(total_seen, n = top_n) %>%
  ungroup() %>%
  mutate(label_clean = str_extract(lexeme_string, "^[^<]+"))  # optional cleaning

# 4) Plot (side-by-side)
ggplot(top_pos, aes(x = total_seen, y = reorder(label_clean, total_seen), fill = pos)) +
  geom_col(position = "dodge") +
   geom_text(
    aes(label = total_seen),
    position = position_dodge(width = 0.9),
    hjust = -0.1,
    size = 3 
   ) +
  labs(
    title = "Top 20 Most Frequent Italian Verbs vs Nouns",
    x = "Total times seen (sum of session_seen)",
    y = "Italian lexeme",
    fill = "Part of Speech"
  )

# Conclusion

This analysis shows that comparing the most frequent italian verbs and nouns, this study shows that both lexical categories follow a similar patterns having high level words shown to the learner. As these items are common words that a learner would encounter when learning a new language it shows Duolingo’s focus. The finding propose that repetition promotes increased practice rahter then showing if the learner is successful or not successful at learning the language