Duolingo is an interactive app that helps people learn languages giving access to a on-on-one experience using technology. The app use a science-backed approach that can deliver measurable results to the user. This app is accessible from anywhere in the world for free as their idea is that everyone should be able to have access to the platform for learning a language. In this project I have decided to focus on user learning Italian.
1. Which Italian lexical items shows a high rate in the Duolingo data set?
2. Does repetition have an effect on the nnumbers of correct and incorrect responses?
This large data shows more then 12 millions rows of learning traces from Duolingo. Each row shows how the users’ interacts with a precise language and vocabulary item. Different variables are shown in the data set such as:
Since the data set was pre-processed, personal identifying details were removed and learning patters remain untouched. Data set link: https://www.kaggle.com/datasets/charitygithogora/cleaned-duolingo-learning-data/data
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ readr 2.1.5
## ✔ ggplot2 4.0.0 ✔ stringr 1.5.2
## ✔ lubridate 1.9.4 ✔ tibble 3.3.0
## ✔ purrr 1.1.0 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
cleaned_duolingo <- read.csv("/Users/Sofia/Desktop/cleaned_duolingo.csv")
names(cleaned_duolingo)
## [1] "p_recall" "timestamp" "delta"
## [4] "user_id" "learning_language" "ui_language"
## [7] "lexeme_id" "lexeme_string" "history_seen"
## [10] "history_correct" "session_seen" "session_correct"
cleaned_duolingo %>%
count(session_correct) %>%
ggplot(aes(x = factor(session_correct), y = n, fill = factor(session_correct))) +
geom_col(fill = "steelblue") +
labs(
title = " Number of Correct Answers during a duolingo session",
x = "number of correct answers per session",
y = "Count"
)
This first visualization shows us that the learning is right-skewed,
where most learners only have one to three correct responses for the
session. This shows that Duolingo’s learning platform uses short
learning activity. So the x axis shows the number of correct answers per
session and y is showing the session count
italian_only <- cleaned_duolingo %>%
filter(learning_language == "it")
italian_only %>%
count(session_correct) %>%
ggplot(aes(x = session_correct, y = n)) +
geom_col() +
labs(
title = " Number of Correct Answers – Italian Learners",
x = "Session_correct value",
y = "Count"
)
Since the data set shows learners learning different languages for
the purpose of this research Italian was chosen. This bar blot shows in
general how many times a correct answer is given by a learner
library(dplyr)
library(tidyverse)
cleaned_duolingo <- read.csv("/Users/Sofia/Desktop/cleaned_duolingo.csv")
it_only <- cleaned_duolingo %>%
filter(learning_language == "it")
it_only %>%
select((lexeme_string))
library(dplyr)
library(tidyverse)
it_only <- cleaned_duolingo %>%
filter(learning_language == "it") %>%
mutate( incorrect = session_seen - session_correct,
is_verb = str_detect(lexeme_string, "<vblex>")
)
verb<- it_only %>%
select(lexeme_string,history_seen, session_correct,incorrect)
verb_errors <- it_only %>%
filter(is_verb) %>%
group_by(lexeme_string) %>%
summarise(
total_seen = sum(session_seen, na.rm = TRUE),
total_correct = sum(session_correct, na.rm = TRUE),
total_incorrect = sum(incorrect, na.rm = TRUE),
error_rate = total_incorrect / total_seen
) %>%
arrange(desc(error_rate))
head(verb_errors, 20)
The data showed a very large poll of words (lexeme_string) that the learners were practicing using the platform. This table shows only the verbs(vblex) that were practiced by a variety of users
ggplot(verb_errors, aes(x=total_seen, y=total_correct))+
geom_line(linewidth = 0.5)
ggplot(verb_errors, aes(x=total_seen, y=total_incorrect))+
geom_line(linewidth = 0.5)
To discover if there were differences between between the number of
time Italian verbs were seen by the learner and the number of of correct
and incorrect responses associated with the verbs. What seems to occur
is that verbs that seem more frequently tend to show a higher numbers of
correct and incorrect responses. Therefore, the more responses lead the
learner to have opportunities for correct and incorrect responses which
does not only focus on being accurate
ggplot(verb_errors, aes(x = total_seen)) +
geom_line(aes(y = total_correct, linetype = "Correct"), linewidth = 0.5) +
geom_line(aes(y = total_incorrect, linetype = "Incorrect"), linewidth = 0.5) +
labs(
x = "total times of a verb was seen",
y = "total number of responses",
linetype = "Response"
)
To compare the two visualizations I have chosen to show the
relationship between exposure and the learner responses for verbs in
Italian. Choosing to overlap the responses helps to compare the
responses and how outcomes and exposure increases. It also shows how
correct response rise more then incorrect responses. In fact frequent
exposure can lead to more successful answers even if mistakes
occurs
it_only <- cleaned_duolingo %>%
filter(learning_language == "it") %>%
mutate( incorrect = session_seen - session_correct,
is_n = str_detect(lexeme_string, "<n>")
)
verb<- it_only %>%
select(lexeme_string,history_seen, session_correct,incorrect)
noun_errors <- it_only %>%
filter(is_n) %>%
group_by(lexeme_string) %>%
summarise(
total_seen = sum(session_seen, na.rm = TRUE),
total_correct = sum(session_correct, na.rm = TRUE),
total_incorrect = sum(incorrect, na.rm = TRUE),
error_rate = total_incorrect / total_seen
) %>%
arrange(desc(error_rate))
ggplot(noun_errors, aes(x=total_seen, y=total_correct))+
geom_line(linewidth = 0.5)
ggplot(noun_errors, aes(x=total_seen, y=total_incorrect))+
geom_line(linewidth = 0.5)
As the previous visualization this one shows the relationship
between the number of time Italian noun lexeme were sees by the learner
and the number of correct and incorrect answer given. As with the verbs,
this shows that nouns also appear frequently and that both correct and
incorrect shows higher responses. This can really shows how repetition
increases opportunities for the learner rather then showing
accuracy
it_freq <- cleaned_duolingo %>%
filter(learning_language == "it") %>%
group_by(lexeme_string) %>%
summarise(
total_seen = sum(session_seen, na.rm = TRUE)
) %>%
arrange(desc(total_seen))
top_words <- it_freq %>%
slice_max(total_seen, n = 20)
ggplot(top_words, aes(x = reorder(lexeme_string, total_seen),
y = total_seen, fill=)) +
geom_col() +
coord_flip() +
labs(
title = "Most Frequent Italian words in the Duolingo Data",
x = "Italian verbs",
y = "Total times a word is seen"
)+
theme_minimal()
Here it is possible to notice what words are most frequently shown
in the data set for when learning Italian. The highest most frequent are
function words such as pronouns and determiners, forms of the verb
“essere” an nouns This pattern shown gives us an insight on how
the Duolingo’s platform structures learning rathern then what are the
differences between the words shown
# 1) Italian only + POS tag
it_pos <- cleaned_duolingo %>%
filter(learning_language == "it") %>%
mutate(
pos = case_when(
str_detect(lexeme_string, "<vblex>") ~ "Verb",
str_detect(lexeme_string, "<n>") ~ "Noun",
TRUE ~ NA_character_
)
) %>%
filter(!is.na(pos))
# 2) Frequency per lexeme within each POS
pos_freq <- it_pos %>%
group_by(pos, lexeme_string) %>%
summarise(total_seen = sum(session_seen, na.rm = TRUE), .groups = "drop")
# 3) Top N from each POS
top_n <- 20
top_pos <- pos_freq %>%
group_by(pos) %>%
slice_max(total_seen, n = top_n) %>%
ungroup() %>%
mutate(label_clean = str_extract(lexeme_string, "^[^<]+")) # optional cleaning
# 4) Plot (side-by-side)
ggplot(top_pos, aes(x = total_seen, y = reorder(label_clean, total_seen), fill = pos)) +
geom_col(position = "dodge") +
geom_text(
aes(label = total_seen),
position = position_dodge(width = 0.9),
hjust = -0.1,
size = 3
) +
labs(
title = "Top 20 Most Frequent Italian Verbs vs Nouns",
x = "Total times seen (sum of session_seen)",
y = "Italian lexeme",
fill = "Part of Speech"
)
# Conclusion
This analysis shows that comparing the most frequent italian verbs and nouns, this study shows that both lexical categories follow a similar patterns having high level words shown to the learner. As these items are common words that a learner would encounter when learning a new language it shows Duolingo’s focus. The finding propose that repetition promotes increased practice rahter then showing if the learner is successful or not successful at learning the language