The goal of this task is to develop a basic predictive model using n-grams to predict the next word based on user input. This is the first step in creating a predictive text application that will later be implemented in a Shiny app. / El objetivo de esta tarea es desarrollar un modelo predictivo básico usando n-gramas para predecir la siguiente palabra según la entrada del usuario. Este es el primer paso para crear una aplicación de texto predictivo que luego se implementará en una app Shiny.
library(dplyr)
##
## Adjuntando el paquete: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(stringr)
library(tidytext)
library(data.table)
##
## Adjuntando el paquete: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
library(ggplot2)
library(readr)
blogs <- readLines("final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news <- readLines("final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)
summary_stats <- data.frame(
source = c("blogs", "news", "twitter"),
lines = c(length(blogs), length(news), length(twitter)),
words = c(sum(str_count(blogs, "\\w+")),
sum(str_count(news, "\\w+")),
sum(str_count(twitter, "\\w+")))
)
knitr::kable(summary_stats)
source | lines | words |
---|---|---|
blogs | 899288 | 38309620 |
news | 1010206 | 35622913 |
2360148 | 31003544 |
set.seed(123)
sample_data <- c(sample(blogs, 5000),
sample(news, 5000),
sample(twitter, 5000))
clean_corpus <- tolower(sample_data)
clean_corpus <- str_replace_all(clean_corpus, "[^a-z\\s]", " ")
clean_corpus <- str_replace_all(clean_corpus, "\\s+", " ")
clean_corpus <- str_trim(clean_corpus)
corpus_df <- tibble(text = clean_corpus)
unigrams <- corpus_df %>%
unnest_tokens(output = word, input = text, token = "words") %>%
count(word, sort = TRUE)
top_unigrams <- unigrams %>% top_n(10, n)
ggplot(top_unigrams, aes(x = reorder(word, n), y = n)) +
geom_col(fill = "steelblue") +
coord_flip() +
labs(title = "Top 10 Unigrams", x = "Word / Palabra", y = "Frequency / Frecuencia")
bigrams <- corpus_df %>%
unnest_tokens(output = bigram, input = text, token = "ngrams", n = 2) %>%
count(bigram, sort = TRUE)
top_bigrams <- bigrams %>% top_n(10, n)
ggplot(top_bigrams, aes(x = reorder(bigram, n), y = n)) +
geom_col(fill = "darkgreen") +
coord_flip() +
labs(title = "Top 10 Bigrams", x = "Bigram", y = "Frequency")
trigrams <- corpus_df %>%
unnest_tokens(output = trigram, input = text, token = "ngrams", n = 3) %>%
count(trigram, sort = TRUE)
top_trigrams <- trigrams %>% top_n(10, n)
ggplot(top_trigrams, aes(x = reorder(trigram, n), y = n)) +
geom_col(fill = "purple") +
coord_flip() +
labs(title = "Top 10 Trigrams", x = "Trigram", y = "Frequency")
predict_next_word <- function(input, unigrams, bigrams, trigrams) {
input <- tolower(input)
input <- str_replace_all(input, "[^a-z\\s]", "")
words <- str_split(input, " ")[[1]]
n <- length(words)
if (n >= 2) {
last2 <- paste(words[(n-1):n], collapse = " ")
match <- trigrams %>% filter(str_starts(trigram, last2)) %>% slice(1)
if (nrow(match) > 0) return(str_split(match$trigram[1], " ")[[1]][3])
}
if (n >= 1) {
last1 <- words[n]
match <- bigrams %>% filter(str_starts(bigram, last1)) %>% slice(1)
if (nrow(match) > 0) return(str_split(match$bigram[1], " ")[[1]][2])
}
return(unigrams$word[1])
}
bigrams <- bigrams %>%
mutate(prob = (n + 1) / (sum(n) + n_distinct(bigram)))
object.size(unigrams)
## 2278976 bytes
object.size(bigrams)
## 19010624 bytes
object.size(trigrams)
## 30314744 bytes
unigrams <- unigrams %>% filter(n > 1)
bigrams <- bigrams %>% filter(n > 1)
trigrams <- trigrams %>% filter(n > 1)
This report presents a basic exploratory analysis of text data collected from blogs, news articles, and Twitter. It summarizes the main characteristics of the datasets, such as line and word counts, and presents visualizations of the most frequent words and word combinations.
Based on this analysis, a simple predictive model was developed using n-grams to anticipate the next word a user might type, considering up to three previous words. This model serves as the foundation for a future application that will provide fast and efficient predictions.
The next steps will include evaluating the model with new data, implementing techniques to handle unseen word combinations (smoothing and backoff), and optimizing performance to ensure efficient operation in resource-limited environments such as mobile apps or web servers.
This work demonstrates solid progress toward building a predictive text system that can enhance the writing experience in interactive applications.
/
En este reporte se realizó un análisis exploratorio básico de los datos de texto recopilados de blogs, noticias y Twitter. Se resumen las características principales de los conjuntos de datos, como el conteo de líneas y palabras, y se presentan visualizaciones de las palabras y combinaciones de palabras más frecuentes.
Con base en este análisis, se desarrolló un modelo predictivo simple que utiliza n-gramas para anticipar la siguiente palabra que un usuario podría escribir, considerando hasta tres palabras anteriores. Este modelo sirve como fundamento para una aplicación futura que ofrecerá predicciones rápidas y eficientes.
Las próximas etapas incluirán la evaluación del modelo con datos nuevos, la implementación de técnicas para manejar combinaciones de palabras no vistas (suavizado y backoff), y optimizaciones para asegurar un buen desempeño en entornos con recursos limitados, como aplicaciones móviles o servidores web.
Este trabajo demuestra un progreso sólido en la construcción de un sistema de texto predictivo que puede ser utilizado para mejorar la experiencia de escritura en aplicaciones interactivas.