1. Introducción

Este informe describe el avance del proyecto de predicción de texto (tipo siguiente palabra) del Capstone de la especialización de Ciencia de Datos (Coursera, JHU).
Nos enfocamos en el idioma inglés (en_US) para cumplir eficientemente los hitos: carga/muestreo de datos, estadísticas básicas, análisis exploratorio y plan para el algoritmo y una app Shiny.

2. Datos y muestreo

Trabajamos con tres archivos de texto en inglés:

en_US.blogs.txt
en_US.news.txt
en_US.twitter.txt

Rutas locales:

blogs   <- "C:/Users/AAmaya/Downloads/Coursera-SwiftKey/final/en_US/en_US.blogs.txt"
news    <- "C:/Users/AAmaya/Downloads/Coursera-SwiftKey/final/en_US/en_US.news.txt"
twitter <- "C:/Users/AAmaya/Downloads/Coursera-SwiftKey/final/en_US/en_US.twitter.txt"

files <- c(blogs = blogs, news = news, twitter = twitter)
files

##                                                                       blogs 
##   "C:/Users/AAmaya/Downloads/Coursera-SwiftKey/final/en_US/en_US.blogs.txt" 
##                                                                        news 
##    "C:/Users/AAmaya/Downloads/Coursera-SwiftKey/final/en_US/en_US.news.txt" 
##                                                                     twitter 
## "C:/Users/AAmaya/Downloads/Coursera-SwiftKey/final/en_US/en_US.twitter.txt"

Como los datos son grandes, utilizamos muestreo aleatorio (por defecto 1% por archivo). Guardamos la muestra para reutilizarla.

# Instalar/ cargar paquetes necesarios
req <- c("readr","dplyr","stringi","ggplot2","tidyr","tibble","purrr","tidytext")
new <- req[!req %in% installed.packages()[,"Package"]]
if(length(new)) install.packages(new, dependencies = TRUE)
lapply(req, library, character.only = TRUE)

## [[1]]
## [1] "readr"     "stats"     "graphics"  "grDevices" "utils"     "datasets" 
## [7] "methods"   "base"     
## 
## [[2]]
## [1] "dplyr"     "readr"     "stats"     "graphics"  "grDevices" "utils"    
## [7] "datasets"  "methods"   "base"     
## 
## [[3]]
##  [1] "stringi"   "dplyr"     "readr"     "stats"     "graphics"  "grDevices"
##  [7] "utils"     "datasets"  "methods"   "base"     
## 
## [[4]]
##  [1] "ggplot2"   "stringi"   "dplyr"     "readr"     "stats"     "graphics" 
##  [7] "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[5]]
##  [1] "tidyr"     "ggplot2"   "stringi"   "dplyr"     "readr"     "stats"    
##  [7] "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[6]]
##  [1] "tibble"    "tidyr"     "ggplot2"   "stringi"   "dplyr"     "readr"    
##  [7] "stats"     "graphics"  "grDevices" "utils"     "datasets"  "methods"  
## [13] "base"     
## 
## [[7]]
##  [1] "purrr"     "tibble"    "tidyr"     "ggplot2"   "stringi"   "dplyr"    
##  [7] "readr"     "stats"     "graphics"  "grDevices" "utils"     "datasets" 
## [13] "methods"   "base"     
## 
## [[8]]
##  [1] "tidytext"  "purrr"     "tibble"    "tidyr"     "ggplot2"   "stringi"  
##  [7] "dplyr"     "readr"     "stats"     "graphics"  "grDevices" "utils"    
## [13] "datasets"  "methods"   "base"

set.seed(123)
sample_frac <- 0.01     # 1% de muestreo
chunk_size  <- 100000   # líneas por chunk (streaming)
sample_out  <- "sample_enUS.txt"

stream_sample <- function(infile, outfile, p = 0.01, chunk = 100000L) {
  if (file.exists(outfile)) file.remove(outfile)
  con <- file(infile, open = "r", encoding = "UTF-8")
  on.exit(close(con))
  repeat {
    lines <- readLines(con, n = chunk, warn = FALSE, skipNul = TRUE)
    if (length(lines) == 0) break
    keep <- rbinom(length(lines), 1, p) == 1
    if (any(keep)) readr::write_lines(lines[keep], outfile, append = file.exists(outfile))
  }
  invisible(outfile)
}

# Crear muestra combinada (blogs + news + twitter)
purrr::walk(files, ~ stream_sample(.x, sample_out, p = sample_frac, chunk = chunk_size))

# Cargar muestra
sample_data <- readr::read_lines(sample_out, skip_empty_rows = TRUE)
length(sample_data)

## [1] 23488

Nota: Este enfoque evita cargar todo el archivo en memoria; la muestra queda persistida en sample_enUS.txt.

3. Estadísticas básicas (streaming)

Calculamos líneas, palabras y caracteres por archivo sin leer todo en memoria (aproximamos palabras con stringi::stri_count_words).

stream_stats <- function(infile, chunk = 100000L) {
  con <- file(infile, open = "r", encoding = "UTF-8")
  on.exit(close(con))
  total_lines <- 0L; total_chars <- 0L; total_words <- 0L
  repeat {
    lines <- readLines(con, n = chunk, warn = FALSE, skipNul = TRUE)
    n <- length(lines)
    if (n == 0) break
    total_lines <- total_lines + n
    total_chars <- total_chars + sum(nchar(lines, type = "chars", allowNA = TRUE))
    total_words <- total_words + sum(stringi::stri_count_words(lines), na.rm = TRUE)
  }
  tibble::tibble(lines = total_lines, words = total_words, chars = total_chars)
}

stats_df <- purrr::imap_dfr(files, ~ cbind.data.frame(file = .y, stream_stats(.x)), .id = NULL) %>%
  dplyr::mutate(size_MB = file.info(unlist(files))[.$file, "size"] / 1024^2)

stats_df

##      file   lines    words     chars size_MB
## 1   blogs  899288 37546806 206824509      NA
## 2    news   77259  2674561  15639408      NA
## 3 twitter 2360148 30096690 162122861      NA

Tabla resumen:

stats_df %>%
  dplyr::select(file, size_MB, lines, words, chars) %>%
  dplyr::mutate(size_MB = round(size_MB, 1)) %>%
  knitr::kable(caption = "Estadísticas básicas por archivo (streaming)", format = "html")

Estadísticas básicas por archivo (streaming)
file	size_MB	lines	words	chars
blogs	NA	899288	37546806	206824509
news	NA	77259	2674561	15639408
twitter	NA	2360148	30096690	162122861

4. Análisis exploratorio (muestra)

A continuación usamos la muestra combinada para un análisis ligero pero ilustrativo.

4.1 Limpieza mínima y longitud de línea

sample_tbl <- tibble::tibble(text = sample_data) %>%
  dplyr::mutate(
    n_chars = nchar(text),
    n_words = stringi::stri_count_words(text)
  )

summary(dplyr::select(sample_tbl, n_chars, n_words))

##     n_chars          n_words     
##  Min.   :  5.00   Min.   : 1.00  
##  1st Qu.: 36.00   1st Qu.: 7.00  
##  Median : 64.00   Median :12.00  
##  Mean   : 68.35   Mean   :12.69  
##  3rd Qu.: 99.00   3rd Qu.:18.00  
##  Max.   :140.00   Max.   :35.00

Distribución de longitud (caracteres y palabras por línea):

ggplot2::ggplot(sample_tbl, ggplot2::aes(n_chars)) + 
  ggplot2::geom_histogram(bins = 50) + 
  ggplot2::labs(title = "Distribución de longitud (caracteres por línea)", x = "Caracteres", y = "Frecuencia")

ggplot2::ggplot(sample_tbl, ggplot2::aes(n_words)) + 
  ggplot2::geom_histogram(bins = 50) + 
  ggplot2::labs(title = "Distribución de longitud (palabras por línea)", x = "Palabras", y = "Frecuencia")

4.2 Frecuencia de palabras (unigramas)

# Limpieza ligera y tokenización
tokens <- sample_tbl %>%
  dplyr::select(text) %>%
  tidytext::unnest_tokens(word, text) %>%
  dplyr::filter(!word %in% tidytext::stop_words$word,
         !stringi::stri_detect_regex(word, "^[0-9]+$"))

top_uni <- tokens %>%
  dplyr::count(word, sort = TRUE) %>%
  dplyr::slice_head(n = 20)

knitr::kable(top_uni, caption = "Top 20 palabras (unigramas) – muestra")

Top 20 palabras (unigramas) – muestra
word	n
love	1077
day	896
rt	866
time	754
lol	696
people	536
follow	527
happy	482
tonight	412
night	381
life	348
hope	329
twitter	324
game	317
im	315
week	290
wait	269
awesome	260
tomorrow	257
haha	248

ggplot2::ggplot(top_uni, ggplot2::aes(stats::reorder(word, n), n)) +
  ggplot2::geom_col() +
  ggplot2::coord_flip() +
  ggplot2::labs(title = "Top 20 palabras (unigramas) – muestra", x = NULL, y = "Frecuencia")

4.3 N-gramas (bi/tri-gramas)

bigrams <- sample_tbl %>%
  tidytext::unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
  tidyr::separate(bigram, into = c("w1","w2"), sep = " ") %>%
  dplyr::filter(!w1 %in% tidytext::stop_words$word,
         !w2 %in% tidytext::stop_words$word) %>%
  tidyr::unite(bigram, w1, w2, sep = " ") %>%
  dplyr::count(bigram, sort = TRUE)

trigrams <- sample_tbl %>%
  tidytext::unnest_tokens(trigram, text, token = "ngrams", n = 3) %>%
  tidyr::separate(trigram, into = c("w1","w2","w3"), sep = " ") %>%
  dplyr::filter(!w1 %in% tidytext::stop_words$word,
         !w2 %in% tidytext::stop_words$word,
         !w3 %in% tidytext::stop_words$word) %>%
  tidyr::unite(trigram, w1, w2, w3, sep = " ") %>%
  dplyr::count(trigram, sort = TRUE)

knitr::kable(head(bigrams, 15), caption = "Top 15 bigramas – muestra")

Top 15 bigramas – muestra
bigram	n
happy birthday	72
social media	49
san diego	32
mother’s day	27
mothers day	24
awkward moment	23
rt rt	23
beautiful day	20
ha ha	20
stay tuned	20
ice cream	17
god bless	16
hell yeah	15
san francisco	15
happy mother’s	14

ggplot2::ggplot(head(bigrams, 15), ggplot2::aes(stats::reorder(bigram, n), n)) +
  ggplot2::geom_col() + ggplot2::coord_flip() +
  ggplot2::labs(title = "Top 15 bigramas – muestra", x = NULL, y = "Frecuencia")

ggplot2::ggplot(head(trigrams, 15), ggplot2::aes(stats::reorder(trigram, n), n)) +
  ggplot2::geom_col() + ggplot2::coord_flip() +
  ggplot2::labs(title = "Top 15 trigramas – muestra", x = NULL, y = "Frecuencia")

5. Hallazgos iniciales

Heterogeneidad por fuente: Twitter contiene líneas más cortas y mayor ruido (emojis, hashtags); blogs y news son más largos y formales.
Distribuciones sesgadas: La longitud de líneas y la frecuencia de palabras muestran colas largas (pocas palabras muy frecuentes).
Estructuras frecuentes: Los bi/tri-gramas capturan combinaciones comunes útiles para predicción basada en back-off.

6. Plan del algoritmo de predicción

Objetivo: sugerir la siguiente palabra dada una frase de entrada.

Enfoque (lenguaje claro): 1. Preparación
- Muestreo estratificado por fuente (blogs/news/twitter).
- Limpieza: normalización, minúsculas, manejo de contracciones, emojis/URLs.
2. Modelo n-gramas con back-off
- Construcción de tablas de unigramas, bigramas y trigramas con suavizado (Kneser–Ney o Laplace).
- Estrategia back-off: si no existe el trigram, usar el bigram; si no, el unigram.
3. Evaluación
- Separar conjunto de validación.
- Métricas: Perplejidad y accuracy@k (si la palabra correcta está en el top-k sugerencias).
4. Optimización
- Ajustar tamaño de vocabulario y umbrales de frecuencia para equilibrar precisión y velocidad.
5. Despliegue en Shiny
- Caja de texto para que el usuario escriba.
- Botones/etiquetas con las 3–5 palabras sugeridas en tiempo real.
- Panel con explicaciones simples y una sección de feedback del usuario.

7. Borrador de UI (Shiny)

Panel izquierdo: entrada de texto y sugerencias (botones).
Panel derecho: últimas sugerencias, tiempo de respuesta y opciones (tamaño de n-grama, top-k).
Extra: descarga de un reporte con las sugerencias generadas.

8. Conclusión

Se descargaron y cargaron exitosamente los datos en inglés.
Se generó una muestra reproducible y se calcularon estadísticas básicas mediante streaming.
El análisis exploratorio muestra patrones útiles para un modelo de n-gramas.
Existe un plan claro para construir el algoritmo y presentar una app Shiny usable por gestores no-técnicos.

9. Apéndice: reproducibilidad y utilidades

Guardar la muestra y tablas de n-gramas para reuso:

readr::write_lines(sample_data, "sample_enUS.txt")

# Guardar tablas
readr::write_csv(top_uni,   "unigrams_top20.csv")
readr::write_csv(head(bigrams, 5000), "bigrams_top5000.csv")
readr::write_csv(head(trigrams, 5000), "trigrams_top5000.csv")

Informe de Hitos – Capstone SwiftKey

Alejandra A.

2025-08-17