Opinión pública de los estadounidenses sobre Donald Trump y sus decisiones políticas.

  • El actual presidente de USA, Donald Trump es una figura política que genera opiniones y debates altamente polarizados. Esto hace que las discusiones en redes sociales, especialmente Reddit sean altamente ricas en lenguaje emocional, valorativo y argumentativo, lo que resulta ideal para practicar técnicas de procesamiento de lenguaje natural. Además, la cantidad de datos disponibles en la plataforma permite obtener una muestra suficiente para el análisis.

  • Con este análisis buscamos identificar cuáles son las palabras y temas más frecuentes al discutir la aprobación o desaprobación del mandatario, y si existen diferencias lingüísticas significativas entre subreddits de distintas inclinaciones políticas.

#Instalar paquete
library(RedditExtractoR)
## Warning: package 'RedditExtractoR' was built under R version 4.4.3

1. Descarga de datos

Términos de búsqueda

En este apartado, para asegurar que los datos que se obtengan sean relevantes para la pregunta de investigación se determinan los siguientes términos específicos para filtrar la posterior descarga de posts en la red social Reddit.

Utilizamos frases específicas como “Trump approval poll” en lugar de términos amplios como “Trump”, siguiendo la recomendación de evitar búsquedas demasiado amplias. Esto asegura que los posts capturados contengan explícitamente opiniones sobre su gestión.

# Términos que capturan opinión pública sobre Trump
terminos_opinion <- c(
  "Trump approval rating",
  "Americans think about Trump",
  "Trump public opinion", 
  "Trump policies survey",
  "Do Americans support Trump",
  "Trump approval poll",
  "Trump disapproval reasons",
  "Americans view Trump decision"
)

# Subreddits donde se debate opinión pública (TODOS ACTIVOS)
subreddits_opinion <- c(
  "politics",           # Debate político general
  "AskAnAmerican",      # Preguntas a estadounidenses
  "PoliticalDiscussion",# Discusión seria
  "conservatives",      # Visión conservadora
  "democrats",          # Visión liberal
  "neutralnews",        # Intento de neutralidad
  "fivethirtyeight"     # Datos y encuestas
)

Identificación de subgrupos

El propósito general de esta sección es implementar una función que busca threads (hilos de discusión) en Reddit sobre un tema específico.

buscar_threads_documentado <- function(termino, subreddit = NA) {
  
  cat("\n🔍 Búsqueda:", termino)
  if(!is.na(subreddit)) cat(" | subreddit: r/", subreddit)
  cat("\n")
  
  # Pausa para respetar API de Reddit (decisión metodológica)
  Sys.sleep(1)
  
  resultado <- tryCatch({
    find_thread_urls(
      keywords = termino,
      subreddit = subreddit,
      sort_by = "relevance",  # Más relevante que "top" para opiniones
      period = "month"         # Último mes (datos frescos)
    )
  }, error = function(e) {
    cat("   ❌ Error:", conditionMessage(e), "\n")
    return(NULL)
  })
  
  if(!is.null(resultado) && nrow(resultado) > 0) {
    # Añadir metadatos de la búsqueda
    resultado$termino_usado <- termino
    resultado$subreddit_buscado <- ifelse(is.na(subreddit), "todos", subreddit)
    resultado$fecha_busqueda <- Sys.Date()
    
    cat("   ✅", nrow(resultado), "threads encontrados\n")
    return(resultado)
  } else {
    cat("   ⚠️ 0 resultados\n")
    return(NULL)
  }
}

Descarga de datos

Luego de haber filtrado nuestros datos de búsqueda, procedemos a extraer publicaciones/posts y comentarios desde Reddit.

## 
##  = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
## INICIO DE DESCARGA DE DATOS
## = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
## Objetivo minimo: 80 posts
## 
##  = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
## RESULTADOS DE LA DESCARGA (DATOS DE EJEMPLO)
## = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
## 
## ESTADISTICAS:
##    - Threads descargados: 100
##    - Objetivo de 80 posts: ALCANZADO
##    - Subreddits representados: 4
##    - Periodo de datos: 2026-03-31 a 2026-05-26
## 
## DISTRIBUCION POR SUBREDDIT:
## 
##       AskAnAmerican       conservatives PoliticalDiscussion            politics 
##                  25                  25                  25                  25
## 
## EJEMPLO DE LA BASE (primeros 5 posts):
##                                                                title
## 1                  Trump favorability hits 51% in latest Gallup poll
## 2                  Trump favorability hits 51% in latest Gallup poll
## 3   Polarized nation: Trump's tax plan splits voters down the middle
## 4 Poll shows 47% of Americans approve of Trump's handling of economy
## 5                Trump's trade policies unpopular with 63% of voters
##             subreddit comments score
## 1            politics      241  2872
## 2       conservatives      375  1976
## 3 PoliticalDiscussion      252  2283
## 4       AskAnAmerican      359   507
## 5            politics      344   880
## 
## EJEMPLO DE COMENTARIO:
## I voted for him twice but I am not sure about 2026.
## 
## Datos guardados en 'datos_trump_opinion.rds'

Importante:

Al intentar descargar datos reales desde Reddit, se detectó un error porque la API de la app actualmente requiere autenticación. Para poder completar el ejercicio se optó por generar un dataset sintético que replica la estructura de los datos reales.

2. Limpieza textual

## Warning: package 'dplyr' was built under R version 4.4.3
## 
## Adjuntando el paquete: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Warning: package 'tidytext' was built under R version 4.4.3
## Warning: package 'tidyr' was built under R version 4.4.3
## 
##  = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
## LIMPIEZA TEXTUAL
## = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
## 
## Texto a analizar:
##    - Total de documentos: 100
##    - Largo promedio del texto: 118.03 caracteres
## 
## === TOKENIZACION ===
## Total de tokens generados: 1915
## Palabras unicas (tipo): 140
## 
## Ejemplo de tokens (primeros 30):
##  [1] "trump"        "favorability" "hits"         "51"           "in"          
##  [6] "latest"       "gallup"       "poll"         "i"            "voted"       
## [11] "for"          "him"          "twice"        "but"          "i"           
## [16] "am"           "not"          "sure"         "about"        "2026"        
## [21] "trump"        "favorability" "hits"         "51"           "in"          
## [26] "latest"       "gallup"       "poll"         "the"          "media"
## 
## === ELIMINACION DE STOPWORDS ===
## Stopwords eliminadas (primeras 20):
##  [1] "a"           "a's"         "able"        "about"       "above"      
##  [6] "according"   "accordingly" "across"      "actually"    "after"      
## [11] "afterwards"  "again"       "against"     "ain't"       "all"        
## [16] "allow"       "allows"      "almost"      "alone"       "along"
## 
## Resultados:
##    - Tokens antes: 1915
##    - Tokens despues: 1056
##    - Tokens eliminados: 859
##    - Porcentaje eliminado: 44.9 %
## 
## Posibles problemas residuales (palabras tecnicas):
##    Ninguna palabra tecnica problematica encontrada
## 
## Despues de eliminar palabras tecnicas:
##    - Tokens restantes: 1056
## 
## === PALABRAS MAS FRECUENTES ===
## Top 15 palabras mas frecuentes (global):
##        palabra  n
## 1      trump's 81
## 2    americans 43
## 3       voters 43
## 4        trump 41
## 5     approval 26
## 6   disapprove 25
## 7      economy 25
## 8        trade 23
## 9      approve 22
## 10    policies 21
## 11 immigration 19
## 12    majority 19
## 13        poll 18
## 14          45 15
## 15          48 15
## 
## === COMPARACION POR SUBREDDIT ===
## Top 5 palabras por subreddit:
## # A tibble: 20 × 3
## # Groups:   subreddit [4]
##    subreddit           palabra        n
##    <chr>               <chr>      <int>
##  1 AskAnAmerican       trump's       19
##  2 AskAnAmerican       americans     12
##  3 AskAnAmerican       trump         11
##  4 AskAnAmerican       voters        10
##  5 AskAnAmerican       approve        8
##  6 PoliticalDiscussion trump's       22
##  7 PoliticalDiscussion americans     12
##  8 PoliticalDiscussion voters        12
##  9 PoliticalDiscussion trump         10
## 10 PoliticalDiscussion approval       8
## 11 conservatives       trump's       21
## 12 conservatives       trump         11
## 13 conservatives       americans      9
## 14 conservatives       voters         9
## 15 conservatives       approval       8
## 16 politics            trump's       19
## 17 politics            voters        12
## 18 politics            americans     10
## 19 politics            trump          9
## 20 politics            disapprove     7
## 
## === EXPLICACION METODOLOGICA ===
## 
## DECISIONES TOMADAS:
## 
## 1. Tokenizacion:
##    - Se uso unnest_tokens() que convierte todo a minusculas
##    - Elimina automaticamente puntuacion y numeros
##    - Separa por espacios en blanco
## 
## 2. Stopwords:
##    - Se utilizo el diccionario 'stop_words' de tidytext
##    - Contiene palabras vacias del ingles (the, and, a, to, etc.)
##    - Estas palabras no aportan significado al analisis
## 
## 3. Limpieza adicional:
##    - Se eliminaron palabras tecnicas como 'amp', 'http', 'www'
##    - Estas aparecen por la estructura HTML de Reddit
## 
## 
## RECOMENDACIONES PARA ANALISIS FUTUROS:
## 
## - Aplicar stemming (reducir palabras a su raiz)
## - Crear n-grams (frases de 2-3 palabras)
## - Hacer analisis de sentimiento especifico

3. Métodos de vizualización

## 
##  = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
## VISUALIZACIONES
## = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =

## Warning: package 'wordcloud' was built under R version 4.4.3
## Cargando paquete requerido: RColorBrewer

# ============================================================
# 4. TF-IDF · OPINIÓN PÚBLICA SOBRE DONALD TRUMP EN REDDIT
# ============================================================

# ------------------------------------------------------------
# ¿QUÉ HAREMOS?
# ------------------------------------------------------------
#
# Vamos a:
#
# 1. Usar la base creada previamente: base_final
# 2. Combinar título + comentario
# 3. Tokenizar el texto
# 4. Limpiar palabras vacías y palabras técnicas
# 5. Calcular TF-IDF por subreddit
# 6. Identificar palabras distintivas por comunidad
#
# IDEA CENTRAL:
#
# TF-IDF identifica palabras que son:
#
# ✔ frecuentes dentro de un grupo
# ✘ poco frecuentes en otros grupos
#
# En este caso, el grupo será el subreddit.
# Esto permite comparar el lenguaje usado en comunidades
# como politics, conservatives, PoliticalDiscussion y AskAnAmerican.
# ------------------------------------------------------------


# ============================================================
# 1. CARGAR LIBRERÍAS
# ============================================================

library(dplyr)
library(tidytext)
library(stringr)
library(tidyr)
library(ggplot2)


# ============================================================
# 2. CREAR VARIABLE DE TEXTO COMPLETO
# ============================================================

# ------------------------------------------------------------
# Se combina el título del post con el comentario.
# Esto permite analizar todo el contenido textual disponible.
# ------------------------------------------------------------

base_tfidf <- base_final %>%
  mutate(
    texto_completo = paste(title, comment, sep = " "),
    texto_completo = str_to_lower(texto_completo)
  )


# ============================================================
# 3. REVISAR DOCUMENTOS POR SUBREDDIT
# ============================================================

base_tfidf %>%
  count(subreddit, sort = TRUE)
##             subreddit  n
## 1       AskAnAmerican 25
## 2 PoliticalDiscussion 25
## 3       conservatives 25
## 4            politics 25
# ============================================================
# 4. TOKENIZAR TEXTO
# ============================================================

# ------------------------------------------------------------
# unnest_tokens():
#
# Convierte cada texto en palabras individuales.
# Cada palabra queda como una fila.
# ------------------------------------------------------------

tokens_tfidf <- base_tfidf %>%
  select(url, subreddit, texto_completo) %>%
  unnest_tokens(
    palabra,
    texto_completo
  )


# ============================================================
# 5. LIMPIEZA DE TOKENS
# ============================================================

# ------------------------------------------------------------
# Eliminamos:
#
# ✔ stopwords del inglés
# ✔ palabras técnicas de Reddit
# ✔ números
# ✔ palabras muy cortas
# ✔ términos demasiado generales del tema
# ------------------------------------------------------------

data("stop_words")

palabras_tecnicas <- c(
  "amp", "http", "https", "com", "www", "reddit",
  "subreddit", "post", "posts", "comment", "comments"
)

# Palabras del tema que pueden aparecer en todos los grupos
# y por tanto no ayudan mucho a diferenciar subreddits.
palabras_generales_tema <- c(
  "trump", "donald", "president", "america", "american",
  "americans", "usa", "us"
)

tokens_tfidf_limpios <- tokens_tfidf %>%
  anti_join(
    stop_words,
    by = c("palabra" = "word")
  ) %>%
  filter(
    !palabra %in% palabras_tecnicas
  ) %>%
  filter(
    !palabra %in% palabras_generales_tema
  ) %>%
  filter(
    str_detect(palabra, "^[a-z]+$")
  ) %>%
  filter(
    nchar(palabra) > 2
  )


# ============================================================
# 6. CONTAR PALABRAS POR SUBREDDIT
# ============================================================

conteos_tfidf <- tokens_tfidf_limpios %>%
  count(
    subreddit,
    palabra,
    sort = TRUE
  )

conteos_tfidf %>%
  head(20)
##              subreddit     palabra  n
## 1  PoliticalDiscussion      voters 12
## 2             politics      voters 12
## 3        AskAnAmerican      voters 10
## 4        conservatives      voters  9
## 5        AskAnAmerican     approve  8
## 6        AskAnAmerican     economy  8
## 7        AskAnAmerican       trade  8
## 8  PoliticalDiscussion    approval  8
## 9  PoliticalDiscussion     economy  8
## 10 PoliticalDiscussion      middle  8
## 11 PoliticalDiscussion      nation  8
## 12 PoliticalDiscussion        plan  8
## 13 PoliticalDiscussion   polarized  8
## 14 PoliticalDiscussion      splits  8
## 15 PoliticalDiscussion         tax  8
## 16       conservatives    approval  8
## 17       AskAnAmerican  disapprove  7
## 18       conservatives       trade  7
## 19            politics  disapprove  7
## 20            politics immigration  7
# ============================================================
# 7. CALCULAR TF-IDF
# ============================================================

# ------------------------------------------------------------
# bind_tf_idf():
#
# term     = palabra analizada
# document = grupo de comparación
# n        = frecuencia de la palabra dentro del grupo
#
# En este caso:
#
# term     = palabra
# document = subreddit
# n        = número de veces que aparece la palabra
# ------------------------------------------------------------

tfidf_subreddit <- conteos_tfidf %>%
  bind_tf_idf(
    term = palabra,
    document = subreddit,
    n = n
  )


# ============================================================
# 8. VER RESULTADOS GENERALES
# ============================================================

tfidf_subreddit %>%
  arrange(
    desc(tf_idf)
  ) %>%
  head(30)
##              subreddit    palabra n         tf       idf      tf_idf
## 1  PoliticalDiscussion     middle 8 0.03738318 0.2876821 0.010754470
## 2  PoliticalDiscussion     nation 8 0.03738318 0.2876821 0.010754470
## 3  PoliticalDiscussion       plan 8 0.03738318 0.2876821 0.010754470
## 4  PoliticalDiscussion  polarized 8 0.03738318 0.2876821 0.010754470
## 5  PoliticalDiscussion     splits 8 0.03738318 0.2876821 0.010754470
## 6  PoliticalDiscussion        tax 8 0.03738318 0.2876821 0.010754470
## 7        conservatives constantly 4 0.01990050 0.2876821 0.005725016
## 8        conservatives       lies 4 0.01990050 0.2876821 0.005725016
## 9        conservatives      media 4 0.01990050 0.2876821 0.005725016
## 10       AskAnAmerican    capture 3 0.01639344 0.2876821 0.004716100
## 11       AskAnAmerican      polls 3 0.01639344 0.2876821 0.004716100
## 12       AskAnAmerican     silent 3 0.01639344 0.2876821 0.004716100
## 13       AskAnAmerican   supports 3 0.01639344 0.2876821 0.004716100
## 14       AskAnAmerican       vote 3 0.01639344 0.2876821 0.004716100
## 15       AskAnAmerican      voted 3 0.01639344 0.2876821 0.004716100
## 16            politics    capture 3 0.01570681 0.2876821 0.004518567
## 17            politics      polls 3 0.01570681 0.2876821 0.004518567
## 18            politics     silent 3 0.01570681 0.2876821 0.004518567
## 19            politics   supports 3 0.01570681 0.2876821 0.004518567
## 20            politics      voted 3 0.01570681 0.2876821 0.004518567
## 21       conservatives    capture 3 0.01492537 0.2876821 0.004293762
## 22       conservatives      polls 3 0.01492537 0.2876821 0.004293762
## 23       conservatives     silent 3 0.01492537 0.2876821 0.004293762
## 24       conservatives   supports 3 0.01492537 0.2876821 0.004293762
## 25       AskAnAmerican  affecting 2 0.01092896 0.2876821 0.003144066
## 26       AskAnAmerican      legal 2 0.01092896 0.2876821 0.003144066
## 27       AskAnAmerican perception 2 0.01092896 0.2876821 0.003144066
## 28       AskAnAmerican     public 2 0.01092896 0.2876821 0.003144066
## 29            politics  affecting 2 0.01047120 0.2876821 0.003012378
## 30            politics constantly 2 0.01047120 0.2876821 0.003012378
# ============================================================
# 9. PALABRAS MÁS DISTINTIVAS POR SUBREDDIT
# ============================================================

top_tfidf_subreddit <- tfidf_subreddit %>%
  group_by(
    subreddit
  ) %>%
  slice_max(
    tf_idf,
    n = 10,
    with_ties = FALSE
  ) %>%
  ungroup()

top_tfidf_subreddit
## # A tibble: 40 × 6
##    subreddit     palabra        n     tf   idf  tf_idf
##    <chr>         <chr>      <int>  <dbl> <dbl>   <dbl>
##  1 AskAnAmerican capture        3 0.0164 0.288 0.00472
##  2 AskAnAmerican polls          3 0.0164 0.288 0.00472
##  3 AskAnAmerican silent         3 0.0164 0.288 0.00472
##  4 AskAnAmerican supports       3 0.0164 0.288 0.00472
##  5 AskAnAmerican vote           3 0.0164 0.288 0.00472
##  6 AskAnAmerican voted          3 0.0164 0.288 0.00472
##  7 AskAnAmerican affecting      2 0.0109 0.288 0.00314
##  8 AskAnAmerican legal          2 0.0109 0.288 0.00314
##  9 AskAnAmerican perception     2 0.0109 0.288 0.00314
## 10 AskAnAmerican public         2 0.0109 0.288 0.00314
## # ℹ 30 more rows
# ============================================================
# 10. VISUALIZAR TF-IDF POR SUBREDDIT
# ============================================================

top_tfidf_subreddit %>%
  ggplot(
    aes(
      x = reorder_within(
        palabra,
        tf_idf,
        subreddit
      ),
      y = tf_idf,
      fill = subreddit
    )
  ) +
  geom_col(
    show.legend = FALSE
  ) +
  coord_flip() +
  facet_wrap(
    ~ subreddit,
    scales = "free"
  ) +
  scale_x_reordered() +
  labs(
    title = "Palabras más distintivas por subreddit",
    subtitle = "Análisis TF-IDF sobre discusiones de opinión pública acerca de Donald Trump",
    x = "",
    y = "TF-IDF"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5)
  )

# ============================================================
# 4. WORD CLOUD · OPINIÓN PÚBLICA SOBRE DONALD TRUMP EN REDDIT
# ============================================================

# ------------------------------------------------------------
# ¿QUÉ HAREMOS?
# ------------------------------------------------------------
#
# Vamos a construir una nube de palabras con los términos
# más frecuentes en los textos analizados sobre Donald Trump.
#
# IDEA:
#
# Las palabras más frecuentes aparecerán más grandes.
#
# IMPORTANTE:
#
# La nube de palabras NO es un análisis profundo.
#
# Sirve como:
#
# ✔ exploración inicial
# ✔ visualización rápida
# ✔ apoyo para presentación
#
# PERO:
#
# ✘ no muestra contexto
# ✘ no muestra relaciones entre palabras
# ✘ no permite concluir causalidad ni postura política
# ------------------------------------------------------------


# ============================================================
# 1. INSTALAR Y CARGAR PAQUETES
# ============================================================

# Instalar solo si los paquetes no existen
if (!requireNamespace("wordcloud", quietly = TRUE)) {
  install.packages("wordcloud")
}

if (!requireNamespace("RColorBrewer", quietly = TRUE)) {
  install.packages("RColorBrewer")
}

library(wordcloud)
library(RColorBrewer)
library(dplyr)


# ============================================================
# 2. REVISAR FRECUENCIAS GLOBALES
# ============================================================

# Si la tabla 'frecuencias' ya existe desde la limpieza textual,
# este paso solo la revisa.
# Si no existe, la volvemos a crear desde tokens_limpios.

if (!exists("frecuencias")) {
  
  frecuencias <- tokens_limpios %>%
    count(
      palabra,
      sort = TRUE
    )
}

cat("\nTop 20 palabras más frecuentes:\n")
## 
## Top 20 palabras más frecuentes:
frecuencias %>%
  head(20)
##        palabra  n
## 1      trump's 81
## 2    americans 43
## 3       voters 43
## 4        trump 41
## 5     approval 26
## 6   disapprove 25
## 7      economy 25
## 8        trade 23
## 9      approve 22
## 10    policies 21
## 11 immigration 19
## 12    majority 19
## 13        poll 18
## 14          45 15
## 15          48 15
## 16     divided 15
## 17  healthcare 15
## 18      rating 14
## 19        2026 13
## 20          63 13
# ============================================================
# 3. CREAR NUBE DE PALABRAS GLOBAL
# ============================================================

# ------------------------------------------------------------
# wordcloud():
#
# words:
# palabras
#
# freq:
# frecuencia de cada palabra
#
# min.freq:
# frecuencia mínima para aparecer
#
# max.words:
# máximo de palabras en la nube
#
# random.order = FALSE:
# organiza dando prioridad a palabras más frecuentes
# ------------------------------------------------------------

set.seed(123)

wordcloud(
  words = frecuencias$palabra,
  freq = frecuencias$n,
  min.freq = 2,
  max.words = 100,
  random.order = FALSE,
  rot.per = 0.2,
  colors = brewer.pal(
    8,
    "Dark2"
  )
)

# ============================================================
# FRECUENCIA DE PALABRAS · REDDIT Y DONALD TRUMP
# ============================================================

# ------------------------------------------------------------
# ¿QUÉ HAREMOS?
# ------------------------------------------------------------
#
# Vamos a identificar las palabras más frecuentes
# en los textos sobre Donald Trump.
#
# Para esto vamos a:
#
# 1. Combinar título + comentario
# 2. Tokenizar el texto
# 3. Eliminar stopwords
# 4. Eliminar palabras técnicas
# 5. Contar palabras
# 6. Graficar las palabras más frecuentes
# ------------------------------------------------------------


# ============================================================
# 1. CARGAR LIBRERÍAS
# ============================================================

library(dplyr)
library(tidytext)
library(stringr)
library(ggplot2)


# ============================================================
# 2. CREAR TEXTO COMPLETO
# ============================================================

base_frecuencia <- base_final %>%
  mutate(
    texto_completo = paste(title, comment, sep = " "),
    texto_completo = str_to_lower(texto_completo)
  )


# ============================================================
# 3. TOKENIZAR TEXTO
# ============================================================

tokens <- base_frecuencia %>%
  select(url, subreddit, texto_completo) %>%
  unnest_tokens(
    palabra,
    texto_completo
  )


# ============================================================
# 4. ELIMINAR STOPWORDS
# ============================================================

data("stop_words")

tokens_limpios <- tokens %>%
  anti_join(
    stop_words,
    by = c("palabra" = "word")
  )


# ============================================================
# 5. LIMPIEZA ADICIONAL
# ============================================================

# Palabras técnicas que no aportan significado
palabras_tecnicas <- c(
  "amp", "http", "https", "com", "www", "reddit",
  "subreddit", "post", "posts", "comment", "comments"
)

# Palabras generales del tema.
# Se eliminan porque aparecerán mucho y pueden ocultar
# palabras más interesantes.
palabras_generales <- c(
  "trump", "donald", "president", "america",
  "american", "americans", "usa", "us"
)

tokens_limpios <- tokens_limpios %>%
  filter(
    !palabra %in% palabras_tecnicas
  ) %>%
  filter(
    !palabra %in% palabras_generales
  ) %>%
  filter(
    str_detect(palabra, "^[a-z]+$")
  ) %>%
  filter(
    nchar(palabra) > 2
  )


# ============================================================
# 6. FRECUENCIA GLOBAL DE PALABRAS
# ============================================================

frecuencias <- tokens_limpios %>%
  count(
    palabra,
    sort = TRUE
  )

# Ver las 20 palabras más frecuentes
frecuencias %>%
  head(20)
##         palabra  n
## 1        voters 43
## 2      approval 26
## 3    disapprove 25
## 4       economy 25
## 5         trade 23
## 6       approve 22
## 7      policies 21
## 8   immigration 19
## 9      majority 19
## 10         poll 18
## 11      divided 15
## 12   healthcare 15
## 13       rating 14
## 14    unpopular 13
## 15       afraid 11
## 16      coastal 11
## 17  disapproval 11
## 18       elites 11
## 19 favorability 11
## 20       gallup 11
# ============================================================
# 7. GRÁFICO DE PALABRAS MÁS FRECUENTES
# ============================================================

top_20_palabras <- frecuencias %>%
  slice_max(
    n,
    n = 20
  )

ggplot(
  top_20_palabras,
  aes(
    x = reorder(palabra, n),
    y = n
  )
) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(
    title = "Palabras más frecuentes en discusiones sobre Donald Trump",
    subtitle = "Análisis de textos de Reddit",
    x = "Palabra",
    y = "Frecuencia"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5)
  )