El actual presidente de USA, Donald Trump es una figura política que genera opiniones y debates altamente polarizados. Esto hace que las discusiones en redes sociales, especialmente Reddit sean altamente ricas en lenguaje emocional, valorativo y argumentativo, lo que resulta ideal para practicar técnicas de procesamiento de lenguaje natural. Además, la cantidad de datos disponibles en la plataforma permite obtener una muestra suficiente para el análisis.
Con este análisis buscamos identificar cuáles son las palabras y temas más frecuentes al discutir la aprobación o desaprobación del mandatario, y si existen diferencias lingüísticas significativas entre subreddits de distintas inclinaciones políticas.
#Instalar paquete
library(RedditExtractoR)
## Warning: package 'RedditExtractoR' was built under R version 4.4.3
En este apartado, para asegurar que los datos que se obtengan sean relevantes para la pregunta de investigación se determinan los siguientes términos específicos para filtrar la posterior descarga de posts en la red social Reddit.
Utilizamos frases específicas como “Trump approval poll” en lugar de términos amplios como “Trump”, siguiendo la recomendación de evitar búsquedas demasiado amplias. Esto asegura que los posts capturados contengan explícitamente opiniones sobre su gestión.
# Términos que capturan opinión pública sobre Trump
terminos_opinion <- c(
"Trump approval rating",
"Americans think about Trump",
"Trump public opinion",
"Trump policies survey",
"Do Americans support Trump",
"Trump approval poll",
"Trump disapproval reasons",
"Americans view Trump decision"
)
# Subreddits donde se debate opinión pública (TODOS ACTIVOS)
subreddits_opinion <- c(
"politics", # Debate político general
"AskAnAmerican", # Preguntas a estadounidenses
"PoliticalDiscussion",# Discusión seria
"conservatives", # Visión conservadora
"democrats", # Visión liberal
"neutralnews", # Intento de neutralidad
"fivethirtyeight" # Datos y encuestas
)
El propósito general de esta sección es implementar una función que busca threads (hilos de discusión) en Reddit sobre un tema específico.
buscar_threads_documentado <- function(termino, subreddit = NA) {
cat("\n🔍 Búsqueda:", termino)
if(!is.na(subreddit)) cat(" | subreddit: r/", subreddit)
cat("\n")
# Pausa para respetar API de Reddit (decisión metodológica)
Sys.sleep(1)
resultado <- tryCatch({
find_thread_urls(
keywords = termino,
subreddit = subreddit,
sort_by = "relevance", # Más relevante que "top" para opiniones
period = "month" # Último mes (datos frescos)
)
}, error = function(e) {
cat(" ❌ Error:", conditionMessage(e), "\n")
return(NULL)
})
if(!is.null(resultado) && nrow(resultado) > 0) {
# Añadir metadatos de la búsqueda
resultado$termino_usado <- termino
resultado$subreddit_buscado <- ifelse(is.na(subreddit), "todos", subreddit)
resultado$fecha_busqueda <- Sys.Date()
cat(" ✅", nrow(resultado), "threads encontrados\n")
return(resultado)
} else {
cat(" ⚠️ 0 resultados\n")
return(NULL)
}
}
Luego de haber filtrado nuestros datos de búsqueda, procedemos a extraer publicaciones/posts y comentarios desde Reddit.
##
## = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
## INICIO DE DESCARGA DE DATOS
## = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
## Objetivo minimo: 80 posts
##
## = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
## RESULTADOS DE LA DESCARGA (DATOS DE EJEMPLO)
## = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
##
## ESTADISTICAS:
## - Threads descargados: 100
## - Objetivo de 80 posts: ALCANZADO
## - Subreddits representados: 4
## - Periodo de datos: 2026-03-31 a 2026-05-26
##
## DISTRIBUCION POR SUBREDDIT:
##
## AskAnAmerican conservatives PoliticalDiscussion politics
## 25 25 25 25
##
## EJEMPLO DE LA BASE (primeros 5 posts):
## title
## 1 Trump favorability hits 51% in latest Gallup poll
## 2 Trump favorability hits 51% in latest Gallup poll
## 3 Polarized nation: Trump's tax plan splits voters down the middle
## 4 Poll shows 47% of Americans approve of Trump's handling of economy
## 5 Trump's trade policies unpopular with 63% of voters
## subreddit comments score
## 1 politics 241 2872
## 2 conservatives 375 1976
## 3 PoliticalDiscussion 252 2283
## 4 AskAnAmerican 359 507
## 5 politics 344 880
##
## EJEMPLO DE COMENTARIO:
## I voted for him twice but I am not sure about 2026.
##
## Datos guardados en 'datos_trump_opinion.rds'
Importante:
Al intentar descargar datos reales desde Reddit, se detectó un error porque la API de la app actualmente requiere autenticación. Para poder completar el ejercicio se optó por generar un dataset sintético que replica la estructura de los datos reales.
## Warning: package 'dplyr' was built under R version 4.4.3
##
## Adjuntando el paquete: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Warning: package 'tidytext' was built under R version 4.4.3
## Warning: package 'tidyr' was built under R version 4.4.3
##
## = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
## LIMPIEZA TEXTUAL
## = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
##
## Texto a analizar:
## - Total de documentos: 100
## - Largo promedio del texto: 118.03 caracteres
##
## === TOKENIZACION ===
## Total de tokens generados: 1915
## Palabras unicas (tipo): 140
##
## Ejemplo de tokens (primeros 30):
## [1] "trump" "favorability" "hits" "51" "in"
## [6] "latest" "gallup" "poll" "i" "voted"
## [11] "for" "him" "twice" "but" "i"
## [16] "am" "not" "sure" "about" "2026"
## [21] "trump" "favorability" "hits" "51" "in"
## [26] "latest" "gallup" "poll" "the" "media"
##
## === ELIMINACION DE STOPWORDS ===
## Stopwords eliminadas (primeras 20):
## [1] "a" "a's" "able" "about" "above"
## [6] "according" "accordingly" "across" "actually" "after"
## [11] "afterwards" "again" "against" "ain't" "all"
## [16] "allow" "allows" "almost" "alone" "along"
##
## Resultados:
## - Tokens antes: 1915
## - Tokens despues: 1056
## - Tokens eliminados: 859
## - Porcentaje eliminado: 44.9 %
##
## Posibles problemas residuales (palabras tecnicas):
## Ninguna palabra tecnica problematica encontrada
##
## Despues de eliminar palabras tecnicas:
## - Tokens restantes: 1056
##
## === PALABRAS MAS FRECUENTES ===
## Top 15 palabras mas frecuentes (global):
## palabra n
## 1 trump's 81
## 2 americans 43
## 3 voters 43
## 4 trump 41
## 5 approval 26
## 6 disapprove 25
## 7 economy 25
## 8 trade 23
## 9 approve 22
## 10 policies 21
## 11 immigration 19
## 12 majority 19
## 13 poll 18
## 14 45 15
## 15 48 15
##
## === COMPARACION POR SUBREDDIT ===
## Top 5 palabras por subreddit:
## # A tibble: 20 × 3
## # Groups: subreddit [4]
## subreddit palabra n
## <chr> <chr> <int>
## 1 AskAnAmerican trump's 19
## 2 AskAnAmerican americans 12
## 3 AskAnAmerican trump 11
## 4 AskAnAmerican voters 10
## 5 AskAnAmerican approve 8
## 6 PoliticalDiscussion trump's 22
## 7 PoliticalDiscussion americans 12
## 8 PoliticalDiscussion voters 12
## 9 PoliticalDiscussion trump 10
## 10 PoliticalDiscussion approval 8
## 11 conservatives trump's 21
## 12 conservatives trump 11
## 13 conservatives americans 9
## 14 conservatives voters 9
## 15 conservatives approval 8
## 16 politics trump's 19
## 17 politics voters 12
## 18 politics americans 10
## 19 politics trump 9
## 20 politics disapprove 7
##
## === EXPLICACION METODOLOGICA ===
##
## DECISIONES TOMADAS:
##
## 1. Tokenizacion:
## - Se uso unnest_tokens() que convierte todo a minusculas
## - Elimina automaticamente puntuacion y numeros
## - Separa por espacios en blanco
##
## 2. Stopwords:
## - Se utilizo el diccionario 'stop_words' de tidytext
## - Contiene palabras vacias del ingles (the, and, a, to, etc.)
## - Estas palabras no aportan significado al analisis
##
## 3. Limpieza adicional:
## - Se eliminaron palabras tecnicas como 'amp', 'http', 'www'
## - Estas aparecen por la estructura HTML de Reddit
##
##
## RECOMENDACIONES PARA ANALISIS FUTUROS:
##
## - Aplicar stemming (reducir palabras a su raiz)
## - Crear n-grams (frases de 2-3 palabras)
## - Hacer analisis de sentimiento especifico
##
## = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
## VISUALIZACIONES
## = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
## Warning: package 'wordcloud' was built under R version 4.4.3
## Cargando paquete requerido: RColorBrewer
# ============================================================
# 4. TF-IDF · OPINIÓN PÚBLICA SOBRE DONALD TRUMP EN REDDIT
# ============================================================
# ------------------------------------------------------------
# ¿QUÉ HAREMOS?
# ------------------------------------------------------------
#
# Vamos a:
#
# 1. Usar la base creada previamente: base_final
# 2. Combinar título + comentario
# 3. Tokenizar el texto
# 4. Limpiar palabras vacías y palabras técnicas
# 5. Calcular TF-IDF por subreddit
# 6. Identificar palabras distintivas por comunidad
#
# IDEA CENTRAL:
#
# TF-IDF identifica palabras que son:
#
# ✔ frecuentes dentro de un grupo
# ✘ poco frecuentes en otros grupos
#
# En este caso, el grupo será el subreddit.
# Esto permite comparar el lenguaje usado en comunidades
# como politics, conservatives, PoliticalDiscussion y AskAnAmerican.
# ------------------------------------------------------------
# ============================================================
# 1. CARGAR LIBRERÍAS
# ============================================================
library(dplyr)
library(tidytext)
library(stringr)
library(tidyr)
library(ggplot2)
# ============================================================
# 2. CREAR VARIABLE DE TEXTO COMPLETO
# ============================================================
# ------------------------------------------------------------
# Se combina el título del post con el comentario.
# Esto permite analizar todo el contenido textual disponible.
# ------------------------------------------------------------
base_tfidf <- base_final %>%
mutate(
texto_completo = paste(title, comment, sep = " "),
texto_completo = str_to_lower(texto_completo)
)
# ============================================================
# 3. REVISAR DOCUMENTOS POR SUBREDDIT
# ============================================================
base_tfidf %>%
count(subreddit, sort = TRUE)
## subreddit n
## 1 AskAnAmerican 25
## 2 PoliticalDiscussion 25
## 3 conservatives 25
## 4 politics 25
# ============================================================
# 4. TOKENIZAR TEXTO
# ============================================================
# ------------------------------------------------------------
# unnest_tokens():
#
# Convierte cada texto en palabras individuales.
# Cada palabra queda como una fila.
# ------------------------------------------------------------
tokens_tfidf <- base_tfidf %>%
select(url, subreddit, texto_completo) %>%
unnest_tokens(
palabra,
texto_completo
)
# ============================================================
# 5. LIMPIEZA DE TOKENS
# ============================================================
# ------------------------------------------------------------
# Eliminamos:
#
# ✔ stopwords del inglés
# ✔ palabras técnicas de Reddit
# ✔ números
# ✔ palabras muy cortas
# ✔ términos demasiado generales del tema
# ------------------------------------------------------------
data("stop_words")
palabras_tecnicas <- c(
"amp", "http", "https", "com", "www", "reddit",
"subreddit", "post", "posts", "comment", "comments"
)
# Palabras del tema que pueden aparecer en todos los grupos
# y por tanto no ayudan mucho a diferenciar subreddits.
palabras_generales_tema <- c(
"trump", "donald", "president", "america", "american",
"americans", "usa", "us"
)
tokens_tfidf_limpios <- tokens_tfidf %>%
anti_join(
stop_words,
by = c("palabra" = "word")
) %>%
filter(
!palabra %in% palabras_tecnicas
) %>%
filter(
!palabra %in% palabras_generales_tema
) %>%
filter(
str_detect(palabra, "^[a-z]+$")
) %>%
filter(
nchar(palabra) > 2
)
# ============================================================
# 6. CONTAR PALABRAS POR SUBREDDIT
# ============================================================
conteos_tfidf <- tokens_tfidf_limpios %>%
count(
subreddit,
palabra,
sort = TRUE
)
conteos_tfidf %>%
head(20)
## subreddit palabra n
## 1 PoliticalDiscussion voters 12
## 2 politics voters 12
## 3 AskAnAmerican voters 10
## 4 conservatives voters 9
## 5 AskAnAmerican approve 8
## 6 AskAnAmerican economy 8
## 7 AskAnAmerican trade 8
## 8 PoliticalDiscussion approval 8
## 9 PoliticalDiscussion economy 8
## 10 PoliticalDiscussion middle 8
## 11 PoliticalDiscussion nation 8
## 12 PoliticalDiscussion plan 8
## 13 PoliticalDiscussion polarized 8
## 14 PoliticalDiscussion splits 8
## 15 PoliticalDiscussion tax 8
## 16 conservatives approval 8
## 17 AskAnAmerican disapprove 7
## 18 conservatives trade 7
## 19 politics disapprove 7
## 20 politics immigration 7
# ============================================================
# 7. CALCULAR TF-IDF
# ============================================================
# ------------------------------------------------------------
# bind_tf_idf():
#
# term = palabra analizada
# document = grupo de comparación
# n = frecuencia de la palabra dentro del grupo
#
# En este caso:
#
# term = palabra
# document = subreddit
# n = número de veces que aparece la palabra
# ------------------------------------------------------------
tfidf_subreddit <- conteos_tfidf %>%
bind_tf_idf(
term = palabra,
document = subreddit,
n = n
)
# ============================================================
# 8. VER RESULTADOS GENERALES
# ============================================================
tfidf_subreddit %>%
arrange(
desc(tf_idf)
) %>%
head(30)
## subreddit palabra n tf idf tf_idf
## 1 PoliticalDiscussion middle 8 0.03738318 0.2876821 0.010754470
## 2 PoliticalDiscussion nation 8 0.03738318 0.2876821 0.010754470
## 3 PoliticalDiscussion plan 8 0.03738318 0.2876821 0.010754470
## 4 PoliticalDiscussion polarized 8 0.03738318 0.2876821 0.010754470
## 5 PoliticalDiscussion splits 8 0.03738318 0.2876821 0.010754470
## 6 PoliticalDiscussion tax 8 0.03738318 0.2876821 0.010754470
## 7 conservatives constantly 4 0.01990050 0.2876821 0.005725016
## 8 conservatives lies 4 0.01990050 0.2876821 0.005725016
## 9 conservatives media 4 0.01990050 0.2876821 0.005725016
## 10 AskAnAmerican capture 3 0.01639344 0.2876821 0.004716100
## 11 AskAnAmerican polls 3 0.01639344 0.2876821 0.004716100
## 12 AskAnAmerican silent 3 0.01639344 0.2876821 0.004716100
## 13 AskAnAmerican supports 3 0.01639344 0.2876821 0.004716100
## 14 AskAnAmerican vote 3 0.01639344 0.2876821 0.004716100
## 15 AskAnAmerican voted 3 0.01639344 0.2876821 0.004716100
## 16 politics capture 3 0.01570681 0.2876821 0.004518567
## 17 politics polls 3 0.01570681 0.2876821 0.004518567
## 18 politics silent 3 0.01570681 0.2876821 0.004518567
## 19 politics supports 3 0.01570681 0.2876821 0.004518567
## 20 politics voted 3 0.01570681 0.2876821 0.004518567
## 21 conservatives capture 3 0.01492537 0.2876821 0.004293762
## 22 conservatives polls 3 0.01492537 0.2876821 0.004293762
## 23 conservatives silent 3 0.01492537 0.2876821 0.004293762
## 24 conservatives supports 3 0.01492537 0.2876821 0.004293762
## 25 AskAnAmerican affecting 2 0.01092896 0.2876821 0.003144066
## 26 AskAnAmerican legal 2 0.01092896 0.2876821 0.003144066
## 27 AskAnAmerican perception 2 0.01092896 0.2876821 0.003144066
## 28 AskAnAmerican public 2 0.01092896 0.2876821 0.003144066
## 29 politics affecting 2 0.01047120 0.2876821 0.003012378
## 30 politics constantly 2 0.01047120 0.2876821 0.003012378
# ============================================================
# 9. PALABRAS MÁS DISTINTIVAS POR SUBREDDIT
# ============================================================
top_tfidf_subreddit <- tfidf_subreddit %>%
group_by(
subreddit
) %>%
slice_max(
tf_idf,
n = 10,
with_ties = FALSE
) %>%
ungroup()
top_tfidf_subreddit
## # A tibble: 40 × 6
## subreddit palabra n tf idf tf_idf
## <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 AskAnAmerican capture 3 0.0164 0.288 0.00472
## 2 AskAnAmerican polls 3 0.0164 0.288 0.00472
## 3 AskAnAmerican silent 3 0.0164 0.288 0.00472
## 4 AskAnAmerican supports 3 0.0164 0.288 0.00472
## 5 AskAnAmerican vote 3 0.0164 0.288 0.00472
## 6 AskAnAmerican voted 3 0.0164 0.288 0.00472
## 7 AskAnAmerican affecting 2 0.0109 0.288 0.00314
## 8 AskAnAmerican legal 2 0.0109 0.288 0.00314
## 9 AskAnAmerican perception 2 0.0109 0.288 0.00314
## 10 AskAnAmerican public 2 0.0109 0.288 0.00314
## # ℹ 30 more rows
# ============================================================
# 10. VISUALIZAR TF-IDF POR SUBREDDIT
# ============================================================
top_tfidf_subreddit %>%
ggplot(
aes(
x = reorder_within(
palabra,
tf_idf,
subreddit
),
y = tf_idf,
fill = subreddit
)
) +
geom_col(
show.legend = FALSE
) +
coord_flip() +
facet_wrap(
~ subreddit,
scales = "free"
) +
scale_x_reordered() +
labs(
title = "Palabras más distintivas por subreddit",
subtitle = "Análisis TF-IDF sobre discusiones de opinión pública acerca de Donald Trump",
x = "",
y = "TF-IDF"
) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, face = "bold"),
plot.subtitle = element_text(hjust = 0.5)
)
# ============================================================
# 4. WORD CLOUD · OPINIÓN PÚBLICA SOBRE DONALD TRUMP EN REDDIT
# ============================================================
# ------------------------------------------------------------
# ¿QUÉ HAREMOS?
# ------------------------------------------------------------
#
# Vamos a construir una nube de palabras con los términos
# más frecuentes en los textos analizados sobre Donald Trump.
#
# IDEA:
#
# Las palabras más frecuentes aparecerán más grandes.
#
# IMPORTANTE:
#
# La nube de palabras NO es un análisis profundo.
#
# Sirve como:
#
# ✔ exploración inicial
# ✔ visualización rápida
# ✔ apoyo para presentación
#
# PERO:
#
# ✘ no muestra contexto
# ✘ no muestra relaciones entre palabras
# ✘ no permite concluir causalidad ni postura política
# ------------------------------------------------------------
# ============================================================
# 1. INSTALAR Y CARGAR PAQUETES
# ============================================================
# Instalar solo si los paquetes no existen
if (!requireNamespace("wordcloud", quietly = TRUE)) {
install.packages("wordcloud")
}
if (!requireNamespace("RColorBrewer", quietly = TRUE)) {
install.packages("RColorBrewer")
}
library(wordcloud)
library(RColorBrewer)
library(dplyr)
# ============================================================
# 2. REVISAR FRECUENCIAS GLOBALES
# ============================================================
# Si la tabla 'frecuencias' ya existe desde la limpieza textual,
# este paso solo la revisa.
# Si no existe, la volvemos a crear desde tokens_limpios.
if (!exists("frecuencias")) {
frecuencias <- tokens_limpios %>%
count(
palabra,
sort = TRUE
)
}
cat("\nTop 20 palabras más frecuentes:\n")
##
## Top 20 palabras más frecuentes:
frecuencias %>%
head(20)
## palabra n
## 1 trump's 81
## 2 americans 43
## 3 voters 43
## 4 trump 41
## 5 approval 26
## 6 disapprove 25
## 7 economy 25
## 8 trade 23
## 9 approve 22
## 10 policies 21
## 11 immigration 19
## 12 majority 19
## 13 poll 18
## 14 45 15
## 15 48 15
## 16 divided 15
## 17 healthcare 15
## 18 rating 14
## 19 2026 13
## 20 63 13
# ============================================================
# 3. CREAR NUBE DE PALABRAS GLOBAL
# ============================================================
# ------------------------------------------------------------
# wordcloud():
#
# words:
# palabras
#
# freq:
# frecuencia de cada palabra
#
# min.freq:
# frecuencia mínima para aparecer
#
# max.words:
# máximo de palabras en la nube
#
# random.order = FALSE:
# organiza dando prioridad a palabras más frecuentes
# ------------------------------------------------------------
set.seed(123)
wordcloud(
words = frecuencias$palabra,
freq = frecuencias$n,
min.freq = 2,
max.words = 100,
random.order = FALSE,
rot.per = 0.2,
colors = brewer.pal(
8,
"Dark2"
)
)
# ============================================================
# FRECUENCIA DE PALABRAS · REDDIT Y DONALD TRUMP
# ============================================================
# ------------------------------------------------------------
# ¿QUÉ HAREMOS?
# ------------------------------------------------------------
#
# Vamos a identificar las palabras más frecuentes
# en los textos sobre Donald Trump.
#
# Para esto vamos a:
#
# 1. Combinar título + comentario
# 2. Tokenizar el texto
# 3. Eliminar stopwords
# 4. Eliminar palabras técnicas
# 5. Contar palabras
# 6. Graficar las palabras más frecuentes
# ------------------------------------------------------------
# ============================================================
# 1. CARGAR LIBRERÍAS
# ============================================================
library(dplyr)
library(tidytext)
library(stringr)
library(ggplot2)
# ============================================================
# 2. CREAR TEXTO COMPLETO
# ============================================================
base_frecuencia <- base_final %>%
mutate(
texto_completo = paste(title, comment, sep = " "),
texto_completo = str_to_lower(texto_completo)
)
# ============================================================
# 3. TOKENIZAR TEXTO
# ============================================================
tokens <- base_frecuencia %>%
select(url, subreddit, texto_completo) %>%
unnest_tokens(
palabra,
texto_completo
)
# ============================================================
# 4. ELIMINAR STOPWORDS
# ============================================================
data("stop_words")
tokens_limpios <- tokens %>%
anti_join(
stop_words,
by = c("palabra" = "word")
)
# ============================================================
# 5. LIMPIEZA ADICIONAL
# ============================================================
# Palabras técnicas que no aportan significado
palabras_tecnicas <- c(
"amp", "http", "https", "com", "www", "reddit",
"subreddit", "post", "posts", "comment", "comments"
)
# Palabras generales del tema.
# Se eliminan porque aparecerán mucho y pueden ocultar
# palabras más interesantes.
palabras_generales <- c(
"trump", "donald", "president", "america",
"american", "americans", "usa", "us"
)
tokens_limpios <- tokens_limpios %>%
filter(
!palabra %in% palabras_tecnicas
) %>%
filter(
!palabra %in% palabras_generales
) %>%
filter(
str_detect(palabra, "^[a-z]+$")
) %>%
filter(
nchar(palabra) > 2
)
# ============================================================
# 6. FRECUENCIA GLOBAL DE PALABRAS
# ============================================================
frecuencias <- tokens_limpios %>%
count(
palabra,
sort = TRUE
)
# Ver las 20 palabras más frecuentes
frecuencias %>%
head(20)
## palabra n
## 1 voters 43
## 2 approval 26
## 3 disapprove 25
## 4 economy 25
## 5 trade 23
## 6 approve 22
## 7 policies 21
## 8 immigration 19
## 9 majority 19
## 10 poll 18
## 11 divided 15
## 12 healthcare 15
## 13 rating 14
## 14 unpopular 13
## 15 afraid 11
## 16 coastal 11
## 17 disapproval 11
## 18 elites 11
## 19 favorability 11
## 20 gallup 11
# ============================================================
# 7. GRÁFICO DE PALABRAS MÁS FRECUENTES
# ============================================================
top_20_palabras <- frecuencias %>%
slice_max(
n,
n = 20
)
ggplot(
top_20_palabras,
aes(
x = reorder(palabra, n),
y = n
)
) +
geom_col(fill = "steelblue") +
coord_flip() +
labs(
title = "Palabras más frecuentes en discusiones sobre Donald Trump",
subtitle = "Análisis de textos de Reddit",
x = "Palabra",
y = "Frecuencia"
) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, face = "bold"),
plot.subtitle = element_text(hjust = 0.5)
)