INTRODUÇÃO

Declaração do Problema

A desinformação e a circulação de fake news se tornaram um dos principais problemas sociais da era digital. Com o crescimento de redes sociais e portais de notícias, entender as diferenças linguísticas, estruturais e temáticas entre notícias reais e notícias falsas é fundamental para pesquisadores, jornalistas e plataformas de mídia.

Como o problema será abordado:

limpeza e pré-processamento textual;
criação de novas variáveis;
análise exploratória;
visualizações comparativas entre notícias reais e falsas.

Técnica e abordagem

técnicas de NLP (Natural Language Processing) com o pacote tidytext;
contagem e comparação de tokens;
análise de sentimentos;
extração de padrões linguísticos;
visualizações com ggplot2.

Quem se beneficia?

jornalistas e verificadores de notícias;
desenvolvedores de ferramentas contra desinformação;
pesquisadores em comunicação digital;
formuladores de políticas públicas.

Pacotes Utilizados

tidyverse – leitura, limpeza e manipulação de dados
tidytext – tokenização, remoção de stopwords, análise textual
stringr – limpeza de strings
ggplot2 – visualização de dados
readr – importação dos arquivos CSV
textclean – remoção de caracteres especiais e limpeza profunda
wordcloud / wordcloud2 – nuvens de palavras
*lubridate – manipulação de datas
*tm – pré-processamento de texto
*text2vec – vetorização eficiente

#install.packages(c("tidyverse","tidytext","textclean","wordcloud","wordcloud2","text2vec","lubridate","ggridges","scales","knitr","kableExtra"))

library(tidyverse)

## Warning: pacote 'tidyverse' foi compilado no R versão 4.5.2

## Warning: pacote 'ggplot2' foi compilado no R versão 4.5.2

## Warning: pacote 'tidyr' foi compilado no R versão 4.5.2

## Warning: pacote 'lubridate' foi compilado no R versão 4.5.2

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidytext)

## Warning: pacote 'tidytext' foi compilado no R versão 4.5.2

library(textclean)

## Warning: pacote 'textclean' foi compilado no R versão 4.5.2

library(stringr)
library(readr)
library(wordcloud)

## Warning: pacote 'wordcloud' foi compilado no R versão 4.5.2

## Carregando pacotes exigidos: RColorBrewer

library(wordcloud2)

## Warning: pacote 'wordcloud2' foi compilado no R versão 4.5.2

library(text2vec)

## Warning: pacote 'text2vec' foi compilado no R versão 4.5.2

library(lubridate)
library(ggridges)

## Warning: pacote 'ggridges' foi compilado no R versão 4.5.2

library(scales)

## Warning: pacote 'scales' foi compilado no R versão 4.5.2

## 
## Anexando pacote: 'scales'
## 
## O seguinte objeto é mascarado por 'package:purrr':
## 
##     discard
## 
## O seguinte objeto é mascarado por 'package:readr':
## 
##     col_factor

library(knitr)

## Warning: pacote 'knitr' foi compilado no R versão 4.5.2

library(kableExtra)

## Warning: pacote 'kableExtra' foi compilado no R versão 4.5.2

## 
## Anexando pacote: 'kableExtra'
## 
## O seguinte objeto é mascarado por 'package:dplyr':
## 
##     group_rows

PREPARAÇÃO DOS DADOS

Fonte

fake-and-real-news-dataset: https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset

Estrutura

A base contém dois arquivos:

Fake.csv
True.csv

Cada um deles contém:

título da notícia
texto completo
data (nem sempre presente)
autor (às vezes ausente)

Descrição

~44 mil registros no total
Texto não estruturado, com grande variação de tamanho
Diversas inconsistências:
campos vazios
datas em formatos variados
autores ausentes
caixa alta/baixa inconsistente
ruído textual (aspas, hífens, caracteres especiais, URLs)

Carregar os dados

fake_raw <- read_csv("Fake.csv")

## Rows: 23481 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): title, text, subject, date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

true_raw <- read_csv("True.csv")

## Rows: 21417 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): title, text, subject, date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

fake_sample <- fake_raw %>% 
  sample_n(2500) %>%
  mutate(label = "Fake")

real_sample <- true_raw %>% 
  sample_n(2500) %>%
  mutate(label = "Real")

news <- bind_rows(fake_sample, real_sample) %>%
  mutate(id = row_number()) %>%
  select(id, everything())

glimpse(fake_raw)

## Rows: 23,481
## Columns: 4
## $ title   <chr> "Donald Trump Sends Out Embarrassing New Year’s Eve Message; T…
## $ text    <chr> "Donald Trump just couldn t wish all Americans a Happy New Yea…
## $ subject <chr> "News", "News", "News", "News", "News", "News", "News", "News"…
## $ date    <chr> "December 31, 2017", "December 31, 2017", "December 30, 2017",…

glimpse(true_raw)

## Rows: 21,417
## Columns: 4
## $ title   <chr> "As U.S. budget fight looms, Republicans flip their fiscal scr…
## $ text    <chr> "WASHINGTON (Reuters) - The head of a conservative Republican …
## $ subject <chr> "politicsNews", "politicsNews", "politicsNews", "politicsNews"…
## $ date    <chr> "December 31, 2017", "December 29, 2017", "December 31, 2017",…

text_col <- if("text" %in% colnames(news)) "text" else if("content" %in% colnames(news)) "content" else NA
text_col

## [1] "text"

Etapas de limpeza

conversão para minúsculas;

remoção de pontuação;

remoção de números;

remoção de URLs;

remoção de múltiplos espaços;

normalização de acentuação (quando necessário);

tokenização em palavras;

remoção de stopwords em inglês;

tratamento de palavras muito raras.

news <- news %>%
mutate(
text_full = case_when(
!is.na(text_col) & !is.na(.data[[text_col]]) ~ .data[[text_col]],
"title" %in% colnames(news) & !is.na(title) ~ paste(title, collapse = " "),
TRUE ~ ""
)
)

if (is.na(text_col)) {
if ("title" %in% colnames(news) & "text" %in% colnames(news)) {
news <- news %>% mutate(text_full = paste(title, text, sep = " - "))
} else {
# fallback: combine all character columns
char_cols <- news %>% select(where(is.character)) %>% colnames()
news <- news %>% mutate(text_full = apply(select(., all_of(char_cols)), 1, paste, collapse = " "))
}
}

clean_text <- function(x) {
x <- replace_url(x, replacement = " ")
x <- replace_html(x)
x <- str_to_lower(x)
x <- str_replace_all(x, "[^[:alnum:]\\s]", " ") # sem pontuação
x <- str_replace_all(x, "\\d+", " ") # sem números
x <- str_squish(x)
x <- replace_contraction(x) # expande contrações em english
x
}

news <- news %>%
mutate(text_clean = if_else(is.na(text_full), "", text_full),
text_clean = map_chr(text_clean, clean_text))

news %>% select(id, label) %>% slice_head(n = 5) %>% knitr::kable()

id	label
1	Fake
2	Fake
3	Fake
4	Fake
5	Fake

news <- news %>%
mutate(
n_chars = nchar(text_clean),
n_words = str_count(text_clean, "\\S+"),
n_unique_words = map_int(str_split(text_clean, "\\s+"), ~ n_distinct(.x)),
avg_word_len = if_else(n_words > 0, n_chars / n_words, NA_real_)
)

Resumo

news %>% group_by(label) %>%
summarise(n = n(),
mean_words = mean(n_words, na.rm = TRUE),
median_words = median(n_words, na.rm = TRUE),
sd_words = sd(n_words, na.rm = TRUE)) %>%
knitr::kable() %>% kable_styling(full_width = FALSE)

label	n	mean_words	median_words	sd_words
Fake	2500	2475.7960	384	11290.8326
Real	2500	387.4096	363	264.4102

Remoção de Stop Words e Tokenização

data("stop_words")

tokens <- news %>%
select(id, label, text_clean) %>%
unnest_tokens(word, text_clean) %>%
filter(!word %in% stop_words$word,
str_detect(word, "[a-z]")) # keep alpha tokens only

Rankeamento

tokens %>% count(word, sort = TRUE) %>% slice_head(n = 20) %>% knitr::kable()

word	n
trump	147628
video	74760
obama	31232
hillary	22043
president	20485
watch	20070
house	19903
clinton	17443
white	15370
republican	12240
russia	12178
news	11963
north	11737
bill	11220
senate	11086
people	11030
election	10742
court	10508
media	10434
donald	10268

top_words_label <- tokens %>%
count(label, word, sort = TRUE) %>%
group_by(label) %>%
slice_max(n, n = 20) %>%
ungroup()

ggplot(top_words_label, aes(x = reorder_within(word, n, label), y = n, fill = label)) +
geom_col(show.legend = FALSE) +
facet_wrap(~label, scales = "free_y") +
coord_flip() +
scale_x_reordered() +
labs(title = "Top 20 palavras por categoria", x = "Palavra", y = "Frequencia")

Nuvem de palavras

freq_label <- tokens %>% count(label, word, sort = TRUE)

Fake

freq_label %>% filter(label == "Fake") %>% slice_max(n, n = 200) %>%
with(wordcloud(words = word, freq = n, max.words = 150, scale = c(5,0.6)))

Real

freq_label %>% filter(label == "Real") %>% slice_max(n, n = 200) %>%
with(wordcloud(words = word, freq = n, max.words = 150, scale = c(4,0.4)))

## palavras mais representativas por classe

tf_idf_tbl <- tokens %>%
count(label, word, sort = TRUE) %>%
bind_tf_idf(word, label, n) %>%
arrange(desc(tf_idf))

tf_idf_top <- tf_idf_tbl %>%
group_by(label) %>%
slice_max(tf_idf, n = 15) %>%
ungroup()

ggplot(tf_idf_top, aes(x = reorder_within(word, tf_idf, label), y = tf_idf, fill = label)) +
geom_col(show.legend = FALSE) +
facet_wrap(~label, scales = "free") +
coord_flip() +
scale_x_reordered() +
labs(title = "Top palavras por TF-IDF (por categoria)", x = "Palavra", y = "TF-IDF")

Analise de sentimento

#install.packages("textdata")
library(textdata)

## Warning: pacote 'textdata' foi compilado no R versão 4.5.2

bing <- get_sentiments("bing")
afinn <- get_sentiments("afinn")
nrc <- get_sentiments("nrc") # has emotions

sent_bing <- tokens %>%
inner_join(bing, by = "word") %>%
count(id, label, sentiment) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment_score = (positive - negative))

sent_bing_summary <- sent_bing %>%
group_by(label) %>%
summarise(mean_sentiment = mean(sentiment_score, na.rm = TRUE),
median_sentiment = median(sentiment_score, na.rm = TRUE),
sd_sentiment = sd(sentiment_score, na.rm = TRUE),
n = n())

sent_bing_summary %>% knitr::kable()

label	mean_sentiment	median_sentiment	sd_sentiment	n
Fake	-17.286006	-3	74.013850	2451
Real	-1.609065	-1	8.618555	2471

sent_afinn <- tokens %>%
inner_join(afinn, by = "word") %>%
group_by(id, label) %>%
summarise(afinn_score = sum(value, na.rm = TRUE)) %>%
ungroup()

## `summarise()` has grouped output by 'id'. You can override using the `.groups`
## argument.

ggplot(sent_afinn, aes(x = afinn_score, fill = label)) +
geom_density(alpha = 0.5) +
labs(title = "Distribuicao de pontuacao AFINN por categoria", x = "Pontuacao AFINN")

2VA - Thiago Diniz

Thiago Diniz

2025-11-16