library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
library(stringr)
library(tidytext)
library(ggplot2)
library(widyr)
library(igraph)

## 
## Attaching package: 'igraph'

## The following object is masked from 'package:tidyr':
## 
##     crossing

## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union

## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum

## The following object is masked from 'package:base':
## 
##     union

library(ggraph)
library(wordcloud2)
library(tm)

## Loading required package: NLP

## 
## Attaching package: 'NLP'

## The following object is masked from 'package:ggplot2':
## 
##     annotate

library(topicmodels)
library(htmlwidgets)
library(webshot)
# PhantomJS 설치
webshot::install_phantomjs()

## It seems that the version of `phantomjs` installed is greater than or equal to the requested version.To install the requested version or downgrade to another version, use `force = TRUE`.

Executive summary

Fake news is loosely defined as “lies and propaganda that are falsely presented as news, such as cruel headlines or deliberately manipulated photos and videos, for hateful purposes. The problem of fake news has been around for a long time. Whatever the reason for its creation, fake news is not only misleading to the public, but it can also have a negative impact on individuals, including defamation.

As a news consumer and distributor, it’s important to be able to distinguish between fake news and real news to ensure that only real news is available to consumers. However, it is difficult for humans to manually analyse and fact-check news texts one by one, which requires a lot of human and time resources. Therefore, in the field, computer data analysis techniques such as AI and automated analysis are used to identify fake news.

The fact that a computer reads and judges the text suggests that the difference between fake news and real news is in the text itself.

Through this project, we will analyse the data of fake news and real news and visualise it in word clouds, pie charts, etc. to find out whether there are any characteristics of the text itself.

Data background

For the data, I want to use the ‘Misinformation & Fake News text dataset 79k’ data on Kaggle. https://www.kaggle.com/datasets/stevenpeutz/misinformation-fake-news-text-dataset-79k The data is collected from the United States and contains data on what is considered fake news and what is considered real news.

True-article

The ‘true’ articles comes from a variety of sources, such as Reuters, the New York TImes, the Washington Post and more.

fake-article

The ‘fake’ articles are sourced from: 1. American right wing extremist websites (such as Redflag Newsdesk, Beitbart, Truth Broadcast Network) 2. A previously made public dataset described in the following article: Ahmed H, Traore I, Saad S. (2017) “Detection of Online Fake News Using N-Gram Analysis and Machine Learning Techniques. In: Traore I., Woungang I., Awad A. (eds) Intelligent, Secure, and Dependable Systems in Distributed and Cloud Environments. ISDDC 2017. Lecture Notes in Computer Science, vol 10618. Springer, Cham (pp. 127-138). 3. Disinformation and propaganda cases collected by the EUvsDisinfo project. A project started in 2015 that identifies and fact checks disinformation cases originating from pro-Kremlin media that are spread across the EU.

Data loading, cleaning and preprocessing

Upon loading the original data, it becomes evident that it is not tokenised. Consequently, in order to utilise this data, it is necessary to apply the process of tokenisation and to remove the stop words.

fake_data <- read.csv("DataSet_Misinfo_FAKE.csv")
true_data <- read.csv("DataSet_Misinfo_TRUE.csv")

df_fake <- fake_data %>%
unnest_tokens(word, text) %>%
anti_join(stop_words)

## Joining with `by = join_by(word)`

df_true <- true_data %>%
unnest_tokens(word, text) %>%
anti_join(stop_words)

## Joining with `by = join_by(word)`

Text data analysis

Individual analysis and figures

Anaysis and Figure 1: Wordcloud

Firstly, a word cloud was created in order to ascertain the differences in frequency of occurrence of the words on the surface. This graph offers an intuitive overview of the frequency of words, allowing for a rapid assessment of the data. The data pertaining to fake news and the data pertaining to real news exhibited minimal superficial differences in the word cloud. The terms “Donald Trump,” “Clinton,” and “people” were observed to be present in both plots with a similar frequency. The graph yielded no significant results. Nevertheless, the data indicates that there may be a challenge in distinguishing between fake news and real news, given their superficial similarities.

# Count the frequency of each word in df_fake
fake_word_freq <- df_fake %>%
  count(word, sort = TRUE) %>% 
  head(200)

# Count the frequency of each word in df_true
true_word_freq <- df_true %>%
  count(word, sort = TRUE) %>% 
  head(200)

# Create a word cloud for fake news
wc_f <- wordcloud2(fake_word_freq, size = 1, color = "pink", backgroundColor = "white", shape = 'circle', title)
print(wc_f)
# Create a word cloud for true news
wc_t <- wordcloud2(true_word_freq, size = 1, color = "skyblue", backgroundColor = "white", shape = 'circle')
print(wc_t)

saveWidget(wc_f, "wc_f.html", selfcontained = FALSE)
saveWidget(wc_t, "wc_t.html", selfcontained = FALSE)

webshot("wc_f.html", file = "images/wc_f.png", delay = 5)

webshot("wc_t.html", file = "images/wc_t.png", delay = 5)

Anaysis and Figure 2: Sentiment Distribution Pie-chart

Secondly, a pie chart was constructed to illustrate the distribution of sentiment, utilising the Bing Lexicon. The words were categorised into three groups: negative, neutral, and positive, and coloured red, green, and blue, respectively. In both graphs, the percentage of negative words in the news is greater than the percentage of positive words. A comparison of the graphs reveals that fake news is slightly more negative. Nevertheless, the discrepancy is sufficiently minimal that it is challenging to substantiate the assertion that fake news employs a greater proportion of negative vocabulary than authentic news.

bing <- get_sentiments("bing")

fake_sentiments <- fake_data %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words) %>%
  inner_join(bing) %>%
  count(sentiment) %>%
  mutate(sentiment = ifelse(sentiment == "negative", "Negative", 
                            ifelse(sentiment == "positive", "Positive", "Neutral")),
         n = n / sum(n))

## Joining with `by = join_by(word)`
## Joining with `by = join_by(word)`

## Warning in inner_join(., bing): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 719764 of `x` matches multiple rows in `y`.
## ℹ Row 1997 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

true_sentiments <- true_data %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words) %>%
  inner_join(bing) %>%
  count(sentiment) %>%
  mutate(sentiment = ifelse(sentiment == "negative", "Negative", 
                            ifelse(sentiment == "positive", "Positive", "Neutral")),
         n = n / sum(n))

## Joining with `by = join_by(word)`
## Joining with `by = join_by(word)`

## Warning in inner_join(., bing): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 1984363 of `x` matches multiple rows in `y`.
## ℹ Row 1008 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

pastel_colors <- c("#FFB3BA", "#BAE1FF", "#BFFCC6") 


fake_pie <- pie(fake_sentiments$n, labels = fake_sentiments$sentiment,
    main = "Fake News Sentiment Distribution", col = pastel_colors,
    clockwise = TRUE)

true_pie <- pie(true_sentiments$n, labels = true_sentiments$sentiment,
    main = "True News Sentiment Distribution", col = pastel_colors,
    clockwise = TRUE)

jpeg(filename = "images/true_news_sentiment.jpeg")
pie(true_sentiments$n, labels = true_sentiments$sentiment,
    main = "True News Sentiment Distribution", col = pastel_colors,
    clockwise = TRUE)
dev.off()

## quartz_off_screen 
##                 2

Anaysis and Figure 3: TF-IDF Bar-chart

Thirdly, we are comparing the ten most frequent words with the highest TF-IDF values in fake news and real news. TF-IDF is a statistical measure of the relative importance of a given word within a document.

Fake news The majority of the words are in Russian, and include common words such as “что”, “не”, “с”, “по”, “это”, and “как”. Additionally, the text contains words related to specific websites, such as “21wire” and “21wire.tv”. The text contains unusual symbols or words, such as “quot” and “â”.
True news The text contains numerous proper nouns or names of specific individuals, including “rakhine”, “mnangagwa”, and “pamkeynen”. It can be reasonably assumed that words such as “marawi”, “barnier”, “tillerson’s”, “durst”, “panel’s”, and “tmsnrt.rs” are likely to be related to a specific event, person, or source.

The prevalence of generic words and those related to specific websites is a defining feature of fake news, whereas real news is characterised by the use of proper nouns that are linked to specific events or individuals. Furthermore, the vocabulary of fake news is characterised by a high proportion of Russian words, whereas that of real news is dominated by English words. The TF-IDF values for certain words in real news are observed to be higher than those for the same words in fake news. This may be attributed to the fact that authentic news is more particular and concentrated on a specific subject matter.

re_df_fake <- df_fake %>% mutate(label = "fake")
re_df_true <- df_true %>% mutate(label = "true")

combined_data <- bind_rows(re_df_fake, re_df_true)

# Calculate term frequency
tf <- combined_data %>%
  count(label, word, sort = TRUE) %>%
  group_by(label) %>%
  mutate(total = sum(n)) %>%
  ungroup()

# Calculate TF-IDF
tf_idf <- tf %>%
  bind_tf_idf(word, label, n) %>%
  arrange(desc(tf_idf))

# Visualization
tf_idf_plot <- tf_idf %>%
  group_by(label) %>%
  top_n(10, tf_idf) %>%
  ungroup() %>%
  mutate(word = reorder_within(word, tf_idf, label)) %>%
  ggplot(aes(word, tf_idf, fill = label)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~label, scales = "free") +
  scale_x_reordered() +
  coord_flip() +
  labs(title = "Top 10 TF-IDF Words in Fake and True News",
       x = "Words",
       y = "TF-IDF")

print(tf_idf_plot)

ggsave(filename = "images/tf_idf_plot.jpeg", plot = tf_idf_plot, device = "jpeg")

## Saving 7 x 5 in image

Anaysis and Figure 4: Topic Modeling

In order to pursue further thematic analysis, we elected to employ topic modelling as a means of achieving this objective. The Latent Dirichlet Allocation (LDA) method was employed to identify the latent topics within the documents. Subsequently, a bar plot was constructed to illustrate the words associated with each topic. True News Topics - Topic 1: Key words: people, police, government, city, killed, told, islamic, country, officials, attack These words are mainly related to government, and appear to be about violent events or terrorism. - Topic 2: Key words: trump, clinton, people, campaign, it’s, time, republican, news, ms, president Political topics, mostly related to Trump and Clinton, election campaigns, and the Republican Party. - Topic 3: Key words: trump, house, u.s, president, republican, senate, court, law, federal, tax Topics related to the US government, legislative branch, judicial branch, and taxation. Many are related to President Trump. - Topic 4: Key words: u.s., trump, president, united, russia, china, foreign, north, government, russian Topics related to international politics and diplomacy, including the United States, Russia, and China. Many are related to President Trump.

Fake News Topics - Topic 1: Key words: russia, war, russian, u.s, military, government, syria, president, media, security Topics related to Russia, war, military, and international conflicts such as Syria. - Topic 2: Key words: people, government, obama, world, american, de, america, money, country, time A general social topic, about people, government, President Obama, the world, and the economy. - Topic 3: Key words: trump, clinton, hillary, president, donald, people, election, campaign, obama, news Political topics, related to Trump, Clinton, election campaigns, and President Obama. - Topic 4: Key words: police, people, black, time, water, children, life, school, told, day A topic related to social issues, with content related to police, racial issues, water, education, and children.

In both datasets, there are many topics related to politics, especially those related to President Trump. In True News, topics related to international politics and diplomacy (Russia, China, etc.) are dominant, while in Fake News, topics related to international conflicts and social issues (Russia, Syria, racial issues, etc.) are dominant. While True News topics are primarily political and government-related, Fake News includes a wide range of social issues.

fake_dtm <- df_fake %>%
  count(X, word, sort = TRUE) %>%
  cast_dtm(X, word, n)

true_dtm <- df_true %>%
  count(X, word, sort = TRUE) %>%
  cast_dtm(X, word, n)

fake_lda <- LDA(fake_dtm, k = 4, control = list(seed = 1234))
true_lda <- LDA(true_dtm, k = 4, control = list(seed = 1234))


fake_topics <- tidy(fake_lda, matrix = "beta")
true_topics <- tidy(true_lda, matrix = "beta")

fake_top_terms <- fake_topics %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

true_top_terms <- true_topics %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

# Plot for fake news
fake_LDA <- fake_top_terms %>%
  ggplot(aes(x = reorder_within(term, beta, topic), y = beta, fill = as.factor(topic))) +
  geom_col() +
  coord_flip() +
  scale_x_reordered() +
  scale_fill_brewer(palette = "Set3") +
  facet_wrap(~ topic, scales = "free") +
  labs(title = "Top Terms in Fake News Topics", x = "Terms", y = "Beta") +
  theme_minimal()

# Plot for true news
true_LDA <- true_top_terms %>%
  ggplot(aes(x = reorder_within(term, beta, topic), y = beta, fill = as.factor(topic))) +
  geom_col() +
  coord_flip() +
  scale_x_reordered() +
  scale_fill_brewer(palette = "Set3") +
  facet_wrap(~ topic, scales = "free") +
  labs(title = "Top Terms in True News Topics", x = "Terms", y = "Beta") +
  theme_minimal()

# Display plots
print(fake_LDA)

print(true_LDA)

ggsave(filename = "images/fake_LDA_plot.jpeg", plot = fake_LDA, device = "jpeg")

## Saving 7 x 5 in image

ggsave(filename = "images/true_LDA_plot.jpeg", plot = true_LDA, device = "jpeg")

## Saving 7 x 5 in image

Conclusion

No significant differences were observed in the frequency or sentiment of words. However, when we conducted a TF-IDF analysis and thematic analysis, we were able to observe differences between the two. The data indicates that the content of fake news is more thematically provocative and the vocabulary used is more cohesive. Prior to commencing my research, I anticipated that external textual features would be more pronounced. However, I discovered that there were fewer superficial differences between fake news and real news than I had anticipated. The objective was to design a more detailed model of Korean articles, if possible.

Comparing textual data from fake and real news

Lee-Sungbae

2024-06-18